The Trouble With Starts With The Data

I recently departed the world of software development, and while I have no insight on specific knowledge of the project, I know a little bit about integrating new software to old to make seamless, Frankenstein-ian new products.  One single issue keeps drawing my attention in the debacle: The issue of data management. Conceptually, the issue is not difficult to understand. The reality of addressing the problem, though, could hardly amount to a more monumental task. Let me explain:

As you’ve undoubtedly heard numerous times, the website is a hub, akin to a traffic cop, accessing and managing the flow of data from several different databases. Data from the IRS, Medicare/Medicaid, SS, Veterans Admin, DHS, and other databases all need to be evaluated as part of the site’s functionality. To generate meaningful, coherent output, must be able to identify and handle characteristically similar data from each database. Particularly, each individual article of information within each agency’s database needs to be tagged for retrieval and evaluation.

The most technically advanced structure for organizing data is called a “schema.” A schema is nothing more than a paradigm that, conceptually, bears close resemblance to the scientific naming system for biology (Kingdom, Phylum, Class, Order, Family, Genus, Species). In a schema, datapoints are classified according to increasingly specific identification criteria, and these classifications allow very nimble manipulation of data; entire categories of data can be handled en masse almost as easily (for the most part) in coding as individual datapoints.

If all the databases accessed by employ schemas, the tedious job of mapping the datapoints from the database to each and every corresponding datapoint in each of the accessed databases would merely be staggeringly monumental. The sheer number of datapoints would be overwhelming, but there would be a logical means to identify the nature of each datum. I suspect (although I have no first-hand knowledge) that none of the databases accessed by utilize a schema structure to manage data. Schemas are a current trend, and the databases that relies upon are decades-old in some cases, long predating the rise of schemas. The lack of schemas in those databases will morph a truly onerous job into pure muck-slogging drudgery.

I cannot overstate the difficulty of imposing a schema structure onto an existing database. Each individual datapoint needs to be evaluated and categorized by a live person (or, more likely, a team of people). Many datapoints serve multiple functions, and discerning their primary roles for categorization can be challenging. For example, in a general commercial transaction where a customer purchases a product on the internet, a datapoint identifying a street address may be used for both a billing address and a delivery address. Its (fictional) schema classification may be Profile—->Customer Info—>Address—->??? The next question in the schema assignment process would decide whether the collected datapoint is primarily a billing address, a delivery address, or both. Functionally, many datapoints exhibit characteristics of – and legitimately fit within – multiple categories, but datapoints may occupy only one place in a schema structure. Many, many of those grey-area datapoints get assigned based solely on the professional experience of the persons categorizing.

The immense number of datapoints in a government database would likely require several teams to categorize, and the likelihood of categorical inconsistencies among characteristically similar data within a single database would not be insignificant.  Like datapoints may end up being categorized differently within a single single database AND among the collective databases.  This problem gives rise to a reconciliation period, where questionable categorizations undergo review and reconciliation for consistency. The reconciliation of questionable categorizations can be even more laborious and tedious than the original assignment, because the process generally involves starting with a problem already resolved once unsatisfactorily.  These disagreements can require protracted evaluation to resolve.

To get an idea of the magnitude of the data-structure assignment, think of the database as the hub of a wheel, and each of the other databases that it accesses at the spokes on that wheel.  Every datapoint in the database needs to be mapped to its corresponding counterpart in the database at the end of each spoke. I won’t hazard a guess as to the number of datapoints that need to be mapped, but I have no doubt the scale is spectacularly daunting. I’ve worked projects a fraction of the size of, and schema categorization took close to a year. Granted, that was with four teams numbering about 30 souls in total, but the scope of the project did not approach the size of the project, either.

The implications of inconsistent data identifiers would manifest themselves exactly as we’re seeing now: Garbled data and non-functioning code. If the code cannot consistently identify across multiple platforms the data that it is supposed to manipulate, it has no way to perform any kind of meaningful comparisons or calculations. When the code calls “Dependents” from the databases and gets “spouse” returned from one, “spouse and children” returned from another, and “make and model of auto” from another, it may lock up, or it may spew nonsense. In other words, garbage in—> garbage out. The ONLY way to begin sorting out issues is to make sure is speaking the same language as each database it accesses.

A commenter at DailyKos had this to say a few days ago:

[W]e software developers suck at estimating how long it will take to build a web application (it’s time that we admit that). So, if we suck at it, imagine how poorly our managers who have never written a line of code suck at it when they pull estimates out of their asses to impose on their development teams and report to their bosses.

Here’s a quick tip. If you ever hire a software developer or software team first ask them how long it will take to build an application. Then, when asking this question, know that that developer is a) not sure and b) wants to tell you the shortest time possible for fear that your eyes will pop out when you hear the truth. Finally, take your developer’s estimate and mulitply it by four. If your developer estimates that he can finish an application or new feature in three months, then expect it to take a year. Then, and only then, might you get lucky that the project will come in on time and within budget.

To keep perspective, the commenter is referring to coding a stand-alone application from scratch. In the case of, not only must a NEW website be coded, but the heart and soul of many other databases need to be dissembled and integrated flawlessly to produce a seamless application.  Because of the limits of my own knowledge and experience, my next statement should be taken with a grain of salt, but I believe the latter enterprise may in fact be as large as an undertaking (or larger than) the former.

I am not optimistic about the immediate future of the website.  Considering the complexity of integrating a number of legacy databases into the functional aspects of, the DailyKos commenter’s admonitions regarding reasonable development times seems woefully inadequate.  I don’t believe we have any chance of seeing a working website before January 1. In fact, I believe we’d be hard-pressed to see a working website within the next year.

I’ve tried to stay well within the domain of my own expertise in this commentary. If I have erred in any of my assertions, I welcome any corrections or comments from those who know better.


Tags: ,

24 Responses to “The Trouble With Starts With The Data”

  1. agiledog Says:

    And you are touching on only one aspect of the problem. What are the communication channels (i.e. networks) to these other databases? Do they all have roughly equivalent response time? How about their security, reliability and redundancy? What if one (of the dozen or so) doesn’t respond? Not to mention the consistency of the data between the databases. What if one says your eldest son is Joshua, and another says it is William? What if the names agree, but the SSN for the dependent is a mismatch?

    Another major problem is the “tech surge” that is promised to fix it. Adding people to an already complex project that is in trouble only slows it down even more, as the new folks have to be educated and assimilated in the process.

    • Jazz Says:

      EXACTLY. The White House said the other day that it had identified the problems that need to be corrected and that solutions would be forthcoming. I call serious SHANANIGANS! Understand, naming datapoints according to a schema is an organizational tool. It is NOT coding. The problems I’ve discussed above relate solely to data management, not coding and data manipulation. The need to re-write 5,000,000 lines of code has no bearing on the consistency of data within the databases; the data issue could very well equal the amount of rework required for the code. CODE != DATA, and not one person that I’ve seen has broached the specifics of problematic data management.

  2. Teresa in Fort Worth, TX Says:

    No doubt they will be tempted to make everyone in the country go to one “language”, one system, one whatever.

    And they will fail to realize that it is only through extraordinary diversity – as messy as that may be – that the Earth has been able to sustain itself as long as it has. I would imagine the same thing holds true for technology as well.

    But they’ll try……Lord knows, they’ll try……

    • TwoDogs Says:

      No no no no. Diversity in biology = good. Diversity in tech systems most definitely does not = good. Take off your little PC thinking cap. It does not apply here.

      • Teresa in Fort Worth, TX Says:

        I worded that badly – what I meant was that the government would be the one to “decide” which language everyone is required to use, rather than letting the techs who actually work with this stuff on a continuous basis keep improving what is currently used.

        If the government is in charge of determining which language everyone is required to use, we all might as well start stocking up on punch cards again.

        The LAST thing we want is for government tech “experts” developing their version of state-of-the-art software…..

  3. Merovign Says:

    I worked briefly on a large government IT project designed to create a large central database for large numbers of smaller government organizations, both creating a new system and integrating different local systems. It was a nightmare project that should probably never have been started.

    I did have the chance to look at reports of progress and plans and then look at the actual work – nothing lined up, they were basically only working toward status reports and not toward a completed project.

    It was never completed, never went live. Maybe a thousand people and a couple of years in, *bust*. Don’t want to think of how many dollars.

  4. docweasel Says:

    I’m a designer with a working knowledge of development. From what you say, and other columns I’ve read (some of the databases you speak of are decades old mainframes written in COBOL or some other ancient language) the problem seems compeletely intractable and will NEVER be solved with a website. The future of Obamacare, if it is to work at all, will be individual counsellors, much like tax preparers, who over a period of weeks or months will apply for your information which will be fulfilled by some other functionary, and in time you will be approved for one or more policies and your subsidies figured.

    Just thinking logically, there is no real way to make this work for 330 million Americans.

  5. Mike ANderson Says:

    You forgot to mention that when a government entity is managing the project, you should multiply the estimate of the worst case scenario by at least a factor of 8.

  6. craig Says:

    Your analysis is completely accurate. The schema cannot be fudged.

    On top of that problem, you have the data integrity problem where transactions are recorded across and among multiple databases. What happens when one of them fails to record the transaction (‘commit’ in database jargon)? In a well-designed system, ALL of them must roll back the half-done transaction in order to stay in sync with each other, and that fact has to be communicated to all parties including the initiator of the transaction.

    This is intuitively obvious in online shopping — can’t have the website charge your card and not confirm the order, or confirm the order and not charge your card — but it’s doubly important when confirming enrollment in important stuff. Can’t very well have the website telling the user ‘Congratulations, here’s your new policy # and monthly cost!’ when (a) database X lost communications and failed the ‘commit’ operation and so never actually created your policy, or (b) database Y failed the ‘commit’ operation to set up your subsidy so you’ll be paying more than you were told.

    People who think that Top Men can fix this any sooner than 2015 at the *earliest* are fooling themselves.

    • Jazz Says:

      I’ve been reluctant to throw out “years” in my estimation of the time-frame required to “fix”, but in the darkest recesses of my soul, I think even “years” may be optimistic. is a technical abomination.

  7. steve walsh Says:

    How about searching for data relating to one entity, say a consumer or citizen, across all these independent systems, what will they use for an index? SSN? How does this increase the number of requests and load on the system? Makes error handling, dead-ends, record locks, security a massive undertaking all by itself.

    What a cluster!

  8. David Mount Says:

    As a former mainframe (Cobol) developer and current mid range and web developer, I agree with everything here. But I want to add that next to the government, medical insurance companies are the most bureaucratic and unwieldy entities. It took the medical billing industry about 2 years to convert from 6 digit UPIN codes to identify doctors to a 10 digit NPI code. That’s 2 years to replace a 6 character field with a 10 character field.

  9. Graywolf Says:

    Totally WRONG approach.
    You build a data request/reply format for each unique interface.
    THEY worry about how their data is organized.
    They satisfy your request, putting the data into your message template.
    You (Obamacare) generate these requests as needed and process the replies through a translator either putting the data (in Obamacare format) on a disk queue or interprocess (not recommended).
    Also, the WEB frontend MUST be separated from the transaction engine, thus causing asymmetric I/O as much as possible.
    Symmetric I/O’s (especially disk writes) will quickly bring you to your knees.

  10. Graywolf Says:

    I forgot:
    The “tech surge.”
    A room full of outsized egos with no knowledge of the application and probably damned little experience in project management.

  11. Zombie John Gotti Says:

    When you layer on top of all this the need for all these databases to be relational, the multiplies the nightmare. Finding common elements among them all to link together can only be a headache of major proportions. Then, giving the job to a contractor with a spotty performance history really kicks it up a notch.

  12. Chairman LMAO Says:

    Graywolf… amen! I was wondering when someone was going to talk about building a data exchange API. Forget about mapping endpoints, just provide a means to pull the data you want out of the existing/legacy system (REST or JSON) so in theory, and generally true in reality, the new-ish system doesn’t have to know a thing about the old-ish one.

  13. Jay in Ames Says:

    Very nice, Jazz. I agree with your assessment of the complexity of data modeling.

  14. daggilli Says:

    Never mind attempting to shoehorn the data across these heterogeneous databases into schemata; I would not be surprised if some (most?) of them are not even in 1NF i.e. they lack relational consistency at the most basic level. I’ve seen large institutional datasets that arose in the days before good RDB design was commonplace and I wouldn’t go into them without a pump-action shotgun and a flashlight. Then there is the issue that in any sufficiently large piece of legislation (and God knows ACA satisfies that criterion) constraints may have been introduced that cannot, logically, be simultaneously satisfied i.e. their intersection is the empty set. Courts can sometimes cut Gordian knots of this nature, but computers can’t. This problem may not merely be horrendously hard to dolve. It may literally be insoluble.

  15. Tushar Says:

    >>How about searching for data relating to one entity, say a consumer or citizen, across all these independent systems, what will they use for an index? SSN?

    This is supposed to work for illegal aliens, excuse me, undocumented Americans as well. They don’t have SSNs

  16. ElinorRose Says:

    The most frightening aspect of this is that so many of our lawmakers and citizenry are so irredeemably ignorant and absolutely unteachable that they actually believe that that a team of whiz kids and some genius old geezer brought out of mothballs can solve this problem with the tap of a few keystrokes. They think that Hollywood science fiction is science fact. And when they are lied to and told that “computer experts” have solved the problems, they will believe the lie.

  17. stevepoling (@stevepoling) Says:

    And just suppose that you’ve managed to create the ObamaCare application you described and you marshall all that data. Is there any expectation of privacy on the part of the “subject”? And aren’t there old laws on the books about not using an SSN as a federal id number? If J. Edgar had just a 10th of this information available to him, he’d have been named Caesar by all the blackmailed politicians in Washington.

  18. CaptainCaveman Says:

    I’m an old school mainframe programmer and have done a couple of Fed projects (neither ever worked) and this song keeps going through my head as I keep thinking about it.

  19. Keapon Laffin Says:

    After reading this and the comments and thinking of my own work programming/maintaining/debugging(Fun!) a legacy mainframe system(COBOL of course) I’m having to re-evaluate my opinion of the type of incompetence involved here.
    Many political pundits are laughing ‘They had 3 1/2 years to get it right!’
    Now I’m thinking: Who in the hell thought that it would take only 3 1/2 years to complete this project?
    I’ve had my experiences playing with ancient undocumented spagetti-code. Playing with ancient undocumented spagetti-code written by government contractors? Yea, good luck with that.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: