The Trouble With Starts With The Data

October 23, 2013

I recently departed the world of software development, and while I have no insight on specific knowledge of the project, I know a little bit about integrating new software to old to make seamless, Frankenstein-ian new products.  One single issue keeps drawing my attention in the debacle: The issue of data management. Conceptually, the issue is not difficult to understand. The reality of addressing the problem, though, could hardly amount to a more monumental task. Let me explain:

As you’ve undoubtedly heard numerous times, the website is a hub, akin to a traffic cop, accessing and managing the flow of data from several different databases. Data from the IRS, Medicare/Medicaid, SS, Veterans Admin, DHS, and other databases all need to be evaluated as part of the site’s functionality. To generate meaningful, coherent output, must be able to identify and handle characteristically similar data from each database. Particularly, each individual article of information within each agency’s database needs to be tagged for retrieval and evaluation.

The most technically advanced structure for organizing data is called a “schema.” A schema is nothing more than a paradigm that, conceptually, bears close resemblance to the scientific naming system for biology (Kingdom, Phylum, Class, Order, Family, Genus, Species). In a schema, datapoints are classified according to increasingly specific identification criteria, and these classifications allow very nimble manipulation of data; entire categories of data can be handled en masse almost as easily (for the most part) in coding as individual datapoints.

If all the databases accessed by employ schemas, the tedious job of mapping the datapoints from the database to each and every corresponding datapoint in each of the accessed databases would merely be staggeringly monumental. The sheer number of datapoints would be overwhelming, but there would be a logical means to identify the nature of each datum. I suspect (although I have no first-hand knowledge) that none of the databases accessed by utilize a schema structure to manage data. Schemas are a current trend, and the databases that relies upon are decades-old in some cases, long predating the rise of schemas. The lack of schemas in those databases will morph a truly onerous job into pure muck-slogging drudgery.

I cannot overstate the difficulty of imposing a schema structure onto an existing database. Each individual datapoint needs to be evaluated and categorized by a live person (or, more likely, a team of people). Many datapoints serve multiple functions, and discerning their primary roles for categorization can be challenging. For example, in a general commercial transaction where a customer purchases a product on the internet, a datapoint identifying a street address may be used for both a billing address and a delivery address. Its (fictional) schema classification may be Profile—->Customer Info—>Address—->??? The next question in the schema assignment process would decide whether the collected datapoint is primarily a billing address, a delivery address, or both. Functionally, many datapoints exhibit characteristics of – and legitimately fit within – multiple categories, but datapoints may occupy only one place in a schema structure. Many, many of those grey-area datapoints get assigned based solely on the professional experience of the persons categorizing.

The immense number of datapoints in a government database would likely require several teams to categorize, and the likelihood of categorical inconsistencies among characteristically similar data within a single database would not be insignificant.  Like datapoints may end up being categorized differently within a single single database AND among the collective databases.  This problem gives rise to a reconciliation period, where questionable categorizations undergo review and reconciliation for consistency. The reconciliation of questionable categorizations can be even more laborious and tedious than the original assignment, because the process generally involves starting with a problem already resolved once unsatisfactorily.  These disagreements can require protracted evaluation to resolve.

To get an idea of the magnitude of the data-structure assignment, think of the database as the hub of a wheel, and each of the other databases that it accesses at the spokes on that wheel.  Every datapoint in the database needs to be mapped to its corresponding counterpart in the database at the end of each spoke. I won’t hazard a guess as to the number of datapoints that need to be mapped, but I have no doubt the scale is spectacularly daunting. I’ve worked projects a fraction of the size of, and schema categorization took close to a year. Granted, that was with four teams numbering about 30 souls in total, but the scope of the project did not approach the size of the project, either.

The implications of inconsistent data identifiers would manifest themselves exactly as we’re seeing now: Garbled data and non-functioning code. If the code cannot consistently identify across multiple platforms the data that it is supposed to manipulate, it has no way to perform any kind of meaningful comparisons or calculations. When the code calls “Dependents” from the databases and gets “spouse” returned from one, “spouse and children” returned from another, and “make and model of auto” from another, it may lock up, or it may spew nonsense. In other words, garbage in—> garbage out. The ONLY way to begin sorting out issues is to make sure is speaking the same language as each database it accesses.

A commenter at DailyKos had this to say a few days ago:

[W]e software developers suck at estimating how long it will take to build a web application (it’s time that we admit that). So, if we suck at it, imagine how poorly our managers who have never written a line of code suck at it when they pull estimates out of their asses to impose on their development teams and report to their bosses.

Here’s a quick tip. If you ever hire a software developer or software team first ask them how long it will take to build an application. Then, when asking this question, know that that developer is a) not sure and b) wants to tell you the shortest time possible for fear that your eyes will pop out when you hear the truth. Finally, take your developer’s estimate and mulitply it by four. If your developer estimates that he can finish an application or new feature in three months, then expect it to take a year. Then, and only then, might you get lucky that the project will come in on time and within budget.

To keep perspective, the commenter is referring to coding a stand-alone application from scratch. In the case of, not only must a NEW website be coded, but the heart and soul of many other databases need to be dissembled and integrated flawlessly to produce a seamless application.  Because of the limits of my own knowledge and experience, my next statement should be taken with a grain of salt, but I believe the latter enterprise may in fact be as large as an undertaking (or larger than) the former.

I am not optimistic about the immediate future of the website.  Considering the complexity of integrating a number of legacy databases into the functional aspects of, the DailyKos commenter’s admonitions regarding reasonable development times seems woefully inadequate.  I don’t believe we have any chance of seeing a working website before January 1. In fact, I believe we’d be hard-pressed to see a working website within the next year.

I’ve tried to stay well within the domain of my own expertise in this commentary. If I have erred in any of my assertions, I welcome any corrections or comments from those who know better.


Hello world!

December 6, 2008

Welcome to This is your first post. Edit or delete it and start blogging!