I own enough land that it takes about four hours each week to get all of the grass mowed. As I push or ride the mowers and swipe the trimmer back and forth for hours on end, it gives me time to think about things. Of course, my mind wanders to my weekday worries most of the time. Many people say that they wake from dreaming having solved some terrific, work-related problem. For me the epiphanies often come while shortening millions of blades of grass.
One of the clearest of these lessons came when I was struggling with finding a metaphor for describing how an Extract, Transform & Load (ETL) process should work for one of my clients. One of the vendors (let's call him Butch) with whom the client contracts was arguing that we should send and receive delta files between the systems. These are data files that strictly contain the changes since the last file was sent. Along with the recommendation to use the delta file style of processing, Butch had concocted an elaborate scheme for dealing with the exceptions that naturally occur when exchanging data. For example, what if it was discovered that one of the delta files from the past contained errors? According to Butch's plan, we must delete all of the records that were loaded from that file. Moreover, we would be required to delete all of the records that came after the errant data load, too, including all of the data in the related tables to preserve referential integrity. Then Butch would run a special fix-up file and resend it to us. Finally, we would need to run the fix-ups and then run all of the extracts that came after it in chronological order. How incredibly brittle and labor-intensive a process is that? It turns out that Butch's company had been operating this way for decades. Butch just didn't know any better.
In the ETL world, reconciliation processes to deal with omissions and errors are quite common. Once a quarter perhaps, your data-sharing partners may send a reconciliation file from which you can correct problems that have been introduced along the way. When the data is small, these reconciliation files can be so-called full files, containing a snapshot of all the data in the set. However, when the data sets are too large, full file processing just isn't possible. Thus it was with Butch's data: it was just too big to fix using a full file reconciliation process.
I tried in vain to describe to my client why Butch should add an attribute to all of his tables called [LastUpdated] for tracking his changes. If Butch simply touched this attribute every time a record was changed, he could extract files based on specific date stamps at arbitrary points in the past. Moreover, I demonstrated how most of the labor-intensive, error-prone processes related to the delta file processing model would be eliminated if we could adopt a partial reconciliation file process that included duplicates from the past few days in every file. Butch pushed back hard, convincing my client that it would be wasteful to include duplicate data in every file. My argument wasn't winning and it appeared as though I was being trapped into building a horribly brittle ETL process for my client.
The following Saturday afternoon, as I pushed my way around the lawn in an ever tightening spiral, it occurred to me how I might better explain the idea of partial reconciliation to my client. As the mower moves forward, the pusher knows that getting the best overall performance comes from doing a bit of re-work. The wheels to the outside of the cutting lane are set into previously cut grass as one moves along. Butch would argue that this is wasteful and in some respects, he would be correct. According to Butch, for optimal precision, the wheels should be positioned such that the cutting blade only passes over grass that hasn't yet been cut. We can imagine what Butch's lawn looks like: tufts and spikes of uncut grass everywhere, visual signals of all the places where Butch's precision-dependent model broke down. Or perhaps Butch goes back over the lawn when he's done, fixing up and correcting all of these "deltas" with time-consuming and costly manual interventions.
At the heart of it, all reconciliation processes are date based. In the case of full file reconciliation, the date in question is the beginning of time. In partial reconciliation ETL processes like the one I recommended to my client, the selected date is arbitrary. In the ever-tightening path that one cuts when making trips around the yard with a lawnmower, the line between cut grass and uncut grass represents the point in time where reconciliation occurs. How far into the lane of already cut grass one drifts is subjective. But all seasoned tenders of lawns know that some bit of rework makes the final product better looking while improving the overall experience.
On Monday morning, I explained the allegory of the lawn to my client. Having spent a portion of his weekend doing yard work, the parable immediately rang true. During a conference call with Butch later that day, my client insisted that we use the partial file reconciliation process that I had recommended. He forced Butch to add last updated time stamps to all of his tables and demanded that we pull 72 hours of changes for every extract.
Butch was dumbstruck at first but quickly regained his composure and began complaining about how wasteful it would be to resend some bits up to three times. The client shut him down saying, "It's like mowing the lawn, man! Just do it!" In the months since that project went to production, we've processed hundreds of millions of rows of data from Butch. Yet, we've needed very few manual interventions in the ETL process because problems in the data are typically found and corrected upstream within 72 hours. In the rare cases when that path isn't wide enough, we simply swerve the mower deeper into the pre-cut grass and keep moving forward with the grace and aplomb of master gardeners.