Over the past year I've reviewed what seem like countless plans for enterprise data warehouses. The plans address real problems in the organizations involved: the organization needs better data to recognize trends and react faster to opportunities and challenges; business measures and analyses are unavailable because data in source systems is inconsistent, incomplete, erroneous, or contains current values but no history; and so on.
The plans I've reviewed detail source system data and its integration into a central data hub. However, the ones I'm referring to don't tell how the data will be delivered, or portray a specific vision of how the data is to drive business value, instead, their business case rests on what I'll call the "railroad hypothesis". No one could have predicted how the railroads enabled development of the West, so the improved data infrastructure will create order of magnitude improvements in ability to access, share, and utilize data, from which order of magnitude business benefits will follow.* All too often these plans just build bridges to nowhere.
Over past decades we've used the data-warehouse-as-infrastructure argument to fund thousands of DW projects, but in my experience the predicted humongous business benefits rarely follow. To me, the reason is this: in order to build the DW, IT goes dark for a year or so introspectively modeling and integrating data. As that year goes by a number of things happen to undermine the infrastructure-based value proposition:
- Warehouse purity and usability fall short due to unforeseen data inconsistencies, overlaps, and missing history among source systems.
- Changes in business conditions and turnover among senior stakeholders alter organizational priorities, leaving the months-old DW plan out of sync.
- Key stakeholders tire of the time commitment and complexity of working with data analysts and data modelers, especially as the value proposition of the DW fades due to the previous two factors.
Of course, from the agile point of view these are classic big-requirements-up-front (BRUF) symptoms. That's why I'm so surprised in this day and time to see companies still pursuing classic DW. By now, NoSQL/big data techniques bring an entirely different approach, but in my view the difference between these DW plans and the steady useful evolution of well-executed Hadoop data lakes originates from values rather than tools. Stated in agile manifesto terms, effective data integration projects value delivery of business results over very well integrated data. That is, while there is value in the item on the right, we value the item on the left more. **
A decision to place more value on delivery of business results than on great data integration creates some challenges to the traditional data warehouse development model. The underlying premises of the plans I've recently reviewed are that data integration must follow understanding of all the requirements, and delivering reports and analyses must follow data integration. If we value delivery of business results most highly, then we must integrate data before understanding all of the requirements, and deliver reporting before fully integrating data. These revised priorities imply some key planning and architecture adjustments:
- Before anything else, the team needs to understand key reporting requirements, rank them in terms of business value and difficulty of delivery, and prioritize the high value, low difficulty items.
- The team must distinguish between essential, needed, and nice-to-have features in staging, integration framework, data warehouse, and delivery components, and in early phases only build what's necessary to drive delivery of results.
- While enabling early and frequent delivery, the project must evolve the warehouse toward the target architecture. Early production components will be minimal but should be consistent with the target architecture. For example, marking all tables with a LoadID identifying the ETL job that populated each row might enable future build out of load metadata linked by that LoadID.
- If early solutions violate architectural direction -- which sometimes they will -- the team must plan refactoring for later in the project. For example, early reporting might draw data directly from staged data not fully processed to DW standards.
Data purists will certainly object: How can you report from staging? How can reports be from a single version of truth if all sources haven't loaded? How can you track performance of early loads without full ETL metadata? In fact, it is critical that the team not wait until everything is ticked and tied, because the effect of doing so is to make the project take so long that it becomes a costly irrelevance.
It has been 15 years since the Agile Manifesto, and it has been at most 4 years since Big Data took the world by storm. By now it should be clear that agile techniques work on data projects, and even though (to me) Hadoop and NoSQL solutions tend to be easier to deploy and use, there's no reason a relational project can't borrow big data's "git-r-done" mentality.
A data warehouse project should deliver clear, obvious, and tangible business benefit early and often. Nothing else is as important as that.
* Arthur Grimes. (2010). "The Economics of Infrastructure Investment: Beyond Simple Cost Benefit Analysis", Motu Economic and Public Policy Research, provides a well-thought-out example of infrastructure cost benefit thinking: "A conventional cost benefit analysis is inappropriate where an initial project within a sequence of projects creates options for investment in future projects with uncertain returns, and where: (a) information about those returns is forthcoming only after the initial project is completed, or (b) potential returns from the future projects diminish over time."