Oftentimes I've been on projects where teams are aware that data needs to be moved from one place to another. The resources have been allocated, the overall vision is in place and the team is ready to roll. The one thing left: the detailed requirements. As consultants, we are often called onto projects mid-deployment and are expected to jump in and deliver value. Quickly. With ETL requirements, that is no exception.
Eliciting the requirements for a complex ETL (extracting, transforming, and loading of data) project can be difficult. You know that data needs to move from point A, to point B. But where do you start? From my experience, the easiest way to start uncovering the critical conditions for a successful ETL project can be accomplished by breaking apart the ETL acronym and asking some critical questions. The questions below represent some of the key questions I ask, regardless of project, to quickly and efficiently elicit core requirements. Answering these questions on a project should both form a requirements baseline as well as highlight most if not all of the key functionality that needs to be followed for ETL development.
A note: Most questions are basic, and possibly a bit elementary to ask. However, discussing them encourages much-needed (but often overlooked!) conversations around key ETL assumptions and functionality.
E – Extract
When will the source data need to be extracted? What is the data source from which you are extracting? Is it in an existing database; is it a flat file? How is data currently being sent to the data source? Is it being sent by a job scheduler? If so, should your process piggyback onto the existing job? Should it be an entirely new job? Are there reasons why the ETL should not start, e.g. are there any source conditions that should always generate a job failure?
T – Transform
How should the data be manipulated so that it can be inserted into the target? How will the source and target be mapped? Are there legal compliance requirements (for ex: HIPAA for Healthcare and Sarbanes-Oxley for financial industries)? Do the transformations need to be logged? How should transformation errors be handled? If the job fails mid-load, should the partial data be loaded? Should it all be rolled-back? Are dates stored the same way in the source and the target? Which date formats are valid to be converted to the target? How should non-accepted date formats be handled? What Data Quality steps/checks should occur within the transformations?
L – Load
How is the data going to be stored in the target? Are there pre-existing job dependencies? Are there dependencies on the completion of your ETL process? Is there any balancing required between the source and target? Should anyone receive an email stating the status of the ETL process? How are deleted records handled from the source? What fields determine target Inserts versus Updates?
Even the simplest ETL projects can often uncover difficult scenarios. That's why it's important to try to ask the right questions as early as possible. Doing that encourages top notch requirements and a much better foundation for ETL design.