The debate regarding the relative merits of extract transform load (ETL) and extract load transform (ELT) data migration strategies isn't new, but it remains relevant. Making the right choice is critical, because the wrong solution will slow development cycles and execution times, and increase the cost and complexity of code maintenance.
SQL Server offers a leading ETL tool in integration services (SSIS), which is used by many organizations. There are plenty of experienced, trained, and certified developers who can create packages to solve any business's data migration needs. SSIS is a great tool for sourcing and moving data. It also offers a wide range of components and the ability to make an artistic data flow. But should we always use the components within SSIS to transform and load data?
I've found that basic ELT is a faster to implement architecture. Moreover, the architecture is easier to maintain and it performs better. For the past five years, I've designed ELT solutions in several organizations, minimally using SSIS, and had great results.
In this blog, I'll cover some best practices for implementing an ELT solution for a variety of data migration scenarios. Although I reference SQL Server throughout, this strategy could be applied to any database platform.
Implementing an Effective ELT Architecture
The fundamental difference between ELT and ETL involves where to put and execute transformation logic. Traditional ETL sources data and transforms it in flight, which means that the server running SSIS manipulates the data in memory before writing it to the target. SSIS contains all the logic regarding how to source data, what transformations and business rules to apply, and whether the row requires a transaction against the target. All of this typically happens in a single pass, sometimes linking packages together, with multiple packages executing in the work stream.
With ELT, in contrast, data is sourced in its original form and staged on the target server, rarely applying any transformations. In a separate step, the staged data is transformed and loaded to the target. These transformations are done with T-SQL statements.
Figure 1: Implementation Diagram
The ELT stage-first, load-second model works well with a semantic view layer applied to the staged data. I recommend creating views, one per destination table, with an output SELECT that applies all the necessary joins, lookups, and business rules. Next, use a T-SQL MERGE, which takes the view as a source and applies it to the destination table, joining the view to the target on natural key(s). This MERGE can accomplish Type 1 or Type 2 style changes, making it easy to use in destinations of both third normal form or traditional fact/dimension form.
ELT Provides Superior Traceability
One of the major benefits of an ELT strategy is that it provides superior traceability. The semantic views of an ELT solution contain all the code that requires maintenance. These views perform the transformation to the target structure, making the conversions completely transparent. This allows developers and analysts to link the written code to the source to target requirements, providing the organization with clear traceability, and helping reduce future IT costs and headaches.
In SSIS, it isn't as easy to track data lineage because data can be renamed, manipulated in derived column tasks, and rows can be duplicated or lost with lookups and precedence constraints. By using a T-SQL view within the ELT architecture, all the logic is contained in the SELECT.
A best practice is to implement easily readable code and limit the number of times a column can change. In a traditional ETL architecture, a column can transform many times, while An ELT architecture allows you to limit the number of times a column changes in flight. With ELT, the joins are evident and troubleshooting is easy because the query is already built. In addition, the stage is a direct copy (table name, column name, data type) of the source. By not renaming any source metadata in flight, you don't run the risk of misunderstanding the stage. Because the stage is a copy of the source, and the MERGE performs the column-by-column comparison to determine if the row is an insert, a delta, or requires no action, the only place "logic" is applied is the views. At any time, developers and analysts can query the stage or the views to validate data quality and research data as it passes through the environment.
One of the concerns I have about SSIS is that it offers many components and, as such, provides many ways to implement a solution. This means that developers can build packages in any way they like, making repeatable coding standards harder to enforce. Code reviews of SSIS are a lengthy process. In my experience, only the most rigorously structured SSIS development teams follow a template standard and systematically use components in the most optimal way.
An ELT solution doesn't have these limitations. The stage process is easy to replicate and review, and the MERGE components can be built as reusable stock code. This leaves only the semantic views for code review.
ELT Reduces Execution Times
I've never made a formal attempt to quantify the difference between ELT and ELT performance, but I can tell you from experience that ELT delivers far better results; specifically, faster execution times.
One reason is that ELT allows you to limit transaction log usage by truncating the stage before every table load. Implementing a "flush and fill" method is an ELT best practice, offering an immediate performance gain.
SSIS can do a throughput of tens of thousands of rows per second. But by removing all the intermediary steps of an ETL data flow, the ELT stage process uses only a source and destination component. You can execute multiple tables at a time (leveraging SSIS's beneficial parallelism), even with millions of rows in small windows. This also limits the load on the source, as connections aren't kept open, and complex queries aren't executed by an ETL-style package. Drive storage is an inexpensive hardware component; the need for physically duplicating data temporarily is no reason to avoid ELT.
Lastly, the MERGE statement has many performance benefits over the traditional OLEDB command or the slowly changing dimension components in SSIS. The limitations of the OLEDB command and the dimension components in SSIS are frequently discussed in Microsoft Developer Network (MSDN) forums and blogs. However, a fair critique of this solution is that it creates the need for virtual memory to build the view datasets. With limited virtual memory on a virtual machine server, these views can lag. Materializing them is an option, if an environment has this limitation.
The ELT Design Fits with a Variety of Scenarios
The ELT methodology has the advantage of greater extensibility; in other words, it works in a variety of data migration situations. Typical data migration scenarios will load into relational or star-schema structures, and be one-time (such as when sun-setting an application) or ongoing (such as loading a report mart) frequency. I've found that all four scenarios work well with an ELT architecture, making it a repeatable architecture for various use cases.
ELT also makes it easier to follow two very important best practices: code reusability and consistent design patterns. Some of the components are even reusable, such as the process to dynamically create a MERGE statement. The design pattern of the stage process is also easily repeated.
For one-time loads, I recommend staging all rows from all the necessary tables. For ongoing loads, there's more to consider. Small tables can still be pulled entirely, but it's best to filter large tables for rows relating to a transaction modified since the last execution. This process leverages a logging table to track each load, retrieving a date range since the last execution and up to the current date. Next, you'll write a query to pull the transactions modified in that parameterized date range. Finally, apply that query to each source and join to the table in question. Note: pull columns only from the table in question and re-use this "transactions to consider" query for each subsequent table. This methodology works for both target structures.
Figure 2: Ongoing Source Query Example
As with an ETL solution, a best practice is to load destination tables in the correct order, so the foreign key dependencies are met. For both target structures, the views will join to the destination tables loaded previously and perform surrogate key lookups. When loading facts, it's easy to recalculate measures by staging all the data for a changed fact and performing a normal update to the row. If the transactional data is staged, the target rows relating to that transaction can be refreshed. I emphasize that when staging a transaction, it isn't necessary to look for what changed. Stage all the data about a transaction and allow the MERGE process to discover which subsequent tables and attributes will be modified. For one-time loads, this solution is especially appealing because the migration code base is "throw-away." There's no value in developing an extensive SSIS ETL process that needs to execute only a single time in production.
Data migration solutions using ELT are more traceable, performant, and extensible than ETL solutions. ELT is also easier to explain to other IT resources; unlike ETL, it isn't possible for data flow tasks in ELT to be a black box of data manipulations. ELT makes logical sense and is adaptable to a wide range of business cases. Developers with a T-SQL background can pick up ELT quickly and readily maintain the code base. By following the best practices noted above, your IT organization will have a consistent way of migrating data - one that doesn't require extensive knowledge of the ETL tool.