Do you know what happened during the daily load process today? What about yesterday? What about three months ago?
Logging is an important part of the overall ETL system. There are different methodologies for capturing information about past ETL processes; built-in system logging, flat file logs, notification emails, log tables in the database. All of these are valid and serve a purpose. They are useful for capturing specific records or record sets. They can be triggered by an attribute or condition and used later to save a temporary data set, identify data lineage, drive future load parameters or maybe even roll back problematic data loads. Let's run through the various avenues for creating and maintaining logs.
Built-In System Logs
These are great for troubleshooting issues that happened in the most recent run as well as analyzing execution statistics. They have the distinct advantage of working out of the box, and require little or no set up to utilize. These logs have the disadvantage of capturing only run statistics. They can be administered in terms of their verbosity, retention and maybe target storage location, but the logs generally are not configured to capture individual data values and cannot be used by future runs.
Run History Logs
This type of logging is typically implemented in the database with an execution log table. At a minimum, each package that pulls data into the environment should insert a record into the table where start time, end time, and status can all be tracked. The table may also save off the number of source records pulled, target inserts, target updates and target deletes. In conjunction with the log table, record tagging can be implemented. Just put a column in your target table to reference the execution log and suddenly you can identify the process that inserted or last updated each record. This is very useful if you find out that an upstream system was producing bad data at some point in time. You can query the execution log for the processes that ran during that timeframe, and join to the target table based on the log id. Once this is accomplished, remediation becomes potentially much easier.
Logging is also essential for capturing errors. If you have a lookup that must return an id for a given natural key but doesn't return anything or a field that must convert to an integer but has non-numeric data… somebody needs to be notified. In these cases you can set up a log to capture the errors. The log can be implemented as a destination after the transformation, as an "on error" event, or as a rejected row at the target that writes to a file. There are various options, some depending on the specific ETL tool. The log itself can be table or flat file based. If a table is used, it can be customized to allow other processes to utilize the error records. For example, if you have a stage table that stores one day of transactions and loads them to a fact table. If the lookup for a specific customer id fails, you can set up the stage process to not remove transactions for that customer id from stage until the issue is resolved. Regardless of what gets captured in the error log, don't just scrap the records, save the details that will be needed to find the cause and fix it!
Email can also be a useful form of logging. An ETL package can send notifications when it starts, finishes, runs too long, generates errors or pulls record counts outside or a predefined range. However, this form of notification should generally only be implemented where action is required upon receipt of the email. Setting up a distribution list and assigning responsibilities for responding to the content of the email are crucial tasks when implementing this method. If the email does not require action or if roles are not defined, stakeholders will quickly learn to treat the emails as spam. So, to realize the potential of email logging, remember to account for the human element. Try not to fill people's inboxes with more automated messages than necessary. However, when used correctly, email notification is an often required and incredibly helpful tool.
Notice, not all methods of logging are necessarily required, but they can all contribute to the overall success of an ETL solution.