Before embarking on a large scale data migration from a legacy system into a new system, it is important to establish a game plan. Part of the game plan is to profile the data that will be migrated. Unfortunately, too many times I see Data Analyst diving into the source database and begin writing extensive SQL queries that provide various permutations of data profiling reports before determining if the data is even relevant to the new system. I believe that by having a game plan for data profiling and subsequently, establishing a remediation plan, the organization will save a lot of time and money.
I have assumed, for the purposes of this blog, that the technology solution selected to replace a legacy system, has well-documented and approved Business Requirements by the business owners. Further, these requirements have been translated into Data Requirements for the new system. In addition, I have assumed the source to target system has been properly mapped. This would include the retention period and inclusion criteria of the data to migrate. For example, if the oldest record the business wants to bring over is 7 years, then ensure there is a filter to not bring records older than that. This will cut out a lot of records that do not need to be profiled and thus remediated.
There are three major processes for determining the level of effort required to profile the source data from the legacy system. These three are Data Load, Data Loss, and Data Quality. For the purpose of this blog, generally the target system should establish if the data to be migrated is Mandatory (records imperative and required to stand up the new system) or Optional (nice to have but the system will operate without it).
Once the retention period has been communicated by the business, it is important to know, based on the source system data, if the data will load properly or if an error will occur when it does not. Generally, the culprit will be mismatches of the data types and null records from the source to the target system. In my experience, data types are fairly easy to resolve, so I will not delve into this topic.
By viewing the metadata of the target system, when compared to the source system, you can gauge right away what the potential load constraints will be. So one of the first metrics you want to determine is how many null records are in the source system. This can be done by using a simple select statement on the number of null records in a field:
SELECT COUNT(*) FROM TABLE_NAME WHERE column_name IS NULL;
What you want to do next is determine the quantity and percentage of null records against the total record count for that field and post the data. You do this for all the records to determine a matrix that provides all the tables and columns associated with the null record count. This will illustrate the severity of the null records and impact in the migration effort. More importantly, you will see the degree to which the data load will fail during the actual load process, pre-empting and forestalling an inevitable high count of data load failures. In addition, this can be part of the ETL process whereby the business may decide that a null record can default to particular value or be passed as is. Moreover, you can establish the priority for determining which records require the most attention before beginning the remediation effort.
Once you have established the null records causing data load problems have been resolved, then your focus, assuming you resolved the data type issue as well, is to see which records will cause data loss. Data loss can be defined as records that load and, because the ETL process truncates excessively long records, will cause the remaining data within a record, not to load. The record will not be rejected but the load will not be complete. For example, let us assume the target Address field calls for varchar(40) (alphanumeric values for 40 character length) but the source Address field contains 60 characters. In this case, the first 40 characters will be loaded but the additional 20 tailing characters will be truncated and will not be migrated to the new system. This is what would be considered data loss. Again, we have not started introducing an ETL process that would parse and match data records. We are only looking at whether the data will truncate or not, thus causing a loss in the data load process. A simple high level SQL statement to identify potential truncation, assuming the data type varchar(40) for the target system, is below:
SELECT column_name FROM TABLE_NAME WHERE LENGTH(column_name) > 40;
By viewing the output of the SQL statement, you can begin to get a glimpse of the potential data quality problems you will encounter. For example, a record in an Address field, that includes a Name, would load but would be wrong.
Another cause of data loss would be duplicate records. The reason I categorize this as a data loss is because the record represents a redundant capture of data that will be lost once the duplication is identified. The earlier this is identified within the source system and remediated, the less work will be required to load the data and not have to remediate post-conversion.
Once you have established that the data can load 1) because you handle null records, 2) because you solved the truncation problem and 3) because you omit duplicate records (or at least identified duplicate records), you are now ready to determine the quality of the data. This is where the level of granularity expands. By having completed the Data Load and Data Loss assessment, you will have eliminated many unnecessary steps before beginning the arduous task of remediating the data. I will not delve too much into data quality but at a high level it may comprise the following:
Validity – Does the data fall between an acceptable range of values?
Data Definitions – Are all data elements correctly and consistently defined across all areas of the organization?
Existence – Does the organization currently contain the required data?
Consistency – Is data the same across the organization?
It can be overwhelming and daunting to tackle an enterprise level data migration project. By having a game plan that focuses on Data Load, Data Loss and Data Quality in that sequence, I hope you are able to greatly reduce the level of effort before taking corrective measures to remediate the source data. Moreover, by focusing on the mandatory and optional records to be migrated, the transformation rules for the ETL can be less complicated because you will have eliminated many exception records that will not be migrated and thus not have to be remediated. Finally, the data remediation plan can now be executed at a reduced state because your effort will focus on the key data priorities.
Part II of this blog will focus on how to set up the Data Remediation priorities.