Margaret, who was an average sales person, moved from Washington, DC to Richmond, VA, whose market is one fifth the size, during the month of June. When the annual evaluations of sales performance were done in the month of December, she was listed as the top performer in the Richmond market resulting in the company promoting her to Sales Director. The next two highest ranked Richmond salespeople had been the consistent leaders for the last several years and outperformed Margaret since she arrived in Richmond. Her very high sales numbers during the first six months of the year skewed her average, placing her above the rest of the Richmond area. In this example, if the decision makers had correct information handy, and used it appropriately, would they have promoted Margaret over her new Richmond peers?
Here is another example. The forty-third World Health Assembly in 1990 endorsed a new standardization of Diseases, popularly known as ICD-10. Current plans are that the United States will formally use this new standard starting from October 1, 2013. This change would require most health care stakeholder organizations to convert their existing disease classifications into the corresponding new codes. The diseases and medical procedures master data may need substantial reorganization, at a minimum a change in the natural key values in database tables. This will impact a significant amount of the historic and summarized data used in decision support systems. Let's discuss an example of a current disease code that may split into two different ICD-10 disease codes when the new standard is implemented. There may be a lot of historic data against the old code that now needs to be associated correctly with the new codes. Also, there may be lot of summarized data that may now needs to be related with the new codes. If the organizations do not have an effective strategy to address issues such as these, they may be left with many data quality problems.
Both the above examples demonstrate the data quality issues resulting from Slowly Changing Dimensions (SCD.) Slowly Changing Dimensions are the dimensions whose data changes at irregular intervals of time. Typical examples of SCD are similar to the disease codes or the employee's location described above.
The following are some best practices to address the issues around slowly changing dimensions in data warehousing:
1. Look at them with a focus – many big issues resulting from improper handling of slowly changing dimensions are the result of ignorance or lack of focus on them. If every dimension in the Enterprise Data Warehouse model is specifically looked at from the perspective of potential change of its data and the corresponding business impact, many of the existing issues or future risks can be unearthed.
2. Keep the Latest Data Across the Board – Slowly changing dimensions like the ones in the case of the ICD-10 standard, once implemented are not likely to change often in the future. The best way to address this type of a change is, therefore, to make the change across the board by updating all the old codes with new code where possible. Historic and summarized data that can not be easily fit into the new standard will need a onetime remediation or some creative way to map to the new dimensional data. As a general rule there is perhaps no need to maintain the history (other than for an archival purpose) if the change is typically a onetime thing. Nevertheless, it is very important to get this update done "across the board," so once done, historical data references the new terms.
3. Keep the History and the Latest Together – Slowly changing dimensions like the ones in the case of employee location example will perhaps need history with correctly tagged validity period. Typically, this can be accomplished by either a) adding additional columns for old and new values (if the history to just one level would be enough) or b) adding additional columns for validity start and end dates. The current record is either flagged separately or will have an end date as blank. In this case, even though the data change is not likely to be a onetime thing, it is not expected to happen often.
4. Keep the History Separate – There are cases where the dimension data changes are expected to happen often. For example, a customer or a supplier may have certain type of contract that gives some specific benefits during the contract's validity period. If most of the reporting needs are for the latest current valid contracts, then it may be better to store the past history of the contracts in a separate contracts history table. This is a case where history of the dimension is not used that often and the dimension is likely to change more than just once.
Slowly Changing Dimensions in the world of data management can pose significant data quality challenges if a suitable strategy is not adopted to mitigate the impact they may have on the business. It is very important to examine all the dimensions often with an independent perspective to be able to uncover and address the business risks and issues the changing dimensions bring.