You want to call a long-time friend with whom you have not spoken in a while. You whip out your smartphone, unlock the screen, open the phone app and search for her name. And what you see are two phone numbers, both labeled “Mobile.” You faintly remember that her phone number changed a few months ago, but apparently you forgot to delete the old number. And now you don’t know which one is correct. Now what? When you look into it, you discover that one of the phone numbers was stored in the phone’s internal memory while the other number was synced to your phone from your contact list on your favorite cloud provider. In fact, you are looking at two different data sources that contain conflicting information. Welcome to the world of “Master Data Management (MDM).”
Many individuals and organizations – small and large – must address issues similar to the one outlined above: Outdated versus current data, and multiple data sources yield conflicting data. Fortunately, MDM provides a systematic approach to address those challenges. According to Margaret Rouse, MDM can be defined as “A method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference.”
What is MDM?
Let’s dissect the statement above: The first aspect is the definition of critical data. While this seems to be pretty obvious, one has to make a dedicated effort to take stock of all of the data elements and decide which are critical to the organization and should be governed under Master Data Management (that’s another key word we see in the definition above). For instance, for an e-commerce business, critical data would include customers’ name, address, and contact information, as well as detailed order history. However, feedback left by those customers might or might not be critical to running a successful business. Hence, the business owners must decide to what degree they need to manage feedback data.
Once the critical data has been defined, one must review the source(s) of the data, and how to integrate them so that the outcome is a true and complete dataset that serves as a single point of reference. But what is meant by single point of reference? Often, multiple systems deliver data that need to be combined to paint a complete picture. For example, a government agency handling Workers Compensation might receive medical license information for all authorized health care providers from the department issuing health care provider licenses, while employer data would come from the Department of Labor. Those two data streams must to be matched with Workers Compensation claims data captured by the government agency’s in-house IT system.
Approaches to MDM
In the Registry approach, the MDM maintains
indices that specify which data element is stored on which source
system. Thus, it is possible to assemble a complete set of data (often
called the Golden Record) by pulling the appropriate data from
each source. Since the various systems maintain full control of their
own data, conflicts are largely avoided. When data fields containing the
same data are stored on two (or more) systems, MDM would obtain only
the data from the system that has been designated as the
“source-of-truth” for this particular data field. Or there could be
business rules in place that describe how to select the “correct” piece
of information, or how to merge that data. For instance, a customer name
could be stored both on the system handling payments and the system
recording sales. Let’s say on the payment system the name appears as
“John Smith” and in the sales database as “John E. Smith.” A business
rule could compare those two names and yield the longer of those two
strings, in this case “John E. Smith.”
It is important to note that any changes to the master
data are done on the source systems only. The MDM registry only points
to the correct piece of information, providing a read-only view of data
without modifying master data – a useful way to remove duplications and
gain consistent access to the master data.
An advantage of this approach is that it is less
intrusive than other methods; the main work is done creating and
maintaining the MDM registry, while the source systems are largely
un-touched (provided they can expose the data to the MDM). However, this
architecture might not be acceptable for systems that rely on (or close
to) 100% availability and fast response times because it depends on the
reliability and response times of each of the source systems. Hence,
the overall availability and response time is determined by the least
reliable and slowest systems.
Virtually the exact opposite of the Registry is the Consolidation approach: Here, the master data is consolidated from multiple sources. The MDM combines them in a meaningful way and stores the resulting dataset in a central data hub or data warehouse. In effect, the MDM system migrates the source data to a target system while applying any required business rules. Obviously, this approach requires more changes to the source systems, but the advantage is that the data is always readily available once it has arrived in the data warehouse. Any updates to the master data are made on the central data hub. Consolidation is a good approach for reporting or analytics residing on a data warehouse.
Sometimes it is not possible to migrate the data into one place, for instance when the datasets on the various systems are too large to be migrated and handled effectively in one system. In the Coexistence model, the golden record is constructed in the same way as in the Consolidation approach. However, here the master data is stored in the central MDM system and updated in its source systems. Any changes to the data are synchronized between the MDM system and the source systems. As such, it is important that all attributes of the master data model are consistent and cleansed (more about that later) before uploading them into the MDM system. The main benefit of this approach is that data is mastered in the source systems and then synchronized with the hub so data can coexist in multiple source system and still offer a single source of truth. This can speed up the access, and make reporting and analytics easier since all the master data attributes can be accessed in a single place. However, the task of managing data on multiple systems remains.
The Transaction/Centralized approach loads the source data into the MDM’s central data hub. The data is cleansed and matched to create a master data set. The enhanced data can then be published back to its respective source system; source systems can subscribe to updates published by the central MDM system to achieve complete consistency. This approach provides a high level of control of all the data; however, often it requires significant changes to the source systems.
We described how to combine data from several sources and manage it, but what about the data itself? In the example portrayed at the beginning of this article, one of the phone numbers was apparently wrong; combining the two numbers would not solve the problems. Generally speaking, if the business processes are fed with outdated, incomplete, or simply incorrect data, how can one expect up-to-date, complete, and correct output?
Here Data Cleansing processes come into play. These processes try to identify incorrect, corrupt, or incomplete data and then attempt to correct them wherever possible. The identification could be based on sets of rules. For instance, a Social Security Number should always be 9 digits long and conform to the format described here. Thus, any entry in the SSN data field that does not conform to those rules could be flagged and routed to a remediation process.
Sometimes it is difficult to determine if the data is correct and complete; if this is the case, data can be flagged and fed into a manual review process where human experts review the data and perform corrective actions.
Of course, one way of dealing with incomplete or outdated data is to simply ignore it and exclude it from any further processing. Sometimes that might be the best solution available. For instance, why should a computer hardware store care about 10-year-old customer data? Their business and customer base are changing so quickly that such old data could be irrelevant to today’s business decisions. When old data is likely not needed anymore to support the business processes one could simply archive that data “as-is” and move on.
Assuring that the data is of high quality and the steps to maintain that quality falls under the responsibilities of a Data Steward. This role requires a strong understanding of both the technical aspects and business side of the organization, acting as the liaison between the IT department and the business.
The implementation of MDM starts with People.
Based on their knowledge of business requirements, these experts will
specify which areas of data (often also called “data entities”) should
be put under MDM control. They also need to define the ownership of the
data; that is, who is responsible for which data. Further, they set up
the rules that determine the flow of information within the organization
– and outside of the organization if applicable. An important task is
to define rules and algorithms that handle situations when multiple
sources could be used to obtain the same piece of data. Above we
outlined the situation that a customer is known both as “John Smith” and
“John E. Smith,” and that the business rule specifies that the name
with the most characters should be taken. During the implementation
these are the kind of business rules that must be established, tested,
Once those decision have been made, we need to define Processes.
For instance, the exact processes to route the data automatically from
their sources to the destination must be worked out. But what happens if
the automatic workflow fails, e.g., because data records violate
business rules? Coming back to our previous example, what if there is a
third source listing the same customer as “John ED Smith”? Here the
“take the longest name”-rule would yield an ambiguous result. In cases
like this, manual workflows might have to be created that allow for
review and decision. Also, it must be determined who or what can author
(create, modify, delete) data. Finally, validation processes that assess
the completeness and correctness of the data have to be established,
e.g., address information could be validated using an USPS address
Now that we know what data we want to put under MDM and how to process the data, it is time to look at the Technology
that enables us to implement all of this. The IT department will need
to furnish the appropriate infrastructure (on-premise or in the cloud),
such as servers, databases, and network connections, while the software
developers will implement algorithms and – if required – program
interfaces that provide for sharing the data.
Maintaining High Data Quality
Depending on the MDM strategy, there will be an initial migration of data from the source to the target system(s), during which the data is run through all of the steps described above. Then, as new data arrives, that information must be combined with the already existing (“historic”) data.
Continuously detecting data issues and quickly remediating them is vital because the data quality will deteriorate just as the telephone number scenario illustrates: If you had deleted the outdated number as soon as you received your friend’s new contact information you would have avoided a lot of trouble.
The Costs of (not) doing MDM
Master Data Management is not easy (or cheap) to obtain. However, Harvard Business Review estimated that in the US alone, bad data carries a cost of over $3 Trillion per year. Given this very large figure, every organization should seriously consider taking steps toward MDM; correct and complete data that is easy to access is a valuable asset. Through accurate reporting based on good data, an important understanding of what went right and wrong in the past can be gained. Those insights lead to data-driven decisions that can shape the future. Like in life, the best advice is to start small: You could start with one data entity where a clean-up and MDM effort would yield immediate improvements. Over time, you can scale this process, bringing your full (pertinent) data under MDM control.
Stefan is a leader in CapTech’s Data & Analytics Practice Area. His has deep experience in business verticals ranging from FinTech, HealthCareIT, Digital Marketing to the Energy sector. Stefan holds a PhD in Geophysics from Ruhr University Bochum, Germany.