has long been a critical issue for businesses, particularly those in regulated industries such as healthcare and financial services, but it has become more critical - and more challenging - with the arrival of big data.
Data governance deals with such questions as the origins, or lineage, of data; who can access data and what they can do with it; and how data is categorized or catalogued. In the traditional data warehouse, establishing a solid data governance strategy requires the right people, a solid process and a good data management tool. The introduction of a big data platform such as Hadoop punches a massive hole in existing data governance strategies, creating serious issues across the business.
Hadoop is an open-source software framework that allows users to amass a wide range of data, including semi-structured and unstructured data, from a wide range of sources. This provides the business with an unprecedented amount of information about customers and their behavior, which can be leveraged to drive improvements in customer experience. But Hadoop also can make it difficult to understand exactly what data is being stored, where it came from, and who is doing what with it.
This is particularly troublesome because some of the information is likely to be sensitive - for example, customer names, addresses, account numbers and Social Security numbers - and easily available to people who aren't authorized to access to it. In financial services and healthcare, customers expect such information to be heavily protected. Regulators share that expectation.
Traditional data governance tools such as Informatica, Ab Initio, IBM and others help with data governance issues only after data has been structured and made available to users through a traditional database and governance tool. In a big data environment, businesses need far more than this, because data access, audit logs, business metadata, data quality and data lineage artifacts reside in multiple locations, and in some cases not at all.
Part of the attraction of Hadoop is that it enables users engage in discovery before raw data has been transformed, cleansed, prepped and modeled. That allows the business to quickly gain insights into customer behavior and gain a distinct competitive advantage as a result. But Hadoop's inherent lack of structure impedes data governance.
Nonetheless, it is possible to strike a balance between discovery and governance in a big-data environment. To capture the rich data available within the Hadoop ecosystem without creating data governance issues, it's necessary to use a tool that is native to the ecosystem and that is built specifically to solve these problems. The truly native Hadoop governance options are limited to Apache Atlas and Cloudera Navigator.
Atlas, which is being built as part of Hortonworks Data Governance Initiative, isn't fully ready for commercial use. Cloudera Navigator is a more complete solution, but is a proprietary part of the Cloudera enterprise data hub. That means that if you aren't running a Cloudera cluster, you won't be able to leverage the tool.
At CapTech, we've heard from a large number of businesses that are struggling to strike a workable balance between data discovery and data governance within Hadoop. We have extensive experience with data-governance implementations, particularly within financial services and healthcare. We also have extensive experience working with the Hadoop ecosystem and have built custom solutions to alleviate our clients data-governance problems. We also have also worked with businesses to help them understand how to manage data and take advantage of the new tools available in a big data environment.
We can help you manage the big issues that make big data both promising and challenging.