News | September 24, 2015
Big Data: How to Start Right and Avoid the Pitfalls
Big Data is the future of business. According to CloudTweaks.com, as much as 2.5 quintillion bytes of data are produced each day, with most of this data being captured by Big Data. With its ability to transfer all data sources all into one centralized place, Big Data provides opportunities, clearer visions, customer conversations and transactions. However, with the dazzling big promise of Big Data comes a potentially huge letdown. If this vast pool of information resources is not accessible or usable, it becomes useless. This paper examines strategies for building the most value into your Big Data system by enabling process controls to effectively mine, access and secure Big Data.
There is a lot of pressure on companies today to take advantage of Big Data. For many companies, Big Data has become a competitive differentiator as they make substantial investments in solutions and initiatives to collect and analyze the ever-growing volume of data available in the digital world. However, this pressure has led many companies to make decisions and buy tools that have restricted their ability to mine the data they are collecting. In this short paper, we will explain what some of these critical pitfalls are and how to avoid them.
The drive to collect, analyze and profit from large data sets often leads companies to buy new tools without fully understanding how they will impact their business processes. For example, a large telecommunications company began ingesting their operational and transactional data to several Hadoop Data Lakes. They did it for good operational reasons, including scalability, speed and simplicity of collecting numerous disparate formats. Unfortunately, many of the traditional Data Management and Data Governance practices implemented around relational database management systems (RDMBS) were largely ignored. Things like data stewardship, quality control, normalization with an enterprise data model (EDM) and consistent metadata were overlooked.
- Files were duplicated across multiple Hadoop clusters, making it difficult to define a single source of truth for the business community as well as inefficiently using IT resources.
- Multiple distributions of Hadoop prevented file sharing across the organization.
- Inconsistent field definition and limited format controls introduced instability in data processing jobs. Extract, Transformation, and Load (ETL) jobs frequently failed because data types or content of specific fields were changed.
- It was nearly impossible to trace the origin of a field back to its source, resulting in little to no data provenance or lineage that is critical in regulated industries such as Finance.
- There was limited security and minimal access control auditing.
- Without traditional Data Governance and Management Practices, the Hadoop Data Lake instead became what is colloquially known as the "DataSwamp." The value of data slowly (or quickly!) degraded due to the lack of data controls enforced prior to ingestion into the Hadoop cluster, resulting in numerous data processing jobs failing and limiting the business community's ability to leverage the information to make knowledge-based decisions or run predictive analytics.
Even though the underlying technologies in the data platform have changed, traditional Data Governance and Data Management best practices still apply. Data management technologies and procedures that were originally developed around RDBMSes have evolved to support many new technologies, including Hadoop and NoSQL. CapTech recommends that organizations:
- Implement Data Management best practices prior to introducing new data management technologies, such as Hadoop and NoSQL, into the enterprise
- Expand the scope of Information Architect role(s) to include ALL data resources
- Define schemas for data to be housed in new data management ecosystems
- Capture metadata about files before they are ingested
- Identify Data Stewards for the new ecosystem and data stores
What CapTech Can Provide:
CapTech is implementing a data ingestion tool for a Fortune 100 bank, a custom-built ingestion engine that will enforce data standards, metadata tags, security and data promotion through several layers of maturity. CapTech can provide clients with a team to ensure that Data Management processes are followed during the implementation and adoption of Hadoop. We recommend that the team be comprised of a Data/Information Architect and Data Analyst that will:
- Define a Data Governance plan for the Hadoop ecosystem
- Work with clients to implement the data governance plan
- Implement a data management tool for the Hadoop ecosystem, including market review, recommendation, and implementation support
Ben Harden | [email protected] | 703.371.466
Mr. Harden has over 16 years of experience delivering enterprise data warehousing solutions for fortune 500 clients. He is well versed in the areas of project management, requirements gathering, functional design, technical design, reporting development, technical training, testing, and system implementation. For the past 6 years Mr. Harden has specialized in delivering enterprise data warehousing solutions using the Agile Scrum methodology. Most recently Mr. Harden has focused on delivering Big Data solutions based on the Hadoop platform. Mr. Harden is a Certified Scrum Master, Product Owner, Scaled Agilist and Project Management Professional. He is a graduate of Virginia Tech's Pamplin School of Business.