A top 10 U.S. bank had built a data lake filled with rich data about customers, but lacked platforms that would allow data scientists and business users to catalog, collaborate on and mine metadata. CapTech built a custom metadata repository and a set of services to catalog data landing in the lake.
The build-out architecture of a data lake allows a company to capture and stream massive amounts of big data and real world visualizations in one place. This process allows businesses to access critical assets and analytics in real time. Without proper organization, access, and security, the information in a data lake is of little use.
CapTech worked with a Fortune 500 financial services firm that had built a data lake filled with rich customer data—website traffic, audio calls, and credit card and online transactions— but did not have the platforms in place the give data scientists and business users the ability to catalog, collaborate on, and mine metadata.
CapTech built a custom metadata repository, application, and set of services to catalog data landing in the lake. The application catalogs business, technical, and operational metadata, tracks lineage and data quality, and provides support for enterprise data governance processes.
To make sense of the data from the application, our team audited and assigned values to data records. We then gave data consumers the ability to search and use these records. Our team also provided access to the entire metadata lineage to preserve important data and enable governance and security controls within the lake.
Tools & Methodologies
- Cloudera CDH
- Cloudera Navigator
- Angular JS
- Restful Web Services
- Spring MVC/Hibernate
The CapTech team followed the Scaled Agile Framework (SAFe) methodology, working with ten other scrum teams to deliver the enterprise data lake.
- Cataloged the global metadata for all data ingested into the data lake—20,000 files per day in the end state—and systemized information in a compatible format that enabled searching, segmentation, and understanding
- Captured the transformation of data, through the ecosystem, enabling data analysis modeling and provided the ability to see full attribute level lineage of data from lake ingress (incoming) to lake egress (outgoing)
- Protected and preserved the enterprise’s system by creating a data governance workflow environment that notified internal owners of unmet or unreviewed high-risk data requirements