Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Articles March 13, 2020

The Revival of the Data Lake Makes Everything Possible

Ben Harden
Author
Ben Harden

The data lake, a repository that can store multiple forms of both structured and unstructured data, is making a comeback—and it is the foundation for all things related to modern data.

According to Gartner, the data lake’s present position in the “Trough of Disillusionment” came after a “Peak of Inflated Expectations,” and we all saw that happen—the boundless excitement and the grasping of its potential, combined with the complex architecture to make it all come to be. But things have changed. The Cloud has created new possibilities. Data teams are sinking in their teeth. The stages of enlightenment and productivity are coming or, arguably, they are officially here.

Organizations capture so much data—both structured and unstructured—and in a perfect world, this data can deliver unprecedented insights. We can build and create with more confidence and speed, spending our time on more impactful work with this data on our side.

The early adopters were the first generation of data lake users. They explored the possibilities and learned many lessons. Now, we are on the brink of data lakes 2.0. Organizations that didn't take action in the first rounds can jump straight into the second, but if they don't have the Machine Learning (ML) and Data Ops to surround their data lakes, they may still have a problem. To take advantage of the data in your data lake, you need Data Operations in place which give you the ability to rapidly move data, frame it, analyze it, and deploy models created out of its insights into production.

So much is happening in the world of data. This is how data lakes will likely impact modern data and analytics in the coming year and beyond:

There's an approach that has emerged around effective speed to market with good controls and processes in place, and that is known as ML Ops, also sometimes referenced as Data Ops or Artifical Intelligence (AI) Ops. Whatever you call it, this discipline was born out of the need to develop the operational capabilities that sit on top of data lakes and make everything work. An ML engineer is the person that connects it all together, working with the data engineering and data science teams to make enterprise-level machine learning happen.

Without an ML engineer, it’s easy to end up in a situation where data scientists can’t run models across all the data that they want to, or they might have models that can run, but only with many hours of labor-intensive setup and data preperation. If an organization isn’t using the Cloud, they most likely won’t have the computing power that they need to iterate quickly and move into a production environment.

Fast forward to a process involving a data lake and an ML engineer. Now, organizations can create processes to build a model, train a model, test the model, and deploy it into production quickly and at scale. Open-source software utilities like Hadoop were essential for data collection and comprehension early on, but they are on the decline, while data lakes 2.0 are on the rise.

When we talk about ML algorithms, building them from scratch is not necessary anymore. There are software platforms and out-of-the box models to choose from that can accomplish many goals. Specifically, new software enables organizations to pick a platform that makes significant chunks of a process easier—for example, ML from data robot and H20.ai allow not only easy access to a data lake but also all the pieces needed to handle ML ops, deployments, and launches of models into production. Why build all of that from scratch if it’s not necessary?

Beyond platforms, there are also models available for countless capabilities. Image recognition, for instance, is a complex undertaking, but there are pre-built models, including AWS recognition, Google Vision, and Microsoft Cognitive Services, that can run on a chosen platform and make it easy to identify and tag objects in a photo. These prebuilt options can increase speed to market. There’s no need to reinvent the wheel.

Leveraging an ML platform to build, deploy, and test a model very quickly enables an organization to put multiple models into production, get feedback, do A/B testing, and analyze the results. Ways to consume and use data from the data lake are moving fast. Options for use are opening up to everyone.

Because of the new “out-of-the-box” solutions, you don’t need to be a data scientist to explore the possibilities within data lakes, but you will likely need guidance. We’re on the brink of a democratization of data; however, ethical decisions around data privacy, consumer privacy, and ML systems need to be emphasized, or much can go wrong.

Gartner predicts that by 2023, 60% of organizations that have at least 20 data scientists will require a professional code of conduct, incorporating ethical use of data analytics. But there’s no need to wait until 2023; organizations should emphasize the importance of focusing on ethics within these data conversations now.

Data lakes will have a profound effect on the future of data use, but having an ML engineer, utilizing pre-built platforms and models, and focusing on the ethics are all essential to have the kind of business impact that people are talking about. It all must connect.

The data lake might have been in a trough of disillusionment, but it has matured—and modern data and analytics are now recognizing how to harness its endless possibilities.