The times they are a changing, and for the world of data warehousing this means two things; new, unprecedented changes in data processing, and the rising demand for people who can implement this technology. Distributed data platforms are becoming bigger, stronger, faster and most importantly, more accessible. In the wake of this new force of nature, the eco-system of relational databases is susceptible to becoming outmoded. Granted, the traditional relational database certainly has a place in the data landscape. But more and more, applications have become increasingly demanding and distributed platforms are replacing RDBMS systems. Is your data and reporting team ready for this monumental shift?
Now, more than ever, data is driving the business. It makes our applications work, influences decisions and enhances our understanding of trends in the company and even the overall industry. The new challenge is dealing with the increasing demands that make all of this possible. Distributed databases provide a method to organize and work with vast quantities of rapidly evolving data. With a NoSQL platform like Apache Cassandra, Apache Hadoop, Amazon DynamoDB, MongoDB and many more emerging database technologies (too many to count) the data landscape is in full tilt. These tools offer powerful, scalable processing with high fault tolerance and lightning fast performance.
One of your production servers goes down? Don't fret! There are dozens more nodes within a cluster faithfully serving up requests.
You know your application is going to hit a wall at the end of the year? No problem! With a few clicks new machines can be provisioned and added to the cluster within a few minutes.
You want some immediate insight into key performance indicators and analysis across billions of rows of data? We have you covered! Powerful new distributed data processing engines like Apache Spark are up for the task, delivering static or real time reporting using immense amounts of data.
The advantages are real, but how can they be realized? Without targeted training or intense project experience, traditional database developers don't have the skills needed to transition a relational data warehouse environment into this brave new world. The technical hurdles are daunting and the knowledge base is deep and nuanced. I believe in only a couple of years the requirements for being a data analyst, data engineer, data architect, or even a report builder will be wildly different than the set of capabilities we see in the current market. Here are some critical skills and talents that your team needs today.
Shift in Mindset
Making the mental transition will be uncomfortable and painful to begin with. This is a wild contrast to the warm and fuzzy world of entities and relationships where each data element is stored once and inherently atomic, consistent, isolated, and durable. Moving to NoSQL means that you store what you need in a specific context regardless of where else it might be stored. You can't pull together data across tables (joins) and you may not be able to perform the exact filtering that you expect. This means ad-hoc querying is difficult and interrogating data issues or discrepancies becomes daunting.
Spark and Functional Programming
Moving data around in a NoSQL environment is challenging and Spark is one of the best open source tools available to bridge this gap. The hurdle here is two-fold, setting up your Spark environment and learning functional programming basics. Setting up a Spark cluster or a local shell is no small task and troubleshooting connection strings across multiple databases can prove frustrating. Following that, you have a few mainstream programming languages to choose from for developing your ETL; Java, Scala, or Python. These languages can have a steep learning curve for SQL developers who have never had exposure to object-oriented programming languages. Fortunately there are a lot of resources, tutorials, and examples available online, but don't expect any hand holding.
You will need to become comfortable with Linux (suggested Debian or Red Hat distributions). Almost all of your initial work with a NoSQL offering or ETL engine is going to involve installation and setup on a Linux server with CLI. No fancy installers, no GUI, and no mercy. In the NoSQL world, it is very common that you, the developer, become the admin as well; so you'll need to learn a thing or two. The essentials include moving files and directories, viewing and editing text documents, and even inspecting error logs and local volumes.
Open Source Offerings
There is a whole world of tools and add-ons that people have made available for each NoSQL platform. The ability to quickly research candidates and analyze benefits, costs, and additional complexity added to the project is important. Then you need to be able to incorporate it into your solution and make it work, not only in development but in production too! One tool we discovered is Apache Lucene which provides full text search functionality against your database. There is even a smaller project in GitHub called Lussandra that marries Lucene with your Cassandra database to provide search functionality against existing data. There are also many smaller GitHub projects that tackle everyday tasks like promoting database changes across environments. If you think you need to write a script for something, odds are that somebody out there is writing the same script and sharing it for the community. GitHub is an amazing platform with incredible repositories that offer a huge technical lift. Also, ingratiate yourself with the Apache project as well as Maven; they will prove valuable.
Application Development Understanding
NoSQL tools and approaches are tightly coupled with application architecture. In many cases, database developers will be forced to analyze and understand code to know how the database is expecting to handle data. In contrast to datawarehousing which aims to support user analytics and reporting, application database design must adhere to the functionality of the application program. Serve up the right data, without incurring unnecessary complexity, and trying your best to avoid anti-patterns.
Are you Ready?
Transitioning a mature (or even nascent) relational environment to a NoSQL offering can seem daunting. However, far and away the most critical aspect to success is team collaboration. No one person can become proficient in all skillsets within a reasonable timeframe and everyone has their own strengths. Understanding individual talents and sharing what you learn with your team is the only path that leads to victory. Look for architects who have experience with denormalized database design, application development patterns, some functional knowledge of Spark and hands-on experience in supporting the tool. Partnering with seasoned professionals can produce enormous gains in group knowledge and drastically reduce project timelines. It takes experienced guidance, perseverance, and a team that is willing to learn and grow.