I recently attended the DataStax Partner Certification Bootcamp. My original intent was to take the Cassandra Certified Administrator test the first day of the Cassandra Summit; however, our partner representative suggested I participate in the Bootcamp instead. The difference between the certifications and the prerequisite knowledge was a little vague and I had to muddle my way through the process. Ultimately I ended up registering only for the Administrator track because I did not think I had the developer skills to get through the Spark and Solr work. The purpose of this blog is to help clarify the process for others.
First, I want to clarify the differences between the DataStax Partner Certifications and the Apache Certifications. DataStax has their hands in both programs but they each serve a different purpose.
Apache Cassandra Certifications
The Apache Cassandra Certifications are administered by DataStax in cooperation with O'Reilly. There are three tracks:
- Cassandra Developer
- Cassandra Administrator
- Cassandra Architect
The Cassandra Certified Architect is basically Cassandra Certified Administrator + Cassandra Certified Developer. This is the typical (1) take online courses, (2) take the practice test, (3) take the online exam certification and (4) hope you pass so your company reimburses your test fee. The whole process is documented here. The disadvantage of these programs is that there is a big difference between book knowledge and hand on production experience. There is no guarantee that the person you hire actually knows what they are doing. This is where DataStax Certified Partners come in.
DataStax Partner Certification Program
The DataStax Partner Certification Program is designed to ensure that certified professionals actually have hands on experience with Cassandra and the DataStax Enterprise tools provided by DataStax. The Apache Cassandra Certifications are just one of the prerequisites prior to being invited to participate in the DataStax Partner Certification. The basic process is:
- Become a DataStax Partner
- Take the relevant DataStax Academy courses
- Have your partner representative invite you to the program
- Pass the pre-qualification exam
- Attend bootcamp at DataStax HQ and pass the labs
This process is documented in detail here. There are three tracks that line up exactly with the O'Reilly Apache Cassandra certifications.
What do you really need to know?
My background is basically Data Architecture with a lot of hands on Data Engineering. I have extensive Big Data (NoSQL and Hadoop) experience and I am comfortable at a Unix command line. I am NOT a developer and I am NOT an administrator. I think this is a common conundrum for people that come from the traditional database/ETL/data warehouse paradigm. A lot of Big Data tools have programmatic interfaces that need to be leveraged to process data. This includes Java, Python, Perl and Scala. These programmatic interfaces are slowly being replaced with scripting and SQL interfaces. This is CQL 3.0 for Cassandra, Spark SQL for the analytics component of the DSE stack and Solr for the search (ad-hoc) capabilities.
I was fairly confident I had enough Unix chops to pass the Administrator bootcamp, but I was concerned that I did not have enough developer (Scala) chops to pass the Developer bootcamp. In retrospect I should have signed up for the Developer bootcamp in addition to the Administrator bootcamp because there was very little programming required. It was mostly cut and paste Scala. The pre-qualification exam was intense and focused a lot on Unix administration. As I mentioned before, I am not an administrator; however, I am a solid Unix scripter. I barely passed this exam and I think I was the only one in my bootcamp class of 12 to pass it. One of the students said "I have been a Unix administrator for over 10 years and I did not recognize half of the commands in the pre-qualification exam." From my experience, here is what you really need:
- Go through all of the material in DataStax Academy several times and take all of the quizzes. The training is excellent (recently revamped) and the quizzes give you an idea of what the exam questions will be like.
- Production experience is a HUGE plus. I suggest you spin up some VMs and simulate cluster failures and issues if you are now working with C* in production.
- Know the DSE command line tools inside and out.
- Be comfortable at the Unix command line and understand how to measure performance and system resource utilization. Scripting is not required.
- Be familiar with JMeter.
- Have a basic understanding of Scala.
What happens at the Bootcamp?
The Bootcamp is intense. It is a 5 day event at the HQ of DataStax in Santa Clara, CA. It started Monday morning around 8:30a. The Administrator Bootcamp ran through Tuesday morning. The Developer Bootcamp ran from Tuesday afternoon through Friday. The Architect Bootcamp is just the combination of the two for a full week. I only attended the Administrator portion because I did not think I had the qualifications for the Developer/Architect Bootcamp. This was a mistake. I should have attended the Bootcamp as I mentioned above. The course ran as follows:
It was a long day especially since I flew in from the east coast the day before. I got there around 8a and I don't think we left the building until 7:30p.
- Introductions, overview and AWS cluster/machine assignments. An interesting thing to note is that you do all of the assignments as a team. I just paired up with the person sitting next to me.
- Cluster Confusion - Our team was provided access to an Apache Cassandra cluster (not DSE) with OpsCenter running. There are a bunch of things misconfigured in the cluster. We had to identify the issues with the cluster and fix them. We then used JMeter to test our tuned cluster to determine the transaction throughput. I thin it was ~ 300 ops/s when misconfigured. 1,000 ops/s was deemed acceptable for passing. We had fixed a number of issues but we were only getting 1300 ops/s. Then we realized that there was minimal load on the cluster. This could only mean there were network issues or the client was stressing the cluster. We were running JMeter through the GUI client. We got 5,000+ ops/s as soon as we ran the command line JMeter tool.
- We had to document our troubleshooting approach, all the issues, how we identified then and how we fixed them. The results are submitted to GIT as a text document for review by the DataStax team.
Tuesday morning started around 8:30. The Administrator portion ended at lunch. I stayed for the rest of the day because if focused on designing a data model for a sample application.
- The instructor walked us through each know issue from the previous day's misconfigured cluster exercise. It was a very informative session. We learned everything we did wrong, and sometimes right.
- Bare Metal Installation - The rest of the morning was for the second and last Administrator exercise. We had to install Apache C* on a fresh CentOS machine. I was a little worried about this exercise because I thought we would have to install the OS. Luckily, we did not. The point of this exercise is to demonstrate how to install C* with recommended guidelines on a fresh machine. I documented my results and committed to Git.
- Intro to Scala - After lunch, the instructor gave a quick overview of Scala as a functional language and how it is used in Spark.
- Data Modeling - The rest of the afternoon was for the Data Modeling exercise. We had to design a data model based on known queries for a sample application. This took most of the afternoon and was a an excellent learning experience.
I was finished on Tuesday and did not attend the sessions on Wednesday, Thursday and Friday. On Wednesday, the teams implemented their data model in C* and used Spark to move data from the old data model to the new data model. Apparently this involved modifying template scripts or using Spark SQL so limited Scala knowledge was required. The teams built a sample application on Thursday and Friday and presented their final results to the DataStax on Friday afternoon.
- It was very intense, thorough and educational. Expect long days and work at night to document your results.
- You don't need to be a developer or a Unix administrator to go thorough the Bootcamp.
- You should have production experience with Apache Cassandra and preferably DSE Cassandra.
- The instructors/leaders are very knowledgeable. Take full advantage of the Bootcamp to ask any and all questions.
- Documentation on the program was limited at the time. Hopefully I have resolved that issue with this blog.
- There is a reason why part of the kitchen is labeled Data Snaxs.