This year’s PyData conference was hosted by Microsoft in beautiful Seattle, Wa. This smaller conference pales in comparison to PyCon in size, but has a stronger focus on machine learning and data analysis using Python. Throughout the seminars and talks that I attended, I noticed a few themes that I wanted to share.
It’s time for Python 3.
Python 3 came out in December of 2008, causing a rift in the Python community. Due to some harsh lack of backwards compatibility that were necessary to move the language forward, many Python developers were not too excited to take up the new version of the language. At PyData, it was clear that the community is embracing Python 3. Nearly every speaker was coding using it, and even though most of the tools they were demo-ing were supposedly backwards compatible, many didn’t properly work until I used 3. It’s time to embrace this change and, almost 10 years later, use Python 3.
Dask Dask Dask
One of the first demos I went to was for Dask
, a parallel processing library for Python. With each description of Dask’s functionality, my eyes grew wider and wider as the possibilities for implementing this tool at my current client flooded my mind.
Like Spark, Dask helps you analyze your data faster via parallelism. Dask works on a single machine or on a cluster, like Spark. Unlike Spark, everything is made specifically to work with pandas and numpy in Python. With Dask, you are no longer a second-hand citizen of Scala-land. Dask is also incredibly easy to implement, and has been designed for simplicity.
For example, if you want to start taking advantage of Dask, but don’t want to rewrite your code, you can call the dask.delayed class and decorate your functions with it. This will automatically apply the efficient parallelism of Dask, and doesn’t require much code tinkering on your part. With Dask, you can also create a dashboard to visualize
the parallelism in real time as you code. Better yet, Dask is written in pure Python, so it doesn’t need several dependencies to run – Dask just works. Dask is part of the anaconda Python distribution, so if you’re using anaconda, you already have Dask.
I see Dask as a great tool to use when your data are big, but not big enough to run efficiently on Spark. It’s a great intermediate library, which will no doubt provide business value to those who use it.
Bokeh and other data visualization tools
As a data scientist who primarily codes in Python, I have a bevy of visualization tools at my disposal, all of which have pros and cons. I prefer to use a combination of Seaborn, Matplotlib, and pandas for my visualization. At PyData, Bokeh
There were also some impressive visualization tools for Natural Language Processing (NLP). One talk by Jason Kessler
demonstrated a new Python library called “scattertext
” which allows you to visualize text data. With scattertext, you can evaluate word frequency as a function of discrete categories, effectively illustrating the differences between groups via text. Imagine the possibilities here – you can perform A/B tests, analyze text data with NLP, and visualize them beautifully to stakeholders.
All in all, I'd consider going to PyData again. There was a great mix of technical and basic topics, so data scientists from all levels have the opportunity to learn something.