In spite of what some would believe, it is rare that an entirely new discipline enters the IT mainstream. That's happened recently, with the sudden advent of the data scientist. Recently I sat down with one of our data scientists, Yan Li, to discuss how she got into the field and what it is like to be a data scientist. Hopefully Yan's insights will help those wishing to enter, or just better understand, this exciting new field.
Q: How did you become interested in data science?
Yan: I have always loved data, numbers, math, and statistics! Through taking quite a lot of statistics classes in college, I became interested in data science when taking a data mining class as my first master class. To me, data science is data mining rebranded. The ability to sift through large volumes of data to identify previously unknown, interesting, and potentially useful patterns just intrigued me.
Q: What skills are needed in your day-to-day work?
Yan: One important skill is communication, the most difficult aspect of which is how to translate business questions into practical analytical problems. A lot of times, business domain experts may not even know what their problems are and may require some preliminary analytical efforts to present to the business user to help them articulate the business problem better.
In addition, 70% percent of my efforts is actually doing ETL! So SQL skill is a must, and query performance tuning is inevitable. There are also data from various sources, so being able to learn different tools quickly is very important.
I am using R because that is the only statistical modeling tool provided at my current client site. I also use Tableau as a data understanding and visualization tool. I use Hadoop Hive to retrieve HDFS log files and move into Oracle to add dimensions. But data scientists should not be constrainted to a specific tool. For example, preliminary data understanding may require me to write a Splunk query to look at what is in the machine log, or run a Pig script to get machine data population statistics.
Q: What surprises you most about Data science as you apply it in business?
Yan: It is never about how to build the most accurate model, in contrast to many scientific challenges. For example, KDD Cup (the first and most popular data mining challenge) winner models are typically ensemble models, which means they are highly accurate but don't explain very much. In business, the modeling result should provide useful insights while balancing various priorities: be accurate enough, have enough stability, and be cheap to deploy, to name only a few considerations.
Q: What advice do you have to someone entering the field?
Yan: Have a solid understanding of data mining and analytics methodologies and processes. Data science requires a formalized approach. Do not jump right away into statistical modeling. Make sure you have a solid business understanding; go through all necessary data analysis and preparation steps. Also remember that data science is an iterative process, you may build the model, see the initial results, and then tear it apart and start from the beginning again.