Fruit Bowl, book and newspaper

A coworker recently asked how people keep up with everything that's going on in Big Data and I ended up writing a lot, so I thought I'd post it here.

For reference, I do daily work as a data scientist. I mostly work in Python these days but I've done R and Hadoop in the past.

I see Big Data as a subset of data engineering, which is really just a subset of overall development concepts, so while I try to read articles specific to new data technologies like Spark and Cassandra, I also read as many general development sources as I can to better understand systems.

  • Hacker News - I read Hacker News on a daily basis. It's the best source for news and commentary of everything going on in the tech world in the United States today. I click specifically on articles related to data and check out the comments to see what practical experience people have. I also try to read about things I'm not familiar with.

    For example, right now on the front page, there is a post called "An APL Compiler targeting a Typed Array Intermediate Language." I have no idea what most of this means, but I'll click through anyway and check out the source code to see how the code is structured on GitHub. Then I'll google around to find out what's special about typed array intermediate languages. Later, I'll check out the GitHub repo author's homepage. After doing this kind of internet wandering for months upon months, one becomes familiar with the tech scene and the terminology around it.
  • Newsletters - I receive several data science newsletters that I pick relevant links out of to read and cross-reference with Hacker News. A few favorites:
  • Twitter - I find Twitter to be really noisy and not as enjoyable as it used to be even as little as two years ago. But there are still a few people tweeting good links, and who generate industry discussion
    • [@thepracticaldev]( is one of the best Twitter accounts to get a feel for the zeitgeist of software development today and has lots of good links
    • [@randal_olson]( is really good about catching all the hot data science links of the day and at generating a community of commentary around those links
    • [@b0rk]( is probably one of the friendliest public faces in tech today, very generous with her knowledge, and contagious in her enthusiasm for the topic.
  • Reddit - A couple of subreddits have discussion that meets the level of quality of Hacker News and maybe even transcends it because they focus less on abstract ideas and more on actual examples.
    • r/learnpython - For nuances you missed the first time arond
    • r/python - After you're done learning and have general questions
    • r/programming - Sometimes the same posts as Hacker News, but often even more technical rather than startup-y and offer broader discussion
    • r/statistics - Sometimes good discussions that really get into the nitty gritty
    • r/machinelearning - Getting to be a more active community with lots of good links and pointers
    • r/cscareerquestions - More for junior devs just starting in their careers, but has a lot of good discussion about salary negotiations, different work environments, etc.

For entertainment as well as education:

r/sysadmin and r/talesfromtechsupport - The best groan-worthy stories online. Also you learn a lot about how not to do devops.
r/programmerhumor - Self-explanatory
  • Slack: I'm part of a data engineering slack where people talk about issues they've had at their company and how they're solving them. There are lots of others.
  • I go to local meetups and ask what people are using in their big data/ data engineering stacks. Two really good ones I went to recently were Papers We Love Philly, and the Philly Area Scala Enthusiasts lecture on DataFrames in Spark. I also love DataPhilly and PhillyPUG and attend as much as family life allows. I also talk to my friends in the industry. What are they using? What are they not doing? What's in? What's out? Which vendors are they evaluating?
  • I subscribe to mailing lists of development projects I'm interested in. For example, I'm in the Spark mailing list, as well as scikit-learn. I've learned a lot about people's various use cases and issues from these mailing lists, as well as common architecture patterns. I archive them all in my gmail so I can reference them later.
  • GitHub - I surf it from time to time to see what's popular, and to see how to write good code and code documentation. It's particularly useful to see how to structure specific blocks of code in real live projects.
  • Stack OverFlow - Excellent source for answers, but I also sometimes just browse Cross Validated, the stats sister site, to see what's popular and what people are answering.
  • O'Reilly and Manning emails - I bought books from them at one point and still get the emails. The latest book announcements are a good signal of what tech is in demand. They also sometimes have interesting free webinars.
  • Podcasts - I'll binge-listen to a bunch every few months. There are starting to be some really high-quality data-driven podcasts out there. The trick is finding ones where the creators have a good back-and-forth and get into the tech right away as opposed to chatting about the weekend for 40 minutes.
Data Science