My colleague Bill Wooten recently wrote about the value of privacy and shared his thoughts on the book "Privacy and Big Data" by Terence Craig and Mary E. Ludloff. I also read the book and was both fascinated and disturbed by how much potentially revealing data is readily available to consume. As a data professional, I get excited about combining data in new ways and delivering insight that was previously not available. As a consumer and user of technology, it is disturbing to know how my information is being used.
Take a site like Spokeo, they are doing some amazing things with publicly available information, combining and presenting it in ways that no one had previously imagined. Within just a few clicks I can see every address I've lived at since college plotted on a google map. I can see the demographic profile of each of those areas, how much I paid for my houses, my age, phone numbers and various email addresses. Spokeo can even connect me to other people I am related to and present me with a family tree. Not only that, but they know all of the social networks I have joined and can display my user name and public posts. All this information is available just from searching on my name, email address or phone number (and only one is required).
After getting over the initial shock of seeing all of my personal data pulled together in this single easy to digest view I started to wonder where it all came from and who else might be using it.
Where does it come from?
As it turns out there are plenty of data marketplaces where large data sets can be purchased or even downloaded for free. Data.gov provides access to all publicly available government data, ranging from census findings to crime statistics. One of the things Spokeo is doing is a simple exercise of tying address information to census demographic information and real estate tax data. Knowing my age, address, the demographics of the area I live and how much I paid for my house becomes a powerful marketing tool that has real value to a variety of companies.
Infochimps is another data marketplace where a variety of datasets can be purchased. One of the most interesting data sets available for purchase is data from dating site okcupid.com. All 28 personality questions answered by users are available in the data set, along with their gender, age, state and metro area of the individual. This data set married with the data found on Spokeo and all of a sudden people know more than you ever thought possible. If you are wondering what okcupid is doing with all this data (beyond selling it), check out this article for some insight.
Add in Geographic Data
Now that we know everything about you from easily procured data, let's take the next logical step and learn about your geographic location and patterns. We learned from the Spokeo search that you have a twitter and/or flickr account. We can take that information and plug your userid's into the creepy application. Creepy will then plot the location of every tweet and geo-tagged photo you have shared via these services on a map. As you can see in the below example, patterns quickly emerge.
Since we already know where you live, now we can see what time of day you are most likely to post in that location vs. other locations and can do things like predict when you are most likely to be out of the house. We also likely know where you work from your LinkedIn profile and will be able to see when you are tweeting from the office. If we don't know where you work, with the time and location of your tweets we can make a very educated guess.
Now that we know we know this data is available for the taking, just think about how it could be enriched with our personal information that is stored behind corporate firewalls. What happens when a bank ties banking information to this data set? If they know I live in an expensive neighborhood will they market to me differently? What about when my health insurer knows I have been traveling abroad to a country with an outbreak of a serious disease? Will my premiums be affected?
Ever wonder what clients your competitors are doing business with? How about taking that nice company managed twitter list that identifies all their employees twitter handles. Using Creepy you can map all the individuals tweet locations of the employees, filter out the non working times and you'll see obvious clusters of tweets occurring around certain geographic locations. It isn't a stretch to assume those locations are the locations where people are working.
Worst of all, what about a military officer posting a picture to share with his or her family from a secret location in an unfriendly location? The geotag information that is posted with the photo has just revealed his exact location to the enemy.
Knowing how our data is being used is the most important thing we can do. Once we understand we can make informed decisions on what we share, where we share it from and when we choose to share it. Unfortunately, it isn't always clear what happens to our data after we share it we can't assume it will always be used with the best intentions.