There's a lot of buzz around Hadoop and big data, as we've indicated in an earlier post about the biggest Data Governance trends in 2011 (Data Governance Trends in 2011). Here's a quick primer to help you to start to understand what Hadoop is and how it differs from a traditional database management system. Hadoop does not replace traditional databases; it's a different toolset that excels at solving specific problems that are either very expensive or impossible for DBMSs.
Hadoop is a set of open source technologies that are designed to run on low-cost commodity hardware. At its core are the Hadoop Distributed File System (HDFS) and MapReduce. The HDFS splits the data into chunks and distributes it, with replication, across multiple nodes. Each node has its own local storage and processing. Technologies like RAID are not needed since the entire node can be swapped without endangering the integrity of the data or processing. While the HDFS manages the data storage and security, MapReduce is the processing engine which accesses data stored on the HDFS. Using MapReduce requires a significant amount of coding and the programmer should be skilled at parallel processing. MapReduce is used to send 'questions' to the individual nodes for processing, which is contrary to the traditional model of sending data sets to the processing hardware. Hadoop excels at inexpensively analyzing large amounts of data by breaking up the data across nodes and processing a component at each node.
There are additional software components that can be layered onto the HDFS/MapReduce model to provide further abstraction. Hive, for example, is an open source tool that can be used to generate SQL-like queries (though the syntax is slightly different and structure is applied to the data when it's queried). Pig is a higher-level language used to reduce the amount of coding needed to structure complex queries. There are many other components for anyone in an organization to explore the data on Hadoop.
Hadoop is open source, so the software can be downloaded and installed at no cost, and is designed to run on low-cost, commodity hardware. This is a significant contrast to the traditional DBMS that runs expensive software on high-grade, fault-tolerant hardware. Because Hadoop is also extremely scalable and fault tolerant, administration costs are typically lower, and large volumes of high-velocity data can be easily managed.
Manages Certain Data Types Better than a DBMS
The massively parallel nature, and therefore easily scalable HDFS, is one of the reasons Hadoop is better suited for large volumes of high-velocity data. It also requires less overhead. Hadoop can store and analyze unstructured data, so there's no need to transform or aggregate data using an ETL process before feeding it to Hadoop. This allows companies to maintain the richness of their data by keeping all of its detail.
Examples of unstructured or high velocity data:
- The human genome (about 3GB)
- Sensors that provide temperature and moisture readouts 5 times per minute
- User-generated data, such as Facebook status updates
While Hadoop can manage and process a massive amount of data (Google was processing 20,000GB of data per day….in 2008), it does not provide real-time results, and may be less suited to direct interaction with users than a DBMS.