Hadoop The Definitive Guide


If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.

This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.


RDBMS compared to MapReduce


RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear


MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time.


MapReduce tries to colocate the data with the compute node, so data access is fast since it is local.


Map reduce in one line, it does absolutelly 100% the same as Hadoop (except that it doesn't scale)

cat sample.txt | map.rb | sort | reduce.rb