Hadoop The Definitive Guide
If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
RDBMS compared to MapReduce
RDBMS | MapReduce | |
---|---|---|
Data size | Gigabytes | Petabytes |
Access | Interactive and batch | Batch |
Updates | Read and write many times | Write once, read many times |
Structure | Static schema | Dynamic schema |
Integrity | High | Low |
Scaling | Nonlinear | Linear |
MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time.
MapReduce tries to colocate the data with the compute node, so data access is fast since it is local.
Map reduce in one line, it does absolutelly 100% the same as Hadoop (except that it doesn't scale)
cat sample.txt | map.rb | sort | reduce.rb