Data Science - Big Data History

Apache Hadoop Distributed File System (HDFS) with MapReduce Makes Big Data Possible – 2006 AD

Return to Timeline of the History of Computers


Hadoop Makes Big Data Possible

Doug Cutting (dates unavailable)

“Parallelism is the key to computing with massive data: break a problem into many small pieces and attack them all at the same time, each with a different computer. But until the early 2000s, most large-scale parallel systems were based on the scientific computing model: they were one-of-a-kind, high-performance clusters built with expensive, high-reliability components. Hard to program, these systems mostly ran custom software to solve problems such as simulating nuclear-weapon explosions.

Hadoop takes a different approach. Instead of specialty hardware, Hadoop lets corporations, schools, and even individual users build parallel processing systems from ordinary computers. Multiple copies of the data are distributed across multiple hard drives in different computers; if one drive or system fails, Hadoop replicates one of the other copies. Instead of moving large amounts of data over a network to super-fast CPUs, Hadoop moves a copy of the program to the data.

Hadoop got its start at the Internet Archive, where Doug Cutting was developing an internet search engine. A few years into the project, Cutting came across a pair of academic papers from Google, one describing the distributed file system that Google had created for storing data in its massive clusters, and the other describing Google’s MapReduce system for sending distributed programs to the data. Realizing that Google’s approach was better than his, he rewrote his code to match Google’s design.

In 2006, Cutting recognized that his implementation of the distribution systems could be used for more than running a search engine, so he took 11,000 lines of code out of his system and made them a standalone system. He named it “Hadoop” after one of his son’s toys, a stuffed elephant.

Because the Hadoop code was open source, other companies and individuals could work on it as well. And with the “big data” boom, many needed what Hadoop offered. The code improved, and the systems’ capabilities expanded. By 2015, the open source Hadoop market was valued at $6 billion and estimated to grow to $20 billion by 2020.”

SEE ALSO Connection Machine (1985), GNU Manifesto (1985)

Although the big-data program Hadoop is typically run on high-performance clusters, hobbyists have also run it, as a hack, on tiny underpowered machines like these Cubieboards.

Fair Use Sources: B07C2NQSPV

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI ’04): December 6–8, 2004, San Francisco, CA. Berkeley, CA: USENIX Association, 2004.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google File System.” In SOSP ‘03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, 29–43. Vol. 37, no. 5 of Operating Systems Review. New York: Association for Computing Machinery, October, 2003.