While scrolling through Facebook, you must have come across a lot of ads telling you to learn various technologies & tools like Python, Hadoop, Spark, etc. In this article, we’re going to discuss Hadoop & its applications.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still common use.
It has since also found use on clusters of higher-end hardware. All the modules are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
This Apache software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
A wide variety of companies and organizations use this software for both research and production.
The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside it, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.
This framework itself is mostly written in the Java programming language, with some native code in C and command-line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user’s program.
Other projects in the ecosystem expose richer user interfaces.
According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003. This paper spawned another one from Google – “MapReduce: Simplified Data Processing on Large Clusters”.
Development started on the Apache Nutch project but was moved to the new subproject in January 2006.
Doug Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.
In March 2006, Owen O’Malley was the first committer to add to the Hadoop project; Hadoop 0.1.0 was released in April 2006.
It continues to evolve through contributions that are being made to the project. The very first design document for the Hadoop Distributed File System was written by Dhruba Borthakur in 2007.
Applications of Hadoop
In this video by intricity101, you’ll get an idea of What Hadoop is. Watch the video below:
Hope you liked reading this article. For more such articles, click here…