Khare (2009) defines Hadoop as a framework which is based on Java and is open source. Hadoop is used in running parallel applications of variable size clusters. It was generated by “Doug Cutting”, who is the inventor of Lucence, a searching tool and Nutch, which is believed as web crawler. It is believed that cutting was highly influenced by GFS which is Google File. Hadoop was the project which was used by Yahoo, which was similar to MapReduce, a distributed processing system used by google. However, MapReduce is now the platform which is used by Hadoop outside Google.
Currently, Hadoop is an open source software program which is managed by Apache (White, 2012). There are several other projects which are run by Apache which are open sourced. These projects include Pig, Hive, Cassandra, Avro, HBase, Chukwa, Mahout and Zookeeper. HBase and Cassandra are NOSQL based project. Hadoop is made from two main elements: Hadoop Distributed File System (HDFS), MapReduce (Shafer, 2011). This file system is written in java language and is distributed, scalable and portable.
Hadoop cluster as shown in above figure, is made up of Master node, slave node and a secondary Name node.
Each master node have a single Name node however, there are multiple slave nodes which works as the data nodes. The name node and data node is used as a storage node. Master node have a job tracker and slave node have task tracker which are used as compute node. Name node stores the metadata and keeps the track of files and information where the file is stored.
Data node is used to store and retrieve the data when told by the name node. Secondary name node works as the checkpoint which help name node to work better. TCP/IP model is used for communication in file system however, RPC (Remote Procedure Call) is used by the clients in order to communicate with each other. HDFS is used for data warehousing as well. These could have large data set which can be up to few gigabytes or terabytes, which are stored across several machines.
The HDFS duplicate its data on multiple host by dividing its data into small chunks providing the feature of reliability. The each part of data is stored onto three different machine by default, two on same rack and one on different. The processing of data requires less time as the parallel processing takes place. The HDFS can scale to unlimited storage by adding new cluster at any time.
The working of Hadoop can be showed by the diagram below:
Dean (2007) explains MapReduce as a programming model which is used to process and generate large data sets. MapReduce by its name that it have two jobs i.e. Map job and Reduce job. The map job takes a set of data as input, and then convert the data into another form where the data is broken into smaller tuples in the form of key/ value pairs. The job of reduce is to take the output of map job as an input and combines the tuples. In every case, the Reduce job is held after the Map job is finished.