The Universe of Hadoop & Its Many Modules: #1 A bit of theory

The Universe of Hadoop & Its Many Modules: #1 A bit of theory

The Series

This is the first in a series of articles on Hadoop and its many modules, we will start by taking a overview on what the purpose of Hadoop is and a bit of Hadoop theory, in the next article we’ll dive straight into the practical part, we will be setting up a single node and test the many aspects of Hadoop like configuration, HDFS management, and test a small dataset with a MapReduce algorithm. Simply to put theory into practice.

The next few articles in this series we will explore the requirements of settings up a multi-node cluster consisting of multiple servers running a MapReduce job. We will also delve into YARN and install and use different YARN modules such as Kafka, Storm and Ambari.

If there is any specific topic you would like me to cover, feel free to leave a comment or contact me 🙂

Series Progress

The Universe of Hadoop & Its Many Modules: #1 A bit of theoryDone
The Universe of Hadoop & Its Many Modules: #2 Installing a single-node cluster – Done
The Universe of Hadoop & Its Many Modules: #3 Managing a multi-node cluster – Not started – On hold untill exams are over, end of January
The Universe of Hadoop & Its Many Modules: #4 YARN Modules – Not started

 

Preface

I make this series because during my 4th semester in Computer Science I chose Big Data as my thesis and I thought to myself that it was difficult to find a proper introduction to Hadoop for newcomers. One thing is to just install Hadoop and run a MapReduce job but that doesn’t give any sense if you have no clue what is going on, another thing is to delve into every tiny detail of Hadoop’s components, that wasn’t going to help me much either, well at least that was how I felt, so this is my endeavor to create a series of articles which will not leave the reader alone thinking “What just happened?” I will strive to create a series of articles which I feel will give you enough detailed information and then we will test it and I will do my very best to explain what is going on.

 

What is Hadoop

Hadoop is a software framework which eases distributed processing of large datasets across clusters of servers. It is designed to scale up from a single server to thousands of servers where each offers computation and storage. Rather than relying on hardware to provide high availability the software itself is designed to detect and handle failures at the application layer.

Hadoop consists of four modules.

  1. Hadoop Common
  2. Hadoop Distributed File System (HDFS)
  3. Hadoop MapReduce
  4. Hadoop YARN

Lets delve into the four base modules of Hadoop.

Hadoop Common

Hadoop Common is the base library and collection of utilities that support other Hadoop modules.

Hadoop Distributed File System (HDFS)

HDFS is as its name implies a distributed file system which is build to support several goals, here is a few.

  • Hardware Failure

Supposed you have a very large dataset either in the hundreds of terabytes or in petabytes and you got a few hundred servers at your disposal. Imagine for a second one quarter of your servers instantly died to whatever reason we do not know, now imagine if the data which was distributed to those servers now were lost. HDFS is build to limit such incidents by storing parts of your dataset across multiple servers.

  • Large Data sets

HDFS is exceptional at handling large datasets, instead of moving data from one server to another to perform the needed computation, HDFS provides interfaces for applications to move themselves closer to where the data is, by doing this HDFS can decrease network congestion and increase the overall throughput.

NameNode & DataNode

HDFS is a master/slave architecture. For HDFS to manage, track and distribute parts of your dataset HDFS makes use of something we call NameNode and DataNode, Lets take a look at the two and their responsibilities.

NameNode – The Master of the HDFS cluster

  • Manages which DataNodes whom are connected to this cluster

Periodic each datanode will try to send a heartbeat message to the namenode, if the namenode doesn’t receive a heartbeat for a given time the datanode will be excluded from the cluster until it begins to send heartbeat messages again. Therefor HDFS is great at scaling, on the fly you can spin up more servers, all they have to do is send a heartbeat message to the namenode.

  • Manages which DataNode got which data block and which data blocks makes up the complete file

The NameNode stores the HDFS file system metadata which essentially means the NameNode knows exactly which data blocks makes up the whole file but the metadata is also containing which data blocks are stored on which datanodes.

Besides knowing which datanode contains which data block, it is also necessary to ensure data integrity, HDFS does this by calculating checksums and associate them with each data block, when the data block is received by a datanode it will calculate the checksum again and validate it up against the checksum which is associated with the data block.

The NameNode also keeps a complete transaction log.

  • Makes sure that each data block is replicated onto several datanodes

Imagine we configured HDFS to replicate each data block onto three datanodes

HDFS has a in-memory table which maps data blocks to datanodes and in our imagination it would be one data block maps to three data nodes.

DataNode – The Slave of the HDFS cluster

  • Keep in touch with the NameNode

During startup each datanode connects to the namenode and performs a handshake, at this stage it verifies the namespace ID and software version of the datanode, if they do not match that of the namenode, the datanode is excluded.

If the namespace ID and software version match that of the namenode, the datanode will be registered to the namenode, at this stage the datanode begins to send periodic heartbeat messages to the namenode, usually every three second.

If by any measure the namenode does not receive the heartbeat messages from any datanode for a period of time, the given datanode will be excluded from the cluster until the namenode receives heartbeat messages from the said datanode again. Also the data blocks hosted on said datanode will be unavailable and the namenode will create new replicas onto other datanodes.

  • Ensure integrity

A data block consist of two files, the data itself and a metadata file which consist of several fields such as a calculated checksum, the size of the data and more.

When the datanode receives a data block it looks at the calculated checksum and calculates a new checksum of the data, if these two checksums matches each other, the integrity is still in order and the process can move on.

Hadoop MapReduce

Hadoop MapReduce is a software framework for writing applications that can process vast amount of data and scale across hundreds or thousands of servers in a cluster. Lets take a peek at how MapReduce works.

Imagine you had every comment on the whole internet, that is quite a lot but just follow with me, and we would like process each and figure out if they were posted during AM or PM.

The way MapReduce work is quite different from “regular” software design, usually we could think of either looping through each comment and figure out if they were posted during AM or PM, we might even thread it and distribute the workload, but lets take a look at how MapReduce could solve the case for us.

MapReduce consists of two separate tasks, first there is the Map job which takes a set of data(our vast amount of comments) and converts it into key/value pair set of data, at this point our data could look like this.

Key Value
AM 1
PM 1
AM 1
and so on and so on

Each comment has been Mapped as either AM or PM, now its time for the second task, the Reduce job. The Reduce job takes in key/value pairs and reduce them into smaller set of key/value pairs, which could look like this.

Key Value
AM 2
PM 1

This example is very basic, any real case could contains terabytes upon terabytes of real data and the data itself could be of different formats and you might be wondering during this example how did Hadoop distribute our work and how many Map jobs were created.

We have already talked about datanodes and data blocks but we haven’t talked about the size of the data blocks. When we configure Hadoop we tell Hadoop how big we want each data block to become, that is imagine you put 10TB of comment data into your HDFS and we configured HDFS to create 128MB data blocks, so that would be 10TB*128 and we will end up with around 82 thousand data blocks which will be the equivalent to 82 thousand Map jobs.

The amount of Map/Reduce jobs that will run simultaneous is by default set as unlimited but Hadoop is doing its best to run N amount of Map/Reduce jobs while keeping the best performance.

Hadoop YARN

YARN or also called Yet-Another-Resource-Negotiator. 🙂

We will not delve into the deep understandings of YARN because it is an large update to many of the core systems of Hadoop, instead we will take a short look at why YARN was build and why that is good for us.

So basically YARN is an updated version of MapReduce, which lift some of the limitations of MapReduce. MapReduce was only able to work with batches of data, that is why our example from before had a large dataset instead of an ongoing stream of data, another limitation was found by Yahoo! where clusters of 5 thousand nodes and 40 thousand tasks running concurrently would hit a scalability bottleneck.

YARN also extends the power of Hadoop to new technologies which with it spawned a new set of tools made by other developers to take advantage of Hadoop’s great strengths.

Here is just but a few tools which enables new functionality into Hadoop.

  • Apache Hive – Data warehouse infrastructure, providing summarization, query, analysis and SQL-like language: HiveSQL.
  • Apache Storm – real time distributed computation system.
  • Apache Kafka – fast, scalable publish-subscribe messaging system.
  • and many more

 

The end

Next article status: The Universe of Hadoop & Its Many Modules: #2 Installing a single-node cluster is currently in progress.

In the next article we will create a server with the OS Ubuntu Server 14.04.2 LTS. We will install Hadoop, create a NameNode and a DataNode, take a look at the configuration, take a look at HDFS management, explore an example MapReduce algorithm and last but not least we will upload a small test data set and run the MapReduce on the data set.

I truly hope this first article in my series “The Universe of Hadoop” has sparked an interest in you for playing with big data tools.

Feel free to comment to notify me if I made any mistakes or if you have any questions but also just to say Hello 🙂

Thanks for reading
Kim