Contents
The Series
This is the second in a series of articles on Hadoop and its many modules, in this article we will setup and configure a single node Hadoop environment, we will also test a MapReduce algorithm on a small dataset.
If you haven’t already, I’d recommend you to read the first article in this series, the first article will introduce you to Hadoop and some of its subsystems and how they work.
The next few articles in this series we will explore the requirements of settings up a multi-node cluster consisting of multiple servers running a MapReduce job. We will also delve into YARN and install and use different YARN modules such as Kafka, Storm and Ambari.
If there is any specific topic you would like me to cover, feel free to leave a comment or contact me 🙂
Series Progress
The Universe of Hadoop & Its Many Modules: #1 A bit of theory – Done
The Universe of Hadoop & Its Many Modules: #2 Installing a single-node cluster – Done
The Universe of Hadoop & Its Many Modules: #3 Managing a multi-node cluster – Not started – On hold until exams are over, end of January
The Universe of Hadoop & Its Many Modules: #4 YARN Modules – Not started
Installing a single-node cluster
In this article we will install and configure a Hadoop cluster containing a single node.
We will create a NameNode and a DataNode, we will take a look at the Hadoop configuration and the HDFS management, we will also use a simple MapReduce algorithm and last but not least we will upload a small test data set and run a MapReduce job on said data set.
Prerequisite
To follow along this article I expect you have
- Set up a single server running Ubuntu 14.04.03 LTS
- SSH is installed
There is no requirement whether it is an old laptop, virtual machine, cloud instance or fully fledged dedicated server.
I have set up a cloud instance at DigitalOcean with the following hardware configuration
Component | Specification |
---|---|
CPU | 1 Core at 2 Ghz |
RAM | 1GB |
Disk | 20GB SSD |
Net | 2TB Transfer |
I can definitely recommend trying DigitalOcean, I have been using them for some time now, mostly their low end servers which cost 5$-10$ monthly which easily could keep up with what I was using them for.
DigitalOcean have a service where I can refer new customers to them, you will get 10$ which you can spend as you wish. The good thing for me is, if you keep using DigitalOcean, then when you have spend 25$, then DigitalOcean will give me 25$ as a appreciation since I helped them get a new customer 🙂
Here is the referal link if you feel like giving cloud computing a try: https://www.digitalocean.com/?refcode=510b61d3ce57
First things First
Before we start installing anything we need to make sure our development environment is up-to-date.
Execute these two commands
sudo apt-get update
This will update the server package index file. Imagine there is a huge list of available packages to your version of operating system, before we start updating or installing new packages we need to know their newest version or if they’re available anymore, so the command above asks for the newest package list.
sudo apt-get upgrade
This will update your server and all installed packages to the newest version based on the information from the local package list.
You will probably be prompted whether to install new updates or not, type y and press enter, it will now install the newest updates.
Hadoop Requirements
Hadoop requires Java to run, also the MapReduce algorithm we will be using is built with Java, therefore let’s install Java Runtime Environment.
Execute the following command
sudo apt-get install openjdk-7-jre
This will install OpenJDK Java runtime environment version 7 onto your server, it will require to download around 70MB, which is allright.
Your server should start downloading and installing, when it is completed type the following command to assure java is installed correctly.
java -version
It should tell you the name of the program including its version, if it does so. Everything is fine and we can move on.
Installing Hadoop
First we need to download the latest stable version of Hadoop, which is of this time of writing 2.7.1
Go to: http://www.apache.org/dyn/closer.cgi/hadoop/common/
Click on the suggested mirror, which is located in the top, right under the Apache Software Foundation logo.
Choose hadoop-2.7.1/
Copy the link from hadoop-2.7.1.tar.gz
On your server execute the following commands.
cd ~
This will make sure you are currently in the home directory of the user you are logged in as, you might ask why this even matter?
The reason for this is simplicity, we do not want to download hadoop into some random system folder, or another user’s home directory.
wget Insert The Link You Copied Here
This will download hadoop into the folder in which you currently are.
Example in my situation: wget http://mirrors.dotsrc.org/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
ls
This will show the content of the folder in which you currently are, it should display hadoop-2.7.1.tar.gz. If it does, Great.
tar hadoop-2.7.1.tar.gz
This will unpack hadoop into a folder named hadoop-2.7.1
mv hadoop-2.7.1 hadoop
This will rename the folder hadoop-2.7.1 to hadoop
Now we just need to tell Hadoop where we put our java installation.
Execute the following commands.
nano hadoop/etc/hadoop/hadoop-env.sh
This will open a file containing hadoop environment variables.
In this file you need to find export JAVA_HOME and make sure it says
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Ninja Tip:
The text editor in which you are viewing the file is called Nano, if you wish to quickly search for a particular string press ctrl-w and type in your search string and press Enter, this will put your cursor to the first result.
Press ctrl-w then Enter without any search string to proceed to the next result.
Now we need to save the file, press ctrl-x and it should prompt you whether you would like to save the modified buffer, which is yes so type Y for yes and press Enter.
If by any chance you wouldn’t like to save it you could also type N for no or press ctrl-c for cancel.
You might ask, why does cancel require me pressing the ctrl button?
Why. I cannot answer but if you see a ^ next to the option, it requires pressing the ctrl button as well.
Let’s make sure everything is working as it should, execute the following command.
hadoop/bin/hadoop version
It should respond with the version of Hadoop and some other information, if it does that then it is working.
Before we move onto the next step, we must first understand the three modes in which Hadoop can run.
- Standalone
- Default configuration, everything is run as a single java process.
- Pseudo Distributed
- Distributed simulation. Each Hadoop daemon(HDFS, YARN, MapReduce, etc.) is run in a separate Java process
- Fully Distributed
- Requires multiple servers
We will choose Pseudo Distributed mode because that is closer to a real environment before we move onto Fully Distributed in the next article.
Hadoop Configuration
First we will configure the file core-site.xml which contains settings regarding the Hadoop environment.
Execute the following command
nano hadoop/etc/hadoop/core-site.xml
Between the tags <configuration> and </configuration> paste the following block and save it by pressing ctrl-x and type y and press Enter.
<property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
This tells Hadoop where our NameNode is located, since our entire Hadoop environment is on one server, it is set to localhost.
Now we will configure the file hdfs-site.xml which contains settings regarding the HDFS.
Execute the following commands
mkdir pseudo
Creates a folder named pseudo
nano hadoop/etc/hadoop/hdfs-site.xml
Between the tags <configuration> and </configuration> paste the following block and save it by pressing ctrl-x and type y and press Enter.
<property> <name>dfs.name.dir</name> <value>/home/<YourUsername>/pseudo/dfs</value> </property> <property> <name>dfs.data.dir</name> <value>/home/<YourUsername>/pseudo/dfs</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property>
Remember from my previous article where I wrote about replicating blocks onto several datanodes? This is the configuration for said feature, here we are specifying that we only want one replication of each data block.
Passwordless SSH
Hadoop uses SSH to establish a connection between its nodes, so now we’re going to generate and add a pair of public private keys.
Execute the following commands
ssh-keygen -t dsa -P ” -f ~/.ssh/id_myKey
This will generate a public private key pair.
cat ~/.ssh/id_myKey.pub >> ~/.ssh/authorized_keys
This will put the content of your public key into the authorized_keys text file, this text file contains all the keys which are allowed to log into this user on this server with the corresponding private key.
In short: Hadoop needs to connect via SSH but there is a lock and we just gave it a key for that lock.
Prepare The File System
Now we need to prepare and format the HDFS. Do you remember from my previous article that HDFS stands for Hadoop Distributed File System, HDFS is a filesystem within our filesystem.
File system ception?
Before we start running a MapReduce job we need to put our data into the HDFS, then we can run the MapReduce job and afterwards extract the results from HDFS.
To format HDFS execute the following command
hadoop/bin/hdfs namenode -format
It might print quite a lot of text, but don’t frighten the end of the text should look like.
15/11/22 07:14:28 INFO util.ExitUtil: Exiting with status 0
15/11/22 07:14:28 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Hadoop01/127.0.1.1
************************************************************/
Yours might not look completely like this, example the SHUTDOWN_MSG might have another IP, example localhost/192.168.0.1
Lets try to start our Hadoop environment with the following command.
hadoop/sbin/start-dfs.sh
This should tell you it is starting the namenode, datanode, secondarynamenode and where the log resides for each node.
You can try to access the webinterface for your Hadoop environment by going to IP:50070
I will not go into detail what you can see in this Web Interface, it is quite simplistic just visually showing you your Hadoop environment. Click around and take a look 😉
Prepare Folders & Files
We will now prepare our working folders and files within the HDFS, so follow along and execute the following commands.
hadoop/bin/hdfs dfs -mkdir /user
hadoop/bin/hdfs dfs -mkdir /user/hadoopGuide
Inside our HDFS we first create a folder named user, then inside user we create a folder named hadoopGuide.
You might ask if there is any reason why we called it user or hadoopGuide. There is not, I just thought it was nice.
hadoop/bin/hdfs dfs -put hadoop/LICENSE.txt /user/hadoopGuide
This one is a bit more complex, in this single command we are working with both our own OS filesystem of our Ubuntu server and the Hadoop filesystem(HDFS)
The command -put allows us to “put” data into our HDFS from our own filesystem.
The first part “hadoop/LICENSE.txt” is the data I decided to put into our HDFS, you can put whatever you wish inside, I just thought for this particular MapReduce where we will count words, it would be interesting to see how many times each word occurs in the license.
The last part “/user/hadoopGuide” is where inside our HDFS we want to put our data.
Now, go to your Hadoop Web Interface on IP:50070 and click on Utilities and Browse the File System. Here you should be able to see the folders we just created and the data we just put into our HDFS.
Run Sample MapReduce Job
Hadoop ships with a few MapReduce examples, you can see the available MapReduce examples by executing the following command.
hadoop/bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar
Shows available MapReduce examples
The first MapReduce job we want to run is WordCount, it will run through our LICENSE.txt file and count how many times each word occurs, execute the following command.
hadoop/bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /user/hadoopGuide/LICENSE.txt /user/hadoopGuide/output
It will start running your request and when it is complete it will print some statistics about the MapReduce Job, example. File System Counter, Map-Reduce Framework, Shuffle Errors, File Input Format Counter and File Output Format Counter.
Before we examine the result, let’s talk about how the command is structured.
The first part “hadoop/bin/hadoop jar” tells Hadoop it is time to run a jar file.
The second part “hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount” is where the jar is located and which program the jar should run.
The third part “/user/hadoopGuide/LICENSE.txt” is where inside the HDFS the data for the MapReduce job is located.
The fourth and last part “/user/hadoopGuide/output” is where inside the HDFS the result should be stored. We do not need to create the output folder, that is up to Hadoop, Hadoop will give you an error if you already had created an output folder.
Let’s take a look at the results now, as we specified in the command above, the result should be located inside the HDFS at “/user/hadoopGuide/output”
Inside the output directory you should find two files.
_SUCCESS
part-r-00000
We’re interested in part-r-00000, let’s look inside by executing the following command.
hadoop/bin/hdfs dfs -cat /user/hadoopGuide/output/part-r-00000
This will print all the content in the terminal
Key(word) | Value(occurrences) |
---|---|
AS | 4 |
Contribution | 1 |
Contributor | 1 |
and so on | and so on |
Remember from my previous article where I wrote about MapReduce and Key/Value pairs. This is the Key/Value pair result from the MapReduce job we just executed.
Error
There is insufficient memory for the Java Runtime Environment to continue
I originally used the server at DigitalOcean which have 512MB of memory and costs 5$ monthly.
While doing this article I ran into the error above, which means my server didn’t have enough memory to run the desired MapReduce.
Native memory allocation (malloc) failed to allocate 104861696 bytes
An error report file with more information is saved as:
/home/<username>/hs_err_pi<number>.log
I Opened this file with nano: nano hs_err_piNUMBER.log
Replace NUMBER with the actual number it tells you, it is different for everybody as it is the id of the process.
Search for memory: by using the Ninja trick, ctrl-w and press Enter.
Here you can see how much physical memory the server got available, which in my situation is 501792 kilobyte which corresponds to around 500MB, you can also see how much was free which for me is 6512 kilobyte, around 6.5MB, I guess the Free memory is the last free memory before the process crashed.
But we can see that Hadoop required approximately 104MB and we maxed out our memory and the process crashed.
The proper way to address this issue is to add more memory to your server, if by any chance you are using DigitalOcean as I am, you can do the following
hadoop/sbin/stop-dfs.sh
This will stop Hadoop
sudo shutdown -h now
This will shut down your server, this is a necessity to add more memory.
Go to your DigitalOcean Droplets interface, at the most left of your Droplet there is a round circle, if it is turned on its green, if its turned off its grey. It should be grey.
Click on your Droplet and click Resize.
Choose Flexible and select the one for 10$ a month.
This got 500Mb more than the previous I used which was 5$ a month. 1GB memory in total should be enough.
You might ask, why not Permanent?
Permanent also increases the file system size, therefore it is a permanent solution and cannot be down sized again, I do not need the extra disk space for now, therefore I prefer to upgrade the CPU and RAM for the purpose of testing.
When you are satisfied with your choice, click Resize, wait for the completion bar to fill up.
When it is done, click on Power and Power On, again wait for the completion bar to fill up.
IMPORTANT
You might need to format your HDFS, yes you will loose your folders and files inside HDFS.
Be sure your Hadoop environment is not running and start from Prepare The File System
The End
Next article status: The Universe of Hadoop & Its Many Modules: #3 Managing a multi-node cluster is currently NOT STARTED. Due to exams I put the 3rd article on hold for now, at the end of January I should be able to begin writing the 3rd article.
In the next article we’ll expand on what we learned from this article and setup a multi-node cluster Hadoop environment which will involve multiple servers. We will also explore more configuration in order for the nodes to recognize and communicate with each other.
Feel free to comment to notify me if I made any mistakes or if you have any questions but also just to say Hello 🙂
Thanks for reading
Kim