I’m hooked on Debian as of late. I’m still kinda new to Linux environment. I’ve been on Windows side for waaay too long. So, after familiarizing myself with Linux for the past 2 months, I decided to pick up a book about Hadoop, mainly because I’m interested in processing big data. While this is a great book, it seems to assume that you are familiar with Linux and Java. This has been a fun learning experience for me. This might be useful for others who might be struggling to get Hadoop set up for the first time. If you are a Debian guru, please be gentle. This is my first Debian related post.

Let’s dive in. I’m assuming that you have a clean install of Debian, with nothing but SSH installed. You need to have the following package installed:

  • sudo(optional). To install, login as root(type su, enter your password), run apt-get install sudo.
    • give your username the ability to sudo by adding the following line to /etc/sudoers
      • vi /etc/sudoers, and add the following line under User Privelege Secifications(hit i key to insert text, and escape key to get out of the edit/insert mode. Type in :wq to save a file and quit the editor
      • yourUserName ALL=(ALL) ALL
  • vim (type sudo apt-get install vim). You also need an SSH server, which I installed during my Debian installation.
  • Generate private and public key pair for the current user:
  • type ssh-keygen and accept the default location by hitting enter.
  • You can choose to protect your private key with a password.
  • After the pair is generated, run cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • To make sure that this is done correctly, run ssh localhost. You should get a prompt, without having to type in password again
  • The authenticity of host ‘localhost (’ can’t be established.
    RSA key fingerprint is xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.
    The authenticity of host ‘localhost (’ can’t be established.RSA key fingerprint is .
    The next time you ssh in, the message above shouldn’t appear again.

All those above are debian configuration. Now, let’s try to set up single-node Hadoop. Some of this is described in Appendix A of the book I mentioned above. However, the instruction seems to oversimplify stuff. I’ll try to go into more details on how to install Java and Hadoop for a first timer. If you just follow the installation instruction on Appendix A, and try to run the command on page 23:

$ export HADOOP_CLASSPATH=build/classes
$ hadoop MaxTemperature input/ncdc/sample.txt output

You will get the following error(even after installing JDK):

Exception in thread “main” java.lang.NoClassDefFoundError: MaxTemperature
Caused by: java.lang.ClassNotFoundException: MaxTemperature
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

If you know how Java works, this is because, you still haven’t compiled your .java files into .class files for JVM to run. That’s what the error above is saying. You need to compile all your *.java files into *.class files. In order to do that, you need to install and configure JDK and HADOOP installation.

  • First thing you need is JDK – Java Development Kit. To get this, you need to update your sources.list file(a source repository where you get all your package for your debian). This package is listed under debian non free package.
  • Type the following command at the prompt:sudo vim /etc/apt/sources.list
  • Add the following to your sources.list. I’m using repository hosted by Indiana University. If you prefer a repository closer to your location, you can look it up here:
#JAVA non free
deb http://ftp.uwsg.indiana.edu/linux/debian/ squeeze main non-free
deb-src http://ftp.uwsg.indiana.edu/linux/debian/ squeeze main non-free
  • While you are at it, add the following repository too so that you can install Hadoop later:

deb http://archive.cloudera.com/debian squeeze-cdh3 contrib
deb-src http://archive.cloudera.com/debian squeeze-cdh3 contrib

  • run sudo apt-get update
  • install java by running sudo apt-get install sun-java6-jdk. Confirm the license agreement. This should download and install JDK from the repository you specified above
  • Now, add CloudEra.com’s public key to your system. This is so that you can make sure packages are trusted(that they are from cloudera indeed). From your home directory(user@host:~$ prompt)Run wget http://archive.cloudera.com/debian/cloudera.key && sudo apt-key add ~/cloudera.key. To see if cloud era’s key is added, run sudo apt-key list. If the key is added correctly, you should see an entry.
  • Now, install hadoop by running  sudo apt-get install hadoop-0.20 hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker. If you didn’t add cloud era’s public key, you will be warned that the source of the package is not authenticated.
  • Now it’s time to setup several environment variables:
  1. after the installation, there should be a directory called hadoop-0.20 within /usr/lib/ directory. This is the value of HADOOP_INSTALL environment variable To check, run ls -l -d h /usr/lib/h*/. Make note of this value!!
  2. get the path to your java install location by running readlink -f /usr/bin/java | sed “s:/bin/java::”. This will follow symbolic link from /usr/bin/java all the way down to its actual folder location. Copy the result of this command! This is your JAVA_HOME value.
  3. Run this command, and make note of the result: hadoop version | sed “s:^Hadoop 0:hadoop-0:” |head -n 1. This is the value of your HADOOP_VERSION.
  4. I like things organized. So, I created the following directory structure in my home directory:
  5. ~/hadoop/build/sources  - This where I store all my *.java files
  6. ~/hadoop/build/classes – This is where I store all my *.classes files(The output after compiling *.java files using javac command)
  7. ~/hadoop/input – This is where I store all the text files to be processed by MapReduce
  8. ~/hadoop/output – This is where I store all the output after being processed by MapReduce
  • Run vim ~/.bashrc, add the following at the very end(hit G and o if you are in VIM). You need to adjust it based on the output of #1,2,3 above, and also how you organize your  #4. hashmark (#) indicates a comment.
# for hadoop and java setting
#based on #6 above.
declare -x HADOOP_CLASSPATH=~/hadoop/build/classes
#based on # 1 above
declare -x HADOOP_INSTALL="/usr/lib/hadoop-0.20"
#based on # 3 above
declare -x HADOOP_VERSION="hadoop-0.20.2-cdh3u3"
#based on # 2 above
declare -x JAVA_HOME="/usr/lib/jvm/java-6-sun-"
#to tell java where to look for classes or library
declare -x CLASSPATH=$HADOOP_INSTALL/$HADOOP_VERSION-core.jar:~/hadoop/build/classes
export PATH=$PATH:/sbin:$JAVA_HOME/bin
  • escape from the insert mode, and save the file and quit(hit :wq in VIM)run bash command. This would apply the profile you just added above.
  • Now you are ready to compile all your *.java files. I’m assuming you have all the *.java files stored in ~/hadoop/build/sources. I will be storing the compiled code in ~/hadoop/build/classes. Run the following command to compile all MaxTemperatureReducer.java, MaxTemperatureMapper.java and MaxTemperatureReducer.java
    • javac ~/hadoop/build/sources/*.java -d ~/hadoop/build/classes
    • If all goes well, all your *.java files will be compiled, and the resulting *.class files will be store in ~/hadoop/build/classes. Check your classes directory by running: ls ~/hadoop/build/classes. You should see 3 corresponding class files.
  • Now we are ready to run hadoop. Make sure you have an appropriate input file expected by the compiled application. If you don’t, you can download it from http://ftp3.ncdc.noaa.gov/pub/data/noaa/. You can download the data by running wget. For example, to download a file, http://ftp3.ncdc.noaa.gov/pub/data/noaa/2012/010010-99999-2012.gz to  your ~/hadoop/input/ncdc/ folder, run wget -P ~/hadoop/input/ncdc/ http://ftp3.ncdc.noaa.gov/pub/data/noaa/1911/029170-99999-1911.gz. To unzip this file into a plain text file, run gunzip -d ~/hadoop/input/ncdc/*.gz. You can rename this file to sample.txt by running mv ~/hadoop/input/ncdc/029170-99999-1911 ~/hadoop/input/ncdc/sample.txt
  • Try running hadoop again as instructed by the book. We don’t have to worry about setting the HADOOP_CLASSPATH now  because we already set it as environment variable. Run  hadoop MaxTemperature ~/hadoop/input/ncdc/sample.txt  ~hadoop/output.

If all is well, you should see output similar what’s printed in the book. This is the output:

12/05/14 12:07:26 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/05/14 12:07:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/05/14 12:07:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/05/14 12:07:26 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).        12/05/14 12:07:27 WARN snappy.LoadSnappy: Snappy native library is available
12/05/14 12:07:27 INFO snappy.LoadSnappy: Snappy native library loaded
12/05/14 12:07:27 INFO mapred.FileInputFormat: Total input paths to process : 1
12/05/14 12:07:27 INFO mapred.JobClient: Running job: job_local_0001
12/05/14 12:07:28 INFO util.ProcessTree: setsid exited with exit code 0
12/05/14 12:07:28 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@10d09ad3       12/05/14 12:07:28 INFO mapred.MapTask: numReduceTasks: 1
12/05/14 12:07:28 INFO mapred.MapTask: io.sort.mb = 100
12/05/14 12:07:28 INFO mapred.MapTask: data buffer = 79691776/99614720
12/05/14 12:07:28 INFO mapred.MapTask: record buffer = 262144/327680
12/05/14 12:07:28 INFO mapred.MapTask: Starting flush of map output
12/05/14 12:07:28 INFO mapred.MapTask: Finished spill 0
12/05/14 12:07:28 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting                                    12/05/14 12:07:28 INFO mapred.LocalJobRunner: file:/home/bwibowo/hadoop/input/ncdc/sample.txt:0+150522                                                       12/05/14 12:07:28 INFO mapred.Task: Task ‘attempt_local_0001_m_000000_0′ done.
12/05/14 12:07:28 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3c50507           12/05/14 12:07:28 INFO mapred.LocalJobRunner:
12/05/14 12:07:28 INFO mapred.Merger: Merging 1 sorted segments
12/05/14 12:07:28 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 11937 bytes
12/05/14 12:07:28 INFO mapred.LocalJobRunner:
12/05/14 12:07:28 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/05/14 12:07:28 INFO mapred.LocalJobRunner:
12/05/14 12:07:28 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now