Hadoop

2 articles in category Hadoop / Subscribe

Pig doesn’t support scalar variable assignment. That is you can not have a statement like this

var = 3

The smallest unit you can have is a tuple, containing a single value

var = {3}

So, say that you have a variable X containing 2 columns,

(word1,1)
(word2,4)
(word3,14)

and you need to do some math against the second column, based on the result of a value stored in a variable, var above.

The following statement won’t work:

result = FOREACH X GENERATE $1*var;

Instead you need to join two variables together so that for every row of X, you will have an additional column containing the value from var. You need to produce the following data before proceeding with your calculation

(word1,1)
(word2,4,3)
(word3,14,3)

To accomplish this, you need to do the following:

temp = JOIN X BY 1, var BY 1 USING 'replicated';

Now you can do your math operation

result = FOREACH temp GENERATE $1*$2;

I’m hooked on Debian as of late. I’m still kinda new to Linux environment. I’ve been on Windows side for waaay too long. So, after familiarizing myself with Linux for the past 2 months, I decided to pick up a book about Hadoop, mainly because I’m interested in processing big data. While this is a great book, it seems to assume that you are familiar with Linux and Java. This has been a fun learning experience for me. This might be useful for others who might be struggling to get Hadoop set up for the first time. If you are a Debian guru, please be gentle. This is my first Debian related post.

Let’s dive in. I’m assuming that you have a clean install of Debian, with nothing but SSH installed. You need to have the following package installed:

  • sudo(optional). To install, login as root(type su, enter your password), run apt-get install sudo.
    • give your username the ability to sudo by adding the following line to /etc/sudoers
      • vi /etc/sudoers, and add the following line under User Privelege Secifications(hit i key to insert text, and escape key to get out of the edit/insert mode. Type in :wq to save a file and quit the editor
      • yourUserName ALL=(ALL) ALL
  • vim (type sudo apt-get install vim). You also need an SSH server, which I installed during my Debian installation.
  • Generate private and public key pair for the current user:
  • type ssh-keygen and accept the default location by hitting enter.
  • You can choose to protect your private key with a password.
  • After the pair is generated, run cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • To make sure that this is done correctly, run ssh localhost. You should get a prompt, without having to type in password again
  • The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
    RSA key fingerprint is xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.
    The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.RSA key fingerprint is .
    The next time you ssh in, the message above shouldn’t appear again.

All those above are debian configuration. Now, let’s try to set up single-node Hadoop. Some of this is described in Appendix A of the book I mentioned above. However, the instruction seems to oversimplify stuff. I’ll try to go into more details on how to install Java and Hadoop for a first timer. If you just follow the installation instruction on Appendix A, and try to run the command on page 23:

$ export HADOOP_CLASSPATH=build/classes
$ hadoop MaxTemperature input/ncdc/sample.txt output

You will get the following error(even after installing JDK):

Exception in thread “main” java.lang.NoClassDefFoundError: MaxTemperature
Caused by: java.lang.ClassNotFoundException: MaxTemperature
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Continue Reading →