Getting Started With the Hadoop Cluster via ssh

You can ssh into any of (data4|data5|data6).cs.rutgers.edu; if you follow these instructions, everything will persist across all three machines.

Setting Up Your Environment

Three directories you’ll want to be aware of (among others, if you’re familiar with the Rutgers CS computing universe):

  • ~ is your ilab home directory and should store settings and not much else due to the fact that your quota is low. Anything here will be part of your home directory on ilab.cs.rutgers.edu and related machines.
  • /common/users/<username> is quota-free, backed up, and files stored here can be accessed by you from any departmental machine. However, HDFS and related Hadoop services cannot access this directory.
  • /common/clusterdata/<username> is quota-free, not backed up, and can be accessed by HDFS and related services.

Edit your ~/.bashrc so that it contains the following entries:

export JAVA_HOME="/usr/java/jdk1.8.0_181-amd64/jre"

export SPARK_MAJOR_VERSION=2

export PYTHONPATH=/usr/hdp/3.1.0.0-78/spark2/python:/usr/hdp/3.1.0.0-78/spark2/python/lib/py4j-0.10.7-src.zip

In order to properly use nltk, we’ll need to download some additional data, and we need somewhere to store that additional data so that it doesn’t cause us to use up our ~ quota yet can be accessed by our Hadoop applications. Create a directory /common/clusterdata/<username>/nltk_data and then add the following to your ~/.bashrc:

export NLTK_DATA=/common/clusterdata/<username>/nltk_data

For these changes to take effect, you can either log out or log back in, or run source ~/.bashrc.

Verifying That Things Work

  1. Verify that your Hadoop directory exists with hdfs dfs -ls /user/<username>. Everything you want to store in HDFS must/will be placed in this directory.
  2. Run hadoop jar /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100. You should see it appear on the YARN job list next to your username as QuasiMonteCarlo, a MAPREDUCE job. You shouldn’t encounter any errors, and at the end you should see:

    Job Finished in 46.837 seconds
    Estimated value of Pi is 3.14800000000000000000