ssh into any of
(data1|data2|data3).cs.rutgers.edu; if you follow these instructions, everything will persist across all three machines.
Setting Up Your Environment
Three directories you’ll want to be aware of (among others, if you’re familiar with the Rutgers CS computing universe):
ilabhome directory and should store settings and not much else due to the fact that your quota is low. Anything here will be part of your home directory on
ilab.cs.rutgers.eduand related machines.
/common/users/<username>is quota-free, backed up, and files stored here can be accessed by you from any departmental machine. However, HDFS and related Hadoop services cannot access this directory.
/common/clusterdata/<username>is quota-free, not backed up, and can be accessed by HDFS and related services.
~/.bashrc so that it contains the following entries:
export JAVA_HOME="/usr/java/jdk1.8.0_102/jre/" export SPARK_MAJOR_VERSION=2 export PYTHONPATH=/usr/hdp/188.8.131.52-235/spark2/python:/usr/hdp/184.108.40.206-235/spark2/python/lib/py4j-0.10.4-src.zip
In order to properly use nltk, we’ll need to download some additional data, and we need somewhere to store that additional data so that it doesn’t cause us to use up our
~ quota yet can be accessed by our Hadoop applications. Create a directory
/common/clusterdata/<username>/nltk_data and then add the following to your
For these changes to take effect, you can either log out or log back in, or run
Verifying That Things Work
- Verify that your Hadoop directory exists with
hdfs dfs -ls /user/<username>. Everything you want to store in HDFS must/will be placed in this directory.
hadoop jar /usr/hdp/220.127.116.11-235/hadoop-mapreduce/hadoop-mapreduce-examples-18.104.22.168.6.3.0-235.jar pi 10 100. You should see it appear on the YARN job list next to your username as
MAPREDUCEjob. You shouldn’t encounter any errors, and at the end you should see:
Job Finished in 51.157 seconds Estimated value of Pi is 3.14800000000000000000