Running Spark from the Command Line

In this tutorial, we’ll find the n most “interesting” unigrams in a large text file by submitting a Spark job to the cluster from the command line.

Download any sizable file of text, e.g. a book from Project Gutenberg. Upload this to HDFS; Spark can read data from the local machine but only if we also run the script locally (e.g. without --master yarn).

Download this script to /common/clusterdata/<username> and run it as:

spark-submit --master yarn nltk_unigram_count_with_pyspark.py hdfs:///user/<username>/text_file

You may be prompted to download some nltk data; follow the instructions and try the script again. The data should be downloaded to the location of $NLTK_DATA.

When successful, on stdout you’ll see, in decreasing order, the top 10 most-frequently occurring unigrams except for stopwords, normalized to lower-case and free from extraneous punctuation. Try getting a larger list with the --n argument; some words we may consider stopwords may appear.