In this tutorial, we’ll find the
n most “interesting” unigrams in a large text file by submitting a Spark job to the cluster from the command line.
Download any sizable file of text, e.g. a book from Project Gutenberg. Upload this to HDFS; Spark can read data from the local machine but only if we also run the script locally (e.g. without
Download this script to
/common/clusterdata/<username> and run it as:
spark-submit --master yarn nltk_unigram_count_with_pyspark.py hdfs:///user/<username>/text_file
You may be prompted to download some nltk data; follow the instructions and try the script again. The data should be downloaded to the location of
When successful, on
stdout you’ll see, in decreasing order, the top 10 most-frequently occurring unigrams except for stopwords, normalized to lower-case and free from extraneous punctuation. Try getting a larger list with the
--n argument; some words we may consider stopwords may appear.