# Building a simple word count application with Spark

This lab will build on the techniques covered in the first Spark workshop. We will develop a simple word count application of the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). 

This lab is mandatory for Workshop 2 and required to validate your registration. 

####Read-me before:
You must execute each cell and fill with the appropriate code when necessary.
At the end of the notebook, there is a generated code to be copied and pasted into the meetup registration. 

Setup import and functions

In [None]:
# Just excecute this cell
import os.path
import re
import hashlib

Loads the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page).

In [None]:
# Just excecute this cell
baseDir = os.path.join('data')
inputPath = os.path.join('shakespeare.txt')
fileName = os.path.join(baseDir, inputPath)

shakespeareRDD = (sc
 .textFile(fileName, 8))

shakespeareRDD.cache()
print '\n'.join(shakespeareRDD
 .zipWithIndex() # to (line, lineNum)
 .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'
 .take(15))

In [None]:
# Just excecute this cell
def toLower(text):
 """
 Changes all text to lower case.
 """
 return text.lower()

print toLower('Hello WORLD') #should be "hello world"

#### Define the function `removePunctuation` removes any punctuation. We use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space.

In [None]:
# Just excecute this cell
pattern=re.compile("[^a-zA-Z0-9\s]")
def removePunctuation(text):
 """Removes punctuation from the given text

 Note:
 Only spaces, letters, and numbers should be retained. Other characters should should be
 eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
 punctuation is removed.

 Args:
 text (str): A string.

 Returns:
 str: The cleaned up string.
 """
 cleanText = pattern.sub('', text)
 return cleanText
print removePunctuation('Hi, you! My ZIP code is 98-9800') #should be Hi you My ZIP code is 989800
print removePunctuation('No under_score!') #No underscore

In [None]:
# Just excecute this cell
def strips(text):
 """strips leading and trailing spaces.
 """
 return text.strip()
print '>%s<' % strips(' This is a text') #should print >This is a text<
print '>%s<' % (strips(removePunctuation('No under_score !'))) #should print >No underscore<

In [None]:
# Just excecute this cell
stopfile = os.path.join(baseDir, 'stopwords.txt')
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

In [None]:
# Just excecute this cell
def isNotStopWord(word):
 """ Tells if the given word isn't a English common word.
 Args:
 string (str): input string
 Returns:
 Boolean: True if word isn't a stopword. Otherwise, False
 """
 return word not in stopwords

print isNotStopWord('brown') # Should give True
print isNotStopWord('the') # Should give False

#### wordCount` function **
#### First, define a function for word counting. You should reuse the techniques that have been covered during the first workshop. This function should take in an RDD that is a list of words and return a pair RDD that has all of the words and their associated counts.

In [None]:
# TODO: Replace with appropriate code
def wordCount(wordListRDD):
 """Creates a pair RDD with word counts from an RDD of words.
 Args:
 wordListRDD (RDD of str): An RDD consisting of words.

 Returns:
 RDD of (str, int): An RDD consisting of (word, count) tuples.
 """
 return 

#### Before you can use the `wordcount()` function, you have to address two issues with the format of the RDD:
 + #### The first issue is that that we need to split each line by its spaces.
 + #### The second issue is we need to filter out empty lines.
 
#### Apply a transformation that will split each element of the RDD by its spaces. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be.

In [None]:
# TODO: Replace with appropriate code
cleanRDD = (shakespeareRDD
 .map(removePunctuation)
 .map(toLower)
 .map(strips)
 .(lambda line: line.split(' '))
 .filter()
 .filter(isNotStopWord))

#### You now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.

#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts.

In [None]:
#collect the top 15
top15WordsAndCounts = wordCount(cleanRDD).
print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))

####Generate the md5 code to validate your registration

In [None]:
md5_code = hashlib.md5()
for (word, count) in top15WordsAndCounts:
 md5_code.update(word)

meetup_code = md5_code.hexdigest()
if hashlib.sha224(meetup_code).hexdigest() == '427681d5929a35ab878c291b0de5f4b8a009dc9b71d2e54dbf7c46ba':
 print 'Well done, copy this code: %s' % md5_code.hexdigest()
else:
 print 'This is not the expected code, please try again. \nTip: the code starts with "cc" and finishes with "ad1c"'