{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a simple word count application with Spark\n", "\n", "This lab will build on the techniques covered in the first Spark workshop. We will develop a simple word count application of the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). \n", "\n", "This lab is mandatory for Workshop 2 and required to validate your registration. \n", "\n", "####Read-me before:\n", "You must execute each cell and fill with the appropriate code when necessary.\n", "At the end of the notebook, there is a generated code to be copied and pasted into the meetup registration. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup import and functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "import os.path\n", "import re\n", "import hashlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loads the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "baseDir = os.path.join('data')\n", "inputPath = os.path.join('shakespeare.txt')\n", "fileName = os.path.join(baseDir, inputPath)\n", "\n", "shakespeareRDD = (sc\n", " .textFile(fileName, 8))\n", "\n", "shakespeareRDD.cache()\n", "print '\\n'.join(shakespeareRDD\n", " .zipWithIndex() # to (line, lineNum)\n", " .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'\n", " .take(15))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "def toLower(text):\n", " \"\"\"\n", " Changes all text to lower case.\n", " \"\"\"\n", " return text.lower()\n", "\n", "print toLower('Hello WORLD') #should be \"hello world\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define the function `removePunctuation` removes any punctuation. We use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "pattern=re.compile(\"[^a-zA-Z0-9\\s]\")\n", "def removePunctuation(text):\n", " \"\"\"Removes punctuation from the given text\n", "\n", " Note:\n", " Only spaces, letters, and numbers should be retained. Other characters should should be\n", " eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n", " punctuation is removed.\n", "\n", " Args:\n", " text (str): A string.\n", "\n", " Returns:\n", " str: The cleaned up string.\n", " \"\"\"\n", " cleanText = pattern.sub('', text)\n", " return cleanText\n", "print removePunctuation('Hi, you! My ZIP code is 98-9800') #should be Hi you My ZIP code is 989800\n", "print removePunctuation('No under_score!') #No underscore" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "def strips(text):\n", " \"\"\"strips leading and trailing spaces.\n", " \"\"\"\n", " return text.strip()\n", "print '>%s<' % strips(' This is a text') #should print >This is a text<\n", "print '>%s<' % (strips(removePunctuation('No under_score !'))) #should print >No underscore<" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "stopfile = os.path.join(baseDir, 'stopwords.txt')\n", "stopwords = set(sc.textFile(stopfile).collect())\n", "print 'These are the stopwords: %s' % stopwords" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just excecute this cell\n", "def isNotStopWord(word):\n", " \"\"\" Tells if the given word isn't a English common word.\n", " Args:\n", " string (str): input string\n", " Returns:\n", " Boolean: True if word isn't a stopword. Otherwise, False\n", " \"\"\"\n", " return word not in stopwords\n", "\n", "print isNotStopWord('brown') # Should give True\n", "print isNotStopWord('the') # Should give False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### wordCount` function **\n", "#### First, define a function for word counting. You should reuse the techniques that have been covered during the first workshop. This function should take in an RDD that is a list of words and return a pair RDD that has all of the words and their associated counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TODO: Replace with appropriate code\n", "def wordCount(wordListRDD):\n", " \"\"\"Creates a pair RDD with word counts from an RDD of words.\n", " Args:\n", " wordListRDD (RDD of str): An RDD consisting of words.\n", "\n", " Returns:\n", " RDD of (str, int): An RDD consisting of (word, count) tuples.\n", " \"\"\"\n", " return " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Before you can use the `wordcount()` function, you have to address two issues with the format of the RDD:\n", " + #### The first issue is that that we need to split each line by its spaces.\n", " + #### The second issue is we need to filter out empty lines.\n", " \n", "#### Apply a transformation that will split each element of the RDD by its spaces. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TODO: Replace with appropriate code\n", "cleanRDD = (shakespeareRDD\n", " .map(removePunctuation)\n", " .map(toLower)\n", " .map(strips)\n", " .(lambda line: line.split(' '))\n", " .filter()\n", " .filter(isNotStopWord))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### You now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n", "\n", "#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#collect the top 15\n", "top15WordsAndCounts = wordCount(cleanRDD).\n", "print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Generate the md5 code to validate your registration" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "md5_code = hashlib.md5()\n", "for (word, count) in top15WordsAndCounts:\n", " md5_code.update(word)\n", "\n", "meetup_code = md5_code.hexdigest()\n", "if hashlib.sha224(meetup_code).hexdigest() == '427681d5929a35ab878c291b0de5f4b8a009dc9b71d2e54dbf7c46ba':\n", " print 'Well done, copy this code: %s' % md5_code.hexdigest()\n", "else:\n", " print 'This is not the expected code, please try again. \\nTip: the code starts with \"cc\" and finishes with \"ad1c\"'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 0 }