123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304 |
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Building a simple word count application with Spark\n",
- "\n",
- "This lab will build on the techniques covered in the first Spark workshop. We will develop a simple word count application of the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). \n",
- "\n",
- "This lab is mandatory for Workshop 2 and required to validate your registration. \n",
- "\n",
- "####Read-me before:\n",
- "You must execute each cell and fill with the appropriate code when necessary.\n",
- "At the end of the notebook, there is a generated code to be copied and pasted into the meetup registration. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Setup import and functions"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "import os.path\n",
- "import re\n",
- "import hashlib"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Loads the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "baseDir = os.path.join('data')\n",
- "inputPath = os.path.join('shakespeare.txt')\n",
- "fileName = os.path.join(baseDir, inputPath)\n",
- "\n",
- "shakespeareRDD = (sc\n",
- " .textFile(fileName, 8))\n",
- "\n",
- "shakespeareRDD.cache()\n",
- "print '\\n'.join(shakespeareRDD\n",
- " .zipWithIndex() # to (line, lineNum)\n",
- " .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'\n",
- " .take(15))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "def toLower(text):\n",
- " \"\"\"\n",
- " Changes all text to lower case.\n",
- " \"\"\"\n",
- " return text.lower()\n",
- "\n",
- "print toLower('Hello WORLD') #should be \"hello world\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Define the function `removePunctuation` removes any punctuation. We use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "pattern=re.compile(\"[^a-zA-Z0-9\\s]\")\n",
- "def removePunctuation(text):\n",
- " \"\"\"Removes punctuation from the given text\n",
- "\n",
- " Note:\n",
- " Only spaces, letters, and numbers should be retained. Other characters should should be\n",
- " eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n",
- " punctuation is removed.\n",
- "\n",
- " Args:\n",
- " text (str): A string.\n",
- "\n",
- " Returns:\n",
- " str: The cleaned up string.\n",
- " \"\"\"\n",
- " cleanText = pattern.sub('', text)\n",
- " return cleanText\n",
- "print removePunctuation('Hi, you! My ZIP code is 98-9800') #should be Hi you My ZIP code is 989800\n",
- "print removePunctuation('No under_score!') #No underscore"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "def strips(text):\n",
- " \"\"\"strips leading and trailing spaces.\n",
- " \"\"\"\n",
- " return text.strip()\n",
- "print '>%s<' % strips(' This is a text') #should print >This is a text<\n",
- "print '>%s<' % (strips(removePunctuation('No under_score !'))) #should print >No underscore<"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "stopfile = os.path.join(baseDir, 'stopwords.txt')\n",
- "stopwords = set(sc.textFile(stopfile).collect())\n",
- "print 'These are the stopwords: %s' % stopwords"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# Just excecute this cell\n",
- "def isNotStopWord(word):\n",
- " \"\"\" Tells if the given word isn't a English common word.\n",
- " Args:\n",
- " string (str): input string\n",
- " Returns:\n",
- " Boolean: True if word isn't a stopword. Otherwise, False\n",
- " \"\"\"\n",
- " return word not in stopwords\n",
- "\n",
- "print isNotStopWord('brown') # Should give True\n",
- "print isNotStopWord('the') # Should give False"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### wordCount` function **\n",
- "#### First, define a function for word counting. You should reuse the techniques that have been covered during the first workshop. This function should take in an RDD that is a list of words and return a pair RDD that has all of the words and their associated counts."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# TODO: Replace <FILL IN> with appropriate code\n",
- "def wordCount(wordListRDD):\n",
- " \"\"\"Creates a pair RDD with word counts from an RDD of words.\n",
- " Args:\n",
- " wordListRDD (RDD of str): An RDD consisting of words.\n",
- "\n",
- " Returns:\n",
- " RDD of (str, int): An RDD consisting of (word, count) tuples.\n",
- " \"\"\"\n",
- " return <FILL IN>"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Before you can use the `wordcount()` function, you have to address two issues with the format of the RDD:\n",
- " + #### The first issue is that that we need to split each line by its spaces.\n",
- " + #### The second issue is we need to filter out empty lines.\n",
- " \n",
- "#### Apply a transformation that will split each element of the RDD by its spaces. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# TODO: Replace <FILL IN> with appropriate code\n",
- "cleanRDD = (shakespeareRDD\n",
- " .map(removePunctuation)\n",
- " .map(toLower)\n",
- " .map(strips)\n",
- " .<FILL IN>(lambda line: line.split(' '))\n",
- " .filter(<FILL IN>)\n",
- " .filter(isNotStopWord))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### You now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n",
- "\n",
- "#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "#collect the top 15\n",
- "top15WordsAndCounts = wordCount(cleanRDD).<FILL IN>\n",
- "print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "####Generate the md5 code to validate your registration"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "md5_code = hashlib.md5()\n",
- "for (word, count) in top15WordsAndCounts:\n",
- " md5_code.update(word)\n",
- "\n",
- "meetup_code = md5_code.hexdigest()\n",
- "if hashlib.sha224(meetup_code).hexdigest() == '427681d5929a35ab878c291b0de5f4b8a009dc9b71d2e54dbf7c46ba':\n",
- " print 'Well done, copy this code: %s' % md5_code.hexdigest()\n",
- "else:\n",
- " print 'This is not the expected code, please try again. \\nTip: the code starts with \"cc\" and finishes with \"ad1c\"'"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 2",
- "language": "python",
- "name": "python2"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 2
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython2",
- "version": "2.7.13"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
- }
|