bachir
/
ola5doc


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304
							{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building a simple word count application with Spark\n",
    "\n",
    "This lab will build on the techniques covered in the first Spark workshop. We will develop a simple word count application of the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page).  \n",
    "\n",
    "This lab is mandatory for Workshop 2 and required to validate your registration. \n",
    "\n",
    "####Read-me before:\n",
    "You must execute each cell and fill with the appropriate code when necessary.\n",
    "At the end of the notebook, there is a generated code to be copied and pasted into the meetup registration.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setup import and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "import os.path\n",
    "import re\n",
    "import hashlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loads the  [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "baseDir = os.path.join('data')\n",
    "inputPath = os.path.join('shakespeare.txt')\n",
    "fileName = os.path.join(baseDir, inputPath)\n",
    "\n",
    "shakespeareRDD = (sc\n",
    "                  .textFile(fileName, 8))\n",
    "\n",
    "shakespeareRDD.cache()\n",
    "print '\\n'.join(shakespeareRDD\n",
    "                .zipWithIndex()  # to (line, lineNum)\n",
    "                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'\n",
    "                .take(15))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "def toLower(text):\n",
    "    \"\"\"\n",
    "    Changes all text to lower case.\n",
    "    \"\"\"\n",
    "    return text.lower()\n",
    "\n",
    "print toLower('Hello WORLD') #should be \"hello world\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define the function `removePunctuation`  removes any punctuation.  We use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "pattern=re.compile(\"[^a-zA-Z0-9\\s]\")\n",
    "def removePunctuation(text):\n",
    "    \"\"\"Removes punctuation from the given text\n",
    "\n",
    "    Note:\n",
    "        Only spaces, letters, and numbers should be retained.  Other characters should should be\n",
    "        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after\n",
    "        punctuation is removed.\n",
    "\n",
    "    Args:\n",
    "        text (str): A string.\n",
    "\n",
    "    Returns:\n",
    "        str: The cleaned up string.\n",
    "    \"\"\"\n",
    "    cleanText = pattern.sub('', text)\n",
    "    return cleanText\n",
    "print removePunctuation('Hi, you! My ZIP code is 98-9800') #should be Hi you My ZIP code is 989800\n",
    "print removePunctuation('No under_score!') #No underscore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "def strips(text):\n",
    "    \"\"\"strips leading and trailing spaces.\n",
    "    \"\"\"\n",
    "    return text.strip()\n",
    "print '>%s<' % strips(' This is a text') #should print >This is a text<\n",
    "print '>%s<' % (strips(removePunctuation('No under_score !'))) #should print >No underscore<"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "stopfile = os.path.join(baseDir, 'stopwords.txt')\n",
    "stopwords = set(sc.textFile(stopfile).collect())\n",
    "print 'These are the stopwords: %s' % stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Just excecute this cell\n",
    "def isNotStopWord(word):\n",
    "    \"\"\" Tells if the given word isn't a English common word.\n",
    "    Args:\n",
    "        string (str): input string\n",
    "    Returns:\n",
    "        Boolean: True if word isn't a stopword. Otherwise, False\n",
    "    \"\"\"\n",
    "    return word not in stopwords\n",
    "\n",
    "print isNotStopWord('brown') # Should give True\n",
    "print isNotStopWord('the') # Should give False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### wordCount` function **\n",
    "#### First, define a function for word counting.  You should reuse the techniques that have been covered during the first workshop.  This function should take in an RDD that is a list of words and return a pair RDD that has all of the words and their associated counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# TODO: Replace <FILL IN> with appropriate code\n",
    "def wordCount(wordListRDD):\n",
    "    \"\"\"Creates a pair RDD with word counts from an RDD of words.\n",
    "    Args:\n",
    "        wordListRDD (RDD of str): An RDD consisting of words.\n",
    "\n",
    "    Returns:\n",
    "        RDD of (str, int): An RDD consisting of (word, count) tuples.\n",
    "    \"\"\"\n",
    "    return <FILL IN>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Before you can use the `wordcount()` function, you have to address two issues with the format of the RDD:\n",
    "  + #### The first issue is that  that we need to split each line by its spaces.\n",
    "  + #### The second issue is we need to filter out empty lines.\n",
    " \n",
    "#### Apply a transformation that will split each element of the RDD by its spaces. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# TODO: Replace <FILL IN> with appropriate code\n",
    "cleanRDD = (shakespeareRDD\n",
    "            .map(removePunctuation)\n",
    "            .map(toLower)\n",
    "            .map(strips)\n",
    "            .<FILL IN>(lambda line: line.split(' '))\n",
    "            .filter(<FILL IN>)\n",
    "            .filter(isNotStopWord))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### You now have an RDD that is only words.  Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n",
    "\n",
    "#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#collect the top 15\n",
    "top15WordsAndCounts = wordCount(cleanRDD).<FILL IN>\n",
    "print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####Generate the md5 code to validate your registration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "md5_code = hashlib.md5()\n",
    "for (word, count) in top15WordsAndCounts:\n",
    "    md5_code.update(word)\n",
    "\n",
    "meetup_code = md5_code.hexdigest()\n",
    "if hashlib.sha224(meetup_code).hexdigest() == '427681d5929a35ab878c291b0de5f4b8a009dc9b71d2e54dbf7c46ba':\n",
    "    print 'Well done, copy this code: %s' % md5_code.hexdigest()\n",
    "else:\n",
    "    print 'This is not the expected code, please try again. \\nTip: the code starts with \"cc\" and finishes with \"ad1c\"'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}