├── .github
    └── FUNDING.yml
├── .gitignore
├── ETL.md
├── LICENSE
├── README.md
├── adhoc.py
├── adhoc.sh
├── defaults.cfg
├── deprecated
    └── TextRank.py
├── etl.py
├── exsto.py
├── filter.py
├── graph1.scala
├── graph2.scala
├── notebooks.zip
├── parse.py
├── regex.py
└── scrape.py


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | github: ceteri
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 


--------------------------------------------------------------------------------
/ETL.md:
--------------------------------------------------------------------------------
  1 | ## ETL in PySpark with Spark SQL
  2 | 
  3 | Let's use PySpark and Spark SQL to prepare the data for ML and graph
  4 | analysis.
  5 | We can perform *data discovery* while reshaping the data for later
  6 | work.
  7 | These early results can help guide our deeper analysis.
  8 | 
  9 | NB: if this ETL needs to run outside of the `bin/pyspark` shell, first
 10 | set up a `SparkContext` variable:
 11 | 
 12 | ```python
 13 | from pyspark import SparkContext
 14 | sc = SparkContext(appName="Exsto", master="local[*]")
 15 | ```
 16 | 
 17 | Import the JSON data produced by the scraper and register its schema
 18 | for ad-hoc SQL queries later.
 19 | Each message has the fields: 
 20 | `date`, `sender`, `id`, `next_thread`, `prev_thread`, `next_url`, `subject`, `text`
 21 | 
 22 | ```python
 23 | from pyspark.sql import SQLContext, Row
 24 | sqlCtx = SQLContext(sc)
 25 | 
 26 | msg = sqlCtx.jsonFile("data").cache()
 27 | msg.registerTempTable("msg")
 28 | ```
 29 | 
 30 | NB: note the persistence used for the JSON message data.
 31 | We may need to unpersist at a later stage of this ETL work.
 32 | 
 33 | ### Question: Who are the senders?
 34 | 
 35 | Who are the people in the developer community sending email to the list?
 36 | We will use this as a dimension in our analysis and reporting.
 37 | Let's create a map, with a unique ID for each email address --
 38 | this will be required for the graph analysis.
 39 | It may come in handy later for some
 40 | [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
 41 | 
 42 | ```python
 43 | who = msg.map(lambda x: x.sender).distinct().zipWithUniqueId()
 44 | who.take(10)
 45 | 
 46 | whoMap = who.collectAsMap()
 47 | 
 48 | print "\nsenders"
 49 | print len(whoMap)
 50 | ```
 51 | 
 52 | ### Question: Who are the top K senders?
 53 | 
 54 | [Apache Spark](http://spark.apache.org/) is one of the most
 55 | active open source developer communities on Apache, so it
 56 | will tend to have several thousand people engaged.
 57 | Let's identify the most active ones.
 58 | Then we can show a leaderboard and track changes in it over time.
 59 | 
 60 | ```python
 61 | from operator import add
 62 | 
 63 | top_sender = msg.map(lambda x: (x.sender, 1,)).reduceByKey(add) \
 64 |  .map(lambda (a, b): (b, a)) \
 65 |  .sortByKey(0, 1) \
 66 |  .map(lambda (a, b): (b, a))
 67 | 
 68 | print "\ntop senders"
 69 | print top_sender.take(11)
 70 | ```
 71 | 
 72 | You many notice that code... it comes from *word count*.
 73 | 
 74 | 
 75 | ### Question: Which are the top K conversations?
 76 | 
 77 | Clearly, some people discuss over the email list more than others.
 78 | Let's identify *who* those people are.
 79 | Later we can leverage our graph analysis to determine *what* they discuss.
 80 | 
 81 | NB: note the use case for `groupByKey` transformations; 
 82 | sometimes its usage is indicated.
 83 | 
 84 | ```python
 85 | import itertools
 86 | 
 87 | def nitems (replier, senders):
 88 |   for sender, g in itertools.groupby(senders):
 89 |     yield len(list(g)), (replier, sender,)
 90 | 
 91 | senders = msg.map(lambda x: (x.id, x.sender,))
 92 | replies = msg.map(lambda x: (x.prev_thread, x.sender,))
 93 | 
 94 | convo = replies.join(senders).values() \
 95 |  .filter(lambda (a, b): a != b)
 96 | 
 97 | top_convo = convo.groupByKey() \
 98 |  .flatMap(lambda (a, b): list(nitems(a, b))) \
 99 |  .sortByKey(0)
100 | 
101 | print "\ntop convo"
102 | print top_convo.take(10)
103 | ```
104 | 
105 | ### Prepare for Sender/Reply Graph Analysis
106 | 
107 | Given the RDDs that we have created to help answer some of the
108 | questions so far, let's persist those data sets using
109 | [Parquet](http://parquet.io) --
110 | starting with the graph of sender/message/reply:
111 | 
112 | ```python
113 | edge = top_convo.map(lambda (a, b): (whoMap.get(b[0]), whoMap.get(b[1]), a,))
114 | edgeSchema = edge.map(lambda p: Row(replier=p[0], sender=p[1], count=int(p[2])))
115 | edgeTable = sqlCtx.inferSchema(edgeSchema)
116 | edgeTable.saveAsParquetFile("reply_edge.parquet")
117 | 
118 | node = who.map(lambda (a, b): (b, a))
119 | nodeSchema = node.map(lambda p: Row(id=int(p[0]), sender=p[1]))
120 | nodeTable = sqlCtx.inferSchema(nodeSchema)
121 | nodeTable.saveAsParquetFile("reply_node.parquet")
122 | ```
123 | 
124 | 
125 | ### Prepare for TextRank Analysis per paragraph
126 | 
127 | ```python
128 | def map_graf_edges (x):
129 |   j = json.loads(x)
130 | 
131 |   for pair in j["tile"]:
132 |     n0 = int(pair[0])
133 |     n1 = int(pair[1])
134 | 
135 |     if n0 > 0 and n1 > 0:
136 |       yield (j["id"], n0, n1,)
137 |       yield (j["id"], n1, n0,)
138 | 
139 | graf = sc.textFile("parsed")
140 | n = graf.flatMap(map_graf_edges).count()
141 | print "\ngraf edges", n
142 | 
143 | edgeSchema = graf.map(lambda p: Row(id=p[0], node0=p[1], node1=p[2]))
144 | 
145 | edgeTable = sqlCtx.inferSchema(edgeSchema)
146 | edgeTable.saveAsParquetFile("graf_edge.parquet")
147 | ```
148 | 
149 | ```python
150 | def map_graf_nodes (x):
151 |   j = json.loads(x)
152 | 
153 |   for word in j["graf"]:
154 |     yield [j["id"]] + word
155 | 
156 | graf = sc.textFile("parsed")
157 | n = graf.flatMap(map_graf_nodes).count()
158 | print "\ngraf nodes", n
159 | 
160 | nodeSchema = graf.map(lambda p: Row(id=p[0], node_id=p[1], raw=p[2], root=p[3], pos=p[4], keep=p[5], num=p[6]))
161 | 
162 | nodeTable = sqlCtx.inferSchema(nodeSchema)
163 | nodeTable.saveAsParquetFile("graf_node.parquet")
164 | ```
165 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2015-2021 Derwen, Inc.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Exsto: Developer Community Insights
  2 | 
  3 | A frequently asked question on the [Apache Spark](http://spark.apache.org/) 
  4 | user email list concerns where to find data sets for evaluating the code.
  5 | Oddly enough, the collection of archived messages for this email list
  6 | provides an excellent data set for evaluating Spark capabilities, e.g.,
  7 | machine learning, graph algorithms, text analytics, time-series analysis, etc.
  8 | 
  9 | Herein, an open source developer community considers itself algorithmically.
 10 | This project shows work-in-progress for how to surface data insights from 
 11 | the developer email forums for an Apache open source project. 
 12 | It leverages advanced technologies for natural language processing, machine 
 13 | learning, graph algorithms, time series analysis, etc.
 14 | As an example, we use data from the `<user@spark.apache.org>` 
 15 | [email list archives](http://mail-archives.apache.org) to help understand 
 16 | its community better.
 17 | 
 18 | See these talks about `Exsto`:
 19 | <ul>
 20 | <li>
 21 | DataDayTexas 2015 session talk,
 22 | [Microservices, Containers, and Machine Learning]
 23 | (http://www.slideshare.net/pacoid/microservices-containers-and-machine-learning)
 24 | </li>
 25 | <li>
 26 | Scala Days EU 2015 session,
 27 | [GraphX: Graph analytics for insights about developer communities]
 28 | (http://www.slideshare.net/pacoid/graphx-graph-analytics-for-insights-about-developer-communities)
 29 | </li>
 30 | </ul>
 31 | 
 32 | In particular, we will shows production use of NLP tooling in Python, 
 33 | integrated with
 34 | [MLlib](http://spark.apache.org/docs/latest/mllib-guide.html)
 35 | (machine learning) and 
 36 | [GraphX](http://spark.apache.org/docs/latest/graphx-programming-guide.html)
 37 | (graph algorithms) in Apache Spark. 
 38 | Machine learning approaches used include: 
 39 | [Word2Vec](https://code.google.com/p/word2vec/), 
 40 | [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf),
 41 | Connected Components, Streaming K-Means, etc.
 42 | 
 43 | Keep in mind that "One Size Fits All" is an anti-pattern, especially for 
 44 | Big Data tools. 
 45 | This project illustrates how to leverage microservices and containers to 
 46 | scale-out the code+data components that do not fit well in Spark, Hadoop, etc.
 47 | 
 48 | In addition to Spark, other technologies used include: 
 49 | [Mesos](http://mesos.apache.org/),
 50 | [Docker](https://www.docker.com/),
 51 | [Anaconda](http://continuum.io/downloads),
 52 | [Flask](http://flask.pocoo.org/),
 53 | [NLTK](http://www.nltk.org/),
 54 | [TextBlob](https://textblob.readthedocs.org/en/dev/).
 55 | 
 56 | 
 57 | ## Dependencies
 58 | 
 59 |   * https://github.com/opentable/docker-anaconda
 60 | 
 61 | ```bash
 62 | conda config --add channels https://conda.binstar.org/sloria
 63 | conda install textblob
 64 | python -m textblob.download_corpora
 65 | python -m nltk.downloader -d ~/nltk_data all
 66 | pip install -U textblob textblob-aptagger
 67 | pip install lxml
 68 | pip install python-dateutil
 69 | pip install Flask
 70 | ```
 71 | 
 72 | NLTK and TextBlob require some
 73 | [data downloads](https://s3.amazonaws.com/textblob/nltk_data.tar.gz)
 74 | which may also require updating the NLTK data path:
 75 | 
 76 | ```python
 77 | import nltk
 78 | nltk.data.path.append("~/nltk_data/")
 79 | ```
 80 | 
 81 | 
 82 | ## Running
 83 | 
 84 | To change the project configuration simply edit the `defaults.cfg`
 85 | file.
 86 | 
 87 | 
 88 | ### scrape the email list via scripts
 89 | 
 90 | ```bash
 91 | ./scrape.py data/foo.json
 92 | ```
 93 | 
 94 | ### parse the email text via scripts
 95 | 
 96 | ```bash
 97 | ./parse.py data/foo.json parsed/foo.json
 98 | ```
 99 | 
100 | ## Public Data
101 | 
102 | Example data from the Apache Spark email list is available as JSON:
103 | 
104 |   * original: [https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/original/2015_01.json](https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/original/2015_01.json)
105 |   * parsed: [https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/parsed/2015_01.json](https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/parsed/2015_01.json)
106 | 
107 | 
108 | # What's in a name?
109 | 
110 | The word [exsto](http://en.wiktionary.org/wiki/exsto) is the Latin
111 | verb meaning "to stand out", in its present active form.
112 | 


--------------------------------------------------------------------------------
/adhoc.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding: utf-8
  3 | 
  4 | import json
  5 | import sys
  6 | 
  7 | 
  8 | from pyspark import SparkContext
  9 | sc = SparkContext(appName="Exsto", master="local[*]")
 10 | 
 11 | from pyspark.sql import SQLContext, Row
 12 | sqlCtx = SQLContext(sc)
 13 | 
 14 | msg = sqlCtx.jsonFile("data").cache()
 15 | msg.registerTempTable("msg")
 16 | 
 17 | 
 18 | # Question: Who are the senders?
 19 | 
 20 | who = msg.map(lambda x: x.sender).distinct().zipWithUniqueId()
 21 | who.take(10)
 22 | 
 23 | whoMap = who.collectAsMap()
 24 | 
 25 | print "\nsenders"
 26 | print len(whoMap)
 27 | 
 28 | 
 29 | # Question: Who are the top K senders?
 30 | 
 31 | from operator import add
 32 | 
 33 | top_sender = msg.map(lambda x: (x.sender, 1,)).reduceByKey(add) \
 34 |  .map(lambda (a, b): (b, a)) \
 35 |  .sortByKey(0, 1) \
 36 |  .map(lambda (a, b): (b, a))
 37 | 
 38 | print "\ntop senders"
 39 | print top_sender.take(11)
 40 | 
 41 | 
 42 | # Question: Which are the top K conversations?
 43 | 
 44 | import itertools
 45 | 
 46 | def nitems (replier, senders):
 47 |   for sender, g in itertools.groupby(senders):
 48 |     yield len(list(g)), (replier, sender,)
 49 | 
 50 | senders = msg.map(lambda x: (x.id, x.sender,))
 51 | replies = msg.map(lambda x: (x.prev_thread, x.sender,))
 52 | 
 53 | convo = replies.join(senders).values() \
 54 |  .filter(lambda (a, b): a != b)
 55 | 
 56 | top_convo = convo.groupByKey() \
 57 |  .flatMap(lambda (a, b): list(nitems(a, b))) \
 58 |  .sortByKey(0)
 59 | 
 60 | print "\ntop convo"
 61 | print top_convo.take(10)
 62 | 
 63 | 
 64 | # Prepare for Sender/Reply Graph Analysis
 65 | 
 66 | edge = top_convo.map(lambda (a, b): (whoMap.get(b[0]), whoMap.get(b[1]), a,))
 67 | edgeSchema = edge.map(lambda p: Row(replier=long(p[0]), sender=long(p[1]), num=int(p[2])))
 68 | edgeTable = sqlCtx.inferSchema(edgeSchema)
 69 | edgeTable.saveAsParquetFile("reply_edge.parquet")
 70 | 
 71 | node = who.map(lambda (a, b): (b, a))
 72 | nodeSchema = node.map(lambda p: Row(id=long(p[0]), sender=p[1]))
 73 | nodeTable = sqlCtx.inferSchema(nodeSchema)
 74 | nodeTable.saveAsParquetFile("reply_node.parquet")
 75 | 
 76 | 
 77 | # Prepare for TextRank Analysis per paragraph
 78 | 
 79 | def map_graf_edges (x):
 80 |   j = json.loads(x)
 81 | 
 82 |   for pair in j["tile"]:
 83 |     n0 = int(pair[0])
 84 |     n1 = int(pair[1])
 85 | 
 86 |     if n0 > 0 and n1 > 0:
 87 |       yield (j["id"], n0, n1,)
 88 |       yield (j["id"], n1, n0,)
 89 | 
 90 | graf = sc.textFile("parsed").flatMap(map_graf_edges)
 91 | n = graf.count()
 92 | print "\ngraf edges", n
 93 | 
 94 | edgeSchema = graf.map(lambda p: Row(id=p[0], node0=int(p[1]), node1=int(p[2])))
 95 | 
 96 | edgeTable = sqlCtx.inferSchema(edgeSchema)
 97 | edgeTable.saveAsParquetFile("graf_edge.parquet")
 98 | 
 99 | 
100 | def map_graf_nodes (x):
101 |   j = json.loads(x)
102 | 
103 |   for word in j["graf"]:
104 |     yield [j["id"]] + word
105 | 
106 | graf = sc.textFile("parsed").flatMap(map_graf_nodes)
107 | n = graf.count()
108 | print "\ngraf nodes", n
109 | 
110 | nodeSchema = graf.map(lambda p: Row(id=p[0], node_id=int(p[1]), raw=p[2], root=p[3], pos=p[4], keep=p[5], num=int(p[6])))
111 | 
112 | nodeTable = sqlCtx.inferSchema(nodeSchema)
113 | nodeTable.saveAsParquetFile("graf_node.parquet")
114 | 


--------------------------------------------------------------------------------
/adhoc.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -x
2 | 
3 | export SPARK_HOME=~/opt/spark
4 | 
5 | rm -rf reply_edge.parquet reply_node.parquet
6 | rm -rf graf_edge.parquet graf_node.parquet
7 | 
8 | SPARK_LOCAL_IP=127.0.0.1 \
9 | $SPARK_HOME/bin/spark-submit ./adhoc.py


--------------------------------------------------------------------------------
/defaults.cfg:
--------------------------------------------------------------------------------
1 | [scraper]
2 | iterations: 2500
3 | nap_time: 2
4 | base_url: http://mail-archives.apache.org
5 | start_url: /mod_mbox/spark-user/201503.mbox/<1427773728882-22311.post%40n3.nabble.com>
6 | 
7 | 


--------------------------------------------------------------------------------
/deprecated/TextRank.py:
--------------------------------------------------------------------------------
  1 | # TextRank, based on:
  2 | # http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
  3 | 
  4 | from itertools import tee, izip
  5 | from nltk import stem
  6 | from text.blob import TextBlob as tb
  7 | from textblob_aptagger import PerceptronTagger
  8 | import nltk.data
  9 | import numpy as np
 10 | import sys
 11 | 
 12 | 
 13 | TOKENIZER = nltk.data.load('tokenizers/punkt/english.pickle')
 14 | TAGGER = PerceptronTagger()
 15 | STEMMER = stem.porter.PorterStemmer()
 16 | 
 17 | 
 18 | def pos_tag (s):
 19 |   """high-performance part-of-speech tagger"""
 20 |   global TAGGER
 21 |   return TAGGER.tag(s)
 22 | 
 23 | 
 24 | def wrap_words (pair):
 25 |   """wrap each (word, tag) pair as an object with fully indexed metadata"""
 26 |   global STEMMER
 27 |   index = pair[0]
 28 |   result = []
 29 |   for word, tag in pair[1]:
 30 |     word = word.lower()
 31 |     stem = STEMMER.stem(word)
 32 |     if stem == "":
 33 |       stem = word
 34 |     keep = tag in ('JJ', 'NN', 'NNS', 'NNP',)
 35 |     result.append({ "id": 0, "index": index, "stem": stem, "word": word, "tag": tag, "keep": keep })
 36 |     index += 1
 37 |   return result
 38 | 
 39 | 
 40 | ######################################################################
 41 | ## build a graph from raw text
 42 | 
 43 | TEXT = """
 44 | Compatibility of systems of linear constraints over the set of natural numbers. 
 45 | Criteria of compatibility of a system of linear Diophantine equations, strict
 46 | inequations, and nonstrict inequations are considered. Upper bounds for
 47 | components of a minimal set of solutions and algorithms of construction of
 48 | minimal generating sets of solutions for all types of systems are given. 
 49 | These criteria and the corresponding algorithms for constructing a minimal
 50 | supporting set of solutions can be used in solving all the considered types
 51 | systems and systems of mixed types.
 52 | """
 53 | 
 54 | from pyspark import SparkContext
 55 | sc = SparkContext(appName="TextRank", master="local[*]")
 56 | 
 57 | sent = sc.parallelize(TOKENIZER.tokenize(TEXT)).map(pos_tag).cache()
 58 | sent.collect()
 59 | 
 60 | base = list(np.cumsum(np.array(sent.map(len).collect())))
 61 | base.insert(0, 0)
 62 | base.pop()
 63 | sent_length = sc.parallelize(base)
 64 |  
 65 | tagged_doc = sent_length.zip(sent).map(wrap_words)
 66 | 
 67 | 
 68 | ######################################################################
 69 | 
 70 | from pyspark.sql import SQLContext, Row
 71 | sqlCtx = SQLContext(sc)
 72 | 
 73 | def enum_words (s):
 74 |   for word in s:
 75 |     yield word
 76 | 
 77 | words = tagged_doc.flatMap(enum_words)
 78 | pair_words = words.keyBy(lambda w: w["stem"])
 79 | uniq_words = words.map(lambda w: w["stem"]).distinct().zipWithUniqueId()
 80 | 
 81 | uniq = sc.broadcast(dict(uniq_words.collect()))
 82 | 
 83 | 
 84 | def id_words (pair):
 85 |   (key, val) = pair
 86 |   word = val[0]
 87 |   id = val[1]
 88 |   word["id"] = id
 89 |   return word
 90 | 
 91 | id_doc = pair_words.join(uniq_words).map(id_words)
 92 | id_words = id_doc.map(lambda w: (w["id"], w["index"], w["word"], w["stem"], w["tag"]))
 93 | 
 94 | wordSchema = id_words.map(lambda p: Row(id=long(p[0]), index=int(p[1]), word=p[2], stem=p[3], tag=p[4]))
 95 | wordTable = sqlCtx.inferSchema(wordSchema)
 96 | 
 97 | wordTable.registerTempTable("word")
 98 | wordTable.saveAsParquetFile("word.parquet")
 99 | 
100 | 
101 | ######################################################################
102 | 
103 | def sliding_window (iterable, size):
104 |   """apply a sliding window to produce 'size' tiles"""
105 |   iters = tee(iterable, size)
106 |   for i in xrange(1, size):
107 |     for each in iters[i:]:
108 |       next(each, None)
109 |   return list(izip(*iters))
110 | 
111 | 
112 | def keep_pair (pair):
113 |   """filter the relevant linked word pairs"""
114 |   return pair[0]["keep"] and pair[1]["keep"] and (pair[0]["word"] != pair[1]["word"])
115 | 
116 | 
117 | def link_words (seq):
118 |   """attempt to link words in a sentence"""
119 |   return [ (seq[0], word) for word in seq[1:] ]
120 | 
121 | 
122 | tiled = tagged_doc.flatMap(lambda s: sliding_window(s, 3)).flatMap(link_words).filter(keep_pair)
123 | 
124 | t0 = tiled.map(lambda l: (uniq.value[l[0]["stem"]], uniq.value[l[1]["stem"]],))
125 | t1 = tiled.map(lambda l: (uniq.value[l[1]["stem"]], uniq.value[l[0]["stem"]],))
126 | 
127 | neighbors = t0.union(t1)
128 | 
129 | edgeSchema = neighbors.map(lambda p: Row(n0=long(p[0]), n1=long(p[1])))
130 | edgeTable = sqlCtx.inferSchema(edgeSchema)
131 | 
132 | edgeTable.registerTempTable("edge")
133 | edgeTable.saveAsParquetFile("edge.parquet")
134 | 


--------------------------------------------------------------------------------
/etl.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding: utf-8
 3 | 
 4 | import exsto
 5 | import json
 6 | import sys
 7 | 
 8 | DEBUG = False # True
 9 | 
10 | 
11 | def main ():
12 |   global DEBUG
13 |   path = sys.argv[1]
14 | 
15 |   with open(path, 'r') as f:
16 |     for line in f.readlines():
17 |       meta = json.loads(line)
18 | 
19 |       for graf_text in exsto.filter_quotes(meta["text"]):
20 |         try:
21 |           for sent in exsto.parse_graf(meta["id"], graf_text):
22 |             print exsto.pretty_print(sent)
23 |         except (IndexError) as e:
24 |           if DEBUG:
25 |             print "IndexError: " + str(e)
26 |             print graf_text
27 | 
28 | if __name__ == "__main__":
29 |   main()
30 | 


--------------------------------------------------------------------------------
/exsto.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding: utf-8
  3 | 
  4 | import dateutil.parser as dp
  5 | import hashlib
  6 | import json
  7 | import lxml.html
  8 | import os
  9 | import re
 10 | import string
 11 | import textblob
 12 | import textblob_aptagger as tag
 13 | import urllib 
 14 | 
 15 | DEBUG = False # True
 16 | 
 17 | 
 18 | ######################################################################
 19 | ## scrape the Apache mailing list archives
 20 | 
 21 | PAT_EMAIL_ID = re.compile("^.*\%3c(.*)\@.*$")
 22 | 
 23 | 
 24 | def scrape_url (url):
 25 |   """get the HTML and parse it as an XML doc"""
 26 |   text = urllib.urlopen(url).read()
 27 |   text = filter(lambda x: x in string.printable, text)
 28 |   root = lxml.html.document_fromstring(text)
 29 | 
 30 |   return root
 31 | 
 32 | 
 33 | def parse_email (root, base_url):
 34 |   """parse email fields from an lxml root"""
 35 |   global PAT_EMAIL_ID
 36 |   meta = {}
 37 | 
 38 |   path = "/html/head/title"
 39 |   meta["subject"] = root.xpath(path)[0].text
 40 | 
 41 |   path = "/html/body/table/tbody/tr[@class='from']/td[@class='right']"
 42 |   meta["sender"] = root.xpath(path)[0].text
 43 | 
 44 |   path = "/html/body/table/tbody/tr[@class='date']/td[@class='right']"
 45 |   meta["date"] = dp.parse(root.xpath(path)[0].text).isoformat()
 46 | 
 47 |   path = "/html/body/table/tbody/tr[@class='raw']/td[@class='right']/a"
 48 |   link = root.xpath(path)[0].get("href")
 49 |   meta["id"] = PAT_EMAIL_ID.match(link).group(1)
 50 | 
 51 |   path = "/html/body/table/tbody/tr[@class='contents']/td/pre"
 52 |   meta["text"] = root.xpath(path)[0].text
 53 | 
 54 |   # parse the optional elements
 55 | 
 56 |   path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Next by date']"
 57 |   refs = root.xpath(path)
 58 | 
 59 |   if len(refs) > 0:
 60 |     link = refs[0].get("href")
 61 |     meta["next_url"] = base_url + link
 62 |   else:
 63 |     meta["next_url"] = ""
 64 | 
 65 |   path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Previous by thread']"
 66 |   refs = root.xpath(path)
 67 |   
 68 |   if len(refs) > 0:
 69 |     link = refs[0].get("href")
 70 |     meta["prev_thread"] = PAT_EMAIL_ID.match(link).group(1)
 71 |   else:
 72 |     meta["prev_thread"] = ""
 73 | 
 74 |   path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Next by thread']"
 75 |   refs = root.xpath(path)
 76 |   
 77 |   if len(refs) > 0:
 78 |     link = refs[0].get("href")
 79 |     meta["next_thread"] = PAT_EMAIL_ID.match(link).group(1)
 80 |   else:
 81 |     meta["next_thread"] = ""
 82 | 
 83 |   return meta
 84 | 
 85 | 
 86 | ######################################################################
 87 | ## filter the novel text versus quoted text in an email message
 88 | 
 89 | PAT_FORWARD = re.compile("\n\-+ Forwarded message \-+\n")
 90 | PAT_REPLIED = re.compile("\nOn.*\d+.*\n?wrote\:\n+\>")
 91 | PAT_UNSUBSC = re.compile("\n\-+\nTo unsubscribe,.*\nFor additional commands,.*")
 92 | 
 93 | 
 94 | def split_grafs (lines):
 95 |   """segment the raw text into paragraphs"""
 96 |   graf = []
 97 | 
 98 |   for line in lines:
 99 |     line = line.strip()
100 | 
101 |     if len(line) < 1:
102 |       if len(graf) > 0:
103 |         yield "\n".join(graf)
104 |         graf = []
105 |     else:
106 |       graf.append(line)
107 | 
108 |   if len(graf) > 0:
109 |     yield "\n".join(graf)
110 | 
111 | 
112 | def filter_quotes (text):
113 |   """filter the quoted text out of a message"""
114 |   global DEBUG
115 |   global PAT_FORWARD, PAT_REPLIED, PAT_UNSUBSC
116 | 
117 |   text = filter(lambda x: x in string.printable, text)
118 | 
119 |   if DEBUG:
120 |     print text
121 | 
122 |   # strip off quoted text in a forward
123 |   m = PAT_FORWARD.split(text, re.M)
124 | 
125 |   if m and len(m) > 1:
126 |     text = m[0]
127 | 
128 |   # strip off quoted text in a reply
129 |   m = PAT_REPLIED.split(text, re.M)
130 | 
131 |   if m and len(m) > 1:
132 |     text = m[0]
133 | 
134 |   # strip off any trailing unsubscription notice
135 |   m = PAT_UNSUBSC.split(text, re.M)
136 | 
137 |   if m:
138 |     text = m[0]
139 | 
140 |   # replace any remaining quoted text with blank lines
141 |   lines = []
142 | 
143 |   for line in text.split("\n"):
144 |     if line.startswith(">"):
145 |       lines.append("")
146 |     else:
147 |       lines.append(line)
148 | 
149 |   return list(split_grafs(lines))
150 | 
151 | 
152 | def test_filter (path):
153 |   """run the unit tests for known quoting styles"""
154 |   global DEBUG
155 |   DEBUG = True
156 | 
157 |   for root, dirs, files in os.walk(path):
158 |     for file in files:
159 |       with open(path + file, 'r') as f:
160 |         line = f.readline()
161 |         meta = json.loads(line)
162 |         grafs = filter_quotes(meta["text"])
163 | 
164 |         if not grafs or len(grafs) < 1:
165 |           raise Exception("no results")
166 |         else:
167 |           print grafs
168 | 
169 | 
170 | ######################################################################
171 | ## parse and markup text paragraphs for semantic analysis
172 | 
173 | PAT_PUNCT = re.compile(r'^\W+$')
174 | PAT_SPACE = re.compile(r'\_+$')
175 | 
176 | POS_KEEPS = ['v', 'n', 'j']
177 | POS_LEMMA = ['v', 'n']
178 | TAGGER = tag.PerceptronTagger()
179 | UNIQ_WORDS = { ".": 0 }
180 | 
181 | 
182 | def is_not_word (word):
183 |   return PAT_PUNCT.match(word) or PAT_SPACE.match(word)
184 | 
185 | 
186 | def get_word_id (root):
187 |   """lookup/assign a unique identify for each word"""
188 |   global UNIQ_WORDS
189 | 
190 |   # in practice, this should use a microservice via some robust
191 |   # distributed cache, e.g., Cassandra, Redis, etc.
192 | 
193 |   if root not in UNIQ_WORDS:
194 |     UNIQ_WORDS[root] = len(UNIQ_WORDS)
195 | 
196 |   return UNIQ_WORDS[root]
197 | 
198 | 
199 | def get_tiles (graf, size=3):
200 |   """generate word pairs for the TextRank graph"""
201 |   graf_len = len(graf)
202 | 
203 |   for i in xrange(0, graf_len):
204 |     w0 = graf[i]
205 | 
206 |     for j in xrange(i + 1, min(graf_len, i + 1 + size)):
207 |       w1 = graf[j]
208 | 
209 |       if w0[4] == w1[4] == 1:
210 |         yield (w0[0], w1[0],)
211 | 
212 | 
213 | def parse_graf (msg_id, text, base):
214 |   """parse and markup each sentence in the given paragraph"""
215 |   global DEBUG
216 |   global POS_KEEPS, POS_LEMMA, TAGGER
217 | 
218 |   markup = []
219 |   i = base
220 | 
221 |   for s in textblob.TextBlob(text).sentences:
222 |     graf = []
223 |     m = hashlib.sha1()
224 | 
225 |     pos = TAGGER.tag(str(s))
226 |     p_idx = 0
227 |     w_idx = 0
228 | 
229 |     while p_idx < len(pos):
230 |       p = pos[p_idx]
231 | 
232 |       if DEBUG:
233 |         print "IDX", p_idx, p
234 |         print "reg", is_not_word(p[0])
235 |         print "   ", w_idx, len(s.words), s.words
236 |         print graf
237 | 
238 |       if is_not_word(p[0]) or (p[1] == "SYM"):
239 |         if (w_idx == len(s.words) - 1):
240 |           w = p[0]
241 |           t = '.'
242 |         else:
243 |           p_idx += 1
244 |           continue
245 |       elif w_idx < len(s.words):
246 |         w = s.words[w_idx]
247 |         t = p[1].lower()[0]
248 |         w_idx += 1
249 | 
250 |       if t in POS_LEMMA:
251 |         l = str(w.singularize().lemmatize(t)).lower()
252 |       elif t != '.':
253 |         l = str(w).lower()
254 |       else:
255 |         l = w
256 | 
257 |       keep = 1 if t in POS_KEEPS else 0
258 |       m.update(l)
259 | 
260 |       id = get_word_id(l) if keep == 1 else 0
261 |       graf.append((id, w, l, p[1], keep, i,))
262 | 
263 |       i += 1
264 |       p_idx += 1
265 | 
266 |     # tile the pairs for TextRank
267 |     tile = list(get_tiles(graf))
268 | 
269 |     #"lang": s.detect_language(),
270 |     markup.append({
271 |         "id": msg_id,
272 |         "size": len(graf),
273 |         "sha1": m.hexdigest(),
274 |         "polr": s.sentiment.polarity,
275 |         "subj": s.sentiment.subjectivity,
276 |         "graf": graf,
277 |         "tile": tile
278 |         })
279 | 
280 |   return markup, i
281 | 
282 | 
283 | ######################################################################
284 | ## common utilities
285 | 
286 | def pretty_print (obj, indent=False):
287 |   """pretty print a JSON object"""
288 | 
289 |   if indent:
290 |     return json.dumps(obj, sort_keys=True, indent=2, separators=(',', ': '))
291 |   else:
292 |     return json.dumps(obj, sort_keys=True)
293 | 


--------------------------------------------------------------------------------
/filter.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding: utf-8
 3 | 
 4 | import exsto
 5 | import json
 6 | import os
 7 | import sys
 8 | 
 9 | 
10 | def main ():
11 |   path = sys.argv[1]
12 | 
13 |   if os.path.isdir(path):
14 |     exsto.test_filter(path)
15 |   else:
16 |     with open(path, 'r') as f:
17 |       for line in f.readlines():
18 |         meta = json.loads(line)
19 |         print exsto.pretty_print(exsto.filter_quotes(meta["text"]))
20 | 
21 | 
22 | if __name__ == "__main__":
23 |   main()
24 | 


--------------------------------------------------------------------------------
/graph1.scala:
--------------------------------------------------------------------------------
  1 | import org.apache.spark.graphx._
  2 | import org.apache.spark.rdd.RDD
  3 | 
  4 | val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
  5 | import sqlCtx._
  6 | 
  7 | val edge = sqlCtx.parquetFile("graf_edge.parquet")
  8 | edge.registerTempTable("edge")
  9 | 
 10 | val node = sqlCtx.parquetFile("graf_node.parquet")
 11 | node.registerTempTable("node")
 12 | 
 13 | // Let's pick one message as an example --
 14 | // at scale we'd parallelize this
 15 | 
 16 | val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
 17 | 
 18 | 
 19 | val sql = """
 20 | SELECT node_id, root 
 21 | FROM node 
 22 | WHERE id='%s' AND keep='1'
 23 | """.format(msg_id)
 24 | 
 25 | val n = sqlCtx.sql(sql.stripMargin).distinct()
 26 | val nodes: RDD[(Long, String)] = n.map{ p =>
 27 |   (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])
 28 | }
 29 | nodes.collect()
 30 | 
 31 | 
 32 | val sql = """
 33 | SELECT node0, node1 
 34 | FROM edge 
 35 | WHERE id='%s'
 36 | """.format(msg_id)
 37 | 
 38 | val e = sqlCtx.sql(sql.stripMargin).distinct()
 39 | val edges: RDD[Edge[Int]] = e.map{ p =>
 40 |   Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)
 41 | }
 42 | edges.collect()
 43 | 
 44 | // run PageRank
 45 | 
 46 | val g: Graph[String, Int] = Graph(nodes, edges)
 47 | val r = g.pageRank(0.0001).vertices
 48 | 
 49 | r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
 50 | 
 51 | // save the ranks
 52 | 
 53 | case class Rank(id: Int, rank: Float)
 54 | val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))
 55 | 
 56 | rank.registerTempTable("rank")
 57 | 
 58 | 
 59 | //////////////////////////////////////////////////////////////////////
 60 | 
 61 | def median[T](s: Seq[T])(implicit n: Fractional[T]) = {
 62 |   import n._
 63 |   val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)
 64 |   if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head
 65 | }
 66 | 
 67 | node.schema
 68 | edge.schema
 69 | rank.schema
 70 | 
 71 | val sql = """
 72 | SELECT n.num, n.raw, r.rank
 73 | FROM node n JOIN rank r ON n.node_id = r.id 
 74 | WHERE n.id='%s' AND n.keep='1'
 75 | ORDER BY n.num
 76 | """.format(msg_id)
 77 | 
 78 | val s = sqlCtx.sql(sql.stripMargin).collect()
 79 | 
 80 | val min_rank = median(r.map(_._2).collect())
 81 | 
 82 | var span:List[String] = List()
 83 | var last_index = -1
 84 | var rank_sum = 0.0
 85 | 
 86 | var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()
 87 | 
 88 | s.foreach { x => 
 89 |   //println (x)
 90 |   val index = x.getInt(0)
 91 |   val word = x.getString(1)
 92 |   val rank = x.getFloat(2)
 93 | 
 94 |   var isStop = false
 95 | 
 96 |   // test for break from past
 97 |   if (span.size > 0 && rank < min_rank) isStop = true
 98 |   if (span.size > 0 && (index - last_index > 1)) isStop = true
 99 | 
100 |   // clear accumulation
101 |   if (isStop) {
102 |     val phrase = span.mkString(" ")
103 |     phrases += (phrase -> rank_sum)
104 |     //println(phrase, rank_sum)
105 | 
106 |     span = List()
107 |     last_index = index
108 |     rank_sum = 0.0
109 |   }
110 | 
111 |   // start or append
112 |   if (rank >= min_rank) {
113 |     span = span :+ word
114 |     last_index = index
115 |     rank_sum += rank
116 |   }
117 | }
118 | 
119 | // summarize the text as a list of ranked keyphrases
120 | 
121 | var summary = sc.parallelize(phrases.toSeq).distinct().sortBy(_._2, ascending=false).cache()
122 | 
123 | // take top 50 percentile
124 | // NOT USED FOR SMALL MESSAGES
125 | 
126 | val min_rank = median(summary.map(_._2).collect().toSeq)
127 | summary = summary.filter(_._2 >= min_rank)
128 | 
129 | val sum = summary.map(_._2).reduce(_ + _)
130 | summary = summary.map(x => (x._1, x._2 / sum))
131 | summary.collect()
132 | 


--------------------------------------------------------------------------------
/graph2.scala:
--------------------------------------------------------------------------------
 1 | import org.apache.spark.graphx._
 2 | import org.apache.spark.rdd.RDD
 3 | 
 4 | val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
 5 | import sqlCtx._
 6 | 
 7 | val edge = sqlCtx.parquetFile("reply_edge.parquet")
 8 | edge.registerTempTable("edge")
 9 | 
10 | val node = sqlCtx.parquetFile("reply_node.parquet")
11 | node.registerTempTable("node")
12 | 
13 | edge.schemaString
14 | node.schemaString
15 | 
16 | 
17 | val sql = "SELECT id, sender FROM node"
18 | 
19 | val n = sqlCtx.sql(sql).distinct()
20 | val nodes: RDD[(Long, String)] = n.map{ p =>
21 |   (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])
22 | }
23 | nodes.collect()
24 | 
25 | 
26 | val sql = "SELECT replier, sender, num FROM edge"
27 | 
28 | val e = sqlCtx.sql(sql).distinct()
29 | val edges: RDD[Edge[Int]] = e.map{ p =>
30 |   Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])
31 | }
32 | edges.collect()
33 | 
34 | 
35 | // run graph analytics
36 | 
37 | val g: Graph[String, Int] = Graph(nodes, edges)
38 | val r = g.pageRank(0.0001).vertices
39 | 
40 | r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
41 | 
42 | // define a reduce operation to compute the highest degree vertex
43 | 
44 | def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
45 |   if (a._2 > b._2) a else b
46 | }
47 | 
48 | // compute the max degrees
49 | 
50 | val maxInDegree: (VertexId, Int)  = g.inDegrees.reduce(max)
51 | val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)
52 | val maxDegrees: (VertexId, Int)   = g.degrees.reduce(max)
53 | 
54 | val node_map: scala.collection.Map[Long, String] = node.
55 |   map(p => (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])).collectAsMap()
56 | 
57 | // connected components
58 | 
59 | val scc = g.stronglyConnectedComponents(10).vertices
60 | node.join(scc).foreach(println)
61 | 


--------------------------------------------------------------------------------
/notebooks.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DerwenAI/exsto/d632379ab19b36cc0bb9e17623d161e32963ff63/notebooks.zip


--------------------------------------------------------------------------------
/parse.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding: utf-8
 3 | 
 4 | import exsto
 5 | import json
 6 | import sys
 7 | 
 8 | 
 9 | DEBUG = False # True
10 | 
11 | def main():
12 |   global DEBUG
13 | 
14 |   path = sys.argv[1]
15 | 
16 |   with open(path, 'r') as f:
17 |     for line in f.readlines():
18 |       meta = json.loads(line)
19 |       base = 0
20 | 
21 |       for graf_text in exsto.filter_quotes(meta["text"]):
22 |         if DEBUG:
23 |           print graf_text
24 | 
25 |         grafs, new_base = exsto.parse_graf(meta["id"], graf_text, base)
26 |         base = new_base
27 | 
28 |         for graf in grafs:
29 |           print exsto.pretty_print(graf)
30 | 
31 | 
32 | if __name__ == "__main__":
33 |   main()
34 | 


--------------------------------------------------------------------------------
/regex.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | PAT_PUNCT = re.compile(r'^\W+$')
 4 | PAT_SPACE = re.compile(r'\S+$')
 5 | 
 6 | ex =  [
 7 |     '________________________________',
 8 |     '.'
 9 |     ]
10 | 
11 | for s in ex:
12 |     print s
13 |     print "reg", PAT_PUNCT.match(s)
14 |     print "foo", PAT_SPACE.match(s)
15 | 


--------------------------------------------------------------------------------
/scrape.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding: utf-8
 3 | 
 4 | import ConfigParser
 5 | import exsto
 6 | import sys
 7 | import time
 8 | 
 9 | 
10 | def main ():
11 |   config = ConfigParser.ConfigParser()
12 |   config.read("defaults.cfg")
13 | 
14 |   iterations = config.getint("scraper", "iterations")
15 |   nap_time = config.getint("scraper", "nap_time")
16 |   base_url = config.get("scraper", "base_url")
17 |   url = base_url + config.get("scraper", "start_url")
18 | 
19 |   with open(sys.argv[1], 'w') as f:
20 |     for i in xrange(0, iterations):
21 |       if len(url) < 1:
22 |         break
23 |       else:
24 |         root = exsto.scrape_url(url)
25 |         meta = exsto.parse_email(root, base_url)
26 | 
27 |         f.write(exsto.pretty_print(meta))
28 |         f.write('\n')
29 | 
30 |         url = meta["next_url"]
31 |         time.sleep(nap_time)
32 | 
33 | 
34 | if __name__ == "__main__":
35 |   main()
36 | 


--------------------------------------------------------------------------------