├── .github └── FUNDING.yml ├── .gitignore ├── ETL.md ├── LICENSE ├── README.md ├── adhoc.py ├── adhoc.sh ├── defaults.cfg ├── deprecated └── TextRank.py ├── etl.py ├── exsto.py ├── filter.py ├── graph1.scala ├── graph2.scala ├── notebooks.zip ├── parse.py ├── regex.py └── scrape.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: ceteri 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | -------------------------------------------------------------------------------- /ETL.md: -------------------------------------------------------------------------------- 1 | ## ETL in PySpark with Spark SQL 2 | 3 | Let's use PySpark and Spark SQL to prepare the data for ML and graph 4 | analysis. 5 | We can perform *data discovery* while reshaping the data for later 6 | work. 7 | These early results can help guide our deeper analysis. 8 | 9 | NB: if this ETL needs to run outside of the `bin/pyspark` shell, first 10 | set up a `SparkContext` variable: 11 | 12 | ```python 13 | from pyspark import SparkContext 14 | sc = SparkContext(appName="Exsto", master="local[*]") 15 | ``` 16 | 17 | Import the JSON data produced by the scraper and register its schema 18 | for ad-hoc SQL queries later. 19 | Each message has the fields: 20 | `date`, `sender`, `id`, `next_thread`, `prev_thread`, `next_url`, `subject`, `text` 21 | 22 | ```python 23 | from pyspark.sql import SQLContext, Row 24 | sqlCtx = SQLContext(sc) 25 | 26 | msg = sqlCtx.jsonFile("data").cache() 27 | msg.registerTempTable("msg") 28 | ``` 29 | 30 | NB: note the persistence used for the JSON message data. 31 | We may need to unpersist at a later stage of this ETL work. 32 | 33 | ### Question: Who are the senders? 34 | 35 | Who are the people in the developer community sending email to the list? 36 | We will use this as a dimension in our analysis and reporting. 37 | Let's create a map, with a unique ID for each email address -- 38 | this will be required for the graph analysis. 39 | It may come in handy later for some 40 | [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition). 41 | 42 | ```python 43 | who = msg.map(lambda x: x.sender).distinct().zipWithUniqueId() 44 | who.take(10) 45 | 46 | whoMap = who.collectAsMap() 47 | 48 | print "\nsenders" 49 | print len(whoMap) 50 | ``` 51 | 52 | ### Question: Who are the top K senders? 53 | 54 | [Apache Spark](http://spark.apache.org/) is one of the most 55 | active open source developer communities on Apache, so it 56 | will tend to have several thousand people engaged. 57 | Let's identify the most active ones. 58 | Then we can show a leaderboard and track changes in it over time. 59 | 60 | ```python 61 | from operator import add 62 | 63 | top_sender = msg.map(lambda x: (x.sender, 1,)).reduceByKey(add) \ 64 | .map(lambda (a, b): (b, a)) \ 65 | .sortByKey(0, 1) \ 66 | .map(lambda (a, b): (b, a)) 67 | 68 | print "\ntop senders" 69 | print top_sender.take(11) 70 | ``` 71 | 72 | You many notice that code... it comes from *word count*. 73 | 74 | 75 | ### Question: Which are the top K conversations? 76 | 77 | Clearly, some people discuss over the email list more than others. 78 | Let's identify *who* those people are. 79 | Later we can leverage our graph analysis to determine *what* they discuss. 80 | 81 | NB: note the use case for `groupByKey` transformations; 82 | sometimes its usage is indicated. 83 | 84 | ```python 85 | import itertools 86 | 87 | def nitems (replier, senders): 88 | for sender, g in itertools.groupby(senders): 89 | yield len(list(g)), (replier, sender,) 90 | 91 | senders = msg.map(lambda x: (x.id, x.sender,)) 92 | replies = msg.map(lambda x: (x.prev_thread, x.sender,)) 93 | 94 | convo = replies.join(senders).values() \ 95 | .filter(lambda (a, b): a != b) 96 | 97 | top_convo = convo.groupByKey() \ 98 | .flatMap(lambda (a, b): list(nitems(a, b))) \ 99 | .sortByKey(0) 100 | 101 | print "\ntop convo" 102 | print top_convo.take(10) 103 | ``` 104 | 105 | ### Prepare for Sender/Reply Graph Analysis 106 | 107 | Given the RDDs that we have created to help answer some of the 108 | questions so far, let's persist those data sets using 109 | [Parquet](http://parquet.io) -- 110 | starting with the graph of sender/message/reply: 111 | 112 | ```python 113 | edge = top_convo.map(lambda (a, b): (whoMap.get(b[0]), whoMap.get(b[1]), a,)) 114 | edgeSchema = edge.map(lambda p: Row(replier=p[0], sender=p[1], count=int(p[2]))) 115 | edgeTable = sqlCtx.inferSchema(edgeSchema) 116 | edgeTable.saveAsParquetFile("reply_edge.parquet") 117 | 118 | node = who.map(lambda (a, b): (b, a)) 119 | nodeSchema = node.map(lambda p: Row(id=int(p[0]), sender=p[1])) 120 | nodeTable = sqlCtx.inferSchema(nodeSchema) 121 | nodeTable.saveAsParquetFile("reply_node.parquet") 122 | ``` 123 | 124 | 125 | ### Prepare for TextRank Analysis per paragraph 126 | 127 | ```python 128 | def map_graf_edges (x): 129 | j = json.loads(x) 130 | 131 | for pair in j["tile"]: 132 | n0 = int(pair[0]) 133 | n1 = int(pair[1]) 134 | 135 | if n0 > 0 and n1 > 0: 136 | yield (j["id"], n0, n1,) 137 | yield (j["id"], n1, n0,) 138 | 139 | graf = sc.textFile("parsed") 140 | n = graf.flatMap(map_graf_edges).count() 141 | print "\ngraf edges", n 142 | 143 | edgeSchema = graf.map(lambda p: Row(id=p[0], node0=p[1], node1=p[2])) 144 | 145 | edgeTable = sqlCtx.inferSchema(edgeSchema) 146 | edgeTable.saveAsParquetFile("graf_edge.parquet") 147 | ``` 148 | 149 | ```python 150 | def map_graf_nodes (x): 151 | j = json.loads(x) 152 | 153 | for word in j["graf"]: 154 | yield [j["id"]] + word 155 | 156 | graf = sc.textFile("parsed") 157 | n = graf.flatMap(map_graf_nodes).count() 158 | print "\ngraf nodes", n 159 | 160 | nodeSchema = graf.map(lambda p: Row(id=p[0], node_id=p[1], raw=p[2], root=p[3], pos=p[4], keep=p[5], num=p[6])) 161 | 162 | nodeTable = sqlCtx.inferSchema(nodeSchema) 163 | nodeTable.saveAsParquetFile("graf_node.parquet") 164 | ``` 165 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2015-2021 Derwen, Inc. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Exsto: Developer Community Insights 2 | 3 | A frequently asked question on the [Apache Spark](http://spark.apache.org/) 4 | user email list concerns where to find data sets for evaluating the code. 5 | Oddly enough, the collection of archived messages for this email list 6 | provides an excellent data set for evaluating Spark capabilities, e.g., 7 | machine learning, graph algorithms, text analytics, time-series analysis, etc. 8 | 9 | Herein, an open source developer community considers itself algorithmically. 10 | This project shows work-in-progress for how to surface data insights from 11 | the developer email forums for an Apache open source project. 12 | It leverages advanced technologies for natural language processing, machine 13 | learning, graph algorithms, time series analysis, etc. 14 | As an example, we use data from the `` 15 | [email list archives](http://mail-archives.apache.org) to help understand 16 | its community better. 17 | 18 | See these talks about `Exsto`: 19 | 31 | 32 | In particular, we will shows production use of NLP tooling in Python, 33 | integrated with 34 | [MLlib](http://spark.apache.org/docs/latest/mllib-guide.html) 35 | (machine learning) and 36 | [GraphX](http://spark.apache.org/docs/latest/graphx-programming-guide.html) 37 | (graph algorithms) in Apache Spark. 38 | Machine learning approaches used include: 39 | [Word2Vec](https://code.google.com/p/word2vec/), 40 | [TextRank](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf), 41 | Connected Components, Streaming K-Means, etc. 42 | 43 | Keep in mind that "One Size Fits All" is an anti-pattern, especially for 44 | Big Data tools. 45 | This project illustrates how to leverage microservices and containers to 46 | scale-out the code+data components that do not fit well in Spark, Hadoop, etc. 47 | 48 | In addition to Spark, other technologies used include: 49 | [Mesos](http://mesos.apache.org/), 50 | [Docker](https://www.docker.com/), 51 | [Anaconda](http://continuum.io/downloads), 52 | [Flask](http://flask.pocoo.org/), 53 | [NLTK](http://www.nltk.org/), 54 | [TextBlob](https://textblob.readthedocs.org/en/dev/). 55 | 56 | 57 | ## Dependencies 58 | 59 | * https://github.com/opentable/docker-anaconda 60 | 61 | ```bash 62 | conda config --add channels https://conda.binstar.org/sloria 63 | conda install textblob 64 | python -m textblob.download_corpora 65 | python -m nltk.downloader -d ~/nltk_data all 66 | pip install -U textblob textblob-aptagger 67 | pip install lxml 68 | pip install python-dateutil 69 | pip install Flask 70 | ``` 71 | 72 | NLTK and TextBlob require some 73 | [data downloads](https://s3.amazonaws.com/textblob/nltk_data.tar.gz) 74 | which may also require updating the NLTK data path: 75 | 76 | ```python 77 | import nltk 78 | nltk.data.path.append("~/nltk_data/") 79 | ``` 80 | 81 | 82 | ## Running 83 | 84 | To change the project configuration simply edit the `defaults.cfg` 85 | file. 86 | 87 | 88 | ### scrape the email list via scripts 89 | 90 | ```bash 91 | ./scrape.py data/foo.json 92 | ``` 93 | 94 | ### parse the email text via scripts 95 | 96 | ```bash 97 | ./parse.py data/foo.json parsed/foo.json 98 | ``` 99 | 100 | ## Public Data 101 | 102 | Example data from the Apache Spark email list is available as JSON: 103 | 104 | * original: [https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/original/2015_01.json](https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/original/2015_01.json) 105 | * parsed: [https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/parsed/2015_01.json](https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/parsed/2015_01.json) 106 | 107 | 108 | # What's in a name? 109 | 110 | The word [exsto](http://en.wiktionary.org/wiki/exsto) is the Latin 111 | verb meaning "to stand out", in its present active form. 112 | -------------------------------------------------------------------------------- /adhoc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import json 5 | import sys 6 | 7 | 8 | from pyspark import SparkContext 9 | sc = SparkContext(appName="Exsto", master="local[*]") 10 | 11 | from pyspark.sql import SQLContext, Row 12 | sqlCtx = SQLContext(sc) 13 | 14 | msg = sqlCtx.jsonFile("data").cache() 15 | msg.registerTempTable("msg") 16 | 17 | 18 | # Question: Who are the senders? 19 | 20 | who = msg.map(lambda x: x.sender).distinct().zipWithUniqueId() 21 | who.take(10) 22 | 23 | whoMap = who.collectAsMap() 24 | 25 | print "\nsenders" 26 | print len(whoMap) 27 | 28 | 29 | # Question: Who are the top K senders? 30 | 31 | from operator import add 32 | 33 | top_sender = msg.map(lambda x: (x.sender, 1,)).reduceByKey(add) \ 34 | .map(lambda (a, b): (b, a)) \ 35 | .sortByKey(0, 1) \ 36 | .map(lambda (a, b): (b, a)) 37 | 38 | print "\ntop senders" 39 | print top_sender.take(11) 40 | 41 | 42 | # Question: Which are the top K conversations? 43 | 44 | import itertools 45 | 46 | def nitems (replier, senders): 47 | for sender, g in itertools.groupby(senders): 48 | yield len(list(g)), (replier, sender,) 49 | 50 | senders = msg.map(lambda x: (x.id, x.sender,)) 51 | replies = msg.map(lambda x: (x.prev_thread, x.sender,)) 52 | 53 | convo = replies.join(senders).values() \ 54 | .filter(lambda (a, b): a != b) 55 | 56 | top_convo = convo.groupByKey() \ 57 | .flatMap(lambda (a, b): list(nitems(a, b))) \ 58 | .sortByKey(0) 59 | 60 | print "\ntop convo" 61 | print top_convo.take(10) 62 | 63 | 64 | # Prepare for Sender/Reply Graph Analysis 65 | 66 | edge = top_convo.map(lambda (a, b): (whoMap.get(b[0]), whoMap.get(b[1]), a,)) 67 | edgeSchema = edge.map(lambda p: Row(replier=long(p[0]), sender=long(p[1]), num=int(p[2]))) 68 | edgeTable = sqlCtx.inferSchema(edgeSchema) 69 | edgeTable.saveAsParquetFile("reply_edge.parquet") 70 | 71 | node = who.map(lambda (a, b): (b, a)) 72 | nodeSchema = node.map(lambda p: Row(id=long(p[0]), sender=p[1])) 73 | nodeTable = sqlCtx.inferSchema(nodeSchema) 74 | nodeTable.saveAsParquetFile("reply_node.parquet") 75 | 76 | 77 | # Prepare for TextRank Analysis per paragraph 78 | 79 | def map_graf_edges (x): 80 | j = json.loads(x) 81 | 82 | for pair in j["tile"]: 83 | n0 = int(pair[0]) 84 | n1 = int(pair[1]) 85 | 86 | if n0 > 0 and n1 > 0: 87 | yield (j["id"], n0, n1,) 88 | yield (j["id"], n1, n0,) 89 | 90 | graf = sc.textFile("parsed").flatMap(map_graf_edges) 91 | n = graf.count() 92 | print "\ngraf edges", n 93 | 94 | edgeSchema = graf.map(lambda p: Row(id=p[0], node0=int(p[1]), node1=int(p[2]))) 95 | 96 | edgeTable = sqlCtx.inferSchema(edgeSchema) 97 | edgeTable.saveAsParquetFile("graf_edge.parquet") 98 | 99 | 100 | def map_graf_nodes (x): 101 | j = json.loads(x) 102 | 103 | for word in j["graf"]: 104 | yield [j["id"]] + word 105 | 106 | graf = sc.textFile("parsed").flatMap(map_graf_nodes) 107 | n = graf.count() 108 | print "\ngraf nodes", n 109 | 110 | nodeSchema = graf.map(lambda p: Row(id=p[0], node_id=int(p[1]), raw=p[2], root=p[3], pos=p[4], keep=p[5], num=int(p[6]))) 111 | 112 | nodeTable = sqlCtx.inferSchema(nodeSchema) 113 | nodeTable.saveAsParquetFile("graf_node.parquet") 114 | -------------------------------------------------------------------------------- /adhoc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -x 2 | 3 | export SPARK_HOME=~/opt/spark 4 | 5 | rm -rf reply_edge.parquet reply_node.parquet 6 | rm -rf graf_edge.parquet graf_node.parquet 7 | 8 | SPARK_LOCAL_IP=127.0.0.1 \ 9 | $SPARK_HOME/bin/spark-submit ./adhoc.py -------------------------------------------------------------------------------- /defaults.cfg: -------------------------------------------------------------------------------- 1 | [scraper] 2 | iterations: 2500 3 | nap_time: 2 4 | base_url: http://mail-archives.apache.org 5 | start_url: /mod_mbox/spark-user/201503.mbox/<1427773728882-22311.post%40n3.nabble.com> 6 | 7 | -------------------------------------------------------------------------------- /deprecated/TextRank.py: -------------------------------------------------------------------------------- 1 | # TextRank, based on: 2 | # http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf 3 | 4 | from itertools import tee, izip 5 | from nltk import stem 6 | from text.blob import TextBlob as tb 7 | from textblob_aptagger import PerceptronTagger 8 | import nltk.data 9 | import numpy as np 10 | import sys 11 | 12 | 13 | TOKENIZER = nltk.data.load('tokenizers/punkt/english.pickle') 14 | TAGGER = PerceptronTagger() 15 | STEMMER = stem.porter.PorterStemmer() 16 | 17 | 18 | def pos_tag (s): 19 | """high-performance part-of-speech tagger""" 20 | global TAGGER 21 | return TAGGER.tag(s) 22 | 23 | 24 | def wrap_words (pair): 25 | """wrap each (word, tag) pair as an object with fully indexed metadata""" 26 | global STEMMER 27 | index = pair[0] 28 | result = [] 29 | for word, tag in pair[1]: 30 | word = word.lower() 31 | stem = STEMMER.stem(word) 32 | if stem == "": 33 | stem = word 34 | keep = tag in ('JJ', 'NN', 'NNS', 'NNP',) 35 | result.append({ "id": 0, "index": index, "stem": stem, "word": word, "tag": tag, "keep": keep }) 36 | index += 1 37 | return result 38 | 39 | 40 | ###################################################################### 41 | ## build a graph from raw text 42 | 43 | TEXT = """ 44 | Compatibility of systems of linear constraints over the set of natural numbers. 45 | Criteria of compatibility of a system of linear Diophantine equations, strict 46 | inequations, and nonstrict inequations are considered. Upper bounds for 47 | components of a minimal set of solutions and algorithms of construction of 48 | minimal generating sets of solutions for all types of systems are given. 49 | These criteria and the corresponding algorithms for constructing a minimal 50 | supporting set of solutions can be used in solving all the considered types 51 | systems and systems of mixed types. 52 | """ 53 | 54 | from pyspark import SparkContext 55 | sc = SparkContext(appName="TextRank", master="local[*]") 56 | 57 | sent = sc.parallelize(TOKENIZER.tokenize(TEXT)).map(pos_tag).cache() 58 | sent.collect() 59 | 60 | base = list(np.cumsum(np.array(sent.map(len).collect()))) 61 | base.insert(0, 0) 62 | base.pop() 63 | sent_length = sc.parallelize(base) 64 | 65 | tagged_doc = sent_length.zip(sent).map(wrap_words) 66 | 67 | 68 | ###################################################################### 69 | 70 | from pyspark.sql import SQLContext, Row 71 | sqlCtx = SQLContext(sc) 72 | 73 | def enum_words (s): 74 | for word in s: 75 | yield word 76 | 77 | words = tagged_doc.flatMap(enum_words) 78 | pair_words = words.keyBy(lambda w: w["stem"]) 79 | uniq_words = words.map(lambda w: w["stem"]).distinct().zipWithUniqueId() 80 | 81 | uniq = sc.broadcast(dict(uniq_words.collect())) 82 | 83 | 84 | def id_words (pair): 85 | (key, val) = pair 86 | word = val[0] 87 | id = val[1] 88 | word["id"] = id 89 | return word 90 | 91 | id_doc = pair_words.join(uniq_words).map(id_words) 92 | id_words = id_doc.map(lambda w: (w["id"], w["index"], w["word"], w["stem"], w["tag"])) 93 | 94 | wordSchema = id_words.map(lambda p: Row(id=long(p[0]), index=int(p[1]), word=p[2], stem=p[3], tag=p[4])) 95 | wordTable = sqlCtx.inferSchema(wordSchema) 96 | 97 | wordTable.registerTempTable("word") 98 | wordTable.saveAsParquetFile("word.parquet") 99 | 100 | 101 | ###################################################################### 102 | 103 | def sliding_window (iterable, size): 104 | """apply a sliding window to produce 'size' tiles""" 105 | iters = tee(iterable, size) 106 | for i in xrange(1, size): 107 | for each in iters[i:]: 108 | next(each, None) 109 | return list(izip(*iters)) 110 | 111 | 112 | def keep_pair (pair): 113 | """filter the relevant linked word pairs""" 114 | return pair[0]["keep"] and pair[1]["keep"] and (pair[0]["word"] != pair[1]["word"]) 115 | 116 | 117 | def link_words (seq): 118 | """attempt to link words in a sentence""" 119 | return [ (seq[0], word) for word in seq[1:] ] 120 | 121 | 122 | tiled = tagged_doc.flatMap(lambda s: sliding_window(s, 3)).flatMap(link_words).filter(keep_pair) 123 | 124 | t0 = tiled.map(lambda l: (uniq.value[l[0]["stem"]], uniq.value[l[1]["stem"]],)) 125 | t1 = tiled.map(lambda l: (uniq.value[l[1]["stem"]], uniq.value[l[0]["stem"]],)) 126 | 127 | neighbors = t0.union(t1) 128 | 129 | edgeSchema = neighbors.map(lambda p: Row(n0=long(p[0]), n1=long(p[1]))) 130 | edgeTable = sqlCtx.inferSchema(edgeSchema) 131 | 132 | edgeTable.registerTempTable("edge") 133 | edgeTable.saveAsParquetFile("edge.parquet") 134 | -------------------------------------------------------------------------------- /etl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import exsto 5 | import json 6 | import sys 7 | 8 | DEBUG = False # True 9 | 10 | 11 | def main (): 12 | global DEBUG 13 | path = sys.argv[1] 14 | 15 | with open(path, 'r') as f: 16 | for line in f.readlines(): 17 | meta = json.loads(line) 18 | 19 | for graf_text in exsto.filter_quotes(meta["text"]): 20 | try: 21 | for sent in exsto.parse_graf(meta["id"], graf_text): 22 | print exsto.pretty_print(sent) 23 | except (IndexError) as e: 24 | if DEBUG: 25 | print "IndexError: " + str(e) 26 | print graf_text 27 | 28 | if __name__ == "__main__": 29 | main() 30 | -------------------------------------------------------------------------------- /exsto.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import dateutil.parser as dp 5 | import hashlib 6 | import json 7 | import lxml.html 8 | import os 9 | import re 10 | import string 11 | import textblob 12 | import textblob_aptagger as tag 13 | import urllib 14 | 15 | DEBUG = False # True 16 | 17 | 18 | ###################################################################### 19 | ## scrape the Apache mailing list archives 20 | 21 | PAT_EMAIL_ID = re.compile("^.*\%3c(.*)\@.*$") 22 | 23 | 24 | def scrape_url (url): 25 | """get the HTML and parse it as an XML doc""" 26 | text = urllib.urlopen(url).read() 27 | text = filter(lambda x: x in string.printable, text) 28 | root = lxml.html.document_fromstring(text) 29 | 30 | return root 31 | 32 | 33 | def parse_email (root, base_url): 34 | """parse email fields from an lxml root""" 35 | global PAT_EMAIL_ID 36 | meta = {} 37 | 38 | path = "/html/head/title" 39 | meta["subject"] = root.xpath(path)[0].text 40 | 41 | path = "/html/body/table/tbody/tr[@class='from']/td[@class='right']" 42 | meta["sender"] = root.xpath(path)[0].text 43 | 44 | path = "/html/body/table/tbody/tr[@class='date']/td[@class='right']" 45 | meta["date"] = dp.parse(root.xpath(path)[0].text).isoformat() 46 | 47 | path = "/html/body/table/tbody/tr[@class='raw']/td[@class='right']/a" 48 | link = root.xpath(path)[0].get("href") 49 | meta["id"] = PAT_EMAIL_ID.match(link).group(1) 50 | 51 | path = "/html/body/table/tbody/tr[@class='contents']/td/pre" 52 | meta["text"] = root.xpath(path)[0].text 53 | 54 | # parse the optional elements 55 | 56 | path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Next by date']" 57 | refs = root.xpath(path) 58 | 59 | if len(refs) > 0: 60 | link = refs[0].get("href") 61 | meta["next_url"] = base_url + link 62 | else: 63 | meta["next_url"] = "" 64 | 65 | path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Previous by thread']" 66 | refs = root.xpath(path) 67 | 68 | if len(refs) > 0: 69 | link = refs[0].get("href") 70 | meta["prev_thread"] = PAT_EMAIL_ID.match(link).group(1) 71 | else: 72 | meta["prev_thread"] = "" 73 | 74 | path = "/html/body/table/thead/tr/th[@class='nav']/a[@title='Next by thread']" 75 | refs = root.xpath(path) 76 | 77 | if len(refs) > 0: 78 | link = refs[0].get("href") 79 | meta["next_thread"] = PAT_EMAIL_ID.match(link).group(1) 80 | else: 81 | meta["next_thread"] = "" 82 | 83 | return meta 84 | 85 | 86 | ###################################################################### 87 | ## filter the novel text versus quoted text in an email message 88 | 89 | PAT_FORWARD = re.compile("\n\-+ Forwarded message \-+\n") 90 | PAT_REPLIED = re.compile("\nOn.*\d+.*\n?wrote\:\n+\>") 91 | PAT_UNSUBSC = re.compile("\n\-+\nTo unsubscribe,.*\nFor additional commands,.*") 92 | 93 | 94 | def split_grafs (lines): 95 | """segment the raw text into paragraphs""" 96 | graf = [] 97 | 98 | for line in lines: 99 | line = line.strip() 100 | 101 | if len(line) < 1: 102 | if len(graf) > 0: 103 | yield "\n".join(graf) 104 | graf = [] 105 | else: 106 | graf.append(line) 107 | 108 | if len(graf) > 0: 109 | yield "\n".join(graf) 110 | 111 | 112 | def filter_quotes (text): 113 | """filter the quoted text out of a message""" 114 | global DEBUG 115 | global PAT_FORWARD, PAT_REPLIED, PAT_UNSUBSC 116 | 117 | text = filter(lambda x: x in string.printable, text) 118 | 119 | if DEBUG: 120 | print text 121 | 122 | # strip off quoted text in a forward 123 | m = PAT_FORWARD.split(text, re.M) 124 | 125 | if m and len(m) > 1: 126 | text = m[0] 127 | 128 | # strip off quoted text in a reply 129 | m = PAT_REPLIED.split(text, re.M) 130 | 131 | if m and len(m) > 1: 132 | text = m[0] 133 | 134 | # strip off any trailing unsubscription notice 135 | m = PAT_UNSUBSC.split(text, re.M) 136 | 137 | if m: 138 | text = m[0] 139 | 140 | # replace any remaining quoted text with blank lines 141 | lines = [] 142 | 143 | for line in text.split("\n"): 144 | if line.startswith(">"): 145 | lines.append("") 146 | else: 147 | lines.append(line) 148 | 149 | return list(split_grafs(lines)) 150 | 151 | 152 | def test_filter (path): 153 | """run the unit tests for known quoting styles""" 154 | global DEBUG 155 | DEBUG = True 156 | 157 | for root, dirs, files in os.walk(path): 158 | for file in files: 159 | with open(path + file, 'r') as f: 160 | line = f.readline() 161 | meta = json.loads(line) 162 | grafs = filter_quotes(meta["text"]) 163 | 164 | if not grafs or len(grafs) < 1: 165 | raise Exception("no results") 166 | else: 167 | print grafs 168 | 169 | 170 | ###################################################################### 171 | ## parse and markup text paragraphs for semantic analysis 172 | 173 | PAT_PUNCT = re.compile(r'^\W+$') 174 | PAT_SPACE = re.compile(r'\_+$') 175 | 176 | POS_KEEPS = ['v', 'n', 'j'] 177 | POS_LEMMA = ['v', 'n'] 178 | TAGGER = tag.PerceptronTagger() 179 | UNIQ_WORDS = { ".": 0 } 180 | 181 | 182 | def is_not_word (word): 183 | return PAT_PUNCT.match(word) or PAT_SPACE.match(word) 184 | 185 | 186 | def get_word_id (root): 187 | """lookup/assign a unique identify for each word""" 188 | global UNIQ_WORDS 189 | 190 | # in practice, this should use a microservice via some robust 191 | # distributed cache, e.g., Cassandra, Redis, etc. 192 | 193 | if root not in UNIQ_WORDS: 194 | UNIQ_WORDS[root] = len(UNIQ_WORDS) 195 | 196 | return UNIQ_WORDS[root] 197 | 198 | 199 | def get_tiles (graf, size=3): 200 | """generate word pairs for the TextRank graph""" 201 | graf_len = len(graf) 202 | 203 | for i in xrange(0, graf_len): 204 | w0 = graf[i] 205 | 206 | for j in xrange(i + 1, min(graf_len, i + 1 + size)): 207 | w1 = graf[j] 208 | 209 | if w0[4] == w1[4] == 1: 210 | yield (w0[0], w1[0],) 211 | 212 | 213 | def parse_graf (msg_id, text, base): 214 | """parse and markup each sentence in the given paragraph""" 215 | global DEBUG 216 | global POS_KEEPS, POS_LEMMA, TAGGER 217 | 218 | markup = [] 219 | i = base 220 | 221 | for s in textblob.TextBlob(text).sentences: 222 | graf = [] 223 | m = hashlib.sha1() 224 | 225 | pos = TAGGER.tag(str(s)) 226 | p_idx = 0 227 | w_idx = 0 228 | 229 | while p_idx < len(pos): 230 | p = pos[p_idx] 231 | 232 | if DEBUG: 233 | print "IDX", p_idx, p 234 | print "reg", is_not_word(p[0]) 235 | print " ", w_idx, len(s.words), s.words 236 | print graf 237 | 238 | if is_not_word(p[0]) or (p[1] == "SYM"): 239 | if (w_idx == len(s.words) - 1): 240 | w = p[0] 241 | t = '.' 242 | else: 243 | p_idx += 1 244 | continue 245 | elif w_idx < len(s.words): 246 | w = s.words[w_idx] 247 | t = p[1].lower()[0] 248 | w_idx += 1 249 | 250 | if t in POS_LEMMA: 251 | l = str(w.singularize().lemmatize(t)).lower() 252 | elif t != '.': 253 | l = str(w).lower() 254 | else: 255 | l = w 256 | 257 | keep = 1 if t in POS_KEEPS else 0 258 | m.update(l) 259 | 260 | id = get_word_id(l) if keep == 1 else 0 261 | graf.append((id, w, l, p[1], keep, i,)) 262 | 263 | i += 1 264 | p_idx += 1 265 | 266 | # tile the pairs for TextRank 267 | tile = list(get_tiles(graf)) 268 | 269 | #"lang": s.detect_language(), 270 | markup.append({ 271 | "id": msg_id, 272 | "size": len(graf), 273 | "sha1": m.hexdigest(), 274 | "polr": s.sentiment.polarity, 275 | "subj": s.sentiment.subjectivity, 276 | "graf": graf, 277 | "tile": tile 278 | }) 279 | 280 | return markup, i 281 | 282 | 283 | ###################################################################### 284 | ## common utilities 285 | 286 | def pretty_print (obj, indent=False): 287 | """pretty print a JSON object""" 288 | 289 | if indent: 290 | return json.dumps(obj, sort_keys=True, indent=2, separators=(',', ': ')) 291 | else: 292 | return json.dumps(obj, sort_keys=True) 293 | -------------------------------------------------------------------------------- /filter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import exsto 5 | import json 6 | import os 7 | import sys 8 | 9 | 10 | def main (): 11 | path = sys.argv[1] 12 | 13 | if os.path.isdir(path): 14 | exsto.test_filter(path) 15 | else: 16 | with open(path, 'r') as f: 17 | for line in f.readlines(): 18 | meta = json.loads(line) 19 | print exsto.pretty_print(exsto.filter_quotes(meta["text"])) 20 | 21 | 22 | if __name__ == "__main__": 23 | main() 24 | -------------------------------------------------------------------------------- /graph1.scala: -------------------------------------------------------------------------------- 1 | import org.apache.spark.graphx._ 2 | import org.apache.spark.rdd.RDD 3 | 4 | val sqlCtx = new org.apache.spark.sql.SQLContext(sc) 5 | import sqlCtx._ 6 | 7 | val edge = sqlCtx.parquetFile("graf_edge.parquet") 8 | edge.registerTempTable("edge") 9 | 10 | val node = sqlCtx.parquetFile("graf_node.parquet") 11 | node.registerTempTable("node") 12 | 13 | // Let's pick one message as an example -- 14 | // at scale we'd parallelize this 15 | 16 | val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw" 17 | 18 | 19 | val sql = """ 20 | SELECT node_id, root 21 | FROM node 22 | WHERE id='%s' AND keep='1' 23 | """.format(msg_id) 24 | 25 | val n = sqlCtx.sql(sql.stripMargin).distinct() 26 | val nodes: RDD[(Long, String)] = n.map{ p => 27 | (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String]) 28 | } 29 | nodes.collect() 30 | 31 | 32 | val sql = """ 33 | SELECT node0, node1 34 | FROM edge 35 | WHERE id='%s' 36 | """.format(msg_id) 37 | 38 | val e = sqlCtx.sql(sql.stripMargin).distinct() 39 | val edges: RDD[Edge[Int]] = e.map{ p => 40 | Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0) 41 | } 42 | edges.collect() 43 | 44 | // run PageRank 45 | 46 | val g: Graph[String, Int] = Graph(nodes, edges) 47 | val r = g.pageRank(0.0001).vertices 48 | 49 | r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println) 50 | 51 | // save the ranks 52 | 53 | case class Rank(id: Int, rank: Float) 54 | val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat)) 55 | 56 | rank.registerTempTable("rank") 57 | 58 | 59 | ////////////////////////////////////////////////////////////////////// 60 | 61 | def median[T](s: Seq[T])(implicit n: Fractional[T]) = { 62 | import n._ 63 | val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2) 64 | if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head 65 | } 66 | 67 | node.schema 68 | edge.schema 69 | rank.schema 70 | 71 | val sql = """ 72 | SELECT n.num, n.raw, r.rank 73 | FROM node n JOIN rank r ON n.node_id = r.id 74 | WHERE n.id='%s' AND n.keep='1' 75 | ORDER BY n.num 76 | """.format(msg_id) 77 | 78 | val s = sqlCtx.sql(sql.stripMargin).collect() 79 | 80 | val min_rank = median(r.map(_._2).collect()) 81 | 82 | var span:List[String] = List() 83 | var last_index = -1 84 | var rank_sum = 0.0 85 | 86 | var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map() 87 | 88 | s.foreach { x => 89 | //println (x) 90 | val index = x.getInt(0) 91 | val word = x.getString(1) 92 | val rank = x.getFloat(2) 93 | 94 | var isStop = false 95 | 96 | // test for break from past 97 | if (span.size > 0 && rank < min_rank) isStop = true 98 | if (span.size > 0 && (index - last_index > 1)) isStop = true 99 | 100 | // clear accumulation 101 | if (isStop) { 102 | val phrase = span.mkString(" ") 103 | phrases += (phrase -> rank_sum) 104 | //println(phrase, rank_sum) 105 | 106 | span = List() 107 | last_index = index 108 | rank_sum = 0.0 109 | } 110 | 111 | // start or append 112 | if (rank >= min_rank) { 113 | span = span :+ word 114 | last_index = index 115 | rank_sum += rank 116 | } 117 | } 118 | 119 | // summarize the text as a list of ranked keyphrases 120 | 121 | var summary = sc.parallelize(phrases.toSeq).distinct().sortBy(_._2, ascending=false).cache() 122 | 123 | // take top 50 percentile 124 | // NOT USED FOR SMALL MESSAGES 125 | 126 | val min_rank = median(summary.map(_._2).collect().toSeq) 127 | summary = summary.filter(_._2 >= min_rank) 128 | 129 | val sum = summary.map(_._2).reduce(_ + _) 130 | summary = summary.map(x => (x._1, x._2 / sum)) 131 | summary.collect() 132 | -------------------------------------------------------------------------------- /graph2.scala: -------------------------------------------------------------------------------- 1 | import org.apache.spark.graphx._ 2 | import org.apache.spark.rdd.RDD 3 | 4 | val sqlCtx = new org.apache.spark.sql.SQLContext(sc) 5 | import sqlCtx._ 6 | 7 | val edge = sqlCtx.parquetFile("reply_edge.parquet") 8 | edge.registerTempTable("edge") 9 | 10 | val node = sqlCtx.parquetFile("reply_node.parquet") 11 | node.registerTempTable("node") 12 | 13 | edge.schemaString 14 | node.schemaString 15 | 16 | 17 | val sql = "SELECT id, sender FROM node" 18 | 19 | val n = sqlCtx.sql(sql).distinct() 20 | val nodes: RDD[(Long, String)] = n.map{ p => 21 | (p(0).asInstanceOf[Long], p(1).asInstanceOf[String]) 22 | } 23 | nodes.collect() 24 | 25 | 26 | val sql = "SELECT replier, sender, num FROM edge" 27 | 28 | val e = sqlCtx.sql(sql).distinct() 29 | val edges: RDD[Edge[Int]] = e.map{ p => 30 | Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int]) 31 | } 32 | edges.collect() 33 | 34 | 35 | // run graph analytics 36 | 37 | val g: Graph[String, Int] = Graph(nodes, edges) 38 | val r = g.pageRank(0.0001).vertices 39 | 40 | r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println) 41 | 42 | // define a reduce operation to compute the highest degree vertex 43 | 44 | def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = { 45 | if (a._2 > b._2) a else b 46 | } 47 | 48 | // compute the max degrees 49 | 50 | val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max) 51 | val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max) 52 | val maxDegrees: (VertexId, Int) = g.degrees.reduce(max) 53 | 54 | val node_map: scala.collection.Map[Long, String] = node. 55 | map(p => (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])).collectAsMap() 56 | 57 | // connected components 58 | 59 | val scc = g.stronglyConnectedComponents(10).vertices 60 | node.join(scc).foreach(println) 61 | -------------------------------------------------------------------------------- /notebooks.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DerwenAI/exsto/d632379ab19b36cc0bb9e17623d161e32963ff63/notebooks.zip -------------------------------------------------------------------------------- /parse.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import exsto 5 | import json 6 | import sys 7 | 8 | 9 | DEBUG = False # True 10 | 11 | def main(): 12 | global DEBUG 13 | 14 | path = sys.argv[1] 15 | 16 | with open(path, 'r') as f: 17 | for line in f.readlines(): 18 | meta = json.loads(line) 19 | base = 0 20 | 21 | for graf_text in exsto.filter_quotes(meta["text"]): 22 | if DEBUG: 23 | print graf_text 24 | 25 | grafs, new_base = exsto.parse_graf(meta["id"], graf_text, base) 26 | base = new_base 27 | 28 | for graf in grafs: 29 | print exsto.pretty_print(graf) 30 | 31 | 32 | if __name__ == "__main__": 33 | main() 34 | -------------------------------------------------------------------------------- /regex.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | PAT_PUNCT = re.compile(r'^\W+$') 4 | PAT_SPACE = re.compile(r'\S+$') 5 | 6 | ex = [ 7 | '________________________________', 8 | '.' 9 | ] 10 | 11 | for s in ex: 12 | print s 13 | print "reg", PAT_PUNCT.match(s) 14 | print "foo", PAT_SPACE.match(s) 15 | -------------------------------------------------------------------------------- /scrape.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | import ConfigParser 5 | import exsto 6 | import sys 7 | import time 8 | 9 | 10 | def main (): 11 | config = ConfigParser.ConfigParser() 12 | config.read("defaults.cfg") 13 | 14 | iterations = config.getint("scraper", "iterations") 15 | nap_time = config.getint("scraper", "nap_time") 16 | base_url = config.get("scraper", "base_url") 17 | url = base_url + config.get("scraper", "start_url") 18 | 19 | with open(sys.argv[1], 'w') as f: 20 | for i in xrange(0, iterations): 21 | if len(url) < 1: 22 | break 23 | else: 24 | root = exsto.scrape_url(url) 25 | meta = exsto.parse_email(root, base_url) 26 | 27 | f.write(exsto.pretty_print(meta)) 28 | f.write('\n') 29 | 30 | url = meta["next_url"] 31 | time.sleep(nap_time) 32 | 33 | 34 | if __name__ == "__main__": 35 | main() 36 | --------------------------------------------------------------------------------