├── .gitignore
├── LICENSE
├── README.md
├── __init__.py
├── dnflow.cfg.template
├── json2csv.py
├── luigi.cfg
├── queue_tasks.py
├── requirements.txt
├── schema.sql
├── static
├── css
│ └── style.css
└── js
│ └── index.js
├── summarize.py
├── templates
├── base.html
├── feed.xml
├── index.html
├── robots.txt
├── summary.html
└── summary_compare.html
└── ui.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 |
27 | # PyInstaller
28 | # Usually these files are written by a python script from a template
29 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 |
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 |
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 |
48 | # Translations
49 | *.mo
50 | *.pot
51 |
52 | # Django stuff:
53 | *.log
54 |
55 | # Sphinx documentation
56 | docs/_build/
57 |
58 | # PyBuilder
59 | target/
60 |
61 | #Ipython Notebook
62 | .ipynb_checkpoints
63 |
64 |
65 | .python-version
66 | *.swp
67 | ENV
68 | data/
69 | *.sqlite*
70 | luigi-state.pickle
71 | .DS_Store
72 | *.rdb
73 | node_modules/
74 | dnflow.cfg
75 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2016 Washington University in St Louis
4 | Copyright (c) 2016 University of California at Riverside
5 | Copyright (c) 2016 University of Maryland.
6 |
7 | Permission is hereby granted, free of charge, to any person obtaining a copy
8 | of this software and associated documentation files (the "Software"), to deal
9 | in the Software without restriction, including without limitation the rights
10 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11 | copies of the Software, and to permit persons to whom the Software is
12 | furnished to do so, subject to the following conditions:
13 |
14 | The above copyright notice and this permission notice shall be included in all
15 | copies or substantial portions of the Software.
16 |
17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23 | SOFTWARE.
24 |
25 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # dnflow
2 |
3 | An early experiment in automating a series of actions with Twitter
4 | data for docnow. If you want to install dnflow and don't want to manually
5 | set things up yourself give our
6 | [Ansible playbook](https://github.com/docnow/dnflow-ansible) a try.
7 |
8 | Uses [Luigi](http://luigi.readthedocs.org/) for workflow automation.
9 |
10 |
11 | ## running it for yourself
12 |
13 | First create your dnflow configuration file, and add your Twitter application
14 | keys to it:
15 |
16 | cp dnflow.cfg.template dnflow.cfg
17 |
18 | If you are running on a non-standard HTTP port, such as the flask default,
19 | `localhost:5000`, be sure to include the port number in the value of
20 | `HOSTNAME`, e.g.:
21 |
22 | HOSTNAME = 'localhost:5000'
23 |
24 | The current `summarize.py` is set up to collect a handful of tweets
25 | based on a search, then execute a series of counts against it. This
26 | will result in one data file (the source tweets) and several count
27 | files (with the same name under `data/` but with extensions like
28 | `-urls`, `-hashtags` added on.
29 |
30 | Assuming you either have an activated virtualenv or similar sandbox,
31 | install the requirements first:
32 | ```
33 | % pip install -r requirements
34 | ```
35 |
36 | Start the `luigid` central scheduler, best done in another terminal:
37 | ```
38 | % luigid
39 | ```
40 |
41 | To test the workflow, run the following to kick it off (substituting a
42 | search term of interest):
43 | ```
44 | % python -m luigi --module summarize RunFlow --term lahoreblast
45 | ```
46 |
47 | It may take a moment to execute the search, which will require repeated
48 | calls to the Twitter API. As soon as it completes, you should have all
49 | the mentioned files in your `data/` directory. The naming scheme isn't
50 | well thought out. This is only a test.
51 |
52 | While you're at it, take a look at the web ui for luigi's scheduler at:
53 |
54 | http://localhost:8082/
55 |
56 | (Assuming you didn't change the port when you started luigid.)
57 |
58 |
59 | ## adding the flask UI
60 |
61 | `ui.py` contains a simple web app that allows a search to be specified
62 | through the web, queueing workflows to execute in the background, and
63 | showing workflow process status and links to completed summaries as well.
64 | Running the web UI takes a few more steps.
65 |
66 | * Install and run [Redis](http://redis.io/)
67 |
68 | Redis can be run without configuration changes, best done in another
69 | terminal:
70 |
71 | ```
72 | % redis-server
73 | ```
74 |
75 | * Start a [Redis Queue](http://python-rq.org/) worker
76 |
77 | RQ requires a running instance of Redis and one or more workers, also
78 | best done in another terminal.
79 |
80 | ```
81 | % rq worker
82 | ```
83 |
84 | * Create the flask UI backend
85 |
86 | A simple SQLite3 database tracks the searches you will create and their
87 | workflow status. Within your dnflow virtual environment:
88 |
89 | ```
90 | % sqlite3 db.sqlite3 < schema.sql
91 | ```
92 |
93 | * Start the [flask](http://flask.pocoo.org/) UI
94 |
95 | The flask UI shows a list of existing searches, lets you add new ones,
96 | and links to completed search summaries. Again, within your dnflow
97 | virtual environment, and probably in yet another terminal window:
98 |
99 | ```
100 | % python ui.py
101 | ```
102 |
103 |
104 | ### The flow, for now
105 |
106 | The luigi workflow is not automated; it needs to be invoked explicitly.
107 | The web UI is the wrong place to invoke the workflow because the
108 | workflow can run for a long time, yet the UI needs to remain
109 | responsive. For these reasons, the process is separated out with
110 | the queue.
111 |
112 | When a search is added, dnflow adds a job to the queue by defining
113 | a Python subprocess to call the luigi workflow from the commandline.
114 | RQ enqueues this task for later processing. If one or more RQ
115 | workers are available, the job is assigned and begins. Because
116 | dnflow's enqueueing of the job is very fast, it can return an updated
117 | view promptly.
118 |
119 | The luigi workflow takes as long as it needs, generating static files
120 | in a distinct directory for each requested search.
121 |
122 | Integration between the web UI and workflows occurs in the UI's
123 | SQLite database, where search terms are stored with a job id. When
124 | the workflow is assigned to an RQ worker, that search record is
125 | updated through an HTTP PUT to the web app at the URL `/job`, with
126 | a reference to the job id and its output directory. Each individual
127 | task within the workflow further updates this same URL with additional
128 | PUTs upon task start, success, or failure. This is handled using
129 | [Luigi's event
130 | model](http://luigi.readthedocs.io/en/stable/api/luigi.event.html) and
131 | HTTP callbacks/hooks to the UI keep the integration between the two
132 | pieces of the environment simple. During a workflow, the most recent
133 | task status will be recorded in the database, and is available for
134 | display in the UI.
135 |
136 | With these pieces in place, several requests for new searches can
137 | be added rapidly within the UI. Each search will be run by the
138 | next available RQ worker process, so if only one process is available,
139 | they will execute in succession, but with more than one worker running,
140 | multiple workflows can run in parallel. The main limitation here
141 | is the rate limit on Twitter's API.
142 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DocNow/dnflow/3e43cb4a0062af2cdaf3d4948725aaddfcbd84f8/__init__.py
--------------------------------------------------------------------------------
/dnflow.cfg.template:
--------------------------------------------------------------------------------
1 | HOSTNAME = 'localhost'
2 | DEBUG = True
3 | DATABASE = 'db.sqlite3'
4 | SECRET_KEY = 'a super secret key'
5 | STATIC_URL_PATH = '/static'
6 | DATA_DIR = 'data'
7 | REDIS_HOST = 'localhost'
8 | REDIS_PORT = 6379
9 | REDIS_DB = 4
10 | TWITTER_CONSUMER_KEY = 'YOUR_TWITTER_CONSUMER_KEY_HERE'
11 | TWITTER_CONSUMER_SECRET = 'YOUR_TWITTER_CONSUMER_SECRET_HERE'
12 | MAX_TIMEOUT = 24 * 60 * 60
13 |
14 | # set the following two variables to o non-empty values to add
15 | # basic auth for PUT updates on /job
16 | HTTP_BASICAUTH_USER = ''
17 | HTTP_BASICAUTH_PASS = ''
18 |
--------------------------------------------------------------------------------
/json2csv.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import sys
3 | import json
4 | import fileinput
5 |
6 | def main():
7 | sheet = csv.writer(sys.stdout, encoding="utf-8")
8 | sheet.writerow(get_headings())
9 | for line in fileinput.input():
10 | tweet = json.loads(line)
11 | sheet.writerow(get_row(tweet))
12 |
13 | def get_headings():
14 | return [
15 | 'coordinates',
16 | 'created_at',
17 | 'hashtags',
18 | 'media',
19 | 'urls',
20 | 'favorite_count',
21 | 'id',
22 | 'in_reply_to_screen_name',
23 | 'in_reply_to_status_id',
24 | 'in_reply_to_user_id',
25 | 'lang',
26 | 'place',
27 | 'possibly_sensitive',
28 | 'retweet_count',
29 | 'reweet_id',
30 | 'retweet_screen_name',
31 | 'source',
32 | 'text',
33 | 'tweet_url',
34 | 'user_created_at',
35 | 'user_screen_name',
36 | 'user_default_profile_image',
37 | 'user_description',
38 | 'user_favourites_count',
39 | 'user_followers_count',
40 | 'user_friends_count',
41 | 'user_listed_count',
42 | 'user_location',
43 | 'user_name',
44 | 'user_screen_name',
45 | 'user_statuses_count',
46 | 'user_time_zone',
47 | 'user_urls',
48 | 'user_verified',
49 | ]
50 |
51 | def get_row(t):
52 | get = t.get
53 | user = t.get('user').get
54 | row = [
55 | coordinates(t),
56 | get('created_at'),
57 | hashtags(t),
58 | media(t),
59 | urls(t),
60 | get('favorite_count'),
61 | get('id_str'),
62 | get('in_reply_to_screen_name'),
63 | get('in_reply_to_status_id'),
64 | get('in_reply_to_user_id'),
65 | get('lang'),
66 | place(t),
67 | get('possibly_sensitive'),
68 | get('retweet_count'),
69 | retweet_id(t),
70 | retweet_screen_name(t),
71 | get('source'),
72 | get('text'),
73 | tweet_url(t),
74 | user('created_at'),
75 | user('screen_name'),
76 | user('default_profile_image'),
77 | user('description'),
78 | user('favourites_count'),
79 | user('followers_count'),
80 | user('friends_count'),
81 | user('listed_count'),
82 | user('location'),
83 | user('name'),
84 | user('screen_name'),
85 | user('statuses_count'),
86 | user('time_zone'),
87 | user_urls(t),
88 | user('verified'),
89 | ]
90 | return row
91 |
92 | def coordinates(t):
93 | if 'coordinates' in t and t['coordinates']:
94 | return '%f %f' % tuple(t['coordinates']['coordinates'])
95 | return None
96 |
97 | def hashtags(t):
98 | return ' '.join([h['text'] for h in t['entities']['hashtags']])
99 |
100 | def media(t):
101 | if 'media' in t['entities']:
102 | return ' '.join([h['expanded_url'] for h in t['entities']['media']])
103 | else:
104 | return None
105 |
106 | def urls(t):
107 | return ' '.join([h['expanded_url'] for h in t['entities']['urls']])
108 |
109 | def place(t):
110 | if t['place']:
111 | return t['place']['full_name']
112 |
113 | def retweet_id(t):
114 | if 'retweeted_status' in t and t['retweeted_status']:
115 | return t['retweeted_status']['id_str']
116 |
117 | def retweet_screen_name(t):
118 | if 'retweeted_status' in t and t['retweeted_status']:
119 | return t['retweeted_status']['user']['screen_name']
120 |
121 | def tweet_url(t):
122 | return "https://twitter.com/%s/status/%s" % (t['user']['screen_name'], t['id_str'])
123 |
124 | def user_urls(t):
125 | u = t.get('user')
126 | if not u:
127 | return None
128 | urls = []
129 | if 'entities' in u and 'url' in u['entities'] and 'urls' in u['entities']['url']:
130 | for url in u['entities']['url']['urls']:
131 | if url['expanded_url']:
132 | urls.append(url['expanded_url'])
133 | return ' '.join(urls)
134 |
135 |
136 | if __name__ == "__main__":
137 | main()
138 |
--------------------------------------------------------------------------------
/luigi.cfg:
--------------------------------------------------------------------------------
1 | [core]
2 | parallel-scheduling = True
3 |
4 | [scheduler]
5 | record_task_history = True
6 | state_path = luigi-state.pickle
7 |
8 | [task_history]
9 | db_connection = sqlite:///history.sqlite.db
10 |
--------------------------------------------------------------------------------
/queue_tasks.py:
--------------------------------------------------------------------------------
1 | import subprocess
2 |
3 |
4 | def run_flow(text, job_id, count, token, secret):
5 | subprocess.run([
6 | 'python',
7 | '-m',
8 | 'luigi',
9 | '--module',
10 | 'summarize',
11 | 'RunFlow',
12 | '--term',
13 | text,
14 | '--jobid',
15 | str(job_id),
16 | '--count',
17 | str(count),
18 | '--token',
19 | str(token),
20 | '--secret',
21 | str(secret)
22 | ])
23 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | Flask
2 | flask-oauthlib
3 | imagehash
4 | Jinja2
5 | luigi
6 | networkx
7 | pandas
8 | redis
9 | rq
10 | sqlalchemy
11 | twarc
12 | tweepy
13 |
--------------------------------------------------------------------------------
/schema.sql:
--------------------------------------------------------------------------------
1 | DROP TABLE IF EXISTS searches;
2 | CREATE TABLE searches (
3 | id INTEGER PRIMARY KEY AUTOINCREMENT,
4 | text TEXT NOT NULL,
5 | date_path TEXT NOT NULL,
6 | user TEXT NOT NULL,
7 | status TEXT,
8 | created DATETIME DEFAULT CURRENT_TIMESTAMP,
9 | published DATETIME
10 | );
11 |
--------------------------------------------------------------------------------
/static/css/style.css:
--------------------------------------------------------------------------------
1 | body {
2 | margin: 1em;
3 | }
4 |
5 | .searchList {
6 | }
7 |
8 | table {
9 | text-align: left;
10 | width: 1200px;
11 | }
12 |
13 | th {
14 | padding 5px;
15 | width: 15%;
16 | }
17 |
18 | th.status {
19 | width: 20%;
20 | }
21 |
22 | th.actions {
23 | width: 20%;
24 | }
25 |
26 | td {
27 | padding-right: 10px;
28 | padding-bottom: 5px;
29 | }
30 |
31 | .disclaimer {
32 | max-width: 600px;
33 | background-color: #eee;
34 | padding: 5px;
35 | border: thin solid #ccc;
36 | }
37 |
38 | .searchForm {
39 | padding: 5px;
40 | }
41 |
42 | #includePublished {
43 | font-size: 12pt;
44 | font-style: italic;
45 | padding: 3px;
46 | margin-bottom: 5px;
47 | }
48 |
49 | button {
50 | width: 90px;
51 | margin-right: 5px;
52 | outline: none;
53 | }
54 |
55 | button.delete {
56 | background-color: pink;
57 | }
58 |
59 | button.publish {
60 | background-color: lightgreen;
61 | }
62 |
63 | button.unpublish {
64 | background-color: yellow;
65 | }
66 |
--------------------------------------------------------------------------------
/static/js/index.js:
--------------------------------------------------------------------------------
1 | var Search = React.createClass({
2 | put: function(change) {
3 | $.ajax({
4 | type: 'PUT',
5 | url: '/api/search/' + this.props.id,
6 | data: JSON.stringify(change),
7 | dataType: 'json',
8 | contentType: 'application/json'
9 | });
10 | },
11 | unpublish: function() {
12 | this.put({id: this.props.id, published: false});
13 | },
14 | publish: function() {
15 | this.put({id: this.props.id, published: true});
16 | },
17 | remove: function() {
18 | $.ajax({
19 | type: 'DELETE',
20 | url: '/api/search/' + this.props.id,
21 | data: JSON.stringify({}),
22 | dataType: 'json',
23 | contentType: 'application/json'
24 | });
25 | },
26 | render: function() {
27 | var link = {this.props.text};
28 | if (! (this.props.status == "FINISHED: RunFlow")) {
29 | link = this.props.text;
30 | }
31 |
32 | if (this.props.canModify) {
33 | if (this.props.published) {
34 | var publishButton =
35 | ;
38 | } else {
39 | var publishButton =
40 | ;
43 | }
44 | var buttons =
45 |
hi there
6 | {% if twitter_user %}
7 | {{ twitter_user }}
8 | {% endif %}
9 |
10 |
11 |
12 | This is a design prototype for the Documenting the Now project. Please understand that data created here can disappear at any time. By using this application you are agreeing to our code of conduct. If you have any questions please let us know what you think in our Slack channel, or by emailing info@docnow.io
13 |