├── .github
└── workflows
│ └── main.yml
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── images
└── d3.png
├── setup.cfg
├── setup.py
├── test-data
└── tweets.jsonl
├── test_twarc_network.py
└── twarc_network
├── __init__.py
└── index.html
/.github/workflows/main.yml:
--------------------------------------------------------------------------------
1 | name: tests
2 | on:
3 | push:
4 | paths-ignore:
5 | - 'README.md'
6 | jobs:
7 | build:
8 | runs-on: ubuntu-latest
9 | strategy:
10 | matrix:
11 | python-version: [3.8]
12 | steps:
13 |
14 | - uses: actions/checkout@v2
15 |
16 | - name: Set up Python ${{ matrix.python-version }}
17 | uses: actions/setup-python@v2
18 | with:
19 | python-version: ${{ matrix.python-version }}
20 |
21 | - name: Install dependencies
22 | run: |
23 | python -m pip install --upgrade pip
24 | python setup.py install
25 |
26 | - name: Test with pytest
27 | run: python setup.py test
28 |
29 | - name: Ensure packages can be built
30 | run: |
31 | python -m pip install wheel
32 | python setup.py sdist bdist_wheel
33 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .eggs
2 | dist
3 | twarc_network.egg-info
4 | Pipfile
5 | __pycache__
6 | build
7 | *.log
8 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) Documenting the Now Project
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include twarc_network/index.html
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # twarc-network
2 |
3 |
4 |
5 | [](https://github.com/DocNow/twarc-network/actions/workflows/main.yml)
6 |
7 | *twarc-network* builds a reply, quote, retweet and mention network from a file of tweets
8 | that you've collected using twarc. It will write out the network as a [gexf],
9 | [gml], json, csv or html file. It uses [networkx] for the graph model and [d3] for the html presentation.
10 |
11 | If you know CSS you can hack at the generated HTML file to modify the style to
12 | suit your needs. If you come up with a more pleasing representation please send
13 | a pull request! Exporting as a gexf, or gml will allow you to import
14 | the data into tools like [Gephi], [Cytoscape] and [GraphViz] for further
15 | analysis and visualization.
16 |
17 | ## Install
18 |
19 | To install you will need to:
20 |
21 | pip3 install twarc-network
22 |
23 | ## Collect Data
24 |
25 | First you will need to collect some data with [twarc]:
26 |
27 | twarc2 search blacklivesmatter > tweets.jsonl
28 |
29 | ## Output Formats
30 |
31 | Once you've got some data you can create the default D3 HTML visualization:
32 |
33 | twarc2 network tweets.jsonl network.html
34 |
35 | or [gexf]:
36 |
37 | twarc2 network tweets.jsonl --format gexf network.gexf
38 |
39 | or [gml]:
40 |
41 | twarc2 network tweets.jsonl --format gml network.gml
42 |
43 | or json:
44 |
45 | twarc2 network tweets.jsonl --format json network.json
46 |
47 | or CSV edge list:
48 |
49 | twarc2 network tweets.jsonl --format csv network.csv
50 |
51 | ## Changing the Nodes
52 |
53 | Tweets can be connected together as replies, quotes and retweets. If you would
54 | like to see the network oriented around nodes that are tweets instead of users
55 | you can:
56 |
57 | twarc2 network tweets.jsonl --nodes tweets network.html
58 |
59 | Hashtags can can be connected when they are used together in a tweet. So you
60 | can visualize a network where nodes are hashtags:
61 |
62 | twarc2 network tweets.jsonl --nodes hashtags > network.html
63 |
64 | ## Changing the Edges
65 |
66 | By default, when user and tweet graphs are built,
67 | all types of interactions are used as edges:
68 | Retweet, reply or quote in the case of tweets;
69 | retweet, reply, quote or mention in the case of users.
70 | But you can also limit the types considered.
71 | For example, if you only want retweet edges, you can:
72 |
73 | twarc2 network tweets.jsonl tweets.html --edges retweet
74 |
75 | Or if you only want replies and quotes, you can:
76 |
77 | twarc2 network tweets.jsonl tweets.html --edges reply --edges quote
78 |
79 | ## Component Sizes
80 |
81 | Depending on the data you are analyzing it can be helpful to remove weakly connected components in
82 | the graph that are smaller than some number. For example if you don't want to
83 | visualize networks where two nodes are only connected to each other and not
84 | anyone else you can:
85 |
86 | twarc2 network tweets.jsonl tweets.html --min-component-size 3
87 |
88 | It's less common but you can also remove nodes that are part of too large
89 | subgraphs. For example if you wanted to remove any components that were
90 | larger than 10:
91 |
92 | twarc2 network tweets.jsonl tweets.html --max-component-size 10
93 |
94 | ## Attributes
95 |
96 | The possible node attributes are the following:
97 | - `screen_name`:
98 | When the node is a user, its username;
99 | by default, it is used as the label of the nodes.
100 | When the node is a tweet, the username of its author.
101 | - `user_id`:
102 | When the node is a user, its id;
103 | if you want to use it as the label of the nodes,
104 | you can use the flag `--id-as-label`.
105 | When the node is a tweet, the id of its author.
106 | - `start_date`:
107 | The date of the first interaction that made the node appear in the graph.
108 | For example, if the node is a retweet, it is its date of creation.
109 | Or if the node is an original tweet,
110 | it is the date of the first retweet, reply or quote.
111 | The format is `dd/mm/yyyy hh:mm:ss`.
112 |
113 | The possible edge attributes are the following:
114 | - `type`: When the nodes are tweets, one of the following values:
115 | `retweet`, `reply` or `quote`.
116 | - `retweet`: When the nodes are users,
117 | the number of retweets the source has made to the target.
118 | - `reply`: When the nodes are users,
119 | the number of replies the source has made to the target.
120 | - `quote`: When the nodes are users,
121 | the number of quotes the source has made to the target.
122 | - `mention`: When the nodes are users,
123 | the number of mentions the source has made to the target.
124 | - `weight`:
125 | When the nodes are users, the sum of `retweet`, `reply`, `quote` and `mention`.
126 | When the nodes are hashtags,
127 | the number of tweets that contained both hashtags.
128 |
129 | [gexf]: https://gephi.org/gexf/format/
130 | [d3]: https://d3js.org/
131 | [networkx]: https://networkx.org/
132 | [twarc]: https://github.com/docnow/twarc
133 | [gml]: https://en.wikipedia.org/wiki/Graph_Modelling_Language
134 | [Gephi]: https://gephi.org/
135 | [Cytoscape]: https://cytoscape.org/
136 | [GraphViz]: https://graphviz.org/
137 |
--------------------------------------------------------------------------------
/images/d3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DocNow/twarc-network/5f1118ab14a9b66803988bcb6b6a08af8cd0632c/images/d3.png
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [aliases]
2 | test=pytest
3 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 |
3 | with open("README.md") as f:
4 | long_description = f.read()
5 |
6 | setuptools.setup(
7 | name="twarc-network",
8 | version="0.3.0",
9 | url="https://github.com/docnow/twarc-network",
10 | author="Ed Summers",
11 | author_email="ehs@pobox.com",
12 | packages=["twarc_network"],
13 | description="Generate network visualizations for Twitter data",
14 | long_description=long_description,
15 | long_description_content_type="text/markdown",
16 | license="MIT",
17 | classifiers=[
18 | "License :: OSI Approved :: MIT License",
19 | ],
20 | python_requires=">=3.3",
21 | install_requires=["twarc", "networkx"],
22 | setup_data={"twarc_network": ["twarc_network/index.html"]},
23 | package_data={"twarc_network": ["index.html"]},
24 | setup_requires=["pytest-runner"],
25 | tests_require=["pytest"],
26 | entry_points="""
27 | [twarc.plugins]
28 | network=twarc_network:network
29 | """,
30 | )
31 |
--------------------------------------------------------------------------------
/test_twarc_network.py:
--------------------------------------------------------------------------------
1 | import io
2 | import re
3 | import csv
4 | import json
5 |
6 | from twarc_network import network
7 | from click.testing import CliRunner
8 |
9 | runner = CliRunner()
10 |
11 |
12 | def test_html():
13 | result = runner.invoke(network, ["test-data/tweets.jsonl"])
14 | assert result.exit_code == 0
15 | m = re.search(r"var graph = ({.*});.*var link = ", result.output, re.DOTALL)
16 | assert m
17 | graph = json.loads(m.group(1))
18 | assert len(graph["nodes"]) == 656
19 | assert len(graph["links"]) == 618
20 |
21 |
22 | def test_json():
23 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "json"])
24 | assert result.exit_code == 0
25 | graph = json.loads(result.output)
26 | assert len(graph["nodes"]) == 656
27 | assert len(graph["links"]) == 618
28 |
29 |
30 | def test_gexf():
31 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "gexf"])
32 | assert result.exit_code == 0
33 |
34 |
35 | def test_gexf():
36 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "gml"])
37 | assert result.exit_code == 0
38 |
39 |
40 | def test_csv():
41 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "csv"])
42 | assert result.exit_code == 0
43 | data = csv.reader(io.StringIO(result.output))
44 | assert len(list(data)) == 618
45 |
46 |
47 | def test_min_component():
48 | result = runner.invoke(
49 | network,
50 | ["test-data/tweets.jsonl", "--format", "json", "--min-component-size", "4"],
51 | )
52 | graph = json.loads(result.output)
53 | assert len(graph["nodes"]) == 469
54 |
55 |
56 | def test_max_component():
57 | result = runner.invoke(
58 | network,
59 | ["test-data/tweets.jsonl", "--format", "json", "--max-component-size", "15"],
60 | )
61 | graph = json.loads(result.output)
62 | assert len(graph["nodes"]) == 355
63 |
64 |
65 | def test_tweets():
66 | result = runner.invoke(
67 | network, ["test-data/tweets.jsonl", "--format", "json", "--nodes", "tweets"]
68 | )
69 | assert result.exit_code == 0
70 | graph = json.loads(result.output)
71 | assert len(graph["nodes"]) == 613
72 |
73 |
74 | def test_hashtags():
75 | result = runner.invoke(
76 | network, ["test-data/tweets.jsonl", "--format", "json", "--nodes", "hashtags"]
77 | )
78 | assert result.exit_code == 0
79 | graph = json.loads(result.output)
80 | assert len(graph["nodes"]) == 352
81 |
82 |
83 | def test_edges():
84 | result = runner.invoke(
85 | network,
86 | [
87 | "test-data/tweets.jsonl", "--format", "json", "--edges", "retweet",
88 | "--edges", "reply", "--edges", "quote"
89 | ]
90 | )
91 | assert result.exit_code == 0
92 | graph = json.loads(result.output)
93 | assert len(graph["nodes"]) == 484
94 | assert len(graph["links"]) == 391
95 |
--------------------------------------------------------------------------------
/twarc_network/__init__.py:
--------------------------------------------------------------------------------
1 | import io
2 | import json
3 | import time
4 | import click
5 | import networkx
6 | import itertools
7 |
8 | from pathlib import Path
9 |
10 | from twarc import ensure_flattened
11 |
12 |
13 | @click.command()
14 | @click.option(
15 | "--format",
16 | type=click.Choice(
17 | ["html", "json", "gexf", "csv", "gml"], case_sensitive=False
18 | ),
19 | default="html",
20 | help="Output format for the network",
21 | )
22 | @click.option(
23 | "--nodes",
24 | type=click.Choice(["users", "tweets", "hashtags"]),
25 | default="users",
26 | help="What type of nodes to use in the network",
27 | )
28 | @click.option(
29 | "--edges",
30 | type=click.Choice(["retweet", "reply", "quote", "mention"]),
31 | multiple=True,
32 | default=["retweet", "reply", "quote", "mention"],
33 | help="What type of edges to use in the network",
34 | )
35 | @click.option("--min-component-size", type=int, help="Minimum weakly connected component size to include")
36 | @click.option("--max-component-size", type=int, help="Maximum weakly connected component size to include")
37 | @click.option("--id-as-label", is_flag=True, help="Use user id as node label")
38 | @click.argument("infile", type=click.File("r"), default="-")
39 | @click.argument("outfile", type=click.File("w"), default="-")
40 | def network(format, nodes, edges, infile, outfile, min_component_size,
41 | max_component_size, id_as_label):
42 | """
43 | Generates a network graph of tweets as GEXF, GML, JSON, HTML, CSV.
44 | """
45 |
46 | g = get_graph(infile, nodes, edges, id_as_label)
47 |
48 | # if the user wants to limit component min/max sizes
49 | if min_component_size or max_component_size:
50 | g_copy = g.copy()
51 | for components in networkx.weakly_connected_components(g):
52 | sg = g.subgraph(components)
53 | if min_component_size and len(sg) < min_component_size:
54 | g_copy.remove_nodes_from(sg.nodes())
55 | elif max_component_size and len(sg) > max_component_size:
56 | g_copy.remove_nodes_from(sg.nodes())
57 | g = g_copy
58 |
59 | if format == "gexf":
60 | outfile.write(nxstr(networkx.write_gexf, g))
61 |
62 | elif format == "gml":
63 | outfile.write(nxstr(networkx.write_gml, g))
64 |
65 | elif format == "json":
66 | json.dump(to_json(g), outfile, indent=2)
67 |
68 | elif format == "csv":
69 | outfile.write(nxstr(networkx.write_edgelist, g, delimiter=","))
70 |
71 | elif format == "html":
72 | graph_data = json.dumps(to_json(g), indent=2)
73 | html_file = Path(__file__).parent / "index.html"
74 | html = html_file.open().read()
75 | html = html.replace("__GRAPH_DATA__", graph_data)
76 | outfile.write(html)
77 |
78 |
79 | def get_graph(infile, nodes_type, edge_types, id_as_label):
80 | g = networkx.DiGraph()
81 |
82 | for line in infile:
83 | for t in ensure_flattened(json.loads(line)):
84 |
85 | from_id = t["id"]
86 |
87 | from_user = t["author"]["username"]
88 | from_user_id = t["author"]["id"]
89 |
90 | created_at_date = time.strftime(
91 | "%d/%m/%Y %H:%M:%S",
92 | time.strptime(t["created_at"], "%Y-%m-%dT%H:%M:%S.%fZ"),
93 | )
94 |
95 | # get referenced tweets but ignore ones that have been deleted and
96 | # have no author stanza
97 | refs = filter(lambda r: "author" in r, t.get("referenced_tweets", []))
98 |
99 | if nodes_type == "users":
100 | for ref in refs:
101 | to_user = ref["author"]["username"]
102 | to_user_id = ref["author"]["id"]
103 | edge_type = get_edge_type(ref)
104 | if edge_type in edge_types:
105 | add_user_edge(
106 | g,
107 | from_user,
108 | from_user_id,
109 | to_user,
110 | to_user_id,
111 | edge_type,
112 | created_at_date,
113 | edge_types,
114 | id_as_label,
115 | )
116 | if "mention" in edge_types:
117 | mentions = t.get("entities", dict()).get("mentions", [])
118 | if is_first_mention_a_retweet(t):
119 | mentions = mentions[1:]
120 | for mention in mentions:
121 | to_user = mention["username"]
122 | to_user_id = mention["id"]
123 | add_user_edge(
124 | g,
125 | from_user,
126 | from_user_id,
127 | to_user,
128 | to_user_id,
129 | "mention",
130 | created_at_date,
131 | edge_types,
132 | id_as_label,
133 | )
134 |
135 | elif nodes_type == "tweets":
136 | for ref in refs:
137 | to_id = ref["id"]
138 | to_user = ref["author"]["username"]
139 | to_user_id = ref["author"]["id"]
140 | edge_type = get_edge_type(ref)
141 | if edge_type in edge_types:
142 | add_tweet_edge(
143 | g,
144 | from_user,
145 | from_user_id,
146 | from_id,
147 | to_user,
148 | to_user_id,
149 | to_id,
150 | edge_type,
151 | created_at_date,
152 | )
153 |
154 | elif nodes_type == "hashtags":
155 |
156 | # some tweets apparently lack an entities stanza?
157 | if "entities" not in t:
158 | continue
159 |
160 | hashtags = map(lambda h: h["tag"], t["entities"].get("hashtags", []))
161 | # list of all possible hashtag pairs
162 | hashtag_pairs = itertools.combinations(hashtags, 2)
163 | for ht1, ht2 in hashtag_pairs:
164 | add_hashtag_edge(
165 | g,
166 | "#" + ht1.lower(),
167 | "#" + ht2.lower(),
168 | created_at_date,
169 | )
170 |
171 | else:
172 | raise Exception(f"Unkown node type: {nodes_type}")
173 |
174 | return g
175 |
176 |
177 | def add_user_edge(g, from_user, from_user_id, to_user, to_user_id, edge_type,
178 | created_at, edge_types, id_as_label):
179 |
180 | # storing start_date will allow for timestamps for gephi timeline, where nodes
181 | # will appear on screen at their start date and stay on forever after
182 |
183 | if id_as_label:
184 | from_label = from_user_id
185 | to_label = to_user_id
186 | else:
187 | from_label = from_user
188 | to_label = to_user
189 | g.add_node(
190 | from_label,
191 | screen_name=from_user,
192 | user_id=from_user_id,
193 | start_date=created_at,
194 | )
195 | g.add_node(
196 | to_label,
197 | screen_name=to_user,
198 | user_id=to_user_id,
199 | start_date=created_at,
200 | )
201 |
202 | if g.has_edge(from_label, to_label):
203 | weights = g[from_label][to_label]
204 | else:
205 | g.add_edge(from_label, to_label)
206 | weights = {t: 0 for t in ("weight", ) + edge_types}
207 | weights["weight"] += 1
208 | weights[edge_type] += 1
209 | g[from_label][to_label].update(weights)
210 |
211 |
212 | def add_tweet_edge(g, from_user, from_user_id, from_id, to_user, to_user_id,
213 | to_id, edge_type, created_at):
214 | g.add_node(
215 | from_id,
216 | screen_name=from_user,
217 | user_id=from_user_id,
218 | start_date=created_at,
219 | )
220 | g.add_node(
221 | to_id,
222 | screen_name=to_user,
223 | user_id=to_user_id,
224 | start_date=created_at,
225 | )
226 |
227 | g.add_edge(from_id, to_id, type=edge_type)
228 |
229 |
230 | def add_hashtag_edge(g, from_hashtag, to_hashtag, created_at):
231 | g.add_node(from_hashtag, start_date=created_at)
232 | g.add_node(to_hashtag, start_date=created_at)
233 |
234 | if g.has_edge(from_hashtag, to_hashtag):
235 | weight = g[from_hashtag][to_hashtag]["weight"] + 1
236 | else:
237 | weight = 1
238 | g.add_edge(from_hashtag, to_hashtag, weight=weight)
239 |
240 |
241 | def to_json(g):
242 | j = {"nodes": [], "links": []}
243 | for node_id, attrs in g.nodes(data=True):
244 | node = {"id": node_id}
245 | node.update(attrs)
246 | j["nodes"].append(node)
247 | for source, target, attrs in g.edges(data=True):
248 | link = {"source": source, "target": target}
249 | link.update(attrs)
250 | j["links"].append(link)
251 | return j
252 |
253 |
254 | def get_edge_type(ref):
255 | if ref["type"] == "retweeted":
256 | return "retweet"
257 | elif ref["type"] == "replied_to":
258 | return "reply"
259 | elif ref["type"] == "quoted":
260 | return "quote"
261 | else:
262 | raise Exception(f'unknown reference type: {ref["type"]}')
263 |
264 |
265 | def is_first_mention_a_retweet(tweet):
266 | if "referenced_tweets" not in tweet:
267 | return False
268 | if get_edge_type(tweet["referenced_tweets"][0]) != "retweet":
269 | return False
270 |
271 | if "entities" not in tweet or "mentions" not in tweet["entities"]:
272 | return False
273 | if tweet["entities"]["mentions"][0]["start"] != 3:
274 | return False
275 |
276 | return True
277 |
278 |
279 | def nxstr(f, *args, **kwargs):
280 | # networkx output functions want to write to a file as bytes
281 | # but click.File is expecting a string. This function takes the
282 | # networkx function and parameters and writes the bytes to a
283 | # BytesIO object to return it as a string.
284 | out = io.BytesIO()
285 | args = args + (out,)
286 | f(*args, **kwargs)
287 | return out.getvalue().decode("utf-8")
288 |
--------------------------------------------------------------------------------
/twarc_network/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |