├── .github └── workflows │ └── main.yml ├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README.md ├── images └── d3.png ├── setup.cfg ├── setup.py ├── test-data └── tweets.jsonl ├── test_twarc_network.py └── twarc_network ├── __init__.py └── index.html /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | name: tests 2 | on: 3 | push: 4 | paths-ignore: 5 | - 'README.md' 6 | jobs: 7 | build: 8 | runs-on: ubuntu-latest 9 | strategy: 10 | matrix: 11 | python-version: [3.8] 12 | steps: 13 | 14 | - uses: actions/checkout@v2 15 | 16 | - name: Set up Python ${{ matrix.python-version }} 17 | uses: actions/setup-python@v2 18 | with: 19 | python-version: ${{ matrix.python-version }} 20 | 21 | - name: Install dependencies 22 | run: | 23 | python -m pip install --upgrade pip 24 | python setup.py install 25 | 26 | - name: Test with pytest 27 | run: python setup.py test 28 | 29 | - name: Ensure packages can be built 30 | run: | 31 | python -m pip install wheel 32 | python setup.py sdist bdist_wheel 33 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .eggs 2 | dist 3 | twarc_network.egg-info 4 | Pipfile 5 | __pycache__ 6 | build 7 | *.log 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) Documenting the Now Project 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include twarc_network/index.html 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # twarc-network 2 | 3 | 4 | 5 | [![Build Status](https://github.com/docnow/twarc-network/workflows/tests/badge.svg)](https://github.com/DocNow/twarc-network/actions/workflows/main.yml) 6 | 7 | *twarc-network* builds a reply, quote, retweet and mention network from a file of tweets 8 | that you've collected using twarc. It will write out the network as a [gexf], 9 | [gml], json, csv or html file. It uses [networkx] for the graph model and [d3] for the html presentation. 10 | 11 | If you know CSS you can hack at the generated HTML file to modify the style to 12 | suit your needs. If you come up with a more pleasing representation please send 13 | a pull request! Exporting as a gexf, or gml will allow you to import 14 | the data into tools like [Gephi], [Cytoscape] and [GraphViz] for further 15 | analysis and visualization. 16 | 17 | ## Install 18 | 19 | To install you will need to: 20 | 21 | pip3 install twarc-network 22 | 23 | ## Collect Data 24 | 25 | First you will need to collect some data with [twarc]: 26 | 27 | twarc2 search blacklivesmatter > tweets.jsonl 28 | 29 | ## Output Formats 30 | 31 | Once you've got some data you can create the default D3 HTML visualization: 32 | 33 | twarc2 network tweets.jsonl network.html 34 | 35 | or [gexf]: 36 | 37 | twarc2 network tweets.jsonl --format gexf network.gexf 38 | 39 | or [gml]: 40 | 41 | twarc2 network tweets.jsonl --format gml network.gml 42 | 43 | or json: 44 | 45 | twarc2 network tweets.jsonl --format json network.json 46 | 47 | or CSV edge list: 48 | 49 | twarc2 network tweets.jsonl --format csv network.csv 50 | 51 | ## Changing the Nodes 52 | 53 | Tweets can be connected together as replies, quotes and retweets. If you would 54 | like to see the network oriented around nodes that are tweets instead of users 55 | you can: 56 | 57 | twarc2 network tweets.jsonl --nodes tweets network.html 58 | 59 | Hashtags can can be connected when they are used together in a tweet. So you 60 | can visualize a network where nodes are hashtags: 61 | 62 | twarc2 network tweets.jsonl --nodes hashtags > network.html 63 | 64 | ## Changing the Edges 65 | 66 | By default, when user and tweet graphs are built, 67 | all types of interactions are used as edges: 68 | Retweet, reply or quote in the case of tweets; 69 | retweet, reply, quote or mention in the case of users. 70 | But you can also limit the types considered. 71 | For example, if you only want retweet edges, you can: 72 | 73 | twarc2 network tweets.jsonl tweets.html --edges retweet 74 | 75 | Or if you only want replies and quotes, you can: 76 | 77 | twarc2 network tweets.jsonl tweets.html --edges reply --edges quote 78 | 79 | ## Component Sizes 80 | 81 | Depending on the data you are analyzing it can be helpful to remove weakly connected components in 82 | the graph that are smaller than some number. For example if you don't want to 83 | visualize networks where two nodes are only connected to each other and not 84 | anyone else you can: 85 | 86 | twarc2 network tweets.jsonl tweets.html --min-component-size 3 87 | 88 | It's less common but you can also remove nodes that are part of too large 89 | subgraphs. For example if you wanted to remove any components that were 90 | larger than 10: 91 | 92 | twarc2 network tweets.jsonl tweets.html --max-component-size 10 93 | 94 | ## Attributes 95 | 96 | The possible node attributes are the following: 97 | - `screen_name`: 98 | When the node is a user, its username; 99 | by default, it is used as the label of the nodes. 100 | When the node is a tweet, the username of its author. 101 | - `user_id`: 102 | When the node is a user, its id; 103 | if you want to use it as the label of the nodes, 104 | you can use the flag `--id-as-label`. 105 | When the node is a tweet, the id of its author. 106 | - `start_date`: 107 | The date of the first interaction that made the node appear in the graph. 108 | For example, if the node is a retweet, it is its date of creation. 109 | Or if the node is an original tweet, 110 | it is the date of the first retweet, reply or quote. 111 | The format is `dd/mm/yyyy hh:mm:ss`. 112 | 113 | The possible edge attributes are the following: 114 | - `type`: When the nodes are tweets, one of the following values: 115 | `retweet`, `reply` or `quote`. 116 | - `retweet`: When the nodes are users, 117 | the number of retweets the source has made to the target. 118 | - `reply`: When the nodes are users, 119 | the number of replies the source has made to the target. 120 | - `quote`: When the nodes are users, 121 | the number of quotes the source has made to the target. 122 | - `mention`: When the nodes are users, 123 | the number of mentions the source has made to the target. 124 | - `weight`: 125 | When the nodes are users, the sum of `retweet`, `reply`, `quote` and `mention`. 126 | When the nodes are hashtags, 127 | the number of tweets that contained both hashtags. 128 | 129 | [gexf]: https://gephi.org/gexf/format/ 130 | [d3]: https://d3js.org/ 131 | [networkx]: https://networkx.org/ 132 | [twarc]: https://github.com/docnow/twarc 133 | [gml]: https://en.wikipedia.org/wiki/Graph_Modelling_Language 134 | [Gephi]: https://gephi.org/ 135 | [Cytoscape]: https://cytoscape.org/ 136 | [GraphViz]: https://graphviz.org/ 137 | -------------------------------------------------------------------------------- /images/d3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DocNow/twarc-network/5f1118ab14a9b66803988bcb6b6a08af8cd0632c/images/d3.png -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [aliases] 2 | test=pytest 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md") as f: 4 | long_description = f.read() 5 | 6 | setuptools.setup( 7 | name="twarc-network", 8 | version="0.3.0", 9 | url="https://github.com/docnow/twarc-network", 10 | author="Ed Summers", 11 | author_email="ehs@pobox.com", 12 | packages=["twarc_network"], 13 | description="Generate network visualizations for Twitter data", 14 | long_description=long_description, 15 | long_description_content_type="text/markdown", 16 | license="MIT", 17 | classifiers=[ 18 | "License :: OSI Approved :: MIT License", 19 | ], 20 | python_requires=">=3.3", 21 | install_requires=["twarc", "networkx"], 22 | setup_data={"twarc_network": ["twarc_network/index.html"]}, 23 | package_data={"twarc_network": ["index.html"]}, 24 | setup_requires=["pytest-runner"], 25 | tests_require=["pytest"], 26 | entry_points=""" 27 | [twarc.plugins] 28 | network=twarc_network:network 29 | """, 30 | ) 31 | -------------------------------------------------------------------------------- /test_twarc_network.py: -------------------------------------------------------------------------------- 1 | import io 2 | import re 3 | import csv 4 | import json 5 | 6 | from twarc_network import network 7 | from click.testing import CliRunner 8 | 9 | runner = CliRunner() 10 | 11 | 12 | def test_html(): 13 | result = runner.invoke(network, ["test-data/tweets.jsonl"]) 14 | assert result.exit_code == 0 15 | m = re.search(r"var graph = ({.*});.*var link = ", result.output, re.DOTALL) 16 | assert m 17 | graph = json.loads(m.group(1)) 18 | assert len(graph["nodes"]) == 656 19 | assert len(graph["links"]) == 618 20 | 21 | 22 | def test_json(): 23 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "json"]) 24 | assert result.exit_code == 0 25 | graph = json.loads(result.output) 26 | assert len(graph["nodes"]) == 656 27 | assert len(graph["links"]) == 618 28 | 29 | 30 | def test_gexf(): 31 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "gexf"]) 32 | assert result.exit_code == 0 33 | 34 | 35 | def test_gexf(): 36 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "gml"]) 37 | assert result.exit_code == 0 38 | 39 | 40 | def test_csv(): 41 | result = runner.invoke(network, ["test-data/tweets.jsonl", "--format", "csv"]) 42 | assert result.exit_code == 0 43 | data = csv.reader(io.StringIO(result.output)) 44 | assert len(list(data)) == 618 45 | 46 | 47 | def test_min_component(): 48 | result = runner.invoke( 49 | network, 50 | ["test-data/tweets.jsonl", "--format", "json", "--min-component-size", "4"], 51 | ) 52 | graph = json.loads(result.output) 53 | assert len(graph["nodes"]) == 469 54 | 55 | 56 | def test_max_component(): 57 | result = runner.invoke( 58 | network, 59 | ["test-data/tweets.jsonl", "--format", "json", "--max-component-size", "15"], 60 | ) 61 | graph = json.loads(result.output) 62 | assert len(graph["nodes"]) == 355 63 | 64 | 65 | def test_tweets(): 66 | result = runner.invoke( 67 | network, ["test-data/tweets.jsonl", "--format", "json", "--nodes", "tweets"] 68 | ) 69 | assert result.exit_code == 0 70 | graph = json.loads(result.output) 71 | assert len(graph["nodes"]) == 613 72 | 73 | 74 | def test_hashtags(): 75 | result = runner.invoke( 76 | network, ["test-data/tweets.jsonl", "--format", "json", "--nodes", "hashtags"] 77 | ) 78 | assert result.exit_code == 0 79 | graph = json.loads(result.output) 80 | assert len(graph["nodes"]) == 352 81 | 82 | 83 | def test_edges(): 84 | result = runner.invoke( 85 | network, 86 | [ 87 | "test-data/tweets.jsonl", "--format", "json", "--edges", "retweet", 88 | "--edges", "reply", "--edges", "quote" 89 | ] 90 | ) 91 | assert result.exit_code == 0 92 | graph = json.loads(result.output) 93 | assert len(graph["nodes"]) == 484 94 | assert len(graph["links"]) == 391 95 | -------------------------------------------------------------------------------- /twarc_network/__init__.py: -------------------------------------------------------------------------------- 1 | import io 2 | import json 3 | import time 4 | import click 5 | import networkx 6 | import itertools 7 | 8 | from pathlib import Path 9 | 10 | from twarc import ensure_flattened 11 | 12 | 13 | @click.command() 14 | @click.option( 15 | "--format", 16 | type=click.Choice( 17 | ["html", "json", "gexf", "csv", "gml"], case_sensitive=False 18 | ), 19 | default="html", 20 | help="Output format for the network", 21 | ) 22 | @click.option( 23 | "--nodes", 24 | type=click.Choice(["users", "tweets", "hashtags"]), 25 | default="users", 26 | help="What type of nodes to use in the network", 27 | ) 28 | @click.option( 29 | "--edges", 30 | type=click.Choice(["retweet", "reply", "quote", "mention"]), 31 | multiple=True, 32 | default=["retweet", "reply", "quote", "mention"], 33 | help="What type of edges to use in the network", 34 | ) 35 | @click.option("--min-component-size", type=int, help="Minimum weakly connected component size to include") 36 | @click.option("--max-component-size", type=int, help="Maximum weakly connected component size to include") 37 | @click.option("--id-as-label", is_flag=True, help="Use user id as node label") 38 | @click.argument("infile", type=click.File("r"), default="-") 39 | @click.argument("outfile", type=click.File("w"), default="-") 40 | def network(format, nodes, edges, infile, outfile, min_component_size, 41 | max_component_size, id_as_label): 42 | """ 43 | Generates a network graph of tweets as GEXF, GML, JSON, HTML, CSV. 44 | """ 45 | 46 | g = get_graph(infile, nodes, edges, id_as_label) 47 | 48 | # if the user wants to limit component min/max sizes 49 | if min_component_size or max_component_size: 50 | g_copy = g.copy() 51 | for components in networkx.weakly_connected_components(g): 52 | sg = g.subgraph(components) 53 | if min_component_size and len(sg) < min_component_size: 54 | g_copy.remove_nodes_from(sg.nodes()) 55 | elif max_component_size and len(sg) > max_component_size: 56 | g_copy.remove_nodes_from(sg.nodes()) 57 | g = g_copy 58 | 59 | if format == "gexf": 60 | outfile.write(nxstr(networkx.write_gexf, g)) 61 | 62 | elif format == "gml": 63 | outfile.write(nxstr(networkx.write_gml, g)) 64 | 65 | elif format == "json": 66 | json.dump(to_json(g), outfile, indent=2) 67 | 68 | elif format == "csv": 69 | outfile.write(nxstr(networkx.write_edgelist, g, delimiter=",")) 70 | 71 | elif format == "html": 72 | graph_data = json.dumps(to_json(g), indent=2) 73 | html_file = Path(__file__).parent / "index.html" 74 | html = html_file.open().read() 75 | html = html.replace("__GRAPH_DATA__", graph_data) 76 | outfile.write(html) 77 | 78 | 79 | def get_graph(infile, nodes_type, edge_types, id_as_label): 80 | g = networkx.DiGraph() 81 | 82 | for line in infile: 83 | for t in ensure_flattened(json.loads(line)): 84 | 85 | from_id = t["id"] 86 | 87 | from_user = t["author"]["username"] 88 | from_user_id = t["author"]["id"] 89 | 90 | created_at_date = time.strftime( 91 | "%d/%m/%Y %H:%M:%S", 92 | time.strptime(t["created_at"], "%Y-%m-%dT%H:%M:%S.%fZ"), 93 | ) 94 | 95 | # get referenced tweets but ignore ones that have been deleted and 96 | # have no author stanza 97 | refs = filter(lambda r: "author" in r, t.get("referenced_tweets", [])) 98 | 99 | if nodes_type == "users": 100 | for ref in refs: 101 | to_user = ref["author"]["username"] 102 | to_user_id = ref["author"]["id"] 103 | edge_type = get_edge_type(ref) 104 | if edge_type in edge_types: 105 | add_user_edge( 106 | g, 107 | from_user, 108 | from_user_id, 109 | to_user, 110 | to_user_id, 111 | edge_type, 112 | created_at_date, 113 | edge_types, 114 | id_as_label, 115 | ) 116 | if "mention" in edge_types: 117 | mentions = t.get("entities", dict()).get("mentions", []) 118 | if is_first_mention_a_retweet(t): 119 | mentions = mentions[1:] 120 | for mention in mentions: 121 | to_user = mention["username"] 122 | to_user_id = mention["id"] 123 | add_user_edge( 124 | g, 125 | from_user, 126 | from_user_id, 127 | to_user, 128 | to_user_id, 129 | "mention", 130 | created_at_date, 131 | edge_types, 132 | id_as_label, 133 | ) 134 | 135 | elif nodes_type == "tweets": 136 | for ref in refs: 137 | to_id = ref["id"] 138 | to_user = ref["author"]["username"] 139 | to_user_id = ref["author"]["id"] 140 | edge_type = get_edge_type(ref) 141 | if edge_type in edge_types: 142 | add_tweet_edge( 143 | g, 144 | from_user, 145 | from_user_id, 146 | from_id, 147 | to_user, 148 | to_user_id, 149 | to_id, 150 | edge_type, 151 | created_at_date, 152 | ) 153 | 154 | elif nodes_type == "hashtags": 155 | 156 | # some tweets apparently lack an entities stanza? 157 | if "entities" not in t: 158 | continue 159 | 160 | hashtags = map(lambda h: h["tag"], t["entities"].get("hashtags", [])) 161 | # list of all possible hashtag pairs 162 | hashtag_pairs = itertools.combinations(hashtags, 2) 163 | for ht1, ht2 in hashtag_pairs: 164 | add_hashtag_edge( 165 | g, 166 | "#" + ht1.lower(), 167 | "#" + ht2.lower(), 168 | created_at_date, 169 | ) 170 | 171 | else: 172 | raise Exception(f"Unkown node type: {nodes_type}") 173 | 174 | return g 175 | 176 | 177 | def add_user_edge(g, from_user, from_user_id, to_user, to_user_id, edge_type, 178 | created_at, edge_types, id_as_label): 179 | 180 | # storing start_date will allow for timestamps for gephi timeline, where nodes 181 | # will appear on screen at their start date and stay on forever after 182 | 183 | if id_as_label: 184 | from_label = from_user_id 185 | to_label = to_user_id 186 | else: 187 | from_label = from_user 188 | to_label = to_user 189 | g.add_node( 190 | from_label, 191 | screen_name=from_user, 192 | user_id=from_user_id, 193 | start_date=created_at, 194 | ) 195 | g.add_node( 196 | to_label, 197 | screen_name=to_user, 198 | user_id=to_user_id, 199 | start_date=created_at, 200 | ) 201 | 202 | if g.has_edge(from_label, to_label): 203 | weights = g[from_label][to_label] 204 | else: 205 | g.add_edge(from_label, to_label) 206 | weights = {t: 0 for t in ("weight", ) + edge_types} 207 | weights["weight"] += 1 208 | weights[edge_type] += 1 209 | g[from_label][to_label].update(weights) 210 | 211 | 212 | def add_tweet_edge(g, from_user, from_user_id, from_id, to_user, to_user_id, 213 | to_id, edge_type, created_at): 214 | g.add_node( 215 | from_id, 216 | screen_name=from_user, 217 | user_id=from_user_id, 218 | start_date=created_at, 219 | ) 220 | g.add_node( 221 | to_id, 222 | screen_name=to_user, 223 | user_id=to_user_id, 224 | start_date=created_at, 225 | ) 226 | 227 | g.add_edge(from_id, to_id, type=edge_type) 228 | 229 | 230 | def add_hashtag_edge(g, from_hashtag, to_hashtag, created_at): 231 | g.add_node(from_hashtag, start_date=created_at) 232 | g.add_node(to_hashtag, start_date=created_at) 233 | 234 | if g.has_edge(from_hashtag, to_hashtag): 235 | weight = g[from_hashtag][to_hashtag]["weight"] + 1 236 | else: 237 | weight = 1 238 | g.add_edge(from_hashtag, to_hashtag, weight=weight) 239 | 240 | 241 | def to_json(g): 242 | j = {"nodes": [], "links": []} 243 | for node_id, attrs in g.nodes(data=True): 244 | node = {"id": node_id} 245 | node.update(attrs) 246 | j["nodes"].append(node) 247 | for source, target, attrs in g.edges(data=True): 248 | link = {"source": source, "target": target} 249 | link.update(attrs) 250 | j["links"].append(link) 251 | return j 252 | 253 | 254 | def get_edge_type(ref): 255 | if ref["type"] == "retweeted": 256 | return "retweet" 257 | elif ref["type"] == "replied_to": 258 | return "reply" 259 | elif ref["type"] == "quoted": 260 | return "quote" 261 | else: 262 | raise Exception(f'unknown reference type: {ref["type"]}') 263 | 264 | 265 | def is_first_mention_a_retweet(tweet): 266 | if "referenced_tweets" not in tweet: 267 | return False 268 | if get_edge_type(tweet["referenced_tweets"][0]) != "retweet": 269 | return False 270 | 271 | if "entities" not in tweet or "mentions" not in tweet["entities"]: 272 | return False 273 | if tweet["entities"]["mentions"][0]["start"] != 3: 274 | return False 275 | 276 | return True 277 | 278 | 279 | def nxstr(f, *args, **kwargs): 280 | # networkx output functions want to write to a file as bytes 281 | # but click.File is expecting a string. This function takes the 282 | # networkx function and parameters and writes the bytes to a 283 | # BytesIO object to return it as a string. 284 | out = io.BytesIO() 285 | args = args + (out,) 286 | f(*args, **kwargs) 287 | return out.getvalue().decode("utf-8") 288 | -------------------------------------------------------------------------------- /twarc_network/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 68 | 69 | 70 | 71 | 72 | 73 | 74 |
75 | 76 | 187 | 188 | 189 | 190 | 191 | --------------------------------------------------------------------------------