16 | Hypergraph degree and edge size distributions
17 |
18 | ## Source of original data
19 | Source: [DisGeNET](https://www.disgenet.org/)
20 |
21 | ## References
22 | If you use this dataset, please cite these references:
23 | * [The DisGeNET knowledge platform for disease genomics: 2019 update](https://doi.org/10.1093/nar/gkz1021). Janet Piñero, Juan Manuel Ramírez-Anguita, Josep Saüch-Pitarch, Francesco Ronzano, Emilio Centeno, Ferran Sanz, Laura I Furlong. Nucleic Acids Research, 2019.
--------------------------------------------------------------------------------
/datasheets/plant-pollinator-mpl-049/README_plant-pollinator-mpl-049.md:
--------------------------------------------------------------------------------
1 | # plant-pollinator-mpl-049
2 |
3 | ## Summary
4 |
5 | This is a hypergraph dataset where nodes are plants species, and hyperedges are pollinator species that visit a given plant.
6 | Locality of study: Denmark (latitude: 56.066667, longitude: 10.216667).
7 |
8 | ## Statistics
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 37
11 | * number of hyperedges: 225
12 | * distribution of the connected components:
13 |
24 | Hypergraph degree and edge size distributions
25 |
26 | ## Source of original data
27 | Source: [web-of-life](https://www.web-of-life.es/), dataset ID: M_PL_046.
28 |
29 | ## References
30 | If you use this dataset, please cite these references:
31 | * Bundgaard, M. (2003). Tidslig og rumlig variation i et plante-bestøvernetværk. Msc thesis. University of Aarhus. Aarhus, Denmark.
32 |
--------------------------------------------------------------------------------
/datasheets/tags-stack-overflow/README_tags-stack-overflow.md:
--------------------------------------------------------------------------------
1 | # tags-stack-overflow
2 |
3 | ## Summary
4 |
5 | This dataset is derived from tags on Stack Overflow posts. The raw data was
6 | downloaded from
7 | https://archive.org/details/stackexchange
8 |
9 | Each simplex corresponds to all of the tags used in a post, and each node in a
10 | simplex corresponds to a tag. The times are the time of the post in millisecond
11 | but normalized so that the earliest tag starts at 0.
12 |
13 | ## Statistics
14 |
15 | Some basic statistics of this dataset are:
16 | * number of nodes: 49,998
17 | * number of timestamped simplices: 14,458,875
18 | * number of unique simplices: 5,675,497
19 | * number of edges in projected graph: 4,147,302
20 |
21 | ## Data:
22 |
23 | Source: [tags-stack-overflow dataset](https://www.cs.cornell.edu/~arb/data/tags-stack-overflow/)
24 |
25 | ## References
26 |
27 | If you use this data, please cite the following paper:
28 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115). Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
--------------------------------------------------------------------------------
/datasheets/plant-pollinator-mpl-016/README_plant-pollinator-mpl-016.md:
--------------------------------------------------------------------------------
1 | # plant-pollinator-mpl-016
2 |
3 | ## Summary
4 |
5 | This is a hypergraph dataset where nodes are plants species, and hyperedges are pollinator species that visit a given plant.
6 | Locality of study: Doñana Nat. Park, Spain (latitude: 37.016667, longitude: -6.55).
7 |
8 | ## Statistics
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 26
11 | * number of hyperedges: 179
12 | * distribution of the connected components:
13 |
24 | Hypergraph degree and edge size distributions
25 |
26 | ## Source of original data
27 | Source: [web-of-life](https://www.web-of-life.es/), dataset ID: M_PL_016.
28 |
29 | ## References
30 | If you use this dataset, please cite these references:
31 | * Herrera, J. (1988) [Pollination relatioships in southern spanish mediterranean shrublands.](https://www.jstor.org/stable/2260469) Journal of Ecology 76: 274-287.
32 |
--------------------------------------------------------------------------------
/datasheets/plant-pollinator-mpl-014/README_plant-pollinator-mpl-014.md:
--------------------------------------------------------------------------------
1 | # plant-pollinator-mpl-014
2 |
3 | ## Summary
4 |
5 | This is a hypergraph dataset where nodes are plants species, and hyperedges are pollinator species that visit a given plant.
6 | Locality of study: Hazen Camp, Ellesmere Island, Canada (latitude: 81.816667 , longitude: -71.3).
7 |
8 | ## Statistics
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 29
11 | * number of hyperedges: 81
12 | * distribution of the connected components:
13 |
24 | Hypergraph degree and edge size distributions
25 |
26 | ## Source of original data
27 | Source: [web-of-life](https://www.web-of-life.es/), dataset ID: M_PL_014.
28 |
29 | ## References
30 | If you use this dataset, please cite these references:
31 | * Hocking, B. (1968). [Insect-flower associations in the high Arctic with special reference to nectar](https://doi.org/10.2307/3565022). *Oikos*, 359-387.
32 |
33 |
--------------------------------------------------------------------------------
/datasheets/plant-pollinator-mpl-062/README_plant-pollinator-mpl-062md:
--------------------------------------------------------------------------------
1 | # plant-pollinator-mpl-062
2 |
3 | ## Summary
4 |
5 | This is a hypergraph dataset where nodes are plants species, and hyperedges are pollinator species that visit a given plant.
6 |
7 |
8 | Locality of study: Carlinville, Illinois, USA (latitude: 39.278958, longitude: -89.8968771).
9 |
10 | ## Statistics
11 | Some basic statistics of this dataset are:
12 | * number of nodes: 456
13 | * number of hyperedges: 1,044
14 | * distribution of the connected components:
15 |
26 | Hypergraph degree and edge size distributions
27 |
28 | ## Source of original data
29 | Source: [web-of-life](https://www.web-of-life.es/), dataset ID: M_PL_062.
30 |
31 | ## References
32 | If you use this dataset, please cite these references:
33 | * Robertson, C. 1929. "Flowers and insects: lists of visitors to four hundred and fifty-three flowers". Carlinville, IL, USA, C. Robertson.
34 |
35 |
--------------------------------------------------------------------------------
/code/import_hospital-lyon.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 |
3 | import networkx as nx
4 | import pandas as pd
5 | import xgi
6 |
7 | data_folder = "data"
8 |
9 | dataset_name = "hospital-lyon"
10 | data = pd.read_csv(
11 | "data/hospital-lyon/detailed_list_of_contacts_Hospital.dat",
12 | sep="\t",
13 | header=0,
14 | names=["time", "node1", "node2", "type1", "type2"],
15 | )
16 |
17 | H = xgi.Hypergraph()
18 | H["name"] = "hospital-lyon"
19 |
20 | nodes1 = dict(zip(data["node1"].values.tolist(), data["type1"].values.tolist()))
21 | nodes2 = dict(zip(data["node2"].values.tolist(), data["type2"].values.tolist()))
22 | nodes = dict()
23 | nodes.update(nodes1)
24 | nodes.update(nodes2)
25 |
26 | for node, nodetype in nodes.items():
27 | H.add_node(node, type=nodetype)
28 |
29 | start_time = datetime(2010, 12, 6, 13, 0, 0)
30 |
31 | for t in data["time"].unique():
32 | time = timedelta(seconds=int(t))
33 | d = data[data.time == t]
34 | links = d[["node1", "node2"]].values.tolist()
35 | G = nx.Graph(links)
36 | for e in nx.find_cliques(G):
37 | H.add_edge(e, timestamp=(start_time + time).isoformat())
38 |
39 |
40 | xgi.write_json(H, "data/hospital-lyon/hospital-lyon.json")
41 |
--------------------------------------------------------------------------------
/datasheets/diseasome/README_diseasome.md:
--------------------------------------------------------------------------------
1 | # diseasome
2 |
3 | ## Summary
4 |
5 | This is a dataset of diseases and the genes associated with them. In this dataset, a disease is a node and a gene is a hyperedge. The "label" attribute of the nodes gives the disease description and the "label" attribute of the edges gives the gene name. The disease-disease correlations were filtered out to enforce only disease-gene relationships.
6 |
7 | ## Statistics
8 | Some basic statistics of this dataset are:
9 | * number of nodes: 516
10 | * number of hyperedges: 903
11 | * The dataset is connected
12 | * degree and edge size distributions:
13 |
14 |
15 |
16 | Hypergraph degree and edge size distributions
17 |
18 | ## Source of original data
19 | Source: [Gephi](https://github.com/gephi/gephi.github.io/blob/master/datasets/diseasome.gexf.zip)
20 |
21 | ## References
22 | If you use this dataset, please cite these references:
23 | * [The human disease network](https://doi.org/10.1073/pnas.0701361104). Kwang-Il Goh, Michael E. Cusick, David Valle, and Albert-László Barabási. Proceedings of the National Academy of Sciences (PNAS), 2007.
--------------------------------------------------------------------------------
/code/import_malawi-village.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 |
3 | import networkx as nx
4 | import pandas as pd
5 | import xgi
6 |
7 | dataset_name = "Malawi-village"
8 | data = pd.read_csv(
9 | "data/Malawi21/tnet_malawi_pilot.csv",
10 | sep=",",
11 | )
12 | data["time"] = list(zip(data.day, data.contact_time))
13 |
14 | H = xgi.Hypergraph()
15 | H["name"] = dataset_name
16 |
17 | nodes1 = data["id1"].values.tolist()
18 | nodes2 = data["id2"].values.tolist()
19 | nodes = set()
20 | nodes.update(set(nodes1))
21 | nodes.update(set(nodes2))
22 |
23 | H.add_nodes_from(nodes)
24 |
25 | start_time = datetime(2019, 12, 22, 11, 31, 40)
26 | # this is calculated by finding when the day switches and then subtracting
27 | # the number of seconds elapsed at the switch. The data seems to start on
28 | # Dec. 22 and end on Jan. 4
29 |
30 | for t in data["time"].unique():
31 | days, sec = t
32 | time = timedelta(seconds=int(sec))
33 | d = data[data.time == t]
34 | links = d[["id1", "id2"]].values.tolist()
35 | G = nx.Graph(links)
36 | for e in nx.find_cliques(G):
37 | H.add_edge(e, timestamp=(start_time + time).isoformat())
38 | print(sec)
39 |
40 |
41 | xgi.write_json(H, f"data/Malawi21/malawi-village.json")
42 |
--------------------------------------------------------------------------------
/code/import_NDC-substances.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import utilities
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_name = "NDC-substances-full"
9 | new_dataset_name = "NDC-substances"
10 |
11 | dataset_folder = "NDC-substances-full"
12 | size_file = f"{dataset_name}-nverts.txt"
13 | member_file = f"{dataset_name}-simplices.txt"
14 | labels_file = f"{dataset_name}-node-labels.txt"
15 | times_file = f"{dataset_name}-times.txt"
16 |
17 | hyperedge_size_file = os.path.join(data_folder, dataset_folder, size_file)
18 | member_ID_file = os.path.join(data_folder, dataset_folder, member_file)
19 | node_labels_file = os.path.join(data_folder, dataset_folder, labels_file)
20 | edge_times_file = os.path.join(data_folder, dataset_folder, times_file)
21 |
22 | edgelist = utilities.readScHoLPData(hyperedge_size_file, member_ID_file)
23 |
24 | H = xgi.Hypergraph(edgelist)
25 | H["name"] = new_dataset_name
26 |
27 | delimiter = " "
28 |
29 | node_labels = utilities.readScHoLPLabels(node_labels_file, delimiter)
30 |
31 | H.add_nodes_from(list(node_labels.keys()))
32 |
33 | for label, name in node_labels.items():
34 | H.nodes[label].update({"name": name})
35 |
36 |
37 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, f"{new_dataset_name}.json"))
38 |
--------------------------------------------------------------------------------
/datasheets/plant-pollinator-mpl-015/README_plant-pollinator-mpl-015.md:
--------------------------------------------------------------------------------
1 | # plant-pollinator-mpl-015
2 |
3 | ## Summary
4 |
5 | This is a hypergraph dataset where nodes are plants species, and hyperedges are pollinator species that visit a given plant.
6 | Locality of study: Daphní, Athens, Greece (latitude: 38.014466, longitude: 23.635043).
7 |
8 | ## Statistics
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 131
11 | * number of hyperedges: 666
12 | * distribution of the connected components:
13 |
23 | Hypergraph degree and edge size distributions
24 |
25 | ## Source of original data
26 | Source: [tags-ask-ubuntu dataset](https://www.cs.cornell.edu/~arb/data/tags-ask-ubuntu/)
27 |
28 | ## References
29 | If you use this dataset, please cite these references:
30 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115). Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
--------------------------------------------------------------------------------
/datasheets/coauth-mag-geology/README_coauth-MAG-Geology.md:
--------------------------------------------------------------------------------
1 | # coauth-MAG-Geology
2 |
3 | ## Summary
4 |
5 | This is a temporal higher-order network dataset, which here means a sequence of timestamped simplices where each simplex is a set of nodes. In this dataset, nodes are authors and a simplex is a publication marked with the "Geology" tag in the Microsoft Academic Graph. Timestamps are the year of publication.
6 |
7 | ## Statistics
8 |
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 1,256,385
11 | * number of timestamped simplices: 1,590,335
12 | * number of unique simplices: 1,207,390
13 | * number of edges in projected graph: 512,0762
14 |
15 | ## Changelog
16 |
17 | - v0.2: removed restriction on edge size (was max 25 nodes) with PR #22 https://github.com/xgi-org/xgi-data/pull/22
18 |
19 | ## Source of original data
20 |
21 | Source: [coauth-MAG-Geology dataset](https://www.cs.cornell.edu/~arb/data/coauth-MAG-Geology/)
22 |
23 | ## References
24 |
25 | If you use this data, please cite the following papers:
26 |
27 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115).
28 | Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
29 | Proceedings of the National Academy of Sciences (PNAS), 2018.
30 |
31 | * [An overview of Microsoft Academic Service (MAS) and applications[(https://doi.org/10.1145/2740908.2742839).
32 | Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang.
33 | Proceedings of WWW, 2015.
--------------------------------------------------------------------------------
/datasheets/NDC-substances/README-NDC-substances.md:
--------------------------------------------------------------------------------
1 | # NDC substances
2 |
3 | ## Summary
4 |
5 | This is a temporal higher-order network dataset, which here means a sequence of timestamped simplices where each simplex is a set of nodes. Under the Drug Listing Act of 1972, the U.S. Food and Drug Administration releases information on all commercial drugs going through the regulation of the agency, forming the National Drug Code (NDC) Directory. In this dataset, each simplex corresponds to an NDC code for a drug, and the nodes are substances that make up the drug. Timestamps are in days and represent when the drug was first marketed. We restricted to simplices that consist of at most 25 nodes.
6 |
7 |
8 | The file NDC-substances-node-labels.txt maps the node IDs to the substances.
9 |
10 | The nth line in NDC-classes-simplex-labels.txt is the name of the drug
11 | corresponding to the nth simplex.
12 |
13 | ## Statistics
14 |
15 | * number of nodes: 5,311
16 | * number of timestamped simplices: 112,405
17 | * number of unique simplices: 10,025
18 | * number of edges in projected graph: 88,268
19 |
20 | ## Origin of data
21 |
22 | Source: [NDC-substances](https://www.cs.cornell.edu/~arb/data/NDC-substances/).
23 |
24 | ## References
25 |
26 | If you use this data, please cite the following paper:
27 | Simplicial closure and higher-order link prediction.
28 | Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
29 | [Proceedings of the National Academy of Sciences (PNAS)](https://doi.org/10.1073/pnas.1800683115), 2018.
--------------------------------------------------------------------------------
/code/inspect_json.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 |
4 | # graph parameters
5 | filename = sys.argv[1]
6 |
7 | with open(filename) as file:
8 | # load JSON file
9 | data = json.loads(file.read())
10 |
11 | # load hypergraph attributes
12 | try:
13 | hypergraph_attrs = data["hypergraph-data"]
14 |
15 | # Is a dataset name specified?
16 | try:
17 | name = hypergraph_attrs["name"]
18 | except:
19 | print("Dataset name not specified!")
20 | except:
21 | print("No hypergraph attributes!")
22 |
23 | # Are nodes specified?
24 | try:
25 | node_data = data["node-data"]
26 | except:
27 | print("No nodes specified!")
28 |
29 | # Are hyperedges specified?
30 | try:
31 | edge_data = data["edge-data"]
32 |
33 | except:
34 | print("No hyperedges specified!")
35 |
36 | try:
37 | edges = data["edge-dict"]
38 | for e in edges:
39 | # are the nodes in the hyperedges in the list of nodes?
40 | if e not in edge_data:
41 | print(f"edge {e} not in the list of edges.")
42 | try:
43 | members = edges[e]
44 | for node in members:
45 | if node not in node_data:
46 | print(f"Edge {e} contains non-existent node {node}!")
47 | except:
48 | print(f"Edge {e} has no associated members!")
49 |
50 | except:
51 | print("No hyperedges specified!")
52 |
53 | print("Inspection complete.")
54 |
--------------------------------------------------------------------------------
/datasheets/coauth-mag-history/README_coauth-MAG-History.md:
--------------------------------------------------------------------------------
1 | # coauth-MAG-History
2 |
3 | ## Summary
4 |
5 | This is a temporal higher-order network dataset, which here means a sequence of timestamped simplices where each simplex is a set of nodes. In this dataset, nodes are authors and a simplex is a publication marked with the "History" tag in the Microsoft Academic Graph. Timestamps are the year of publication.
6 |
7 | ## Statistics
8 |
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 1,014,734
11 | * number of timestamped simplices: 1,812,511
12 | * number of unique simplices: 895,668
13 | * number of edges in projected graph: 1,156,914
14 |
15 | ## Changelog
16 |
17 | - v1.2: fixed year format with PR #31 https://github.com/xgi-org/xgi-data/pull/31
18 | - v1.1: removed restriction on edge size (was max 25 nodes) with PR #22 https://github.com/xgi-org/xgi-data/pull/22
19 |
20 | ## Source of original data
21 |
22 | Source: [coauth-MAG-History dataset](https://www.cs.cornell.edu/~arb/data/coauth-MAG-History/)
23 |
24 | ## References
25 |
26 | If you use this data, please cite the following papers:
27 |
28 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115).
29 | Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
30 | Proceedings of the National Academy of Sciences (PNAS), 2018.
31 |
32 | * [An overview of Microsoft Academic Service (MAS) and applications[(https://doi.org/10.1145/2740908.2742839).
33 | Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang.
34 | Proceedings of WWW, 2015.
--------------------------------------------------------------------------------
/code/import_house-bills.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import utilities
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_name = "house-bills"
9 |
10 | new_dataset_name = "house-bills"
11 |
12 | dataset_folder = "house-bills"
13 | edgelist_file = f"hyperedges-{dataset_name}.txt"
14 | node_names_file = f"node-names-{dataset_name}.txt"
15 | node_affiliations_file = f"node-labels-{dataset_name}.txt"
16 | affiliation_names_file = f"label-names-{dataset_name}.txt"
17 |
18 | edgelist_filepath = os.path.join(data_folder, dataset_folder, edgelist_file)
19 | node_names_filepath = os.path.join(data_folder, dataset_folder, node_names_file)
20 | node_affiliations_filepath = os.path.join(
21 | data_folder, dataset_folder, node_affiliations_file
22 | )
23 | affiliation_names_filepath = os.path.join(
24 | data_folder, dataset_folder, affiliation_names_file
25 | )
26 |
27 | H = xgi.read_edgelist(edgelist_filepath, delimiter=",")
28 | H["name"] = new_dataset_name
29 |
30 | node_labels = utilities.readScHoLPLabels(node_names_filepath, two_column=False)
31 | node_affiliation = utilities.readScHoLPLabels(
32 | node_affiliations_filepath, two_column=False
33 | )
34 |
35 | for id, name in node_labels.items():
36 | H.nodes[str(id)].update({"name": name})
37 |
38 | affiliation_names = []
39 | with open(affiliation_names_filepath) as label_data:
40 | for line in label_data:
41 | affiliation_names.append(line.strip("\n"))
42 |
43 | for id, label in node_affiliation.items():
44 | H.nodes[str(id)].update({"affiliation": affiliation_names[int(label) - 1]})
45 |
46 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, f"{new_dataset_name}.json"))
47 |
--------------------------------------------------------------------------------
/code/import_senate-bills.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import utilities
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_name = "senate-bills"
9 |
10 | new_dataset_name = "senate-bills"
11 |
12 | dataset_folder = "senate-bills"
13 | edgelist_file = f"hyperedges-{dataset_name}.txt"
14 | node_names_file = f"node-names-{dataset_name}.txt"
15 | node_affiliations_file = f"node-labels-{dataset_name}.txt"
16 | affiliation_names_file = f"label-names-{dataset_name}.txt"
17 |
18 | edgelist_filepath = os.path.join(data_folder, dataset_folder, edgelist_file)
19 | node_names_filepath = os.path.join(data_folder, dataset_folder, node_names_file)
20 | node_affiliations_filepath = os.path.join(
21 | data_folder, dataset_folder, node_affiliations_file
22 | )
23 | affiliation_names_filepath = os.path.join(
24 | data_folder, dataset_folder, affiliation_names_file
25 | )
26 |
27 | H = xgi.read_edgelist(edgelist_filepath, delimiter=",")
28 | H["name"] = new_dataset_name
29 |
30 | node_labels = utilities.readScHoLPLabels(node_names_filepath, two_column=False)
31 | node_affiliation = utilities.readScHoLPLabels(
32 | node_affiliations_filepath, two_column=False
33 | )
34 |
35 | for id, name in node_labels.items():
36 | H.nodes[str(id)].update({"name": name})
37 |
38 | affiliation_names = []
39 | with open(affiliation_names_filepath) as label_data:
40 | for line in label_data:
41 | affiliation_names.append(line.strip("\n"))
42 |
43 | for id, label in node_affiliation.items():
44 | H.nodes[str(id)].update({"affiliation": affiliation_names[int(label) - 1]})
45 |
46 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, f"{new_dataset_name}.json"))
47 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | XGI-DATA is distributed with the BSD 3-Clause License
2 |
3 | Copyright (c) 2021-2024, XGI-DATA Developers
4 |
5 | All rights reserved.
6 |
7 | Redistribution and use in source and binary forms, with or without
8 | modification, are permitted provided that the following conditions are met:
9 |
10 | 1. Redistributions of source code must retain the above copyright notice, this
11 | list of conditions and the following disclaimer.
12 |
13 | 2. Redistributions in binary form must reproduce the above copyright notice,
14 | this list of conditions and the following disclaimer in the documentation
15 | and/or other materials provided with the distribution.
16 |
17 | 3. Neither the name of the copyright holder nor the names of its
18 | contributors may be used to endorse or promote products derived from
19 | this software without specific prior written permission.
20 |
21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
22 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
25 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
27 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
28 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
29 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
30 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
31 |
--------------------------------------------------------------------------------
/code/import_house-committees.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import utilities
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_name = "house-committees"
9 |
10 | new_dataset_name = "house-committees"
11 |
12 | dataset_folder = "house-committees"
13 | edgelist_file = f"hyperedges-{dataset_name}.txt"
14 | node_names_file = f"node-names-{dataset_name}.txt"
15 | node_affiliations_file = f"node-labels-{dataset_name}.txt"
16 | affiliation_names_file = f"label-names-{dataset_name}.txt"
17 |
18 | edgelist_filepath = os.path.join(data_folder, dataset_folder, edgelist_file)
19 | node_names_filepath = os.path.join(data_folder, dataset_folder, node_names_file)
20 | node_affiliations_filepath = os.path.join(
21 | data_folder, dataset_folder, node_affiliations_file
22 | )
23 | affiliation_names_filepath = os.path.join(
24 | data_folder, dataset_folder, affiliation_names_file
25 | )
26 |
27 | H = xgi.read_edgelist(edgelist_filepath, delimiter=",")
28 | H["name"] = new_dataset_name
29 |
30 | node_labels = utilities.readScHoLPLabels(node_names_filepath, two_column=False)
31 | node_affiliation = utilities.readScHoLPLabels(
32 | node_affiliations_filepath, two_column=False
33 | )
34 |
35 | for id, name in node_labels.items():
36 | H.nodes[str(id)].update({"name": name})
37 |
38 | affiliation_names = []
39 | with open(affiliation_names_filepath) as label_data:
40 | for line in label_data:
41 | affiliation_names.append(line.strip("\n"))
42 |
43 | for id, label in node_affiliation.items():
44 | H.nodes[str(id)].update({"affiliation": affiliation_names[int(label) - 1]})
45 |
46 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, f"{new_dataset_name}.json"))
47 |
--------------------------------------------------------------------------------
/code/import_senate-committees.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import utilities
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_name = "senate-committees"
9 |
10 | new_dataset_name = "senate-committees"
11 |
12 | dataset_folder = "senate-committees"
13 | edgelist_file = f"hyperedges-{dataset_name}.txt"
14 | node_names_file = f"node-names-{dataset_name}.txt"
15 | node_affiliations_file = f"node-labels-{dataset_name}.txt"
16 | affiliation_names_file = f"label-names-{dataset_name}.txt"
17 |
18 | edgelist_filepath = os.path.join(data_folder, dataset_folder, edgelist_file)
19 | node_names_filepath = os.path.join(data_folder, dataset_folder, node_names_file)
20 | node_affiliations_filepath = os.path.join(
21 | data_folder, dataset_folder, node_affiliations_file
22 | )
23 | affiliation_names_filepath = os.path.join(
24 | data_folder, dataset_folder, affiliation_names_file
25 | )
26 |
27 | H = xgi.read_edgelist(edgelist_filepath, delimiter=",")
28 | H["name"] = new_dataset_name
29 |
30 | node_labels = utilities.readScHoLPLabels(node_names_filepath, two_column=False)
31 | node_affiliation = utilities.readScHoLPLabels(
32 | node_affiliations_filepath, two_column=False
33 | )
34 |
35 | for id, name in node_labels.items():
36 | H.nodes[str(id)].update({"name": name})
37 |
38 | affiliation_names = []
39 | with open(affiliation_names_filepath) as label_data:
40 | for line in label_data:
41 | affiliation_names.append(line.strip("\n"))
42 |
43 | for id, label in node_affiliation.items():
44 | H.nodes[str(id)].update({"affiliation": affiliation_names[int(label) - 1]})
45 |
46 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, f"{new_dataset_name}.json"))
47 |
--------------------------------------------------------------------------------
/datasheets/congress-bills/README_congress-bills.md:
--------------------------------------------------------------------------------
1 | # congress-bills
2 |
3 | ## Summary
4 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In this dataset, nodes are US Congresspersons and hyperedges are comprised of the sponsor and co-sponsors of legislative bills put forth in both the House of Representatives and the Senate. Timestamps are in ISO8601 format. The dataset was derived from James Fowler's data.
5 |
6 | ## Statistics
7 | Some basic statistics of this dataset are:
8 | * number of nodes: 1,718
9 | * number of timestamped hyperedges: 282,049
10 | * there is a single connected component of size 1,718
11 | * degree and edge size distributions:
12 |
13 |
14 |
15 |
16 | Hypergraph degree and edge size distributions
17 |
18 | ## Source of original data
19 | Source: [congress-bills dataset](https://www.cs.cornell.edu/~arb/data/congress-bills/)
20 |
21 | ## References
22 | If you use this dataset, please cite these references:
23 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115). Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
24 | * [Connecting the Congress: A Study of Cosponsorship Networks](https://doi.org/10.1093/pan/mpl002). James H. Fowler. Political Analysis, 2006.
25 | * [Legislative Cosponsorship Networks in the U.S. House and Senate](https://doi.org/10.1016/j.socnet.2005.11.003). James H. Fowler. Social Networks, 2006.
--------------------------------------------------------------------------------
/code/import_NDC-classes.py:
--------------------------------------------------------------------------------
1 | import os
2 | from datetime import datetime
3 |
4 | import utilities
5 | import xgi
6 |
7 | data_folder = "data"
8 |
9 | dataset_folder = "NDC-classes-full"
10 | size_file = "NDC-classes-full-nverts.txt"
11 | member_file = "NDC-classes-full-simplices.txt"
12 | nlabels_file = "NDC-classes-full-node-labels.txt"
13 | elabels_file = "NDC-classes-full-simplex-labels.txt"
14 | times_file = "NDC-classes-full-times.txt"
15 |
16 | hyperedge_size_file = os.path.join(data_folder, dataset_folder, size_file)
17 | member_ID_file = os.path.join(data_folder, dataset_folder, member_file)
18 | node_labels_file = os.path.join(data_folder, dataset_folder, nlabels_file)
19 | edge_labels_file = os.path.join(data_folder, dataset_folder, elabels_file)
20 | edge_times_file = os.path.join(data_folder, dataset_folder, times_file)
21 |
22 | edgelist = utilities.readScHoLPData(hyperedge_size_file, member_ID_file)
23 |
24 | H = xgi.Hypergraph(edgelist)
25 | H["name"] = "NDC-classes"
26 |
27 | delimiter = " "
28 |
29 | node_labels = utilities.readScHoLPLabels(node_labels_file, delimiter)
30 | edge_labels = utilities.readScHoLPLabels(edge_labels_file, delimiter, two_column=False)
31 |
32 | edge_times = utilities.read_SCHOLP_dates(
33 | edge_times_file, reference_time=datetime(1, 1, 1), time_unit="milliseconds"
34 | )
35 |
36 | H.add_nodes_from(list(node_labels.keys()))
37 |
38 | H.set_edge_attributes(edge_labels, name="name")
39 |
40 | for label, name in node_labels.items():
41 | H.nodes[label].update({"name": name})
42 |
43 | for label, date in edge_times.items():
44 | H.edges[label].update({"timestamp": date})
45 |
46 |
47 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, "ndc-classes.json"))
48 |
--------------------------------------------------------------------------------
/datasheets/contact-primary-school/README_contact-primary-school.md:
--------------------------------------------------------------------------------
1 | # contact-primary-school
2 |
3 | ## Summary
4 |
5 | This dataset is constructed from a contact network amongst children and teachers
6 | at a primary school. The contact network was downloaded from
7 | http://www.sociopatterns.org/datasets/primary-school-temporal-network-data/
8 |
9 | We form simplices through cliques of simultaneous contacts. Specifically, for every unique timestamp in the dataset, we construct a simplex for every maximal clique amongst the contact edges that exist for that timestamp. Timestamps were
10 | recorded in 20 second intervals.
11 |
12 | ## Statistics
13 | Some basic statistics of this dataset are:
14 | * number of nodes: 242
15 | * number of timestamped simplices: 106,879
16 | * number of unique simplices: 12,799
17 | * number of edges in projected graph: 8,317
18 |
19 | * degree and edge size distributions:
20 |
21 |
22 |
23 | Hypergraph degree and edge size distributions
24 |
25 | ## Source of original data
26 | Source: [contact-primary-school dataset](hhttps://www.cs.cornell.edu/~arb/data/contact-primary-school/)
27 |
28 | ## References
29 | If you use this dataset, please cite these references:
30 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115). Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
31 | * [CHigh-Resolution Measurements of Face-to-Face Contact Patterns in a Primary School](https://doi.org/10.1371/journal.pone.0023176). RJuliette Stehlé, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-François Pinton, Marco Quaggiotto, Wouter Van den Broeck, Corinne Régis, Bruno Lina, and Philippe Vanhems. PLoS ONE, 2011.
--------------------------------------------------------------------------------
/datasheets/NDC-classes/README.md:
--------------------------------------------------------------------------------
1 | # ndc-classes
2 |
3 | ## Overview
4 | This dataset consists of the pharmaceutical classes used to classify drugs in
5 | the National Drug Code Directory maintained by the Food and Drug
6 | Administration between 1946 and 2017. The original data was downloaded from
7 | https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm.
8 |
9 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. Timestamps are in ISO8601 format. In the original dataset, the same drug substance can have more than one NDC code. For example, different dosages of the same drug may result in multiple NDC codes.
10 |
11 | ## Statistics
12 | Some basic statistics of this dataset are:
13 | * number of nodes: 1,161
14 | * number of timestamped hyperedges: 49,726
15 | * distribution of the connected components:
16 |
38 | Hypergraph degree and edge size distributions
39 |
40 | ## Source of original data
41 | Sources:
42 | * [NDC-classes dataset](https://www.cs.cornell.edu/~arb/data/NDC-classes/)
43 | * [FDA](https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm)
44 |
45 | ## References
46 | If you use this dataset, please cite these references:
47 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115), Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
--------------------------------------------------------------------------------
/datasheets/contact-high-school/README_contact-high-school.md:
--------------------------------------------------------------------------------
1 | # contact-high-school
2 |
3 | ## Summary
4 |
5 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. This dataset is constructed from a contact network amongst high school students
6 | in Marseilles, France, in December 2013. The contact network was downloaded from
7 | http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/
8 |
9 | We form simplices through cliques of simultaneous contacts. Specifically, for
10 | every unique timestamp in the dataset, we construct a simplex for every maximal
11 | clique amongst the contact edges that exist for that timestamp. Timestamps were
12 | recorded in 20 second intervals.
13 |
14 | ## Statistics
15 | Some basic statistics of this dataset are:
16 | * number of nodes: 327
17 | * number of timestamped hyperedges: 172,035
18 | * there is a single connected component of size 327
19 |
20 | * degree and edge size distributions:
21 |
22 |
23 |
24 | Hypergraph degree and edge size distributions
25 |
26 | ## Source of original data
27 | Source: [contact-high-school dataset](https://www.cs.cornell.edu/~arb/data/contact-high-school/)
28 |
29 | ## References
30 | If you use this dataset, please cite these references:
31 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115). Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
32 | * [Contact Patterns in a High School: A Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys](https://doi.org/10.1371/journal.pone.0136497). Rossana Mastrandrea, Julie Fournet, and Alain Barrat. PLoS ONE, 2015.
--------------------------------------------------------------------------------
/datasheets/DAWN/README.md:
--------------------------------------------------------------------------------
1 | # dawn
2 |
3 | The Drug Abuse Warning Network (DAWN) is a national health surveillance system that records drug use contributing to hospital emergency department visits throughout the United States. Hyperedges in this dataset are the drugs used by a patient (as reported by the patient) in an emergency department visit. The drugs include illicit substances, prescription and over-the-counter medication, and dietary supplements. Timestamps of visits (under the `timestamp` attribute of the hyperedges) are recorded at the resolution of quarter-years, spanning a total duration of 8 years (2004 to 2011). The names of the drugs are encoded in the `name` attribute of the nodes.
4 |
5 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. Timestamps are in ISO8601 format. In the original dataset, the same drug substance can have more than one NDC code. For example, different dosages of the same drug may result in multiple NDC codes.
6 |
7 | ## Statistics
8 | Some basic statistics of this dataset are:
9 | * number of nodes: 2,558
10 | * number of timestamped hyperedges: 2,272,433
11 | * distribution of the connected components:
12 |
23 | Hypergraph degree and edge size distributions
24 |
25 | ## Source of original data
26 | Sources:
27 | * [DAWN dataset](https://www.cs.cornell.edu/~arb/data/DAWN/)
28 |
29 | ## References
30 | If you use this dataset, please cite these references:
31 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115), Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
--------------------------------------------------------------------------------
/code/import_diseasome.py:
--------------------------------------------------------------------------------
1 | import os
2 | import xml.etree.ElementTree as ET
3 |
4 | import xgi
5 |
6 | data_folder = "data"
7 |
8 | dataset_folder = "diseasome"
9 |
10 | H = xgi.Hypergraph()
11 | tree = ET.parse(os.path.join(data_folder, dataset_folder, "diseasome.gexf"))
12 | root = tree.getroot()
13 |
14 | node_attr = dict()
15 | edge_attr = dict()
16 | for item in root:
17 | for subelement in item:
18 | if "nodes" in subelement.tag:
19 | for node in subelement:
20 | for attrlist in node:
21 | for attr in attrlist:
22 | if (
23 | attr.attrib["id"] == "0"
24 | and attr.attrib["value"] == "disease"
25 | ):
26 | node_attr[node.attrib["id"]] = {
27 | "label": node.attrib["label"]
28 | }
29 | elif (
30 | attr.attrib["id"] == "0" and attr.attrib["value"] == "gene"
31 | ):
32 | edge_attr[node.attrib["id"]] = {
33 | "label": node.attrib["label"]
34 | }
35 |
36 | for item in root:
37 | for subelement in item:
38 | if "edges" in subelement.tag:
39 | for edge in subelement:
40 | source = edge.attrib["source"]
41 | target = edge.attrib["target"]
42 | if source in node_attr and target in edge_attr:
43 | H.add_node_to_edge(edge.attrib["target"], edge.attrib["source"])
44 | elif target in node_attr and source in edge_attr:
45 | H.add_node_to_edge(edge.attrib["source"], edge.attrib["target"])
46 | else:
47 | print(f"Edge ({source}, {target}) Not bipartite!")
48 |
49 | xgi.set_edge_attributes(H, edge_attr)
50 | xgi.set_node_attributes(H, node_attr)
51 | H["name"] = "Diseasome"
52 |
53 |
54 | xgi.write_json(H, os.path.join(data_folder, dataset_folder, "diseasome.json"))
55 |
--------------------------------------------------------------------------------
/datasheets/hospital-lyon/README_hospital-lyon.md:
--------------------------------------------------------------------------------
1 | # hospital-lyon
2 |
3 | ## Summary
4 | This dataset contains the temporal network of contacts between patients, patients and health-care workers (HCWs) and among HCWs in a hospital ward in Lyon, France, from Monday, December 6, 2010 at 1:00 pm to Friday, December 10, 2010 at 2:00 pm. The study included 46 HCWs and 29 patients.
5 |
6 | The active contacts are resolved to a 20-second interval in the data collection. In the original data, each line has the form "t i j Si Sj", where i and j are the anonymous IDs of the persons in contact, Si and Sj are their statuses (NUR=paramedical staff, i.e. nurses and nurses’ aides; PAT=Patient; MED=Medical doctor; ADM=administrative staff), and the interval during which this contact was active is [ t – 20s, t ]. If a node is connected to more than one other node at a given time interval, we assume that all these nodes participate in a group interactions, i.e., if nodes 1 and 2 as well as nodes 2 and 3 are active at a given time interval, we assume that 1, 2, and 3 participate in a group interaction together. All timestamps are in standard ISO8601 format.
7 |
8 | ## Statistics
9 | * number of nodes: 75 (46 HCWs and 29 patients)
10 | * number of timestamped hyperedges: 21,398
11 | * there is a single connected component of size 75
12 | * degree and edge size distributions:
13 |
14 |
15 |
16 |
17 | Hypergraph degree and edge size distributions
18 |
19 | ## Source of original data
20 | Source: [SocioPatterns dataset: Hospital ward dynamic contact network](http://www.sociopatterns.org/datasets/hospital-ward-dynamic-contact-network/)
21 |
22 | ## References
23 | If you use this dataset, please cite these references:
24 | * [Estimating Potential Infection Transmission Routes in Hospital Wards Using Wearable Proximity Sensors](http://dx.doi.org/10.1371%2Fjournal.pone.0073970). Philippe Vanhems, Alain Barrat, Ciro Cattuto, Jean-François Pinton, Nagham Khanafer, Corinne Régis, Byeul-a Kim, Brigitte Comte, Nicolas Voirin. PLoS ONE, 2013.
--------------------------------------------------------------------------------
/code/utilities.py:
--------------------------------------------------------------------------------
1 | import json
2 | from datetime import datetime, timedelta
3 |
4 |
5 | def readScHoLPData(edge_size_file, member_ID_file):
6 | edgelist = list()
7 | with open(edge_size_file) as size_file, open(member_ID_file) as id_file:
8 | sizes = size_file.read().splitlines()
9 | members = id_file.read().splitlines()
10 | member_index = 0
11 | for index in range(len(sizes)):
12 | edge = list()
13 | edge_size = int(sizes[index])
14 | for i in range(member_index, member_index + edge_size):
15 | member = members[i]
16 | edge.append(member)
17 | edgelist.append(tuple(edge))
18 | member_index += edge_size
19 | return edgelist
20 |
21 |
22 | def readScHoLPLabels(labels_file, delimiter="\t", two_column=True):
23 | label_dict = dict()
24 | with open(labels_file) as label_data:
25 | for i, line in enumerate(label_data):
26 | if two_column:
27 | s = line.split(delimiter, 1)
28 | idx = s[0]
29 | val = s[1].rstrip("\n")
30 | else:
31 | idx = i + 1
32 | val = line.rstrip("\n")
33 | label_dict[idx] = val
34 | return label_dict
35 |
36 |
37 | def read_SCHOLP_dates(
38 | timestamp_file, reference_time=datetime(1, 1, 1), time_unit="days"
39 | ):
40 | time_dict = dict()
41 | with open(timestamp_file) as time_data:
42 | lines = time_data.read().splitlines()
43 | for i in range(len(lines)):
44 | t = int(lines[i])
45 | if time_unit == "days":
46 | time = reference_time + timedelta(days=t)
47 | elif time_unit == "seconds":
48 | time = reference_time + timedelta(seconds=t)
49 | elif time_unit == "milliseconds":
50 | time = reference_time + timedelta(seconds=t / 1000)
51 | elif time_unit == "quarters":
52 | year = int((t - 1) / 4)
53 | quarter = (t - 1) % 4 + 1
54 | time = datetime(year, int(3 * quarter), 1)
55 | elif time_unit == "years":
56 | year = t
57 | time = datetime(year, 1, 1)
58 | time_dict[i] = time.isoformat()
59 | return time_dict
60 |
--------------------------------------------------------------------------------
/CITATION.cff:
--------------------------------------------------------------------------------
1 | # YAML 1.2
2 | cff-version: "1.2.0"
3 | authors:
4 | - email: nicholas.landry@uvm.edu
5 | family-names: Landry
6 | given-names: Nicholas W.
7 | orcid: "https://orcid.org/0000-0003-1270-4980"
8 | - family-names: Lucas
9 | given-names: Maxime
10 | orcid: "https://orcid.org/0000-0001-8087-2981"
11 | - family-names: Iacopini
12 | given-names: Iacopo
13 | orcid: "https://orcid.org/0000-0001-8794-6410"
14 | - family-names: Petri
15 | given-names: Giovanni
16 | orcid: "https://orcid.org/0000-0003-1847-5031"
17 | - family-names: Schwarze
18 | given-names: Alice
19 | orcid: "https://orcid.org/0000-0002-9146-8068"
20 | - family-names: Patania
21 | given-names: Alice
22 | orcid: "https://orcid.org/0000-0002-3047-4376"
23 | - family-names: Torres
24 | given-names: Leo
25 | orcid: "https://orcid.org/0000-0002-2675-2775"
26 | contact:
27 | - email: nicholas.landry@uvm.edu
28 | family-names: Landry
29 | given-names: Nicholas W.
30 | orcid: "https://orcid.org/0000-0003-1270-4980"
31 | doi: 10.5281/zenodo.7939055
32 | message: If you use this software, please cite our article in the
33 | Journal of Open Source Software.
34 | preferred-citation:
35 | authors:
36 | - email: nicholas.landry@uvm.edu
37 | family-names: Landry
38 | given-names: Nicholas W.
39 | orcid: "https://orcid.org/0000-0003-1270-4980"
40 | - family-names: Lucas
41 | given-names: Maxime
42 | orcid: "https://orcid.org/0000-0001-8087-2981"
43 | - family-names: Iacopini
44 | given-names: Iacopo
45 | orcid: "https://orcid.org/0000-0001-8794-6410"
46 | - family-names: Petri
47 | given-names: Giovanni
48 | orcid: "https://orcid.org/0000-0003-1847-5031"
49 | - family-names: Schwarze
50 | given-names: Alice
51 | orcid: "https://orcid.org/0000-0002-9146-8068"
52 | - family-names: Patania
53 | given-names: Alice
54 | orcid: "https://orcid.org/0000-0002-3047-4376"
55 | - family-names: Torres
56 | given-names: Leo
57 | orcid: "https://orcid.org/0000-0002-2675-2775"
58 | date-published: 2023-05-17
59 | doi: 10.21105/joss.05162
60 | issn: 2475-9066
61 | issue: 85
62 | journal: Journal of Open Source Software
63 | publisher:
64 | name: Open Journals
65 | start: 5162
66 | title: "XGI: A Python package for higher-order interaction networks"
67 | type: article
68 | url: "https://joss.theoj.org/papers/10.21105/joss.05162"
69 | volume: 8
70 | title: "XGI: A Python package for higher-order interaction networks"
--------------------------------------------------------------------------------
/datasheets/email-eu/README_email-eu.md:
--------------------------------------------------------------------------------
1 | # email-eu
2 |
3 | ## Overview
4 | This hypergraph dataset was generated using email data from a large European research institution for a period from October 2003 to May 2005 (18 months). Information about all incoming and outgoing email between members of the research institution has been anonymized. The e-mails only represent communication between institution members (the core), and the dataset does not contain incoming messages from or outgoing messages to the rest of the world.
5 |
6 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. Timestamps are in ISO8601 format. In email communication, messages can be sent to multiple recipients. In this dataset, nodes are email addresses at a European research institution. The original data source only contains directed temporal edge tuples (sender, receiver, timestamp), where timestamps are recorded at 1-second resolution. The hyperedges are undirected and consist of a sender and all receivers grouped such that the email between the sender and each receiver has the same timestamp.
7 |
8 | ## Statistics
9 | Some basic statistics of this dataset are:
10 | * number of nodes: 1,005
11 | * number of timestamped hyperedges: 235,263
12 | * distribution of the connected components:
13 |
25 | Hypergraph degree and edge size distributions
26 |
27 | ## Source of original data
28 | Source: [email-Eu dataset](https://www.cs.cornell.edu/~arb/data/email-Eu/)
29 |
30 | ## References
31 | If you use this dataset, please cite these references:
32 | * [Simplicial closure and higher-order link prediction](https://doi.org/10.1073/pnas.1800683115), Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. Proceedings of the National Academy of Sciences (PNAS), 2018.
33 | * [Local Higher-order Graph Clustering](https://doi.org/10.1145/3097983.3098069), Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. Proceedings of KDD, 2017.
34 | * [Graph Evolution: Densification and Shrinking Diameters](https://doi.org/10.1145/1217299.1217301), Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. ACM Transactions on Knowledge Discovery from Data, 2007.
--------------------------------------------------------------------------------
/HOW_TO_CONTRIBUTE.md:
--------------------------------------------------------------------------------
1 | # Contributing
2 |
3 | ## Adding a dataset to XGI-DATA
4 |
5 | ### Creating a JSON file for a dataset
6 | 1. Create a script titled `import_.py`. Choose a dataset name that is concise, yet descriptive.
7 | 2. Convert the raw data into an `xgi` hypergraph, using the above script. Examples of importing are in the [code](/code/) folder.
8 | 3. Save the dataset to a JSON file using the `xgi.write_json()` [method](https://xgi.readthedocs.io/en/stable/api/readwrite/xgi.readwrite.json.html#module-xgi.readwrite.json).
9 |
10 | ### Adding to Zenodo
11 | 1. Navigate to the [XGI page](https://zenodo.org/communities/xgi) on Zenodo.
12 | 2. Click the "New Upload" button. This should prompt you to log into Zenodo and will bring up the form to upload a new dataset.
13 | 3. Enter the information in the "new upload" form:
14 | 1. In the "Files" section, drag and drop the file or click the "Upload files" button.
15 | 2. When asked "Do you already have a DOI for this upload?", select "No".
16 | 3. Under "Resource type", select "Dataset" from the dropdown list.
17 | 4. Under "Title" enter the dataset name selected above.
18 | 5. Under "Creators", add yourself with name, ORCID, and affiliation along with your role (typically "Data collector" or "Data curator")
19 | 6. Under "Description" write the name of the dataset, where it is from, how it was collected, what nodes and edges are, and some basic statistics about the dataset.
20 | 7. Under "Version", type "v0.0" if this is the first version of the dataset.
21 | 4. Click the "Submit for review" button. This will send it to the XGI-DATA moderators for review.
22 |
23 | Once the dataset has been added to Zenodo, do the following:
24 |
25 | ### Updating Github
26 | 1. Fork XGI-DATA.
27 | 2. Move the import script created prior to the `code` folder.
28 | 3. Add an entry (in alphabetical order) in [`index.json`](https://github.com/xgi-org/xgi-data/blob/add-contribution-guide/index.json) with:
29 | 1. The dataset name as the key (all lowercase!)
30 | 2. The value as a dictionary `{"url": }`
31 | 3. The url can be found by going to the [XGI page](https://zenodo.org/communities/xgi) on Zenodo, and clicking on the record you just made. Then the url is `https://zenodo.org/records//.json`.
32 | 4. In `README.md`, add the dataset name (alphabetically) as a hyperlink with the Zenodo page url.
33 | 5. Run `get_stats.json` with the dataset name as an argument in `load_xgi_data()`. If every cell in this notebook is run, it will update the `index.json` file and add a plot of the degree/edge size distribution.
--------------------------------------------------------------------------------
/datasheets/email-enron/README_email-enron.md:
--------------------------------------------------------------------------------
1 | # email-enron
2 |
3 | ## Summary
4 |
5 | This is a temporal hypergraph dataset, which here means a sequence of timestamped hyperedges where each hyperedge is a set of nodes. In email communication, messages can be sent to multiple recipients. In this dataset, nodes are email addresses at Enron and a hyperedge is comprised of the sender and all recipients of the email. Only email addresses from a core set of employees are included. Timestamps are in ISO8601 format.
6 |
7 | This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
8 |
9 | The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.
10 |
11 | ## Statistics
12 | Some basic statistics of this dataset are:
13 | * number of nodes: 148
14 | * number of timestamped hyperedges: 10,885
15 | * distribution of the connected components:
16 |