├── README.md ├── conda_requirements.txt ├── data_extraction ├── README.md ├── complicated_cascade_followers.json ├── complicated_cascade_partial.csv ├── extract_ground_truth_cp2.py ├── keyword_map.json └── twitter_cascade_reconstruction.py ├── december-measurements ├── BaselineMeasurements.py ├── CommunityCentricMeasurements.py ├── ContentCentricMeasurements.py ├── Metrics.py ├── TEMeasurements.py ├── UserCentricMeasurements.py ├── cascade_measurements.py ├── cascade_reconstruction │ ├── example_follower_data │ │ ├── a.txt │ │ ├── b.txt │ │ ├── d.txt │ │ ├── h.txt │ │ ├── i.txt │ │ ├── j.txt │ │ ├── k.txt │ │ └── m.txt │ ├── twitter_cascade_reconstruction.py │ ├── twitter_example_data_reconstructed.json │ └── twitter_reconstruction_example_data.json ├── cascade_validators.py ├── config │ ├── baseline_metrics_config_github.py │ ├── baseline_metrics_config_github_crypto_s1.py │ ├── baseline_metrics_config_github_cve_s1.py │ ├── baseline_metrics_config_github_cyber_s1.py │ ├── baseline_metrics_config_reddit.py │ ├── baseline_metrics_config_reddit_crypto_s1.py │ ├── baseline_metrics_config_reddit_crypto_s2.py │ ├── baseline_metrics_config_reddit_cve_s1.py │ ├── baseline_metrics_config_reddit_cve_s2.py │ ├── baseline_metrics_config_reddit_cyber_s1.py │ ├── baseline_metrics_config_reddit_cyber_s2.py │ ├── baseline_metrics_config_twitter.py │ ├── baseline_metrics_config_twitter_crypto_s1.py │ ├── baseline_metrics_config_twitter_cve_s1.py │ ├── baseline_metrics_config_twitter_cve_s2.py │ ├── baseline_metrics_config_twitter_cyber_s1.py │ ├── cascade_metrics_config.py │ ├── cascade_metrics_config_twitter.py │ └── network_metrics_config.py ├── infodynamics.jar ├── network_measurements.py ├── plotting │ ├── charts.py │ ├── transformer.py │ └── visualization_config.py ├── run_measurements_and_metrics.py └── validators.py ├── github-measurements-old ├── Metrics.py ├── RepoCentricMeasurements.py ├── RepoMeasurementsWithPlot.py ├── TransferEntropy.py ├── UserCentricMeasurements.py ├── UserMeasurementsWithPlot.py ├── load_data.py ├── metrics_config.py └── plots.py ├── github-measurements ├── CommunityCentricMeasurements.py ├── Measurements.py ├── Metrics.py ├── RepoCentricMeasurements.py ├── TEMeasurements.py ├── UserCentricMeasurements.py ├── infodynamics.jar ├── metrics_config.py ├── reference-approaches │ ├── README.md │ ├── generate_reference_approach_data.py │ └── reference_approach_performance_plots.py └── requirements.txt ├── license.txt └── pip_requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | # socialsim 2 | 3 | This repo contains scripts needed to run the measurements and metrics for the SocialSim challenge evaluation. 4 | 5 | ## Change Log 6 | 7 | * **2 November 2018**: 8 | * Added code for reconstructing Twitter cascades using follower data (identifying the parentID for retweets and the rootID for reply tweets) in december-measurements/cascade_reconstruction. 9 | * Improved efficiency of cascade measurements by switching to igraph implementation and making improvements to the time series meausurements 10 | * Fix handling of cascades where root node is not included in the simulation input 11 | 12 | * **31 October 2018**: 13 | * Improved efficiency of the network initialization 14 | * Added cascade measurements to the visualization configuration. Cascade measurements will now generate visualizations if the plot_flag is set to True. 15 | 16 | * **25 October 2018**: 17 | * Changed handling of root-only cascades (i.e. posts with no comments or tweets with no replies/retweets/quotes) to no longer return None, allowing metrics to be calculated even if the simulation or the ground truth contains these empty cascades. 18 | * Changed the join between simulation and ground truth data for calculation of one-to-one measurements (e.g. RMSE, R2) to an outer join rather than an inner with appropriate filling of missing values (forward fill for cumulative time-series and zero fill for non-cumulative). 19 | * Changed default behavior for community metrics. Previously used the baseline challenge community definitions by default, now calculates each community measurement on the full set of data by default. 20 | 21 | * **24 October 2018**: 22 | * Added checks for valid values of the status and actionSubType fields to avoid errors when calculating measurements that require these fields. 23 | 24 | * **16 October 2018**: 25 | * We added information added requirement files and instructions to setup an environment for running the code. 26 | * Fixed Typo in Config 27 | 28 | * **12 October 2018**: 29 | * We updated the network_measurements implementations to use igraph and SNAP rather than networkx for improved memory and time performance. Some of our team members had trouble with the python-igraph and SNAP installations. If you have trouble with the python-igraph installation using pip, try the conda install - "conda install -c conda-forge python-igraph". SNAP should be installed from https://snap.stanford.edu/snappy/ rather than using pip. If you get a "Fatal Python error: PyThreadState_Get: no current thread" error, you should modify the SNAP setup.py file and replace "dynlib_path = getdynpath()" with e.g. "dynlib_path = "/anaconda/lib/libpython2.7.dylib" (use the path to your libpython2.7.dylib file). Please contact us if you are having trouble with your installation after following these steps. 30 | * Additionally, we moved from the CSV input format to the JSON input format. Example JSON files for each platform can be found on the December Challenge wiki page in the same place as the example csv files. 31 | 32 | * **9 October 2018**: 33 | * We updated the cascade_measurements so that cascade-level measurements are calculated using the CascadeCollectionMeasurements class rathan the SingleCascadeMeasurements class. This means that all cascade measurements can now be calculated using the CascadeCollectionMeasurements class. The cascade_examples function shows how to run cascade measurements. Additionally, we fixed the implementation of the cascade breadth calculation. 34 | 35 | ## Environment Installation 36 | 37 | Create a conda environment by running 38 | 39 | conda create --name my_env_name --file conda_requirements.txt -c conda-forge python=2.7 40 | 41 | Activate your new conda environment by running 42 | 43 | source activate my_env_name 44 | 45 | install remaining pip requirements with 46 | 47 | pip install -r pip_requirements.txt 48 | 49 | and finally install snap by following the instructions found here: https://snap.stanford.edu/snappy/ 50 | 51 | ## Scripts 52 | 53 | ### run_measurements_and_metrics.py 54 | 55 | This is the main script that provides functionality to run individual measurements and metrics or the full set of assigned measurements and metrics for the challenge (this replaces 56 | the previous metrics_config.py script). 57 | 58 | #### Measurement Configuration 59 | 60 | The measurement configurations used by run_measurements_and_metrics.py are found in the metric_config files in the config/ directory. These 61 | files define a set of dictionaries for different measurement types that specify the measurement and metric parameters. There are five metrics_config files: 62 | 63 | 1. network_metrics_config.py - contains `network_measurement_params` to be used for all network measurements 64 | 2. cascade_metrics_config.py - contains `cascade_measurement_params` to be used for all cascade measurements 65 | 3. baseline_metrics_config_github.py - contains `github_measurement_params` to be used for baseline measurements applied to GitHub 66 | 3. baseline_metrics_config_reddit.py - contains `reddit_measurement_params` to be used for baseline measurements applied to Reddit 67 | 3. baseline_metrics_config_twitter.py - contains `twitter_measurement_params` to be used for baseline measurements applied to Twitter 68 | 69 | 70 | Each dictionary element in one of the measurement_params dictionaries defines the metric assignments for a single measurement, with the key indicating the name of the 71 | measurement and the value specifying the measurement function, measurement function arguments, scenarios which the measurement is included for, 72 | and metrics functions for the metric calculation. 73 | For example, here is the specification of a single measurement in this format: 74 | 75 | ```python 76 | measurement_params = { 77 | "user_unique_repos": { 78 | 'question': '17', 79 | "scale": "population", 80 | "node_type":"user", 81 | "scenario1":True, 82 | "scenario2":False, 83 | "scenario3":True, 84 | "measurement": "getUserUniqueRepos", 85 | "measurement_args":{"eventType":contribution_events}, 86 | "metrics": { 87 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 88 | "rmse": Metrics.rmse, 89 | "r2": Metrics.r2} 90 | } 91 | } 92 | ``` 93 | 94 | This measurement is related to the number of unique repos that users contribute to (Question #17), which is a user-centric 95 | measurement at the population level. The measurement will be used in scenario 1 and scenario 3, but not scenario 2. 96 | The "measurement" keyword specifies the measurement function to apply, and the "measurement_args" keywords specifies 97 | the arguments to the measurement function in dictionary format. The "metrics" keyword provides a dictionary of each of 98 | the metrics that should be applied for this measurement. 99 | 100 | #### Measurements Classes 101 | 102 | Measurements are calculated on a data set by employing one of the measurements classes. There are currently 6 measurements classes which produce different categories of measurements. 103 | 1. BaselineMeasurements implemented in BaselineMeasurements.py - this includes all measurements from the baseline challenge which have been generalized to apply to GitHub,Twitter, or Reddit 104 | 2. GithubNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Github. 105 | 3. RedditNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Reddit. 106 | 4. TwitterNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Twitter. 107 | 5. SingleCascadeMeasurements implemented in cascade_measurements.py - this includes node level cascade measurements (i.e. measurements on a single cascade) 108 | 6. CascadeCollectionMeasurements implemented in cascade_measurements.py - this includes population and community level cascade measurements (i.e. measurements on a set of cascades) 109 | 110 | To instantiate a measurements object for particular data set (either simulation or ground truth data), you generally pass the data frame to one of the above classes: 111 | 112 | ```python 113 | #create measurement object from data frame 114 | measurement = BaselineMeasurements(data_frame) 115 | #create measurement object from csv file 116 | measurement = BaselineMeasurements(csv_file_name) 117 | 118 | #create measurement object with specific list of nodes to calculate node-level measurements on 119 | measurement = BaselineMeasurements(data_frame,user_node_ids=['user_id1'],content_node_ids=['repo_id1']) 120 | ``` 121 | 122 | This object contains the methods for calculating all of the measurements of the given type. For example, the user unique repos measurement can be calculated as follows: 123 | 124 | ```python 125 | result = measurement.getUserUniqueRepos(eventType=contribution_events) 126 | ``` 127 | 128 | #### Running a Single Measurement 129 | 130 | The `run_measurement` function can be used to calculate the measurement output for a single measurement on a given data set using the measurement_params configuration, which contains the parameters to be used for evaluation during the challenge event. The arguments for this function include the data, the measurement_params dictionary, and the name of the measurement to apply. 131 | 132 | For example, if we want to run one of the baseline GitHub measurements on the simulation data, we need to provide the `github_measurement_params` dictionary which contains the relavent configution and provide the name of the specific measurement we are interested in: 133 | 134 | ```python 135 | simulation = BaselineMeasurements(simulation_data_frame) 136 | meas = run_measurement(simulation, github_measurement_params, "user_unique_content") 137 | ``` 138 | 139 | The `run_metrics` function can be used to run all the relevant metrics for a given measurement in addition to the measurement output itself. 140 | This function takes two Measurement objects as input, one for the ground truth and one for the simulation, the relevant measurement_params dictionary, and the name of the measurement as listed in the keywords of measurement_params. It returns the measurement results for the ground truth and the simulation and the metric output. 141 | 142 | For example: 143 | 144 | ```python 145 | ground_truth = BaselineMeasurements(ground_truth_data_frame) 146 | simulation = BaselineMeasurements(simulation_data_frame) 147 | gt_measurement, sim_measurement, metric = run_metrics(ground_truth, simulation, "user_unique_content", github_measurement_params) 148 | ``` 149 | 150 | #### Running All Measurements 151 | 152 | To run the all the measurements that are defined in the measurement_params configuration, the `run_all_measurements` and `run_all_metrics` 153 | functions can be used. To run all the measurements on a simulation data Measurements object and save the output in pickle files in the output directory: 154 | 155 | ```python 156 | meas_dictionary = run_all_measurements(simulation,github_measurement_params,output_dir='measurement_output/') 157 | ``` 158 | 159 | To run all the metrics for all the measurements on a ground truth Measurements object and simulation data Measurements object: 160 | 161 | ```python 162 | metrics = run_all_metrics(ground_truth,simulation,github_measurement_params) 163 | ``` 164 | 165 | For both `run_all_metrics` and `run_all_measurements`, you can additionally specify specific subsets of the measurements by using the filter parameter to filter on any properties in the measurement_params dictionary. For example: 166 | 167 | ```python 168 | metrics = run_all_metrics(ground_truth,simulation,github_measurement_params,filters={"scale":"population","node_type":"user") 169 | ``` 170 | 171 | #### Plotting 172 | 173 | In order to generate plots of the measurements, any of the `run_metrics`, `run_measurement`, `run_all_metrics`, and `run_all_measurements` scripts can take the following arguments: 174 | 175 | 1. plot_flag - boolean indicator of whether to generate plots 176 | 2. show - boolean indicator of whether to display the plots to screen 177 | 3. plot_dir - A directory in which to save the plots. If plot_dir is an empty string '', the plots will not be saved. 178 | 179 | Currently, plotting is only implemented for the baseline challenge measurements. Plotting functionality for the remaining meausrements will be released at a later date. 180 | 181 | ### Metrics.py 182 | 183 | This script contains implementations of each metric for comparison of the output of the ground truth and simulation 184 | measurements. 185 | 186 | ### BaselineMeasurements.py 187 | 188 | This script contains the core BaselineMeasurements class which performs intialization of all input data for measurement calculation 189 | for the measurements from the baseline challenge. 190 | 191 | ### UserCentricMeasurements.py 192 | 193 | This script contains implementations of the user-centric measurements inside the UserCentricMeasurements class. 194 | 195 | ### ContentCentricMeasurements.py 196 | 197 | This script contains implementations of the baseline content-centric measurements inside the ContentCentricMeasurements class. 198 | 199 | ### CommunityCentricMeasurements.py 200 | 201 | This script contains implementations of the community-centric measurements inside the CommunityCentricMeasurements class. 202 | 203 | ### network_measurements.py 204 | 205 | This script contains implementations of the network measurements inside the GithubNetworkMeasurements,RedditNetworkMeasurements, and TwitterNetworkMeasurements classes. 206 | 207 | ### cascade_measurements.py 208 | 209 | This script contains implementations of the cascade measurements inside the SingleCascadeMeasurements and CascadeCollectionMeasurements classes. 210 | 211 | -------------------------------------------------------------------------------- /conda_requirements.txt: -------------------------------------------------------------------------------- 1 | # This file may be used to create an environment using: 2 | # $ conda create --name --file 3 | # platform: linux-64 4 | blas=1.0=mkl 5 | ca-certificates=2018.03.07=0 6 | cairo=1.14.12=h276e583_5 7 | certifi=2018.8.24=py27_1 8 | fontconfig=2.13.1=h65d0f4c_0 9 | freetype=2.9.1=h6debe1e_4 10 | gettext=0.19.8.1=h5e8e0c9_1 11 | glib=2.56.2=h464dc38_0 12 | gmp=6.1.2=hfc679d8_0 13 | icu=58.2=hfc679d8_0 14 | igraph=0.7.1=hcc8e21d_5 15 | intel-openmp=2019.0=118 16 | jpype1=0.6.3=py27_0 17 | libedit=3.1.20170329=h6b74fdf_2 18 | libffi=3.2.1=hd88cf55_4 19 | libgcc-ng=8.2.0=hdf63c60_1 20 | libgfortran-ng=7.2.0=hdf63c60_3 21 | libiconv=1.15=h470a237_3 22 | libpng=1.6.35=ha92aebf_2 23 | libstdcxx-ng=8.2.0=hdf63c60_1 24 | libuuid=2.32.1=h470a237_2 25 | libxcb=1.13=h470a237_2 26 | libxml2=2.9.8=h422b904_5 27 | mkl_fft=1.0.6=py27_0 28 | mkl_random=1.0.1=py27_0 29 | ncurses=6.1=hf484d3e_0 30 | numpy=1.15.2=py27h1d66e8a_1 31 | numpy-base=1.15.2=py27h81de0dd_1 32 | openssl=1.0.2p=h14c3975_0 33 | pcre=8.41=hfc679d8_3 34 | pip=10.0.1=py27_0 35 | pixman=0.34.0=h470a237_3 36 | pthread-stubs=0.4=h470a237_1 37 | pycairo=1.17.1=py27h4d1f301_0 38 | python=2.7.15=h1571d57_0 39 | python-igraph=0.7.1.post6=py27h470a237_5 40 | readline=7.0=h7b6447c_5 41 | setuptools=40.4.3=py27_0 42 | sqlite=3.25.2=h7b6447c_0 43 | tk=8.6.8=hbc83047_0 44 | wheel=0.32.0=py27_0 45 | xorg-kbproto=1.0.7=h470a237_2 46 | xorg-libice=1.0.9=h470a237_4 47 | xorg-libsm=1.2.3=h8c8a85c_0 48 | xorg-libx11=1.6.6=h470a237_0 49 | xorg-libxau=1.0.8=h470a237_6 50 | xorg-libxdmcp=1.1.2=h470a237_7 51 | xorg-libxext=1.3.3=h470a237_4 52 | xorg-libxrender=0.9.10=h470a237_2 53 | xorg-renderproto=0.11.1=h470a237_2 54 | xorg-xextproto=7.3.0=h470a237_2 55 | xorg-xproto=7.0.31=h470a237_7 56 | xz=5.2.4=h470a237_1 57 | zlib=1.2.11=ha838bed_2 58 | -------------------------------------------------------------------------------- /data_extraction/README.md: -------------------------------------------------------------------------------- 1 | # Ground Truth Data Extraction 2 | 3 | extract\_ground\_truth\_cp2.py demonstrates the approach for converting the raw JSON format for Reddit, Twitter, and GitHub to the simulation output schema for each platform. The script is designed to query PNNL's mongo database, so you will have to modify the queries to interface with your individual data storage. 4 | 5 | The extraction process for each platform follows the follow steps: 6 | 7 | 1. Query a specific time period 8 | 2. Extract relevant fields from data 9 | 3. (For Twitter only) Assign roots and parents using the cascade reconstruction script 10 | 4. (Reddit and Twitter) Propagate any information IDs on parent posts/comments/tweets to all children of the post/comment/tweet 11 | 5. Duplicate events that are related to multiple information IDs. For example: 12 | * userA, tweetA, [CVE-2017-123, CVE-2014-456] will split into: 13 | * userA, tweetA, CVE-2017-123 14 | * userA, tweetA, CVE-2014-456 -------------------------------------------------------------------------------- /data_extraction/complicated_cascade_followers.json: -------------------------------------------------------------------------------- 1 | {"1":["3","4"], 2 | "2":["5","7"], 3 | "3":["8","10"], 4 | "4":["11","12","13"], 5 | "5":["14","16"], 6 | "6":["17","19"], 7 | "7":["22"], 8 | "8":["23","25"], 9 | "9":["26","28"], 10 | "10":["31"], 11 | "13":["32","33","34"]} 12 | -------------------------------------------------------------------------------- /data_extraction/complicated_cascade_partial.csv: -------------------------------------------------------------------------------- 1 | actionType,nodeID,nodeUserID,parentID,rootID,partialParentID 2 | tweet,1,1,1,1, 3 | reply,2,2,1,?,1 4 | quote,3,3,?,1, 5 | retweet,4,4,?,1 6 | quote,5,5,?,?,2 7 | reply,6,6,2,?,2 8 | retweet,7,7,?,?,2 9 | quote,8,8,?,1, 10 | reply,9,9,3,?,3 11 | retweet,10,10,?,1,3 12 | retweet,11,11,?,1, 13 | retweet,12,12,?,1, 14 | retweet,13,13,?,1, 15 | quote,14,14,?,?,2 16 | reply,15,15,5,?,5 17 | retweet,16,16,?,?,5 18 | quote,17,17,?,?,6 19 | reply,18,18,6,?,6 20 | retweet,19,19,?,?,6 21 | retweet,22,22,?,?,2 22 | quote,23,23,?,1, 23 | reply,24,24,8,?,8 24 | retweet,25,25,?,1,8 25 | quote,26,26,?,?,9 26 | reply,27,27,9,?,9 27 | retweet,28,28,?,?,9 28 | retweet,31,31,?,1,3 29 | retweet,32,32,?,1, 30 | retweet,33,33,?,1, 31 | retweet,34,34,?,1, 32 | -------------------------------------------------------------------------------- /data_extraction/keyword_map.json: -------------------------------------------------------------------------------- 1 | {"electroneum": ["#Electroneum", "Electroneum", "#ETN", "ETN", "@electroneum"], 2 | "tether": ["Tether", "#Tether", "#USDT", "USDT", "@Tether_to"], 3 | "genesis vision": ["Genesis vision", "#GVT", "GVT", "#GenesisVision", "@genesis_vision"], 4 | "ubiq": ["UBIQ", "#Ubiq", "#UBQ", "UBQ"], 5 | "vcash": ["VCash", "#XCV", "#VCash", "@Vcashinfo"], 6 | "chill_coin": ["Chill Coin", "#chillcoin", "chillcoin", "@chillcoin"], 7 | "magi_coin": ["Magi Coin", "#magicoin", "#XMG", "XMG"], 8 | "indorse": ["Indorse", "#indorse", "#IND","IND"], 9 | "bitcoin_diamond": ["Bitcoin Diamond", "#BITcoindiamond", "#BCD", "@BitcoinDiamond_","BCD"], 10 | "chaincoin": ["#chaincoin", "chaincoin", "#chc", "@chaincoin","CHC"], 11 | "ecoin": ["E-coin", "#ecoin","ecoin"], 12 | "paycoin": ["paycoin", "#paycoin", "#XPY","XPY"], 13 | "quantum_resistant_ledger": ["Quantum Resistant Ledger", "#QuantumResistantLedger", "#QRL", "@QRLedger","QRL"], 14 | "omni": ["Omni", "#Omni"], 15 | "bean_cash": ["Bean Cash", "#bitb", "#beancash", "@BeanCash_BEAN","bitb"], 16 | "blockmason_credit_protocol": ["Blockmason credit protocol", "#Blockmasoncreditprotocol", "#bcpt","bcpt"], 17 | "bytecent": ["Bytecent", "#Bytecent", "#byc", "@bytecentbyc","byc"], 18 | "agoras_tokens": ["Agoras tokens", "#AgorasTokens", "#agrs","agrs"], 19 | "bancor_network_token": ["Bancor Network Token", "#BancorNetworkToken", "#BNT", "@bancornetwork","BNT"], 20 | "granitecoin": ["granitecoin", "#granitecoin", "#GRN","GRN"], 21 | "pesetacoin": ["pesetacoin", "#pesetacoin", "@PesetacoinOfic"], 22 | "agrello": ["agrello", "#agrello", "#DLT", "@AgrelloOfficial","DLT"], 23 | "peercoin": ["Peercoin", "#Peercoin", "@PeercoinPPC"], 24 | "stealth": ["#Stealth", "#XST", "@stealthsend","XST"], 25 | "version": ["@VersionCrypto"]} 26 | 27 | -------------------------------------------------------------------------------- /data_extraction/twitter_cascade_reconstruction.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import glob 3 | from collections import defaultdict 4 | import os 5 | import pprint 6 | import json 7 | import numpy as np 8 | 9 | def load_data(json_file, full_submission=True): 10 | """ 11 | Takes in the location of a json file and loads it as a pandas dataframe. 12 | Does some preprocessing to change text from unicode to ascii. 13 | """ 14 | 15 | if full_submission: 16 | with open(json_file) as f: 17 | dataset = json.loads(f.read()) 18 | 19 | dataset = dataset['data'] 20 | dataset = pd.DataFrame(dataset) 21 | else: 22 | dataset = pd.read_json(json_file) 23 | 24 | dataset.sort_index(axis=1, inplace=True) 25 | dataset = dataset.replace('', np.nan) 26 | 27 | # This converts the column names to ascii 28 | mapping = {name:str(name) for name in dataset.columns.tolist()} 29 | dataset = dataset.rename(index=str, columns=mapping) 30 | 31 | # This converts the row names to ascii 32 | dataset = dataset.reset_index(drop=True) 33 | 34 | # This converts the cell values to ascii 35 | json_df = dataset.applymap(str) 36 | 37 | return dataset 38 | 39 | 40 | class ParentIDApproximation: 41 | """ 42 | class to obtain parent tweet id for retweets 43 | """ 44 | 45 | def __init__(self, followers, cascade_collection_df, nodeID_col_name="nodeID", userID_col_name='nodeUserID', 46 | nodeTime_col_name='nodeTime', rootID_col_name='rootID', 47 | root_userID_col_name='rootUserID', 48 | root_nodeTime_col_name='rootTime'): 49 | """ 50 | :param followers: dictionary with key: userID, value: [list of followers of userID] 51 | :param cascade_collection_df: dataframe with nodeID, userID, nodeTime, rootID, root_userID, root_nodeTime as columns 52 | default values for column names correspond to those in the Twitter data schema 53 | (https://wiki.socialsim.info/display/SOC/Twitter+Data+Schema) 54 | """ 55 | self.followers = followers 56 | self.cascade_collection_df = cascade_collection_df.copy() 57 | self.nodeID_col_name = nodeID_col_name 58 | self.userID_col_name = userID_col_name 59 | self.nodeTime_col_name = nodeTime_col_name 60 | self.rootID_col_name = rootID_col_name 61 | self.root_userID_col_name = root_userID_col_name 62 | self.root_nodeTime_col_name = root_nodeTime_col_name 63 | 64 | def get_all_tweets_rtd_later_by_followers(self, tweet_id, cascade_df): 65 | 66 | tweet_details = cascade_df.loc[tweet_id] 67 | 68 | # add self to followers because users will retweet themselves 69 | output = cascade_df[ 70 | (cascade_df[self.userID_col_name]. 71 | isin(self.followers[tweet_details[self.userID_col_name]].union( 72 | {tweet_details[self.userID_col_name]}))) & # in followers 73 | (cascade_df[self.nodeTime_col_name] > tweet_details[self.nodeTime_col_name]) 74 | ]. \ 75 | index.values.tolist() 76 | 77 | return output 78 | 79 | def update_parentid(self, cascade_df_main, root_id): 80 | 81 | root_userID = cascade_df_main.loc[cascade_df_main.index.max()][self.root_userID_col_name] 82 | root_nodeTime = cascade_df_main.loc[cascade_df_main.index.max()][self.root_nodeTime_col_name] 83 | 84 | cascade_df = cascade_df_main.sort_values(self.nodeTime_col_name).drop( 85 | [self.root_userID_col_name, self.root_nodeTime_col_name], axis=1).copy() 86 | cascade_df["parentID"] = 0 87 | 88 | # root tweet also added to the cascade since we need the time when the root tweet was tweeted 89 | if root_id not in cascade_df[self.nodeID_col_name].values: 90 | cascade_df.loc[cascade_df.index.max() + 1] = { 91 | self.nodeID_col_name: root_id, 92 | self.userID_col_name: root_userID, 93 | self.nodeTime_col_name: root_nodeTime, 94 | self.rootID_col_name: root_id, 95 | "parentID": None, 96 | "actionType": "NA" 97 | } 98 | cascade_df = cascade_df.set_index(self.nodeID_col_name) 99 | seed_tweets = [root_id] 100 | while seed_tweets: 101 | new_seed_tweets = [] 102 | for seed_tweet_id in seed_tweets: 103 | tweets_to_be_updated = self.get_all_tweets_rtd_later_by_followers(seed_tweet_id, 104 | cascade_df) # assume a user as their follower since a user can retweet themselves 105 | cascade_df.loc[tweets_to_be_updated, "parentID"] = seed_tweet_id 106 | new_seed_tweets.extend(tweets_to_be_updated) 107 | 108 | seed_tweets = cascade_df[ 109 | cascade_df.index.isin(new_seed_tweets)].index.tolist() # keeping the order a.t. tweeted timestamp 110 | 111 | cascade_df = cascade_df[cascade_df['actionType'] != 'NA'] 112 | cascade_df.loc[cascade_df['parentID'] == 0,'parentID'] = cascade_df.loc[cascade_df['parentID'] == 0,'partialParentID'] 113 | #cascade_df.dropna(subset=["parentID"]) 114 | #return cascade_df[cascade_df["parentID"] != 0].reset_index() 115 | 116 | return cascade_df.reset_index() 117 | 118 | def get_approximate_parentids(self, mapping_only=True, csv=False): 119 | """ 120 | :param mapping_only: remove other columns except nodeID and parentID 121 | :param csv: write the parentID mapping to a csv file 122 | """ 123 | # parentID is None for root tweets 124 | parentid_map_dfs = [] 125 | for tweet_id, cascade_df in self.cascade_collection_df.groupby(self.rootID_col_name): 126 | if len(cascade_df[cascade_df['actionType'] != 'reply']) > 0: 127 | updated_cascade_df = self.update_parentid(cascade_df[cascade_df['actionType'] != 'reply'], tweet_id) 128 | parentid_map_dfs.append(updated_cascade_df) 129 | parentid_map_all_cascades_df = pd.concat(parentid_map_dfs).reset_index(drop=True) 130 | parentid_map_all_cascades_df.dropna(inplace=True) 131 | if mapping_only: 132 | parentid_map_all_cascades_df = parentid_map_all_cascades_df[[self.nodeID_col_name, "parentID"]] 133 | if csv: 134 | parentid_map_all_cascades_df.to_csv("retweet_cascades_with_parentID.csv", index=False) 135 | 136 | return parentid_map_all_cascades_df 137 | 138 | def get_reply_cascade_root_tweet(df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", timestamp_col="nodeTime", json=False): 139 | """ 140 | :param df: dataframe containing a set of reply cascades 141 | :param json: return in json format or pandas dataframe 142 | :return: df with rootID column added, representing the cascade root node 143 | """ 144 | df = df.sort_values(timestamp_col) 145 | rootid_mapping = pd.Series(df[parent_node_col].values, index=df[node_col]).to_dict() 146 | 147 | def update_reply_cascade(reply_cascade): 148 | for tweet_id, reply_to_tweet_id in reply_cascade.items(): 149 | if reply_to_tweet_id in reply_cascade: 150 | reply_cascade[tweet_id] = reply_cascade[reply_to_tweet_id] 151 | return reply_cascade 152 | 153 | prev_rootid_mapping = {} 154 | while rootid_mapping != prev_rootid_mapping: 155 | prev_rootid_mapping = rootid_mapping.copy() 156 | rootid_mapping = update_reply_cascade(rootid_mapping) 157 | 158 | df["rootID_new"] = df[node_col].map(rootid_mapping) 159 | 160 | df.loc[df['rootID'] == '?','rootID'] = df.loc[df['rootID'] == '?','rootID_new'] 161 | df = df.drop('rootID_new',axis=1) 162 | if json: 163 | return df.to_json(orient='records') 164 | else: 165 | return df 166 | 167 | def full_reconstruction(data,followers=defaultdict(lambda: set([]))): 168 | 169 | #store replies for later 170 | replies = data[data['actionType'] == 'reply'] 171 | 172 | #get the user who posted the partial parent tweet for each retweet 173 | parent_users = data[['nodeID','nodeUserID','nodeTime']] 174 | parent_users.columns = ['partialParentID','rootUserID','rootTime'] 175 | data = data.merge(parent_users,on='partialParentID',how='left') 176 | 177 | #store original tweets for later 178 | original_tweets = data[data['actionType'] == 'tweet'] 179 | 180 | cols = ['nodeID','nodeUserID','nodeTime','partialParentID','rootUserID','rootTime','actionType'] 181 | 182 | #get parent IDs for retweets and quotes 183 | pia = ParentIDApproximation(followers, data[cols],rootID_col_name='partialParentID') 184 | parent_ids = pia.get_approximate_parentids() 185 | 186 | data['parentID'] = data['nodeID'].map(dict(zip(parent_ids.nodeID,parent_ids.parentID))) 187 | data = data[~data['actionType'].isin(['reply','tweet'])] 188 | 189 | #rejoin with replies and original tweets 190 | data = pd.concat([data,replies,original_tweets],axis=0).sort_values('nodeTime') 191 | data = data.drop(['rootUserID','rootTime'],axis=1) 192 | 193 | #follow cascade chain to get root node for reply tweets 194 | data = get_reply_cascade_root_tweet(data) 195 | 196 | return(data) 197 | 198 | 199 | if __name__ == '__main__': 200 | 201 | with open('complicated_cascade_followers.json','rb') as f: 202 | followers = json.load(f) 203 | for k in followers: 204 | followers[k] = set(followers[k]) 205 | 206 | followers = defaultdict(lambda: set([]),followers) 207 | 208 | cascade_collection_df = pd.read_csv('complicated_cascade_partial.csv') 209 | 210 | cascade_collection_df['partialParentID'] = cascade_collection_df['partialParentID'].fillna(1) 211 | cascade_collection_df['nodeTime'] = pd.date_range(start='1/1/2018',periods=len(cascade_collection_df)) 212 | 213 | cascade_collection_df['partialParentID'] = cascade_collection_df['partialParentID'].astype(int) 214 | cascade_collection_df[['nodeID','parentID','rootID','partialParentID','nodeUserID']] = cascade_collection_df[['nodeID','parentID','rootID','partialParentID','nodeUserID']].astype(str) 215 | 216 | results = full_reconstruction(cascade_collection_df,followers) 217 | 218 | print(results) 219 | 220 | 221 | 222 | 223 | -------------------------------------------------------------------------------- /december-measurements/BaselineMeasurements.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | from datetime import datetime 5 | from multiprocessing import Pool 6 | from functools import partial 7 | from pathos import pools as pp 8 | 9 | import pickle as pkl 10 | 11 | from UserCentricMeasurements import * 12 | from ContentCentricMeasurements import * 13 | from CommunityCentricMeasurements import * 14 | 15 | from TEMeasurements import * 16 | from collections import defaultdict 17 | 18 | import jpype 19 | import json 20 | import os 21 | 22 | basedir = os.path.dirname(__file__) 23 | 24 | class BaselineMeasurements(UserCentricMeasurements, ContentCentricMeasurements, TEMeasurements, CommunityCentricMeasurements): 25 | def __init__(self, 26 | dfLoc, 27 | content_node_ids=[], 28 | user_node_ids=[], 29 | metaContentData=False, 30 | metaUserData=False, 31 | contentActorsFile=os.path.join(basedir, './baseline_challenge_data/filtUsers-baseline.pkl'), 32 | contentFile=os.path.join(basedir, './baseline_challenge_data/filtRepos-baseline.pkl'), 33 | topNodes=[], 34 | topEdges=[], 35 | previousActionsFile='', 36 | community_dictionary='', 37 | # community_dictionary=os.path.join(basedir, './baseline_challenge_data/baseline_challenge_community_dict.pkl'), 38 | te_config=os.path.join(basedir, './baseline_challenge_data/te_params_baseline.json'), 39 | platform='github', 40 | use_java=True): 41 | super(BaselineMeasurements, self).__init__() 42 | 43 | self.platform = platform 44 | 45 | try: 46 | # check if input is a data frame 47 | dfLoc.columns 48 | df = dfLoc 49 | except: 50 | # if not it should be a csv file path 51 | df = pd.read_csv(dfLoc) 52 | 53 | self.contribution_events = ['PullRequestEvent', 54 | 'PushEvent', 55 | 'IssuesEvent', 56 | 'IssueCommentEvent', 57 | 'PullRequestReviewCommentEvent', 58 | 'CommitCommentEvent', 59 | 'CreateEvent', 60 | 'post', 61 | 'tweet'] 62 | 63 | self.popularity_events = ['WatchEvent', 64 | 'ForkEvent', 65 | 'comment', 66 | 'post', 67 | 'retweet', 68 | 'quote', 69 | 'reply'] 70 | 71 | print('preprocessing...') 72 | 73 | self.main_df = self.preprocess(df) 74 | 75 | print('splitting optional columns...') 76 | 77 | # store action and merged columns in a seperate data frame that is not used for most measurements 78 | if platform == 'github' and len(self.main_df.columns) == 6 and 'action' in self.main_df.columns: 79 | self.main_df_opt = self.main_df.copy()[['action', 'merged']] 80 | self.main_df = self.main_df.drop(['action', 'merged'], axis=1) 81 | else: 82 | self.main_df_opt = None 83 | 84 | # For content centric 85 | print('getting selected content IDs...') 86 | 87 | if content_node_ids != ['all']: 88 | if self.platform == 'reddit': 89 | self.selectedContent = self.main_df[self.main_df.root.isin(content_node_ids)] 90 | elif self.platform == 'twitter': 91 | self.selectedContent = self.main_df[self.main_df.root.isin(content_node_ids)] 92 | else: 93 | self.selectedContent = self.main_df[self.main_df.content.isin(content_node_ids)] 94 | else: 95 | self.selectedContent = self.main_df 96 | 97 | # For userCentric 98 | self.selectedUsers = self.main_df[self.main_df.user.isin(user_node_ids)] 99 | 100 | print('processing repo metatdata...') 101 | 102 | # read in external metadata files 103 | # repoMetaData format - full_name_h,created_at,owner.login_h,language 104 | # userMetaData format - login_h,created_at,location,company 105 | 106 | if metaContentData != False: 107 | self.useContentMetaData = True 108 | meta_content_data = pd.read_csv(metaContentData) 109 | self.contentMetaData = self.preprocessContentMeta(meta_content_data) 110 | else: 111 | self.useContentMetaData = False 112 | print('processing user metatdata...') 113 | if metaUserData != False: 114 | self.useUserMetaData = True 115 | self.userMetaData = self.preprocessUserMeta(pd.read_csv(metaUserData)) 116 | else: 117 | self.useUserMetaData = False 118 | 119 | # For Community 120 | self.community_dict_file = community_dictionary 121 | print('getting communities...') 122 | if self.platform == 'github': 123 | self.communityDF = self.getCommmunityDF(community_col='community') 124 | elif self.platform == 'reddit': 125 | self.communityDF = self.getCommmunityDF(community_col='subreddit') 126 | else: 127 | self.communityDF = self.getCommmunityDF(community_col='') 128 | 129 | # read in previous events count external file (used only for one measurement) 130 | try: 131 | print('reading previous counts...') 132 | self.previous_event_counts = pd.read_csv(previousActionsFile) 133 | except: 134 | self.previous_event_counts = None 135 | 136 | # For TE 137 | if use_java: 138 | print('starting jvm...') 139 | if not jpype.isJVMStarted(): 140 | jpype.startJVM(jpype.getDefaultJVMPath(), 141 | '-ea', 142 | '-Djava.class.path=infodynamics.jar') 143 | 144 | # read pkl files which define nodes of interest for TE measurements 145 | self.repo_actors = self.readPickleFile(contentActorsFile) 146 | self.repo_groups = self.readPickleFile(contentFile) 147 | 148 | self.top_users = topNodes 149 | self.top_edges = topEdges 150 | 151 | # read pkl files which define nodes of interest for TE measurements 152 | self.content_actors = self.readPickleFile(contentActorsFile) 153 | self.content_groups = self.readPickleFile(contentFile) 154 | 155 | # set TE parameters 156 | with open(te_config, 'rb') as f: 157 | te_params = json.load(f) 158 | 159 | self.startTime = pd.Timestamp(te_params['startTime']) 160 | self.binSize = te_params['binSize'] 161 | self.teThresh = te_params['teThresh'] 162 | self.delayUnits = np.array(te_params['delayUnits']) 163 | self.starEvent = te_params['starEvent'] 164 | self.otherEvents = te_params['otherEvents'] 165 | self.kE = te_params['kE'] 166 | self.kN = te_params['kN'] 167 | self.nReps = te_params['nReps'] 168 | self.bGetTS = te_params['bGetTS'] 169 | 170 | def preprocess(self, df): 171 | 172 | """ 173 | Edit columns, convert date, sort by date 174 | """ 175 | 176 | if self.platform=='reddit': 177 | mapping = {'actionType' : 'event', 178 | 'communityID': 'subreddit', 179 | 'keywords' : 'keywords', 180 | 'nodeID' : 'content', 181 | 'nodeTime' : 'time', 182 | 'nodeUserID' : 'user', 183 | 'parentID' : 'parent', 184 | 'rootID' : 'root'} 185 | elif self.platform=='twitter': 186 | mapping = {'actionType' : 'event', 187 | 'nodeID' : 'content', 188 | 'nodeTime' : 'time', 189 | 'nodeUserID' : 'user', 190 | 'parentID' : 'parent', 191 | 'rootID' : 'root'} 192 | elif self.platform=='github': 193 | mapping = {'nodeID' : 'content', 194 | 'nodeUserID' : 'user', 195 | 'actionType' : 'event', 196 | 'nodeTime' : 'time', 197 | 'actionSubType': 'action', 198 | 'status':'merged'} 199 | else: 200 | print('Invalid platform.') 201 | 202 | df = df.rename(index=str, columns=mapping) 203 | 204 | df = df[df.event.isin(self.popularity_events + self.contribution_events)] 205 | 206 | try: 207 | df['time'] = pd.to_datetime(df['time'],unit='s') 208 | except: 209 | try: 210 | df['time'] = pd.to_datetime(df['time'],unit='ms') 211 | except: 212 | df['time'] = pd.to_datetime(df['time']) 213 | 214 | 215 | df = df.sort_values(by='time') 216 | df = df.assign(time=df.time.dt.floor('h')) 217 | return df 218 | 219 | def preprocessContentMeta(self, df): 220 | try: 221 | df.columns = ['content', 'created_at', 'owner_id', 'language'] 222 | except: 223 | df.columns = ['created_at', 'owner_id', 'content'] 224 | df['created_at'] = pd.to_datetime(df['created_at']) 225 | df = df[df.content.isin(self.main_df.content.values)] 226 | return df 227 | 228 | def preprocessUserMeta(self, df): 229 | try: 230 | df.columns = ['user', 'created_at', 'location', 'company'] 231 | except: 232 | df.columns = ['user', 'created_at', 'city', 'country', 'company'] 233 | df['created_at'] = pd.to_datetime(df['created_at']) 234 | df = df[df.user.isin(self.main_df.user.values)] 235 | return df 236 | 237 | def readPickleFile(self, ipFile): 238 | 239 | with open(ipFile, 'rb') as handle: 240 | obj = pkl.load(handle) 241 | 242 | return obj 243 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/a.txt: -------------------------------------------------------------------------------- 1 | b 2 | c 3 | d 4 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/b.txt: -------------------------------------------------------------------------------- 1 | f 2 | g 3 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/d.txt: -------------------------------------------------------------------------------- 1 | e 2 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/h.txt: -------------------------------------------------------------------------------- 1 | i 2 | j 3 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/i.txt: -------------------------------------------------------------------------------- 1 | m 2 | n 3 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/j.txt: -------------------------------------------------------------------------------- 1 | k 2 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/k.txt: -------------------------------------------------------------------------------- 1 | l 2 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/example_follower_data/m.txt: -------------------------------------------------------------------------------- 1 | o 2 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/twitter_cascade_reconstruction.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import glob 3 | from collections import defaultdict 4 | import os 5 | import pprint 6 | import json 7 | import numpy as np 8 | 9 | def load_data(json_file, full_submission=True): 10 | """ 11 | Takes in the location of a json file and loads it as a pandas dataframe. 12 | Does some preprocessing to change text from unicode to ascii. 13 | """ 14 | 15 | if full_submission: 16 | with open(json_file) as f: 17 | dataset = json.loads(f.read()) 18 | 19 | dataset = dataset['data'] 20 | dataset = pd.DataFrame(dataset) 21 | else: 22 | dataset = pd.read_json(json_file) 23 | 24 | dataset.sort_index(axis=1, inplace=True) 25 | dataset = dataset.replace('', np.nan) 26 | 27 | # This converts the column names to ascii 28 | mapping = {name:str(name) for name in dataset.columns.tolist()} 29 | dataset = dataset.rename(index=str, columns=mapping) 30 | 31 | # This converts the row names to ascii 32 | dataset = dataset.reset_index(drop=True) 33 | 34 | # This converts the cell values to ascii 35 | json_df = dataset.applymap(str) 36 | 37 | return dataset 38 | 39 | 40 | class ParentIDApproximation: 41 | """ 42 | class to obtain parent tweet id for retweets 43 | """ 44 | 45 | def __init__(self, followers, cascade_collection_df, nodeID_col_name="nodeID", userID_col_name='nodeUserID', 46 | nodeTime_col_name='nodeTime', rootID_col_name='rootID', 47 | root_userID_col_name='rootUserID', 48 | root_nodeTime_col_name='rootTime'): 49 | """ 50 | :param followers: dictionary with key: userID, value: [list of followers of userID] 51 | :param cascade_collection_df: dataframe with nodeID, userID, nodeTime, rootID, root_userID, root_nodeTime as columns 52 | default values for column names correspond to those in the Twitter data schema 53 | (https://wiki.socialsim.info/display/SOC/Twitter+Data+Schema) 54 | """ 55 | self.followers = followers 56 | self.cascade_collection_df = cascade_collection_df 57 | self.nodeID_col_name = nodeID_col_name 58 | self.userID_col_name = userID_col_name 59 | self.nodeTime_col_name = nodeTime_col_name 60 | self.rootID_col_name = rootID_col_name 61 | self.root_userID_col_name = root_userID_col_name 62 | self.root_nodeTime_col_name = root_nodeTime_col_name 63 | 64 | def get_all_tweets_rtd_later_by_followers(self, tweet_id, cascade_df): 65 | tweet_details = cascade_df.loc[tweet_id] 66 | 67 | # add self to followers because users will retweet themselves 68 | return cascade_df[ 69 | (cascade_df[self.userID_col_name]. 70 | isin(self.followers[tweet_details[self.userID_col_name]].union( 71 | {tweet_details[self.userID_col_name]}))) & # in followers 72 | (cascade_df[self.nodeTime_col_name] > tweet_details[self.nodeTime_col_name]) 73 | ]. \ 74 | index.values.tolist() 75 | 76 | def update_parentid(self, cascade_df_main, root_id): 77 | root_userID = cascade_df_main.loc[cascade_df_main.index.max()][self.root_userID_col_name] 78 | root_nodeTime = cascade_df_main.loc[cascade_df_main.index.max()][self.root_nodeTime_col_name] 79 | cascade_df = cascade_df_main.sort_values(self.nodeTime_col_name).drop( 80 | [self.root_userID_col_name, self.root_nodeTime_col_name], axis=1).copy() 81 | cascade_df["parentID"] = 0 82 | # root tweet also added to the cascade since we need the time when the root tweet was tweeted 83 | cascade_df.loc[cascade_df.index.max() + 1] = { 84 | self.nodeID_col_name: root_id, 85 | self.userID_col_name: root_userID, 86 | self.nodeTime_col_name: root_nodeTime, 87 | self.rootID_col_name: root_id, 88 | "parentID": None 89 | } 90 | cascade_df = cascade_df.set_index(self.nodeID_col_name) 91 | seed_tweets = [root_id] 92 | while seed_tweets: 93 | new_seed_tweets = [] 94 | for seed_tweet_id in seed_tweets: 95 | tweets_to_be_updated = self.get_all_tweets_rtd_later_by_followers(seed_tweet_id, 96 | cascade_df) # assume a user as their follower since a user can retweet themselves 97 | cascade_df.loc[tweets_to_be_updated, "parentID"] = seed_tweet_id 98 | new_seed_tweets.extend(tweets_to_be_updated) 99 | 100 | seed_tweets = cascade_df[ 101 | cascade_df.index.isin(new_seed_tweets)].index.tolist() # keeping the order a.t. tweeted timestamp 102 | cascade_df.dropna(subset=["parentID"]) 103 | return cascade_df[cascade_df["parentID"] != 0].reset_index() 104 | 105 | def get_approximate_parentids(self, mapping_only=True, csv=False): 106 | """ 107 | :param mapping_only: remove other columns except nodeID and parentID 108 | :param csv: write the parentID mapping to a csv file 109 | """ 110 | # parentID is None for root tweets 111 | parentid_map_dfs = [] 112 | for tweet_id, cascade_df in self.cascade_collection_df.groupby(self.rootID_col_name): 113 | updated_cascade_df = self.update_parentid(cascade_df, tweet_id) 114 | parentid_map_dfs.append(updated_cascade_df) 115 | parentid_map_all_cascades_df = pd.concat(parentid_map_dfs).reset_index(drop=True) 116 | parentid_map_all_cascades_df.dropna(inplace=True) 117 | if mapping_only: 118 | parentid_map_all_cascades_df = parentid_map_all_cascades_df[[self.nodeID_col_name, "parentID"]] 119 | if csv: 120 | parentid_map_all_cascades_df.to_csv("retweet_cascades_with_parentID.csv", index=False) 121 | 122 | return parentid_map_all_cascades_df 123 | 124 | def get_reply_cascade_root_tweet(df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", timestamp_col="nodeTime", json=False): 125 | """ 126 | :param df: dataframe containing a set of reply cascades 127 | :param json: return in json format or pandas dataframe 128 | :return: df with rootID column added, representing the cascade root node 129 | """ 130 | df = df.sort_values(timestamp_col) 131 | rootid_mapping = pd.Series(df[parent_node_col].values, index=df[node_col]).to_dict() 132 | 133 | def update_reply_cascade(reply_cascade): 134 | for tweet_id, reply_to_tweet_id in reply_cascade.items(): 135 | if reply_to_tweet_id in reply_cascade: 136 | reply_cascade[tweet_id] = reply_cascade[reply_to_tweet_id] 137 | return reply_cascade 138 | 139 | prev_rootid_mapping = {} 140 | while rootid_mapping != prev_rootid_mapping: 141 | prev_rootid_mapping = rootid_mapping.copy() 142 | rootid_mapping = update_reply_cascade(rootid_mapping) 143 | df["rootID_new"] = df[node_col].map(rootid_mapping) 144 | df.loc[df['actionType'] == 'reply','rootID'] = df.loc[df['actionType'] == 'reply','rootID_new'] 145 | df = df.drop('rootID_new',axis=1) 146 | if json: 147 | return df.to_json(orient='records') 148 | else: 149 | return df 150 | 151 | 152 | if __name__ == '__main__': 153 | 154 | #one text file per user listing that user's followers 155 | follower_data = glob.glob('example_follower_data/*.txt') 156 | 157 | #create followers dictionary with user IDs as keys and list of followers as values 158 | followers = defaultdict(lambda: set([])) 159 | for fn in follower_data: 160 | user = os.path.splitext(os.path.split(fn)[-1])[0] 161 | f = set(pd.read_csv(fn,header=None)[0].tolist()) 162 | print('User {}: {} followers'.format(user,len(f))) 163 | followers[user] = f 164 | 165 | #read in ground truth data file in JSON format 166 | #this data should be missing parentIDs for retweets/quotes and rootIDs for replies 167 | #(because they are not available from the Twitter JSON) 168 | cascade_collection_df = load_data('twitter_reconstruction_example_data.json',full_submission=False) 169 | 170 | #store replies for later 171 | replies = cascade_collection_df[cascade_collection_df['actionType'] == 'reply'] 172 | 173 | #limit data to events where the rootID is also contained in the data 174 | cascade_collection_df = cascade_collection_df[cascade_collection_df['rootID'].isin(cascade_collection_df['nodeID'])] 175 | 176 | #get the user who posted the root tweet for each retweet 177 | root_users = cascade_collection_df[['nodeID','nodeUserID','nodeTime']] 178 | root_users.columns = ['rootID','rootUserID','rootTime'] 179 | cascade_collection_df = cascade_collection_df.merge(root_users,on='rootID',how='left') 180 | 181 | #store original tweets for later 182 | original_tweets = cascade_collection_df[cascade_collection_df['actionType'] == 'tweet'] 183 | 184 | #subset on only retweets and quotes 185 | cascade_collection_df = cascade_collection_df[cascade_collection_df['actionType'].isin(['retweet','quote'])] 186 | cascade_collection_df_retweets = cascade_collection_df[['nodeID','nodeUserID','nodeTime','rootID','rootUserID','rootTime']] 187 | 188 | #get parent IDs for retweets and quotes 189 | pia = ParentIDApproximation(followers, cascade_collection_df_retweets) 190 | parent_ids = pia.get_approximate_parentids() 191 | 192 | cascade_collection_df['parentID'] = cascade_collection_df['nodeID'].map(dict(zip(parent_ids.nodeID,parent_ids.parentID))) 193 | 194 | #rejoin with replies and original tweets 195 | cascade_collection_df = pd.concat([cascade_collection_df,replies,original_tweets],axis=0).sort_values('nodeTime') 196 | cascade_collection_df = cascade_collection_df.drop(['rootUserID','rootTime'],axis=1) 197 | 198 | #follow cascade chain to get root node for reply tweets 199 | cascade_collection_df = get_reply_cascade_root_tweet(cascade_collection_df) 200 | 201 | print('Results:') 202 | print(cascade_collection_df) 203 | 204 | output = cascade_collection_df.to_dict(orient='records') 205 | 206 | with open('twitter_example_data_reconstructed.json','w') as f: 207 | json.dump(output, f) 208 | 209 | 210 | 211 | 212 | 213 | -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/twitter_example_data_reconstructed.json: -------------------------------------------------------------------------------- 1 | [{"rootID": "A", "nodeTime": "2017-08-15T00:00:00Z", "nodeUserID": "a", "nodeID": "A", "actionType": "tweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:01Z", "nodeUserID": "b", "nodeID": "B", "actionType": "retweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:02Z", "nodeUserID": "c", "nodeID": "C", "actionType": "retweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:03Z", "nodeUserID": "d", "nodeID": "D", "actionType": "reply", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:04Z", "nodeUserID": "e", "nodeID": "E", "actionType": "reply", "parentID": "D"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:05Z", "nodeUserID": "f", "nodeID": "F", "actionType": "retweet", "parentID": "B"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:06Z", "nodeUserID": "g", "nodeID": "G", "actionType": "reply", "parentID": "B"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:07Z", "nodeUserID": "h", "nodeID": "H", "actionType": "tweet", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:08Z", "nodeUserID": "i", "nodeID": "I", "actionType": "retweet", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:09Z", "nodeUserID": "j", "nodeID": "J", "actionType": "reply", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:12Z", "nodeUserID": "m", "nodeID": "M", "actionType": "retweet", "parentID": "I"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:13Z", "nodeUserID": "n", "nodeID": "N", "actionType": "retweet", "parentID": "I"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:14Z", "nodeUserID": "o", "nodeID": "O", "actionType": "retweet", "parentID": "M"}] -------------------------------------------------------------------------------- /december-measurements/cascade_reconstruction/twitter_reconstruction_example_data.json: -------------------------------------------------------------------------------- 1 | [{"rootID": "A", "actionType": "tweet", "parentID": "A", "nodeTime": "2017-08-15T00:00:00Z", "nodeUserID": "a", "nodeID": "A"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:01Z", "nodeUserID": "b", "nodeID": "B"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:02Z", "nodeUserID": "c", "nodeID": "C"}, {"rootID": "?", "actionType": "reply", "parentID": "A", "nodeTime": "2017-08-15T00:00:03Z", "nodeUserID": "d", "nodeID": "D"}, {"rootID": "?", "actionType": "reply", "parentID": "D", "nodeTime": "2017-08-15T00:00:04Z", "nodeUserID": "e", "nodeID": "E"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:05Z", "nodeUserID": "f", "nodeID": "F"}, {"rootID": "?", "actionType": "reply", "parentID": "B", "nodeTime": "2017-08-15T00:00:06Z", "nodeUserID": "g", "nodeID": "G"}, {"rootID": "H", "actionType": "tweet", "parentID": "H", "nodeTime": "2017-08-15T00:00:07Z", "nodeUserID": "h", "nodeID": "H"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:08Z", "nodeUserID": "i", "nodeID": "I"}, {"rootID": "?", "actionType": "reply", "parentID": "H", "nodeTime": "2017-08-15T00:00:09Z", "nodeUserID": "j", "nodeID": "J"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:12Z", "nodeUserID": "m", "nodeID": "M"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:13Z", "nodeUserID": "n", "nodeID": "N"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:14Z", "nodeUserID": "o", "nodeID": "O"}] -------------------------------------------------------------------------------- /december-measurements/cascade_validators.py: -------------------------------------------------------------------------------- 1 | from functools import wraps 2 | 3 | 4 | def check_root_only(default=None): 5 | """ 6 | check if it is a single node cascade 7 | """ 8 | def wrap(func): 9 | @wraps(func) 10 | def wrapped_f(self, *args, **kwargs): 11 | 12 | if len(self.main_df[self.main_df[self.node_col] != self.main_df[self.root_node_col]])==0: 13 | return default 14 | else: 15 | return func(self, *args, **kwargs) 16 | 17 | return wrapped_f 18 | 19 | return wrap 20 | -------------------------------------------------------------------------------- /december-measurements/config/baseline_metrics_config_twitter.py: -------------------------------------------------------------------------------- 1 | from functools import partial, update_wrapper 2 | import Metrics 3 | import ContentCentricMeasurements 4 | import UserCentricMeasurements 5 | #from load_data import load_data 6 | from BaselineMeasurements import * 7 | 8 | import pprint 9 | 10 | 11 | def named_partial(func, *args, **kwargs): 12 | partial_func = partial(func, *args, **kwargs) 13 | update_wrapper(partial_func, func) 14 | partial_func.varnames = func.__code__.co_varnames 15 | return partial_func 16 | 17 | 18 | twitter_events = ["tweet","retweet","quote","reply"] 19 | 20 | 21 | user_measurement_params = { 22 | ### User Centric Measurements 23 | "user_unique_content": { 24 | 'question': '17', 25 | "scale": "population", 26 | "node_type":"user", 27 | 'scenario1':True, 28 | 'scenario2':True, 29 | 'scenario3':True, 30 | "measurement": "getUserUniqueContent", 31 | "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 32 | "metrics": { 33 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 34 | "rmse": Metrics.rmse, 35 | "nrmse": named_partial(Metrics.rmse,relative=True), 36 | "r2": Metrics.r2} 37 | }, 38 | 39 | "user_activity_timeline": { 40 | "question": '19', 41 | "scale": "node", 42 | "node_type":"user", 43 | 'scenario1':False, 44 | 'scenario2':True, 45 | 'scenario3':False, 46 | "measurement": "getUserActivityTimeline", 47 | "measurement_args":{"eventTypes":twitter_events}, 48 | "metrics": {"rmse": Metrics.rmse, 49 | "nrmse": named_partial(Metrics.rmse,relative=True), 50 | "ks_test": Metrics.ks_test, 51 | "dtw": Metrics.dtw} 52 | 53 | }, 54 | 55 | "user_activity_distribution": { 56 | "question": '24a', 57 | "scale": "population", 58 | "node_type":"user", 59 | 'scenario1':True, 60 | 'scenario2':True, 61 | 'scenario3':True, 62 | "measurement": "getUserActivityDistribution", 63 | "measurement_args":{"eventTypes":twitter_events}, 64 | "metrics": {"rmse": Metrics.rmse, 65 | "nrmse": named_partial(Metrics.rmse,relative=True), 66 | "r2": Metrics.r2, 67 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 68 | }, 69 | 70 | "most_active_users": { 71 | "question": '24b', 72 | "scale": "population", 73 | "node_type":"user", 74 | 'scenario1':True, 75 | 'scenario2':True, 76 | 'scenario3':True, 77 | "measurement": "getMostActiveUsers", 78 | "measurement_args":{"eventTypes":twitter_events}, 79 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 80 | }, 81 | 82 | "user_popularity": { 83 | "question": '25', 84 | "scale": "population", 85 | "node_type":"user", 86 | 'scenario1':True, 87 | 'scenario2':True, 88 | 'scenario3':True, 89 | "measurement": "getUserPopularity", 90 | "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"}, 91 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)} 92 | }, 93 | 94 | "user_gini_coef": { 95 | "question": '26a', 96 | "scale": "population", 97 | "node_type":"user", 98 | 'scenario1':True, 99 | 'scenario2':True, 100 | 'scenario3':True, 101 | "measurement": "getGiniCoef", 102 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 103 | "metrics": {"absolute_difference": Metrics.absolute_difference, 104 | "absolute_percentage_error":Metrics.absolute_percentage_error} 105 | }, 106 | 107 | "user_palma_coef": { 108 | "question": '26b', 109 | "scale": "population", 110 | "node_type":"user", 111 | 'scenario1':True, 112 | 'scenario2':True, 113 | 'scenario3':True, 114 | "measurement": "getPalmaCoef", 115 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 116 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 117 | "absolute_difference":Metrics.absolute_difference} 118 | }, 119 | 120 | #"user_diffusion_delay": { 121 | # "question": '27', 122 | # "scale": "population", 123 | # "node_type":"user", 124 | # 'scenario1':True, 125 | # 'scenario2':True, 126 | # 'scenario3':True, 127 | # "measurement": "getUserDiffusionDelay", 128 | # "measurement_args":{"eventTypes":twitter_events}, 129 | # "metrics": {"ks_test": Metrics.ks_test} 130 | #} 131 | 132 | } 133 | 134 | content_measurement_params = { 135 | ##Content-centric measurements 136 | # "content_diffusion_delay": { 137 | # "question": 1, 138 | # "scale": "node", 139 | # "node_type":"content", 140 | # "scenario1":False, 141 | # "scenario2":True, 142 | # "scenario3":False, 143 | # "measurement": "getContentDiffusionDelay", 144 | # "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"}, 145 | # "metrics": {"ks_test": Metrics.ks_test, 146 | # "js_divergence": named_partial(Metrics.js_divergence, discrete=False)}, 147 | # }, 148 | 149 | # "content_growth": { 150 | # "question": 2, 151 | # "scale": "node", 152 | # "node_type":"content", 153 | # "scenario1":False, 154 | # "scenario2":True, 155 | # "scenario3":False, 156 | # "measurement": "getContentGrowth", 157 | # "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"}, 158 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 159 | # "dtw": Metrics.dtw} 160 | # }, 161 | 162 | # "content_contributors": { 163 | # "question": 4, 164 | # "scale": "node", 165 | # "node_type":"content", 166 | # "scenario1":False, 167 | # "scenario2":True, 168 | # "scenario3":False, 169 | # "measurement": "getContributions", 170 | # "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 171 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 172 | # "dtw": Metrics.dtw} 173 | # }, 174 | 175 | # "content_event_distribution_dayofweek": { 176 | # "question": 5, 177 | # "scale": "node", 178 | # "node_type":"content", 179 | # "scenario1":False, 180 | # "scenario2":True, 181 | # "scenario3":False, 182 | # "measurement": "getDistributionOfEvents", 183 | # "measurement_args":{"weekday":True,"content_field":"root"}, 184 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 185 | # }, 186 | 187 | "content_liveliness_distribution": { 188 | "question": 13, 189 | "scale": "population", 190 | "node_type":"content", 191 | "scenario1":True, 192 | "scenario2":True, 193 | "scenario3":True, 194 | "measurement": "getDistributionOfEventsByContent", 195 | "measurement_args":{"eventTypes":["reply"],"content_field":"root"}, 196 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 197 | }, 198 | 199 | # "content_liveliness_topk": { 200 | # "question": 13, 201 | # "scale": "population", 202 | # "node_type":"content", 203 | # "scenario1":False, 204 | # "scenario2":True, 205 | # "scenario3":False, 206 | # "measurement": "getTopKContent", 207 | ## "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"}, 208 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)} 209 | # }, 210 | 211 | "content_popularity_distribution": { 212 | "question": 13, 213 | "scale": "population", 214 | "node_type":"content", 215 | "scenario1":False, 216 | "scenario2":True, 217 | "scenario3":False, 218 | "measurement": "getDistributionOfEventsByContent", 219 | "measurement_args":{"eventTypes":["retweet"],"content_field":"root"}, 220 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 221 | }, 222 | 223 | # "content_popularity_topk": { 224 | # "question": 13, 225 | # "scale": "population", 226 | # "node_type":"content", 227 | # "scenario1":True, 228 | # "scenario2":True, 229 | # "scenario3":True, 230 | # "measurement": "getTopKContent", 231 | # "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"}, 232 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 233 | # }, 234 | 235 | "content_activity_disparity_gini_retweet": { 236 | "question": 14, 237 | "scale": "population", 238 | "node_type":"content", 239 | "scenario1":True, 240 | "scenario2":True, 241 | "scenario3":True, 242 | "measurement": "getGiniCoef", 243 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 244 | "metrics": {"absolute_difference": Metrics.absolute_difference, 245 | "absolute_percentage_error":Metrics.absolute_percentage_error} 246 | }, 247 | 248 | "content_activity_disparity_palma_retweet": { 249 | "question": 14, 250 | "scale": "population", 251 | "node_type":"content", 252 | "scenario1":True, 253 | "scenario2":True, 254 | "scenario3":True, 255 | "measurement": "getPalmaCoef", 256 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 257 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 258 | "absolute_difference":Metrics.absolute_difference} 259 | }, 260 | "content_activity_disparity_gini_quote": { 261 | "question": 14, 262 | "scale": "population", 263 | "node_type":"content", 264 | "scenario1":True, 265 | "scenario2":True, 266 | "scenario3":True, 267 | "measurement": "getGiniCoef", 268 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 269 | "metrics": {"absolute_difference": Metrics.absolute_difference, 270 | "absolute_percentage_error":Metrics.absolute_percentage_error} 271 | }, 272 | 273 | "content_activity_disparity_palma_quote": { 274 | "question": 14, 275 | "scale": "population", 276 | "node_type":"content", 277 | "scenario1":True, 278 | "scenario2":True, 279 | "scenario3":True, 280 | "measurement": "getPalmaCoef", 281 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 282 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 283 | "absolute_difference":Metrics.absolute_difference} 284 | }, 285 | "content_activity_disparity_gini_reply": { 286 | "question": 14, 287 | "scale": "population", 288 | "node_type":"content", 289 | "scenario1":True, 290 | "scenario2":True, 291 | "scenario3":True, 292 | "measurement": "getGiniCoef", 293 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 294 | "metrics": {"absolute_difference": Metrics.absolute_difference, 295 | "absolute_percentage_error":Metrics.absolute_percentage_error} 296 | }, 297 | 298 | "content_activity_disparity_palma_reply": { 299 | "question": 14, 300 | "scale": "population", 301 | "node_type":"content", 302 | "scenario1":True, 303 | "scenario2":True, 304 | "scenario3":True, 305 | "measurement": "getPalmaCoef", 306 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 307 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 308 | "absolute_difference":Metrics.absolute_difference} 309 | } 310 | 311 | 312 | } 313 | 314 | 315 | twitter_measurement_params = {} 316 | twitter_measurement_params.update(user_measurement_params) 317 | twitter_measurement_params.update(content_measurement_params) 318 | -------------------------------------------------------------------------------- /december-measurements/config/baseline_metrics_config_twitter_crypto_s1.py: -------------------------------------------------------------------------------- 1 | from functools import partial, update_wrapper 2 | import Metrics 3 | import ContentCentricMeasurements 4 | import UserCentricMeasurements 5 | #from load_data import load_data 6 | from BaselineMeasurements import * 7 | 8 | import pprint 9 | 10 | 11 | def named_partial(func, *args, **kwargs): 12 | partial_func = partial(func, *args, **kwargs) 13 | update_wrapper(partial_func, func) 14 | partial_func.varnames = func.__code__.co_varnames 15 | return partial_func 16 | 17 | 18 | twitter_events = ["tweet","retweet","quote","reply"] 19 | 20 | 21 | user_measurement_params = { 22 | ### User Centric Measurements 23 | "user_unique_content": { 24 | 'question': '17', 25 | "scale": "population", 26 | "node_type":"user", 27 | 'scenario1':True, 28 | 'scenario2':True, 29 | 'scenario3':True, 30 | "measurement": "getUserUniqueContent", 31 | "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 32 | "metrics": { 33 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 34 | "rmse": Metrics.rmse, 35 | "nrmse": named_partial(Metrics.rmse,relative=True), 36 | "r2": Metrics.r2} 37 | }, 38 | 39 | "user_activity_timeline": { 40 | "question": '19', 41 | "scale": "node", 42 | "node_type":"user", 43 | 'scenario1':False, 44 | 'scenario2':True, 45 | 'scenario3':False, 46 | "measurement": "getUserActivityTimeline", 47 | "measurement_args":{"eventTypes":twitter_events}, 48 | "metrics": {"rmse": Metrics.rmse, 49 | "nrmse": named_partial(Metrics.rmse,relative=True), 50 | "ks_test": Metrics.ks_test, 51 | "dtw": Metrics.dtw} 52 | 53 | }, 54 | 55 | "user_activity_distribution": { 56 | "question": '24a', 57 | "scale": "population", 58 | "node_type":"user", 59 | 'scenario1':True, 60 | 'scenario2':True, 61 | 'scenario3':True, 62 | "measurement": "getUserActivityDistribution", 63 | "measurement_args":{"eventTypes":twitter_events}, 64 | "metrics": {"rmse": Metrics.rmse, 65 | "nrmse": named_partial(Metrics.rmse,relative=True), 66 | "r2": Metrics.r2, 67 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 68 | }, 69 | 70 | "most_active_users": { 71 | "question": '24b', 72 | "scale": "population", 73 | "node_type":"user", 74 | 'scenario1':True, 75 | 'scenario2':True, 76 | 'scenario3':True, 77 | "measurement": "getMostActiveUsers", 78 | "measurement_args":{"eventTypes":twitter_events}, 79 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 80 | }, 81 | 82 | "user_popularity": { 83 | "question": '25', 84 | "scale": "population", 85 | "node_type":"user", 86 | 'scenario1':True, 87 | 'scenario2':True, 88 | 'scenario3':True, 89 | "measurement": "getUserPopularity", 90 | "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"}, 91 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)} 92 | }, 93 | 94 | "user_gini_coef": { 95 | "question": '26a', 96 | "scale": "population", 97 | "node_type":"user", 98 | 'scenario1':True, 99 | 'scenario2':True, 100 | 'scenario3':True, 101 | "measurement": "getGiniCoef", 102 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 103 | "metrics": {"absolute_difference": Metrics.absolute_difference, 104 | "absolute_percentage_error":Metrics.absolute_percentage_error} 105 | }, 106 | 107 | "user_palma_coef": { 108 | "question": '26b', 109 | "scale": "population", 110 | "node_type":"user", 111 | 'scenario1':True, 112 | 'scenario2':True, 113 | 'scenario3':True, 114 | "measurement": "getPalmaCoef", 115 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 116 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 117 | "absolute_difference":Metrics.absolute_difference} 118 | }, 119 | 120 | #"user_diffusion_delay": { 121 | # "question": '27', 122 | # "scale": "population", 123 | # "node_type":"user", 124 | # 'scenario1':True, 125 | # 'scenario2':True, 126 | # 'scenario3':True, 127 | # "measurement": "getUserDiffusionDelay", 128 | # "measurement_args":{"eventTypes":twitter_events}, 129 | # "metrics": {"ks_test": Metrics.ks_test} 130 | #} 131 | 132 | } 133 | 134 | content_measurement_params = { 135 | ##Content-centric measurements 136 | # "content_diffusion_delay": { 137 | # "question": 1, 138 | # "scale": "node", 139 | # "node_type":"content", 140 | # "scenario1":False, 141 | # "scenario2":True, 142 | # "scenario3":False, 143 | # "measurement": "getContentDiffusionDelay", 144 | # "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"}, 145 | # "metrics": {"ks_test": Metrics.ks_test, 146 | # "js_divergence": named_partial(Metrics.js_divergence, discrete=False)}, 147 | # }, 148 | 149 | # "content_growth": { 150 | # "question": 2, 151 | # "scale": "node", 152 | # "node_type":"content", 153 | # "scenario1":False, 154 | # "scenario2":True, 155 | # "scenario3":False, 156 | # "measurement": "getContentGrowth", 157 | # "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"}, 158 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 159 | # "dtw": Metrics.dtw} 160 | # }, 161 | 162 | # "content_contributors": { 163 | # "question": 4, 164 | # "scale": "node", 165 | # "node_type":"content", 166 | # "scenario1":False, 167 | # "scenario2":True, 168 | # "scenario3":False, 169 | # "measurement": "getContributions", 170 | # "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 171 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 172 | # "dtw": Metrics.dtw} 173 | # }, 174 | 175 | # "content_event_distribution_dayofweek": { 176 | # "question": 5, 177 | # "scale": "node", 178 | # "node_type":"content", 179 | # "scenario1":False, 180 | # "scenario2":True, 181 | # "scenario3":False, 182 | # "measurement": "getDistributionOfEvents", 183 | # "measurement_args":{"weekday":True,"content_field":"root"}, 184 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 185 | # }, 186 | 187 | "content_liveliness_distribution": { 188 | "question": 13, 189 | "scale": "population", 190 | "node_type":"content", 191 | "scenario1":True, 192 | "scenario2":True, 193 | "scenario3":True, 194 | "measurement": "getDistributionOfEventsByContent", 195 | "measurement_args":{"eventTypes":["reply"],"content_field":"root"}, 196 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 197 | }, 198 | 199 | # "content_liveliness_topk": { 200 | # "question": 13, 201 | # "scale": "population", 202 | # "node_type":"content", 203 | # "scenario1":False, 204 | # "scenario2":True, 205 | # "scenario3":False, 206 | # "measurement": "getTopKContent", 207 | ## "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"}, 208 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)} 209 | # }, 210 | 211 | "content_popularity_distribution": { 212 | "question": 13, 213 | "scale": "population", 214 | "node_type":"content", 215 | "scenario1":False, 216 | "scenario2":True, 217 | "scenario3":False, 218 | "measurement": "getDistributionOfEventsByContent", 219 | "measurement_args":{"eventTypes":["retweet"],"content_field":"root"}, 220 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 221 | }, 222 | 223 | # "content_popularity_topk": { 224 | # "question": 13, 225 | # "scale": "population", 226 | # "node_type":"content", 227 | # "scenario1":True, 228 | # "scenario2":True, 229 | # "scenario3":True, 230 | # "measurement": "getTopKContent", 231 | # "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"}, 232 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 233 | # }, 234 | 235 | "content_activity_disparity_gini_retweet": { 236 | "question": 14, 237 | "scale": "population", 238 | "node_type":"content", 239 | "scenario1":True, 240 | "scenario2":True, 241 | "scenario3":True, 242 | "measurement": "getGiniCoef", 243 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 244 | "metrics": {"absolute_difference": Metrics.absolute_difference, 245 | "absolute_percentage_error":Metrics.absolute_percentage_error} 246 | }, 247 | 248 | "content_activity_disparity_palma_retweet": { 249 | "question": 14, 250 | "scale": "population", 251 | "node_type":"content", 252 | "scenario1":True, 253 | "scenario2":True, 254 | "scenario3":True, 255 | "measurement": "getPalmaCoef", 256 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 257 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 258 | "absolute_difference":Metrics.absolute_difference} 259 | }, 260 | "content_activity_disparity_gini_quote": { 261 | "question": 14, 262 | "scale": "population", 263 | "node_type":"content", 264 | "scenario1":True, 265 | "scenario2":True, 266 | "scenario3":True, 267 | "measurement": "getGiniCoef", 268 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 269 | "metrics": {"absolute_difference": Metrics.absolute_difference, 270 | "absolute_percentage_error":Metrics.absolute_percentage_error} 271 | }, 272 | 273 | "content_activity_disparity_palma_quote": { 274 | "question": 14, 275 | "scale": "population", 276 | "node_type":"content", 277 | "scenario1":True, 278 | "scenario2":True, 279 | "scenario3":True, 280 | "measurement": "getPalmaCoef", 281 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 282 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 283 | "absolute_difference":Metrics.absolute_difference} 284 | }, 285 | "content_activity_disparity_gini_reply": { 286 | "question": 14, 287 | "scale": "population", 288 | "node_type":"content", 289 | "scenario1":True, 290 | "scenario2":True, 291 | "scenario3":True, 292 | "measurement": "getGiniCoef", 293 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 294 | "metrics": {"absolute_difference": Metrics.absolute_difference, 295 | "absolute_percentage_error":Metrics.absolute_percentage_error} 296 | }, 297 | 298 | "content_activity_disparity_palma_reply": { 299 | "question": 14, 300 | "scale": "population", 301 | "node_type":"content", 302 | "scenario1":True, 303 | "scenario2":True, 304 | "scenario3":True, 305 | "measurement": "getPalmaCoef", 306 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 307 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 308 | "absolute_difference":Metrics.absolute_difference} 309 | } 310 | 311 | 312 | } 313 | 314 | 315 | twitter_scenario1_measurement_params_crypto = {} 316 | twitter_scenario1_measurement_params_crypto.update(user_measurement_params) 317 | twitter_scenario1_measurement_params_crypto.update(content_measurement_params) 318 | -------------------------------------------------------------------------------- /december-measurements/config/baseline_metrics_config_twitter_cve_s1.py: -------------------------------------------------------------------------------- 1 | from functools import partial, update_wrapper 2 | import Metrics 3 | import ContentCentricMeasurements 4 | import UserCentricMeasurements 5 | #from load_data import load_data 6 | from BaselineMeasurements import * 7 | 8 | import pprint 9 | 10 | 11 | def named_partial(func, *args, **kwargs): 12 | partial_func = partial(func, *args, **kwargs) 13 | update_wrapper(partial_func, func) 14 | partial_func.varnames = func.__code__.co_varnames 15 | return partial_func 16 | 17 | 18 | twitter_events = ["tweet","retweet","quote","reply"] 19 | 20 | 21 | user_measurement_params = { 22 | ### User Centric Measurements 23 | "user_unique_content": { 24 | 'question': '17', 25 | "scale": "population", 26 | "node_type":"user", 27 | 'scenario1':True, 28 | 'scenario2':True, 29 | 'scenario3':True, 30 | "measurement": "getUserUniqueContent", 31 | "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 32 | "metrics": { 33 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 34 | "rmse": Metrics.rmse, 35 | "nrmse": named_partial(Metrics.rmse,relative=True), 36 | "r2": Metrics.r2} 37 | }, 38 | 39 | # "user_activity_timeline": { 40 | # "question": '19', 41 | # "scale": "node", 42 | # "node_type":"user", 43 | # 'scenario1':False, 44 | # 'scenario2':True, 45 | # 'scenario3':False, 46 | # "measurement": "getUserActivityTimeline", 47 | # "measurement_args":{"eventTypes":twitter_events}, 48 | # "metrics": {"rmse": Metrics.rmse, 49 | # "nrmse": named_partial(Metrics.rmse,relative=True), 50 | # "ks_test": Metrics.ks_test, 51 | # "dtw": Metrics.dtw} 52 | # }, 53 | 54 | "user_activity_distribution": { 55 | "question": '24a', 56 | "scale": "population", 57 | "node_type":"user", 58 | 'scenario1':True, 59 | 'scenario2':True, 60 | 'scenario3':True, 61 | "measurement": "getUserActivityDistribution", 62 | "measurement_args":{"eventTypes":twitter_events}, 63 | "metrics": {"rmse": Metrics.rmse, 64 | "nrmse": named_partial(Metrics.rmse,relative=True), 65 | "r2": Metrics.r2, 66 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 67 | }, 68 | 69 | "most_active_users": { 70 | "question": '24b', 71 | "scale": "population", 72 | "node_type":"user", 73 | 'scenario1':True, 74 | 'scenario2':True, 75 | 'scenario3':True, 76 | "measurement": "getMostActiveUsers", 77 | "measurement_args":{"k":30,"eventTypes":twitter_events}, 78 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.84)} 79 | }, 80 | 81 | "user_popularity": { 82 | "question": '25', 83 | "scale": "population", 84 | "node_type":"user", 85 | 'scenario1':True, 86 | 'scenario2':True, 87 | 'scenario3':True, 88 | "measurement": "getUserPopularity", 89 | "measurement_args":{"k":30,"eventTypes":twitter_events,"content_field":"root"}, 90 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.84)} 91 | }, 92 | 93 | "user_gini_coef": { 94 | "question": '26a', 95 | "scale": "population", 96 | "node_type":"user", 97 | 'scenario1':True, 98 | 'scenario2':True, 99 | 'scenario3':True, 100 | "measurement": "getGiniCoef", 101 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 102 | "metrics": {"absolute_difference": Metrics.absolute_difference, 103 | "absolute_percentage_error":Metrics.absolute_percentage_error} 104 | }, 105 | 106 | "user_palma_coef": { 107 | "question": '26b', 108 | "scale": "population", 109 | "node_type":"user", 110 | 'scenario1':True, 111 | 'scenario2':True, 112 | 'scenario3':True, 113 | "measurement": "getPalmaCoef", 114 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 115 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 116 | "absolute_difference":Metrics.absolute_difference} 117 | }, 118 | 119 | #"user_diffusion_delay": { 120 | # "question": '27', 121 | # "scale": "population", 122 | # "node_type":"user", 123 | # 'scenario1':True, 124 | # 'scenario2':True, 125 | # 'scenario3':True, 126 | # "measurement": "getUserDiffusionDelay", 127 | # "measurement_args":{"eventTypes":twitter_events}, 128 | # "metrics": {"ks_test": Metrics.ks_test} 129 | #} 130 | 131 | } 132 | 133 | content_measurement_params = { 134 | ##Content-centric measurements 135 | # "content_diffusion_delay": { 136 | # "question": 1, 137 | # "scale": "node", 138 | # "node_type":"content", 139 | # "scenario1":False, 140 | # "scenario2":True, 141 | # "scenario3":False, 142 | # "measurement": "getContentDiffusionDelay", 143 | # "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"}, 144 | # "metrics": {"ks_test": Metrics.ks_test, 145 | # "js_divergence": named_partial(Metrics.js_divergence, discrete=False)}, 146 | # }, 147 | 148 | # "content_growth": { 149 | # "question": 2, 150 | # "scale": "node", 151 | # "node_type":"content", 152 | # "scenario1":False, 153 | # "scenario2":True, 154 | # "scenario3":False, 155 | # "measurement": "getContentGrowth", 156 | # "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"}, 157 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 158 | # "dtw": Metrics.dtw} 159 | # }, 160 | 161 | # "content_contributors": { 162 | # "question": 4, 163 | # "scale": "node", 164 | # "node_type":"content", 165 | # "scenario1":False, 166 | # "scenario2":True, 167 | # "scenario3":False, 168 | # "measurement": "getContributions", 169 | # "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 170 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 171 | # "dtw": Metrics.dtw} 172 | # }, 173 | 174 | # "content_event_distribution_dayofweek": { 175 | # "question": 5, 176 | # "scale": "node", 177 | # "node_type":"content", 178 | # "scenario1":False, 179 | # "scenario2":True, 180 | # "scenario3":False, 181 | # "measurement": "getDistributionOfEvents", 182 | # "measurement_args":{"weekday":True,"content_field":"root"}, 183 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 184 | # }, 185 | 186 | "content_liveliness_distribution": { 187 | "question": 13, 188 | "scale": "population", 189 | "node_type":"content", 190 | "scenario1":True, 191 | "scenario2":True, 192 | "scenario3":True, 193 | "measurement": "getDistributionOfEventsByContent", 194 | "measurement_args":{"eventTypes":["reply"],"content_field":"root"}, 195 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 196 | }, 197 | 198 | # "content_liveliness_topk": { 199 | # "question": 13, 200 | # "scale": "population", 201 | # "node_type":"content", 202 | # "scenario1":False, 203 | # "scenario2":True, 204 | # "scenario3":False, 205 | # "measurement": "getTopKContent", 206 | ## "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"}, 207 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)} 208 | # }, 209 | 210 | "content_popularity_distribution": { 211 | "question": 13, 212 | "scale": "population", 213 | "node_type":"content", 214 | "scenario1":False, 215 | "scenario2":True, 216 | "scenario3":False, 217 | "measurement": "getDistributionOfEventsByContent", 218 | "measurement_args":{"eventTypes":["retweet"],"content_field":"root"}, 219 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 220 | }, 221 | 222 | # "content_popularity_topk": { 223 | # "question": 13, 224 | # "scale": "population", 225 | # "node_type":"content", 226 | # "scenario1":True, 227 | # "scenario2":True, 228 | # "scenario3":True, 229 | # "measurement": "getTopKContent", 230 | # "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"}, 231 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 232 | # }, 233 | 234 | "content_activity_disparity_gini_retweet": { 235 | "question": 14, 236 | "scale": "population", 237 | "node_type":"content", 238 | "scenario1":True, 239 | "scenario2":True, 240 | "scenario3":True, 241 | "measurement": "getGiniCoef", 242 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 243 | "metrics": {"absolute_difference": Metrics.absolute_difference, 244 | "absolute_percentage_error":Metrics.absolute_percentage_error} 245 | }, 246 | 247 | "content_activity_disparity_palma_retweet": { 248 | "question": 14, 249 | "scale": "population", 250 | "node_type":"content", 251 | "scenario1":True, 252 | "scenario2":True, 253 | "scenario3":True, 254 | "measurement": "getPalmaCoef", 255 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 256 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 257 | "absolute_difference":Metrics.absolute_difference} 258 | }, 259 | # "content_activity_disparity_gini_quote": { 260 | # "question": 14, 261 | # "scale": "population", 262 | # "node_type":"content", 263 | # "scenario1":True, 264 | # "scenario2":True, 265 | # "scenario3":True, 266 | # "measurement": "getGiniCoef", 267 | # "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 268 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 269 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 270 | # }, 271 | 272 | # "content_activity_disparity_palma_quote": { 273 | # "question": 14, 274 | # "scale": "population", 275 | # "node_type":"content", 276 | # "scenario1":True, 277 | # "scenario2":True, 278 | # "scenario3":True, 279 | # "measurement": "getPalmaCoef", 280 | # "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 281 | # "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 282 | # "absolute_difference":Metrics.absolute_difference} 283 | # }, 284 | # "content_activity_disparity_gini_reply": { 285 | # "question": 14, 286 | # "scale": "population", 287 | # "node_type":"content", 288 | # "scenario1":True, 289 | # "scenario2":True, 290 | # "scenario3":True, 291 | # "measurement": "getGiniCoef", 292 | # "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 293 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 294 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 295 | # }, 296 | 297 | # "content_activity_disparity_palma_reply": { 298 | # "question": 14, 299 | # "scale": "population", 300 | # "node_type":"content", 301 | # "scenario1":True, 302 | # "scenario2":True, 303 | # "scenario3":True, 304 | # "measurement": "getPalmaCoef", 305 | # "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 306 | # "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 307 | # "absolute_difference":Metrics.absolute_difference} 308 | # } 309 | 310 | 311 | } 312 | 313 | 314 | twitter_scenario1_measurement_params_cve = {} 315 | twitter_scenario1_measurement_params_cve.update(user_measurement_params) 316 | twitter_scenario1_measurement_params_cve.update(content_measurement_params) 317 | -------------------------------------------------------------------------------- /december-measurements/config/baseline_metrics_config_twitter_cve_s2.py: -------------------------------------------------------------------------------- 1 | from functools import partial, update_wrapper 2 | import Metrics 3 | import ContentCentricMeasurements 4 | import UserCentricMeasurements 5 | #from load_data import load_data 6 | from BaselineMeasurements import * 7 | 8 | import pprint 9 | 10 | 11 | def named_partial(func, *args, **kwargs): 12 | partial_func = partial(func, *args, **kwargs) 13 | update_wrapper(partial_func, func) 14 | partial_func.varnames = func.__code__.co_varnames 15 | return partial_func 16 | 17 | 18 | twitter_events = ["tweet","retweet","quote","reply"] 19 | 20 | 21 | user_measurement_params = { 22 | ### User Centric Measurements 23 | #"user_unique_content": { 24 | # 'question': '17', 25 | # "scale": "population", 26 | # "node_type":"user", 27 | # 'scenario1':True, 28 | # 'scenario2':True, 29 | # 'scenario3':True, 30 | # "measurement": "getUserUniqueContent", 31 | # "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 32 | # "metrics": { 33 | # "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 34 | # "rmse": Metrics.rmse, 35 | # "nrmse": named_partial(Metrics.rmse,relative=True), 36 | # "r2": Metrics.r2} 37 | #}, 38 | 39 | # "user_activity_timeline": { 40 | # "question": '19', 41 | # "scale": "node", 42 | # "node_type":"user", 43 | # 'scenario1':False, 44 | # 'scenario2':True, 45 | # 'scenario3':False, 46 | # "measurement": "getUserActivityTimeline", 47 | # "measurement_args":{"eventTypes":twitter_events}, 48 | # "metrics": {"rmse": Metrics.rmse, 49 | # "nrmse": named_partial(Metrics.rmse,relative=True), 50 | # "ks_test": Metrics.ks_test, 51 | # "dtw": Metrics.dtw} 52 | # 53 | # }, 54 | 55 | "user_activity_distribution": { 56 | "question": '24a', 57 | "scale": "population", 58 | "node_type":"user", 59 | 'scenario1':True, 60 | 'scenario2':True, 61 | 'scenario3':True, 62 | "measurement": "getUserActivityDistribution", 63 | "measurement_args":{"eventTypes":twitter_events}, 64 | "metrics": {"rmse": Metrics.rmse, 65 | "nrmse": named_partial(Metrics.rmse,relative=True), 66 | "r2": Metrics.r2, 67 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 68 | }, 69 | 70 | "most_active_users": { 71 | "question": '24b', 72 | "scale": "population", 73 | "node_type":"user", 74 | 'scenario1':True, 75 | 'scenario2':True, 76 | 'scenario3':True, 77 | "measurement": "getMostActiveUsers", 78 | "measurement_args":{"k":10,"eventTypes":twitter_events}, 79 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)} 80 | }, 81 | 82 | "user_popularity": { 83 | "question": '25', 84 | "scale": "population", 85 | "node_type":"user", 86 | 'scenario1':True, 87 | 'scenario2':True, 88 | 'scenario3':True, 89 | "measurement": "getUserPopularity", 90 | "measurement_args":{"k":10,"eventTypes":twitter_events,"content_field":"root"}, 91 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)} 92 | }, 93 | 94 | "user_gini_coef": { 95 | "question": '26a', 96 | "scale": "population", 97 | "node_type":"user", 98 | 'scenario1':True, 99 | 'scenario2':True, 100 | 'scenario3':True, 101 | "measurement": "getGiniCoef", 102 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 103 | "metrics": {"absolute_difference": Metrics.absolute_difference, 104 | "absolute_percentage_error":Metrics.absolute_percentage_error} 105 | }, 106 | 107 | "user_palma_coef": { 108 | "question": '26b', 109 | "scale": "population", 110 | "node_type":"user", 111 | 'scenario1':True, 112 | 'scenario2':True, 113 | 'scenario3':True, 114 | "measurement": "getPalmaCoef", 115 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 116 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 117 | "absolute_difference":Metrics.absolute_difference} 118 | }, 119 | 120 | #"user_diffusion_delay": { 121 | # "question": '27', 122 | # "scale": "population", 123 | # "node_type":"user", 124 | # 'scenario1':True, 125 | # 'scenario2':True, 126 | # 'scenario3':True, 127 | # "measurement": "getUserDiffusionDelay", 128 | # "measurement_args":{"eventTypes":twitter_events}, 129 | # "metrics": {"ks_test": Metrics.ks_test} 130 | #} 131 | 132 | } 133 | 134 | content_measurement_params = { 135 | ##Content-centric measurements 136 | "content_diffusion_delay": { 137 | "question": 1, 138 | "scale": "node", 139 | "node_type":"content", 140 | "scenario1":False, 141 | "scenario2":True, 142 | "scenario3":False, 143 | "measurement": "getContentDiffusionDelay", 144 | "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"}, 145 | "metrics": {"ks_test": Metrics.ks_test, 146 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False)}, 147 | }, 148 | 149 | "content_growth": { 150 | "question": 2, 151 | "scale": "node", 152 | "node_type":"content", 153 | "scenario1":False, 154 | "scenario2":True, 155 | "scenario3":False, 156 | "measurement": "getContentGrowth", 157 | "measurement_args":{"eventTypes":twitter_events,"time_bin":"h","content_field":"root"}, 158 | "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 159 | "nrmse": named_partial(Metrics.rmse,relative=True), 160 | "dtw": Metrics.dtw} 161 | }, 162 | 163 | "content_contributors": { 164 | "question": 4, 165 | "scale": "node", 166 | "node_type":"content", 167 | "scenario1":False, 168 | "scenario2":True, 169 | "scenario3":False, 170 | "measurement": "getContributions", 171 | "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 172 | "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 173 | "nrmse": named_partial(Metrics.rmse,relative=True), 174 | "dtw": Metrics.dtw} 175 | }, 176 | 177 | # "content_event_distribution_dayofweek": { 178 | # "question": 5, 179 | # "scale": "node", 180 | # "node_type":"content", 181 | # "scenario1":False, 182 | # "scenario2":True, 183 | # "scenario3":False, 184 | # "measurement": "getDistributionOfEvents", 185 | # "measurement_args":{"weekday":True,"content_field":"root"}, 186 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 187 | # }, 188 | 189 | # "content_liveliness_distribution": { 190 | # "question": 13, 191 | # "scale": "population", 192 | # "node_type":"content", 193 | # "scenario1":True, 194 | # "scenario2":True, 195 | # "scenario3":True, 196 | # "measurement": "getDistributionOfEventsByContent", 197 | # "measurement_args":{"eventTypes":["reply"],"content_field":"root"}, 198 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False), 199 | # "rmse": Metrics.rmse, 200 | # "nrmse": named_partial(Metrics.rmse,relative=True), 201 | # "r2": Metrics.r2} 202 | # }, 203 | 204 | # "content_liveliness_topk": { 205 | # "question": 13, 206 | # "scale": "population", 207 | # "node_type":"content", 208 | # "scenario1":False, 209 | # "scenario2":True, 210 | # "scenario3":False, 211 | # "measurement": "getTopKContent", 212 | # "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"}, 213 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)} 214 | # }, 215 | 216 | "content_popularity_distribution": { 217 | "question": 13, 218 | "scale": "population", 219 | "node_type":"content", 220 | "scenario1":False, 221 | "scenario2":True, 222 | "scenario3":False, 223 | "measurement": "getDistributionOfEventsByContent", 224 | "measurement_args":{"eventTypes":["retweet"],"content_field":"root"}, 225 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False), 226 | "rmse": Metrics.rmse, 227 | "nrmse": named_partial(Metrics.rmse,relative=True), 228 | "r2": Metrics.r2} 229 | }, 230 | 231 | "content_popularity_topk": { 232 | "question": 13, 233 | "scale": "population", 234 | "node_type":"content", 235 | "scenario1":True, 236 | "scenario2":True, 237 | "scenario3":True, 238 | "measurement": "getTopKContent", 239 | "measurement_args":{"k":10,"eventTypes":["retweet"],"content_field":"root"}, 240 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)} 241 | }, 242 | 243 | "content_activity_disparity_gini_retweet": { 244 | "question": 14, 245 | "scale": "population", 246 | "node_type":"content", 247 | "scenario1":True, 248 | "scenario2":True, 249 | "scenario3":True, 250 | "measurement": "getGiniCoef", 251 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 252 | "metrics": {"absolute_difference": Metrics.absolute_difference, 253 | "absolute_percentage_error":Metrics.absolute_percentage_error} 254 | }, 255 | 256 | "content_activity_disparity_palma_retweet": { 257 | "question": 14, 258 | "scale": "population", 259 | "node_type":"content", 260 | "scenario1":True, 261 | "scenario2":True, 262 | "scenario3":True, 263 | "measurement": "getPalmaCoef", 264 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 265 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 266 | "absolute_difference":Metrics.absolute_difference} 267 | }, 268 | # "content_activity_disparity_gini_quote": { 269 | # "question": 14, 270 | # "scale": "population", 271 | # "node_type":"content", 272 | # "scenario1":True, 273 | # "scenario2":True, 274 | # "scenario3":True, 275 | # "measurement": "getGiniCoef", 276 | # "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 277 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 278 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 279 | # }, 280 | 281 | # "content_activity_disparity_palma_quote": { 282 | # "question": 14, 283 | # "scale": "population", 284 | # "node_type":"content", 285 | # "scenario1":True, 286 | # "scenario2":True, 287 | # "scenario3":True, 288 | # "measurement": "getPalmaCoef", 289 | # "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 290 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 291 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 292 | # }, 293 | # "content_activity_disparity_gini_reply": { 294 | # "question": 14, 295 | # "scale": "population", 296 | # "node_type":"content", 297 | # "scenario1":True, 298 | # "scenario2":True, 299 | # "scenario3":True, 300 | # "measurement": "getGiniCoef", 301 | # "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 302 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 303 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 304 | # }, 305 | 306 | # "content_activity_disparity_palma_reply": { 307 | # "question": 14, 308 | # "scale": "population", 309 | # "node_type":"content", 310 | # "scenario1":True, 311 | # "scenario2":True, 312 | # "scenario3":True, 313 | # "measurement": "getPalmaCoef", 314 | # "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 315 | # "metrics": {"absolute_difference": Metrics.absolute_difference, 316 | # "absolute_percentage_error":Metrics.absolute_percentage_error} 317 | # } 318 | 319 | 320 | } 321 | 322 | 323 | twitter_scenario2_measurement_params_cve = {} 324 | twitter_scenario2_measurement_params_cve.update(user_measurement_params) 325 | twitter_scenario2_measurement_params_cve.update(content_measurement_params) 326 | -------------------------------------------------------------------------------- /december-measurements/config/baseline_metrics_config_twitter_cyber_s1.py: -------------------------------------------------------------------------------- 1 | from functools import partial, update_wrapper 2 | import Metrics 3 | import ContentCentricMeasurements 4 | import UserCentricMeasurements 5 | #from load_data import load_data 6 | from BaselineMeasurements import * 7 | 8 | import pprint 9 | 10 | 11 | def named_partial(func, *args, **kwargs): 12 | partial_func = partial(func, *args, **kwargs) 13 | update_wrapper(partial_func, func) 14 | partial_func.varnames = func.__code__.co_varnames 15 | return partial_func 16 | 17 | 18 | twitter_events = ["tweet","retweet","quote","reply"] 19 | 20 | 21 | user_measurement_params = { 22 | ### User Centric Measurements 23 | "user_unique_content": { 24 | 'question': '17', 25 | "scale": "population", 26 | "node_type":"user", 27 | 'scenario1':True, 28 | 'scenario2':True, 29 | 'scenario3':True, 30 | "measurement": "getUserUniqueContent", 31 | "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 32 | "metrics": { 33 | "js_divergence": named_partial(Metrics.js_divergence, discrete=False), 34 | "rmse": Metrics.rmse, 35 | "nrmse": named_partial(Metrics.rmse,relative=True), 36 | "r2": Metrics.r2} 37 | }, 38 | 39 | "user_activity_timeline": { 40 | "question": '19', 41 | "scale": "node", 42 | "node_type":"user", 43 | 'scenario1':False, 44 | 'scenario2':True, 45 | 'scenario3':False, 46 | "measurement": "getUserActivityTimeline", 47 | "measurement_args":{"eventTypes":twitter_events}, 48 | "metrics": {"rmse": Metrics.rmse, 49 | "nrmse": named_partial(Metrics.rmse,relative=True), 50 | "ks_test": Metrics.ks_test, 51 | "dtw": Metrics.dtw} 52 | 53 | }, 54 | 55 | "user_activity_distribution": { 56 | "question": '24a', 57 | "scale": "population", 58 | "node_type":"user", 59 | 'scenario1':True, 60 | 'scenario2':True, 61 | 'scenario3':True, 62 | "measurement": "getUserActivityDistribution", 63 | "measurement_args":{"eventTypes":twitter_events}, 64 | "metrics": {"rmse": Metrics.rmse, 65 | "nrmse": named_partial(Metrics.rmse,relative=True), 66 | "r2": Metrics.r2, 67 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 68 | }, 69 | 70 | "most_active_users": { 71 | "question": '24b', 72 | "scale": "population", 73 | "node_type":"user", 74 | 'scenario1':True, 75 | 'scenario2':True, 76 | 'scenario3':True, 77 | "measurement": "getMostActiveUsers", 78 | "measurement_args":{"eventTypes":twitter_events}, 79 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 80 | }, 81 | 82 | "user_popularity": { 83 | "question": '25', 84 | "scale": "population", 85 | "node_type":"user", 86 | 'scenario1':True, 87 | 'scenario2':True, 88 | 'scenario3':True, 89 | "measurement": "getUserPopularity", 90 | "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"}, 91 | "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)} 92 | }, 93 | 94 | "user_gini_coef": { 95 | "question": '26a', 96 | "scale": "population", 97 | "node_type":"user", 98 | 'scenario1':True, 99 | 'scenario2':True, 100 | 'scenario3':True, 101 | "measurement": "getGiniCoef", 102 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 103 | "metrics": {"absolute_difference": Metrics.absolute_difference, 104 | "absolute_percentage_error":Metrics.absolute_percentage_error} 105 | }, 106 | 107 | "user_palma_coef": { 108 | "question": '26b', 109 | "scale": "population", 110 | "node_type":"user", 111 | 'scenario1':True, 112 | 'scenario2':True, 113 | 'scenario3':True, 114 | "measurement": "getPalmaCoef", 115 | "measurement_args":{"nodeType":"user","eventTypes":twitter_events}, 116 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 117 | "absolute_difference":Metrics.absolute_difference} 118 | }, 119 | 120 | #"user_diffusion_delay": { 121 | # "question": '27', 122 | # "scale": "population", 123 | # "node_type":"user", 124 | # 'scenario1':True, 125 | # 'scenario2':True, 126 | # 'scenario3':True, 127 | # "measurement": "getUserDiffusionDelay", 128 | # "measurement_args":{"eventTypes":twitter_events}, 129 | # "metrics": {"ks_test": Metrics.ks_test} 130 | #} 131 | 132 | } 133 | 134 | content_measurement_params = { 135 | ##Content-centric measurements 136 | # "content_diffusion_delay": { 137 | # "question": 1, 138 | # "scale": "node", 139 | # "node_type":"content", 140 | # "scenario1":False, 141 | # "scenario2":True, 142 | # "scenario3":False, 143 | # "measurement": "getContentDiffusionDelay", 144 | # "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"}, 145 | # "metrics": {"ks_test": Metrics.ks_test, 146 | # "js_divergence": named_partial(Metrics.js_divergence, discrete=False)}, 147 | # }, 148 | 149 | # "content_growth": { 150 | # "question": 2, 151 | # "scale": "node", 152 | # "node_type":"content", 153 | # "scenario1":False, 154 | # "scenario2":True, 155 | # "scenario3":False, 156 | # "measurement": "getContentGrowth", 157 | # "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"}, 158 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 159 | # "dtw": Metrics.dtw} 160 | # }, 161 | 162 | # "content_contributors": { 163 | # "question": 4, 164 | # "scale": "node", 165 | # "node_type":"content", 166 | # "scenario1":False, 167 | # "scenario2":True, 168 | # "scenario3":False, 169 | # "measurement": "getContributions", 170 | # "measurement_args":{"eventTypes":twitter_events,"content_field":"root"}, 171 | # "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"), 172 | # "dtw": Metrics.dtw} 173 | # }, 174 | 175 | # "content_event_distribution_dayofweek": { 176 | # "question": 5, 177 | # "scale": "node", 178 | # "node_type":"content", 179 | # "scenario1":False, 180 | # "scenario2":True, 181 | # "scenario3":False, 182 | # "measurement": "getDistributionOfEvents", 183 | # "measurement_args":{"weekday":True,"content_field":"root"}, 184 | # "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)} 185 | # }, 186 | 187 | "content_liveliness_distribution": { 188 | "question": 13, 189 | "scale": "population", 190 | "node_type":"content", 191 | "scenario1":True, 192 | "scenario2":True, 193 | "scenario3":True, 194 | "measurement": "getDistributionOfEventsByContent", 195 | "measurement_args":{"eventTypes":["reply"],"content_field":"root"}, 196 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 197 | }, 198 | 199 | # "content_liveliness_topk": { 200 | # "question": 13, 201 | # "scale": "population", 202 | # "node_type":"content", 203 | # "scenario1":False, 204 | # "scenario2":True, 205 | # "scenario3":False, 206 | # "measurement": "getTopKContent", 207 | ## "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"}, 208 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)} 209 | # }, 210 | 211 | "content_popularity_distribution": { 212 | "question": 13, 213 | "scale": "population", 214 | "node_type":"content", 215 | "scenario1":False, 216 | "scenario2":True, 217 | "scenario3":False, 218 | "measurement": "getDistributionOfEventsByContent", 219 | "measurement_args":{"eventTypes":["retweet"],"content_field":"root"}, 220 | "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)} 221 | }, 222 | 223 | # "content_popularity_topk": { 224 | # "question": 13, 225 | # "scale": "population", 226 | # "node_type":"content", 227 | # "scenario1":True, 228 | # "scenario2":True, 229 | # "scenario3":True, 230 | # "measurement": "getTopKContent", 231 | # "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"}, 232 | # "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)} 233 | # }, 234 | 235 | "content_activity_disparity_gini_retweet": { 236 | "question": 14, 237 | "scale": "population", 238 | "node_type":"content", 239 | "scenario1":True, 240 | "scenario2":True, 241 | "scenario3":True, 242 | "measurement": "getGiniCoef", 243 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 244 | "metrics": {"absolute_difference": Metrics.absolute_difference, 245 | "absolute_percentage_error":Metrics.absolute_percentage_error} 246 | }, 247 | 248 | "content_activity_disparity_palma_retweet": { 249 | "question": 14, 250 | "scale": "population", 251 | "node_type":"content", 252 | "scenario1":True, 253 | "scenario2":True, 254 | "scenario3":True, 255 | "measurement": "getPalmaCoef", 256 | "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"}, 257 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 258 | "absolute_difference":Metrics.absolute_difference} 259 | }, 260 | "content_activity_disparity_gini_quote": { 261 | "question": 14, 262 | "scale": "population", 263 | "node_type":"content", 264 | "scenario1":True, 265 | "scenario2":True, 266 | "scenario3":True, 267 | "measurement": "getGiniCoef", 268 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 269 | "metrics": {"absolute_difference": Metrics.absolute_difference, 270 | "absolute_percentage_error":Metrics.absolute_percentage_error} 271 | }, 272 | 273 | "content_activity_disparity_palma_quote": { 274 | "question": 14, 275 | "scale": "population", 276 | "node_type":"content", 277 | "scenario1":True, 278 | "scenario2":True, 279 | "scenario3":True, 280 | "measurement": "getPalmaCoef", 281 | "measurement_args":{"eventTypes":["quote"],"nodeType":"root"}, 282 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 283 | "absolute_difference":Metrics.absolute_difference} 284 | }, 285 | "content_activity_disparity_gini_reply": { 286 | "question": 14, 287 | "scale": "population", 288 | "node_type":"content", 289 | "scenario1":True, 290 | "scenario2":True, 291 | "scenario3":True, 292 | "measurement": "getGiniCoef", 293 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 294 | "metrics": {"absolute_difference": Metrics.absolute_difference, 295 | "absolute_percentage_error":Metrics.absolute_percentage_error} 296 | }, 297 | 298 | "content_activity_disparity_palma_reply": { 299 | "question": 14, 300 | "scale": "population", 301 | "node_type":"content", 302 | "scenario1":True, 303 | "scenario2":True, 304 | "scenario3":True, 305 | "measurement": "getPalmaCoef", 306 | "measurement_args":{"eventTypes":["reply"],"nodeType":"root"}, 307 | "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error, 308 | "absolute_difference":Metrics.absolute_difference} 309 | } 310 | 311 | 312 | } 313 | 314 | 315 | twitter_scenario1_measurement_params_cyber = {} 316 | twitter_scenario1_measurement_params_cyber.update(user_measurement_params) 317 | twitter_scenario1_measurement_params_cyber.update(content_measurement_params) 318 | -------------------------------------------------------------------------------- /december-measurements/config/network_metrics_config.py: -------------------------------------------------------------------------------- 1 | import Metrics 2 | from run_measurements_and_metrics import named_partial 3 | 4 | network_measurement_params = { 5 | ### Github 6 | "number_of_nodes": { 7 | "question": '', 8 | "scale": "population", 9 | "scenario1":True, 10 | "scenario2":False, 11 | "sceanrio2":True, 12 | "measurement": "number_of_nodes", 13 | "metrics": { 14 | "absolute_difference": Metrics.absolute_difference, 15 | "absolute_percentage_error": Metrics.absolute_percentage_error, 16 | } 17 | }, 18 | 19 | "number_of_edges": { 20 | "question": '', 21 | "scale": "population", 22 | "scenario1":True, 23 | "scenario2":False, 24 | "sceanrio2":True, 25 | "measurement": 'number_of_edges', 26 | "metrics": { 27 | "absolute_difference": Metrics.absolute_difference, 28 | "absolute_percentage_error": Metrics.absolute_percentage_error, 29 | } 30 | }, 31 | 32 | "density": { 33 | "question": '', 34 | "scale": "population", 35 | "scenario1":True, 36 | "scenario2":False, 37 | "sceanrio2":True, 38 | "measurement": 'density', 39 | "metrics": { 40 | "absolute_percentage_error": Metrics.absolute_percentage_error, 41 | "absolute_difference": Metrics.absolute_difference, 42 | } 43 | }, 44 | 45 | "mean_shortest_path_length": { 46 | "question": '', 47 | "scale": "population", 48 | "scenario1":True, 49 | "scenario2":False, 50 | "sceanrio2":True, 51 | "measurement": 'mean_shortest_path_length', 52 | "metrics": { 53 | "absolute_difference": Metrics.absolute_difference, 54 | "absolute_percentage_error": Metrics.absolute_percentage_error, 55 | } 56 | }, 57 | 58 | "assortativity_coefficient": { 59 | "question": '', 60 | "scale": "population", 61 | "scenario1":True, 62 | "scenario2":False, 63 | "sceanrio2":True, 64 | "measurement": 'assortativity_coefficient', 65 | "metrics": { 66 | "absolute_percentage_error": Metrics.absolute_percentage_error, 67 | "absolute_difference": Metrics.absolute_difference, 68 | } 69 | }, 70 | 71 | "number_of_connected_components": { 72 | "question": '', 73 | "scale": "population", 74 | "scenario1":True, 75 | "scenario2":False, 76 | "sceanrio2":True, 77 | "measurement": 'number_of_connected_components', 78 | "metrics": { 79 | "absolute_difference": Metrics.absolute_difference, 80 | "absolute_percentage_error": Metrics.absolute_percentage_error, 81 | } 82 | }, 83 | 84 | "average_clustering_coefficient": { 85 | "question": '', 86 | "scale": "population", 87 | "scenario1":True, 88 | "scenario2":False, 89 | "sceanrio2":True, 90 | "measurement": 'average_clustering_coefficient', 91 | "metrics": { 92 | "absolute_percentage_error": Metrics.absolute_percentage_error, 93 | "absolute_difference": Metrics.absolute_difference, 94 | } 95 | }, 96 | 97 | "max_node_degree": { 98 | "question": '', 99 | "scale": "population", 100 | "scenario1":True, 101 | "scenario2":False, 102 | "sceanrio2":True, 103 | "measurement": 'max_node_degree', 104 | "metrics": { 105 | "absolute_difference": Metrics.absolute_difference, 106 | "absolute_percentage_error": Metrics.absolute_percentage_error, 107 | } 108 | }, 109 | 110 | "mean_node_degree": { 111 | "question": '', 112 | "scale": "population", 113 | "scenario1":True, 114 | "scenario2":False, 115 | "sceanrio2":True, 116 | "measurement": 'mean_node_degree', 117 | "metrics": { 118 | "absolute_difference": Metrics.absolute_difference, 119 | "absolute_percentage_error": Metrics.absolute_percentage_error, 120 | } 121 | }, 122 | 123 | "degree_distribution": { 124 | "question": '', 125 | "scale": "population", 126 | "scenario1":True, 127 | "scenario2":False, 128 | "sceanrio2":True, 129 | "measurement": 'degree_distribution', 130 | "metrics": { 131 | "js_divergence": named_partial(Metrics.js_divergence, discrete=True), 132 | } 133 | }, 134 | 135 | "community_modularity": { 136 | "question": '', 137 | "scale": "population", 138 | "scenario1":True, 139 | "scenario2":False, 140 | "sceanrio2":True, 141 | "measurement": 'community_modularity', 142 | "metrics": { 143 | "absolute_percentage_error": Metrics.absolute_percentage_error, 144 | "absolute_difference": Metrics.absolute_difference, 145 | } 146 | }, 147 | } 148 | -------------------------------------------------------------------------------- /december-measurements/infodynamics.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pnnl/socialsim/06f0ce61d10ca08dd50d256fb30ac0ae81ead58d/december-measurements/infodynamics.jar -------------------------------------------------------------------------------- /december-measurements/network_measurements.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | import pandas as pd 4 | import igraph as ig 5 | import snap as sn 6 | from time import time 7 | import numpy as np 8 | 9 | import community 10 | import tqdm 11 | 12 | from prettytable import PrettyTable 13 | from prettytable import MSWORD_FRIENDLY 14 | 15 | import os 16 | 17 | __all__ = ['GithubNetworkMeasurements', 18 | 'TwitterNetworkMeasurements', 19 | 'RedditNetworkMeasurements'] 20 | 21 | class NetworkMeasurements(object): 22 | """ 23 | This class implements Network specific measurements. It uses iGraph and SNAP libraries with Python interfaces. 24 | For installation information please visit the websites for the two packages. 25 | 26 | iGraph-Python at http://igraph.org/python/ 27 | SNAP Python at https://snap.stanford.edu/snappy/ 28 | """ 29 | def __init__(self, data, test=False): 30 | self.main_df = data if isinstance(data, pd.DataFrame) else pd.read_csv(data) 31 | 32 | if test: 33 | print('Running test version of network measurements') 34 | self.main_df = self.main_df.head(1000) 35 | 36 | assert self.main_df is not None and len(self.main_df) > 0, 'Problem with the dataframe creation' 37 | 38 | self.preprocess() 39 | 40 | self.build_undirected_graph(self.main_df) 41 | 42 | 43 | def preprocess(self): 44 | return NotImplementedError() 45 | 46 | def build_undirected_graph(self, df): 47 | return NotImplementedError() 48 | 49 | def mean_shortest_path_length(self): 50 | return sn.GetBfsEffDiamAll(self.gUNsn, 500, False)[3] 51 | 52 | def number_of_nodes(self): 53 | return ig.Graph.vcount(self.gUNig) 54 | 55 | def number_of_edges(self): 56 | return ig.Graph.ecount(self.gUNig) 57 | 58 | def density(self): 59 | return ig.Graph.density(self.gUNig) 60 | 61 | def assortativity_coefficient(self): 62 | return ig.Graph.assortativity_degree(self.gUNig) 63 | 64 | def number_of_connected_components(self): 65 | return len(ig.Graph.components(self.gUNig,mode="WEAK")) 66 | 67 | def average_clustering_coefficient(self): 68 | return sn.GetClustCfAll(self.gUNsn, sn.TFltPrV())[0] 69 | #return ig.Graph.transitivity_avglocal_undirected(self.gUNig,mode="zero") 70 | 71 | def max_node_degree(self): 72 | return max(ig.Graph.degree(self.gUNig)) 73 | 74 | def mean_node_degree(self): 75 | return 2.0*ig.Graph.ecount(self.gUNig)/ig.Graph.vcount(self.gUNig) 76 | 77 | def degree_distribution(self): 78 | degVals = ig.Graph.degree(self.gUNig) 79 | return pd.DataFrame([{'node': idx, 'value': degVals[idx]} for idx in range(self.gUNig.vcount())]) 80 | 81 | def community_modularity(self): 82 | return ig.Graph.modularity(self.gUNig,ig.Graph.community_multilevel(self.gUNig)) 83 | 84 | 85 | def get_parent_uids(self,df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", user_col="nodeUserID"): 86 | """ 87 | :return: adds parentUserID column with user id of the parent if it exits in df 88 | if it doesn't exist, uses the user id of the root instead 89 | if both doesn't exist: NaN 90 | """ 91 | tweet_uids = pd.Series(df[user_col].values, index=df[node_col]).to_dict() 92 | df['parentUserID'] = df[parent_node_col].map(tweet_uids) 93 | df.loc[(df[root_node_col] != df[node_col]) & (df['parentUserID'].isnull()), 'parentUserID'] = \ 94 | df[(df[root_node_col] != df[node_col]) & (df['parentUserID'].isnull())][root_node_col].map(tweet_uids) 95 | return df 96 | 97 | class GithubNetworkMeasurements(NetworkMeasurements): 98 | 99 | def __init__(self, project_on='nodeID', weighted=False, **kwargs): 100 | self.project_on = project_on 101 | self.weighted = weighted 102 | super(GithubNetworkMeasurements, self).__init__(**kwargs) 103 | 104 | def preprocess(self): 105 | pass 106 | 107 | def build_undirected_graph(self, df): 108 | 109 | #self.main_df = pd.read_csv(data) 110 | self.main_df = self.main_df[['nodeUserID','nodeID']] 111 | 112 | left_nodes = np.array(self.main_df['nodeUserID'].unique().tolist()) 113 | right_nodes = np.array(self.main_df['nodeID'].unique().tolist()) 114 | el = self.main_df.apply(tuple, axis=1).tolist() 115 | edgelist = list(set(el)) 116 | 117 | #iGraph Graph object construction 118 | B = ig.Graph.TupleList(edgelist, directed=False) 119 | names = np.array(B.vs["name"]) 120 | types = np.isin(names,right_nodes) 121 | B.vs["type"] = types 122 | p1,p2 = B.bipartite_projection(multiplicity=False) 123 | 124 | self.gUNig = None 125 | if (self.project_on == "user"): 126 | self.gUNig = p1 127 | else: 128 | self.gUNig = p2 129 | 130 | #self.gUNig = B.bipartite_projection(multiplicity=False, which = 0) 131 | 132 | 133 | #SNAP graph object construction 134 | self.gUNsn = sn.TUNGraph.New() 135 | for v in self.gUNig.vs: 136 | self.gUNsn.AddNode(v.index) 137 | for e in self.gUNig.es: 138 | self.gUNsn.AddEdge(e.source,e.target) 139 | 140 | 141 | class TwitterNetworkMeasurements(NetworkMeasurements): 142 | def __init__(self, **kwargs): 143 | super(TwitterNetworkMeasurements, self).__init__(**kwargs) 144 | 145 | def preprocess(self): 146 | pass 147 | 148 | def build_undirected_graph(self, df): 149 | 150 | df = self.get_parent_uids(df).dropna(subset=['parentUserID']) 151 | edgelist = df[['nodeUserID','parentUserID']].apply(tuple,axis=1).tolist() 152 | 153 | #iGraph Graph object construction 154 | self.gUNig = ig.Graph.TupleList(edgelist, directed=False) 155 | 156 | #SNAP graph object construction 157 | self.gUNsn = sn.TUNGraph.New() 158 | for v in self.gUNig.vs: 159 | self.gUNsn.AddNode(v.index) 160 | for e in self.gUNig.es: 161 | self.gUNsn.AddEdge(e.source,e.target) 162 | 163 | 164 | class RedditNetworkMeasurements(NetworkMeasurements): 165 | def __init__(self, **kwargs): 166 | super(RedditNetworkMeasurements, self).__init__(**kwargs) 167 | 168 | def preprocess(self): 169 | pass 170 | 171 | def build_undirected_graph(self,df): 172 | 173 | df = self.get_parent_uids(df).dropna(subset=['parentUserID']) 174 | edgelist = df[['nodeUserID','parentUserID']].apply(tuple,axis=1).tolist() 175 | 176 | #iGraph Graph object construction 177 | self.gUNig = ig.Graph.TupleList(edgelist, directed=False) 178 | 179 | #SNAP graph object construction 180 | self.gUNsn = sn.TUNGraph.New() 181 | for v in self.gUNig.vs: 182 | self.gUNsn.AddNode(v.index) 183 | for e in self.gUNig.es: 184 | self.gUNsn.AddEdge(e.source,e.target) 185 | 186 | 187 | -------------------------------------------------------------------------------- /december-measurements/plotting/charts.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import seaborn as sns 3 | import numpy as np 4 | import pandas as pd 5 | 6 | sns.set(style="whitegrid") 7 | 8 | 9 | def histogram(df, xlabel, ylabel, title, **kwargs): 10 | n_bins = 100 11 | 12 | if 'Simulation' in df.columns and 'Ground Truth' in df.columns: 13 | 14 | gold_data = df.dropna(subset=["Ground Truth"])["Ground Truth"] 15 | test_data = df.dropna(subset=["Simulation"])["Simulation"] 16 | 17 | data = np.concatenate([gold_data, test_data]) 18 | 19 | elif 'Simulation' in df.columns or 'Ground Truth' in df.columns: 20 | 21 | if 'Simulation' in df.columns: 22 | data = df.dropna(subset=["Simulation"])['Simulation'] 23 | test_data = data.copy() 24 | else: 25 | data = df.dropna(subset=["Ground Truth"])['Ground Truth'] 26 | gold_data = data.copy() 27 | else: 28 | return None 29 | 30 | _,bins = np.histogram(data,bins='doane') 31 | #bins = np.linspace(data.min(), data.max(), n_bins) 32 | 33 | fig, ax = plt.subplots(1, 1, figsize=(15, 5)) 34 | if 'Ground Truth' in df.columns: 35 | ax.hist(gold_data, bins, log=True, label='Ground Truth', alpha=0.7, color='green') 36 | if 'Simulation' in df.columns: 37 | ax.hist(test_data, bins, log=True, label='Simulation', alpha=.7, color='red') 38 | 39 | ax.set(xlabel=xlabel) 40 | ax.set(ylabel=ylabel) 41 | ax.set(title=title) 42 | ax.legend(loc='best') 43 | 44 | plt.tight_layout() 45 | return fig 46 | 47 | 48 | def scatter(df, xlabel, ylabel, title, **kwargs): 49 | 50 | if 'Ground Truth' in df.columns and 'Simulation' in df.columns: 51 | fig, ax = plt.subplots(1, 1, figsize=(15, 5)) 52 | sns.scatterplot(x="Ground Truth", y="Simulation", data=df, ax=ax, alpha=0.7) 53 | ax.set(xlabel=xlabel) 54 | ax.set(ylabel=ylabel) 55 | ax.set(title=title) 56 | plt.tight_layout() 57 | return fig 58 | else: 59 | return None 60 | 61 | 62 | def bar(df, xlabel, ylabel, title, **kwargs): 63 | 64 | palette = set_palette(df) 65 | 66 | df.fillna(0, inplace=True) 67 | 68 | df = df.melt(df.columns[0], var_name='type', value_name='vals') 69 | 70 | fig, ax = plt.subplots(1, 1, figsize=(15, 7)) 71 | sns.barplot(x=df.columns[0], y='vals', hue='type', data=df, ax=ax, palette=palette, alpha=0.7) 72 | ax.set_xticklabels(ax.get_xticklabels(), rotation=30) 73 | ax.set(xlabel=xlabel) 74 | ax.set(ylabel=ylabel) 75 | ax.legend(loc='best') 76 | ax.set(title=title) 77 | plt.tight_layout() 78 | return fig 79 | 80 | 81 | def set_palette(df): 82 | 83 | if 'Ground Truth' in df.columns and 'Simulation' in df.columns: 84 | palette = ['green','red'] 85 | elif 'Ground Truth' in df.columns: 86 | palette = ['green'] 87 | else: 88 | palette = ['red'] 89 | 90 | return palette 91 | 92 | def time_series(df, xlabel, ylabel, title, **kwargs): 93 | 94 | fig, ax = plt.subplots(1, 1, figsize=(15, 5)) 95 | 96 | palette = set_palette(df) 97 | 98 | df = df.melt(id_vars = [c for c in df.columns if c not in ['Ground Truth','Simulation']], var_name='type', value_name='vals').sort_values('type') 99 | 100 | df.dropna(inplace=True) 101 | sns.lineplot(x=df.columns[0], y='vals', hue='type', data=df, ax=ax, marker='o', palette=palette, alpha=0.7) 102 | handles, labels = ax.get_legend_handles_labels() 103 | ax.legend(loc='best', handles=handles[1:], labels=labels[1:]) 104 | 105 | ax.set(xlabel=xlabel) 106 | ax.set(ylabel=ylabel) 107 | ax.set(title=title) 108 | plt.tight_layout() 109 | 110 | return fig 111 | 112 | 113 | 114 | def multi_time_series(df, xlabel, ylabel, title, **kwargs): 115 | 116 | fig, ax = plt.subplots(1, 1, figsize=(15, 5)) 117 | 118 | if 'time' in df.columns: 119 | time_col = 'time' 120 | elif 'date' in df.columns: 121 | time_col = 'date' 122 | elif 'weekday' in df.columns: 123 | day_map = {'Monday':1, 124 | 'Tuesday':2, 125 | 'Wednesday':3, 126 | 'Thursday':4, 127 | 'Friday':5, 128 | 'Saturday':6, 129 | 'Sunday':7} 130 | df['weekday_int'] = df['weekday'].map(day_map) 131 | df = df.sort_values('weekday_int') 132 | time_col = 'weekday_int' 133 | 134 | if 'Ground Truth' in df.columns and 'Simulation' in df.columns: 135 | value_vars = ['Ground Truth', 'Simulation'] 136 | elif 'Ground Truth' in df.columns: 137 | value_vars = ['Ground Truth'] 138 | else: 139 | value_vars = ['Simulation'] 140 | 141 | df = pd.melt(df, id_vars=[c for c in df.columns if c not in value_vars], value_vars=value_vars, var_name='type').fillna(0) 142 | 143 | sns.lineplot(x=time_col, y='value', hue=[c for c in df.columns if c not in ['Ground Truth', 'Simulation',time_col]][0], style='type', 144 | data=df, ax=ax, marker='o', alpha=0.7, 145 | palette='bright') 146 | 147 | if time_col == 'weekday_int': 148 | ax.set(xticklabels=df['weekday'].unique()) 149 | 150 | handles, labels = ax.get_legend_handles_labels() 151 | ax.legend(loc='best', handles=handles[1:], labels=labels[1:]) 152 | ax.set(xlabel=xlabel) 153 | ax.set(ylabel=ylabel) 154 | ax.set(title=title) 155 | plt.tight_layout() 156 | return fig 157 | 158 | 159 | def save_charts(fig, loc): 160 | fig.savefig(loc) 161 | plt.close(fig) 162 | 163 | 164 | def show_charts(): 165 | plt.show() 166 | 167 | def chart_factory(chart_name): 168 | charts_mapping = { 169 | 'bar': bar, 170 | 'hist': histogram, 171 | 'time_series': time_series, 172 | 'scatter': scatter, 173 | 'multi_time_series':multi_time_series 174 | } 175 | 176 | return charts_mapping.get(chart_name, None) 177 | -------------------------------------------------------------------------------- /december-measurements/plotting/transformer.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def to_DataFrame(data_type): 4 | data_mapping = { 5 | 'dict': convert_dict, 6 | 'DataFrame': convert_DataFrame, 7 | 'dict_DataFrame': convert_dict_DataFrame, 8 | 'dict_Series': convert_dict_Series, 9 | 'dict_array':convert_dict_array, 10 | 'Series':convert_Series, 11 | 'tuple': None 12 | } 13 | 14 | return data_mapping.get(data_type, None) 15 | 16 | 17 | 18 | 19 | def convert_Series(ground_truth_data=None, sim_data=None, **kwargs): 20 | 21 | if not ground_truth_data is None and not sim_data is None: 22 | result_df = pd.concat([ground_truth_data.reset_index(drop=True),sim_data.reset_index(drop=True)], axis=1) 23 | result_df.columns = ['Ground Truth', 'Simulation'] 24 | elif not ground_truth_data is None: 25 | result_df = pd.DataFrame(ground_truth_data.reset_index(drop=True)) 26 | result_df.columns = ['Ground Truth'] 27 | elif not sim_data is None: 28 | result_df = pd.DataFrame(sim_data.reset_index(drop=True)) 29 | result_df.columns = ['Simulation'] 30 | 31 | return result_df 32 | 33 | 34 | 35 | def convert_dict(ground_truth_data=None, sim_data=None, **kwargs): 36 | 37 | 38 | if not ground_truth_data is None and not sim_data is None: 39 | keys = list(ground_truth_data.keys()) + list(sim_data.keys()) 40 | 41 | keys = set(keys) 42 | 43 | data = [] 44 | for k in keys: 45 | data.append({'Key': k, 'Ground Truth': ground_truth_data.get(k, None), 'Simulation': sim_data.get(k, None)}) 46 | 47 | df= pd.DataFrame(data)[['Key','Ground Truth','Simulation']] 48 | 49 | 50 | elif not ground_truth_data is None: 51 | keys = list(ground_truth_data.keys()) 52 | 53 | keys = set(keys) 54 | 55 | data = [] 56 | for k in keys: 57 | data.append({'Key': k, 'Ground Truth': ground_truth_data.get(k, None)}) 58 | 59 | df= pd.DataFrame(data)[['Key','Ground Truth']] 60 | 61 | elif not sim_data is None: 62 | keys = list(sim_data.keys()) 63 | 64 | keys = set(keys) 65 | 66 | data = [] 67 | for k in keys: 68 | data.append({'Key': k, 'Simulation': sim_data.get(k, None)}) 69 | 70 | df= pd.DataFrame(data)[['Key','Simulation']] 71 | 72 | return df 73 | 74 | 75 | def convert_DataFrame(ground_truth_data=None, sim_data=None, **kwargs): 76 | 77 | if ground_truth_data is None: 78 | result_df = sim_data.copy() 79 | result_df.rename(index=str,columns={'value':'Simulation'},inplace=True) 80 | elif sim_data is None: 81 | result_df = ground_truth_data.copy() 82 | result_df.rename(index=str,columns={'value':'Ground Truth'},inplace=True) 83 | else: 84 | merge_cols = [c for c in ground_truth_data.columns if c != 'value'] 85 | result_df = pd.merge(ground_truth_data, sim_data, on=merge_cols, how='outer') 86 | result_df.columns = merge_cols + ['Ground Truth', 'Simulation'] 87 | 88 | return result_df 89 | 90 | 91 | def convert_dict_DataFrame(ground_truth_data=None, sim_data=None, **kwargs): 92 | 93 | if kwargs.get('key'): 94 | 95 | if not ground_truth_data is None and not sim_data is None and kwargs.get('key') in ground_truth_data and kwargs.get('key') in sim_data: 96 | merge_columns = [c for c in ground_truth_data[kwargs.get('key')].columns if c != 'value'] 97 | result_df = pd.merge(ground_truth_data[kwargs.get('key')], sim_data[kwargs.get('key')], on=merge_columns, how='outer') 98 | result_df.columns = merge_columns + ['Ground Truth', 'Simulation'] 99 | elif not ground_truth_data is None and kwargs.get('key') in ground_truth_data: 100 | result_df = ground_truth_data[kwargs.get('key')].copy() 101 | result_df.rename(index=str,columns={"value":"Ground Truth"},inplace=True) 102 | elif not sim_data is None and kwargs.get('key') in sim_data: 103 | result_df = sim_data[kwargs.get('key')].copy() 104 | result_df.rename(index=str,columns={"value":"Simulation"},inplace=True) 105 | else: 106 | return None 107 | 108 | return result_df 109 | 110 | 111 | def convert_dict_Series(ground_truth_data=None, sim_data=None, **kwargs): 112 | 113 | if kwargs.get('key'): 114 | 115 | both = True 116 | if not sim_data is None and kwargs.get('key') in sim_data: 117 | sim_data= sim_data[kwargs.get('key')] 118 | result_df = pd.DataFrame(sim_data).copy() 119 | result_df.rename(index=str,columns={"value":"Simulation"},inplace=True) 120 | else: 121 | both = False 122 | 123 | if not ground_truth_data is None and kwargs.get('key') in ground_truth_data: 124 | ground_truth_data=ground_truth_data[kwargs.get('key')] 125 | result_df = pd.DataFrame(ground_truth_data).copy() 126 | result_df.rename(index=str,columns={"value":"Ground Truth"},inplace=True) 127 | else: 128 | both = False 129 | 130 | if both: 131 | result_df = pd.concat([ground_truth_data.reset_index(drop=True),sim_data.reset_index(drop=True)], axis=1) 132 | result_df.columns = [ 'Ground Truth', 'Simulation'] 133 | 134 | return result_df 135 | 136 | def convert_dict_array(ground_truth_data=None, sim_data=None, **kwargs): 137 | 138 | if kwargs.get('key'): 139 | 140 | both = True 141 | 142 | if not sim_data is None: 143 | sim_data = pd.Series(sim_data[kwargs.get('key')]) 144 | result_df = sim_data.copy() 145 | result_df.rename(index=str,columns={"value":"Simulation"},inplace=True) 146 | else: 147 | both = False 148 | 149 | if not ground_truth_data is None: 150 | ground_truth_data = pd.Series(ground_truth_data[kwargs.get('key')]) 151 | result_df = ground_truth_data.copy() 152 | result_df.rename(index=str,columns={"value":"Simulation"},inplace=True) 153 | else: 154 | both = False 155 | 156 | 157 | if both: 158 | result_df = pd.concat([ground_truth_data,sim_data], axis=1) 159 | result_df.columns = [ 'Ground Truth', 'Simulation'] 160 | 161 | return result_df 162 | -------------------------------------------------------------------------------- /december-measurements/plotting/visualization_config.py: -------------------------------------------------------------------------------- 1 | measurement_plot_params = { 2 | 3 | ### community 4 | 5 | "community_burstiness": { 6 | "data_type": "dict", 7 | "x_axis": "Community", 8 | "y_axis": "Burstiness", 9 | "plot": ['bar'] 10 | }, 11 | 12 | "community_contributing_users": { 13 | "data_type": "dict", 14 | "x_axis": "Community", 15 | "y_axis": "Proportion of Users Contributing", 16 | "plot": ['bar'] 17 | }, 18 | 19 | "community_event_proportions": { 20 | "data_type": "dict_DataFrame", 21 | "x_axis": "Event Type", 22 | "y_axis": "Event Proportion", 23 | "plot": ['bar'], 24 | "plot_keys": "community" 25 | }, 26 | 27 | "community_geo_locations": { 28 | "data_type": "dict_DataFrame", 29 | "x_axis": "Country", 30 | "y_axis": "Number of Events", 31 | "plot": ['bar'], 32 | "plot_keys": "community" 33 | }, 34 | 35 | "community_issue_types": { # result None type 36 | "data_type": "dict_DataFrame", 37 | "x_axis": "Date", 38 | "y_axis": "Number of Issues", 39 | "plot": ['multi_time_series'], 40 | "plot_keys": "community" 41 | 42 | }, 43 | 44 | "community_num_user_actions": { 45 | "data_type": "dict_DataFrame", 46 | "x_axis": "Date", 47 | "y_axis": "Mean Number of User Actions", 48 | "hue": "Key", 49 | "plot": ['time_series'], 50 | "plot_keys": "community_subsets" 51 | }, 52 | # 53 | 54 | 'community_user_account_ages': { 55 | "data_type": "dict_Series", 56 | "x_axis": "User Account Age", 57 | "y_axis": "Number of Actions", 58 | "plot": ['hist'], 59 | "plot_keys": "community" 60 | }, 61 | 62 | 'community_user_burstiness': { 63 | "data_type": "dict_Series", 64 | "x_axis": "User Burstiness", 65 | "y_axis": "Number of Users", 66 | "plot": ['hist'], 67 | "plot_keys": "community" 68 | }, 69 | 70 | # 71 | "community_gini": { 72 | "data_type": "dict", 73 | "x_axis": "Community", 74 | "y_axis": "Gini Scores", 75 | "plot": ['bar'] 76 | }, 77 | 78 | "community_palma": { 79 | "data_type": "dict", 80 | "x_axis": "Community", 81 | "y_axis": "Palma Scores", 82 | "plot": ['bar'] 83 | }, 84 | 85 | # repo 86 | # 87 | 88 | "content_contributors": { 89 | "data_type": "dict_DataFrame", 90 | "x_axis": "Date", 91 | "y_axis": "Number of Contributors", 92 | "plot": ['time_series'], 93 | "plot_keys": "content" 94 | }, 95 | 96 | "content_diffusion_delay": { 97 | "data_type": "dict_Series", 98 | "x_axis": "Diffusion Delay", 99 | "y_axis": "Number of Events", 100 | "plot": ['hist'], 101 | "plot_keys": "content" 102 | }, 103 | 104 | "repo_event_counts_issue": { 105 | "data_type": "DataFrame", 106 | "y_axis": "Number of Repos", 107 | "x_axis": "Number of Issue Events", 108 | "plot": ['hist'] 109 | }, 110 | 111 | "repo_event_counts_pull_request": { 112 | "data_type": "DataFrame", 113 | "y_axis": "Number of Repos", 114 | "x_axis": "Number of Pull Requests", 115 | "plot": ['hist'] 116 | }, 117 | 118 | "repo_event_counts_push": { 119 | "data_type": "DataFrame", 120 | "y_axis": "Number of Repos", 121 | "x_axis": "Number of Push Events", 122 | "plot": ['hist'] 123 | }, 124 | 125 | "content_event_distribution_daily": { 126 | "data_type": "dict_DataFrame", 127 | "x_axis": "Date", 128 | "y_axis": "# Events", 129 | "plot": ['multi_time_series'], 130 | "plot_keys": "content" 131 | }, 132 | 133 | "content_event_distribution_dayofweek": { 134 | "data_type": "dict_DataFrame", 135 | "x_axis": "Day of Week", 136 | "y_axis": "# Events", 137 | "plot": ['multi_time_series'], 138 | "plot_keys": "content" 139 | }, 140 | 141 | "content_growth": { 142 | "data_type": "dict_DataFrame", 143 | "x_axis": "Date", 144 | "y_axis": "# Events", 145 | "plot": ['time_series'], 146 | "plot_keys": "content" 147 | }, 148 | # 149 | "repo_issue_to_push": { 150 | "data_type": "dict_DataFrame", 151 | "x_axis": "Number of Previous Events", 152 | "y_axis": "Issue Push Ratio", 153 | "plot": ['time_series'], 154 | "plot_keys": "content" 155 | }, 156 | 157 | "content_liveliness_distribution": { 158 | "data_type": "DataFrame", 159 | "y_axis": "Number of Repos/Posts/Tweets", 160 | "x_axis": "Number of Forks/Comments/Replies", 161 | "plot": ['hist'] 162 | }, 163 | 164 | "repo_trustingness": { 165 | "data_type": "DataFrame", 166 | "x_axis": "Ground Truth", 167 | "y_axis": "Simulation", 168 | "plot": ['scatter'] 169 | }, 170 | 171 | "content_popularity_distribution": { 172 | "data_type": "DataFrame", 173 | "y_axis": "Number of Repos/Tweets", 174 | "x_axis": "Number of Watches/Rewtweets", 175 | "plot": ['hist'] 176 | }, 177 | 178 | "repo_user_continue_prop": { 179 | "data_type": "dict_DataFrame", 180 | "x_axis": "Number of Actions", 181 | "y_axis": "Probability of Continuing", 182 | "plot": ['time_series'], 183 | "plot_keys": "content" 184 | }, 185 | # 186 | # 187 | # ### user 188 | 189 | "user_popularity": { 190 | "data_type": "DataFrame", 191 | "y_axis": "Number of Users", 192 | "x_axis": "Popularity of User's Repos/Tweets/Posts", 193 | "plot": ['hist'] 194 | }, 195 | 196 | "user_activity_distribution": { 197 | "data_type": "DataFrame", 198 | "x_axis": "User Activity", 199 | "y_axis": "Number of Users", 200 | "plot": ['hist'] 201 | }, 202 | 203 | "user_diffusion_delay": { 204 | "data_type": "Series", 205 | "x_axis": "Diffusion Delay (H)", 206 | "y_axis": "Number of Events", 207 | "plot": ['hist'] 208 | }, 209 | "user_activity_timeline": { 210 | "data_type": "dict_DataFrame", 211 | "x_axis": "Date", 212 | "y_axis": "Number of Events", 213 | "plot": ['time_series'], 214 | "plot_keys": "user" 215 | }, 216 | 217 | "user_trustingness": { 218 | "data_type": "DataFrame", 219 | "x_axis": "Ground Truth", 220 | "y_axis": "Simulation", 221 | "plot": ['scatter'] 222 | }, 223 | 224 | "user_unique_content": { 225 | "data_type": "DataFrame", 226 | "x_axis": "Number of Unique Repos/Posts/Tweets", 227 | "y_axis": "Number of Users", 228 | "plot": ['hist'] 229 | } 230 | } 231 | 232 | cascade_measurement_plot_params = { 233 | 'cascade_breadth_by_depth': { 234 | 'data_type': 'dict_DataFrame', 235 | 'plot': ['time_series'], 236 | 'x_axis': 'Depth', 237 | 'y_axis': 'Breadth', 238 | 'plot_keys':'cascade'}, 239 | 240 | 'cascade_breadth_by_time': 241 | {'data_type': 'dict_DataFrame', 242 | 'plot': ['time_series'], 243 | 'x_axis': 'Date', 244 | 'y_axis': 'Breadth', 245 | 'plot_keys':'cascade'}, 246 | 247 | 'cascade_max_depth_over_time': 248 | {'data_type': 'dict_DataFrame', 249 | 'plot': ['time_series'], 250 | 'x_axis': 'Date', 251 | 'y_axis': 'Depth', 252 | 'plot_keys':'cascade'}, 253 | 254 | 'cascade_new_user_ratio_by_depth': 255 | {'data_type': 'dict_DataFrame', 256 | 'plot': ['time_series'], 257 | 'x_axis': 'Depth', 258 | 'y_axis': 'New User Ratio', 259 | 'plot_keys':'cascade'}, 260 | 261 | 'cascade_new_user_ratio_by_time': 262 | {'data_type': 'dict_DataFrame', 263 | 'plot': ['time_series'], 264 | 'x_axis': 'Date', 265 | 'y_axis': 'New User Ratio', 266 | 'plot_keys':'cascade'}, 267 | 268 | 'cascade_size_over_time': 269 | {'data_type': 'dict_DataFrame', 270 | 'plot': ['time_series'], 271 | 'x_axis': 'Date', 272 | 'y_axis': 'Cascade Size', 273 | 'plot_keys':'cascade'}, 274 | 275 | 'cascade_structural_virality_over_time': 276 | {'data_type': 'dict_DataFrame', 277 | 'plot': ['time_series'], 278 | 'x_axis': 'Date', 279 | 'y_axis': 'Structural Virality', 280 | 'plot_keys':'cascade'}, 281 | 282 | 'cascade_uniq_users_by_depth': 283 | {'data_type': 'dict_DataFrame', 284 | 'plot': ['time_series'], 285 | 'x_axis': 'Depth', 286 | 'y_axis': 'Unique Users', 287 | 'plot_keys':'cascade'}, 288 | 289 | 'cascade_uniq_users_by_time': 290 | {'data_type': 'dict_DataFrame', 291 | 'plot': ['time_series'], 292 | 'x_axis': 'Date', 293 | 'y_axis': 'Unique Users', 294 | 'plot_keys':'cascade'}, 295 | 296 | 'community_cascade_lifetime_distribution': 297 | {'data_type': 'dict_DataFrame', 298 | 'plot': ['hist'], 299 | 'x_axis': 'Lifetime', 300 | 'y_axis': 'Number of Cascades', 301 | 'plot_keys':'community'}, 302 | 303 | 'community_cascade_lifetime_timeseries': 304 | {'data_type': 'dict_DataFrame', 305 | 'plot': ['time_series'], 306 | 'x_axis': 'Date', 307 | 'y_axis': 'Cascade Lifetime', 308 | 'plot_keys':'community'}, 309 | 310 | 'community_cascade_size_distribution': 311 | {'data_type': 'dict_DataFrame', 312 | 'plot': ['hist'], 313 | 'x_axis': 'Size', 314 | 'y_axis': 'Number of Cascades', 315 | 'plot_keys':'community'}, 316 | 317 | 'community_cascade_size_timeseries': 318 | {'data_type': 'dict_DataFrame', 319 | 'plot': ['time_series'], 320 | 'x_axis': 'Time', 321 | 'y_axis': 'Cascade Size', 322 | 'plot_keys':'community'}, 323 | 324 | 'community_max_breadth_distribution': 325 | {'data_type': 'dict_DataFrame', 326 | 'plot': ['hist'], 327 | 'x_axis': 'Max Breadth', 328 | 'y_axis': 'Number of Cascades', 329 | 'plot_keys':'community'}, 330 | 331 | 'community_max_depth_distribution': 332 | {'data_type': 'dict_DataFrame', 333 | 'plot': ['hist'], 334 | 'x_axis': 'Max Depth', 335 | 'y_axis': 'Number of Cascades', 336 | 'plot_keys':'community'}, 337 | 338 | 'community_new_user_ratio_by_time': 339 | {'data_type': 'dict_DataFrame', 340 | 'plot': ['time_series'], 341 | 'x_axis': 'Date', 342 | 'y_axis': 'New User Ratio', 343 | 'plot_keys':'community'}, 344 | 345 | 'community_structural_virality_distribution': 346 | {'data_type': 'dict_DataFrame', 347 | 'plot': ['hist'], 348 | 'x_axis': 'Structural Virality', 349 | 'y_axis': 'Number of Cascade', 350 | 'plot_keys':'community'}, 351 | 352 | 'community_unique_users_by_time': 353 | {'data_type': 'dict_DataFrame', 354 | 'plot': ['time_series'], 355 | 'x_axis': 'Date', 356 | 'y_axis': 'Unique Users', 357 | 'plot_keys':'community'}, 358 | 359 | 'population_cascade_lifetime_distribution': 360 | {'data_type': 'DataFrame', 361 | 'plot': ['hist'], 362 | 'x_axis': 'Cascade Lifetime', 363 | 'y_axis': 'Number of Cascades'}, 364 | 365 | 'population_cascade_lifetime_timeseries': 366 | {'data_type': 'DataFrame', 367 | 'plot': ['time_series'], 368 | 'x_axis': 'Date', 369 | 'y_axis': 'Cascade Lifetime'}, 370 | 371 | 'population_cascade_size_distribution': 372 | {'data_type': 'DataFrame', 373 | 'plot': ['hist'], 374 | 'x_axis': 'Size', 375 | 'y_axis': 'Number of Cascades'}, 376 | 377 | 'population_cascade_size_timeseries': 378 | {'data_type': 'DataFrame', 379 | 'plot': ['time_series'], 380 | 'x_axis': 'Date', 381 | 'y_axis': 'Cascade Size'}, 382 | 383 | 'population_max_breadth_distribution': 384 | {'data_type': 'DataFrame', 385 | 'plot': ['hist'], 386 | 'x_axis': 'Max Breadth', 387 | 'y_axis': 'Number of Cascades'}, 388 | 389 | 'population_max_depth_distribution': 390 | {'data_type': 'DataFrame', 391 | 'plot': ['hist'], 392 | 'x_axis': 'Max Depth', 393 | 'y_axis': 'Number of Cascades'}, 394 | 395 | 'population_structural_virality_distribution': 396 | {'data_type': 'DataFrame', 397 | 'plot': ['hist'], 398 | 'x_axis': 'Structural Virality', 399 | 'y_axis': 'Number of Cascade'} 400 | } 401 | 402 | measurement_plot_params.update(cascade_measurement_plot_params) 403 | -------------------------------------------------------------------------------- /december-measurements/validators.py: -------------------------------------------------------------------------------- 1 | from functools import wraps 2 | 3 | def check_empty(default=None): 4 | def wrap(func): 5 | @wraps(func) 6 | def wrapped_f(self, *args, **kwargs): 7 | if self.main_df is None or self.main_df.empty or len(self.main_df) <= 0: 8 | return default 9 | else: 10 | return func(self, *args, **kwargs) 11 | return wrapped_f 12 | return wrap 13 | 14 | def check_root_only(default=None): 15 | """ 16 | check if it is a single node cascade 17 | """ 18 | def wrap(func): 19 | @wraps(func) 20 | def wrapped_f(self, *args, **kwargs): 21 | if len(self.main_df[self.main_df[self.node_col]!=self.main_df[self.root_node_col]])==0: 22 | return default 23 | else: 24 | return func(self, *args, **kwargs) 25 | return wrapped_f 26 | return wrap 27 | -------------------------------------------------------------------------------- /github-measurements-old/TransferEntropy.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import jpype 4 | from jpype import * 5 | from datetime import datetime 6 | 7 | ''' 8 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract 9 | No. DE-AC05-76RL01830 with the Department of Energy (DOE). All rights in the computer software are reserved by DOE on 10 | behalf of the United States Government and the Contractor as provided in the Contract. You are authorized to use this 11 | computer software for Governmental purposes but it is not to be released or distributed to the public. NEITHER THE 12 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS 13 | SOFTWARE. This notice including this sentence must appear on any copies of this computer software. 14 | ''' 15 | 16 | ''' 17 | This class implements measurements to calculate the transfer entropy between users. The main function 18 | for TE calculation requires the jpype package. This is a java package that has to be called 19 | within python by using a Java Virtual Machine. 20 | 21 | These measurements assume that the data is in the order id,created_at,type,actor.id,repo.id 22 | ''' 23 | 24 | ''' 25 | This method takes a list of times and transforms them to a time series 26 | 27 | Input: List of created times 28 | Output: List representing a time series (differences) 29 | ''' 30 | def getTimeSeriesInSecs(ts_list): 31 | base_time = datetime.strptime('2015-01-01T00:00:00Z', '%Y-%m-%dT%H:%M:%SZ') 32 | secSet = set() 33 | for timeVal in ts_list: 34 | time_std = datetime.strptime(timeVal, '%Y-%m-%dT%H:%M:%SZ') 35 | diff = time_std - base_time 36 | secSet.add(int(diff.total_seconds())) 37 | 38 | secList = sorted(list(secSet)) 39 | return secList 40 | 41 | ''' 42 | This method bins the time series into discrete bins 43 | 44 | Input: totalBins - Total Number of bins 45 | binSize - Size of the bins 46 | timeSeries - This is the list representation of the time series 47 | ''' 48 | def getBinnedTimeSeriesSingleBinary(totalBins, binSize, timeSeries): 49 | tsBinned = np.zeros((totalBins), dtype=int) 50 | for timeVal in timeSeries: 51 | idx = (timeVal // binSize) 52 | tsBinned[idx] = 1 53 | 54 | return tsBinned 55 | 56 | ''' 57 | This method bins the time series into real valued bins 58 | 59 | Input: totalBins - Total Number of Bins 60 | binSize - Size of the bins 61 | timeSeries - This is the list representation of the time series 62 | ''' 63 | def getBinnedTimeSeriesSingleRealVal(totalBins, binSize, timeSeries): 64 | tsBinned = np.zeros((totalBins), dtype=float) 65 | for timeVal in timeSeries: 66 | idx = int((timeVal // binSize)) 67 | tsBinned[idx] = tsBinned[idx] + 1.00 68 | 69 | return tsBinned.tolist() 70 | 71 | ''' 72 | This method calculates the transfer entropy (TE) between two binary time series 73 | 74 | Input: src - This is the source time series 75 | dest - This is the destination time series 76 | delayParam - This is the parameter that controls the delay when calculating the TE 77 | 78 | Output: Value of Transfer Entropy between the source and destination time series. 79 | ''' 80 | def getTETimeSeriesPairBinary(src, dest, delayParam): 81 | teCalcClass = jpype.JPackage("infodynamics.measures.discrete").TransferEntropyCalculatorDiscrete 82 | teCalc = teCalcClass(2, 1, 1, 1, 1, delayParam) 83 | 84 | teCalc.initialise() 85 | teCalc.addObservations(src, dest) 86 | te = teCalc.computeAverageLocalOfObservations() 87 | 88 | return te 89 | 90 | ''' 91 | This method calculates the transfer entropy (TE) between two real time series 92 | 93 | Input: src - This is the source time series 94 | dest - This is the destination time series 95 | delayParam - This is the parameter that controls the delay when calculating the TE 96 | 97 | Output: Value of Transfer Entropy between the source and destination time series. 98 | ''' 99 | def getTETimeSeriesPairRealValued(src, dest, delay): 100 | teCalcClass = JPackage("infodynamics.measures.continuous.kraskov").TransferEntropyCalculatorKraskov 101 | teCalc = teCalcClass() 102 | teCalc.setProperty("NORMALISE", "true") # Normalise the individual variables 103 | teCalc.setProperty("k", "3") # Use Kraskov parameter K=4 for 4 nearest points 104 | 105 | teCalc.initialise(1, 1, 1, 1, delay) # Use history length 1 (Schreiber k=1) 106 | teCalc.setObservations(JArray(JDouble, 1)(src), JArray(JDouble, 1)(dest)) 107 | te = teCalc.computeAverageLocalOfObservations() 108 | 109 | return te 110 | 111 | 112 | ''' 113 | This method calculates the Transfer entropy for two users and a given dataframe 114 | 115 | Input: df - Data frame to extract user data from. This can be any subset of data 116 | user1 - The id of the first user (source user) 117 | user2 - The id of the second user (destination user) 118 | realSeries - Boolean that indicates whether or not the time series should be binned into real or discrete values 119 | 120 | Output: Transfer Entropy between the two users 121 | ''' 122 | def getTransferEntropy(df,user1,user2,realSeries=False): 123 | 124 | df.columns = ['id', 'time', 'type', 'user', 'repo'] 125 | 126 | user1Series = df[df.user == user1]['time'].tolist() 127 | user2Series = df[df.user == user2]['time'].tolist() 128 | user1Series = getTimeSeriesInSecs(user1Series) 129 | user2Series = getTimeSeriesInSecs(user2Series) 130 | 131 | binSize = 10800 # 3 hours = 10800 secs 132 | maxTime = max(max(user1Series), max(user2Series)) 133 | totalbins = int(np.ceil(maxTime / float(binSize))) 134 | 135 | te = 0.0 136 | 137 | ##Jar location for the infodynamics package 138 | jarLocation = "./infodynamics.jar" 139 | 140 | # Start the JVM (add the "-Xmx" option with say 1024M if you get crashes due to not enough memory space) 141 | jpype.startJVM(jpype.getDefaultJVMPath(), "-ea", "-Djava.class.path=" + jarLocation) 142 | 143 | 144 | if realSeries: 145 | user1Series = getBinnedTimeSeriesSingleRealVal(totalbins,binSize,user1Series) 146 | user2Series = getBinnedTimeSeriesSingleRealVal(totalbins,binSize,user2Series) 147 | te = getTETimeSeriesPairRealValued(user1Series, user2Series, 3) 148 | else: 149 | user1Series = getBinnedTimeSeriesSingleBinary(totalbins, binSize, user1Series) 150 | user2Series = getBinnedTimeSeriesSingleBinary(totalbins, binSize, user2Series) 151 | te = getTETimeSeriesPairBinary(user1Series,user2Series,1) 152 | 153 | jpype.shutdownJVM() 154 | 155 | return te 156 | 157 | 158 | 159 | 160 | 161 | -------------------------------------------------------------------------------- /github-measurements-old/UserCentricMeasurements.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from multiprocessing import Pool 4 | from functools import partial 5 | 6 | ''' 7 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract 8 | No. DE-AC05-76RL01830 with the Department of Energy (DOE). All rights in the computer software are reserved by DOE on 9 | behalf of the United States Government and the Contractor as provided in the Contract. You are authorized to use this 10 | computer software for Governmental purposes but it is not to be released or distributed to the public. NEITHER THE 11 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS 12 | SOFTWARE. This notice including this sentence must appear on any copies of this computer software. 13 | ''' 14 | 15 | ''' 16 | This class implements user centric method. Each function will describe which metric it is used for according 17 | to the questions number and mapping. 18 | 19 | These metrics assume that the data is in the order id,created_at,type,actor.id,repo.id 20 | ''' 21 | 22 | 23 | ''' 24 | This method returns the number of unique repos that a particular set of users contributed too 25 | 26 | Question #17 27 | 28 | Inputs: DataFrame - Desired dataset 29 | users - A list of users of interest 30 | 31 | Output: A dataframe with the user id and the number of repos contributed to 32 | ''' 33 | def getUserUniqueRepos(df,users=None): 34 | df = df.copy() 35 | df.columns = ['time', 'event','user', 'repo'] 36 | if users: 37 | df = df[df.user.isin(users)] 38 | df =df.groupby('user') 39 | data = df.repo.nunique().reset_index() 40 | data.columns = ['user','value'] 41 | return data 42 | 43 | 44 | ''' 45 | This method returns the cumulative activity of the desire user over time. 46 | 47 | Question #19 48 | 49 | Inputs: DataFrame - Desired dataset 50 | users - A list of users of interest 51 | 52 | Output: A grouped dataframe of the users activity over time 53 | ''' 54 | def getUserActivityTimeline(df, users=None,time_bin='1d',cumSum=False): 55 | df = df.copy() 56 | df.columns = ['time', 'event','user', 'repo'] 57 | df['time'] = pd.to_datetime(df['time']) 58 | if users: 59 | df = df[df.user.isin(users)] 60 | df['value'] = 1 61 | if cumSum: 62 | df['cumsum'] = df.groupby('user').value.transform(pd.Series.cumsum) 63 | df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).max().reset_index() 64 | df['value'] = df['cumsum'] 65 | df = df.drop('cumsum',axis=1) 66 | else: 67 | df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).sum().reset_index() 68 | 69 | #timeGrouper 70 | data = df.sort_values(['user', 'time']) 71 | return data 72 | 73 | ''' 74 | This method returns the top k most popular users for the dataset, where popularity is measured 75 | as the total popularity of the repos created by the user. 76 | 77 | Question #25 78 | 79 | Inputs: DataFrame - Desired dataset 80 | k - (Optional) The number of users that you would like returned. 81 | use_metadata - External metadata file containing repo owners. Otherwise use first observed user with a creation event as a proxy for the repo owner. 82 | 83 | Output: A dataframe with the user ids and number events for that user 84 | ''' 85 | def getUserPopularity(df,k=10,metadata_file = ''): 86 | 87 | if metadata_file != '': 88 | repo_metadata = pd.read_csv(metadata_file) 89 | repo_metadata = repo_metadata[['full_name_h','owner.login_h']] 90 | 91 | df = df.copy() 92 | df.columns = ['time', 'event','user', 'repo'] 93 | df['value'] = 1 94 | 95 | repo_popularity = df[df['event'].isin(['ForkEvent','WatchEvent'])].groupby('repo')['value'].sum().reset_index() 96 | 97 | if metadata_file != '': 98 | merged = repo_popularity.merge(repo_metadata,left_on='repo',right_on='full_name_h',how='left') 99 | else: 100 | user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first') 101 | user_repos = user_repos[['user','repo']] 102 | user_repos.columns = ['owner.login_h','repo'] 103 | merged = user_repos.merge(repo_popularity,on='repo',how='left') 104 | 105 | measurement = merged.groupby('owner.login_h').value.sum().sort_values(ascending=False).head(k) 106 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 107 | return measurement 108 | 109 | ''' 110 | This method returns the average time between events for each user 111 | 112 | Question #29b and c 113 | 114 | Inputs: df - Data frame of all data for repos 115 | users - (Optional) List of specific users to calculate the metric for 116 | nCPu - (Optional) Number of CPU's to run metric in parallel 117 | 118 | Outputs: A list of average times for each user. Length should match number of repos 119 | ''' 120 | def getAvgTimebwEvents(df,users=None, nCPU=1): 121 | df = df.copy() 122 | df.columns = ['time', 'event', 'user', 'repo'] 123 | df['time'] = pd.to_datetime(df['time']) 124 | 125 | if users == None: 126 | users = df['user'].unique() 127 | 128 | p = Pool(nCPU) 129 | args = [(df, users[i]) for i, item_a in enumerate(users)] 130 | deltas = p.map(getMeanTimeHelper, args) 131 | p.join() 132 | p.close() 133 | return deltas 134 | 135 | ''' 136 | Helper function for getting the average time between events 137 | 138 | Inputs: Same as average time between events 139 | Output: Same as average time between events 140 | ''' 141 | def getMeanTime(df, user): 142 | d = df[df.user == user] 143 | d = d.sort_values(by='time') 144 | delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's') 145 | return delta 146 | 147 | 148 | def getMeanTimeHelper(args): 149 | return getMeanTime(*args) 150 | 151 | ''' 152 | This method returns distribution the diffusion delay for each user 153 | 154 | Question #27 155 | 156 | Inputs: DataFrame - Desired dataset 157 | unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation 158 | for the possible options 159 | metadata_file - File containing user account creation times. Otherwise use first observed action of user as proxy for account creation time. 160 | 161 | Output: A list (array) of deltas in units specified 162 | ''' 163 | def getUserDiffusionDelay(df,unit='s',metadata_file = ''): 164 | 165 | if metadata_file != '': 166 | user_metadata = pd.read_csv(metadata_file) 167 | user_metadata['created_at'] = pd.to_datetime(user_metadata['created_at']) 168 | 169 | 170 | df = df.copy() 171 | df.columns = ['time','event','user','repo'] 172 | df['value'] = df['time'] 173 | df['value'] = pd.to_datetime(df['value']) 174 | 175 | if metadata_file != '': 176 | df = df.merge(user_metadata[['login_h','created_at']],left_on='user',right_on='login_h',how='left') 177 | df = df[['login_h','created_at','value']].dropna() 178 | measurement = df['value'].sub(df['created_at']).apply(lambda x: int(x / np.timedelta64(1, unit))) 179 | else: 180 | grouped = df.groupby('user') 181 | transformed = grouped['value'].transform('min') 182 | measurement = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit))) 183 | 184 | 185 | 186 | return measurement 187 | 188 | 189 | ''' 190 | This method returns the gini coefficient for user events. (User Disparity) 191 | 192 | Question #26a 193 | 194 | Inputs: DataFrame - Desired dataset 195 | 196 | 197 | Output: The gini coefficient for the dataset 198 | ''' 199 | def getGiniCoef(df): 200 | df = df.copy() 201 | df.columns = ['time', 'event', 'user', 'repo'] 202 | df['value'] = 1 203 | df = df.groupby('user') 204 | event_counts = df.value.sum() 205 | values = np.sort(np.array(event_counts)) 206 | 207 | cdf = np.cumsum(values) / float(np.sum(values)) 208 | percent_nodes = np.arange(len(values)) / float(len(values)) 209 | 210 | g = 1 - 2*np.trapz(x=percent_nodes,y=cdf) 211 | return g 212 | 213 | ''' 214 | This method returns the palma coefficient for user events. (User Disparity) 215 | 216 | Question #26b 217 | 218 | Inputs: DataFrame - Desired dataset 219 | 220 | 221 | Output: p - The palma coefficient for the dataset 222 | data - dataframe showing the CDF and Node percentages. (Mainly used for plotting) 223 | ''' 224 | def getPalmaCoef(df): 225 | df = df.copy() 226 | df.columns = ['time', 'event', 'user', 'repo'] 227 | df['value'] = 1 228 | df = df.groupby('user') 229 | event_counts = df.value.sum() 230 | 231 | 232 | values = np.sort(np.array(event_counts)) 233 | 234 | 235 | cdf = np.cumsum(values) / float(np.sum(values)) 236 | percent_nodes = np.arange(len(values)) / float(len(values)) 237 | 238 | 239 | p10 = np.sum(values[percent_nodes >= 0.9]) 240 | p40 = np.sum(values[percent_nodes <= 0.4]) 241 | 242 | 243 | p = float(p10) / float(p40) 244 | 245 | x = cdf 246 | y = percent_nodes 247 | data = pd.DataFrame({'cum_nodes': y, 'cum_value': x}) 248 | 249 | return p 250 | 251 | ''' 252 | This method returns the top k users with the most events. 253 | 254 | Question #24b 255 | 256 | Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events 257 | k - Number of users to be returned 258 | 259 | Output: Dataframe with the user ids and number of events 260 | ''' 261 | def getMostActiveUsers(df,k=10): 262 | df = df.copy() 263 | df.columns = ['time', 'event', 'user', 'repo'] 264 | dft = df 265 | dft['value'] = 1 266 | dft = df.groupby('user') 267 | measurement = dft.value.sum().sort_values(ascending=False).head(k) 268 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 269 | return measurement 270 | 271 | ''' 272 | This method returns the distribution for the users activity (event counts). 273 | 274 | Question #24a 275 | 276 | Inputs: DataFrame - Desired dataset 277 | eventType - (Optional) Desired event type to use 278 | 279 | Output: List containing the event counts per user 280 | ''' 281 | def getUserActivityDistribution(df,eventType=None): 282 | df = df.copy() 283 | df.columns = ['time', 'event', 'user', 'repo'] 284 | if eventType != None: 285 | df = df[df.event == eventType] 286 | df['value'] = 1 287 | df = df.groupby('user') 288 | measurement = df.value.sum().reset_index() 289 | return measurement 290 | -------------------------------------------------------------------------------- /github-measurements-old/UserMeasurementsWithPlot.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract 3 | No. DE-AC05-76RL01830 with the Department of Energy (DOE). All rights in the computer software are reserved by DOE on 4 | behalf of the United States Government and the Contractor as provided in the Contract. You are authorized to use this 5 | computer software for Governmental purposes but it is not to be released or distributed to the public. NEITHER THE 6 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS 7 | SOFTWARE. This notice including this sentence must appear on any copies of this computer software. 8 | ''' 9 | from plots import * 10 | 11 | ''' 12 | The following is the user measurment functions previously released with plotting added. The plots are currently all printed. 13 | ''' 14 | 15 | ''' 16 | This method returns the number of unique repos that a particular set of users contributed too 17 | 18 | Question #18 19 | 20 | Inputs: DataFrame - Desired dataset 21 | users - A list of users of interest 22 | log - to plot with log values default false 23 | 24 | Output: A dataframe with the user id and the number of repos contributed to 25 | ''' 26 | def getUserUniqueRepos(df,users, log=False): 27 | df.columns = ['id', 'time', 'event','user', 'repo'] 28 | df = df[df.user.isin(users)] 29 | df =df.groupby('user') 30 | data = df.repo.nunique() 31 | td = data 32 | print plot_top_users(data,'User','Unique Repos Contributed To','Quantity of Repos Users Contributed To') 33 | 34 | return td 35 | 36 | ''' 37 | This method returns the cumulative activity of the desire user over time. 38 | 39 | Question #20 40 | 41 | Inputs: DataFrame - Desired dataset 42 | users - A list of users of interest 43 | 44 | Output: A grouped dataframe of the users activity over time 45 | ''' 46 | def getUserActivityTimeline(df, users, log=False): 47 | df.columns = ['id', 'time', 'event','user', 'repo'] 48 | 49 | df = df[df.user.isin(users)] 50 | df['value'] = 1 51 | 52 | df['time'] = pd.to_datetime(df['time']) 53 | df['time'] = df['time'].dt.strftime('%Y-%m-%d') 54 | df = df.groupby(['user','time']).sum() 55 | 56 | minDate = df.index.min()[1] 57 | maxDate = df.index.max()[1] 58 | 59 | idx = pd.date_range(minDate, maxDate) 60 | ndf = pd.DataFrame() 61 | first = 0 62 | for u in users: 63 | d = df.loc[u] 64 | d.index = pd.DatetimeIndex(d.index) 65 | d = d[['value']].reindex(idx).fillna(0) 66 | d = d.cumsum() 67 | d['user'] = u 68 | d = d.reset_index() 69 | if first == 0: 70 | first = 1 71 | ndf = d 72 | continue 73 | ndf = pd.concat([ndf,d]) 74 | ndf.columns = ['time','value','user'] 75 | ndf['time'] = pd.to_datetime(ndf['time']) 76 | ndf = ndf.sort_values(['time']) 77 | ndf = ndf.set_index(['time']) 78 | 79 | print plot_activity_timeline(ndf,'Time','Total Number of Contributions','Cumulutive Sum of Contributions') 80 | 81 | return ndf 82 | 83 | 84 | ''' 85 | This method returns the top k most popular users for the dataset. 86 | 87 | Question #27 88 | 89 | Inputs: DataFrame - Desired dataset 90 | k - (Optional) The number of users that you would like returned. 91 | 92 | Output: A dataframe with the user ids and number events for that user 93 | ''' 94 | def getUserPopularity(df,k=10, log=False): 95 | df.columns = ['id', 'time', 'event','user', 'repo'] 96 | df['value'] = 1 97 | 98 | repo_popularity = df[df['event'] != 'CreateEvent'].groupby('repo')['value'].sum().reset_index() 99 | user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first') 100 | merged = user_repos[['user','repo']].merge(repo_popularity,on='repo',how='left') 101 | measurement = merged.groupby('user').value.sum().sort_values(ascending=False).head(k) 102 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 103 | 104 | print plot_top_users(measurement,'User Popularity','User','User Popularity') 105 | 106 | return measurement 107 | 108 | 109 | ''' 110 | Helper function for getting the average time between events 111 | 112 | Inputs: Same as average time between events 113 | Output: Same as average time between events 114 | ''' 115 | def getMeanTime(df,r): 116 | d = df[df.repo == r] 117 | d = d.sort_values(by='time') 118 | delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's') 119 | return delta 120 | 121 | 122 | ''' 123 | This method returns the average time between events for each user 124 | 125 | Question #29b and c 126 | 127 | Inputs: df - Data frame of all data for repos 128 | repos - (Optional) List of specific users to calculate the measurement for 129 | nCPu - (Optional) Number of CPU's to run measurement in parallel 130 | 131 | Outputs: A list of average times for each user. Length should match number of repos 132 | ''' 133 | def getAvgTimebwEvents(df,users=None, nCPU=1): 134 | df.columns = ['id','time', 'event', 'user', 'repo'] 135 | df['time'] = pd.to_datetime(df['time']) 136 | 137 | if users == None: 138 | users = df['user'].unique() 139 | 140 | p = Pool(nCPU) 141 | mean_time_partial = partial(getMeanTime,df=df) 142 | deltas = p.map(mean_time_partial,users) 143 | 144 | 145 | _,bins = np.histogram(deltas,bins='auto') 146 | 147 | measurement = pd.DataFrame(deltas) 148 | 149 | measurement.plot(kind='hist',bins=bins,legend=False,cumulative=False,normed=False,figsize=(10,7)) 150 | plt.xlabel('Time Between PullRequestEvents in Seconds',fontsize=20) 151 | plt.ylabel('Number of Repos',fontsize=20) 152 | plt.title('Average Time Between PullRequestEvents',fontsize=20) 153 | plt.xticks(fontsize=15) 154 | plt.yticks(fontsize=15) 155 | plt.tight_layout() 156 | print plt.show() 157 | return deltas 158 | 159 | ''' 160 | This method returns distribution the diffusion delay for each user 161 | 162 | Question #29 163 | 164 | Inputs: DataFrame - Desired dataset 165 | unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation 166 | for the possible options 167 | 168 | Output: A list (array) of deltas in units specified 169 | ''' 170 | def getUserDiffusionDelay(df,unit='s', log=False): 171 | df.columns = ['id', 'time', 'event', 'user', 'repo'] 172 | df['value'] = df['time'] 173 | df['value'] = pd.to_datetime(df['value']) 174 | grouped = df.groupby('user') 175 | transformed = grouped['value'].transform('min') 176 | delta = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit))) 177 | 178 | print plot_histogram(delta,'User Activity Delay','Number of Users','Diffusion Delay') 179 | 180 | return delta 181 | 182 | 183 | ''' 184 | This method returns the gini coefficient for user events. (User Disparity) 185 | 186 | Question #28 187 | 188 | Inputs: DataFrame - Desired dataset 189 | 190 | 191 | Output: The gini coefficient for the dataset 192 | ''' 193 | def getGiniCoef(df): 194 | df.columns = ['id', 'time', 'event', 'user', 'repo'] 195 | df['value'] = 1 196 | df = df.groupby('user') 197 | event_counts = df.value.sum() 198 | values = np.sort(np.array(event_counts)) 199 | 200 | cdf = np.cumsum(values) / float(np.sum(values)) 201 | percent_nodes = np.arange(len(values)) / float(len(values)) 202 | 203 | g = 1 - 2*np.trapz(x=percent_nodes,y=cdf) 204 | return g 205 | 206 | 207 | ''' 208 | This method returns the palma coefficient for user events. (User Disparity) 209 | 210 | Question #28 211 | 212 | Inputs: DataFrame - Desired dataset 213 | 214 | 215 | Output: p - The palma coefficient for the dataset 216 | data - dataframe showing the CDF and Node percentages. (Mainly used for plotting) 217 | ''' 218 | def getPalmaCoef(df): 219 | df.columns = ['id', 'time', 'event', 'user', 'repo'] 220 | df['value'] = 1 221 | df = df.groupby('user') 222 | event_counts = df.value.sum() 223 | 224 | values = np.sort(np.array(event_counts)) 225 | 226 | cdf = np.cumsum(values) / float(np.sum(values)) 227 | percent_nodes = np.arange(len(values)) / float(len(values)) 228 | 229 | p10 = np.sum(values[percent_nodes >= 0.9]) 230 | p40 = np.sum(values[percent_nodes <= 0.4]) 231 | 232 | p = float(p10) / float(p40) 233 | 234 | x = cdf 235 | y = percent_nodes 236 | data = pd.DataFrame({'cum_nodes': y, 'cum_value': x}) 237 | 238 | print plot_palma(data,'Cumulative share of Repos','Cumulative share of Events','User Event Dispartiy') 239 | 240 | return p,data 241 | 242 | ''' 243 | This method returns the top k users with the most events. 244 | 245 | Question #26b 246 | 247 | Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events 248 | k - Number of users to be returned 249 | 250 | Output: Dataframe with the user ids and number of events 251 | ''' 252 | def getMostActiveUsers(df,k=10, log=True): 253 | df.columns = ['id', 'time', 'event', 'user', 'repo'] 254 | df['value'] = 1 255 | df = df.groupby('user') 256 | measurement = df.value.sum().sort_values(ascending=False).head(k) 257 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 258 | 259 | print plot_top_users(measurement,'User','User Activity','Top Users') 260 | 261 | 262 | ''' 263 | This method returns the distribution for the users activity (event counts). 264 | 265 | Question #26a 266 | 267 | Inputs: DataFrame - Desired dataset 268 | eventType - (Optional) Desired event type to use 269 | 270 | Output: List containing the event counts per user 271 | ''' 272 | def getUserActivityDistribution(df,eventType=None): 273 | df.columns = ['id', 'time', 'event', 'user', 'repo'] 274 | if eventType != None: 275 | df = df[df.event == eventType] 276 | df['value'] = 1 277 | df = df.groupby('user') 278 | 279 | print plot_histogram(d.value.values,'Total Activity','Number of Users','User Activity Distribution') 280 | 281 | return np.array(measurement).tolist() -------------------------------------------------------------------------------- /github-measurements-old/load_data.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | 4 | def load_data(): 5 | 6 | path = '/Users/grac833/Documents/Projects/SocialSim/temp/infrastructure/tira/services/GithubMetricServices' 7 | 8 | dfs = [] 9 | for i in range(1,3): 10 | i = str(i) 11 | if len(i) == 1: 12 | i = '0' + i 13 | df = pd.read_csv(path + '/leidosData/weekly_data_2017-07-' + str(i) + ' 00:00:00.csv') 14 | dfs.append(df) 15 | df = pd.read_csv(path + '/leidosData/weekly_data_2017-08-' + str(i) + ' 00:00:00.csv') 16 | dfs.append(df) 17 | gt = pd.concat(dfs) 18 | 19 | dfs = [] 20 | for i in range(1, 3): 21 | i = str(i) 22 | if len(i) == 1: 23 | i = '0' + i 24 | df = pd.read_csv(path + '/leidosData/weekly_data_2017-07-' + str(i) + ' 00:00:00.csv') 25 | dfs.append(df) 26 | df = pd.read_csv(path + '/leidosData/weekly_data_2017-08-' + str(i) + ' 00:00:00.csv') 27 | dfs.append(df) 28 | sim1 = pd.concat(dfs) 29 | 30 | gt = gt.drop("_id", axis=1) 31 | sim1 = sim1.drop("_id", axis=1) 32 | 33 | print(sim1) 34 | 35 | return gt,sim1 36 | 37 | 38 | if __name__ == "__main__": 39 | 40 | load_data() 41 | -------------------------------------------------------------------------------- /github-measurements-old/plots.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from multiprocessing import Pool 5 | from functools import partial 6 | import matplotlib.pyplot as plt 7 | import matplotlib.mlab as mlab 8 | from datetime import datetime 9 | import seaborn as sns 10 | import matplotlib.dates as dates 11 | import calendar 12 | from itertools import * 13 | from matplotlib import rcParams 14 | rcParams.update({'figure.autolayout': True}) 15 | 16 | def savePlots(loc, plt): 17 | plt.savefig(loc) 18 | 19 | event_colors = {'CommitCommentEvent':'#e59400', 20 | 'CreateEvent':'#B2912F', 21 | 'DeleteEvent':'#B276B2', 22 | 'ForkEvent':'#4D4D4D', 23 | 'IssueCommentEvent':'#DECF3F', 24 | 'IssuesEvent':'#60BD68', 25 | 'PullRequestEvent':'#5DA5DA', 26 | 'PullRequestReviewCommentEvent':'#D3D3D3', 27 | 'PushEvent':'#F17CB0', 28 | 'WatchEvent':'#F15854'} 29 | 30 | def plot_histogram(data,xlabel,ylabel,title, log=False, loc=False): 31 | 32 | sns.set_style('whitegrid') 33 | sns.set_context('talk') 34 | 35 | ##ploting Histogram 36 | _,bins = np.histogram(data,bins='doane') 37 | 38 | measurement = pd.DataFrame(data) 39 | 40 | measurement.plot(kind='hist',bins=bins,legend=False,cumulative=False,normed=False,log=log) 41 | 42 | plt.xlabel(xlabel) 43 | plt.ylabel(ylabel) 44 | plt.title(title) 45 | plt.tight_layout() 46 | 47 | if loc != False: 48 | savePlots(loc,plt) 49 | return 50 | 51 | return plt.show() 52 | 53 | def plot_line_graph(data,xlabel,ylabel,title,labels="",loc=False): 54 | sns.set_style('whitegrid') 55 | sns.set_context('talk') 56 | 57 | ##plotting line graph 58 | _,bins = np.histogram(data,bins='auto') 59 | 60 | Watchmeasurement = pd.DataFrame(data) 61 | 62 | tx = [x for x in range(len(data))] 63 | 64 | plt.figure(figsize=(10,7)) 65 | plt.plot(tx, data, label=labels) 66 | 67 | plt.xlabel(xlabel, fontsize=20) 68 | plt.ylabel(ylabel, fontsize=20) 69 | plt.title(title, fontsize=20) 70 | plt.legend(fontsize=15) 71 | plt.xticks(fontsize=15) 72 | plt.tight_layout() 73 | 74 | if loc != False: 75 | savePlots(loc,plt) 76 | return 77 | return plt.show() 78 | 79 | def plot_time_series(data,xlabel,ylabel,title,loc=False): 80 | 81 | plt.clf() 82 | sns.set_style('whitegrid') 83 | sns.set_context('talk') 84 | p = data 85 | plt.plot(p['date'],p['value']) 86 | 87 | 88 | plt.xticks(fontsize=15) 89 | plt.yticks(fontsize=15) 90 | plt.xlabel(xlabel, fontsize=20) 91 | plt.ylabel(ylabel, fontsize=20) 92 | plt.title(title, fontsize=20) 93 | plt.xticks(rotation=45) 94 | 95 | plt.tight_layout() 96 | 97 | if loc != False: 98 | savePlots(loc,plt) 99 | return 100 | 101 | return plt.show() 102 | 103 | def plot_contributions_oneline(data,xlabel,ylabel,title,loc=False): 104 | 105 | sns.set_style('whitegrid') 106 | sns.set_context('talk') 107 | 108 | p = data 109 | ax = plt.gca() 110 | labels = [str(x) for x in p.date.values] 111 | plt.clf() 112 | plt.plot(p.date.values, p.value.values, label='Unique Users per Day') 113 | 114 | plt.xticks(fontsize=15) 115 | plt.yticks(fontsize=15) 116 | plt.xlabel(xlabel, fontsize=20) 117 | plt.ylabel(ylabel, fontsize=20) 118 | plt.title(title) 119 | plt.legend() 120 | plt.xticks(rotation=45) 121 | plt.tight_layout() 122 | 123 | 124 | if loc != False: 125 | savePlots(loc,plt) 126 | return 127 | 128 | return plt.show() 129 | 130 | def plot_contributions_twolines(containsDup,noDups,xlabel,ylabel,title,loc=False): 131 | 132 | plt.clf() 133 | fig = plt.figure(figsize=(18,15)) 134 | ax = fig.add_subplot(221) 135 | labels = [str(x)[:10] for x in containsDup.date.values] 136 | ys = [x for x in range(len(containsDup))] 137 | 138 | plt.plot(ys, containsDup.user.values, label='Unique Users per Day') 139 | plt.plot(ys, noDups.user.values, label='Unique User over All') 140 | ax.set_xticklabels(labels=labels, fontsize=20) 141 | 142 | # ax.tick_params(labelsize=15) 143 | plt.tight_layout() 144 | plt.xlabel('Time',fontsize=20) 145 | plt.ylabel('Number of Users',fontsize=20) 146 | plt.title('Cumulative Number of Contributing Users Over Time',fontsize=20) 147 | plt.legend(loc=2, prop={'size': 15}) 148 | plt.xticks(rotation=45) 149 | plt.xticks(fontsize=15) 150 | plt.yticks(fontsize=15) 151 | 152 | if loc != False: 153 | savePlots(loc,plt) 154 | return 155 | 156 | return plt.show() 157 | 158 | def plot_palma_gini(data,xlabel,ylabel,title,loc=False): 159 | data.plot(x = 'cum_nodes',y='cum_value',legend=False) 160 | plt.ylabel(ylabel) 161 | plt.xlabel(xlabel) 162 | plt.plot([0,1],[0,1],linestyle='--',color='k') 163 | plt.tight_layout() 164 | plt.title(title) 165 | if loc != False: 166 | savePlots(loc,plt) 167 | return 168 | return plt.show() 169 | 170 | def plot_distribution_of_events(data,weekday,loc=False): 171 | p = pd.DataFrame(data) 172 | p = p.reset_index() 173 | if weekday == True: 174 | p = p.rename(index=str, columns={'weekday': 'date'}) 175 | p = p.reset_index() 176 | p = p.pivot(index='date', columns='event', values='value').fillna(0) 177 | tp = p.reset_index() 178 | tp.set_index('date') 179 | del tp['date'] 180 | total = tp.sum(axis=1) 181 | for ele in tp.columns: 182 | if ele == 'date': 183 | continue 184 | tp[ele] = tp[ele] 185 | 186 | plt.clf() 187 | sns.set_style('whitegrid') 188 | sns.set_context('talk') 189 | 190 | ax = plt.gca() 191 | 192 | calIndex = list(calendar.day_name) 193 | labels = [str(x)[:10] for x in p.index.values] 194 | 195 | title = 'Days' 196 | if weekday == True: 197 | labels = [calIndex[i] for i in range(len(labels))] 198 | title = 'Weekday' 199 | my_colors = list(islice(cycle([ '#B2912F', '#4D4D4D', '#DECF3F','#60BD68','#5DA5DA','#D3D3D3','#F17CB0','#F15854','#B276B2', '#e59400']), None, len(tp))) 200 | 201 | tp.plot(ax=ax, color=[event_colors.get(x) for x in tp.columns],rot=0) 202 | ax.xaxis.set_ticks(np.arange(0,len(labels))) 203 | ax.set_xticklabels(labels=labels, rotation=45) 204 | plt.legend() 205 | plt.title('Distribution of Events per ' + title) 206 | plt.xlabel(title) 207 | plt.ylabel('Number of Events') 208 | 209 | plt.tight_layout() 210 | if loc != False: 211 | savePlots(loc,plt) 212 | return 213 | return plt.show() 214 | 215 | 216 | 217 | 218 | ############# 219 | #User Centric 220 | ############# 221 | 222 | def plot_top_users(data, xlabel,ylabel,title, log=False,loc=False): 223 | data = pd.DataFrame(data) 224 | 225 | data.plot(kind='bar',legend=False,log=log) 226 | plt.ylabel(ylabel) 227 | plt.xlabel(xlabel) 228 | plt.tight_layout() 229 | plt.title(title) 230 | if loc != False: 231 | savePlots(loc,plt) 232 | return 233 | return plt.show() 234 | 235 | def plot_activity_timeline(data,xlabel,ylabel,title, log=False,loc=False): 236 | p = data 237 | for u in users: 238 | p[p['user'] == u]['value'].plot(legend=False,logy=False,label=u) 239 | 240 | plt.xticks(fontsize=15) 241 | plt.yticks(fontsize=15) 242 | plt.xlabel(xlabel, fontsize=20) 243 | plt.ylabel(ylabel, fontsize=20) 244 | plt.title(title, fontsize=20) 245 | plt.tight_layout() 246 | plt.xticks(rotation=45) 247 | if loc != False: 248 | savePlots(loc,plt) 249 | return 250 | return plt.show() 251 | 252 | ############ 253 | #Community 254 | ############ 255 | 256 | def plot_CommunityProportions(p,xlabel,ylabel,title, loc=False): 257 | data = pd.DataFrame(p) 258 | ax = data.plot(kind='bar',legend=False) 259 | ax.set_xticklabels(data.edgeType.values) 260 | plt.xlabel(xlabel) 261 | plt.ylabel(ylabel) 262 | plt.title(title) 263 | if loc != False: 264 | savePlots(loc,plt) 265 | return 266 | return plt.show() 267 | 268 | 269 | def plot_propIssueEvent(p, xlabel, ylabel,title, loc=False): 270 | 271 | plt.clf() 272 | fig = plt.figure(figsize=(18,15)) 273 | ax = fig.add_subplot(221) 274 | labels = [str(x)[:10] for x in p.index.values] 275 | ys = [x for x in range(len(p[p['issueType'] == 'closed']))] 276 | 277 | plt.plot(ys, p[p['issueType'] == 'closed'].counts.values, label='Closed') 278 | plt.plot(ys, p[p['issueType'] == 'opened'].counts.values, label='Opened') 279 | plt.plot(ys, p[p['issueType'] == 'reopened'].counts.values, label='ReOpened') 280 | ax.set_xticklabels(labels=labels, fontsize=20) 281 | 282 | plt.tight_layout() 283 | plt.xlabel(xlabel,fontsize=20) 284 | plt.ylabel(ylabel,fontsize=20) 285 | plt.title(title,fontsize=20) 286 | plt.legend(bbox_to_anchor=(-.25, .001), loc=2, prop={'size': 15}) 287 | plt.xticks(rotation=45) 288 | plt.xticks(fontsize=15) 289 | plt.yticks(fontsize=15) 290 | 291 | if loc != False: 292 | savePlots(loc,plt) 293 | return 294 | return plt.show() 295 | 296 | -------------------------------------------------------------------------------- /github-measurements/Measurements.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from multiprocessing import Pool 5 | from functools import partial 6 | from pathos import pools as pp 7 | import pickle as pkl 8 | from UserCentricMeasurements import * 9 | from RepoCentricMeasurements import * 10 | from CommunityCentricMeasurements import * 11 | from TEMeasurements import * 12 | from collections import defaultdict 13 | import jpype 14 | import json 15 | 16 | class Measurements(UserCentricMeasurements, RepoCentricMeasurements, TEMeasurements, CommunityCentricMeasurements): 17 | def __init__(self, dfLoc, interested_repos=[], interested_users=[], metaRepoData=False, metaUserData=False, 18 | repoActorsFile='data/filtUsers-test.pkl',reposFile='data/filtRepos-test.pkl',topNodes=[],topEdges=[], 19 | previousActionsFile='',community_dictionary='data/communities.pkl',te_config='te_params_dry_run2.json'): 20 | super(Measurements, self).__init__() 21 | 22 | try: 23 | #check if input is a data frame 24 | dfLoc.columns 25 | df = dfLoc 26 | except: 27 | #if not it should be a csv file path 28 | df = pd.read_csv(dfLoc) 29 | 30 | self.contribution_events = ["PullRequestEvent", "PushEvent", "IssuesEvent","IssueCommentEvent","PullRequestReviewCommentEvent","CommitCommentEvent","CreateEvent"] 31 | self.popularity_events = ['WatchEvent','ForkEvent'] 32 | 33 | print('preprocessing...') 34 | self.main_df = self.preprocess(df) 35 | 36 | print('splitting optional columns...') 37 | #store action and merged columns in a seperate data frame that is not used for most measurements 38 | if len(self.main_df.columns) == 6: 39 | self.main_df_opt = self.main_df.copy()[['action','merged']] 40 | self.main_df_opt['merged'] = self.main_df_opt['merged'].astype(bool) 41 | self.main_df = self.main_df.drop(['action','merged'],axis=1) 42 | else: 43 | self.main_df_opt = None 44 | 45 | 46 | #For repoCentric 47 | print('getting selected repos...') 48 | self.selectedRepos = self.getSelectRepos(interested_repos) #Dictionary of selected repos index == repoid 49 | 50 | #For userCentric 51 | self.selectedUsers = self.main_df[self.main_df.user.isin(interested_users)] 52 | 53 | print('processing repo metatdata...') 54 | #read in external metadata files 55 | #repoMetaData format - full_name_h,created_at,owner.login_h,language 56 | #userMetaData format - login_h,created_at,location,company 57 | if metaRepoData != False: 58 | self.useRepoMetaData = True 59 | self.repoMetaData = self.preprocessRepoMeta(pd.read_csv(metaRepoData)) 60 | else: 61 | self.useRepoMetaData = False 62 | print('processing user metatdata...') 63 | if metaUserData != False: 64 | self.useUserMetaData = True 65 | self.userMetaData = self.preprocessUserMeta(pd.read_csv(metaUserData)) 66 | else: 67 | self.useUserMetaData = False 68 | 69 | 70 | #For Community 71 | print('getting communities...') 72 | self.communities = self.getCommunities(path=community_dictionary) 73 | 74 | #read in previous events count external file (used only for one measurement) 75 | try: 76 | print('reading previous counts...') 77 | self.previous_event_counts = pd.read_csv(previousActionsFile) 78 | except: 79 | self.previous_event_counts = None 80 | 81 | 82 | #For TE 83 | print('starting jvm...') 84 | if not jpype.isJVMStarted(): 85 | jpype.startJVM(jpype.getDefaultJVMPath(), "-ea", "-Djava.class.path=" + "infodynamics.jar") 86 | 87 | self.top_users = topNodes 88 | self.top_edges = topEdges 89 | 90 | #read pkl files which define nodes of interest for TE measurements 91 | self.repo_actors = self.readPickleFile(repoActorsFile) 92 | self.repo_groups = self.readPickleFile(reposFile) 93 | 94 | #set TE parameters 95 | with open(te_config,'rb') as f: 96 | te_params = json.load(f) 97 | 98 | self.startTime = pd.Timestamp(te_params['startTime']) 99 | self.binSize= te_params['binSize'] 100 | self.teThresh = te_params['teThresh'] 101 | self.delayUnits = np.array(te_params['delayUnits']) 102 | self.starEvent = te_params['starEvent'] 103 | self.otherEvents = te_params['otherEvents'] 104 | self.kE = te_params['kE'] 105 | self.kN = te_params['kN'] 106 | self.nReps = te_params['nReps'] 107 | self.bGetTS = te_params['bGetTS'] 108 | 109 | 110 | 111 | def preprocess(self,df): 112 | #edit columns, convert date, sort by date 113 | if df.columns[0] == '_id': 114 | del df['_id'] 115 | if len(df.columns) == 4: 116 | df.columns = ['time', 'event', 'user', 'repo'] 117 | else: 118 | df.columns = ['time', 'event', 'user', 'repo','action','merged'] 119 | df = df[df.event.isin(self.popularity_events + self.contribution_events)] 120 | df['time'] = pd.to_datetime(df['time']) 121 | df = df.sort_values(by='time') 122 | df = df.assign(time=df.time.dt.floor('h')) 123 | return df 124 | 125 | def preprocessRepoMeta(self,df): 126 | try: 127 | df.columns = ['repo','created_at','owner_id','language'] 128 | except: 129 | df.columns = ['created_at','owner_id','repo'] 130 | df = df[df.repo.isin(self.main_df.repo.values)] 131 | df['created_at'] = pd.to_datetime(df['created_at']) 132 | #df = df.drop_duplicates('repo') 133 | return df 134 | 135 | def preprocessUserMeta(self,df): 136 | try: 137 | df.columns = ['user','created_at','location','company'] 138 | except: 139 | df.columns = ['user','created_at','city','country','company'] 140 | 141 | df = df[df.user.isin(self.main_df.user.values)] 142 | df['created_at'] = pd.to_datetime(df['created_at']) 143 | return df 144 | 145 | def readPickleFile(self,ipFile): 146 | 147 | with open(ipFile, 'rb') as handle: 148 | obj = pkl.load(handle) 149 | 150 | return obj 151 | -------------------------------------------------------------------------------- /github-measurements/UserCentricMeasurements.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from datetime import datetime 4 | from multiprocessing import Pool 5 | from functools import partial 6 | from pathos import pools as pp 7 | import pickle 8 | 9 | ''' 10 | This class implements user centric method. Each function will describe which metric it is used for according 11 | to the questions number and mapping. 12 | These metrics assume that the data is in the order id,created_at,type,actor.id,repo.id 13 | ''' 14 | 15 | class UserCentricMeasurements(object): 16 | def __init__(self): 17 | super(UserCentricMeasurements, self).__init__() 18 | 19 | ''' 20 | This function selects a subset of the full data set for a selected set of users and event types. 21 | Inputs: users - A boolean or a list of users. If it is list of user ids (login_h) the data frame is subset on only this list of users. 22 | If it is True, then the pre-selected node-level subset is used. If False, then all users are included. 23 | eventType - A list of event types to include in the data set 24 | 25 | Output: A data frame with only the selected users and event types. 26 | ''' 27 | def determineDf(self,users,eventType): 28 | 29 | if users == True: 30 | #self.selectedUsers is a data frame containing only the users in interested_users 31 | df = self.selectedUsers 32 | elif users != False: 33 | df = df[df.user.isin(users)] 34 | else: 35 | df = self.main_df 36 | 37 | if eventType != None: 38 | df = df[df.event.isin(eventType)] 39 | 40 | return df 41 | 42 | ''' 43 | This method returns the number of unique repos that a particular set of users contributed too 44 | Question #17 45 | Inputs: selectedUsers - A list of users of interest or a boolean indicating whether to subset to the node-level measurement users. 46 | eventType - A list of event types to include in the data 47 | Output: A dataframe with the user id and the number of repos contributed to 48 | ''' 49 | def getUserUniqueRepos(self,selectedUsers=False,eventType=None): 50 | df = self.determineDf(selectedUsers,eventType) 51 | df = df.groupby('user') 52 | data = df.repo.nunique().reset_index() 53 | data.columns = ['user','value'] 54 | return data 55 | 56 | ''' 57 | This method returns the timeline of activity of the desired user over time, either in raw or cumulative counts. 58 | Question #19 59 | Inputs: selectedUsers - A list of users of interest or a boolean indicating whether to subset to node-level measurement users. 60 | time_bin - Time frequency for calculating event counts 61 | cumSum - Boolean indicating whether to calculate the cumulative activity counts 62 | eventType = List of event types to include in the data 63 | Output: A dictionary with a data frame for each user with two columns: data and event counts 64 | ''' 65 | def getUserActivityTimeline(self, selectedUsers=True,time_bin='1d',cumSum=False,eventType=None): 66 | df = self.determineDf(selectedUsers,eventType) 67 | 68 | df['value'] = 1 69 | if cumSum: 70 | df['cumsum'] = df.groupby('user').value.transform(pd.Series.cumsum) 71 | df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).max().reset_index() 72 | df['value'] = df['cumsum'] 73 | df = df.drop('cumsum',axis=1) 74 | else: 75 | df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).sum().reset_index() 76 | 77 | data = df.sort_values(['user', 'time']) 78 | measurements = {} 79 | for user in data['user'].unique(): 80 | measurements[user] = data[data['user'] == user] 81 | 82 | return measurements 83 | 84 | 85 | ''' 86 | This method returns the top k most popular users for the dataset, where popularity is measured 87 | as the total popularity of the repos created by the user. 88 | Question #25 89 | Inputs: k - (Optional) The number of users that you would like returned. 90 | use_metadata - External metadata file containing repo owners. Otherwise use first observed user with 91 | a creation event as a proxy for the repo owner. 92 | eventType - A list of event types to include 93 | Output: A dataframe with the user ids and number events for that user 94 | ''' 95 | def getUserPopularity(self,k=5000,use_metadata=False,eventType=None): 96 | 97 | df = self.determineDf(False,eventType) 98 | 99 | df['value'] = 1 100 | 101 | repo_popularity = df[df.event.isin(['WatchEvent','ForkEvent'])].groupby('repo')['value'].sum().reset_index() 102 | 103 | if use_metadata and self.useRepoMetaData: 104 | #merge repo popularity with the owner information in repo_metadata 105 | #drop data for which no owner information exists in metadata 106 | repo_popularity = repo_popularity.merge(self.repoMetaData,left_on='repo',right_on='repo', 107 | how='left').dropna() 108 | 109 | elif df['repo'].str.match('.{22}/.{22}').all(): 110 | #if all repo IDs have the correct format use the owner info from the repo id 111 | repo_popularity['owner_id'] = repo_popularity['repo'].apply(lambda x: x.split('/')[0]) 112 | else: 113 | #otherwise use creation event as a proxy for ownership 114 | user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first') 115 | user_repos = user_repos[['user','repo']] 116 | user_repos.columns = ['owner_id','repo'] 117 | if len(user_repos.index) >= 0: 118 | repo_popularity = user_repos.merge(repo_popularity,on='repo',how='left') 119 | else: 120 | return None 121 | 122 | 123 | measurement = repo_popularity.groupby('owner_id').value.sum().sort_values(ascending=False).head(k) 124 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 125 | 126 | return measurement 127 | 128 | 129 | ''' 130 | This method returns the average time between events for each user 131 | 132 | Inputs: df - Data frame of all data for repos 133 | users - (Optional) List of specific users to calculate the metric for 134 | nCPu - (Optional) Number of CPU's to run metric in parallel 135 | Outputs: A list of average times for each user. Length should match number of repos 136 | ''' 137 | def getAvgTimebwEventsUsers(self,selectedUsers=True, nCPU=1): 138 | df = self.determineDf(selectedUsers) 139 | users = self.df['user'].unique() 140 | args = [(df, users[i]) for i, item_a in enumerate(users)] 141 | pool = pp.ProcessPool(nCPU) 142 | deltas = pool.map(self.getMeanTimeHelper, args) 143 | return deltas 144 | 145 | ''' 146 | Helper function for getting the average time between events 147 | 148 | Inputs: Same as average time between events 149 | Output: Same as average time between events 150 | ''' 151 | def getMeanTimeUser(self,df, user): 152 | d = df[df.user == user] 153 | d = d.sort_values(by='time') 154 | delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's') 155 | return delta 156 | 157 | def getMeanTimeUserHelper(self,args): 158 | return self.getMeanTimeUser(*args) 159 | 160 | ''' 161 | This method returns distribution the diffusion delay for each user 162 | Question #27 163 | Inputs: DataFrame - Desired dataset 164 | unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation 165 | for the possible options 166 | metadata_file - File containing user account creation times. Otherwise use first observed action of user as proxy for account creation time. 167 | Output: A list (array) of deltas in units specified 168 | ''' 169 | def getUserDiffusionDelay(self,unit='h', selectedUser=True,eventType=None): 170 | 171 | df = self.determineDf(selectedUser,eventType) 172 | 173 | df['value'] = df['time'] 174 | df['value'] = pd.to_datetime(df['value']) 175 | df['value'] = df['value'].dt.round('1H') 176 | 177 | if self.useUserMetaData: 178 | df = df.merge(self.userMetaData[['user','created_at']],left_on='user',right_on='user',how='left') 179 | df = df[['user','created_at','value']].dropna() 180 | measurement = df['value'].sub(df['created_at']).apply(lambda x: int(x / np.timedelta64(1, unit))) 181 | else: 182 | grouped = df.groupby('user') 183 | transformed = grouped['value'].transform('min') 184 | measurement = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit))) 185 | return measurement 186 | 187 | ''' 188 | This method returns the top k users with the most events. 189 | Question #24b 190 | Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events 191 | k - Number of users to be returned 192 | Output: Dataframe with the user ids and number of events 193 | ''' 194 | def getMostActiveUsers(self,k=5000,eventType=None): 195 | 196 | df = self.main_df 197 | 198 | if eventType != None: 199 | df = df[df.event.isin(eventType)] 200 | 201 | df['value'] = 1 202 | df = df.groupby('user') 203 | measurement = df.value.sum().sort_values(ascending=False).head(k) 204 | measurement = pd.DataFrame(measurement).sort_values('value',ascending=False) 205 | return measurement 206 | 207 | ''' 208 | This method returns the distribution for the users activity (event counts). 209 | Question #24a 210 | Inputs: DataFrame - Desired dataset 211 | eventType - (Optional) Desired event type to use 212 | Output: List containing the event counts per user 213 | ''' 214 | def getUserActivityDistribution(self,eventType=None,selectedUser=False): 215 | 216 | if selectedUser: 217 | df = self.selectedUsers 218 | else: 219 | df = self.main_df 220 | 221 | if eventType != None: 222 | df = df[df.event.isin(eventType)] 223 | 224 | df['value'] = 1 225 | df = df.groupby('user') 226 | measurement = df.value.sum().reset_index() 227 | return measurement 228 | 229 | 230 | ''' 231 | Calculate the proportion of pull requests that are accepted by each user. 232 | Question #15 (Optional Measurement) 233 | Inputs: eventType: List of event types to include in the calculation (Should be PullRequestEvent). 234 | thresh: Minimum number of PullRequests a repo must have to be included in the distribution. 235 | Output: Data frame with the proportion of accepted pull requests for each user 236 | ''' 237 | def getUserPullRequestAcceptance(self,eventType=['PullRequestEvent'], thresh=2): 238 | 239 | df = self.main_df_opt 240 | 241 | if not df is None and 'PullRequestEvent' in self.main_df.event.values: 242 | 243 | df = df[self.main_df.event.isin(eventType)] 244 | users_repos = self.main_df[self.main_df.event.isin(eventType)] 245 | 246 | #subset on only PullRequest close actions (not opens) 247 | idx = df['action'] == 'closed' 248 | closes = df[idx] 249 | users_repos = users_repos[idx] 250 | 251 | #merge pull request columns (action, merged) with main data frame columns 252 | closes = pd.concat([users_repos,closes],axis=1) 253 | closes = closes[['user','repo','merged']] 254 | closes['value'] = 1 255 | 256 | #add up number of accepted (merged) and rejected pullrequests by user and repo 257 | outcomes = closes.pivot_table(index=['user','repo'],values=['value'],columns=['merged'],aggfunc=np.sum).fillna(0) 258 | 259 | outcomes.columns = outcomes.columns.get_level_values(1) 260 | 261 | outcomes = outcomes.rename(index=str, columns={True: "accepted", False: "rejected"}) 262 | 263 | for col in ['accepted','rejected']: 264 | if col not in outcomes.columns: 265 | outcomes[col] = 0 266 | 267 | outcomes['total'] = outcomes['accepted'] + outcomes['rejected'] 268 | outcomes['value'] = outcomes['accepted'] / outcomes['total'] 269 | outcomes = outcomes.reset_index() 270 | outcomes = outcomes[outcomes['total'] >= thresh] 271 | 272 | if len(outcomes.index) > 0: 273 | #calculate the average acceptance rate for each user across their repos 274 | measurement = outcomes[['user','value']].groupby('user').mean() 275 | else: 276 | measurement = None 277 | else: 278 | measurement = None 279 | 280 | return measurement 281 | 282 | -------------------------------------------------------------------------------- /github-measurements/infodynamics.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pnnl/socialsim/06f0ce61d10ca08dd50d256fb30ac0ae81ead58d/github-measurements/infodynamics.jar -------------------------------------------------------------------------------- /github-measurements/reference-approaches/README.md: -------------------------------------------------------------------------------- 1 | # Reference Approach Scripts 2 | 3 | * **generate_reference_approach_data.py**: This script can generate reference approach data for a target test period using a given historical data set. 4 | * **reference_approach_performance_plots.py**: This script can be used to replicate the visualizations we used to summarize performance relative to the reference approaches. -------------------------------------------------------------------------------- /github-measurements/reference-approaches/generate_reference_approach_data.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import datetime 3 | import numpy as np 4 | import glob 5 | 6 | 7 | def ingest_historical_data(csv_file): 8 | 9 | """ 10 | Read data from csv file 11 | """ 12 | 13 | print('reading data...') 14 | df = pd.read_csv(csv_file) 15 | df.columns = ['created_at','type','actor_login_h','repo_name_h','payload_action','payload_pull_request_merged'] 16 | 17 | print('to datetime..') 18 | df['created_at'] = pd.to_datetime(df['created_at']) 19 | 20 | print('sorting...') 21 | df = df.sort_values('created_at') 22 | 23 | return df 24 | 25 | def subset_data(df,start,end): 26 | 27 | """ 28 | Return temporal data subset based on start and end dates 29 | """ 30 | 31 | print('subsetting...') 32 | df = df[ (df['created_at'] >= start) & (df['created_at'] <= end) ] 33 | 34 | return(df) 35 | 36 | def shift_data(df,shift, end): 37 | 38 | """ 39 | Shift data based on fixed offset (shift) and subset based on upper limit (end) 40 | """ 41 | 42 | print('shifting...') 43 | df['created_at'] += shift 44 | df = df[df['created_at'] <= end] 45 | 46 | return df 47 | 48 | 49 | def sample_data(df,start,end,proportional=True): 50 | 51 | """ 52 | Sample data either uniformly (proportional=False) or proporationally (proportional=True) to fill test period from start to end 53 | """ 54 | 55 | print('inter-event times...') 56 | 57 | df['inter_event_times'] = df['created_at'] - df['created_at'].shift() 58 | inter_event_times = df['inter_event_times'].dropna() 59 | 60 | max_time = df['created_at'].min() 61 | multiplier=( (pd.to_datetime(end) - pd.to_datetime(start)) / df['inter_event_times'].mean() ) / float(len(df.index)) 62 | 63 | #repeat until enough data is sampled to fill the test period 64 | while max_time < pd.to_datetime(end): 65 | 66 | if proportional: 67 | sample = pd.DataFrame(df['inter_event_times'].dropna().sample(int(multiplier*len(df.index)),replace=True)) 68 | sampled_inter_event_times = sample.cumsum() 69 | else: 70 | sample = pd.DataFrame(np.random.uniform(np.min(inter_event_times.dt.total_seconds()),1.0,int(multiplier*len(df.index))))[0].round(0) 71 | sample = pd.to_timedelta(sample,unit='s') 72 | sampled_inter_event_times = pd.DataFrame(sample).cumsum() 73 | 74 | event_times = (pd.to_datetime(start) + sampled_inter_event_times) 75 | max_time = pd.to_datetime(event_times.max().values[0]) 76 | multiplier*=1.5 77 | 78 | event_times = event_times[(event_times < pd.to_datetime(end)).values] 79 | 80 | if proportional: 81 | users = df['actor_login_h'] 82 | repos = df['repo_name_h'] 83 | events = df['type'] 84 | else: 85 | users = pd.Series(df['actor_login_h'].unique()) 86 | repos = pd.Series(df['repo_name_h'].unique()) 87 | events = pd.Series(df['type'].unique()) 88 | 89 | 90 | users = users.sample(len(event_times),replace=True).values 91 | repos = repos.sample(len(event_times),replace=True).values 92 | events = events.sample(len(event_times),replace=True).values 93 | 94 | df_out = pd.DataFrame({'time':event_times.values.flatten(), 95 | 'event':events, 96 | 'user':users, 97 | 'repo':repos}) 98 | 99 | if proportional: 100 | pr_action = df[df['type'] == 'PullRequestEvent']['payload_action'] 101 | pr_merged = df[df['type'] == 'PullRequestEvent']['payload_pull_request_merged'] 102 | iss_action = df[df['type'] == 'IssuesEvent']['payload_action'] 103 | else: 104 | pr_action = df[df['type'] == 'PullRequestEvent']['payload_action'].unique() 105 | pr_merged = df[df['type'] == 'PullRequestEvent']['payload_pull_request_merged'].unique() 106 | iss_action = df[df['type'] == 'IssuesEvent']['payload_action'].unique() 107 | 108 | pull_requests = df_out[df_out['event'] == 'PullRequestEvent'] 109 | pull_requests['payload_action'] = pd.Series(pr_action).sample(len(pull_requests.index), 110 | replace=True).values 111 | pull_requests['payload_pull_request_merged'] = pd.Series(pr_merged).sample(len(pull_requests.index), 112 | replace=True).values 113 | 114 | 115 | issues = df_out[df_out['event'] == 'IssuesEvent'] 116 | issues['payload_action'] = pd.Series(iss_action).sample(len(issues.index),replace=True).values 117 | 118 | df_out = df_out[~df_out['event'].isin(['IssuesEvent','PullRequestEvent'])] 119 | df_out = pd.concat([df_out,pull_requests,issues]) 120 | df_out = df_out.sort_values('time') 121 | 122 | df_out = df_out[['time','event','user','repo','payload_action','payload_pull_request_merged']] 123 | 124 | return df_out 125 | 126 | 127 | def create_shifted_reference(csv_file, test_start_date='2018-02-01', test_end_date='2018-02-28', 128 | historical_start_date='2017-08-01',historical_end_date='2017-08-31'): 129 | 130 | 131 | """ 132 | Create shifted reference from historical data in csv_file using data ranging from historical_start_date 133 | to historical_end_date to generate new shifted data ranging from test_start_date to test_end_date. 134 | """ 135 | 136 | 137 | df = ingest_historical_data(csv_file) 138 | 139 | 140 | test_delta_t = np.datetime64(test_end_date) - np.datetime64(test_start_date) 141 | historical_delta_t = np.datetime64(historical_end_date) - np.datetime64(historical_start_date) 142 | if historical_delta_t > test_delta_t: 143 | df = subset_data(df,historical_start_date,historical_end_date) 144 | else: 145 | print('Not enough historical data to create shifted reference approach') 146 | return None 147 | 148 | shifted_df = shift_data(df,np.datetime64(test_start_date) - np.datetime64(historical_start_date),np.datetime64(test_end_date)) 149 | shifted_df = subset_data(shifted_df,test_start_date,test_end_date) 150 | 151 | return shifted_df 152 | 153 | 154 | def create_sampled_reference(csv_file, test_start_date='2018-02-01', test_end_date='2018-02-28', 155 | historical_start_date='2017-08-01',historical_end_date='2017-08-31', 156 | proportional=True): 157 | 158 | """ 159 | Create sampled reference from historical data in csv_file using data ranging from historical_start_date 160 | to historical_end_date to generate new sampled data ranging from test_start_date to test_end_date. 161 | If proportional is True, the sampling will be proportional to the observed frequencies in the 162 | historical data. Otherwise, sampling will be uniform. 163 | """ 164 | 165 | df = ingest_historical_data(csv_file) 166 | 167 | df = subset_data(df,historical_start_date,historical_end_date) 168 | 169 | sampled_df = sample_data(df,test_start_date, test_end_date,proportional) 170 | 171 | return sampled_df 172 | 173 | 174 | def main(): 175 | 176 | fn = 'august_2017.csv' 177 | 178 | shifted_reference = create_shifted_reference(fn,test_end_date='2018-02-05') 179 | print('shifted reference') 180 | print(shifted_reference) 181 | 182 | sampled_reference_uniform = create_sampled_reference(fn,proportional=False,test_end_date='2018-02-05') 183 | print('sampled reference uniform') 184 | print(sampled_reference_uniform) 185 | 186 | sampled_reference_proportional = create_sampled_reference(fn,proportional=True,test_end_date='2018-02-05') 187 | print('sampled reference proportional') 188 | print(sampled_reference_proportional) 189 | 190 | 191 | if __name__ == "__main__": 192 | main() 193 | -------------------------------------------------------------------------------- /github-measurements/requirements.txt: -------------------------------------------------------------------------------- 1 | fastdtw==0.3.2 2 | numpy==1.14.0 3 | statsmodels==0.8.0 4 | pathos==0.2.1 5 | pandas==0.23.1 6 | matplotlib==2.0.2 7 | scipy==0.19.1 8 | JPype1==0.6.3 9 | scikit_learn==0.19.1 10 | -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | Copyright 2018 PACIFIC NORTHWEST NATIONAL LABORATORY 2 | 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 4 | 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 6 | 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 8 | 9 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 10 | -------------------------------------------------------------------------------- /pip_requirements.txt: -------------------------------------------------------------------------------- 1 | backports.functools-lru-cache==1.5 2 | certifi==2018.8.24 3 | chardet==3.0.4 4 | Click==7.0 5 | community==1.0.0b1 6 | cycler==0.10.0 7 | decorator==4.3.0 8 | dill==0.2.8.2 9 | fastdtw==0.3.2 10 | Flask==1.0.2 11 | idna==2.7 12 | itsdangerous==0.24 13 | Jinja2==2.10 14 | JPype1==0.6.3 15 | kiwisolver==1.0.1 16 | MarkupSafe==1.0 17 | matplotlib==2.2.3 18 | mkl-fft==1.0.6 19 | mkl-random==1.0.1 20 | multiprocess==0.70.6.1 21 | networkx==2.2 22 | numpy==1.15.2 23 | pandas==0.23.4 24 | pathos==0.2.2.1 25 | patsy==0.5.0 26 | pox==0.2.4 27 | ppft==1.6.4.8 28 | prettytable==0.7.2 29 | pycairo==1.17.1 30 | pyparsing==2.2.2 31 | PySAL==1.14.4.post2 32 | python-dateutil==2.7.3 33 | python-igraph==0.7.1.post6 34 | pytz==2018.5 35 | requests==2.19.1 36 | scikit-learn==0.20.0 37 | scipy==1.1.0 38 | seaborn==0.9.0 39 | six==1.11.0 40 | sklearn==0.0 41 | statsmodels==0.9.0 42 | subprocess32==3.5.2 43 | tqdm==4.26.0 44 | urllib3==1.23 45 | Werkzeug==0.14.1 46 | --------------------------------------------------------------------------------