├── README.md
├── conda_requirements.txt
├── data_extraction
    ├── README.md
    ├── complicated_cascade_followers.json
    ├── complicated_cascade_partial.csv
    ├── extract_ground_truth_cp2.py
    ├── keyword_map.json
    └── twitter_cascade_reconstruction.py
├── december-measurements
    ├── BaselineMeasurements.py
    ├── CommunityCentricMeasurements.py
    ├── ContentCentricMeasurements.py
    ├── Metrics.py
    ├── TEMeasurements.py
    ├── UserCentricMeasurements.py
    ├── cascade_measurements.py
    ├── cascade_reconstruction
    │   ├── example_follower_data
    │   │   ├── a.txt
    │   │   ├── b.txt
    │   │   ├── d.txt
    │   │   ├── h.txt
    │   │   ├── i.txt
    │   │   ├── j.txt
    │   │   ├── k.txt
    │   │   └── m.txt
    │   ├── twitter_cascade_reconstruction.py
    │   ├── twitter_example_data_reconstructed.json
    │   └── twitter_reconstruction_example_data.json
    ├── cascade_validators.py
    ├── config
    │   ├── baseline_metrics_config_github.py
    │   ├── baseline_metrics_config_github_crypto_s1.py
    │   ├── baseline_metrics_config_github_cve_s1.py
    │   ├── baseline_metrics_config_github_cyber_s1.py
    │   ├── baseline_metrics_config_reddit.py
    │   ├── baseline_metrics_config_reddit_crypto_s1.py
    │   ├── baseline_metrics_config_reddit_crypto_s2.py
    │   ├── baseline_metrics_config_reddit_cve_s1.py
    │   ├── baseline_metrics_config_reddit_cve_s2.py
    │   ├── baseline_metrics_config_reddit_cyber_s1.py
    │   ├── baseline_metrics_config_reddit_cyber_s2.py
    │   ├── baseline_metrics_config_twitter.py
    │   ├── baseline_metrics_config_twitter_crypto_s1.py
    │   ├── baseline_metrics_config_twitter_cve_s1.py
    │   ├── baseline_metrics_config_twitter_cve_s2.py
    │   ├── baseline_metrics_config_twitter_cyber_s1.py
    │   ├── cascade_metrics_config.py
    │   ├── cascade_metrics_config_twitter.py
    │   └── network_metrics_config.py
    ├── infodynamics.jar
    ├── network_measurements.py
    ├── plotting
    │   ├── charts.py
    │   ├── transformer.py
    │   └── visualization_config.py
    ├── run_measurements_and_metrics.py
    └── validators.py
├── github-measurements-old
    ├── Metrics.py
    ├── RepoCentricMeasurements.py
    ├── RepoMeasurementsWithPlot.py
    ├── TransferEntropy.py
    ├── UserCentricMeasurements.py
    ├── UserMeasurementsWithPlot.py
    ├── load_data.py
    ├── metrics_config.py
    └── plots.py
├── github-measurements
    ├── CommunityCentricMeasurements.py
    ├── Measurements.py
    ├── Metrics.py
    ├── RepoCentricMeasurements.py
    ├── TEMeasurements.py
    ├── UserCentricMeasurements.py
    ├── infodynamics.jar
    ├── metrics_config.py
    ├── reference-approaches
    │   ├── README.md
    │   ├── generate_reference_approach_data.py
    │   └── reference_approach_performance_plots.py
    └── requirements.txt
├── license.txt
└── pip_requirements.txt


/README.md:
--------------------------------------------------------------------------------
  1 | # socialsim
  2 | 
  3 | This repo contains scripts needed to run the measurements and metrics for the SocialSim challenge evaluation.
  4 | 
  5 | ## Change Log
  6 | 
  7 | * **2 November 2018**:
  8 |   * Added code for reconstructing Twitter cascades using follower data (identifying the parentID for retweets and the rootID for reply tweets) in december-measurements/cascade_reconstruction.
  9 |   * Improved efficiency of cascade measurements by switching to igraph implementation and making improvements to the time series meausurements
 10 |   * Fix handling of cascades where root node is not included in the simulation input
 11 | 
 12 | * **31 October 2018**:
 13 |   * Improved efficiency of the network initialization
 14 |   * Added cascade measurements to the visualization configuration.  Cascade measurements will now generate visualizations if the plot_flag is set to True.
 15 | 
 16 | * **25 October 2018**:
 17 |    * Changed handling of root-only cascades (i.e. posts with no comments or tweets with no replies/retweets/quotes) to no longer return None, allowing metrics to be calculated even if the simulation or the ground truth contains these empty cascades.
 18 |    * Changed the join between simulation and ground truth data for calculation of one-to-one measurements (e.g. RMSE, R2) to an outer join rather than an inner with appropriate filling of missing values (forward fill for cumulative time-series and zero fill for non-cumulative). 
 19 |    * Changed default behavior for community metrics. Previously used the baseline challenge community definitions by default, now calculates each community measurement on the full set of data by default.        
 20 | 
 21 | * **24 October 2018**:
 22 |    * Added checks for valid values of the status and actionSubType fields to avoid errors when calculating measurements that require these fields.
 23 | 
 24 | * **16 October 2018**:
 25 |    * We added information added requirement files and instructions to setup an environment for running the code.
 26 |    * Fixed Typo in Config
 27 | 
 28 | * **12 October 2018**: 
 29 |    * We updated the network_measurements implementations to use igraph and SNAP rather than networkx for improved memory and time performance.  Some of our team members had trouble with the python-igraph and SNAP installations.  If you have trouble with the python-igraph installation using pip, try the conda install - "conda install -c conda-forge python-igraph".   SNAP should be installed from https://snap.stanford.edu/snappy/ rather than using pip.  If you get a "Fatal Python error: PyThreadState_Get: no current thread" error, you should modify the SNAP setup.py file and replace "dynlib_path = getdynpath()" with e.g. "dynlib_path = "/anaconda/lib/libpython2.7.dylib" (use the path to your libpython2.7.dylib file).  Please contact us if you are having trouble with your installation after following these steps.  
 30 |    * Additionally, we moved from the CSV input format to the JSON input format.  Example JSON files for each platform can be found on the December Challenge wiki page in the same place as the example csv files.
 31 | 
 32 | * **9 October 2018**: 
 33 |    * We updated the cascade_measurements so that cascade-level measurements are calculated using the CascadeCollectionMeasurements class rathan the SingleCascadeMeasurements class.  This means that all cascade measurements can now be calculated using the CascadeCollectionMeasurements class.  The cascade_examples function shows how to run cascade measurements.  Additionally, we fixed the implementation of the cascade breadth calculation.
 34 |    
 35 | ## Environment Installation
 36 | 
 37 | Create a conda environment by running 
 38 | 
 39 | conda create --name my_env_name --file conda_requirements.txt -c conda-forge python=2.7
 40 | 
 41 | Activate your new conda environment by running 
 42 | 
 43 | source activate my_env_name
 44 | 
 45 | install remaining pip requirements with 
 46 | 
 47 | pip install -r pip_requirements.txt
 48 | 
 49 | and finally install snap by following the instructions found here: https://snap.stanford.edu/snappy/   
 50 | 
 51 | ## Scripts
 52 | 
 53 | ### run_measurements_and_metrics.py
 54 | 
 55 | This is the main script that provides functionality to run individual measurements and metrics or the full set of assigned measurements and metrics for the challenge (this replaces
 56 | the previous metrics_config.py script). 
 57 | 
 58 | #### Measurement Configuration
 59 | 
 60 | The measurement configurations used by run_measurements_and_metrics.py are found in the metric_config files in the config/ directory.  These 
 61 | files define a set of dictionaries for different measurement types that specify the measurement and metric parameters. There are five metrics_config files:
 62 | 
 63 | 1. network_metrics_config.py - contains `network_measurement_params` to be used for all network measurements
 64 | 2. cascade_metrics_config.py - contains `cascade_measurement_params` to be used for all cascade measurements
 65 | 3. baseline_metrics_config_github.py - contains `github_measurement_params` to be used for baseline measurements applied to GitHub
 66 | 3. baseline_metrics_config_reddit.py - contains `reddit_measurement_params` to be used for baseline measurements applied to Reddit
 67 | 3. baseline_metrics_config_twitter.py - contains `twitter_measurement_params` to be used for baseline measurements applied to Twitter
 68 | 
 69 | 
 70 | Each dictionary element in one of the measurement_params dictionaries defines the metric assignments for a single measurement, with the key indicating the name of the 
 71 | measurement and the value specifying the measurement function, measurement function arguments, scenarios which the measurement is included for,
 72 | and metrics functions for the metric calculation.
 73 | For example, here is the specification of a single measurement in this format:
 74 | 
 75 | ```python
 76 |  measurement_params = {
 77 |  "user_unique_repos": {
 78 |         'question': '17',
 79 |         "scale": "population",
 80 |         "node_type":"user",
 81 | 	"scenario1":True,
 82 | 	"scenario2":False,
 83 | 	"scenario3":True,
 84 |         "measurement": "getUserUniqueRepos",
 85 | 	"measurement_args":{"eventType":contribution_events},
 86 |         "metrics": { 
 87 |             "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 88 |             "rmse": Metrics.rmse,
 89 |             "r2": Metrics.r2}
 90 |     }
 91 |  }   
 92 | ```
 93 | 
 94 | This measurement is related to the number of unique repos that users contribute to (Question #17), which is a user-centric 
 95 | measurement at the population level.  The measurement will be used in scenario 1 and scenario 3, but not scenario 2.
 96 | The "measurement" keyword specifies the measurement function to  apply, and the "measurement_args" keywords specifies 
 97 | the arguments to the measurement function in dictionary format.  The "metrics" keyword provides a dictionary of each of 
 98 | the metrics that should be applied for this measurement.
 99 | 
100 | #### Measurements Classes
101 | 
102 | Measurements are calculated on a data set by employing one of the measurements classes.  There are currently 6 measurements classes which produce different categories of measurements.  
103 | 1. BaselineMeasurements implemented in BaselineMeasurements.py - this includes all measurements from the baseline challenge which have been generalized to apply to GitHub,Twitter, or Reddit
104 | 2. GithubNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Github.
105 | 3. RedditNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Reddit.
106 | 4. TwitterNetworkMeasurements implemented in network_measurements.py - this includes network measurements for Twitter.
107 | 5. SingleCascadeMeasurements implemented in cascade_measurements.py - this includes node level cascade measurements (i.e. measurements on a single cascade)
108 | 6. CascadeCollectionMeasurements implemented in cascade_measurements.py - this includes population and community level cascade measurements (i.e. measurements on a set of cascades)
109 | 
110 | To instantiate a measurements object for particular data set (either simulation or ground truth data), you generally pass the data frame to one of the above classes:
111 | 
112 | ```python
113 | #create measurement object from data frame 
114 | measurement = BaselineMeasurements(data_frame)
115 | #create measurement object from csv file
116 | measurement = BaselineMeasurements(csv_file_name)
117 | 
118 | #create measurement object with specific list of nodes to calculate node-level measurements on
119 | measurement = BaselineMeasurements(data_frame,user_node_ids=['user_id1'],content_node_ids=['repo_id1'])
120 | ```
121 | 
122 | This object contains the methods for calculating all of the measurements of the given type.  For example, the user unique repos measurement can be calculated as follows:
123 | 
124 | ```python
125 | result = measurement.getUserUniqueRepos(eventType=contribution_events)
126 | ```
127 | 
128 | #### Running a Single Measurement
129 | 
130 | The `run_measurement` function can be used to calculate the measurement output for a single measurement on a given data set using the measurement_params configuration, which contains the parameters to be used for evaluation during the challenge event.  The arguments for this function include the data, the measurement_params dictionary, and the name of the measurement to apply.
131 | 
132 | For example, if we want to run one of the baseline GitHub measurements on the simulation data, we need to provide the `github_measurement_params` dictionary which contains the relavent configution and provide the name of the specific measurement we are interested in:
133 | 
134 | ```python
135 | simulation = BaselineMeasurements(simulation_data_frame)
136 | meas = run_measurement(simulation, github_measurement_params, "user_unique_content")
137 | ```
138 | 
139 | The `run_metrics` function can be used to run all the relevant metrics for a given measurement in addition to the measurement output itself.  
140 | This function takes two Measurement objects as input, one for the ground truth and one for the simulation, the relevant measurement_params dictionary, and the name of the measurement as listed in the keywords of measurement_params. It returns the measurement results for the ground truth and the simulation and the metric output.
141 | 
142 | For example:
143 | 
144 | ```python                                                                                                            
145 | ground_truth = BaselineMeasurements(ground_truth_data_frame)
146 | simulation = BaselineMeasurements(simulation_data_frame)
147 | gt_measurement, sim_measurement, metric = run_metrics(ground_truth, simulation, "user_unique_content", github_measurement_params)
148 | ```
149 | 
150 | #### Running All Measurements
151 | 
152 | To run the all the  measurements that are defined in the measurement_params configuration, the `run_all_measurements` and `run_all_metrics`
153 | functions can be used.  To run all  the measurements on a simulation data Measurements object and save the output in pickle files in the output directory:
154 | 
155 | ```python
156 | meas_dictionary = run_all_measurements(simulation,github_measurement_params,output_dir='measurement_output/')
157 | ```
158 | 
159 | To run all the metrics for all the measurements on a ground truth Measurements object and simulation data Measurements object:
160 | 
161 | ```python
162 | metrics = run_all_metrics(ground_truth,simulation,github_measurement_params)
163 | ```
164 | 
165 | For both `run_all_metrics` and `run_all_measurements`, you can additionally specify specific subsets of the measurements by using the filter parameter to filter on any properties in the measurement_params dictionary.  For example:
166 | 
167 | ```python
168 | metrics = run_all_metrics(ground_truth,simulation,github_measurement_params,filters={"scale":"population","node_type":"user")
169 | ```
170 | 
171 | #### Plotting
172 | 
173 | In order to generate plots of the measurements, any of the `run_metrics`, `run_measurement`, `run_all_metrics`, and `run_all_measurements` scripts can take the following arguments:
174 | 
175 | 1. plot_flag - boolean indicator of whether to generate plots
176 | 2. show - boolean indicator of whether to display the plots to screen
177 | 3. plot_dir - A directory in which to save the plots.  If plot_dir is an empty string '', the plots will not be saved.
178 | 
179 | Currently, plotting is only implemented for the baseline challenge measurements.  Plotting functionality for the remaining meausrements will be released at a later date.
180 | 
181 | ### Metrics.py
182 | 
183 | This script contains implementations of each metric for comparison of the output of the ground truth and simulation
184 | measurements.
185 | 
186 | ### BaselineMeasurements.py
187 | 
188 | This script contains the core BaselineMeasurements class which performs intialization of all input data for measurement calculation
189 | for the measurements from the baseline challenge.
190 | 
191 | ### UserCentricMeasurements.py
192 | 
193 | This script contains implementations of the user-centric measurements inside the UserCentricMeasurements class.
194 | 
195 | ### ContentCentricMeasurements.py
196 | 
197 | This script contains implementations of the baseline content-centric measurements inside the ContentCentricMeasurements class.
198 | 
199 | ### CommunityCentricMeasurements.py
200 | 
201 | This script contains implementations of the community-centric measurements inside the CommunityCentricMeasurements class.
202 | 
203 | ### network_measurements.py
204 | 
205 | This script contains implementations of the network measurements inside the GithubNetworkMeasurements,RedditNetworkMeasurements, and TwitterNetworkMeasurements classes.
206 | 
207 | ### cascade_measurements.py
208 | 
209 | This script contains implementations of the cascade measurements inside the SingleCascadeMeasurements and CascadeCollectionMeasurements classes.
210 | 
211 | 


--------------------------------------------------------------------------------
/conda_requirements.txt:
--------------------------------------------------------------------------------
 1 | # This file may be used to create an environment using:
 2 | # $ conda create --name <env> --file <this file>
 3 | # platform: linux-64
 4 | blas=1.0=mkl
 5 | ca-certificates=2018.03.07=0
 6 | cairo=1.14.12=h276e583_5
 7 | certifi=2018.8.24=py27_1
 8 | fontconfig=2.13.1=h65d0f4c_0
 9 | freetype=2.9.1=h6debe1e_4
10 | gettext=0.19.8.1=h5e8e0c9_1
11 | glib=2.56.2=h464dc38_0
12 | gmp=6.1.2=hfc679d8_0
13 | icu=58.2=hfc679d8_0
14 | igraph=0.7.1=hcc8e21d_5
15 | intel-openmp=2019.0=118
16 | jpype1=0.6.3=py27_0
17 | libedit=3.1.20170329=h6b74fdf_2
18 | libffi=3.2.1=hd88cf55_4
19 | libgcc-ng=8.2.0=hdf63c60_1
20 | libgfortran-ng=7.2.0=hdf63c60_3
21 | libiconv=1.15=h470a237_3
22 | libpng=1.6.35=ha92aebf_2
23 | libstdcxx-ng=8.2.0=hdf63c60_1
24 | libuuid=2.32.1=h470a237_2
25 | libxcb=1.13=h470a237_2
26 | libxml2=2.9.8=h422b904_5
27 | mkl_fft=1.0.6=py27_0
28 | mkl_random=1.0.1=py27_0
29 | ncurses=6.1=hf484d3e_0
30 | numpy=1.15.2=py27h1d66e8a_1
31 | numpy-base=1.15.2=py27h81de0dd_1
32 | openssl=1.0.2p=h14c3975_0
33 | pcre=8.41=hfc679d8_3
34 | pip=10.0.1=py27_0
35 | pixman=0.34.0=h470a237_3
36 | pthread-stubs=0.4=h470a237_1
37 | pycairo=1.17.1=py27h4d1f301_0
38 | python=2.7.15=h1571d57_0
39 | python-igraph=0.7.1.post6=py27h470a237_5
40 | readline=7.0=h7b6447c_5
41 | setuptools=40.4.3=py27_0
42 | sqlite=3.25.2=h7b6447c_0
43 | tk=8.6.8=hbc83047_0
44 | wheel=0.32.0=py27_0
45 | xorg-kbproto=1.0.7=h470a237_2
46 | xorg-libice=1.0.9=h470a237_4
47 | xorg-libsm=1.2.3=h8c8a85c_0
48 | xorg-libx11=1.6.6=h470a237_0
49 | xorg-libxau=1.0.8=h470a237_6
50 | xorg-libxdmcp=1.1.2=h470a237_7
51 | xorg-libxext=1.3.3=h470a237_4
52 | xorg-libxrender=0.9.10=h470a237_2
53 | xorg-renderproto=0.11.1=h470a237_2
54 | xorg-xextproto=7.3.0=h470a237_2
55 | xorg-xproto=7.0.31=h470a237_7
56 | xz=5.2.4=h470a237_1
57 | zlib=1.2.11=ha838bed_2
58 | 


--------------------------------------------------------------------------------
/data_extraction/README.md:
--------------------------------------------------------------------------------
 1 | # Ground Truth Data Extraction
 2 | 
 3 | extract\_ground\_truth\_cp2.py demonstrates the approach for converting the raw JSON format for Reddit, Twitter, and GitHub to the simulation output schema for each platform.  The script is designed to query PNNL's mongo database, so you will have to modify the queries to interface with your individual data storage.  
 4 | 
 5 | The extraction process for each platform follows the follow steps:
 6 | 
 7 | 1. Query a specific time period
 8 | 2. Extract relevant fields from data
 9 | 3. (For Twitter only) Assign roots and parents using the cascade reconstruction script
10 | 4. (Reddit and Twitter) Propagate any information IDs on parent posts/comments/tweets to all children of the post/comment/tweet
11 | 5. Duplicate events that are related to multiple information IDs. For example:
12 |   * userA, tweetA, [CVE-2017-123, CVE-2014-456] will split into:
13 |     * userA, tweetA, CVE-2017-123
14 |     * userA, tweetA, CVE-2014-456


--------------------------------------------------------------------------------
/data_extraction/complicated_cascade_followers.json:
--------------------------------------------------------------------------------
 1 | {"1":["3","4"],
 2 |  "2":["5","7"],
 3 |  "3":["8","10"],
 4 |  "4":["11","12","13"],
 5 |  "5":["14","16"],
 6 |  "6":["17","19"],
 7 |  "7":["22"],
 8 |  "8":["23","25"],
 9 |  "9":["26","28"],
10 |  "10":["31"],
11 |  "13":["32","33","34"]}
12 | 


--------------------------------------------------------------------------------
/data_extraction/complicated_cascade_partial.csv:
--------------------------------------------------------------------------------
 1 | actionType,nodeID,nodeUserID,parentID,rootID,partialParentID
 2 | tweet,1,1,1,1,
 3 | reply,2,2,1,?,1
 4 | quote,3,3,?,1,
 5 | retweet,4,4,?,1
 6 | quote,5,5,?,?,2
 7 | reply,6,6,2,?,2
 8 | retweet,7,7,?,?,2
 9 | quote,8,8,?,1,
10 | reply,9,9,3,?,3
11 | retweet,10,10,?,1,3
12 | retweet,11,11,?,1,
13 | retweet,12,12,?,1,
14 | retweet,13,13,?,1,
15 | quote,14,14,?,?,2
16 | reply,15,15,5,?,5
17 | retweet,16,16,?,?,5
18 | quote,17,17,?,?,6
19 | reply,18,18,6,?,6
20 | retweet,19,19,?,?,6
21 | retweet,22,22,?,?,2
22 | quote,23,23,?,1,
23 | reply,24,24,8,?,8
24 | retweet,25,25,?,1,8
25 | quote,26,26,?,?,9
26 | reply,27,27,9,?,9
27 | retweet,28,28,?,?,9
28 | retweet,31,31,?,1,3
29 | retweet,32,32,?,1,
30 | retweet,33,33,?,1,
31 | retweet,34,34,?,1,
32 | 


--------------------------------------------------------------------------------
/data_extraction/keyword_map.json:
--------------------------------------------------------------------------------
 1 | {"electroneum": ["#Electroneum", "Electroneum", "#ETN", "ETN", "@electroneum"],
 2 |  "tether": ["Tether", "#Tether", "#USDT", "USDT", "@Tether_to"],
 3 |  "genesis vision": ["Genesis vision", "#GVT", "GVT", "#GenesisVision", "@genesis_vision"],
 4 |  "ubiq": ["UBIQ", "#Ubiq", "#UBQ", "UBQ"],
 5 |  "vcash": ["VCash", "#XCV", "#VCash", "@Vcashinfo"],
 6 |  "chill_coin": ["Chill Coin", "#chillcoin", "chillcoin", "@chillcoin"],
 7 |  "magi_coin": ["Magi Coin", "#magicoin", "#XMG", "XMG"],
 8 |  "indorse": ["Indorse", "#indorse", "#IND","IND"],
 9 |  "bitcoin_diamond": ["Bitcoin Diamond", "#BITcoindiamond", "#BCD", "@BitcoinDiamond_","BCD"],
10 |  "chaincoin": ["#chaincoin", "chaincoin", "#chc", "@chaincoin","CHC"],
11 |  "ecoin": ["E-coin", "#ecoin","ecoin"],
12 |  "paycoin": ["paycoin", "#paycoin", "#XPY","XPY"],
13 |  "quantum_resistant_ledger": ["Quantum Resistant Ledger", "#QuantumResistantLedger", "#QRL", "@QRLedger","QRL"],
14 |  "omni": ["Omni", "#Omni"],
15 |  "bean_cash": ["Bean Cash", "#bitb", "#beancash", "@BeanCash_BEAN","bitb"],
16 |  "blockmason_credit_protocol": ["Blockmason credit protocol", "#Blockmasoncreditprotocol", "#bcpt","bcpt"],
17 |  "bytecent": ["Bytecent", "#Bytecent", "#byc", "@bytecentbyc","byc"],
18 |  "agoras_tokens": ["Agoras tokens", "#AgorasTokens", "#agrs","agrs"],
19 |  "bancor_network_token": ["Bancor Network Token", "#BancorNetworkToken", "#BNT", "@bancornetwork","BNT"],
20 |  "granitecoin": ["granitecoin", "#granitecoin", "#GRN","GRN"],
21 |  "pesetacoin": ["pesetacoin", "#pesetacoin", "@PesetacoinOfic"],
22 |  "agrello": ["agrello", "#agrello", "#DLT", "@AgrelloOfficial","DLT"],
23 |  "peercoin": ["Peercoin", "#Peercoin", "@PeercoinPPC"],
24 |  "stealth": ["#Stealth", "#XST", "@stealthsend","XST"],
25 |  "version": ["@VersionCrypto"]}
26 |  
27 | 


--------------------------------------------------------------------------------
/data_extraction/twitter_cascade_reconstruction.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import glob
  3 | from collections import defaultdict
  4 | import os
  5 | import pprint
  6 | import json
  7 | import numpy as np
  8 | 
  9 | def load_data(json_file, full_submission=True):
 10 |     """
 11 |     Takes in the location of a json file and loads it as a pandas dataframe.
 12 |     Does some preprocessing to change text from unicode to ascii.
 13 |     """
 14 | 
 15 |     if full_submission:
 16 |         with open(json_file) as f:
 17 |             dataset = json.loads(f.read())
 18 | 
 19 |         dataset = dataset['data']
 20 |         dataset = pd.DataFrame(dataset)
 21 |     else:
 22 |         dataset = pd.read_json(json_file)
 23 | 
 24 |     dataset.sort_index(axis=1, inplace=True)
 25 |     dataset = dataset.replace('', np.nan)
 26 | 
 27 |     # This converts the column names to ascii
 28 |     mapping = {name:str(name) for name in dataset.columns.tolist()}
 29 |     dataset = dataset.rename(index=str, columns=mapping)
 30 | 
 31 |     # This converts the row names to ascii
 32 |     dataset = dataset.reset_index(drop=True)
 33 | 
 34 |     # This converts the cell values to ascii
 35 |     json_df = dataset.applymap(str)
 36 | 
 37 |     return dataset
 38 | 
 39 | 
 40 | class ParentIDApproximation:
 41 |     """
 42 |     class to obtain parent tweet id for retweets
 43 |     """
 44 | 
 45 |     def __init__(self, followers, cascade_collection_df, nodeID_col_name="nodeID", userID_col_name='nodeUserID',
 46 |                  nodeTime_col_name='nodeTime', rootID_col_name='rootID',
 47 |                  root_userID_col_name='rootUserID',
 48 |                  root_nodeTime_col_name='rootTime'):
 49 |         """
 50 |         :param followers: dictionary with key: userID, value: [list of followers of userID]
 51 |         :param cascade_collection_df: dataframe with nodeID, userID, nodeTime, rootID, root_userID, root_nodeTime as columns
 52 |                 default values for column names correspond to those in the Twitter data schema
 53 |                 (https://wiki.socialsim.info/display/SOC/Twitter+Data+Schema)
 54 |         """
 55 |         self.followers = followers
 56 |         self.cascade_collection_df = cascade_collection_df.copy()
 57 |         self.nodeID_col_name = nodeID_col_name
 58 |         self.userID_col_name = userID_col_name
 59 |         self.nodeTime_col_name = nodeTime_col_name
 60 |         self.rootID_col_name = rootID_col_name
 61 |         self.root_userID_col_name = root_userID_col_name
 62 |         self.root_nodeTime_col_name = root_nodeTime_col_name
 63 | 
 64 |     def get_all_tweets_rtd_later_by_followers(self, tweet_id, cascade_df):
 65 |         
 66 |         tweet_details = cascade_df.loc[tweet_id]
 67 |         
 68 |         # add self to followers because users will retweet themselves
 69 |         output = cascade_df[
 70 |             (cascade_df[self.userID_col_name].
 71 |                 isin(self.followers[tweet_details[self.userID_col_name]].union(
 72 |                 {tweet_details[self.userID_col_name]}))) &  # in followers
 73 |             (cascade_df[self.nodeTime_col_name] > tweet_details[self.nodeTime_col_name])
 74 |             ]. \
 75 |             index.values.tolist()
 76 | 
 77 |         return output
 78 |         
 79 |     def update_parentid(self, cascade_df_main, root_id):
 80 | 
 81 |         root_userID = cascade_df_main.loc[cascade_df_main.index.max()][self.root_userID_col_name]
 82 |         root_nodeTime = cascade_df_main.loc[cascade_df_main.index.max()][self.root_nodeTime_col_name]
 83 | 
 84 |         cascade_df = cascade_df_main.sort_values(self.nodeTime_col_name).drop(
 85 |             [self.root_userID_col_name, self.root_nodeTime_col_name], axis=1).copy()
 86 |         cascade_df["parentID"] = 0
 87 | 
 88 |         # root tweet also added to the cascade since we need the time when the root tweet was tweeted
 89 |         if root_id not in cascade_df[self.nodeID_col_name].values:
 90 |             cascade_df.loc[cascade_df.index.max() + 1] = {
 91 |                 self.nodeID_col_name: root_id,
 92 |                 self.userID_col_name: root_userID,
 93 |                 self.nodeTime_col_name: root_nodeTime,
 94 |                 self.rootID_col_name: root_id,
 95 |                 "parentID": None,
 96 |                 "actionType": "NA"
 97 |             }
 98 |         cascade_df = cascade_df.set_index(self.nodeID_col_name)
 99 |         seed_tweets = [root_id]
100 |         while seed_tweets:
101 |             new_seed_tweets = []
102 |             for seed_tweet_id in seed_tweets:
103 |                 tweets_to_be_updated = self.get_all_tweets_rtd_later_by_followers(seed_tweet_id,
104 |                                                                                   cascade_df)  # assume a user as their follower since a user can retweet themselves
105 |                 cascade_df.loc[tweets_to_be_updated, "parentID"] = seed_tweet_id
106 |                 new_seed_tweets.extend(tweets_to_be_updated)
107 | 
108 |             seed_tweets = cascade_df[
109 |                 cascade_df.index.isin(new_seed_tweets)].index.tolist()  # keeping the order a.t. tweeted timestamp
110 |             
111 |         cascade_df = cascade_df[cascade_df['actionType'] != 'NA']
112 |         cascade_df.loc[cascade_df['parentID'] == 0,'parentID'] = cascade_df.loc[cascade_df['parentID'] == 0,'partialParentID']
113 |         #cascade_df.dropna(subset=["parentID"])
114 |         #return cascade_df[cascade_df["parentID"] != 0].reset_index()
115 | 
116 |         return cascade_df.reset_index()
117 |         
118 |     def get_approximate_parentids(self, mapping_only=True, csv=False):
119 |         """
120 |         :param mapping_only: remove other columns except nodeID and parentID
121 |         :param csv: write the parentID mapping to a csv file
122 |         """
123 |         # parentID is None for root tweets
124 |         parentid_map_dfs = []
125 |         for tweet_id, cascade_df in self.cascade_collection_df.groupby(self.rootID_col_name):
126 |             if len(cascade_df[cascade_df['actionType'] != 'reply']) > 0:
127 |                 updated_cascade_df = self.update_parentid(cascade_df[cascade_df['actionType'] != 'reply'], tweet_id)
128 |                 parentid_map_dfs.append(updated_cascade_df)
129 |         parentid_map_all_cascades_df = pd.concat(parentid_map_dfs).reset_index(drop=True)
130 |         parentid_map_all_cascades_df.dropna(inplace=True)
131 |         if mapping_only:
132 |             parentid_map_all_cascades_df = parentid_map_all_cascades_df[[self.nodeID_col_name, "parentID"]]
133 |         if csv:
134 |             parentid_map_all_cascades_df.to_csv("retweet_cascades_with_parentID.csv", index=False)
135 | 
136 |         return parentid_map_all_cascades_df
137 | 
138 | def get_reply_cascade_root_tweet(df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", timestamp_col="nodeTime", json=False):
139 |     """
140 |     :param df: dataframe containing a set of reply cascades
141 |     :param json: return in json format or pandas dataframe
142 |     :return: df with rootID column added, representing the cascade root node
143 |     """
144 |     df = df.sort_values(timestamp_col)
145 |     rootid_mapping = pd.Series(df[parent_node_col].values, index=df[node_col]).to_dict()
146 |     
147 |     def update_reply_cascade(reply_cascade):
148 |         for tweet_id, reply_to_tweet_id in reply_cascade.items():
149 |             if reply_to_tweet_id in reply_cascade:
150 |                 reply_cascade[tweet_id] = reply_cascade[reply_to_tweet_id]
151 |         return reply_cascade
152 | 
153 |     prev_rootid_mapping = {}
154 |     while rootid_mapping != prev_rootid_mapping:
155 |         prev_rootid_mapping = rootid_mapping.copy()
156 |         rootid_mapping = update_reply_cascade(rootid_mapping)
157 | 
158 |         df["rootID_new"] = df[node_col].map(rootid_mapping)
159 | 
160 |     df.loc[df['rootID'] == '?','rootID'] = df.loc[df['rootID'] == '?','rootID_new']
161 |     df = df.drop('rootID_new',axis=1)
162 |     if json:
163 |         return df.to_json(orient='records')
164 |     else:
165 |         return df
166 | 
167 | def full_reconstruction(data,followers=defaultdict(lambda: set([]))):
168 |     
169 |     #store replies for later
170 |     replies = data[data['actionType'] == 'reply']
171 | 
172 |     #get the user who posted the partial parent tweet for each retweet
173 |     parent_users = data[['nodeID','nodeUserID','nodeTime']]
174 |     parent_users.columns = ['partialParentID','rootUserID','rootTime']
175 |     data = data.merge(parent_users,on='partialParentID',how='left')
176 |     
177 |     #store original tweets for later
178 |     original_tweets = data[data['actionType'] == 'tweet']
179 | 
180 |     cols = ['nodeID','nodeUserID','nodeTime','partialParentID','rootUserID','rootTime','actionType']
181 |         
182 |     #get parent IDs for retweets and quotes
183 |     pia = ParentIDApproximation(followers, data[cols],rootID_col_name='partialParentID')
184 |     parent_ids = pia.get_approximate_parentids()
185 |     
186 |     data['parentID'] = data['nodeID'].map(dict(zip(parent_ids.nodeID,parent_ids.parentID)))
187 |     data = data[~data['actionType'].isin(['reply','tweet'])]
188 |     
189 |     #rejoin with replies and original tweets
190 |     data = pd.concat([data,replies,original_tweets],axis=0).sort_values('nodeTime')
191 |     data = data.drop(['rootUserID','rootTime'],axis=1)
192 | 
193 |     #follow cascade chain to get root node for reply tweets
194 |     data = get_reply_cascade_root_tweet(data)
195 | 
196 |     return(data)
197 | 
198 |     
199 | if __name__ == '__main__':
200 | 
201 |     with open('complicated_cascade_followers.json','rb') as f:
202 |         followers = json.load(f)
203 |     for k in followers:
204 |         followers[k] = set(followers[k])
205 |         
206 |     followers = defaultdict(lambda: set([]),followers)
207 |     
208 |     cascade_collection_df = pd.read_csv('complicated_cascade_partial.csv')
209 | 
210 |     cascade_collection_df['partialParentID'] = cascade_collection_df['partialParentID'].fillna(1)
211 |     cascade_collection_df['nodeTime'] = pd.date_range(start='1/1/2018',periods=len(cascade_collection_df))
212 | 
213 |     cascade_collection_df['partialParentID'] = cascade_collection_df['partialParentID'].astype(int)
214 |     cascade_collection_df[['nodeID','parentID','rootID','partialParentID','nodeUserID']] = cascade_collection_df[['nodeID','parentID','rootID','partialParentID','nodeUserID']].astype(str)
215 | 
216 |     results = full_reconstruction(cascade_collection_df,followers)
217 | 
218 |     print(results)
219 | 
220 | 
221 |     
222 | 
223 | 


--------------------------------------------------------------------------------
/december-measurements/BaselineMeasurements.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy  as np
  3 | 
  4 | from datetime        import datetime
  5 | from multiprocessing import Pool
  6 | from functools       import partial
  7 | from pathos          import pools as pp
  8 | 
  9 | import pickle as pkl
 10 | 
 11 | from UserCentricMeasurements      import *
 12 | from ContentCentricMeasurements   import *
 13 | from CommunityCentricMeasurements import *
 14 | 
 15 | from TEMeasurements import *
 16 | from collections    import defaultdict
 17 | 
 18 | import jpype
 19 | import json
 20 | import os
 21 | 
 22 | basedir = os.path.dirname(__file__)
 23 | 
 24 | class BaselineMeasurements(UserCentricMeasurements, ContentCentricMeasurements, TEMeasurements, CommunityCentricMeasurements):
 25 |     def __init__(self,
 26 |                  dfLoc,
 27 |                  content_node_ids=[],
 28 |                  user_node_ids=[],
 29 |                  metaContentData=False,
 30 |                  metaUserData=False,
 31 |                  contentActorsFile=os.path.join(basedir, './baseline_challenge_data/filtUsers-baseline.pkl'),
 32 |                  contentFile=os.path.join(basedir, './baseline_challenge_data/filtRepos-baseline.pkl'),
 33 |                  topNodes=[],
 34 |                  topEdges=[],
 35 |                  previousActionsFile='',
 36 |                  community_dictionary='',
 37 | #                 community_dictionary=os.path.join(basedir, './baseline_challenge_data/baseline_challenge_community_dict.pkl'),
 38 |                  te_config=os.path.join(basedir, './baseline_challenge_data/te_params_baseline.json'),
 39 |                  platform='github',
 40 |                  use_java=True):
 41 |         super(BaselineMeasurements, self).__init__()
 42 | 
 43 |         self.platform = platform
 44 | 
 45 |         try:
 46 |             # check if input is a data frame
 47 |             dfLoc.columns
 48 |             df = dfLoc
 49 |         except:
 50 |             # if not it should be a csv file path
 51 |             df = pd.read_csv(dfLoc)
 52 | 
 53 |         self.contribution_events = ['PullRequestEvent',
 54 |                                     'PushEvent',
 55 |                                     'IssuesEvent',
 56 |                                     'IssueCommentEvent',
 57 |                                     'PullRequestReviewCommentEvent',
 58 |                                     'CommitCommentEvent',
 59 |                                     'CreateEvent',
 60 |                                     'post',
 61 |                                     'tweet']
 62 | 
 63 |         self.popularity_events = ['WatchEvent',
 64 |                                   'ForkEvent',
 65 |                                   'comment',
 66 |                                   'post',
 67 |                                   'retweet',
 68 |                                   'quote',
 69 |                                   'reply']
 70 | 
 71 |         print('preprocessing...')
 72 | 
 73 |         self.main_df = self.preprocess(df)
 74 | 
 75 |         print('splitting optional columns...')
 76 | 
 77 |         # store action and merged columns in a seperate data frame that is not used for most measurements
 78 |         if platform == 'github' and len(self.main_df.columns) == 6 and 'action' in self.main_df.columns:
 79 |             self.main_df_opt = self.main_df.copy()[['action', 'merged']]
 80 |             self.main_df = self.main_df.drop(['action', 'merged'], axis=1)
 81 |         else:
 82 |             self.main_df_opt = None
 83 | 
 84 |         # For content centric
 85 |         print('getting selected content IDs...')
 86 | 
 87 |         if content_node_ids != ['all']:
 88 |             if self.platform == 'reddit':
 89 |                 self.selectedContent = self.main_df[self.main_df.root.isin(content_node_ids)]
 90 |             elif self.platform == 'twitter':
 91 |                 self.selectedContent = self.main_df[self.main_df.root.isin(content_node_ids)]
 92 |             else:
 93 |                 self.selectedContent = self.main_df[self.main_df.content.isin(content_node_ids)]
 94 |         else:
 95 |             self.selectedContent = self.main_df
 96 |                 
 97 |         # For userCentric
 98 |         self.selectedUsers = self.main_df[self.main_df.user.isin(user_node_ids)]
 99 | 
100 |         print('processing repo metatdata...')
101 | 
102 |         # read in external metadata files
103 |         # repoMetaData format - full_name_h,created_at,owner.login_h,language
104 |         # userMetaData format - login_h,created_at,location,company
105 | 
106 |         if metaContentData != False:
107 |             self.useContentMetaData = True
108 |             meta_content_data = pd.read_csv(metaContentData)
109 |             self.contentMetaData = self.preprocessContentMeta(meta_content_data)
110 |         else:
111 |             self.useContentMetaData = False
112 |         print('processing user metatdata...')
113 |         if metaUserData != False:
114 |             self.useUserMetaData = True
115 |             self.userMetaData = self.preprocessUserMeta(pd.read_csv(metaUserData))
116 |         else:
117 |             self.useUserMetaData = False
118 | 
119 |         # For Community
120 |         self.community_dict_file = community_dictionary
121 |         print('getting communities...')
122 |         if self.platform == 'github':
123 |             self.communityDF = self.getCommmunityDF(community_col='community')
124 |         elif self.platform == 'reddit':
125 |             self.communityDF = self.getCommmunityDF(community_col='subreddit')
126 |         else:
127 |             self.communityDF = self.getCommmunityDF(community_col='')
128 | 
129 |         # read in previous events count external file (used only for one measurement)
130 |         try:
131 |             print('reading previous counts...')
132 |             self.previous_event_counts = pd.read_csv(previousActionsFile)
133 |         except:
134 |             self.previous_event_counts = None
135 | 
136 |         # For TE
137 |         if use_java:
138 |             print('starting jvm...')
139 |             if not jpype.isJVMStarted():
140 |                 jpype.startJVM(jpype.getDefaultJVMPath(),
141 |                                '-ea',
142 |                                '-Djava.class.path=infodynamics.jar')
143 | 
144 |         # read pkl files which define nodes of interest for TE measurements
145 |         self.repo_actors = self.readPickleFile(contentActorsFile)
146 |         self.repo_groups = self.readPickleFile(contentFile)
147 | 
148 |         self.top_users = topNodes
149 |         self.top_edges = topEdges
150 | 
151 |         # read pkl files which define nodes of interest for TE measurements
152 |         self.content_actors = self.readPickleFile(contentActorsFile)
153 |         self.content_groups = self.readPickleFile(contentFile)
154 | 
155 |         # set TE parameters
156 |         with open(te_config, 'rb') as f:
157 |             te_params = json.load(f)
158 | 
159 |         self.startTime = pd.Timestamp(te_params['startTime'])
160 |         self.binSize = te_params['binSize']
161 |         self.teThresh = te_params['teThresh']
162 |         self.delayUnits = np.array(te_params['delayUnits'])
163 |         self.starEvent = te_params['starEvent']
164 |         self.otherEvents = te_params['otherEvents']
165 |         self.kE = te_params['kE']
166 |         self.kN = te_params['kN']
167 |         self.nReps = te_params['nReps']
168 |         self.bGetTS = te_params['bGetTS']
169 | 
170 |     def preprocess(self, df):
171 | 
172 |         """
173 |         Edit columns, convert date, sort by date
174 |         """
175 | 
176 |         if self.platform=='reddit':
177 |             mapping = {'actionType' : 'event',
178 |                        'communityID': 'subreddit',
179 |                        'keywords'   : 'keywords',
180 |                        'nodeID'     : 'content',
181 |                        'nodeTime'   : 'time',
182 |                        'nodeUserID' : 'user',
183 |                        'parentID'   : 'parent',
184 |                        'rootID'     : 'root'}
185 |         elif self.platform=='twitter':
186 |             mapping = {'actionType' : 'event',
187 |                        'nodeID'     : 'content',
188 |                        'nodeTime'   : 'time',
189 |                        'nodeUserID' : 'user',
190 |                        'parentID'   : 'parent',
191 |                        'rootID'     : 'root'}
192 |         elif self.platform=='github':
193 |             mapping = {'nodeID'     : 'content',
194 |                        'nodeUserID' : 'user',
195 |                        'actionType' : 'event',
196 |                        'nodeTime'   : 'time',
197 |                        'actionSubType': 'action',
198 |                        'status':'merged'}
199 |         else:
200 |             print('Invalid platform.')
201 | 
202 |         df = df.rename(index=str, columns=mapping)
203 | 
204 |         df = df[df.event.isin(self.popularity_events + self.contribution_events)]
205 | 
206 |         try:
207 |             df['time'] = pd.to_datetime(df['time'],unit='s')
208 |         except:
209 |             try:
210 |                 df['time'] = pd.to_datetime(df['time'],unit='ms')
211 |             except:
212 |                 df['time'] = pd.to_datetime(df['time'])
213 | 
214 | 
215 |         df = df.sort_values(by='time')
216 |         df = df.assign(time=df.time.dt.floor('h'))
217 |         return df
218 | 
219 |     def preprocessContentMeta(self, df):
220 |         try:
221 |             df.columns = ['content', 'created_at', 'owner_id', 'language']
222 |         except:
223 |             df.columns = ['created_at', 'owner_id', 'content']
224 |         df['created_at'] = pd.to_datetime(df['created_at'])
225 |         df = df[df.content.isin(self.main_df.content.values)]
226 |         return df
227 | 
228 |     def preprocessUserMeta(self, df):
229 |         try:
230 |             df.columns = ['user', 'created_at', 'location', 'company']
231 |         except:
232 |             df.columns = ['user', 'created_at', 'city', 'country', 'company']
233 |         df['created_at'] = pd.to_datetime(df['created_at'])
234 |         df = df[df.user.isin(self.main_df.user.values)]
235 |         return df
236 | 
237 |     def readPickleFile(self, ipFile):
238 | 
239 |         with open(ipFile, 'rb') as handle:
240 |             obj = pkl.load(handle)
241 | 
242 |         return obj
243 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/a.txt:
--------------------------------------------------------------------------------
1 | b
2 | c
3 | d
4 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/b.txt:
--------------------------------------------------------------------------------
1 | f
2 | g
3 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/d.txt:
--------------------------------------------------------------------------------
1 | e
2 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/h.txt:
--------------------------------------------------------------------------------
1 | i
2 | j
3 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/i.txt:
--------------------------------------------------------------------------------
1 | m
2 | n
3 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/j.txt:
--------------------------------------------------------------------------------
1 | k
2 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/k.txt:
--------------------------------------------------------------------------------
1 | l
2 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/example_follower_data/m.txt:
--------------------------------------------------------------------------------
1 | o
2 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/twitter_cascade_reconstruction.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import glob
  3 | from collections import defaultdict
  4 | import os
  5 | import pprint
  6 | import json
  7 | import numpy as np
  8 | 
  9 | def load_data(json_file, full_submission=True):
 10 |     """
 11 |     Takes in the location of a json file and loads it as a pandas dataframe.
 12 |     Does some preprocessing to change text from unicode to ascii.
 13 |     """
 14 | 
 15 |     if full_submission:
 16 |         with open(json_file) as f:
 17 |             dataset = json.loads(f.read())
 18 | 
 19 |         dataset = dataset['data']
 20 |         dataset = pd.DataFrame(dataset)
 21 |     else:
 22 |         dataset = pd.read_json(json_file)
 23 | 
 24 |     dataset.sort_index(axis=1, inplace=True)
 25 |     dataset = dataset.replace('', np.nan)
 26 | 
 27 |     # This converts the column names to ascii
 28 |     mapping = {name:str(name) for name in dataset.columns.tolist()}
 29 |     dataset = dataset.rename(index=str, columns=mapping)
 30 | 
 31 |     # This converts the row names to ascii
 32 |     dataset = dataset.reset_index(drop=True)
 33 | 
 34 |     # This converts the cell values to ascii
 35 |     json_df = dataset.applymap(str)
 36 | 
 37 |     return dataset
 38 | 
 39 | 
 40 | class ParentIDApproximation:
 41 |     """
 42 |     class to obtain parent tweet id for retweets
 43 |     """
 44 | 
 45 |     def __init__(self, followers, cascade_collection_df, nodeID_col_name="nodeID", userID_col_name='nodeUserID',
 46 |                  nodeTime_col_name='nodeTime', rootID_col_name='rootID',
 47 |                  root_userID_col_name='rootUserID',
 48 |                  root_nodeTime_col_name='rootTime'):
 49 |         """
 50 |         :param followers: dictionary with key: userID, value: [list of followers of userID]
 51 |         :param cascade_collection_df: dataframe with nodeID, userID, nodeTime, rootID, root_userID, root_nodeTime as columns
 52 |                 default values for column names correspond to those in the Twitter data schema
 53 |                 (https://wiki.socialsim.info/display/SOC/Twitter+Data+Schema)
 54 |         """
 55 |         self.followers = followers
 56 |         self.cascade_collection_df = cascade_collection_df
 57 |         self.nodeID_col_name = nodeID_col_name
 58 |         self.userID_col_name = userID_col_name
 59 |         self.nodeTime_col_name = nodeTime_col_name
 60 |         self.rootID_col_name = rootID_col_name
 61 |         self.root_userID_col_name = root_userID_col_name
 62 |         self.root_nodeTime_col_name = root_nodeTime_col_name
 63 | 
 64 |     def get_all_tweets_rtd_later_by_followers(self, tweet_id, cascade_df):
 65 |         tweet_details = cascade_df.loc[tweet_id]
 66 |         
 67 |         # add self to followers because users will retweet themselves
 68 |         return cascade_df[
 69 |             (cascade_df[self.userID_col_name].
 70 |                 isin(self.followers[tweet_details[self.userID_col_name]].union(
 71 |                 {tweet_details[self.userID_col_name]}))) &  # in followers
 72 |             (cascade_df[self.nodeTime_col_name] > tweet_details[self.nodeTime_col_name])
 73 |             ]. \
 74 |             index.values.tolist()
 75 | 
 76 |     def update_parentid(self, cascade_df_main, root_id):
 77 |         root_userID = cascade_df_main.loc[cascade_df_main.index.max()][self.root_userID_col_name]
 78 |         root_nodeTime = cascade_df_main.loc[cascade_df_main.index.max()][self.root_nodeTime_col_name]
 79 |         cascade_df = cascade_df_main.sort_values(self.nodeTime_col_name).drop(
 80 |             [self.root_userID_col_name, self.root_nodeTime_col_name], axis=1).copy()
 81 |         cascade_df["parentID"] = 0
 82 |         # root tweet also added to the cascade since we need the time when the root tweet was tweeted
 83 |         cascade_df.loc[cascade_df.index.max() + 1] = {
 84 |             self.nodeID_col_name: root_id,
 85 |             self.userID_col_name: root_userID,
 86 |             self.nodeTime_col_name: root_nodeTime,
 87 |             self.rootID_col_name: root_id,
 88 |             "parentID": None
 89 |         }
 90 |         cascade_df = cascade_df.set_index(self.nodeID_col_name)
 91 |         seed_tweets = [root_id]
 92 |         while seed_tweets:
 93 |             new_seed_tweets = []
 94 |             for seed_tweet_id in seed_tweets:
 95 |                 tweets_to_be_updated = self.get_all_tweets_rtd_later_by_followers(seed_tweet_id,
 96 |                                                                                   cascade_df)  # assume a user as their follower since a user can retweet themselves
 97 |                 cascade_df.loc[tweets_to_be_updated, "parentID"] = seed_tweet_id
 98 |                 new_seed_tweets.extend(tweets_to_be_updated)
 99 | 
100 |             seed_tweets = cascade_df[
101 |                 cascade_df.index.isin(new_seed_tweets)].index.tolist()  # keeping the order a.t. tweeted timestamp
102 |         cascade_df.dropna(subset=["parentID"])
103 |         return cascade_df[cascade_df["parentID"] != 0].reset_index()
104 | 
105 |     def get_approximate_parentids(self, mapping_only=True, csv=False):
106 |         """
107 |         :param mapping_only: remove other columns except nodeID and parentID
108 |         :param csv: write the parentID mapping to a csv file
109 |         """
110 |         # parentID is None for root tweets
111 |         parentid_map_dfs = []
112 |         for tweet_id, cascade_df in self.cascade_collection_df.groupby(self.rootID_col_name):
113 |             updated_cascade_df = self.update_parentid(cascade_df, tweet_id)
114 |             parentid_map_dfs.append(updated_cascade_df)
115 |         parentid_map_all_cascades_df = pd.concat(parentid_map_dfs).reset_index(drop=True)
116 |         parentid_map_all_cascades_df.dropna(inplace=True)
117 |         if mapping_only:
118 |             parentid_map_all_cascades_df = parentid_map_all_cascades_df[[self.nodeID_col_name, "parentID"]]
119 |         if csv:
120 |             parentid_map_all_cascades_df.to_csv("retweet_cascades_with_parentID.csv", index=False)
121 | 
122 |         return parentid_map_all_cascades_df
123 | 
124 | def get_reply_cascade_root_tweet(df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", timestamp_col="nodeTime", json=False):
125 |     """
126 |     :param df: dataframe containing a set of reply cascades
127 |     :param json: return in json format or pandas dataframe
128 |     :return: df with rootID column added, representing the cascade root node
129 |     """
130 |     df = df.sort_values(timestamp_col)
131 |     rootid_mapping = pd.Series(df[parent_node_col].values, index=df[node_col]).to_dict()
132 | 
133 |     def update_reply_cascade(reply_cascade):
134 |         for tweet_id, reply_to_tweet_id in reply_cascade.items():
135 |             if reply_to_tweet_id in reply_cascade:
136 |                 reply_cascade[tweet_id] = reply_cascade[reply_to_tweet_id]
137 |         return reply_cascade
138 | 
139 |     prev_rootid_mapping = {}
140 |     while rootid_mapping != prev_rootid_mapping:
141 |         prev_rootid_mapping = rootid_mapping.copy()
142 |         rootid_mapping = update_reply_cascade(rootid_mapping)
143 |     df["rootID_new"] = df[node_col].map(rootid_mapping)
144 |     df.loc[df['actionType'] == 'reply','rootID'] =     df.loc[df['actionType'] == 'reply','rootID_new']
145 |     df = df.drop('rootID_new',axis=1)
146 |     if json:
147 |         return df.to_json(orient='records')
148 |     else:
149 |         return df
150 | 
151 | 
152 | if __name__ == '__main__':
153 | 
154 |     #one text file per user listing that user's followers
155 |     follower_data = glob.glob('example_follower_data/*.txt')
156 | 
157 |     #create followers dictionary with user IDs as keys and list of followers as values    
158 |     followers = defaultdict(lambda: set([]))
159 |     for fn in follower_data:
160 |         user = os.path.splitext(os.path.split(fn)[-1])[0]
161 |         f = set(pd.read_csv(fn,header=None)[0].tolist())
162 |         print('User {}: {} followers'.format(user,len(f)))
163 |         followers[user] = f
164 | 
165 |     #read in ground truth data file in JSON format
166 |     #this data should be missing parentIDs for retweets/quotes and rootIDs for replies 
167 |     #(because they are not available from the Twitter JSON) 
168 |     cascade_collection_df = load_data('twitter_reconstruction_example_data.json',full_submission=False)
169 | 
170 |     #store replies for later
171 |     replies = cascade_collection_df[cascade_collection_df['actionType'] == 'reply']
172 | 
173 |     #limit data to events where the rootID is also contained in the data
174 |     cascade_collection_df = cascade_collection_df[cascade_collection_df['rootID'].isin(cascade_collection_df['nodeID'])]
175 | 
176 |     #get the user who posted the root tweet for each retweet
177 |     root_users = cascade_collection_df[['nodeID','nodeUserID','nodeTime']]
178 |     root_users.columns = ['rootID','rootUserID','rootTime']
179 |     cascade_collection_df = cascade_collection_df.merge(root_users,on='rootID',how='left')
180 | 
181 |     #store original tweets for later
182 |     original_tweets = cascade_collection_df[cascade_collection_df['actionType'] == 'tweet']
183 | 
184 |     #subset on only retweets and quotes
185 |     cascade_collection_df = cascade_collection_df[cascade_collection_df['actionType'].isin(['retweet','quote'])]
186 |     cascade_collection_df_retweets = cascade_collection_df[['nodeID','nodeUserID','nodeTime','rootID','rootUserID','rootTime']]
187 | 
188 |     #get parent IDs for retweets and quotes
189 |     pia = ParentIDApproximation(followers, cascade_collection_df_retweets)
190 |     parent_ids = pia.get_approximate_parentids()
191 | 
192 |     cascade_collection_df['parentID'] = cascade_collection_df['nodeID'].map(dict(zip(parent_ids.nodeID,parent_ids.parentID)))
193 | 
194 |     #rejoin with replies and original tweets
195 |     cascade_collection_df = pd.concat([cascade_collection_df,replies,original_tweets],axis=0).sort_values('nodeTime')
196 |     cascade_collection_df = cascade_collection_df.drop(['rootUserID','rootTime'],axis=1)
197 | 
198 |     #follow cascade chain to get root node for reply tweets
199 |     cascade_collection_df = get_reply_cascade_root_tweet(cascade_collection_df)
200 |     
201 |     print('Results:')
202 |     print(cascade_collection_df)
203 | 
204 |     output = cascade_collection_df.to_dict(orient='records')
205 | 
206 |     with open('twitter_example_data_reconstructed.json','w') as f:
207 |         json.dump(output, f)
208 | 
209 | 
210 | 
211 |     
212 | 
213 | 


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/twitter_example_data_reconstructed.json:
--------------------------------------------------------------------------------
1 | [{"rootID": "A", "nodeTime": "2017-08-15T00:00:00Z", "nodeUserID": "a", "nodeID": "A", "actionType": "tweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:01Z", "nodeUserID": "b", "nodeID": "B", "actionType": "retweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:02Z", "nodeUserID": "c", "nodeID": "C", "actionType": "retweet", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:03Z", "nodeUserID": "d", "nodeID": "D", "actionType": "reply", "parentID": "A"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:04Z", "nodeUserID": "e", "nodeID": "E", "actionType": "reply", "parentID": "D"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:05Z", "nodeUserID": "f", "nodeID": "F", "actionType": "retweet", "parentID": "B"}, {"rootID": "A", "nodeTime": "2017-08-15T00:00:06Z", "nodeUserID": "g", "nodeID": "G", "actionType": "reply", "parentID": "B"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:07Z", "nodeUserID": "h", "nodeID": "H", "actionType": "tweet", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:08Z", "nodeUserID": "i", "nodeID": "I", "actionType": "retweet", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:09Z", "nodeUserID": "j", "nodeID": "J", "actionType": "reply", "parentID": "H"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:12Z", "nodeUserID": "m", "nodeID": "M", "actionType": "retweet", "parentID": "I"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:13Z", "nodeUserID": "n", "nodeID": "N", "actionType": "retweet", "parentID": "I"}, {"rootID": "H", "nodeTime": "2017-08-15T00:00:14Z", "nodeUserID": "o", "nodeID": "O", "actionType": "retweet", "parentID": "M"}]


--------------------------------------------------------------------------------
/december-measurements/cascade_reconstruction/twitter_reconstruction_example_data.json:
--------------------------------------------------------------------------------
1 | [{"rootID": "A", "actionType": "tweet", "parentID": "A", "nodeTime": "2017-08-15T00:00:00Z", "nodeUserID": "a", "nodeID": "A"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:01Z", "nodeUserID": "b", "nodeID": "B"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:02Z", "nodeUserID": "c", "nodeID": "C"}, {"rootID": "?", "actionType": "reply", "parentID": "A", "nodeTime": "2017-08-15T00:00:03Z", "nodeUserID": "d", "nodeID": "D"}, {"rootID": "?", "actionType": "reply", "parentID": "D", "nodeTime": "2017-08-15T00:00:04Z", "nodeUserID": "e", "nodeID": "E"}, {"rootID": "A", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:05Z", "nodeUserID": "f", "nodeID": "F"}, {"rootID": "?", "actionType": "reply", "parentID": "B", "nodeTime": "2017-08-15T00:00:06Z", "nodeUserID": "g", "nodeID": "G"}, {"rootID": "H", "actionType": "tweet", "parentID": "H", "nodeTime": "2017-08-15T00:00:07Z", "nodeUserID": "h", "nodeID": "H"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:08Z", "nodeUserID": "i", "nodeID": "I"}, {"rootID": "?", "actionType": "reply", "parentID": "H", "nodeTime": "2017-08-15T00:00:09Z", "nodeUserID": "j", "nodeID": "J"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:12Z", "nodeUserID": "m", "nodeID": "M"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:13Z", "nodeUserID": "n", "nodeID": "N"}, {"rootID": "H", "actionType": "retweet", "parentID": "?", "nodeTime": "2017-08-15T00:00:14Z", "nodeUserID": "o", "nodeID": "O"}]


--------------------------------------------------------------------------------
/december-measurements/cascade_validators.py:
--------------------------------------------------------------------------------
 1 | from functools import wraps
 2 | 
 3 | 
 4 | def check_root_only(default=None):
 5 |     """
 6 |     check if it is a single node cascade
 7 |     """
 8 |     def wrap(func):
 9 |         @wraps(func)
10 |         def wrapped_f(self, *args, **kwargs):
11 | 
12 |             if len(self.main_df[self.main_df[self.node_col] != self.main_df[self.root_node_col]])==0:
13 |                 return default
14 |             else:
15 |                 return func(self, *args, **kwargs)
16 | 
17 |         return wrapped_f
18 | 
19 |     return wrap
20 | 


--------------------------------------------------------------------------------
/december-measurements/config/baseline_metrics_config_twitter.py:
--------------------------------------------------------------------------------
  1 | from functools import partial, update_wrapper
  2 | import Metrics
  3 | import ContentCentricMeasurements
  4 | import UserCentricMeasurements
  5 | #from load_data import load_data
  6 | from BaselineMeasurements import *
  7 | 
  8 | import pprint
  9 | 
 10 | 
 11 | def named_partial(func, *args, **kwargs):
 12 |     partial_func = partial(func, *args, **kwargs)
 13 |     update_wrapper(partial_func, func)
 14 |     partial_func.varnames = func.__code__.co_varnames
 15 |     return partial_func
 16 | 
 17 | 
 18 | twitter_events = ["tweet","retweet","quote","reply"]
 19 | 
 20 | 
 21 | user_measurement_params = {
 22 |     ### User Centric Measurements
 23 |      "user_unique_content": {
 24 |          'question': '17',
 25 |          "scale": "population",
 26 |          "node_type":"user",
 27 |          'scenario1':True,
 28 |          'scenario2':True,
 29 |          'scenario3':True,
 30 |          "measurement": "getUserUniqueContent",
 31 |          "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
 32 |          "metrics": {
 33 |              "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 34 |              "rmse": Metrics.rmse,
 35 |              "nrmse": named_partial(Metrics.rmse,relative=True),
 36 |              "r2": Metrics.r2}
 37 |      },
 38 | 
 39 |      "user_activity_timeline": {
 40 |          "question": '19',
 41 |          "scale": "node",
 42 |          "node_type":"user",
 43 |          'scenario1':False,
 44 |          'scenario2':True,
 45 |          'scenario3':False,
 46 |          "measurement": "getUserActivityTimeline",
 47 |          "measurement_args":{"eventTypes":twitter_events},
 48 |          "metrics": {"rmse": Metrics.rmse,
 49 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 50 |                      "ks_test": Metrics.ks_test,
 51 |                      "dtw": Metrics.dtw}
 52 |     
 53 |      },
 54 | 
 55 |      "user_activity_distribution": {
 56 |          "question": '24a',
 57 |          "scale": "population",
 58 |          "node_type":"user",
 59 |          'scenario1':True,
 60 |          'scenario2':True,
 61 |          'scenario3':True,
 62 |          "measurement": "getUserActivityDistribution",
 63 |          "measurement_args":{"eventTypes":twitter_events},
 64 |          "metrics": {"rmse": Metrics.rmse,
 65 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 66 |                      "r2": Metrics.r2,
 67 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
 68 |      },
 69 |     
 70 |      "most_active_users": {
 71 |          "question": '24b',
 72 |          "scale": "population",
 73 |          "node_type":"user",
 74 |          'scenario1':True,
 75 |          'scenario2':True,
 76 |          'scenario3':True,
 77 |          "measurement": "getMostActiveUsers",
 78 |          "measurement_args":{"eventTypes":twitter_events},
 79 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
 80 |      },
 81 | 
 82 |      "user_popularity": {
 83 |          "question": '25',
 84 |          "scale": "population",
 85 |          "node_type":"user",
 86 |          'scenario1':True,
 87 |          'scenario2':True,
 88 |          'scenario3':True,
 89 |          "measurement": "getUserPopularity",
 90 |          "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"},
 91 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)}
 92 |      },
 93 | 
 94 |      "user_gini_coef": {
 95 |          "question": '26a',
 96 |          "scale": "population",
 97 |          "node_type":"user",
 98 |          'scenario1':True,
 99 |          'scenario2':True,
100 |          'scenario3':True,
101 |          "measurement": "getGiniCoef",
102 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
103 |          "metrics": {"absolute_difference": Metrics.absolute_difference,
104 |                      "absolute_percentage_error":Metrics.absolute_percentage_error}
105 |      },
106 | 
107 |      "user_palma_coef": {
108 |          "question": '26b',
109 |          "scale": "population",
110 |          "node_type":"user",
111 |          'scenario1':True,
112 |          'scenario2':True,
113 |          'scenario3':True,
114 |          "measurement": "getPalmaCoef",
115 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
116 |          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
117 |                      "absolute_difference":Metrics.absolute_difference}
118 |      },
119 | 
120 |      #"user_diffusion_delay": {
121 |      #    "question": '27',
122 |      #    "scale": "population",
123 |      #    "node_type":"user",
124 |      #    'scenario1':True,
125 |      #    'scenario2':True,
126 |      #    'scenario3':True,
127 |      #    "measurement": "getUserDiffusionDelay",
128 |      #    "measurement_args":{"eventTypes":twitter_events},
129 |      #    "metrics": {"ks_test": Metrics.ks_test}
130 |      #}
131 | 
132 | }
133 | 
134 | content_measurement_params = {
135 |     ##Content-centric measurements
136 | #     "content_diffusion_delay": {
137 | #         "question": 1,
138 | #         "scale": "node",
139 | #         "node_type":"content",
140 | #         "scenario1":False,
141 | #         "scenario2":True,
142 | #         "scenario3":False,
143 | #         "measurement": "getContentDiffusionDelay",
144 | #         "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"},
145 | #         "metrics": {"ks_test": Metrics.ks_test,
146 | #                     "js_divergence": named_partial(Metrics.js_divergence, discrete=False)},
147 | #     },
148 |    
149 | #     "content_growth": {
150 | #         "question": 2,
151 | #         "scale": "node",
152 | #         "node_type":"content",
153 | #         "scenario1":False,
154 | #         "scenario2":True,
155 | #         "scenario3":False,
156 | #         "measurement": "getContentGrowth",
157 | #         "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"},
158 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
159 | #                     "dtw": Metrics.dtw}
160 | #     },
161 |    
162 | #     "content_contributors": {
163 | #         "question": 4,
164 | #         "scale": "node",
165 | #         "node_type":"content",
166 | #         "scenario1":False,
167 | #         "scenario2":True,
168 | #         "scenario3":False,
169 | #         "measurement": "getContributions",
170 | #         "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
171 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
172 | #                     "dtw": Metrics.dtw}
173 | #     },
174 |       
175 | #    "content_event_distribution_dayofweek": {
176 | #        "question": 5,
177 | #        "scale": "node",
178 | #        "node_type":"content",
179 | #         "scenario1":False,
180 | #         "scenario2":True,
181 | #         "scenario3":False,
182 | #        "measurement": "getDistributionOfEvents",
183 | #        "measurement_args":{"weekday":True,"content_field":"root"},
184 | #        "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
185 | #    },
186 |     
187 |      "content_liveliness_distribution": {
188 |          "question": 13,
189 |          "scale": "population",
190 |          "node_type":"content",
191 |          "scenario1":True,
192 |          "scenario2":True,
193 |          "scenario3":True,
194 |          "measurement": "getDistributionOfEventsByContent",
195 |          "measurement_args":{"eventTypes":["reply"],"content_field":"root"},
196 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
197 |      },
198 |     
199 | #     "content_liveliness_topk": {
200 | #         "question": 13,
201 | #         "scale": "population",
202 | #         "node_type":"content",
203 | #         "scenario1":False,
204 | #         "scenario2":True,
205 | #         "scenario3":False,
206 | #         "measurement": "getTopKContent",
207 | ##         "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"},
208 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)}
209 | #     },
210 | 
211 |      "content_popularity_distribution": {
212 |          "question": 13,
213 |          "scale": "population",
214 |          "node_type":"content",
215 |          "scenario1":False,
216 |          "scenario2":True,
217 |          "scenario3":False,
218 |          "measurement": "getDistributionOfEventsByContent",
219 |          "measurement_args":{"eventTypes":["retweet"],"content_field":"root"},
220 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
221 |      },
222 |     
223 | #     "content_popularity_topk": {
224 | #         "question": 13,
225 | #         "scale": "population",
226 | #         "node_type":"content",
227 | #         "scenario1":True,
228 | #         "scenario2":True,
229 | #         "scenario3":True,
230 | #         "measurement": "getTopKContent",
231 | #         "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"},
232 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
233 | #     },
234 | 
235 |       "content_activity_disparity_gini_retweet": {
236 |           "question": 14,
237 |           "scale": "population",
238 |           "node_type":"content",
239 |          "scenario1":True,
240 |          "scenario2":True,
241 |          "scenario3":True,
242 |           "measurement": "getGiniCoef",
243 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
244 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
245 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
246 |       },
247 |     
248 |       "content_activity_disparity_palma_retweet": {
249 |           "question": 14,
250 |           "scale": "population",
251 |           "node_type":"content",
252 |          "scenario1":True,
253 |          "scenario2":True,
254 |          "scenario3":True,
255 |           "measurement": "getPalmaCoef",
256 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
257 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
258 |                      "absolute_difference":Metrics.absolute_difference}
259 |       },
260 |       "content_activity_disparity_gini_quote": {
261 |           "question": 14,
262 |           "scale": "population",
263 |           "node_type":"content",
264 |           "scenario1":True,
265 |           "scenario2":True,
266 |           "scenario3":True,
267 |           "measurement": "getGiniCoef",
268 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
269 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
270 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
271 |       },
272 |     
273 |       "content_activity_disparity_palma_quote": {
274 |           "question": 14,
275 |           "scale": "population",
276 |           "node_type":"content",
277 |           "scenario1":True,
278 |           "scenario2":True,
279 |           "scenario3":True,
280 |           "measurement": "getPalmaCoef",
281 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
282 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
283 |                       "absolute_difference":Metrics.absolute_difference}
284 |       },
285 |       "content_activity_disparity_gini_reply": {
286 |           "question": 14,
287 |           "scale": "population",
288 |           "node_type":"content",
289 |           "scenario1":True,
290 |           "scenario2":True,
291 |           "scenario3":True,
292 |           "measurement": "getGiniCoef",
293 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
294 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
295 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
296 |       },
297 |     
298 |       "content_activity_disparity_palma_reply": {
299 |           "question": 14,
300 |           "scale": "population",
301 |           "node_type":"content",
302 |           "scenario1":True,
303 |           "scenario2":True,
304 |           "scenario3":True,
305 |           "measurement": "getPalmaCoef",
306 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
307 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
308 |                       "absolute_difference":Metrics.absolute_difference}
309 |       }
310 | 
311 | 
312 | }
313 | 
314 | 
315 | twitter_measurement_params = {}
316 | twitter_measurement_params.update(user_measurement_params)
317 | twitter_measurement_params.update(content_measurement_params)
318 | 


--------------------------------------------------------------------------------
/december-measurements/config/baseline_metrics_config_twitter_crypto_s1.py:
--------------------------------------------------------------------------------
  1 | from functools import partial, update_wrapper
  2 | import Metrics
  3 | import ContentCentricMeasurements
  4 | import UserCentricMeasurements
  5 | #from load_data import load_data
  6 | from BaselineMeasurements import *
  7 | 
  8 | import pprint
  9 | 
 10 | 
 11 | def named_partial(func, *args, **kwargs):
 12 |     partial_func = partial(func, *args, **kwargs)
 13 |     update_wrapper(partial_func, func)
 14 |     partial_func.varnames = func.__code__.co_varnames
 15 |     return partial_func
 16 | 
 17 | 
 18 | twitter_events = ["tweet","retweet","quote","reply"]
 19 | 
 20 | 
 21 | user_measurement_params = {
 22 |     ### User Centric Measurements
 23 |      "user_unique_content": {
 24 |          'question': '17',
 25 |          "scale": "population",
 26 |          "node_type":"user",
 27 |          'scenario1':True,
 28 |          'scenario2':True,
 29 |          'scenario3':True,
 30 |          "measurement": "getUserUniqueContent",
 31 |          "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
 32 |          "metrics": {
 33 |              "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 34 |              "rmse": Metrics.rmse,
 35 |              "nrmse": named_partial(Metrics.rmse,relative=True),
 36 |              "r2": Metrics.r2}
 37 |      },
 38 | 
 39 |      "user_activity_timeline": {
 40 |          "question": '19',
 41 |          "scale": "node",
 42 |          "node_type":"user",
 43 |          'scenario1':False,
 44 |          'scenario2':True,
 45 |          'scenario3':False,
 46 |          "measurement": "getUserActivityTimeline",
 47 |          "measurement_args":{"eventTypes":twitter_events},
 48 |          "metrics": {"rmse": Metrics.rmse,
 49 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 50 |                      "ks_test": Metrics.ks_test,
 51 |                      "dtw": Metrics.dtw}
 52 |     
 53 |      },
 54 | 
 55 |      "user_activity_distribution": {
 56 |          "question": '24a',
 57 |          "scale": "population",
 58 |          "node_type":"user",
 59 |          'scenario1':True,
 60 |          'scenario2':True,
 61 |          'scenario3':True,
 62 |          "measurement": "getUserActivityDistribution",
 63 |          "measurement_args":{"eventTypes":twitter_events},
 64 |          "metrics": {"rmse": Metrics.rmse,
 65 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 66 |                      "r2": Metrics.r2,
 67 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
 68 |      },
 69 |     
 70 |      "most_active_users": {
 71 |          "question": '24b',
 72 |          "scale": "population",
 73 |          "node_type":"user",
 74 |          'scenario1':True,
 75 |          'scenario2':True,
 76 |          'scenario3':True,
 77 |          "measurement": "getMostActiveUsers",
 78 |          "measurement_args":{"eventTypes":twitter_events},
 79 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
 80 |      },
 81 | 
 82 |      "user_popularity": {
 83 |          "question": '25',
 84 |          "scale": "population",
 85 |          "node_type":"user",
 86 |          'scenario1':True,
 87 |          'scenario2':True,
 88 |          'scenario3':True,
 89 |          "measurement": "getUserPopularity",
 90 |          "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"},
 91 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)}
 92 |      },
 93 | 
 94 |      "user_gini_coef": {
 95 |          "question": '26a',
 96 |          "scale": "population",
 97 |          "node_type":"user",
 98 |          'scenario1':True,
 99 |          'scenario2':True,
100 |          'scenario3':True,
101 |          "measurement": "getGiniCoef",
102 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
103 |          "metrics": {"absolute_difference": Metrics.absolute_difference,
104 |                      "absolute_percentage_error":Metrics.absolute_percentage_error}
105 |      },
106 | 
107 |      "user_palma_coef": {
108 |          "question": '26b',
109 |          "scale": "population",
110 |          "node_type":"user",
111 |          'scenario1':True,
112 |          'scenario2':True,
113 |          'scenario3':True,
114 |          "measurement": "getPalmaCoef",
115 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
116 |          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
117 |                      "absolute_difference":Metrics.absolute_difference}
118 |      },
119 | 
120 |      #"user_diffusion_delay": {
121 |      #    "question": '27',
122 |      #    "scale": "population",
123 |      #    "node_type":"user",
124 |      #    'scenario1':True,
125 |      #    'scenario2':True,
126 |      #    'scenario3':True,
127 |      #    "measurement": "getUserDiffusionDelay",
128 |      #    "measurement_args":{"eventTypes":twitter_events},
129 |      #    "metrics": {"ks_test": Metrics.ks_test}
130 |      #}
131 | 
132 | }
133 | 
134 | content_measurement_params = {
135 |     ##Content-centric measurements
136 | #     "content_diffusion_delay": {
137 | #         "question": 1,
138 | #         "scale": "node",
139 | #         "node_type":"content",
140 | #         "scenario1":False,
141 | #         "scenario2":True,
142 | #         "scenario3":False,
143 | #         "measurement": "getContentDiffusionDelay",
144 | #         "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"},
145 | #         "metrics": {"ks_test": Metrics.ks_test,
146 | #                     "js_divergence": named_partial(Metrics.js_divergence, discrete=False)},
147 | #     },
148 |    
149 | #     "content_growth": {
150 | #         "question": 2,
151 | #         "scale": "node",
152 | #         "node_type":"content",
153 | #         "scenario1":False,
154 | #         "scenario2":True,
155 | #         "scenario3":False,
156 | #         "measurement": "getContentGrowth",
157 | #         "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"},
158 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
159 | #                     "dtw": Metrics.dtw}
160 | #     },
161 |    
162 | #     "content_contributors": {
163 | #         "question": 4,
164 | #         "scale": "node",
165 | #         "node_type":"content",
166 | #         "scenario1":False,
167 | #         "scenario2":True,
168 | #         "scenario3":False,
169 | #         "measurement": "getContributions",
170 | #         "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
171 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
172 | #                     "dtw": Metrics.dtw}
173 | #     },
174 |       
175 | #    "content_event_distribution_dayofweek": {
176 | #        "question": 5,
177 | #        "scale": "node",
178 | #        "node_type":"content",
179 | #         "scenario1":False,
180 | #         "scenario2":True,
181 | #         "scenario3":False,
182 | #        "measurement": "getDistributionOfEvents",
183 | #        "measurement_args":{"weekday":True,"content_field":"root"},
184 | #        "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
185 | #    },
186 |     
187 |      "content_liveliness_distribution": {
188 |          "question": 13,
189 |          "scale": "population",
190 |          "node_type":"content",
191 |          "scenario1":True,
192 |          "scenario2":True,
193 |          "scenario3":True,
194 |          "measurement": "getDistributionOfEventsByContent",
195 |          "measurement_args":{"eventTypes":["reply"],"content_field":"root"},
196 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
197 |      },
198 |     
199 | #     "content_liveliness_topk": {
200 | #         "question": 13,
201 | #         "scale": "population",
202 | #         "node_type":"content",
203 | #         "scenario1":False,
204 | #         "scenario2":True,
205 | #         "scenario3":False,
206 | #         "measurement": "getTopKContent",
207 | ##         "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"},
208 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)}
209 | #     },
210 | 
211 |      "content_popularity_distribution": {
212 |          "question": 13,
213 |          "scale": "population",
214 |          "node_type":"content",
215 |          "scenario1":False,
216 |          "scenario2":True,
217 |          "scenario3":False,
218 |          "measurement": "getDistributionOfEventsByContent",
219 |          "measurement_args":{"eventTypes":["retweet"],"content_field":"root"},
220 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
221 |      },
222 |     
223 | #     "content_popularity_topk": {
224 | #         "question": 13,
225 | #         "scale": "population",
226 | #         "node_type":"content",
227 | #         "scenario1":True,
228 | #         "scenario2":True,
229 | #         "scenario3":True,
230 | #         "measurement": "getTopKContent",
231 | #         "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"},
232 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
233 | #     },
234 | 
235 |       "content_activity_disparity_gini_retweet": {
236 |           "question": 14,
237 |           "scale": "population",
238 |           "node_type":"content",
239 |          "scenario1":True,
240 |          "scenario2":True,
241 |          "scenario3":True,
242 |           "measurement": "getGiniCoef",
243 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
244 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
245 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
246 |       },
247 |     
248 |       "content_activity_disparity_palma_retweet": {
249 |           "question": 14,
250 |           "scale": "population",
251 |           "node_type":"content",
252 |          "scenario1":True,
253 |          "scenario2":True,
254 |          "scenario3":True,
255 |           "measurement": "getPalmaCoef",
256 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
257 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
258 |                      "absolute_difference":Metrics.absolute_difference}
259 |       },
260 |       "content_activity_disparity_gini_quote": {
261 |           "question": 14,
262 |           "scale": "population",
263 |           "node_type":"content",
264 |           "scenario1":True,
265 |           "scenario2":True,
266 |           "scenario3":True,
267 |           "measurement": "getGiniCoef",
268 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
269 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
270 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
271 |       },
272 |     
273 |       "content_activity_disparity_palma_quote": {
274 |           "question": 14,
275 |           "scale": "population",
276 |           "node_type":"content",
277 |           "scenario1":True,
278 |           "scenario2":True,
279 |           "scenario3":True,
280 |           "measurement": "getPalmaCoef",
281 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
282 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
283 |                       "absolute_difference":Metrics.absolute_difference}
284 |       },
285 |       "content_activity_disparity_gini_reply": {
286 |           "question": 14,
287 |           "scale": "population",
288 |           "node_type":"content",
289 |           "scenario1":True,
290 |           "scenario2":True,
291 |           "scenario3":True,
292 |           "measurement": "getGiniCoef",
293 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
294 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
295 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
296 |       },
297 |     
298 |       "content_activity_disparity_palma_reply": {
299 |           "question": 14,
300 |           "scale": "population",
301 |           "node_type":"content",
302 |           "scenario1":True,
303 |           "scenario2":True,
304 |           "scenario3":True,
305 |           "measurement": "getPalmaCoef",
306 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
307 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
308 |                       "absolute_difference":Metrics.absolute_difference}
309 |       }
310 | 
311 | 
312 | }
313 | 
314 | 
315 | twitter_scenario1_measurement_params_crypto = {}
316 | twitter_scenario1_measurement_params_crypto.update(user_measurement_params)
317 | twitter_scenario1_measurement_params_crypto.update(content_measurement_params)
318 | 


--------------------------------------------------------------------------------
/december-measurements/config/baseline_metrics_config_twitter_cve_s1.py:
--------------------------------------------------------------------------------
  1 | from functools import partial, update_wrapper
  2 | import Metrics
  3 | import ContentCentricMeasurements
  4 | import UserCentricMeasurements
  5 | #from load_data import load_data
  6 | from BaselineMeasurements import *
  7 | 
  8 | import pprint
  9 | 
 10 | 
 11 | def named_partial(func, *args, **kwargs):
 12 |     partial_func = partial(func, *args, **kwargs)
 13 |     update_wrapper(partial_func, func)
 14 |     partial_func.varnames = func.__code__.co_varnames
 15 |     return partial_func
 16 | 
 17 | 
 18 | twitter_events = ["tweet","retweet","quote","reply"]
 19 | 
 20 | 
 21 | user_measurement_params = {
 22 |     ### User Centric Measurements
 23 |      "user_unique_content": {
 24 |          'question': '17',
 25 |          "scale": "population",
 26 |          "node_type":"user",
 27 |          'scenario1':True,
 28 |          'scenario2':True,
 29 |          'scenario3':True,
 30 |          "measurement": "getUserUniqueContent",
 31 |          "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
 32 |          "metrics": {
 33 |              "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 34 |              "rmse": Metrics.rmse,
 35 |              "nrmse": named_partial(Metrics.rmse,relative=True),
 36 |              "r2": Metrics.r2}
 37 |      },
 38 | 
 39 | #     "user_activity_timeline": {
 40 | #         "question": '19',
 41 | #         "scale": "node",
 42 | #         "node_type":"user",
 43 | #         'scenario1':False,
 44 | #         'scenario2':True,
 45 | #         'scenario3':False,
 46 | #         "measurement": "getUserActivityTimeline",
 47 | #         "measurement_args":{"eventTypes":twitter_events},
 48 | #         "metrics": {"rmse": Metrics.rmse,
 49 | #                     "nrmse": named_partial(Metrics.rmse,relative=True),
 50 | #                     "ks_test": Metrics.ks_test,
 51 | #                     "dtw": Metrics.dtw}    
 52 | #     },
 53 | 
 54 |      "user_activity_distribution": {
 55 |          "question": '24a',
 56 |          "scale": "population",
 57 |          "node_type":"user",
 58 |          'scenario1':True,
 59 |          'scenario2':True,
 60 |          'scenario3':True,
 61 |          "measurement": "getUserActivityDistribution",
 62 |          "measurement_args":{"eventTypes":twitter_events},
 63 |          "metrics": {"rmse": Metrics.rmse,
 64 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 65 |                      "r2": Metrics.r2,
 66 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
 67 |      },
 68 |     
 69 |      "most_active_users": {
 70 |          "question": '24b',
 71 |          "scale": "population",
 72 |          "node_type":"user",
 73 |          'scenario1':True,
 74 |          'scenario2':True,
 75 |          'scenario3':True,
 76 |          "measurement": "getMostActiveUsers",
 77 |          "measurement_args":{"k":30,"eventTypes":twitter_events},
 78 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.84)}
 79 |      },
 80 | 
 81 |      "user_popularity": {
 82 |          "question": '25',
 83 |          "scale": "population",
 84 |          "node_type":"user",
 85 |          'scenario1':True,
 86 |          'scenario2':True,
 87 |          'scenario3':True,
 88 |          "measurement": "getUserPopularity",
 89 |          "measurement_args":{"k":30,"eventTypes":twitter_events,"content_field":"root"},
 90 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.84)}
 91 |      },
 92 | 
 93 |      "user_gini_coef": {
 94 |          "question": '26a',
 95 |          "scale": "population",
 96 |          "node_type":"user",
 97 |          'scenario1':True,
 98 |          'scenario2':True,
 99 |          'scenario3':True,
100 |          "measurement": "getGiniCoef",
101 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
102 |          "metrics": {"absolute_difference": Metrics.absolute_difference,
103 |                      "absolute_percentage_error":Metrics.absolute_percentage_error}
104 |      },
105 | 
106 |      "user_palma_coef": {
107 |          "question": '26b',
108 |          "scale": "population",
109 |          "node_type":"user",
110 |          'scenario1':True,
111 |          'scenario2':True,
112 |          'scenario3':True,
113 |          "measurement": "getPalmaCoef",
114 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
115 |          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
116 |                      "absolute_difference":Metrics.absolute_difference}
117 |      },
118 | 
119 |      #"user_diffusion_delay": {
120 |      #    "question": '27',
121 |      #    "scale": "population",
122 |      #    "node_type":"user",
123 |      #    'scenario1':True,
124 |      #    'scenario2':True,
125 |      #    'scenario3':True,
126 |      #    "measurement": "getUserDiffusionDelay",
127 |      #    "measurement_args":{"eventTypes":twitter_events},
128 |      #    "metrics": {"ks_test": Metrics.ks_test}
129 |      #}
130 | 
131 | }
132 | 
133 | content_measurement_params = {
134 |     ##Content-centric measurements
135 | #     "content_diffusion_delay": {
136 | #         "question": 1,
137 | #         "scale": "node",
138 | #         "node_type":"content",
139 | #         "scenario1":False,
140 | #         "scenario2":True,
141 | #         "scenario3":False,
142 | #         "measurement": "getContentDiffusionDelay",
143 | #         "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"},
144 | #         "metrics": {"ks_test": Metrics.ks_test,
145 | #                     "js_divergence": named_partial(Metrics.js_divergence, discrete=False)},
146 | #     },
147 |    
148 | #     "content_growth": {
149 | #         "question": 2,
150 | #         "scale": "node",
151 | #         "node_type":"content",
152 | #         "scenario1":False,
153 | #         "scenario2":True,
154 | #         "scenario3":False,
155 | #         "measurement": "getContentGrowth",
156 | #         "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"},
157 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
158 | #                     "dtw": Metrics.dtw}
159 | #     },
160 |    
161 | #     "content_contributors": {
162 | #         "question": 4,
163 | #         "scale": "node",
164 | #         "node_type":"content",
165 | #         "scenario1":False,
166 | #         "scenario2":True,
167 | #         "scenario3":False,
168 | #         "measurement": "getContributions",
169 | #         "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
170 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
171 | #                     "dtw": Metrics.dtw}
172 | #     },
173 |       
174 | #    "content_event_distribution_dayofweek": {
175 | #        "question": 5,
176 | #        "scale": "node",
177 | #        "node_type":"content",
178 | #         "scenario1":False,
179 | #         "scenario2":True,
180 | #         "scenario3":False,
181 | #        "measurement": "getDistributionOfEvents",
182 | #        "measurement_args":{"weekday":True,"content_field":"root"},
183 | #        "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
184 | #    },
185 |     
186 |      "content_liveliness_distribution": {
187 |          "question": 13,
188 |          "scale": "population",
189 |          "node_type":"content",
190 |          "scenario1":True,
191 |          "scenario2":True,
192 |          "scenario3":True,
193 |          "measurement": "getDistributionOfEventsByContent",
194 |          "measurement_args":{"eventTypes":["reply"],"content_field":"root"},
195 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
196 |      },
197 |     
198 | #     "content_liveliness_topk": {
199 | #         "question": 13,
200 | #         "scale": "population",
201 | #         "node_type":"content",
202 | #         "scenario1":False,
203 | #         "scenario2":True,
204 | #         "scenario3":False,
205 | #         "measurement": "getTopKContent",
206 | ##         "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"},
207 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)}
208 | #     },
209 | 
210 |      "content_popularity_distribution": {
211 |          "question": 13,
212 |          "scale": "population",
213 |          "node_type":"content",
214 |          "scenario1":False,
215 |          "scenario2":True,
216 |          "scenario3":False,
217 |          "measurement": "getDistributionOfEventsByContent",
218 |          "measurement_args":{"eventTypes":["retweet"],"content_field":"root"},
219 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
220 |      },
221 |     
222 | #     "content_popularity_topk": {
223 | #         "question": 13,
224 | #         "scale": "population",
225 | #         "node_type":"content",
226 | #         "scenario1":True,
227 | #         "scenario2":True,
228 | #         "scenario3":True,
229 | #         "measurement": "getTopKContent",
230 | #         "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"},
231 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
232 | #     },
233 | 
234 |       "content_activity_disparity_gini_retweet": {
235 |           "question": 14,
236 |           "scale": "population",
237 |           "node_type":"content",
238 |          "scenario1":True,
239 |          "scenario2":True,
240 |          "scenario3":True,
241 |           "measurement": "getGiniCoef",
242 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
243 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
244 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
245 |       },
246 |     
247 |       "content_activity_disparity_palma_retweet": {
248 |           "question": 14,
249 |           "scale": "population",
250 |           "node_type":"content",
251 |          "scenario1":True,
252 |          "scenario2":True,
253 |          "scenario3":True,
254 |           "measurement": "getPalmaCoef",
255 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
256 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
257 |                      "absolute_difference":Metrics.absolute_difference}
258 |       },
259 | #      "content_activity_disparity_gini_quote": {
260 | #          "question": 14,
261 | #          "scale": "population",
262 | #          "node_type":"content",
263 | #          "scenario1":True,
264 | #          "scenario2":True,
265 | #          "scenario3":True,
266 | #          "measurement": "getGiniCoef",
267 | #          "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
268 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
269 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
270 | #      },
271 |     
272 | #      "content_activity_disparity_palma_quote": {
273 | #          "question": 14,
274 | #          "scale": "population",
275 | #          "node_type":"content",
276 | #          "scenario1":True,
277 | #          "scenario2":True,
278 | #          "scenario3":True,
279 | #          "measurement": "getPalmaCoef",
280 | #          "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
281 | #          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
282 | #                      "absolute_difference":Metrics.absolute_difference}
283 | #      },
284 | #      "content_activity_disparity_gini_reply": {
285 | #          "question": 14,
286 | #          "scale": "population",
287 | #          "node_type":"content",
288 | #          "scenario1":True,
289 | #          "scenario2":True,
290 | #          "scenario3":True,
291 | #          "measurement": "getGiniCoef",
292 | #          "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
293 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
294 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
295 | #      },
296 |     
297 | #      "content_activity_disparity_palma_reply": {
298 | #          "question": 14,
299 | #          "scale": "population",
300 | #          "node_type":"content",
301 | #          "scenario1":True,
302 | #          "scenario2":True,
303 | #          "scenario3":True,
304 | #          "measurement": "getPalmaCoef",
305 | #          "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
306 | #          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
307 | #                      "absolute_difference":Metrics.absolute_difference}
308 | #      }
309 | 
310 | 
311 | }
312 | 
313 | 
314 | twitter_scenario1_measurement_params_cve = {}
315 | twitter_scenario1_measurement_params_cve.update(user_measurement_params)
316 | twitter_scenario1_measurement_params_cve.update(content_measurement_params)
317 | 


--------------------------------------------------------------------------------
/december-measurements/config/baseline_metrics_config_twitter_cve_s2.py:
--------------------------------------------------------------------------------
  1 | from functools import partial, update_wrapper
  2 | import Metrics
  3 | import ContentCentricMeasurements
  4 | import UserCentricMeasurements
  5 | #from load_data import load_data
  6 | from BaselineMeasurements import *
  7 | 
  8 | import pprint
  9 | 
 10 | 
 11 | def named_partial(func, *args, **kwargs):
 12 |     partial_func = partial(func, *args, **kwargs)
 13 |     update_wrapper(partial_func, func)
 14 |     partial_func.varnames = func.__code__.co_varnames
 15 |     return partial_func
 16 | 
 17 | 
 18 | twitter_events = ["tweet","retweet","quote","reply"]
 19 | 
 20 | 
 21 | user_measurement_params = {
 22 |     ### User Centric Measurements
 23 |      #"user_unique_content": {
 24 |      #    'question': '17',
 25 |      #    "scale": "population",
 26 |      #    "node_type":"user",
 27 |      #    'scenario1':True,
 28 |      #    'scenario2':True,
 29 |      #    'scenario3':True,
 30 |      #    "measurement": "getUserUniqueContent",
 31 |      #    "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
 32 |      #    "metrics": {
 33 |      #        "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 34 |      #        "rmse": Metrics.rmse,
 35 |      #        "nrmse": named_partial(Metrics.rmse,relative=True),
 36 |      #        "r2": Metrics.r2}
 37 |      #},
 38 | 
 39 | #     "user_activity_timeline": {
 40 | #         "question": '19',
 41 | #         "scale": "node",
 42 | #         "node_type":"user",
 43 | #         'scenario1':False,
 44 | #         'scenario2':True,
 45 | #         'scenario3':False,
 46 | #         "measurement": "getUserActivityTimeline",
 47 | #         "measurement_args":{"eventTypes":twitter_events},
 48 | #         "metrics": {"rmse": Metrics.rmse,
 49 | #                     "nrmse": named_partial(Metrics.rmse,relative=True),
 50 | #                     "ks_test": Metrics.ks_test,
 51 | #                     "dtw": Metrics.dtw}
 52 | #    
 53 | #     },
 54 | 
 55 |      "user_activity_distribution": {
 56 |          "question": '24a',
 57 |          "scale": "population",
 58 |          "node_type":"user",
 59 |          'scenario1':True,
 60 |          'scenario2':True,
 61 |          'scenario3':True,
 62 |          "measurement": "getUserActivityDistribution",
 63 |          "measurement_args":{"eventTypes":twitter_events},
 64 |          "metrics": {"rmse": Metrics.rmse,
 65 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 66 |                      "r2": Metrics.r2,
 67 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
 68 |      },
 69 |     
 70 |      "most_active_users": {
 71 |          "question": '24b',
 72 |          "scale": "population",
 73 |          "node_type":"user",
 74 |          'scenario1':True,
 75 |          'scenario2':True,
 76 |          'scenario3':True,
 77 |          "measurement": "getMostActiveUsers",
 78 |          "measurement_args":{"k":10,"eventTypes":twitter_events},
 79 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)}
 80 |      },
 81 | 
 82 |      "user_popularity": {
 83 |          "question": '25',
 84 |          "scale": "population",
 85 |          "node_type":"user",
 86 |          'scenario1':True,
 87 |          'scenario2':True,
 88 |          'scenario3':True,
 89 |          "measurement": "getUserPopularity",
 90 |          "measurement_args":{"k":10,"eventTypes":twitter_events,"content_field":"root"},
 91 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)}
 92 |      },
 93 | 
 94 |      "user_gini_coef": {
 95 |          "question": '26a',
 96 |          "scale": "population",
 97 |          "node_type":"user",
 98 |          'scenario1':True,
 99 |          'scenario2':True,
100 |          'scenario3':True,
101 |          "measurement": "getGiniCoef",
102 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
103 |          "metrics": {"absolute_difference": Metrics.absolute_difference,
104 |                      "absolute_percentage_error":Metrics.absolute_percentage_error}
105 |      },
106 | 
107 |      "user_palma_coef": {
108 |          "question": '26b',
109 |          "scale": "population",
110 |          "node_type":"user",
111 |          'scenario1':True,
112 |          'scenario2':True,
113 |          'scenario3':True,
114 |          "measurement": "getPalmaCoef",
115 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
116 |          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
117 |                      "absolute_difference":Metrics.absolute_difference}
118 |      },
119 | 
120 |      #"user_diffusion_delay": {
121 |      #    "question": '27',
122 |      #    "scale": "population",
123 |      #    "node_type":"user",
124 |      #    'scenario1':True,
125 |      #    'scenario2':True,
126 |      #    'scenario3':True,
127 |      #    "measurement": "getUserDiffusionDelay",
128 |      #    "measurement_args":{"eventTypes":twitter_events},
129 |      #    "metrics": {"ks_test": Metrics.ks_test}
130 |      #}
131 | 
132 | }
133 | 
134 | content_measurement_params = {
135 |     ##Content-centric measurements
136 |      "content_diffusion_delay": {
137 |          "question": 1,
138 |          "scale": "node",
139 |          "node_type":"content",
140 |          "scenario1":False,
141 |          "scenario2":True,
142 |          "scenario3":False,
143 |          "measurement": "getContentDiffusionDelay",
144 |          "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"},
145 |          "metrics": {"ks_test": Metrics.ks_test,
146 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=False)},
147 |      },
148 |    
149 |      "content_growth": {
150 |          "question": 2,
151 |          "scale": "node",
152 |          "node_type":"content",
153 |          "scenario1":False,
154 |          "scenario2":True,
155 |          "scenario3":False,
156 |          "measurement": "getContentGrowth",
157 |          "measurement_args":{"eventTypes":twitter_events,"time_bin":"h","content_field":"root"},
158 |          "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
159 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
160 |                      "dtw": Metrics.dtw}
161 |      },
162 |    
163 |      "content_contributors": {
164 |          "question": 4,
165 |          "scale": "node",
166 |          "node_type":"content",
167 |          "scenario1":False,
168 |          "scenario2":True,
169 |          "scenario3":False,
170 |          "measurement": "getContributions",
171 |          "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
172 |          "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
173 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
174 |                      "dtw": Metrics.dtw}
175 |      },
176 |       
177 | #    "content_event_distribution_dayofweek": {
178 | #        "question": 5,
179 | #        "scale": "node",
180 | #        "node_type":"content",
181 | #         "scenario1":False,
182 | #         "scenario2":True,
183 | #         "scenario3":False,
184 | #        "measurement": "getDistributionOfEvents",
185 | #        "measurement_args":{"weekday":True,"content_field":"root"},
186 | #        "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
187 | #    },
188 |     
189 | #     "content_liveliness_distribution": {
190 | #         "question": 13,
191 | #         "scale": "population",
192 | #         "node_type":"content",
193 | #         "scenario1":True,
194 | #         "scenario2":True,
195 | #         "scenario3":True,
196 | #         "measurement": "getDistributionOfEventsByContent",
197 | #         "measurement_args":{"eventTypes":["reply"],"content_field":"root"},
198 | #         "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False),
199 | #                     "rmse": Metrics.rmse,
200 | #                     "nrmse": named_partial(Metrics.rmse,relative=True),
201 | #                     "r2": Metrics.r2}
202 | #     },
203 |     
204 | #     "content_liveliness_topk": {
205 | #         "question": 13,
206 | #         "scale": "population",
207 | #         "node_type":"content",
208 | #         "scenario1":False,
209 | #         "scenario2":True,
210 | #         "scenario3":False,
211 | #         "measurement": "getTopKContent",
212 | #         "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"},
213 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)}
214 | #     },
215 | 
216 |      "content_popularity_distribution": {
217 |          "question": 13,
218 |          "scale": "population",
219 |          "node_type":"content",
220 |          "scenario1":False,
221 |          "scenario2":True,
222 |          "scenario3":False,
223 |          "measurement": "getDistributionOfEventsByContent",
224 |          "measurement_args":{"eventTypes":["retweet"],"content_field":"root"},
225 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False),
226 |                      "rmse": Metrics.rmse,
227 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
228 |                      "r2": Metrics.r2}
229 |      },
230 |     
231 |      "content_popularity_topk": {
232 |          "question": 13,
233 |          "scale": "population",
234 |          "node_type":"content",
235 |          "scenario1":True,
236 |          "scenario2":True,
237 |          "scenario3":True,
238 |          "measurement": "getTopKContent",
239 |          "measurement_args":{"k":10,"eventTypes":["retweet"],"content_field":"root"},
240 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.58)}
241 |      },
242 | 
243 |       "content_activity_disparity_gini_retweet": {
244 |           "question": 14,
245 |           "scale": "population",
246 |           "node_type":"content",
247 |          "scenario1":True,
248 |          "scenario2":True,
249 |          "scenario3":True,
250 |           "measurement": "getGiniCoef",
251 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
252 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
253 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
254 |       },
255 |     
256 |       "content_activity_disparity_palma_retweet": {
257 |           "question": 14,
258 |           "scale": "population",
259 |           "node_type":"content",
260 |          "scenario1":True,
261 |          "scenario2":True,
262 |          "scenario3":True,
263 |           "measurement": "getPalmaCoef",
264 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
265 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
266 |                       "absolute_difference":Metrics.absolute_difference}
267 |       },
268 | #      "content_activity_disparity_gini_quote": {
269 | #          "question": 14,
270 | #          "scale": "population",
271 | #          "node_type":"content",
272 | #          "scenario1":True,
273 | #          "scenario2":True,
274 | #          "scenario3":True,
275 | #          "measurement": "getGiniCoef",
276 | #          "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
277 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
278 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
279 | #      },
280 |     
281 | #      "content_activity_disparity_palma_quote": {
282 | #          "question": 14,
283 | #          "scale": "population",
284 | #          "node_type":"content",
285 | #          "scenario1":True,
286 | #          "scenario2":True,
287 | #          "scenario3":True,
288 | #          "measurement": "getPalmaCoef",
289 | #          "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
290 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
291 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
292 | #      },
293 | #      "content_activity_disparity_gini_reply": {
294 | #          "question": 14,
295 | #          "scale": "population",
296 | #          "node_type":"content",
297 | #          "scenario1":True,
298 | #          "scenario2":True,
299 | #          "scenario3":True,
300 | #          "measurement": "getGiniCoef",
301 | #          "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
302 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
303 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
304 | #      },
305 |     
306 | #      "content_activity_disparity_palma_reply": {
307 | #          "question": 14,
308 | #          "scale": "population",
309 | #          "node_type":"content",
310 | #          "scenario1":True,
311 | #          "scenario2":True,
312 | #          "scenario3":True,
313 | #          "measurement": "getPalmaCoef",
314 | #          "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
315 | #          "metrics": {"absolute_difference": Metrics.absolute_difference,
316 | #                      "absolute_percentage_error":Metrics.absolute_percentage_error}
317 | #      }
318 | 
319 | 
320 | }
321 | 
322 | 
323 | twitter_scenario2_measurement_params_cve = {}
324 | twitter_scenario2_measurement_params_cve.update(user_measurement_params)
325 | twitter_scenario2_measurement_params_cve.update(content_measurement_params)
326 | 


--------------------------------------------------------------------------------
/december-measurements/config/baseline_metrics_config_twitter_cyber_s1.py:
--------------------------------------------------------------------------------
  1 | from functools import partial, update_wrapper
  2 | import Metrics
  3 | import ContentCentricMeasurements
  4 | import UserCentricMeasurements
  5 | #from load_data import load_data
  6 | from BaselineMeasurements import *
  7 | 
  8 | import pprint
  9 | 
 10 | 
 11 | def named_partial(func, *args, **kwargs):
 12 |     partial_func = partial(func, *args, **kwargs)
 13 |     update_wrapper(partial_func, func)
 14 |     partial_func.varnames = func.__code__.co_varnames
 15 |     return partial_func
 16 | 
 17 | 
 18 | twitter_events = ["tweet","retweet","quote","reply"]
 19 | 
 20 | 
 21 | user_measurement_params = {
 22 |     ### User Centric Measurements
 23 |      "user_unique_content": {
 24 |          'question': '17',
 25 |          "scale": "population",
 26 |          "node_type":"user",
 27 |          'scenario1':True,
 28 |          'scenario2':True,
 29 |          'scenario3':True,
 30 |          "measurement": "getUserUniqueContent",
 31 |          "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
 32 |          "metrics": {
 33 |              "js_divergence": named_partial(Metrics.js_divergence, discrete=False),
 34 |              "rmse": Metrics.rmse,
 35 |              "nrmse": named_partial(Metrics.rmse,relative=True),
 36 |              "r2": Metrics.r2}
 37 |      },
 38 | 
 39 |      "user_activity_timeline": {
 40 |          "question": '19',
 41 |          "scale": "node",
 42 |          "node_type":"user",
 43 |          'scenario1':False,
 44 |          'scenario2':True,
 45 |          'scenario3':False,
 46 |          "measurement": "getUserActivityTimeline",
 47 |          "measurement_args":{"eventTypes":twitter_events},
 48 |          "metrics": {"rmse": Metrics.rmse,
 49 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 50 |                      "ks_test": Metrics.ks_test,
 51 |                      "dtw": Metrics.dtw}
 52 |     
 53 |      },
 54 | 
 55 |      "user_activity_distribution": {
 56 |          "question": '24a',
 57 |          "scale": "population",
 58 |          "node_type":"user",
 59 |          'scenario1':True,
 60 |          'scenario2':True,
 61 |          'scenario3':True,
 62 |          "measurement": "getUserActivityDistribution",
 63 |          "measurement_args":{"eventTypes":twitter_events},
 64 |          "metrics": {"rmse": Metrics.rmse,
 65 |                      "nrmse": named_partial(Metrics.rmse,relative=True),
 66 |                      "r2": Metrics.r2,
 67 |                      "js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
 68 |      },
 69 |     
 70 |      "most_active_users": {
 71 |          "question": '24b',
 72 |          "scale": "population",
 73 |          "node_type":"user",
 74 |          'scenario1':True,
 75 |          'scenario2':True,
 76 |          'scenario3':True,
 77 |          "measurement": "getMostActiveUsers",
 78 |          "measurement_args":{"eventTypes":twitter_events},
 79 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
 80 |      },
 81 | 
 82 |      "user_popularity": {
 83 |          "question": '25',
 84 |          "scale": "population",
 85 |          "node_type":"user",
 86 |          'scenario1':True,
 87 |          'scenario2':True,
 88 |          'scenario3':True,
 89 |          "measurement": "getUserPopularity",
 90 |          "measurement_args":{"k":4000,"eventTypes":twitter_events,"content_field":"root"},
 91 |          "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9987)}
 92 |      },
 93 | 
 94 |      "user_gini_coef": {
 95 |          "question": '26a',
 96 |          "scale": "population",
 97 |          "node_type":"user",
 98 |          'scenario1':True,
 99 |          'scenario2':True,
100 |          'scenario3':True,
101 |          "measurement": "getGiniCoef",
102 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
103 |          "metrics": {"absolute_difference": Metrics.absolute_difference,
104 |                      "absolute_percentage_error":Metrics.absolute_percentage_error}
105 |      },
106 | 
107 |      "user_palma_coef": {
108 |          "question": '26b',
109 |          "scale": "population",
110 |          "node_type":"user",
111 |          'scenario1':True,
112 |          'scenario2':True,
113 |          'scenario3':True,
114 |          "measurement": "getPalmaCoef",
115 |          "measurement_args":{"nodeType":"user","eventTypes":twitter_events},
116 |          "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
117 |                      "absolute_difference":Metrics.absolute_difference}
118 |      },
119 | 
120 |      #"user_diffusion_delay": {
121 |      #    "question": '27',
122 |      #    "scale": "population",
123 |      #    "node_type":"user",
124 |      #    'scenario1':True,
125 |      #    'scenario2':True,
126 |      #    'scenario3':True,
127 |      #    "measurement": "getUserDiffusionDelay",
128 |      #    "measurement_args":{"eventTypes":twitter_events},
129 |      #    "metrics": {"ks_test": Metrics.ks_test}
130 |      #}
131 | 
132 | }
133 | 
134 | content_measurement_params = {
135 |     ##Content-centric measurements
136 | #     "content_diffusion_delay": {
137 | #         "question": 1,
138 | #         "scale": "node",
139 | #         "node_type":"content",
140 | #         "scenario1":False,
141 | #         "scenario2":True,
142 | #         "scenario3":False,
143 | #         "measurement": "getContentDiffusionDelay",
144 | #         "measurement_args":{"eventTypes":["reply",'retweet','quote'],"time_bin":"h","content_field":"root"},
145 | #         "metrics": {"ks_test": Metrics.ks_test,
146 | #                     "js_divergence": named_partial(Metrics.js_divergence, discrete=False)},
147 | #     },
148 |    
149 | #     "content_growth": {
150 | #         "question": 2,
151 | #         "scale": "node",
152 | #         "node_type":"content",
153 | #         "scenario1":False,
154 | #         "scenario2":True,
155 | #         "scenario3":False,
156 | #         "measurement": "getContentGrowth",
157 | #         "measurement_args":{"eventTypes":twitter_events,"time_bin":"d","content_field":"root"},
158 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
159 | #                     "dtw": Metrics.dtw}
160 | #     },
161 |    
162 | #     "content_contributors": {
163 | #         "question": 4,
164 | #         "scale": "node",
165 | #         "node_type":"content",
166 | #         "scenario1":False,
167 | #         "scenario2":True,
168 | #         "scenario3":False,
169 | #         "measurement": "getContributions",
170 | #         "measurement_args":{"eventTypes":twitter_events,"content_field":"root"},
171 | #         "metrics": {"rmse": named_partial(Metrics.rmse, join="outer"),
172 | #                     "dtw": Metrics.dtw}
173 | #     },
174 |       
175 | #    "content_event_distribution_dayofweek": {
176 | #        "question": 5,
177 | #        "scale": "node",
178 | #        "node_type":"content",
179 | #         "scenario1":False,
180 | #         "scenario2":True,
181 | #         "scenario3":False,
182 | #        "measurement": "getDistributionOfEvents",
183 | #        "measurement_args":{"weekday":True,"content_field":"root"},
184 | #        "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=True)}
185 | #    },
186 |     
187 |      "content_liveliness_distribution": {
188 |          "question": 13,
189 |          "scale": "population",
190 |          "node_type":"content",
191 |          "scenario1":True,
192 |          "scenario2":True,
193 |          "scenario3":True,
194 |          "measurement": "getDistributionOfEventsByContent",
195 |          "measurement_args":{"eventTypes":["reply"],"content_field":"root"},
196 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
197 |      },
198 |     
199 | #     "content_liveliness_topk": {
200 | #         "question": 13,
201 | #         "scale": "population",
202 | #         "node_type":"content",
203 | #         "scenario1":False,
204 | #         "scenario2":True,
205 | #         "scenario3":False,
206 | #         "measurement": "getTopKContent",
207 | ##         "measurement_args":{"k":50,"eventTypes":["reply"],"content_field":"root"},
208 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.9)}
209 | #     },
210 | 
211 |      "content_popularity_distribution": {
212 |          "question": 13,
213 |          "scale": "population",
214 |          "node_type":"content",
215 |          "scenario1":False,
216 |          "scenario2":True,
217 |          "scenario3":False,
218 |          "measurement": "getDistributionOfEventsByContent",
219 |          "measurement_args":{"eventTypes":["retweet"],"content_field":"root"},
220 |          "metrics": {"js_divergence": named_partial(Metrics.js_divergence, discrete=False)}
221 |      },
222 |     
223 | #     "content_popularity_topk": {
224 | #         "question": 13,
225 | #         "scale": "population",
226 | #         "node_type":"content",
227 | #         "scenario1":True,
228 | #         "scenario2":True,
229 | #         "scenario3":True,
230 | #         "measurement": "getTopKContent",
231 | #         "measurement_args":{"k":5000,"eventTypes":["retweet"],"content_field":"root"},
232 | #         "metrics": {"rbo": named_partial(Metrics.rbo_score, p=0.999)}
233 | #     },
234 | 
235 |       "content_activity_disparity_gini_retweet": {
236 |           "question": 14,
237 |           "scale": "population",
238 |           "node_type":"content",
239 |          "scenario1":True,
240 |          "scenario2":True,
241 |          "scenario3":True,
242 |           "measurement": "getGiniCoef",
243 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
244 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
245 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
246 |       },
247 |     
248 |       "content_activity_disparity_palma_retweet": {
249 |           "question": 14,
250 |           "scale": "population",
251 |           "node_type":"content",
252 |          "scenario1":True,
253 |          "scenario2":True,
254 |          "scenario3":True,
255 |           "measurement": "getPalmaCoef",
256 |           "measurement_args":{"eventTypes":["retweet"],"nodeType":"root"},
257 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
258 |                      "absolute_difference":Metrics.absolute_difference}
259 |       },
260 |       "content_activity_disparity_gini_quote": {
261 |           "question": 14,
262 |           "scale": "population",
263 |           "node_type":"content",
264 |           "scenario1":True,
265 |           "scenario2":True,
266 |           "scenario3":True,
267 |           "measurement": "getGiniCoef",
268 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
269 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
270 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
271 |       },
272 |     
273 |       "content_activity_disparity_palma_quote": {
274 |           "question": 14,
275 |           "scale": "population",
276 |           "node_type":"content",
277 |           "scenario1":True,
278 |           "scenario2":True,
279 |           "scenario3":True,
280 |           "measurement": "getPalmaCoef",
281 |           "measurement_args":{"eventTypes":["quote"],"nodeType":"root"},
282 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
283 |                       "absolute_difference":Metrics.absolute_difference}
284 |       },
285 |       "content_activity_disparity_gini_reply": {
286 |           "question": 14,
287 |           "scale": "population",
288 |           "node_type":"content",
289 |           "scenario1":True,
290 |           "scenario2":True,
291 |           "scenario3":True,
292 |           "measurement": "getGiniCoef",
293 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
294 |           "metrics": {"absolute_difference": Metrics.absolute_difference,
295 |                       "absolute_percentage_error":Metrics.absolute_percentage_error}
296 |       },
297 |     
298 |       "content_activity_disparity_palma_reply": {
299 |           "question": 14,
300 |           "scale": "population",
301 |           "node_type":"content",
302 |           "scenario1":True,
303 |           "scenario2":True,
304 |           "scenario3":True,
305 |           "measurement": "getPalmaCoef",
306 |           "measurement_args":{"eventTypes":["reply"],"nodeType":"root"},
307 |           "metrics": {"absolute_percentage_error":Metrics.absolute_percentage_error,
308 |                       "absolute_difference":Metrics.absolute_difference}
309 |       }
310 | 
311 | 
312 | }
313 | 
314 | 
315 | twitter_scenario1_measurement_params_cyber = {}
316 | twitter_scenario1_measurement_params_cyber.update(user_measurement_params)
317 | twitter_scenario1_measurement_params_cyber.update(content_measurement_params)
318 | 


--------------------------------------------------------------------------------
/december-measurements/config/network_metrics_config.py:
--------------------------------------------------------------------------------
  1 | import Metrics
  2 | from run_measurements_and_metrics import named_partial
  3 | 
  4 | network_measurement_params = {
  5 |     ### Github
  6 |     "number_of_nodes": {
  7 |         "question": '',
  8 |         "scale": "population",
  9 |         "scenario1":True,
 10 |         "scenario2":False,
 11 |         "sceanrio2":True,
 12 |         "measurement": "number_of_nodes",
 13 |         "metrics": {
 14 |             "absolute_difference": Metrics.absolute_difference,
 15 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 16 |         }
 17 |     },
 18 | 
 19 |     "number_of_edges": {
 20 |         "question": '',
 21 |         "scale": "population",
 22 |         "scenario1":True,
 23 |         "scenario2":False,
 24 |         "sceanrio2":True,
 25 |         "measurement": 'number_of_edges',
 26 |         "metrics": {
 27 |             "absolute_difference": Metrics.absolute_difference,
 28 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 29 |         }
 30 |     },
 31 | 
 32 |     "density": {
 33 |         "question": '',
 34 |         "scale": "population",
 35 |         "scenario1":True,
 36 |         "scenario2":False,
 37 |         "sceanrio2":True,
 38 |         "measurement": 'density',
 39 |         "metrics": {
 40 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 41 |             "absolute_difference": Metrics.absolute_difference,
 42 |         }
 43 |     },
 44 | 
 45 |     "mean_shortest_path_length": {
 46 |         "question": '',
 47 |         "scale": "population",
 48 |         "scenario1":True,
 49 |         "scenario2":False,
 50 |         "sceanrio2":True,
 51 |         "measurement": 'mean_shortest_path_length',
 52 |         "metrics": {
 53 |             "absolute_difference": Metrics.absolute_difference,
 54 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 55 |         }
 56 |     },
 57 | 
 58 |     "assortativity_coefficient": {
 59 |         "question": '',
 60 |         "scale": "population",
 61 |         "scenario1":True,
 62 |         "scenario2":False,
 63 |         "sceanrio2":True,
 64 |         "measurement": 'assortativity_coefficient',
 65 |         "metrics": {
 66 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 67 |             "absolute_difference": Metrics.absolute_difference,
 68 |         }
 69 |     },
 70 | 
 71 |     "number_of_connected_components": {
 72 |         "question": '',
 73 |         "scale": "population",
 74 |         "scenario1":True,
 75 |         "scenario2":False,
 76 |         "sceanrio2":True,
 77 |         "measurement": 'number_of_connected_components',
 78 |         "metrics": {
 79 |             "absolute_difference": Metrics.absolute_difference,
 80 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 81 |         }
 82 |     },
 83 | 
 84 |     "average_clustering_coefficient": {
 85 |         "question": '',
 86 |         "scale": "population",
 87 |         "scenario1":True,
 88 |         "scenario2":False,
 89 |         "sceanrio2":True,
 90 |         "measurement": 'average_clustering_coefficient',
 91 |         "metrics": {
 92 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
 93 |             "absolute_difference": Metrics.absolute_difference,
 94 |         }
 95 |     },
 96 | 
 97 |     "max_node_degree": {
 98 |         "question": '',
 99 |         "scale": "population",
100 |         "scenario1":True,
101 |         "scenario2":False,
102 |         "sceanrio2":True,
103 |         "measurement": 'max_node_degree',
104 |         "metrics": {
105 |             "absolute_difference": Metrics.absolute_difference,
106 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
107 |         }
108 |     },
109 |     
110 |     "mean_node_degree": {
111 |         "question": '',
112 |         "scale": "population",
113 |         "scenario1":True,
114 |         "scenario2":False,
115 |         "sceanrio2":True,
116 |         "measurement": 'mean_node_degree',
117 |         "metrics": {
118 |             "absolute_difference": Metrics.absolute_difference,
119 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
120 |         }
121 |     },
122 | 
123 |     "degree_distribution": {
124 |         "question": '',
125 |         "scale": "population",
126 |         "scenario1":True,
127 |         "scenario2":False,
128 |         "sceanrio2":True,
129 |         "measurement": 'degree_distribution',
130 |         "metrics": {
131 |             "js_divergence": named_partial(Metrics.js_divergence, discrete=True),
132 |         }
133 |     },
134 | 
135 |     "community_modularity": {
136 |         "question": '',
137 |         "scale": "population",
138 |         "scenario1":True,
139 |         "scenario2":False,
140 |         "sceanrio2":True,
141 |         "measurement": 'community_modularity',
142 |         "metrics": {
143 |             "absolute_percentage_error": Metrics.absolute_percentage_error,
144 |             "absolute_difference": Metrics.absolute_difference,
145 |         }
146 |     },
147 | }
148 | 


--------------------------------------------------------------------------------
/december-measurements/infodynamics.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pnnl/socialsim/06f0ce61d10ca08dd50d256fb30ac0ae81ead58d/december-measurements/infodynamics.jar


--------------------------------------------------------------------------------
/december-measurements/network_measurements.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | 
  3 | import pandas as pd
  4 | import igraph as ig
  5 | import snap as sn
  6 | from time import time
  7 | import numpy as np
  8 | 
  9 | import community
 10 | import tqdm
 11 | 
 12 | from prettytable import PrettyTable
 13 | from prettytable import MSWORD_FRIENDLY
 14 | 
 15 | import os
 16 | 
 17 | __all__ = ['GithubNetworkMeasurements',
 18 |            'TwitterNetworkMeasurements',
 19 |            'RedditNetworkMeasurements']
 20 | 
 21 | class NetworkMeasurements(object):
 22 |     """
 23 |     This class implements Network specific   measurements. It uses iGraph and SNAP libraries with Python interfaces.
 24 |     For installation information please visit the websites for the two packages.
 25 |  
 26 |     iGraph-Python at http://igraph.org/python/
 27 |     SNAP Python at https://snap.stanford.edu/snappy/
 28 |     """
 29 |     def __init__(self, data, test=False):
 30 |         self.main_df = data if isinstance(data, pd.DataFrame) else pd.read_csv(data)
 31 | 
 32 |         if test:
 33 |             print('Running test version of network measurements')
 34 |             self.main_df = self.main_df.head(1000)
 35 | 
 36 |         assert self.main_df is not None and len(self.main_df) > 0, 'Problem with the dataframe creation'
 37 | 
 38 |         self.preprocess()
 39 |         
 40 |         self.build_undirected_graph(self.main_df)
 41 | 
 42 |  
 43 |     def preprocess(self):
 44 |         return NotImplementedError()
 45 | 
 46 |     def build_undirected_graph(self, df):
 47 |         return NotImplementedError()
 48 |     
 49 |     def mean_shortest_path_length(self):
 50 |         return sn.GetBfsEffDiamAll(self.gUNsn, 500, False)[3]
 51 |     
 52 |     def number_of_nodes(self):
 53 |         return ig.Graph.vcount(self.gUNig)
 54 | 
 55 |     def number_of_edges(self):
 56 |         return ig.Graph.ecount(self.gUNig)
 57 | 
 58 |     def density(self):
 59 |         return ig.Graph.density(self.gUNig)
 60 | 
 61 |     def assortativity_coefficient(self):
 62 |         return ig.Graph.assortativity_degree(self.gUNig)
 63 | 
 64 |     def number_of_connected_components(self):
 65 |         return len(ig.Graph.components(self.gUNig,mode="WEAK"))
 66 | 
 67 |     def average_clustering_coefficient(self):
 68 |         return sn.GetClustCfAll(self.gUNsn, sn.TFltPrV())[0]
 69 |         #return ig.Graph.transitivity_avglocal_undirected(self.gUNig,mode="zero")
 70 | 
 71 |     def max_node_degree(self):
 72 |         return max(ig.Graph.degree(self.gUNig))
 73 | 
 74 |     def mean_node_degree(self):
 75 |         return 2.0*ig.Graph.ecount(self.gUNig)/ig.Graph.vcount(self.gUNig)
 76 | 
 77 |     def degree_distribution(self):
 78 |         degVals = ig.Graph.degree(self.gUNig)
 79 |         return pd.DataFrame([{'node': idx, 'value': degVals[idx]} for idx in range(self.gUNig.vcount())])
 80 | 
 81 |     def community_modularity(self):
 82 |         return ig.Graph.modularity(self.gUNig,ig.Graph.community_multilevel(self.gUNig))
 83 | 
 84 | 
 85 |     def get_parent_uids(self,df, parent_node_col="parentID", node_col="nodeID", root_node_col="rootID", user_col="nodeUserID"):
 86 |         """
 87 |         :return: adds parentUserID column with user id of the parent if it exits in df
 88 |         if it doesn't exist, uses the user id of the root instead
 89 |         if both doesn't exist: NaN
 90 |         """
 91 |         tweet_uids = pd.Series(df[user_col].values, index=df[node_col]).to_dict()
 92 |         df['parentUserID'] = df[parent_node_col].map(tweet_uids)
 93 |         df.loc[(df[root_node_col] != df[node_col]) & (df['parentUserID'].isnull()), 'parentUserID'] = \
 94 |             df[(df[root_node_col] != df[node_col]) & (df['parentUserID'].isnull())][root_node_col].map(tweet_uids)
 95 |         return df
 96 | 
 97 | class GithubNetworkMeasurements(NetworkMeasurements):
 98 | 
 99 |     def __init__(self, project_on='nodeID', weighted=False, **kwargs):
100 |         self.project_on = project_on
101 |         self.weighted = weighted
102 |         super(GithubNetworkMeasurements, self).__init__(**kwargs)
103 | 
104 |     def preprocess(self):
105 |         pass
106 | 
107 |     def build_undirected_graph(self, df):
108 |         
109 |         #self.main_df = pd.read_csv(data)
110 |         self.main_df = self.main_df[['nodeUserID','nodeID']]
111 |         
112 |         left_nodes = np.array(self.main_df['nodeUserID'].unique().tolist())
113 |         right_nodes = np.array(self.main_df['nodeID'].unique().tolist())
114 |         el = self.main_df.apply(tuple, axis=1).tolist()
115 |         edgelist = list(set(el))
116 |  
117 |         #iGraph Graph object construction
118 |         B = ig.Graph.TupleList(edgelist, directed=False)
119 |         names = np.array(B.vs["name"])
120 |         types = np.isin(names,right_nodes)
121 |         B.vs["type"] = types
122 |         p1,p2 = B.bipartite_projection(multiplicity=False)
123 |         
124 |         self.gUNig = None
125 |         if (self.project_on == "user"):
126 |             self.gUNig = p1
127 |         else:
128 |             self.gUNig = p2
129 | 
130 |         #self.gUNig = B.bipartite_projection(multiplicity=False, which = 0)
131 | 
132 |         
133 |         #SNAP graph object construction
134 |         self.gUNsn = sn.TUNGraph.New()
135 |         for v in self.gUNig.vs:
136 |             self.gUNsn.AddNode(v.index)
137 |         for e in self.gUNig.es:
138 |             self.gUNsn.AddEdge(e.source,e.target) 
139 |         
140 | 
141 | class TwitterNetworkMeasurements(NetworkMeasurements):
142 |     def __init__(self, **kwargs):
143 |         super(TwitterNetworkMeasurements, self).__init__(**kwargs)
144 | 
145 |     def preprocess(self):
146 |         pass
147 | 
148 |     def build_undirected_graph(self, df):
149 | 
150 |         df = self.get_parent_uids(df).dropna(subset=['parentUserID'])
151 |         edgelist = df[['nodeUserID','parentUserID']].apply(tuple,axis=1).tolist()        
152 |                         
153 |         #iGraph Graph object construction
154 |         self.gUNig = ig.Graph.TupleList(edgelist, directed=False)
155 |                 
156 |         #SNAP graph object construction
157 |         self.gUNsn = sn.TUNGraph.New()
158 |         for v in self.gUNig.vs:
159 |             self.gUNsn.AddNode(v.index)
160 |         for e in self.gUNig.es:
161 |             self.gUNsn.AddEdge(e.source,e.target)         
162 |         
163 |       
164 | class RedditNetworkMeasurements(NetworkMeasurements):
165 |     def __init__(self, **kwargs):
166 |         super(RedditNetworkMeasurements, self).__init__(**kwargs)
167 | 
168 |     def preprocess(self):
169 |         pass
170 | 
171 |     def build_undirected_graph(self,df):
172 | 
173 |         df = self.get_parent_uids(df).dropna(subset=['parentUserID'])
174 |         edgelist = df[['nodeUserID','parentUserID']].apply(tuple,axis=1).tolist()
175 |                 
176 |         #iGraph Graph object construction
177 |         self.gUNig = ig.Graph.TupleList(edgelist, directed=False)
178 |                 
179 |         #SNAP graph object construction
180 |         self.gUNsn = sn.TUNGraph.New()
181 |         for v in self.gUNig.vs:
182 |             self.gUNsn.AddNode(v.index)
183 |         for e in self.gUNig.es:
184 |             self.gUNsn.AddEdge(e.source,e.target)        
185 | 
186 |    
187 | 


--------------------------------------------------------------------------------
/december-measurements/plotting/charts.py:
--------------------------------------------------------------------------------
  1 | import matplotlib.pyplot as plt
  2 | import seaborn as sns
  3 | import numpy as np
  4 | import pandas as pd
  5 | 
  6 | sns.set(style="whitegrid")
  7 | 
  8 | 
  9 | def histogram(df, xlabel, ylabel, title, **kwargs):
 10 |     n_bins = 100
 11 | 
 12 |     if 'Simulation' in df.columns and 'Ground Truth' in df.columns:
 13 | 
 14 |         gold_data = df.dropna(subset=["Ground Truth"])["Ground Truth"]
 15 |         test_data = df.dropna(subset=["Simulation"])["Simulation"]
 16 |        
 17 |         data = np.concatenate([gold_data, test_data])
 18 |     
 19 |     elif 'Simulation' in df.columns or 'Ground Truth' in df.columns:
 20 | 
 21 |         if 'Simulation' in df.columns:
 22 |             data = df.dropna(subset=["Simulation"])['Simulation']
 23 |             test_data = data.copy()
 24 |         else: 
 25 |             data = df.dropna(subset=["Ground Truth"])['Ground Truth']
 26 |             gold_data = data.copy()
 27 |     else:
 28 |         return None
 29 | 
 30 |     _,bins = np.histogram(data,bins='doane')
 31 |     #bins = np.linspace(data.min(), data.max(), n_bins)
 32 | 
 33 |     fig, ax = plt.subplots(1, 1, figsize=(15, 5))
 34 |     if 'Ground Truth' in df.columns:
 35 |         ax.hist(gold_data, bins, log=True, label='Ground Truth', alpha=0.7, color='green')
 36 |     if 'Simulation' in df.columns:
 37 |         ax.hist(test_data, bins, log=True, label='Simulation', alpha=.7, color='red')
 38 | 
 39 |     ax.set(xlabel=xlabel)
 40 |     ax.set(ylabel=ylabel)
 41 |     ax.set(title=title)
 42 |     ax.legend(loc='best')
 43 | 
 44 |     plt.tight_layout()
 45 |     return fig
 46 | 
 47 | 
 48 | def scatter(df, xlabel, ylabel, title, **kwargs):
 49 | 
 50 |     if 'Ground Truth' in df.columns and 'Simulation' in df.columns:
 51 |         fig, ax = plt.subplots(1, 1, figsize=(15, 5))
 52 |         sns.scatterplot(x="Ground Truth", y="Simulation", data=df, ax=ax, alpha=0.7)
 53 |         ax.set(xlabel=xlabel)
 54 |         ax.set(ylabel=ylabel)
 55 |         ax.set(title=title)
 56 |         plt.tight_layout()
 57 |         return fig
 58 |     else:
 59 |         return None
 60 | 
 61 | 
 62 | def bar(df, xlabel, ylabel, title, **kwargs):
 63 | 
 64 |     palette = set_palette(df)
 65 | 
 66 |     df.fillna(0, inplace=True)
 67 | 
 68 |     df = df.melt(df.columns[0], var_name='type', value_name='vals')
 69 | 
 70 |     fig, ax = plt.subplots(1, 1, figsize=(15, 7))
 71 |     sns.barplot(x=df.columns[0], y='vals', hue='type', data=df, ax=ax, palette=palette, alpha=0.7)
 72 |     ax.set_xticklabels(ax.get_xticklabels(), rotation=30)
 73 |     ax.set(xlabel=xlabel)
 74 |     ax.set(ylabel=ylabel)
 75 |     ax.legend(loc='best')
 76 |     ax.set(title=title)
 77 |     plt.tight_layout()
 78 |     return fig
 79 | 
 80 | 
 81 | def set_palette(df):
 82 | 
 83 |     if 'Ground Truth' in df.columns and 'Simulation' in df.columns:
 84 |         palette = ['green','red']
 85 |     elif 'Ground Truth' in df.columns:
 86 |         palette = ['green']
 87 |     else:
 88 |         palette = ['red']
 89 |     
 90 |     return palette
 91 | 
 92 | def time_series(df, xlabel, ylabel, title, **kwargs):
 93 |     
 94 |     fig, ax = plt.subplots(1, 1, figsize=(15, 5))
 95 | 
 96 |     palette = set_palette(df)
 97 | 
 98 |     df = df.melt(id_vars = [c for c in df.columns if c not in ['Ground Truth','Simulation']], var_name='type', value_name='vals').sort_values('type')
 99 | 
100 |     df.dropna(inplace=True)
101 |     sns.lineplot(x=df.columns[0], y='vals', hue='type', data=df, ax=ax, marker='o', palette=palette, alpha=0.7)
102 |     handles, labels = ax.get_legend_handles_labels()
103 |     ax.legend(loc='best', handles=handles[1:], labels=labels[1:])
104 | 
105 |     ax.set(xlabel=xlabel)
106 |     ax.set(ylabel=ylabel)
107 |     ax.set(title=title)
108 |     plt.tight_layout()
109 |     
110 |     return fig
111 |     
112 | 
113 | 
114 | def multi_time_series(df, xlabel, ylabel, title, **kwargs):
115 | 
116 |     fig, ax = plt.subplots(1, 1, figsize=(15, 5))
117 | 
118 |     if 'time' in df.columns:
119 |         time_col = 'time'
120 |     elif 'date' in df.columns:
121 |         time_col = 'date'
122 |     elif 'weekday' in df.columns:
123 |         day_map = {'Monday':1,
124 |                'Tuesday':2,
125 |                'Wednesday':3,
126 |                'Thursday':4,
127 |                'Friday':5,
128 |                'Saturday':6,
129 |                'Sunday':7}
130 |         df['weekday_int'] = df['weekday'].map(day_map)
131 |         df = df.sort_values('weekday_int')
132 |         time_col = 'weekday_int'
133 | 
134 |     if 'Ground Truth' in df.columns and 'Simulation' in df.columns:
135 |         value_vars = ['Ground Truth', 'Simulation']
136 |     elif 'Ground Truth' in df.columns:
137 |         value_vars = ['Ground Truth']
138 |     else:
139 |         value_vars = ['Simulation']
140 | 
141 |     df = pd.melt(df, id_vars=[c for c in df.columns if c not in value_vars], value_vars=value_vars, var_name='type').fillna(0)
142 | 
143 |     sns.lineplot(x=time_col, y='value', hue=[c for c in df.columns if c not in ['Ground Truth', 'Simulation',time_col]][0], style='type', 
144 |                  data=df, ax=ax, marker='o', alpha=0.7,
145 |                  palette='bright')
146 | 
147 |     if time_col == 'weekday_int':
148 |         ax.set(xticklabels=df['weekday'].unique())
149 | 
150 |     handles, labels = ax.get_legend_handles_labels()
151 |     ax.legend(loc='best', handles=handles[1:], labels=labels[1:])
152 |     ax.set(xlabel=xlabel)
153 |     ax.set(ylabel=ylabel)
154 |     ax.set(title=title)
155 |     plt.tight_layout()
156 |     return fig
157 |     
158 | 
159 | def save_charts(fig, loc):
160 |     fig.savefig(loc)
161 |     plt.close(fig)
162 | 
163 | 
164 | def show_charts():
165 |     plt.show()
166 | 
167 | def chart_factory(chart_name):
168 |     charts_mapping = {
169 |         'bar': bar,
170 |         'hist': histogram,
171 |         'time_series': time_series,
172 |         'scatter': scatter,
173 |         'multi_time_series':multi_time_series
174 |     }
175 | 
176 |     return charts_mapping.get(chart_name, None)
177 | 


--------------------------------------------------------------------------------
/december-measurements/plotting/transformer.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | 
  3 | def to_DataFrame(data_type):
  4 |     data_mapping = {
  5 |         'dict': convert_dict,
  6 |         'DataFrame': convert_DataFrame,
  7 |         'dict_DataFrame': convert_dict_DataFrame,
  8 |         'dict_Series': convert_dict_Series,
  9 |         'dict_array':convert_dict_array,
 10 |         'Series':convert_Series,
 11 |         'tuple': None
 12 |     }
 13 | 
 14 |     return data_mapping.get(data_type, None)
 15 | 
 16 | 
 17 | 
 18 | 
 19 | def convert_Series(ground_truth_data=None, sim_data=None, **kwargs):
 20 | 
 21 |     if not ground_truth_data is None and not sim_data is None:
 22 |         result_df = pd.concat([ground_truth_data.reset_index(drop=True),sim_data.reset_index(drop=True)], axis=1)
 23 |         result_df.columns = ['Ground Truth', 'Simulation']
 24 |     elif not ground_truth_data is None:
 25 |         result_df = pd.DataFrame(ground_truth_data.reset_index(drop=True))
 26 |         result_df.columns = ['Ground Truth']
 27 |     elif not sim_data is None:
 28 |         result_df = pd.DataFrame(sim_data.reset_index(drop=True))
 29 |         result_df.columns = ['Simulation']
 30 | 
 31 |     return result_df
 32 | 
 33 | 
 34 | 
 35 | def convert_dict(ground_truth_data=None, sim_data=None, **kwargs):
 36 | 
 37 | 
 38 |     if not ground_truth_data is None and not sim_data is None:
 39 |         keys = list(ground_truth_data.keys()) + list(sim_data.keys())
 40 | 
 41 |         keys = set(keys)
 42 | 
 43 |         data = []
 44 |         for k in keys:
 45 |             data.append({'Key': k, 'Ground Truth': ground_truth_data.get(k, None), 'Simulation': sim_data.get(k, None)})
 46 | 
 47 |         df= pd.DataFrame(data)[['Key','Ground Truth','Simulation']]
 48 | 
 49 | 
 50 |     elif not ground_truth_data is None:
 51 |         keys = list(ground_truth_data.keys())
 52 | 
 53 |         keys = set(keys)
 54 | 
 55 |         data = []
 56 |         for k in keys:
 57 |             data.append({'Key': k, 'Ground Truth': ground_truth_data.get(k, None)})
 58 | 
 59 |         df= pd.DataFrame(data)[['Key','Ground Truth']]
 60 | 
 61 |     elif not sim_data is None:
 62 |         keys = list(sim_data.keys())
 63 | 
 64 |         keys = set(keys)
 65 | 
 66 |         data = []
 67 |         for k in keys:
 68 |             data.append({'Key': k, 'Simulation': sim_data.get(k, None)})
 69 | 
 70 |         df= pd.DataFrame(data)[['Key','Simulation']]
 71 | 
 72 |     return df
 73 | 
 74 | 
 75 | def convert_DataFrame(ground_truth_data=None, sim_data=None, **kwargs):
 76 |  
 77 |     if ground_truth_data is None:
 78 |         result_df = sim_data.copy()
 79 |         result_df.rename(index=str,columns={'value':'Simulation'},inplace=True)
 80 |     elif sim_data is None:
 81 |         result_df = ground_truth_data.copy()
 82 |         result_df.rename(index=str,columns={'value':'Ground Truth'},inplace=True)
 83 |     else:
 84 |         merge_cols = [c for c in ground_truth_data.columns if c != 'value']
 85 |         result_df = pd.merge(ground_truth_data, sim_data, on=merge_cols, how='outer')
 86 |         result_df.columns = merge_cols + ['Ground Truth', 'Simulation']
 87 | 
 88 |     return result_df
 89 | 
 90 | 
 91 | def convert_dict_DataFrame(ground_truth_data=None, sim_data=None, **kwargs):
 92 | 
 93 |     if kwargs.get('key'):
 94 | 
 95 |         if not ground_truth_data is None and not sim_data is None and kwargs.get('key') in ground_truth_data and kwargs.get('key') in sim_data:
 96 |             merge_columns = [c for c in ground_truth_data[kwargs.get('key')].columns if c != 'value']
 97 |             result_df = pd.merge(ground_truth_data[kwargs.get('key')], sim_data[kwargs.get('key')], on=merge_columns, how='outer')
 98 |             result_df.columns = merge_columns + ['Ground Truth', 'Simulation']
 99 |         elif not ground_truth_data is None and kwargs.get('key') in ground_truth_data:
100 |             result_df = ground_truth_data[kwargs.get('key')].copy()
101 |             result_df.rename(index=str,columns={"value":"Ground Truth"},inplace=True)
102 |         elif not sim_data is None and kwargs.get('key') in sim_data:
103 |             result_df = sim_data[kwargs.get('key')].copy()
104 |             result_df.rename(index=str,columns={"value":"Simulation"},inplace=True)
105 |         else:
106 |             return None
107 | 
108 |         return result_df
109 | 
110 | 
111 | def convert_dict_Series(ground_truth_data=None, sim_data=None, **kwargs):
112 | 
113 |     if kwargs.get('key'):
114 | 
115 |         both = True
116 |         if not sim_data is None and kwargs.get('key') in sim_data:
117 |             sim_data= sim_data[kwargs.get('key')]
118 |             result_df = pd.DataFrame(sim_data).copy()
119 |             result_df.rename(index=str,columns={"value":"Simulation"},inplace=True)
120 |         else:
121 |             both = False
122 | 
123 |         if not ground_truth_data is None and kwargs.get('key') in ground_truth_data:
124 |             ground_truth_data=ground_truth_data[kwargs.get('key')]
125 |             result_df = pd.DataFrame(ground_truth_data).copy()
126 |             result_df.rename(index=str,columns={"value":"Ground Truth"},inplace=True)
127 |         else:
128 |             both = False
129 | 
130 |         if both:
131 |             result_df = pd.concat([ground_truth_data.reset_index(drop=True),sim_data.reset_index(drop=True)], axis=1)
132 |             result_df.columns = [ 'Ground Truth', 'Simulation']
133 | 
134 |         return result_df
135 | 
136 | def convert_dict_array(ground_truth_data=None, sim_data=None, **kwargs):
137 | 
138 |     if kwargs.get('key'):
139 | 
140 |         both = True
141 | 
142 |         if not sim_data is None:
143 |             sim_data = pd.Series(sim_data[kwargs.get('key')])
144 |             result_df = sim_data.copy()
145 |             result_df.rename(index=str,columns={"value":"Simulation"},inplace=True)
146 |         else:
147 |             both = False
148 | 
149 |         if not ground_truth_data is None:
150 |             ground_truth_data = pd.Series(ground_truth_data[kwargs.get('key')])
151 |             result_df = ground_truth_data.copy()
152 |             result_df.rename(index=str,columns={"value":"Simulation"},inplace=True)
153 |         else:
154 |             both = False
155 | 
156 |  
157 |         if both:
158 |             result_df = pd.concat([ground_truth_data,sim_data], axis=1)
159 |             result_df.columns = [ 'Ground Truth', 'Simulation']
160 | 
161 |         return result_df
162 | 


--------------------------------------------------------------------------------
/december-measurements/plotting/visualization_config.py:
--------------------------------------------------------------------------------
  1 | measurement_plot_params = {
  2 | 
  3 |     ### community
  4 | 
  5 |     "community_burstiness": {
  6 |         "data_type": "dict",
  7 |         "x_axis": "Community",
  8 |         "y_axis": "Burstiness",
  9 |         "plot": ['bar']
 10 |     },
 11 | 
 12 |     "community_contributing_users": {
 13 |         "data_type": "dict",
 14 |         "x_axis": "Community",
 15 |         "y_axis": "Proportion of Users Contributing",
 16 |         "plot": ['bar']
 17 |     },
 18 | 
 19 |     "community_event_proportions": {
 20 |         "data_type": "dict_DataFrame",
 21 |         "x_axis": "Event Type",
 22 |         "y_axis": "Event Proportion",
 23 |         "plot": ['bar'],
 24 |         "plot_keys": "community"
 25 |     },
 26 | 
 27 |     "community_geo_locations": {
 28 |         "data_type": "dict_DataFrame",
 29 |         "x_axis": "Country",
 30 |         "y_axis": "Number of Events",
 31 |         "plot": ['bar'],
 32 |         "plot_keys": "community"
 33 |     },
 34 | 
 35 |     "community_issue_types": {  # result None type
 36 |         "data_type": "dict_DataFrame",
 37 |         "x_axis": "Date",
 38 |         "y_axis": "Number of Issues",
 39 |         "plot": ['multi_time_series'],
 40 |         "plot_keys": "community"
 41 | 
 42 |     },
 43 | 
 44 |     "community_num_user_actions": {
 45 |         "data_type": "dict_DataFrame",
 46 |         "x_axis": "Date",
 47 |         "y_axis": "Mean Number of User Actions",
 48 |         "hue": "Key",
 49 |         "plot": ['time_series'],
 50 |         "plot_keys": "community_subsets"
 51 |     },
 52 |     #
 53 | 
 54 |     'community_user_account_ages': {
 55 |         "data_type": "dict_Series",
 56 |         "x_axis": "User Account Age",
 57 |         "y_axis": "Number of Actions",
 58 |         "plot": ['hist'],
 59 |         "plot_keys": "community"
 60 |     },
 61 | 
 62 |     'community_user_burstiness': {
 63 |         "data_type": "dict_Series",
 64 |         "x_axis": "User Burstiness",
 65 |         "y_axis": "Number of Users",
 66 |         "plot": ['hist'],
 67 |         "plot_keys": "community"
 68 |     },
 69 | 
 70 |     #
 71 |     "community_gini": {
 72 |         "data_type": "dict",
 73 |         "x_axis": "Community",
 74 |         "y_axis": "Gini Scores",
 75 |         "plot": ['bar']
 76 |     },
 77 | 
 78 |     "community_palma": {
 79 |         "data_type": "dict",
 80 |         "x_axis": "Community",
 81 |         "y_axis": "Palma Scores",
 82 |         "plot": ['bar']
 83 |     },
 84 | 
 85 |     # repo
 86 |     #
 87 | 
 88 |     "content_contributors": {
 89 |         "data_type": "dict_DataFrame",
 90 |         "x_axis": "Date",
 91 |         "y_axis": "Number of Contributors",
 92 |         "plot": ['time_series'],
 93 |         "plot_keys": "content"
 94 |     },
 95 | 
 96 |     "content_diffusion_delay": {
 97 |         "data_type": "dict_Series",
 98 |         "x_axis": "Diffusion Delay",
 99 |         "y_axis": "Number of Events",
100 |         "plot": ['hist'],
101 |         "plot_keys": "content"
102 |     },
103 | 
104 |     "repo_event_counts_issue": {
105 |         "data_type": "DataFrame",
106 |         "y_axis": "Number of Repos",
107 |         "x_axis": "Number of Issue Events",
108 |         "plot": ['hist']
109 |     },
110 | 
111 |     "repo_event_counts_pull_request": {
112 |         "data_type": "DataFrame",
113 |         "y_axis": "Number of Repos",
114 |         "x_axis": "Number of Pull Requests",
115 |         "plot": ['hist']
116 |     },
117 | 
118 |     "repo_event_counts_push": {
119 |         "data_type": "DataFrame",
120 |         "y_axis": "Number of Repos",
121 |         "x_axis": "Number of Push Events",
122 |         "plot": ['hist']
123 |     },
124 | 
125 |     "content_event_distribution_daily": {
126 |         "data_type": "dict_DataFrame",
127 |         "x_axis": "Date",
128 |         "y_axis": "# Events",
129 |         "plot": ['multi_time_series'],
130 |         "plot_keys": "content"
131 |     },
132 | 
133 |     "content_event_distribution_dayofweek": {
134 |         "data_type": "dict_DataFrame",
135 |         "x_axis": "Day of Week",
136 |         "y_axis": "# Events",
137 |         "plot": ['multi_time_series'],
138 |         "plot_keys": "content"
139 |     },
140 | 
141 |     "content_growth": {
142 |         "data_type": "dict_DataFrame",
143 |         "x_axis": "Date",
144 |         "y_axis": "# Events",
145 |         "plot": ['time_series'],
146 |         "plot_keys": "content"
147 |     },
148 |     #
149 |     "repo_issue_to_push": {
150 |         "data_type": "dict_DataFrame",
151 |         "x_axis": "Number of Previous Events",
152 |         "y_axis": "Issue Push Ratio",
153 |         "plot": ['time_series'],
154 |         "plot_keys": "content"
155 |     },
156 | 
157 |     "content_liveliness_distribution": {
158 |         "data_type": "DataFrame",
159 |         "y_axis": "Number of Repos/Posts/Tweets",
160 |         "x_axis": "Number of Forks/Comments/Replies",
161 |         "plot": ['hist']
162 |     },
163 | 
164 |     "repo_trustingness": {
165 |         "data_type": "DataFrame",
166 |         "x_axis": "Ground Truth",
167 |         "y_axis": "Simulation",
168 |         "plot": ['scatter']
169 |     },
170 | 
171 |     "content_popularity_distribution": {
172 |         "data_type": "DataFrame",
173 |         "y_axis": "Number of Repos/Tweets",
174 |         "x_axis": "Number of Watches/Rewtweets",
175 |         "plot": ['hist']
176 |     },
177 | 
178 |     "repo_user_continue_prop": {
179 |         "data_type": "dict_DataFrame",
180 |         "x_axis": "Number of Actions",
181 |         "y_axis": "Probability of Continuing",
182 |         "plot": ['time_series'],
183 |         "plot_keys": "content"
184 |     },
185 |     #
186 |     #
187 |     # ### user
188 | 
189 |     "user_popularity": {
190 |         "data_type": "DataFrame",
191 |         "y_axis": "Number of Users",
192 |         "x_axis": "Popularity of User's Repos/Tweets/Posts",
193 |         "plot": ['hist']
194 |     },
195 | 
196 |     "user_activity_distribution": {
197 |         "data_type": "DataFrame",
198 |         "x_axis": "User Activity",
199 |         "y_axis": "Number of Users",
200 |         "plot": ['hist']
201 |     },
202 | 
203 |     "user_diffusion_delay": {
204 |         "data_type": "Series",
205 |         "x_axis": "Diffusion Delay (H)",
206 |         "y_axis": "Number of Events",
207 |         "plot": ['hist']
208 |     },
209 |     "user_activity_timeline": {
210 |         "data_type": "dict_DataFrame",
211 |         "x_axis": "Date",
212 |         "y_axis": "Number of Events",
213 |         "plot": ['time_series'],
214 |         "plot_keys": "user"
215 |     },
216 | 
217 |     "user_trustingness": {
218 |         "data_type": "DataFrame",
219 |         "x_axis": "Ground Truth",
220 |         "y_axis": "Simulation",
221 |         "plot": ['scatter']
222 |     },
223 | 
224 |     "user_unique_content": {
225 |         "data_type": "DataFrame",
226 |         "x_axis": "Number of Unique Repos/Posts/Tweets",
227 |         "y_axis": "Number of Users",
228 |         "plot": ['hist']
229 |     }
230 | }
231 | 
232 | cascade_measurement_plot_params = {
233 |     'cascade_breadth_by_depth': {
234 |         'data_type': 'dict_DataFrame',
235 |         'plot': ['time_series'],
236 |         'x_axis': 'Depth',
237 |         'y_axis': 'Breadth',
238 |         'plot_keys':'cascade'},
239 | 
240 |     'cascade_breadth_by_time':
241 |         {'data_type': 'dict_DataFrame',
242 |          'plot': ['time_series'],
243 |          'x_axis': 'Date',
244 |          'y_axis': 'Breadth',
245 |          'plot_keys':'cascade'},
246 | 
247 |     'cascade_max_depth_over_time':
248 |         {'data_type': 'dict_DataFrame',
249 |          'plot': ['time_series'],
250 |          'x_axis': 'Date',
251 |          'y_axis': 'Depth',
252 |          'plot_keys':'cascade'},
253 | 
254 |     'cascade_new_user_ratio_by_depth':
255 |         {'data_type': 'dict_DataFrame',
256 |          'plot': ['time_series'],
257 |          'x_axis': 'Depth',
258 |          'y_axis': 'New User Ratio',
259 |          'plot_keys':'cascade'},
260 | 
261 |     'cascade_new_user_ratio_by_time':
262 |         {'data_type': 'dict_DataFrame',
263 |          'plot': ['time_series'],
264 |          'x_axis': 'Date',
265 |          'y_axis': 'New User Ratio',
266 |          'plot_keys':'cascade'},
267 | 
268 |     'cascade_size_over_time':
269 |         {'data_type': 'dict_DataFrame',
270 |          'plot': ['time_series'],
271 |          'x_axis': 'Date',
272 |          'y_axis': 'Cascade Size',
273 |          'plot_keys':'cascade'},
274 | 
275 |     'cascade_structural_virality_over_time':
276 |         {'data_type': 'dict_DataFrame',
277 |          'plot': ['time_series'],
278 |          'x_axis': 'Date',
279 |          'y_axis': 'Structural Virality',
280 |          'plot_keys':'cascade'},
281 | 
282 |     'cascade_uniq_users_by_depth':
283 |         {'data_type': 'dict_DataFrame',
284 |          'plot': ['time_series'],
285 |          'x_axis': 'Depth',
286 |          'y_axis': 'Unique Users',
287 |          'plot_keys':'cascade'},
288 | 
289 |     'cascade_uniq_users_by_time':
290 |         {'data_type': 'dict_DataFrame',
291 |          'plot': ['time_series'],
292 |          'x_axis': 'Date',
293 |          'y_axis': 'Unique Users',
294 |          'plot_keys':'cascade'},
295 | 
296 |     'community_cascade_lifetime_distribution':
297 |         {'data_type': 'dict_DataFrame',
298 |          'plot': ['hist'],
299 |          'x_axis': 'Lifetime',
300 |          'y_axis': 'Number of Cascades',
301 |          'plot_keys':'community'},
302 | 
303 |     'community_cascade_lifetime_timeseries':
304 |         {'data_type': 'dict_DataFrame',
305 |          'plot': ['time_series'],
306 |          'x_axis': 'Date',
307 |          'y_axis': 'Cascade Lifetime',
308 |          'plot_keys':'community'},
309 | 
310 |     'community_cascade_size_distribution':
311 |         {'data_type': 'dict_DataFrame',
312 |          'plot': ['hist'],
313 |          'x_axis': 'Size',
314 |          'y_axis': 'Number of Cascades',
315 |          'plot_keys':'community'},
316 | 
317 |     'community_cascade_size_timeseries':
318 |         {'data_type': 'dict_DataFrame',
319 |          'plot': ['time_series'],
320 |          'x_axis': 'Time',
321 |          'y_axis': 'Cascade Size',
322 |          'plot_keys':'community'},
323 | 
324 |     'community_max_breadth_distribution':
325 |         {'data_type': 'dict_DataFrame',
326 |          'plot': ['hist'],
327 |          'x_axis': 'Max Breadth',
328 |          'y_axis': 'Number of Cascades',
329 |          'plot_keys':'community'},
330 | 
331 |     'community_max_depth_distribution':
332 |         {'data_type': 'dict_DataFrame',
333 |          'plot': ['hist'],
334 |          'x_axis': 'Max Depth',
335 |          'y_axis': 'Number of Cascades',
336 |          'plot_keys':'community'},
337 | 
338 |     'community_new_user_ratio_by_time':
339 |         {'data_type': 'dict_DataFrame',
340 |          'plot': ['time_series'],
341 |          'x_axis': 'Date',
342 |          'y_axis': 'New User Ratio',
343 |          'plot_keys':'community'},
344 | 
345 |     'community_structural_virality_distribution':
346 |         {'data_type': 'dict_DataFrame',
347 |          'plot': ['hist'],
348 |          'x_axis': 'Structural Virality',
349 |          'y_axis': 'Number of Cascade',
350 |          'plot_keys':'community'},
351 | 
352 |     'community_unique_users_by_time':
353 |         {'data_type': 'dict_DataFrame',
354 |          'plot': ['time_series'],
355 |          'x_axis': 'Date',
356 |          'y_axis': 'Unique Users',
357 |          'plot_keys':'community'},
358 | 
359 |     'population_cascade_lifetime_distribution':
360 |         {'data_type': 'DataFrame',
361 |          'plot': ['hist'],
362 |          'x_axis': 'Cascade Lifetime',
363 |          'y_axis': 'Number of Cascades'},
364 | 
365 |     'population_cascade_lifetime_timeseries':
366 |         {'data_type': 'DataFrame',
367 |          'plot': ['time_series'],
368 |          'x_axis': 'Date',
369 |          'y_axis': 'Cascade Lifetime'},
370 | 
371 |     'population_cascade_size_distribution':
372 |         {'data_type': 'DataFrame',
373 |          'plot': ['hist'],
374 |          'x_axis': 'Size',
375 |          'y_axis': 'Number of Cascades'},
376 | 
377 |     'population_cascade_size_timeseries':
378 |         {'data_type': 'DataFrame',
379 |          'plot': ['time_series'],
380 |          'x_axis': 'Date',
381 |          'y_axis': 'Cascade Size'},
382 | 
383 |     'population_max_breadth_distribution':
384 |         {'data_type': 'DataFrame',
385 |          'plot': ['hist'],
386 |          'x_axis': 'Max Breadth',
387 |          'y_axis': 'Number of Cascades'},
388 | 
389 |     'population_max_depth_distribution':
390 |         {'data_type': 'DataFrame',
391 |          'plot': ['hist'],
392 |          'x_axis': 'Max Depth',
393 |          'y_axis': 'Number of Cascades'},
394 | 
395 |     'population_structural_virality_distribution':
396 |         {'data_type': 'DataFrame',
397 |          'plot': ['hist'],
398 |          'x_axis': 'Structural Virality',
399 |          'y_axis': 'Number of Cascade'}
400 | }
401 | 
402 | measurement_plot_params.update(cascade_measurement_plot_params)
403 | 


--------------------------------------------------------------------------------
/december-measurements/validators.py:
--------------------------------------------------------------------------------
 1 | from functools import wraps
 2 | 
 3 | def check_empty(default=None):
 4 |     def wrap(func):
 5 |         @wraps(func)
 6 |         def wrapped_f(self, *args, **kwargs):
 7 |             if self.main_df is None or self.main_df.empty or len(self.main_df) <= 0:
 8 |                 return default
 9 |             else:
10 |                 return func(self, *args, **kwargs)
11 |         return wrapped_f
12 |     return wrap
13 | 
14 | def check_root_only(default=None):
15 |     """
16 |     check if it is a single node cascade
17 |     """
18 |     def wrap(func):
19 |         @wraps(func)
20 |         def wrapped_f(self, *args, **kwargs):
21 |             if len(self.main_df[self.main_df[self.node_col]!=self.main_df[self.root_node_col]])==0:
22 |                 return default
23 |             else:
24 |                 return func(self, *args, **kwargs)
25 |         return wrapped_f
26 |     return wrap
27 | 


--------------------------------------------------------------------------------
/github-measurements-old/TransferEntropy.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import jpype
  4 | from jpype import *
  5 | from datetime import datetime
  6 | 
  7 | '''
  8 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract
  9 | No. DE-AC05-76RL01830 with the Department of Energy (DOE).  All rights in the computer software are reserved by DOE on
 10 | behalf of the United States Government and the Contractor as provided in the Contract.  You are authorized to use this
 11 | computer software for Governmental purposes but it is not to be released or distributed to the public.  NEITHER THE
 12 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS
 13 | SOFTWARE.  This notice including this sentence must appear on any copies of this computer software.
 14 | '''
 15 | 
 16 | '''
 17 | This class implements measurements to calculate the transfer entropy between users. The main function
 18 | for TE calculation requires the jpype package. This is a java package that has to be called
 19 | within python by using a Java Virtual Machine.
 20 | 
 21 | These measurements assume that the data is in the order id,created_at,type,actor.id,repo.id
 22 | '''
 23 | 
 24 | '''
 25 | This method takes a list of times and transforms them to a time series
 26 | 
 27 | Input: List of created times
 28 | Output: List representing a time series (differences)
 29 | '''
 30 | def getTimeSeriesInSecs(ts_list):
 31 |     base_time = datetime.strptime('2015-01-01T00:00:00Z', '%Y-%m-%dT%H:%M:%SZ')
 32 |     secSet = set()
 33 |     for timeVal in ts_list:
 34 |         time_std = datetime.strptime(timeVal, '%Y-%m-%dT%H:%M:%SZ')
 35 |         diff = time_std - base_time
 36 |         secSet.add(int(diff.total_seconds()))
 37 | 
 38 |     secList = sorted(list(secSet))
 39 |     return secList
 40 | 
 41 | '''
 42 | This method bins the time series into discrete bins
 43 | 
 44 | Input: totalBins - Total Number of bins
 45 |        binSize - Size of the bins
 46 |        timeSeries - This is the list representation of the time series
 47 | '''
 48 | def getBinnedTimeSeriesSingleBinary(totalBins, binSize, timeSeries):
 49 |     tsBinned = np.zeros((totalBins), dtype=int)
 50 |     for timeVal in timeSeries:
 51 |         idx = (timeVal // binSize)
 52 |         tsBinned[idx] = 1
 53 | 
 54 |     return tsBinned
 55 | 
 56 | '''
 57 | This method bins the time series into real valued bins
 58 | 
 59 | Input: totalBins - Total Number of Bins
 60 |        binSize - Size of the bins
 61 |        timeSeries - This is the list representation of the time series
 62 | '''
 63 | def getBinnedTimeSeriesSingleRealVal(totalBins, binSize, timeSeries):
 64 |     tsBinned = np.zeros((totalBins), dtype=float)
 65 |     for timeVal in timeSeries:
 66 |         idx = int((timeVal // binSize))
 67 |         tsBinned[idx] = tsBinned[idx] + 1.00
 68 | 
 69 |     return tsBinned.tolist()
 70 | 
 71 | '''
 72 | This method calculates the transfer entropy (TE) between two binary time series
 73 | 
 74 | Input: src - This is the source time series
 75 |        dest - This is the destination time series
 76 |        delayParam - This is the parameter that controls the delay when calculating the TE
 77 | 
 78 | Output: Value of Transfer Entropy between the source and destination time series.
 79 | '''
 80 | def getTETimeSeriesPairBinary(src, dest, delayParam):
 81 |     teCalcClass = jpype.JPackage("infodynamics.measures.discrete").TransferEntropyCalculatorDiscrete
 82 |     teCalc = teCalcClass(2, 1, 1, 1, 1, delayParam)
 83 | 
 84 |     teCalc.initialise()
 85 |     teCalc.addObservations(src, dest)
 86 |     te = teCalc.computeAverageLocalOfObservations()
 87 | 
 88 |     return te
 89 | 
 90 | '''
 91 | This method calculates the transfer entropy (TE) between two real time series
 92 | 
 93 | Input: src - This is the source time series
 94 |        dest - This is the destination time series
 95 |        delayParam - This is the parameter that controls the delay when calculating the TE
 96 | 
 97 | Output: Value of Transfer Entropy between the source and destination time series.
 98 | '''
 99 | def getTETimeSeriesPairRealValued(src, dest, delay):
100 |     teCalcClass = JPackage("infodynamics.measures.continuous.kraskov").TransferEntropyCalculatorKraskov
101 |     teCalc = teCalcClass()
102 |     teCalc.setProperty("NORMALISE", "true")  # Normalise the individual variables
103 |     teCalc.setProperty("k", "3")  # Use Kraskov parameter K=4 for 4 nearest points
104 | 
105 |     teCalc.initialise(1, 1, 1, 1, delay)  # Use history length 1 (Schreiber k=1)
106 |     teCalc.setObservations(JArray(JDouble, 1)(src), JArray(JDouble, 1)(dest))
107 |     te = teCalc.computeAverageLocalOfObservations()
108 | 
109 |     return te
110 | 
111 | 
112 | '''
113 | This method calculates the Transfer entropy for two users and a given dataframe
114 | 
115 | Input: df - Data frame to extract user data from. This can be any subset of data
116 |        user1 - The id of the first user (source user)
117 |        user2 - The id of the second user (destination user)
118 |        realSeries - Boolean that indicates whether or not the time series should be binned into real or discrete values
119 | 
120 | Output: Transfer Entropy between the two users
121 | '''
122 | def getTransferEntropy(df,user1,user2,realSeries=False):
123 | 
124 |     df.columns = ['id', 'time', 'type', 'user', 'repo']
125 | 
126 |     user1Series = df[df.user == user1]['time'].tolist()
127 |     user2Series = df[df.user == user2]['time'].tolist()
128 |     user1Series = getTimeSeriesInSecs(user1Series)
129 |     user2Series = getTimeSeriesInSecs(user2Series)
130 | 
131 |     binSize = 10800  # 3 hours = 10800 secs
132 |     maxTime = max(max(user1Series), max(user2Series))
133 |     totalbins = int(np.ceil(maxTime / float(binSize)))
134 | 
135 |     te = 0.0
136 | 
137 |     ##Jar location for the infodynamics package
138 |     jarLocation = "./infodynamics.jar"
139 | 
140 |     # Start the JVM (add the "-Xmx" option with say 1024M if you get crashes due to not enough memory space)
141 |     jpype.startJVM(jpype.getDefaultJVMPath(), "-ea", "-Djava.class.path=" + jarLocation)
142 | 
143 | 
144 |     if realSeries:
145 |         user1Series = getBinnedTimeSeriesSingleRealVal(totalbins,binSize,user1Series)
146 |         user2Series = getBinnedTimeSeriesSingleRealVal(totalbins,binSize,user2Series)
147 |         te = getTETimeSeriesPairRealValued(user1Series, user2Series, 3)
148 |     else:
149 |         user1Series = getBinnedTimeSeriesSingleBinary(totalbins, binSize, user1Series)
150 |         user2Series = getBinnedTimeSeriesSingleBinary(totalbins, binSize, user2Series)
151 |         te = getTETimeSeriesPairBinary(user1Series,user2Series,1)
152 | 
153 |     jpype.shutdownJVM()
154 | 
155 |     return te
156 | 
157 | 
158 | 
159 | 
160 | 
161 | 


--------------------------------------------------------------------------------
/github-measurements-old/UserCentricMeasurements.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from multiprocessing import Pool
  4 | from functools import partial
  5 | 
  6 | '''
  7 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract
  8 | No. DE-AC05-76RL01830 with the Department of Energy (DOE).  All rights in the computer software are reserved by DOE on
  9 | behalf of the United States Government and the Contractor as provided in the Contract.  You are authorized to use this
 10 | computer software for Governmental purposes but it is not to be released or distributed to the public.  NEITHER THE
 11 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS
 12 | SOFTWARE.  This notice including this sentence must appear on any copies of this computer software.
 13 | '''
 14 | 
 15 | '''
 16 | This class implements user centric method. Each function will describe which metric it is used for according
 17 | to the questions number and mapping.
 18 | 
 19 | These metrics assume that the data is in the order id,created_at,type,actor.id,repo.id
 20 | '''
 21 | 
 22 | 
 23 | '''
 24 | This method returns the number of unique repos that a particular set of users contributed too
 25 | 
 26 | Question #17
 27 | 
 28 | Inputs: DataFrame - Desired dataset
 29 |         users - A list of users of interest
 30 | 
 31 | Output: A dataframe with the user id and the number of repos contributed to
 32 | '''
 33 | def getUserUniqueRepos(df,users=None):
 34 |     df = df.copy()
 35 |     df.columns = ['time', 'event','user', 'repo']
 36 |     if users:
 37 |         df = df[df.user.isin(users)]
 38 |     df =df.groupby('user')
 39 |     data = df.repo.nunique().reset_index()
 40 |     data.columns = ['user','value']
 41 |     return data
 42 | 
 43 | 
 44 | '''
 45 | This method returns the cumulative activity of the desire user over time.
 46 | 
 47 | Question #19
 48 | 
 49 | Inputs: DataFrame - Desired dataset
 50 |         users - A list of users of interest
 51 | 
 52 | Output: A grouped dataframe of the users activity over time
 53 | '''
 54 | def getUserActivityTimeline(df, users=None,time_bin='1d',cumSum=False):
 55 |     df = df.copy()
 56 |     df.columns = ['time', 'event','user', 'repo']
 57 |     df['time'] = pd.to_datetime(df['time'])
 58 |     if users:
 59 |         df = df[df.user.isin(users)]
 60 |     df['value'] = 1
 61 |     if cumSum:
 62 |         df['cumsum'] = df.groupby('user').value.transform(pd.Series.cumsum)
 63 |         df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).max().reset_index()
 64 |         df['value'] = df['cumsum']
 65 |         df = df.drop('cumsum',axis=1)
 66 |     else:
 67 |         df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).sum().reset_index()
 68 | 
 69 |     #timeGrouper
 70 |     data = df.sort_values(['user', 'time'])
 71 |     return data
 72 | 
 73 | '''
 74 | This method returns the top k most popular users for the dataset, where popularity is measured
 75 | as the total popularity of the repos created by the user.
 76 | 
 77 | Question #25
 78 | 
 79 | Inputs: DataFrame - Desired dataset
 80 |         k - (Optional) The number of users that you would like returned.
 81 |         use_metadata - External metadata file containing repo owners.  Otherwise use first observed user with a creation event as a proxy for the repo owner.
 82 | 
 83 | Output: A dataframe with the user ids and number events for that user
 84 | '''
 85 | def getUserPopularity(df,k=10,metadata_file = ''):
 86 | 
 87 |     if metadata_file != '':
 88 |         repo_metadata = pd.read_csv(metadata_file)
 89 |         repo_metadata = repo_metadata[['full_name_h','owner.login_h']]
 90 | 
 91 |     df = df.copy()
 92 |     df.columns = ['time', 'event','user', 'repo']
 93 |     df['value'] = 1
 94 |     
 95 |     repo_popularity = df[df['event'].isin(['ForkEvent','WatchEvent'])].groupby('repo')['value'].sum().reset_index()
 96 | 
 97 |     if metadata_file != '':
 98 |         merged = repo_popularity.merge(repo_metadata,left_on='repo',right_on='full_name_h',how='left')
 99 |     else:
100 |         user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first')
101 |         user_repos = user_repos[['user','repo']]
102 |         user_repos.columns = ['owner.login_h','repo']
103 |         merged = user_repos.merge(repo_popularity,on='repo',how='left')
104 |         
105 |     measurement = merged.groupby('owner.login_h').value.sum().sort_values(ascending=False).head(k)
106 |     measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
107 |     return measurement
108 | 
109 | '''
110 | This method returns the average time between events for each user
111 | 
112 | Question #29b and c
113 | 
114 | Inputs: df - Data frame of all data for repos
115 |         users - (Optional) List of specific users to calculate the metric for
116 |         nCPu - (Optional) Number of CPU's to run metric in parallel
117 | 
118 | Outputs: A list of average times for each user. Length should match number of repos
119 | '''
120 | def getAvgTimebwEvents(df,users=None, nCPU=1):
121 |     df = df.copy()
122 |     df.columns = ['time', 'event', 'user', 'repo']
123 |     df['time'] = pd.to_datetime(df['time'])
124 | 
125 |     if users == None:
126 |         users = df['user'].unique()
127 | 
128 |     p = Pool(nCPU)
129 |     args = [(df, users[i]) for i, item_a in enumerate(users)]
130 |     deltas = p.map(getMeanTimeHelper, args)
131 |     p.join()
132 |     p.close()
133 |     return deltas
134 | 
135 | '''
136 | Helper function for getting the average time between events
137 | 
138 | Inputs: Same as average time between events
139 | Output: Same as average time between events
140 | '''
141 | def getMeanTime(df, user):
142 |     d = df[df.user == user]
143 |     d = d.sort_values(by='time')
144 |     delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's')
145 |     return delta
146 | 
147 | 
148 | def getMeanTimeHelper(args):
149 |     return getMeanTime(*args)
150 | 
151 | '''
152 | This method returns distribution the diffusion delay for each user
153 | 
154 | Question #27
155 | 
156 | Inputs: DataFrame - Desired dataset
157 |         unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation
158 |         for the possible options
159 |         metadata_file - File containing user account creation times.  Otherwise use first observed action of user as proxy for account creation time.
160 | 
161 | Output: A list (array) of deltas in units specified
162 | '''
163 | def getUserDiffusionDelay(df,unit='s',metadata_file = ''):
164 | 
165 |     if metadata_file != '':
166 |         user_metadata = pd.read_csv(metadata_file)
167 |         user_metadata['created_at'] = pd.to_datetime(user_metadata['created_at'])
168 | 
169 | 
170 |     df = df.copy()
171 |     df.columns = ['time','event','user','repo']
172 |     df['value'] = df['time']
173 |     df['value'] = pd.to_datetime(df['value'])
174 | 
175 |     if metadata_file != '':
176 |         df = df.merge(user_metadata[['login_h','created_at']],left_on='user',right_on='login_h',how='left')
177 |         df = df[['login_h','created_at','value']].dropna()
178 |         measurement = df['value'].sub(df['created_at']).apply(lambda x: int(x / np.timedelta64(1, unit)))
179 |     else:
180 |         grouped = df.groupby('user')
181 |         transformed = grouped['value'].transform('min')
182 |         measurement = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit)))
183 |     
184 |    
185 | 
186 |     return measurement
187 | 
188 | 
189 | '''
190 | This method returns the gini coefficient for user events. (User Disparity)
191 | 
192 | Question #26a
193 | 
194 | Inputs: DataFrame - Desired dataset
195 | 
196 | 
197 | Output: The gini coefficient for the dataset
198 | '''
199 | def getGiniCoef(df):
200 |     df = df.copy()
201 |     df.columns = ['time', 'event', 'user', 'repo']
202 |     df['value'] = 1
203 |     df = df.groupby('user')
204 |     event_counts = df.value.sum()
205 |     values = np.sort(np.array(event_counts))
206 | 
207 |     cdf = np.cumsum(values) / float(np.sum(values))
208 |     percent_nodes = np.arange(len(values)) / float(len(values))
209 | 
210 |     g = 1 - 2*np.trapz(x=percent_nodes,y=cdf)
211 |     return g
212 | 
213 | '''
214 | This method returns the palma coefficient for user events. (User Disparity)
215 | 
216 | Question #26b
217 | 
218 | Inputs: DataFrame - Desired dataset
219 | 
220 | 
221 | Output: p - The palma coefficient for the dataset
222 |         data - dataframe showing the CDF and Node percentages. (Mainly used for plotting)
223 | '''
224 | def getPalmaCoef(df):
225 |     df = df.copy()
226 |     df.columns = ['time', 'event', 'user', 'repo']
227 |     df['value'] = 1
228 |     df = df.groupby('user')
229 |     event_counts = df.value.sum()
230 | 
231 | 
232 |     values = np.sort(np.array(event_counts))
233 | 
234 | 
235 |     cdf = np.cumsum(values) / float(np.sum(values))
236 |     percent_nodes = np.arange(len(values)) / float(len(values))
237 | 
238 | 
239 |     p10 = np.sum(values[percent_nodes >= 0.9])
240 |     p40 = np.sum(values[percent_nodes <= 0.4])
241 | 
242 | 
243 |     p = float(p10) / float(p40)
244 | 
245 |     x = cdf
246 |     y = percent_nodes
247 |     data = pd.DataFrame({'cum_nodes': y, 'cum_value': x})
248 | 
249 |     return p
250 | 
251 | '''
252 | This method returns the top k users with the most events.
253 | 
254 | Question #24b
255 | 
256 | Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events
257 |         k - Number of users to be returned
258 | 
259 | Output: Dataframe with the user ids and number of events
260 | '''
261 | def getMostActiveUsers(df,k=10):
262 |     df = df.copy()
263 |     df.columns = ['time', 'event', 'user', 'repo']
264 |     dft = df
265 |     dft['value'] = 1
266 |     dft = df.groupby('user')
267 |     measurement = dft.value.sum().sort_values(ascending=False).head(k)
268 |     measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
269 |     return measurement
270 | 
271 | '''
272 | This method returns the distribution for the users activity (event counts).
273 | 
274 | Question #24a
275 | 
276 | Inputs: DataFrame - Desired dataset
277 |         eventType - (Optional) Desired event type to use
278 | 
279 | Output: List containing the event counts per user
280 | '''
281 | def getUserActivityDistribution(df,eventType=None):
282 |     df = df.copy()
283 |     df.columns = ['time', 'event', 'user', 'repo']
284 |     if eventType != None:
285 |         df = df[df.event == eventType]
286 |     df['value'] = 1
287 |     df = df.groupby('user')
288 |     measurement = df.value.sum().reset_index()
289 |     return measurement
290 | 


--------------------------------------------------------------------------------
/github-measurements-old/UserMeasurementsWithPlot.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract
  3 | No. DE-AC05-76RL01830 with the Department of Energy (DOE).  All rights in the computer software are reserved by DOE on
  4 | behalf of the United States Government and the Contractor as provided in the Contract.  You are authorized to use this
  5 | computer software for Governmental purposes but it is not to be released or distributed to the public.  NEITHER THE
  6 | GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS
  7 | SOFTWARE.  This notice including this sentence must appear on any copies of this computer software.
  8 | '''
  9 | from plots import *
 10 | 
 11 | '''
 12 | The following is the user measurment functions previously released with plotting added. The plots are currently all printed.
 13 | '''
 14 | 
 15 | '''
 16 | This method returns the number of unique repos that a particular set of users contributed too
 17 | 
 18 | Question #18
 19 | 
 20 | Inputs: DataFrame - Desired dataset
 21 |         users - A list of users of interest
 22 |         log - to plot with log values default false
 23 | 
 24 | Output: A dataframe with the user id and the number of repos contributed to
 25 | '''
 26 | def getUserUniqueRepos(df,users, log=False):
 27 |     df.columns = ['id', 'time', 'event','user', 'repo']
 28 |     df = df[df.user.isin(users)]
 29 |     df =df.groupby('user')
 30 |     data = df.repo.nunique()
 31 |     td = data
 32 |     print plot_top_users(data,'User','Unique Repos Contributed To','Quantity of Repos Users Contributed To')
 33 |     
 34 |     return td
 35 | 
 36 | '''
 37 | This method returns the cumulative activity of the desire user over time.
 38 | 
 39 | Question #20
 40 | 
 41 | Inputs: DataFrame - Desired dataset
 42 |         users - A list of users of interest
 43 | 
 44 | Output: A grouped dataframe of the users activity over time
 45 | '''
 46 | def getUserActivityTimeline(df, users, log=False):
 47 |     df.columns = ['id', 'time', 'event','user', 'repo']
 48 | 
 49 |     df = df[df.user.isin(users)]
 50 |     df['value'] = 1
 51 | 
 52 |     df['time'] = pd.to_datetime(df['time'])
 53 |     df['time'] = df['time'].dt.strftime('%Y-%m-%d')
 54 |     df = df.groupby(['user','time']).sum()
 55 | 
 56 |     minDate = df.index.min()[1]
 57 |     maxDate = df.index.max()[1]
 58 | 
 59 |     idx = pd.date_range(minDate, maxDate)
 60 |     ndf = pd.DataFrame()
 61 |     first = 0 
 62 |     for u in users:
 63 |         d = df.loc[u]
 64 |         d.index = pd.DatetimeIndex(d.index)
 65 |         d = d[['value']].reindex(idx).fillna(0)
 66 |         d = d.cumsum()
 67 |         d['user'] = u
 68 |         d = d.reset_index()
 69 |         if first == 0:
 70 |             first = 1
 71 |             ndf = d
 72 |             continue
 73 |         ndf = pd.concat([ndf,d])
 74 |     ndf.columns = ['time','value','user']
 75 |     ndf['time'] = pd.to_datetime(ndf['time'])
 76 |     ndf = ndf.sort_values(['time'])
 77 |     ndf = ndf.set_index(['time'])
 78 | 
 79 |     print plot_activity_timeline(ndf,'Time','Total Number of Contributions','Cumulutive Sum of Contributions')
 80 |    
 81 |     return ndf
 82 | 
 83 | 
 84 | '''
 85 | This method returns the top k most popular users for the dataset.
 86 | 
 87 | Question #27
 88 | 
 89 | Inputs: DataFrame - Desired dataset
 90 |         k - (Optional) The number of users that you would like returned.
 91 | 
 92 | Output: A dataframe with the user ids and number events for that user
 93 | '''
 94 | def getUserPopularity(df,k=10, log=False):
 95 |     df.columns = ['id', 'time', 'event','user', 'repo']
 96 |     df['value'] = 1
 97 | 
 98 |     repo_popularity = df[df['event'] != 'CreateEvent'].groupby('repo')['value'].sum().reset_index()
 99 |     user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first')
100 |     merged = user_repos[['user','repo']].merge(repo_popularity,on='repo',how='left')
101 |     measurement = merged.groupby('user').value.sum().sort_values(ascending=False).head(k)
102 |     measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
103 | 
104 |     print plot_top_users(measurement,'User Popularity','User','User Popularity')
105 | 
106 |     return measurement
107 | 
108 | 
109 | '''
110 | Helper function for getting the average time between events
111 | 
112 | Inputs: Same as average time between events
113 | Output: Same as average time between events
114 | '''
115 | def getMeanTime(df,r):
116 |     d = df[df.repo == r]
117 |     d = d.sort_values(by='time')
118 |     delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's')
119 |     return delta
120 | 
121 | 
122 | '''
123 | This method returns the average time between events for each user
124 | 
125 | Question #29b and c
126 | 
127 | Inputs: df - Data frame of all data for repos
128 |         repos - (Optional) List of specific users to calculate the measurement for
129 |         nCPu - (Optional) Number of CPU's to run measurement in parallel
130 | 
131 | Outputs: A list of average times for each user. Length should match number of repos
132 | '''
133 | def getAvgTimebwEvents(df,users=None, nCPU=1):
134 |     df.columns = ['id','time', 'event', 'user', 'repo']
135 |     df['time'] = pd.to_datetime(df['time'])
136 | 
137 |     if users == None:
138 |         users = df['user'].unique()
139 | 
140 |     p = Pool(nCPU)
141 |     mean_time_partial = partial(getMeanTime,df=df)
142 |     deltas = p.map(mean_time_partial,users)
143 |     
144 |     
145 |     _,bins = np.histogram(deltas,bins='auto')
146 |  
147 |     measurement = pd.DataFrame(deltas)
148 | 
149 |     measurement.plot(kind='hist',bins=bins,legend=False,cumulative=False,normed=False,figsize=(10,7))
150 |     plt.xlabel('Time Between PullRequestEvents in Seconds',fontsize=20)
151 |     plt.ylabel('Number of Repos',fontsize=20)
152 |     plt.title('Average Time Between PullRequestEvents',fontsize=20)
153 |     plt.xticks(fontsize=15)
154 |     plt.yticks(fontsize=15)
155 |     plt.tight_layout()
156 |     print plt.show()
157 |     return deltas
158 | 
159 | '''
160 | This method returns distribution the diffusion delay for each user
161 | 
162 | Question #29
163 | 
164 | Inputs: DataFrame - Desired dataset
165 |         unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation
166 |         for the possible options
167 | 
168 | Output: A list (array) of deltas in units specified
169 | '''
170 | def getUserDiffusionDelay(df,unit='s', log=False):
171 |     df.columns = ['id', 'time', 'event', 'user', 'repo']
172 |     df['value'] = df['time']
173 |     df['value'] = pd.to_datetime(df['value'])
174 |     grouped = df.groupby('user')
175 |     transformed = grouped['value'].transform('min')
176 |     delta = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit)))
177 |     
178 |     print plot_histogram(delta,'User Activity Delay','Number of Users','Diffusion Delay')
179 | 
180 |     return delta
181 | 
182 | 
183 | '''
184 | This method returns the gini coefficient for user events. (User Disparity)
185 | 
186 | Question #28
187 | 
188 | Inputs: DataFrame - Desired dataset
189 | 
190 | 
191 | Output: The gini coefficient for the dataset
192 | '''
193 | def getGiniCoef(df):
194 |     df.columns = ['id', 'time', 'event', 'user', 'repo']
195 |     df['value'] = 1
196 |     df = df.groupby('user')
197 |     event_counts = df.value.sum()
198 |     values = np.sort(np.array(event_counts))
199 | 
200 |     cdf = np.cumsum(values) / float(np.sum(values))
201 |     percent_nodes = np.arange(len(values)) / float(len(values))
202 | 
203 |     g = 1 - 2*np.trapz(x=percent_nodes,y=cdf)
204 |     return g
205 | 
206 | 
207 | '''
208 | This method returns the palma coefficient for user events. (User Disparity)
209 | 
210 | Question #28
211 | 
212 | Inputs: DataFrame - Desired dataset
213 | 
214 | 
215 | Output: p - The palma coefficient for the dataset
216 |         data - dataframe showing the CDF and Node percentages. (Mainly used for plotting)
217 | '''
218 | def getPalmaCoef(df):
219 |     df.columns = ['id', 'time', 'event', 'user', 'repo']
220 |     df['value'] = 1
221 |     df = df.groupby('user')
222 |     event_counts = df.value.sum()
223 | 
224 |     values = np.sort(np.array(event_counts))
225 | 
226 |     cdf = np.cumsum(values) / float(np.sum(values))
227 |     percent_nodes = np.arange(len(values)) / float(len(values))
228 | 
229 |     p10 = np.sum(values[percent_nodes >= 0.9])
230 |     p40 = np.sum(values[percent_nodes <= 0.4])
231 | 
232 |     p = float(p10) / float(p40)
233 | 
234 |     x = cdf
235 |     y = percent_nodes
236 |     data = pd.DataFrame({'cum_nodes': y, 'cum_value': x})
237 | 
238 |     print plot_palma(data,'Cumulative share of Repos','Cumulative share of Events','User Event Dispartiy')
239 |     
240 |     return p,data
241 | 
242 | '''
243 | This method returns the top k users with the most events.
244 | 
245 | Question #26b
246 | 
247 | Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events
248 |         k - Number of users to be returned
249 | 
250 | Output: Dataframe with the user ids and number of events
251 | '''
252 | def getMostActiveUsers(df,k=10, log=True):
253 |     df.columns = ['id', 'time', 'event', 'user', 'repo']
254 |     df['value'] = 1
255 |     df = df.groupby('user')
256 |     measurement = df.value.sum().sort_values(ascending=False).head(k)
257 |     measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
258 | 
259 |     print plot_top_users(measurement,'User','User Activity','Top Users')
260 | 
261 | 
262 | '''
263 | This method returns the distribution for the users activity (event counts).
264 | 
265 | Question #26a
266 | 
267 | Inputs: DataFrame - Desired dataset
268 |         eventType - (Optional) Desired event type to use
269 | 
270 | Output: List containing the event counts per user
271 | '''
272 | def getUserActivityDistribution(df,eventType=None):
273 |     df.columns = ['id', 'time', 'event', 'user', 'repo']
274 |     if eventType != None:
275 |         df = df[df.event == eventType]
276 |     df['value'] = 1
277 |     df = df.groupby('user')
278 | 
279 |     print plot_histogram(d.value.values,'Total Activity','Number of Users','User Activity Distribution')
280 | 
281 |     return np.array(measurement).tolist()


--------------------------------------------------------------------------------
/github-measurements-old/load_data.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | 
 3 | 
 4 | def load_data():
 5 |     
 6 |     path = '/Users/grac833/Documents/Projects/SocialSim/temp/infrastructure/tira/services/GithubMetricServices'
 7 | 
 8 |     dfs = []
 9 |     for i in range(1,3):
10 |         i = str(i)
11 |         if len(i) == 1:
12 |             i = '0' + i
13 |         df = pd.read_csv(path + '/leidosData/weekly_data_2017-07-' + str(i) + ' 00:00:00.csv')
14 |         dfs.append(df)
15 |         df = pd.read_csv(path + '/leidosData/weekly_data_2017-08-' + str(i) + ' 00:00:00.csv')
16 |         dfs.append(df)
17 |     gt = pd.concat(dfs)
18 | 
19 |     dfs = []
20 |     for i in range(1, 3):
21 |         i = str(i)
22 |         if len(i) == 1:
23 |             i = '0' + i
24 |         df = pd.read_csv(path + '/leidosData/weekly_data_2017-07-' + str(i) + ' 00:00:00.csv')
25 |         dfs.append(df)
26 |         df = pd.read_csv(path + '/leidosData/weekly_data_2017-08-' + str(i) + ' 00:00:00.csv')
27 |         dfs.append(df)
28 |     sim1 = pd.concat(dfs)
29 | 
30 |     gt = gt.drop("_id", axis=1)
31 |     sim1 = sim1.drop("_id", axis=1)
32 | 
33 |     print(sim1)
34 | 
35 |     return gt,sim1
36 | 
37 | 
38 | if __name__ == "__main__":
39 | 
40 |     load_data()
41 | 


--------------------------------------------------------------------------------
/github-measurements-old/plots.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from datetime import datetime
  4 | from multiprocessing import Pool
  5 | from functools import partial
  6 | import matplotlib.pyplot as plt
  7 | import matplotlib.mlab as mlab
  8 | from datetime import datetime
  9 | import seaborn as sns
 10 | import matplotlib.dates as dates
 11 | import calendar
 12 | from itertools import *
 13 | from matplotlib import rcParams
 14 | rcParams.update({'figure.autolayout': True})
 15 | 
 16 | def savePlots(loc, plt):
 17 |     plt.savefig(loc)
 18 | 
 19 | event_colors = {'CommitCommentEvent':'#e59400',
 20 |                 'CreateEvent':'#B2912F',
 21 |                 'DeleteEvent':'#B276B2',
 22 |                 'ForkEvent':'#4D4D4D', 
 23 |                 'IssueCommentEvent':'#DECF3F',
 24 |                 'IssuesEvent':'#60BD68',
 25 |                 'PullRequestEvent':'#5DA5DA',
 26 |                 'PullRequestReviewCommentEvent':'#D3D3D3',
 27 |                 'PushEvent':'#F17CB0',
 28 |                 'WatchEvent':'#F15854'}
 29 | 
 30 | def plot_histogram(data,xlabel,ylabel,title, log=False, loc=False):
 31 | 
 32 |     sns.set_style('whitegrid')
 33 |     sns.set_context('talk')
 34 |     
 35 |     ##ploting Histogram
 36 |     _,bins = np.histogram(data,bins='doane')
 37 | 
 38 |     measurement = pd.DataFrame(data)
 39 | 
 40 |     measurement.plot(kind='hist',bins=bins,legend=False,cumulative=False,normed=False,log=log)
 41 | 
 42 |     plt.xlabel(xlabel)
 43 |     plt.ylabel(ylabel)
 44 |     plt.title(title)
 45 |     plt.tight_layout()
 46 |     
 47 |     if loc != False:
 48 |         savePlots(loc,plt)
 49 |         return
 50 |     
 51 |     return plt.show()
 52 | 
 53 | def plot_line_graph(data,xlabel,ylabel,title,labels="",loc=False):
 54 |     sns.set_style('whitegrid')
 55 |     sns.set_context('talk')
 56 |     
 57 |     ##plotting line graph
 58 |     _,bins = np.histogram(data,bins='auto')
 59 |  
 60 |     Watchmeasurement = pd.DataFrame(data)
 61 | 
 62 |     tx = [x for x in range(len(data))]
 63 |     
 64 |     plt.figure(figsize=(10,7))
 65 |     plt.plot(tx, data, label=labels)
 66 |     
 67 |     plt.xlabel(xlabel, fontsize=20)
 68 |     plt.ylabel(ylabel, fontsize=20)
 69 |     plt.title(title, fontsize=20)
 70 |     plt.legend(fontsize=15)
 71 |     plt.xticks(fontsize=15)
 72 |     plt.tight_layout()
 73 |     
 74 |     if loc != False:
 75 |         savePlots(loc,plt)
 76 |         return
 77 |     return plt.show()
 78 |     
 79 | def plot_time_series(data,xlabel,ylabel,title,loc=False):
 80 |     
 81 |     plt.clf()
 82 |     sns.set_style('whitegrid')
 83 |     sns.set_context('talk')
 84 |     p = data
 85 |     plt.plot(p['date'],p['value'])
 86 | 
 87 |   
 88 |     plt.xticks(fontsize=15)
 89 |     plt.yticks(fontsize=15)
 90 |     plt.xlabel(xlabel, fontsize=20)
 91 |     plt.ylabel(ylabel, fontsize=20)
 92 |     plt.title(title, fontsize=20)
 93 |     plt.xticks(rotation=45)
 94 | 
 95 |     plt.tight_layout()
 96 |     
 97 |     if loc != False:
 98 |         savePlots(loc,plt)
 99 |         return
100 |     
101 |     return plt.show()
102 | 
103 | def plot_contributions_oneline(data,xlabel,ylabel,title,loc=False):
104 | 
105 |     sns.set_style('whitegrid')
106 |     sns.set_context('talk')
107 | 
108 |     p = data
109 |     ax = plt.gca()
110 |     labels = [str(x) for x in p.date.values]
111 |     plt.clf()
112 |     plt.plot(p.date.values, p.value.values, label='Unique Users per Day')
113 | 
114 |     plt.xticks(fontsize=15)
115 |     plt.yticks(fontsize=15)
116 |     plt.xlabel(xlabel, fontsize=20)
117 |     plt.ylabel(ylabel, fontsize=20)
118 |     plt.title(title)
119 |     plt.legend()
120 |     plt.xticks(rotation=45)
121 |     plt.tight_layout()
122 | 
123 |     
124 |     if loc != False:
125 |         savePlots(loc,plt)
126 |         return
127 |         
128 |     return plt.show()
129 | 
130 | def plot_contributions_twolines(containsDup,noDups,xlabel,ylabel,title,loc=False):
131 | 
132 |     plt.clf()
133 |     fig = plt.figure(figsize=(18,15))
134 |     ax = fig.add_subplot(221)
135 |     labels = [str(x)[:10] for x in containsDup.date.values]
136 |     ys = [x for x in range(len(containsDup))]
137 | 
138 |     plt.plot(ys, containsDup.user.values, label='Unique Users per Day')
139 |     plt.plot(ys, noDups.user.values, label='Unique User over All')
140 |     ax.set_xticklabels(labels=labels, fontsize=20)
141 | 
142 |     # ax.tick_params(labelsize=15)
143 |     plt.tight_layout()
144 |     plt.xlabel('Time',fontsize=20)
145 |     plt.ylabel('Number of Users',fontsize=20)
146 |     plt.title('Cumulative Number of Contributing Users Over Time',fontsize=20)
147 |     plt.legend(loc=2, prop={'size': 15})
148 |     plt.xticks(rotation=45)
149 |     plt.xticks(fontsize=15)
150 |     plt.yticks(fontsize=15)
151 |     
152 |     if loc != False:
153 |         savePlots(loc,plt)
154 |         return 
155 |     
156 |     return plt.show()
157 | 
158 | def plot_palma_gini(data,xlabel,ylabel,title,loc=False):
159 |     data.plot(x = 'cum_nodes',y='cum_value',legend=False)
160 |     plt.ylabel(ylabel)
161 |     plt.xlabel(xlabel)
162 |     plt.plot([0,1],[0,1],linestyle='--',color='k')
163 |     plt.tight_layout()
164 |     plt.title(title)
165 |     if loc != False:
166 |         savePlots(loc,plt)
167 |         return
168 |     return plt.show()
169 | 
170 | def plot_distribution_of_events(data,weekday,loc=False):
171 |     p = pd.DataFrame(data)
172 |     p = p.reset_index()
173 |     if weekday == True:
174 |         p = p.rename(index=str, columns={'weekday': 'date'})
175 |     p = p.reset_index()
176 |     p = p.pivot(index='date', columns='event', values='value').fillna(0)
177 |     tp = p.reset_index()
178 |     tp.set_index('date')
179 |     del tp['date']
180 |     total = tp.sum(axis=1)
181 |     for ele in tp.columns:
182 |         if ele == 'date':
183 |             continue
184 |         tp[ele] = tp[ele]
185 |     
186 |     plt.clf()
187 |     sns.set_style('whitegrid')
188 |     sns.set_context('talk')
189 | 
190 |     ax = plt.gca()
191 | 
192 |     calIndex = list(calendar.day_name)
193 |     labels = [str(x)[:10] for x in p.index.values]
194 | 
195 |     title = 'Days'
196 |     if weekday == True:
197 |         labels = [calIndex[i] for i in range(len(labels))]
198 |         title = 'Weekday'
199 |     my_colors = list(islice(cycle([ '#B2912F', '#4D4D4D', '#DECF3F','#60BD68','#5DA5DA','#D3D3D3','#F17CB0','#F15854','#B276B2', '#e59400']), None, len(tp)))
200 | 
201 |     tp.plot(ax=ax, color=[event_colors.get(x) for x in tp.columns],rot=0)
202 |     ax.xaxis.set_ticks(np.arange(0,len(labels)))
203 |     ax.set_xticklabels(labels=labels, rotation=45)
204 |     plt.legend()
205 |     plt.title('Distribution of Events per ' + title)
206 |     plt.xlabel(title)
207 |     plt.ylabel('Number of Events')
208 | 
209 |     plt.tight_layout()
210 |     if loc != False:
211 |         savePlots(loc,plt)
212 |         return
213 |     return plt.show()
214 | 
215 | 
216 | 
217 | 
218 | #############
219 | #User Centric
220 | #############
221 | 
222 | def plot_top_users(data, xlabel,ylabel,title, log=False,loc=False):
223 |     data = pd.DataFrame(data)
224 | 
225 |     data.plot(kind='bar',legend=False,log=log)
226 |     plt.ylabel(ylabel)
227 |     plt.xlabel(xlabel)
228 |     plt.tight_layout()
229 |     plt.title(title)
230 |     if loc != False:
231 |         savePlots(loc,plt)
232 |         return 
233 |     return plt.show()
234 | 
235 | def plot_activity_timeline(data,xlabel,ylabel,title, log=False,loc=False):
236 |     p = data
237 |     for u in users:
238 |         p[p['user'] == u]['value'].plot(legend=False,logy=False,label=u)
239 | 
240 |     plt.xticks(fontsize=15)
241 |     plt.yticks(fontsize=15)
242 |     plt.xlabel(xlabel, fontsize=20)
243 |     plt.ylabel(ylabel, fontsize=20)
244 |     plt.title(title, fontsize=20)
245 |     plt.tight_layout()
246 |     plt.xticks(rotation=45)
247 |     if loc != False:
248 |         savePlots(loc,plt)
249 |         return 
250 |     return plt.show()
251 | 
252 | ############
253 | #Community
254 | ############
255 | 
256 | def plot_CommunityProportions(p,xlabel,ylabel,title, loc=False):
257 |     data = pd.DataFrame(p)
258 |     ax = data.plot(kind='bar',legend=False)
259 |     ax.set_xticklabels(data.edgeType.values)
260 |     plt.xlabel(xlabel)
261 |     plt.ylabel(ylabel)
262 |     plt.title(title)
263 |     if loc != False:
264 |         savePlots(loc,plt)
265 |         return
266 |     return plt.show()
267 | 
268 | 
269 | def plot_propIssueEvent(p, xlabel, ylabel,title, loc=False):
270 | 
271 |     plt.clf()
272 |     fig = plt.figure(figsize=(18,15))
273 |     ax = fig.add_subplot(221)
274 |     labels = [str(x)[:10] for x in p.index.values]
275 |     ys = [x for x in range(len(p[p['issueType'] == 'closed']))]
276 | 
277 |     plt.plot(ys, p[p['issueType'] == 'closed'].counts.values, label='Closed')
278 |     plt.plot(ys, p[p['issueType'] == 'opened'].counts.values, label='Opened')
279 |     plt.plot(ys, p[p['issueType'] == 'reopened'].counts.values, label='ReOpened')
280 |     ax.set_xticklabels(labels=labels, fontsize=20)
281 | 
282 |     plt.tight_layout()
283 |     plt.xlabel(xlabel,fontsize=20)
284 |     plt.ylabel(ylabel,fontsize=20)
285 |     plt.title(title,fontsize=20)
286 |     plt.legend(bbox_to_anchor=(-.25, .001), loc=2, prop={'size': 15})
287 |     plt.xticks(rotation=45)
288 |     plt.xticks(fontsize=15)
289 |     plt.yticks(fontsize=15)
290 |     
291 |     if loc != False:
292 |         savePlots(loc,plt)
293 |         return
294 |     return plt.show()
295 |     
296 | 


--------------------------------------------------------------------------------
/github-measurements/Measurements.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from datetime import datetime
  4 | from multiprocessing import Pool
  5 | from functools import partial
  6 | from pathos import pools as pp
  7 | import pickle as pkl
  8 | from UserCentricMeasurements import *
  9 | from RepoCentricMeasurements import *
 10 | from CommunityCentricMeasurements import *
 11 | from TEMeasurements import *
 12 | from collections import defaultdict
 13 | import jpype
 14 | import json
 15 | 
 16 | class Measurements(UserCentricMeasurements, RepoCentricMeasurements, TEMeasurements, CommunityCentricMeasurements):
 17 |     def __init__(self, dfLoc, interested_repos=[], interested_users=[], metaRepoData=False, metaUserData=False,
 18 |                  repoActorsFile='data/filtUsers-test.pkl',reposFile='data/filtRepos-test.pkl',topNodes=[],topEdges=[],
 19 |                  previousActionsFile='',community_dictionary='data/communities.pkl',te_config='te_params_dry_run2.json'):
 20 |         super(Measurements, self).__init__()
 21 |         
 22 |         try:
 23 |             #check if input is a data frame
 24 |             dfLoc.columns
 25 |             df = dfLoc
 26 |         except:
 27 |             #if not it should be a csv file path
 28 |             df = pd.read_csv(dfLoc)
 29 | 
 30 |         self.contribution_events = ["PullRequestEvent", "PushEvent", "IssuesEvent","IssueCommentEvent","PullRequestReviewCommentEvent","CommitCommentEvent","CreateEvent"]
 31 |         self.popularity_events = ['WatchEvent','ForkEvent']
 32 | 
 33 |         print('preprocessing...')
 34 |         self.main_df = self.preprocess(df)
 35 | 
 36 |         print('splitting optional columns...')
 37 |         #store action and merged columns in a seperate data frame that is not used for most measurements
 38 |         if len(self.main_df.columns) == 6:
 39 |             self.main_df_opt = self.main_df.copy()[['action','merged']]
 40 |             self.main_df_opt['merged'] = self.main_df_opt['merged'].astype(bool)
 41 |             self.main_df = self.main_df.drop(['action','merged'],axis=1)
 42 |         else:
 43 |             self.main_df_opt = None
 44 | 
 45 | 
 46 |         #For repoCentric
 47 |         print('getting selected repos...')
 48 |         self.selectedRepos = self.getSelectRepos(interested_repos) #Dictionary of selected repos index == repoid
 49 |         
 50 |         #For userCentric
 51 |         self.selectedUsers = self.main_df[self.main_df.user.isin(interested_users)]
 52 | 
 53 |         print('processing repo metatdata...')
 54 |         #read in external metadata files
 55 |         #repoMetaData format - full_name_h,created_at,owner.login_h,language
 56 |         #userMetaData format - login_h,created_at,location,company
 57 |         if metaRepoData != False:
 58 |             self.useRepoMetaData = True
 59 |             self.repoMetaData = self.preprocessRepoMeta(pd.read_csv(metaRepoData))
 60 |         else:
 61 |             self.useRepoMetaData = False
 62 |         print('processing user metatdata...')
 63 |         if metaUserData != False:
 64 |             self.useUserMetaData = True
 65 |             self.userMetaData = self.preprocessUserMeta(pd.read_csv(metaUserData))
 66 |         else:
 67 |             self.useUserMetaData = False
 68 | 
 69 | 
 70 |         #For Community
 71 |         print('getting communities...')
 72 |         self.communities = self.getCommunities(path=community_dictionary)
 73 | 
 74 |         #read in previous events count external file (used only for one measurement)
 75 |         try:
 76 |             print('reading previous counts...')
 77 |             self.previous_event_counts = pd.read_csv(previousActionsFile)
 78 |         except:
 79 |             self.previous_event_counts = None
 80 | 
 81 |  
 82 |         #For TE
 83 |         print('starting jvm...')
 84 |         if not jpype.isJVMStarted():
 85 |             jpype.startJVM(jpype.getDefaultJVMPath(), "-ea", "-Djava.class.path=" + "infodynamics.jar")
 86 | 
 87 |         self.top_users = topNodes
 88 |         self.top_edges = topEdges
 89 | 
 90 |         #read pkl files which define nodes of interest for TE measurements
 91 |         self.repo_actors = self.readPickleFile(repoActorsFile)
 92 |         self.repo_groups = self.readPickleFile(reposFile)
 93 | 
 94 |         #set TE parameters
 95 |         with open(te_config,'rb') as f:
 96 |             te_params = json.load(f)
 97 | 
 98 |         self.startTime = pd.Timestamp(te_params['startTime'])
 99 |         self.binSize= te_params['binSize']
100 |         self.teThresh = te_params['teThresh']
101 |         self.delayUnits = np.array(te_params['delayUnits'])
102 |         self.starEvent = te_params['starEvent']
103 |         self.otherEvents = te_params['otherEvents']
104 |         self.kE = te_params['kE']
105 |         self.kN = te_params['kN']
106 |         self.nReps = te_params['nReps']
107 |         self.bGetTS = te_params['bGetTS']
108 | 
109 | 
110 | 
111 |     def preprocess(self,df):
112 |         #edit columns, convert date, sort by date
113 |         if df.columns[0] == '_id':
114 |             del df['_id']
115 |         if len(df.columns) == 4:
116 |             df.columns = ['time', 'event', 'user', 'repo']
117 |         else:
118 |             df.columns = ['time', 'event', 'user', 'repo','action','merged']
119 |         df = df[df.event.isin(self.popularity_events + self.contribution_events)]
120 |         df['time'] = pd.to_datetime(df['time'])
121 |         df = df.sort_values(by='time')
122 |         df = df.assign(time=df.time.dt.floor('h'))
123 |         return df
124 | 
125 |     def preprocessRepoMeta(self,df):
126 |         try:
127 |             df.columns = ['repo','created_at','owner_id','language']
128 |         except:
129 |             df.columns = ['created_at','owner_id','repo']
130 |         df = df[df.repo.isin(self.main_df.repo.values)]
131 |         df['created_at'] = pd.to_datetime(df['created_at'])
132 |         #df = df.drop_duplicates('repo')
133 |         return df
134 |     
135 |     def preprocessUserMeta(self,df):
136 |         try:
137 |             df.columns = ['user','created_at','location','company']
138 |         except:
139 |             df.columns = ['user','created_at','city','country','company']
140 |         
141 |         df = df[df.user.isin(self.main_df.user.values)]
142 |         df['created_at'] = pd.to_datetime(df['created_at'])
143 |         return df
144 | 
145 |     def readPickleFile(self,ipFile):
146 | 
147 |         with open(ipFile, 'rb') as handle:
148 |             obj = pkl.load(handle)
149 | 
150 |         return obj 
151 | 


--------------------------------------------------------------------------------
/github-measurements/UserCentricMeasurements.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from datetime import datetime
  4 | from multiprocessing import Pool
  5 | from functools import partial
  6 | from pathos import pools as pp
  7 | import pickle
  8 | 
  9 | '''
 10 | This class implements user centric method. Each function will describe which metric it is used for according
 11 | to the questions number and mapping.
 12 | These metrics assume that the data is in the order id,created_at,type,actor.id,repo.id
 13 | '''
 14 | 
 15 | class UserCentricMeasurements(object):
 16 |     def __init__(self):
 17 |         super(UserCentricMeasurements, self).__init__()
 18 | 
 19 |     '''
 20 |     This function selects a subset of the full data set for a selected set of users and event types.
 21 |     Inputs: users - A boolean or a list of users.  If it is list of user ids (login_h) the data frame is subset on only this list of users.
 22 |                     If it is True, then the pre-selected node-level subset is used.  If False, then all users are included.
 23 |             eventType - A list of event types to include in the data set
 24 | 
 25 |     Output: A data frame with only the selected users and event types.
 26 |     '''
 27 |     def determineDf(self,users,eventType):
 28 | 
 29 |         if users == True:
 30 |             #self.selectedUsers is a data frame containing only the users in interested_users
 31 |             df = self.selectedUsers
 32 |         elif users != False:
 33 |             df = df[df.user.isin(users)]
 34 |         else:
 35 |             df = self.main_df
 36 | 
 37 |         if eventType != None:
 38 |             df = df[df.event.isin(eventType)]
 39 | 
 40 |         return df
 41 | 
 42 |     '''
 43 |     This method returns the number of unique repos that a particular set of users contributed too
 44 |     Question #17
 45 |     Inputs: selectedUsers - A list of users of interest or a boolean indicating whether to subset to the node-level measurement users.
 46 |             eventType - A list of event types to include in the data
 47 |     Output: A dataframe with the user id and the number of repos contributed to
 48 |     '''
 49 |     def getUserUniqueRepos(self,selectedUsers=False,eventType=None):
 50 |         df = self.determineDf(selectedUsers,eventType)
 51 |         df = df.groupby('user')
 52 |         data = df.repo.nunique().reset_index()
 53 |         data.columns = ['user','value']
 54 |         return data
 55 | 
 56 |     '''
 57 |     This method returns the timeline of activity of the desired user over time, either in raw or cumulative counts.
 58 |     Question #19
 59 |     Inputs: selectedUsers - A list of users of interest or a boolean indicating whether to subset to node-level measurement users.
 60 |             time_bin - Time frequency for calculating event counts
 61 |             cumSum - Boolean indicating whether to calculate the cumulative activity counts
 62 |             eventType = List of event types to include in the data
 63 |     Output: A dictionary with a data frame for each user with two columns: data and event counts
 64 |     '''
 65 |     def getUserActivityTimeline(self, selectedUsers=True,time_bin='1d',cumSum=False,eventType=None):
 66 |         df = self.determineDf(selectedUsers,eventType)
 67 | 
 68 |         df['value'] = 1
 69 |         if cumSum:
 70 |             df['cumsum'] = df.groupby('user').value.transform(pd.Series.cumsum)
 71 |             df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).max().reset_index()
 72 |             df['value'] = df['cumsum']
 73 |             df = df.drop('cumsum',axis=1)
 74 |         else:
 75 |             df = df.groupby(['user',pd.Grouper(key='time',freq=time_bin)]).sum().reset_index()
 76 | 
 77 |         data = df.sort_values(['user', 'time'])
 78 |         measurements = {}
 79 |         for user in data['user'].unique():
 80 |             measurements[user] = data[data['user'] == user]
 81 | 
 82 |         return measurements
 83 | 
 84 | 
 85 |     '''
 86 |     This method returns the top k most popular users for the dataset, where popularity is measured
 87 |     as the total popularity of the repos created by the user.
 88 |     Question #25
 89 |     Inputs: k - (Optional) The number of users that you would like returned.
 90 |             use_metadata - External metadata file containing repo owners.  Otherwise use first observed user with 
 91 |                            a creation event as a proxy for the repo owner.
 92 |             eventType - A list of event types to include
 93 |     Output: A dataframe with the user ids and number events for that user
 94 |     '''
 95 |     def getUserPopularity(self,k=5000,use_metadata=False,eventType=None):
 96 | 
 97 |         df = self.determineDf(False,eventType)
 98 | 
 99 |         df['value'] = 1
100 | 
101 |         repo_popularity = df[df.event.isin(['WatchEvent','ForkEvent'])].groupby('repo')['value'].sum().reset_index()
102 | 
103 |         if use_metadata and self.useRepoMetaData:
104 |             #merge repo popularity with the owner information in repo_metadata
105 |             #drop data for which no owner information exists in metadata
106 |             repo_popularity = repo_popularity.merge(self.repoMetaData,left_on='repo',right_on='repo',
107 |                                            how='left').dropna()
108 |             
109 |         elif df['repo'].str.match('.{22}/.{22}').all():
110 |             #if all repo IDs have the correct format use the owner info from the repo id
111 |             repo_popularity['owner_id'] = repo_popularity['repo'].apply(lambda x: x.split('/')[0])
112 |         else:
113 |             #otherwise use creation event as a proxy for ownership
114 |             user_repos = df[df['event'] == 'CreateEvent'].sort_values('time').drop_duplicates(subset='repo',keep='first')
115 |             user_repos = user_repos[['user','repo']]
116 |             user_repos.columns = ['owner_id','repo']
117 |             if len(user_repos.index) >= 0:
118 |                 repo_popularity = user_repos.merge(repo_popularity,on='repo',how='left')
119 |             else:
120 |                 return None
121 | 
122 |            
123 |         measurement = repo_popularity.groupby('owner_id').value.sum().sort_values(ascending=False).head(k)
124 |         measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
125 |         
126 |         return measurement
127 | 
128 | 
129 |     '''
130 |     This method returns the average time between events for each user
131 | 
132 |     Inputs: df - Data frame of all data for repos
133 |     users - (Optional) List of specific users to calculate the metric for
134 |     nCPu - (Optional) Number of CPU's to run metric in parallel
135 |     Outputs: A list of average times for each user. Length should match number of repos
136 |     '''
137 |     def getAvgTimebwEventsUsers(self,selectedUsers=True, nCPU=1):
138 |         df = self.determineDf(selectedUsers)
139 |         users = self.df['user'].unique()
140 |         args = [(df, users[i]) for i, item_a in enumerate(users)]
141 |         pool = pp.ProcessPool(nCPU)
142 |         deltas = pool.map(self.getMeanTimeHelper, args)
143 |         return deltas
144 | 
145 |     '''
146 |     Helper function for getting the average time between events
147 | 
148 |     Inputs: Same as average time between events
149 |     Output: Same as average time between events
150 |     '''
151 |     def getMeanTimeUser(self,df, user):
152 |         d = df[df.user == user]
153 |         d = d.sort_values(by='time')
154 |         delta = np.mean(np.diff(d.time)) / np.timedelta64(1, 's')
155 |         return delta
156 | 
157 |     def getMeanTimeUserHelper(self,args):
158 |         return self.getMeanTimeUser(*args)
159 | 
160 |     '''
161 |     This method returns distribution the diffusion delay for each user
162 |     Question #27
163 |     Inputs: DataFrame - Desired dataset
164 |     unit - (Optional) This is the unit that you want the distribution in. Check np.timedelta64 documentation
165 |     for the possible options
166 |     metadata_file - File containing user account creation times.  Otherwise use first observed action of user as proxy for account creation time.
167 |     Output: A list (array) of deltas in units specified
168 |     '''
169 |     def getUserDiffusionDelay(self,unit='h', selectedUser=True,eventType=None):
170 | 
171 |         df = self.determineDf(selectedUser,eventType)
172 | 
173 |         df['value'] = df['time']
174 |         df['value'] = pd.to_datetime(df['value'])
175 |         df['value'] = df['value'].dt.round('1H')
176 | 
177 |         if self.useUserMetaData:
178 |             df = df.merge(self.userMetaData[['user','created_at']],left_on='user',right_on='user',how='left')
179 |             df = df[['user','created_at','value']].dropna()
180 |             measurement = df['value'].sub(df['created_at']).apply(lambda x: int(x / np.timedelta64(1, unit)))
181 |         else:
182 |             grouped = df.groupby('user')
183 |             transformed = grouped['value'].transform('min')
184 |             measurement = df['value'].sub(transformed).apply(lambda x: int(x / np.timedelta64(1, unit)))
185 |         return measurement
186 | 
187 |     '''
188 |     This method returns the top k users with the most events.
189 |     Question #24b
190 |     Inputs: DataFrame - Desired dataset. Used mainly when dealing with subset of events
191 |     k - Number of users to be returned
192 |     Output: Dataframe with the user ids and number of events
193 |     '''
194 |     def getMostActiveUsers(self,k=5000,eventType=None):
195 | 
196 |         df = self.main_df
197 | 
198 |         if eventType != None:
199 |             df = df[df.event.isin(eventType)]
200 | 
201 |         df['value'] = 1
202 |         df = df.groupby('user')
203 |         measurement = df.value.sum().sort_values(ascending=False).head(k)
204 |         measurement = pd.DataFrame(measurement).sort_values('value',ascending=False)
205 |         return measurement
206 | 
207 |     '''
208 |     This method returns the distribution for the users activity (event counts).
209 |     Question #24a
210 |     Inputs: DataFrame - Desired dataset
211 |     eventType - (Optional) Desired event type to use
212 |     Output: List containing the event counts per user
213 |     '''
214 |     def getUserActivityDistribution(self,eventType=None,selectedUser=False):
215 | 
216 |         if selectedUser:
217 |             df = self.selectedUsers
218 |         else:
219 |             df = self.main_df
220 | 
221 |         if eventType != None:
222 |             df = df[df.event.isin(eventType)]
223 | 
224 |         df['value'] = 1
225 |         df = df.groupby('user')
226 |         measurement = df.value.sum().reset_index()
227 |         return measurement
228 | 
229 | 
230 |     '''
231 |     Calculate the proportion of pull requests that are accepted by each user.
232 |     Question #15 (Optional Measurement)
233 |     Inputs: eventType: List of event types to include in the calculation (Should be PullRequestEvent).
234 |             thresh: Minimum number of PullRequests a repo must have to be included in the distribution. 
235 |     Output: Data frame with the proportion of accepted pull requests for each user
236 |     '''
237 |     def getUserPullRequestAcceptance(self,eventType=['PullRequestEvent'], thresh=2):
238 | 
239 |         df = self.main_df_opt
240 | 
241 |         if not df is None and 'PullRequestEvent' in self.main_df.event.values:
242 | 
243 |             df = df[self.main_df.event.isin(eventType)]
244 |             users_repos = self.main_df[self.main_df.event.isin(eventType)]
245 | 
246 |             #subset on only PullRequest close actions (not opens)
247 |             idx = df['action'] == 'closed'
248 |             closes = df[idx]
249 |             users_repos = users_repos[idx]
250 | 
251 |             #merge pull request columns (action, merged) with main data frame columns
252 |             closes = pd.concat([users_repos,closes],axis=1)
253 |             closes = closes[['user','repo','merged']]
254 |             closes['value'] = 1
255 | 
256 |             #add up number of accepted (merged) and rejected pullrequests by user and repo
257 |             outcomes = closes.pivot_table(index=['user','repo'],values=['value'],columns=['merged'],aggfunc=np.sum).fillna(0)
258 |             
259 |             outcomes.columns = outcomes.columns.get_level_values(1)
260 | 
261 |             outcomes = outcomes.rename(index=str, columns={True: "accepted", False: "rejected"})
262 | 
263 |             for col in ['accepted','rejected']:
264 |                 if col not in outcomes.columns:
265 |                     outcomes[col] = 0
266 | 
267 |             outcomes['total'] = outcomes['accepted'] +  outcomes['rejected']
268 |             outcomes['value'] = outcomes['accepted'] / outcomes['total']
269 |             outcomes = outcomes.reset_index()
270 |             outcomes = outcomes[outcomes['total'] >= thresh]
271 | 
272 |             if len(outcomes.index) > 0:
273 |                 #calculate the average acceptance rate for each user across their repos
274 |                 measurement = outcomes[['user','value']].groupby('user').mean()
275 |             else:
276 |                 measurement = None
277 |         else:
278 |             measurement = None
279 | 
280 |         return measurement
281 |     
282 | 


--------------------------------------------------------------------------------
/github-measurements/infodynamics.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pnnl/socialsim/06f0ce61d10ca08dd50d256fb30ac0ae81ead58d/github-measurements/infodynamics.jar


--------------------------------------------------------------------------------
/github-measurements/reference-approaches/README.md:
--------------------------------------------------------------------------------
1 | # Reference Approach Scripts
2 | 
3 | * **generate_reference_approach_data.py**: This script can generate reference approach data for a target test period using a given historical data set.
4 | * **reference_approach_performance_plots.py**: This script can be used to replicate the visualizations we used to summarize performance relative to the reference approaches.


--------------------------------------------------------------------------------
/github-measurements/reference-approaches/generate_reference_approach_data.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import datetime
  3 | import numpy as np
  4 | import glob
  5 | 
  6 | 
  7 | def ingest_historical_data(csv_file):
  8 | 
  9 |     """
 10 |     Read data from csv file
 11 |     """
 12 | 
 13 |     print('reading data...')
 14 |     df = pd.read_csv(csv_file)
 15 |     df.columns = ['created_at','type','actor_login_h','repo_name_h','payload_action','payload_pull_request_merged']
 16 | 
 17 |     print('to datetime..')
 18 |     df['created_at'] = pd.to_datetime(df['created_at'])
 19 | 
 20 |     print('sorting...')
 21 |     df = df.sort_values('created_at')
 22 | 
 23 |     return df
 24 | 
 25 | def subset_data(df,start,end):
 26 | 
 27 |     """
 28 |     Return temporal data subset based on start and end dates
 29 |     """
 30 | 
 31 |     print('subsetting...')
 32 |     df = df[ (df['created_at'] >= start) & (df['created_at'] <= end) ]
 33 | 
 34 |     return(df)
 35 | 
 36 | def shift_data(df,shift, end):
 37 | 
 38 |     """
 39 |     Shift data based on fixed offset (shift) and subset based on upper limit (end)
 40 |     """
 41 | 
 42 |     print('shifting...')
 43 |     df['created_at'] += shift
 44 |     df = df[df['created_at'] <= end]
 45 | 
 46 |     return df
 47 | 
 48 | 
 49 | def sample_data(df,start,end,proportional=True):
 50 | 
 51 |     """
 52 |     Sample data either uniformly (proportional=False) or proporationally (proportional=True) to fill test period from start to end
 53 |     """
 54 | 
 55 |     print('inter-event times...')
 56 | 
 57 |     df['inter_event_times'] = df['created_at'] - df['created_at'].shift()
 58 |     inter_event_times = df['inter_event_times'].dropna()
 59 | 
 60 |     max_time = df['created_at'].min()
 61 |     multiplier=( (pd.to_datetime(end) - pd.to_datetime(start)) / df['inter_event_times'].mean() ) / float(len(df.index))
 62 | 
 63 |     #repeat until enough data is sampled to fill the test period
 64 |     while max_time < pd.to_datetime(end):
 65 | 
 66 |         if proportional:
 67 |             sample = pd.DataFrame(df['inter_event_times'].dropna().sample(int(multiplier*len(df.index)),replace=True))
 68 |             sampled_inter_event_times = sample.cumsum()
 69 |         else:
 70 |             sample = pd.DataFrame(np.random.uniform(np.min(inter_event_times.dt.total_seconds()),1.0,int(multiplier*len(df.index))))[0].round(0)
 71 |             sample = pd.to_timedelta(sample,unit='s')
 72 |             sampled_inter_event_times = pd.DataFrame(sample).cumsum()
 73 |         
 74 |         event_times = (pd.to_datetime(start) + sampled_inter_event_times)
 75 |         max_time = pd.to_datetime(event_times.max().values[0])
 76 |         multiplier*=1.5
 77 | 
 78 |     event_times = event_times[(event_times < pd.to_datetime(end)).values]
 79 | 
 80 |     if proportional:
 81 |         users = df['actor_login_h']
 82 |         repos = df['repo_name_h']
 83 |         events = df['type']
 84 |     else:
 85 |         users = pd.Series(df['actor_login_h'].unique())
 86 |         repos = pd.Series(df['repo_name_h'].unique())
 87 |         events = pd.Series(df['type'].unique())
 88 | 
 89 | 
 90 |     users = users.sample(len(event_times),replace=True).values
 91 |     repos = repos.sample(len(event_times),replace=True).values
 92 |     events = events.sample(len(event_times),replace=True).values
 93 | 
 94 |     df_out = pd.DataFrame({'time':event_times.values.flatten(),
 95 |                            'event':events,
 96 |                            'user':users,
 97 |                            'repo':repos})
 98 |     
 99 |     if proportional:
100 |         pr_action = df[df['type'] == 'PullRequestEvent']['payload_action']
101 |         pr_merged = df[df['type'] == 'PullRequestEvent']['payload_pull_request_merged']
102 |         iss_action = df[df['type'] == 'IssuesEvent']['payload_action']
103 |     else:
104 |         pr_action = df[df['type'] == 'PullRequestEvent']['payload_action'].unique()
105 |         pr_merged = df[df['type'] == 'PullRequestEvent']['payload_pull_request_merged'].unique()
106 |         iss_action = df[df['type'] == 'IssuesEvent']['payload_action'].unique()
107 | 
108 |     pull_requests = df_out[df_out['event'] == 'PullRequestEvent']
109 |     pull_requests['payload_action'] = pd.Series(pr_action).sample(len(pull_requests.index),
110 |                                                                   replace=True).values
111 |     pull_requests['payload_pull_request_merged'] = pd.Series(pr_merged).sample(len(pull_requests.index),
112 |                                                                                replace=True).values
113 | 
114 | 
115 |     issues = df_out[df_out['event'] == 'IssuesEvent']
116 |     issues['payload_action'] = pd.Series(iss_action).sample(len(issues.index),replace=True).values
117 |     
118 |     df_out = df_out[~df_out['event'].isin(['IssuesEvent','PullRequestEvent'])]
119 |     df_out = pd.concat([df_out,pull_requests,issues])
120 |     df_out = df_out.sort_values('time')
121 | 
122 |     df_out = df_out[['time','event','user','repo','payload_action','payload_pull_request_merged']]
123 | 
124 |     return df_out
125 | 
126 | 
127 | def create_shifted_reference(csv_file, test_start_date='2018-02-01', test_end_date='2018-02-28',
128 |                              historical_start_date='2017-08-01',historical_end_date='2017-08-31'):
129 | 
130 | 
131 |     """
132 |     Create shifted reference from historical data in csv_file using data ranging from historical_start_date
133 |     to historical_end_date to generate new shifted data ranging from test_start_date to test_end_date.
134 |     """
135 | 
136 | 
137 |     df = ingest_historical_data(csv_file)
138 | 
139 | 
140 |     test_delta_t = np.datetime64(test_end_date) - np.datetime64(test_start_date)
141 |     historical_delta_t = np.datetime64(historical_end_date) - np.datetime64(historical_start_date)
142 |     if historical_delta_t > test_delta_t:
143 |         df = subset_data(df,historical_start_date,historical_end_date)
144 |     else:
145 |         print('Not enough historical data to create shifted reference approach')
146 |         return None
147 | 
148 |     shifted_df = shift_data(df,np.datetime64(test_start_date) - np.datetime64(historical_start_date),np.datetime64(test_end_date))    
149 |     shifted_df = subset_data(shifted_df,test_start_date,test_end_date)
150 | 
151 |     return shifted_df
152 | 
153 | 
154 | def create_sampled_reference(csv_file, test_start_date='2018-02-01', test_end_date='2018-02-28',
155 |                              historical_start_date='2017-08-01',historical_end_date='2017-08-31',
156 |                              proportional=True):
157 | 
158 |     """
159 |     Create sampled reference from historical data in csv_file using data ranging from historical_start_date
160 |     to historical_end_date to generate new sampled data ranging from test_start_date to test_end_date.
161 |     If proportional is True, the sampling will be proportional to the observed frequencies in the
162 |     historical data.  Otherwise, sampling will be uniform.
163 |     """
164 | 
165 |     df = ingest_historical_data(csv_file)
166 | 
167 |     df = subset_data(df,historical_start_date,historical_end_date)
168 |    
169 |     sampled_df = sample_data(df,test_start_date, test_end_date,proportional)
170 | 
171 |     return sampled_df
172 | 
173 |     
174 | def main():
175 | 
176 |     fn = 'august_2017.csv'
177 | 
178 |     shifted_reference = create_shifted_reference(fn,test_end_date='2018-02-05')
179 |     print('shifted reference')
180 |     print(shifted_reference)
181 | 
182 |     sampled_reference_uniform = create_sampled_reference(fn,proportional=False,test_end_date='2018-02-05')
183 |     print('sampled reference uniform')
184 |     print(sampled_reference_uniform)
185 | 
186 |     sampled_reference_proportional = create_sampled_reference(fn,proportional=True,test_end_date='2018-02-05')
187 |     print('sampled reference proportional')
188 |     print(sampled_reference_proportional)
189 |     
190 | 
191 | if __name__ == "__main__":
192 |     main()
193 | 


--------------------------------------------------------------------------------
/github-measurements/requirements.txt:
--------------------------------------------------------------------------------
 1 | fastdtw==0.3.2
 2 | numpy==1.14.0
 3 | statsmodels==0.8.0
 4 | pathos==0.2.1
 5 | pandas==0.23.1
 6 | matplotlib==2.0.2
 7 | scipy==0.19.1
 8 | JPype1==0.6.3
 9 | scikit_learn==0.19.1
10 | 


--------------------------------------------------------------------------------
/license.txt:
--------------------------------------------------------------------------------
 1 | Copyright 2018 PACIFIC NORTHWEST NATIONAL LABORATORY
 2 | 
 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 4 | 
 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 6 | 
 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
 8 | 
 9 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
10 | 


--------------------------------------------------------------------------------
/pip_requirements.txt:
--------------------------------------------------------------------------------
 1 | backports.functools-lru-cache==1.5
 2 | certifi==2018.8.24
 3 | chardet==3.0.4
 4 | Click==7.0
 5 | community==1.0.0b1
 6 | cycler==0.10.0
 7 | decorator==4.3.0
 8 | dill==0.2.8.2
 9 | fastdtw==0.3.2
10 | Flask==1.0.2
11 | idna==2.7
12 | itsdangerous==0.24
13 | Jinja2==2.10
14 | JPype1==0.6.3
15 | kiwisolver==1.0.1
16 | MarkupSafe==1.0
17 | matplotlib==2.2.3
18 | mkl-fft==1.0.6
19 | mkl-random==1.0.1
20 | multiprocess==0.70.6.1
21 | networkx==2.2
22 | numpy==1.15.2
23 | pandas==0.23.4
24 | pathos==0.2.2.1
25 | patsy==0.5.0
26 | pox==0.2.4
27 | ppft==1.6.4.8
28 | prettytable==0.7.2
29 | pycairo==1.17.1
30 | pyparsing==2.2.2
31 | PySAL==1.14.4.post2
32 | python-dateutil==2.7.3
33 | python-igraph==0.7.1.post6
34 | pytz==2018.5
35 | requests==2.19.1
36 | scikit-learn==0.20.0
37 | scipy==1.1.0
38 | seaborn==0.9.0
39 | six==1.11.0
40 | sklearn==0.0
41 | statsmodels==0.9.0
42 | subprocess32==3.5.2
43 | tqdm==4.26.0
44 | urllib3==1.23
45 | Werkzeug==0.14.1
46 | 


--------------------------------------------------------------------------------