├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md └── identity-resolution ├── README.md ├── data └── DATA.md ├── images ├── architecture.png └── sagemaker-link.png ├── notebooks └── identity-graph │ ├── identity-graph-sample.ipynb │ └── nepytune │ ├── __init__.py │ ├── benchmarks │ ├── __init__.py │ ├── __main__.py │ ├── benchmarks_visualization.py │ ├── connection_pool.py │ ├── drop_graph.py │ ├── ingestion.py │ └── query_runner.py │ ├── cli │ ├── __init__.py │ ├── __main__.py │ ├── add.py │ ├── extend.py │ ├── split.py │ └── transform.py │ ├── drawing.py │ ├── edges │ ├── __init__.py │ ├── identity_groups.py │ ├── ip_loc.py │ ├── persistent_ids.py │ ├── user_website.py │ └── website_groups.py │ ├── nodes │ ├── __init__.py │ ├── identity_groups.py │ ├── ip_loc.py │ ├── users.py │ └── websites.py │ ├── traversal.py │ ├── usecase │ ├── __init__.py │ ├── brand_interaction.py │ ├── purchase_path.py │ ├── similar_audience.py │ ├── undecided_users.py │ ├── user_summary.py │ └── users_from_household.py │ ├── utils.py │ ├── visualizations │ ├── __init__.py │ ├── bar_plots.py │ ├── commons.py │ ├── histogram.py │ ├── network_graph.py │ ├── pie_chart.py │ ├── segments.py │ ├── sunburst_chart.py │ └── venn_diagram.py │ └── write_utils.py └── templates ├── bulk-load-stack.yaml ├── identity-resolution.yml └── neptune-workbench-stack.yaml /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | .*~ 3 | .*.swp 4 | *.pyc 5 | .DS_Store 6 | *.lock -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Advertising & Marketing Samples 2 | 3 | Samples and documentation for various advertising and marketing use cases on AWS. 4 | 5 | ## Sample 1: [Customer Identity Graph using Amazon Neptune](./identity-resolution/) 6 | 7 | Customer identity graph enables single unified view of customer identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising. Included in this repository is a sample solution for building an identity graph solution using Amazon Neptune, a managed graph database service on AWS. 8 | 9 | ## Additional Reading 10 | 11 | [AWS Advertising & Marketing portal](https://aws.amazon.com/advertising-marketing/) 12 | 13 | ## Contributing 14 | 15 | Please see further instructions on contributing in the CONTRIBUTING file. 16 | 17 | ## License 18 | 19 | This library is licensed under the MIT-0 License. See the LICENSE file. 20 | 21 | -------------------------------------------------------------------------------- /identity-resolution/README.md: -------------------------------------------------------------------------------- 1 | # Identity Graph Using Amazon Neptune 2 | 3 | An identity graph provides a single unified view of customers and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising. 4 | 5 | The following notebook walks you through a sample solution for identity graph and how it can be used within a larger identity resolution architecture using an open dataset and the use of a graph database, Amazon Neptune. In this notebook, we also show a number of data visualizations that allow one to better understand the structure of an identity graph and the aspects of an identity resolution dataset and use case. Later in the notebook, we expose some additional use cases that can be exposed using this particular dataset. 6 | 7 | ## Getting Started 8 | 9 | This repo includes the following assets: 10 | - A [Jupyter notebook](notebooks/identity-resolution/identity-graph-sample.ipynb) containing a more thorough explanation of the Identity Graph use case, the dataset that is being used, the graph data model, and graph queries that are used in deriving identities, audiences, customer journeys, etc. 11 | - A [sample dataset](data/DATA.md) comprised of anonymized cookies, device IDs, and website visits. It also includes additional manufactured data that enriches the original anonymized dataset to make this more realistic. 12 | - A set of [Python scripts](notebooks/identity-resolution/nepytune) that are used within the Jupyter notebook for executing each of the different use cases and examples. We're providing the code for these scripts here such that you can extend these for your own benefit. 13 | - A [CloudFormation template](templates/identity-resolution.yml) to launch each of these resources along with the necessary infrastructure. This template will create an Amazon Neptune database cluster and load the sample dataset into the cluster. It will also create a SageMaker Jupyter Notebook instance and install the scripts and sample Jupyter notebook to this instance for you to run against the Neptune cluster. 14 | 15 | ### Architecture 16 | 17 | 18 | 19 | ### Quickstart 20 | 21 | To get started quickly, we have included the following quick-launch link for deploying this sample architecture. 22 | 23 | | Region | Stack | 24 | | ---- | ---- | 25 | |US East (Ohio) | [](https://us-east-2.console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-east-2/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) | 26 | |US East (N. Virginia) | [](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-east-1/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) | 27 | |US West (Oregon) | [](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-west-2/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) | 28 | |EU West (Ireland) | [](https://eu-west-1.console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-eu-west-1/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) | 29 | 30 | Once you have launched the stack, go to the Outputs tab of the root stack and click on the SageMakerNotebook link. This will bring up the Jupyter notebook console of the SageMaker Jupyter Notebook instance that you created. 31 | 32 | 33 | 34 | Once logged into Jupyter, browse through the Neptune/identity-resolution directories until you see the identity-graph-sample.ipynb file. This is the Jupyter notebook containing all of the sample use cases and queries for using Amazon Neptune for Identity Graph. Click on the ipynb file. Additional instructions for each of the use cases are embedded in the Jupyter notebook (ipynb file). 35 | 36 | ## License Summary 37 | 38 | This library is licensed under the MIT-0 License. See the LICENSE file. 39 | -------------------------------------------------------------------------------- /identity-resolution/data/DATA.md: -------------------------------------------------------------------------------- 1 | # Sample Dataset for Identity Resolution on Amazon Neptune 2 | 3 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/identity_group_edges.csv 4 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/identity_group_nodes.csv 5 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/ip_edges.csv 6 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/ip_nodes.csv 7 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/persistent_edges.csv 8 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/persistent_nodes.csv 9 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/transient_edges.csv 10 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/transient_nodes.csv 11 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/website_group_edges.csv 12 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/website_group_nodes.csv 13 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/websites.csv -------------------------------------------------------------------------------- /identity-resolution/images/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/images/architecture.png -------------------------------------------------------------------------------- /identity-resolution/images/sagemaker-link.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/images/sagemaker-link.png -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__main__.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import asyncio 3 | import csv 4 | import logging 5 | import os 6 | import random 7 | import time 8 | import statistics 9 | 10 | import numpy as np 11 | 12 | from nepytune.benchmarks.query_runner import get_query_runner 13 | from nepytune.benchmarks.connection_pool import NeptuneConnectionPool 14 | 15 | QUERY_NAMES = [ 16 | 'get_sibling_attrs', 'undecided_user_check', 'undecided_user_audience', 17 | 'brand_interaction_audience', 'get_all_transient_ids_in_household', 18 | 'early_website_adopters' 19 | ] 20 | 21 | parser = argparse.ArgumentParser(description="Run query benchmarks") 22 | parser.add_argument("--users", type=int, default=10) 23 | parser.add_argument("--samples", type=int, default=1000) 24 | parser.add_argument("--queries", default=['all'], type=str, 25 | nargs='+', choices=QUERY_NAMES + ['all']) 26 | parser.add_argument("--verbose", action='store_true') 27 | parser.add_argument("--csv", action="store_true") 28 | parser.add_argument("--output", type=str, default="results") 29 | args = parser.parse_args() 30 | 31 | if args.queries == ['all']: 32 | args.queries = QUERY_NAMES 33 | 34 | if (args.verbose): 35 | level = logging.DEBUG 36 | else: 37 | level = logging.INFO 38 | 39 | logging.basicConfig(level=level, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') 40 | logger = logging.getLogger(__name__) 41 | 42 | sem = asyncio.Semaphore(args.users) 43 | 44 | 45 | def custom_exception_handler(loop, context): 46 | """Stop event loop if exception occurs.""" 47 | loop.default_exception_handler(context) 48 | 49 | exception = context.get('exception') 50 | if isinstance(exception, Exception): 51 | print(context) 52 | loop.stop() 53 | 54 | 55 | async def run_query(query_runner, sample, semaphore, pool): 56 | """Run query with limit on concurrent connections.""" 57 | async with semaphore: 58 | return await query_runner.run(sample, pool) 59 | 60 | 61 | async def run(query, samples, pool): 62 | """Run query benchmark tasks.""" 63 | query_runner = get_query_runner(query, samples) 64 | 65 | logger.info("Initializing query data.") 66 | await asyncio.gather(query_runner.initialize()) 67 | 68 | queries = [] 69 | logger.info("Running benchmark.") 70 | for i in range(samples): 71 | queries.append(asyncio.create_task(run_query(query_runner, i, sem, pool))) 72 | results = await asyncio.gather(*queries) 73 | 74 | logger.info(f"Successful queries: {query_runner.succeded}") 75 | logger.info(f"Failed queries: {query_runner.failed}") 76 | 77 | benchmark_results = [result for result in results if result] 78 | return benchmark_results, query_runner.succeded, query_runner.failed 79 | 80 | 81 | def stats(results): 82 | """Print statistics for benchmark results.""" 83 | print(f"Samples: {args.samples}") 84 | print(f"Mean: {statistics.mean(results)}s") 85 | print(f"Median: {statistics.median(results)}s") 86 | a = np.array(results) 87 | for percentile in [50, 90, 99, 99.9, 99.99]: 88 | result = np.percentile(a, percentile) 89 | print(f"{percentile} percentile: {result}s") 90 | 91 | 92 | if __name__ == '__main__': 93 | loop = asyncio.get_event_loop() 94 | loop.set_exception_handler(custom_exception_handler) 95 | 96 | pool = NeptuneConnectionPool(args.users) 97 | try: 98 | loop.run_until_complete(pool.create()) 99 | for query in args.queries: 100 | logger.info(f"Benchmarking query: {query}") 101 | logger.info(f"Concurrent users: {args.users}") 102 | results, succeded, failed = loop.run_until_complete(run(query, args.samples, pool)) 103 | stats([measure[2] for measure in results]) 104 | if args.csv: 105 | dst = f"{args.output}/{query}-{args.samples}-{args.users}.csv" 106 | with open(dst, "w") as f: 107 | writer = csv.writer(f) 108 | for measure in results: 109 | writer.writerow(measure) 110 | query_stats = f"{args.output}/{query}-{args.samples}-{args.users}-stats.csv" 111 | with open(query_stats, "w") as f: 112 | writer = csv.writer(f) 113 | writer.writerow([succeded, failed]) 114 | finally: 115 | loop.run_until_complete(pool.destroy()) 116 | loop.run_until_complete(loop.shutdown_asyncgens()) 117 | loop.close() 118 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/benchmarks_visualization.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import math 3 | import os 4 | import tqdm 5 | import pandas as pd 6 | import plotly.graph_objects as go 7 | from intervaltree import IntervalTree 8 | from plotly.subplots import make_subplots 9 | 10 | 11 | def get_benchmarks_results_dataframes(results_path, query, instances, 12 | samples_by_users): 13 | """Convert benchmarks results into data frames.""" 14 | dfs_by_users = {} 15 | for users, samples in samples_by_users.items(): 16 | dfs = [] 17 | for instance in instances: 18 | df = pd.read_csv(f"{results_path}/{instance}/{query}-{samples}-{users}.csv", 19 | names=['start', 'end', 'duration']) 20 | df["instance"] = instance 21 | dfs.append(df) 22 | 23 | dfs_by_users[users] = pd.concat(dfs) 24 | return dfs_by_users 25 | 26 | 27 | def show_query_time_graph(benchmarks_dfs, yfunc, title, x_title): 28 | """Show query duration graph.""" 29 | fig = go.Figure() 30 | 31 | for users, df in benchmarks_dfs.items(): 32 | fig.add_trace( 33 | go.Box( 34 | x=df["instance"], 35 | y=yfunc(df["duration"]), 36 | boxpoints=False, 37 | boxmean=True, 38 | name=f"{users} users", 39 | hoverinfo="y", 40 | ) 41 | ) 42 | 43 | fig.update_layout( 44 | yaxis=dict( 45 | title=title, 46 | tickangle=-45, 47 | ), 48 | xaxis_title=x_title, 49 | boxmode='group' 50 | ) 51 | fig.show() 52 | 53 | 54 | def select_concurrent_queries_from_data(query, benchmarks_dfs, cache_path): 55 | """Measure concurrent queries from benchmark results.""" 56 | users_chart_data = {} 57 | cache_suffix = "cache_concurrent" 58 | 59 | if not os.path.isdir(cache_path): 60 | os.makedirs(cache_path) 61 | 62 | for users in benchmarks_dfs.keys(): 63 | cache_filename = f"{cache_path}/{query}-{users}-{cache_suffix}.csv" 64 | if os.path.isfile(cache_filename): 65 | with open(cache_filename) as f: 66 | print(f"Reading from cached file: {cache_filename}.") 67 | queries_df = pd.read_csv(f) 68 | queries_df = queries_df.set_index( 69 | pd.to_datetime(queries_df['timestamp'])) 70 | users_chart_data[users] = queries_df 71 | else: 72 | df = benchmarks_dfs[users].copy() 73 | # convert to milliseconds 74 | df["duration"] = df["duration"].multiply(1000) 75 | 76 | data_frames = [] 77 | for instance in df.instance.unique(): 78 | queries = get_concurrent_queries_by_time(df, users, instance) 79 | queries_df = pd.DataFrame( 80 | queries, columns=['timestamp', 'users', 'instance']) 81 | 82 | resampled = resample_queries_frame(queries_df, '100ms') 83 | 84 | data_frames.append(resampled) 85 | 86 | with open(cache_filename, "w") as f: 87 | pd.concat(data_frames).to_csv(f) 88 | 89 | users_chart_data[users] = pd.concat(data_frames) 90 | 91 | return users_chart_data 92 | 93 | 94 | def show_concurrent_queries_charts(concurrent_queries_dfs, x_title, y_title): 95 | """Show concurrent queries chart.""" 96 | for users, df in concurrent_queries_dfs.items(): 97 | instances = len(df.instance.unique()) 98 | 99 | fig = make_subplots(rows=instances, cols=1) 100 | 101 | for row, instance in enumerate(df.instance.unique(), start=1): 102 | instance_data = df[df.instance == instance] 103 | fig.add_trace( 104 | go.Scatter( 105 | x=[(idx - instance_data.index[0]).total_seconds() 106 | for idx in instance_data.index], 107 | y=instance_data["users"], 108 | name=instance 109 | ), 110 | row=row, 111 | col=1 112 | ) 113 | 114 | fig.update_yaxes( 115 | title_text=f"{y_title} for: {users} users", row=2, col=1) 116 | fig.update_xaxes(title_text=x_title, row=3, col=1) 117 | 118 | fig.show() 119 | 120 | 121 | def get_concurrent_queries_by_time(df, users, instance): 122 | """ 123 | Return concurrent running queries by time. 124 | 125 | Build interval tree of running query times. 126 | Calculate time range duration and check overlaping queries. 127 | """ 128 | idf = df.loc[df["instance"] == instance].copy() 129 | 130 | idf['start'] = pd.to_datetime(idf['start'], unit='s') 131 | idf['end'] = pd.to_datetime(idf['end'], unit='s') 132 | 133 | # get nsmallest and nlargest to not leave single running queries 134 | start = idf.nsmallest(int(users), "start")["start"].max() 135 | end = idf.nlargest(int(users), "end")["end"].min() 136 | 137 | step = math.ceil(idf['duration'].min()/10) 138 | 139 | t = IntervalTree() 140 | for index, row in idf.iterrows(): 141 | t[row["start"]:row["end"]] = None 142 | 143 | tr = pd.to_datetime(pd.date_range( 144 | start=start, end=end, freq=f"{step}ms")) 145 | 146 | rows = [] 147 | for i in tqdm.tqdm(range(len(tr)-1)): 148 | r1 = tr[i] 149 | r2 = tr[i+1] 150 | concurrent_queries = len(t[r1:r2]) 151 | rows.append([r1, concurrent_queries, instance]) 152 | 153 | return rows 154 | 155 | 156 | def resample_queries_frame(df, freq): 157 | """Resample queries frame with given frequency.""" 158 | df = df.set_index(pd.to_datetime(df['timestamp'])) 159 | 160 | resampled = pd.DataFrame() 161 | resampled["users"] = df.users.resample(freq).mean().bfill() 162 | resampled["instance"] = df.instance.resample(freq).last().bfill() 163 | 164 | return resampled 165 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/connection_pool.py: -------------------------------------------------------------------------------- 1 | import os 2 | from aiogremlin import DriverRemoteConnection 3 | 4 | CONNECTION_RETRIES = 5 5 | CONNECTION_HEARTBEAT = 0.1 6 | 7 | class NeptuneConnectionPool(): 8 | def __init__(self, users): 9 | self.users = users 10 | self.active = [] 11 | self.available = [] 12 | 13 | async def create(self): 14 | for _ in range(self.users): 15 | conn = await self.init_neptune_connection() 16 | self.available.append(conn) 17 | 18 | async def destroy(self): 19 | for conn in self.active + self.available: 20 | await conn.close() 21 | 22 | def lock(self): 23 | for _ in range(CONNECTION_RETRIES): 24 | if self.available: 25 | conn = self.available.pop() 26 | self.active.append(conn) 27 | return conn 28 | raise ConnectionError("Cannot aquire connection from pool.") 29 | 30 | def unlock(self, conn): 31 | self.active.remove(conn) 32 | self.available.append(conn) 33 | 34 | async def init_neptune_connection(self): 35 | """Init Neptune connection.""" 36 | endpoint = os.environ["NEPTUNE_CLUSTER_ENDPOINT"] 37 | port = os.getenv("NEPTUNE_CLUSTER_PORT", "8182") 38 | return await DriverRemoteConnection.open(f"ws://{endpoint}:{port}/gremlin", "g") 39 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/drop_graph.py: -------------------------------------------------------------------------------- 1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. 2 | # All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"). 5 | # You may not use this file except in compliance with the License. 6 | # A copy of the License is located at 7 | # 8 | # http://aws.amazon.com/apache2.0/ 9 | # 10 | # or in the "license" file accompanying this file. 11 | # This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 12 | # either express or implied. See the License for the specific language governing permissions 13 | # and limitations under the License. 14 | 15 | ''' 16 | @author: krlawrence 17 | @copyright: Amazon.com, Inc. or its affiliates 18 | @license: Apache2 19 | @contact: @krlawrence 20 | @deffield created: 2019-04-02 21 | 22 | This code uses Gremlin Python to drop an entire graph. 23 | 24 | It is intended as an example of a multi-threaded strategy for dropping vertices and edges. 25 | 26 | The following overall strategy is currently used. 27 | 28 | 1. Fetch all edge IDs 29 | - Edges are fetched using multiple threads in large batches. 30 | - Smaller slices are queued up for worker threads to drop. 31 | 2. Drop all edges using those IDs 32 | - Worker threads read the slices of IDs from the queue and drop the edges. 33 | 3. Fetch all vertex IDs 34 | - Vertices are fetched using multiple threads in large batches. 35 | - Smaller slices are queued up for worker threads to drop. 36 | 4. Drop all vertices using the fetched IDs 37 | - Worker threads read the slices of IDs from the queue and drop the vertices. 38 | 39 | NOTES: 40 | 1: To avoid possible concurrent write exceptions no fetching and dropping is done in parallel. 41 | 2: Edges are explicitly dropped before vertices, again to avoid any conflicting writes. 42 | 3: This code uses an in-memory, thread-safe queue. The amount of data that can be processed 43 | will depend upon how big of an in-memory queue can be created. It has been tested using a 44 | graph containing 10M vertices and 10M edges. 45 | 4: While the code as written deletes an entire graph, it could be easily adapted to delete part 46 | of a graph instead. 47 | 5: The following environment variables should be defined before this code is run. 48 | NEPTUNE_PORT - The port that the Neptune endpoint is listening on such as 8182. 49 | NEPTUNE_WRITER - The Neptune Cluster endpoint name such as 50 | "mygraph.cluster-abcdefghijkl.us-east-1.neptune.amazonaws.com" 51 | 6: This script assumes that the 'gremlinpyton' library has already been installed. 52 | 7: For massive graphs (with hundreds of millions or billions of elements) creating a new 53 | Neptune cluster will be faster than trying to delete everything programmatically. 54 | 55 | STILL TODO: 56 | The code could be further improved by offering an option to only drop the edges and by 57 | removing the need to count all edges and all vertices before starting work. The use of 58 | threads could be further optimized in future to get more reuse out of the fetcher threads. 59 | One further refinement that would enable very large graphs to be dropped, would be to 60 | avoid the need to read all elementment IDs into memory before dropping can start by doing 61 | that process iteratively. This script should probably also been turned into a class. 62 | ''' 63 | 64 | from gremlin_python.structure.graph import Graph 65 | from gremlin_python.process.graph_traversal import __ 66 | from gremlin_python.process.strategies import * 67 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection 68 | from gremlin_python.process.traversal import * 69 | from threading import Thread 70 | from queue import Queue 71 | import threading 72 | import time 73 | import math 74 | import os 75 | 76 | # The fetch size and batch sizes should not need to be changed but can be if necessary. 77 | # As a guide, the number of threads should be twice the number of vCPU available of the Neptune write master node. 78 | 79 | MAX_FETCH_SIZE = 50000 # Maximum number of IDs to fetch at a time. A large number limits the number of range() calls 80 | EDGE_BATCH_SIZE = 500 # Number of edges to drop in each call to drop(). This affects the queue entry size. 81 | VERTEX_BATCH_SIZE = 500 # Number of vertices to drop in each call to drop(). This affects the queue entry size. 82 | MAX_FETCHERS = 8 # Maximum number of threads allowed to be created for fetching vertices and edges 83 | NUM_THREADS = 8 # Number of local workers to create to process the drop queue. 84 | POOL_SIZE = 8 # Python driver default is 4. Change to create a bigger pool. 85 | MAX_WORKERS = 8 # Python driver default is 5 * number of CPU on client machine. 86 | 87 | # Ready flag is used to tell workers they can start processing the queue 88 | ready_flag = threading.Event() 89 | 90 | # The wait queues are used to make sure all threads have finished fetching before the 91 | # workers start processing the IDs to be dropped. 92 | edge_fetch_wait_queue = Queue() 93 | vertex_fetch_wait_queue = Queue() 94 | 95 | # Queue that will contain the node and edge IDs that need to be dropped 96 | pending_work = Queue() 97 | 98 | 99 | #################################################################################### 100 | # fetch_edges 101 | # 102 | # Calculate how many threads are needed to fetch the edge IDs and create the threads 103 | #################################################################################### 104 | def fetch_edges(g, q): 105 | print("\nPROCESSING EDGES") 106 | print("Assessing number of edges.") 107 | count = g.E().count().next() 108 | print(count, "edges to drop") 109 | if count > 0: 110 | fetch_size = MAX_FETCH_SIZE 111 | num_threads = min(math.ceil(count/fetch_size),MAX_FETCHERS) 112 | bracket_size = math.ceil(count/num_threads) 113 | print("Will use", num_threads, "threads.") 114 | print("Each thread will queue", bracket_size) 115 | print("Queueing IDs") 116 | 117 | start_offset = 0 118 | 119 | fetchers = [None] * num_threads 120 | 121 | for i in range(num_threads): 122 | edge_fetch_wait_queue.put(i) 123 | fetchers[i] = Thread(target=edge_fetcher, args=(g, pending_work,start_offset,bracket_size,)) 124 | fetchers[i].setDaemon(True) 125 | fetchers[i].start() 126 | start_offset += bracket_size 127 | return count 128 | 129 | #################################################################################### 130 | # fetch_vertices 131 | # 132 | # Calculate how many threads are needed to fetch the node IDs and create the threads 133 | #################################################################################### 134 | def fetch_vertices(g, q): 135 | print("\nPROCESSING VERTICES") 136 | print("Assessing number of vertices.") 137 | count = g.V().count().next() 138 | print(count, "vertices to drop") 139 | if count > 0: 140 | fetch_size = MAX_FETCH_SIZE 141 | num_threads = min(math.ceil(count/fetch_size),MAX_FETCHERS) 142 | bracket_size = math.ceil(count/num_threads) 143 | print("Will use", num_threads, "threads.") 144 | print("Each thread will queue", bracket_size) 145 | print("Queueing IDs") 146 | 147 | start_offset = 0 148 | 149 | fetchers = [None] * num_threads 150 | 151 | for i in range(num_threads): 152 | vertex_fetch_wait_queue.put(i) 153 | fetchers[i] = Thread(target=vertex_fetcher, args=(g, pending_work,start_offset,bracket_size,)) 154 | fetchers[i].setDaemon(True) 155 | fetchers[i].start() 156 | start_offset += bracket_size 157 | return count 158 | 159 | #################################################################################### 160 | # edge_fetcher 161 | # 162 | # Fetch edges in large batches and queue them up for deletion in smaller slices 163 | #################################################################################### 164 | def edge_fetcher(g, q,start_offset,bracket_size): 165 | p1 = start_offset 166 | inc = min(bracket_size,MAX_FETCH_SIZE) 167 | p2 = start_offset + inc 168 | org = p1 169 | done = False 170 | nm = threading.currentThread().name 171 | print(nm,"[edges] Fetching from offset", start_offset, "with end at",start_offset+bracket_size) 172 | edge_fetch_wait_queue.get() 173 | 174 | done = False 175 | while not done: 176 | success = False 177 | while not success: 178 | print(nm,"[edges] retrieving range ({},{} batch=size={})".format(p1,p2,p2-p1)) 179 | try: 180 | edges = g.E().range(p1,p2).id().toList() 181 | success = True 182 | except: 183 | print("***",nm,"Exception while fetching. Retrying.") 184 | time.sleep(1) 185 | 186 | slices = math.ceil(len(edges)/EDGE_BATCH_SIZE) 187 | s1 = 0 188 | s2 = 0 189 | for i in range(slices): 190 | s2 += min(len(edges)-s1,EDGE_BATCH_SIZE) 191 | q.put(["edges",edges[s1:s2]]) 192 | s1 = s2 193 | p1 += inc 194 | if p1 >= org + bracket_size: 195 | done = True 196 | else: 197 | p2 += min(inc, org+bracket_size - p2) 198 | size = q.qsize() 199 | print(nm,"[edges] work done. Queue size ==>",size) 200 | edge_fetch_wait_queue.task_done() 201 | return 202 | 203 | #################################################################################### 204 | # vertex_fetcher 205 | # 206 | # Fetch vertices in large batches and queue them up for deletion in smaller slices 207 | #################################################################################### 208 | def vertex_fetcher(g, q,start_offset,bracket_size): 209 | p1 = start_offset 210 | inc = min(bracket_size,MAX_FETCH_SIZE) 211 | p2 = start_offset + inc 212 | org = p1 213 | done = False 214 | nm = threading.currentThread().name 215 | print(nm,"[vertices] Fetching from offset", start_offset, "with end at",start_offset+bracket_size) 216 | vertex_fetch_wait_queue.get() 217 | 218 | done = False 219 | while not done: 220 | success = False 221 | while not success: 222 | print(nm,"[vertices] retrieving range ({},{} batch=size={})".format(p1,p2,p2-p1)) 223 | try: 224 | vertices = g.V().range(p1,p2).id().toList() 225 | success = True 226 | except: 227 | print("***",nm,"Exception while fetching. Retrying.") 228 | time.sleep(1) 229 | 230 | slices = math.ceil(len(vertices)/VERTEX_BATCH_SIZE) 231 | s1 = 0 232 | s2 = 0 233 | for i in range(slices): 234 | s2 += min(len(vertices)-s1,VERTEX_BATCH_SIZE) 235 | q.put(["vertices",vertices[s1:s2]]) 236 | s1 = s2 237 | p1 += inc 238 | if p1 >= org + bracket_size: 239 | done = True 240 | else: 241 | p2 += min(inc, org+bracket_size - p2) 242 | size = q.qsize() 243 | print(nm,"[vertices] work done. Queue size ==>",size) 244 | vertex_fetch_wait_queue.task_done() 245 | return 246 | 247 | #################################################################################### 248 | # worker 249 | # 250 | # Worker task that will handle deletion of IDs that are in the queue. Multiple workers 251 | # will be created based on the value specified for NUM_THREADS. 252 | #################################################################################### 253 | def worker(g, q): 254 | c = 0 255 | nm = threading.currentThread().name 256 | print("Worker", nm, "started") 257 | while True: 258 | ready = ready_flag.wait() 259 | if not q.empty(): 260 | work = q.get() 261 | successful = False 262 | while not successful: 263 | try: 264 | if len(work[1]) > 0: 265 | print("[{}] {} deleting {} {}".format(c,nm,len(work[1]), work[0])) 266 | if work[0] == "edges": 267 | g.E(work[1]).drop().iterate() 268 | else: 269 | g.V(work[1]).drop().iterate() 270 | successful = True 271 | except: 272 | # A concurrent modification error can occur if we try to drop an element 273 | # that is already loacked by some other process accessing the graph. 274 | # If that happens sleep briefly and try again. 275 | print("{} Exception dropping some {} will retry".format(nm,work[0])) 276 | print(sys.exc_info()[0]) 277 | print(sys.exc_info()[1]) 278 | time.sleep(1) 279 | c += 1 280 | q.task_done() 281 | 282 | 283 | 284 | def drop(g): 285 | #################################################################################### 286 | # Do the work! 287 | # 288 | #################################################################################### 289 | # Fetch the edges 290 | equeue_start_time = time.time() 291 | ecount = fetch_edges(g, pending_work) 292 | edge_fetch_wait_queue.join() 293 | equeue_end_time = time.time() 294 | 295 | # Create the pool of workers that will drop the edges and vertices 296 | print("Creating drop() workers") 297 | 298 | workers = [None] * NUM_THREADS 299 | ready_flag.set() 300 | 301 | edrop_start_time = time.time() 302 | for i in range(NUM_THREADS): 303 | workers[i] = Thread(target=worker, args=(g, pending_work,)) 304 | workers[i].setDaemon(True) 305 | workers[i].start() 306 | 307 | # Wait until all of the edges in the queue have been dropped 308 | pending_work.join() 309 | edrop_end_time = time.time() 310 | 311 | # Tell the workers to wait until the vertex IDs have all been enqueued 312 | ready_flag.clear() 313 | 314 | # Fetch the vertex IDs 315 | vqueue_start_time = time.time() 316 | vcount = fetch_vertices(g, pending_work) 317 | vertex_fetch_wait_queue.join() 318 | vqueue_end_time = time.time() 319 | 320 | # Tell the workers to start dropping the vertices 321 | vdrop_start_time = time.time() 322 | ready_flag.set() 323 | pending_work.join() 324 | vdrop_end_time = time.time() 325 | 326 | # Calculate how long each phase took 327 | eqtime = equeue_end_time - equeue_start_time 328 | vqtime = vqueue_end_time - vqueue_start_time 329 | etime = edrop_end_time - edrop_start_time 330 | vtime = vdrop_end_time - vdrop_start_time 331 | 332 | print("Summary") 333 | print("-------") 334 | print("Worker threads", NUM_THREADS) 335 | print("Max fetch size", MAX_FETCH_SIZE) 336 | print("Edge batch size", EDGE_BATCH_SIZE) 337 | print("Vertex batch size", VERTEX_BATCH_SIZE) 338 | print("Edges dropped", ecount) 339 | print("Vertices dropped", vcount) 340 | print("Time taken to queue edges", eqtime) 341 | print("Time taken to drop edges", etime) 342 | print("Time taken to queue vertices", vqtime) 343 | print("Time taken to drop vertices", vtime) 344 | 345 | print("TOTAL TIME",eqtime + vqtime + etime + vtime) 346 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/ingestion.py: -------------------------------------------------------------------------------- 1 | """Common code for running benchmarks.""" 2 | 3 | import csv 4 | import json 5 | import logging 6 | import time 7 | import os 8 | 9 | import boto3 10 | import botocore 11 | import requests 12 | 13 | from itertools import islice 14 | 15 | from gremlin_python.structure.graph import Graph 16 | from gremlin_python.process.graph_traversal import __ 17 | from gremlin_python.process.strategies import * 18 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection 19 | from gremlin_python.process.traversal import * 20 | 21 | import pandas as pd 22 | 23 | import plotly.graph_objects as go 24 | 25 | from nepytune.benchmarks.drop_graph import drop 26 | 27 | AWS_REGION = os.getenv("AWS_REGION") 28 | NEPTUNE_ENDPOINT = os.getenv('NEPTUNE_CLUSTER_ENDPOINT') 29 | NEPTUNE_PORT = os.getenv('NEPTUNE_CLUSTER_PORT') 30 | NEPTUNE_LOADER_ENDPOINT = f"https://{NEPTUNE_ENDPOINT}:{NEPTUNE_PORT}/loader" 31 | NEPTUNE_GREMLIN_ENDPOINT = f"ws://{NEPTUNE_ENDPOINT}:{NEPTUNE_PORT}/gremlin" 32 | NEPTUNE_LOAD_ROLE_ARN = os.getenv("NEPTUNE_LOAD_ROLE_ARN") 33 | BUCKET = os.getenv("S3_PROCESSED_DATASET_BUCKET") 34 | DATASET_DIR = "../../dataset" 35 | 36 | GREMLIN_POOL_SIZE = 8 # Python driver default is 4. Change to create a bigger pool. 37 | GREMLIN_MAX_WORKERS = 8 # Python driver default is 5 * number of CPU on client machine. 38 | 39 | # Initialize Neptune connection 40 | graph=Graph() 41 | connection = DriverRemoteConnection(NEPTUNE_GREMLIN_ENDPOINT,'g', 42 | pool_size=GREMLIN_POOL_SIZE, 43 | max_workers=GREMLIN_MAX_WORKERS) 44 | g = graph.traversal().withRemote(connection) 45 | 46 | 47 | # Initialize logger 48 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') 49 | logger = logging.getLogger() 50 | 51 | 52 | # Make dataset directory 53 | if not os.path.isdir(DATASET_DIR): 54 | os.mkdir(DATASET_DIR) 55 | 56 | 57 | def download_file(bucket, file): 58 | """Download file from S3.""" 59 | try: 60 | logger.info("Start downloading %s.", file) 61 | dst = f"./{DATASET_DIR}/{file}" 62 | if os.path.isfile(dst): 63 | logger.info("File exists, skipping.") 64 | return 65 | 66 | s3 = boto3.resource('s3') 67 | s3.Bucket(bucket).download_file(file, f"./{DATASET_DIR}/{file}") 68 | except botocore.exceptions.ClientError as e: 69 | if e.response['Error']['Code'] == "404": 70 | print("The object does not exist.") 71 | else: 72 | raise 73 | 74 | 75 | def upload_file(file_name, bucket, prefix, key=None): 76 | """Upload file to S3 bucket.""" 77 | if key is None: 78 | key = file_name 79 | object_name = f"{prefix}/{key}" 80 | s3_client = boto3.client('s3') 81 | try: 82 | response = s3_client.upload_file(file_name, bucket, object_name) 83 | except botocore.exceptions.ClientError as e: 84 | raise e 85 | return object_name 86 | 87 | 88 | def wait_for_load_complete(load_id): 89 | """Wait for Neptune load to complete.""" 90 | while not is_load_completed(load_id): 91 | time.sleep(10) 92 | 93 | 94 | def is_load_completed(load_id): 95 | """Check if Neptune load is completed""" 96 | response = requests.get(f"{NEPTUNE_LOADER_ENDPOINT}/{load_id}").json() 97 | status = response["payload"]["overallStatus"]["status"] 98 | if status == "LOAD_IN_PROGRESS": 99 | return False 100 | return True 101 | 102 | 103 | def copy_n_lines(src, dst, n): 104 | """Copy N lines from src to dst file.""" 105 | if os.path.isfile(dst): 106 | logger.info("File: %s exists, skipping.", dst) 107 | return 108 | 109 | with open(src) as src_file: 110 | lines = islice(src_file, n) 111 | with open(dst, 'w') as dst_file: 112 | dst_file.writelines(lines) 113 | 114 | 115 | def populate_graph(vertices_n): 116 | import tempfile 117 | import uuid 118 | 119 | logger.info("Populating graph with %s vertices.", vertices_n) 120 | 121 | if vertices_n == 0: 122 | return 123 | 124 | labels = '"~id","attr1:String","attr2:String","~label"' 125 | 126 | fd, path = tempfile.mkstemp() 127 | try: 128 | with os.fdopen(fd, 'w') as tmp: 129 | tmp.write(labels + '\n') 130 | for _ in range(vertices_n): 131 | node_id = str(uuid.uuid4()) 132 | attr1 = node_id 133 | attr2 = node_id 134 | label = "generatedVertice" 135 | tmp.write(f"{node_id},{attr1},{attr2},{label}\n") 136 | key = upload_file(path, BUCKET, "generated") 137 | load_into_neptune(BUCKET, key) 138 | s3 = boto3.resource("s3") 139 | s3.Object(BUCKET, key).delete() 140 | 141 | finally: 142 | os.remove(path) 143 | 144 | 145 | 146 | def load_into_neptune(bucket, key): 147 | """Load CSV file into neptune.""" 148 | data = { 149 | "source" : f"s3://{bucket}/{key}", 150 | "format" : "csv", 151 | "iamRoleArn" : NEPTUNE_LOAD_ROLE_ARN, 152 | "region" : AWS_REGION, 153 | "failOnError" : "FALSE", 154 | "parallelism" : "MEDIUM", 155 | "updateSingleCardinalityProperties" : "FALSE" 156 | } 157 | response = requests.post(NEPTUNE_LOADER_ENDPOINT, json=data) 158 | json_response = response.json() 159 | load_id = json_response["payload"]["loadId"] 160 | logger.info("Waiting for load %s to complete.", load_id) 161 | wait_for_load_complete(load_id) 162 | logger.info("Load %s completed", load_id) 163 | 164 | return load_id 165 | 166 | 167 | def get_loading_time(load_id): 168 | response = requests.get(f"{NEPTUNE_LOADER_ENDPOINT}/{load_id}").json() 169 | time_spent = response["payload"]["overallStatus"]["totalTimeSpent"] 170 | return time_spent 171 | 172 | 173 | def benchmark_loading_data(source, entities_to_add, 174 | initial_sizes=[0], dependencies=[], drop=True): 175 | """ 176 | Benchmark loading data into AWS Neptune. 177 | 178 | Graph is dropped before every benchmark run. 179 | Benchmark measures loading time for vertices and edges. 180 | Graph can be populated with initial random data. 181 | """ 182 | 183 | filename = f"{source}.csv" 184 | download_file(BUCKET, filename) 185 | prefix = "splitted" 186 | 187 | results = {} 188 | 189 | logger.info("Loading dependencies.") 190 | for dependency in dependencies: 191 | filename = f"{DATASET_DIR}/{dependency}" 192 | logger.info("Uploading %s to S3 bucket.", dependency) 193 | key = upload_file(filename, BUCKET, "dependencies", key=dependency) 194 | load_id = load_into_neptune(BUCKET, key) 195 | 196 | for initial_graph_size in initial_sizes: 197 | results[initial_graph_size] = {} 198 | 199 | for entities_n in entities_to_add: 200 | if drop: 201 | drop(g) 202 | populate_graph(initial_graph_size) 203 | 204 | logger.info("Generating file with %s entities.", entities_n) 205 | dst = f"{DATASET_DIR}/{source}_{entities_n}.csv" 206 | copy_n_lines(f"{DATASET_DIR}/{source}.csv", dst, entities_n) 207 | 208 | logger.info("Uploading %s to S3 bucket.", dst) 209 | csv_file = upload_file(dst, BUCKET, prefix, f"{source}_{entities_n}.csv") 210 | load_id = load_into_neptune(BUCKET, csv_file) 211 | 212 | loading_time = get_loading_time(load_id) 213 | logger.info("Loading %d nodes lasts for %d seconds.", entities_n, loading_time) 214 | 215 | results[initial_graph_size][entities_n] = loading_time 216 | 217 | return results 218 | 219 | 220 | def save_result_to_csv(source, results, dst="."): 221 | """Save ingestion results to CSV file.""" 222 | with open(f"{dst}/ingestion-{source}.csv", "w") as f: 223 | writer = csv.writer(f) 224 | for initial_size, result in results.items(): 225 | for entites, time in result.items(): 226 | writer.writerow(initial_size, entites, time) 227 | 228 | 229 | def draw_loading_benchmark_results(results, title, x_title, y_title): 230 | """Draw loading benchmark results.""" 231 | fig_data = [ 232 | { 233 | "type": "bar", 234 | "name": f"Initial graph size: {k}", 235 | "x": list(v.keys()), 236 | "y": list(v.values()) 237 | } for k,v in results.items() 238 | ] 239 | 240 | _draw_group_bar(fig_data, title, x_title, y_title) 241 | 242 | 243 | def draw_from_csv(csv, title, x_title, y_title): 244 | """Draw loading benchmark from csv.""" 245 | df = pd.read_csv(csv, names=['initial', 'entities', 'duration']) 246 | 247 | fig_data = [ 248 | { 249 | "type": "bar", 250 | "name": f"Initial graph size: {initial_graph_size}", 251 | "x": group["entities"], 252 | "y": group["duration"] 253 | } for initial_graph_size, group in df.groupby('initial') 254 | ] 255 | 256 | _draw_group_bar(fig_data, title, x_title, y_title) 257 | 258 | 259 | def _draw_group_bar(fig_data, title, x_title, y_title): 260 | fig = go.Figure({ 261 | "data": fig_data, 262 | "layout": { 263 | "title": {"text": title}, 264 | "xaxis.type": "category", 265 | "barmode": "group", 266 | "xaxis_title": x_title, 267 | "yaxis_title": y_title, 268 | } 269 | }) 270 | 271 | fig.show() 272 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/benchmarks/query_runner.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import math 4 | import random 5 | import time 6 | import asyncio 7 | 8 | from datetime import timedelta 9 | 10 | from gremlin_python.process.graph_traversal import values, outE, inE 11 | from gremlin_python.process.traversal import Column, Order 12 | from aiogremlin import DriverRemoteConnection, Graph 13 | from aiogremlin.exception import GremlinServerError 14 | 15 | from nepytune.usecase import ( 16 | get_sibling_attrs, brand_interaction_audience, 17 | get_all_transient_ids_in_household, undecided_user_audience_check, 18 | undecided_users_audience, get_activity_of_early_adopters 19 | ) 20 | 21 | logger = logging.getLogger(__name__) 22 | 23 | ARG_COLLECTION = 1000 24 | COIN = 0.1 25 | 26 | 27 | class QueryRunner: 28 | """Query runner.""" 29 | 30 | def __init__(self, query, samples): 31 | self.args = [] 32 | self.query = query 33 | self.samples = int(samples) 34 | self.succeded = 0 35 | self.failed = 0 36 | 37 | async def run(self, sample, pool): 38 | """Run query and return measure.""" 39 | sample_no = sample + 1 40 | try: 41 | connection = pool.lock() 42 | g = Graph().traversal().withRemote(connection) 43 | args = self.get_args(sample) 44 | try: 45 | start = time.time() 46 | result = await self.query(g, **args).toList() 47 | end = time.time() 48 | _log_query_info(self.samples, sample_no, args, result) 49 | self.succeded += 1 50 | return (start, end, end - start) 51 | except GremlinServerError as e: 52 | logger.debug(f"Sample {sample_no} failed: {e.msg}") 53 | self.failed += 1 54 | return None 55 | finally: 56 | pool.unlock(connection) 57 | except ConnectionError as e: 58 | logger.debug(f"Sample {sample_no} failed: {e}") 59 | self.failed += 1 60 | return None 61 | 62 | 63 | async def initialize(self): 64 | pass 65 | 66 | def get_args(self, sample): 67 | """Get args for query function.""" 68 | return self.args[sample % len(self.args)] 69 | 70 | 71 | class SiblingsAttrsRunner(QueryRunner): 72 | def __init__(self, samples): 73 | super().__init__(query=get_sibling_attrs, samples=samples) 74 | 75 | async def initialize(self): 76 | connection = await init_neptune_connection() 77 | async with connection: 78 | g = Graph().traversal().withRemote(connection) 79 | transient_ids = await get_household_members(g, ARG_COLLECTION) 80 | 81 | self.args = [ 82 | { 83 | "transient_id": transient_id 84 | } for transient_id in transient_ids 85 | ] 86 | 87 | 88 | class BrandInteractionRunner(QueryRunner): 89 | def __init__(self, samples): 90 | super().__init__(query=brand_interaction_audience, samples=samples) 91 | 92 | async def initialize(self): 93 | connection = await init_neptune_connection() 94 | async with connection: 95 | g = Graph().traversal().withRemote(connection) 96 | websites = await ( 97 | g.V().hasLabel("website").coin(COIN).limit(ARG_COLLECTION).toList() 98 | ) 99 | 100 | self.args = [ 101 | { 102 | "website_url": website 103 | } for website in websites 104 | ] 105 | 106 | 107 | class AudienceCheck(QueryRunner): 108 | def __init__(self, samples): 109 | self.args = [] 110 | super().__init__(query=undecided_user_audience_check, samples=samples) 111 | 112 | async def initialize(self): 113 | connection = await init_neptune_connection() 114 | async with connection: 115 | g = Graph().traversal().withRemote(connection) 116 | 117 | data = await ( 118 | g.V().hasLabel("transientId").coin(COIN).limit(ARG_COLLECTION) 119 | .group() 120 | .by() 121 | .by( 122 | outE("visited").coin(COIN).inV().in_( 123 | "links_to").out("links_to").coin(COIN) 124 | .path() 125 | .by(values("uid")) 126 | .by(values("ts")) 127 | .by(values("url")) 128 | .by(values("url")) 129 | .by(values("url")) 130 | ).select(Column.values).unfold() 131 | ).toList() 132 | 133 | self.args = [ 134 | { 135 | "transient_id": result[0], 136 | "website_url": result[2], 137 | "thank_you_page_url": result[4], 138 | "since": result[1] - timedelta(days=random.randint(30, 60)), 139 | "min_visited_count": random.randint(2, 5) 140 | } for result in data if result 141 | ] 142 | 143 | 144 | class AudienceGeneration(QueryRunner): 145 | def __init__(self, samples): 146 | self.args = [] 147 | super().__init__(query=undecided_users_audience, samples=samples) 148 | 149 | async def initialize(self): 150 | connection = await init_neptune_connection() 151 | async with connection: 152 | g = Graph().traversal().withRemote(connection) 153 | 154 | most_visited_websites = await get_most_active_websites(g) 155 | data = await ( 156 | g.V(most_visited_websites) 157 | .group() 158 | .by() 159 | .by( 160 | inE().hasLabel("visited").coin(COIN).inV() 161 | .in_("links_to").out("links_to").coin(COIN) 162 | .path() 163 | .by(values("url")) # visited website 164 | .by(values("ts")) # timestamp 165 | .by(values("url")) # visited website 166 | .by(values("url")) # root website 167 | .by(values("url").limit(1)) # thank you page 168 | ).select(Column.values).unfold() 169 | ).toList() 170 | 171 | self.args = [ 172 | { 173 | "website_url": result[0], 174 | "thank_you_page_url": result[4], 175 | "since": result[1] - timedelta(days=random.randint(30, 60)), 176 | "min_visited_count": random.randint(2, 5) 177 | } for result in data 178 | ] 179 | 180 | 181 | class EarlyAdopters(QueryRunner): 182 | def __init__(self, samples): 183 | super().__init__( 184 | query=get_activity_of_early_adopters, 185 | samples=samples) 186 | 187 | async def initialize(self): 188 | connection = await init_neptune_connection() 189 | async with connection: 190 | g = Graph().traversal().withRemote(connection) 191 | most_visited_websites = await get_most_active_websites(g) 192 | 193 | self.args = [ 194 | { 195 | "thank_you_page_url": website 196 | } for website in most_visited_websites 197 | ] 198 | 199 | 200 | class HouseholdDevices(QueryRunner): 201 | def __init__(self, samples): 202 | super().__init__(query=get_all_transient_ids_in_household, 203 | samples=samples) 204 | 205 | async def initialize(self): 206 | connection = await init_neptune_connection() 207 | async with connection: 208 | g = Graph().traversal().withRemote(connection) 209 | household_members = await get_household_members(g, ARG_COLLECTION) 210 | 211 | self.args = [ 212 | { 213 | "transient_id": member 214 | } for member in household_members 215 | ] 216 | 217 | 218 | async def get_household_members(g, limit, coin=COIN): 219 | """Return transient IDs which are memebers of identity group.""" 220 | return await ( 221 | g.V().hasLabel("identityGroup").out("member") 222 | .out("has_identity") 223 | .coin(coin).limit(limit).toList() 224 | ) 225 | 226 | 227 | async def init_neptune_connection(): 228 | """Init Neptune connection.""" 229 | endpoint = os.environ["NEPTUNE_CLUSTER_ENDPOINT"] 230 | port = os.getenv("NEPTUNE_CLUSTER_PORT", "8182") 231 | return await DriverRemoteConnection.open(f"ws://{endpoint}:{port}/gremlin", "g") 232 | 233 | 234 | def _log_query_info(samples, sample_no, args, result): 235 | logger.debug(f"Sample {sample_no} args: {args}") 236 | if len(result) > 100: 237 | logger.debug("Truncating query result.") 238 | logger.debug(f"Sample {sample_no} result: {result[:100]}") 239 | else: 240 | logger.debug(f"Sample {sample_no} result: {result}") 241 | 242 | samples_checkpoint = math.ceil(samples*0.1) 243 | if sample_no % samples_checkpoint == 0: 244 | logger.info(f"Finished {sample_no} of {samples} samples.") 245 | 246 | 247 | async def get_most_active_websites(g): 248 | """Return websites with most visits.""" 249 | # Query for most visited websites is quite slow. 250 | # Thus visited websites are hardcoded. 251 | 252 | # most_visited_websites = await ( 253 | # g.V().hasLabel("website") 254 | # .order().by(inE('visited').count(), Order.decr) 255 | # .limit(1000).toList() 256 | # ) 257 | 258 | most_visited_websites = [ 259 | "8f6b27fe6f0dcdae", 260 | "a997482113271d8f/5758f309e11931ce", 261 | "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?aac4d7fceeea7dcb", 262 | "6e89cfa05ae05032", 263 | "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?1d5bfa3db363b460", 264 | "3cfce7aac081cf80/49d249c29289f7a5/5ea0237ac10c9de3?1911788a62d90dd4", 265 | "12a78ad541e95ae", 266 | "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?77af8f56d61f1f7", 267 | "ed95a9a5be30e4c8/5162fc6a223f248d", 268 | "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb", 269 | "2e272bb1ae067296/49ffef01dbcd3442", 270 | "6ea77fc3ea42bd5b", 271 | "4c980617e02858a4", 272 | "b23e286d713f61fd/f9077d4b41c9e32e", 273 | "c3c6e6e856091767", 274 | "12a78ad541e95ae/7de2f069da3a3655", 275 | "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?b80a3fe036e3d80", 276 | "6ae12ea8ec730ba5/281bb5a0f4846ea7/802fc6a2d4f41295/34702b07a20db/8b84b6e138385d6", 277 | "8f6f3d03e10289c2", 278 | "ed95a9a5be30e4c8", 279 | "ed95a9a5be30e4c8/9c2692a00033d2ca", 280 | "afea1067d86a1c44/768ddae806aa91cc", 281 | "7875af5f916d165/2de17cd3dfa1bafb?28d8c9221be3456e", 282 | "1f8649a74c661bd4", 283 | "ed95a9a5be30e4c8/d400c9e183de73f3", 284 | "0d9afe7c94a6fcb8", 285 | "5f63cba1308ebad/16e720804d7385cb?5a4b1b396bf1130", 286 | "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?a2ae02cc94e330f2", 287 | "6cb909d81a2f5b20", 288 | "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?799961866adb8a72", 289 | "5f63cba1308ebad/16e720804d7385cb?282d33b7392ed0f3", 290 | "b23e286d713f61fd/16e720804d7385cb", 291 | "dcb69d5b9ce0d93", 292 | "9e82d69ba38ad61", 293 | "1f8649a74c661bd4/b3cf138ac65a87cd", 294 | "427e6f941738985a", 295 | "8f6b27fe6f0dcdae/77cc413057b22ef2", 296 | "7e89190c7bcf1be9", 297 | "7e89190c7bcf1be9/fb5a409aecff2de1/32c6ffef1a8068b2/01fee084a3cb3563?590f324987a908ac", 298 | "277503b36e998a2c", 299 | "5bb77e7558c09124", 300 | "b23e286d713f61fd/16e720804d7385cb?799961866adb8a72", 301 | "6eefbbf46b47c5e", 302 | "dc958d5abcb0c7f4", 303 | "fb3859d88debbc2f/10e22e5ca30919fd/bed4d82bfc7fb316/9fb2db33a1362553/af1bef8666741753", 304 | "54df72c060e95707/01fee084a3cb3563", 305 | "1f73e4b495d6947a", 306 | "fb3859d88debbc2f/10e22e5ca30919fd/bed4d82bfc7fb316/9fb2db33a1362553/af1bef8666741753/b8b68b641a5d7f18", 307 | "6ae12ea8ec730ba5/281bb5a0f4846ea7/253bf3e95bec331a/34702b07a20db/8b84b6e138385d6", 308 | "5f63cba1308ebad/16e720804d7385cb", 309 | "a4e358da594acc69/d5e31c7559f5aae", 310 | "6e89cfa05ae05032?7ded49ef5f6ae4b5", 311 | "307809459d18aac/05ec660c9d33a602/1c4578927f3f3711/2ba906928c030c0f", 312 | "427e6f941738985a/7de2f069da3a3655", 313 | "70fc5e1c206b990d", 314 | "40c40bf5f58729e9", 315 | "2f38166a9f476d14/2e1f4252a64ef39e?ffa3ebbd543f63a", 316 | "8f6b27fe6f0dcdae/7de2f069da3a3655", 317 | "530bd88a2a6056ba/753be5bb22047d7d/ac5dd08add7bd9b3", 318 | "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?2cb5075b4f4e88dd", 319 | "c415bc2d4909291c/ff90c3dd68949525", 320 | "88784b4873c7551d/a8c79e6cf0f93af?3fe03b55422683a", 321 | "ec9d0d6b37ae8d68/01fee084a3cb3563", 322 | "ec9d0d6b37ae8d68/01fee084a3cb3563/850b51f8595b735c/d1559ef785b761e1", 323 | "999fd0543f2499ba/05ec660c9d33a602/1c4578927f3f3711/2ba906928c030c0f", 324 | "cf17e071ca4a6d63/333314eda494a273/9683443388b62d72", 325 | "afea1067d86a1c44/8968eb8d56ea2005", 326 | "6865c9a20330e96e", 327 | "afea1067d86a1c44/f13f8d0b2be7d308", 328 | "5f63cba1308ebad/16e720804d7385cb?9b2c7d0cf9c19280", 329 | "a4e358da594acc69", 330 | "043f71e11bce6115", 331 | "2f38166a9f476d14/2e1f4252a64ef39e?23cb33cf67558126", 332 | "2972e09dd52b5c34/e0da2d3e2c6f610/16e720804d7385cb?aac4d7fceeea7dcb", 333 | "ed95a9a5be30e4c8/9c2692a00033d2ca/de6b0a4bdf4056d8", 334 | "ef5e1c317855b110/d22919653063ad0f", 335 | "db7d0a15587e37", 336 | "fe5809a4bf69b53b", 337 | "c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f", 338 | "f9717a397d602927", 339 | "c415bc2d4909291c", 340 | "97c681e48c2bd244", 341 | "ed95a9a5be30e4c8/9c2692a00033d2ca/51faf05ad73be17c", 342 | "38111edd541b4aa0", 343 | "6eefbbf46b47c5e/7de2f069da3a3655", 344 | "6cb909d81a2f5b20/16e720804d7385cb?106cec9ffea2f2df", 345 | "968c8e4fbbb8b0ce", 346 | "8f6f3d03e10289c2/7de2f069da3a3655", 347 | "ed95a9a5be30e4c8/5162fc6a223f248d/4dab901f0f98436", 348 | "a16689098c57e580", 349 | "f745af148dbad70c/8b9644ee902b2351/01fee084a3cb3563/33dcc329910a2ce2", 350 | "cf17e071ca4a6d63", 351 | "ed95a9a5be30e4c8/9c2692a00033d2ca/4dab901f0f98436", 352 | "afea1067d86a1c44", 353 | "2972e09dd52b5c34/e0da2d3e2c6f610/16e720804d7385cb", 354 | "04285bbaac4dba06/01fee084a3cb3563/26db9e0e4002aab4", 355 | "9cafb5406de1df9e", 356 | "9b569b834ef0716c/16e720804d7385cb?c5a19578c7c7204c", 357 | "521fca29d4156a9d", 358 | "f8c1d22d2e8ba7c4", 359 | ] 360 | 361 | return most_visited_websites 362 | 363 | 364 | def get_query_runner(query, samples): 365 | """Query runner factory.""" 366 | if query == 'get_sibling_attrs': 367 | return SiblingsAttrsRunner(samples) 368 | elif query == 'brand_interaction_audience': 369 | return BrandInteractionRunner(samples) 370 | elif query == 'get_all_transient_ids_in_household': 371 | return HouseholdDevices(samples) 372 | elif query == "undecided_user_check": 373 | return AudienceCheck(samples) 374 | elif query == "undecided_user_audience": 375 | return AudienceGeneration(samples) 376 | elif query == "early_website_adopters": 377 | return EarlyAdopters(samples) 378 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/cli/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/__main__.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | 4 | from nepytune.cli.transform import ( 5 | register as transform_register, 6 | main as transform_main, 7 | ) 8 | from nepytune.cli.split import register as split_register, main as split_main 9 | from nepytune.cli.add import register as add_register, main as add_main 10 | from nepytune.cli.extend import register as extend_register, main as extend_main 11 | 12 | 13 | logging.basicConfig(format="%(asctime)-15s %(message)s") 14 | 15 | 16 | def main(): 17 | """Main entry point for all commands.""" 18 | parser = argparse.ArgumentParser(description="Extend/generate dataset csv files") 19 | parser.set_defaults(subparser="none") 20 | 21 | subparsers = parser.add_subparsers() 22 | 23 | transform_register(subparsers) 24 | split_register(subparsers) 25 | add_register(subparsers) 26 | extend_register(subparsers) 27 | 28 | args = parser.parse_args() 29 | 30 | if args.subparser == "transform": 31 | transform_main(args) 32 | 33 | if args.subparser == "split": 34 | split_main(args) 35 | 36 | if args.subparser == "add": 37 | add_main(args) 38 | 39 | if args.subparser == "extend": 40 | extend_main(args) 41 | 42 | 43 | if __name__ == "__main__": 44 | main() 45 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/add.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import configparser 3 | import json 4 | import time 5 | import random 6 | import sys 7 | import csv 8 | import logging 9 | import ipaddress 10 | from collections import namedtuple 11 | from urllib.parse import urlparse 12 | 13 | from faker import Faker 14 | from faker.providers.user_agent import Provider as UAProvider 15 | from user_agents import parse 16 | 17 | 18 | from networkx.utils.union_find import UnionFind 19 | 20 | from nepytune.write_utils import json_lines_file 21 | from nepytune.utils import hash_ 22 | 23 | 24 | COMPANY_MIN_SIZE = 6 25 | 26 | logger = logging.getLogger("add") 27 | logger.setLevel(logging.INFO) 28 | 29 | 30 | class UserAgentProvider(UAProvider): 31 | """Custom faker provider that derives user agent based on type.""" 32 | 33 | def user_agent_from_type(self, type_): 34 | """Given type, generate appropriate user agent.""" 35 | while True: 36 | user_agent = self.user_agent() 37 | if type_ == "device": 38 | if "Mobile" in user_agent: 39 | return user_agent 40 | elif type_ == "cookie": 41 | if "Mobile" not in user_agent: 42 | return user_agent 43 | else: 44 | raise ValueError(f"Unsupported {type_}") 45 | 46 | 47 | class PersistentNodes(UnionFind): 48 | """networkx.UnionFind datastructure with custom iterable over node sets.""" 49 | 50 | def node_groups(self): 51 | """Iterate over node groups yield parent hash and node members.""" 52 | for node_set in self.to_sets(): 53 | yield hash_(node_set), node_set 54 | 55 | 56 | def extract_user_groups(user_mapping_path): 57 | """Generate disjoint user groups based on union find datastructure.""" 58 | with open(user_mapping_path) as f_h: 59 | pers_reader = csv.reader(f_h, delimiter=",") 60 | uf_ds = PersistentNodes() 61 | for row in pers_reader: 62 | uf_ds.union(row[0], row[1]) 63 | return uf_ds 64 | 65 | 66 | def generate_persistent_groups(user_groups, dst): 67 | """Write facts about persistent to transient nodes mapping.""" 68 | with open(dst, "w") as f_h: 69 | for persistent_id, node_group in user_groups.node_groups(): 70 | f_h.write( 71 | json.dumps({"pid": persistent_id, "transientIds": list(node_group)}) 72 | + "\n" 73 | ) 74 | 75 | 76 | def generate_identity_groups(persistent_ids_file, distribution, dst, _seed=None): 77 | """Write facts about identity_group mapping.""" 78 | if _seed is not None: 79 | random.seed(time.time()) 80 | 81 | with open(persistent_ids_file) as f_h: 82 | pids = [data["pid"] for data in json_lines_file(f_h)] 83 | 84 | random.shuffle(pids) 85 | 86 | sizes, weights = zip(*[[k, v] for k, v in distribution.items()]) 87 | i = 0 88 | with open(dst, "w") as f_h: 89 | while i < len(pids): 90 | size, *_ = random.choices(sizes, weights=weights) 91 | size = min(size, abs(len(pids) - i)) 92 | persistent_ids = [pids[i + j] for j in range(size)] 93 | type_ = "household" if len(persistent_ids) < COMPANY_MIN_SIZE else "company" 94 | f_h.write( 95 | json.dumps( 96 | { 97 | "igid": hash_(persistent_ids), 98 | "type": type_, 99 | "persistentIds": persistent_ids, 100 | } 101 | ) 102 | + "\n" 103 | ) 104 | # advance even if size was 0, meaning that persistent id 105 | # does not belong to any identity_group 106 | i += size or 1 107 | 108 | 109 | def parse_distribution(size, weights): 110 | """Parse and validate distribution params.""" 111 | if len(size) != len(weights): 112 | raise ValueError( 113 | "Identity group parsing issue: weights list and identity group " 114 | "size list are of different length" 115 | ) 116 | 117 | eps = 1e-4 118 | # accept small errors, as floating point arithmetic cannot be done precisely on computers 119 | if not 1 - eps < sum(weights) < 1 + eps: 120 | raise ValueError( 121 | "Identity group parsing issue: weights must sum to 1, " 122 | f"but sum to {sum(weights)} instead" 123 | ) 124 | return dict(zip(size, weights)) 125 | 126 | 127 | def get_ip_addresses(cidr): 128 | """Get list of hosts within given network cidr.""" 129 | network = ipaddress.ip_network(cidr) 130 | hosts = list(network.hosts()) 131 | if not hosts: 132 | return [network.network_address] 133 | return hosts 134 | 135 | 136 | def build_iploc_knowledge( 137 | ip_facts_file, 138 | persistent_ids_facts_file, 139 | identity_group_facts_file, 140 | transient_ids_facts_file, 141 | dst, 142 | ): 143 | """ 144 | Given some fact files, generate random locations and IP addresses in a sane way. 145 | 146 | It’s like funnel. At the very top you have identity groups, then persistent nodes, 147 | then transient nodes. 148 | 149 | Logic can be simplified to: 150 | * identity groups = select few (at most 8 with very low probability) IP addresses 151 | * persistent nodes = select few IP addresses from the group above 152 | * transient nodes = select few IP addresses from the group above 153 | 154 | This way context data is sane. Each transient node has IPs from subset of persistent 155 | id IPs, and identity groups IPs. 156 | 157 | Probabilities makes highly probably for transient nodes to be within the same city, 158 | and the same state. Same goes for persistent nodes. 159 | """ 160 | IPLoc = namedtuple("IPLoc", "state, city, ip_address") 161 | 162 | with open(ip_facts_file) as f_h: 163 | ip_cidrs_by_state_city = list(json_lines_file(f_h)) 164 | 165 | knowledge = {"identity_group": {}, "persistent_id": {}, "transient_ids": {}} 166 | 167 | def random_ip_loc(): 168 | state_count, *_ = random.choices([1, 2], weights=[0.98, 0.02]) 169 | for state_data in random.choices(ip_cidrs_by_state_city, k=state_count): 170 | city_count, *_ = random.choices( 171 | [1, 2, 3, 4], weights=[0.85, 0.1, 0.04, 0.01] 172 | ) 173 | for city_data in random.choices(state_data["cities"], k=city_count): 174 | random_cidr = random.choice(city_data["cidr_blocks"]) 175 | yield IPLoc( 176 | state=state_data["state"], 177 | city=city_data["city"], 178 | ip_address=str(random.choice(get_ip_addresses(random_cidr))), 179 | ) 180 | 181 | def random_ip_loc_from_group(locations): 182 | # compute weights, each next item is two times less likely probably than the previous 183 | weights = [1] 184 | for _ in locations[:-1]: 185 | weights.append(weights[-1] / 2) 186 | 187 | count = len(locations) 188 | random_count, *_ = random.choices(list(range(1, count + 1)), weights=weights) 189 | return list(set(random.choices(locations, k=random_count))) 190 | 191 | logger.info("Creating Identity group / persistent ids IP facts") 192 | with open(identity_group_facts_file) as f_h: 193 | for data in json_lines_file(f_h): 194 | locations = knowledge["identity_group"][data["igid"]] = list( 195 | set(random_ip_loc()) 196 | ) 197 | 198 | for persistent_id in data["persistentIds"]: 199 | knowledge["persistent_id"][persistent_id] = random_ip_loc_from_group( 200 | locations 201 | ) 202 | 203 | logger.info("Creating persistent / transient ids IP facts") 204 | with open(persistent_ids_facts_file) as f_h: 205 | for data in json_lines_file(f_h): 206 | persistent_id = data["pid"] 207 | # handle case where persistent id does not belong to any identity group 208 | if data["pid"] not in knowledge: 209 | knowledge["persistent_id"][persistent_id] = random_ip_loc_from_group( 210 | list(set(random_ip_loc())) 211 | ) 212 | for transient_id in data["transientIds"]: 213 | knowledge["transient_ids"][transient_id] = random_ip_loc_from_group( 214 | knowledge["persistent_id"][persistent_id] 215 | ) 216 | # now assign random ip location for transient ids without persistent ids 217 | logger.info("Processing remaining transient ids facts") 218 | with open(transient_ids_facts_file) as t_f_h: 219 | for data in json_lines_file(t_f_h): 220 | if data["uid"] not in knowledge["transient_ids"]: 221 | knowledge["transient_ids"][data["uid"]] = list( 222 | set( 223 | random_ip_loc_from_group( # "transient group" level 224 | random_ip_loc_from_group( # "persistent group" level 225 | list(set(random_ip_loc())) # "identity group" level 226 | ) 227 | ) 228 | ) 229 | ) 230 | 231 | with open(dst, "w") as f_h: 232 | for key, data in knowledge["transient_ids"].items(): 233 | f_h.write( 234 | json.dumps( 235 | {"transient_id": key, "loc": [item._asdict() for item in data]} 236 | ) 237 | + "\n" 238 | ) 239 | 240 | def generate_website_groups(urls_file, iab_categories, dst): 241 | """Generate website groups.""" 242 | website_groups = {} 243 | with open(urls_file) as urls_f: 244 | urls_reader = csv.reader(urls_f, delimiter=",") 245 | for row in urls_reader: 246 | url = row[1] 247 | root_url = urlparse("//" + url).hostname 248 | if root_url not in website_groups: 249 | iab_category = random.choice(iab_categories) 250 | website_groups[root_url] = { 251 | "websites": [url], 252 | "category": { 253 | "code": iab_category[0], 254 | "name": iab_category[1] 255 | } 256 | } 257 | else: 258 | website_groups[root_url]["websites"].append(url) 259 | 260 | with open(dst, "w") as dst_file: 261 | for url, data in website_groups.items(): 262 | website_group = { 263 | "url": url, 264 | "websites": data["websites"], 265 | "category": data["category"] 266 | } 267 | website_group_id = hash_(website_group.items()) 268 | website_group["id"] = website_group_id 269 | dst_file.write( 270 | json.dumps(website_group) + "\n" 271 | ) 272 | 273 | 274 | def read_iab_categories(iab_filepath): 275 | """Read IAB categories tuples from JSON file.""" 276 | with open(iab_filepath) as iab_file: 277 | categories = json.loads(iab_file.read()) 278 | return [(code, category) for code, category in categories.items()] 279 | 280 | 281 | def build_user_identitity_knowledge( 282 | persistent_ids_facts_file, transient_ids_facts_file, dst 283 | ): 284 | """ 285 | Generate some facts about user identities. 286 | 287 | There are few informations generated here: 288 | * transient ids types: cookie | device 289 | * transient id emails (it's randomly selected from persistent id emails) 290 | * transient id user agent ( 291 | if transient id type is cookie then workstation user agent is generated, 292 | otherwise mobile one 293 | ) 294 | * derivatives of user agent 295 | * device family (if type device) 296 | * OS 297 | * browser 298 | """ 299 | user_emails = {} 300 | fake = Faker() 301 | fake.add_provider(UserAgentProvider) 302 | 303 | logger.info("Creating emails per transient ids") 304 | # create fake emails for devices with persistent ids 305 | with open(persistent_ids_facts_file) as f_h: 306 | for data in json_lines_file(f_h): 307 | nemail = random.randint(1, len(data["transientIds"])) 308 | emails = [fake.email() for _ in range(nemail)] 309 | for transient_id in data["transientIds"]: 310 | user_emails[transient_id] = random.choice(emails) 311 | 312 | # create fake emails for devices without persistent ids 313 | with open(transient_ids_facts_file) as t_f_h: 314 | for data in json_lines_file(t_f_h): 315 | if data["uid"] not in user_emails: 316 | user_emails[data["uid"]] = fake.email() 317 | 318 | logger.info("Writing down user identity facts") 319 | with open(dst, "w") as f_h: 320 | for transient_id, data in user_emails.items(): 321 | type_ = random.choice(["cookie", "device"]) 322 | uset_agent_str = fake.user_agent_from_type(type_) 323 | 324 | user_agent = parse(uset_agent_str) 325 | device = user_agent.device.family 326 | operating_system = user_agent.os.family 327 | browser = user_agent.browser.family 328 | 329 | f_h.write( 330 | json.dumps( 331 | { 332 | "transient_id": transient_id, 333 | "user_agent": uset_agent_str, 334 | "device": device, 335 | "os": operating_system, 336 | "browser": browser, 337 | "email": data, 338 | "type": type_, 339 | } 340 | ) 341 | + "\n" 342 | ) 343 | 344 | 345 | def register(parser): 346 | """Register 'add' parser.""" 347 | add_parser = parser.add_parser("add") 348 | add_parser.add_argument("--config-file", type=argparse.FileType("r"), required=True) 349 | 350 | add_subparser = add_parser.add_subparsers() 351 | 352 | persistent_id_parser = add_subparser.add_parser("persistent_id") 353 | persistent_id_parser.set_defaults(subparser="add", command="persistent_id") 354 | 355 | identity_group_parser = add_subparser.add_parser("identity_group") 356 | identity_group_parser.add_argument("--size", type=int, dest="size", action="append") 357 | identity_group_parser.add_argument( 358 | "--weights", type=float, dest="weights", action="append" 359 | ) 360 | identity_group_parser.set_defaults(subparser="add", command="identity_group") 361 | 362 | fact_parser = add_subparser.add_parser("fact") 363 | fact_parser.set_defaults(subparser="add", command="facts") 364 | 365 | website_groups_parser = add_subparser.add_parser("website_groups") 366 | website_groups_parser.set_defaults(subparser="add", command="website_groups") 367 | 368 | 369 | def main(args): 370 | """Generate dataset files with information about the world.""" 371 | config = configparser.ConfigParser() 372 | config.read(args.config_file.name) 373 | 374 | if args.command == "persistent_id": 375 | logger.info("Generate persistent id file to %s", config["dst"]["persistent"]) 376 | uf_ds = extract_user_groups(config["src"]["user_to_user"]) 377 | generate_persistent_groups(uf_ds, config["dst"]["persistent"]) 378 | 379 | if args.command == "identity_group": 380 | logger.info( 381 | "Generate identity group file to %s", config["dst"]["identity_group"] 382 | ) 383 | try: 384 | distribution = parse_distribution(args.size, args.weights) 385 | except ValueError as exc: 386 | print(exc) 387 | sys.exit(2) 388 | 389 | generate_identity_groups( 390 | config["dst"]["persistent"], distribution, config["dst"]["identity_group"] 391 | ) 392 | 393 | if args.command == "facts": 394 | logger.info("Generate IP facts file to %s", config["dst"]["ip_info"]) 395 | build_iploc_knowledge( 396 | ip_facts_file=config["src"]["location_to_cidr"], 397 | persistent_ids_facts_file=config["dst"]["persistent"], 398 | identity_group_facts_file=config["dst"]["identity_group"], 399 | transient_ids_facts_file=config["src"]["facts"], 400 | dst=config["dst"]["ip_info"], 401 | ) 402 | logger.info( 403 | "Generate user identity facts file to %s", 404 | config["dst"]["user_identity_info"], 405 | ) 406 | build_user_identitity_knowledge( 407 | persistent_ids_facts_file=config["dst"]["persistent"], 408 | transient_ids_facts_file=config["src"]["facts"], 409 | dst=config["dst"]["user_identity_info"], 410 | ) 411 | 412 | if args.command == "website_groups": 413 | logger.info("Generate website groups file to %s.", config["dst"]["website_groups"]) 414 | urls_file = config["src"]["urls"] 415 | dst_file = config["dst"]["website_groups"] 416 | iab_categories = read_iab_categories(config["src"]["iab_categories"]) 417 | 418 | generate_website_groups(urls_file, iab_categories, dst_file) 419 | 420 | logger.info("Done!") 421 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/extend.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import configparser 4 | import os 5 | import json 6 | import itertools 7 | import random 8 | 9 | from nepytune.write_utils import json_lines_file 10 | 11 | 12 | logger = logging.getLogger("extend") 13 | logger.setLevel(logging.INFO) 14 | 15 | 16 | def extend_facts_file(fact_file_path, ip_loc_file_path, user_identity_file_path): 17 | """Extend facts file with additional information.""" 18 | ip_loc_cor = extend_with_iploc_information(ip_loc_file_path) 19 | user_identity_cor = extend_with_user_identity_information(user_identity_file_path) 20 | 21 | next(ip_loc_cor) 22 | next(user_identity_cor) 23 | 24 | dst = f"{fact_file_path}.tmp" 25 | with open(fact_file_path) as f_h: 26 | with open(dst, "w") as f_dst: 27 | for data in json_lines_file(f_h): 28 | transformed_row = user_identity_cor.send(ip_loc_cor.send(data)) 29 | f_dst.write(json.dumps(transformed_row) + "\n") 30 | 31 | ip_loc_cor.close() 32 | 33 | os.rename(dst, fact_file_path) 34 | 35 | 36 | def extend_with_user_identity_information(user_identity_file_path): 37 | """Coroutine which generates user identity facts based on transient id.""" 38 | with open(user_identity_file_path) as f_h: 39 | user_id_data = {data["transient_id"]: data for data in json_lines_file(f_h)} 40 | 41 | data = yield 42 | 43 | while data is not None: 44 | transformed = {**data.copy(), **user_id_data[data["uid"]]} 45 | del transformed["transient_id"] 46 | data = yield transformed 47 | 48 | 49 | def extend_with_iploc_information(ip_loc_file_path): 50 | """Coroutine which generates ip location facts based on transient id.""" 51 | with open(ip_loc_file_path) as f_h: 52 | loc_data = {data["transient_id"]: data["loc"] for data in json_lines_file(f_h)} 53 | 54 | data = yield 55 | 56 | def get_sane_ip_locaction(uid, facts, max_ts_difference=3600): 57 | """ 58 | Given transient id and its facts add information about ip/location. 59 | 60 | Process is semi-deterministic. 61 | 1. Choose the location at random from the given list of locations 62 | 2. Repeat returning this location as long as the timestamp difference 63 | lies within the `max_ts_difference` 64 | 3. Otherwise, start from 1) 65 | """ 66 | facts = [None] + sorted(facts, key=lambda x: x["ts"]) 67 | ptr1, ptr2 = itertools.tee(facts, 2) 68 | next(ptr2, None) 69 | 70 | loc_fact = random.choice(loc_data[uid]) 71 | 72 | for previous_item, current in zip(ptr1, ptr2): 73 | if ( 74 | previous_item is None 75 | or current["ts"] - previous_item["ts"] > max_ts_difference 76 | ): 77 | loc_fact = random.choice(loc_data[uid]) 78 | yield {**current, **loc_fact} 79 | 80 | while data is not None: 81 | transformed = data.copy() 82 | transformed["facts"] = list( 83 | get_sane_ip_locaction(uid=data["uid"], facts=data["facts"]) 84 | ) 85 | data = yield transformed 86 | 87 | 88 | def register(parser): 89 | """Register 'extend' parser.""" 90 | extend_parser = parser.add_parser("extend") 91 | extend_parser.set_defaults(subparser="extend") 92 | extend_parser.add_argument( 93 | "--config-file", type=argparse.FileType("r"), required=True 94 | ) 95 | 96 | extend_subparser = extend_parser.add_subparsers() 97 | _ = extend_subparser.add_parser("facts") 98 | extend_parser.set_defaults(command="facts") 99 | 100 | 101 | def main(args): 102 | """Extend facts with information about the world.""" 103 | config = configparser.ConfigParser() 104 | config.read(args.config_file.name) 105 | 106 | if args.command == "facts": 107 | logger.info("Extend facts file to %s", config["src"]["facts"]) 108 | extend_facts_file( 109 | fact_file_path=config["src"]["facts"], 110 | ip_loc_file_path=config["dst"]["ip_info"], 111 | user_identity_file_path=config["dst"]["user_identity_info"], 112 | ) 113 | 114 | logger.info("Done!") 115 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/split.py: -------------------------------------------------------------------------------- 1 | import json 2 | import csv 3 | import argparse 4 | 5 | 6 | def batch_facts(src, size): 7 | """Split facts into batches of provided size.""" 8 | with open(src) as f_h: 9 | json_lines = [] 10 | i = 0 11 | 12 | for line in f_h: 13 | if i > size: 14 | yield json_lines 15 | i = 0 16 | json_lines = [] 17 | 18 | json_lines.append(json.loads(line)) 19 | i = i + 1 20 | 21 | yield json_lines 22 | 23 | 24 | def write_json_facts(json_lines, dst): 25 | """Write down jsonline facts into dst.""" 26 | with open(dst, "w") as f_h: 27 | for data in json_lines: 28 | f_h.write(json.dumps(data) + "\n") 29 | 30 | 31 | def load_urls(src): 32 | """ 33 | Load given url file csv into memory. 34 | 35 | It assumes that only two columns are present. One is key, other is value. 36 | """ 37 | with open(src) as f_h: 38 | data = csv.reader(f_h, delimiter=",") 39 | return dict((int(row[0]), row[1]) for row in data) 40 | 41 | 42 | def write_urls(json_facts, urls, dst): 43 | """Write down urls batch based on batch of json facts.""" 44 | with open(dst, "w") as f_h: 45 | writer = csv.writer(f_h, delimiter=",") 46 | for data in json_facts: 47 | for fact in data["facts"]: 48 | writer.writerow([fact["fid"], urls[fact["fid"]]]) 49 | 50 | 51 | def register(parser): 52 | """Register 'split' command.""" 53 | split_parser = parser.add_parser("split") 54 | split_parser.set_defaults(subparser="split") 55 | 56 | split_parser.add_argument("--size", type=int, required=True) 57 | split_parser.add_argument( 58 | "--facts-file", type=argparse.FileType("r"), required=True 59 | ) 60 | split_parser.add_argument("--urls-file", type=argparse.FileType("r"), required=True) 61 | split_parser.add_argument("--dst-folder", type=str, required=True) 62 | 63 | 64 | def main(args): 65 | """'Split' command logic.""" 66 | location, size = args.dst_folder, args.size 67 | urls = load_urls(args.urls_file.name) 68 | i = 0 69 | file_prefix = f"{i * size}_{(i + 1) * size}" 70 | for json_lines in batch_facts(args.facts_file.name, size): 71 | i = i + 1 72 | write_json_facts(json_lines, dst=f"{location}/{file_prefix}_facts.json") 73 | write_urls(json_lines, urls, dst=f"{location}/{file_prefix}_urls.csv") 74 | file_prefix = f"{i * size}_{(i + 1)* size}" 75 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/cli/transform.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import configparser 4 | import glob 5 | from pathlib import PurePath 6 | import concurrent.futures 7 | from string import Template 8 | 9 | from nepytune.nodes import websites, users, identity_groups, ip_loc 10 | from nepytune.edges import ( 11 | user_website, 12 | website_groups, 13 | identity_groups as identity_group_edges, 14 | persistent_ids, 15 | ip_loc as ip_loc_edges, 16 | ) 17 | 18 | 19 | logger = logging.getLogger("transform") 20 | logger.setLevel(logging.INFO) 21 | 22 | 23 | def build_destination_path(src, dst): 24 | """Given src path, extract batch information and build new destination path.""" 25 | stem = PurePath(src).stem 26 | batch_id = f"{'_'.join(stem.split('_')[:2])}_" 27 | return Template(dst).substitute(batch_id=batch_id) 28 | 29 | 30 | def register(parser): 31 | """Register 'transform' parser.""" 32 | transform_parser = parser.add_parser("transform") 33 | transform_parser.set_defaults(subparser="transform") 34 | 35 | transform_parser.add_argument( 36 | "--config-file", type=argparse.FileType("r"), required=True 37 | ) 38 | transform_parser.add_argument("--websites", action="store_true", default=False) 39 | transform_parser.add_argument("--website_groups", action="store_true", default=False) 40 | transform_parser.add_argument("--transientIds", action="store_true", default=False) 41 | transform_parser.add_argument("--persistentIds", action="store_true", default=False) 42 | transform_parser.add_argument( 43 | "--identityGroupIds", action="store_true", default=False 44 | ) 45 | transform_parser.add_argument("--ips", action="store_true", default=False) 46 | # workers param affect only processing transient entities; 47 | # other types of entities are processed fast enough 48 | transform_parser.add_argument("--workers", type=int, default=1) 49 | 50 | 51 | def main(args): 52 | """Transform csv files into ready-to-load neptune format.""" 53 | config = configparser.ConfigParser() 54 | config.read(args.config_file.name) 55 | 56 | files = { 57 | "facts": config["src"]["facts"], 58 | "urls": config["src"]["urls"], 59 | "titles": config["src"]["titles"], 60 | } 61 | 62 | if args.websites: 63 | logger.info("Generating website nodes to %s", config["dst"]["websites"]) 64 | websites.generate_website_nodes( 65 | files["urls"], files["titles"], config["dst"]["websites"] 66 | ) 67 | 68 | if args.website_groups: 69 | groups_json = config["src"]["website_groups"] 70 | 71 | nodes_dst = config["dst"]["website_group_nodes"] 72 | logger.info("Generating website group nodes to %s", nodes_dst) 73 | websites.generate_website_group_nodes(groups_json, nodes_dst) 74 | 75 | edges_dst = config["dst"]["website_group_edges"] 76 | logger.info("Generating website group edges to %s", edges_dst) 77 | website_groups.generate_website_group_edges(groups_json, edges_dst) 78 | 79 | if args.transientIds: 80 | if args.workers > 1: 81 | fact_files = sorted(glob.glob(config["src"]["facts_glob"])) 82 | url_files = sorted(glob.glob(config["src"]["urls_glob"])) 83 | 84 | with concurrent.futures.ProcessPoolExecutor( 85 | max_workers=args.workers 86 | ) as executor: 87 | futures = [] 88 | logger.info("Scheduling...") 89 | for fact_file, url_file in zip(fact_files, url_files): 90 | futures.append( 91 | executor.submit( 92 | users.generate_user_nodes, 93 | fact_file, 94 | build_destination_path( 95 | fact_file, config["dst"]["transient_nodes"] 96 | ), 97 | ) 98 | ) 99 | futures.append( 100 | executor.submit( 101 | user_website.generate_user_website_edges, 102 | { 103 | "titles": files["titles"], 104 | "urls": url_file, 105 | "facts": fact_file, 106 | }, 107 | build_destination_path( 108 | fact_file, config["dst"]["transient_edges"] 109 | ), 110 | ) 111 | ) 112 | logger.info("Processing of transient nodes started.") 113 | 114 | for future in concurrent.futures.as_completed(futures): 115 | logger.info( 116 | "Succesfully written transient entity file into %s", 117 | future.result(), 118 | ) 119 | else: 120 | nodes_dst = Template(config["dst"]["transient_nodes"]).substitute( 121 | batch_id="" 122 | ) 123 | logger.info("Generating transient id nodes to %s", nodes_dst) 124 | users.generate_user_nodes(config["src"]["facts"], nodes_dst) 125 | 126 | edges_dst = Template(config["dst"]["transient_edges"]).substitute( 127 | batch_id="" 128 | ) 129 | logger.info("Generating transient id edges to %s", edges_dst) 130 | user_website.generate_user_website_edges(files, edges_dst) 131 | 132 | if args.persistentIds: 133 | logger.info( 134 | "Generating persistent id nodes to %s", config["dst"]["persistent_nodes"] 135 | ) 136 | users.generate_persistent_nodes( 137 | config["src"]["persistent"], config["dst"]["persistent_nodes"] 138 | ) 139 | logger.info( 140 | "Generating persistent id edges to %s", config["dst"]["persistent_edges"] 141 | ) 142 | persistent_ids.generate_persistent_id_edges( 143 | config["src"]["persistent"], config["dst"]["persistent_edges"] 144 | ) 145 | 146 | if args.identityGroupIds: 147 | logger.info( 148 | "Generating identity group id nodes to %s", 149 | config["dst"]["identity_group_nodes"], 150 | ) 151 | identity_groups.generate_identity_group_nodes( 152 | config["src"]["identity_group"], config["dst"]["identity_group_nodes"] 153 | ) 154 | logger.info( 155 | "Generating identity group id edges to %s", 156 | config["dst"]["identity_group_edges"], 157 | ) 158 | identity_group_edges.generate_identity_group_edges( 159 | config["src"]["identity_group"], config["dst"]["identity_group_edges"] 160 | ) 161 | 162 | if args.ips: 163 | logger.info("Generating IP id nodes to %s", config["dst"]["ip_nodes"]) 164 | ip_loc.generate_ip_loc_nodes_from_facts( 165 | config["src"]["facts"], config["dst"]["ip_nodes"] 166 | ) 167 | logger.info("Generating IP edges to %s", config["dst"]["ip_edges"]) 168 | ip_loc_edges.generate_ip_loc_edges_from_facts( 169 | config["src"]["facts"], config["dst"]["ip_edges"] 170 | ) 171 | 172 | logger.info("Done!") 173 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/drawing.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | 3 | import plotly.graph_objects as go 4 | import networkx as nx 5 | 6 | 7 | def layout(graph, layout=nx.spring_layout, **layout_args): 8 | pos = layout(graph, **layout_args) 9 | 10 | nx.set_node_attributes(graph, { 11 | node_id: { 12 | "pos": value 13 | } 14 | for node_id, value in pos.items() 15 | }) 16 | return graph 17 | 18 | 19 | def spring_layout(graph): 20 | return layout(graph, nx.spring_layout, scale=0.5) 21 | 22 | 23 | def group_by_label(graph, type_="nodes"): 24 | if type_ == "nodes": 25 | return group_by_grouper(graph, lambda x: x[1]["label"], type_) 26 | else: 27 | return group_by_grouper(graph, lambda x: x[2]["label"], type_) 28 | 29 | 30 | def group_by_grouper(graph, grouper, type_="nodes"): 31 | if type_ == "nodes": 32 | data = graph.nodes(data=True) 33 | else: 34 | data = graph.edges(data=True) 35 | 36 | return itertools.groupby( 37 | sorted(list(data), key=grouper), 38 | key=grouper 39 | ) 40 | 41 | 42 | def edges_scatter(graph): 43 | edge_x = [] 44 | edge_y = [] 45 | 46 | for edge in graph.edges(): 47 | x0, y0 = graph.nodes[edge[0]]["pos"] 48 | x1, y1 = graph.nodes[edge[1]]["pos"] 49 | edge_x.append(x0) 50 | edge_x.append(x1) 51 | edge_x.append(None) 52 | edge_y.append(y0) 53 | edge_y.append(y1) 54 | edge_y.append(None) 55 | 56 | return go.Scatter( 57 | x=edge_x, y=edge_y, 58 | line=dict(width=0.5, color='#888'), 59 | name="edges", 60 | hoverinfo="none", 61 | mode="lines", 62 | ) 63 | 64 | 65 | def edge_scatters_by_label(graph, widths=None, colors=None, dashes=None, opacity=None): 66 | if not colors: 67 | colors = {} 68 | if not dashes: 69 | dashes = {} 70 | if not widths: 71 | widths = {} 72 | if not opacity: 73 | opacity = {} 74 | 75 | for label, edges in group_by_label(graph, type_="edges"): 76 | edge_x = [] 77 | edge_y = [] 78 | 79 | for edge in edges: 80 | x0, y0 = graph.nodes[edge[0]]["pos"] 81 | x1, y1 = graph.nodes[edge[1]]["pos"] 82 | edge_x.append(x0) 83 | edge_x.append(x1) 84 | edge_x.append(None) 85 | edge_y.append(y0) 86 | edge_y.append(y1) 87 | edge_y.append(None) 88 | 89 | yield go.Scatter( 90 | x=edge_x, y=edge_y, 91 | line=dict( 92 | width=widths.get(label, 0.5), 93 | color=colors.get(label, '#888'), 94 | dash=dashes.get(label, "solid") 95 | ), 96 | opacity=opacity.get(label, 1), 97 | name=label, 98 | hoverinfo="none", 99 | mode="lines", 100 | ) 101 | 102 | 103 | 104 | def edge_annotations(graph): 105 | annotations = [] 106 | for from_, to_, attr_map in graph.edges(data=True): 107 | x0, y0 = graph.nodes[from_]["pos"] 108 | x1, y1 = graph.nodes[to_]["pos"] 109 | x_mid, y_mid = (x0 + x1) / 2, (y0 + y1) / 2 110 | annotations.append(dict( 111 | xref="x", 112 | yref="y", 113 | x=x_mid, y=y_mid, 114 | text=attr_map["label"], 115 | font=dict(size=12), 116 | showarrow=False 117 | )) 118 | 119 | return annotations 120 | 121 | 122 | def scatters_by_label(graph, attrs_to_skip, sizes=None, colors=None): 123 | if not colors: 124 | colors = {} 125 | if not sizes: 126 | sizes = {} 127 | 128 | for i, (label, node_group) in enumerate(group_by_label(graph)): 129 | node_group = list(node_group) 130 | node_x = [] 131 | node_y = [] 132 | opacity = [] 133 | size_list = [] 134 | 135 | for node_id, _ in node_group: 136 | x, y = graph.nodes[node_id]["pos"] 137 | opacity.append(graph.nodes[node_id].get("opacity", 1)) 138 | size_list.append( 139 | graph.nodes[node_id].get("size", sizes.get(label, 10)) 140 | ) 141 | node_x.append(x) 142 | node_y.append(y) 143 | 144 | node_trace = go.Scatter( 145 | x=node_x, y=node_y, 146 | name=label, 147 | mode='markers', 148 | hoverinfo='text', 149 | marker=dict( 150 | showscale=False, 151 | colorscale='Hot', 152 | reversescale=True, 153 | color=colors.get(label, i * 5), 154 | opacity=opacity, 155 | size=size_list, 156 | line_width=2 157 | ) 158 | ) 159 | 160 | node_text = [] 161 | 162 | def format_v(attr, value): 163 | if isinstance(value, dict): 164 | return "".join([format_v(k, str(v)) for k, v in value.items()]) 165 | value = str(value) 166 | if len(value) < 80: 167 | return f"
{attr}: {value}" 168 | else: 169 | result = f"
{attr}: " 170 | substr = "" 171 | for word in value.split(" "): 172 | if len(word + substr) < 80: 173 | substr = f"{substr} {word}" 174 | else: 175 | result = f"{result}
{5 * ' '} {substr}" 176 | substr = "" 177 | 178 | return f"{result}
{5 * ' '} {substr}" 179 | 180 | for node_id, attr_dict in node_group: 181 | node_text.append( 182 | "".join([ 183 | format_v(attr, value) for attr, value in attr_dict.items() 184 | if attr not in attrs_to_skip 185 | ]) 186 | ) 187 | 188 | node_trace.text = node_text 189 | 190 | yield node_trace 191 | 192 | 193 | def draw(title, scatters, annotations=None): 194 | fig = go.Figure( 195 | data=scatters, 196 | layout=go.Layout( 197 | title_text=title, 198 | titlefont_size=16, 199 | showlegend=True, 200 | hovermode='closest', 201 | margin=dict(b=20, l=5, r=5, t=40), 202 | xaxis=dict(showgrid=False, zeroline=False, showticklabels=False), 203 | yaxis=dict(showgrid=False, zeroline=False, showticklabels=False) 204 | ) 205 | ) 206 | if annotations: 207 | fig.update_layout( 208 | annotations=annotations 209 | ) 210 | fig.show() 211 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/edges/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/identity_groups.py: -------------------------------------------------------------------------------- 1 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file 2 | from nepytune.utils import get_id 3 | 4 | 5 | def generate_identity_group_edges(src, dst): 6 | """Generate identity_group edge csv file.""" 7 | with open(src) as f_h: 8 | with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer: 9 | for data in json_lines_file(f_h): 10 | persistent_ids = data["persistentIds"] 11 | if persistent_ids: 12 | for persistent_id in persistent_ids: 13 | identity_group_to_persistent = { 14 | "_id": get_id(data["igid"], persistent_id, {}), 15 | "_from": data["igid"], 16 | "to": persistent_id, 17 | "attribute_map": {}, 18 | "label": "member", 19 | } 20 | writer.add(**identity_group_to_persistent) 21 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/ip_loc.py: -------------------------------------------------------------------------------- 1 | from nepytune.nodes.ip_loc import IPLoc, get_id 2 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file 3 | from nepytune.utils import get_id as get_edge_id 4 | 5 | 6 | def generate_ip_loc_edges_from_facts(src, dst): 7 | """Generate ip location csv file with edges.""" 8 | with open(src) as f_h: 9 | with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer: 10 | for data in json_lines_file(f_h): 11 | uid_locations = set() 12 | for fact in data["facts"]: 13 | uid_locations.add( 14 | IPLoc(fact["state"], fact["city"], fact["ip_address"]) 15 | ) 16 | 17 | for location in uid_locations: 18 | loc_id = get_id(location) 19 | writer.add( 20 | _id=get_edge_id(data["uid"], loc_id, {}), 21 | _from=data["uid"], 22 | to=loc_id, 23 | label="uses", 24 | attribute_map={}, 25 | ) 26 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/persistent_ids.py: -------------------------------------------------------------------------------- 1 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file 2 | from nepytune.utils import get_id 3 | 4 | 5 | def generate_persistent_id_edges(src, dst): 6 | """Generate persistentID edges based on union-find datastructure.""" 7 | with open(src) as f_h: 8 | with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer: 9 | for data in json_lines_file(f_h): 10 | for node in data["transientIds"]: 11 | persistent_to_transient = { 12 | "_id": get_id(data["pid"], node, {}), 13 | "_from": data["pid"], 14 | "to": node, 15 | "label": "has_identity", 16 | "attribute_map": {}, 17 | } 18 | writer.add(**persistent_to_transient) 19 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/user_website.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import json 3 | import logging 4 | 5 | from datetime import datetime 6 | 7 | from nepytune.write_utils import gremlin_writer, json_lines_file, GremlinEdgeCSV 8 | from nepytune.utils import get_id 9 | 10 | 11 | logger = logging.getLogger("user_edges") 12 | logger.setLevel(logging.INFO) 13 | 14 | 15 | def _parse_ts(timestamp): 16 | """Parse timestamp.""" 17 | for div in (1_000, 1_000_000): 18 | try: 19 | return datetime.fromtimestamp(timestamp / div).strftime("%Y-%m-%dT%H:%M:%S") 20 | except: 21 | logger.info("Could not parse timestamp: %d with %d", timestamp, div) 22 | return "" 23 | 24 | 25 | def generate_user_website_edges(src_map, dst): 26 | """Generate edges between user nodes and website nodes.""" 27 | with open(src_map["urls"]) as url_file: 28 | fact_to_website = {} 29 | for row in csv.reader(url_file, delimiter=","): 30 | fact_to_website[int(row[0])] = row[1] 31 | 32 | with open(src_map["facts"]) as facts_file: 33 | attrs = [ 34 | "ts:Date", 35 | "visited_url:String", 36 | "uid:String", 37 | "state:String", 38 | "city:String", 39 | "ip_address:String", 40 | ] 41 | with gremlin_writer(GremlinEdgeCSV, dst, attributes=attrs) as writer: 42 | for data in json_lines_file(facts_file): 43 | for fact in data["facts"]: 44 | timestamp = _parse_ts(fact["ts"]) 45 | website_id = fact_to_website[fact["fid"]] 46 | loc_attrs = { 47 | "state": fact["state"], 48 | "city": fact["city"], 49 | "ip_address": fact["ip_address"], 50 | } 51 | attr_map = { 52 | "ts": timestamp, 53 | "visited_url": website_id, 54 | "uid": data["uid"], 55 | **loc_attrs, 56 | } 57 | user_to_website = { 58 | "_id": get_id(data["uid"], website_id, attr_map), 59 | "_from": data["uid"], 60 | "to": website_id, 61 | "label": "visited", 62 | "attribute_map": attr_map, 63 | } 64 | try: 65 | writer.add(**user_to_website) 66 | except Exception: 67 | logger.exception("Something went wrong while creating an edge") 68 | logger.info(json.dumps({"uid": data["uid"], **fact})) 69 | 70 | return dst 71 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/edges/website_groups.py: -------------------------------------------------------------------------------- 1 | from nepytune.utils import get_id 2 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file 3 | 4 | 5 | WEBISTE_GROUP_EDGE_LABEL = "links_to" 6 | 7 | 8 | def generate_website_group_edges(website_group_json, dst): 9 | """Generate website group edges CSV.""" 10 | with open(website_group_json) as f_h: 11 | with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer: 12 | for data in json_lines_file(f_h): 13 | root_id = data["id"] 14 | websites = data["websites"] 15 | for website in websites: 16 | writer.add( 17 | _id=get_id(root_id, website, {}), 18 | _from=root_id, 19 | to=website, 20 | label=WEBISTE_GROUP_EDGE_LABEL, 21 | attribute_map={} 22 | ) 23 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/nodes/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/nodes/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/nodes/identity_groups.py: -------------------------------------------------------------------------------- 1 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file 2 | 3 | 4 | def generate_identity_group_nodes(src, dst): 5 | """Generate identity_group csv file with nodes.""" 6 | attrs = ["igid:String", "type:String"] 7 | with open(src) as f_h: 8 | with gremlin_writer(GremlinNodeCSV, dst, attributes=attrs) as writer: 9 | for data in json_lines_file(f_h): 10 | if data["persistentIds"]: 11 | writer.add( 12 | _id=data["igid"], 13 | attribute_map={"igid": data["igid"], "type": data["type"]}, 14 | label="identityGroup", 15 | ) 16 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/nodes/ip_loc.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | 3 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file 4 | from nepytune.utils import hash_ 5 | 6 | 7 | IPLoc = namedtuple("IPLoc", "state, city, ip_address") 8 | 9 | 10 | def get_id(ip_loc): 11 | """Generate id from ip loc.""" 12 | return hash_([ip_loc.state, ip_loc.city, ip_loc.ip_address]) 13 | 14 | 15 | def generate_ip_loc_nodes_from_facts(src, dst): 16 | """Generate ip location csv file with nodes.""" 17 | attrs = ["state:String", "city:String", "ip_address:String"] 18 | with open(src) as f_h: 19 | with gremlin_writer(GremlinNodeCSV, dst, attributes=attrs) as writer: 20 | locations = set() 21 | for data in json_lines_file(f_h): 22 | for fact in data["facts"]: 23 | locations.add( 24 | IPLoc(fact["state"], fact["city"], fact["ip_address"]) 25 | ) 26 | 27 | for location in locations: 28 | writer.add( 29 | _id=get_id(location), 30 | attribute_map={ 31 | "state": location.state, 32 | "city": location.city, 33 | "ip_address": location.ip_address, 34 | }, 35 | label="IP", 36 | ) 37 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/nodes/users.py: -------------------------------------------------------------------------------- 1 | from nepytune.write_utils import gremlin_writer, json_lines_file, GremlinNodeCSV 2 | 3 | 4 | def generate_user_nodes(src, dst): 5 | """Generate user node csv file.""" 6 | attributes = [ 7 | "uid:String", 8 | "user_agent:String", 9 | "device:String", 10 | "os:String", 11 | "browser:String", 12 | "email:String", 13 | "type:String", 14 | ] 15 | with open(src) as src_data: 16 | with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer: 17 | for data in json_lines_file(src_data): 18 | writer.add( 19 | _id=data["uid"], 20 | attribute_map={ 21 | "uid": data["uid"], 22 | "user_agent": data["user_agent"], 23 | "device": data["device"], 24 | "os": data["os"], 25 | "browser": data["browser"], 26 | "email": data["email"], 27 | "type": data["type"], 28 | }, 29 | label="transientId", 30 | ) 31 | return dst 32 | 33 | 34 | def generate_persistent_nodes(src, dst): 35 | """Generate persistent node csv file.""" 36 | with open(src) as f_h: 37 | with gremlin_writer(GremlinNodeCSV, dst, attributes=["pid:String"]) as writer: 38 | for data in json_lines_file(f_h): 39 | writer.add( 40 | _id=data["pid"], 41 | attribute_map={"pid": data["pid"]}, 42 | label="persistentId", 43 | ) 44 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/nodes/websites.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import collections 3 | 4 | from nepytune.utils import hash_ 5 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file 6 | 7 | WEBSITE_LABEL = "website" 8 | WEBSITE_GROUP_LABEL = "websiteGroup" 9 | 10 | Website = collections.namedtuple("Website", ["url", "title"]) 11 | 12 | 13 | def generate_website_nodes(urls, titles, dst): 14 | """ 15 | Generate Website nodes and save it into csv file. 16 | 17 | The CSV is compatible with AWS Neptune Gremlin data format. 18 | 19 | Website nodes are generated from dataset files: 20 | * urls.csv 21 | * titles.csv 22 | 23 | Files contain maps of fact_id and website url/title. 24 | Data is joined by fact_id. 25 | """ 26 | 27 | urls = read_urls_from_csv(urls) 28 | titles = read_titles_from_csv(titles) 29 | generate_website_csv(urls, titles, dst) 30 | 31 | 32 | def generate_website_group_nodes(website_group_json, dst): 33 | """Generate website groups csv.""" 34 | attributes = [ 35 | "url:String", 36 | "category:String", 37 | "categoryCode:String" 38 | ] 39 | with open(website_group_json) as f_h: 40 | with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer: 41 | for data in json_lines_file(f_h): 42 | writer.add( 43 | _id=data["id"], 44 | attribute_map={ 45 | "url": data["url"], 46 | "category": data["category"]["name"], 47 | "categoryCode": data["category"]["code"] 48 | }, 49 | label=WEBSITE_GROUP_LABEL 50 | ) 51 | 52 | 53 | def read_urls_from_csv(path): 54 | """Return dict with urls and fact ids corresponding to them.""" 55 | urls = collections.defaultdict(list) 56 | with open(path) as csv_file: 57 | csv_reader = csv.reader(csv_file, delimiter=",") 58 | for row in csv_reader: 59 | fid = row[0] 60 | url = row[1] 61 | urls[url].append(fid) 62 | return urls 63 | 64 | 65 | def read_titles_from_csv(path): 66 | """Read titles from csv.""" 67 | titles = {} 68 | with open(path) as csv_file: 69 | csv_reader = csv.reader(csv_file, delimiter=",") 70 | for row in csv_reader: 71 | fid = row[0] 72 | title = row[1] 73 | titles[fid] = title 74 | return titles 75 | 76 | 77 | def generate_websites(urls, titles): 78 | """Yield rows in CSV format.""" 79 | for url, fids in urls.items(): 80 | title = get_website_title(fids, titles) 81 | yield Website(url, title) 82 | 83 | 84 | def get_website_title(fids, titles): 85 | """Get website title.""" 86 | for fid in fids: 87 | title = titles.get(fid) 88 | if title: 89 | return title 90 | return None 91 | 92 | 93 | def generate_website_csv(urls, titles, dst): 94 | """Generate destination CSV file.""" 95 | attributes = ["url:String", "title:String"] 96 | with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer: 97 | for website in generate_websites(urls, titles): 98 | attribute_map = {"url": website.url, "title": website.title} 99 | writer.add( 100 | _id=website.url, attribute_map=attribute_map, label=WEBSITE_LABEL 101 | ) 102 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/traversal.py: -------------------------------------------------------------------------------- 1 | from gremlin_python.process.anonymous_traversal import traversal 2 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection 3 | from gremlin_python.driver.aiohttp.transport import AiohttpTransport 4 | 5 | def get_traversal(endpoint): 6 | """Given gremlin endpoint get connected remote traversal.""" 7 | return traversal().withRemote( 8 | DriverRemoteConnection(endpoint, "g", 9 | transport_factory=lambda:AiohttpTransport(call_from_event_loop=True)) 10 | ) -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sample use case query package. 3 | 4 | Each module defines few "public" functions among which: 5 | * one is for creating the visual representation of part of the referenced subgraph 6 | * one or two are for use case queries to run on the graph 7 | """ 8 | 9 | from nepytune.usecase.user_summary import get_sibling_attrs 10 | from nepytune.usecase.undecided_users import ( 11 | undecided_users_audience, undecided_user_audience_check 12 | ) 13 | from nepytune.usecase.brand_interaction import brand_interaction_audience 14 | from nepytune.usecase.users_from_household import get_all_transient_ids_in_household 15 | from nepytune.usecase.purchase_path import get_activity_of_early_adopters 16 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/brand_interaction.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use case: Advertisers want to generate audiences for DSP platform targeting. 3 | Specific audience could be the users who are interested in specific car brands. 4 | """ 5 | 6 | import networkx as nx 7 | from gremlin_python.process.traversal import P 8 | from gremlin_python.process.graph_traversal import select, out 9 | 10 | from nepytune import drawing 11 | 12 | 13 | def get_root_url(g, website_url): 14 | """Given website url, get its root node.""" 15 | return ( 16 | g.V(website_url) 17 | .hasLabel("website") 18 | .in_("links_to") 19 | ) 20 | 21 | 22 | def brand_interaction_audience(g, website_url): 23 | """ 24 | Given website url, get all transitive (through persistent) identities 25 | that interacted with this brand on any of its pages. 26 | """ 27 | return ( 28 | get_root_url(g, website_url) 29 | .out("links_to") # get all websites from this root url 30 | .in_("visited") 31 | .in_("has_identity").dedup() 32 | .out("has_identity") 33 | .values("uid") 34 | ) 35 | 36 | 37 | def draw_referenced_subgraph(g, root_url): 38 | graph = _build_networkx_graph( 39 | root_url, 40 | query_results=_get_transient_ids( 41 | _get_persistent_ids_which_visited_website(g, root_url), 42 | root_url 43 | ).next() 44 | ) 45 | graph = drawing.layout(graph, nx.kamada_kawai_layout) 46 | drawing.draw( 47 | title="Brand interaction", 48 | scatters=[ 49 | drawing.edges_scatter(graph) 50 | ] + list( 51 | drawing.scatters_by_label( 52 | graph, attrs_to_skip=["pos"], 53 | sizes={"websiteGroup": 30, "transientId": 10, "persistentId": 15, "website": 10} 54 | ) 55 | ), 56 | ) 57 | 58 | 59 | # =========================== 60 | # Everything below was added to introspect the query results via visualisations 61 | 62 | 63 | def _build_networkx_graph(root_url, query_results): 64 | graph = nx.Graph() 65 | graph.add_node( 66 | root_url, label="websiteGroup", url=root_url 67 | ) 68 | 69 | for persistent_id, visited_events in query_results.items(): 70 | graph.add_node(persistent_id, label="persistentId", pid=persistent_id) 71 | 72 | for event in visited_events: 73 | graph.add_node(event["uid"], label="transientId", uid=event["uid"]) 74 | if event["visited_url"] != root_url: 75 | graph.add_node(event["visited_url"], label="website", url=event["visited_url"]) 76 | graph.add_edge(event["uid"], event["visited_url"], label="visited") 77 | graph.add_edge(persistent_id, event["uid"], label="has_identity") 78 | graph.add_edge(root_url, event["visited_url"], label="links_to") 79 | 80 | return graph 81 | 82 | 83 | def _get_persistent_ids_which_visited_website(g, root_url): 84 | return ( 85 | g.V(root_url) 86 | .aggregate("root_url") 87 | .in_("visited") 88 | .in_("has_identity").dedup().limit(50).fold() 89 | .project("root_url", "persistent_ids") 90 | .by(select("root_url").unfold().valueMap(True)) 91 | .by() 92 | ) 93 | 94 | 95 | def _get_transient_ids(query, root_url): 96 | return ( 97 | query 98 | .select("persistent_ids") 99 | .unfold() 100 | .group() 101 | .by("pid") 102 | .by( 103 | out("has_identity") 104 | .outE("visited") 105 | .has( # do not go through links_to, as it causes neptune memory errors 106 | "visited_url", P.between(root_url, root_url + "/zzz") 107 | ) 108 | .valueMap("uid", "visited_url") 109 | .dedup() 110 | .limit(15) 111 | .fold() 112 | ) 113 | ) 114 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/purchase_path.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use-Case. 3 | 4 | Marketing analyst wants to understand path to purchase of a new product by a few early adopters ( say 5) 5 | through interactive queries. This product is high involvement and expensive, and therefore they want to understand the 6 | research undertaken by the customer. 7 | 8 | * Which device was used to initiate the first research. Was that prompted by an ad, email promotion? 9 | * How many devices were used overall and what was the time taken from initial research to final purchase 10 | * On which devices did the customer spend more time 11 | """ 12 | import itertools 13 | from collections import namedtuple, defaultdict 14 | from datetime import timedelta 15 | 16 | import networkx as nx 17 | import plotly.graph_objects as go 18 | from gremlin_python.process.traversal import P 19 | from gremlin_python.process.graph_traversal import ( 20 | outV, values, project, constant, select, inV, where, identity 21 | ) 22 | 23 | from nepytune import drawing 24 | from nepytune.visualizations import bar_plots 25 | 26 | 27 | Event = namedtuple('Event', 'ts persistentId transientId device_type url') 28 | Session = namedtuple('Session', 'transientId persistentId device_type events') 29 | 30 | 31 | def get_activity_of_early_adopters(g, thank_you_page_url, skip_single_transients=False, limit=5): 32 | """ 33 | Given thank you page url, find first early adopters of the product. 34 | 35 | In other words: 36 | * find first few persistent identities (or transient if they're not matched with any user) 37 | that visited given thank you page 38 | * extract their *whole* activity on the domain of the thank_you_page 39 | """ 40 | return ( 41 | g.V(thank_you_page_url) 42 | .hasLabel("website").as_("thank_you") 43 | .in_("links_to").as_("website_group") 44 | .select("thank_you") 45 | .inE("visited") 46 | .order().by("ts") 47 | .choose( 48 | constant(skip_single_transients).is_(P.eq(True)), 49 | where(outV().in_("has_identity")), 50 | identity() 51 | ) 52 | .choose( 53 | outV().in_("has_identity"), 54 | project( 55 | "type", "id", "purchase_ts" 56 | ) 57 | .by(constant("persistent")) 58 | .by(outV().in_("has_identity")) 59 | .by(values("ts")), 60 | project( 61 | "type", "id", "purchase_ts" 62 | ) 63 | .by(constant("transient")) 64 | .by(outV()) 65 | .by(values("ts")) 66 | ).dedup("id").limit(limit) 67 | .choose( 68 | select("type").is_("persistent"), 69 | project( 70 | "persistent_id", "transient_id", "purchase_ts" 71 | ).by(select("id").values("pid")) 72 | .by(select("id").out("has_identity").fold()) 73 | .by(select("purchase_ts")), 74 | project("persistent_id", "transient_id", "purchase_ts") 75 | .by(constant("")) 76 | .by(select("id").fold()) 77 | .by(select("purchase_ts")) 78 | ).project("persistent_id", "purchase_ts", "devices", "visits") 79 | .by(select("persistent_id")) 80 | .by(select("purchase_ts")) 81 | .by(select("transient_id").unfold().group().by(values("uid")).by(values("type"))) 82 | .by( 83 | select("transient_id").unfold().outE("visited").order().by("ts") 84 | .where( 85 | inV().in_("links_to").where(P.eq("website_group")) 86 | ) 87 | .project( 88 | "transientId", "url", "ts" 89 | ).by("uid").by("visited_url").by("ts").fold()) 90 | ) 91 | 92 | 93 | def transform_activities(result_set): 94 | """Build the flat list of user activities.""" 95 | for per_persistent_events in result_set: 96 | for visit in per_persistent_events["visits"]: 97 | if visit["ts"] <= per_persistent_events["purchase_ts"]: 98 | yield Event(**{ 99 | "persistentId": per_persistent_events["persistent_id"] or None, 100 | "device_type": per_persistent_events["devices"][visit["transientId"]], 101 | **visit 102 | }) 103 | 104 | 105 | def first_device_in_session(user_events): 106 | """Get device id which initialize session.""" 107 | return user_events[0].transientId 108 | 109 | 110 | def time_to_purchase(user_events): 111 | """Get device id which initialize session.""" 112 | return user_events[-1].ts - user_events[0].ts 113 | 114 | 115 | def consecutive_pairs(iterable): 116 | f_ptr, s_ptr = itertools.tee(iterable, 2) 117 | next(s_ptr) 118 | return zip(f_ptr, s_ptr) 119 | 120 | 121 | def generate_session_from_event(events, max_ts_delta=300): 122 | """Generate sessions from events.""" 123 | events_by_timestamp = sorted(events, key=lambda event: (event.transientId, event.ts)) 124 | guard_event = Event( 125 | ts=None, persistentId=None, transientId=None, device_type=None, url=None 126 | ) 127 | sessions = [] 128 | 129 | session = Session( 130 | transientId=events_by_timestamp[0].transientId, 131 | persistentId=events_by_timestamp[0].persistentId, 132 | device_type=events_by_timestamp[0].device_type, 133 | events=[] 134 | ) 135 | events_count = 0 136 | 137 | for event, next_event in consecutive_pairs(events_by_timestamp + [guard_event]): 138 | session.events.append(event) 139 | if event.transientId != next_event.transientId or (next_event.ts - event.ts).seconds > max_ts_delta: 140 | sessions.append(session) 141 | events_count += len(session.events) 142 | session = Session( 143 | transientId=next_event.transientId, 144 | persistentId=next_event.persistentId, 145 | device_type=next_event.device_type, 146 | events=[] 147 | ) 148 | 149 | assert len(events_by_timestamp) == events_count 150 | return sessions 151 | 152 | 153 | def get_session_duration(user_session): 154 | """Get session duration.""" 155 | return user_session.events[-1].ts - user_session.events[0].ts 156 | 157 | 158 | def get_time_by_device(user_sessions): 159 | """Get time spent on device.""" 160 | time_by_device = defaultdict(timedelta) 161 | 162 | for session in user_sessions: 163 | time_by_device[session.transientId] += get_session_duration(session) 164 | 165 | return time_by_device 166 | 167 | 168 | def generate_stats(all_activities, **kwargs): 169 | """Generate statistics per user (persistentId) activities.""" 170 | result = dict() 171 | 172 | user_sessions = generate_session_from_event(all_activities, **kwargs) 173 | 174 | def grouper(session): 175 | return session.persistentId or session.transientId 176 | 177 | for persistent_id, session_list in (itertools.groupby(sorted(user_sessions, key=grouper), key=grouper)): 178 | session_list = list(session_list) 179 | session_durations = get_time_by_device(session_list) 180 | user_events_by_timestamp = sorted( 181 | itertools.chain.from_iterable([session.events for session in session_list]), 182 | key=lambda event: event.ts 183 | ) 184 | 185 | if persistent_id not in result: 186 | result[persistent_id] = { 187 | "transient_ids": {}, 188 | "devices_count": 0, 189 | "first_device": first_device_in_session(user_events_by_timestamp), 190 | "time_to_purchase": time_to_purchase(user_events_by_timestamp), 191 | } 192 | 193 | for transient_id, duration in session_durations.items(): 194 | user_sessions = sorted( 195 | [session for session in session_list if session.transientId == transient_id], 196 | key=lambda session: session.events[0].ts 197 | ) 198 | result[persistent_id]["transient_ids"][transient_id] = { 199 | "sessions_duration": duration, 200 | "sessions_count": len(user_sessions), 201 | "purchase_session": user_sessions[-1], 202 | "sessions": user_sessions 203 | } 204 | result[persistent_id]["devices_count"] += 1 205 | return result 206 | 207 | 208 | def draw_referenced_subgraph(persistent_id, graph): 209 | drawing.draw( 210 | title=f"{persistent_id} path to purchase", 211 | scatters=list( 212 | drawing.edge_scatters_by_label( 213 | graph, 214 | opacity={"visited": 0.35, "purchase_path": 0.4}, 215 | widths={"links_to": 0.2, "visited": 3, "purchase_path": 3}, 216 | colors={"links_to": "grey", "purchase_path": "red"}, 217 | dashes={"links_to": "dot"} 218 | ) 219 | ) + list( 220 | drawing.scatters_by_label( 221 | graph, attrs_to_skip=["pos", "size"], 222 | sizes={ 223 | "event": 9, 224 | "persistentId": 20, 225 | "thank-you-page": 25, 226 | "website": 25, 227 | "session": 15, 228 | }, 229 | colors={ 230 | "event": 'rgb(153,112,171)', 231 | "session": 'rgb(116,173,209)', 232 | "thank-you-page": 'orange', 233 | "website": 'rgb(90,174,97)', 234 | "transientId": 'rgb(158,1,66)', 235 | "persistentId": 'rgb(213,62,79)' 236 | } 237 | ) 238 | ), 239 | ) 240 | 241 | 242 | def compute_subgraph_pos(query_results, thank_you_page): 243 | """Given query results compute subgraph positions.""" 244 | for persistent_id, raw_graph in _build_networkx_graph_single( 245 | query_results=query_results, 246 | thank_you_page=thank_you_page, 247 | max_ts_delta=300 248 | ): 249 | raw_graph.nodes[thank_you_page]["label"] = "thank-you-page" 250 | 251 | graph_with_pos_computed = drawing.layout(raw_graph, _custom_layout) 252 | 253 | yield persistent_id, graph_with_pos_computed 254 | 255 | 256 | def custom_plots(data): 257 | """Build list of custom plot figures.""" 258 | return [ 259 | bar_plots.make_bars( 260 | { 261 | k[:5]: v["time_to_purchase"].total_seconds() / (3600 * 24) 262 | for k, v in data.items() 263 | }, 264 | title="User's time to purchase", 265 | x_title="Persistent IDs", 266 | y_title="Days to purchase", 267 | lazy=True 268 | ), 269 | _show_session_stats(data, title="Per device session statistics"), 270 | _show_most_common_visited_webpages(data, title="Most common visited subpages before purchase", count=10), 271 | ] 272 | 273 | 274 | # =========================== 275 | # Everything below was added to introspect the query results via visualisations 276 | 277 | 278 | def _show_session_stats(data, title): 279 | def sunburst_data(data): 280 | total_sum = sum( 281 | values["sessions_count"] 282 | for _, v in data.items() 283 | for values in v["transient_ids"].values() 284 | ) 285 | yield "", "Users", 1.5 * total_sum, "white", "" 286 | 287 | for i, (persistentId, v) in enumerate(data.items(), 1): 288 | yield ( 289 | "Users", 290 | persistentId[:5], 291 | sum(values["sessions_count"] for values in v["transient_ids"].values()), 292 | i, 293 | ( 294 | f"
persistentId: {persistentId}
" 295 | f"devices count: {len(v['transient_ids'])}" 296 | ) 297 | ) 298 | for transientId, values in v["transient_ids"].items(): 299 | yield ( 300 | persistentId[:5], 301 | transientId[:5], 302 | values["sessions_count"], 303 | i, 304 | ( 305 | f"
transientId: {transientId}" 306 | f"
session count: {values['sessions_count']}" 307 | f"
total session duration: {values['sessions_duration']}" 308 | ) 309 | ) 310 | for session in values["sessions"]: 311 | yield ( 312 | transientId[:5], 313 | session.events[0].ts, 314 | 1, 315 | i, 316 | ( 317 | f"
session start: {session.events[0].ts}" 318 | f"
session end: {session.events[-1].ts}" 319 | f"
session duration: {session.events[-1].ts - session.events[0].ts}" 320 | ) 321 | ) 322 | # aka legend 323 | yield "Users", "User ids", total_sum / 2, "white", "" 324 | yield "User ids", "User devices", total_sum / 2, "white", "" 325 | yield "User devices", "User sessions", total_sum / 2, "white", "" 326 | 327 | parents, labels, values, colors, hovers = zip(*[r for r in list(sunburst_data(data))]) 328 | 329 | fig = go.Figure( 330 | go.Sunburst( 331 | labels=labels, 332 | parents=parents, 333 | values=values, 334 | branchvalues="total", 335 | marker=dict( 336 | colors=colors, 337 | line=dict(width=0.5, color='DarkSlateGrey') 338 | ), 339 | hovertext=hovers, 340 | hoverinfo="text", 341 | ), 342 | ) 343 | 344 | fig.update_layout(margin=dict(t=50, l=0, r=0, b=0), title=title) 345 | return fig 346 | 347 | 348 | def _show_most_common_visited_webpages(data, title, count): 349 | def drop_qs(url): 350 | pos = url.find("?") 351 | if pos == -1: 352 | return url 353 | return url[0:pos] 354 | 355 | def compute_data(data): 356 | res = defaultdict(list) 357 | for v in data.values(): 358 | for values in v["transient_ids"].values(): 359 | for session in values["sessions"]: 360 | for event in session.events: 361 | res[drop_qs(event.url)].append(session.persistentId) 362 | return res 363 | 364 | def sunburst_data(data): 365 | total_sum = sum(len(v) for v in data.values()) 366 | yield "", "websites", total_sum, "" 367 | for i, (website, persistents) in enumerate(data.items()): 368 | yield ( 369 | "websites", f"Website {i}", 370 | len(persistents), 371 | f"
website: {website}" 372 | f"
users: {len(set(persistents))}" 373 | f"
events: {len(persistents)}" 374 | ) 375 | for persistent, group in itertools.groupby( 376 | sorted(list(persistents)), 377 | ): 378 | group = list(group) 379 | yield ( 380 | f"Website {i}", persistent[:5], 381 | len(group), 382 | f"
persistentId: {persistent}" 383 | f"
events: {len(group)}" 384 | ) 385 | 386 | events_data = compute_data(data) 387 | most_common = dict(sorted(events_data.items(), key=lambda x: -len(x[1]))[:count]) 388 | most_common_counts = {k: len(v) for k, v in most_common.items()} 389 | 390 | pie_chart = go.Pie( 391 | labels=list(most_common_counts.keys()), 392 | values=list(most_common_counts.values()), 393 | marker=dict(line=dict(color='DarkSlateGrey', width=0.5)), 394 | domain=dict(column=0) 395 | ) 396 | 397 | parents, labels, values, hovers = zip(*[r for r in list(sunburst_data(most_common))]) 398 | 399 | sunburst = go.Sunburst( 400 | labels=labels, 401 | parents=parents, 402 | values=values, 403 | branchvalues="total", 404 | marker=dict( 405 | line=dict(width=0.5, color='DarkSlateGrey') 406 | ), 407 | hovertext=hovers, 408 | hoverinfo="text", 409 | domain=dict(column=1) 410 | ) 411 | 412 | layout = go.Layout( 413 | grid=go.layout.Grid(columns=2, rows=1), 414 | margin=go.layout.Margin(t=50, l=0, r=0, b=0), 415 | title=title, 416 | legend_orientation="h" 417 | ) 418 | 419 | return go.Figure([pie_chart, sunburst], layout) 420 | 421 | 422 | def _build_networkx_graph_single(query_results, thank_you_page, **kwargs): 423 | def drop_qs(url): 424 | pos = url.find("?") 425 | if pos == -1: 426 | return url 427 | return url[0:pos] 428 | 429 | def transient_attrs(transient_id, transient_dict): 430 | return { 431 | "uid": transient_id, 432 | "sessions_count": len(transient_dict["sessions"]), 433 | "time_on_device": transient_dict["sessions_duration"] 434 | } 435 | 436 | def session_attrs(session): 437 | return hash((session.transientId, session.events[0])), { 438 | "duration": get_session_duration(session), 439 | "events": len(session.events) 440 | } 441 | 442 | def event_to_website(graph, event, event_label): 443 | website = drop_qs(event.url) 444 | graph.add_node(website, label="website", url=website) 445 | graph.add_node(hash(event), label=event_label, **event._asdict()) 446 | graph.add_edge(website, hash(event), label="links_to") 447 | 448 | for persistent_id, result_dict in generate_stats(query_results, **kwargs).items(): 449 | graph = nx.MultiGraph() 450 | graph.add_node(persistent_id, label="persistentId", pid=persistent_id) 451 | 452 | for transient_id, transient_dict in result_dict["transient_ids"].items(): 453 | graph.add_node(transient_id, label="transientId", **transient_attrs(transient_id, transient_dict)) 454 | graph.add_edge(persistent_id, transient_id, label="has_identity") 455 | 456 | for session in transient_dict["sessions"]: 457 | event_label = "event" 458 | if session == transient_dict["purchase_session"]: 459 | event_edge_label = "purchase_path" 460 | else: 461 | event_edge_label = "visited" 462 | 463 | session_id, session_node_attrs = session_attrs(session) 464 | # transient -> session 465 | graph.add_node(session_id, label="session", **session_node_attrs) 466 | graph.add_edge(session_id, transient_id, label="session") 467 | 468 | fst_event = session.events[0] 469 | # event -> website without query strings 470 | event_to_website(graph, fst_event, event_label) 471 | 472 | # session -> first session event 473 | graph.add_edge(session_id, hash(fst_event), label="session_start") 474 | 475 | for fst_event, snd_event in consecutive_pairs(session.events): 476 | event_to_website(graph, fst_event, event_label) 477 | event_to_website(graph, snd_event, event_label) 478 | graph.add_edge(hash(fst_event), hash(snd_event), label=event_edge_label) 479 | graph.nodes[result_dict["first_device"]]["size"] = 15 480 | 481 | yield persistent_id, graph 482 | 483 | 484 | def _custom_layout(graph): 485 | """Custom layout function.""" 486 | def _transform_graph(graph): 487 | """ 488 | Transform one graph into another for the purposes of better visualisation. 489 | 490 | We rebuild the graph in a tricky way to force the position computation algorithm 491 | to allign with the desired shape. 492 | """ 493 | new_graph = nx.MultiGraph() 494 | 495 | for edge in graph.edges(data=True): 496 | fst, snd, params = edge 497 | label = params["label"] 498 | 499 | new_graph.add_node(fst, **graph.nodes[fst]) 500 | new_graph.add_node(snd, **graph.nodes[snd]) 501 | if label == "links_to": 502 | # website -> event 503 | # => event -> user_website -> website 504 | user_website = f"{fst}_{snd}" 505 | new_graph.add_node(user_website, label="user_website") 506 | new_graph.add_edge(snd, user_website, label="session_visit") 507 | new_graph.add_edge(user_website, fst, label="session_link") 508 | else: 509 | new_graph.add_edge(fst, snd, **params) 510 | 511 | return new_graph 512 | 513 | return nx.kamada_kawai_layout(_transform_graph(graph)) 514 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/similar_audience.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use case: 3 | 4 | Identify look-alike customers for a product. 5 | The goal here is to identify prospects, who show similar behavioral patterns as your existing customers. 6 | While we can easily do this algorithmically and automate this, the goal here is to provide visual query 7 | to improve human understanding to the marketing analysts. 8 | What are the device ids from my customer graph, who are not yet buying my product (say Golf Club), 9 | but are show similar behavior patterns such lifestyle choices of buying golf or other sporting goods. 10 | """ 11 | 12 | from itertools import chain 13 | 14 | import networkx as nx 15 | 16 | from gremlin_python.process.graph_traversal import select, out, choose, constant, or_, group 17 | from gremlin_python.process.traversal import Column, Order, P 18 | 19 | import plotly.graph_objects as go 20 | 21 | from nepytune import drawing 22 | 23 | 24 | def recommend_similar_audience(g, website_url, categories_limit=3, search_time_limit_in_seconds=15): 25 | """Given website url, categories_limit, categories_coin recommend similar audience in n most popular categories. 26 | 27 | Similar audience - audience of users that at least once visited subpage of domain that contains IAB-category codes 28 | that are most popular across users of given website 29 | """ 30 | average_guy = ( 31 | g.V(website_url) 32 | .in_("visited") 33 | .in_("has_identity").dedup() 34 | .hasLabel("persistentId") 35 | .group().by() 36 | .by( 37 | out("has_identity").out("visited").in_("links_to") 38 | .groupCount().by("categoryCode") 39 | ) 40 | .select(Column.values).unfold().unfold() 41 | .group().by(Column.keys) 42 | .by(select(Column.values).mean()).unfold() 43 | .order().by(Column.values, Order.desc) 44 | .limit(categories_limit) 45 | ) 46 | 47 | most_popular_categories = dict(chain(*category.items()) for category in average_guy.toList()) 48 | 49 | guy_stats_subquery = ( 50 | out("has_identity") 51 | .out("visited").in_("links_to") 52 | .groupCount().by("categoryCode") 53 | .project(*most_popular_categories.keys()) 54 | ) 55 | 56 | conditions_subqueries = [] 57 | for i in most_popular_categories: 58 | guy_stats_subquery = guy_stats_subquery.by(choose(select(i), select(i), constant(0))) 59 | conditions_subqueries.append( 60 | select(Column.values).unfold() 61 | .select(i) 62 | .is_(P.gt(int(most_popular_categories[i]))) 63 | ) 64 | 65 | return ( 66 | g.V() 67 | .hasLabel("websiteGroup") 68 | .has("categoryCode", P.within(list(most_popular_categories.keys()))) 69 | .out("links_to").in_("visited").dedup().in_("has_identity").dedup() 70 | .hasLabel("persistentId") 71 | .where( 72 | out("has_identity").out("visited") 73 | .has("url", P.neq(website_url)) 74 | ) 75 | .timeLimit(search_time_limit_in_seconds * 1000) 76 | .local( 77 | group().by().by(guy_stats_subquery) 78 | .where(or_(*conditions_subqueries)) 79 | ) 80 | .select(Column.keys).unfold() 81 | .out("has_identity") 82 | .values("uid") 83 | ) 84 | 85 | 86 | def draw_average_buyer_profile_pie_chart(g, website_url, categories_limit=3,): 87 | average_profile = _get_categories_popular_across_audience_of_website( 88 | g, website_url, categories_limit=categories_limit 89 | ).toList() 90 | average_profile = dict(chain(*category.items()) for category in average_profile) 91 | 92 | labels = list(average_profile.keys()) 93 | values = list(int(i) for i in average_profile.values()) 94 | 95 | fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0)]) 96 | fig.update_traces(textinfo='value+label+percent') 97 | fig.update_layout( 98 | title_text=f"3 Most popular IAB categories of " 99 | f"\"Average Buyer Profile\"" 100 | f"
for thank you page {website_url}") 101 | fig.show() 102 | 103 | 104 | def draw_referenced_subgraph(g, website_url, categories_limit=3, search_time_limit_in_seconds=15): 105 | average_profile = _get_categories_popular_across_audience_of_website( 106 | g, website_url, categories_limit=categories_limit 107 | ).toList() 108 | average_profile = dict( 109 | chain(*category.items()) for category in average_profile 110 | ) 111 | similar_audience = _query_users_activities_stats( 112 | g, website_url, average_profile, search_time_limit_in_seconds=search_time_limit_in_seconds 113 | ) 114 | similar_audience = similar_audience.limit(15).toList() 115 | 116 | graph = _build_graph(average_profile, similar_audience) 117 | 118 | iabs = [n for n, params in graph.nodes(data=True) if params["label"] == "IAB"] 119 | avg_iabs = [n for n in iabs if graph.nodes[n]["category"] in average_profile] 120 | 121 | graph_with_pos_computed = drawing.layout( 122 | graph, 123 | nx.shell_layout, 124 | nlist=[ 125 | ["averageBuyer"], 126 | avg_iabs, 127 | set(iabs) - set(avg_iabs), 128 | [n for n, params in graph.nodes(data=True) if params["label"] == "persistentId"], 129 | [n for n, params in graph.nodes(data=True) if params["label"] == "transientId"], 130 | ] 131 | ) 132 | 133 | # update positions 134 | for name in set(iabs) - set(avg_iabs): 135 | node = graph_with_pos_computed.nodes[name] 136 | node["pos"] = [node["pos"][0], node["pos"][1]-1.75] 137 | 138 | for name in ["averageBuyer"] + avg_iabs: 139 | node = graph_with_pos_computed.nodes[name] 140 | node["pos"] = [node["pos"][0], node["pos"][1]+1.75] 141 | 142 | node = graph_with_pos_computed.nodes["averageBuyer"] 143 | node["pos"] = [node["pos"][0], node["pos"][1]+1] 144 | 145 | drawing.draw( 146 | title="User devices that visited ecommerce websites and optionally converted", 147 | scatters=list( 148 | drawing.edge_scatters_by_label( 149 | graph_with_pos_computed, 150 | dashes={ 151 | "interestedInButNotSufficient": "dash", 152 | "interestedIn": "solid" 153 | } 154 | )) + list( 155 | drawing.scatters_by_label( 156 | graph_with_pos_computed, attrs_to_skip=["pos", "opacity"], 157 | sizes={ 158 | "averageBuyer": 30, 159 | "IAB":10, 160 | "persistentId":20 161 | } 162 | ) 163 | ) 164 | ) 165 | 166 | 167 | # =========================== 168 | # Everything below was added to introspect the query results via visualisations 169 | 170 | def _get_categories_popular_across_audience_of_website(g, website_url, categories_limit=3): 171 | return ( 172 | g.V(website_url) 173 | .in_("visited") 174 | .in_("has_identity").dedup() 175 | .hasLabel("persistentId") 176 | .group().by() 177 | .by( 178 | out("has_identity").out("visited").in_("links_to") 179 | .groupCount().by("categoryCode") 180 | ) 181 | .select(Column.values).unfold().unfold() 182 | .group().by(Column.keys) 183 | .by(select(Column.values).mean()).unfold() 184 | .order().by(Column.values, Order.desc) 185 | .limit(categories_limit) 186 | ) 187 | 188 | 189 | def _query_users_activities_stats(g, website_url, most_popular_categories, 190 | search_time_limit_in_seconds=30): 191 | return ( 192 | g.V() 193 | .hasLabel("websiteGroup") 194 | .has("categoryCode", P.within(list(most_popular_categories.keys()))) 195 | .out("links_to").in_("visited").dedup().in_("has_identity").dedup() 196 | .hasLabel("persistentId") 197 | .where( 198 | out("has_identity").out("visited") 199 | .has("url", P.neq(website_url)) 200 | ) 201 | .timeLimit(search_time_limit_in_seconds * 1000) 202 | .local( 203 | group().by().by( 204 | out("has_identity") 205 | .out("visited").in_("links_to") 206 | .groupCount().by("categoryCode") 207 | ) 208 | .project("pid", "iabs", "tids") 209 | .by(select(Column.keys).unfold()) 210 | .by(select(Column.values).unfold()) 211 | .by(select(Column.keys).unfold().out("has_identity").values("uid").fold()) 212 | ) 213 | ) 214 | 215 | 216 | def _build_graph(average_buyer_categories, similar_audience): 217 | avg_buyer = "averageBuyer" 218 | 219 | graph = nx.Graph() 220 | graph.add_node(avg_buyer, label=avg_buyer, **average_buyer_categories) 221 | 222 | for avg_iab in average_buyer_categories.keys(): 223 | graph.add_node(avg_iab, label="IAB", category=avg_iab) 224 | graph.add_edge(avg_buyer, avg_iab, label="interestedIn") 225 | 226 | for user in similar_audience: 227 | pid, cats, tids = user["pid"], user["iabs"], user["tids"] 228 | 229 | user_categories = dict(sorted(cats.items(), key=lambda x: x[1])[:3]) 230 | comparison = {k: cats.get(k, 0) for k in average_buyer_categories.keys()} 231 | user_categories.update(comparison) 232 | 233 | user_comparisons = False 234 | for ucategory, value in user_categories.items(): 235 | graph.add_node(ucategory, label="IAB", category=ucategory) 236 | label = "interestedIn" 237 | if value: 238 | if ucategory in average_buyer_categories: 239 | if user_categories[ucategory] >= average_buyer_categories[ucategory]: 240 | user_comparisons = True 241 | else: 242 | label = "interestedInButNotSufficient" 243 | graph.add_edge(pid, ucategory, label=label) 244 | 245 | opacity = 1 if user_comparisons else 0.5 246 | for tid in tids: 247 | graph.add_edge(pid, tid, label="hasIdentity") 248 | graph.add_node(tid, label="transientId", uid=tid, opacity=opacity) 249 | 250 | graph.add_node( 251 | pid, label="persistentId", pid=pid, 252 | opacity=opacity, **cats 253 | ) 254 | 255 | return graph 256 | 257 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/undecided_users.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use case: Ecommerce publishers want to convince undecided users to purchase the product by offering them discount codes 3 | as soon as they have met certain criteria. Find all users who have visited product page at least X times in the last 4 | 30 days, but did not buy anything (have not visited thank you page). 5 | """ 6 | from collections import Counter 7 | 8 | from gremlin_python.process.traversal import P, Column 9 | from gremlin_python.process.graph_traversal import ( 10 | has, groupCount, 11 | constant, and_, coalesce, select, count, out, where, values 12 | ) 13 | 14 | import networkx as nx 15 | 16 | from nepytune import drawing 17 | 18 | 19 | def undecided_user_audience_check(g, transient_id, website_url, thank_you_page_url, since, min_visited_count): 20 | """ 21 | Given transient id, check whether it belongs to an audience. 22 | 23 | It's simple yes, no question. 24 | 25 | User belongs to an audience whenever all of the following criteria are met: 26 | * visited some website url at least X times since specific timestamp 27 | * did not visit thank you page url since specific timestamp 28 | """ 29 | return ( 30 | g.V(transient_id) 31 | .hasLabel("transientId") 32 | .in_("has_identity") 33 | .out("has_identity") 34 | .outE("visited") 35 | .has("ts", P.gt(since)) 36 | .choose( 37 | has("visited_url", website_url), 38 | groupCount("visits").by(constant("page_visits")) 39 | ) 40 | .choose( 41 | has("visited_url", thank_you_page_url), 42 | groupCount("visits").by(constant("thank_you_page_vists")) 43 | ) 44 | .cap("visits") 45 | .coalesce( 46 | and_( 47 | coalesce(select("thank_you_page_vists"), constant(0)).is_(0), 48 | select("page_visits").is_(P.gt(min_visited_count)) 49 | ).choose( 50 | count().is_(1), 51 | constant(True) 52 | ), 53 | constant(False) 54 | ) 55 | 56 | ) 57 | 58 | 59 | def undecided_users_audience(g, website_url, thank_you_page_url, since, min_visited_count): 60 | """ 61 | Given website url, get all the users that meet audience conditions. 62 | 63 | It returns list of transient identities uids. 64 | 65 | Audience is build from the users that met following criteria: 66 | * visited some website url at least X times since specific timestamp 67 | * did not visit thank you page url since specific timestamp 68 | """ 69 | return ( 70 | g.V(website_url) 71 | .hasLabel("website") 72 | .inE("visited").has("ts", P.gt(since)).outV() 73 | .in_("has_identity") 74 | .groupCount() 75 | .unfold().dedup() 76 | .where( 77 | select(Column.values).is_(P.gt(min_visited_count)) 78 | ) 79 | .select(Column.keys).as_("pids") 80 | .map( 81 | out("has_identity") 82 | .outE("visited") 83 | .has("visited_url", thank_you_page_url) 84 | .has("ts", P.gt(since)).outV() 85 | .in_("has_identity").dedup() 86 | .values("pid").fold() 87 | ).as_("pids_that_visited") 88 | .select("pids") 89 | .not_( 90 | has("pid", where(P.within("pids_that_visited"))) 91 | ) 92 | .out("has_identity") 93 | .values("uid") 94 | ) 95 | 96 | 97 | def draw_referenced_subgraph(g, website_url, thank_you_page_url, since, min_visited_count): 98 | raw_graph = _build_networkx_graph(g, website_url, thank_you_page_url, since) 99 | 100 | persistent_nodes = [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "persistentId"] 101 | graph_with_pos_computed = drawing.layout( 102 | raw_graph, 103 | nx.shell_layout, 104 | nlist=[ 105 | [website_url], 106 | [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "transientId"], 107 | [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "persistentId"], 108 | [thank_you_page_url] 109 | ] 110 | ) 111 | 112 | # update positions and change node label 113 | raw_graph.nodes[thank_you_page_url]["pos"] += (0, 0.75) 114 | for node in persistent_nodes: 115 | has_visited_thank_you_page = False 116 | visited_at_least_X_times = False 117 | for check_name, value in raw_graph.nodes[node]["visited_events"].items(): 118 | if ">=" in check_name and value > 0: 119 | if "thank" in check_name: 120 | has_visited_thank_you_page = True 121 | elif value > min_visited_count: 122 | visited_at_least_X_times = True 123 | if (has_visited_thank_you_page or not visited_at_least_X_times): 124 | for _, to in raw_graph.edges(node): 125 | raw_graph.nodes[to]["opacity"] = 0.25 126 | raw_graph.nodes[node]["opacity"] = 0.25 127 | 128 | drawing.draw( 129 | title="User devices that visited ecommerce websites and optionally converted", 130 | scatters=[ 131 | drawing.edges_scatter(graph_with_pos_computed) 132 | ] + list( 133 | drawing.scatters_by_label( 134 | graph_with_pos_computed, attrs_to_skip=["pos", "opacity"], 135 | sizes={ 136 | "transientId": 10, "transientId-audience": 10, 137 | "persistentId": 20, "persistentId-audience": 20, 138 | "website": 30, 139 | "thankYouPage": 30, 140 | } 141 | ) 142 | ) 143 | ) 144 | 145 | 146 | # =========================== 147 | # Everything below was added to introspect the query results via visualisations 148 | 149 | 150 | def _get_subgraph(g, website_url, thank_you_page_url, since): 151 | return ( 152 | g.V() 153 | .hasLabel("website") 154 | .has("url", P.within([website_url, thank_you_page_url])) 155 | .in_("visited") 156 | .in_("has_identity") 157 | .dedup().limit(20) 158 | .project("persistent_id", "transient_ids", "visited_events") 159 | .by(values("pid")) 160 | .by(out("has_identity").values("uid").fold()) 161 | .by( 162 | out("has_identity") 163 | .outE("visited") 164 | .has("visited_url", P.within([website_url, thank_you_page_url])) 165 | .valueMap("visited_url", "ts", "uid").dedup().fold() 166 | ) 167 | ) 168 | 169 | 170 | def _build_networkx_graph(g, website_url, thank_you_page_url, since): 171 | graph = nx.Graph() 172 | graph.add_node(website_url, label="website", url=website_url) 173 | graph.add_node(thank_you_page_url, label="thankYouPage", url=thank_you_page_url) 174 | 175 | for data in _get_subgraph(g, website_url, thank_you_page_url, since).toList(): 176 | graph.add_node(data["persistent_id"], label="persistentId", pid=data["persistent_id"], 177 | visited_events=Counter()) 178 | 179 | for transient_id in data["transient_ids"]: 180 | graph.add_node(transient_id, label="transientId", uid=transient_id, visited_events=Counter()) 181 | graph.add_edge(transient_id, data["persistent_id"], label="has_identity") 182 | 183 | for event in data["visited_events"]: 184 | edge = event["visited_url"], event["uid"] 185 | try: 186 | graph.edges[edge]["ts"].append(event["ts"]) 187 | except: 188 | graph.add_edge(*edge, label="visited", ts=[event["ts"]]) 189 | 190 | 191 | for node_map in graph.nodes[data["persistent_id"]], graph.nodes[event["uid"]]: 192 | if event["visited_url"] == website_url: 193 | node_map["visited_events"][f"visited website < {since}"] += (event["ts"] < since) 194 | node_map["visited_events"][f"visited website >= {since}"] += (event["ts"] >= since) 195 | else: 196 | node_map["visited_events"][f"visited thank you page < {since}"] += (event["ts"] < since) 197 | node_map["visited_events"][f"visited thank you page >= {since}"] += (event["ts"] >= since) 198 | 199 | return graph 200 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/user_summary.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use case: Advertisers want to find out information about user interests to provide an accurate targeting. 3 | The data should be based on the activity of the user across all devices. 4 | """ 5 | from collections.abc import Iterable 6 | 7 | import networkx as nx 8 | from gremlin_python.process.traversal import Column, T 9 | from gremlin_python.process.graph_traversal import select, out, in_, values, valueMap, project, constant 10 | 11 | from nepytune import drawing 12 | 13 | 14 | def get_sibling_attrs(g, transient_id): 15 | """ 16 | Given transient id, get summary of information we have about it or its sibling nodes. 17 | 18 | We gather: 19 | * node attributes 20 | * IP / location information 21 | * IAB categories of visited websites 22 | """ 23 | return ( 24 | g.V(transient_id) 25 | .choose( 26 | in_("has_identity"), # check if this transient id has persistent id 27 | in_("has_identity"). 28 | project( 29 | "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories" 30 | ).by(in_("member").values("igid")) 31 | .by(values("pid")) 32 | .by( 33 | out("has_identity").valueMap().unfold() 34 | .group() 35 | .by(Column.keys) 36 | .by(select(Column.values).unfold().dedup().fold()) 37 | ) 38 | .by( 39 | out("has_identity") 40 | .out("uses").dedup().valueMap().fold() 41 | ) 42 | .by( 43 | out("has_identity") 44 | .out("visited") 45 | .in_("links_to") 46 | .values("categoryCode").dedup().fold() 47 | ) 48 | , project( 49 | "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories" 50 | ).by(constant("")) 51 | .by(constant("")) 52 | .by( 53 | valueMap().unfold() 54 | .group() 55 | .by(Column.keys) 56 | .by(select(Column.values).unfold().dedup().fold()) 57 | ) 58 | .by( 59 | out("uses").dedup().valueMap().fold() 60 | ) 61 | .by( 62 | out("visited") 63 | .in_("links_to") 64 | .values("categoryCode").dedup().fold() 65 | ) 66 | ) 67 | ) 68 | 69 | 70 | def draw_refrenced_subgraph(g, transient_id): 71 | raw_graph = _build_networkx_graph( 72 | g, g.V(transient_id).in_("has_identity").in_("member").next() 73 | ) 74 | graph_with_pos_computed = drawing.layout( 75 | raw_graph, 76 | nx.spring_layout, 77 | iterations=2500 78 | ) 79 | 80 | drawing.draw( 81 | title="Part of single household activity on the web", 82 | scatters=[ 83 | drawing.edges_scatter(graph_with_pos_computed) 84 | ] + list( 85 | drawing.scatters_by_label( 86 | graph_with_pos_computed, attrs_to_skip=["pos"], 87 | sizes={"identityGroup": 30, "transientId": 15, "persistentId": 20, "websiteGroup": 15, "website": 10} 88 | ) 89 | ), 90 | ) 91 | 92 | 93 | # =========================== 94 | # Everything below was added to introspect the query results via visualisations 95 | 96 | def _get_subgraph(g, identity_group_id): 97 | return ( 98 | g.V(identity_group_id) 99 | .project("props", "persistent_ids") 100 | .by(valueMap(True)) 101 | .by( 102 | out("member") 103 | .group() 104 | .by() 105 | .by( 106 | project("props", "transient_ids") 107 | .by(valueMap(True)) 108 | .by( 109 | out("has_identity") 110 | .group() 111 | .by() 112 | .by( 113 | project("props", "ip_location", "random_website_paths") 114 | .by(valueMap(True)) 115 | .by( 116 | out("uses").valueMap(True).fold() 117 | ) 118 | .by( 119 | out("visited").as_("start") 120 | .in_("links_to").as_("end") 121 | .limit(100) 122 | .path() 123 | .by(valueMap("url")) 124 | .by(valueMap("url", "categoryCode")) 125 | .from_("start").to("end") 126 | .dedup() 127 | .fold() 128 | ) 129 | ).select( 130 | Column.values 131 | ) 132 | ) 133 | ).select(Column.values) 134 | ) 135 | ) 136 | 137 | 138 | def _build_networkx_graph(g, identity_group_id): 139 | def get_attributes(attribute_list): 140 | attrs = {} 141 | for attr_name, value in attribute_list: 142 | attr_name = str(attr_name) 143 | 144 | if isinstance(value, Iterable) and not isinstance(value, str): 145 | for i, single_val in enumerate(value): 146 | attrs[f"{attr_name}-{i}"] = single_val 147 | else: 148 | if '.' in attr_name: 149 | attr_name = attr_name.split('.')[-1] 150 | attrs[attr_name] = value 151 | 152 | return attrs 153 | 154 | graph = nx.Graph() 155 | 156 | for ig_node in _get_subgraph(g, identity_group_id).toList(): 157 | ig_id = ig_node["props"][T.id] 158 | 159 | graph.add_node( 160 | ig_id, 161 | **get_attributes(ig_node["props"].items()) 162 | ) 163 | 164 | for persistent_node in ig_node["persistent_ids"]: 165 | p_id = persistent_node["props"][T.id] 166 | graph.add_node( 167 | p_id, 168 | **get_attributes(persistent_node["props"].items()) 169 | ) 170 | graph.add_edge(ig_id, p_id, label="member") 171 | 172 | for transient_node in persistent_node["transient_ids"]: 173 | transient_node_map = transient_node["props"] 174 | transient_id = transient_node_map[T.id] 175 | graph.add_node( 176 | transient_id, 177 | **get_attributes(transient_node_map.items()) 178 | ) 179 | graph.add_edge(transient_id, p_id, label="has_identity") 180 | 181 | for ip_location_node in transient_node["ip_location"]: 182 | ip_location_id = ip_location_node[T.id] 183 | graph.add_node(ip_location_id, **get_attributes(ip_location_node.items())) 184 | graph.add_edge(ip_location_id, transient_id, label="uses") 185 | 186 | for visited_link, root_url in transient_node["random_website_paths"]: 187 | graph.add_node(visited_link["url"][0], label="website", **get_attributes(visited_link.items())) 188 | graph.add_node(root_url["url"][0], label="websiteGroup", **get_attributes(root_url.items())) 189 | graph.add_edge(transient_id, visited_link["url"][0], label="visits") 190 | graph.add_edge(visited_link["url"][0], root_url["url"][0], label="links_to") 191 | return graph 192 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/usecase/users_from_household.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use case: user has visited a travel agency website recently. 3 | Advertisers want to display ads about travel promotions to all members of his household. 4 | """ 5 | from collections.abc import Iterable 6 | 7 | from gremlin_python.process.traversal import Column, T 8 | from gremlin_python.process.graph_traversal import project, valueMap, out 9 | import networkx as nx 10 | 11 | from nepytune import drawing 12 | 13 | 14 | def get_all_transient_ids_in_household(g, transient_id): 15 | """Given transient id, get all transient ids from its household.""" 16 | return ( 17 | g.V(transient_id) 18 | .hasLabel("transientId") 19 | .in_("has_identity") 20 | .in_("member") 21 | .has("type", "household") 22 | .out("member") 23 | .out("has_identity"). 24 | values("uid") 25 | ) 26 | 27 | 28 | def draw_referenced_subgraph(g, transient_id): 29 | graph = drawing.spring_layout( 30 | _build_networkx_graph( 31 | g, 32 | g.V(transient_id).in_("has_identity").in_("member").next() 33 | ) 34 | ) 35 | 36 | drawing.draw( 37 | title="Single identity group graph structure", 38 | scatters=[ 39 | drawing.edges_scatter(graph) 40 | ] + list( 41 | drawing.scatters_by_label( 42 | graph, attrs_to_skip=["pos"], 43 | sizes={"identityGroup": 60, "transientId": 20, "persistentId": 40} 44 | ) 45 | ), 46 | annotations=drawing.edge_annotations(graph) 47 | ) 48 | 49 | 50 | # =========================== 51 | # Everything below was added to introspect the query results via visualisations 52 | 53 | 54 | def _get_identity_group_hierarchy(g, identity_group_id): 55 | return ( 56 | g.V(identity_group_id) 57 | .project("props", "persistent_ids") 58 | .by(valueMap(True)) 59 | .by( 60 | out("member") 61 | .group() 62 | .by() 63 | .by( 64 | project("props", "transient_ids") 65 | .by(valueMap(True)) 66 | .by( 67 | out("has_identity").valueMap(True).fold() 68 | ) 69 | ).select(Column.values) 70 | ) 71 | ) 72 | 73 | 74 | def _build_networkx_graph(g, identity_group_id): 75 | def get_attributes(attribute_list): 76 | attrs = {} 77 | for attr_name, value in attribute_list: 78 | attr_name = str(attr_name) 79 | 80 | if isinstance(value, Iterable) and not isinstance(value, str): 81 | for i, single_val in enumerate(value): 82 | attrs[f"{attr_name}-{i}"] = single_val 83 | else: 84 | if '.' in attr_name: 85 | attr_name = attr_name.split('.')[-1] 86 | attrs[attr_name] = value 87 | 88 | return attrs 89 | 90 | graph = nx.Graph() 91 | 92 | for ig_node in _get_identity_group_hierarchy(g, identity_group_id).toList(): 93 | ig_id = ig_node["props"][T.id] 94 | 95 | graph.add_node( 96 | ig_id, 97 | **get_attributes(ig_node["props"].items()) 98 | ) 99 | 100 | for persistent_node in ig_node["persistent_ids"]: 101 | p_id = persistent_node["props"][T.id] 102 | graph.add_node( 103 | p_id, 104 | **get_attributes(persistent_node["props"].items()) 105 | ) 106 | graph.add_edge(ig_id, p_id, label="member") 107 | 108 | for transient_node_map in persistent_node["transient_ids"]: 109 | transient_id = transient_node_map[T.id] 110 | graph.add_node( 111 | transient_id, 112 | **get_attributes(transient_node_map.items()) 113 | ) 114 | graph.add_edge(transient_id, p_id, label="has_identity") 115 | 116 | return graph 117 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/utils.py: -------------------------------------------------------------------------------- 1 | import hashlib 2 | import inspect 3 | import os 4 | 5 | import nepytune.benchmarks.benchmarks_visualization as bench_viz 6 | 7 | 8 | def hash_(list_): 9 | """Generate sha1 hash from the given list.""" 10 | return hashlib.sha1(str(tuple(sorted(list_))).encode("utf-8")).hexdigest() 11 | 12 | 13 | def get_id(_from, to, attributes): 14 | """Get id of a given entity.""" 15 | return hash_([_from, to, str(tuple(attributes.items()))]) 16 | 17 | 18 | def show_query_benchmarks(benchmark_results_path, cache_path, query, 19 | samples_by_users): 20 | instances = os.listdir(benchmark_results_path) 21 | instances = sorted(instances, key=lambda x: int(x.split('.')[-1].split('xlarge')[0])) 22 | 23 | benchmarks_dfs = bench_viz.get_benchmarks_results_dataframes( 24 | query=query, 25 | samples_by_users=samples_by_users, 26 | instances=instances, 27 | results_path=benchmark_results_path 28 | ) 29 | concurrent_queries_dfs = bench_viz.select_concurrent_queries_from_data( 30 | query, 31 | benchmarks_dfs, 32 | cache_path=cache_path 33 | ) 34 | bench_viz.show_concurrent_queries_charts( 35 | concurrent_queries_dfs, 36 | x_title="Time from start of benchmark (Miliseconds)", 37 | y_title="Number of concurrent running queries" 38 | ) 39 | 40 | bench_viz.show_query_time_graph( 41 | benchmarks_dfs, 42 | yfunc=lambda df: df.multiply(1000).tolist(), 43 | title="Request duration (Miliseconds)", 44 | x_title="Number of concurrent queries", 45 | ) 46 | bench_viz.show_query_time_graph( 47 | benchmarks_dfs, 48 | yfunc=lambda df: (1 / df).tolist(), 49 | title="Queries per second", 50 | x_title="Number of concurrent queries", 51 | ) 52 | 53 | 54 | def show(func): 55 | lines = inspect.getsource(func) 56 | print(lines) 57 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/visualizations/__init__.py -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/bar_plots.py: -------------------------------------------------------------------------------- 1 | import plotly.graph_objects as go 2 | import colorlover as cl 3 | 4 | def make_bars(data, title, x_title, y_title, lazy=False): 5 | color = cl.scales[str(len(data.keys()))]['div']['RdYlBu'] 6 | fig = go.Figure( 7 | [ 8 | go.Bar( 9 | x=list(data.keys()), 10 | y=list(data.values()), 11 | hoverinfo="y", 12 | marker=dict(color=color), 13 | ) 14 | ] 15 | ) 16 | 17 | fig.update_layout( 18 | title=title, 19 | yaxis_type="log", 20 | xaxis_title=x_title, 21 | yaxis_title=y_title, 22 | ) 23 | if not lazy: 24 | fig.show() 25 | return fig -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/commons.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | 3 | from gremlin_python.process.graph_traversal import select, has, unfold 4 | from gremlin_python.process.traversal import P 5 | 6 | 7 | def get_timerange_condition(g, start_hour=16, end_hour=18, limit=1000): 8 | dates = ( 9 | g.E() 10 | .hasLabel("visited") 11 | .limit(limit) 12 | .values("ts") 13 | .fold() 14 | .as_("timestamps") 15 | .project("start", "end") 16 | .by(select("timestamps").unfold().min_()) 17 | .by(select("timestamps").unfold().max_()) 18 | ).next() 19 | 20 | start = dates["start"].replace(hour=start_hour, minute=0, second=0) 21 | end = dates["end"].replace(hour=start_hour, minute=0, second=0) 22 | 23 | toReturn = [] 24 | 25 | for days in range((end - start).days): 26 | toReturn.append( 27 | has( 28 | 'ts', 29 | P.between( 30 | start + timedelta(days=days), 31 | start + timedelta(days=days) + timedelta(hours=end_hour - start_hour) 32 | ) 33 | ) 34 | 35 | ) 36 | 37 | return toReturn 38 | 39 | # [ 40 | # has( 41 | # 'ts', 42 | # P.between( 43 | # start + timedelta(days=days), 44 | # start + timedelta(days=days) + timedelta(hours=end_hour - start_hour) 45 | # ) 46 | # ) 47 | # for days in range((end - start).days) 48 | # ] 49 | 50 | 51 | def get_user_device_statistics(g, dt_conditions, limit=10000): 52 | return ( 53 | g.E().hasLabel("visited").or_(*dt_conditions) 54 | .limit(limit).outV().fold() 55 | .project("type", "device", "browser") 56 | .by( 57 | unfold().unfold().groupCount().by("type") 58 | ) 59 | .by( 60 | unfold().unfold().groupCount().by("device") 61 | ) 62 | .by( 63 | unfold().unfold().groupCount().by("browser") 64 | ) 65 | ) -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/histogram.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import plotly.graph_objects as go 3 | 4 | 5 | def show(activities, website_name): 6 | # convert activities into pandas series 7 | activity_series = pd.to_datetime(pd.Series(list(activities))) 8 | 9 | # trim timestamps to desired granulation (in this case, hours) 10 | hourly_activity_series = activity_series.dt.strftime("%H") 11 | 12 | # prepare values & labels source for histogram's xaxis 13 | day_hours = pd.to_datetime(pd.date_range(start="00:00", end="23:59", freq="H")) 14 | 15 | # create histogram 16 | fig = go.Figure( 17 | data=[ 18 | go.Histogram( 19 | x=hourly_activity_series, 20 | histnorm='percent' 21 | ) 22 | ] 23 | ) 24 | 25 | # provide titles/labels/bar_gaps 26 | fig.update_layout( 27 | title_text=f"Activity of all users that visited website {website_name}", 28 | xaxis_title_text='Day time (Hour)', 29 | yaxis_title_text='Percentage of visits', 30 | 31 | xaxis=dict( 32 | tickangle=45, 33 | tickmode='array', 34 | tickvals=day_hours.strftime("%H").tolist(), 35 | ticktext=day_hours.strftime("%H:%M").tolist() 36 | ), 37 | bargap=0.05, 38 | ) 39 | 40 | # show histogram 41 | fig.show() 42 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/network_graph.py: -------------------------------------------------------------------------------- 1 | import networkx as nx 2 | from gremlin_python.process.graph_traversal import in_, coalesce, constant, select 3 | from gremlin_python.process.traversal import T, P, Column 4 | 5 | from nepytune import drawing 6 | 7 | 8 | def query_website_node(g, website_id): 9 | return g.V(website_id).valueMap(True).toList()[0] 10 | 11 | 12 | def query_transient_nodes_for_website(g, website_id, limit=10000): 13 | return (g.V(website_id) 14 | .in_("visited") 15 | .limit(limit) 16 | .project("uid", "pid") 17 | .by("uid") 18 | .by(in_("has_identity").values("pid").fold()) 19 | .group() 20 | .by(coalesce(select("pid").unfold(), constant("transient-nodes-connected-to-website"))) 21 | .by(select("uid").dedup().limit(100).fold()) 22 | .unfold() 23 | .project("persistent-node-id", "transient-nodes") 24 | .by(select(Column.keys)) 25 | .by(select(Column.values)) 26 | .where(select("transient-nodes").unfold().count().is_(P.gt(1))) 27 | ).toList() 28 | 29 | 30 | def create_graph_for_website_and_transient_nodes(website_node, transient_nodes_for_website): 31 | website_id = website_node[T.id] 32 | 33 | graph = nx.Graph() 34 | graph.add_node( 35 | website_id, 36 | **{ 37 | "id": website_id, 38 | "label": website_node[T.label], 39 | "title": website_node["title"][0], 40 | "url": website_node["url"][0] 41 | } 42 | ) 43 | 44 | transient_nodes = [] 45 | persistent_nodes = [] 46 | 47 | for node in transient_nodes_for_website: 48 | if node["persistent-node-id"] != "transient-nodes-connected-to-website": 49 | pnode = node["persistent-node-id"] 50 | persistent_nodes.append(pnode) 51 | graph.add_node( 52 | pnode, 53 | id=pnode, 54 | label="persistentId" 55 | ) 56 | 57 | for tnode in node["transient-nodes"]: 58 | graph.add_edge( 59 | pnode, 60 | tnode, 61 | label="has_identity" 62 | ) 63 | 64 | for tnode in node["transient-nodes"]: 65 | graph.add_node( 66 | tnode, 67 | id=tnode, 68 | label="transientId" 69 | ) 70 | 71 | graph.add_edge( 72 | website_id, 73 | tnode, 74 | label="visited" 75 | ) 76 | 77 | transient_nodes.append(tnode) 78 | return graph 79 | 80 | 81 | def show(g, website_id): 82 | """Show users that visited website on more than one device.""" 83 | 84 | transient_nodes_for_website = query_transient_nodes_for_website(g, website_id) 85 | website_node = query_website_node(g, website_id) 86 | 87 | raw_graph = create_graph_for_website_and_transient_nodes(website_node, transient_nodes_for_website) 88 | graph = drawing.spring_layout(raw_graph) 89 | 90 | drawing.draw( 91 | title="", 92 | scatters=[drawing.edges_scatter(graph)] + list(drawing.scatters_by_label(graph, attrs_to_skip=["pos"])), 93 | ) -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/pie_chart.py: -------------------------------------------------------------------------------- 1 | import colorlover as cl 2 | import plotly.graph_objects as go 3 | from plotly.subplots import make_subplots 4 | 5 | 6 | def show(data): 7 | type_labels, type_values = zip(*data["type"].items()) 8 | device_labels, device_values = zip(*data["device"].items()) 9 | browser_labels, browser_values = zip(*data["browser"].items()) 10 | 11 | fig = make_subplots(rows=3, cols=1, specs=[ 12 | [{"type": "pie"}], 13 | [{"type": "pie"}], 14 | [{"type": "pie"}] 15 | ]) 16 | 17 | fig.add_trace( 18 | go.Pie(labels=list(reversed(type_labels)), values=list(reversed(type_values)), hole=0, name="Type", 19 | marker={'colors': ['#7F7FFF', '#FF7F7F']}, 20 | textinfo='label+percent', hoverinfo="label+percent+value", textfont_size=20 21 | ), 22 | row=2, col=1, 23 | 24 | ) 25 | 26 | fig.add_trace( 27 | go.Pie(labels=["device
type"], values=[data["type"]["device"]], 28 | hole=0, textinfo='label', hoverinfo="label+value", 29 | marker={'colors': ['#7F7FFF']}, textfont_size=20 30 | ), 31 | row=1, col=1, 32 | 33 | ) 34 | 35 | fig.add_trace( 36 | go.Pie(labels=device_labels, values=device_values, hole=.8, opacity=1, 37 | textinfo='label', textposition='outside', hoverinfo="label+percent+value", 38 | marker={'colors': ['rgb(247,251,255)', 39 | 'rgb(222,235,247)', 40 | 'rgb(198,219,239)', 41 | 'rgb(158,202,225)', 42 | 'rgb(107,174,214)', 43 | 'rgb(66,146,198)', 44 | 'rgb(33,113,181)', 45 | 'rgb(8,81,156)', 46 | 'rgb(8,48,107)', 47 | 'rgb(9,32,66)', 48 | ] 49 | }, textfont_size=12), 50 | row=1, col=1, 51 | ) 52 | 53 | fig.add_trace( 54 | go.Pie(labels=["cookie
browser"], values=[data["type"]["cookie"]], 55 | hole=0, textinfo='label', hoverinfo="label+value", 56 | marker={'colors': ['#FF7F7F']}, textfont_size=20), 57 | row=3, col=1, 58 | ) 59 | 60 | fig.add_trace( 61 | go.Pie(labels=browser_labels, values=browser_values, hole=.8, 62 | textinfo='label', textposition='outside', hoverinfo="label+percent+value", 63 | marker={'colors': ['rgb(255,245,240)', 64 | 'rgb(254,224,210)', 65 | 'rgb(252,187,161)', 66 | 'rgb(252,146,114)', 67 | 'rgb(251,106,74)', 68 | 'rgb(239,59,44)', 69 | 'rgb(203,24,29)', 70 | 'rgb(165,15,21)', 71 | 'rgb(103,0,13)', 72 | 'rgb(51, 6,12)' 73 | ] 74 | }, textfont_size=12), 75 | row=3, col=1, 76 | ) 77 | 78 | fig.update_layout( 79 | showlegend=False, 80 | height=1000, 81 | ) 82 | 83 | fig.show() 84 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/segments.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime, timedelta 2 | 3 | 4 | from gremlin_python.process.graph_traversal import in_, inE, select, out, values, has 5 | from gremlin_python.process.traversal import P, Column 6 | 7 | 8 | def get_all_devices_from_website_visitors(g, website_id, limit=100): 9 | """Get all transient ids (including siblings), that visited given page.""" 10 | 11 | return ( 12 | g.V(website_id) 13 | .project("transient_ids_no_persistent", "transient_ids_with_siblings") 14 | .by( 15 | in_("visited").limit(limit).fold() 16 | ) 17 | .by( 18 | in_("visited").in_("has_identity").dedup().out("has_identity").limit(limit).fold() 19 | ) 20 | .select(Column.values).unfold().unfold().dedup() 21 | ) 22 | 23 | 24 | def query_users_intersted_in_content(g, iab_codes, limit=10000): 25 | """Get users (persistent identities) that interacted with websites with given iab codes.""" 26 | 27 | return ( 28 | g.V() 29 | .hasLabel("persistentId") 30 | .coin(0.8) 31 | .limit(limit) 32 | .where(out("has_identity") 33 | .out("visited") 34 | .in_("links_to") 35 | .has("categoryCode", P.within(iab_codes)) 36 | ) 37 | .project("persistent_id", "attributes", "ip_location") 38 | .by(values("pid")) 39 | .by( 40 | out("has_identity").valueMap("browser", "email", "uid").unfold() 41 | .group() 42 | .by(Column.keys) 43 | .by(select(Column.values).unfold().dedup().fold()) 44 | ) 45 | .by(out("has_identity").out("uses").dedup().valueMap().fold()) 46 | ) 47 | 48 | 49 | def query_users_active_in_given_date_intervals(g, dt_conditions, limit=300): 50 | """Get users (persistent identities) that interacted with website in given date interval.""" 51 | 52 | return ( 53 | g.V().hasLabel("persistentId") 54 | .coin(0.5) 55 | .limit(limit) 56 | .where( 57 | out("has_identity").outE("visited").or_( 58 | *dt_conditions 59 | ) 60 | ) 61 | .project("persistent_id", "attributes", "ip_location") 62 | .by(values("pid")) 63 | .by( 64 | out("has_identity").valueMap("browser", "email", "uid").unfold() 65 | .group() 66 | .by(Column.keys) 67 | .by(select(Column.values).unfold().dedup().fold()) 68 | ) 69 | .by(out("has_identity").out("uses").dedup().valueMap().fold()) 70 | ) 71 | 72 | 73 | def query_users_active_in_n_days(g, n=30, today=datetime(2016, 6, 22, 23, 59), limit=1000): 74 | """Get users that were active in last 30 days.""" 75 | 76 | dt_condition = [ 77 | has("ts", P.gt(today - timedelta(days=n))) 78 | ] 79 | return query_users_active_in_given_date_intervals(g, dt_condition, limit) -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/sunburst_chart.py: -------------------------------------------------------------------------------- 1 | import plotly.graph_objects as go 2 | 3 | 4 | def show(data): 5 | type_labels, type_values = zip(*data["type"].items()) 6 | device_labels, device_values = zip(*data["device"].items()) 7 | browser_labels, browser_values = zip(*data["browser"].items()) 8 | 9 | trace = go.Sunburst( 10 | labels=type_labels + device_labels + browser_labels, 11 | parents=["", ""] + ["device"] * len(device_labels) + ["cookie"] * len(browser_labels), 12 | values=type_values + device_values + browser_values, 13 | hoverinfo="label+value", 14 | ) 15 | 16 | layout = go.Layout( 17 | margin=go.layout.Margin(t=0, l=0, r=0, b=0), 18 | ) 19 | 20 | fig = go.Figure([trace], layout) 21 | fig.show() 22 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/visualizations/venn_diagram.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import random 3 | 4 | import plotly.graph_objects as go 5 | import yaml 6 | 7 | 8 | def get_intersections(s1, s2, s3): 9 | class NodeElement: 10 | def __init__(self, **kwargs): 11 | self.attributes = kwargs 12 | 13 | def __hash__(self): 14 | return hash(self.attributes["persistent_id"]) 15 | 16 | def __eq__(self, other): 17 | return self.attributes["persistent_id"] == other.attributes["persistent_id"] 18 | 19 | def __repr__(self): 20 | pid = self.attributes['persistent_id'] 21 | hash_ = str(hash(self.attributes['persistent_id'])) 22 | return f"{pid}, {hash_}" 23 | 24 | a = {NodeElement(**e) for e in s1} 25 | b = {NodeElement(**e) for e in s2} 26 | c = {NodeElement(**e) for e in s3} 27 | 28 | result = { 29 | "ab": a & b, 30 | "ac": a & c, 31 | "bc": b & c, 32 | "abc": a & b & c 33 | } 34 | 35 | result["a"] = a - (result["ab"] | result["ac"]) 36 | result["b"] = b - (result["ab"] | result["bc"]) 37 | result["c"] = b - (result["ac"] | result["bc"]) 38 | 39 | return result 40 | 41 | 42 | def make_label(node): 43 | return "
" + yaml.dump( 44 | node.attributes, 45 | default_style=None, 46 | default_flow_style=False, 47 | width=50 48 | ).replace("\n", "
") 49 | 50 | 51 | TRIANGLES = { 52 | "abc": [ 53 | [(8, -0.35), (10, -4.5), (12, -0.35)], 54 | [(11.51671, -2.41546), (9.98847, -4.47654), (12, -0.35)], 55 | [(10, 0), (12, -0.35), (8, -0.35)], 56 | [(8, -0.35), (8.58508, -2.5938), (9.98847, -4.47654)] 57 | ], 58 | "ab": [ 59 | [(8, 0), (10, 4.5), (12, 0)], 60 | [(8, 0), (11.49694, 2.39107), (12, 0)], 61 | [(8, 0), (10, 4.5), (8.51494, 2.37233)], 62 | [(12, 0), (10, 4.5), (11.49383, 2.40639)] 63 | ], 64 | "ac": [ 65 | [(4, -5.65), (9.7, -4.75), (7.5, -0.55)], 66 | [(4, -5.65), (5.26182, -2.31944), (7.5, -0.55)], 67 | [(8.26214, -1.9908), (8, -0.35), (7.5, -0.55)], 68 | [(8.92526, -3.30719), (10, -4.5), (9.7, -4.75)], 69 | [(7.01578, -5.9212), (4, -5.65), (9.7, -4.75)] 70 | ], 71 | "bc": [ 72 | [(16.01075, -5.7627), (12.51157, -0.57146), (10.31157, -4.77146)], 73 | [(10, -4.5), (11.08632, -3.32866), (10.31157, -4.77146)], 74 | [(12.00131, -0.28126), (12.51157, -0.57146), (11.74943, -2.01226)], 75 | [(12.51157, -0.57146), (14.74975, -2.3409), (16.01075, -5.7627)], 76 | [(10.31157, -4.77146), (12.99579, -5.94266), (16.01157, -5.67146)] 77 | ], 78 | "a": [ 79 | [(1.59, 4.12), (1.2, -3.54), (8.07, 5.62)], 80 | [(8.01, -0.31), (1.2, -3.54), (8.07, 5.62)], 81 | [(4.76091, -1.85498), (4.76091, -1.85498), (1.20313, -3.56193)], 82 | [(1.20313, -3.56193), (0, 0), (1.58563, 4.1221)], 83 | [(4.93073, -1.82779), (1.20313, -3.53928), (4.00216, -5.85506)], 84 | [(1.58563, 4.1221), (4.56262, 5.82533), (8.06809, 5.62107)], 85 | [(8.06809, 5.62107), (9.96902, 4.49245), (8.03037, 1.87789)], 86 | [(4.93976, -1.79473), (5.3695, -2.15025), (6.48177, -1.03733)], 87 | [(4.93976, -1.79473), (5.3695, -2.15025), (4.55188, -3.3864)], 88 | [(8.31901, 1.92203), (8.31901, 1.92203), (8.63932, 2.8171)], 89 | [(8.31901, 1.92203), (8.31901, 1.92203), (8.02962, 1.02571)] 90 | ], 91 | "b": [ 92 | [(12.06, 5.65), (18.89, -3.51), (18.38, 4.1)], 93 | [(12, -0.28), (12.06, 5.65), (18.89, -3.51)], 94 | [(12.06077, 5.6496), (12.0047, 2.32125), (10.02248, 4.49887)], 95 | [(15.10229, -1.76245), (18.8895, -3.51074), (16.01075, -5.7627)], 96 | [(20, 0), (18.37991, 4.09718), (18.8895, -3.51074)], 97 | [(18.37991, 4.09718), (12.06077, 5.6496), (15.63243, 5.78919)], 98 | ], 99 | "c": [ 100 | [(10, -12), (4.38, -8), (15.60, -8)], 101 | [(10, -4.55), (4.38, -8), (15.60, -8)], 102 | [(4.01794, -5.64598), (4.38003, -8.00561), (7.22591, -6.21212)], 103 | [(15.99694, -5.86975), (12.83447, -6.25924), (15.62772, -8.02776)], 104 | [(4.38003, -8.00561), (6.43193, -10.80379), (10, -12)], 105 | [(15.62772, -8.02776), (13.95762, -10.55233), (10, -12)], 106 | [(5.56245, -6.01624), (7.11534, -5.92534), (7.22591, -6.21212)], 107 | [(8.21897, -5.58699), (7.11534, -5.92534), (7.22591, -6.21212)], 108 | [(11.76526, -5.59305), (13.023, -5.93749), (12.83447, -6.25924)], 109 | [(14.49948, -5.9889), (13.023, -5.93749), (12.83447, -6.25924)], 110 | ], 111 | } 112 | 113 | 114 | def show_venn_diagram(intersections, labels): 115 | def point_on_triangle(pt1, pt2, pt3): 116 | """ 117 | Random point on the triangle with vertices pt1, pt2 and pt3. 118 | """ 119 | s, t = sorted([random.random(), random.random()]) 120 | return (s * pt1[0] + (t - s) * pt2[0] + (1 - t) * pt3[0], 121 | s * pt1[1] + (t - s) * pt2[1] + (1 - t) * pt3[1]) 122 | 123 | def area(tri): 124 | y_list = [tri[0][1], tri[1][1], tri[2][1]] 125 | x_list = [tri[0][0], tri[1][0], tri[2][0]] 126 | height = max(y_list) - min(y_list) 127 | width = max(x_list) - min(x_list) 128 | return height * width / 2 129 | 130 | empty_sets = [k for k, v in intersections.items() if not len(v)] 131 | 132 | if empty_sets: 133 | raise ValueError(f"Given intersections \"{empty_sets}\" are empty, cannot continue") 134 | 135 | scatters = [] 136 | 137 | for k, v in intersections.items(): 138 | weights = [area(triangle) for triangle in TRIANGLES[k]] 139 | points_pairs = [point_on_triangle(*random.choices(TRIANGLES[k], weights=weights)[0]) for _ in v] 140 | x, y = zip(*points_pairs) 141 | scatter_labels = [make_label(n) for n in v] 142 | 143 | scatters.append( 144 | go.Scatter( 145 | x=x, 146 | y=y, 147 | mode='markers', 148 | showlegend=False, 149 | text=scatter_labels, 150 | marker=dict( 151 | size=10, 152 | line=dict(width=2, 153 | color='DarkSlateGrey'), 154 | opacity=1, 155 | ), 156 | hoverinfo="text", 157 | ) 158 | ) 159 | fig = go.Figure( 160 | data=list(scatters), 161 | layout=go.Layout( 162 | title_text="", 163 | autosize=False, 164 | titlefont_size=16, 165 | showlegend=True, 166 | hovermode='closest', 167 | margin=dict(b=20, l=5, r=5, t=40), 168 | xaxis=dict(showgrid=False, zeroline=False, showticklabels=False), 169 | yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, scaleanchor="x", scaleratio=1) 170 | ), 171 | ) 172 | 173 | fig.update_layout( 174 | shapes=[ 175 | go.layout.Shape( 176 | type="circle", 177 | x0=0, 178 | y0=-6, 179 | x1=12, 180 | y1=6, 181 | fillcolor="Red", 182 | opacity=0.15, 183 | layer='below' 184 | ), 185 | go.layout.Shape( 186 | type="circle", 187 | x0=8, 188 | y0=-6, 189 | x1=20, 190 | y1=6, 191 | fillcolor="Blue", 192 | opacity=0.15, 193 | layer='below' 194 | ), 195 | go.layout.Shape( 196 | type="circle", 197 | x0=4, 198 | y0=-12, 199 | x1=16, 200 | y1=0, 201 | fillcolor="Green", 202 | opacity=0.15, 203 | layer='below' 204 | ), 205 | ] 206 | ) 207 | 208 | fig.update_layout( 209 | annotations=[ 210 | dict( 211 | xref="x", 212 | yref="y", 213 | x=6, y=6, 214 | text=labels[0], 215 | font=dict(size=15), 216 | showarrow=True, 217 | arrowwidth=2, 218 | ax=-50, 219 | ay=-25, 220 | arrowhead=7, 221 | ), 222 | dict( 223 | xref="x", 224 | yref="y", 225 | x=14, y=6, 226 | text=labels[1], 227 | font=dict(size=15), 228 | showarrow=True, 229 | arrowwidth=2, 230 | ax=50, 231 | ay=-25, 232 | arrowhead=7, 233 | ), 234 | dict( 235 | xref="x", 236 | yref="y", 237 | x=10, y=-12, 238 | text=labels[2], 239 | font=dict(size=15), 240 | showarrow=True, 241 | arrowwidth=2, 242 | ax=50, 243 | ay=25, 244 | arrowhead=7, 245 | ), 246 | ] 247 | ) 248 | 249 | fig.show() 250 | -------------------------------------------------------------------------------- /identity-resolution/notebooks/identity-graph/nepytune/write_utils.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import csv 3 | from contextlib import contextmanager 4 | import json 5 | 6 | 7 | class GremlinCSV: 8 | """Build CSV file in AWS-Neptune ready-to-load data format.""" 9 | 10 | def __init__(self, opened_file, attributes): 11 | """Create CSV writer.""" 12 | self.types = dict(key.split(":") for key in attributes) 13 | self.writer = csv.writer(opened_file, quoting=csv.QUOTE_ALL) 14 | self.key_order = list(self.types.keys()) 15 | self.writer.writerow(self.header) 16 | 17 | def attributes(self, attribute_map): 18 | """Build attribute list from attribute_map with default values.""" 19 | return [attribute_map.get(k, "") for k in self.key_order] 20 | 21 | @property 22 | @abc.abstractmethod 23 | def header(self): 24 | """Get header.""" 25 | 26 | 27 | class GremlinNodeCSV(GremlinCSV): 28 | """Build CSV file with graph nodes in AWS-Neptune ready-to-load data format.""" 29 | 30 | @property 31 | def header(self): 32 | """Get header.""" 33 | return ( 34 | ["~id"] 35 | + [f"{key}:{self.types[key]}" for key in self.key_order] 36 | + ["~label"] 37 | ) 38 | 39 | def add(self, _id, attribute_map, label): 40 | """Add row to CSV file.""" 41 | self.writer.writerow([_id] + self.attributes(attribute_map) + [label]) 42 | 43 | 44 | class GremlinEdgeCSV(GremlinCSV): 45 | """Build CSV file with graph edges in AWS-Neptune ready-to-load data format.""" 46 | 47 | @property 48 | def header(self): 49 | """Get header.""" 50 | return ["~id", "~from", "~to", "~label"] + [ 51 | f"{key}:{self.types[key]}" for key in self.key_order 52 | ] 53 | 54 | def add(self, _id, _from, to, label, attribute_map): 55 | """Add row to CSV file.""" 56 | self.writer.writerow([_id, _from, to, label] + self.attributes(attribute_map)) 57 | 58 | 59 | @contextmanager 60 | def gremlin_writer(type_, file_name, attributes): 61 | """Factory of gremlin writer objects.""" 62 | with open(file_name, "w", 1024 * 1024) as f_t: 63 | yield type_(f_t, attributes=attributes) 64 | 65 | 66 | def json_lines_file(opened_file): 67 | """Yield json lines from opened file.""" 68 | for line in opened_file: 69 | yield json.loads(line) 70 | -------------------------------------------------------------------------------- /identity-resolution/templates/bulk-load-stack.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | 3 | Parameters: 4 | bulkloadNeptuneEndpoint: 5 | Type: String 6 | bulkloadNeptuneData: 7 | Type: String 8 | bulkloadNeptuneIAMRole: 9 | Type: String 10 | Description: IAM Role ARN for bulk load role 11 | bulkloadNeptuneSecurityGroup: 12 | Type: AWS::EC2::SecurityGroup::Id 13 | bulkloadSubnet1: 14 | Type: AWS::EC2::Subnet::Id 15 | bulkloadBucket: 16 | Type: String 17 | 18 | Mappings: 19 | Constants: 20 | S3Keys: 21 | NeptuneLoaderCode: identity-resolution/functions/NeptuneLoader.zip 22 | PythonLambdaLayer: identity-resolution/functions/PythonLambdaLayer.zip 23 | 24 | Resources: 25 | 26 | bulkloadNeptuneLoader: 27 | DependsOn: 28 | - bulkloadNeptuneLoaderLambdaRoleCloudWatchStream 29 | - bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup 30 | - bulkloadNeptuneLoaderLambdaRoleEC2 31 | - bulkloadNeptuneLoaderLambdaRole 32 | Type: "Custom::NeptuneLoader" 33 | Properties: 34 | ServiceToken: 35 | Fn::GetAtt: [ bulkloadNeptuneLoaderLambda, Arn] 36 | 37 | bulkloadNeptuneLoaderLambdaRoleCloudWatchStream: 38 | Type: AWS::IAM::Policy 39 | Properties: 40 | PolicyDocument: 41 | Statement: 42 | - Action: 43 | - logs:CreateLogStream 44 | - logs:PutLogEvents 45 | Effect: Allow 46 | Resource: !Join [ "", [ "arn:aws:logs:", !Ref "AWS::Region", ":", !Ref "AWS::AccountId" , ":log-group:/aws/lambda/", !Ref bulkloadNeptuneLoaderLambda, ":*" ]] 47 | Version: '2012-10-17' 48 | PolicyName: bulkloadNeptuneLoaderLambdaRoleCloudWatchStream 49 | Roles: 50 | - Ref: bulkloadNeptuneLoaderLambdaRole 51 | bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup: 52 | Type: AWS::IAM::Policy 53 | Properties: 54 | PolicyDocument: 55 | Statement: 56 | - Action: 57 | - logs:CreateLogGroup 58 | Effect: Allow 59 | Resource: !Join [ "", [ "arn:aws:logs:", !Ref "AWS::Region", ":", !Ref "AWS::AccountId" , ":*" ]] 60 | Version: '2012-10-17' 61 | PolicyName: bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup 62 | Roles: 63 | - Ref: bulkloadNeptuneLoaderLambdaRole 64 | bulkloadNeptuneLoaderLambdaRoleEC2: 65 | Type: AWS::IAM::Policy 66 | Properties: 67 | PolicyDocument: 68 | Statement: 69 | - Action: 70 | - ec2:CreateNetworkInterface 71 | - ec2:DescribeNetworkInterfaces 72 | - ec2:DeleteNetworkInterface 73 | - ec2:DetachNetworkInterface 74 | Effect: Allow 75 | Resource: "*" 76 | Version: '2012-10-17' 77 | PolicyName: bulkloadNeptuneLoaderLambdaRoleEC2 78 | Roles: 79 | - Ref: bulkloadNeptuneLoaderLambdaRole 80 | bulkloadNeptuneLoaderLambda: 81 | DependsOn: 82 | - bulkloadNeptuneLoaderLambdaRoleEC2 83 | Type: AWS::Lambda::Function 84 | Properties: 85 | Code: 86 | S3Bucket: 87 | Ref: bulkloadBucket 88 | S3Key: !FindInMap 89 | - Constants 90 | - S3Keys 91 | - NeptuneLoaderCode 92 | Description: 'Lambda function to load data into Neptune instance.' 93 | Environment: 94 | Variables: 95 | neptunedb: 96 | Ref: bulkloadNeptuneEndpoint 97 | neptuneloads3path: 98 | Ref: bulkloadNeptuneData 99 | region: 100 | Ref: "AWS::Region" 101 | s3loadiamrole: 102 | Ref: bulkloadNeptuneIAMRole 103 | Handler: lambda_function.lambda_handler 104 | MemorySize: 128 105 | Layers: 106 | - !Ref PythonLambdaLayer 107 | Role: 108 | Fn::GetAtt: [ bulkloadNeptuneLoaderLambdaRole, Arn ] 109 | Runtime: python3.9 110 | Timeout: 180 111 | VpcConfig: 112 | SecurityGroupIds: 113 | - Ref: bulkloadNeptuneSecurityGroup 114 | SubnetIds: 115 | - Ref: bulkloadSubnet1 116 | bulkloadNeptuneLoaderLambdaRole: 117 | Type: AWS::IAM::Role 118 | Properties: 119 | ManagedPolicyArns: 120 | - 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole' 121 | - 'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess' 122 | AssumeRolePolicyDocument: 123 | Statement: 124 | - Action: sts:AssumeRole 125 | Effect: Allow 126 | Principal: 127 | Service: 128 | - lambda.amazonaws.com 129 | Version: '2012-10-17' 130 | Path: / 131 | PythonLambdaLayer: 132 | Type: "AWS::Lambda::LayerVersion" 133 | Properties: 134 | CompatibleRuntimes: 135 | - python3.9 136 | - python3.8 137 | Content: 138 | S3Bucket: 139 | Ref: bulkloadBucket 140 | S3Key: !FindInMap 141 | - Constants 142 | - S3Keys 143 | - PythonLambdaLayer -------------------------------------------------------------------------------- /identity-resolution/templates/identity-resolution.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | 3 | Mappings: 4 | S3Buckets: 5 | us-west-2: 6 | bucket: aws-admartech-samples-us-west-2 7 | us-east-1: 8 | bucket: aws-admartech-samples-us-east-1 9 | us-east-2: 10 | bucket: aws-admartech-samples-us-east-2 11 | eu-west-1: 12 | bucket: aws-admartech-samples-eu-west-1 13 | 14 | Constants: 15 | S3Keys: 16 | neptuneNotebooks: /identity-resolution/notebooks/identity-graph 17 | irdata: /identity-resolution/data/ 18 | bulkLoadStack: /identity-resolution/templates/bulk-load-stack.yaml 19 | neptuneNotebookStack: /identity-resolution/templates/neptune-workbench-stack.yaml 20 | 21 | #------------------------------------------------------------------------------# 22 | # RESOURCES 23 | #------------------------------------------------------------------------------# 24 | Resources: 25 | # ---------- CREATING NEPTUNE CLUSTER FROM SNAPSHOT ---------- 26 | NeptuneBaseStack: 27 | Type: AWS::CloudFormation::Stack 28 | Properties: 29 | TemplateURL: https://s3.amazonaws.com/aws-neptune-customer-samples/v2/cloudformation-templates/neptune-base-stack.json 30 | Parameters: 31 | NeptuneQueryTimeout: '300000' 32 | DbInstanceType: db.r5.12xlarge 33 | TimeoutInMinutes: '360' 34 | 35 | # ---------- SETTING UP SAGEMAKER NOTEBOOK INSTANCES ---------- 36 | ExecutionRole: 37 | Type: AWS::IAM::Role 38 | Properties: 39 | AssumeRolePolicyDocument: 40 | Version: "2012-10-17" 41 | Statement: 42 | - Effect: Allow 43 | Principal: 44 | Service: 45 | - sagemaker.amazonaws.com 46 | Action: 47 | - sts:AssumeRole 48 | Path: "/" 49 | Policies: 50 | - PolicyName: "sagemakerneptunepolicy" 51 | PolicyDocument: 52 | Version: "2012-10-17" 53 | Statement: 54 | - Effect: "Allow" 55 | Action: 56 | - cloudwatch:PutMetricData 57 | Resource: 58 | Fn::Sub: "arn:${AWS::Partition}:cloudwatch:${AWS::Region}:${AWS::AccountId}:*" 59 | - Effect: "Allow" 60 | Action: 61 | - "logs:CreateLogGroup" 62 | - "logs:CreateLogStream" 63 | - "logs:DescribeLogStreams" 64 | - "logs:PutLogEvents" 65 | - "logs:GetLogEvents" 66 | Resource: 67 | Fn::Sub: "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:*" 68 | - Effect: "Allow" 69 | Action: "neptune-db:connect" 70 | Resource: 71 | Fn::Sub: "arn:${AWS::Partition}:neptune-db:${AWS::Region}:${AWS::AccountId}:${NeptuneBaseStack.Outputs.DBClusterId}/*" 72 | - Effect: "Allow" 73 | Action: 74 | - "s3:Get*" 75 | - "s3:List*" 76 | Resource: 77 | Fn::Sub: "arn:${AWS::Partition}:s3:::*" 78 | 79 | SageMakerNeptuneStack: 80 | Type: AWS::CloudFormation::Stack 81 | Properties: 82 | TemplateURL: 83 | Fn::Join: [ "", 84 | [ 85 | https://s3.amazonaws.com/, 86 | !FindInMap [ S3Buckets, Ref: 'AWS::Region', bucket ], 87 | !FindInMap [ Constants, S3Keys, neptuneNotebookStack ] 88 | ] 89 | ] 90 | Parameters: 91 | SageMakerNotebookName: "id-graph-notebook" 92 | NotebookInstanceType: ml.m5.xlarge 93 | NeptuneClusterEndpoint: 94 | Fn::GetAtt: 95 | - NeptuneBaseStack 96 | - Outputs.DBClusterEndpoint 97 | NeptuneClusterPort: 98 | Fn::GetAtt: 99 | - NeptuneBaseStack 100 | - Outputs.DBClusterPort 101 | NeptuneClusterSecurityGroups: 102 | Fn::GetAtt: 103 | - NeptuneBaseStack 104 | - Outputs.NeptuneSG 105 | NeptuneClusterSubnetId: 106 | Fn::GetAtt: 107 | - NeptuneBaseStack 108 | - Outputs.PublicSubnet1 109 | SageMakerNotebookRole: 110 | Fn::GetAtt: 111 | - ExecutionRole 112 | - Arn 113 | AdditionalNotebookS3Locations: !Join 114 | - '' 115 | - - 's3://' 116 | - !FindInMap 117 | - S3Buckets 118 | - !Ref 'AWS::Region' 119 | - bucket 120 | - !FindInMap 121 | - Constants 122 | - S3Keys 123 | - neptuneNotebooks 124 | TimeoutInMinutes: '60' 125 | 126 | # --------- LOAD DATA INTO NEPTUNE --------- 127 | 128 | NeptuneBulkLoadStack: 129 | Type: AWS::CloudFormation::Stack 130 | Properties: 131 | TemplateURL: !Join 132 | - '' 133 | - - 'https://s3.' 134 | - !Ref 'AWS::Region' 135 | - '.amazonaws.com/' 136 | - !FindInMap 137 | - S3Buckets 138 | - !Ref 'AWS::Region' 139 | - bucket 140 | - !FindInMap 141 | - Constants 142 | - S3Keys 143 | - bulkLoadStack 144 | Parameters: 145 | bulkloadNeptuneEndpoint: 146 | Fn::GetAtt: 147 | - NeptuneBaseStack 148 | - Outputs.DBClusterEndpoint 149 | bulkloadNeptuneData: !Join 150 | - '' 151 | - - 's3://' 152 | - !FindInMap 153 | - S3Buckets 154 | - !Ref 'AWS::Region' 155 | - bucket 156 | - !FindInMap 157 | - Constants 158 | - S3Keys 159 | - irdata 160 | bulkloadNeptuneIAMRole: 161 | Fn::GetAtt: 162 | - NeptuneBaseStack 163 | - Outputs.NeptuneLoadFromS3IAMRoleArn 164 | bulkloadNeptuneSecurityGroup: 165 | Fn::GetAtt: 166 | - NeptuneBaseStack 167 | - Outputs.NeptuneSG 168 | bulkloadSubnet1: 169 | Fn::GetAtt: 170 | - NeptuneBaseStack 171 | - Outputs.PrivateSubnet1 172 | bulkloadBucket: !FindInMap 173 | - S3Buckets 174 | - !Ref 'AWS::Region' 175 | - bucket 176 | 177 | 178 | #------------------------------------------------------------------------------# 179 | # OUTPUTS 180 | #------------------------------------------------------------------------------# 181 | 182 | Outputs: 183 | VPC: 184 | Description: VPC of the Neptune Cluster 185 | Value: 186 | Fn::GetAtt: 187 | - NeptuneBaseStack 188 | - Outputs.VPC 189 | PublicSubnet1: 190 | Value: 191 | Fn::GetAtt: 192 | - NeptuneBaseStack 193 | - Outputs.PublicSubnet1 194 | NeptuneSG: 195 | Description: Neptune Security Group 196 | Value: 197 | Fn::GetAtt: 198 | - NeptuneBaseStack 199 | - Outputs.NeptuneSG 200 | SageMakerNotebook: 201 | Value: 202 | Fn::GetAtt: 203 | - SageMakerNeptuneStack 204 | - Outputs.NeptuneNotebook 205 | DBClusterEndpoint: 206 | Description: Master Endpoint for Neptune Cluster 207 | Value: 208 | Fn::GetAtt: 209 | - NeptuneBaseStack 210 | - Outputs.DBClusterEndpoint 211 | DBInstanceEndpoint: 212 | Description: Master Instance Endpoint 213 | Value: 214 | Fn::GetAtt: 215 | - NeptuneBaseStack 216 | - Outputs.DBInstanceEndpoint 217 | GremlinEndpoint: 218 | Description: Gremlin Endpoint for Neptune 219 | Value: 220 | Fn::GetAtt: 221 | - NeptuneBaseStack 222 | - Outputs.GremlinEndpoint 223 | LoaderEndpoint: 224 | Description: Loader Endpoint for Neptune 225 | Value: 226 | Fn::GetAtt: 227 | - NeptuneBaseStack 228 | - Outputs.LoaderEndpoint 229 | DBClusterReadEndpoint: 230 | Description: DB cluster Read Endpoint 231 | Value: 232 | Fn::GetAtt: 233 | - NeptuneBaseStack 234 | - Outputs.DBClusterReadEndpoint 235 | -------------------------------------------------------------------------------- /identity-resolution/templates/neptune-workbench-stack.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | 3 | Description: A template to deploy Neptune Notebooks using CloudFormation resources. 4 | 5 | Parameters: 6 | NotebookInstanceType: 7 | Description: The notebook instance type. 8 | Type: String 9 | Default: ml.t2.medium 10 | AllowedValues: 11 | - ml.t2.medium 12 | - ml.t2.large 13 | - ml.t2.xlarge 14 | - ml.t2.2xlarge 15 | - ml.t3.2xlarge 16 | - ml.t3.large 17 | - ml.t3.medium 18 | - ml.t3.xlarge 19 | - ml.m4.xlarge 20 | - ml.m4.2xlarge 21 | - ml.m4.4xlarge 22 | - ml.m4.10xlarge 23 | - ml.m4.16xlarge 24 | - ml.m5.12xlarge 25 | - ml.m5.24xlarge 26 | - ml.m5.2xlarge 27 | - ml.m5.4xlarge 28 | - ml.m5.xlarge 29 | - ml.p2.16xlarge 30 | - ml.p2.8xlarge 31 | - ml.p2.xlarge 32 | - ml.p3.16xlarge 33 | - ml.p3.2xlarge 34 | - ml.p3.8xlarge 35 | - ml.c4.2xlarge 36 | - ml.c4.4xlarge 37 | - ml.c4.8xlarge 38 | - ml.c4.xlarge 39 | - ml.c5.18xlarge 40 | - ml.c5.2xlarge 41 | - ml.c5.4xlarge 42 | - ml.c5.9xlarge 43 | - ml.c5.xlarge 44 | - ml.c5d.18xlarge 45 | - ml.c5d.2xlarge 46 | - ml.c5d.4xlarge 47 | - ml.c5d.9xlarge 48 | - ml.c5d.xlarge 49 | ConstraintDescription: Must be a valid SageMaker instance type. 50 | 51 | NeptuneClusterEndpoint: 52 | Description: The cluster endpoint of an existing Neptune cluster. 53 | Type: String 54 | 55 | NeptuneClusterPort: 56 | Description: 'OPTIONAL: The Port of an existing Neptune cluster (default 8182).' 57 | Type: String 58 | Default: '8182' 59 | 60 | NeptuneClusterSecurityGroups: 61 | Description: The VPC security group IDs. The security groups must be for the same VPC as specified in the subnet. 62 | Type: List 63 | 64 | NeptuneClusterSubnetId: 65 | Description: The ID of the subnet in a VPC to which you would like to have a connectivity from your ML compute instance. 66 | Type: AWS::EC2::Subnet::Id 67 | 68 | SageMakerNotebookRole: 69 | Description: The ARN for the IAM role that the notebook instance will assume. 70 | Type: String 71 | AllowedPattern: ^arn:aws[a-z\-]*:iam::\d{12}:role/?[a-zA-Z_0-9+=,.@\-_/]+$ 72 | 73 | SageMakerNotebookName: 74 | Description: The name of the Neptune notebook. 75 | Type: String 76 | 77 | AdditionalNotebookS3Locations: 78 | Description: Location of additional notebooks to include with the Notebook instance. 79 | Type: String 80 | 81 | Conditions: 82 | InstallNotebookContent: 83 | Fn::Not: [ 84 | Fn::Equals: [ 85 | Ref: AdditionalNotebookS3Locations, "" 86 | ] 87 | ] 88 | 89 | Resources: 90 | NeptuneNotebookInstance: 91 | Type: AWS::SageMaker::NotebookInstance 92 | Properties: 93 | NotebookInstanceName: !Join 94 | - '' 95 | - - 'aws-neptune-' 96 | - !Ref SageMakerNotebookName 97 | InstanceType: 98 | Ref: NotebookInstanceType 99 | SubnetId: 100 | Ref: NeptuneClusterSubnetId 101 | SecurityGroupIds: 102 | Ref: NeptuneClusterSecurityGroups 103 | RoleArn: 104 | Ref: SageMakerNotebookRole 105 | LifecycleConfigName: 106 | Fn::GetAtt: 107 | - NeptuneNotebookInstanceLifecycleConfig 108 | - NotebookInstanceLifecycleConfigName 109 | 110 | NeptuneNotebookInstanceLifecycleConfig: 111 | Type: AWS::SageMaker::NotebookInstanceLifecycleConfig 112 | Properties: 113 | OnStart: 114 | - Content: 115 | Fn::Base64: 116 | Fn::Join: 117 | - '' 118 | - - "#!/bin/bash\n" 119 | - sudo -u ec2-user -i << 'EOF' 120 | - "\n" 121 | - echo 'export GRAPH_NOTEBOOK_AUTH_MODE= 122 | - "DEFAULT' >> ~/.bashrc\n" 123 | - echo 'export GRAPH_NOTEBOOK_HOST= 124 | - Ref: NeptuneClusterEndpoint 125 | - "' >> ~/.bashrc\n" 126 | - echo 'export GRAPH_NOTEBOOK_PORT= 127 | - Ref: NeptuneClusterPort 128 | - "' >> ~/.bashrc\n" 129 | - echo 'export NEPTUNE_LOAD_FROM_S3_ROLE_ARN= 130 | - "' >> ~/.bashrc\n" 131 | - echo 'export AWS_REGION= 132 | - Ref: AWS::Region 133 | - "' >> ~/.bashrc\n" 134 | - aws s3 cp s3://aws-neptune-notebook/graph_notebook.tar.gz /tmp/graph_notebook.tar.gz 135 | - "\n" 136 | - echo 'export NOTEBOOK_CONTENT_S3_LOCATION=, 137 | - Ref: AdditionalNotebookS3Locations 138 | - "' >> ~/.bashrc\n" 139 | - aws s3 sync s3://aws-neptune-customer-samples/neptune-sagemaker/notebooks /home/ec2-user/SageMaker/Neptune --exclude * --include util/* 140 | - "\n" 141 | - rm -rf /tmp/graph_notebook 142 | - "\n" 143 | - tar -zxvf /tmp/graph_notebook.tar.gz -C /tmp 144 | - "\n" 145 | - /tmp/graph_notebook/install.sh 146 | - "\n" 147 | - mkdir /home/ec2-user/SageMaker/identity-graph 148 | - "\n" 149 | - Fn::If: [ InstallNotebookContent, 150 | Fn::Join: 151 | [ "", [ 152 | "aws s3 cp ", 153 | Ref: AdditionalNotebookS3Locations, 154 | " /home/ec2-user/SageMaker/identity-graph/ --recursive" 155 | ] 156 | ], 157 | "# No notebook content\n" 158 | ] 159 | - "\n" 160 | - EOF 161 | 162 | Outputs: 163 | NeptuneNotebookInstanceId: 164 | Value: 165 | Ref: NeptuneNotebookInstance 166 | NeptuneNotebook: 167 | Value: 168 | Fn::Join: [ "", 169 | [ 170 | "https://", 171 | Fn::Select: [ 1, Fn::Split: [ "/", Ref: "NeptuneNotebookInstance" ] ], 172 | ".notebook.", 173 | Ref: "AWS::Region", 174 | ".sagemaker.aws/" 175 | ] 176 | ] 177 | NeptuneNotebookInstanceLifecycleConfigId: 178 | Value: 179 | Ref: "NeptuneNotebookInstanceLifecycleConfig" --------------------------------------------------------------------------------