├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
└── identity-resolution
    ├── README.md
    ├── data
        └── DATA.md
    ├── images
        ├── architecture.png
        └── sagemaker-link.png
    ├── notebooks
        └── identity-graph
        │   ├── identity-graph-sample.ipynb
        │   └── nepytune
        │       ├── __init__.py
        │       ├── benchmarks
        │           ├── __init__.py
        │           ├── __main__.py
        │           ├── benchmarks_visualization.py
        │           ├── connection_pool.py
        │           ├── drop_graph.py
        │           ├── ingestion.py
        │           └── query_runner.py
        │       ├── cli
        │           ├── __init__.py
        │           ├── __main__.py
        │           ├── add.py
        │           ├── extend.py
        │           ├── split.py
        │           └── transform.py
        │       ├── drawing.py
        │       ├── edges
        │           ├── __init__.py
        │           ├── identity_groups.py
        │           ├── ip_loc.py
        │           ├── persistent_ids.py
        │           ├── user_website.py
        │           └── website_groups.py
        │       ├── nodes
        │           ├── __init__.py
        │           ├── identity_groups.py
        │           ├── ip_loc.py
        │           ├── users.py
        │           └── websites.py
        │       ├── traversal.py
        │       ├── usecase
        │           ├── __init__.py
        │           ├── brand_interaction.py
        │           ├── purchase_path.py
        │           ├── similar_audience.py
        │           ├── undecided_users.py
        │           ├── user_summary.py
        │           └── users_from_household.py
        │       ├── utils.py
        │       ├── visualizations
        │           ├── __init__.py
        │           ├── bar_plots.py
        │           ├── commons.py
        │           ├── histogram.py
        │           ├── network_graph.py
        │           ├── pie_chart.py
        │           ├── segments.py
        │           ├── sunburst_chart.py
        │           └── venn_diagram.py
        │       └── write_utils.py
    └── templates
        ├── bulk-load-stack.yaml
        ├── identity-resolution.yml
        └── neptune-workbench-stack.yaml


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | .*~
3 | .*.swp
4 | *.pyc
5 | .DS_Store
6 | *.lock


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AWS Advertising & Marketing Samples
 2 | 
 3 | Samples and documentation for various advertising and marketing use cases on AWS.
 4 | 
 5 | ## Sample 1: [Customer Identity Graph using Amazon Neptune](./identity-resolution/)
 6 | 
 7 | Customer identity graph enables single unified view of customer identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising. Included in this repository is a sample solution for building an identity graph solution using Amazon Neptune, a managed graph database service on AWS.
 8 | 
 9 | ## Additional Reading
10 | 
11 | [AWS Advertising & Marketing portal](https://aws.amazon.com/advertising-marketing/)
12 | 
13 | ## Contributing
14 | 
15 | Please see further instructions on contributing in the CONTRIBUTING file.
16 | 
17 | ## License
18 | 
19 | This library is licensed under the MIT-0 License. See the LICENSE file.
20 | 
21 | 


--------------------------------------------------------------------------------
/identity-resolution/README.md:
--------------------------------------------------------------------------------
 1 | # Identity Graph Using Amazon Neptune
 2 | 
 3 | An identity graph provides a single unified view of customers and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising.
 4 | 
 5 | The following notebook walks you through a sample solution for identity graph and how it can be used within a larger identity resolution architecture using an open dataset and the use of a graph database, Amazon Neptune. In this notebook, we also show a number of data visualizations that allow one to better understand the structure of an identity graph and the aspects of an identity resolution dataset and use case. Later in the notebook, we expose some additional use cases that can be exposed using this particular dataset.
 6 | 
 7 | ## Getting Started
 8 | 
 9 | This repo includes the following assets:
10 | - A [Jupyter notebook](notebooks/identity-resolution/identity-graph-sample.ipynb) containing a more thorough explanation of the Identity Graph use case, the dataset that is being used, the graph data model, and graph queries that are used in deriving identities, audiences, customer journeys, etc.
11 | - A [sample dataset](data/DATA.md) comprised of anonymized cookies, device IDs, and website visits.  It also includes additional manufactured data that enriches the original anonymized dataset to make this more realistic.
12 | - A set of [Python scripts](notebooks/identity-resolution/nepytune) that are used within the Jupyter notebook for executing each of the different use cases and examples.  We're providing the code for these scripts here such that you can extend these for your own benefit.
13 | - A [CloudFormation template](templates/identity-resolution.yml) to launch each of these resources along with the necessary infrastructure.  This template will create an Amazon Neptune database cluster and load the sample dataset into the cluster.  It will also create a SageMaker Jupyter Notebook instance and install the scripts and sample Jupyter notebook to this instance for you to run against the Neptune cluster.
14 | 
15 | ### Architecture
16 | 
17 | <img src="./images/architecture.png">
18 | 
19 | ### Quickstart
20 | 
21 | To get started quickly, we have included the following quick-launch link for deploying this sample architecture.
22 | 
23 | | Region | Stack |
24 | | ---- | ---- |
25 | |US East (Ohio) |  [<img src="https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png">](https://us-east-2.console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-east-2/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) |
26 | |US East (N. Virginia) |  [<img src="https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png">](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-east-1/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) |
27 | |US West (Oregon) |  [<img src="https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png">](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-us-west-2/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) |
28 | |EU West (Ireland) |  [<img src="https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png">](https://eu-west-1.console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/create/review?templateURL=https://s3.amazonaws.com/aws-admartech-samples-eu-west-1/identity-resolution/templates/identity-resolution.yml&stackName=Identity-Graph-Sample) |
29 | 
30 | Once you have launched the stack, go to the Outputs tab of the root stack and click on the SageMakerNotebook link.  This will bring up the Jupyter notebook console of the SageMaker Jupyter Notebook instance that you created.
31 | 
32 | <img src="./images/sagemaker-link.png">
33 | 
34 | Once logged into Jupyter, browse through the Neptune/identity-resolution directories until you see the identity-graph-sample.ipynb file.  This is the Jupyter notebook containing all of the sample use cases and queries for using Amazon Neptune for Identity Graph.  Click on the ipynb file.  Additional instructions for each of the use cases are embedded in the Jupyter notebook (ipynb file).
35 | 
36 | ## License Summary
37 | 
38 | This library is licensed under the MIT-0 License. See the LICENSE file.
39 | 


--------------------------------------------------------------------------------
/identity-resolution/data/DATA.md:
--------------------------------------------------------------------------------
 1 | # Sample Dataset for Identity Resolution on Amazon Neptune
 2 | 
 3 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/identity_group_edges.csv
 4 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/identity_group_nodes.csv
 5 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/ip_edges.csv
 6 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/ip_nodes.csv
 7 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/persistent_edges.csv
 8 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/persistent_nodes.csv
 9 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/transient_edges.csv
10 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/transient_nodes.csv
11 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/website_group_edges.csv
12 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/website_group_nodes.csv
13 | - http://s3.amazonaws.com/aws-admartech-samples/identity-resolution/data/websites.csv


--------------------------------------------------------------------------------
/identity-resolution/images/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/images/architecture.png


--------------------------------------------------------------------------------
/identity-resolution/images/sagemaker-link.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/images/sagemaker-link.png


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/__main__.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import asyncio
  3 | import csv
  4 | import logging
  5 | import os
  6 | import random
  7 | import time
  8 | import statistics
  9 | 
 10 | import numpy as np
 11 | 
 12 | from nepytune.benchmarks.query_runner import get_query_runner
 13 | from nepytune.benchmarks.connection_pool import NeptuneConnectionPool
 14 | 
 15 | QUERY_NAMES = [
 16 |     'get_sibling_attrs', 'undecided_user_check', 'undecided_user_audience',
 17 |     'brand_interaction_audience', 'get_all_transient_ids_in_household',
 18 |     'early_website_adopters'
 19 | ]
 20 | 
 21 | parser = argparse.ArgumentParser(description="Run query benchmarks")
 22 | parser.add_argument("--users", type=int, default=10)
 23 | parser.add_argument("--samples", type=int, default=1000)
 24 | parser.add_argument("--queries", default=['all'], type=str,
 25 |     nargs='+', choices=QUERY_NAMES + ['all'])
 26 | parser.add_argument("--verbose", action='store_true')
 27 | parser.add_argument("--csv", action="store_true")
 28 | parser.add_argument("--output", type=str, default="results")
 29 | args = parser.parse_args()
 30 | 
 31 | if args.queries == ['all']:
 32 |     args.queries = QUERY_NAMES
 33 | 
 34 | if (args.verbose):
 35 |     level = logging.DEBUG
 36 | else:
 37 |     level = logging.INFO
 38 | 
 39 | logging.basicConfig(level=level, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 40 | logger = logging.getLogger(__name__)
 41 | 
 42 | sem = asyncio.Semaphore(args.users)
 43 | 
 44 | 
 45 | def custom_exception_handler(loop, context):
 46 |     """Stop event loop if exception occurs."""
 47 |     loop.default_exception_handler(context)
 48 | 
 49 |     exception = context.get('exception')
 50 |     if isinstance(exception, Exception):
 51 |         print(context)
 52 |         loop.stop()
 53 | 
 54 | 
 55 | async def run_query(query_runner, sample, semaphore, pool):
 56 |     """Run query with limit on concurrent connections."""
 57 |     async with semaphore:
 58 |         return await query_runner.run(sample, pool)
 59 | 
 60 | 
 61 | async def run(query, samples, pool):
 62 |     """Run query benchmark tasks."""
 63 |     query_runner = get_query_runner(query, samples)
 64 | 
 65 |     logger.info("Initializing query data.")
 66 |     await asyncio.gather(query_runner.initialize())
 67 | 
 68 |     queries = []
 69 |     logger.info("Running benchmark.")
 70 |     for i in range(samples):
 71 |         queries.append(asyncio.create_task(run_query(query_runner, i, sem, pool)))
 72 |     results = await asyncio.gather(*queries)
 73 | 
 74 |     logger.info(f"Successful queries: {query_runner.succeded}")
 75 |     logger.info(f"Failed queries: {query_runner.failed}")
 76 | 
 77 |     benchmark_results = [result for result in results if result]
 78 |     return benchmark_results, query_runner.succeded, query_runner.failed
 79 | 
 80 | 
 81 | def stats(results):
 82 |     """Print statistics for benchmark results."""
 83 |     print(f"Samples: {args.samples}")
 84 |     print(f"Mean: {statistics.mean(results)}s")
 85 |     print(f"Median: {statistics.median(results)}s")
 86 |     a = np.array(results)
 87 |     for percentile in [50, 90, 99, 99.9, 99.99]:
 88 |         result = np.percentile(a, percentile)
 89 |         print(f"{percentile} percentile: {result}s")
 90 | 
 91 | 
 92 | if __name__ ==  '__main__':
 93 |     loop = asyncio.get_event_loop()
 94 |     loop.set_exception_handler(custom_exception_handler)
 95 | 
 96 |     pool = NeptuneConnectionPool(args.users)
 97 |     try:
 98 |         loop.run_until_complete(pool.create())
 99 |         for query in args.queries:
100 |             logger.info(f"Benchmarking query: {query}")
101 |             logger.info(f"Concurrent users: {args.users}")
102 |             results, succeded, failed = loop.run_until_complete(run(query, args.samples, pool))
103 |             stats([measure[2] for measure in results])
104 |             if args.csv:
105 |                 dst = f"{args.output}/{query}-{args.samples}-{args.users}.csv"
106 |                 with open(dst, "w") as f:
107 |                     writer = csv.writer(f)
108 |                     for measure in results:
109 |                         writer.writerow(measure)
110 |                 query_stats = f"{args.output}/{query}-{args.samples}-{args.users}-stats.csv"
111 |                 with open(query_stats, "w") as f:
112 |                     writer = csv.writer(f)
113 |                     writer.writerow([succeded, failed])
114 |     finally:
115 |         loop.run_until_complete(pool.destroy())
116 |         loop.run_until_complete(loop.shutdown_asyncgens())
117 |         loop.close()
118 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/benchmarks_visualization.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import math
  3 | import os
  4 | import tqdm
  5 | import pandas as pd
  6 | import plotly.graph_objects as go
  7 | from intervaltree import IntervalTree
  8 | from plotly.subplots import make_subplots
  9 | 
 10 | 
 11 | def get_benchmarks_results_dataframes(results_path, query, instances,
 12 |                                       samples_by_users):
 13 |     """Convert benchmarks results into data frames."""
 14 |     dfs_by_users = {}
 15 |     for users, samples in samples_by_users.items():
 16 |         dfs = []
 17 |         for instance in instances:
 18 |             df = pd.read_csv(f"{results_path}/{instance}/{query}-{samples}-{users}.csv",
 19 |                              names=['start', 'end', 'duration'])
 20 |             df["instance"] = instance
 21 |             dfs.append(df)
 22 | 
 23 |         dfs_by_users[users] = pd.concat(dfs)
 24 |     return dfs_by_users
 25 | 
 26 | 
 27 | def show_query_time_graph(benchmarks_dfs, yfunc, title, x_title):
 28 |     """Show query duration graph."""
 29 |     fig = go.Figure()
 30 | 
 31 |     for users, df in benchmarks_dfs.items():
 32 |         fig.add_trace(
 33 |             go.Box(
 34 |                 x=df["instance"],
 35 |                 y=yfunc(df["duration"]),
 36 |                 boxpoints=False,
 37 |                 boxmean=True,
 38 |                 name=f"{users} users",
 39 |                 hoverinfo="y",
 40 |             )
 41 |         )
 42 | 
 43 |     fig.update_layout(
 44 |         yaxis=dict(
 45 |             title=title,
 46 |             tickangle=-45,
 47 |         ),
 48 |         xaxis_title=x_title,
 49 |         boxmode='group'
 50 |     )
 51 |     fig.show()
 52 | 
 53 | 
 54 | def select_concurrent_queries_from_data(query, benchmarks_dfs, cache_path):
 55 |     """Measure concurrent queries from benchmark results."""
 56 |     users_chart_data = {}
 57 |     cache_suffix = "cache_concurrent"
 58 | 
 59 |     if not os.path.isdir(cache_path):
 60 |         os.makedirs(cache_path)
 61 | 
 62 |     for users in benchmarks_dfs.keys():
 63 |         cache_filename = f"{cache_path}/{query}-{users}-{cache_suffix}.csv"
 64 |         if os.path.isfile(cache_filename):
 65 |             with open(cache_filename) as f:
 66 |                 print(f"Reading from cached file: {cache_filename}.")
 67 |                 queries_df = pd.read_csv(f)
 68 |                 queries_df = queries_df.set_index(
 69 |                     pd.to_datetime(queries_df['timestamp']))
 70 |                 users_chart_data[users] = queries_df
 71 |         else:
 72 |             df = benchmarks_dfs[users].copy()
 73 |             # convert to milliseconds
 74 |             df["duration"] = df["duration"].multiply(1000)
 75 | 
 76 |             data_frames = []
 77 |             for instance in df.instance.unique():
 78 |                 queries = get_concurrent_queries_by_time(df, users, instance)
 79 |                 queries_df = pd.DataFrame(
 80 |                     queries, columns=['timestamp', 'users', 'instance'])
 81 | 
 82 |                 resampled = resample_queries_frame(queries_df, '100ms')
 83 | 
 84 |                 data_frames.append(resampled)
 85 | 
 86 |             with open(cache_filename, "w") as f:
 87 |                 pd.concat(data_frames).to_csv(f)
 88 | 
 89 |             users_chart_data[users] = pd.concat(data_frames)
 90 | 
 91 |     return users_chart_data
 92 | 
 93 | 
 94 | def show_concurrent_queries_charts(concurrent_queries_dfs, x_title, y_title):
 95 |     """Show concurrent queries chart."""
 96 |     for users, df in concurrent_queries_dfs.items():
 97 |         instances = len(df.instance.unique())
 98 | 
 99 |         fig = make_subplots(rows=instances, cols=1)
100 | 
101 |         for row, instance in enumerate(df.instance.unique(), start=1):
102 |             instance_data = df[df.instance == instance]
103 |             fig.add_trace(
104 |                 go.Scatter(
105 |                     x=[(idx - instance_data.index[0]).total_seconds()
106 |                        for idx in instance_data.index],
107 |                     y=instance_data["users"],
108 |                     name=instance
109 |                 ),
110 |                 row=row,
111 |                 col=1
112 |             )
113 | 
114 |         fig.update_yaxes(
115 |             title_text=f"{y_title} for: {users} users", row=2, col=1)
116 |         fig.update_xaxes(title_text=x_title, row=3, col=1)
117 | 
118 |         fig.show()
119 | 
120 | 
121 | def get_concurrent_queries_by_time(df, users, instance):
122 |     """
123 |     Return concurrent running queries by time.
124 | 
125 |     Build interval tree of running query times.
126 |     Calculate time range duration and check overlaping queries.
127 |     """
128 |     idf = df.loc[df["instance"] == instance].copy()
129 | 
130 |     idf['start'] = pd.to_datetime(idf['start'], unit='s')
131 |     idf['end'] = pd.to_datetime(idf['end'], unit='s')
132 | 
133 |     # get nsmallest and nlargest to not leave single running queries
134 |     start = idf.nsmallest(int(users), "start")["start"].max()
135 |     end = idf.nlargest(int(users), "end")["end"].min()
136 | 
137 |     step = math.ceil(idf['duration'].min()/10)
138 | 
139 |     t = IntervalTree()
140 |     for index, row in idf.iterrows():
141 |         t[row["start"]:row["end"]] = None
142 | 
143 |     tr = pd.to_datetime(pd.date_range(
144 |         start=start, end=end, freq=f"{step}ms"))
145 | 
146 |     rows = []
147 |     for i in tqdm.tqdm(range(len(tr)-1)):
148 |         r1 = tr[i]
149 |         r2 = tr[i+1]
150 |         concurrent_queries = len(t[r1:r2])
151 |         rows.append([r1, concurrent_queries, instance])
152 | 
153 |     return rows
154 | 
155 | 
156 | def resample_queries_frame(df, freq):
157 |     """Resample queries frame with given frequency."""
158 |     df = df.set_index(pd.to_datetime(df['timestamp']))
159 | 
160 |     resampled = pd.DataFrame()
161 |     resampled["users"] = df.users.resample(freq).mean().bfill()
162 |     resampled["instance"] = df.instance.resample(freq).last().bfill()
163 | 
164 |     return resampled
165 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/connection_pool.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from aiogremlin import DriverRemoteConnection
 3 | 
 4 | CONNECTION_RETRIES = 5
 5 | CONNECTION_HEARTBEAT = 0.1
 6 | 
 7 | class NeptuneConnectionPool():
 8 |     def __init__(self, users):
 9 |         self.users = users
10 |         self.active = []
11 |         self.available = []
12 | 
13 |     async def create(self):
14 |         for _ in range(self.users):
15 |             conn = await self.init_neptune_connection()
16 |             self.available.append(conn)
17 | 
18 |     async def destroy(self):
19 |         for conn in self.active + self.available:
20 |             await conn.close()
21 | 
22 |     def lock(self):
23 |         for _ in range(CONNECTION_RETRIES):
24 |             if self.available:
25 |                 conn = self.available.pop()
26 |                 self.active.append(conn)
27 |                 return conn
28 |         raise ConnectionError("Cannot aquire connection from pool.")
29 | 
30 |     def unlock(self, conn):
31 |         self.active.remove(conn)
32 |         self.available.append(conn)
33 | 
34 |     async def init_neptune_connection(self):
35 |         """Init Neptune connection."""
36 |         endpoint = os.environ["NEPTUNE_CLUSTER_ENDPOINT"]
37 |         port = os.getenv("NEPTUNE_CLUSTER_PORT", "8182")
38 |         return await DriverRemoteConnection.open(f"ws://{endpoint}:{port}/gremlin", "g")
39 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/drop_graph.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates.
  2 | # All Rights Reserved.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License").
  5 | # You may not use this file except in compliance with the License.
  6 | # A copy of the License is located at
  7 | #
  8 | #    http://aws.amazon.com/apache2.0/
  9 | #
 10 | # or in the "license" file accompanying this file.
 11 | # This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
 12 | # either express or implied. See the License for the specific language governing permissions
 13 | # and limitations under the License.
 14 | 
 15 | '''
 16 | @author:     krlawrence
 17 | @copyright:  Amazon.com, Inc. or its affiliates
 18 | @license:    Apache2
 19 | @contact:    @krlawrence
 20 | @deffield    created:  2019-04-02
 21 | 
 22 | This code uses Gremlin Python to drop an entire graph.
 23 | 
 24 | It is intended as an example of a multi-threaded strategy for dropping vertices and edges.
 25 | 
 26 | The following overall strategy is currently used.
 27 | 
 28 |   1. Fetch all edge IDs
 29 |      - Edges are fetched using multiple threads in large batches.
 30 |      - Smaller slices are queued up for worker threads to drop.
 31 |   2. Drop all edges using those IDs
 32 |      - Worker threads read the slices of IDs from the queue and drop the edges.
 33 |   3. Fetch all vertex IDs
 34 |      - Vertices are fetched using multiple threads in large batches.
 35 |      - Smaller slices are queued up for worker threads to drop.
 36 |   4. Drop all vertices using the fetched IDs
 37 |      - Worker threads read the slices of IDs from the queue and drop the vertices.
 38 | 
 39 | NOTES:
 40 |   1: To avoid possible concurrent write exceptions no fetching and dropping is done in parallel.
 41 |   2: Edges are explicitly dropped before vertices, again to avoid any conflicting writes.
 42 |   3: This code uses an in-memory, thread-safe  queue. The amount of data that can be processed
 43 |      will depend upon how big of an in-memory queue can be created. It has been tested using a
 44 |      graph containing 10M vertices and 10M edges.
 45 |   4: While the code as written deletes an entire graph, it could be easily adapted to delete part
 46 |      of a graph instead.
 47 |   5: The following environment variables should be defined before this code is run.
 48 |      NEPTUNE_PORT    - The port that the Neptune endpoint is listening on such as 8182.
 49 |      NEPTUNE_WRITER  - The Neptune Cluster endpoint name such as
 50 |                         "mygraph.cluster-abcdefghijkl.us-east-1.neptune.amazonaws.com"
 51 |   6: This script assumes that the 'gremlinpyton' library has already been installed.
 52 |   7: For massive graphs (with hundreds of millions or billions of elements) creating a new
 53 |      Neptune cluster will be faster than trying to delete everything programmatically.
 54 | 
 55 | STILL TODO:
 56 | The code could be further improved by offering an option to only drop the edges and by
 57 | removing the need to count all edges and all vertices before starting work.  The use of
 58 | threads could be further optimized in future to get more reuse out of the fetcher threads.
 59 | One further refinement that would enable very large graphs to be dropped, would be to
 60 | avoid the need to read all elementment IDs into memory before dropping can start by doing
 61 | that process iteratively.  This script should probably also been turned into a class.
 62 | '''
 63 | 
 64 | from gremlin_python.structure.graph import Graph
 65 | from gremlin_python.process.graph_traversal import __
 66 | from gremlin_python.process.strategies import *
 67 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
 68 | from gremlin_python.process.traversal import *
 69 | from threading import Thread
 70 | from queue import Queue
 71 | import threading
 72 | import time
 73 | import math
 74 | import os
 75 | 
 76 | # The fetch size and batch sizes should not need to be changed but can be if necessary.
 77 | # As a guide, the number of threads should be twice the number of vCPU available of the Neptune write master node.
 78 | 
 79 | MAX_FETCH_SIZE  =  50000  # Maximum number of IDs to fetch at a time. A large number limits the number of range() calls
 80 | EDGE_BATCH_SIZE =    500  # Number of edges to drop in each call to drop(). This affects the queue entry size.
 81 | VERTEX_BATCH_SIZE =  500  # Number of vertices to drop in each call to drop(). This affects the queue entry size.
 82 | MAX_FETCHERS    =      8  # Maximum number of threads allowed to be created for fetching vertices and edges
 83 | NUM_THREADS     =      8  # Number of local workers to create to process the drop queue.
 84 | POOL_SIZE       =      8  # Python driver default is 4. Change to create a bigger pool.
 85 | MAX_WORKERS     =      8  # Python driver default is 5 * number of CPU on client machine.
 86 | 
 87 | # Ready flag is used to tell workers they can start processing the queue
 88 | ready_flag = threading.Event()
 89 | 
 90 | # The wait queues are used to make sure all threads have finished fetching before the
 91 | # workers start processing the IDs to be dropped.
 92 | edge_fetch_wait_queue = Queue()
 93 | vertex_fetch_wait_queue = Queue()
 94 | 
 95 | # Queue that will contain the node and edge IDs that need to be dropped
 96 | pending_work = Queue()
 97 | 
 98 | 
 99 | ####################################################################################
100 | # fetch_edges
101 | #
102 | # Calculate how many threads are needed to fetch the edge IDs and create the threads
103 | ####################################################################################
104 | def fetch_edges(g, q):
105 |     print("\nPROCESSING EDGES")
106 |     print("Assessing number of edges.")
107 |     count = g.E().count().next()
108 |     print(count, "edges to drop")
109 |     if count > 0:
110 |         fetch_size = MAX_FETCH_SIZE
111 |         num_threads = min(math.ceil(count/fetch_size),MAX_FETCHERS)
112 |         bracket_size = math.ceil(count/num_threads)
113 |         print("Will use", num_threads, "threads.")
114 |         print("Each thread will queue", bracket_size)
115 |         print("Queueing  IDs")
116 | 
117 |         start_offset = 0
118 | 
119 |         fetchers = [None] * num_threads
120 | 
121 |         for i in range(num_threads):
122 |             edge_fetch_wait_queue.put(i)
123 |             fetchers[i] = Thread(target=edge_fetcher, args=(g, pending_work,start_offset,bracket_size,))
124 |             fetchers[i].setDaemon(True)
125 |             fetchers[i].start()
126 |             start_offset += bracket_size
127 |         return count
128 | 
129 | ####################################################################################
130 | # fetch_vertices
131 | #
132 | # Calculate how many threads are needed to fetch the node IDs and create the threads
133 | ####################################################################################
134 | def fetch_vertices(g, q):
135 |     print("\nPROCESSING VERTICES")
136 |     print("Assessing number of vertices.")
137 |     count = g.V().count().next()
138 |     print(count, "vertices to drop")
139 |     if count > 0:
140 |         fetch_size = MAX_FETCH_SIZE
141 |         num_threads = min(math.ceil(count/fetch_size),MAX_FETCHERS)
142 |         bracket_size = math.ceil(count/num_threads)
143 |         print("Will use", num_threads, "threads.")
144 |         print("Each thread will queue", bracket_size)
145 |         print("Queueing  IDs")
146 | 
147 |         start_offset = 0
148 | 
149 |         fetchers = [None] * num_threads
150 | 
151 |         for i in range(num_threads):
152 |             vertex_fetch_wait_queue.put(i)
153 |             fetchers[i] = Thread(target=vertex_fetcher, args=(g, pending_work,start_offset,bracket_size,))
154 |             fetchers[i].setDaemon(True)
155 |             fetchers[i].start()
156 |             start_offset += bracket_size
157 |         return count
158 | 
159 | ####################################################################################
160 | # edge_fetcher
161 | #
162 | # Fetch edges in large batches and queue them up for deletion in smaller slices
163 | ####################################################################################
164 | def edge_fetcher(g, q,start_offset,bracket_size):
165 |     p1 = start_offset
166 |     inc = min(bracket_size,MAX_FETCH_SIZE)
167 |     p2 = start_offset + inc
168 |     org = p1
169 |     done = False
170 |     nm = threading.currentThread().name
171 |     print(nm,"[edges] Fetching from offset", start_offset, "with end at",start_offset+bracket_size)
172 |     edge_fetch_wait_queue.get()
173 | 
174 |     done = False
175 |     while not done:
176 |         success = False
177 |         while not success:
178 |             print(nm,"[edges] retrieving range ({},{} batch=size={})".format(p1,p2,p2-p1))
179 |             try:
180 |                 edges = g.E().range(p1,p2).id().toList()
181 |                 success = True
182 |             except:
183 |                 print("***",nm,"Exception while fetching. Retrying.")
184 |                 time.sleep(1)
185 | 
186 |         slices = math.ceil(len(edges)/EDGE_BATCH_SIZE)
187 |         s1 = 0
188 |         s2 = 0
189 |         for i in range(slices):
190 |             s2 += min(len(edges)-s1,EDGE_BATCH_SIZE)
191 |             q.put(["edges",edges[s1:s2]])
192 |             s1 = s2
193 |         p1 += inc
194 |         if p1 >= org + bracket_size:
195 |             done = True
196 |         else:
197 |             p2 += min(inc, org+bracket_size - p2)
198 |     size = q.qsize()
199 |     print(nm,"[edges] work done. Queue size ==>",size)
200 |     edge_fetch_wait_queue.task_done()
201 |     return
202 | 
203 | ####################################################################################
204 | # vertex_fetcher
205 | #
206 | # Fetch vertices in large batches and queue them up for deletion in smaller slices
207 | ####################################################################################
208 | def vertex_fetcher(g, q,start_offset,bracket_size):
209 |     p1 = start_offset
210 |     inc = min(bracket_size,MAX_FETCH_SIZE)
211 |     p2 = start_offset + inc
212 |     org = p1
213 |     done = False
214 |     nm = threading.currentThread().name
215 |     print(nm,"[vertices] Fetching from offset", start_offset, "with end at",start_offset+bracket_size)
216 |     vertex_fetch_wait_queue.get()
217 | 
218 |     done = False
219 |     while not done:
220 |         success = False
221 |         while not success:
222 |             print(nm,"[vertices] retrieving range ({},{} batch=size={})".format(p1,p2,p2-p1))
223 |             try:
224 |                 vertices = g.V().range(p1,p2).id().toList()
225 |                 success = True
226 |             except:
227 |                 print("***",nm,"Exception while fetching. Retrying.")
228 |                 time.sleep(1)
229 | 
230 |         slices = math.ceil(len(vertices)/VERTEX_BATCH_SIZE)
231 |         s1 = 0
232 |         s2 = 0
233 |         for i in range(slices):
234 |             s2 += min(len(vertices)-s1,VERTEX_BATCH_SIZE)
235 |             q.put(["vertices",vertices[s1:s2]])
236 |             s1 = s2
237 |         p1 += inc
238 |         if p1 >= org + bracket_size:
239 |             done = True
240 |         else:
241 |             p2 += min(inc, org+bracket_size - p2)
242 |     size = q.qsize()
243 |     print(nm,"[vertices] work done. Queue size ==>",size)
244 |     vertex_fetch_wait_queue.task_done()
245 |     return
246 | 
247 | ####################################################################################
248 | # worker
249 | #
250 | # Worker task that will handle deletion of IDs that are in the queue. Multiple workers
251 | # will be created based on the value specified for NUM_THREADS.
252 | ####################################################################################
253 | def worker(g, q):
254 |     c = 0
255 |     nm = threading.currentThread().name
256 |     print("Worker", nm, "started")
257 |     while True:
258 |         ready = ready_flag.wait()
259 |         if not q.empty():
260 |             work = q.get()
261 |             successful = False
262 |             while not successful:
263 |                 try:
264 |                     if len(work[1]) > 0:
265 |                         print("[{}] {} deleting {} {}".format(c,nm,len(work[1]), work[0]))
266 |                         if work[0] == "edges":
267 |                             g.E(work[1]).drop().iterate()
268 |                         else:
269 |                             g.V(work[1]).drop().iterate()
270 |                     successful = True
271 |                 except:
272 |                     # A concurrent modification error can occur if we try to drop an element
273 |                     # that is already loacked by some other process accessing the graph.
274 |                     # If that happens sleep briefly and try again.
275 |                     print("{} Exception dropping some {} will retry".format(nm,work[0]))
276 |                     print(sys.exc_info()[0])
277 |                     print(sys.exc_info()[1])
278 |                     time.sleep(1)
279 |                 c += 1
280 |             q.task_done()
281 | 
282 | 
283 | 
284 | def drop(g):
285 |     ####################################################################################
286 |     # Do the work!
287 |     #
288 |     ####################################################################################
289 |     # Fetch the edges
290 |     equeue_start_time = time.time()
291 |     ecount = fetch_edges(g, pending_work)
292 |     edge_fetch_wait_queue.join()
293 |     equeue_end_time = time.time()
294 | 
295 |     # Create the pool of workers that will drop the edges and vertices
296 |     print("Creating drop() workers")
297 | 
298 |     workers = [None] * NUM_THREADS
299 |     ready_flag.set()
300 | 
301 |     edrop_start_time = time.time()
302 |     for i in range(NUM_THREADS):
303 |         workers[i] = Thread(target=worker, args=(g, pending_work,))
304 |         workers[i].setDaemon(True)
305 |         workers[i].start()
306 | 
307 |     # Wait until all of the edges in the queue have been dropped
308 |     pending_work.join()
309 |     edrop_end_time = time.time()
310 | 
311 |     # Tell the workers to wait until the vertex IDs have all been enqueued
312 |     ready_flag.clear()
313 | 
314 |     # Fetch the vertex IDs
315 |     vqueue_start_time = time.time()
316 |     vcount = fetch_vertices(g, pending_work)
317 |     vertex_fetch_wait_queue.join()
318 |     vqueue_end_time = time.time()
319 | 
320 |     # Tell the workers to start dropping the vertices
321 |     vdrop_start_time = time.time()
322 |     ready_flag.set()
323 |     pending_work.join()
324 |     vdrop_end_time = time.time()
325 | 
326 |     # Calculate how long each phase took
327 |     eqtime = equeue_end_time - equeue_start_time
328 |     vqtime = vqueue_end_time - vqueue_start_time
329 |     etime =  edrop_end_time - edrop_start_time
330 |     vtime =  vdrop_end_time - vdrop_start_time
331 | 
332 |     print("Summary")
333 |     print("-------")
334 |     print("Worker threads", NUM_THREADS)
335 |     print("Max fetch size", MAX_FETCH_SIZE)
336 |     print("Edge batch size", EDGE_BATCH_SIZE)
337 |     print("Vertex batch size", VERTEX_BATCH_SIZE)
338 |     print("Edges dropped", ecount)
339 |     print("Vertices dropped", vcount)
340 |     print("Time taken to queue edges", eqtime)
341 |     print("Time taken to drop edges", etime)
342 |     print("Time taken to queue vertices", vqtime)
343 |     print("Time taken to drop vertices", vtime)
344 | 
345 |     print("TOTAL TIME",eqtime + vqtime + etime + vtime)
346 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/ingestion.py:
--------------------------------------------------------------------------------
  1 | """Common code for running benchmarks."""
  2 | 
  3 | import csv
  4 | import json
  5 | import logging
  6 | import time
  7 | import os
  8 | 
  9 | import boto3
 10 | import botocore
 11 | import requests
 12 | 
 13 | from itertools import islice
 14 | 
 15 | from gremlin_python.structure.graph import Graph
 16 | from gremlin_python.process.graph_traversal import __
 17 | from gremlin_python.process.strategies import *
 18 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
 19 | from gremlin_python.process.traversal import *
 20 | 
 21 | import pandas as pd
 22 | 
 23 | import plotly.graph_objects as go
 24 | 
 25 | from nepytune.benchmarks.drop_graph import drop
 26 | 
 27 | AWS_REGION = os.getenv("AWS_REGION")
 28 | NEPTUNE_ENDPOINT = os.getenv('NEPTUNE_CLUSTER_ENDPOINT')
 29 | NEPTUNE_PORT = os.getenv('NEPTUNE_CLUSTER_PORT')
 30 | NEPTUNE_LOADER_ENDPOINT = f"https://{NEPTUNE_ENDPOINT}:{NEPTUNE_PORT}/loader"
 31 | NEPTUNE_GREMLIN_ENDPOINT = f"ws://{NEPTUNE_ENDPOINT}:{NEPTUNE_PORT}/gremlin"
 32 | NEPTUNE_LOAD_ROLE_ARN = os.getenv("NEPTUNE_LOAD_ROLE_ARN")
 33 | BUCKET = os.getenv("S3_PROCESSED_DATASET_BUCKET")
 34 | DATASET_DIR = "../../dataset"
 35 | 
 36 | GREMLIN_POOL_SIZE       =      8  # Python driver default is 4. Change to create a bigger pool.
 37 | GREMLIN_MAX_WORKERS     =      8  # Python driver default is 5 * number of CPU on client machine.
 38 | 
 39 | # Initialize Neptune connection
 40 | graph=Graph()
 41 | connection = DriverRemoteConnection(NEPTUNE_GREMLIN_ENDPOINT,'g',
 42 |                                     pool_size=GREMLIN_POOL_SIZE,
 43 |                                     max_workers=GREMLIN_MAX_WORKERS)
 44 | g = graph.traversal().withRemote(connection)
 45 | 
 46 | 
 47 | # Initialize logger
 48 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 49 | logger = logging.getLogger()
 50 | 
 51 | 
 52 | # Make dataset directory
 53 | if not os.path.isdir(DATASET_DIR):
 54 |     os.mkdir(DATASET_DIR)
 55 | 
 56 | 
 57 | def download_file(bucket, file):
 58 |     """Download file from S3."""
 59 |     try:
 60 |         logger.info("Start downloading %s.", file)
 61 |         dst = f"./{DATASET_DIR}/{file}"
 62 |         if os.path.isfile(dst):
 63 |             logger.info("File exists, skipping.")
 64 |             return
 65 | 
 66 |         s3 = boto3.resource('s3')
 67 |         s3.Bucket(bucket).download_file(file, f"./{DATASET_DIR}/{file}")
 68 |     except botocore.exceptions.ClientError as e:
 69 |         if e.response['Error']['Code'] == "404":
 70 |             print("The object does not exist.")
 71 |         else:
 72 |             raise
 73 | 
 74 | 
 75 | def upload_file(file_name, bucket, prefix, key=None):
 76 |     """Upload file to S3 bucket."""
 77 |     if key is None:
 78 |         key = file_name
 79 |     object_name = f"{prefix}/{key}"
 80 |     s3_client = boto3.client('s3')
 81 |     try:
 82 |         response = s3_client.upload_file(file_name, bucket, object_name)
 83 |     except botocore.exceptions.ClientError as e:
 84 |         raise e
 85 |     return object_name
 86 | 
 87 | 
 88 | def wait_for_load_complete(load_id):
 89 |     """Wait for Neptune load to complete."""
 90 |     while not is_load_completed(load_id):
 91 |         time.sleep(10)
 92 | 
 93 | 
 94 | def is_load_completed(load_id):
 95 |     """Check if Neptune load is completed"""
 96 |     response = requests.get(f"{NEPTUNE_LOADER_ENDPOINT}/{load_id}").json()
 97 |     status = response["payload"]["overallStatus"]["status"]
 98 |     if status == "LOAD_IN_PROGRESS":
 99 |         return False
100 |     return True
101 | 
102 | 
103 | def copy_n_lines(src, dst, n):
104 |     """Copy N lines from src to dst file."""
105 |     if os.path.isfile(dst):
106 |         logger.info("File: %s exists, skipping.", dst)
107 |         return
108 | 
109 |     with open(src) as src_file:
110 |         lines = islice(src_file, n)
111 |         with open(dst, 'w') as dst_file:
112 |             dst_file.writelines(lines)
113 | 
114 | 
115 | def populate_graph(vertices_n):
116 |     import tempfile
117 |     import uuid
118 | 
119 |     logger.info("Populating graph with %s vertices.", vertices_n)
120 | 
121 |     if vertices_n == 0:
122 |         return
123 | 
124 |     labels = '"~id","attr1:String","attr2:String","~label"'
125 | 
126 |     fd, path = tempfile.mkstemp()
127 |     try:
128 |         with os.fdopen(fd, 'w') as tmp:
129 |             tmp.write(labels + '\n')
130 |             for _ in range(vertices_n):
131 |                 node_id = str(uuid.uuid4())
132 |                 attr1 = node_id
133 |                 attr2 = node_id
134 |                 label = "generatedVertice"
135 |                 tmp.write(f"{node_id},{attr1},{attr2},{label}\n")
136 |             key = upload_file(path, BUCKET, "generated")
137 |             load_into_neptune(BUCKET, key)
138 |             s3 = boto3.resource("s3")
139 |             s3.Object(BUCKET, key).delete()
140 | 
141 |     finally:
142 |         os.remove(path)
143 | 
144 | 
145 | 
146 | def load_into_neptune(bucket, key):
147 |     """Load CSV file into neptune."""
148 |     data = {
149 |       "source" : f"s3://{bucket}/{key}",
150 |       "format" : "csv",
151 |       "iamRoleArn" : NEPTUNE_LOAD_ROLE_ARN,
152 |       "region" : AWS_REGION,
153 |       "failOnError" : "FALSE",
154 |       "parallelism" : "MEDIUM",
155 |       "updateSingleCardinalityProperties" : "FALSE"
156 |     }
157 |     response = requests.post(NEPTUNE_LOADER_ENDPOINT, json=data)
158 |     json_response = response.json()
159 |     load_id = json_response["payload"]["loadId"]
160 |     logger.info("Waiting for load %s to complete.", load_id)
161 |     wait_for_load_complete(load_id)
162 |     logger.info("Load %s completed", load_id)
163 | 
164 |     return load_id
165 | 
166 | 
167 | def get_loading_time(load_id):
168 |     response = requests.get(f"{NEPTUNE_LOADER_ENDPOINT}/{load_id}").json()
169 |     time_spent = response["payload"]["overallStatus"]["totalTimeSpent"]
170 |     return time_spent
171 | 
172 | 
173 | def benchmark_loading_data(source, entities_to_add,
174 |                            initial_sizes=[0], dependencies=[], drop=True):
175 |     """
176 |     Benchmark loading data into AWS Neptune.
177 | 
178 |     Graph is dropped before every benchmark run.
179 |     Benchmark measures loading time for vertices and edges.
180 |     Graph can be populated with initial random data.
181 |     """
182 | 
183 |     filename = f"{source}.csv"
184 |     download_file(BUCKET, filename)
185 |     prefix = "splitted"
186 | 
187 |     results = {}
188 | 
189 |     logger.info("Loading dependencies.")
190 |     for dependency in dependencies:
191 |         filename = f"{DATASET_DIR}/{dependency}"
192 |         logger.info("Uploading %s to S3 bucket.", dependency)
193 |         key = upload_file(filename, BUCKET, "dependencies", key=dependency)
194 |         load_id = load_into_neptune(BUCKET, key)
195 | 
196 |     for initial_graph_size in initial_sizes:
197 |         results[initial_graph_size] = {}
198 | 
199 |         for entities_n in entities_to_add:
200 |             if drop:
201 |                 drop(g)
202 |             populate_graph(initial_graph_size)
203 | 
204 |             logger.info("Generating file with %s entities.", entities_n)
205 |             dst = f"{DATASET_DIR}/{source}_{entities_n}.csv"
206 |             copy_n_lines(f"{DATASET_DIR}/{source}.csv", dst, entities_n)
207 | 
208 |             logger.info("Uploading %s to S3 bucket.", dst)
209 |             csv_file = upload_file(dst, BUCKET, prefix, f"{source}_{entities_n}.csv")
210 |             load_id = load_into_neptune(BUCKET, csv_file)
211 | 
212 |             loading_time = get_loading_time(load_id)
213 |             logger.info("Loading %d nodes lasts for %d seconds.", entities_n, loading_time)
214 | 
215 |             results[initial_graph_size][entities_n] = loading_time
216 | 
217 |     return results
218 | 
219 | 
220 | def save_result_to_csv(source, results, dst="."):
221 |     """Save ingestion results to CSV file."""
222 |     with open(f"{dst}/ingestion-{source}.csv", "w") as f:
223 |         writer = csv.writer(f)
224 |         for initial_size, result in results.items():
225 |             for entites, time in result.items():
226 |                 writer.writerow(initial_size, entites, time)
227 | 
228 | 
229 | def draw_loading_benchmark_results(results, title, x_title, y_title):
230 |     """Draw loading benchmark results."""
231 |     fig_data = [
232 |         {
233 |             "type": "bar",
234 |             "name": f"Initial graph size: {k}",
235 |             "x": list(v.keys()),
236 |             "y": list(v.values())
237 |         } for k,v in results.items()
238 |     ]
239 | 
240 |     _draw_group_bar(fig_data, title, x_title, y_title)
241 | 
242 | 
243 | def draw_from_csv(csv, title, x_title, y_title):
244 |     """Draw loading benchmark from csv."""
245 |     df = pd.read_csv(csv, names=['initial', 'entities', 'duration'])
246 | 
247 |     fig_data = [
248 |         {
249 |             "type": "bar",
250 |             "name": f"Initial graph size: {initial_graph_size}",
251 |             "x": group["entities"],
252 |             "y": group["duration"]
253 |         } for initial_graph_size, group in df.groupby('initial')
254 |     ]
255 | 
256 |     _draw_group_bar(fig_data, title, x_title, y_title)
257 | 
258 | 
259 | def _draw_group_bar(fig_data, title, x_title, y_title):
260 |     fig = go.Figure({
261 |         "data": fig_data,
262 |         "layout": {
263 |             "title": {"text": title},
264 |             "xaxis.type": "category",
265 |             "barmode": "group",
266 |             "xaxis_title": x_title,
267 |             "yaxis_title": y_title,
268 |         }
269 |     })
270 | 
271 |     fig.show()
272 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/benchmarks/query_runner.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import os
  3 | import math
  4 | import random
  5 | import time
  6 | import asyncio
  7 | 
  8 | from datetime import timedelta
  9 | 
 10 | from gremlin_python.process.graph_traversal import values, outE, inE
 11 | from gremlin_python.process.traversal import Column, Order
 12 | from aiogremlin import DriverRemoteConnection, Graph
 13 | from aiogremlin.exception import GremlinServerError
 14 | 
 15 | from nepytune.usecase import (
 16 |     get_sibling_attrs, brand_interaction_audience,
 17 |     get_all_transient_ids_in_household, undecided_user_audience_check,
 18 |     undecided_users_audience, get_activity_of_early_adopters
 19 | )
 20 | 
 21 | logger = logging.getLogger(__name__)
 22 | 
 23 | ARG_COLLECTION = 1000
 24 | COIN = 0.1
 25 | 
 26 | 
 27 | class QueryRunner:
 28 |     """Query runner."""
 29 | 
 30 |     def __init__(self, query, samples):
 31 |         self.args = []
 32 |         self.query = query
 33 |         self.samples = int(samples)
 34 |         self.succeded = 0
 35 |         self.failed = 0
 36 | 
 37 |     async def run(self, sample, pool):
 38 |         """Run query and return measure."""
 39 |         sample_no = sample + 1
 40 |         try:
 41 |             connection = pool.lock()
 42 |             g = Graph().traversal().withRemote(connection)
 43 |             args = self.get_args(sample)
 44 |             try:
 45 |                 start = time.time()
 46 |                 result = await self.query(g, **args).toList()
 47 |                 end = time.time()
 48 |                 _log_query_info(self.samples, sample_no, args, result)
 49 |                 self.succeded += 1
 50 |                 return (start, end, end - start)
 51 |             except GremlinServerError as e:
 52 |                 logger.debug(f"Sample {sample_no} failed: {e.msg}")
 53 |                 self.failed += 1
 54 |                 return None
 55 |             finally:
 56 |                 pool.unlock(connection)
 57 |         except ConnectionError as e:
 58 |             logger.debug(f"Sample {sample_no} failed: {e}")
 59 |             self.failed += 1
 60 |             return None
 61 | 
 62 | 
 63 |     async def initialize(self):
 64 |         pass
 65 | 
 66 |     def get_args(self, sample):
 67 |         """Get args for query function."""
 68 |         return self.args[sample % len(self.args)]
 69 | 
 70 | 
 71 | class SiblingsAttrsRunner(QueryRunner):
 72 |     def __init__(self, samples):
 73 |         super().__init__(query=get_sibling_attrs, samples=samples)
 74 | 
 75 |     async def initialize(self):
 76 |         connection = await init_neptune_connection()
 77 |         async with connection:
 78 |             g = Graph().traversal().withRemote(connection)
 79 |             transient_ids = await get_household_members(g, ARG_COLLECTION)
 80 | 
 81 |             self.args = [
 82 |                 {
 83 |                     "transient_id": transient_id
 84 |                 } for transient_id in transient_ids
 85 |             ]
 86 | 
 87 | 
 88 | class BrandInteractionRunner(QueryRunner):
 89 |     def __init__(self, samples):
 90 |         super().__init__(query=brand_interaction_audience, samples=samples)
 91 | 
 92 |     async def initialize(self):
 93 |         connection = await init_neptune_connection()
 94 |         async with connection:
 95 |             g = Graph().traversal().withRemote(connection)
 96 |             websites = await (
 97 |                 g.V().hasLabel("website").coin(COIN).limit(ARG_COLLECTION).toList()
 98 |             )
 99 | 
100 |             self.args = [
101 |                 {
102 |                     "website_url": website
103 |                 } for website in websites
104 |             ]
105 | 
106 | 
107 | class AudienceCheck(QueryRunner):
108 |     def __init__(self, samples):
109 |         self.args = []
110 |         super().__init__(query=undecided_user_audience_check, samples=samples)
111 | 
112 |     async def initialize(self):
113 |         connection = await init_neptune_connection()
114 |         async with connection:
115 |             g = Graph().traversal().withRemote(connection)
116 | 
117 |             data = await (
118 |                 g.V().hasLabel("transientId").coin(COIN).limit(ARG_COLLECTION)
119 |                 .group()
120 |                 .by()
121 |                 .by(
122 |                     outE("visited").coin(COIN).inV().in_(
123 |                         "links_to").out("links_to").coin(COIN)
124 |                     .path()
125 |                     .by(values("uid"))
126 |                     .by(values("ts"))
127 |                     .by(values("url"))
128 |                     .by(values("url"))
129 |                     .by(values("url"))
130 |                 ).select(Column.values).unfold()
131 |             ).toList()
132 | 
133 |             self.args = [
134 |                 {
135 |                     "transient_id": result[0],
136 |                     "website_url": result[2],
137 |                     "thank_you_page_url": result[4],
138 |                     "since": result[1] - timedelta(days=random.randint(30, 60)),
139 |                     "min_visited_count": random.randint(2, 5)
140 |                 } for result in data if result
141 |             ]
142 | 
143 | 
144 | class AudienceGeneration(QueryRunner):
145 |     def __init__(self, samples):
146 |         self.args = []
147 |         super().__init__(query=undecided_users_audience, samples=samples)
148 | 
149 |     async def initialize(self):
150 |         connection = await init_neptune_connection()
151 |         async with connection:
152 |             g = Graph().traversal().withRemote(connection)
153 | 
154 |             most_visited_websites = await get_most_active_websites(g)
155 |             data = await (
156 |                 g.V(most_visited_websites)
157 |                 .group()
158 |                 .by()
159 |                 .by(
160 |                     inE().hasLabel("visited").coin(COIN).inV()
161 |                     .in_("links_to").out("links_to").coin(COIN)
162 |                     .path()
163 |                     .by(values("url"))  # visited website
164 |                     .by(values("ts"))  # timestamp
165 |                     .by(values("url"))  # visited website
166 |                     .by(values("url"))  # root website
167 |                     .by(values("url").limit(1))  # thank you page
168 |                 ).select(Column.values).unfold()
169 |             ).toList()
170 | 
171 |             self.args = [
172 |                 {
173 |                     "website_url": result[0],
174 |                     "thank_you_page_url": result[4],
175 |                     "since": result[1] - timedelta(days=random.randint(30, 60)),
176 |                     "min_visited_count": random.randint(2, 5)
177 |                 } for result in data
178 |             ]
179 | 
180 | 
181 | class EarlyAdopters(QueryRunner):
182 |     def __init__(self, samples):
183 |         super().__init__(
184 |             query=get_activity_of_early_adopters,
185 |             samples=samples)
186 | 
187 |     async def initialize(self):
188 |         connection = await init_neptune_connection()
189 |         async with connection:
190 |             g = Graph().traversal().withRemote(connection)
191 |             most_visited_websites = await get_most_active_websites(g)
192 | 
193 |             self.args = [
194 |                 {
195 |                     "thank_you_page_url": website
196 |                 } for website in most_visited_websites
197 |             ]
198 | 
199 | 
200 | class HouseholdDevices(QueryRunner):
201 |     def __init__(self, samples):
202 |         super().__init__(query=get_all_transient_ids_in_household,
203 |                          samples=samples)
204 | 
205 |     async def initialize(self):
206 |         connection = await init_neptune_connection()
207 |         async with connection:
208 |             g = Graph().traversal().withRemote(connection)
209 |             household_members = await get_household_members(g, ARG_COLLECTION)
210 | 
211 |             self.args = [
212 |                 {
213 |                     "transient_id": member
214 |                 } for member in household_members
215 |             ]
216 | 
217 | 
218 | async def get_household_members(g, limit, coin=COIN):
219 |     """Return transient IDs which are memebers of identity group."""
220 |     return await (
221 |         g.V().hasLabel("identityGroup").out("member")
222 |         .out("has_identity")
223 |         .coin(coin).limit(limit).toList()
224 |     )
225 | 
226 | 
227 | async def init_neptune_connection():
228 |     """Init Neptune connection."""
229 |     endpoint = os.environ["NEPTUNE_CLUSTER_ENDPOINT"]
230 |     port = os.getenv("NEPTUNE_CLUSTER_PORT", "8182")
231 |     return await DriverRemoteConnection.open(f"ws://{endpoint}:{port}/gremlin", "g")
232 | 
233 | 
234 | def _log_query_info(samples, sample_no, args, result):
235 |     logger.debug(f"Sample {sample_no} args: {args}")
236 |     if len(result) > 100:
237 |         logger.debug("Truncating query result.")
238 |         logger.debug(f"Sample {sample_no} result: {result[:100]}")
239 |     else:
240 |         logger.debug(f"Sample {sample_no} result: {result}")
241 | 
242 |     samples_checkpoint = math.ceil(samples*0.1)
243 |     if sample_no % samples_checkpoint == 0:
244 |         logger.info(f"Finished {sample_no} of {samples} samples.")
245 | 
246 | 
247 | async def get_most_active_websites(g):
248 |     """Return websites with most visits."""
249 |     # Query for most visited websites is quite slow.
250 |     # Thus visited websites are hardcoded.
251 | 
252 |     # most_visited_websites = await (
253 |     #     g.V().hasLabel("website")
254 |     #     .order().by(inE('visited').count(), Order.decr)
255 |     #     .limit(1000).toList()
256 |     # )
257 | 
258 |     most_visited_websites = [
259 |         "8f6b27fe6f0dcdae",
260 |         "a997482113271d8f/5758f309e11931ce",
261 |         "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?aac4d7fceeea7dcb",
262 |         "6e89cfa05ae05032",
263 |         "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?1d5bfa3db363b460",
264 |         "3cfce7aac081cf80/49d249c29289f7a5/5ea0237ac10c9de3?1911788a62d90dd4",
265 |         "12a78ad541e95ae",
266 |         "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?77af8f56d61f1f7",
267 |         "ed95a9a5be30e4c8/5162fc6a223f248d",
268 |         "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb",
269 |         "2e272bb1ae067296/49ffef01dbcd3442",
270 |         "6ea77fc3ea42bd5b",
271 |         "4c980617e02858a4",
272 |         "b23e286d713f61fd/f9077d4b41c9e32e",
273 |         "c3c6e6e856091767",
274 |         "12a78ad541e95ae/7de2f069da3a3655",
275 |         "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?b80a3fe036e3d80",
276 |         "6ae12ea8ec730ba5/281bb5a0f4846ea7/802fc6a2d4f41295/34702b07a20db/8b84b6e138385d6",
277 |         "8f6f3d03e10289c2",
278 |         "ed95a9a5be30e4c8",
279 |         "ed95a9a5be30e4c8/9c2692a00033d2ca",
280 |         "afea1067d86a1c44/768ddae806aa91cc",
281 |         "7875af5f916d165/2de17cd3dfa1bafb?28d8c9221be3456e",
282 |         "1f8649a74c661bd4",
283 |         "ed95a9a5be30e4c8/d400c9e183de73f3",
284 |         "0d9afe7c94a6fcb8",
285 |         "5f63cba1308ebad/16e720804d7385cb?5a4b1b396bf1130",
286 |         "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?a2ae02cc94e330f2",
287 |         "6cb909d81a2f5b20",
288 |         "b23e286d713f61fd/e0da2d3e2c6f610/16e720804d7385cb?799961866adb8a72",
289 |         "5f63cba1308ebad/16e720804d7385cb?282d33b7392ed0f3",
290 |         "b23e286d713f61fd/16e720804d7385cb",
291 |         "dcb69d5b9ce0d93",
292 |         "9e82d69ba38ad61",
293 |         "1f8649a74c661bd4/b3cf138ac65a87cd",
294 |         "427e6f941738985a",
295 |         "8f6b27fe6f0dcdae/77cc413057b22ef2",
296 |         "7e89190c7bcf1be9",
297 |         "7e89190c7bcf1be9/fb5a409aecff2de1/32c6ffef1a8068b2/01fee084a3cb3563?590f324987a908ac",
298 |         "277503b36e998a2c",
299 |         "5bb77e7558c09124",
300 |         "b23e286d713f61fd/16e720804d7385cb?799961866adb8a72",
301 |         "6eefbbf46b47c5e",
302 |         "dc958d5abcb0c7f4",
303 |         "fb3859d88debbc2f/10e22e5ca30919fd/bed4d82bfc7fb316/9fb2db33a1362553/af1bef8666741753",
304 |         "54df72c060e95707/01fee084a3cb3563",
305 |         "1f73e4b495d6947a",
306 |         "fb3859d88debbc2f/10e22e5ca30919fd/bed4d82bfc7fb316/9fb2db33a1362553/af1bef8666741753/b8b68b641a5d7f18",
307 |         "6ae12ea8ec730ba5/281bb5a0f4846ea7/253bf3e95bec331a/34702b07a20db/8b84b6e138385d6",
308 |         "5f63cba1308ebad/16e720804d7385cb",
309 |         "a4e358da594acc69/d5e31c7559f5aae",
310 |         "6e89cfa05ae05032?7ded49ef5f6ae4b5",
311 |         "307809459d18aac/05ec660c9d33a602/1c4578927f3f3711/2ba906928c030c0f",
312 |         "427e6f941738985a/7de2f069da3a3655",
313 |         "70fc5e1c206b990d",
314 |         "40c40bf5f58729e9",
315 |         "2f38166a9f476d14/2e1f4252a64ef39e?ffa3ebbd543f63a",
316 |         "8f6b27fe6f0dcdae/7de2f069da3a3655",
317 |         "530bd88a2a6056ba/753be5bb22047d7d/ac5dd08add7bd9b3",
318 |         "7e89190c7bcf1be9/32345249f712667/26010d49384ca927/01fee084a3cb3563?2cb5075b4f4e88dd",
319 |         "c415bc2d4909291c/ff90c3dd68949525",
320 |         "88784b4873c7551d/a8c79e6cf0f93af?3fe03b55422683a",
321 |         "ec9d0d6b37ae8d68/01fee084a3cb3563",
322 |         "ec9d0d6b37ae8d68/01fee084a3cb3563/850b51f8595b735c/d1559ef785b761e1",
323 |         "999fd0543f2499ba/05ec660c9d33a602/1c4578927f3f3711/2ba906928c030c0f",
324 |         "cf17e071ca4a6d63/333314eda494a273/9683443388b62d72",
325 |         "afea1067d86a1c44/8968eb8d56ea2005",
326 |         "6865c9a20330e96e",
327 |         "afea1067d86a1c44/f13f8d0b2be7d308",
328 |         "5f63cba1308ebad/16e720804d7385cb?9b2c7d0cf9c19280",
329 |         "a4e358da594acc69",
330 |         "043f71e11bce6115",
331 |         "2f38166a9f476d14/2e1f4252a64ef39e?23cb33cf67558126",
332 |         "2972e09dd52b5c34/e0da2d3e2c6f610/16e720804d7385cb?aac4d7fceeea7dcb",
333 |         "ed95a9a5be30e4c8/9c2692a00033d2ca/de6b0a4bdf4056d8",
334 |         "ef5e1c317855b110/d22919653063ad0f",
335 |         "db7d0a15587e37",
336 |         "fe5809a4bf69b53b",
337 |         "c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f",
338 |         "f9717a397d602927",
339 |         "c415bc2d4909291c",
340 |         "97c681e48c2bd244",
341 |         "ed95a9a5be30e4c8/9c2692a00033d2ca/51faf05ad73be17c",
342 |         "38111edd541b4aa0",
343 |         "6eefbbf46b47c5e/7de2f069da3a3655",
344 |         "6cb909d81a2f5b20/16e720804d7385cb?106cec9ffea2f2df",
345 |         "968c8e4fbbb8b0ce",
346 |         "8f6f3d03e10289c2/7de2f069da3a3655",
347 |         "ed95a9a5be30e4c8/5162fc6a223f248d/4dab901f0f98436",
348 |         "a16689098c57e580",
349 |         "f745af148dbad70c/8b9644ee902b2351/01fee084a3cb3563/33dcc329910a2ce2",
350 |         "cf17e071ca4a6d63",
351 |         "ed95a9a5be30e4c8/9c2692a00033d2ca/4dab901f0f98436",
352 |         "afea1067d86a1c44",
353 |         "2972e09dd52b5c34/e0da2d3e2c6f610/16e720804d7385cb",
354 |         "04285bbaac4dba06/01fee084a3cb3563/26db9e0e4002aab4",
355 |         "9cafb5406de1df9e",
356 |         "9b569b834ef0716c/16e720804d7385cb?c5a19578c7c7204c",
357 |         "521fca29d4156a9d",
358 |         "f8c1d22d2e8ba7c4",
359 |     ]
360 | 
361 |     return most_visited_websites
362 | 
363 | 
364 | def get_query_runner(query, samples):
365 |     """Query runner factory."""
366 |     if query == 'get_sibling_attrs':
367 |         return SiblingsAttrsRunner(samples)
368 |     elif query == 'brand_interaction_audience':
369 |         return BrandInteractionRunner(samples)
370 |     elif query == 'get_all_transient_ids_in_household':
371 |         return HouseholdDevices(samples)
372 |     elif query == "undecided_user_check":
373 |         return AudienceCheck(samples)
374 |     elif query == "undecided_user_audience":
375 |         return AudienceGeneration(samples)
376 |     elif query == "early_website_adopters":
377 |         return EarlyAdopters(samples)
378 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/cli/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/__main__.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import logging
 3 | 
 4 | from nepytune.cli.transform import (
 5 |     register as transform_register,
 6 |     main as transform_main,
 7 | )
 8 | from nepytune.cli.split import register as split_register, main as split_main
 9 | from nepytune.cli.add import register as add_register, main as add_main
10 | from nepytune.cli.extend import register as extend_register, main as extend_main
11 | 
12 | 
13 | logging.basicConfig(format="%(asctime)-15s %(message)s")
14 | 
15 | 
16 | def main():
17 |     """Main entry point for all commands."""
18 |     parser = argparse.ArgumentParser(description="Extend/generate dataset csv files")
19 |     parser.set_defaults(subparser="none")
20 | 
21 |     subparsers = parser.add_subparsers()
22 | 
23 |     transform_register(subparsers)
24 |     split_register(subparsers)
25 |     add_register(subparsers)
26 |     extend_register(subparsers)
27 | 
28 |     args = parser.parse_args()
29 | 
30 |     if args.subparser == "transform":
31 |         transform_main(args)
32 | 
33 |     if args.subparser == "split":
34 |         split_main(args)
35 | 
36 |     if args.subparser == "add":
37 |         add_main(args)
38 | 
39 |     if args.subparser == "extend":
40 |         extend_main(args)
41 | 
42 | 
43 | if __name__ == "__main__":
44 |     main()
45 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/add.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import configparser
  3 | import json
  4 | import time
  5 | import random
  6 | import sys
  7 | import csv
  8 | import logging
  9 | import ipaddress
 10 | from collections import namedtuple
 11 | from urllib.parse import urlparse
 12 | 
 13 | from faker import Faker
 14 | from faker.providers.user_agent import Provider as UAProvider
 15 | from user_agents import parse
 16 | 
 17 | 
 18 | from networkx.utils.union_find import UnionFind
 19 | 
 20 | from nepytune.write_utils import json_lines_file
 21 | from nepytune.utils import hash_
 22 | 
 23 | 
 24 | COMPANY_MIN_SIZE = 6
 25 | 
 26 | logger = logging.getLogger("add")
 27 | logger.setLevel(logging.INFO)
 28 | 
 29 | 
 30 | class UserAgentProvider(UAProvider):
 31 |     """Custom faker provider that derives user agent based on type."""
 32 | 
 33 |     def user_agent_from_type(self, type_):
 34 |         """Given type, generate appropriate user agent."""
 35 |         while True:
 36 |             user_agent = self.user_agent()
 37 |             if type_ == "device":
 38 |                 if "Mobile" in user_agent:
 39 |                     return user_agent
 40 |             elif type_ == "cookie":
 41 |                 if "Mobile" not in user_agent:
 42 |                     return user_agent
 43 |             else:
 44 |                 raise ValueError(f"Unsupported {type_}")
 45 | 
 46 | 
 47 | class PersistentNodes(UnionFind):
 48 |     """networkx.UnionFind datastructure with custom iterable over node sets."""
 49 | 
 50 |     def node_groups(self):
 51 |         """Iterate over node groups yield parent hash and node members."""
 52 |         for node_set in self.to_sets():
 53 |             yield hash_(node_set), node_set
 54 | 
 55 | 
 56 | def extract_user_groups(user_mapping_path):
 57 |     """Generate disjoint user groups based on union find datastructure."""
 58 |     with open(user_mapping_path) as f_h:
 59 |         pers_reader = csv.reader(f_h, delimiter=",")
 60 |         uf_ds = PersistentNodes()
 61 |         for row in pers_reader:
 62 |             uf_ds.union(row[0], row[1])
 63 |         return uf_ds
 64 | 
 65 | 
 66 | def generate_persistent_groups(user_groups, dst):
 67 |     """Write facts about persistent to transient nodes mapping."""
 68 |     with open(dst, "w") as f_h:
 69 |         for persistent_id, node_group in user_groups.node_groups():
 70 |             f_h.write(
 71 |                 json.dumps({"pid": persistent_id, "transientIds": list(node_group)})
 72 |                 + "\n"
 73 |             )
 74 | 
 75 | 
 76 | def generate_identity_groups(persistent_ids_file, distribution, dst, _seed=None):
 77 |     """Write facts about identity_group mapping."""
 78 |     if _seed is not None:
 79 |         random.seed(time.time())
 80 | 
 81 |     with open(persistent_ids_file) as f_h:
 82 |         pids = [data["pid"] for data in json_lines_file(f_h)]
 83 | 
 84 |     random.shuffle(pids)
 85 | 
 86 |     sizes, weights = zip(*[[k, v] for k, v in distribution.items()])
 87 |     i = 0
 88 |     with open(dst, "w") as f_h:
 89 |         while i < len(pids):
 90 |             size, *_ = random.choices(sizes, weights=weights)
 91 |             size = min(size, abs(len(pids) - i))
 92 |             persistent_ids = [pids[i + j] for j in range(size)]
 93 |             type_ = "household" if len(persistent_ids) < COMPANY_MIN_SIZE else "company"
 94 |             f_h.write(
 95 |                 json.dumps(
 96 |                     {
 97 |                         "igid": hash_(persistent_ids),
 98 |                         "type": type_,
 99 |                         "persistentIds": persistent_ids,
100 |                     }
101 |                 )
102 |                 + "\n"
103 |             )
104 |             # advance even if size was 0, meaning that persistent id
105 |             # does not belong to any identity_group
106 |             i += size or 1
107 | 
108 | 
109 | def parse_distribution(size, weights):
110 |     """Parse and validate distribution params."""
111 |     if len(size) != len(weights):
112 |         raise ValueError(
113 |             "Identity group parsing issue: weights list and identity group "
114 |             "size list are of different length"
115 |         )
116 | 
117 |     eps = 1e-4
118 |     # accept small errors, as floating point arithmetic cannot be done precisely on computers
119 |     if not 1 - eps < sum(weights) < 1 + eps:
120 |         raise ValueError(
121 |             "Identity group parsing issue: weights must sum to 1, "
122 |             f"but sum to {sum(weights)} instead"
123 |         )
124 |     return dict(zip(size, weights))
125 | 
126 | 
127 | def get_ip_addresses(cidr):
128 |     """Get list of hosts within given network cidr."""
129 |     network = ipaddress.ip_network(cidr)
130 |     hosts = list(network.hosts())
131 |     if not hosts:
132 |         return [network.network_address]
133 |     return hosts
134 | 
135 | 
136 | def build_iploc_knowledge(
137 |     ip_facts_file,
138 |     persistent_ids_facts_file,
139 |     identity_group_facts_file,
140 |     transient_ids_facts_file,
141 |     dst,
142 | ):
143 |     """
144 |     Given some fact files, generate random locations and IP addresses in a sane way.
145 | 
146 |     It’s like funnel. At the very top you have identity groups, then persistent nodes,
147 |     then transient nodes.
148 | 
149 |     Logic can be simplified to:
150 |         * identity groups = select few (at most 8 with very low probability) IP addresses
151 |         * persistent nodes = select few IP addresses from the group above
152 |         * transient nodes = select few IP addresses from the group above
153 | 
154 |     This way context data is sane. Each transient node has IPs from subset of persistent
155 |     id IPs, and identity groups IPs.
156 | 
157 |     Probabilities makes highly probably for transient nodes to be within the same city,
158 |     and the same state. Same goes for persistent nodes.
159 |     """
160 |     IPLoc = namedtuple("IPLoc", "state, city, ip_address")
161 | 
162 |     with open(ip_facts_file) as f_h:
163 |         ip_cidrs_by_state_city = list(json_lines_file(f_h))
164 | 
165 |     knowledge = {"identity_group": {}, "persistent_id": {}, "transient_ids": {}}
166 | 
167 |     def random_ip_loc():
168 |         state_count, *_ = random.choices([1, 2], weights=[0.98, 0.02])
169 |         for state_data in random.choices(ip_cidrs_by_state_city, k=state_count):
170 |             city_count, *_ = random.choices(
171 |                 [1, 2, 3, 4], weights=[0.85, 0.1, 0.04, 0.01]
172 |             )
173 |             for city_data in random.choices(state_data["cities"], k=city_count):
174 |                 random_cidr = random.choice(city_data["cidr_blocks"])
175 |                 yield IPLoc(
176 |                     state=state_data["state"],
177 |                     city=city_data["city"],
178 |                     ip_address=str(random.choice(get_ip_addresses(random_cidr))),
179 |                 )
180 | 
181 |     def random_ip_loc_from_group(locations):
182 |         # compute weights, each next item is two times less likely probably than the previous
183 |         weights = [1]
184 |         for _ in locations[:-1]:
185 |             weights.append(weights[-1] / 2)
186 | 
187 |         count = len(locations)
188 |         random_count, *_ = random.choices(list(range(1, count + 1)), weights=weights)
189 |         return list(set(random.choices(locations, k=random_count)))
190 | 
191 |     logger.info("Creating Identity group / persistent ids IP facts")
192 |     with open(identity_group_facts_file) as f_h:
193 |         for data in json_lines_file(f_h):
194 |             locations = knowledge["identity_group"][data["igid"]] = list(
195 |                 set(random_ip_loc())
196 |             )
197 | 
198 |             for persistent_id in data["persistentIds"]:
199 |                 knowledge["persistent_id"][persistent_id] = random_ip_loc_from_group(
200 |                     locations
201 |                 )
202 | 
203 |     logger.info("Creating persistent / transient ids IP facts")
204 |     with open(persistent_ids_facts_file) as f_h:
205 |         for data in json_lines_file(f_h):
206 |             persistent_id = data["pid"]
207 |             # handle case where persistent id does not belong to any identity group
208 |             if data["pid"] not in knowledge:
209 |                 knowledge["persistent_id"][persistent_id] = random_ip_loc_from_group(
210 |                     list(set(random_ip_loc()))
211 |                 )
212 |             for transient_id in data["transientIds"]:
213 |                 knowledge["transient_ids"][transient_id] = random_ip_loc_from_group(
214 |                     knowledge["persistent_id"][persistent_id]
215 |                 )
216 |         # now assign random ip location for transient ids without persistent ids
217 |         logger.info("Processing remaining transient ids facts")
218 |         with open(transient_ids_facts_file) as t_f_h:
219 |             for data in json_lines_file(t_f_h):
220 |                 if data["uid"] not in knowledge["transient_ids"]:
221 |                     knowledge["transient_ids"][data["uid"]] = list(
222 |                         set(
223 |                             random_ip_loc_from_group(  # "transient group" level
224 |                                 random_ip_loc_from_group(  # "persistent group" level
225 |                                     list(set(random_ip_loc()))  # "identity group" level
226 |                                 )
227 |                             )
228 |                         )
229 |                     )
230 | 
231 |     with open(dst, "w") as f_h:
232 |         for key, data in knowledge["transient_ids"].items():
233 |             f_h.write(
234 |                 json.dumps(
235 |                     {"transient_id": key, "loc": [item._asdict() for item in data]}
236 |                 )
237 |                 + "\n"
238 |             )
239 | 
240 | def generate_website_groups(urls_file, iab_categories, dst):
241 |     """Generate website groups."""
242 |     website_groups = {}
243 |     with open(urls_file) as urls_f:
244 |         urls_reader = csv.reader(urls_f, delimiter=",")
245 |         for row in urls_reader:
246 |             url = row[1]
247 |             root_url = urlparse("//" + url).hostname
248 |             if root_url not in website_groups:
249 |                 iab_category = random.choice(iab_categories)
250 |                 website_groups[root_url] = {
251 |                     "websites": [url],
252 |                     "category": {
253 |                         "code": iab_category[0],
254 |                         "name": iab_category[1]
255 |                     }
256 |                 }
257 |             else:
258 |                 website_groups[root_url]["websites"].append(url)
259 | 
260 |     with open(dst, "w") as dst_file:
261 |         for url, data in website_groups.items():
262 |             website_group = {
263 |                 "url": url,
264 |                 "websites": data["websites"],
265 |                 "category": data["category"]
266 |             }
267 |             website_group_id = hash_(website_group.items())
268 |             website_group["id"] = website_group_id
269 |             dst_file.write(
270 |                 json.dumps(website_group) + "\n"
271 |             )
272 | 
273 | 
274 | def read_iab_categories(iab_filepath):
275 |     """Read IAB categories tuples from JSON file."""
276 |     with open(iab_filepath) as iab_file:
277 |         categories = json.loads(iab_file.read())
278 |         return [(code, category) for code, category in categories.items()]
279 | 
280 | 
281 | def build_user_identitity_knowledge(
282 |     persistent_ids_facts_file, transient_ids_facts_file, dst
283 | ):
284 |     """
285 |     Generate some facts about user identities.
286 | 
287 |     There are few informations generated here:
288 |         * transient ids types: cookie | device
289 |         * transient id emails (it's randomly selected from persistent id emails)
290 |         * transient id user agent (
291 |             if transient id type is cookie then workstation user agent is generated,
292 |             otherwise mobile one
293 |         )
294 |         * derivatives of user agent
295 |             * device family (if type device)
296 |             * OS
297 |             * browser
298 |     """
299 |     user_emails = {}
300 |     fake = Faker()
301 |     fake.add_provider(UserAgentProvider)
302 | 
303 |     logger.info("Creating emails per transient ids")
304 |     # create fake emails for devices with persistent ids
305 |     with open(persistent_ids_facts_file) as f_h:
306 |         for data in json_lines_file(f_h):
307 |             nemail = random.randint(1, len(data["transientIds"]))
308 |             emails = [fake.email() for _ in range(nemail)]
309 |             for transient_id in data["transientIds"]:
310 |                 user_emails[transient_id] = random.choice(emails)
311 | 
312 |     # create fake emails for devices without persistent ids
313 |     with open(transient_ids_facts_file) as t_f_h:
314 |         for data in json_lines_file(t_f_h):
315 |             if data["uid"] not in user_emails:
316 |                 user_emails[data["uid"]] = fake.email()
317 | 
318 |     logger.info("Writing down user identity facts")
319 |     with open(dst, "w") as f_h:
320 |         for transient_id, data in user_emails.items():
321 |             type_ = random.choice(["cookie", "device"])
322 |             uset_agent_str = fake.user_agent_from_type(type_)
323 | 
324 |             user_agent = parse(uset_agent_str)
325 |             device = user_agent.device.family
326 |             operating_system = user_agent.os.family
327 |             browser = user_agent.browser.family
328 | 
329 |             f_h.write(
330 |                 json.dumps(
331 |                     {
332 |                         "transient_id": transient_id,
333 |                         "user_agent": uset_agent_str,
334 |                         "device": device,
335 |                         "os": operating_system,
336 |                         "browser": browser,
337 |                         "email": data,
338 |                         "type": type_,
339 |                     }
340 |                 )
341 |                 + "\n"
342 |             )
343 | 
344 | 
345 | def register(parser):
346 |     """Register 'add' parser."""
347 |     add_parser = parser.add_parser("add")
348 |     add_parser.add_argument("--config-file", type=argparse.FileType("r"), required=True)
349 | 
350 |     add_subparser = add_parser.add_subparsers()
351 | 
352 |     persistent_id_parser = add_subparser.add_parser("persistent_id")
353 |     persistent_id_parser.set_defaults(subparser="add", command="persistent_id")
354 | 
355 |     identity_group_parser = add_subparser.add_parser("identity_group")
356 |     identity_group_parser.add_argument("--size", type=int, dest="size", action="append")
357 |     identity_group_parser.add_argument(
358 |         "--weights", type=float, dest="weights", action="append"
359 |     )
360 |     identity_group_parser.set_defaults(subparser="add", command="identity_group")
361 | 
362 |     fact_parser = add_subparser.add_parser("fact")
363 |     fact_parser.set_defaults(subparser="add", command="facts")
364 | 
365 |     website_groups_parser = add_subparser.add_parser("website_groups")
366 |     website_groups_parser.set_defaults(subparser="add", command="website_groups")
367 | 
368 | 
369 | def main(args):
370 |     """Generate dataset files with information about the world."""
371 |     config = configparser.ConfigParser()
372 |     config.read(args.config_file.name)
373 | 
374 |     if args.command == "persistent_id":
375 |         logger.info("Generate persistent id file to %s", config["dst"]["persistent"])
376 |         uf_ds = extract_user_groups(config["src"]["user_to_user"])
377 |         generate_persistent_groups(uf_ds, config["dst"]["persistent"])
378 | 
379 |     if args.command == "identity_group":
380 |         logger.info(
381 |             "Generate identity group file to %s", config["dst"]["identity_group"]
382 |         )
383 |         try:
384 |             distribution = parse_distribution(args.size, args.weights)
385 |         except ValueError as exc:
386 |             print(exc)
387 |             sys.exit(2)
388 | 
389 |         generate_identity_groups(
390 |             config["dst"]["persistent"], distribution, config["dst"]["identity_group"]
391 |         )
392 | 
393 |     if args.command == "facts":
394 |         logger.info("Generate IP facts file to %s", config["dst"]["ip_info"])
395 |         build_iploc_knowledge(
396 |             ip_facts_file=config["src"]["location_to_cidr"],
397 |             persistent_ids_facts_file=config["dst"]["persistent"],
398 |             identity_group_facts_file=config["dst"]["identity_group"],
399 |             transient_ids_facts_file=config["src"]["facts"],
400 |             dst=config["dst"]["ip_info"],
401 |         )
402 |         logger.info(
403 |             "Generate user identity facts file to %s",
404 |             config["dst"]["user_identity_info"],
405 |         )
406 |         build_user_identitity_knowledge(
407 |             persistent_ids_facts_file=config["dst"]["persistent"],
408 |             transient_ids_facts_file=config["src"]["facts"],
409 |             dst=config["dst"]["user_identity_info"],
410 |         )
411 | 
412 |     if args.command == "website_groups":
413 |         logger.info("Generate website groups file to %s.", config["dst"]["website_groups"])
414 |         urls_file = config["src"]["urls"]
415 |         dst_file = config["dst"]["website_groups"]
416 |         iab_categories = read_iab_categories(config["src"]["iab_categories"])
417 | 
418 |         generate_website_groups(urls_file, iab_categories, dst_file)
419 | 
420 |     logger.info("Done!")
421 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/extend.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import logging
  3 | import configparser
  4 | import os
  5 | import json
  6 | import itertools
  7 | import random
  8 | 
  9 | from nepytune.write_utils import json_lines_file
 10 | 
 11 | 
 12 | logger = logging.getLogger("extend")
 13 | logger.setLevel(logging.INFO)
 14 | 
 15 | 
 16 | def extend_facts_file(fact_file_path, ip_loc_file_path, user_identity_file_path):
 17 |     """Extend facts file with additional information."""
 18 |     ip_loc_cor = extend_with_iploc_information(ip_loc_file_path)
 19 |     user_identity_cor = extend_with_user_identity_information(user_identity_file_path)
 20 | 
 21 |     next(ip_loc_cor)
 22 |     next(user_identity_cor)
 23 | 
 24 |     dst = f"{fact_file_path}.tmp"
 25 |     with open(fact_file_path) as f_h:
 26 |         with open(dst, "w") as f_dst:
 27 |             for data in json_lines_file(f_h):
 28 |                 transformed_row = user_identity_cor.send(ip_loc_cor.send(data))
 29 |                 f_dst.write(json.dumps(transformed_row) + "\n")
 30 | 
 31 |         ip_loc_cor.close()
 32 | 
 33 |     os.rename(dst, fact_file_path)
 34 | 
 35 | 
 36 | def extend_with_user_identity_information(user_identity_file_path):
 37 |     """Coroutine which generates user identity facts based on transient id."""
 38 |     with open(user_identity_file_path) as f_h:
 39 |         user_id_data = {data["transient_id"]: data for data in json_lines_file(f_h)}
 40 | 
 41 |     data = yield
 42 | 
 43 |     while data is not None:
 44 |         transformed = {**data.copy(), **user_id_data[data["uid"]]}
 45 |         del transformed["transient_id"]
 46 |         data = yield transformed
 47 | 
 48 | 
 49 | def extend_with_iploc_information(ip_loc_file_path):
 50 |     """Coroutine which generates ip location facts based on transient id."""
 51 |     with open(ip_loc_file_path) as f_h:
 52 |         loc_data = {data["transient_id"]: data["loc"] for data in json_lines_file(f_h)}
 53 | 
 54 |     data = yield
 55 | 
 56 |     def get_sane_ip_locaction(uid, facts, max_ts_difference=3600):
 57 |         """
 58 |         Given transient id and its facts add information about ip/location.
 59 | 
 60 |         Process is semi-deterministic.
 61 |             1. Choose the location at random from the given list of locations
 62 |             2. Repeat returning this location as long as the timestamp difference
 63 |                lies within the `max_ts_difference`
 64 |             3. Otherwise, start from 1)
 65 |         """
 66 |         facts = [None] + sorted(facts, key=lambda x: x["ts"])
 67 |         ptr1, ptr2 = itertools.tee(facts, 2)
 68 |         next(ptr2, None)
 69 | 
 70 |         loc_fact = random.choice(loc_data[uid])
 71 | 
 72 |         for previous_item, current in zip(ptr1, ptr2):
 73 |             if (
 74 |                 previous_item is None
 75 |                 or current["ts"] - previous_item["ts"] > max_ts_difference
 76 |             ):
 77 |                 loc_fact = random.choice(loc_data[uid])
 78 |             yield {**current, **loc_fact}
 79 | 
 80 |     while data is not None:
 81 |         transformed = data.copy()
 82 |         transformed["facts"] = list(
 83 |             get_sane_ip_locaction(uid=data["uid"], facts=data["facts"])
 84 |         )
 85 |         data = yield transformed
 86 | 
 87 | 
 88 | def register(parser):
 89 |     """Register 'extend' parser."""
 90 |     extend_parser = parser.add_parser("extend")
 91 |     extend_parser.set_defaults(subparser="extend")
 92 |     extend_parser.add_argument(
 93 |         "--config-file", type=argparse.FileType("r"), required=True
 94 |     )
 95 | 
 96 |     extend_subparser = extend_parser.add_subparsers()
 97 |     _ = extend_subparser.add_parser("facts")
 98 |     extend_parser.set_defaults(command="facts")
 99 | 
100 | 
101 | def main(args):
102 |     """Extend facts with information about the world."""
103 |     config = configparser.ConfigParser()
104 |     config.read(args.config_file.name)
105 | 
106 |     if args.command == "facts":
107 |         logger.info("Extend facts file to %s", config["src"]["facts"])
108 |         extend_facts_file(
109 |             fact_file_path=config["src"]["facts"],
110 |             ip_loc_file_path=config["dst"]["ip_info"],
111 |             user_identity_file_path=config["dst"]["user_identity_info"],
112 |         )
113 | 
114 |     logger.info("Done!")
115 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/split.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import csv
 3 | import argparse
 4 | 
 5 | 
 6 | def batch_facts(src, size):
 7 |     """Split facts into batches of provided size."""
 8 |     with open(src) as f_h:
 9 |         json_lines = []
10 |         i = 0
11 | 
12 |         for line in f_h:
13 |             if i > size:
14 |                 yield json_lines
15 |                 i = 0
16 |                 json_lines = []
17 | 
18 |             json_lines.append(json.loads(line))
19 |             i = i + 1
20 | 
21 |         yield json_lines
22 | 
23 | 
24 | def write_json_facts(json_lines, dst):
25 |     """Write down jsonline facts into dst."""
26 |     with open(dst, "w") as f_h:
27 |         for data in json_lines:
28 |             f_h.write(json.dumps(data) + "\n")
29 | 
30 | 
31 | def load_urls(src):
32 |     """
33 |     Load given url file csv into memory.
34 | 
35 |     It assumes that only two columns are present. One is key, other is value.
36 |     """
37 |     with open(src) as f_h:
38 |         data = csv.reader(f_h, delimiter=",")
39 |         return dict((int(row[0]), row[1]) for row in data)
40 | 
41 | 
42 | def write_urls(json_facts, urls, dst):
43 |     """Write down urls batch based on batch of json facts."""
44 |     with open(dst, "w") as f_h:
45 |         writer = csv.writer(f_h, delimiter=",")
46 |         for data in json_facts:
47 |             for fact in data["facts"]:
48 |                 writer.writerow([fact["fid"], urls[fact["fid"]]])
49 | 
50 | 
51 | def register(parser):
52 |     """Register 'split' command."""
53 |     split_parser = parser.add_parser("split")
54 |     split_parser.set_defaults(subparser="split")
55 | 
56 |     split_parser.add_argument("--size", type=int, required=True)
57 |     split_parser.add_argument(
58 |         "--facts-file", type=argparse.FileType("r"), required=True
59 |     )
60 |     split_parser.add_argument("--urls-file", type=argparse.FileType("r"), required=True)
61 |     split_parser.add_argument("--dst-folder", type=str, required=True)
62 | 
63 | 
64 | def main(args):
65 |     """'Split' command logic."""
66 |     location, size = args.dst_folder, args.size
67 |     urls = load_urls(args.urls_file.name)
68 |     i = 0
69 |     file_prefix = f"{i * size}_{(i + 1) * size}"
70 |     for json_lines in batch_facts(args.facts_file.name, size):
71 |         i = i + 1
72 |         write_json_facts(json_lines, dst=f"{location}/{file_prefix}_facts.json")
73 |         write_urls(json_lines, urls, dst=f"{location}/{file_prefix}_urls.csv")
74 |         file_prefix = f"{i * size}_{(i + 1)* size}"
75 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/cli/transform.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import logging
  3 | import configparser
  4 | import glob
  5 | from pathlib import PurePath
  6 | import concurrent.futures
  7 | from string import Template
  8 | 
  9 | from nepytune.nodes import websites, users, identity_groups, ip_loc
 10 | from nepytune.edges import (
 11 |     user_website,
 12 |     website_groups,
 13 |     identity_groups as identity_group_edges,
 14 |     persistent_ids,
 15 |     ip_loc as ip_loc_edges,
 16 | )
 17 | 
 18 | 
 19 | logger = logging.getLogger("transform")
 20 | logger.setLevel(logging.INFO)
 21 | 
 22 | 
 23 | def build_destination_path(src, dst):
 24 |     """Given src path, extract batch information and build new destination path."""
 25 |     stem = PurePath(src).stem
 26 |     batch_id = f"{'_'.join(stem.split('_')[:2])}_"
 27 |     return Template(dst).substitute(batch_id=batch_id)
 28 | 
 29 | 
 30 | def register(parser):
 31 |     """Register 'transform' parser."""
 32 |     transform_parser = parser.add_parser("transform")
 33 |     transform_parser.set_defaults(subparser="transform")
 34 | 
 35 |     transform_parser.add_argument(
 36 |         "--config-file", type=argparse.FileType("r"), required=True
 37 |     )
 38 |     transform_parser.add_argument("--websites", action="store_true", default=False)
 39 |     transform_parser.add_argument("--website_groups", action="store_true", default=False)
 40 |     transform_parser.add_argument("--transientIds", action="store_true", default=False)
 41 |     transform_parser.add_argument("--persistentIds", action="store_true", default=False)
 42 |     transform_parser.add_argument(
 43 |         "--identityGroupIds", action="store_true", default=False
 44 |     )
 45 |     transform_parser.add_argument("--ips", action="store_true", default=False)
 46 |     # workers param affect only processing transient entities;
 47 |     # other types of entities are processed fast enough
 48 |     transform_parser.add_argument("--workers", type=int, default=1)
 49 | 
 50 | 
 51 | def main(args):
 52 |     """Transform csv files into ready-to-load neptune format."""
 53 |     config = configparser.ConfigParser()
 54 |     config.read(args.config_file.name)
 55 | 
 56 |     files = {
 57 |         "facts": config["src"]["facts"],
 58 |         "urls": config["src"]["urls"],
 59 |         "titles": config["src"]["titles"],
 60 |     }
 61 | 
 62 |     if args.websites:
 63 |         logger.info("Generating website nodes to %s", config["dst"]["websites"])
 64 |         websites.generate_website_nodes(
 65 |             files["urls"], files["titles"], config["dst"]["websites"]
 66 |         )
 67 | 
 68 |     if args.website_groups:
 69 |         groups_json = config["src"]["website_groups"]
 70 | 
 71 |         nodes_dst = config["dst"]["website_group_nodes"]
 72 |         logger.info("Generating website group nodes to %s", nodes_dst)
 73 |         websites.generate_website_group_nodes(groups_json, nodes_dst)
 74 | 
 75 |         edges_dst = config["dst"]["website_group_edges"]
 76 |         logger.info("Generating website group edges to %s", edges_dst)
 77 |         website_groups.generate_website_group_edges(groups_json, edges_dst)
 78 | 
 79 |     if args.transientIds:
 80 |         if args.workers > 1:
 81 |             fact_files = sorted(glob.glob(config["src"]["facts_glob"]))
 82 |             url_files = sorted(glob.glob(config["src"]["urls_glob"]))
 83 | 
 84 |             with concurrent.futures.ProcessPoolExecutor(
 85 |                 max_workers=args.workers
 86 |             ) as executor:
 87 |                 futures = []
 88 |                 logger.info("Scheduling...")
 89 |                 for fact_file, url_file in zip(fact_files, url_files):
 90 |                     futures.append(
 91 |                         executor.submit(
 92 |                             users.generate_user_nodes,
 93 |                             fact_file,
 94 |                             build_destination_path(
 95 |                                 fact_file, config["dst"]["transient_nodes"]
 96 |                             ),
 97 |                         )
 98 |                     )
 99 |                     futures.append(
100 |                         executor.submit(
101 |                             user_website.generate_user_website_edges,
102 |                             {
103 |                                 "titles": files["titles"],
104 |                                 "urls": url_file,
105 |                                 "facts": fact_file,
106 |                             },
107 |                             build_destination_path(
108 |                                 fact_file, config["dst"]["transient_edges"]
109 |                             ),
110 |                         )
111 |                     )
112 |                 logger.info("Processing of transient nodes started.")
113 | 
114 |                 for future in concurrent.futures.as_completed(futures):
115 |                     logger.info(
116 |                         "Succesfully written transient entity file into %s",
117 |                         future.result(),
118 |                     )
119 |         else:
120 |             nodes_dst = Template(config["dst"]["transient_nodes"]).substitute(
121 |                 batch_id=""
122 |             )
123 |             logger.info("Generating transient id nodes to %s", nodes_dst)
124 |             users.generate_user_nodes(config["src"]["facts"], nodes_dst)
125 | 
126 |             edges_dst = Template(config["dst"]["transient_edges"]).substitute(
127 |                 batch_id=""
128 |             )
129 |             logger.info("Generating transient id edges to %s", edges_dst)
130 |             user_website.generate_user_website_edges(files, edges_dst)
131 | 
132 |     if args.persistentIds:
133 |         logger.info(
134 |             "Generating persistent id nodes to %s", config["dst"]["persistent_nodes"]
135 |         )
136 |         users.generate_persistent_nodes(
137 |             config["src"]["persistent"], config["dst"]["persistent_nodes"]
138 |         )
139 |         logger.info(
140 |             "Generating persistent id edges to %s", config["dst"]["persistent_edges"]
141 |         )
142 |         persistent_ids.generate_persistent_id_edges(
143 |             config["src"]["persistent"], config["dst"]["persistent_edges"]
144 |         )
145 | 
146 |     if args.identityGroupIds:
147 |         logger.info(
148 |             "Generating identity group id nodes to %s",
149 |             config["dst"]["identity_group_nodes"],
150 |         )
151 |         identity_groups.generate_identity_group_nodes(
152 |             config["src"]["identity_group"], config["dst"]["identity_group_nodes"]
153 |         )
154 |         logger.info(
155 |             "Generating identity group id edges to %s",
156 |             config["dst"]["identity_group_edges"],
157 |         )
158 |         identity_group_edges.generate_identity_group_edges(
159 |             config["src"]["identity_group"], config["dst"]["identity_group_edges"]
160 |         )
161 | 
162 |     if args.ips:
163 |         logger.info("Generating IP id nodes to %s", config["dst"]["ip_nodes"])
164 |         ip_loc.generate_ip_loc_nodes_from_facts(
165 |             config["src"]["facts"], config["dst"]["ip_nodes"]
166 |         )
167 |         logger.info("Generating IP edges to %s", config["dst"]["ip_edges"])
168 |         ip_loc_edges.generate_ip_loc_edges_from_facts(
169 |             config["src"]["facts"], config["dst"]["ip_edges"]
170 |         )
171 | 
172 |     logger.info("Done!")
173 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/drawing.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | 
  3 | import plotly.graph_objects as go
  4 | import networkx as nx
  5 | 
  6 | 
  7 | def layout(graph, layout=nx.spring_layout, **layout_args):
  8 |     pos = layout(graph, **layout_args)
  9 | 
 10 |     nx.set_node_attributes(graph, {
 11 |         node_id: {
 12 |             "pos": value
 13 |         }
 14 |         for node_id, value in pos.items()
 15 |     })
 16 |     return graph
 17 | 
 18 | 
 19 | def spring_layout(graph):
 20 |     return layout(graph, nx.spring_layout, scale=0.5)
 21 | 
 22 | 
 23 | def group_by_label(graph, type_="nodes"):
 24 |     if type_ == "nodes":
 25 |         return group_by_grouper(graph, lambda x: x[1]["label"], type_)
 26 |     else:
 27 |         return group_by_grouper(graph, lambda x: x[2]["label"], type_)
 28 | 
 29 | 
 30 | def group_by_grouper(graph, grouper, type_="nodes"):
 31 |     if type_ == "nodes":
 32 |         data = graph.nodes(data=True)
 33 |     else:
 34 |         data = graph.edges(data=True)
 35 | 
 36 |     return itertools.groupby(
 37 |         sorted(list(data), key=grouper),
 38 |         key=grouper
 39 |     )
 40 | 
 41 | 
 42 | def edges_scatter(graph):
 43 |     edge_x = []
 44 |     edge_y = []
 45 | 
 46 |     for edge in graph.edges():
 47 |         x0, y0 = graph.nodes[edge[0]]["pos"]
 48 |         x1, y1 = graph.nodes[edge[1]]["pos"]
 49 |         edge_x.append(x0)
 50 |         edge_x.append(x1)
 51 |         edge_x.append(None)
 52 |         edge_y.append(y0)
 53 |         edge_y.append(y1)
 54 |         edge_y.append(None)
 55 | 
 56 |     return go.Scatter(
 57 |         x=edge_x, y=edge_y,
 58 |         line=dict(width=0.5, color='#888'),
 59 |         name="edges",
 60 |         hoverinfo="none",
 61 |         mode="lines",
 62 |     )
 63 | 
 64 | 
 65 | def edge_scatters_by_label(graph, widths=None, colors=None, dashes=None, opacity=None):
 66 |     if not colors:
 67 |         colors = {}
 68 |     if not dashes:
 69 |         dashes = {}
 70 |     if not widths:
 71 |         widths = {}
 72 |     if not opacity:
 73 |         opacity = {}
 74 | 
 75 |     for label, edges in group_by_label(graph, type_="edges"):
 76 |         edge_x = []
 77 |         edge_y = []
 78 | 
 79 |         for edge in edges:
 80 |             x0, y0 = graph.nodes[edge[0]]["pos"]
 81 |             x1, y1 = graph.nodes[edge[1]]["pos"]
 82 |             edge_x.append(x0)
 83 |             edge_x.append(x1)
 84 |             edge_x.append(None)
 85 |             edge_y.append(y0)
 86 |             edge_y.append(y1)
 87 |             edge_y.append(None)
 88 | 
 89 |         yield go.Scatter(
 90 |             x=edge_x, y=edge_y,
 91 |             line=dict(
 92 |                 width=widths.get(label, 0.5),
 93 |                 color=colors.get(label, '#888'),
 94 |                 dash=dashes.get(label, "solid")
 95 |             ),
 96 |             opacity=opacity.get(label, 1),
 97 |             name=label,
 98 |             hoverinfo="none",
 99 |             mode="lines",
100 |         )
101 | 
102 | 
103 | 
104 | def edge_annotations(graph):
105 |     annotations = []
106 |     for from_, to_, attr_map in graph.edges(data=True):
107 |         x0, y0 = graph.nodes[from_]["pos"]
108 |         x1, y1 = graph.nodes[to_]["pos"]
109 |         x_mid, y_mid = (x0 + x1) / 2, (y0 + y1) / 2
110 |         annotations.append(dict(
111 |             xref="x",
112 |             yref="y",
113 |             x=x_mid, y=y_mid,
114 |             text=attr_map["label"],
115 |             font=dict(size=12),
116 |             showarrow=False
117 |         ))
118 | 
119 |     return annotations
120 | 
121 | 
122 | def scatters_by_label(graph, attrs_to_skip, sizes=None, colors=None):
123 |     if not colors:
124 |         colors = {}
125 |     if not sizes:
126 |         sizes = {}
127 | 
128 |     for i, (label, node_group) in enumerate(group_by_label(graph)):
129 |         node_group = list(node_group)
130 |         node_x = []
131 |         node_y = []
132 |         opacity = []
133 |         size_list = []
134 | 
135 |         for node_id, _ in node_group:
136 |             x, y = graph.nodes[node_id]["pos"]
137 |             opacity.append(graph.nodes[node_id].get("opacity", 1))
138 |             size_list.append(
139 |                 graph.nodes[node_id].get("size", sizes.get(label, 10))
140 |             )
141 |             node_x.append(x)
142 |             node_y.append(y)
143 | 
144 |         node_trace = go.Scatter(
145 |             x=node_x, y=node_y,
146 |             name=label,
147 |             mode='markers',
148 |             hoverinfo='text',
149 |             marker=dict(
150 |                 showscale=False,
151 |                 colorscale='Hot',
152 |                 reversescale=True,
153 |                 color=colors.get(label, i * 5),
154 |                 opacity=opacity,
155 |                 size=size_list,
156 |                 line_width=2
157 |             )
158 |         )
159 | 
160 |         node_text = []
161 | 
162 |         def format_v(attr, value):
163 |             if isinstance(value, dict):
164 |                 return "".join([format_v(k, str(v)) for k, v in value.items()])
165 |             value = str(value)
166 |             if len(value) < 80:
167 |                 return f"</br>{attr}: {value}"
168 |             else:
169 |                 result = f"</br>{attr}: "
170 |                 substr = ""
171 |                 for word in value.split(" "):
172 |                     if len(word + substr) < 80:
173 |                         substr = f"{substr} {word}"
174 |                     else:
175 |                         result = f"{result} </br> {5 * ' '} {substr}"
176 |                         substr = ""
177 | 
178 |                 return f"{result} </br> {5 * ' '} {substr}"
179 | 
180 |         for node_id, attr_dict in node_group:
181 |             node_text.append(
182 |                 "".join([
183 |                     format_v(attr, value) for attr, value in attr_dict.items()
184 |                     if attr not in attrs_to_skip
185 |                 ])
186 |             )
187 | 
188 |         node_trace.text = node_text
189 | 
190 |         yield node_trace
191 | 
192 | 
193 | def draw(title, scatters, annotations=None):
194 |     fig = go.Figure(
195 |         data=scatters,
196 |         layout=go.Layout(
197 |             title_text=title,
198 |             titlefont_size=16,
199 |             showlegend=True,
200 |             hovermode='closest',
201 |             margin=dict(b=20, l=5, r=5, t=40),
202 |             xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
203 |             yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
204 |         )
205 |     )
206 |     if annotations:
207 |         fig.update_layout(
208 |             annotations=annotations
209 |         )
210 |     fig.show()
211 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/edges/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/identity_groups.py:
--------------------------------------------------------------------------------
 1 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file
 2 | from nepytune.utils import get_id
 3 | 
 4 | 
 5 | def generate_identity_group_edges(src, dst):
 6 |     """Generate identity_group edge csv file."""
 7 |     with open(src) as f_h:
 8 |         with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer:
 9 |             for data in json_lines_file(f_h):
10 |                 persistent_ids = data["persistentIds"]
11 |                 if persistent_ids:
12 |                     for persistent_id in persistent_ids:
13 |                         identity_group_to_persistent = {
14 |                             "_id": get_id(data["igid"], persistent_id, {}),
15 |                             "_from": data["igid"],
16 |                             "to": persistent_id,
17 |                             "attribute_map": {},
18 |                             "label": "member",
19 |                         }
20 |                         writer.add(**identity_group_to_persistent)
21 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/ip_loc.py:
--------------------------------------------------------------------------------
 1 | from nepytune.nodes.ip_loc import IPLoc, get_id
 2 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file
 3 | from nepytune.utils import get_id as get_edge_id
 4 | 
 5 | 
 6 | def generate_ip_loc_edges_from_facts(src, dst):
 7 |     """Generate ip location csv file with edges."""
 8 |     with open(src) as f_h:
 9 |         with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer:
10 |             for data in json_lines_file(f_h):
11 |                 uid_locations = set()
12 |                 for fact in data["facts"]:
13 |                     uid_locations.add(
14 |                         IPLoc(fact["state"], fact["city"], fact["ip_address"])
15 |                     )
16 | 
17 |                 for location in uid_locations:
18 |                     loc_id = get_id(location)
19 |                     writer.add(
20 |                         _id=get_edge_id(data["uid"], loc_id, {}),
21 |                         _from=data["uid"],
22 |                         to=loc_id,
23 |                         label="uses",
24 |                         attribute_map={},
25 |                     )
26 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/persistent_ids.py:
--------------------------------------------------------------------------------
 1 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file
 2 | from nepytune.utils import get_id
 3 | 
 4 | 
 5 | def generate_persistent_id_edges(src, dst):
 6 |     """Generate persistentID edges based on union-find datastructure."""
 7 |     with open(src) as f_h:
 8 |         with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer:
 9 |             for data in json_lines_file(f_h):
10 |                 for node in data["transientIds"]:
11 |                     persistent_to_transient = {
12 |                         "_id": get_id(data["pid"], node, {}),
13 |                         "_from": data["pid"],
14 |                         "to": node,
15 |                         "label": "has_identity",
16 |                         "attribute_map": {},
17 |                     }
18 |                     writer.add(**persistent_to_transient)
19 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/user_website.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import json
 3 | import logging
 4 | 
 5 | from datetime import datetime
 6 | 
 7 | from nepytune.write_utils import gremlin_writer, json_lines_file, GremlinEdgeCSV
 8 | from nepytune.utils import get_id
 9 | 
10 | 
11 | logger = logging.getLogger("user_edges")
12 | logger.setLevel(logging.INFO)
13 | 
14 | 
15 | def _parse_ts(timestamp):
16 |     """Parse timestamp."""
17 |     for div in (1_000, 1_000_000):
18 |         try:
19 |             return datetime.fromtimestamp(timestamp / div).strftime("%Y-%m-%dT%H:%M:%S")
20 |         except:
21 |             logger.info("Could not parse timestamp: %d with %d", timestamp, div)
22 |     return ""
23 | 
24 | 
25 | def generate_user_website_edges(src_map, dst):
26 |     """Generate edges between user nodes and website nodes."""
27 |     with open(src_map["urls"]) as url_file:
28 |         fact_to_website = {}
29 |         for row in csv.reader(url_file, delimiter=","):
30 |             fact_to_website[int(row[0])] = row[1]
31 | 
32 |     with open(src_map["facts"]) as facts_file:
33 |         attrs = [
34 |             "ts:Date",
35 |             "visited_url:String",
36 |             "uid:String",
37 |             "state:String",
38 |             "city:String",
39 |             "ip_address:String",
40 |         ]
41 |         with gremlin_writer(GremlinEdgeCSV, dst, attributes=attrs) as writer:
42 |             for data in json_lines_file(facts_file):
43 |                 for fact in data["facts"]:
44 |                     timestamp = _parse_ts(fact["ts"])
45 |                     website_id = fact_to_website[fact["fid"]]
46 |                     loc_attrs = {
47 |                         "state": fact["state"],
48 |                         "city": fact["city"],
49 |                         "ip_address": fact["ip_address"],
50 |                     }
51 |                     attr_map = {
52 |                         "ts": timestamp,
53 |                         "visited_url": website_id,
54 |                         "uid": data["uid"],
55 |                         **loc_attrs,
56 |                     }
57 |                     user_to_website = {
58 |                         "_id": get_id(data["uid"], website_id, attr_map),
59 |                         "_from": data["uid"],
60 |                         "to": website_id,
61 |                         "label": "visited",
62 |                         "attribute_map": attr_map,
63 |                     }
64 |                     try:
65 |                         writer.add(**user_to_website)
66 |                     except Exception:
67 |                         logger.exception("Something went wrong while creating an edge")
68 |                         logger.info(json.dumps({"uid": data["uid"], **fact}))
69 | 
70 |     return dst
71 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/edges/website_groups.py:
--------------------------------------------------------------------------------
 1 | from nepytune.utils import get_id
 2 | from nepytune.write_utils import gremlin_writer, GremlinEdgeCSV, json_lines_file
 3 | 
 4 | 
 5 | WEBISTE_GROUP_EDGE_LABEL = "links_to"
 6 | 
 7 | 
 8 | def generate_website_group_edges(website_group_json, dst):
 9 |     """Generate website group edges CSV."""
10 |     with open(website_group_json) as f_h:
11 |         with gremlin_writer(GremlinEdgeCSV, dst, attributes=[]) as writer:
12 |             for data in json_lines_file(f_h):
13 |                 root_id = data["id"]
14 |                 websites = data["websites"]
15 |                 for website in websites:
16 |                     writer.add(
17 |                         _id=get_id(root_id, website, {}),
18 |                         _from=root_id,
19 |                         to=website,
20 |                         label=WEBISTE_GROUP_EDGE_LABEL,
21 |                         attribute_map={}
22 |                     )
23 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/nodes/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/nodes/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/nodes/identity_groups.py:
--------------------------------------------------------------------------------
 1 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file
 2 | 
 3 | 
 4 | def generate_identity_group_nodes(src, dst):
 5 |     """Generate identity_group csv file with nodes."""
 6 |     attrs = ["igid:String", "type:String"]
 7 |     with open(src) as f_h:
 8 |         with gremlin_writer(GremlinNodeCSV, dst, attributes=attrs) as writer:
 9 |             for data in json_lines_file(f_h):
10 |                 if data["persistentIds"]:
11 |                     writer.add(
12 |                         _id=data["igid"],
13 |                         attribute_map={"igid": data["igid"], "type": data["type"]},
14 |                         label="identityGroup",
15 |                     )
16 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/nodes/ip_loc.py:
--------------------------------------------------------------------------------
 1 | from collections import namedtuple
 2 | 
 3 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file
 4 | from nepytune.utils import hash_
 5 | 
 6 | 
 7 | IPLoc = namedtuple("IPLoc", "state, city, ip_address")
 8 | 
 9 | 
10 | def get_id(ip_loc):
11 |     """Generate id from ip loc."""
12 |     return hash_([ip_loc.state, ip_loc.city, ip_loc.ip_address])
13 | 
14 | 
15 | def generate_ip_loc_nodes_from_facts(src, dst):
16 |     """Generate ip location csv file with nodes."""
17 |     attrs = ["state:String", "city:String", "ip_address:String"]
18 |     with open(src) as f_h:
19 |         with gremlin_writer(GremlinNodeCSV, dst, attributes=attrs) as writer:
20 |             locations = set()
21 |             for data in json_lines_file(f_h):
22 |                 for fact in data["facts"]:
23 |                     locations.add(
24 |                         IPLoc(fact["state"], fact["city"], fact["ip_address"])
25 |                     )
26 | 
27 |             for location in locations:
28 |                 writer.add(
29 |                     _id=get_id(location),
30 |                     attribute_map={
31 |                         "state": location.state,
32 |                         "city": location.city,
33 |                         "ip_address": location.ip_address,
34 |                     },
35 |                     label="IP",
36 |                 )
37 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/nodes/users.py:
--------------------------------------------------------------------------------
 1 | from nepytune.write_utils import gremlin_writer, json_lines_file, GremlinNodeCSV
 2 | 
 3 | 
 4 | def generate_user_nodes(src, dst):
 5 |     """Generate user node csv file."""
 6 |     attributes = [
 7 |         "uid:String",
 8 |         "user_agent:String",
 9 |         "device:String",
10 |         "os:String",
11 |         "browser:String",
12 |         "email:String",
13 |         "type:String",
14 |     ]
15 |     with open(src) as src_data:
16 |         with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer:
17 |             for data in json_lines_file(src_data):
18 |                 writer.add(
19 |                     _id=data["uid"],
20 |                     attribute_map={
21 |                         "uid": data["uid"],
22 |                         "user_agent": data["user_agent"],
23 |                         "device": data["device"],
24 |                         "os": data["os"],
25 |                         "browser": data["browser"],
26 |                         "email": data["email"],
27 |                         "type": data["type"],
28 |                     },
29 |                     label="transientId",
30 |                 )
31 |         return dst
32 | 
33 | 
34 | def generate_persistent_nodes(src, dst):
35 |     """Generate persistent node csv file."""
36 |     with open(src) as f_h:
37 |         with gremlin_writer(GremlinNodeCSV, dst, attributes=["pid:String"]) as writer:
38 |             for data in json_lines_file(f_h):
39 |                 writer.add(
40 |                     _id=data["pid"],
41 |                     attribute_map={"pid": data["pid"]},
42 |                     label="persistentId",
43 |                 )
44 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/nodes/websites.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import collections
  3 | 
  4 | from nepytune.utils import hash_
  5 | from nepytune.write_utils import gremlin_writer, GremlinNodeCSV, json_lines_file
  6 | 
  7 | WEBSITE_LABEL = "website"
  8 | WEBSITE_GROUP_LABEL = "websiteGroup"
  9 | 
 10 | Website = collections.namedtuple("Website", ["url", "title"])
 11 | 
 12 | 
 13 | def generate_website_nodes(urls, titles, dst):
 14 |     """
 15 |     Generate Website nodes and save it into csv file.
 16 | 
 17 |     The CSV is compatible with AWS Neptune Gremlin data format.
 18 | 
 19 |     Website nodes are generated from dataset files:
 20 |         * urls.csv
 21 |         * titles.csv
 22 | 
 23 |     Files contain maps of fact_id and website url/title.
 24 |     Data is joined by fact_id.
 25 |     """
 26 | 
 27 |     urls = read_urls_from_csv(urls)
 28 |     titles = read_titles_from_csv(titles)
 29 |     generate_website_csv(urls, titles, dst)
 30 | 
 31 | 
 32 | def generate_website_group_nodes(website_group_json, dst):
 33 |     """Generate website groups csv."""
 34 |     attributes = [
 35 |         "url:String",
 36 |         "category:String",
 37 |         "categoryCode:String"
 38 |     ]
 39 |     with open(website_group_json) as f_h:
 40 |         with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer:
 41 |             for data in json_lines_file(f_h):
 42 |                 writer.add(
 43 |                     _id=data["id"],
 44 |                     attribute_map={
 45 |                         "url": data["url"],
 46 |                         "category": data["category"]["name"],
 47 |                         "categoryCode": data["category"]["code"]
 48 |                     },
 49 |                     label=WEBSITE_GROUP_LABEL
 50 |                 )
 51 | 
 52 | 
 53 | def read_urls_from_csv(path):
 54 |     """Return dict with urls and fact ids corresponding to them."""
 55 |     urls = collections.defaultdict(list)
 56 |     with open(path) as csv_file:
 57 |         csv_reader = csv.reader(csv_file, delimiter=",")
 58 |         for row in csv_reader:
 59 |             fid = row[0]
 60 |             url = row[1]
 61 |             urls[url].append(fid)
 62 |     return urls
 63 | 
 64 | 
 65 | def read_titles_from_csv(path):
 66 |     """Read titles from csv."""
 67 |     titles = {}
 68 |     with open(path) as csv_file:
 69 |         csv_reader = csv.reader(csv_file, delimiter=",")
 70 |         for row in csv_reader:
 71 |             fid = row[0]
 72 |             title = row[1]
 73 |             titles[fid] = title
 74 |     return titles
 75 | 
 76 | 
 77 | def generate_websites(urls, titles):
 78 |     """Yield rows in CSV format."""
 79 |     for url, fids in urls.items():
 80 |         title = get_website_title(fids, titles)
 81 |         yield Website(url, title)
 82 | 
 83 | 
 84 | def get_website_title(fids, titles):
 85 |     """Get website title."""
 86 |     for fid in fids:
 87 |         title = titles.get(fid)
 88 |         if title:
 89 |             return title
 90 |     return None
 91 | 
 92 | 
 93 | def generate_website_csv(urls, titles, dst):
 94 |     """Generate destination CSV file."""
 95 |     attributes = ["url:String", "title:String"]
 96 |     with gremlin_writer(GremlinNodeCSV, dst, attributes=attributes) as writer:
 97 |         for website in generate_websites(urls, titles):
 98 |             attribute_map = {"url": website.url, "title": website.title}
 99 |             writer.add(
100 |                 _id=website.url, attribute_map=attribute_map, label=WEBSITE_LABEL
101 |             )
102 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/traversal.py:
--------------------------------------------------------------------------------
 1 | from gremlin_python.process.anonymous_traversal import traversal
 2 | from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
 3 | from gremlin_python.driver.aiohttp.transport import AiohttpTransport
 4 | 
 5 | def get_traversal(endpoint):
 6 |     """Given gremlin endpoint get connected remote traversal."""
 7 |     return traversal().withRemote(
 8 |         DriverRemoteConnection(endpoint, "g",
 9 |           transport_factory=lambda:AiohttpTransport(call_from_event_loop=True))
10 |     )


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/__init__.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Sample use case query package.
 3 | 
 4 | Each module defines few "public" functions among which:
 5 | * one is for creating the visual representation of part of the referenced subgraph
 6 | * one or two are for use case queries to run on the graph
 7 | """
 8 | 
 9 | from nepytune.usecase.user_summary import get_sibling_attrs
10 | from nepytune.usecase.undecided_users import (
11 |     undecided_users_audience, undecided_user_audience_check
12 | )
13 | from nepytune.usecase.brand_interaction import brand_interaction_audience
14 | from nepytune.usecase.users_from_household import get_all_transient_ids_in_household
15 | from nepytune.usecase.purchase_path import get_activity_of_early_adopters
16 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/brand_interaction.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use case: Advertisers want to generate audiences for DSP platform targeting.
  3 | Specific audience could be the users who are interested in specific car brands.
  4 | """
  5 | 
  6 | import networkx as nx
  7 | from gremlin_python.process.traversal import P
  8 | from gremlin_python.process.graph_traversal import select, out
  9 | 
 10 | from nepytune import drawing
 11 | 
 12 | 
 13 | def get_root_url(g, website_url):
 14 |     """Given website url, get its root node."""
 15 |     return (
 16 |         g.V(website_url)
 17 |             .hasLabel("website")
 18 |             .in_("links_to")
 19 |     )
 20 | 
 21 | 
 22 | def brand_interaction_audience(g, website_url):
 23 |     """
 24 |     Given website url, get all transitive (through persistent) identities
 25 |     that interacted with this brand on any of its pages.
 26 |     """
 27 |     return (
 28 |         get_root_url(g, website_url)
 29 |             .out("links_to") # get all websites from this root url
 30 |             .in_("visited")
 31 |             .in_("has_identity").dedup()
 32 |             .out("has_identity")
 33 |             .values("uid")
 34 |     )
 35 | 
 36 | 
 37 | def draw_referenced_subgraph(g, root_url):
 38 |     graph = _build_networkx_graph(
 39 |         root_url,
 40 |         query_results=_get_transient_ids(
 41 |             _get_persistent_ids_which_visited_website(g, root_url),
 42 |             root_url
 43 |         ).next()
 44 |     )
 45 |     graph = drawing.layout(graph, nx.kamada_kawai_layout)
 46 |     drawing.draw(
 47 |         title="Brand interaction",
 48 |         scatters=[
 49 |             drawing.edges_scatter(graph)
 50 |         ] + list(
 51 |             drawing.scatters_by_label(
 52 |                 graph, attrs_to_skip=["pos"],
 53 |                 sizes={"websiteGroup": 30, "transientId": 10, "persistentId": 15, "website": 10}
 54 |             )
 55 |         ),
 56 |     )
 57 | 
 58 | 
 59 | # ===========================
 60 | # Everything below was added to introspect the query results via visualisations
 61 | 
 62 | 
 63 | def _build_networkx_graph(root_url, query_results):
 64 |     graph = nx.Graph()
 65 |     graph.add_node(
 66 |         root_url, label="websiteGroup", url=root_url
 67 |     )
 68 | 
 69 |     for persistent_id, visited_events in query_results.items():
 70 |         graph.add_node(persistent_id, label="persistentId", pid=persistent_id)
 71 | 
 72 |         for event in visited_events:
 73 |             graph.add_node(event["uid"], label="transientId", uid=event["uid"])
 74 |             if event["visited_url"] != root_url:
 75 |                 graph.add_node(event["visited_url"], label="website", url=event["visited_url"])
 76 |                 graph.add_edge(event["uid"], event["visited_url"], label="visited")
 77 |             graph.add_edge(persistent_id, event["uid"], label="has_identity")
 78 |             graph.add_edge(root_url, event["visited_url"], label="links_to")
 79 | 
 80 |     return graph
 81 | 
 82 | 
 83 | def _get_persistent_ids_which_visited_website(g, root_url):
 84 |     return (
 85 |         g.V(root_url)
 86 |             .aggregate("root_url")
 87 |             .in_("visited")
 88 |             .in_("has_identity").dedup().limit(50).fold()
 89 |             .project("root_url", "persistent_ids")
 90 |                 .by(select("root_url").unfold().valueMap(True))
 91 |                 .by()
 92 |     )
 93 | 
 94 | 
 95 | def _get_transient_ids(query, root_url):
 96 |     return (
 97 |         query
 98 |         .select("persistent_ids")
 99 |         .unfold()
100 |         .group()
101 |             .by("pid")
102 |             .by(
103 |                 out("has_identity")
104 |                 .outE("visited")
105 |                 .has(  # do not go through links_to, as it causes neptune memory errors
106 |                     "visited_url", P.between(root_url, root_url + "/zzz")
107 |                 )
108 |                 .valueMap("uid", "visited_url")
109 |                 .dedup()
110 |                 .limit(15)
111 |                 .fold()
112 |             )
113 |     )
114 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/purchase_path.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use-Case.
  3 | 
  4 | Marketing analyst wants to understand path to purchase of a new product by a few early adopters ( say 5)
  5 | through interactive queries. This product is high involvement and expensive, and therefore they want to understand the
  6 | research undertaken by the customer.
  7 | 
  8 | * Which device was used to initiate the first research. Was that prompted by an ad, email promotion?
  9 | * How many devices were used overall and what was the time taken from initial research to final purchase
 10 | * On which devices did the customer spend more time
 11 | """
 12 | import itertools
 13 | from collections import namedtuple, defaultdict
 14 | from datetime import timedelta
 15 | 
 16 | import networkx as nx
 17 | import plotly.graph_objects as go
 18 | from gremlin_python.process.traversal import P
 19 | from gremlin_python.process.graph_traversal import (
 20 |     outV, values, project, constant, select, inV, where, identity
 21 | )
 22 | 
 23 | from nepytune import drawing
 24 | from nepytune.visualizations import bar_plots
 25 | 
 26 | 
 27 | Event = namedtuple('Event', 'ts persistentId transientId device_type url')
 28 | Session = namedtuple('Session', 'transientId persistentId device_type events')
 29 | 
 30 | 
 31 | def get_activity_of_early_adopters(g, thank_you_page_url, skip_single_transients=False, limit=5):
 32 |     """
 33 |     Given thank you page url, find first early adopters of the product.
 34 | 
 35 |     In other words:
 36 |         * find first few persistent identities (or transient if they're not matched with any user)
 37 |           that visited given thank you page
 38 |         * extract their *whole* activity on the domain of the thank_you_page
 39 |     """
 40 |     return (
 41 |         g.V(thank_you_page_url)
 42 |         .hasLabel("website").as_("thank_you")
 43 |         .in_("links_to").as_("website_group")
 44 |         .select("thank_you")
 45 |         .inE("visited")
 46 |         .order().by("ts")
 47 |         .choose(
 48 |             constant(skip_single_transients).is_(P.eq(True)),
 49 |             where(outV().in_("has_identity")),
 50 |             identity()
 51 |         )
 52 |         .choose(
 53 |             outV().in_("has_identity"),
 54 |             project(
 55 |                 "type", "id", "purchase_ts"
 56 |             )
 57 |                 .by(constant("persistent"))
 58 |                 .by(outV().in_("has_identity"))
 59 |                 .by(values("ts")),
 60 |             project(
 61 |                 "type", "id", "purchase_ts"
 62 |             )
 63 |                 .by(constant("transient"))
 64 |                 .by(outV())
 65 |                 .by(values("ts"))
 66 |         ).dedup("id").limit(limit)
 67 |         .choose(
 68 |             select("type").is_("persistent"),
 69 |             project(
 70 |                 "persistent_id", "transient_id", "purchase_ts"
 71 |             ).by(select("id").values("pid"))
 72 |              .by(select("id").out("has_identity").fold())
 73 |              .by(select("purchase_ts")),
 74 |             project("persistent_id", "transient_id", "purchase_ts")
 75 |                 .by(constant(""))
 76 |                 .by(select("id").fold())
 77 |                 .by(select("purchase_ts"))
 78 |         ).project("persistent_id", "purchase_ts", "devices", "visits")
 79 |             .by(select("persistent_id"))
 80 |             .by(select("purchase_ts"))
 81 |             .by(select("transient_id").unfold().group().by(values("uid")).by(values("type")))
 82 |             .by(
 83 |                 select("transient_id").unfold().outE("visited").order().by("ts")
 84 |                 .where(
 85 |                     inV().in_("links_to").where(P.eq("website_group"))
 86 |                 )
 87 |                 .project(
 88 |                     "transientId", "url", "ts"
 89 |                 ).by("uid").by("visited_url").by("ts").fold())
 90 |     )
 91 | 
 92 | 
 93 | def transform_activities(result_set):
 94 |     """Build the flat list of user activities."""
 95 |     for per_persistent_events in result_set:
 96 |         for visit in per_persistent_events["visits"]:
 97 |             if visit["ts"] <= per_persistent_events["purchase_ts"]:
 98 |                 yield Event(**{
 99 |                     "persistentId": per_persistent_events["persistent_id"] or None,
100 |                     "device_type": per_persistent_events["devices"][visit["transientId"]],
101 |                     **visit
102 |                 })
103 | 
104 | 
105 | def first_device_in_session(user_events):
106 |     """Get device id which initialize session."""
107 |     return user_events[0].transientId
108 | 
109 | 
110 | def time_to_purchase(user_events):
111 |     """Get device id which initialize session."""
112 |     return user_events[-1].ts  - user_events[0].ts
113 | 
114 | 
115 | def consecutive_pairs(iterable):
116 |     f_ptr, s_ptr = itertools.tee(iterable, 2)
117 |     next(s_ptr)
118 |     return zip(f_ptr, s_ptr)
119 | 
120 | 
121 | def generate_session_from_event(events, max_ts_delta=300):
122 |     """Generate sessions from events."""
123 |     events_by_timestamp = sorted(events, key=lambda event: (event.transientId, event.ts))
124 |     guard_event = Event(
125 |         ts=None, persistentId=None, transientId=None, device_type=None, url=None
126 |     )
127 |     sessions = []
128 | 
129 |     session = Session(
130 |         transientId=events_by_timestamp[0].transientId,
131 |         persistentId=events_by_timestamp[0].persistentId,
132 |         device_type=events_by_timestamp[0].device_type,
133 |         events=[]
134 |     )
135 |     events_count = 0
136 | 
137 |     for event, next_event in consecutive_pairs(events_by_timestamp + [guard_event]):
138 |         session.events.append(event)
139 |         if event.transientId != next_event.transientId or (next_event.ts - event.ts).seconds > max_ts_delta:
140 |             sessions.append(session)
141 |             events_count += len(session.events)
142 |             session = Session(
143 |                 transientId=next_event.transientId,
144 |                 persistentId=next_event.persistentId,
145 |                 device_type=next_event.device_type,
146 |                 events=[]
147 |             )
148 | 
149 |     assert len(events_by_timestamp) == events_count
150 |     return sessions
151 | 
152 | 
153 | def get_session_duration(user_session):
154 |     """Get session duration."""
155 |     return user_session.events[-1].ts - user_session.events[0].ts
156 | 
157 | 
158 | def get_time_by_device(user_sessions):
159 |     """Get time spent on device."""
160 |     time_by_device = defaultdict(timedelta)
161 | 
162 |     for session in user_sessions:
163 |         time_by_device[session.transientId] += get_session_duration(session)
164 | 
165 |     return time_by_device
166 | 
167 | 
168 | def generate_stats(all_activities, **kwargs):
169 |     """Generate statistics per user (persistentId) activities."""
170 |     result = dict()
171 | 
172 |     user_sessions = generate_session_from_event(all_activities, **kwargs)
173 | 
174 |     def grouper(session):
175 |         return session.persistentId or session.transientId
176 | 
177 |     for persistent_id, session_list in (itertools.groupby(sorted(user_sessions, key=grouper), key=grouper)):
178 |         session_list = list(session_list)
179 |         session_durations = get_time_by_device(session_list)
180 |         user_events_by_timestamp = sorted(
181 |             itertools.chain.from_iterable([session.events for session in session_list]),
182 |             key=lambda event: event.ts
183 |         )
184 | 
185 |         if persistent_id not in result:
186 |             result[persistent_id] = {
187 |                 "transient_ids": {},
188 |                 "devices_count": 0,
189 |                 "first_device": first_device_in_session(user_events_by_timestamp),
190 |                 "time_to_purchase": time_to_purchase(user_events_by_timestamp),
191 |             }
192 | 
193 |         for transient_id, duration in session_durations.items():
194 |             user_sessions = sorted(
195 |                 [session for session in session_list if session.transientId == transient_id],
196 |                 key=lambda session: session.events[0].ts
197 |             )
198 |             result[persistent_id]["transient_ids"][transient_id] = {
199 |                 "sessions_duration": duration,
200 |                 "sessions_count": len(user_sessions),
201 |                 "purchase_session": user_sessions[-1],
202 |                 "sessions": user_sessions
203 |             }
204 |             result[persistent_id]["devices_count"] += 1
205 |     return result
206 | 
207 | 
208 | def draw_referenced_subgraph(persistent_id, graph):
209 |     drawing.draw(
210 |         title=f"{persistent_id} path to purchase",
211 |         scatters=list(
212 |             drawing.edge_scatters_by_label(
213 |                 graph,
214 |                 opacity={"visited": 0.35, "purchase_path": 0.4},
215 |                 widths={"links_to": 0.2, "visited": 3, "purchase_path": 3},
216 |                 colors={"links_to": "grey", "purchase_path": "red"},
217 |                 dashes={"links_to": "dot"}
218 |             )
219 |         ) + list(
220 |             drawing.scatters_by_label(
221 |                 graph, attrs_to_skip=["pos", "size"],
222 |                 sizes={
223 |                     "event": 9,
224 |                     "persistentId": 20,
225 |                     "thank-you-page": 25,
226 |                     "website": 25,
227 |                     "session": 15,
228 |                 },
229 |                 colors={
230 |                     "event": 'rgb(153,112,171)',
231 |                     "session": 'rgb(116,173,209)',
232 |                     "thank-you-page": 'orange',
233 |                     "website": 'rgb(90,174,97)',
234 |                     "transientId": 'rgb(158,1,66)',
235 |                     "persistentId": 'rgb(213,62,79)'
236 |                 }
237 |             )
238 |         ),
239 |     )
240 | 
241 | 
242 | def compute_subgraph_pos(query_results, thank_you_page):
243 |     """Given query results compute subgraph positions."""
244 |     for persistent_id, raw_graph in _build_networkx_graph_single(
245 |         query_results=query_results,
246 |         thank_you_page=thank_you_page,
247 |         max_ts_delta=300
248 |     ):
249 |         raw_graph.nodes[thank_you_page]["label"] = "thank-you-page"
250 | 
251 |         graph_with_pos_computed = drawing.layout(raw_graph, _custom_layout)
252 | 
253 |         yield persistent_id, graph_with_pos_computed
254 | 
255 | 
256 | def custom_plots(data):
257 |     """Build list of custom plot figures."""
258 |     return [
259 |         bar_plots.make_bars(
260 |             {
261 |                 k[:5]: v["time_to_purchase"].total_seconds() / (3600 * 24)
262 |                 for k, v in data.items()
263 |             },
264 |             title="User's time to purchase",
265 |             x_title="Persistent IDs",
266 |             y_title="Days to purchase",
267 |             lazy=True
268 |         ),
269 |         _show_session_stats(data, title="Per device session statistics"),
270 |         _show_most_common_visited_webpages(data, title="Most common visited subpages before purchase", count=10),
271 |     ]
272 | 
273 | 
274 | # ===========================
275 | # Everything below was added to introspect the query results via visualisations
276 | 
277 | 
278 | def _show_session_stats(data, title):
279 |     def sunburst_data(data):
280 |         total_sum = sum(
281 |             values["sessions_count"]
282 |             for _, v in data.items()
283 |             for values in v["transient_ids"].values()
284 |         )
285 |         yield "", "Users", 1.5 * total_sum, "white", ""
286 | 
287 |         for i, (persistentId, v) in enumerate(data.items(), 1):
288 |             yield (
289 |                 "Users",
290 |                 persistentId[:5],
291 |                 sum(values["sessions_count"] for values in v["transient_ids"].values()),
292 |                 i,
293 |                 (
294 |                     f"<br>persistentId: {persistentId} </br>"
295 |                     f"devices count: {len(v['transient_ids'])}"
296 |                 )
297 |             )
298 |             for transientId, values in v["transient_ids"].items():
299 |                 yield (
300 |                     persistentId[:5],
301 |                     transientId[:5],
302 |                     values["sessions_count"],
303 |                     i,
304 |                     (
305 |                         f"<br>transientId: {transientId}"
306 |                         f"<br>session count: {values['sessions_count']}"
307 |                         f"<br>total session duration: {values['sessions_duration']}"
308 |                     )
309 |                 )
310 |                 for session in values["sessions"]:
311 |                     yield (
312 |                         transientId[:5],
313 |                         session.events[0].ts,
314 |                         1,
315 |                         i,
316 |                         (
317 |                             f"<br>session start: {session.events[0].ts}"
318 |                             f"<br>session end: {session.events[-1].ts}"
319 |                             f"<br>session duration: {session.events[-1].ts - session.events[0].ts}"
320 |                         )
321 |                     )
322 |         # aka legend
323 |         yield "Users", "User ids", total_sum / 2, "white", ""
324 |         yield "User ids", "User devices", total_sum / 2, "white", ""
325 |         yield "User devices", "User sessions", total_sum / 2, "white", ""
326 | 
327 |     parents, labels, values, colors, hovers = zip(*[r for r in list(sunburst_data(data))])
328 | 
329 |     fig = go.Figure(
330 |         go.Sunburst(
331 |             labels=labels,
332 |             parents=parents,
333 |             values=values,
334 |             branchvalues="total",
335 |             marker=dict(
336 |                 colors=colors,
337 |                 line=dict(width=0.5, color='DarkSlateGrey')
338 |             ),
339 |             hovertext=hovers,
340 |             hoverinfo="text",
341 |         ),
342 |     )
343 | 
344 |     fig.update_layout(margin=dict(t=50, l=0, r=0, b=0), title=title)
345 |     return fig
346 | 
347 | 
348 | def _show_most_common_visited_webpages(data, title, count):
349 |     def drop_qs(url):
350 |         pos = url.find("?")
351 |         if pos == -1:
352 |             return url
353 |         return url[0:pos]
354 | 
355 |     def compute_data(data):
356 |         res = defaultdict(list)
357 |         for v in data.values():
358 |             for values in v["transient_ids"].values():
359 |                 for session in values["sessions"]:
360 |                     for event in session.events:
361 |                         res[drop_qs(event.url)].append(session.persistentId)
362 |         return res
363 | 
364 |     def sunburst_data(data):
365 |         total_sum = sum(len(v) for v in data.values())
366 |         yield "", "websites", total_sum, ""
367 |         for i, (website, persistents) in enumerate(data.items()):
368 |             yield (
369 |                 "websites", f"Website {i}",
370 |                 len(persistents),
371 |                 f"<br>website: {website}"
372 |                 f"<br>users: {len(set(persistents))}"
373 |                 f"<br>events: {len(persistents)}"
374 |             )
375 |             for persistent, group in itertools.groupby(
376 |                 sorted(list(persistents)),
377 |             ):
378 |                 group = list(group)
379 |                 yield (
380 |                     f"Website {i}", persistent[:5],
381 |                     len(group),
382 |                     f"<br>persistentId: {persistent}"
383 |                     f"<br>events: {len(group)}"
384 |                 )
385 | 
386 |     events_data = compute_data(data)
387 |     most_common = dict(sorted(events_data.items(), key=lambda x: -len(x[1]))[:count])
388 |     most_common_counts = {k: len(v) for k, v in most_common.items()}
389 | 
390 |     pie_chart = go.Pie(
391 |         labels=list(most_common_counts.keys()),
392 |         values=list(most_common_counts.values()),
393 |         marker=dict(line=dict(color='DarkSlateGrey', width=0.5)),
394 |         domain=dict(column=0)
395 |     )
396 | 
397 |     parents, labels, values, hovers = zip(*[r for r in list(sunburst_data(most_common))])
398 | 
399 |     sunburst = go.Sunburst(
400 |         labels=labels,
401 |         parents=parents,
402 |         values=values,
403 |         branchvalues="total",
404 |         marker=dict(
405 |             line=dict(width=0.5, color='DarkSlateGrey')
406 |         ),
407 |         hovertext=hovers,
408 |         hoverinfo="text",
409 |         domain=dict(column=1)
410 |     )
411 | 
412 |     layout = go.Layout(
413 |         grid=go.layout.Grid(columns=2, rows=1),
414 |         margin=go.layout.Margin(t=50, l=0, r=0, b=0),
415 |         title=title,
416 |         legend_orientation="h"
417 |     )
418 | 
419 |     return go.Figure([pie_chart, sunburst], layout)
420 | 
421 | 
422 | def _build_networkx_graph_single(query_results, thank_you_page, **kwargs):
423 |     def drop_qs(url):
424 |         pos = url.find("?")
425 |         if pos == -1:
426 |             return url
427 |         return url[0:pos]
428 | 
429 |     def transient_attrs(transient_id, transient_dict):
430 |         return {
431 |             "uid": transient_id,
432 |             "sessions_count": len(transient_dict["sessions"]),
433 |             "time_on_device": transient_dict["sessions_duration"]
434 |         }
435 | 
436 |     def session_attrs(session):
437 |         return hash((session.transientId, session.events[0])), {
438 |             "duration": get_session_duration(session),
439 |             "events": len(session.events)
440 |         }
441 | 
442 |     def event_to_website(graph, event, event_label):
443 |         website = drop_qs(event.url)
444 |         graph.add_node(website, label="website", url=website)
445 |         graph.add_node(hash(event), label=event_label, **event._asdict())
446 |         graph.add_edge(website, hash(event), label="links_to")
447 | 
448 |     for persistent_id, result_dict in generate_stats(query_results, **kwargs).items():
449 |         graph = nx.MultiGraph()
450 |         graph.add_node(persistent_id, label="persistentId", pid=persistent_id)
451 | 
452 |         for transient_id, transient_dict in result_dict["transient_ids"].items():
453 |             graph.add_node(transient_id, label="transientId", **transient_attrs(transient_id, transient_dict))
454 |             graph.add_edge(persistent_id, transient_id, label="has_identity")
455 | 
456 |             for session in transient_dict["sessions"]:
457 |                 event_label = "event"
458 |                 if session == transient_dict["purchase_session"]:
459 |                     event_edge_label = "purchase_path"
460 |                 else:
461 |                     event_edge_label = "visited"
462 | 
463 |                 session_id, session_node_attrs = session_attrs(session)
464 |                 # transient -> session
465 |                 graph.add_node(session_id, label="session", **session_node_attrs)
466 |                 graph.add_edge(session_id, transient_id, label="session")
467 | 
468 |                 fst_event = session.events[0]
469 |                 # event -> website without query strings
470 |                 event_to_website(graph, fst_event, event_label)
471 | 
472 |                 # session -> first session event
473 |                 graph.add_edge(session_id, hash(fst_event), label="session_start")
474 | 
475 |                 for fst_event, snd_event in consecutive_pairs(session.events):
476 |                     event_to_website(graph, fst_event, event_label)
477 |                     event_to_website(graph, snd_event, event_label)
478 |                     graph.add_edge(hash(fst_event), hash(snd_event), label=event_edge_label)
479 |         graph.nodes[result_dict["first_device"]]["size"] = 15
480 | 
481 |         yield persistent_id, graph
482 | 
483 | 
484 | def _custom_layout(graph):
485 |     """Custom layout function."""
486 |     def _transform_graph(graph):
487 |         """
488 |         Transform one graph into another for the purposes of better visualisation.
489 | 
490 |         We rebuild the graph in a tricky way to force the position computation algorithm
491 |         to allign with the desired shape.
492 |         """
493 |         new_graph = nx.MultiGraph()
494 | 
495 |         for edge in graph.edges(data=True):
496 |             fst, snd, params = edge
497 |             label = params["label"]
498 | 
499 |             new_graph.add_node(fst, **graph.nodes[fst])
500 |             new_graph.add_node(snd, **graph.nodes[snd])
501 |             if label == "links_to":
502 |                 # website -> event
503 |                 # => event -> user_website -> website
504 |                 user_website = f"{fst}_{snd}"
505 |                 new_graph.add_node(user_website, label="user_website")
506 |                 new_graph.add_edge(snd, user_website, label="session_visit")
507 |                 new_graph.add_edge(user_website, fst, label="session_link")
508 |             else:
509 |                 new_graph.add_edge(fst, snd, **params)
510 | 
511 |         return new_graph
512 | 
513 |     return nx.kamada_kawai_layout(_transform_graph(graph))
514 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/similar_audience.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use case:
  3 | 
  4 | Identify look-alike customers for a product.
  5 | The goal here is to identify prospects, who show similar behavioral patterns as your existing customers.
  6 | While we can easily do this algorithmically and automate this, the goal here is to provide visual query
  7 | to improve human understanding to the marketing analysts.
  8 | What are the device ids from my customer graph, who are not yet buying my product (say Golf Club),
  9 | but are show similar behavior patterns such lifestyle choices of buying golf or other sporting goods.
 10 | """
 11 | 
 12 | from itertools import chain
 13 | 
 14 | import networkx as nx
 15 | 
 16 | from gremlin_python.process.graph_traversal import select, out, choose, constant, or_, group
 17 | from gremlin_python.process.traversal import Column, Order, P
 18 | 
 19 | import plotly.graph_objects as go
 20 | 
 21 | from nepytune import drawing
 22 | 
 23 | 
 24 | def recommend_similar_audience(g, website_url, categories_limit=3, search_time_limit_in_seconds=15):
 25 |     """Given website url, categories_limit, categories_coin recommend similar audience in n most popular categories.
 26 | 
 27 |     Similar audience - audience of users that at least once visited subpage of domain that contains IAB-category codes
 28 |     that are most popular across users of given website
 29 |     """
 30 |     average_guy = (
 31 |         g.V(website_url)
 32 |             .in_("visited")
 33 |             .in_("has_identity").dedup()
 34 |             .hasLabel("persistentId")
 35 |             .group().by()
 36 |             .by(
 37 |                 out("has_identity").out("visited").in_("links_to")
 38 |                 .groupCount().by("categoryCode")
 39 |             )
 40 |             .select(Column.values).unfold().unfold()
 41 |             .group().by(Column.keys)
 42 |             .by(select(Column.values).mean()).unfold()
 43 |             .order().by(Column.values, Order.desc)
 44 |             .limit(categories_limit)
 45 |     )
 46 | 
 47 |     most_popular_categories = dict(chain(*category.items()) for category in average_guy.toList())
 48 | 
 49 |     guy_stats_subquery = (
 50 |         out("has_identity")
 51 |         .out("visited").in_("links_to")
 52 |         .groupCount().by("categoryCode")
 53 |         .project(*most_popular_categories.keys())
 54 |     )
 55 | 
 56 |     conditions_subqueries = []
 57 |     for i in most_popular_categories:
 58 |         guy_stats_subquery = guy_stats_subquery.by(choose(select(i), select(i), constant(0)))
 59 |         conditions_subqueries.append(
 60 |             select(Column.values).unfold()
 61 |                 .select(i)
 62 |                 .is_(P.gt(int(most_popular_categories[i])))
 63 |         )
 64 | 
 65 |     return (
 66 |             g.V()
 67 |                 .hasLabel("websiteGroup")
 68 |                 .has("categoryCode", P.within(list(most_popular_categories.keys())))
 69 |                 .out("links_to").in_("visited").dedup().in_("has_identity").dedup()
 70 |                 .hasLabel("persistentId")
 71 |                 .where(
 72 |                     out("has_identity").out("visited")
 73 |                            .has("url", P.neq(website_url))
 74 |                 )
 75 |                 .timeLimit(search_time_limit_in_seconds * 1000)
 76 |                 .local(
 77 |                     group().by().by(guy_stats_subquery)
 78 |                     .where(or_(*conditions_subqueries))
 79 |                 )
 80 |                 .select(Column.keys).unfold()
 81 |                 .out("has_identity")
 82 |                 .values("uid")
 83 |     )
 84 | 
 85 | 
 86 | def draw_average_buyer_profile_pie_chart(g, website_url, categories_limit=3,):
 87 |     average_profile = _get_categories_popular_across_audience_of_website(
 88 |         g, website_url, categories_limit=categories_limit
 89 |     ).toList()
 90 |     average_profile = dict(chain(*category.items()) for category in average_profile)
 91 | 
 92 |     labels = list(average_profile.keys())
 93 |     values = list(int(i) for i in average_profile.values())
 94 | 
 95 |     fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0)])
 96 |     fig.update_traces(textinfo='value+label+percent')
 97 |     fig.update_layout(
 98 |         title_text=f"3 Most popular IAB categories of "
 99 |                    f"<b>\"Average Buyer Profile\"</b>"
100 |                    f"<br>for thank you page <b>{website_url}</b>")
101 |     fig.show()
102 | 
103 | 
104 | def draw_referenced_subgraph(g, website_url, categories_limit=3, search_time_limit_in_seconds=15):
105 |     average_profile = _get_categories_popular_across_audience_of_website(
106 |         g, website_url, categories_limit=categories_limit
107 |     ).toList()
108 |     average_profile = dict(
109 |         chain(*category.items()) for category in average_profile
110 |     )
111 |     similar_audience = _query_users_activities_stats(
112 |         g, website_url, average_profile, search_time_limit_in_seconds=search_time_limit_in_seconds
113 |     )
114 |     similar_audience = similar_audience.limit(15).toList()
115 | 
116 |     graph = _build_graph(average_profile, similar_audience)
117 | 
118 |     iabs = [n for n, params in graph.nodes(data=True) if params["label"] == "IAB"]
119 |     avg_iabs = [n for n in iabs if graph.nodes[n]["category"] in average_profile]
120 | 
121 |     graph_with_pos_computed = drawing.layout(
122 |         graph,
123 |         nx.shell_layout,
124 |         nlist=[
125 |             ["averageBuyer"],
126 |             avg_iabs,
127 |             set(iabs) - set(avg_iabs),
128 |             [n for n, params in graph.nodes(data=True) if params["label"] == "persistentId"],
129 |             [n for n, params in graph.nodes(data=True) if params["label"] == "transientId"],
130 |         ]
131 |     )
132 | 
133 |     # update positions
134 |     for name in set(iabs) - set(avg_iabs):
135 |         node = graph_with_pos_computed.nodes[name]
136 |         node["pos"] = [node["pos"][0], node["pos"][1]-1.75]
137 | 
138 |     for name in ["averageBuyer"] + avg_iabs:
139 |         node = graph_with_pos_computed.nodes[name]
140 |         node["pos"] = [node["pos"][0], node["pos"][1]+1.75]
141 | 
142 |     node = graph_with_pos_computed.nodes["averageBuyer"]
143 |     node["pos"] = [node["pos"][0], node["pos"][1]+1]
144 | 
145 |     drawing.draw(
146 |         title="User devices that visited ecommerce websites and optionally converted",
147 |         scatters=list(
148 |             drawing.edge_scatters_by_label(
149 |                 graph_with_pos_computed,
150 |                 dashes={
151 |                     "interestedInButNotSufficient": "dash",
152 |                     "interestedIn": "solid"
153 |                 }
154 |             )) + list(
155 |             drawing.scatters_by_label(
156 |                 graph_with_pos_computed, attrs_to_skip=["pos", "opacity"],
157 |                 sizes={
158 |                     "averageBuyer": 30,
159 |                     "IAB":10,
160 |                     "persistentId":20
161 |                 }
162 |             )
163 |         )
164 |     )
165 | 
166 | 
167 | # ===========================
168 | # Everything below was added to introspect the query results via visualisations
169 | 
170 | def _get_categories_popular_across_audience_of_website(g, website_url, categories_limit=3):
171 |     return (
172 |         g.V(website_url)
173 |             .in_("visited")
174 |             .in_("has_identity").dedup()
175 |             .hasLabel("persistentId")
176 |             .group().by()
177 |             .by(
178 |             out("has_identity").out("visited").in_("links_to")
179 |                 .groupCount().by("categoryCode")
180 |         )
181 |             .select(Column.values).unfold().unfold()
182 |             .group().by(Column.keys)
183 |             .by(select(Column.values).mean()).unfold()
184 |             .order().by(Column.values, Order.desc)
185 |             .limit(categories_limit)
186 |     )
187 | 
188 | 
189 | def _query_users_activities_stats(g, website_url, most_popular_categories,
190 |                                   search_time_limit_in_seconds=30):
191 |      return (
192 |         g.V()
193 |             .hasLabel("websiteGroup")
194 |             .has("categoryCode", P.within(list(most_popular_categories.keys())))
195 |             .out("links_to").in_("visited").dedup().in_("has_identity").dedup()
196 |             .hasLabel("persistentId")
197 |             .where(
198 |             out("has_identity").out("visited")
199 |                 .has("url", P.neq(website_url))
200 |         )
201 |             .timeLimit(search_time_limit_in_seconds * 1000)
202 |             .local(
203 |                 group().by().by(
204 |                     out("has_identity")
205 |                     .out("visited").in_("links_to")
206 |                     .groupCount().by("categoryCode")
207 |                 )
208 |                 .project("pid", "iabs", "tids")
209 |                 .by(select(Column.keys).unfold())
210 |                 .by(select(Column.values).unfold())
211 |                 .by(select(Column.keys).unfold().out("has_identity").values("uid").fold())
212 |         )
213 |     )
214 | 
215 | 
216 | def _build_graph(average_buyer_categories, similar_audience):
217 |     avg_buyer = "averageBuyer"
218 | 
219 |     graph = nx.Graph()
220 |     graph.add_node(avg_buyer, label=avg_buyer, **average_buyer_categories)
221 | 
222 |     for avg_iab in average_buyer_categories.keys():
223 |         graph.add_node(avg_iab, label="IAB", category=avg_iab)
224 |         graph.add_edge(avg_buyer, avg_iab, label="interestedIn")
225 | 
226 |     for user in similar_audience:
227 |         pid, cats, tids = user["pid"], user["iabs"], user["tids"]
228 | 
229 |         user_categories = dict(sorted(cats.items(), key=lambda x: x[1])[:3])
230 |         comparison = {k: cats.get(k, 0) for k in average_buyer_categories.keys()}
231 |         user_categories.update(comparison)
232 | 
233 |         user_comparisons = False
234 |         for ucategory, value in user_categories.items():
235 |             graph.add_node(ucategory, label="IAB", category=ucategory)
236 |             label = "interestedIn"
237 |             if value:
238 |                 if ucategory in average_buyer_categories:
239 |                     if user_categories[ucategory] >= average_buyer_categories[ucategory]:
240 |                         user_comparisons = True
241 |                     else:
242 |                         label = "interestedInButNotSufficient"
243 |                 graph.add_edge(pid, ucategory, label=label)
244 | 
245 |         opacity = 1 if user_comparisons else 0.5
246 |         for tid in tids:
247 |             graph.add_edge(pid, tid, label="hasIdentity")
248 |             graph.add_node(tid, label="transientId", uid=tid, opacity=opacity)
249 | 
250 |         graph.add_node(
251 |             pid, label="persistentId", pid=pid,
252 |             opacity=opacity, **cats
253 |         )
254 | 
255 |     return graph
256 | 
257 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/undecided_users.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use case: Ecommerce publishers want to convince undecided users to purchase the product by offering them discount codes
  3 | as soon as they have met certain criteria. Find all users who have visited product page at least X times in the last
  4 | 30 days, but did not buy anything (have not visited thank you page).
  5 | """
  6 | from collections import Counter
  7 | 
  8 | from gremlin_python.process.traversal import P, Column
  9 | from gremlin_python.process.graph_traversal import (
 10 |     has, groupCount,
 11 |     constant, and_, coalesce, select, count, out, where, values
 12 | )
 13 | 
 14 | import networkx as nx
 15 | 
 16 | from nepytune import drawing
 17 | 
 18 | 
 19 | def undecided_user_audience_check(g, transient_id, website_url, thank_you_page_url, since, min_visited_count):
 20 |     """
 21 |     Given transient id, check whether it belongs to an audience.
 22 | 
 23 |     It's simple yes, no question.
 24 | 
 25 |     User belongs to an audience whenever all of the following criteria are met:
 26 |         * visited some website url at least X times since specific timestamp
 27 |         * did not visit thank you page url since specific timestamp
 28 |     """
 29 |     return (
 30 |         g.V(transient_id)
 31 |             .hasLabel("transientId")
 32 |             .in_("has_identity")
 33 |             .out("has_identity")
 34 |             .outE("visited")
 35 |             .has("ts", P.gt(since))
 36 |             .choose(
 37 |                 has("visited_url", website_url),
 38 |                 groupCount("visits").by(constant("page_visits"))
 39 |             )
 40 |             .choose(
 41 |                 has("visited_url", thank_you_page_url),
 42 |                 groupCount("visits").by(constant("thank_you_page_vists"))
 43 |             )
 44 |             .cap("visits")
 45 |             .coalesce(
 46 |                 and_(
 47 |                     coalesce(select("thank_you_page_vists"), constant(0)).is_(0),
 48 |                     select("page_visits").is_(P.gt(min_visited_count))
 49 |                 ).choose(
 50 |                     count().is_(1),
 51 |                     constant(True)
 52 |                 ),
 53 |                 constant(False)
 54 |             )
 55 | 
 56 |     )
 57 | 
 58 | 
 59 | def undecided_users_audience(g, website_url, thank_you_page_url, since, min_visited_count):
 60 |     """
 61 |     Given website url, get all the users that meet audience conditions.
 62 | 
 63 |     It returns list of transient identities uids.
 64 | 
 65 |     Audience is build from the users that met following criteria:
 66 |         * visited some website url at least X times since specific timestamp
 67 |         * did not visit thank you page url since specific timestamp
 68 |     """
 69 |     return (
 70 |         g.V(website_url)
 71 |             .hasLabel("website")
 72 |             .inE("visited").has("ts", P.gt(since)).outV()
 73 |             .in_("has_identity")
 74 |             .groupCount()
 75 |             .unfold().dedup()
 76 |             .where(
 77 |                 select(Column.values).is_(P.gt(min_visited_count))
 78 |             )
 79 |             .select(Column.keys).as_("pids")
 80 |             .map(
 81 |                 out("has_identity")
 82 |                 .outE("visited")
 83 |                 .has("visited_url", thank_you_page_url)
 84 |                 .has("ts", P.gt(since)).outV()
 85 |                 .in_("has_identity").dedup()
 86 |                 .values("pid").fold()
 87 |             ).as_("pids_that_visited")
 88 |             .select("pids")
 89 |             .not_(
 90 |                 has("pid", where(P.within("pids_that_visited")))
 91 |             )
 92 |             .out("has_identity")
 93 |                 .values("uid")
 94 |     )
 95 | 
 96 | 
 97 | def draw_referenced_subgraph(g, website_url, thank_you_page_url, since, min_visited_count):
 98 |     raw_graph = _build_networkx_graph(g, website_url, thank_you_page_url, since)
 99 | 
100 |     persistent_nodes = [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "persistentId"]
101 |     graph_with_pos_computed = drawing.layout(
102 |         raw_graph,
103 |         nx.shell_layout,
104 |         nlist=[
105 |             [website_url],
106 |             [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "transientId"],
107 |             [node for node, attr in raw_graph.nodes(data=True) if attr["label"] == "persistentId"],
108 |             [thank_you_page_url]
109 |         ]
110 |     )
111 | 
112 |     # update positions and change node label
113 |     raw_graph.nodes[thank_you_page_url]["pos"] += (0, 0.75)
114 |     for node in persistent_nodes:
115 |         has_visited_thank_you_page = False
116 |         visited_at_least_X_times = False
117 |         for check_name, value in raw_graph.nodes[node]["visited_events"].items():
118 |             if ">=" in check_name and value > 0:
119 |                 if "thank" in check_name:
120 |                     has_visited_thank_you_page = True
121 |                 elif value > min_visited_count:
122 |                     visited_at_least_X_times = True
123 |         if (has_visited_thank_you_page or not visited_at_least_X_times):
124 |             for _, to in raw_graph.edges(node):
125 |                 raw_graph.nodes[to]["opacity"] = 0.25
126 |             raw_graph.nodes[node]["opacity"] = 0.25
127 | 
128 |     drawing.draw(
129 |         title="User devices that visited ecommerce websites and optionally converted",
130 |         scatters=[
131 |             drawing.edges_scatter(graph_with_pos_computed)
132 |         ] + list(
133 |             drawing.scatters_by_label(
134 |                 graph_with_pos_computed, attrs_to_skip=["pos", "opacity"],
135 |                 sizes={
136 |                     "transientId": 10, "transientId-audience": 10,
137 |                     "persistentId": 20, "persistentId-audience": 20,
138 |                     "website": 30,
139 |                     "thankYouPage": 30,
140 |                 }
141 |             )
142 |         )
143 |     )
144 | 
145 | 
146 | # ===========================
147 | # Everything below was added to introspect the query results via visualisations
148 | 
149 | 
150 | def _get_subgraph(g, website_url, thank_you_page_url, since):
151 |     return (
152 |         g.V()
153 |         .hasLabel("website")
154 |         .has("url", P.within([website_url, thank_you_page_url]))
155 |         .in_("visited")
156 |         .in_("has_identity")
157 |         .dedup().limit(20)
158 |         .project("persistent_id", "transient_ids", "visited_events")
159 |             .by(values("pid"))
160 |             .by(out("has_identity").values("uid").fold())
161 |             .by(
162 |                 out("has_identity")
163 |                 .outE("visited")
164 |                 .has("visited_url", P.within([website_url, thank_you_page_url]))
165 |                 .valueMap("visited_url", "ts", "uid").dedup().fold()
166 |             )
167 |     )
168 | 
169 | 
170 | def _build_networkx_graph(g, website_url, thank_you_page_url, since):
171 |     graph = nx.Graph()
172 |     graph.add_node(website_url, label="website", url=website_url)
173 |     graph.add_node(thank_you_page_url, label="thankYouPage", url=thank_you_page_url)
174 | 
175 |     for data in _get_subgraph(g, website_url, thank_you_page_url, since).toList():
176 |         graph.add_node(data["persistent_id"], label="persistentId", pid=data["persistent_id"],
177 |                        visited_events=Counter())
178 | 
179 |         for transient_id in data["transient_ids"]:
180 |             graph.add_node(transient_id, label="transientId", uid=transient_id, visited_events=Counter())
181 |             graph.add_edge(transient_id, data["persistent_id"], label="has_identity")
182 | 
183 |         for event in data["visited_events"]:
184 |             edge = event["visited_url"], event["uid"]
185 |             try:
186 |                 graph.edges[edge]["ts"].append(event["ts"])
187 |             except:
188 |                 graph.add_edge(*edge, label="visited", ts=[event["ts"]])
189 | 
190 | 
191 |             for node_map in graph.nodes[data["persistent_id"]], graph.nodes[event["uid"]]:
192 |                 if event["visited_url"] == website_url:
193 |                     node_map["visited_events"][f"visited website < {since}"] += (event["ts"] < since)
194 |                     node_map["visited_events"][f"visited website >= {since}"] += (event["ts"] >= since)
195 |                 else:
196 |                     node_map["visited_events"][f"visited thank you page < {since}"] += (event["ts"] < since)
197 |                     node_map["visited_events"][f"visited thank you page >= {since}"] += (event["ts"] >= since)
198 | 
199 |     return graph
200 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/user_summary.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use case:  Advertisers want to find out information about user interests to provide an accurate targeting.
  3 | The data should be based on the activity of the user across all devices.
  4 | """
  5 | from collections.abc import Iterable
  6 | 
  7 | import networkx as nx
  8 | from gremlin_python.process.traversal import Column, T
  9 | from gremlin_python.process.graph_traversal import select, out, in_, values, valueMap, project, constant
 10 | 
 11 | from nepytune import drawing
 12 | 
 13 | 
 14 | def get_sibling_attrs(g, transient_id):
 15 |     """
 16 |     Given transient id, get summary of information we have about it or its sibling nodes.
 17 | 
 18 |     We gather:
 19 |         * node attributes
 20 |         * IP / location information
 21 |         * IAB categories of visited websites
 22 |     """
 23 |     return (
 24 |         g.V(transient_id)
 25 |             .choose(
 26 |                 in_("has_identity"),  # check if this transient id has persistent id
 27 |                 in_("has_identity").
 28 |                     project(
 29 |                         "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories"
 30 |                     ).by(in_("member").values("igid"))
 31 |                     .by(values("pid"))
 32 |                     .by(
 33 |                         out("has_identity").valueMap().unfold()
 34 |                         .group()
 35 |                         .by(Column.keys)
 36 |                         .by(select(Column.values).unfold().dedup().fold())
 37 |                     )
 38 |                     .by(
 39 |                         out("has_identity")
 40 |                         .out("uses").dedup().valueMap().fold()
 41 |                     )
 42 |                     .by(
 43 |                         out("has_identity")
 44 |                         .out("visited")
 45 |                         .in_("links_to")
 46 |                         .values("categoryCode").dedup().fold()
 47 |                     )
 48 |                 , project(
 49 |                     "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories"
 50 |                 ).by(constant(""))
 51 |                 .by(constant(""))
 52 |                 .by(
 53 |                     valueMap().unfold()
 54 |                     .group()
 55 |                     .by(Column.keys)
 56 |                     .by(select(Column.values).unfold().dedup().fold())
 57 |                 )
 58 |                 .by(
 59 |                     out("uses").dedup().valueMap().fold()
 60 |                 )
 61 |                 .by(
 62 |                     out("visited")
 63 |                     .in_("links_to")
 64 |                     .values("categoryCode").dedup().fold()
 65 |                 )
 66 |             )
 67 |         )
 68 | 
 69 | 
 70 | def draw_refrenced_subgraph(g, transient_id):
 71 |     raw_graph = _build_networkx_graph(
 72 |         g, g.V(transient_id).in_("has_identity").in_("member").next()
 73 |     )
 74 |     graph_with_pos_computed = drawing.layout(
 75 |         raw_graph,
 76 |         nx.spring_layout,
 77 |         iterations=2500
 78 |     )
 79 | 
 80 |     drawing.draw(
 81 |         title="Part of single household activity on the web",
 82 |         scatters=[
 83 |             drawing.edges_scatter(graph_with_pos_computed)
 84 |         ] + list(
 85 |             drawing.scatters_by_label(
 86 |                 graph_with_pos_computed, attrs_to_skip=["pos"],
 87 |                 sizes={"identityGroup": 30, "transientId": 15, "persistentId": 20, "websiteGroup": 15, "website": 10}
 88 |             )
 89 |         ),
 90 |     )
 91 | 
 92 | 
 93 | # ===========================
 94 | # Everything below was added to introspect the query results via visualisations
 95 | 
 96 | def _get_subgraph(g, identity_group_id):
 97 |     return (
 98 |         g.V(identity_group_id)
 99 |             .project("props", "persistent_ids")
100 |                 .by(valueMap(True))
101 |                 .by(
102 |                     out("member")
103 |                     .group()
104 |                     .by()
105 |                     .by(
106 |                         project("props", "transient_ids")
107 |                             .by(valueMap(True))
108 |                             .by(
109 |                                 out("has_identity")
110 |                                 .group()
111 |                                 .by()
112 |                                 .by(
113 |                                     project("props", "ip_location", "random_website_paths")
114 |                                     .by(valueMap(True))
115 |                                     .by(
116 |                                         out("uses").valueMap(True).fold()
117 |                                     )
118 |                                     .by(
119 |                                         out("visited").as_("start")
120 |                                         .in_("links_to").as_("end")
121 |                                         .limit(100)
122 |                                         .path()
123 |                                             .by(valueMap("url"))
124 |                                             .by(valueMap("url", "categoryCode"))
125 |                                         .from_("start").to("end")
126 |                                         .dedup()
127 |                                         .fold()
128 |                                     )
129 |                                 ).select(
130 |                                     Column.values
131 |                                 )
132 |                              )
133 |                     ).select(Column.values)
134 |                 )
135 |     )
136 | 
137 | 
138 | def _build_networkx_graph(g, identity_group_id):
139 |     def get_attributes(attribute_list):
140 |         attrs = {}
141 |         for attr_name, value in attribute_list:
142 |             attr_name = str(attr_name)
143 | 
144 |             if isinstance(value, Iterable) and not isinstance(value, str):
145 |                 for i, single_val in enumerate(value):
146 |                     attrs[f"{attr_name}-{i}"] = single_val
147 |             else:
148 |                 if '.' in attr_name:
149 |                     attr_name = attr_name.split('.')[-1]
150 |                 attrs[attr_name] = value
151 | 
152 |         return attrs
153 | 
154 |     graph = nx.Graph()
155 | 
156 |     for ig_node in _get_subgraph(g, identity_group_id).toList():
157 |         ig_id = ig_node["props"][T.id]
158 | 
159 |         graph.add_node(
160 |             ig_id,
161 |             **get_attributes(ig_node["props"].items())
162 |         )
163 | 
164 |         for persistent_node in ig_node["persistent_ids"]:
165 |             p_id = persistent_node["props"][T.id]
166 |             graph.add_node(
167 |                 p_id,
168 |                 **get_attributes(persistent_node["props"].items())
169 |             )
170 |             graph.add_edge(ig_id, p_id, label="member")
171 | 
172 |             for transient_node in persistent_node["transient_ids"]:
173 |                 transient_node_map = transient_node["props"]
174 |                 transient_id = transient_node_map[T.id]
175 |                 graph.add_node(
176 |                     transient_id,
177 |                     **get_attributes(transient_node_map.items())
178 |                 )
179 |                 graph.add_edge(transient_id, p_id, label="has_identity")
180 | 
181 |                 for ip_location_node in transient_node["ip_location"]:
182 |                     ip_location_id = ip_location_node[T.id]
183 |                     graph.add_node(ip_location_id, **get_attributes(ip_location_node.items()))
184 |                     graph.add_edge(ip_location_id, transient_id, label="uses")
185 | 
186 |                 for visited_link, root_url in transient_node["random_website_paths"]:
187 |                     graph.add_node(visited_link["url"][0], label="website", **get_attributes(visited_link.items()))
188 |                     graph.add_node(root_url["url"][0], label="websiteGroup", **get_attributes(root_url.items()))
189 |                     graph.add_edge(transient_id, visited_link["url"][0], label="visits")
190 |                     graph.add_edge(visited_link["url"][0], root_url["url"][0], label="links_to")
191 |     return graph
192 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/usecase/users_from_household.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Use case: user has visited a travel agency website recently.
  3 | Advertisers want to display ads about travel promotions to all members of his household.
  4 | """
  5 | from collections.abc import Iterable
  6 | 
  7 | from gremlin_python.process.traversal import Column, T
  8 | from gremlin_python.process.graph_traversal import project, valueMap, out
  9 | import networkx as nx
 10 | 
 11 | from nepytune import drawing
 12 | 
 13 | 
 14 | def get_all_transient_ids_in_household(g, transient_id):
 15 |     """Given transient id, get all transient ids from its household."""
 16 |     return (
 17 |         g.V(transient_id)
 18 |             .hasLabel("transientId")
 19 |             .in_("has_identity")
 20 |             .in_("member")
 21 |             .has("type", "household")
 22 |             .out("member")
 23 |             .out("has_identity").
 24 |         values("uid")
 25 |     )
 26 | 
 27 | 
 28 | def draw_referenced_subgraph(g, transient_id):
 29 |     graph = drawing.spring_layout(
 30 |         _build_networkx_graph(
 31 |             g,
 32 |             g.V(transient_id).in_("has_identity").in_("member").next()
 33 |         )
 34 |     )
 35 | 
 36 |     drawing.draw(
 37 |         title="Single identity group graph structure",
 38 |         scatters=[
 39 |             drawing.edges_scatter(graph)
 40 |         ] + list(
 41 |             drawing.scatters_by_label(
 42 |                 graph, attrs_to_skip=["pos"],
 43 |                 sizes={"identityGroup": 60, "transientId": 20, "persistentId": 40}
 44 |             )
 45 |         ),
 46 |         annotations=drawing.edge_annotations(graph)
 47 |     )
 48 | 
 49 | 
 50 | # ===========================
 51 | # Everything below was added to introspect the query results via visualisations
 52 | 
 53 | 
 54 | def _get_identity_group_hierarchy(g, identity_group_id):
 55 |     return (
 56 |         g.V(identity_group_id)
 57 |             .project("props", "persistent_ids")
 58 |                 .by(valueMap(True))
 59 |                 .by(
 60 |                     out("member")
 61 |                     .group()
 62 |                     .by()
 63 |                     .by(
 64 |                         project("props", "transient_ids")
 65 |                             .by(valueMap(True))
 66 |                             .by(
 67 |                                 out("has_identity").valueMap(True).fold()
 68 |                         )
 69 |                     ).select(Column.values)
 70 |                 )
 71 |     )
 72 | 
 73 | 
 74 | def _build_networkx_graph(g, identity_group_id):
 75 |     def get_attributes(attribute_list):
 76 |         attrs = {}
 77 |         for attr_name, value in attribute_list:
 78 |             attr_name = str(attr_name)
 79 | 
 80 |             if isinstance(value, Iterable) and not isinstance(value, str):
 81 |                 for i, single_val in enumerate(value):
 82 |                     attrs[f"{attr_name}-{i}"] = single_val
 83 |             else:
 84 |                 if '.' in attr_name:
 85 |                     attr_name = attr_name.split('.')[-1]
 86 |                 attrs[attr_name] = value
 87 | 
 88 |         return attrs
 89 | 
 90 |     graph = nx.Graph()
 91 | 
 92 |     for ig_node in _get_identity_group_hierarchy(g, identity_group_id).toList():
 93 |         ig_id = ig_node["props"][T.id]
 94 | 
 95 |         graph.add_node(
 96 |             ig_id,
 97 |             **get_attributes(ig_node["props"].items())
 98 |         )
 99 | 
100 |         for persistent_node in ig_node["persistent_ids"]:
101 |             p_id = persistent_node["props"][T.id]
102 |             graph.add_node(
103 |                 p_id,
104 |                 **get_attributes(persistent_node["props"].items())
105 |             )
106 |             graph.add_edge(ig_id, p_id, label="member")
107 | 
108 |             for transient_node_map in persistent_node["transient_ids"]:
109 |                 transient_id = transient_node_map[T.id]
110 |                 graph.add_node(
111 |                     transient_id,
112 |                     **get_attributes(transient_node_map.items())
113 |                 )
114 |                 graph.add_edge(transient_id, p_id, label="has_identity")
115 | 
116 |     return graph
117 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/utils.py:
--------------------------------------------------------------------------------
 1 | import hashlib
 2 | import inspect
 3 | import os
 4 | 
 5 | import nepytune.benchmarks.benchmarks_visualization as bench_viz
 6 | 
 7 | 
 8 | def hash_(list_):
 9 |     """Generate sha1 hash from the given list."""
10 |     return hashlib.sha1(str(tuple(sorted(list_))).encode("utf-8")).hexdigest()
11 | 
12 | 
13 | def get_id(_from, to, attributes):
14 |     """Get id of a given entity."""
15 |     return hash_([_from, to, str(tuple(attributes.items()))])
16 | 
17 | 
18 | def show_query_benchmarks(benchmark_results_path, cache_path, query,
19 |                           samples_by_users):
20 |     instances = os.listdir(benchmark_results_path)
21 |     instances = sorted(instances, key=lambda x: int(x.split('.')[-1].split('xlarge')[0]))
22 | 
23 |     benchmarks_dfs = bench_viz.get_benchmarks_results_dataframes(
24 |         query=query,
25 |         samples_by_users=samples_by_users,
26 |         instances=instances,
27 |         results_path=benchmark_results_path
28 |     )
29 |     concurrent_queries_dfs = bench_viz.select_concurrent_queries_from_data(
30 |         query,
31 |         benchmarks_dfs,
32 |         cache_path=cache_path
33 |     )
34 |     bench_viz.show_concurrent_queries_charts(
35 |         concurrent_queries_dfs,
36 |         x_title="Time from start of benchmark (Miliseconds)",
37 |         y_title="Number of concurrent running queries"
38 |     )
39 | 
40 |     bench_viz.show_query_time_graph(
41 |         benchmarks_dfs,
42 |         yfunc=lambda df: df.multiply(1000).tolist(),
43 |         title="Request duration (Miliseconds)",
44 |         x_title="Number of concurrent queries",
45 |     )
46 |     bench_viz.show_query_time_graph(
47 |         benchmarks_dfs,
48 |         yfunc=lambda df: (1 / df).tolist(),
49 |         title="Queries per second",
50 |         x_title="Number of concurrent queries",
51 |     )
52 | 
53 | 
54 | def show(func):
55 |     lines = inspect.getsource(func)
56 |     print(lines)
57 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-admartech-samples/142f5d36e6f2b776eab057242f71353573a35b29/identity-resolution/notebooks/identity-graph/nepytune/visualizations/__init__.py


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/bar_plots.py:
--------------------------------------------------------------------------------
 1 | import plotly.graph_objects as go
 2 | import colorlover as cl
 3 | 
 4 | def make_bars(data, title, x_title, y_title, lazy=False):
 5 |     color = cl.scales[str(len(data.keys()))]['div']['RdYlBu']
 6 |     fig = go.Figure(
 7 |         [
 8 |             go.Bar(
 9 |                 x=list(data.keys()),
10 |                 y=list(data.values()),
11 |                 hoverinfo="y",
12 |                 marker=dict(color=color),
13 |             )
14 |         ]
15 |     )
16 | 
17 |     fig.update_layout(
18 |         title=title,
19 |         yaxis_type="log",
20 |         xaxis_title=x_title,
21 |         yaxis_title=y_title,
22 |     )
23 |     if not lazy:
24 |         fig.show()
25 |     return fig


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/commons.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime, timedelta
 2 | 
 3 | from gremlin_python.process.graph_traversal import select, has, unfold
 4 | from gremlin_python.process.traversal import P
 5 | 
 6 | 
 7 | def get_timerange_condition(g, start_hour=16, end_hour=18, limit=1000):
 8 |     dates = (
 9 |         g.E()
10 |             .hasLabel("visited")
11 |             .limit(limit)
12 |             .values("ts")
13 |             .fold()
14 |             .as_("timestamps")
15 |             .project("start", "end")
16 |                 .by(select("timestamps").unfold().min_())
17 |                 .by(select("timestamps").unfold().max_())
18 |     ).next()
19 | 
20 |     start = dates["start"].replace(hour=start_hour, minute=0, second=0)
21 |     end = dates["end"].replace(hour=start_hour, minute=0, second=0)
22 |     
23 |     toReturn = []
24 |     
25 |     for days in range((end - start).days):
26 |         toReturn.append(
27 |             has(
28 |                 'ts',
29 |                 P.between(
30 |                     start + timedelta(days=days),
31 |                     start + timedelta(days=days) + timedelta(hours=end_hour - start_hour)
32 |                 )
33 |             )
34 |             
35 |         )
36 | 
37 |     return toReturn
38 | 
39 | #     [
40 | #         has(
41 | #             'ts',
42 | #             P.between(
43 | #                 start + timedelta(days=days),
44 | #                 start + timedelta(days=days) + timedelta(hours=end_hour - start_hour)
45 | #             )
46 | #         )
47 | #         for days in range((end - start).days)
48 | #     ]
49 | 
50 | 
51 | def get_user_device_statistics(g, dt_conditions, limit=10000):
52 |     return (
53 |         g.E().hasLabel("visited").or_(*dt_conditions)
54 |         .limit(limit).outV().fold()
55 |         .project("type", "device", "browser")
56 |             .by(
57 |                 unfold().unfold().groupCount().by("type")
58 |             )
59 |             .by(
60 |                 unfold().unfold().groupCount().by("device")
61 |             )
62 |             .by(
63 |                 unfold().unfold().groupCount().by("browser")
64 |             )
65 |     )


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/histogram.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import plotly.graph_objects as go
 3 | 
 4 | 
 5 | def show(activities, website_name):
 6 |     # convert activities into pandas series
 7 |     activity_series = pd.to_datetime(pd.Series(list(activities)))
 8 | 
 9 |     # trim timestamps to desired granulation (in this case, hours)
10 |     hourly_activity_series = activity_series.dt.strftime("%H")
11 | 
12 |     # prepare values & labels source for histogram's xaxis 
13 |     day_hours = pd.to_datetime(pd.date_range(start="00:00", end="23:59", freq="H"))
14 | 
15 |     # create histogram
16 |     fig = go.Figure(
17 |         data=[
18 |             go.Histogram(
19 |                 x=hourly_activity_series,
20 |                 histnorm='percent'
21 |             )
22 |         ]
23 |     )
24 | 
25 |     # provide titles/labels/bar_gaps
26 |     fig.update_layout(
27 |         title_text=f"Activity of all users that visited website <b>{website_name}<b>",
28 |         xaxis_title_text='Day time (Hour)',
29 |         yaxis_title_text='Percentage of visits',
30 | 
31 |         xaxis=dict(
32 |             tickangle=45,
33 |             tickmode='array',
34 |             tickvals=day_hours.strftime("%H").tolist(),
35 |             ticktext=day_hours.strftime("%H:%M").tolist()
36 |         ),
37 |         bargap=0.05,
38 |     )
39 | 
40 |     # show histogram
41 |     fig.show()
42 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/network_graph.py:
--------------------------------------------------------------------------------
 1 | import networkx as nx
 2 | from gremlin_python.process.graph_traversal import in_, coalesce, constant, select
 3 | from gremlin_python.process.traversal import T, P, Column
 4 | 
 5 | from nepytune import drawing
 6 | 
 7 | 
 8 | def query_website_node(g, website_id):
 9 |     return g.V(website_id).valueMap(True).toList()[0]
10 | 
11 | 
12 | def query_transient_nodes_for_website(g, website_id, limit=10000):
13 |     return (g.V(website_id)
14 |             .in_("visited")
15 |             .limit(limit)
16 |             .project("uid", "pid")
17 |             .by("uid")
18 |             .by(in_("has_identity").values("pid").fold())
19 |             .group()
20 |             .by(coalesce(select("pid").unfold(), constant("transient-nodes-connected-to-website")))
21 |             .by(select("uid").dedup().limit(100).fold())
22 |             .unfold()
23 |             .project("persistent-node-id", "transient-nodes")
24 |             .by(select(Column.keys))
25 |             .by(select(Column.values))
26 |             .where(select("transient-nodes").unfold().count().is_(P.gt(1)))
27 |             ).toList()
28 | 
29 | 
30 | def create_graph_for_website_and_transient_nodes(website_node, transient_nodes_for_website):
31 |     website_id = website_node[T.id]
32 | 
33 |     graph = nx.Graph()
34 |     graph.add_node(
35 |         website_id,
36 |         **{
37 |             "id": website_id,
38 |             "label": website_node[T.label],
39 |             "title": website_node["title"][0],
40 |             "url": website_node["url"][0]
41 |         }
42 |     )
43 | 
44 |     transient_nodes = []
45 |     persistent_nodes = []
46 | 
47 |     for node in transient_nodes_for_website:
48 |         if node["persistent-node-id"] != "transient-nodes-connected-to-website":
49 |             pnode = node["persistent-node-id"]
50 |             persistent_nodes.append(pnode)
51 |             graph.add_node(
52 |                 pnode,
53 |                 id=pnode,
54 |                 label="persistentId"
55 |             )
56 | 
57 |             for tnode in node["transient-nodes"]:
58 |                 graph.add_edge(
59 |                     pnode,
60 |                     tnode,
61 |                     label="has_identity"
62 |                 )
63 | 
64 |         for tnode in node["transient-nodes"]:
65 |             graph.add_node(
66 |                 tnode,
67 |                 id=tnode,
68 |                 label="transientId"
69 |             )
70 | 
71 |             graph.add_edge(
72 |                 website_id,
73 |                 tnode,
74 |                 label="visited"
75 |             )
76 | 
77 |             transient_nodes.append(tnode)
78 |     return graph
79 | 
80 | 
81 | def show(g, website_id):
82 |     """Show users that visited website on more than one device."""
83 | 
84 |     transient_nodes_for_website = query_transient_nodes_for_website(g, website_id)
85 |     website_node = query_website_node(g, website_id)
86 | 
87 |     raw_graph = create_graph_for_website_and_transient_nodes(website_node, transient_nodes_for_website)
88 |     graph = drawing.spring_layout(raw_graph)
89 | 
90 |     drawing.draw(
91 |         title="",
92 |         scatters=[drawing.edges_scatter(graph)] + list(drawing.scatters_by_label(graph, attrs_to_skip=["pos"])),
93 |     )


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/pie_chart.py:
--------------------------------------------------------------------------------
 1 | import colorlover as cl
 2 | import plotly.graph_objects as go
 3 | from plotly.subplots import make_subplots
 4 | 
 5 | 
 6 | def show(data):
 7 |     type_labels, type_values = zip(*data["type"].items())
 8 |     device_labels, device_values = zip(*data["device"].items())
 9 |     browser_labels, browser_values = zip(*data["browser"].items())
10 | 
11 |     fig = make_subplots(rows=3, cols=1, specs=[
12 |         [{"type": "pie"}],
13 |         [{"type": "pie"}],
14 |         [{"type": "pie"}]
15 |     ])
16 | 
17 |     fig.add_trace(
18 |         go.Pie(labels=list(reversed(type_labels)), values=list(reversed(type_values)), hole=0, name="Type",
19 |                marker={'colors': ['#7F7FFF', '#FF7F7F']},
20 |                textinfo='label+percent', hoverinfo="label+percent+value", textfont_size=20
21 |                ),
22 |         row=2, col=1,
23 | 
24 |     )
25 | 
26 |     fig.add_trace(
27 |         go.Pie(labels=["device <br> type"], values=[data["type"]["device"]],
28 |                hole=0, textinfo='label', hoverinfo="label+value",
29 |                marker={'colors': ['#7F7FFF']}, textfont_size=20
30 |                ),
31 |         row=1, col=1,
32 | 
33 |     )
34 | 
35 |     fig.add_trace(
36 |         go.Pie(labels=device_labels, values=device_values, hole=.8, opacity=1,
37 |                textinfo='label', textposition='outside', hoverinfo="label+percent+value",
38 |                marker={'colors': ['rgb(247,251,255)',
39 |                                   'rgb(222,235,247)',
40 |                                   'rgb(198,219,239)',
41 |                                   'rgb(158,202,225)',
42 |                                   'rgb(107,174,214)',
43 |                                   'rgb(66,146,198)',
44 |                                   'rgb(33,113,181)',
45 |                                   'rgb(8,81,156)',
46 |                                   'rgb(8,48,107)',
47 |                                   'rgb(9,32,66)',
48 |                                  ]
49 |                       }, textfont_size=12),
50 |         row=1, col=1,
51 |     )
52 | 
53 |     fig.add_trace(
54 |         go.Pie(labels=["cookie <br> browser"], values=[data["type"]["cookie"]],
55 |                hole=0, textinfo='label', hoverinfo="label+value",
56 |                marker={'colors': ['#FF7F7F']}, textfont_size=20),
57 |         row=3, col=1,
58 |     )
59 | 
60 |     fig.add_trace(
61 |         go.Pie(labels=browser_labels, values=browser_values, hole=.8,
62 |                textinfo='label', textposition='outside', hoverinfo="label+percent+value",
63 |                marker={'colors': ['rgb(255,245,240)',
64 |                                   'rgb(254,224,210)',
65 |                                   'rgb(252,187,161)',
66 |                                   'rgb(252,146,114)',
67 |                                   'rgb(251,106,74)',
68 |                                   'rgb(239,59,44)',
69 |                                   'rgb(203,24,29)',
70 |                                   'rgb(165,15,21)',
71 |                                   'rgb(103,0,13)',
72 |                                   'rgb(51, 6,12)'
73 |                                  ]
74 |                       }, textfont_size=12),
75 |         row=3, col=1,
76 |     )
77 | 
78 |     fig.update_layout(
79 |         showlegend=False,
80 |         height=1000,
81 |     )
82 | 
83 |     fig.show()
84 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/segments.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime, timedelta
 2 | 
 3 | 
 4 | from gremlin_python.process.graph_traversal import in_, inE, select, out, values, has
 5 | from gremlin_python.process.traversal import P, Column
 6 | 
 7 | 
 8 | def get_all_devices_from_website_visitors(g, website_id, limit=100):
 9 |     """Get all transient ids (including siblings), that visited given page."""
10 | 
11 |     return (
12 |         g.V(website_id)
13 |             .project("transient_ids_no_persistent", "transient_ids_with_siblings")
14 |                 .by(
15 |                     in_("visited").limit(limit).fold()
16 |                 )
17 |                 .by(
18 |                     in_("visited").in_("has_identity").dedup().out("has_identity").limit(limit).fold()
19 |                 )
20 |             .select(Column.values).unfold().unfold().dedup()
21 |     )
22 | 
23 | 
24 | def query_users_intersted_in_content(g, iab_codes, limit=10000):
25 |     """Get users (persistent identities) that interacted with websites with given iab codes."""
26 | 
27 |     return (
28 |         g.V()
29 |             .hasLabel("persistentId")
30 |             .coin(0.8)
31 |             .limit(limit)
32 |             .where(out("has_identity")
33 |                    .out("visited")
34 |                    .in_("links_to")
35 |                    .has("categoryCode", P.within(iab_codes))
36 |             )
37 |             .project("persistent_id", "attributes", "ip_location")
38 |                 .by(values("pid"))
39 |                 .by(
40 |                     out("has_identity").valueMap("browser", "email", "uid").unfold()
41 |                     .group()
42 |                         .by(Column.keys)
43 |                         .by(select(Column.values).unfold().dedup().fold())
44 |                 )
45 |                 .by(out("has_identity").out("uses").dedup().valueMap().fold())
46 |         )
47 | 
48 | 
49 | def query_users_active_in_given_date_intervals(g, dt_conditions, limit=300):
50 |     """Get users (persistent identities) that interacted with website in given date interval."""
51 | 
52 |     return (
53 |         g.V().hasLabel("persistentId")
54 |             .coin(0.5)
55 |             .limit(limit)
56 |             .where(
57 |                 out("has_identity").outE("visited").or_(
58 |                     *dt_conditions
59 |                 )
60 |             )
61 |             .project("persistent_id", "attributes", "ip_location")
62 |                 .by(values("pid"))
63 |                 .by(
64 |                     out("has_identity").valueMap("browser", "email", "uid").unfold()
65 |                         .group()
66 |                         .by(Column.keys)
67 |                         .by(select(Column.values).unfold().dedup().fold())
68 |                 )
69 |                 .by(out("has_identity").out("uses").dedup().valueMap().fold())
70 |     )
71 | 
72 | 
73 | def query_users_active_in_n_days(g, n=30, today=datetime(2016, 6, 22, 23, 59), limit=1000):
74 |     """Get users that were active in last 30 days."""
75 | 
76 |     dt_condition = [
77 |         has("ts", P.gt(today - timedelta(days=n)))
78 |     ]
79 |     return query_users_active_in_given_date_intervals(g, dt_condition, limit)


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/sunburst_chart.py:
--------------------------------------------------------------------------------
 1 | import plotly.graph_objects as go
 2 | 
 3 | 
 4 | def show(data):
 5 |     type_labels, type_values = zip(*data["type"].items())
 6 |     device_labels, device_values = zip(*data["device"].items())
 7 |     browser_labels, browser_values = zip(*data["browser"].items())
 8 | 
 9 |     trace = go.Sunburst(
10 |         labels=type_labels + device_labels + browser_labels,
11 |         parents=["", ""] + ["device"] * len(device_labels) + ["cookie"] * len(browser_labels),
12 |         values=type_values + device_values + browser_values,
13 |         hoverinfo="label+value",
14 |     )
15 | 
16 |     layout = go.Layout(
17 |         margin=go.layout.Margin(t=0, l=0, r=0, b=0),
18 |     )
19 | 
20 |     fig = go.Figure([trace], layout)
21 |     fig.show()
22 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/visualizations/venn_diagram.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import random
  3 | 
  4 | import plotly.graph_objects as go
  5 | import yaml
  6 | 
  7 | 
  8 | def get_intersections(s1, s2, s3):
  9 |     class NodeElement:
 10 |         def __init__(self, **kwargs):
 11 |             self.attributes = kwargs
 12 | 
 13 |         def __hash__(self):
 14 |             return hash(self.attributes["persistent_id"])
 15 | 
 16 |         def __eq__(self, other):
 17 |             return self.attributes["persistent_id"] == other.attributes["persistent_id"]
 18 | 
 19 |         def __repr__(self):
 20 |             pid = self.attributes['persistent_id']
 21 |             hash_ = str(hash(self.attributes['persistent_id']))
 22 |             return f"{pid}, {hash_}"
 23 | 
 24 |     a = {NodeElement(**e) for e in s1}
 25 |     b = {NodeElement(**e) for e in s2}
 26 |     c = {NodeElement(**e) for e in s3}
 27 | 
 28 |     result = {
 29 |         "ab": a & b,
 30 |         "ac": a & c,
 31 |         "bc": b & c,
 32 |         "abc": a & b & c
 33 |     }
 34 | 
 35 |     result["a"] = a - (result["ab"] | result["ac"])
 36 |     result["b"] = b - (result["ab"] | result["bc"])
 37 |     result["c"] = b - (result["ac"] | result["bc"])
 38 | 
 39 |     return result
 40 | 
 41 | 
 42 | def make_label(node):
 43 |     return "<br>" + yaml.dump(
 44 |         node.attributes,
 45 |         default_style=None,
 46 |         default_flow_style=False,
 47 |         width=50
 48 |     ).replace("\n", "</br>")
 49 | 
 50 | 
 51 | TRIANGLES = {
 52 |     "abc": [
 53 |         [(8, -0.35), (10, -4.5), (12, -0.35)],
 54 |         [(11.51671, -2.41546), (9.98847, -4.47654), (12, -0.35)],
 55 |         [(10, 0), (12, -0.35), (8, -0.35)],
 56 |         [(8, -0.35), (8.58508, -2.5938), (9.98847, -4.47654)]
 57 |     ],
 58 |     "ab": [
 59 |         [(8, 0), (10, 4.5), (12, 0)],
 60 |         [(8, 0), (11.49694, 2.39107), (12, 0)],
 61 |         [(8, 0), (10, 4.5), (8.51494, 2.37233)],
 62 |         [(12, 0), (10, 4.5), (11.49383, 2.40639)]
 63 |     ],
 64 |     "ac": [
 65 |         [(4, -5.65), (9.7, -4.75), (7.5, -0.55)],
 66 |         [(4, -5.65), (5.26182, -2.31944), (7.5, -0.55)],
 67 |         [(8.26214, -1.9908), (8, -0.35), (7.5, -0.55)],
 68 |         [(8.92526, -3.30719), (10, -4.5), (9.7, -4.75)],
 69 |         [(7.01578, -5.9212), (4, -5.65), (9.7, -4.75)]
 70 |     ],
 71 |     "bc": [
 72 |         [(16.01075, -5.7627), (12.51157, -0.57146), (10.31157, -4.77146)],
 73 |         [(10, -4.5), (11.08632, -3.32866), (10.31157, -4.77146)],
 74 |         [(12.00131, -0.28126), (12.51157, -0.57146), (11.74943, -2.01226)],
 75 |         [(12.51157, -0.57146), (14.74975, -2.3409), (16.01075, -5.7627)],
 76 |         [(10.31157, -4.77146), (12.99579, -5.94266), (16.01157, -5.67146)]
 77 |     ],
 78 |     "a": [
 79 |         [(1.59, 4.12), (1.2, -3.54), (8.07, 5.62)],
 80 |         [(8.01, -0.31), (1.2, -3.54), (8.07, 5.62)],
 81 |         [(4.76091, -1.85498), (4.76091, -1.85498), (1.20313, -3.56193)],
 82 |         [(1.20313, -3.56193), (0, 0), (1.58563, 4.1221)],
 83 |         [(4.93073, -1.82779), (1.20313, -3.53928), (4.00216, -5.85506)],
 84 |         [(1.58563, 4.1221), (4.56262, 5.82533), (8.06809, 5.62107)],
 85 |         [(8.06809, 5.62107), (9.96902, 4.49245), (8.03037, 1.87789)],
 86 |         [(4.93976, -1.79473), (5.3695, -2.15025), (6.48177, -1.03733)],
 87 |         [(4.93976, -1.79473), (5.3695, -2.15025), (4.55188, -3.3864)],
 88 |         [(8.31901, 1.92203), (8.31901, 1.92203), (8.63932, 2.8171)],
 89 |         [(8.31901, 1.92203), (8.31901, 1.92203), (8.02962, 1.02571)]
 90 |     ],
 91 |     "b": [
 92 |         [(12.06, 5.65), (18.89, -3.51), (18.38, 4.1)],
 93 |         [(12, -0.28), (12.06, 5.65), (18.89, -3.51)],
 94 |         [(12.06077, 5.6496), (12.0047, 2.32125), (10.02248, 4.49887)],
 95 |         [(15.10229, -1.76245), (18.8895, -3.51074), (16.01075, -5.7627)],
 96 |         [(20, 0), (18.37991, 4.09718), (18.8895, -3.51074)],
 97 |         [(18.37991, 4.09718), (12.06077, 5.6496), (15.63243, 5.78919)],
 98 |     ],
 99 |     "c": [
100 |         [(10, -12), (4.38, -8), (15.60, -8)],
101 |         [(10, -4.55), (4.38, -8), (15.60, -8)],
102 |         [(4.01794, -5.64598), (4.38003, -8.00561), (7.22591, -6.21212)],
103 |         [(15.99694, -5.86975), (12.83447, -6.25924), (15.62772, -8.02776)],
104 |         [(4.38003, -8.00561), (6.43193, -10.80379), (10, -12)],
105 |         [(15.62772, -8.02776), (13.95762, -10.55233), (10, -12)],
106 |         [(5.56245, -6.01624), (7.11534, -5.92534), (7.22591, -6.21212)],
107 |         [(8.21897, -5.58699), (7.11534, -5.92534), (7.22591, -6.21212)],
108 |         [(11.76526, -5.59305), (13.023, -5.93749), (12.83447, -6.25924)],
109 |         [(14.49948, -5.9889), (13.023, -5.93749), (12.83447, -6.25924)],
110 |     ],
111 | }
112 | 
113 | 
114 | def show_venn_diagram(intersections, labels):
115 |     def point_on_triangle(pt1, pt2, pt3):
116 |         """
117 |         Random point on the triangle with vertices pt1, pt2 and pt3.
118 |         """
119 |         s, t = sorted([random.random(), random.random()])
120 |         return (s * pt1[0] + (t - s) * pt2[0] + (1 - t) * pt3[0],
121 |                 s * pt1[1] + (t - s) * pt2[1] + (1 - t) * pt3[1])
122 | 
123 |     def area(tri):
124 |         y_list = [tri[0][1], tri[1][1], tri[2][1]]
125 |         x_list = [tri[0][0], tri[1][0], tri[2][0]]
126 |         height = max(y_list) - min(y_list)
127 |         width = max(x_list) - min(x_list)
128 |         return height * width / 2
129 | 
130 |     empty_sets = [k for k, v in intersections.items() if not len(v)]
131 | 
132 |     if empty_sets:
133 |         raise ValueError(f"Given intersections \"{empty_sets}\" are empty, cannot continue")
134 | 
135 |     scatters = []
136 | 
137 |     for k, v in intersections.items():
138 |         weights = [area(triangle) for triangle in TRIANGLES[k]]
139 |         points_pairs = [point_on_triangle(*random.choices(TRIANGLES[k], weights=weights)[0]) for _ in v]
140 |         x, y = zip(*points_pairs)
141 |         scatter_labels = [make_label(n) for n in v]
142 | 
143 |         scatters.append(
144 |             go.Scatter(
145 |                 x=x,
146 |                 y=y,
147 |                 mode='markers',
148 |                 showlegend=False,
149 |                 text=scatter_labels,
150 |                 marker=dict(
151 |                     size=10,
152 |                     line=dict(width=2,
153 |                               color='DarkSlateGrey'),
154 |                     opacity=1,
155 |                 ),
156 |                 hoverinfo="text",
157 |             )
158 |         )
159 |     fig = go.Figure(
160 |         data=list(scatters),
161 |         layout=go.Layout(
162 |             title_text="",
163 |             autosize=False,
164 |             titlefont_size=16,
165 |             showlegend=True,
166 |             hovermode='closest',
167 |             margin=dict(b=20, l=5, r=5, t=40),
168 |             xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
169 |             yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, scaleanchor="x", scaleratio=1)
170 |         ),
171 |     )
172 | 
173 |     fig.update_layout(
174 |         shapes=[
175 |             go.layout.Shape(
176 |                 type="circle",
177 |                 x0=0,
178 |                 y0=-6,
179 |                 x1=12,
180 |                 y1=6,
181 |                 fillcolor="Red",
182 |                 opacity=0.15,
183 |                 layer='below'
184 |             ),
185 |             go.layout.Shape(
186 |                 type="circle",
187 |                 x0=8,
188 |                 y0=-6,
189 |                 x1=20,
190 |                 y1=6,
191 |                 fillcolor="Blue",
192 |                 opacity=0.15,
193 |                 layer='below'
194 |             ),
195 |             go.layout.Shape(
196 |                 type="circle",
197 |                 x0=4,
198 |                 y0=-12,
199 |                 x1=16,
200 |                 y1=0,
201 |                 fillcolor="Green",
202 |                 opacity=0.15,
203 |                 layer='below'
204 |             ),
205 |         ]
206 |     )
207 | 
208 |     fig.update_layout(
209 |         annotations=[
210 |             dict(
211 |                 xref="x",
212 |                 yref="y",
213 |                 x=6, y=6,
214 |                 text=labels[0],
215 |                 font=dict(size=15),
216 |                 showarrow=True,
217 |                 arrowwidth=2,
218 |                 ax=-50,
219 |                 ay=-25,
220 |                 arrowhead=7,
221 |             ),
222 |             dict(
223 |                 xref="x",
224 |                 yref="y",
225 |                 x=14, y=6,
226 |                 text=labels[1],
227 |                 font=dict(size=15),
228 |                 showarrow=True,
229 |                 arrowwidth=2,
230 |                 ax=50,
231 |                 ay=-25,
232 |                 arrowhead=7,
233 |             ),
234 |             dict(
235 |                 xref="x",
236 |                 yref="y",
237 |                 x=10, y=-12,
238 |                 text=labels[2],
239 |                 font=dict(size=15),
240 |                 showarrow=True,
241 |                 arrowwidth=2,
242 |                 ax=50,
243 |                 ay=25,
244 |                 arrowhead=7,
245 |             ),
246 |         ]
247 |     )
248 | 
249 |     fig.show()
250 | 


--------------------------------------------------------------------------------
/identity-resolution/notebooks/identity-graph/nepytune/write_utils.py:
--------------------------------------------------------------------------------
 1 | import abc
 2 | import csv
 3 | from contextlib import contextmanager
 4 | import json
 5 | 
 6 | 
 7 | class GremlinCSV:
 8 |     """Build CSV file in AWS-Neptune ready-to-load data format."""
 9 | 
10 |     def __init__(self, opened_file, attributes):
11 |         """Create CSV writer."""
12 |         self.types = dict(key.split(":") for key in attributes)
13 |         self.writer = csv.writer(opened_file, quoting=csv.QUOTE_ALL)
14 |         self.key_order = list(self.types.keys())
15 |         self.writer.writerow(self.header)
16 | 
17 |     def attributes(self, attribute_map):
18 |         """Build attribute list from attribute_map with default values."""
19 |         return [attribute_map.get(k, "") for k in self.key_order]
20 | 
21 |     @property
22 |     @abc.abstractmethod
23 |     def header(self):
24 |         """Get header."""
25 | 
26 | 
27 | class GremlinNodeCSV(GremlinCSV):
28 |     """Build CSV file with graph nodes in AWS-Neptune ready-to-load data format."""
29 | 
30 |     @property
31 |     def header(self):
32 |         """Get header."""
33 |         return (
34 |             ["~id"]
35 |             + [f"{key}:{self.types[key]}" for key in self.key_order]
36 |             + ["~label"]
37 |         )
38 | 
39 |     def add(self, _id, attribute_map, label):
40 |         """Add row to CSV file."""
41 |         self.writer.writerow([_id] + self.attributes(attribute_map) + [label])
42 | 
43 | 
44 | class GremlinEdgeCSV(GremlinCSV):
45 |     """Build CSV file with graph edges in AWS-Neptune ready-to-load data format."""
46 | 
47 |     @property
48 |     def header(self):
49 |         """Get header."""
50 |         return ["~id", "~from", "~to", "~label"] + [
51 |             f"{key}:{self.types[key]}" for key in self.key_order
52 |         ]
53 | 
54 |     def add(self, _id, _from, to, label, attribute_map):
55 |         """Add row to CSV file."""
56 |         self.writer.writerow([_id, _from, to, label] + self.attributes(attribute_map))
57 | 
58 | 
59 | @contextmanager
60 | def gremlin_writer(type_, file_name, attributes):
61 |     """Factory of gremlin writer objects."""
62 |     with open(file_name, "w", 1024 * 1024) as f_t:
63 |         yield type_(f_t, attributes=attributes)
64 | 
65 | 
66 | def json_lines_file(opened_file):
67 |     """Yield json lines from opened file."""
68 |     for line in opened_file:
69 |         yield json.loads(line)
70 | 


--------------------------------------------------------------------------------
/identity-resolution/templates/bulk-load-stack.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | 
  3 | Parameters:
  4 |   bulkloadNeptuneEndpoint:
  5 |     Type: String
  6 |   bulkloadNeptuneData:
  7 |     Type: String
  8 |   bulkloadNeptuneIAMRole:
  9 |     Type: String
 10 |     Description: IAM Role ARN for bulk load role
 11 |   bulkloadNeptuneSecurityGroup:
 12 |     Type: AWS::EC2::SecurityGroup::Id
 13 |   bulkloadSubnet1:
 14 |     Type: AWS::EC2::Subnet::Id
 15 |   bulkloadBucket:
 16 |     Type: String
 17 | 
 18 | Mappings:
 19 |   Constants:
 20 |     S3Keys:
 21 |       NeptuneLoaderCode: identity-resolution/functions/NeptuneLoader.zip
 22 |       PythonLambdaLayer: identity-resolution/functions/PythonLambdaLayer.zip
 23 | 
 24 | Resources:
 25 | 
 26 |   bulkloadNeptuneLoader: 
 27 |     DependsOn:
 28 |       - bulkloadNeptuneLoaderLambdaRoleCloudWatchStream
 29 |       - bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup
 30 |       - bulkloadNeptuneLoaderLambdaRoleEC2
 31 |       - bulkloadNeptuneLoaderLambdaRole
 32 |     Type: "Custom::NeptuneLoader"
 33 |     Properties: 
 34 |       ServiceToken:
 35 |         Fn::GetAtt: [ bulkloadNeptuneLoaderLambda, Arn]
 36 | 
 37 |   bulkloadNeptuneLoaderLambdaRoleCloudWatchStream:
 38 |     Type: AWS::IAM::Policy
 39 |     Properties:
 40 |       PolicyDocument:
 41 |         Statement:
 42 |         - Action:
 43 |           - logs:CreateLogStream
 44 |           - logs:PutLogEvents
 45 |           Effect: Allow
 46 |           Resource: !Join [ "", [ "arn:aws:logs:", !Ref "AWS::Region", ":", !Ref "AWS::AccountId" , ":log-group:/aws/lambda/",  !Ref bulkloadNeptuneLoaderLambda, ":*" ]]
 47 |         Version: '2012-10-17'
 48 |       PolicyName: bulkloadNeptuneLoaderLambdaRoleCloudWatchStream
 49 |       Roles:
 50 |       - Ref: bulkloadNeptuneLoaderLambdaRole
 51 |   bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup:
 52 |     Type: AWS::IAM::Policy
 53 |     Properties:
 54 |       PolicyDocument:
 55 |         Statement:
 56 |         - Action:
 57 |           - logs:CreateLogGroup
 58 |           Effect: Allow
 59 |           Resource: !Join [ "", [ "arn:aws:logs:", !Ref "AWS::Region", ":", !Ref "AWS::AccountId" , ":*" ]]
 60 |         Version: '2012-10-17'
 61 |       PolicyName: bulkloadNeptuneLoaderLambdaRoleCloudWatchGroup
 62 |       Roles:
 63 |       - Ref: bulkloadNeptuneLoaderLambdaRole
 64 |   bulkloadNeptuneLoaderLambdaRoleEC2:
 65 |     Type: AWS::IAM::Policy
 66 |     Properties:
 67 |       PolicyDocument:
 68 |         Statement:
 69 |         - Action:
 70 |           - ec2:CreateNetworkInterface
 71 |           - ec2:DescribeNetworkInterfaces
 72 |           - ec2:DeleteNetworkInterface
 73 |           - ec2:DetachNetworkInterface          
 74 |           Effect: Allow
 75 |           Resource: "*"
 76 |         Version: '2012-10-17'
 77 |       PolicyName: bulkloadNeptuneLoaderLambdaRoleEC2
 78 |       Roles:
 79 |       - Ref: bulkloadNeptuneLoaderLambdaRole
 80 |   bulkloadNeptuneLoaderLambda:
 81 |     DependsOn:
 82 |       - bulkloadNeptuneLoaderLambdaRoleEC2
 83 |     Type: AWS::Lambda::Function
 84 |     Properties:
 85 |       Code:
 86 |         S3Bucket: 
 87 |           Ref: bulkloadBucket
 88 |         S3Key: !FindInMap 
 89 |                  - Constants
 90 |                  - S3Keys
 91 |                  - NeptuneLoaderCode
 92 |       Description: 'Lambda function to load data into Neptune instance.'
 93 |       Environment:
 94 |         Variables:
 95 |           neptunedb: 
 96 |             Ref: bulkloadNeptuneEndpoint
 97 |           neptuneloads3path: 
 98 |             Ref: bulkloadNeptuneData
 99 |           region: 
100 |             Ref: "AWS::Region"
101 |           s3loadiamrole: 
102 |             Ref: bulkloadNeptuneIAMRole
103 |       Handler: lambda_function.lambda_handler
104 |       MemorySize: 128
105 |       Layers:
106 |         - !Ref PythonLambdaLayer
107 |       Role:
108 |         Fn::GetAtt: [ bulkloadNeptuneLoaderLambdaRole, Arn ]
109 |       Runtime: python3.9
110 |       Timeout: 180
111 |       VpcConfig:
112 |         SecurityGroupIds:
113 |           - Ref: bulkloadNeptuneSecurityGroup
114 |         SubnetIds:
115 |           - Ref: bulkloadSubnet1
116 |   bulkloadNeptuneLoaderLambdaRole:
117 |     Type: AWS::IAM::Role
118 |     Properties:
119 |       ManagedPolicyArns:
120 |         - 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
121 |         - 'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess'
122 |       AssumeRolePolicyDocument:
123 |         Statement:
124 |         - Action: sts:AssumeRole
125 |           Effect: Allow
126 |           Principal:
127 |             Service:
128 |             - lambda.amazonaws.com
129 |         Version: '2012-10-17'
130 |       Path: /
131 |   PythonLambdaLayer:
132 |     Type: "AWS::Lambda::LayerVersion"
133 |     Properties:
134 |       CompatibleRuntimes:
135 |         - python3.9
136 |         - python3.8
137 |       Content:
138 |         S3Bucket: 
139 |           Ref: bulkloadBucket
140 |         S3Key: !FindInMap
141 |           - Constants
142 |           - S3Keys
143 |           - PythonLambdaLayer


--------------------------------------------------------------------------------
/identity-resolution/templates/identity-resolution.yml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: '2010-09-09'
  2 | 
  3 | Mappings:
  4 |   S3Buckets:
  5 |     us-west-2:
  6 |       bucket: aws-admartech-samples-us-west-2
  7 |     us-east-1:
  8 |       bucket: aws-admartech-samples-us-east-1
  9 |     us-east-2:
 10 |       bucket: aws-admartech-samples-us-east-2
 11 |     eu-west-1:
 12 |       bucket: aws-admartech-samples-eu-west-1
 13 | 
 14 |   Constants:
 15 |     S3Keys:
 16 |       neptuneNotebooks: /identity-resolution/notebooks/identity-graph
 17 |       irdata: /identity-resolution/data/
 18 |       bulkLoadStack: /identity-resolution/templates/bulk-load-stack.yaml
 19 |       neptuneNotebookStack: /identity-resolution/templates/neptune-workbench-stack.yaml
 20 | 
 21 |   #------------------------------------------------------------------------------#
 22 |   # RESOURCES
 23 |   #------------------------------------------------------------------------------#
 24 | Resources:
 25 | # ---------- CREATING NEPTUNE CLUSTER FROM SNAPSHOT ----------
 26 |   NeptuneBaseStack:
 27 |     Type: AWS::CloudFormation::Stack
 28 |     Properties:
 29 |       TemplateURL: https://s3.amazonaws.com/aws-neptune-customer-samples/v2/cloudformation-templates/neptune-base-stack.json
 30 |       Parameters:
 31 |         NeptuneQueryTimeout: '300000'
 32 |         DbInstanceType: db.r5.12xlarge
 33 |       TimeoutInMinutes: '360'
 34 | 
 35 | # ---------- SETTING UP SAGEMAKER NOTEBOOK INSTANCES ----------
 36 |   ExecutionRole:
 37 |     Type: AWS::IAM::Role
 38 |     Properties:
 39 |       AssumeRolePolicyDocument:
 40 |         Version: "2012-10-17"
 41 |         Statement:
 42 |           - Effect: Allow
 43 |             Principal:
 44 |               Service:
 45 |                 - sagemaker.amazonaws.com
 46 |             Action:
 47 |               - sts:AssumeRole
 48 |       Path: "/"
 49 |       Policies:
 50 |         - PolicyName: "sagemakerneptunepolicy"
 51 |           PolicyDocument:
 52 |             Version: "2012-10-17"
 53 |             Statement:
 54 |               - Effect: "Allow"
 55 |                 Action:
 56 |                   - cloudwatch:PutMetricData
 57 |                 Resource:
 58 |                   Fn::Sub: "arn:${AWS::Partition}:cloudwatch:${AWS::Region}:${AWS::AccountId}:*"
 59 |               - Effect: "Allow"
 60 |                 Action:
 61 |                   - "logs:CreateLogGroup"
 62 |                   - "logs:CreateLogStream"
 63 |                   - "logs:DescribeLogStreams"
 64 |                   - "logs:PutLogEvents"
 65 |                   - "logs:GetLogEvents"
 66 |                 Resource:
 67 |                   Fn::Sub: "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:*"
 68 |               - Effect: "Allow"
 69 |                 Action: "neptune-db:connect"
 70 |                 Resource:
 71 |                   Fn::Sub: "arn:${AWS::Partition}:neptune-db:${AWS::Region}:${AWS::AccountId}:${NeptuneBaseStack.Outputs.DBClusterId}/*"
 72 |               - Effect: "Allow"
 73 |                 Action:
 74 |                   - "s3:Get*"
 75 |                   - "s3:List*"
 76 |                 Resource:
 77 |                   Fn::Sub: "arn:${AWS::Partition}:s3:::*"
 78 | 
 79 |   SageMakerNeptuneStack:
 80 |     Type: AWS::CloudFormation::Stack
 81 |     Properties:
 82 |       TemplateURL:
 83 |         Fn::Join: [ "",
 84 |           [
 85 |             https://s3.amazonaws.com/,
 86 |             !FindInMap [ S3Buckets, Ref: 'AWS::Region', bucket ],
 87 |             !FindInMap [ Constants, S3Keys, neptuneNotebookStack ]
 88 |           ]
 89 |         ]
 90 |       Parameters:
 91 |         SageMakerNotebookName: "id-graph-notebook"
 92 |         NotebookInstanceType: ml.m5.xlarge
 93 |         NeptuneClusterEndpoint:
 94 |           Fn::GetAtt:
 95 |           - NeptuneBaseStack
 96 |           - Outputs.DBClusterEndpoint
 97 |         NeptuneClusterPort:
 98 |           Fn::GetAtt:
 99 |           - NeptuneBaseStack
100 |           - Outputs.DBClusterPort
101 |         NeptuneClusterSecurityGroups:
102 |           Fn::GetAtt:
103 |           - NeptuneBaseStack
104 |           - Outputs.NeptuneSG
105 |         NeptuneClusterSubnetId:
106 |           Fn::GetAtt:
107 |           - NeptuneBaseStack
108 |           - Outputs.PublicSubnet1
109 |         SageMakerNotebookRole:
110 |           Fn::GetAtt:
111 |           - ExecutionRole
112 |           - Arn
113 |         AdditionalNotebookS3Locations: !Join
114 |                                           - ''
115 |                                           - - 's3://'
116 |                                             - !FindInMap
117 |                                               - S3Buckets
118 |                                               - !Ref 'AWS::Region'
119 |                                               - bucket
120 |                                             - !FindInMap
121 |                                               - Constants
122 |                                               - S3Keys
123 |                                               - neptuneNotebooks
124 |       TimeoutInMinutes: '60'
125 | 
126 | # --------- LOAD DATA INTO NEPTUNE ---------
127 | 
128 |   NeptuneBulkLoadStack:
129 |     Type: AWS::CloudFormation::Stack
130 |     Properties:
131 |       TemplateURL: !Join
132 |                     - ''
133 |                     - - 'https://s3.'
134 |                       - !Ref 'AWS::Region'
135 |                       - '.amazonaws.com/'
136 |                       - !FindInMap
137 |                         - S3Buckets
138 |                         - !Ref 'AWS::Region'
139 |                         - bucket
140 |                       - !FindInMap
141 |                         - Constants
142 |                         - S3Keys
143 |                         - bulkLoadStack
144 |       Parameters:
145 |         bulkloadNeptuneEndpoint:
146 |           Fn::GetAtt:
147 |           - NeptuneBaseStack
148 |           - Outputs.DBClusterEndpoint
149 |         bulkloadNeptuneData: !Join
150 |                           - ''
151 |                           - - 's3://'
152 |                             - !FindInMap
153 |                               - S3Buckets
154 |                               - !Ref 'AWS::Region'
155 |                               - bucket
156 |                             - !FindInMap
157 |                               - Constants
158 |                               - S3Keys
159 |                               - irdata
160 |         bulkloadNeptuneIAMRole:
161 |           Fn::GetAtt:
162 |           - NeptuneBaseStack
163 |           - Outputs.NeptuneLoadFromS3IAMRoleArn
164 |         bulkloadNeptuneSecurityGroup:
165 |           Fn::GetAtt:
166 |           - NeptuneBaseStack
167 |           - Outputs.NeptuneSG
168 |         bulkloadSubnet1:
169 |           Fn::GetAtt:
170 |           - NeptuneBaseStack
171 |           - Outputs.PrivateSubnet1
172 |         bulkloadBucket: !FindInMap
173 |                           - S3Buckets
174 |                           - !Ref 'AWS::Region'
175 |                           - bucket
176 |           
177 | 
178 |   #------------------------------------------------------------------------------#
179 |   # OUTPUTS
180 |   #------------------------------------------------------------------------------#
181 | 
182 | Outputs:
183 |   VPC:
184 |     Description: VPC of the Neptune Cluster
185 |     Value:
186 |       Fn::GetAtt:
187 |         - NeptuneBaseStack
188 |         - Outputs.VPC
189 |   PublicSubnet1:
190 |     Value:
191 |       Fn::GetAtt:
192 |         - NeptuneBaseStack
193 |         - Outputs.PublicSubnet1
194 |   NeptuneSG:
195 |     Description: Neptune Security Group
196 |     Value:
197 |       Fn::GetAtt:
198 |         - NeptuneBaseStack
199 |         - Outputs.NeptuneSG
200 |   SageMakerNotebook:
201 |     Value:
202 |       Fn::GetAtt:
203 |       - SageMakerNeptuneStack
204 |       - Outputs.NeptuneNotebook
205 |   DBClusterEndpoint:
206 |     Description: Master Endpoint for Neptune Cluster
207 |     Value:
208 |       Fn::GetAtt:
209 |       - NeptuneBaseStack
210 |       - Outputs.DBClusterEndpoint
211 |   DBInstanceEndpoint:
212 |     Description: Master Instance Endpoint
213 |     Value:
214 |       Fn::GetAtt:
215 |       - NeptuneBaseStack
216 |       - Outputs.DBInstanceEndpoint
217 |   GremlinEndpoint:
218 |     Description: Gremlin Endpoint for Neptune
219 |     Value:
220 |       Fn::GetAtt:
221 |       - NeptuneBaseStack
222 |       - Outputs.GremlinEndpoint
223 |   LoaderEndpoint:
224 |     Description: Loader Endpoint for Neptune
225 |     Value:
226 |       Fn::GetAtt:
227 |       - NeptuneBaseStack
228 |       - Outputs.LoaderEndpoint
229 |   DBClusterReadEndpoint:
230 |     Description: DB cluster Read Endpoint
231 |     Value:
232 |       Fn::GetAtt:
233 |       - NeptuneBaseStack
234 |       - Outputs.DBClusterReadEndpoint
235 | 


--------------------------------------------------------------------------------
/identity-resolution/templates/neptune-workbench-stack.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: '2010-09-09'
  2 | 
  3 | Description: A template to deploy Neptune Notebooks using CloudFormation resources.
  4 | 
  5 | Parameters:
  6 |   NotebookInstanceType:
  7 |     Description: The notebook instance type.
  8 |     Type: String
  9 |     Default: ml.t2.medium
 10 |     AllowedValues:
 11 |     - ml.t2.medium
 12 |     - ml.t2.large
 13 |     - ml.t2.xlarge
 14 |     - ml.t2.2xlarge
 15 |     - ml.t3.2xlarge
 16 |     - ml.t3.large
 17 |     - ml.t3.medium
 18 |     - ml.t3.xlarge
 19 |     - ml.m4.xlarge
 20 |     - ml.m4.2xlarge
 21 |     - ml.m4.4xlarge
 22 |     - ml.m4.10xlarge
 23 |     - ml.m4.16xlarge
 24 |     - ml.m5.12xlarge
 25 |     - ml.m5.24xlarge
 26 |     - ml.m5.2xlarge
 27 |     - ml.m5.4xlarge
 28 |     - ml.m5.xlarge
 29 |     - ml.p2.16xlarge
 30 |     - ml.p2.8xlarge
 31 |     - ml.p2.xlarge
 32 |     - ml.p3.16xlarge
 33 |     - ml.p3.2xlarge
 34 |     - ml.p3.8xlarge
 35 |     - ml.c4.2xlarge
 36 |     - ml.c4.4xlarge
 37 |     - ml.c4.8xlarge
 38 |     - ml.c4.xlarge
 39 |     - ml.c5.18xlarge
 40 |     - ml.c5.2xlarge
 41 |     - ml.c5.4xlarge
 42 |     - ml.c5.9xlarge
 43 |     - ml.c5.xlarge
 44 |     - ml.c5d.18xlarge
 45 |     - ml.c5d.2xlarge
 46 |     - ml.c5d.4xlarge
 47 |     - ml.c5d.9xlarge
 48 |     - ml.c5d.xlarge
 49 |     ConstraintDescription: Must be a valid SageMaker instance type.
 50 | 
 51 |   NeptuneClusterEndpoint:
 52 |     Description: The cluster endpoint of an existing Neptune cluster.
 53 |     Type: String
 54 | 
 55 |   NeptuneClusterPort:
 56 |     Description: 'OPTIONAL: The Port of an existing Neptune cluster (default 8182).'
 57 |     Type: String
 58 |     Default: '8182'
 59 | 
 60 |   NeptuneClusterSecurityGroups:
 61 |     Description: The VPC security group IDs. The security groups must be for the same VPC as specified in the subnet.
 62 |     Type: List<AWS::EC2::SecurityGroup::Id>
 63 | 
 64 |   NeptuneClusterSubnetId:
 65 |     Description: The ID of the subnet in a VPC to which you would like to have a connectivity from your ML compute instance.
 66 |     Type: AWS::EC2::Subnet::Id
 67 | 
 68 |   SageMakerNotebookRole:
 69 |     Description: The ARN for the IAM role that the notebook instance will assume.
 70 |     Type: String
 71 |     AllowedPattern: ^arn:aws[a-z\-]*:iam::\d{12}:role/?[a-zA-Z_0-9+=,.@\-_/]+$
 72 | 
 73 |   SageMakerNotebookName:
 74 |     Description: The name of the Neptune notebook.
 75 |     Type: String
 76 | 
 77 |   AdditionalNotebookS3Locations:
 78 |     Description: Location of additional notebooks to include with the Notebook instance.
 79 |     Type: String
 80 | 
 81 | Conditions:
 82 |   InstallNotebookContent:
 83 |     Fn::Not: [
 84 |       Fn::Equals: [
 85 |         Ref: AdditionalNotebookS3Locations, ""
 86 |       ]
 87 |     ]
 88 | 
 89 | Resources:
 90 |   NeptuneNotebookInstance:
 91 |     Type: AWS::SageMaker::NotebookInstance
 92 |     Properties:
 93 |       NotebookInstanceName: !Join
 94 |                           - ''
 95 |                           - - 'aws-neptune-'
 96 |                             - !Ref SageMakerNotebookName
 97 |       InstanceType:
 98 |         Ref: NotebookInstanceType
 99 |       SubnetId:
100 |         Ref: NeptuneClusterSubnetId
101 |       SecurityGroupIds:
102 |         Ref: NeptuneClusterSecurityGroups
103 |       RoleArn:
104 |         Ref: SageMakerNotebookRole
105 |       LifecycleConfigName:
106 |         Fn::GetAtt:
107 |         - NeptuneNotebookInstanceLifecycleConfig
108 |         - NotebookInstanceLifecycleConfigName
109 | 
110 |   NeptuneNotebookInstanceLifecycleConfig:
111 |     Type: AWS::SageMaker::NotebookInstanceLifecycleConfig
112 |     Properties:
113 |       OnStart:
114 |       - Content:
115 |          Fn::Base64:
116 |             Fn::Join:
117 |             - ''
118 |             - - "#!/bin/bash\n"
119 |               - sudo -u ec2-user -i << 'EOF'
120 |               - "\n"
121 |               - echo 'export GRAPH_NOTEBOOK_AUTH_MODE=
122 |               - "DEFAULT' >> ~/.bashrc\n"
123 |               - echo 'export GRAPH_NOTEBOOK_HOST=
124 |               - Ref: NeptuneClusterEndpoint
125 |               - "' >> ~/.bashrc\n"
126 |               - echo 'export GRAPH_NOTEBOOK_PORT=
127 |               - Ref: NeptuneClusterPort
128 |               - "' >> ~/.bashrc\n"
129 |               - echo 'export NEPTUNE_LOAD_FROM_S3_ROLE_ARN=
130 |               - "' >> ~/.bashrc\n"
131 |               - echo 'export AWS_REGION=
132 |               - Ref: AWS::Region
133 |               - "' >> ~/.bashrc\n"
134 |               - aws s3 cp s3://aws-neptune-notebook/graph_notebook.tar.gz /tmp/graph_notebook.tar.gz
135 |               - "\n"
136 |               - echo 'export NOTEBOOK_CONTENT_S3_LOCATION=,
137 |               - Ref: AdditionalNotebookS3Locations
138 |               - "' >> ~/.bashrc\n"
139 |               - aws s3 sync s3://aws-neptune-customer-samples/neptune-sagemaker/notebooks /home/ec2-user/SageMaker/Neptune --exclude * --include util/*
140 |               - "\n"
141 |               - rm -rf /tmp/graph_notebook
142 |               - "\n"
143 |               - tar -zxvf /tmp/graph_notebook.tar.gz -C /tmp
144 |               - "\n"
145 |               - /tmp/graph_notebook/install.sh
146 |               - "\n"
147 |               - mkdir /home/ec2-user/SageMaker/identity-graph
148 |               - "\n"
149 |               - Fn::If: [ InstallNotebookContent,
150 |                   Fn::Join: 
151 |                       [ "", [
152 |                         "aws s3 cp ",
153 |                         Ref: AdditionalNotebookS3Locations,
154 |                         " /home/ec2-user/SageMaker/identity-graph/ --recursive"
155 |                         ]
156 |                       ],
157 |                   "# No notebook content\n"
158 |               ]
159 |               - "\n"
160 |               - EOF
161 | 
162 | Outputs: 
163 |   NeptuneNotebookInstanceId:
164 |     Value:
165 |       Ref: NeptuneNotebookInstance
166 |   NeptuneNotebook:
167 |     Value: 
168 |       Fn::Join: [ "", 
169 |         [
170 |           "https://",
171 |           Fn::Select: [ 1, Fn::Split: [ "/", Ref: "NeptuneNotebookInstance" ] ],
172 |           ".notebook.",
173 |           Ref: "AWS::Region",
174 |           ".sagemaker.aws/"
175 |         ]
176 |       ]
177 |   NeptuneNotebookInstanceLifecycleConfigId: 
178 |     Value: 
179 |       Ref: "NeptuneNotebookInstanceLifecycleConfig"


--------------------------------------------------------------------------------