├── .github
└── PULL_REQUEST_TEMPLATE.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── LICENSE-SAMPLECODE
├── LICENSE-SUMMARY
├── README.md
├── _config.yml
├── _layouts
└── default.html
├── assets
└── css
│ └── style.scss
└── src
├── AWS_logo_RGB_WHT.png
├── accessing-from-aws-lambda
├── README.md
├── lambda-neptune.png
└── thumbnail.png
├── connecting-using-a-load-balancer
├── README.md
├── application-load-balancer.png
├── network-load-balancer.png
└── thumbnail.png
├── converting-to-graph
├── README.md
├── document-2-graph.png
├── key-value-2-graph.png
├── relational-fk1.png
├── relational-join-table.png
├── relational-multi-join.png
├── relational-table.png
└── thumbnail.png
├── data-models-and-query-languages
├── README.md
├── property-graph.png
├── rdf.png
└── thumbnail.png
├── graph-data-modelling
├── README.md
├── bi-directional-relationships.png
├── data-modelling-process.png
├── edge-labels.png
├── hub-and-spoke-1.png
├── hub-and-spoke-2.png
├── hub-and-spoke-3.png
├── hub-and-spoke-4.png
├── large-query.png
├── multiple-relationships.png
├── rdf
│ ├── rdf-graph-development-lifecycle-1.png
│ ├── rdf-graph-development-lifecycle-2-op-1.png
│ ├── rdf-graph-development-lifecycle-2-op-2.png
│ ├── rdf-graph-development-lifecycle-2-op-3.png
│ ├── rdf-graph-development-lifecycle-3.png
│ ├── rdf-graph-development-lifecycle.png
│ ├── rei-1.png
│ └── rei-2.png
├── small-query.png
├── thumbnail.png
└── uni-directional-relationships.png
└── writing-from-amazon-kinesis-data-streams
├── README.md
├── kinesis-neptune.png
└── thumbnail.png
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | *Issue #, if available:*
2 |
3 | *Description of changes:*
4 |
5 |
6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
7 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | **/.DS_Store
2 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Guidelines for contributing
2 |
3 | Thank you for your interest in contributing to AWS documentation! We greatly value feedback and contributions from our community.
4 |
5 | Please read through this document before you submit any pull requests or issues. It will help us work together more effectively.
6 |
7 | ## What to expect when you contribute
8 |
9 | When you submit a pull request, our team is notified and will respond as quickly as we can. We'll do our best to work with you to ensure that your pull request adheres to our style and standards. If we merge your pull request, we might make additional edits later for style or clarity.
10 |
11 | The AWS documentation source files on GitHub aren't published directly to the official documentation website. If we merge your pull request, we'll publish your changes to the documentation website as soon as we can, but they won't appear immediately or automatically.
12 |
13 | We look forward to receiving your pull requests for:
14 |
15 | * New content you'd like to contribute (such as new code samples or tutorials)
16 | * Inaccuracies in the content
17 | * Information gaps in the content that need more detail to be complete
18 | * Typos or grammatical errors
19 | * Suggested rewrites that improve clarity and reduce confusion
20 |
21 | **Note:** We all write differently, and you might not like how we've written or organized something currently. We want that feedback. But please be sure that your request for a rewrite is supported by the previous criteria. If it isn't, we might decline to merge it.
22 |
23 | ## How to contribute
24 |
25 | To contribute, send us a pull request. For small changes, such as fixing a typo or adding a link, you can use the [GitHub Edit Button](https://blog.github.com/2011-04-26-forking-with-the-edit-button/). For larger changes:
26 |
27 | 1. [Fork the repository](https://help.github.com/articles/fork-a-repo/).
28 | 2. In your fork, make your change in a branch that's based on this repo's **master** branch.
29 | 3. Commit the change to your fork, using a clear and descriptive commit message.
30 | 4. [Create a pull request](https://help.github.com/articles/creating-a-pull-request-from-a-fork/), answering any questions in the pull request form.
31 |
32 | Before you send us a pull request, please be sure that:
33 |
34 | 1. You're working from the latest source on the **master** branch.
35 | 2. You check [existing open](https://github.com/awsdocs/aws-dbs-refarch-graph/pulls), and [recently closed](https://github.com/awsdocs/aws-dbs-refarch-graph/pulls?q=is%3Apr+is%3Aclosed), pull requests to be sure that someone else hasn't already addressed the problem.
36 | 3. You [create an issue](https://github.com/awsdocs/aws-dbs-refarch-graph/issues/new) before working on a contribution that will take a significant amount of your time.
37 |
38 | For contributions that will take a significant amount of time, [open a new issue](https://github.com/awsdocs/aws-dbs-refarch-graph/issues/new) to pitch your idea before you get started. Explain the problem and describe the content you want to see added to the documentation. Let us know if you'll write it yourself or if you'd like us to help. We'll discuss your proposal with you and let you know whether we're likely to accept it. We don't want you to spend a lot of time on a contribution that might be outside the scope of the documentation or that's already in the works.
39 |
40 | ## Finding contributions to work on
41 |
42 | If you'd like to contribute, but don't have a project in mind, look at the [open issues](https://github.com/awsdocs/aws-dbs-refarch-graph/issues) in this repository for some ideas. Any issues with the [help wanted](https://github.com/awsdocs/aws-dbs-refarch-graph/labels/help%20wanted) or [enhancement](https://github.com/awsdocs/aws-dbs-refarch-graph/labels/enhancement) labels are a great place to start.
43 |
44 | In addition to written content, we really appreciate new examples and code samples for our documentation, such as examples for different platforms or environments, and code samples in additional languages.
45 |
46 | ## Code of conduct
47 |
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). For more information, see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact [opensource-codeofconduct@amazon.com](mailto:opensource-codeofconduct@amazon.com) with any additional questions or comments.
49 |
50 | ## Security issue notifications
51 |
52 | If you discover a potential security issue, please notify AWS Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public issue on GitHub.
53 |
54 | ## Licensing
55 |
56 | See the [LICENSE](https://github.com/awsdocs/aws-dbs-refarch-graph/blob/master/LICENSE) file for this project's licensing. We will ask you to confirm the licensing of your contribution. We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
57 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Creative Commons Attribution-ShareAlike 4.0 International Public License
2 |
3 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
4 |
5 | Section 1 – Definitions.
6 |
7 | a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
8 |
9 | b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
10 |
11 | c. BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.
12 |
13 | d. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
14 |
15 | e. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
16 |
17 | f. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
18 |
19 | g. License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
20 |
21 | h. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
22 |
23 | i. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
24 |
25 | j. Licensor means the individual(s) or entity(ies) granting rights under this Public License.
26 |
27 | k. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
28 |
29 | l. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
30 |
31 | m. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
32 |
33 | Section 2 – Scope.
34 |
35 | a. License grant.
36 |
37 | 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
38 |
39 | A. reproduce and Share the Licensed Material, in whole or in part; and
40 |
41 | B. produce, reproduce, and Share Adapted Material.
42 |
43 | 2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
44 |
45 | 3. Term. The term of this Public License is specified in Section 6(a).
46 |
47 | 4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
48 |
49 | 5. Downstream recipients.
50 |
51 | A. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
52 |
53 | B. Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
54 |
55 | C. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
56 |
57 | 6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
58 |
59 | b. Other rights.
60 |
61 | 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
62 |
63 | 2. Patent and trademark rights are not licensed under this Public License.
64 |
65 | 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.
66 |
67 | Section 3 – License Conditions.
68 |
69 | Your exercise of the Licensed Rights is expressly made subject to the following conditions.
70 |
71 | a. Attribution.
72 |
73 | 1. If You Share the Licensed Material (including in modified form), You must:
74 |
75 | A. retain the following if it is supplied by the Licensor with the Licensed Material:
76 |
77 | i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
78 |
79 | ii. a copyright notice;
80 |
81 | iii. a notice that refers to this Public License;
82 |
83 | iv. a notice that refers to the disclaimer of warranties;
84 |
85 | v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
86 |
87 | B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
88 |
89 | C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
90 |
91 | 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
92 |
93 | 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
94 |
95 | b. ShareAlike.In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
96 |
97 | 1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
98 |
99 | 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
100 |
101 | 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
102 |
103 | Section 4 – Sui Generis Database Rights.
104 |
105 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
106 |
107 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
108 |
109 | b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
110 |
111 | c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
112 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
113 |
114 | Section 5 – Disclaimer of Warranties and Limitation of Liability.
115 |
116 | a. Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
117 |
118 | b. To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
119 |
120 | c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
121 |
122 | Section 6 – Term and Termination.
123 |
124 | a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
125 |
126 | b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
127 |
128 | 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
129 |
130 | 2. upon express reinstatement by the Licensor.
131 |
132 | c. For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
133 |
134 | d. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
135 |
136 | e. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
137 |
138 | Section 7 – Other Terms and Conditions.
139 |
140 | a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
141 |
142 | b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
143 |
144 | Section 8 – Interpretation.
145 |
146 | a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
147 |
148 | b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
149 |
150 | c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
151 |
152 | d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
153 |
--------------------------------------------------------------------------------
/LICENSE-SAMPLECODE:
--------------------------------------------------------------------------------
1 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | software and associated documentation files (the "Software"), to deal in the Software
5 | without restriction, including without limitation the rights to use, copy, modify,
6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
--------------------------------------------------------------------------------
/LICENSE-SUMMARY:
--------------------------------------------------------------------------------
1 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.
4 |
5 | The sample code within this documentation is made available under a modified MIT license. See the LICENSE-SAMPLECODE file.
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | __Graph Database__ workloads are operational and business intelligence database workloads that store and query highly connected data.
2 |
3 | Example graph database workloads include:
4 |
5 | * __Knowledge graphs__ In a general sense, a knowledge graph is a network or connected representation of things relevant to a specific domain or organization. Examples include a movie-based knowledge graph containing details of movies, the actors who have appeared in these movies, and the production staff members who have worked on them; an art graph, containing details of museums, the works of art displayed in each museum, and the artists who have created these works of art; and an organizational knowledge graph containing details of employees, their roles, the departments to which they belong, and the offices where they work. Read more about [Knowledge Graphs on AWS](https://aws.amazon.com/neptune/knowledge-graphs-on-aws/).
6 |
7 | * __Fraud detection__ By connecting seemingly isolated facts – applications of new lines of credit, identity information, transactions, the place and time of each transaction, the IP address from which a request was submitted – we can find patterns of fraudulent behavior, group the multiple identities used by an individual to overextend credit, and identify the members of possible fraud ring. Read more about [Fraud Graphs on AWS](https://aws.amazon.com/neptune/fraud-graphs-on-aws/).
8 |
9 | * __Identity graphs__ By connecting devices, browsing data, and identity information we can create a single unified view of customers and prospects based on their interactions with a product or website across a set of devices and identifiers that can then be used for real-time personalization and targeted advertising. Read more about [Identity Graphs on AWS](https://aws.amazon.com/neptune/identity-graphs-on-aws/).
10 |
11 | * __Social networking__ People and the social relations that connect them: friendship, follower, and professional relationships. Social networks can be used to identify the transitive relations that connect individuals and calculate degrees of separation, rank influential or important individuals and trace paths of influence, detect communities and the relationships and attributes that establish membership of a community, and predict the likelihood of new relationships emerging between individuals.
12 |
13 | * __Recommendations__ Often combined with some social networking data, recommendations engines provide a predictive capability based on the existing connections in the network. For example, by capturing details of users, the things or topics they have expresed an interest in, and the things they have purchased, you can offer per-user recommendations: "people who have purchased things you have purchased and/or who share your interests have also purchased X and/or are also interested in Y."
14 |
15 | * __Network and IT operations__ A physical network is an intrinsically connected structure. By populating a graph with details of our network infrastructure, we can do top-down and bottom-up impact analyses, identifying which parts of the network an application or service depends on, determining whether redundancy exists throughout the network on behalf of a customer, application or service, and assessing the likely impact on service provision should a network element fail or have to be replaced or repaired.
16 |
17 | ### Do I Have a Graph Workload?
18 |
19 | You may have a graph workload if you need to:
20 |
21 | * model and navigate sophisticated or complex structures,
22 | * quickly and flexibly link or connect items,
23 | * answer questions based on an understanding of how things are connected – both the semantics of the relationships between entities and the various strengths, weights or qualities of these relationships.
24 |
25 | Key characteristics of a graph database workload include:
26 |
27 | * Data volumes are expected to comprise many millions or billions of items and relationships.
28 | * Data can change frequently, with additions and changes to items and connections made available to clients less than a second after having been made durable.
29 | * Queries begin by finding one or more starting points in the graph, and then explore the neighbouring portions of the graph in order to discover connected items or compute results as they traverse the paths that connect items, with subsecond response times.
30 | * Items may exhibit variable schema, insofar as two items of the same type may not necessarily share the exact same set of attributes.
31 | * Items may be connected to one another in many different ways, with no two pairs of items necessarily connected in the exact same way.
32 |
33 | Examples of connected data queries include:
34 |
35 | * Which friends and colleagues do we have in common?
36 | * Which applications and services in my network will be affected if a particular network element – a router or switch, for example – fails? Do we have redundancy throughout the network for our most important customers?
37 | * What's the quickest route between two stations on the underground?
38 | * What do you recommend this customer should buy, view, or listen to next?
39 | * Which products, services and subscriptions does a user have permission to access and modify?
40 | * What's the cheapest or fastest means of delivering this parcel from A to B?
41 | * Which parties are likely working together to defraud their bank or insurer?
42 | * Which institutions are most at risk of poisoning the financial markets?
43 |
44 | ### Choosing a Data Technology For Your Workload
45 |
46 | Data workloads in which data items are implicitly or explicity connected to one another can be implemented using a wide range of relational and non-relational technologies, but in situations where the data is not only highly connected but also the queries addressed to the data exploit this connected structure, there are many design, development and performance benefits to using a graph database optimized for graph workloads.
47 |
48 | [Amazon Neptune](https://aws.amazon.com/neptune/) is a fast and reliable graph database optimized for storing and querying connected data. It's ideal when your query workloads require navigating connections and leveraging the strength, weight, or quality of the relationships between items. Combined with other AWS services, you can use Neptune as the database backend for applications and services whose data models and query patterns represent graph workloads, and as a datastore for graph-oriented BI and light analytics.
49 |
50 | When choosing a database for your application you should ensure the operational, performance and data architecture characteristics of your candidate technologies are a good fit for your workload. Sometimes you will have to make tradeoffs between these characteristics. Many relational and non-relational technologies can be used to implement connected data scenarios, but the balance of design and development effort involved, resulting performance, and ease with which you can evolve your solution will vary from technology to technology.
51 |
52 | You can use a relational database, such as one of the managed engines supported by the [Amazon Relational Database Service](https://aws.amazon.com/rds/) (Amazon RDS), to build a connected data application, using foreign keys and join tables to model connectedness, and join-based queries to navigate the graph structure at query time. However, the variations in structure that manifest themselves in many large graph datasets can present problems when designing and maintaining a relational schema. Complex traversal and path-based operations can result in large and difficult to understand SQL queries. Furthermore, the performance of join-intensive SQL queries can deteriorate as the dataset grows.
53 |
54 | A non-relational document or key-value store, such as [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), can similarly be used to model connected data. DynamoDB offers high-throughput, low-latency reads and writes at any scale. However, it is best suited to workloads in which items or collections of items are inserted or retrieved without reference to or joining with other items in the dataset. Applications that need to take advantage of the connections between items will have to implement joins in the application layer and issue multiple requests per query, making the application logic more complex, impacting performance, and undermining the isolation offered by a single query.
55 |
56 | Neptune offers two different graph data models and query languages that simplify graph data modelling and query development, ACID transactions for creating and modifying connected structures, and a storage layer that automatically grows in line with your storage requirements, up to 64 TB. Complex graph queries are easier to express in Neptune than they are in SQL or in your own application logic, and will often perform better. RDS-based relational solutions, however, remain better suited to workloads that filter, count or perform simple joins between sets, or which require the data integrity guarantees offered by striong schema, while DynamoDB continues to excel at inserting and retrieving discrete items or collections of items with predictably low latencies at any scale.
57 |
58 | ## Data Architectures
59 |
60 | ### [Data Models and Query Languages](src/data-models-and-query-languages)
61 |
62 |
63 |
64 |
65 | Neptune supports two different graph data models: the property graph data model, and the Resource Description Framework. Each data model has its own query language for creating and querying graph data. For a property graph, you create and query data using Apache Tinkerpop Gremlin, an open source query language supported by several other graph databases. For an RDF graph you create and query data using SPARQL, a graph pattern matching language standardized by the W3C.
66 |
67 |
73 |
74 | When you build a graph database application you will have to design and implement an application graph data model, together with graph queries that address that model. The application graph data model expresses the application domain; the queries answer the questions you would have to pose to that domain in order to satisfy your application use cases. This section describes how to create an application graph model.
75 |
76 |
77 |
78 | ### [Converting Other Data Models to a Graph Model](src/converting-to-graph)
79 |
80 |
81 |
82 |
83 | Sometimes you need to take data from another data technology and ingest it into a graph database prior to undertaking any explicit application-specific graph data modelling. In these circumstances you can apply a number of 'mechanical' transformations that yield a naive graph model. This section describes how to map relational, document and key-value data models to a graph model.
84 |
85 |
86 |
87 | ## Deployment Architectures
88 |
89 | ### [Connecting to Amazon Neptune from Clients Outside the Neptune VPC](src/connecting-using-a-load-balancer)
90 |
91 |
92 |
93 | Amazon Neptune only allows connections from clients located in the same VPC as the Neptune cluster. If you want to connect from outside the Neptune VPC, you can use a load balancer. This architecture shows how you can use either a Network Load Balancer or an Application Load Balancer to connect to Neptune.
94 |
95 |
100 |
101 | If you are building an application or service on Amazon Neptune, you may choose to expose an API to your clients, rather than offer direct access to the database. AWS Lambda allows you to build and run application logic without provisioning or managing servers. This architecture shows you how to connect AWS Lambda functions to Amazon Neptune.
102 |
103 |
104 |
105 | ### [Writing to Amazon Neptune from an Amazon Kinesis Data Stream](src/writing-from-amazon-kinesis-data-streams)
106 |
107 |
108 |
109 | When using Amazon Neptune in high write throughput scenarios, you can improve the reliability, performance and scalability of your application by sending writes from your client to an Amazon Kinsesis Data Stream. An AWS Lambda function polls the stream and issues batches of writes to the underlying Neptune database.
110 |
111 |
112 |
113 |
114 | ## License Summary
115 |
116 | The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.
117 |
118 | The sample code within this documentation is made available under a modified MIT license. See the LICENSE-SAMPLECODE file.
119 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate
--------------------------------------------------------------------------------
/_layouts/default.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 | {% seo %}
11 |
12 |
13 |
14 |
15 |
16 |
35 |
36 |
37 |
45 |
46 | {% if site.google_analytics %}
47 |
55 | {% endif %}
56 |
57 |
--------------------------------------------------------------------------------
/assets/css/style.scss:
--------------------------------------------------------------------------------
1 | ---
2 | ---
3 |
4 | @import "{{ site.theme }}";
5 |
6 | table, th, td {
7 | border: 0px;
8 | vertical-align: top;
9 | }
10 | img {
11 | border:0px;
12 | }
13 | .inner {
14 | position: relative;
15 | max-width: 80%;
16 | padding: 20px 10px;
17 | margin: 0 auto;
18 | }
--------------------------------------------------------------------------------
/src/AWS_logo_RGB_WHT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/AWS_logo_RGB_WHT.png
--------------------------------------------------------------------------------
/src/accessing-from-aws-lambda/README.md:
--------------------------------------------------------------------------------
1 | # Accessing Amazon Neptune from AWS Lambda Functions
2 |
3 | Amazon Neptune runs inside your private VPC and its endpoints can be accessed only by resources inside the VPC. To expose the endpoints outside the VPC you can use a load balancer - either an Application Load Balancer or a Network Load Balancer.
4 |
5 | If you are building an application or service on Amazon Neptune, you may choose to expose an API to your clients, rather than offer direct access to the database. AWS Lambda allows you to build and run application logic without provisioning or managing servers. Amazon API Gateway allows you to publish secure APIs that access code running on AWS Lambda.
6 |
7 | This architecture shows you how to connect AWS Lambda functions to Amazon Neptune.
8 |
9 | 
10 |
11 | ### Walkthrough of the Architecture
12 |
13 | 1. In this architecture your Neptune cluster is run in at least two subnets in two Availability Zones, with each subnet in a different Availability Zone. By distributing your cluster instances across at least two Availability Zones, you help ensure that there are instances available in your DB cluster in the unlikely event of an Availability Zone failure.
14 | 2. Neptune's VPC security group is configured to allow access from the AWS Lambda security group on the Neptune cluster's port.
15 | 3. AWS Lambda is [configured to access resources in your VPC](https://docs.aws.amazon.com/lambda/latest/dg/vpc.html). Doing so allows Lambda to create elastic network interfaces (ENIs) that enable your function to connect securely to Neptune.
16 | 4. The Lambda VPC configuration information includes at least 2 private subnets, allowing Lambda to run in high availability mode.
17 | 5. The VPC security group that Lambda uses is permitted to access Neptune via an inbound rule on the Neptune VPC security group.
18 | 6. Code running in your Lambda function uses a Gremlin or SPARQL client to submit queries to the Neptune cluster's cluster, reader and/or instance endpoints.
19 | 7. API Gateway exposes API operations that accept client requests and execute your backend Lambda functions.
20 |
21 | ### Best Practices
22 |
23 | * If you require external internet access for your function, configure your Lambda security group to allow outbound connections and route outbound internet traffic via a [NAT gateway attached to your VPC](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/).
24 | * Lambda functions that are configured to run inside a VPC incurs an additional ENI start-up penalty. This means address resolution may be delayed when trying to connect to network resources. As an alternative to running inside a VPC, you can run your Lambda functions outside your VPC and connect to the Neptune endpoints via a load balancer. If you do this, you should consider enabling [IAM database authentication](https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth.html) on your Neptune cluster, and configuring the Lambda execution role with an IAM policy that grants access to the database.
25 | * Use a single connection and graph traversal source for the entire lifetime of the Lambda execution context. If the Gremlin driver you’re using has a connection pool, configure it to use a single connection. Hold the connection in a member variable so that it can be resued across invocations. Concurrent client requests to the function will be handled by different function instances running in separate execution contexts – each with its own member variable connection.
26 | * Handle connection issues and retry connections in your function code. While the goal is to maintain a single connection for the lifetime of an execution context, unexpected network events can cause this connection to be terminated abruptly. Connection failures will manifest as different errors depending on the driver you’re using. You should code your function to handle these connection issues and attempt a reconnection if necessary.
27 | * If your Lambda function modifies data in Neptune, you should consider adopting a backoff-and-retry strategy to handle `ConcurrentModificationException` and `ReadOnlyViolationException` errors. `ConcurrentModificationException` errors occur when multiple concurrent requests attempt to modify the same elements in the graph – see the documentation on [Neptune transaction semantics](https://docs.aws.amazon.com/neptune/latest/userguide/transactions.html) for more details. `ReadOnlyViolationException` errors can occur if the client attempts to write to a database instance that is no longer the primary.
28 | * **Deprecated December 2020** ~~If your Lambda functions connect to Neptune using WebSockets, ensure they close their connections at the end of each invocation. Do not try to maintain a connection pool across invocations. While this adds some additional latency opening a connection per function invocation, it avoids your functions exceeding the Neptune WebSocket connection limit of 60,000 connections. If you use the [Java Gremlin client](http://tinkerpop.apache.org/docs/current/reference/#gremlin-java) to query Neptune, initialize a `Cluster` object in a static member variable, and then inside your handler method explicitly create a `Client` object, which your code then closes at the end of the method.~~
29 | * **Deprecated December 2020** ~~Because a connection pool will last only for the duration of a single Lambda invocation, and will often service only one request, consider reducing the size of the connection pool. Alternatively, if you are using Gremlin, consider submitting requests to the [Gremlin HTTP REST endpoint](https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-rest.html) rather than the WebSockets endpoint, thereby avoiding the need to create and manage the lifetime of a connection pool. The downside of this approach is that you must write string-based queries, rather than take advantage of the strongly-typed [Gremlin Language Variants](http://tinkerpop.apache.org/docs/current/reference/#gremlin-variants) (GLV) that allow you to write Gremlin directly in your programming language of choice.~~
30 |
31 | ### Learn More
32 |
33 | * [Using AWS Lambda functions in Amazon Neptune](https://docs.aws.amazon.com/neptune/latest/userguide/lambda-functions.html) contains examples of AWS Lambda fucntions written in Java, JavaScript and Python.
34 | * [Load balance graph queries using the Amazon Neptune Gremlin Client](https://aws.amazon.com/blogs/database/load-balance-graph-queries-using-the-amazon-neptune-gremlin-client/) contains a section on 'Using the Neptune Gremlin client in an AWS Lambda function'.
35 | * Find recommendations for using Amazon Neptune and maximizing performance in [Best Practices: Getting the Most Out of Neptune](https://docs.aws.amazon.com/neptune/latest/userguide/best-practices.html).
36 | * [Building Serverless Calorie tracker application with AWS AppSync and Amazon Neptune](https://github.com/aws-samples/aws-appsync-calorie-tracker-workshop/) workshop.
--------------------------------------------------------------------------------
/src/accessing-from-aws-lambda/lambda-neptune.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/accessing-from-aws-lambda/lambda-neptune.png
--------------------------------------------------------------------------------
/src/accessing-from-aws-lambda/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/accessing-from-aws-lambda/thumbnail.png
--------------------------------------------------------------------------------
/src/connecting-using-a-load-balancer/README.md:
--------------------------------------------------------------------------------
1 | # Connecting to Amazon Neptune from Clients Outside the Neptune VPC
2 |
3 | Amazon Neptune only allows connections from clients located in the same VPC as the Neptune cluster. If you want to connect from outside the Neptune VPC, you can use a load balancer. This architecture shows how you can use either a Network Load Balancer or an Application Load Balancer to connect to Neptune.
4 |
5 | * [Connecting to Amazon Neptune from clients outside the Neptune VPC using AWS Network Load Balancer](#connecting-to-amazon-neptune-from-clients-outside-the-neptune-vpc-using-aws-network-load-balancer)
6 | * [Connecting to Amazon Neptune from clients outside the Neptune VPC using AWS Application Load Balancer](#connecting-to-amazon-neptune-from-clients-outside-the-neptune-vpc-using-aws-application-load-balancer)
7 |
8 | ## Connecting to Amazon Neptune from clients outside the Neptune VPC using AWS Network Load Balancer
9 |
10 | You want to connect to your Neptune cluster from clients located outside the VPC in which you launched your Neptune cluster.
11 |
12 | Amazon Neptune only allows connections from clients located in the same VPC as the Neptune cluster. In this architecture, clients located outside the VPC connect to Neptune via a Network Load Balancer.
13 |
14 | 
15 |
16 | ### Walkthrough of the Architecture
17 |
18 | 1. In this architecture your Neptune cluster is run in at least two subnets in two Availability Zones, with each subnet in a different Availability Zone.
19 | 2. The Neptune DB subnet group spans at least two subnets in two Availability Zones.
20 | 3. Web connections from external clients terminate on a Network Load Balancer in a public subnet.
21 | 4. The load balancer forwards requests to the Neptune cluster endpoint (which then routes to the primary instance in the database cluster).
22 | 5. The target IP addresses of the cluster endpoint are refreshed on a periodic basis by a Lambda function.
23 | 6. This Lambda function is triggered by a CloudWatch event. When it fires, the function queries a DNS server for the IP addresses of the Neptune cluster endpoint. It registers new IP addresses with the load balancer’s target group, and deregisters any stale IP addresses.
24 |
25 | ### Best Practices
26 |
27 | * Restrict access to your cluster to a range of IP addresses using the security groups attached to the Neptune instances.
28 | * We recommend further restricting access by enabling [IAM database authentication](https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth.html). IAM database authentication requires all HTTP requests be signed using AWS Signature Version 4, which requires changes to the client. See the Neptune documentation for more details. For IAM auth enabled databases, the client must sign the request using Neptune's DNS and include an HTTP Host header whose value is ``. Therefore, the client must know the Neptune cluster's DNS and port in addition to the load balancer's DNS and port. The Host header must reach the Neptune cluster intact. Because it is a Layer 4 load balancer, the Network Load Balancer won't change the HTTP headers: you won't need to configure anything to ensure the Host header arrives intact at the Neptune endpoint.
29 | * You can [connect to Neptune using SSL](https://docs.aws.amazon.com/neptune/latest/userguide/security-ssl.html) with this architecture. If you are using a load balancer you must use SSL termination and have your own SSL certificate on the proxy server. The Network Load Balancer, although a Layer 4 load balancer, now supports [TLS termination](https://aws.amazon.com/blogs/aws/new-tls-termination-for-network-load-balancers/).
30 | * We recommend that new cluster IPs are registered with the NLB as soon as they are identified. IPs that appear no longer to be associated with the Neptune cluster should be cautiously deregistered from the NLB – for example, only after three consecutive DNS checks indicate they are no longer associated with the cluster. You can maintain a candidate list of IP addresses to be deregistered using a file stored in an S3 bucket. See [this blog post](https://aws.amazon.com/blogs/networking-and-content-delivery/using-static-ip-addresses-for-application-load-balancers/) for details on implementing a stateful deregistration process.
31 | * You can increase the availability of the load balancer endpoint by enabling multiple Availability Zones for the load balancer.
32 | * This architecture enables external clients to access the Neptune cluster endpoint, which always points to the primary instance in the cluster. To enable access to the reader endpoint, which load balances connections across all the read replicas in the cluster, you will need to either create a second target group and a listener configured with a different port, or create a second load balancer with a different DNS name. In either case, you will need to maintain the load balancer’s target IP addresses for the reader endpoint using the Lambda mechanism described above.
33 |
34 | ## Connecting to Amazon Neptune from clients outside the Neptune VPC using AWS Application Load Balancer
35 |
36 | You want to connect to your Neptune cluster from clients located outside the VPC in which you launched your Neptune cluster.
37 |
38 | Amazon Neptune only allows connections from clients located in the same VPC as the Neptune cluster. In this architecture, clients located outside the VPC connect to Neptune via an Application Load Balancer.
39 |
40 |
41 | 
42 |
43 | ### Walkthrough of the Architecture
44 |
45 | 1. In this architecture your Neptune cluster is run in at least two subnets in two Availability Zones, with each subnet in a different Availability Zone.
46 | 2. The Neptune DB subnet group spans at least two subnets in two Availability Zones.
47 | 3. Web connections from external clients terminate on an Application Load Balancer in a public subnet.
48 | 4. The load balancer forwards requests to HAProxy running on an EC2 instance. This EC2 instance is registered in a target group belonging to the ALB.
49 | 5. HAProxy is configured with the Neptune cluster endpoint DNS and port. Requests from the ALB are forwarded to the primary instance in the database cluster.
50 |
51 |
52 | This architecture differs from the previous architecture in that it introduces two hops between the client and the Neptune instance, whereas the previous architecture introduced only one hop. The previous architecture used all AWS managed services; this architecture introduces a piece of third-party open source software (HAProxy).
53 |
54 | ### Best Practices
55 |
56 | * Restrict access to your cluster to a range of IP addresses using the security groups attached to the Application Load Balancer.
57 | * We recommend further restricting access by enabling [IAM database authentication](https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth.html). IAM database authentication requires all HTTP requests be signed using AWS Signature Version 4, which requires changes to the client. See the Neptune documentation for more details. For IAM auth enabled databases, the client must sign the request using Neptune's DNS and include an HTTP Host header whose value is ``. Therefore, the client must know the Neptune cluster's DNS and port in addition to the load balancer's DNS and port. The Host header must reach the Neptune cluster intact. In this architecture, the ALB will convert the DNS name in the Host header to an IP address. Therefore, you must configure HAProxy to replace the original Host header with one containing the Neptune cluster DNS and port.
58 | * You can [connect to Neptune via SSL](https://docs.aws.amazon.com/neptune/latest/userguide/security-ssl.html) with this architecture. Use SSL termination and have your own SSL certificate on the proxy server.
59 | * You can increase the availability of the publicly available endpoint by enabling multiple Availability Zones for the ALB, adding multiple HAProxy instances in each AZ, and load balancing across the HAProxy instances by including all instances in each ALB’s target group.
60 | * This architecture enables external clients to access the Neptune cluster endpoint, which always points to the primary instance in the cluster. To enable access to the reader endpoint, which load balances connections across all the read replicas in the cluster, you can configure path-based routing either in the ALB, which will allow you to route to different HAProxy instances, or in HAProxy itself. With path-based routing, the client would add a path suffix – such as `/reader` or `/writer` – to the request URI. For example, to submit a Gremlin query to the Neptune read replicas, the client would use `http://:80/gremlin/reader`. You will need to rewrite the path to remove this suffix in the ALB or HAProxy before passing the request to the appropriate Neptune endpoint.
61 |
--------------------------------------------------------------------------------
/src/connecting-using-a-load-balancer/application-load-balancer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/connecting-using-a-load-balancer/application-load-balancer.png
--------------------------------------------------------------------------------
/src/connecting-using-a-load-balancer/network-load-balancer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/connecting-using-a-load-balancer/network-load-balancer.png
--------------------------------------------------------------------------------
/src/connecting-using-a-load-balancer/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/connecting-using-a-load-balancer/thumbnail.png
--------------------------------------------------------------------------------
/src/converting-to-graph/README.md:
--------------------------------------------------------------------------------
1 | # Converting Other Data Models to a Graph Model
2 |
3 | As described in [Overview of the Design Process](../graph-data-modelling#overview-of-the-design-process), when building a graph database application it is best to design an application graph data model and graph queries by working backwards from a set of application use cases, and using the model as a target for any subsequent data ingest from other systems.
4 |
5 | However, you may sometimes need to take data from another data technology and ingest it into a graph database prior to undertaking any explicit application-specific graph data modelling. In these circumstances you can apply a number of 'mechanical' transformations that yield a naive graph model. This model will not necessarily be optimised for specific use cases and queries, but it can provide the basis for exploration and the iterative development of an application graph data model.
6 |
7 | * [Converting a Relational Data Model to a Graph Model](#converting-a-relational-data-model-to-a-graph-model)
8 | * [Converting a Document-Oriented Data Model to a Graph Model](#converting-a-document-oriented-data-model-to-a-graph-model)
9 | * [Converting a Key-Value Data Model to a Graph Model](#converting-a-key-value-data-model-to-a-graph-model)
10 |
11 | ## Converting a Relational Data Model to a Graph Model
12 |
13 | ### Tables
14 |
15 | As a rule of thumb, each row in a table can be converted to a vertex in a property graph, or a set of statements with a common subject in an RDF graph.
16 |
17 | 
18 |
19 | When converting to a property graph:
20 |
21 | * Convert the column names to property keys.
22 | * Use the table name as the label for each vertex.
23 | * Concatenate primary key values to generate each vertex ID.
24 |
25 | ```
26 | g.addV('Person').property(id, '12').property('f_name', 'Alice')
27 | ```
28 |
29 | When converting to an RDF graph:
30 |
31 | * Column names and values become predicates and object literals.
32 | * The table name becomes an object literal value of an rdf:type predicate.
33 | * Concatenate primary key values to generate each resource ID.
34 |
35 | ```
36 | PREFIX s:
37 | PREFIX rdf:
38 |
39 | INSERT
40 | {
41 | s:12 rdf:type s:Person ;
42 | s:firstName "Alice" .
43 | }
44 | WHERE {}
45 | ```
46 |
47 | ### Foreign Keys
48 |
49 | Foreign key relations can be modelled as edges.
50 |
51 | 
52 |
53 | When converting to a property graph:
54 |
55 | * Choose an edge direction and label that best express the domain semantics of the relationship.
56 | * Concatenate primary and foreign key values to generate the edge ID.
57 |
58 | In Gremlin, assuming the Person and Address vertices already exist:
59 |
60 | ```
61 | g.V('12').addE('ADDRESS').to(V('512')).property(id, '512-12')
62 | ```
63 |
64 | When converting to an RDF Graph:
65 |
66 | * Represent the relationship using a triple whose subject and object are URIs identifying the resources to be connected.
67 | * Choose a direction and predicate value that best express the domain semantics of the relationship.
68 |
69 | ```
70 | PREFIX s:
71 | PREFIX rdf:
72 |
73 | INSERT
74 | {
75 | s:12 s:address s:512 .
76 | }
77 | WHERE {}
78 | ```
79 |
80 | ### Join Tables With Two Foreign Keys
81 |
82 | Each row in a join table with two foreign keys can be converted to an edge with edge properties.
83 |
84 | 
85 |
86 | When converting to a property graph:
87 |
88 | * Use the join table name for the edge label.
89 | * Concatenate foreign keys to create the edge ID.
90 | * Convert the remaining columns to edge properties.
91 |
92 | In Gremlin, assuming the Person and Company vertices already exist:
93 |
94 | ```
95 | g.V('12').addE('EMPLOYMENT').to(V('512')).
96 | property(id, '12-512').
97 | property('from', 2012).
98 | property('to', 2015)
99 | ```
100 |
101 | When converting to an RDF graph you will have to introduce an intermediate node:
102 |
103 | * Use the join table name to type the node.
104 | * Concatenate foreign keys to create the subject URI.
105 |
106 | ```
107 | PREFIX j:
108 | PREFIX rdf:
109 |
110 | INSERT
111 | {
112 | j:512-12 rdf:type j:Employment ;
113 | j:from 2012 ;
114 | j:to 2015 ;
115 | j:company j:512 ;
116 | j:person j:12 .
117 | }
118 | WHERE {}
119 | ```
120 |
121 | ### Tables With More Than Two Foreign Keys
122 |
123 | If a table contains more than two foreign keys, convert each row to an intermediate node, and convert the foreign keys into edges connecting the intermediate node to other nodes.
124 |
125 | 
126 |
127 | In Gremlin, assuming the User, Location and Product vertices already exist:
128 |
129 | ```
130 | g.addV('Purchase').property(id, '43').property('date', '14-12-2018').
131 | V('43').addE('USER').to(V('678')).property(id, '43-678').
132 | V('43').addE('LOCATION').to(V('144')).property(id, '43-144').
133 | V('43').addE('PRODUCT').to(V('94')).property(id, '43-94')
134 | ```
135 |
136 | In SPARQL:
137 |
138 | ```
139 | PREFIX o:
140 | PREFIX rdf:
141 |
142 | INSERT
143 | {
144 | o:43 rdf:type o:Purchase ;
145 | o:date "14-12-2018" ;
146 | o:user c:678 ;
147 | o:location c:144 ;
148 | o:product c:94 .
149 | }
150 | WHERE {}
151 | ```
152 |
153 | ## Converting a Document-Oriented Data Model to a Graph Model
154 |
155 | Document-oriented databases store semi- or variably-structured documents, usually encoded as JSON or XML.
156 |
157 | Nested Structures
158 |
159 | Documents often comprise a nested structure containing all the information necessary to satisfy a business operation. Such self-contained islands of information are sometimes called [aggregates](https://martinfowler.com/bliki/DDD_Aggregate.html). An order document, for example, may contain multiple line items together with the billing and delivery addresses necessary to satisfy payment and fulfilment processes.
160 |
161 | Document-oriented data models favour redundancy over explicit joins. Continuing the order example, if a customer places several orders, each order document will contain its own billing and delivery addresses so that it can be processed as a single unit without having to execute joins in the application layer.
162 |
163 | To convert a nested structure to a graph model, extract the nested items and create a node for each unique nested item (thereby removing any data redundancy). Create a node for each parent document and connect this node to the node representing the nested item. Derive an edge label (and, if necessary a qualifying property) from the property key used to identify the nested item in the parent document.
164 |
165 | 
166 |
167 | In Gremlin, create the Orders and Address like this:
168 |
169 | ```
170 | g.V().addV('Order').property(id, 'order-1').
171 | addV('Order').property(id, 'order-2').
172 | addV('Order').property(id, 'order-3').
173 | addV('Address').property(id, 'address-1').
174 | V('order-1').addE('ADDRESS').to(V('address-1')).
175 | property(id, 'order-1-address-1').property('type', 'delivery').
176 | V('order-2').addE('ADDRESS').to(V('address-1')).
177 | property(id, 'order-2-address-1').property('type', 'payment').
178 | V('order-3').addE('ADDRESS').to(V('address-1')).
179 | property(id, 'order-3-address-1').property('type', 'delivery')
180 | ```
181 |
182 | In SPARQL we use intermediate nodes to hold the address type property:
183 |
184 | ```
185 | PREFIX o:
186 | PREFIX rdf:
187 |
188 | INSERT
189 | {
190 | o:order-1 rdf:type o:Order .
191 | o:order-2 rdf:type o:Order .
192 | o:order-3 rdf:type o:Order .
193 | o:address-1 rdf:type o:Address .
194 | o:order-1-address-1 o:type "delivery" ;
195 | o:order o:order-1 ;
196 | o:address o:address-1 .
197 | o:order-2-address-1 o:type "payment" ;
198 | o:order o:order-2 ;
199 | o:address o:address-1 .
200 | o:order-3-address-1 o:type "delivery" .
201 | o:order o:order-3 ;
202 | o:address o:address-1 .
203 | }
204 | WHERE {}
205 | ```
206 |
207 | ### Document Joins
208 |
209 | Some document-oriented application models join documents across collections by including the IDs of external documents in a field in the document to which they are to be joined. Treat these scenarios as you would foreign keys in a relational schema.
210 |
211 | ## Converting a Key-Value Data Model to a Graph Model
212 |
213 | A key-value database allows you to model and store data as records comprising key-value pairs. The key can be a simple literal, or, in some cases, a composite of several attributes. A record's value comprises one or more fields. Key-value stores are schemaless, meaning that no two records need share the exact same set of fields.
214 |
215 | Much like document-oriented databases, key-value workloads are typically aggregate-oriented:
216 |
217 | * Each record contains all the information necessary to satisfy a business operation, without the application having to join the data in one record to the data in another record.
218 | * Applications retrieve discrete records or collections of records using their keys or predicates applied to their keys. Some key-value databases use indexes to facilitate applying predicates and filters at query time to the fields inside each record's value.
219 |
220 | ### Implicit Structure
221 |
222 | Whilst ostensibly conforming to a very simple data model, many key-value datasets contain implicit structure and connectedness:
223 |
224 | * Both keys and values can be overloaded with structure: a key may comprise a hierarchical prefix, a value a delimited set of tags, for example. Applications that understand a dataset's record semantics can parse keys and values to infer additional structural information.
225 | * Redundancy across records is common. Field values or families of field values that reoccur in multiple records may refer to a single instance of an entity in the application domain.
226 | * Individual field values may comprise nested structures – JSON documents, for example.
227 | * Some field values may act as foreign keys that refer to the IDs of other records, or even other data sources.
228 |
229 | Given these features of key-value data models, there's often a lot of graph-like structure that can be teased out of a key-value dataset. You'll need to review the fields and field datatypes for the records in your dataset, and the application semantics applied to keys and values, in order to determine what kind of connected structure is implicit in the dataset.
230 |
231 | ### Steps to Convert Records to Vertices and Edges in a Graph
232 |
233 | 1. Convert each record to a vertex in the graph model. Use the collection or table name in the key-value store to type or label the vertex in the graph.
234 | 2. Parse out any application structure present in the key, and create additional vertices and edges to represent this structure. Use any remaining key data to generate the vertex ID.
235 | 3. Parse out any application structure present in individual fields, and create additional vertices and edges to represent this structure.
236 | 4. Identify nested field values and treat them as documents, applying the same modelling techniques you use to [convert a document-oriented data model to a graph model](#converting-a-document-oriented-data-model-to-a-graph-model).
237 | 5. Identify fields that act as foreign keys, and add edges to join the vertices at either end of the relation. Use the field name to label the edge.
238 | 6. Identify frequently reoccurring fields, families of fields, and field values across records, and consider creating new vertices to represent the entities implied by these fields.
239 | 7. Map remaining record fields to vertex properties.
240 |
241 | ### Example
242 |
243 | In the following example the `city:dept` field is overloaded with hierarchical information representing geographic and organizational structure. When converting to a graph model, we create additional vertices to represent states, cities and departments, and connect these vertices with edges in line with the connectedness implicit in the `city:dept` familiy of values.
244 |
245 | 
246 |
247 |
248 |
249 |
250 |
251 |
252 |
--------------------------------------------------------------------------------
/src/converting-to-graph/document-2-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/document-2-graph.png
--------------------------------------------------------------------------------
/src/converting-to-graph/key-value-2-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/key-value-2-graph.png
--------------------------------------------------------------------------------
/src/converting-to-graph/relational-fk1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/relational-fk1.png
--------------------------------------------------------------------------------
/src/converting-to-graph/relational-join-table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/relational-join-table.png
--------------------------------------------------------------------------------
/src/converting-to-graph/relational-multi-join.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/relational-multi-join.png
--------------------------------------------------------------------------------
/src/converting-to-graph/relational-table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/relational-table.png
--------------------------------------------------------------------------------
/src/converting-to-graph/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/converting-to-graph/thumbnail.png
--------------------------------------------------------------------------------
/src/data-models-and-query-languages/README.md:
--------------------------------------------------------------------------------
1 | # Data Models and Query Languages
2 |
3 | A graph data model connects items or values using elements variously called edges, links or relationships. Many application domains can be modelled as graphs: social, follower and business relationship networks, IT and physical network infrastructures, organizatonal structures, entitlements and access control networks, logistics and delivery networks, supply chains, etc.
4 |
5 | Neptune supports two different graph data models: the [property graph](https://en.wikipedia.org/wiki/Graph_database#Labeled-Property_Graph) data model, and the [Resource Description Framework](https://www.w3.org/RDF/). Each data model has its own query language for creating and querying graph data. For a property graph, you create and query data using [Apache Tinkerpop Gremlin](http://tinkerpop.apache.org/docs/current/reference/), an open source query language supported by several other graph databases. For an RDF graph you create and query data using [SPARQL](https://www.w3.org/TR/rdf-sparql-query/), a graph pattern matching language standardized by the W3C.
6 |
7 | ## Property Graph and Gremlin
8 |
9 | ### Vertices and Edges
10 |
11 | The property graph data model represents graph data as _vertices_ and _edges_ (sometimes called nodes and relationships). You typically use vertices to represent entities in your domain, edges to represent the relationships between these entities. Every edge must have a name, or label, and a direction – that is, a start vertex and an end vertex. Neptune's property graph model doesn't allow dangling edges.
12 |
13 | ### Properties
14 |
15 | You can attach one or more _properties_ to each of the vertices and edges in your graph. Typically, you use vertex properties to represent the attributes of entities in your domain, and edge properties to represent the strength, weight or quality of a relationship. You can also use properties to represent metadata – timestamps, access control lists, etc.
16 |
17 | ### IDs
18 |
19 | Every vertex and every edge in the graph must have a unique ID. Because every edge has its own identity, you can create multiple edges connecting the same pair of vertices.
20 |
21 | > Some graph databases allow you to assign your own IDs to vertices and edges. Others automatically create IDs for you. Neptune allows you to supply your own IDs when creating vertices and edges: if you don't assign your own ID to an element, Neptune will create a string-based UUID for you. All vertex IDs must be unique, and all edge IDs must be unique. However, Neptune does allow a vertex and an edge to have the same ID.
22 |
23 | ### Labels
24 |
25 | As well as adding properties to the elements in your graph, you can also attach labels to both the vertices and edges. Edge labels are mandatory: you must attach exactly one label to each edge in your graph. An edge's label expresses the semantics of the relationship represented by the edge. Vertex labels are optional: you can attach zero, one or many labels to each vertex in your graph. Vertex labels allow you to tag, type and group vertices.
26 |
27 | ### Example
28 |
29 | In the following diagram we see three vertices. Each vertex is labelled `User`, and has an `id`, and `firstName` and `lastName` properties. The vertices are connected by edges labelled `FOLLOWS`.
30 |
31 | 
32 |
33 | To query a property graph in Neptune you use the Gremlin query language. The following Gremlin query finds the names of the users whom Bob follows:
34 |
35 | ```
36 | g.V('p-1').out('FOLLOWS').valueMap('firstName', 'lastName')
37 | ```
38 |
39 | ### Learn More
40 |
41 | * [Apache TinkerPop Documentation](http://tinkerpop.apache.org/docs/current/reference/)
42 | * [PRACTICAL GREMLIN: An Apache TinkerPop Tutorial](http://kelvinlawrence.net/book/Gremlin-Graph-Guide.html)
43 |
44 | ## RDF Graph and SPARQL
45 |
46 | RDF encodes resource descriptions in the form of subject-predicate-object triples. In contrast to the property graph model, which 'chunks' data into record-like vertices and edges with attached properties, RDF creates a more fine-grained representation of your domain.
47 |
48 | The following diagram shows the same information as the property graph above, but this time encoded as RDF.
49 |
50 | 
51 |
52 | Subjects and predicates in RDF are always URIs. Object values can be either URIs or literals. In the example shown above, the triple `contacts:p-2 contacts:firstName “Alice”` comprises a URI subject and predicate, and a string literal object. Relationships between resources use URI-based object values.
53 |
54 | To query an RDF graph you use SPARQL. The following SPARQL query finds the names of the users whom Bob follows:
55 |
56 | ```
57 | PREFIX s:
58 |
59 | SELECT ?firstName ?lastName WHERE {
60 | s:p-1 s:follows ?p .
61 | ?p s:firstName ?firstName .
62 | ?p s:lastName ?lastName
63 | }
64 | ```
65 |
66 | ## Choosing a Data Model and Query Language for Your Workload
67 |
68 | Both graph data models and query languages – property graph and Gremlin, RDF and SPARQL – can be used to implement the majority of graph database workloads. Application developers and those coming from a relational database background often find the property graph model easier to work with, whereas those familiar with Semantic Web technologies may prefer RDF, but there are no hard-and-fast rules.
69 |
70 | In choosing a model and query language, bear in mind the following points:
71 |
72 | * The property graph data model has no schema and no predefined vocabularies for property names and labels. You must create your own application-specific data model and enforce constraints around the naming of labels and properties in your application layer. RDF, on the other hand, has predefined [schema vocabularies](https://www.w3.org/TR/rdf-schema/) with well-understood data modelling semantics for specifying class and property schema elements, and predefined domain-specific vocabularies such as [vCard](https://www.w3.org/TR/vcard-rdf/), [FOAF](http://xmlns.com/foaf/spec/), [Dublin Core](http://dublincore.org/) and [SKOS](https://www.w3.org/2004/02/skos/) for describing resources in different domains: contact information, social network relations, document metadata and knowledge networks, for example.
73 | * RDF was designed to make it easy to share and publish data with fixed, well-understood semantics. There exist today many linked and open datasets – for example [DBpedia](https://en.wikipedia.org/wiki/DBpedia) and [GeoNames](https://en.wikipedia.org/wiki/GeoNames) – that you can incorporate into your own application. Insofar as your own data reuses vocabularies shared with any third-party dataset you ingest, data integration occurs as a side-effect of the linking across datasets facilitated by these shared vocabularies.
74 | * Property graphs support edge properties, making it easy to associate edge attributes, such as the strength, weight or quality of a relationship or some edge metadata, with the edge definition. To qualify an edge in RDF with additional data, you must use intermediate nodes or blank nodes – nameless nodes with no permanent identity – to group the edge values. Intermediate nodes can complicate an RDF model and the associated queries. If your workload requires applying computations over edge attributes in the course of a graph traversal, consider using a property graph and Gremlin.
75 | * RDF supports the concept of named graphs, allowing you to group a set of RDF statements and identify them with a URI. Named graphs allow you to distinguish logical or domain-specific subgraphs within your dataset, and attach additional statements that apply to these subgraphs as a whole. The property graph data model allows you to create multiple disconnected subgraphs within the same dataset, but has no equivalent to named graphs that allows you to identify and address individual subgraphs. If you need to differentiate between and manage multiple subgraphs in your dataset – for example, on behalf of multiple tenants – consider using RDF.
76 | * Gremlin, being an imperative traversal language, allows for an algorithmic approach to developing graph queries. It supports iterative looping constructs (e.g. `repeat().until()`) and path extraction, making it easy to traverse variable-length paths and extract and apply computations over property values in the paths traversed. SPARQL makes it easy to find instances of known graph patterns, even those with optional elements, and to extract values from the pattern instances that have been found.
77 | * While every edge in a property graph or RDF graph must be directed, Gremlin allows you to ignore edge direction in your queries (using `both()` and `bothE()` steps). If you need to model bi-directional relationships, consider using a property graph with Gremlin. Alternatively, if you use RDF, you will have to introduce pairs of relationships between resources.
78 |
79 |
80 |
81 |
--------------------------------------------------------------------------------
/src/data-models-and-query-languages/property-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/data-models-and-query-languages/property-graph.png
--------------------------------------------------------------------------------
/src/data-models-and-query-languages/rdf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/data-models-and-query-languages/rdf.png
--------------------------------------------------------------------------------
/src/data-models-and-query-languages/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/data-models-and-query-languages/thumbnail.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/README.md:
--------------------------------------------------------------------------------
1 | # Graph Data Modelling
2 |
3 | - [Overview of the Design Process](#overview-of-the-design-process)
4 | - [Property Graph Data Modelling](#property-graph-data-modelling)
5 | - [RDF Data Modelling](#rdf-data-modelling)
6 |
7 | When you build a graph database application you will have to design and implement an application graph data model, together with graph queries that address that model. The application graph data model should express the application domain; the queries should answer the questions you would have to pose to that domain in order to satisfy your application use cases.
8 |
9 | You build the application graph data model from graph primitives – vertices, edges, labels and properties in the case of a property graph, subject-predicate-object triples for RDF. You name these primitives to express the semantics of the application domain. You structure them to facilitate the path traversals or graph patterns contained in your queries.
10 |
11 | Think of your application graph model and your queries as being two sides of the same coin. The graph is the superset of paths or graph patterns expressed in your queries. In order to achieve this degree of alignment between model and queries, employ a 'design for queryability' approach, whereby you drive out the model and the queries on a use-case-by-use-case basis.
12 |
13 | ## Overview of the Design Process
14 |
15 | 
16 |
17 | 1. Work backwards from your application or end-user goals. These goals are typically expressed as a backlog of feature requests, use cases or agile user stories. For each use case write down the questions you would have to put to the domain in order to facilitate the outcomes that motivate the use case. What would you need to know, find, or compute?
18 | 2. Review these questions and identify candidate entities, attributes and relationships. These become the basis of your graph model, implemented using the primitives particular to either the property graph or RDF.
19 | 3. Review these questions and your prototype application graph model to determine how you would answer each question by traversing paths through the graph or matching structural patterns in the graph. Adjust the model until you are satisfied it faciliatates querying in an efficient and expressive manner.
20 | 4. Continue to iterate over your use cases, refining your candidate model and queries as you introduce new features.
21 | 5. Once you have a candidate application graph data model you can treat this as a target for any necessary data migration and integration scenarios. Identify existing sources of data and implement extract, transform and load (ETL) processes that ingest data into the target model.
22 | 6. If you are building an application or service on top of a graph database, design and implement write operations that insert, modify and if necessary delete data in the target application graph model.
23 |
24 | ### Best Practices
25 |
26 | * Develop your model and queries in a test-driven fashion. Create test fixtures that install a sample dataset in a known state, and write unit tests for your queries (or the parts of your application that encapsulate queries) that assert query results based on a fixed set of inputs.
27 | * As you evolve your model and add new queries, rerun your tests to identify broken queries that need revising in line with the updated model.
28 |
29 | ### Learn More
30 |
31 | * For a worked example of deriving a property graph model from a set of use cases in a test-driven fashion, see the [Property Graph Data Modelling](https://github.com/aws-samples/amazon-neptune-samples/tree/master/gremlin/property-graph-data-modelling) sample
32 | * For guidance on converting a relational, key-value or document data model to a property graph model, see [Converting Other Data Models to a Graph Model](../converting-to-graph)
33 |
34 | # Property Graph Data Modelling
35 |
36 | - [Graph Data Modelling](#graph-data-modelling)
37 | - [Overview of the Design Process](#overview-of-the-design-process)
38 | - [Best Practices](#best-practices)
39 | - [Learn More](#learn-more)
40 | - [Property Graph Data Modelling](#property-graph-data-modelling)
41 | - [Learn More](#learn-more-1)
42 | - [Building an Application Graph Data Model](#building-an-application-graph-data-model)
43 | - [Vertices](#vertices)
44 | - [Vertex IDs](#vertex-ids)
45 | - [Vertex labels](#vertex-labels)
46 | - [Vertex properties](#vertex-properties)
47 | - [When should I model an attribute as a property versus a label?](#when-should-i-model-an-attribute-as-a-property-versus-a-label)
48 | - [When should I model an attribute as a property and when should I pull it out into its own vertex?](#when-should-i-model-an-attribute-as-a-property-and-when-should-i-pull-it-out-into-its-own-vertex)
49 | - [Complex value types](#complex-value-types)
50 | - [Value structures](#value-structures)
51 | - [Relating entities through their attributes at query time](#relating-entities-through-their-attributes-at-query-time)
52 | - [Edges](#edges)
53 | - [Edge IDs](#edge-ids)
54 | - [Edge labels](#edge-labels)
55 | - [Bi-directional relationships](#bi-directional-relationships)
56 | - [Uni-directional relationships](#uni-directional-relationships)
57 | - [Multiple relationships between vertices](#multiple-relationships-between-vertices)
58 | - [Edge properties](#edge-properties)
59 | - [The Hub-and-Spoke Pattern](#the-hub-and-spoke-pattern)
60 | - [Hub-and-spoke example](#hub-and-spoke-example)
61 | - [When to use hub-and-spoke](#when-to-use-hub-and-spoke)
62 |
63 | ### Learn More
64 |
65 | * For a worked example of deriving a property graph model from a set of use cases, see the [Property Graph Data Modelling](https://github.com/aws-samples/amazon-neptune-samples/tree/master/gremlin/property-graph-data-modelling) sample
66 | * For guidance on converting a relational, key-value or document data model to a property graph model, see [Converting Other Data Models to a Graph Model](../converting-to-graph)
67 |
68 |
69 | ## Building an Application Graph Data Model
70 |
71 | An application-specific property graph data model describes how your graph data is structured both to express your domain and to make it easy and efficient to query for your most important use cases:
72 |
73 | - What types of vertices do you have in your graph, as represented by vertex labels?
74 | - What properties are attached to each type of vertex?
75 | - How are different vertices connected?
76 | - What edge labels do you use to represent different types of edges?
77 | - What properties are attached to these edges?
78 |
79 | By answering these questions you describe an application property graph data model that is specialized for your specific application or set of uses cases.
80 |
81 | In the relational world we'd express an application-specific relational model using schema and constraints. In the property graph world, however, there are very few commonly-adopted formal constructs for doing the same. Some graph databases advertise themselves as being schema-free, others allow for optional schema or constraints to be layered on top of the data, while a few require an upfront schema to be defined using a product-specific schema language.
82 |
83 | > Neptune is a schema-free graph database. No two vertices, even those with the same labels, need share the exact same set of properties. No two values of a particular property need use the same datatype. No two pairs of vertices need be connected in the exact same way.
84 | > The only constraints that Neptune asserts are:
85 | > - Every edge must have a start vertex and an end vertex. These can be the same vertex: that is, Neptune allows self edges.
86 | > - All vertex IDs must be unique, and all edge IDs must be unique. However, Neptune does allow a vertex and an edge to have the same ID.
87 |
88 | You can use traditional data and application modelling techniques, including entity relationship diagrams (ERD) and the Unified Modelling Language (UML) to model your graph, but many graph application designs begin by illustrating a small, representative example of the graph, with specific vertices, labels, properties and edges showing how _instances_ of things in the application domain are attributed and connected to one another. These _specifications by example_ can then be easily turned into representative datasets against which you can develop and test your queries.
89 |
90 | ## Vertices
91 |
92 | Use vertices to represent instances of a thing (an entity, concept, event, etc). You can think of a vertex as being equivalent to a row in a relational table.
93 |
94 | ### Vertex IDs
95 |
96 | Some graph databases automatically assign IDs to vertices when they are created, others allow you to supply your own. If the database allows it, consider supplying your own IDs when creating vertices. These could be a stable domain attribute that uniquely identifies an entity – an employee number, for example, or a product SKU – or an ID derived from an original source for the data, such as a primary key in a relational table.
97 |
98 | > Neptune allows you to supply your own IDs when you create a vertex. If you don't supply an ID, Neptune will create a string-based UUID for you.
99 |
100 | ### Vertex labels
101 |
102 | Use a vertex label to indicate the entity type or the role that the vertex plays in your dataset. _People, users, customers, products, jobs, policies_ – in the singular, _person, user, customer, product, job_ and _policy_: all good candidate vertex labels.
103 |
104 | Try to limit each vertex to having just one label. Entities can sometimes play multiple roles in your dataset: if that's the case, it's fine to attach multiple labels to a vertex. But avoid using labels as flags or enumerated tags that group entities of a particular type. Better to use a property to perform this partitioning. For example, if you wanted to version vertices in your graph, it would be best to do this by attaching a `version` property containing a numeric property value to each vertex, rather than labelling each vertex `v1`, `v2`, `v3`, etc.
105 |
106 | ### Vertex properties
107 |
108 | Use vertex properties to represent the attributes of an entity: _first name, last name, invoice number, height, width, colour, ISBN, price_, etc. One of the modelling benefits of the property graph is that it allows you to 'chunk up' entity attributes into a discrete, easily understood record-like structure: that is, to represent a 'thing' as a labelled vertex with multiple properties.
109 |
110 | As well as using vertex properties to model entity attributes, you can also use them to store vertex metadata such as a version number, last updated timestamp, or access control list.
111 |
112 | > Gremlin supports single, set and list cardinality for vertex properties. Neptune, however, supports only single and set cardinality, not list, with set cardinality the default. Set cardinality allows you to model multi-value vertex properties containing unique values: an `emailAddress` property, for example, could contain a set such as `[john.smith@example.com,j.smith@example.org,johnsmith@example.net]`.
113 | >
114 | > If you need to model a list with Neptune – either to maintain list order or to store duplicate values – you have a couple of choices:
115 | > - If there's no requirement to filter the list's contents at query time, you can store the list as a delimited string representation. To modify the contents of the list, you'll need to implement some logic in your application to retrieve the current representation of the list, parse it into a list type to which you can apply any necessary modifications, and then update the property with a string representation of the new list value.
116 | > - If you need to filter on the list's contents during a traversal, you'll have to pull the list values out as properties on separate vertices. You can connect these list value vertices either directly to the 'parent' vertex, or to a 'list' vertex that is attached to the parent. You can then use additional properties, either on the edges or on the value vertices themselves, to store metadata, such as item order. While this solution allows for filters or predicates to be applied to list values during a traversal, it introduces more complexity into both the data model and the queries that apply these filters. It may also increase both query latencies and storage costs (every value has the storage overhead of its being a vertex).
117 |
118 | ### When should I model an attribute as a property versus a label?
119 |
120 | Use a label to type a vertex or to describe the role a vertex plays in your dataset. Use properties to capture the instance-based attributes of a thing in your domain. Labels help answer the question: What does this vertex represent? Properties help answer the question: What are the attributes of this particular thing?
121 |
122 | ### When should I model an attribute as a property and when should I pull it out into its own vertex?
123 |
124 | Model an attribute as its own vertex when:
125 |
126 | - the attribute value is a complex value type _and/or_
127 | - the attribute value is part of a value structure, such as a hierarchy _and/or_
128 | - the attribute value will be used to relate entities at query time.
129 |
130 | #### Complex value types
131 |
132 | Complex value types – attribute values that contain more than one field – are best represented as their own vertices. _Address_ is a good example: model address as a separate vertex with `line1`, `line2`, `city`, and `zipcode` properties.
133 |
134 | With some applications, you may want to attach metadata such as a timestamp or access control list to a specific attribute (rather than the vertex representing the entity to which the attribute belongs). If your graph database and its data model and query language support [metaproperties](https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#metaprop) then you can take advantage of these features to implement your use case.
135 |
136 | > Neptune doesn't support metaproperties. To attach attribute-specific metadata, you will have to model the attribute as its own vertex, and add additional properties to this vertex (and/or to the edge connecting this attribute vertex to the entity with which it is associated) to represent your metadata.
137 |
138 | #### Value structures
139 |
140 | Some value types are part of a set that has its own internal structure. As an example, take the classification hierarchy in an online product catalogue. _Treasure Island_ might be classified as both 'Classic Children's Literature' and as 'Action and Adventure', both of which are subcategories of 'Fiction' in the 'Books' part of the catalogue. Such hierarchical or multi-hierarchical structures are best represented using the subgraph structures enabled by pulling each classification value out into its own vertex.
141 |
142 | #### Relating entities through their attributes at query time
143 |
144 | If an attribute value will be used to create paths through the network that relate entities at query time, consider pulling it out into its own vertex.
145 |
146 | _Social security number_ is a good example. Normally, we'd model social security number by attaching a `socialSecurityNumber` property to a `User` vertex. But in a fraud detection graph, where individuals in a fraud ring share bits of identity information, things are more complicated. Here we might have a connected data query of the form:
147 |
148 | _Given individual X, can we find other people in the graph who have opened accounts using the same social security number as person X?_
149 |
150 | In other words, we have a starting point, person X, but need thereafter to find other people who have something in common with person X based on a specific attribute – the social security number used by person X.
151 |
152 | Note that this connected data query is very different from the kind of query that asks:
153 |
154 | _Find everyone who has used social security number '123-45-6789'._
155 |
156 | This latter query could be satisfied simply by filtering `User` vertices based on a `socialSecurityNumber` value that is known to us at the time the query is formulated. It's the equivalent of a simple key-value lookup. In the connected data query, in contrast, we don't necessarily know the social security number at query time. What we do know is how to identify person X. Having found person X, the connected data query then needs to find other people who are connected to X by way of some shared attribute – the social security number.
157 |
158 | If you are considering modelling an attribute as its own vertex in order to facilitate connected data queries, apply good judgement based on your understanding of the domain. The new vertex should probably represent a significant concept in the domain. In the fraud detection example, bits of identity information are meaningful domain entities that can exist independent of the users with which they are associated. In other domains, the same might not be true of these same attributes.
159 |
160 | ## Edges
161 |
162 | Use edges to represent the relationships between things in your domain.
163 |
164 | The performance of a graph query depends on how much of the graph the query has to 'touch' in order to generate a set of results. The larger the working set, the longer it will take to get from storage and then traverse once it has been cached in main memory.
165 |
166 | 
167 |
168 | You can ensure your queries touch the minimum amount of data by naming edges in a way that allows the query engine to follow only those relationships relevant to the query being executed.
169 |
170 | 
171 |
172 | Edges compose and partition the graph. By connecting vertices, they structure the whole, creating a complex composite from what would otherwise be simple islands of data. At the same time they serve to partition the graph, differentiating connections between elements based on name, direction and property values so that queries can identity specific subgraphs within a larger, more variably connected structure. By focussing your queries on certain edge labels and directions, and the paths they form, you allow the query engine to exclude irrelevant parts of the graph from consideration, effectively materializing a particular view of the graph dedicated to addressing a specific query need.
173 |
174 | ### Edge IDs
175 |
176 | Every edge has an ID. Some graph databases automatically assign IDs to edges when they are created, others allow you to supply your own. Because every edge has its own identity, the property graph allows you to create multiple edges with the same labels (and properties) between any given pair of vertices.
177 |
178 | > Neptune allows you to supply your own IDs when you create an edge. If you don't supply an ID, Neptune will create a string-based UUID for you.
179 |
180 | ### Edge labels
181 |
182 | Derive your edge labels from your use cases. Doing so helps structure and partition your data so that queries ignore vertices and edges that have no bearing on the working set necessary to satisfy the query.
183 |
184 | 
185 |
186 | If your queries need only find relationships with a particular name drawn from a family of names (for example, of all the addresses in the dataset, one query needs only find work addresses, another only home addresses), then consider using fine-grained edge labels or predicates.
187 |
188 | If some or all of your queries need to find all relationships belonging to a particular family (for example, all addresses, irrespective of whether they are work or home addresses), use a more general name qualified with an edge property. The tradeoff here is that queries that require only specific relationship types (for example, work addresses) will touch more of the graph and will have to filter based on the edge property, but the design provides for both finding all edges with a particular name, and finding specific types of edges belonging to that family.
189 |
190 | #### Bi-directional relationships
191 |
192 | If you need to model bi-directional relationships, in which relationship direction is of no consequence to the model, you can use a single, directed relationship, but ignore its direction in your Gremlin queries using the `both()` or `bothE()` steps.
193 |
194 | 
195 |
196 | Here's an example query that ignores edge direction:
197 |
198 | ```
199 | g.V('p-1').both('WORKS_WITH')
200 | ```
201 |
202 | #### Uni-directional relationships
203 |
204 | Edges in a property graph are always directed: ideal for expressing uni-directional relationships.
205 |
206 | 
207 |
208 | In Gremlin you then explicitly state the direction you wish to follow in your queries using the `in()`, `out()`, `inE()` and `outE()` steps:
209 |
210 | ```
211 | g.V('p-1').in('FOLLOWS')
212 | ```
213 |
214 | or
215 |
216 | ```
217 | g.V('p-1').out('FOLLOWS')
218 | ```
219 |
220 | > ### Favour outgoing edges
221 | > Neptune is optimized for traversing outgoing edges. Therefore, if possible, design your model so that your performance-critical queries follow mostly outgoing edges.
222 | >
223 | > If a query has to traverse an incoming edge, always specify the edge label as part of the query, even if there is only one type of edge label that the traversal could possibly follow. For example, for a vertex that has only `CREATED` incoming edges, we would recommend using `in('CREATED')` and `inE('CREATED')` rather than `in()` and `inE()` to traverse those edges.
224 |
225 | #### Multiple relationships between vertices
226 |
227 | You can connect any pair of vertices with multiple edges. These edges can all have the same name, or they can have different names. Each edge represents an instance of a connection between the start and end vertices. In many cases, such edges will be attributed with one or more distinguishing properties, such as timestamps.
228 |
229 | 
230 |
231 | ### Edge properties
232 |
233 | Use edge properties to represent the strength, weight or quality of a relationship. Using edge properties, you can further filter which edges a traversal follows – following only `KNOWS` edges in a social graph whose `strength` property is greater than 5, for example – or compute a cumulative result along a path – calculating the shortest, or cheapest, or quickest route through a logistics network, for example.
234 |
235 | You can also use edge properties to store metadata such as a version number, last updated timestamp, or access control list.
236 |
237 | > ### A note on predicates
238 | > In RDF terms, both edge labels and edge and vertex property names are considered predicates. Neptune is optimized for datasets containing a relatively small number of unique predicates – in the order of several thousand at most. A dataset containing 100,000 `User` vertices, each with 5 properties, and 1 million `FOLLOWS` edges has 6 unique predicates (5 vertex properties and 1 edge label).
239 | >
240 | > Keep the number of predicates in your data model relatively small. Databases with many tens of thousands or even millions of unique predicates can experience a drop in performance.
241 |
242 | ## The Hub-and-Spoke Pattern
243 |
244 | One of the most common patterns in property graph data modelling is the hub-and-spoke structure, comprising a central vertex connected to several neighbouring vertices. This central vertex often represents a fact or event, the neighbouring vertices contextual information that helps explain or enrich our understanding of this hub vertex. An example would be a `Purchase` hub vertex, representing a purchasing event, connected to the `User` who made the purchase, the several `Product` items in the user's shopping basket, and the `Shop` where the items were bought.
245 |
246 | The hub-and-spoke subgraph structure is similar to the star schema or facts and dimensions model employed in data warehousing. Each hub node represents an instance of a fact or event (or other entity). A hub vertex is connected to one or more spoke or dimension vertices. These dimension vertices in turn are often connected to multiple other fact vertices: a `User` makes many purchases; a `Product` appears in multiple shopping baskets. The subgraph structure may occur thousands or millions of times in a dataset, with the dimension vertices acting as contextual intermediaries through which facts or events can be related.
247 |
248 | Sometimes this pattern will emerge as a straightforward representation of your domain. At other times, you may find yourself moving to this pattern to accomodate several different use cases and the queries associated with them, or to provide for the longterm evolvability of your model. Your overall goal is to design an application graph data model that is expressive of your domain, easy to query on behalf of your most important use cases, and easy to evolve as you discover new use cases and introduce new features into your application. If your data model is too simple, it can become difficult to add new use cases and queries. If it is too complex, it can become difficult to maintain, and may impose a performance penalty of some of your more important queries. Aim to be as simple as you can given the needs of your application, and no simpler.
249 |
250 | ### Hub-and-spoke example
251 |
252 | Consider how we might represent a person's employment details in an application graph data model. If all we need to know, given our current and anticipated new use cases, is the company where a person worked, the following will suffice:
253 |
254 | 
255 |
256 | We can even add properties to the `WORKED_AT` edge to describe Alice's role, and the period during which she worked for Example Corp (e.g. `from` and `to` properties, with data values):
257 |
258 | 
259 |
260 | But if our application use cases require us to ask deeper questions of Alice's employment history – In which office was she located? How did her role relate to other roles in the company? – then we'll need to adopt a more complex model based on the hub-and-spoke pattern:
261 |
262 | 
263 |
264 | Here we've taken the action encoded in the `WORKED_AT` edge ('working' or 'worked', a verb) and turned it into a vertex labelled `Job` (a noun) that acts as a hub connected by way of `HAS_JOB` and `AT_COMPANY` edges to the vertices representing Alice and Example Corp. You'll find that most edges can be decomposed in this way into a more complex vertex-and-two-edges structure. The trick is in identifying when this is _necessary_.
265 |
266 | The advantage of this model is that it allows for the longterm evolvability of your application. You can always add new types of dimension nodes as you learn more about your domain and introduce new features. If a new use case emerges that requires us to capture details of the department to which Alice belonged (an organisational hierarchy worthy of its own subgraph structure), for example, we can easily add a new `Department` vertex to the model:
267 |
268 | 
269 |
270 | ### When to use hub-and-spoke
271 |
272 | It's tempting to apply this pattern everywhere, transforming every edge in your 'naive' model into a vertex-and-two-edges structure. But if you don't need the richness and flexibility of this subgraph structure, don't use it: you'll end up increasing storage overheads and query latencies (more data to store, fetch and traverse) for no appreciable benefit.
273 |
274 | Conversely, you may be struggling with an overly simplistic model derived from your first natural language description of your domain. Problems often present themselves when you find yourself wanting to add an edge to an edge – to annotate one relationship with another. This can sometimes be the result of _verbing_ – the language habit whereby a noun is transformed into a verb. For example, instead of saying X `SENT` an `Email` `TO` Y, we might verb the noun, and say X `EMAILED` Y. If this is the basis of our model, problems emerge when we want to indicate who was CCd on the mail, or describe one email as being a reply to another. By pulling out the domain entity inherent in the verb – `Email` from `EMAILED` – we can introduce a hub node that allows for far more expressive structuring of entities and relationships.
275 |
276 | If you're struggling to come up with a graph structure that captures the complex interdependencies between several things in your domain, look for the nouns, and hence the domain concepts, hidden inside of some of the verb phrases you've used to describe the structuring of your domain.
277 |
278 | While some hub vertices lie hidden in verbs, other hub-and-spoke structures can be found in adverbial phrases – those additional parts of a sentence that describe how, when or where an action was performed. Adverbial phrases result in what entity-relational modelling calls _n-ary relationships_; that is, complex, multi-dimensional relationships that bind together several things and concepts. The hub-and-spoke pattern is ideal for these kinds of n-ary relationships. While it may sometimes feel as though you're encumbering your model with another vertex just to accomodate the need for multiple relationships, you can invariably find a good, domain-meaningful term for this hub vertex that helps make the model more expressive.
279 |
280 | # RDF Data Modelling
281 |
282 | - [The Graph Development Lifecycle](#the-graph-development-lifecycle)
283 | - [Step 1: Describing new features](#step-1-describing-new-features)
284 | - [Step 2: Designing the Ontology](#step-2-designing-the-ontology)
285 | - [Step 3: Encoding the Ontology as RDF-OWL](#step-3-encoding-the-ontology-as-rdf-owl)
286 | - [Step 4: Create instance data as RDF](#step-4-create-instance-data-as-rdf)
287 | - [Step 5: Test features with SPARQL](#step-5-test-features-with-sparql)
288 | - [Graph Development Lifecycle Tutorial](#graph-development-lifecycle-tutorial)
289 | - [Designing RDF Graph models (Ontologies)](#designing-rdf-graph-models-ontologies)
290 | - [Iteration 1: adding the first feature to the solution](#iteration-1-adding-the-first-feature-to-the-solution)
291 | - [Iteration 2: adding a second feature](#iteration-2-adding-a-second-feature)
292 | - [Ontology design: Replicate an element](#ontology-design-option-1-replicate-an-element)
293 | - [Ontology design: Re-using elements](#ontology-design-option-2-re-using-elements)
294 | - [Ontology design: Multiple diagrams and replicating elements](#ontology-design-option-3-multiple-diagrams-and-replicating-elements)
295 | - [Iteration 3 : adding a breaking change](#iteration-3-adding-a-breaking-change)
296 | - [Reification](#reification)
297 | - [Testing features with multiple SPARQL queries](#testing-features-with-multiple-sparql-queries)
298 | - [Using Edges to Facilitate Efficient Graph Queries](#using-edges-to-facilitate-efficient-graph-queries)
299 | - [Predicate names](#predicate-names)
300 | - [Bi-directional relationships](#bi-directional-relationships)
301 | - [Uni-directional relationships](#uni-directional-relationships)
302 | - [Multiple relationships between nodes](#multiple-relationships-between-nodes)
303 |
304 | ## The Graph Development Lifecycle
305 |
306 | ### Introduction
307 |
308 | Designing, building and testing [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) graph models is an iterative process, which we refer to here as The Graph Development Lifecycle.
309 | The Graph Development Lifecycle starts by describing the features you want your graph to address in natural language, followed by a visual design, encoding the model as RDF using [OWL (Web Ontology Language)](https://www.wikipedia.org/wiki/Web_Ontology_Language), and then testing your ability do answer the features with some sample data and [SPARQL](https://en.wikipedia.org/wiki/SPARQL) queries.
310 |
311 | 
312 |
313 | Here we describe each step in the process, and below we give a full example of [iterating over the Graph Development Lifecycle multiple times.](#graph-deveopment-lifecycle-walkthough)
314 |
315 | ### Step 1: Describing new features
316 |
317 | Here we describe the features we want our graph to address, in plain natural language.
318 |
319 | **Feature example:** *"In my social graph, I want to list all the people, by name, that are not my direct friends, but that are friends with my friends"*
320 |
321 | ### Step 2: Designing the Ontology
322 |
323 | The term "Ontology", originally used in the study of Metaphysiscs, is the philosophical study of being. It investigates what types of entities exist, how they are grouped into categories, and how they are related to one another on the most fundamental level.
324 | When we design a schema or logical model for RDF, we call it an "Ontology".
325 |
326 | Visit Wikipedia to further understand the differences between the [philosophical definition of an Ontology](https://en.wikipedia.org/wiki/Ontology) and the [Information Science definition of an Ontology](https://en.wikipedia.org/wiki/Ontology_%28information_science%29).
327 |
328 | Using the features described in natural language from Step 1, we draw the Ontology described by the features. We take all the concepts and relationships described as natural language and display them in an easy to digest diagram. Any diagramming tool is usually suitable, a whiteboard is ideal.
329 |
330 | ### Step 3: Encoding the Ontology as RDF-OWL
331 |
332 | Once you have designed an Ontology in step 2, we record the model as data, in the graph. The Ontology we use to record the schema definition is called [OWL (Web Ontology Language)](https://www.wikipedia.org/wiki/Web_Ontology_Language), it is encoded in RDF.
333 |
334 | This activity does not enforce a schema, but describe the Ontological model as data, and gives you the ability to query the Ontology along side the actual data. Although it may look daunting at first, there are various tools available to help you with this process, and it is easy thing to do manually with practice. We show a full example below.
335 |
336 | N.B. Designing and encoding an Ontology is not a requirement for an RDF database (like Amazon Neptune) to work. Some pactitioners do not design OR encode their Ontologies at all. It is, however, highly recommended, for the following reasons:
337 |
338 | * The ability to query the logical model, schema and entity definitions just like any other data.
339 | * Compatibility with tooling that understands the OWL standard
340 | * When creating Software componenets that use the graph, they can be engineered to understand the model, to, for example, dynamically create React componenets for specific types.
341 |
342 | ### Step 4: Create instance data as RDF
343 |
344 | Now that you have an Ontology defined, you can create or source some data to fit the Ontology, as you need some sample data to test whether your Feature can be satisfied. We call this the Instance data.
345 | The data is recored in RDF, and stored in the database alongside the OWL data.
346 |
347 | The distinction between Instance RDF data and Ontological RDF/OWL data is sometimes referred to as [Description Logic](https://en.wikipedia.org/wiki/Description_logic). Where the instance data belongs in the [ABox](https://en.wikipedia.org/wiki/Abox)(assertional box), and the Ontology (OWL) data belonging in the [TBox](https://en.wikipedia.org/wiki/Tbox) (terminological data).
348 | |||
349 | |-|-|
350 | | Tbox (Ontology) | Every employee is a person |
351 | | ABox (Instance data) | Bob is an employee |
352 |
353 |
354 | ### Step 5: Test features with SPARQL
355 |
356 | Now you have some RDF instance data and your OWL Ontology, you can test whether or not you can satisfy your feature requirements by running SPARQL queries.
357 |
358 | Once you have completed the cycle, if you can satisfy your feature request, you can consider this journey around the cycle complete. If you cannot satisfy the feature, you can start the cycle again from step 1/2, with the new knowledge you have learnt, and try again with a new model to satisfy your features requirements.
359 |
360 | We show a full example around the lifecycle below.
361 |
362 | # Graph Development Lifecycle Tutorial
363 |
364 | ### Designing RDF Graph models (Ontologies)
365 |
366 | When designing a model for an RDF graph (or Ontology), first we describe in natural language the features that we want from our Ontology, then we design and document it, by drawing the concepts which your model describes, their properties, and the relationships between them; this process is called designing an Ontology.
367 |
368 | You can design an Ontology simply by drawing it somewhere/anywhere. Ontologies can be drawn on paper, on a white board, or on any graphical design environment where you can draw concepts, relationships between concepts and properties of those concepts. There are third party tools that are dedicated to the design of Ontologies, but we will not describe them here, as they are not needed for this guide.
369 |
370 | Once designed, the Ontology can be stored as RDF in a graph database, and so can be queried like any other data.
371 | Experienced Ontologists may often write Ontology RDF data (known as OWL), either by hand or using third party tooling, sometimes skipping some steps described here, but in this guide we describe a more complete and well documented process, as described in the diagram above.
372 |
373 | Now we demonstrate iterating around the Graph Development Lifecycle 3 times, for three new features.
374 |
375 | ## Iteration 1: adding the first feature to the solution
376 |
377 | ### 1. Describe new features
378 |
379 | We descibe what we want to find out from the graph, in natural language.
380 |
381 | | Feature | Description |
382 | |-|-|
383 | | 1 | I want to know which employees work for which departments in the ACME CORP |
384 |
385 | ### 2. Design the Ontology
386 |
387 | We draw concepts/objects such as ‘Department’ and ‘Employee’ in a different style to data properties such a string for a name, in order to differentiate between literal values of core data types and domain specific concepts.
388 |
389 | 
390 |
391 | ### 3. Encode the ontology as RDF (OWL)
392 |
393 | Here is a complete example of the Ontology RDF (OWL) to match the diagram from step 2.
394 | The RDF shown here is in the format of [Turtle](https://www.w3.org/TR/turtle/), to load this data into Neptune, see the documentation on [Loading data into Amazon Neptune](https://docs.aws.amazon.com/neptune/latest/userguide/load-data.html).
395 |
396 | ```
397 | @prefix owl: .
398 | @prefix rdf: .
399 | @prefix rdfs: .
400 | @prefix awso: .
401 |
402 | awso:withinOrganisation rdf:type owl:ObjectProperty ;
403 | rdfs:domain awso:Department ;
404 | rdfs:range awso:Organisation ;
405 | rdfs:label "within organisation" .
406 |
407 | awso:hasPositionIn rdf:type owl:ObjectProperty ;
408 | rdfs:domain awso:Employee ;
409 | rdfs:range awso:Department ;
410 | rdfs:label "has a position within" .
411 |
412 | awso:hasName rdf:type owl:DatatypeProperty ;
413 | rdfs:domain awso:Organisation ;
414 | rdfs:range xsd:string .
415 |
416 | awso:Department rdf:type owl:Class ;
417 | rdfs:label "Department" .
418 |
419 | awso:Employee rdf:type owl:Class ;
420 | rdfs:label "Employee" .
421 |
422 | awso:Organisation rdf:type owl:Class ;
423 | rdfs:label "Organisation" .
424 | ```
425 |
426 | ### 4. Create instance data as RDF
427 |
428 | Create sample data, to be loaded into Neptune, which fits to the Ontology.
429 | We create RDF in the Turtle format that describes:
430 |
431 | * 6 Employees
432 | * 3 Departments
433 | * 2 Organisations
434 | * 6 statements showing which departments the employees work in
435 | * 3 statements showing which organisations the departments are in
436 |
437 | ```
438 | @prefix awso: .
439 | @prefix awsr: .
440 |
441 | awsr:Employee2 a awso:Employee .
442 | awsr:Employee1 a awso:Employee .
443 | awsr:Employee3 a awso:Employee .
444 | awsr:Employee4 a awso:Employee .
445 | awsr:Employee5 a awso:Employee .
446 | awsr:Employee6 a awso:Employee .
447 | awsr:Employee7 a awso:Employee .
448 |
449 | awsr:Department1 a awso:Department .
450 | awsr:Department2 a awso:Department .
451 | awsr:Department3 a awso:Department .
452 | awsr:Department4 a awso:Department .
453 |
454 | awsr:Org1 a awso:Organisation ;
455 | awso:hasName "ACME Corp" .
456 | awsr:Org2 a awso:Organisation ;
457 | awso:hasName "NORMCO LTD" .
458 |
459 | awsr:Employee1 awso:hasPositionIn awsr:Department1 .
460 | awsr:Employee2 awso:hasPositionIn awsr:Department1 .
461 | awsr:Employee3 awso:hasPositionIn awsr:Department2 .
462 | awsr:Employee4 awso:hasPositionIn awsr:Department2 .
463 | awsr:Employee5 awso:hasPositionIn awsr:Department3 .
464 | awsr:Employee6 awso:hasPositionIn awsr:Department3 .
465 | awsr:Employee7 awso:hasPositionIn awsr:Department4 .
466 |
467 | awsr:Department1 awso:withinOrganisation awsr:Org1 .
468 | awsr:Department2 awso:withinOrganisation awsr:Org1 .
469 | awsr:Department3 awso:withinOrganisation awsr:Org2 .
470 | awsr:Department4 awso:withinOrganisation awsr:Org2 .
471 | ```
472 | ### 5. Test features with SPARQL
473 | Write a SPARQL query which tests that our Feature meets the requirements.
474 |
475 | | Feature | Description |
476 | |-|-|
477 | | 1 | I want to know which employees work for which departments in the ACME CORP |
478 |
479 | ```
480 | prefix awso:
481 | prefix awsr:
482 |
483 | SELECT * WHERE {
484 | ?employee a awso:Employee .
485 | ?employee awso:hasPositionIn ?department .
486 | ?department awso:withinOrganisation awsr:Org1 .
487 | }
488 | ```
489 | The response to the query proves that our feature is satisfied, showing in a table a list of employees and the departments they work for, filtered by only one of the organisations.
490 |
491 | | Employee | Department |
492 | | ---------------------------------------- | ------------------------------------------ |
493 | | http://aws.amazon.com/resource#Employee1 | http://aws.amazon.com/resource#Department1 |
494 | | http://aws.amazon.com/resource#Employee2 | http://aws.amazon.com/resource#Department1 |
495 | | http://aws.amazon.com/resource#Employee3 | http://aws.amazon.com/resource#Department2 |
496 | | http://aws.amazon.com/resource#Employee4 | http://aws.amazon.com/resource#Department2 |
497 |
498 | ## Iteration 2: adding a second feature
499 |
500 | ### 1. Describe new Features
501 |
502 | | Feature | Description |
503 | |-|-|
504 | | 1 | I want to know which employees work for which departments in the ACME CORP |
505 | | 2 | and I want to list all the names of the employees in the organisation ACME CORP |
506 |
507 | ### 2. Design the Ontology
508 |
509 | We can now update our diagram to fulfil the requirement of the new feature, by adding the ‘has name’ property to the Employee.
510 | We re-use the same property “has name” that we first used for ‘Organisation’. This is because they are conceptually the same.
511 | In other words, the following statement is true: “both Employees and Organisations have a name.”
512 | Because it is the same property conceptually and in reality, you can choose to visualise this in multiple ways. How you draw the diagram of the Ontology is up to you, here are 3 different approaches:
513 |
514 | #### Ontology design option 1: Replicate an element
515 |
516 | Replicating the element 'has name' on the diagram gives a lot of flexibility when drawing the model, as you can position the elements anywhere near their respective domain. It does however mean that you have more elements on your diagram, so can use up more of the canvas available to you.
517 |
518 | 
519 |
520 | #### Ontology design option 2: Re-using elements
521 |
522 | Pointing the relationship 'has name' to the same element means you use less of the canvas, but it may be more difficult to lay out and visualisae later on, especially if lots of entities have a name.
523 |
524 | 
525 |
526 | #### Ontology design option 3: Multiple diagrams and replicating elements
527 |
528 | Splitting the diagram into two, and duplicating the elements Employee and Organisation, means you can manage your canvases seperately, and allow for much more expansion later on, for example if every class 'has a name', the diagram will still be very easy to understand. However, this does mean maintaining multiple diagrams.
529 |
530 | 
531 |
532 | ### 3. Encode the ontology as RDF (OWL)
533 |
534 | We add the new entities to our OWL file in RDF Turtle format:
535 | *(We predict that we will also want names for Departments, so we can load that too.)*
536 |
537 | You can write the same RDF to Neptune as many times as you want, you will never get duplicate statements. Understand an RDF graph as a SET of statemets.
538 |
539 | We add the following RDF/OWL into Neptune, which overwrites the previous definition of the 'has name' datatype property.
540 |
541 | ```
542 | @prefix owl: .
543 | @prefix rdf: .
544 | @prefix rdfs: .
545 | @prefix awso: .
546 |
547 | awso:hasName rdf:type owl:DatatypeProperty ;
548 | rdfs:domain awso:Organisation ,
549 | awso:Employee ,
550 | awso:Department ;
551 | rdfs:range xsd:string ;
552 | rdfs:label "has name" .
553 | ```
554 |
555 |
556 | ### 4. Create instance data as RDF
557 |
558 | We already have 'has name' sample data for the Organisations, so we just add the missing sample RDF instance data: the 'has name' datatype properties for Employees and Departments.
559 |
560 | ```
561 | @prefix awso: .
562 | @prefix awsr: .
563 |
564 | awsr:Employee1 awso:hasName "John Smith".
565 | awsr:Employee2 awso:hasName "Jane Doe".
566 | awsr:Employee3 awso:hasName "Mike Jones".
567 | awsr:Employee4 awso:hasName "Callum McAllister".
568 | awsr:Employee5 awso:hasName "Allison Hunter".
569 | awsr:Employee6 awso:hasName "Sanjay Singh".
570 | awsr:Employee7 awso:hasName "Lars Anderson".
571 |
572 | awsr:Department1 awso:hasName "Sales & Marketing".
573 | awsr:Department2 awso:hasName "I.T.".
574 | awsr:Department3 awso:hasName "Human Resources".
575 | awsr:Department4 awso:hasName "Information Technology".
576 |
577 | ```
578 |
579 | ### 5. Test features with SPARQL
580 |
581 | We write a new SPARQL query to satisfy both features
582 |
583 | | Feature | Description |
584 | |-|-|
585 | | 1 | I want to know which employees work for which departments in the ACME CORP |
586 | | 2 | and I want to list all the names of the employees in the organisation ACME CORP |
587 |
588 |
589 | ```
590 | prefix awso:
591 | prefix awsr:
592 |
593 | SELECT ?employeeName ?departmentName ?orgName WHERE {
594 | ?employee a awso:Employee ;
595 | awso:hasName ?employeeName ;
596 | awso:hasPositionIn ?department .
597 | ?department awso:withinOrganisation awsr:Org1 ;
598 | awso:hasName ?departmentName .
599 | awsr:Org1 awso:hasName ?orgName .
600 | }
601 | ```
602 |
603 | The response to the query proves that both features are satisfied, showing in a table with a list of employees names and the departments they work for, filtered by only one of the organisations. Now that we have names for everything, we can use them.
604 |
605 | | employeeName | departmentName | orgName |
606 | |-|-|-|
607 | | John Smith | Sales & Marketing | ACME Corp
608 | | Jane Doe | Sales & Marketing | ACME Corp
609 | | Mike Jones | I.T. | ACME Corp
610 | | Callum McAllister | I.T. | ACME Corp
611 |
612 |
613 | ## Iteration 3: adding a breaking change
614 |
615 | ### 1. Descibe new Features
616 |
617 | We describe a third feature.
618 |
619 |
620 | | Feature | Description |
621 | |-|-|
622 | | 1 | I want to know which employees work for which departments in the ACME CORP |
623 | | 2 | and I want to list all the names of the employees in the organisation ACME CORP |
624 | | 3 | and I want to list all the workers that work in IT departments across all organisations |
625 |
626 |
627 | ### 2. Design the Ontology
628 |
629 | This new third feature presents us with a problem. We cannot write a SPARQL query will collect together all the people that work in any IT department regardless of Organisation. You can see this within our existing Ontology diagram.
630 | In the existing Ontology, we have different IT departments for different organisations. We do not have a way of recognising that they are both IT departments, but in different Organisations:
631 |
632 | 
633 |
634 | ```
635 | ...
636 | awsr:Department2 a awso:Department ;
637 | awso:hasName "I.T.".
638 | awsr:Department4 a awso:Department ;
639 | awso:hasName "Information Technology".
640 | ...
641 | awsr:Department2 awso:withinOrganisation awsr:Org1 .
642 | awsr:Department4 awso:withinOrganisation awsr:Org2 .
643 | ...
644 | ```
645 |
646 | #### Reification
647 |
648 | We need to expand the model to recognise that the departments are both the same kind of departments, but in different organisations.
649 | To do this, we need to expand the relationship ‘has position in’. This process is called [Reification](https://en.wikipedia.org/wiki/Reification_(knowledge_representation)).
650 |
651 | We start with the relationship ...
652 |
653 | 
654 |
655 | ... and reify(expand) it, so that we can recognise when Employees work for the same Department, but in different Positions for different organisations.
656 |
657 | 
658 |
659 | Our new complete Ontology looks like this:
660 |
661 | 
662 |
663 | ### 3. Encode the ontology as RDF (OWL)
664 |
665 | Lots of our Ontology has changed after reification, so first we delete everything from before in a simple SPARQL clear-all query:
666 |
667 | ```
668 | CLEAR ALL
669 | ```
670 |
671 | ...and we create a new complete RDF/OWL file and load it:
672 |
673 | ```
674 | @prefix owl: .
675 | @prefix rdf: .
676 | @prefix rdfs: .
677 | @prefix awso: .
678 |
679 | awso:hasPosition rdf:type owl:ObjectProperty ;
680 | rdfs:domain awso:Employee ;
681 | rdfs:range awso:Position ;
682 | rdfs:label "has position" .
683 |
684 | awso:withinOrganisation rdf:type owl:ObjectProperty ;
685 | rdfs:domain awso:Department ;
686 | rdfs:range awso:Organisation ;
687 | rdfs:label "within organisation" .
688 |
689 | awso:inDepartment rdf:type owl:ObjectProperty ;
690 | rdfs:domain awso:Position ;
691 | rdfs:range awso:Department ;
692 | rdfs:label "for department" .
693 |
694 | awso:hasName rdf:type owl:DatatypeProperty ;
695 | rdfs:domain awso:Organisation ,
696 | awso:Department ,
697 | awso:Employee ,
698 | awso:Position ;
699 | rdfs:range xsd:string ;
700 | rdfs:label "has name" .
701 |
702 | awso:Department rdf:type owl:Class ;
703 | rdfs:label "Department" .
704 |
705 | awso:Employee rdf:type owl:Class ;
706 | rdfs:label "Employee" .
707 |
708 | awso:Organisation rdf:type owl:Class ;
709 | rdfs:label "Organisation" .
710 |
711 | awso:Position rdf:type owl:Class ;
712 | rdfs:label "Position" .
713 | ```
714 |
715 | ### 4. Create instance data as RDF
716 |
717 | As we have started wit a new Ontology in step 3, and deleted everything, we load a full new sample RDF dataset.
718 |
719 | ```
720 | @prefix awso: .
721 | @prefix awsr: .
722 |
723 | awsr:Employee1 a awso:Employee ;
724 | awso:hasName "John Smith".
725 | awsr:Employee2 a awso:Employee ;
726 | awso:hasName "Jane Doe".
727 | awsr:Employee3 a awso:Employee ;
728 | awso:hasName "Mike Jones".
729 | awsr:Employee4 a awso:Employee ;
730 | awso:hasName "Callum McAllister".
731 | awsr:Employee5 a awso:Employee ;
732 | awso:hasName "Allison Hunter".
733 | awsr:Employee6 a awso:Employee ;
734 | awso:hasName "Sanjay Singh".
735 | awsr:Employee7 a awso:Employee ;
736 | awso:hasName "Lars Anderson".
737 |
738 | awsr:Department1 a awso:Department ;
739 | awso:hasName "Sales & Marketing".
740 | awsr:Department2 a awso:Department ;
741 | awso:hasName "I.T.".
742 | awsr:Department3 a awso:Department ;
743 | awso:hasName "Human Resources".
744 |
745 | awsr:Org1 a awso:Organisation ;
746 | awso:hasName "ACME Corp" .
747 | awsr:Org2 a awso:Organisation ;
748 | awso:hasName "NORMCO LTD" .
749 |
750 | # Employee 1
751 | awsr:Employee1 awso:hasPosition awsr:Pos1 .
752 | awsr:Pos1 a awso:Position .
753 | awsr:Pos1 awso:hasName "John Smith in Sales&Marketing for ACME Corp" .
754 | awsr:Pos1 awso:inDepartment awsr:Department2 .
755 | awsr:Pos1 awso:withinOrganisation awsr:Org1 .
756 |
757 | # Employee 2
758 | awsr:Employee2 awso:hasPosition awsr:Pos2 .
759 | awsr:Pos2 a awso:Position .
760 | awsr:Pos2 awso:hasName "Jane Doe in Sales&Marketing for ACME Corp" .
761 | awsr:Pos2 awso:inDepartment awsr:Department1 .
762 | awsr:Pos2 awso:withinOrganisation awsr:Org1 .
763 |
764 | # Employee 3
765 | awsr:Employee3 awso:hasPosition awsr:Pos3 .
766 | awsr:Pos3 a awso:Position .
767 | awsr:Pos3 awso:hasName "Mike Jones in I.T. for ACME Corp" .
768 | awsr:Pos3 awso:inDepartment awsr:Department2 .
769 | awsr:Pos3 awso:withinOrganisation awsr:Org1 .
770 |
771 | # Employee 4
772 | awsr:Employee4 awso:hasPosition awsr:Pos4 .
773 | awsr:Pos4 a awso:Position .
774 | awsr:Pos4 awso:hasName "Callum McAllister in I.T. for ACME Corp" .
775 | awsr:Pos4 awso:inDepartment awsr:Department2 .
776 | awsr:Pos4 awso:withinOrganisation awsr:Org1 .
777 |
778 | # Employee 5
779 | awsr:Employee5 awso:hasPosition awsr:Pos5 .
780 | awsr:Pos5 a awso:Position .
781 | awsr:Pos5 awso:hasName "Allison Hunter in H.R. for ACME Corp" .
782 | awsr:Pos5 awso:inDepartment awsr:Department3 .
783 | awsr:Pos5 awso:withinOrganisation awsr:Org1 .
784 |
785 | # Employee 6
786 | awsr:Employee6 awso:hasPosition awsr:Pos6 .
787 | awsr:Pos6 a awso:Position .
788 | awsr:Pos6 awso:hasName "Sanjay Singh in H.R. for ACME Corp" .
789 | awsr:Pos6 awso:inDepartment awsr:Department3 .
790 | awsr:Pos6 awso:withinOrganisation awsr:Org1 .
791 |
792 | # Employee 7
793 | awsr:Employee7 awso:hasPosition awsr:Pos7 .
794 | awsr:Pos7 a awso:Position .
795 | awsr:Pos7 awso:hasName "Lars Anderson in I.T. for NORMCO LTD" .
796 | awsr:Pos7 awso:inDepartment awsr:Department2 .
797 | awsr:Pos7 awso:withinOrganisation awsr:Org2 .
798 |
799 | ```
800 |
801 | ### 5. Test features with SPARQL
802 |
803 | We write a new SPARQL query to satisfy all three features:
804 |
805 | | Feature | Description |
806 | |-|-|
807 | | 1 | I want to know which employees work for which departments in the ACME CORP |
808 | | 2 | and I want to list all the names of the employees in the organisation ACME CORP |
809 | | 3 | and I want to list all the workers that work in IT departments across all organisations |
810 |
811 | ```
812 | prefix awso:
813 | prefix awsr:
814 |
815 | SELECT ?employeeName ?departmentName ?orgName WHERE {
816 |
817 | ?employee a awso:Employee ;
818 | awso:hasName ?employeeName ;
819 | awso:hasPosition/awso:inDepartment ?department ;
820 | awso:hasPosition/awso:withinOrganisation ?org .
821 |
822 | ?department awso:hasName ?departmentName .
823 | ?org awso:hasName ?orgName .
824 |
825 | }
826 | ```
827 | The result shows that we can now satisfy all three features, with the Department URI being the same for every employee that works in I.T.
828 |
829 | | employeeName | departmentName | orgName |
830 | |-|-|-|
831 | | John Smith | I.T. | ACME Corp |
832 | | Mike Jones | I.T. | ACME Corp |
833 | | Callum McAllister | I.T. | ACME Corp |
834 | | Jane Doe | Sales & Marketing | ACME Corp |
835 | | Allison Hunter | Human Resources | ACME Corp |
836 | | Sanjay Singh | Human Resources | ACME Corp |
837 | | Lars Anderson | I.T. | NORMCO LTD |
838 |
839 | ### Testing features with multiple SPARQL queries
840 |
841 | You could also satisfy your features with seperate SPARQL queries.
842 | For example, here is a SPARQL query designed to satisfy only Feature 3:
843 |
844 | | Feature | Description |
845 | |-|-|
846 | | 3 | and I want to list all the workers that work in IT departments across all organisations |
847 |
848 |
849 | ```
850 | prefix awso:
851 | prefix awsr:
852 |
853 | SELECT ?employeeName ?orgName WHERE {
854 |
855 | BIND( as ?ITDepartment)
856 |
857 | ?employee a awso:Employee ;
858 | awso:hasName ?employeeName ;
859 | awso:hasPosition/awso:inDepartment ?ITDepartment ;
860 | awso:hasPosition/awso:withinOrganisation ?org .
861 |
862 | ?ITDepartment awso:hasName ?departmentName .
863 | ?org awso:hasName ?orgName .
864 |
865 | }
866 | ```
867 | Take note of the 'BIND' variable in the query, which explicitly sets the Department to be I.T.
868 | | employeeName | orgName |
869 | |-|-|
870 | | John Smith | ACME Corp |
871 | | Mike Jones | ACME Corp |
872 | | Callum McAllister | ACME Corp |
873 | | Lars Anderson | NORMCO LTD |
874 |
875 |
876 | ## Using Edges to Facilitate Efficient Graph Queries
877 |
878 | The performance of a graph query depends on how much of the graph the query has to 'touch' in order to generate a set of results. The larger the working set, the longer it will take to get from storage and then traverse once it has been cached in main memory.
879 |
880 | 
881 |
882 | You can ensure your queries touch the minimum amount of data by naming predicates in a way that allows the query engine to follow only those relationships relevant to the query being executed.
883 |
884 | 
885 |
886 | Predicate compose and partition the graph. By connecting vertices, they structure the whole, creating a complex composite from what would otherwise be simple islands of data. At the same time they serve to partition the graph, differentiating connections between elements based on name, direction and property values so that queries can identity specific subgraphs within a larger, more variably connected structure. By focussing your queries on certain predicates and the paths they form, you allow the query engine to exclude irrelevant parts of the graph from consideration, effectively materializing a particular view of the graph dedicated to addressing a specific query need.
887 |
888 | ### Predicate names
889 |
890 | Derive your predicates from your use cases. Doing so helps structure and partition your data so that queries ignore triples that have no bearing on the working set necessary to satisfy the query.
891 |
892 | ### Bi-directional relationships
893 |
894 | If you need to model bi-directional relationships, you will have to add pairs of predicates.
895 |
896 | 
897 |
898 | ```
899 | PREFIX j:
900 |
901 | INSERT
902 | {
903 | j:p-1 j:worksWith j:p-2 .
904 | j:p-2 j:worksWith j:p-1 .
905 | }
906 | WHERE {}
907 | ```
908 |
909 | ### Uni-directional relationships
910 |
911 | The directed nature of edges naturally lends itself to expressing uni-directional relationships.
912 |
913 | 
914 |
915 | In SPARQL:
916 |
917 | ```
918 | PREFIX s:
919 |
920 | SELECT ?followees WHERE {
921 | s:p-1 s:follows ?followees
922 | }
923 | ```
924 |
925 | and:
926 |
927 | ```
928 | PREFIX s:
929 |
930 | SELECT ?followers WHERE {
931 | ?followers s:follows s:p-1
932 | }
933 | ```
934 |
935 | ### Multiple relationships between nodes
936 |
937 | With RDF you can connect any pair of nodes with multiple relationships with different names. Connecting a pair of nodes with multiple relationships with the same name is slightly more complicated. RDF does not have a concept of relationship identity that would serve to distinguish predicate instances. To connect a pair of nodes with multiple relationships with the same name you will have to introduce intermediate nodes, one per instance of the relationship. This is another example of [Reification](#reification).
938 |
939 | 
940 |
--------------------------------------------------------------------------------
/src/graph-data-modelling/bi-directional-relationships.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/bi-directional-relationships.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/data-modelling-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/data-modelling-process.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/edge-labels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/edge-labels.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/hub-and-spoke-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/hub-and-spoke-1.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/hub-and-spoke-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/hub-and-spoke-2.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/hub-and-spoke-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/hub-and-spoke-3.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/hub-and-spoke-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/hub-and-spoke-4.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/large-query.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/large-query.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/multiple-relationships.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/multiple-relationships.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-1.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-1.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-2.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-2-op-3.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle-3.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rdf-graph-development-lifecycle.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rei-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rei-1.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/rdf/rei-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/rdf/rei-2.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/small-query.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/small-query.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/thumbnail.png
--------------------------------------------------------------------------------
/src/graph-data-modelling/uni-directional-relationships.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/graph-data-modelling/uni-directional-relationships.png
--------------------------------------------------------------------------------
/src/writing-from-amazon-kinesis-data-streams/README.md:
--------------------------------------------------------------------------------
1 | # Writing to Amazon Neptune from an Amazon Kinesis Data Stream
2 |
3 | You can improve the reliability, performance and scalability of your application by reducing the coupling between components. If your application is composed of multiple distributed services, you can reduce coupling by introducing queues – message queues, streams, etc. – between components.
4 |
5 | When using Amazon Neptune in high write throughput scenarios, you can improve the reliability, performance and scalability of your application by sending logical writes from your client to an Amazon Kinsesis Data Stream. An AWS Lambda function polls the stream and issues batches of writes to the underlying Neptune database.
6 |
7 | 
8 |
9 | ### Walkthough of the Architecture
10 |
11 | 1. In this architecture your Neptune cluster is run in at least two subnets in two Availability Zones, with each subnet in a different Availability Zone. By distributing your cluster instances across at least two Availability Zones, you help ensure that there are instances available in your DB cluster in the unlikely event of an Availability Zone failure.
12 | 2. A Kinesis Data Stream is provisioned to accept write requests from client applications, which act as record producers.
13 | 3. Clients can use the [Amazon Kinesis Data Streams API](https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html) or [Kinesis Agent](https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html) to write individual records to the data stream.
14 | 4. An AWS Lambda function [processes records in the data stream](https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html). Create a Lambda function and an event source mapping. The event source mapping tells Lambda to send records from your data stream to the Lambda function, which uses a Gremlin or SPARQL client to submit write requests to the Neptune cluster endpoint.
15 | 5. The Lambda function issues batch writes to Neptune. Each batch is executed in the context of a single transaction.
16 | 6. To increase the speed at which the function processes records, add shards to the data stream. Lambda processes records in each shard in order, and stops processing additional records in a shard if the function returns an error.
17 |
18 | ### Best Practices
19 |
20 | * See [Accessing Amazon Neptune from AWS Lambda Functions](../../src/accessing-from-aws-lambda) for details on deploying Lambda functions that write to Amazon Neptune.
21 | * This architecture is intended for scenarios in which a large number of clients trigger individual writes to the backend, as is often the case with many Web applications or mobile applications. In some circumstances you may want to aggregate write requests before submitting them to the Kinesis Data Stream: an IoT application in which many devices frequently emit small status updates falls into this category, for example. To implement record aggregation in the client, and deaggregation in your Lambda functions, use the [Kinesis Producer Library Deaggregation Modules for AWS Lambda](https://github.com/awslabs/kinesis-aggregation).
22 | * Consider pulling large batches from the stream (by configuring the batch size property in the [event source mapping](https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html#services-kinesis-eventsourcemapping)), but writing smaller batches to Neptune one after another in a single Lambda invocation. For example, pulling 1000 records from the stream, and issuing 10 batch writes, each of 100 records, to Neptune during a single Lambda invocation. This allows you to tune batch write size according to factors such as the instance size of the Neptune database leader node and the complexity of your writes, while reusing a connection for the several batch writes you issue to Neptune during a single Lambda invocation. Be aware, however, that if any of the batch writes to Neptune fails and the failure propogates outside the Lambda function, the entire batch that was pulled from the stream will be retried on the next invocation of the function.
23 | * Use idempotent writes to ensure the correct outcome irrespective of the number of times a write is attempted. If you are using Gremlin, you can use the [`coalesce()` step to implement idempotent writes](http://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#coaladdv).
24 | * You can control concurrency by adjusting the number of shards in your Kinesis Data Stream. For example, two shards will result in two concurrent Lambda invocations, one per shard. If you use shards to control concurrency, we recommend setting the number of shards to no more than 2 x the number of vCPUs on the Neptune leader node.
25 | * A shard is a throughput unit of Amazon Kinesis Data Streams, and the service is charged on [Shard Hours and PUT Payload Units](https://aws.amazon.com/kinesis/data-streams/pricing/). Increasing the number of shards in order to increase concurrency and throughput will therefore increase costs.
26 | * Alternatively, at the expense of additional engineering effort, you can increase concurrency using the threading model particular to your Lambda runtime.
27 | * Records in a Kinesis Data Stream are ordered per shard based on insert time. However, there is no total ordering of records within a stream with multiple shards. When using this architecture, either ensure that logical writes are wholly independent of one another such that they can be executed out of insert order, or direct dependent writes to the same shard using partition keys to group data by shard. If you are processing batches in a serial fashion within a Lambda function, you can maintain the insert order imposed by the shard with which the function is associated. If, however, you implement your own concurrency inside a Lambda function, writes to Neptune can end up being ordered differently from the order imposed by the shard.
28 |
29 | ### Learn More
30 |
31 | * Download or run an example of [Writing to Amazon Neptune from an Amazon Kinesis Data Stream](https://github.com/aws-samples/amazon-neptune-samples/tree/master/gremlin/stream-2-neptune)
--------------------------------------------------------------------------------
/src/writing-from-amazon-kinesis-data-streams/kinesis-neptune.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/writing-from-amazon-kinesis-data-streams/kinesis-neptune.png
--------------------------------------------------------------------------------
/src/writing-from-amazon-kinesis-data-streams/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-dbs-refarch-graph/cefcf5b255b7b526b982ed3448734e5f7c67aace/src/writing-from-amazon-kinesis-data-streams/thumbnail.png
--------------------------------------------------------------------------------