├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── TEST
├── img
└── architecture.png
└── lab-guides
├── activity1.md
├── activity10.md
├── activity2.md
├── activity3.md
├── activity4.md
├── activity5.md
├── activity6.md
├── activity7.md
├── activity8.md
├── activity9.md
├── appendixA.md
├── appendixB.md
└── images
├── 1-1.png
├── 1-2.png
├── 1-3.png
├── 1-4.png
├── 1-5.png
├── 1.txt
├── 2-1.png
├── 2-2.png
├── 2-3.png
├── 3-1.png
├── 3-2.png
├── 3-3.png
├── 4-1.png
├── 4-2.png
├── 5-1.png
├── 5-10.png
├── 5-11.png
├── 5-2.png
├── 5-3.png
├── 5-4.png
├── 5-5.png
├── 5-6.png
├── 5-7.png
├── 5-8.png
├── 5-9.png
├── 6-1.png
├── 6-2.png
├── 7-1.png
├── 7-2.png
├── 7-3.png
├── 7-4.png
├── 7-5.png
├── 8-1.png
├── 8-10.png
├── 8-11.png
├── 8-12.png
├── 8-13.png
├── 8-14.png
├── 8-15.png
├── 8-16.png
├── 8-17.png
├── 8-2.png
├── 8-3.png
├── 8-4.png
├── 8-5.png
├── 8-6.png
├── 8-7.png
├── 8-8.png
├── 8-9.png
├── 9-1.png
├── 9-10.png
├── 9-11.png
├── 9-2.png
├── 9-3.png
├── 9-4.png
├── 9-5.png
├── 9-6.png
├── 9-7.png
├── 9-8.png
├── 9-9.png
├── a-1.png
├── a-2.png
├── a-3.png
├── a-4.png
├── a-5.png
├── a-6.png
├── a-7.png
├── b-1.png
├── b-10.png
├── b-11.png
├── b-12.png
├── b-13.png
├── b-14.png
├── b-15.png
├── b-16.png
├── b-2.png
├── b-3.png
├── b-4.png
├── b-5.png
├── b-6.png
├── b-7.png
├── b-8.png
└── b-9.png
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
4 | this software and associated documentation files (the "Software"), to deal in
5 | the Software without restriction, including without limitation the rights to
6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7 | the Software, and to permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Workshop - Using AWS Lake Formation ML Transforms to cleanse the data in a data lake
2 |
3 | ### Background
4 | Customers ingest data from multiple sources into their data lakes. This data often has the same meaning but uses different labels/names, which can take months to cleanse, slowing down the data processing and analytics cycle.
5 |
6 | ### Introduction
7 | In this lab, participants will go through various steps to provision a data lake, catalog data in AWS Glue Data Catalog and use AWS Lake Formation ML Transforms to cleanse the data in a data lake.
8 |
9 | ### Use Case
10 | Happy HealthCare’s marketing team receives data from multiple EHR providers to identify count of patients within a specific geography. Since data is coming from different EHR providers, and SSN number is optional, it is getting difficult for Happy HealthCare to identify the unique customers.
11 |
12 | Happy HealthCare decided to use AWS Lake Formation ML Transforms to identify the potential duplicates in data. They would then be using Amazon Athena and Amazon QuickSight to identify patient density in a specific geographic area. This will help them identify potential customers.
13 |
14 | ### Lab Data Set
15 | We have used the synthetic dataset generated from Open Source Freely extensible Biomedical Record Linkage Program (FEBRL). This dataset mimicks real-life patient data sets that lead to duplicates. You can also generate your own dataset by following the steps mentioned @ https://github.com/J535D165/FEBRL-fork-v0.4.2/tree/master/dsgen or in [this](https://aws.amazon.com/blogs/big-data/matching-patient-records-with-the-aws-lake-formation-findmatches-transform/) AWS blog. For the sake of simplicity, we would be copying the sample dataset automatically into an Amazon S3 bucket that is created as part of the CloudFormation launch.
16 |
17 | ### Architecture
18 | Below is the high-level architecture that you would be implementing as part of this lab.
19 |
20 | 
21 |
22 |
23 | ### Activities
24 | Below table summarizes various activities to be done as part of creating a data lake and using AWS Lake Formation ML Transforms to deduplicate the data in a data lake.
25 |
26 | **In this lab, we start with setting up and registering a data lake using AWS Lake Formation and then go all the way to analyze, deduplicate and query the data in a data lake.**
27 |
28 | | No. | Activity | User |
29 | | --- | ------------- | ------------- |
30 | | 1 | Provision the following:
- Data lake on Amazon S3
- Data lake administrator
- Data lake analyst/developer
- Data lake and glue roles | IAM Administrator |
31 | | 2| Register a data lake | Data Lake Administrator |
32 | | 3 | Assign Lake Formation Permissions to data lake analyst | Data Lake Administrator |
33 | | 4 | Create AWS Glue Database | Data Lake Administrator |
34 | | 5 | Crawl and catalog Patient data in AWS Glue | Data Lake Analyst |
35 | | 6 | Login back as data lake administrators and assign table permissions to data analyst | Data Lake Administrator |
36 | | 7 | Observe the data pattern and duplicates in data using Amazon Athena | Data Lake Analyst |
37 | | 8 | Create, teach and Tune an AWS Lake Formation ML Transform | Data Lake Analyst |
38 | | 9 | Create an AWS Glue ETL Job to use ML Transform for data deduplication | Data Lake Analyst |
39 | | 10 | Catalog de-duplicated data and query using Amazon Athena | Data Lake Analyst |
40 |
41 |
42 | ## Workshop Activities
43 |
44 | 1. [Provision a data lake, data lake administrator and data lake analyst](lab-guides/activity1.md)
45 | 2. [Register a data lake](lab-guides/activity2.md)
46 | 3. [Assign Lake Formation Permissions to data lake analyst and service role](lab-guides/activity3.md)
47 | 4. [Create AWS Glue Database](lab-guides/activity4.md)
48 | 5. [Crawl and catalog Patient data in AWS Glue](lab-guides/activity5.md)
49 | 6. [Login as Data Lake Administrator and assign Table permissions](lab-guides/activity6.md)
50 | 7. [Observe the data pattern and duplicates in data](lab-guides/activity7.md)
51 | 8. [Create, teach and Tune an AWS Lake Formation ML Transform](lab-guides/activity8.md)
52 | 9. [Create and Run Glue ETL Job to use ML Transform for finding duplicates](lab-guides/activity9.md)
53 | 10. [Catalog de-duplicated data and query using Amazon Athena](lab-guides/activity10.md)
54 |
55 |
56 | ## License
57 |
58 | This library is licensed under the MIT-0 License. See the LICENSE file.
59 |
60 |
--------------------------------------------------------------------------------
/TEST:
--------------------------------------------------------------------------------
1 | TEST
2 |
--------------------------------------------------------------------------------
/img/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/img/architecture.png
--------------------------------------------------------------------------------
/lab-guides/activity1.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity2.md)
2 |
3 | ____
4 |
5 | ## 1. Provision a data lake, data lake administrator and data lake analyst
6 |
7 | ### a. Launch CloudFormation Template
8 |
9 | Launch the CloudFormation stack in one of the AWS regions. Other regions are also supported.
10 |
11 | We recommend that CloudFormation template be launched from the user having administrator previliges.
12 |
13 | Region | Launch
14 | -------|-----
15 | US East (N. Virginia) | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=aws-lf-ml-transform-wrk&templateURL=https://re-invent2019-lakeformation-ml.s3-us-west-2.amazonaws.com/cloudformation/lf-ml-devendpoint.template)
16 | US West (Oregon) | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=aws-lf-ml-transform-wrk&templateURL=https://re-invent2019-lakeformation-ml.s3-us-west-2.amazonaws.com/cloudformation/lf-ml-devendpoint.template)
17 |
18 | Accept all default values, Click **Next**. On the last page, select the checkbox **I acknowledge that AWS CloudFormation might create IAM resources with custom names** and click on on **Create Stack**.
19 | Wait for cloudformation template to **Complete**.
20 |
21 | CloudFormation template would create the below resources.
22 | - Data Lake Administrator user (dladmin)
23 | - Data Lake Analyst (dlanalyst)
24 | - S3 Bucket with Sample Patient Dataset having duplicates. **(Use this S3 bucket throughout the lab. The one shown the in the screenshots is only for the reference.)**
25 | - Labelling file that would be used in Activity#8
26 | - Glue Development Endpoint
27 | - SageMaker Notebook instance with Spark ETL code
28 | - IAM Role for AWS Glue and Lake Formation
29 |
30 | **NOTE: Password for dladmin and dlanalyst users is set to "welcome".**
31 |
32 | ### b. Setup Data Lake Administrator
33 | Navigate to Lake Formation Dashboard from AWS Management Console.
34 | 
35 | First time you navigate to Lake Formation Dashboard page, you would be prompted for creating a Data Lake Administrator. Click on **“Add administrators”**.
36 |
37 |
38 |
39 | OR
40 |
41 | i) While you are logged in as an IAM Admin user
42 |
43 | ii) Go to => Lake Formation Console **→ Admins and database creators → Data lake administrators → Grant**
44 |
45 | iii) Click on **Add Administrators**
46 |
47 |
48 |
49 | iv) Select the Data Lake Administrator as **“dladmin”** user and click on **Save**.
50 |
51 |
52 |
53 | v) Another recommended change that you would need to do is to go to Lake Formation Console **→ Data Catalog → Settings** and **uncheck** both the boxes as shown below and click on **Save** button.
54 |
55 |
56 |
57 |
58 | After this step you, would not be using this IAM user again. Instead you will use **dladmin** user as a Data Lake Administrator and **dlanalyst** user as a data lake analyst/developer.
59 |
60 | ___
61 |
62 | [Back to main guide](../README.md) | [Next](activity2.md)
63 |
--------------------------------------------------------------------------------
/lab-guides/activity10.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](appendixA.md)
2 |
3 | ___
4 |
5 | ## 10. Catalog de-duplicated data and query using Amazon Athena
6 |
7 | Your dedup’ed data is now available under transformresult folder in S3 bucket.
8 |
9 | Follow the similar steps as in **Activity#5, 6 and 7** to
10 |
11 | a) **Crawl and Catalog** data in AWS Glue
12 |
13 | b) Login as Data Lake Administrator and **assign** the dlanalyst **permission** to **Select** the data
14 |
15 | c) Query the data using **Amazon Athena**
16 |
17 | NOTE: Data generated from ETL Job consists of quotes and you would need to edit the table properties. We have provided the screenshots for the above steps including how to remove quotes in the result-set in the [Appendix B](appendixB.md) below.
18 |
19 | You can query data using Amazon Athena and export the results to see the deduplicated results.
20 |
21 | ___
22 |
23 | [Back to main guide](../README.md) | [Next](appendixA.md)
24 |
--------------------------------------------------------------------------------
/lab-guides/activity2.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity3.md)
2 | ___
3 |
4 | ## 2. Register a data lake
5 |
6 | Data Lake Administrators would register the data lake, so that, when you point Lake Formation at your data sources, Lake Formation can crawl those sources and would be able to access the data into your new Amazon Simple Storage Service (Amazon S3) data lake.
7 |
8 | a) Login as a data lake administrator **dladmin**
9 |
10 | b) To register a data lake location, Navigate to Lake Formation Console **→ Under Register and Ingest on the left side → Data Lake locations** → Click on **Register Location.**
11 |
12 | c) Select the **S3 bucket name** created for you and **patientdata** folder under it. **Please use this S3 bucket name throughout the lab.**
13 |
14 |
15 |
16 | d) For IAM Role, select the role **AWSGlueServiceRole-LF-MLLab**
17 |
18 |
19 |
20 | e) Click on **Register Location**. You would see the confirmation that S3 path is registered successfully.
21 |
22 |
23 |
24 |
25 | ___
26 |
27 | [Back to main guide](../README.md) | [Next](activity3.md)
28 |
--------------------------------------------------------------------------------
/lab-guides/activity3.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity4.md)
2 | ___
3 |
4 | ## 3. Assign Lake Formation Permissions to data lake analyst and service role
5 |
6 | ### a) Grant database creation permission to data analyst
7 |
8 | Data lake administrator has to give the privileges to IAM Principal such as a User or Role in order to be able to perform CRUD operations on objects (database/tables) in data catalog.
9 |
10 | i) Login as a Data Lake Administrator – dladmin
11 |
12 | ii) From Lake Formation console, navigate to **Admins and database creators** under Permissions section in Lake Formation console **→ under Database Creators section Click on Grant
13 | → Select user dlanalyst** and check the **Create database option** under Catalog permissions.
14 |
15 |
16 |
17 | iii) Click on **Grant** to save the Create database permission for the **dlanalyst**
18 |
19 |
20 |
21 |
22 | ### b) Grant Data Location Permission to data analyst and AWS IAM Service Role
23 |
24 | You will also need to grant the permission to dlanalyst on the data lake storage.
25 |
26 | i) In Lake Formation console, navigate to **Permissions → Data locations** → Click on **Grant**
27 |
28 | ii) Select the dlanalyst as well as AWS Service Role **‘AWSGlueServiceRole-LF-MLLab’** from **IAM users and roles** dropdown
29 |
30 | iii) In the Storage location field select **‘\<>/patientdata’**
31 |
32 | iv) Click on **Grant** to save the changes
33 |
34 |
35 |
36 |
37 | ___
38 |
39 | [Back to main guide](../README.md) | [Next](activity4.md)
40 |
41 |
--------------------------------------------------------------------------------
/lab-guides/activity4.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity5.md)
2 | ___
3 |
4 | ## 4. Create AWS Glue Database
5 | AWS Glue Database is a collection of Tables in Glue data catalog. While **logged in as Data Lake Administrator (dladmin)**, let us create the database and add permissions for **dlanalyst** and **Service Role** on this database.
6 |
7 | a) Navigate to Lake Formation Console → **Data Catalog section → Databases → Create database**
8 |
9 | b) Give **Name** as **patientdb**
10 |
11 | c) Click on **Create Database**
12 |
13 |
14 |
15 | d) Navigate to **Permissions → Data Permissions → Grant**
16 |
17 | e) Select User **dlanalyst** as well as service role **‘AWSGlueServiceRole-LF-MLLab’**
18 |
19 | f) Select **database** as **patientdb**
20 |
21 | g) Under **Database permissions** select **Super**
22 |
23 | h) Click on **Grant** to save the changes
24 |
25 |
26 |
27 | ___
28 |
29 | [Back to main guide](../README.md) | [Next](activity5.md)
30 |
31 |
--------------------------------------------------------------------------------
/lab-guides/activity5.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity6.md)
2 | ___
3 |
4 | ## 5. Crawl and catalog Patient data in AWS Glue
5 | FindMatches ML Transform that we would be using, operates on tables defined in the AWS Glue Data Catalog. Use AWS Glue crawlers to discover and catalog the patient data.
6 |
7 | a) Login as **dlanalyst**
8 |
9 | b) Navigate to Lake Formation Console **→ Data Catalog → Tables → Create Table using a crawler**
10 |
11 |
12 |
13 | Click on **Get Started** in case you are prompted.
14 |
15 |
16 |
17 | c) Select **Add tables using a crawler**
18 |
19 | d) Give Crawler name as **patient-raw-ds-cr** and click **Next**
20 |
21 |
22 |
23 |
24 | e) Select Data stores on the next page
25 |
26 |
27 |
28 | f) Select **S3** in **Choose Data Store** dropdown
29 |
30 | g) Select **\/patientdata/rawdata** as Include path
31 |
32 |
33 |
34 | h) Click **Next**
35 |
36 |
37 |
38 | i) Select **Choose an existing IAM role** and select **AWSGlueServiceRole-LF-MLLab**
39 |
40 |
41 |
42 | j) Click **Next**
43 |
44 |
45 |
46 | k) Select **Database** as **patientdb** and click **Next**
47 |
48 |
49 |
50 |
51 |
52 | l) Click **Finish** button, select the crawler and click **Run Crawler**
53 |
54 |
55 |
56 |
57 | Once the crawler is run successfully, it would create the table under AWS Glue Data Catalog Database. As a next step, Data Lake User would want to query the data in a data lake through data catalog tables. In order to do this, they would need permission from Lake Formation. In the next step, we login as a Data Lake Administrator and give ‘dlanalyst’ the permission to Select, Insert, Update, Delete data from a table in AWS Glue Data Catalog.
58 |
59 | ___
60 |
61 | [Back to main guide](../README.md) | [Next](activity6.md)
62 |
63 |
64 |
--------------------------------------------------------------------------------
/lab-guides/activity6.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity7.md)
2 | ___
3 |
4 | ## 6. Login as Data Lake Administrator and assign Table permissions
5 |
6 | a) Login as data lake administrator user **dladmin**
7 |
8 | b) From Lake Formation Console, navigate to **Permissions → Data Permissions → Grant**
9 |
10 |
11 |
12 | c) Select user **dlanalyst**
13 |
14 | d) Under **Database** section, select **patientdb**
15 |
16 | e) Under **Table** section, select **rawdata**
17 |
18 | f) Under **Table permissions** section, select **Super**
19 |
20 | g) Click on **Grant** to save the changes
21 |
22 |
23 |
24 |
25 | ___
26 |
27 | [Back to main guide](../README.md) | [Next](activity7.md)
28 |
29 |
--------------------------------------------------------------------------------
/lab-guides/activity7.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity8.md)
2 | ___
3 |
4 | ## 7. Observe the data pattern and duplicates in data
5 |
6 | You can now query data in Data Lake using Amazon Athena. Perform the below steps:
7 |
8 | a) Login as **dlanalyst** if not already
9 |
10 | b) Navigate to **Lake Formation Console → Data Catalog → Tables**
11 |
12 | c) Select the **rawdata** table → **Actions → View data**
13 |
14 | d) In the **Athena console** → select database as **patientdb → tables rawdata → options Preview**
15 |
16 | e) You can download the csv by removing the **limit** clause in the select statement and clicking on the **Download the results** icon in the Results section
17 |
18 |
19 |
20 |
21 | NOTE:
22 | After landing on Athena console, if you get an error or query doesn’t run, click on the **Set up a query result location in Amazon S3** link and enter the value as **s3://\<\\>/query/**
23 |
24 | **Note the trailing slash in the above path!**
25 |
26 |
27 |
28 |
29 |
30 |
31 | f) Click on **Save** and run the query. You can download entire dataset by removing the **“limit 10”** clause from the SQL, running the query again and by clicking on the **download** icon as highlighted below.
32 |
33 |
34 |
35 | g) Open the downloaded file in excel, sort by patient_id and observe the duplicates
36 |
37 |
38 |
39 | As highlighted with different colors in the above table with identifying different groups that includes the original patient record grouped with its duplicates. The patient_id values are generated in a specific format that helps us identify the such groups. The format is “rec-\-org/dup-\” followed by FEBRL data gen tool.
40 | As a next step, we will create, teach and tune AWS Lake Formation FindMatches ML Transform and then use it in the Glue ETL job to find matches and/or remove the duplicates.
41 |
42 |
43 |
44 | ___
45 |
46 | [Back to main guide](../README.md) | [Next](activity8.md)
47 |
--------------------------------------------------------------------------------
/lab-guides/activity8.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity9.md)
2 | ___
3 |
4 | ## 8. Create, teach and Tune an AWS Lake Formation ML Transform
5 |
6 | ### a) Create FindMatches ML Transform
7 |
8 | i) Login as **dlanalyst**
9 |
10 | ii) Navigate to AWS Glue Console → On the left side, under **ETL → Jobs → ML Transforms**
11 |
12 | iii) Click on **Add Transform**
13 |
14 |
15 |
16 | iv) Specify **patient-data-ml-transform** as **Transform name**
17 |
18 | v) IAM Role as **AWSGlueServiceRole-LF-MLLab**
19 |
20 | vi) Expand **Task Run Properties** section
21 |
22 | vii) Select Worker Type as **G.1X (Recommended)**
23 |
24 | viii) Enter **Number of Workers** as **5**
25 |
26 | ix) **Glue Version** as **Spark 2.2 (Glue Version 0.9)**
27 |
28 | x) Keep other values as default and click on **Next**
29 |
30 |
31 |
32 |
33 | xi) Select **rawdata** as a **Data Source** and click **Next**
34 |
35 |
36 |
37 | xii) Select **patient_id** as a **primary key** in the next page
38 |
39 |
40 |
41 | xiii) In the **Tune Transform** step, select **Custom** for **Recall vs Precision** and specify the value of **0.9**
42 |
43 | xiv) Also, for **Lower cost vs Accuracy** select the **Custom** field and specify its value as **1**
44 |
45 | We have specified these values to achieve the best results. If needed, you can later tweak these values by selecting the transform and using the Tune menu.
46 |
47 |
48 |
49 |
50 | xv) Review the values and click **Finish**
51 |
52 |
53 |
54 | ### b) Teach transform to identify the duplicates
55 | In this step we will teach the transform by providing labelled examples of matching and non-matching records.
56 | You can create your labeling set yourself or allowing AWS Glue to generate the labeling set based on heuristics.
57 | AWS Glue extracts records from your source data and suggests potential matching records. The file will container approximately 100 data samples for you to work with. We highly recommend using the “Generate the Labeling file” feature to create the training set to teach your Transform.
58 |
59 | i) Select **patient-data-ml-transform** Transform
60 |
61 | ii) Select **Action → Teach transform**
62 |
63 |
64 |
65 | iii) Select **I do not have labels** and click on **Generate labeling file**
66 |
67 |
68 |
69 |
70 | iv) Select the S3 location until labeldata folder and append **“/download”** to it, this is where you want to keep the generated labeling file and click **Generate**
71 |
72 |
73 |
74 | v) It would take approximately 10 mins for AWS Glue to generate the labeling file. Once enabled click on **Download labeling file.**
75 |
76 | **NOTE: Instead of waiting at this point, you can proceed with step c) below of Uploading the label file from S3. This will save time.**
77 | In case you want to take a look at the similar labelling file that gets generated, navigate to **Amazon S3 Console → \<\\>/patientdata/labeldata/** and **download** the **“labeled-dataset-200.csv”** file.
78 |
79 |
80 |
81 | The labelled data file that is generated has the **label column empty** as shown below:
82 |
83 |
84 |
85 |
86 | Notice that there are 2 additional columns added, **labelling_set_id** and **label_id**.
87 |
88 | You will need to populate the label column explicitly by marking the records that are a real match with the same value. Each labelling set should contain positive and negative match examples.
89 |
90 | This file is fully ready for consumption. However, let’s go a little deeper into its structure, so that you know how to prepare and label data for your matching projects. The label column is empty in the generated file and you need to fill this in like below:
91 |
92 |
93 |
94 | The entire training dataset is divided into labeling sets. Each labeling set displays a labeling_set_id value. This identification simplifies the labeling process, enabling you to focus on the match relationship of records within the same labeling set, rather than having to scan the entire file. **You would assign labels according to which records should match based on the attribute values.**
95 |
96 | **If you specify the same label value for two or more records within a labeling set, you teach FindMatches to consider these records a match.** On the other hand, when two or more records have different labels within the same labeling set, FindMatches learns that these records aren’t considered a match. FindMatches evaluates record relationships only between records within the same labeling set, not across labeling sets.
97 |
98 | Plan to label a few hundred records to achieve modest match quality. Plan to label a few thousand records to achieve high match quality.
99 |
100 | For your convenience we have already created a sample labeling file under the S3 bucket specified while launching the Cloud Formation.
101 |
102 | ### c) Upload your labels and review match quality
103 |
104 | After you create the labeled dataset, teach FindMatches where to find it.
105 |
106 | i) In the **AWS Glue console**, select the transform that you created earlier.
107 |
108 | ii) Choose **Action → Teach transform**
109 |
110 | iii) On the following page, select **I have labels**, choose **Upload labeling file from S3**, and then choose **Next**.
111 |
112 |
113 |
114 |
115 |
116 | iv) On the **Estimate quality metrics** page, click on **Estimate transform quality** and **Finish**
117 |
118 |
119 |
120 | v) You can go back to the **Glue Console -> Select the ML Transform → Under History tab, monitor the status of Estimate Transform Quality task**
121 |
122 |
123 |
124 | vi) **Match Quality** operation may take some time to complete. Once the status is **Succeeded**, click on the **Estimate Quality** tab. You should see the results for transforms quality as shown below.
125 |
126 |
127 |
128 |
129 | **NOTE: Instead of waiting on this step, you can read the instructions below and proceed to Step#9. Later, come back and verify the results for Estimate Quality operation.**
130 |
131 | The transform quality estimate learns using 70% of your labels. After it’s trained, the quality estimate tests how well the transform learned to identify matching records against the remaining 30%. Finally, the transform generates quality metrics by comparing the matches and non-matches predicted by the algorithm vs. your actual labels. This process may take several minutes.
132 |
133 | Your result should look like those in the above screenshot. Consider these metrics approximate, as the test uses only a small subset of data for estimating quality. If you’re satisfied with the metrics, proceed with creating and running a record matching job. Or, to improve matching quality further, upload more labelled records.
134 |
135 | ___
136 |
137 | ## Next Step: ETL:
138 |
139 | Next step is running the ETL Job to find duplicates in rawdata table from Glue Data Catalog.
140 |
141 | In this lab, you have 2 options for this:
142 |
143 | | | Options | Considerations |
144 | | --- | ------ | ------ |
145 | | 1 | Using AWS Glue Console, Run Glue ETL Job as per [Activity#9](activity9.md) and continue through the excercise | * This needs initial start-up time (cold start) for underlying Spark cluster to spin-up
* Recommended for production deployments |
146 | | 2 | Use AWS Glue Development Endpoints.
Follow **[Appendix A](appendixA.md)** to run PySpark code from SageMaker Jupyter Notebook and then follow **[Appendix B](appendixB.md)** to catalog and query the matched/deduplicated data | * No start-up time. Development Endpoint is already provisioned as part of lab and ETL Job execution can begin immediately.
* Iterative ETL development through various kernels available in SageMaker Notebook instance
Recommended for Glue ETL dev phase |
147 |
148 |
149 |
150 | ___
151 |
152 | [Back to main guide](../README.md) | [Next](activity9.md)
153 |
--------------------------------------------------------------------------------
/lab-guides/activity9.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](activity10.md)
2 |
3 | ___
4 |
5 | ## 9. Create and Run Glue ETL Job to use ML Transform for finding duplicates
6 |
7 | After you create a FindMatches transform and verify that it has learned to identify matching records in your data, you’re ready to identify matches and can do data deduplication over your complete dataset.
8 |
9 | a) Login as **dlanalyst**
10 |
11 | b) In the **AWS Glue console**, in the left navigation pane, choose **Jobs, Add job**.
12 |
13 | c) Under **Configure the job properties** → **Name** as **patient-data-dedup-job**
14 |
15 | d) Select IAM Role as **AWSGlueServiceRole-LF-MLLab**
16 |
17 | e) Select **Type** as **Spark**
18 |
19 | f) **Glue version** as **Spark 2.2, Python 2 (Glue version 0.9)**
20 |
21 | g) Keep defaults for other values and click on **Next**
22 |
23 |
24 |
25 |
26 | h) Select **rawdata** as a **data source**
27 |
28 |
29 |
30 | i) On the next page, select **Find Matching Records** and **check** the option **Remove Duplicate Records**
31 |
32 | j) Select **Worker Type** as **G.1X**
33 |
34 | k) Specify **Number of workers** as **5**
35 |
36 | l) Click **Next**
37 |
38 |
39 |
40 | m) On the next page, select **patient-data-ml-transform** and click **Next**
41 |
42 |
43 |
44 | n) On the next page, select **Create tables in your data target**
45 |
46 | o) Choose **Amazon S3** as a data store
47 |
48 | p) Select Format as **CSV**
49 |
50 | q) Specify the Target path as **\/patientdata/transformresult**
51 |
52 | r) Click **Save job and edit script**
53 |
54 |
55 |
56 | s) Click on **Run**, to run the Glue ETL Job which has auto-generated code which uses **FindMatches ML Transform** to identify the duplicate records and **removes** them.
57 |
58 |
59 |
60 | Below is the code snippet which uses FindMatches transform to identify the result.
61 |
62 |
63 |
64 | Below code removes the duplicate records and keeps the one with lowest primary key value. You may modify the logic to remove other records or even merge the records.
65 |
66 |
67 |
68 | t) Click on **Run Job**
69 |
70 |
71 |
72 | u) Monitor the progress of the Job from the Glue Console
73 |
74 |
75 |
76 | v) Once the Job status is **Succeeded**, you can verify the files under **transformresult** folder in S3 bucket
77 |
78 |
79 |
80 |
81 | ___
82 |
83 | [Back to main guide](../README.md) | [Next](activity10.md)
84 |
--------------------------------------------------------------------------------
/lab-guides/appendixA.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md) | [Next](appendixB.md)
2 | ___
3 |
4 | ## Appendix A: Using Amazon SageMaker Jupyter Notebook to Run the Glue Job
5 |
6 | ### Step 1: Open Jupyter Notebook from AWS Glue Console
7 |
8 | a) Navigate to AWS Glue Console, **ETL → Dev Endpoints → Notebooks**
9 |
10 |
11 |
12 | b) Select the notebook named **aws-glue-ml-transform** and click **Open notebook** button. This will take you to **SageMaker notebook** page.
13 |
14 |
15 |
16 | ### Step 2: Run the Glue PySpark script to deduplicate data using FindMatches ML Transform
17 |
18 | a) In the **SageMaker Notebook** console, open the file named **“Glue ML Transforms Notebook.ipynb”**
19 |
20 |
21 |
22 | b) You will see that this notebook has the code similar to the shown below.
23 |
24 |
25 |
26 | c) In PySpark code, replace the value of **\<\\>** with the **Transform ID** of FindMatches ML Transform
27 |
28 |
29 |
30 | d) Replace **\<\\>** in PySpark code with the name of the S3 Bucket used for your data lake.
31 |
32 | If you followed the names used for S3 bucket path prefixes throughout the guide, you do not need to change anything else. If you have changed the S3 bucket path prefixes, then please update those accordingly in the Notebook script (PySpark code).
33 |
34 | Make sure that **Estimate Quality** operation on ML Transforms has been completed successfully before you run the PySpark script from the notebook.
35 |
36 | e) To run the block of code, click on the **Run** icon for specific cell in the **Notebook**
37 |
38 |
39 |
40 | At this point, **Spark ETL application** will run.
41 |
42 |
43 |
44 | Once the Progress bar disappears, and asterisk [*] before the cell is converted to the number, you can continue the steps of crawling and cataloging the data from [Appendix B](appendixB.md).
45 |
46 |
47 |
48 | ___
49 |
50 | [Back to main guide](../README.md) | [Next](appendixB.md)
51 |
--------------------------------------------------------------------------------
/lab-guides/appendixB.md:
--------------------------------------------------------------------------------
1 | [Back to main guide](../README.md)
2 | ___
3 |
4 | ## Appendix B
5 |
6 | ### Step 1: Crawl and Catalog data in AWS Glue
7 |
8 | a) From **Glue Console → Add tables using a Crawler**
9 |
10 |
11 |
12 | b) Specify the **crawler name** as **patient-dedup-data-cr**
13 |
14 |
15 |
16 | c) Select **Crawler source type** as **Data stores**
17 |
18 |
19 |
20 | d) Choose **S3** as a data store and select the **transformresult** folder under **S3 bucket**
21 |
22 |
23 |
24 | e) Choose **No** for **Add another data store**
25 |
26 |
27 |
28 | f) **Choose Existing IAM Role** from the drop down – **AWSGlueServiceRole-LF-MLLab**
29 |
30 |
31 |
32 | g) Select **Run on Demand**
33 |
34 |
35 |
36 | h) Select **patientdb** as a **Database**
37 |
38 |
39 |
40 | i) Review the information and click **Finish**
41 |
42 |
43 |
44 | j) **Run** the crawler
45 |
46 |
47 |
48 |
49 | ### Step 2: Login as Data Lake Administrator and grant dlanalyst permission to the table
50 |
51 | a) Login as data lake administrator – **dladmin**
52 |
53 | b) Navigate to **Lake Formation Console → select Data Permissions → Grant**
54 |
55 | c) Grant user **dlanalyst** permission on **transformresult** table
56 |
57 |
58 |
59 |
60 | ### Step 3: Modify Table Definition to remove quotes
61 |
62 | a) Login as **dlanalyst**
63 |
64 | b) Navigate to **AWS Glue Console → Databases → Tables** → Select the **transformresult** table and click on **Edit table details**
65 |
66 |
67 |
68 | c) Change the **Serde Serialization lib** to **org.apache.hadoop.hive.serde2.OpenCSVSerde** and the **Serde Parameters** as shown below since the transformed result csv files contains the data values enclosed in double quotes **(“)**. You can set the Serde parameters as per your requirements based on the type of datasets you have. For our dataset, we will set below values
69 |
70 |
71 |
72 | d) Click on the tablename **transformresult** and select **Edit schema**
73 |
74 |
75 |
76 | e) Edit the Schema of the table by setting the data types of **all** columns to **“String”**. You can choose to retain the inferred data types by Glue Crawler as per your requirements as well. For this case, we will set all to String datatype for the sake of simplicity **except match_id column** datatype set as **“bigint”**.
77 |
78 |
79 |
80 | f) Select **View data** option which will take you to **Amazon Athena** to query the data from a data lake
81 |
82 |
83 |
84 |
85 |
86 | ## Congratulations!! You have successfully completed the lab.
87 |
88 | ___
89 |
90 | [Back to main guide](../README.md)
91 |
--------------------------------------------------------------------------------
/lab-guides/images/1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1-1.png
--------------------------------------------------------------------------------
/lab-guides/images/1-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1-2.png
--------------------------------------------------------------------------------
/lab-guides/images/1-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1-3.png
--------------------------------------------------------------------------------
/lab-guides/images/1-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1-4.png
--------------------------------------------------------------------------------
/lab-guides/images/1-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1-5.png
--------------------------------------------------------------------------------
/lab-guides/images/1.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/1.txt
--------------------------------------------------------------------------------
/lab-guides/images/2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/2-1.png
--------------------------------------------------------------------------------
/lab-guides/images/2-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/2-2.png
--------------------------------------------------------------------------------
/lab-guides/images/2-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/2-3.png
--------------------------------------------------------------------------------
/lab-guides/images/3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/3-1.png
--------------------------------------------------------------------------------
/lab-guides/images/3-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/3-2.png
--------------------------------------------------------------------------------
/lab-guides/images/3-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/3-3.png
--------------------------------------------------------------------------------
/lab-guides/images/4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/4-1.png
--------------------------------------------------------------------------------
/lab-guides/images/4-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/4-2.png
--------------------------------------------------------------------------------
/lab-guides/images/5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-1.png
--------------------------------------------------------------------------------
/lab-guides/images/5-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-10.png
--------------------------------------------------------------------------------
/lab-guides/images/5-11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-11.png
--------------------------------------------------------------------------------
/lab-guides/images/5-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-2.png
--------------------------------------------------------------------------------
/lab-guides/images/5-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-3.png
--------------------------------------------------------------------------------
/lab-guides/images/5-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-4.png
--------------------------------------------------------------------------------
/lab-guides/images/5-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-5.png
--------------------------------------------------------------------------------
/lab-guides/images/5-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-6.png
--------------------------------------------------------------------------------
/lab-guides/images/5-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-7.png
--------------------------------------------------------------------------------
/lab-guides/images/5-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-8.png
--------------------------------------------------------------------------------
/lab-guides/images/5-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/5-9.png
--------------------------------------------------------------------------------
/lab-guides/images/6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/6-1.png
--------------------------------------------------------------------------------
/lab-guides/images/6-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/6-2.png
--------------------------------------------------------------------------------
/lab-guides/images/7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/7-1.png
--------------------------------------------------------------------------------
/lab-guides/images/7-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/7-2.png
--------------------------------------------------------------------------------
/lab-guides/images/7-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/7-3.png
--------------------------------------------------------------------------------
/lab-guides/images/7-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/7-4.png
--------------------------------------------------------------------------------
/lab-guides/images/7-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/7-5.png
--------------------------------------------------------------------------------
/lab-guides/images/8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-1.png
--------------------------------------------------------------------------------
/lab-guides/images/8-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-10.png
--------------------------------------------------------------------------------
/lab-guides/images/8-11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-11.png
--------------------------------------------------------------------------------
/lab-guides/images/8-12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-12.png
--------------------------------------------------------------------------------
/lab-guides/images/8-13.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-13.png
--------------------------------------------------------------------------------
/lab-guides/images/8-14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-14.png
--------------------------------------------------------------------------------
/lab-guides/images/8-15.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-15.png
--------------------------------------------------------------------------------
/lab-guides/images/8-16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-16.png
--------------------------------------------------------------------------------
/lab-guides/images/8-17.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-17.png
--------------------------------------------------------------------------------
/lab-guides/images/8-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-2.png
--------------------------------------------------------------------------------
/lab-guides/images/8-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-3.png
--------------------------------------------------------------------------------
/lab-guides/images/8-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-4.png
--------------------------------------------------------------------------------
/lab-guides/images/8-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-5.png
--------------------------------------------------------------------------------
/lab-guides/images/8-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-6.png
--------------------------------------------------------------------------------
/lab-guides/images/8-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-7.png
--------------------------------------------------------------------------------
/lab-guides/images/8-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-8.png
--------------------------------------------------------------------------------
/lab-guides/images/8-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/8-9.png
--------------------------------------------------------------------------------
/lab-guides/images/9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-1.png
--------------------------------------------------------------------------------
/lab-guides/images/9-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-10.png
--------------------------------------------------------------------------------
/lab-guides/images/9-11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-11.png
--------------------------------------------------------------------------------
/lab-guides/images/9-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-2.png
--------------------------------------------------------------------------------
/lab-guides/images/9-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-3.png
--------------------------------------------------------------------------------
/lab-guides/images/9-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-4.png
--------------------------------------------------------------------------------
/lab-guides/images/9-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-5.png
--------------------------------------------------------------------------------
/lab-guides/images/9-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-6.png
--------------------------------------------------------------------------------
/lab-guides/images/9-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-7.png
--------------------------------------------------------------------------------
/lab-guides/images/9-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-8.png
--------------------------------------------------------------------------------
/lab-guides/images/9-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/9-9.png
--------------------------------------------------------------------------------
/lab-guides/images/a-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-1.png
--------------------------------------------------------------------------------
/lab-guides/images/a-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-2.png
--------------------------------------------------------------------------------
/lab-guides/images/a-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-3.png
--------------------------------------------------------------------------------
/lab-guides/images/a-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-4.png
--------------------------------------------------------------------------------
/lab-guides/images/a-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-5.png
--------------------------------------------------------------------------------
/lab-guides/images/a-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-6.png
--------------------------------------------------------------------------------
/lab-guides/images/a-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/a-7.png
--------------------------------------------------------------------------------
/lab-guides/images/b-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-1.png
--------------------------------------------------------------------------------
/lab-guides/images/b-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-10.png
--------------------------------------------------------------------------------
/lab-guides/images/b-11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-11.png
--------------------------------------------------------------------------------
/lab-guides/images/b-12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-12.png
--------------------------------------------------------------------------------
/lab-guides/images/b-13.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-13.png
--------------------------------------------------------------------------------
/lab-guides/images/b-14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-14.png
--------------------------------------------------------------------------------
/lab-guides/images/b-15.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-15.png
--------------------------------------------------------------------------------
/lab-guides/images/b-16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-16.png
--------------------------------------------------------------------------------
/lab-guides/images/b-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-2.png
--------------------------------------------------------------------------------
/lab-guides/images/b-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-3.png
--------------------------------------------------------------------------------
/lab-guides/images/b-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-4.png
--------------------------------------------------------------------------------
/lab-guides/images/b-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-5.png
--------------------------------------------------------------------------------
/lab-guides/images/b-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-6.png
--------------------------------------------------------------------------------
/lab-guides/images/b-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-7.png
--------------------------------------------------------------------------------
/lab-guides/images/b-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-8.png
--------------------------------------------------------------------------------
/lab-guides/images/b-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-lakeformation-ml-transforms/b77eefd00e70b6f80d8e5beb57aadfeb0867f2eb/lab-guides/images/b-9.png
--------------------------------------------------------------------------------