├── .DS_Store ├── .gitignore ├── 01_Register_New_Database.md ├── 02_Create_IAM_Roles_And_Users.md ├── 03_Grant_Permissions_With_Lake_Formation.md ├── 04_Set_Up_SageMaker_Studio.md ├── 05_Test_Lake_Formation_Access_Control_Policies.md ├── 06_Audit_Data_Access_With_Lake_Formation_And_CloudTrail.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── cdktemplate ├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md ├── app.py ├── cdk.json ├── lambda │ ├── .DS_Store │ ├── __init__.py │ ├── sagemaker_studio_domain.py │ └── sagemaker_studio_profile.py ├── requirements.txt ├── sagemaker_studio_audit_control │ ├── .DS_Store │ ├── __init__.py │ ├── amazon_reviews_dataset_stack.py │ ├── data_scientist_users_stack.py │ ├── sagemaker_studio_audit_control_stack.py │ └── sagemaker_studio_stack.py ├── setup.py └── source.bat ├── images ├── 0SageMakerAuditControl.png ├── 1CreateDatabase.png ├── 1RegisterLocation.png ├── 1VerifyTable.png ├── 3LakeFormationPermissions.png ├── 4SageMakerStudioDomain.png ├── 5NotebookUserFull.png ├── 5SageMakerStudioLimited.png ├── 5SageMakerStudioNotebook.png ├── 6CloudTrail.png ├── 6LakeFormation.png ├── 7IAMUsersAndGroups.png └── 8FederatedIdentities.png └── notebook ├── .DS_Store └── sagemaker_studio_audit_control.ipynb /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # Others 132 | *.sh 133 | .vscode -------------------------------------------------------------------------------- /01_Register_New_Database.md: -------------------------------------------------------------------------------- 1 | ## Registering a new Database in Lake Formation 2 | 3 | I use the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to [Creating required IAM roles and IAM users for Data Scientists](./02_Create_IAM_Roles_And_Users.md). 4 | 5 | To register the Amazon Customer Reviews Dataset in Lake Formation, complete the following steps: 6 | 7 | 1. Sign in to the console with the credentials associated to a [Lake Formation administrator](https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html#terminology-admin), based on your authentication method (AWS IAM, AWS SSO, or federation with an external IdP). 8 | 2. On the Lake Formation console, in the navigation pane, under **Data catalog**, choose **Databases**. 9 | 3. Choose **Create Database**. 10 | 4. In **Database details**, select **Database** to create the database in your own account. 11 | 5. For **Name**, enter a name for the database, such as `amazon_reviews_db`. 12 | 6. For Location, enter `s3://amazon-reviews-pds`. 13 | 7. Under **Default permissions for newly created tables**, make sure to clear the option **Use only IAM access control for new tables in this database**. 14 | 15 |

16 | 17 |

18 | 19 | 8. Choose **Create database**. 20 | 21 | The Amazon Customer Reviews Dataset is currently available in TSV and Parquet formats. The Parquet dataset is partitioned on Amazon S3 by `product_category`. To create a table in the data lake for the Parquet dataset, you can use an [AWS Glue](https://aws.amazon.com/glue) crawler or manually create the table using Athena, as described in the Amazon Customer Reviews Dataset [README file](https://s3.amazonaws.com/amazon-reviews-pds/readme.html). 22 | 23 | 24 | 9. On the Athena console, create the table. 25 | 26 | If you haven’t specified a query result location before, follow the instructions in [Specifying a Query Result Location](https://docs.aws.amazon.com/athena/latest/ug/querying.html#query-results-specify-location). 27 | 28 | 10. Choose the data source `AwsDataCatalog`. 29 | 11. Choose the database created in the previous step. 30 | 12. In the Query Editor, enter the following query: 31 | 32 | ```sql 33 | CREATE EXTERNAL TABLE amazon_reviews_parquet( 34 | marketplace string, 35 | customer_id string, 36 | review_id string, 37 | product_id string, 38 | product_parent string, 39 | product_title string, 40 | star_rating int, 41 | helpful_votes int, 42 | total_votes int, 43 | vine string, 44 | verified_purchase string, 45 | review_headline string, 46 | review_body string, 47 | review_date bigint, 48 | year int) 49 | PARTITIONED BY (product_category string) 50 | ROW FORMAT SERDE 51 | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 52 | STORED AS INPUTFORMAT 53 | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 54 | OUTPUTFORMAT 55 | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 56 | LOCATION 57 | 's3://amazon-reviews-pds/parquet/' 58 | ``` 59 | 60 | 13. Choose **Run query**. 61 | 62 | You should receive a Query successful response when the table is created. 63 | 64 | 14. Enter the following query to load the table partitions: 65 | 66 | `MSCK REPAIR TABLE amazon_reviews_parquet` 67 | 68 | 15. Choose **Run query**. 69 | 16. On the Lake Formation console, in the navigation pane, under **Data catalog**, choose **Tables**. 70 | 17. For **Table name**, enter a table name. 71 | 18. Verify that you can see the table details. 72 | 73 |

74 | 75 |

76 | 77 | 19. Scroll down to see the table schema and partitions. 78 | 79 | Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database. 80 | 81 | 20. On the Lake Formation console, in the navigation pane, under **Register and ingest**, choose **Data lake locations**. 82 | 21. On the **Data lake locations** page, choose **Register location**. 83 | 22. For **Amazon S3 path**, enter `s3://amazon-reviews-pds/`. 84 | 23. For **IAM role**, you can keep the default role. 85 | 24. Choose **Register location**. 86 | 87 | ## [Proceed to the next section](./02_Create_IAM_Roles_And_Users.md) to create the required IAM roles and users for data scientists. 88 | 89 | -------------------------------------------------------------------------------- /02_Create_IAM_Roles_And_Users.md: -------------------------------------------------------------------------------- 1 | ## Creating required IAM resources 2 | 3 | To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies, roles, and, if using AWS IAM users for authentication, a group, and users. The implementation leverages [attribute-based access control](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) (ABAC) to define IAM permissions. 4 | 5 | ### IAM resources for authentication using federation 6 | 7 | The following diagram illustrates the resources you configure in this section if using federated identities with AWS SSO (aligned with our [best practice](https://wa.aws.amazon.com/wat.question.SEC_2.en.html) of using temporary credentials to access AWS accounts). 8 | 9 |

10 | 11 |

12 | 13 | In this section, you complete the following high-level steps for users authenticated into AWS SSO, assuming the users are utilizing their Microsoft Active Directory (AD) email credentials: 14 | 15 | 1. Create an SSO permission set named `DataScientist` and assign SSO access into your AWS account to the users: `data-scientist-full@domain` and `data-scientist-limited@domain`, to control their federated access to the console and to Studio. 16 | 2. Add a custom inline policy to SSO permission set. 17 | 18 | The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their AD user name. The AD user name can be sent as an attribute from an external identity provider into AWS SSO, and then [used for access control](https://docs.aws.amazon.com/singlesignon/latest/userguide/attributesforaccesscontrol.html). The policy also denies the use of SageMaker notebook instances, allowing Studio notebooks only. 19 | 20 | 3. For each AD user, create individual IAM roles, which are used as user profile execution roles in Studio later. 21 | 22 | The naming convention for these roles consists of a common prefix followed by the corresponding AD user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual users who performed the activities. For this post, I use the prefix `SageMakerStudioExecutionRole_`. 23 | 24 | 4. Create a managed policy named `SageMakerUserProfileExecutionPolicy` and assign it to each of the IAM roles. 25 | 26 | The policy establishes coarse-grained access permissions to the data lake. 27 | 28 | ### IAM resources for authentication using AWS IAM 29 | 30 | The following diagram illustrates the resources you configure in this section if using AWS IAM users for authentication. 31 | 32 |

33 | 34 |

35 | 36 | In this section, you complete the following high-level steps: 37 | 38 | 1. Create an IAM group named `DataScientists` containing two users: `data-scientist-full` and `data-scientist-limited`, to control their access to the console and to Studio. 39 | 2. Create a managed policy named `DataScientistGroupPolicy` and assign it to the group. 40 | 41 | The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only. 42 | 43 | 3. For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later. 44 | 45 | The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, I use the prefix `SageMakerStudioExecutionRole_`. 46 | 47 | 4. Create a managed policy named `SageMakerUserProfileExecutionPolicy` and assign it to each of the IAM roles. 48 | 49 | The policy establishes coarse-grained access permissions to the data lake. 50 | 51 | Follow the remainder of this section to create the IAM resources described, depending on whether you use federated identities with AWS SSO (aligned with our [best practice](https://wa.aws.amazon.com/wat.question.SEC_2.en.html) of using temporary credentials to access AWS accounts) or AWS IAM users. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles. 52 | 53 | ### Creating the required SSO permission set (only for authentication using federation) 54 | 55 | To create your SSO permission set and assign it to your data scientists, complete the following steps: 56 | 57 | 1. Sign in to the console using an IAM principal with permissions to create SSO permission sets and assign SSO access to users and groups into your AWS account. 58 | 2. (If using AWS Managed Microsoft AD directory) On the AWS SSO console, verify that the AWS SSO user attribute `email` [is mapped](https://docs.aws.amazon.com/singlesignon/latest/userguide/mapssoattributestocdattributes.html) to the attribute `${dir:windowsUpn}` in Active Directory. 59 | 3. On the SSO console, [enable attributes for access control](https://docs.aws.amazon.com/singlesignon/latest/userguide/configure-abac.html) and select the mapped attribute. 60 | - On the **Attributes for access control** page, in the **Key** field, enter `studiouserid`. 61 | - In the **Value (optional)** field, choose or enter `${user:email}`. 62 | 4. [Create a custom permission set](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtocreatepermissionset.html) named `DataScientist`, based on custom permissions. 63 | - Under **Create a custom permissions policy**, use the following JSON policy document to provide permissions: 64 | 65 |
 66 | 		{
 67 | 			"Version": "2012-10-17",
 68 | 			"Statement": [
 69 | 				{
 70 | 					"Action": [
 71 | 						"sagemaker:DescribeDomain",
 72 | 						"sagemaker:ListDomains",
 73 | 						"sagemaker:ListUserProfiles",
 74 | 						"sagemaker:ListApps"
 75 | 					],
 76 | 					"Resource": "*",
 77 | 					"Effect": "Allow",
 78 | 					"Sid": "AmazonSageMakerStudioReadOnly"
 79 | 				},
 80 | 				{
 81 | 					"Action": "sagemaker:AddTags",
 82 | 					"Resource": "*",
 83 | 					"Effect": "Allow",
 84 | 					"Sid": "AmazonSageMakerAddTags"
 85 | 				},
 86 | 				{
 87 | 					"Condition": {
 88 | 						"StringEquals": {
 89 | 							"sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/studiouserid}"
 90 | 						}
 91 | 					},
 92 | 					"Action": [
 93 | 						"sagemaker:CreatePresignedDomainUrl",
 94 | 						"sagemaker:DescribeUserProfile"
 95 | 					],
 96 | 					"Resource": "*",
 97 | 					"Effect": "Allow",
 98 | 					"Sid": "AmazonSageMakerAllowedUserProfile"
 99 | 				},
100 | 				{
101 | 					"Condition": {
102 | 						"StringNotEquals": {
103 | 							"sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/studiouserid}"
104 | 						}
105 | 					},
106 | 					"Action": [
107 | 						"sagemaker:CreatePresignedDomainUrl",
108 | 						"sagemaker:DescribeUserProfile"
109 | 					],
110 | 					"Resource": "*",
111 | 					"Effect": "Deny",
112 | 					"Sid": "AmazonSageMakerDeniedUserProfiles"
113 | 				},
114 | 				{
115 | 					"Action": [
116 | 						"sagemaker:CreatePresignedNotebookInstanceUrl",
117 | 						"sagemaker:*NotebookInstance",
118 | 						"sagemaker:*NotebookInstanceLifecycleConfig",
119 | 						"sagemaker:CreateUserProfile",
120 | 						"sagemaker:DeleteDomain",
121 | 						"sagemaker:DeleteUserProfile"
122 | 					],
123 | 					"Resource": "*",
124 | 					"Effect": "Deny",
125 | 					"Sid": "AmazonSageMakerDeniedServices"
126 | 				}
127 | 			]
128 | 		}
129 |         
130 | 131 | The policy allows users to access Studio, but only using a SageMaker user profile with a tag that matches their AD user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only. 132 | 133 | 5. [Assign SSO access](https://docs.aws.amazon.com/singlesignon/latest/userguide/useraccess.html) into your AWS account to a group containing the data scientist users. 134 | - In the **Select users or groups** page, type a group name containing the data scientist users in your connected directory. 135 | - In the **Select permission sets** page, select the `DataScientist` permission set. 136 | 137 | ### Creating the required IAM group and users (only for authentication using AWS IAM): 138 | 139 | To create your group and users, complete the following steps: 140 | 141 | 1. Sign in to the console using an IAM user with permissions to create groups, users, roles, and policies. 142 | 2. On the IAM console, [create policies on the JSON tab](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html#access_policies_create-json-editor) to create a new IAM managed policy named `DataScientistGroupPolicy`. 143 | - Use the following JSON policy document to provide permissions: 144 | 145 |
146 | 		{
147 | 			"Version": "2012-10-17",
148 | 			"Statement": [
149 | 				{
150 | 					"Action": [
151 | 						"sagemaker:DescribeDomain",
152 | 						"sagemaker:ListDomains",
153 | 						"sagemaker:ListUserProfiles",
154 | 						"sagemaker:ListApps"
155 | 					],
156 | 					"Resource": "*",
157 | 					"Effect": "Allow",
158 | 					"Sid": "AmazonSageMakerStudioReadOnly"
159 | 				},
160 | 				{
161 | 					"Action": "sagemaker:AddTags",
162 | 					"Resource": "*",
163 | 					"Effect": "Allow",
164 | 					"Sid": "AmazonSageMakerAddTags"
165 | 				},
166 | 				{
167 | 					"Condition": {
168 | 						"StringEquals": {
169 | 							"sagemaker:ResourceTag/studiouserid": "${aws:username}"
170 | 						}
171 | 					},
172 | 					"Action": [
173 | 						"sagemaker:CreatePresignedDomainUrl",
174 | 						"sagemaker:DescribeUserProfile"
175 | 					],
176 | 					"Resource": "*",
177 | 					"Effect": "Allow",
178 | 					"Sid": "AmazonSageMakerAllowedUserProfile"
179 | 				},
180 | 				{
181 | 					"Condition": {
182 | 						"StringNotEquals": {
183 | 							"sagemaker:ResourceTag/studiouserid": "${aws:username}"
184 | 						}
185 | 					},
186 | 					"Action": [
187 | 						"sagemaker:CreatePresignedDomainUrl",
188 | 						"sagemaker:DescribeUserProfile"
189 | 					],
190 | 					"Resource": "*",
191 | 					"Effect": "Deny",
192 | 					"Sid": "AmazonSageMakerDeniedUserProfiles"
193 | 				},
194 | 				{
195 | 					"Action": [
196 | 						"sagemaker:CreatePresignedNotebookInstanceUrl",
197 | 						"sagemaker:*NotebookInstance",
198 | 						"sagemaker:*NotebookInstanceLifecycleConfig",
199 | 						"sagemaker:CreateUserProfile",
200 | 						"sagemaker:DeleteDomain",
201 | 						"sagemaker:DeleteUserProfile"
202 | 					],
203 | 					"Resource": "*",
204 | 					"Effect": "Deny",
205 | 					"Sid": "AmazonSageMakerDeniedServices"
206 | 				}
207 | 			]
208 | 		}
209 |         
210 | 211 | The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only. 212 | 213 | 1. [Create an IAM group](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_groups_create.html). 214 | - For **Group name**, enter `DataScientists`. 215 | - Search and attach the AWS managed policy named `DataScientist` and the IAM policy created in the previous step. 216 | 217 | 2. [Create two IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_console) named `data-scientist-full` and `data-scientist-limited`. 218 | 219 | Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing [only support those characters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html#sagemaker-CreateUserProfile-request-UserProfileName). 220 | 221 | ### Creating the required IAM roles: 222 | 223 | To create your roles, complete the following steps: 224 | 225 | 1. On the IAM console, [create a new managed policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html#access_policies_create-json-editor) named `SageMakerUserProfileExecutionPolicy`. 226 | - Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID: 227 | 228 |
229 | 		{
230 | 			"Version": "2012-10-17",
231 | 			"Statement": [
232 | 				{
233 | 					"Action": [
234 | 						"sagemaker:DescribeDomain",
235 | 						"sagemaker:ListDomains",
236 | 						"sagemaker:ListUserProfiles",
237 | 						"sagemaker:ListApps"
238 | 					],
239 | 					"Resource": "*",
240 | 					"Effect": "Allow",
241 | 					"Sid": "AmazonSageMakerStudioReadOnly"
242 | 				},
243 | 				{
244 | 					"Action": "sagemaker:AddTags",
245 | 					"Resource": "*",
246 | 					"Effect": "Allow",
247 | 					"Sid": "AmazonSageMakerAddTags"
248 | 				},
249 | 				{
250 | 					"Action": "sagemaker:DescribeUserProfile",
251 | 					"Resource": "arn:aws:sagemaker:<aws region>:<account id>:user-profile/*/${aws:PrincipalTag/userprofilename}",
252 | 					"Effect": "Allow",
253 | 					"Sid": "AmazonSageMakerAllowedUserProfile"
254 | 				},
255 | 				{
256 | 					"Action": "sagemaker:DescribeUserProfile",
257 | 					"Effect": "Deny",
258 | 					"NotResource": "arn:aws:sagemaker:<aws region>:<account id>:user-profile/*/${aws:PrincipalTag/userprofilename}",
259 | 					"Sid": "AmazonSageMakerDeniedUserProfiles"
260 | 				},
261 | 				{
262 | 					"Action": "sagemaker:*App",
263 | 					"Resource": "arn:aws:sagemaker:<aws region>:<account id>:app/*/${aws:PrincipalTag/userprofilename}/*",
264 | 					"Effect": "Allow",
265 | 					"Sid": "AmazonSageMakerAllowedApp"
266 | 				},
267 | 				{
268 | 					"Action": "sagemaker:*App",
269 | 					"Effect": "Deny",
270 | 					"NotResource": "arn:aws:sagemaker:<aws region>:<account id>:app/*/${aws:PrincipalTag/userprofilename}/*",
271 | 					"Sid": "AmazonSageMakerDeniedApps"
272 | 				},
273 | 				{
274 | 					"Action": [
275 | 						"lakeformation:GetDataAccess",
276 | 						"glue:GetTable",
277 | 						"glue:GetTables",
278 | 						"glue:SearchTables",
279 | 						"glue:GetDatabase",
280 | 						"glue:GetDatabases",
281 | 						"glue:GetPartitions"
282 | 					],
283 | 					"Resource": "*",
284 | 					"Effect": "Allow",
285 | 					"Sid": "LakeFormationPermissions"
286 | 				},
287 | 				{
288 | 					"Effect": "Allow",
289 | 					"Action": [
290 | 						"s3:CreateBucket",
291 | 						"s3:GetObject",
292 | 						"s3:PutObject"
293 | 					],
294 | 					"Resource": [
295 | 						"arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>",
296 | 						"arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>/*"
297 | 					]
298 | 				},
299 | 				{
300 | 					"Action": "iam:PassRole",
301 | 					"Resource": "*",
302 | 					"Effect": "Allow",
303 | 					"Sid": "AmazonSageMakerStudioIAMPassRole"
304 | 				},
305 | 				{
306 | 					"Action": "sts:AssumeRole",
307 | 					"Resource": "*",
308 | 					"Effect": "Deny",
309 | 					"Sid": "DenyAssummingOtherIAMRoles"
310 | 				}
311 | 			]
312 | 		} 
313 |         
314 | 315 | This policy provides limited IAM permissions to Studio. For more information on recommended policies for team groups in Studio, see [Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation](https://aws.amazon.com/blogs/machine-learning/configuring-amazon-sagemaker-studio-for-teams-and-groups-with-complete-resource-isolation/). The policy also provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see [Methods for Fine-Grained Access Control](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-fine-grained.html). 316 | 317 | 2. [Create an IAM role](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html) for the first data scientist (`data-scientist-full`), which is used as the corresponding user profile’s execution role. 318 | - On the **Attach permissions policy** page, the AWS managed policy `AmazonSageMakerFullAccess` is attached by default. You remove this policy later, to maintain minimum privilege. 319 | - For **Tags**, add the key `userprofilename` and the value `data-scientist-full`. 320 | - For **Role name**, use the naming convention introduced at the beginning of this section to name the role `SageMakerStudioExecutionRole_data-scientist-full`. 321 | 3. To add the remaining policies, on the **Roles** page, choose the role name you just created. 322 | 4. Under **Permissions**, remove the policy `AmazonSageMakerFullAccess`. 323 | 5. Choose **Attach policies**. 324 | 6. Search and select the `SageMakerUserProfileExecutionPolicy` and `AmazonAthenaFullAccess` policies 325 | 7. Choose **Attach policy**. 326 | 8. Repeat the previous steps to create an IAM role for the second data scientist (`data-scientist-limited`). 327 | - For **Tags**, add the key `userprofilename` and the value `data-scientist-limited`. 328 | - For **Role name**, use the naming convention, such as `SageMakerStudioExecutionRole_data-scientist-limited`. 329 | 330 | ## [Proceed to the next section](./03_Grant_Permissions_With_Lake_Formation.md) to grant data permissions with Lake Formation. 331 | 332 | -------------------------------------------------------------------------------- /03_Grant_Permissions_With_Lake_Formation.md: -------------------------------------------------------------------------------- 1 | ## Granting data permissions with Lake Formation 2 | 3 | Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Amazon Customer Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation. 4 | 5 | 1. Sign in to the console with the credentials associated to a [Lake Formation administrator](https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html#terminology-admin), based on your authentication method (AWS IAM, AWS SSO, or federation with an external IdP). 6 | 2. On the Lake Formation console, in the navigation pane, choose **Tables**. 7 | 3. On the **Tables** page, select the table you created earlier, such as `amazon_reviews_parquet`. 8 | 4. On the **Actions** menu, under **Permissions**, choose **Grant**. 9 | 5. Provide the following information to grant full access to the Amazon Customer Reviews Dataset table for the first data scientist: 10 | - Select **My account**. 11 | - For **IAM users and roles**, choose the execution role associated to the first data scientist, such as `SageMakerStudioExecutionRole_data-scientist-full`. 12 | - For **Table permissions** and **Grantable permissions**, select **Select**. 13 | - Choose **Grant**. 14 | 6. Repeat the first step to grant limited access to the dataset for the second data scientist, providing the following information: 15 | - Select **My account**. 16 | - For **IAM users and roles**, choose the execution role associated to the second data scientist, such as `SageMakerStudioExecutionRole_data-scientist-limited`. 17 | - For **Columns**, choose **Include columns**. 18 | - Choose a subset of columns, such as: `product_category`, `product_id`, `product_parent`, `product_title`, `star_rating`, `review_headline`, `review_body`, and `review_date`. 19 | - For **Table permissions** and **Grantable permissions**, select **Select**. 20 | - Choose **Grant**. 21 | 7. To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose **Tables**. 22 | 8. On the **Tables** page, select the table you created earlier, such as `amazon_reviews_parquet`. 23 | 9. On the **Actions** menu, under **Permissions**, choose **View permissions** to open the **Data permissions** menu. 24 | 25 | You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin. 26 | 27 |

28 | 29 |

30 | 31 | If you see the principal `IAMAllowedPrincipals` listed on the **Data permissions** menu for the table, you must remove it. Select the principal and choose **Revoke**. On the **Revoke permissions page**, choose **Revoke**. 32 | 33 | ## [Proceed to the next section](./04_Set_Up_SageMaker_Studio.md) to set up SageMaker Studio. 34 | 35 | -------------------------------------------------------------------------------- /04_Set_Up_SageMaker_Studio.md: -------------------------------------------------------------------------------- 1 | ## Setting up SageMaker Studio 2 | 3 | You now onboard to Studio and create two user profiles, one for each data scientist. 4 | 5 | When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, configuration settings, and an Amazon EFS volume, which contains data for the users, including notebooks, resources, and artifacts. 6 | 7 | Each user receives a private home directory within Amazon EFS for notebooks, Git repositories, and data files. All traffic between the domain and the Amazon EFS volume is communicated through specified subnet IDs. By default, all other traffic goes over the internet through a SageMaker system [Amazon Virtual Private Cloud](https://aws.amazon.com/vpc/) (Amazon VPC). 8 | 9 | Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by assigning a private VPC to the domain. This is beyond the scope of this post, but you can find additional details in [Securing Amazon SageMaker Studio connectivity using a private VPC](https://aws.amazon.com/blogs/machine-learning/securing-amazon-sagemaker-studio-connectivity-using-a-private-vpc/). 10 | 11 | To create the Studio user profiles with the `studiouserid` tag, I use the AWS CLI. As of this writing, including a tag when creating a user profile is available only through AWS CLI. You can find additional details in [Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation](https://aws.amazon.com/blogs/machine-learning/configuring-amazon-sagemaker-studio-for-teams-and-groups-with-complete-resource-isolation/). 12 | 13 | If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles. 14 | 15 | ### Onboarding to Studio: 16 | 17 | To onboard to Studio, complete the following steps: 18 | 19 | 1. Sign in to the console with the credentials of a user with service administrator permissions for SageMaker, based on your authentication method (AWS IAM, AWS SSO, or federation with an external IdP). 20 | 2. On the SageMaker console, in the navigation pane, choose **Amazon SageMaker Studio**. 21 | 3. On the **Studio** menu, under **Get started**, choose **Standard setup**. 22 | 4. For **Authentication method**, choose **AWS Identity and Access Management (IAM)**. 23 | 5. Under **Permission**, for **Execution role for all users**, choose an option from the role selector. 24 | 25 | You’re not using this execution role for the SageMaker user profiles that you create later. If you choose **Create a new role**, the **Create an IAM role** dialog opens. 26 | 27 | 6. For **S3 buckets you specify**, choose **None**. 28 | 7. Choose **Create role**. 29 | 30 | SageMaker creates a new IAM role named `AmazonSageMaker-ExecutionPolicy`, with the `AmazonSageMakerFullAccess` policy attached. 31 | 32 | 8. Under **Network and storage**, for **VPC**, choose the private VPC that is used for communication with the Amazon EFS volume. 33 | 9. For **Subnet(s)**, choose multiple subnets in the VPC from different Availability Zones. 34 | 10. Choose **Submit**. 35 | 11. On the **Studio Control Panel**, under **Studio Summary**, wait for the status to change to `Ready` and the **Add user button** to be enabled. 36 | 37 | ### Creating the SageMaker user profiles: 38 | 39 | To create your SageMaker user profiles with the `studiouserid` tag, complete the following steps: 40 | 41 | 1. On AWS CLI, create the Studio user profile for the first data scientist. Enter the following command, providing the account ID, Studio domain ID, and the identity (AWS SSO, external IdP, or AWS IAM) of the first data scientist, depending on your authentication method 42 | 43 | 44 | aws sagemaker create-user-profile --domain-id <domain id> --user-profile-name data-scientist-full --tags Key=studiouserid,Value=<user id> --user-settings ExecutionRole= arn:aws:iam::<account id>:role/SageMakerStudioExecutionRole_data-scientist-full 45 | 46 | 47 | 2. After creating the first user profile, repeat the previous step to create a second user profile on AWS CLI, providing the account ID, Studio domain ID, and the identity (AWS SSO, external IdP, or AWS IAM) of the second data scientist, depending on your authentication method 48 | 49 | 50 | aws sagemaker create-user-profile --domain-id <domain id> --user-profile-name data-scientist-limited --tags Key=studiouserid,Value=<user id> --user-settings ExecutionRole=arn:aws:iam::<account id>:role/SageMakerStudioExecutionRole_data-scientist-limited 51 | 52 | 53 |

54 | 55 |

56 | 57 | ## [Proceed to the next section](./05_Test_Lake_Formation_Access_Control_Policies.md) to test Lake Formation access control policies. 58 | 59 | -------------------------------------------------------------------------------- /05_Test_Lake_Formation_Access_Control_Policies.md: -------------------------------------------------------------------------------- 1 | ## Testing Lake Formation access control policies 2 | 3 | You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier. 4 | 5 | 1. Sign in to the console with the credentials associated to the first data scientist (`data-scientist-full`), based on your authentication method (AWS IAM, AWS SSO, or federation with an external IdP). 6 | 2. On the SageMaker console, in the navigation pane, choose **Amazon SageMaker Studio**. 7 | 3. On the **Studio Control Panel**, choose user name `data-scientist-full`. 8 | 4. Choose **Open Studio**. 9 | 5. Wait for SageMaker Studio to load. 10 | 11 | Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name. 12 | 13 | 6. In Studio, on the top menu, under **File**, under **New**, choose **Terminal**. 14 | 7. At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions: 15 | 16 | 17 | git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git 18 | 19 | 20 | 8. In the left sidebar, choose the file browser icon. 21 | 9. Navigate to `amazon-sagemaker-studio-audit`. 22 | 10. Open the `notebook` folder. 23 | 11. Choose `sagemaker-studio-audit-control.ipynb` to open the notebook. 24 | 12. In the **Select Kernel** dialog, choose **Python 3 (Data Science)**. 25 | 13. Choose **Select**. 26 | 14. Wait for the kernel to load. 27 | 28 |

29 | 30 |

31 | 32 | 15. Starting from the first code cell in the notebook, press Shift + Enter to run the code cell. 33 | 16. Continue running all the code cells, waiting for the previous cell to finish before running the following cell. 34 | 35 | After running the last `SELECT` query, because the user has full SELECT permissions for the table, the query output includes all the columns in the `amazon_reviews_parquet` table. 36 | 37 |

38 | 39 |

40 | 41 | 17. On the top menu, under **File**, choose **Shut Down**. 42 | 18. Choose **Shutdown All** to shut down all the Studio apps. 43 | 19. Close the Studio browser tab. 44 | 20. 20. Repeat the previous steps in this section, this time signing in with the credentials associated to the second data scientist (`data-scientist-limited`) and opening Studio with this user. 45 | 21. Don’t run the code cell in the section **Create S3 bucket for query output files**. 46 | 47 | For this user, after running the same `SELECT` query in the Studio notebook, the query output only includes a subset of columns for the `amazon_reviews_parquet` table. 48 | 49 |

50 | 51 |

52 | 53 | ## [Proceed to the next section](./06_Audit_Data_Access_With_Lake_Formation_And_CloudTrail.md) to audit data access activity with Lake Formation and CloudTrail. 54 | 55 | -------------------------------------------------------------------------------- /06_Audit_Data_Access_With_Lake_Formation_And_CloudTrail.md: -------------------------------------------------------------------------------- 1 | ## Auditing data access activity with Lake Formation and CloudTrail 2 | 3 | In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as `GetDataAccess`. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake. 4 | 5 | Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in [Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs](https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-ct) and [Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena](https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/). 6 | 7 | ### Auditing data access activity with Lake Formation 8 | 9 | To review activity in Lake Formation, complete the following steps: 10 | 11 | 1. Sign out of the AWS account. 12 | 2. Sign in to the console with the credentials associated to a [Lake Formation administrator](https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html#terminology-admin), based on your authentication method (AWS IAM, AWS SSO, or federation with an external IdP). 13 | 3. On the Lake Formation console, in the navigation pane, choose **Dashboard**. 14 | 15 | Under **Recent access activity**, you can find the events associated to the data access for both users. 16 | 17 | 4. Choose the most recent event with event name `GetDataAccess`. 18 | 5. Choose **View event**. 19 | 20 | Among other attributes, each event includes the following: 21 | 22 | - Event date and time 23 | - Event source (Lake Formation) 24 | - Athena query ID 25 | - Table being queried 26 | - IAM user embedded in the Lake Formation principal, based on the chosen role name convention 27 | 28 |

29 | 30 |

31 | 32 | ### Auditing data access activity with CloudTrail 33 | 34 | To review activity in CloudTrail, complete the following steps: 35 | 36 | 1. On the CloudTrail console, in the navigation pane, choose **Event history**. 37 | 2. In the **Event history** menu, for **Filter**, choose **Event name**. 38 | 3. Enter `StartQueryExecution`. 39 | 4. Expand the most recent event, then choose **View event**. 40 | 41 | This event includes additional parameters that are useful to complete the audit analysis, such as the following: 42 | 43 | - Event source (Athena). 44 | - Athena query ID, matching the query ID from Lake Formation’s `GetDataAccess` event. 45 | - Query string. 46 | - Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID. 47 | 48 |

49 | 50 |

51 | 52 | ## Cleaning up 53 | 54 | To avoid incurring future charges, delete the resources created during this walkthrough. 55 | 56 | If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, deleting the stack deletes the remaining resources. 57 | 58 | If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in `Deleted` state before deleting the stack. 59 | 60 | If you didn’t use the CloudFormation template, you can manually delete the resources you created: 61 | 62 | 1. On the **Studio Control Panel**, for each user profile, choose **User Details**. 63 | 2. Choose **Delete user**. 64 | 3. When all users are deleted, choose **Delete Studio**. 65 | 4. On the Amazon EFS console, delete the volume that was automatically created for Studio. 66 | 5. On the Lake Formation console, delete the table and the database created for the Amazon Customer Reviews Dataset. 67 | 6. Remove the data lake location for the dataset. 68 | 7. On the IAM console, delete the IAM users, group, and roles created for this walkthrough. 69 | 8. Delete the policies you created for these principals. 70 | 9. On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with `sagemaker-audit-control-query-results-`), and the bucket created by Studio to share notebooks (starting with `sagemaker-studio-`). 71 | 72 | ## This concludes the walkthrough. [Click here](./README.md) to go back to the main page. 73 | 74 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation 2 | 3 | Certain industries are required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks. 4 | 5 | This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) notebooks and [AWS Lake Formation](https://aws.amazon.com/lake-formation/) access control policies. This is a how-to guide based on the [Machine Learning Lens for the AWS Well-Architected Framework](https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf), following the design principles described in the Security Pillar: 6 | 7 | - Restrict access to Machine Learning (ML) systems. 8 | - Ensure data governance. 9 | - Enforce data lineage. 10 | - Enforce regulatory compliance. 11 | 12 | Note: This post provides guidance for customers already using [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) users and groups to manage identities, and also for customer using [AWS Single Sign-On](https://aws.amazon.com/single-sign-on/) (AWS SSO). Please note, however, that our [best practice for identity management](https://wa.aws.amazon.com/wat.question.SEC_2.en.html) is to use AWS SSO, or federation with AWS IAM roles, so that people access AWS accounts using temporary credentials. 13 | 14 | ## Overview of solution 15 | This implementation uses [Amazon Athena](http://aws.amazon.com/athena) and the [PyAthena](https://pypi.org/project/PyAthena/) client on an Amazon SageMaker Studio notebook to query data on a data lake registered with AWS Lake Formation. 16 | 17 | SageMaker Studio is the first fully integrated development environment (IDE) for ML. Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. [Studio Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html) are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand. 18 | 19 | Athena is an interactive query service that makes it easy to analyze data directly in [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. 20 | 21 | Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML. 22 | 23 | For an existing data lake registered with AWS Lake Formation, the following diagram illustrates the proposed implementation: 24 | 25 |

26 | 27 |

28 | 29 | 1. Data scientists access the [AWS Management Console](http://aws.amazon.com/console) using their identity, which can be an AWS SSO user name, a federated identity with an AWS IAM role (both options align with our [best practice](https://wa.aws.amazon.com/wat.question.SEC_2.en.html) of using temporary credentials to access AWS accounts), or an IAM user account belonging to an IAM group. In the console, data scientists open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook. 30 | The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), user Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data. 31 | 2. The Studio notebook is associated with a Python kernel. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio. 32 | 3. Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access. 33 | 4. Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation. 34 | 5. Athena returns the SQL query result to the Studio notebook. 35 | Lake Formation records data access requests and other activity history for the registered data lake locations. [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) also records these and other API calls made to AWS during the entire flow, including Athena query execution requests. 36 | 37 | ## Walkthrough 38 | 39 | In this walkthrough, I show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities: 40 | 41 | - Register a new database in Lake Formation - `01_Register_New_Database.md` 42 | - Create the required IAM resources - `02_Create_IAM_Roles_And_Users.md` 43 | - Grant data permissions with AWS Lake Formation - `03_Grant_Permissions_With_Lake_Formation.md` 44 | - Set up SageMaker Studio - `04_Set_Up_SageMaker_Studio.md` 45 | - Test AWS Lake Formation access control policies using a Studio notebook - `05_Test_Lake_Formation_Access_Control_Policies.md` 46 | - Audit data access activity with AWS Lake Formation and AWS CloudTrail - `06_Audit_Data_Access_With_Lake_Formation_And_CloudTrail.md` 47 | 48 | ### Prerequisites 49 | 50 | For this walkthrough, you should have the following prerequisites: 51 | 52 | - An [AWS account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup) 53 | - A data lake set up in Lake Formation with a Lake Formation Admin. For general guidance on how to set up Lake Formation, see [Getting started with AWS Lake Formation](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-lake-formation/). 54 | - Basic knowledge on creating IAM policies, roles, users, and groups. 55 | - If using AWS SSO, knowledge on creating SSO permission sets and assigning them to users and groups. 56 | 57 | If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy the following [AWS CloudFormation](http://aws.amazon.com/cloudformation) template in a Region that supports [Studio](https://aws.amazon.com/sagemaker/pricing/#Amazon_SageMaker_Pricing_Calculator) and [Lake Formation](https://docs.aws.amazon.com/general/latest/gr/lake-formation.html#lake-formation_region): 58 | 59 | [![Launch Stack](https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/create/review?templateURL=https://aws-ml-blog.s3.amazonaws.com/artifacts/sagemaker-studio-audit-control/SageMakerStudioAuditControlStack.yaml&stackName=SageMakerStudioAuditControl) 60 | 61 | You can also deploy the template by [downloading the CloudFormation template](https://aws-ml-blog.s3.amazonaws.com/artifacts/sagemaker-studio-audit-control/SageMakerStudioAuditControlStack.yaml). When deploying the CloudFormation template, you provide the following parameters: 62 | 63 | - Authentication method for Studio, which can be selected between `AWS IAM with IAM users` and `AWS IAM with AWS account federation (external IdP)`. The former option is suitable for customers using AWS IAM user and groups to manage identities. The latter is suitable for customers using AWS SSO to manage access into AWS accounts with temporary credentials, which aligns with our [best practices](https://wa.aws.amazon.com/wat.question.SEC_2.en.html) for managing identities for people. 64 | - Studio profile name for a data scientist with full access to the dataset. The default name is `data-scientist-full`. If you select `AWS IAM with IAM users`, an IAM user with the same name is also created. In that case, the password for the IAM user is created automatically and stored as a secret in [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/). 65 | - Studio profile name for a data scientist with limited access to the dataset. The default user name is `data-scientist-limited`. If you select `AWS IAM with IAM users`, an IAM user with the same name is also created. In that case, the password for the IAM user is created automatically and stored as a secret in Secrets Manager. 66 | - Names for the database and table to be created for the dataset. The default names are `amazon_reviews_db` and `amazon_reviews_parquet`, respectively. 67 | - VPC and subnets that are used by Studio to communicate with the [Amazon Elastic File System](https://aws.amazon.com/efs/) (Amazon EFS) volume associated to Studio. 68 | 69 | If you use AWS SSO, and decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you must follow the sections **IAM resources for authentication using federation** and **Creating the required SSO permission set** in this post. Then you can go directly to the section **Testing Lake Formation access control policies**. 70 | 71 | If you use AWS IAM users and groups to manage identities, and decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section **Testing Lake Formation access control policies** in this post. 72 | 73 | ### Where to go next? 74 | 75 | - If you choose to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to [Testing AWS Lake Formation access control policies](./05_Test_Lake_Formation_Access_Control_Policies.md) 76 | - If you prefer to perform the initial setup activities without the CloudFormation template, our you want to take a look at the activities automated by the template, you can start this walkthrough by [Registering a new Database in Lake Formation](./01_Register_New_Database.md) 77 | 78 | ## Security 79 | 80 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 81 | 82 | ## License 83 | 84 | This library is licensed under the MIT-0 License. See the LICENSE file. 85 | -------------------------------------------------------------------------------- /cdktemplate/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/cdktemplate/.DS_Store -------------------------------------------------------------------------------- /cdktemplate/.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | package-lock.json 3 | __pycache__ 4 | .pytest_cache 5 | .env 6 | *.egg-info 7 | *.yaml 8 | *.sh 9 | 10 | # CDK asset staging directory 11 | .cdk.staging 12 | cdk.out -------------------------------------------------------------------------------- /cdktemplate/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /cdktemplate/README.md: -------------------------------------------------------------------------------- 1 | # CDK template for "Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation" 2 | 3 | This is the CDK template for "Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation". 4 | 5 | Note that the stacks are configured as nested stacks with static URL based on the environment variable `NESTED_STACK_URL_PREFIX`. The parent stack is `sagemaker_studio_audit_control/sagemaker_studio_audit_control_stack.py` 6 | 7 | If you are planning to rebuild the CloudFormation template to deploy in your own environment, make sure to edit `app.py` and modify the environment variable `os.environ["NESTED_STACK_URL_PREFIX"]` at the beginning of the file. 8 | 9 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 10 | 11 | This project is set up like a standard Python project. The initializationprocess also creates a virtualenv within this project, stored under the .env directory. To create the virtualenv it assumes that there is a `python3` (or `python` for Windows) executable in your path with access to the `venv` package. If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv manually. 12 | 13 | To manually create a virtualenv on MacOS and Linux: 14 | 15 | ``` 16 | $ python3 -m venv .env 17 | ``` 18 | 19 | After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv. 20 | 21 | ``` 22 | $ source .env/bin/activate 23 | ``` 24 | 25 | If you are a Windows platform, you would activate the virtualenv like this: 26 | 27 | ``` 28 | % .env\Scripts\activate.bat 29 | ``` 30 | 31 | Once the virtualenv is activated, you can install the required dependencies. 32 | 33 | ``` 34 | $ pip install -r requirements.txt 35 | ``` 36 | 37 | At this point you can now synthesize the CloudFormation templates for this code and save each stack as YAML files. 38 | 39 | ``` 40 | $ cdk synth --version-reporting false --path-metadata false 41 | 42 | $ cdk synth --version-reporting false --path-metadata false amazon-reviews-dataset-stack > AmazonReviewsDatasetStack.yaml 43 | 44 | $ cdk synth --version-reporting false --path-metadata false data-scientist-users-stack > DataScientistUsersStack.yaml 45 | 46 | $ cdk synth --version-reporting false --path-metadata false sagemaker-studio-stack > SageMakerStudioStack.yaml 47 | 48 | $ cdk synth --version-reporting false --path-metadata false sagemaker-studio-audit-control > SageMakerStudioAuditControlStack.yaml 49 | ``` 50 | 51 | To add additional dependencies, for example other CDK libraries, just addthem to your `setup.py` file and rerun the `pip install -r requirements.txt` command. 52 | 53 | ## Useful commands 54 | 55 | * `cdk ls` list all stacks in the app 56 | * `cdk synth` emits the synthesized CloudFormation template 57 | * `cdk deploy` deploy this stack to your default AWS account/region 58 | * `cdk diff` compare deployed stack with current state 59 | * `cdk docs` open CDK documentation 60 | 61 | Enjoy! 62 | 63 | ## License 64 | 65 | This library is licensed under the MIT-0 License. See the LICENSE file. 66 | -------------------------------------------------------------------------------- /cdktemplate/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | # SPDX-License-Identifier: MIT-0 5 | 6 | from aws_cdk import core 7 | import os 8 | 9 | os.environ["AMAZON_REVIEWS_BUCKET_ARN"] = "arn:aws:s3:::amazon-reviews-pds" 10 | os.environ["ROLE_NAME_PREFIX"] = "SageMakerStudio_" 11 | os.environ["ATHENA_QUERY_BUCKET_PREFIX"] = "sagemaker-audit-control-query-results-" 12 | 13 | os.environ["NESTED_STACK_URL_PREFIX"] = "https://aws-ml-blog.s3.amazonaws.com/artifacts/sagemaker-studio-audit-control/" 14 | 15 | from sagemaker_studio_audit_control.sagemaker_studio_audit_control_stack import SageMakerStudioAuditControlStack 16 | from sagemaker_studio_audit_control.amazon_reviews_dataset_stack import AmazonReviewsDatasetStack 17 | from sagemaker_studio_audit_control.data_scientist_users_stack import DataScientistUsersStack 18 | from sagemaker_studio_audit_control.sagemaker_studio_stack import SageMakerStudioStack 19 | 20 | app = core.App() 21 | SageMakerStudioAuditControlStack(app, "sagemaker-studio-audit-control") 22 | AmazonReviewsDatasetStack(app, "amazon-reviews-dataset-stack") 23 | DataScientistUsersStack(app, "data-scientist-users-stack") 24 | SageMakerStudioStack(app, "sagemaker-studio-stack") 25 | 26 | app.synth() -------------------------------------------------------------------------------- /cdktemplate/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "context": { 4 | "@aws-cdk/core:enableStackNameDuplicates": "true", 5 | "aws-cdk:enableDiffNoFail": "true" 6 | } 7 | } 8 | -------------------------------------------------------------------------------- /cdktemplate/lambda/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/cdktemplate/lambda/.DS_Store -------------------------------------------------------------------------------- /cdktemplate/lambda/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/cdktemplate/lambda/__init__.py -------------------------------------------------------------------------------- /cdktemplate/lambda/sagemaker_studio_domain.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import json 5 | import cfnresponse 6 | import boto3 7 | import time 8 | from botocore.exceptions import ClientError 9 | 10 | sm = boto3.client("sagemaker") 11 | 12 | SLEEP_INTERVAL = 10 13 | TIMEOUT_OFFSET = SLEEP_INTERVAL + 1 14 | 15 | def handler(event, context): 16 | print("Received event: %s." % event) 17 | request_type = event["RequestType"] 18 | 19 | if request_type == "Create": return create_resource(event, context) 20 | elif request_type == "Update": return update_resource(event, context) 21 | elif request_type == "Delete": return delete_resource(event, context) 22 | else : 23 | print("Unknown RequestType: %s." % request_type) 24 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 25 | 26 | def create_resource(event, context): 27 | try: 28 | current_timestamp = int(round(time.time() * 1000)) 29 | sm_domain = sm.create_domain( 30 | DomainName = "default-{}".format(current_timestamp), 31 | AuthMode = "IAM", 32 | SubnetIds = event["ResourceProperties"]["SubnetIds"], 33 | VpcId = event["ResourceProperties"]["VpcId"], 34 | DefaultUserSettings = { 35 | "ExecutionRole": event["ResourceProperties"]["DefaultExecutionRole"] 36 | }, 37 | # AppNetworkAccessType = "VpcOnly", 38 | ) 39 | domain_id = sm_domain["DomainArn"].split("/")[1] 40 | domain_status = sm.describe_domain(DomainId = domain_id)["Status"] 41 | response_data = {"DomainId" : domain_id} 42 | 43 | while domain_status != "InService": 44 | 45 | if context.get_remaining_time_in_millis() < TIMEOUT_OFFSET: 46 | print("Lambda Function about to time out. Aborting.") 47 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 48 | 49 | if domain_status == "Pending": 50 | print("Waiting for InService status.") 51 | time.sleep(SLEEP_INTERVAL) 52 | domain_status = sm.describe_domain(DomainId = domain_id)["Status"] 53 | elif domain_status == "Deleting": 54 | print("Create Domain Failed. Domain being deleted by another process.") 55 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 56 | else: #domain_status == "Failed" or Unknown 57 | print("Create Domain Failed. Status: %s" % domain_status) 58 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 59 | 60 | cfnresponse.send(event, context, cfnresponse.SUCCESS, response_data) 61 | 62 | except ClientError as e: 63 | print("Unexpected error: %s." % e) 64 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 65 | 66 | def update_resource(event, context): 67 | try: 68 | print("Received Update event.") 69 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) 70 | 71 | except ClientError as e: 72 | print("Unexpected error: %s." % e) 73 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 74 | 75 | def delete_resource(event, context): 76 | domain_id = "" 77 | try: 78 | print("Received Delete event") 79 | domains = sm.list_domains()["Domains"] 80 | 81 | if len(domains) == 0: 82 | print("Resource not found. Nothing to do.") 83 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) 84 | return 85 | else: 86 | domain_id = domains[0]["DomainId"] 87 | sm.delete_domain( DomainId = domain_id, RetentionPolicy={ 'HomeEfsFileSystem': 'Delete'} ) 88 | 89 | print("Checking Delete status.") 90 | domain_status = sm.describe_domain(DomainId = domain_id)["Status"] 91 | 92 | while True: 93 | if context.get_remaining_time_in_millis() < TIMEOUT_OFFSET: 94 | print("Lambda Function about to time out. Aborting.") 95 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 96 | 97 | if domain_status in ["Deleting", "Pending", "InService"] : 98 | print("Waiting for Deletion.") 99 | time.sleep(SLEEP_INTERVAL) 100 | domain_status = sm.describe_domain(DomainId = domain_id)["Status"] 101 | else: # "Failed" or Unknown 102 | print("Delete Domain Failed. Status: %s" % domain_status) 103 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 104 | 105 | except ClientError as e: 106 | 107 | if e.response['Error']['Code'] == 'ResourceNotFound': 108 | print("Domain successfully deleted.") 109 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {}) 110 | else: 111 | print("Unexpected error: %s." % e) 112 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) -------------------------------------------------------------------------------- /cdktemplate/lambda/sagemaker_studio_profile.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import json 5 | import cfnresponse 6 | import boto3 7 | import time 8 | from botocore.exceptions import ClientError 9 | 10 | client = boto3.client("sagemaker") 11 | 12 | SLEEP_INTERVAL = 5 13 | TIMEOUT_OFFSET = SLEEP_INTERVAL + 1 14 | 15 | def handler(event, context): 16 | 17 | print("Received event: %s" % event) 18 | request_type = event["RequestType"] 19 | if request_type == "Create": return create_resource(event, context) 20 | elif request_type == "Update": return update_resource(event, context) 21 | elif request_type == "Delete": return delete_resource(event, context) 22 | else : 23 | # Unknown RequestType 24 | print("Invalid request type: %s." % request_type) 25 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 26 | 27 | def create_resource(event, context): 28 | try: 29 | user_profile = client.create_user_profile( 30 | DomainId = event["ResourceProperties"]["DomainId"], 31 | UserProfileName = event["ResourceProperties"]["UserProfileName"], 32 | Tags = [ 33 | { 34 | "Key" : "studiouserid", 35 | "Value" : event["ResourceProperties"]["StudioUserId"] 36 | } 37 | ], 38 | UserSettings={ 39 | "ExecutionRole": event["ResourceProperties"]["ExecutionRole"] 40 | } 41 | ) 42 | response_data = { "UserProfileArn" : user_profile["UserProfileArn"]} 43 | cfnresponse.send(event, context, cfnresponse.SUCCESS, response_data) 44 | 45 | except ClientError as e: 46 | print("Unexpected error: %s" % e) 47 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 48 | 49 | def update_resource(event, context): 50 | try: 51 | user_profile = client.update_user_profile( 52 | DomainId = event["ResourceProperties"]["DomainId"], 53 | UserProfileName = event["ResourceProperties"]["UserProfileName"], 54 | Tags = [ 55 | { 56 | "Key" : "studiouserid", 57 | "Value" : event["ResourceProperties"]["StudioUserId"] 58 | } 59 | ], 60 | UserSettings={ 61 | "ExecutionRole": event["ResourceProperties"]["ExecutionRole"] 62 | } 63 | ) 64 | response_data = { "UserProfileArn" : user_profile["UserProfileArn"]} 65 | cfnresponse.send(event, context, cfnresponse.SUCCESS, response_data) 66 | 67 | except ClientError as e: 68 | if e.response['Error']['Code'] == 'ResourceNotFound': 69 | print("Resource not found. Creating resource.") 70 | return create_resource(event, context) 71 | else: 72 | print("Unexpected error: %s." % e) 73 | cfnresponse.send(event, context, cfnresponse.FAILED, {}) 74 | 75 | def delete_resource(event, context): 76 | 77 | domain_id = event["ResourceProperties"]["DomainId"] 78 | user_profile_name = event["ResourceProperties"]["UserProfileName"] 79 | response_data = { "UserProfileName" : user_profile_name } 80 | 81 | try: 82 | print("Received delete event.") 83 | 84 | client.delete_user_profile( 85 | DomainId = domain_id, 86 | UserProfileName = user_profile_name 87 | ) 88 | 89 | print("Checking Delete status.") 90 | user_profile_status = client.describe_user_profile( DomainId = domain_id, UserProfileName = user_profile_name)["Status"] 91 | 92 | while True: 93 | if context.get_remaining_time_in_millis() < TIMEOUT_OFFSET: 94 | print("Lambda Function about to time out. Aborting.") 95 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 96 | 97 | if user_profile_status in ["Deleting", "Pending", "InService"] : 98 | print("Waiting for Deletion.") 99 | time.sleep(SLEEP_INTERVAL) 100 | user_profile_status = client.describe_user_profile( DomainId = domain_id, UserProfileName = user_profile_name)["Status"] 101 | else: #domain_status == "Failed" 102 | print("Delete User Profile Failed.") 103 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 104 | 105 | except ClientError as e: 106 | 107 | if e.response['Error']['Code'] == 'ResourceNotFound': 108 | print("User Profile successfully deleted.") 109 | cfnresponse.send(event, context, cfnresponse.SUCCESS, response_data) 110 | else: 111 | print("Unexpected error: %s" % e) 112 | cfnresponse.send(event, context, cfnresponse.FAILED, response_data) 113 | -------------------------------------------------------------------------------- /cdktemplate/requirements.txt: -------------------------------------------------------------------------------- 1 | -e . 2 | -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/cdktemplate/sagemaker_studio_audit_control/.DS_Store -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/cdktemplate/sagemaker_studio_audit_control/__init__.py -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/amazon_reviews_dataset_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from aws_cdk import ( 5 | aws_lakeformation as lf, 6 | aws_glue as glue, 7 | aws_s3 as s3, 8 | aws_iam as iam, 9 | core 10 | ) 11 | import os 12 | 13 | AMAZON_REVIEWS_BUCKET_ARN = os.environ["AMAZON_REVIEWS_BUCKET_ARN"] 14 | 15 | class AmazonReviewsDatasetStack(core.Stack): 16 | 17 | def __init__(self, scope: core.Construct, id: str, **kwargs) -> None: 18 | super().__init__(scope, id, **kwargs) 19 | 20 | # CloudFormation Parameters 21 | 22 | glue_db_name = core.CfnParameter(self, "GlueDatabaseNameAmazonReviews", 23 | type="String", 24 | description="Name of Glue Database to be created for Amazon Reviews.", 25 | allowed_pattern="[\w-]+", 26 | default = "amazon_reviews_db" 27 | ) 28 | 29 | glue_table_name = core.CfnParameter(self, "GlueTableNameAmazonReviews", 30 | type="String", 31 | description="Name of Glue Table to be created for Amazon Reviews (Parquet).", 32 | allowed_pattern="[\w-]+", 33 | default = "amazon_reviews_parquet" 34 | ) 35 | 36 | self.template_options.template_format_version = "2010-09-09" 37 | self.template_options.description = "Amazon Reviews Dataset." 38 | self.template_options.metadata = { "License": "MIT-0" } 39 | 40 | # Create Database, Table and Partitions for Amazon Reviews 41 | 42 | amazon_reviews_bucket = s3.Bucket.from_bucket_arn(self, "ImportedAmazonReviewsBucket", AMAZON_REVIEWS_BUCKET_ARN) 43 | 44 | lakeformation_resource = lf.CfnResource(self, "LakeFormationResource", 45 | resource_arn = amazon_reviews_bucket.bucket_arn, 46 | use_service_linked_role = True) 47 | 48 | cfn_glue_db = glue.CfnDatabase(self, "GlueDatabase", 49 | catalog_id = core.Aws.ACCOUNT_ID, 50 | database_input = glue.CfnDatabase.DatabaseInputProperty( 51 | name = glue_db_name.value_as_string, 52 | location_uri=amazon_reviews_bucket.s3_url_for_object(), 53 | ) 54 | ) 55 | 56 | amazon_reviews_table = glue.CfnTable(self, "GlueTableAmazonReviews", 57 | catalog_id = cfn_glue_db.catalog_id, 58 | database_name = glue_db_name.value_as_string, 59 | table_input = glue.CfnTable.TableInputProperty( 60 | description = "Amazon Customer Reviews (a.k.a. Product Reviews)", 61 | name = glue_table_name.value_as_string, 62 | parameters = { 63 | "classification": "parquet", 64 | "typeOfData": "file" 65 | }, 66 | partition_keys = [{"name": "product_category","type": "string"}], 67 | storage_descriptor = glue.CfnTable.StorageDescriptorProperty( 68 | columns = [ 69 | {"name": "marketplace", "type": "string"}, 70 | {"name": "customer_id", "type": "string"}, 71 | {"name": "review_id","type": "string"}, 72 | {"name": "product_id","type": "string"}, 73 | {"name": "product_parent","type": "string"}, 74 | {"name": "product_title","type": "string"}, 75 | {"name": "star_rating","type": "int"}, 76 | {"name": "helpful_votes","type": "int"}, 77 | {"name": "total_votes","type": "int"}, 78 | {"name": "vine","type": "string"}, 79 | {"name": "verified_purchase","type": "string"}, 80 | {"name": "review_headline","type": "string"}, 81 | {"name": "review_body","type": "string"}, 82 | {"name": "review_date","type": "bigint"}, 83 | {"name": "year","type": "int"}], 84 | location = amazon_reviews_bucket.s3_url_for_object() + "/parquet/", 85 | input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat", 86 | output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat", 87 | serde_info = glue.CfnTable.SerdeInfoProperty( 88 | serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", 89 | parameters = { 90 | "classification": "parquet", 91 | "typeOfData": "file" 92 | } 93 | ) 94 | ), 95 | table_type = "EXTERNAL_TABLE" 96 | ) 97 | ) 98 | 99 | # amazon_reviews_table.node.add_dependency(glue_default_permissions) 100 | amazon_reviews_table.node.add_dependency(cfn_glue_db) 101 | 102 | partition_list = ["Apparel", "Automotive", "Baby", "Beauty", "Books", "Camera", "Digital_Ebook_Purchase", 103 | "Digital_Music_Purchase", "Digital_Software", "Digital_Video_Download","Digital_Video_Games", "Electronics", 104 | "Furniture", "Gift_Card", "Grocery", "Health_&_Personal_Care", "Home", "Home_Entertainment", 105 | "Home_Improvement", "Jewelry", "Kitchen", "Lawn_and_Garden", "Luggage", "Major_Appliances", "Mobile_Apps", 106 | "Mobile_Electronics", "Music", "Musical_Instruments", "Office_Products", "Outdoors", "PC", "Personal_Care_Appliances", 107 | "Pet_Products", "Shoes", "Software", "Sports", "Tools", "Toys", "Video", "Video_DVD", "Video_Games", 108 | "Watches", "Wireless"] 109 | 110 | partition_uri_prefix = f"{amazon_reviews_bucket.s3_url_for_object()}/parquet/{amazon_reviews_table.table_input.partition_keys[0].name}" 111 | 112 | for partition in partition_list: 113 | 114 | cfn_partition_location = partition_uri_prefix + "=" + partition 115 | 116 | cfn_partition_id = "Partition"+partition 117 | 118 | cfn_partition = glue.CfnPartition(self, cfn_partition_id, 119 | catalog_id = amazon_reviews_table.catalog_id, 120 | database_name = glue_db_name.value_as_string, 121 | partition_input = glue.CfnPartition.PartitionInputProperty( 122 | values = [ partition ], 123 | storage_descriptor = glue.CfnPartition.StorageDescriptorProperty( 124 | location = cfn_partition_location, 125 | input_format = "org.apache.hadoop.mapred.TextInputFormat", 126 | serde_info = glue.CfnPartition.SerdeInfoProperty( 127 | serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", 128 | parameters = { 129 | "serialization.format": "1" 130 | } 131 | ) 132 | ) 133 | ), 134 | table_name = glue_table_name.value_as_string 135 | ) 136 | 137 | cfn_partition.add_depends_on(amazon_reviews_table) 138 | -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/data_scientist_users_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from aws_cdk import ( 5 | aws_iam as iam, 6 | aws_lakeformation as lf, 7 | aws_secretsmanager as secretsmanager, 8 | aws_sso as sso, 9 | core 10 | ) 11 | import os 12 | import json 13 | 14 | ROLE_NAME_PREFIX = os.environ["ROLE_NAME_PREFIX"] 15 | ATHENA_QUERY_BUCKET_PREFIX = os.environ["ATHENA_QUERY_BUCKET_PREFIX"] 16 | 17 | class DataScientistUsersStack(core.Stack): 18 | 19 | def __init__(self, scope: core.Construct, id: str, **kwargs) -> None: 20 | super().__init__(scope, id, **kwargs) 21 | 22 | # CloudFormation Parameters 23 | 24 | studio_authentication = core.CfnParameter(self, "StudioAuthentication", 25 | type="String", 26 | description="Authentication method for SageMaker Studio.", 27 | allowed_values=[ 28 | "AWS IAM with IAM users", 29 | "AWS IAM with AWS account federation (external IdP)" 30 | ], 31 | default = "AWS IAM with IAM users" 32 | ) 33 | 34 | user_data_scientist_1 = core.CfnParameter(self, "DataScientistFullAccess", 35 | type="String", 36 | description="Username for Data Scientist with full access to Amazon Reviews.", 37 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 38 | default = "data-scientist-full" 39 | ) 40 | 41 | user_data_scientist_2 = core.CfnParameter(self, "DataScientistLimitedAccess", 42 | type="String", 43 | description="Username for Data Scientist with limited access to Amazon Reviews.", 44 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 45 | default = "data-scientist-limited" 46 | ) 47 | 48 | federated_user_data_scientist_1 = core.CfnParameter(self, "FederatedDataScientistFullAccess", 49 | type="String", 50 | description="\ 51 | IdP user name for data scientist with full access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 52 | ) 53 | 54 | federated_user_data_scientist_2 = core.CfnParameter(self, "FederatedDataScientistLimitedAccess", 55 | type="String", 56 | description="\ 57 | IdP user name for data scientist with limited access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 58 | ) 59 | 60 | glue_db_name = core.CfnParameter(self, "GlueDatabaseNameAmazonReviews", 61 | type="String", 62 | description="Name of Glue DB to be created for Amazon Reviews.", 63 | allowed_pattern="[\w-]+", 64 | default = "amazon_reviews_db" 65 | ) 66 | 67 | glue_table_name = core.CfnParameter(self, "GlueTableNameAmazonReviews", 68 | type="String", 69 | description="Name of Glue Table to be created for Amazon Reviews (Parquet).", 70 | allowed_pattern="[\w-]+", 71 | default = "amazon_reviews_parquet" 72 | ) 73 | 74 | self.template_options.template_format_version = "2010-09-09" 75 | self.template_options.description = "IAM Users and Roles for Data Scientists." 76 | self.template_options.metadata = { "License": "MIT-0" } 77 | 78 | # Conditions for SageMaker Studio authentication 79 | 80 | aws_iam_users = core.CfnCondition(self, "IsIAMUserAuthentication", 81 | expression = core.Fn.condition_equals("AWS IAM with IAM users", studio_authentication) 82 | ) 83 | 84 | aws_federation = core.CfnCondition(self, "IsFederatedAuthentication", 85 | expression = core.Fn.condition_equals("AWS IAM with AWS account federation (external IdP)", studio_authentication) 86 | ) 87 | 88 | # IAM Policy for Data Scientists Group, Users and Roles for Data Scientists 89 | 90 | data_scientist_group_managed_policy = iam.ManagedPolicy(self, "DataScientistGroupPolicy", 91 | managed_policy_name = "DataScientistGroupPolicy", 92 | statements = [ 93 | iam.PolicyStatement( 94 | sid = "AmazonSageMakerStudioReadOnly", 95 | effect=iam.Effect.ALLOW, 96 | actions=[ 97 | "sagemaker:DescribeDomain", 98 | "sagemaker:ListDomains", 99 | "sagemaker:ListUserProfiles", 100 | "sagemaker:ListApps" 101 | ], 102 | resources=["*"]), 103 | iam.PolicyStatement( 104 | sid = "AmazonSageMakerAddTags", 105 | effect=iam.Effect.ALLOW, 106 | actions=[ 107 | "sagemaker:AddTags" 108 | ], 109 | resources=["*"]), 110 | iam.PolicyStatement( 111 | sid = "AmazonSageMakerAllowedUserProfile", 112 | effect=iam.Effect.ALLOW, 113 | actions=[ 114 | "sagemaker:CreatePresignedDomainUrl", 115 | "sagemaker:DescribeUserProfile" 116 | ], 117 | resources=["*"], 118 | conditions = { 119 | "StringEquals": { 120 | "sagemaker:ResourceTag/studiouserid": "${aws:username}" 121 | } 122 | }), 123 | iam.PolicyStatement( 124 | sid = "AmazonSageMakerDeniedUserProfiles", 125 | effect=iam.Effect.DENY, 126 | actions = [ 127 | "sagemaker:CreatePresignedDomainUrl", 128 | "sagemaker:DescribeUserProfile" 129 | ], 130 | resources=["*"], 131 | conditions = { 132 | "StringNotEquals": { 133 | "sagemaker:ResourceTag/studiouserid": "${aws:username}" 134 | } 135 | }), 136 | iam.PolicyStatement( 137 | sid = "AmazonSageMakerDeniedServices", 138 | effect=iam.Effect.DENY, 139 | actions = [ 140 | "sagemaker:CreatePresignedNotebookInstanceUrl", 141 | "sagemaker:*NotebookInstance", 142 | "sagemaker:*NotebookInstanceLifecycleConfig", 143 | "sagemaker:CreateUserProfile", 144 | "sagemaker:DeleteDomain", 145 | "sagemaker:DeleteUserProfile" 146 | ], 147 | resources=["*"]) 148 | ] 149 | ) 150 | data_scientist_group_managed_policy.node.default_child.cfn_options.condition = aws_iam_users 151 | 152 | # IAM Group and Users for Data Scientists (only for AWS IAM with IAM Users) 153 | 154 | data_scientists_group = iam.Group(self, "DataScientistsIAMGroup", 155 | group_name = "DataScientists", 156 | managed_policies = [ 157 | # iam.ManagedPolicy.from_aws_managed_policy_name("job-function/DataScientist"), 158 | data_scientist_group_managed_policy 159 | ] 160 | ) 161 | 162 | data_scientists_group.node.default_child.cfn_options.condition = aws_iam_users 163 | 164 | pw_data_scientist_1 = secretsmanager.Secret(self, "DataScientistFullAccesspwd", 165 | generate_secret_string = secretsmanager.SecretStringGenerator(), 166 | removal_policy = core.RemovalPolicy.DESTROY 167 | ) 168 | pw_data_scientist_1.node.default_child.cfn_options.condition = aws_iam_users 169 | 170 | pw_data_scientist_2 = secretsmanager.Secret(self, "DataScientistLimitedAccesspwd", 171 | generate_secret_string = secretsmanager.SecretStringGenerator(), 172 | removal_policy = core.RemovalPolicy.DESTROY 173 | ) 174 | pw_data_scientist_2.node.default_child.cfn_options.condition = aws_iam_users 175 | 176 | 177 | user_1 = iam.User(self, "DataScientist1IAMUser", 178 | user_name = user_data_scientist_1.value_as_string, 179 | password = core.SecretValue.secrets_manager(pw_data_scientist_1.secret_arn), 180 | ) 181 | user_1.node.default_child.cfn_options.condition = aws_iam_users 182 | 183 | user_2 = iam.User(self, "DataScientist2IAMUser", 184 | user_name = user_data_scientist_2.value_as_string, 185 | password = core.SecretValue.secrets_manager(pw_data_scientist_2.secret_arn), 186 | ) 187 | user_2.node.default_child.cfn_options.condition = aws_iam_users 188 | 189 | user_1.add_to_group(data_scientists_group) 190 | user_2.add_to_group(data_scientists_group) 191 | 192 | # IAM Roles for SageMaker User Profiles 193 | 194 | data_scientist_role_1 = core.Fn.condition_if( 195 | aws_iam_users.logical_id, 196 | user_data_scientist_1, 197 | core.Fn.condition_if(aws_federation.logical_id, federated_user_data_scientist_1, "") 198 | ) 199 | 200 | data_scientist_role_2 = core.Fn.condition_if( 201 | aws_iam_users.logical_id, 202 | user_data_scientist_2, 203 | core.Fn.condition_if(aws_federation.logical_id, federated_user_data_scientist_2, "") 204 | ) 205 | 206 | user_profile_managed_policy = iam.ManagedPolicy(self, "SageMakerUserProfileExecutionPolicy", 207 | managed_policy_name = "SageMakerUserProfileExecutionPolicy", 208 | statements = [ 209 | iam.PolicyStatement( 210 | sid = "AmazonSageMakerStudioReadOnly", 211 | effect=iam.Effect.ALLOW, 212 | actions=[ 213 | "sagemaker:DescribeDomain", 214 | "sagemaker:ListDomains", 215 | "sagemaker:ListUserProfiles", 216 | "sagemaker:ListApps" 217 | ], 218 | resources=["*"]), 219 | iam.PolicyStatement( 220 | sid = "AmazonSageMakerAddTags", 221 | effect=iam.Effect.ALLOW, 222 | actions=[ 223 | "sagemaker:AddTags" 224 | ], 225 | resources=["*"]), 226 | iam.PolicyStatement( 227 | sid = "AmazonSageMakerAllowedUserProfile", 228 | effect=iam.Effect.ALLOW, 229 | actions=[ 230 | "sagemaker:DescribeUserProfile" 231 | ], 232 | resources=[f"arn:aws:sagemaker:{core.Aws.REGION}:{core.Aws.ACCOUNT_ID}:user-profile/*/${{aws:PrincipalTag/userprofilename}}"]), 233 | iam.PolicyStatement( 234 | sid = "AmazonSageMakerDeniedUserProfiles", 235 | effect=iam.Effect.DENY, 236 | actions = [ 237 | "sagemaker:DescribeUserProfile" 238 | ], 239 | not_resources=[f"arn:aws:sagemaker:{core.Aws.REGION}:{core.Aws.ACCOUNT_ID}:user-profile/*/${{aws:PrincipalTag/userprofilename}}"]), 240 | iam.PolicyStatement( 241 | sid = "AmazonSageMakerAllowedApp", 242 | effect=iam.Effect.ALLOW, 243 | actions = [ 244 | "sagemaker:*App" 245 | ], 246 | resources=[f"arn:aws:sagemaker:{core.Aws.REGION}:{core.Aws.ACCOUNT_ID}:app/*/${{aws:PrincipalTag/userprofilename}}/*"]), 247 | iam.PolicyStatement( 248 | sid = "AmazonSageMakerDeniedApps", 249 | effect=iam.Effect.DENY, 250 | actions = [ 251 | "sagemaker:*App" 252 | ], 253 | not_resources=[f"arn:aws:sagemaker:{core.Aws.REGION}:{core.Aws.ACCOUNT_ID}:app/*/${{aws:PrincipalTag/userprofilename}}/*"]), 254 | iam.PolicyStatement( 255 | sid = "LakeFormationPermissions", 256 | effect=iam.Effect.ALLOW, 257 | actions=[ 258 | "lakeformation:GetDataAccess", 259 | "glue:GetTable", 260 | "glue:GetTables", 261 | "glue:SearchTables", 262 | "glue:GetDatabase", 263 | "glue:GetDatabases", 264 | "glue:GetPartitions" 265 | ], 266 | resources=["*"]), 267 | iam.PolicyStatement( 268 | sid = "S3Permissions", 269 | effect=iam.Effect.ALLOW, 270 | actions=[ 271 | "s3:CreateBucket", 272 | "s3:GetObject", 273 | "s3:PutObject" 274 | ], 275 | resources=[ 276 | f"arn:aws:s3:::{ATHENA_QUERY_BUCKET_PREFIX}{core.Aws.REGION}-{core.Aws.ACCOUNT_ID}", 277 | f"arn:aws:s3:::{ATHENA_QUERY_BUCKET_PREFIX}{core.Aws.REGION}-{core.Aws.ACCOUNT_ID}/*" 278 | ]), 279 | iam.PolicyStatement( 280 | sid ="AmazonSageMakerStudioIAMPassRole", 281 | effect=iam.Effect.ALLOW, 282 | actions=[ 283 | "iam:PassRole" 284 | ], 285 | resources=["*"]), 286 | iam.PolicyStatement( 287 | sid = "DenyAssummingOtherIAMRoles", 288 | effect=iam.Effect.DENY, 289 | actions = [ 290 | "sts:AssumeRole" 291 | ], 292 | resources=["*"]), 293 | ] 294 | ) 295 | 296 | role_1 = iam.Role(self, "DataScientist1IAMRole", 297 | role_name = f"{ROLE_NAME_PREFIX}{data_scientist_role_1.to_string()}", 298 | assumed_by = iam.ServicePrincipal("sagemaker.amazonaws.com"), 299 | description = f"Custom role for user {data_scientist_role_1.to_string()}.", 300 | managed_policies = [ 301 | iam.ManagedPolicy.from_aws_managed_policy_name("AmazonAthenaFullAccess"), 302 | user_profile_managed_policy 303 | ], 304 | ) 305 | 306 | role_2 = iam.Role(self, "DataScientist2IAMRole", 307 | role_name = f"{ROLE_NAME_PREFIX}{data_scientist_role_2.to_string()}", 308 | assumed_by = iam.ServicePrincipal('sagemaker.amazonaws.com'), 309 | description = f"Custom role for user {data_scientist_role_2.to_string()}.", 310 | managed_policies = [ 311 | iam.ManagedPolicy.from_aws_managed_policy_name("AmazonAthenaFullAccess"), 312 | user_profile_managed_policy 313 | ], 314 | ) 315 | 316 | core.Tags.of(role_1).add("userprofilename", user_data_scientist_1.value_as_string) 317 | core.Tags.of(role_2).add("userprofilename", user_data_scientist_2.value_as_string) 318 | 319 | # Grant Lake Formation Permissinos for Amazon Reviews Table 320 | 321 | lf_permission_1 = lf.CfnPermissions(self, "LFPermissionDataScientist1", 322 | data_lake_principal = lf.CfnPermissions.DataLakePrincipalProperty(data_lake_principal_identifier = role_1.role_arn), 323 | resource = lf.CfnPermissions.ResourceProperty( 324 | table_resource = lf.CfnPermissions.TableResourceProperty( 325 | name = glue_table_name.value_as_string, 326 | database_name = glue_db_name.value_as_string 327 | ) 328 | ), 329 | permissions = ["SELECT"], 330 | permissions_with_grant_option = ["SELECT"]) 331 | 332 | lf_permission_2 = lf.CfnPermissions(self, "LFPermissionDataScientist2", 333 | data_lake_principal = lf.CfnPermissions.DataLakePrincipalProperty(data_lake_principal_identifier = role_2.role_arn), 334 | resource = lf.CfnPermissions.ResourceProperty( 335 | table_with_columns_resource = lf.CfnPermissions.TableWithColumnsResourceProperty( 336 | column_names = ["product_category","product_id","product_parent","product_title","star_rating","review_headline","review_body","review_date"], 337 | name = glue_table_name.value_as_string, 338 | database_name = glue_db_name.value_as_string 339 | ) 340 | ), 341 | permissions = ["SELECT"], 342 | permissions_with_grant_option = ["SELECT"]) 343 | 344 | # Stack Outputs 345 | 346 | core.CfnOutput(self, "IAMUserDSFull", 347 | value=user_1.user_name, 348 | description="IAM User Data Scientist 1", 349 | condition=aws_iam_users 350 | ) 351 | 352 | core.CfnOutput(self, "IAMUserDSLimited", 353 | value=user_2.user_name, 354 | description="IAM User Data Scientist 2", 355 | condition=aws_iam_users 356 | ) -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/sagemaker_studio_audit_control_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from aws_cdk import ( 5 | core 6 | ) 7 | import os 8 | 9 | NESTED_STACK_URL_PREFIX = os.environ["NESTED_STACK_URL_PREFIX"] 10 | 11 | class SageMakerStudioAuditControlStack(core.Stack): 12 | 13 | def __init__(self, scope: core.Construct, id: str, **kwargs) -> None: 14 | super().__init__(scope, id, **kwargs) 15 | 16 | global NESTED_STACK_URL_PREFIX 17 | 18 | # CloudFormation Parameters 19 | 20 | studio_authentication = core.CfnParameter(self, "StudioAuthentication", 21 | type="String", 22 | description="Authentication method for SageMaker Studio.", 23 | allowed_values=[ 24 | "AWS IAM with IAM users", 25 | "AWS IAM with AWS account federation (external IdP)" 26 | ], 27 | default = "AWS IAM with IAM users" 28 | ) 29 | 30 | user_data_scientist_1 = core.CfnParameter(self, "DataScientistFullAccess", 31 | type="String", 32 | description="\ 33 | User profile for data scientist with full access to Amazon Reviews. \ 34 | An IAM user with the same name will be created if the authentication method is \"AWS IAM with IAM users\"", 35 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 36 | default = "data-scientist-full" 37 | ) 38 | 39 | user_data_scientist_2 = core.CfnParameter(self, "DataScientistLimitedAccess", 40 | type="String", 41 | description= "\ 42 | User profile for data scientist with limited access to Amazon Reviews. \ 43 | An IAM user with the same name will be created if the authentication method is \"AWS IAM with IAM users\"", 44 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 45 | default = "data-scientist-limited" 46 | ) 47 | 48 | federated_user_data_scientist_1 = core.CfnParameter(self, "FederatedDataScientistFullAccess", 49 | type="String", 50 | description="\ 51 | IdP user name for data scientist with full access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 52 | ) 53 | 54 | federated_user_data_scientist_2 = core.CfnParameter(self, "FederatedDataScientistLimitedAccess", 55 | type="String", 56 | description="\ 57 | IdP user name for data scientist with limited access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 58 | ) 59 | 60 | glue_db_name = core.CfnParameter(self, "GlueDatabaseNameAmazonReviews", 61 | type="String", 62 | description="Name of Glue DB to be created for Amazon Reviews.", 63 | allowed_pattern="[\w-]+", 64 | default = "amazon_reviews_db" 65 | ) 66 | 67 | glue_table_name = core.CfnParameter(self, "GlueTableNameAmazonReviews", 68 | type="String", 69 | description="Name of Glue Table to be created for Amazon Reviews (Parquet).", 70 | allowed_pattern="[\w-]+", 71 | default = "amazon_reviews_parquet" 72 | ) 73 | 74 | sagemaker_studio_vpc = core.CfnParameter(self, "SageMakerStudioVpcId", 75 | type="AWS::EC2::VPC::Id", 76 | description="VPC that SageMaker Studio will use for communication with the EFS volume." 77 | ) 78 | 79 | sagemaker_studio_subnets = core.CfnParameter(self, "SageMakerStudiosubnetIds", 80 | type="List", 81 | description="Subnet(s) that SageMaker Studio will use for communication with the EFS volume. Must be in the selected VPC and in different AZs." 82 | ) 83 | 84 | self.template_options.template_format_version = "2010-09-09" 85 | self.template_options.description = "\ 86 | Control and audit data exploration activities with Amazon SageMaker Studio and AWS Lake Formation." 87 | self.template_options.metadata = { 88 | "License": "MIT-0", 89 | "AWS::CloudFormation::Interface": { 90 | "ParameterGroups": [ 91 | { 92 | "Label": { "default": "SageMaker Studio Authentication" }, 93 | "Parameters": [ studio_authentication.logical_id ] 94 | }, 95 | { 96 | "Label": { "default": "SageMaker Studio User Profiles" }, 97 | "Parameters": [ 98 | user_data_scientist_1.logical_id, 99 | user_data_scientist_2.logical_id 100 | ] 101 | }, 102 | { 103 | "Label": { "default": "Federated Identities - SKIP THIS SECTION IF USING \"AWS IAM with IAM users\" FOR AUTHENTICATION" }, 104 | "Parameters": [ 105 | federated_user_data_scientist_1.logical_id, 106 | federated_user_data_scientist_2.logical_id 107 | ] 108 | }, 109 | { 110 | "Label": { "default": "Amazon Reviews Dataset" }, 111 | "Parameters": [ 112 | glue_db_name.logical_id, 113 | glue_table_name.logical_id 114 | ] 115 | }, 116 | { 117 | "Label": { "default": "SageMaker Studio VPC" }, 118 | "Parameters": [ 119 | sagemaker_studio_vpc.logical_id, 120 | sagemaker_studio_subnets.logical_id 121 | ] 122 | } 123 | ], 124 | "ParameterLabels": { 125 | studio_authentication.logical_id: { 126 | "default": "Authentication method" 127 | }, 128 | user_data_scientist_1.logical_id: { 129 | "default": "Data Scientist 1 User Profile" 130 | }, 131 | user_data_scientist_2.logical_id: { 132 | "default": "Data Scientist 2 User Profile" 133 | }, 134 | federated_user_data_scientist_1.logical_id: { 135 | "default": "Data Scientist 1's IdP user name" 136 | }, 137 | federated_user_data_scientist_2.logical_id: { 138 | "default": "Data Scientist 2's IdP user name" 139 | }, 140 | glue_db_name.logical_id: { 141 | "default": "Glue Database Name" 142 | }, 143 | glue_table_name.logical_id: { 144 | "default": "Glue Table Name" 145 | }, 146 | sagemaker_studio_vpc.logical_id: { 147 | "default": "SageMaker Studio VPC ID" 148 | }, 149 | sagemaker_studio_subnets.logical_id: { 150 | "default": "SageMaker Studio Subnet(s) ID" 151 | } 152 | } 153 | } 154 | } 155 | 156 | # Conditions for SageMaker Studio authentication 157 | 158 | aws_iam_users = core.CfnCondition(self, "IsIAMUserAuthentication", 159 | expression = core.Fn.condition_equals("AWS IAM with IAM users", studio_authentication) 160 | ) 161 | 162 | aws_federation = core.CfnCondition(self, "IsFederatedAuthentication", 163 | expression = core.Fn.condition_equals("AWS IAM with AWS account federation (external IdP)", studio_authentication) 164 | ) 165 | 166 | # Nested Stacks 167 | 168 | amazon_reviews_dataset = core.CfnStack(self, "AmazonReviewsDatasetStack", 169 | template_url = NESTED_STACK_URL_PREFIX + "AmazonReviewsDatasetStack.yaml", 170 | parameters = { 171 | "GlueDatabaseNameAmazonReviews" : glue_db_name.value_as_string, 172 | "GlueTableNameAmazonReviews" : glue_table_name.value_as_string 173 | }) 174 | 175 | data_scientist_users = core.CfnStack(self, "DataScientistUsersStack", 176 | template_url = NESTED_STACK_URL_PREFIX + "DataScientistUsersStack.yaml", 177 | parameters = { 178 | "StudioAuthentication" : studio_authentication.value_as_string, 179 | "DataScientistFullAccess" : user_data_scientist_1.value_as_string, 180 | "DataScientistLimitedAccess" : user_data_scientist_2.value_as_string, 181 | "FederatedDataScientistFullAccess" : federated_user_data_scientist_1.value_as_string, 182 | "FederatedDataScientistLimitedAccess" : federated_user_data_scientist_2.value_as_string, 183 | "GlueDatabaseNameAmazonReviews" : glue_db_name.value_as_string, 184 | "GlueTableNameAmazonReviews" : glue_table_name.value_as_string, 185 | }) 186 | 187 | data_scientist_users.add_depends_on(amazon_reviews_dataset) 188 | 189 | sagemaker_studio = core.CfnStack(self, "SageMakerStudioStack", 190 | template_url = NESTED_STACK_URL_PREFIX + "SageMakerStudioStack.yaml", 191 | parameters = { 192 | "StudioAuthentication" : studio_authentication.value_as_string, 193 | "DataScientistFullAccessUsername" : user_data_scientist_1.value_as_string, 194 | "DataScientistLimitedAccessUsername" : user_data_scientist_2.value_as_string, 195 | "FederatedDataScientistFullAccess" : federated_user_data_scientist_1.value_as_string, 196 | "FederatedDataScientistLimitedAccess" : federated_user_data_scientist_2.value_as_string, 197 | "SageMakerStudioVpcId" : sagemaker_studio_vpc.value_as_string, 198 | "SageMakerStudiosubnetIds" : core.Fn.join(",",sagemaker_studio_subnets.value_as_list) 199 | }) 200 | 201 | sagemaker_studio.add_depends_on(data_scientist_users) 202 | 203 | # Stack Outputs 204 | 205 | core.CfnOutput(self, "IAMUserDSFull", 206 | value=data_scientist_users.get_att("Outputs.IAMUserDSFull").to_string(), 207 | description="IAM User Data Scientist 1", 208 | condition=aws_iam_users 209 | ) 210 | 211 | core.CfnOutput(self, "IAMUserDSLimited", 212 | value=data_scientist_users.get_att("Outputs.IAMUserDSLimited").to_string(), 213 | description="IAM User Data Scientist 2", 214 | condition=aws_iam_users 215 | ) 216 | -------------------------------------------------------------------------------- /cdktemplate/sagemaker_studio_audit_control/sagemaker_studio_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from aws_cdk import ( 5 | aws_iam as iam, 6 | # aws_lambda as _lambda, 7 | # aws_sagemaker as sm, 8 | core 9 | ) 10 | import os 11 | 12 | ROLE_NAME_PREFIX = os.environ["ROLE_NAME_PREFIX"] 13 | 14 | class SageMakerStudioStack(core.Stack): 15 | 16 | def __init__(self, scope: core.Construct, id: str, **kwargs) -> None: 17 | super().__init__(scope, id, **kwargs) 18 | 19 | # CloudFormation Parameters 20 | 21 | studio_authentication = core.CfnParameter(self, "StudioAuthentication", 22 | type="String", 23 | description="Authentication method for SageMaker Studio.", 24 | allowed_values=[ 25 | "AWS IAM with IAM users", 26 | "AWS IAM with AWS account federation (external IdP)" 27 | ], 28 | default = "AWS IAM with IAM users" 29 | ) 30 | 31 | user_data_scientist_1 = core.CfnParameter(self, "DataScientistFullAccessUsername", 32 | type="String", 33 | description="Username for Data Scientist with full access to Amazon Reviews.", 34 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 35 | default = "data-scientist-full" 36 | ) 37 | 38 | user_data_scientist_2 = core.CfnParameter(self, "DataScientistLimitedAccessUsername", 39 | type="String", 40 | description="Username for Data Scientist with limited access to Amazon Reviews.", 41 | allowed_pattern="^[a-zA-Z0-9](-*[a-zA-Z0-9])*", 42 | default = "data-scientist-limited" 43 | ) 44 | 45 | federated_user_data_scientist_1 = core.CfnParameter(self, "FederatedDataScientistFullAccess", 46 | type="String", 47 | description="\ 48 | IdP user name for data scientist with full access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 49 | ) 50 | 51 | federated_user_data_scientist_2 = core.CfnParameter(self, "FederatedDataScientistLimitedAccess", 52 | type="String", 53 | description="\ 54 | IdP user name for data scientist with limited access to Amazon Reviews (e.g., \"username\", or \"username@domain\").", 55 | ) 56 | 57 | sagemaker_studio_vpc = core.CfnParameter(self, "SageMakerStudioVpcId", 58 | type="String", 59 | description="VPC that SageMaker Studio will use for communication with the EFS volume." 60 | ) 61 | 62 | sagemaker_studio_subnets = core.CfnParameter(self, "SageMakerStudiosubnetIds", 63 | type="CommaDelimitedList", 64 | description="Subnet(s) that SageMaker Studio will use for communication with the EFS volume. Must be in the selected VPC and in different AZs." 65 | ) 66 | 67 | self.template_options.template_format_version = "2010-09-09" 68 | self.template_options.description = "SageMaker Studio and Studio User Profiles." 69 | self.template_options.metadata = { "License": "MIT-0" } 70 | 71 | # Conditions for SageMaker Studio authentication 72 | 73 | aws_iam_users = core.CfnCondition(self, "IsIAMUserAuthentication", 74 | expression = core.Fn.condition_equals("AWS IAM with IAM users", studio_authentication) 75 | ) 76 | 77 | aws_federation = core.CfnCondition(self, "IsFederatedAuthentication", 78 | expression = core.Fn.condition_equals("AWS IAM with AWS account federation (external IdP)", studio_authentication) 79 | ) 80 | 81 | # IAM Users and Roles for Data Scientists 82 | 83 | data_scientist_role_1 = core.Fn.condition_if( 84 | aws_iam_users.logical_id, 85 | user_data_scientist_1.value_as_string, 86 | core.Fn.condition_if(aws_federation.logical_id, federated_user_data_scientist_1.value_as_string, "") 87 | ) 88 | 89 | data_scientist_role_2 = core.Fn.condition_if( 90 | aws_iam_users.logical_id, 91 | user_data_scientist_2.value_as_string, 92 | core.Fn.condition_if(aws_federation.logical_id, federated_user_data_scientist_2.value_as_string, "") 93 | ) 94 | 95 | role_1 = iam.Role.from_role_arn(self, "DataScientistFullIAMRole", 96 | role_arn = f"arn:aws:iam::{core.Aws.ACCOUNT_ID}:role/{ROLE_NAME_PREFIX}{data_scientist_role_1.to_string()}" 97 | ) 98 | 99 | role_2 = iam.Role.from_role_arn(self, "DataScientistLimitedIAMRole", 100 | role_arn = f"arn:aws:iam::{core.Aws.ACCOUNT_ID}:role/{ROLE_NAME_PREFIX}{data_scientist_role_2.to_string()}" 101 | ) 102 | 103 | # Create SageMaker Studio Domain (as CfnResource) 104 | 105 | sm_default_execution_role = iam.Role(self, "SageMakerStudioDefaultExecutionRole", 106 | role_name = ROLE_NAME_PREFIX + "Default", 107 | assumed_by = iam.ServicePrincipal('sagemaker.amazonaws.com'), 108 | managed_policies = [iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")] 109 | ) 110 | 111 | sm_domain = core.CfnResource(self, "SageMakerDomain", 112 | type = "AWS::SageMaker::Domain", 113 | properties = { 114 | "AuthMode" : "IAM", 115 | "DefaultUserSettings" : { 116 | "ExecutionRole": sm_default_execution_role.role_arn 117 | }, 118 | "DomainName" : "default-domain", 119 | "SubnetIds" : sagemaker_studio_subnets.value_as_list, 120 | "VpcId" : sagemaker_studio_vpc.value_as_string 121 | }) 122 | 123 | sm_domain_id = sm_domain.ref 124 | 125 | # Create SageMaker Studio User Profiles (as CfnResources) 126 | 127 | sm_profile_full = core.CfnResource(self, "SageMakerUserProfileDataScientistFull", 128 | type = "AWS::SageMaker::UserProfile", 129 | properties = { 130 | "DomainId" : sm_domain_id, 131 | "Tags" : [{ 132 | "Key" : "studiouserid", 133 | "Value" : data_scientist_role_1 134 | }], 135 | "UserProfileName" : user_data_scientist_1.value_as_string, 136 | "UserSettings" : { 137 | "ExecutionRole" : role_1.role_arn, 138 | } 139 | }) 140 | 141 | sm_profile_limited = core.CfnResource(self, "SageMakerUserProfileDataScientistLimited", 142 | type = "AWS::SageMaker::UserProfile", 143 | properties = { 144 | "DomainId" : sm_domain_id, 145 | "Tags" : [{ 146 | "Key" : "studiouserid", 147 | "Value" : data_scientist_role_2 148 | }], 149 | "UserProfileName" : user_data_scientist_2.value_as_string, 150 | "UserSettings" : { 151 | "ExecutionRole" : role_2.role_arn, 152 | } 153 | }) 154 | 155 | sm_profile_full.node.add_dependency(sm_domain) 156 | sm_profile_limited.node.add_dependency(sm_domain) -------------------------------------------------------------------------------- /cdktemplate/setup.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import setuptools 5 | 6 | 7 | with open("README.md") as fp: 8 | long_description = fp.read() 9 | 10 | 11 | setuptools.setup( 12 | name="sagemaker_studio_audit_control", 13 | version="0.0.1", 14 | 15 | description="SageMaker Studio Audit and Control", 16 | long_description=long_description, 17 | long_description_content_type="text/markdown", 18 | 19 | author="Rodrigo Alarcon", 20 | 21 | package_dir={"": "sagemaker_studio_audit_control"}, 22 | packages=setuptools.find_packages(where="sagemaker_studio_audit_control"), 23 | 24 | install_requires=[ 25 | "aws-cdk.core==1.90.0", 26 | "aws-cdk.aws-iam", 27 | "aws-cdk.aws_lakeformation", 28 | "aws-cdk.aws_glue", 29 | "aws-cdk.aws_s3", 30 | "aws-cdk.aws_secretsmanager", 31 | # "aws-cdk.aws_sagemaker", 32 | "aws-cdk.aws_sso" 33 | ], 34 | 35 | python_requires=">=3.6", 36 | 37 | classifiers=[ 38 | "Development Status :: 4 - Beta", 39 | 40 | "Intended Audience :: Developers", 41 | 42 | "License :: OSI Approved :: MIT License", 43 | 44 | "Programming Language :: JavaScript", 45 | "Programming Language :: Python :: 3 :: Only", 46 | "Programming Language :: Python :: 3.6", 47 | "Programming Language :: Python :: 3.7", 48 | "Programming Language :: Python :: 3.8", 49 | 50 | "Topic :: Software Development :: Code Generators", 51 | "Topic :: Utilities", 52 | 53 | "Typing :: Typed", 54 | ], 55 | ) 56 | -------------------------------------------------------------------------------- /cdktemplate/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .env/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .env\Scripts\activate.bat for you 13 | .env\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /images/0SageMakerAuditControl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/0SageMakerAuditControl.png -------------------------------------------------------------------------------- /images/1CreateDatabase.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/1CreateDatabase.png -------------------------------------------------------------------------------- /images/1RegisterLocation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/1RegisterLocation.png -------------------------------------------------------------------------------- /images/1VerifyTable.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/1VerifyTable.png -------------------------------------------------------------------------------- /images/3LakeFormationPermissions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/3LakeFormationPermissions.png -------------------------------------------------------------------------------- /images/4SageMakerStudioDomain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/4SageMakerStudioDomain.png -------------------------------------------------------------------------------- /images/5NotebookUserFull.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/5NotebookUserFull.png -------------------------------------------------------------------------------- /images/5SageMakerStudioLimited.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/5SageMakerStudioLimited.png -------------------------------------------------------------------------------- /images/5SageMakerStudioNotebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/5SageMakerStudioNotebook.png -------------------------------------------------------------------------------- /images/6CloudTrail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/6CloudTrail.png -------------------------------------------------------------------------------- /images/6LakeFormation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/6LakeFormation.png -------------------------------------------------------------------------------- /images/7IAMUsersAndGroups.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/7IAMUsersAndGroups.png -------------------------------------------------------------------------------- /images/8FederatedIdentities.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/images/8FederatedIdentities.png -------------------------------------------------------------------------------- /notebook/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-studio-audit/b0cb649f1c5cec768b3db4b93c289df7f4433093/notebook/.DS_Store -------------------------------------------------------------------------------- /notebook/sagemaker_studio_audit_control.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Control and audit data exploration activities with Amazon SageMaker Studio and AWS Lake Formation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook accompanies the blog post \"Control and audit data exploration activities with Amazon SageMaker Studio and AWS Lake Formation\". The notebook demonstrates how to use SageMaker Studio along with Lake Formation to provide granular access to a data lake for different data scientists. The queries used in this notebook are based on the [Amazon Customer Reviews Dataset](https://registry.opendata.aws/amazon-reviews/), which should be registered in an existing data lake before running this code." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "To compare data permissions across users, you should run the same notebook using different SageMaker user profiles." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "### Prerequisites" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "This implementation uses Amazon Athena and the [PyAthena](https://pypi.org/project/PyAthena/) client to query data on a data lake registered with AWS Lake Formation. We will also use Pandas to run queries and store the results as Dataframes." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "First we install PyAthena and import the required libraries." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "!pip install pyathena" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from pyathena import connect\n", 61 | "import pandas as pd\n", 62 | "import boto3" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "The AWS Account ID and AWS Region will be used to create an S3 bucket where Athena will save query output files. The AWS Region will also be passed as parameter when connecting to our data lake through Athena using PyAthena." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "sts = boto3.client(\"sts\")\n", 79 | "account_id = sts.get_caller_identity()[\"Account\"]" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "region = boto3.session.Session().region_name" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "query_result_bucket_name = \"sagemaker-audit-control-query-results-{}-{}\".format(region, account_id)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### Create S3 bucket for query output files - SKIP THIS SECTION FOR THE SECOND DATA SCIENTIST USER" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "query_result_bucket = {}\n", 114 | "\n", 115 | "if region == \"us-east-1\":\n", 116 | " s3 = boto3.client(\"s3\")\n", 117 | " query_result_bucket = s3.create_bucket(\n", 118 | " Bucket = query_result_bucket_name,\n", 119 | " )\n", 120 | "else:\n", 121 | " s3 = boto3.client(\"s3\", region_name=region)\n", 122 | " query_result_bucket = s3.create_bucket(\n", 123 | " Bucket = query_result_bucket_name,\n", 124 | " CreateBucketConfiguration = {\n", 125 | " \"LocationConstraint\": region\n", 126 | " }\n", 127 | " )" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "### Run queries using Amazon Athena and PyAthena" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Once the prerequisites are configured, we can start running queries on the data lake through Athena using the PyAthena client. " 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "First we create a connection to Athena using PyAthena's `connect` constructor. We will pass this object as a parameter when we run queries with Pandas `read_sql` method." 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "conn = connect(s3_staging_dir =\"s3://{}/queries/\".format(query_result_bucket_name), region_name=region)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Our first query will list all the databases to which this user has been granted access in the data lake." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "db_name_df = pd.read_sql(\"SHOW DATABASES\", conn)\n", 174 | "db_name = db_name_df.iloc[0][0]\n", 175 | "print(db_name)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Our second query will list all the tables in the previous database to which this user has been granted access." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "tables_df = pd.read_sql(\"SHOW TABLES IN {}\".format(db_name), conn)\n", 192 | "table_name = tables_df.iloc[0][0]\n", 193 | "print(table_name)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "Finally we run a `SELECT` query to see all columns in the previous table to which this user has been granted access. If you have full permissions for the table, the `SELECT` query output will include the following columns:\n", 201 | "- marketplace \n", 202 | "- customer_id \n", 203 | "- review_id \n", 204 | "- product_id \n", 205 | "- product_parent \n", 206 | "- product_title \n", 207 | "- star_rating \n", 208 | "- helpful_votes \n", 209 | "- total_votes \n", 210 | "- vine \n", 211 | "- verified_purchase \n", 212 | "- review_headline \n", 213 | "- review_body \n", 214 | "- review_date \n", 215 | "- year\n", 216 | "- product_category" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "df = pd.read_sql(\"SELECT * FROM {}.{} LIMIT 10\".format(db_name, table_name), conn)\n", 226 | "df.head(10)" 227 | ] 228 | } 229 | ], 230 | "metadata": { 231 | "instance_type": "ml.t3.medium", 232 | "kernelspec": { 233 | "display_name": "Python 3 (Data Science)", 234 | "language": "python", 235 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" 236 | }, 237 | "language_info": { 238 | "codemirror_mode": { 239 | "name": "ipython", 240 | "version": 3 241 | }, 242 | "file_extension": ".py", 243 | "mimetype": "text/x-python", 244 | "name": "python", 245 | "nbconvert_exporter": "python", 246 | "pygments_lexer": "ipython3", 247 | "version": "3.7.6" 248 | } 249 | }, 250 | "nbformat": 4, 251 | "nbformat_minor": 4 252 | } 253 | --------------------------------------------------------------------------------