├── README.md
├── diagram
├── ImmersionDay-ServerlessDataLake-RevPB3.xml
└── README.md
├── lab
├── README.md
├── Serverless Data Lake Day Lab-PREP.docx
├── Serverless Data Lake Day Lab1-KDG-KFH-S3.docx
├── Serverless Data Lake Day Lab2-1to2-3-Glue-DataCatalogTransform.docx
├── Serverless Data Lake Day Lab2-4-OpenData-GDELT.docx
└── Serverless Data Lake Day Lab3-Athena-QS-and-3PP.docx
├── presentation
├── AWSLoft-BuildingAServerlessDataLake-RevA-2019-03-07-c.pdf
└── README.md
└── scripts
├── ImmersionDay-ServerlessDataLake-RevPB3.xml
├── Lab 2.3 - Advanced Data Preparation with Developer Endpoints and Notebook - PA5-2019-12-02.ipynb
├── README.md
├── sdlimmersionlab-IAMuser-policy-PA2.json
└── serverlessDataLakeImmersionIAMcf.json
/README.md:
--------------------------------------------------------------------------------
1 | # aws-serverless-data-lake-workshop
2 |
3 | This workshop is meant to give customers a hands-on experience with mentioned AWS services. Serverless Data Lake workshop helps customers build a cloud-native and future-proof serverless data lake architecture. It allows hands-on time with AWS big data and analytics services including Amazon Kinesis Services for streaming data ingestion and analytics, AWS Glue for ETL and Data Catalogue Management, Amazon Athena to query data lake.
4 |
--------------------------------------------------------------------------------
/diagram/ImmersionDay-ServerlessDataLake-RevPB3.xml:
--------------------------------------------------------------------------------
1 | 
--------------------------------------------------------------------------------
/diagram/README.md:
--------------------------------------------------------------------------------
1 | # Workshop Architecture Diagram #
2 |
3 |
4 | We've provided the Draw.io diagram depicted in the Lab guides.
5 |
--------------------------------------------------------------------------------
/lab/README.md:
--------------------------------------------------------------------------------
1 | # Lab Guide #
2 |
3 |
4 | This lab guide is prepared to assist you ingest, store, transform, create insights on unstructured data using AWS serverless services. Most of the demos make use of AWS Console, however all the labs can be automated via Cloudformation templates, AWS CLI or AWS API.
5 |
--------------------------------------------------------------------------------
/lab/Serverless Data Lake Day Lab-PREP.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/lab/Serverless Data Lake Day Lab-PREP.docx
--------------------------------------------------------------------------------
/lab/Serverless Data Lake Day Lab1-KDG-KFH-S3.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/lab/Serverless Data Lake Day Lab1-KDG-KFH-S3.docx
--------------------------------------------------------------------------------
/lab/Serverless Data Lake Day Lab2-1to2-3-Glue-DataCatalogTransform.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/lab/Serverless Data Lake Day Lab2-1to2-3-Glue-DataCatalogTransform.docx
--------------------------------------------------------------------------------
/lab/Serverless Data Lake Day Lab2-4-OpenData-GDELT.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/lab/Serverless Data Lake Day Lab2-4-OpenData-GDELT.docx
--------------------------------------------------------------------------------
/lab/Serverless Data Lake Day Lab3-Athena-QS-and-3PP.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/lab/Serverless Data Lake Day Lab3-Athena-QS-and-3PP.docx
--------------------------------------------------------------------------------
/presentation/AWSLoft-BuildingAServerlessDataLake-RevA-2019-03-07-c.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AWS-Big-Data-Projects/aws-serverless-data-lake-workshop/d385ed760c4ced2469be6d5ecac9210d86ee6cdf/presentation/AWSLoft-BuildingAServerlessDataLake-RevA-2019-03-07-c.pdf
--------------------------------------------------------------------------------
/presentation/README.md:
--------------------------------------------------------------------------------
1 | # Slides from the workshop Building Serverless Data Lakes on AWS #
2 |
3 |
4 | Download link for the higher resolution presentation deck: https://s3.amazonaws.com/hbapub/loft/AWSLoft-BuildingAServerlessDataLake-RevA-2019-03-07.pdf
5 |
--------------------------------------------------------------------------------
/scripts/ImmersionDay-ServerlessDataLake-RevPB3.xml:
--------------------------------------------------------------------------------
1 | 7T1pk5pK178mVfd5q5Ki2fkICIqioLh/SSEgoGyyiPrr38ZlZgQmYyZOZjKJ91YGG2iasy99jl8w3t81Yz1yuqFpeV9QxNx9wRpfUBSQFAL/FCP70wiKgPOIHbvm+arHAc09WOfBy2WZa1rJ1YVpGHqpG10PGmEQWEZ6NabHcZhfX7YMveunRrptVQY0Q/eqoxPXTJ3TKI1Sj+Mty7Wdy5MByZzOLHRjbcdhFpyf9wXFlsfP6bSvX+Y6v2ji6GaYPxnChC8YH4dhejryd7zlFcC9gO10n/jM2Yd1x1aQ3nQDhZ9u2epeZl3WfFxZur9A4/g+VnEH+IJxemycEQaxhXGmnjgP55I0DtcWH3phfLwVQ44feGbpet5lPAgDeDtnx7rpwnWWhosJVT1NrTg4zgqJ52HmCzIgnLjqq57ffmvFqbV7MnR+9aYV+lYa7+El57P4BQ1nOv0KkPNA/gTrlzHnCcYpkjhT25nS7IfJH6END84Afwb4OPMXAx9cwx6lqqBHiRrQPyDtV0CP/sWAx5FryJNMDdHjNZAH9B0gj/3FkAdlcYPeSPMYfgfI3yDpoTqKikPXPypIzkl97wzq4i1dqCFZz7ULQKVh9GRU1heWp4aJm7phcXYRpmnowwu84gT3oBWfoOqsFzHu+DA2iU6KvECefvmydHcFrrnzehpOmhYWAFu8MyoaZoB+c6ENsHQhvcTfDPhEVDT1VId/ivEE/tV9/RAGX/U8+ZqkVmC4XjF6FPRiD57QrHjrGtZ3DT7SXbrGdyOM9t+/sxPtO++FmQmRRH+LAvsejIeX8U/XqZsq+u/Bd+AG9FuByRbW0zVzPHDbE2qoYTySPDLes3CyzCubqwqlJ0Co44HLWGx5eupury21Osicn6CGLlzJAxIgM30jruUfdT1HEmaxYZ1ve2o7lWbCQWUm8lvJLEj12LbSylxHZD28+m1y8wZbYRk+SrczTnj+F3n3yQkDotSKC/48z/Mw8IQywiz13ACSxsUyL0tjuDD++KkjI5KmAYc9L6wvww03hrOflhWEcUE3NSL7Is78nV04Kt8CK83DeJ18SyDTW/H3JA3jk6C7A3fT18yNYWiFtwFC1GhV5A6mJF5Vqx2IhcSFwg4RAn3hQS5GETaC4rNMNbUYe2B85BZKqSewH5FJGTVQPmPf1qclf7dOC/6uR1GVeESCJjD8eRK5i55GSpxNAHAZeYJPEv9Wg1CSKkuBV2GUqGBUg0iGREZ68EHcIoZHdnEU63nxwnFo8npaPW1tISCS6vh/bU3p/a86XqGPF7H/opVwzZmgjjReJsLaK+rIyPyWYFxmrK20Sjw4SpMGUSd5CIpFG3Sd8XEHgnowsfcX1VFV/CSoJSbsG408/eB3IC2yQlrN4ghF/hOG8v8+sICwjwt+B4lAU9cIBHU+E1qHQIa5A8KoW2XBbukfBX31zIN8QJ4VCKoebzIr/ScTfotMoEokhSPMe8qEqnkp9ZqCNqyg/gJe6E0VeIys2IVPKxB2HFIfv3O546aWFulGcX0OkXFNGEdgXnz7O8U2yJKLVcOnaF1s4x4u1sVDecqmQ2Ug/FkgJGnyRRACGn8jEIIKCNWBwgua9mcBkSaodwRiNbo5EFRlMDwpBt0/6t7jv3BkLGkjVpbmfxiZAgRlXmZ1BLwNiMkqiCvg+0E0BXkhmkIJLCkUFxVBE+08ZeHmhnYY6J7wOPrh4y0kWvaLS+C/NdxCoWV9WZromVgLxIG+f3JZVFyQPL/g8nMuC/7ZdT1S0mkFrw38kDdEzH8PqUEKi/fTYs5vCEpcBmbHAQIDl4EncqHxJ0QEKwgEJZP9VgotOwkoWproXhSK1S/4WQrFfo6i6V+8/vLed+OAtw1diyLDYNinELZUOSINkNeK20pwG4By4vVu5FxZNPFjgquurXTHr5NcNfr2T7/X4oIB15YuBkrBz5ulJ1LWo7flUn6W3CoLfoHYyuvC7k1qNyR2qqR2k0D76LRD0kQ5zo6/knqeTHX/3BtVtfafib8tXQ8aPi+G4E456n9RtuOZN42y0UTZVnrPyDtVDeRWkP55d79gxHVk6UGMvLjlq2zEvAr2dXL2xIJJpAcXJhxYEVTEbrHhoyZYwknwH22fpJb/EEGHD346wQ18/S9P/455egAlUUnrYAhTk919u2w9XY1yfiE4NnWsQP9CNOoo6EWa+jhJOv30Hu+SuAco8426Ri6KYA9DT7BLkXVCn6rs4HkVfqv2AsRvP3ONdXJ8For8x0n/O6L6w6Jx87jcCioRhKPEt0YlRl8zKSCrmrtGWZB30BV0NUPeOJpsiOpGVoGgD4y4wrj8Hl3W+U77Z6iSiMUft9Q8xR9Vx4T0DxyJmzFIv2xp/dJ2R4ZBPokDj5Y3O+FYxZe6OWqEX5B34dpyocT9vLJLvucvdJxpnLmX4/xkqjdAUY2h85egqJKffDWGHmd6AwTVWCq/hiAaKf6rqh3m+PkDEFfepQ+ICoPcjDkUe3GuO+KyLlF49hvKjkThgV5hmdxk4eXE1+QYcWCLYAAd7R5PXmbRIusY4pL1vRU/cYNPs14/CQ4vnnVaoD2R/oiezvRWE86oWEVl48l3TfOof+u2JjzGWH6YoHx9aUfVVAVUDcneo5SNeVV27CYW/tNY9WspRPdqPi1PdEcmrUssvQmTcnpqOP+Y9AdMCmpKTt+OTauhXy7z1nBE8ouIIzxQosIjPNW3nQOPwSKJauM/7ESDd/ByEY9cQpijSJGRqL9KO4fPUKQ5ee6ax2oR1raCmjKC8wMnF6oK7acXl1b7YX1j9wjr79buCPKKkSIgRFFK85a+MZQAZXcLpW6vLcHuQIlV1/iZbNZ/LbcIvboGPFvEP/7tDP89O8OxindHvmfWiqnLnNwvlsLz56zVhzc3SjsBiIv39PNbGUFpIvB2bsFDT4ufKfy+R133MyK7trb7qvz7Uu19zJJj7OkrKh6LsHl3zCmDHOk07ZCFn542coSRDY+EfvE949kZ/MMPlvIXlFPgYXuOeEJ/PAg5C8z0kTvGBnPr4Cm+Mp/7NjqajCOnOZ+MpbXGiuiGdQdjzR8M2h67G4iTwdBr0H6Uay2NTbvd/m7LcrZ/4HvdZmp29r1eu78HvmFO6SJmzPUY6hBsEYvAoLrngIrFqh20DFUvys3V6XY711HElGZ9OzZsieyG0WImDRC243bZfiqJESuydph4oYz0O25f4AdgvJFYh2MGY0pZbina1wgjE/jtFB4aWO+AxpzDopi1bQcIow7WwbA5t5uK77U7mpC1QpJ1W0MLmbYcPOPsWdNTolUHmh1iB0wanLqC8oFrdIcboG+LmvrlFtsZE1nEJyAdTqaENVx32ll3s8ibUm86BoYs7YXVDE03HSecNa2FsOsOhe5enqeQpDhtn8xXfakH7Qxuu6I2BRTooHvoHvwV59jrfNiwcSVQqIict+Sd3+ZtoAXDtsd30UU3VOKNHm6U1WbGFBif9/tItxEMofwRoeQvRrprrTfI4axL1WEPBF3QBfzWcmbjDW01Wr3gwCzjJanu8d2a8aztNJq2QTdoa3N34WOqhZnA7DUMzCBUB5+IOyLJ57PJ3G9nid7cZXKOG6pLt7h83oQIFTW102s4gh+M9Um0mXVRQ5F3uiE4kgbttlgnx5v1HDUP0zFpBaHbbfB9kmy2/P0a9MbTSZ/sbDbr2MhaYmOpHvC1sS1Aso0EEzMwG6TL6fkVxC2UANwhZigGIZE0nnWIdqb7iTaVAbO2ZaGdLXxCWkn7kdtft1W+xx33ZYrLaTymRtABEJltdyPxdk519l2PtINiwuVkC/+d+Z3iqTJjigNkKZsjr62N96jeE6DZnRDqirSg3BObosj1R25nn8jt5jKgaTnDKCZLVvJ+18OodOdAfkKXmb2Ft4uswEmLZp4wudpv9JHeYGRCInM7btbZr6S8C7beEk566OMDddq1Qe7grEVDg5ULQrK7GuPcdA1IsGgTVjpdZBP4at1Nh8xWayjTOEOx7SOk4BQNXN0iPbYPzMWEDszcaLO0OkSjGL5oFmfU3pS1tb02LX+1wFJCEZ3ZJOpkhSznesG8OcgVdx25+j6NjYDNBQ4VW/DUND/YrtRB03g6RCkx6EAyXXOD0Ujs8R1R24a5JDREju33G5ACCyeJZxPObXiENd4pg+EWW6StrDPyh/NIaLszHxMXBLfjIYn3iO4AMVvT/cKaOngCZdd8bzpUexSPsTZlYiam+xDHcFaWa7MN/jBbokK6msU9OmqGPJLKwzx2jIm4lfCsJeSI3nGlIEYpzhcd3BiyaoyQEKwtyAK+EHW0wOV6hZqwiIYcaqLuBPwOau7hyJOy3Z4MeZLq7JJ5Ux4VCiNqsHnXaAS6hnpzemj4PZZvu2Ho2kprQyabXaNDzeMNnmXBgehGCGnOIijo9abj9AeC0fJbXLLogEQbqqt+OOqOmk5MNsklboJl2ETh7Gv9sE3p3YYLbYUN8bZj073hZOy5zPLgq6w2THe4PRyD3qAP+YZcmNCyHEhDgVpnbV4T4r7hU9bUG8xlSMUiGu3xttFq7PAcYVljHK5W67YyhHJApNS0GQyVtkOfUDRL8bHL+ftw09mucEZdoCNvPhG9TDnxQ3PDOEMu300aDWTZ2tE8S3gztLvCFgTjU0SBaHq2oAphsx3uGWvb5DJGnQJv6ksEeejsNxMvMg6zrEnsKG0tpjk7U13W7osB2B92rrQYyMM15S9DpWHDV0XzdKojoLdaNnJaYfek0gqIbNGEYganrZ44hDzMqUprlYfihrFnAocdOsGweGvfSAazAEoXditM0aK3yTwJp5Dow40DqXsD4gXXTyZJkwZzIxxtt+1xu5BZfrZOfd6YLcn5ntf2rMZBk1eFJ5zG9pBiVNY3G3bYCbNDspA7INWnZ/JeQjVgbTsF+qSN7BJ+PGJJeTd2ZzyBLykFS/3RbjXzpbCtTcSVhPasubHjrBU/OYJ/KbiUeZyK8zvAyowG1tyQm00Y83mXD1N5Mo3ZBePIE2iEBep+5A0HrCEd5jSx7ilCY8OGRmfCbBckdShkvhkQpI+lwDSyYNCOxdMqp5vdIkUoYWdMd/HGRftgsxdaKWWhpMpyTaqxFnJ+TvMZsxoDbYQtJrtE7HGtdS/QkET2EjGnlENGyP1IDVaH9nIkqZ4tyk0n1GdbzMV9KSuEsrxfRwg17nIDR1qmeNZxgbPcgIazxmRnJ3Lr+X6mEcX7MhbjmvkwghJ6v3YkqBbVAOtIChTYKyh4RZyNBUrrtg8uNMy4FKXMtU+3Nn1iGTSnErIZy9bU6jsJzS/RpZOHY5pbdHnBLHDfStCFOh8gwlwR4r2DchEqgrE9m7Ji3tra09yhW4aNwdfXVweo49v8IaGUBT9JQjVujToxzY4xz5pPlZ4l6Vabpcws7xs9b7AKejhhhOP1nLRkbzsNR0pr3IppD58kk3jcQXurGa3gI3Ew2a1iTdYZPRlNskajtRL3yaYj94c7Qc8lz22II4YfyKxmdbGZKdpDy+7tKRMv1HbWwigzIeQl1Ug2YyXgffhaOSI3zTHYYlssThqDvCMFwygPVw7ObJyFbLUi0nCAsVj7ltoK+H6z5TbFvgYHjQzTe3hOIcpEbTZEcTqmTYHvcL3hCIizSdMB0sC0BmywslHlINPQxFHG09a221i2+hcyh+pB5Ga92dTDMwYHMw2K6gG913tWtmRxecIrs9yyJqntWcBaqr6hWxsWEZZLiRU32zZUp5LL4N1tm5dce9vD/X4Xs2dG3vDxEc2mmaR5/Z2QmRzroKAna6ME8KuCQ8SDj0AloxAUydtSF1/h3T4dje2W53QnYjI3IhJqrUM8jArNPmUFrdcHmiqkbQiogxNudGinzAb7SSJoVEZkYxEaaL6pYJaCZ+oKvp6Zu40hOYMjGbpGhWbeyxQZ729lxT7QcUdbt+nE305lGxfxTHY7UNWslsEKlcWVqlJqcIDP9YrCRA4Tt7a3CZpyoayZPcLMM9FpNTb4xOAAyS7JvRG0KTEhlGEArQ9Ooi110oa3k4SFNYRuwilMpjfdEOktp/O9trMY2/Q7o7UMTiKC4ICIWcN2YYx6XckH7sxt56bYF5Gh2G15zLg/DzbIrD2WzN1WR9NVT/dNwC7odorYG39aKDKl0A5MsMiaByPiVNQOxu0mwfF9yZm11qEcyeMMXandHBRvH/vqDqeXPD4kFSVhG3st5KNNUqhhw4M6kCNy29/t/IloW1vTVPxh/9CBFmcLtPWWEjYKrGwBlZpgMoKHm0bX2bA2P3BaU04fd5GhDEahQ0+687F4wEYiPoa6p9WdTgBjHiZdOtGE4VEL+QYn+T194EUp0gwTdgYNTmu3KEIzXDJtrQYJtWp2m4NmkrnGRMsnKitJXpOVtpiN8/ZyRRvQ1oxmQHBaLCf4mJv3Jauj7KiEa4VYMwvzzl50oP3FNVcxsm+MkaL8U5zSG2DzDjqxM0l0oLMRbhKw7iA2b/dHTQ2I0NkgCgtBWbvzvGeT2eLoVnHtwYgQ4nXbhlYZdPOK/+8R6mCu9+KTWE2cg6469JcOFr/UEAv5Uc8cNtC9PXRf/4R2OfrDWj9Cawy0bi/lY43P1W67ezTRBK/aY/A3VoeBSlwRAu+V5WFP0PxQg1VuUningp0fLPv5ArHq6qgXbqHvcgt+3+IggFV3Ir4YvPsEXRuPGU49txIILPgVUEWoVdQTKBjdZP3Qk/FVG4RuF20YXsnVINXM/mV3zlXG8A5dfwB2Q7XIJ8T9B+rYCcpbI39vy06sbhNrabOAEltQEdUn8pGBZbuPJx+zdlnyFfJW+hV9MXX7aZL/lSrjr6Cm/AjUNfu+R/of1PbvLOFSKhR8YP4kMq3shEzw9yCTqHRS/q24xKsZtUfP4Vwo0LQCK9bTYmtGJa0u6/7ChBdxfGgHbhreklf/OG6Hd1r9e/gaJXfxsZXvi4U9zB02TgD811yNl7oCkMfPp3A1CAK7wtSru/7UNFkutWu+k5dBEGjpQdi5V/1zSyMqS/v5O84vczd3oaYP8M9Q6Kfp8k2AsmtGlizym+sjiHK1aHmm+23VuEW+fBw9cAk/Ld3YcsKkbDn8Ho1AlPZ0YvWNmuvqBPHynptX9c1Ef2AKiBfIVE2AhuVB6j6+CeQyS/eTP8sI+BjIL3X7xPDfjPzq1vuKHaidsPvxUflO7HttJ9QGj9+uShTUtML+K/UldulFWt2T/LMKExo5L011x82NNT13PzCrGWEchdArtL4fi7Qf7isxHtWgeIp/U8ZDS61Y8Kr/jNW0p8Xv4UbVNPmthrWCr2ps+YUc5f+uKpWSTnvIdf6WyEZN6+CWboaQ2lG+wMWkdTpQZFaFf46FJ4o8LI4fuuJ8VsSUfuAFv7RM/z2IeWvH7tP0M0Ar+ehXhh4wuhwReKPQA1bun8r8OO9Yvv7eQQTihnqPT5h3+hg5R4CQ5aQTxtA1VvGbpR3JfybV6zCHVfo/1/W8eTOziqyaVe0Gx39ehQyg1VrJz/9OpfxxWrN/dKVMM9f+Bv6YuvlZtcyUXBeAVKa6V6fgssV3/gnVZ1dWuh4gV9f/umImb4hWfNr2oXjZ6mFqcn91WzFAufHO63i9uheDDcLUKXorIKfieNYwIPDTTyxwkRKF0zW/l14vbu+Cgqpd+lzhel1n1wQr7MHj/7ZpeenXMLKCr09bMX+57uT6r9z9ncrdqcuvhL9LuftDC5AfxaeKzTjJ6TXCZREEgbT05SHoX2zTOQmFV0Wu7iEt8RJIEbrKqUwNp96j1TKg/qhEpqv7NQ4DwfDF271tt9pyhIGqaXKKgRq6vwuWqsG+4wbKc4+XD4svo1hkflzku6Ct0tmPrDodeI13TtxDCV5qJv6x1o/FH17Z7lnT+f/tWOv5nbs3d/XC67p6VRTR/9XrmGfafSG9sHi4EW7Pv1LhFppKO7ZZ96zkSd5a1tfW8dv+E9uz19sJ6ZokGErVhGvuETxYsNMVTwxnSPQ9WY18Z68fvn+tp5sS+P+FXp8PvVZwX0Mhz29GuLSrvcTvfl+txzPU8C8Q/37UACcqV3e9Nz3U/W7EX0AP9y0I+iWieJKKeShZrlLFWxUEPUMVVTO+YiX0vo3d2HYDV683F35UKWTpf1xxyS/hGCcrOK4x8N8sq/AMkm8w+z9y6PcXue7a4fr68FOUT5mu7mfdKfoOPtczGKkLxn+mXUy/hDKAg+sdhWhtl9PfzUV/Z0XtxzCoAE6W5Cr0v99uYwP8GoeF7/yYbSviEt3QtIor/h8=
--------------------------------------------------------------------------------
/scripts/Lab 2.3 - Advanced Data Preparation with Developer Endpoints and Notebook - PA5-2019-12-02.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Serverless Data Lake Immersion\n",
8 | "## Lab 2.3 - Advanced Data Preparation with Developer Endpoints and Notebook\n",
9 | "`(Revision History:\n",
10 | "PA5, 2019-10-19, @akirmak: updated Section 7.2 based on feedback from identified by @rmichaud and @werberm and the solution proposed by @greenste. PA4 excluded.\n",
11 | "PA4, 2019-10-19, @akirmak: Advanced Spark ETL logic added as bonus\n",
12 | "PA3, 2019-05-09, @akirmak: updated based on feedback from @hohenber\n",
13 | "PA2, 2018-12-13, @akirmak \n",
14 | "PA1, 2018-12-07`\n",
15 | "\n",
16 | "This example shows how to do joins and filters with transforms on DynamicFrames.\n",
17 | "\n",
18 | "For purposes of our Immersion Day, we are assuming that you have done the previous Lab assignments (Create Firehose delivery stream, ingest simulated product catalogue data to S3, crawled this data and put the results into a database called `-tame-bda-immersion-gdb` and a table called `raw` in your Data Catalog, as described in the lab guide.\n",
19 | "\n",
20 | "### 2. Getting started\n",
21 | "\n",
22 | "DataFrames APIs support elaborate methods for slicing-and-dicing the data. It includes operations such as \"selecting\" rows, columns, and cells by name or by number, filtering out rows, etc. Statistical data is usually very messy and contains lots of missing and incorrect values and range violations. So a critically important feature of DataFrames is the explicit management of missing data.\n",
23 | "\n",
24 | "We will write a script that:\n",
25 | "\n",
26 | "1. Queries data\n",
27 | "2. Reformats data\n",
28 | "3. Repartitions the data\n",
29 | "\n",
30 | "Begin by running some boilerplate to import the AWS Glue libraries we'll need and set up a single `GlueContext`.\n",
31 | "Then, start a Spark application and create dynamic frame from our the data in S3. \n",
32 | "\n",
33 | "Some concepts:\n",
34 | "\n",
35 | "- Spark provides a unified platform for writing big data applications, ranging from simple data loading and SQL queries to machine learning and streaming computation over the same engine and with a consistent set of APIs.\n",
36 | "- Spark handles loading data from Amazon S3. \n",
37 | "- You control your Spark Application through a driver process called the SparkSession.\n",
38 | "- A Spark DataFrame is the most common Structured API and simply represents a table of data with rows and columns. (Not to be confused with R and Python DataFrames. Those (with some exceptions) exist on one machine rather than multiple machines)\n",
39 | "- Schema is the list that defines the columns and types within those columns.\n",
40 | "\n",
41 | "**Important** Before running the next step, update the *initials* variable with your initials (e.g. fs-tame-bda-immersion-gdb for Frank Sinatra)"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 1,
47 | "metadata": {},
48 | "outputs": [
49 | {
50 | "name": "stdout",
51 | "output_type": "stream",
52 | "text": [
53 | "Starting Spark application\n"
54 | ]
55 | },
56 | {
57 | "data": {
58 | "text/html": [
59 | "