├── data ├── netflix_titles_dirty_01.csv.gz ├── netflix_titles_dirty_02.csv.gz ├── netflix_titles_dirty_03.csv.gz ├── netflix_titles_dirty_04.csv.gz ├── netflix_titles_dirty_05.csv.gz ├── netflix_titles_dirty_06.csv.gz └── netflix_titles_dirty_07.csv.gz ├── assets ├── SparkLiveTraining-shellcommands.png ├── Live Training Slidedeck - Cleaning Data with Pyspark.pdf └── datacamp.svg ├── notebooks ├── python_live_session_template_spark.ipynb ├── python_live_session_template.ipynb └── Cleaning_Data_with_PySpark.ipynb ├── README.md └── Q&A-20200617.md /data/netflix_titles_dirty_01.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_01.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_02.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_02.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_03.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_03.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_04.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_04.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_05.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_05.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_06.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_06.csv.gz -------------------------------------------------------------------------------- /data/netflix_titles_dirty_07.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/data/netflix_titles_dirty_07.csv.gz -------------------------------------------------------------------------------- /assets/SparkLiveTraining-shellcommands.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/assets/SparkLiveTraining-shellcommands.png -------------------------------------------------------------------------------- /assets/Live Training Slidedeck - Cleaning Data with Pyspark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/data-cleaning-with-pyspark-live-training/master/assets/Live Training Slidedeck - Cleaning Data with Pyspark.pdf -------------------------------------------------------------------------------- /notebooks/python_live_session_template_spark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "6Ijg5wUCTQYG" 8 | }, 9 | "source": [ 10 | "

\n", 11 | "\"DataCamp\n", 12 | "

\n", 13 | "

\n", 14 | "\n", 15 | "## **Python PySpark Live Training Template**\n", 16 | "\n", 17 | "_Enter a brief description of your session, here's an example below:_\n", 18 | "\n", 19 | "Welcome to this hands-on training where we will immerse yourself in data visualization in Python. Using both `matplotlib` and `seaborn`, we'll learn how to create visualizations that are presentation-ready.\n", 20 | "\n", 21 | "The ability to present and discuss\n", 22 | "\n", 23 | "* Create various types of plots, including bar-plots, distribution plots, box-plots and more using Seaborn and Matplotlib.\n", 24 | "* Format and stylize your visualizations to make them report-ready.\n", 25 | "* Create sub-plots to create clearer visualizations and supercharge your workflow.\n", 26 | "\n", 27 | "## **The Dataset**\n", 28 | "\n", 29 | "_Enter a brief description of your dataset and its columns, here's an example below:_\n", 30 | "\n", 31 | "\n", 32 | "The dataset to be used in this webinar is a CSV file named `airbnb.csv`, which contains data on airbnb listings in the state of New York. It contains the following columns:\n", 33 | "\n", 34 | "- `listing_id`: The unique identifier for a listing\n", 35 | "- `description`: The description used on the listing\n", 36 | "- `host_id`: Unique identifier for a host\n", 37 | "- `host_name`: Name of host\n", 38 | "- `neighbourhood_full`: Name of boroughs and neighbourhoods\n", 39 | "- `coordinates`: Coordinates of listing _(latitude, longitude)_\n", 40 | "- `Listing added`: Date of added listing\n", 41 | "- `room_type`: Type of room \n", 42 | "- `rating`: Rating from 0 to 5.\n", 43 | "- `price`: Price per night for listing\n", 44 | "- `number_of_reviews`: Amount of reviews received \n", 45 | "- `last_review`: Date of last review\n", 46 | "- `reviews_per_month`: Number of reviews per month\n", 47 | "- `availability_365`: Number of days available per year\n", 48 | "- `Number of stays`: Total number of stays thus far\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## **Setting up a PySpark session**\n", 56 | "\n", 57 | "This set of code lets you enable a PySpark session using google colabs, make sure to run the code snippets to enable PySpark." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Just run this code\n", 67 | "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", 68 | "!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", 69 | "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n", 70 | "!pip install -q findspark" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Just run this code too!\n", 80 | "import os\n", 81 | "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", 82 | "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "# Set up a Spark session\n", 92 | "import findspark\n", 93 | "findspark.init()\n", 94 | "from pyspark.sql import SparkSession\n", 95 | "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "colab_type": "text", 102 | "id": "BMYfcKeDY85K" 103 | }, 104 | "source": [ 105 | "## **Getting started**" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 2, 111 | "metadata": { 112 | "colab": {}, 113 | "colab_type": "code", 114 | "id": "EMQfyC7GUNhT" 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "# Import other relevant libraries\n", 119 | "from pyspark.ml.feature import VectorAssembler\n", 120 | "from pyspark.ml.regression import LinearRegression" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 0, 126 | "metadata": { 127 | "colab": {}, 128 | "colab_type": "code", 129 | "id": "IAfz_jiu0NjN" 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# Get dataset into local environment\n", 134 | "!wget -O /tmp/airbnb.csv 'https://github.com/datacamp/python-live-training-template/blob/master/data/airbnb.csv?raw=True'\n", 135 | "airbnb = spark.read.csv('/tmp/airbnb.csv', inferSchema=True, header =True)" 136 | ] 137 | } 138 | ], 139 | "metadata": { 140 | "colab": { 141 | "name": "Cleaning Data in Python live session.ipynb", 142 | "provenance": [] 143 | }, 144 | "kernelspec": { 145 | "display_name": "Python 3", 146 | "language": "python", 147 | "name": "python3" 148 | }, 149 | "language_info": { 150 | "codemirror_mode": { 151 | "name": "ipython", 152 | "version": 3 153 | }, 154 | "file_extension": ".py", 155 | "mimetype": "text/x-python", 156 | "name": "python", 157 | "nbconvert_exporter": "python", 158 | "pygments_lexer": "ipython3", 159 | "version": "3.7.1" 160 | } 161 | }, 162 | "nbformat": 4, 163 | "nbformat_minor": 1 164 | } 165 | -------------------------------------------------------------------------------- /assets/datacamp.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Data Cleaning with PySpark live session**
by **Mike Metzger** 2 | 3 | Live training sessions are designed to mimic the flow of how a real data scientist would address a problem or a task. As such, a session needs to have some “narrative” where learners are achieving stated learning objectives in the form of a real-life data science task or project. For example, a data visualization live session could be around analyzing a dataset and creating a report with a specific business objective in mind _(ex: analyzing and visualizing churn)_, a data cleaning live session could be about preparing a dataset for analysis etc ... 4 | 5 | As part of the 'Live training Spec' process, you will need to complete the following tasks: 6 | 7 | Edit this README by filling in the information for steps 1 - 4. 8 | 9 | ## Step 1: Foundations 10 | 11 | This part of the 'Live training Spec' process is designed to help guide you through session design by having you think through several key questions. Please make sure to delete the examples provided here for you. 12 | 13 | ### A. What problem(s) will students learn how to solve? (minimum of 5 problems) 14 | 15 | - Reminders of lazy loading & Spark transformations vs actions 16 | - Handling scenarios with improper data content (column counts, etc) 17 | - Splitting column based on content 18 | - Joining mutliple dataframes together 19 | - Working with `monotonically_increasing_id` 20 | - Using various functions from `pyspark.sql.functions` for data cleaning 21 | - Using UDFs to clean data entries 22 | 23 | 24 | ### B. What technologies, packages, or functions will students use? Please be exhaustive. 25 | 26 | - Spark 27 | - Python 28 | 29 | ### C. What terms or jargon will you define? 30 | 31 | - Spark Schemas: Much like a relational database schema, Spark dataframes have a schema or description of the columns and the types of data contained within. 32 | - Parquet: A storage mechanism used in "big data", where data is stored in column format, allowing for easy query and access. 33 | 34 | ### D. What mistakes or misconceptions do you expect? 35 | 36 | - Remembering the various options available to load CSV data: There are a significant number of options available. Remembering the difference between options like `sep`, `quote`, etc can cause problems if the learner is not used to having documentation available (ie, Spark docs). 37 | 38 | - Understanding that almost everything in Spark is done lazily: Nothing in Spark is actually executed until you run an action, such as `df.count()`, `df.write()`, etc. This can be difficult to troubleshoot if you're not certain what is happening. 39 | 40 | ### E. What datasets will you use? 41 | 42 | Netflix Movie dataset, custom "dirtied" 43 | 44 | ## Step 2: Who is this session for? 45 | 46 | Terms like "beginner" and "expert" mean different things to different people, so we use personas to help instructors clarify a live training's audience. When designing a specific live training, instructors should explain how it will or won't help these people, and what extra skills or prerequisite knowledge they are assuming their students have above and beyond what's included in the persona. 47 | 48 | - [ ] Please select the roles and industries that align with your live training. 49 | - [ ] Include an explanation describing your reasoning and any other relevant information. 50 | 51 | ### What roles would this live training be suitable for? 52 | 53 | *Check all that apply.* 54 | 55 | - [ ] Data Consumer 56 | - [ ] Leader 57 | - [x] Data Analyst 58 | - [ ] Citizen Data Scientist 59 | - [x] Data Scientist 60 | - [x] Data Engineer 61 | - [ ] Database Administrator 62 | - [ ] Statistician 63 | - [ ] Machine Learning Scientist 64 | - [ ] Programmer 65 | - [ ] Other (please describe) 66 | 67 | Typically using Spark for data cleaning means you have to a) have a fair amount of data, b) understand that it needs to be cleaned / filtered / etc and what that means, and c) have something you intend to do with it afterwards. 68 | 69 | - Data Engineers often perform these steps. 70 | - Data Scientists may need to clean the data, or provide the list of things that should be cleaned. Understanding what is possible can assist with said list. 71 | - Data Analysts would be on the cusp for this course, but might have input as a consumer of the data regarding what tasks to perform. 72 | 73 | ### What industries would this apply to? 74 | 75 | Generally, any industry in need of processing a lot of data where Spark is appropriate would qualify. Specifically, we're looking at batch style processing to clean data to prep for storage or analysis. 76 | 77 | ### What level of expertise should learners have before beginning the live training? 78 | 79 | *List three or more examples of skills that you expect learners to have before beginning the live training* 80 | 81 | - Can load a Spark dataframe with data via CSV 82 | - Understands general python usage, including imports and creation of functions 83 | - Would have enough data to need a Spark solution and has a general idea what this means 84 | - Has some knowledge of writing SQL / SQL style queries 85 | 86 | ## Step 3: Prerequisites 87 | 88 | - Intro to PySpark 89 | - Cleaning Data with PySpark 90 | 91 | ## Step 4: Session Outline 92 | 93 | A live training session usually begins with an introductory presentation, followed by the live training itself, and an ending presentation. Your live session is expected to be around 2h30m-3h long (including Q&A) with a hard-limit at 3h30m. You can check out our live training content guidelines [here](_LINK_). 94 | 95 | ### Introduction Slides 96 | - Intro to the webinar and instructor 97 | - Intro to the topics 98 | - Define data cleaning 99 | - Discuss reasons for using Spark and Python for data cleaning 100 | - Illustrate data set and some of the issues loading the data 101 | - Review session outline & describe Q&A process 102 | 103 | ### Live Training 104 | #### Spark initialization and load DataFrame 105 | - Import necessary Spark libraries and Colab helper tools 106 | - Print our initial data source, illustrating the various problems with it (multiple column counts, odd separators, misformatted columns, etc) 107 | - Try loading file(s) into Spark DataFrame using `spark.load.csv()` 108 | - Show current schema using `.printSchema()` 109 | - Reload file(s) into new Spark DataFrame using `spark.load.csv()` but with custom separator 110 | - `.printSchema()` again, illustrating a single column DataFrame 111 | - ** Q&A ** 112 | 113 | #### Initial Data Cleaning steps 114 | - Use `.filter()` to remove any comment rows (multiple types of columns) 115 | - Create a new column using `.withColumn()` with a count of our intended columns 116 | - Use `.filter()` to remove any rows differing from the intended column count 117 | - Determine the difference between current DataFrame and original to save off malformed data rows 118 | - ** Q&A ** 119 | 120 | #### More Data Cleaning & Formatting 121 | - Now split the remaining source data into actual DataFrame columns using combination of `.split()`, `.explode()` 122 | - Use `.withColumnRenamed()` to rename columns to a usable format 123 | - Create a UDF to split a string 124 | - Apply the UDF to a combined name field 125 | - Use `.drop()` to remove any unnecessary columns 126 | - Add an id field using `.withColumn()` and `pyspark.sql.functions.monotonically_increasing_id()` 127 | - Review the schema and save out to a Parquet file 128 | 129 | #### End slides 130 | - Recap 131 | - Course suggestions 132 | 133 | 134 | ## Authoring your session 135 | 136 | To get yourself started with setting up your live session, follow the steps below: 137 | 138 | 1. Download and install the "Open in Colabs" extension from [here](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo?hl=en). This will let you take any jupyter notebook you see in a GitHub repository and open it as a **temporary** Colabs link. 139 | 2. Upload your dataset(s) to the `data` folder. 140 | 3. Upload your images, gifs, or any other assets you want to use in the notebook in the `assets` folder. 141 | 4. Check out the notebooks templates in the `notebooks` folder, and keep the template you want for your session while deleting all remaining ones. 142 | 5. Preview your desired notebook, press on "Open in Colabs" extension - and start developing your content in colabs _(which will act as the solution code to the session)_. :warning: **Important** :warning: Your progress will **not** be saved on Google Colabs since it's a temporary link. To save your progress, make sure to press on `File`, `Save a copy in GitHub` and follow remaining prompts. You can also download the notebook locally and develop the content there as long you test out that the syntax works on Colabs as well. 143 | 6. Once your notebooks is ready to go, give it the name `session_name_solution.ipynb` create an empty version of the Notebook to be filled out by you and learners during the session, end the file name with `session_name_learners.ipynb`. 144 | 7. Create Colabs links for both sessions and save them in notebooks :tada: 145 | -------------------------------------------------------------------------------- /notebooks/python_live_session_template.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Cleaning Data in Python live session.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | }, 14 | "language_info": { 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "file_extension": ".py", 20 | "mimetype": "text/x-python", 21 | "name": "python", 22 | "nbconvert_exporter": "python", 23 | "pygments_lexer": "ipython3", 24 | "version": "3.7.1" 25 | } 26 | }, 27 | "cells": [ 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "colab_type": "text", 32 | "id": "6Ijg5wUCTQYG" 33 | }, 34 | "source": [ 35 | "

\n", 36 | "\"DataCamp\n", 37 | "

\n", 38 | "

\n", 39 | "\n", 40 | "## **Python Live Training Template**\n", 41 | "\n", 42 | "_Enter a brief description of your session, here's an example below:_\n", 43 | "\n", 44 | "Welcome to this hands-on training where we will immerse yourself in data visualization in Python. Using both `matplotlib` and `seaborn`, we'll learn how to create visualizations that are presentation-ready.\n", 45 | "\n", 46 | "The ability to present and discuss\n", 47 | "\n", 48 | "* Create various types of plots, including bar-plots, distribution plots, box-plots and more using Seaborn and Matplotlib.\n", 49 | "* Format and stylize your visualizations to make them report-ready.\n", 50 | "* Create sub-plots to create clearer visualizations and supercharge your workflow.\n", 51 | "\n", 52 | "## **The Dataset**\n", 53 | "\n", 54 | "_Enter a brief description of your dataset and its columns, here's an example below:_\n", 55 | "\n", 56 | "\n", 57 | "The dataset to be used in this webinar is a CSV file named `airbnb.csv`, which contains data on airbnb listings in the state of New York. It contains the following columns:\n", 58 | "\n", 59 | "- `listing_id`: The unique identifier for a listing\n", 60 | "- `description`: The description used on the listing\n", 61 | "- `host_id`: Unique identifier for a host\n", 62 | "- `host_name`: Name of host\n", 63 | "- `neighbourhood_full`: Name of boroughs and neighbourhoods\n", 64 | "- `coordinates`: Coordinates of listing _(latitude, longitude)_\n", 65 | "- `Listing added`: Date of added listing\n", 66 | "- `room_type`: Type of room \n", 67 | "- `rating`: Rating from 0 to 5.\n", 68 | "- `price`: Price per night for listing\n", 69 | "- `number_of_reviews`: Amount of reviews received \n", 70 | "- `last_review`: Date of last review\n", 71 | "- `reviews_per_month`: Number of reviews per month\n", 72 | "- `availability_365`: Number of days available per year\n", 73 | "- `Number of stays`: Total number of stays thus far\n" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": { 79 | "colab_type": "text", 80 | "id": "BMYfcKeDY85K" 81 | }, 82 | "source": [ 83 | "## **Getting started**" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "metadata": { 89 | "colab_type": "code", 90 | "id": "EMQfyC7GUNhT", 91 | "colab": {} 92 | }, 93 | "source": [ 94 | "# Import libraries\n", 95 | "import pandas as pd\n", 96 | "import matplotlib.pyplot as plt\n", 97 | "import numpy as np\n", 98 | "import seaborn as sns\n", 99 | "import missingno as msno\n", 100 | "import datetime as dt" 101 | ], 102 | "execution_count": 0, 103 | "outputs": [] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "metadata": { 108 | "colab_type": "code", 109 | "id": "l8t_EwRNZPLB", 110 | "outputId": "36a85c6f-f2ae-44e0-ac01-fc55462bc616", 111 | "colab": { 112 | "base_uri": "https://localhost:8080/", 113 | "height": 479 114 | } 115 | }, 116 | "source": [ 117 | "# Read in the dataset\n", 118 | "airbnb = pd.read_csv('https://github.com/adelnehme/python-for-spreadsheet-users-webinar/blob/master/datasets/airbnb.csv?raw=true', \n", 119 | " index_col = 'Unnamed: 0')\n", 120 | "\n", 121 | "# Print header\n", 122 | "airbnb.head()" 123 | ], 124 | "execution_count": 0, 125 | "outputs": [ 126 | { 127 | "output_type": "execute_result", 128 | "data": { 129 | "text/html": [ 130 | "
\n", 131 | "\n", 144 | "\n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | "
listing_idnamehost_idhost_nameneighbourhood_fullcoordinatesroom_typepricenumber_of_reviewslast_reviewreviews_per_monthavailability_365ratingnumber_of_stays5_starslisting_added
013740704Cozy,budget friendly, cable inc, private entra...20583125MichelBrooklyn, Flatlands(40.63222, -73.93398)Private room45$102018-12-120.70854.10095412.00.6094322018-06-08
122005115Two floor apartment near Central Park82746113CeciliaManhattan, Upper West Side(40.78761, -73.96862)Entire home/apt135$12019-06-301.001453.3676001.20.7461352018-12-25
221667615Beautiful 1BR in Brooklyn Heights78251LeslieBrooklyn, Brooklyn Heights(40.7007, -73.99517)Entire home/apt150$0NaNNaN65NaNNaNNaN2018-08-15
36425850Spacious, charming studio32715865YelenaManhattan, Upper West Side(40.79169, -73.97498)Entire home/apt86$52017-09-230.1304.7632036.00.7699472017-03-20
422986519Bedroom on the lively Lower East Side154262349BrookeManhattan, Lower East Side(40.71884, -73.98354)Private room160$232019-06-122.291023.82259127.60.6493832020-10-23
\n", 264 | "
" 265 | ], 266 | "text/plain": [ 267 | " listing_id ... listing_added\n", 268 | "0 13740704 ... 2018-06-08\n", 269 | "1 22005115 ... 2018-12-25\n", 270 | "2 21667615 ... 2018-08-15\n", 271 | "3 6425850 ... 2017-03-20\n", 272 | "4 22986519 ... 2020-10-23\n", 273 | "\n", 274 | "[5 rows x 16 columns]" 275 | ] 276 | }, 277 | "metadata": { 278 | "tags": [] 279 | }, 280 | "execution_count": 4 281 | } 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "metadata": { 287 | "colab_type": "code", 288 | "id": "IAfz_jiu0NjN", 289 | "colab": {} 290 | }, 291 | "source": [ 292 | "" 293 | ], 294 | "execution_count": 0, 295 | "outputs": [] 296 | } 297 | ] 298 | } -------------------------------------------------------------------------------- /Q&A-20200617.md: -------------------------------------------------------------------------------- 1 | # Q&A questions / answers 2 | ## From session on 2020-06-17 3 | 4 | - So functions from F are applied to each value in the column individually, right? Or is it the case only in comdbination with `withColumn()`? 5 | - Functions from F (ie, `from pyspark.sql import functions as F`). These functions can be used in various places depending on exactly what you're trying to do, including the `df.withColumn()` method. 6 | 7 | - pyspark has a different syntax than normal python code. I'm I right 8 | - PySpark is a Python API and uses Python syntax. You may be thrown off with the multiple chaining though, or the one letter function (F). In a nutshell it looks a bit different but it uses Python syntax. 9 | 10 | - so with pyspark everything is chain statements? can't you have python states such as loops, etc? 11 | - Pyspark is really a Python API on top of the Scala / Java underpinnings of Spark. Dataframes are most performant when you use the built in API calls to avoid serialization performance penalties. That said, you can use loops and such in your code, especially in UDFs (User defined functions), but it is not always recommended. 12 | 13 | - can you explain again why the tab seperator didn't work and why the curly brackt? 14 | - Because { is not used anywhere in the dataset content, so there's no risk to accidentally separate a cell in two. 15 | 16 | - So sorry... is "/tmp/netblex_titles_cleaned.csv" a directory here? (with one piece of data from "partition" under it "part-00000-5f404d8b-...."?) 17 | - Yes, exactly. The Spark dataframewriter class writes data out in this fashion, and typically you will have between a few and several thousand files present depending on the size of the dataframe. 18 | 19 | - is Spark just a python library? or can it be used outside of Python? 20 | - Spark is a unified analytics engine for big data processing built in Scala, working on top of Java. PySpark is the Python API that allows you to interact iwth Spark in Python. You can also use Spark directly with Scala and Java or via the R and SQL libraries. 21 | 22 | - what's the advantage of using pyspark over pandas? 23 | - Primarily that you can handle significantly larger datasets on a Spark cluster than you could on any given machine running Pandas. Beyond that, it is often easier to implement Spark jobs into a production workflow rather than pandas, especially if you can take advantage of the Spark machine learning and other analysis tools. SparkSQL is also a bit more mature than using SQL style syntax with pandas. That said, both are extremely useful tools and have considerable overlap conceptually. I'd recommend becoming familiar with both and using whatever makes sense for your use case. 24 | 25 | - can't we use the pandas write csv command to create a single csv file automatically? 26 | - You could, using the `dataframe.toPandas()` method. You must make sure that the driver, or primary, node of the Spark cluster has enough memory to handle the combination of all the data at a single location. 27 | 28 | - Sorry, i know that u explained this, but whats the main different or benefits of using DF versus RDD? 29 | - An RDD is a Resilient Distributed Dataset, and is the underlying data structure for most of Spark. Dataframes are technically built on top of RDDs, but provide added functionality and capability. Especially when using Pyspark, dataframes are recommended for performance reasons, but also for functionality such as SparkSQL, and so forth. If you ever came across a specific need for an RDD you could always use the `dataframe.rdd` call. For example, it's possible to implement a similar cleaning methodology as this course using RDDs - there are performance implications but depending on your background it might reduce the cognitive load when writing the code. 30 | 31 | - Does Spark use index values for each record like in Pandas? 32 | - Not as such. Spark treats rows or records in a dataframe as a defined group of columns. The columns are then stored / accessed according to their underlying data structure at the time (ie, CSV source, Parquet source files, etc). You can use something like `.filter/.where` to specifically limit the rows you're accessing. You could also use the `pyspark.sql.functions.rank` function to assign a value and filter from there. Beyond that, you can use the `pyspark.sql.functions.monotonically_increasing_id()` method to generate an ID column and access rows via said ID (again, with the `.filter/.where` methods on the dataframe. Note that there are some gotchas to using this function - please refer to the documentation or the full Cleaning Data with Pyspark course for more information. 33 | 34 | - Beside loops, what else is not advisable to use in spark? 35 | - Really anything that acts like a Singleton object, or anything that you couldn't split in some fashion. Spark works best when each partition of data can be processed independently. There are some ways around things like this (ie, using the `broadcast()` method) but usually you'd want to have a splittable job to get the most out of Spark. *NOTE* You may still want to test the functionality when using things like loops / etc. It might not be the fastest method *in* Spark, but it still might be worth the time to implement it overall. 36 | 37 | - When you are creating new dataframes from existing ones, is this occupying more space in memory? 38 | - It is, but it is a very small amount of memory, especially before the dataframe is instantiated with a Spark action (ie, `.count()`, `.write()`, etc). Remember that transformations in Spark are *lazy*, meaning they don't take effect until data is required to provide an answer. A dataframe itself is more like a recipe of steps rather than the actual data. We use actions to instantiate the dataframe. If Spark runs out of working memory, it will try to flush out unused information and reprocess the partition. 39 | 40 | - Could we get a statistics about rows with different columns number in DataFrame without using every time `count()` method? 41 | - There are certainly other methods you can use to gain information about a dataframe. The `.describe()` method will provide assorted information about the data that may be useful. 42 | 43 | - What does the '`truncate=False`' parameter indicates inside the '`.show()`' method? 44 | - It means the column is shown completely, so if the content of the cell is really long, you will see it all, instead of it being truncated with three dots: https://stackoverflow.com/questions/33742895/how-to-show-full-column-content-in-a-spark-dataframe 45 | 46 | - Can CSV reader read the CSV file that is compressed (gz) but does not end with .gz? 47 | - The CSV parser can handle compressed files, including bzip2, gzip, lz4, snappy and deflate *if* they are labelled as such. Note that gzip files are unsplittable, so having multiple files is recommended vs one large file. If you need one large file, try something like snappy instead. In regard to the specific question, I had not tried it before, but it does not look like the parser understands how to handle a compressed file that is not labelled specifically as such (ie, .bz2, .gz, etc). There unfortunately is not an argument available (as of Spark 2.4.6 / 3.0.0b2) that allows you to specify the type of compression used in files. 48 | 49 | - You used single quote' and double quote " in the codes. Is there any difference between the two types? 50 | - No specific difference beyond if you need to mix the two (ie, an inline filter clause `'name == "John"'`) or if you need an interpreted character (use `"` in that case). 51 | 52 | - What are the pros and cons of PySpark, PyArrow, and FastParquet? 53 | - These aren't specifically analogous to one another, but PySpark is capable of reading / writing parquet data and is a recommended format for high performance. PyArrow is a library to parse parquet style data in any application that implements the library requirements. Note that it is not necessarily distributed, so you may need to adjust the sizing capabilities of your system vs the amount of data. I have not specifically used FastParquet but it is a native Python implementation to parse parquet data. Both PyArrow and FastParquet appear to be under active development so I'd use whatever works for your use case assuming your data fits on a given system. 54 | 55 | - How does `F.col()` knows which dataframe to use column from? 56 | - It assumes that the column is a portion of the dataframe you've defined it with (ie, `df.withColumn('test', F.col('orig'))` would look for `df.orig` as the column to access. If you're performing a join, F.col will try to determine from the available columns, otherwise it will throw an error regarding the selection being ambigious. In that case, you can specify the column via one of the other available options (`df['colname']`, `df.colname`, etc). 57 | 58 | - what the difference between 'filter' and 'where' functions? 59 | - There is no difference here, they are alias of each other. You can use whichever one makes more sense to you (ie, if you come from a SQL background, `.where` might make more sense while reading it. 60 | 61 | - As long as withColumn creates new DataFrame does this mean that I need a lot more RAM than size of initial DataFrame to effectively wrangling data? 62 | - Refer to question previously, but generally no, you don't need a lot of extra RAM for dataframe storage as Spark intelligently manages it. The area with the greatest RAM utilization would be joining a lot of large dataframes together as they must be at least partially instantiated. 63 | 64 | - `.cast(IntegerType())` and `.cast('Integer')` are the same? 65 | - Yes, either style can be used. 66 | 67 | - What happened if you don't run `.coalesce(1)` before writing csv file? It will write several files? 68 | - Correct, it will write as many files as there are internal partitions of the dataframe. 69 | 70 | - What if the data itself contains the curly brace "{"? 71 | - Then you risk separating a cell in two and treating the separated content as two separate columns, so you would need to find another character to separate with. 72 | 73 | - Sorry, maybe basic question, this is the first time for me learning spark. Thanks. 74 | - If you're looking for more information on Spark, please try the Intro to PySpark course available on DataCamp for more information. Once you've taken that, the content here and in the Cleaning Data with PySpark course will make more sense. 75 | 76 | - What is difference between PySpark and Pandas? 77 | - Spark is a framework built to work with Big Data, using parallel processing among other things. It's built in Java. PySpark is the Python tool to interact with Spark. pandas is a Python library used for data manipulation. Both can do similar operations but Spark is better suited for Big Data. 78 | 79 | - Spark is memory based(data is loaded into memory). So how does it work when the data size is much bigger than the memory can handle? during that time, is it better to use a different engine e.g. mapreduce for efficiency? 80 | - Spark partitions data into multiple chunks and then operates on the chunks, typically reducing the amount of memory required to process a dataset. More memory typically means faster performance as more information can be processed at once. There is a lower limit depending on the amount of data you have and the type of operations you're running but typically you wouldn't hit these on normally spec'd clusters. Regarding the use of mapreduce, Spark does keep data in memory vs Hadoop mapreduce but it's not required to fit it all at once. If data is needed later it is either reloaded from source / reprocessed or loaded from cache. Another option is to write intermediary data out before beginning a large processing step as this often helps the optimizer access needed data without reprocessing everything. The final note though, is to test your workloads - in almost all circumstances I've used in production, Spark is faster than classic mapreduce, especially for any SQL style jobs. 81 | 82 | - So for highly scalable data its recommended to use pyspark? 83 | - Definitely. 84 | 85 | - can pyspark be used for unstructured data- i.e documents? csv files already have a structure 86 | - You can, but you may need to preprocess the data, or perform steps similar to what we are here. If you know the formatting of say a binary file or via a set of regex's, you could perform a similar split / clean operation. It depends entirely on what you're trying to accomplish. 87 | 88 | - `titles_single_new_df = titles_single_df.filter(F.col('_c0').startsWith('#'))` gives me an error : 'Column' object is not callable 89 | - The method you want to use is `.startswith('#')` instead of `.startsWith('#')`. Note the lowercase vs uppercase `W`. 90 | 91 | - Do you have any tips on how to install Spark on my local machine in PyCharm for example? I've been finding it very difficult to install it locally. 92 | - You can try following this tutorial: https://medium.com/@gongster/how-to-use-pyspark-in-pycharm-ide-2fd8997b1cdd. Also note that installing Spark locally can be troublesome. I'd highly advise using something like Anaconda, a Docker container or one of the services available (Databricks, Amazon EMR, etc) to run Spark for you elsewhere. 93 | 94 | - In this code we've written fieldcount as lower case and camel case - `titles_single_df.select('fieldcount', '_c0').where('fieldCount > 12').show(truncate=False)`. Why is that? Are these different columns or does it not matter to Pyspark? 95 | - Great observation! In this circumstance, it does not matter, but I would recommend keeping it all the same in your code. I will fix the discrepency in the future. 96 | 97 | - printSchema is a Spark command, correct? 98 | - Yes, it is a method on the Spark dataframe class. 99 | 100 | - ~ = not, correct? 101 | - Yes. 102 | 103 | - when filter & select is being used -- is this part of SparkSQL? 104 | - Yes, but note that there is some terminology clarification here. The PySpark dataframe library is part of pyspark.sql. SparkSQL itself is the ability to run SQL commands directly within Spark (ie, entirely without another language). PySpark and Scala can also run SQL commands internally, but also use the extra functionality in Python / Scala as desired. 105 | 106 | - what are positives/negatives of caching? 107 | - Typically, caching is almost always good. The only case where I'd say it's not is if you don't plan to do anything with the dataframe shortly after caching the data, as you'll have spent the time for the data to be stored (often on disk) with little to no gain. If you need to cache a lot of objects, I'd recommend writing at least some of your result sets out to parquet files and reload a dataframe from there. This provides more options for Spark to optimize your later processing steps. 108 | 109 | - When we specify column, do we all need to add `F.col()`? or only `col()` sufficient? 110 | - Assuming you plan to use the `col()` function at all (ie, you can use the `df['name']` or `df.name` option as well) the answer depends on how you load the `pyspark.sql.functions` library. In this case, we've aliased it to `F`, so you'd reference it as `F.col()`. If you instead do something like `from pyspark.sql.functions import *`, you can use it without the `F.`. It depends on your style / requirements. 111 | 112 | - (additional question) Some engineer said to me that production running would be preferred with Scala or Java, but with latest Spark, is there so much difference using PySpark? (could I use it in production system)? 113 | - In certain operations there is a speedup using Scala, especially if you need to use UDFs. That said, Myself and many others run tasks in PySpark in production with no issue. You can also use a mix of the two, putting performance sensitive code in Scala and referencing it from Python as needed. 114 | 115 | - What would be the advantage/disadvantage for PySpark v Scala/Java? 116 | - It depends on your familiarity, but Python is often easier to write (though Scala really isn't bad at all. Java depends on your experience with static languages). You can also use any of the python libraries as needed (with some potential performance implications vs a Scala implementation.) Often, people will use both, or start with Python and convert to Scala as it makes sense. 117 | 118 | - you said there are many ways to find rows starting with '#' one of which we just used. Can you briefly touch upon the other ones? 119 | - The primary method for dealing with comment rows is using the `comment=` attribute on the `spark.read.csv()` method to specify which comment to handle. Assuming you have one type of comment row, and it starts with a single character, this works well. The method we worked with handles any comments that start with a character or multiple characters / words / etc, assuming they start at the beginning of the line. If I also wanted to remove comments later in the row (ie, a code comment or some such) I could also use the Spark regex functions. You could also do the same behavior in a SQL statement, though it gets a bit hairy. 120 | 121 | - What is an ArrayType column? 122 | - It is an aggregate Spark column type, that is analogous to a Python list, stored in a column. These can work well if you have a variable amount of data that you wish to store within a dataframe under a single column header. 123 | 124 | - whats is the difference between Spark and Dask library python? 125 | - Spark supports Python, R and Java Virtual Machine code. Dask only supports Python. Both are clustered options and have varying levels of flexibility depending on your needs. 126 | 127 | - where are the curly braces as separators? 128 | - They are used to separate upon, to create the columns, so they will not be rendered in the dataset (like columns or tabs won't render) 129 | 130 | - Could we try this in a script in visual code or anywhere else? 131 | - Yes. You could use a notebook to test your pipeline, and then turn it into an executable scripit using any IDE. Note that you will need to define your Spark session object based on your cluster setup (refer to docs / ask your cluster admin for more details) 132 | 133 | - have I to use spark before tensorflow to process the big data? 134 | - Spark or any other tool. Tensorflow is not a data manipulation tool, it's used for machine learning operations. Just like you would use pandas to clean data and scikit learn for predictions. You can use Tensorflow and Keras within Spark if desired. 135 | 136 | - What is the best process, use a df 10*10 or a df 1*100? in memory sense 137 | I have to process 1 billions rows but I could transpose some part and get 10millions rows by 100 columns, so I wonder what is the best method, keep 1billion rows and 1 column or 10millions rows and 100 columns? 138 | - Best answer is to test a portion of both and see how it behaves. Typically more rows mean that you can partition your data a bit better, but this varies a bit based on use case. That said, a 1 column dataframe is not always that useful (beyond working through the example here) so store the data in the method that makes the most sense overall. 139 | 140 | - with this code 141 | ``` 142 | # Run this code as is to install Spark in Colab 143 | !apt-get install openjdk-8-jdk-headless -qq > /dev/null 144 | !wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz 145 | !tar xf spark-2.4.5-bin-hadoop2.7.tgz 146 | !pip install -q findspark 147 | ``` 148 | - These are specific commands to install dependencies in Google collab. You need to install these dependencies on your system using the appropriate commands for it (WIndows, MacOS or Linux): https://openjdk.java.net/install/ to start. Otherwise, you can use an online Spark provider to test with (Databricks, AWS, etc) 149 | 150 | - Where is the best location to store our csv file, in C:/ or in another hard drive? 151 | - There is no better location. The best location is where you will easily be able to retrieve it. Then, if you plan to do additional analysis, you can move it to the folder in which you will create your scripts. If you plan to use Spark on a cluster to access any data, it needs to be accessible to the Spark nodes in something like NFS, HDFS, S3, or other network accessible storage. 152 | 153 | - In a local environment, Do we need to work in a tempory folder (I see '/temp')? I am sorry but I don't understand if the csv file is stored in our hard drive ou in our ram when we open it trough a sparksession 154 | - The CSV file is an actual file. What gets written in memory is the different manipulations that you do on dataframes as you manipulate the data. When you downlowd the CSV, you will have an actual file, with lines of text separated by tabs. /temp is a folder on the Google Collab environment to store stuffs that you odn't want to see persist over sessions. 155 | -------------------------------------------------------------------------------- /notebooks/Cleaning_Data_with_PySpark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Cleaning Data with PySpark.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | }, 14 | "language_info": { 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "file_extension": ".py", 20 | "mimetype": "text/x-python", 21 | "name": "python", 22 | "nbconvert_exporter": "python", 23 | "pygments_lexer": "ipython3", 24 | "version": "3.7.1" 25 | } 26 | }, 27 | "cells": [ 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "colab_type": "text", 32 | "id": "6Ijg5wUCTQYG" 33 | }, 34 | "source": [ 35 | "

\n", 36 | "\"DataCamp\n", 37 | "

\n", 38 | "

\n", 39 | "\n", 40 | "## **Cleaning Data with Pyspark**\n", 41 | "\n", 42 | "Welcome to this hands-on training where we will investigate cleaning a dataset using Python and Apache Spark! During this training, we will cover:\n", 43 | "\n", 44 | "* Efficiently loading data into a Spark DataFrame\n", 45 | "* Handling errant rows / columns from the dataset, including comments, missing data, combined or misinterpreted columns, etc.\n", 46 | "* Using Python UDFs to run advanced transformations on data\n", 47 | "\n", 48 | "\n", 49 | "## **The Dataset**\n", 50 | "\n", 51 | "The dataset used in this webinar is a set of CSV files named `netflix_titles_raw*.csv`. These contain information related to the movies and television shows available on Netflix. These are the *dirty* versions of the dataset - we will cover the individual problems as we work through the notebook.\n", 52 | "\n", 53 | "Given that this is a data cleaning webinar, let's look at our intended result. The dataset will contain the follwing information:\n", 54 | "\n", 55 | "- `show_id`: A unique integer identifier for the show\n", 56 | "- `type`: The type of content, `Movie` or `TV Show`\n", 57 | "- `title`: The title of the content\n", 58 | "- `director`: The director (or directors)\n", 59 | "- `cast`: The cast\n", 60 | "- `country`: Country (or countries) where the content is available\n", 61 | "- `date_added`: Date added to Netflix\n", 62 | "- `release_year`: Year of content release\n", 63 | "- `rating`: Content rating\n", 64 | "- `duration`: The duration\n", 65 | "- `listed_in`: The genres the content is listed in\n", 66 | "- `description`: A description of the content\n", 67 | "\n" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": { 73 | "id": "KLi6HZeKbiG4", 74 | "colab_type": "text" 75 | }, 76 | "source": [ 77 | "## **Setting up a PySpark session**\n", 78 | "\n", 79 | "Before we can start processing our data, we need to configure a Pyspark session for Google Colab. Note that this is specific for using Spark and Python in Colab and likely is not required for other environments. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "metadata": { 85 | "id": "O86xE8hXbiG5", 86 | "colab_type": "code", 87 | "colab": {}, 88 | "cellView": "both" 89 | }, 90 | "source": [ 91 | "# Run this code as is to install Spark in Colab\n", 92 | "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", 93 | "!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", 94 | "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n", 95 | "!pip install -q findspark" 96 | ], 97 | "execution_count": null, 98 | "outputs": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "metadata": { 103 | "id": "ZMaqiRkBbiG7", 104 | "colab_type": "code", 105 | "colab": {} 106 | }, 107 | "source": [ 108 | "# Run this code to setup the environment\n", 109 | "import os\n", 110 | "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", 111 | "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"" 112 | ], 113 | "execution_count": null, 114 | "outputs": [] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "metadata": { 119 | "id": "qKvFmTnQbiG9", 120 | "colab_type": "code", 121 | "colab": {} 122 | }, 123 | "source": [ 124 | "# Finally, setup our Spark session\n", 125 | "import findspark\n", 126 | "findspark.init()\n", 127 | "from pyspark.sql import SparkSession\n", 128 | "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" 129 | ], 130 | "execution_count": null, 131 | "outputs": [] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": { 136 | "colab_type": "text", 137 | "id": "BMYfcKeDY85K" 138 | }, 139 | "source": [ 140 | "## **Getting started**\n", 141 | "\n", 142 | "Before doing anything else, lets copy our data files locally. We'll be using `wget`, `ls`, `gunzip`, and `head`, which are normally shell commands. In the notebook environment, we can run any given shell command using the precursor `!`. The purpose of our shell commands are as follows:\n", 143 | "\n", 144 | "- `wget` - Is an HTTP client that will download our data files and save them locally in our notebook environment.\n", 145 | "- `ls` - Used to list the files in a directory.\n", 146 | "- `gunzip` - Our data files are compressed using the `gzip` compression format. `gunzip` allows us to decompress those files.\n", 147 | "- `head` - Much like the `.head()` command in Pandas, the shell command `head` defaults to printing out the first 10 lines of a file. You can specify more or fewer lines if desired. \n", 148 | "\n", 149 | "This is an example of some of these commands being executed in a traditional shell environment:\n", 150 | "\n", 151 | "![Shell Commands](https://github.com/datacamp/data-cleaning-with-pyspark-live-training/raw/master/assets/SparkLiveTraining-shellcommands.png)\n", 152 | "\n", 153 | "Let's run the follwing cell to pull the *dirty* files locally. We'll be writing the files to the `/tmp` directory in the notebook environment." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "metadata": { 159 | "colab_type": "code", 160 | "id": "IAfz_jiu0NjN", 161 | "colab": {} 162 | }, 163 | "source": [ 164 | "# Copy our dataset locally\n", 165 | "\n", 166 | "!wget -O /tmp/netflix_titles_dirty_01.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_01.csv.gz?raw=True'\n", 167 | "!wget -O /tmp/netflix_titles_dirty_02.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_02.csv.gz?raw=True'\n", 168 | "!wget -O /tmp/netflix_titles_dirty_03.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_03.csv.gz?raw=True'\n", 169 | "!wget -O /tmp/netflix_titles_dirty_04.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_04.csv.gz?raw=True'\n", 170 | "!wget -O /tmp/netflix_titles_dirty_05.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_05.csv.gz?raw=True'\n", 171 | "!wget -O /tmp/netflix_titles_dirty_06.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_06.csv.gz?raw=True'\n", 172 | "!wget -O /tmp/netflix_titles_dirty_07.csv.gz 'https://github.com/datacamp/data-cleaning-with-pyspark-live-training/blob/master/data/netflix_titles_dirty_07.csv.gz?raw=True'\n", 173 | "\n" 174 | ], 175 | "execution_count": null, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "id": "Wo5BjsTS3nmy", 182 | "colab_type": "text" 183 | }, 184 | "source": [ 185 | "**Now, let's verify that we have all 7 files we expect**\n", 186 | "\n", 187 | "Let's use the `ls` command to list files in the `/tmp` directory. Note that the `*` here is a wildcard, meaning match anything. Specifically, we're looking for `netflix_titles*`, which is match any file or directory that starts with `netflix_titles`." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "metadata": { 193 | "id": "8QjJ77UpeS1N", 194 | "colab_type": "code", 195 | "colab": {} 196 | }, 197 | "source": [ 198 | "!ls /tmp/netflix_titles*" 199 | ], 200 | "execution_count": null, 201 | "outputs": [] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": { 206 | "id": "5NLnNIcR3z3G", 207 | "colab_type": "text" 208 | }, 209 | "source": [ 210 | "**And then, we'll take a look at the first 20 rows of one of the files**\n", 211 | "\n", 212 | "As mentioned earlier, `gunzip` is a decompression tool. The `-c` means to print out the result rather than write it to a file. The `|`, or pipe symbol, is used to pass the output of one command as the input to another command. In this case, we're using `head -20` to print the first 20 lines of the decompressed data.\n", 213 | "\n", 214 | "Make sure to take a look at the contents of the data and notice that the separator used in this file is a tab character rather than a more traditional comma. " 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "metadata": { 220 | "id": "45574aapb4eX", 221 | "colab_type": "code", 222 | "colab": {} 223 | }, 224 | "source": [ 225 | "!gunzip -c /tmp/netflix_titles_dirty_03.csv.gz | head -20" 226 | ], 227 | "execution_count": null, 228 | "outputs": [] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": { 233 | "id": "tlIBIjOM39AS", 234 | "colab_type": "text" 235 | }, 236 | "source": [ 237 | "## **Loading our initial DataFrame**\n", 238 | "\n", 239 | "Let's take a look at what Spark does with our data and see if it can properly parse the output. To do this, we'll first load the content into a DataFrame using the `spark.read.csv()` method. \n", 240 | "\n", 241 | "We'll pass in three arguments:\n", 242 | "\n", 243 | " - The path to the file(s)\n", 244 | " - An entry for `header=False`. Our files do not have a header row, so we must specify this or risk a data row being interpreted as a header.\n", 245 | " - The last argument we add is the `sep` option, which specifies the field separator. Often in CSV files this is a comma (`,`), but in our files it's a `\\t` or tab character.\n", 246 | "\n", 247 | "In our command, we'll be using the wildcard character, `*`, again to access all the files matching `/tmp/netflix_titles_dirty*.csv.gz` (ie, all the files we've downloaded thus far). Spark will handle associating all the files to the same dataframe. \n", 248 | "\n", 249 | "Depending on how familiar you are with Spark, you should note that this line does not actually read the contents of the files yet. This command is *lazy*, which means it's not executed until we request Spark to execute some type of analysis. This is a crucial point to understand how data processing behaves within Spark. " 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "metadata": { 255 | "id": "uQ_z1ZFYb9AX", 256 | "colab_type": "code", 257 | "colab": {} 258 | }, 259 | "source": [ 260 | "# Read csv file wiith spark.read_csv()\n" 261 | ], 262 | "execution_count": null, 263 | "outputs": [] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": { 268 | "id": "lUFy-nit4fmP", 269 | "colab_type": "text" 270 | }, 271 | "source": [ 272 | "## **Initial analysis**\n", 273 | "\n", 274 | "Let's look at the first 150 rows using the `.show()` method on the DataFrame. We'll pass in:\n", 275 | "\n", 276 | "- The number of rows to display (`150`)\n", 277 | "- Set the `truncate` option to `False` so we can see all the DataFrame columns and content.\n", 278 | "\n", 279 | "*Note: `.show()` is a Spark action, meaning that any previous lazy commands will now be executed prior to the actual action. Spark can also optimize these commands for best performance as needed.*\n", 280 | "\n" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "metadata": { 286 | "id": "0LjdckdCcCQR", 287 | "colab_type": "code", 288 | "colab": {} 289 | }, 290 | "source": [ 291 | "# Show first 150 rows\n" 292 | ], 293 | "execution_count": null, 294 | "outputs": [] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": { 299 | "id": "LGzPDicW4zUo", 300 | "colab_type": "text" 301 | }, 302 | "source": [ 303 | "\n", 304 | "**Problem 1**: First column contains a mix of numeric and string data.\n", 305 | "\n", 306 | "\n", 307 | "We can also use the `.printSchema()` method to print the inferred schema associated with the data. Notice that we have 12 columns (which is expected based on our format information) but there are no column names, incorrect datatypes, and each field is nullable. *Note: `.printSchema()` is also a Spark action.*\n", 308 | "\n", 309 | "**Problem 2**: Column names are generic.\n", 310 | "\n", 311 | "**Problem 3**: All columns are typed as strings, but appear to contain various datatypes (also reference **problem 1**)" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "metadata": { 317 | "id": "THf1tctucd9h", 318 | "colab_type": "code", 319 | "colab": {} 320 | }, 321 | "source": [ 322 | "# Print schema\n" 323 | ], 324 | "execution_count": null, 325 | "outputs": [] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "id": "RXGTHdV75wUi", 331 | "colab_type": "text" 332 | }, 333 | "source": [ 334 | "## **Bypassing the CSV interpeter**\n", 335 | "\n", 336 | "Our first few data rows look ok, but we can see that we have a few random rows even in our small sample. We know the first column should be an integer value but it looks like there are some values that do not meet this requirement. \n", 337 | "\n", 338 | "Let's run a `.count()` method on our dataframe to determine how many rows are present in the dataset, regardless of whether they're correct.\n" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "metadata": { 344 | "id": "eABNrdsfOzRR", 345 | "colab_type": "code", 346 | "colab": {} 347 | }, 348 | "source": [ 349 | "# Count the number of rows\n", 350 | "\n" 351 | ], 352 | "execution_count": null, 353 | "outputs": [] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": { 358 | "id": "zYuq7DGUPtSe", 359 | "colab_type": "text" 360 | }, 361 | "source": [ 362 | " Let's run a quick select statement on the DataFrame to determine the makeup of the content. We'll use some methods from the `pyspark.sql.functions` module to help us determine this. We'll alias this module as `F` for simplicity.\n", 363 | "\n", 364 | "We're going to use the `F.col` method to find the column named `_c0` in our dataframe. We'll then chain the function `.cast(\"int\")` to attempt to change each entry from a string value to an integer one. We then use the `.isNotNull()` function to find only entries that are not null. This is passed to the the `.filter()` method on the dataframe to return only rows that meet this requirement. Finally we run the `.count()` method on the resulting dataframe to get the count of rows where the first column is an integer value.\n", 365 | "\n", 366 | "An example to filter a Spark dataframe named `df_1` with a column named `_c0` would be:\n", 367 | "\n", 368 | "```\n", 369 | "from pyspark.sql import functions as F\n", 370 | "\n", 371 | "df_1.filter(F.col(\"_c0\").cast(\"int\").isNotNull()).show()\n", 372 | "```\n" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "metadata": { 378 | "id": "uznFuRdDcwTM", 379 | "colab_type": "code", 380 | "colab": {} 381 | }, 382 | "source": [ 383 | "# Import the Pyspark SQL helper functions\n", 384 | "\n", 385 | "\n", 386 | "# Determine how many rows have a column that converts properly to an integer value\n", 387 | "\n" 388 | ], 389 | "execution_count": null, 390 | "outputs": [] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": { 395 | "id": "m7_8YXWoS59q", 396 | "colab_type": "text" 397 | }, 398 | "source": [ 399 | "### **Spark is different**\n", 400 | "\n", 401 | "Depending on your background, you may be very confused as to what's going on right now. If you're used to pandas, you should know that Spark behaves similarly, but there are some significant differences.\n", 402 | "\n", 403 | "#### **Spark dataframes are immutable**\n", 404 | "- This means that once a dataframe is created, you cannot change the contents of the dataframe, you can only create a new one.\n", 405 | "- We won't cover why during this class, beyond mentioning that it makes a distributed system (such as Spark) much more manageable.\n", 406 | "- While you cannot change a dataframe, you can create a new one.\n", 407 | "- Dataframes in Spark are defined with various types of *transformations*. These are the *lazy* commands we mentioned earlier. You can think of them as a recipe, or set of commands that will be run at a given time. \n", 408 | "- Spark *actions* (such as `.count()`, `.show()`, `.write()`, among others) instantiate the dataframe, meaning that the data is processed and available within the dataframe.\n", 409 | "\n", 410 | "The important thing to note is that these `.filter()` commands are not changing our underlying dataframe - it's creating a new dataframe with the results of the `.filter()` operation and giving us the `.count()` of that. This is all done behind the scenes for you, but it's important to understand what Spark is doing underneath.\n", 411 | "\n", 412 | "### **Back to analysis**\n", 413 | "\n", 414 | "Now that we've determined there is a difference between the number of correct entries vs all entries (ie, take the full count and subtract the filtered dataframe), let's look at the rows that aren't coverting properly.\n", 415 | "\n", 416 | "We'll do the same as before, but this time we're going to use the `.isNull()` method to obtain only the rows that can't be cast to integers. We'll use the `.show()` method to see the content.\n", 417 | "\n", 418 | "*Note, this is only possible because Spark maintains an immutable dataframe. In our previous step, we created a new dataframe from our `.filter()` command, ran an action, and then Spark threw away the dataframe as we didn't assign it to a variable. We're going to now do the same type of operation on the original `titles_df` dataframe.*" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "metadata": { 424 | "id": "FM8wY72J7qnC", 425 | "colab_type": "code", 426 | "colab": {} 427 | }, 428 | "source": [ 429 | "# Look at rows that don't convert properly\n" 430 | ], 431 | "execution_count": null, 432 | "outputs": [] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": { 437 | "id": "jw0aLgc19Qok", 438 | "colab_type": "text" 439 | }, 440 | "source": [ 441 | "*Data problems*:\n", 442 | "- **Problem 4**: Comment rows - These begin with a `#` character in the first column, and all other columns are null\n", 443 | "- **Problem 5**: Missing first column - We have few rows that reference `TV Show` or `Movie`, which should be the 2nd column.\n", 444 | "- **Problem 6**: Odd columns - There are a few rows included where the columns seem out of sync (ie, a content type in the ID field, dates in the wrong column, etc).\n", 445 | "\n", 446 | "We could fairly easily remove rows that match this pattern, but we're not entirely sure what to expect here. This is a common issue when trying to parse a large amount of data, be it in native Python, in Spark, or even with command-line tools. \n", 447 | "\n", 448 | "What we need to do is bypass most of the CSV parser's intelligence, but still load the content into a DataFrame. One way to do this is to modify an option on the CSV loader.\n", 449 | "\n", 450 | "# **CSV loading**\n", 451 | "\n", 452 | "Our initial import relies on the defaults for the CSV import mechanism. This typically assumes an actual comma-separated value file using `,` between fields and a normal row level terminator (ie, `\\r\\n`, `\\r`, `\\n`). While this often works well, it doesn't always handle ever data cleaning process you'd like, especially if you want to save the errant data for later examination.\n", 453 | "\n", 454 | "One way we can trick our CSV load is to specify a custom separator that we know does not exist within our dataset. As we used above, the option to do this is called `sep` and takes a single character to be used as the column separator. The separator cannot be an empty string so depending on your data, you may need to determine a character that is not used. For our purposes, let's use a curly brace, `{`, which is most likely not present in our data." 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "metadata": { 460 | "id": "Vme6VYgc8UP7", 461 | "colab_type": "code", 462 | "colab": {} 463 | }, 464 | "source": [ 465 | "# Load the files into a DataFrame with a single column\n" 466 | ], 467 | "execution_count": null, 468 | "outputs": [] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "metadata": { 473 | "id": "CyTxTDhXART2", 474 | "colab_type": "code", 475 | "colab": {} 476 | }, 477 | "source": [ 478 | "# Count rows\n" 479 | ], 480 | "execution_count": null, 481 | "outputs": [] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "metadata": { 486 | "id": "JcsIM-2KBMzP", 487 | "colab_type": "code", 488 | "colab": {} 489 | }, 490 | "source": [ 491 | "# Show header\n" 492 | ], 493 | "execution_count": null, 494 | "outputs": [] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "metadata": { 499 | "id": "P0HJfU--BRm1", 500 | "colab_type": "code", 501 | "colab": {} 502 | }, 503 | "source": [ 504 | "# Print schema\n", 505 | "\n" 506 | ], 507 | "execution_count": null, 508 | "outputs": [] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": { 513 | "id": "Uyelt4SeaiHe", 514 | "colab_type": "text" 515 | }, 516 | "source": [ 517 | "# **Q&A?**" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": { 523 | "id": "hSTcSJgEB-YK", 524 | "colab_type": "text" 525 | }, 526 | "source": [ 527 | "# **Cleaning up our data**\n", 528 | "\n", 529 | "We know from some earlier analysis that we have comment rows in place (ie, rows that begin with a `#`). While Spark provides a DataFrame option to handle this automatically, let's consider what it would take to remove comment rows.\n", 530 | "\n", 531 | "We need to:\n", 532 | "\n", 533 | "- Determine if the column / line starts with a `#`\n", 534 | "- If so, filter these out to a new DataFrame\n", 535 | "\n", 536 | "There are many ways to accomplish this in Spark, but let's use a conceptually straight-forward option, `.startsWith()`." 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "metadata": { 542 | "id": "Pny8lhMnBVsv", 543 | "colab_type": "code", 544 | "colab": {} 545 | }, 546 | "source": [ 547 | "# Filter DataFrame and show rows that starts with #\n" 548 | ], 549 | "execution_count": null, 550 | "outputs": [] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "metadata": { 555 | "id": "8GOHWXvmDhtu", 556 | "colab_type": "code", 557 | "colab": { 558 | "base_uri": "https://localhost:8080/", 559 | "height": 34 560 | }, 561 | "outputId": "ef392087-5e1c-452c-e6f0-53001e967392" 562 | }, 563 | "source": [ 564 | "# Count number of rows\n", 565 | "\n" 566 | ], 567 | "execution_count": null, 568 | "outputs": [ 569 | { 570 | "output_type": "execute_result", 571 | "data": { 572 | "text/plain": [ 573 | "47" 574 | ] 575 | }, 576 | "metadata": { 577 | "tags": [] 578 | }, 579 | "execution_count": 17 580 | } 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": { 586 | "id": "MurSx_92YsGv", 587 | "colab_type": "text" 588 | }, 589 | "source": [ 590 | "We've determined that we have 47 rows that begin with a comment character. We can now easily filter these from our DataFrame as we do below. This resolves **Problem 4**.\n", 591 | "\n", 592 | "*Note: We're doing things in a more difficult fashion than is absolutely necessary to illustrate options. The Spark CSV reader has an option for a `comment` property, which actually defaults to skipping all rows starting with a `#` character. That said, it only supports a single character - consider if you were looking for multi-character options (ie, a // from C-style syntax). This feature is also only available in newer versions of Spark, where our method works in any of the 2.x Spark releases.*" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "metadata": { 598 | "id": "FnKLL42OXtkx", 599 | "colab_type": "code", 600 | "colab": {} 601 | }, 602 | "source": [ 603 | "# Filter out comments\n", 604 | "\n" 605 | ], 606 | "execution_count": null, 607 | "outputs": [] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "metadata": { 612 | "id": "MHbyFQt4aUag", 613 | "colab_type": "code", 614 | "colab": {} 615 | }, 616 | "source": [ 617 | "" 618 | ], 619 | "execution_count": null, 620 | "outputs": [] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": { 625 | "id": "2NRGmdeqa2L3", 626 | "colab_type": "text" 627 | }, 628 | "source": [ 629 | "# **Checking column counts**\n", 630 | "\n", 631 | "Our next step for cleaning this dataset in Pyspark involves determining how many columns we should have in our dataset and take care of any outliers. We know from our earlier examination that the dataset should have 12 usable columns per row.\n", 632 | "\n", 633 | "First, let's determine how many columns are present in the data and add that as a column. We'll do this with a combination of:\n", 634 | "\n", 635 | "- `F.split()`: This acts similar to the Python `split()` method, splitting the contents of a dataframe column on a specified character into a Spark ArrayType column (ie, Spark's version of a list variable). \n", 636 | "- `F.size()`: Returns the size (length) of a Spark ArrayType column.\n", 637 | "- `.withColumn()`: Creates a new dataframe with a given column." 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "metadata": { 643 | "id": "0mNCitULaWL7", 644 | "colab_type": "code", 645 | "colab": {} 646 | }, 647 | "source": [ 648 | "# Add a column representing the total number of fields / columns\n", 649 | "\n" 650 | ], 651 | "execution_count": null, 652 | "outputs": [] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "metadata": { 657 | "id": "Lv5MOxL_tuyw", 658 | "colab_type": "code", 659 | "colab": {} 660 | }, 661 | "source": [ 662 | "# Show rows with a fieldcount > 12 (Note, select statement here isn't necessarily required - used to reorder the columns for easier viewing)\n" 663 | ], 664 | "execution_count": null, 665 | "outputs": [] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "metadata": { 670 | "id": "Pl3Tz5rht5OC", 671 | "colab_type": "code", 672 | "colab": {} 673 | }, 674 | "source": [ 675 | "# Check for any rows with fewer than 12 columns\n", 676 | "\n" 677 | ], 678 | "execution_count": null, 679 | "outputs": [] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": { 684 | "id": "S71Avac6k1h3", 685 | "colab_type": "text" 686 | }, 687 | "source": [ 688 | "**Problem 7**: Column counts do not match our expected schema." 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "metadata": { 694 | "id": "YmqGCyxi7peV", 695 | "colab_type": "code", 696 | "colab": {} 697 | }, 698 | "source": [ 699 | "# Save these to a separate dataframe for later analysis\n", 700 | "\n" 701 | ], 702 | "execution_count": null, 703 | "outputs": [] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "metadata": { 708 | "id": "Ag8tZcQ68gzr", 709 | "colab_type": "code", 710 | "colab": {} 711 | }, 712 | "source": [ 713 | "# Determine total number of \"bad\" rows\n" 714 | ], 715 | "execution_count": null, 716 | "outputs": [] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": { 721 | "id": "FCDT2lN4laxn", 722 | "colab_type": "text" 723 | }, 724 | "source": [ 725 | "We can now resolve **problems 5, 6, and 7** by accessing only the rows that have 12 columns present." 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "metadata": { 731 | "id": "jR_BbSWA8ila", 732 | "colab_type": "code", 733 | "colab": {} 734 | }, 735 | "source": [ 736 | "# Set the dataframe without the bad rows\n" 737 | ], 738 | "execution_count": null, 739 | "outputs": [] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "metadata": { 744 | "id": "VkyXBTyv8phE", 745 | "colab_type": "code", 746 | "colab": {} 747 | }, 748 | "source": [ 749 | "# How many current rows in dataframe?\n", 750 | "\n" 751 | ], 752 | "execution_count": null, 753 | "outputs": [] 754 | }, 755 | { 756 | "cell_type": "markdown", 757 | "metadata": { 758 | "id": "L_QN8__l81xh", 759 | "colab_type": "text" 760 | }, 761 | "source": [ 762 | "# **Q&A**" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": { 768 | "id": "eYMSHgfk9ukp", 769 | "colab_type": "text" 770 | }, 771 | "source": [ 772 | "# **More cleaning / prep**\n", 773 | "\n", 774 | "Now that we've removed rows that don't fit our basic formatting, let's continue on with making our dataframe more useful.\n", 775 | "\n", 776 | "First, let's create a new column that is a list (actually a Spark ArrayType column) containing all \"columns\" using the `pyspark.sql.functions.split` method. We'll call this `splitcolumn`." 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "metadata": { 782 | "id": "FdV-yNUpAkr3", 783 | "colab_type": "code", 784 | "colab": {} 785 | }, 786 | "source": [ 787 | "# Create a list of split strings as a new column named splitcolumn\n" 788 | ], 789 | "execution_count": null, 790 | "outputs": [] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "metadata": { 795 | "id": "5ku9EWddB8Su", 796 | "colab_type": "code", 797 | "colab": {} 798 | }, 799 | "source": [ 800 | "# View the contents\n" 801 | ], 802 | "execution_count": null, 803 | "outputs": [] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": { 808 | "id": "BwUcLor4IggD", 809 | "colab_type": "text" 810 | }, 811 | "source": [ 812 | "# **Creating typed columns**\n", 813 | "\n", 814 | "There are several ways to do this operation depending on your needs, but for this dataset we'll explicitly convert the strings in the ArrayType column (ie, a Spark list column) to actual dataframe columns. The `.getItem()` method returns the value at the specified index of the listcolumn (ie, of `splitcolumn`). \n", 815 | "\n", 816 | "Let's consider if we wanted to create a full dataframe from the following example dataframe (`df_1`):\n", 817 | "\n", 818 | "splitcolumn |\n", 819 | "---|\n", 820 | "[1,USA,North America]\n", 821 | "[2,France,Europe]\n", 822 | "[3,China,Asia]\n", 823 | "\n", 824 | "```\n", 825 | "df_1 = df_1.withColumn('country_id', df_1.splitcolumn.getItem(0).cast(IntegerType())\n", 826 | "df_1 = df_1.withColumn('country_name', df_1.splitcolumn.getItem(1))\n", 827 | "df_1 = df_1.withColumn('continent', df_1.splitcolumn.getItem(2))\n", 828 | "```\n", 829 | "\n", 830 | "This would give you a resulting dataframe of:\n", 831 | "\n", 832 | "splitcolumn | country_id | country_name | continent\n", 833 | "---|---|---|---\n", 834 | "[1,USA,North America]|1|USA|North America\n", 835 | "[2,France,Europe]|2|France|Europe\n", 836 | "[3,China,Asia]|3|China|Asia\n", 837 | "\n", 838 | "The `splitcolumn` is currently still present - we'll take care of that later on.\n", 839 | "\n", 840 | "Note that for `show_id` and `release_year`, we'll also use `.cast()` to specify them as integers rather than just strings. We also need to import the `IntegerType` from the `pyspark.sql.types` module to properly cast our data to an integer column in Spark.\n", 841 | "\n" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "metadata": { 847 | "id": "Roo0kxtlB-hz", 848 | "colab_type": "code", 849 | "colab": {} 850 | }, 851 | "source": [ 852 | "from pyspark.sql.types import IntegerType\n", 853 | "\n", 854 | "# Create columns with specific data types using .getItem()\n" 855 | ], 856 | "execution_count": null, 857 | "outputs": [] 858 | }, 859 | { 860 | "cell_type": "markdown", 861 | "metadata": { 862 | "id": "LytioBfiJLvr", 863 | "colab_type": "text" 864 | }, 865 | "source": [ 866 | "Let's now drop our columns that aren't needed anymore. These are `_c0` (the original single line string), `fieldcount`, and `splitcolumn`. You can drop these as a single column per entry, or a comma-separated set of column names." 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "metadata": { 872 | "id": "feZwxQ8rC1MJ", 873 | "colab_type": "code", 874 | "colab": {} 875 | }, 876 | "source": [ 877 | "# Drop original _c0 column\n", 878 | "\n" 879 | ], 880 | "execution_count": null, 881 | "outputs": [] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": { 886 | "id": "7G_Zdg7EJq-i", 887 | "colab_type": "text" 888 | }, 889 | "source": [ 890 | "Let's verify our content, check the row count, and then look at our schema." 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "metadata": { 896 | "id": "eL_ewudgDKKt", 897 | "colab_type": "code", 898 | "colab": {} 899 | }, 900 | "source": [ 901 | "# Showcase new DataFrame\n", 902 | "\n" 903 | ], 904 | "execution_count": null, 905 | "outputs": [] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "metadata": { 910 | "id": "l-C2jZOZEnnr", 911 | "colab_type": "code", 912 | "colab": {} 913 | }, 914 | "source": [ 915 | "# Count rows\n", 916 | "\n" 917 | ], 918 | "execution_count": null, 919 | "outputs": [] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "metadata": { 924 | "id": "fti0um-RDRVp", 925 | "colab_type": "code", 926 | "colab": {} 927 | }, 928 | "source": [ 929 | "# Print the schema\n", 930 | "\n" 931 | ], 932 | "execution_count": null, 933 | "outputs": [] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": { 938 | "id": "d2W5ImNkJ3f8", 939 | "colab_type": "text" 940 | }, 941 | "source": [ 942 | "# **Even more cleanup**\n", 943 | "\n", 944 | "With our last set of steps, we've successfully fixed our remaining **problems (1, 2, and 3)**. Now that we have a generally clean dataset, let's look for further issues.\n", 945 | "\n", 946 | "If we look at the distinct values available in the show `type` column, we see an issue. This is a categorical data column - collapsing the values in this column is a typical data cleaning issue." 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "metadata": { 952 | "id": "YP9IzxLPFwBS", 953 | "colab_type": "code", 954 | "colab": {} 955 | }, 956 | "source": [ 957 | "# Check out unique items in type column\n", 958 | "\n" 959 | ], 960 | "execution_count": null, 961 | "outputs": [] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "metadata": { 966 | "id": "Nx7weVd1G33o", 967 | "colab_type": "code", 968 | "colab": {} 969 | }, 970 | "source": [ 971 | "# Isolate rows where type is \"\"\n", 972 | "\n" 973 | ], 974 | "execution_count": null, 975 | "outputs": [] 976 | }, 977 | { 978 | "cell_type": "markdown", 979 | "metadata": { 980 | "id": "TW4sUhBMKt9v", 981 | "colab_type": "text" 982 | }, 983 | "source": [ 984 | "You'll notice that we have 5 rows where the type is not specified when one should be. \n", 985 | "\n", 986 | "**Problem 8**: Invalid entry in a column - The type column should contain only `TV Show` or `Movie`, but it also contains an empty string value.\n", 987 | "\n", 988 | "\n", 989 | " We have a couple of options:\n", 990 | "\n", 991 | "- Drop the rows\n", 992 | "- Infer what the `type` is based on other content available in the dataset\n", 993 | "\n", 994 | "You could remove the problem rows using something like:\n", 995 | "\n", 996 | "```\n", 997 | "titles_cleaned_df = titles_cleaned_df.where('type == \"TV Show\" or type == \"Movie\"')\n", 998 | "```\n", 999 | "\n", 1000 | "That feels a bit like cheating though - let's consider how else we could determine this largely automatically.\n", 1001 | "\n", 1002 | "If you look at the `duration` column, you'll notice that there are different meanings behind the entries. We have durations that contain the word *min* (minutes) or the word `Season` for seasons of the show. We can try to use these to properly set the `type` value for these rows.\n", 1003 | "\n", 1004 | "This problem is a bit tricky though as Spark does not have the concept of updating data within a column without jumping through several hoops. We can however work through this issue using a User Defined Function, or UDF.\n", 1005 | "\n", 1006 | "## **UDF**\n", 1007 | "\n", 1008 | "If you haven't worked with them before, a UDF is a Python function that gets applied to every row of a DataFrame. They are extremely flexible and can help us work through issues such as our current one.\n", 1009 | "\n", 1010 | "A UDF in Pyspark requires three things:\n", 1011 | "\n", 1012 | "1. **A Python function or callable**: This is the function that you want called when the UDF is run by Spark.\n", 1013 | "2. **A UDF variable**: Defined by the `udf()` function, with the Python callable defined in #1, and the Spark return type.\n", 1014 | "3. **A Spark transformation**: The UDF must be defined via a transformation (ie `.withColumn()`) to be applied to the dataframe.\n", 1015 | "\n", 1016 | "Consider the following dataframe `df_1`, containing a two fields, `a` and `b`\n", 1017 | "\n", 1018 | "a|b\n", 1019 | "---|---\n", 1020 | "1|2\n", 1021 | "2|3\n", 1022 | "3|4\n", 1023 | "\n", 1024 | "For illustration, let's say we want to use a UDF to define a new column, which is simply the value `a*b`, unless the value of `a` is `3`. If it is, then we want the value to be `0`.\n", 1025 | "\n", 1026 | "Let's define our function - taking two arguments, `a`, and `b`. \n", 1027 | "\n", 1028 | "```\n", 1029 | "def multiply(a, b):\n", 1030 | " if a == 3:\n", 1031 | " return 0\n", 1032 | " else:\n", 1033 | " return a*b\n", 1034 | "```\n", 1035 | "\n", 1036 | "Now, we need to define our UDF variable. We need to import `udf` from `pyspark.sql.functions`, and as we're returning an integer value, we need to import `IntegerType` from `pyspark.sql.types`. \n", 1037 | "\n", 1038 | "```\n", 1039 | "from pyspark.sql.functions import udf\n", 1040 | "from pyspark.sql.types import IntegerType\n", 1041 | "\n", 1042 | "udfMultiply = udf(multiply, IntegerType())\n", 1043 | "```\n", 1044 | "\n", 1045 | "Note that the `udf` function only takes the name of the callable, not any arguments. Those are defined with our last step - defining a new column. \n", 1046 | "\n", 1047 | "```\n", 1048 | "df_1 = df_1.withColumn('output', udfMultiply(F.col('a'), F.col('b')))\n", 1049 | "```\n", 1050 | "\n", 1051 | "Note that in this instance we're using the `F.col` method to refer to the column. You could use any of the other valid Spark methods to specify a dataframe column (ie, `df_1.a`, `df_1['b']`, etc).\n", 1052 | "\n", 1053 | "Assuming we run a Spark action, such as `.show()`, we would get a dataframe with the following contents:\n", 1054 | "\n", 1055 | "a|b|output\n", 1056 | "---|---|---\n", 1057 | "1|2|2\n", 1058 | "2|3|6\n", 1059 | "3|4|0\n", 1060 | "\n", 1061 | "*Note this is a trivial example for illustration purposes - the same behavior would be much better implemented using assorted functions in the Spark libraries.*\n", 1062 | "\n", 1063 | "\n", 1064 | "First, let's define our Python function that takes two arguments, a showtype (ie, *Movie*, *TV Show*, or other) and the showduration. Note that these are strings in this case. We'll check if the showtype is already a Movie or TV Show - if so, just return that value. Otherwise, we'll check if the showduration ends with *min*, indicating a Movie. If not, we'll specify it as a TV Show.\n" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "metadata": { 1070 | "id": "m97hs-7qPCus", 1071 | "colab_type": "code", 1072 | "colab": {} 1073 | }, 1074 | "source": [ 1075 | "# Define the UDF callable\n" 1076 | ], 1077 | "execution_count": null, 1078 | "outputs": [] 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "metadata": { 1083 | "id": "PW5L773xQCfy", 1084 | "colab_type": "text" 1085 | }, 1086 | "source": [ 1087 | "# **Define the UDF for Spark**\n", 1088 | "\n", 1089 | "Now we need to configure the UDF for Spark to access it accordingly." 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "metadata": { 1095 | "id": "3y7TpzkAQb8x", 1096 | "colab_type": "code", 1097 | "colab": {} 1098 | }, 1099 | "source": [ 1100 | "from pyspark.sql.functions import udf\n", 1101 | "from pyspark.sql.types import StringType\n" 1102 | ], 1103 | "execution_count": null, 1104 | "outputs": [] 1105 | }, 1106 | { 1107 | "cell_type": "code", 1108 | "metadata": { 1109 | "id": "0sE7GaRVR7Tv", 1110 | "colab_type": "code", 1111 | "colab": {} 1112 | }, 1113 | "source": [ 1114 | "# Create a new derived column, passing in the appropriate values\n", 1115 | "\n" 1116 | ], 1117 | "execution_count": null, 1118 | "outputs": [] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "metadata": { 1123 | "id": "-eGdUm5ZSojy", 1124 | "colab_type": "code", 1125 | "colab": {} 1126 | }, 1127 | "source": [ 1128 | "# Show the rows where type is an empty string again, examining the derivedType\n" 1129 | ], 1130 | "execution_count": null, 1131 | "outputs": [] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "metadata": { 1136 | "id": "w1Z923yGSt2q", 1137 | "colab_type": "code", 1138 | "colab": {} 1139 | }, 1140 | "source": [ 1141 | "# Drop the original type column and rename derviedType to type\n" 1142 | ], 1143 | "execution_count": null, 1144 | "outputs": [] 1145 | }, 1146 | { 1147 | "cell_type": "code", 1148 | "metadata": { 1149 | "id": "JW7Dtb-wHfs5", 1150 | "colab_type": "code", 1151 | "colab": {} 1152 | }, 1153 | "source": [ 1154 | "# Verify we only have two types available\n" 1155 | ], 1156 | "execution_count": null, 1157 | "outputs": [] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "metadata": { 1162 | "id": "rXr5D8-4HkSR", 1163 | "colab_type": "code", 1164 | "colab": {} 1165 | }, 1166 | "source": [ 1167 | "# Verify our row count is the same\n" 1168 | ], 1169 | "execution_count": null, 1170 | "outputs": [] 1171 | }, 1172 | { 1173 | "cell_type": "markdown", 1174 | "metadata": { 1175 | "id": "Ks9eP6uMnUPe", 1176 | "colab_type": "text" 1177 | }, 1178 | "source": [ 1179 | "We've now successfully resolved **problem 8**." 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "markdown", 1184 | "metadata": { 1185 | "id": "0KUVyBzLTZgN", 1186 | "colab_type": "text" 1187 | }, 1188 | "source": [ 1189 | "# **Saving data for analysis / further processing**\n", 1190 | "\n", 1191 | "The last step of our data cleaning is to save the cleaned dataframe out to a file type. If you plan to do any further analysis or processing using Spark, it's highly recommended you use Parquet. Other options are available per your needs, but Spark is optimized to take advantage of Parquet.\n", 1192 | "\n", 1193 | "There are two options we use for the `.write.parquet()` method:\n", 1194 | "\n", 1195 | "- The path of where to write the file\n", 1196 | "- An optional `mode` parameter, which we've set to `overwrite`. This allows Spark to write data to an existing location, solving some potential issues in a notebook environment." 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "code", 1201 | "metadata": { 1202 | "id": "inKlmeIfHnw9", 1203 | "colab_type": "code", 1204 | "colab": {} 1205 | }, 1206 | "source": [ 1207 | "# Save the data\n" 1208 | ], 1209 | "execution_count": null, 1210 | "outputs": [] 1211 | }, 1212 | { 1213 | "cell_type": "markdown", 1214 | "metadata": { 1215 | "id": "ay7w3C4iqRXp", 1216 | "colab_type": "text" 1217 | }, 1218 | "source": [ 1219 | "Let's now take a look at the contents using the `ls` shell command. You'll notice that the `/tmp/netflix_titles_cleaned.parquet` location is actually a directory, not just a file. This is due to the way Spark handles its data allocation and formatting. More on this in a minute." 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "metadata": { 1225 | "id": "v5JnlKUzssau", 1226 | "colab_type": "code", 1227 | "colab": {} 1228 | }, 1229 | "source": [ 1230 | "# Is file in directory?\n" 1231 | ], 1232 | "execution_count": null, 1233 | "outputs": [] 1234 | }, 1235 | { 1236 | "cell_type": "markdown", 1237 | "metadata": { 1238 | "id": "KydUNbnYn-BQ", 1239 | "colab_type": "text" 1240 | }, 1241 | "source": [ 1242 | "Note that typically when processing data in Spark, you'll want to use the Parquet format as described above. This is great for any further processing or analysis you plan to do in Spark. However, it can be difficult to read Parquet files outside of Spark without extra work. As such, let's create a version in CSV format that you can download if you desire.\n", 1243 | "\n", 1244 | "We'll need to do four steps for this operation:\n", 1245 | "\n", 1246 | "- Combine the data into a single file using the `.coalesce(1)` transformation. Spark normally keeps data in separate files to improve performance and bypass RAM issues. Our dataset is small and we can bypass those concerns.\n", 1247 | "- Use the `.write.csv()` method (instead of `.write.parquet()`). We'll also add an extra option of `sep='\\t'` to bypass the issue of commas being present in our data. We also have to define a `header=True` component so our columns are named correctly.\n", 1248 | "- Rename the file to something usable with a shell command `mv`. Spark stores files named via their partition id. We need to rename that to something more recognizable.\n", 1249 | "- Finally, we'll use a special command specific to the notebook environment to download the file.\n", 1250 | "\n", 1251 | "As you've seen within Spark, you can chain commands together. As such, we'll combine the first two components." 1252 | ] 1253 | }, 1254 | { 1255 | "cell_type": "code", 1256 | "metadata": { 1257 | "id": "Km2dxHXGp85h", 1258 | "colab_type": "code", 1259 | "colab": {} 1260 | }, 1261 | "source": [ 1262 | "# Coalesce and save the data in CSV format\n", 1263 | "\n", 1264 | "titles_cleaned_df.coalesce(1).write.csv('/tmp/netflix_titles_cleaned.csv', mode='overwrite', sep='\\t', header=True)" 1265 | ], 1266 | "execution_count": null, 1267 | "outputs": [] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "metadata": { 1272 | "id": "AvadihO-qnNF", 1273 | "colab_type": "code", 1274 | "colab": { 1275 | "base_uri": "https://localhost:8080/", 1276 | "height": 34 1277 | }, 1278 | "outputId": "9f48b025-f373-449f-9558-617b87a0d6cf" 1279 | }, 1280 | "source": [ 1281 | "# Look at the output of the command using the shell command `ls`\n", 1282 | "\n", 1283 | "!ls /tmp/netflix_titles_cleaned.csv" 1284 | ], 1285 | "execution_count": null, 1286 | "outputs": [ 1287 | { 1288 | "output_type": "stream", 1289 | "text": [ 1290 | "part-00000-5f404d8b-c020-478e-856d-76999184e063-c000.csv _SUCCESS\n" 1291 | ], 1292 | "name": "stdout" 1293 | } 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "code", 1298 | "metadata": { 1299 | "id": "TCZMYcOKq2VX", 1300 | "colab_type": "code", 1301 | "colab": {} 1302 | }, 1303 | "source": [ 1304 | "# Rename the data file\n", 1305 | "\n", 1306 | "!mv /tmp/netflix_titles_cleaned.csv/part-00000*.csv /tmp/netflix_titles_cleaned_final.csv" 1307 | ], 1308 | "execution_count": null, 1309 | "outputs": [] 1310 | }, 1311 | { 1312 | "cell_type": "code", 1313 | "metadata": { 1314 | "id": "qMaWUVGIrIWT", 1315 | "colab_type": "code", 1316 | "colab": {} 1317 | }, 1318 | "source": [ 1319 | "# Download the file via notebook tools\n", 1320 | "\n", 1321 | "from google.colab import files\n", 1322 | "files.download('/tmp/netflix_titles_cleaned_final.csv')" 1323 | ], 1324 | "execution_count": null, 1325 | "outputs": [] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": { 1330 | "id": "LMJO9w_xUDAh", 1331 | "colab_type": "text" 1332 | }, 1333 | "source": [ 1334 | "# **Challenges**\n", 1335 | "\n", 1336 | "We've looked at several data cleaning operations using Spark. Here are some other challenges to consider within the dataset:\n", 1337 | "\n", 1338 | "1) *Splitting names* - \n", 1339 | " You may have noticed that the names are combined for the cast and directors into a list. Consider how you would turn that data into a list / array column to easily access more detailed information (which shows have the largest cast, etc?)\n", 1340 | "\n", 1341 | "2) *Splitting names further* - Consider taking any of the name fields and splitting it into first name, last name, etc. Take special consideration about how you would handle initials, names with more than 3 components, etc.\n", 1342 | "\n", 1343 | "3) *Parsing dates* - Look at the `date_added` field and determine if and how you could reliably convert this to an actual datetime field.\n", 1344 | "\n", 1345 | "# **Last Q&A**" 1346 | ] 1347 | } 1348 | ] 1349 | } 1350 | --------------------------------------------------------------------------------