├── .gitignore ├── subscriber-pipeline ├── dev │ ├── changelog.md │ ├── cademycode.db │ ├── cademycode_updated.db │ └── cleanse_data.py ├── prod │ └── changelog.md ├── script.sh ├── readme.md └── writeup │ ├── article.md │ └── data_eng_cp_writeup.ipynb ├── bike-rental-data-management ├── ERD.pdf ├── data-dictionaries │ ├── citibike.pdf │ └── weather.pdf ├── readme.md ├── tables.sql ├── views.sql ├── writeup.md └── data │ └── newark_airport_2016.csv ├── README.md └── nosql-cloud-deployment └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | **/.DS_Store 2 | **/.ipynb_checkpoints 3 | -------------------------------------------------------------------------------- /subscriber-pipeline/dev/changelog.md: -------------------------------------------------------------------------------- 1 | ## 0.0.0 2 | ### Added 3 | - data cleansing script 4 | -------------------------------------------------------------------------------- /subscriber-pipeline/prod/changelog.md: -------------------------------------------------------------------------------- 1 | ## 0.0.0 2 | ### Added 3 | - data cleansing script 4 | -------------------------------------------------------------------------------- /bike-rental-data-management/ERD.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Codecademy-Curriculum/Data-Engineering-Career-Path-Portfolio-Projects/HEAD/bike-rental-data-management/ERD.pdf -------------------------------------------------------------------------------- /subscriber-pipeline/dev/cademycode.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Codecademy-Curriculum/Data-Engineering-Career-Path-Portfolio-Projects/HEAD/subscriber-pipeline/dev/cademycode.db -------------------------------------------------------------------------------- /subscriber-pipeline/dev/cademycode_updated.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Codecademy-Curriculum/Data-Engineering-Career-Path-Portfolio-Projects/HEAD/subscriber-pipeline/dev/cademycode_updated.db -------------------------------------------------------------------------------- /bike-rental-data-management/data-dictionaries/citibike.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Codecademy-Curriculum/Data-Engineering-Career-Path-Portfolio-Projects/HEAD/bike-rental-data-management/data-dictionaries/citibike.pdf -------------------------------------------------------------------------------- /bike-rental-data-management/data-dictionaries/weather.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Codecademy-Curriculum/Data-Engineering-Career-Path-Portfolio-Projects/HEAD/bike-rental-data-management/data-dictionaries/weather.pdf -------------------------------------------------------------------------------- /bike-rental-data-management/readme.md: -------------------------------------------------------------------------------- 1 | # bike-rental-data-management 2 | 3 | This repository contains code to complete Codecademy's `Bike Rental Data Management` project. 4 | 5 | ## Project Overview 6 | 7 | Project to create a relational database with analytics-ready views connecting Citi Bike and weather datasets. The project involved: 8 | - inspecting and cleaning both datasets 9 | - developing a relational database structure 10 | - implementing the database in PostgreSQL and inserting the dataset 11 | - developing flexible analytics-ready views on top of the relational database 12 | 13 | # Folder Structure 14 | 15 | - `bike-rental.ipynb`: prepare the data for inserting into the database 16 | - `tables.sql`: SQL queries to create the database tables 17 | - `views.sql`: SQL queries to create the database views 18 | - `ERD.pdf`: entity relationship diagram for the database 19 | 20 | ### `data` 21 | - `newark_airport_2016.csv`: weather data from Newark airport 22 | - `JC-2016xx-citibike-tripdata.csv`: twelve files each containing one month of Citi Bike data from Jersey City 23 | - replace `xx` with the number of each month 24 | 25 | ### `data-dictionaries` 26 | - `citibike.pdf`: details on the Citi Bike data files from Citi Bike's website 27 | - `weather.pdf`: details on the weather data from NOAA's website 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Engineering Portfolio 2 | 3 | Welcome to our sample Data Engineering portfolio! 4 | 5 | # Projects 6 | 7 | ### bike-rental-data-management ([writeup](./bike-rental-data-management/writeup.md)) 8 | 9 | A relational database with analytics-ready views connecting Citi Bike and weather datasets. The project involved: 10 | - inspecting and cleaning both datasets 11 | - developing a relational database structure 12 | - implementing the database in PostgreSQL and inserting the dataset 13 | - developing flexible analytics-ready views on top of the relational database 14 | 15 | ### subscriber-pipeline ([writeup](./subscriber-pipeline/writeup/article.md)) 16 | 17 | A semi-automated bash+python pipeline to regularly transform a messy SQLite database into a clean source of truth for an analytics team. The pipeline 18 | - performs unit tests to confirm data validity 19 | - writes human-readable errors to an error log 20 | - automatically checks and updates changelogs 21 | - updates a production database with new clean data 22 | 23 | ### nosql-cloud-deployment ([writeup](./nosql-cloud-deployment)) 24 | 25 | A project to deploy the output of `subscriber-pipeline` to a MongoDB database on DigitalOcean Cloud. The project included 26 | - creating a new MongoDB instance on DigitalOcean 27 | - connecting to the cloud MongoDB using MongoDB Compass 28 | - uploading the clean dataset as a NoSQL Collection 29 | - validating the final dataset 30 | -------------------------------------------------------------------------------- /bike-rental-data-management/tables.sql: -------------------------------------------------------------------------------- 1 | 2 | create table date_dim ( 3 | date_key integer PRIMARY KEY, 4 | full_date date, 5 | month integer, 6 | day integer, 7 | month_name varchar(20), 8 | day_name varchar(20), 9 | financial_qtr integer, 10 | weekend boolean 11 | ); 12 | 13 | 14 | create table stations ( 15 | id integer PRIMARY KEY, 16 | station_name varchar(50), 17 | latitude real, 18 | longitude real 19 | ); 20 | 21 | create table trip_demo ( 22 | id integer PRIMARY KEY, 23 | user_type varchar(50), 24 | gender integer, 25 | birth_year integer, 26 | age integer 27 | ); 28 | 29 | 30 | create table weather ( 31 | id serial PRIMARY KEY, 32 | rec_date date, 33 | avg_wind real, 34 | prcp real, 35 | snow_amt real, 36 | snow_depth real, 37 | tavg integer, 38 | tmax integer, 39 | tmin integer, 40 | date_key integer, 41 | rain boolean, 42 | snow boolean, 43 | FOREIGN KEY(date_key) REFERENCES date_dim(date_key) 44 | ); 45 | 46 | 47 | create table rides ( 48 | id integer PRIMARY KEY, 49 | date_key integer, 50 | trip_duration integer, 51 | trip_minutes real, 52 | trip_hours real, 53 | start_time timestamp, 54 | stop_time timestamp, 55 | start_station_id integer, 56 | end_station_id integer, 57 | bike_id integer, 58 | valid_duration boolean, 59 | trip_demo integer, 60 | FOREIGN KEY(date_key) REFERENCES date_dim(date_key), 61 | FOREIGN KEY(start_station_id) REFERENCES stations(id), 62 | FOREIGN KEY(end_station_id) REFERENCES stations(id), 63 | FOREIGN KEY(trip_demo) REFERENCES trip_demo(id) 64 | ); 65 | -------------------------------------------------------------------------------- /nosql-cloud-deployment/readme.md: -------------------------------------------------------------------------------- 1 | # NoSQL Cloud Deployment 2 | 3 | ## Project writeup 4 | 5 | In this project, I deployed the output from my `subscriber-pipeline` project to a MongoDB database on DigitalOcean. The project included 6 | 7 | - creating a new MongoDB instance on DigitalOcean 8 | - connecting to the cloud MongoDB using MongoDB Compass 9 | - uploading the clean dataset as a NoSQL Collection 10 | - validating the final dataset 11 | 12 | ### Database Setup 13 | 14 | 1. Create a MongoDB instance on DigitalOcean 15 | 2. Connect to the MongoDB instance from MongoDB Compass 16 | 3. Create a new database on the server 17 | 4. Collect the output CSV from `subscriber-pipeline` 18 | 5. Import the CSV as a NoSQL collection with the correct datatypes 19 | 20 | ### Database Validation 21 | 22 | After importing the data, I performed the following validation checks 23 | 24 | 1. To confirm no data was lost, I compared row counts between the CSV and the MongoDB Database 25 | 2. I inspected a few records visually, and noticed some whitespace had been introduced. I can use a `$trim` aggregation after import to remove this whitespace in the MongoDB, though it would be better to investigate exactly where these artifacts were introduced in the full pipeline. 26 | 3. I created the following analytics-oriented filters to inspect the results: 27 | 1. `{state: {$eq: " Colorado"}}` to see a specific state (notice the whitespace issue appearing here) 28 | 2. `{avg_salary: {$gt: 100000}}` to see customers from high-earning industries 29 | 3. `{time_spent_hrs: {$eq: 0}}` to see customers who cancelled before spending any time on the platform 30 | -------------------------------------------------------------------------------- /subscriber-pipeline/script.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | # prompts the user to confirm before cleansing the data 5 | echo "Ready to clean the data? [1/0]" 6 | read cleancontinue 7 | 8 | # if the user selects 1 then run the cleanse_data script 9 | if [ $cleancontinue -eq 1 ] 10 | then 11 | echo "Cleaning Data" 12 | python dev/cleanse_data.py 13 | echo "Done Cleaning Data" 14 | 15 | # grab the first line of dev and prod changelogs 16 | dev_version=$(head -n 1 dev/changelog.md) 17 | prod_version=$(head -n 1 prod/changelog.md) 18 | 19 | # delimit the 1st line of the changelog by space 20 | read -a splitversion_dev <<< $dev_version 21 | read -a splitversion_prod <<< $prod_version 22 | 23 | # delimiting will result in a list of two elements 24 | # the second (index 1) will contain the version numbers 25 | dev_version=${splitversion_dev[1]} 26 | prod_version=${splitversion_prod[1]} 27 | 28 | if [ $prod_version != $dev_version ] 29 | then 30 | # Tells the user that the dev and prod changes are different 31 | # ask before proceeding to move dev files to prod 32 | echo "New changes detected. Move dev files to prod? [1/0]" 33 | read scriptcontinue 34 | else 35 | scriptcontinue=0 36 | fi 37 | # otherwise, don't run anything 38 | else 39 | echo "Please come back when you are ready" 40 | fi 41 | 42 | # if user selects 1 then copy or move certain files from dev to prod 43 | if [ $scriptcontinue -eq 1 ] 44 | then 45 | for filename in dev/* 46 | do 47 | if [ $filename == "dev/cademycode_cleansed.db" ] || [ $filename == "dev/cademycode_cleansed.csv" ] || [ $filename == "dev/changelog.md" ] 48 | then 49 | cp $filename prod 50 | echo "Copying " $filename 51 | else 52 | echo "Not Copying " $filename 53 | fi 54 | done 55 | # otherwise, don't run anything 56 | else 57 | echo "Please come back when you are ready" 58 | fi 59 | -------------------------------------------------------------------------------- /subscriber-pipeline/readme.md: -------------------------------------------------------------------------------- 1 | # Subscriber Cancellations Data Pipeline project 2 | 3 | This repository contains example code to complete the Codecademy `Subscriber Cancellations Data Pipeline` Project. 4 | 5 | ## Project Description 6 | 7 | A semi-automated bash+python pipeline to regularly transform a messy SQLite database into a clean source of truth for an analytics team. 8 | 9 | The pipeline 10 | 11 | - performs unit tests to confirm data validity 12 | - writes human-readable errors to an error log 13 | - automatically checks and updates changelogs 14 | - updates a production database with new clean data 15 | 16 | Please see [writeup/article.md](./writeup/article.md) for an overview of the development process and [writeup/data_eng_cp_writeup.ipynb](./writeup/data_eng_cp_writeup.ipynb) for an exploratory Jupyter notebook. 17 | 18 | ## Instructions 19 | 20 | This repository is set up as if the scripts have never been run. To run, 21 | 22 | 1. Run `script.sh` and follow the prompts 23 | 2. If prompted, `script.sh` will run `dev/cleanse_data.py`, which runs unit tests and data cleaning functions on `dev/cademycode.db` 24 | 3. If `cleanse_data.py` runs into any errors during unit testing, it will raise an exception, log the issue, and terminate 25 | 4. Otherwise, `cleanse_data.py` will update the clean database and CSV with any new records 26 | 5. After a successful update, the number of new records and other update data will be written to `dev/changelog.md` 27 | 6. `script.sh` will check the changelog to see if there are updates 28 | 7. If so, `script.sh` will request permission to overwrite the production database 29 | 7. If the user grants permission, `script.sh` will copy the updated database to `prod` 30 | 31 | If you follow these instructions, the script will run on the initial dataset in `dev/cademycode.db`. To test running the script on the updated database, change the name of `dev/cademycode_updated.db` to `dev/cademycode.db`. 32 | 33 | ## Folder Structure 34 | - **script.sh**: bash script to handle running the data cleanser and moving files to `/prod` 35 | ### dev 36 | - **changelog.md**: automatically updated with each run, logs new records and tracks missing data 37 | - **cleanse_data.py**: runs unit tests and data cleansing on `cademycode.db`. 38 | - **cleanse_db.log**: log of errors encountered during the python script execution. 39 | - **cademycode_cleansed.db**: output from `cleanse_data.py` that contains 2 tables 40 | - `cademycode_aggregated`: table containing the joined data from cleaned version of the 3 `cademycode.db` tables. 41 | - `missing_data`: table containing incomplete data that could not be imputed or assumed. 42 | - **cademycode.db**: database containing the raw data from 3 tables: 43 | - `cademycode_students`: demographic and course data of mock cademycode students. 44 | - `cademycode_student_jobs`: a lookup table of student job industries. 45 | - `cademycode_courses`: a lookup table of career paths and the estimated hours of completion. 46 | - **cademycode_updated.db**: updated version of `cademycode.db` for testing the update process 47 | ### prod 48 | - **changelog.md**: `script.sh` will copy from `/dev` when updates are approved 49 | - **cademycode_cleansed.db**: `script.sh` will copy from `/dev` when updates are approved 50 | - **cademycode_cleansed.csv**: CSV version of the clean table, overwritten upon update 51 | ### writeup 52 | - **article.md**: high-level overview of the process of developing these scripts 53 | - **data_eng_cp_writeup.ipynb**: Jupyter Notebook containing the discovery phase of this project: loading, inspecting, transforming. 54 | -------------------------------------------------------------------------------- /subscriber-pipeline/writeup/article.md: -------------------------------------------------------------------------------- 1 | # Project Writeup 2 | 3 | A semi-automated bash+python pipeline to regularly transform a messy SQLite database into a clean source of truth for an analytics team. The pipeline 4 | - performs unit tests to confirm data validity 5 | - writes human-readable errors to an error log 6 | - automatically checks and updates changelogs 7 | - updates a production database with new clean data 8 | 9 | ## Scenario 10 | 11 | A fictional online education company `Cademycode` has a database of cancelled subscriber information that gets periodically updated from a variety of sources (to be clear: this data is entirely fictional). They want to automate the process of transforming this messy database into a clean table for their analytics team. My goal was to create a semi-automated data ingestion pipeline that checks for new data, automatically performs cleaning and transformation operations, runs unit tests to check data validity, and logs errors for human review. 12 | 13 | ## The Process 14 | 15 | ### Inspecting and Transforming the Dataset ([exploration notebook](./data_eng_cp_writeup.ipynb)) 16 | 17 | Several of the raw data tables had records with some missing data. Visualization indicated that most of the missing data was missing at random (MAR). Because this data was MAR and formed only a small percentage of the overall data, deletion seemed appropriate. Instead of fully deleting the missing rows, I separated the rows with missing data into their own table. This way, an analytics team can inspect the missing data themselves if they want, and we can keep track of how much data is missing as the cleaned database was updated. 18 | 19 | Some of the missing data was structurally missing, corresponding to students who had never started any courses before cancelling their subscription. I handled these rows by introducing a new category for those students, so an analytics team can easily study subscribers who cancel before making any use of the product. 20 | 21 | In addition to missing data, the student's contact information was stored in a dictionary, which needed to be converted to flat columns with the correct data types. 22 | 23 | Lastly, I added new columns to assist an analytics team in filtering the subscriber population. For example, I used the subscriber birth date to create age and decade categories, so an analytics team can more easily see if those demographic attributes impact subscribers. 24 | 25 | ### Automation 26 | 27 | The raw database is updated regularly with new long-term cancellations, so the transformations above need to automated as much as possible. However, an automated data process also needs automated checks to prevent unwanted data updates. Therefore, the cleaning and wrangling code needed to be wrapped in a Python script to perform unit tests, log errors, and track any features of potential concern (such as a sudden increase in records with missing entries.) 28 | 29 | My unit tests check 30 | 31 | - If there are any updates to the database 32 | - If there are any new rows with missing values in the updates 33 | - If there are any fully null rows in the updates 34 | - If the table schemas are the same as expected 35 | - If all join keys are present between the tables before joining 36 | 37 | If the process passes all the tests, I can be reasonably confident that the automated cleaning script will operate as I expect from my Jupyter notebook, and the update can proceed. If not, then logged errors help me identify any unexpected changes to the database (for example, a new column) that my script wasn't designed to handle. Even if all unit tests are passed, the script still records data regarding missing records to the changelog for human review. 38 | 39 | Lastly, I created a bash script to automate the process of 40 | 41 | 1. running the unit test and update script 42 | 2. checking if the script found and made any updates 43 | 3. if so, moving the newly updated clean database to a production folder 44 | -------------------------------------------------------------------------------- /bike-rental-data-management/views.sql: -------------------------------------------------------------------------------- 1 | --- create a table with various daily ride counts 2 | CREATE VIEW daily_counts AS 3 | SELECT date_dim.date_key, 4 | date_dim.full_date, 5 | date_dim.month_name, 6 | date_dim.day, 7 | date_dim.day_name, 8 | date_dim.weekend, 9 | count(rides.id) AS ride_totals, 10 | count(trip_demo.user_type) filter (where trip_demo.user_type = 'Subscriber') as subscriber_rides, 11 | count(trip_demo.user_type) filter (where trip_demo.user_type = 'Customer') AS customer_rides, 12 | count(trip_demo.user_type) filter (where trip_demo.user_type = 'Unknown') unknown_rides, 13 | count(rides.valid_duration) filter (where not rides.valid_duration) AS late_return 14 | FROM rides 15 | RIGHT JOIN date_dim ON rides.date_key = date_dim.date_key 16 | LEFT JOIN trip_demo ON rides.trip_demo = trip_demo.id 17 | GROUP BY date_dim.date_key 18 | ORDER BY date_dim.date_key; 19 | 20 | --- create a table with daily daily_data 21 | CREATE VIEW daily_data AS 22 | SELECT daily_counts.date_key, 23 | daily_counts.full_date, 24 | daily_counts.month_name, 25 | daily_counts.day, 26 | daily_counts.day_name, 27 | daily_counts.ride_totals, 28 | sum(daily_counts.ride_totals) OVER (PARTITION BY daily_counts.month_name ORDER BY daily_counts.date_key) AS month_running_total, 29 | daily_counts.subscriber_rides, 30 | daily_counts.customer_rides, 31 | daily_counts.unknown_rides, 32 | daily_counts.late_return, 33 | daily_counts.weekend, 34 | weather.tmin, 35 | weather.tavg, 36 | weather.tmax, 37 | weather.avg_wind, 38 | weather.prcp, 39 | weather.snow_amt, 40 | weather.rain, 41 | weather.snow 42 | FROM daily_counts 43 | JOIN weather ON daily_counts.date_key = weather.date_key 44 | ORDER BY daily_counts.date_key; 45 | 46 | --- create a monthly data view 47 | CREATE VIEW monthly_data AS 48 | SELECT date_dim.month, 49 | date_dim.month_name, 50 | round(avg(daily.ride_totals)) AS avg_daily_rides, 51 | sum(daily.ride_totals) AS total_rides, 52 | round(avg(daily.customer_rides)) AS avg_customer_rides, 53 | sum(daily.customer_rides) AS total_customer_rides, 54 | round(avg(daily.subscriber_rides)) AS avg_subscriber_rides, 55 | sum(daily.subscriber_rides) AS total_subscriber_rides, 56 | sum(daily.unknown_rides) AS total_unknown_rides, 57 | sum(daily.late_return) AS total_late_returns, 58 | round(avg(daily.tavg)) AS avg_tavg, 59 | count(daily.snow) FILTER (WHERE daily.snow) AS days_with_snow, 60 | count(daily.rain) FILTER (WHERE daily.rain) AS days_with_rain, 61 | max(daily.snow_amt) AS max_snow_amt, 62 | max(daily.prcp) AS max_prcp 63 | FROM date_dim 64 | JOIN daily_data daily ON date_dim.date_key = daily.date_key 65 | GROUP BY date_dim.month, date_dim.month_name 66 | ORDER BY date_dim.month; 67 | 68 | --- create a view to help analyze rides returned after the limit 69 | CREATE VIEW late_return AS 70 | SELECT date_dim.full_date, 71 | rides.id, 72 | rides.trip_hours, 73 | rides.bike_id, 74 | ( SELECT stations.station_name 75 | FROM stations 76 | WHERE rides.start_station_id = stations.id) AS start_location, 77 | ( SELECT stations.station_name 78 | FROM stations 79 | WHERE rides.end_station_id = stations.id) AS end_location, 80 | trip_demo.user_type 81 | FROM rides 82 | JOIN date_dim ON rides.date_key = date_dim.date_key 83 | JOIN trip_demo ON rides.trip_demo = trip_demo.id 84 | WHERE rides.valid_duration = False; 85 | 86 | 87 | --- examples of queries analysts could run on this database 88 | 89 | --- create a week summary over the year 90 | CREATE VIEW week_summary AS 91 | SELECT daily.day_name, 92 | round(avg(daily.ride_totals)) AS avg_rides, 93 | round(avg(daily.customer_rides)) AS avg_customer_rides, 94 | round(avg(daily.subscriber_rides)) AS avg_subscriber_rides 95 | FROM daily_counts daily 96 | GROUP BY daily.day_name 97 | ORDER BY (round(avg(daily.ride_totals))) DESC; 98 | 99 | --- create an hourly summary 100 | CREATE VIEW hourly_summary AS 101 | SELECT EXTRACT(hour FROM rides.start_time) AS start_hour, 102 | count(*) AS ride_totals 103 | FROM rides 104 | GROUP BY (EXTRACT(hour FROM rides.start_time)) 105 | ORDER BY (count(*)) DESC; 106 | 107 | --- create a demographics summary 108 | CREATE VIEW trip_demographics AS 109 | SELECT count(rides.id) AS total_rides, 110 | trip_demo.age, 111 | trip_demo.gender, 112 | trip_demo.user_type 113 | FROM rides 114 | JOIN trip_demo ON rides.trip_demo = trip_demo.id 115 | GROUP BY trip_demo.age, trip_demo.gender, trip_demo.user_type 116 | ORDER BY trip_demo.user_type, trip_demo.gender, trip_demo.age; 117 | -------------------------------------------------------------------------------- /bike-rental-data-management/writeup.md: -------------------------------------------------------------------------------- 1 | # Bike Rental Data Management Writeup 2 | 3 | ## Project Overview 4 | 5 | The last decade has seen enormous growth in the personal transportation startup industry, including bike and e-bike rental programs in many major cities. The goal of this project is to create a flexible and efficient relational database to store bike rental data, using datasets from Citi Bike and NOAA. The project involved 6 | 7 | - inspecting and cleaning both datasets 8 | - developing a relational database structure 9 | - implementing the database in PostgreSQL and inserting the data 10 | - developing flexible analytics-ready views on top of the relational database 11 | 12 | ## Inspecting and Cleaning the Datasets 13 | 14 | #### [Jupyter Notebook](./bike-rental.ipynb) 15 | 16 | The datasets for this project were monthly Citi Bike trip records from Jersey City and daily NOAA weather data from Newark Airport. Both datasets were for the full year of 2016. 17 | 18 | The Citi Bike dataset had some data integrity issues. Four days were missing from the data altogether, and a few fields had missing or unknown values. Visualization and statistical analysis indicated that the missing values were likely not missing at random. For example, demographic information was almost entirely missing for short-term customers. To help the analytics team investigate these issues, I preserved the missing data as `unknown` and created summary columns in the final database so that the missing data can be easily used or removed. 19 | 20 | There were also outliers in some of the numeric fields. For example, the trip duration column had trips lasting thousands of hours, even though Citi Bike's maximum rental is 24 hours. Since it was possible these indicated system malfunctions or users keeping rentals past the limit, I kept these values in the dataset and created a flag to easily access them or remove them from queries. I also created fields to view trip duration by minutes and hours (the original dataset had only seconds.) 21 | 22 | The weather dataset had some all-null columns that needed to be dropped, leaving average wind speed, precipitation, temperature, snow depth and snow amount. These are all fields one might expect to impact bike ridership in measurable ways. I also added binary rain and snow indicators to assist in filtering views. 23 | 24 | ## Developing and Implementing a Relational Database 25 | 26 | #### [Entity Relationship Diagram](./ERD.pdf), [SQL create table queries](./tables.sql) 27 | 28 | The Citi Bike table required some adjustment for normalization. For example, the `station` and `demographic` data needed to be split into their own tables. The station data already had IDs in the database to reuse as a key, while the demographic data required creating a new primary key column and adding the appropriate pointers in the main Citi Bike table. 29 | 30 | The weather data required less restructuring. However, joining on date-time fields is risky. To facilitate joins, I created a `date_key` field for both the weather and Citi Bike data that stores each day as an integer in `yyyymmdd` format. 31 | 32 | Finally, I created a central date dimension table to easily and efficiently provide different date information. This table stores each day as both a date and integer `date_key` along with: 33 | 34 | - month as both number (e.g. `1`) and name (`January`) 35 | - day as both number (e.g `21`) and name (`Wednesday`) 36 | - whether the day is a weekend day or not 37 | - the financial quarter 38 | 39 | Based on these tables, I created a database schema specifying data types and primary/foreign key relationships. After creating a PostgreSQL database matching this schema, I used SQLAlchemy and pandas to insert the data to the database. 40 | 41 | ## Developing Views 42 | 43 | #### [SQL create view queries](./views.sql) 44 | 45 | To assist an analytics team, I developed the following views: 46 | 47 | ### daily_data 48 | 49 | This view contains data for every day of the year (using outer joins to make sure days missing from the Citi Bike dataset can be investigated). In addition to the number of rides on a given day, it contains 50 | 51 | - date breakdowns (e.g. month and day of the week) 52 | - ride breakdowns (e.g. using `filter` statements to see rides by subscriber, customer, and duration) 53 | - running monthly totals (using `window` functions) 54 | - and weather data 55 | 56 | In particular, this view contains data on the number of records with unknown `user_type` and suspiciously large `trip_duration`. This should help an analytics team investigate these further. 57 | 58 | ### monthly_data 59 | 60 | This view summarizes `daily_data` by month. In addition to the number of rides per month, it contains 61 | - total rides broken down by user type 62 | - daily ride averages broken down by user type 63 | - average temperature and maximum precipitation amounts (since averages were all close to `0`) 64 | - counts of rainy and snowy days 65 | 66 | ### late_return 67 | 68 | While the prior views already contain counts of late returns (rentals with a duration over 24 hours), this table contains the data needed to investigate these further: 69 | - full date 70 | - bike id 71 | - start and end station 72 | - user type 73 | 74 | ### hourly_summary, trip_demographics, week_summary 75 | 76 | In addition to the main views, I created some smaller views. These are examples of the kinds of analytics that can be built on top of this database. They include summaries of trip demographics as well as the most popular hours of the day and days of the week. 77 | -------------------------------------------------------------------------------- /subscriber-pipeline/dev/cleanse_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import sqlite3 5 | import pandas as pd 6 | import ast 7 | import numpy as np 8 | from sqlalchemy import create_engine 9 | import logging 10 | 11 | pd.options.mode.chained_assignment = None 12 | 13 | logging.basicConfig(filename="./dev/cleanse_db.log", 14 | format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", 15 | filemode='w', 16 | level=logging.DEBUG, 17 | force=True) 18 | logger = logging.getLogger(__name__) 19 | 20 | 21 | def cleanse_student_table(df): 22 | """ 23 | Cleanse the `cademycode_students` table according to the discoveries made in the writeup 24 | 25 | Parameters: 26 | df (DataFrame): `student` table from `cademycode.db` 27 | 28 | Returns: 29 | df (DataFrame): cleaned version of the input table 30 | missing_data (DataFrame): incomplete data that was removed for later inspection 31 | 32 | """ 33 | now = pd.to_datetime('now') 34 | df['age'] = (now - pd.to_datetime(df['dob'])).astype(' 0: 208 | assert_err_msg = str(errors) + " column(s) dtypes aren't the same" 209 | logger.exception(assert_err_msg) 210 | assert errors == 0, assert_err_msg 211 | 212 | 213 | def main(): 214 | 215 | # initialize log 216 | logger.info("Start Log") 217 | 218 | # check for current version and calculate next version for changelog 219 | with open('./dev/changelog.md') as f: 220 | lines = f.readlines() 221 | next_ver = int(lines[0].split('.')[2][0])+1 222 | 223 | # connect to the dev database and read in the three tables 224 | con = sqlite3.connect('./dev/cademycode.db') 225 | students = pd.read_sql_query("SELECT * FROM cademycode_students", con) 226 | career_paths = pd.read_sql_query("SELECT * FROM cademycode_courses", con) 227 | student_jobs = pd.read_sql_query("SELECT * FROM cademycode_student_jobs", con) 228 | con.close() 229 | 230 | # get the current production tables, if they exist 231 | try: 232 | con = sqlite3.connect('./prod/cademycode_cleansed.db') 233 | clean_db = pd.read_sql_query("SELECT * FROM cademycode_aggregated", con) 234 | missing_db = pd.read_sql_query("SELECT * FROM incomplete_data", con) 235 | con.close() 236 | 237 | # filter for students that don't exist in the cleansed database 238 | new_students = students[~np.isin(students.uuid.unique(), clean_db.uuid.unique())] 239 | except: 240 | new_students = students 241 | clean_db = [] 242 | 243 | # run the cleanse_student_table() function on the new students only 244 | clean_new_students, missing_data = cleanse_student_table(new_students) 245 | 246 | try: 247 | # filter for incomplete rows that don't exist in the missing data table 248 | new_missing_data = missing_data[~np.isin(missing_data.uuid.unique(), missing_db.uuid.unique())] 249 | except: 250 | new_missing_data = missing_data 251 | 252 | # upsert new incomplete data if there are any 253 | if len(new_missing_data) > 0: 254 | sqlite_connection = sqlite3.connect('./dev/cademycode_cleansed.db') 255 | missing_data.to_sql('incomplete_data', sqlite_connection, if_exists='append', index=False) 256 | sqlite_connection.close() 257 | 258 | # proceed only if there is new student data 259 | if len(clean_new_students) > 0: 260 | # clean the rest of the tables 261 | clean_career_paths = cleanse_career_path(career_paths) 262 | clean_student_jobs = cleanse_student_jobs(student_jobs) 263 | 264 | ##### UNIT TESTING BEFORE JOINING ##### 265 | # Ensure that all required join keys are present 266 | test_for_job_id(clean_new_students, clean_student_jobs) 267 | test_for_path_id(clean_new_students, clean_career_paths) 268 | ####################################### 269 | 270 | clean_new_students['job_id'] = clean_new_students['job_id'].astype(int) 271 | clean_new_students['current_career_path_id'] = clean_new_students['current_career_path_id'].astype(int) 272 | 273 | df_clean = clean_new_students.merge(clean_career_paths, left_on='current_career_path_id', right_on='career_path_id', how='left') 274 | df_clean = df_clean.merge(clean_student_jobs, on='job_id', how='left') 275 | 276 | ##### UNIT TESTING ##### 277 | # Ensure correct schema and complete data before upserting to database 278 | if len(clean_db) > 0: 279 | test_num_cols(df_clean, clean_db) 280 | test_schema(df_clean, clean_db) 281 | test_nulls(df_clean) 282 | ######################## 283 | 284 | # Upsert new cleaned data to cademycode_cleansed.db 285 | con = create_engine('sqlite:///./dev/cademycode_cleansed.db', echo=True) 286 | sqlite_connection = con.connect() 287 | df_clean.to_sql('cademycode_aggregated', sqlite_connection, if_exists='append', index=False) 288 | clean_db = pd.read_sql_query("SELECT * FROM cademycode_aggregated", con) 289 | sqlite_connection.close() 290 | 291 | # Write new cleaned data to a csv file 292 | clean_db.to_csv('./dev/cademycode_cleansed.csv') 293 | 294 | # create new automatic changelog entry 295 | new_lines = [ 296 | '## 0.0.' + str(next_ver) + '\n' + 297 | '### Added\n' + 298 | '- ' + str(len(df_clean)) + ' more data to database of raw data\n' + 299 | '- ' + str(len(new_missing_data)) + ' new missing data to incomplete_data table\n' + 300 | '\n' 301 | ] 302 | w_lines = ''.join(new_lines + lines) 303 | 304 | # update the changelog 305 | with open('./dev/changelog.md', 'w') as f: 306 | for line in w_lines: 307 | f.write(line) 308 | else: 309 | print("No new data") 310 | logger.info("No new data") 311 | logger.info("End Log") 312 | 313 | 314 | if __name__ == "__main__": 315 | main() 316 | -------------------------------------------------------------------------------- /bike-rental-data-management/data/newark_airport_2016.csv: -------------------------------------------------------------------------------- 1 | STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TSUN,WDF2,WDF5,WSF2,WSF5 2 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,12.75,,0.00,0.0,0.0,41,43,34,,270,280,25.9,35.1 3 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-02,9.40,,0.00,0.0,0.0,36,42,30,,260,260,21.0,25.1 4 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-03,10.29,,0.00,0.0,0.0,37,47,28,,270,250,23.9,30.0 5 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-04,17.22,,0.00,0.0,0.0,32,35,14,,330,330,25.9,33.1 6 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-05,9.84,,0.00,0.0,0.0,19,31,10,,360,350,25.1,31.1 7 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-06,5.37,,0.00,0.0,0.0,28,42,15,,230,250,12.1,16.1 8 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-07,3.36,,0.00,0.0,0.0,35,46,24,,20,360,8.9,10.1 9 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-08,8.05,,0.00,0.0,0.0,38,45,31,,20,30,14.1,16.1 10 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-09,6.71,,0.01,0.0,0.0,44,48,38,,60,70,13.0,17.0 11 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-10,15.43,,1.77,0.0,0.0,53,65,39,,260,270,36.0,42.9 12 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-11,16.33,,0.00,0.0,0.0,36,39,24,,270,290,31.1,45.0 13 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-12,8.72,,0.01,0.0,0.0,31,44,22,,270,240,35.1,42.9 14 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-13,18.12,,0.00,0.0,0.0,27,31,22,,270,260,31.1,44.1 15 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-14,8.05,,0.00,0.0,0.0,28,39,23,,250,250,21.0,25.9 16 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-15,3.80,,0.00,0.0,0.0,38,49,29,,160,160,10.1,15.0 17 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-16,11.63,,0.26,0.0,0.0,47,55,42,,280,310,21.9,28.0 18 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-17,9.40,,0.07,0.7,0.0,38,42,29,,320,340,18.1,23.0 19 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-18,17.22,,0.03,0.5,1.2,27,30,18,,290,280,29.1,38.0 20 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-19,19.01,,0.00,0.0,0.0,21,31,16,,290,280,29.1,40.0 21 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-20,9.40,,0.00,0.0,0.0,31,39,27,,300,290,21.9,31.1 22 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-21,13.42,,0.00,0.0,0.0,32,38,26,,330,290,23.0,31.1 23 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-22,10.29,,0.03,0.3,0.0,26,30,20,,350,330,21.9,29.1 24 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-23,22.82,,1.81,24.0,7.1,26,27,23,,10,20,31.1,38.9 25 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-24,9.40,,0.01,0.2,20.1,26,36,17,,10,360,25.9,31.1 26 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-25,2.46,,0.00,0.0,18.9,29,37,18,,180,300,6.9,13.0 27 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-26,5.82,,0.00,0.0,16.9,35,46,25,,230,270,15.0,23.9 28 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-27,9.62,,0.01,0.0,14.2,42,45,32,,280,270,21.9,29.1 29 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-28,4.03,,0.00,0.0,9.8,34,42,24,,240,240,14.1,18.1 30 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-29,10.29,,0.00,0.0,9.1,36,41,30,,300,290,23.9,35.1 31 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-30,8.28,,0.00,0.0,7.9,32,39,25,,310,300,23.0,30.0 32 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-31,4.92,,0.00,0.0,7.1,39,54,30,,200,210,12.1,15.0 33 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-01,8.05,,0.02,0.0,3.9,45,58,36,,360,310,18.1,23.0 34 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-02,3.80,,0.00,0.0,1.2,43,51,32,,350,,11.0, 35 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-03,6.49,,0.61,0.0,0.0,48,58,39,,170,160,19.9,30.0 36 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-04,9.40,,0.01,0.0,0.0,54,56,42,,330,320,23.0,31.1 37 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-05,12.53,,0.48,2.8,1.2,38,43,31,,290,310,25.9,31.1 38 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-06,7.16,,0.00,0.0,1.2,33,41,25,,200,200,15.0,18.1 39 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-07,4.70,,0.00,0.0,0.0,37,47,28,,30,30,19.9,25.1 40 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-08,12.75,,0.02,0.2,0.0,37,40,29,,30,30,21.9,25.9 41 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-09,6.04,,0.02,0.2,0.0,31,36,27,,10,20,15.0,21.0 42 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-10,11.63,,0.02,0.3,0.0,35,42,31,,240,250,23.0,29.1 43 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-11,19.91,,0.00,0.2,0.0,27,31,19,,270,290,32.0,38.9 44 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-12,9.17,,0.00,0.0,0.0,21,27,15,,280,290,21.9,31.1 45 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-13,19.69,,0.00,0.0,0.0,21,24,7,,280,280,36.9,46.1 46 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-14,11.63,,0.00,0.0,0.0,8,18,0,,300,320,29.1,35.1 47 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-15,6.93,,0.42,1.5,0.0,18,33,10,,20,360,14.1,17.0 48 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-16,10.51,,0.99,0.0,1.2,43,55,33,,260,230,28.0,33.1 49 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-17,9.62,,0.00,0.0,0.0,39,41,32,,280,290,18.1,25.1 50 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-18,13.42,,0.00,0.0,0.0,33,38,27,,350,310,21.9,28.0 51 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-19,6.93,,0.00,0.0,0.0,29,38,19,,140,140,15.0,21.9 52 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-20,11.86,,0.00,0.0,0.0,47,64,38,,230,230,23.0,28.0 53 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-21,5.59,,0.08,0.0,0.0,51,56,38,,360,350,15.0,18.1 54 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-22,6.93,,0.00,0.0,0.0,45,54,35,,30,30,15.0,18.1 55 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-23,12.75,,0.27,0.0,0.0,39,42,34,,40,50,21.0,29.1 56 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-24,11.41,,1.02,0.0,0.0,42,63,36,,160,160,29.1,51.0 57 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-25,18.57,,0.05,0.0,0.0,54,64,35,,250,240,33.1,40.0 58 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-26,17.90,,0.00,0.0,0.0,36,39,28,,310,310,30.0,38.0 59 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-27,9.62,,0.00,0.0,0.0,33,43,24,,230,250,19.9,25.1 60 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-28,12.30,,0.00,0.0,0.0,45,62,35,,210,200,23.0,30.0 61 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-02-29,13.65,,0.03,0.0,0.0,51,62,42,,250,280,28.0,35.1 62 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-01,9.40,,0.00,0.0,0.0,45,53,39,,130,140,17.0,21.9 63 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-02,19.01,,0.12,0.0,0.0,47,57,29,,270,270,36.9,45.0 64 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-03,9.40,,0.00,0.0,0.0,32,38,26,,310,300,19.9,25.9 65 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-04,11.18,,0.12,1.4,1.2,33,40,29,,350,350,21.9,28.0 66 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-05,5.82,,0.00,0.0,0.0,35,43,28,,350,350,17.0,21.9 67 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-06,4.92,,0.00,0.0,0.0,36,47,27,,30,20,15.0,18.1 68 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-07,10.51,,0.00,0.0,0.0,43,63,31,,240,230,23.9,30.0 69 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-08,3.80,,0.00,0.0,0.0,54,65,45,,140,110,13.0,18.1 70 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-09,6.93,,0.00,0.0,0.0,58,82,42,,230,250,19.9,25.1 71 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-10,8.50,,0.00,0.0,0.0,70,81,59,,240,240,21.9,25.9 72 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-11,11.41,,0.03,0.0,0.0,64,69,48,,320,340,23.0,29.1 73 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-12,8.05,,0.00,0.0,0.0,50,60,38,,220,200,19.9,25.9 74 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-13,6.49,,0.00,0.0,0.0,56,62,49,,20,50,15.0,18.1 75 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-14,14.32,,0.52,0.0,0.0,48,52,41,,40,40,23.9,32.0 76 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-15,9.40,,0.00,0.0,0.0,48,57,45,,40,40,23.9,31.1 77 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-16,3.58,,0.03,0.0,0.0,54,64,46,,140,130,13.0,17.0 78 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-17,7.83,,0.00,0.0,0.0,53,66,44,,320,310,35.1,44.1 79 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-18,14.76,,0.00,0.0,0.0,52,62,41,,300,340,29.1,36.9 80 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-19,6.71,,0.00,0.0,0.0,43,47,35,,30,20,21.0,25.1 81 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-20,10.96,,0.07,0.2,0.0,39,44,33,,30,40,19.9,21.9 82 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-21,14.32,,0.00,0.0,0.0,40,53,33,,320,310,30.0,38.0 83 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-22,12.08,,0.00,0.0,0.0,44,58,34,,240,260,21.9,31.1 84 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-23,10.29,,0.00,0.0,0.0,57,74,45,,250,260,21.0,25.9 85 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-24,9.62,,0.00,0.0,0.0,52,56,44,,30,120,18.1,23.0 86 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-25,11.41,,0.11,0.0,0.0,57,74,46,,280,330,23.9,29.1 87 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-26,9.17,,0.00,0.0,0.0,47,54,38,,340,340,19.9,23.0 88 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-27,6.93,,0.00,0.0,0.0,46,51,44,,160,130,14.1,18.1 89 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-28,15.21,,0.38,0.0,0.0,48,65,43,,270,270,40.0,57.9 90 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-29,21.03,,0.00,0.0,0.0,51,56,40,,310,340,32.0,46.1 91 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-30,8.50,,0.00,0.0,0.0,45,55,33,,180,180,19.9,25.9 92 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-03-31,13.87,,0.00,0.0,0.0,58,74,46,,210,200,30.0,38.0 93 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-01,13.42,,0.00,0.0,0.0,70,80,62,,280,290,29.1,38.9 94 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-02,9.84,,0.16,0.0,0.0,58,64,42,,270,280,33.1,40.0 95 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-03,19.91,,0.03,0.0,0.0,43,45,35,,310,310,40.9,55.0 96 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-04,10.51,,0.44,0.0,0.0,41,44,30,,10,10,23.0,29.1 97 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-05,16.33,,0.00,0.0,0.0,34,44,26,,360,360,28.0,33.1 98 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-06,8.50,,0.00,0.0,0.0,38,49,27,,160,160,21.9,31.1 99 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-07,12.53,,0.09,0.0,0.0,53,61,46,,260,260,30.0,38.0 100 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-08,14.09,,0.00,0.0,0.0,48,50,38,,250,240,29.1,36.0 101 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-09,10.74,,0.09,0.0,0.0,40,44,32,,320,320,23.9,29.1 102 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-10,11.18,,0.00,0.0,0.0,40,52,31,,330,310,23.9,30.0 103 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-11,10.29,,0.01,0.0,0.0,51,68,42,,200,190,25.9,33.1 104 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-12,12.30,,0.13,0.0,0.0,56,61,44,,320,330,23.9,35.1 105 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-13,7.38,,0.00,0.0,0.0,48,58,39,,360,360,17.0,21.9 106 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-14,7.16,,0.00,0.0,0.0,50,62,41,,30,360,14.1,18.1 107 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-15,4.70,,0.00,0.0,0.0,51,65,39,,160,90,12.1,14.1 108 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-16,6.93,,0.00,0.0,0.0,54,68,41,,20,360,16.1,23.9 109 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-17,3.80,,0.00,0.0,0.0,56,73,42,,150,150,13.0,15.0 110 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-18,6.49,,0.00,0.0,0.0,64,81,47,,140,40,15.0,19.9 111 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-19,14.32,,0.00,0.0,0.0,67,74,54,,350,350,28.0,36.0 112 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-20,9.84,,0.00,0.0,0.0,60,69,50,,340,350,18.1,28.0 113 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-21,11.86,,0.00,0.0,0.0,60,77,44,,230,200,21.0,29.1 114 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-22,8.50,,0.00,0.0,0.0,71,83,61,,240,260,18.1,23.0 115 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-23,12.75,,0.04,0.0,0.0,67,73,54,,340,300,23.0,31.1 116 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-24,8.28,,0.00,0.0,0.0,58,72,45,,10,360,19.9,23.9 117 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-25,5.37,,0.00,0.0,0.0,58,68,48,,90,110,13.0,16.1 118 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-26,8.05,,0.10,0.0,0.0,58,68,46,,290,310,25.9,32.0 119 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-27,8.05,,0.00,0.0,0.0,51,67,46,,230,230,16.1,21.0 120 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-28,6.71,,0.00,0.0,0.0,52,58,47,,30,30,15.0,18.1 121 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-29,7.83,,0.03,0.0,0.0,51,57,47,,110,120,15.0,21.0 122 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-04-30,7.16,,0.00,0.0,0.0,52,63,44,,160,140,13.0,21.0 123 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-01,6.93,,0.21,0.0,0.0,49,51,43,,130,160,14.1,18.1 124 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-02,4.47,,0.14,0.0,0.0,50,61,45,,350,360,12.1,13.0 125 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-03,6.49,,0.59,0.0,0.0,54,55,51,,40,40,14.1,16.1 126 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-04,9.62,,0.01,0.0,0.0,52,58,49,,40,40,18.1,23.9 127 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-05,12.97,,0.00,0.0,0.0,51,59,46,,30,30,18.1,23.9 128 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-06,10.51,,0.62,0.0,0.0,52,54,49,,30,90,19.9,28.0 129 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-07,5.14,,0.10,0.0,0.0,52,59,48,,50,100,12.1,16.1 130 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-08,10.29,,0.11,0.0,0.0,56,68,47,,310,300,30.0,40.0 131 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-09,12.75,,0.00,0.0,0.0,61,75,45,,260,250,29.1,38.0 132 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-10,8.72,,0.00,0.0,0.0,57,62,46,,240,250,18.1,23.0 133 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-11,6.04,,0.00,0.0,0.0,61,74,49,,240,240,16.1,19.9 134 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-12,5.37,,0.00,0.0,0.0,67,77,54,,110,100,14.1,18.1 135 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-13,4.25,,0.29,0.0,0.0,62,66,58,,260,250,12.1,15.0 136 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-14,9.62,,0.00,0.0,0.0,66,76,56,,260,260,32.0,44.1 137 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-15,16.11,,0.00,0.0,0.0,54,59,44,,270,270,38.0,48.1 138 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-16,15.43,,0.00,0.0,0.0,52,69,40,,260,270,30.0,40.9 139 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-17,9.40,,0.00,0.0,0.0,59,66,52,,220,260,18.1,25.1 140 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-18,7.16,,0.00,0.0,0.0,57,67,50,,150,30,15.0,23.0 141 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-19,3.80,,0.00,0.0,0.0,61,72,54,,120,130,14.1,19.9 142 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-20,5.59,,0.00,0.0,0.0,64,76,52,,330,330,13.0,18.1 143 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-21,4.70,,0.08,0.0,0.0,62,64,54,,140,110,14.1,18.1 144 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-22,6.49,,0.04,0.0,0.0,60,70,52,,150,360,12.1,15.0 145 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-23,5.82,,0.01,0.0,0.0,65,76,53,,140,130,14.1,18.1 146 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-24,4.70,,0.05,0.0,0.0,65,77,58,,40,260,13.0,19.9 147 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-25,8.05,,0.00,0.0,0.0,74,91,57,,240,260,19.9,23.9 148 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-26,6.93,,0.00,0.0,0.0,80,93,64,,160,260,17.0,23.9 149 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-27,7.38,,0.00,0.0,0.0,79,89,70,,140,150,16.1,21.9 150 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-28,10.51,,0.00,0.0,0.0,82,96,71,,240,240,19.9,25.9 151 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-29,8.28,,0.03,0.0,0.0,80,88,69,,140,110,16.1,21.0 152 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-30,6.49,,1.57,0.0,0.0,75,83,69,,220,230,16.1,21.0 153 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-05-31,6.04,,0.00,0.0,0.0,77,85,71,,160,150,10.1,16.1 154 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-01,7.16,,0.00,0.0,0.0,76,84,67,,140,140,15.0,19.9 155 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-02,6.71,,0.00,0.0,0.0,68,72,63,,150,160,13.0,17.0 156 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-03,4.25,,0.03,0.0,0.0,65,70,63,,160,150,8.1,10.1 157 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-04,5.59,,0.01,0.0,0.0,72,85,65,,130,160,15.0,18.1 158 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-05,6.93,,0.94,0.0,0.0,69,74,66,,300,300,32.0,45.0 159 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-06,11.63,,0.00,0.0,0.0,74,86,66,,250,280,23.0,32.0 160 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-07,11.41,,0.14,0.0,0.0,77,87,64,,280,310,28.0,33.1 161 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-08,12.53,,0.51,0.0,0.0,63,69,54,,300,290,28.0,35.1 162 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-09,15.21,,0.00,0.0,0.0,63,75,53,,280,290,29.1,36.9 163 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-10,11.63,,0.00,0.0,0.0,67,80,56,,340,310,23.0,31.1 164 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-11,10.07,,0.00,0.0,0.0,72,91,55,,250,230,23.0,31.1 165 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-12,19.91,,0.00,0.0,0.0,82,88,63,,300,300,32.0,40.9 166 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-13,15.43,,0.00,0.0,0.0,67,76,57,,330,330,28.0,35.1 167 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-14,10.51,,0.00,0.0,0.0,69,82,59,,300,300,17.0,25.9 168 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-15,6.93,,0.00,0.0,0.0,75,87,59,,230,240,16.1,21.0 169 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-16,4.47,,0.00,0.0,0.0,73,77,67,,230,220,12.1,13.0 170 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-17,7.83,,0.00,0.0,0.0,71,79,63,,130,130,17.0,23.9 171 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-18,5.59,,0.00,0.0,0.0,72,86,58,,150,160,14.1,18.1 172 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-19,6.71,,0.00,0.0,0.0,75,88,63,,110,110,16.1,21.9 173 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-20,6.49,,0.00,0.0,0.0,74,85,65,,230,230,17.0,23.0 174 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-21,10.74,,0.00,0.0,0.0,82,91,73,,310,310,25.1,32.0 175 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-22,8.28,,0.00,0.0,0.0,79,90,67,,310,310,29.1,35.1 176 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-23,4.47,,0.00,0.0,0.0,76,83,69,,150,280,10.1,14.1 177 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-24,8.28,,0.00,0.0,0.0,75,83,68,,110,140,18.1,23.9 178 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-25,7.38,,0.00,0.0,0.0,73,82,64,,130,150,15.0,19.9 179 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-26,7.38,,0.00,0.0,0.0,75,86,65,,150,140,16.1,21.9 180 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-27,9.17,,0.37,0.0,0.0,75,87,64,,190,170,16.1,21.9 181 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-28,5.82,,0.39,0.0,0.0,73,78,69,,310,280,19.9,25.9 182 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-29,8.28,,0.01,0.0,0.0,75,87,68,,270,270,18.1,21.9 183 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-06-30,7.16,,0.00,0.0,0.0,77,88,67,,260,270,17.0,21.9 184 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-01,7.38,,0.20,0.0,0.0,75,80,70,,180,180,23.0,29.1 185 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-02,13.42,,0.00,0.0,0.0,73,80,66,,310,300,21.9,31.1 186 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-03,7.38,,0.00,0.0,0.0,74,83,63,,230,230,18.1,23.0 187 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-04,8.28,,0.46,0.0,0.0,74,84,62,,200,200,21.9,25.9 188 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-05,7.16,,0.17,0.0,0.0,77,89,70,,270,280,17.0,23.0 189 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-06,5.14,,0.00,0.0,0.0,84,95,74,,250,240,16.1,19.9 190 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-07,7.61,,0.01,0.0,0.0,85,95,77,,240,240,23.9,33.1 191 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-08,9.62,,0.05,0.0,0.0,81,90,69,,130,140,17.0,21.9 192 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-09,7.61,,0.49,0.0,0.0,69,71,66,,30,90,14.1,18.1 193 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-10,9.17,,0.00,0.0,0.0,74,84,67,,300,290,19.9,25.9 194 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-11,6.71,,0.00,0.0,0.0,74,84,65,,20,300,14.1,21.9 195 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-12,7.38,,0.00,0.0,0.0,75,84,67,,160,160,18.1,23.9 196 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-13,6.71,,0.00,0.0,0.0,78,88,69,,190,210,8.9,16.1 197 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-14,6.49,,0.12,0.0,0.0,81,91,76,,280,290,36.0,46.1 198 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-15,9.40,,0.00,0.0,0.0,86,94,78,,230,260,18.1,23.0 199 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-16,6.93,,0.57,0.0,0.0,82,91,75,,180,140,18.1,25.9 200 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-17,6.04,,0.00,0.0,0.0,81,93,73,,260,260,18.1,21.0 201 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-18,10.96,,0.23,0.0,0.0,83,95,72,,240,240,48.1,66.0 202 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-19,11.86,,0.00,0.0,0.0,79,85,72,,330,290,21.0,28.0 203 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-20,7.61,,0.00,0.0,0.0,76,86,66,,250,240,13.0,17.0 204 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-21,9.62,,0.00,0.0,0.0,80,92,69,,210,230,23.0,27.1 205 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-22,14.76,,0.00,0.0,0.0,84,97,73,,230,220,23.0,29.1 206 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-23,11.63,,0.00,0.0,0.0,89,98,80,,270,260,23.0,29.1 207 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-24,8.05,,0.00,0.0,0.0,87,96,73,,250,260,18.1,25.1 208 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-25,9.17,,1.84,0.0,0.0,84,99,73,,90,90,29.1,47.0 209 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-26,10.51,,0.00,0.0,0.0,81,93,72,,300,300,21.0,28.0 210 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-27,6.26,,0.00,0.0,0.0,84,93,73,,190,280,14.1,21.9 211 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-28,6.49,,0.00,0.0,0.0,83,93,75,,220,220,21.0,29.1 212 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-29,8.95,,0.87,0.0,0.0,77,86,70,,30,50,17.0,21.9 213 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-30,5.82,,0.36,0.0,0.0,78,86,73,,160,160,16.1,19.9 214 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-07-31,7.83,,0.71,0.0,0.0,76,80,72,,100,90,14.1,19.9 215 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-01,6.71,,0.00,0.0,0.0,74,79,71,,60,70,13.0,17.0 216 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-02,6.93,,0.00,0.0,0.0,74,80,67,,40,40,14.1,17.0 217 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-03,6.71,,0.00,0.0,0.0,73,80,65,,160,150,15.0,19.9 218 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-04,6.49,,0.00,0.0,0.0,74,81,64,,160,150,13.0,18.1 219 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-05,8.50,,0.00,0.0,0.0,75,84,67,,200,210,16.1,21.9 220 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-06,9.40,,0.09,0.0,0.0,79,91,72,,300,300,25.1,32.0 221 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-07,8.50,,0.00,0.0,0.0,81,89,71,,270,320,18.1,23.0 222 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-08,6.26,,0.00,0.0,0.0,80,87,69,,330,320,13.0,16.1 223 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-09,7.38,,0.00,0.0,0.0,79,87,68,,150,140,16.1,21.9 224 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-10,10.07,,0.11,0.0,0.0,81,90,75,,220,230,21.0,25.1 225 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-11,10.07,,0.11,0.0,0.0,86,96,75,,300,310,21.0,28.0 226 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-12,10.74,,0.17,0.0,0.0,84,97,74,,250,250,32.0,46.1 227 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-13,8.50,,0.00,0.0,0.0,87,98,79,,230,230,18.1,25.1 228 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-14,10.74,,0.00,0.0,0.0,89,97,80,,270,280,31.1,38.0 229 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-15,7.16,,0.00,0.0,0.0,85,94,77,,290,300,14.1,19.0 230 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-16,10.51,,0.01,0.0,0.0,85,93,77,,260,250,28.0,36.9 231 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-17,10.74,,0.20,0.0,0.0,82,88,76,,250,250,36.0,45.0 232 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-18,5.59,,0.10,0.0,0.0,80,87,73,,260,260,15.0,19.9 233 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-19,6.93,,0.00,0.0,0.0,81,90,74,,20,360,17.0,21.0 234 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-20,7.61,,0.00,0.0,0.0,80,87,72,,140,130,17.0,23.0 235 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-21,9.17,,0.13,0.0,0.0,80,87,74,,130,140,18.1,25.1 236 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-22,13.87,,0.00,0.0,0.0,76,81,67,,320,320,23.9,36.0 237 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-23,6.26,,0.00,0.0,0.0,73,84,59,,260,270,15.0,19.9 238 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-24,7.38,,0.00,0.0,0.0,76,90,63,,210,260,15.0,19.9 239 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-25,10.07,,0.01,0.0,0.0,78,88,67,,260,180,18.1,23.9 240 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-26,11.41,,0.00,0.0,0.0,83,95,75,,350,310,21.9,29.1 241 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-27,7.16,,0.00,0.0,0.0,80,88,71,,140,140,15.0,19.9 242 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-28,6.71,,0.00,0.0,0.0,79,87,72,,200,160,15.0,19.9 243 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-29,10.96,,0.00,0.0,0.0,81,93,72,,330,310,19.9,28.0 244 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-30,7.83,,0.00,0.0,0.0,77,85,68,,10,150,14.1,19.0 245 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-08-31,7.61,,0.00,0.0,0.0,79,92,74,,290,280,15.0,18.1 246 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-01,6.71,,0.21,0.0,0.0,75,81,66,,340,310,17.0,23.9 247 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-02,10.74,,0.00,0.0,0.0,73,82,65,,360,350,17.0,25.1 248 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-03,10.96,,0.00,0.0,0.0,71,77,65,,30,30,21.0,30.0 249 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-04,10.51,,0.00,0.0,0.0,71,82,62,,40,40,18.1,23.9 250 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-05,14.32,,0.00,0.0,0.0,73,86,58,,340,340,25.9,33.1 251 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-06,19.69,,0.00,0.0,0.0,78,82,69,,10,360,29.1,36.0 252 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-07,11.86,,0.00,0.0,0.0,76,87,69,,20,20,21.9,27.1 253 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-08,6.04,,0.00,0.0,0.0,80,92,71,,240,240,14.1,19.0 254 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-09,7.38,,0.00,0.0,0.0,87,93,78,,350,360,14.1,17.0 255 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-10,7.61,,0.01,0.0,0.0,83,92,76,,200,230,18.1,25.9 256 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-11,13.42,,0.00,0.0,0.0,81,86,66,,330,300,25.9,33.1 257 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-12,6.71,,0.00,0.0,0.0,69,78,57,,150,150,13.0,17.0 258 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-13,7.61,,0.00,0.0,0.0,71,84,60,,200,200,19.9,23.0 259 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-14,10.96,,0.00,0.0,0.0,77,94,66,,320,320,36.0,50.1 260 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-15,9.62,,0.00,0.0,0.0,69,75,60,,30,30,23.0,29.1 261 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-16,6.04,,0.00,0.0,0.0,67,73,57,,170,170,14.1,18.1 262 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-17,7.61,,0.00,0.0,0.0,68,77,59,,140,110,17.0,21.9 263 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-18,8.95,,0.00,0.0,0.0,75,87,68,,230,200,17.0,23.0 264 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-19,4.92,,1.03,0.0,0.0,74,76,70,,280,190,13.0,17.0 265 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-20,5.14,,0.00,0.0,0.0,75,84,69,,350,340,13.0,17.0 266 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-21,7.61,,0.00,0.0,0.0,76,86,67,,20,10,17.0,19.9 267 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-22,5.59,,0.00,0.0,0.0,73,84,61,,170,170,12.1,17.0 268 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-23,5.37,,0.01,0.0,0.0,75,90,60,,10,10,23.0,25.9 269 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-24,11.41,,0.44,0.0,0.0,68,72,58,,20,30,21.9,25.9 270 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-25,6.49,,0.00,0.0,0.0,62,71,52,,320,310,15.0,21.9 271 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-26,8.95,,0.00,0.0,0.0,63,73,51,,140,140,18.1,25.1 272 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-27,5.82,,0.12,0.0,0.0,69,76,62,,230,230,14.1,21.0 273 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-28,12.30,,0.00,0.0,0.0,65,73,56,,30,80,19.9,25.9 274 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-29,14.32,,0.04,0.0,0.0,61,65,58,,50,50,23.0,30.0 275 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-09-30,16.33,,0.31,0.0,0.0,59,60,56,,30,30,23.9,31.1 276 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-01,11.41,,0.00,0.0,0.0,59,63,56,,30,40,16.1,23.0 277 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-02,5.59,,0.00,0.0,0.0,60,65,57,,40,40,13.0,16.1 278 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-03,4.92,,0.00,0.0,0.0,65,74,60,,290,300,12.1,19.0 279 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-04,10.29,,0.00,0.0,0.0,65,71,61,,40,40,18.1,23.0 280 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-05,7.16,,0.00,0.0,0.0,61,68,52,,20,40,15.0,18.1 281 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-06,5.14,,0.00,0.0,0.0,62,75,51,,160,160,8.9,12.1 282 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-07,4.47,,0.00,0.0,0.0,63,74,53,,160,140,13.0,17.0 283 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-08,5.37,,0.17,0.0,0.0,63,69,55,,320,320,14.1,17.0 284 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-09,17.00,,0.48,0.0,0.0,60,66,52,,360,340,29.1,35.1 285 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-10,12.97,,0.00,0.0,0.0,56,66,47,,350,350,23.0,30.0 286 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-11,5.37,,0.00,0.0,0.0,53,64,43,,140,140,13.0,15.0 287 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-12,4.47,,0.00,0.0,0.0,58,67,47,,140,150,10.1,15.0 288 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-13,7.83,,0.01,0.0,0.0,60,68,52,,330,340,21.9,27.1 289 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-14,8.05,,0.00,0.0,0.0,56,64,46,,330,340,16.1,21.0 290 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-15,6.71,,0.00,0.0,0.0,53,64,42,,140,150,14.1,17.0 291 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-16,8.50,,0.00,0.0,0.0,58,73,45,,240,240,17.0,21.9 292 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-17,6.49,,0.00,0.0,0.0,70,83,60,,260,260,15.0,19.9 293 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-18,9.62,,0.00,0.0,0.0,72,85,62,,210,200,21.9,28.0 294 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-19,9.84,,0.00,0.0,0.0,76,87,64,,350,310,18.1,23.9 295 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-20,9.40,,0.00,0.0,0.0,67,71,60,,100,100,16.1,23.0 296 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-21,4.47,,0.22,0.0,0.0,68,71,58,,310,300,21.0,25.9 297 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-22,16.11,,0.55,0.0,0.0,54,58,47,,280,290,35.1,42.9 298 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-23,14.32,,0.00,0.0,0.0,53,65,44,,280,280,30.0,40.0 299 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-24,12.75,,0.00,0.0,0.0,57,66,47,,300,300,28.0,34.0 300 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-25,11.63,,0.00,0.0,0.0,49,54,42,,320,300,23.9,31.1 301 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-26,10.96,,0.00,0.0,0.0,45,52,37,,310,330,21.0,25.9 302 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-27,8.05,,0.65,0.0,0.0,45,56,42,,90,80,15.0,21.0 303 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-28,13.87,,0.00,0.0,0.0,48,51,42,,300,320,32.0,40.9 304 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-29,8.95,,0.00,0.0,0.0,49,66,37,,230,230,19.9,27.1 305 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-30,9.40,,0.92,0.0,0.0,65,78,54,,330,340,38.0,53.0 306 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-10-31,10.96,,0.00,0.0,0.0,52,57,43,,320,330,23.0,31.1 307 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-01,7.38,,0.00,0.0,0.0,48,60,37,,140,250,13.0,18.1 308 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-02,8.05,,0.00,0.0,0.0,59,73,51,,250,240,16.1,19.9 309 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-03,11.63,,0.00,0.0,0.0,63,74,57,,340,340,23.9,32.0 310 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-04,11.63,,0.00,0.0,0.0,58,62,47,,320,320,21.9,27.1 311 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-05,8.50,,0.00,0.0,0.0,52,65,40,,320,320,21.0,25.1 312 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-06,13.20,,0.00,0.0,0.0,54,60,43,,330,320,23.9,30.0 313 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-07,7.61,,0.00,0.0,0.0,47,56,36,,10,10,17.0,21.9 314 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-08,4.03,,0.00,0.0,0.0,51,69,38,,260,290,12.1,17.0 315 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-09,7.83,,0.20,0.0,0.0,55,58,48,,340,330,23.9,29.1 316 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-10,10.96,,0.00,0.0,0.0,49,58,41,,330,330,25.1,31.1 317 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-11,17.22,,0.00,0.0,0.0,53,64,42,,330,320,35.1,40.9 318 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-12,9.84,,0.00,0.0,0.0,44,53,37,,320,340,21.9,27.1 319 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-13,9.17,,0.00,0.0,0.0,46,63,34,,230,250,18.1,23.0 320 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-14,5.82,,0.00,0.0,0.0,49,63,37,,250,250,15.0,19.9 321 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-15,10.74,,2.79,0.0,0.0,52,54,46,,360,360,29.1,35.1 322 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-16,9.17,,0.00,0.0,0.0,51,64,41,,270,260,18.1,25.1 323 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-17,9.84,,0.00,0.0,0.0,55,63,48,,320,330,23.0,28.0 324 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-18,4.25,,0.00,0.0,0.0,54,67,44,,350,330,14.1,16.1 325 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-19,9.17,,0.20,0.0,0.0,53,61,36,,290,290,28.0,36.0 326 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-20,21.47,,0.23,0.0,0.0,42,43,34,,260,260,36.9,45.0 327 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-21,21.03,,0.00,0.0,0.0,39,42,37,,300,280,33.1,46.1 328 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-22,18.12,,0.00,0.0,0.0,39,44,37,,300,300,32.0,42.1 329 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-23,9.17,,0.00,0.0,0.0,40,48,33,,320,320,18.1,25.9 330 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-24,6.93,,0.00,0.0,0.0,43,50,35,,360,360,13.0,16.1 331 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-25,4.70,,0.01,0.0,0.0,48,55,45,,220,,8.1, 332 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-26,8.95,,0.00,0.0,0.0,48,53,41,,330,320,21.0,28.0 333 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-27,8.72,,0.00,0.0,0.0,43,52,35,,290,300,21.0,27.1 334 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-28,4.92,,0.00,0.0,0.0,44,54,32,,230,190,13.0,15.0 335 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-29,8.05,,2.02,0.0,0.0,53,60,49,,150,140,16.1,23.0 336 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-11-30,6.93,,1.07,0.0,0.0,56,58,52,,20,30,15.0,19.0 337 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-01,13.87,,0.11,0.0,0.0,52,55,41,,280,260,28.0,38.0 338 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-02,14.76,,0.00,0.0,0.0,45,53,38,,270,270,25.1,31.1 339 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-03,17.00,,0.00,0.0,0.0,45,50,41,,310,300,31.1,36.9 340 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-04,9.62,,0.00,0.0,0.0,43,48,39,,320,300,19.9,25.9 341 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-05,8.28,,0.16,0.0,0.0,43,50,37,,290,280,21.9,29.1 342 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-06,7.61,,0.44,0.0,0.0,42,46,34,,50,50,18.1,23.0 343 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-07,6.93,,0.05,0.0,0.0,42,48,37,,40,40,15.0,18.1 344 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-08,10.74,,0.00,0.0,0.0,40,44,33,,270,260,23.9,31.1 345 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-09,15.66,,0.00,0.0,0.0,35,41,29,,290,310,28.0,36.9 346 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-10,9.40,,0.00,0.0,0.0,31,37,23,,290,280,18.1,23.9 347 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-11,5.14,,0.01,0.3,0.0,30,34,25,,200,200,12.1,15.0 348 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-12,8.72,,0.50,0.0,0.0,38,46,32,,260,260,18.1,23.0 349 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-13,5.59,,0.00,0.0,0.0,38,44,33,,230,280,13.0,19.9 350 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-14,9.62,,0.00,0.0,0.0,39,44,32,,290,300,21.9,28.0 351 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-15,20.58,,0.00,0.1,0.0,28,32,18,,280,290,36.0,46.1 352 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-16,11.41,,0.00,0.0,0.0,20,28,16,,260,270,19.9,25.1 353 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-17,4.92,,0.71,3.0,2.0,29,36,24,,210,210,16.1,21.9 354 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-18,10.51,,0.07,0.0,1.2,42,59,31,,300,310,36.9,45.0 355 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-19,11.86,,0.00,0.0,0.0,29,31,22,,340,330,21.9,28.0 356 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-20,6.49,,0.00,0.0,0.0,25,34,18,,240,240,13.0,21.0 357 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-21,6.26,,0.00,0.0,0.0,32,42,24,,240,260,13.0,17.0 358 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-22,9.84,,0.00,0.0,0.0,40,51,32,,310,290,25.1,33.1 359 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-23,9.17,,0.00,0.0,0.0,42,48,34,,310,310,17.0,21.9 360 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-24,7.61,,0.49,0.0,0.0,41,46,37,,260,240,16.1,21.0 361 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-25,8.50,,0.00,0.0,0.0,43,52,34,,340,300,15.0,21.9 362 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-26,7.38,,0.00,0.0,0.0,35,47,29,,230,230,16.1,21.0 363 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-27,13.65,,0.01,0.0,0.0,53,62,40,,270,270,29.1,38.0 364 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-28,8.28,,0.00,0.0,0.0,41,43,31,,330,330,19.9,25.1 365 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-29,8.05,,0.36,0.0,0.0,38,45,31,,170,150,18.1,25.1 366 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-30,14.99,,0.00,0.0,0.0,37,42,32,,270,270,25.9,33.1 367 | USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-12-31,12.30,,0.00,0.0,0.0,35,44,29,,200,220,21.9,28.0 -------------------------------------------------------------------------------- /subscriber-pipeline/writeup/data_eng_cp_writeup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "1e6ba182", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import sqlite3\n", 11 | "import pandas as pd\n", 12 | "import ast\n", 13 | "import numpy as np" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "eee9e8fa", 19 | "metadata": {}, 20 | "source": [ 21 | "This file will demostrate the discovery phase/process used before creating the `cleanse_data.py` file which automates the cleaning and update processes." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "3a3130a4", 27 | "metadata": {}, 28 | "source": [ 29 | "# Connecting SQLite3 Database\n", 30 | "\n", 31 | "First I'll connect to the database to determine which tables are contained in the database." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "id": "b538ddde", 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "name": "stdout", 42 | "output_type": "stream", 43 | "text": [ 44 | "[('cademycode_students',), ('cademycode_courses',), ('cademycode_student_jobs',)]\n" 45 | ] 46 | } 47 | ], 48 | "source": [ 49 | "con = sqlite3.connect('cademycode.db')\n", 50 | "cur = con.cursor()\n", 51 | "\n", 52 | "# read all table names\n", 53 | "table_list = [a for a in cur.execute(\"SELECT name FROM sqlite_master WHERE type = 'table'\")]\n", 54 | "print(table_list)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "7e07af25", 60 | "metadata": {}, 61 | "source": [ 62 | "## Extract tables from db file\n", 63 | "\n", 64 | "I'll extract the tables using simple query statements and save them to python variables." 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 3, 70 | "id": "ab55b933", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "students = pd.read_sql_query(\"SELECT * FROM cademycode_students\", con)\n", 75 | "career_paths = pd.read_sql_query(\"SELECT * FROM cademycode_courses\", con)\n", 76 | "student_jobs = pd.read_sql_query(\"SELECT * FROM cademycode_student_jobs\", con)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "id": "af833d27", 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "students: 5000\n", 90 | "career_paths: 10\n", 91 | "student_jobs: 13\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "print('students:', len(students))\n", 97 | "print('career_paths:', len(career_paths))\n", 98 | "print('student_jobs:', len(student_jobs))" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "0efe7abb", 104 | "metadata": {}, 105 | "source": [ 106 | "# Working with the `Students` table\n", 107 | "\n", 108 | "Let's work with the students file first since it is the largest. I'll start by taking a look at the first few rows of data to get a general sense of what the structure looks like." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 5, 114 | "id": "4780a9b2", 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "data": { 119 | "text/html": [ 120 | "
\n", 121 | "\n", 134 | "\n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | "
uuidnamedobsexcontact_infojob_idnum_course_takencurrent_career_path_idtime_spent_hrs
01Annabelle Avery1943-07-03F{\"mailing_address\": \"303 N Timber Key, Irondal...7.06.01.04.99
12Micah Rubio1991-02-07M{\"mailing_address\": \"767 Crescent Fair, Shoals...7.05.08.04.4
23Hosea Dale1989-12-07M{\"mailing_address\": \"P.O. Box 41269, St. Bonav...7.08.08.06.74
34Mariann Kirk1988-07-31F{\"mailing_address\": \"517 SE Wintergreen Isle, ...6.07.09.012.31
45Lucio Alexander1963-08-31M{\"mailing_address\": \"18 Cinder Cliff, Doyles b...7.014.03.05.64
\n", 212 | "
" 213 | ], 214 | "text/plain": [ 215 | " uuid name dob sex \\\n", 216 | "0 1 Annabelle Avery 1943-07-03 F \n", 217 | "1 2 Micah Rubio 1991-02-07 M \n", 218 | "2 3 Hosea Dale 1989-12-07 M \n", 219 | "3 4 Mariann Kirk 1988-07-31 F \n", 220 | "4 5 Lucio Alexander 1963-08-31 M \n", 221 | "\n", 222 | " contact_info job_id num_course_taken \\\n", 223 | "0 {\"mailing_address\": \"303 N Timber Key, Irondal... 7.0 6.0 \n", 224 | "1 {\"mailing_address\": \"767 Crescent Fair, Shoals... 7.0 5.0 \n", 225 | "2 {\"mailing_address\": \"P.O. Box 41269, St. Bonav... 7.0 8.0 \n", 226 | "3 {\"mailing_address\": \"517 SE Wintergreen Isle, ... 6.0 7.0 \n", 227 | "4 {\"mailing_address\": \"18 Cinder Cliff, Doyles b... 7.0 14.0 \n", 228 | "\n", 229 | " current_career_path_id time_spent_hrs \n", 230 | "0 1.0 4.99 \n", 231 | "1 8.0 4.4 \n", 232 | "2 8.0 6.74 \n", 233 | "3 9.0 12.31 \n", 234 | "4 3.0 5.64 " 235 | ] 236 | }, 237 | "execution_count": 5, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "students.head()" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "7d92482c", 249 | "metadata": {}, 250 | "source": [ 251 | "Next I'll examine the datatypes of each column and check for any null values using the `.info()` function." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 6, 257 | "id": "3fdb91f3", 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "name": "stdout", 262 | "output_type": "stream", 263 | "text": [ 264 | "\n", 265 | "RangeIndex: 5000 entries, 0 to 4999\n", 266 | "Data columns (total 9 columns):\n", 267 | " # Column Non-Null Count Dtype \n", 268 | "--- ------ -------------- ----- \n", 269 | " 0 uuid 5000 non-null int64 \n", 270 | " 1 name 5000 non-null object\n", 271 | " 2 dob 5000 non-null object\n", 272 | " 3 sex 5000 non-null object\n", 273 | " 4 contact_info 5000 non-null object\n", 274 | " 5 job_id 4995 non-null object\n", 275 | " 6 num_course_taken 4749 non-null object\n", 276 | " 7 current_career_path_id 4529 non-null object\n", 277 | " 8 time_spent_hrs 4529 non-null object\n", 278 | "dtypes: int64(1), object(8)\n", 279 | "memory usage: 351.7+ KB\n" 280 | ] 281 | } 282 | ], 283 | "source": [ 284 | "students.info()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "id": "c205dbad", 290 | "metadata": {}, 291 | "source": [ 292 | "A couple things I notice when examining the DataFrame so far:\n", 293 | "- contact_info appears to be a dictionary, which will require additional work to explode it into seperate columns\n", 294 | "- each row has a UUID, which means one student can only be one row\n", 295 | "- `dob` seems to be date of birth, will want to make sure it is a datetime object\n", 296 | " - calculating age from DoB might be useful as a grouping column\n", 297 | "- none of the numerical columns are coming in as floats or integers\n", 298 | " - `job_id` and `current_career_path_id` should be treated as categorical data since they are IDs\n", 299 | "- looks like there is missing data in `job_id`, `current_career_path_id`, and `current_career_path_id`" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "id": "db618d8f", 305 | "metadata": {}, 306 | "source": [ 307 | "## Calculating Approximate Age\n", 308 | "\n", 309 | "I'll convert `dob` into approximate age. This provides a couple benefits over a datetime field:\n", 310 | "- To an analytics team who's trying to tell a story with numbers, age makes more aesthetic sense. \n", 311 | "- This creates a new ordinal or even categorical field for analysts to dive into." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 8, 317 | "id": "f5790e39", 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "now = pd.to_datetime('now')\n", 322 | "students['age'] = (now - pd.to_datetime(students['dob'])).astype('\n", 336 | "\n", 349 | "\n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | "
uuidnamedobsexcontact_infojob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_group
01Annabelle Avery1943-07-03F{\"mailing_address\": \"303 N Timber Key, Irondal...7.06.01.04.9979.070
12Micah Rubio1991-02-07M{\"mailing_address\": \"767 Crescent Fair, Shoals...7.05.08.04.431.030
23Hosea Dale1989-12-07M{\"mailing_address\": \"P.O. Box 41269, St. Bonav...7.08.08.06.7432.030
34Mariann Kirk1988-07-31F{\"mailing_address\": \"517 SE Wintergreen Isle, ...6.07.09.012.3133.030
45Lucio Alexander1963-08-31M{\"mailing_address\": \"18 Cinder Cliff, Doyles b...7.014.03.05.6458.050
\n", 439 | "" 440 | ], 441 | "text/plain": [ 442 | " uuid name dob sex \\\n", 443 | "0 1 Annabelle Avery 1943-07-03 F \n", 444 | "1 2 Micah Rubio 1991-02-07 M \n", 445 | "2 3 Hosea Dale 1989-12-07 M \n", 446 | "3 4 Mariann Kirk 1988-07-31 F \n", 447 | "4 5 Lucio Alexander 1963-08-31 M \n", 448 | "\n", 449 | " contact_info job_id num_course_taken \\\n", 450 | "0 {\"mailing_address\": \"303 N Timber Key, Irondal... 7.0 6.0 \n", 451 | "1 {\"mailing_address\": \"767 Crescent Fair, Shoals... 7.0 5.0 \n", 452 | "2 {\"mailing_address\": \"P.O. Box 41269, St. Bonav... 7.0 8.0 \n", 453 | "3 {\"mailing_address\": \"517 SE Wintergreen Isle, ... 6.0 7.0 \n", 454 | "4 {\"mailing_address\": \"18 Cinder Cliff, Doyles b... 7.0 14.0 \n", 455 | "\n", 456 | " current_career_path_id time_spent_hrs age age_group \n", 457 | "0 1.0 4.99 79.0 70 \n", 458 | "1 8.0 4.4 31.0 30 \n", 459 | "2 8.0 6.74 32.0 30 \n", 460 | "3 9.0 12.31 33.0 30 \n", 461 | "4 3.0 5.64 58.0 50 " 462 | ] 463 | }, 464 | "execution_count": 9, 465 | "metadata": {}, 466 | "output_type": "execute_result" 467 | } 468 | ], 469 | "source": [ 470 | "students.head()" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "id": "0039298b", 476 | "metadata": {}, 477 | "source": [ 478 | "## Explode the dictionary" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 10, 484 | "id": "aa6e32f2", 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "#explode_contact = pd.json_normalize(students['contact_info'])" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "id": "d1d7fb50", 494 | "metadata": {}, 495 | "source": [ 496 | "Interestingly enough, Pandas imported the `contact_info` column as a string type rather than as JSON. This adds an extra step and requires me to convert the strings to dictionaries, and then explode:" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 11, 502 | "id": "b084239b", 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "students['contact_info'] = students[\"contact_info\"].apply(lambda x: ast.literal_eval(x))\n", 507 | "explode_contact = pd.json_normalize(students['contact_info'])\n", 508 | "students = pd.concat([students.drop('contact_info', axis=1), explode_contact], axis=1)" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 12, 514 | "id": "52eccd9f", 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "data": { 519 | "text/html": [ 520 | "
\n", 521 | "\n", 534 | "\n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupmailing_addressemail
01Annabelle Avery1943-07-03F7.06.01.04.9979.070303 N Timber Key, Irondale, Wisconsin, 84736annabelle_avery9376@woohoo.com
12Micah Rubio1991-02-07M7.05.08.04.431.030767 Crescent Fair, Shoals, Indiana, 37439rubio6772@hmail.com
23Hosea Dale1989-12-07M7.08.08.06.7432.030P.O. Box 41269, St. Bonaventure, Virginia, 83637hosea_dale8084@coldmail.com
34Mariann Kirk1988-07-31F6.07.09.012.3133.030517 SE Wintergreen Isle, Lane, Arkansas, 82242kirk4005@hmail.com
45Lucio Alexander1963-08-31M7.014.03.05.6458.05018 Cinder Cliff, Doyles borough, Rhode Island,...alexander9810@hmail.com
\n", 630 | "
" 631 | ], 632 | "text/plain": [ 633 | " uuid name dob sex job_id num_course_taken \\\n", 634 | "0 1 Annabelle Avery 1943-07-03 F 7.0 6.0 \n", 635 | "1 2 Micah Rubio 1991-02-07 M 7.0 5.0 \n", 636 | "2 3 Hosea Dale 1989-12-07 M 7.0 8.0 \n", 637 | "3 4 Mariann Kirk 1988-07-31 F 6.0 7.0 \n", 638 | "4 5 Lucio Alexander 1963-08-31 M 7.0 14.0 \n", 639 | "\n", 640 | " current_career_path_id time_spent_hrs age age_group \\\n", 641 | "0 1.0 4.99 79.0 70 \n", 642 | "1 8.0 4.4 31.0 30 \n", 643 | "2 8.0 6.74 32.0 30 \n", 644 | "3 9.0 12.31 33.0 30 \n", 645 | "4 3.0 5.64 58.0 50 \n", 646 | "\n", 647 | " mailing_address \\\n", 648 | "0 303 N Timber Key, Irondale, Wisconsin, 84736 \n", 649 | "1 767 Crescent Fair, Shoals, Indiana, 37439 \n", 650 | "2 P.O. Box 41269, St. Bonaventure, Virginia, 83637 \n", 651 | "3 517 SE Wintergreen Isle, Lane, Arkansas, 82242 \n", 652 | "4 18 Cinder Cliff, Doyles borough, Rhode Island,... \n", 653 | "\n", 654 | " email \n", 655 | "0 annabelle_avery9376@woohoo.com \n", 656 | "1 rubio6772@hmail.com \n", 657 | "2 hosea_dale8084@coldmail.com \n", 658 | "3 kirk4005@hmail.com \n", 659 | "4 alexander9810@hmail.com " 660 | ] 661 | }, 662 | "execution_count": 12, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "students.head()" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "id": "f0015208", 674 | "metadata": {}, 675 | "source": [ 676 | "After exploding the dictionary column, it could be worth spliting the `mailing_address` column into `street`, `city`, `state`, `zip_code` columns." 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": 13, 682 | "id": "9d86c66f", 683 | "metadata": {}, 684 | "outputs": [ 685 | { 686 | "data": { 687 | "text/html": [ 688 | "
\n", 689 | "\n", 702 | "\n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupemailstreetcitystatezip_code
01Annabelle Avery1943-07-03F7.06.01.04.9979.070annabelle_avery9376@woohoo.com303 N Timber KeyIrondaleWisconsin84736
12Micah Rubio1991-02-07M7.05.08.04.431.030rubio6772@hmail.com767 Crescent FairShoalsIndiana37439
23Hosea Dale1989-12-07M7.08.08.06.7432.030hosea_dale8084@coldmail.comP.O. Box 41269St. BonaventureVirginia83637
34Mariann Kirk1988-07-31F6.07.09.012.3133.030kirk4005@hmail.com517 SE Wintergreen IsleLaneArkansas82242
45Lucio Alexander1963-08-31M7.014.03.05.6458.050alexander9810@hmail.com18 Cinder CliffDoyles boroughRhode Island73737
\n", 816 | "
" 817 | ], 818 | "text/plain": [ 819 | " uuid name dob sex job_id num_course_taken \\\n", 820 | "0 1 Annabelle Avery 1943-07-03 F 7.0 6.0 \n", 821 | "1 2 Micah Rubio 1991-02-07 M 7.0 5.0 \n", 822 | "2 3 Hosea Dale 1989-12-07 M 7.0 8.0 \n", 823 | "3 4 Mariann Kirk 1988-07-31 F 6.0 7.0 \n", 824 | "4 5 Lucio Alexander 1963-08-31 M 7.0 14.0 \n", 825 | "\n", 826 | " current_career_path_id time_spent_hrs age age_group \\\n", 827 | "0 1.0 4.99 79.0 70 \n", 828 | "1 8.0 4.4 31.0 30 \n", 829 | "2 8.0 6.74 32.0 30 \n", 830 | "3 9.0 12.31 33.0 30 \n", 831 | "4 3.0 5.64 58.0 50 \n", 832 | "\n", 833 | " email street city \\\n", 834 | "0 annabelle_avery9376@woohoo.com 303 N Timber Key Irondale \n", 835 | "1 rubio6772@hmail.com 767 Crescent Fair Shoals \n", 836 | "2 hosea_dale8084@coldmail.com P.O. Box 41269 St. Bonaventure \n", 837 | "3 kirk4005@hmail.com 517 SE Wintergreen Isle Lane \n", 838 | "4 alexander9810@hmail.com 18 Cinder Cliff Doyles borough \n", 839 | "\n", 840 | " state zip_code \n", 841 | "0 Wisconsin 84736 \n", 842 | "1 Indiana 37439 \n", 843 | "2 Virginia 83637 \n", 844 | "3 Arkansas 82242 \n", 845 | "4 Rhode Island 73737 " 846 | ] 847 | }, 848 | "execution_count": 13, 849 | "metadata": {}, 850 | "output_type": "execute_result" 851 | } 852 | ], 853 | "source": [ 854 | "split_address = students.mailing_address.str.split(',', expand=True)\n", 855 | "split_address.columns = ['street', 'city', 'state', 'zip_code']\n", 856 | "students = pd.concat([students.drop('mailing_address', axis=1), split_address], axis=1)\n", 857 | "students.head()" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 14, 863 | "id": "29fd11de", 864 | "metadata": {}, 865 | "outputs": [ 866 | { 867 | "name": "stdout", 868 | "output_type": "stream", 869 | "text": [ 870 | "\n", 871 | "RangeIndex: 5000 entries, 0 to 4999\n", 872 | "Data columns (total 15 columns):\n", 873 | " # Column Non-Null Count Dtype \n", 874 | "--- ------ -------------- ----- \n", 875 | " 0 uuid 5000 non-null int64 \n", 876 | " 1 name 5000 non-null object \n", 877 | " 2 dob 5000 non-null object \n", 878 | " 3 sex 5000 non-null object \n", 879 | " 4 job_id 4995 non-null object \n", 880 | " 5 num_course_taken 4749 non-null object \n", 881 | " 6 current_career_path_id 4529 non-null object \n", 882 | " 7 time_spent_hrs 4529 non-null object \n", 883 | " 8 age 5000 non-null float64\n", 884 | " 9 age_group 5000 non-null int64 \n", 885 | " 10 email 5000 non-null object \n", 886 | " 11 street 5000 non-null object \n", 887 | " 12 city 5000 non-null object \n", 888 | " 13 state 5000 non-null object \n", 889 | " 14 zip_code 5000 non-null object \n", 890 | "dtypes: float64(1), int64(2), object(12)\n", 891 | "memory usage: 586.1+ KB\n" 892 | ] 893 | } 894 | ], 895 | "source": [ 896 | "students.info()" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "id": "6c02205b", 902 | "metadata": {}, 903 | "source": [ 904 | "## Fixing column datatypes\n", 905 | "Lets fix the column datatypes first. I know that `num_course_taken` should be integers and `time_spent_hrs` should be floats. Even though `job_id` and `current_career_path_id` look numerical, they should be treated as categorical data since they are IDs. Due to null values, I can't cast `job_id` and `current_career_path_id` into integers, so I'll cast them as floats for now." 906 | ] 907 | }, 908 | { 909 | "cell_type": "code", 910 | "execution_count": 15, 911 | "id": "2759d73e", 912 | "metadata": {}, 913 | "outputs": [], 914 | "source": [ 915 | "students['job_id'] = students['job_id'].astype(float)\n", 916 | "students['current_career_path_id'] = students['current_career_path_id'].astype(float)\n", 917 | "students['num_course_taken'] = students['num_course_taken'].astype(float)\n", 918 | "students['time_spent_hrs'] = students['time_spent_hrs'].astype(float)" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": 16, 924 | "id": "519e3cdb", 925 | "metadata": {}, 926 | "outputs": [ 927 | { 928 | "name": "stdout", 929 | "output_type": "stream", 930 | "text": [ 931 | "\n", 932 | "RangeIndex: 5000 entries, 0 to 4999\n", 933 | "Data columns (total 15 columns):\n", 934 | " # Column Non-Null Count Dtype \n", 935 | "--- ------ -------------- ----- \n", 936 | " 0 uuid 5000 non-null int64 \n", 937 | " 1 name 5000 non-null object \n", 938 | " 2 dob 5000 non-null object \n", 939 | " 3 sex 5000 non-null object \n", 940 | " 4 job_id 4995 non-null float64\n", 941 | " 5 num_course_taken 4749 non-null float64\n", 942 | " 6 current_career_path_id 4529 non-null float64\n", 943 | " 7 time_spent_hrs 4529 non-null float64\n", 944 | " 8 age 5000 non-null float64\n", 945 | " 9 age_group 5000 non-null int64 \n", 946 | " 10 email 5000 non-null object \n", 947 | " 11 street 5000 non-null object \n", 948 | " 12 city 5000 non-null object \n", 949 | " 13 state 5000 non-null object \n", 950 | " 14 zip_code 5000 non-null object \n", 951 | "dtypes: float64(5), int64(2), object(8)\n", 952 | "memory usage: 586.1+ KB\n" 953 | ] 954 | } 955 | ], 956 | "source": [ 957 | "students.info()" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "id": "eb0d7042", 963 | "metadata": {}, 964 | "source": [ 965 | "## Handling Missing Data\n", 966 | "\n", 967 | "### `num_course_taken`\n", 968 | "Let's examine the missing data to determine if the missing data is missing at random, structurally missing, or missing not at random:" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 17, 974 | "id": "ed8300e4", 975 | "metadata": {}, 976 | "outputs": [ 977 | { 978 | "data": { 979 | "text/html": [ 980 | "
\n", 981 | "\n", 994 | "\n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupemailstreetcitystatezip_code
2526Doug Browning1970-06-08M7.0NaN5.01.9252.050doug7761@inlook.comP.O. Box 15845DevineFlorida23097
2627Damon Schrauwen1953-10-31M4.0NaN10.03.7368.060damon9864@woohoo.comP.O. Box 84659MabenGeorgia66137
5152Alisa Neil1977-05-28F5.0NaN8.022.8645.040alisa9616@inlook.com16 View AnnexMossesNorth Dakota25748
7071Chauncey Hooper1962-04-07M3.0NaN3.03.9760.060chauncey6352@woohoo.com955 Dewy FlatSlaughtervilleSouth Carolina22167
8081Ellyn van Heest1984-06-28F3.0NaN10.012.3938.030ellyn_vanheest8375@hmail.com872 Cider GladeChickenDelaware42689
................................................
48894890Tegan Cochran1970-11-08F5.0NaN8.022.7551.050tegan130@inlook.com106 Sunny NookVernalGeorgia10769
48984899Ruthann Oliver1998-05-22F3.0NaN7.021.2724.020ruthann1124@woohoo.com644 Merry IslandGreen ValleyWyoming91273
49144915Ernest Holmes1995-03-11M7.0NaN9.026.5027.020ernest_holmes505@hmail.com872 Wintergreen HarborGallitzin boroughMaine50103
49804981Brice Franklin1946-12-01M4.0NaN5.08.6675.070brice9741@coldmail.com947 Panda WayNew Bedford villageVermont31232
49854986Russel Vonck1994-09-07M5.0NaN5.016.4427.020russel_vonck2834@woohoo.com815 Middle Timber CornerClawsonSouth Dakota02949
\n", 1216 | "

251 rows × 15 columns

\n", 1217 | "
" 1218 | ], 1219 | "text/plain": [ 1220 | " uuid name dob sex job_id num_course_taken \\\n", 1221 | "25 26 Doug Browning 1970-06-08 M 7.0 NaN \n", 1222 | "26 27 Damon Schrauwen 1953-10-31 M 4.0 NaN \n", 1223 | "51 52 Alisa Neil 1977-05-28 F 5.0 NaN \n", 1224 | "70 71 Chauncey Hooper 1962-04-07 M 3.0 NaN \n", 1225 | "80 81 Ellyn van Heest 1984-06-28 F 3.0 NaN \n", 1226 | "... ... ... ... .. ... ... \n", 1227 | "4889 4890 Tegan Cochran 1970-11-08 F 5.0 NaN \n", 1228 | "4898 4899 Ruthann Oliver 1998-05-22 F 3.0 NaN \n", 1229 | "4914 4915 Ernest Holmes 1995-03-11 M 7.0 NaN \n", 1230 | "4980 4981 Brice Franklin 1946-12-01 M 4.0 NaN \n", 1231 | "4985 4986 Russel Vonck 1994-09-07 M 5.0 NaN \n", 1232 | "\n", 1233 | " current_career_path_id time_spent_hrs age age_group \\\n", 1234 | "25 5.0 1.92 52.0 50 \n", 1235 | "26 10.0 3.73 68.0 60 \n", 1236 | "51 8.0 22.86 45.0 40 \n", 1237 | "70 3.0 3.97 60.0 60 \n", 1238 | "80 10.0 12.39 38.0 30 \n", 1239 | "... ... ... ... ... \n", 1240 | "4889 8.0 22.75 51.0 50 \n", 1241 | "4898 7.0 21.27 24.0 20 \n", 1242 | "4914 9.0 26.50 27.0 20 \n", 1243 | "4980 5.0 8.66 75.0 70 \n", 1244 | "4985 5.0 16.44 27.0 20 \n", 1245 | "\n", 1246 | " email street \\\n", 1247 | "25 doug7761@inlook.com P.O. Box 15845 \n", 1248 | "26 damon9864@woohoo.com P.O. Box 84659 \n", 1249 | "51 alisa9616@inlook.com 16 View Annex \n", 1250 | "70 chauncey6352@woohoo.com 955 Dewy Flat \n", 1251 | "80 ellyn_vanheest8375@hmail.com 872 Cider Glade \n", 1252 | "... ... ... \n", 1253 | "4889 tegan130@inlook.com 106 Sunny Nook \n", 1254 | "4898 ruthann1124@woohoo.com 644 Merry Island \n", 1255 | "4914 ernest_holmes505@hmail.com 872 Wintergreen Harbor \n", 1256 | "4980 brice9741@coldmail.com 947 Panda Way \n", 1257 | "4985 russel_vonck2834@woohoo.com 815 Middle Timber Corner \n", 1258 | "\n", 1259 | " city state zip_code \n", 1260 | "25 Devine Florida 23097 \n", 1261 | "26 Maben Georgia 66137 \n", 1262 | "51 Mosses North Dakota 25748 \n", 1263 | "70 Slaughterville South Carolina 22167 \n", 1264 | "80 Chicken Delaware 42689 \n", 1265 | "... ... ... ... \n", 1266 | "4889 Vernal Georgia 10769 \n", 1267 | "4898 Green Valley Wyoming 91273 \n", 1268 | "4914 Gallitzin borough Maine 50103 \n", 1269 | "4980 New Bedford village Vermont 31232 \n", 1270 | "4985 Clawson South Dakota 02949 \n", 1271 | "\n", 1272 | "[251 rows x 15 columns]" 1273 | ] 1274 | }, 1275 | "metadata": {}, 1276 | "output_type": "display_data" 1277 | } 1278 | ], 1279 | "source": [ 1280 | "missing_course_taken = students[students[['num_course_taken']].isnull().any(axis=1)]\n", 1281 | "display(missing_course_taken)" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "markdown", 1286 | "id": "aaed3ab6", 1287 | "metadata": {}, 1288 | "source": [ 1289 | "Let's take a look at the distribution of missing and complete data grouped by the categorical data. If the distribution between the missing and non-missing data is similar, we can conclude that the data is MAR if no identifiable pattern is found. Generally, if something is MNAR or structurally missing, it should be visible by graphing the data." 1290 | ] 1291 | }, 1292 | { 1293 | "cell_type": "code", 1294 | "execution_count": 18, 1295 | "id": "b3d0a95a", 1296 | "metadata": {}, 1297 | "outputs": [ 1298 | { 1299 | "data": { 1300 | "text/plain": [ 1301 | "" 1302 | ] 1303 | }, 1304 | "execution_count": 18, 1305 | "metadata": {}, 1306 | "output_type": "execute_result" 1307 | }, 1308 | { 1309 | "data": { 1310 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEFCAYAAADzHRw3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAS30lEQVR4nO3df4zX9X3A8efLQwqT81aFdCpa6OpE8RDwoC01/CjxR9cwbCWhSjWUdYa2YGH9MROXrV3TbMauTqobYQtlZrRgqFvJZDFzlFqFrYBQ5LQQ4ijecNkNrWAnK6ev/XEnO8477ntwd1/vfc9HQnLf7+d9388rd8mTDx++n883MhNJ0sB3TrUHkCT1DoMuSYUw6JJUCIMuSYUw6JJUCIMuSYUYUq0djxw5MseMGVOt3UvSgLRz587/zsxRnW2rWtDHjBnDjh07qrV7SRqQIuLnXW3zlIskFcKgS1IhDLokFaJq59AlDTwnTpygqamJ48ePV3uU4g0bNozRo0dz7rnnVvw9Bl1SxZqamqitrWXMmDFERLXHKVZmcuTIEZqamhg7dmzF3+cpF0kVO378OBdeeKEx72MRwYUXXtjjfwkZdEk9Ysz7x5n8nA26JHVh4cKFbNiw4bRr1qxZw+HDh/tpotPzHHpf+WpdP+/v1f7dnwSMufuxXn29g3/2sV59vf6wZs0arr76ai6++OJqj+IRuqSB5+GHH2bChAlcc8013H777fz85z9n9uzZTJgwgdmzZ3Po0CGg9Qj7s5/9LLNmzeJ973sfP/rRj1i0aBFXXnklCxcuPPl6I0aM4Itf/CKTJ09m9uzZNDc3v22fO3fuZMaMGVx77bXceOONvPTSS2zYsIEdO3awYMECJk6cyOuvv97puv5i0CUNKI2NjXzjG99g8+bN/PSnP+WBBx5gyZIl3HHHHezZs4cFCxZw1113nVz/yiuvsHnzZu6//37mzJnD8uXLaWxs5Nlnn2X37t0A/PKXv2Ty5Mk888wzzJgxg6997Wun7PPEiRMsXbqUDRs2sHPnThYtWsQ999zDvHnzaGhoYO3atezevZshQ4Z0uq6/eMpF0oCyefNm5s2bx8iRIwG44IIL2LZtG48++igAt99+O1/5yldOrp8zZw4RQX19Pe95z3uor68HYPz48Rw8eJCJEydyzjnnMH/+fAA+9alP8YlPfOKUfe7bt4+9e/dy/fXXA/DGG29w0UUXvW22Stf1FYMuaUDJzG7fAdJ++7ve9S4AzjnnnJNfv/W4paWl2+9/a5/jx49n27Zt3c5Wybq+4ikXSQPK7NmzeeSRRzhy5AgAL7/8MtOmTWPdunUArF27luuuu65Hr/nmm2+efDfLd7/73bd9/xVXXEFzc/PJUJ84cYLGxkYAamtrOXbsWLfr+oNH6JIGlPHjx3PPPfcwY8YMampqmDRpEitWrGDRokXcd999jBo1iu985zs9es3zzjuPxsZGrr32Wurq6li/fv0p24cOHcqGDRu46667ePXVV2lpaWHZsmWMHz+ehQsXsnjxYoYPH862bdu6XNcfIjP7ZUcdNTQ0ZNH3Q/dtiyrQ888/z5VXXlntMXrdiBEjeO2116o9xtt09vOOiJ2Z2dDZek+5SFIhDLqkQe+deHR+Jgy6JBXCoEtSIQy6JBXCoEtSIQy6pAFl2rRp1R7hFGvWrGHJkiWnXbNlyxa2bt3a57N4YZGkM9fb11tUcD1Ff4Sxt23ZsoURI0b0+V9GFR2hR8RNEbEvIg5ExN2nWTclIt6IiHm9N6Ik/b8RI0YArZGcOXMm8+bNY9y4cSxYsIC3LpTcvn0706ZN45prrmHq1KkcO3aM48eP8+lPf5r6+nomTZrED3/4Q6D1CPvmm29mzpw5jB07lgcffJBvfetbTJo0iQ9+8IO8/PLLAMycOZNly5Yxbdo0rr76an7yk5+8bbbm5mZuueUWpkyZwpQpU3j66ac5ePAgK1eu5P7772fixIn8+Mc/7nRdb+j2CD0iaoCHgOuBJmB7RGzMzOc6WXcv8HivTCZJ3di1axeNjY1cfPHFfPjDH+bpp59m6tSpzJ8/n/Xr1zNlyhSOHj3K8OHDeeCBBwB49tln+dnPfsYNN9zA/v37Adi7dy+7du3i+PHjvP/97+fee+9l165dLF++nIcffphly5YBrbfZ3bp1K08++SSLFi1i7969p8zzhS98geXLl3Pddddx6NAhbrzxRp5//nkWL17MiBEj+NKXvgTAbbfd1um6s1XJKZepwIHMfAEgItYBc4HnOqxbCnwfmHLWU0lSBaZOncro0aMBmDhxIgcPHqSuro6LLrqIKVNaU3T++ecD8NRTT7F06VIAxo0bx3vf+96TQZ81axa1tbXU1tZSV1fHnDlzAKivr2fPnj0n93frrbcCMH36dI4ePcovfvGLU+Z54okneO65/0/j0aNHT964q5J1tbW1Z/XzqCTolwAvtnvcBHyg/YKIuAT4OPARDLqkftL+drg1NTW0tLR0eXvd0923quNtddvfcrf9LXY7vm7Hx2+++Sbbtm1j+PDhp5270nU9VUnQO7vxcMefzF8Af5CZb5zuPsURcSdwJ8Bll11W4YhS7392ZXcG4mdbqtW4ceM4fPgw27dvZ8qUKRw7dozhw4czffp01q5dy0c+8hH279/PoUOHuOKKK3jmmWcqfu3169cza9YsnnrqKerq6qirO/U/hW+44QYefPBBvvzlLwOwe/duJk6cSG1tLUePHu123dmqJOhNwKXtHo8GOn7EdQOwri3mI4HfjoiWzPyH9osycxWwClrvtniGM5+Rfg/CsH7dnaQ2Q4cOZf369SxdupTXX3+d4cOH88QTT/C5z32OxYsXU19fz5AhQ1izZs0pR+aVePe73820adM4evQoq1evftv2FStW8PnPf54JEybQ0tLC9OnTWblyJXPmzGHevHn84Ac/4Nvf/naX685Wt7fPjYghwH5gNvAfwHbgtszs9K7tEbEG+MfM3HC61+3v2+f2f9Bv69f9lX77XI/Q3xlKvX1uJWbOnMk3v/lNGho6vXNtn+jp7XO7PULPzJaIWELru1dqgNWZ2RgRi9u2n/1fK5Kks1bRhUWZuQnY1OG5TkOemQvPfixJemfZsmVLtUfolpf+S1IhDLqkHqnWx1YONmfyczbokio2bNgwjhw5YtT7WGZy5MgRhg3r2dvlvDmXpIqNHj2apqYmmpubqz1K8YYNG3byKthKGXRJFTv33HMZO3ZstcdQFzzlIkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVIgh1R5Aekf6al0/7+/V/t2fiuQRuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEqCnpE3BQR+yLiQETc3cn2uRGxJyJ2R8SOiLiu90eVJJ1Ot1eKRkQN8BBwPdAEbI+IjZn5XLtl/wJszMyMiAnAI8C4vhhYktS5So7QpwIHMvOFzPwVsA6Y235BZr6Wmdn28DwgkST1q0qCfgnwYrvHTW3PnSIiPh4RPwMeAxb1zniSpEpVEvTo5Lm3HYFn5t9n5jjgZuDrnb5QxJ1t59h3NDc392hQSdLpVRL0JuDSdo9HA4e7WpyZTwK/GREjO9m2KjMbMrNh1KhRPR5WktS1SoK+Hbg8IsZGxFDgk8DG9gsi4v0REW1fTwaGAkd6e1hJUte6fZdLZrZExBLgcaAGWJ2ZjRGxuG37SuAW4I6IOAG8Dsxv95+kkqR+UNEHXGTmJmBTh+dWtvv6XuDe3h1NktQTXikqSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUw6JJUCIMuSYUYUu0BJKlXfbWun/f3av/u7zQ8QpekQhh0SSqEQZekQhh0SSqEQZekQhh0SSqEQZekQhh0SSqEQZekQhh0SSpERUGPiJsiYl9EHIiIuzvZviAi9rT92RoR1/T+qJKk0+k26BFRAzwEfBS4Crg1Iq7qsOzfgRmZOQH4OrCqtweVJJ1eJTfnmgocyMwXACJiHTAXeO6tBZm5td36fwVG9+aQkgauMXc/1q/7OzisX3f3jlLJKZdLgBfbPW5qe64rvwv809kMJUnquUqO0KOT57LThRGzaA36dV1svxO4E+Cyyy6rcERJUiUqOUJvAi5t93g0cLjjooiYAPwNMDczj3T2Qpm5KjMbMrNh1KhRZzKvJKkLlQR9O3B5RIyNiKHAJ4GN7RdExGXAo8Dtmbm/98eUJHWn21MumdkSEUuAx4EaYHVmNkbE4rbtK4E/Ai4E/jIiAFoys6HvxpYkdVTRR9Bl5iZgU4fnVrb7+jPAZ3p3NElST3ilqCQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEMuiQVwqBLUiEqCnpE3BQR+yLiQETc3cn2cRGxLSL+NyK+1PtjSpK6M6S7BRFRAzwEXA80AdsjYmNmPtdu2cvAXcDNfTGkJKl7lRyhTwUOZOYLmfkrYB0wt/2CzPyvzNwOnOiDGSVJFagk6JcAL7Z73NT2XI9FxJ0RsSMidjQ3N5/JS0iSulBJ0KOT5/JMdpaZqzKzITMbRo0adSYvIUnqQiVBbwIubfd4NHC4b8aRJJ2pSoK+Hbg8IsZGxFDgk8DGvh1LktRT3b7LJTNbImIJ8DhQA6zOzMaIWNy2fWVE/AawAzgfeDMilgFXZebRvhtdktRet0EHyMxNwKYOz61s9/V/0noqRpJUJV4pKkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVAiDLkmFMOiSVIiKgh4RN0XEvog4EBF3d7I9ImJF2/Y9ETG590eVJJ1Ot0GPiBrgIeCjwFXArRFxVYdlHwUub/tzJ/BXvTynJKkblRyhTwUOZOYLmfkrYB0wt8OaucDD2epfgV+PiIt6eVZJ0mkMqWDNJcCL7R43AR+oYM0lwEvtF0XEnbQewQO8FhH7ejTtABIwEvjvftvh16LfdjUY+PsbuAbB7+69XW2oJOidTZtnsIbMXAWsqmCfA15E7MjMhmrPoTPj72/gGsy/u0pOuTQBl7Z7PBo4fAZrJEl9qJKgbwcuj4ixETEU+CSwscOajcAdbe92+SDwama+1PGFJEl9p9tTLpnZEhFLgMeBGmB1ZjZGxOK27SuBTcBvAweA/wE+3XcjDxiD4tRSwfz9DVyD9ncXmW871S1JGoC8UlSSCmHQJakQBl2SCmHQe0FEXFbtGSTJ/xTtBRHxTGZObvv6+5l5S7VnUuUiouPbcE+Rmb/TX7OoZyLij06zOTPz6/02zDtAJVeKqnvtr5R9X9Wm0Jn6EK23rvge8G90fuWz3pl+2clzvwZ8BrgQMOjqseziaw0MvwFcD9wK3AY8BnwvMxurOpW6lZl//tbXEVELfAFYROtNBP+8q+8rladcekFEvEHrkUIAw2m9uIq2x5mZ51drNvVMRLyL1rDfB/xJZn67yiOpGxFxAfD7wALgb4EHMvOV6k5VHR6h94LMrKn2DDo7bSH/GK0xHwOsAB6t5kzqXkTcB3yC1qtD6zPztSqPVFUeoWvQi4i/Ba4G/glYl5l7qzySKhQRbwL/C7Rw6unOQfmvY4OuQa8tCm/959qgj4IGLoMuSYXwwiJJKoRBl6RCGHRJKoRBl6RCGHQNShFxXkQ8FhE/jYi9ETE/Iq6NiB9FxM6IeDwiLoqIuojYFxFXtH3f9yLi96o9v9QZLyzSYHUTcDgzPwYQEXW0vg99bmY2R8R84BuZuajtIxjXRMQDwLsz86+rN7bUNd+2qEEpIn6L1s/JfQT4R+AVYCvwQtuSGuClzLyhbf0q4Bbgmsxs6v+Jpe55hK5BKTP3R8S1tH64+Z8C/ww0ZuaHOq6NiHOAK4HXgQsAg653JM+ha1CKiIuB/8nMvwO+CXwAGBURH2rbfm5EjG9bvhx4ntb7vKyOiHOrMbPUHY/QNVjVA/e1XfZ/AvgsrfcDWdF2Pn0I8BcRcYLWe2tPzcxjEfEk8IfAH1dpbqlLnkOXpEJ4ykWSCmHQJakQBl2SCmHQJakQBl2SCmHQJakQBl2SCmHQJakQ/wfmw0YMtfvKIAAAAABJRU5ErkJggg==\n", 1311 | "text/plain": [ 1312 | "
" 1313 | ] 1314 | }, 1315 | "metadata": { 1316 | "needs_background": "light" 1317 | }, 1318 | "output_type": "display_data" 1319 | } 1320 | ], 1321 | "source": [ 1322 | "sg = (students.groupby('sex').count()['uuid']/len(students)).rename('complete')\n", 1323 | "mg = (missing_course_taken.groupby('sex').count()['uuid']/len(missing_course_taken)).rename('incomplete')\n", 1324 | "df = pd.concat([sg, mg], axis=1)\n", 1325 | "df.plot.bar()" 1326 | ] 1327 | }, 1328 | { 1329 | "cell_type": "code", 1330 | "execution_count": 20, 1331 | "id": "acaade8b", 1332 | "metadata": {}, 1333 | "outputs": [ 1334 | { 1335 | "data": { 1336 | "text/plain": [ 1337 | "" 1338 | ] 1339 | }, 1340 | "execution_count": 20, 1341 | "metadata": {}, 1342 | "output_type": "execute_result" 1343 | }, 1344 | { 1345 | "data": { 1346 | "image/png": "\n", 1347 | "text/plain": [ 1348 | "
" 1349 | ] 1350 | }, 1351 | "metadata": { 1352 | "needs_background": "light" 1353 | }, 1354 | "output_type": "display_data" 1355 | } 1356 | ], 1357 | "source": [ 1358 | "sg = (students.groupby('job_id').count()['uuid']/len(students)).rename('complete')\n", 1359 | "mg = (missing_course_taken.groupby('job_id').count()['uuid']/len(missing_course_taken)).rename('incomplete')\n", 1360 | "df = pd.concat([sg, mg], axis=1)\n", 1361 | "df.plot.bar()" 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "code", 1366 | "execution_count": 21, 1367 | "id": "b3f87be8", 1368 | "metadata": {}, 1369 | "outputs": [ 1370 | { 1371 | "data": { 1372 | "text/plain": [ 1373 | "" 1374 | ] 1375 | }, 1376 | "execution_count": 21, 1377 | "metadata": {}, 1378 | "output_type": "execute_result" 1379 | }, 1380 | { 1381 | "data": { 1382 | "image/png": "\n", 1383 | "text/plain": [ 1384 | "
" 1385 | ] 1386 | }, 1387 | "metadata": { 1388 | "needs_background": "light" 1389 | }, 1390 | "output_type": "display_data" 1391 | } 1392 | ], 1393 | "source": [ 1394 | "sg = (students.groupby('age_group').count()['uuid']/len(students)).rename('complete')\n", 1395 | "mg = (missing_course_taken.groupby('age_group').count()['uuid']/len(missing_course_taken)).rename('incomplete')\n", 1396 | "df = pd.concat([sg, mg], axis=1)\n", 1397 | "df.plot.bar()" 1398 | ] 1399 | }, 1400 | { 1401 | "cell_type": "markdown", 1402 | "id": "d3816192", 1403 | "metadata": {}, 1404 | "source": [ 1405 | "Nothing looks out of the ordinary in terms of distribution or from a simple eye-test. Additionally, there are instances of zeroes in the column, so we can conclude that the NaN's aren't just students who haven't enrolled in a course yet - ruling out structural missing data. \n", 1406 | "\n", 1407 | "I think it's safe to call this missing at random (MAR). Since less than 5% of the data is missing and it appears to be MAR, I can remove the missing data with little impact. Instead of completely deleting it, I will store the missing data in a seperate table in the event that my analytics teams wants to try to fill it in." 1408 | ] 1409 | }, 1410 | { 1411 | "cell_type": "code", 1412 | "execution_count": 22, 1413 | "id": "d4122e4e", 1414 | "metadata": {}, 1415 | "outputs": [], 1416 | "source": [ 1417 | "missing_data = pd.DataFrame()\n", 1418 | "missing_data = pd.concat([missing_data, missing_course_taken])\n", 1419 | "students = students.dropna(subset=['num_course_taken'])" 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": 23, 1425 | "id": "81b278ad", 1426 | "metadata": {}, 1427 | "outputs": [ 1428 | { 1429 | "name": "stdout", 1430 | "output_type": "stream", 1431 | "text": [ 1432 | "\n", 1433 | "Int64Index: 4749 entries, 0 to 4999\n", 1434 | "Data columns (total 15 columns):\n", 1435 | " # Column Non-Null Count Dtype \n", 1436 | "--- ------ -------------- ----- \n", 1437 | " 0 uuid 4749 non-null int64 \n", 1438 | " 1 name 4749 non-null object \n", 1439 | " 2 dob 4749 non-null object \n", 1440 | " 3 sex 4749 non-null object \n", 1441 | " 4 job_id 4744 non-null float64\n", 1442 | " 5 num_course_taken 4749 non-null float64\n", 1443 | " 6 current_career_path_id 4298 non-null float64\n", 1444 | " 7 time_spent_hrs 4298 non-null float64\n", 1445 | " 8 age 4749 non-null float64\n", 1446 | " 9 age_group 4749 non-null int64 \n", 1447 | " 10 email 4749 non-null object \n", 1448 | " 11 street 4749 non-null object \n", 1449 | " 12 city 4749 non-null object \n", 1450 | " 13 state 4749 non-null object \n", 1451 | " 14 zip_code 4749 non-null object \n", 1452 | "dtypes: float64(5), int64(2), object(8)\n", 1453 | "memory usage: 593.6+ KB\n" 1454 | ] 1455 | } 1456 | ], 1457 | "source": [ 1458 | "students.info()" 1459 | ] 1460 | }, 1461 | { 1462 | "cell_type": "markdown", 1463 | "id": "d8c42329", 1464 | "metadata": {}, 1465 | "source": [ 1466 | "### `job_id`" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "id": "00062e1d", 1472 | "metadata": {}, 1473 | "source": [ 1474 | "Since this missing data is categorical, we can't impute the data. \n", 1475 | "If this data is MAR, there are only a few other options:\n", 1476 | "- delete the data due to the very low percentage of affect rows\n", 1477 | "- replace with the most frequent category\n", 1478 | "- develop a classification model to predict the missing category" 1479 | ] 1480 | }, 1481 | { 1482 | "cell_type": "code", 1483 | "execution_count": 24, 1484 | "id": "aa0063d3", 1485 | "metadata": {}, 1486 | "outputs": [ 1487 | { 1488 | "data": { 1489 | "text/html": [ 1490 | "
\n", 1491 | "\n", 1504 | "\n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupemailstreetcitystatezip_code
162163Glen Riley2002-08-22MNaN8.03.05.7019.010glen_riley4484@hmail.comP.O. Box 37267Cornlea villageTennessee19192
757758Mercedez Vorberg2002-03-25FNaN15.04.04.1420.020mercedez6297@woohoo.com284 Cedar SeventhVirden villageWashington60489
854855Kurt Ho2002-05-29MNaN0.08.023.7220.020ho6107@inlook.comP.O. Box 27254OlinNew Hampshire60067
10291030Penny Gaines2002-03-01NNaN15.04.016.2520.020gaines2897@hmail.com138 Misty ValeStockton boroughWest Virginia53630
15421543Frederick Reilly2002-11-13MNaN7.09.021.3219.010frederick_reilly6971@woohoo.comP.O. Box 40769QuakervillageMaryland96218
\n", 1618 | "
" 1619 | ], 1620 | "text/plain": [ 1621 | " uuid name dob sex job_id num_course_taken \\\n", 1622 | "162 163 Glen Riley 2002-08-22 M NaN 8.0 \n", 1623 | "757 758 Mercedez Vorberg 2002-03-25 F NaN 15.0 \n", 1624 | "854 855 Kurt Ho 2002-05-29 M NaN 0.0 \n", 1625 | "1029 1030 Penny Gaines 2002-03-01 N NaN 15.0 \n", 1626 | "1542 1543 Frederick Reilly 2002-11-13 M NaN 7.0 \n", 1627 | "\n", 1628 | " current_career_path_id time_spent_hrs age age_group \\\n", 1629 | "162 3.0 5.70 19.0 10 \n", 1630 | "757 4.0 4.14 20.0 20 \n", 1631 | "854 8.0 23.72 20.0 20 \n", 1632 | "1029 4.0 16.25 20.0 20 \n", 1633 | "1542 9.0 21.32 19.0 10 \n", 1634 | "\n", 1635 | " email street city \\\n", 1636 | "162 glen_riley4484@hmail.com P.O. Box 37267 Cornlea village \n", 1637 | "757 mercedez6297@woohoo.com 284 Cedar Seventh Virden village \n", 1638 | "854 ho6107@inlook.com P.O. Box 27254 Olin \n", 1639 | "1029 gaines2897@hmail.com 138 Misty Vale Stockton borough \n", 1640 | "1542 frederick_reilly6971@woohoo.com P.O. Box 40769 Quakervillage \n", 1641 | "\n", 1642 | " state zip_code \n", 1643 | "162 Tennessee 19192 \n", 1644 | "757 Washington 60489 \n", 1645 | "854 New Hampshire 60067 \n", 1646 | "1029 West Virginia 53630 \n", 1647 | "1542 Maryland 96218 " 1648 | ] 1649 | }, 1650 | "execution_count": 24, 1651 | "metadata": {}, 1652 | "output_type": "execute_result" 1653 | } 1654 | ], 1655 | "source": [ 1656 | "missing_job_id = students[students[['job_id']].isnull().any(axis=1)]\n", 1657 | "missing_job_id.head()" 1658 | ] 1659 | }, 1660 | { 1661 | "cell_type": "markdown", 1662 | "id": "5baf278a", 1663 | "metadata": {}, 1664 | "source": [ 1665 | "With only 5 rows of missing data and no pattern that I can deduce, this data also appears to be MAR. Taking into account both the amount of resources (time) I have and the impact that fixing 5 rows of data will have, the most reasonable action is to delete this data as well." 1666 | ] 1667 | }, 1668 | { 1669 | "cell_type": "code", 1670 | "execution_count": 25, 1671 | "id": "d4628b96", 1672 | "metadata": {}, 1673 | "outputs": [], 1674 | "source": [ 1675 | "missing_data = pd.concat([missing_data, missing_job_id])\n", 1676 | "students = students.dropna(subset=['job_id'])" 1677 | ] 1678 | }, 1679 | { 1680 | "cell_type": "code", 1681 | "execution_count": 26, 1682 | "id": "c2ae86a9", 1683 | "metadata": {}, 1684 | "outputs": [ 1685 | { 1686 | "name": "stdout", 1687 | "output_type": "stream", 1688 | "text": [ 1689 | "\n", 1690 | "Int64Index: 4744 entries, 0 to 4999\n", 1691 | "Data columns (total 15 columns):\n", 1692 | " # Column Non-Null Count Dtype \n", 1693 | "--- ------ -------------- ----- \n", 1694 | " 0 uuid 4744 non-null int64 \n", 1695 | " 1 name 4744 non-null object \n", 1696 | " 2 dob 4744 non-null object \n", 1697 | " 3 sex 4744 non-null object \n", 1698 | " 4 job_id 4744 non-null float64\n", 1699 | " 5 num_course_taken 4744 non-null float64\n", 1700 | " 6 current_career_path_id 4293 non-null float64\n", 1701 | " 7 time_spent_hrs 4293 non-null float64\n", 1702 | " 8 age 4744 non-null float64\n", 1703 | " 9 age_group 4744 non-null int64 \n", 1704 | " 10 email 4744 non-null object \n", 1705 | " 11 street 4744 non-null object \n", 1706 | " 12 city 4744 non-null object \n", 1707 | " 13 state 4744 non-null object \n", 1708 | " 14 zip_code 4744 non-null object \n", 1709 | "dtypes: float64(5), int64(2), object(8)\n", 1710 | "memory usage: 593.0+ KB\n" 1711 | ] 1712 | } 1713 | ], 1714 | "source": [ 1715 | "students.info()" 1716 | ] 1717 | }, 1718 | { 1719 | "cell_type": "markdown", 1720 | "id": "111cd020", 1721 | "metadata": {}, 1722 | "source": [ 1723 | "### `current_career_path_id`\n", 1724 | "`current_career_path_id` is also categorical, so we can apply the same conclusions to the missing data if we can conclude that this is also MAR." 1725 | ] 1726 | }, 1727 | { 1728 | "cell_type": "code", 1729 | "execution_count": 27, 1730 | "id": "cbb700c4", 1731 | "metadata": {}, 1732 | "outputs": [], 1733 | "source": [ 1734 | "missing_path_id = students[students[['current_career_path_id']].isnull().any(axis=1)]" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "code", 1739 | "execution_count": 28, 1740 | "id": "4c4cae67", 1741 | "metadata": {}, 1742 | "outputs": [ 1743 | { 1744 | "data": { 1745 | "text/html": [ 1746 | "
\n", 1747 | "\n", 1760 | "\n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " \n", 1811 | " \n", 1812 | " \n", 1813 | " \n", 1814 | " \n", 1815 | " \n", 1816 | " \n", 1817 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " \n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupemailstreetcitystatezip_code
1516Norene Dalton1976-04-30F6.00.0NaNNaN46.040norene_dalton9509@hmail.com130 Wishing EssexBranchOhio13616
1920Sofia van Steenbergen1990-02-21N7.013.0NaNNaN32.030vansteenbergen8482@inlook.com634 Clear Barn DellBeamanGeorgia33288
3031Christoper Warner1989-12-28M2.05.0NaNNaN32.030warner5906@coldmail.com556 Stony HighlandsDrainIllinois01973
4950Antony Horne1996-05-29M3.02.0NaNNaN26.020antony577@coldmail.comP.O. Box 78685LenoxTexas15516
5455Omar Bunk1955-11-08M3.014.0NaNNaN66.060omar1245@coldmail.com445 Dale HollowVermont villageSouth Carolina28329
\n", 1874 | "
" 1875 | ], 1876 | "text/plain": [ 1877 | " uuid name dob sex job_id num_course_taken \\\n", 1878 | "15 16 Norene Dalton 1976-04-30 F 6.0 0.0 \n", 1879 | "19 20 Sofia van Steenbergen 1990-02-21 N 7.0 13.0 \n", 1880 | "30 31 Christoper Warner 1989-12-28 M 2.0 5.0 \n", 1881 | "49 50 Antony Horne 1996-05-29 M 3.0 2.0 \n", 1882 | "54 55 Omar Bunk 1955-11-08 M 3.0 14.0 \n", 1883 | "\n", 1884 | " current_career_path_id time_spent_hrs age age_group \\\n", 1885 | "15 NaN NaN 46.0 40 \n", 1886 | "19 NaN NaN 32.0 30 \n", 1887 | "30 NaN NaN 32.0 30 \n", 1888 | "49 NaN NaN 26.0 20 \n", 1889 | "54 NaN NaN 66.0 60 \n", 1890 | "\n", 1891 | " email street city \\\n", 1892 | "15 norene_dalton9509@hmail.com 130 Wishing Essex Branch \n", 1893 | "19 vansteenbergen8482@inlook.com 634 Clear Barn Dell Beaman \n", 1894 | "30 warner5906@coldmail.com 556 Stony Highlands Drain \n", 1895 | "49 antony577@coldmail.com P.O. Box 78685 Lenox \n", 1896 | "54 omar1245@coldmail.com 445 Dale Hollow Vermont village \n", 1897 | "\n", 1898 | " state zip_code \n", 1899 | "15 Ohio 13616 \n", 1900 | "19 Georgia 33288 \n", 1901 | "30 Illinois 01973 \n", 1902 | "49 Texas 15516 \n", 1903 | "54 South Carolina 28329 " 1904 | ] 1905 | }, 1906 | "execution_count": 28, 1907 | "metadata": {}, 1908 | "output_type": "execute_result" 1909 | } 1910 | ], 1911 | "source": [ 1912 | "missing_path_id.head()" 1913 | ] 1914 | }, 1915 | { 1916 | "cell_type": "markdown", 1917 | "id": "e0196574", 1918 | "metadata": {}, 1919 | "source": [ 1920 | "The most interesting thing to note is that `time_spent_hrs` is also missing with the first 5 rows of missing `current_career_path_id`. This is worth investigating further because it's possible that `current_career_path_id` and `time_spent_hrs` are structurally missing - the student is not currently taking any classes." 1921 | ] 1922 | }, 1923 | { 1924 | "cell_type": "code", 1925 | "execution_count": 29, 1926 | "id": "62e7b86c", 1927 | "metadata": { 1928 | "scrolled": true 1929 | }, 1930 | "outputs": [ 1931 | { 1932 | "name": "stdout", 1933 | "output_type": "stream", 1934 | "text": [ 1935 | "\n", 1936 | "Int64Index: 451 entries, 15 to 4974\n", 1937 | "Data columns (total 15 columns):\n", 1938 | " # Column Non-Null Count Dtype \n", 1939 | "--- ------ -------------- ----- \n", 1940 | " 0 uuid 451 non-null int64 \n", 1941 | " 1 name 451 non-null object \n", 1942 | " 2 dob 451 non-null object \n", 1943 | " 3 sex 451 non-null object \n", 1944 | " 4 job_id 451 non-null float64\n", 1945 | " 5 num_course_taken 451 non-null float64\n", 1946 | " 6 current_career_path_id 0 non-null float64\n", 1947 | " 7 time_spent_hrs 0 non-null float64\n", 1948 | " 8 age 451 non-null float64\n", 1949 | " 9 age_group 451 non-null int64 \n", 1950 | " 10 email 451 non-null object \n", 1951 | " 11 street 451 non-null object \n", 1952 | " 12 city 451 non-null object \n", 1953 | " 13 state 451 non-null object \n", 1954 | " 14 zip_code 451 non-null object \n", 1955 | "dtypes: float64(5), int64(2), object(8)\n", 1956 | "memory usage: 56.4+ KB\n" 1957 | ] 1958 | } 1959 | ], 1960 | "source": [ 1961 | "missing_path_id.info()" 1962 | ] 1963 | }, 1964 | { 1965 | "cell_type": "markdown", 1966 | "id": "642e5dfc", 1967 | "metadata": {}, 1968 | "source": [ 1969 | "As hypothesized, it seems that `time_spent_hrs` is null wherever `current_career_path_id` is too. This would mean that both are structually missing. To resolve this, I will set `current_career_path_id` to a new id that will indicate no current career path, and I will set `time_spent_hrs` to 0 to indicate no hours spent." 1970 | ] 1971 | }, 1972 | { 1973 | "cell_type": "code", 1974 | "execution_count": 30, 1975 | "id": "14facb4c", 1976 | "metadata": {}, 1977 | "outputs": [ 1978 | { 1979 | "data": { 1980 | "text/plain": [ 1981 | "array([ 1., 8., 9., 3., 6., 10., 5., nan, 4., 7., 2.])" 1982 | ] 1983 | }, 1984 | "execution_count": 30, 1985 | "metadata": {}, 1986 | "output_type": "execute_result" 1987 | } 1988 | ], 1989 | "source": [ 1990 | "students['current_career_path_id'].unique()" 1991 | ] 1992 | }, 1993 | { 1994 | "cell_type": "code", 1995 | "execution_count": 31, 1996 | "id": "ffec07c4", 1997 | "metadata": {}, 1998 | "outputs": [], 1999 | "source": [ 2000 | "students['current_career_path_id'] = np.where(students['current_career_path_id'].isnull(), 0, students['current_career_path_id'])\n", 2001 | "students['time_spent_hrs'] = np.where(students['time_spent_hrs'].isnull(), 0, students['time_spent_hrs'])" 2002 | ] 2003 | }, 2004 | { 2005 | "cell_type": "code", 2006 | "execution_count": 32, 2007 | "id": "c8cd0bcc", 2008 | "metadata": {}, 2009 | "outputs": [ 2010 | { 2011 | "name": "stdout", 2012 | "output_type": "stream", 2013 | "text": [ 2014 | "\n", 2015 | "Int64Index: 4744 entries, 0 to 4999\n", 2016 | "Data columns (total 15 columns):\n", 2017 | " # Column Non-Null Count Dtype \n", 2018 | "--- ------ -------------- ----- \n", 2019 | " 0 uuid 4744 non-null int64 \n", 2020 | " 1 name 4744 non-null object \n", 2021 | " 2 dob 4744 non-null object \n", 2022 | " 3 sex 4744 non-null object \n", 2023 | " 4 job_id 4744 non-null float64\n", 2024 | " 5 num_course_taken 4744 non-null float64\n", 2025 | " 6 current_career_path_id 4744 non-null float64\n", 2026 | " 7 time_spent_hrs 4744 non-null float64\n", 2027 | " 8 age 4744 non-null float64\n", 2028 | " 9 age_group 4744 non-null int64 \n", 2029 | " 10 email 4744 non-null object \n", 2030 | " 11 street 4744 non-null object \n", 2031 | " 12 city 4744 non-null object \n", 2032 | " 13 state 4744 non-null object \n", 2033 | " 14 zip_code 4744 non-null object \n", 2034 | "dtypes: float64(5), int64(2), object(8)\n", 2035 | "memory usage: 593.0+ KB\n" 2036 | ] 2037 | } 2038 | ], 2039 | "source": [ 2040 | "students.info()" 2041 | ] 2042 | }, 2043 | { 2044 | "cell_type": "markdown", 2045 | "id": "9012b4b5", 2046 | "metadata": {}, 2047 | "source": [ 2048 | "# Working with the `career_paths` table\n" 2049 | ] 2050 | }, 2051 | { 2052 | "cell_type": "code", 2053 | "execution_count": 33, 2054 | "id": "6d72b176", 2055 | "metadata": {}, 2056 | "outputs": [ 2057 | { 2058 | "data": { 2059 | "text/html": [ 2060 | "
\n", 2061 | "\n", 2074 | "\n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | " \n", 2101 | " \n", 2102 | " \n", 2103 | " \n", 2104 | " \n", 2105 | " \n", 2106 | " \n", 2107 | " \n", 2108 | " \n", 2109 | " \n", 2110 | " \n", 2111 | " \n", 2112 | " \n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | " \n", 2125 | " \n", 2126 | " \n", 2127 | " \n", 2128 | " \n", 2129 | " \n", 2130 | " \n", 2131 | " \n", 2132 | " \n", 2133 | " \n", 2134 | " \n", 2135 | " \n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | " \n", 2143 | " \n", 2144 | " \n", 2145 | "
career_path_idcareer_path_namehours_to_complete
01data scientist20
12data engineer20
23data analyst12
34software engineering25
45backend engineer18
56frontend engineer20
67iOS developer27
78android developer27
89machine learning engineer35
910ux/ui designer15
\n", 2146 | "
" 2147 | ], 2148 | "text/plain": [ 2149 | " career_path_id career_path_name hours_to_complete\n", 2150 | "0 1 data scientist 20\n", 2151 | "1 2 data engineer 20\n", 2152 | "2 3 data analyst 12\n", 2153 | "3 4 software engineering 25\n", 2154 | "4 5 backend engineer 18\n", 2155 | "5 6 frontend engineer 20\n", 2156 | "6 7 iOS developer 27\n", 2157 | "7 8 android developer 27\n", 2158 | "8 9 machine learning engineer 35\n", 2159 | "9 10 ux/ui designer 15" 2160 | ] 2161 | }, 2162 | "metadata": {}, 2163 | "output_type": "display_data" 2164 | } 2165 | ], 2166 | "source": [ 2167 | "display(career_paths)" 2168 | ] 2169 | }, 2170 | { 2171 | "cell_type": "code", 2172 | "execution_count": 34, 2173 | "id": "70e0bccc", 2174 | "metadata": {}, 2175 | "outputs": [ 2176 | { 2177 | "name": "stdout", 2178 | "output_type": "stream", 2179 | "text": [ 2180 | "\n", 2181 | "RangeIndex: 10 entries, 0 to 9\n", 2182 | "Data columns (total 3 columns):\n", 2183 | " # Column Non-Null Count Dtype \n", 2184 | "--- ------ -------------- ----- \n", 2185 | " 0 career_path_id 10 non-null int64 \n", 2186 | " 1 career_path_name 10 non-null object\n", 2187 | " 2 hours_to_complete 10 non-null int64 \n", 2188 | "dtypes: int64(2), object(1)\n", 2189 | "memory usage: 368.0+ bytes\n" 2190 | ] 2191 | } 2192 | ], 2193 | "source": [ 2194 | "career_paths.info()" 2195 | ] 2196 | }, 2197 | { 2198 | "cell_type": "markdown", 2199 | "id": "e58f5c51", 2200 | "metadata": {}, 2201 | "source": [ 2202 | "This table is relatively small compared to the `students` table. Taking a look at the full table, and sanity checking with the `.info()` function shows that this table doesn't require much. However, since we added in a new `career_path_id` to account for students who are not taking a career path, we need to add this to the database. " 2203 | ] 2204 | }, 2205 | { 2206 | "cell_type": "code", 2207 | "execution_count": 35, 2208 | "id": "913aeda3", 2209 | "metadata": {}, 2210 | "outputs": [ 2211 | { 2212 | "data": { 2213 | "text/html": [ 2214 | "
\n", 2215 | "\n", 2228 | "\n", 2229 | " \n", 2230 | " \n", 2231 | " \n", 2232 | " \n", 2233 | " \n", 2234 | " \n", 2235 | " \n", 2236 | " \n", 2237 | " \n", 2238 | " \n", 2239 | " \n", 2240 | " \n", 2241 | " \n", 2242 | " \n", 2243 | " \n", 2244 | " \n", 2245 | " \n", 2246 | " \n", 2247 | " \n", 2248 | " \n", 2249 | " \n", 2250 | " \n", 2251 | " \n", 2252 | " \n", 2253 | " \n", 2254 | " \n", 2255 | " \n", 2256 | " \n", 2257 | " \n", 2258 | " \n", 2259 | " \n", 2260 | " \n", 2261 | " \n", 2262 | " \n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | " \n", 2285 | " \n", 2286 | " \n", 2287 | " \n", 2288 | " \n", 2289 | " \n", 2290 | " \n", 2291 | " \n", 2292 | " \n", 2293 | " \n", 2294 | " \n", 2295 | " \n", 2296 | " \n", 2297 | " \n", 2298 | " \n", 2299 | " \n", 2300 | " \n", 2301 | " \n", 2302 | " \n", 2303 | " \n", 2304 | " \n", 2305 | "
career_path_idcareer_path_namehours_to_complete
01data scientist20
12data engineer20
23data analyst12
34software engineering25
45backend engineer18
56frontend engineer20
67iOS developer27
78android developer27
89machine learning engineer35
910ux/ui designer15
100not applicable0
\n", 2306 | "
" 2307 | ], 2308 | "text/plain": [ 2309 | " career_path_id career_path_name hours_to_complete\n", 2310 | "0 1 data scientist 20\n", 2311 | "1 2 data engineer 20\n", 2312 | "2 3 data analyst 12\n", 2313 | "3 4 software engineering 25\n", 2314 | "4 5 backend engineer 18\n", 2315 | "5 6 frontend engineer 20\n", 2316 | "6 7 iOS developer 27\n", 2317 | "7 8 android developer 27\n", 2318 | "8 9 machine learning engineer 35\n", 2319 | "9 10 ux/ui designer 15\n", 2320 | "10 0 not applicable 0" 2321 | ] 2322 | }, 2323 | "metadata": {}, 2324 | "output_type": "display_data" 2325 | } 2326 | ], 2327 | "source": [ 2328 | "not_applicable = {'career_path_id': 0,\n", 2329 | " 'career_path_name': 'not applicable',\n", 2330 | " 'hours_to_complete': 0}\n", 2331 | " \n", 2332 | "career_paths.loc[len(career_paths)] = not_applicable\n", 2333 | "display(career_paths)" 2334 | ] 2335 | }, 2336 | { 2337 | "cell_type": "markdown", 2338 | "id": "21b7fd5e", 2339 | "metadata": {}, 2340 | "source": [ 2341 | "# Working with `student_jobs` table" 2342 | ] 2343 | }, 2344 | { 2345 | "cell_type": "code", 2346 | "execution_count": 36, 2347 | "id": "fc002fa2", 2348 | "metadata": {}, 2349 | "outputs": [ 2350 | { 2351 | "data": { 2352 | "text/html": [ 2353 | "
\n", 2354 | "\n", 2367 | "\n", 2368 | " \n", 2369 | " \n", 2370 | " \n", 2371 | " \n", 2372 | " \n", 2373 | " \n", 2374 | " \n", 2375 | " \n", 2376 | " \n", 2377 | " \n", 2378 | " \n", 2379 | " \n", 2380 | " \n", 2381 | " \n", 2382 | " \n", 2383 | " \n", 2384 | " \n", 2385 | " \n", 2386 | " \n", 2387 | " \n", 2388 | " \n", 2389 | " \n", 2390 | " \n", 2391 | " \n", 2392 | " \n", 2393 | " \n", 2394 | " \n", 2395 | " \n", 2396 | " \n", 2397 | " \n", 2398 | " \n", 2399 | " \n", 2400 | " \n", 2401 | " \n", 2402 | " \n", 2403 | " \n", 2404 | " \n", 2405 | " \n", 2406 | " \n", 2407 | " \n", 2408 | " \n", 2409 | " \n", 2410 | " \n", 2411 | " \n", 2412 | " \n", 2413 | " \n", 2414 | " \n", 2415 | " \n", 2416 | " \n", 2417 | " \n", 2418 | " \n", 2419 | " \n", 2420 | " \n", 2421 | " \n", 2422 | " \n", 2423 | " \n", 2424 | " \n", 2425 | " \n", 2426 | " \n", 2427 | " \n", 2428 | " \n", 2429 | " \n", 2430 | " \n", 2431 | " \n", 2432 | " \n", 2433 | " \n", 2434 | " \n", 2435 | " \n", 2436 | " \n", 2437 | " \n", 2438 | " \n", 2439 | " \n", 2440 | " \n", 2441 | " \n", 2442 | " \n", 2443 | " \n", 2444 | " \n", 2445 | " \n", 2446 | " \n", 2447 | " \n", 2448 | " \n", 2449 | " \n", 2450 | " \n", 2451 | " \n", 2452 | " \n", 2453 | " \n", 2454 | " \n", 2455 | " \n", 2456 | "
job_idjob_categoryavg_salary
01analytics86000
12engineer101000
23software developer110000
34creative66000
45financial services135000
56education61000
67HR80000
78student10000
89healthcare120000
90other80000
103software developer110000
114creative66000
125financial services135000
\n", 2457 | "
" 2458 | ], 2459 | "text/plain": [ 2460 | " job_id job_category avg_salary\n", 2461 | "0 1 analytics 86000\n", 2462 | "1 2 engineer 101000\n", 2463 | "2 3 software developer 110000\n", 2464 | "3 4 creative 66000\n", 2465 | "4 5 financial services 135000\n", 2466 | "5 6 education 61000\n", 2467 | "6 7 HR 80000\n", 2468 | "7 8 student 10000\n", 2469 | "8 9 healthcare 120000\n", 2470 | "9 0 other 80000\n", 2471 | "10 3 software developer 110000\n", 2472 | "11 4 creative 66000\n", 2473 | "12 5 financial services 135000" 2474 | ] 2475 | }, 2476 | "metadata": {}, 2477 | "output_type": "display_data" 2478 | } 2479 | ], 2480 | "source": [ 2481 | "display(student_jobs)" 2482 | ] 2483 | }, 2484 | { 2485 | "cell_type": "code", 2486 | "execution_count": 37, 2487 | "id": "d625b756", 2488 | "metadata": {}, 2489 | "outputs": [ 2490 | { 2491 | "name": "stdout", 2492 | "output_type": "stream", 2493 | "text": [ 2494 | "\n", 2495 | "RangeIndex: 13 entries, 0 to 12\n", 2496 | "Data columns (total 3 columns):\n", 2497 | " # Column Non-Null Count Dtype \n", 2498 | "--- ------ -------------- ----- \n", 2499 | " 0 job_id 13 non-null int64 \n", 2500 | " 1 job_category 13 non-null object\n", 2501 | " 2 avg_salary 13 non-null int64 \n", 2502 | "dtypes: int64(2), object(1)\n", 2503 | "memory usage: 440.0+ bytes\n" 2504 | ] 2505 | } 2506 | ], 2507 | "source": [ 2508 | "student_jobs.info()" 2509 | ] 2510 | }, 2511 | { 2512 | "cell_type": "markdown", 2513 | "id": "f5451107", 2514 | "metadata": {}, 2515 | "source": [ 2516 | "Similar to the `cademycode_courses` table, the `cademycode_student_jobs` table is small compared to the students table. Taking a look at the full table, and sanity checking with the .info() function shows that there are duplicates in this table based off of the `job_id` column. This is the only issue that needs to be addressed based off the inspection." 2517 | ] 2518 | }, 2519 | { 2520 | "cell_type": "code", 2521 | "execution_count": 38, 2522 | "id": "3acd5607", 2523 | "metadata": {}, 2524 | "outputs": [ 2525 | { 2526 | "data": { 2527 | "text/html": [ 2528 | "
\n", 2529 | "\n", 2542 | "\n", 2543 | " \n", 2544 | " \n", 2545 | " \n", 2546 | " \n", 2547 | " \n", 2548 | " \n", 2549 | " \n", 2550 | " \n", 2551 | " \n", 2552 | " \n", 2553 | " \n", 2554 | " \n", 2555 | " \n", 2556 | " \n", 2557 | " \n", 2558 | " \n", 2559 | " \n", 2560 | " \n", 2561 | " \n", 2562 | " \n", 2563 | " \n", 2564 | " \n", 2565 | " \n", 2566 | " \n", 2567 | " \n", 2568 | " \n", 2569 | " \n", 2570 | " \n", 2571 | " \n", 2572 | " \n", 2573 | " \n", 2574 | " \n", 2575 | " \n", 2576 | " \n", 2577 | " \n", 2578 | " \n", 2579 | " \n", 2580 | " \n", 2581 | " \n", 2582 | " \n", 2583 | " \n", 2584 | " \n", 2585 | " \n", 2586 | " \n", 2587 | " \n", 2588 | " \n", 2589 | " \n", 2590 | " \n", 2591 | " \n", 2592 | " \n", 2593 | " \n", 2594 | " \n", 2595 | " \n", 2596 | " \n", 2597 | " \n", 2598 | " \n", 2599 | " \n", 2600 | " \n", 2601 | " \n", 2602 | " \n", 2603 | " \n", 2604 | " \n", 2605 | " \n", 2606 | " \n", 2607 | " \n", 2608 | " \n", 2609 | " \n", 2610 | " \n", 2611 | " \n", 2612 | " \n", 2613 | "
job_idjob_categoryavg_salary
01analytics86000
12engineer101000
23software developer110000
34creative66000
45financial services135000
56education61000
67HR80000
78student10000
89healthcare120000
90other80000
\n", 2614 | "
" 2615 | ], 2616 | "text/plain": [ 2617 | " job_id job_category avg_salary\n", 2618 | "0 1 analytics 86000\n", 2619 | "1 2 engineer 101000\n", 2620 | "2 3 software developer 110000\n", 2621 | "3 4 creative 66000\n", 2622 | "4 5 financial services 135000\n", 2623 | "5 6 education 61000\n", 2624 | "6 7 HR 80000\n", 2625 | "7 8 student 10000\n", 2626 | "8 9 healthcare 120000\n", 2627 | "9 0 other 80000" 2628 | ] 2629 | }, 2630 | "metadata": {}, 2631 | "output_type": "display_data" 2632 | } 2633 | ], 2634 | "source": [ 2635 | "student_jobs.drop_duplicates(inplace=True)\n", 2636 | "display(student_jobs)" 2637 | ] 2638 | }, 2639 | { 2640 | "cell_type": "markdown", 2641 | "id": "66b8f70c", 2642 | "metadata": {}, 2643 | "source": [ 2644 | "# Joining tables\n", 2645 | "Before we begin joining - let's convert the ids to integers now that the null values have been resolved." 2646 | ] 2647 | }, 2648 | { 2649 | "cell_type": "code", 2650 | "execution_count": 39, 2651 | "id": "881513fc", 2652 | "metadata": {}, 2653 | "outputs": [], 2654 | "source": [ 2655 | "students['job_id'] = students['job_id'].astype(int)\n", 2656 | "students['current_career_path_id'] = students['current_career_path_id'].astype(int)" 2657 | ] 2658 | }, 2659 | { 2660 | "cell_type": "markdown", 2661 | "id": "24130b5f", 2662 | "metadata": {}, 2663 | "source": [ 2664 | "Now we can join them together:" 2665 | ] 2666 | }, 2667 | { 2668 | "cell_type": "code", 2669 | "execution_count": 40, 2670 | "id": "32b214bd", 2671 | "metadata": {}, 2672 | "outputs": [ 2673 | { 2674 | "data": { 2675 | "text/html": [ 2676 | "
\n", 2677 | "\n", 2690 | "\n", 2691 | " \n", 2692 | " \n", 2693 | " \n", 2694 | " \n", 2695 | " \n", 2696 | " \n", 2697 | " \n", 2698 | " \n", 2699 | " \n", 2700 | " \n", 2701 | " \n", 2702 | " \n", 2703 | " \n", 2704 | " \n", 2705 | " \n", 2706 | " \n", 2707 | " \n", 2708 | " \n", 2709 | " \n", 2710 | " \n", 2711 | " \n", 2712 | " \n", 2713 | " \n", 2714 | " \n", 2715 | " \n", 2716 | " \n", 2717 | " \n", 2718 | " \n", 2719 | " \n", 2720 | " \n", 2721 | " \n", 2722 | " \n", 2723 | " \n", 2724 | " \n", 2725 | " \n", 2726 | " \n", 2727 | " \n", 2728 | " \n", 2729 | " \n", 2730 | " \n", 2731 | " \n", 2732 | " \n", 2733 | " \n", 2734 | " \n", 2735 | " \n", 2736 | " \n", 2737 | " \n", 2738 | " \n", 2739 | " \n", 2740 | " \n", 2741 | " \n", 2742 | " \n", 2743 | " \n", 2744 | " \n", 2745 | " \n", 2746 | " \n", 2747 | " \n", 2748 | " \n", 2749 | " \n", 2750 | " \n", 2751 | " \n", 2752 | " \n", 2753 | " \n", 2754 | " \n", 2755 | " \n", 2756 | " \n", 2757 | " \n", 2758 | " \n", 2759 | " \n", 2760 | " \n", 2761 | " \n", 2762 | " \n", 2763 | " \n", 2764 | " \n", 2765 | " \n", 2766 | " \n", 2767 | " \n", 2768 | " \n", 2769 | " \n", 2770 | " \n", 2771 | " \n", 2772 | " \n", 2773 | " \n", 2774 | " \n", 2775 | " \n", 2776 | " \n", 2777 | " \n", 2778 | " \n", 2779 | " \n", 2780 | " \n", 2781 | " \n", 2782 | " \n", 2783 | " \n", 2784 | " \n", 2785 | " \n", 2786 | " \n", 2787 | " \n", 2788 | " \n", 2789 | " \n", 2790 | " \n", 2791 | " \n", 2792 | " \n", 2793 | " \n", 2794 | " \n", 2795 | " \n", 2796 | " \n", 2797 | " \n", 2798 | " \n", 2799 | " \n", 2800 | " \n", 2801 | " \n", 2802 | " \n", 2803 | " \n", 2804 | " \n", 2805 | " \n", 2806 | " \n", 2807 | " \n", 2808 | " \n", 2809 | " \n", 2810 | " \n", 2811 | " \n", 2812 | " \n", 2813 | " \n", 2814 | " \n", 2815 | " \n", 2816 | " \n", 2817 | " \n", 2818 | " \n", 2819 | " \n", 2820 | " \n", 2821 | " \n", 2822 | " \n", 2823 | " \n", 2824 | " \n", 2825 | " \n", 2826 | " \n", 2827 | " \n", 2828 | " \n", 2829 | " \n", 2830 | " \n", 2831 | " \n", 2832 | " \n", 2833 | "
uuidnamedobsexjob_idnum_course_takencurrent_career_path_idtime_spent_hrsageage_groupemailstreetcitystatezip_codecareer_path_idcareer_path_namehours_to_completejob_categoryavg_salary
01Annabelle Avery1943-07-03F76.014.9979.070annabelle_avery9376@woohoo.com303 N Timber KeyIrondaleWisconsin847361data scientist20HR80000
12Micah Rubio1991-02-07M75.084.4031.030rubio6772@hmail.com767 Crescent FairShoalsIndiana374398android developer27HR80000
23Hosea Dale1989-12-07M78.086.7432.030hosea_dale8084@coldmail.comP.O. Box 41269St. BonaventureVirginia836378android developer27HR80000
34Mariann Kirk1988-07-31F67.0912.3133.030kirk4005@hmail.com517 SE Wintergreen IsleLaneArkansas822429machine learning engineer35education61000
45Lucio Alexander1963-08-31M714.035.6458.050alexander9810@hmail.com18 Cinder CliffDoyles boroughRhode Island737373data analyst12HR80000
\n", 2834 | "
" 2835 | ], 2836 | "text/plain": [ 2837 | " uuid name dob sex job_id num_course_taken \\\n", 2838 | "0 1 Annabelle Avery 1943-07-03 F 7 6.0 \n", 2839 | "1 2 Micah Rubio 1991-02-07 M 7 5.0 \n", 2840 | "2 3 Hosea Dale 1989-12-07 M 7 8.0 \n", 2841 | "3 4 Mariann Kirk 1988-07-31 F 6 7.0 \n", 2842 | "4 5 Lucio Alexander 1963-08-31 M 7 14.0 \n", 2843 | "\n", 2844 | " current_career_path_id time_spent_hrs age age_group \\\n", 2845 | "0 1 4.99 79.0 70 \n", 2846 | "1 8 4.40 31.0 30 \n", 2847 | "2 8 6.74 32.0 30 \n", 2848 | "3 9 12.31 33.0 30 \n", 2849 | "4 3 5.64 58.0 50 \n", 2850 | "\n", 2851 | " email street city \\\n", 2852 | "0 annabelle_avery9376@woohoo.com 303 N Timber Key Irondale \n", 2853 | "1 rubio6772@hmail.com 767 Crescent Fair Shoals \n", 2854 | "2 hosea_dale8084@coldmail.com P.O. Box 41269 St. Bonaventure \n", 2855 | "3 kirk4005@hmail.com 517 SE Wintergreen Isle Lane \n", 2856 | "4 alexander9810@hmail.com 18 Cinder Cliff Doyles borough \n", 2857 | "\n", 2858 | " state zip_code career_path_id career_path_name \\\n", 2859 | "0 Wisconsin 84736 1 data scientist \n", 2860 | "1 Indiana 37439 8 android developer \n", 2861 | "2 Virginia 83637 8 android developer \n", 2862 | "3 Arkansas 82242 9 machine learning engineer \n", 2863 | "4 Rhode Island 73737 3 data analyst \n", 2864 | "\n", 2865 | " hours_to_complete job_category avg_salary \n", 2866 | "0 20 HR 80000 \n", 2867 | "1 27 HR 80000 \n", 2868 | "2 27 HR 80000 \n", 2869 | "3 35 education 61000 \n", 2870 | "4 12 HR 80000 " 2871 | ] 2872 | }, 2873 | "execution_count": 40, 2874 | "metadata": {}, 2875 | "output_type": "execute_result" 2876 | } 2877 | ], 2878 | "source": [ 2879 | "final_df = students.merge(career_paths, left_on='current_career_path_id', right_on='career_path_id', how='left')\n", 2880 | "final_df = final_df.merge(student_jobs, on='job_id', how='left')\n", 2881 | "final_df.head()" 2882 | ] 2883 | }, 2884 | { 2885 | "cell_type": "code", 2886 | "execution_count": 41, 2887 | "id": "ab7a4366", 2888 | "metadata": {}, 2889 | "outputs": [ 2890 | { 2891 | "name": "stdout", 2892 | "output_type": "stream", 2893 | "text": [ 2894 | "\n", 2895 | "Int64Index: 4744 entries, 0 to 4743\n", 2896 | "Data columns (total 20 columns):\n", 2897 | " # Column Non-Null Count Dtype \n", 2898 | "--- ------ -------------- ----- \n", 2899 | " 0 uuid 4744 non-null int64 \n", 2900 | " 1 name 4744 non-null object \n", 2901 | " 2 dob 4744 non-null object \n", 2902 | " 3 sex 4744 non-null object \n", 2903 | " 4 job_id 4744 non-null int64 \n", 2904 | " 5 num_course_taken 4744 non-null float64\n", 2905 | " 6 current_career_path_id 4744 non-null int64 \n", 2906 | " 7 time_spent_hrs 4744 non-null float64\n", 2907 | " 8 age 4744 non-null float64\n", 2908 | " 9 age_group 4744 non-null int64 \n", 2909 | " 10 email 4744 non-null object \n", 2910 | " 11 street 4744 non-null object \n", 2911 | " 12 city 4744 non-null object \n", 2912 | " 13 state 4744 non-null object \n", 2913 | " 14 zip_code 4744 non-null object \n", 2914 | " 15 career_path_id 4744 non-null int64 \n", 2915 | " 16 career_path_name 4744 non-null object \n", 2916 | " 17 hours_to_complete 4744 non-null int64 \n", 2917 | " 18 job_category 4744 non-null object \n", 2918 | " 19 avg_salary 4744 non-null int64 \n", 2919 | "dtypes: float64(3), int64(7), object(10)\n", 2920 | "memory usage: 778.3+ KB\n" 2921 | ] 2922 | } 2923 | ], 2924 | "source": [ 2925 | "final_df.info()\n", 2926 | "con.close()" 2927 | ] 2928 | }, 2929 | { 2930 | "cell_type": "markdown", 2931 | "id": "33fb3dbd", 2932 | "metadata": {}, 2933 | "source": [ 2934 | "# Upsert cleansed and missing data tables to a new SQLite DB" 2935 | ] 2936 | }, 2937 | { 2938 | "cell_type": "code", 2939 | "execution_count": 42, 2940 | "id": "013e3edb", 2941 | "metadata": {}, 2942 | "outputs": [], 2943 | "source": [ 2944 | "sqlite_connection = sqlite3.connect('cademycode_cleansed.db')\n", 2945 | "final_df.to_sql('cademycode_aggregated', sqlite_connection, if_exists='replace', index=False)" 2946 | ] 2947 | }, 2948 | { 2949 | "cell_type": "code", 2950 | "execution_count": 43, 2951 | "id": "a8323e20", 2952 | "metadata": {}, 2953 | "outputs": [], 2954 | "source": [ 2955 | "db_df = pd.read_sql_query(\"SELECT * FROM cademycode_aggregated\", sqlite_connection)" 2956 | ] 2957 | }, 2958 | { 2959 | "cell_type": "code", 2960 | "execution_count": 44, 2961 | "id": "0a15cee4", 2962 | "metadata": {}, 2963 | "outputs": [ 2964 | { 2965 | "name": "stdout", 2966 | "output_type": "stream", 2967 | "text": [ 2968 | "\n", 2969 | "RangeIndex: 4744 entries, 0 to 4743\n", 2970 | "Data columns (total 20 columns):\n", 2971 | " # Column Non-Null Count Dtype \n", 2972 | "--- ------ -------------- ----- \n", 2973 | " 0 uuid 4744 non-null int64 \n", 2974 | " 1 name 4744 non-null object \n", 2975 | " 2 dob 4744 non-null object \n", 2976 | " 3 sex 4744 non-null object \n", 2977 | " 4 job_id 4744 non-null int64 \n", 2978 | " 5 num_course_taken 4744 non-null float64\n", 2979 | " 6 current_career_path_id 4744 non-null int64 \n", 2980 | " 7 time_spent_hrs 4744 non-null float64\n", 2981 | " 8 age 4744 non-null float64\n", 2982 | " 9 age_group 4744 non-null int64 \n", 2983 | " 10 email 4744 non-null object \n", 2984 | " 11 street 4744 non-null object \n", 2985 | " 12 city 4744 non-null object \n", 2986 | " 13 state 4744 non-null object \n", 2987 | " 14 zip_code 4744 non-null object \n", 2988 | " 15 career_path_id 4744 non-null int64 \n", 2989 | " 16 career_path_name 4744 non-null object \n", 2990 | " 17 hours_to_complete 4744 non-null int64 \n", 2991 | " 18 job_category 4744 non-null object \n", 2992 | " 19 avg_salary 4744 non-null int64 \n", 2993 | "dtypes: float64(3), int64(7), object(10)\n", 2994 | "memory usage: 741.4+ KB\n" 2995 | ] 2996 | } 2997 | ], 2998 | "source": [ 2999 | "db_df.info()" 3000 | ] 3001 | }, 3002 | { 3003 | "cell_type": "code", 3004 | "execution_count": 45, 3005 | "id": "6d94e3ce", 3006 | "metadata": {}, 3007 | "outputs": [], 3008 | "source": [ 3009 | "missing_data.to_sql('incomplete_data', sqlite_connection, if_exists='replace', index=False)" 3010 | ] 3011 | }, 3012 | { 3013 | "cell_type": "code", 3014 | "execution_count": 46, 3015 | "id": "02114952", 3016 | "metadata": {}, 3017 | "outputs": [], 3018 | "source": [ 3019 | "missing_df = pd.read_sql_query(\"SELECT * FROM incomplete_data\", sqlite_connection)\n", 3020 | "sqlite_connection.close()" 3021 | ] 3022 | }, 3023 | { 3024 | "cell_type": "code", 3025 | "execution_count": 47, 3026 | "id": "443639a3", 3027 | "metadata": {}, 3028 | "outputs": [], 3029 | "source": [ 3030 | "db_df.to_csv('cademycode_cleansed.csv')" 3031 | ] 3032 | } 3033 | ], 3034 | "metadata": { 3035 | "kernelspec": { 3036 | "display_name": "Python 3 (ipykernel)", 3037 | "language": "python", 3038 | "name": "python3" 3039 | }, 3040 | "language_info": { 3041 | "codemirror_mode": { 3042 | "name": "ipython", 3043 | "version": 3 3044 | }, 3045 | "file_extension": ".py", 3046 | "mimetype": "text/x-python", 3047 | "name": "python", 3048 | "nbconvert_exporter": "python", 3049 | "pygments_lexer": "ipython3", 3050 | "version": "3.9.12" 3051 | } 3052 | }, 3053 | "nbformat": 4, 3054 | "nbformat_minor": 5 3055 | } 3056 | --------------------------------------------------------------------------------