├── Capstone Project ├── README.md ├── Spark - Full Dataset.ipynb └── Spark - Subset Analytics.ipynb ├── Data Scientist Nanodegree certificate.jpg ├── Project 1 - Finding Donars ├── README.md ├── __pycache__ │ └── visuals.cpython-36.pyc ├── census.csv ├── finding_donors.ipynb └── visuals.py ├── Project 2 - Image Classifier Application ├── Application │ ├── README.md │ ├── cat_to_name.json │ ├── model_spec.py │ ├── predict.py │ ├── train.py │ └── workspace_utils.py ├── Image Classifier Project.ipynb ├── Readme.md ├── assets │ ├── Flowers.png │ └── inference_example.png └── workspace_utils.py ├── Project 3 - Identify Customer Segementation ├── Identify_Customer_Segments.ipynb └── README.md ├── Project 4 - Data Science Blog ├── README.md ├── Understanding the Career of Data Scientists.ipynb └── data │ ├── 2011 Stack Overflow Survey Responses.csv │ ├── 2012 Stack Overflow Survey Responses.csv │ ├── 2013 Stack Overflow Survey Responses.csv │ └── 2014 Stack Overflow Survey Responses.csv ├── Project 5 - Disaster Response Pipeline ├── ETL Pipeline Preparation.ipynb ├── ML Pipeline Preparation.ipynb ├── README.md ├── app │ ├── app.py │ ├── templates │ │ ├── go.html │ │ └── master.html │ └── webapp screenshot.png ├── data │ ├── DisasterResponse.db │ ├── disaster_categories.csv │ ├── disaster_messages.csv │ └── process_data.py └── models │ ├── adaboost_model.pkl │ ├── tokenizer_function.py │ └── train_classifier.py ├── Project 6 - Reccomendation System ├── Recommendations_with_IBM.ipynb ├── __pycache__ │ └── project_tests.cpython-36.pyc ├── data │ ├── articles_community.csv │ └── user-item-interactions.csv ├── project_tests.py ├── top_10.p ├── top_20.p └── top_5.p └── README.md /Capstone Project/README.md: -------------------------------------------------------------------------------- 1 | # Capstone Project - Sparkify Music Service Analytics 2 | 3 | ### Motivation 4 | 5 | As many music streaming services that opreates on dataset with seconds of timeframe, the size of the data grows expotentially large in a short while. The size of the data would quickly outgrown the memory limit of most personal computers. To ensure the analytical activities are still performed as normal, we need to utilize the power of spark to perform a distributed data analytics task to obtain insights for from the data. In particular, we would like to understand the factors that contributes to users churning behaviors. 6 | 7 | ### Data Descriptions 8 | 9 | The data provided is the user log of the service, having demographic info, user activities, timestamps and etc. The data contains the user information logs that includes 10 | 11 | * Add Friend 12 | * Add to Playlist 13 | * Cancel/Cancel Confirmation 14 | * Submit Upgrade/Upgrade 15 | * Submit Downgrade/Downgrade 16 | * Error 17 | * Help 18 | * Home 19 | * Logout 20 | * Nextsong 21 | * Roll Advert 22 | * Save Settings 23 | * Thumbs Up / Down 24 | 25 | ### Project Objective 26 | 27 | Using the user information logs, we will attempt to predict the possiblities of a user churning using machine learning models. We will also attempt to understand contributing factors of the churning behaviors. 28 | 29 | ### Required Packages 30 | 31 | * Pandas 32 | * pyspark 33 | * matplot 34 | * numpy 35 | 36 | ### Model Refinement 37 | 38 | The presented model represents the best model I have constructed so far. Originally I only used the all the activities in *page* as features, which yielded 0.69 F1 score on the small test set we have. After I engineered 6 other features as noted in the project, I was able to obtain an F1 score of 0.80 (0.88 after scale up to the large dataset). 39 | 40 | ### Gallery 41 | 42 | 43 | [Project Notebook: Spark - Subset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Subset%20Analytics.ipynb) 44 | 45 | [Project Notebook: Spark - Full Dataset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Full%20Dataset.ipynb) 46 | 47 | [Blog Post: Understanding Customer Churning with Big Data Analytics](https://medium.com/@bowenchen/understanding-customer-churning-with-big-data-analytics-70ce4eb17669) 48 | 49 | -------------------------------------------------------------------------------- /Capstone Project/Spark - Full Dataset.ipynb: -------------------------------------------------------------------------------- 1 | {"cells": [{"metadata": {}, "cell_type": "markdown", "source": "# Sparkify - Full Analytics Script"}, {"metadata": {}, "cell_type": "markdown", "source": "Import Packages"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import udf, countDistinct, count, when, sum,col\nfrom pyspark.sql.types import IntegerType\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator\nfrom pyspark.ml.regression import LinearRegression\nfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n\nfrom pyspark.ml.feature import OneHotEncoder, StringIndexer, MinMaxScaler, VectorAssembler\nfrom pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier\n\nimport warnings\n\nwarnings.filterwarnings('ignore')", "execution_count": 1, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "c704ada3c7ba4c2daae0900322485e35"}}, "metadata": {}}, {"output_type": "stream", "text": "Starting Spark application\n", "name": "stdout"}, {"output_type": "display_data", "data": {"text/plain": "", "text/html": "\n
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
6application_1547505874377_0007pysparkidleLinkLink\u2714
"}, "metadata": {}}, {"output_type": "stream", "text": "SparkSession available as 'spark'.\n", "name": "stdout"}]}, {"metadata": {}, "cell_type": "markdown", "source": "Load data from AWS"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Create spark session\nspark = (SparkSession \n .builder \n .appName(\"Sparkify\") \n .getOrCreate())\n\n# Read in full sparkify dataset\nevent_data = \"s3n://dsnd-sparkify/sparkify_event_data.json\"\nevents = spark.read.json(event_data)", "execution_count": 2, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b9aa3a14f92f4421a372a1fcc802bc48"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "### Define Churn"}, {"metadata": {}, "cell_type": "markdown", "source": "We will define Churn as `Cancellation Confirmation` events. We could also add `Downgrade` events as Churn, but we could use `Downgrade` events as an additional feature to predict `Cancellation Confirmation` events (Churn). "}, {"metadata": {}, "cell_type": "markdown", "source": "Create a column named `Churn` as the label of whether the user has churned"}, {"metadata": {}, "cell_type": "markdown", "source": "# Feature Engineering\n\nBuild 7 features that are needed to construct the model "}, {"metadata": {}, "cell_type": "markdown", "source": "Remove several less useful columns to speed up the opreations\n* First Name\n* Last Name\n* auth\n* status\n* gender\n* ItemInSession\n* location\n* method\n* song\n* artist\n"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events = events.drop('firstName', 'lastName', 'auth', 'gender', 'song','artist',\n 'status', 'method', 'location', 'registration', 'itemInSession')", "execution_count": 3, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "0e8f270a9cef42e1acdd14993e5281f1"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**1**. pivot the page column to obtain different activities for the user, then remove the less significant features"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events_pivot = events.groupby([\"userId\"]).pivot(\"page\").count().fillna(0)\n\n# drop unecessary columns\nevents_pivot = events_pivot.drop('About', 'Cancel', 'Login', 'Submit Registration', 'Register', 'Save Settings')", "execution_count": 4, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "061d9d1cecb34c1c9c209fb46c463ff2"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**2.** Add average song played length"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# filter events log to contain only next song\nevents_songs = events.filter(events.page == 'NextSong')\n\n# Total songs length played\ntotal_length = events_songs.groupby(events_songs.userId).agg(sum('length'))\n\n# join events pivot\nevents_pivot = (events_pivot.join(total_length, on = 'userId', how = 'left')\n .withColumnRenamed(\"Cancellation Confirmation\", \"Churn\")\n .withColumnRenamed(\"sum(length)\", \"total_length\"))", "execution_count": 5, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "916f27479ae7497cb30f56337108772d"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**3.** Add days active"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "convert = 1000*60*60*24 # conversion factor to days\n\n# Find minimum/maximum time stamp of each user\nmin_timestmp = events.select([\"userId\", \"ts\"]).groupby(\"userId\").min(\"ts\")\nmax_timestmp = events.select([\"userId\", \"ts\"]).groupby(\"userId\").max(\"ts\")\n\n# Find days active of each user\ndaysActive = min_timestmp.join(max_timestmp, on=\"userId\")\ndaysActive = (daysActive.withColumn(\"days_active\", \n (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysActive = daysActive.select([\"userId\", \"days_active\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysActive, on = 'userId', how = 'left')", "execution_count": 6, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b9e763560e3f4c57bde971ef0b748ed4"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**4.** Add number of sessions"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "numSessions = (events.select([\"userId\", \"sessionId\"])\n .distinct()\n .groupby(\"userId\")\n .count()\n .withColumnRenamed(\"count\", \"num_sessions\"))\n\n# join events pivot\nevents_pivot = events_pivot.join(numSessions, on = 'userId', how = 'left')", "execution_count": 7, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "7cc5ea2b76cc476b932f6a30e0be9fd1"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**5.** Add days as paid user"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Find minimum/maximum time stamp of each user as paid user\npaid_min_ts = events.filter(events.level == 'paid').groupby(\"userId\").min(\"ts\")\npaid_max_ts = events.filter(events.level == 'paid').groupby(\"userId\").max(\"ts\")\n\n# Find days as paid user of each user\n\ndaysPaid = paid_min_ts.join(paid_max_ts, on=\"userId\")\ndaysPaid = (daysPaid.withColumn(\"days_paid\", \n (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysPaid = daysPaid.select([\"userId\", \"days_paid\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysPaid, on = 'userId', how='left')", "execution_count": 8, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "504d5b7b50d24ca48288b5f2d0c6c3e3"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**6.** Add days as a free user"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Find minimum/maximum time stamp of each user as paid user\nfree_min_ts = events.filter(events.level == 'free').groupby(\"userId\").min(\"ts\")\nfree_max_ts = events.filter(events.level == 'free').groupby(\"userId\").max(\"ts\")\n\n# Find days as paid user of each user\ndaysFree = free_min_ts.join(free_max_ts, on=\"userId\")\ndaysFree = (daysFree.withColumn(\"days_free\", \n (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysFree = daysFree.select([\"userId\", \"days_free\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysFree, on = 'userId', how='left')", "execution_count": 9, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "9ddaec6b994b419e9188a0a272a84dc3"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**7.** Add user access agent"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# find user access agents, and perform one-hot encoding on the user \nuserAgents = events.select(['userId', 'userAgent']).distinct()\nuserAgents = userAgents.fillna('Unknown')\n\n# build string indexer\nstringIndexer = StringIndexer(inputCol=\"userAgent\", outputCol=\"userAgentIndex\")\nmodel = stringIndexer.fit(userAgents)\nuserAgents = model.transform(userAgents)\n\n# one hot encode userAgent column\nencoder = OneHotEncoder(inputCol=\"userAgentIndex\", outputCol=\"userAgentVec\")\nuserAgents = encoder.transform(userAgents).select(['userId', 'userAgentVec'])\n\n# join events pivot\nevents_pivot = events_pivot.join(userAgents, on = 'userId', how ='left')", "execution_count": 10, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "df46d0b8ded943789a285f9c7fd9a9aa"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**8.** Fill all empty values as 0"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events_pivot = events_pivot.fillna(0)", "execution_count": 11, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "4251a4b61dc7425db99b4c465e64678d"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "# Modeling\n\nSplit the full dataset into train, test, and validation sets. Test out three machine learning algorithms\n\n* Logistic Regression\n* Random Forest\n* Gradient Boosting\n\nGradient Boosting has the largest out-of-bag F1-score, we will proceed with this algorithm and build a pipeline around this algorithm."}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Split data into train and test set\nevents_pivot = events_pivot.withColumnRenamed('Churn', 'label')\ntraining, test = events_pivot.randomSplit([0.9, 0.1])", "execution_count": 12, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "01a68c16df37497598c6ff99687e006e"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "Build machine learning pipeline"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Create vector from feature data\nfeature_names = events_pivot.drop('label', 'userId').schema.names\nvec_asembler = VectorAssembler(inputCols = feature_names, outputCol = \"Features\")\n\n# Scale each column\nscalar = MinMaxScaler(inputCol=\"Features\", outputCol=\"ScaledFeatures\")\n\n# build classifier\ngbt = GBTClassifier(featuresCol=\"ScaledFeatures\", labelCol=\"label\")\n\n# Consturct pipeline\npipeline_gbt = Pipeline(stages=[vec_asembler, scalar, gbt])", "execution_count": 13, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b38e9532f23d42d9a7cf9cd9830232a4"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "Fit gradient boosting model"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "gbt_model = pipeline_gbt.fit(training)", "execution_count": 14, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "beae16983a15424993810a8ef9d29b2e"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "def modelEvaluations(model, metric, data):\n \"\"\" Evaluate a machine learning model's performance \n Input: \n model - pipeline object\n metric - the metric of the evaluations\n data - data being evaluated\n Output:\n [score, confusion matrix]\n \"\"\"\n # generate predictions\n evaluator = MulticlassClassificationEvaluator(metricName = metric)\n predictions = model.transform(data)\n \n # calcualte score\n score = evaluator.evaluate(predictions)\n confusion_matrix = (predictions.groupby(\"label\")\n .pivot(\"prediction\")\n .count())\n return [score, confusion_matrix]", "execution_count": 15, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "ea0e7ffbbaf8404da577625b27a59334"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "f1_best, conf_mtx_best = modelEvaluations(gbt_model, 'f1', test)", "execution_count": 16, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "0491fc6e55c64f8ea5cd5f622ba48774"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "print('The F1 score for the gradient boosting model:', f1_best)\nconf_mtx_best.show()", "execution_count": 17, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "3f98d9f271794849bbe9e3b45314bffc"}}, "metadata": {}}, {"output_type": "stream", "text": "('The F1 score for the gradient boosting model:', 0.8896163691822966)\n+-----+----+---+\n|label| 0.0|1.0|\n+-----+----+---+\n| 0|1612| 70|\n| 1| 163|344|\n+-----+----+---+", "name": "stdout"}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "", "execution_count": null, "outputs": []}], "metadata": {"kernelspec": {"name": "pysparkkernel", "display_name": "PySpark", "language": ""}, "language_info": {"name": "pyspark", "mimetype": "text/x-python", "codemirror_mode": {"name": "python", "version": 2}, "pygments_lexer": "python2"}}, "nbformat": 4, "nbformat_minor": 2} -------------------------------------------------------------------------------- /Data Scientist Nanodegree certificate.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Data Scientist Nanodegree certificate.jpg -------------------------------------------------------------------------------- /Project 1 - Finding Donars/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Supervised Learning 3 | 4 | ### Project: Finding Donors for CharityML 5 | 6 | (Cited From Udacity) 7 | 8 | ### Install 9 | 10 | This project requires **Python 3.x** and the following Python libraries installed: 11 | 12 | - [NumPy](http://www.numpy.org/) 13 | - [Pandas](http://pandas.pydata.org) 14 | - [matplotlib](http://matplotlib.org/) 15 | - [scikit-learn](http://scikit-learn.org/stable/) 16 | 17 | You will also need to have software installed to run and execute an [iPython Notebook](http://ipython.org/notebook.html) 18 | 19 | 20 | ### Code 21 | 22 | Template code is provided in the `finding_donors.ipynb` notebook file. You will also be required to use the included `visuals.py` Python file and the `census.csv` dataset file to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the project. Note that the code included in `visuals.py` is meant to be used out-of-the-box and not intended for students to manipulate. If you are interested in how the visualizations are created in the notebook, please feel free to explore this Python file. 23 | 24 | ### Run 25 | 26 | In a terminal or command window, navigate to the top-level project directory `finding_donors/` (that contains this README) and run one of the following commands: 27 | 28 | ```bash 29 | ipython notebook finding_donors.ipynb 30 | ``` 31 | or 32 | ```bash 33 | jupyter notebook finding_donors.ipynb 34 | ``` 35 | 36 | This will open the iPython Notebook software and project file in your browser. 37 | 38 | ### Data 39 | 40 | The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper *"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid",* by Ron Kohavi. You may find this paper [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf), with the original dataset hosted on [UCI](https://archive.ics.uci.edu/ml/datasets/Census+Income). 41 | 42 | **Features** 43 | - `age`: Age 44 | - `workclass`: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked) 45 | - `education_level`: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool) 46 | - `education-num`: Number of educational years completed 47 | - `marital-status`: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse) 48 | - `occupation`: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces) 49 | - `relationship`: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried) 50 | - `race`: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black) 51 | - `sex`: Sex (Female, Male) 52 | - `capital-gain`: Monetary Capital Gains 53 | - `capital-loss`: Monetary Capital Losses 54 | - `hours-per-week`: Average Hours Per Week Worked 55 | - `native-country`: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands) 56 | 57 | **Target Variable** 58 | - `income`: Income Class (<=50K, >50K) 59 | -------------------------------------------------------------------------------- /Project 1 - Finding Donars/__pycache__/visuals.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 1 - Finding Donars/__pycache__/visuals.cpython-36.pyc -------------------------------------------------------------------------------- /Project 1 - Finding Donars/visuals.py: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # Suppress matplotlib user warnings 3 | # Necessary for newer version of matplotlib 4 | import warnings 5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib") 6 | # 7 | # Display inline matplotlib plots with IPython 8 | from IPython import get_ipython 9 | get_ipython().run_line_magic('matplotlib', 'inline') 10 | ########################################### 11 | 12 | import matplotlib.pyplot as pl 13 | import matplotlib.patches as mpatches 14 | import numpy as np 15 | import pandas as pd 16 | from time import time 17 | from sklearn.metrics import f1_score, accuracy_score 18 | 19 | 20 | def distribution(data, transformed = False): 21 | """ 22 | Visualization code for displaying skewed distributions of features 23 | """ 24 | 25 | # Create figure 26 | fig = pl.figure(figsize = (11,5)); 27 | 28 | # Skewed feature plotting 29 | for i, feature in enumerate(['capital-gain','capital-loss']): 30 | ax = fig.add_subplot(1, 2, i+1) 31 | ax.hist(data[feature], bins = 25, color = '#00A0A0') 32 | ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14) 33 | ax.set_xlabel("Value") 34 | ax.set_ylabel("Number of Records") 35 | ax.set_ylim((0, 2000)) 36 | ax.set_yticks([0, 500, 1000, 1500, 2000]) 37 | ax.set_yticklabels([0, 500, 1000, 1500, ">2000"]) 38 | 39 | # Plot aesthetics 40 | if transformed: 41 | fig.suptitle("Log-transformed Distributions of Continuous Census Data Features", \ 42 | fontsize = 16, y = 1.03) 43 | else: 44 | fig.suptitle("Skewed Distributions of Continuous Census Data Features", \ 45 | fontsize = 16, y = 1.03) 46 | 47 | fig.tight_layout() 48 | fig.show() 49 | 50 | 51 | def evaluate(results, accuracy, f1): 52 | """ 53 | Visualization code to display results of various learners. 54 | 55 | inputs: 56 | - learners: a list of supervised learners 57 | - stats: a list of dictionaries of the statistic results from 'train_predict()' 58 | - accuracy: The score for the naive predictor 59 | - f1: The score for the naive predictor 60 | """ 61 | 62 | # Create figure 63 | fig, ax = pl.subplots(2, 3, figsize = (11,7)) 64 | 65 | # Constants 66 | bar_width = 0.3 67 | colors = ['#A00000','#00A0A0','#00A000'] 68 | 69 | # Super loop to plot four panels of data 70 | for k, learner in enumerate(results.keys()): 71 | for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']): 72 | for i in np.arange(3): 73 | 74 | # Creative plot code 75 | ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k]) 76 | ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45]) 77 | ax[j//3, j%3].set_xticklabels(["1%", "10%", "100%"]) 78 | ax[j//3, j%3].set_xlabel("Training Set Size") 79 | ax[j//3, j%3].set_xlim((-0.1, 3.0)) 80 | 81 | # Add unique y-labels 82 | ax[0, 0].set_ylabel("Time (in seconds)") 83 | ax[0, 1].set_ylabel("Accuracy Score") 84 | ax[0, 2].set_ylabel("F-score") 85 | ax[1, 0].set_ylabel("Time (in seconds)") 86 | ax[1, 1].set_ylabel("Accuracy Score") 87 | ax[1, 2].set_ylabel("F-score") 88 | 89 | # Add titles 90 | ax[0, 0].set_title("Model Training") 91 | ax[0, 1].set_title("Accuracy Score on Training Subset") 92 | ax[0, 2].set_title("F-score on Training Subset") 93 | ax[1, 0].set_title("Model Predicting") 94 | ax[1, 1].set_title("Accuracy Score on Testing Set") 95 | ax[1, 2].set_title("F-score on Testing Set") 96 | 97 | # Add horizontal lines for naive predictors 98 | ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 99 | ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 100 | ax[0, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 101 | ax[1, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 102 | 103 | # Set y-limits for score panels 104 | ax[0, 1].set_ylim((0, 1)) 105 | ax[0, 2].set_ylim((0, 1)) 106 | ax[1, 1].set_ylim((0, 1)) 107 | ax[1, 2].set_ylim((0, 1)) 108 | 109 | # Create patches for the legend 110 | patches = [] 111 | for i, learner in enumerate(results.keys()): 112 | patches.append(mpatches.Patch(color = colors[i], label = learner)) 113 | pl.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \ 114 | loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large') 115 | 116 | # Aesthetics 117 | pl.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.10) 118 | pl.tight_layout() 119 | pl.show() 120 | 121 | 122 | def feature_plot(importances, X_train, y_train): 123 | 124 | # Display the five most important features 125 | indices = np.argsort(importances)[::-1] 126 | columns = X_train.columns.values[indices[:5]] 127 | values = importances[indices][:5] 128 | 129 | # Creat the plot 130 | fig = pl.figure(figsize = (9,5)) 131 | pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16) 132 | pl.bar(np.arange(5), values, width = 0.6, align="center", color = '#00A000', \ 133 | label = "Feature Weight") 134 | pl.bar(np.arange(5) - 0.3, np.cumsum(values), width = 0.2, align = "center", color = '#00A0A0', \ 135 | label = "Cumulative Feature Weight") 136 | pl.xticks(np.arange(5), columns) 137 | pl.xlim((-0.5, 4.5)) 138 | pl.ylabel("Weight", fontsize = 12) 139 | pl.xlabel("Feature", fontsize = 12) 140 | 141 | pl.legend(loc = 'upper center') 142 | pl.tight_layout() 143 | pl.show() 144 | -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/README.md: -------------------------------------------------------------------------------- 1 | # Image Classification Application - Command Line 2 | 3 | ### Structure 4 | 5 | The whole interface contains 5 files and 1 folder, 6 | 7 | 8 | #### **train.py** - training interface 9 | 10 | Basic usage: python train.py data_directory 11 | 12 | Prints out training loss, validation loss, and validation accuracy as the network trains 13 | Options: 14 | 15 | * Set directory to save checkpoints: python train.py data_dir --save_dir save_directory 16 | * Choose architecture: python train.py data_dir --arch "vgg13" 17 | * Set hyperparameters: python train.py data_dir --learning_rate 0.01 --hidden_units 512 --epochs 20 18 | * Use GPU for training: python train.py data_dir --gpu 19 | 20 | 21 | #### **predict.py** - predicting interface 22 | 23 | 24 | Basic usage: python predict.py /path/to/image checkpoint Options: 25 | 26 | * Return top K most likely classes: python predict.py input checkpoint --top_k 3 27 | * Use a mapping of categories to real names: python predict.py input checkpoint --category_names cat_to_name.json 28 | * Use GPU for inference: python predict.py input checkpoint --gpu 29 | 30 | 31 | #### **model_spec.py** - utitity functions 32 | 33 | Provides all helper functions that **train.py** and **predict.py** uses. The main train and predict functions are implemented here 34 | 35 | #### **cat_to_json.json** - categories name mapping 36 | 37 | Provides the encoding of 102 flower classes 38 | 39 | #### **workspace_utils.py** - active session 40 | 41 | Keeps the training session from being timed out 42 | 43 | #### **checkpoints** - saves different model checkpoints 44 | 45 | Directory where the trained model will be saved to. The pretrained weights could not be uploaded due to file size limits 46 | -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/cat_to_name.json: -------------------------------------------------------------------------------- 1 | {"21": "fire lily", "3": "canterbury bells", "45": "bolero deep blue", "1": "pink primrose", "34": "mexican aster", "27": "prince of wales feathers", "7": "moon orchid", "16": "globe-flower", "25": "grape hyacinth", "26": "corn poppy", "79": "toad lily", "39": "siam tulip", "24": "red ginger", "67": "spring crocus", "35": "alpine sea holly", "32": "garden phlox", "10": "globe thistle", "6": "tiger lily", "93": "ball moss", "33": "love in the mist", "9": "monkshood", "102": "blackberry lily", "14": "spear thistle", "19": "balloon flower", "100": "blanket flower", "13": "king protea", "49": "oxeye daisy", "15": "yellow iris", "61": "cautleya spicata", "31": "carnation", "64": "silverbush", "68": "bearded iris", "63": "black-eyed susan", "69": "windflower", "62": "japanese anemone", "20": "giant white arum lily", "38": "great masterwort", "4": "sweet pea", "86": "tree mallow", "101": "trumpet creeper", "42": "daffodil", "22": "pincushion flower", "2": "hard-leaved pocket orchid", "54": "sunflower", "66": "osteospermum", "70": "tree poppy", "85": "desert-rose", "99": "bromelia", "87": "magnolia", "5": "english marigold", "92": "bee balm", "28": "stemless gentian", "97": "mallow", "57": "gaura", "40": "lenten rose", "47": "marigold", "59": "orange dahlia", "48": "buttercup", "55": "pelargonium", "36": "ruby-lipped cattleya", "91": "hippeastrum", "29": "artichoke", "71": "gazania", "90": "canna lily", "18": "peruvian lily", "98": "mexican petunia", "8": "bird of paradise", "30": "sweet william", "17": "purple coneflower", "52": "wild pansy", "84": "columbine", "12": "colt's foot", "11": "snapdragon", "96": "camellia", "23": "fritillary", "50": "common dandelion", "44": "poinsettia", "53": "primula", "72": "azalea", "65": "californian poppy", "80": "anthurium", "76": "morning glory", "37": "cape flower", "56": "bishop of llandaff", "60": "pink-yellow dahlia", "82": "clematis", "58": "geranium", "75": "thorn apple", "41": "barbeton daisy", "95": "bougainvillea", "43": "sword lily", "83": "hibiscus", "78": "lotus lotus", "88": "cyclamen", "94": "foxglove", "81": "frangipani", "74": "rose", "89": "watercress", "73": "water lily", "46": "wallflower", "77": "passion flower", "51": "petunia"} -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/model_spec.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from torch import nn 4 | from torch import optim 5 | import torch.nn.functional as F 6 | from torchvision import datasets, transforms, models 7 | from workspace_utils import active_session 8 | from PIL import Image 9 | import json 10 | import argparse 11 | from collections import OrderedDict # use dict, but we have to keep the order 12 | import matplotlib.pyplot as plt 13 | 14 | # ================= Train Model Functions ===================== 15 | def train_model(hyperparameters, data_dir, save_dir, device): 16 | """Train the neural network, called in main function utilize the following helper functions, 17 | """ 18 | 19 | model_init, resize_aspect = get_model(hyperparameters['architecture']) 20 | image_dataset = loadImageData(resize_aspect, data_dir) 21 | model_spec = buildNeuralNetwork(model_init, hyperparameters, data_dir, device) 22 | 23 | for e in range(hyperparameters['epochs']): 24 | model_spec['model'].train() 25 | running_loss = 0 # the loss for every batch 26 | 27 | for i, train_batch in enumerate(image_dataset['trainloader']): # minibatch training 28 | 29 | # send the inputs labels to the tensors that uses the specified devices 30 | inputs, labels = tuple(map(lambda x: x.to(device), train_batch)) 31 | model_spec['optimizer'].zero_grad() # clear out previous gradients, avoids accumulations 32 | 33 | # Forward and backward passes 34 | try: 35 | predictions,_ = model_spec['model'].forward(inputs) 36 | 37 | except: 38 | predictions = model_spec['model'].forward(inputs) 39 | 40 | loss = model_spec['criterion'](predictions, labels) 41 | loss.backward() 42 | model_spec['optimizer'].step() 43 | # calculate the total loss for 1 epoch of training 44 | running_loss += loss.item() 45 | 46 | # print the loss every .. batches 47 | if i % hyperparameters['print_every'] == 0: 48 | model_spec['model'].eval() # set to evaluation mode 49 | train_accuracy = evaluate_performance(model_spec['model'], 50 | image_dataset['trainloader'], 51 | model_spec['criterion']) # see evaluate function below 52 | 53 | validate_accuracy = evaluate_performance(model_spec['model'], 54 | image_dataset['validloader'], 55 | model_spec['criterion']) 56 | 57 | print("Epoch: {}/{}... :".format(e+1, hyperparameters['epochs']), 58 | "Loss: {:.4f},".format(running_loss/hyperparameters['print_every']), 59 | "Training Accuracy:{: .4f} %,".format(train_accuracy * 100), 60 | "Validation Accuracy:{: .4f} %".format(validate_accuracy * 100) 61 | ) 62 | running_loss = 0 63 | model_spec['model'].train() 64 | 65 | saveModel(image_dataset, model_spec['model'], model_spec['classifier'], save_dir) 66 | return model_spec['model'] 67 | 68 | def get_model(architecture): 69 | # set model architecture 70 | if architecture == 'inception_v3': 71 | model_init = models.inception_v3(pretrained=True) 72 | model_init.arch = 'inception_v3' 73 | resize_aspect = [320, 299] 74 | 75 | elif architecture == 'densenet161': 76 | model_init = models.densenet161(pretrained=True) 77 | model_init.arch = 'densenet161' 78 | resize_aspect = [256, 224] 79 | 80 | elif architecture == 'vgg19': 81 | model_init = models.vgg19(pretrained=True) 82 | model_init.arch = 'vgg19' 83 | resize_aspect = [256, 224] 84 | 85 | return model_init, resize_aspect 86 | 87 | def loadImageData(resize_aspect, data_dir): 88 | """Input: 89 | resize_aspect - depends on the architecture 90 | data_dir - directory of all image data""" 91 | train_dir = data_dir + '/train' 92 | valid_dir = data_dir + '/valid' 93 | test_dir = data_dir + '/test' 94 | # Define transforms for the training, validation, and testing sets, using data augumentations on training set, 95 | # Inception_v3 has input size 299x299 96 | 97 | train_transforms = transforms.Compose([transforms.RandomRotation(30), 98 | transforms.RandomResizedCrop(resize_aspect[1]), 99 | transforms.RandomHorizontalFlip(), 100 | transforms.ToTensor(), 101 | transforms.Normalize([0.485, 0.456, 0.406], 102 | [0.229, 0.224, 0.225])]) 103 | 104 | validation_transforms = transforms.Compose([transforms.Resize(resize_aspect[1]), 105 | transforms.CenterCrop(resize_aspect[1]), 106 | transforms.ToTensor(), 107 | transforms.Normalize([0.485, 0.456, 0.406], 108 | [0.229, 0.224, 0.225])]) 109 | 110 | 111 | test_transforms = transforms.Compose([transforms.Resize(resize_aspect[1]), 112 | transforms.CenterCrop(resize_aspect[1]), 113 | transforms.ToTensor(), 114 | transforms.Normalize([0.485, 0.456, 0.406], 115 | [0.229, 0.224, 0.225])]) 116 | 117 | # Load the datasets with ImageFolder 118 | train_data = datasets.ImageFolder(train_dir, transform=train_transforms) 119 | validation_data = datasets.ImageFolder(valid_dir, transform=validation_transforms) 120 | test_data = datasets.ImageFolder(test_dir, transform=test_transforms) 121 | 122 | # Using the image datasets and the trainforms, define the dataloaders 123 | trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True) 124 | validloader = torch.utils.data.DataLoader(validation_data, batch_size= 32) 125 | testloader = torch.utils.data.DataLoader(test_data, batch_size= 32) 126 | # label mapping 127 | with open('cat_to_name.json', 'r') as f: 128 | cat_to_name = json.load(f) 129 | 130 | image_dataset = {'train': train_data, 'test': test_data, 'validate': validation_data, 131 | 'trainloader':trainloader, 'validloader':validloader, 'testloader': testloader, 132 | 'mapping': cat_to_name} 133 | return image_dataset 134 | 135 | def buildNeuralNetwork(model, hyperparameters, data_dir, device = 'cuda'): 136 | """Builds the transfer learning network according to the given architecture 137 | """ 138 | # turns off gradient 139 | for param in model.parameters(): 140 | param.requires_grad = False 141 | 142 | # input units mapping: 143 | input_units = {'inception_v3': 2048, 'densenet161': 2208, 'vgg19': 25088} 144 | 145 | # rebuild last layer 146 | classifier = nn.Sequential(OrderedDict([ 147 | ('fc1', nn.Linear(input_units[model.arch], 148 | hyperparameters['hidden_units'])), 149 | ('relu1', nn.ReLU()), 150 | ('dropout1', nn.Dropout(hyperparameters['dropout_prob'])), 151 | ('fc2', nn.Linear(hyperparameters['hidden_units'], 152 | 102)), 153 | ('output', nn.LogSoftmax(dim=1)) 154 | ])) 155 | # Attach the feedforward neural network, adjust for nameing conventions 156 | # Define criteria and loss 157 | criterion = nn.NLLLoss() 158 | if model.arch == 'inception_v3': 159 | model.fc = classifier 160 | optimizer = optim.Adam(model.fc.parameters(), lr = hyperparameters['learning_rate']) 161 | 162 | else: 163 | model.classifier = classifier 164 | optimizer = optim.Adam(model.classifier.parameters(), lr = hyperparameters['learning_rate']) 165 | 166 | # Important: Send model to use gpu cuda 167 | model = model.to(device) 168 | model_spec = {'model': model, 'criterion': criterion, 169 | 'optimizer': optimizer, 'classifier':classifier} 170 | 171 | return model_spec 172 | 173 | def evaluate_performance(model, dataloader,criterion, device = 'cuda'): 174 | # Evaluate performance for all batches in an epoch 175 | performance = [evaluate_performance_batch(model, i, criterion) for i in iter(dataloader)] 176 | correct, total = list(map(sum, zip(*performance))) 177 | return correct/total 178 | 179 | def evaluate_performance_batch(model,batch, criterion, device = 'cuda'): 180 | """Evaluate performance for a single batch""" 181 | with torch.no_grad(): 182 | images, labels = tuple(map(lambda x: x.to(device), batch)) 183 | predictions = model.forward(images) 184 | _, predict = torch.max(predictions, 1) 185 | 186 | correct = (predict == labels).sum().item() 187 | total = len(labels) 188 | 189 | return correct, total 190 | 191 | def saveModel(image_dataset, model, classifier, save_dir): 192 | # Saves the pretrained model 193 | with active_session(): 194 | check_point_file = save_dir + model.arch + '_checkpoint.pth' 195 | model.class_to_idx = image_dataset['train'].class_to_idx 196 | 197 | checkpoint_dict = { 198 | 'architecture': 'inception_v3', 199 | 'class_to_idx': model.class_to_idx, 200 | 'state_dict': model.state_dict(), 201 | 'classifier': classifier 202 | } 203 | torch.save(checkpoint_dict, check_point_file) 204 | print("Model saved") 205 | return None 206 | 207 | # ================= Predict Functions ===================== 208 | 209 | def predict(image_path, checkpoint_path, category_names, device, topk): 210 | ''' Predict the class (or classes) of an image using a trained deep learning model. 211 | ''' 212 | # Implement the code to predict the class from an image file 213 | model, resize_aspect = load_model_checkpoint(checkpoint_path) 214 | model.eval() 215 | image = process_image(image_path, resize_aspect) 216 | with open(category_names, 'r') as f: 217 | cat_to_name = json.load(f) 218 | 219 | # use forward propagation to obtain the class probabilities 220 | image = torch.tensor(image, dtype= torch.float).unsqueeze(0).to(device) 221 | predict_prob_tensor = torch.exp(model.forward(image)) # convert log probabilities to real probabilities 222 | predict_prob = predict_prob_tensor.cpu().detach().numpy()[0] # change into numpy array 223 | 224 | # Find the correspoinding top k classes 225 | top_k_idx = predict_prob.argsort()[-topk:][::-1] 226 | probs = predict_prob[top_k_idx] 227 | classes = np.array(list(range(1, 102)))[top_k_idx] 228 | visualize_pred(image, model, probs, classes, cat_to_name, topk) 229 | 230 | return probs, classes 231 | 232 | def load_model_checkpoint(path): 233 | """Load model checkpoint given path""" 234 | checkpoint = torch.load(path, map_location={'cuda:0': 'cpu'}) 235 | model, resize_aspect = get_model(checkpoint['architecture']) 236 | if model.arch == 'inception_v3': 237 | model.fc = checkpoint['classifier'] 238 | else: 239 | model.classifier = checkpoint['classifier'] 240 | 241 | model.load_state_dict(checkpoint['state_dict']) 242 | model.class_to_idx = checkpoint['class_to_idx'] 243 | return model, resize_aspect 244 | 245 | def process_image(image, resize_aspect): 246 | ''' Scales, crops, and normalizes a PIL image for a PyTorch model, 247 | returns an Numpy array 248 | ''' 249 | # Process a PIL image for use in a PyTorch model 250 | im = Image.open(image) 251 | 252 | # resize image to 320 on the shortest side 253 | size = (resize_aspect[0], resize_aspect[0]) 254 | im.thumbnail(size) 255 | 256 | # crop out 299 portion in the center 257 | width, height = im.size 258 | left = (width - resize_aspect[1])/2 259 | top = (height - resize_aspect[1])/2 260 | right = (width + resize_aspect[1])/2 261 | bottom = (height + resize_aspect[1])/2 262 | im = im.crop((left, top, right, bottom)) 263 | 264 | # normalize image 265 | np_image = np.array(im) 266 | im_mean = np.array([0.485, 0.456, 0.406]) 267 | im_sd = np.array([0.229, 0.224, 0.225]) 268 | np_image = (np_image/255 - im_mean)/im_sd 269 | 270 | # transpose the image 271 | np_image = np_image.T 272 | return np_image 273 | 274 | def imshow2(image, ax=None, title=None): 275 | """Returns the original image after preprocessing""" 276 | if ax is None: 277 | fig, ax = plt.subplots() 278 | 279 | # PyTorch tensors assume the color channel is the first dimension 280 | # but matplotlib assumes is the third dimension 281 | image = image.transpose((1, 2, 0)) 282 | 283 | # Undo preprocessing 284 | mean = np.array([0.485, 0.456, 0.406]) 285 | std = np.array([0.229, 0.224, 0.225]) 286 | image = std * image + mean 287 | 288 | # Image needs to be clipped between 0 and 1 or it looks like noise when displayed 289 | image = np.clip(image, 0, 1) 290 | 291 | #plt.suptitle(title) 292 | ax.imshow(image) 293 | 294 | return ax 295 | 296 | # Display an image along with the top 5 classes 297 | def visualize_pred(image, model, probs, classes, cat_to_name, topk): 298 | """ Visualize the top k probabilities an image is predicted as""" 299 | im = process_image(image) 300 | flower_names = [cat_to_name[str(x)] for x in classes] 301 | 302 | # Build subplots above 303 | fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,10)) 304 | # set axis settings top 305 | imshow2(im, ax =ax1) 306 | ax1.set_title(cat_to_name[image.split('/')[2]]) 307 | # set axis settings bottom 308 | ax2.barh(np.arange(1, topk + 1), probs) 309 | ax2.set_yticks(np.arange(1, topk + 1)) 310 | ax2.set_yticklabels(flower_names) 311 | ax2.set_aspect(0.187) 312 | ax2.set_xlim(0,1) 313 | return None 314 | 315 | #=================== get input args train / predict ====================== 316 | def get_input_args_train(): 317 | parser = argparse.ArgumentParser() 318 | parser.add_argument('data_directory', type=str, default = None, 319 | help="data directory") 320 | parser.add_argument('--save_dir', type=str, default='checkpoints/', 321 | help='save checkpoints to directory') 322 | parser.add_argument('--arch', type=str, default='inception_v3', 323 | help='model architecture') 324 | parser.add_argument('--learning_rate', type=float, default=0.001, 325 | help='learning rate, default 0.001') 326 | parser.add_argument('--hidden_units', type=int, default=500, 327 | help='hidden units, default 500') 328 | parser.add_argument('--print_every', type=int, default=20, 329 | help='print every iterations') 330 | parser.add_argument('--dropout_prob', type=int, default=0.1, 331 | help='print every iterations') 332 | parser.add_argument('--epochs', type=int, default=15, 333 | help='epochs, default 15') 334 | parser.add_argument('--gpu', action='store_true', 335 | default= 'cuda', help='to cuda gpu') 336 | 337 | return parser.parse_args() 338 | 339 | def get_input_args_predict(): 340 | parser = argparse.ArgumentParser() 341 | 342 | parser.add_argument('path_to_image', type=str, default=None, 343 | help='image file to predict') 344 | 345 | parser.add_argument('checkpoint', type=str, default='checkpoints/inception_v3_checkpoint.pth', 346 | help='path to checkpoint') 347 | 348 | parser.add_argument('--topk', type=int, default=5, 349 | help='return top k most likely classes the image belongs to') 350 | parser.add_argument('--category_names', type=str, default='cat_to_name.json', 351 | help='class names mapping') 352 | parser.add_argument('--gpu', default='cuda', 353 | help='use cuda') 354 | 355 | return parser.parse_args() 356 | -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/predict.py: -------------------------------------------------------------------------------- 1 | from model_spec import * 2 | import argparse 3 | 4 | def main(): 5 | args = get_input_args_predict() 6 | predict(args.path_to_image, args.checkpoint, args.category_names, args. gpu, args.topk) 7 | 8 | if __name__ == '__main__': 9 | main() -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/train.py: -------------------------------------------------------------------------------- 1 | from model_spec import * 2 | import argparse 3 | 4 | def main(): 5 | args = get_input_args() 6 | # get hyperparameters 7 | hyperparameters = {'architecture': args.arch, 'epochs': args.epoch, 'print_every': args.print_every, 8 | 'hidden_units' : args.hidden_units, 'learning_rate': args.learning_rate, 9 | 'dropout_prob': args.dropout_prob} 10 | 11 | train_model(hyperparameters, data_dir= args.data_directory, save_dir = args.save_dir, device = args.gpu) 12 | 13 | if __name__ == '__main__': 14 | main() 15 | -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Application/workspace_utils.py: -------------------------------------------------------------------------------- 1 | import signal 2 | 3 | from contextlib import contextmanager 4 | 5 | import requests 6 | 7 | 8 | DELAY = INTERVAL = 4 * 60 # interval time in seconds 9 | MIN_DELAY = MIN_INTERVAL = 2 * 60 10 | KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive" 11 | TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token" 12 | TOKEN_HEADERS = {"Metadata-Flavor":"Google"} 13 | 14 | 15 | def _request_handler(headers): 16 | def _handler(signum, frame): 17 | requests.request("POST", KEEPALIVE_URL, headers=headers) 18 | return _handler 19 | 20 | 21 | @contextmanager 22 | def active_session(delay=DELAY, interval=INTERVAL): 23 | """ 24 | Example: 25 | from workspace_utils import active session 26 | with active_session(): 27 | # do long-running work here 28 | """ 29 | token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text 30 | headers = {'Authorization': "STAR " + token} 31 | delay = max(delay, MIN_DELAY) 32 | interval = max(interval, MIN_INTERVAL) 33 | original_handler = signal.getsignal(signal.SIGALRM) 34 | try: 35 | signal.signal(signal.SIGALRM, _request_handler(headers)) 36 | signal.setitimer(signal.ITIMER_REAL, delay, interval) 37 | yield 38 | finally: 39 | signal.signal(signal.SIGALRM, original_handler) 40 | signal.setitimer(signal.ITIMER_REAL, 0) 41 | 42 | 43 | def keep_awake(iterable, delay=DELAY, interval=INTERVAL): 44 | """ 45 | Example: 46 | from workspace_utils import keep_awake 47 | for i in keep_awake(range(5)): 48 | # do iteration with lots of work here 49 | """ 50 | with active_session(delay, interval): yield from iterable -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/Readme.md: -------------------------------------------------------------------------------- 1 | # Deep Learning 2 | 3 | ### Project: Image Classifier Project 4 | 5 | 6 | ### Data 7 | 8 | The data for this project is quite large - in fact, it is so large you cannot upload it onto Github. You will be training using 102 different types of flowers, where there ~20 images per flower to train on. Then you will use your trained classifier to see if you can predict the type for new images of the flowers (Quoted from Udacity). 9 | 10 | ### Jupyter Notebook 11 | 12 | This notebook implements the inception model in the jupyter notebook format. Most of the functions are static. To view the notebook, go to the following link 13 | 14 | [Project Notebook: Image Classifier](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%202%20-%20Image%20Classifier%20Application/Image%20Classifier%20Project.ipynb?flush_cache=true) 15 | 16 | ### Application 17 | 18 | The notebook is then converted into a command line application 19 | 20 | Specifications 21 | 22 | The first file, train.py, will train a new network on a dataset and save the model as a checkpoint. The second file, predict.py, uses a trained network to predict the class for an input image. 23 | 24 | Train a new network on a data set with train.py 25 | 26 | Basic usage: python train.py data_directory 27 | * Prints out training loss, validation loss, and validation accuracy as the network trains 28 | 29 | Options: 30 | * Set directory to save checkpoints: python train.py data_dir --save_dir save_directory 31 | * Choose architecture: python train.py data_dir --arch "vgg13" 32 | * Set hyperparameters: python train.py data_dir --learning_rate 0.01 --hidden_units 512 --epochs 20 33 | * Use GPU for training: python train.py data_dir --gpu 34 | 35 | Predict flower name from an image with predict.py along with the probability of that name. That is, you'll pass in a single image * /path/to/image and return the flower name and class probability. 36 | 37 | Basic usage: python predict.py /path/to/image checkpoint 38 | Options: 39 | * Return top K most likely classes: python predict.py input checkpoint --top_k 3 40 | * Use a mapping of categories to real names: python predict.py input checkpoint --category_names cat_to_name.json 41 | * Use GPU for inference: python predict.py input checkpoint --gpu 42 | -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/assets/Flowers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 2 - Image Classifier Application/assets/Flowers.png -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/assets/inference_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 2 - Image Classifier Application/assets/inference_example.png -------------------------------------------------------------------------------- /Project 2 - Image Classifier Application/workspace_utils.py: -------------------------------------------------------------------------------- 1 | import signal 2 | 3 | from contextlib import contextmanager 4 | 5 | import requests 6 | 7 | 8 | DELAY = INTERVAL = 4 * 60 # interval time in seconds 9 | MIN_DELAY = MIN_INTERVAL = 2 * 60 10 | KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive" 11 | TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token" 12 | TOKEN_HEADERS = {"Metadata-Flavor":"Google"} 13 | 14 | 15 | def _request_handler(headers): 16 | def _handler(signum, frame): 17 | requests.request("POST", KEEPALIVE_URL, headers=headers) 18 | return _handler 19 | 20 | 21 | @contextmanager 22 | def active_session(delay=DELAY, interval=INTERVAL): 23 | """ 24 | Example: 25 | 26 | from workspace_utils import active session 27 | 28 | with active_session(): 29 | # do long-running work here 30 | """ 31 | token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text 32 | headers = {'Authorization': "STAR " + token} 33 | delay = max(delay, MIN_DELAY) 34 | interval = max(interval, MIN_INTERVAL) 35 | original_handler = signal.getsignal(signal.SIGALRM) 36 | try: 37 | signal.signal(signal.SIGALRM, _request_handler(headers)) 38 | signal.setitimer(signal.ITIMER_REAL, delay, interval) 39 | yield 40 | finally: 41 | signal.signal(signal.SIGALRM, original_handler) 42 | signal.setitimer(signal.ITIMER_REAL, 0) 43 | 44 | 45 | def keep_awake(iterable, delay=DELAY, interval=INTERVAL): 46 | """ 47 | Example: 48 | 49 | from workspace_utils import keep_awake 50 | 51 | for i in keep_awake(range(5)): 52 | # do iteration with lots of work here 53 | """ 54 | with active_session(delay, interval): yield from iterable -------------------------------------------------------------------------------- /Project 3 - Identify Customer Segementation/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Identifying Customers Segmentations 3 | 4 | **Description** 5 | 6 | In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task. 7 | -------------------------------------------------------------------------------- /Project 4 - Data Science Blog/README.md: -------------------------------------------------------------------------------- 1 | # Understanding Data Scientist Careers in a Data Science Way 2 | 3 | **Project Motivation** 4 | 5 | For this project, I was interestested in using 2011 - 2018 Stack Overflow developer survey data to better understand the career of data scientists. In particular, the questions I interested in are 6 | 7 | * Does a data science role return you a happy career, or a healthier lifestyle? 8 | * Do data scientists get paid higher salaries for working hard? 9 | * What skills are required to become a data scientist? 10 | 11 | **Installation** 12 | 13 | No extra besides the built-in libraries from Anaconda needed to run this project 14 | 15 | * numpy 16 | * pandas 17 | * seaborn 18 | * glob 19 | * os 20 | 21 | **File Descriptions** 22 | 23 | * data: Folder contains data files of StackOverflow developer survey data, following name conventions of "YYYY Stack Overflow Survey Responses.csv" 24 | * Understanding the Career of Data Scientists.ipynb: The Jupyter Notebook used for the main analytics 25 | 26 | **Results** 27 | 28 | The main takeaways from this analytics are 29 | 30 | * The best practice for doing data science is to have a question before you collect the data 31 | * Data scientists work hard, but they are satisfied with their careers 32 | * Versatile programming skills and strong commnunication skills are needed for data scientists 33 | 34 | Specific findings of the code can be found at the Jupyter Notebook and blog post below. 35 | 36 | * [Project Notebook: Understanding the Career of Data Scientists](http://nbviewer.jupyter.org/github/chenbowen184/Data_Scientist_Nanodegree/blob/master/Project%204%20-%20Data%20Science%20Blog/Understanding%20the%20Career%20of%20Data%20Scientists.ipynb) 37 | * [Blog Post: Understanding the Career of Data Scientist Using the Data Science Way](https://medium.com/@bowenchen/understanding-the-career-of-data-scientists-in-a-data-science-way-9bd63817221e) 38 | 39 | **Licensing Acknowledgements** 40 | 41 | Thank you for @StackOverflow for sharing your developer survey for multiple years 42 | -------------------------------------------------------------------------------- /Project 4 - Data Science Blog/data/2011 Stack Overflow Survey Responses.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 4 - Data Science Blog/data/2011 Stack Overflow Survey Responses.csv -------------------------------------------------------------------------------- /Project 4 - Data Science Blog/data/2012 Stack Overflow Survey Responses.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 4 - Data Science Blog/data/2012 Stack Overflow Survey Responses.csv -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/ETL Pipeline Preparation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# ETL Pipeline Preparation\n", 8 | "Follow the instructions below to help you create your ETL pipeline.\n", 9 | "### 1. Import libraries and load datasets.\n", 10 | "- Import Python libraries\n", 11 | "- Load `messages.csv` into a dataframe and inspect the first few lines.\n", 12 | "- Load `categories.csv` into a dataframe and inspect the first few lines." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "# import libraries\n", 24 | "import pandas as pd\n", 25 | "import numpy as n\n", 26 | "from sqlalchemy import create_engine" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "data": { 36 | "text/html": [ 37 | "
\n", 38 | "\n", 51 | "\n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | "
idmessageoriginalgenre
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...direct
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirect
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...direct
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...direct
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...direct
\n", 99 | "
" 100 | ], 101 | "text/plain": [ 102 | " id message \\\n", 103 | "0 2 Weather update - a cold front from Cuba that c... \n", 104 | "1 7 Is the Hurricane over or is it not over \n", 105 | "2 8 Looking for someone but no name \n", 106 | "3 9 UN reports Leogane 80-90 destroyed. Only Hospi... \n", 107 | "4 12 says: west side of Haiti, rest of the country ... \n", 108 | "\n", 109 | " original genre \n", 110 | "0 Un front froid se retrouve sur Cuba ce matin. ... direct \n", 111 | "1 Cyclone nan fini osinon li pa fini direct \n", 112 | "2 Patnm, di Maryani relem pou li banm nouvel li ... direct \n", 113 | "3 UN reports Leogane 80-90 destroyed. Only Hospi... direct \n", 114 | "4 facade ouest d Haiti et le reste du pays aujou... direct " 115 | ] 116 | }, 117 | "execution_count": 2, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "# load messages dataset\n", 124 | "messages = pd.read_csv('data/disaster_messages.csv')\n", 125 | "messages.head()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 3, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/html": [ 136 | "
\n", 137 | "\n", 150 | "\n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | "
idcategories
02related-1;request-0;offer-0;aid_related-0;medi...
17related-1;request-0;offer-0;aid_related-1;medi...
28related-1;request-0;offer-0;aid_related-0;medi...
39related-1;request-1;offer-0;aid_related-1;medi...
412related-1;request-0;offer-0;aid_related-0;medi...
\n", 186 | "
" 187 | ], 188 | "text/plain": [ 189 | " id categories\n", 190 | "0 2 related-1;request-0;offer-0;aid_related-0;medi...\n", 191 | "1 7 related-1;request-0;offer-0;aid_related-1;medi...\n", 192 | "2 8 related-1;request-0;offer-0;aid_related-0;medi...\n", 193 | "3 9 related-1;request-1;offer-0;aid_related-1;medi...\n", 194 | "4 12 related-1;request-0;offer-0;aid_related-0;medi..." 195 | ] 196 | }, 197 | "execution_count": 3, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "# load categories dataset\n", 204 | "categories = pd.read_csv('data/disaster_categories.csv')\n", 205 | "categories.head()" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### 2. Merge datasets.\n", 213 | "- Merge the messages and categories datasets using the common id\n", 214 | "- Assign this combined dataset to `df`, which will be cleaned in the following steps" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 4, 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/html": [ 225 | "
\n", 226 | "\n", 239 | "\n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | "
idmessageoriginalgenrecategories
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...directrelated-1;request-0;offer-0;aid_related-0;medi...
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirectrelated-1;request-0;offer-0;aid_related-1;medi...
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...directrelated-1;request-0;offer-0;aid_related-0;medi...
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...directrelated-1;request-1;offer-0;aid_related-1;medi...
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...directrelated-1;request-0;offer-0;aid_related-0;medi...
\n", 293 | "
" 294 | ], 295 | "text/plain": [ 296 | " id message \\\n", 297 | "0 2 Weather update - a cold front from Cuba that c... \n", 298 | "1 7 Is the Hurricane over or is it not over \n", 299 | "2 8 Looking for someone but no name \n", 300 | "3 9 UN reports Leogane 80-90 destroyed. Only Hospi... \n", 301 | "4 12 says: west side of Haiti, rest of the country ... \n", 302 | "\n", 303 | " original genre \\\n", 304 | "0 Un front froid se retrouve sur Cuba ce matin. ... direct \n", 305 | "1 Cyclone nan fini osinon li pa fini direct \n", 306 | "2 Patnm, di Maryani relem pou li banm nouvel li ... direct \n", 307 | "3 UN reports Leogane 80-90 destroyed. Only Hospi... direct \n", 308 | "4 facade ouest d Haiti et le reste du pays aujou... direct \n", 309 | "\n", 310 | " categories \n", 311 | "0 related-1;request-0;offer-0;aid_related-0;medi... \n", 312 | "1 related-1;request-0;offer-0;aid_related-1;medi... \n", 313 | "2 related-1;request-0;offer-0;aid_related-0;medi... \n", 314 | "3 related-1;request-1;offer-0;aid_related-1;medi... \n", 315 | "4 related-1;request-0;offer-0;aid_related-0;medi... " 316 | ] 317 | }, 318 | "execution_count": 4, 319 | "metadata": {}, 320 | "output_type": "execute_result" 321 | } 322 | ], 323 | "source": [ 324 | "# merge datasets\n", 325 | "df = pd.merge(messages, categories, on = 'id')\n", 326 | "df.head()" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "### 3. Split `categories` into separate category columns.\n", 334 | "- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.\n", 335 | "- Use the first row of categories dataframe to create column names for the categories data.\n", 336 | "- Rename columns of `categories` with new column names." 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 5, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "data": { 346 | "text/html": [ 347 | "
\n", 348 | "\n", 361 | "\n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | "
0123456789...26272829303132333435
0related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
1related-1request-0offer-0aid_related-1medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-1floods-0storm-1fire-0earthquake-0cold-0other_weather-0direct_report-0
2related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
3related-1request-1offer-0aid_related-1medical_help-0medical_products-1search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
4related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
\n", 511 | "

5 rows × 36 columns

\n", 512 | "
" 513 | ], 514 | "text/plain": [ 515 | " 0 1 2 3 4 \\\n", 516 | "0 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 517 | "1 related-1 request-0 offer-0 aid_related-1 medical_help-0 \n", 518 | "2 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 519 | "3 related-1 request-1 offer-0 aid_related-1 medical_help-0 \n", 520 | "4 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 521 | "\n", 522 | " 5 6 7 8 \\\n", 523 | "0 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 524 | "1 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 525 | "2 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 526 | "3 medical_products-1 search_and_rescue-0 security-0 military-0 \n", 527 | "4 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 528 | "\n", 529 | " 9 ... 26 27 \\\n", 530 | "0 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 531 | "1 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 532 | "2 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 533 | "3 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 534 | "4 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 535 | "\n", 536 | " 28 29 30 31 32 33 \\\n", 537 | "0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 538 | "1 weather_related-1 floods-0 storm-1 fire-0 earthquake-0 cold-0 \n", 539 | "2 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 540 | "3 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 541 | "4 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 542 | "\n", 543 | " 34 35 \n", 544 | "0 other_weather-0 direct_report-0 \n", 545 | "1 other_weather-0 direct_report-0 \n", 546 | "2 other_weather-0 direct_report-0 \n", 547 | "3 other_weather-0 direct_report-0 \n", 548 | "4 other_weather-0 direct_report-0 \n", 549 | "\n", 550 | "[5 rows x 36 columns]" 551 | ] 552 | }, 553 | "execution_count": 5, 554 | "metadata": {}, 555 | "output_type": "execute_result" 556 | } 557 | ], 558 | "source": [ 559 | "# create a dataframe of the 36 individual category columns\n", 560 | "categories = df['categories'].str.split(';', expand=True)\n", 561 | "categories.head()" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 6, 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "name": "stdout", 571 | "output_type": "stream", 572 | "text": [ 573 | "0 related\n", 574 | "1 request\n", 575 | "2 offer\n", 576 | "3 aid_related\n", 577 | "4 medical_help\n", 578 | "5 medical_products\n", 579 | "6 search_and_rescue\n", 580 | "7 security\n", 581 | "8 military\n", 582 | "9 child_alone\n", 583 | "10 water\n", 584 | "11 food\n", 585 | "12 shelter\n", 586 | "13 clothing\n", 587 | "14 money\n", 588 | "15 missing_people\n", 589 | "16 refugees\n", 590 | "17 death\n", 591 | "18 other_aid\n", 592 | "19 infrastructure_related\n", 593 | "20 transport\n", 594 | "21 buildings\n", 595 | "22 electricity\n", 596 | "23 tools\n", 597 | "24 hospitals\n", 598 | "25 shops\n", 599 | "26 aid_centers\n", 600 | "27 other_infrastructure\n", 601 | "28 weather_related\n", 602 | "29 floods\n", 603 | "30 storm\n", 604 | "31 fire\n", 605 | "32 earthquake\n", 606 | "33 cold\n", 607 | "34 other_weather\n", 608 | "35 direct_report\n", 609 | "Name: 1, dtype: object\n" 610 | ] 611 | } 612 | ], 613 | "source": [ 614 | "# select the first row of the categories dataframe\n", 615 | "row = categories.iloc[1]\n", 616 | "\n", 617 | "# use this row to extract a list of new column names for categories.\n", 618 | "# one way is to apply a lambda function that takes everything \n", 619 | "# up to the second to last character of each string with slicing\n", 620 | "category_colnames = row.apply(lambda x: x.split('-')[0])\n", 621 | "print(category_colnames)" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 7, 627 | "metadata": {}, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/html": [ 632 | "
\n", 633 | "\n", 646 | "\n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | "
1relatedrequestofferaid_relatedmedical_helpmedical_productssearch_and_rescuesecuritymilitarychild_alone...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
0related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
1related-1request-0offer-0aid_related-1medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-1floods-0storm-1fire-0earthquake-0cold-0other_weather-0direct_report-0
2related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
3related-1request-1offer-0aid_related-1medical_help-0medical_products-1search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
4related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
\n", 796 | "

5 rows × 36 columns

\n", 797 | "
" 798 | ], 799 | "text/plain": [ 800 | "1 related request offer aid_related medical_help \\\n", 801 | "0 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 802 | "1 related-1 request-0 offer-0 aid_related-1 medical_help-0 \n", 803 | "2 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 804 | "3 related-1 request-1 offer-0 aid_related-1 medical_help-0 \n", 805 | "4 related-1 request-0 offer-0 aid_related-0 medical_help-0 \n", 806 | "\n", 807 | "1 medical_products search_and_rescue security military \\\n", 808 | "0 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 809 | "1 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 810 | "2 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 811 | "3 medical_products-1 search_and_rescue-0 security-0 military-0 \n", 812 | "4 medical_products-0 search_and_rescue-0 security-0 military-0 \n", 813 | "\n", 814 | "1 child_alone ... aid_centers other_infrastructure \\\n", 815 | "0 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 816 | "1 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 817 | "2 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 818 | "3 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 819 | "4 child_alone-0 ... aid_centers-0 other_infrastructure-0 \n", 820 | "\n", 821 | "1 weather_related floods storm fire earthquake cold \\\n", 822 | "0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 823 | "1 weather_related-1 floods-0 storm-1 fire-0 earthquake-0 cold-0 \n", 824 | "2 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 825 | "3 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 826 | "4 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 \n", 827 | "\n", 828 | "1 other_weather direct_report \n", 829 | "0 other_weather-0 direct_report-0 \n", 830 | "1 other_weather-0 direct_report-0 \n", 831 | "2 other_weather-0 direct_report-0 \n", 832 | "3 other_weather-0 direct_report-0 \n", 833 | "4 other_weather-0 direct_report-0 \n", 834 | "\n", 835 | "[5 rows x 36 columns]" 836 | ] 837 | }, 838 | "execution_count": 7, 839 | "metadata": {}, 840 | "output_type": "execute_result" 841 | } 842 | ], 843 | "source": [ 844 | "# rename the columns of `categories`\n", 845 | "categories.columns = category_colnames\n", 846 | "categories.head()" 847 | ] 848 | }, 849 | { 850 | "cell_type": "markdown", 851 | "metadata": {}, 852 | "source": [ 853 | "### 4. Convert category values to just numbers 0 or 1.\n", 854 | "- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.\n", 855 | "- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`." 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": 8, 861 | "metadata": {}, 862 | "outputs": [ 863 | { 864 | "data": { 865 | "text/html": [ 866 | "
\n", 867 | "\n", 880 | "\n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | "
1relatedrequestofferaid_relatedmedical_helpmedical_productssearch_and_rescuesecuritymilitarychild_alone...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
01000000000...0000000000
11001000000...0010100000
21000000000...0000000000
31101010000...0000000000
41000000000...0000000000
\n", 1030 | "

5 rows × 36 columns

\n", 1031 | "
" 1032 | ], 1033 | "text/plain": [ 1034 | "1 related request offer aid_related medical_help medical_products \\\n", 1035 | "0 1 0 0 0 0 0 \n", 1036 | "1 1 0 0 1 0 0 \n", 1037 | "2 1 0 0 0 0 0 \n", 1038 | "3 1 1 0 1 0 1 \n", 1039 | "4 1 0 0 0 0 0 \n", 1040 | "\n", 1041 | "1 search_and_rescue security military child_alone ... \\\n", 1042 | "0 0 0 0 0 ... \n", 1043 | "1 0 0 0 0 ... \n", 1044 | "2 0 0 0 0 ... \n", 1045 | "3 0 0 0 0 ... \n", 1046 | "4 0 0 0 0 ... \n", 1047 | "\n", 1048 | "1 aid_centers other_infrastructure weather_related floods storm fire \\\n", 1049 | "0 0 0 0 0 0 0 \n", 1050 | "1 0 0 1 0 1 0 \n", 1051 | "2 0 0 0 0 0 0 \n", 1052 | "3 0 0 0 0 0 0 \n", 1053 | "4 0 0 0 0 0 0 \n", 1054 | "\n", 1055 | "1 earthquake cold other_weather direct_report \n", 1056 | "0 0 0 0 0 \n", 1057 | "1 0 0 0 0 \n", 1058 | "2 0 0 0 0 \n", 1059 | "3 0 0 0 0 \n", 1060 | "4 0 0 0 0 \n", 1061 | "\n", 1062 | "[5 rows x 36 columns]" 1063 | ] 1064 | }, 1065 | "execution_count": 8, 1066 | "metadata": {}, 1067 | "output_type": "execute_result" 1068 | } 1069 | ], 1070 | "source": [ 1071 | "for column in categories:\n", 1072 | " # set each value to be the last character of the string\n", 1073 | " categories[column] = categories[column].apply(lambda x: int(x.split('-')[1]))\n", 1074 | " \n", 1075 | "categories.head()" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "markdown", 1080 | "metadata": {}, 1081 | "source": [ 1082 | "### 5. Replace `categories` column in `df` with new category columns.\n", 1083 | "- Drop the categories column from the df dataframe since it is no longer needed.\n", 1084 | "- Concatenate df and categories data frames." 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": 9, 1090 | "metadata": {}, 1091 | "outputs": [ 1092 | { 1093 | "data": { 1094 | "text/html": [ 1095 | "
\n", 1096 | "\n", 1109 | "\n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | "
idmessageoriginalgenre
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...direct
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirect
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...direct
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...direct
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...direct
\n", 1157 | "
" 1158 | ], 1159 | "text/plain": [ 1160 | " id message \\\n", 1161 | "0 2 Weather update - a cold front from Cuba that c... \n", 1162 | "1 7 Is the Hurricane over or is it not over \n", 1163 | "2 8 Looking for someone but no name \n", 1164 | "3 9 UN reports Leogane 80-90 destroyed. Only Hospi... \n", 1165 | "4 12 says: west side of Haiti, rest of the country ... \n", 1166 | "\n", 1167 | " original genre \n", 1168 | "0 Un front froid se retrouve sur Cuba ce matin. ... direct \n", 1169 | "1 Cyclone nan fini osinon li pa fini direct \n", 1170 | "2 Patnm, di Maryani relem pou li banm nouvel li ... direct \n", 1171 | "3 UN reports Leogane 80-90 destroyed. Only Hospi... direct \n", 1172 | "4 facade ouest d Haiti et le reste du pays aujou... direct " 1173 | ] 1174 | }, 1175 | "execution_count": 9, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "# drop the original categories column from `df`\n", 1182 | "df.drop('categories', axis = 1, inplace = True)\n", 1183 | "\n", 1184 | "df.head()" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "execution_count": 10, 1190 | "metadata": {}, 1191 | "outputs": [ 1192 | { 1193 | "data": { 1194 | "text/html": [ 1195 | "
\n", 1196 | "\n", 1209 | "\n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | "
idmessageoriginalgenrerelatedrequestofferaid_relatedmedical_helpmedical_products...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...direct100000...0000000000
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirect100100...0010100000
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...direct100000...0000000000
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...direct110101...0000000000
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...direct100000...0000000000
\n", 1359 | "

5 rows × 40 columns

\n", 1360 | "
" 1361 | ], 1362 | "text/plain": [ 1363 | " id message \\\n", 1364 | "0 2 Weather update - a cold front from Cuba that c... \n", 1365 | "1 7 Is the Hurricane over or is it not over \n", 1366 | "2 8 Looking for someone but no name \n", 1367 | "3 9 UN reports Leogane 80-90 destroyed. Only Hospi... \n", 1368 | "4 12 says: west side of Haiti, rest of the country ... \n", 1369 | "\n", 1370 | " original genre related \\\n", 1371 | "0 Un front froid se retrouve sur Cuba ce matin. ... direct 1 \n", 1372 | "1 Cyclone nan fini osinon li pa fini direct 1 \n", 1373 | "2 Patnm, di Maryani relem pou li banm nouvel li ... direct 1 \n", 1374 | "3 UN reports Leogane 80-90 destroyed. Only Hospi... direct 1 \n", 1375 | "4 facade ouest d Haiti et le reste du pays aujou... direct 1 \n", 1376 | "\n", 1377 | " request offer aid_related medical_help medical_products ... \\\n", 1378 | "0 0 0 0 0 0 ... \n", 1379 | "1 0 0 1 0 0 ... \n", 1380 | "2 0 0 0 0 0 ... \n", 1381 | "3 1 0 1 0 1 ... \n", 1382 | "4 0 0 0 0 0 ... \n", 1383 | "\n", 1384 | " aid_centers other_infrastructure weather_related floods storm fire \\\n", 1385 | "0 0 0 0 0 0 0 \n", 1386 | "1 0 0 1 0 1 0 \n", 1387 | "2 0 0 0 0 0 0 \n", 1388 | "3 0 0 0 0 0 0 \n", 1389 | "4 0 0 0 0 0 0 \n", 1390 | "\n", 1391 | " earthquake cold other_weather direct_report \n", 1392 | "0 0 0 0 0 \n", 1393 | "1 0 0 0 0 \n", 1394 | "2 0 0 0 0 \n", 1395 | "3 0 0 0 0 \n", 1396 | "4 0 0 0 0 \n", 1397 | "\n", 1398 | "[5 rows x 40 columns]" 1399 | ] 1400 | }, 1401 | "execution_count": 10, 1402 | "metadata": {}, 1403 | "output_type": "execute_result" 1404 | } 1405 | ], 1406 | "source": [ 1407 | "# concatenate the original dataframe with the new `categories` dataframe\n", 1408 | "df = df.join(categories)\n", 1409 | "df.head()" 1410 | ] 1411 | }, 1412 | { 1413 | "cell_type": "markdown", 1414 | "metadata": {}, 1415 | "source": [ 1416 | "### 6. Remove duplicates.\n", 1417 | "- Check how many duplicates are in this dataset.\n", 1418 | "- Drop the duplicates.\n", 1419 | "- Confirm duplicates were removed." 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": 11, 1425 | "metadata": {}, 1426 | "outputs": [ 1427 | { 1428 | "data": { 1429 | "text/plain": [ 1430 | "170" 1431 | ] 1432 | }, 1433 | "execution_count": 11, 1434 | "metadata": {}, 1435 | "output_type": "execute_result" 1436 | } 1437 | ], 1438 | "source": [ 1439 | "# check number of duplicates\n", 1440 | "sum(df.duplicated())" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "code", 1445 | "execution_count": 12, 1446 | "metadata": { 1447 | "collapsed": true 1448 | }, 1449 | "outputs": [], 1450 | "source": [ 1451 | "# drop duplicates\n", 1452 | "df = df.drop_duplicates()" 1453 | ] 1454 | }, 1455 | { 1456 | "cell_type": "code", 1457 | "execution_count": 13, 1458 | "metadata": {}, 1459 | "outputs": [ 1460 | { 1461 | "data": { 1462 | "text/plain": [ 1463 | "0" 1464 | ] 1465 | }, 1466 | "execution_count": 13, 1467 | "metadata": {}, 1468 | "output_type": "execute_result" 1469 | } 1470 | ], 1471 | "source": [ 1472 | "# check number of duplicates\n", 1473 | "sum(df.duplicated())" 1474 | ] 1475 | }, 1476 | { 1477 | "cell_type": "markdown", 1478 | "metadata": {}, 1479 | "source": [ 1480 | "### 7. Save the clean dataset into an sqlite database.\n", 1481 | "You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below." 1482 | ] 1483 | }, 1484 | { 1485 | "cell_type": "code", 1486 | "execution_count": 14, 1487 | "metadata": { 1488 | "collapsed": true 1489 | }, 1490 | "outputs": [], 1491 | "source": [ 1492 | "engine = create_engine('sqlite:///DisasterResponse.db')\n", 1493 | "df.to_sql('DisasterResponse', engine, index=False)" 1494 | ] 1495 | }, 1496 | { 1497 | "cell_type": "markdown", 1498 | "metadata": {}, 1499 | "source": [ 1500 | "### 8. Use this notebook to complete `etl_pipeline.py`\n", 1501 | "Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later." 1502 | ] 1503 | } 1504 | ], 1505 | "metadata": { 1506 | "kernelspec": { 1507 | "display_name": "Python 3", 1508 | "language": "python", 1509 | "name": "python3" 1510 | }, 1511 | "language_info": { 1512 | "codemirror_mode": { 1513 | "name": "ipython", 1514 | "version": 3 1515 | }, 1516 | "file_extension": ".py", 1517 | "mimetype": "text/x-python", 1518 | "name": "python", 1519 | "nbconvert_exporter": "python", 1520 | "pygments_lexer": "ipython3", 1521 | "version": "3.6.3" 1522 | } 1523 | }, 1524 | "nbformat": 4, 1525 | "nbformat_minor": 2 1526 | } 1527 | -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/ML Pipeline Preparation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# ML Pipeline Preparation\n", 8 | "Follow the instructions below to help you create your ML pipeline.\n", 9 | "### 1. Import libraries and load data from database.\n", 10 | "- Import Python libraries\n", 11 | "- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)\n", 12 | "- Define feature and target variables X and Y" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "# import libraries\n", 24 | "import pandas as pd\n", 25 | "import numpy as np\n", 26 | "import pickle\n", 27 | "from sqlalchemy import create_engine\n", 28 | "import warnings\n", 29 | "warnings.filterwarnings(\"ignore\")" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "# import NLP libraries\n", 41 | "import re\n", 42 | "import nltk \n", 43 | "from nltk.corpus import stopwords\n", 44 | "from nltk.tokenize import word_tokenize\n", 45 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 46 | "# nltk.download('punkt')\n", 47 | "# nltk.download('stopwords')\n", 48 | "# nltk.download('wordnet') # download for lemmatization" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "# import sklearn\n", 60 | "from sklearn.pipeline import Pipeline\n", 61 | "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n", 62 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 63 | "from sklearn.multioutput import MultiOutputClassifier\n", 64 | "from sklearn.metrics import precision_score, recall_score, f1_score\n", 65 | "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# load data from database\n", 77 | "engine = create_engine('sqlite:///data/DisasterResponse.db')\n", 78 | "df = pd.read_sql_table('DisasterResponse', engine)\n", 79 | "X = df['message']\n", 80 | "Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### 2. Write a tokenization function to process your text data" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 5, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "def tokenize(text):\n", 99 | " # Define url pattern\n", 100 | " url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'\n", 101 | " \n", 102 | " # Detect and replace urls\n", 103 | " detected_urls = re.findall(url_re, text)\n", 104 | " for url in detected_urls:\n", 105 | " text = text.replace(url, \"urlplaceholder\")\n", 106 | " \n", 107 | " # tokenize sentences\n", 108 | " tokens = word_tokenize(text)\n", 109 | " lemmatizer = WordNetLemmatizer()\n", 110 | " \n", 111 | " # save cleaned tokens\n", 112 | " clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]\n", 113 | " \n", 114 | " # remove stopwords\n", 115 | " STOPWORDS = list(set(stopwords.words('english')))\n", 116 | " clean_tokens = [token for token in clean_tokens if token not in STOPWORDS]\n", 117 | " \n", 118 | " return clean_tokens" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### 3. Build a machine learning pipeline\n", 126 | "- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 6, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "def build_pipeline():\n", 138 | " \n", 139 | " # build NLP pipeline - count words, tf-idf, multiple output classifier\n", 140 | " pipeline = Pipeline([\n", 141 | " ('vec', CountVectorizer(tokenizer=tokenize)),\n", 142 | " ('tfidf', TfidfTransformer()),\n", 143 | " ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators = 100, n_jobs = 6)))\n", 144 | " ])\n", 145 | " \n", 146 | " return pipeline" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "### 4. Train pipeline\n", 154 | "- Split data into train and test sets\n", 155 | "- Train pipeline" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 7, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/plain": [ 166 | "Pipeline(memory=None,\n", 167 | " steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 168 | " dtype=, encoding='utf-8', input='content',\n", 169 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 170 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 171 | " strip_..._score=False, random_state=None, verbose=0,\n", 172 | " warm_start=False),\n", 173 | " n_jobs=None))])" 174 | ] 175 | }, 176 | "execution_count": 7, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "X_train, X_test, y_train, y_test = train_test_split(X, Y)\n", 183 | "pipeline = build_pipeline()\n", 184 | "pipeline.fit(X_train, y_train)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### 5. Test your model\n", 192 | "Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 8, 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "def build_report(pipeline, X_test, y_test):\n", 204 | " # predict on the X_test\n", 205 | " y_pred = pipeline.predict(X_test)\n", 206 | " \n", 207 | " # build classification report on every column\n", 208 | " performances = []\n", 209 | " for i in range(len(y_test.columns)):\n", 210 | " performances.append([f1_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro'),\n", 211 | " precision_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro'),\n", 212 | " recall_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro')])\n", 213 | " # build dataframe\n", 214 | " performances = pd.DataFrame(performances, columns=['f1 score', 'precision', 'recall'],\n", 215 | " index = y_test.columns) \n", 216 | " return performances" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 9, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/html": [ 227 | "
\n", 228 | "\n", 241 | "\n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | "
f1 scoreprecisionrecall
related0.8016480.8016480.801648
request0.8948730.8948730.894873
offer0.9958800.9958800.995880
aid_related0.7773880.7773880.777388
medical_help0.9205070.9205070.920507
medical_products0.9560570.9560570.956057
search_and_rescue0.9717730.9717730.971773
security0.9823010.9823010.982301
military0.9681110.9681110.968111
child_alone1.0000001.0000001.000000
water0.9581930.9581930.958193
food0.9401890.9401890.940189
shelter0.9350020.9350020.935002
clothing0.9867260.9867260.986726
money0.9786390.9786390.978639
missing_people0.9899300.9899300.989930
refugees0.9687210.9687210.968721
death0.9621610.9621610.962161
other_aid0.8712240.8712240.871224
infrastructure_related0.9346960.9346960.934696
transport0.9537690.9537690.953769
buildings0.9517850.9517850.951785
electricity0.9812330.9812330.981233
tools0.9948120.9948120.994812
hospitals0.9905400.9905400.990540
shops0.9952700.9952700.995270
aid_centers0.9879460.9879460.987946
other_infrastructure0.9556000.9556000.955600
weather_related0.8787000.8787000.878700
floods0.9499540.9499540.949954
storm0.9372900.9372900.937290
fire0.9914560.9914560.991456
earthquake0.9710100.9710100.971010
cold0.9806230.9806230.980623
other_weather0.9469030.9469030.946903
direct_report0.8675620.8675620.867562
\n", 469 | "
" 470 | ], 471 | "text/plain": [ 472 | " f1 score precision recall\n", 473 | "related 0.801648 0.801648 0.801648\n", 474 | "request 0.894873 0.894873 0.894873\n", 475 | "offer 0.995880 0.995880 0.995880\n", 476 | "aid_related 0.777388 0.777388 0.777388\n", 477 | "medical_help 0.920507 0.920507 0.920507\n", 478 | "medical_products 0.956057 0.956057 0.956057\n", 479 | "search_and_rescue 0.971773 0.971773 0.971773\n", 480 | "security 0.982301 0.982301 0.982301\n", 481 | "military 0.968111 0.968111 0.968111\n", 482 | "child_alone 1.000000 1.000000 1.000000\n", 483 | "water 0.958193 0.958193 0.958193\n", 484 | "food 0.940189 0.940189 0.940189\n", 485 | "shelter 0.935002 0.935002 0.935002\n", 486 | "clothing 0.986726 0.986726 0.986726\n", 487 | "money 0.978639 0.978639 0.978639\n", 488 | "missing_people 0.989930 0.989930 0.989930\n", 489 | "refugees 0.968721 0.968721 0.968721\n", 490 | "death 0.962161 0.962161 0.962161\n", 491 | "other_aid 0.871224 0.871224 0.871224\n", 492 | "infrastructure_related 0.934696 0.934696 0.934696\n", 493 | "transport 0.953769 0.953769 0.953769\n", 494 | "buildings 0.951785 0.951785 0.951785\n", 495 | "electricity 0.981233 0.981233 0.981233\n", 496 | "tools 0.994812 0.994812 0.994812\n", 497 | "hospitals 0.990540 0.990540 0.990540\n", 498 | "shops 0.995270 0.995270 0.995270\n", 499 | "aid_centers 0.987946 0.987946 0.987946\n", 500 | "other_infrastructure 0.955600 0.955600 0.955600\n", 501 | "weather_related 0.878700 0.878700 0.878700\n", 502 | "floods 0.949954 0.949954 0.949954\n", 503 | "storm 0.937290 0.937290 0.937290\n", 504 | "fire 0.991456 0.991456 0.991456\n", 505 | "earthquake 0.971010 0.971010 0.971010\n", 506 | "cold 0.980623 0.980623 0.980623\n", 507 | "other_weather 0.946903 0.946903 0.946903\n", 508 | "direct_report 0.867562 0.867562 0.867562" 509 | ] 510 | }, 511 | "execution_count": 9, 512 | "metadata": {}, 513 | "output_type": "execute_result" 514 | } 515 | ], 516 | "source": [ 517 | "build_report(pipeline, X_test, y_test)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "### 6. Improve your model\n", 525 | "Use grid search to find better parameters. " 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 10, 531 | "metadata": {}, 532 | "outputs": [ 533 | { 534 | "data": { 535 | "text/plain": [ 536 | "GridSearchCV(cv=5, error_score='raise-deprecating',\n", 537 | " estimator=Pipeline(memory=None,\n", 538 | " steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 539 | " dtype=, encoding='utf-8', input='content',\n", 540 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 541 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 542 | " strip_..._score=False, random_state=None, verbose=0,\n", 543 | " warm_start=False),\n", 544 | " n_jobs=None))]),\n", 545 | " fit_params=None, iid='warn', n_jobs=6,\n", 546 | " param_grid={'clf__estimator__max_features': ['sqrt', 0.5], 'clf__estimator__n_estimators': [50, 100]},\n", 547 | " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", 548 | " scoring=None, verbose=0)" 549 | ] 550 | }, 551 | "execution_count": 10, 552 | "metadata": {}, 553 | "output_type": "execute_result" 554 | } 555 | ], 556 | "source": [ 557 | "parameters = {'clf__estimator__max_features':['sqrt', 0.5],\n", 558 | " 'clf__estimator__n_estimators':[50, 100]}\n", 559 | "\n", 560 | "cv = GridSearchCV(estimator=pipeline, param_grid = parameters, cv = 5, n_jobs = 6)\n", 561 | "cv.fit(X_train, y_train)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "### 7. Test your model\n", 569 | "Show the accuracy, precision, and recall of the tuned model. \n", 570 | "\n", 571 | "Since this project focuses on code quality, process, and pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 11, 577 | "metadata": {}, 578 | "outputs": [ 579 | { 580 | "data": { 581 | "text/html": [ 582 | "
\n", 583 | "\n", 596 | "\n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | "
f1 scoreprecisionrecall
related0.8019530.8019530.801953
request0.8880070.8880070.888007
offer0.9952700.9952700.995270
aid_related0.7653340.7653340.765334
medical_help0.9206590.9206590.920659
medical_products0.9623130.9623130.962313
search_and_rescue0.9705520.9705520.970552
security0.9789440.9789440.978944
military0.9668900.9668900.966890
child_alone1.0000001.0000001.000000
water0.9662800.9662800.966280
food0.9514800.9514800.951480
shelter0.9485810.9485810.948581
clothing0.9890140.9890140.989014
money0.9784860.9784860.978486
missing_people0.9903880.9903880.990388
refugees0.9716200.9716200.971620
death0.9736040.9736040.973604
other_aid0.8686300.8686300.868630
infrastructure_related0.9296610.9296610.929661
transport0.9543790.9543790.954379
buildings0.9563630.9563630.956363
electricity0.9797070.9797070.979707
tools0.9935920.9935920.993592
hospitals0.9884040.9884040.988404
shops0.9943550.9943550.994355
aid_centers0.9877940.9877940.987794
other_infrastructure0.9523950.9523950.952395
weather_related0.8806840.8806840.880684
floods0.9557520.9557520.955752
storm0.9456820.9456820.945682
fire0.9919130.9919130.991913
earthquake0.9731460.9731460.973146
cold0.9830640.9830640.983064
other_weather0.9411050.9411050.941105
direct_report0.8564240.8564240.856424
\n", 824 | "
" 825 | ], 826 | "text/plain": [ 827 | " f1 score precision recall\n", 828 | "related 0.801953 0.801953 0.801953\n", 829 | "request 0.888007 0.888007 0.888007\n", 830 | "offer 0.995270 0.995270 0.995270\n", 831 | "aid_related 0.765334 0.765334 0.765334\n", 832 | "medical_help 0.920659 0.920659 0.920659\n", 833 | "medical_products 0.962313 0.962313 0.962313\n", 834 | "search_and_rescue 0.970552 0.970552 0.970552\n", 835 | "security 0.978944 0.978944 0.978944\n", 836 | "military 0.966890 0.966890 0.966890\n", 837 | "child_alone 1.000000 1.000000 1.000000\n", 838 | "water 0.966280 0.966280 0.966280\n", 839 | "food 0.951480 0.951480 0.951480\n", 840 | "shelter 0.948581 0.948581 0.948581\n", 841 | "clothing 0.989014 0.989014 0.989014\n", 842 | "money 0.978486 0.978486 0.978486\n", 843 | "missing_people 0.990388 0.990388 0.990388\n", 844 | "refugees 0.971620 0.971620 0.971620\n", 845 | "death 0.973604 0.973604 0.973604\n", 846 | "other_aid 0.868630 0.868630 0.868630\n", 847 | "infrastructure_related 0.929661 0.929661 0.929661\n", 848 | "transport 0.954379 0.954379 0.954379\n", 849 | "buildings 0.956363 0.956363 0.956363\n", 850 | "electricity 0.979707 0.979707 0.979707\n", 851 | "tools 0.993592 0.993592 0.993592\n", 852 | "hospitals 0.988404 0.988404 0.988404\n", 853 | "shops 0.994355 0.994355 0.994355\n", 854 | "aid_centers 0.987794 0.987794 0.987794\n", 855 | "other_infrastructure 0.952395 0.952395 0.952395\n", 856 | "weather_related 0.880684 0.880684 0.880684\n", 857 | "floods 0.955752 0.955752 0.955752\n", 858 | "storm 0.945682 0.945682 0.945682\n", 859 | "fire 0.991913 0.991913 0.991913\n", 860 | "earthquake 0.973146 0.973146 0.973146\n", 861 | "cold 0.983064 0.983064 0.983064\n", 862 | "other_weather 0.941105 0.941105 0.941105\n", 863 | "direct_report 0.856424 0.856424 0.856424" 864 | ] 865 | }, 866 | "execution_count": 11, 867 | "metadata": {}, 868 | "output_type": "execute_result" 869 | } 870 | ], 871 | "source": [ 872 | "build_report(cv, X_test, y_test)" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 16, 878 | "metadata": {}, 879 | "outputs": [ 880 | { 881 | "data": { 882 | "text/plain": [ 883 | "{'clf__estimator__max_features': 0.5, 'clf__estimator__n_estimators': 100}" 884 | ] 885 | }, 886 | "execution_count": 16, 887 | "metadata": {}, 888 | "output_type": "execute_result" 889 | } 890 | ], 891 | "source": [ 892 | "cv.best_params_" 893 | ] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": {}, 898 | "source": [ 899 | "### 8. Try improving your model further. Here are a few ideas:\n", 900 | "* try other machine learning algorithms\n", 901 | "* add other features besides the TF-IDF" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": 12, 907 | "metadata": {}, 908 | "outputs": [ 909 | { 910 | "data": { 911 | "text/html": [ 912 | "
\n", 913 | "\n", 926 | "\n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | "
f1 scoreprecisionrecall
related0.7628930.7628930.762893
request0.8921270.8921270.892127
offer0.9940490.9940490.994049
aid_related0.7673180.7673180.767318
medical_help0.9237110.9237110.923711
medical_products0.9613980.9613980.961398
search_and_rescue0.9707050.9707050.970705
security0.9778760.9778760.977876
military0.9714680.9714680.971468
child_alone1.0000001.0000001.000000
water0.9633810.9633810.963381
food0.9469030.9469030.946903
shelter0.9418680.9418680.941868
clothing0.9879460.9879460.987946
money0.9774180.9774180.977418
missing_people0.9893190.9893190.989319
refugees0.9694840.9694840.969484
death0.9687210.9687210.968721
other_aid0.8687820.8687820.868782
infrastructure_related0.9287460.9287460.928746
transport0.9554470.9554470.955447
buildings0.9548370.9548370.954837
electricity0.9809280.9809280.980928
tools0.9935920.9935920.993592
hospitals0.9874890.9874890.987489
shops0.9940490.9940490.994049
aid_centers0.9867260.9867260.986726
other_infrastructure0.9519380.9519380.951938
weather_related0.8762590.8762590.876259
floods0.9536160.9536160.953616
storm0.9389690.9389690.938969
fire0.9911500.9911500.991150
earthquake0.9707050.9707050.970705
cold0.9818430.9818430.981843
other_weather0.9426300.9426300.942630
direct_report0.8588650.8588650.858865
\n", 1154 | "
" 1155 | ], 1156 | "text/plain": [ 1157 | " f1 score precision recall\n", 1158 | "related 0.762893 0.762893 0.762893\n", 1159 | "request 0.892127 0.892127 0.892127\n", 1160 | "offer 0.994049 0.994049 0.994049\n", 1161 | "aid_related 0.767318 0.767318 0.767318\n", 1162 | "medical_help 0.923711 0.923711 0.923711\n", 1163 | "medical_products 0.961398 0.961398 0.961398\n", 1164 | "search_and_rescue 0.970705 0.970705 0.970705\n", 1165 | "security 0.977876 0.977876 0.977876\n", 1166 | "military 0.971468 0.971468 0.971468\n", 1167 | "child_alone 1.000000 1.000000 1.000000\n", 1168 | "water 0.963381 0.963381 0.963381\n", 1169 | "food 0.946903 0.946903 0.946903\n", 1170 | "shelter 0.941868 0.941868 0.941868\n", 1171 | "clothing 0.987946 0.987946 0.987946\n", 1172 | "money 0.977418 0.977418 0.977418\n", 1173 | "missing_people 0.989319 0.989319 0.989319\n", 1174 | "refugees 0.969484 0.969484 0.969484\n", 1175 | "death 0.968721 0.968721 0.968721\n", 1176 | "other_aid 0.868782 0.868782 0.868782\n", 1177 | "infrastructure_related 0.928746 0.928746 0.928746\n", 1178 | "transport 0.955447 0.955447 0.955447\n", 1179 | "buildings 0.954837 0.954837 0.954837\n", 1180 | "electricity 0.980928 0.980928 0.980928\n", 1181 | "tools 0.993592 0.993592 0.993592\n", 1182 | "hospitals 0.987489 0.987489 0.987489\n", 1183 | "shops 0.994049 0.994049 0.994049\n", 1184 | "aid_centers 0.986726 0.986726 0.986726\n", 1185 | "other_infrastructure 0.951938 0.951938 0.951938\n", 1186 | "weather_related 0.876259 0.876259 0.876259\n", 1187 | "floods 0.953616 0.953616 0.953616\n", 1188 | "storm 0.938969 0.938969 0.938969\n", 1189 | "fire 0.991150 0.991150 0.991150\n", 1190 | "earthquake 0.970705 0.970705 0.970705\n", 1191 | "cold 0.981843 0.981843 0.981843\n", 1192 | "other_weather 0.942630 0.942630 0.942630\n", 1193 | "direct_report 0.858865 0.858865 0.858865" 1194 | ] 1195 | }, 1196 | "execution_count": 12, 1197 | "metadata": {}, 1198 | "output_type": "execute_result" 1199 | } 1200 | ], 1201 | "source": [ 1202 | "pipeline_improved = Pipeline([\n", 1203 | " ('vect', CountVectorizer(tokenizer=tokenize)),\n", 1204 | " ('tfidf', TfidfTransformer()),\n", 1205 | " ('clf', MultiOutputClassifier(AdaBoostClassifier(n_estimators = 100)))\n", 1206 | " ])\n", 1207 | "pipeline_improved.fit(X_train, y_train)\n", 1208 | "y_pred_improved = pipeline_improved.predict(X_test)\n", 1209 | "build_report(pipeline_improved, X_test, y_test)" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "markdown", 1214 | "metadata": {}, 1215 | "source": [ 1216 | "### 9. Export your model as a pickle file" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 18, 1222 | "metadata": { 1223 | "collapsed": true 1224 | }, 1225 | "outputs": [], 1226 | "source": [ 1227 | "pickle.dump(pipeline, open('rf_model.pkl', 'wb'))" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "code", 1232 | "execution_count": 14, 1233 | "metadata": { 1234 | "collapsed": true 1235 | }, 1236 | "outputs": [], 1237 | "source": [ 1238 | "pickle.dump(pipeline_improved, open('adaboost_model.pkl', 'wb'))" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "markdown", 1243 | "metadata": {}, 1244 | "source": [ 1245 | "### 10. Use this notebook to complete `train.py`\n", 1246 | "Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user." 1247 | ] 1248 | } 1249 | ], 1250 | "metadata": { 1251 | "kernelspec": { 1252 | "display_name": "Python 3", 1253 | "language": "python", 1254 | "name": "python3" 1255 | }, 1256 | "language_info": { 1257 | "codemirror_mode": { 1258 | "name": "ipython", 1259 | "version": 3 1260 | }, 1261 | "file_extension": ".py", 1262 | "mimetype": "text/x-python", 1263 | "name": "python", 1264 | "nbconvert_exporter": "python", 1265 | "pygments_lexer": "ipython3", 1266 | "version": "3.6.3" 1267 | } 1268 | }, 1269 | "nbformat": 4, 1270 | "nbformat_minor": 2 1271 | } 1272 | -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/README.md: -------------------------------------------------------------------------------- 1 | # Disaster Response Pipeline Project 2 | 3 | ### Project Description: 4 | 5 | In this project, I built a data transformation - machine learning pipeline that is capable to curate the class of the messages. The pipeline is eventually built into a flask application. The project include a web app where an emergency worker can input a new message and get classification results in several categories. The landing page of the webapp also includes 4 visualizations of the training dataset built with plotly. 6 | 7 | ### File Descriptions: 8 | The project contains the following files, 9 | 10 | * ETL Pipeline Preparation.ipynb: Notebook experiment for the ETL pipelines 11 | * ML Pipeline Preparation.ipynb: Notebook experiment for the machine learning pipelines 12 | * data/process_data.py: The ETL pipeline used to process data in preparation for model building. 13 | * models/train_classifier.py: The Machine Learning pipeline used to fit, tune, evaluate, and export the model to a Python pickle (pickle is not uploaded to the repo due to size constraints on github). 14 | * app/templates/~.html: HTML pages for the web app. 15 | * run.py: Start the Python server for the web app and prepare visualizations. 16 | 17 | The app is now deployed on heroku at this [link](https://disaster-response-app184.herokuapp.com/) 18 | 19 | Example message to classify: "Help, Fire!" 20 | 21 | ### Local Instructions: 22 | 1. Run the following commands in the project's root directory to set up your database and model. 23 | 24 | - To run ETL pipeline that cleans data and stores in database 25 | `python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db` 26 | - To run ML pipeline that trains classifier and saves 27 | `python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl` 28 | 29 | 2. Run the following command in the app's directory to run your web app. 30 | `python app.py` 31 | 32 | 3. Go to http://127.0.0.1:5000/ 33 | 34 | 35 | ![Webapp Screenshot](https://raw.githubusercontent.com/chenbowen184/Data_Science_Portfolio/master/Project%205%20-%20Disaster%20Response%20Pipeline/app/webapp%20screenshot.png) 36 | -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/app/app.py: -------------------------------------------------------------------------------- 1 | import json 2 | import plotly 3 | import pandas as pd 4 | import re 5 | from collections import Counter 6 | 7 | # import NLP libraries 8 | from tokenizer_function import Tokenizer, tokenize 9 | 10 | from flask import Flask 11 | from flask import render_template, request, jsonify 12 | from plotly.graph_objs import Bar 13 | from sklearn.externals import joblib 14 | from sqlalchemy import create_engine 15 | 16 | 17 | app = Flask(__name__) 18 | 19 | 20 | @app.before_first_request 21 | 22 | def load_model_data(): 23 | global df 24 | global model 25 | # load data 26 | 27 | engine = create_engine('sqlite:///data/DisasterResponse.db') 28 | df = pd.read_sql_table('DisasterResponse', engine) 29 | 30 | # load model 31 | model = joblib.load("models/adaboost_model.pkl") 32 | 33 | # index webpage displays cool visuals and receives user input text for model 34 | @app.route('/') 35 | @app.route('/index') 36 | 37 | def index(): 38 | 39 | # extract data needed for visuals 40 | # Message counts of different generes 41 | genre_counts = df.groupby('genre').count()['message'] 42 | genre_names = list(genre_counts.index) 43 | 44 | # Message counts for different categories 45 | cate_counts_df = df.iloc[:, 4:].sum().sort_values(ascending=False) 46 | cate_counts = list(cate_counts_df) 47 | cate_names = list(cate_counts_df.index) 48 | 49 | # Top keywords in Social Media in percentages 50 | social_media_messages = ' '.join(df[df['genre'] == 'social']['message']) 51 | social_media_tokens = tokenize(social_media_messages) 52 | social_media_wrd_counter = Counter(social_media_tokens).most_common() 53 | social_media_wrd_cnt = [i[1] for i in social_media_wrd_counter] 54 | social_media_wrd_pct = [i/sum(social_media_wrd_cnt) *100 for i in social_media_wrd_cnt] 55 | social_media_wrds = [i[0] for i in social_media_wrd_counter] 56 | 57 | # Top keywords in Direct in percentages 58 | direct_messages = ' '.join(df[df['genre'] == 'direct']['message']) 59 | direct_tokens = tokenize(direct_messages) 60 | direct_wrd_counter = Counter(direct_tokens).most_common() 61 | direct_wrd_cnt = [i[1] for i in direct_wrd_counter] 62 | direct_wrd_pct = [i/sum(direct_wrd_cnt) * 100 for i in direct_wrd_cnt] 63 | direct_wrds = [i[0] for i in direct_wrd_counter] 64 | 65 | # create visuals 66 | 67 | graphs = [ 68 | # Histogram of the message genere 69 | { 70 | 'data': [ 71 | Bar( 72 | x=genre_names, 73 | y=genre_counts 74 | ) 75 | ], 76 | 77 | 'layout': { 78 | 'title': 'Distribution of Message Genres', 79 | 'yaxis': { 80 | 'title': "Count" 81 | }, 82 | 'xaxis': { 83 | 'title': "Genre" 84 | } 85 | } 86 | }, 87 | # histogram of social media messages top 30 keywords 88 | { 89 | 'data': [ 90 | Bar( 91 | x=social_media_wrds[:50], 92 | y=social_media_wrd_pct[:50] 93 | ) 94 | ], 95 | 96 | 'layout':{ 97 | 'title': "Top 50 Keywords in Social Media Messages", 98 | 'xaxis': {'tickangle':60 99 | }, 100 | 'yaxis': { 101 | 'title': "% Total Social Media Messages" 102 | } 103 | } 104 | }, 105 | 106 | # histogram of direct messages top 30 keywords 107 | { 108 | 'data': [ 109 | Bar( 110 | x=direct_wrds[:50], 111 | y=direct_wrd_pct[:50] 112 | ) 113 | ], 114 | 115 | 'layout':{ 116 | 'title': "Top 50 Keywords in Direct Messages", 117 | 'xaxis': {'tickangle':60 118 | }, 119 | 'yaxis': { 120 | 'title': "% Total Direct Messages" 121 | } 122 | } 123 | }, 124 | 125 | 126 | 127 | # histogram of messages categories distributions 128 | { 129 | 'data': [ 130 | Bar( 131 | x=cate_names, 132 | y=cate_counts 133 | ) 134 | ], 135 | 136 | 'layout':{ 137 | 'title': "Distribution of Message Categories", 138 | 'xaxis': {'tickangle':60 139 | }, 140 | 'yaxis': { 141 | 'title': "count" 142 | } 143 | } 144 | }, 145 | 146 | ] 147 | 148 | # encode plotly graphs in JSON 149 | ids = ["graph-{}".format(i) for i, _ in enumerate(graphs)] 150 | graphJSON = json.dumps(graphs, cls=plotly.utils.PlotlyJSONEncoder) 151 | 152 | # render web page with plotly graphs 153 | return render_template('master.html', ids=ids, graphJSON=graphJSON) 154 | 155 | 156 | # web page that handles user query and displays model results 157 | @app.route('/go') 158 | def go(): 159 | # save user input in query 160 | query = request.args.get('query', '') 161 | 162 | # use model to predict classification for query 163 | classification_labels = model.predict([query])[0] 164 | classification_results = dict(zip(df.columns[4:], classification_labels)) 165 | 166 | # This will render the go.html Please see that file. 167 | return render_template( 168 | 'go.html', 169 | query=query, 170 | classification_result=classification_results 171 | ) 172 | 173 | 174 | def main(): 175 | app.run() 176 | 177 | 178 | if __name__ == '__main__': 179 | main() -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/app/templates/go.html: -------------------------------------------------------------------------------- 1 | {% extends "master.html" %} 2 | {% block title %}Results{% endblock %} 3 | 4 | {% block message %} 5 |
6 |

MESSAGE

7 |

{{query}}

8 | {% endblock %} 9 | 10 | {% block content %} 11 |

Result

12 |
    13 | {% for category, classification in classification_result.items() %} 14 | {% if classification == 1 %} 15 |
  • {{category.replace('_', ' ').title()}}
  • 16 | {% else %} 17 |
  • {{category.replace('_', ' ').title()}}
  • 18 | {% endif %} 19 | {% endfor %} 20 | 21 | 22 | 23 | 24 | {% endblock %} -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/app/templates/master.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Disasters 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 31 | 32 | 33 |
    34 |
    35 |

    Disaster Response Application

    36 |

    Analyzing message data for disaster response

    37 |
    38 | 39 |
    40 |
    41 |
    42 | 43 |
    44 | 45 |
    46 |
    47 |
    48 |
    49 | 50 | {% block message %} 51 | {% endblock %} 52 |
    53 |
    54 | 55 |
    56 | {% block content %} 57 | 60 | {% endblock %} 61 | 62 | {% for id in ids %} 63 |
    64 | {% endfor %} 65 |
    66 | 67 | 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/app/webapp screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/app/webapp screenshot.png -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/data/DisasterResponse.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/data/DisasterResponse.db -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/data/process_data.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pandas as pd 3 | import numpy as n 4 | from sqlalchemy import create_engine 5 | 6 | def load_data(messages_filepath, categories_filepath): 7 | """ 8 | Load data from the csv. 9 | Args: 10 | messages_filepath: the path of the messages.csv files that needs to be transferred 11 | categories_filepath: the path of the categories.csv files that needs to be transferred 12 | Returns: 13 | merged_df (DataFrame): messages and categories merged dataframe 14 | """ 15 | # load messages and categories 16 | messages = pd.read_csv(messages_filepath) 17 | categories = pd.read_csv(categories_filepath) 18 | # merge two dataframes into one 19 | df = pd.merge(messages, categories, on = 'id') 20 | return df 21 | 22 | def clean_data(df): 23 | """ 24 | Clean the unstructured merged dataframe into structured dataframes. 25 | 1. Rename columns of different categories 26 | 2. Remove Duplicates 27 | 28 | Args: 29 | df: The preprocessed dataframe 30 | Returns: 31 | df (DataFrame): messages and categories merged dataframe 32 | """ 33 | 34 | # split the categories columns into multiple columns 35 | categories = df['categories'].str.split(';', expand=True) 36 | 37 | # rename columns 38 | row = categories.iloc[1] 39 | category_colnames = row.apply(lambda x: x.split('-')[0]) 40 | categories.columns = category_colnames 41 | 42 | # replace original values into 1 and 0 43 | for column in categories: 44 | categories[column] = categories[column].apply(lambda x: int(x.split('-')[1])) 45 | 46 | # replace the old categories column 47 | df.drop('categories', axis = 1, inplace = True) 48 | df = df.join(categories) 49 | # drop duplicates 50 | df = df.drop_duplicates() 51 | return df 52 | 53 | 54 | def save_data(df, database_filename): 55 | """ 56 | Save processed dataframe into sqlite database 57 | 58 | Args: 59 | df: The preprocessed dataframe 60 | database_filename: name of the database 61 | Returns: 62 | None 63 | """ 64 | 65 | # save data into a sqlite database 66 | engine = create_engine('sqlite:///Messages.db') 67 | df.to_sql('Messages', engine, index=False, if_exists='replace') 68 | 69 | 70 | def main(): 71 | if len(sys.argv) == 4: 72 | 73 | messages_filepath, categories_filepath, database_filepath = sys.argv[1:] 74 | 75 | print('Loading data...\n MESSAGES: {}\n CATEGORIES: {}' 76 | .format(messages_filepath, categories_filepath)) 77 | df = load_data(messages_filepath, categories_filepath) 78 | 79 | print('Cleaning data...') 80 | df = clean_data(df) 81 | 82 | print('Saving data...\n DATABASE: {}'.format(database_filepath)) 83 | save_data(df, database_filepath) 84 | 85 | print('Cleaned data saved to database!') 86 | 87 | else: 88 | print('Please provide the filepaths of the messages and categories '\ 89 | 'datasets as the first and second argument respectively, as '\ 90 | 'well as the filepath of the database to save the cleaned data '\ 91 | 'to as the third argument. \n\nExample: python process_data.py '\ 92 | 'disaster_messages.csv disaster_categories.csv '\ 93 | 'DisasterResponse.db') 94 | 95 | 96 | if __name__ == '__main__': 97 | main() -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/models/adaboost_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/models/adaboost_model.pkl -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/models/tokenizer_function.py: -------------------------------------------------------------------------------- 1 | import re 2 | import nltk 3 | from nltk.corpus import stopwords 4 | from nltk.tokenize import word_tokenize 5 | from nltk.stem.wordnet import WordNetLemmatizer 6 | import pandas as pd 7 | from sklearn.base import BaseEstimator, TransformerMixin 8 | 9 | class Tokenizer(BaseEstimator, TransformerMixin): 10 | """ Tokenize transformer to be used in the pipeline 11 | """ 12 | def __init__(self): 13 | pass 14 | 15 | def fit(self, X, y=None): 16 | return self 17 | 18 | def transform(self, X): 19 | return pd.Series(X).apply(tokenize).values 20 | 21 | 22 | def tokenize(text): 23 | """ 24 | Tokenize the message into word level features. 25 | 1. replace urls 26 | 2. convert to lower cases 27 | 3. remove stopwords 28 | 4. strip white spaces 29 | Args: 30 | text: input text messages 31 | Returns: 32 | cleaned tokens(List) 33 | """ 34 | # Define url pattern 35 | url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' 36 | 37 | # Detect and replace urls 38 | detected_urls = re.findall(url_re, text) 39 | for url in detected_urls: 40 | text = text.replace(url, "urlplaceholder") 41 | 42 | # tokenize sentences 43 | tokens = word_tokenize(text) 44 | lemmatizer = WordNetLemmatizer() 45 | 46 | # save cleaned tokens 47 | clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens] 48 | 49 | # remove stopwords 50 | STOPWORDS = list(set(stopwords.words('english'))) 51 | clean_tokens = [token for token in clean_tokens if token not in STOPWORDS] 52 | 53 | return clean_tokens 54 | -------------------------------------------------------------------------------- /Project 5 - Disaster Response Pipeline/models/train_classifier.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pandas as pd 3 | import numpy as np 4 | import pickle 5 | from sqlalchemy import create_engine 6 | 7 | # import tokenize_function 8 | from models.tokenizer_function import Tokenizer 9 | 10 | # import sklearn 11 | from sklearn.pipeline import Pipeline 12 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 13 | from sklearn.model_selection import train_test_split, GridSearchCV 14 | from sklearn.multioutput import MultiOutputClassifier 15 | from sklearn.metrics import precision_score, recall_score, f1_score 16 | from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier 17 | from sklearn.externals import joblib 18 | 19 | def load_data(database_filepath): 20 | """ 21 | Load data from the sqlite database. 22 | Args: 23 | database_filepath: the path of the database file 24 | Returns: 25 | X (DataFrame): messages 26 | Y (DataFrame): One-hot encoded categories 27 | category_names (List) 28 | """ 29 | 30 | # load data from database 31 | engine = create_engine('sqlite:///../data/DisasterResponse.db') 32 | df = pd.read_sql_table('DisasterResponse', engine) 33 | X = df['message'] 34 | Y = df.drop(['id', 'message', 'original', 'genre'], axis=1) 35 | category_names = Y.columns 36 | 37 | return X, Y, category_names 38 | 39 | 40 | def build_model(): 41 | """ 42 | build NLP pipeline - count words, tf-idf, multiple output classifier, 43 | grid search the best parameters 44 | Args: 45 | None 46 | Returns: 47 | cross validated classifier object 48 | """ 49 | # 50 | pipeline = Pipeline([ 51 | ('tokenizer', Tokenizer()), 52 | ('vec', CountVectorizer()), 53 | ('tfidf', TfidfTransformer()), 54 | ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators = 100))) 55 | ]) 56 | 57 | # grid search 58 | parameters = {'clf__estimator__max_features':['sqrt', 0.5], 59 | 'clf__estimator__n_estimators':[50, 100]} 60 | 61 | cv = GridSearchCV(estimator=pipeline, param_grid = parameters, cv = 5, n_jobs = 10) 62 | 63 | return cv 64 | 65 | 66 | def evaluate_model(model, X_test, Y_test, category_names): 67 | """ 68 | Evaluate the model performances, in terms of f1-score, precison and recall 69 | Args: 70 | model: the model to be evaluated 71 | X_test: X_test dataframe 72 | Y_test: Y_test dataframe 73 | category_names: category names list defined in load data 74 | Returns: 75 | perfomances (DataFrame) 76 | """ 77 | # predict on the X_test 78 | y_pred = model.predict(X_test) 79 | 80 | # build classification report on every column 81 | performances = [] 82 | for i in range(len(category_names)): 83 | performances.append([f1_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro'), 84 | precision_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro'), 85 | recall_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro')]) 86 | # build dataframe 87 | performances = pd.DataFrame(performances, columns=['f1 score', 'precision', 'recall'], 88 | index = category_names) 89 | return performances 90 | 91 | 92 | def save_model(model, model_filepath): 93 | """ 94 | Save model to pickle 95 | """ 96 | joblib.dump(model, open(model_filepath, 'wb')) 97 | 98 | 99 | def main(): 100 | if len(sys.argv) == 3: 101 | database_filepath, model_filepath = sys.argv[1:] 102 | print('Loading data...\n DATABASE: {}'.format(database_filepath)) 103 | X, Y, category_names = load_data(database_filepath) 104 | X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) 105 | 106 | print('Building model...') 107 | model = build_model() 108 | 109 | print('Training model...') 110 | model.fit(X_train, Y_train) 111 | 112 | print('Evaluating model...') 113 | evaluate_model(model, X_test, Y_test, category_names) 114 | 115 | print('Saving model...\n MODEL: {}'.format(model_filepath)) 116 | save_model(model, model_filepath) 117 | 118 | print('Trained model saved!') 119 | 120 | else: 121 | print('Please provide the filepath of the disaster messages database '\ 122 | 'as the first argument and the filepath of the pickle file to '\ 123 | 'save the model to as the second argument. \n\nExample: python '\ 124 | 'train_classifier.py ../data/DisasterResponse.db classifier.pkl') 125 | 126 | 127 | if __name__ == '__main__': 128 | main() 129 | -------------------------------------------------------------------------------- /Project 6 - Reccomendation System/__pycache__/project_tests.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/__pycache__/project_tests.cpython-36.pyc -------------------------------------------------------------------------------- /Project 6 - Reccomendation System/project_tests.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import pickle 4 | 5 | df = pd.read_csv('data/user-item-interactions.csv') 6 | df_content = pd.read_csv('data/articles_community.csv') 7 | del df['Unnamed: 0'] 8 | del df_content['Unnamed: 0'] 9 | 10 | 11 | def sol_1_test(sol_1_dict): 12 | sol_1_dict_ = { 13 | '`50% of individuals have _____ or fewer interactions.`': 3, 14 | '`The total number of user-article interactions in the dataset is ______.`': 45993, 15 | '`The maximum number of user-article interactions by any 1 user is ______.`': 364, 16 | '`The most viewed article in the dataset was viewed _____ times.`': 937, 17 | '`The article_id of the most viewed article is ______.`': '1429.0', 18 | '`The number of unique articles that have at least 1 rating ______.`': 714, 19 | '`The number of unique users in the dataset is ______`': 5148, 20 | '`The number of unique articles on the IBM platform`': 1051, 21 | } 22 | 23 | if sol_1_dict_ == sol_1_dict: 24 | print("It looks like you have everything right here! Nice job!") 25 | 26 | else: 27 | for k, v in sol_1_dict.items(): 28 | if sol_1_dict_[k] != sol_1_dict[k]: 29 | print("Oops! It looks like the value associated with: {} wasn't right. Try again. It might just be the datatype. All of the values should be ints except the article_id should be a string. Let each row be considered a separate user-article interaction. If a user interacts with an article 3 times, these are considered 3 separate interactions.".format(k)) 30 | 31 | 32 | def sol_2_test(top_articles): 33 | top_5 = top_articles(5) 34 | top_10 = top_articles(10) 35 | top_20 = top_articles(20) 36 | 37 | checks = ['top_5', 'top_10', 'top_20'] 38 | for idx, file in enumerate(checks): 39 | if set(eval(file)) == set(pickle.load(open( "{}.p".format(file), "rb" ))): 40 | print("Your {} looks like the solution list! Nice job.".format(file)) 41 | else: 42 | print("Oops! The {} list doesn't look how we expected. Try again.".format(file)) 43 | 44 | 45 | 46 | def sol_5_test(sol_5_dict): 47 | sol_5_dict_1 = { 48 | 'The user that is most similar to user 1.': 3933, 49 | 'The user that is the 10th most similar to user 131': 242 50 | } 51 | if sol_5_dict == sol_5_dict_1: 52 | print("This all looks good! Nice job!") 53 | 54 | else: 55 | for k, v in sol_5_dict_1.items(): 56 | if set(sol_5_dict[k]) != set(sol_5_dict_1[k]): 57 | print("Oops! Looks like there is a mistake with the {} key in your dictionary. The answer should be {}. Try again.".format(k,v)) 58 | 59 | 60 | def sol_4_test(sol_4_dict): 61 | 62 | a = 662 # len(test_idx) - user_item_test.shape[0] 63 | b = 574 # user_test_shape[1] or len(test_arts) because we can make predictions for all articles 64 | c = 20 # user_item_test.shape[0] 65 | d = 0 # len(test_arts) - user_item_test.shape[1] 66 | 67 | sol_4_dict_1 = { 68 | 'How many users can we make predictions for in the test set?': c, 69 | 'How many users in the test set are we not able to make predictions for because of the cold start problem?': a, 70 | 'How many movies can we make predictions for in the test set?': b, 71 | 'How many movies in the test set are we not able to make predictions for because of the cold start problem?': d 72 | } 73 | 74 | if sol_4_dict == sol_4_dict_1: 75 | print("Awesome job! That's right! All of the test movies are in the training data, but there are only 20 test users that were also in the training set. All of the other users that are in the test set we have no data on. Therefore, we cannot make predictions for these users using SVD.") 76 | else: 77 | for k, v in sol_4_dict_1.items(): 78 | if sol_4_dict_1[k] != sol_4_dict[k]: 79 | print("Sorry it looks like that isn't the right value associated with {}. Try again.".format(k)) 80 | 81 | 82 | -------------------------------------------------------------------------------- /Project 6 - Reccomendation System/top_10.p: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_10.p -------------------------------------------------------------------------------- /Project 6 - Reccomendation System/top_20.p: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_20.p -------------------------------------------------------------------------------- /Project 6 - Reccomendation System/top_5.p: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_5.p -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Portfolio Projects 2 | 3 | ### Project 1 - Predicting Donors' Income using supervised learning 4 | 5 | **Description:** 6 | 7 | In this project, I employed several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. I then chose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. My goal with this implementation was to construct a model that accurately predicts whether an individual makes more than $50,000. 8 | 9 | [Project Notebook: Finding Donors](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%201%20-%20Finding%20Donars/finding_donors.ipynb) 10 | 11 | 12 | ### Project 2 - Flower Image Classifier Application 13 | 14 | **Description:** 15 | 16 | Going forward, AI algorithms will be incorporated into more and more everyday applications. For example, you might want to include an image classifier in a smart phone app. To do this, you'd use a deep learning model trained on hundreds of thousands of images as part of the overall application architecture. A large part of software development in the future will be using these types of models as common parts of applications. 17 | 18 | In this project, I trained an image classifier to recognize different species of flowers. You can imagine using something like this in a phone app that tells you the name of the flower your camera is looking at. In practice one would train this classifier, then export it for use in application. We'll be using this dataset of 102 flower categories 19 | 20 | [Project Notebook: Image Classifier](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%202%20-%20Image%20Classifier%20Application/Image%20Classifier%20Project.ipynb?flush_cache=true) 21 | 22 | 23 | ### Project 3 - Identifying Customers Segmentations 24 | 25 | **Description:** 26 | 27 | In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task. 28 | 29 | First, the general demographics data are clustered through a KMeans clustering algorithms, then the same parameters are applied over the customer dataset to investigate if the customers are following the same distributions. 30 | 31 | [Project Notebook: Customer Segmentations](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%203%20-%20Identify%20Customer%20Segementation/Identify_Customer_Segments.ipynb?flush_cache=true) 32 | 33 | 34 | 35 | ### Project 4 - Data Science Blog 36 | 37 | **Description:** 38 | 39 | In this project, I analyzed the 2011 - 2018 Stack Overflow developer survey data in order to create a blog post regarding a comprehensive study of data science careers. The project notebook could be found below 40 | 41 | [Project Notebook: Understanding the Career of Data Scientists](http://nbviewer.jupyter.org/github/chenbowen184/Data_Scientist_Nanodegree/blob/master/Project%204%20-%20Data%20Science%20Blog/Understanding%20the%20Career%20of%20Data%20Scientists.ipynb) 42 | 43 | [Blog Post: Understanding the Career of Data Scientist Using the Data Science Way](https://medium.com/@bowenchen/understanding-the-career-of-data-scientists-in-a-data-science-way-9bd63817221e) 44 | 45 | ### Project 5 - Disaster Response Pipeline 46 | 47 | **Description:** 48 | 49 | In this project, I built a data transformation - machine learning pipeline that is capable to curate the class of the messages. The pipeline is eventually built into a flask application. The project include a web app where an emergency worker can input a new message and get classification results in several categories. The web app will also display visualizations of the data.The project notebook could be found below 50 | 51 | [Project Notebook: ETL](http://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%205%20-%20Disaster%20Response%20Pipeline/ETL%20Pipeline%20Preparation.ipynb) 52 | 53 | [Prject Notebook: Machine Learning Pipeline](http://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%205%20-%20Disaster%20Response%20Pipeline/ML%20Pipeline%20Preparation.ipynb) 54 | 55 | [Disaster Response App](https://disaster-response-app184.herokuapp.com/) 56 | 57 | 58 | ### Project 6 - Recommendation System 59 | 60 | **Description:** 61 | 62 | In this project, I developed a recommendation engine's algorithm with IBM communities articles and user interactions. This project will serve as a prototype of the recommender systems of the article recommendation systems of IBM. The project will be hosted on 63 | 64 | [Project Notebook: Recommendations with IBM](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%206%20-%20Reccomendation%20System/Recommendations_with_IBM.ipynb) 65 | 66 | ### Capstone Project - Spark Distributed Analytics 67 | 68 | **Description:** 69 | 70 | In the capstone project, I built a distributed machine learning for a user log using spark - the big data technology toolkit. The primary objective was to predict the churning possibilities of every user. Most of the data visualizations were completed on a small subset of the data, while the full dataset analytics were performed using AWS's EMR services. 71 | 72 | [Project Notebook: Spark - Subset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Subset%20Analytics.ipynb) 73 | 74 | [Project Notebook: Spark - Full Dataset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Full%20Dataset.ipynb) 75 | 76 | [Blog Post: Understanding Customer Churning with Big Data Analytics](https://medium.com/@bowenchen/understanding-customer-churning-with-big-data-analytics-70ce4eb17669) 77 | 78 | ![Certificate](https://raw.githubusercontent.com/chenbowen184/Data_Science_Portfolio/master/Data%20Scientist%20Nanodegree%20certificate.jpg) 79 | 80 | **Disclaimer:** Remember to provide proper citation if you want to use any part of this code. 81 | --------------------------------------------------------------------------------