├── Capstone Project
    ├── README.md
    ├── Spark - Full Dataset.ipynb
    └── Spark - Subset Analytics.ipynb
├── Data Scientist Nanodegree certificate.jpg
├── Project 1 - Finding Donars
    ├── README.md
    ├── __pycache__
    │   └── visuals.cpython-36.pyc
    ├── census.csv
    ├── finding_donors.ipynb
    └── visuals.py
├── Project 2 - Image Classifier Application
    ├── Application
    │   ├── README.md
    │   ├── cat_to_name.json
    │   ├── model_spec.py
    │   ├── predict.py
    │   ├── train.py
    │   └── workspace_utils.py
    ├── Image Classifier Project.ipynb
    ├── Readme.md
    ├── assets
    │   ├── Flowers.png
    │   └── inference_example.png
    └── workspace_utils.py
├── Project 3 - Identify Customer Segementation
    ├── Identify_Customer_Segments.ipynb
    └── README.md
├── Project 4 - Data Science Blog
    ├── README.md
    ├── Understanding the Career of Data Scientists.ipynb
    └── data
    │   ├── 2011 Stack Overflow Survey Responses.csv
    │   ├── 2012 Stack Overflow Survey Responses.csv
    │   ├── 2013 Stack Overflow Survey Responses.csv
    │   └── 2014 Stack Overflow Survey Responses.csv
├── Project 5 - Disaster Response Pipeline
    ├── ETL Pipeline Preparation.ipynb
    ├── ML Pipeline Preparation.ipynb
    ├── README.md
    ├── app
    │   ├── app.py
    │   ├── templates
    │   │   ├── go.html
    │   │   └── master.html
    │   └── webapp screenshot.png
    ├── data
    │   ├── DisasterResponse.db
    │   ├── disaster_categories.csv
    │   ├── disaster_messages.csv
    │   └── process_data.py
    └── models
    │   ├── adaboost_model.pkl
    │   ├── tokenizer_function.py
    │   └── train_classifier.py
├── Project 6 - Reccomendation System
    ├── Recommendations_with_IBM.ipynb
    ├── __pycache__
    │   └── project_tests.cpython-36.pyc
    ├── data
    │   ├── articles_community.csv
    │   └── user-item-interactions.csv
    ├── project_tests.py
    ├── top_10.p
    ├── top_20.p
    └── top_5.p
└── README.md


/Capstone Project/README.md:
--------------------------------------------------------------------------------
 1 | # Capstone Project - Sparkify Music Service Analytics
 2 | 
 3 | ### Motivation
 4 | 
 5 | As many music streaming services that opreates on dataset with seconds of timeframe, the size of the data grows expotentially large in a short while. The size of the data would quickly outgrown the memory limit of most personal computers. To ensure the analytical activities are still performed as normal, we need to utilize the power of spark to perform a distributed data analytics task to obtain insights for from the data. In particular, we would like to understand the factors that contributes to users churning behaviors.
 6 | 
 7 | ### Data Descriptions
 8 | 
 9 | The data provided is the user log of the service, having demographic info, user activities, timestamps and etc. The data contains the user information logs that includes 
10 | 
11 | * Add Friend
12 | * Add to Playlist
13 | * Cancel/Cancel Confirmation
14 | * Submit Upgrade/Upgrade
15 | * Submit Downgrade/Downgrade
16 | * Error
17 | * Help
18 | * Home
19 | * Logout
20 | * Nextsong
21 | * Roll Advert
22 | * Save Settings
23 | * Thumbs Up / Down
24 | 
25 | ### Project Objective
26 | 
27 | Using the user information logs, we will attempt to predict the possiblities of a user churning using machine learning models. We will also attempt to understand contributing factors of the churning behaviors.
28 | 
29 | ### Required Packages
30 | 
31 | * Pandas
32 | * pyspark
33 | * matplot
34 | * numpy
35 | 
36 | ### Model Refinement
37 | 
38 | The presented model represents the best model I have constructed so far. Originally I only used the all the activities in *page* as features, which yielded 0.69 F1 score on the small test set we have. After I engineered 6 other features as noted in the project, I was able to obtain an F1 score of 0.80 (0.88 after scale up to the large dataset).
39 | 
40 | ### Gallery
41 | 
42 | 
43 | [Project Notebook: Spark - Subset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Subset%20Analytics.ipynb)
44 | 
45 | [Project Notebook: Spark - Full Dataset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Full%20Dataset.ipynb)
46 | 
47 | [Blog Post: Understanding Customer Churning with Big Data Analytics](https://medium.com/@bowenchen/understanding-customer-churning-with-big-data-analytics-70ce4eb17669)
48 | 
49 | 


--------------------------------------------------------------------------------
/Capstone Project/Spark - Full Dataset.ipynb:
--------------------------------------------------------------------------------
1 | {"cells": [{"metadata": {}, "cell_type": "markdown", "source": "# Sparkify - Full Analytics Script"}, {"metadata": {}, "cell_type": "markdown", "source": "Import Packages"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import udf, countDistinct, count, when, sum,col\nfrom pyspark.sql.types import IntegerType\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator\nfrom pyspark.ml.regression import LinearRegression\nfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n\nfrom pyspark.ml.feature import OneHotEncoder, StringIndexer, MinMaxScaler, VectorAssembler\nfrom pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier\n\nimport warnings\n\nwarnings.filterwarnings('ignore')", "execution_count": 1, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "c704ada3c7ba4c2daae0900322485e35"}}, "metadata": {}}, {"output_type": "stream", "text": "Starting Spark application\n", "name": "stdout"}, {"output_type": "display_data", "data": {"text/plain": "<IPython.core.display.HTML object>", "text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>6</td><td>application_1547505874377_0007</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"http://ip-172-31-21-69.us-west-1.compute.internal:20888/proxy/application_1547505874377_0007/\">Link</a></td><td><a target=\"_blank\" href=\"http://ip-172-31-23-152.us-west-1.compute.internal:8042/node/containerlogs/container_1547505874377_0007_01_000001/livy\">Link</a></td><td>\u2714</td></tr></table>"}, "metadata": {}}, {"output_type": "stream", "text": "SparkSession available as 'spark'.\n", "name": "stdout"}]}, {"metadata": {}, "cell_type": "markdown", "source": "Load data from AWS"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Create spark session\nspark = (SparkSession \n        .builder \n        .appName(\"Sparkify\") \n        .getOrCreate())\n\n# Read in full sparkify dataset\nevent_data = \"s3n://dsnd-sparkify/sparkify_event_data.json\"\nevents = spark.read.json(event_data)", "execution_count": 2, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b9aa3a14f92f4421a372a1fcc802bc48"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "### Define Churn"}, {"metadata": {}, "cell_type": "markdown", "source": "We will define Churn as `Cancellation Confirmation` events. We could also add `Downgrade` events as Churn, but we could use `Downgrade` events as an additional feature to predict `Cancellation Confirmation` events (Churn). "}, {"metadata": {}, "cell_type": "markdown", "source": "Create a column named `Churn` as the label of whether the user has churned"}, {"metadata": {}, "cell_type": "markdown", "source": "# Feature Engineering\n\nBuild 7 features that are needed to construct the model "}, {"metadata": {}, "cell_type": "markdown", "source": "Remove several less useful columns to speed up the opreations\n* First Name\n* Last Name\n* auth\n* status\n* gender\n* ItemInSession\n* location\n* method\n* song\n* artist\n"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events = events.drop('firstName', 'lastName', 'auth', 'gender', 'song','artist',\n                      'status', 'method', 'location', 'registration', 'itemInSession')", "execution_count": 3, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "0e8f270a9cef42e1acdd14993e5281f1"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**1**. pivot the page column to obtain different activities for the user, then remove the less significant features"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events_pivot = events.groupby([\"userId\"]).pivot(\"page\").count().fillna(0)\n\n# drop unecessary columns\nevents_pivot = events_pivot.drop('About', 'Cancel', 'Login',  'Submit Registration',  'Register', 'Save Settings')", "execution_count": 4, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "061d9d1cecb34c1c9c209fb46c463ff2"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**2.** Add average song played length"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# filter events log to contain only next song\nevents_songs = events.filter(events.page == 'NextSong')\n\n# Total songs length played\ntotal_length = events_songs.groupby(events_songs.userId).agg(sum('length'))\n\n# join events pivot\nevents_pivot = (events_pivot.join(total_length, on = 'userId', how = 'left')\n                            .withColumnRenamed(\"Cancellation Confirmation\", \"Churn\")\n                            .withColumnRenamed(\"sum(length)\", \"total_length\"))", "execution_count": 5, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "916f27479ae7497cb30f56337108772d"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**3.** Add days active"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "convert = 1000*60*60*24 # conversion factor to days\n\n# Find minimum/maximum time stamp of each user\nmin_timestmp = events.select([\"userId\", \"ts\"]).groupby(\"userId\").min(\"ts\")\nmax_timestmp = events.select([\"userId\", \"ts\"]).groupby(\"userId\").max(\"ts\")\n\n# Find days active of each user\ndaysActive = min_timestmp.join(max_timestmp, on=\"userId\")\ndaysActive = (daysActive.withColumn(\"days_active\", \n                                   (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysActive = daysActive.select([\"userId\", \"days_active\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysActive, on = 'userId', how = 'left')", "execution_count": 6, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b9e763560e3f4c57bde971ef0b748ed4"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**4.** Add number of sessions"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "numSessions = (events.select([\"userId\", \"sessionId\"])\n                      .distinct()\n                      .groupby(\"userId\")\n                       .count()\n                      .withColumnRenamed(\"count\", \"num_sessions\"))\n\n# join events pivot\nevents_pivot = events_pivot.join(numSessions, on = 'userId', how = 'left')", "execution_count": 7, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "7cc5ea2b76cc476b932f6a30e0be9fd1"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**5.** Add days as paid user"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Find minimum/maximum time stamp of each user as paid user\npaid_min_ts = events.filter(events.level == 'paid').groupby(\"userId\").min(\"ts\")\npaid_max_ts = events.filter(events.level == 'paid').groupby(\"userId\").max(\"ts\")\n\n# Find days as paid user of each user\n\ndaysPaid = paid_min_ts.join(paid_max_ts, on=\"userId\")\ndaysPaid = (daysPaid.withColumn(\"days_paid\", \n                                (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysPaid = daysPaid.select([\"userId\", \"days_paid\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysPaid, on = 'userId', how='left')", "execution_count": 8, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "504d5b7b50d24ca48288b5f2d0c6c3e3"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**6.** Add days as a free user"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Find minimum/maximum time stamp of each user as paid user\nfree_min_ts = events.filter(events.level == 'free').groupby(\"userId\").min(\"ts\")\nfree_max_ts = events.filter(events.level == 'free').groupby(\"userId\").max(\"ts\")\n\n# Find days as paid user of each user\ndaysFree = free_min_ts.join(free_max_ts, on=\"userId\")\ndaysFree = (daysFree.withColumn(\"days_free\", \n                                (col(\"max(ts)\")-col(\"min(ts)\")) / convert))\ndaysFree = daysFree.select([\"userId\", \"days_free\"])\n\n# join events pivot\nevents_pivot = events_pivot.join(daysFree, on = 'userId', how='left')", "execution_count": 9, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "9ddaec6b994b419e9188a0a272a84dc3"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**7.** Add user access agent"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# find user access agents, and perform one-hot encoding on the user \nuserAgents = events.select(['userId', 'userAgent']).distinct()\nuserAgents = userAgents.fillna('Unknown')\n\n# build string indexer\nstringIndexer = StringIndexer(inputCol=\"userAgent\", outputCol=\"userAgentIndex\")\nmodel = stringIndexer.fit(userAgents)\nuserAgents = model.transform(userAgents)\n\n# one hot encode userAgent column\nencoder = OneHotEncoder(inputCol=\"userAgentIndex\", outputCol=\"userAgentVec\")\nuserAgents = encoder.transform(userAgents).select(['userId', 'userAgentVec'])\n\n# join events pivot\nevents_pivot = events_pivot.join(userAgents, on = 'userId', how ='left')", "execution_count": 10, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "df46d0b8ded943789a285f9c7fd9a9aa"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "**8.** Fill all empty values as 0"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "events_pivot = events_pivot.fillna(0)", "execution_count": 11, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "4251a4b61dc7425db99b4c465e64678d"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "# Modeling\n\nSplit the full dataset into train, test, and validation sets. Test out three machine learning algorithms\n\n* Logistic Regression\n* Random Forest\n* Gradient Boosting\n\nGradient Boosting has the largest out-of-bag F1-score, we will proceed with this algorithm and build a pipeline around this algorithm."}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Split data into train and test set\nevents_pivot = events_pivot.withColumnRenamed('Churn', 'label')\ntraining, test = events_pivot.randomSplit([0.9, 0.1])", "execution_count": 12, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "01a68c16df37497598c6ff99687e006e"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "Build machine learning pipeline"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "# Create vector from feature data\nfeature_names = events_pivot.drop('label', 'userId').schema.names\nvec_asembler = VectorAssembler(inputCols = feature_names, outputCol = \"Features\")\n\n# Scale each column\nscalar = MinMaxScaler(inputCol=\"Features\", outputCol=\"ScaledFeatures\")\n\n# build classifier\ngbt = GBTClassifier(featuresCol=\"ScaledFeatures\", labelCol=\"label\")\n\n# Consturct pipeline\npipeline_gbt = Pipeline(stages=[vec_asembler, scalar, gbt])", "execution_count": 13, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "b38e9532f23d42d9a7cf9cd9830232a4"}}, "metadata": {}}]}, {"metadata": {}, "cell_type": "markdown", "source": "Fit gradient boosting model"}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "gbt_model = pipeline_gbt.fit(training)", "execution_count": 14, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "beae16983a15424993810a8ef9d29b2e"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "def modelEvaluations(model, metric, data):\n    \"\"\" Evaluate a machine learning model's performance \n        Input: \n            model - pipeline object\n            metric - the metric of the evaluations\n            data - data being evaluated\n        Output:\n            [score, confusion matrix]\n    \"\"\"\n    # generate predictions\n    evaluator = MulticlassClassificationEvaluator(metricName = metric)\n    predictions = model.transform(data)\n    \n    # calcualte score\n    score = evaluator.evaluate(predictions)\n    confusion_matrix = (predictions.groupby(\"label\")\n                                   .pivot(\"prediction\")\n                                   .count())\n    return [score, confusion_matrix]", "execution_count": 15, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "ea0e7ffbbaf8404da577625b27a59334"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "f1_best, conf_mtx_best = modelEvaluations(gbt_model, 'f1', test)", "execution_count": 16, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "0491fc6e55c64f8ea5cd5f622ba48774"}}, "metadata": {}}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "print('The F1 score for the gradient boosting model:', f1_best)\nconf_mtx_best.show()", "execution_count": 17, "outputs": [{"output_type": "display_data", "data": {"text/plain": "VBox()", "application/vnd.jupyter.widget-view+json": {"version_major": 2, "version_minor": 0, "model_id": "3f98d9f271794849bbe9e3b45314bffc"}}, "metadata": {}}, {"output_type": "stream", "text": "('The F1 score for the gradient boosting model:', 0.8896163691822966)\n+-----+----+---+\n|label| 0.0|1.0|\n+-----+----+---+\n|    0|1612| 70|\n|    1| 163|344|\n+-----+----+---+", "name": "stdout"}]}, {"metadata": {"trusted": true}, "cell_type": "code", "source": "", "execution_count": null, "outputs": []}], "metadata": {"kernelspec": {"name": "pysparkkernel", "display_name": "PySpark", "language": ""}, "language_info": {"name": "pyspark", "mimetype": "text/x-python", "codemirror_mode": {"name": "python", "version": 2}, "pygments_lexer": "python2"}}, "nbformat": 4, "nbformat_minor": 2}


--------------------------------------------------------------------------------
/Data Scientist Nanodegree certificate.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Data Scientist Nanodegree certificate.jpg


--------------------------------------------------------------------------------
/Project 1 - Finding Donars/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Supervised Learning 
 3 | 
 4 | ### Project: Finding Donors for CharityML
 5 | 
 6 | (Cited From Udacity)
 7 | 
 8 | ### Install
 9 | 
10 | This project requires **Python 3.x** and the following Python libraries installed:
11 | 
12 | - [NumPy](http://www.numpy.org/)
13 | - [Pandas](http://pandas.pydata.org)
14 | - [matplotlib](http://matplotlib.org/)
15 | - [scikit-learn](http://scikit-learn.org/stable/)
16 | 
17 | You will also need to have software installed to run and execute an [iPython Notebook](http://ipython.org/notebook.html)
18 | 
19 | 
20 | ### Code
21 | 
22 | Template code is provided in the `finding_donors.ipynb` notebook file. You will also be required to use the included `visuals.py` Python file and the `census.csv` dataset file to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the project. Note that the code included in `visuals.py` is meant to be used out-of-the-box and not intended for students to manipulate. If you are interested in how the visualizations are created in the notebook, please feel free to explore this Python file.
23 | 
24 | ### Run
25 | 
26 | In a terminal or command window, navigate to the top-level project directory `finding_donors/` (that contains this README) and run one of the following commands:
27 | 
28 | ```bash
29 | ipython notebook finding_donors.ipynb
30 | ```  
31 | or
32 | ```bash
33 | jupyter notebook finding_donors.ipynb
34 | ```
35 | 
36 | This will open the iPython Notebook software and project file in your browser.
37 | 
38 | ### Data
39 | 
40 | The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper *"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid",* by Ron Kohavi. You may find this paper [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf), with the original dataset hosted on [UCI](https://archive.ics.uci.edu/ml/datasets/Census+Income).
41 | 
42 | **Features**
43 | - `age`: Age
44 | - `workclass`: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
45 | - `education_level`: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
46 | - `education-num`: Number of educational years completed
47 | - `marital-status`: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
48 | - `occupation`: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
49 | - `relationship`: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
50 | - `race`: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
51 | - `sex`: Sex (Female, Male)
52 | - `capital-gain`: Monetary Capital Gains
53 | - `capital-loss`: Monetary Capital Losses
54 | - `hours-per-week`: Average Hours Per Week Worked
55 | - `native-country`: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
56 | 
57 | **Target Variable**
58 | - `income`: Income Class (<=50K, >50K)
59 | 


--------------------------------------------------------------------------------
/Project 1 - Finding Donars/__pycache__/visuals.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 1 - Finding Donars/__pycache__/visuals.cpython-36.pyc


--------------------------------------------------------------------------------
/Project 1 - Finding Donars/visuals.py:
--------------------------------------------------------------------------------
  1 | ###########################################
  2 | # Suppress matplotlib user warnings
  3 | # Necessary for newer version of matplotlib
  4 | import warnings
  5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
  6 | #
  7 | # Display inline matplotlib plots with IPython
  8 | from IPython import get_ipython
  9 | get_ipython().run_line_magic('matplotlib', 'inline')
 10 | ###########################################
 11 | 
 12 | import matplotlib.pyplot as pl
 13 | import matplotlib.patches as mpatches
 14 | import numpy as np
 15 | import pandas as pd
 16 | from time import time
 17 | from sklearn.metrics import f1_score, accuracy_score
 18 | 
 19 | 
 20 | def distribution(data, transformed = False):
 21 |     """
 22 |     Visualization code for displaying skewed distributions of features
 23 |     """
 24 |     
 25 |     # Create figure
 26 |     fig = pl.figure(figsize = (11,5));
 27 | 
 28 |     # Skewed feature plotting
 29 |     for i, feature in enumerate(['capital-gain','capital-loss']):
 30 |         ax = fig.add_subplot(1, 2, i+1)
 31 |         ax.hist(data[feature], bins = 25, color = '#00A0A0')
 32 |         ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14)
 33 |         ax.set_xlabel("Value")
 34 |         ax.set_ylabel("Number of Records")
 35 |         ax.set_ylim((0, 2000))
 36 |         ax.set_yticks([0, 500, 1000, 1500, 2000])
 37 |         ax.set_yticklabels([0, 500, 1000, 1500, ">2000"])
 38 | 
 39 |     # Plot aesthetics
 40 |     if transformed:
 41 |         fig.suptitle("Log-transformed Distributions of Continuous Census Data Features", \
 42 |             fontsize = 16, y = 1.03)
 43 |     else:
 44 |         fig.suptitle("Skewed Distributions of Continuous Census Data Features", \
 45 |             fontsize = 16, y = 1.03)
 46 | 
 47 |     fig.tight_layout()
 48 |     fig.show()
 49 | 
 50 | 
 51 | def evaluate(results, accuracy, f1):
 52 |     """
 53 |     Visualization code to display results of various learners.
 54 |     
 55 |     inputs:
 56 |       - learners: a list of supervised learners
 57 |       - stats: a list of dictionaries of the statistic results from 'train_predict()'
 58 |       - accuracy: The score for the naive predictor
 59 |       - f1: The score for the naive predictor
 60 |     """
 61 |   
 62 |     # Create figure
 63 |     fig, ax = pl.subplots(2, 3, figsize = (11,7))
 64 | 
 65 |     # Constants
 66 |     bar_width = 0.3
 67 |     colors = ['#A00000','#00A0A0','#00A000']
 68 |     
 69 |     # Super loop to plot four panels of data
 70 |     for k, learner in enumerate(results.keys()):
 71 |         for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
 72 |             for i in np.arange(3):
 73 |                 
 74 |                 # Creative plot code
 75 |                 ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k])
 76 |                 ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45])
 77 |                 ax[j//3, j%3].set_xticklabels(["1%", "10%", "100%"])
 78 |                 ax[j//3, j%3].set_xlabel("Training Set Size")
 79 |                 ax[j//3, j%3].set_xlim((-0.1, 3.0))
 80 |     
 81 |     # Add unique y-labels
 82 |     ax[0, 0].set_ylabel("Time (in seconds)")
 83 |     ax[0, 1].set_ylabel("Accuracy Score")
 84 |     ax[0, 2].set_ylabel("F-score")
 85 |     ax[1, 0].set_ylabel("Time (in seconds)")
 86 |     ax[1, 1].set_ylabel("Accuracy Score")
 87 |     ax[1, 2].set_ylabel("F-score")
 88 |     
 89 |     # Add titles
 90 |     ax[0, 0].set_title("Model Training")
 91 |     ax[0, 1].set_title("Accuracy Score on Training Subset")
 92 |     ax[0, 2].set_title("F-score on Training Subset")
 93 |     ax[1, 0].set_title("Model Predicting")
 94 |     ax[1, 1].set_title("Accuracy Score on Testing Set")
 95 |     ax[1, 2].set_title("F-score on Testing Set")
 96 |     
 97 |     # Add horizontal lines for naive predictors
 98 |     ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
 99 |     ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
100 |     ax[0, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
101 |     ax[1, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
102 |     
103 |     # Set y-limits for score panels
104 |     ax[0, 1].set_ylim((0, 1))
105 |     ax[0, 2].set_ylim((0, 1))
106 |     ax[1, 1].set_ylim((0, 1))
107 |     ax[1, 2].set_ylim((0, 1))
108 | 
109 |     # Create patches for the legend
110 |     patches = []
111 |     for i, learner in enumerate(results.keys()):
112 |         patches.append(mpatches.Patch(color = colors[i], label = learner))
113 |     pl.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \
114 |                loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')
115 |     
116 |     # Aesthetics
117 |     pl.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.10)
118 |     pl.tight_layout()
119 |     pl.show()
120 |     
121 | 
122 | def feature_plot(importances, X_train, y_train):
123 |     
124 |     # Display the five most important features
125 |     indices = np.argsort(importances)[::-1]
126 |     columns = X_train.columns.values[indices[:5]]
127 |     values = importances[indices][:5]
128 | 
129 |     # Creat the plot
130 |     fig = pl.figure(figsize = (9,5))
131 |     pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16)
132 |     pl.bar(np.arange(5), values, width = 0.6, align="center", color = '#00A000', \
133 |           label = "Feature Weight")
134 |     pl.bar(np.arange(5) - 0.3, np.cumsum(values), width = 0.2, align = "center", color = '#00A0A0', \
135 |           label = "Cumulative Feature Weight")
136 |     pl.xticks(np.arange(5), columns)
137 |     pl.xlim((-0.5, 4.5))
138 |     pl.ylabel("Weight", fontsize = 12)
139 |     pl.xlabel("Feature", fontsize = 12)
140 |     
141 |     pl.legend(loc = 'upper center')
142 |     pl.tight_layout()
143 |     pl.show()  
144 | 


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/README.md:
--------------------------------------------------------------------------------
 1 | # Image Classification Application - Command Line
 2 | 
 3 | ### Structure
 4 | 
 5 | The whole interface contains 5 files and 1 folder,
 6 | 
 7 | 
 8 | #### **train.py** - training interface
 9 | 
10 | Basic usage: python train.py data_directory
11 | 
12 | Prints out training loss, validation loss, and validation accuracy as the network trains
13 | Options:
14 | 
15 | * Set directory to save checkpoints: python train.py data_dir --save_dir save_directory
16 | * Choose architecture: python train.py data_dir --arch "vgg13"
17 | * Set hyperparameters: python train.py data_dir --learning_rate 0.01 --hidden_units 512 --epochs 20
18 | * Use GPU for training: python train.py data_dir --gpu
19 | 
20 | 
21 | #### **predict.py** - predicting interface
22 | 
23 | 
24 | Basic usage: python predict.py /path/to/image checkpoint Options:
25 | 
26 | * Return top K most likely classes: python predict.py input checkpoint --top_k 3
27 | * Use a mapping of categories to real names: python predict.py input checkpoint --category_names cat_to_name.json
28 | * Use GPU for inference: python predict.py input checkpoint --gpu
29 | 
30 | 
31 | #### **model_spec.py** - utitity functions
32 | 
33 | Provides all helper functions that **train.py** and **predict.py** uses. The main train and predict functions are implemented here
34 | 
35 | #### **cat_to_json.json** - categories name mapping
36 | 
37 | Provides the encoding of 102 flower classes
38 | 
39 | #### **workspace_utils.py** - active session
40 | 
41 | Keeps the training session from being timed out
42 | 
43 | #### **checkpoints** - saves different model checkpoints
44 | 
45 | Directory where the trained model will be saved to. The pretrained weights could not be uploaded due to file size limits
46 | 


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/cat_to_name.json:
--------------------------------------------------------------------------------
1 | {"21": "fire lily", "3": "canterbury bells", "45": "bolero deep blue", "1": "pink primrose", "34": "mexican aster", "27": "prince of wales feathers", "7": "moon orchid", "16": "globe-flower", "25": "grape hyacinth", "26": "corn poppy", "79": "toad lily", "39": "siam tulip", "24": "red ginger", "67": "spring crocus", "35": "alpine sea holly", "32": "garden phlox", "10": "globe thistle", "6": "tiger lily", "93": "ball moss", "33": "love in the mist", "9": "monkshood", "102": "blackberry lily", "14": "spear thistle", "19": "balloon flower", "100": "blanket flower", "13": "king protea", "49": "oxeye daisy", "15": "yellow iris", "61": "cautleya spicata", "31": "carnation", "64": "silverbush", "68": "bearded iris", "63": "black-eyed susan", "69": "windflower", "62": "japanese anemone", "20": "giant white arum lily", "38": "great masterwort", "4": "sweet pea", "86": "tree mallow", "101": "trumpet creeper", "42": "daffodil", "22": "pincushion flower", "2": "hard-leaved pocket orchid", "54": "sunflower", "66": "osteospermum", "70": "tree poppy", "85": "desert-rose", "99": "bromelia", "87": "magnolia", "5": "english marigold", "92": "bee balm", "28": "stemless gentian", "97": "mallow", "57": "gaura", "40": "lenten rose", "47": "marigold", "59": "orange dahlia", "48": "buttercup", "55": "pelargonium", "36": "ruby-lipped cattleya", "91": "hippeastrum", "29": "artichoke", "71": "gazania", "90": "canna lily", "18": "peruvian lily", "98": "mexican petunia", "8": "bird of paradise", "30": "sweet william", "17": "purple coneflower", "52": "wild pansy", "84": "columbine", "12": "colt's foot", "11": "snapdragon", "96": "camellia", "23": "fritillary", "50": "common dandelion", "44": "poinsettia", "53": "primula", "72": "azalea", "65": "californian poppy", "80": "anthurium", "76": "morning glory", "37": "cape flower", "56": "bishop of llandaff", "60": "pink-yellow dahlia", "82": "clematis", "58": "geranium", "75": "thorn apple", "41": "barbeton daisy", "95": "bougainvillea", "43": "sword lily", "83": "hibiscus", "78": "lotus lotus", "88": "cyclamen", "94": "foxglove", "81": "frangipani", "74": "rose", "89": "watercress", "73": "water lily", "46": "wallflower", "77": "passion flower", "51": "petunia"}


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/model_spec.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from torch import nn
  4 | from torch import optim
  5 | import torch.nn.functional as F
  6 | from torchvision import datasets, transforms, models
  7 | from workspace_utils import active_session
  8 | from PIL import Image
  9 | import json
 10 | import argparse
 11 | from collections import OrderedDict # use dict, but we have to keep the order
 12 | import matplotlib.pyplot as plt
 13 | 
 14 | # ================= Train Model Functions =====================
 15 | def train_model(hyperparameters, data_dir, save_dir, device):
 16 |     """Train the neural network, called in main function utilize the following helper functions,
 17 |     """
 18 |     
 19 |     model_init, resize_aspect = get_model(hyperparameters['architecture'])
 20 |     image_dataset = loadImageData(resize_aspect, data_dir) 
 21 |     model_spec = buildNeuralNetwork(model_init, hyperparameters, data_dir, device) 
 22 |     
 23 |     for e in range(hyperparameters['epochs']):
 24 |         model_spec['model'].train()
 25 |         running_loss = 0 # the loss for every batch
 26 |         
 27 |         for i, train_batch in enumerate(image_dataset['trainloader']): # minibatch training
 28 |             
 29 |             # send the inputs labels to the tensors that uses the specified devices
 30 |             inputs, labels = tuple(map(lambda x: x.to(device), train_batch))
 31 |             model_spec['optimizer'].zero_grad() # clear out previous gradients, avoids accumulations
 32 |             
 33 |             # Forward and backward passes
 34 |             try:
 35 |                 predictions,_ = model_spec['model'].forward(inputs)
 36 |                 
 37 |             except:
 38 |                 predictions = model_spec['model'].forward(inputs)
 39 |                 
 40 |             loss = model_spec['criterion'](predictions, labels)
 41 |             loss.backward()
 42 |             model_spec['optimizer'].step()
 43 |             # calculate the total loss for 1 epoch of training
 44 |             running_loss += loss.item()
 45 |             
 46 |             # print the loss every .. batches
 47 |             if i % hyperparameters['print_every'] == 0:
 48 |                 model_spec['model'].eval() # set to evaluation mode
 49 |                 train_accuracy = evaluate_performance(model_spec['model'], 
 50 |                                                       image_dataset['trainloader'], 
 51 |                                                       model_spec['criterion']) # see evaluate function below
 52 |                 
 53 |                 validate_accuracy = evaluate_performance(model_spec['model'], 
 54 |                                                          image_dataset['validloader'],
 55 |                                                          model_spec['criterion'])
 56 |                 
 57 |                 print("Epoch: {}/{}... :".format(e+1, hyperparameters['epochs']),
 58 |                       "Loss: {:.4f},".format(running_loss/hyperparameters['print_every']),
 59 |                       "Training Accuracy:{: .4f} %,".format(train_accuracy * 100),
 60 |                       "Validation Accuracy:{: .4f} %".format(validate_accuracy * 100)
 61 |                      )
 62 |                 running_loss = 0
 63 |                 model_spec['model'].train()
 64 |                 
 65 |     saveModel(image_dataset, model_spec['model'], model_spec['classifier'], save_dir)
 66 |     return model_spec['model']
 67 | 
 68 | def get_model(architecture):
 69 |     # set model architecture
 70 |     if architecture == 'inception_v3':
 71 |         model_init = models.inception_v3(pretrained=True)
 72 |         model_init.arch = 'inception_v3'
 73 |         resize_aspect = [320, 299]
 74 | 
 75 |     elif architecture == 'densenet161':
 76 |         model_init = models.densenet161(pretrained=True)
 77 |         model_init.arch = 'densenet161'
 78 |         resize_aspect = [256, 224]
 79 |     
 80 |     elif architecture == 'vgg19':
 81 |         model_init = models.vgg19(pretrained=True)
 82 |         model_init.arch = 'vgg19'
 83 |         resize_aspect = [256, 224]
 84 |     
 85 |     return model_init, resize_aspect
 86 | 
 87 | def loadImageData(resize_aspect, data_dir):
 88 |     """Input: 
 89 |             resize_aspect - depends on the architecture
 90 |             data_dir - directory of all image data"""
 91 |     train_dir = data_dir + '/train'
 92 |     valid_dir = data_dir + '/valid'
 93 |     test_dir = data_dir + '/test'
 94 |     # Define transforms for the training, validation, and testing sets, using data augumentations on training set,
 95 |     # Inception_v3 has input size 299x299
 96 |     
 97 |     train_transforms = transforms.Compose([transforms.RandomRotation(30),
 98 |                                            transforms.RandomResizedCrop(resize_aspect[1]),
 99 |                                            transforms.RandomHorizontalFlip(),
100 |                                            transforms.ToTensor(),
101 |                                            transforms.Normalize([0.485, 0.456, 0.406], 
102 |                                                                 [0.229, 0.224, 0.225])])
103 | 
104 |     validation_transforms = transforms.Compose([transforms.Resize(resize_aspect[1]),
105 |                                                 transforms.CenterCrop(resize_aspect[1]),
106 |                                                transforms.ToTensor(),
107 |                                                transforms.Normalize([0.485, 0.456, 0.406], 
108 |                                                                     [0.229, 0.224, 0.225])])
109 | 
110 | 
111 |     test_transforms = transforms.Compose([transforms.Resize(resize_aspect[1]),
112 |                                           transforms.CenterCrop(resize_aspect[1]),
113 |                                           transforms.ToTensor(),
114 |                                           transforms.Normalize([0.485, 0.456, 0.406], 
115 |                                                                [0.229, 0.224, 0.225])])
116 | 
117 |     # Load the datasets with ImageFolder
118 |     train_data = datasets.ImageFolder(train_dir, transform=train_transforms)
119 |     validation_data = datasets.ImageFolder(valid_dir, transform=validation_transforms)
120 |     test_data = datasets.ImageFolder(test_dir, transform=test_transforms)
121 | 
122 |     # Using the image datasets and the trainforms, define the dataloaders
123 |     trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
124 |     validloader = torch.utils.data.DataLoader(validation_data, batch_size= 32)
125 |     testloader = torch.utils.data.DataLoader(test_data, batch_size= 32)
126 |     # label mapping
127 |     with open('cat_to_name.json', 'r') as f:
128 |         cat_to_name = json.load(f)
129 |     
130 |     image_dataset = {'train': train_data, 'test': test_data, 'validate': validation_data, 
131 |                      'trainloader':trainloader, 'validloader':validloader, 'testloader': testloader, 
132 |                      'mapping': cat_to_name}
133 |     return  image_dataset
134 | 
135 | def buildNeuralNetwork(model, hyperparameters,  data_dir, device = 'cuda'):
136 |     """Builds the transfer learning network according to the given architecture
137 |     """
138 |     # turns off gradient
139 |     for param in model.parameters():
140 |         param.requires_grad = False
141 |         
142 |     # input units mapping:
143 |     input_units = {'inception_v3': 2048, 'densenet161': 2208, 'vgg19': 25088}
144 |         
145 |    # rebuild last layer
146 |     classifier = nn.Sequential(OrderedDict([
147 |                                             ('fc1', nn.Linear(input_units[model.arch],
148 |                                                               hyperparameters['hidden_units'])),
149 |                                             ('relu1', nn.ReLU()),
150 |                                             ('dropout1', nn.Dropout(hyperparameters['dropout_prob'])),
151 |                                             ('fc2', nn.Linear(hyperparameters['hidden_units'],
152 |                                                               102)),
153 |                                             ('output', nn.LogSoftmax(dim=1))
154 |                                            ]))
155 |     # Attach the feedforward neural network, adjust for nameing conventions
156 |     # Define criteria and loss
157 |     criterion = nn.NLLLoss()
158 |     if model.arch == 'inception_v3':
159 |         model.fc = classifier
160 |         optimizer = optim.Adam(model.fc.parameters(), lr = hyperparameters['learning_rate'])
161 |         
162 |     else:
163 |         model.classifier = classifier
164 |         optimizer = optim.Adam(model.classifier.parameters(), lr = hyperparameters['learning_rate'])
165 |         
166 |     # Important: Send model to use gpu cuda
167 |     model = model.to(device)
168 |     model_spec = {'model': model, 'criterion': criterion,
169 |                   'optimizer': optimizer, 'classifier':classifier}
170 |     
171 |     return model_spec
172 | 
173 | def evaluate_performance(model, dataloader,criterion, device = 'cuda'):
174 |      # Evaluate performance for all batches in an epoch
175 |     performance = [evaluate_performance_batch(model, i, criterion) for i in iter(dataloader)]  
176 |     correct, total = list(map(sum, zip(*performance)))
177 |     return correct/total
178 |     
179 | def evaluate_performance_batch(model,batch, criterion, device = 'cuda'):
180 |     """Evaluate performance for a single batch"""
181 |     with torch.no_grad():
182 |         images, labels = tuple(map(lambda x: x.to(device), batch))
183 |         predictions = model.forward(images)
184 |         _, predict = torch.max(predictions, 1)
185 |         
186 |         correct = (predict == labels).sum().item()
187 |         total  = len(labels)
188 |         
189 |     return correct, total
190 |         
191 | def saveModel(image_dataset, model, classifier, save_dir):
192 |     # Saves the pretrained model
193 |     with active_session():
194 |         check_point_file = save_dir + model.arch +  '_checkpoint.pth'
195 |     model.class_to_idx =  image_dataset['train'].class_to_idx
196 | 
197 |     checkpoint_dict = {
198 |         'architecture': 'inception_v3',
199 |         'class_to_idx': model.class_to_idx, 
200 |         'state_dict': model.state_dict(),
201 |         'classifier': classifier
202 |     }
203 |     torch.save(checkpoint_dict, check_point_file)
204 |     print("Model saved")
205 |     return None
206 | 
207 | # ================= Predict Functions =====================
208 | 
209 | def predict(image_path, checkpoint_path, category_names, device, topk):
210 |     ''' Predict the class (or classes) of an image using a trained deep learning model.
211 |     '''
212 |     # Implement the code to predict the class from an image file
213 |     model, resize_aspect = load_model_checkpoint(checkpoint_path)
214 |     model.eval()
215 |     image = process_image(image_path, resize_aspect)
216 |     with open(category_names, 'r') as f:
217 |         cat_to_name = json.load(f)
218 | 
219 |     # use forward propagation to obtain the class probabilities
220 |     image = torch.tensor(image, dtype= torch.float).unsqueeze(0).to(device)
221 |     predict_prob_tensor = torch.exp(model.forward(image)) # convert log probabilities to real probabilities
222 |     predict_prob = predict_prob_tensor.cpu().detach().numpy()[0] # change into numpy array
223 |     
224 |     # Find the correspoinding top k classes
225 |     top_k_idx =  predict_prob.argsort()[-topk:][::-1]
226 |     probs =  predict_prob[top_k_idx]
227 |     classes = np.array(list(range(1, 102)))[top_k_idx]
228 |     visualize_pred(image, model, probs, classes, cat_to_name, topk)
229 |     
230 |     return probs, classes
231 | 
232 | def load_model_checkpoint(path):
233 |     """Load model checkpoint given path"""
234 |     checkpoint = torch.load(path, map_location={'cuda:0': 'cpu'})
235 |     model, resize_aspect = get_model(checkpoint['architecture'])
236 |     if model.arch == 'inception_v3':
237 |         model.fc = checkpoint['classifier']
238 |     else: 
239 |         model.classifier = checkpoint['classifier']
240 |         
241 |     model.load_state_dict(checkpoint['state_dict'])
242 |     model.class_to_idx = checkpoint['class_to_idx']
243 |     return model, resize_aspect
244 |     
245 | def process_image(image, resize_aspect):
246 |     ''' Scales, crops, and normalizes a PIL image for a PyTorch model,
247 |         returns an Numpy array
248 |     '''
249 |     # Process a PIL image for use in a PyTorch model
250 |     im = Image.open(image)
251 |     
252 |     # resize image to 320 on the shortest side
253 |     size = (resize_aspect[0], resize_aspect[0])
254 |     im.thumbnail(size)
255 |     
256 |     # crop out 299 portion in the center
257 |     width, height = im.size
258 |     left = (width - resize_aspect[1])/2
259 |     top = (height - resize_aspect[1])/2
260 |     right = (width + resize_aspect[1])/2
261 |     bottom = (height + resize_aspect[1])/2
262 |     im = im.crop((left, top, right, bottom))
263 |     
264 |     # normalize image
265 |     np_image = np.array(im)
266 |     im_mean = np.array([0.485, 0.456, 0.406])
267 |     im_sd = np.array([0.229, 0.224, 0.225])
268 |     np_image = (np_image/255 - im_mean)/im_sd
269 |     
270 |     # transpose the image
271 |     np_image = np_image.T
272 |     return np_image
273 | 
274 | def imshow2(image, ax=None, title=None):
275 |     """Returns the original image after preprocessing"""
276 |     if ax is None:
277 |         fig, ax = plt.subplots()
278 |     
279 |     # PyTorch tensors assume the color channel is the first dimension
280 |     # but matplotlib assumes is the third dimension
281 |     image = image.transpose((1, 2, 0))
282 |     
283 |     # Undo preprocessing
284 |     mean = np.array([0.485, 0.456, 0.406])
285 |     std = np.array([0.229, 0.224, 0.225])
286 |     image = std * image + mean
287 |     
288 |     # Image needs to be clipped between 0 and 1 or it looks like noise when displayed
289 |     image = np.clip(image, 0, 1)
290 |    
291 |     #plt.suptitle(title)
292 |     ax.imshow(image)
293 | 
294 |     return ax
295 | 
296 | # Display an image along with the top 5 classes
297 | def visualize_pred(image, model, probs, classes, cat_to_name, topk):
298 |     """ Visualize the top k probabilities an image is predicted as"""
299 |     im = process_image(image) 
300 |     flower_names = [cat_to_name[str(x)] for x in classes]
301 |     
302 |     # Build subplots above
303 |     fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,10))
304 |     # set axis settings top
305 |     imshow2(im, ax =ax1)
306 |     ax1.set_title(cat_to_name[image.split('/')[2]])
307 |     # set axis settings bottom
308 |     ax2.barh(np.arange(1, topk + 1), probs)
309 |     ax2.set_yticks(np.arange(1, topk + 1))
310 |     ax2.set_yticklabels(flower_names) 
311 |     ax2.set_aspect(0.187)
312 |     ax2.set_xlim(0,1)
313 |     return None
314 |     
315 | #=================== get input args train / predict ======================
316 | def get_input_args_train():
317 |     parser = argparse.ArgumentParser()
318 |     parser.add_argument('data_directory', type=str, default = None,
319 |                     help="data directory")
320 |     parser.add_argument('--save_dir', type=str, default='checkpoints/',
321 |                         help='save checkpoints to directory')
322 |     parser.add_argument('--arch', type=str, default='inception_v3',
323 |                         help='model architecture')
324 |     parser.add_argument('--learning_rate', type=float, default=0.001,
325 |                         help='learning rate, default 0.001')
326 |     parser.add_argument('--hidden_units', type=int, default=500,
327 |                         help='hidden units, default 500')
328 |     parser.add_argument('--print_every', type=int, default=20,
329 |                         help='print every iterations')    
330 |     parser.add_argument('--dropout_prob', type=int, default=0.1,
331 |                         help='print every iterations')   
332 |     parser.add_argument('--epochs', type=int, default=15,
333 |                         help='epochs, default 15')
334 |     parser.add_argument('--gpu', action='store_true',
335 |                         default= 'cuda', help='to cuda gpu')
336 | 
337 |     return parser.parse_args()
338 | 
339 | def get_input_args_predict():
340 |     parser = argparse.ArgumentParser()
341 |     
342 |     parser.add_argument('path_to_image', type=str, default=None,
343 |                     help='image file to predict')
344 | 
345 |     parser.add_argument('checkpoint', type=str, default='checkpoints/inception_v3_checkpoint.pth',
346 |                     help='path to checkpoint')
347 | 
348 |     parser.add_argument('--topk', type=int, default=5,
349 |                         help='return top k most likely classes the image belongs to')
350 |     parser.add_argument('--category_names', type=str, default='cat_to_name.json',
351 |                         help='class names mapping')
352 |     parser.add_argument('--gpu', default='cuda',
353 |                         help='use cuda')
354 | 
355 |     return parser.parse_args()
356 | 


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/predict.py:
--------------------------------------------------------------------------------
1 | from model_spec import *
2 | import argparse
3 | 
4 | def main():
5 |      args = get_input_args_predict()
6 |      predict(args.path_to_image, args.checkpoint, args.category_names, args. gpu, args.topk)
7 | 
8 | if __name__ == '__main__':
9 |     main()


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/train.py:
--------------------------------------------------------------------------------
 1 | from model_spec import *
 2 | import argparse
 3 | 
 4 | def main():
 5 |      args = get_input_args()
 6 |      # get hyperparameters
 7 |      hyperparameters =  {'architecture': args.arch, 'epochs': args.epoch, 'print_every': args.print_every, 
 8 |                           'hidden_units' : args.hidden_units, 'learning_rate': args.learning_rate, 
 9 |                          'dropout_prob': args.dropout_prob}
10 |         
11 |      train_model(hyperparameters, data_dir= args.data_directory, save_dir = args.save_dir, device = args.gpu)
12 |     
13 | if __name__ == '__main__':
14 |     main()
15 | 


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Application/workspace_utils.py:
--------------------------------------------------------------------------------
 1 | import signal
 2 | 
 3 | from contextlib import contextmanager
 4 | 
 5 | import requests
 6 | 
 7 | 
 8 | DELAY = INTERVAL = 4 * 60  # interval time in seconds
 9 | MIN_DELAY = MIN_INTERVAL = 2 * 60
10 | KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
11 | TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
12 | TOKEN_HEADERS = {"Metadata-Flavor":"Google"}
13 | 
14 | 
15 | def _request_handler(headers):
16 |     def _handler(signum, frame):
17 |         requests.request("POST", KEEPALIVE_URL, headers=headers)
18 |     return _handler
19 | 
20 | 
21 | @contextmanager
22 | def active_session(delay=DELAY, interval=INTERVAL):
23 |     """
24 |     Example:
25 |     from workspace_utils import active session
26 |     with active_session():
27 |         # do long-running work here
28 |     """
29 |     token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
30 |     headers = {'Authorization': "STAR " + token}
31 |     delay = max(delay, MIN_DELAY)
32 |     interval = max(interval, MIN_INTERVAL)
33 |     original_handler = signal.getsignal(signal.SIGALRM)
34 |     try:
35 |         signal.signal(signal.SIGALRM, _request_handler(headers))
36 |         signal.setitimer(signal.ITIMER_REAL, delay, interval)
37 |         yield
38 |     finally:
39 |         signal.signal(signal.SIGALRM, original_handler)
40 |         signal.setitimer(signal.ITIMER_REAL, 0)
41 | 
42 | 
43 | def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
44 |     """
45 |     Example:
46 |     from workspace_utils import keep_awake
47 |     for i in keep_awake(range(5)):
48 |         # do iteration with lots of work here
49 |     """
50 |     with active_session(delay, interval): yield from iterable


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/Readme.md:
--------------------------------------------------------------------------------
 1 | # Deep Learning 
 2 | 
 3 | ### Project: Image Classifier Project 
 4 | 
 5 | 
 6 | ### Data
 7 | 
 8 | The data for this project is quite large - in fact, it is so large you cannot upload it onto Github. You will be training using 102 different types of flowers, where there ~20 images per flower to train on.  Then you will use your trained classifier to see if you can predict the type for new images of the flowers (Quoted from Udacity).
 9 | 
10 | ### Jupyter Notebook
11 | 
12 | This notebook implements the inception model in the jupyter notebook format. Most of the functions are static. To view the notebook, go to the following link
13 | 
14 | [Project Notebook: Image Classifier](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%202%20-%20Image%20Classifier%20Application/Image%20Classifier%20Project.ipynb?flush_cache=true)
15 | 
16 | ### Application
17 | 
18 | The notebook is then converted into a command line application
19 | 
20 | Specifications
21 | 
22 | The first file, train.py, will train a new network on a dataset and save the model as a checkpoint. The second file, predict.py, uses a trained network to predict the class for an input image. 
23 | 
24 | Train a new network on a data set with train.py
25 | 
26 | Basic usage: python train.py data_directory
27 | * Prints out training loss, validation loss, and validation accuracy as the network trains
28 | 
29 | Options:
30 | * Set directory to save checkpoints: python train.py data_dir --save_dir save_directory
31 | * Choose architecture: python train.py data_dir --arch "vgg13"
32 | * Set hyperparameters: python train.py data_dir --learning_rate 0.01 --hidden_units 512 --epochs 20
33 | * Use GPU for training: python train.py data_dir --gpu
34 | 
35 | Predict flower name from an image with predict.py along with the probability of that name. That is, you'll pass in a single image * /path/to/image and return the flower name and class probability.
36 | 
37 | Basic usage: python predict.py /path/to/image checkpoint
38 | Options:
39 | * Return top K most likely classes: python predict.py input checkpoint --top_k 3
40 | * Use a mapping of categories to real names: python predict.py input checkpoint --category_names cat_to_name.json
41 | * Use GPU for inference: python predict.py input checkpoint --gpu
42 | 


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/assets/Flowers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 2 - Image Classifier Application/assets/Flowers.png


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/assets/inference_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 2 - Image Classifier Application/assets/inference_example.png


--------------------------------------------------------------------------------
/Project 2 - Image Classifier Application/workspace_utils.py:
--------------------------------------------------------------------------------
 1 | import signal
 2 | 
 3 | from contextlib import contextmanager
 4 | 
 5 | import requests
 6 | 
 7 | 
 8 | DELAY = INTERVAL = 4 * 60  # interval time in seconds
 9 | MIN_DELAY = MIN_INTERVAL = 2 * 60
10 | KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
11 | TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
12 | TOKEN_HEADERS = {"Metadata-Flavor":"Google"}
13 | 
14 | 
15 | def _request_handler(headers):
16 |     def _handler(signum, frame):
17 |         requests.request("POST", KEEPALIVE_URL, headers=headers)
18 |     return _handler
19 | 
20 | 
21 | @contextmanager
22 | def active_session(delay=DELAY, interval=INTERVAL):
23 |     """
24 |     Example:
25 | 
26 |     from workspace_utils import active session
27 | 
28 |     with active_session():
29 |         # do long-running work here
30 |     """
31 |     token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
32 |     headers = {'Authorization': "STAR " + token}
33 |     delay = max(delay, MIN_DELAY)
34 |     interval = max(interval, MIN_INTERVAL)
35 |     original_handler = signal.getsignal(signal.SIGALRM)
36 |     try:
37 |         signal.signal(signal.SIGALRM, _request_handler(headers))
38 |         signal.setitimer(signal.ITIMER_REAL, delay, interval)
39 |         yield
40 |     finally:
41 |         signal.signal(signal.SIGALRM, original_handler)
42 |         signal.setitimer(signal.ITIMER_REAL, 0)
43 | 
44 | 
45 | def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
46 |     """
47 |     Example:
48 | 
49 |     from workspace_utils import keep_awake
50 | 
51 |     for i in keep_awake(range(5)):
52 |         # do iteration with lots of work here
53 |     """
54 |     with active_session(delay, interval): yield from iterable


--------------------------------------------------------------------------------
/Project 3 - Identify Customer Segementation/README.md:
--------------------------------------------------------------------------------
1 | 
2 | # Identifying Customers Segmentations
3 | 
4 | **Description**
5 | 
6 | In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
7 | 


--------------------------------------------------------------------------------
/Project 4 - Data Science Blog/README.md:
--------------------------------------------------------------------------------
 1 | # Understanding Data Scientist Careers in a Data Science Way
 2 | 
 3 | **Project Motivation**
 4 | 
 5 | For this project, I was interestested in using 2011 - 2018 Stack Overflow developer survey data to better understand the career of data scientists. In particular, the questions I interested in are
 6 | 
 7 | * Does a data science role return you a happy career, or a healthier lifestyle?
 8 | * Do data scientists get paid higher salaries for working hard?
 9 | * What skills are required to become a data scientist?
10 | 
11 | **Installation**
12 | 
13 | No extra besides the built-in libraries from Anaconda needed to run this project
14 | 
15 | * numpy
16 | * pandas
17 | * seaborn
18 | * glob
19 | * os
20 | 
21 | **File Descriptions**
22 | 
23 | * data: Folder contains data files of StackOverflow developer survey data, following name conventions of "YYYY Stack Overflow Survey Responses.csv"
24 | * Understanding the Career of Data Scientists.ipynb: The Jupyter Notebook used for the main analytics
25 | 
26 | **Results**
27 | 
28 | The main takeaways from this analytics are
29 | 
30 | * The best practice for doing data science is to have a question before you collect the data
31 | * Data scientists work hard, but they are satisfied with their careers
32 | * Versatile programming skills and strong commnunication skills are needed for data scientists
33 | 
34 | Specific findings of the code can be found at the Jupyter Notebook and blog post below.
35 | 
36 | * [Project Notebook: Understanding the Career of Data Scientists](http://nbviewer.jupyter.org/github/chenbowen184/Data_Scientist_Nanodegree/blob/master/Project%204%20-%20Data%20Science%20Blog/Understanding%20the%20Career%20of%20Data%20Scientists.ipynb) 
37 | * [Blog Post: Understanding the Career of Data Scientist Using the Data Science Way](https://medium.com/@bowenchen/understanding-the-career-of-data-scientists-in-a-data-science-way-9bd63817221e)
38 | 
39 | **Licensing Acknowledgements**
40 | 
41 | Thank you for @StackOverflow for sharing your developer survey for multiple years
42 | 


--------------------------------------------------------------------------------
/Project 4 - Data Science Blog/data/2011 Stack Overflow Survey Responses.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 4 - Data Science Blog/data/2011 Stack Overflow Survey Responses.csv


--------------------------------------------------------------------------------
/Project 4 - Data Science Blog/data/2012 Stack Overflow Survey Responses.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 4 - Data Science Blog/data/2012 Stack Overflow Survey Responses.csv


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/ETL Pipeline Preparation.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# ETL Pipeline Preparation\n",
   8 |     "Follow the instructions below to help you create your ETL pipeline.\n",
   9 |     "### 1. Import libraries and load datasets.\n",
  10 |     "- Import Python libraries\n",
  11 |     "- Load `messages.csv` into a dataframe and inspect the first few lines.\n",
  12 |     "- Load `categories.csv` into a dataframe and inspect the first few lines."
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "code",
  17 |    "execution_count": 1,
  18 |    "metadata": {
  19 |     "collapsed": true
  20 |    },
  21 |    "outputs": [],
  22 |    "source": [
  23 |     "# import libraries\n",
  24 |     "import pandas as pd\n",
  25 |     "import numpy as n\n",
  26 |     "from sqlalchemy import create_engine"
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "code",
  31 |    "execution_count": 2,
  32 |    "metadata": {},
  33 |    "outputs": [
  34 |     {
  35 |      "data": {
  36 |       "text/html": [
  37 |        "<div>\n",
  38 |        "<style scoped>\n",
  39 |        "    .dataframe tbody tr th:only-of-type {\n",
  40 |        "        vertical-align: middle;\n",
  41 |        "    }\n",
  42 |        "\n",
  43 |        "    .dataframe tbody tr th {\n",
  44 |        "        vertical-align: top;\n",
  45 |        "    }\n",
  46 |        "\n",
  47 |        "    .dataframe thead th {\n",
  48 |        "        text-align: right;\n",
  49 |        "    }\n",
  50 |        "</style>\n",
  51 |        "<table border=\"1\" class=\"dataframe\">\n",
  52 |        "  <thead>\n",
  53 |        "    <tr style=\"text-align: right;\">\n",
  54 |        "      <th></th>\n",
  55 |        "      <th>id</th>\n",
  56 |        "      <th>message</th>\n",
  57 |        "      <th>original</th>\n",
  58 |        "      <th>genre</th>\n",
  59 |        "    </tr>\n",
  60 |        "  </thead>\n",
  61 |        "  <tbody>\n",
  62 |        "    <tr>\n",
  63 |        "      <th>0</th>\n",
  64 |        "      <td>2</td>\n",
  65 |        "      <td>Weather update - a cold front from Cuba that c...</td>\n",
  66 |        "      <td>Un front froid se retrouve sur Cuba ce matin. ...</td>\n",
  67 |        "      <td>direct</td>\n",
  68 |        "    </tr>\n",
  69 |        "    <tr>\n",
  70 |        "      <th>1</th>\n",
  71 |        "      <td>7</td>\n",
  72 |        "      <td>Is the Hurricane over or is it not over</td>\n",
  73 |        "      <td>Cyclone nan fini osinon li pa fini</td>\n",
  74 |        "      <td>direct</td>\n",
  75 |        "    </tr>\n",
  76 |        "    <tr>\n",
  77 |        "      <th>2</th>\n",
  78 |        "      <td>8</td>\n",
  79 |        "      <td>Looking for someone but no name</td>\n",
  80 |        "      <td>Patnm, di Maryani relem pou li banm nouvel li ...</td>\n",
  81 |        "      <td>direct</td>\n",
  82 |        "    </tr>\n",
  83 |        "    <tr>\n",
  84 |        "      <th>3</th>\n",
  85 |        "      <td>9</td>\n",
  86 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
  87 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
  88 |        "      <td>direct</td>\n",
  89 |        "    </tr>\n",
  90 |        "    <tr>\n",
  91 |        "      <th>4</th>\n",
  92 |        "      <td>12</td>\n",
  93 |        "      <td>says: west side of Haiti, rest of the country ...</td>\n",
  94 |        "      <td>facade ouest d Haiti et le reste du pays aujou...</td>\n",
  95 |        "      <td>direct</td>\n",
  96 |        "    </tr>\n",
  97 |        "  </tbody>\n",
  98 |        "</table>\n",
  99 |        "</div>"
 100 |       ],
 101 |       "text/plain": [
 102 |        "   id                                            message  \\\n",
 103 |        "0   2  Weather update - a cold front from Cuba that c...   \n",
 104 |        "1   7            Is the Hurricane over or is it not over   \n",
 105 |        "2   8                    Looking for someone but no name   \n",
 106 |        "3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   \n",
 107 |        "4  12  says: west side of Haiti, rest of the country ...   \n",
 108 |        "\n",
 109 |        "                                            original   genre  \n",
 110 |        "0  Un front froid se retrouve sur Cuba ce matin. ...  direct  \n",
 111 |        "1                 Cyclone nan fini osinon li pa fini  direct  \n",
 112 |        "2  Patnm, di Maryani relem pou li banm nouvel li ...  direct  \n",
 113 |        "3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct  \n",
 114 |        "4  facade ouest d Haiti et le reste du pays aujou...  direct  "
 115 |       ]
 116 |      },
 117 |      "execution_count": 2,
 118 |      "metadata": {},
 119 |      "output_type": "execute_result"
 120 |     }
 121 |    ],
 122 |    "source": [
 123 |     "# load messages dataset\n",
 124 |     "messages = pd.read_csv('data/disaster_messages.csv')\n",
 125 |     "messages.head()"
 126 |    ]
 127 |   },
 128 |   {
 129 |    "cell_type": "code",
 130 |    "execution_count": 3,
 131 |    "metadata": {},
 132 |    "outputs": [
 133 |     {
 134 |      "data": {
 135 |       "text/html": [
 136 |        "<div>\n",
 137 |        "<style scoped>\n",
 138 |        "    .dataframe tbody tr th:only-of-type {\n",
 139 |        "        vertical-align: middle;\n",
 140 |        "    }\n",
 141 |        "\n",
 142 |        "    .dataframe tbody tr th {\n",
 143 |        "        vertical-align: top;\n",
 144 |        "    }\n",
 145 |        "\n",
 146 |        "    .dataframe thead th {\n",
 147 |        "        text-align: right;\n",
 148 |        "    }\n",
 149 |        "</style>\n",
 150 |        "<table border=\"1\" class=\"dataframe\">\n",
 151 |        "  <thead>\n",
 152 |        "    <tr style=\"text-align: right;\">\n",
 153 |        "      <th></th>\n",
 154 |        "      <th>id</th>\n",
 155 |        "      <th>categories</th>\n",
 156 |        "    </tr>\n",
 157 |        "  </thead>\n",
 158 |        "  <tbody>\n",
 159 |        "    <tr>\n",
 160 |        "      <th>0</th>\n",
 161 |        "      <td>2</td>\n",
 162 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 163 |        "    </tr>\n",
 164 |        "    <tr>\n",
 165 |        "      <th>1</th>\n",
 166 |        "      <td>7</td>\n",
 167 |        "      <td>related-1;request-0;offer-0;aid_related-1;medi...</td>\n",
 168 |        "    </tr>\n",
 169 |        "    <tr>\n",
 170 |        "      <th>2</th>\n",
 171 |        "      <td>8</td>\n",
 172 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 173 |        "    </tr>\n",
 174 |        "    <tr>\n",
 175 |        "      <th>3</th>\n",
 176 |        "      <td>9</td>\n",
 177 |        "      <td>related-1;request-1;offer-0;aid_related-1;medi...</td>\n",
 178 |        "    </tr>\n",
 179 |        "    <tr>\n",
 180 |        "      <th>4</th>\n",
 181 |        "      <td>12</td>\n",
 182 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 183 |        "    </tr>\n",
 184 |        "  </tbody>\n",
 185 |        "</table>\n",
 186 |        "</div>"
 187 |       ],
 188 |       "text/plain": [
 189 |        "   id                                         categories\n",
 190 |        "0   2  related-1;request-0;offer-0;aid_related-0;medi...\n",
 191 |        "1   7  related-1;request-0;offer-0;aid_related-1;medi...\n",
 192 |        "2   8  related-1;request-0;offer-0;aid_related-0;medi...\n",
 193 |        "3   9  related-1;request-1;offer-0;aid_related-1;medi...\n",
 194 |        "4  12  related-1;request-0;offer-0;aid_related-0;medi..."
 195 |       ]
 196 |      },
 197 |      "execution_count": 3,
 198 |      "metadata": {},
 199 |      "output_type": "execute_result"
 200 |     }
 201 |    ],
 202 |    "source": [
 203 |     "# load categories dataset\n",
 204 |     "categories = pd.read_csv('data/disaster_categories.csv')\n",
 205 |     "categories.head()"
 206 |    ]
 207 |   },
 208 |   {
 209 |    "cell_type": "markdown",
 210 |    "metadata": {},
 211 |    "source": [
 212 |     "### 2. Merge datasets.\n",
 213 |     "- Merge the messages and categories datasets using the common id\n",
 214 |     "- Assign this combined dataset to `df`, which will be cleaned in the following steps"
 215 |    ]
 216 |   },
 217 |   {
 218 |    "cell_type": "code",
 219 |    "execution_count": 4,
 220 |    "metadata": {},
 221 |    "outputs": [
 222 |     {
 223 |      "data": {
 224 |       "text/html": [
 225 |        "<div>\n",
 226 |        "<style scoped>\n",
 227 |        "    .dataframe tbody tr th:only-of-type {\n",
 228 |        "        vertical-align: middle;\n",
 229 |        "    }\n",
 230 |        "\n",
 231 |        "    .dataframe tbody tr th {\n",
 232 |        "        vertical-align: top;\n",
 233 |        "    }\n",
 234 |        "\n",
 235 |        "    .dataframe thead th {\n",
 236 |        "        text-align: right;\n",
 237 |        "    }\n",
 238 |        "</style>\n",
 239 |        "<table border=\"1\" class=\"dataframe\">\n",
 240 |        "  <thead>\n",
 241 |        "    <tr style=\"text-align: right;\">\n",
 242 |        "      <th></th>\n",
 243 |        "      <th>id</th>\n",
 244 |        "      <th>message</th>\n",
 245 |        "      <th>original</th>\n",
 246 |        "      <th>genre</th>\n",
 247 |        "      <th>categories</th>\n",
 248 |        "    </tr>\n",
 249 |        "  </thead>\n",
 250 |        "  <tbody>\n",
 251 |        "    <tr>\n",
 252 |        "      <th>0</th>\n",
 253 |        "      <td>2</td>\n",
 254 |        "      <td>Weather update - a cold front from Cuba that c...</td>\n",
 255 |        "      <td>Un front froid se retrouve sur Cuba ce matin. ...</td>\n",
 256 |        "      <td>direct</td>\n",
 257 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 258 |        "    </tr>\n",
 259 |        "    <tr>\n",
 260 |        "      <th>1</th>\n",
 261 |        "      <td>7</td>\n",
 262 |        "      <td>Is the Hurricane over or is it not over</td>\n",
 263 |        "      <td>Cyclone nan fini osinon li pa fini</td>\n",
 264 |        "      <td>direct</td>\n",
 265 |        "      <td>related-1;request-0;offer-0;aid_related-1;medi...</td>\n",
 266 |        "    </tr>\n",
 267 |        "    <tr>\n",
 268 |        "      <th>2</th>\n",
 269 |        "      <td>8</td>\n",
 270 |        "      <td>Looking for someone but no name</td>\n",
 271 |        "      <td>Patnm, di Maryani relem pou li banm nouvel li ...</td>\n",
 272 |        "      <td>direct</td>\n",
 273 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 274 |        "    </tr>\n",
 275 |        "    <tr>\n",
 276 |        "      <th>3</th>\n",
 277 |        "      <td>9</td>\n",
 278 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
 279 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
 280 |        "      <td>direct</td>\n",
 281 |        "      <td>related-1;request-1;offer-0;aid_related-1;medi...</td>\n",
 282 |        "    </tr>\n",
 283 |        "    <tr>\n",
 284 |        "      <th>4</th>\n",
 285 |        "      <td>12</td>\n",
 286 |        "      <td>says: west side of Haiti, rest of the country ...</td>\n",
 287 |        "      <td>facade ouest d Haiti et le reste du pays aujou...</td>\n",
 288 |        "      <td>direct</td>\n",
 289 |        "      <td>related-1;request-0;offer-0;aid_related-0;medi...</td>\n",
 290 |        "    </tr>\n",
 291 |        "  </tbody>\n",
 292 |        "</table>\n",
 293 |        "</div>"
 294 |       ],
 295 |       "text/plain": [
 296 |        "   id                                            message  \\\n",
 297 |        "0   2  Weather update - a cold front from Cuba that c...   \n",
 298 |        "1   7            Is the Hurricane over or is it not over   \n",
 299 |        "2   8                    Looking for someone but no name   \n",
 300 |        "3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   \n",
 301 |        "4  12  says: west side of Haiti, rest of the country ...   \n",
 302 |        "\n",
 303 |        "                                            original   genre  \\\n",
 304 |        "0  Un front froid se retrouve sur Cuba ce matin. ...  direct   \n",
 305 |        "1                 Cyclone nan fini osinon li pa fini  direct   \n",
 306 |        "2  Patnm, di Maryani relem pou li banm nouvel li ...  direct   \n",
 307 |        "3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct   \n",
 308 |        "4  facade ouest d Haiti et le reste du pays aujou...  direct   \n",
 309 |        "\n",
 310 |        "                                          categories  \n",
 311 |        "0  related-1;request-0;offer-0;aid_related-0;medi...  \n",
 312 |        "1  related-1;request-0;offer-0;aid_related-1;medi...  \n",
 313 |        "2  related-1;request-0;offer-0;aid_related-0;medi...  \n",
 314 |        "3  related-1;request-1;offer-0;aid_related-1;medi...  \n",
 315 |        "4  related-1;request-0;offer-0;aid_related-0;medi...  "
 316 |       ]
 317 |      },
 318 |      "execution_count": 4,
 319 |      "metadata": {},
 320 |      "output_type": "execute_result"
 321 |     }
 322 |    ],
 323 |    "source": [
 324 |     "# merge datasets\n",
 325 |     "df = pd.merge(messages, categories, on = 'id')\n",
 326 |     "df.head()"
 327 |    ]
 328 |   },
 329 |   {
 330 |    "cell_type": "markdown",
 331 |    "metadata": {},
 332 |    "source": [
 333 |     "### 3. Split `categories` into separate category columns.\n",
 334 |     "- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.\n",
 335 |     "- Use the first row of categories dataframe to create column names for the categories data.\n",
 336 |     "- Rename columns of `categories` with new column names."
 337 |    ]
 338 |   },
 339 |   {
 340 |    "cell_type": "code",
 341 |    "execution_count": 5,
 342 |    "metadata": {},
 343 |    "outputs": [
 344 |     {
 345 |      "data": {
 346 |       "text/html": [
 347 |        "<div>\n",
 348 |        "<style scoped>\n",
 349 |        "    .dataframe tbody tr th:only-of-type {\n",
 350 |        "        vertical-align: middle;\n",
 351 |        "    }\n",
 352 |        "\n",
 353 |        "    .dataframe tbody tr th {\n",
 354 |        "        vertical-align: top;\n",
 355 |        "    }\n",
 356 |        "\n",
 357 |        "    .dataframe thead th {\n",
 358 |        "        text-align: right;\n",
 359 |        "    }\n",
 360 |        "</style>\n",
 361 |        "<table border=\"1\" class=\"dataframe\">\n",
 362 |        "  <thead>\n",
 363 |        "    <tr style=\"text-align: right;\">\n",
 364 |        "      <th></th>\n",
 365 |        "      <th>0</th>\n",
 366 |        "      <th>1</th>\n",
 367 |        "      <th>2</th>\n",
 368 |        "      <th>3</th>\n",
 369 |        "      <th>4</th>\n",
 370 |        "      <th>5</th>\n",
 371 |        "      <th>6</th>\n",
 372 |        "      <th>7</th>\n",
 373 |        "      <th>8</th>\n",
 374 |        "      <th>9</th>\n",
 375 |        "      <th>...</th>\n",
 376 |        "      <th>26</th>\n",
 377 |        "      <th>27</th>\n",
 378 |        "      <th>28</th>\n",
 379 |        "      <th>29</th>\n",
 380 |        "      <th>30</th>\n",
 381 |        "      <th>31</th>\n",
 382 |        "      <th>32</th>\n",
 383 |        "      <th>33</th>\n",
 384 |        "      <th>34</th>\n",
 385 |        "      <th>35</th>\n",
 386 |        "    </tr>\n",
 387 |        "  </thead>\n",
 388 |        "  <tbody>\n",
 389 |        "    <tr>\n",
 390 |        "      <th>0</th>\n",
 391 |        "      <td>related-1</td>\n",
 392 |        "      <td>request-0</td>\n",
 393 |        "      <td>offer-0</td>\n",
 394 |        "      <td>aid_related-0</td>\n",
 395 |        "      <td>medical_help-0</td>\n",
 396 |        "      <td>medical_products-0</td>\n",
 397 |        "      <td>search_and_rescue-0</td>\n",
 398 |        "      <td>security-0</td>\n",
 399 |        "      <td>military-0</td>\n",
 400 |        "      <td>child_alone-0</td>\n",
 401 |        "      <td>...</td>\n",
 402 |        "      <td>aid_centers-0</td>\n",
 403 |        "      <td>other_infrastructure-0</td>\n",
 404 |        "      <td>weather_related-0</td>\n",
 405 |        "      <td>floods-0</td>\n",
 406 |        "      <td>storm-0</td>\n",
 407 |        "      <td>fire-0</td>\n",
 408 |        "      <td>earthquake-0</td>\n",
 409 |        "      <td>cold-0</td>\n",
 410 |        "      <td>other_weather-0</td>\n",
 411 |        "      <td>direct_report-0</td>\n",
 412 |        "    </tr>\n",
 413 |        "    <tr>\n",
 414 |        "      <th>1</th>\n",
 415 |        "      <td>related-1</td>\n",
 416 |        "      <td>request-0</td>\n",
 417 |        "      <td>offer-0</td>\n",
 418 |        "      <td>aid_related-1</td>\n",
 419 |        "      <td>medical_help-0</td>\n",
 420 |        "      <td>medical_products-0</td>\n",
 421 |        "      <td>search_and_rescue-0</td>\n",
 422 |        "      <td>security-0</td>\n",
 423 |        "      <td>military-0</td>\n",
 424 |        "      <td>child_alone-0</td>\n",
 425 |        "      <td>...</td>\n",
 426 |        "      <td>aid_centers-0</td>\n",
 427 |        "      <td>other_infrastructure-0</td>\n",
 428 |        "      <td>weather_related-1</td>\n",
 429 |        "      <td>floods-0</td>\n",
 430 |        "      <td>storm-1</td>\n",
 431 |        "      <td>fire-0</td>\n",
 432 |        "      <td>earthquake-0</td>\n",
 433 |        "      <td>cold-0</td>\n",
 434 |        "      <td>other_weather-0</td>\n",
 435 |        "      <td>direct_report-0</td>\n",
 436 |        "    </tr>\n",
 437 |        "    <tr>\n",
 438 |        "      <th>2</th>\n",
 439 |        "      <td>related-1</td>\n",
 440 |        "      <td>request-0</td>\n",
 441 |        "      <td>offer-0</td>\n",
 442 |        "      <td>aid_related-0</td>\n",
 443 |        "      <td>medical_help-0</td>\n",
 444 |        "      <td>medical_products-0</td>\n",
 445 |        "      <td>search_and_rescue-0</td>\n",
 446 |        "      <td>security-0</td>\n",
 447 |        "      <td>military-0</td>\n",
 448 |        "      <td>child_alone-0</td>\n",
 449 |        "      <td>...</td>\n",
 450 |        "      <td>aid_centers-0</td>\n",
 451 |        "      <td>other_infrastructure-0</td>\n",
 452 |        "      <td>weather_related-0</td>\n",
 453 |        "      <td>floods-0</td>\n",
 454 |        "      <td>storm-0</td>\n",
 455 |        "      <td>fire-0</td>\n",
 456 |        "      <td>earthquake-0</td>\n",
 457 |        "      <td>cold-0</td>\n",
 458 |        "      <td>other_weather-0</td>\n",
 459 |        "      <td>direct_report-0</td>\n",
 460 |        "    </tr>\n",
 461 |        "    <tr>\n",
 462 |        "      <th>3</th>\n",
 463 |        "      <td>related-1</td>\n",
 464 |        "      <td>request-1</td>\n",
 465 |        "      <td>offer-0</td>\n",
 466 |        "      <td>aid_related-1</td>\n",
 467 |        "      <td>medical_help-0</td>\n",
 468 |        "      <td>medical_products-1</td>\n",
 469 |        "      <td>search_and_rescue-0</td>\n",
 470 |        "      <td>security-0</td>\n",
 471 |        "      <td>military-0</td>\n",
 472 |        "      <td>child_alone-0</td>\n",
 473 |        "      <td>...</td>\n",
 474 |        "      <td>aid_centers-0</td>\n",
 475 |        "      <td>other_infrastructure-0</td>\n",
 476 |        "      <td>weather_related-0</td>\n",
 477 |        "      <td>floods-0</td>\n",
 478 |        "      <td>storm-0</td>\n",
 479 |        "      <td>fire-0</td>\n",
 480 |        "      <td>earthquake-0</td>\n",
 481 |        "      <td>cold-0</td>\n",
 482 |        "      <td>other_weather-0</td>\n",
 483 |        "      <td>direct_report-0</td>\n",
 484 |        "    </tr>\n",
 485 |        "    <tr>\n",
 486 |        "      <th>4</th>\n",
 487 |        "      <td>related-1</td>\n",
 488 |        "      <td>request-0</td>\n",
 489 |        "      <td>offer-0</td>\n",
 490 |        "      <td>aid_related-0</td>\n",
 491 |        "      <td>medical_help-0</td>\n",
 492 |        "      <td>medical_products-0</td>\n",
 493 |        "      <td>search_and_rescue-0</td>\n",
 494 |        "      <td>security-0</td>\n",
 495 |        "      <td>military-0</td>\n",
 496 |        "      <td>child_alone-0</td>\n",
 497 |        "      <td>...</td>\n",
 498 |        "      <td>aid_centers-0</td>\n",
 499 |        "      <td>other_infrastructure-0</td>\n",
 500 |        "      <td>weather_related-0</td>\n",
 501 |        "      <td>floods-0</td>\n",
 502 |        "      <td>storm-0</td>\n",
 503 |        "      <td>fire-0</td>\n",
 504 |        "      <td>earthquake-0</td>\n",
 505 |        "      <td>cold-0</td>\n",
 506 |        "      <td>other_weather-0</td>\n",
 507 |        "      <td>direct_report-0</td>\n",
 508 |        "    </tr>\n",
 509 |        "  </tbody>\n",
 510 |        "</table>\n",
 511 |        "<p>5 rows × 36 columns</p>\n",
 512 |        "</div>"
 513 |       ],
 514 |       "text/plain": [
 515 |        "          0          1        2              3               4   \\\n",
 516 |        "0  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 517 |        "1  related-1  request-0  offer-0  aid_related-1  medical_help-0   \n",
 518 |        "2  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 519 |        "3  related-1  request-1  offer-0  aid_related-1  medical_help-0   \n",
 520 |        "4  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 521 |        "\n",
 522 |        "                   5                    6           7           8   \\\n",
 523 |        "0  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 524 |        "1  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 525 |        "2  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 526 |        "3  medical_products-1  search_and_rescue-0  security-0  military-0   \n",
 527 |        "4  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 528 |        "\n",
 529 |        "              9        ...                    26                      27  \\\n",
 530 |        "0  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 531 |        "1  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 532 |        "2  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 533 |        "3  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 534 |        "4  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 535 |        "\n",
 536 |        "                  28        29       30      31            32      33  \\\n",
 537 |        "0  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 538 |        "1  weather_related-1  floods-0  storm-1  fire-0  earthquake-0  cold-0   \n",
 539 |        "2  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 540 |        "3  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 541 |        "4  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 542 |        "\n",
 543 |        "                34               35  \n",
 544 |        "0  other_weather-0  direct_report-0  \n",
 545 |        "1  other_weather-0  direct_report-0  \n",
 546 |        "2  other_weather-0  direct_report-0  \n",
 547 |        "3  other_weather-0  direct_report-0  \n",
 548 |        "4  other_weather-0  direct_report-0  \n",
 549 |        "\n",
 550 |        "[5 rows x 36 columns]"
 551 |       ]
 552 |      },
 553 |      "execution_count": 5,
 554 |      "metadata": {},
 555 |      "output_type": "execute_result"
 556 |     }
 557 |    ],
 558 |    "source": [
 559 |     "# create a dataframe of the 36 individual category columns\n",
 560 |     "categories = df['categories'].str.split(';', expand=True)\n",
 561 |     "categories.head()"
 562 |    ]
 563 |   },
 564 |   {
 565 |    "cell_type": "code",
 566 |    "execution_count": 6,
 567 |    "metadata": {},
 568 |    "outputs": [
 569 |     {
 570 |      "name": "stdout",
 571 |      "output_type": "stream",
 572 |      "text": [
 573 |       "0                    related\n",
 574 |       "1                    request\n",
 575 |       "2                      offer\n",
 576 |       "3                aid_related\n",
 577 |       "4               medical_help\n",
 578 |       "5           medical_products\n",
 579 |       "6          search_and_rescue\n",
 580 |       "7                   security\n",
 581 |       "8                   military\n",
 582 |       "9                child_alone\n",
 583 |       "10                     water\n",
 584 |       "11                      food\n",
 585 |       "12                   shelter\n",
 586 |       "13                  clothing\n",
 587 |       "14                     money\n",
 588 |       "15            missing_people\n",
 589 |       "16                  refugees\n",
 590 |       "17                     death\n",
 591 |       "18                 other_aid\n",
 592 |       "19    infrastructure_related\n",
 593 |       "20                 transport\n",
 594 |       "21                 buildings\n",
 595 |       "22               electricity\n",
 596 |       "23                     tools\n",
 597 |       "24                 hospitals\n",
 598 |       "25                     shops\n",
 599 |       "26               aid_centers\n",
 600 |       "27      other_infrastructure\n",
 601 |       "28           weather_related\n",
 602 |       "29                    floods\n",
 603 |       "30                     storm\n",
 604 |       "31                      fire\n",
 605 |       "32                earthquake\n",
 606 |       "33                      cold\n",
 607 |       "34             other_weather\n",
 608 |       "35             direct_report\n",
 609 |       "Name: 1, dtype: object\n"
 610 |      ]
 611 |     }
 612 |    ],
 613 |    "source": [
 614 |     "# select the first row of the categories dataframe\n",
 615 |     "row = categories.iloc[1]\n",
 616 |     "\n",
 617 |     "# use this row to extract a list of new column names for categories.\n",
 618 |     "# one way is to apply a lambda function that takes everything \n",
 619 |     "# up to the second to last character of each string with slicing\n",
 620 |     "category_colnames = row.apply(lambda x: x.split('-')[0])\n",
 621 |     "print(category_colnames)"
 622 |    ]
 623 |   },
 624 |   {
 625 |    "cell_type": "code",
 626 |    "execution_count": 7,
 627 |    "metadata": {},
 628 |    "outputs": [
 629 |     {
 630 |      "data": {
 631 |       "text/html": [
 632 |        "<div>\n",
 633 |        "<style scoped>\n",
 634 |        "    .dataframe tbody tr th:only-of-type {\n",
 635 |        "        vertical-align: middle;\n",
 636 |        "    }\n",
 637 |        "\n",
 638 |        "    .dataframe tbody tr th {\n",
 639 |        "        vertical-align: top;\n",
 640 |        "    }\n",
 641 |        "\n",
 642 |        "    .dataframe thead th {\n",
 643 |        "        text-align: right;\n",
 644 |        "    }\n",
 645 |        "</style>\n",
 646 |        "<table border=\"1\" class=\"dataframe\">\n",
 647 |        "  <thead>\n",
 648 |        "    <tr style=\"text-align: right;\">\n",
 649 |        "      <th>1</th>\n",
 650 |        "      <th>related</th>\n",
 651 |        "      <th>request</th>\n",
 652 |        "      <th>offer</th>\n",
 653 |        "      <th>aid_related</th>\n",
 654 |        "      <th>medical_help</th>\n",
 655 |        "      <th>medical_products</th>\n",
 656 |        "      <th>search_and_rescue</th>\n",
 657 |        "      <th>security</th>\n",
 658 |        "      <th>military</th>\n",
 659 |        "      <th>child_alone</th>\n",
 660 |        "      <th>...</th>\n",
 661 |        "      <th>aid_centers</th>\n",
 662 |        "      <th>other_infrastructure</th>\n",
 663 |        "      <th>weather_related</th>\n",
 664 |        "      <th>floods</th>\n",
 665 |        "      <th>storm</th>\n",
 666 |        "      <th>fire</th>\n",
 667 |        "      <th>earthquake</th>\n",
 668 |        "      <th>cold</th>\n",
 669 |        "      <th>other_weather</th>\n",
 670 |        "      <th>direct_report</th>\n",
 671 |        "    </tr>\n",
 672 |        "  </thead>\n",
 673 |        "  <tbody>\n",
 674 |        "    <tr>\n",
 675 |        "      <th>0</th>\n",
 676 |        "      <td>related-1</td>\n",
 677 |        "      <td>request-0</td>\n",
 678 |        "      <td>offer-0</td>\n",
 679 |        "      <td>aid_related-0</td>\n",
 680 |        "      <td>medical_help-0</td>\n",
 681 |        "      <td>medical_products-0</td>\n",
 682 |        "      <td>search_and_rescue-0</td>\n",
 683 |        "      <td>security-0</td>\n",
 684 |        "      <td>military-0</td>\n",
 685 |        "      <td>child_alone-0</td>\n",
 686 |        "      <td>...</td>\n",
 687 |        "      <td>aid_centers-0</td>\n",
 688 |        "      <td>other_infrastructure-0</td>\n",
 689 |        "      <td>weather_related-0</td>\n",
 690 |        "      <td>floods-0</td>\n",
 691 |        "      <td>storm-0</td>\n",
 692 |        "      <td>fire-0</td>\n",
 693 |        "      <td>earthquake-0</td>\n",
 694 |        "      <td>cold-0</td>\n",
 695 |        "      <td>other_weather-0</td>\n",
 696 |        "      <td>direct_report-0</td>\n",
 697 |        "    </tr>\n",
 698 |        "    <tr>\n",
 699 |        "      <th>1</th>\n",
 700 |        "      <td>related-1</td>\n",
 701 |        "      <td>request-0</td>\n",
 702 |        "      <td>offer-0</td>\n",
 703 |        "      <td>aid_related-1</td>\n",
 704 |        "      <td>medical_help-0</td>\n",
 705 |        "      <td>medical_products-0</td>\n",
 706 |        "      <td>search_and_rescue-0</td>\n",
 707 |        "      <td>security-0</td>\n",
 708 |        "      <td>military-0</td>\n",
 709 |        "      <td>child_alone-0</td>\n",
 710 |        "      <td>...</td>\n",
 711 |        "      <td>aid_centers-0</td>\n",
 712 |        "      <td>other_infrastructure-0</td>\n",
 713 |        "      <td>weather_related-1</td>\n",
 714 |        "      <td>floods-0</td>\n",
 715 |        "      <td>storm-1</td>\n",
 716 |        "      <td>fire-0</td>\n",
 717 |        "      <td>earthquake-0</td>\n",
 718 |        "      <td>cold-0</td>\n",
 719 |        "      <td>other_weather-0</td>\n",
 720 |        "      <td>direct_report-0</td>\n",
 721 |        "    </tr>\n",
 722 |        "    <tr>\n",
 723 |        "      <th>2</th>\n",
 724 |        "      <td>related-1</td>\n",
 725 |        "      <td>request-0</td>\n",
 726 |        "      <td>offer-0</td>\n",
 727 |        "      <td>aid_related-0</td>\n",
 728 |        "      <td>medical_help-0</td>\n",
 729 |        "      <td>medical_products-0</td>\n",
 730 |        "      <td>search_and_rescue-0</td>\n",
 731 |        "      <td>security-0</td>\n",
 732 |        "      <td>military-0</td>\n",
 733 |        "      <td>child_alone-0</td>\n",
 734 |        "      <td>...</td>\n",
 735 |        "      <td>aid_centers-0</td>\n",
 736 |        "      <td>other_infrastructure-0</td>\n",
 737 |        "      <td>weather_related-0</td>\n",
 738 |        "      <td>floods-0</td>\n",
 739 |        "      <td>storm-0</td>\n",
 740 |        "      <td>fire-0</td>\n",
 741 |        "      <td>earthquake-0</td>\n",
 742 |        "      <td>cold-0</td>\n",
 743 |        "      <td>other_weather-0</td>\n",
 744 |        "      <td>direct_report-0</td>\n",
 745 |        "    </tr>\n",
 746 |        "    <tr>\n",
 747 |        "      <th>3</th>\n",
 748 |        "      <td>related-1</td>\n",
 749 |        "      <td>request-1</td>\n",
 750 |        "      <td>offer-0</td>\n",
 751 |        "      <td>aid_related-1</td>\n",
 752 |        "      <td>medical_help-0</td>\n",
 753 |        "      <td>medical_products-1</td>\n",
 754 |        "      <td>search_and_rescue-0</td>\n",
 755 |        "      <td>security-0</td>\n",
 756 |        "      <td>military-0</td>\n",
 757 |        "      <td>child_alone-0</td>\n",
 758 |        "      <td>...</td>\n",
 759 |        "      <td>aid_centers-0</td>\n",
 760 |        "      <td>other_infrastructure-0</td>\n",
 761 |        "      <td>weather_related-0</td>\n",
 762 |        "      <td>floods-0</td>\n",
 763 |        "      <td>storm-0</td>\n",
 764 |        "      <td>fire-0</td>\n",
 765 |        "      <td>earthquake-0</td>\n",
 766 |        "      <td>cold-0</td>\n",
 767 |        "      <td>other_weather-0</td>\n",
 768 |        "      <td>direct_report-0</td>\n",
 769 |        "    </tr>\n",
 770 |        "    <tr>\n",
 771 |        "      <th>4</th>\n",
 772 |        "      <td>related-1</td>\n",
 773 |        "      <td>request-0</td>\n",
 774 |        "      <td>offer-0</td>\n",
 775 |        "      <td>aid_related-0</td>\n",
 776 |        "      <td>medical_help-0</td>\n",
 777 |        "      <td>medical_products-0</td>\n",
 778 |        "      <td>search_and_rescue-0</td>\n",
 779 |        "      <td>security-0</td>\n",
 780 |        "      <td>military-0</td>\n",
 781 |        "      <td>child_alone-0</td>\n",
 782 |        "      <td>...</td>\n",
 783 |        "      <td>aid_centers-0</td>\n",
 784 |        "      <td>other_infrastructure-0</td>\n",
 785 |        "      <td>weather_related-0</td>\n",
 786 |        "      <td>floods-0</td>\n",
 787 |        "      <td>storm-0</td>\n",
 788 |        "      <td>fire-0</td>\n",
 789 |        "      <td>earthquake-0</td>\n",
 790 |        "      <td>cold-0</td>\n",
 791 |        "      <td>other_weather-0</td>\n",
 792 |        "      <td>direct_report-0</td>\n",
 793 |        "    </tr>\n",
 794 |        "  </tbody>\n",
 795 |        "</table>\n",
 796 |        "<p>5 rows × 36 columns</p>\n",
 797 |        "</div>"
 798 |       ],
 799 |       "text/plain": [
 800 |        "1    related    request    offer    aid_related    medical_help  \\\n",
 801 |        "0  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 802 |        "1  related-1  request-0  offer-0  aid_related-1  medical_help-0   \n",
 803 |        "2  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 804 |        "3  related-1  request-1  offer-0  aid_related-1  medical_help-0   \n",
 805 |        "4  related-1  request-0  offer-0  aid_related-0  medical_help-0   \n",
 806 |        "\n",
 807 |        "1    medical_products    search_and_rescue    security    military  \\\n",
 808 |        "0  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 809 |        "1  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 810 |        "2  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 811 |        "3  medical_products-1  search_and_rescue-0  security-0  military-0   \n",
 812 |        "4  medical_products-0  search_and_rescue-0  security-0  military-0   \n",
 813 |        "\n",
 814 |        "1    child_alone       ...           aid_centers    other_infrastructure  \\\n",
 815 |        "0  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 816 |        "1  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 817 |        "2  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 818 |        "3  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 819 |        "4  child_alone-0       ...         aid_centers-0  other_infrastructure-0   \n",
 820 |        "\n",
 821 |        "1    weather_related    floods    storm    fire    earthquake    cold  \\\n",
 822 |        "0  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 823 |        "1  weather_related-1  floods-0  storm-1  fire-0  earthquake-0  cold-0   \n",
 824 |        "2  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 825 |        "3  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 826 |        "4  weather_related-0  floods-0  storm-0  fire-0  earthquake-0  cold-0   \n",
 827 |        "\n",
 828 |        "1    other_weather    direct_report  \n",
 829 |        "0  other_weather-0  direct_report-0  \n",
 830 |        "1  other_weather-0  direct_report-0  \n",
 831 |        "2  other_weather-0  direct_report-0  \n",
 832 |        "3  other_weather-0  direct_report-0  \n",
 833 |        "4  other_weather-0  direct_report-0  \n",
 834 |        "\n",
 835 |        "[5 rows x 36 columns]"
 836 |       ]
 837 |      },
 838 |      "execution_count": 7,
 839 |      "metadata": {},
 840 |      "output_type": "execute_result"
 841 |     }
 842 |    ],
 843 |    "source": [
 844 |     "# rename the columns of `categories`\n",
 845 |     "categories.columns = category_colnames\n",
 846 |     "categories.head()"
 847 |    ]
 848 |   },
 849 |   {
 850 |    "cell_type": "markdown",
 851 |    "metadata": {},
 852 |    "source": [
 853 |     "### 4. Convert category values to just numbers 0 or 1.\n",
 854 |     "- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.\n",
 855 |     "- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`."
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "code",
 860 |    "execution_count": 8,
 861 |    "metadata": {},
 862 |    "outputs": [
 863 |     {
 864 |      "data": {
 865 |       "text/html": [
 866 |        "<div>\n",
 867 |        "<style scoped>\n",
 868 |        "    .dataframe tbody tr th:only-of-type {\n",
 869 |        "        vertical-align: middle;\n",
 870 |        "    }\n",
 871 |        "\n",
 872 |        "    .dataframe tbody tr th {\n",
 873 |        "        vertical-align: top;\n",
 874 |        "    }\n",
 875 |        "\n",
 876 |        "    .dataframe thead th {\n",
 877 |        "        text-align: right;\n",
 878 |        "    }\n",
 879 |        "</style>\n",
 880 |        "<table border=\"1\" class=\"dataframe\">\n",
 881 |        "  <thead>\n",
 882 |        "    <tr style=\"text-align: right;\">\n",
 883 |        "      <th>1</th>\n",
 884 |        "      <th>related</th>\n",
 885 |        "      <th>request</th>\n",
 886 |        "      <th>offer</th>\n",
 887 |        "      <th>aid_related</th>\n",
 888 |        "      <th>medical_help</th>\n",
 889 |        "      <th>medical_products</th>\n",
 890 |        "      <th>search_and_rescue</th>\n",
 891 |        "      <th>security</th>\n",
 892 |        "      <th>military</th>\n",
 893 |        "      <th>child_alone</th>\n",
 894 |        "      <th>...</th>\n",
 895 |        "      <th>aid_centers</th>\n",
 896 |        "      <th>other_infrastructure</th>\n",
 897 |        "      <th>weather_related</th>\n",
 898 |        "      <th>floods</th>\n",
 899 |        "      <th>storm</th>\n",
 900 |        "      <th>fire</th>\n",
 901 |        "      <th>earthquake</th>\n",
 902 |        "      <th>cold</th>\n",
 903 |        "      <th>other_weather</th>\n",
 904 |        "      <th>direct_report</th>\n",
 905 |        "    </tr>\n",
 906 |        "  </thead>\n",
 907 |        "  <tbody>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>0</th>\n",
 910 |        "      <td>1</td>\n",
 911 |        "      <td>0</td>\n",
 912 |        "      <td>0</td>\n",
 913 |        "      <td>0</td>\n",
 914 |        "      <td>0</td>\n",
 915 |        "      <td>0</td>\n",
 916 |        "      <td>0</td>\n",
 917 |        "      <td>0</td>\n",
 918 |        "      <td>0</td>\n",
 919 |        "      <td>0</td>\n",
 920 |        "      <td>...</td>\n",
 921 |        "      <td>0</td>\n",
 922 |        "      <td>0</td>\n",
 923 |        "      <td>0</td>\n",
 924 |        "      <td>0</td>\n",
 925 |        "      <td>0</td>\n",
 926 |        "      <td>0</td>\n",
 927 |        "      <td>0</td>\n",
 928 |        "      <td>0</td>\n",
 929 |        "      <td>0</td>\n",
 930 |        "      <td>0</td>\n",
 931 |        "    </tr>\n",
 932 |        "    <tr>\n",
 933 |        "      <th>1</th>\n",
 934 |        "      <td>1</td>\n",
 935 |        "      <td>0</td>\n",
 936 |        "      <td>0</td>\n",
 937 |        "      <td>1</td>\n",
 938 |        "      <td>0</td>\n",
 939 |        "      <td>0</td>\n",
 940 |        "      <td>0</td>\n",
 941 |        "      <td>0</td>\n",
 942 |        "      <td>0</td>\n",
 943 |        "      <td>0</td>\n",
 944 |        "      <td>...</td>\n",
 945 |        "      <td>0</td>\n",
 946 |        "      <td>0</td>\n",
 947 |        "      <td>1</td>\n",
 948 |        "      <td>0</td>\n",
 949 |        "      <td>1</td>\n",
 950 |        "      <td>0</td>\n",
 951 |        "      <td>0</td>\n",
 952 |        "      <td>0</td>\n",
 953 |        "      <td>0</td>\n",
 954 |        "      <td>0</td>\n",
 955 |        "    </tr>\n",
 956 |        "    <tr>\n",
 957 |        "      <th>2</th>\n",
 958 |        "      <td>1</td>\n",
 959 |        "      <td>0</td>\n",
 960 |        "      <td>0</td>\n",
 961 |        "      <td>0</td>\n",
 962 |        "      <td>0</td>\n",
 963 |        "      <td>0</td>\n",
 964 |        "      <td>0</td>\n",
 965 |        "      <td>0</td>\n",
 966 |        "      <td>0</td>\n",
 967 |        "      <td>0</td>\n",
 968 |        "      <td>...</td>\n",
 969 |        "      <td>0</td>\n",
 970 |        "      <td>0</td>\n",
 971 |        "      <td>0</td>\n",
 972 |        "      <td>0</td>\n",
 973 |        "      <td>0</td>\n",
 974 |        "      <td>0</td>\n",
 975 |        "      <td>0</td>\n",
 976 |        "      <td>0</td>\n",
 977 |        "      <td>0</td>\n",
 978 |        "      <td>0</td>\n",
 979 |        "    </tr>\n",
 980 |        "    <tr>\n",
 981 |        "      <th>3</th>\n",
 982 |        "      <td>1</td>\n",
 983 |        "      <td>1</td>\n",
 984 |        "      <td>0</td>\n",
 985 |        "      <td>1</td>\n",
 986 |        "      <td>0</td>\n",
 987 |        "      <td>1</td>\n",
 988 |        "      <td>0</td>\n",
 989 |        "      <td>0</td>\n",
 990 |        "      <td>0</td>\n",
 991 |        "      <td>0</td>\n",
 992 |        "      <td>...</td>\n",
 993 |        "      <td>0</td>\n",
 994 |        "      <td>0</td>\n",
 995 |        "      <td>0</td>\n",
 996 |        "      <td>0</td>\n",
 997 |        "      <td>0</td>\n",
 998 |        "      <td>0</td>\n",
 999 |        "      <td>0</td>\n",
1000 |        "      <td>0</td>\n",
1001 |        "      <td>0</td>\n",
1002 |        "      <td>0</td>\n",
1003 |        "    </tr>\n",
1004 |        "    <tr>\n",
1005 |        "      <th>4</th>\n",
1006 |        "      <td>1</td>\n",
1007 |        "      <td>0</td>\n",
1008 |        "      <td>0</td>\n",
1009 |        "      <td>0</td>\n",
1010 |        "      <td>0</td>\n",
1011 |        "      <td>0</td>\n",
1012 |        "      <td>0</td>\n",
1013 |        "      <td>0</td>\n",
1014 |        "      <td>0</td>\n",
1015 |        "      <td>0</td>\n",
1016 |        "      <td>...</td>\n",
1017 |        "      <td>0</td>\n",
1018 |        "      <td>0</td>\n",
1019 |        "      <td>0</td>\n",
1020 |        "      <td>0</td>\n",
1021 |        "      <td>0</td>\n",
1022 |        "      <td>0</td>\n",
1023 |        "      <td>0</td>\n",
1024 |        "      <td>0</td>\n",
1025 |        "      <td>0</td>\n",
1026 |        "      <td>0</td>\n",
1027 |        "    </tr>\n",
1028 |        "  </tbody>\n",
1029 |        "</table>\n",
1030 |        "<p>5 rows × 36 columns</p>\n",
1031 |        "</div>"
1032 |       ],
1033 |       "text/plain": [
1034 |        "1  related  request  offer  aid_related  medical_help  medical_products  \\\n",
1035 |        "0        1        0      0            0             0                 0   \n",
1036 |        "1        1        0      0            1             0                 0   \n",
1037 |        "2        1        0      0            0             0                 0   \n",
1038 |        "3        1        1      0            1             0                 1   \n",
1039 |        "4        1        0      0            0             0                 0   \n",
1040 |        "\n",
1041 |        "1  search_and_rescue  security  military  child_alone      ...        \\\n",
1042 |        "0                  0         0         0            0      ...         \n",
1043 |        "1                  0         0         0            0      ...         \n",
1044 |        "2                  0         0         0            0      ...         \n",
1045 |        "3                  0         0         0            0      ...         \n",
1046 |        "4                  0         0         0            0      ...         \n",
1047 |        "\n",
1048 |        "1  aid_centers  other_infrastructure  weather_related  floods  storm  fire  \\\n",
1049 |        "0            0                     0                0       0      0     0   \n",
1050 |        "1            0                     0                1       0      1     0   \n",
1051 |        "2            0                     0                0       0      0     0   \n",
1052 |        "3            0                     0                0       0      0     0   \n",
1053 |        "4            0                     0                0       0      0     0   \n",
1054 |        "\n",
1055 |        "1  earthquake  cold  other_weather  direct_report  \n",
1056 |        "0           0     0              0              0  \n",
1057 |        "1           0     0              0              0  \n",
1058 |        "2           0     0              0              0  \n",
1059 |        "3           0     0              0              0  \n",
1060 |        "4           0     0              0              0  \n",
1061 |        "\n",
1062 |        "[5 rows x 36 columns]"
1063 |       ]
1064 |      },
1065 |      "execution_count": 8,
1066 |      "metadata": {},
1067 |      "output_type": "execute_result"
1068 |     }
1069 |    ],
1070 |    "source": [
1071 |     "for column in categories:\n",
1072 |     "    # set each value to be the last character of the string\n",
1073 |     "    categories[column] = categories[column].apply(lambda x: int(x.split('-')[1]))\n",
1074 |     "    \n",
1075 |     "categories.head()"
1076 |    ]
1077 |   },
1078 |   {
1079 |    "cell_type": "markdown",
1080 |    "metadata": {},
1081 |    "source": [
1082 |     "### 5. Replace `categories` column in `df` with new category columns.\n",
1083 |     "- Drop the categories column from the df dataframe since it is no longer needed.\n",
1084 |     "- Concatenate df and categories data frames."
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "code",
1089 |    "execution_count": 9,
1090 |    "metadata": {},
1091 |    "outputs": [
1092 |     {
1093 |      "data": {
1094 |       "text/html": [
1095 |        "<div>\n",
1096 |        "<style scoped>\n",
1097 |        "    .dataframe tbody tr th:only-of-type {\n",
1098 |        "        vertical-align: middle;\n",
1099 |        "    }\n",
1100 |        "\n",
1101 |        "    .dataframe tbody tr th {\n",
1102 |        "        vertical-align: top;\n",
1103 |        "    }\n",
1104 |        "\n",
1105 |        "    .dataframe thead th {\n",
1106 |        "        text-align: right;\n",
1107 |        "    }\n",
1108 |        "</style>\n",
1109 |        "<table border=\"1\" class=\"dataframe\">\n",
1110 |        "  <thead>\n",
1111 |        "    <tr style=\"text-align: right;\">\n",
1112 |        "      <th></th>\n",
1113 |        "      <th>id</th>\n",
1114 |        "      <th>message</th>\n",
1115 |        "      <th>original</th>\n",
1116 |        "      <th>genre</th>\n",
1117 |        "    </tr>\n",
1118 |        "  </thead>\n",
1119 |        "  <tbody>\n",
1120 |        "    <tr>\n",
1121 |        "      <th>0</th>\n",
1122 |        "      <td>2</td>\n",
1123 |        "      <td>Weather update - a cold front from Cuba that c...</td>\n",
1124 |        "      <td>Un front froid se retrouve sur Cuba ce matin. ...</td>\n",
1125 |        "      <td>direct</td>\n",
1126 |        "    </tr>\n",
1127 |        "    <tr>\n",
1128 |        "      <th>1</th>\n",
1129 |        "      <td>7</td>\n",
1130 |        "      <td>Is the Hurricane over or is it not over</td>\n",
1131 |        "      <td>Cyclone nan fini osinon li pa fini</td>\n",
1132 |        "      <td>direct</td>\n",
1133 |        "    </tr>\n",
1134 |        "    <tr>\n",
1135 |        "      <th>2</th>\n",
1136 |        "      <td>8</td>\n",
1137 |        "      <td>Looking for someone but no name</td>\n",
1138 |        "      <td>Patnm, di Maryani relem pou li banm nouvel li ...</td>\n",
1139 |        "      <td>direct</td>\n",
1140 |        "    </tr>\n",
1141 |        "    <tr>\n",
1142 |        "      <th>3</th>\n",
1143 |        "      <td>9</td>\n",
1144 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
1145 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
1146 |        "      <td>direct</td>\n",
1147 |        "    </tr>\n",
1148 |        "    <tr>\n",
1149 |        "      <th>4</th>\n",
1150 |        "      <td>12</td>\n",
1151 |        "      <td>says: west side of Haiti, rest of the country ...</td>\n",
1152 |        "      <td>facade ouest d Haiti et le reste du pays aujou...</td>\n",
1153 |        "      <td>direct</td>\n",
1154 |        "    </tr>\n",
1155 |        "  </tbody>\n",
1156 |        "</table>\n",
1157 |        "</div>"
1158 |       ],
1159 |       "text/plain": [
1160 |        "   id                                            message  \\\n",
1161 |        "0   2  Weather update - a cold front from Cuba that c...   \n",
1162 |        "1   7            Is the Hurricane over or is it not over   \n",
1163 |        "2   8                    Looking for someone but no name   \n",
1164 |        "3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   \n",
1165 |        "4  12  says: west side of Haiti, rest of the country ...   \n",
1166 |        "\n",
1167 |        "                                            original   genre  \n",
1168 |        "0  Un front froid se retrouve sur Cuba ce matin. ...  direct  \n",
1169 |        "1                 Cyclone nan fini osinon li pa fini  direct  \n",
1170 |        "2  Patnm, di Maryani relem pou li banm nouvel li ...  direct  \n",
1171 |        "3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct  \n",
1172 |        "4  facade ouest d Haiti et le reste du pays aujou...  direct  "
1173 |       ]
1174 |      },
1175 |      "execution_count": 9,
1176 |      "metadata": {},
1177 |      "output_type": "execute_result"
1178 |     }
1179 |    ],
1180 |    "source": [
1181 |     "# drop the original categories column from `df`\n",
1182 |     "df.drop('categories', axis = 1, inplace = True)\n",
1183 |     "\n",
1184 |     "df.head()"
1185 |    ]
1186 |   },
1187 |   {
1188 |    "cell_type": "code",
1189 |    "execution_count": 10,
1190 |    "metadata": {},
1191 |    "outputs": [
1192 |     {
1193 |      "data": {
1194 |       "text/html": [
1195 |        "<div>\n",
1196 |        "<style scoped>\n",
1197 |        "    .dataframe tbody tr th:only-of-type {\n",
1198 |        "        vertical-align: middle;\n",
1199 |        "    }\n",
1200 |        "\n",
1201 |        "    .dataframe tbody tr th {\n",
1202 |        "        vertical-align: top;\n",
1203 |        "    }\n",
1204 |        "\n",
1205 |        "    .dataframe thead th {\n",
1206 |        "        text-align: right;\n",
1207 |        "    }\n",
1208 |        "</style>\n",
1209 |        "<table border=\"1\" class=\"dataframe\">\n",
1210 |        "  <thead>\n",
1211 |        "    <tr style=\"text-align: right;\">\n",
1212 |        "      <th></th>\n",
1213 |        "      <th>id</th>\n",
1214 |        "      <th>message</th>\n",
1215 |        "      <th>original</th>\n",
1216 |        "      <th>genre</th>\n",
1217 |        "      <th>related</th>\n",
1218 |        "      <th>request</th>\n",
1219 |        "      <th>offer</th>\n",
1220 |        "      <th>aid_related</th>\n",
1221 |        "      <th>medical_help</th>\n",
1222 |        "      <th>medical_products</th>\n",
1223 |        "      <th>...</th>\n",
1224 |        "      <th>aid_centers</th>\n",
1225 |        "      <th>other_infrastructure</th>\n",
1226 |        "      <th>weather_related</th>\n",
1227 |        "      <th>floods</th>\n",
1228 |        "      <th>storm</th>\n",
1229 |        "      <th>fire</th>\n",
1230 |        "      <th>earthquake</th>\n",
1231 |        "      <th>cold</th>\n",
1232 |        "      <th>other_weather</th>\n",
1233 |        "      <th>direct_report</th>\n",
1234 |        "    </tr>\n",
1235 |        "  </thead>\n",
1236 |        "  <tbody>\n",
1237 |        "    <tr>\n",
1238 |        "      <th>0</th>\n",
1239 |        "      <td>2</td>\n",
1240 |        "      <td>Weather update - a cold front from Cuba that c...</td>\n",
1241 |        "      <td>Un front froid se retrouve sur Cuba ce matin. ...</td>\n",
1242 |        "      <td>direct</td>\n",
1243 |        "      <td>1</td>\n",
1244 |        "      <td>0</td>\n",
1245 |        "      <td>0</td>\n",
1246 |        "      <td>0</td>\n",
1247 |        "      <td>0</td>\n",
1248 |        "      <td>0</td>\n",
1249 |        "      <td>...</td>\n",
1250 |        "      <td>0</td>\n",
1251 |        "      <td>0</td>\n",
1252 |        "      <td>0</td>\n",
1253 |        "      <td>0</td>\n",
1254 |        "      <td>0</td>\n",
1255 |        "      <td>0</td>\n",
1256 |        "      <td>0</td>\n",
1257 |        "      <td>0</td>\n",
1258 |        "      <td>0</td>\n",
1259 |        "      <td>0</td>\n",
1260 |        "    </tr>\n",
1261 |        "    <tr>\n",
1262 |        "      <th>1</th>\n",
1263 |        "      <td>7</td>\n",
1264 |        "      <td>Is the Hurricane over or is it not over</td>\n",
1265 |        "      <td>Cyclone nan fini osinon li pa fini</td>\n",
1266 |        "      <td>direct</td>\n",
1267 |        "      <td>1</td>\n",
1268 |        "      <td>0</td>\n",
1269 |        "      <td>0</td>\n",
1270 |        "      <td>1</td>\n",
1271 |        "      <td>0</td>\n",
1272 |        "      <td>0</td>\n",
1273 |        "      <td>...</td>\n",
1274 |        "      <td>0</td>\n",
1275 |        "      <td>0</td>\n",
1276 |        "      <td>1</td>\n",
1277 |        "      <td>0</td>\n",
1278 |        "      <td>1</td>\n",
1279 |        "      <td>0</td>\n",
1280 |        "      <td>0</td>\n",
1281 |        "      <td>0</td>\n",
1282 |        "      <td>0</td>\n",
1283 |        "      <td>0</td>\n",
1284 |        "    </tr>\n",
1285 |        "    <tr>\n",
1286 |        "      <th>2</th>\n",
1287 |        "      <td>8</td>\n",
1288 |        "      <td>Looking for someone but no name</td>\n",
1289 |        "      <td>Patnm, di Maryani relem pou li banm nouvel li ...</td>\n",
1290 |        "      <td>direct</td>\n",
1291 |        "      <td>1</td>\n",
1292 |        "      <td>0</td>\n",
1293 |        "      <td>0</td>\n",
1294 |        "      <td>0</td>\n",
1295 |        "      <td>0</td>\n",
1296 |        "      <td>0</td>\n",
1297 |        "      <td>...</td>\n",
1298 |        "      <td>0</td>\n",
1299 |        "      <td>0</td>\n",
1300 |        "      <td>0</td>\n",
1301 |        "      <td>0</td>\n",
1302 |        "      <td>0</td>\n",
1303 |        "      <td>0</td>\n",
1304 |        "      <td>0</td>\n",
1305 |        "      <td>0</td>\n",
1306 |        "      <td>0</td>\n",
1307 |        "      <td>0</td>\n",
1308 |        "    </tr>\n",
1309 |        "    <tr>\n",
1310 |        "      <th>3</th>\n",
1311 |        "      <td>9</td>\n",
1312 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
1313 |        "      <td>UN reports Leogane 80-90 destroyed. Only Hospi...</td>\n",
1314 |        "      <td>direct</td>\n",
1315 |        "      <td>1</td>\n",
1316 |        "      <td>1</td>\n",
1317 |        "      <td>0</td>\n",
1318 |        "      <td>1</td>\n",
1319 |        "      <td>0</td>\n",
1320 |        "      <td>1</td>\n",
1321 |        "      <td>...</td>\n",
1322 |        "      <td>0</td>\n",
1323 |        "      <td>0</td>\n",
1324 |        "      <td>0</td>\n",
1325 |        "      <td>0</td>\n",
1326 |        "      <td>0</td>\n",
1327 |        "      <td>0</td>\n",
1328 |        "      <td>0</td>\n",
1329 |        "      <td>0</td>\n",
1330 |        "      <td>0</td>\n",
1331 |        "      <td>0</td>\n",
1332 |        "    </tr>\n",
1333 |        "    <tr>\n",
1334 |        "      <th>4</th>\n",
1335 |        "      <td>12</td>\n",
1336 |        "      <td>says: west side of Haiti, rest of the country ...</td>\n",
1337 |        "      <td>facade ouest d Haiti et le reste du pays aujou...</td>\n",
1338 |        "      <td>direct</td>\n",
1339 |        "      <td>1</td>\n",
1340 |        "      <td>0</td>\n",
1341 |        "      <td>0</td>\n",
1342 |        "      <td>0</td>\n",
1343 |        "      <td>0</td>\n",
1344 |        "      <td>0</td>\n",
1345 |        "      <td>...</td>\n",
1346 |        "      <td>0</td>\n",
1347 |        "      <td>0</td>\n",
1348 |        "      <td>0</td>\n",
1349 |        "      <td>0</td>\n",
1350 |        "      <td>0</td>\n",
1351 |        "      <td>0</td>\n",
1352 |        "      <td>0</td>\n",
1353 |        "      <td>0</td>\n",
1354 |        "      <td>0</td>\n",
1355 |        "      <td>0</td>\n",
1356 |        "    </tr>\n",
1357 |        "  </tbody>\n",
1358 |        "</table>\n",
1359 |        "<p>5 rows × 40 columns</p>\n",
1360 |        "</div>"
1361 |       ],
1362 |       "text/plain": [
1363 |        "   id                                            message  \\\n",
1364 |        "0   2  Weather update - a cold front from Cuba that c...   \n",
1365 |        "1   7            Is the Hurricane over or is it not over   \n",
1366 |        "2   8                    Looking for someone but no name   \n",
1367 |        "3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   \n",
1368 |        "4  12  says: west side of Haiti, rest of the country ...   \n",
1369 |        "\n",
1370 |        "                                            original   genre  related  \\\n",
1371 |        "0  Un front froid se retrouve sur Cuba ce matin. ...  direct        1   \n",
1372 |        "1                 Cyclone nan fini osinon li pa fini  direct        1   \n",
1373 |        "2  Patnm, di Maryani relem pou li banm nouvel li ...  direct        1   \n",
1374 |        "3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct        1   \n",
1375 |        "4  facade ouest d Haiti et le reste du pays aujou...  direct        1   \n",
1376 |        "\n",
1377 |        "   request  offer  aid_related  medical_help  medical_products      ...        \\\n",
1378 |        "0        0      0            0             0                 0      ...         \n",
1379 |        "1        0      0            1             0                 0      ...         \n",
1380 |        "2        0      0            0             0                 0      ...         \n",
1381 |        "3        1      0            1             0                 1      ...         \n",
1382 |        "4        0      0            0             0                 0      ...         \n",
1383 |        "\n",
1384 |        "   aid_centers  other_infrastructure  weather_related  floods  storm  fire  \\\n",
1385 |        "0            0                     0                0       0      0     0   \n",
1386 |        "1            0                     0                1       0      1     0   \n",
1387 |        "2            0                     0                0       0      0     0   \n",
1388 |        "3            0                     0                0       0      0     0   \n",
1389 |        "4            0                     0                0       0      0     0   \n",
1390 |        "\n",
1391 |        "   earthquake  cold  other_weather  direct_report  \n",
1392 |        "0           0     0              0              0  \n",
1393 |        "1           0     0              0              0  \n",
1394 |        "2           0     0              0              0  \n",
1395 |        "3           0     0              0              0  \n",
1396 |        "4           0     0              0              0  \n",
1397 |        "\n",
1398 |        "[5 rows x 40 columns]"
1399 |       ]
1400 |      },
1401 |      "execution_count": 10,
1402 |      "metadata": {},
1403 |      "output_type": "execute_result"
1404 |     }
1405 |    ],
1406 |    "source": [
1407 |     "# concatenate the original dataframe with the new `categories` dataframe\n",
1408 |     "df = df.join(categories)\n",
1409 |     "df.head()"
1410 |    ]
1411 |   },
1412 |   {
1413 |    "cell_type": "markdown",
1414 |    "metadata": {},
1415 |    "source": [
1416 |     "### 6. Remove duplicates.\n",
1417 |     "- Check how many duplicates are in this dataset.\n",
1418 |     "- Drop the duplicates.\n",
1419 |     "- Confirm duplicates were removed."
1420 |    ]
1421 |   },
1422 |   {
1423 |    "cell_type": "code",
1424 |    "execution_count": 11,
1425 |    "metadata": {},
1426 |    "outputs": [
1427 |     {
1428 |      "data": {
1429 |       "text/plain": [
1430 |        "170"
1431 |       ]
1432 |      },
1433 |      "execution_count": 11,
1434 |      "metadata": {},
1435 |      "output_type": "execute_result"
1436 |     }
1437 |    ],
1438 |    "source": [
1439 |     "# check number of duplicates\n",
1440 |     "sum(df.duplicated())"
1441 |    ]
1442 |   },
1443 |   {
1444 |    "cell_type": "code",
1445 |    "execution_count": 12,
1446 |    "metadata": {
1447 |     "collapsed": true
1448 |    },
1449 |    "outputs": [],
1450 |    "source": [
1451 |     "# drop duplicates\n",
1452 |     "df = df.drop_duplicates()"
1453 |    ]
1454 |   },
1455 |   {
1456 |    "cell_type": "code",
1457 |    "execution_count": 13,
1458 |    "metadata": {},
1459 |    "outputs": [
1460 |     {
1461 |      "data": {
1462 |       "text/plain": [
1463 |        "0"
1464 |       ]
1465 |      },
1466 |      "execution_count": 13,
1467 |      "metadata": {},
1468 |      "output_type": "execute_result"
1469 |     }
1470 |    ],
1471 |    "source": [
1472 |     "# check number of duplicates\n",
1473 |     "sum(df.duplicated())"
1474 |    ]
1475 |   },
1476 |   {
1477 |    "cell_type": "markdown",
1478 |    "metadata": {},
1479 |    "source": [
1480 |     "### 7. Save the clean dataset into an sqlite database.\n",
1481 |     "You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below."
1482 |    ]
1483 |   },
1484 |   {
1485 |    "cell_type": "code",
1486 |    "execution_count": 14,
1487 |    "metadata": {
1488 |     "collapsed": true
1489 |    },
1490 |    "outputs": [],
1491 |    "source": [
1492 |     "engine = create_engine('sqlite:///DisasterResponse.db')\n",
1493 |     "df.to_sql('DisasterResponse', engine, index=False)"
1494 |    ]
1495 |   },
1496 |   {
1497 |    "cell_type": "markdown",
1498 |    "metadata": {},
1499 |    "source": [
1500 |     "### 8. Use this notebook to complete `etl_pipeline.py`\n",
1501 |     "Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later."
1502 |    ]
1503 |   }
1504 |  ],
1505 |  "metadata": {
1506 |   "kernelspec": {
1507 |    "display_name": "Python 3",
1508 |    "language": "python",
1509 |    "name": "python3"
1510 |   },
1511 |   "language_info": {
1512 |    "codemirror_mode": {
1513 |     "name": "ipython",
1514 |     "version": 3
1515 |    },
1516 |    "file_extension": ".py",
1517 |    "mimetype": "text/x-python",
1518 |    "name": "python",
1519 |    "nbconvert_exporter": "python",
1520 |    "pygments_lexer": "ipython3",
1521 |    "version": "3.6.3"
1522 |   }
1523 |  },
1524 |  "nbformat": 4,
1525 |  "nbformat_minor": 2
1526 | }
1527 | 


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/ML Pipeline Preparation.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# ML Pipeline Preparation\n",
   8 |     "Follow the instructions below to help you create your ML pipeline.\n",
   9 |     "### 1. Import libraries and load data from database.\n",
  10 |     "- Import Python libraries\n",
  11 |     "- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)\n",
  12 |     "- Define feature and target variables X and Y"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "code",
  17 |    "execution_count": 1,
  18 |    "metadata": {
  19 |     "collapsed": true
  20 |    },
  21 |    "outputs": [],
  22 |    "source": [
  23 |     "# import libraries\n",
  24 |     "import pandas as pd\n",
  25 |     "import numpy as np\n",
  26 |     "import pickle\n",
  27 |     "from sqlalchemy import create_engine\n",
  28 |     "import warnings\n",
  29 |     "warnings.filterwarnings(\"ignore\")"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "code",
  34 |    "execution_count": 2,
  35 |    "metadata": {
  36 |     "collapsed": true
  37 |    },
  38 |    "outputs": [],
  39 |    "source": [
  40 |     "# import NLP libraries\n",
  41 |     "import re\n",
  42 |     "import nltk \n",
  43 |     "from nltk.corpus import stopwords\n",
  44 |     "from nltk.tokenize import word_tokenize\n",
  45 |     "from nltk.stem.wordnet import WordNetLemmatizer\n",
  46 |     "# nltk.download('punkt')\n",
  47 |     "# nltk.download('stopwords')\n",
  48 |     "# nltk.download('wordnet') # download for lemmatization"
  49 |    ]
  50 |   },
  51 |   {
  52 |    "cell_type": "code",
  53 |    "execution_count": 3,
  54 |    "metadata": {
  55 |     "collapsed": true
  56 |    },
  57 |    "outputs": [],
  58 |    "source": [
  59 |     "# import sklearn\n",
  60 |     "from sklearn.pipeline import Pipeline\n",
  61 |     "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n",
  62 |     "from sklearn.model_selection import train_test_split, GridSearchCV\n",
  63 |     "from sklearn.multioutput import MultiOutputClassifier\n",
  64 |     "from sklearn.metrics import precision_score, recall_score, f1_score\n",
  65 |     "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier"
  66 |    ]
  67 |   },
  68 |   {
  69 |    "cell_type": "code",
  70 |    "execution_count": 4,
  71 |    "metadata": {
  72 |     "collapsed": true
  73 |    },
  74 |    "outputs": [],
  75 |    "source": [
  76 |     "# load data from database\n",
  77 |     "engine = create_engine('sqlite:///data/DisasterResponse.db')\n",
  78 |     "df = pd.read_sql_table('DisasterResponse', engine)\n",
  79 |     "X = df['message']\n",
  80 |     "Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)"
  81 |    ]
  82 |   },
  83 |   {
  84 |    "cell_type": "markdown",
  85 |    "metadata": {},
  86 |    "source": [
  87 |     "### 2. Write a tokenization function to process your text data"
  88 |    ]
  89 |   },
  90 |   {
  91 |    "cell_type": "code",
  92 |    "execution_count": 5,
  93 |    "metadata": {
  94 |     "collapsed": true
  95 |    },
  96 |    "outputs": [],
  97 |    "source": [
  98 |     "def tokenize(text):\n",
  99 |     "    # Define url pattern\n",
 100 |     "    url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'\n",
 101 |     "    \n",
 102 |     "    # Detect and replace urls\n",
 103 |     "    detected_urls = re.findall(url_re, text)\n",
 104 |     "    for url in detected_urls:\n",
 105 |     "        text = text.replace(url, \"urlplaceholder\")\n",
 106 |     "    \n",
 107 |     "    # tokenize sentences\n",
 108 |     "    tokens = word_tokenize(text)\n",
 109 |     "    lemmatizer = WordNetLemmatizer()\n",
 110 |     "    \n",
 111 |     "    # save cleaned tokens\n",
 112 |     "    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]\n",
 113 |     "    \n",
 114 |     "    # remove stopwords\n",
 115 |     "    STOPWORDS = list(set(stopwords.words('english')))\n",
 116 |     "    clean_tokens = [token for token in clean_tokens if token not in STOPWORDS]\n",
 117 |     "    \n",
 118 |     "    return clean_tokens"
 119 |    ]
 120 |   },
 121 |   {
 122 |    "cell_type": "markdown",
 123 |    "metadata": {},
 124 |    "source": [
 125 |     "### 3. Build a machine learning pipeline\n",
 126 |     "- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables."
 127 |    ]
 128 |   },
 129 |   {
 130 |    "cell_type": "code",
 131 |    "execution_count": 6,
 132 |    "metadata": {
 133 |     "collapsed": true
 134 |    },
 135 |    "outputs": [],
 136 |    "source": [
 137 |     "def build_pipeline():\n",
 138 |     "    \n",
 139 |     "    # build NLP pipeline - count words, tf-idf, multiple output classifier\n",
 140 |     "    pipeline = Pipeline([\n",
 141 |     "        ('vec', CountVectorizer(tokenizer=tokenize)),\n",
 142 |     "        ('tfidf', TfidfTransformer()),\n",
 143 |     "        ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators = 100, n_jobs = 6)))\n",
 144 |     "    ])\n",
 145 |     "    \n",
 146 |     "    return pipeline"
 147 |    ]
 148 |   },
 149 |   {
 150 |    "cell_type": "markdown",
 151 |    "metadata": {},
 152 |    "source": [
 153 |     "### 4. Train pipeline\n",
 154 |     "- Split data into train and test sets\n",
 155 |     "- Train pipeline"
 156 |    ]
 157 |   },
 158 |   {
 159 |    "cell_type": "code",
 160 |    "execution_count": 7,
 161 |    "metadata": {},
 162 |    "outputs": [
 163 |     {
 164 |      "data": {
 165 |       "text/plain": [
 166 |        "Pipeline(memory=None,\n",
 167 |        "     steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 168 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 169 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 170 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 171 |        "        strip_..._score=False, random_state=None, verbose=0,\n",
 172 |        "            warm_start=False),\n",
 173 |        "           n_jobs=None))])"
 174 |       ]
 175 |      },
 176 |      "execution_count": 7,
 177 |      "metadata": {},
 178 |      "output_type": "execute_result"
 179 |     }
 180 |    ],
 181 |    "source": [
 182 |     "X_train, X_test, y_train, y_test = train_test_split(X, Y)\n",
 183 |     "pipeline = build_pipeline()\n",
 184 |     "pipeline.fit(X_train, y_train)"
 185 |    ]
 186 |   },
 187 |   {
 188 |    "cell_type": "markdown",
 189 |    "metadata": {},
 190 |    "source": [
 191 |     "### 5. Test your model\n",
 192 |     "Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each."
 193 |    ]
 194 |   },
 195 |   {
 196 |    "cell_type": "code",
 197 |    "execution_count": 8,
 198 |    "metadata": {
 199 |     "collapsed": true
 200 |    },
 201 |    "outputs": [],
 202 |    "source": [
 203 |     "def build_report(pipeline, X_test, y_test):\n",
 204 |     "    # predict on the X_test\n",
 205 |     "    y_pred = pipeline.predict(X_test)\n",
 206 |     "    \n",
 207 |     "    # build classification report on every column\n",
 208 |     "    performances = []\n",
 209 |     "    for i in range(len(y_test.columns)):\n",
 210 |     "        performances.append([f1_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro'),\n",
 211 |     "                             precision_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro'),\n",
 212 |     "                             recall_score(y_test.iloc[:, i].values, y_pred[:, i], average='micro')])\n",
 213 |     "    # build dataframe\n",
 214 |     "    performances = pd.DataFrame(performances, columns=['f1 score', 'precision', 'recall'],\n",
 215 |     "                                index = y_test.columns)   \n",
 216 |     "    return performances"
 217 |    ]
 218 |   },
 219 |   {
 220 |    "cell_type": "code",
 221 |    "execution_count": 9,
 222 |    "metadata": {},
 223 |    "outputs": [
 224 |     {
 225 |      "data": {
 226 |       "text/html": [
 227 |        "<div>\n",
 228 |        "<style scoped>\n",
 229 |        "    .dataframe tbody tr th:only-of-type {\n",
 230 |        "        vertical-align: middle;\n",
 231 |        "    }\n",
 232 |        "\n",
 233 |        "    .dataframe tbody tr th {\n",
 234 |        "        vertical-align: top;\n",
 235 |        "    }\n",
 236 |        "\n",
 237 |        "    .dataframe thead th {\n",
 238 |        "        text-align: right;\n",
 239 |        "    }\n",
 240 |        "</style>\n",
 241 |        "<table border=\"1\" class=\"dataframe\">\n",
 242 |        "  <thead>\n",
 243 |        "    <tr style=\"text-align: right;\">\n",
 244 |        "      <th></th>\n",
 245 |        "      <th>f1 score</th>\n",
 246 |        "      <th>precision</th>\n",
 247 |        "      <th>recall</th>\n",
 248 |        "    </tr>\n",
 249 |        "  </thead>\n",
 250 |        "  <tbody>\n",
 251 |        "    <tr>\n",
 252 |        "      <th>related</th>\n",
 253 |        "      <td>0.801648</td>\n",
 254 |        "      <td>0.801648</td>\n",
 255 |        "      <td>0.801648</td>\n",
 256 |        "    </tr>\n",
 257 |        "    <tr>\n",
 258 |        "      <th>request</th>\n",
 259 |        "      <td>0.894873</td>\n",
 260 |        "      <td>0.894873</td>\n",
 261 |        "      <td>0.894873</td>\n",
 262 |        "    </tr>\n",
 263 |        "    <tr>\n",
 264 |        "      <th>offer</th>\n",
 265 |        "      <td>0.995880</td>\n",
 266 |        "      <td>0.995880</td>\n",
 267 |        "      <td>0.995880</td>\n",
 268 |        "    </tr>\n",
 269 |        "    <tr>\n",
 270 |        "      <th>aid_related</th>\n",
 271 |        "      <td>0.777388</td>\n",
 272 |        "      <td>0.777388</td>\n",
 273 |        "      <td>0.777388</td>\n",
 274 |        "    </tr>\n",
 275 |        "    <tr>\n",
 276 |        "      <th>medical_help</th>\n",
 277 |        "      <td>0.920507</td>\n",
 278 |        "      <td>0.920507</td>\n",
 279 |        "      <td>0.920507</td>\n",
 280 |        "    </tr>\n",
 281 |        "    <tr>\n",
 282 |        "      <th>medical_products</th>\n",
 283 |        "      <td>0.956057</td>\n",
 284 |        "      <td>0.956057</td>\n",
 285 |        "      <td>0.956057</td>\n",
 286 |        "    </tr>\n",
 287 |        "    <tr>\n",
 288 |        "      <th>search_and_rescue</th>\n",
 289 |        "      <td>0.971773</td>\n",
 290 |        "      <td>0.971773</td>\n",
 291 |        "      <td>0.971773</td>\n",
 292 |        "    </tr>\n",
 293 |        "    <tr>\n",
 294 |        "      <th>security</th>\n",
 295 |        "      <td>0.982301</td>\n",
 296 |        "      <td>0.982301</td>\n",
 297 |        "      <td>0.982301</td>\n",
 298 |        "    </tr>\n",
 299 |        "    <tr>\n",
 300 |        "      <th>military</th>\n",
 301 |        "      <td>0.968111</td>\n",
 302 |        "      <td>0.968111</td>\n",
 303 |        "      <td>0.968111</td>\n",
 304 |        "    </tr>\n",
 305 |        "    <tr>\n",
 306 |        "      <th>child_alone</th>\n",
 307 |        "      <td>1.000000</td>\n",
 308 |        "      <td>1.000000</td>\n",
 309 |        "      <td>1.000000</td>\n",
 310 |        "    </tr>\n",
 311 |        "    <tr>\n",
 312 |        "      <th>water</th>\n",
 313 |        "      <td>0.958193</td>\n",
 314 |        "      <td>0.958193</td>\n",
 315 |        "      <td>0.958193</td>\n",
 316 |        "    </tr>\n",
 317 |        "    <tr>\n",
 318 |        "      <th>food</th>\n",
 319 |        "      <td>0.940189</td>\n",
 320 |        "      <td>0.940189</td>\n",
 321 |        "      <td>0.940189</td>\n",
 322 |        "    </tr>\n",
 323 |        "    <tr>\n",
 324 |        "      <th>shelter</th>\n",
 325 |        "      <td>0.935002</td>\n",
 326 |        "      <td>0.935002</td>\n",
 327 |        "      <td>0.935002</td>\n",
 328 |        "    </tr>\n",
 329 |        "    <tr>\n",
 330 |        "      <th>clothing</th>\n",
 331 |        "      <td>0.986726</td>\n",
 332 |        "      <td>0.986726</td>\n",
 333 |        "      <td>0.986726</td>\n",
 334 |        "    </tr>\n",
 335 |        "    <tr>\n",
 336 |        "      <th>money</th>\n",
 337 |        "      <td>0.978639</td>\n",
 338 |        "      <td>0.978639</td>\n",
 339 |        "      <td>0.978639</td>\n",
 340 |        "    </tr>\n",
 341 |        "    <tr>\n",
 342 |        "      <th>missing_people</th>\n",
 343 |        "      <td>0.989930</td>\n",
 344 |        "      <td>0.989930</td>\n",
 345 |        "      <td>0.989930</td>\n",
 346 |        "    </tr>\n",
 347 |        "    <tr>\n",
 348 |        "      <th>refugees</th>\n",
 349 |        "      <td>0.968721</td>\n",
 350 |        "      <td>0.968721</td>\n",
 351 |        "      <td>0.968721</td>\n",
 352 |        "    </tr>\n",
 353 |        "    <tr>\n",
 354 |        "      <th>death</th>\n",
 355 |        "      <td>0.962161</td>\n",
 356 |        "      <td>0.962161</td>\n",
 357 |        "      <td>0.962161</td>\n",
 358 |        "    </tr>\n",
 359 |        "    <tr>\n",
 360 |        "      <th>other_aid</th>\n",
 361 |        "      <td>0.871224</td>\n",
 362 |        "      <td>0.871224</td>\n",
 363 |        "      <td>0.871224</td>\n",
 364 |        "    </tr>\n",
 365 |        "    <tr>\n",
 366 |        "      <th>infrastructure_related</th>\n",
 367 |        "      <td>0.934696</td>\n",
 368 |        "      <td>0.934696</td>\n",
 369 |        "      <td>0.934696</td>\n",
 370 |        "    </tr>\n",
 371 |        "    <tr>\n",
 372 |        "      <th>transport</th>\n",
 373 |        "      <td>0.953769</td>\n",
 374 |        "      <td>0.953769</td>\n",
 375 |        "      <td>0.953769</td>\n",
 376 |        "    </tr>\n",
 377 |        "    <tr>\n",
 378 |        "      <th>buildings</th>\n",
 379 |        "      <td>0.951785</td>\n",
 380 |        "      <td>0.951785</td>\n",
 381 |        "      <td>0.951785</td>\n",
 382 |        "    </tr>\n",
 383 |        "    <tr>\n",
 384 |        "      <th>electricity</th>\n",
 385 |        "      <td>0.981233</td>\n",
 386 |        "      <td>0.981233</td>\n",
 387 |        "      <td>0.981233</td>\n",
 388 |        "    </tr>\n",
 389 |        "    <tr>\n",
 390 |        "      <th>tools</th>\n",
 391 |        "      <td>0.994812</td>\n",
 392 |        "      <td>0.994812</td>\n",
 393 |        "      <td>0.994812</td>\n",
 394 |        "    </tr>\n",
 395 |        "    <tr>\n",
 396 |        "      <th>hospitals</th>\n",
 397 |        "      <td>0.990540</td>\n",
 398 |        "      <td>0.990540</td>\n",
 399 |        "      <td>0.990540</td>\n",
 400 |        "    </tr>\n",
 401 |        "    <tr>\n",
 402 |        "      <th>shops</th>\n",
 403 |        "      <td>0.995270</td>\n",
 404 |        "      <td>0.995270</td>\n",
 405 |        "      <td>0.995270</td>\n",
 406 |        "    </tr>\n",
 407 |        "    <tr>\n",
 408 |        "      <th>aid_centers</th>\n",
 409 |        "      <td>0.987946</td>\n",
 410 |        "      <td>0.987946</td>\n",
 411 |        "      <td>0.987946</td>\n",
 412 |        "    </tr>\n",
 413 |        "    <tr>\n",
 414 |        "      <th>other_infrastructure</th>\n",
 415 |        "      <td>0.955600</td>\n",
 416 |        "      <td>0.955600</td>\n",
 417 |        "      <td>0.955600</td>\n",
 418 |        "    </tr>\n",
 419 |        "    <tr>\n",
 420 |        "      <th>weather_related</th>\n",
 421 |        "      <td>0.878700</td>\n",
 422 |        "      <td>0.878700</td>\n",
 423 |        "      <td>0.878700</td>\n",
 424 |        "    </tr>\n",
 425 |        "    <tr>\n",
 426 |        "      <th>floods</th>\n",
 427 |        "      <td>0.949954</td>\n",
 428 |        "      <td>0.949954</td>\n",
 429 |        "      <td>0.949954</td>\n",
 430 |        "    </tr>\n",
 431 |        "    <tr>\n",
 432 |        "      <th>storm</th>\n",
 433 |        "      <td>0.937290</td>\n",
 434 |        "      <td>0.937290</td>\n",
 435 |        "      <td>0.937290</td>\n",
 436 |        "    </tr>\n",
 437 |        "    <tr>\n",
 438 |        "      <th>fire</th>\n",
 439 |        "      <td>0.991456</td>\n",
 440 |        "      <td>0.991456</td>\n",
 441 |        "      <td>0.991456</td>\n",
 442 |        "    </tr>\n",
 443 |        "    <tr>\n",
 444 |        "      <th>earthquake</th>\n",
 445 |        "      <td>0.971010</td>\n",
 446 |        "      <td>0.971010</td>\n",
 447 |        "      <td>0.971010</td>\n",
 448 |        "    </tr>\n",
 449 |        "    <tr>\n",
 450 |        "      <th>cold</th>\n",
 451 |        "      <td>0.980623</td>\n",
 452 |        "      <td>0.980623</td>\n",
 453 |        "      <td>0.980623</td>\n",
 454 |        "    </tr>\n",
 455 |        "    <tr>\n",
 456 |        "      <th>other_weather</th>\n",
 457 |        "      <td>0.946903</td>\n",
 458 |        "      <td>0.946903</td>\n",
 459 |        "      <td>0.946903</td>\n",
 460 |        "    </tr>\n",
 461 |        "    <tr>\n",
 462 |        "      <th>direct_report</th>\n",
 463 |        "      <td>0.867562</td>\n",
 464 |        "      <td>0.867562</td>\n",
 465 |        "      <td>0.867562</td>\n",
 466 |        "    </tr>\n",
 467 |        "  </tbody>\n",
 468 |        "</table>\n",
 469 |        "</div>"
 470 |       ],
 471 |       "text/plain": [
 472 |        "                        f1 score  precision    recall\n",
 473 |        "related                 0.801648   0.801648  0.801648\n",
 474 |        "request                 0.894873   0.894873  0.894873\n",
 475 |        "offer                   0.995880   0.995880  0.995880\n",
 476 |        "aid_related             0.777388   0.777388  0.777388\n",
 477 |        "medical_help            0.920507   0.920507  0.920507\n",
 478 |        "medical_products        0.956057   0.956057  0.956057\n",
 479 |        "search_and_rescue       0.971773   0.971773  0.971773\n",
 480 |        "security                0.982301   0.982301  0.982301\n",
 481 |        "military                0.968111   0.968111  0.968111\n",
 482 |        "child_alone             1.000000   1.000000  1.000000\n",
 483 |        "water                   0.958193   0.958193  0.958193\n",
 484 |        "food                    0.940189   0.940189  0.940189\n",
 485 |        "shelter                 0.935002   0.935002  0.935002\n",
 486 |        "clothing                0.986726   0.986726  0.986726\n",
 487 |        "money                   0.978639   0.978639  0.978639\n",
 488 |        "missing_people          0.989930   0.989930  0.989930\n",
 489 |        "refugees                0.968721   0.968721  0.968721\n",
 490 |        "death                   0.962161   0.962161  0.962161\n",
 491 |        "other_aid               0.871224   0.871224  0.871224\n",
 492 |        "infrastructure_related  0.934696   0.934696  0.934696\n",
 493 |        "transport               0.953769   0.953769  0.953769\n",
 494 |        "buildings               0.951785   0.951785  0.951785\n",
 495 |        "electricity             0.981233   0.981233  0.981233\n",
 496 |        "tools                   0.994812   0.994812  0.994812\n",
 497 |        "hospitals               0.990540   0.990540  0.990540\n",
 498 |        "shops                   0.995270   0.995270  0.995270\n",
 499 |        "aid_centers             0.987946   0.987946  0.987946\n",
 500 |        "other_infrastructure    0.955600   0.955600  0.955600\n",
 501 |        "weather_related         0.878700   0.878700  0.878700\n",
 502 |        "floods                  0.949954   0.949954  0.949954\n",
 503 |        "storm                   0.937290   0.937290  0.937290\n",
 504 |        "fire                    0.991456   0.991456  0.991456\n",
 505 |        "earthquake              0.971010   0.971010  0.971010\n",
 506 |        "cold                    0.980623   0.980623  0.980623\n",
 507 |        "other_weather           0.946903   0.946903  0.946903\n",
 508 |        "direct_report           0.867562   0.867562  0.867562"
 509 |       ]
 510 |      },
 511 |      "execution_count": 9,
 512 |      "metadata": {},
 513 |      "output_type": "execute_result"
 514 |     }
 515 |    ],
 516 |    "source": [
 517 |     "build_report(pipeline, X_test, y_test)"
 518 |    ]
 519 |   },
 520 |   {
 521 |    "cell_type": "markdown",
 522 |    "metadata": {},
 523 |    "source": [
 524 |     "### 6. Improve your model\n",
 525 |     "Use grid search to find better parameters. "
 526 |    ]
 527 |   },
 528 |   {
 529 |    "cell_type": "code",
 530 |    "execution_count": 10,
 531 |    "metadata": {},
 532 |    "outputs": [
 533 |     {
 534 |      "data": {
 535 |       "text/plain": [
 536 |        "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
 537 |        "       estimator=Pipeline(memory=None,\n",
 538 |        "     steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 539 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 540 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 541 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 542 |        "        strip_..._score=False, random_state=None, verbose=0,\n",
 543 |        "            warm_start=False),\n",
 544 |        "           n_jobs=None))]),\n",
 545 |        "       fit_params=None, iid='warn', n_jobs=6,\n",
 546 |        "       param_grid={'clf__estimator__max_features': ['sqrt', 0.5], 'clf__estimator__n_estimators': [50, 100]},\n",
 547 |        "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
 548 |        "       scoring=None, verbose=0)"
 549 |       ]
 550 |      },
 551 |      "execution_count": 10,
 552 |      "metadata": {},
 553 |      "output_type": "execute_result"
 554 |     }
 555 |    ],
 556 |    "source": [
 557 |     "parameters = {'clf__estimator__max_features':['sqrt', 0.5],\n",
 558 |     "              'clf__estimator__n_estimators':[50, 100]}\n",
 559 |     "\n",
 560 |     "cv = GridSearchCV(estimator=pipeline, param_grid = parameters, cv = 5, n_jobs = 6)\n",
 561 |     "cv.fit(X_train, y_train)"
 562 |    ]
 563 |   },
 564 |   {
 565 |    "cell_type": "markdown",
 566 |    "metadata": {},
 567 |    "source": [
 568 |     "### 7. Test your model\n",
 569 |     "Show the accuracy, precision, and recall of the tuned model.  \n",
 570 |     "\n",
 571 |     "Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": 11,
 577 |    "metadata": {},
 578 |    "outputs": [
 579 |     {
 580 |      "data": {
 581 |       "text/html": [
 582 |        "<div>\n",
 583 |        "<style scoped>\n",
 584 |        "    .dataframe tbody tr th:only-of-type {\n",
 585 |        "        vertical-align: middle;\n",
 586 |        "    }\n",
 587 |        "\n",
 588 |        "    .dataframe tbody tr th {\n",
 589 |        "        vertical-align: top;\n",
 590 |        "    }\n",
 591 |        "\n",
 592 |        "    .dataframe thead th {\n",
 593 |        "        text-align: right;\n",
 594 |        "    }\n",
 595 |        "</style>\n",
 596 |        "<table border=\"1\" class=\"dataframe\">\n",
 597 |        "  <thead>\n",
 598 |        "    <tr style=\"text-align: right;\">\n",
 599 |        "      <th></th>\n",
 600 |        "      <th>f1 score</th>\n",
 601 |        "      <th>precision</th>\n",
 602 |        "      <th>recall</th>\n",
 603 |        "    </tr>\n",
 604 |        "  </thead>\n",
 605 |        "  <tbody>\n",
 606 |        "    <tr>\n",
 607 |        "      <th>related</th>\n",
 608 |        "      <td>0.801953</td>\n",
 609 |        "      <td>0.801953</td>\n",
 610 |        "      <td>0.801953</td>\n",
 611 |        "    </tr>\n",
 612 |        "    <tr>\n",
 613 |        "      <th>request</th>\n",
 614 |        "      <td>0.888007</td>\n",
 615 |        "      <td>0.888007</td>\n",
 616 |        "      <td>0.888007</td>\n",
 617 |        "    </tr>\n",
 618 |        "    <tr>\n",
 619 |        "      <th>offer</th>\n",
 620 |        "      <td>0.995270</td>\n",
 621 |        "      <td>0.995270</td>\n",
 622 |        "      <td>0.995270</td>\n",
 623 |        "    </tr>\n",
 624 |        "    <tr>\n",
 625 |        "      <th>aid_related</th>\n",
 626 |        "      <td>0.765334</td>\n",
 627 |        "      <td>0.765334</td>\n",
 628 |        "      <td>0.765334</td>\n",
 629 |        "    </tr>\n",
 630 |        "    <tr>\n",
 631 |        "      <th>medical_help</th>\n",
 632 |        "      <td>0.920659</td>\n",
 633 |        "      <td>0.920659</td>\n",
 634 |        "      <td>0.920659</td>\n",
 635 |        "    </tr>\n",
 636 |        "    <tr>\n",
 637 |        "      <th>medical_products</th>\n",
 638 |        "      <td>0.962313</td>\n",
 639 |        "      <td>0.962313</td>\n",
 640 |        "      <td>0.962313</td>\n",
 641 |        "    </tr>\n",
 642 |        "    <tr>\n",
 643 |        "      <th>search_and_rescue</th>\n",
 644 |        "      <td>0.970552</td>\n",
 645 |        "      <td>0.970552</td>\n",
 646 |        "      <td>0.970552</td>\n",
 647 |        "    </tr>\n",
 648 |        "    <tr>\n",
 649 |        "      <th>security</th>\n",
 650 |        "      <td>0.978944</td>\n",
 651 |        "      <td>0.978944</td>\n",
 652 |        "      <td>0.978944</td>\n",
 653 |        "    </tr>\n",
 654 |        "    <tr>\n",
 655 |        "      <th>military</th>\n",
 656 |        "      <td>0.966890</td>\n",
 657 |        "      <td>0.966890</td>\n",
 658 |        "      <td>0.966890</td>\n",
 659 |        "    </tr>\n",
 660 |        "    <tr>\n",
 661 |        "      <th>child_alone</th>\n",
 662 |        "      <td>1.000000</td>\n",
 663 |        "      <td>1.000000</td>\n",
 664 |        "      <td>1.000000</td>\n",
 665 |        "    </tr>\n",
 666 |        "    <tr>\n",
 667 |        "      <th>water</th>\n",
 668 |        "      <td>0.966280</td>\n",
 669 |        "      <td>0.966280</td>\n",
 670 |        "      <td>0.966280</td>\n",
 671 |        "    </tr>\n",
 672 |        "    <tr>\n",
 673 |        "      <th>food</th>\n",
 674 |        "      <td>0.951480</td>\n",
 675 |        "      <td>0.951480</td>\n",
 676 |        "      <td>0.951480</td>\n",
 677 |        "    </tr>\n",
 678 |        "    <tr>\n",
 679 |        "      <th>shelter</th>\n",
 680 |        "      <td>0.948581</td>\n",
 681 |        "      <td>0.948581</td>\n",
 682 |        "      <td>0.948581</td>\n",
 683 |        "    </tr>\n",
 684 |        "    <tr>\n",
 685 |        "      <th>clothing</th>\n",
 686 |        "      <td>0.989014</td>\n",
 687 |        "      <td>0.989014</td>\n",
 688 |        "      <td>0.989014</td>\n",
 689 |        "    </tr>\n",
 690 |        "    <tr>\n",
 691 |        "      <th>money</th>\n",
 692 |        "      <td>0.978486</td>\n",
 693 |        "      <td>0.978486</td>\n",
 694 |        "      <td>0.978486</td>\n",
 695 |        "    </tr>\n",
 696 |        "    <tr>\n",
 697 |        "      <th>missing_people</th>\n",
 698 |        "      <td>0.990388</td>\n",
 699 |        "      <td>0.990388</td>\n",
 700 |        "      <td>0.990388</td>\n",
 701 |        "    </tr>\n",
 702 |        "    <tr>\n",
 703 |        "      <th>refugees</th>\n",
 704 |        "      <td>0.971620</td>\n",
 705 |        "      <td>0.971620</td>\n",
 706 |        "      <td>0.971620</td>\n",
 707 |        "    </tr>\n",
 708 |        "    <tr>\n",
 709 |        "      <th>death</th>\n",
 710 |        "      <td>0.973604</td>\n",
 711 |        "      <td>0.973604</td>\n",
 712 |        "      <td>0.973604</td>\n",
 713 |        "    </tr>\n",
 714 |        "    <tr>\n",
 715 |        "      <th>other_aid</th>\n",
 716 |        "      <td>0.868630</td>\n",
 717 |        "      <td>0.868630</td>\n",
 718 |        "      <td>0.868630</td>\n",
 719 |        "    </tr>\n",
 720 |        "    <tr>\n",
 721 |        "      <th>infrastructure_related</th>\n",
 722 |        "      <td>0.929661</td>\n",
 723 |        "      <td>0.929661</td>\n",
 724 |        "      <td>0.929661</td>\n",
 725 |        "    </tr>\n",
 726 |        "    <tr>\n",
 727 |        "      <th>transport</th>\n",
 728 |        "      <td>0.954379</td>\n",
 729 |        "      <td>0.954379</td>\n",
 730 |        "      <td>0.954379</td>\n",
 731 |        "    </tr>\n",
 732 |        "    <tr>\n",
 733 |        "      <th>buildings</th>\n",
 734 |        "      <td>0.956363</td>\n",
 735 |        "      <td>0.956363</td>\n",
 736 |        "      <td>0.956363</td>\n",
 737 |        "    </tr>\n",
 738 |        "    <tr>\n",
 739 |        "      <th>electricity</th>\n",
 740 |        "      <td>0.979707</td>\n",
 741 |        "      <td>0.979707</td>\n",
 742 |        "      <td>0.979707</td>\n",
 743 |        "    </tr>\n",
 744 |        "    <tr>\n",
 745 |        "      <th>tools</th>\n",
 746 |        "      <td>0.993592</td>\n",
 747 |        "      <td>0.993592</td>\n",
 748 |        "      <td>0.993592</td>\n",
 749 |        "    </tr>\n",
 750 |        "    <tr>\n",
 751 |        "      <th>hospitals</th>\n",
 752 |        "      <td>0.988404</td>\n",
 753 |        "      <td>0.988404</td>\n",
 754 |        "      <td>0.988404</td>\n",
 755 |        "    </tr>\n",
 756 |        "    <tr>\n",
 757 |        "      <th>shops</th>\n",
 758 |        "      <td>0.994355</td>\n",
 759 |        "      <td>0.994355</td>\n",
 760 |        "      <td>0.994355</td>\n",
 761 |        "    </tr>\n",
 762 |        "    <tr>\n",
 763 |        "      <th>aid_centers</th>\n",
 764 |        "      <td>0.987794</td>\n",
 765 |        "      <td>0.987794</td>\n",
 766 |        "      <td>0.987794</td>\n",
 767 |        "    </tr>\n",
 768 |        "    <tr>\n",
 769 |        "      <th>other_infrastructure</th>\n",
 770 |        "      <td>0.952395</td>\n",
 771 |        "      <td>0.952395</td>\n",
 772 |        "      <td>0.952395</td>\n",
 773 |        "    </tr>\n",
 774 |        "    <tr>\n",
 775 |        "      <th>weather_related</th>\n",
 776 |        "      <td>0.880684</td>\n",
 777 |        "      <td>0.880684</td>\n",
 778 |        "      <td>0.880684</td>\n",
 779 |        "    </tr>\n",
 780 |        "    <tr>\n",
 781 |        "      <th>floods</th>\n",
 782 |        "      <td>0.955752</td>\n",
 783 |        "      <td>0.955752</td>\n",
 784 |        "      <td>0.955752</td>\n",
 785 |        "    </tr>\n",
 786 |        "    <tr>\n",
 787 |        "      <th>storm</th>\n",
 788 |        "      <td>0.945682</td>\n",
 789 |        "      <td>0.945682</td>\n",
 790 |        "      <td>0.945682</td>\n",
 791 |        "    </tr>\n",
 792 |        "    <tr>\n",
 793 |        "      <th>fire</th>\n",
 794 |        "      <td>0.991913</td>\n",
 795 |        "      <td>0.991913</td>\n",
 796 |        "      <td>0.991913</td>\n",
 797 |        "    </tr>\n",
 798 |        "    <tr>\n",
 799 |        "      <th>earthquake</th>\n",
 800 |        "      <td>0.973146</td>\n",
 801 |        "      <td>0.973146</td>\n",
 802 |        "      <td>0.973146</td>\n",
 803 |        "    </tr>\n",
 804 |        "    <tr>\n",
 805 |        "      <th>cold</th>\n",
 806 |        "      <td>0.983064</td>\n",
 807 |        "      <td>0.983064</td>\n",
 808 |        "      <td>0.983064</td>\n",
 809 |        "    </tr>\n",
 810 |        "    <tr>\n",
 811 |        "      <th>other_weather</th>\n",
 812 |        "      <td>0.941105</td>\n",
 813 |        "      <td>0.941105</td>\n",
 814 |        "      <td>0.941105</td>\n",
 815 |        "    </tr>\n",
 816 |        "    <tr>\n",
 817 |        "      <th>direct_report</th>\n",
 818 |        "      <td>0.856424</td>\n",
 819 |        "      <td>0.856424</td>\n",
 820 |        "      <td>0.856424</td>\n",
 821 |        "    </tr>\n",
 822 |        "  </tbody>\n",
 823 |        "</table>\n",
 824 |        "</div>"
 825 |       ],
 826 |       "text/plain": [
 827 |        "                        f1 score  precision    recall\n",
 828 |        "related                 0.801953   0.801953  0.801953\n",
 829 |        "request                 0.888007   0.888007  0.888007\n",
 830 |        "offer                   0.995270   0.995270  0.995270\n",
 831 |        "aid_related             0.765334   0.765334  0.765334\n",
 832 |        "medical_help            0.920659   0.920659  0.920659\n",
 833 |        "medical_products        0.962313   0.962313  0.962313\n",
 834 |        "search_and_rescue       0.970552   0.970552  0.970552\n",
 835 |        "security                0.978944   0.978944  0.978944\n",
 836 |        "military                0.966890   0.966890  0.966890\n",
 837 |        "child_alone             1.000000   1.000000  1.000000\n",
 838 |        "water                   0.966280   0.966280  0.966280\n",
 839 |        "food                    0.951480   0.951480  0.951480\n",
 840 |        "shelter                 0.948581   0.948581  0.948581\n",
 841 |        "clothing                0.989014   0.989014  0.989014\n",
 842 |        "money                   0.978486   0.978486  0.978486\n",
 843 |        "missing_people          0.990388   0.990388  0.990388\n",
 844 |        "refugees                0.971620   0.971620  0.971620\n",
 845 |        "death                   0.973604   0.973604  0.973604\n",
 846 |        "other_aid               0.868630   0.868630  0.868630\n",
 847 |        "infrastructure_related  0.929661   0.929661  0.929661\n",
 848 |        "transport               0.954379   0.954379  0.954379\n",
 849 |        "buildings               0.956363   0.956363  0.956363\n",
 850 |        "electricity             0.979707   0.979707  0.979707\n",
 851 |        "tools                   0.993592   0.993592  0.993592\n",
 852 |        "hospitals               0.988404   0.988404  0.988404\n",
 853 |        "shops                   0.994355   0.994355  0.994355\n",
 854 |        "aid_centers             0.987794   0.987794  0.987794\n",
 855 |        "other_infrastructure    0.952395   0.952395  0.952395\n",
 856 |        "weather_related         0.880684   0.880684  0.880684\n",
 857 |        "floods                  0.955752   0.955752  0.955752\n",
 858 |        "storm                   0.945682   0.945682  0.945682\n",
 859 |        "fire                    0.991913   0.991913  0.991913\n",
 860 |        "earthquake              0.973146   0.973146  0.973146\n",
 861 |        "cold                    0.983064   0.983064  0.983064\n",
 862 |        "other_weather           0.941105   0.941105  0.941105\n",
 863 |        "direct_report           0.856424   0.856424  0.856424"
 864 |       ]
 865 |      },
 866 |      "execution_count": 11,
 867 |      "metadata": {},
 868 |      "output_type": "execute_result"
 869 |     }
 870 |    ],
 871 |    "source": [
 872 |     "build_report(cv, X_test, y_test)"
 873 |    ]
 874 |   },
 875 |   {
 876 |    "cell_type": "code",
 877 |    "execution_count": 16,
 878 |    "metadata": {},
 879 |    "outputs": [
 880 |     {
 881 |      "data": {
 882 |       "text/plain": [
 883 |        "{'clf__estimator__max_features': 0.5, 'clf__estimator__n_estimators': 100}"
 884 |       ]
 885 |      },
 886 |      "execution_count": 16,
 887 |      "metadata": {},
 888 |      "output_type": "execute_result"
 889 |     }
 890 |    ],
 891 |    "source": [
 892 |     "cv.best_params_"
 893 |    ]
 894 |   },
 895 |   {
 896 |    "cell_type": "markdown",
 897 |    "metadata": {},
 898 |    "source": [
 899 |     "### 8. Try improving your model further. Here are a few ideas:\n",
 900 |     "* try other machine learning algorithms\n",
 901 |     "* add other features besides the TF-IDF"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "code",
 906 |    "execution_count": 12,
 907 |    "metadata": {},
 908 |    "outputs": [
 909 |     {
 910 |      "data": {
 911 |       "text/html": [
 912 |        "<div>\n",
 913 |        "<style scoped>\n",
 914 |        "    .dataframe tbody tr th:only-of-type {\n",
 915 |        "        vertical-align: middle;\n",
 916 |        "    }\n",
 917 |        "\n",
 918 |        "    .dataframe tbody tr th {\n",
 919 |        "        vertical-align: top;\n",
 920 |        "    }\n",
 921 |        "\n",
 922 |        "    .dataframe thead th {\n",
 923 |        "        text-align: right;\n",
 924 |        "    }\n",
 925 |        "</style>\n",
 926 |        "<table border=\"1\" class=\"dataframe\">\n",
 927 |        "  <thead>\n",
 928 |        "    <tr style=\"text-align: right;\">\n",
 929 |        "      <th></th>\n",
 930 |        "      <th>f1 score</th>\n",
 931 |        "      <th>precision</th>\n",
 932 |        "      <th>recall</th>\n",
 933 |        "    </tr>\n",
 934 |        "  </thead>\n",
 935 |        "  <tbody>\n",
 936 |        "    <tr>\n",
 937 |        "      <th>related</th>\n",
 938 |        "      <td>0.762893</td>\n",
 939 |        "      <td>0.762893</td>\n",
 940 |        "      <td>0.762893</td>\n",
 941 |        "    </tr>\n",
 942 |        "    <tr>\n",
 943 |        "      <th>request</th>\n",
 944 |        "      <td>0.892127</td>\n",
 945 |        "      <td>0.892127</td>\n",
 946 |        "      <td>0.892127</td>\n",
 947 |        "    </tr>\n",
 948 |        "    <tr>\n",
 949 |        "      <th>offer</th>\n",
 950 |        "      <td>0.994049</td>\n",
 951 |        "      <td>0.994049</td>\n",
 952 |        "      <td>0.994049</td>\n",
 953 |        "    </tr>\n",
 954 |        "    <tr>\n",
 955 |        "      <th>aid_related</th>\n",
 956 |        "      <td>0.767318</td>\n",
 957 |        "      <td>0.767318</td>\n",
 958 |        "      <td>0.767318</td>\n",
 959 |        "    </tr>\n",
 960 |        "    <tr>\n",
 961 |        "      <th>medical_help</th>\n",
 962 |        "      <td>0.923711</td>\n",
 963 |        "      <td>0.923711</td>\n",
 964 |        "      <td>0.923711</td>\n",
 965 |        "    </tr>\n",
 966 |        "    <tr>\n",
 967 |        "      <th>medical_products</th>\n",
 968 |        "      <td>0.961398</td>\n",
 969 |        "      <td>0.961398</td>\n",
 970 |        "      <td>0.961398</td>\n",
 971 |        "    </tr>\n",
 972 |        "    <tr>\n",
 973 |        "      <th>search_and_rescue</th>\n",
 974 |        "      <td>0.970705</td>\n",
 975 |        "      <td>0.970705</td>\n",
 976 |        "      <td>0.970705</td>\n",
 977 |        "    </tr>\n",
 978 |        "    <tr>\n",
 979 |        "      <th>security</th>\n",
 980 |        "      <td>0.977876</td>\n",
 981 |        "      <td>0.977876</td>\n",
 982 |        "      <td>0.977876</td>\n",
 983 |        "    </tr>\n",
 984 |        "    <tr>\n",
 985 |        "      <th>military</th>\n",
 986 |        "      <td>0.971468</td>\n",
 987 |        "      <td>0.971468</td>\n",
 988 |        "      <td>0.971468</td>\n",
 989 |        "    </tr>\n",
 990 |        "    <tr>\n",
 991 |        "      <th>child_alone</th>\n",
 992 |        "      <td>1.000000</td>\n",
 993 |        "      <td>1.000000</td>\n",
 994 |        "      <td>1.000000</td>\n",
 995 |        "    </tr>\n",
 996 |        "    <tr>\n",
 997 |        "      <th>water</th>\n",
 998 |        "      <td>0.963381</td>\n",
 999 |        "      <td>0.963381</td>\n",
1000 |        "      <td>0.963381</td>\n",
1001 |        "    </tr>\n",
1002 |        "    <tr>\n",
1003 |        "      <th>food</th>\n",
1004 |        "      <td>0.946903</td>\n",
1005 |        "      <td>0.946903</td>\n",
1006 |        "      <td>0.946903</td>\n",
1007 |        "    </tr>\n",
1008 |        "    <tr>\n",
1009 |        "      <th>shelter</th>\n",
1010 |        "      <td>0.941868</td>\n",
1011 |        "      <td>0.941868</td>\n",
1012 |        "      <td>0.941868</td>\n",
1013 |        "    </tr>\n",
1014 |        "    <tr>\n",
1015 |        "      <th>clothing</th>\n",
1016 |        "      <td>0.987946</td>\n",
1017 |        "      <td>0.987946</td>\n",
1018 |        "      <td>0.987946</td>\n",
1019 |        "    </tr>\n",
1020 |        "    <tr>\n",
1021 |        "      <th>money</th>\n",
1022 |        "      <td>0.977418</td>\n",
1023 |        "      <td>0.977418</td>\n",
1024 |        "      <td>0.977418</td>\n",
1025 |        "    </tr>\n",
1026 |        "    <tr>\n",
1027 |        "      <th>missing_people</th>\n",
1028 |        "      <td>0.989319</td>\n",
1029 |        "      <td>0.989319</td>\n",
1030 |        "      <td>0.989319</td>\n",
1031 |        "    </tr>\n",
1032 |        "    <tr>\n",
1033 |        "      <th>refugees</th>\n",
1034 |        "      <td>0.969484</td>\n",
1035 |        "      <td>0.969484</td>\n",
1036 |        "      <td>0.969484</td>\n",
1037 |        "    </tr>\n",
1038 |        "    <tr>\n",
1039 |        "      <th>death</th>\n",
1040 |        "      <td>0.968721</td>\n",
1041 |        "      <td>0.968721</td>\n",
1042 |        "      <td>0.968721</td>\n",
1043 |        "    </tr>\n",
1044 |        "    <tr>\n",
1045 |        "      <th>other_aid</th>\n",
1046 |        "      <td>0.868782</td>\n",
1047 |        "      <td>0.868782</td>\n",
1048 |        "      <td>0.868782</td>\n",
1049 |        "    </tr>\n",
1050 |        "    <tr>\n",
1051 |        "      <th>infrastructure_related</th>\n",
1052 |        "      <td>0.928746</td>\n",
1053 |        "      <td>0.928746</td>\n",
1054 |        "      <td>0.928746</td>\n",
1055 |        "    </tr>\n",
1056 |        "    <tr>\n",
1057 |        "      <th>transport</th>\n",
1058 |        "      <td>0.955447</td>\n",
1059 |        "      <td>0.955447</td>\n",
1060 |        "      <td>0.955447</td>\n",
1061 |        "    </tr>\n",
1062 |        "    <tr>\n",
1063 |        "      <th>buildings</th>\n",
1064 |        "      <td>0.954837</td>\n",
1065 |        "      <td>0.954837</td>\n",
1066 |        "      <td>0.954837</td>\n",
1067 |        "    </tr>\n",
1068 |        "    <tr>\n",
1069 |        "      <th>electricity</th>\n",
1070 |        "      <td>0.980928</td>\n",
1071 |        "      <td>0.980928</td>\n",
1072 |        "      <td>0.980928</td>\n",
1073 |        "    </tr>\n",
1074 |        "    <tr>\n",
1075 |        "      <th>tools</th>\n",
1076 |        "      <td>0.993592</td>\n",
1077 |        "      <td>0.993592</td>\n",
1078 |        "      <td>0.993592</td>\n",
1079 |        "    </tr>\n",
1080 |        "    <tr>\n",
1081 |        "      <th>hospitals</th>\n",
1082 |        "      <td>0.987489</td>\n",
1083 |        "      <td>0.987489</td>\n",
1084 |        "      <td>0.987489</td>\n",
1085 |        "    </tr>\n",
1086 |        "    <tr>\n",
1087 |        "      <th>shops</th>\n",
1088 |        "      <td>0.994049</td>\n",
1089 |        "      <td>0.994049</td>\n",
1090 |        "      <td>0.994049</td>\n",
1091 |        "    </tr>\n",
1092 |        "    <tr>\n",
1093 |        "      <th>aid_centers</th>\n",
1094 |        "      <td>0.986726</td>\n",
1095 |        "      <td>0.986726</td>\n",
1096 |        "      <td>0.986726</td>\n",
1097 |        "    </tr>\n",
1098 |        "    <tr>\n",
1099 |        "      <th>other_infrastructure</th>\n",
1100 |        "      <td>0.951938</td>\n",
1101 |        "      <td>0.951938</td>\n",
1102 |        "      <td>0.951938</td>\n",
1103 |        "    </tr>\n",
1104 |        "    <tr>\n",
1105 |        "      <th>weather_related</th>\n",
1106 |        "      <td>0.876259</td>\n",
1107 |        "      <td>0.876259</td>\n",
1108 |        "      <td>0.876259</td>\n",
1109 |        "    </tr>\n",
1110 |        "    <tr>\n",
1111 |        "      <th>floods</th>\n",
1112 |        "      <td>0.953616</td>\n",
1113 |        "      <td>0.953616</td>\n",
1114 |        "      <td>0.953616</td>\n",
1115 |        "    </tr>\n",
1116 |        "    <tr>\n",
1117 |        "      <th>storm</th>\n",
1118 |        "      <td>0.938969</td>\n",
1119 |        "      <td>0.938969</td>\n",
1120 |        "      <td>0.938969</td>\n",
1121 |        "    </tr>\n",
1122 |        "    <tr>\n",
1123 |        "      <th>fire</th>\n",
1124 |        "      <td>0.991150</td>\n",
1125 |        "      <td>0.991150</td>\n",
1126 |        "      <td>0.991150</td>\n",
1127 |        "    </tr>\n",
1128 |        "    <tr>\n",
1129 |        "      <th>earthquake</th>\n",
1130 |        "      <td>0.970705</td>\n",
1131 |        "      <td>0.970705</td>\n",
1132 |        "      <td>0.970705</td>\n",
1133 |        "    </tr>\n",
1134 |        "    <tr>\n",
1135 |        "      <th>cold</th>\n",
1136 |        "      <td>0.981843</td>\n",
1137 |        "      <td>0.981843</td>\n",
1138 |        "      <td>0.981843</td>\n",
1139 |        "    </tr>\n",
1140 |        "    <tr>\n",
1141 |        "      <th>other_weather</th>\n",
1142 |        "      <td>0.942630</td>\n",
1143 |        "      <td>0.942630</td>\n",
1144 |        "      <td>0.942630</td>\n",
1145 |        "    </tr>\n",
1146 |        "    <tr>\n",
1147 |        "      <th>direct_report</th>\n",
1148 |        "      <td>0.858865</td>\n",
1149 |        "      <td>0.858865</td>\n",
1150 |        "      <td>0.858865</td>\n",
1151 |        "    </tr>\n",
1152 |        "  </tbody>\n",
1153 |        "</table>\n",
1154 |        "</div>"
1155 |       ],
1156 |       "text/plain": [
1157 |        "                        f1 score  precision    recall\n",
1158 |        "related                 0.762893   0.762893  0.762893\n",
1159 |        "request                 0.892127   0.892127  0.892127\n",
1160 |        "offer                   0.994049   0.994049  0.994049\n",
1161 |        "aid_related             0.767318   0.767318  0.767318\n",
1162 |        "medical_help            0.923711   0.923711  0.923711\n",
1163 |        "medical_products        0.961398   0.961398  0.961398\n",
1164 |        "search_and_rescue       0.970705   0.970705  0.970705\n",
1165 |        "security                0.977876   0.977876  0.977876\n",
1166 |        "military                0.971468   0.971468  0.971468\n",
1167 |        "child_alone             1.000000   1.000000  1.000000\n",
1168 |        "water                   0.963381   0.963381  0.963381\n",
1169 |        "food                    0.946903   0.946903  0.946903\n",
1170 |        "shelter                 0.941868   0.941868  0.941868\n",
1171 |        "clothing                0.987946   0.987946  0.987946\n",
1172 |        "money                   0.977418   0.977418  0.977418\n",
1173 |        "missing_people          0.989319   0.989319  0.989319\n",
1174 |        "refugees                0.969484   0.969484  0.969484\n",
1175 |        "death                   0.968721   0.968721  0.968721\n",
1176 |        "other_aid               0.868782   0.868782  0.868782\n",
1177 |        "infrastructure_related  0.928746   0.928746  0.928746\n",
1178 |        "transport               0.955447   0.955447  0.955447\n",
1179 |        "buildings               0.954837   0.954837  0.954837\n",
1180 |        "electricity             0.980928   0.980928  0.980928\n",
1181 |        "tools                   0.993592   0.993592  0.993592\n",
1182 |        "hospitals               0.987489   0.987489  0.987489\n",
1183 |        "shops                   0.994049   0.994049  0.994049\n",
1184 |        "aid_centers             0.986726   0.986726  0.986726\n",
1185 |        "other_infrastructure    0.951938   0.951938  0.951938\n",
1186 |        "weather_related         0.876259   0.876259  0.876259\n",
1187 |        "floods                  0.953616   0.953616  0.953616\n",
1188 |        "storm                   0.938969   0.938969  0.938969\n",
1189 |        "fire                    0.991150   0.991150  0.991150\n",
1190 |        "earthquake              0.970705   0.970705  0.970705\n",
1191 |        "cold                    0.981843   0.981843  0.981843\n",
1192 |        "other_weather           0.942630   0.942630  0.942630\n",
1193 |        "direct_report           0.858865   0.858865  0.858865"
1194 |       ]
1195 |      },
1196 |      "execution_count": 12,
1197 |      "metadata": {},
1198 |      "output_type": "execute_result"
1199 |     }
1200 |    ],
1201 |    "source": [
1202 |     "pipeline_improved = Pipeline([\n",
1203 |     "                                ('vect', CountVectorizer(tokenizer=tokenize)),\n",
1204 |     "                                ('tfidf', TfidfTransformer()),\n",
1205 |     "                                ('clf', MultiOutputClassifier(AdaBoostClassifier(n_estimators = 100)))\n",
1206 |     "                            ])\n",
1207 |     "pipeline_improved.fit(X_train, y_train)\n",
1208 |     "y_pred_improved = pipeline_improved.predict(X_test)\n",
1209 |     "build_report(pipeline_improved, X_test, y_test)"
1210 |    ]
1211 |   },
1212 |   {
1213 |    "cell_type": "markdown",
1214 |    "metadata": {},
1215 |    "source": [
1216 |     "### 9. Export your model as a pickle file"
1217 |    ]
1218 |   },
1219 |   {
1220 |    "cell_type": "code",
1221 |    "execution_count": 18,
1222 |    "metadata": {
1223 |     "collapsed": true
1224 |    },
1225 |    "outputs": [],
1226 |    "source": [
1227 |     "pickle.dump(pipeline, open('rf_model.pkl', 'wb'))"
1228 |    ]
1229 |   },
1230 |   {
1231 |    "cell_type": "code",
1232 |    "execution_count": 14,
1233 |    "metadata": {
1234 |     "collapsed": true
1235 |    },
1236 |    "outputs": [],
1237 |    "source": [
1238 |     "pickle.dump(pipeline_improved, open('adaboost_model.pkl', 'wb'))"
1239 |    ]
1240 |   },
1241 |   {
1242 |    "cell_type": "markdown",
1243 |    "metadata": {},
1244 |    "source": [
1245 |     "### 10. Use this notebook to complete `train.py`\n",
1246 |     "Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user."
1247 |    ]
1248 |   }
1249 |  ],
1250 |  "metadata": {
1251 |   "kernelspec": {
1252 |    "display_name": "Python 3",
1253 |    "language": "python",
1254 |    "name": "python3"
1255 |   },
1256 |   "language_info": {
1257 |    "codemirror_mode": {
1258 |     "name": "ipython",
1259 |     "version": 3
1260 |    },
1261 |    "file_extension": ".py",
1262 |    "mimetype": "text/x-python",
1263 |    "name": "python",
1264 |    "nbconvert_exporter": "python",
1265 |    "pygments_lexer": "ipython3",
1266 |    "version": "3.6.3"
1267 |   }
1268 |  },
1269 |  "nbformat": 4,
1270 |  "nbformat_minor": 2
1271 | }
1272 | 


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/README.md:
--------------------------------------------------------------------------------
 1 | # Disaster Response Pipeline Project
 2 | 
 3 | ### Project Description:
 4 | 
 5 | In this project, I built a data transformation - machine learning pipeline that is capable to curate the class of the messages. The pipeline is eventually built into a flask application. The project include a web app where an emergency worker can input a new message and get classification results in several categories. The landing page of the webapp also includes 4 visualizations of the training dataset built with plotly.
 6 | 
 7 | ### File Descriptions:
 8 | The project contains the following files,
 9 | 
10 | * ETL Pipeline Preparation.ipynb: Notebook experiment for the ETL pipelines
11 | * ML Pipeline Preparation.ipynb: Notebook experiment for the machine learning pipelines
12 | * data/process_data.py: The ETL pipeline used to process data in preparation for model building.
13 | * models/train_classifier.py: The Machine Learning pipeline used to fit, tune, evaluate, and export the model to a Python pickle (pickle is not uploaded to the repo due to size constraints on github).
14 | * app/templates/~.html: HTML pages for the web app.
15 | * run.py: Start the Python server for the web app and prepare visualizations.
16 | 
17 | The app is now deployed on heroku at this [link](https://disaster-response-app184.herokuapp.com/)
18 | 
19 | Example message to classify: "Help, Fire!"
20 | 
21 | ### Local Instructions:
22 | 1. Run the following commands in the project's root directory to set up your database and model.
23 | 
24 |     - To run ETL pipeline that cleans data and stores in database
25 |         `python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db`
26 |     - To run ML pipeline that trains classifier and saves
27 |         `python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl`
28 | 
29 | 2. Run the following command in the app's directory to run your web app.
30 |     `python app.py`
31 | 
32 | 3. Go to http://127.0.0.1:5000/
33 | 
34 | 
35 | ![Webapp Screenshot](https://raw.githubusercontent.com/chenbowen184/Data_Science_Portfolio/master/Project%205%20-%20Disaster%20Response%20Pipeline/app/webapp%20screenshot.png)
36 | 


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/app/app.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import plotly
  3 | import pandas as pd
  4 | import re
  5 | from collections import Counter
  6 | 
  7 | # import NLP libraries
  8 | from tokenizer_function import Tokenizer, tokenize
  9 | 
 10 | from flask import Flask
 11 | from flask import render_template, request, jsonify
 12 | from plotly.graph_objs import Bar
 13 | from sklearn.externals import joblib
 14 | from sqlalchemy import create_engine
 15 | 
 16 | 
 17 | app = Flask(__name__)
 18 | 
 19 | 
 20 | @app.before_first_request
 21 | 
 22 | def load_model_data():
 23 |     global df
 24 |     global model
 25 |     # load data
 26 | 
 27 |     engine = create_engine('sqlite:///data/DisasterResponse.db')
 28 |     df = pd.read_sql_table('DisasterResponse', engine)
 29 | 
 30 |     # load model
 31 |     model = joblib.load("models/adaboost_model.pkl")
 32 | 
 33 | # index webpage displays cool visuals and receives user input text for model
 34 | @app.route('/')
 35 | @app.route('/index')
 36 | 
 37 | def index():
 38 |     
 39 |     # extract data needed for visuals
 40 |     # Message counts of different generes
 41 |     genre_counts = df.groupby('genre').count()['message']
 42 |     genre_names = list(genre_counts.index)
 43 | 
 44 |     # Message counts for different categories
 45 |     cate_counts_df = df.iloc[:, 4:].sum().sort_values(ascending=False)
 46 |     cate_counts = list(cate_counts_df)
 47 |     cate_names = list(cate_counts_df.index)
 48 | 
 49 |     # Top keywords in Social Media in percentages
 50 |     social_media_messages = ' '.join(df[df['genre'] == 'social']['message'])
 51 |     social_media_tokens = tokenize(social_media_messages)
 52 |     social_media_wrd_counter = Counter(social_media_tokens).most_common()
 53 |     social_media_wrd_cnt = [i[1] for i in social_media_wrd_counter]
 54 |     social_media_wrd_pct = [i/sum(social_media_wrd_cnt) *100 for i in social_media_wrd_cnt]
 55 |     social_media_wrds = [i[0] for i in social_media_wrd_counter]
 56 | 
 57 |     # Top keywords in Direct in percentages
 58 |     direct_messages = ' '.join(df[df['genre'] == 'direct']['message'])
 59 |     direct_tokens = tokenize(direct_messages)
 60 |     direct_wrd_counter = Counter(direct_tokens).most_common()
 61 |     direct_wrd_cnt = [i[1] for i in direct_wrd_counter]
 62 |     direct_wrd_pct = [i/sum(direct_wrd_cnt) * 100 for i in direct_wrd_cnt]
 63 |     direct_wrds = [i[0] for i in direct_wrd_counter]
 64 | 
 65 |     # create visuals
 66 | 
 67 |     graphs = [
 68 |     # Histogram of the message genere
 69 |         {
 70 |             'data': [
 71 |                 Bar(
 72 |                     x=genre_names,
 73 |                     y=genre_counts
 74 |                 )
 75 |             ],
 76 | 
 77 |             'layout': {
 78 |                 'title': 'Distribution of Message Genres',
 79 |                 'yaxis': {
 80 |                     'title': "Count"
 81 |                 },
 82 |                 'xaxis': {
 83 |                     'title': "Genre"
 84 |                 }
 85 |             }
 86 |         },
 87 |         # histogram of social media messages top 30 keywords 
 88 |         {
 89 |             'data': [
 90 |                     Bar(
 91 |                         x=social_media_wrds[:50],
 92 |                         y=social_media_wrd_pct[:50]
 93 |                                     )
 94 |             ],
 95 | 
 96 |             'layout':{
 97 |                 'title': "Top 50 Keywords in Social Media Messages",
 98 |                 'xaxis': {'tickangle':60
 99 |                 },
100 |                 'yaxis': {
101 |                     'title': "% Total Social Media Messages"    
102 |                 }
103 |             }
104 |         }, 
105 | 
106 |         # histogram of direct messages top 30 keywords 
107 |         {
108 |             'data': [
109 |                     Bar(
110 |                         x=direct_wrds[:50],
111 |                         y=direct_wrd_pct[:50]
112 |                                     )
113 |             ],
114 | 
115 |             'layout':{
116 |                 'title': "Top 50 Keywords in Direct Messages",
117 |                 'xaxis': {'tickangle':60
118 |                 },
119 |                 'yaxis': {
120 |                     'title': "% Total Direct Messages"    
121 |                 }
122 |             }
123 |         }, 
124 | 
125 | 
126 | 
127 |         # histogram of messages categories distributions
128 |         {
129 |             'data': [
130 |                     Bar(
131 |                         x=cate_names,
132 |                         y=cate_counts
133 |                                     )
134 |             ],
135 | 
136 |             'layout':{
137 |                 'title': "Distribution of Message Categories",
138 |                 'xaxis': {'tickangle':60
139 |                 },
140 |                 'yaxis': {
141 |                     'title': "count"    
142 |                 }
143 |             }
144 |         },     
145 | 
146 |     ]
147 |     
148 |     # encode plotly graphs in JSON
149 |     ids = ["graph-{}".format(i) for i, _ in enumerate(graphs)]
150 |     graphJSON = json.dumps(graphs, cls=plotly.utils.PlotlyJSONEncoder)
151 |     
152 |     # render web page with plotly graphs
153 |     return render_template('master.html', ids=ids, graphJSON=graphJSON)
154 | 
155 | 
156 | # web page that handles user query and displays model results
157 | @app.route('/go')
158 | def go():
159 |     # save user input in query
160 |     query = request.args.get('query', '') 
161 | 
162 |     # use model to predict classification for query
163 |     classification_labels = model.predict([query])[0]
164 |     classification_results = dict(zip(df.columns[4:], classification_labels))
165 | 
166 |     # This will render the go.html Please see that file. 
167 |     return render_template(
168 |         'go.html',
169 |         query=query,
170 |         classification_result=classification_results
171 |     )
172 | 
173 | 
174 | def main():
175 |     app.run()
176 | 
177 | 
178 | if __name__ == '__main__':
179 |     main()


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/app/templates/go.html:
--------------------------------------------------------------------------------
 1 | {% extends "master.html" %}
 2 | {% block title %}Results{% endblock %}
 3 | 
 4 | {% block message %}
 5 |     <hr />
 6 |     <h4 class="text-center">MESSAGE</h4>
 7 |     <p class="text-center"><i>{{query}}</i></p>
 8 | {% endblock %}
 9 | 
10 | {% block content %}
11 |     <h1 class="text-center">Result</h1>
12 |         <ul class="list-group">
13 |             {% for category, classification in classification_result.items() %}
14 |                 {% if classification == 1 %}
15 |                     <li class="list-group-item list-group-item-success text-center">{{category.replace('_', ' ').title()}}</li>
16 |                 {% else %}
17 |                     <li class="list-group-item list-group-item-dark text-center">{{category.replace('_', ' ').title()}}</li>
18 |                 {% endif %}
19 |             {% endfor %}
20 | 
21 |         </div>
22 |     </div>
23 | 
24 | {% endblock %}


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/app/templates/master.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 |     <meta charset="utf-8">
 5 |     <meta http-equiv="X-UA-Compatible" content="IE=edge">
 6 |     <meta name="viewport" content="width=device-width, initial-scale=1">
 7 | 
 8 |     <title>Disasters</title>
 9 | 
10 |     <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
11 |     <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity="sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" crossorigin="anonymous">
12 |     <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
13 |     <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
14 | </head>
15 | 
16 | <body>
17 | 
18 | <nav class="navbar navbar-inverse navbar-fixed-top">
19 |     <div class="container">
20 |         <div class="navbar-header">
21 |             <a class="navbar-brand" href="/">Disaster Response Application</a>
22 |         </div>
23 |         <div id="navbar" class="collapse navbar-collapse">
24 |             <ul class="nav navbar-nav">
25 |                 <li><a href="https://www.udacity.com/">Made with Udacity</a></li>
26 |                 <li><a href="https://github.com/chenbowen184">My GitHub</a></li>
27 |             </ul>
28 |         </div>
29 |     </div>
30 | </nav>
31 | 
32 | 
33 | <div class="jumbotron">
34 |     <div class="container">
35 |         <h1 class="text-center">Disaster Response Application</h1>
36 |         <p class="text-center">Analyzing message data for disaster response</p>
37 |         <hr />
38 |       
39 |         <div class="row">
40 |             <div class="col-lg-12 form-group-lg">
41 |                 <form action="/go" method="get">
42 |                     <input type="text" class="form-control form-control-lg" name="query" placeholder="Enter a message to classify">
43 |                     <div class="col-lg-offset-5">
44 |                         <button type="submit" class="btn btn-lg btn-success">Classify Message</button>
45 |                     </div>
46 |                 </form>
47 |             </div>
48 |         </div>
49 | 
50 |         {% block message %}
51 |         {% endblock %}
52 |     </div>
53 | </div>
54 | 
55 | <div class="container">
56 |     {% block content %}
57 |         <div class="page-header">
58 |             <h1 class="text-center">Overview of Training Dataset</h1>
59 |         </div>
60 |     {% endblock %}
61 | 
62 |     {% for id in ids %}
63 |         <div id="{{id}}"></div>
64 |     {% endfor %}
65 | </div>
66 | 
67 | <script type="text/javascript">
68 |     const graphs = {{graphJSON | safe}};
69 |     const ids = {{ids | safe}};
70 |     for(let i in graphs) {
71 |         Plotly.plot(ids[i], graphs[i].data, graphs[i].layout);
72 |     }
73 | </script>
74 | 
75 | </body>
76 | </html>
77 | 


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/app/webapp screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/app/webapp screenshot.png


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/data/DisasterResponse.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/data/DisasterResponse.db


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/data/process_data.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import pandas as pd
 3 | import numpy as n
 4 | from sqlalchemy import create_engine
 5 | 
 6 | def load_data(messages_filepath, categories_filepath):
 7 |     """
 8 |         Load data from the csv. 
 9 |     Args: 
10 |         messages_filepath: the path of the messages.csv files that needs to be transferred
11 |         categories_filepath: the path of the categories.csv files that needs to be transferred
12 |     Returns: 
13 |         merged_df (DataFrame): messages and categories merged dataframe
14 |     """
15 |     # load messages and categories
16 |     messages = pd.read_csv(messages_filepath)
17 |     categories = pd.read_csv(categories_filepath)
18 |     # merge two dataframes into one
19 |     df = pd.merge(messages, categories, on = 'id')
20 |     return df
21 | 
22 | def clean_data(df):
23 |     """
24 |         Clean the unstructured merged dataframe into structured dataframes. 
25 |         1. Rename columns of different categories
26 |         2. Remove Duplicates
27 | 
28 |     Args: 
29 |         df: The preprocessed dataframe
30 |     Returns: 
31 |         df (DataFrame): messages and categories merged dataframe
32 |     """
33 | 
34 |     # split the categories columns into multiple columns
35 |     categories = df['categories'].str.split(';', expand=True)
36 | 
37 |     # rename columns
38 |     row = categories.iloc[1]
39 |     category_colnames = row.apply(lambda x: x.split('-')[0])
40 |     categories.columns = category_colnames
41 | 
42 |     # replace original values into 1 and 0
43 |     for column in categories:
44 |         categories[column] = categories[column].apply(lambda x: int(x.split('-')[1]))
45 | 
46 |     # replace the old categories column
47 |     df.drop('categories', axis = 1, inplace = True)
48 |     df = df.join(categories)
49 |     # drop duplicates
50 |     df = df.drop_duplicates()
51 |     return df
52 |     
53 | 
54 | def save_data(df, database_filename):
55 |     """
56 |         Save processed dataframe into sqlite database
57 | 
58 |     Args: 
59 |         df: The preprocessed dataframe
60 |         database_filename: name of the database
61 |     Returns: 
62 |         None
63 |     """
64 | 
65 |     # save data into a sqlite database
66 |     engine = create_engine('sqlite:///Messages.db')
67 |     df.to_sql('Messages', engine, index=False, if_exists='replace')
68 | 
69 | 
70 | def main():
71 |     if len(sys.argv) == 4:
72 | 
73 |         messages_filepath, categories_filepath, database_filepath = sys.argv[1:]
74 | 
75 |         print('Loading data...\n    MESSAGES: {}\n    CATEGORIES: {}'
76 |               .format(messages_filepath, categories_filepath))
77 |         df = load_data(messages_filepath, categories_filepath)
78 | 
79 |         print('Cleaning data...')
80 |         df = clean_data(df)
81 |         
82 |         print('Saving data...\n    DATABASE: {}'.format(database_filepath))
83 |         save_data(df, database_filepath)
84 |         
85 |         print('Cleaned data saved to database!')
86 |     
87 |     else:
88 |         print('Please provide the filepaths of the messages and categories '\
89 |               'datasets as the first and second argument respectively, as '\
90 |               'well as the filepath of the database to save the cleaned data '\
91 |               'to as the third argument. \n\nExample: python process_data.py '\
92 |               'disaster_messages.csv disaster_categories.csv '\
93 |               'DisasterResponse.db')
94 | 
95 | 
96 | if __name__ == '__main__':
97 |     main()


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/models/adaboost_model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 5 - Disaster Response Pipeline/models/adaboost_model.pkl


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/models/tokenizer_function.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import nltk 
 3 | from nltk.corpus import stopwords
 4 | from nltk.tokenize import word_tokenize
 5 | from nltk.stem.wordnet import WordNetLemmatizer
 6 | import pandas as pd
 7 | from sklearn.base import BaseEstimator, TransformerMixin
 8 | 
 9 | class Tokenizer(BaseEstimator, TransformerMixin):
10 |     """ Tokenize transformer to be used in the pipeline
11 |     """
12 |     def __init__(self):
13 |         pass
14 | 
15 |     def fit(self, X, y=None):
16 |         return self
17 | 
18 |     def transform(self, X):
19 |         return pd.Series(X).apply(tokenize).values
20 | 
21 | 
22 | def tokenize(text):
23 |     """
24 |         Tokenize the message into word level features. 
25 |         1. replace urls
26 |         2. convert to lower cases
27 |         3. remove stopwords
28 |         4. strip white spaces
29 |     Args: 
30 |         text: input text messages
31 |     Returns: 
32 |         cleaned tokens(List)
33 |     """   
34 |     # Define url pattern
35 |     url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
36 |     
37 |     # Detect and replace urls
38 |     detected_urls = re.findall(url_re, text)
39 |     for url in detected_urls:
40 |         text = text.replace(url, "urlplaceholder")
41 |     
42 |     # tokenize sentences
43 |     tokens = word_tokenize(text)
44 |     lemmatizer = WordNetLemmatizer()
45 |     
46 |     # save cleaned tokens
47 |     clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]
48 |     
49 |     # remove stopwords
50 |     STOPWORDS = list(set(stopwords.words('english')))
51 |     clean_tokens = [token for token in clean_tokens if token not in STOPWORDS]
52 |     
53 |     return clean_tokens
54 | 


--------------------------------------------------------------------------------
/Project 5 - Disaster Response Pipeline/models/train_classifier.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import pandas as pd
  3 | import numpy as np
  4 | import pickle
  5 | from sqlalchemy import create_engine
  6 | 
  7 | # import tokenize_function
  8 | from models.tokenizer_function import Tokenizer
  9 | 
 10 | # import sklearn
 11 | from sklearn.pipeline import Pipeline
 12 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
 13 | from sklearn.model_selection import train_test_split, GridSearchCV
 14 | from sklearn.multioutput import MultiOutputClassifier
 15 | from sklearn.metrics import precision_score, recall_score, f1_score
 16 | from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
 17 | from sklearn.externals import joblib
 18 | 
 19 | def load_data(database_filepath):
 20 |     """
 21 |         Load data from the sqlite database. 
 22 |     Args: 
 23 |         database_filepath: the path of the database file
 24 |     Returns: 
 25 |         X (DataFrame): messages 
 26 |         Y (DataFrame): One-hot encoded categories
 27 |         category_names (List)
 28 |     """
 29 |     
 30 |     # load data from database
 31 |     engine = create_engine('sqlite:///../data/DisasterResponse.db')
 32 |     df = pd.read_sql_table('DisasterResponse', engine)
 33 |     X = df['message']
 34 |     Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)
 35 |     category_names = Y.columns
 36 |     
 37 |     return X, Y, category_names
 38 | 
 39 | 
 40 | def build_model():    
 41 |     """
 42 |       build NLP pipeline - count words, tf-idf, multiple output classifier,
 43 |       grid search the best parameters
 44 |     Args: 
 45 |         None
 46 |     Returns: 
 47 |         cross validated classifier object
 48 |     """   
 49 |     # 
 50 |     pipeline = Pipeline([
 51 |         ('tokenizer', Tokenizer()),
 52 |         ('vec', CountVectorizer()),
 53 |         ('tfidf', TfidfTransformer()),
 54 |         ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators = 100)))
 55 |     ])
 56 |     
 57 |     # grid search
 58 |     parameters = {'clf__estimator__max_features':['sqrt', 0.5],
 59 |               'clf__estimator__n_estimators':[50, 100]}
 60 | 
 61 |     cv = GridSearchCV(estimator=pipeline, param_grid = parameters, cv = 5, n_jobs = 10)
 62 |    
 63 |     return cv
 64 | 
 65 | 
 66 | def evaluate_model(model, X_test, Y_test, category_names):
 67 |     """
 68 |         Evaluate the model performances, in terms of f1-score, precison and recall
 69 |     Args: 
 70 |         model: the model to be evaluated
 71 |         X_test: X_test dataframe
 72 |         Y_test: Y_test dataframe
 73 |         category_names: category names list defined in load data
 74 |     Returns: 
 75 |         perfomances (DataFrame)
 76 |     """   
 77 |     # predict on the X_test
 78 |     y_pred = model.predict(X_test)
 79 |     
 80 |     # build classification report on every column
 81 |     performances = []
 82 |     for i in range(len(category_names)):
 83 |         performances.append([f1_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro'),
 84 |                              precision_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro'),
 85 |                              recall_score(Y_test.iloc[:, i].values, y_pred[:, i], average='micro')])
 86 |     # build dataframe
 87 |     performances = pd.DataFrame(performances, columns=['f1 score', 'precision', 'recall'],
 88 |                                 index = category_names)   
 89 |     return performances
 90 | 
 91 | 
 92 | def save_model(model, model_filepath):
 93 |     """
 94 |         Save model to pickle
 95 |     """
 96 |     joblib.dump(model, open(model_filepath, 'wb'))
 97 | 
 98 | 
 99 | def main():
100 |     if len(sys.argv) == 3:
101 |         database_filepath, model_filepath = sys.argv[1:]
102 |         print('Loading data...\n    DATABASE: {}'.format(database_filepath))
103 |         X, Y, category_names = load_data(database_filepath)
104 |         X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
105 |         
106 |         print('Building model...')
107 |         model = build_model()
108 |         
109 |         print('Training model...')
110 |         model.fit(X_train, Y_train)
111 |         
112 |         print('Evaluating model...')
113 |         evaluate_model(model, X_test, Y_test, category_names)
114 | 
115 |         print('Saving model...\n    MODEL: {}'.format(model_filepath))
116 |         save_model(model, model_filepath)
117 | 
118 |         print('Trained model saved!')
119 | 
120 |     else:
121 |         print('Please provide the filepath of the disaster messages database '\
122 |               'as the first argument and the filepath of the pickle file to '\
123 |               'save the model to as the second argument. \n\nExample: python '\
124 |               'train_classifier.py ../data/DisasterResponse.db classifier.pkl')
125 | 
126 | 
127 | if __name__ == '__main__':
128 |     main()
129 | 


--------------------------------------------------------------------------------
/Project 6 - Reccomendation System/__pycache__/project_tests.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/__pycache__/project_tests.cpython-36.pyc


--------------------------------------------------------------------------------
/Project 6 - Reccomendation System/project_tests.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | import pickle
 4 | 
 5 | df = pd.read_csv('data/user-item-interactions.csv')
 6 | df_content = pd.read_csv('data/articles_community.csv')
 7 | del df['Unnamed: 0']
 8 | del df_content['Unnamed: 0']
 9 | 
10 | 
11 | def sol_1_test(sol_1_dict):
12 |     sol_1_dict_ = {
13 |     '`50% of individuals have _____ or fewer interactions.`': 3,
14 |     '`The total number of user-article interactions in the dataset is ______.`': 45993,
15 |     '`The maximum number of user-article interactions by any 1 user is ______.`': 364,
16 |     '`The most viewed article in the dataset was viewed _____ times.`': 937,
17 |     '`The article_id of the most viewed article is ______.`': '1429.0',
18 |     '`The number of unique articles that have at least 1 rating ______.`': 714,
19 |     '`The number of unique users in the dataset is ______`': 5148,
20 |     '`The number of unique articles on the IBM platform`': 1051,
21 |     }
22 |     
23 |     if sol_1_dict_ == sol_1_dict:
24 |         print("It looks like you have everything right here! Nice job!")
25 |         
26 |     else:
27 |         for k, v in sol_1_dict.items():
28 |             if sol_1_dict_[k] != sol_1_dict[k]:
29 |                 print("Oops! It looks like the value associated with: {} wasn't right. Try again.  It might just be the datatype.  All of the values should be ints except the article_id should be a string.  Let each row be considered a separate user-article interaction.  If a user interacts with an article 3 times, these are considered 3 separate interactions.".format(k))
30 |                 
31 |                 
32 | def sol_2_test(top_articles):
33 |     top_5 = top_articles(5)
34 |     top_10 = top_articles(10)
35 |     top_20 = top_articles(20)
36 |     
37 |     checks = ['top_5', 'top_10', 'top_20']
38 |     for idx, file in enumerate(checks):
39 |         if set(eval(file)) == set(pickle.load(open( "{}.p".format(file), "rb" ))):
40 |             print("Your {} looks like the solution list! Nice job.".format(file))
41 |         else:
42 |             print("Oops! The {} list doesn't look how we expected.  Try again.".format(file))
43 |     
44 |     
45 |     
46 | def sol_5_test(sol_5_dict):
47 |     sol_5_dict_1 = {
48 |     'The user that is most similar to user 1.': 3933, 
49 |     'The user that is the 10th most similar to user 131': 242
50 |     }
51 |     if sol_5_dict == sol_5_dict_1:
52 |         print("This all looks good!  Nice job!")
53 |         
54 |     else:
55 |         for k, v in sol_5_dict_1.items():
56 |             if set(sol_5_dict[k]) != set(sol_5_dict_1[k]):
57 |                 print("Oops!  Looks like there is a mistake with the {} key in your dictionary.  The answer should be {}.  Try again.".format(k,v))
58 |     
59 |     
60 | def sol_4_test(sol_4_dict):
61 |     
62 |     a = 662 # len(test_idx) - user_item_test.shape[0]
63 |     b = 574 # user_test_shape[1] or len(test_arts) because we can make predictions for all articles
64 |     c = 20 # user_item_test.shape[0]
65 |     d = 0 # len(test_arts) - user_item_test.shape[1]
66 | 
67 |     sol_4_dict_1 = {
68 |     'How many users can we make predictions for in the test set?': c, 
69 |     'How many users in the test set are we not able to make predictions for because of the cold start problem?': a, 
70 |     'How many movies can we make predictions for in the test set?': b,
71 |     'How many movies in the test set are we not able to make predictions for because of the cold start problem?': d
72 |     }
73 |     
74 |     if sol_4_dict == sol_4_dict_1:
75 |         print("Awesome job!  That's right!  All of the test movies are in the training data, but there are only 20 test users that were also in the training set.  All of the other users that are in the test set we have no data on.  Therefore, we cannot make predictions for these users using SVD.")
76 |     else:
77 |         for k, v in sol_4_dict_1.items():
78 |             if sol_4_dict_1[k] != sol_4_dict[k]:
79 |                 print("Sorry it looks like that isn't the right value associated with {}.  Try again.".format(k))
80 |      
81 |     
82 |  


--------------------------------------------------------------------------------
/Project 6 - Reccomendation System/top_10.p:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_10.p


--------------------------------------------------------------------------------
/Project 6 - Reccomendation System/top_20.p:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_20.p


--------------------------------------------------------------------------------
/Project 6 - Reccomendation System/top_5.p:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chen-bowen/Data_Science_Portfolio/7f3acbb403e5666eacd3b973fa812bf87d73b2e4/Project 6 - Reccomendation System/top_5.p


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Science Portfolio Projects
 2 | 
 3 | ### Project 1 - Predicting Donors' Income using supervised learning
 4 | 
 5 | **Description:** 
 6 | 
 7 | In this project, I employed several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. I then chose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. My goal with this implementation was to construct a model that accurately predicts whether an individual makes more than $50,000. 
 8 | 
 9 | [Project Notebook: Finding Donors](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%201%20-%20Finding%20Donars/finding_donors.ipynb)
10 | 
11 | 
12 | ### Project 2 - Flower Image Classifier Application
13 | 
14 | **Description:** 
15 | 
16 | Going forward, AI algorithms will be incorporated into more and more everyday applications. For example, you might want to include an image classifier in a smart phone app. To do this, you'd use a deep learning model trained on hundreds of thousands of images as part of the overall application architecture. A large part of software development in the future will be using these types of models as common parts of applications.
17 | 
18 | In this project, I trained an image classifier to recognize different species of flowers. You can imagine using something like this in a phone app that tells you the name of the flower your camera is looking at. In practice one would train this classifier, then export it for use in application. We'll be using this dataset of 102 flower categories
19 | 
20 | [Project Notebook: Image Classifier](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%202%20-%20Image%20Classifier%20Application/Image%20Classifier%20Project.ipynb?flush_cache=true)
21 | 
22 | 
23 | ### Project 3 - Identifying Customers Segmentations
24 | 
25 | **Description:** 
26 | 
27 | In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
28 | 
29 | First, the general demographics data are clustered through a KMeans clustering algorithms, then the same parameters are applied over the customer dataset to investigate if the customers are following the same distributions.
30 | 
31 | [Project Notebook: Customer Segmentations](http://nbviewer.jupyter.org/github/chenbowen184/Udacity_Data_Science_Projects/blob/master/Project%203%20-%20Identify%20Customer%20Segementation/Identify_Customer_Segments.ipynb?flush_cache=true)
32 | 
33 | 
34 | 
35 | ### Project 4 - Data Science Blog
36 | 
37 | **Description:** 
38 | 
39 | In this project, I analyzed the 2011 - 2018 Stack Overflow developer survey data in order to create a blog post regarding a comprehensive study of data science careers. The project notebook could be found below
40 | 
41 | [Project Notebook: Understanding the Career of Data Scientists](http://nbviewer.jupyter.org/github/chenbowen184/Data_Scientist_Nanodegree/blob/master/Project%204%20-%20Data%20Science%20Blog/Understanding%20the%20Career%20of%20Data%20Scientists.ipynb) 
42 | 
43 | [Blog Post: Understanding the Career of Data Scientist Using the Data Science Way](https://medium.com/@bowenchen/understanding-the-career-of-data-scientists-in-a-data-science-way-9bd63817221e)
44 | 
45 | ### Project 5 - Disaster Response Pipeline
46 | 
47 | **Description:**
48 | 
49 | In this project, I built a data transformation - machine learning pipeline that is capable to curate the class of the messages. The pipeline is eventually built into a flask application. The project include a web app where an emergency worker can input a new message and get classification results in several categories. The web app will also display visualizations of the data.The project notebook could be found below
50 | 
51 | [Project Notebook: ETL](http://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%205%20-%20Disaster%20Response%20Pipeline/ETL%20Pipeline%20Preparation.ipynb)
52 | 
53 | [Prject Notebook: Machine Learning Pipeline](http://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%205%20-%20Disaster%20Response%20Pipeline/ML%20Pipeline%20Preparation.ipynb)
54 | 
55 | [Disaster Response App](https://disaster-response-app184.herokuapp.com/)
56 | 
57 | 
58 | ### Project 6 - Recommendation System
59 | 
60 | **Description:**
61 | 
62 | In this project, I developed a recommendation engine's algorithm with IBM communities articles and user interactions. This project will serve as a prototype of the recommender systems of the article recommendation systems of IBM. The project will be hosted on 
63 | 
64 | [Project Notebook: Recommendations with IBM](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Project%206%20-%20Reccomendation%20System/Recommendations_with_IBM.ipynb)
65 | 
66 | ### Capstone Project - Spark Distributed Analytics
67 | 
68 | **Description:**
69 | 
70 | In the capstone project, I built a distributed machine learning for a user log using spark - the big data technology toolkit. The primary objective was to predict the churning possibilities of every user. Most of the data visualizations were completed on a small subset of the data, while the full dataset analytics were performed using AWS's EMR services.
71 | 
72 | [Project Notebook: Spark - Subset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Subset%20Analytics.ipynb)
73 | 
74 | [Project Notebook: Spark - Full Dataset Analytics](https://nbviewer.jupyter.org/github/chenbowen184/Data_Science_Portfolio/blob/master/Capstone%20Project/Spark%20-%20Full%20Dataset.ipynb)
75 | 
76 | [Blog Post: Understanding Customer Churning with Big Data Analytics](https://medium.com/@bowenchen/understanding-customer-churning-with-big-data-analytics-70ce4eb17669)
77 | 
78 | ![Certificate](https://raw.githubusercontent.com/chenbowen184/Data_Science_Portfolio/master/Data%20Scientist%20Nanodegree%20certificate.jpg)
79 | 
80 | **Disclaimer:** Remember to provide proper citation if you want to use any part of this code.
81 | 


--------------------------------------------------------------------------------