Dataset

├── Demo - DataShop.ipynb ├── LICENSE ├── README.md ├── data ├── ds76_tx_All_Data_74_2018_0912_070949.txt └── images │ ├── difficulty_vs_time.html │ └── exampleimage.png ├── entityset_function.ipynb ├── requirements.txt └── utils.py /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, Featuretools 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Making predictions from a DataShop dataset 2 | 3 |

4 | 5 | 6 | In this tutorial, we show how to predict whether a student will succesfully answer a problem using a dataset from [CMU DataShop](https://pslcdatashop.web.cmu.edu/). While online courses are logistically efficient, the structure can make it more difficult for a teacher to understand how students are learning in their class. To try to fill in those gaps, we can apply machine learning.
7 | 8 | However, building an accurate machine learning model requires extracting information called **features**. Finding the right features is a crucial component of both finding a satisfactory answer and of interpreting the dataset as a whole. The process of **feature engineering** is made simple by [Featuretools](http://www.featuretools.com). 9 | 10 | *If you're running the [notebook](Demo%20-%20DataShop.ipynb) yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You will only need the `.txt` file. The infrastructure in that notebook will work with **any** DataShop dataset, but you will need to change the filename to the dataset you'd like to load.* 11 | 12 | ## Highlights 13 | * Show how to import a DataShop dataset into featuretools 14 | * Demonstrate efficacy of automatic feature generation by training a machine learning model 15 | * Give an example of how Featuretools can reveal and help answer interesting questions 16 | 17 | Here is a plot of two automatically generated features: 18 | 19 | ![Example image](data/images/exampleimage.png) 20 | 21 | This is an image of the average time spent on a problem versus the success rate on a given problem. There is an [interactive version](https://www.featuretools.com/wp-content/uploads/2018/03/difficulty_vs_time.html) of this plot which lets you hover over individual points to see the problem and problem step. Notice that the success rate on problems that take longer is uniformly lower for this dataset. 22 | 23 | ## Running the tutorial 24 | 25 | 1. Clone the repo 26 | 27 | ``` 28 | git clone https://github.com/Featuretools/predict-correct-answer.git 29 | ``` 30 | 31 | 2. Install the requirements 32 | 33 | ``` 34 | pip install -r requirements.txt 35 | ``` 36 | 37 | *You will also need to install **graphviz** for this demo. Please install graphviz according to the instructions in the [Featuretools Documentation](https://docs.featuretools.com/getting_started/install.html)* 38 | 39 | 3. Download the data 40 | 41 | You can download the geometry dataset from [the datashop website](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) (free account required). Follow this [instructions](https://pslcdatashop.web.cmu.edu/help?datasetId=76&page=export) to download the data. Take the `.txt` file from the zipped download and place it in the `data` folder in this repository. 42 | 43 | 4. Run the Tutorial notebook:
44 | 45 | ``` 46 | jupyter notebook 47 | ``` 48 | 49 | - [Demo - DataShop](Demo%20-%20DataShop.ipynb)
50 | 51 | 52 | 53 | *Note: The notebook relies on a `datashop_to_entityset` function which is described in depth in the [entityset_function notebook](entityset_function.ipynb).* 54 | 55 | ## Feature Labs 56 | 57 |

58 | 59 | 60 | Featuretools is an open source project created by [Feature Labs](https://www.featurelabs.com/). To see the other open source projects we're working on visit Feature Labs [Open Source](https://www.featurelabs.com/open). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/). 61 | 62 | ### Contact 63 | 64 | Any questions can be directed to help@featurelabs.com 65 | -------------------------------------------------------------------------------- /data/images/exampleimage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alteryx/predict-correct-answer/f6127b391264cd2c5bac71b1cffa1a72fbdd8edf/data/images/exampleimage.png -------------------------------------------------------------------------------- /entityset_function.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/plain": [ 11 | "'0.6.0'" 12 | ] 13 | }, 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "output_type": "execute_result" 17 | } 18 | ], 19 | "source": [ 20 | "import numpy as np\n", 21 | "import pandas as pd\n", 22 | "import featuretools as ft\n", 23 | "ft.__version__" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# About\n", 31 | "This is a notebook which shows how we can create an EntitySet from a dataset with the DataShop structure. The content of this notebook, the function `datashop_to_entityset`, is also in [utilities](utilities.py) so that it can be loaded as a package in the [demo](Demo%20-%20DataShop.ipynb) notebook." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import featuretools.variable_types as vtypes\n", 41 | "def datashop_to_entityset(filename):\n", 42 | " # Make an EntitySet called Dataset with the following structure\n", 43 | " #\n", 44 | " # schools students problems\n", 45 | " # \\ | /\n", 46 | " # classes sessions problem steps\n", 47 | " # \\ | /\n", 48 | " # transactions -- attempts\n", 49 | " #\n", 50 | "\n", 51 | " # Convert the csv into a dataframe using pandas\n", 52 | " data = pd.read_csv(filename, '\\t')\n", 53 | "\n", 54 | " # Make the Transaction Id the index column of the dataframe and clean other columns\n", 55 | " data.index = data['Transaction Id']\n", 56 | " data = data.drop(['Row'], axis=1)\n", 57 | " data['Outcome'] = data['Outcome'].map({'INCORRECT': 0, 'CORRECT': 1})\n", 58 | "\n", 59 | " # Make a new 'End Time' column which is start_time + duration\n", 60 | " # This is /super useful/ because you shouldn't be using outcome data at\n", 61 | " # any point before the student has attempted the problem.\n", 62 | " data['End Time'] = pd.to_datetime(data['Time']) + pd.to_timedelta(pd.to_numeric(data['Duration (sec)']), 's')\n", 63 | "\n", 64 | " # Make a list of all the KC and CF columns present\n", 65 | " kc_and_cf_cols = [x for x in data.columns if (x.startswith('KC ') or x.startswith('CF '))]\n", 66 | "\n", 67 | " # Now we start making an entityset. We make 'End Time' a time index for 'Outcome'\n", 68 | " # even though our primary time index for a row is 'Time' preventing label leakage.\n", 69 | " es = ft.EntitySet('Dataset')\n", 70 | " es.entity_from_dataframe(entity_id='transactions',\n", 71 | " index='Transaction Id',\n", 72 | " dataframe=data,\n", 73 | " variable_types={'Outcome': vtypes.Boolean},\n", 74 | " time_index='Time',\n", 75 | " secondary_time_index={'End Time': ['Outcome', 'Is Last Attempt', 'Duration (sec)']})\n", 76 | "\n", 77 | " # Every transaction has a `problem_step` which is associated to a problem\n", 78 | " es.normalize_entity(base_entity_id='transactions',\n", 79 | " new_entity_id='problem_steps',\n", 80 | " index='Step Name',\n", 81 | " additional_variables=['Problem Name'],\n", 82 | " make_time_index=False)\n", 83 | "\n", 84 | " es.normalize_entity(base_entity_id='problem_steps',\n", 85 | " new_entity_id='problems',\n", 86 | " index='Problem Name',\n", 87 | " make_time_index=False)\n", 88 | "\n", 89 | " # Every transaction has a `session` associated to a student\n", 90 | " es.normalize_entity(base_entity_id='transactions',\n", 91 | " new_entity_id='sessions',\n", 92 | " index='Session Id',\n", 93 | " additional_variables=['Anon Student Id'],\n", 94 | " make_time_index=True)\n", 95 | "\n", 96 | " es.normalize_entity(base_entity_id='sessions',\n", 97 | " new_entity_id='students',\n", 98 | " index='Anon Student Id',\n", 99 | " make_time_index=True)\n", 100 | "\n", 101 | " # Every transaction has a `class` associated to a school\n", 102 | " es.normalize_entity(base_entity_id='transactions',\n", 103 | " new_entity_id='classes',\n", 104 | " index='Class',\n", 105 | " additional_variables=['School'],\n", 106 | " make_time_index=False)\n", 107 | "\n", 108 | " es.normalize_entity(base_entity_id='classes',\n", 109 | " new_entity_id='schools',\n", 110 | " index='School',\n", 111 | " make_time_index=False)\n", 112 | "\n", 113 | " # And because we might be interested in creating features grouped\n", 114 | " # by attempts we normalize by those as well.\n", 115 | " es.normalize_entity(base_entity_id='transactions',\n", 116 | " new_entity_id='attempts',\n", 117 | " index='Attempt At Step',\n", 118 | " additional_variables=[],\n", 119 | " make_time_index=False)\n", 120 | " return es" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 3, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "data": { 130 | "text/plain": [ 131 | "Entityset: Dataset\n", 132 | " Entities:\n", 133 | " transactions [Rows: 6778, Columns: 179]\n", 134 | " problem_steps [Rows: 78, Columns: 2]\n", 135 | " problems [Rows: 20, Columns: 1]\n", 136 | " sessions [Rows: 59, Columns: 3]\n", 137 | " students [Rows: 59, Columns: 2]\n", 138 | " classes [Rows: 1, Columns: 2]\n", 139 | " schools [Rows: 1, Columns: 1]\n", 140 | " attempts [Rows: 9, Columns: 1]\n", 141 | " Relationships:\n", 142 | " transactions.Step Name -> problem_steps.Step Name\n", 143 | " problem_steps.Problem Name -> problems.Problem Name\n", 144 | " transactions.Session Id -> sessions.Session Id\n", 145 | " sessions.Anon Student Id -> students.Anon Student Id\n", 146 | " transactions.Class -> classes.Class\n", 147 | " classes.School -> schools.School\n", 148 | " transactions.Attempt At Step -> attempts.Attempt At Step" 149 | ] 150 | }, 151 | "execution_count": 3, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "filename = 'data/ds76_tx_All_Data_74_2018_0912_070949.txt'\n", 158 | "es = datashop_to_entityset(filename)\n", 159 | "es" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "# Summary\n", 167 | "In total we have made 8 entities. At the base is `transactions` and then we have the following one-to-many relationships which have a one-to-many with `transactions`:\n", 168 | "1. `problems` -> `problem_steps` -> `transactions` \n", 169 | "2. `students` -> `sessions` -> `transactions`\n", 170 | "3. `schools` -> `classes` -> `transactions`\n", 171 | "4. `attempts` -> `transactions`\n", 172 | "\n", 173 | "Our base entity also has a time index `Time` and a secondary time index `End Time` for columns which can only be known when the event is over. This allows us to use `Outcome` in our feature matrix, since it will only be used for later events." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 4, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "image/svg+xml": [ 184 | "\n", 185 | "\n", 187 | "\n", 189 | "\n", 190 | "\n" 493 | ], 494 | "text/plain": [ 495 | "" 496 | ] 497 | }, 498 | "execution_count": 4, 499 | "metadata": {}, 500 | "output_type": "execute_result" 501 | } 502 | ], 503 | "source": [ 504 | "es.plot()" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": { 510 | "collapsed": true 511 | }, 512 | "source": [ 513 | "

\n", 514 | " $\"Featuretools\"$ \n", 515 | "

\n", 516 | "\n", 517 | "Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/)." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [] 526 | } 527 | ], 528 | "metadata": { 529 | "kernelspec": { 530 | "display_name": "Python 3", 531 | "language": "python", 532 | "name": "python3" 533 | }, 534 | "language_info": { 535 | "codemirror_mode": { 536 | "name": "ipython", 537 | "version": 3 538 | }, 539 | "file_extension": ".py", 540 | "mimetype": "text/x-python", 541 | "name": "python", 542 | "nbconvert_exporter": "python", 543 | "pygments_lexer": "ipython3", 544 | "version": "3.7.2" 545 | } 546 | }, 547 | "nbformat": 4, 548 | "nbformat_minor": 2 549 | } 550 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | bokeh>=1.0.2 2 | featuretools>=0.16.0 3 | jupyter>=1.0.0 4 | scikit-learn>=0.20.2 5 | graphviz>=0.10.1 6 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import featuretools.variable_types as vtypes 2 | import pandas as pd 3 | import featuretools as ft 4 | from featuretools.primitives import Sum, Mean, Hour 5 | from featuretools.selection import remove_low_information_features 6 | from sklearn.ensemble import RandomForestClassifier 7 | from sklearn.metrics import roc_auc_score 8 | 9 | from bokeh.plotting import figure 10 | from bokeh.models import ColumnDataSource, HoverTool 11 | from bokeh.io import show 12 | 13 | 14 | def datashop_to_entityset(filename): 15 | # Make an EntitySet called Dataset with the following structure 16 | # 17 | # schools students problems 18 | # \ | / 19 | # classes sessions problem steps 20 | # \ | / 21 | # transactions -- attempts 22 | # 23 | 24 | # Convert the csv into a dataframe using pandas 25 | data = pd.read_csv(filename, '\t', parse_dates=True) 26 | 27 | # Make the Transaction Id the index column of the dataframe and clean other columns 28 | data.index = data['Transaction Id'] 29 | data = data.drop(['Row'], axis=1) 30 | data['Outcome'] = data['Outcome'].map({'INCORRECT': 0, 'CORRECT': 1}) 31 | 32 | # Make a new 'End Time' column which is start_time + duration 33 | # This is /super useful/ because you shouldn't be using outcome data at 34 | # any point before the student has attempted the problem. 35 | data['End Time'] = pd.to_datetime( 36 | data['Time']) + pd.to_timedelta(pd.to_numeric(data['Duration (sec)']), 's') 37 | 38 | # Make a list of all the KC and CF columns present 39 | kc_and_cf_cols = [x for x in data.columns if ( 40 | x.startswith('KC ') or x.startswith('CF '))] 41 | 42 | # Now we start making an entityset. We make 'End Time' a time index for 'Outcome' 43 | # even though our primary time index for a row is 'Time' preventing label leakage. 44 | es = ft.EntitySet('Dataset') 45 | es.entity_from_dataframe(entity_id='transactions', 46 | index='Transaction Id', 47 | dataframe=data, 48 | variable_types={'Outcome': vtypes.Boolean, 'Attempt At Step': vtypes.Categorical}, 49 | time_index='Time', 50 | secondary_time_index={'End Time': [ 51 | 'Outcome', 'Is Last Attempt', 'Duration (sec)']} 52 | ) 53 | 54 | # Every transaction has a `problem_step` which is associated to a problem 55 | es.normalize_entity(base_entity_id='transactions', 56 | new_entity_id='problem_steps', 57 | index='Step Name', 58 | additional_variables=['Problem Name'] + kc_and_cf_cols, 59 | make_time_index=True) 60 | 61 | es.normalize_entity(base_entity_id='problem_steps', 62 | new_entity_id='problems', 63 | index='Problem Name', 64 | make_time_index=True) 65 | 66 | # Every transaction has a `session` associated to a student 67 | es.normalize_entity(base_entity_id='transactions', 68 | new_entity_id='sessions', 69 | index='Session Id', 70 | additional_variables=['Anon Student Id'], 71 | make_time_index=True) 72 | 73 | es.normalize_entity(base_entity_id='sessions', 74 | new_entity_id='students', 75 | index='Anon Student Id', 76 | make_time_index=True) 77 | 78 | # Every transaction has a `class` associated to a school 79 | es.normalize_entity(base_entity_id='transactions', 80 | new_entity_id='classes', 81 | index='Class', 82 | additional_variables=['School'], 83 | make_time_index=False) 84 | 85 | es.normalize_entity(base_entity_id='classes', 86 | new_entity_id='schools', 87 | index='School', 88 | make_time_index=False) 89 | 90 | # And because we might be interested in creating features grouped 91 | # by attempts we normalize by those as well. 92 | # es.normalize_entity(base_entity_id='transactions', 93 | # new_entity_id='attempts', 94 | # index='Attempt At Step', 95 | # additional_variables=[], 96 | # make_time_index=False) 97 | return es 98 | 99 | 100 | def create_features(es, label='Outcome', custom_agg=[]): 101 | cutoff_times = es['transactions'].df[['Transaction Id', 'End Time', label]] 102 | fm, features = ft.dfs(entityset=es, 103 | target_entity='transactions', 104 | agg_primitives=[Sum, Mean] + custom_agg, 105 | trans_primitives=[Hour], 106 | max_depth=3, 107 | approximate='2m', 108 | cutoff_time=cutoff_times, 109 | verbose=True) 110 | fm_enc, _ = ft.encode_features(fm, features) 111 | fm_enc = fm_enc.fillna(0) 112 | fm_enc = remove_low_information_features(fm_enc) 113 | labels = fm.pop(label) 114 | return (fm_enc, labels) 115 | 116 | 117 | def estimate_score(fm_enc, label, splitter): 118 | k = 0 119 | for train_index, test_index in splitter.split(fm_enc): 120 | clf = RandomForestClassifier() 121 | X_train, X_test = fm_enc.iloc[train_index], fm_enc.iloc[test_index] 122 | y_train, y_test = label[train_index], label[test_index] 123 | clf.fit(X_train, y_train) 124 | preds = clf.predict(X_test) 125 | score = round(roc_auc_score(preds, y_test), 2) 126 | print("AUC score on time split {} is {}".format(k, score)) 127 | 128 | 129 | def feature_importances(fm_enc, clf, feats=5): 130 | feature_imps = [(imp, fm_enc.columns[i]) 131 | for i, imp in enumerate(clf.feature_importances_)] 132 | feature_imps.sort() 133 | feature_imps.reverse() 134 | print('Feature Importances: ') 135 | for i, f in enumerate(feature_imps[0:feats]): 136 | print('{}: {}'.format(i + 1, f[1])) 137 | print("-----\n") 138 | return ([f[1] for f in feature_imps[0:feats]]) 139 | 140 | 141 | def datashop_plot(fm, col1='', col2='', label=None, names=['', '', '']): 142 | colorlist = ['#3A3A3A', '#1072B9', '#B22222'] 143 | colormap = {name: colorlist[name] for name in label} 144 | colors = [colormap[x] for x in label] 145 | labelmap = {0: 'INCORRECT', 1: 'CORRECT'} 146 | desc = [labelmap[x] for x in label] 147 | source = ColumnDataSource(dict( 148 | x=fm[col1], 149 | y=fm[col2], 150 | desc=desc, 151 | color=colors, 152 | index=fm.index, 153 | problem_step=fm['Step Name'], 154 | problem=fm['problem_steps.Problem Name'], 155 | attempt=fm['Attempt At Step'] 156 | )) 157 | hover = HoverTool(tooltips=[ 158 | ("(x,y)", "(@x, @y)"), 159 | ("problem", "@problem"), 160 | ("problem step", "@problem_step"), 161 | ]) 162 | 163 | p = figure(title=names[0], 164 | tools=['box_zoom', hover, 'reset'], width=800) 165 | p.scatter(x='x', 166 | y='y', 167 | color='color', 168 | legend_group='desc', 169 | source=source, 170 | alpha=.6) 171 | 172 | p.xaxis.axis_label = names[1] 173 | p.yaxis.axis_label = names[2] 174 | return p 175 | 176 | from sklearn.preprocessing import LabelEncoder 177 | def inplace_encoder(X): 178 | for col in X: 179 | le = LabelEncoder() 180 | X[col] = le.fit_transform(X[[col]].astype(str)) 181 | return X 182 | --------------------------------------------------------------------------------