Document	Category
0	The sky is blue and beautiful.	weather
1	Love this blue and beautiful sky!	weather
2	The quick brown fox jumps over the lazy dog.	animals
3	A king's breakfast has sausages, ham, bacon, eggs, toast and beans	food
4	I love green eggs, ham, sausages and bacon!	food
5	The brown fox is quick and the blue dog is lazy!	animals
6	The sky is very blue and the sky is very beautiful today	weather
7	The dog is lazy but the brown fox is quick!	animals

Document

	0	1	2	3	4	5	6	7
0	1.000000	0.820599	0.000000	0.000000	0.000000	0.192353	0.817246	0.000000
1	0.820599	1.000000	0.000000	0.000000	0.225489	0.157845	0.670631	0.000000
2	0.000000	0.000000	1.000000	0.000000	0.000000	0.791821	0.000000	0.850516
3	0.000000	0.000000	0.000000	1.000000	0.506866	0.000000	0.000000	0.000000
4	0.000000	0.225489	0.000000	0.506866	1.000000	0.000000	0.000000	0.000000
5	0.192353	0.157845	0.791821	0.000000	0.000000	1.000000	0.115488	0.930989
6	0.817246	0.670631	0.000000	0.000000	0.000000	0.115488	1.000000	0.000000
7	0.000000	0.000000	0.850516	0.000000	0.000000	0.930989	0.000000	1.000000

	Document\\Cluster 1	Document\\Cluster 2	Distance	Cluster Size
0	2	7	0.253098	2
1	0	6	0.308539	2
2	5	8	0.386952	3
3	1	9	0.489845	3
4	3	4	0.732945	2
5	11	12	2.69565	5
6	10	13	3.45108	8

	Document	Category	ClusterLabel
0	The sky is blue and beautiful.	weather	2
1	Love this blue and beautiful sky!	weather	2
2	The quick brown fox jumps over the lazy dog.	animals	1
3	A king's breakfast has sausages, ham, bacon, eggs, toast and beans	food	3
4	I love green eggs, ham, sausages and bacon!	food	3
5	The brown fox is quick and the blue dog is lazy!	animals	1
6	The sky is very blue and the sky is very beautiful today	weather	2
7	The dog is lazy but the brown fox is quick!	animals	1

	T1	T2	T3
0	0.832191	0.083480	0.084329
1	0.863554	0.069100	0.067346
2	0.047794	0.047776	0.904430
3	0.037243	0.925559	0.037198
4	0.049121	0.903076	0.047802
5	0.054901	0.047778	0.897321
6	0.888287	0.055697	0.056016
7	0.055704	0.055689	0.888607

	Document	Category	ClusterLabel
0	The sky is blue and beautiful.	weather	2
1	Love this blue and beautiful sky!	weather	2
2	The quick brown fox jumps over the lazy dog.	animals	1
3	A king's breakfast has sausages, ham, bacon, eggs, toast and beans	food	0
4	I love green eggs, ham, sausages and bacon!	food	0
5	The brown fox is quick and the blue dog is lazy!	animals	1
6	The sky is very blue and the sky is very beautiful today	weather	2
7	The dog is lazy but the brown fox is quick!	animals	1

	Product	Consumer_complaint_narrative	category_id
1	Credit reporting	I have outdated information on my credit repor...	0
2	Consumer Loan	I purchased a new car on XXXX XXXX. The car de...	1
7	Credit reporting	An account on my credit report has a mistaken ...	0
12	Debt collection	This company refuses to provide me verificatio...	2
16	Debt collection	This complaint is in regards to Square Two Fin...	2

Binary Logistic Regression Classifier¹

\n", 1098 | "

Let us start with the binary case. For an M-dimensional feature vector \\(\\textbf{x}=[x_1,x_2,...,x_M]^T\\), the posterior probability of class \\(y\\in\\{\\pm{1}\\}\\) given \\(\\textbf{x}\\) is assumed to satisfy\n", 1099 | "

\n", 1100 | "

\\begin{equation}\n", 1101 | "\\ln{\\frac{p(y=1|\\textbf{x})}{p(y=-1|\\textbf{x})}}=\\textbf{w}^T\\textbf{x},\n", 1102 | "\\end{equation}

\n", 1103 | "

\n", 1104 | "where \\(\\textbf{w}=[w_1,w_2,...,w_M]^T\\) is the weighting vector to be learned. Given the constraint that \\(p(y=1|\\textbf{x})+p(y=-1|\\textbf{x})=1\\), it follows that\n", 1105 | "

\n", 1106 | "

\\begin{equation} \\label{Eqn:Prob_Binary}\n", 1107 | "p(y|\\textbf{x})=\\frac{1}{1+\\exp(-y\\textbf{w}^T\\textbf{x})}=\\sigma(y\\textbf{w}^T\\textbf{x}),\n", 1108 | "\\end{equation}

\n", 1109 | "

\n", 1110 | "in which we can observe the logistic sigmoid function \\(\\sigma(a)=\\frac{1}{1+\\exp(-a)}\\).

\n", 1111 | "

Based on the assumptions above, the weighting vector, \\(\\textbf{w}\\), can be learned by maximum likelihood estimation (MLE). More specifically, given training data set \\(\\mathcal{D}=\\{(\\textbf{x}_1,y_1),(\\textbf{x}_2,y_2),...,(\\textbf{x}_N,y_N)\\}\\),\n", 1112 | "

\n", 1113 | "

\\begin{align}\n", 1114 | "\\begin{aligned}\n", 1115 | "\\textbf{w}^*&=\\max_{\\textbf{w}}{\\mathcal{L}(\\textbf{w})}\\\\\n", 1116 | "&=\\max_{\\textbf{w}}{\\sum_{i=1}^N\\ln{{p(y_i|\\textbf{x}_i)}}}\\\\\n", 1117 | "&=\\max_{\\textbf{w}}{\\sum_{i=1}^N{\\ln{\\frac{1}{1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}}}}\\\\\n", 1118 | "&=\\min_{\\textbf{w}}{\\sum_{i=1}^N{\\ln{(1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i))}}}.\n", 1119 | "\\end{aligned}\n", 1120 | "\\end{align}

\n", 1121 | "

\n", 1122 | "We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as\n", 1123 | "

\n", 1124 | "

\\begin{align}\n", 1125 | "\\begin{aligned}\n", 1126 | "\\nabla{\\mathcal{L}(\\textbf{w})}&=\\sum_{i=1}^N{\\frac{-y_i\\textbf{x}_i\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}{1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}}\\\\\n", 1127 | "&=-\\sum_{i=1}^N{y_i\\textbf{x}_i(1-p(y_i|\\textbf{x}_i))}.\n", 1128 | "\\end{aligned}\n", 1129 | "\\end{align}

\n", 1130 | "

\n", 1131 | "Then, we can learn the optimal \\(\\textbf{w}\\) by starting with an initial \\(\\textbf{w}_0\\) and iterating as follows:\n", 1132 | "

\n", 1133 | "

\\begin{equation} \\label{Eqn:Iteration_Binary}\n", 1134 | "\\textbf{w}_{t+1}=\\textbf{w}_{t}-\\eta_t\\nabla{\\mathcal{L}(\\textbf{w})},\n", 1135 | "\\end{equation}

\n", 1136 | "

\n", 1137 | "where \\(\\eta_t\\) is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting \\(\\eta_t\\propto{1/\\sqrt{t}}\\) such that the step size decreases with an increasing time \\(t\\).

\n", 1138 | "

Multiclass Logistic Regression Classifier¹

\n", 1139 | "

When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have \\(K\\) possible classes, that is, \\(y\\in\\{1,2,..,K\\}\\). It is assumed that the posterior probability of class \\(y=k\\) given \\(\\textbf{x}\\) follows\n", 1140 | "

\n", 1141 | "

\\begin{equation}\n", 1142 | "\\ln{p(y=k|\\textbf{x})}\\propto\\textbf{w}_k^T\\textbf{x},\n", 1143 | "\\end{equation}

\n", 1144 | "

\n", 1145 | "where \\(\\textbf{w}_k\\) is a column weighting vector corresponding to class \\(k\\). Considering all classes \\(k=1,2,...,K\\), we would have a weighting matrix that includes all \\(K\\) weighting vectors. That is, \\(\\textbf{W}=[\\textbf{w}_1,\\textbf{w}_2,...,\\textbf{w}_K]\\).\n", 1146 | "Under the constraint\n", 1147 | "

\n", 1148 | "

\\begin{equation}\n", 1149 | "\\sum_{k=1}^K{p(y=k|\\textbf{x})}=1,\n", 1150 | "\\end{equation}

\n", 1151 | "

\n", 1152 | "it then follows that\n", 1153 | "

\n", 1154 | "

\\begin{equation} \\label{Eqn:Prob_Multiple}\n", 1155 | "p(y=k|\\textbf{x})=\\frac{\\exp(\\textbf{w}_k^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}.\n", 1156 | "\\end{equation}

\n", 1157 | "

The weighting matrix, \\(\\textbf{W}\\), can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set \\(\\mathcal{D}=\\{(\\textbf{x}_1,y_1),(\\textbf{x}_2,y_2),...(\\textbf{x}_N,y_N)\\}\\),\n", 1158 | "

\n", 1159 | "

\\begin{align}\n", 1160 | "\\begin{aligned}\n", 1161 | "\\textbf{W}^*&=\\max_{\\textbf{W}}{\\mathcal{L}(\\textbf{W})}\\\\\n", 1162 | "&=\\max_{\\textbf{W}}{\\sum_{i=1}^N\\ln{{p(y_i|\\textbf{x}_i)}}}\\\\\n", 1163 | "&=\\max_{\\textbf{W}}{\\sum_{i=1}^N{\\ln{\\frac{\\exp(\\textbf{w}_{y_i}^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}}}}.\n", 1164 | "\\end{aligned}\n", 1165 | "\\end{align}

\n", 1166 | "

\n", 1167 | "The gradient of the objective function with respect to each \\(\\textbf{w}_k\\) can be calculated as\n", 1168 | "

\n", 1169 | "

\\begin{align}\n", 1170 | "\\begin{aligned}\n", 1171 | "\\frac{\\partial{\\mathcal{L}(\\textbf{W})}}{\\partial{\\textbf{w}_k}}&=\\sum_{i=1}^N{\\textbf{x}_i\\left(I(y_i=k)-\\frac{\\exp(\\textbf{w}_k^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}\\right)}\\\\\n", 1172 | "&=\\sum_{i=1}^N{\\textbf{x}_i(I(y_i=k)-p(y_i=k|\\textbf{x}_i))},\n", 1173 | "\\end{aligned}\n", 1174 | "\\end{align}

\n", 1175 | "

\n", 1176 | "where \\(I(\\cdot)\\) is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:\n", 1177 | "

\n", 1178 | "

\\begin{equation}\\label{Eqn:Iteration_Multiple}\n", 1179 | "\\textbf{w}_{k,t+1}=\\textbf{w}_{k,t}+\\eta_{t}\\frac{\\partial{\\mathcal{L}(\\textbf{W})}}{\\partial{\\textbf{w}_k}}.\n", 1180 | "\\end{equation}

\n", 1181 | "

\n", 1182 | "Note that we have \"\\(+\\)\" instead of \"\\(-\\)\", because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.

\n", 1183 | "

How to Perform Predictions?

\n", 1184 | "

Once the optimal weights are learned from the logistic regression model, for any new feature vector \\(\\textbf{x}\\), we can easily calculate the probability that it is associated to each class label \\(k\\) in the binary case in the multiclass case. With the probabilities for each class label available, we can then perform:

\n", 1185 | "

a hard decision by identifying the class label with the highest probability, or
a soft decision by showing the top \\(k\\) most probable class labels with their corresponding probabilities.

" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "code", 1193 | "execution_count": null, 1194 | "metadata": {}, 1195 | "outputs": [], 1196 | "source": [ 1197 | "# Linear Classifier on Count Vectors\n", 1198 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)\n", 1199 | "print(\"LR, Count Vectors: \", accuracy)\n", 1200 | "\n", 1201 | "# Linear Classifier on Word Level TF IDF Vectors\n", 1202 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)\n", 1203 | "print(\"LR, WordLevel TF-IDF: \", accuracy)\n", 1204 | "\n", 1205 | "# Linear Classifier on Ngram Level TF IDF Vectors\n", 1206 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)\n", 1207 | "print(\"LR, N-Gram Vectors: \", accuracy)\n", 1208 | "\n", 1209 | "# Linear Classifier on Character Level TF IDF Vectors\n", 1210 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)\n", 1211 | "print(\"LR, CharLevel Vectors: \", accuracy)" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "markdown", 1216 | "metadata": {}, 1217 | "source": [ 1218 | "### 3.3 Implementing a SVM Model" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": null, 1224 | "metadata": {}, 1225 | "outputs": [], 1226 | "source": [ 1227 | "# https://svivek.com/teaching/machine-learning/fall2018/slides/svm/svm-sgd.pdf\n", 1228 | "# https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-machine-with-math-47d6193c82be" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "code", 1233 | "execution_count": null, 1234 | "metadata": {}, 1235 | "outputs": [], 1236 | "source": [ 1237 | "\n", 1238 | "\n", 1239 | "# SVM on Ngram Level TF IDF Vectors\n", 1240 | "accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)\n", 1241 | "print(\"SVM, N-Gram Vectors: \", accuracy)" 1242 | ] 1243 | }, 1244 | { 1245 | "cell_type": "code", 1246 | "execution_count": null, 1247 | "metadata": {}, 1248 | "outputs": [], 1249 | "source": [ 1250 | "from sklearn.pipeline import Pipeline" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "code", 1255 | "execution_count": null, 1256 | "metadata": {}, 1257 | "outputs": [], 1258 | "source": [ 1259 | "from sklearn.linear_model import SGDClassifier\n", 1260 | "text_clf_svm = Pipeline([('vect', CountVectorizer()),\n", 1261 | " ('tfidf', TfidfTransformer()),\n", 1262 | " ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": null, 1268 | "metadata": {}, 1269 | "outputs": [], 1270 | "source": [ 1271 | ">>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)\n", 1272 | ">>> predicted_svm = text_clf_svm.predict(twenty_test.data)\n", 1273 | ">>> np.mean(predicted_svm == twenty_test.target)" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": null, 1279 | "metadata": {}, 1280 | "outputs": [], 1281 | "source": [] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "### 3.4 Bagging Model" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": null, 1293 | "metadata": {}, 1294 | "outputs": [], 1295 | "source": [ 1296 | "# RF on Count Vectors\n", 1297 | "accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)\n", 1298 | "print \"RF, Count Vectors: \", accuracy\n", 1299 | "\n", 1300 | "# RF on Word Level TF IDF Vectors\n", 1301 | "accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)\n", 1302 | "print \"RF, WordLevel TF-IDF: \", accuracy" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "markdown", 1307 | "metadata": {}, 1308 | "source": [ 1309 | "### 3.5 Boosting Model" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": {}, 1316 | "outputs": [], 1317 | "source": [ 1318 | "# Extereme Gradient Boosting on Count Vectors\n", 1319 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xvalid_count.tocsc())\n", 1320 | "print(\"Xgb, Count Vectors: \", accuracy)\n", 1321 | "\n", 1322 | "# Extereme Gradient Boosting on Word Level TF IDF Vectors\n", 1323 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc())\n", 1324 | "print(\"Xgb, WordLevel TF-IDF: \", accuracy)\n", 1325 | "\n", 1326 | "# Extereme Gradient Boosting on Character Level TF IDF Vectors\n", 1327 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(), train_y, xvalid_tfidf_ngram_chars.tocsc())\n", 1328 | "print(\"Xgb, CharLevel Vectors: \", accuracy)" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "markdown", 1333 | "metadata": {}, 1334 | "source": [ 1335 | "### 3.6 Shallow Neural Networks" 1336 | ] 1337 | }, 1338 | { 1339 | "cell_type": "code", 1340 | "execution_count": null, 1341 | "metadata": {}, 1342 | "outputs": [], 1343 | "source": [ 1344 | "def create_model_architecture(input_size):\n", 1345 | " # create input layer \n", 1346 | " input_layer = layers.Input((input_size, ), sparse=True)\n", 1347 | " \n", 1348 | " # create hidden layer\n", 1349 | " hidden_layer = layers.Dense(100, activation=\"relu\")(input_layer)\n", 1350 | " \n", 1351 | " # create output layer\n", 1352 | " output_layer = layers.Dense(1, activation=\"sigmoid\")(hidden_layer)\n", 1353 | "\n", 1354 | " classifier = models.Model(inputs = input_layer, outputs = output_layer)\n", 1355 | " classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1356 | " return classifier \n", 1357 | "\n", 1358 | "classifier = create_model_architecture(xtrain_tfidf_ngram.shape[1])\n", 1359 | "accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram, is_neural_net=True)\n", 1360 | "print(\"NN, Ngram Level TF IDF Vectors\", accuracy)" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "metadata": {}, 1366 | "source": [ 1367 | "### 3.7.1 Convolutional Neural Network [Deep Neural Networks]" 1368 | ] 1369 | }, 1370 | { 1371 | "cell_type": "code", 1372 | "execution_count": null, 1373 | "metadata": {}, 1374 | "outputs": [], 1375 | "source": [ 1376 | "from IPython.display import Image\n", 1377 | "from IPython.core.display import HTML \n", 1378 | "Image(url= \"cnn.png\")" 1379 | ] 1380 | }, 1381 | { 1382 | "cell_type": "code", 1383 | "execution_count": null, 1384 | "metadata": {}, 1385 | "outputs": [], 1386 | "source": [ 1387 | "def create_cnn():\n", 1388 | " # Add an Input Layer\n", 1389 | " input_layer = layers.Input((70, ))\n", 1390 | "\n", 1391 | " # Add the word embedding Layer\n", 1392 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1393 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1394 | "\n", 1395 | " # Add the convolutional Layer\n", 1396 | " conv_layer = layers.Convolution1D(100, 3, activation=\"relu\")(embedding_layer)\n", 1397 | "\n", 1398 | " # Add the pooling Layer\n", 1399 | " pooling_layer = layers.GlobalMaxPool1D()(conv_layer)\n", 1400 | "\n", 1401 | " # Add the output Layers\n", 1402 | " output_layer1 = layers.Dense(50, activation=\"relu\")(pooling_layer)\n", 1403 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1404 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1405 | "\n", 1406 | " # Compile the model\n", 1407 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1408 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1409 | " \n", 1410 | " return model\n", 1411 | "\n", 1412 | "classifier = create_cnn()\n", 1413 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1414 | "print(\"CNN, Word Embeddings\", accuracy)" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "markdown", 1419 | "metadata": {}, 1420 | "source": [ 1421 | "### 3.7.2 Recurrent Neural Network – LSTM [Deep Neural Networks]" 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "code", 1426 | "execution_count": null, 1427 | "metadata": {}, 1428 | "outputs": [], 1429 | "source": [ 1430 | "def create_rnn_lstm():\n", 1431 | " # Add an Input Layer\n", 1432 | " input_layer = layers.Input((70, ))\n", 1433 | "\n", 1434 | " # Add the word embedding Layer\n", 1435 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1436 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1437 | "\n", 1438 | " # Add the LSTM Layer\n", 1439 | " lstm_layer = layers.LSTM(100)(embedding_layer)\n", 1440 | "\n", 1441 | " # Add the output Layers\n", 1442 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1443 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1444 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1445 | "\n", 1446 | " # Compile the model\n", 1447 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1448 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1449 | " \n", 1450 | " return model\n", 1451 | "\n", 1452 | "classifier = create_rnn_lstm()\n", 1453 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1454 | "print(\"RNN-LSTM, Word Embeddings\", accuracy)" 1455 | ] 1456 | }, 1457 | { 1458 | "cell_type": "markdown", 1459 | "metadata": {}, 1460 | "source": [ 1461 | "### 3.7.3 Recurrent Neural Network – GRU [Deep Neural Networks]" 1462 | ] 1463 | }, 1464 | { 1465 | "cell_type": "code", 1466 | "execution_count": null, 1467 | "metadata": {}, 1468 | "outputs": [], 1469 | "source": [ 1470 | "def create_rnn_gru():\n", 1471 | " # Add an Input Layer\n", 1472 | " input_layer = layers.Input((70, ))\n", 1473 | "\n", 1474 | " # Add the word embedding Layer\n", 1475 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1476 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1477 | "\n", 1478 | " # Add the GRU Layer\n", 1479 | " lstm_layer = layers.GRU(100)(embedding_layer)\n", 1480 | "\n", 1481 | " # Add the output Layers\n", 1482 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1483 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1484 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1485 | "\n", 1486 | " # Compile the model\n", 1487 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1488 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1489 | " \n", 1490 | " return model\n", 1491 | "\n", 1492 | "classifier = create_rnn_gru()\n", 1493 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1494 | "print(\"RNN-GRU, Word Embeddings\", accuracy)" 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "markdown", 1499 | "metadata": {}, 1500 | "source": [ 1501 | "### 3.7.4 Bidirectional RNN [Deep Neural Networks]" 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "code", 1506 | "execution_count": null, 1507 | "metadata": {}, 1508 | "outputs": [], 1509 | "source": [ 1510 | "def create_bidirectional_rnn():\n", 1511 | " # Add an Input Layer\n", 1512 | " input_layer = layers.Input((70, ))\n", 1513 | "\n", 1514 | " # Add the word embedding Layer\n", 1515 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1516 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1517 | "\n", 1518 | " # Add the LSTM Layer\n", 1519 | " lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)\n", 1520 | "\n", 1521 | " # Add the output Layers\n", 1522 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1523 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1524 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1525 | "\n", 1526 | " # Compile the model\n", 1527 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1528 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1529 | " \n", 1530 | " return model\n", 1531 | "\n", 1532 | "classifier = create_bidirectional_rnn()\n", 1533 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1534 | "print(\"RNN-Bidirectional, Word Embeddings\", accuracy)" 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "markdown", 1539 | "metadata": {}, 1540 | "source": [ 1541 | "### 3.7.5 Recurrent Convolutional Neural Network" 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "markdown", 1546 | "metadata": {}, 1547 | "source": [ 1548 | "- Hierarichial Attention Networks\n", 1549 | "- Sequence to Sequence Models with Attention\n", 1550 | "- Bidirectional Recurrent Convolutional Neural Networks\n", 1551 | "- CNNs and RNNs with more number of layers" 1552 | ] 1553 | }, 1554 | { 1555 | "cell_type": "code", 1556 | "execution_count": null, 1557 | "metadata": {}, 1558 | "outputs": [], 1559 | "source": [ 1560 | "def create_rcnn():\n", 1561 | " # Add an Input Layer\n", 1562 | " input_layer = layers.Input((70, ))\n", 1563 | "\n", 1564 | " # Add the word embedding Layer\n", 1565 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1566 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1567 | " \n", 1568 | " # Add the recurrent layer\n", 1569 | " rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))(embedding_layer)\n", 1570 | " \n", 1571 | " # Add the convolutional Layer\n", 1572 | " conv_layer = layers.Convolution1D(100, 3, activation=\"relu\")(embedding_layer)\n", 1573 | "\n", 1574 | " # Add the pooling Layer\n", 1575 | " pooling_layer = layers.GlobalMaxPool1D()(conv_layer)\n", 1576 | "\n", 1577 | " # Add the output Layers\n", 1578 | " output_layer1 = layers.Dense(50, activation=\"relu\")(pooling_layer)\n", 1579 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1580 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1581 | "\n", 1582 | " # Compile the model\n", 1583 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1584 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1585 | " \n", 1586 | " return model\n", 1587 | "\n", 1588 | "classifier = create_rcnn()\n", 1589 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1590 | "print(\"CNN, Word Embeddings\", accuracy)" 1591 | ] 1592 | }, 1593 | { 1594 | "cell_type": "code", 1595 | "execution_count": null, 1596 | "metadata": {}, 1597 | "outputs": [], 1598 | "source": [] 1599 | }, 1600 | { 1601 | "cell_type": "code", 1602 | "execution_count": null, 1603 | "metadata": {}, 1604 | "outputs": [], 1605 | "source": [] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": {}, 1610 | "source": [ 1611 | "# EXPLAIN MODELS" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": {}, 1617 | "source": [ 1618 | "# TextExplainer: debugging black-box text classifiers" 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "markdown", 1623 | "metadata": {}, 1624 | "source": [ 1625 | "https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html\n", 1626 | "\n", 1627 | "https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html#example-problem-lsa-svm-for-20-newsgroups-dataset\n", 1628 | "\n", 1629 | "**Goal:** explain predictions of arbitrary classifiers, including text classifiers (when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved)" 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "markdown", 1634 | "metadata": {}, 1635 | "source": [ 1636 | "### Example problem: LSA+SVM for 20 Newsgroups dataset" 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "code", 1641 | "execution_count": null, 1642 | "metadata": {}, 1643 | "outputs": [], 1644 | "source": [ 1645 | "from sklearn.datasets import fetch_20newsgroups\n", 1646 | "\n", 1647 | "categories = ['alt.atheism', 'soc.religion.christian',\n", 1648 | " 'comp.graphics', 'sci.med']\n", 1649 | "twenty_train = fetch_20newsgroups(\n", 1650 | " subset='train',\n", 1651 | " categories=categories,\n", 1652 | " shuffle=True,\n", 1653 | " random_state=42,\n", 1654 | " remove=('headers', 'footers'),\n", 1655 | ")\n", 1656 | "twenty_test = fetch_20newsgroups(\n", 1657 | " subset='test',\n", 1658 | " categories=categories,\n", 1659 | " shuffle=True,\n", 1660 | " random_state=42,\n", 1661 | " remove=('headers', 'footers'),\n", 1662 | ")" 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "code", 1667 | "execution_count": null, 1668 | "metadata": {}, 1669 | "outputs": [], 1670 | "source": [ 1671 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1672 | "from sklearn.svm import SVC\n", 1673 | "from sklearn.decomposition import TruncatedSVD\n", 1674 | "from sklearn.pipeline import Pipeline, make_pipeline\n", 1675 | "\n", 1676 | "vec = TfidfVectorizer(min_df=3, stop_words='english',\n", 1677 | " ngram_range=(1, 2))\n", 1678 | "\n", 1679 | "# The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.\n", 1680 | "svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)\n", 1681 | "lsa = make_pipeline(vec, svd)\n", 1682 | "\n", 1683 | "clf = SVC(C=150, gamma=2e-2, probability=True)\n", 1684 | "pipe = make_pipeline(lsa, clf)\n", 1685 | "pipe.fit(twenty_train.data, twenty_train.target)\n", 1686 | "pipe.score(twenty_test.data, twenty_test.target)" 1687 | ] 1688 | }, 1689 | { 1690 | "cell_type": "code", 1691 | "execution_count": null, 1692 | "metadata": {}, 1693 | "outputs": [], 1694 | "source": [ 1695 | "def print_prediction(doc):\n", 1696 | " y_pred = pipe.predict_proba([doc])[0]\n", 1697 | " for target, prob in zip(twenty_train.target_names, y_pred):\n", 1698 | " print(\"{:.3f} {}\".format(prob, target))\n", 1699 | "\n", 1700 | "doc = twenty_test.data[0]\n", 1701 | "\n", 1702 | "print(twenty_test.data[0])\n", 1703 | "print('------------------------------------ What is the prediction?-------------------------------------------------------')\n", 1704 | "print_prediction(doc)" 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "markdown", 1709 | "metadata": {}, 1710 | "source": [ 1711 | "### TextExplainer" 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "markdown", 1716 | "metadata": {}, 1717 | "source": [ 1718 | "1. Create a TextExplainer instance, \n", 1719 | "2. ... then pass the document to explain and a black-box classifier (a function which returns probabilities) to the fit() method, \n", 1720 | "3. ... then check the explanation:" 1721 | ] 1722 | }, 1723 | { 1724 | "cell_type": "code", 1725 | "execution_count": null, 1726 | "metadata": {}, 1727 | "outputs": [], 1728 | "source": [ 1729 | "import eli5\n", 1730 | "from eli5.lime import TextExplainer\n", 1731 | "\n", 1732 | "doc = twenty_test.data[0]\n", 1733 | "\n", 1734 | "te = TextExplainer(random_state=42)\n", 1735 | "te.fit(doc, pipe.predict_proba)\n", 1736 | "te.show_prediction(target_names=twenty_train.target_names)" 1737 | ] 1738 | }, 1739 | { 1740 | "cell_type": "markdown", 1741 | "metadata": {}, 1742 | "source": [ 1743 | "### Why it works?" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "markdown", 1748 | "metadata": {}, 1749 | "source": [ 1750 | "Explanation makes sense - we expect reasonable classifier to **take highlighted words in account**. But how can we be sure this is **how the pipeline works**, not just a nice-looking lie? \n", 1751 | "\n", 1752 | "A simple **sanity check** is to **remove or change the highlighted words**, to confirm that **they change the outcome**" 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": null, 1758 | "metadata": {}, 1759 | "outputs": [], 1760 | "source": [ 1761 | "import re\n", 1762 | "doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)\n", 1763 | "print_prediction(doc2)" 1764 | ] 1765 | }, 1766 | { 1767 | "cell_type": "markdown", 1768 | "metadata": {}, 1769 | "source": [ 1770 | "**Predicted probabilities changed a lot indeed.**\n", 1771 | "\n", 1772 | "And in fact, TextExplainer did something similar to get the explanation. TextExplainer generated a lot of texts similar to the document (by removing some of the words), and then trained a white-box classifier which predicts the output of the black-box classifier (not the true labels!). The explanation we saw is for this white-box classifier.\n", 1773 | "\n", 1774 | "This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:\n", 1775 | "\n", 1776 | "- generate distorted versions of the text;\n", 1777 | "- predict probabilities for these distorted texts using the black-box classifier;\n", 1778 | "- train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.\n", 1779 | "\n", 1780 | "The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.\n", 1781 | "\n", 1782 | "Generated samples (distorted texts) are available in samples_ attribute:" 1783 | ] 1784 | }, 1785 | { 1786 | "cell_type": "code", 1787 | "execution_count": null, 1788 | "metadata": {}, 1789 | "outputs": [], 1790 | "source": [ 1791 | "print(te.samples_[0])" 1792 | ] 1793 | }, 1794 | { 1795 | "cell_type": "code", 1796 | "execution_count": null, 1797 | "metadata": {}, 1798 | "outputs": [], 1799 | "source": [ 1800 | "# By default TextExplainer generates 5000 distorted texts (use n_samples argument to change the amount):\n", 1801 | "len(te.samples_)" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "markdown", 1806 | "metadata": {}, 1807 | "source": [ 1808 | "### Customizing TextExplainer: classifier" 1809 | ] 1810 | }, 1811 | { 1812 | "cell_type": "code", 1813 | "execution_count": null, 1814 | "metadata": {}, 1815 | "outputs": [], 1816 | "source": [ 1817 | "from sklearn.tree import DecisionTreeClassifier\n", 1818 | "dtree=DecisionTreeClassifier()\n", 1819 | "dtree.fit(te5.show_weights())" 1820 | ] 1821 | }, 1822 | { 1823 | "cell_type": "code", 1824 | "execution_count": null, 1825 | "metadata": {}, 1826 | "outputs": [], 1827 | "source": [ 1828 | "explain_prediction_tree_classifier" 1829 | ] 1830 | }, 1831 | { 1832 | "cell_type": "code", 1833 | "execution_count": null, 1834 | "metadata": {}, 1835 | "outputs": [], 1836 | "source": [ 1837 | "from sklearn.tree import DecisionTreeClassifier\n", 1838 | "\n", 1839 | "te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)\n", 1840 | "te5.fit(doc, pipe.predict_proba)\n", 1841 | "print(te5.metrics_)\n", 1842 | "te5.show_weights()" 1843 | ] 1844 | }, 1845 | { 1846 | "cell_type": "markdown", 1847 | "metadata": {}, 1848 | "source": [ 1849 | "So according to this tree if **“kidney” is not in the document** and **“pain” is not in the document** then the **probability of a document** belonging to **sci.med** drops to **0.65**. If at least one of these words remain sci.med probability stays** 0.9+.**" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "markdown", 1854 | "metadata": {}, 1855 | "source": [ 1856 | "# 3 ways to interpretate NLP model" 1857 | ] 1858 | }, 1859 | { 1860 | "cell_type": "markdown", 1861 | "metadata": {}, 1862 | "source": [ 1863 | "https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb\n", 1864 | "\n", 1865 | "**Goal**: want to know why we predict it wrongly\n", 1866 | "\n", 1867 | "**1 . Interpretability**\n", 1868 | "- **Intrinsic**: We do not need to train another model to explain the target. For example, it is using decision tree or linear model\n", 1869 | "- **Post hoc**: The model belongs to black-box model which we need to use another model to interpret it. \n", 1870 | "\n", 1871 | "**2. Approach**\n", 1872 | "- **Model-specific**: Some tools are limited to specific model such as liner model and neural network model.\n", 1873 | "- **Model-agnostic**: On the other hand, some tools able to explain any model by building write-box model. \n", 1874 | "\n", 1875 | "** 3. Level**\n", 1876 | "- **Global**: Explain the overall model such as feature weight. This one give you a in general model behavior\n", 1877 | "- **Local**: Explain the specific prediction result." 1878 | ] 1879 | }, 1880 | { 1881 | "cell_type": "code", 1882 | "execution_count": null, 1883 | "metadata": {}, 1884 | "outputs": [], 1885 | "source": [ 1886 | "import random\n", 1887 | "import pandas as pd\n", 1888 | "import IPython\n", 1889 | "import xgboost\n", 1890 | "\n", 1891 | "import eli5\n", 1892 | "from eli5.lime import TextExplainer\n", 1893 | "from lime.lime_text import LimeTextExplainer\n", 1894 | "print('ELI5 Version:', eli5.__version__)\n", 1895 | "print('XGBoost Version:', xgboost.__version__)" 1896 | ] 1897 | }, 1898 | { 1899 | "cell_type": "code", 1900 | "execution_count": null, 1901 | "metadata": {}, 1902 | "outputs": [], 1903 | "source": [ 1904 | "from sklearn.datasets import fetch_20newsgroups\n", 1905 | "train_raw_df = fetch_20newsgroups(subset='train')\n", 1906 | "test_raw_df = fetch_20newsgroups(subset='test')" 1907 | ] 1908 | }, 1909 | { 1910 | "cell_type": "code", 1911 | "execution_count": null, 1912 | "metadata": {}, 1913 | "outputs": [], 1914 | "source": [ 1915 | "x_train = train_raw_df.data\n", 1916 | "y_train = train_raw_df.target\n", 1917 | "\n", 1918 | "x_test = test_raw_df.data\n", 1919 | "y_test = test_raw_df.target" 1920 | ] 1921 | }, 1922 | { 1923 | "cell_type": "code", 1924 | "execution_count": null, 1925 | "metadata": {}, 1926 | "outputs": [], 1927 | "source": [ 1928 | "x_train" 1929 | ] 1930 | }, 1931 | { 1932 | "cell_type": "code", 1933 | "execution_count": null, 1934 | "metadata": {}, 1935 | "outputs": [], 1936 | "source": [ 1937 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1938 | "from sklearn.linear_model import LogisticRegression\n", 1939 | "from sklearn.ensemble import RandomForestClassifier\n", 1940 | "from sklearn.pipeline import make_pipeline\n", 1941 | "from xgboost import XGBClassifier" 1942 | ] 1943 | }, 1944 | { 1945 | "cell_type": "code", 1946 | "execution_count": null, 1947 | "metadata": {}, 1948 | "outputs": [], 1949 | "source": [ 1950 | "names = ['Logistic Regression', 'Random Forest', 'XGBoost Classifier']" 1951 | ] 1952 | }, 1953 | { 1954 | "cell_type": "code", 1955 | "execution_count": null, 1956 | "metadata": {}, 1957 | "outputs": [], 1958 | "source": [ 1959 | "def build_model(names, x, y):\n", 1960 | " pipelines = []\n", 1961 | " vec = TfidfVectorizer()\n", 1962 | " vec.fit(x)\n", 1963 | "\n", 1964 | " for name in names:\n", 1965 | " print('train %s' % name)\n", 1966 | " \n", 1967 | " if name == 'Logistic Regression':\n", 1968 | " estimator = LogisticRegression(solver='newton-cg', n_jobs=-1)\n", 1969 | " pipeline = make_pipeline(vec, estimator)\n", 1970 | " elif name == 'Random Forest':\n", 1971 | " estimator = RandomForestClassifier(n_jobs=-1)\n", 1972 | " pipeline = make_pipeline(vec, estimator)\n", 1973 | " elif name == 'XGBoost Classifier':\n", 1974 | " estimator = XGBClassifier()\n", 1975 | " pipeline = make_pipeline(vec, estimator)\n", 1976 | " \n", 1977 | " pipeline.fit(x, y)\n", 1978 | " pipelines.append({\n", 1979 | " 'name': name,\n", 1980 | " 'pipeline': pipeline\n", 1981 | " })\n", 1982 | " \n", 1983 | " return pipelines, vec" 1984 | ] 1985 | }, 1986 | { 1987 | "cell_type": "code", 1988 | "execution_count": null, 1989 | "metadata": {}, 1990 | "outputs": [], 1991 | "source": [ 1992 | "pipelines, vec = build_model(names, x_train, y_train)" 1993 | ] 1994 | }, 1995 | { 1996 | "cell_type": "markdown", 1997 | "metadata": {}, 1998 | "source": [ 1999 | "### 1. ELI5" 2000 | ] 2001 | }, 2002 | { 2003 | "cell_type": "markdown", 2004 | "metadata": {}, 2005 | "source": [ 2006 | "#### A. - ELI5 - Global Interpretation" 2007 | ] 2008 | }, 2009 | { 2010 | "cell_type": "code", 2011 | "execution_count": null, 2012 | "metadata": {}, 2013 | "outputs": [], 2014 | "source": [ 2015 | "for pipeline in pipelines:\n", 2016 | " print('Estimator: %s' % (pipeline['name']))\n", 2017 | " labels = pipeline['pipeline'].classes_.tolist()\n", 2018 | " \n", 2019 | " if pipeline['name'] in ['Logistic Regression', 'Random Forest']:\n", 2020 | " estimator = pipeline['pipeline']\n", 2021 | " elif pipeline['name'] == 'XGBoost Classifier':\n", 2022 | " estimator = pipeline['pipeline'].steps[1][1].get_booster()\n", 2023 | "# Not support Keras\n", 2024 | "# elif pipeline['name'] == 'keras':\n", 2025 | "# estimator = pipeline['pipeline']\n", 2026 | " else:\n", 2027 | " continue\n", 2028 | " \n", 2029 | " IPython.display.display(\n", 2030 | " eli5.show_weights(estimator=estimator, top=10, target_names=labels, vec=vec))" 2031 | ] 2032 | }, 2033 | { 2034 | "cell_type": "markdown", 2035 | "metadata": {}, 2036 | "source": [ 2037 | "#### B. - ELI5 - Local Interpretation" 2038 | ] 2039 | }, 2040 | { 2041 | "cell_type": "code", 2042 | "execution_count": null, 2043 | "metadata": {}, 2044 | "outputs": [], 2045 | "source": [ 2046 | "number_of_sample = 1\n", 2047 | "sample_ids = [random.randint(0, len(x_test) -1 ) for p in range(0, number_of_sample)]\n", 2048 | "\n", 2049 | "for idx in sample_ids:\n", 2050 | " print('Index: %d' % (idx))\n", 2051 | "# print('Index: %d, Feature: %s' % (idx, x_test[idx]))\n", 2052 | " for pipeline in pipelines:\n", 2053 | " print('-' * 50)\n", 2054 | " print('Estimator: %s' % (pipeline['name']))\n", 2055 | " \n", 2056 | " print('True Label: %s, Predicted Label: %s' % (y_test[idx], pipeline['pipeline'].predict([x_test[idx]])[0]))\n", 2057 | " labels = pipeline['pipeline'].classes_.tolist()\n", 2058 | " \n", 2059 | " if pipeline['name'] in ['Logistic Regression', 'Random Forest']:\n", 2060 | " estimator = pipeline['pipeline'].steps[1][1]\n", 2061 | " elif pipeline['name'] == 'XGBoost Classifier':\n", 2062 | " estimator = pipeline['pipeline'].steps[1][1].get_booster()\n", 2063 | " # Not support Keras\n", 2064 | "# elif pipeline['name'] == 'Keras':\n", 2065 | "# estimator = pipeline['pipeline'].model\n", 2066 | " else:\n", 2067 | " continue\n", 2068 | "\n", 2069 | " IPython.display.display(\n", 2070 | " eli5.show_prediction(estimator, x_test[idx], top=10, vec=vec, target_names=labels))" 2071 | ] 2072 | }, 2073 | { 2074 | "cell_type": "markdown", 2075 | "metadata": {}, 2076 | "source": [ 2077 | "### 2. LIME [2 independent examples]" 2078 | ] 2079 | }, 2080 | { 2081 | "cell_type": "markdown", 2082 | "metadata": {}, 2083 | "source": [ 2084 | "## 1st example" 2085 | ] 2086 | }, 2087 | { 2088 | "cell_type": "markdown", 2089 | "metadata": {}, 2090 | "source": [ 2091 | "https://www.kaggle.com/emanceau/interpreting-machine-learning-lime-explainer/notebook" 2092 | ] 2093 | }, 2094 | { 2095 | "cell_type": "markdown", 2096 | "metadata": {}, 2097 | "source": [ 2098 | "Dataset contains text from works of fiction written by spooky authors of the public domain:\n", 2099 | "- Edgar Allan Poe (EAP)\n", 2100 | "- HP Lovecraft (HPL)\n", 2101 | "- Mary Wollstonecraft Shelley (MWS)\n", 2102 | "\n", 2103 | "The objective is to **accurately identify the author of the sentences in the test set**\n", 2104 | "\n", 2105 | "**Lime explainer mission** is to help human to **understand decisions made by machine learning**. Basically, lime explainer create **a local linear model** around the prediction and try to **explain factor influence**." 2106 | ] 2107 | }, 2108 | { 2109 | "cell_type": "code", 2110 | "execution_count": 2, 2111 | "metadata": {}, 2112 | "outputs": [], 2113 | "source": [ 2114 | "import numpy as np\n", 2115 | "import pandas as pd\n", 2116 | "\n", 2117 | "import matplotlib.pyplot as plt\n", 2118 | "\n", 2119 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", 2120 | "from sklearn.model_selection import train_test_split\n", 2121 | "from sklearn.metrics import confusion_matrix\n", 2122 | "from sklearn import ensemble, metrics, model_selection, naive_bayes\n", 2123 | "from sklearn.pipeline import make_pipeline\n", 2124 | "\n", 2125 | "from lime import lime_text\n", 2126 | "from lime.lime_text import LimeTextExplainer\n", 2127 | "import itertools \n", 2128 | "%matplotlib inline\n", 2129 | "import warnings\n", 2130 | "warnings.simplefilter('ignore')" 2131 | ] 2132 | }, 2133 | { 2134 | "cell_type": "code", 2135 | "execution_count": 3, 2136 | "metadata": {}, 2137 | "outputs": [], 2138 | "source": [ 2139 | "train_df = pd.read_csv(\"train.csv\")\n", 2140 | "test_df = pd.read_csv(\"test.csv\")" 2141 | ] 2142 | }, 2143 | { 2144 | "cell_type": "code", 2145 | "execution_count": 4, 2146 | "metadata": {}, 2147 | "outputs": [ 2148 | { 2149 | "data": { 2150 | "text/html": [ 2151 | "

\n", 2152 | "\n", 2165 | "\n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | "

	id	text	author
0	id26305	This process, however, afforded me no means of...	EAP
1	id17569	It never once occurred to me that the fumbling...	HPL
2	id11008	In his left hand was a gold snuff box, from wh...	EAP
3	id27763	How lovely is spring As we looked from Windsor...	MWS
4	id12958	Finding nothing else, not even gold, the Super...	HPL

\n", 2207 | "

" 2208 | ], 2209 | "text/plain": [ 2210 | " id text author\n", 2211 | "0 id26305 This process, however, afforded me no means of... EAP\n", 2212 | "1 id17569 It never once occurred to me that the fumbling... HPL\n", 2213 | "2 id11008 In his left hand was a gold snuff box, from wh... EAP\n", 2214 | "3 id27763 How lovely is spring As we looked from Windsor... MWS\n", 2215 | "4 id12958 Finding nothing else, not even gold, the Super... HPL" 2216 | ] 2217 | }, 2218 | "execution_count": 4, 2219 | "metadata": {}, 2220 | "output_type": "execute_result" 2221 | } 2222 | ], 2223 | "source": [ 2224 | "train_df.head()" 2225 | ] 2226 | }, 2227 | { 2228 | "cell_type": "code", 2229 | "execution_count": 5, 2230 | "metadata": {}, 2231 | "outputs": [ 2232 | { 2233 | "data": { 2234 | "text/html": [ 2235 | "

\n", 2236 | "\n", 2249 | "\n", 2250 | " \n", 2251 | " \n", 2252 | " \n", 2253 | " \n", 2254 | " \n", 2255 | " \n", 2256 | " \n", 2257 | " \n", 2258 | " \n", 2259 | " \n", 2260 | " \n", 2261 | " \n", 2262 | " \n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | "

	id	text
0	id02310	Still, as I urged our leaving Ireland with suc...
1	id24541	If a fire wanted fanning, it could readily be ...
2	id00134	And when they had broken down the frail door t...
3	id27757	While I was thinking how I should possibly man...
4	id04081	I am not sure to what limit his knowledge may ...

\n", 2285 | "

" 2286 | ], 2287 | "text/plain": [ 2288 | " id text\n", 2289 | "0 id02310 Still, as I urged our leaving Ireland with suc...\n", 2290 | "1 id24541 If a fire wanted fanning, it could readily be ...\n", 2291 | "2 id00134 And when they had broken down the frail door t...\n", 2292 | "3 id27757 While I was thinking how I should possibly man...\n", 2293 | "4 id04081 I am not sure to what limit his knowledge may ..." 2294 | ] 2295 | }, 2296 | "execution_count": 5, 2297 | "metadata": {}, 2298 | "output_type": "execute_result" 2299 | } 2300 | ], 2301 | "source": [ 2302 | "test_df.head()" 2303 | ] 2304 | }, 2305 | { 2306 | "cell_type": "markdown", 2307 | "metadata": {}, 2308 | "source": [ 2309 | "#### Explainer with basic model" 2310 | ] 2311 | }, 2312 | { 2313 | "cell_type": "code", 2314 | "execution_count": 6, 2315 | "metadata": {}, 2316 | "outputs": [ 2317 | { 2318 | "data": { 2319 | "text/plain": [ 2320 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 2321 | ] 2322 | }, 2323 | "execution_count": 6, 2324 | "metadata": {}, 2325 | "output_type": "execute_result" 2326 | } 2327 | ], 2328 | "source": [ 2329 | "class_names = ['EAP', 'HPL', 'MWS']\n", 2330 | "cols_to_drop = ['id', 'text']\n", 2331 | "train_X = train_df.drop(cols_to_drop+['author'], axis=1)\n", 2332 | "\n", 2333 | "## Prepare the data for modeling ###\n", 2334 | "author_mapping_dict = {'EAP':0, 'HPL':1, 'MWS':2}\n", 2335 | "train_y = train_df['author'].map(author_mapping_dict)\n", 2336 | "train_id = train_df['id'].values\n", 2337 | "\n", 2338 | "tfidf_vec = TfidfVectorizer(ngram_range=(1,5), analyzer='char')\n", 2339 | "full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", 2340 | "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", 2341 | "\n", 2342 | "X_train, X_test, y_train, y_test = train_test_split(train_tfidf, train_y, test_size=0.33, random_state=14)\n", 2343 | "model_tf = naive_bayes.MultinomialNB()\n", 2344 | "model_tf.fit(X_train, y_train)" 2345 | ] 2346 | }, 2347 | { 2348 | "cell_type": "code", 2349 | "execution_count": null, 2350 | "metadata": {}, 2351 | "outputs": [], 2352 | "source": [ 2353 | "print(X_train)" 2354 | ] 2355 | }, 2356 | { 2357 | "cell_type": "code", 2358 | "execution_count": null, 2359 | "metadata": {}, 2360 | "outputs": [], 2361 | "source": [ 2362 | "def plot_confusion_matrix(cm, classes,\n", 2363 | " normalize=False,\n", 2364 | " title='Confusion matrix',\n", 2365 | " cmap=plt.cm.Blues):\n", 2366 | " \"\"\"\n", 2367 | " This function prints and plots the confusion matrix.\n", 2368 | " Normalization can be applied by setting `normalize=True`.\n", 2369 | " \"\"\"\n", 2370 | " if normalize:\n", 2371 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 2372 | " print(\"Normalized confusion matrix\")\n", 2373 | " else:\n", 2374 | " print('Confusion matrix, without normalization')\n", 2375 | "\n", 2376 | " print(cm)\n", 2377 | "\n", 2378 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 2379 | " plt.title(title)\n", 2380 | " plt.colorbar()\n", 2381 | " tick_marks = np.arange(len(classes))\n", 2382 | " plt.xticks(tick_marks, classes, rotation=45)\n", 2383 | " plt.yticks(tick_marks, classes)\n", 2384 | "\n", 2385 | " fmt = '.2f' if normalize else 'd'\n", 2386 | " thresh = cm.max() / 2.\n", 2387 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 2388 | " plt.text(j, i, format(cm[i, j], fmt),\n", 2389 | " horizontalalignment=\"center\",\n", 2390 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 2391 | "\n", 2392 | " plt.tight_layout()\n", 2393 | " plt.ylabel('True label')\n", 2394 | " plt.xlabel('Predicted label')" 2395 | ] 2396 | }, 2397 | { 2398 | "cell_type": "code", 2399 | "execution_count": null, 2400 | "metadata": {}, 2401 | "outputs": [], 2402 | "source": [ 2403 | "y_pred = model_tf.predict(X_test)\n", 2404 | "\n", 2405 | "# Compute confusion matrix\n", 2406 | "cnf_matrix = confusion_matrix(y_test, y_pred)\n", 2407 | "np.set_printoptions(precision=2)\n", 2408 | "\n", 2409 | "# Plot non-normalized confusion matrix\n", 2410 | "plt.figure()\n", 2411 | "plot_confusion_matrix(cnf_matrix, classes=class_names,\n", 2412 | " title='Confusion matrix, without normalization')\n", 2413 | "plt.show()" 2414 | ] 2415 | }, 2416 | { 2417 | "cell_type": "code", 2418 | "execution_count": null, 2419 | "metadata": {}, 2420 | "outputs": [], 2421 | "source": [ 2422 | "import re\n", 2423 | "c_tf = make_pipeline(tfidf_vec, model_tf)\n", 2424 | "\n", 2425 | "split_expression = lambda s: re.split(r'\\W+', s)\n", 2426 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)" 2427 | ] 2428 | }, 2429 | { 2430 | "cell_type": "code", 2431 | "execution_count": null, 2432 | "metadata": {}, 2433 | "outputs": [], 2434 | "source": [ 2435 | "comp = y_test.to_frame()\n", 2436 | "comp['idx'] = comp.index.values\n", 2437 | "comp['pred'] = y_pred\n", 2438 | "comp.rename(columns={'author': 'real'}, inplace=True)" 2439 | ] 2440 | }, 2441 | { 2442 | "cell_type": "markdown", 2443 | "metadata": {}, 2444 | "source": [ 2445 | "### Explaining errors" 2446 | ] 2447 | }, 2448 | { 2449 | "cell_type": "markdown", 2450 | "metadata": {}, 2451 | "source": [ 2452 | "#### A --- True POE but classified in HPL" 2453 | ] 2454 | }, 2455 | { 2456 | "cell_type": "code", 2457 | "execution_count": null, 2458 | "metadata": {}, 2459 | "outputs": [], 2460 | "source": [ 2461 | "wrong_poe_hpl = comp[(comp.real ==0) & (comp.pred ==1)]\n", 2462 | "wrong_poe_hpl.shape\n", 2463 | "print(wrong_poe_hpl.idx)\n", 2464 | "idx = wrong_poe_hpl.idx.iloc[1]\n", 2465 | "\n", 2466 | "print('We see that we got', len(wrong_poe_hpl.idx), 'as shown by the confusion matrix above')" 2467 | ] 2468 | }, 2469 | { 2470 | "cell_type": "code", 2471 | "execution_count": null, 2472 | "metadata": {}, 2473 | "outputs": [], 2474 | "source": [ 2475 | "c_tf.predict_proba" 2476 | ] 2477 | }, 2478 | { 2479 | "cell_type": "code", 2480 | "execution_count": null, 2481 | "metadata": {}, 2482 | "outputs": [], 2483 | "source": [ 2484 | "tokenizer = lambda doc: re.compile(r\"(?u)\\b\\w\\w+\\b\").findall(doc)\n", 2485 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=tokenizer)\n", 2486 | "exp = explainer.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=6)" 2487 | ] 2488 | }, 2489 | { 2490 | "cell_type": "markdown", 2491 | "metadata": {}, 2492 | "source": [ 2493 | "This error is created by the use of ancient greek words. Possible to improve the model ?" 2494 | ] 2495 | }, 2496 | { 2497 | "cell_type": "code", 2498 | "execution_count": null, 2499 | "metadata": {}, 2500 | "outputs": [], 2501 | "source": [ 2502 | "idx = wrong_poe_hpl.idx.iloc[3]\n", 2503 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=2)\n", 2504 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))" 2505 | ] 2506 | }, 2507 | { 2508 | "cell_type": "markdown", 2509 | "metadata": {}, 2510 | "source": [ 2511 | "OK, very difficult case. Only three words > Not enough to properly classify. No improvement possible." 2512 | ] 2513 | }, 2514 | { 2515 | "cell_type": "markdown", 2516 | "metadata": {}, 2517 | "source": [ 2518 | "#### B. --- True POE but classified in MWS" 2519 | ] 2520 | }, 2521 | { 2522 | "cell_type": "code", 2523 | "execution_count": null, 2524 | "metadata": {}, 2525 | "outputs": [], 2526 | "source": [ 2527 | "wrong_poe_mws = comp[(comp.real ==0) & (comp.pred ==2)]\n", 2528 | "print(wrong_poe_mws.shape)\n", 2529 | "idx = wrong_poe_mws.idx.iloc[12]" 2530 | ] 2531 | }, 2532 | { 2533 | "cell_type": "code", 2534 | "execution_count": null, 2535 | "metadata": {}, 2536 | "outputs": [], 2537 | "source": [ 2538 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2539 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))" 2540 | ] 2541 | }, 2542 | { 2543 | "cell_type": "markdown", 2544 | "metadata": {}, 2545 | "source": [ 2546 | "OK, this text contains anaphora, possible to improve the model with anaphora feature." 2547 | ] 2548 | }, 2549 | { 2550 | "cell_type": "code", 2551 | "execution_count": null, 2552 | "metadata": {}, 2553 | "outputs": [], 2554 | "source": [ 2555 | "idx = wrong_poe_mws.idx.iloc[18]\n", 2556 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2557 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2558 | ] 2559 | }, 2560 | { 2561 | "cell_type": "markdown", 2562 | "metadata": {}, 2563 | "source": [ 2564 | "OK, probabilities (EAP and MWS) are very close. Possible to improve the model." 2565 | ] 2566 | }, 2567 | { 2568 | "cell_type": "markdown", 2569 | "metadata": {}, 2570 | "source": [ 2571 | "#### C. --- True MWS but classified in HPL" 2572 | ] 2573 | }, 2574 | { 2575 | "cell_type": "code", 2576 | "execution_count": null, 2577 | "metadata": {}, 2578 | "outputs": [], 2579 | "source": [ 2580 | "wrong_mws_hpl = comp[(comp.real ==2) & (comp.pred ==1)]\n", 2581 | "print(wrong_mws_hpl.shape)\n", 2582 | "idx = wrong_mws_hpl.idx.iloc[8]" 2583 | ] 2584 | }, 2585 | { 2586 | "cell_type": "code", 2587 | "execution_count": null, 2588 | "metadata": {}, 2589 | "outputs": [], 2590 | "source": [ 2591 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2592 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2593 | ] 2594 | }, 2595 | { 2596 | "cell_type": "markdown", 2597 | "metadata": {}, 2598 | "source": [ 2599 | "OK, probabilities (HPL and MWS) are very close. Possible to improve the model." 2600 | ] 2601 | }, 2602 | { 2603 | "cell_type": "code", 2604 | "execution_count": null, 2605 | "metadata": {}, 2606 | "outputs": [], 2607 | "source": [ 2608 | "idx = wrong_mws_hpl.idx.iloc[5]\n", 2609 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2610 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2611 | ] 2612 | }, 2613 | { 2614 | "cell_type": "markdown", 2615 | "metadata": {}, 2616 | "source": [ 2617 | "OK, probabilities (EAP, HPL, MWS ) are all very close. Possible to improve the model (using repetition pattern ?)" 2618 | ] 2619 | }, 2620 | { 2621 | "cell_type": "markdown", 2622 | "metadata": {}, 2623 | "source": [ 2624 | "## 2nd example" 2625 | ] 2626 | }, 2627 | { 2628 | "cell_type": "markdown", 2629 | "metadata": {}, 2630 | "source": [ 2631 | "https://marcotcr.github.io/lime/tutorials/Lime%20-%20basic%20usage%2C%20two%20class%20case.html" 2632 | ] 2633 | }, 2634 | { 2635 | "cell_type": "markdown", 2636 | "metadata": {}, 2637 | "source": [ 2638 | "### 1st step : Fetching data, training a classifier" 2639 | ] 2640 | }, 2641 | { 2642 | "cell_type": "markdown", 2643 | "metadata": {}, 2644 | "source": [ 2645 | "For simplicity, we'll use a **2-class subset**: atheism and christianity" 2646 | ] 2647 | }, 2648 | { 2649 | "cell_type": "code", 2650 | "execution_count": null, 2651 | "metadata": {}, 2652 | "outputs": [], 2653 | "source": [ 2654 | "import lime\n", 2655 | "import sklearn\n", 2656 | "import numpy as np\n", 2657 | "import sklearn\n", 2658 | "import sklearn.ensemble\n", 2659 | "import sklearn.metrics\n", 2660 | "from __future__ import print_function" 2661 | ] 2662 | }, 2663 | { 2664 | "cell_type": "code", 2665 | "execution_count": null, 2666 | "metadata": {}, 2667 | "outputs": [], 2668 | "source": [ 2669 | "from sklearn.datasets import fetch_20newsgroups\n", 2670 | "categories = ['alt.atheism', 'soc.religion.christian']\n", 2671 | "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)\n", 2672 | "newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)\n", 2673 | "class_names = ['atheism', 'christian']" 2674 | ] 2675 | }, 2676 | { 2677 | "cell_type": "markdown", 2678 | "metadata": {}, 2679 | "source": [ 2680 | "Let's use the **tfidf vectorizer**, commonly used for text." 2681 | ] 2682 | }, 2683 | { 2684 | "cell_type": "code", 2685 | "execution_count": null, 2686 | "metadata": {}, 2687 | "outputs": [], 2688 | "source": [ 2689 | "vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)\n", 2690 | "train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n", 2691 | "test_vectors = vectorizer.transform(newsgroups_test.data)" 2692 | ] 2693 | }, 2694 | { 2695 | "cell_type": "markdown", 2696 | "metadata": {}, 2697 | "source": [ 2698 | "Now, let's say we want to use **random forests for classification**. It's usually hard to understand what random forests are doing, especially with many trees." 2699 | ] 2700 | }, 2701 | { 2702 | "cell_type": "code", 2703 | "execution_count": null, 2704 | "metadata": {}, 2705 | "outputs": [], 2706 | "source": [ 2707 | "rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)\n", 2708 | "rf.fit(train_vectors, newsgroups_train.target)" 2709 | ] 2710 | }, 2711 | { 2712 | "cell_type": "code", 2713 | "execution_count": null, 2714 | "metadata": {}, 2715 | "outputs": [], 2716 | "source": [ 2717 | "pred = rf.predict(test_vectors)\n", 2718 | "sklearn.metrics.f1_score(newsgroups_test.target, pred, average='binary')" 2719 | ] 2720 | }, 2721 | { 2722 | "cell_type": "markdown", 2723 | "metadata": {}, 2724 | "source": [ 2725 | "We see that this classifier achieves a very high F score" 2726 | ] 2727 | }, 2728 | { 2729 | "cell_type": "markdown", 2730 | "metadata": {}, 2731 | "source": [ 2732 | "### 2nd step : Explaining predictions using lime" 2733 | ] 2734 | }, 2735 | { 2736 | "cell_type": "markdown", 2737 | "metadata": {}, 2738 | "source": [ 2739 | "Lime explainers assume that **classifiers act on raw text**, but **sklearn classifiers** act on **vectorized representation of texts**. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists." 2740 | ] 2741 | }, 2742 | { 2743 | "cell_type": "code", 2744 | "execution_count": null, 2745 | "metadata": {}, 2746 | "outputs": [], 2747 | "source": [ 2748 | "from lime import lime_text\n", 2749 | "from sklearn.pipeline import make_pipeline\n", 2750 | "c = make_pipeline(vectorizer, rf)" 2751 | ] 2752 | }, 2753 | { 2754 | "cell_type": "code", 2755 | "execution_count": null, 2756 | "metadata": {}, 2757 | "outputs": [], 2758 | "source": [ 2759 | "print(c.predict_proba([newsgroups_test.data[0]]))" 2760 | ] 2761 | }, 2762 | { 2763 | "cell_type": "markdown", 2764 | "metadata": {}, 2765 | "source": [ 2766 | "Now we create an explainer object. We pass the class_names a an argument for prettier display." 2767 | ] 2768 | }, 2769 | { 2770 | "cell_type": "code", 2771 | "execution_count": null, 2772 | "metadata": {}, 2773 | "outputs": [], 2774 | "source": [ 2775 | "from lime.lime_text import LimeTextExplainer\n", 2776 | "import re\n", 2777 | "split_expression = lambda s: re.split(r'\\W+', s)\n", 2778 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)" 2779 | ] 2780 | }, 2781 | { 2782 | "cell_type": "markdown", 2783 | "metadata": {}, 2784 | "source": [ 2785 | "We then generate an explanation with at most 6 features for an arbitrary document in the test set." 2786 | ] 2787 | }, 2788 | { 2789 | "cell_type": "code", 2790 | "execution_count": null, 2791 | "metadata": {}, 2792 | "outputs": [], 2793 | "source": [ 2794 | "idx = 83\n", 2795 | "exp = explainer.explain_instance(newsgroups_test.data[idx], c.predict_proba, num_features=6)\n", 2796 | "print('Document id: %d' % idx)\n", 2797 | "print('Probability(christian) =', c.predict_proba([newsgroups_test.data[idx]])[0,1])\n", 2798 | "print('True class: %s' % class_names[newsgroups_test.target[idx]])" 2799 | ] 2800 | }, 2801 | { 2802 | "cell_type": "markdown", 2803 | "metadata": {}, 2804 | "source": [ 2805 | "The classifier got this example right (it predicted atheism)." 2806 | ] 2807 | }, 2808 | { 2809 | "cell_type": "code", 2810 | "execution_count": null, 2811 | "metadata": {}, 2812 | "outputs": [], 2813 | "source": [ 2814 | "# The explanation is presented below as a list of weighted features\n", 2815 | "\n", 2816 | "exp.as_list()" 2817 | ] 2818 | }, 2819 | { 2820 | "cell_type": "markdown", 2821 | "metadata": {}, 2822 | "source": [ 2823 | "These weighted features are a linear model, which approximates the **behaviour of the random forest classifier in the vicinity of the test example**. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case." 2824 | ] 2825 | }, 2826 | { 2827 | "cell_type": "code", 2828 | "execution_count": null, 2829 | "metadata": {}, 2830 | "outputs": [], 2831 | "source": [ 2832 | "print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])\n", 2833 | "tmp = test_vectors[idx].copy()\n", 2834 | "tmp[0,vectorizer.vocabulary_['Posting']] = 0\n", 2835 | "tmp[0,vectorizer.vocabulary_['Host']] = 0\n", 2836 | "print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])\n", 2837 | "print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])" 2838 | ] 2839 | }, 2840 | { 2841 | "cell_type": "markdown", 2842 | "metadata": {}, 2843 | "source": [ 2844 | "Pretty close!\n", 2845 | "**The words that explain the model around this document seem very arbitrary** - not much to do with either Christianity or Atheism.\n", 2846 | "In fact, these are words that appear in the email headers (you will see this clearly soon), which **make distinguishing between the classes much easier.**" 2847 | ] 2848 | }, 2849 | { 2850 | "cell_type": "markdown", 2851 | "metadata": {}, 2852 | "source": [ 2853 | "### 3rd Step: Visualizing explanations" 2854 | ] 2855 | }, 2856 | { 2857 | "cell_type": "code", 2858 | "execution_count": null, 2859 | "metadata": {}, 2860 | "outputs": [], 2861 | "source": [ 2862 | "%matplotlib inline\n", 2863 | "fig = exp.as_pyplot_figure()" 2864 | ] 2865 | }, 2866 | { 2867 | "cell_type": "code", 2868 | "execution_count": null, 2869 | "metadata": {}, 2870 | "outputs": [], 2871 | "source": [ 2872 | "exp.show_in_notebook(text=False)\n", 2873 | "# exp.save_to_file('/tmp/oi.html')" 2874 | ] 2875 | }, 2876 | { 2877 | "cell_type": "code", 2878 | "execution_count": null, 2879 | "metadata": {}, 2880 | "outputs": [], 2881 | "source": [ 2882 | "# how the words that affect the classifier the most are all in the email header.\n", 2883 | "exp.show_in_notebook(text=True)" 2884 | ] 2885 | }, 2886 | { 2887 | "cell_type": "markdown", 2888 | "metadata": {}, 2889 | "source": [ 2890 | "# Clustering documents using similarity features" 2891 | ] 2892 | }, 2893 | { 2894 | "cell_type": "markdown", 2895 | "metadata": {}, 2896 | "source": [ 2897 | "https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/feature%20engineering%20text%20data/Feature%20Engineering%20Text%20Data%20-%20Traditional%20Strategies.ipynb" 2898 | ] 2899 | }, 2900 | { 2901 | "cell_type": "code", 2902 | "execution_count": null, 2903 | "metadata": {}, 2904 | "outputs": [], 2905 | "source": [] 2906 | }, 2907 | { 2908 | "cell_type": "code", 2909 | "execution_count": null, 2910 | "metadata": {}, 2911 | "outputs": [], 2912 | "source": [] 2913 | }, 2914 | { 2915 | "cell_type": "code", 2916 | "execution_count": null, 2917 | "metadata": {}, 2918 | "outputs": [], 2919 | "source": [] 2920 | } 2921 | ], 2922 | "metadata": { 2923 | "kernelspec": { 2924 | "display_name": "Python 3", 2925 | "language": "python", 2926 | "name": "python3" 2927 | }, 2928 | "language_info": { 2929 | "codemirror_mode": { 2930 | "name": "ipython", 2931 | "version": 3 2932 | }, 2933 | "file_extension": ".py", 2934 | "mimetype": "text/x-python", 2935 | "name": "python", 2936 | "nbconvert_exporter": "python", 2937 | "pygments_lexer": "ipython3", 2938 | "version": "3.7.0" 2939 | } 2940 | }, 2941 | "nbformat": 4, 2942 | "nbformat_minor": 2 2943 | } 2944 | -------------------------------------------------------------------------------- /pictures/LDA2VEC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/LDA2VEC.png -------------------------------------------------------------------------------- /pictures/characters_attention.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/characters_attention.gif -------------------------------------------------------------------------------- /pictures/explainability.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/explainability.gif -------------------------------------------------------------------------------- /pictures/generative_LDA.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/generative_LDA.gif -------------------------------------------------------------------------------- /pictures/pyldavis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/pyldavis.png -------------------------------------------------------------------------------- /pictures/tsne_lda.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/tsne_lda.png -------------------------------------------------------------------------------- /pictures/word_correlations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/word_correlations.png -------------------------------------------------------------------------------- /pictures/word_frequency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/word_frequency.png --------------------------------------------------------------------------------

	bacon	beans	beautiful	blue	breakfast	brown	dog	eggs	fox	green	ham	jumps	kings	lazy	love	quick	sausages	sky	toast	today
0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
1	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0
2	0	0	0	0	0	1	1	0	1	0	0	1	0	1	0	1	0	0	0	0
3	1	1	0	0	1	0	0	1	0	0	1	0	1	0	0	0	1	0	1	0
4	1	0	0	0	0	0	0	1	0	1	1	0	0	0	1	0	1	0	0	0
5	0	0	0	1	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	1
7	0	0	0	0	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0

	bacon eggs	beautiful sky	beautiful today	blue beautiful	blue dog	blue sky	breakfast sausages	brown fox	dog lazy	eggs ham	...	lazy dog	love blue	love green	quick blue	quick brown	sausages bacon	sausages ham	sky beautiful	sky blue	toast beans
0	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1	0	1	0	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	1	0	0	...	1	0	0	0	1	0	0	0	0	0
3	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	1	0	0	1
4	0	0	0	0	0	0	0	0	0	1	...	0	0	1	0	0	1	0	0	0	0
5	0	0	0	0	1	0	0	1	1	0	...	0	0	0	1	0	0	0	0	0	0
6	0	0	1	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	1	1	0
7	0	0	0	0	0	0	0	1	1	0	...	0	0	0	0	0	0	0	0	0	0

	bacon	beans	beautiful	blue	breakfast	brown	dog	eggs	fox	green	ham	jumps	kings	lazy	love	quick	sausages	sky	toast	today
0	0.00	0.00	0.60	0.53	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.60	0.00	0.0
1	0.00	0.00	0.49	0.43	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.57	0.00	0.00	0.49	0.00	0.0
2	0.00	0.00	0.00	0.00	0.00	0.38	0.38	0.00	0.38	0.00	0.00	0.53	0.00	0.38	0.00	0.38	0.00	0.00	0.00	0.0
3	0.32	0.38	0.00	0.00	0.38	0.00	0.00	0.32	0.00	0.00	0.32	0.00	0.38	0.00	0.00	0.00	0.32	0.00	0.38	0.0
4	0.39	0.00	0.00	0.00	0.00	0.00	0.00	0.39	0.00	0.47	0.39	0.00	0.00	0.00	0.39	0.00	0.39	0.00	0.00	0.0
5	0.00	0.00	0.00	0.37	0.00	0.42	0.42	0.00	0.42	0.00	0.00	0.00	0.00	0.42	0.00	0.42	0.00	0.00	0.00	0.0
6	0.00	0.00	0.36	0.32	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.72	0.00	0.5
7	0.00	0.00	0.00	0.00	0.00	0.45	0.45	0.00	0.45	0.00	0.00	0.00	0.00	0.45	0.00	0.45	0.00	0.00	0.00	0.0

	bacon	beans	beautiful	blue	breakfast	brown	dog	eggs	fox	green	ham	jumps	kings	lazy	love	quick	sausages	sky	toast	today
0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
1	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0
2	0	0	0	0	0	1	1	0	1	0	0	1	0	1	0	1	0	0	0	0
3	1	1	0	0	1	0	0	1	0	0	1	0	1	0	0	0	1	0	1	0
4	1	0	0	0	0	0	0	1	0	1	1	0	0	0	1	0	1	0	0	0
5	0	0	0	1	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	1
7	0	0	0	0	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0

	bacon eggs	beautiful sky	beautiful today	blue beautiful	blue dog	blue sky	breakfast sausages	brown fox	dog lazy	eggs ham	...	lazy dog	love blue	love green	quick blue	quick brown	sausages bacon	sausages ham	sky beautiful	sky blue	toast beans
0	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1	0	1	0	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	1	0	0	...	1	0	0	0	1	0	0	0	0	0
3	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	1	0	0	1
4	0	0	0	0	0	0	0	0	0	1	...	0	0	1	0	0	1	0	0	0	0
5	0	0	0	0	1	0	0	1	1	0	...	0	0	0	1	0	0	0	0	0	0
6	0	0	1	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	1	1	0
7	0	0	0	0	0	0	0	1	1	0	...	0	0	0	0	0	0	0	0	0	0

Binary Logistic Regression Classifier1

Multiclass Logistic Regression Classifier1

How to Perform Predictions?

Binary Logistic Regression Classifier¹

Multiclass Logistic Regression Classifier¹

	bacon	beans	beautiful	blue	breakfast	brown	dog	eggs	fox	green	ham	jumps	kings	lazy	love	quick	sausages	sky	toast	today
0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
1	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0
2	0	0	0	0	0	1	1	0	1	0	0	1	0	1	0	1	0	0	0	0
3	1	1	0	0	1	0	0	1	0	0	1	0	1	0	0	0	1	0	1	0
4	1	0	0	0	0	0	0	1	0	1	1	0	0	0	1	0	1	0	0	0
5	0	0	0	1	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	1
7	0	0	0	0	0	1	1	0	1	0	0	0	0	1	0	1	0	0	0	0

	bacon eggs	beautiful sky	beautiful today	blue beautiful	blue dog	blue sky	breakfast sausages	brown fox	dog lazy	eggs ham	...	lazy dog	love blue	love green	quick blue	quick brown	sausages bacon	sausages ham	sky beautiful	sky blue	toast beans
0	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
1	0	1	0	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	1	0	0	...	1	0	0	0	1	0	0	0	0	0
3	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	1	0	0	1
4	0	0	0	0	0	0	0	0	0	1	...	0	0	1	0	0	1	0	0	0	0
5	0	0	0	0	1	0	0	1	1	0	...	0	0	0	1	0	0	0	0	0	0
6	0	0	1	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	1	1	0
7	0	0	0	0	0	0	0	1	1	0	...	0	0	0	0	0	0	0	0	0	0