├── .gitignore ├── Ch1_PyTorch_CBOW_embeddings.ipynb ├── Ch2_PyTorch_classifier.ipynb ├── Ch3_Elasticsearch_indexing.ipynb ├── README.md ├── crawlers └── config.py ├── helpers ├── __init__.py └── pickle_helpers.py ├── img └── espy.png └── model └── elastic_hashtag_model └── emonet /.gitignore: -------------------------------------------------------------------------------- 1 | */__pycache__ 2 | __pycache__/ 3 | .ipynb_checkpoints 4 | data/ 5 | -------------------------------------------------------------------------------- /Ch1_PyTorch_CBOW_embeddings.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Training for CBOW Word Embeddings\n", 8 | "In this notebook we would like to employ the CBOW algorithm to output pretrained word embeddings, which will be used to perform text classification. The text classification will be performed on real-time tweets obtained from the Twitter Public Streaming API. Once we have classified the tweet, we can then further process them and index (store) in an Elasticsearch index (data storage). The Elasticsearch index will serve as a real-time, fast search engine, which will allows us to perform different kinds of analyses and visualizations on the classified and preprocessed tweets using Kibana." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Importing Libraries\n", 16 | "The first thing is to import the libraries." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "0.4.1\n" 29 | ] 30 | } 31 | ], 32 | "source": [ 33 | "import helpers.pickle_helpers as ph\n", 34 | "import pandas as pd\n", 35 | "import re\n", 36 | "import collections\n", 37 | "import numpy as np\n", 38 | "import torch\n", 39 | "import torch.nn as nn\n", 40 | "from torch.autograd import Variable\n", 41 | "import torch.optim as optim\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "%matplotlib inline\n", 44 | "import torch.nn.functional as F\n", 45 | "\n", 46 | "### checking PyTorch version\n", 47 | "print(torch.__version__)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### Data Loading\n", 55 | "The data has been preprocessed, cleaned, and formated as a pandas dataframe. Below you can see the code used for importing the dataframes. Dataframes allow us to perform basic statistics and transformations to our dataset. Although it is not mandatory to have our data in such format, I strongly recommend it for in case you like to perform some extra analysis on your dataset." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "### load the traing and test dataframes\n", 65 | "train_data = ph.load_from_pickle(directory=\"data/datasets/df_grained_tweet_tr.pkl\")\n", 66 | "test_data = ph.load_from_pickle(directory=\"data/datasets/df_grained_tweet_te_unbal.pkl\")\n", 67 | "\n", 68 | "### renaming the column names of the dataframe\n", 69 | "train_data.rename(index=str, columns={\"emo\":\"emotions\", \"sentence\": \"text\"}, inplace=True);\n", 70 | "test_data.rename(index=str, columns={\"emo\":\"emotions\", \"sentence\": \"text\"}, inplace=True);\n", 71 | "\n", 72 | "### remove hashtags to avoid any bias from them\n", 73 | "train_data.text = train_data.text.str.replace(\" \", \"\")\n", 74 | "test_data.text = test_data.text.str.replace(\" \", \"\")" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Emotion Distribution\n", 82 | "We can now explore the distribution of our dataset." 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 3, 88 | "metadata": {}, 89 | "outputs": [ 90 | { 91 | "data": { 92 | "text/plain": [ 93 | "" 94 | ] 95 | }, 96 | "execution_count": 3, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | }, 100 | { 101 | "data": { 102 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAD8CAYAAABdCyJkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAHYhJREFUeJzt3Xu4XVV57/Hvr4lcRCGBbPPEXBqq8QLYRrKFeMGDoBjQNrGlQh41ATlEDqC1R0+JpU/JAelBrbXlVPFAySFRJCCIpDQY0xTEWgMJEHPhYrYhHJKGJCZcVBAMvOeP+W4zs11r75091iYr+Ps8z3rWXO8cY8wx51xrvmuOOdfeigjMzMwG6nf2dgfMzGzf5kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZkT4TiaSxkm6XdL+ktZL+LOOHSloiaV0+D8+4JF0uqUvSKklH19qameXXSZpZi0+StDrrXC5JvS3DzMzah/r6ZbukUcCoiLhX0iuBe4BpwBnAjoi4TNJsYHhEXCDpFODjwCnAscA/RMSxkg4FVgCdQGQ7kyLicUl3A58A7gIWAZdHxG2SPt9oGb31d8SIETF+/PiBbQ0zs99S99xzz08jomMgdYf2VSAiNgObc/pnkh4ARgNTgeOz2DzgDuCCjM+PKkMtkzQsk9HxwJKI2AEgaQkwRdIdwMERsSzj86kS1W29LKOp8ePHs2LFij5X3MzMdpH0yEDr7tE1EknjgTdTnTmMzCQD8BgwMqdHA4/Wqm3MWG/xjQ3i9LIMMzNrE/1OJJJeAdwEfDIinqrPy7OPQf3rj70tQ9IsSSskrdi2bdtgdsPMzHroVyKR9DKqJHJtRHwrw1tyyKr7OsrWjG8Cxtaqj8lYb/ExDeK9LWM3EXFlRHRGRGdHx4CG+MzMbID6c9eWgKuBByLi72qzFgLdd17NBG6pxWfk3VuTgSdzeGoxcJKk4Xn31UnA4pz3lKTJuawZPdpqtAwzM2sTfV5sB94OfARYLWllxv4SuAy4QdJZwCPAB3PeIqo7trqAp4EzASJih6RLgOVZ7uLuC+/AucA1wIFUF9lvy3izZZiZWZvo8/bffU1nZ2f4ri0zsz0j6Z6I6BxIXf+y3czMijiRmJlZEScSMzMr0p+L7S9Z42f/S0vb23DZ+1ranpnZvsBnJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKxIn4lE0lxJWyWtqcWul7QyHxu6/5e7pPGSnqnN+2qtziRJqyV1SbpckjJ+qKQlktbl8/CMK8t1SVol6ejWr76ZmZXqzxnJNcCUeiAiTouIiRExEbgJ+FZt9k+650XEObX4FcDZwIR8dLc5G1gaEROApfka4ORa2VlZ38zM2kyfiSQi7gR2NJqXZxUfBK7rrQ1Jo4CDI2JZRAQwH5iWs6cC83J6Xo/4/KgsA4ZlO2Zm1kZKr5EcB2yJiHW12OGS7pP0PUnHZWw0sLFWZmPGAEZGxOacfgwYWavzaJM6ZmbWJkr/1e50dj8b2QyMi4jtkiYB35Z0ZH8bi4iQFHvaCUmzqIa/GDdu3J5WNzOzAgM+I5E0FPhj4PruWEQ8GxHbc/oe4CfA64BNwJha9TEZA9jSPWSVz1szvgkY26TObiLiyojojIjOjo6Oga6SmZkNQMnQ1ruBByPi10NWkjokDcnp36O6UL4+h66ekjQ5r6vMAG7JaguBmTk9s0d8Rt69NRl4sjYEZmZmbaI/t/9eB/wQeL2kjZLOylmn85sX2d8JrMrbgW8EzomI7gv15wL/BHRRnanclvHLgPdIWkeVnC7L+CJgfZa/KuubmVmb6fMaSURMbxI/o0HsJqrbgRuVXwEc1SC+HTixQTyA8/rqn5mZ7V3+ZbuZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyL9+Z/tcyVtlbSmFpsjaZOklfk4pTbvM5K6JD0k6b21+JSMdUmaXYsfLumujF8vab+M75+vu3L++FattJmZtU5/zkiuAaY0iH8pIibmYxGApCOA04Ejs85XJA2RNAT4MnAycAQwPcsCfC7bei3wOHBWxs8CHs/4l7KcmZm1mT4TSUTcCezoZ3tTgQUR8WxEPAx0Acfkoysi1kfEc8ACYKokAScAN2b9ecC0WlvzcvpG4MQsb2ZmbaTkGsn5klbl0NfwjI0GHq2V2ZixZvHDgCciYmeP+G5t5fwns7yZmbWRoQOsdwVwCRD5/EXgo63q1J6SNAuYBTBu3Li91Y3BMeeQQWjzyda3aWa/tQZ0RhIRWyLi+Yh4AbiKaugKYBMwtlZ0TMaaxbcDwyQN7RHfra2cf0iWb9SfKyOiMyI6Ozo6BrJKZmY2QANKJJJG1V5+AOi+o2shcHrecXU4MAG4G1gOTMg7tPajuiC/MCICuB04NevPBG6ptTUzp08F/i3Lm5lZG+lzaEvSdcDxwAhJG4GLgOMlTaQa2toAfAwgItZKugG4H9gJnBcRz2c75wOLgSHA3IhYm4u4AFgg6bPAfcDVGb8a+JqkLqqL/acXr62ZmbVcn4kkIqY3CF/dINZd/lLg0gbxRcCiBvH17Boaq8d/CfxpX/0zM7O9y79sNzOzIk4kZmZWxInEzMyKOJGYmVkRJxIzMyviRGJmZkWcSMzMrIgTiZmZFXEiMTOzIk4kZmZWxInEzMyKOJGYmVkRJxIzMyviRGJmZkWcSMzMrIgTiZmZFXEiMTOzIk4kZmZWpM9EImmupK2S1tRiX5D0oKRVkm6WNCzj4yU9I2llPr5aqzNJ0mpJXZIul6SMHyppiaR1+Tw848pyXbmco1u/+mZmVqo/ZyTXAFN6xJYAR0XE7wM/Bj5Tm/eTiJiYj3Nq8SuAs4EJ+ehuczawNCImAEvzNcDJtbKzsr6ZmbWZPhNJRNwJ7OgR+25E7MyXy4AxvbUhaRRwcEQsi4gA5gPTcvZUYF5Oz+sRnx+VZcCwbMfMzNpIK66RfBS4rfb6cEn3SfqepOMyNhrYWCuzMWMAIyNic04/Boys1Xm0SR0zM2sTQ0sqS7oQ2Alcm6HNwLiI2C5pEvBtSUf2t72ICEkxgH7Mohr+Yty4cXta3czMCgz4jETSGcD7gQ/lcBUR8WxEbM/pe4CfAK8DNrH78NeYjAFs6R6yyuetGd8EjG1SZzcRcWVEdEZEZ0dHx0BXyczMBmBAiUTSFOAvgD+KiKdr8Q5JQ3L696gulK/PoaunJE3Ou7VmALdktYXAzJye2SM+I+/emgw8WRsCMzOzNtHn0Jak64DjgRGSNgIXUd2ltT+wJO/iXZZ3aL0TuFjSr4AXgHMiovtC/blUd4AdSHVNpfu6ymXADZLOAh4BPpjxRcApQBfwNHBmyYqamdng6DORRMT0BuGrm5S9CbipybwVwFEN4tuBExvEAzivr/6Zmdne5V+2m5lZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr0q9EImmupK2S1tRih0paImldPg/PuCRdLqlL0ipJR9fqzMzy6yTNrMUnSVqddS5X/iP4ZsswM7P20d8zkmuAKT1is4GlETEBWJqvAU4GJuRjFnAFVEkBuAg4FjgGuKiWGK4Azq7Vm9LHMszMrE30K5FExJ3Ajh7hqcC8nJ4HTKvF50dlGTBM0ijgvcCSiNgREY8DS4ApOe/giFgWEQHM79FWo2WYmVmbKLlGMjIiNuf0Y8DInB4NPFortzFjvcU3Noj3tozdSJolaYWkFdu2bRvg6piZ2UC05GJ7nklEK9oayDIi4sqI6IyIzo6OjsHshpmZ9VCSSLbksBT5vDXjm4CxtXJjMtZbfEyDeG/LMDOzNlGSSBYC3XdezQRuqcVn5N1bk4Enc3hqMXCSpOF5kf0kYHHOe0rS5Lxba0aPthotw8zM2sTQ/hSSdB1wPDBC0kaqu68uA26QdBbwCPDBLL4IOAXoAp4GzgSIiB2SLgGWZ7mLI6L7Av65VHeGHQjclg96WYaZmbWJfiWSiJjeZNaJDcoGcF6TduYCcxvEVwBHNYhvb7QMMzNrH/5lu5mZFXEiMTOzIk4kZmZWpF/XSMz68qZ5b2p5m6tnrm55mw+84Y0tb/ONDz7Q8jbN9iU+IzEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFBpxIJL1e0sra4ylJn5Q0R9KmWvyUWp3PSOqS9JCk99biUzLWJWl2LX64pLsyfr2k/Qa+qmZmNhgGnEgi4qGImBgRE4FJwNPAzTn7S93zImIRgKQjgNOBI4EpwFckDZE0BPgycDJwBDA9ywJ8Ltt6LfA4cNZA+2tmZoOjVUNbJwI/iYhHeikzFVgQEc9GxMNAF3BMProiYn1EPAcsAKZKEnACcGPWnwdMa1F/zcysRVqVSE4Hrqu9Pl/SKklzJQ3P2Gjg0VqZjRlrFj8MeCIidvaI/wZJsyStkLRi27Zt5WtjZmb9VpxI8rrFHwHfzNAVwGuAicBm4Iuly+hLRFwZEZ0R0dnR0THYizMzs5pW/M/2k4F7I2ILQPczgKSrgFvz5SZgbK3emIzRJL4dGCZpaJ6V1MubmVmbaMXQ1nRqw1qSRtXmfQBYk9MLgdMl7S/pcGACcDewHJiQd2jtRzVMtjAiArgdODXrzwRuaUF/zcyshYrOSCQdBLwH+Fgt/HlJE4EANnTPi4i1km4A7gd2AudFxPPZzvnAYmAIMDci1mZbFwALJH0WuA+4uqS/ZmbWekWJJCJ+QXVRvB77SC/lLwUubRBfBCxqEF9PdVeXmZm1Kf+y3czMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMihQnEkkbJK2WtFLSiowdKmmJpHX5PDzjknS5pC5JqyQdXWtnZpZfJ2lmLT4p2+/Kuirts5mZtU6rzkjeFRETI6IzX88GlkbEBGBpvgY4GZiQj1nAFVAlHuAi4Fiqf617UXfyyTJn1+pNaVGfzcysBQZraGsqMC+n5wHTavH5UVkGDJM0CngvsCQidkTE48ASYErOOzgilkVEAPNrbZmZWRtoRSIJ4LuS7pE0K2MjI2JzTj8GjMzp0cCjtbobM9ZbfGODuJmZtYmhLWjjHRGxSdKrgCWSHqzPjIiQFC1YTlOZwGYBjBs3bjAXZWZmPRSfkUTEpnzeCtxMdY1jSw5Lkc9bs/gmYGyt+piM9RYf0yDesw9XRkRnRHR2dHSUrpKZme2BokQi6SBJr+yeBk4C1gALge47r2YCt+T0QmBG3r01GXgyh8AWAydJGp4X2U8CFue8pyRNzru1ZtTaMjOzNlA6tDUSuDnvyB0KfCMiviNpOXCDpLOAR4APZvlFwClAF/A0cCZAROyQdAmwPMtdHBE7cvpc4BrgQOC2fJiZWZsoSiQRsR74gwbx7cCJDeIBnNekrbnA3AbxFcBRJf00M7PB41+2m5lZEScSMzMr4kRiZmZFnEjMzKxIK36QaGYt9uVz/q3lbZ731RNa3qYZ+IzEzMwKOZGYmVkRJxIzMyviRGJmZkWcSMzMrIgTiZmZFXEiMTOzIk4kZmZWxInEzMyKOJGYmVkRJxIzMyviRGJmZkWcSMzMrMiAE4mksZJul3S/pLWS/izjcyRtkrQyH6fU6nxGUpekhyS9txafkrEuSbNr8cMl3ZXx6yXtN9D+mpnZ4Cg5I9kJfCoijgAmA+dJOiLnfSkiJuZjEUDOOx04EpgCfEXSEElDgC8DJwNHANNr7Xwu23ot8DhwVkF/zcxsEAw4kUTE5oi4N6d/BjwAjO6lylRgQUQ8GxEPA13AMfnoioj1EfEcsACYKknACcCNWX8eMG2g/TUzs8HRkmskksYDbwbuytD5klZJmitpeMZGA4/Wqm3MWLP4YcATEbGzR9zMzNpIcSKR9ArgJuCTEfEUcAXwGmAisBn4Yuky+tGHWZJWSFqxbdu2wV6cmZnVFCUSSS+jSiLXRsS3ACJiS0Q8HxEvAFdRDV0BbALG1qqPyViz+HZgmKShPeK/ISKujIjOiOjs6OgoWSUzM9tDJXdtCbgaeCAi/q4WH1Ur9gFgTU4vBE6XtL+kw4EJwN3AcmBC3qG1H9UF+YUREcDtwKlZfyZwy0D7a2Zmg2No30WaejvwEWC1pJUZ+0uqu64mAgFsAD4GEBFrJd0A3E91x9d5EfE8gKTzgcXAEGBuRKzN9i4AFkj6LHAfVeIyM7M2MuBEEhH/DqjBrEW91LkUuLRBfFGjehGxnl1DY2Zm1ob8y3YzMyviRGJmZkWcSMzMrEjJxXYz+y33xdPe3/I2P3X9rS1v0waXz0jMzKyIE4mZmRVxIjEzsyJOJGZmVsSJxMzMijiRmJlZEScSMzMr4kRiZmZFnEjMzKyIf9luZi95G2d/v+VtjrnsuJa3ua/yGYmZmRVxIjEzsyJOJGZmVsSJxMzMirT9xXZJU4B/oPp/7v8UEZft5S6ZmbXcnDlz9ok2G2nrMxJJQ4AvAycDRwDTJR2xd3tlZmZ1bZ1IgGOArohYHxHPAQuAqXu5T2ZmVtPuiWQ08Gjt9caMmZlZm1BE7O0+NCXpVGBKRPzXfP0R4NiIOL9HuVnArHz5euChFndlBPDTFrc5GNzP1toX+rkv9BHcz1YbjH7+bkR0DKRiu19s3wSMrb0ek7HdRMSVwJWD1QlJKyKic7DabxX3s7X2hX7uC30E97PV2q2f7T60tRyYIOlwSfsBpwML93KfzMyspq3PSCJip6TzgcVUt//OjYi1e7lbZmZW09aJBCAiFgGL9nI3Bm3YrMXcz9baF/q5L/QR3M9Wa6t+tvXFdjMza3/tfo3EzMzanBNJjaTxktbs7X70RdJ/7O0+NCPpE5IekHTt3u7LbztJiyQNG8T2h0k6t0VtHS/pba1oK9ubI+nTki6W9O5WtdvL8qb1969u9CzbVx8ldUq6fID92m0fSXq1pBsH0lavy/HQ1i6SxgO3RsRRe7kr+yxJDwLvjoiNBW0MjYidLezWi0aSqD5XLwxC2/3aLoPZhx7LGU+Dz8tA9p+kOcDPI+JvW9S3lrbXj+VdQ7Ut+jxI70nZFvRrPC/GMS0iXnIP4CDgX4AfAWuA04C/prqdeA3VharuJDopy/0I+AKwJuNnAN8CvgOsAz5fa/8k4IfAvcA3gVdk/DLgfmAV8LcZ+9Nc5o+AO1u0fj8H1N1fYDVwWs6bD0yrlb0WmPoibfevAs9lfy4E5gJ3A/d19wEYD3w/t929wNsyfnzGFwI/HoS+fRu4B1gLzKptx0tz3ywDRmb8Nfl6NfBZqgNSdzv/I99Hq4D/WVunh3Lbr6X6Ydeevj83ACNyfidwR07PAb4G/AC4Lt+XtwB35PvyomZ96G6z0fJq7/3v5XZZDIzaw226AHgGWJnb5Nf7L/uzplb208CcnP4Euz4nC7LsY1S/EVsJHDfAfXxhLvvfc1t9GrgGOLWXz2fDfZ3vx1trbf8jcEajdoC3Ub3vn83tcWGz91eW3QE8nOv6mh59fAvwH1nnbuCV9b7U3g8/zP1/dsZfASyl+kytZtfnrb6PvlDfL8ABwP/N8vcB74o+jn1Nt/2LcYB5sR/AnwBX1V4fAhxae/014A9zehXwzpzumUjWZ90DgEeofhw5ArgTOCjLXUCVpA6j+iB3J6hh+bwaGF2PtWD9fp7ruITqtuiRwP8DRgH/Bfh2bb0fBoa+iNt+Q26jvwE+3L3eVB/wg4CXAwdkfAKwIqePB34BHD5I/To0nw+kOpgeBkTtffB54K9y+lZgek6fw66Dy0nklxCqYeFbgXfmh/MFYHLB+3MDzRPJPcCBtffl5ux/97p0NupDbV80Wt7LqA5YHRk7jer2+j3ZpuPZ9XnZbf/ReyL5T2D/Hp+TOcCnC/bvJKrP2suBg4EuaomE5p/PZvv6eBokkl7a+UYupz/vr2vIxFF/DexHdcx5S8YPprqz9td9ye30o1zOCKo/IfXqLHdwlhmR668G+6G+zz7Vvc+BN1AdQw6gybGvt+3/Ur1Gshp4j6TPSTouIp4E3iXpLkmrgROAI3P8eFhE3Jn1vtajnaUR8WRE/JLqG8jvApOp/hLxDyStBGZm/Engl8DVkv4YeDrb+AFwjaSzqQ76rfIO4LqIeD4itlB9s3xLRHyP6kecHcB04KbYO8NEJwGzcxvdQfWGHEd1ALsq98M3qbZlt7sj4uFB6s8nJHV/MxxLlcSeozqQQHWwHp/Tb82+QXWA6HZSPu6j+ub3hmwH4JGIWNbPvjR6f/ZmYUQ8U3u9JCK2Z+xbVO+F3vrQaHmvB44CluQ++iuqvxxRor/7bxVwraQPA616bx4H3BwRT0fEU/zmD5ebfT6b7etmmrVzBNXZSX/eX828HtgcEcsBIuKpJp/dWyLimYj4KXA71R+3FfA3klYB/0r1NwlH9rG8dwBfz2U9SJUwXpfzGh37mmr735EMRET8WNLRwCnAZyUtBc4DOiPi0Rw/PaAfTT1bm36eanuJ6oM8vWdhSccAJ1J9uzgfOCEizpF0LPA+4B5JkyJie8Hq9cd84MNUfwngzEFeVjMC/iQidvu7Z7nttwB/QPWt/pe12b8YlI5IxwPvBt4aEU9LuoNq//8q8usYu/Zvr00B/ysi/k+P9sezB31v8v7cya6bX3q+N3u23fPCZjQp19vybgbWRsRb+9vvfqgvv74+sPs6vY/qTO4PgQslvamFfWgoqh83/8bns5cqDfvfqB1JF1ONBvx5RHyj4P3V79Vp8PpDQAcwKSJ+JWkD/TvGNdPo2NfUS/KMRNKrgacj4utUw1VH56yfSnoF1RuAiHgCeEJS9ze6D/Wj+WXA2yW9Npd1kKTXZbuHRPUDyj+nOlAi6TURcVdE/DWwjd3/dliJ7wOnSRqSZx/vpBpThepU+ZMAEXF/i5a3pxYDH88Lv0h6c8YPofrW9QLwEVp7ltbMIcDjmUTeQHVW2ZtlVMNBUCXjbouBj+a+RtJoSa/a0840eX9uoBqeobbsZt4j6VBJBwLTqM5693R5DwEdkt6aZV4m6cg9XJWfUY3hN7IFeJWkwyTtD7w/l/M7VMMkt1MNCx9CNb7fW1v9cScwTdKBkl5JlaR+rdnnk+b7+hHgCEn758jFib20cwjVmcn+/Xx/NVvXh4BRkt6Sy3qlpEYH8KmSDpB0GNWw1/Lsw9ZMIu9i1xlEb9v1++QxT9LrqEYMBvQHb1+SZyTAm4AvSHoB+BXw36g+cGuoLuotr5U9E5grKYDv9tVwRGyTdAZwXX5AoBoW+Blwi6QDqL65/vec9wVJEzK2lGp8s1RQfaN8a7YXwF9ExGPZxy2SHqC6wLy3XAL8PbAqDx4PUx1MvgLcJGkG1cW8QTkL6eE7wDm5TR6iOnj05pPA1yVdmHWfBIiI70p6I/DDzI8/pzrze34P+9Po/Xkg1XDJJVRDgb25G7iJaijq6xGxIs+K+r28iHhO1V/XvlzSIVTHgr+nulDfLxGxXdIPVN0y/wxV8uie96v8pn431UX0B3PWEKptewjVZ+LyiHhC0j8DN0qaCnw8Ir7f337k8u6VdD3V52Eru3/GoTqYNvp8NtvXj0q6geqY8TDVcGazdr5DdVz5KrsubPdmAdXw7ifIL7W5zOcknQb87/yS8AzVmXRPq6iGtEYAl0TEf6q63f6fc8h4Bbm9e+yj26j+UWC3rwBXZJ2dVDcTPJvv7T3i23/3Mfkt5N6IaDpmKenlVOPiR/dj/N16yO33TESEpNOpLsa2xT9Uyy8xndHjXynYwLTzvm5EL/Jtzf31Uj0jeUnKIYo7qC7qNSvzbuBq4EtOIgM2CfjHHJZ7AvjoXu6PDR7v6xbwGYmZmRV5SV5sNzOzF48TiZmZFXEiMTOzIk4kZmZWxInEzMyKOJGYmVmR/w/kqp9ccYoQ+wAAAABJRU5ErkJggg==\n", 103 | "text/plain": [ 104 | "
" 105 | ] 106 | }, 107 | "metadata": { 108 | "needs_background": "light" 109 | }, 110 | "output_type": "display_data" 111 | } 112 | ], 113 | "source": [ 114 | "train_data.emotions.value_counts().plot(kind=\"bar\", rot=0)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### Processing and Creating Vocabulary\n", 122 | "Before training and constructing the word embeddings we need to process the data and create a vocabulary. The functions below achieve these two things." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 4, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "def clearstring(string):\n", 132 | " \"\"\" clean tweets \"\"\"\n", 133 | " string = re.sub('[^\\'\\\"A-Za-z0-9 ]+', '', string)\n", 134 | " string = string.split(' ')\n", 135 | " string = filter(None, string)\n", 136 | " string = [y.strip() for y in string]\n", 137 | " string = [y for y in string if len(y) > 3 and y.find('nbsp') < 0]\n", 138 | " return ' '.join(string)\n", 139 | "\n", 140 | "def read_data():\n", 141 | " \"\"\" generate vocabulary \"\"\"\n", 142 | " vocab = []\n", 143 | " text = train_data.text.values.tolist()\n", 144 | " for t in text:\n", 145 | " strings = clearstring(t)\n", 146 | " vocab+=strings.split()\n", 147 | " return vocab" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 5, 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "name": "stdout", 157 | "output_type": "stream", 158 | "text": [ 159 | "example 10 words: ['love', 'with', 'midi', 'skirt', 'woooo', 'usermention', 'holme', 'upon', 'spalding', 'moor']\n", 160 | "size corpus: 3698224\n", 161 | "size of unique words: 92095\n" 162 | ] 163 | } 164 | ], 165 | "source": [ 166 | "### build the vocabulary\n", 167 | "vocabulary = read_data()\n", 168 | "print(\"example 10 words:\", vocabulary[:10])\n", 169 | "print('size corpus:',len(vocabulary))\n", 170 | "vocabulary_size = len(list(set(vocabulary)))\n", 171 | "print('size of unique words:',vocabulary_size)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/plain": [ 182 | "['love',\n", 183 | " 'with',\n", 184 | " 'midi',\n", 185 | " 'skirt',\n", 186 | " 'woooo',\n", 187 | " 'usermention',\n", 188 | " 'holme',\n", 189 | " 'upon',\n", 190 | " 'spalding',\n", 191 | " 'moor']" 192 | ] 193 | }, 194 | "execution_count": 6, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "### print the vocabulary\n", 201 | "vocabulary[:10]" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "### Build The Dataset\n", 209 | "Now that we have the vocabulary, we would like to format our dataset into sequence of tokens represented by the index of those tokens in the vocabulary. Look at the code below to see how this can be achieved. In particular note the example of how to convert the data to sequence of indexed tokens. The reason the index is necessary is because this is needed by the model in order to identify words by their ids instead of their actual raw representation (i.e., letters)." 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 7, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "def build_dataset(words, n_words):\n", 219 | " count = [['UNK', -1]]\n", 220 | " count.extend(collections.Counter(words).most_common(n_words - 1))\n", 221 | " dictionary = dict()\n", 222 | " for word, _ in count:\n", 223 | " dictionary[word] = len(dictionary) # increase index as words added\n", 224 | " data = list()\n", 225 | " unk_count = 0\n", 226 | " for word in words:\n", 227 | " if word in dictionary:\n", 228 | " index = dictionary[word]\n", 229 | " else:\n", 230 | " index = 0 # dictionary['UNK']\n", 231 | " unk_count += 1\n", 232 | " data.append(index)\n", 233 | " count[0][1] = unk_count\n", 234 | " reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n", 235 | " return data, count, dictionary, reverse_dictionary" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 8, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)\n", 245 | "del vocabulary" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 9, 251 | "metadata": {}, 252 | "outputs": [ 253 | { 254 | "name": "stdout", 255 | "output_type": "stream", 256 | "text": [ 257 | "Most common words (+UNK) [['UNK', 1], ('usermention', 147536), ('this', 52001), ('just', 51505), ('that', 50171)]\n", 258 | "Sample data [34, 5, 16555, 3984, 2340, 1, 30165, 2511, 30166, 24865, 1075, 1815, 161, 97, 117] ['love', 'with', 'midi', 'skirt', 'woooo', 'usermention', 'holme', 'upon', 'spalding', 'moor', 'rock', 'challenge', 'please', 'well', 'done']\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "print('Most common words (+UNK)', count[:5])\n", 264 | "print('Sample data', data[:15], [reverse_dictionary[i] for i in data[:15]])" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 10, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "3698224" 276 | ] 277 | }, 278 | "execution_count": 10, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "len(data)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "### Generating Batches\n", 292 | "For training the CBOW model with the input sequences of tokens, we need to generate batches and feed those to the model. The batches will be of a particular size, in this case 8, and also context window, which determines the final size of the input sequence. For instance, if the context window size is 1, then we have a target and two context words, which means that the sequence size results in three tokens. See the code below for how to achieve this." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 11, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "data_index = 0\n", 302 | "\n", 303 | "def generate_batch(batch_size, context_window):\n", 304 | " # all context tokens should be used, hence no associated num_skips argument\n", 305 | " global data_index\n", 306 | " context_size = 2 * context_window\n", 307 | " batch = np.ndarray(shape=(batch_size, context_size), dtype=np.int32)\n", 308 | " labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)\n", 309 | " span = 2 * context_window + 1 # [ context_window target context_window ]\n", 310 | " buffer = collections.deque(maxlen=span)\n", 311 | " for _ in range(span):\n", 312 | " buffer.append(data[data_index])\n", 313 | " data_index = (data_index + 1) % len(data)\n", 314 | " for i in range(batch_size):\n", 315 | " # context tokens are just all the tokens in buffer except the target\n", 316 | " batch[i, :] = [token for idx, token in enumerate(buffer) if idx != context_window]\n", 317 | " labels[i, 0] = buffer[context_window]\n", 318 | " buffer.append(data[data_index])\n", 319 | " data_index = (data_index + 1) % len(data)\n", 320 | " data_index-=1\n", 321 | " return batch, labels" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 12, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "batch, labels = generate_batch(batch_size=8, context_window=1)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 13, 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "data": { 340 | "text/plain": [ 341 | "[('love', 34),\n", 342 | " ('with', 5),\n", 343 | " ('midi', 16555),\n", 344 | " ('skirt', 3984),\n", 345 | " ('woooo', 2340),\n", 346 | " ('usermention', 1),\n", 347 | " ('holme', 30165),\n", 348 | " ('upon', 2511),\n", 349 | " ('spalding', 30166),\n", 350 | " ('moor', 24865)]" 351 | ] 352 | }, 353 | "execution_count": 13, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "[(reverse_dictionary[i],i) for i in data[:10]]" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 14, 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "data": { 369 | "text/plain": [ 370 | "[34, 5, 16555, 3984, 2340, 1, 30165, 2511, 30166, 24865]" 371 | ] 372 | }, 373 | "execution_count": 14, 374 | "metadata": {}, 375 | "output_type": "execute_result" 376 | } 377 | ], 378 | "source": [ 379 | "data[:10]" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 16, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "name": "stdout", 389 | "output_type": "stream", 390 | "text": [ 391 | "34 love 16555 midi -> 5 with\n", 392 | "5 with 3984 skirt -> 16555 midi\n", 393 | "16555 midi 2340 woooo -> 3984 skirt\n", 394 | "3984 skirt 1 usermention -> 2340 woooo\n", 395 | "2340 woooo 30165 holme -> 1 usermention\n", 396 | "1 usermention 2511 upon -> 30165 holme\n", 397 | "30165 holme 30166 spalding -> 2511 upon\n", 398 | "2511 upon 24865 moor -> 30166 spalding\n" 399 | ] 400 | } 401 | ], 402 | "source": [ 403 | "### a batch sample\n", 404 | "for i in range(8):\n", 405 | " print(batch[i, 0], reverse_dictionary[batch[i, 0]],\n", 406 | " batch[i, 1], reverse_dictionary[batch[i, 1]],\n", 407 | " '->', labels[i, 0], reverse_dictionary[labels[i, 0]])" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### Model\n", 415 | "Now that we have a function to generate batches of sequences, which we can feed to the mode, it's time to actually build the model. The model we will use here is a one layer feed forward neural network, which train with the goal to learn parameters that will represent the values of the word embeddings. See code below for the model we will use to train the embeddings. It is assumed that you are familiar with PyTorch to understand the code below. " 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 18, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "class EmoCBOW(nn.Module):\n", 425 | " def __init__(self, vocab_size, embedding_dim, context_size):\n", 426 | " super(EmoCBOW, self).__init__()\n", 427 | " self.context_size = context_size\n", 428 | " self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n", 429 | " self.linear = nn.Linear(embedding_dim, vocab_size)\n", 430 | " \n", 431 | " def forward(self, inputs):\n", 432 | " # batch_size X context X embedding_dim\n", 433 | " embeds = self.embeddings(inputs)\n", 434 | " average_embeds = torch.mean(embeds, dim=1)\n", 435 | " out = self.linear(average_embeds) \n", 436 | " log_probs = F.log_softmax(out, dim=1)\n", 437 | " return log_probs\n", 438 | " \n", 439 | " def get_word_emdedding(self, word):\n", 440 | " word = Variable(torch.LongTensor([dictionary[word]]).cuda())\n", 441 | " return self.embeddings(word).view(1,-1)" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### Pretesting Model\n", 449 | "We would like to test the model with a sample batch before we do the actual training of the word embeddings. Below is sample code on how to achieve this. " 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 19, 455 | "metadata": {}, 456 | "outputs": [ 457 | { 458 | "name": "stdout", 459 | "output_type": "stream", 460 | "text": [ 461 | "torch.Size([8, 92095])\n" 462 | ] 463 | } 464 | ], 465 | "source": [ 466 | "### test with one batch\n", 467 | "context_batch, target_batch = generate_batch(batch_size=8, context_window=1)\n", 468 | "\n", 469 | "context_var = Variable(torch.LongTensor(context_batch))\n", 470 | "### print(context_var)\n", 471 | "dummy_model = EmoCBOW(vocab_size=vocabulary_size, embedding_dim=10, context_size=1)\n", 472 | "log_probs = dummy_model(context_var)\n", 473 | "print(log_probs.size())\n", 474 | "\n", 475 | "data_index = 0 # reset index" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "### Training\n", 483 | "Now it's time to train the model. Again, I assume you are familiar with the particular bit you need to perform the training with PyTorch. While the model is training you see an output loss which indicates whether the model is learning. We train for a lot of steps, so be patient and if everything went right, you should see the loss decreasing and the model converging, which means that the word embeddings are getting better. I have stopped in the process below, but you can let it train for a couple of steps, or even let it run for the full number of steps I defined. The time of training will vary but usually it takes a while for the model to train the embeddings. " 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 22, 489 | "metadata": {}, 490 | "outputs": [ 491 | { 492 | "name": "stdout", 493 | "output_type": "stream", 494 | "text": [ 495 | "Average loss at step 0 : 11.477616\n", 496 | "Average loss at step 2000 : 9.01468\n" 497 | ] 498 | }, 499 | { 500 | "ename": "KeyboardInterrupt", 501 | "evalue": "", 502 | "output_type": "error", 503 | "traceback": [ 504 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 505 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 506 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnum_steps\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mbatch_inputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_labels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgenerate_batch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mBATCH_SIZE\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mCONTEXT_WINDOW\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mcontext_var\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mVariable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mLongTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcuda\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0mtargets\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mVariable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msqueeze\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mLongTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_labels\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcuda\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzero_grad\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 507 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "BATCH_SIZE = 128\n", 513 | "EMBEDDING_DIM = 128\n", 514 | "CONTEXT_WINDOW = 2\n", 515 | "num_steps = 100000\n", 516 | "plot_every = 2000\n", 517 | "\n", 518 | "losses = []\n", 519 | "model = EmoCBOW(vocabulary_size, EMBEDDING_DIM, CONTEXT_WINDOW)\n", 520 | "model.cuda()\n", 521 | "optimizer = optim.SGD(model.parameters(), lr=1.0)\n", 522 | "\n", 523 | "average_loss = 0\n", 524 | "for step in range(num_steps):\n", 525 | " batch_inputs, batch_labels = generate_batch(BATCH_SIZE, CONTEXT_WINDOW)\n", 526 | " context_var = Variable(torch.LongTensor(batch_inputs).cuda())\n", 527 | " targets = Variable(torch.squeeze(torch.LongTensor(batch_labels).cuda()))\n", 528 | " model.zero_grad()\n", 529 | " \n", 530 | " log_probs = model(context_var) # inputs are context vectors\n", 531 | " loss = F.nll_loss(log_probs, targets) # criterion\n", 532 | " loss.backward()\n", 533 | " optimizer.step()\n", 534 | " average_loss += loss.data\n", 535 | " \n", 536 | " if step % plot_every == 0:\n", 537 | " if step > 0:\n", 538 | " average_loss /= plot_every\n", 539 | " print(\"Average loss at step\", step, \": \", average_loss.cpu().numpy())\n", 540 | " average_loss = 0" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Store the Embeddings\n", 548 | "After you have trained the embeddings, you know want to store them and reuse them for some downstream task like sentiment classification. Essentially, you will be using the embeddings as input features for a model to conduct some task. The format you store the embeddings doesn't really matter, for as long as you can retrieve them easily and efficiently in the future. I am storing the embeddings in numpy in this notebook and further converting the matrix into pickle file. Notice that we are also storing the vocabulary since we will need it again when training our text classifier." 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 20, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "embeddings = model.embeddings.weight.data.cpu().numpy()" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 21, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [ 566 | "ph.convert_to_pickle(directory=\"data/hashtags_word_embeddings/es_py_cbow_embeddings.p\", item=embeddings)\n", 567 | "ph.convert_to_pickle(directory=\"data/hashtags_word_embeddings/es_py_cbow_dictionary.p\", item=dictionary)" 568 | ] 569 | } 570 | ], 571 | "metadata": { 572 | "kernelspec": { 573 | "display_name": "Python 3", 574 | "language": "python", 575 | "name": "python3" 576 | }, 577 | "language_info": { 578 | "codemirror_mode": { 579 | "name": "ipython", 580 | "version": 3 581 | }, 582 | "file_extension": ".py", 583 | "mimetype": "text/x-python", 584 | "name": "python", 585 | "nbconvert_exporter": "python", 586 | "pygments_lexer": "ipython3", 587 | "version": "3.6.0" 588 | } 589 | }, 590 | "nbformat": 4, 591 | "nbformat_minor": 2 592 | } 593 | -------------------------------------------------------------------------------- /Ch2_PyTorch_classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Emotion Classification\n", 8 | "This is the second chapter of this series. In this notebook we aim to train a classifier based on the word embeddings we pretrained in the previous notebook. Then we will store the classifier to conduct emotional analysis on real-time tweets, which will be covered in the next chapter." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 14, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "import pandas as pd\n", 18 | "import numpy as np\n", 19 | "from sklearn import preprocessing, metrics, decomposition, pipeline, dummy\n", 20 | "import torch\n", 21 | "import torch.nn.functional as F\n", 22 | "import torch.nn as nn\n", 23 | "import os\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "%matplotlib inline\n", 26 | "import helpers.pickle_helpers as ph\n", 27 | "import time\n", 28 | "import math\n", 29 | "from sklearn.cross_validation import train_test_split\n", 30 | "import re" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Parameters\n", 38 | "First, let's declare our hyperparameters." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "EMBEDDING_DIM = 128\n", 48 | "HIDDEN_SIZE = 256\n", 49 | "KEEP_PROB = 0.8\n", 50 | "BATCH_SIZE = 128\n", 51 | "NUM_EPOCHS = 50 \n", 52 | "DELTA = 0.5\n", 53 | "NUM_LAYERS = 3\n", 54 | "LEARNING_RATE = 0.001" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "### Data Preparation\n", 62 | "Let's import and prepare the data as was done in the previous chapter." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 4, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "train_data = ph.load_from_pickle(directory=\"data/datasets/df_grained_tweet_tr.pkl\")\n", 72 | "test_data = ph.load_from_pickle(directory=\"data/datasets/df_grained_tweet_te_unbal.pkl\")\n", 73 | "\n", 74 | "train_data.rename(index=str, columns={\"emo\":\"emotions\", \"sentence\": \"text\"}, inplace=True);\n", 75 | "test_data.rename(index=str, columns={\"emo\":\"emotions\", \"sentence\": \"text\"}, inplace=True);\n", 76 | "\n", 77 | "train_data.text = train_data.text.str.replace(\" \", \"\")\n", 78 | "test_data.text = test_data.text.str.replace(\" \", \"\")" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 5, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/plain": [ 89 | "597192" 90 | ] 91 | }, 92 | "execution_count": 5, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "len(train_data)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "def clearstring(string):\n", 108 | " string = re.sub('[^\\'\\\"A-Za-z0-9 ]+', '', string)\n", 109 | " string = string.split(' ')\n", 110 | " string = filter(None, string)\n", 111 | " string = [y.strip() for y in string]\n", 112 | " string = [y for y in string if len(y) > 3 and y.find('nbsp') < 0]\n", 113 | " return ' '.join(string)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 7, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "train_data.text = train_data.text.apply(lambda d: clearstring(d))\n", 123 | "test_data.text = test_data.text.apply(lambda d: clearstring(d))" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### Obtain Word Embeddings\n", 131 | "Here is the code for importing the word embeddings we pretrained in the previous chapter. Notice that we are also importing the vocabulary. See below how handy the vocabulary is to inspect our word embeddings." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "### load word embeddings and accompanying vocabulary\n", 141 | "wv = ph.load_from_pickle(\"data/hashtags_word_embeddings/es_py_cbow_embeddings.p\")\n", 142 | "vocab = ph.load_from_pickle(\"data/hashtags_word_embeddings/es_py_cbow_dictionary.p\")" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 9, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "array([ 0.5389905 , -0.8261296 , -1.8023891 , -0.8072674 , -0.6313184 ,\n", 154 | " -1.3096205 , 1.6170695 , 1.8171018 , 0.05804818, 1.5923933 ,\n", 155 | " 1.2208248 , -0.08000907, 1.4284078 , 0.5594934 , 0.8742701 ,\n", 156 | " 0.04409672, -0.51616585, -0.26882973, 0.2614767 , 1.7617252 ,\n", 157 | " -0.7654648 , -0.1121751 , 0.6021578 , -2.7278464 , -1.5101068 ,\n", 158 | " 1.9514263 , 0.9859432 , -2.0553567 , 0.52864003, -1.5633332 ,\n", 159 | " -2.329722 , 0.33874342, 0.9558916 , 0.9637566 , 0.72352 ,\n", 160 | " -0.60107934, 1.2980587 , 1.3291203 , 0.08595378, -0.96753865,\n", 161 | " -0.47979838, -1.4262284 , 0.80548376, 0.94358546, -0.85197926,\n", 162 | " -1.5562207 , -0.28793994, -0.21579984, -0.6607775 , -0.21598966,\n", 163 | " 1.6049399 , -0.343651 , -0.0540315 , -2.1718023 , -0.98242474,\n", 164 | " -1.6945462 , -1.3239328 , 1.6394376 , -1.1029811 , 0.42646387,\n", 165 | " -1.0574629 , -0.4617092 , -1.0275363 , 1.7248987 , -0.05921336,\n", 166 | " 0.9992472 , 0.7281742 , 1.0187635 , 1.8406339 , -2.0048149 ,\n", 167 | " 2.6621861 , 0.80933565, 0.65741915, -0.1611871 , 0.72472906,\n", 168 | " 1.483416 , -0.800681 , -0.6170338 , 0.9091752 , 0.35176483,\n", 169 | " -1.4197102 , 0.73179495, 1.2767175 , -0.74212426, -0.52197933,\n", 170 | " -1.8342316 , -0.8961808 , 1.1606023 , -2.0411768 , -1.3687735 ,\n", 171 | " -2.1972206 , 0.16410299, -0.6888266 , -1.58254 , 0.4490404 ,\n", 172 | " 2.5568488 , 0.9290964 , -0.9500061 , -0.25545642, -0.19501002,\n", 173 | " -0.9169069 , 1.7392551 , 0.8232341 , 0.93090016, -1.2818229 ,\n", 174 | " -0.18206023, 0.5242739 , 0.6704099 , 1.7621306 , -1.0661116 ,\n", 175 | " 0.850737 , 0.02583216, -1.7723794 , 1.7245288 , 2.6550083 ,\n", 176 | " -0.89468575, -1.0228765 , 1.4283719 , 1.4670846 , -1.108844 ,\n", 177 | " 1.4647324 , 1.5425268 , 0.56075734, 2.5217469 , -0.37765834,\n", 178 | " 0.3549512 , 1.5589453 , 0.5800641 ], dtype=float32)" 179 | ] 180 | }, 181 | "execution_count": 9, 182 | "metadata": {}, 183 | "output_type": "execute_result" 184 | } 185 | ], 186 | "source": [ 187 | "### eg. to obtain embedding for token\n", 188 | "wv[vocab[\"feel\"]]" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "### Tokenization and Label Binarization\n", 196 | "A very important step before training our classifier, is to make sure the data is in the right format so it becomes easy for us to feed the data into the model. In the code below we will tokenize our dataset, in particular the inputs. Then we will also perform binarization on the target values so as to obtain one-hot vectors that will uniquely represent the target of each sentence or tweet. We also do some additional pre-processing which you can follow below. " 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 15, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def remove_unknown_words(tokens):\n", 206 | " return [t for t in tokens if t in vocab]\n", 207 | "\n", 208 | "def check_size(c, size):\n", 209 | " if len(c) <= size:\n", 210 | " return False\n", 211 | " else:\n", 212 | " return True\n", 213 | " \n", 214 | "### tokens and tokensize\n", 215 | "train_data[\"tokens\"] = train_data.text.apply(lambda t: remove_unknown_words(t.split()))\n", 216 | "train_data[\"tokensize\"] = train_data.tokens.apply(lambda t: len(t))\n", 217 | "test_data[\"tokens\"] = test_data.text.apply(lambda t: remove_unknown_words(t.split()))\n", 218 | "test_data[\"tokensize\"] = test_data.tokens.apply(lambda t: len(t))\n", 219 | "\n", 220 | "### filter by tokensize\n", 221 | "train_data = train_data.loc[train_data[\"tokens\"].apply(lambda d: check_size(d, 7)) != False].copy()\n", 222 | "test_data = test_data.loc[test_data[\"tokens\"].apply(lambda d: check_size(d, 7)) != False].copy()\n", 223 | "\n", 224 | "### sorting by tokensize\n", 225 | "train_data.sort_values(by=\"tokensize\", ascending=True, inplace=True)\n", 226 | "test_data.sort_values(by=\"tokensize\", ascending=True, inplace=True)\n", 227 | "\n", 228 | "### resetting index\n", 229 | "train_data.reset_index(drop=True, inplace=True);\n", 230 | "test_data.reset_index(drop=True, inplace=True);\n", 231 | "\n", 232 | "### Binarization\n", 233 | "emotions = list(set(train_data.emotions.unique()))\n", 234 | "num_emotions = len(emotions)\n", 235 | "\n", 236 | "### binarizer\n", 237 | "mlb = preprocessing.MultiLabelBinarizer()\n", 238 | "\n", 239 | "train_data_labels = [set(emos) & set(emotions) for emos in train_data[['emotions']].values]\n", 240 | "test_data_labels = [set(emos) & set(emotions) for emos in test_data[['emotions']].values]\n", 241 | "\n", 242 | "y_bin_emotions = mlb.fit_transform(train_data_labels)\n", 243 | "test_y_bin_emotions = mlb.fit_transform(test_data_labels)\n", 244 | "\n", 245 | "train_data['bin_emotions'] = y_bin_emotions.tolist()\n", 246 | "test_data['bin_emotions'] = test_y_bin_emotions.tolist()" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "### Generate sample input\n", 254 | "Once we have processed our data, let's look at an example of how we will converting sentences into input vectors, which are basically word vectors concatenated to represent the input sequence." 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 16, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "sentence_embeddings = [wv[vocab[w]] for w in \"this feels fantastic\".split()]" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 17, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "data": { 273 | "text/plain": [ 274 | "[array([ 5.7762969e-01, 4.1182911e-01, 1.5915717e+00, 1.9623135e-01,\n", 275 | " 1.4823467e-01, 3.4592927e-02, 1.0979089e-01, -5.4003459e-01,\n", 276 | " 5.5145639e-01, -2.0645244e-01, 6.2708288e-01, 1.9114013e+00,\n", 277 | " 4.1743749e-01, 4.8000565e-01, 1.3688921e+00, -6.0899270e-01,\n", 278 | " -8.2222080e-01, -1.6738379e-01, 2.5278423e-03, -4.4002768e-01,\n", 279 | " -1.7636645e-01, 3.1228867e-01, 8.5302269e-01, -5.5778861e-02,\n", 280 | " -9.6316218e-01, 6.3835210e-01, 1.1264894e+00, -7.7165258e-01,\n", 281 | " 1.7387373e+00, 1.3290544e+00, -2.6808953e-01, 2.6583406e-01,\n", 282 | " 1.7067311e+00, 4.0209743e-01, 1.9354068e+00, -4.4382878e-02,\n", 283 | " -1.7041634e+00, -2.1780021e+00, 6.2105244e-01, 4.5051843e-01,\n", 284 | " -9.4019301e-02, -1.6840085e-01, -6.8932152e-01, -8.8215894e-01,\n", 285 | " -1.4211287e+00, -6.9710428e-01, 9.1269486e-02, -1.3960580e+00,\n", 286 | " -2.6473520e+00, 1.2631515e-01, 1.0753033e+00, -1.7343637e+00,\n", 287 | " -1.2398950e+00, -1.8989055e-01, 5.5069500e-01, -9.9274379e-01,\n", 288 | " -7.4581426e-01, 1.9070454e+00, -2.7693167e-02, -9.6485667e-02,\n", 289 | " 3.6455104e+00, -2.2448828e+00, -2.3194687e+00, -5.6355500e-01,\n", 290 | " -2.2364409e+00, -1.3884341e+00, 6.8607783e-01, 6.3522869e-01,\n", 291 | " 1.6772349e+00, 9.3361482e-02, 1.5434825e+00, -8.9733368e-01,\n", 292 | " -2.0110564e-01, 5.5500650e-01, -2.7845064e-01, 8.5825706e-01,\n", 293 | " 2.6179519e-01, -1.4814560e-01, -5.7858503e-01, 5.2921659e-01,\n", 294 | " -6.2351793e-01, 1.1778877e+00, 6.9542038e-01, 1.8816992e+00,\n", 295 | " 3.1759745e-01, -4.7993168e-01, 6.5814179e-01, 8.7885934e-01,\n", 296 | " 6.0468066e-02, 7.4128270e-02, 2.2988920e+00, 2.1285081e+00,\n", 297 | " 7.0240453e-02, -7.6330572e-01, 8.3526218e-01, 6.0745466e-01,\n", 298 | " -1.0194540e+00, -2.1956379e+00, -1.2714338e+00, 6.3572550e-01,\n", 299 | " 6.6295260e-01, -8.3488572e-01, -3.4988093e-01, -3.5540792e-01,\n", 300 | " 5.4124153e-01, -7.7268988e-01, 1.4683855e-01, -7.3003507e-01,\n", 301 | " -8.1091434e-01, -1.0907569e+00, 7.5887805e-01, -1.1122453e+00,\n", 302 | " -1.6199481e+00, -1.3784732e+00, -6.3396573e-02, 3.2632509e-01,\n", 303 | " 1.0684365e-01, 6.0308921e-01, 4.6167067e-01, -2.0168118e+00,\n", 304 | " -6.7048740e-01, 1.6356069e-01, -5.4351605e-02, -5.2482843e-01,\n", 305 | " 2.2043006e+00, -6.8451458e-01, 5.9733611e-01, 6.6534078e-01],\n", 306 | " dtype=float32),\n", 307 | " array([-3.4321475e-01, -9.7226053e-02, 2.8605098e-01, 1.6047080e+00,\n", 308 | " -3.6453772e-01, -1.3920774e+00, -2.1250713e+00, 1.3125460e-01,\n", 309 | " -1.1453985e+00, 5.4572320e-01, 5.2630770e-01, -1.8370697e-01,\n", 310 | " 1.4765236e+00, 1.1878918e+00, -4.0682489e-01, -4.7836415e-02,\n", 311 | " -5.4636401e-01, 1.9224505e+00, -2.6708531e-01, 5.0754392e-01,\n", 312 | " -6.4190727e-01, 6.8657053e-01, -6.4917213e-01, 5.0319391e-01,\n", 313 | " -4.8732966e-01, 1.8009869e+00, -2.3410184e+00, -9.4244355e-01,\n", 314 | " -9.2588544e-01, 7.6922643e-01, -7.4314862e-01, 7.0185089e-01,\n", 315 | " -1.0966054e+00, 7.8724593e-01, 7.5844818e-01, 9.1872163e-02,\n", 316 | " 7.0011061e-01, 4.7920561e-01, -1.5113609e-02, 1.4994408e+00,\n", 317 | " -1.0265969e+00, 2.9074928e-01, -7.6647228e-01, 1.8470247e+00,\n", 318 | " -8.4488952e-01, -1.3199706e+00, -4.4321135e-01, 5.3276116e-01,\n", 319 | " 1.8265551e-01, 1.1034945e+00, 3.8836792e-01, 2.6915863e-01,\n", 320 | " -6.7549026e-01, -1.2705921e-01, -4.9914065e-01, -3.7022445e+00,\n", 321 | " 1.1977068e+00, 3.4566635e-01, 5.5629976e-02, 1.3779374e+00,\n", 322 | " -3.9924735e-01, -1.2794230e+00, 3.4014046e+00, -1.2588968e+00,\n", 323 | " -1.6168836e-01, 8.2324558e-01, 2.9140648e-01, 2.2544200e+00,\n", 324 | " 1.4198905e+00, -2.7008796e+00, 1.5832986e+00, 1.3438987e-03,\n", 325 | " 4.7332349e-01, 1.9437153e+00, 8.7838221e-01, 1.3765662e+00,\n", 326 | " 6.4651889e-01, -1.0945044e-01, 4.6745947e-01, -7.7465302e-01,\n", 327 | " -1.7219128e-01, 1.3659716e-01, 1.4069235e+00, -1.2043966e+00,\n", 328 | " 2.7390096e-01, -9.2881405e-01, 7.3064059e-02, -6.9506335e-01,\n", 329 | " -2.2899912e-01, -3.2435477e+00, -2.0895963e+00, 1.0968444e+00,\n", 330 | " 7.4347031e-01, -3.1055303e+00, -7.6739632e-02, 4.2136496e-01,\n", 331 | " 3.3838820e-01, -4.1653013e-01, 1.0817224e-01, -2.2449881e-02,\n", 332 | " 7.2924626e-01, 4.5947462e-01, 5.7326639e-01, -1.9229509e-01,\n", 333 | " -1.7776063e-01, 4.1691759e-01, -3.6446020e-01, -1.5269613e-02,\n", 334 | " -1.6729140e+00, 6.9680309e-01, 1.0556157e+00, 1.0876462e+00,\n", 335 | " 9.8904811e-02, 1.4382801e+00, 1.7168192e+00, 1.8068274e+00,\n", 336 | " 2.4255323e-01, -1.0590203e+00, 1.0824920e+00, -2.5140762e+00,\n", 337 | " -2.3148799e-01, 4.1911473e+00, -3.4231823e-02, 1.5553576e+00,\n", 338 | " 2.7134141e-01, 2.6498488e-01, -2.5449184e-01, -1.8989407e+00],\n", 339 | " dtype=float32),\n", 340 | " array([ 0.8383601 , -0.02155342, -0.21082091, 0.6485529 , -0.59349656,\n", 341 | " -0.166402 , 1.3000834 , -0.07898946, -0.16624215, 0.52123684,\n", 342 | " 0.05233976, -1.5532598 , 0.01666474, 0.797122 , -0.7451202 ,\n", 343 | " -1.6641759 , 1.2602216 , -2.0044358 , 0.68592983, 0.7536933 ,\n", 344 | " 1.0812731 , 2.10356 , -1.7539785 , -0.6635254 , -0.7465607 ,\n", 345 | " 0.2638522 , -0.3235982 , 1.4076495 , -0.09119514, 2.086436 ,\n", 346 | " -0.4526414 , 0.26831234, 0.24467596, 2.33805 , -0.7017719 ,\n", 347 | " -0.6682133 , -0.9301834 , 0.21346547, -1.0819333 , 0.03980344,\n", 348 | " -0.07848723, 0.716963 , -0.28034478, 0.6563167 , -0.99363357,\n", 349 | " 0.71183956, 0.05822359, -1.6912135 , -2.4925132 , -0.5482579 ,\n", 350 | " -0.67647994, 1.3980678 , 2.86393 , 1.2885548 , -1.5518631 ,\n", 351 | " -1.1034924 , -1.1662406 , 0.3353053 , 0.19297248, -0.95059246,\n", 352 | " 0.31902936, -1.0137295 , -1.279213 , -0.5329634 , -0.07975607,\n", 353 | " -0.19864069, 0.6106306 , 1.0557752 , -0.878621 , -0.8509309 ,\n", 354 | " -0.12062441, -0.27696317, 2.2124147 , 1.9911683 , 0.7381984 ,\n", 355 | " -0.469987 , -1.6558627 , -0.0847896 , -1.5840882 , 0.74699026,\n", 356 | " -0.13173659, 0.96634436, -1.3921932 , -0.16244002, 0.7752265 ,\n", 357 | " -0.23255356, -0.44541982, -2.2467227 , 0.10506741, -0.20535523,\n", 358 | " -0.09891574, -0.35552487, 0.13457903, -0.18867804, -0.04975915,\n", 359 | " 0.5091362 , -2.1489737 , 0.84570265, 1.2204372 , -1.2863662 ,\n", 360 | " -1.1997837 , -0.1355166 , -1.842612 , 0.27185363, -0.43057394,\n", 361 | " 0.9251916 , -0.45085236, 0.65534955, -1.4492592 , -0.7060368 ,\n", 362 | " 0.58963746, -1.9130523 , 0.74782646, 0.99171853, -0.42570722,\n", 363 | " -0.73163205, 2.2265303 , 1.0439353 , 0.21321568, 0.70397234,\n", 364 | " -0.41201043, -1.3467301 , -0.3377973 , 1.7296644 , -2.1833317 ,\n", 365 | " -1.9238352 , 0.00673127, 1.1287643 ], dtype=float32)]" 366 | ] 367 | }, 368 | "execution_count": 17, 369 | "metadata": {}, 370 | "output_type": "execute_result" 371 | } 372 | ], 373 | "source": [ 374 | "sentence_embeddings" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "### Batching by Bucketing approach\n", 382 | "Here is the code to generate batches for our training. This code is a little bit different from the batching approach we used to train our embeddings. Here we are going to generate batches of input sentences. In addition, we will also use a bucketing approach, which is basically a trick to generate more efficient batches that are of similar size. You don't need to know more about the batching for now, just that it is needed for training. We will explain the purpose of bucketing more in details in a future chapter of this series." 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 18, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "### renders embeddings with paddings; zeros where missing tokens\n", 392 | "def generate_embeds_with_pads(tokens, max_size):\n", 393 | " \n", 394 | " padded_embedding = []\n", 395 | " for i in range(max_size):\n", 396 | " if i+1 > len(tokens): # do padding\n", 397 | " padded_embedding.append(list(np.zeros(EMBEDDING_DIM)))\n", 398 | " else: # do embedding for existing tokens\n", 399 | " padded_embedding.append(list(wv[vocab[tokens[i]]])) \n", 400 | " return padded_embedding\n", 401 | "\n", 402 | "### generate the actual batches\n", 403 | "def generate_batches(data, batch_size):\n", 404 | " actual_batches = math.ceil(len(data) / batch_size)\n", 405 | " bins = np.linspace(0, len(data), actual_batches + 1) # this renders actual batches bins of size batch_size\n", 406 | " groups = data.groupby(np.digitize(data.index, bins))\n", 407 | " \n", 408 | " groups_indices = groups.indices\n", 409 | " groups_maxes = groups.max().tokensize\n", 410 | " \n", 411 | " return groups.indices, groups_maxes" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "### Model\n", 419 | "Let's set up our model." 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 19, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "class EmoNet(torch.nn.Module):\n", 429 | " def __init__(self, num_layers, hidden_size, embedding_dim, output_size, dropout):\n", 430 | " super(EmoNet, self).__init__()\n", 431 | " self.embedding_dim = embedding_dim\n", 432 | " self.keep_prob = dropout\n", 433 | " self.hidden_size = hidden_size\n", 434 | " self.nlayers = num_layers\n", 435 | " self.output = output_size\n", 436 | " \n", 437 | " self.dropout = nn.Dropout(p=self.keep_prob)\n", 438 | " \n", 439 | " self.rnn = nn.LSTM(input_size=self.embedding_dim,\n", 440 | " hidden_size=self.hidden_size, \n", 441 | " num_layers=self.nlayers,\n", 442 | " dropout=self.keep_prob)\n", 443 | " self.linear = nn.Linear(self.hidden_size, output_size)\n", 444 | " \n", 445 | " def forward(self, inputs):\n", 446 | " # batch_size X seq_len X embedding_dim -> seq_len, batch_size, embedding_dim\n", 447 | " X = inputs.permute(1,0,2)\n", 448 | " self.rnn.flatten_parameters()\n", 449 | " output, hidden = self.rnn(X)\n", 450 | " (_, last_state) = hidden \n", 451 | " out = self.dropout(output[-1]) \n", 452 | " out = self.linear(out)\n", 453 | " log_probs = F.log_softmax(out, dim=1)\n", 454 | " return log_probs " 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "### Pretesting with one batch sample\n", 462 | "Let's test the model to make sure that we are getting the right output." 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 20, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "tensor([[-2.0232, -2.0691, -2.1157, -2.0892, -2.1025, -2.0685, -2.0496, -2.1217],\n", 475 | " [-2.0673, -2.1060, -2.0852, -2.0269, -2.1725, -2.0424, -2.0406, -2.1025],\n", 476 | " [-2.0391, -2.1295, -2.1221, -2.0777, -2.0516, -2.0863, -2.0247, -2.1099],\n", 477 | " [-1.9732, -2.0941, -2.1539, -2.1623, -2.0857, -2.0165, -2.0270, -2.1399],\n", 478 | " [-2.0045, -2.1246, -2.2071, -2.1499, -2.1643, -1.9416, -1.9807, -2.0957],\n", 479 | " [-2.0669, -2.0434, -2.0972, -2.1399, -2.1139, -2.0600, -1.9759, -2.1500],\n", 480 | " [-2.0779, -2.0254, -2.1648, -2.0490, -2.1991, -1.9826, -2.0161, -2.1418],\n", 481 | " [-2.0564, -2.1175, -2.0688, -2.1243, -2.1005, -2.0728, -1.9879, -2.1144],\n", 482 | " [-2.0580, -2.0817, -2.1553, -2.0956, -2.0416, -2.0482, -2.0393, -2.1219],\n", 483 | " [-2.0021, -2.0735, -2.1122, -2.0266, -2.1404, -2.0821, -2.0176, -2.1964],\n", 484 | " [-2.0476, -2.1002, -2.1645, -2.1462, -2.0558, -2.0172, -1.9741, -2.1467],\n", 485 | " [-2.0600, -2.0664, -2.0646, -2.1640, -2.1037, -2.0233, -2.0344, -2.1271],\n", 486 | " [-2.1151, -2.0421, -2.0883, -2.1581, -2.1212, -2.0197, -2.0214, -2.0783],\n", 487 | " [-2.0683, -2.0992, -2.0486, -2.0950, -2.1395, -2.0397, -2.0194, -2.1323],\n", 488 | " [-2.0556, -2.0997, -2.1066, -2.0898, -2.0724, -2.0421, -2.0250, -2.1499]],\n", 489 | " grad_fn=)\n" 490 | ] 491 | } 492 | ], 493 | "source": [ 494 | "train_groups_indices, train_groups_maxes = generate_batches(train_data, BATCH_SIZE)\n", 495 | "test_groups_indices, test_groups_maxes = generate_batches(test_data, BATCH_SIZE)\n", 496 | "\n", 497 | "n_train = len(train_data) // BATCH_SIZE\n", 498 | "n_test = len(test_data) // BATCH_SIZE\n", 499 | "\n", 500 | "batch_x = train_data.iloc[train_groups_indices[1]].tokens.apply(lambda d: \n", 501 | " generate_embeds_with_pads(d, train_groups_maxes[1]) ).values.tolist()\n", 502 | "batch_y = train_data.loc[train_groups_indices[1]].bin_emotions.values.tolist()\n", 503 | "\n", 504 | "final_batch_x = torch.FloatTensor(np.array(batch_x))\n", 505 | "final_batch_y = torch.FloatTensor(np.array(batch_y))\n", 506 | "\n", 507 | "dummy_model = EmoNet(NUM_LAYERS, HIDDEN_SIZE, EMBEDDING_DIM, num_emotions, KEEP_PROB)\n", 508 | "log_probs = dummy_model(final_batch_x)\n", 509 | "print(log_probs[:15])" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "### Training\n", 517 | "Now let's train the model. But first, let's define the necessary variables to conduct the training like the optimizer and whether we are training on the cpu or gpu." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 21, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "### define model\n", 527 | "use_cuda = True if torch.cuda.is_available() else False\n", 528 | "device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n", 529 | "model = EmoNet(NUM_LAYERS, HIDDEN_SIZE, EMBEDDING_DIM, num_emotions, KEEP_PROB).to(device)\n", 530 | "\n", 531 | "optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)\n", 532 | "dimension = EMBEDDING_DIM\n", 533 | "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 10, 0, 0, 0\n", 534 | "\n", 535 | "### defining batch generation\n", 536 | "train_groups_indices, train_groups_maxes = generate_batches(train_data, BATCH_SIZE)\n", 537 | "test_groups_indices, test_groups_maxes = generate_batches(test_data, BATCH_SIZE)\n", 538 | "\n", 539 | "n_train = len(train_data) // BATCH_SIZE\n", 540 | "n_test = len(test_data) // BATCH_SIZE" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": 22, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "def get_accuracy(logit, target, batch_size):\n", 550 | " ''' Obtain accuracy for training round '''\n", 551 | " corrects = (torch.max(logit, 1)[1].view(target.size()).data == target.data).sum()\n", 552 | " accuracy = 100.0 * corrects/batch_size\n", 553 | " return accuracy" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "metadata": {}, 559 | "source": [ 560 | "...and finally we can train the model. Note that I stopped the training after the first round, since I have already done the training on my computer. You can let the training continue until you have reached a nice accuracy." 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 23, 566 | "metadata": {}, 567 | "outputs": [ 568 | { 569 | "name": "stdout", 570 | "output_type": "stream", 571 | "text": [ 572 | "epoch: 0 , pass acc: 0 , current acc: 46\n", 573 | "time taken: 48.675190687179565\n", 574 | "epoch: 1 , training loss: 1.4679006251582394 , training acc: 46 , valid loss: 1.490867356731467 , valid acc: 46\n" 575 | ] 576 | }, 577 | { 578 | "ename": "KeyboardInterrupt", 579 | "evalue": "", 580 | "output_type": "error", 581 | "traceback": [ 582 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 583 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 584 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mb\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m batch_x = train_data.iloc[train_groups_indices[b+1]].tokens.apply(lambda d: \n\u001b[0m\u001b[1;32m 12\u001b[0m generate_embeds_with_pads(d, train_groups_maxes[b+1]) ).values.tolist()\n\u001b[1;32m 13\u001b[0m \u001b[0mbatch_y\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_groups_indices\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbin_emotions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtolist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 585 | "\u001b[0;32m/home/ellfae/anaconda3/lib/python3.6/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36mapply\u001b[0;34m(self, func, convert_dtype, args, **kwds)\u001b[0m\n\u001b[1;32m 2549\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2550\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masobject\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2551\u001b[0;31m \u001b[0mmapped\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmap_infer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconvert_dtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2552\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2553\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmapped\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmapped\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSeries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 586 | "\u001b[0;32mpandas/_libs/src/inference.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.map_infer\u001b[0;34m()\u001b[0m\n", 587 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(d)\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mb\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m batch_x = train_data.iloc[train_groups_indices[b+1]].tokens.apply(lambda d: \n\u001b[0;32m---> 12\u001b[0;31m generate_embeds_with_pads(d, train_groups_maxes[b+1]) ).values.tolist()\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0mbatch_y\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtrain_groups_indices\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbin_emotions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtolist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mbatch_y\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margmax\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_y\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 588 | "\u001b[0;32m\u001b[0m in \u001b[0;36mgenerate_embeds_with_pads\u001b[0;34m(tokens, max_size)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mpadded_embedding\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mzeros\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mEMBEDDING_DIM\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# do embedding for existing tokens\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mpadded_embedding\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mvocab\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mtokens\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mpadded_embedding\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 589 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 590 | ] 591 | } 592 | ], 593 | "source": [ 594 | "### training\n", 595 | "while True:\n", 596 | " lasttime = time.time()\n", 597 | " ### early stoping to avoid overfitting\n", 598 | " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n", 599 | " print('break epoch:', EPOCH)\n", 600 | " break\n", 601 | " train_acc, train_loss, test_acc , test_loss = 0, 0, 0, 0\n", 602 | " \n", 603 | " for b in range(n_train):\n", 604 | " batch_x = train_data.iloc[train_groups_indices[b+1]].tokens.apply(lambda d: \n", 605 | " generate_embeds_with_pads(d, train_groups_maxes[b+1]) ).values.tolist()\n", 606 | " batch_y = train_data.loc[train_groups_indices[b+1]].bin_emotions.values.tolist()\n", 607 | " batch_y = np.argmax(batch_y, axis=1) \n", 608 | " final_batch_x = torch.FloatTensor(np.array(batch_x)).to(device)\n", 609 | " final_batch_y = torch.LongTensor(batch_y).to(device)\n", 610 | " \n", 611 | " model.zero_grad()\n", 612 | " y_hat = model(final_batch_x)\n", 613 | " \n", 614 | " loss = F.nll_loss(y_hat, final_batch_y)\n", 615 | " loss.backward()\n", 616 | " optimizer.step()\n", 617 | " \n", 618 | " train_loss += loss.item()\n", 619 | " train_acc += get_accuracy(y_hat, final_batch_y, BATCH_SIZE)\n", 620 | " \n", 621 | " for b in range(n_test):\n", 622 | " batch_x = test_data.iloc[test_groups_indices[b+1]].tokens.apply(lambda d: \n", 623 | " generate_embeds_with_pads(d, test_groups_maxes[b+1]) ).values.tolist()\n", 624 | " batch_y = test_data.loc[test_groups_indices[b+1]].bin_emotions.values.tolist()\n", 625 | " batch_y = np.argmax(batch_y, axis=1)\n", 626 | " final_batch_x = torch.FloatTensor(np.array(batch_x)).to(device)\n", 627 | " final_batch_y = torch.LongTensor(batch_y).to(device)\n", 628 | " \n", 629 | " model.zero_grad()\n", 630 | " y_hat = model(final_batch_x)\n", 631 | " \n", 632 | " loss = F.nll_loss(y_hat, final_batch_y)\n", 633 | " \n", 634 | " test_loss += loss.item()\n", 635 | " test_acc += get_accuracy(y_hat, final_batch_y, BATCH_SIZE)\n", 636 | " \n", 637 | " train_loss /= n_train\n", 638 | " train_acc /= n_train\n", 639 | " test_loss /= n_test\n", 640 | " test_acc /= n_test\n", 641 | " \n", 642 | " if test_acc > CURRENT_ACC:\n", 643 | " print('epoch:', EPOCH, ', pass acc:', CURRENT_ACC, ', current acc:', test_acc.cpu().numpy())\n", 644 | " CURRENT_ACC = test_acc\n", 645 | " CURRENT_CHECKPOINT = 0\n", 646 | " ### TODO: do checkpoint for model here using PyTorch\n", 647 | " else:\n", 648 | " CURRENT_CHECKPOINT += 1\n", 649 | " EPOCH += 1\n", 650 | " print('time taken:', time.time()-lasttime)\n", 651 | " print('epoch:', EPOCH, ', training loss:', train_loss, ', training acc:', train_acc.cpu().numpy(), ', valid loss:', test_loss, ', valid acc:', test_acc.cpu().numpy())\n" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### Store the Model\n", 659 | "Now that the model has been trained, we can store the classifier and then reuse it again to classify sentences or other tweets in the future. We will do this in the next chapter of this series. By the way, notice that I didn't properly evaluate the performance of the model here. I am sure you can find a way to improve the accuracy of the model by using more advanced deep learning techniques. You can also try to find a method to properly evaluate the model. I will provide that code in a future chapter. For now, we will use the model above, which has a fair accuracy, since the purpose of the series is to show you how to use the inferences of the model to conduct further analysis on a new dataset. We will cover this further analysis in the next chapter. Let's store the model first, and then we will retrieve it in the next notebook for classifying real-time tweets." 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 28, 665 | "metadata": {}, 666 | "outputs": [ 667 | { 668 | "name": "stderr", 669 | "output_type": "stream", 670 | "text": [ 671 | "/home/ellfae/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/serialization.py:193: UserWarning: Couldn't retrieve source code for container of type EmoNet. It won't be checked for correctness upon loading.\n", 672 | " \"type \" + obj.__name__ + \". It won't be checked \"\n" 673 | ] 674 | } 675 | ], 676 | "source": [ 677 | "import copy\n", 678 | "tmodel = copy.deepcopy(model)\n", 679 | "torch.save(tmodel, 'model/elastic_hashtag_model/emonet')" 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": null, 685 | "metadata": {}, 686 | "outputs": [], 687 | "source": [] 688 | } 689 | ], 690 | "metadata": { 691 | "kernelspec": { 692 | "display_name": "Python 3", 693 | "language": "python", 694 | "name": "python3" 695 | }, 696 | "language_info": { 697 | "codemirror_mode": { 698 | "name": "ipython", 699 | "version": 3 700 | }, 701 | "file_extension": ".py", 702 | "mimetype": "text/x-python", 703 | "name": "python", 704 | "nbconvert_exporter": "python", 705 | "pygments_lexer": "ipython3", 706 | "version": "3.6.0" 707 | } 708 | }, 709 | "nbformat": 4, 710 | "nbformat_minor": 2 711 | } 712 | -------------------------------------------------------------------------------- /Ch3_Elasticsearch_indexing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Elasticsearch Restoring Emotion Classifier/Crawling\n", 8 | "Here we are going to resotre the emotion classifier that we trained in the previous chapter of the series. Then we will crawl real-time tweets from Twitter to classify, and then index into Elasticsearch. Then we are going to use Kibana to analyze the predictions and see where the model is doing well and not so well. In fact, we are going to use the inference of the model, to answer a few interesting questions using Kibana powerful analytic functionalities." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "%load_ext autoreload\n", 18 | "%autoreload 2" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import pandas as pd\n", 28 | "import numpy as np\n", 29 | "from sklearn import preprocessing, metrics, decomposition, pipeline, dummy\n", 30 | "import torch\n", 31 | "import torch.nn.functional as F\n", 32 | "import torch.nn as nn\n", 33 | "import os\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "%matplotlib inline\n", 36 | "import helpers.pickle_helpers as ph\n", 37 | "import time\n", 38 | "import math\n", 39 | "from sklearn.cross_validation import train_test_split\n", 40 | "import re" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Parameters\n", 48 | "Let's define the hyperparameters again since we need then when restoring the model" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "NUM_WORDS = 10000 # max size of vocabulary\n", 58 | "EMBEDDING_DIM = 128\n", 59 | "HIDDEN_SIZE = 256\n", 60 | "ATTENTION_SIZE = 150\n", 61 | "KEEP_PROB = 0.8\n", 62 | "BATCH_SIZE = 128\n", 63 | "NUM_EPOCHS = 50 # Model easily overfits without pre-trained words embeddings, that's why train for a few epochs\n", 64 | "DELTA = 0.5\n", 65 | "NUM_LAYERS = 3\n", 66 | "LEARNING_RATE = 0.001" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Import Embeddings\n", 74 | "Also, we will need the word embeddings to perform classification." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "### load word embeddings and accompanying vocabulary\n", 84 | "wv = ph.load_from_pickle(\"data/hashtags_word_embeddings/es_py_cbow_embeddings.p\")\n", 85 | "vocab = ph.load_from_pickle(\"data/hashtags_word_embeddings/es_py_cbow_dictionary.p\")" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "### Redefine Model\n", 93 | "Since we used a very a naive way to store our model in the previous chapter, we will need to redefine the same model again. With the latest version of PyTorch there better ways to store and restore models without having to do all of this unecessary steps. For now, let's just use this simple approach to restore our models." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "class EmoNet(nn.Module):\n", 103 | " def __init__(self, num_layers, hidden_size, embedding_dim, output_size, dropout):\n", 104 | " super(EmoNet, self).__init__()\n", 105 | " self.embedding_dim = embedding_dim\n", 106 | " self.keep_prob = dropout\n", 107 | " self.hidden_size = hidden_size\n", 108 | " self.nlayers = num_layers\n", 109 | " self.output = output_size\n", 110 | " \n", 111 | " self.dropout = nn.Dropout(p=self.keep_prob)\n", 112 | " \n", 113 | " self.rnn = nn.LSTM(input_size=self.embedding_dim,\n", 114 | " hidden_size=self.hidden_size, \n", 115 | " num_layers=self.nlayers,\n", 116 | " dropout=self.keep_prob)\n", 117 | " self.linear = nn.Linear(self.hidden_size, output_size)\n", 118 | " \n", 119 | " def forward(self, inputs):\n", 120 | " # batch_size X seq_len X embedding_dim -> seq_len, batch_size, embedding_dim\n", 121 | " X = inputs.permute(1,0,2)\n", 122 | " self.rnn.flatten_parameters()\n", 123 | " output, hidden = self.rnn(X)\n", 124 | " (_, last_state) = hidden \n", 125 | " out = self.dropout(output[-1]) \n", 126 | " out = self.linear(out)\n", 127 | " log_probs = F.log_softmax(out, dim=1)\n", 128 | " return log_probs " 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "### restoring the model\n", 138 | "tmodel = torch.load('model/elastic_hashtag_model/emonet')" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "use_cuda = True\n", 148 | "device = torch.device(\"cuda\" if use_cuda else \"cpu\")" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### Helper Function\n", 156 | "For simplicity, let's redefine the helper function we used before. If you want to further optimize your code, you could easily put these reusable functions into a seperate library." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "### TODO: move this preprocessing helper functions\n", 166 | "def clearstring(string):\n", 167 | " string = string.lower()\n", 168 | " string = re.sub('[^\\'\\\"A-Za-z0-9 ]+', '', string)\n", 169 | " string = string.split(' ')\n", 170 | " string = filter(None, string)\n", 171 | " string = [y.strip() for y in string]\n", 172 | " string = [y for y in string if len(y) > 3 and y.find('nbsp') < 0]\n", 173 | " return string\n", 174 | "\n", 175 | "def generate_embeds_with_pads(tokens, max_size):\n", 176 | " \n", 177 | " padded_embedding = []\n", 178 | " for i in range(max_size):\n", 179 | " if i+1 > len(tokens): # do padding\n", 180 | " padded_embedding.append(list(np.zeros(EMBEDDING_DIM)))\n", 181 | " else: # do embedding for existing tokens\n", 182 | " padded_embedding.append(list(wv[vocab[tokens[i]]])) \n", 183 | " return padded_embedding\n", 184 | "\n", 185 | "def remove_unknown_words(tokens):\n", 186 | " return [t for t in tokens if t in vocab]" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "### Input Transformations\n", 194 | "When we are crawling data from the Twitter API, we need to preprocess it and then transform the input to word embedding representations. Same process as used for the classifier int the previous chapter, just that this case we are using it to classify real-time data." 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "def transform_data_to_input(text):\n", 204 | " \"\"\" Accepts only one text as input; can be done by batches later on\"\"\"\n", 205 | " ### TODO Do the preprocessing here\n", 206 | " text = clearstring(text) # list of tokens\n", 207 | " text = remove_unknown_words(text)\n", 208 | " emb = generate_embeds_with_pads(text, len(text))\n", 209 | " \n", 210 | " return emb\n", 211 | "\n", 212 | "emo_map = {0: 'anger', \n", 213 | " 1: 'anticipation', \n", 214 | " 2: 'disgust', \n", 215 | " 3: 'fear', \n", 216 | " 4: 'joy', \n", 217 | " 5: 'sadness',\n", 218 | " 6: 'surprise',\n", 219 | " 7: 'trust'}" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Sample Classification of text\n", 227 | "Let's test to see if those function above work on a dummy text. Wow! You can see that the classifier classifies the word \"unhappy\" to sadness, which means the model is good to some extent." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "### Get the emotion from tweet\n", 237 | "x = transform_data_to_input(\"unhappy\") # put tweet here\n", 238 | "final_x = torch.FloatTensor(np.array(x))\n", 239 | "final_x = final_x.unsqueeze(0)\n", 240 | "emo_map[torch.argmax(tmodel(final_x.to(device))).detach().item()]" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### Crawl Data And Index to Elasticsearch\n", 248 | "Now to the main part of this series. We have pretrained embeddings, we have trained and stored a classifier, but the best part of all will happen next. We will crawl real-time tweets and classify them into an emotion. We will then store those inferences of the model, along with the text, into Elasticsearch. We will then connect the Elasticsearch with Kibana and analyze our results. " 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "### import a few useful libraries\n", 258 | "import crawlers.config as cf\n", 259 | "from elasticsearch import Elasticsearch\n", 260 | "from elasticsearch import helpers\n", 261 | "es = Elasticsearch(cf.ELASTICSEARCH['hostname'])\n", 262 | "import sys, json\n", 263 | "import crawlers.config as config\n", 264 | "from tweepy import Stream\n", 265 | "from tweepy import OAuthHandler\n", 266 | "from tweepy.streaming import StreamListener\n", 267 | "from elasticsearch import helpers\n", 268 | "from elasticsearch import Elasticsearch\n", 269 | "import re\n", 270 | "from copy import deepcopy" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "### Crawler Configurations\n", 278 | "The following are just some extra configurations that are needed for the crawler. Keep in mind that these configurations are mostly obtained from the config library provided with the repository." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "LANGUAGES = ['en']\n", 288 | "WANTED_KEYS = [\n", 289 | " 'id_str',\n", 290 | " 'text',\n", 291 | " 'created_at',\n", 292 | " 'in_reply_to_status_id_str',\n", 293 | " 'in_reply_to_user_id_str',\n", 294 | " 'retweeted',\n", 295 | " 'entities'] # Wanted keys to store in the database\n", 296 | "KEYWORDS = config.KEYWORDS['joy'] + \\\n", 297 | " config.KEYWORDS['trust'] + \\\n", 298 | " config.KEYWORDS['fear'] + \\\n", 299 | " config.KEYWORDS['surprise'] + \\\n", 300 | " config.KEYWORDS['sadness'] + \\\n", 301 | " config.KEYWORDS['disgust'] + \\\n", 302 | " config.KEYWORDS['anger'] + \\\n", 303 | " config.KEYWORDS['anticipation'] + \\\n", 304 | " config.KEYWORDS['other']\n", 305 | "\n", 306 | "print(len(KEYWORDS))" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "### Helper functions\n", 314 | "Below are a few helper functions which will be useful for the crawler in order to properly store the information we want on Elasticsearch. Again, the notebook could be simplified by putting this code in a seperate Python file. For now, we will stick to our long functions just to have everything in one place, where I can easily explain the components of the tutorial." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "def convert_to_es_format(tweet):\n", 324 | " \"\"\"Convert into elastic format\"\"\"\n", 325 | " action = [\n", 326 | " {\n", 327 | " \"_index\": cf.ELASTICSEARCH['index'],\n", 328 | " \"_type\": cf.ELASTICSEARCH['type'],\n", 329 | " \"_source\": {\n", 330 | " \"emotion\": tweet[\"emotion\"],\n", 331 | " \"created_at\": tweet[\"created_at\"],\n", 332 | " \"tweet_id\": tweet[\"tweet_id\"],\n", 333 | " \"text\": tweet[\"text\"],\n", 334 | " \"hashtags\": tweet[\"hashtags\"] \n", 335 | " } \n", 336 | " }\n", 337 | " ]\n", 338 | " return action\n", 339 | "\n", 340 | "def post_tweet_to_es(doc):\n", 341 | " \"\"\" insert into Elasticsearch in bulk \"\"\"\n", 342 | " helpers.bulk(es, doc)\n", 343 | "\n", 344 | "def get_hashtags(list):\n", 345 | " \"\"\"obtain hashtags from tweet\"\"\"\n", 346 | " hashtags = []\n", 347 | " for h in list:\n", 348 | " hashtags.append(h['text'])\n", 349 | " return hashtags\n", 350 | "\n", 351 | "def predict_emotion(text):\n", 352 | " \"\"\" ouput prediction of the model \"\"\"\n", 353 | " x = transform_data_to_input(text) # put tweet here\n", 354 | " final_x = torch.FloatTensor(np.array(x))\n", 355 | " final_x = final_x.unsqueeze(0)\n", 356 | " return emo_map[torch.argmax(tmodel(final_x.to(device))).detach().item()]\n", 357 | "\n", 358 | "def format_to_print(tweet, hashtags):\n", 359 | " \"\"\" format raw tweet \"\"\"\n", 360 | " tweet_dict = {'text':tweet['text'],\n", 361 | " 'created_at': tweet['created_at'],\n", 362 | " 'tweet_id': tweet['id_str'],\n", 363 | " 'emotion': predict_emotion(tweet['text']),\n", 364 | " 'hashtags': hashtags}\n", 365 | " return tweet_dict" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "### The Crawler\n", 373 | "And we are finally ready to start crawling and storing our data. The crawler code below is standard code for crawling from the Twitter API. You will need to configure all your tokens in the config file so that this code can work. The `on_data` function in the class below achives everything we want: from preprocessing it, to classifying it, to indexing it into Elasticsearch. Spend some time analyzing the code below and make sure you understand how it is doing everything which I just explained. " 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "# Stream Listener\n", 383 | "class Listener(StreamListener):\n", 384 | "\n", 385 | " @staticmethod\n", 386 | " def on_data(data):\n", 387 | " try:\n", 388 | " reponse = json.loads(data)\n", 389 | " tweet = {key: reponse[key] for key in set(WANTED_KEYS) & set(reponse.keys())}\n", 390 | "\n", 391 | " hashtags = get_hashtags(tweet['entities']['hashtags'])\n", 392 | " anyretweet= re.findall(r'RT|https|http', str(tweet['text']))\n", 393 | "\n", 394 | " ### formatting tweet\n", 395 | " final_tweet = format_to_print(tweet, hashtags)\n", 396 | "\n", 397 | " ### make insertions\n", 398 | " if not anyretweet: \n", 399 | " f = deepcopy(final_tweet)\n", 400 | " \n", 401 | " ### insert to elasticsearch\n", 402 | " es_final_tweet = convert_to_es_format(f)\n", 403 | " post_tweet_to_es(es_final_tweet)\n", 404 | "\n", 405 | " except Exception as e:\n", 406 | " print(e)\n", 407 | " #print (\"--------------On data function------------\")\n", 408 | " return True\n", 409 | "\n", 410 | " @staticmethod\n", 411 | " def on_error(status):\n", 412 | " print (\"--------------On error function------------\")\n", 413 | " print (status)\n", 414 | " return True\n", 415 | "\n", 416 | " @staticmethod\n", 417 | " def on_timeout():\n", 418 | " print (\"--------------On timeout function------------\")\n", 419 | " print >> sys.stderr, 'Timeout...'\n", 420 | " return True # Don't kill the stream\n", 421 | "\n", 422 | " @staticmethod\n", 423 | " def on_status(status):\n", 424 | " print (\"--------------On status function------------\")\n", 425 | " print (status.text)\n", 426 | " return True" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "### Start Crawling...\n", 434 | "And for the moment you have been waiting for. Let's start the crawler! You can let the crawler run for as much time as you want. Since this is a crawler, I do suggest you convert this notebook into a Python script to make it more efficient. You will also notice sometimes that the crawler will output a warning \"cannot unsqueeze empty tensor\", you can ignore it since sometimes tweets are too short to deduce any information from them and so the classifier won't be able to infer anything from it. " 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "### Starts streaming\n", 444 | "while True:\n", 445 | " try:\n", 446 | " auth = OAuthHandler(\n", 447 | " config.TWITTER['consumer_key'], config.TWITTER['consumer_secret'])\n", 448 | " auth.set_access_token(\n", 449 | " config.TWITTER['access_token'], config.TWITTER['access_secret'])\n", 450 | " print(\"Crawling, Classifying, and Indexing tweets...\")\n", 451 | " twitterStream = Stream(auth, Listener())\n", 452 | " twitterStream.filter(languages=LANGUAGES, track=KEYWORDS)\n", 453 | " except KeyboardInterrupt:\n", 454 | " print (\"--------------On keyboard interruption function------------\")\n", 455 | " print(\"Bye\")\n", 456 | " sys.exit()" 457 | ] 458 | } 459 | ], 460 | "metadata": { 461 | "kernelspec": { 462 | "display_name": "Python 3", 463 | "language": "python", 464 | "name": "python3" 465 | }, 466 | "language_info": { 467 | "codemirror_mode": { 468 | "name": "ipython", 469 | "version": 3 470 | }, 471 | "file_extension": ".py", 472 | "mimetype": "text/x-python", 473 | "name": "python", 474 | "nbconvert_exporter": "python", 475 | "pygments_lexer": "ipython3", 476 | "version": "3.6.0" 477 | } 478 | }, 479 | "nbformat": 4, 480 | "nbformat_minor": 2 481 | } 482 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Emotion Analysis with Elastic and PyTorch 2 | 3 | ![alt txt](./img/espy.png) 4 | 5 | The objective of this project is to integrate real-time search and analytics technologies such as Elasticsearch and Kibana, while leveraging the inferences from deep learning frameworks such as PyTorch. With this integration it is possible to build a holistic NLP system that streamlines emotion analysis research and helps to answer interesting research questions and hypotheses through automated linguistic analyses, visualizations, and inferences. The figure above illustrates the system we intend to build in this project. 6 | 7 | ## The Dataset 8 | Due to Twitter TOS I am not allowed to publicly share the datasets here. However, you can **directly request** the datasets by emailing me at ellfae@gmail.com. 9 | 10 | ## Word Embeddings 11 | The emotion classifier in this project makes use of pretrained word embeddings. I have provided the code in "Ch1_PyTorch_CBOW_embeddings.ipynb" to pre-train your own embeddings for your dataset. Alternatively, you can use pretrained word embeddings from Word2vec or GLoVe or even ELMo embeddings. If you would like access to my pretrained embeddings, you can find them [here](https://www.dropbox.com/sh/d3sgvfwixpza7li/AABFnEAfajzz1hZ9cKRu-wMoa?dl=0). Note that they may need further fine-tuning, but you can just use them to test out the code in the notebooks. 12 | 13 | ## The Classifier 14 | After we have trained the word embeddings, it is time to train the classifier and store it for later use. In the "Ch2_PyTorch_classifier.ipynb" notebook I show all the details of how to achieve this. 15 | 16 | ## Elasticsearch Indexing 17 | In this project we will be pulling real-time data from the Twitter API, pre-processing it, analyzing it, and indexing it to Elasticsearch. Thus, the crawler code found in "Ch3_Elasticsearch_indexing.ipynb" will achieve these steps. You can find more details on how to configure your Twitter crawler [here](https://github.com/IDEA-NTHU-Taiwan/twitter_crawler_by_keywords). 18 | 19 | ## Kibana Analytics 20 | After indexing the classified tweets, I conduct further extensive analysis using Kibana. This part of the project will be uploaded soon. But if you have some expertise in Kibana you can just configure th index and start to create visualizations right away. I am performing complex emotion analysis as part of my work, thus I go the extra mile to find out more about the dataset and try to understand emotion from very different perspectives. All this stuff will be incorporated into this project soon! Take a look at these [preliminary slides](https://docs.google.com/presentation/d/1OFNKZwFyQq0BBxL7ABOosk2hAPul3jMpZj4jNmZNcMA/edit?usp=sharing) to get a feel for what other stuff you can expect to see in this project in the future. 21 | 22 | ## Future Vision 23 | The future vision of this project is to build a system as proposed in the poster below. This project was presented at the first PyTorch Developer Conference in San Francisco. It's an ongoing project with so many different layers of abstractions that help data scientist and machine learning enthusiast learn about search and analytics technologies, in particular how it is applied to emotion analysis. 24 | 25 | ![alt txt](https://github.com/omarsar/omarsar.github.io/blob/master/images/pytorch_conf.png?raw=true) 26 | 27 | ## Todo 28 | This is an ongoing project and these are the future additions I am working on: 29 | - Build PyTorch-based Emotion Recognition API 30 | - Integrate Logstash with the Emotion API 31 | - This avoids the need of a standalone crawler 32 | - We can also store tweets in bulk rather than by single tweets as is currently being done 33 | - Incorporate Opennlp plugin for the NLP part 34 | - Refactor code a bit 35 | - Create scripts to hold classes and functions 36 | -------------------------------------------------------------------------------- /crawlers/config.py: -------------------------------------------------------------------------------- 1 | # create twitter app to access Twitter Streaming API (https://apps.twitter.com) 2 | TWITTER = dict( 3 | consumer_key = '...', 4 | consumer_secret = '...', 5 | access_token = '...', 6 | access_secret = '...' 7 | ) 8 | 9 | # Elasticsearch configurations 10 | ELASTICSEARCH = dict( 11 | hostname = "localhost:9200", 12 | index = "hash_tweets", # new elastic collection index 13 | type = "doc"# new elastic collection type 14 | ) 15 | 16 | # emotion hashtags keywords 17 | KEYWORDS = dict( joy = [ 18 | "#accomplished", 19 | "#alive", 20 | "#amazing", 21 | "#awesome"], 22 | trust = [ 23 | "#acceptance", 24 | "#admiration", 25 | "#amused", 26 | "#appreciated"], 27 | fear = [ 28 | "#afraid", 29 | "#anxious", 30 | "#apprehension", 31 | "#awe", 32 | "#concerned"], 33 | surprise = [ 34 | "#amazed", 35 | "#amazement", 36 | "#crazy", 37 | "#different"], 38 | sadness = [ 39 | "#alone", 40 | "#ashamed", 41 | "#awful", 42 | "#awkward"], 43 | disgust = [ 44 | "#bitter", 45 | "#blah", 46 | "#bored" 47 | "#boredom"], 48 | anger = [ 49 | "#aggravated", 50 | "#aggressiveness", 51 | "#anger", 52 | "#anger2"], 53 | anticipation = [ 54 | "#adventurous", 55 | "#anticipation", 56 | "#curious", 57 | "#desperate"], 58 | other = [ 59 | "#asleep", 60 | "#awake", 61 | "#brave", 62 | "#busy"]) -------------------------------------------------------------------------------- /helpers/__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = ["pickle_helpers"] 2 | -------------------------------------------------------------------------------- /helpers/pickle_helpers.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | ''' 4 | Store and load anything to and from pickle 5 | ''' 6 | 7 | def convert_to_pickle(item, directory): 8 | ''' 9 | Usage: convert dictionary object to pickle format 10 | pickle_helpers.convert_to_pickle(cat_list,"data/liwc_pickle/liwc_cat.p") 11 | ''' 12 | 13 | pickle.dump(item, open(directory,"wb")) 14 | 15 | 16 | def load_from_pickle(directory): 17 | ''' 18 | Usage: Load pickle file 19 | pickle_helpers.load_from_pickle("data/liwc_pickle/liwc_cat.p") 20 | ''' 21 | 22 | return pickle.load(open(directory,"rb")) 23 | -------------------------------------------------------------------------------- /img/espy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/omarsar/emotion_analysis_elastic_pytorch/9d09cebb64ff7cb468e835bfe4e84a53552ab5b2/img/espy.png -------------------------------------------------------------------------------- /model/elastic_hashtag_model/emonet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/omarsar/emotion_analysis_elastic_pytorch/9d09cebb64ff7cb468e835bfe4e84a53552ab5b2/model/elastic_hashtag_model/emonet --------------------------------------------------------------------------------