├── models └── .keep ├── images └── fhchat.png ├── floyd.yml ├── README.md ├── support.py └── language-identification.ipynb /models/.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /images/fhchat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/floydhub/language-identification-template/master/images/fhchat.png -------------------------------------------------------------------------------- /floyd.yml: -------------------------------------------------------------------------------- 1 | env: tensorflow-1.7 2 | machine: cpu 3 | data: 4 | - source: floydhub/datasets/language-identification/1 5 | destination: languageidentification -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Language Identification 2 | 3 | [Language identification](https://en.wikipedia.org/wiki/Language_identification) is one of the most common feature of every Social Network or Web application, this is commonly paired with [Machine Translation](https://en.wikipedia.org/wiki/Machine_translation) to improve the user experience and content accesibility(a must have in the 2.0 society). *What can you use it for?* This is a foundation for other features such as Machine Translation (as mentioned before) and post/tweets/articles and documents analysis. 4 | 5 | ### Try it now 6 | 7 | [![Run on FloydHub](https://static.floydhub.com/button/button.svg)](https://floydhub.com/run?template=https://github.com/floydhub/language-identification-template) 8 | 9 | Click this button to open a Workspace on FloydHub that will train this model. 10 | 11 | ### Language identification of short pieces of text from Wikipedia 12 | 13 | In this notebook we will build a deep learning model able to [detect the languages from short piceces of text (140 characters, old Tweets lenght) with high accuracy using neural networks](http://machinelearningexp.com/deep-learning-language-identification-using-keras-tensorflow/). The task is commonly solved using hard-coded rules or NLP library, but we will attack the problem using Deep Learning. 14 | 15 | ![fhChat](https://raw.githubusercontent.com/floydhub/language-identification-template/master/images/fhchat.png) 16 | 17 | *Made with [Sketch Group Chat](https://www.sketchappsources.com/free-source/1558-group-chat-sketch-freebie-resource.html)* 18 | 19 | We have [already gathered and extract the raw dataset](https://floydhub.com/floydhub/datasets/language-identification/1) from https://dumps.wikimedia.org for 7 languages: Italian, Spanish and French which are considered to be in Latin language group, English and German have also common roots. Czech and Slovakian are extremely similar and are considered to be one of major challenged in the language recognition. 20 | 21 | iso-code | language | example 22 | ---------|----------|-------- 23 | en | English | Hello world! 24 | fr | French | Bonjour tout le monde! 25 | es | Spanish | Hola mundo! 26 | it | Italian | Ciao mondo! 27 | de | German | Hallo welt! 28 | cz | Czech | Ahoj světe! 29 | sk | Slovakian | Dobrý deň svet! 30 | 31 | We will: 32 | 33 | - Preprocess text data for NLP 34 | - Build and train Deep Neural Network using Keras and Tensorflow 35 | - Evaluate our model on the test set 36 | - Run the model on your own text! 37 | -------------------------------------------------------------------------------- /support.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | import seaborn as sns 4 | import re 5 | 6 | ############ 7 | # Alphabet # 8 | ############ 9 | 10 | # we will use alphabet for text cleaning and letter counting 11 | def define_alphabet(): 12 | base_en = 'abcdefghijklmnopqrstuvwxyz' 13 | special_chars = ' !?¿¡' 14 | german = 'äöüß' 15 | italian = 'àèéìíòóùú' 16 | french = 'àâæçéèêêîïôœùûüÿ' 17 | spanish = 'áéíóúüñ' 18 | czech = 'áčďéěíjňóřšťúůýž' 19 | slovak = 'áäčďdzdžéíĺľňóôŕšťúýž' 20 | all_lang_chars = base_en + german + italian + french + spanish + czech + slovak 21 | small_chars = list(set(list(all_lang_chars))) 22 | small_chars.sort() 23 | big_chars = list(set(list(all_lang_chars.upper()))) 24 | big_chars.sort() 25 | small_chars += special_chars 26 | letters_string = '' 27 | letters = small_chars + big_chars 28 | for letter in letters: 29 | letters_string += letter 30 | return small_chars,big_chars,letters_string 31 | 32 | ######## 33 | # Plot # 34 | ######## 35 | 36 | def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14): 37 | """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap. 38 | 39 | Arguments 40 | --------- 41 | confusion_matrix: numpy.ndarray 42 | The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix. 43 | Similarly constructed ndarrays can also be used. 44 | class_names: list 45 | An ordered list of class names, in the order they index the given confusion matrix. 46 | figsize: tuple 47 | A 2-long tuple, the first value determining the horizontal size of the ouputted figure, 48 | the second determining the vertical size. Defaults to (10,7). 49 | fontsize: int 50 | Font size for axes labels. Defaults to 14. 51 | 52 | Returns 53 | ------- 54 | matplotlib.figure.Figure 55 | The resulting confusion matrix figure 56 | 57 | FROM: https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823 58 | """ 59 | df_cm = pd.DataFrame( 60 | confusion_matrix, index=class_names, columns=class_names, 61 | ) 62 | fig = plt.figure(figsize=figsize) 63 | try: 64 | heatmap = sns.heatmap(df_cm, annot=True, fmt="d") 65 | except ValueError: 66 | raise ValueError("Confusion matrix values must be integers.") 67 | heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize) 68 | heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize) 69 | plt.title('Confusion Matrix') 70 | plt.ylabel('True label') 71 | plt.xlabel('Predicted label') 72 | return fig 73 | 74 | ################################### 75 | # Data cleaning utility functions # 76 | ################################### 77 | 78 | # we will create here several text-cleaning procedures. 79 | # These procedure will help us to clean the data we have for training, 80 | # but also will be useful in cleaning the text we want to classify, before the classification by trained DNN 81 | 82 | # remove XML tags procedure 83 | # for example, Wikipedia Extractor creates tags like this below, we need to remove them 84 | # ... 85 | def remove_xml(text): 86 | return re.sub(r'<[^<]+?>', '', text) 87 | 88 | # remove new lines - we need dense data 89 | def remove_newlines(text): 90 | return text.replace('\n', ' ') 91 | 92 | # replace many spaces in text with one space - too many spaces is unnecesary 93 | # we want to keep single spaces between words 94 | # as this can tell DNN about average length of the word and this may be useful feature 95 | def remove_manyspaces(text): 96 | return re.sub(r'\s+', ' ', text) 97 | 98 | # and here the whole procedure together 99 | def clean_text(text): 100 | text = remove_xml(text) 101 | text = remove_newlines(text) 102 | text = remove_manyspaces(text) 103 | return text 104 | 105 | ################# 106 | # Preprocessing # 107 | ################# 108 | 109 | # this function will get sample of texh from each cleaned language file. 110 | # It will try to preserve complete words - if word is to be sliced, sample will be shortened to full word 111 | def get_sample_text(file_content,start_index,sample_size): 112 | # we want to start from full first word 113 | # if the firts character is not space, move to next ones 114 | while not (file_content[start_index].isspace()): 115 | start_index += 1 116 | #now we look for first non-space character - beginning of any word 117 | while file_content[start_index].isspace(): 118 | start_index += 1 119 | end_index = start_index+sample_size 120 | # we also want full words at the end 121 | while not (file_content[end_index].isspace()): 122 | end_index -= 1 123 | return file_content[start_index:end_index] 124 | 125 | # we need only alpha characters and some (very limited) special characters 126 | # exactly the ones defined in the alphabet 127 | # no numbers, most of special characters also bring no value for our classification task 128 | # (like dot or comma - they are the same in all of our languages so does not bring additional informational value) 129 | 130 | # count number of chars in text based on given alphabet 131 | def count_chars(text, alphabet): 132 | alphabet_counts = [] 133 | for letter in alphabet: 134 | count = text.count(letter) 135 | alphabet_counts.append(count) 136 | return alphabet_counts 137 | 138 | # process text and return sample input row for DNN 139 | # note that we are counting separatey: 140 | # a) counts of all letters regardless of their size (whole text turned to lowercase letter) 141 | # b) counts of big letters only 142 | # this is because German uses big letters for beginning of nouns so this feature is meaningful 143 | def get_input_row(content,start_index,sample_size, alphabet): 144 | sample_text = get_sample_text(content,start_index,sample_size) 145 | counted_chars_all = count_chars(sample_text.lower(), alphabet[0]) 146 | counted_chars_big = count_chars(sample_text, alphabet[1]) 147 | all_parts = counted_chars_all + counted_chars_big 148 | return all_parts 149 | -------------------------------------------------------------------------------- /language-identification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Language Identification\n", 8 | "\n", 9 | "[Language identification](https://en.wikipedia.org/wiki/Language_identification) is one of the most common feature of every Social Network or Web application, this is commonly paired with [Machine Translation](https://en.wikipedia.org/wiki/Machine_translation) to improve the user experience and content accesibility(a must have in the 2.0 society). The goal of the task is to detect the natural language of a given piece of text. *What can you use it for?* This is a foundation for other features such as Machine Translation (as mentioned before) and post/tweets/articles and documents analysis.\n", 10 | "\n", 11 | "### Language identification of short pieces of text from Wikipedia\n", 12 | "\n", 13 | "In this notebook we will build a deep learning model able to [detect the languages from short piceces of text (140 characters, old Tweets lenght) with high accuracy using neural networks](http://machinelearningexp.com/deep-learning-language-identification-using-keras-tensorflow/). The task is commonly solved using hard-coded rules or NLP library, but we will attack the problem using Deep Learning. \n", 14 | "\n", 15 | "\n", 16 | "\n", 17 | "*Made with [Sketch Group Chat](https://www.sketchappsources.com/free-source/1558-group-chat-sketch-freebie-resource.html)*\n", 18 | "\n", 19 | "We have [already gathered and extract the raw dataset](https://floydhub.com/floydhub/datasets/language-identification/1) from https://dumps.wikimedia.org for 7 languages: *Italian*, *Spanish* and *French* which are considered to be in Latin language group, *English* and *German* have also common roots. *Czech* and *Slovakian* are extremely similar and are considered to be one of major challenged in the language recognition.\n", 20 | "\n", 21 | "iso-code | language | example\n", 22 | "---------|----------|--------\n", 23 | "en | English | Hello world!\n", 24 | "fr | French | Bonjour tout le monde!\n", 25 | "es | Spanish | Hola mundo!\n", 26 | "it | Italian | Ciao mondo!\n", 27 | "de | German | Hallo welt!\n", 28 | "cz | Czech | Ahoj světe!\n", 29 | "sk | Slovakian | Dobrý deň svet!\n", 30 | "\n", 31 | "We will:\n", 32 | "\n", 33 | "- Preprocess text data for NLP\n", 34 | "- Build and train Deep Neural Network using Keras and Tensorflow\n", 35 | "- Evaluate our model on the test set\n", 36 | "- Run the model on your own text!\n", 37 | "\n", 38 | "\n", 39 | "### Instructions\n", 40 | "- To execute a code cell, click on the cell and press `Shift + Enter` (shortcut for Run).\n", 41 | "- To learn more about Workspaces, check out the [Getting Started Notebook](get_started_workspace.ipynb).\n", 42 | "- **Tip**: *Feel free to try this Notebook with your own data and on your own super awesome regression task.*\n", 43 | "\n", 44 | "Now, let's get started! 🚀" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Initial Setup\n", 52 | "Let's start by importing some packages." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 1, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "name": "stderr", 62 | "output_type": "stream", 63 | "text": [ 64 | "Using TensorFlow backend.\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "import os\n", 70 | "import random\n", 71 | "import numpy as np\n", 72 | "import tensorflow as tf\n", 73 | "import time\n", 74 | "\n", 75 | "from sklearn import preprocessing\n", 76 | "from sklearn.metrics import classification_report\n", 77 | "from sklearn.model_selection import train_test_split\n", 78 | "\n", 79 | "import keras\n", 80 | "from keras.models import Sequential\n", 81 | "from keras.layers import Dense\n", 82 | "from keras.layers import Dropout\n", 83 | "import keras.optimizers" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "## Training Parameters\n", 91 | "\n", 92 | "We'll set the hyperparameters for training our model. If you understand what they mean, feel free to play around - otherwise, we recommend keeping the defaults for your first run 🙂" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 2, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "# Hyperparams if GPU is available\n", 102 | "if tf.test.is_gpu_available():\n", 103 | " # GPU\n", 104 | " BATCH_SIZE = 512 # Number of images used in each iteration\n", 105 | " EPOCHS = 12 # Number of passes through entire dataset\n", 106 | " \n", 107 | "# Hyperparams for CPU training\n", 108 | "else:\n", 109 | " # CPU\n", 110 | " BATCH_SIZE = 64\n", 111 | " EPOCHS = 12" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "## Data preparation\n", 119 | "\n", 120 | "*WARNING*\n", 121 | "\n", 122 | "Make sure that the dataset has been mounted before running the next Code Cells. The data mounting should take about 3 minutes." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 3, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "ALPHABET:\n", 135 | "abcdefghijklmnopqrstuvwxyzßàáâäæçèéêìíîïñòóôöùúûüýÿčďěĺľňœŕřšťůž !?¿¡ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÄÆÇÈÉÊÌÍÎÏÑÒÓÔÖÙÚÛÜÝČĎĚĹĽŇŒŔŘŠŤŮŸŽ\n", 136 | "ALPHABET LEN(VOCAB SIZE): 132\n" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "#################\n", 142 | "# Configuration #\n", 143 | "#################\n", 144 | "\n", 145 | "# dictionary of languages that our classifier will cover\n", 146 | "LANGUAGES_DICT = {'en':0,'fr':1,'es':2,'it':3,'de':4,'sk':5,'cs':6}\n", 147 | "\n", 148 | "# Length of cleaned text used for training and prediction - 140 chars\n", 149 | "MAX_LEN = 140\n", 150 | "\n", 151 | "# number of language samples per language that we will extract from source files\n", 152 | "NUM_SAMPLES = 250000\n", 153 | "\n", 154 | "# For reproducibility\n", 155 | "SEED = 42\n", 156 | "\n", 157 | "from support import define_alphabet\n", 158 | "# Load the Alphabet\n", 159 | "alphabet = define_alphabet()\n", 160 | "print('ALPHABET:')\n", 161 | "print(alphabet[2])\n", 162 | "\n", 163 | "VOCAB_SIZE = len(alphabet[2])\n", 164 | "print('ALPHABET LEN(VOCAB SIZE):', VOCAB_SIZE)\n", 165 | "\n", 166 | "# Folders from where load / store the raw, source, cleaned, samples and train_test data\n", 167 | "data_directory = \"/floyd/input/languageidentification/data\"\n", 168 | "source_directory = os.path.join(data_directory, 'source')\n", 169 | "cleaned_directory = os.path.join(data_directory, 'cleaned')\n", 170 | "samples_directory = os.path.join('/tmp', 'samples')\n", 171 | "train_test_directory = os.path.join('/tmp', 'train_test')" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "Before feeding the data into the model, we have to preprocess the text.\n", 179 | "\n", 180 | "We will use the characters frequency as features to our model. This representation is similar to the [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model, with the exception that we are using characters and not words for defining the Vocabulary. You can see an example below:" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 4, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "name": "stdout", 190 | "output_type": "stream", 191 | "text": [ 192 | "1. SAMPLE TEXT: \n", 193 | " die Fähre \"Ibn Batouta\", und den Mondkrater Ibn Battuta. Liste der Gemeinden in Österreich Dies ist eine Zusammenstellung von Listen der\n", 194 | "\n", 195 | "2. REFERENCE ALPHABET: \n", 196 | " ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ß', 'à', 'á', 'â', 'ä', 'æ', 'ç', 'è', 'é', 'ê', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'ö', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'č', 'ď', 'ě', 'ĺ', 'ľ', 'ň', 'œ', 'ŕ', 'ř', 'š', 'ť', 'ů', 'ž', ' ', '!', '?', '¿', '¡', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'À', 'Á', 'Â', 'Ä', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ì', 'Í', 'Î', 'Ï', 'Ñ', 'Ò', 'Ó', 'Ô', 'Ö', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Č', 'Ď', 'Ě', 'Ĺ', 'Ľ', 'Ň', 'Œ', 'Ŕ', 'Ř', 'Š', 'Ť', 'Ů', 'Ÿ', 'Ž']\n", 197 | "\n", 198 | "3. SAMPLE INPUT ROW: \n", 199 | " [6, 4, 1, 8, 18, 1, 2, 2, 11, 0, 1, 4, 4, 13, 3, 0, 0, 7, 7, 11, 5, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 2, 0, 1, 0, 1, 1, 0, 2, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", 200 | "\n", 201 | "4. INPUT SIZE (VOCAB SIZE): 132\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "from support import get_sample_text, get_input_row\n", 207 | " \n", 208 | "# let's see if our processing is returning counts\n", 209 | "# last part calculates also input_size for DNN so this code must be run before DNN is trained\n", 210 | "path = os.path.join(cleaned_directory, \"de_cleaned.txt\")\n", 211 | "with open(path, 'r') as f:\n", 212 | " content = f.read()\n", 213 | " random_index = random.randrange(0,len(content)-2*MAX_LEN)\n", 214 | " sample_text = get_sample_text(content,random_index,MAX_LEN)\n", 215 | " print (\"1. SAMPLE TEXT: \\n\", sample_text)\n", 216 | " print (\"\\n2. REFERENCE ALPHABET: \\n\", alphabet[0]+alphabet[1])\n", 217 | " \n", 218 | " sample_input_row = get_input_row(content, random_index, MAX_LEN, alphabet)\n", 219 | " print (\"\\n3. SAMPLE INPUT ROW: \\n\",sample_input_row)\n", 220 | " \n", 221 | " input_size = len(sample_input_row)\n", 222 | " if input_size != VOCAB_SIZE:\n", 223 | " print(\"Something strange happened!\")\n", 224 | " \n", 225 | " print (\"\\n4. INPUT SIZE (VOCAB SIZE): \", input_size)\n", 226 | " del content" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "Now we will apply the transformation from raw text to Bag of Characters representation for all the data we have collected. At the end of the proprocessing, we will have 250k samples per language where every sample will be piece of text 140 characters long, represented using the Bag of Characters model.\n", 234 | "\n", 235 | "Dataset dimension (1750k, 133):\n", 236 | "- rows: 1750k (250k * 7) or (NUM_SAMPLES * num_languages)\n", 237 | "- columns: 133 (132 + 1) or (VOCAB_SIZE + language_index)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 5, 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "name": "stdout", 247 | "output_type": "stream", 248 | "text": [ 249 | "Processing file : /floyd/input/languageidentification/data/cleaned/en_cleaned.txt\n", 250 | "File size : 101.42 MB | # possible samples : 768340 | # skip chars : 199\n", 251 | "----------------------------------------------------------------------------------------------------\n", 252 | "Processing file : /floyd/input/languageidentification/data/cleaned/fr_cleaned.txt\n", 253 | "File size : 98.72 MB | # possible samples : 747899 | # skip chars : 191\n", 254 | "----------------------------------------------------------------------------------------------------\n", 255 | "Processing file : /floyd/input/languageidentification/data/cleaned/es_cleaned.txt\n", 256 | "File size : 97.56 MB | # possible samples : 739111 | # skip chars : 187\n", 257 | "----------------------------------------------------------------------------------------------------\n", 258 | "Processing file : /floyd/input/languageidentification/data/cleaned/it_cleaned.txt\n", 259 | "File size : 101.89 MB | # possible samples : 771891 | # skip chars : 200\n", 260 | "----------------------------------------------------------------------------------------------------\n", 261 | "Processing file : /floyd/input/languageidentification/data/cleaned/de_cleaned.txt\n", 262 | "File size : 101.22 MB | # possible samples : 766797 | # skip chars : 198\n", 263 | "----------------------------------------------------------------------------------------------------\n", 264 | "Processing file : /floyd/input/languageidentification/data/cleaned/sk_cleaned.txt\n", 265 | "File size : 85.65 MB | # possible samples : 648847 | # skip chars : 151\n", 266 | "----------------------------------------------------------------------------------------------------\n", 267 | "Processing file : /floyd/input/languageidentification/data/cleaned/cs_cleaned.txt\n", 268 | "File size : 90.56 MB | # possible samples : 686098 | # skip chars : 166\n", 269 | "----------------------------------------------------------------------------------------------------\n", 270 | "Vocab Size : 132\n", 271 | "----------------------------------------------------------------------------------------------------\n", 272 | "Samples array size : (1750000, 133)\n", 273 | "/tmp/samples/lang_samples_132.npz size : 57.21 MB\n" 274 | ] 275 | } 276 | ], 277 | "source": [ 278 | "# Utility function to return file Bytes size in MB\n", 279 | "def size_mb(size):\n", 280 | " size_mb = '{:.2f}'.format(size/(1000*1000.0))\n", 281 | " return size_mb + \" MB\"\n", 282 | "\n", 283 | "# Now we have preprocessing utility functions ready. Let's use them to process each cleaned language file\n", 284 | "# and turn text data into numerical data samples for our neural network\n", 285 | "# prepare numpy array\n", 286 | "sample_data = np.empty((NUM_SAMPLES*len(LANGUAGES_DICT),input_size+1),dtype = np.uint16)\n", 287 | "lang_seq = 0 # offset for each language data\n", 288 | "jump_reduce = 0.2 # part of characters removed from jump to avoid passing the end of file\n", 289 | "\n", 290 | "for lang_code in LANGUAGES_DICT:\n", 291 | " start_index = 0\n", 292 | " path = os.path.join(cleaned_directory, lang_code+\"_cleaned.txt\")\n", 293 | " with open(path, 'r') as f:\n", 294 | " print (\"Processing file : \" + path)\n", 295 | " file_content = f.read()\n", 296 | " content_length = len(file_content)\n", 297 | " remaining = content_length - MAX_LEN*NUM_SAMPLES\n", 298 | " jump = int(((remaining/NUM_SAMPLES)*3)/4)\n", 299 | " print (\"File size : \",size_mb(content_length),\\\n", 300 | " \" | # possible samples : \",int(content_length/VOCAB_SIZE),\\\n", 301 | " \"| # skip chars : \" + str(jump))\n", 302 | " for idx in range(NUM_SAMPLES):\n", 303 | " input_row = get_input_row(file_content, start_index, MAX_LEN, alphabet)\n", 304 | " sample_data[NUM_SAMPLES*lang_seq+idx,] = input_row + [LANGUAGES_DICT[lang_code]]\n", 305 | " start_index += MAX_LEN + jump\n", 306 | " del file_content\n", 307 | " lang_seq += 1\n", 308 | " print (100*\"-\")\n", 309 | " \n", 310 | "# Let's randomy shuffle the data\n", 311 | "np.random.shuffle(sample_data)\n", 312 | "# reference input size\n", 313 | "print (\"Vocab Size : \",VOCAB_SIZE )\n", 314 | "print (100*\"-\")\n", 315 | "print (\"Samples array size : \",sample_data.shape )\n", 316 | "\n", 317 | "# Create the the sample dirctory if not exists\n", 318 | "if not os.path.exists(samples_directory):\n", 319 | " os.makedirs(samples_directory)\n", 320 | "\n", 321 | "# Save compressed sample data to disk\n", 322 | "path_smpl = os.path.join(samples_directory,\"lang_samples_\"+str(VOCAB_SIZE)+\".npz\")\n", 323 | "np.savez_compressed(path_smpl,data=sample_data)\n", 324 | "print(path_smpl, \"size : \", size_mb(os.path.getsize(path_smpl)))\n", 325 | "del sample_data" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "Sanity check." 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 6, 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "name": "stdout", 342 | "output_type": "stream", 343 | "text": [ 344 | "Sample record : \n", 345 | " [10 2 3 6 13 4 4 2 10 0 0 8 3 6 5 3 0 4 10 6 4 0 0 1\n", 346 | " 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 347 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 1\n", 348 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 349 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 350 | " 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", 351 | "\n", 352 | "Sample language : en\n", 353 | "\n", 354 | "Dataset shape (Total_samples, Alphabet): (1750000, 133)\n", 355 | "Language bins count (samples per language): \n", 356 | "en 250000\n", 357 | "fr 250000\n", 358 | "es 250000\n", 359 | "it 250000\n", 360 | "de 250000\n", 361 | "sk 250000\n", 362 | "cs 250000\n" 363 | ] 364 | } 365 | ], 366 | "source": [ 367 | "# utility function to turn language id into language code\n", 368 | "def decode_langid(langid): \n", 369 | " for dname, did in LANGUAGES_DICT.items():\n", 370 | " if did == langid:\n", 371 | " return dname\n", 372 | "\n", 373 | "# Loading the data\n", 374 | "path_smpl = os.path.join(samples_directory,\"lang_samples_\"+str(VOCAB_SIZE)+\".npz\")\n", 375 | "dt = np.load(path_smpl)['data']\n", 376 | "\n", 377 | "# Sanity chech on a random sample\n", 378 | "random_index = random.randrange(0,dt.shape[0])\n", 379 | "print (\"Sample record : \\n\",dt[random_index,])\n", 380 | "print (\"\\nSample language : \",decode_langid(dt[random_index,][VOCAB_SIZE]))\n", 381 | "\n", 382 | "# Check if the data have equal share of different languages\n", 383 | "print (\"\\nDataset shape (Total_samples, Alphabet):\", dt.shape)\n", 384 | "bins = np.bincount(dt[:,input_size])\n", 385 | "\n", 386 | "print (\"Language bins count (samples per language): \") \n", 387 | "for lang_code in LANGUAGES_DICT: \n", 388 | " print (lang_code, bins[LANGUAGES_DICT[lang_code]])" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "### Data preprocessing\n", 396 | "\n", 397 | "Even if our data is ready for the training, we have to [Standardize](https://en.wikipedia.org/wiki/Feature_scaling#Standardization) the dataset. This will help our model to converge faster since it makes the data more \"computational friendly\"." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 7, 403 | "metadata": {}, 404 | "outputs": [ 405 | { 406 | "name": "stdout", 407 | "output_type": "stream", 408 | "text": [ 409 | "Example data before processing:\n", 410 | "X : \n", 411 | " [ 8. 0. 5. 5. 21. 2. 1. 7. 6. 0. 0. 6. 1. 8. 5. 0. 0. 7.\n", 412 | " 10. 10. 4. 3. 2. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", 413 | " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", 414 | " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 24. 0. 0. 0. 0. 1. 0. 0.\n", 415 | " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.\n", 416 | " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", 417 | " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", 418 | " 0. 0. 0. 0. 0. 0.]\n", 419 | "Y : \n", 420 | " 0.0\n", 421 | "X preprocessed shape : (1750000, 132)\n", 422 | "\n", 423 | "Example data after processing:\n", 424 | "X : \n", 425 | " [-4.17050153e-01 -1.13270748e+00 6.38392270e-01 1.75331384e-01\n", 426 | " 1.73380041e+00 5.33405364e-01 -3.80145878e-01 1.91697395e+00\n", 427 | " -5.14976740e-01 -6.19032502e-01 -7.30089128e-01 1.92491740e-01\n", 428 | " -1.09850657e+00 -2.04314422e-02 -7.94259489e-01 -1.46870685e+00\n", 429 | " -4.61331427e-01 5.33611178e-02 1.16716337e+00 1.14771831e+00\n", 430 | " 3.73478569e-02 2.73657739e-01 1.30994463e+00 -3.63413811e-01\n", 431 | " -1.49887251e-02 -7.82141924e-01 -1.45357341e-01 -2.73461759e-01\n", 432 | " -5.23208618e-01 -8.40839669e-02 -2.57696390e-01 -1.61323845e-02\n", 433 | " -9.63160172e-02 -2.67638355e-01 -5.72866321e-01 -1.24845751e-01\n", 434 | " -8.58770609e-02 -4.83961701e-01 -7.90864602e-02 -4.38546538e-02\n", 435 | " -1.56157464e-01 -1.38136834e-01 -3.22708428e-01 -1.61213249e-01\n", 436 | " -1.93664357e-01 -1.26288012e-01 -3.17232043e-01 -4.67012003e-02\n", 437 | " -2.45021537e-01 -3.98998171e-01 -6.17011636e-03 -3.92622024e-01\n", 438 | " -1.21239848e-01 -3.00596684e-01 -4.27031331e-02 -2.16784000e-01\n", 439 | " -1.39083400e-01 -5.39894253e-02 -2.65944358e-02 -2.88298607e-01\n", 440 | " -3.61458272e-01 -1.90843359e-01 -2.18461171e-01 -3.85120064e-01\n", 441 | " 1.30634761e+00 -3.31691280e-02 -3.66677120e-02 -1.20478580e-02\n", 442 | " -9.79928020e-03 9.66669083e-01 -4.30910617e-01 -4.42056417e-01\n", 443 | " -4.23938990e-01 -4.10972983e-01 -3.52509290e-01 -3.54364634e-01\n", 444 | " -3.37542146e-01 -3.59003991e-01 -3.15173328e-01 -3.17866415e-01\n", 445 | " -4.55488026e-01 -4.60290730e-01 -3.92121077e-01 -3.03513795e-01\n", 446 | " -4.86154377e-01 -1.12909846e-01 1.68673170e+00 -5.44467688e-01\n", 447 | " 1.40097451e+00 -2.54374862e-01 -3.80121171e-01 -2.62794465e-01\n", 448 | " -9.93090570e-02 -1.18509911e-01 -2.20897570e-01 -5.71699217e-02\n", 449 | " -4.62426692e-02 -1.81627274e-02 -3.68494987e-02 -6.64461171e-03\n", 450 | " -8.24078172e-03 -4.20710333e-02 -8.09805691e-02 -2.85832421e-03\n", 451 | " -7.55929330e-04 -2.11150870e-02 -2.74087414e-02 -1.60357123e-03\n", 452 | " -6.56590983e-03 -2.61863228e-03 -1.99632142e-02 -4.02917853e-03\n", 453 | " -4.37377803e-02 -1.51186134e-03 -5.64606264e-02 0.00000000e+00\n", 454 | " -5.21061346e-02 -5.66610694e-03 -1.08200289e-01 -2.99859717e-02\n", 455 | " -2.85716611e-03 -1.06904586e-03 -2.65672822e-02 -4.56597749e-03\n", 456 | " -6.93865819e-03 -1.01418595e-03 -4.24020365e-02 -8.98558497e-02\n", 457 | " -1.53618287e-02 -1.76383916e-03 0.00000000e+00 -7.48927444e-02]\n", 458 | "Y : \n", 459 | " [1. 0. 0. 0. 0. 0. 0.]\n", 460 | "/tmp/train_test/train_test_data_132.npz size : 94.82 MB\n" 461 | ] 462 | } 463 | ], 464 | "source": [ 465 | "# we need to preprocess data for DNN yet again - scale it \n", 466 | "# scaling will ensure that our optimization algorithm (variation of gradient descent) will converge well\n", 467 | "# we need also ensure one-hot econding of target classes for softmax output layer\n", 468 | "# let's convert datatype before processing to float\n", 469 | "dt = dt.astype(np.float32)\n", 470 | "# X and Y split\n", 471 | "X = dt[:, 0:input_size] # Samples\n", 472 | "Y = dt[:, input_size] # The last element is the label\n", 473 | "del dt\n", 474 | "\n", 475 | "# Random index to check random sample\n", 476 | "random_index = random.randrange(0,X.shape[0])\n", 477 | "print(\"Example data before processing:\")\n", 478 | "print(\"X : \\n\", X[random_index,])\n", 479 | "print(\"Y : \\n\", Y[random_index])\n", 480 | "\n", 481 | "# X PREPROCESSING\n", 482 | "# Feature Standardization - Standar scaler will be useful later during DNN prediction\n", 483 | "standard_scaler = preprocessing.StandardScaler().fit(X)\n", 484 | "X = standard_scaler.transform(X) \n", 485 | "print (\"X preprocessed shape :\", X.shape)\n", 486 | "\n", 487 | "# Y PREPROCESSINGY \n", 488 | "# One-hot encoding\n", 489 | "Y = keras.utils.to_categorical(Y, num_classes=len(LANGUAGES_DICT))\n", 490 | "\n", 491 | "# See the sample data\n", 492 | "print(\"\\nExample data after processing:\")\n", 493 | "print(\"X : \\n\", X[random_index,])\n", 494 | "print(\"Y : \\n\", Y[random_index])\n", 495 | "\n", 496 | "# Train/test split. Static seed to have comparable results for different runs\n", 497 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=SEED)\n", 498 | "del X, Y\n", 499 | "\n", 500 | "# Create the train / test directory if not extists\n", 501 | "if not os.path.exists(train_test_directory):\n", 502 | " os.makedirs(train_test_directory)\n", 503 | "\n", 504 | "# Save compressed train_test data to disk\n", 505 | "path_tt = os.path.join(train_test_directory,\"train_test_data_\"+str(VOCAB_SIZE)+\".npz\")\n", 506 | "np.savez_compressed(path_tt,X_train=X_train,Y_train=Y_train,X_test=X_test,Y_test=Y_test)\n", 507 | "print(path_tt, \"size : \",size_mb(os.path.getsize(path_tt)))\n", 508 | "del X_train,Y_train,X_test,Y_test" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "## Train - Test Split\n", 516 | "\n", 517 | "Split: 80% for Train and 20% for Test" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 8, 523 | "metadata": {}, 524 | "outputs": [ 525 | { 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "X_train: (1400000, 132)\n", 530 | "Y_train: (1400000, 7)\n", 531 | "X_test: (350000, 132)\n", 532 | "Y_test: (350000, 7)\n" 533 | ] 534 | } 535 | ], 536 | "source": [ 537 | "# Load train data first from file\n", 538 | "path_tt = os.path.join(train_test_directory, \"train_test_data_\"+str(VOCAB_SIZE)+\".npz\")\n", 539 | "train_test_data = np.load(path_tt)\n", 540 | "\n", 541 | "# Train Set\n", 542 | "X_train = train_test_data['X_train']\n", 543 | "print (\"X_train: \",X_train.shape)\n", 544 | "Y_train = train_test_data['Y_train']\n", 545 | "print (\"Y_train: \",Y_train.shape)\n", 546 | "\n", 547 | "# Test Set\n", 548 | "X_test = train_test_data['X_test']\n", 549 | "print (\"X_test: \",X_test.shape)\n", 550 | "Y_test = train_test_data['Y_test']\n", 551 | "print (\"Y_test: \",Y_test.shape)\n", 552 | "\n", 553 | "del train_test_data" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "metadata": {}, 559 | "source": [ 560 | "### Model\n", 561 | "\n", 562 | "We will implement a really simple 3 layers Neural Network with [Droput](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) for preventing Overfitting. We will also use the [Xavier initializer](https://www.quora.com/What-is-an-intuitive-explanation-of-the-Xavier-Initialization-for-Deep-Neural-Networks)(one of the best initialization scheme, this will improve the chance to converge in a better \"place\": a point with better accuracy).\n", 563 | "\n", 564 | "![nn](http://neuralnetworksanddeeplearning.com/images/tikz40.png)\n", 565 | "\n", 566 | "*From http://neuralnetworksanddeeplearning.com/*" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 9, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "name": "stdout", 576 | "output_type": "stream", 577 | "text": [ 578 | "_________________________________________________________________\n", 579 | "Layer (type) Output Shape Param # \n", 580 | "=================================================================\n", 581 | "dense_1 (Dense) (None, 500) 66500 \n", 582 | "_________________________________________________________________\n", 583 | "dropout_1 (Dropout) (None, 500) 0 \n", 584 | "_________________________________________________________________\n", 585 | "dense_2 (Dense) (None, 300) 150300 \n", 586 | "_________________________________________________________________\n", 587 | "dropout_2 (Dropout) (None, 300) 0 \n", 588 | "_________________________________________________________________\n", 589 | "dense_3 (Dense) (None, 100) 30100 \n", 590 | "_________________________________________________________________\n", 591 | "dropout_3 (Dropout) (None, 100) 0 \n", 592 | "_________________________________________________________________\n", 593 | "dense_4 (Dense) (None, 7) 707 \n", 594 | "=================================================================\n", 595 | "Total params: 247,607\n", 596 | "Trainable params: 247,607\n", 597 | "Non-trainable params: 0\n", 598 | "_________________________________________________________________\n" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "model = Sequential()\n", 604 | "# Note: glorot_uniform is the Xavier uniform initializer.\n", 605 | "\n", 606 | "model.add(Dense(500,input_dim=input_size, kernel_initializer=\"glorot_uniform\", activation=\"sigmoid\"))\n", 607 | "model.add(Dropout(0.5))\n", 608 | "model.add(Dense(300, kernel_initializer=\"glorot_uniform\", activation=\"sigmoid\"))\n", 609 | "model.add(Dropout(0.5))\n", 610 | "model.add(Dense(100, kernel_initializer=\"glorot_uniform\", activation=\"sigmoid\"))\n", 611 | "model.add(Dropout(0.5))\n", 612 | "model.add(Dense(len(LANGUAGES_DICT), kernel_initializer=\"glorot_uniform\", activation=\"softmax\"))\n", 613 | "model_optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)\n", 614 | "model.compile(loss='categorical_crossentropy',\n", 615 | " optimizer=model_optimizer,\n", 616 | " metrics=['accuracy'])\n", 617 | "\n", 618 | "model.summary()" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": {}, 624 | "source": [ 625 | "### Train & Evaluate\n", 626 | "\n", 627 | "If you left the default hyperpameters in the Notebook untouched, your training should take approximately:\n", 628 | "\n", 629 | "- On CPU machine: 24 minutes for 12 epochs.\n", 630 | "- On GPU machine: 3 minute for 12 epochs.\n", 631 | "\n", 632 | "*Note*: You can follow the execution on Tensorboard." 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "from keras.callbacks import TensorBoard\n", 642 | "\n", 643 | "# Tensorboard\n", 644 | "tensorboard = TensorBoard(log_dir=\"run\")\n", 645 | "\n", 646 | "# let's fit the data\n", 647 | "# history variable will help us to plot results later\n", 648 | "history = model.fit(X_train,Y_train,\n", 649 | " epochs=EPOCHS,\n", 650 | " validation_split=0.1,\n", 651 | " batch_size=BATCH_SIZE,\n", 652 | " callbacks=[tensorboard],\n", 653 | " shuffle=True,\n", 654 | " verbose=2)" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 11, 660 | "metadata": {}, 661 | "outputs": [ 662 | { 663 | "name": "stdout", 664 | "output_type": "stream", 665 | "text": [ 666 | "350000/350000 [==============================] - 15s 42us/step\n", 667 | "acc: 97.60%\n" 668 | ] 669 | } 670 | ], 671 | "source": [ 672 | "# Evaluation on Test set\n", 673 | "scores = model.evaluate(X_test, Y_test, verbose=1)\n", 674 | "print(\"%s: %.2f%%\" % (model.metrics_names[1], scores[1]*100))" 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": 12, 680 | "metadata": {}, 681 | "outputs": [], 682 | "source": [ 683 | "# and now we will prepare data for scikit-learn confusion matrix and classification report\n", 684 | "Y_pred = model.predict_classes(X_test)\n", 685 | "Y_pred = keras.utils.to_categorical(Y_pred, num_classes=len(LANGUAGES_DICT))\n", 686 | "LABELS = list(LANGUAGES_DICT.keys())" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 13, 692 | "metadata": {}, 693 | "outputs": [ 694 | { 695 | "data": { 696 | "image/png": "\n", 697 | "text/plain": [ 698 | "
" 699 | ] 700 | }, 701 | "metadata": {}, 702 | "output_type": "display_data" 703 | } 704 | ], 705 | "source": [ 706 | "# Plot confusion matrix \n", 707 | "from sklearn.metrics import confusion_matrix\n", 708 | "from support import print_confusion_matrix\n", 709 | "\n", 710 | "cnf_matrix = confusion_matrix(np.argmax(Y_pred,axis=1), np.argmax(Y_test,axis=1))\n", 711 | "_ = print_confusion_matrix(cnf_matrix, LABELS)" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 14, 717 | "metadata": {}, 718 | "outputs": [ 719 | { 720 | "name": "stdout", 721 | "output_type": "stream", 722 | "text": [ 723 | " precision recall f1-score support\n", 724 | "\n", 725 | " en 0.97 0.97 0.97 49999\n", 726 | " fr 0.98 0.98 0.98 49917\n", 727 | " es 0.98 0.97 0.97 49889\n", 728 | " it 0.97 0.97 0.97 49977\n", 729 | " de 0.99 0.99 0.99 50111\n", 730 | " sk 0.97 0.98 0.98 50047\n", 731 | " cs 0.98 0.97 0.98 50060\n", 732 | "\n", 733 | "avg / total 0.98 0.98 0.98 350000\n", 734 | "\n" 735 | ] 736 | } 737 | ], 738 | "source": [ 739 | "# Classification Report\n", 740 | "print(classification_report(Y_test, Y_pred, target_names=LABELS))" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": {}, 746 | "source": [ 747 | "## It's your turn\n", 748 | "\n", 749 | "Test out the model you just trained. Run the code Cell below and type your text in the widget, Have fun!🎉\n", 750 | "\n", 751 | "Here are some inspirations:\n", 752 | "\n", 753 | "##### EN - Frank Baum, The Wonderful Wizard of Oz, Project Gutenberg, public domain\n", 754 | "You are welcome, most noble Sorceress, to the land of the Munchkins. We are so grateful to you for having killed the Wicked Witch of the East, and for setting our people free from bondage.\n", 755 | "\n", 756 | "##### DE - Johann Wolfgang von Goethe, Faust: Der Tragödie erster Teil, Project Gutenberg, public domain\n", 757 | "Habe nun, ach! Philosophie, Juristerei und Medizin, Und leider auch Theologie Durchaus studiert, mit heißem Bemühn. Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor.\n", 758 | "\n", 759 | "##### FR - Pierre Benoît, L'Atlantide, \n", 760 | "Voilà cinq mois que j'en faisais fonction, et, ma foi, je supportais bien cette responsabilité et goûtais fort cette indépendance. Je puis même affirmer, sans me flatter.\n", 761 | "\n", 762 | "##### IT - Alberto Boccardi, Il peccato di Loreta, Project Gutenberg, public domain\n", 763 | "Giovanni Sant'Angelo, che negli anni passati a Padova in mezzo alla baraonda tanto gioconda degli studenti, aveva appreso ad amare con foga di giovane qualche alto ideale, tornato in famiglia dovette fare uno sforzo.\n", 764 | "\n", 765 | "##### ES - Fernando Callejo Ferrer, Música y Músicos Portorriqueños, Project Gutenberg, public domain\n", 766 | "Dedicada esta sección a la reseña de los compositores nativos y obras que han producido, con ligeros comentarios propios a cada uno, parécenos oportuno dar ligeras noticias sobre el origen de la composición.\n", 767 | "\n", 768 | "##### CS - František Omelka, Blesky nad Beskydami, Project Gutenberg, public domain\n", 769 | "A Slávek, jsa povzbuzen, se ptal a otec odpovídal. Přestože byl prostým venkovským listonošem, nepřivedla jej žádná synova otázka do rozpaků. Od mládí se zajímal o dějepis a literaturu.\n", 770 | "\n", 771 | "##### SK - Janko Matúška, Nad Tatrou sa blýska, national anthem of Slovakia, https://en.wikipedia.org/wiki/Nad_Tatrou_sa_blýska\n", 772 | "Nad Tatrou sa blýska Hromy divo bijú Zastavme ich, bratia Veď sa ony stratia Slováci ožijú To Slovensko naše Posiaľ tvrdo spalo Ale blesky hromu Vzbudzujú ho k tomu Aby sa prebralo.\n", 773 | "\n", 774 | "Can you do better? Play around with the model hyperparameters!" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": 15, 780 | "metadata": {}, 781 | "outputs": [ 782 | { 783 | "data": { 784 | "application/vnd.jupyter.widget-view+json": { 785 | "model_id": "51a58e6a7b0f497aa907a3df96d293fb", 786 | "version_major": 2, 787 | "version_minor": 0 788 | }, 789 | "text/plain": [ 790 | "interactive(children=(Textarea(value='', description='TEXT', placeholder='Type the text to identify here'), Bu…" 791 | ] 792 | }, 793 | "metadata": {}, 794 | "output_type": "display_data" 795 | } 796 | ], 797 | "source": [ 798 | "# and now we will have some fun. Seeing is believing!\n", 799 | "# We will take some texts and try to predict the text's language using our trained neural network.\n", 800 | "\n", 801 | "from ipywidgets import interact_manual\n", 802 | "from ipywidgets import widgets\n", 803 | "from support import clean_text\n", 804 | "\n", 805 | "\n", 806 | "def get_prediction(TEXT):\n", 807 | " if len(TEXT) < MAX_LEN:\n", 808 | " print(\"Text has to be at least {} chars long, but it is {}/{}\".format(MAX_LEN, len(TEXT), MAX_LEN))\n", 809 | " return(-1)\n", 810 | " # Data cleaning\n", 811 | " cleaned_text = clean_text(TEXT)\n", 812 | " \n", 813 | " # Get the MAX_LEN char\n", 814 | " input_row = get_input_row(cleaned_text, 0, MAX_LEN, alphabet)\n", 815 | " \n", 816 | " # Data preprocessing (Standardization)\n", 817 | " test_array = standard_scaler.transform([input_row])\n", 818 | " \n", 819 | " raw_score = model.predict(test_array)\n", 820 | " pred_idx= np.argmax(raw_score, axis=1)[0]\n", 821 | " score = raw_score[0][pred_idx]*100\n", 822 | " \n", 823 | " # Prediction\n", 824 | " prediction = LABELS[model.predict_classes(test_array)[0]]\n", 825 | " print('TEXT:', TEXT, '\\nPREDICTION:', prediction.upper(), '\\nSCORE:', score)\n", 826 | "\n", 827 | "interact_manual(get_prediction, TEXT=widgets.Textarea(placeholder='Type the text to identify here'));" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": {}, 833 | "source": [ 834 | "## Save your model" 835 | ] 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": 16, 840 | "metadata": {}, 841 | "outputs": [], 842 | "source": [ 843 | "# Saving Model Weight\n", 844 | "model.save_weights('models/lang_identification_weights.h5')" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "##### That's all folks - don't forget to shutdown your workspace once you're done 🙂" 852 | ] 853 | } 854 | ], 855 | "metadata": { 856 | "kernelspec": { 857 | "display_name": "Python 2", 858 | "language": "python", 859 | "name": "python2" 860 | }, 861 | "language_info": { 862 | "codemirror_mode": { 863 | "name": "ipython", 864 | "version": 2 865 | }, 866 | "file_extension": ".py", 867 | "mimetype": "text/x-python", 868 | "name": "python", 869 | "nbconvert_exporter": "python", 870 | "pygments_lexer": "ipython2", 871 | "version": "2.7.10" 872 | } 873 | }, 874 | "nbformat": 4, 875 | "nbformat_minor": 2 876 | } 877 | --------------------------------------------------------------------------------