├── BERT.png ├── Modelling.ipynb ├── README.md ├── Regular expression.ipynb ├── RepresentationAnd Embedding.ipynb ├── Text_Normalization.ipynb ├── alice.txt ├── attention.png ├── biLM.jpg └── transformers.ipynb /BERT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yyanhui/Language-processing-basics/e361f95e89eb5bf8c33ee06d5c6562ba30811f24/BERT.png -------------------------------------------------------------------------------- /Modelling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f3e17361-0d1a-4481-be04-24f529a073b6", 6 | "metadata": {}, 7 | "source": [ 8 | "The language models deal with input of vectors embedded which have arbitrary length, the objective will be dependant on the history information, neural networks accept fixed size of inputs, so we need to deal with context and history size. one way of modeling is to input only a fixed window of history. The introduction of encoders and decoders:\n", 9 | "\n", 10 | "Encoders : Send input through a smaller-than-necessary layer to force the neural network to find a small set of parameters that produced intermediate activations that approximates the output\n", 11 | "\n", 12 | "Decoder : A set of parameters that recovers information to produce the output" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "id": "95e07707-e9f6-4b56-b6a4-3b5105adae51", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "## read the text file\n", 23 | "with open('Europarl-french-v7/europarl-en.txt', 'r') as file:\n", 24 | " lines_en = file.readlines()[:5000]\n", 25 | "with open('Europarl-french-v7/europarl-fr.txt', 'r') as file:\n", 26 | " lines_fr = file.readlines()[:5000]" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "f287f3cc-e3a9-48ca-bee1-2d631a8848fa", 32 | "metadata": {}, 33 | "source": [ 34 | "## RNN and LSTM\n", 35 | "RNN structures : deal with different length of history, recurrently process each time slice\n", 36 | "![RNN](RNN.png)\n", 37 | "\n", 38 | "LSTM : instead of replacing the hidden state each time-slice, adding a memory cell to decide which part to forget or memmorize\n", 39 | "![LTSM](LSTM.png)\n", 40 | "\n", 41 | "a. forget gate : $f = \\sigma(W_{x,f}x+b_{x,f}+W_{h,f}h+b_{h,f}) $ output (0,1)\n", 42 | "\n", 43 | "b. input gate : $i = \\sigma(W_{x,i}x+b_{i,f}+W_{h,i}h+b_{h,i}) $ \n", 44 | "\n", 45 | "c. cell memory : $g = tanh(W_{x,g}x+b_{x,g}+W_{h,g}h+b_{h,g}) $ output range [-1,1]\n", 46 | "\n", 47 | "d. update cell state : $c = (f*c_{i-1})+(i*g)$\n", 48 | "\n", 49 | "e. output gate : $o = \\sigma(W_{x,o}x+b_{x,o}+W_{h,o}h+b_{h,o}) $ \n", 50 | "\n", 51 | "f. update hidden state: $h_i = o*tanh(c_i)$" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 43, 57 | "id": "c3e95975-f461-4ce5-b7a0-2550d079eb67", 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "import numpy as np\n", 62 | "import tensorflow as tf\n", 63 | "from transformers import BertTokenizer\n", 64 | "from tensorflow.keras.models import Sequential\n", 65 | "from tensorflow.keras.layers import SimpleRNN, LSTM, Dense, Input\n", 66 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", 67 | "\n", 68 | "# Create input sequences and labels for training\n", 69 | "tokenizer = BertTokenizer.from_pretrained(\"bert-base-cased\")\n", 70 | "input_sequences = []\n", 71 | "for line in lines_en[:500]:\n", 72 | " token_list = tokenizer.encode(line.replace(\"\\n\", \"\")[:512])\n", 73 | " input_sequences.append(token_list[:-2])\n", 74 | "\n", 75 | "max_sequence_length = max(len(seq) for seq in input_sequences)\n", 76 | "input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')\n", 77 | "\n", 78 | "X, y = input_sequences[:, :-1], input_sequences[:, -1]\n", 79 | "y = tf.keras.utils.to_categorical(y, num_classes=tokenizer.vocab_size)\n", 80 | "X_ = X.reshape((X.shape[0], 1, X.shape[1]))" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 15, 86 | "id": "4305543a-da8e-470c-9e70-9ad3e8da0f58", 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "(500, 1, 104)" 93 | ] 94 | }, 95 | "execution_count": 15, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "X_.shape" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 17, 107 | "id": "329aaffd-2186-4743-a727-5a420bbf7a94", 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "(500, 28996)" 114 | ] 115 | }, 116 | "execution_count": 17, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "y.shape" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 47, 128 | "id": "d7e99b92-b0de-44d4-86f6-137a6360d993", 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "name": "stderr", 133 | "output_type": "stream", 134 | "text": [ 135 | "/opt/anaconda3/lib/python3.12/site-packages/keras/src/layers/rnn/rnn.py:200: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.\n", 136 | " super().__init__(**kwargs)\n" 137 | ] 138 | }, 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "" 143 | ] 144 | }, 145 | "execution_count": 47, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "# Build and train the SimpleRNN model\n", 152 | "model_rnn = Sequential()\n", 153 | "model_rnn.add(SimpleRNN(100, input_shape=(1, X_.shape[2])))\n", 154 | "model_rnn.add(Dense(tokenizer.vocab_size, activation='softmax'))\n", 155 | "model_rnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n", 156 | "model_rnn.fit(X_, y, epochs=100, verbose=0)\n", 157 | "\n", 158 | "# Build and train the LSTM model\n", 159 | "model_lstm = Sequential()\n", 160 | "model_lstm.add(LSTM(100, input_shape=(1, X_.shape[2])))\n", 161 | "model_lstm.add(Dense(tokenizer.vocab_size, activation='softmax'))\n", 162 | "model_lstm.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n", 163 | "model_lstm.fit(X_, y, epochs=100, verbose=0)\n" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 49, 169 | "id": "e8498e2f-e8e7-4d7d-8c77-b4dc2b840df9", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 52ms/step\n", 177 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 11ms/step\n", 178 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 9ms/step\n", 179 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 10ms/step\n", 180 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 10ms/step\n", 181 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 43ms/step\n", 182 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 9ms/step\n", 183 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 9ms/step\n", 184 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 9ms/step\n", 185 | "\u001b[1m1/1\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 10ms/step\n", 186 | "Generated Text (SimpleRNN): There should be no confusion in this debate. As environmentalists, we do not want an programmes employment them ##ity report\n", 187 | "Generated Text (LSTM): There should be no confusion in this debate. As environmentalists, we do not want an safety Europe them citizens Europe\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "# Generate text using the trained models\n", 193 | "def generate_text(seed_text, model, max_sequence_len, num_words):\n", 194 | " for _ in range(num_words):\n", 195 | " token_list = tokenizer.encode(seed_text)\n", 196 | " token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')\n", 197 | " token_list = token_list.reshape((token_list.shape[0], 1, token_list.shape[1]))\n", 198 | " predicted = np.argmax(model.predict(token_list), axis=-1)\n", 199 | " output_word = tokenizer.decode(predicted)\n", 200 | " seed_text += \" \" + output_word\n", 201 | " return seed_text\n", 202 | "\n", 203 | "# Example of generating text with each model\n", 204 | "generated_text_rnn = generate_text(\"There should be no confusion in this debate. As environmentalists, we do not want an\", model_rnn, max_sequence_length, num_words=5)\n", 205 | "generated_text_lstm = generate_text(\"There should be no confusion in this debate. As environmentalists, we do not want an\", model_lstm, max_sequence_length, num_words=5)\n", 206 | "\n", 207 | "print(\"Generated Text (SimpleRNN):\", generated_text_rnn)\n", 208 | "print(\"Generated Text (LSTM):\", generated_text_lstm)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "id": "430984c2-68fd-4cf5-a425-e871fec12722", 214 | "metadata": {}, 215 | "source": [ 216 | "## Sequence to Sequence models\n", 217 | "When dealing with translation tasks, the input and output will look like\n", 218 | "\n", 219 | "$input_i = SOSx_{i,1}x_{i,2}\\dots x_{i,2}EOS$\n", 220 | "\n", 221 | "$ouput_i = SOSy_{i,1}y_{i,2}\\dots y_{i,2}EOS$\n", 222 | "\n", 223 | "there's no one-to-one mapping, the output could be of arbitrary length, the entire context is needed for translation.\n", 224 | "\n", 225 | "Then a sequence to sequence model structure is utilized, it encodes all the words until EOS reached, then after decoding all words, there's a encoding layer to put all results together\n", 226 | "![s2s](s2s.png)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 19, 232 | "id": "0c199d2b-3b9c-475f-9649-3ec57d40ed0a", 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "import numpy as np\n", 237 | "import keras\n", 238 | "import os\n", 239 | "from pathlib import Path" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 35, 245 | "id": "bf8bc333-36d8-4e87-b1e1-2ecca4758790", 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "batch_size = 64 # Batch size for training.\n", 250 | "epochs = 16 # Number of epochs to train for.\n", 251 | "latent_dim = 256 # Latent dimensionality of the encoding space.\n", 252 | "num_samples = 1000 # Number of samples to train on." 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 37, 258 | "id": "997ac799-c57b-4e96-8142-fdea9cd9eeb3", 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "Number of samples: 1000\n", 266 | "Number of unique input tokens: 86\n", 267 | "Number of unique output tokens: 99\n", 268 | "Max sequence length for inputs: 683\n", 269 | "Max sequence length for outputs: 877\n" 270 | ] 271 | } 272 | ], 273 | "source": [ 274 | "# Vectorize the data.\n", 275 | "input_texts = []\n", 276 | "target_texts = []\n", 277 | "input_characters = set()\n", 278 | "target_characters = set()\n", 279 | "\n", 280 | "for i in range(num_samples):\n", 281 | " input_text = lines_en[i]\n", 282 | " target_text = lines_fr[i]\n", 283 | " target_text = \"\\t\" + target_text + \"\\n\"\n", 284 | " input_texts.append(input_text)\n", 285 | " target_texts.append(target_text)\n", 286 | " for char in input_text:\n", 287 | " if char not in input_characters:\n", 288 | " input_characters.add(char)\n", 289 | " for char in target_text:\n", 290 | " if char not in target_characters:\n", 291 | " target_characters.add(char)\n", 292 | "\n", 293 | "input_characters = sorted(list(input_characters))\n", 294 | "target_characters = sorted(list(target_characters))\n", 295 | "num_encoder_tokens = len(input_characters)\n", 296 | "num_decoder_tokens = len(target_characters)\n", 297 | "max_encoder_seq_length = max([len(txt) for txt in input_texts])\n", 298 | "max_decoder_seq_length = max([len(txt) for txt in target_texts])\n", 299 | "\n", 300 | "print(\"Number of samples:\", len(input_texts))\n", 301 | "print(\"Number of unique input tokens:\", num_encoder_tokens)\n", 302 | "print(\"Number of unique output tokens:\", num_decoder_tokens)\n", 303 | "print(\"Max sequence length for inputs:\", max_encoder_seq_length)\n", 304 | "print(\"Max sequence length for outputs:\", max_decoder_seq_length)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 39, 310 | "id": "65a922e2-3cd9-4f7e-ae16-26183ae06ce9", 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])\n", 315 | "target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])\n", 316 | "\n", 317 | "encoder_input_data = np.zeros(\n", 318 | " (len(input_texts), max_encoder_seq_length, num_encoder_tokens),\n", 319 | " dtype=\"float32\",\n", 320 | ")\n", 321 | "decoder_input_data = np.zeros(\n", 322 | " (len(input_texts), max_decoder_seq_length, num_decoder_tokens),\n", 323 | " dtype=\"float32\",\n", 324 | ")\n", 325 | "decoder_target_data = np.zeros(\n", 326 | " (len(input_texts), max_decoder_seq_length, num_decoder_tokens),\n", 327 | " dtype=\"float32\",\n", 328 | ")\n", 329 | "\n", 330 | "for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):\n", 331 | " for t, char in enumerate(input_text):\n", 332 | " encoder_input_data[i, t, input_token_index[char]] = 1.0\n", 333 | " encoder_input_data[i, t + 1 :, input_token_index[\" \"]] = 1.0\n", 334 | " for t, char in enumerate(target_text):\n", 335 | " # decoder_target_data is ahead of decoder_input_data by one timestep\n", 336 | " decoder_input_data[i, t, target_token_index[char]] = 1.0\n", 337 | " if t > 0:\n", 338 | " # decoder_target_data will be ahead by one timestep\n", 339 | " # and will not include the start character.\n", 340 | " decoder_target_data[i, t - 1, target_token_index[char]] = 1.0\n", 341 | " decoder_input_data[i, t + 1 :, target_token_index[\" \"]] = 1.0\n", 342 | " decoder_target_data[i, t:, target_token_index[\" \"]] = 1.0" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 41, 348 | "id": "82680a1d-8762-44f5-918b-b025acb48363", 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Define an input sequence and process it.\n", 353 | "encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))\n", 354 | "encoder = keras.layers.LSTM(latent_dim, return_state=True)\n", 355 | "encoder_outputs, state_h, state_c = encoder(encoder_inputs)\n", 356 | "\n", 357 | "# We discard `encoder_outputs` and only keep the states.\n", 358 | "encoder_states = [state_h, state_c]\n", 359 | "\n", 360 | "# Set up the decoder, using `encoder_states` as initial state.\n", 361 | "decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))\n", 362 | "\n", 363 | "# We set up our decoder to return full output sequences,\n", 364 | "# and to return internal states as well. We don't use the\n", 365 | "# return states in the training model, but we will use them in inference.\n", 366 | "decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)\n", 367 | "decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)\n", 368 | "decoder_dense = keras.layers.Dense(num_decoder_tokens, activation=\"softmax\")\n", 369 | "decoder_outputs = decoder_dense(decoder_outputs)\n", 370 | "\n", 371 | "# Define the model that will turn\n", 372 | "# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`\n", 373 | "model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 43, 379 | "id": "4238ca01-375a-4e92-9059-61f920d846d8", 380 | "metadata": {}, 381 | "outputs": [ 382 | { 383 | "name": "stdout", 384 | "output_type": "stream", 385 | "text": [ 386 | "Epoch 1/16\n", 387 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.6275 - loss: 2.6168 - val_accuracy: 0.8410 - val_loss: 0.9426\n", 388 | "Epoch 2/16\n", 389 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8151 - loss: 1.0525 - val_accuracy: 0.8409 - val_loss: 0.9267\n", 390 | "Epoch 3/16\n", 391 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8183 - loss: 1.0300 - val_accuracy: 0.8409 - val_loss: 0.8873\n", 392 | "Epoch 4/16\n", 393 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8169 - loss: 1.0070 - val_accuracy: 0.8409 - val_loss: 0.8595\n", 394 | "Epoch 5/16\n", 395 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8182 - loss: 0.9703 - val_accuracy: 0.8405 - val_loss: 0.9735\n", 396 | "Epoch 6/16\n", 397 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8213 - loss: 0.9795 - val_accuracy: 0.8409 - val_loss: 0.7349\n", 398 | "Epoch 7/16\n", 399 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8169 - loss: 0.9208 - val_accuracy: 0.8407 - val_loss: 0.7991\n", 400 | "Epoch 8/16\n", 401 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8141 - loss: 0.8940 - val_accuracy: 0.8404 - val_loss: 0.6042\n", 402 | "Epoch 9/16\n", 403 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8155 - loss: 0.7221 - val_accuracy: 0.8408 - val_loss: 0.9141\n", 404 | "Epoch 10/16\n", 405 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8228 - loss: 0.8052 - val_accuracy: 0.8405 - val_loss: 0.5939\n", 406 | "Epoch 11/16\n", 407 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8189 - loss: 0.6609 - val_accuracy: 0.8339 - val_loss: 0.6076\n", 408 | "Epoch 12/16\n", 409 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8103 - loss: 0.9425 - val_accuracy: 0.8405 - val_loss: 0.5868\n", 410 | "Epoch 13/16\n", 411 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8163 - loss: 0.6757 - val_accuracy: 0.8405 - val_loss: 0.5785\n", 412 | "Epoch 14/16\n", 413 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8134 - loss: 0.6801 - val_accuracy: 0.8406 - val_loss: 0.5760\n", 414 | "Epoch 15/16\n", 415 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m31s\u001b[0m 2s/step - accuracy: 0.8135 - loss: 0.6676 - val_accuracy: 0.8332 - val_loss: 0.5806\n", 416 | "Epoch 16/16\n", 417 | "\u001b[1m13/13\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m30s\u001b[0m 2s/step - accuracy: 0.8113 - loss: 0.7723 - val_accuracy: 0.8390 - val_loss: 0.5860\n" 418 | ] 419 | }, 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "" 424 | ] 425 | }, 426 | "execution_count": 43, 427 | "metadata": {}, 428 | "output_type": "execute_result" 429 | } 430 | ], 431 | "source": [ 432 | "model.compile(\n", 433 | " optimizer=\"rmsprop\", loss=\"categorical_crossentropy\", metrics=[\"accuracy\"]\n", 434 | ")\n", 435 | "model.fit(\n", 436 | " [encoder_input_data, decoder_input_data],\n", 437 | " decoder_target_data,\n", 438 | " batch_size=batch_size,\n", 439 | " epochs=epochs,\n", 440 | " validation_split=0.2,\n", 441 | ")\n", 442 | "# Save model\n", 443 | "# model.save(\"s2s_model.keras\")" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 61, 449 | "id": "40668421-3b25-40a4-809f-6a74d18d5598", 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "name": "stdout", 454 | "output_type": "stream", 455 | "text": [ 456 | "-\n", 457 | "Input sentence: Are state aid to business or inter-company agreements legitimate in a market economy, and who must supervise these exceptions to the absolute rules of the market economy?\n", 458 | "\n", 459 | "Decoded sentence: eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee\n" 460 | ] 461 | } 462 | ], 463 | "source": [ 464 | "encoder_inputs = model.input[0] # input_1\n", 465 | "encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output # lstm_1\n", 466 | "encoder_states = [state_h_enc, state_c_enc]\n", 467 | "encoder_model = keras.Model(encoder_inputs, encoder_states)\n", 468 | "\n", 469 | "decoder_inputs = model.input[1] # input_2\n", 470 | "decoder_state_input_h = keras.Input(shape=(latent_dim,))\n", 471 | "decoder_state_input_c = keras.Input(shape=(latent_dim,))\n", 472 | "decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]\n", 473 | "decoder_lstm = model.layers[3]\n", 474 | "decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(\n", 475 | " decoder_inputs, initial_state=decoder_states_inputs\n", 476 | ")\n", 477 | "decoder_states = [state_h_dec, state_c_dec]\n", 478 | "decoder_dense = model.layers[4]\n", 479 | "decoder_outputs = decoder_dense(decoder_outputs)\n", 480 | "decoder_model = keras.Model(\n", 481 | " [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states\n", 482 | ")\n", 483 | "\n", 484 | "## translation with this model\n", 485 | "reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())\n", 486 | "reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())\n", 487 | "\n", 488 | "\n", 489 | "def decode_sequence(input_seq):\n", 490 | " # Encode the input as state vectors.\n", 491 | " states_value = encoder_model.predict(input_seq, verbose=0)\n", 492 | " \n", 493 | " # Generate empty target sequence of length 1.\n", 494 | " target_seq = np.zeros((1, 1, num_decoder_tokens))\n", 495 | " # Populate the first character of target sequence with the start character.\n", 496 | " target_seq[0, 0, target_token_index[\"\\t\"]] = 1.0\n", 497 | "\n", 498 | " # Sampling loop for a batch of sequences\n", 499 | " # (to simplify, here we assume a batch of size 1).\n", 500 | " stop_condition = False\n", 501 | " decoded_sentence = \"\"\n", 502 | " while not stop_condition:\n", 503 | " output_tokens, h, c = decoder_model.predict(\n", 504 | " [target_seq] + states_value, verbose=0\n", 505 | " )\n", 506 | "\n", 507 | " # Sample a token\n", 508 | " sampled_token_index = np.argmax(output_tokens[0, -1, :])\n", 509 | " sampled_char = reverse_target_char_index[sampled_token_index]\n", 510 | " decoded_sentence += sampled_char\n", 511 | "\n", 512 | " # Exit condition: either hit max length\n", 513 | " # or find stop character.\n", 514 | " if sampled_char == \"\\n\" or len(decoded_sentence) > max_decoder_seq_length:\n", 515 | " stop_condition = True\n", 516 | "\n", 517 | " # Update the target sequence (of length 1).\n", 518 | " target_seq = np.zeros((1, 1, num_decoder_tokens))\n", 519 | " target_seq[0, 0, sampled_token_index] = 1.0\n", 520 | "\n", 521 | " # Update states\n", 522 | " states_value = [h, c]\n", 523 | " return decoded_sentence\n", 524 | "\n", 525 | "\n", 526 | "# Take one sequence \n", 527 | "# for trying out decoding.\n", 528 | "input_seq = encoder_input_data[num_samples - 1:num_samples]\n", 529 | "decoded_sentence = decode_sequence(input_seq)\n", 530 | "print(\"-\")\n", 531 | "print(\"Input sentence:\", input_texts[num_samples - 1])\n", 532 | "print(\"Decoded sentence:\", decoded_sentence)" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "id": "eca2a9a9-609f-4e7c-8bb9-c22836a0a008", 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [] 542 | } 543 | ], 544 | "metadata": { 545 | "kernelspec": { 546 | "display_name": "Python 3 (ipykernel)", 547 | "language": "python", 548 | "name": "python3" 549 | }, 550 | "language_info": { 551 | "codemirror_mode": { 552 | "name": "ipython", 553 | "version": 3 554 | }, 555 | "file_extension": ".py", 556 | "mimetype": "text/x-python", 557 | "name": "python", 558 | "nbconvert_exporter": "python", 559 | "pygments_lexer": "ipython3", 560 | "version": "3.12.4" 561 | } 562 | }, 563 | "nbformat": 4, 564 | "nbformat_minor": 5 565 | } 566 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Here's a brief walk through of basic language modelling. There are mainly four parts of this content. 2 | 3 | 1. Word nomalization, preprocessing words to more standard forms. 4 | 2. Representation, embedding the words into vectors. 5 | 3. Modelling, recurrent models and sequence to sequence model. 6 | 4. Transformers, the attention mechanism and architecture of transformers. 7 | -------------------------------------------------------------------------------- /Regular expression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c20c2c81-1624-4c97-8ef7-390f0bf8e521", 6 | "metadata": {}, 7 | "source": [ 8 | "## Character classes\n", 9 | "s : white space characters\n", 10 | "\n", 11 | "S : Non-white space characters\n", 12 | "\n", 13 | "d : digits\n", 14 | "\n", 15 | "D : Non-digits\n", 16 | "\n", 17 | "w : any word character\n", 18 | "\n", 19 | "W : None-word character\n", 20 | "\n", 21 | "b : word boundary" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "aade9e67-d098-4bf7-a767-3dd1c38a7157", 27 | "metadata": {}, 28 | "source": [ 29 | "## Symbols\n", 30 | "#### repeaters\n", 31 | "\\* : 0-inf repeatition\n", 32 | "\n", 33 | "\\+ : 1-inf repeatition\n", 34 | "\n", 35 | "{number} : number times of repeatition {2,} : more than 2; {2,5} : 2-5 times\n", 36 | "\n", 37 | "#### blurry search\n", 38 | "\n", 39 | "? : may or may not\n", 40 | "\n", 41 | ". : any character\n", 42 | "\n", 43 | "#### positions\n", 44 | "\n", 45 | "^ : beginning\n", 46 | "\n", 47 | "$ : end\n", 48 | "\n", 49 | "\\number : matches the number th character\n", 50 | "\n", 51 | "#### range\n", 52 | "[a-z] : lower letters\n", 53 | "\n", 54 | "[A-Z] : cap letters\n", 55 | "\n", 56 | "[0-9] : digits\n", 57 | "\n", 58 | "() : groups\n", 59 | "\n", 60 | "[^] : negation\n", 61 | "\n", 62 | "(a|b|c) : a or b or c\n", 63 | "\n", 64 | "#### comments\n", 65 | "\n", 66 | "(?#) : intermediate comment\n", 67 | "\n", 68 | "\\# : end of line comment" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "id": "e9cf10d8-dfc9-41c5-968f-c0271629a887", 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [] 78 | } 79 | ], 80 | "metadata": { 81 | "kernelspec": { 82 | "display_name": "Python 3 (ipykernel)", 83 | "language": "python", 84 | "name": "python3" 85 | }, 86 | "language_info": { 87 | "codemirror_mode": { 88 | "name": "ipython", 89 | "version": 3 90 | }, 91 | "file_extension": ".py", 92 | "mimetype": "text/x-python", 93 | "name": "python", 94 | "nbconvert_exporter": "python", 95 | "pygments_lexer": "ipython3", 96 | "version": "3.12.4" 97 | } 98 | }, 99 | "nbformat": 4, 100 | "nbformat_minor": 5 101 | } 102 | -------------------------------------------------------------------------------- /RepresentationAnd Embedding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "df4826be-2d80-4d9a-9521-54092a018867", 6 | "metadata": {}, 7 | "source": [ 8 | "# Representation\n", 9 | "converts the words to numerical vectors" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "id": "df247f1d-2125-40f8-9299-b2c789915e06", 15 | "metadata": { 16 | "jp-MarkdownHeadingCollapsed": true 17 | }, 18 | "source": [ 19 | "### One-hot Encoding (Dummy Encoding)\n", 20 | "Converts categories into multiple binary columns where only one bit is active (1) per entry.\n", 21 | "\n", 22 | "PROS: numerical categorical data,eliminating ordinality (like year 1, 2, 3 ,4 don't provide order information)\n", 23 | "\n", 24 | "CONS: higher dimension, sparse observation for some dimensions, overfitting when too many categories" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 7, 30 | "id": "340958c7-349d-405c-b1ca-6cb98730714e", 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "name": "stdout", 35 | "output_type": "stream", 36 | "text": [ 37 | "Original Data:\n", 38 | " Employee id Gender Remarks\n", 39 | "0 10 M Good\n", 40 | "1 20 F Nice\n", 41 | "2 15 F Good\n", 42 | "3 25 M Great\n", 43 | "4 30 F Nice\n", 44 | "\n", 45 | "One-Hot Encoded Data using Pandas:\n", 46 | " Employee id Gender_M Remarks_Great Remarks_Nice\n", 47 | "0 10 True False False\n", 48 | "1 20 False False True\n", 49 | "2 15 False False False\n", 50 | "3 25 True True False\n", 51 | "4 30 False False True\n", 52 | "\n", 53 | "One-Hot Encoded Data using Scikit-Learn:\n", 54 | " Employee id Gender_F Gender_M Remarks_Good Remarks_Great Remarks_Nice\n", 55 | "0 10 0.0 1.0 1.0 0.0 0.0\n", 56 | "1 20 1.0 0.0 0.0 0.0 1.0\n", 57 | "2 15 1.0 0.0 1.0 0.0 0.0\n", 58 | "3 25 0.0 1.0 0.0 1.0 0.0\n", 59 | "4 30 1.0 0.0 0.0 0.0 1.0\n", 60 | "\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "import pandas as pd\n", 66 | "from sklearn.preprocessing import OneHotEncoder\n", 67 | "data = {\n", 68 | " 'Employee id': [10, 20, 15, 25, 30],\n", 69 | " 'Gender': ['M', 'F', 'F', 'M', 'F'],\n", 70 | " 'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']\n", 71 | "}\n", 72 | "\n", 73 | "df = pd.DataFrame(data)\n", 74 | "print(f\"Original Data:\\n{df}\\n\")\n", 75 | "# Use pd.get_dummies() to one-hot encode the categorical columns\n", 76 | "df_pandas_encoded = pd.get_dummies(df, columns=['Gender', 'Remarks'], drop_first=True)\n", 77 | "print(f\"One-Hot Encoded Data using Pandas:\\n{df_pandas_encoded}\\n\")\n", 78 | "\n", 79 | "encoder = OneHotEncoder(sparse_output=False)\n", 80 | "categorical_columns = ['Gender', 'Remarks']\n", 81 | "one_hot_encoded = encoder.fit_transform(df[categorical_columns])\n", 82 | "one_hot_df = pd.DataFrame(one_hot_encoded, \n", 83 | " columns=encoder.get_feature_names_out(categorical_columns))\n", 84 | "\n", 85 | "df_sklearn_encoded = pd.concat([df.drop(categorical_columns, axis=1), one_hot_df], axis=1)\n", 86 | "\n", 87 | "print(f\"One-Hot Encoded Data using Scikit-Learn:\\n{df_sklearn_encoded}\\n\")\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "id": "d9fccfa5-1f1e-4611-b470-ddb2f4380567", 93 | "metadata": {}, 94 | "source": [ 95 | "### Bag of Words\n", 96 | "\n", 97 | "step1 : preprocessing the text into list of words\n", 98 | "\n", 99 | "step2 : summarize the frequency of each words and select most frequent n words\n", 100 | "\n", 101 | "step2 : set binary vector, where the frequent word positions are 1 otherwise 0" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 1, 107 | "id": "0fde6258-a653-499c-813c-e75d2e4b96db", 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "## read the text file\n", 112 | "with open('de-en.txt', 'r') as file:\n", 113 | " lines = file.readlines()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "id": "c105b384-e3ae-43b0-a5d7-d4f6d26798d3", 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "name": "stdout", 124 | "output_type": "stream", 125 | "text": [ 126 | "['wiederaufnahme der sitzungsperiode ich erkläre die am freitag dem 17', 'dezember unterbrochene sitzungsperiode des europäischen parlaments für wiederaufgenommen wünsche ihnen nochmals alles gute zum jahreswechsel und hoffe daß sie schöne ferien hatten', 'wie sie feststellen konnten ist der gefürchtete millenium bug nicht eingetreten', 'doch sind bürger einiger unserer mitgliedstaaten opfer von schrecklichen naturkatastrophen geworden', 'im parlament besteht der wunsch nach einer aussprache im verlauf dieser sitzungsperiode in den nächsten tagen', 'heute möchte ich sie bitten das ist auch der wunsch einiger kolleginnen und kollegen allen opfern der stürme insbesondere in den verschiedenen ländern der europäischen union in einer schweigeminute zu gedenken', 'ich bitte sie sich zu einer schweigeminute zu erheben', 'das parlament erhebt sich zu einer schweigeminute', 'frau präsidentin zur geschäftsordnung', 'wie sie sicher aus der presse und dem fernsehen wissen gab es in sri lanka mehrere bombenexplosionen mit zahlreichen toten']\n" 127 | ] 128 | } 129 | ], 130 | "source": [ 131 | "import nltk \n", 132 | "import re \n", 133 | "import numpy as np \n", 134 | " \n", 135 | "## preprocessing \n", 136 | "dataset = nltk.sent_tokenize(''.join(lines))\n", 137 | "for i in range(len(dataset)):\n", 138 | " dataset[i] = dataset[i].lower() # Convert to lowercase\n", 139 | " dataset[i] = re.sub(r'\\W', ' ', dataset[i]) # Remove non-word characters\n", 140 | " dataset[i] = re.sub(r'\\s+', ' ', dataset[i]).strip() # Remove extra spaces\n", 141 | "\n", 142 | "# Output cleaned sentences\n", 143 | "print(dataset[:10]) " 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 7, 149 | "id": "18ad339c-570e-4157-a185-c086890a0e9b", 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "## frequency\n", 154 | "word2count = {} \n", 155 | "for data in dataset: \n", 156 | " words = nltk.word_tokenize(data) \n", 157 | " for word in words: \n", 158 | " if word not in word2count.keys(): \n", 159 | " word2count[word] = 1\n", 160 | " else: \n", 161 | " word2count[word] += 1" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 15, 167 | "id": "7e8a99d2-e5eb-4a50-9e9b-9ebf93343b1e", 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "## results\n", 172 | "import heapq \n", 173 | "freq_words = heapq.nlargest(200, word2count, key=word2count.get)\n", 174 | "BoW = [] \n", 175 | "for data in dataset[:500]: \n", 176 | " vector = [] \n", 177 | " for word in freq_words: \n", 178 | " if word in nltk.word_tokenize(data): \n", 179 | " vector.append(1) \n", 180 | " else: \n", 181 | " vector.append(0) \n", 182 | " BoW.append(vector) \n", 183 | "BoW = np.asarray(BoW) " 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 21, 189 | "id": "8bbbcf00-5a8b-4d01-beb3-bd10332596af", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "['wiederaufnahme der sitzungsperiode ich erkläre die am freitag dem 17',\n", 196 | " 'dezember unterbrochene sitzungsperiode des europäischen parlaments für wiederaufgenommen wünsche ihnen nochmals alles gute zum jahreswechsel und hoffe daß sie schöne ferien hatten',\n", 197 | " 'wie sie feststellen konnten ist der gefürchtete millenium bug nicht eingetreten']" 198 | ] 199 | }, 200 | "execution_count": 21, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "dataset[:3]" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 23, 212 | "id": "8c8117fc-3746-4d00-858d-be07d99e7182", 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 219 | " 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 220 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 221 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 222 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 223 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 224 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 225 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 226 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 227 | " 0, 0],\n", 228 | " [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,\n", 229 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n", 230 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,\n", 231 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 232 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,\n", 233 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 234 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 235 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 236 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 237 | " 0, 0],\n", 238 | " [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,\n", 239 | " 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 240 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 241 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 242 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 243 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 244 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 245 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 246 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 247 | " 0, 0]])" 248 | ] 249 | }, 250 | "execution_count": 23, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | } 254 | ], 255 | "source": [ 256 | "BoW[:3]" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "id": "78551ad2-4645-43fb-9c78-6ccfec416900", 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "id": "93e179e3-18e4-4980-8968-5e284a59c99c", 270 | "metadata": {}, 271 | "source": [ 272 | "## N-gram\n", 273 | "Contiguous sequence of n items(characters,words,sub-words) from a given sample of text or speech, an N-gram language model predicts the probability of a given N-gram within any sequence of words. i.e. the possibility of 'python' being next term if given previous sequence as ['Natural', 'language','prcessing','in']\n", 274 | "\n", 275 | "P(w_n|w_1,w_2,w_3,...,w_{i-1}) = \\frac{w_1,w_2,w_3,...,w_i}{P(w_1,w_2,w_3,...,w_{i-1})}\n", 276 | "\n", 277 | "\\approx \\frac{w_1,w_2,w_3,...,w_n}{P(w_{i-n},w_{i-n+1},...,w_{n-1})}" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 26, 283 | "id": "b250cee5-dc76-4775-b30b-93e93289f6db", 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "import nltk\n", 288 | "from nltk import bigrams, trigrams\n", 289 | "from nltk.corpus import reuters\n", 290 | "from collections import defaultdict" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 38, 296 | "id": "da1202c9-5365-45d6-92a3-d04995303073", 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "name": "stdout", 301 | "output_type": "stream", 302 | "text": [ 303 | "Next Word: No prediction available\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "# Join the lines into a single string\n", 309 | "text = ' '.join(lines)\n", 310 | "# Tokenize the text\n", 311 | "words = nltk.word_tokenize(text)\n", 312 | "# Create trigrams\n", 313 | "tri_grams = list(trigrams(words))\n", 314 | "\n", 315 | "# Build a trigram model\n", 316 | "model = defaultdict(lambda: defaultdict(lambda: 0))\n", 317 | "# Count frequency of co-occurrence\n", 318 | "for w1, w2, w3 in tri_grams:\n", 319 | " model[(w1, w2)][w3] += 1\n", 320 | "\n", 321 | "# Transform the counts into probabilities\n", 322 | "for w1_w2 in model:\n", 323 | " total_count = float(sum(model[w1_w2].values()))\n", 324 | " for w3 in model[w1_w2]:\n", 325 | " model[w1_w2][w3] /= total_count\n", 326 | "\n", 327 | "# Function to predict the next word\n", 328 | "def predict_next_word(w1, w2):\n", 329 | " next_word = model[w1, w2]\n", 330 | " if next_word:\n", 331 | " predicted_word = max(next_word, key=next_word.get) # Choose the most likely next word\n", 332 | " return predicted_word\n", 333 | " else:\n", 334 | " return \"No prediction available\"\n", 335 | "\n", 336 | "# Example usage\n", 337 | "print(\"Next Word:\", predict_next_word('Hund', 'läuft'))\n" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 36, 343 | "id": "2e8db58a-bcf3-4caf-b431-fe54936affe5", 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "name": "stdout", 348 | "output_type": "stream", 349 | "text": [ 350 | "Next Word: heart\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "print(\"Next Word:\", predict_next_word('not', 'my'))" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "id": "5ca4cb3e-9386-41c0-b017-8eb9ce02a216", 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "print(\"Next Word:\", predict_next_word('not', 'my'))" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 41, 371 | "id": "60b74938-8d98-4e67-a579-2cd6a0a18179", 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "data": { 376 | "text/plain": [ 377 | "'(Die Sitzung wird um 10.50 Uhr geschlossen.)\\n'" 378 | ] 379 | }, 380 | "execution_count": 41, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "lines[-1]" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "364d013e-fdb8-442a-9dea-bb1c763053d0", 392 | "metadata": {}, 393 | "source": [ 394 | "### TF-IDF (Term Frequency-Inverse Document Frequency)\n", 395 | "evaluate the importance of a word in a document relative to a collection of documents\n", 396 | "\n", 397 | "Term Frequency = numbers of term t appearing in document d/total terms in document d\n", 398 | "\n", 399 | "Inverse Document Frequency = log(total number of documents/number of documents containing term t)\n", 400 | "\n", 401 | "Common but irrelavent word such as 'in' and 'out' could have high TF, yet IDF alone can't reflect the importance of document specified terms (i.e. keyword in papers). TF-IDF balances common and rare words to highlight the most meaningful terms." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "id": "71725640-49b6-431a-9861-93ed3678ec30", 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [ 411 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 412 | "\n", 413 | "# assign documents\n", 414 | "d0 = 'Apples are fruits'\n", 415 | "d1 = 'Tomatoes are fruits'\n", 416 | "d2 = 'Tomatoes are also vegetables'\n", 417 | "\n", 418 | "# merge documents into a single corpus\n", 419 | "string = [d0, d1, d2]\n", 420 | "\n", 421 | "# get tf-df values\n", 422 | "tfidf = TfidfVectorizer()\n", 423 | "result = tfidf.fit_transform(string)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "id": "73d531f1-a64e-484d-a66a-fb757cbd9444", 429 | "metadata": {}, 430 | "source": [ 431 | "the words 'apple','also' are highlighted because they only appear in one documents, and 'tomamtoes','fruits' are considered important because they are frequent yet not common" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 50, 437 | "id": "f6576c89-ee0d-4519-8d94-81a2264aef0f", 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | "\n", 445 | "idf values:\n", 446 | "also : 1.6931471805599454\n", 447 | "apples : 1.6931471805599454\n", 448 | "are : 1.0\n", 449 | "fruits : 1.2876820724517808\n", 450 | "tomatoes : 1.2876820724517808\n", 451 | "vegetables : 1.6931471805599454\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "print('\\nidf values:')\n", 457 | "for word, idf_value in zip(tfidf.get_feature_names_out(), tfidf.idf_):\n", 458 | " print(word, ':', idf_value)" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "id": "f9bfc364-498e-497c-9894-7874e59b055f", 464 | "metadata": {}, 465 | "source": [ 466 | "form of output in tf-idf values: (ith document, jth word) tf-idf" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 52, 472 | "id": "49f161f9-3846-4c5f-add1-f7035a1e97fe", 473 | "metadata": {}, 474 | "outputs": [ 475 | { 476 | "name": "stdout", 477 | "output_type": "stream", 478 | "text": [ 479 | "\n", 480 | "Word indexes:\n", 481 | "{'apples': 1, 'are': 2, 'fruits': 3, 'tomatoes': 4, 'also': 0, 'vegetables': 5}\n", 482 | "\n", 483 | "tf-idf value:\n", 484 | " (0, 3)\t0.5478321549274363\n", 485 | " (0, 2)\t0.4254405389711991\n", 486 | " (0, 1)\t0.7203334490549893\n", 487 | " (1, 4)\t0.6198053799406072\n", 488 | " (1, 3)\t0.6198053799406072\n", 489 | " (1, 2)\t0.48133416873660545\n", 490 | " (2, 5)\t0.5844829010200651\n", 491 | " (2, 0)\t0.5844829010200651\n", 492 | " (2, 4)\t0.444514311537431\n", 493 | " (2, 2)\t0.34520501686496574\n", 494 | "\n", 495 | "tf-idf values in matrix form:\n", 496 | "[[0. 0.72033345 0.42544054 0.54783215 0. 0. ]\n", 497 | " [0. 0. 0.48133417 0.61980538 0.61980538 0. ]\n", 498 | " [0.5844829 0. 0.34520502 0. 0.44451431 0.5844829 ]]\n" 499 | ] 500 | } 501 | ], 502 | "source": [ 503 | "# get indexing\n", 504 | "print('\\nWord indexes:')\n", 505 | "print(tfidf.vocabulary_)\n", 506 | "\n", 507 | "# display tf-idf values\n", 508 | "print('\\ntf-idf value:')\n", 509 | "print(result)\n", 510 | "\n", 511 | "# in matrix form\n", 512 | "print('\\ntf-idf values in matrix form:')\n", 513 | "print(result.toarray())" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "id": "976989e7-22ba-49b9-8a1a-14dbf9a5ee12", 519 | "metadata": {}, 520 | "source": [ 521 | "# Embedding : trained representation models" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "id": "d0244a58-a4d2-4d45-8bca-2e283a5a45b5", 527 | "metadata": {}, 528 | "source": [ 529 | "### Word2Vec\n", 530 | "trained through skip-gram or CBOW(continuous bag of words), a shallow network with 2 hidden layers. 'The dog is running'\n", 531 | "\n", 532 | "Skip-gram : predict the context with target words 'dog'>>'the''is''running'\n", 533 | "\n", 534 | "CBOW : predict target words with given window of context 'the' 'is' 'runing' >> 'dog'\n", 535 | "\n", 536 | "PROS : semantic representation, distributional pattern reflects semantic similarities, allows vector arithmetic\n", 537 | "\n", 538 | "Application : machine translation, text classification, sentiment analysis, and information retrieval" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 1, 544 | "id": "d526ab3e-0d4a-4cc5-8037-4d4156197ce6", 545 | "metadata": {}, 546 | "outputs": [], 547 | "source": [ 548 | "import gensim\n", 549 | "from gensim.models import Word2Vec\n", 550 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 551 | "import warnings\n", 552 | "warnings.filterwarnings(action='ignore')\n", 553 | " \n", 554 | " \n", 555 | "sample = open(\"alice.txt\")\n", 556 | "s = sample.read()\n", 557 | "f = s.replace(\"\\n\", \" \")\n", 558 | " \n", 559 | "data = []\n", 560 | "for i in sent_tokenize(f):\n", 561 | " temp = []\n", 562 | " for j in word_tokenize(i):\n", 563 | " temp.append(j.lower())\n", 564 | " data.append(temp)" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 3, 570 | "id": "5f798c93-e635-4ea6-a7e2-6b02cebe2056", 571 | "metadata": {}, 572 | "outputs": [ 573 | { 574 | "name": "stdout", 575 | "output_type": "stream", 576 | "text": [ 577 | "Cosine similarity between 'alice' and 'wonderland' - CBOW : 0.98673964\n", 578 | "Cosine similarity between 'alice' and 'machines' - CBOW : 0.9174969\n" 579 | ] 580 | } 581 | ], 582 | "source": [ 583 | "# CBOW model\n", 584 | "model1 = gensim.models.Word2Vec(data, min_count=1,\n", 585 | " vector_size=100, window=5)\n", 586 | " \n", 587 | "print(\"Cosine similarity between 'alice' \" +\n", 588 | " \"and 'wonderland' - CBOW : \",\n", 589 | " model1.wv.similarity('alice', 'wonderland'))\n", 590 | " \n", 591 | "print(\"Cosine similarity between 'alice' \" +\n", 592 | " \"and 'machines' - CBOW : \",\n", 593 | " model1.wv.similarity('alice', 'machines'))\n", 594 | " " 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 5, 600 | "id": "63ce15d8-3af5-4543-93a4-2e5b3d705cea", 601 | "metadata": {}, 602 | "outputs": [ 603 | { 604 | "name": "stdout", 605 | "output_type": "stream", 606 | "text": [ 607 | "Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.8623241\n", 608 | "Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.87525654\n" 609 | ] 610 | } 611 | ], 612 | "source": [ 613 | "#Skip Gram model\n", 614 | "model2 = gensim.models.Word2Vec(data, min_count=1, vector_size=100,\n", 615 | " window=5, sg=1)\n", 616 | "\n", 617 | "print(\"Cosine similarity between 'alice' \" +\n", 618 | " \"and 'wonderland' - Skip Gram : \",\n", 619 | " model2.wv.similarity('alice', 'wonderland'))\n", 620 | " \n", 621 | "print(\"Cosine similarity between 'alice' \" +\n", 622 | " \"and 'machines' - Skip Gram : \",\n", 623 | " model2.wv.similarity('alice', 'machines'))\n" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "id": "12b58f6c-9101-4f87-8faa-d98cbb11466e", 629 | "metadata": {}, 630 | "source": [ 631 | "### GloVe\n", 632 | "unsupervised learning using the statistical co-occurrence data of words in a given corpus, minimize the weighted distance between embedding and co-occurance matrix(seen as a possibility matrix)\n", 633 | "\n", 634 | "application : named entity recoginition, translation, semantic search, word analogy, word clustering" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": 11, 640 | "id": "9132a2c2-5a54-4e0d-9043-66d84c4cbda5", 641 | "metadata": {}, 642 | "outputs": [ 643 | { 644 | "name": "stdout", 645 | "output_type": "stream", 646 | "text": [ 647 | "Number of unique words in dictionary= 6\n", 648 | "Dictionary is = {'language': 1, 'text': 2, 'leader': 3, 'the': 4, 'prime': 5, 'natural': 6}\n", 649 | "Dense vector for first word is => [-5.79900026e-01 -1.10100001e-01 -1.15569997e+00 -2.99059995e-03\n", 650 | " -2.06129998e-01 4.52890009e-01 -1.66710004e-01 -1.03820002e+00\n", 651 | " -9.92410004e-01 3.98840010e-01 5.92299998e-01 2.29900002e-01\n", 652 | " 1.52129996e+00 -1.77640006e-01 -2.97259986e-01 -3.92349988e-01\n", 653 | " -7.84709990e-01 1.55939996e-01 6.90769970e-01 5.95369995e-01\n", 654 | " -4.43399996e-01 5.35139978e-01 3.28530014e-01 1.24370003e+00\n", 655 | " 1.29719996e+00 -1.38779998e+00 -1.09249997e+00 -4.09249991e-01\n", 656 | " -5.69710016e-01 -3.46560001e-01 3.71630001e+00 -1.04890001e+00\n", 657 | " -4.67079997e-01 -4.47389990e-01 6.22999994e-03 1.96490008e-02\n", 658 | " -4.01609987e-01 -6.29130006e-01 -8.25060010e-01 4.55909997e-01\n", 659 | " 8.26259971e-01 5.70909977e-01 2.11989999e-01 4.68650013e-01\n", 660 | " -6.00269973e-01 2.99199998e-01 6.79440022e-01 1.42379999e+00\n", 661 | " -3.21520008e-02 -1.26029998e-01]\n" 662 | ] 663 | } 664 | ], 665 | "source": [ 666 | "from tensorflow.keras.preprocessing.text import Tokenizer\n", 667 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", 668 | "import numpy as np\n", 669 | " \n", 670 | "x = {'text', 'the', 'leader', 'prime', 'natural', 'language'}\n", 671 | " \n", 672 | "# create the dict.\n", 673 | "tokenizer = Tokenizer()\n", 674 | "tokenizer.fit_on_texts(x)\n", 675 | " \n", 676 | "# number of unique words in dict.\n", 677 | "print(\"Number of unique words in dictionary=\", \n", 678 | " len(tokenizer.word_index))\n", 679 | "print(\"Dictionary is = \", tokenizer.word_index)\n", 680 | " \n", 681 | "# download glove and unzip it in Notebook.\n", 682 | "#!wget http://nlp.stanford.edu/data/glove.6B.zip\n", 683 | "#!unzip glove*.zip\n", 684 | " \n", 685 | "def embedding_for_vocab(filepath, word_index,\n", 686 | " embedding_dim):\n", 687 | " vocab_size = len(word_index) + 1\n", 688 | " \n", 689 | " # Adding again 1 because of reserved 0 index\n", 690 | " embedding_matrix_vocab = np.zeros((vocab_size,\n", 691 | " embedding_dim))\n", 692 | " \n", 693 | " with open(filepath, encoding=\"utf8\") as f:\n", 694 | " for line in f:\n", 695 | " word, *vector = line.split()\n", 696 | " if word in word_index:\n", 697 | " idx = word_index[word]\n", 698 | " embedding_matrix_vocab[idx] = np.array(\n", 699 | " vector, dtype=np.float32)[:embedding_dim]\n", 700 | " \n", 701 | " return embedding_matrix_vocab\n", 702 | " \n", 703 | "# matrix for vocab: word_index\n", 704 | "embedding_dim = 50\n", 705 | "embedding_matrix_vocab = embedding_for_vocab('glove/glove.6B.50d.txt', tokenizer.word_index,embedding_dim)\n", 706 | " \n", 707 | "print(\"Dense vector for first word is => \",embedding_matrix_vocab[1])" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "id": "037bb3c0-f6d6-4450-81f3-4c7400419591", 713 | "metadata": {}, 714 | "source": [ 715 | "### fastText\n", 716 | "developed by Facebook's AI Research (FAIR) lab, instead of representing words as single entities, FastText breaks them down into smaller components called character n-grams. It's extension of word2Vec model, particularly useful for handling languages with rich morphology or for tasks where out-of-vocabulary words are common\n", 717 | "\n", 718 | "\"\" and \"\". all considered for 'basketball'" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": 13, 724 | "id": "fa746e85-8408-4ec8-9004-47521b11c9c5", 725 | "metadata": {}, 726 | "outputs": [ 727 | { 728 | "name": "stdout", 729 | "output_type": "stream", 730 | "text": [ 731 | "Most similar words to 'sunflower': [('response', 0.2742580771446228), ('time', 0.19048680365085602), ('computer', 0.18673823773860931), ('survey', 0.07351218163967133), ('system', 0.06112516671419144), ('interface', 0.03289278596639633), ('trees', 0.009829370304942131), ('human', -0.0001683309383224696), ('eps', -0.016743149608373642), ('user', -0.029780348762869835)]\n" 732 | ] 733 | } 734 | ], 735 | "source": [ 736 | "from gensim.models import FastText\n", 737 | "from gensim.test.utils import common_texts\n", 738 | "\n", 739 | "# Training FastText model\n", 740 | "corpus = common_texts\n", 741 | "model = FastText(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4, sg=1)\n", 742 | "\n", 743 | "# Example usage: getting embeddings for a word\n", 744 | "word_embedding = model.wv['sunflower']\n", 745 | "similar_words = model.wv.most_similar('sunflower')\n", 746 | "print(\"Most similar words to 'sunflower':\", similar_words)" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "id": "1dbe3279-5d7c-4a4b-8019-39051df91d96", 752 | "metadata": {}, 753 | "source": [ 754 | "## pre-trained embedding" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "id": "40486d4c-2acb-4f78-abaa-20f148cdbb7f", 760 | "metadata": {}, 761 | "source": [ 762 | "### Embeddings from Language Models (ELMo)\n", 763 | "word vectors are calculated using a two-layer bidirectional language model (biLM). Using the complete sentence containing that word, ELMo captures the context of the word and can generate different embeddings for the same word used in a different context in different sentences." 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 11, 769 | "id": "89e09c63-d0f7-4295-83ab-ea297421d2c1", 770 | "metadata": {}, 771 | "outputs": [ 772 | { 773 | "data": { 774 | "image/jpeg": "", 775 | "text/plain": [ 776 | "" 777 | ] 778 | }, 779 | "execution_count": 11, 780 | "metadata": {}, 781 | "output_type": "execute_result" 782 | } 783 | ], 784 | "source": [ 785 | "from IPython.display import Image\n", 786 | "Image(filename='biLM.jpg')" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 16, 792 | "id": "83828987-8049-4346-8ef8-ad749ebbef1f", 793 | "metadata": {}, 794 | "outputs": [ 795 | { 796 | "name": "stdout", 797 | "output_type": "stream", 798 | "text": [ 799 | "Shape of the embeddings: (2, 6, 1024)\n", 800 | "Word embeddings for the word 'WATCH' in the first sentence:\n", 801 | "[ 0.54308295 -0.3439697 0.24777155 ... 0.652064 -0.75163364\n", 802 | " 0.6272552 ]\n", 803 | "Word embeddings for the word 'WATCH' in the second sentence:\n", 804 | "[-0.08213355 0.01050331 -0.01454124 ... 0.4870539 -0.5445795\n", 805 | " 0.52623993]\n" 806 | ] 807 | } 808 | ], 809 | "source": [ 810 | "import tensorflow as tf\n", 811 | "import tensorflow_hub as hub\n", 812 | "\n", 813 | "# Load the ELMo model from TensorFlow Hub\n", 814 | "elmo = hub.load(\"https://tfhub.dev/google/elmo/3\")\n", 815 | "\n", 816 | "# Example sentence (no need to split the sentence into words)\n", 817 | "sentences = [\n", 818 | " \"I love to watch TV\",\n", 819 | " \"I am wearing a wrist watch\"\n", 820 | "]\n", 821 | "\n", 822 | "# Convert the sentence to a tensor\n", 823 | "input_tensor = tf.constant(sentences)\n", 824 | "\n", 825 | "# Get the ELMo embeddings for the sentence\n", 826 | "embeddings = elmo.signatures['default'](input_tensor)['elmo']\n", 827 | "\n", 828 | "# The shape of the output embeddings will be [batch_size, sequence_length, embedding_dim]\n", 829 | "# You can get the embedding for each word in the sentence\n", 830 | "print(\"Shape of the embeddings:\", embeddings.shape)\n", 831 | "\n", 832 | "print(\"Word embeddings for the word 'WATCH' in the first sentence:\")\n", 833 | "print(embeddings[0][4].numpy()) # 4th index for the word \"watch\" in the first sentence\n", 834 | "\n", 835 | "print(\"Word embeddings for the word 'WATCH' in the second sentence:\")\n", 836 | "print(embeddings[1][5].numpy()) # 5th index for the word \"watch\" in the second sentence" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "id": "58f882cc-8ccd-4f6f-aee0-5ffeaa72e979", 842 | "metadata": {}, 843 | "source": [ 844 | "### BERT(Bidirectional Encoder Representations from Transformers)\n", 845 | "Encoder only bidirection tranformer archetecture\n", 846 | "\n", 847 | "Pretraining : 1. Masked Language Model: guessing missing (masked) words, BERT adds a special layer on top of its learning system to make these guesses, focuses on missing words. 2. Next Sentence Prediction (NSP) : guessing if one sentence follows another\n", 848 | "\n", 849 | "Fine-tuning : supervised learning on labeled data" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 19, 855 | "id": "6d4ad01d-29a3-4a34-99fd-2bce322f8031", 856 | "metadata": {}, 857 | "outputs": [ 858 | { 859 | "data": { 860 | "application/vnd.jupyter.widget-view+json": { 861 | "model_id": "8eec2d290efc4595bb9bbcffc9ded4b3", 862 | "version_major": 2, 863 | "version_minor": 0 864 | }, 865 | "text/plain": [ 866 | "tokenizer_config.json: 0%| | 0.00/49.0 [00:00\n", 341 | " bos_piece: \n", 342 | " eos_piece: \n", 343 | " pad_piece: \n", 344 | " unk_surface: ⁇ \n", 345 | " enable_differential_privacy: 0\n", 346 | " differential_privacy_noise_level: 0\n", 347 | " differential_privacy_clipping_threshold: 0\n", 348 | "}\n", 349 | "normalizer_spec {\n", 350 | " name: nmt_nfkc\n", 351 | " add_dummy_prefix: 1\n", 352 | " remove_extra_whitespaces: 1\n", 353 | " escape_whitespaces: 1\n", 354 | " normalization_rule_tsv: \n", 355 | "}\n", 356 | "denormalizer_spec {}\n", 357 | "trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.\n", 358 | "trainer_interface.cc(185) LOG(INFO) Loading corpus: de-en.txt\n", 359 | "trainer_interface.cc(147) LOG(INFO) Loaded 1000000 lines\n", 360 | "trainer_interface.cc(124) LOG(WARNING) Too many sentences are loaded! (1917286), which may slow down training.\n", 361 | "trainer_interface.cc(126) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true.\n", 362 | "trainer_interface.cc(129) LOG(WARNING) They allow to randomly sample sentences from the entire corpus.\n", 363 | "trainer_interface.cc(409) LOG(INFO) Loaded all 1917286 sentences\n", 364 | "trainer_interface.cc(425) LOG(INFO) Adding meta_piece: \n", 365 | "trainer_interface.cc(425) LOG(INFO) Adding meta_piece: \n", 366 | "trainer_interface.cc(425) LOG(INFO) Adding meta_piece: \n", 367 | "trainer_interface.cc(430) LOG(INFO) Normalizing sentences...\n", 368 | "trainer_interface.cc(539) LOG(INFO) all chars count=322814382\n", 369 | "trainer_interface.cc(550) LOG(INFO) Done: 99.9565% characters are covered.\n", 370 | "trainer_interface.cc(560) LOG(INFO) Alphabet size=78\n", 371 | "trainer_interface.cc(561) LOG(INFO) Final character coverage=0.999565\n", 372 | "trainer_interface.cc(592) LOG(INFO) Done! preprocessed 1917286 sentences.\n", 373 | "unigram_model_trainer.cc(265) LOG(INFO) Making suffix array...\n", 374 | "unigram_model_trainer.cc(269) LOG(INFO) Extracting frequent sub strings... node_num=171000070\n", 375 | "unigram_model_trainer.cc(312) LOG(INFO) Initialized 1000078 seed sentencepieces\n", 376 | "trainer_interface.cc(598) LOG(INFO) Tokenizing input sentences with whitespace: 1917286\n", 377 | "trainer_interface.cc(609) LOG(INFO) Done! 636628\n", 378 | "unigram_model_trainer.cc(602) LOG(INFO) Using 636628 sentences for EM training\n", 379 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=262286 obj=11.5537 num_tokens=1480006 num_tokens/piece=5.64272\n", 380 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=192464 obj=8.56834 num_tokens=1484136 num_tokens/piece=7.71124\n", 381 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=144182 obj=8.53139 num_tokens=1521601 num_tokens/piece=10.5533\n", 382 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=143918 obj=8.52441 num_tokens=1521914 num_tokens/piece=10.5749\n", 383 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=107917 obj=8.54047 num_tokens=1598786 num_tokens/piece=14.815\n", 384 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=107903 obj=8.53847 num_tokens=1598825 num_tokens/piece=14.8172\n", 385 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=80919 obj=8.56552 num_tokens=1688640 num_tokens/piece=20.8683\n", 386 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=80916 obj=8.56284 num_tokens=1688832 num_tokens/piece=20.8714\n", 387 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=60684 obj=8.60481 num_tokens=1793327 num_tokens/piece=29.5519\n", 388 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=60681 obj=8.60143 num_tokens=1793295 num_tokens/piece=29.5528\n", 389 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=45507 obj=8.66115 num_tokens=1909712 num_tokens/piece=41.9652\n", 390 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=45507 obj=8.65248 num_tokens=1909647 num_tokens/piece=41.9638\n", 391 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=34129 obj=8.73865 num_tokens=2036594 num_tokens/piece=59.6734\n", 392 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=34127 obj=8.73085 num_tokens=2036564 num_tokens/piece=59.676\n", 393 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=25595 obj=8.84363 num_tokens=2179325 num_tokens/piece=85.1465\n", 394 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=25594 obj=8.83303 num_tokens=2179339 num_tokens/piece=85.1504\n", 395 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=19194 obj=8.98148 num_tokens=2329866 num_tokens/piece=121.385\n", 396 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=19194 obj=8.95582 num_tokens=2330030 num_tokens/piece=121.394\n", 397 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=14395 obj=9.15899 num_tokens=2496081 num_tokens/piece=173.399\n", 398 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=14395 obj=9.12097 num_tokens=2496129 num_tokens/piece=173.403\n", 399 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=10796 obj=9.37226 num_tokens=2669939 num_tokens/piece=247.308\n", 400 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=10796 obj=9.32831 num_tokens=2670057 num_tokens/piece=247.319\n", 401 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=8800 obj=9.5417 num_tokens=2798452 num_tokens/piece=318.006\n", 402 | "unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=8800 obj=9.51506 num_tokens=2799060 num_tokens/piece=318.075\n", 403 | "trainer_interface.cc(687) LOG(INFO) Saving model: mymodel.model\n", 404 | "trainer_interface.cc(699) LOG(INFO) Saving vocabs: mymodel.vocab\n" 405 | ] 406 | } 407 | ], 408 | "source": [ 409 | "import sentencepiece as spm\n", 410 | "\n", 411 | "# Train a tokenizer on a corpus\n", 412 | "spm.SentencePieceTrainer.train(input=\"de-en.txt\", model_prefix=\"mymodel\", vocab_size=8000)\n", 413 | "sp = spm.SentencePieceProcessor(model_file=\"mymodel.model\")" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 32, 419 | "id": "69f057dd-3da4-41a9-9d71-2ef9b991283e", 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "name": "stdout", 424 | "output_type": "stream", 425 | "text": [ 426 | "['▁Hel', 'lo', '▁wo', 'r', 'l', 'd', ',', '▁her', 'e', \"'\", 's', '▁an', '▁', 'ill', 'ust', 'r', 'ation', '▁', 'for', '▁', 'to', 'ken', 'iz', 'ation', '▁', 'wi', 'th', '▁B', 'P', 'E', '!']\n" 427 | ] 428 | } 429 | ], 430 | "source": [ 431 | "# Tokenize a sentence\n", 432 | "text = \"Hello world, here's an illustration for tokenization with BPE!\"\n", 433 | "tokens = sp.encode(text, out_type=str)\n", 434 | "print(tokens)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 34, 440 | "id": "287b4367-162f-496d-9b1c-d3b68a3f87c3", 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "name": "stdout", 445 | "output_type": "stream", 446 | "text": [ 447 | "Hello world, here's an illustration for tokenization with BPE!\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "# Decode into sentence\n", 453 | "decoded_text = sp.decode(tokens)\n", 454 | "print(decoded_text)" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "id": "c6ecee0e-455c-49dc-8509-8997826a33ba", 460 | "metadata": {}, 461 | "source": [ 462 | "# Lemmatization\n", 463 | "reduce words to base or root form" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 1, 469 | "id": "f53e939b-2f76-4005-b86d-1d4bf24a03ce", 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "name": "stderr", 474 | "output_type": "stream", 475 | "text": [ 476 | "[nltk_data] Downloading package wordnet to /Users/xiao/nltk_data...\n", 477 | "[nltk_data] Downloading package omw-1.4 to /Users/xiao/nltk_data...\n" 478 | ] 479 | }, 480 | { 481 | "name": "stdout", 482 | "output_type": "stream", 483 | "text": [ 484 | "running\n", 485 | "run\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "import nltk\n", 491 | "from nltk.stem import WordNetLemmatizer\n", 492 | "from nltk.corpus import wordnet\n", 493 | "\n", 494 | "# Download necessary NLTK resources\n", 495 | "nltk.download('wordnet')\n", 496 | "nltk.download('omw-1.4')\n", 497 | "\n", 498 | "lemmatizer = WordNetLemmatizer()\n", 499 | "\n", 500 | "# Lemmatize a word (default POS is noun)\n", 501 | "print(lemmatizer.lemmatize(\"running\")) # \"running\" → \"running\" (as noun by default)\n", 502 | "print(lemmatizer.lemmatize(\"running\", pos=wordnet.VERB)) # \"running\" → \"run\" (as verb)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 3, 508 | "id": "f8c12c41-2964-4375-90fa-dc36459901f4", 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "You -> you\n", 516 | "'re -> be\n", 517 | "looking -> look\n", 518 | "at -> at\n", 519 | "illustrations -> illustration\n", 520 | "for -> for\n", 521 | "tokenization -> tokenization\n", 522 | "with -> with\n", 523 | "SpaCy -> SpaCy\n", 524 | "! -> !\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "import spacy\n", 530 | "\n", 531 | "# Load the pre-trained model for English\n", 532 | "nlp = spacy.load(\"en_core_web_sm\")\n", 533 | "# Sample sentence\n", 534 | "doc = nlp(\"You're looking at illustrations for tokenization with SpaCy!\")\n", 535 | "# Lemmatize each token in the sentence\n", 536 | "for token in doc:\n", 537 | " print(f\"{token.text} -> {token.lemma_}\")" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "id": "3d5fe6ed-aaf1-4143-acc4-bb3a9aef8af9", 543 | "metadata": {}, 544 | "source": [ 545 | "# Stemming\n", 546 | "reduces words to root by removing suffixes.Porter/Lancaster/Snowball/Lovis/Rule-based stemmer\n", 547 | "\n", 548 | "Porter stemmer is moderate with regard to agressiveness, only support English\n", 549 | "\n", 550 | "Lancaster stemmer is the most agressive, only support English\n", 551 | "\n", 552 | "Snowball is modified upon Porter, moderately agressive, supports over 30 languages" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 18, 558 | "id": "794f06e3-e52f-4e8f-a5c5-473a71ceb327", 559 | "metadata": {}, 560 | "outputs": [ 561 | { 562 | "name": "stderr", 563 | "output_type": "stream", 564 | "text": [ 565 | "[nltk_data] Downloading package punkt to /Users/xiao/nltk_data...\n", 566 | "[nltk_data] Package punkt is already up-to-date!\n" 567 | ] 568 | } 569 | ], 570 | "source": [ 571 | "import nltk\n", 572 | "from nltk.stem import PorterStemmer\n", 573 | "\n", 574 | "# Download necessary data from NLTK (if needed)\n", 575 | "nltk.download('punkt')\n", 576 | "\n", 577 | "# Sample sentence\n", 578 | "sentence = \"You're looking at illustrations for stemming with nltk!\"\n", 579 | "# Tokenize the sentence into words\n", 580 | "words = nltk.word_tokenize(sentence)" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 20, 586 | "id": "c39a79f0-e78b-46fd-afe5-c264c42d2a27", 587 | "metadata": {}, 588 | "outputs": [ 589 | { 590 | "name": "stdout", 591 | "output_type": "stream", 592 | "text": [ 593 | "Original words: ['You', \"'re\", 'looking', 'at', 'illustrations', 'for', 'stemming', 'with', 'nltk', '!']\n", 594 | "Stemmed words: ['you', \"'re\", 'look', 'at', 'illustr', 'for', 'stem', 'with', 'nltk', '!']\n" 595 | ] 596 | } 597 | ], 598 | "source": [ 599 | "# PorterStemmer object\n", 600 | "stemmer = PorterStemmer()\n", 601 | "# Stem each word\n", 602 | "stemmed_words = [stemmer.stem(word) for word in words]\n", 603 | "\n", 604 | "# Print the results\n", 605 | "print(f\"Original words: {words}\")\n", 606 | "print(f\"Stemmed words: {stemmed_words}\")" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 22, 612 | "id": "69a497fc-c950-4942-bdd6-c4e82462155f", 613 | "metadata": {}, 614 | "outputs": [ 615 | { 616 | "name": "stdout", 617 | "output_type": "stream", 618 | "text": [ 619 | "Stemmed words: ['you', \"'re\", 'look', 'at', 'illust', 'for', 'stem', 'with', 'nltk', '!']\n" 620 | ] 621 | } 622 | ], 623 | "source": [ 624 | "# LancasterStemmer\n", 625 | "from nltk.stem import LancasterStemmer\n", 626 | "lancaster_stemmer = LancasterStemmer()\n", 627 | "stemmed_words = [lancaster_stemmer.stem(word) for word in words]\n", 628 | "print(f\"Stemmed words: {stemmed_words}\")" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 24, 634 | "id": "a87f7337-6baa-4273-9fd6-942dff120c11", 635 | "metadata": {}, 636 | "outputs": [ 637 | { 638 | "name": "stdout", 639 | "output_type": "stream", 640 | "text": [ 641 | "Stemmed words: ['you', 're', 'look', 'at', 'illustr', 'for', 'stem', 'with', 'nltk', '!']\n" 642 | ] 643 | } 644 | ], 645 | "source": [ 646 | "#SnowballStemmer\n", 647 | "from nltk.stem import SnowballStemmer\n", 648 | "snowball_stemmer = SnowballStemmer('english')\n", 649 | "stemmed_words = [snowball_stemmer.stem(word) for word in words]\n", 650 | "print(f\"Stemmed words: {stemmed_words}\")" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "id": "9b592532-7807-4e99-9ae7-8b86b26e56e5", 656 | "metadata": {}, 657 | "source": [ 658 | "# Stopword removal\n", 659 | "remove common words" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 30, 665 | "id": "eba8b8b2-f5ef-4b5b-8b08-729a75cf26e6", 666 | "metadata": {}, 667 | "outputs": [ 668 | { 669 | "name": "stderr", 670 | "output_type": "stream", 671 | "text": [ 672 | "[nltk_data] Downloading package punkt to /Users/xiao/nltk_data...\n", 673 | "[nltk_data] Package punkt is already up-to-date!\n", 674 | "[nltk_data] Downloading package stopwords to /Users/xiao/nltk_data...\n" 675 | ] 676 | }, 677 | { 678 | "name": "stdout", 679 | "output_type": "stream", 680 | "text": [ 681 | "Filtered Words: ['example', 'sentence', 'demonstrate', 'stopword', 'removal', '.']\n" 682 | ] 683 | }, 684 | { 685 | "name": "stderr", 686 | "output_type": "stream", 687 | "text": [ 688 | "[nltk_data] Unzipping corpora/stopwords.zip.\n" 689 | ] 690 | } 691 | ], 692 | "source": [ 693 | "import nltk\n", 694 | "from nltk.corpus import stopwords\n", 695 | "from nltk.tokenize import word_tokenize\n", 696 | "\n", 697 | "# Download the stopwords list (only needed once)\n", 698 | "nltk.download('punkt')\n", 699 | "nltk.download('stopwords')\n", 700 | "\n", 701 | "# Sample text\n", 702 | "text = \"This is an example sentence to demonstrate stopword removal.\"\n", 703 | "\n", 704 | "# Tokenize the text into words\n", 705 | "words = word_tokenize(text)\n", 706 | "\n", 707 | "# Get the list of stopwords in English\n", 708 | "stop_words = set(stopwords.words(\"english\"))\n", 709 | "\n", 710 | "# Remove stopwords from the tokenized words\n", 711 | "filtered_words = [word for word in words if word.lower() not in stop_words]\n", 712 | "\n", 713 | "print(\"Filtered Words:\", filtered_words)" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": 32, 719 | "id": "58500ea2-28f3-4bbe-b816-4823eeffd0e9", 720 | "metadata": {}, 721 | "outputs": [ 722 | { 723 | "name": "stdout", 724 | "output_type": "stream", 725 | "text": [ 726 | "Filtered Words: ['example', 'sentence', 'demonstrate', 'stopword', 'removal', '.']\n" 727 | ] 728 | } 729 | ], 730 | "source": [ 731 | "import spacy\n", 732 | "\n", 733 | "# Load the pre-trained model for English\n", 734 | "nlp = spacy.load(\"en_core_web_sm\")\n", 735 | "\n", 736 | "# Sample text\n", 737 | "text = \"This is an example sentence to demonstrate stopword removal.\"\n", 738 | "\n", 739 | "# Process the text with spaCy\n", 740 | "doc = nlp(text)\n", 741 | "\n", 742 | "# Filter out stopwords from the tokens\n", 743 | "filtered_words = [token.text for token in doc if not token.is_stop]\n", 744 | "\n", 745 | "print(\"Filtered Words:\", filtered_words)" 746 | ] 747 | }, 748 | { 749 | "cell_type": "markdown", 750 | "id": "7401bf7a-2762-491d-8b8c-d8cd197aad69", 751 | "metadata": {}, 752 | "source": [ 753 | "# Parts of speech tagging\n", 754 | "assigns a part of speech to each word in sentence based on definition and context." 755 | ] 756 | }, 757 | { 758 | "cell_type": "code", 759 | "execution_count": 26, 760 | "id": "0c1e3f5f-4e22-496d-a8ea-2ed9b12bb500", 761 | "metadata": {}, 762 | "outputs": [ 763 | { 764 | "name": "stdout", 765 | "output_type": "stream", 766 | "text": [ 767 | "You -> PRON\n", 768 | "'re -> AUX\n", 769 | "looking -> VERB\n", 770 | "at -> ADP\n", 771 | "illustrations -> NOUN\n", 772 | "for -> ADP\n", 773 | "POS -> PROPN\n", 774 | "tagging -> VERB\n", 775 | "with -> ADP\n", 776 | "Spacy -> PROPN\n", 777 | "! -> PUNCT\n" 778 | ] 779 | } 780 | ], 781 | "source": [ 782 | "import spacy\n", 783 | "\n", 784 | "# Load the pre-trained model for English\n", 785 | "nlp = spacy.load(\"en_core_web_sm\")\n", 786 | "\n", 787 | "# Sample sentence\n", 788 | "sentence = \"You're looking at illustrations for POS tagging with Spacy!\"\n", 789 | "\n", 790 | "# Process the sentence\n", 791 | "doc = nlp(sentence)\n", 792 | "\n", 793 | "# Display POS tags for each token\n", 794 | "for token in doc:\n", 795 | " print(f\"{token.text} -> {token.pos_}\")" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 28, 801 | "id": "a56c06b2-89ba-4948-b9e6-f4f875756a1c", 802 | "metadata": {}, 803 | "outputs": [ 804 | { 805 | "name": "stderr", 806 | "output_type": "stream", 807 | "text": [ 808 | "[nltk_data] Downloading package punkt to /Users/xiao/nltk_data...\n", 809 | "[nltk_data] Package punkt is already up-to-date!\n", 810 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 811 | "[nltk_data] /Users/xiao/nltk_data...\n" 812 | ] 813 | }, 814 | { 815 | "name": "stdout", 816 | "output_type": "stream", 817 | "text": [ 818 | "You -> PRP\n", 819 | "'re -> VBP\n", 820 | "looking -> VBG\n", 821 | "at -> IN\n", 822 | "illustrations -> NNS\n", 823 | "for -> IN\n", 824 | "POS -> NNP\n", 825 | "tagging -> VBG\n", 826 | "with -> IN\n", 827 | "nltk -> NN\n", 828 | "! -> .\n" 829 | ] 830 | }, 831 | { 832 | "name": "stderr", 833 | "output_type": "stream", 834 | "text": [ 835 | "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n" 836 | ] 837 | } 838 | ], 839 | "source": [ 840 | "import nltk\n", 841 | "from nltk.tokenize import word_tokenize\n", 842 | "\n", 843 | "# Download necessary NLTK data\n", 844 | "nltk.download('punkt')\n", 845 | "nltk.download('averaged_perceptron_tagger')\n", 846 | "\n", 847 | "# Sample sentence\n", 848 | "sentence = \"You're looking at illustrations for POS tagging with nltk!\"\n", 849 | "\n", 850 | "# Tokenize the sentence\n", 851 | "tokens = word_tokenize(sentence)\n", 852 | "\n", 853 | "# Perform POS tagging\n", 854 | "tags = nltk.pos_tag(tokens)\n", 855 | "\n", 856 | "# Display POS tags\n", 857 | "for word, tag in tags:\n", 858 | " print(f\"{word} -> {tag}\")" 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": null, 864 | "id": "cec42e2a-f326-4d25-9ff0-b6a6c0766690", 865 | "metadata": {}, 866 | "outputs": [], 867 | "source": [] 868 | } 869 | ], 870 | "metadata": { 871 | "kernelspec": { 872 | "display_name": "Python 3 (ipykernel)", 873 | "language": "python", 874 | "name": "python3" 875 | }, 876 | "language_info": { 877 | "codemirror_mode": { 878 | "name": "ipython", 879 | "version": 3 880 | }, 881 | "file_extension": ".py", 882 | "mimetype": "text/x-python", 883 | "name": "python", 884 | "nbconvert_exporter": "python", 885 | "pygments_lexer": "ipython3", 886 | "version": "3.10.14" 887 | } 888 | }, 889 | "nbformat": 4, 890 | "nbformat_minor": 5 891 | } 892 | -------------------------------------------------------------------------------- /attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yyanhui/Language-processing-basics/e361f95e89eb5bf8c33ee06d5c6562ba30811f24/attention.png -------------------------------------------------------------------------------- /biLM.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yyanhui/Language-processing-basics/e361f95e89eb5bf8c33ee06d5c6562ba30811f24/biLM.jpg -------------------------------------------------------------------------------- /transformers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "008cb64e-dc3c-4bf0-88c1-2639681fe897", 6 | "metadata": {}, 7 | "source": [ 8 | "## Attention\n", 9 | "The distribution between sentence parts. The rnn structure depend highly on the exact previous text, to emphasize the connection between context, we use a hidden state, attention.\n", 10 | "\n", 11 | "### Calculation\n", 12 | "How's the attention layer calculated:\n", 13 | "\n", 14 | "1. Set query(Q), key(K), value(V) vecters for each embedded token. $Q = X * W_Q$, $K = X * W_K$, $V = X * W_V$\n", 15 | "2. for each token i, calculate the score $softmax((q_i * k_j)/\\sqrt(d_k)) * v_j$, and sum the results for all j. d_k lowers the scale of first score(usually the dimension of key vector), softmax normalizes them, v_j ensures the values we want to focus stay intact\n", 16 | "\n", 17 | "The calculation here in matrix form is $Z = SoftMax(\\frac{QK^T}{\\sqrt(d_k)})V$\n", 18 | "\n", 19 | "### Multi-head attention\n", 20 | "splits up the controller states into chunks and operates the self attention on each chunk separately and then recombines with a fully connected network.\n", 21 | "\n", 22 | "![Multi-heads attention](attention.png)\n", 23 | "\n", 24 | "The picture here is from https://jalammar.github.io/illustrated-transformer/\n", 25 | "\n", 26 | "### Masked attention\n", 27 | "Attention shouldn't have access to text in time j at time i (if j > i), so the attentions reading from right to left are masked out with a casual masking matrix\n", 28 | "\n", 29 | "M_{casual} =\n", 30 | "\\begin{bmatrix}\n", 31 | "0 & -\\infty & -\\infty & -\\infty & -\\infty \\\\\n", 32 | "0 & 0 & -\\infty & -\\infty & -\\infty \\\\\n", 33 | "0 & 0 & 0 & -\\infty & -\\infty \\\\\n", 34 | "0 & 0 & 0 & 0 & -\\infty \\\\\n", 35 | "0 & 0 & 0 & 0 & 0 \\\\\n", 36 | "\\end{bmatrix}\n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 10, 42 | "id": "8256d3ed-52ea-4ae9-aeed-f6d7c417fea8", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "def dot_product_attention(q, k, v, bias, dropout_rate=0.0):\n", 47 | " \"\"\"Dot-product attention.\n", 48 | "\n", 49 | " Args:\n", 50 | " q: Tensor with shape [..., length_q, depth_k].\n", 51 | " k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must\n", 52 | " match with q.\n", 53 | " v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must\n", 54 | " match with q.\n", 55 | " bias: bias Tensor (see attention_bias())\n", 56 | " dropout_rate: a float.\n", 57 | "\n", 58 | " Returns:\n", 59 | " Tensor with shape [..., length_q, depth_v].\n", 60 | " \"\"\"\n", 61 | " logits = tf.matmul(q, k, transpose_b=True) # [..., length_q, length_kv]\n", 62 | " logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1])))\n", 63 | " if bias is not None:\n", 64 | " # `attention_mask` = [B, T]\n", 65 | " from_shape = get_shape_list(q)\n", 66 | " if len(from_shape) == 4:\n", 67 | " broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32)\n", 68 | " elif len(from_shape) == 5:\n", 69 | " # from_shape = [B, N, Block_num, block_size, depth]#\n", 70 | " broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], from_shape[3],\n", 71 | " 1], tf.float32)\n", 72 | "\n", 73 | " bias = tf.matmul(broadcast_ones,\n", 74 | " tf.cast(bias, tf.float32), transpose_b=True)\n", 75 | "\n", 76 | " # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n", 77 | " # masked positions, this operation will create a tensor which is 0.0 for\n", 78 | " # positions we want to attend and -10000.0 for masked positions.\n", 79 | " adder = (1.0 - bias) * -10000.0\n", 80 | "\n", 81 | " # Since we are adding it to the raw scores before the softmax, this is\n", 82 | " # effectively the same as removing these entirely.\n", 83 | " logits += adder\n", 84 | " else:\n", 85 | " adder = 0.0\n", 86 | "\n", 87 | " attention_probs = tf.nn.softmax(logits, name=\"attention_probs\")\n", 88 | " attention_probs = dropout(attention_probs, dropout_rate)\n", 89 | " return tf.matmul(attention_probs, v)\n", 90 | " \n", 91 | "def attention_layer(from_tensor,\n", 92 | " to_tensor,\n", 93 | " attention_mask=None,\n", 94 | " num_attention_heads=1,\n", 95 | " query_act=None,\n", 96 | " key_act=None,\n", 97 | " value_act=None,\n", 98 | " attention_probs_dropout_prob=0.0,\n", 99 | " initializer_range=0.02,\n", 100 | " batch_size=None,\n", 101 | " from_seq_length=None,\n", 102 | " to_seq_length=None,\n", 103 | " use_einsum=True):\n", 104 | " \"\"\"Performs multi-headed attention from `from_tensor` to `to_tensor`.\n", 105 | "\n", 106 | " Args:\n", 107 | " from_tensor: float Tensor of shape [batch_size, from_seq_length,\n", 108 | " from_width].\n", 109 | " to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].\n", 110 | " attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length].\n", 111 | " The values should be 1 or 0. The attention scores will effectively\n", 112 | " be set to -infinity for any positions in the mask that are 0, and\n", 113 | " will be unchanged for positions that are 1.\n", 114 | " num_attention_heads: int. Number of attention heads.\n", 115 | " query_act: (optional) Activation function for the query transform.\n", 116 | " key_act: (optional) Activation function for the key transform.\n", 117 | " value_act: (optional) Activation function for the value transform.\n", 118 | " attention_probs_dropout_prob: (optional) float. Dropout probability of the\n", 119 | " attention probabilities.\n", 120 | " initializer_range: float. Range of the weight initializer.\n", 121 | " batch_size: (Optional) int. If the input is 2D, this might be the batch size\n", 122 | " of the 3D version of the `from_tensor` and `to_tensor`.\n", 123 | " from_seq_length: (Optional) If the input is 2D, this might be the seq length\n", 124 | " of the 3D version of the `from_tensor`.\n", 125 | " to_seq_length: (Optional) If the input is 2D, this might be the seq length\n", 126 | " of the 3D version of the `to_tensor`.\n", 127 | " use_einsum: bool. Whether to use einsum or reshape+matmul for dense layers\n", 128 | "\n", 129 | " Returns:\n", 130 | " float Tensor of shape [batch_size, from_seq_length, num_attention_heads,\n", 131 | " size_per_head].\n", 132 | "\n", 133 | " Raises:\n", 134 | " ValueError: Any of the arguments or tensor shapes are invalid.\n", 135 | " \"\"\"\n", 136 | " from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])\n", 137 | " to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])\n", 138 | " size_per_head = int(from_shape[2]/num_attention_heads)\n", 139 | "\n", 140 | " if len(from_shape) != len(to_shape):\n", 141 | " raise ValueError(\n", 142 | " \"The rank of `from_tensor` must match the rank of `to_tensor`.\")\n", 143 | "\n", 144 | " if len(from_shape) == 3:\n", 145 | " batch_size = from_shape[0]\n", 146 | " from_seq_length = from_shape[1]\n", 147 | " to_seq_length = to_shape[1]\n", 148 | " elif len(from_shape) == 2:\n", 149 | " if (batch_size is None or from_seq_length is None or to_seq_length is None):\n", 150 | " raise ValueError(\n", 151 | " \"When passing in rank 2 tensors to attention_layer, the values \"\n", 152 | " \"for `batch_size`, `from_seq_length`, and `to_seq_length` \"\n", 153 | " \"must all be specified.\")\n", 154 | "\n", 155 | " # Scalar dimensions referenced here:\n", 156 | " # B = batch size (number of sequences)\n", 157 | " # F = `from_tensor` sequence length\n", 158 | " # T = `to_tensor` sequence length\n", 159 | " # N = `num_attention_heads`\n", 160 | " # H = `size_per_head`\n", 161 | "\n", 162 | " # `query_layer` = [B, F, N, H]\n", 163 | " q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head,\n", 164 | " create_initializer(initializer_range), query_act,\n", 165 | " use_einsum, \"query\")\n", 166 | "\n", 167 | " # `key_layer` = [B, T, N, H]\n", 168 | " k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,\n", 169 | " create_initializer(initializer_range), key_act,\n", 170 | " use_einsum, \"key\")\n", 171 | " # `value_layer` = [B, T, N, H]\n", 172 | " v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head,\n", 173 | " create_initializer(initializer_range), value_act,\n", 174 | " use_einsum, \"value\")\n", 175 | " q = tf.transpose(q, [0, 2, 1, 3])\n", 176 | " k = tf.transpose(k, [0, 2, 1, 3])\n", 177 | " v = tf.transpose(v, [0, 2, 1, 3])\n", 178 | " if attention_mask is not None:\n", 179 | " attention_mask = tf.reshape(\n", 180 | " attention_mask, [batch_size, 1, to_seq_length, 1])\n", 181 | " # 'new_embeddings = [B, N, F, H]'\n", 182 | " new_embeddings = dot_product_attention(q, k, v, attention_mask,\n", 183 | " attention_probs_dropout_prob)\n", 184 | "\n", 185 | " return tf.transpose(new_embeddings, [0, 2, 1, 3])\n", 186 | " " 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "id": "9f5fce83-6ad3-4f24-843e-06af60b87c2e", 192 | "metadata": {}, 193 | "source": [ 194 | "## Transformers\n", 195 | "\n", 196 | "Transformers are a multi-layer neural attention model\n", 197 | "\n", 198 | "General transformer models: BERT, ALBERT, RoBERTa(robust optimized BERT), DistilBERT, ELECTRA, XLM / XLM-RoBERTa(cross-lingual), GPT, etc.\n", 199 | "\n", 200 | "![Bert](BERT.png)\n", 201 | "this picture is from https://huggingface.co/blog/bert-101#3-bert-model-size--architecture" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "id": "0e580605-1fd6-479b-b374-d598ade23277", 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [] 211 | } 212 | ], 213 | "metadata": { 214 | "kernelspec": { 215 | "display_name": "Python 3 (ipykernel)", 216 | "language": "python", 217 | "name": "python3" 218 | }, 219 | "language_info": { 220 | "codemirror_mode": { 221 | "name": "ipython", 222 | "version": 3 223 | }, 224 | "file_extension": ".py", 225 | "mimetype": "text/x-python", 226 | "name": "python", 227 | "nbconvert_exporter": "python", 228 | "pygments_lexer": "ipython3", 229 | "version": "3.12.4" 230 | } 231 | }, 232 | "nbformat": 4, 233 | "nbformat_minor": 5 234 | } 235 | --------------------------------------------------------------------------------