├── Duplicate_images ├── Finding_duplicates.ipynb └── Readme.md ├── House_price_prediction_dash ├── Readme.md └── web_home_price.py ├── Image_Classifier ├── Readme.md ├── skin_cancer_classification_1.ipynb ├── skin_cancer_classification_2.ipynb └── snapshot.py ├── Image_Segmentation ├── Readme.md ├── lungs_conv_unet.ipynb ├── lungs_incp_unet.ipynb ├── lungs_incp_unet_snapshot.ipynb ├── seg_mlflow │ ├── MLproject │ ├── Readme.md │ ├── conda.yaml │ ├── lungs_incp_unet_snapshot_MLflow.py │ └── snapshot.py └── snapshot.py ├── Images ├── Bombay_11.jpg ├── Bombay_206.jpg ├── Image_eg.PNG ├── Incp_block.png ├── Readme.md ├── keeshond_59.jpg ├── mask_eg.PNG ├── model_imdb_2.png ├── model_intent.png ├── model_plot.png ├── model_plot_conv.png ├── model_semeval.png ├── screenshot_demo.png ├── training_acc.PNG └── training_loss.PNG ├── Intent_classifier ├── Intent_Classifier.md ├── Intent_classification.png ├── Readme.md └── intentclassifier.ipynb ├── Miscellaneous ├── NER_tagger │ ├── NER_stanford_NLTK.ipynb │ └── Readme.md ├── POS_Tagger │ ├── POSTagger_NLTK.ipynb │ ├── POSTagger_Stanford_NLTK.ipynb │ └── Readme.md ├── Readme.md ├── Word_Embedding.md ├── common_regex.md ├── graph_network_analysis.md ├── pdf_To_doc.ipynb └── topic_modeling.ipynb ├── README.md ├── SSD ├── Readme.md ├── SSD.ipynb ├── boxes.py ├── data_augmentation.py └── data_generator.py ├── Semantic_Relation_Extraction ├── Readme.md └── Relation_extraction_semeval2010.ipynb └── Text_Classification ├── Readme.md ├── classification_imdb.ipynb └── self_Attn_on_seperate_fets_of_2embds.ipynb /Duplicate_images/Readme.md: -------------------------------------------------------------------------------- 1 | # Finding Duplicate Images 2 | 3 | In this notebook task of finding all the different duplicate images among the given set of image is performed. Huge database of images exists which have multiple copies of same image with different ids/ names. To remove duplicate images from such database an efficient way, then comparing similarity of each image with every other image to find the duplicates, is required. 4 | 5 | For the purpose of finding the duplicates, in this notebook, images are first passed through a feature extractor (deep learning network trained on a classification task with last classification layer removed). The feature vector or image vector found are then coverted to hash codes for efficent search using [Locality Sensitivity Hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing). 6 | 7 | [annoyindex](https://github.com/spotify/annoy) package is used to perform LSH. The hash code of an image is found by making multiple random hyperplanes on the dimension of image vector and hash code is given by the sides of each hyperplane a particular image lies. This works on the intuition that on high dimensional space similar objects will have similar location on the space. 8 | 9 | Once the hash for all the images are found a particular image is tested for similarity only with images having same hash code. This allows the search for duplicate to be done much quicker. 10 | 11 | ## Packages used 12 | 13 | keras, tensorlfow, numpy, cv2, matplotlib, annoyindex 14 | 15 | ## Dataset 16 | 17 | For this notebook any image dataset containing duplicate images would suffice, although a dataset which already have all the duplicates found will allow for model comparison. 18 | 19 | The link for the dataset used in this notebook is [here](http://www.robots.ox.ac.uk/~vgg/data/pets/?source=post_page---------------------------). Only the images are requried for this notebook. 20 | 21 | ## Results 22 | 23 | Here are few examples of the dupicate images found: 24 | 25 | ![Example 1](../Images/Bombay_11.jpg) 26 | 27 | ![Example 2](../Images/keeshond_59.jpg) 28 | 29 | ![Example 3](../Images/Bombay_206.jpg) 30 | -------------------------------------------------------------------------------- /House_price_prediction_dash/Readme.md: -------------------------------------------------------------------------------- 1 | # House Price Prediction with Dash Demo 2 | 3 | This is an example script for using dash to create a web app to predict the price of house. This can be used to demo a simple machine learning program to someone who have no idea of how machine learning produces different results for different inputs. 4 | 5 | A simple [XGB](https://xgboost.readthedocs.io/en/latest/index.html) based regressor is trained for producing the results. Since the model was made for the purpose of demo, only the top 20 most important features were used to train the model. This showed more deviation in the predicted value of the house while changing a single input value, which is ideal for purpose of demo. 6 | 7 | For the pusrpose of producing the demo, [Dash](https://plot.ly/dash) was used to create the webapp. The webapp allowed the user to change few categorical and few continous features. The valid range for continous features are mentioned and upon a button click, result is produced for the input values. 8 | 9 | ## Packages used 10 | 11 | sklearn, xgb, numpy, pandas, dash, scipy 12 | 13 | ## Dataset 14 | 15 | For this notebook the dataset was taken from this [kaggle competetion](https://www.kaggle.com/c/home-data-for-ml-course). The dataset was filled for missing values, with appropriate assumptions. Since the purpose of this work is demo, any dataset with continous/categorical features can be used. 16 | 17 | ## Usage 18 | 19 | Download the [demo script](./web_home_price.py) from this folder and also unzip the data in the same folder, such that the downloaded script and the folder home-data-for-ml-course lies at same location. From a cmd prompt navigate to this location and run the python script. 20 | 21 | If succesful, for default dash settings it will ask you to open the link -> http://127.0.0.1:8050/ . On navigating to the link a web app like shown below should open up for user interaction. 22 | 23 | ![Web app Example](../Images/screenshot_demo.png) 24 | -------------------------------------------------------------------------------- /House_price_prediction_dash/web_home_price.py: -------------------------------------------------------------------------------- 1 | import dash 2 | import dash_core_components as dcc 3 | import dash_html_components as html 4 | from textwrap import dedent 5 | from numpy import mean 6 | import pandas as pd 7 | import numpy as np 8 | 9 | from scipy.stats import skew 10 | from sklearn.preprocessing import LabelEncoder 11 | from sklearn.model_selection import train_test_split 12 | 13 | import xgboost as xgb 14 | 15 | train = pd.read_csv("./home-data-for-ml-course/train.csv") 16 | train = train[['LotArea', 'GrLivArea', 'BsmtFinSF1', 'OverallQual', 'OverallCond', 17 | 'TotalBsmtSF', 'YearBuilt', 'GarageYrBlt', '1stFlrSF', 'GarageArea', 18 | '2ndFlrSF', 'Functional', 'OpenPorchSF', 'WoodDeckSF', 'BsmtUnfSF', 19 | 'YearRemodAdd', 'BsmtExposure', 'LotFrontage', 'BsmtQual', 'EnclosedPorch', 'SalePrice']] 20 | 21 | train["LotFrontage"] = train["LotFrontage"].fillna(train["LotFrontage"].median()) 22 | 23 | train["GarageYrBlt"] = train["GarageYrBlt"].fillna(0) 24 | 25 | train["BsmtExposure"] = train["BsmtExposure"].fillna("None") 26 | train["BsmtQual"] = train["BsmtQual"].fillna("None") 27 | 28 | 29 | #Changing OverallCond into a categorical variable 30 | train['OverallCond'] = train['OverallCond'].astype(str) 31 | 32 | cols = ('BsmtQual', 'Functional', 'BsmtExposure','OverallCond') 33 | # process columns, apply LabelEncoder to categorical features 34 | for c in cols: 35 | lbl = LabelEncoder() 36 | lbl.fit(list(train[c].values)) 37 | train[c] = lbl.transform(list(train[c].values)) 38 | 39 | 40 | 41 | train["SalePrice"] = np.log1p(train["SalePrice"]) 42 | 43 | y_train = train["SalePrice"] 44 | x_train = train.drop(axis=1,columns=["SalePrice"]) 45 | 46 | model_xgb = xgb.XGBRegressor(n_jobs=8) 47 | model_xgb.fit(x_train, y_train) 48 | bcnts = 0 49 | 50 | app = dash.Dash() 51 | 52 | app.layout = html.Div([ 53 | html.H1("House price prediction", 54 | style={'text-align': 'center'}), 55 | 56 | html.Div([ 57 | dcc.Markdown(dedent("Please input the **total area** of the house.(range 1500 - 20000).")), 58 | dcc.Input( 59 | placeholder='Full area', 60 | type='number', 61 | value=10000, 62 | id='full_area' 63 | ), 64 | dcc.Markdown(dedent("Please input the **built year** of the house (range 1900 - 2010).")), 65 | dcc.Input( 66 | placeholder='Year Built', 67 | type='number', 68 | value=1975, 69 | id='year_built' 70 | ) 71 | ], style={'display': 'inline-block', 'width': '100%', 'margin': '1.5vh'}), 72 | 73 | html.Div([ 74 | dcc.Markdown(dedent("Please input the **total Garage area** (range 0 - 1500).")), 75 | dcc.Input( 76 | placeholder='Garage area', 77 | type='number', 78 | value=500, 79 | id='grg_area' 80 | ), 81 | dcc.Markdown(dedent("Please input the **garage built year** (range 1900 - 2010).")), 82 | dcc.Input( 83 | placeholder='Garage Year Built', 84 | type='number', 85 | value=1980, 86 | id='grg_year_built' 87 | ) 88 | ], style={'display': 'inline-block', 'width': '100%', 'margin': '1.5vh'}), 89 | 90 | html.Div([ 91 | dcc.Markdown(dedent("Please select the **overall condition** of the house.")), 92 | dcc.Dropdown( 93 | options=[{'label': name, 'value': name} for name in range(9)], 94 | id='overall_cond' 95 | ), 96 | dcc.Markdown(dedent("Please select the **overall quality** of the house.")), 97 | dcc.Dropdown( 98 | options=[{'label': name, 'value': name} for name in range(1,11)], 99 | id='overall_qlty' 100 | ) 101 | ], style={'display': 'inline-block', 'width': '100%', 'margin': '1.5vh'}), 102 | 103 | html.Div([ 104 | dcc.Markdown(dedent("Please select the **basement quality** of the house.")), 105 | dcc.Dropdown( 106 | options=[{'label': name, 'value': name} for name in range(5)], 107 | id='bsmt_qual' 108 | ), 109 | dcc.Markdown(dedent("Please select the **basement exposure** of the house.")), 110 | dcc.Dropdown( 111 | options=[{'label': name, 'value': name} for name in range(5)], 112 | id='bsmt_exposure' 113 | )], style={'display': 'inline-block', 'width': '100%', 'margin': '1.5vh'}), 114 | 115 | html.Button('Submit', id='button', style={'display': 'inline-block', 'margin': '1.5vh'}), 116 | 117 | html.Div(id="output") 118 | ], style={'display': 'inline-block', 'width': '120vh', 'margin-left': '35vh', 'margin-right': '35vh'}) 119 | 120 | 121 | 122 | @app.callback(dash.dependencies.Output('output', 'children'), 123 | [dash.dependencies.Input('full_area', 'value'), 124 | dash.dependencies.Input('grg_area', 'value'), 125 | dash.dependencies.Input('year_built', 'value'), 126 | dash.dependencies.Input('grg_year_built', 'value'), 127 | dash.dependencies.Input('overall_cond', 'value'), 128 | dash.dependencies.Input('overall_qlty', 'value'), 129 | dash.dependencies.Input('bsmt_qual', 'value'), 130 | dash.dependencies.Input('bsmt_exposure', 'value'), 131 | dash.dependencies.Input('button', 'n_clicks')]) 132 | def update_output(full_area, grg_area, year_built, grg_year_built, overall_cond, overall_qlty, bsmt_qual, bsmt_exposure, n_clicks): 133 | global bcnts 134 | if n_clicks <=bcnts: 135 | return "" 136 | bcnts+=1 137 | pred = pd.DataFrame(columns=x_train.columns) 138 | pred.loc[0] = [full_area, 1450, 383.5, overall_qlty, overall_cond, 991.5, year_built, grg_year_built, 1087, grg_area, 0, 6, 25, 0, 477, 1994, bsmt_exposure, 69, bsmt_qual, 0] 139 | 140 | out = np.expm1(model_xgb.predict(pred)[0]) 141 | 142 | return dcc.Markdown(dedent("The predicted value for the house is "+str(out)+str(n_clicks)+" dollars.")) 143 | 144 | 145 | 146 | if __name__ == '__main__': 147 | 148 | app.run_server() -------------------------------------------------------------------------------- /Image_Classifier/Readme.md: -------------------------------------------------------------------------------- 1 | # Image Classifier 2 | 3 | Image classification is a task of classifying images into two or more classes. A simple yet powerful neural network models are built to classify the images. 4 | 5 | Two different models are compared here with the only difference of snapshot training being used in 2nd classifier model: 6 | 1. [skin_cancer_classification_1](./skin_cancer_classification_1.ipynb) 7 | 2. [skin_cancer_classification_2](./skin_cancer_classification_2.ipynb) 8 | 9 | ## Packages used 10 | 11 | keras, sklearn, tensorlfow, numpy, pandas, cv2, matplotlib 12 | 13 | ## Dataset 14 | 15 | For performing the task of Image Classification a medical dataset, in which we need to classify the type of skin disease shown in a given image, is used. 16 | 17 | [Here](https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000/kernels) is the link to the dataset. 18 | 19 | ## Model Architecture 20 | 21 | A Convolution based model is used here to classify the images. 22 | 23 | There are 6 convolution layers with kernel size of (1,1) and number of kernels increases from 16 to 512. 24 | 25 | Each Convolution layer is followed by batch normaliztion layer and Leaky ReLU activation. 26 | 27 | Max pooling of features after each convolution block is performed to reduce the size of features by 2. 28 | 29 | Finally dense layers are used to classify the features found from the convolution layer. 30 | 31 | Categorical Cross entropy is used as the loss function and adam is the optimizer. 32 | 33 | ![Model Architecture](../Images/model_plot.png) 34 | 35 | ## Comparison 36 | 37 | Here are the training curves for both the classifiers. 38 | 39 | ![Training accuracy Curve](../Images/training_acc.PNG) ![Training Loss Curve](../Images/training_loss.PNG) 40 | 41 | Orange curve is for the model without cyclic learning rate. The blue curve representing the model with cosine annealed learning rate, deviates from the local minima during the start of each cycle (when learning rate is increased to initial, maximum value). 42 | 43 | 44 | -------------------------------------------------------------------------------- /Image_Classifier/snapshot.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | 4 | import keras.callbacks as callbacks 5 | from keras.callbacks import Callback 6 | 7 | class SnapshotModelCheckpoint(Callback): 8 | """Callback that saves the snapshot weights of the model. 9 | Saves the model weights on certain epochs (which can be considered the 10 | snapshot of the model at that epoch). 11 | Should be used with the cosine annealing learning rate schedule to save 12 | the weight just before learning rate is sharply increased. 13 | # Arguments: 14 | nb_epochs: total number of epochs that the model will be trained for. 15 | nb_snapshots: number of times the weights of the model will be saved. 16 | fn_prefix: prefix for the filename of the weights. 17 | """ 18 | 19 | def __init__(self, nb_epochs, nb_snapshots, fn_prefix='Model'): 20 | super(SnapshotModelCheckpoint, self).__init__() 21 | 22 | self.check = nb_epochs // nb_snapshots 23 | self.fn_prefix = fn_prefix 24 | 25 | def on_epoch_end(self, epoch, logs={}): 26 | if epoch != 0 and (epoch + 1) % self.check == 0: 27 | filepath = self.fn_prefix + "-%d.h5" % ((epoch + 1) // self.check) 28 | self.model.save_weights(filepath, overwrite=True) 29 | #print("Saved snapshot at weights/%s_%d.h5" % (self.fn_prefix, epoch)) 30 | 31 | 32 | class SnapshotCallbackBuilder: 33 | """Callback builder for snapshot ensemble training of a model. 34 | Creates a list of callbacks, which are provided when training a model 35 | so as to save the model weights at certain epochs, and then sharply 36 | increase the learning rate. 37 | """ 38 | 39 | def __init__(self, nb_epochs, nb_snapshots, init_lr=0.1): 40 | """ 41 | Initialize a snapshot callback builder. 42 | # Arguments: 43 | nb_epochs: total number of epochs that the model will be trained for. 44 | nb_snapshots: number of times the weights of the model will be saved. 45 | init_lr: initial learning rate 46 | """ 47 | self.T = nb_epochs 48 | self.M = nb_snapshots 49 | self.alpha_zero = init_lr 50 | 51 | def get_callbacks(self, model_prefix='Model'): 52 | """ 53 | Creates a list of callbacks that can be used during training to create a 54 | snapshot ensemble of the model. 55 | Args: 56 | model_prefix: prefix for the filename of the weights. 57 | Returns: list of 3 callbacks [ModelCheckpoint, LearningRateScheduler, 58 | SnapshotModelCheckpoint] which can be provided to the 'fit' function 59 | """ 60 | if not os.path.exists('weights/'): 61 | os.makedirs('weights/') 62 | 63 | callback_list = [callbacks.ModelCheckpoint("weights/%s-Best.h5" % model_prefix, monitor="val_acc", 64 | save_best_only=True, save_weights_only=True), 65 | callbacks.LearningRateScheduler(schedule=self._cosine_anneal_schedule), 66 | SnapshotModelCheckpoint(self.T, self.M, fn_prefix='weights/%s' % model_prefix)] 67 | 68 | return callback_list 69 | 70 | def _cosine_anneal_schedule(self, t): 71 | cos_inner = np.pi * (t % (self.T // self.M)) # t - 1 is used when t has 1-based indexing. 72 | cos_inner /= self.T // self.M 73 | cos_out = np.cos(cos_inner) + 1 74 | return float(self.alpha_zero / 2 * cos_out) -------------------------------------------------------------------------------- /Image_Segmentation/Readme.md: -------------------------------------------------------------------------------- 1 | # Image Segmentation 2 | 3 | Image segmentaion or Pixel wise segmentation is a task in which each pixel in the image are classified into 2 or more classes. 4 | In this notebook we need to classify pixels of image into just 2 classes and hence the output needed for this task is a black white mask of the object of interest. 5 | 6 | Here 3 different models are compared: 7 | 1. [lungs_conv_unet](./lungs_conv_unet.ipynb) - An autoencoder model with U-Net architecture is used. 8 | 2. [lungs_incp_unet](./lungs_incp_unet.ipynb) - Model same as above except the convolution layers replaced with inception blocks. 9 | 3. [lungs_incp_unet_snapshot](./lungs_incp_unet_snapshot.ipynb) - Model exactly same as the lungs_incp_unet model with the addition of cosine annealed Learning rate. 10 | 4. [lungs_incp_unet_snapshot_mlflow](./seg_mlflow/) - Contains files for lungs_incp_unet_snapshot deployed using mlflow workflow. 11 | 12 | ## Packages used 13 | 14 | keras, sklearn, tensorlfow, numpy, pandas, cv2, matplotlib 15 | 16 | ## Dataset 17 | 18 | For showing comparison study in the task of Image Segmentation, a dataset where we need to segment lesions from the CT scans of lungs. 19 | 20 | There are total 267 CT scans of lungs corresponding with the manually labelled segmented masks. 21 | 22 | ![Example Image](../Images/Image_eg.PNG) 23 | 24 | ![Example Mask](../Images/mask_eg.PNG) 25 | 26 | [Here](https://www.kaggle.com/kmader/finding-lungs-in-ct-data/home) is the link to the dataset. 27 | 28 | ## Model Architecture 29 | 30 | A Convolution based Auto Encoder model is used here to classify the images. 31 | 32 | In the encoder there are 4 Convolution layers, with increasing number of kernels in each layer. 33 | 34 | Each Convolution layer is followed by batch normaliztion layer and Leaky ReLU activation and the size of features is reduced by a factor of 2 by max pooling feature from each kernel. 35 | 36 | In the decoder it has 4 convolution layers of same feature sizes as the encoder layers, but with double the number of kernels. The upsampled feature and the corresponding features from the encoder layer are stacked. 37 | 38 | The last layer gives an image of same size with each pixel either as 0 or 1. 39 | 40 | ![Model Architecture](../Images/model_plot_conv.png) 41 | 42 | In the models using inception blocks each block of convolution- batch normaliztion - Leaky Relu is replaced with an inception block. 43 | 44 | ![Inception Block](../Images/Incp_block.png) 45 | 46 | ## Comparison 47 | 48 | As found, the model with Convolution layers have an IOU of **0.0**, compared to model with inception block where IOU is **0.43**. 49 | 50 | The best performing model was the [third model](./lungs_incp_unet_snapshot.ipynb) which uses Inception blocks along with cosine anneal learning rate and the IOU here is **0.977**. 51 | 52 | -------------------------------------------------------------------------------- /Image_Segmentation/seg_mlflow/MLproject: -------------------------------------------------------------------------------- 1 | name: lungsegmentation 2 | 3 | conda_env: conda.yaml 4 | 5 | entry_points: 6 | main: 7 | parameters: 8 | image_path: {type: string, default: "../finding-lungs-in-ct-data/2d_images/"} 9 | annotation_path: {type: string, default: "../finding-lungs-in-ct-data/2d_masks/"} 10 | weights_path: {type: string, default: "../weights/"} 11 | log_dir: {type: string, default: "../logs/"} 12 | initial_lr: {type: float, default: 1e-3} 13 | batch_size: {type: int, default: 8} 14 | seed: {type: int, default: 3} 15 | command: "python lungs_incp_unet_snapshot_MLflow.py" -------------------------------------------------------------------------------- /Image_Segmentation/seg_mlflow/Readme.md: -------------------------------------------------------------------------------- 1 | # Image Segmentation tracked/ deployed with mlflow 2 | 3 | In this file the same task from the parent folder is performed using the [same](../lungs_incp_unet_snapshot.ipynb) model with the addition of mlflow workflow. 4 | 5 | [mlflow](https://mlflow.org/) is an open source platform to manage the machine learning lifecycles. Using mlflow one can track the model accuracies/ parameters during training and also save the model. The mlfow projects allows the machine learning code to be reused in any other system just by running the mlflow command. The model saved by the mlflow package can be easily used as an independent entity across different languages, without the need of installing any of the paskages used to train the model. 6 | 7 | ## Packages used 8 | 9 | keras, sklearn, tensorlfow, tenosrboard, numpy, pandas, cv2, matplotlib, mlflow 10 | 11 | ## Dataset 12 | 13 | For showing comparison study in the task of Image Segmentation, a dataset where we need to segment lesions from the CT scans of lungs. 14 | 15 | There are total 267 CT scans of lungs corresponding with the manually labelled segmented masks. 16 | 17 | ![Example Image](../../Images/Image_eg.PNG) 18 | 19 | ![Example Mask](../../Images/mask_eg.PNG) 20 | 21 | [Here](https://www.kaggle.com/kmader/finding-lungs-in-ct-data/home) is the link to the dataset. 22 | 23 | ## Usage 24 | 25 | Make sure you have the above mentioned packages installed in your environment and you have dowloaded and extracted the dataset at Image_Segmentation folder. 26 | 27 | Download the reposritory, and in a cmd prompt navigate to the folder Image_Segmentation folder and run: 28 | 29 | `mlflow run seg_mlflow --no-conda` 30 | 31 | `--no-conda` option is given if you want to run the project in your existing environment. If you want to create a new environment omit this option. 32 | 33 | For viewing the mlflow ui open another cmd prompt and navigate to the folder semantic-segmentation and run: 34 | 35 | `mlflow ui` 36 | 37 | *Note For running Mlflow ui in windows go to the last comment in the link-> https://github.com/mlflow/mlflow/issues/154 38 | 39 | For viewing Tensorboard ui open a cmd prompt and run: 40 | 41 | `tensorboard --logdir="PATH\TO\LOGS"` 42 | 43 | For passing parameters to the programs run cmd: 44 | 45 | `mlflow run seg_mlflow --no-conda -P PARAM_NAME_1=PARAM_VALUE_1 -P PARAM_NAME_2=PARAM_VALUE_2` 46 | 47 | Here are the available parameters: 48 | 49 | --image_path TEXT Path to images folder 50 | --annotation_path TEXT Path to annotations folder 51 | --weights_path TEXT Path to base model weights file 52 | --log_dir TEXT Path to store log files 53 | --initial_lr FLOAT Initial learning rate 54 | --batch_size INTEGER Batch size for training 55 | --seed INTEGER numpy random seed 56 | 57 | ## Folder Structure 58 | 59 | Image_Segmentation 60 | finding-lungs-in-ct-data 61 | finding-lungs-in-ct-data 62 | 2d_images 63 | 2d_masks 64 | seg_mlflow 65 | weights 66 | logs 67 | -------------------------------------------------------------------------------- /Image_Segmentation/seg_mlflow/conda.yaml: -------------------------------------------------------------------------------- 1 | name: segmentation 2 | channels: 3 | - rdkit 4 | - defaults 5 | dependencies: 6 | - python==3.6 7 | - numpy 8 | - pandas 9 | - glob 10 | - tensorflow-mkl 11 | - matplotlib 12 | - opencv-python 13 | - scikit-learn 14 | - keras 15 | - pip: 16 | - mlflow>=1.0 -------------------------------------------------------------------------------- /Image_Segmentation/seg_mlflow/lungs_incp_unet_snapshot_MLflow.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # Image Segmentation using an Inception based U-Net model with snapshots and cosine annealed learning rate. 5 | # 6 | # This notebook is an example notebook of using image data and its corresponding masks to produce masks for the test images whose mask are needed. 7 | # 8 | # Segmentation of an image is to classify each pixel of an image into 2 or more classes. For problems where only one object of interest is needed to be determined in an image containing the object and background, we need to classify every pixel into 2 classes as is the case with the dataset/ model used in this notebook. 9 | 10 | import os 11 | import sys 12 | # ## Import packages 13 | # 14 | # Import different packages for handling images and building the keras model. 15 | 16 | 17 | # Packages for processing/ loading image files 18 | from glob import glob 19 | import cv2 20 | import pandas as pd 21 | import numpy as np 22 | import time 23 | from sklearn.model_selection import train_test_split 24 | 25 | # Packages for the model 26 | from keras.layers import Conv2D, MaxPooling2D, Activation, Dropout, Flatten, BatchNormalization 27 | from keras.layers import LeakyReLU, Concatenate, concatenate, UpSampling2D, Add, Input,Dense 28 | from keras.models import Model,load_model 29 | from keras.optimizers import Adam 30 | import tensorflow as tf 31 | import keras.backend as K 32 | from keras.preprocessing.image import ImageDataGenerator 33 | from keras.callbacks import TensorBoard 34 | 35 | # Import the snapshot builder callback file locally which is used for dynamic LR. 36 | from snapshot import SnapshotCallbackBuilder 37 | 38 | # Package for model deployment 39 | import click 40 | import mlflow 41 | import mlflow.keras 42 | 43 | 44 | # ## Function to load and process Image 45 | # 46 | # Load image using opencv function imread(path, 1 for read as colour image). 47 | # 48 | # Cv2 loads the image with B channel first. We converte this to R channel first. 49 | # 50 | # Gaussian Blur is applied to image to have smoother transitions in image. 51 | # 52 | # Image is scaled to have values between 0 and 1. 53 | # 54 | # For processing masks it is made sure that shapes are as required and the image is thresholded to converte arbitary values to 0 or 255. 55 | 56 | 57 | # Preprocess image for of top view training/ detection 58 | def preprocess_img(img_file): 59 | im = cv2.imread(img_file, 1) 60 | im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB) 61 | im = cv2.GaussianBlur(im, (5,5), 0) 62 | im = im/255 63 | return im 64 | 65 | # Preprocess masks top view 66 | def preprocess_mask(img_file): 67 | im = cv2.imread(img_file, 0) 68 | im = im.reshape(im.shape[0], im.shape[1], 1) 69 | _, im = cv2.threshold(im, 200, 255, cv2.THRESH_BINARY) 70 | im = cv2.GaussianBlur(im,(5,5), 0) 71 | im = im.reshape(im.shape[0], im.shape[1], 1) 72 | im = im/255 73 | return im 74 | 75 | 76 | # ## Inception Block 77 | # 78 | # Instead of a single convolution block an inception block is used to extract the features from the image. 79 | # 80 | # In an inception block input is divided into multiple instances and passed to different combinations of convolution layer and then the output from all the combinations are concatenated and passed further. 81 | # 82 | # An inception block extracts richer feature compared to a convolution layer since in an inception block same input feature is passed through different sized kernel of convolutions, which enables the model to process the features from different point of views. 83 | # 84 | # Different combinations of layers can be experimented with. Also different inception blocks can be combined in a same model and used as done in the actual inception model. 85 | # 86 | # Link to the paper for Inception model is [here](https://arxiv.org/pdf/1602.07261.pdf). 87 | def incp_v3(x, nb_filters): 88 | 89 | b_1 = Conv2D(nb_filters, (1,1), padding='same', kernel_initializer="glorot_normal")(x) 90 | b_1 = BatchNormalization(axis=3)(b_1) 91 | b_1 = LeakyReLU()(b_1) 92 | b_1 = Conv2D(nb_filters, (3,3), padding='same', kernel_initializer="glorot_normal")(b_1) 93 | b_1 = BatchNormalization(axis=3)(b_1) 94 | b_1 = LeakyReLU()(b_1) 95 | b_1 = Conv2D(nb_filters, (3,3), padding='same', kernel_initializer="glorot_normal")(b_1) 96 | b_1 = BatchNormalization(axis=3)(b_1) 97 | b_1 = LeakyReLU()(b_1) 98 | 99 | b_2 = Conv2D(nb_filters, (1,1), padding='same', kernel_initializer="glorot_normal")(x) 100 | b_2 = BatchNormalization(axis=3)(b_2) 101 | b_2 = LeakyReLU()(b_2) 102 | b_2 = Conv2D(nb_filters, (3,3), padding='same', kernel_initializer="glorot_normal")(b_2) 103 | b_2 = BatchNormalization(axis=3)(b_2) 104 | b_2 = LeakyReLU()(b_2) 105 | 106 | b_3 = Conv2D(nb_filters, (1,1), padding='same', kernel_initializer="glorot_normal")(x) 107 | b_3 = BatchNormalization(axis=3)(b_3) 108 | b_3 = LeakyReLU()(b_3) 109 | 110 | b_4 = MaxPooling2D((2,2), strides=(1,1), padding='same')(x) 111 | b_4 = Conv2D(nb_filters, (1,1), padding='same', kernel_initializer="glorot_normal")(b_4) 112 | b_4 = BatchNormalization(axis=3)(b_4) 113 | b_4 = LeakyReLU()(b_4) 114 | 115 | x = concatenate([b_1, b_2, b_3, b_4], axis=3) 116 | return x 117 | 118 | # image_augmenting function 119 | def image_augmentation(imgs, masks, batch_size): 120 | data_gen_args = dict(featurewise_center=False, 121 | featurewise_std_normalization=False, 122 | zoom_range = [0.9, 1.1], 123 | width_shift_range=0.1, 124 | height_shift_range=0.1, 125 | horizontal_flip=True, 126 | vertical_flip = True ) 127 | 128 | image_datagen = ImageDataGenerator(**data_gen_args) 129 | mask_datagen = ImageDataGenerator(**data_gen_args) 130 | seed = 1 131 | 132 | image_datagen.fit(imgs, augment=True, seed=seed) 133 | mask_datagen.fit(masks, augment=True, seed=seed) 134 | 135 | image_generator = image_datagen.flow( 136 | imgs, 137 | batch_size=batch_size, 138 | shuffle=True, 139 | seed=seed) 140 | 141 | mask_generator = mask_datagen.flow( 142 | masks, 143 | batch_size=batch_size, 144 | shuffle=True, 145 | seed=seed) 146 | 147 | train_generator = zip(image_generator, mask_generator) 148 | 149 | return train_generator 150 | 151 | 152 | def get_model(inp): 153 | x = incp_v3(inp, 8) 154 | 155 | b_0 = x 156 | 157 | x = MaxPooling2D((2, 2), strides=2)(x) 158 | b_1 = (x) 159 | 160 | x = incp_v3(x, 16) 161 | 162 | x = MaxPooling2D((2, 2), strides=2)(x) 163 | b_2 = (x) 164 | 165 | x = incp_v3(x, 32) 166 | 167 | x = MaxPooling2D((2, 2), strides=2)(x) 168 | b_3 =(x) 169 | 170 | x = incp_v3(x, 64) 171 | 172 | encoded = MaxPooling2D((2, 2))(x) 173 | 174 | x = incp_v3(encoded, 64) 175 | 176 | x = UpSampling2D((2, 2))(x) 177 | x = Concatenate(axis=3)([x, b_3]) 178 | 179 | x = incp_v3(x, 32) 180 | 181 | x = UpSampling2D((2, 2))(x) 182 | x = Concatenate(axis=3)([x, b_2]) 183 | 184 | x = incp_v3(x, 16) 185 | 186 | x = UpSampling2D((2, 2))(x) 187 | x = Concatenate(axis=3)([x, b_1]) 188 | 189 | x = incp_v3(x, 8) 190 | 191 | x = UpSampling2D((2, 2))(x) 192 | x = Concatenate(axis=3)([x, b_0]) 193 | 194 | decoded = Conv2D(1, (3,3), activation='sigmoid', padding='same', kernel_initializer='glorot_normal')(x) 195 | 196 | return decoded 197 | 198 | 199 | # Callback class for logging metrics into the MLflow pipeline. 200 | class mlfow_log_callback(tf.keras.callbacks.Callback): 201 | 202 | def on_epoch_end(self, epoch, logs=None): 203 | mlflow.log_metric('Train Loss',logs['loss'], epoch) 204 | mlflow.log_metric('Train Accuracy',logs['acc'], epoch) 205 | mlflow.log_metric('Validation Loss',logs['val_loss'], epoch) 206 | mlflow.log_metric('Valdiation Accuracy',logs['val_acc'], epoch) 207 | print(epoch, logs) 208 | 209 | 210 | # click provides an easy interface for getting cmd line arguments. 211 | @click.command() 212 | @click.option("--image_path", default='../finding-lungs-in-ct-data/2d_images/', help="Path to images folder") 213 | @click.option("--annotation_path", default='../finding-lungs-in-ct-data/2d_masks/', help="Path to annotations folder") 214 | @click.option("--weights_path", default='../weights/', help="Path to base model weights file") 215 | @click.option("--log_dir", default='../logs/', help="Path to store log files") 216 | @click.option("--initial_lr", default=1e-3, type=float, help="Initial learning rate") 217 | @click.option("--batch_size", default=4, type=int, help="Batch size for training") 218 | @click.option("--seed", default=3, type=int, help="numpy random seed") 219 | def train(image_path, annotation_path, weights_path, log_dir, initial_lr, batch_size, seed): 220 | 221 | # ## Dataset 222 | # 223 | # The dataset for this notebook is taken from -> https://www.kaggle.com/kmader/finding-lungs-in-ct-data/home . 224 | # 225 | # The dataset contains 267 images of ct scans of lungs with masks representing the lung portion in each of the image. 226 | # 227 | # The images in the dataset are in tif format. 228 | # Load the training files 229 | # 230 | X = [] 231 | file_train = glob(image_path+"*.tif") 232 | for i, file in enumerate(file_train): 233 | im = preprocess_img(file) 234 | X.append(im) 235 | X = np.array(X) 236 | 237 | y = [] 238 | file_mask = glob(annotation_path+"*.tif") 239 | for i in file_mask: 240 | im = preprocess_mask(i) 241 | y.append(im) 242 | y = np.array(y) 243 | 244 | x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed) 245 | 246 | x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1, random_state=seed) 247 | 248 | # ## Auto Encoder with U-Net Architecture 249 | # 250 | # In an auto encoder model for an image, features are extracted and reduced using a convolution network (or another feature extractor like using inception blocks). The feature size is reduced by max pooling features from each kernel in the convolution. Since, over here, our aim is to get the mask of the input image, we reconstruct the latent features (features found after the last max pooling layer), into an image of same size (channel size may differ). This is done by upsampling (copying the same value in each of the kernel position) and convolution of these features. 251 | # 252 | # To increase the accuracy of the network an architecture known as U-Net was used. On the decoder side where reconstruction of image is done, along with latent features, corresponding features from encoder is concatenated in each layer. These allows the decoder network to directly look at the features of corresponding layers in the encoder, to enchance feature reproduction. The paper on [U-Net architecture](https://arxiv.org/abs/1505.04597) gives detail information on the same. 253 | # 254 | # Initalize the Inception based UNET model 255 | 256 | inp = Input(X[0].shape) 257 | decoded = get_model(inp) 258 | 259 | train_generator = image_augmentation(x_train, y_train, batch_size) 260 | val_generator = image_augmentation(x_val, y_val, batch_size) 261 | 262 | tb = TensorBoard(log_dir=log_dir+"lungs_incp_unet_snapshot_mlflow", histogram_freq=0, batch_size=batch_size, write_graph=True, write_grads=True, write_images=True, 263 | embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None) 264 | 265 | # Model parameters 266 | M = 3 267 | nb_epoch = T = 60 268 | snapshot = SnapshotCallbackBuilder(T, M, initial_lr) 269 | timestr = time.strftime("%Y-%m-%d_%H-%M-%S") 270 | model_prefix = 'lungs_incp_unet_snapshot_mlflow{}'.format(timestr) 271 | 272 | # Train model 273 | callbacks = [tb] + snapshot.get_callbacks(model_prefix = model_prefix) + [mlfow_log_callback()] 274 | 275 | 276 | with mlflow.start_run() as run: 277 | 278 | model = Model(inp, decoded) 279 | model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy']) 280 | 281 | hist = model.fit_generator(train_generator, steps_per_epoch=int(len(x_train)/batch_size), epochs=nb_epoch, callbacks=callbacks, verbose=0, 282 | validation_data=val_generator, validation_steps=int(len(x_val)/batch_size)) 283 | 284 | 285 | ## may give location you want to save the model to. 286 | model.save(weights_path+'lungs_incp_unet_snapshot_mlflow') 287 | 288 | pred = [model.predict(x.reshape((1,x.shape[0],x.shape[1],x.shape[2])))[0] for x in x_test] 289 | 290 | temp = [] 291 | for im in y_test: 292 | im *=255 293 | _, im = cv2.threshold(im, 200, 255, cv2.THRESH_BINARY) 294 | temp.append(im) 295 | y_test_t = np.array(temp) 296 | 297 | temp = [] 298 | for im in pred: 299 | im *=255 300 | _, im = cv2.threshold(im, 200, 255, cv2.THRESH_BINARY) 301 | temp.append(im) 302 | pred_t = np.array(temp) 303 | 304 | # ## IOU 305 | # 306 | # Intersection Over Union is used to evaluate the models performance. 307 | # 308 | # The predicted image is compared with the corresponding mask of the test image. 309 | 310 | 311 | component1 = np.float32(y_test_t) 312 | component2 = pred_t 313 | overlap = np.logical_and(component1, component2) # Logical AND 314 | union = np.logical_or(component1, component2) # Logical OR 315 | 316 | IOU = overlap.sum()/float(union.sum()) 317 | print(IOU) 318 | 319 | # Log Parameters and Test score for a run of the model. 320 | mlflow.log_param("batch_size", str(batch_size)) 321 | mlflow.log_param("seed", str(seed)) 322 | mlflow.log_metric('Test IOU',IOU) 323 | mlflow.log_param("initial_lr", str(initial_lr)) 324 | # Log the model. 325 | mlflow.keras.log_model(model, "lungs_incp_unet_snapshot_mlflow_log") 326 | 327 | 328 | if __name__ == '__main__': 329 | train() -------------------------------------------------------------------------------- /Image_Segmentation/seg_mlflow/snapshot.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | 4 | import keras.callbacks as callbacks 5 | from keras.callbacks import Callback 6 | 7 | class SnapshotModelCheckpoint(Callback): 8 | """Callback that saves the snapshot weights of the model. 9 | Saves the model weights on certain epochs (which can be considered the 10 | snapshot of the model at that epoch). 11 | Should be used with the cosine annealing learning rate schedule to save 12 | the weight just before learning rate is sharply increased. 13 | # Arguments: 14 | nb_epochs: total number of epochs that the model will be trained for. 15 | nb_snapshots: number of times the weights of the model will be saved. 16 | fn_prefix: prefix for the filename of the weights. 17 | """ 18 | 19 | def __init__(self, nb_epochs, nb_snapshots, fn_prefix='Model'): 20 | super(SnapshotModelCheckpoint, self).__init__() 21 | 22 | self.check = nb_epochs // nb_snapshots 23 | self.fn_prefix = fn_prefix 24 | 25 | def on_epoch_end(self, epoch, logs={}): 26 | if epoch != 0 and (epoch + 1) % self.check == 0: 27 | filepath = self.fn_prefix + "-%d.h5" % ((epoch + 1) // self.check) 28 | self.model.save_weights(filepath, overwrite=True) 29 | #print("Saved snapshot at weights/%s_%d.h5" % (self.fn_prefix, epoch)) 30 | 31 | 32 | class SnapshotCallbackBuilder: 33 | """Callback builder for snapshot ensemble training of a model. 34 | Creates a list of callbacks, which are provided when training a model 35 | so as to save the model weights at certain epochs, and then sharply 36 | increase the learning rate. 37 | """ 38 | 39 | def __init__(self, nb_epochs, nb_snapshots, init_lr=0.1): 40 | """ 41 | Initialize a snapshot callback builder. 42 | # Arguments: 43 | nb_epochs: total number of epochs that the model will be trained for. 44 | nb_snapshots: number of times the weights of the model will be saved. 45 | init_lr: initial learning rate 46 | """ 47 | self.T = nb_epochs 48 | self.M = nb_snapshots 49 | self.alpha_zero = init_lr 50 | 51 | def get_callbacks(self, model_prefix='Model'): 52 | """ 53 | Creates a list of callbacks that can be used during training to create a 54 | snapshot ensemble of the model. 55 | Args: 56 | model_prefix: prefix for the filename of the weights. 57 | Returns: list of 3 callbacks [ModelCheckpoint, LearningRateScheduler, 58 | SnapshotModelCheckpoint] which can be provided to the 'fit' function 59 | """ 60 | if not os.path.exists('weights/'): 61 | os.makedirs('weights/') 62 | 63 | callback_list = [callbacks.ModelCheckpoint("weights/%s-Best.h5" % model_prefix, monitor="val_acc", 64 | save_best_only=True, save_weights_only=True), 65 | callbacks.LearningRateScheduler(schedule=self._cosine_anneal_schedule), 66 | SnapshotModelCheckpoint(self.T, self.M, fn_prefix='weights/%s' % model_prefix)] 67 | 68 | return callback_list 69 | 70 | def _cosine_anneal_schedule(self, t): 71 | cos_inner = np.pi * (t % (self.T // self.M)) # t - 1 is used when t has 1-based indexing. 72 | cos_inner /= self.T // self.M 73 | cos_out = np.cos(cos_inner) + 1 74 | return float(self.alpha_zero / 2 * cos_out) -------------------------------------------------------------------------------- /Image_Segmentation/snapshot.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | 4 | import keras.callbacks as callbacks 5 | from keras.callbacks import Callback 6 | 7 | class SnapshotModelCheckpoint(Callback): 8 | """Callback that saves the snapshot weights of the model. 9 | Saves the model weights on certain epochs (which can be considered the 10 | snapshot of the model at that epoch). 11 | Should be used with the cosine annealing learning rate schedule to save 12 | the weight just before learning rate is sharply increased. 13 | # Arguments: 14 | nb_epochs: total number of epochs that the model will be trained for. 15 | nb_snapshots: number of times the weights of the model will be saved. 16 | fn_prefix: prefix for the filename of the weights. 17 | """ 18 | 19 | def __init__(self, nb_epochs, nb_snapshots, fn_prefix='Model'): 20 | super(SnapshotModelCheckpoint, self).__init__() 21 | 22 | self.check = nb_epochs // nb_snapshots 23 | self.fn_prefix = fn_prefix 24 | 25 | def on_epoch_end(self, epoch, logs={}): 26 | if epoch != 0 and (epoch + 1) % self.check == 0: 27 | filepath = self.fn_prefix + "-%d.h5" % ((epoch + 1) // self.check) 28 | self.model.save_weights(filepath, overwrite=True) 29 | #print("Saved snapshot at weights/%s_%d.h5" % (self.fn_prefix, epoch)) 30 | 31 | 32 | class SnapshotCallbackBuilder: 33 | """Callback builder for snapshot ensemble training of a model. 34 | Creates a list of callbacks, which are provided when training a model 35 | so as to save the model weights at certain epochs, and then sharply 36 | increase the learning rate. 37 | """ 38 | 39 | def __init__(self, nb_epochs, nb_snapshots, init_lr=0.1): 40 | """ 41 | Initialize a snapshot callback builder. 42 | # Arguments: 43 | nb_epochs: total number of epochs that the model will be trained for. 44 | nb_snapshots: number of times the weights of the model will be saved. 45 | init_lr: initial learning rate 46 | """ 47 | self.T = nb_epochs 48 | self.M = nb_snapshots 49 | self.alpha_zero = init_lr 50 | 51 | def get_callbacks(self, model_prefix='Model'): 52 | """ 53 | Creates a list of callbacks that can be used during training to create a 54 | snapshot ensemble of the model. 55 | Args: 56 | model_prefix: prefix for the filename of the weights. 57 | Returns: list of 3 callbacks [ModelCheckpoint, LearningRateScheduler, 58 | SnapshotModelCheckpoint] which can be provided to the 'fit' function 59 | """ 60 | if not os.path.exists('weights/'): 61 | os.makedirs('weights/') 62 | 63 | callback_list = [callbacks.ModelCheckpoint("weights/%s-Best.h5" % model_prefix, monitor="val_acc", 64 | save_best_only=True, save_weights_only=True), 65 | callbacks.LearningRateScheduler(schedule=self._cosine_anneal_schedule), 66 | SnapshotModelCheckpoint(self.T, self.M, fn_prefix='weights/%s' % model_prefix)] 67 | 68 | return callback_list 69 | 70 | def _cosine_anneal_schedule(self, t): 71 | cos_inner = np.pi * (t % (self.T // self.M)) # t - 1 is used when t has 1-based indexing. 72 | cos_inner /= self.T // self.M 73 | cos_out = np.cos(cos_inner) + 1 74 | return float(self.alpha_zero / 2 * cos_out) -------------------------------------------------------------------------------- /Images/Bombay_11.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/Bombay_11.jpg -------------------------------------------------------------------------------- /Images/Bombay_206.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/Bombay_206.jpg -------------------------------------------------------------------------------- /Images/Image_eg.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/Image_eg.PNG -------------------------------------------------------------------------------- /Images/Incp_block.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/Incp_block.png -------------------------------------------------------------------------------- /Images/Readme.md: -------------------------------------------------------------------------------- 1 | Contains different images to be shown in the readme file throughout the repository. 2 | -------------------------------------------------------------------------------- /Images/keeshond_59.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/keeshond_59.jpg -------------------------------------------------------------------------------- /Images/mask_eg.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/mask_eg.PNG -------------------------------------------------------------------------------- /Images/model_imdb_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/model_imdb_2.png -------------------------------------------------------------------------------- /Images/model_intent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/model_intent.png -------------------------------------------------------------------------------- /Images/model_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/model_plot.png -------------------------------------------------------------------------------- /Images/model_plot_conv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/model_plot_conv.png -------------------------------------------------------------------------------- /Images/model_semeval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/model_semeval.png -------------------------------------------------------------------------------- /Images/screenshot_demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/screenshot_demo.png -------------------------------------------------------------------------------- /Images/training_acc.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/training_acc.PNG -------------------------------------------------------------------------------- /Images/training_loss.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Images/training_loss.PNG -------------------------------------------------------------------------------- /Intent_classifier/Intent_Classifier.md: -------------------------------------------------------------------------------- 1 | # Intent Classification 2 | 3 | Intent classification is a step in NLU, where we need to understand what does the user want, by processing the user query. 4 | 5 | A simple yet powerful chatbot can be made by doing an intent classification and than replying with one of the pre written messages. 6 | 7 | Example queries for a chatbot to find places around: 8 | 9 | 1. “I need to buy groceries.”: Intent is to look for grocery store nearby 10 | 2. “I want to have vegetarian food.”: Intent is to look for restaurants around, ideally veg. 11 | 12 | So basically, we need to understand what is user looking for, and accordingly we need to classify his request to certain category of intent. Once the intent is known, a pre written message can be generated according to the intent. In the first example a message would be: 13 | Chatbot reply -> Some of the grocery store around you are GROCERY_1, address … 14 | 15 | To understand the intent of an user query, we need to train a model to classify requests into intents using a ML algorithm, over the sentences transformed to vectors using methods like TF-IDF, word2Vec, GloVe. First, we convert sentences into array of numbers(vector). 16 | 17 | ![General flow of intent classification, setences->vectors->model](./Intent_classification.png?raw=true "General flow of intent classification, setences->vectors->model") 18 | 19 | Tf-Idf: Give each word in sentence a number depending upon how many times that word occurs in that sentence and also upon the document. For words occurring many times in a sentence and not so many occurrences in document will have high value. 20 | 21 | Word2Vec: There are different methods of getting Word vectors for a sentence, but the main theory behind all the techniques is to give similar words a similar vector representation. So, for words like man and boy and girl will have similar vectors. The length of each vector can be set. Example of Word2Vec techniques: GloVe, CBOW and skip-gram. 22 | 23 | We can use Word2Vecs by training it for our own dataset(if we have enough data for the problem), or else we can use pretrained WordVecs available on internet. Pretrained Wordvec have been trained on huge documents like Wikipedia data, Tweets, etc. and its almost always good for the problem. 24 | 25 | 26 | -------------------------------------------------------------------------------- /Intent_classifier/Intent_classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/netik1020/Concise-iPython-Notebooks-for-Deep-learning/691d7a3709e945c4311fd47095e7285a949b9ba6/Intent_classifier/Intent_classification.png -------------------------------------------------------------------------------- /Intent_classifier/Readme.md: -------------------------------------------------------------------------------- 1 | # Intent Classifier 2 | 3 | Intent classification is a step in NLU, where we need to understand what does the user want, by processing the user query. A small document is written [here](./Intent_Classifier.md) to understand about Intent Classification and how it can be used to make a simple chatbot. 4 | 5 | [Here](./intentclassifier.ipynb) an example notebook is given to perform intent classification of an incoming user query. 6 | 7 | ## Packages used 8 | 9 | keras, sklearn, tensorlfow, numpy, pandas, json 10 | ## Dataset 11 | 12 | For performing the task of Intent Classification, dataset was taken from [here]( https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines). This Dataset have been collected from different sources and have queries pertaining to 7 different intents - addtoplaylist, bookrestaurant, getweather, playmusic, ratebook, searchcreativework and searchscreeningevent. 13 | 14 | ## Model Architecture 15 | 16 | Using a preloaded glove vectors as embedding weights for the model. 17 | 18 | Embedded word vectors are first passed to 1D convolution and than to bidirectional GRU. GRU takes care of the sequential information, while CNN improves the embeddings by emphasizing on neighbor information. 19 | 20 | Global max pool layer is used to pool 1 feature from each of the feature vector. 21 | 22 | Features are enriched with concatenating Self-attended features of the RNN output. 23 | 24 | Finally multiple fully-connected layers are used to classify the incoming query into one of the possible intents. 25 | 26 | Adam optimizer and sparse categorical crossentropy loss are used. 27 | 28 | ![Model Architecture](../Images/model_intent.png) 29 | -------------------------------------------------------------------------------- /Intent_classifier/intentclassifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Intent Classifier \n", 8 | "\n", 9 | "In this notebook a simple way to classify incoming query like \"I want a hot dog.\" into one of the intents. Finding the intent of the user query is a very important task in building a chatbot.\n", 10 | "\n", 11 | "Here intent classication is done by using a keras sequence model to extract the feature from the incoming query." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stderr", 21 | "output_type": "stream", 22 | "text": [ 23 | "Using TensorFlow backend.\n" 24 | ] 25 | }, 26 | { 27 | "data": { 28 | "text/plain": [ 29 | "('2.2.4',\n", 30 | " '1.11.0',\n", 31 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 32 | ] 33 | }, 34 | "execution_count": 1, 35 | "metadata": {}, 36 | "output_type": "execute_result" 37 | } 38 | ], 39 | "source": [ 40 | "import keras, tensorflow, sys\n", 41 | "keras.__version__, tensorflow.__version__, sys.version" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# import required packages\n", 51 | "\n", 52 | "import json\n", 53 | "import pandas as pd\n", 54 | "import numpy as np\n", 55 | "\n", 56 | "import tensorflow as tf\n", 57 | "\n", 58 | "from keras.preprocessing.text import Tokenizer\n", 59 | "from keras.preprocessing.sequence import pad_sequences\n", 60 | "\n", 61 | "from keras.utils.np_utils import to_categorical\n", 62 | "\n", 63 | "from keras.layers import Dense, Input, Flatten, Lambda, Permute, GlobalMaxPooling1D, Activation, Concatenate\n", 64 | "from keras.layers import Convolution1D, MaxPooling1D, Embedding, Dropout, Bidirectional, CuDNNGRU, SpatialDropout1D\n", 65 | "\n", 66 | "from keras.models import Model\n", 67 | "\n", 68 | "from sklearn.preprocessing import LabelEncoder\n", 69 | "from sklearn.model_selection import train_test_split\n", 70 | "from sklearn.metrics import f1_score, accuracy_score\n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## Dataset\n", 78 | "\n", 79 | "Dataset is taken from the link -> https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines This Dataset have been collected from different sources and have queries pertaining to 7 different intents.\n", 80 | "\n", 81 | "The dataset is given in json format and the below block of code is used to read the data." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "(13784, 8)" 93 | ] 94 | }, 95 | "execution_count": 3, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "data = pd.DataFrame()\n", 102 | "\n", 103 | "for intent in ['AddToPlaylist', 'BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook', 'SearchCreativeWork',\n", 104 | " 'SearchScreeningEvent']:\n", 105 | "\n", 106 | " with open(\"./data/2017-06-custom-intent-engines/\" + intent + \"/train_\" + intent + \"_full.json\",\n", 107 | " encoding='cp1251') as data_file:\n", 108 | " full_data = json.load(data_file)\n", 109 | " \n", 110 | " texts = []\n", 111 | " for i in range(len(full_data[intent])):\n", 112 | " text = ''\n", 113 | " for j in range(len(full_data[intent][i]['data'])):\n", 114 | " text += full_data[intent][i]['data'][j]['text']\n", 115 | " texts.append(text)\n", 116 | "\n", 117 | " dftrain = pd.DataFrame(data=texts, columns=['request'])\n", 118 | " dftrain[intent] = np.ones(dftrain.shape[0], dtype='int')\n", 119 | "\n", 120 | " data = data.append(dftrain, ignore_index=True, sort=False)\n", 121 | "\n", 122 | "data = data.fillna(value=0)\n", 123 | "\n", 124 | "data.shape" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "## Sample query\n", 132 | "\n", 133 | "The dataframe contains the query and the column corresponding the intent is marked 1. See below:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/html": [ 144 | "
\n", 145 | "\n", 158 | "\n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | "
requestAddToPlaylistBookRestaurantGetWeatherPlayMusicRateBookSearchCreativeWorkSearchScreeningEvent
7144Play a top five Linda Strawberry ep0.00.00.01.00.00.00.0
5584What will the weather be in IN?0.00.01.00.00.00.00.0
12905Find the movie schedule at twelve AM.0.00.00.00.00.00.01.0
4798will the weather be colder in Naguabo four min...0.00.01.00.00.00.00.0
3989What's the weather close to Cambodia at 05:44:130.00.01.00.00.00.00.0
\n", 230 | "
" 231 | ], 232 | "text/plain": [ 233 | " request AddToPlaylist \\\n", 234 | "7144 Play a top five Linda Strawberry ep 0.0 \n", 235 | "5584 What will the weather be in IN? 0.0 \n", 236 | "12905 Find the movie schedule at twelve AM. 0.0 \n", 237 | "4798 will the weather be colder in Naguabo four min... 0.0 \n", 238 | "3989 What's the weather close to Cambodia at 05:44:13 0.0 \n", 239 | "\n", 240 | " BookRestaurant GetWeather PlayMusic RateBook SearchCreativeWork \\\n", 241 | "7144 0.0 0.0 1.0 0.0 0.0 \n", 242 | "5584 0.0 1.0 0.0 0.0 0.0 \n", 243 | "12905 0.0 0.0 0.0 0.0 0.0 \n", 244 | "4798 0.0 1.0 0.0 0.0 0.0 \n", 245 | "3989 0.0 1.0 0.0 0.0 0.0 \n", 246 | "\n", 247 | " SearchScreeningEvent \n", 248 | "7144 0.0 \n", 249 | "5584 0.0 \n", 250 | "12905 1.0 \n", 251 | "4798 0.0 \n", 252 | "3989 0.0 " 253 | ] 254 | }, 255 | "execution_count": 4, 256 | "metadata": {}, 257 | "output_type": "execute_result" 258 | } 259 | ], 260 | "source": [ 261 | "data.sample(5)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "## Load Glove Embedding\n", 269 | "\n", 270 | "Load the embedding file 'glove.840B.300d.txt' and find the mean and standard deviation vectors of the word vectors. Than for all the words in the vocab initialize the corresponding word vector from the loaded embedded file. For the words for which wordvecs cannot be found in the embedding file, initialize them with a random normal distribution with the above found mean and standard deviation." 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 5, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "def load_glove(word_index):\n", 280 | " EMBEDDING_FILE = '../../embeddings/glove.840B.300d/glove.840B.300d.txt'\n", 281 | " def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\n", 282 | " embeddings_index = dict(get_coefs(*o.split(\" \")) for o in open(EMBEDDING_FILE, encoding=\"utf8\"))\n", 283 | "\n", 284 | " all_embs = np.stack(embeddings_index.values())\n", 285 | " emb_mean,emb_std = all_embs.mean(), all_embs.std()\n", 286 | " embed_size = all_embs.shape[1]\n", 287 | "\n", 288 | " # word_index = tokenizer.word_index\n", 289 | " nb_words = len(word_index)\n", 290 | " embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n", 291 | " for word, i in word_index.items():\n", 292 | " if i >= nb_words: continue\n", 293 | " embedding_vector = embeddings_index.get(word)\n", 294 | " if embedding_vector is not None: embedding_matrix[i] = embedding_vector\n", 295 | " \n", 296 | " return embedding_matrix " 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 6, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "# split data into test and train\n", 306 | "X_train, X_test, y_train, y_test = train_test_split(data[\"request\"], data[[\"AddToPlaylist\", \"BookRestaurant\",\n", 307 | " \"GetWeather\", \"PlayMusic\", \"RateBook\", \"SearchCreativeWork\",\n", 308 | " \"SearchScreeningEvent\"]], test_size=0.25)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "## Tokenize and pad the text sequences\n", 316 | "\n", 317 | "Tokenize -> change the word to there integer ids\n", 318 | "\n", 319 | "Pad -> Trim or pad with zeros to make all sentences of same length.\n" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 7, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "\n", 329 | "X_train = list(X_train)\n", 330 | "\n", 331 | "# tokenize input strings\n", 332 | "tokenizer = Tokenizer()\n", 333 | "tokenizer.fit_on_texts(X_train)\n", 334 | "\n", 335 | "X_train = tokenizer.texts_to_sequences(X_train)\n", 336 | "X_test = tokenizer.texts_to_sequences(X_test)\n", 337 | "\n", 338 | "word_index = tokenizer.word_index\n", 339 | "vocab_size = len(word_index)\n", 340 | "\n", 341 | "# prune each sentence to maximum of 100 words.\n", 342 | "max_sent_len = 100\n", 343 | "\n", 344 | "# sentences with less than 100 words, will be padded with zeroes to make it of length 100\n", 345 | "# sentences with more than 100 words, will be pruned to 100.\n", 346 | "X_train = pad_sequences(X_train, maxlen=max_sent_len)\n", 347 | "X_test = pad_sequences(X_test, maxlen=max_sent_len)\n", 348 | "\n", 349 | "embedding_matrix = load_glove(word_index)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "Converte the one hot vectors of class labels into numerical labels. " 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 8, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "y_train = np.argmax(np.array(y_train), axis=-1)\n", 366 | "y_test = np.argmax(np.array(y_test), axis=-1)\n" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "## Model\n", 374 | "\n", 375 | "Using a preloaded glove vectors as embedding weights for the model.\n", 376 | "\n", 377 | "Embedded word vectors are first featurized with 1D convolution and than passed to bidirectional GRU. GRU takes care of the sequential inforamtion, while CNN improved the embeddings by emphasizing on neighbor inforamtion. \n", 378 | "\n", 379 | "Global max pool layer pools 1 feature from each of the feature vector, unlike maxpool where we determine how many values is to be pooled.\n", 380 | "\n", 381 | "Features are enriched with concatenating Self-attented features of the RNN output. \n", 382 | "\n", 383 | "Finally multiple fully-connected layers are used to classify the incoming query into one of the possible intents.\n", 384 | "\n", 385 | "Adam optimizer and sparse categorical crossentropy loss are used." 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 9, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "name": "stderr", 395 | "output_type": "stream", 396 | "text": [ 397 | "D:\\Program_Files\\Anaconda\\envs\\tensorflow\\lib\\site-packages\\ipykernel_launcher.py:9: UserWarning: Update your `Conv1D` call to the Keras 2 API: `Conv1D(filters=256, activation=\"tanh\", padding=\"same\", strides=1, kernel_size=3)`\n", 398 | " if __name__ == '__main__':\n" 399 | ] 400 | }, 401 | { 402 | "name": "stdout", 403 | "output_type": "stream", 404 | "text": [ 405 | "__________________________________________________________________________________________________\n", 406 | "Layer (type) Output Shape Param # Connected to \n", 407 | "==================================================================================================\n", 408 | "input_1 (InputLayer) (None, 100) 0 \n", 409 | "__________________________________________________________________________________________________\n", 410 | "embedding_1 (Embedding) (None, 100, 300) 2929200 input_1[0][0] \n", 411 | "__________________________________________________________________________________________________\n", 412 | "dropout_1 (Dropout) (None, 100, 300) 0 embedding_1[0][0] \n", 413 | "__________________________________________________________________________________________________\n", 414 | "conv1d_1 (Conv1D) (None, 100, 256) 230656 dropout_1[0][0] \n", 415 | "__________________________________________________________________________________________________\n", 416 | "dropout_2 (Dropout) (None, 100, 256) 0 conv1d_1[0][0] \n", 417 | "__________________________________________________________________________________________________\n", 418 | "bidirectional_1 (Bidirectional) (None, 100, 128) 123648 dropout_2[0][0] \n", 419 | "__________________________________________________________________________________________________\n", 420 | "activation_1 (Activation) (None, 100, 128) 0 bidirectional_1[0][0] \n", 421 | "__________________________________________________________________________________________________\n", 422 | "dense_1 (Dense) (None, 100, 1) 129 activation_1[0][0] \n", 423 | "__________________________________________________________________________________________________\n", 424 | "permute_1 (Permute) (None, 1, 100) 0 dense_1[0][0] \n", 425 | "__________________________________________________________________________________________________\n", 426 | "attn_softmax (Activation) (None, 1, 100) 0 permute_1[0][0] \n", 427 | "__________________________________________________________________________________________________\n", 428 | "lambda_1 (Lambda) (None, 1, 128) 0 attn_softmax[0][0] \n", 429 | " activation_1[0][0] \n", 430 | "__________________________________________________________________________________________________\n", 431 | "global_max_pooling1d_1 (GlobalM (None, 128) 0 activation_1[0][0] \n", 432 | "__________________________________________________________________________________________________\n", 433 | "flatten_1 (Flatten) (None, 128) 0 lambda_1[0][0] \n", 434 | "__________________________________________________________________________________________________\n", 435 | "concatenate_1 (Concatenate) (None, 256) 0 global_max_pooling1d_1[0][0] \n", 436 | " flatten_1[0][0] \n", 437 | "__________________________________________________________________________________________________\n", 438 | "dropout_3 (Dropout) (None, 256) 0 concatenate_1[0][0] \n", 439 | "__________________________________________________________________________________________________\n", 440 | "dense_2 (Dense) (None, 128) 32896 dropout_3[0][0] \n", 441 | "__________________________________________________________________________________________________\n", 442 | "dropout_4 (Dropout) (None, 128) 0 dense_2[0][0] \n", 443 | "__________________________________________________________________________________________________\n", 444 | "dense_3 (Dense) (None, 32) 4128 dropout_4[0][0] \n", 445 | "__________________________________________________________________________________________________\n", 446 | "dense_4 (Dense) (None, 7) 231 dense_3[0][0] \n", 447 | "==================================================================================================\n", 448 | "Total params: 3,320,888\n", 449 | "Trainable params: 3,320,888\n", 450 | "Non-trainable params: 0\n", 451 | "__________________________________________________________________________________________________\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "# Model\n", 457 | "\n", 458 | "sequence_input = Input(shape=(max_sent_len,), dtype='int32')\n", 459 | "\n", 460 | "words = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix],\n", 461 | " trainable=True)(sequence_input)\n", 462 | "words = Dropout(rate=0.3)(words)\n", 463 | "\n", 464 | "output = Convolution1D(filters=256, filter_length=3, activation=\"tanh\", padding='same', strides=1)(words)\n", 465 | "output = Dropout(rate=0.3)(output)\n", 466 | "\n", 467 | "output = Bidirectional(CuDNNGRU(units=64, return_sequences=True), merge_mode='concat')(output)\n", 468 | "output_h = Activation('tanh')(output)\n", 469 | "\n", 470 | "output1 = GlobalMaxPooling1D()(output_h) \n", 471 | "\n", 472 | "# Applying attention to RNN output\n", 473 | "output = Dense(units=1)(output_h)\n", 474 | "output = Permute((2, 1))(output)\n", 475 | "output = Activation('softmax', name=\"attn_softmax\")(output)\n", 476 | "output = Lambda(lambda x: tf.matmul(x[0], x[1])) ([output, output_h])\n", 477 | "output2 = Flatten() (output)\n", 478 | "\n", 479 | "# Concatenating maxpooled and self attended features.\n", 480 | "output = Concatenate()([output1, output2])\n", 481 | "output = Dropout(rate=0.3)(output)\n", 482 | "\n", 483 | "output = Dense(units=128, activation='tanh')(output)\n", 484 | "output = Dropout(rate=0.3)(output)\n", 485 | "\n", 486 | "output = Dense(units=32, activation='tanh')(output)\n", 487 | "output = Dense(units=7, activation='softmax')(output)\n", 488 | "\n", 489 | "model = Model(inputs=sequence_input, outputs=output)\n", 490 | "model.compile(loss='sparse_categorical_crossentropy', optimizer=\"adam\", metrics=['accuracy'])\n", 491 | "\n", 492 | "model.summary()" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 10, 498 | "metadata": {}, 499 | "outputs": [ 500 | { 501 | "name": "stdout", 502 | "output_type": "stream", 503 | "text": [ 504 | "Epoch 1/10\n", 505 | "10338/10338 [==============================] - 6s 628us/step - loss: 0.7464 - acc: 0.7519\n", 506 | "Epoch 2/10\n", 507 | "10338/10338 [==============================] - 4s 378us/step - loss: 0.1104 - acc: 0.9742\n", 508 | "Epoch 3/10\n", 509 | "10338/10338 [==============================] - 4s 383us/step - loss: 0.0646 - acc: 0.9847\n", 510 | "Epoch 4/10\n", 511 | "10338/10338 [==============================] - 4s 381us/step - loss: 0.0472 - acc: 0.9891\n", 512 | "Epoch 5/10\n", 513 | "10338/10338 [==============================] - 4s 380us/step - loss: 0.0389 - acc: 0.9900\n", 514 | "Epoch 6/10\n", 515 | "10338/10338 [==============================] - 4s 385us/step - loss: 0.0288 - acc: 0.9928\n", 516 | "Epoch 7/10\n", 517 | "10338/10338 [==============================] - 4s 388us/step - loss: 0.0159 - acc: 0.9967\n", 518 | "Epoch 8/10\n", 519 | "10338/10338 [==============================] - 4s 384us/step - loss: 0.0173 - acc: 0.9957\n", 520 | "Epoch 9/10\n", 521 | "10338/10338 [==============================] - 4s 379us/step - loss: 0.0126 - acc: 0.9968\n", 522 | "Epoch 10/10\n", 523 | "10338/10338 [==============================] - 4s 379us/step - loss: 0.0086 - acc: 0.9984\n" 524 | ] 525 | }, 526 | { 527 | "data": { 528 | "text/plain": [ 529 | "" 530 | ] 531 | }, 532 | "execution_count": 10, 533 | "metadata": {}, 534 | "output_type": "execute_result" 535 | } 536 | ], 537 | "source": [ 538 | "# train the model\n", 539 | "model.fit(X_train, np.array(y_train), epochs=10, batch_size=128)" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "## Validation\n", 547 | "\n", 548 | "Being able to classify intents for a query with an accuracy of 98.7%" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 11, 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "name": "stdout", 558 | "output_type": "stream", 559 | "text": [ 560 | "f1_score (macro): 0.9872548281477977\n", 561 | "accuracy_score: 0.9872315728380732\n" 562 | ] 563 | } 564 | ], 565 | "source": [ 566 | "#get scores and predictions.\n", 567 | "p = model.predict(X_test)\n", 568 | "p = [np.argmax(i) for i in p]\n", 569 | "\n", 570 | "print(\"f1_score (macro):\", f1_score(y_test, p, average=\"macro\"))\n", 571 | "print(\"accuracy_score:\", accuracy_score(y_test, p))" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [] 580 | } 581 | ], 582 | "metadata": { 583 | "kernelspec": { 584 | "display_name": "Python 3", 585 | "language": "python", 586 | "name": "python3" 587 | }, 588 | "language_info": { 589 | "codemirror_mode": { 590 | "name": "ipython", 591 | "version": 3 592 | }, 593 | "file_extension": ".py", 594 | "mimetype": "text/x-python", 595 | "name": "python", 596 | "nbconvert_exporter": "python", 597 | "pygments_lexer": "ipython3", 598 | "version": "3.6.6" 599 | } 600 | }, 601 | "nbformat": 4, 602 | "nbformat_minor": 2 603 | } 604 | -------------------------------------------------------------------------------- /Miscellaneous/NER_tagger/NER_stanford_NLTK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# An example of NER tagging using stanford NER tagger in NLTK.\n", 8 | "\n", 9 | "An example notebook for using pretrained NER - Named Entity Recognition, provided by stanford using NLTK package." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "Using TensorFlow backend.\n" 22 | ] 23 | }, 24 | { 25 | "data": { 26 | "text/plain": [ 27 | "('2.2.4',\n", 28 | " '1.11.0',\n", 29 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 30 | ] 31 | }, 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "output_type": "execute_result" 35 | } 36 | ], 37 | "source": [ 38 | "import keras, tensorflow, sys\n", 39 | "keras.__version__, tensorflow.__version__, sys.version" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Setting path for java\n", 47 | "\n", 48 | "Since the stanford librarires are built upon java, java must be installed and one must speicfy its path if not set in environment variables." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# Setting path for java executable. Required if not set in Environment variables.\n", 58 | "import os\n", 59 | "java_path = r\"C:\\Program Files\\Java\\jdk1.8.0_171\\bin\\java.exe\"\n", 60 | "os.environ['JAVAHOME'] = java_path" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Download the Stanford NER tagger\n", 68 | "\n", 69 | "Download files of pretrained Stanford NER tagger from -> https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip. Then give the paths for the files 'english.all.3class.distsim.crf.ser.gz' and 'stanford-ner.jar'." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "from nltk.tag import StanfordNERTagger\n", 79 | "\n", 80 | "st_ner = StanfordNERTagger('./stanford-ner-2014-06-16/stanford-ner-2014-06-16/classifiers/english.all.3class.distsim.crf.ser.gz',\n", 81 | " './stanford-ner-2014-06-16/stanford-ner-2014-06-16/stanford-ner.jar', encoding='utf-8')\n", 82 | "\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## Download necessary NLTK bundles\n", 90 | "\n", 91 | "Downloading nltk bundles for handling tokenization of sentence into list of words by considering the punctuation symbols." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 4, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "name": "stderr", 101 | "output_type": "stream", 102 | "text": [ 103 | "[nltk_data] Downloading package punkt to\n", 104 | "[nltk_data] C:\\Users\\Netik\\AppData\\Roaming\\nltk_data...\n", 105 | "[nltk_data] Package punkt is already up-to-date!\n" 106 | ] 107 | }, 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "True" 112 | ] 113 | }, 114 | "execution_count": 4, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "# Download the required nltk bundle, if not already downloaded\n", 121 | "import nltk\n", 122 | "nltk.download('punkt')" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## NER tags for a sample text. \n", 130 | "\n", 131 | "Tokenize the input sentence into words and punctuations and than find the NER tags of the words in a sentence using the above loaded, pretrained Stanford NER tagger." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 5, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "from nltk.tokenize import word_tokenize" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "[('In', 'O'), ('an', 'O'), ('interview', 'O'), ('with', 'O'), ('CBS', 'ORGANIZATION'), ('News', 'ORGANIZATION'), ('60', 'O'), ('Minutes', 'O'), ('on', 'O'), ('Friday', 'O'), (',', 'O'), ('Musk', 'PERSON'), ('said', 'O'), (\"'its\", 'O'), ('possible', 'O'), (\"'\", 'O'), ('he', 'O'), ('would', 'O'), ('purchase', 'O'), ('a', 'O'), ('GM', 'ORGANIZATION'), ('factory', 'O'), ('in', 'O'), ('North', 'LOCATION'), ('America', 'LOCATION'), ('if', 'O'), ('the', 'O'), ('company', 'O'), ('is', 'O'), (\"'going\", 'O'), ('to', 'O'), ('sell', 'O'), ('a', 'O'), ('plant', 'O'), ('or', 'O'), ('not', 'O'), ('use', 'O'), ('it', 'O'), (\"'\", 'O'), ('.', 'O')]\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "text = \"In an interview with CBS News 60 Minutes on Friday, Musk said 'its possible' he would purchase a GM factory in North America if the company is 'going to sell a plant or not use it'.\"\n", 158 | "\n", 159 | "# Tokenize the sentence into list of words and punctuations.\n", 160 | "tokenized_text = word_tokenize(text)\n", 161 | "\n", 162 | "# Find the NER tags for the tokenized sample text\n", 163 | "tagged_text = st_ner.tag(tokenized_text)\n", 164 | "\n", 165 | "# Print the words in the sentence along with there Named-entitiy tags.\n", 166 | "print(tagged_text)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python 3", 180 | "language": "python", 181 | "name": "python3" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 3 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython3", 193 | "version": "3.6.6" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 2 198 | } 199 | -------------------------------------------------------------------------------- /Miscellaneous/NER_tagger/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Miscellaneous/POS_Tagger/POSTagger_NLTK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# POS tagging with NLTK\n", 8 | "\n", 9 | "An example notebook for using pretrained pos-tagger provided in the NLTK package. POS tagging is to assign the part-of-speech to each word in the sentence." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "Using TensorFlow backend.\n" 22 | ] 23 | }, 24 | { 25 | "data": { 26 | "text/plain": [ 27 | "('2.2.4',\n", 28 | " '1.11.0',\n", 29 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 30 | ] 31 | }, 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "output_type": "execute_result" 35 | } 36 | ], 37 | "source": [ 38 | "import keras, tensorflow, sys\n", 39 | "keras.__version__, tensorflow.__version__, sys.version" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Install NLTK package.\n", 47 | "\n", 48 | "If not already there install the NLTK package by running the command -> pip install nltk" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "import nltk " 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Download necessary NLTK bundles\n", 65 | "\n", 66 | "Downloading nltk bundles for removing stop words and handling tokenization of sentence into list of words by considering the punctuation symbols.\n", 67 | "\n", 68 | "Perceptron_tagger is the bundle required for pos tagging in NLTK." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "name": "stderr", 78 | "output_type": "stream", 79 | "text": [ 80 | "[nltk_data] Downloading package stopwords to\n", 81 | "[nltk_data] C:\\Users\\Netik\\AppData\\Roaming\\nltk_data...\n", 82 | "[nltk_data] Package stopwords is already up-to-date!\n", 83 | "[nltk_data] Downloading package punkt to\n", 84 | "[nltk_data] C:\\Users\\Netik\\AppData\\Roaming\\nltk_data...\n", 85 | "[nltk_data] Package punkt is already up-to-date!\n", 86 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 87 | "[nltk_data] C:\\Users\\Netik\\AppData\\Roaming\\nltk_data...\n", 88 | "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", 89 | "[nltk_data] date!\n" 90 | ] 91 | }, 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "True" 96 | ] 97 | }, 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "# Download the required nltk bundles, if not already downloaded\n", 105 | "nltk.download('stopwords')\n", 106 | "nltk.download('punkt')\n", 107 | "nltk.download('averaged_perceptron_tagger') " 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 4, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "from nltk.corpus import stopwords \n", 117 | "from nltk.tokenize import word_tokenize, sent_tokenize \n", 118 | "stop_words = set(stopwords.words('english')) " 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## POS tagging a sample text." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "[('It', 'PRP'), ('takes', 'VBZ'), ('great', 'JJ'), ('deal', 'NN'), ('bravery', 'NN'), ('stand', 'NN'), ('enemies', 'NNS'), (',', ','), ('much', 'JJ'), ('stand', 'NN'), ('friends', 'NNS'), ('.', '.')]\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "# Dummy text \n", 143 | "txt = \"It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\"\n", 144 | "\n", 145 | " \n", 146 | "tokenized = sent_tokenize(txt) \n", 147 | "for i in tokenized: \n", 148 | " \n", 149 | " # Word tokenizers is used to find the words and punctuation in a string \n", 150 | " wordsList = nltk.word_tokenize(i) \n", 151 | " \n", 152 | " # removing stop words from wordList \n", 153 | " wordsList = [w for w in wordsList if not w in stop_words] \n", 154 | " \n", 155 | " # Using a pretrained part-of-speech tagger or POS-tagger. \n", 156 | " tagged = nltk.pos_tag(wordsList) \n", 157 | " \n", 158 | " # Print the words along with there part-of-speech symbols.\n", 159 | " print(tagged) \n" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [] 168 | } 169 | ], 170 | "metadata": { 171 | "kernelspec": { 172 | "display_name": "Python 3", 173 | "language": "python", 174 | "name": "python3" 175 | }, 176 | "language_info": { 177 | "codemirror_mode": { 178 | "name": "ipython", 179 | "version": 3 180 | }, 181 | "file_extension": ".py", 182 | "mimetype": "text/x-python", 183 | "name": "python", 184 | "nbconvert_exporter": "python", 185 | "pygments_lexer": "ipython3", 186 | "version": "3.6.6" 187 | } 188 | }, 189 | "nbformat": 4, 190 | "nbformat_minor": 2 191 | } 192 | -------------------------------------------------------------------------------- /Miscellaneous/POS_Tagger/POSTagger_Stanford_NLTK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# POS tagging with Stanford NLTK\n", 8 | "\n", 9 | "An example notebook for using pretrained Stanford pos-tagger provided in the NLTK package. POS tagging is to assign the part-of-speech to each word in the sentence." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "Using TensorFlow backend.\n" 22 | ] 23 | }, 24 | { 25 | "data": { 26 | "text/plain": [ 27 | "('2.2.4',\n", 28 | " '1.11.0',\n", 29 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 30 | ] 31 | }, 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "output_type": "execute_result" 35 | } 36 | ], 37 | "source": [ 38 | "import keras, tensorflow, sys\n", 39 | "keras.__version__, tensorflow.__version__, sys.version" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "from nltk.tag.stanford import StanfordPOSTagger\n", 49 | "import os" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Setting path for java\n", 57 | "\n", 58 | "Since the stanford librarires are built upon java, you must have running java version to use this." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "java_path = r\"C:\\Program Files\\Java\\jdk1.8.0_171\\bin\\java.exe\"\n", 68 | "os.environ['JAVAHOME'] = java_path" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## POS tagging a sample text.\n", 76 | "\n", 77 | "Download the stanford pos tagger files from the link -> https://nlp.stanford.edu/software/stanford-postagger-2018-10-16.zip and provide the path to the 'english-bidirectional-distsim.tagger' and 'stanford-postagger.jar' files." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 4, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "[('It', 'PRP'), ('takes', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('deal', 'NN'), ('of', 'IN'), ('bravery', 'NN'), ('to', 'TO'), ('stand', 'VB'), ('up', 'RP'), ('to', 'TO'), ('our', 'PRP$'), ('enemies,', 'NN'), ('but', 'CC'), ('just', 'RB'), ('as', 'RB'), ('much', 'JJ'), ('to', 'TO'), ('stand', 'VB'), ('up', 'RP'), ('to', 'TO'), ('our', 'PRP$'), ('friends.', 'NN')]\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "path_to_model = \"./stanford-postagger-2018-10-16/models/english-bidirectional-distsim.tagger\"\n", 95 | "path_to_jar = \"./stanford-postagger-2018-10-16/stanford-postagger.jar\"\n", 96 | "\n", 97 | "# Load the pretrained stanford postagger\n", 98 | "tagger = StanfordPOSTagger(path_to_model, path_to_jar)\n", 99 | "\n", 100 | "# Dummy text \n", 101 | "txt = \"It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\"\n", 102 | "\n", 103 | "# Postag the sentence and print the words along with there part-of-speech symbols.\n", 104 | "print(tagger.tag(txt.split()))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "codemirror_mode": { 123 | "name": "ipython", 124 | "version": 3 125 | }, 126 | "file_extension": ".py", 127 | "mimetype": "text/x-python", 128 | "name": "python", 129 | "nbconvert_exporter": "python", 130 | "pygments_lexer": "ipython3", 131 | "version": "3.6.6" 132 | } 133 | }, 134 | "nbformat": 4, 135 | "nbformat_minor": 2 136 | } 137 | -------------------------------------------------------------------------------- /Miscellaneous/POS_Tagger/Readme.md: -------------------------------------------------------------------------------- 1 | # POS TAGGER 2 | 3 | Here POS tags of a text sequence is found in Python, using NLTK packages. 4 | 5 | POS tags or the Part-of-Speech tags of the text like noun, pronoun, etc., are important information in understanding a natural language. A POS tag sequence of a generated natural language sequence can be checked to determine if a generated sequence is gramaticaly correct. 6 | 7 | Many libraries have been written in python for the same and over here NLTK packages have been used. The 2 files perform POS tagging of text using: 8 | 9 | 1. [NLTK](./POSTagger_NLTK.ipynb) 10 | 11 | 2. [Stanford NLTK](./POSTagger_Stanford_NLTK.ipynb) 12 | 13 | ## Packages used 14 | 15 | nltk 16 | 17 | 18 | -------------------------------------------------------------------------------- /Miscellaneous/Readme.md: -------------------------------------------------------------------------------- 1 | # Notebooks/ Docs for handling text in python. 2 | 3 | This folder contains different documents/ python notebooks for handling text in python. 4 | 5 | The Contents of this folder are: 6 | 7 | 1. [NER Tagger](./NER_tagger) NER tagging of a sequence using stanford NLTK package is shown. 8 | 9 | 2. [POS Tagger](./POS_Tagger) Example of POS tagging of a text sequence is shown using NLTK and stanford packages. 10 | 11 | 3. [Word Embedding](./Word_Embedding.md) which gives an insight on different word embeddings like Glove, Fastext, which are commonly used to extract features from the text. A little about sentence and document level embeddings is also talked about. 12 | 13 | 4. [Commonly used Regular expressions](./common_regex.md) – A file containing some commonly used Regular Expressions along with the descriptions of the expressions. 14 | 15 | 5. [Graph Network Analysis](./graph_network_analysis.md) - Some of the important properties of a graph network. 16 | 17 | 6. [PDF to Doc](./pdf_To_doc.ipynb) – a python notebook to read the pdf documents in python. PDFminer package is used here. 18 | 19 | 7. [Topic Modelling](./Miscellaneous/topic_modeling.ipynb) – Here topic modelling is done using LDA from sklearn and genism packages. 20 | 21 | ## Packages used 22 | 23 | pdfminer, sklearn, gensim, nltk 24 | -------------------------------------------------------------------------------- /Miscellaneous/Word_Embedding.md: -------------------------------------------------------------------------------- 1 | # Word Embedding 2 | 3 | Embeddings are vector representation of an entity, most commonly, with useful information stored in the vectors. Embeddings, in the sense of text corpus, can be of characters, words, phrases, sentences or even documents. Let us look at the different embeddings there. 4 | 5 | Word embeddings are the vector representation of words in a corpus. It can be as simple as one-hot vectors, where only the word is marked 1, for the rest all the words of the vocabulary, it is marked 0 in the vector. Word embeddings can also be computed using complex neural networks, while considering the semantic similarities between the words. 6 | 7 | There are different ways to represent words as numbers, i.e. vectors. 8 | 9 | ## Simple Frequency based practises: 10 | 11 | One of the simple way, one-hot vector, was mentioned above. Basically, the vector size is the number of words in the vocabulary, i.e. total number of distinct, unique words in the whole corpus. A word is represented by marking 1 under its column, and rest all the points of vectors are marked 0. 12 | 13 | Tf-Idf is another way to find the word embeddings of words in a corpus. In Tf-Idf embedding are found use 2 components Term Frequency and Inverse Document frequency. Term frequency is the number of times a word appeared in a document, divided by the total number of words in that document. It’s the measure of how important a word is to the document. 14 | 15 | Inverse Document frequency is the log of total number of documents in the corpus divided by the number of documents a term appeared in it. IDF is the measure of that terms importance in the document. If that term appears in almost all the documents, it must not be useful for the document. TF-IDF of a term is then found by multiplying TF value with IDF value of the term. 16 | 17 | Another way is to find the co – occurrence matrix. Here, in the corpus, the occurrence of each pair of words in a Context Window is found. The context window is a number representing how many words to consider in the context of another word. If context window is 2, then for a word, 4 words will be co-occurring with that word (2 before and 2 after). So a co – occurrence matrix is a square matrix of size equal to total number of words in the vocabulary. A vector of the word can be taken as either of the column or row of that word in the co – occurrence matrix. 18 | 19 | ## Word2Vec 20 | 21 | In word2vec a single layer neural network is made to predict the context words of a word. The number of context words to be predicted is determined by the context window size. This technique is also known as skip-gram models. There are another models known as CBOW models, which predict the word, given the context words. 22 | 23 | In word2vec, the neural network is inputted the word and made to predict the context words. Once the neural network is trained, the weights of hidden layers are taken as the Word embedding for the word taken as the input. The size of hidden layer determines the vector size of the word embedding. The semantic information is embedded in the vectors as the whole pupose of neural network being trained is to predict the context words. 24 | 25 | The Word2Vec embedding can be downloaded from -> [https://github.com/mmihaltz/word2vec-GoogleNews-vectors](https://github.com/mmihaltz/word2vec-GoogleNews-vectors) 26 | 27 | ## GloVe 28 | 29 | The glove vectors are computed from the co-occurrence matrix and since it see’s the overall text, the GloVe vectors have information of the word distribution globally in the corpus. Also semantic relations are intact since the objective of GloVe is to consider the co-occurrence of 2 words. As given in there paper: 30 | 31 | The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well. For this reason, the resulting word vectors perform very well on word analogy tasks, such as those examined in the word2vec package. 32 | 33 | Link to GloVe -> [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/) 34 | 35 | ## FastText 36 | 37 | All the embeddings discussed so far are meant for words. FastText is similar to word2Vec, but they are meant for sub-word information. FastText is trained very similar to word2Vec with difference being that now the words are divided into sub words. Example: “banana” will be divided into “ba”, “na” and “na”, which will have their own embeddings (in this case both the “na” sub word are same and will have same embedding). 38 | 39 | Using FastText embedding even out of Vocabulary word embedding can be estimated, since the subword forming the OOV word may have useful information. FasText can be also used to get the character vectors, where each character will have their own vector. 40 | 41 | FastText embedding file can be downloaded from -> [https://fasttext.cc/docs/en/english-vectors.html](https://fasttext.cc/docs/en/english-vectors.html) 42 | 43 | ## Phrase, Sentence, Document Embedding’s 44 | 45 | Some of the simple ways for getting embedding of components formed by words like phrases, sentences, or documents are: 46 | 47 | 1. Get the mean (average) of the word embeddings of the words forming that component. 48 | 49 | 2. Concatenate the word embeddings of the words of the component. This method may not be very useful since it creates very large sized vectors, as the sequence length increases. 50 | 51 | 3. Store and use different distribution (max, min, avg) of the word embeddings of the words forming that component. 52 | 53 | There is also an approach known as ‘BERT’ to get sentence embedding, which is explained more [here]. 54 | 55 | 56 | 57 | -------------------------------------------------------------------------------- /Miscellaneous/common_regex.md: -------------------------------------------------------------------------------- 1 | # Common Regular Expressions 2 | 3 | ## Matching a Username 4 | 5 | > /^[a-z0-9_-]{3,16}$/ 6 | 7 | The beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {3,16} makes sure that are at least 3 of those characters, but no more than 16. Finally, the end of the string ($) 8 | 9 | ## Matching a Hex Value 10 | 11 | > /^#?([a-f0-9]{6}|[a-f0-9]{3})$/ 12 | 13 | The beginning of the string (^). Next, a (#) is optional because it is followed by a (?). The question mark signifies that the preceding character is optional, but to capture it if it's there. Next, inside the first group (first group of parentheses), we can have two different situations. The first is any lowercase letter between a and f or a number six times. The vertical bar tells us that we can also have three lowercase letters between a and f or numbers instead. Finally, we want the end of the string ($). 14 | 15 | 16 | ## Matching an Email 17 | 18 | > /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ 19 | 20 | The beginning of the string (^). Inside the first group, match one or more lowercase letters, numbers, underscores, dots, or hyphens. The dot is escaped because a non-escaped dot means any character. Directly after that, there must be an at sign. Next is the domain name which must be: one or more lowercase letters, numbers, underscores, dots, or hyphens. Then another (escaped) dot, with the extension being two to six letters or dots. There are 2 to 6 because of the country specific TLD's (.com or .co.in). Finally, we want the end of the string ($). 21 | 22 | 23 | ## Matching a URL 24 | 25 | > /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ 26 | 27 | The first capturing group is all option. It allows the URL to begin with "http://", "https://", or neither of them. A question mark after the s allows URL's that have http or https. In order to make this entire group optional, a question mark is added to the end of it. 28 | 29 | Next is the domain name: one or more numbers, letters, dots, or hypens followed by another dot then two to six letters or dots. The following section is the optional files and directories. Inside the group, match is made for any number of forward slashes, letters, numbers, underscores, spaces, dots, or hyphens. Then this group is allowed to be matched as many times as we want. Pretty much this allows multiple directories to be matched along with a file at the end. A star is used instead of the question mark because the star says zero or more, not zero or one. If a question mark was to be used there, only one file/directory would be able to be matched. 30 | 31 | Then a trailing slash is matched, but it can be optional. Finally end with the end of the line. 32 | 33 | 34 | ## Matching an HTML Tag 35 | 36 | > /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/ 37 | 38 | It matches any HTML tag with the content inside. As usual, begin with the start of the line. 39 | 40 | First comes the tag's name. It must be one or more letters long. The next thing are the tag's attributes. This is any character but a greater than sign (>). Star is used since its optional and may have more than 1 character. The plus sign makes up the attribute and value, and the star says as many attributes as you want. 41 | 42 | Next comes the third non-capture group. Inside, it will contain either a greater than sign, some content, and a closing tag; or some spaces, a forward slash, and a greater than sign. The first option looks for a greater than sign followed by any number of characters, and the closing tag. \1 is used which represents the content that was captured in the first capturing group. In this case it was the tag's name. Now, if that couldn't be matched we want to look for a self closing tag (like an img, br, or hr tag). This needs to have one or more spaces followed by "/>". 43 | 44 | The regex is ended with the end of the line. 45 | 46 | 47 | ## Matching an IP Address 48 | 49 | > /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/ 50 | -------------------------------------------------------------------------------- /Miscellaneous/graph_network_analysis.md: -------------------------------------------------------------------------------- 1 | # Graph Network Analysis 2 | 3 | ## Number of edges 4 | 5 | Calculating the distribution of the in-degree and out-degree in directed graphs (or distribution of degree in un-directed graphs) can be useful for understanding connections of node. 6 | 7 | ## Average Path Length 8 | 9 | The average of the shortest path lengths for all possible node pairs. Gives a measure of ‘tightness’ of the Graph and can be used to understand how quickly/easily something flows in this Network. 10 | 11 | ## BFS and DFS 12 | 13 | Breadth first search and Depth first search are two different algorithms used to search for Nodes in a Graph. They are typically used to figure out if we can reach a Node from a given Node. This is also known as Graph Traversal. 14 | 15 | ## Centrality 16 | 17 | Centrality aims to find the most important nodes in a network. There may be different notions of “important” and hence there are many centrality measures. 18 | 19 | Some of the most commonly used ones are: 20 | 21 | 1. Degree Centrality – This is the number of edges connected to a node. In the case of a directed graph, we can have 2 degree centrality measures. Inflow and Outflow Centrality. 22 | 2. Closeness Centrality – Of a node is the average length of the shortest path from the node to all other nodes 23 | 3. Betweenness Centrality – Number of times a node is present in the shortest path between 2 other nodes 24 | 25 | These centrality measures have variants and the definitions can be implemented using various algorithms. All in all, this means a large number of definitions and algorithms. 26 | 27 | ## Network Density 28 | 29 | A measure of the structure of a network, which measure how many links from all possible links within the network are realized. The density is 0 if there are no edges and 1 for a complete Graph. 30 | 31 | ## Scale-Free Property 32 | 33 | 'Real' networks have a certain underlying creation process, which results in some nodes with a much higher degree compared to other nodes. 34 | 35 | These nodes with a very high degree in the network are called hubs. Example can be of Twitter as a Social Network where prominent people represent hubs, having much more edges to other nodes than the average user. 36 | 37 | ## Node Connectivity 38 | 39 | This describes the number of nodes one must delete from the Graph until it is disconnected. Connected means that if every node in a graph can reach any other node in the network via edges. If this is not the case the graph is disconnected. An important property of any graph should to be that it is not easily to disconnect. 40 | -------------------------------------------------------------------------------- /Miscellaneous/pdf_To_doc.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Example for converting PDF document to string\n", 8 | "\n", 9 | "A function to read any pdf file and get the text in string format using pdfminer package in python 3.6" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "Using TensorFlow backend.\n" 22 | ] 23 | }, 24 | { 25 | "data": { 26 | "text/plain": [ 27 | "('2.2.4',\n", 28 | " '1.11.0',\n", 29 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 30 | ] 31 | }, 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "output_type": "execute_result" 35 | } 36 | ], 37 | "source": [ 38 | "import keras, tensorflow, sys\n", 39 | "keras.__version__, tensorflow.__version__, sys.version" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Package Required: pdfminer\n", 47 | "\n", 48 | "To install in python 3.6 Windows 10 use pip command : pip install pdfminer.six" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# import packages\n", 58 | "\n", 59 | "from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter\n", 60 | "from pdfminer.converter import TextConverter\n", 61 | "from pdfminer.layout import LAParams\n", 62 | "from pdfminer.pdfpage import PDFPage\n", 63 | "from io import StringIO\n", 64 | "\n", 65 | "import re\n" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## Function to read pdf files" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "def convert_pdf_to_txt(path):\n", 82 | " rsrcmgr = PDFResourceManager()\n", 83 | " retstr = StringIO()\n", 84 | " \n", 85 | " # Get a text coverter for reading pdf file\n", 86 | " device = TextConverter(rsrcmgr, retstr, codec=\"utf8\", laparams=LAParams())\n", 87 | " pdf_file = open(path, 'rb')\n", 88 | " interpreter = PDFPageInterpreter(rsrcmgr, device)\n", 89 | " pagenos=set()\n", 90 | "\n", 91 | " # converte the pdf file to stringIO using the interpreter for reading pdf file.\n", 92 | " for page in PDFPage.get_pages(pdf_file, pagenos, maxpages=0, password=\"\",caching=True, check_extractable=True):\n", 93 | " interpreter.process_page(page)\n", 94 | "\n", 95 | " text = retstr.getvalue()\n", 96 | " \n", 97 | " # close all the open streams\n", 98 | " pdf_file.close()\n", 99 | " device.close()\n", 100 | " retstr.close()\n", 101 | " \n", 102 | " return text" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 4, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | " \n", 115 | "\n", 116 | " \n", 117 | "\n", 118 | "Contact\n", 119 | "8867211055 (Mobile)\n", 120 | "netik1020@gmail.com\n", 121 | "\n", 122 | "www.linkedin.com/in/netik1020\n", 123 | "(LinkedIn)\n", 124 | "\n", 125 | "Top Skills\n", 126 | "Machine Learning\n", 127 | "Deep Learning\n", 128 | "Research\n", 129 | "\n", 130 | "Certifications\n", 131 | "Applied Text Mining in Python\n", 132 | "Introduction to Data Science in\n", 133 | "Python\n", 134 | "Applied Social Network Analysis in\n", 135 | "Python\n", 136 | "Applied Machine Learning in Python\n", 137 | "Improving Deep Neural Networks:\n", 138 | "Hyperparameter tuning,\n", 139 | "Regularization and Optimization\n", 140 | "\n", 141 | " \n", 142 | "\n", 143 | "Netik Agarwal\n", 144 | "\n", 145 | "Data Science Consultant - Deep Learning Researcher\n", 146 | "Bengaluru, Karnataka, India\n", 147 | "\n", 148 | "Summary\n", 149 | "A Passionate Deep Learning engineer, working as a consultant. I am\n", 150 | "proficient with:-\n", 151 | "1. Neural Networks like LSTM, CNN, GAN and Autoencoders using\n", 152 | "inception blocks, resnet.\n", 153 | "2. Solving text classification problems using Bi-LSTM with word\n", 154 | "embeddings like word2vec/ Glove.\n", 155 | "3. Working on big data in a Hadoop cluster using PySpark API of\n", 156 | "spark.\n", 157 | "4. Working on images using CNN and OpenCV for data\n", 158 | "transformation.\n", 159 | "5. The underlying math behind gradients, optimizers and different\n", 160 | "algorithms.\n", 161 | "6. ML packages in python like scikit-Learn, numpy, pandas to use\n", 162 | "models like SVM, boosted trees, XGB.\n", 163 | "7. Deep Neural network models with tensorflow (both high & low\n", 164 | "level API's), mxnet and also Keras API.\n", 165 | "\n", 166 | "Experience\n", 167 | "\n", 168 | "Self employed\n", 169 | "Data science Consultant\n", 170 | "October 2017 - Present \n", 171 | "Bengaluru Area, India\n", 172 | "\n", 173 | "Webyog, Inc.\n", 174 | "Software Engineer\n", 175 | "June 2016 - January 2017 (8 months)\n", 176 | "Bengaluru Area, India\n", 177 | "Worked with Win32 API to make bug-fixes and introduce new features for\n", 178 | "SQLyog - GUI tool to manage databases.\n", 179 | "\n", 180 | "Page 1 of 2\n", 181 | "\n", 182 | "\f", 183 | " \n", 184 | "\n", 185 | " \n", 186 | "\n", 187 | " \n", 188 | "\n", 189 | "Philips\n", 190 | "Intern\n", 191 | "January 2016 - May 2016 (5 months)\n", 192 | "Bengaluru Area, India\n", 193 | "Developing Rest services for management of medical images and integration\n", 194 | "of these services.\n", 195 | "\n", 196 | "Education\n", 197 | "Manipal Institute of Technology\n", 198 | "BTECH, Information Technology · (2012 - 2016)\n", 199 | "\n", 200 | "Page 2 of 2\n", 201 | "\n", 202 | "\f", 203 | "\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "print(convert_pdf_to_txt('C:/Users/Netik/Downloads/netik/Netik_Resume.pdf'))" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [] 217 | } 218 | ], 219 | "metadata": { 220 | "kernelspec": { 221 | "display_name": "Python 3", 222 | "language": "python", 223 | "name": "python3" 224 | }, 225 | "language_info": { 226 | "codemirror_mode": { 227 | "name": "ipython", 228 | "version": 3 229 | }, 230 | "file_extension": ".py", 231 | "mimetype": "text/x-python", 232 | "name": "python", 233 | "nbconvert_exporter": "python", 234 | "pygments_lexer": "ipython3", 235 | "version": "3.6.6" 236 | } 237 | }, 238 | "nbformat": 4, 239 | "nbformat_minor": 2 240 | } 241 | -------------------------------------------------------------------------------- /Miscellaneous/topic_modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Using LDA with skLearn and gensim\n", 8 | "\n", 9 | "The notebook uses skLearn and gensim packages to fetch 'n' important topics and 'm' most occuring words in each topic, grouped according to the LDA. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "Using TensorFlow backend.\n" 22 | ] 23 | }, 24 | { 25 | "data": { 26 | "text/plain": [ 27 | "('2.2.4',\n", 28 | " '1.11.0',\n", 29 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 30 | ] 31 | }, 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "output_type": "execute_result" 35 | } 36 | ], 37 | "source": [ 38 | "import keras, tensorflow, sys\n", 39 | "keras.__version__, tensorflow.__version__, sys.version" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Fetching data" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "# Using easy to use 20newsgroups data from sklearn.\n", 56 | "\n", 57 | "from sklearn.datasets import fetch_20newsgroups\n", 58 | "\n", 59 | "dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))\n", 60 | "documents = dataset.data" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "data": { 70 | "text/plain": [ 71 | "\"Well i'm not sure about the story nad it did seem biased. What\\nI disagree with is your statement that the U.S. Media is out to\\nruin Israels reputation. That is rediculous. The U.S. media is\\nthe most pro-israeli media in the world. Having lived in Europe\\nI realize that incidences such as the one described in the\\nletter have occured. The U.S. media as a whole seem to try to\\nignore them. The U.S. is subsidizing Israels existance and the\\nEuropeans are not (at least not to the same degree). So I think\\nthat might be a reason they report more clearly on the\\natrocities.\\n\\tWhat is a shame is that in Austria, daily reports of\\nthe inhuman acts commited by Israeli soldiers and the blessing\\nreceived from the Government makes some of the Holocaust guilt\\ngo away. After all, look how the Jews are treating other races\\nwhen they got power. It is unfortunate.\\n\"" 72 | ] 73 | }, 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "output_type": "execute_result" 77 | } 78 | ], 79 | "source": [ 80 | "documents[0]" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## NLTK stopwords for english" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 4, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "{'why', 'few', \"hadn't\", 'theirs', 'me', 'he', 'very', \"should've\", \"wouldn't\", 'in', 'my', 'hasn', 'with', 'the', \"hasn't\", 'for', 'down', 'himself', 'we', 'doesn', \"it's\", 'm', 'have', 'that', \"aren't\", 'whom', 'where', 's', 'their', 'other', 'she', 'was', 'shouldn', 'at', 'off', 'if', \"she's\", 'don', 'what', \"you'd\", 'be', 'here', 'do', 'each', 'no', 'isn', 'is', 'from', 'ain', 'couldn', 'more', 'were', 'then', 'too', 'by', 'on', 'its', 'own', \"weren't\", 'both', 'ourselves', 'any', 'up', \"mustn't\", 'and', 'mightn', 'hadn', 'had', 'after', 'weren', 'over', 'itself', 'some', 'will', 'to', 'so', 'during', 'shan', 'under', 'yours', \"don't\", 'these', 'because', 'myself', 'you', 'which', 'until', 't', 're', 'your', 'while', \"isn't\", \"you've\", 'him', 'should', \"couldn't\", 'd', 'our', 'all', \"shouldn't\", 'once', 'of', 'further', 'before', 'ma', 'are', \"needn't\", 'am', 'an', \"didn't\", 'between', \"wasn't\", 'as', 'didn', 'it', 'most', 'did', \"mightn't\", \"that'll\", 'haven', 'about', 'o', 'only', 'won', 'there', 'needn', 'below', 'having', 'but', 'doing', 'them', 'wouldn', 'aren', \"doesn't\", 'yourself', 'again', 'her', 'just', \"haven't\", 'ours', 'his', 'll', 'this', 'being', 'nor', 'themselves', \"shan't\", 'who', 'than', 'mustn', 'now', \"won't\", 'against', 'or', 'y', 'above', 'through', 'how', 'when', \"you're\", 'herself', 'not', 'does', 'i', \"you'll\", 'yourselves', 'into', 'those', 'been', 've', 'such', 'they', 'can', 'hers', 'out', 'same', 'a', 'has', 'wasn'}\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "from nltk.corpus import stopwords\n", 105 | "stopwords_en = set(stopwords.words('english'))\n", 106 | "print(stopwords_en)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## Using skLearn LDA for clustering into topics and finding interesting words in each topic\n", 114 | "\n", 115 | "The documents must be vectorized using countvectorizer when performing LDA clustering on data using sklearn.\n", 116 | "\n", 117 | "The count Vevorizer covertes a document by representing it as a vector of count of all the different words in the vocabulary. As one can see most of the units in a document vector will be 0. " 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 5, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", 127 | "\n", 128 | "\n", 129 | "c_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')\n", 130 | "\n" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "max_df -> if a word occurs in max_df percentage of documents, ignore those words. Ignore words that occurs in almost all the documents. eg. 'a' , 'the'. \n", 138 | "\n", 139 | "min_df -> if a word occurs in less than min_df number of dcouments, ignore those words. Ignore words that occurs in very few documents. eg. name of a person.\n", 140 | "\n", 141 | "max_features -> Consider the max_features number of words for the evaluation of topics. Words are taken by considering the ordered frequency of words across the documents (ofcourse, by ignoring the words occuring more than max_df).\n", 142 | "\n", 143 | "stop_words -> Remove english stopwords from the corpus. stop words are words like 'a', 'the', 'of', etc.\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "Fit the above LDA with the countvectorized document." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 6, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "c_vec = c_vectorizer.fit_transform(documents)\n", 160 | "c_feature_names = c_vectorizer.get_feature_names()" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "Finding the LDA components by running the LatentDirichletAllocation function. " 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 7, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stderr", 177 | "output_type": "stream", 178 | "text": [ 179 | "D:\\Program_Files\\Anaconda\\envs\\tensorflow\\lib\\site-packages\\sklearn\\decomposition\\online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21\n", 180 | " DeprecationWarning)\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "from sklearn.decomposition import LatentDirichletAllocation\n", 186 | "\n", 187 | "no_topics = 20\n", 188 | "\n", 189 | "# Run LDA\n", 190 | "lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online',\n", 191 | " learning_offset=50.,random_state=0).fit(c_vec)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "learning_method -> 'online' training means topics(components) will be incrementally trained on mini batches of data, rather than updating component values from the whole data at once.\n", 199 | "\n", 200 | "learning_offset -> a parameter for online training to slowly learn at the start of the training." 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### Example Topics and top words in the topic." 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 8, 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "Topic 0:\n", 220 | "people gun state control right guns crime states law police\n", 221 | "Topic 1:\n", 222 | "time question book years did like don space answer just\n", 223 | "Topic 2:\n", 224 | "mr line rules science stephanopoulos title current define int yes\n", 225 | "Topic 3:\n", 226 | "key chip keys clipper encryption number des algorithm use bit\n", 227 | "Topic 4:\n", 228 | "edu com cs vs w7 cx mail uk 17 send\n", 229 | "Topic 5:\n", 230 | "use does window problem way used point different case value\n", 231 | "Topic 6:\n", 232 | "windows thanks know help db does dos problem like using\n", 233 | "Topic 7:\n", 234 | "bike water effect road design media dod paper like turn\n", 235 | "Topic 8:\n", 236 | "don just like think know people good ve going say\n", 237 | "Topic 9:\n", 238 | "car new price good power used air sale offer ground\n", 239 | "Topic 10:\n", 240 | "file available program edu ftp information files use image version\n", 241 | "Topic 11:\n", 242 | "ax max b8f g9v a86 145 pl 1d9 0t 34u\n", 243 | "Topic 12:\n", 244 | "government law privacy security legal encryption court fbi technology information\n", 245 | "Topic 13:\n", 246 | "card bit memory output video color data mode monitor 16\n", 247 | "Topic 14:\n", 248 | "drive scsi disk mac hard apple drives controller software port\n", 249 | "Topic 15:\n", 250 | "god jesus people believe christian bible say does life church\n", 251 | "Topic 16:\n", 252 | "year game team games season play hockey players league player\n", 253 | "Topic 17:\n", 254 | "10 00 15 25 20 11 12 14 16 13\n", 255 | "Topic 18:\n", 256 | "armenian israel armenians war people jews turkish israeli said women\n", 257 | "Topic 19:\n", 258 | "president people new said health year university school day work\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "\n", 264 | "no_top_words = 10\n", 265 | "\n", 266 | "for topic_idx, topic in enumerate(lda.components_):\n", 267 | " print(\"Topic %d:\" % (topic_idx))\n", 268 | " print(\" \".join([c_feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "## LDA using gensim\n", 276 | "\n", 277 | "The stopwords in the document are removed for finding relevant top words. " 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 9, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stderr", 287 | "output_type": "stream", 288 | "text": [ 289 | "D:\\Program_Files\\Anaconda\\envs\\tensorflow\\lib\\site-packages\\gensim\\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n", 290 | " warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "from gensim import corpora\n", 296 | "dictionary = corpora.Dictionary([x.split() for x in documents])\n", 297 | "corpus = [dictionary.doc2bow([text for text in x.split() if text.lower() not in stopwords_en]) for x in documents] " 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "### Topic words are given along with there importance." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 10, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "name": "stdout", 314 | "output_type": "stream", 315 | "text": [ 316 | "(0, '0.070*\":\" + 0.026*\">\" + 0.005*\"-\" + 0.004*\"anonymous\" + 0.003*\"?\" + 0.003*\"RIPEM\" + 0.003*\"mail\" + 0.002*\"email\" + 0.002*\"information\" + 0.002*\"posting\"')\n", 317 | "(1, '0.023*\".\" + 0.008*\"|\" + 0.004*\"Gordon\" + 0.003*\"----------------------------------------------------------------------------\" + 0.003*\"surrender\" + 0.003*\"Banks\" + 0.003*\"intellect,\" + 0.003*\"shameful\" + 0.003*\"N3JXP\" + 0.003*\"\"Skepticism\"')\n", 318 | "(2, '0.007*\"*\" + 0.004*\"used\" + 0.003*\"use\" + 0.003*\"-\" + 0.003*\"ground\" + 0.002*\"power\" + 0.002*\"one\" + 0.002*\"using\" + 0.002*\"car\" + 0.002*\"may\"')\n", 319 | "(3, '0.012*\"key\" + 0.005*\"space\" + 0.004*\"launch\" + 0.003*\"keys\" + 0.003*\"algorithm\" + 0.003*\"first\" + 0.003*\"chip\" + 0.003*\"satellite\" + 0.003*\"DES\" + 0.003*\"--\"')\n", 320 | "(4, '0.005*\"-\" + 0.004*\"&\" + 0.004*\"Space\" + 0.004*\"University\" + 0.003*\"1993\" + 0.003*\"available\" + 0.003*\"Center\" + 0.003*\"--\" + 0.003*\"NASA\" + 0.003*\"April\"')\n", 321 | "(5, '0.007*\"use\" + 0.006*\"-\" + 0.006*\"get\" + 0.005*\"would\" + 0.005*\"like\" + 0.005*\"using\" + 0.005*\"know\" + 0.004*\"one\" + 0.004*\"anyone\" + 0.004*\"need\"')\n", 322 | "(6, '0.008*\"government\" + 0.007*\"--\" + 0.006*\"Q\" + 0.005*\"would\" + 0.004*\"President\" + 0.004*\"law\" + 0.003*\"encryption\" + 0.003*\"public\" + 0.003*\"American\" + 0.003*\"right\"')\n", 323 | "(7, '0.014*\"#\" + 0.003*\"|>\" + 0.003*\"people\" + 0.003*\"?\" + 0.003*\"one\" + 0.002*\"Israel\" + 0.002*\"may\" + 0.002*\"Israeli\" + 0.002*\"information\" + 0.002*\"also\"')\n", 324 | "(8, '0.082*\"-\" + 0.012*\"$\" + 0.005*\"!\" + 0.005*\"henrik]\" + 0.003*\"vs.\" + 0.003*\"games,\" + 0.002*\"Good\" + 0.002*\"Sharks\" + 0.002*\"Rockefeller\" + 0.002*\"Excellent\"')\n", 325 | "(9, '0.010*\"+\" + 0.007*\";\" + 0.006*\"-\" + 0.003*\"modem\" + 0.002*\"&\" + 0.002*\"shipping\" + 0.002*\"-1\" + 0.002*\"From:\" + 0.002*\"Apr\" + 0.002*\"]\"')\n", 326 | "(10, '0.009*\"->\" + 0.003*\"unit\" + 0.002*\"32-bit\" + 0.002*\"cross\" + 0.002*\"NuBus\" + 0.002*\"allocation\" + 0.002*\"linked\" + 0.002*\"GO\" + 0.002*\"NT\" + 0.002*\"Weaver\"')\n", 327 | "(11, '0.012*\"would\" + 0.009*\"one\" + 0.008*\"like\" + 0.008*\"think\" + 0.007*\"get\" + 0.007*\"know\" + 0.005*\"I\\'m\" + 0.005*\"--\" + 0.005*\"people\" + 0.005*\"could\"')\n", 328 | "(12, '0.007*\"people\" + 0.006*\"would\" + 0.004*\"may\" + 0.004*\"many\" + 0.003*\"one\" + 0.003*\"use\" + 0.002*\"make\" + 0.002*\"also\" + 0.002*\"cause\" + 0.002*\"could\"')\n", 329 | "(13, '0.010*\"God\" + 0.007*\"Jesus\" + 0.005*\"one\" + 0.005*\"believe\" + 0.004*\"Christian\" + 0.003*\"say\" + 0.003*\"Bible\" + 0.003*\"would\" + 0.003*\"$1\" + 0.003*\"Christians\"')\n", 330 | "(14, '0.007*\"Armenian\" + 0.006*\".\" + 0.005*\"Armenians\" + 0.005*\"Turkish\" + 0.005*\"people\" + 0.004*\"Jews\" + 0.004*\"said\" + 0.003*\"-\" + 0.003*\"said,\" + 0.003*\"killed\"')\n", 331 | "(15, '0.043*\"1\" + 0.030*\"0\" + 0.024*\"2\" + 0.014*\"3\" + 0.011*\"4\" + 0.008*\"5\" + 0.008*\"7\" + 0.006*\"6\" + 0.006*\"-\" + 0.006*\"25\"')\n", 332 | "(16, '0.090*\"MAX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'AX>\\'\" + 0.002*\"M\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\"`@(\" + 0.001*\"14\" + 0.001*\"Doug>\" + 0.001*\"------------\" + 0.001*\"--------\" + 0.001*\"pm)\" + 0.001*\"/lib/libX11.so\" + 0.001*\"MG9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=G9V=\" + 0.001*\"Symbol\"')\n", 333 | "(17, '0.030*\"*/\" + 0.027*\"/*\" + 0.024*\"=\" + 0.016*\"DB\" + 0.015*\"}\" + 0.007*\"{\" + 0.006*\"char\" + 0.005*\"int\" + 0.005*\"_/\" + 0.004*\"*\"')\n", 334 | "(18, '0.088*\"X\" + 0.016*\"*\" + 0.007*\"file\" + 0.005*\"window\" + 0.005*\"----------------------------------------------------------------------\" + 0.005*\"entry\" + 0.005*\"program\" + 0.005*\"available\" + 0.004*\"use\" + 0.003*\"Subject:\"')\n", 335 | "(19, '0.041*\"|\" + 0.009*\"/\" + 0.005*\"||\" + 0.005*\"=\" + 0.004*\"entries\" + 0.003*\"\\\\\" + 0.003*\"wire\" + 0.003*\"radar\" + 0.002*\"(\" + 0.002*\"de\"')\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "from gensim.models.ldamodel import LdaModel\n", 341 | "ldamodel = LdaModel(corpus, num_topics=no_topics, id2word=dictionary, passes=15)\n", 342 | "\n", 343 | "topics = ldamodel.print_topics(num_words=no_top_words)\n", 344 | "for topic in topics:\n", 345 | " print(topic)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [] 354 | } 355 | ], 356 | "metadata": { 357 | "kernelspec": { 358 | "display_name": "Python 3", 359 | "language": "python", 360 | "name": "python3" 361 | }, 362 | "language_info": { 363 | "codemirror_mode": { 364 | "name": "ipython", 365 | "version": 3 366 | }, 367 | "file_extension": ".py", 368 | "mimetype": "text/x-python", 369 | "name": "python", 370 | "nbconvert_exporter": "python", 371 | "pygments_lexer": "ipython3", 372 | "version": "3.6.6" 373 | } 374 | }, 375 | "nbformat": 4, 376 | "nbformat_minor": 2 377 | } 378 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Concise-iPython-Notebooks-for-Deep-learning-on-Images-and-text 2 | 3 | This github repository is a collection of [iPython notebooks](https://ipython.org/ipython-doc/3/notebook/notebook.html) for solving different problems in deep learning using [keras API](https://keras.io/) and [tensorflow](https://www.tensorflow.org/) backend. The notebooks are written in as easy to read manner as possible. Everyone is welcome to openly contribute towards this repository and add/ recommend to add more useful and interesting notebooks. 4 | 5 | The notebooks having examples of image processing deals with problems like: 6 | 7 | 1. [Image Segmentation](./Image_Segmentation) to segment (find the boundary of) certain object in an image, was performed using U-Net architecture of the auto encoder model. 8 | 9 | 2. Object Detection was done using [Single Shot MultiBox Detector (SSD) model](./SSD). 10 | 11 | 3. [Image Classification](./Image_Classifier) was done using convolutional network. 12 | 13 | 4. The task of finding duplicate images among the given set of images is done [here](./Duplicate_images). 14 | 15 | The notebooks meant for processing/ understanding texts deals with problems like: 16 | 17 | 1. Basic entity extraction from text using [Named Entity Recognition](./Miscellaneous/NER_tagger/) and tagging the text using [POS taggers](./Miscellaneous/POS_Tagger/). 18 | 19 | 2. [Topic modelling using LDA](./Miscellaneous/topic_modeling.ipynb) and [converting pdf documents to readable text format](./Miscellaneous/pdf_To_doc.ipynb) in python. 20 | 21 | 3. [Classification of text queries](./Text_Classification/) into positive or negative comments. [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/docs/en/english-vectors.html) embedding were used and multiple architectures including [bidirectional GRU](https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15),[LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/), [Attention networks](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/) and combinations of these were experimented with. 22 | 23 | 4. [Relation extraction from a sentence](./Semantic_Relation_Extraction/), between 2 entities. A model which was aware of the entity words when finding the relation was made. Self-attention and GRU was used for feature extraction 24 | 25 | 5. [Intent classifier](./Intent_classifier/) was made to classify incoming queries into one of the intents. This can be used to understand the type of query a user is making in a chat-bot application. Intent can be to book a cab, or to find a restaurant, etc. 26 | 27 | [A demo program](./House_price_prediction_dash/) for showing use of machine learning to produce results for different input values is done. This work is done for purpose of demo, which allows the user to change the features of the model using an interactive web app and the output is produced accordingly and shown in the web app. 28 | 29 | ## Motivation 30 | 31 | I wanted to have this repository for everyone with good understanding of theory in Deep Learning, to have easy to read reference for some common tasks in Deep Learning. 32 | 33 | ## Packages used 34 | 35 | Common Packages – [Tensorflow](https://www.tensorflow.org/), [Tensorboard](https://www.tensorflow.org/guide/summaries_and_tensorboard), [keras](https://keras.io/), [sklearn](https://scikit-learn.org/), [numpy](http://www.numpy.org/), [pandas](https://pandas.pydata.org/), [mlflow](https://www.mlflow.org/docs/latest/index.html) 36 | 37 | Text Based Packages – [NLTK](https://www.nltk.org/), [genism](https://pypi.org/project/gensim/), [pdfminer](https://pypi.org/project/pdfminer/), [keras_self_attention](https://pypi.org/project/keras-self-attention/), [keras_multi_head](https://pypi.org/project/keras-multi-head/) 38 | 39 | Image Based Packages – [openCV](https://pypi.org/project/opencv-python/), [matplotlib](https://matplotlib.org/) 40 | 41 | ## Code style 42 | 43 | Standard. Python. 44 | 45 | ## Index 46 | 47 | ### Text 48 | 49 | 1. [Text Classifier](./Text_Classification/) – Have shown examples of different models to classify IMDB dataset . Text classifiers are very useful in tasks like passing queries to relevant department, in understanding customer review like in case of this dataset. 50 | 51 | 1.1. [Text_classifier](./Text_Classification/classification_imdb.ipynb) – Performs text classification by using different architecture/ layers like GRU, LSTM, Sequence-self-attention, multi-head-attention, global max pooling and global average pooling. Different combinations of above layers can be used by passing arguments to a function to train the different models. GloVe and FastText embeddings have been experimented with. 52 | 53 | 1.2. [Text_Classifer_2](./Text_Classification/self_Attn_on_seperate_fets_of_2embds.ipynb) – Here GloVe and FastText embeddings have been used as different features and there features are concatenated just before the dense layers. The final f1 score (90.5) for this notebook is highest among all the methods. 54 | 55 | 2. [Relation Extraction](./Semantic_Relation_Extraction/) – Data from SemEval_task8 is used to show an example of finding relations between two entities in a sentence. A keras model is built with concatenation of RNN features, self-attention features and max pooled features of the entities in the sentence. 56 | 57 | 3. [Intent Classifier](./Intent_classifier/) – Intent classifier is performed on a dataset containing 7 different intents. This is an example for how deep learning models can be successfully used to understand the intent of a text query. Customer intent can be determined and then a prewritten text can be generated to answer the user query. A simple yet effective chat-bot can be built this way, depending on the different intents possible and data available for each of those intients. 58 | 59 | 4. [Miscellaneous](./Miscellaneous/) – 60 | 61 | 4.1. [POS tagger](./Miscellaneous/POS_Tagger/) – POS tagging is to find the part-of-speech tag of the words in the sentence. The tags of the words can be passed as information to neural networks. 62 | 63 |     4.1.1. [NLTK POS tagger](./Miscellaneous/POS_Tagger/POSTagger_NLTK.ipynb) – Using NLTK package to perform the POS tagging. 64 | 65 |     4.1.2. [Stanford POS tagger](./Miscellaneous/POS_Tagger/POSTagger_Stanford_NLTK.ipynb) – Using pre-trained model of Stanford for POS tagging. 66 | 67 | 4.2. [NER tagger](./Miscellaneous/NER_tagger/) – NER tagging or Named Entity Recognition is to find some common entities in text like name, place, etc. or more subject dependent entity like years of experience, skills, etc. can all be entities while parsing a resume. NER are generally used to find some important words in the document and one can train their own document specific NER tagger. 68 | 69 |     4.2.1. [Stanford NER tagger](./Miscellaneous/NER_tagger/NER_stanford_NLTK.ipynb) – Pre-trained NER provided by the Stanford libraries for entities – Person, organization, location. 70 | 71 |     4.2.2. Self-trained keras model – An example of training your own NER model. (To be done) 72 | 73 | 4.3. [PDF to Doc](./Miscellaneous/pdf_To_doc.ipynb) – a very useful tool to read the pdf documents in python. PDFminer package is used here. 74 | 75 | 4.4. [RegEx](./Miscellaneous/common_regex.md) – Some powerful and commonly used Regular Expressions. 76 | 77 | 4.5. [Embeddings](./Miscellaneous/Word_Embedding.md) – A document going through different embeddings including sentence embedding. 78 | 79 | 4.6. [Topic Modelling](./Miscellaneous/topic_modeling.ipynb) – Topic modelling in text processing is to cluster document into topics according to the word frequencies, or basically, sentences in the document. Each topic are dominated by certain words which can also be extracted. Here topic modelling is done using LDA from sklearn and genism packages. 80 | 81 | ### Image 82 | 83 | 1. [Image Segmentation](./Image_Segmentation) – Image segmentaion or Pixel wise segmentation is a task in which each pixel in the image are classified into 2 or more classes. All the notebooks here have an auto encoder model with U-Net architecture to find the lung lesions from the CT images. 84 | 85 | 1.1. [lungs_conv_unet](./Image_Segmentation/lungs_conv_unet.ipynb) - An autoencoder model with U-Net architecture is used. 86 | 87 | 1.2. [lungs_incp_unet](./Image_Segmentation/lungs_incp_unet.ipynb) - Here convolution layers are replaced with inception blocks. 88 | 89 | 1.3. [lungs_incp_unet_snapshot](./Image_Segmentation/lungs_incp_unet_snapshot.ipynb) - Model exactly same as the lungs_incp_unet model with the addition of cosine annealed Learning rate. 90 | 91 | 1.4. [lungs_incp_unet_snapshot_mlflow](./Image_Segmentation/seg_mlflow/) - Contains files for lungs_incp_unet_snapshot model deployed using mlflow workflow. 92 | 93 | 2. [Single Shot MultiBox Detector (SSD) model](./SSD) – An example implementation of the [SSD model](https://arxiv.org/abs/1512.02325) is shown for objecct detection in pascal VOC dataset. 94 | 95 | 3. [Image Classification](./Image_Classifier) – Image classification is a task of classifying images into two or more classes. A simple yet powerful neural network models are built to classify the images. 96 | 97 | 3.1 [skin_cancer_classification_1](./Image_Classifier/skin_cancer_classification_1.ipynb) - A deep neural network model, with convolution layers for feature extraction. 98 | 99 | 3.2. [skin_cancer_classification_2](./Image_Classifier/skin_cancer_classification_2.ipynb) - Model is same as the skin_cancer_classification_1, only with the addition of cosine annealed learning rate during the training. 100 | 101 | 4. [Finding Duplicates](./Duplicate_images) - Notebook to find duplicate images by comparing image vectors of the image using [Locality Sensitivity Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for efficient search. 102 | 103 | ## Reproducibility 104 | 105 | To run the code in local machine, requried packasges will need to be installed and the dataset must be downloaded from the links provided. 106 | 107 | If someone chose to run the programs online, [google colab](https://colab.research.google.com/notebooks/welcome.ipynb) provides free GPU access. Also [this](https://www.kaggle.com/general/51898) link can be useful for easily using kaggle datasets in googlecolab environment. 108 | -------------------------------------------------------------------------------- /SSD/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /SSD/boxes.py: -------------------------------------------------------------------------------- 1 | from itertools import product 2 | import numpy as np 3 | 4 | 5 | def calculate_intersection_over_union(box_data, prior_boxes): 6 | """Calculate intersection over union of box_data with respect to 7 | prior_boxes. 8 | 9 | Arguments: 10 | ground_truth_data: numpy array with shape (4) indicating x_min, y_min, 11 | x_max and y_max coordinates of the bounding box. 12 | prior_boxes: numpy array with shape (num_boxes, 4). 13 | 14 | Returns: 15 | intersections_over_unions: numpy array with shape (num_boxes) which 16 | corresponds to the intersection over unions of box_data with respect 17 | to all prior_boxes. 18 | """ 19 | x_min = box_data[0] 20 | y_min = box_data[1] 21 | x_max = box_data[2] 22 | y_max = box_data[3] 23 | prior_boxes_x_min = prior_boxes[:, 0] 24 | prior_boxes_y_min = prior_boxes[:, 1] 25 | prior_boxes_x_max = prior_boxes[:, 2] 26 | prior_boxes_y_max = prior_boxes[:, 3] 27 | # calculating the intersection 28 | intersections_x_min = np.maximum(prior_boxes_x_min, x_min) 29 | intersections_y_min = np.maximum(prior_boxes_y_min, y_min) 30 | intersections_x_max = np.minimum(prior_boxes_x_max, x_max) 31 | intersections_y_max = np.minimum(prior_boxes_y_max, y_max) 32 | intersected_widths = intersections_x_max - intersections_x_min 33 | intersected_heights = intersections_y_max - intersections_y_min 34 | intersected_widths = np.maximum(intersected_widths, 0) 35 | intersected_heights = np.maximum(intersected_heights, 0) 36 | intersections = intersected_widths * intersected_heights 37 | # calculating the union 38 | prior_box_widths = prior_boxes_x_max - prior_boxes_x_min 39 | prior_box_heights = prior_boxes_y_max - prior_boxes_y_min 40 | prior_box_areas = prior_box_widths * prior_box_heights 41 | box_width = x_max - x_min 42 | box_height = y_max - y_min 43 | ground_truth_area = box_width * box_height 44 | unions = prior_box_areas + ground_truth_area - intersections 45 | intersection_over_union = intersections / unions 46 | return intersection_over_union 47 | 48 | 49 | def regress_boxes(assigned_prior_boxes, ground_truth_box, box_scale_factors): 50 | 51 | x_min_scale, y_min_scale, x_max_scale, y_max_scale = box_scale_factors 52 | 53 | assigned_prior_boxes = to_center_form(assigned_prior_boxes) 54 | center_x_prior = assigned_prior_boxes[:, 0] 55 | center_y_prior = assigned_prior_boxes[:, 1] 56 | w_prior = assigned_prior_boxes[:, 2] 57 | h_prior = assigned_prior_boxes[:, 3] 58 | 59 | x_min = ground_truth_box[0] 60 | y_min = ground_truth_box[1] 61 | x_max = ground_truth_box[2] 62 | y_max = ground_truth_box[3] 63 | 64 | encoded_center_x = ((x_min + x_max) / 2.) - center_x_prior 65 | encoded_center_x = encoded_center_x / (x_min_scale * w_prior) 66 | encoded_center_y = ((y_min + y_max) / 2.) - center_y_prior 67 | encoded_center_y = encoded_center_y / (y_min_scale * h_prior) 68 | 69 | encoded_w = (x_max - x_min) / w_prior 70 | encoded_w = np.log(encoded_w) / x_max_scale 71 | encoded_h = (y_max - y_min) / h_prior 72 | encoded_h = np.log(encoded_h) / y_max_scale 73 | 74 | regressed_boxes = np.concatenate([encoded_center_x[:, None], 75 | encoded_center_y[:, None], 76 | encoded_w[:, None], 77 | encoded_h[:, None]], axis=1) 78 | 79 | return regressed_boxes 80 | 81 | 82 | def unregress_boxes(predicted_box_data, prior_boxes, 83 | box_scale_factors=[.1, .1, .2, .2]): 84 | 85 | x_min_scale, y_min_scale, x_max_scale, y_max_scale = box_scale_factors 86 | encoded_center_x = predicted_box_data[:, 0] 87 | encoded_center_y = predicted_box_data[:, 1] 88 | encoded_w = predicted_box_data[:, 2] 89 | encoded_h = predicted_box_data[:, 3] 90 | 91 | prior_boxes = to_center_form(prior_boxes) 92 | center_x_prior = prior_boxes[:, 0] 93 | center_y_prior = prior_boxes[:, 1] 94 | w_prior = prior_boxes[:, 2] 95 | h_prior = prior_boxes[:, 3] 96 | 97 | x_min = encoded_center_x * x_min_scale * w_prior 98 | x_min = x_min + center_x_prior 99 | 100 | y_min = encoded_center_y * y_min_scale * h_prior 101 | y_min = y_min + center_y_prior 102 | 103 | x_max = w_prior * np.exp(encoded_w * x_max_scale) 104 | y_max = h_prior * np.exp(encoded_h * y_max_scale) 105 | 106 | unregressed_boxes = np.concatenate([x_min[:, None], y_min[:, None], 107 | x_max[:, None], y_max[:, None]], 108 | axis=1) 109 | 110 | unregressed_boxes[:, :2] -= unregressed_boxes[:, 2:] / 2 111 | unregressed_boxes[:, 2:] += unregressed_boxes[:, :2] 112 | unregressed_boxes = np.clip(unregressed_boxes, 0.0, 1.0) 113 | if predicted_box_data.shape[1] > 4: 114 | unregressed_boxes = np.concatenate([unregressed_boxes, 115 | predicted_box_data[:, 4:]], axis=-1) 116 | return unregressed_boxes 117 | 118 | 119 | def to_point_form(boxes): 120 | 121 | center_x = boxes[:, 0] 122 | center_y = boxes[:, 1] 123 | width = boxes[:, 2] 124 | height = boxes[:, 3] 125 | 126 | x_min = center_x - (width / 2.) 127 | x_max = center_x + (width / 2.) 128 | y_min = center_y - (height / 2.) 129 | y_max = center_y + (height / 2.) 130 | 131 | return np.concatenate([x_min[:, None], y_min[:, None], 132 | x_max[:, None], y_max[:, None]], axis=1) 133 | 134 | 135 | def to_center_form(boxes): 136 | 137 | x_min = boxes[:, 0] 138 | y_min = boxes[:, 1] 139 | x_max = boxes[:, 2] 140 | y_max = boxes[:, 3] 141 | 142 | width = x_max - x_min 143 | height = y_max - y_min 144 | center_x = x_min + (width/2.) 145 | center_y = y_max - (height/2.) 146 | 147 | return np.concatenate([center_x[:, None], center_y[:, None], 148 | width[:, None], height[:, None]], axis=1) 149 | 150 | 151 | def assign_prior_boxes_to_ground_truth(ground_truth_box, prior_boxes, 152 | box_scale_factors=[.1, .1, .2, .2], 153 | regress=True, overlap_threshold=.5, 154 | return_iou=True): 155 | """ Assigns and regresses prior boxes to a single ground_truth_box 156 | data sample. 157 | TODO: Change this function so that it does not regress the boxes 158 | automatically. It should only assign them but not regress them! 159 | Arguments: 160 | prior_boxes: numpy array with shape (num_prior_boxes, 4) 161 | indicating x_min, y_min, x_max and y_max for every prior box. 162 | ground_truth_box: numpy array with shape (4) indicating 163 | x_min, y_min, x_max and y_max of the ground truth box. 164 | box_scale_factors: numpy array with shape (num_boxes, 4) 165 | Which represents a scaling of the localization gradient. 166 | (https://github.com/weiliu89/caffe/issues/155) 167 | 168 | Returns: 169 | regressed_boxes: numpy array with shape (num_assigned_boxes) 170 | which correspond to the regressed values of all 171 | assigned_prior_boxes to the ground_truth_box 172 | """ 173 | ious = calculate_intersection_over_union(ground_truth_box, prior_boxes) 174 | regressed_boxes = np.zeros((len(prior_boxes), 4 + return_iou)) 175 | assign_mask = ious > overlap_threshold 176 | if not assign_mask.any(): 177 | assign_mask[ious.argmax()] = True 178 | if return_iou: 179 | regressed_boxes[:, -1][assign_mask] = ious[assign_mask] 180 | assigned_prior_boxes = prior_boxes[assign_mask] 181 | if regress: 182 | assigned_regressed_priors = regress_boxes(assigned_prior_boxes, 183 | ground_truth_box, 184 | box_scale_factors) 185 | regressed_boxes[assign_mask, 0:4] = assigned_regressed_priors 186 | return regressed_boxes.ravel() 187 | else: 188 | regressed_boxes[assign_mask, 0:4] = assigned_prior_boxes[:, 0:4] 189 | return regressed_boxes.ravel() 190 | 191 | 192 | # TODO change this name for match 193 | def assign_prior_boxes(prior_boxes, ground_truth_data, num_classes, 194 | box_scale_factors=[.1, .1, .2, .2], regress=True, 195 | overlap_threshold=.5, background_id=0): 196 | """ Assign and regress prior boxes to all ground truth samples. 197 | Arguments: 198 | prior_boxes: numpy array with shape (num_prior_boxes, 4) 199 | indicating x_min, y_min, x_max and y_max for every prior box. 200 | ground_truth_data: numpy array with shape (num_samples, 4) 201 | indicating x_min, y_min, x_max and y_max of the ground truth box. 202 | box_scale_factors: numpy array with shape (num_boxes, 4) 203 | Which represents a scaling of the localization gradient. 204 | (https://github.com/weiliu89/caffe/issues/155) 205 | 206 | Returns: 207 | assignments: numpy array with shape 208 | (num_samples, 4 + num_classes + 8) 209 | which correspond to the regressed values of all 210 | assigned_prior_boxes to the ground_truth_box 211 | """ 212 | assignments = np.zeros((len(prior_boxes), 4 + num_classes)) 213 | assignments[:, 4 + background_id] = 1.0 214 | num_objects_in_image = len(ground_truth_data) 215 | if num_objects_in_image == 0: 216 | return assignments 217 | encoded_boxes = np.apply_along_axis(assign_prior_boxes_to_ground_truth, 1, 218 | ground_truth_data[:, :4], prior_boxes, 219 | box_scale_factors, regress, 220 | overlap_threshold) 221 | encoded_boxes = encoded_boxes.reshape(-1, len(prior_boxes), 5) 222 | best_iou = encoded_boxes[:, :, -1].max(axis=0) 223 | best_iou_indices = encoded_boxes[:, :, -1].argmax(axis=0) 224 | best_iou_mask = best_iou > 0 225 | best_iou_indices = best_iou_indices[best_iou_mask] 226 | num_assigned_boxes = len(best_iou_indices) 227 | encoded_boxes = encoded_boxes[:, best_iou_mask, :] 228 | box_sequence = np.arange(num_assigned_boxes) 229 | assignments[best_iou_mask, :4] = encoded_boxes[best_iou_indices, 230 | box_sequence, :4] 231 | 232 | assignments[:, 4][best_iou_mask] = 0 233 | assignments[:, 5:][best_iou_mask] = ground_truth_data[best_iou_indices, 234 | 5:] 235 | return assignments 236 | 237 | 238 | def create_prior_boxes(configuration=None): 239 | if configuration is None: 240 | configuration = get_configuration_file() 241 | image_size = configuration['image_size'] 242 | feature_map_sizes = configuration['feature_map_sizes'] 243 | min_sizes = configuration['min_sizes'] 244 | max_sizes = configuration['max_sizes'] 245 | steps = configuration['steps'] 246 | model_aspect_ratios = configuration['aspect_ratios'] 247 | mean = [] 248 | for feature_map_arg, feature_map_size in enumerate(feature_map_sizes): 249 | step = steps[feature_map_arg] 250 | min_size = min_sizes[feature_map_arg] 251 | max_size = max_sizes[feature_map_arg] 252 | aspect_ratios = model_aspect_ratios[feature_map_arg] 253 | for y, x in product(range(feature_map_size), repeat=2): 254 | f_k = image_size / step 255 | center_x = (x + 0.5) / f_k 256 | center_y = (y + 0.5) / f_k 257 | s_k = min_size / image_size 258 | mean = mean + [center_x, center_y, s_k, s_k] 259 | s_k_prime = np.sqrt(s_k * (max_size / image_size)) 260 | mean = mean + [center_x, center_y, s_k_prime, s_k_prime] 261 | for aspect_ratio in aspect_ratios: 262 | mean = mean + [center_x, center_y, s_k * np.sqrt(aspect_ratio), 263 | s_k / np.sqrt(aspect_ratio)] 264 | mean = mean + [center_x, center_y, s_k / np.sqrt(aspect_ratio), 265 | s_k * np.sqrt(aspect_ratio)] 266 | 267 | output = np.asarray(mean).reshape((-1, 4)) 268 | output = np.clip(output, 0, 1) 269 | return output 270 | 271 | 272 | def get_configuration_file(): 273 | configuration = {'feature_map_sizes': [38, 19, 10, 5, 3, 1], 274 | 'image_size': 300, 275 | 'steps': [8, 16, 32, 64, 100, 300], 276 | 'min_sizes': [30, 60, 111, 162, 213, 264], 277 | 'max_sizes': [60, 111, 162, 213, 264, 315], 278 | 'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]], 279 | 'variance': [0.1, 0.2]} 280 | return configuration 281 | 282 | 283 | def apply_non_max_suppression(boxes, scores, iou_thresh=.45, top_k=200): 284 | """ non maximum suppression in numpy 285 | 286 | Arguments: 287 | boxes : array of boox coordinates of shape (num_samples, 4) 288 | where each columns corresponds to x_min, y_min, x_max, y_max 289 | scores : array of scores given for each box in 'boxes' 290 | iou_thresh : float intersection over union threshold for removing boxes 291 | top_k : int Number of maximum objects per class 292 | 293 | Returns: 294 | selected_indices : array of integers Selected indices of kept boxes 295 | num_selected_boxes : int Number of selected boxes 296 | """ 297 | 298 | selected_indices = np.zeros(shape=len(scores)) 299 | if boxes is None or len(boxes) == 0: 300 | return selected_indices 301 | x_min = boxes[:, 0] 302 | y_min = boxes[:, 1] 303 | x_max = boxes[:, 2] 304 | y_max = boxes[:, 3] 305 | areas = (x_max - x_min) * (y_max - y_min) 306 | remaining_sorted_box_indices = np.argsort(scores) 307 | remaining_sorted_box_indices = remaining_sorted_box_indices[-top_k:] 308 | 309 | num_selected_boxes = 0 310 | while len(remaining_sorted_box_indices) > 0: 311 | best_score_index = remaining_sorted_box_indices[-1] 312 | selected_indices[num_selected_boxes] = best_score_index 313 | num_selected_boxes = num_selected_boxes + 1 314 | if len(remaining_sorted_box_indices) == 1: 315 | break 316 | 317 | remaining_sorted_box_indices = remaining_sorted_box_indices[:-1] 318 | 319 | best_x_min = x_min[best_score_index] 320 | best_y_min = y_min[best_score_index] 321 | best_x_max = x_max[best_score_index] 322 | best_y_max = y_max[best_score_index] 323 | 324 | remaining_x_min = x_min[remaining_sorted_box_indices] 325 | remaining_y_min = y_min[remaining_sorted_box_indices] 326 | remaining_x_max = x_max[remaining_sorted_box_indices] 327 | remaining_y_max = y_max[remaining_sorted_box_indices] 328 | 329 | inner_x_min = np.maximum(remaining_x_min, best_x_min) 330 | inner_y_min = np.maximum(remaining_y_min, best_y_min) 331 | inner_x_max = np.minimum(remaining_x_max, best_x_max) 332 | inner_y_max = np.minimum(remaining_y_max, best_y_max) 333 | 334 | inner_box_widths = inner_x_max - inner_x_min 335 | inner_box_heights = inner_y_max - inner_y_min 336 | 337 | inner_box_widths = np.maximum(inner_box_widths, 0.0) 338 | inner_box_heights = np.maximum(inner_box_heights, 0.0) 339 | 340 | intersections = inner_box_widths * inner_box_heights 341 | remaining_box_areas = areas[remaining_sorted_box_indices] 342 | best_area = areas[best_score_index] 343 | unions = remaining_box_areas + best_area - intersections 344 | 345 | intersec_over_union = intersections / unions 346 | intersec_over_union_mask = intersec_over_union <= iou_thresh 347 | remaining_sorted_box_indices = remaining_sorted_box_indices[ 348 | intersec_over_union_mask] 349 | 350 | return selected_indices.astype(int), num_selected_boxes 351 | -------------------------------------------------------------------------------- /SSD/data_augmentation.py: -------------------------------------------------------------------------------- 1 | # this files is taken from https://github.com/amdegroot/ssd.pytorch 2 | import random 3 | import cv2 4 | import numpy as np 5 | import types 6 | from numpy import random 7 | 8 | 9 | def intersect(box_a, box_b): 10 | max_xy = np.minimum(box_a[:, 2:], box_b[2:]) 11 | min_xy = np.maximum(box_a[:, :2], box_b[:2]) 12 | inter = np.clip((max_xy - min_xy), a_min=0, a_max=np.inf) 13 | return inter[:, 0] * inter[:, 1] 14 | 15 | 16 | def jaccard_numpy(box_a, box_b): 17 | """Compute the jaccard overlap of two sets of boxes. The jaccard overlap 18 | is simply the intersection over union of two boxes. 19 | E.g.: 20 | A ^ B / A u B = A ^ B / (area(A) + area(B) - A ^ B) 21 | Args: 22 | box_a: Multiple bounding boxes, Shape: [num_boxes,4] 23 | box_b: Single bounding box, Shape: [4] 24 | Return: 25 | jaccard overlap: Shape: [box_a.shape[0], box_a.shape[1]] 26 | """ 27 | inter = intersect(box_a, box_b) 28 | area_a = ((box_a[:, 2]-box_a[:, 0]) * 29 | (box_a[:, 3]-box_a[:, 1])) # [A,B] 30 | area_b = ((box_b[2]-box_b[0]) * 31 | (box_b[3]-box_b[1])) # [A,B] 32 | union = area_a + area_b - inter 33 | return inter / union # [A,B] 34 | 35 | 36 | class Compose(object): 37 | """Composes several augmentations together. 38 | Args: 39 | transforms (List[Transform]): list of transforms to compose. 40 | Example: 41 | >>> augmentations.Compose([ 42 | >>> transforms.CenterCrop(10), 43 | >>> transforms.ToTensor(), 44 | >>> ]) 45 | """ 46 | 47 | def __init__(self, transforms): 48 | self.transforms = transforms 49 | 50 | def __call__(self, img, boxes=None, labels=None): 51 | for t in self.transforms: 52 | img, boxes, labels = t(img, boxes, labels) 53 | return img, boxes, labels 54 | 55 | 56 | class Lambda(object): 57 | """Applies a lambda as a transform.""" 58 | 59 | def __init__(self, lambd): 60 | assert isinstance(lambd, types.LambdaType) 61 | self.lambd = lambd 62 | 63 | def __call__(self, img, boxes=None, labels=None): 64 | return self.lambd(img, boxes, labels) 65 | 66 | 67 | class ConvertFromInts(object): 68 | def __call__(self, image, boxes=None, labels=None): 69 | return image.astype(np.float32), boxes, labels 70 | 71 | 72 | class SubtractMeans(object): 73 | def __init__(self, mean): 74 | self.mean = np.array(mean, dtype=np.float32) 75 | 76 | def __call__(self, image, boxes=None, labels=None): 77 | image = image.astype(np.float32) 78 | image -= self.mean 79 | return image.astype(np.float32), boxes, labels 80 | 81 | 82 | class ToAbsoluteCoords(object): 83 | def __call__(self, image, boxes=None, labels=None): 84 | height, width, channels = image.shape 85 | boxes[:, 0] *= width 86 | boxes[:, 2] *= width 87 | boxes[:, 1] *= height 88 | boxes[:, 3] *= height 89 | 90 | return image, boxes, labels 91 | 92 | 93 | class ToPercentCoords(object): 94 | def __call__(self, image, boxes=None, labels=None): 95 | height, width, channels = image.shape 96 | boxes[:, 0] /= width 97 | boxes[:, 2] /= width 98 | boxes[:, 1] /= height 99 | boxes[:, 3] /= height 100 | 101 | return image, boxes, labels 102 | 103 | 104 | class Resize(object): 105 | def __init__(self, size=300): 106 | self.size = size 107 | 108 | def __call__(self, image, boxes=None, labels=None): 109 | image = cv2.resize(image, (self.size, 110 | self.size)) 111 | return image, boxes, labels 112 | 113 | 114 | class RandomSaturation(object): 115 | def __init__(self, lower=0.5, upper=1.5): 116 | self.lower = lower 117 | self.upper = upper 118 | assert self.upper >= self.lower, "contrast upper must be >= lower." 119 | assert self.lower >= 0, "contrast lower must be non-negative." 120 | 121 | def __call__(self, image, boxes=None, labels=None): 122 | if random.randint(2): 123 | image[:, :, 1] *= random.uniform(self.lower, self.upper) 124 | 125 | return image, boxes, labels 126 | 127 | 128 | class RandomHue(object): 129 | def __init__(self, delta=18.0): 130 | assert delta >= 0.0 and delta <= 360.0 131 | self.delta = delta 132 | 133 | def __call__(self, image, boxes=None, labels=None): 134 | if random.randint(2): 135 | image[:, :, 0] += random.uniform(-self.delta, self.delta) 136 | image[:, :, 0][image[:, :, 0] > 360.0] -= 360.0 137 | image[:, :, 0][image[:, :, 0] < 0.0] += 360.0 138 | return image, boxes, labels 139 | 140 | 141 | class RandomLightingNoise(object): 142 | def __init__(self): 143 | self.perms = ((0, 1, 2), (0, 2, 1), 144 | (1, 0, 2), (1, 2, 0), 145 | (2, 0, 1), (2, 1, 0)) 146 | 147 | def __call__(self, image, boxes=None, labels=None): 148 | if random.randint(2): 149 | swap = self.perms[random.randint(len(self.perms))] 150 | shuffle = SwapChannels(swap) # shuffle channels 151 | image = shuffle(image) 152 | return image, boxes, labels 153 | 154 | 155 | class ConvertColor(object): 156 | def __init__(self, current='BGR', transform='HSV'): 157 | self.transform = transform 158 | self.current = current 159 | 160 | def __call__(self, image, boxes=None, labels=None): 161 | if self.current == 'BGR' and self.transform == 'HSV': 162 | image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV) 163 | elif self.current == 'HSV' and self.transform == 'BGR': 164 | image = cv2.cvtColor(image, cv2.COLOR_HSV2BGR) 165 | else: 166 | raise NotImplementedError 167 | return image, boxes, labels 168 | 169 | 170 | class RandomContrast(object): 171 | def __init__(self, lower=0.5, upper=1.5): 172 | self.lower = lower 173 | self.upper = upper 174 | assert self.upper >= self.lower, "contrast upper must be >= lower." 175 | assert self.lower >= 0, "contrast lower must be non-negative." 176 | 177 | # expects float image 178 | def __call__(self, image, boxes=None, labels=None): 179 | if random.randint(2): 180 | alpha = random.uniform(self.lower, self.upper) 181 | image *= alpha 182 | return image, boxes, labels 183 | 184 | 185 | class RandomBrightness(object): 186 | def __init__(self, delta=32): 187 | assert delta >= 0.0 188 | assert delta <= 255.0 189 | self.delta = delta 190 | 191 | def __call__(self, image, boxes=None, labels=None): 192 | if random.randint(2): 193 | delta = random.uniform(-self.delta, self.delta) 194 | image += delta 195 | return image, boxes, labels 196 | 197 | 198 | class RandomSampleCrop(object): 199 | """Crop 200 | Arguments: 201 | img (Image): the image being input during training 202 | boxes (Tensor): the original bounding boxes in pt form 203 | labels (Tensor): the class labels for each bbox 204 | mode (float tuple): the min and max jaccard overlaps 205 | Return: 206 | (img, boxes, classes) 207 | img (Image): the cropped image 208 | boxes (Tensor): the adjusted bounding boxes in pt form 209 | labels (Tensor): the class labels for each bbox 210 | """ 211 | def __init__(self): 212 | self.sample_options = ( 213 | # using entire original input image 214 | None, 215 | # sample a patch s.t. MIN jaccard w/ obj in .1,.3,.4,.7,.9 216 | (0.1, None), 217 | (0.3, None), 218 | (0.7, None), 219 | (0.9, None), 220 | # randomly sample a patch 221 | (None, None), 222 | ) 223 | 224 | def __call__(self, image, boxes=None, labels=None): 225 | height, width, _ = image.shape 226 | while True: 227 | # randomly choose a mode 228 | mode = random.choice(self.sample_options) 229 | if mode is None: 230 | return image, boxes, labels 231 | 232 | min_iou, max_iou = mode 233 | if min_iou is None: 234 | min_iou = float('-inf') 235 | if max_iou is None: 236 | max_iou = float('inf') 237 | 238 | # max trails (50) 239 | for _ in range(50): 240 | current_image = image 241 | 242 | w = random.uniform(0.3 * width, width) 243 | h = random.uniform(0.3 * height, height) 244 | 245 | # aspect ratio constraint b/t .5 & 2 246 | if h / w < 0.5 or h / w > 2: 247 | continue 248 | 249 | left = random.uniform(width - w) 250 | top = random.uniform(height - h) 251 | 252 | # convert to integer rect x1,y1,x2,y2 253 | rect = np.array([int(left), int(top), int(left+w), int(top+h)]) 254 | 255 | # calculate IoU (jaccard overlap) b/t the cropped and gt boxes 256 | overlap = jaccard_numpy(boxes, rect) 257 | 258 | # is min and max overlap constraint satisfied? if not try again 259 | if overlap.min() < min_iou and max_iou < overlap.max(): 260 | continue 261 | 262 | # cut the crop from the image 263 | current_image = current_image[rect[1]:rect[3], rect[0]:rect[2], 264 | :] 265 | 266 | # keep overlap with gt box IF center in sampled patch 267 | centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0 268 | 269 | # mask in all gt boxes that above and to the left of centers 270 | m1 = (rect[0] < centers[:, 0]) * (rect[1] < centers[:, 1]) 271 | 272 | # mask in all gt boxes that under and to the right of centers 273 | m2 = (rect[2] > centers[:, 0]) * (rect[3] > centers[:, 1]) 274 | 275 | # mask in that both m1 and m2 are true 276 | mask = m1 * m2 277 | 278 | # have any valid boxes? try again if not 279 | if not mask.any(): 280 | continue 281 | 282 | # take only matching gt boxes 283 | current_boxes = boxes[mask, :].copy() 284 | 285 | # take only matching gt labels 286 | current_labels = labels[mask] 287 | 288 | # should we use the box left and top corner or the crop's 289 | current_boxes[:, :2] = np.maximum(current_boxes[:, :2], 290 | rect[:2]) 291 | # adjust to crop (by substracting crop's left,top) 292 | current_boxes[:, :2] -= rect[:2] 293 | 294 | current_boxes[:, 2:] = np.minimum(current_boxes[:, 2:], 295 | rect[2:]) 296 | # adjust to crop (by substracting crop's left,top) 297 | current_boxes[:, 2:] -= rect[:2] 298 | 299 | return current_image, current_boxes, current_labels 300 | 301 | 302 | class Expand(object): 303 | def __init__(self, mean): 304 | self.mean = mean 305 | 306 | def __call__(self, image, boxes, labels): 307 | if random.randint(2): 308 | return image, boxes, labels 309 | 310 | height, width, depth = image.shape 311 | ratio = random.uniform(1, 4) 312 | left = random.uniform(0, width*ratio - width) 313 | top = random.uniform(0, height*ratio - height) 314 | 315 | expand_image = np.zeros( 316 | (int(height*ratio), int(width*ratio), depth), 317 | dtype=image.dtype) 318 | expand_image[:, :, :] = self.mean 319 | expand_image[int(top):int(top + height), 320 | int(left):int(left + width)] = image 321 | image = expand_image 322 | 323 | boxes = boxes.copy() 324 | boxes[:, :2] += (int(left), int(top)) 325 | boxes[:, 2:] += (int(left), int(top)) 326 | 327 | return image, boxes, labels 328 | 329 | 330 | class HorizontalFlip(object): 331 | def __call__(self, image, boxes, classes): 332 | _, width, _ = image.shape 333 | if random.randint(2): 334 | image = image[:, ::-1] 335 | boxes = boxes.copy() 336 | boxes[:, 0::2] = width - boxes[:, 2::-2] 337 | return image, boxes, classes 338 | 339 | 340 | class VerticalFlip(object): 341 | def __call__(self, image, boxes, classes): 342 | height = image.shape[0] 343 | if random.randint(2): 344 | image = image[::-1, :] 345 | boxes = boxes.copy() 346 | # boxes[:, 0::2] = width - boxes[:, 2::-2] 347 | boxes[:, [1, 3]] = height - boxes[:, [3, 1]] 348 | return image, boxes, classes 349 | 350 | 351 | class SwapChannels(object): 352 | """Transforms a tensorized image by swapping the channels in the order 353 | specified in the swap tuple. 354 | Args: 355 | swaps (int triple): final order of channels 356 | eg: (2, 1, 0) 357 | """ 358 | 359 | def __init__(self, swaps): 360 | self.swaps = swaps 361 | 362 | def __call__(self, image): 363 | """ 364 | Args: 365 | image (Tensor): image tensor to be transformed 366 | Return: 367 | a tensor with channels swapped according to swap 368 | """ 369 | # if torch.is_tensor(image): 370 | # image = image.data.cpu().numpy() 371 | # else: 372 | # image = np.array(image) 373 | image = image[:, :, self.swaps] 374 | return image 375 | 376 | 377 | class PhotometricDistort(object): 378 | def __init__(self): 379 | self.pd = [ 380 | RandomContrast(), 381 | ConvertColor(transform='HSV'), 382 | RandomSaturation(), 383 | RandomHue(), 384 | ConvertColor(current='HSV', transform='BGR'), 385 | RandomContrast() 386 | ] 387 | self.rand_brightness = RandomBrightness() 388 | self.rand_light_noise = RandomLightingNoise() 389 | 390 | def __call__(self, image, boxes, labels): 391 | im = image.copy() 392 | im, boxes, labels = self.rand_brightness(im, boxes, labels) 393 | if random.randint(2): 394 | distort = Compose(self.pd[:-1]) 395 | else: 396 | distort = Compose(self.pd[1:]) 397 | im, boxes, labels = distort(im, boxes, labels) 398 | return self.rand_light_noise(im, boxes, labels) 399 | 400 | 401 | class PhotometricDistort2(object): 402 | def __init__(self, brightness_var=0.5, saturation_var=.5, contrast_var=.5, 403 | lighting_std=.5): 404 | 405 | self.brightness_var = brightness_var 406 | self.saturation_var = saturation_var 407 | self.contrast_var = contrast_var 408 | self.lighting_std = lighting_std 409 | self.color_jitter = [self.saturation, self.brightness, self.contrast, 410 | self.contrast, self.lighting] 411 | 412 | def _gray_scale(self, image_array): 413 | return image_array.dot([0.299, 0.587, 0.114]) 414 | 415 | def saturation(self, image_array): 416 | gray_scale = self._gray_scale(image_array) 417 | alpha = 2.0 * np.random.random() * self.brightness_var 418 | alpha = alpha + 1 - self.saturation_var 419 | image_array = alpha * image_array + (1 - alpha) * gray_scale[:, :, None] 420 | return np.clip(image_array, 0, 255) 421 | 422 | def brightness(self, image_array): 423 | alpha = 2 * np.random.random() * self.brightness_var 424 | alpha = alpha + 1 - self.saturation_var 425 | image_array = alpha * image_array 426 | return np.clip(image_array, 0, 255) 427 | 428 | def contrast(self, image_array): 429 | gray_scale = (self._gray_scale(image_array).mean() * 430 | np.ones_like(image_array)) 431 | alpha = 2 * np.random.random() * self.contrast_var 432 | alpha = alpha + 1 - self.contrast_var 433 | image_array = image_array * alpha + (1 - alpha) * gray_scale 434 | return np.clip(image_array, 0, 255) 435 | 436 | def lighting(self, image_array): 437 | covariance_matrix = np.cov(image_array.reshape(-1, 3) / 438 | 255.0, rowvar=False) 439 | eigen_values, eigen_vectors = np.linalg.eigh(covariance_matrix) 440 | noise = np.random.randn(3) * self.lighting_std 441 | noise = eigen_vectors.dot(eigen_values * noise) * 255 442 | image_array = image_array + noise 443 | return np.clip(image_array, 0, 255) 444 | 445 | def __call__(self, img, boxes, labels): 446 | random.shuffle(self.color_jitter) 447 | for jitter in self.color_jitter: 448 | img = jitter(img) 449 | return (img, boxes, labels) 450 | 451 | 452 | class SSDAugmentation(object): 453 | def __init__(self, mode='train', size=300, mean=(104, 117, 123)): 454 | self.mean = mean 455 | self.size = size 456 | self.mode = mode 457 | 458 | if self.mode == 'train': 459 | self.augment = Compose([ 460 | ConvertFromInts(), 461 | ToAbsoluteCoords(), 462 | PhotometricDistort(), 463 | Expand(self.mean), 464 | RandomSampleCrop(), 465 | HorizontalFlip(), 466 | # VerticalFlip(), 467 | ToPercentCoords(), 468 | Resize(self.size), 469 | SubtractMeans(self.mean) 470 | ]) 471 | 472 | elif self.mode == 'val': 473 | self.augment = Compose([ 474 | ConvertFromInts(), 475 | # ToAbsoluteCoords(), 476 | # PhotometricDistort(), 477 | # Expand(self.mean), 478 | # RandomSampleCrop(), 479 | # RandomMirror(), 480 | # ToPercentCoords(), 481 | Resize(self.size), 482 | SubtractMeans(self.mean) 483 | ]) 484 | 485 | else: 486 | raise Exception('Invalid mode:', self.mode) 487 | 488 | def __call__(self, img, boxes, labels): 489 | return self.augment(img, boxes, labels) 490 | 491 | 492 | """ 493 | class SSDAugmentation(object): 494 | def __init__(self, size=300, mean=(104, 117, 123)): 495 | self.mean = mean 496 | self.size = size 497 | self.augment = Compose([ 498 | ConvertFromInts(), 499 | ToAbsoluteCoords(), 500 | PhotometricDistort(), 501 | Expand(self.mean), 502 | RandomSampleCrop(), 503 | RandomMirror(), 504 | ToPercentCoords(), 505 | Resize(self.size), 506 | SubtractMeans(self.mean) 507 | ]) 508 | 509 | def __call__(self, img, boxes, labels): 510 | return self.augment(img, boxes, labels) 511 | """ 512 | 513 | 514 | -------------------------------------------------------------------------------- /SSD/data_generator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from random import shuffle 3 | import cv2 4 | import threading 5 | from itertools import product 6 | 7 | from boxes import assign_prior_boxes 8 | from data_augmentation import SSDAugmentation 9 | 10 | 11 | R_MEAN = 123 12 | G_MEAN = 117 13 | B_MEAN = 104 14 | 15 | 16 | def load_image(image_path, target_size=None, 17 | return_original_shape=False, RGB=True): 18 | image_array = cv2.imread(image_path) 19 | if RGB: 20 | image_array = cv2.cvtColor(image_array, cv2.COLOR_BGR2RGB) 21 | height, width = image_array.shape[:2] 22 | if target_size is not None: 23 | image_array = cv2.resize(image_array, target_size) 24 | if return_original_shape: 25 | return image_array, (height, width) 26 | else: 27 | return image_array 28 | 29 | 30 | class threadsafe_iterator: 31 | """Takes an iterator/generator and makes it thread-safe by 32 | serializing call to the `next` method of given iterator/generator. 33 | """ 34 | def __init__(self, iterator): 35 | self.iterator = iterator 36 | self.lock = threading.Lock() 37 | 38 | def __iter__(self): 39 | return self 40 | 41 | def __next__(self): 42 | with self.lock: 43 | return next(self.iterator) 44 | 45 | 46 | def threadsafe_generator(generator): 47 | """A decorator that takes a generator function and makes it thread-safe. 48 | """ 49 | def wrapped_generator(*args, **kwargs): 50 | return threadsafe_iterator(generator(*args, **kwargs)) 51 | return wrapped_generator 52 | 53 | 54 | 55 | 56 | class DataGenerator(object): 57 | 58 | def __init__(self, train_data, prior_boxes, batch_size=32, num_classes=21, 59 | val_data=None, box_scale_factors=[.1, .1, .2, .2]): 60 | 61 | self.train_data = train_data 62 | self.val_data = val_data 63 | self.num_classes = num_classes 64 | self.prior_boxes = prior_boxes 65 | self.box_scale_factors = box_scale_factors 66 | self.batch_size = batch_size 67 | 68 | @threadsafe_generator 69 | def flow(self, mode='train'): 70 | if mode == 'train': 71 | keys = list(self.train_data.keys()) 72 | shuffle(keys) 73 | ground_truth_data = self.train_data 74 | elif mode == 'val': 75 | if self.val_data is None: 76 | raise Exception('Validation data not given in constructor.') 77 | keys = list(self.val_data.keys()) 78 | ground_truth_data = self.val_data 79 | else: 80 | raise Exception('invalid mode: %s' % mode) 81 | 82 | self.transform = SSDAugmentation(mode, 300, (B_MEAN, G_MEAN, R_MEAN)) 83 | 84 | while True: 85 | inputs = [] 86 | targets = [] 87 | for image_path in keys: 88 | image_array = load_image(image_path, RGB=False).copy() 89 | box_data = ground_truth_data[image_path].copy() 90 | 91 | data = (image_array, box_data[:, :4], box_data[:, 4:]) 92 | image_array, box_corners, labels = self.transform(*data) 93 | box_data = np.concatenate([box_corners, labels], axis=1) 94 | 95 | box_data = assign_prior_boxes(self.prior_boxes, box_data, 96 | self.num_classes, 97 | self.box_scale_factors) 98 | 99 | inputs.append(image_array) 100 | targets.append(box_data) 101 | if len(targets) == self.batch_size: 102 | inputs = np.asarray(inputs) 103 | targets = np.asarray(targets) 104 | yield self._wrap_in_dictionary(inputs, targets) 105 | inputs = [] 106 | targets = [] 107 | 108 | def _wrap_in_dictionary(self, image_array, targets): 109 | return [{'input_1': image_array}, 110 | {'predictions': targets}] 111 | -------------------------------------------------------------------------------- /Semantic_Relation_Extraction/Readme.md: -------------------------------------------------------------------------------- 1 | # Semantic Relation Extraction 2 | 3 | This is an example of performing Semantic relation extraction in Python using Keras with tensorflow backend. 4 | 5 | In this notebook, relation between two words in sentence is to be found. A simple example of relation can be 'cause of'. An entity 'A' can be cause of entity 'B'. 6 | 7 | Relation extraction is a part of Natural Language Understanding and can be used to extract useful features from the text corpus. It is used to understand the relations among entities in a sentence. 8 | 9 | ## Packages used 10 | 11 | keras, sklearn, tensorlfow, numpy, regex, pickle, nltk 12 | 13 | ## Dataset 14 | 15 | For performing the task of semantic relation extraction, dataset for SemEval-2010 task 8, was used. The dataset contains sentences, with both the entities marked and the corresponding relation for them. 16 | 17 | More about the dataset can be found [here](http://www.aclweb.org/anthology/S10-1006). 18 | 19 | ## Model Architecture 20 | 21 | Using preloaded glove vectors as embedding weights for the model. 22 | 23 | Embedded word vectors are first passed to 1D convolution and than passed to bidirectional GRU. GRU takes care of the sequential information, while CNN improves the embeddings by emphasizing on neighbor information. 24 | 25 | Global max pooled, mask max pooled and Self-attended features of the RNN output are all concatenated and passed to the dense layers. 26 | 27 | Finally multiple fully-connected layers are used to classify the incoming query into one of the possible intents. 28 | 29 | Adam optimizer and sparse categorical crossentropy loss are used. 30 | 31 | ![Model Architecture](../Images/model_semeval.png) 32 | -------------------------------------------------------------------------------- /Text_Classification/Readme.md: -------------------------------------------------------------------------------- 1 | # Text Classification 2 | 3 | Here text classification is performed using various neural layers in Python using Keras with tensorflow backend. 4 | 5 | In this notebook, we need to classify user comments into positive(1) or negative(0) sentiments. To perform well on the task the model must understand the sentiments in the user query. For this different models having various different parameters and layers have been experimented and compared with. 6 | 7 | 2 different python notebooks are present: 8 | 9 | 1. The [comparison notebook](classification_imdb.ipynb) compares lots of different models by selecting the layers (choice among GRU/ LSTM), whether to use Self-attention, Multihead-attention, use of global average pooling layer and choice among the word embedding files to use. 10 | 11 | 2. The [second notebook](self_Attn_on_seperate_fets_of_2embds.ipynb) have the best result on the test dataset for my testings. These model uses Glove and FastText embeddings as seperate features in upper layers, and the features extracted from these embeedings using SpatialDropout, GRU, Sequence self-attention and Global pooling layers are concatenated and passed to dense layers for classification. 12 | 13 | ![Model Architecture](../Images/model_imdb_2.png) 14 | 15 | ## Packages used 16 | 17 | keras, sklearn, tensorlfow, numpy, keras_self_attention, keras_multi_head 18 | 19 | ## Dataset 20 | 21 | For performing the task of Text Classification, IMDB sentiment analysis dataset was used. This is a popular dataset for text classification in customer reviews of the movies. 22 | 23 | The dataset can be found [here](http://ai.stanford.edu/~amaas/data/sentiment/). 24 | 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /Text_Classification/self_Attn_on_seperate_fets_of_2embds.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Treating embeddings as seperate continuous features\n", 8 | "\n", 9 | "The 2 embeddings used here are glove and fasttext. In the model they are passed to different embedding layers and 2 different layers of feature extraction are present for each of the embeddings.\n", 10 | "\n", 11 | "Sequence feature is extracted by first spatially dropping out the embedding vectors. Than globally max and average fetaures are pooled from the selfattended RNN output.\n", 12 | "\n", 13 | "The max pooled features for both the embeddings is than concatenated and passed to dense layer for feature extraction.\n", 14 | "\n", 15 | "Same is done for the average pooled features and, than these pooled features are concatenated and passed to fully connected network for classification. \n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stderr", 25 | "output_type": "stream", 26 | "text": [ 27 | "Using TensorFlow backend.\n" 28 | ] 29 | }, 30 | { 31 | "data": { 32 | "text/plain": [ 33 | "('2.2.2',\n", 34 | " '1.10.0',\n", 35 | " '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')" 36 | ] 37 | }, 38 | "execution_count": 1, 39 | "metadata": {}, 40 | "output_type": "execute_result" 41 | } 42 | ], 43 | "source": [ 44 | "import keras, tensorflow, sys\n", 45 | "keras.__version__, tensorflow.__version__, sys.version" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "name": "stderr", 55 | "output_type": "stream", 56 | "text": [ 57 | "C:\\Users\\Deep Learning 3033\\AppData\\Local\\conda\\conda\\envs\\tensorflow\\lib\\site-packages\\sklearn\\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 58 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "# import required packages\n", 64 | "import sys\n", 65 | "import warnings\n", 66 | "import os\n", 67 | "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"1\" \n", 68 | "\n", 69 | "if not sys.warnoptions:\n", 70 | " warnings.simplefilter(\"ignore\")\n", 71 | "\n", 72 | "import keras\n", 73 | "import tensorflow as tf\n", 74 | "\n", 75 | "from keras.models import Model\n", 76 | "\n", 77 | "from keras.layers import CuDNNLSTM, CuDNNGRU, BatchNormalization, Dense, Dropout, Activation, Embedding, Input\n", 78 | "from keras.layers import Bidirectional,SpatialDropout1D, GlobalAveragePooling1D, GlobalMaxPooling1D, Concatenate\n", 79 | "\n", 80 | "from keras.optimizers import Adam\n", 81 | "\n", 82 | "from keras.preprocessing.sequence import pad_sequences\n", 83 | "from keras.preprocessing.text import Tokenizer\n", 84 | "from keras.utils.np_utils import to_categorical\n", 85 | "\n", 86 | "from keras_self_attention import SeqSelfAttention\n", 87 | "\n", 88 | "from sklearn.metrics import confusion_matrix,f1_score, precision_score, recall_score, roc_auc_score, accuracy_score\n", 89 | "from sklearn.cross_validation import train_test_split\n", 90 | "\n", 91 | "import pandas as pd\n", 92 | "import numpy as np\n", 93 | "import re\n", 94 | "from glob import glob\n", 95 | "\n", 96 | "import math\n", 97 | "from snapshot import SnapshotCallbackBuilder\n", 98 | "import time" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "def load_imdb_dataset():\n", 108 | "\n", 109 | " # Load the dataset\n", 110 | " train = pd.DataFrame(columns=[\"text\", \"positive\"])\n", 111 | " test = pd.DataFrame(columns=[\"text\", \"positive\"])\n", 112 | " ctr = 0\n", 113 | " cte = 0\n", 114 | " for fil in ['train/', 'test/']:\n", 115 | " for cls in ['pos', 'neg']:\n", 116 | " dset_path = \"./\" + fil + cls\n", 117 | " for fname in sorted(os.listdir(dset_path)):\n", 118 | " if fname.endswith('.txt'):\n", 119 | " with open(os.path.join(dset_path, fname), encoding=\"utf8\") as f:\n", 120 | " if fil == 'train/':\n", 121 | " train.loc[ctr] = (f.read(), int(cls == \"pos\"))\n", 122 | " ctr+=1\n", 123 | " else:\n", 124 | " test.loc[cte] = (f.read(), int(cls == \"pos\"))\n", 125 | " cte+=1\n", 126 | " \n", 127 | " return train, test" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 4, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "Train data shape (25000, 2)\n", 140 | "Test data shape (25000, 2)\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "train, test = load_imdb_dataset()\n", 146 | "\n", 147 | "print (\"Train data shape\", train.shape)\n", 148 | "print (\"Test data shape\", test.shape)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 5, 154 | "metadata": {}, 155 | "outputs": [ 156 | { 157 | "name": "stdout", 158 | "output_type": "stream", 159 | "text": [ 160 | "Train data class distbn 1 12500\n", 161 | "0 12500\n", 162 | "Name: positive, dtype: int64\n", 163 | "Test data class distbn 1 12500\n", 164 | "0 12500\n", 165 | "Name: positive, dtype: int64\n" 166 | ] 167 | } 168 | ], 169 | "source": [ 170 | "print(\"Train data class distbn\", train.positive.value_counts())\n", 171 | "print(\"Test data class distbn\", test.positive.value_counts())" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/html": [ 182 | "
\n", 183 | "\n", 196 | "\n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | "
textpositive
0Bromwell High is a cartoon comedy. It ran at t...1
1Homelessness (or Houselessness as George Carli...1
2Brilliant over-acting by Lesley Ann Warren. Be...1
3This is easily the most underrated film inn th...1
4This is not the typical Mel Brooks film. It wa...1
\n", 232 | "
" 233 | ], 234 | "text/plain": [ 235 | " text positive\n", 236 | "0 Bromwell High is a cartoon comedy. It ran at t... 1\n", 237 | "1 Homelessness (or Houselessness as George Carli... 1\n", 238 | "2 Brilliant over-acting by Lesley Ann Warren. Be... 1\n", 239 | "3 This is easily the most underrated film inn th... 1\n", 240 | "4 This is not the typical Mel Brooks film. It wa... 1" 241 | ] 242 | }, 243 | "execution_count": 6, 244 | "metadata": {}, 245 | "output_type": "execute_result" 246 | } 247 | ], 248 | "source": [ 249 | "train.head()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/html": [ 260 | "
\n", 261 | "\n", 274 | "\n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | "
textpositive
0I went and saw this movie last night after bei...1
1Actor turned director Bill Paxton follows up h...1
2As a recreational golfer with some knowledge o...1
3I saw this film in a sneak preview, and it is ...1
4Bill Paxton has taken the true story of the 19...1
\n", 310 | "
" 311 | ], 312 | "text/plain": [ 313 | " text positive\n", 314 | "0 I went and saw this movie last night after bei... 1\n", 315 | "1 Actor turned director Bill Paxton follows up h... 1\n", 316 | "2 As a recreational golfer with some knowledge o... 1\n", 317 | "3 I saw this film in a sneak preview, and it is ... 1\n", 318 | "4 Bill Paxton has taken the true story of the 19... 1" 319 | ] 320 | }, 321 | "execution_count": 7, 322 | "metadata": {}, 323 | "output_type": "execute_result" 324 | } 325 | ], 326 | "source": [ 327 | "test.head()" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 8, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | "Train Sequence length distribution:\n", 340 | "\n", 341 | "count 25000.000000\n", 342 | "mean 233.787200\n", 343 | "std 173.733032\n", 344 | "min 10.000000\n", 345 | "25% 127.000000\n", 346 | "50% 174.000000\n", 347 | "75% 284.000000\n", 348 | "max 2470.000000\n", 349 | "dtype: float64\n", 350 | "\n", 351 | "\n", 352 | "Test Sequence length distribution:\n", 353 | "\n", 354 | "count 25000.000000\n", 355 | "mean 228.526680\n", 356 | "std 168.883693\n", 357 | "min 4.000000\n", 358 | "25% 126.000000\n", 359 | "50% 172.000000\n", 360 | "75% 277.000000\n", 361 | "max 2278.000000\n", 362 | "dtype: float64\n" 363 | ] 364 | } 365 | ], 366 | "source": [ 367 | "# Average number of words per review \n", 368 | "tr_l = [len(x.split()) for x in train.text]\n", 369 | "te_l = [len(x.split()) for x in test.text]\n", 370 | "print(\"Train Sequence length distribution:\\n\")\n", 371 | "print(pd.Series(tr_l).describe())\n", 372 | "print(\"\\n\\nTest Sequence length distribution:\\n\")\n", 373 | "print(pd.Series(te_l).describe())" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 9, 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "name": "stdout", 383 | "output_type": "stream", 384 | "text": [ 385 | "Vocab size 88582\n" 386 | ] 387 | } 388 | ], 389 | "source": [ 390 | "# Number of unique words by finding the length of dictionary of words mapped with unique tokens (integers)\n", 391 | "tokenizer = Tokenizer()\n", 392 | "tokenizer.fit_on_texts(list(train.text))\n", 393 | "print(\"Vocab size\", len(tokenizer.word_counts))" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 10, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "embed_size = 300 \n", 403 | "\n", 404 | "# mean number of words per sentence in the train set is taken as maximum sentence length.\n", 405 | "max_sent_len = int(np.percentile(tr_l, 50)) \n", 406 | "\n", 407 | "num_words = len(tokenizer.word_counts)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 11, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "# Converte sentence text to list of token represented sentences, required for training\n", 417 | "X = tokenizer.texts_to_sequences(train.text)\n", 418 | "X = pad_sequences(X, maxlen=max_sent_len)\n", 419 | "\n", 420 | "x_test = tokenizer.texts_to_sequences(test.text)\n", 421 | "x_test = pad_sequences(x_test, maxlen=max_sent_len)" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 12, 427 | "metadata": {}, 428 | "outputs": [ 429 | { 430 | "data": { 431 | "text/plain": [ 432 | "((22500, 174), (2500, 174))" 433 | ] 434 | }, 435 | "execution_count": 12, 436 | "metadata": {}, 437 | "output_type": "execute_result" 438 | } 439 | ], 440 | "source": [ 441 | "# Split into train and validation data\n", 442 | "x_train, x_val, y_train, y_val = train_test_split(X, train.positive, test_size=0.1, random_state=3)\n", 443 | "x_train.shape, x_val.shape" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 13, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "# Functions to load different embeddings\n", 453 | "\n", 454 | "def load_glove(word_index):\n", 455 | " EMBEDDING_FILE = '../embeddings/glove.840B.300d/glove.840B.300d.txt'\n", 456 | " def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\n", 457 | " embeddings_index = dict(get_coefs(*o.split(\" \")) for o in open(EMBEDDING_FILE, encoding=\"utf8\"))\n", 458 | "\n", 459 | " all_embs = np.stack(embeddings_index.values())\n", 460 | " emb_mean,emb_std = all_embs.mean(), all_embs.std()\n", 461 | " embed_size = all_embs.shape[1]\n", 462 | " \n", 463 | " nb_words = min(num_words, len(word_index))\n", 464 | " embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n", 465 | " for word, i in word_index.items():\n", 466 | " if i >= num_words: continue\n", 467 | " embedding_vector = embeddings_index.get(word)\n", 468 | " if embedding_vector is not None: embedding_matrix[i] = embedding_vector\n", 469 | " \n", 470 | " return embedding_matrix \n", 471 | "\n", 472 | "def load_fasttext(word_index): \n", 473 | " EMBEDDING_FILE = '../embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'\n", 474 | " def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\n", 475 | " embeddings_index = dict(get_coefs(*o.split(\" \")) for o in open(EMBEDDING_FILE, encoding=\"utf8\") if len(o)>100)\n", 476 | "\n", 477 | " all_embs = np.stack(embeddings_index.values())\n", 478 | " emb_mean,emb_std = all_embs.mean(), all_embs.std()\n", 479 | " embed_size = all_embs.shape[1]\n", 480 | "\n", 481 | " nb_words = min(num_words, len(word_index))\n", 482 | " embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n", 483 | " for word, i in word_index.items():\n", 484 | " if i >= num_words: continue\n", 485 | " embedding_vector = embeddings_index.get(word)\n", 486 | " if embedding_vector is not None: embedding_matrix[i] = embedding_vector\n", 487 | "\n", 488 | " return embedding_matrix" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 14, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "# Get word_indexes (tokens) for each of the word in vocabulary\n", 498 | "word_index = tokenizer.word_index\n", 499 | "\n", 500 | "embedding_matrix_glove = load_glove(word_index)\n", 501 | "embedding_matrix_ft = load_fasttext(word_index)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 15, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "# Build a snaphot of the model after (nb_epochs/ M) epochs. Also cosine anneal the learning rate.\n", 511 | "\n", 512 | "M = 2\n", 513 | "nb_epoch = T = 50\n", 514 | "alpha_zero = 5e-4\n", 515 | "snapshot = SnapshotCallbackBuilder(T, M, alpha_zero)\n", 516 | "timestr = time.strftime(\"%Y-%m-%d_%H-%M-%S\")\n", 517 | "model_prefix = './imdb{}'.format(timestr)\n", 518 | "\n", 519 | "callbacks = snapshot.get_callbacks(model_prefix=model_prefix)\n" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 16, 525 | "metadata": {}, 526 | "outputs": [ 527 | { 528 | "name": "stdout", 529 | "output_type": "stream", 530 | "text": [ 531 | "Model: 0 , Accuracy_score: 0.88964\n", 532 | "Model: 1 , Accuracy_score: 0.89272\n", 533 | "Model: 2 , Accuracy_score: 0.8908\n", 534 | "Model: 3 , Accuracy_score: 0.88996\n", 535 | "Model: 4 , Accuracy_score: 0.89064\n" 536 | ] 537 | } 538 | ], 539 | "source": [ 540 | "pred_avg = []\n", 541 | "real = list(test.positive)\n", 542 | "# Performing cross validation of 5\n", 543 | "for cv in range(5):\n", 544 | " \n", 545 | " # Embedding layer to use glove embeddings\n", 546 | " embedding_layer_g = Embedding(num_words, embed_size, input_length=max_sent_len, trainable=False,\n", 547 | " weights=[embedding_matrix_glove])\n", 548 | " sequence_input_g = Input(shape=(max_sent_len,), dtype='int32')\n", 549 | "\n", 550 | " embedded_sequences_g = embedding_layer_g(sequence_input_g)\n", 551 | " embedded_sequences_g = SpatialDropout1D(0.2)(embedded_sequences_g)\n", 552 | "\n", 553 | " x_g = Bidirectional(CuDNNGRU(64, return_sequences=True), merge_mode='concat')(embedded_sequences_g)\n", 554 | " x_g_a = SeqSelfAttention()(x_g)\n", 555 | "\n", 556 | " x_g = Concatenate()([x_g, x_g_a])\n", 557 | "\n", 558 | " x_g_a = GlobalAveragePooling1D()(x_g)\n", 559 | " x_g = GlobalMaxPooling1D()(x_g)\n", 560 | "\n", 561 | "\n", 562 | " # Embedding layer to use fasttext embeddings\n", 563 | " embedding_layer_f = Embedding(num_words, embed_size, input_length=max_sent_len, trainable=False,\n", 564 | " weights=[embedding_matrix_ft])\n", 565 | " sequence_input_f = Input(shape=(max_sent_len,), dtype='int32')\n", 566 | "\n", 567 | " embedded_sequences_f = embedding_layer_f(sequence_input_f)\n", 568 | " embedded_sequences_f = SpatialDropout1D(0.2)(embedded_sequences_f)\n", 569 | "\n", 570 | " x_f = Bidirectional(CuDNNGRU(64, return_sequences=True), merge_mode='concat')(embedded_sequences_f)\n", 571 | " x_f_a = SeqSelfAttention()(x_f)\n", 572 | "\n", 573 | " x_f = Concatenate()([x_f, x_f_a])\n", 574 | "\n", 575 | " x_f_a = GlobalAveragePooling1D()(x_f)\n", 576 | " x_f = GlobalMaxPooling1D()(x_f)\n", 577 | "\n", 578 | "\n", 579 | " # Concatenate the globally pooled features from each of the embeddings\n", 580 | " x_g = Concatenate()([x_g, x_f])\n", 581 | "\n", 582 | " x_g = Dense(128, activation=\"relu\", kernel_initializer=\"glorot_normal\")(x_g)\n", 583 | " x_g = BatchNormalization()(x_g)\n", 584 | " x_g = Dropout(0.4)(x_g)\n", 585 | "\n", 586 | " \n", 587 | " # Concatenate the globally averaged features from each of the embeddings\n", 588 | " x_a = Concatenate()([x_g_a, x_f_a])\n", 589 | "\n", 590 | " x_a = Dense(128, activation=\"relu\", kernel_initializer=\"glorot_normal\")(x_a)\n", 591 | " x_a = BatchNormalization()(x_a)\n", 592 | " x_a = Dropout(0.4)(x_a)\n", 593 | "\n", 594 | " x = Concatenate()([x_g, x_a])\n", 595 | " \n", 596 | " \n", 597 | " # Fully connected layers to classify using features from both the embeddings.\n", 598 | " x = Dense(128, activation=\"relu\", kernel_initializer=\"glorot_normal\")(x)\n", 599 | " x = BatchNormalization()(x)\n", 600 | " x = Dropout(0.4)(x)\n", 601 | "\n", 602 | " x = Dense(64, activation=\"relu\", kernel_initializer=\"glorot_normal\")(x)\n", 603 | " x = BatchNormalization()(x)\n", 604 | " x = Dropout(0.4)(x)\n", 605 | "\n", 606 | " x = Dense(16, activation=\"relu\", kernel_initializer=\"glorot_normal\")(x)\n", 607 | " x = BatchNormalization()(x)\n", 608 | " x = Dropout(0.4)(x)\n", 609 | "\n", 610 | " out = Dense(1, activation=\"sigmoid\", kernel_initializer=\"glorot_normal\")(x)\n", 611 | " \n", 612 | " model = Model([sequence_input_g, sequence_input_f], out)\n", 613 | "\n", 614 | " model.compile(loss=\"binary_crossentropy\", optimizer=Adam(5e-5),metrics=['accuracy'])\n", 615 | " model.fit([x_train, x_train], y_train, validation_data=([x_val, x_val], y_val), epochs=nb_epoch, verbose=0,\n", 616 | " batch_size=100, shuffle=True, callbacks=callbacks)\n", 617 | " pred = model.predict(x=[x_test, x_test])\n", 618 | " pred = pred > 0.5\n", 619 | " pred = [int(p[0]) for p in pred]\n", 620 | " pred_avg.append(pred)\n", 621 | " print(\"Model:\", cv, \", Accuracy_score:\", accuracy_score(real, pred))\n", 622 | " del model\n", 623 | "\n", 624 | "pred = np.mean(pred_avg, axis=0)\n", 625 | "pred = pred > 0.5\n", 626 | "pred = [int(p) for p in pred]\n" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "## F1 Score: 90.59 " 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 17, 639 | "metadata": {}, 640 | "outputs": [ 641 | { 642 | "name": "stdout", 643 | "output_type": "stream", 644 | "text": [ 645 | "Confusion Matrix:\n", 646 | " [[11306 1194]\n", 647 | " [ 1161 11339]]\n", 648 | "f1_score: 0.9059241800822915 precision_score: 0.9047315088167238 recall_score: 0.90712 accuracy_score: 0.9058\n" 649 | ] 650 | } 651 | ], 652 | "source": [ 653 | "print(\"Confusion Matrix:\\n\", confusion_matrix(real, pred))\n", 654 | "print(\"f1_score:\",f1_score(real, pred), \"precision_score:\",precision_score(real, pred),\n", 655 | " \"recall_score:\",recall_score(real, pred), \"accuracy_score:\",accuracy_score(real, pred))" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [] 664 | } 665 | ], 666 | "metadata": { 667 | "kernelspec": { 668 | "display_name": "Python 3", 669 | "language": "python", 670 | "name": "python3" 671 | }, 672 | "language_info": { 673 | "codemirror_mode": { 674 | "name": "ipython", 675 | "version": 3 676 | }, 677 | "file_extension": ".py", 678 | "mimetype": "text/x-python", 679 | "name": "python", 680 | "nbconvert_exporter": "python", 681 | "pygments_lexer": "ipython3", 682 | "version": "3.6.6" 683 | } 684 | }, 685 | "nbformat": 4, 686 | "nbformat_minor": 2 687 | } 688 | --------------------------------------------------------------------------------