├── utmn ├── readme.txt ├── 3_feedforward_+_ensembling.ipynb ├── 2_LDA.ipynb └── 1_fine_tuning_and_getting_bert_embs.ipynb ├── IIITT ├── DESCRIPTION.md ├── run3.ipynb └── run2.ipynb ├── README.md ├── FideLIPI ├── description.md ├── topic_model_based_feature_creation.py ├── 5_ensemble.py ├── 3_sentence_level_model.py ├── 4_tf_idf_logistic_model.py ├── 1_roberta_on_whole_abstract.py └── 2_roberta_on_abstract_text_combined_with_lda.py └── parklize └── DESCRIPTION.md /utmn/readme.txt: -------------------------------------------------------------------------------- 1 | This folder contains the code by UTMN team. Our final solution achieved 93.82% of the weighted F1-score. 2 | 3 | More details: Glazkova A. Identifying Topics of Scientific Articles with BERT-based Approaches and Topic Modeling. -------------------------------------------------------------------------------- /IIITT/DESCRIPTION.md: -------------------------------------------------------------------------------- 1 | The submission file IIITT.zip has the systems as follows: 2 | 3 | - run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased) 4 | - run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT) 5 | - run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame 6 | - The saved Models and predicted probabilities are available in this link https://drive.google.com/drive/folders/1zY0dEyf49s0H00T7f_H575IKaJWHfWgO?usp=sharing 7 | - The other performed models are available on https://github.com/adeepH/SDPRA-2021-SharedTask 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SDPRA 2021 Shared Task 2 | Submitted systems of "The First Workshop & Shared Task on Scope Detection of the Peer Review Articles" 2021 shared task 3 | 4 | ## Submitted Teams 5 | * **UTMN** 6 | * **IIITT** 7 | * **FideLIPI** 8 | * **Parklize** 9 | 10 | 11 | ## Dataset Citation 12 | ``` 13 | Reddy, Saichethan; Saini, Naveen (2021), 14 | “SDPRA 2021 Shared Task Data”, Mendeley Data, V1, 15 | doi: 10.17632/njb74czv49.1 16 | ``` 17 | 18 | ## Overview Paper 19 | ``` 20 | Reddy, S., Saini., N. Overview and Insights 21 | from Scope Detection of the Peer Review Articles 22 | Shared Tasks 2021. In Proceedings of the The 23 | First Workshop & Shared Task on Scope Detection of 24 | the Peer Review Articles (SDPRA 2021) 25 | ``` 26 | 27 | -------------------------------------------------------------------------------- /FideLIPI/description.md: -------------------------------------------------------------------------------- 1 | # SDPRA-2021 Shared Task 2 | 3 | ### Submission by : FideLIPI 4 | ### Team Members : Ankush Chopra, Sohom Ghosh 5 | 6 | We've built an ensemble of 4 models. These models are: 7 | 8 | 1. A classification model using Roberta pretrained model, where we finetune the model while training for the task. 9 | 2. A classification model using Roberta pretrained model combined with features created using LDA. We also try to finetune the Roberta weights along with classification layer while training for the task. 10 | 3. A classification model using where we first break the abstract into sentences, and build the model using all the sentences of length more than 10 words. We perform sentence tokenization using Spacy. Every sentence is given the same label as it's abstract. While prediction, we take the label with highest combined output probability as the prediction. We've used simple transformer library to build this model. 11 | 4. A classification model built on tf-idf features. These features consist of uni, bi, tri and four grams. We built a logistic regression model using these features. 12 | 13 | Above 4 scores are combined to give the final prediction. Final prediction is made by popular vote, and ties are broken arbitrarily. 14 | -------------------------------------------------------------------------------- /parklize/DESCRIPTION.md: -------------------------------------------------------------------------------- 1 | This is the repository of the [shared task](https://sdpra-2021.github.io/website/) at PAKDD2021 on scholar text (abstract) classification for the solution from team **parklize**. 2 | 3 | 4 | 5 | There are two main ```.ipynb``` notebooks for the solution including: 6 | 7 | - ```pakdd2021_fasttext_entityembeddings.ipynb``` and 8 | - [Google Colab notebook](https://colab.research.google.com/drive/1x9MUQxXa2BnSVYjUMrgfy3oZa_p0YFXu?usp=sharing) 9 | 10 | 11 | 12 | # Details 13 | 14 | ```pakdd2021_fasttext_entityembeddings.ipynb``` does two things: 15 | 16 | - training a [fasttext](https://fasttext.cc/) classifier 17 | - get sentence embeddings with extracted entities using [TagMe](https://tagme.d4science.org/tagme/) and [wikipedia2vec](https://wikipedia2vec.github.io/wikipedia2vec/) 18 | 19 | 20 | 21 | Regarding training a fasttext classifier, there are several steps (cells): 22 | 23 | - read challenge data 24 | - split validation set further into *internal* validation & test sets 25 | - change data to fasttext format 26 | - train a fasttext classifier using [fasttext](https://fasttext.cc/) 27 | - predict on test set(s) 28 | 29 | 30 | 31 | Regarding getting sentence embeddings with extracted entities 32 | 33 | - extract Wikipedia entities/articles using TagMe 34 | - get abstract embeddings by aggregating entity embeddings for those entities mentioned in each abstract 35 | - the entities further filtered by applying k-means clustering (with two clusters) by choosing the large cluster with the assumption that the smaller one consists of noisy entities 36 | 37 | 38 | 39 | The Google Colab notebook does several things such as: 40 | 41 | - training Sentence-BERT classifiers (7) with [sentence transformers](https://www.sbert.net/), and testing with those classifiers 42 | - training a classifier with ```universal-sentence-encoder``` from [Tensorflow Hub](https://www.tensorflow.org/hub) for encoding abstract texts, and testing with this classifier 43 | - loading the fasttext classifier's prediction result -------------------------------------------------------------------------------- /FideLIPI/topic_model_based_feature_creation.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script performs features from abstract by running LDA on these. LDA gives the vectors that represent the abstracts. We use these features as input to one of the roberta model that we have built. Details of this model can be found in script 2. 3 | Author: Sohom Ghosh 4 | """ 5 | import re 6 | import os 7 | import pandas as pd 8 | import numpy as np 9 | import gensim 10 | from gensim import corpora 11 | from nltk.corpus import stopwords 12 | from nltk.stem.wordnet import WordNetLemmatizer 13 | import string 14 | 15 | 16 | PATH = "/data/disk3/pakdd/" 17 | 18 | # Reading data from train, test and validation files 19 | train = pd.read_excel(PATH + "train.xlsx", sheet_name="train", header=None) 20 | train.columns = ["text", "label"] 21 | validation = pd.read_excel( 22 | PATH + "validation.xlsx", sheet_name="validation", header=None 23 | ) 24 | validation.columns = ["text", "label"] 25 | test = pd.read_excel(PATH + "test.xlsx", sheet_name="test", header=None) 26 | test.columns = ["text"] 27 | 28 | 29 | # Topic Modeling / LDA feature extraction 30 | stop = set(stopwords.words("english")) 31 | exclude = set(string.punctuation) 32 | lemma = WordNetLemmatizer() 33 | 34 | 35 | def clean(doc): 36 | stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) 37 | punc_free = "".join(ch for ch in stop_free if ch not in exclude) 38 | normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) 39 | return normalized 40 | 41 | 42 | doc_clean_train = [clean(doc).split() for doc in list(train["text"])] 43 | doc_clean_validation = [clean(doc).split() for doc in list(validation["text"])] 44 | doc_clean_test = [clean(doc).split() for doc in list(test["text"])] 45 | 46 | dictionary = corpora.Dictionary(doc_clean_train) 47 | 48 | Lda = gensim.models.ldamodel.LdaModel 49 | 50 | 51 | def tm_lda_feature_extract(doc_clean, df): 52 | doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] 53 | lmodel = Lda(doc_term_matrix, num_topics=50, id2word=dictionary, passes=50) 54 | feature_matrix_lda = np.zeros(shape=(df.shape[0], 50)) # as number of topics is 50 55 | rw = 0 56 | for dd in doc_clean: 57 | bow_vector = dictionary.doc2bow(dd) 58 | lis = lmodel.get_document_topics( 59 | bow_vector, 60 | minimum_probability=None, 61 | minimum_phi_value=None, 62 | per_word_topics=False, 63 | ) 64 | for (a, b) in lis: 65 | feature_matrix_lda[rw, a] = b 66 | rw = rw + 1 67 | feature_lda_df = pd.DataFrame(feature_matrix_lda) 68 | return feature_lda_df 69 | 70 | 71 | feature_lda_df_train = tm_lda_feature_extract(doc_clean_train, train) 72 | feature_lda_df_validation = tm_lda_feature_extract(doc_clean_validation, validation) 73 | feature_lda_df_test = tm_lda_feature_extract(doc_clean_test, test) 74 | -------------------------------------------------------------------------------- /FideLIPI/5_ensemble.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script does an ensemble of the 4 child model by taking the popular vote from the child models. 3 | Ties are broken arbitrarily. 4 | 5 | Author : Ankush Chopra (ankush01729@gmail.com) 6 | """ 7 | import pandas as pd 8 | import numpy as np 9 | import re, sys, os 10 | 11 | import ast 12 | 13 | from collections import Counter 14 | from sklearn.metrics import f1_score 15 | 16 | 17 | def ensemble_model(lda_path, sentence_model_path, vanila_model_path, tf_idf_model_path): 18 | """ 19 | This function takes the 4 child model output file location as input and return the ensemble predictions as output. 20 | """ 21 | 22 | # reading the data prediction by 4 child models. 23 | lda = pd.read_csv(lda_path) 24 | sentence = pd.read_csv(sentence_model_path) 25 | vanila = pd.read_csv(vanila_model_path) 26 | with open(tf_idf_model_path) as f: 27 | g = f.readlines() 28 | tf_idf = pd.DataFrame(ast.literal_eval(g[0]), columns=["predicted_labels"]) 29 | 30 | # combining all 4 model predictions into one dataframe 31 | lda.reset_index(inplace=True) 32 | lda = pd.merge(lda, vanila, how="left", left_index=True, right_index=True) 33 | lda = lda[["index_x", "abs_text_x", "label_text_x", "pred", "pred_text"]] 34 | lda.columns = [ 35 | "ind", 36 | "abs_text", 37 | "label_text", 38 | "model_with_LDA_text", 39 | "whole_abs_model_text", 40 | ] 41 | ddf = pd.merge(lda, sentence, how="left", on="ind") 42 | ddf.columns = [ 43 | "ind", 44 | "abs_text", 45 | "label_text", 46 | "model_with_LDA_text", 47 | "whole_abs_model_text", 48 | "true_label", 49 | "sentence_model_text", 50 | "true_label_text", 51 | "pred_label_text", 52 | ] 53 | ddf = pd.concat([ddf, tf_idf], axis=1) 54 | 55 | # getting the final prediction by taking the max vote and breaking the ties arbitrarily 56 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6} 57 | ddf["predicted_labels"] = ddf.predicted_labels.map(lambda x: my_dict[x]) 58 | ddf["combined_prediction_4"] = ddf[ 59 | [ 60 | "whole_abs_model_text", 61 | "model_with_LDA_text", 62 | "sentence_model_text", 63 | "predicted_labels", 64 | ] 65 | ].values.tolist() 66 | ddf["selected_from_combined_prediction_4"] = ddf["combined_prediction_4"].apply( 67 | lambda x: Counter(x).most_common(1)[0][0] 68 | ) 69 | 70 | return ddf 71 | 72 | 73 | # f1 score calculation of ensemble model on training data. 74 | train_out = ensemble_model( 75 | "./LDA_and_transformer_on_whole_abstract_train_data.csv", 76 | "./sentence_model_sentence_above_len6_train_prediction_model_above_len_10.csv", 77 | "only_transformer_on_whole_abstract_train_data.csv", 78 | "logistic_regression_tfidf_v2_train_predictions.txt", 79 | ) 80 | train_f1 = f1_score( 81 | train_out.label_text, 82 | train_out.selected_from_combined_prediction_4, 83 | average="weighted", 84 | ) 85 | 86 | # f1 score calculation of ensemble model on validation data. 87 | val_out = ensemble_model( 88 | "./LDA_and_transformer_on_whole_abstract_val_data.csv", 89 | "./sentence_model_sentence_above_len6_val_prediction_model_above_len_10.csv", 90 | "only_transformer_on_whole_abstract_val_data.csv", 91 | "logistic_regression_tfidf_v2_val_predictions.txt", 92 | ) 93 | val_f1 = f1_score( 94 | val_out.label_text, val_out.selected_from_combined_prediction_4, average="weighted" 95 | ) 96 | -------------------------------------------------------------------------------- /FideLIPI/3_sentence_level_model.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script performs the model training on the abstract dataset using the pretrained robera model. We use simple transformer library to train the model. We first break the abstract into sentences and assign the same label to sentence as original abstract. We then, train a model on this sentence level data. Since smaller sentences may not have enough predictive power. We train 4 models by selecting sentence above certain word count to test this hypothesis. We find that, models trained on sentence length above 10 perform the best on the validation data. Putting a sentence length filter of 6 on validation data gives us the best validation performance. 3 | 4 | Author: Ankush Chopra (ankush01729@gmail.com) 5 | """ 6 | 7 | import os 8 | import re 9 | import torch 10 | import spacy 11 | import pandas as pd 12 | import numpy as np 13 | from operator import itemgetter 14 | from sklearn.metrics import f1_score, confusion_matrix 15 | from simpletransformers.classification import ClassificationModel, ClassificationArgs 16 | 17 | # setting up the right device type 18 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 19 | 20 | nlp = spacy.load("en_core_web_sm") 21 | 22 | 23 | def sentence_level_data_prep(df): 24 | """ 25 | This function splits the abstracts into sentences. It uses spacy for sentence tokenization. 26 | """ 27 | 28 | inds = [] 29 | sentences_extracted = [] 30 | for abstract, ind in zip(df["text"].values, df.index): 31 | for i in nlp(str(abstract).replace("\n", "")).sents: 32 | sentences_extracted.append(str(i)) 33 | inds.append(ind) 34 | sent_df = pd.DataFrame( 35 | {"ind": inds, "sentences_from_abstract": sentences_extracted} 36 | ) 37 | return sent_df 38 | 39 | 40 | df = pd.read_csv(r"./train.csv", header=None, names=["text", "labels"]) 41 | sentences_train = sentence_level_data_prep(df) 42 | df.reset_index(inplace=True) 43 | df.columns = ["ind", "text", "labels"] 44 | sentences_train.merge(original_train[["ind", "labels"]], on="ind", how="inner") 45 | 46 | sentences_train["sentence_length"] = sentences_train.sentences_from_abstract.map( 47 | lambda x: len(x.split()) 48 | ) 49 | sentences_train["label_text"] = pd.Categorical(sentences_train.labels) 50 | sentences_train["labels"] = sentences_train.label_text.cat.codes 51 | 52 | 53 | model_args = ClassificationArgs( 54 | num_train_epochs=10, 55 | sliding_window=True, 56 | fp16=False, 57 | use_early_stopping=True, 58 | reprocess_input_data=True, 59 | overwrite_output_dir=True, 60 | ) 61 | 62 | # Create a ClassificationModel 63 | model = ClassificationModel("roberta", "roberta-base", num_labels=7, args=model_args) 64 | 65 | # We train 4 models by selecting sentences above sent_len. We save these model for 10 epochs. At the end, we select best model from these 40 saved epoch models by selecting the one doing the best on the validation set. 66 | # 67 | for sent_len in [0, 6, 10, 15]: 68 | print(sent_len) 69 | sentences_train_filtred = sentences_train[ 70 | (sentences_train["sentence_length"] > sent_len) 71 | ] 72 | sentences_train_filtred.reset_index(inplace=True, drop=True) 73 | train = sentences_train_filtred[["sentences_from_abstract", "labels"]] 74 | 75 | # Optional model configuration 76 | output_dir = "./roberta_model_sentence_" + str(sent_len) 77 | best_model_dir = output_dir + "/best_model/" 78 | cache_dir = output_dir + "/cache/" 79 | print(output_dir) 80 | model_args = ClassificationArgs( 81 | cache_dir=cache_dir, 82 | output_dir=output_dir, 83 | best_model_dir=best_model_dir, 84 | num_train_epochs=10, 85 | sliding_window=True, 86 | fp16=False, 87 | use_early_stopping=True, 88 | reprocess_input_data=True, 89 | overwrite_output_dir=True, 90 | ) 91 | 92 | # Create a ClassificationModel 93 | model = ClassificationModel( 94 | "roberta", "roberta-base", num_labels=7, args=model_args 95 | ) 96 | # You can set class weights by using the optional weight argument 97 | # Train the model 98 | model.train_model(train) 99 | -------------------------------------------------------------------------------- /FideLIPI/4_tf_idf_logistic_model.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script performs the model training on the abstract dataset using the features created using the TF-IDF vectorizer. Model is trained using the logistic regression algorithm which utilizes the 22K features created using 1 to 4-gram token and their tf-idf vectorized values. 3 | Author: Sohom Ghosh 4 | """ 5 | 6 | import re 7 | import os 8 | import pandas as pd 9 | import numpy as np 10 | import string 11 | from sklearn.feature_extraction.text import TfidfVectorizer 12 | from sklearn.linear_model import LogisticRegression 13 | 14 | 15 | PATH = "/data/disk3/pakdd/" 16 | 17 | # reading the input data files 18 | train = pd.read_excel(PATH + "train.xlsx", sheet_name="train", header=None) 19 | train.columns = ["text", "label"] 20 | validation = pd.read_excel( 21 | PATH + "validation.xlsx", sheet_name="validation", header=None 22 | ) 23 | validation.columns = ["text", "label"] 24 | test = pd.read_excel(PATH + "test.xlsx", sheet_name="test", header=None) 25 | test.columns = ["text"] 26 | 27 | # TF-IDF feature creation 28 | tfidf_model_original_v2 = TfidfVectorizer( 29 | ngram_range=(1, 4), min_df=0.0005, stop_words="english" 30 | ) 31 | tfidf_model_original_v2.fit(train["text"]) 32 | 33 | # train 34 | tfidf_df_train_original_v2 = pd.DataFrame( 35 | tfidf_model_original_v2.transform(train["text"]).todense() 36 | ) 37 | tfidf_df_train_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_) 38 | 39 | # validation 40 | tfidf_df_valid_original_v2 = pd.DataFrame( 41 | tfidf_model_original_v2.transform(validation["text"]).todense() 42 | ) 43 | tfidf_df_valid_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_) 44 | 45 | # test 46 | tfidf_df_test_original_v2 = pd.DataFrame( 47 | tfidf_model_original_v2.transform(test["text"]).todense() 48 | ) 49 | tfidf_df_test_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_) 50 | 51 | 52 | # Logistic Regression on tfidf_v2 (22K features) 53 | def model(clf, train_X, train_y, valid_X, valid_y): 54 | clf.fit(train_X, train_y) 55 | pred_tr = clf.predict(train_X) 56 | pred_valid = clf.predict(valid_X) 57 | print("\nTraining F1:{}".format(f1_score(train_y, pred_tr, average="weighted"))) 58 | print("Training Confusion Matrix \n{}".format(confusion_matrix(train_y, pred_tr))) 59 | print("Classification Report: \n{}".format(classification_report(train_y, pred_tr))) 60 | print( 61 | "\nValidation F1:{}".format(f1_score(valid_y, pred_valid, average="weighted")) 62 | ) 63 | print( 64 | "Validation Confusion Matrix \n{}".format(confusion_matrix(valid_y, pred_valid)) 65 | ) 66 | print( 67 | "Classification Report: \n{}".format(classification_report(valid_y, pred_valid)) 68 | ) 69 | 70 | 71 | lr_cnt = 0 72 | train_X = tfidf_df_train_original_v2 73 | valid_X = tfidf_df_valid_original_v2 74 | test_X = tfidf_df_test_original_v2 75 | train_y = train["label"].replace( 76 | {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6} 77 | ) 78 | valid_y = validation["label"].replace( 79 | {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6} 80 | ) 81 | info = "tfidf_v2_only" 82 | 83 | 84 | print("\n ################# LR VERSION ################# " + str(lr_cnt) + "\n") 85 | 86 | # Initializing logistic regression and training the model 87 | lr_clf = LogisticRegression(solver="lbfgs", n_jobs=-1) 88 | model(lr_clf, train_X, train_y, valid_X, valid_y) 89 | params = lr_clf.get_params() 90 | pred_tr = lr_clf.predict(train_X) 91 | pred_valid = lr_clf.predict(valid_X) 92 | open("lr_report_v" + str(lr_cnt) + info + ".txt", "w").write( 93 | str(info) 94 | + "\n\n" 95 | + str(params) 96 | + "\n\n lr_v" 97 | + str(lr_cnt) 98 | + ".pickle.dat" 99 | + "\n\n Training Confusion Matrix \n{}".format(confusion_matrix(train_y, pred_tr)) 100 | + "\n\n Training Classification Report: \n{}".format( 101 | classification_report(train_y, pred_tr) 102 | ) 103 | + "\n\n Validation Confusion Matrix \n{}".format( 104 | confusion_matrix(valid_y, pred_valid) 105 | ) 106 | + "\n\n Validation Classification Report: \n{}".format( 107 | classification_report(valid_y, pred_valid) 108 | ) 109 | ) 110 | validation_predicted_lr_best = lr_clf.predict(valid_X) 111 | repl_di = {0: "CL", 1: "CR", 2: "DC", 3: "DS", 4: "LO", 5: "NI", 6: "SE"} 112 | open(PATH + "logistic_regression_tfidf_v2_validation_predictions.txt", "w").write( 113 | str([repl_di[i] for i in validation_predicted_lr_best]) 114 | ) 115 | 116 | test_predicted_lr_best = lr_clf.predict(test_X) 117 | pd.DataFrame({"predicted_labels": [repl_di[i] for i in test_predicted_lr_best]}).to_csv( 118 | PATH + "logistic_reggression_on_tfidf_v2_22K_features_predicted_on_test.csv", 119 | index=False, 120 | ) 121 | -------------------------------------------------------------------------------- /FideLIPI/1_roberta_on_whole_abstract.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script performs the model training on the abstract dataset using the pretrained robera model with a classifier head. It finetunes the roberta while training for the classification task. 3 | We let this run for 20 epochs and saved all the models. We selected the best models epoch when performance on the validation set stopped improving. 4 | 5 | Author: Ankush Chopra (ankush01729@gmail.com) 6 | """ 7 | import os 8 | import torch 9 | import pandas as pd 10 | from torch.utils.data import Dataset, DataLoader 11 | from transformers import RobertaModel, RobertaTokenizer 12 | 13 | # setting up the device type 14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 15 | 16 | 17 | # reading dataset 18 | df = pd.read_csv(r"./train.csv", header=None, names=["abs_text", "label_text"]) 19 | val_df = pd.read_csv(r"./validation.csv", header=None, names=["abs_text", "label_text"]) 20 | 21 | 22 | # # Converting the codes to appropriate categories using a dictionary 23 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6} 24 | 25 | 26 | def update_cat(x): 27 | return my_dict[x] 28 | 29 | 30 | df["label_text"] = df["label_text"].apply(lambda x: update_cat(x)) 31 | val_df["label_text"] = val_df["label_text"].apply(lambda x: update_cat(x)) 32 | 33 | # Defining some key variables that will be used later on in the training 34 | MAX_LEN = 512 35 | TRAIN_BATCH_SIZE = 32 36 | VALID_BATCH_SIZE = 64 37 | EPOCHS = 20 38 | LEARNING_RATE = 2e-05 39 | tokenizer = RobertaTokenizer.from_pretrained("roberta-base") 40 | 41 | 42 | class Triage(Dataset): 43 | """ 44 | This is a subclass of torch packages Dataset class. It processes input to create ids, masks and targets required for model training. 45 | """ 46 | 47 | def __init__(self, dataframe, tokenizer, max_len, text_col_name, category_col): 48 | self.len = len(dataframe) 49 | self.data = dataframe 50 | self.tokenizer = tokenizer 51 | self.max_len = max_len 52 | self.text_col_name = text_col_name 53 | self.category_col = category_col 54 | 55 | def __getitem__(self, index): 56 | title = str(self.data[self.text_col_name][index]) 57 | title = " ".join(title.split()) 58 | inputs = self.tokenizer.encode_plus( 59 | title, 60 | None, 61 | add_special_tokens=True, 62 | max_length=self.max_len, 63 | pad_to_max_length=True, 64 | return_token_type_ids=True, 65 | truncation=True, 66 | ) 67 | ids = inputs["input_ids"] 68 | mask = inputs["attention_mask"] 69 | 70 | return { 71 | "ids": torch.tensor(ids, dtype=torch.long), 72 | "mask": torch.tensor(mask, dtype=torch.long), 73 | "targets": torch.tensor( 74 | self.data[self.category_col][index], dtype=torch.long 75 | ), 76 | } 77 | 78 | def __len__(self): 79 | return self.len 80 | 81 | 82 | # dataset specifics 83 | text_col_name = "abs_text" 84 | category_col = "label_text" 85 | 86 | training_set = Triage(df, tokenizer, MAX_LEN, text_col_name, category_col) 87 | validation_set = Triage(val_df, tokenizer, MAX_LEN, text_col_name, category_col) 88 | 89 | 90 | # data loader parameters 91 | train_params = {"batch_size": TRAIN_BATCH_SIZE, "shuffle": True, "num_workers": 0} 92 | 93 | test_params = {"batch_size": VALID_BATCH_SIZE, "shuffle": False, "num_workers": 0} 94 | 95 | # creating dataloader for modelling 96 | training_loader = DataLoader(training_set, **train_params) 97 | val_loader = DataLoader(validation_set, **test_params) 98 | 99 | 100 | class BERTClass(torch.nn.Module): 101 | """ 102 | This is the modelling class which adds a classification layer on top of Roberta model. We finetune roberta while training for the label classification. 103 | """ 104 | 105 | def __init__(self, num_class): 106 | super(BERTClass, self).__init__() 107 | self.num_class = num_class 108 | self.l1 = RobertaModel.from_pretrained("roberta-base") 109 | self.pre_classifier = torch.nn.Linear(768, 768) 110 | self.dropout = torch.nn.Dropout(0.3) 111 | self.classifier = torch.nn.Linear(768, self.num_class) 112 | self.history = dict() 113 | 114 | def forward(self, input_ids, attention_mask): 115 | output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask) 116 | hidden_state = output_1[0] 117 | pooler = hidden_state[:, 0] 118 | pooler = self.pre_classifier(pooler) 119 | pooler = torch.nn.ReLU()(pooler) 120 | pooler = self.dropout(pooler) 121 | output = self.classifier(pooler) 122 | return output 123 | 124 | 125 | # initializing and moving the model to the appropriate device 126 | model = BERTClass(7) 127 | model.to(device) 128 | 129 | # Creating the loss function and optimizer 130 | loss_function = torch.nn.CrossEntropyLoss() 131 | optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE) 132 | 133 | 134 | def calcuate_accu(big_idx, targets): 135 | """ 136 | This function compares the predicted output with ground truth to give the count of the correct predictions. 137 | """ 138 | n_correct = (big_idx == targets).sum().item() 139 | return n_correct 140 | 141 | 142 | def train(epoch): 143 | """ 144 | Function to train the model. This function utilizes the model initialized using BERTClass. It trains the model and provides the accuracy on the training set. 145 | """ 146 | tr_loss = 0 147 | n_correct = 0 148 | nb_tr_steps = 0 149 | nb_tr_examples = 0 150 | model.train() 151 | for _, data in enumerate(training_loader, 0): 152 | ids = data["ids"].to(device, dtype=torch.long) 153 | mask = data["mask"].to(device, dtype=torch.long) 154 | targets = data["targets"].to(device, dtype=torch.long) 155 | outputs = model(ids, mask) 156 | loss = loss_function(outputs, targets) 157 | tr_loss += loss.item() 158 | big_val, big_idx = torch.max(outputs.data, dim=1) 159 | n_correct += calcuate_accu(big_idx, targets) 160 | 161 | nb_tr_steps += 1 162 | nb_tr_examples += targets.size(0) 163 | 164 | if _ % 250 == 0: 165 | loss_step = tr_loss / nb_tr_steps 166 | accu_step = (n_correct * 100) / nb_tr_examples 167 | print(f"Training Loss per 250 steps: {loss_step}") 168 | print(f"Training Accuracy per 250 steps: {accu_step}") 169 | 170 | optimizer.zero_grad() 171 | loss.backward() 172 | # # When using GPU 173 | optimizer.step() 174 | 175 | print(f"The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}") 176 | epoch_loss = tr_loss / nb_tr_steps 177 | epoch_accu = (n_correct * 100) / nb_tr_examples 178 | print(f"Training Loss Epoch: {epoch_loss}") 179 | print(f"Training Accuracy Epoch: {epoch_accu}") 180 | 181 | return epoch_loss, epoch_accu 182 | 183 | 184 | def valid(model, testing_loader): 185 | """ 186 | This function calculates the performance numbers on the validation set. 187 | """ 188 | model.eval() 189 | n_correct = 0 190 | n_wrong = 0 191 | total = 0 192 | tr_loss = 0 193 | nb_tr_steps = 0 194 | nb_tr_examples = 0 195 | with torch.no_grad(): 196 | for _, data in enumerate(testing_loader, 0): 197 | ids = data["ids"].to(device, dtype=torch.long) 198 | mask = data["mask"].to(device, dtype=torch.long) 199 | targets = data["targets"].to(device, dtype=torch.long) 200 | outputs = model(ids, mask).squeeze() 201 | loss = loss_function(outputs, targets) 202 | tr_loss += loss.item() 203 | big_val, big_idx = torch.max(outputs.data, dim=1) 204 | n_correct += calcuate_accu(big_idx, targets) 205 | 206 | nb_tr_steps += 1 207 | nb_tr_examples += targets.size(0) 208 | 209 | epoch_loss = tr_loss / nb_tr_steps 210 | epoch_accu = (n_correct * 100) / nb_tr_examples 211 | print(f"Validation Loss Epoch: {epoch_loss}") 212 | print(f"Validation Accuracy Epoch: {epoch_accu}") 213 | 214 | return epoch_loss, epoch_accu 215 | 216 | 217 | # path to save models at the end of the epochs 218 | PATH = "./transformer_model_roberta/" 219 | if not os.path.exists(PATH): 220 | os.makedirs(PATH) 221 | 222 | # variable to store the model performance at the epoch level 223 | model.history["train_acc"] = [] 224 | model.history["val_acc"] = [] 225 | model.history["train_loss"] = [] 226 | model.history["val_loss"] = [] 227 | 228 | # model training 229 | for epoch in range(EPOCHS): 230 | print("Epoch number : ", epoch) 231 | train_loss, train_accu = train(epoch) 232 | val_loss, val_accu = valid(model, val_loader) 233 | model.history["train_acc"].append(train_accu) 234 | model.history["train_loss"].append(train_loss) 235 | model.history["val_acc"].append(val_accu) 236 | model.history["val_loss"].append(val_loss) 237 | torch.save( 238 | { 239 | "epoch": epoch, 240 | "model_state_dict": model.state_dict(), 241 | "optimizer_state_dict": optimizer.state_dict(), 242 | }, 243 | PATH + "/epoch_" + str(epoch) + ".bin", 244 | ) 245 | 246 | -------------------------------------------------------------------------------- /FideLIPI/2_roberta_on_abstract_text_combined_with_lda.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script performs the model training on the abstract dataset using the pretrained robera model with a classifier head. It finetunes the roberta while training for the classification task. Along with the Roberta representation of the abstract, we also use LDA vectors to train the model. 3 | We let this run for 20 epochs and saved all the models. We selected the best models epoch when performance on the validation set stopped improving. 4 | 5 | Author: Ankush Chopra (ankush01729@gmail.com) 6 | """ 7 | import os 8 | import torch 9 | import pandas as pd 10 | from torch.utils.data import Dataset, DataLoader 11 | from transformers import RobertaModel, RobertaTokenizer 12 | 13 | # setting up the device type 14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 15 | 16 | # reading dataset 17 | df = pd.read_csv(r"./train.csv", header=None, names=["abs_text", "label_text"]) 18 | val_df = pd.read_csv(r"./validation.csv", header=None, names=["abs_text", "label_text"]) 19 | 20 | # reading additional features which are derived from topic models using LDA. 21 | lda_train = pd.read_csv("./feature_lda_df_train.csv") 22 | lda_valid = pd.read_csv("./feature_lda_df_validation.csv") 23 | 24 | # concatinating topic vectors from LDA to the abstract dataset 25 | df = pd.concat([df, lda_train], axis=1) 26 | val_df = pd.concat([val_df, lda_valid], axis=1) 27 | 28 | # # Converting the codes to appropriate categories using a dictionary 29 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6} 30 | 31 | 32 | def update_cat(x): 33 | """ 34 | Function to replace text labels with integer classes 35 | """ 36 | return my_dict[x] 37 | 38 | 39 | df["label_text"] = df["label_text"].apply(lambda x: update_cat(x)) 40 | val_df["label_text"] = val_df["label_text"].apply(lambda x: update_cat(x)) 41 | 42 | # Defining some key variables that will be used later on in the training 43 | MAX_LEN = 512 44 | TRAIN_BATCH_SIZE = 32 45 | VALID_BATCH_SIZE = 8 46 | EPOCHS = 1 47 | LEARNING_RATE = 2e-05 48 | tokenizer = RobertaTokenizer.from_pretrained("roberta-base") 49 | 50 | 51 | class Triage(Dataset): 52 | """ 53 | This is a subclass of torch packages Dataset class. It processes input to create ids, masks and targets required for model training. 54 | """ 55 | 56 | def __init__(self, dataframe, tokenizer, max_len, text_col_name, categoty_col): 57 | self.len = len(dataframe) 58 | self.data = dataframe 59 | self.tokenizer = tokenizer 60 | self.max_len = max_len 61 | self.text_col_name = text_col_name 62 | self.categoty_col = categoty_col 63 | self.col_names = list(dataframe) 64 | 65 | def __getitem__(self, index): 66 | title = str(self.data[self.text_col_name][index]) 67 | title = " ".join(title.split()) 68 | inputs = self.tokenizer.encode_plus( 69 | title, 70 | None, 71 | add_special_tokens=True, 72 | max_length=self.max_len, 73 | pad_to_max_length=True, 74 | return_token_type_ids=True, 75 | truncation=True, 76 | ) 77 | ids = inputs["input_ids"] 78 | mask = inputs["attention_mask"] 79 | 80 | return { 81 | "ids": torch.tensor(ids, dtype=torch.long), 82 | "mask": torch.tensor(mask, dtype=torch.long), 83 | "targets": torch.tensor( 84 | self.data[self.categoty_col][index], dtype=torch.long 85 | ), 86 | "tf_idf_feature": torch.tensor( 87 | self.data.loc[index, self.col_names[2:]], dtype=torch.float32 88 | ), 89 | } 90 | 91 | def __len__(self): 92 | return self.len 93 | 94 | 95 | # dataset specifics 96 | text_col_name = "abs_text" 97 | category_col = "label_text" 98 | 99 | training_set = Triage(df, tokenizer, MAX_LEN, text_col_name, category_col) 100 | validation_set = Triage(val_df, tokenizer, MAX_LEN, text_col_name, category_col) 101 | 102 | 103 | # data loader parameters 104 | train_params = {"batch_size": TRAIN_BATCH_SIZE, "shuffle": True, "num_workers": 0} 105 | 106 | test_params = {"batch_size": VALID_BATCH_SIZE, "shuffle": False, "num_workers": 0} 107 | 108 | # creating dataloader for modelling 109 | training_loader = DataLoader(training_set, **train_params) 110 | val_loader = DataLoader(validation_set, **test_params) 111 | 112 | 113 | class BERTClass(torch.nn.Module): 114 | """ 115 | This is the modelling class which adds a classification layer on top of Roberta model. We finetune roberta while training for the label classification. 116 | """ 117 | 118 | def __init__(self, num_class): 119 | super(BERTClass, self).__init__() 120 | self.num_class = num_class 121 | self.l1 = RobertaModel.from_pretrained("roberta-base") 122 | self.hc_features = torch.nn.Linear(50, 128) 123 | self.from_bert = torch.nn.Linear(768, 128) 124 | self.dropout = torch.nn.Dropout(0.3) 125 | self.pre_classifier = torch.nn.Linear(256, 128) 126 | self.classifier = torch.nn.Linear(128, self.num_class) 127 | self.history = dict() 128 | 129 | def forward(self, input_ids, attention_mask, other_features): 130 | output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask) 131 | hidden_state = output_1[0] 132 | pooler = hidden_state[:, 0] 133 | pooler = self.from_bert(pooler) 134 | other_feature_layer = self.hc_features(other_features) 135 | combined_features = torch.cat((pooler, other_feature_layer), dim=1) 136 | combined_features = torch.nn.ReLU()(combined_features) 137 | combined_features = self.dropout(combined_features) 138 | combined_features = self.pre_classifier(combined_features) 139 | output = self.classifier(combined_features) 140 | 141 | return output 142 | 143 | 144 | # initializing and moving the model to the appropriate device 145 | model = BERTClass(7) 146 | model.to(device) 147 | 148 | # Creating the loss function and optimizer 149 | loss_function = torch.nn.CrossEntropyLoss() 150 | optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE) 151 | 152 | 153 | def calcuate_accu(big_idx, targets): 154 | """ 155 | This function compares the predicted output with ground truth to give the count of the correct predictions. 156 | """ 157 | n_correct = (big_idx == targets).sum().item() 158 | return n_correct 159 | 160 | 161 | def train(epoch): 162 | """ 163 | Function to train the model. This function utilizes the model initialized using BERTClass. It trains the model and provides the accuracy on the training set. 164 | """ 165 | tr_loss = 0 166 | n_correct = 0 167 | nb_tr_steps = 0 168 | nb_tr_examples = 0 169 | model.train() 170 | for _, data in enumerate(training_loader, 0): 171 | ids = data["ids"].to(device, dtype=torch.long) 172 | mask = data["mask"].to(device, dtype=torch.long) 173 | targets = data["targets"].to(device, dtype=torch.long) 174 | tf_idf_feature = data["tf_idf_feature"].to(device, dtype=torch.float32) 175 | 176 | outputs = model(ids, mask, tf_idf_feature) 177 | loss = loss_function(outputs, targets) 178 | tr_loss += loss.item() 179 | big_val, big_idx = torch.max(outputs.data, dim=1) 180 | n_correct += calcuate_accu(big_idx, targets) 181 | 182 | nb_tr_steps += 1 183 | nb_tr_examples += targets.size(0) 184 | 185 | if _ % 250 == 0: 186 | loss_step = tr_loss / nb_tr_steps 187 | accu_step = (n_correct * 100) / nb_tr_examples 188 | print(f"Training Loss per 250 steps: {loss_step}") 189 | print(f"Training Accuracy per 250 steps: {accu_step}") 190 | 191 | optimizer.zero_grad() 192 | loss.backward() 193 | # # When using GPU 194 | optimizer.step() 195 | 196 | print(f"The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}") 197 | epoch_loss = tr_loss / nb_tr_steps 198 | epoch_accu = (n_correct * 100) / nb_tr_examples 199 | print(f"Training Loss Epoch: {epoch_loss}") 200 | print(f"Training Accuracy Epoch: {epoch_accu}") 201 | 202 | return epoch_loss, epoch_accu 203 | 204 | 205 | def valid(model, testing_loader): 206 | """ 207 | This function calculates the performance numbers on the validation set. 208 | """ 209 | model.eval() 210 | n_correct = 0 211 | n_wrong = 0 212 | total = 0 213 | tr_loss = 0 214 | nb_tr_steps = 0 215 | nb_tr_examples = 0 216 | with torch.no_grad(): 217 | for _, data in enumerate(testing_loader, 0): 218 | ids = data["ids"].to(device, dtype=torch.long) 219 | mask = data["mask"].to(device, dtype=torch.long) 220 | targets = data["targets"].to(device, dtype=torch.long) 221 | tf_idf_feature = data["tf_idf_feature"].to(device, dtype=torch.float32) 222 | outputs = model(ids, mask, tf_idf_feature).squeeze() 223 | loss = loss_function(outputs, targets) 224 | tr_loss += loss.item() 225 | big_val, big_idx = torch.max(outputs.data, dim=1) 226 | n_correct += calcuate_accu(big_idx, targets) 227 | 228 | nb_tr_steps += 1 229 | nb_tr_examples += targets.size(0) 230 | 231 | epoch_loss = tr_loss / nb_tr_steps 232 | epoch_accu = (n_correct * 100) / nb_tr_examples 233 | print(f"Validation Loss Epoch: {epoch_loss}") 234 | print(f"Validation Accuracy Epoch: {epoch_accu}") 235 | 236 | return epoch_loss, epoch_accu 237 | 238 | 239 | # path to save models at the end of the epochs 240 | PATH = "./transformer_model_roberta_with_lda/" 241 | if not os.path.exists(PATH): 242 | os.makedirs(PATH) 243 | 244 | # variable to store the model performance at the epoch level 245 | model.history["train_acc"] = [] 246 | model.history["val_acc"] = [] 247 | model.history["train_loss"] = [] 248 | model.history["val_loss"] = [] 249 | 250 | # model training 251 | for epoch in range(EPOCHS): 252 | print("Epoch number : ", epoch) 253 | train_loss, train_accu = train(epoch) 254 | val_loss, val_accu = valid(model, val_loader) 255 | model.history["train_acc"].append(train_accu) 256 | model.history["train_loss"].append(train_loss) 257 | model.history["val_acc"].append(val_accu) 258 | model.history["val_loss"].append(val_loss) 259 | torch.save( 260 | { 261 | "epoch": epoch, 262 | "model_state_dict": model.state_dict(), 263 | "optimizer_state_dict": optimizer.state_dict(), 264 | }, 265 | PATH + "/epoch_" + str(epoch) + ".bin", 266 | ) 267 | -------------------------------------------------------------------------------- /utmn/3_feedforward_+_ensembling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "3. feedforward + ensembling.ipynb", 7 | "provenance": [], 8 | "toc_visible": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "jxiXI1lAheOB" 20 | }, 21 | "source": [ 22 | "#load data and libraries" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "metadata": { 28 | "id": "TBUF6vAff3TA" 29 | }, 30 | "source": [ 31 | "!pip install pymorphy2\r\n", 32 | "import numpy as np\r\n", 33 | "import pandas as pd\r\n", 34 | "import math\r\n", 35 | "from sklearn.preprocessing import OneHotEncoder\r\n", 36 | "import re, os, pickle\r\n", 37 | "\r\n", 38 | "import keras\r\n", 39 | "from keras import Sequential\r\n", 40 | "from keras.preprocessing.text import Tokenizer\r\n", 41 | "from keras.preprocessing.sequence import pad_sequences\r\n", 42 | "from keras.utils import to_categorical\r\n", 43 | "\r\n", 44 | "from keras.layers import Input, Embedding, Activation, Flatten, Dense, concatenate\r\n", 45 | "from keras.layers import Conv1D, MaxPooling1D, Dropout, LSTM\r\n", 46 | "from keras.models import Model\r\n", 47 | "\r\n", 48 | "!pip install imblearn\r\n", 49 | "from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE" 50 | ], 51 | "execution_count": null, 52 | "outputs": [] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "metadata": { 57 | "id": "KZUX1JC_gEZY" 58 | }, 59 | "source": [ 60 | "from google.colab import drive\r\n", 61 | "drive.mount('/content/drive')" 62 | ], 63 | "execution_count": null, 64 | "outputs": [] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "metadata": { 69 | "id": "tkagLwcogd1G" 70 | }, 71 | "source": [ 72 | "with open('/content/drive/bert_embs_val.pickle', 'rb') as f:\r\n", 73 | " val_values = pickle.load(f)\r\n", 74 | "with open('/content/drive/bert_embs_train.pickle', 'rb') as f:\r\n", 75 | " train_values = pickle.load(f)" 76 | ], 77 | "execution_count": null, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "metadata": { 83 | "id": "Ob3UgGdtgoGG" 84 | }, 85 | "source": [ 86 | "df = pd.read_csv('/content/drive/train.csv', header=None, names = ['text','label'])\r\n", 87 | "df.head()" 88 | ], 89 | "execution_count": null, 90 | "outputs": [] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "id": "56EoPNvfhS0w" 96 | }, 97 | "source": [ 98 | "train_texts = df.text.values\r\n", 99 | "\r\n", 100 | "possible_labels = df.label.unique()\r\n", 101 | "label_dict = {}\r\n", 102 | "for index, possible_label in enumerate(possible_labels):\r\n", 103 | " label_dict[possible_label] = index\r\n", 104 | "\r\n", 105 | "df['label'] = df.label.replace(label_dict)\r\n", 106 | "train_labels = df.label.values" 107 | ], 108 | "execution_count": null, 109 | "outputs": [] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "metadata": { 114 | "id": "enN2ItahhWsT" 115 | }, 116 | "source": [ 117 | "df = pd.read_csv('/content/drive/validation.csv', header=None, names = ['text','label'])\r\n", 118 | "df.head()" 119 | ], 120 | "execution_count": null, 121 | "outputs": [] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "metadata": { 126 | "id": "Sl6CJ348hZ8e" 127 | }, 128 | "source": [ 129 | "val_texts = df.text.values\r\n", 130 | "df['label'] = df.label.replace(label_dict)\r\n", 131 | "train_labels = list(train_labels) + list(df.label.values)" 132 | ], 133 | "execution_count": null, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "metadata": { 139 | "id": "1F-pQuTIlOhm" 140 | }, 141 | "source": [ 142 | "with open('/content/drive/td_100_train.pickle', 'rb') as f:\r\n", 143 | " distributions_train = pickle.load(f)\r\n", 144 | "with open('/content/drive/td_100_val.pickle', 'rb') as f:\r\n", 145 | " distributions_val = pickle.load(f)" 146 | ], 147 | "execution_count": null, 148 | "outputs": [] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "metadata": { 153 | "id": "0C-k_44_lecY" 154 | }, 155 | "source": [ 156 | "train_data = np.hstack((np.array(train_values),np.array(distributions_train)))\r\n", 157 | "test_data = np.hstack((np.array(val_values),np.array(distributions_val)))\r\n", 158 | "\r\n", 159 | "train_data.shape" 160 | ], 161 | "execution_count": null, 162 | "outputs": [] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "metadata": { 167 | "id": "EDoqQ4FOZ5LN" 168 | }, 169 | "source": [ 170 | "len(train_labels)" 171 | ], 172 | "execution_count": null, 173 | "outputs": [] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "metadata": { 178 | "id": "U2VNSGOoIiyw" 179 | }, 180 | "source": [ 181 | "ros = RandomOverSampler(random_state=1)\r\n", 182 | "train_data_resampled, trai_labels_resampled = ros.fit_resample(train_data, train_labels)" 183 | ], 184 | "execution_count": null, 185 | "outputs": [] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "metadata": { 190 | "id": "Wx-BOylEIyGs" 191 | }, 192 | "source": [ 193 | "train_data = train_data_resampled\r\n", 194 | "train_labels = trai_labels_resampled\r\n", 195 | "\r\n", 196 | "train_data.shape" 197 | ], 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "metadata": { 204 | "id": "woNJvWeqcrei" 205 | }, 206 | "source": [ 207 | "df = pd.DataFrame(train_data)\r\n", 208 | "df['label'] = pd.Series(train_labels)\r\n", 209 | "\r\n", 210 | "df = df.sample(frac=1)\r\n", 211 | "\r\n", 212 | "train_labels = df.label.values\r\n", 213 | "df = df.drop(columns = 'label')\r\n", 214 | "train_data = df.values\r\n", 215 | "\r\n", 216 | "train_data.shape" 217 | ], 218 | "execution_count": null, 219 | "outputs": [] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": { 224 | "id": "iJ3Tn7DThkw2" 225 | }, 226 | "source": [ 227 | "#ffn" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "metadata": { 233 | "id": "wmNHLYY2aVGE" 234 | }, 235 | "source": [ 236 | "import math\r\n", 237 | "border = math.ceil(len(train_data) * 0.1)\r\n", 238 | "\r\n", 239 | "val_data, train_data = train_data[:border], train_data[border:]\r\n", 240 | "val_labels, train_labels = train_labels[:border], train_labels[border:]" 241 | ], 242 | "execution_count": null, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "id": "nUjbm_mMhj-J" 249 | }, 250 | "source": [ 251 | "train_labels = keras.utils.to_categorical(np.array(train_labels),len(label_dict))\r\n", 252 | "val_labels = keras.utils.to_categorical(np.array(val_labels),len(label_dict))" 253 | ], 254 | "execution_count": null, 255 | "outputs": [] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "metadata": { 260 | "id": "2mnE3FuwiDv4" 261 | }, 262 | "source": [ 263 | "inputs=Input(shape=(868,), name='input')\r\n", 264 | "x=Dense(2024, activation='tanh', name='fully_connected_2048_tanh')(inputs)\r\n", 265 | "x=Dense(1024, activation='tanh', name='fully_connected_1024_tanh')(x)\r\n", 266 | "predictions=Dense(len(label_dict), activation='softmax', name='output_softmax')(x)\r\n", 267 | "model=Model(inputs=inputs, outputs=predictions)\r\n", 268 | "model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\r\n", 269 | "model.summary()\r\n", 270 | "\r\n", 271 | "from keras.utils import plot_model\r\n", 272 | "plot_model(model, to_file='fnn.png')" 273 | ], 274 | "execution_count": null, 275 | "outputs": [] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "metadata": { 280 | "id": "exKm5feebKDj" 281 | }, 282 | "source": [ 283 | "from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score\r\n", 284 | "import pickle\r\n", 285 | "\r\n", 286 | "history = model.fit(train_data, train_labels, epochs=5, verbose=2, validation_data=(val_data, val_labels))\r\n", 287 | "\r\n", 288 | "predict = np.argmax(model.predict(val_data), axis=1)\r\n", 289 | "answer = np.argmax(val_labels, axis=1)\r\n", 290 | "\r\n", 291 | "f1=f1_score(predict, answer, average='macro')*100\r\n", 292 | "prec=precision_score(predict, answer, average='macro')*100\r\n", 293 | "recall=recall_score(predict, answer, average='macro')*100\r\n", 294 | "accuracy=accuracy_score(predict, answer)*100\r\n", 295 | "\r\n", 296 | "print(f1)" 297 | ], 298 | "execution_count": null, 299 | "outputs": [] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "metadata": { 304 | "id": "hlk1zYMLbqJi" 305 | }, 306 | "source": [ 307 | "prediction = model.predict(test_data)\r\n", 308 | "\r\n", 309 | "with open('/content/drive/pred_tm.pickle', 'wb') as f:\r\n", 310 | " pickle.dump(prediction, f)" 311 | ], 312 | "execution_count": null, 313 | "outputs": [] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": { 318 | "id": "QhfgnPxi2-xa" 319 | }, 320 | "source": [ 321 | "#Ensembling" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "metadata": { 327 | "id": "VonM6yFLZ4dM" 328 | }, 329 | "source": [ 330 | "labels = {'LO': 0, 'NI': 1, 'DS': 2, 'CL': 3, 'DC': 4, 'SE': 5, 'CR': 6}\r\n", 331 | "inv_labels = {v: k for k, v in labels.items()}\r\n", 332 | "inv_labels" 333 | ], 334 | "execution_count": null, 335 | "outputs": [] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "metadata": { 340 | "id": "a_sIwr7DZ6ie" 341 | }, 342 | "source": [ 343 | "flat_predictions = [inv_labels[f] for f in flat_predictions]\r\n", 344 | "flat_predictions[:10]" 345 | ], 346 | "execution_count": null, 347 | "outputs": [] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "metadata": { 352 | "id": "2s3veVMZdybQ" 353 | }, 354 | "source": [ 355 | "with open('/content/drive/predictions1.pickle', 'rb') as f:\r\n", 356 | " pred1 = pickle.load(f)\r\n", 357 | "\r\n", 358 | "with open('/content/drive/predictions2.pickle', 'rb') as f:\r\n", 359 | " pred2 = pickle.load(f)\r\n", 360 | "\r\n", 361 | "with open('/content/drive/predictions3.pickle', 'rb') as f:\r\n", 362 | " pred3 = pickle.load(f)" 363 | ], 364 | "execution_count": null, 365 | "outputs": [] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "metadata": { 370 | "id": "cB2FuAo3eBOU" 371 | }, 372 | "source": [ 373 | "final = []\r\n", 374 | "for i in range(len(pred1)):\r\n", 375 | " final.append(pred1[i]+pred2[1]+pred3[i])\r\n", 376 | "print(final[0].shape)\r\n", 377 | "final[0]" 378 | ], 379 | "execution_count": null, 380 | "outputs": [] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "metadata": { 385 | "id": "h7-IrL89eLpZ" 386 | }, 387 | "source": [ 388 | "flat_predictions = [item for sublist in final for item in sublist]\r\n", 389 | "flat_predictions[0]" 390 | ], 391 | "execution_count": null, 392 | "outputs": [] 393 | } 394 | ] 395 | } -------------------------------------------------------------------------------- /utmn/2_LDA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "2. LDA.ipynb", 7 | "provenance": [], 8 | "toc_visible": true 9 | }, 10 | "kernelspec": { 11 | "display_name": "Python 3", 12 | "name": "python3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "8JNQ1Bm6OycD" 20 | }, 21 | "source": [ 22 | "#LDA" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "metadata": { 28 | "id": "a1vTC32gOxYm" 29 | }, 30 | "source": [ 31 | "import nltk\n", 32 | "nltk.download('stopwords')" 33 | ], 34 | "execution_count": null, 35 | "outputs": [] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "metadata": { 40 | "id": "Hs5EQgjhPACg" 41 | }, 42 | "source": [ 43 | "import re\n", 44 | "import numpy as np\n", 45 | "import pandas as pd\n", 46 | "from pprint import pprint\n", 47 | "\n", 48 | "# Gensim\n", 49 | "import gensim\n", 50 | "import gensim.corpora as corpora\n", 51 | "from gensim.utils import simple_preprocess\n", 52 | "from gensim.models import CoherenceModel\n", 53 | "\n", 54 | "# spacy for lemmatization\n", 55 | "import spacy\n", 56 | "\n", 57 | "# Plotting tools\n", 58 | "!pip install pyLDAvis\n", 59 | "import pyLDAvis\n", 60 | "import pyLDAvis.gensim # don't skip this\n", 61 | "import matplotlib.pyplot as plt\n", 62 | "%matplotlib inline\n", 63 | "\n", 64 | "# Enable logging for gensim - optional\n", 65 | "import logging\n", 66 | "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)\n", 67 | "\n", 68 | "import warnings\n", 69 | "warnings.filterwarnings(\"ignore\",category=DeprecationWarning)" 70 | ], 71 | "execution_count": null, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "metadata": { 77 | "id": "n7cJ1_vXPNLW" 78 | }, 79 | "source": [ 80 | "# NLTK Stop words\n", 81 | "from nltk.corpus import stopwords\n", 82 | "stop_words = stopwords.words('english')\n", 83 | "stop_words.extend(['from', 'subject', 're', 'edu', 'use'])" 84 | ], 85 | "execution_count": null, 86 | "outputs": [] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "metadata": { 91 | "id": "daDVo9MZPUG7" 92 | }, 93 | "source": [ 94 | "df = pd.read_csv('/content/drive/train.csv', header=None, names = ['text','label'])\n", 95 | "train_texts = df.text.values.tolist()\n", 96 | "df = pd.read_csv('/content/drive/validation.csv', header=None, names = ['text','label'])\n", 97 | "val_texts = df.text.values.tolist()\n", 98 | "df = pd.read_csv('/content/drive/test.csv', header=None, names = ['text'])\n", 99 | "test_texts = df.text.values.tolist()\n", 100 | "data = train_texts + val_texts + test_texts\n", 101 | "\n", 102 | "data = [re.sub('\\S*@\\S*\\s?', '', sent) for sent in data]\n", 103 | "data = [re.sub('\\s+', ' ', sent) for sent in data]\n", 104 | "data = [re.sub(\"\\'\", \"\", sent) for sent in data]\n", 105 | "\n", 106 | "print(data[:1])" 107 | ], 108 | "execution_count": null, 109 | "outputs": [] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "metadata": { 114 | "id": "WeFkk5J7P4xy" 115 | }, 116 | "source": [ 117 | "def sent_to_words(sentences):\n", 118 | " for sentence in sentences:\n", 119 | " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations\n", 120 | "\n", 121 | "data_words = list(sent_to_words(data))\n", 122 | "\n", 123 | "print(data_words[:1])" 124 | ], 125 | "execution_count": null, 126 | "outputs": [] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "metadata": { 131 | "id": "iJnnzmeyQDco" 132 | }, 133 | "source": [ 134 | "# Build the bigram and trigram models\n", 135 | "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.\n", 136 | "trigram = gensim.models.Phrases(bigram[data_words], threshold=100) \n", 137 | "\n", 138 | "# Faster way to get a sentence clubbed as a trigram/bigram\n", 139 | "bigram_mod = gensim.models.phrases.Phraser(bigram)\n", 140 | "trigram_mod = gensim.models.phrases.Phraser(trigram)\n", 141 | "\n", 142 | "# See trigram example\n", 143 | "print(trigram_mod[bigram_mod[data_words[0]]])" 144 | ], 145 | "execution_count": null, 146 | "outputs": [] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "metadata": { 151 | "id": "Q_9W3b9cQLPP" 152 | }, 153 | "source": [ 154 | "#import spacy\n", 155 | "# Define functions for stopwords, bigrams, trigrams and lemmatization\n", 156 | "def remove_stopwords(texts):\n", 157 | " return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]\n", 158 | "\n", 159 | "def make_bigrams(texts):\n", 160 | " return [bigram_mod[doc] for doc in texts]\n", 161 | "\n", 162 | "def make_trigrams(texts):\n", 163 | " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", 164 | "\n", 165 | "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):\n", 166 | " \"\"\"https://spacy.io/api/annotation\"\"\"\n", 167 | " texts_out = []\n", 168 | " for sent in texts:\n", 169 | " doc = nlp(\" \".join(sent)) \n", 170 | " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])\n", 171 | " return texts_out" 172 | ], 173 | "execution_count": null, 174 | "outputs": [] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "metadata": { 179 | "id": "rmAnsscWQYsN" 180 | }, 181 | "source": [ 182 | "import spacy\n", 183 | "# Remove Stop Words\n", 184 | "data_words_nostops = remove_stopwords(data_words)\n", 185 | "\n", 186 | "# Form Bigrams\n", 187 | "data_words_bigrams = make_bigrams(data_words_nostops)\n", 188 | "\n", 189 | "# Initialize spacy 'en' model, keeping only tagger component (for efficiency)\n", 190 | "# python3 -m spacy download en\n", 191 | "nlp = spacy.load('en', disable=['parser', 'ner'])\n", 192 | "\n", 193 | "# Do lemmatization keeping only noun, adj, vb, adv\n", 194 | "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])\n", 195 | "\n", 196 | "print(data_lemmatized[:1])" 197 | ], 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "metadata": { 204 | "id": "GmHv4x0CQrtg" 205 | }, 206 | "source": [ 207 | "# Create Dictionary\n", 208 | "id2word = corpora.Dictionary(data_lemmatized)\n", 209 | "\n", 210 | "# Create Corpus\n", 211 | "texts = data_lemmatized\n", 212 | "\n", 213 | "# Term Document Frequency\n", 214 | "corpus = [id2word.doc2bow(text) for text in texts]\n", 215 | "\n", 216 | "# View\n", 217 | "print(corpus[:1])" 218 | ], 219 | "execution_count": null, 220 | "outputs": [] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "metadata": { 225 | "id": "UDi3SEr9Qw_C" 226 | }, 227 | "source": [ 228 | "id2word[0]" 229 | ], 230 | "execution_count": null, 231 | "outputs": [] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "metadata": { 236 | "id": "16s6HB7VQ0Lh" 237 | }, 238 | "source": [ 239 | "# Human readable format of corpus (term-frequency)\n", 240 | "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" 241 | ], 242 | "execution_count": null, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "id": "lVZAwsQhfNiz" 249 | }, 250 | "source": [ 251 | "import pickle\r\n", 252 | "\r\n", 253 | "with open('/content/drive/corpus.pickle', 'wb') as f:\r\n", 254 | " pickle.dump(corpus, f)\r\n", 255 | "with open('/content/drive/id2word.pickle', 'wb') as f:\r\n", 256 | " pickle.dump(id2word, f)" 257 | ], 258 | "execution_count": null, 259 | "outputs": [] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "metadata": { 264 | "id": "I2TMNp-9Q8nY" 265 | }, 266 | "source": [ 267 | "# Build LDA model\n", 268 | "lda_model = gensim.models.LdaMulticore(corpus=corpus,\n", 269 | " id2word=id2word,\n", 270 | " num_topics=100, \n", 271 | " random_state=100,\n", 272 | " chunksize=100,\n", 273 | " passes=10,\n", 274 | " workers=3,\n", 275 | " per_word_topics=True)" 276 | ], 277 | "execution_count": null, 278 | "outputs": [] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "metadata": { 283 | "id": "U69V41P1R1WQ" 284 | }, 285 | "source": [ 286 | "# Print the Keyword in the 10 topics\n", 287 | "print(lda_model.print_topics())\n", 288 | "doc_lda = lda_model[corpus]" 289 | ], 290 | "execution_count": null, 291 | "outputs": [] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "metadata": { 296 | "id": "mTg1rlk9R5aG" 297 | }, 298 | "source": [ 299 | "# Compute Perplexity\n", 300 | "print('\\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.\n", 301 | "\n", 302 | "# Compute Coherence Score\n", 303 | "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", 304 | "coherence_lda = coherence_model_lda.get_coherence()\n", 305 | "print('\\nCoherence Score: ', coherence_lda)" 306 | ], 307 | "execution_count": null, 308 | "outputs": [] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "metadata": { 313 | "id": "aQOX3MZFSBB1" 314 | }, 315 | "source": [ 316 | "# Visualize the topics\n", 317 | "pyLDAvis.enable_notebook()\n", 318 | "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)\n", 319 | "vis" 320 | ], 321 | "execution_count": null, 322 | "outputs": [] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "metadata": { 327 | "id": "Jg76SsEshJAB" 328 | }, 329 | "source": [ 330 | "with open('/content/drive/lda_model.pickle', 'wb') as f:\r\n", 331 | " pickle.dump(lda_model, f)" 332 | ], 333 | "execution_count": null, 334 | "outputs": [] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": { 339 | "id": "G25c2pseSMC3" 340 | }, 341 | "source": [ 342 | "#choose the best model" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "metadata": { 348 | "id": "f7V-IChJSE8z" 349 | }, 350 | "source": [ 351 | "def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):\n", 352 | " \"\"\"\n", 353 | " Compute c_v coherence for various number of topics\n", 354 | "\n", 355 | " Parameters:\n", 356 | " ----------\n", 357 | " dictionary : Gensim dictionary\n", 358 | " corpus : Gensim corpus\n", 359 | " texts : List of input texts\n", 360 | " limit : Max num of topics\n", 361 | "\n", 362 | " Returns:\n", 363 | " -------\n", 364 | " model_list : List of LDA topic models\n", 365 | " coherence_values : Coherence values corresponding to the LDA model with respective number of topics\n", 366 | " \"\"\"\n", 367 | " coherence_values = []\n", 368 | " model_list = []\n", 369 | " for num_topics in range(start, limit, step):\n", 370 | " print(num_topics)\n", 371 | " model = gensim.models.LdaMulticore(corpus=corpus,\n", 372 | " id2word=id2word,\n", 373 | " num_topics=num_topics, \n", 374 | " random_state=100,\n", 375 | " chunksize=100,\n", 376 | " passes=10,\n", 377 | " workers=None,\n", 378 | " per_word_topics=True)\n", 379 | " model_list.append(model)\n", 380 | " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n", 381 | " coherence_values.append(coherencemodel.get_coherence())\n", 382 | " print(coherence_values[len(coherence_values)-1])\n", 383 | "\n", 384 | " return model_list, coherence_values" 385 | ], 386 | "execution_count": null, 387 | "outputs": [] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "metadata": { 392 | "id": "v14G8uOVSfIm" 393 | }, 394 | "source": [ 395 | "# Can take a long time to run.\n", 396 | "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=10, limit=200, step=10)" 397 | ], 398 | "execution_count": null, 399 | "outputs": [] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "metadata": { 404 | "id": "x41IUc8YSqXm" 405 | }, 406 | "source": [ 407 | "# Show graph\n", 408 | "limit=205; start=10; step=10;\n", 409 | "x = range(start, limit, step)\n", 410 | "plt.plot(x, coherence_values)\n", 411 | "plt.xlabel(\"Number of topics\")\n", 412 | "plt.ylabel(\"Coherence value\")\n", 413 | "#plt.legend((\"coherence_values\"), loc='best')\n", 414 | "plt.show()" 415 | ], 416 | "execution_count": null, 417 | "outputs": [] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": { 422 | "id": "CnNNVgzPUdjC" 423 | }, 424 | "source": [ 425 | "#get topic distributions" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "metadata": { 431 | "id": "LdZFUJzVUgbx" 432 | }, 433 | "source": [ 434 | "def get_dist(dist):\n", 435 | " new_dist = []\n", 436 | " for d in dist:\n", 437 | " new_dist.append(d[1])\n", 438 | " return new_dist\n", 439 | "\n", 440 | "corpus_train, corpus_val = corpus[:16800],corpus[16800:]\n", 441 | "distributions_train = []\n", 442 | "for doc in corpus_train:\n", 443 | " distributions_train.append(get_dist(lda_model.get_document_topics(doc, minimum_probability=0.0)))\n", 444 | "\n", 445 | "with open('/content/drive/td_100_train.pickle', 'wb') as f:\n", 446 | " pickle.dump(distributions_train, f)" 447 | ], 448 | "execution_count": null, 449 | "outputs": [] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "metadata": { 454 | "id": "PwFbHp3Hk_qt" 455 | }, 456 | "source": [ 457 | "distributions_val = []\r\n", 458 | "for doc in corpus_val:\r\n", 459 | " distributions_val.append(get_dist(lda_model.get_document_topics(doc, minimum_probability=0.0)))\r\n", 460 | "\r\n", 461 | "with open('/content/drive/td_100_val.pickle', 'wb') as f:\r\n", 462 | " pickle.dump(distributions_val, f)" 463 | ], 464 | "execution_count": null, 465 | "outputs": [] 466 | } 467 | ] 468 | } -------------------------------------------------------------------------------- /IIITT/run3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Ensemble_classifier.ipynb", 7 | "provenance": [], 8 | "toc_visible": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "code", 18 | "metadata": { 19 | "id": "lWckrP6pStpg" 20 | }, 21 | "source": [ 22 | "import pandas as pd\r\n", 23 | "import numpy as np\r\n", 24 | "import matplotlib.pyplot as plt\r\n" 25 | ], 26 | "execution_count": 1, 27 | "outputs": [] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "metadata": { 32 | "id": "sLnqm1V6KnJB", 33 | "outputId": "11b139ae-7be7-40df-8bb7-ec947c095a6c", 34 | "colab": { 35 | "base_uri": "https://localhost:8080/" 36 | } 37 | }, 38 | "source": [ 39 | "from google.colab import drive\n", 40 | "drive.mount('/content/drive')" 41 | ], 42 | "execution_count": 3, 43 | "outputs": [ 44 | { 45 | "output_type": "stream", 46 | "text": [ 47 | "Mounted at /content/drive\n" 48 | ], 49 | "name": "stdout" 50 | } 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "metadata": { 56 | "colab": { 57 | "base_uri": "https://localhost:8080/" 58 | }, 59 | "id": "4FUUUT_FX1ze", 60 | "outputId": "580b07bd-70fc-4602-86e2-1ea22174a0f7" 61 | }, 62 | "source": [ 63 | "cd /content/drive/MyDrive/sdpra2021/pred_probs/" 64 | ], 65 | "execution_count": 6, 66 | "outputs": [ 67 | { 68 | "output_type": "stream", 69 | "text": [ 70 | "/content/drive/MyDrive/spdra2021/pred_probs\n" 71 | ], 72 | "name": "stdout" 73 | } 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "metadata": { 79 | "colab": { 80 | "base_uri": "https://localhost:8080/", 81 | "height": 204 82 | }, 83 | "id": "ounVF-yhYQsL", 84 | "outputId": "5d3367cd-4f56-4c2a-d3a7-713f5211cb52" 85 | }, 86 | "source": [ 87 | "bert = pd.read_csv('bert.csv')\r\n", 88 | "bert = bert.drop(columns='Unnamed: 0')\r\n", 89 | "bert.head() " 90 | ], 91 | "execution_count": 7, 92 | "outputs": [ 93 | { 94 | "output_type": "execute_result", 95 | "data": { 96 | "text/html": [ 97 | "
\n", 98 | "\n", 111 | "\n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
CLCRDCDSLONISEresultabstract
00.0001500.0016500.9860690.0035390.0026750.0029750.0029422This paper analyses the possibilities of per...
10.0008090.0114020.8954570.0100340.0040860.0732800.0049312A finite element method is presented to comp...
20.9981920.0001820.0000380.0000980.0003800.0001120.0009990This paper includes a reflection on the role...
30.0001240.0015550.0025900.0006850.0006270.9934620.0009585In this document, we describe the fractal st...
40.0001660.0006560.0017650.9956290.0008730.0004290.0004823We show how to test whether a graph with n v...
\n", 189 | "
" 190 | ], 191 | "text/plain": [ 192 | " CL ... abstract\n", 193 | "0 0.000150 ... This paper analyses the possibilities of per...\n", 194 | "1 0.000809 ... A finite element method is presented to comp...\n", 195 | "2 0.998192 ... This paper includes a reflection on the role...\n", 196 | "3 0.000124 ... In this document, we describe the fractal st...\n", 197 | "4 0.000166 ... We show how to test whether a graph with n v...\n", 198 | "\n", 199 | "[5 rows x 9 columns]" 200 | ] 201 | }, 202 | "metadata": { 203 | "tags": [] 204 | }, 205 | "execution_count": 7 206 | } 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "metadata": { 212 | "colab": { 213 | "base_uri": "https://localhost:8080/", 214 | "height": 204 215 | }, 216 | "id": "46d-Z4cYYc5r", 217 | "outputId": "77d2ba85-cba6-4851-c6a1-f841e4287cfb" 218 | }, 219 | "source": [ 220 | "roberta = pd.read_csv('roberta.csv')\r\n", 221 | "roberta = roberta.drop(columns=['Unnamed: 0'])\r\n", 222 | "roberta.head() " 223 | ], 224 | "execution_count": 8, 225 | "outputs": [ 226 | { 227 | "output_type": "execute_result", 228 | "data": { 229 | "text/html": [ 230 | "
\n", 231 | "\n", 244 | "\n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | "
CLCRDCDSLONISEresultabstract
00.0023130.0023870.9809380.0039600.0012520.0050760.0040742This paper analyses the possibilities of per...
10.0021940.0032060.9786230.0054030.0008720.0077450.0019572A finite element method is presented to comp...
20.9979380.0000900.0002470.0004610.0006200.0003090.0003350This paper includes a reflection on the role...
30.0062360.2982820.3845940.0640930.0152410.2030370.0285182In this document, we describe the fractal st...
40.0007520.0009670.0012570.9946320.0015380.0005340.0003213We show how to test whether a graph with n v...
\n", 322 | "
" 323 | ], 324 | "text/plain": [ 325 | " CL ... abstract\n", 326 | "0 0.002313 ... This paper analyses the possibilities of per...\n", 327 | "1 0.002194 ... A finite element method is presented to comp...\n", 328 | "2 0.997938 ... This paper includes a reflection on the role...\n", 329 | "3 0.006236 ... In this document, we describe the fractal st...\n", 330 | "4 0.000752 ... We show how to test whether a graph with n v...\n", 331 | "\n", 332 | "[5 rows x 9 columns]" 333 | ] 334 | }, 335 | "metadata": { 336 | "tags": [] 337 | }, 338 | "execution_count": 8 339 | } 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "metadata": { 345 | "colab": { 346 | "base_uri": "https://localhost:8080/", 347 | "height": 204 348 | }, 349 | "id": "7N1k70ZBZdKp", 350 | "outputId": "4df5ba44-1dac-4fe1-825f-7f3b518cc2ac" 351 | }, 352 | "source": [ 353 | "scibert = pd.read_csv('scibert.csv')\r\n", 354 | "scibert = scibert.drop(columns='Unnamed: 0')\r\n", 355 | "scibert.head() " 356 | ], 357 | "execution_count": 9, 358 | "outputs": [ 359 | { 360 | "output_type": "execute_result", 361 | "data": { 362 | "text/html": [ 363 | "
\n", 364 | "\n", 377 | "\n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | "
CLCRDCDSLONISEresultabstract
00.0001590.0007610.9947990.0007680.0002880.0018390.0013862This paper analyses the possibilities of per...
10.0002860.0015980.8480900.0025780.0007140.1441840.0025502A finite element method is presented to comp...
20.9991090.0002520.0001330.0001480.0001780.0000620.0001170This paper includes a reflection on the role...
30.0001460.0003130.0021940.0001690.0001530.9964660.0005595In this document, we describe the fractal st...
40.0002250.0002350.0004930.9983020.0004250.0001910.0001293We show how to test whether a graph with n v...
\n", 455 | "
" 456 | ], 457 | "text/plain": [ 458 | " CL ... abstract\n", 459 | "0 0.000159 ... This paper analyses the possibilities of per...\n", 460 | "1 0.000286 ... A finite element method is presented to comp...\n", 461 | "2 0.999109 ... This paper includes a reflection on the role...\n", 462 | "3 0.000146 ... In this document, we describe the fractal st...\n", 463 | "4 0.000225 ... We show how to test whether a graph with n v...\n", 464 | "\n", 465 | "[5 rows x 9 columns]" 466 | ] 467 | }, 468 | "metadata": { 469 | "tags": [] 470 | }, 471 | "execution_count": 9 472 | } 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "metadata": { 478 | "id": "ULhUk0xsh3_g" 479 | }, 480 | "source": [ 481 | "test = pd.read_csv('/content/drive/MyDrive/spdra2021/Datasets/test.csv',delimiter=',',\r\n", 482 | " header=None,names=['text'])\r\n", 483 | " " 484 | ], 485 | "execution_count": 10, 486 | "outputs": [] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "metadata": { 491 | "id": "3ZWYYi6hrC_-" 492 | }, 493 | "source": [ 494 | "labels = ['CL','CR','DC','DS','LO','NI','SE']" 495 | ], 496 | "execution_count": 11, 497 | "outputs": [] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "metadata": { 502 | "colab": { 503 | "base_uri": "https://localhost:8080/" 504 | }, 505 | "id": "Q1rhux11aG7f", 506 | "outputId": "0a471a06-5282-4170-e3c7-786f8b098ac5" 507 | }, 508 | "source": [ 509 | "for label in labels:\r\n", 510 | " print(label)\r\n", 511 | " print(np.corrcoef([bert[label].rank(pct=True), roberta[label].rank(pct=True), scibert[label].rank(pct=True)]))\r\n", 512 | "submission = pd.DataFrame()\r\n", 513 | "#submission['id'] = a['abstract']\r\n", 514 | "for label in labels:\r\n", 515 | " submission[label] = (bert[label].rank(pct=True) * 0.3 + roberta[label].rank(pct=True) * 0.3 + scibert[label].rank(pct=True)*0.4)\r\n", 516 | "submission['result'] = submission.idxmax(axis = 1) \r\n", 517 | "submission['result'] = submission['result'].apply({'CL':0,'CR':1,'DC':2,\r\n", 518 | "'DS':3,'LO':4, 'NI':5, 'SE':6}.get) \r\n", 519 | "submission['id'] = test['text']\r\n", 520 | "submission.to_csv('submission.csv', index=False)" 521 | ], 522 | "execution_count": 18, 523 | "outputs": [ 524 | { 525 | "output_type": "stream", 526 | "text": [ 527 | "CL\n", 528 | "[[1. 0.60876764 0.79489693]\n", 529 | " [0.60876764 1. 0.43417398]\n", 530 | " [0.79489693 0.43417398 1. ]]\n", 531 | "CR\n", 532 | "[[1. 0.81781081 0.77273806]\n", 533 | " [0.81781081 1. 0.69869303]\n", 534 | " [0.77273806 0.69869303 1. ]]\n", 535 | "DC\n", 536 | "[[1. 0.84889632 0.85096035]\n", 537 | " [0.84889632 1. 0.88747852]\n", 538 | " [0.85096035 0.88747852 1. ]]\n", 539 | "DS\n", 540 | "[[1. 0.92145531 0.84394307]\n", 541 | " [0.92145531 1. 0.82648213]\n", 542 | " [0.84394307 0.82648213 1. ]]\n", 543 | "LO\n", 544 | "[[1. 0.82319259 0.72438774]\n", 545 | " [0.82319259 1. 0.80665013]\n", 546 | " [0.72438774 0.80665013 1. ]]\n", 547 | "NI\n", 548 | "[[1. 0.92307865 0.91320051]\n", 549 | " [0.92307865 1. 0.90765773]\n", 550 | " [0.91320051 0.90765773 1. ]]\n", 551 | "SE\n", 552 | "[[1. 0.71567135 0.72420244]\n", 553 | " [0.71567135 1. 0.89973318]\n", 554 | " [0.72420244 0.89973318 1. ]]\n" 555 | ], 556 | "name": "stdout" 557 | } 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "metadata": { 563 | "colab": { 564 | "base_uri": "https://localhost:8080/", 565 | "height": 419 566 | }, 567 | "id": "EMqzUplRHjxa", 568 | "outputId": "7e83ec84-1508-4402-d0ac-1188bcf64a8d" 569 | }, 570 | "source": [ 571 | "submission" 572 | ], 573 | "execution_count": 19, 574 | "outputs": [ 575 | { 576 | "output_type": "execute_result", 577 | "data": { 578 | "text/html": [ 579 | "
\n", 580 | "\n", 593 | "\n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | "
CLCRDCDSLONISEresultid
00.2953860.6498000.9720140.6869570.5836860.6883860.7625862This paper analyses the possibilities of per...
10.5150570.7697570.9129290.7873140.6589290.7983140.7519002A finite element method is presented to comp...
20.9190860.1090290.0202290.0747430.2545570.0813290.2085290This paper includes a reflection on the role...
30.3108860.5975860.7353290.4819000.4966430.8686140.5347005In this document, we describe the fractal st...
40.2248290.2467860.2912430.9731570.5695000.2551570.0996143We show how to test whether a graph with n v...
..............................
69950.5712860.3981000.3971290.6380140.9604210.2563140.5070574It is common practice to compare the computa...
69960.6717860.3153710.2967430.5918290.9693360.2715430.5107714Defeasible reasoning is a simple but efficie...
69970.6098710.5570430.4101140.6892140.9290860.3519860.6411144The almost periodic functions form a natural...
69980.5715430.3090000.3208290.6146570.9837430.2261000.5450144A notion of alternating timed automata is pr...
69990.6417000.2917290.2683570.5964000.9701710.2665570.5416434We present a hierarchical framework for anal...
\n", 743 | "

7000 rows × 9 columns

\n", 744 | "
" 745 | ], 746 | "text/plain": [ 747 | " CL ... id\n", 748 | "0 0.295386 ... This paper analyses the possibilities of per...\n", 749 | "1 0.515057 ... A finite element method is presented to comp...\n", 750 | "2 0.919086 ... This paper includes a reflection on the role...\n", 751 | "3 0.310886 ... In this document, we describe the fractal st...\n", 752 | "4 0.224829 ... We show how to test whether a graph with n v...\n", 753 | "... ... ... ...\n", 754 | "6995 0.571286 ... It is common practice to compare the computa...\n", 755 | "6996 0.671786 ... Defeasible reasoning is a simple but efficie...\n", 756 | "6997 0.609871 ... The almost periodic functions form a natural...\n", 757 | "6998 0.571543 ... A notion of alternating timed automata is pr...\n", 758 | "6999 0.641700 ... We present a hierarchical framework for anal...\n", 759 | "\n", 760 | "[7000 rows x 9 columns]" 761 | ] 762 | }, 763 | "metadata": { 764 | "tags": [] 765 | }, 766 | "execution_count": 19 767 | } 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "metadata": { 773 | "id": "7Nl0V7Bm47yj" 774 | }, 775 | "source": [ 776 | "submission['result'] = submission['result'].apply({0:'CL', 1:'CR', 2:'DC',\r\n", 777 | "3:'DS', 4:'LO', 5:'NI', 6:'SE' }.get)\r\n" 778 | ], 779 | "execution_count": 20, 780 | "outputs": [] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "metadata": { 785 | "colab": { 786 | "base_uri": "https://localhost:8080/", 787 | "height": 34 788 | }, 789 | "id": "HlxxQwny5kfl", 790 | "outputId": "51674744-3f1b-4f70-f3ea-5aade7e3a646" 791 | }, 792 | "source": [ 793 | "result = submission['result'].to_numpy()\r\n", 794 | "print(len(result))\r\n", 795 | "np.savetxt(\"run3.txt\", result, fmt = \"%s\")\r\n", 796 | "from google.colab import files\r\n", 797 | "files.download('run3.txt')" 798 | ], 799 | "execution_count": 21, 800 | "outputs": [ 801 | { 802 | "output_type": "stream", 803 | "text": [ 804 | "7000\n" 805 | ], 806 | "name": "stdout" 807 | }, 808 | { 809 | "output_type": "display_data", 810 | "data": { 811 | "application/javascript": [ 812 | "\n", 813 | " async function download(id, filename, size) {\n", 814 | " if (!google.colab.kernel.accessAllowed) {\n", 815 | " return;\n", 816 | " }\n", 817 | " const div = document.createElement('div');\n", 818 | " const label = document.createElement('label');\n", 819 | " label.textContent = `Downloading \"${filename}\": `;\n", 820 | " div.appendChild(label);\n", 821 | " const progress = document.createElement('progress');\n", 822 | " progress.max = size;\n", 823 | " div.appendChild(progress);\n", 824 | " document.body.appendChild(div);\n", 825 | "\n", 826 | " const buffers = [];\n", 827 | " let downloaded = 0;\n", 828 | "\n", 829 | " const channel = await google.colab.kernel.comms.open(id);\n", 830 | " // Send a message to notify the kernel that we're ready.\n", 831 | " channel.send({})\n", 832 | "\n", 833 | " for await (const message of channel.messages) {\n", 834 | " // Send a message to notify the kernel that we're ready.\n", 835 | " channel.send({})\n", 836 | " if (message.buffers) {\n", 837 | " for (const buffer of message.buffers) {\n", 838 | " buffers.push(buffer);\n", 839 | " downloaded += buffer.byteLength;\n", 840 | " progress.value = downloaded;\n", 841 | " }\n", 842 | " }\n", 843 | " }\n", 844 | " const blob = new Blob(buffers, {type: 'application/binary'});\n", 845 | " const a = document.createElement('a');\n", 846 | " a.href = window.URL.createObjectURL(blob);\n", 847 | " a.download = filename;\n", 848 | " div.appendChild(a);\n", 849 | " a.click();\n", 850 | " div.remove();\n", 851 | " }\n", 852 | " " 853 | ], 854 | "text/plain": [ 855 | "" 856 | ] 857 | }, 858 | "metadata": { 859 | "tags": [] 860 | } 861 | }, 862 | { 863 | "output_type": "display_data", 864 | "data": { 865 | "application/javascript": [ 866 | "download(\"download_d23b42c0-641a-49da-8061-44c07f762bf1\", \"run3.txt\", 21000)" 867 | ], 868 | "text/plain": [ 869 | "" 870 | ] 871 | }, 872 | "metadata": { 873 | "tags": [] 874 | } 875 | } 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": { 881 | "id": "0pBDXhATIqGJ" 882 | }, 883 | "source": [ 884 | "#Predictions " 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "metadata": { 890 | "id": "0T6MeF7NkxY4" 891 | }, 892 | "source": [ 893 | "\r\n", 894 | "\r\n", 895 | "\"\"\"\r\n", 896 | "The submission file IIITT.zip has the systems as follows:\r\n", 897 | "\r\n", 898 | "run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased)\r\n", 899 | "run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT)\r\n", 900 | "run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame\r\n", 901 | "\"\"\"" 902 | ], 903 | "execution_count": null, 904 | "outputs": [] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "metadata": { 909 | "id": "bQH0FM_ZMBVP" 910 | }, 911 | "source": [ 912 | "" 913 | ], 914 | "execution_count": null, 915 | "outputs": [] 916 | } 917 | ] 918 | } -------------------------------------------------------------------------------- /utmn/1_fine_tuning_and_getting_bert_embs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "accelerator": "GPU", 6 | "colab": { 7 | "name": "1. fine-tuning and getting bert embs", 8 | "provenance": [], 9 | "collapsed_sections": [], 10 | "toc_visible": true 11 | }, 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "name": "python3" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "code", 20 | "metadata": { 21 | "id": "hTjQwN2Ebfd9" 22 | }, 23 | "source": [ 24 | "import pandas as pd\n", 25 | "\n", 26 | "#import google disk (for data loading)\n", 27 | "from google.colab import drive\n", 28 | "drive.mount('/content/drive')" 29 | ], 30 | "execution_count": null, 31 | "outputs": [] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "id": "RmP4Hy9UZZKA" 37 | }, 38 | "source": [ 39 | "#Import libraries" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "metadata": { 45 | "id": "DEfSbAA4QHas" 46 | }, 47 | "source": [ 48 | "import tensorflow as tf\n", 49 | "\n", 50 | "device_name = tf.test.gpu_device_name()\n", 51 | "if device_name != '/device:GPU:0':\n", 52 | " raise SystemError('GPU device not found')\n", 53 | "print('Found GPU at: {}'.format(device_name))" 54 | ], 55 | "execution_count": null, 56 | "outputs": [] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "metadata": { 61 | "id": "0NmMdkZO8R6q" 62 | }, 63 | "source": [ 64 | "import torch\n", 65 | "\n", 66 | "# If there's a GPU available...\n", 67 | "if torch.cuda.is_available(): \n", 68 | "\n", 69 | " # Tell PyTorch to use the GPU. \n", 70 | " device = torch.device(\"cuda\")\n", 71 | "\n", 72 | " print('There are %d GPU(s) available.' % torch.cuda.device_count())\n", 73 | "\n", 74 | " print('We will use the GPU:', torch.cuda.get_device_name(0))\n", 75 | "\n", 76 | "# If not...\n", 77 | "else:\n", 78 | " print('No GPU available, using the CPU instead.')\n", 79 | " device = torch.device(\"cpu\")" 80 | ], 81 | "execution_count": null, 82 | "outputs": [] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "metadata": { 87 | "id": "XfMNQrCXrhKP" 88 | }, 89 | "source": [ 90 | "!pip install transformers\n", 91 | "!pip install pytorch-pretrained-bert pytorch-nlp\n", 92 | "from transformers import BertModel#, RobertaModel\n", 93 | "import numpy as np\n", 94 | "import tensorflow as tf\n", 95 | "\n", 96 | "from transformers import *\n", 97 | "import pandas as pd\n", 98 | "#import torch\n", 99 | "from keras.preprocessing.sequence import pad_sequences\n", 100 | "from sklearn.model_selection import train_test_split\n", 101 | "import numpy as np\n", 102 | "import os\n", 103 | "import pickle" 104 | ], 105 | "execution_count": null, 106 | "outputs": [] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "id": "E2gF3nZWZwNy" 112 | }, 113 | "source": [ 114 | "#Import data" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "metadata": { 120 | "id": "699-L4GYHV2I" 121 | }, 122 | "source": [ 123 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/train.csv', header=None, names = ['text','label'])\n", 124 | "df.head()" 125 | ], 126 | "execution_count": null, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "metadata": { 132 | "id": "Q0IFhZ5IHjxg" 133 | }, 134 | "source": [ 135 | "train_texts = df.text.values\n", 136 | "\n", 137 | "possible_labels = df.label.unique()\n", 138 | "label_dict = {}\n", 139 | "for index, possible_label in enumerate(possible_labels):\n", 140 | " label_dict[possible_label] = index\n", 141 | "\n", 142 | "print(label_dict)\n", 143 | "\n", 144 | "df['label'] = df.label.replace(label_dict)\n", 145 | "train_labels = df.label.values" 146 | ], 147 | "execution_count": null, 148 | "outputs": [] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "metadata": { 153 | "id": "c8G1dCa8HniE" 154 | }, 155 | "source": [ 156 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/validation.csv', header=None, names = ['text','label'])\n", 157 | "df.head()" 158 | ], 159 | "execution_count": null, 160 | "outputs": [] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "metadata": { 165 | "id": "L5XsBI0gNWwD" 166 | }, 167 | "source": [ 168 | "val_texts = df.text.values\r\n", 169 | "df['label'] = df.label.replace(label_dict)\r\n", 170 | "val_labels = df.label.values" 171 | ], 172 | "execution_count": null, 173 | "outputs": [] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "metadata": { 178 | "id": "j3ha28ClHtMM" 179 | }, 180 | "source": [ 181 | "train_texts = list(train_texts) + list(df.text.values)\n", 182 | "df['label'] = df.label.replace(label_dict)\n", 183 | "train_labels = list(train_labels) + list(df.label.values)\n", 184 | "\n", 185 | "\n", 186 | "df = pd.DataFrame()\n", 187 | "df['text'] = pd.Series(train_texts)\n", 188 | "df['label'] = pd.Series(train_labels)\n", 189 | "df = df.sample(frac=1)\n", 190 | "df.head()" 191 | ], 192 | "execution_count": null, 193 | "outputs": [] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "metadata": { 198 | "id": "8SnmkiWZxqwu" 199 | }, 200 | "source": [ 201 | "train_labels = df.label.values\r\n", 202 | "train_texts = df.text.values\r\n", 203 | "\r\n", 204 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/test.csv', header=None, names = ['text'])\r\n", 205 | "val_texts = df.text.values" 206 | ], 207 | "execution_count": null, 208 | "outputs": [] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "metadata": { 213 | "id": "G-dWiGvSy3Ey" 214 | }, 215 | "source": [ 216 | "print(len(train_labels))\r\n", 217 | "print(len(train_texts))\r\n", 218 | "\r\n", 219 | "print(len(val_texts))" 220 | ], 221 | "execution_count": null, 222 | "outputs": [] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": { 227 | "id": "7WtgjJLo5NsX" 228 | }, 229 | "source": [ 230 | "#Preprocessing" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "metadata": { 236 | "id": "dLnCNHaEoGIR" 237 | }, 238 | "source": [ 239 | "train_texts = [text.lower().replace('\\r\\n',' ').replace('\\n',' ') for text in train_texts]\n", 240 | "val_texts = [text.lower().replace('\\r\\n',' ').replace('\\n',' ') for text in val_texts]" 241 | ], 242 | "execution_count": null, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "id": "9rHaME_JOIDD" 249 | }, 250 | "source": [ 251 | "train_texts[0]" 252 | ], 253 | "execution_count": null, 254 | "outputs": [] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": { 259 | "id": "1GjZ7Z6naPTY" 260 | }, 261 | "source": [ 262 | "#Tokenization" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "metadata": { 268 | "id": "SrA8iiXCrylF" 269 | }, 270 | "source": [ 271 | "model_name = 'allenai/scibert_scivocab_uncased'\n", 272 | "#model_name = 'bert-base-uncased'\n", 273 | "#model_name = \"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\"\n", 274 | "#model_name = 'roberta-large'\n", 275 | "tokenizer = BertTokenizer.from_pretrained(model_name)\n", 276 | "#tokenizer = RobertaTokenizer.from_pretrained(model_name) #for roberta large" 277 | ], 278 | "execution_count": null, 279 | "outputs": [] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "metadata": { 284 | "id": "MxB6Z3Bzr60m" 285 | }, 286 | "source": [ 287 | "# Tokenize all of the sentences and map the tokens to thier word IDs.\n", 288 | "input_ids = []\n", 289 | "attention_masks = []\n", 290 | "\n", 291 | "# For every sentence...\n", 292 | "for i, sent in enumerate(train_texts):\n", 293 | " # `encode_plus` will:\n", 294 | " # (1) Tokenize the sentence.\n", 295 | " # (2) Prepend the `[CLS]` token to the start.\n", 296 | " # (3) Append the `[SEP]` token to the end.\n", 297 | " # (4) Map tokens to their IDs.\n", 298 | " # (5) Pad or truncate the sentence to `max_length`\n", 299 | " # (6) Create attention masks for [PAD] tokens.\n", 300 | " encoded_dict = tokenizer.encode_plus(\n", 301 | " sent, # Sentence to encode.\n", 302 | " truncation = True,\n", 303 | " add_special_tokens = True, # Add '[CLS]' and '[SEP]'\n", 304 | " max_length = 256, # Pad & truncate all sentences.\n", 305 | " pad_to_max_length = True,\n", 306 | " return_attention_mask = True, # Construct attn. masks.\n", 307 | " return_tensors = 'pt', # Return pytorch tensors.\n", 308 | " )\n", 309 | " # Add the encoded sentence to the list. \n", 310 | " input_ids.append(encoded_dict['input_ids'])\n", 311 | " \n", 312 | " # And its attention mask (simply differentiates padding from non-padding).\n", 313 | " attention_masks.append(encoded_dict['attention_mask'])\n", 314 | "\n", 315 | "# Convert the lists into tensors.\n", 316 | "input_ids = torch.cat(input_ids, dim=0)\n", 317 | "attention_masks = torch.cat(attention_masks, dim=0)\n", 318 | "labels = torch.tensor(train_labels)\n", 319 | "\n", 320 | "# Print sentence 0, now as a list of IDs.\n", 321 | "print('Original: ', train_texts[0])\n", 322 | "print('Token IDs:', input_ids[0])" 323 | ], 324 | "execution_count": null, 325 | "outputs": [] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "id": "f9UK8OpDaUZo" 331 | }, 332 | "source": [ 333 | "#Split into train and validation samples" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "metadata": { 339 | "id": "NfCcyf_XsASo" 340 | }, 341 | "source": [ 342 | "from torch.utils.data import TensorDataset, random_split\n", 343 | "\n", 344 | "# Combine the training inputs into a TensorDataset.\n", 345 | "dataset = TensorDataset(input_ids, attention_masks, labels)\n", 346 | "\n", 347 | "# Create a 90-10 train-validation split.\n", 348 | "\n", 349 | "# Calculate the number of samples to include in each set.\n", 350 | "train_size = int(0.9 * len(dataset))\n", 351 | "val_size = len(dataset) - train_size\n", 352 | "\n", 353 | "# Divide the dataset by randomly selecting samples.\n", 354 | "train_dataset, val_dataset = random_split(dataset, [train_size, val_size])\n", 355 | "\n", 356 | "print('{:>5,} training samples'.format(train_size))\n", 357 | "print('{:>5,} validation samples'.format(val_size))" 358 | ], 359 | "execution_count": null, 360 | "outputs": [] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": { 365 | "id": "2nGKq9cnagyT" 366 | }, 367 | "source": [ 368 | "#Model" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "metadata": { 374 | "id": "kUji4egasCrM" 375 | }, 376 | "source": [ 377 | "from torch.utils.data import DataLoader, RandomSampler, SequentialSampler\n", 378 | "\n", 379 | "# The DataLoader needs to know our batch size for training, so we specify it \n", 380 | "# here. For fine-tuning BERT on a specific task, the authors recommend a batch \n", 381 | "# size of 16 or 32.\n", 382 | "batch_size = 8\n", 383 | "\n", 384 | "# Create the DataLoaders for our training and validation sets.\n", 385 | "# We'll take training samples in random order. \n", 386 | "train_dataloader = DataLoader(\n", 387 | " train_dataset, # The training samples.\n", 388 | " sampler = RandomSampler(train_dataset), # Select batches randomly\n", 389 | " batch_size = batch_size # Trains with this batch size.\n", 390 | " )\n", 391 | "\n", 392 | "# For validation the order doesn't matter, so we'll just read them sequentially.\n", 393 | "validation_dataloader = DataLoader(\n", 394 | " val_dataset, # The validation samples.\n", 395 | " sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.\n", 396 | " batch_size = batch_size # Evaluate with this batch size.\n", 397 | " )" 398 | ], 399 | "execution_count": null, 400 | "outputs": [] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "metadata": { 405 | "id": "gez19WipsGrk" 406 | }, 407 | "source": [ 408 | "from transformers import RobertaForSequenceClassification, AdamW, RobertaConfig\n", 409 | "\n", 410 | "# Load BertForSequenceClassification, the pretrained BERT model with a single \n", 411 | "# linear classification layer on top. \n", 412 | "#'''\n", 413 | "model = BertForSequenceClassification.from_pretrained(\n", 414 | " model_name, # Use the 12-layer BERT model, with an uncased vocab.\n", 415 | " num_labels = len(label_dict), # The number of output labels--2 for binary classification.\n", 416 | " # You can increase this for multi-class tasks. \n", 417 | " output_attentions = False, # Whether the model returns attentions weights.\n", 418 | " output_hidden_states = False, # Whether the model returns all hidden-states.\n", 419 | ")\n", 420 | "'''\n", 421 | "#for roberta large\n", 422 | "model = RobertaForSequenceClassification.from_pretrained(\n", 423 | " \"roberta-large\", # Use the 12-layer BERT model, with an uncased vocab.\n", 424 | " num_labels = len(label_dict), # The number of output labels--2 for binary classification.\n", 425 | " # You can increase this for multi-class tasks. \n", 426 | " output_attentions = False, # Whether the model returns attentions weights.\n", 427 | " output_hidden_states = False, # Whether the model returns all hidden-states.\n", 428 | ")\n", 429 | "'''\n", 430 | "model.cuda()" 431 | ], 432 | "execution_count": null, 433 | "outputs": [] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "metadata": { 438 | "id": "_5Bki-YbsKHz" 439 | }, 440 | "source": [ 441 | "# Get all of the model's parameters as a list of tuples.\n", 442 | "params = list(model.named_parameters())\n", 443 | "\n", 444 | "print('The BERT model has {:} different named parameters.\\n'.format(len(params)))\n", 445 | "\n", 446 | "print('==== Embedding Layer ====\\n')\n", 447 | "\n", 448 | "for p in params[0:5]:\n", 449 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))\n", 450 | "\n", 451 | "print('\\n==== First Transformer ====\\n')\n", 452 | "\n", 453 | "for p in params[5:21]:\n", 454 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))\n", 455 | "\n", 456 | "print('\\n==== Output Layer ====\\n')\n", 457 | "\n", 458 | "for p in params[-4:]:\n", 459 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))" 460 | ], 461 | "execution_count": null, 462 | "outputs": [] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "metadata": { 467 | "id": "qwTEGVO_sULP" 468 | }, 469 | "source": [ 470 | "# Note: AdamW is a class from the huggingface library (as opposed to pytorch) \n", 471 | "# I believe the 'W' stands for 'Weight Decay fix\"\n", 472 | "optimizer = AdamW(model.parameters(),\n", 473 | " lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5\n", 474 | " eps = 1e-8 # args.adam_epsilon - default is 1e-8.\n", 475 | " )" 476 | ], 477 | "execution_count": null, 478 | "outputs": [] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "metadata": { 483 | "id": "-FSrNbcasXpI" 484 | }, 485 | "source": [ 486 | "from transformers import get_linear_schedule_with_warmup\n", 487 | "\n", 488 | "# Number of training epochs. The BERT authors recommend between 2 and 4. \n", 489 | "# We chose to run for 4, but we'll see later that this may be over-fitting the\n", 490 | "# training data.\n", 491 | "epochs = 3\n", 492 | "\n", 493 | "# Total number of training steps is [number of batches] x [number of epochs]. \n", 494 | "# (Note that this is not the same as the number of training samples).\n", 495 | "total_steps = len(train_dataloader) * epochs\n", 496 | "\n", 497 | "# Create the learning rate scheduler.\n", 498 | "scheduler = get_linear_schedule_with_warmup(optimizer, \n", 499 | " num_warmup_steps = 0, # Default value in run_glue.py\n", 500 | " num_training_steps = total_steps)" 501 | ], 502 | "execution_count": null, 503 | "outputs": [] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "metadata": { 508 | "id": "Ew9TYlDGsa_R" 509 | }, 510 | "source": [ 511 | "import numpy as np\n", 512 | "\n", 513 | "# Function to calculate the accuracy of our predictions vs labels\n", 514 | "def flat_accuracy(preds, labels):\n", 515 | " pred_flat = np.argmax(preds, axis=1).flatten()\n", 516 | " labels_flat = labels.flatten()\n", 517 | " return np.sum(pred_flat == labels_flat) / len(labels_flat)" 518 | ], 519 | "execution_count": null, 520 | "outputs": [] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "metadata": { 525 | "id": "Li765m2DseoO" 526 | }, 527 | "source": [ 528 | "import time\n", 529 | "import datetime\n", 530 | "\n", 531 | "def format_time(elapsed):\n", 532 | " '''\n", 533 | " Takes a time in seconds and returns a string hh:mm:ss\n", 534 | " '''\n", 535 | " # Round to the nearest second.\n", 536 | " elapsed_rounded = int(round((elapsed)))\n", 537 | " \n", 538 | " # Format as hh:mm:ss\n", 539 | " return str(datetime.timedelta(seconds=elapsed_rounded))\n" 540 | ], 541 | "execution_count": null, 542 | "outputs": [] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "metadata": { 547 | "id": "SnPqEl8rsgL5" 548 | }, 549 | "source": [ 550 | "import random\n", 551 | "import numpy as np\n", 552 | "\n", 553 | "# This training code is based on the `run_glue.py` script here:\n", 554 | "# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128\n", 555 | "\n", 556 | "# Set the seed value all over the place to make this reproducible.\n", 557 | "seed_val = 1\n", 558 | "\n", 559 | "random.seed(seed_val)\n", 560 | "np.random.seed(seed_val)\n", 561 | "torch.manual_seed(seed_val)\n", 562 | "torch.cuda.manual_seed_all(seed_val)\n", 563 | "\n", 564 | "# We'll store a number of quantities such as training and validation loss, \n", 565 | "# validation accuracy, and timings.\n", 566 | "training_stats = []\n", 567 | "\n", 568 | "# Measure the total training time for the whole run.\n", 569 | "total_t0 = time.time()\n", 570 | "\n", 571 | "# For each epoch...\n", 572 | "for epoch_i in range(0, epochs):\n", 573 | " \n", 574 | " # ========================================\n", 575 | " # Training\n", 576 | " # ========================================\n", 577 | " \n", 578 | " # Perform one full pass over the training set.\n", 579 | "\n", 580 | " print(\"\")\n", 581 | " print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))\n", 582 | " print('Training...')\n", 583 | "\n", 584 | " # Measure how long the training epoch takes.\n", 585 | " t0 = time.time()\n", 586 | "\n", 587 | " # Reset the total loss for this epoch.\n", 588 | " total_train_loss = 0\n", 589 | "\n", 590 | " # Put the model into training mode. Don't be mislead--the call to \n", 591 | " # `train` just changes the *mode*, it doesn't *perform* the training.\n", 592 | " # `dropout` and `batchnorm` layers behave differently during training\n", 593 | " # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)\n", 594 | " model.train()\n", 595 | "\n", 596 | " # For each batch of training data...\n", 597 | " for step, batch in enumerate(train_dataloader):\n", 598 | "\n", 599 | " # Progress update every 40 batches.\n", 600 | " if step % 40 == 0 and not step == 0:\n", 601 | " # Calculate elapsed time in minutes.\n", 602 | " elapsed = format_time(time.time() - t0)\n", 603 | " \n", 604 | " # Report progress.\n", 605 | " print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))\n", 606 | "\n", 607 | " # Unpack this training batch from our dataloader. \n", 608 | " #\n", 609 | " # As we unpack the batch, we'll also copy each tensor to the GPU using the \n", 610 | " # `to` method.\n", 611 | " #\n", 612 | " # `batch` contains three pytorch tensors:\n", 613 | " # [0]: input ids \n", 614 | " # [1]: attention masks\n", 615 | " # [2]: labels \n", 616 | " b_input_ids = batch[0].to(device)\n", 617 | " b_input_mask = batch[1].to(device)\n", 618 | " b_labels = batch[2].to(device)\n", 619 | "\n", 620 | " # Always clear any previously calculated gradients before performing a\n", 621 | " # backward pass. PyTorch doesn't do this automatically because \n", 622 | " # accumulating the gradients is \"convenient while training RNNs\". \n", 623 | " # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)\n", 624 | " model.zero_grad() \n", 625 | "\n", 626 | " # Perform a forward pass (evaluate the model on this training batch).\n", 627 | " # The documentation for this `model` function is here: \n", 628 | " # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification\n", 629 | " # It returns different numbers of parameters depending on what arguments\n", 630 | " # arge given and what flags are set. For our useage here, it returns\n", 631 | " # the loss (because we provided labels) and the \"logits\"--the model\n", 632 | " # outputs prior to activation.\n", 633 | " loss = model(b_input_ids, \n", 634 | " token_type_ids=None, \n", 635 | " attention_mask=b_input_mask, \n", 636 | " labels=b_labels)[0]\n", 637 | " logits = model(b_input_ids, \n", 638 | " token_type_ids=None, \n", 639 | " attention_mask=b_input_mask, \n", 640 | " labels=b_labels)[1] \n", 641 | "\n", 642 | " # Accumulate the training loss over all of the batches so that we can\n", 643 | " # calculate the average loss at the end. `loss` is a Tensor containing a\n", 644 | " # single value; the `.item()` function just returns the Python value \n", 645 | " # from the tensor.\n", 646 | " total_train_loss += loss.item()\n", 647 | "\n", 648 | " # Perform a backward pass to calculate the gradients.\n", 649 | " loss.backward()\n", 650 | "\n", 651 | " # Clip the norm of the gradients to 1.0.\n", 652 | " # This is to help prevent the \"exploding gradients\" problem.\n", 653 | " torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n", 654 | "\n", 655 | " # Update parameters and take a step using the computed gradient.\n", 656 | " # The optimizer dictates the \"update rule\"--how the parameters are\n", 657 | " # modified based on their gradients, the learning rate, etc.\n", 658 | " optimizer.step()\n", 659 | "\n", 660 | " # Update the learning rate.\n", 661 | " scheduler.step()\n", 662 | "\n", 663 | " # Calculate the average loss over all of the batches.\n", 664 | " avg_train_loss = total_train_loss / len(train_dataloader) \n", 665 | " \n", 666 | " # Measure how long this epoch took.\n", 667 | " training_time = format_time(time.time() - t0)\n", 668 | "\n", 669 | " print(\"\")\n", 670 | " print(\" Average training loss: {0:.2f}\".format(avg_train_loss))\n", 671 | " print(\" Training epcoh took: {:}\".format(training_time))\n", 672 | " \n", 673 | " # ========================================\n", 674 | " # Validation\n", 675 | " # ========================================\n", 676 | " # After the completion of each training epoch, measure our performance on\n", 677 | " # our validation set.\n", 678 | "\n", 679 | " print(\"\")\n", 680 | " print(\"Running Validation...\")\n", 681 | "\n", 682 | " t0 = time.time()\n", 683 | "\n", 684 | " # Put the model in evaluation mode--the dropout layers behave differently\n", 685 | " # during evaluation.\n", 686 | " model.eval()\n", 687 | "\n", 688 | " # Tracking variables \n", 689 | " total_eval_accuracy = 0\n", 690 | " total_eval_loss = 0\n", 691 | " nb_eval_steps = 0\n", 692 | "\n", 693 | " # Evaluate data for one epoch\n", 694 | " for batch in validation_dataloader:\n", 695 | " \n", 696 | " # Unpack this training batch from our dataloader. \n", 697 | " #\n", 698 | " # As we unpack the batch, we'll also copy each tensor to the GPU using \n", 699 | " # the `to` method.\n", 700 | " #\n", 701 | " # `batch` contains three pytorch tensors:\n", 702 | " # [0]: input ids \n", 703 | " # [1]: attention masks\n", 704 | " # [2]: labels \n", 705 | " b_input_ids = batch[0].to(device)\n", 706 | " b_input_mask = batch[1].to(device)\n", 707 | " b_labels = batch[2].to(device)\n", 708 | " \n", 709 | " # Tell pytorch not to bother with constructing the compute graph during\n", 710 | " # the forward pass, since this is only needed for backprop (training).\n", 711 | " with torch.no_grad(): \n", 712 | "\n", 713 | " # Forward pass, calculate logit predictions.\n", 714 | " # token_type_ids is the same as the \"segment ids\", which \n", 715 | " # differentiates sentence 1 and 2 in 2-sentence tasks.\n", 716 | " # The documentation for this `model` function is here: \n", 717 | " # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification\n", 718 | " # Get the \"logits\" output by the model. The \"logits\" are the output\n", 719 | " # values prior to applying an activation function like the softmax.\n", 720 | " loss = model(b_input_ids, \n", 721 | " token_type_ids=None, \n", 722 | " attention_mask=b_input_mask,\n", 723 | " labels=b_labels)[0]\n", 724 | " logits = model(b_input_ids, \n", 725 | " token_type_ids=None, \n", 726 | " attention_mask=b_input_mask,\n", 727 | " labels=b_labels)[1]\n", 728 | " \n", 729 | " # Accumulate the validation loss.\n", 730 | " total_eval_loss += loss.item()\n", 731 | "\n", 732 | " # Move logits and labels to CPU\n", 733 | " logits = logits.detach().cpu().numpy()\n", 734 | " label_ids = b_labels.to('cpu').numpy()\n", 735 | "\n", 736 | " # Calculate the accuracy for this batch of test sentences, and\n", 737 | " # accumulate it over all batches.\n", 738 | " total_eval_accuracy += flat_accuracy(logits, label_ids)\n", 739 | " \n", 740 | "\n", 741 | " # Report the final accuracy for this validation run.\n", 742 | " avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)\n", 743 | " print(\" Accuracy: {0:.2f}\".format(avg_val_accuracy))\n", 744 | "\n", 745 | " # Calculate the average loss over all of the batches.\n", 746 | " avg_val_loss = total_eval_loss / len(validation_dataloader)\n", 747 | " \n", 748 | " # Measure how long the validation run took.\n", 749 | " validation_time = format_time(time.time() - t0)\n", 750 | " \n", 751 | " print(\" Validation Loss: {0:.2f}\".format(avg_val_loss))\n", 752 | " print(\" Validation took: {:}\".format(validation_time))\n", 753 | "\n", 754 | " # Record all statistics from this epoch.\n", 755 | " training_stats.append(\n", 756 | " {\n", 757 | " 'epoch': epoch_i + 1,\n", 758 | " 'Training Loss': avg_train_loss,\n", 759 | " 'Valid. Loss': avg_val_loss,\n", 760 | " 'Valid. Accur.': avg_val_accuracy,\n", 761 | " 'Training Time': training_time,\n", 762 | " 'Validation Time': validation_time\n", 763 | " }\n", 764 | " )\n", 765 | "\n", 766 | "print(\"\")\n", 767 | "print(\"Training complete!\")\n", 768 | "\n", 769 | "print(\"Total training took {:} (h:mm:ss)\".format(format_time(time.time()-total_t0)))" 770 | ], 771 | "execution_count": null, 772 | "outputs": [] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "metadata": { 777 | "id": "Z55Kdx7mi1xa" 778 | }, 779 | "source": [ 780 | "model.save_pretrained('/content/drive/bert')" 781 | ], 782 | "execution_count": null, 783 | "outputs": [] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": { 788 | "id": "imcRg9Z3bDlw" 789 | }, 790 | "source": [ 791 | "#Test" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "metadata": { 797 | "id": "IEAg_ois9225" 798 | }, 799 | "source": [ 800 | "import pandas as pd\n", 801 | "\n", 802 | "# Tokenize all of the sentences and map the tokens to thier word IDs.\n", 803 | "input_ids = []\n", 804 | "attention_masks = []\n", 805 | "\n", 806 | "# For every sentence...\n", 807 | "for sent in val_texts:\n", 808 | " # `encode_plus` will:\n", 809 | " # (1) Tokenize the sentence.\n", 810 | " # (2) Prepend the `[CLS]` token to the start.\n", 811 | " # (3) Append the `[SEP]` token to the end.\n", 812 | " # (4) Map tokens to their IDs.\n", 813 | " # (5) Pad or truncate the sentence to `max_length`\n", 814 | " # (6) Create attention masks for [PAD] tokens.\n", 815 | " encoded_dict = tokenizer.encode_plus(\n", 816 | " sent, # Sentence to encode.\n", 817 | " add_special_tokens = True, # Add '[CLS]' and '[SEP]'\n", 818 | " max_length = 256, # Pad & truncate all sentences.\n", 819 | " truncation = True,\n", 820 | " pad_to_max_length = True,\n", 821 | " return_attention_mask = True, # Construct attn. masks.\n", 822 | " return_tensors = 'pt', # Return pytorch tensors.\n", 823 | " )\n", 824 | " \n", 825 | " # Add the encoded sentence to the list. \n", 826 | " input_ids.append(encoded_dict['input_ids'])\n", 827 | " \n", 828 | " # And its attention mask (simply differentiates padding from non-padding).\n", 829 | " attention_masks.append(encoded_dict['attention_mask'])\n", 830 | "\n", 831 | "# Convert the lists into tensors.\n", 832 | "input_ids = torch.cat(input_ids, dim=0)\n", 833 | "attention_masks = torch.cat(attention_masks, dim=0)\n", 834 | "#labels = torch.tensor(val_labels)\n", 835 | "\n", 836 | "# Set the batch size. \n", 837 | "batch_size = 8\n", 838 | "\n", 839 | "# Create the DataLoader.\n", 840 | "prediction_data = TensorDataset(input_ids, attention_masks)\n", 841 | "prediction_sampler = SequentialSampler(prediction_data)\n", 842 | "prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)" 843 | ], 844 | "execution_count": null, 845 | "outputs": [] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "metadata": { 850 | "id": "lLlmfbbP-Lof" 851 | }, 852 | "source": [ 853 | "# Prediction on test set\n", 854 | "\n", 855 | "print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))\n", 856 | "\n", 857 | "\n", 858 | "# Put model in evaluation mode\n", 859 | "model.eval()\n", 860 | "\n", 861 | "# Tracking variables \n", 862 | "predictions, true_labels = [], []\n", 863 | "\n", 864 | "# Predict \n", 865 | "for batch in prediction_dataloader:\n", 866 | " # Add batch to GPU\n", 867 | " batch = tuple(t.to(device) for t in batch)\n", 868 | " \n", 869 | " # Unpack the inputs from our dataloader\n", 870 | " b_input_ids, b_input_mask = batch\n", 871 | " \n", 872 | " # Telling the model not to compute or store gradients, saving memory and \n", 873 | " # speeding up prediction\n", 874 | " with torch.no_grad():\n", 875 | " # Forward pass, calculate logit predictions\n", 876 | " outputs = model(b_input_ids, token_type_ids=None, \n", 877 | " attention_mask=b_input_mask)\n", 878 | "\n", 879 | " logits = outputs[0]\n", 880 | "\n", 881 | " # Move logits and labels to CPU\n", 882 | " logits = logits.detach().cpu().numpy()\n", 883 | " #label_ids = b_labels.to('cpu').numpy()\n", 884 | " \n", 885 | " # Store predictions and true labels\n", 886 | " predictions.append(logits)\n", 887 | " #true_labels.append(label_ids)\n", 888 | "\n", 889 | "print(' DONE.')" 890 | ], 891 | "execution_count": null, 892 | "outputs": [] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "metadata": { 897 | "id": "JQDwYWHGcAOx" 898 | }, 899 | "source": [ 900 | "import pickle\r\n", 901 | "\r\n", 902 | "with open('/content/drive/predictions.pickle', 'wb') as f:\r\n", 903 | " pickle.dump(predictions, f)" 904 | ], 905 | "execution_count": null, 906 | "outputs": [] 907 | }, 908 | { 909 | "cell_type": "code", 910 | "metadata": { 911 | "id": "5nJEm2o4izmB" 912 | }, 913 | "source": [ 914 | "model.save_pretrained('/content/drive/bert')" 915 | ], 916 | "execution_count": null, 917 | "outputs": [] 918 | }, 919 | { 920 | "cell_type": "markdown", 921 | "metadata": { 922 | "id": "Vo9NjfLCMgKl" 923 | }, 924 | "source": [ 925 | "# Getting bert embeddings" 926 | ] 927 | }, 928 | { 929 | "cell_type": "code", 930 | "metadata": { 931 | "colab": { 932 | "background_save": true 933 | }, 934 | "id": "myFoZyZTMfPs" 935 | }, 936 | "source": [ 937 | "model = BertModel.from_pretrained('/content/drive/bert')\n", 938 | "\n", 939 | "values = []\n", 940 | "\n", 941 | "for text in train_texts:\n", 942 | " input_ids = torch.tensor(tokenizer.encode(text, truncation = True, \\\n", 943 | " add_special_tokens = True, max_length = 256)).unsqueeze(0) # Batch size 1\n", 944 | " outputs = model(input_ids)\n", 945 | " last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple\n", 946 | " values.append(np.average(last_hidden_states[0].detach().cpu().numpy(), axis = 0))\n", 947 | "\n", 948 | "with open('/content/drive/bert_embs_train.pickle', 'wb') as f:\n", 949 | " pickle.dump(values, f)" 950 | ], 951 | "execution_count": null, 952 | "outputs": [] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "metadata": { 957 | "id": "zFhgjuphOaRY" 958 | }, 959 | "source": [ 960 | "values = []\n", 961 | "\n", 962 | "for text in val_texts:\n", 963 | " input_ids = torch.tensor(tokenizer.encode(text, truncation = True, \\\n", 964 | " add_special_tokens = True, max_length = 256)).unsqueeze(0) # Batch size 1\n", 965 | " outputs = model(input_ids)\n", 966 | " last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple\n", 967 | " values.append(np.average(last_hidden_states[0].detach().cpu().numpy(), axis = 0))\n", 968 | "\n", 969 | "with open('/content/drive/bert_embs_val.pickle', 'wb') as f:\n", 970 | " pickle.dump(values, f)" 971 | ], 972 | "execution_count": null, 973 | "outputs": [] 974 | } 975 | ] 976 | } -------------------------------------------------------------------------------- /IIITT/run2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Ensemble_classifier.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | } 13 | }, 14 | "cells": [ 15 | { 16 | "cell_type": "code", 17 | "metadata": { 18 | "id": "lWckrP6pStpg" 19 | }, 20 | "source": [ 21 | "import pandas as pd\r\n", 22 | "import numpy as np\r\n", 23 | "import matplotlib.pyplot as plt\r\n" 24 | ], 25 | "execution_count": null, 26 | "outputs": [] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "metadata": { 31 | "colab": { 32 | "base_uri": "https://localhost:8080/" 33 | }, 34 | "id": "4FUUUT_FX1ze", 35 | "outputId": "bba714b3-29dd-4e68-acbd-3624f1ceb354" 36 | }, 37 | "source": [ 38 | "cd /content/drive/MyDrive/sdpra2021/pred_probs/" 39 | ], 40 | "execution_count": null, 41 | "outputs": [ 42 | { 43 | "output_type": "stream", 44 | "text": [ 45 | "/content/drive/MyDrive/sdpra2021/pred_probs\n" 46 | ], 47 | "name": "stdout" 48 | } 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "metadata": { 54 | "colab": { 55 | "base_uri": "https://localhost:8080/", 56 | "height": 204 57 | }, 58 | "id": "ounVF-yhYQsL", 59 | "outputId": "8020bb22-24ad-499d-d87b-a12cac302747" 60 | }, 61 | "source": [ 62 | "bert = pd.read_csv('bert.csv')\r\n", 63 | "bert = bert.drop(columns='Unnamed: 0')\r\n", 64 | "bert.head() " 65 | ], 66 | "execution_count": null, 67 | "outputs": [ 68 | { 69 | "output_type": "execute_result", 70 | "data": { 71 | "text/html": [ 72 | "
\n", 73 | "\n", 86 | "\n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | "
CLCRDCDSLONISEresultabstract
00.0001500.0016500.9860690.0035390.0026750.0029750.0029422This paper analyses the possibilities of per...
10.0008090.0114020.8954570.0100340.0040860.0732800.0049312A finite element method is presented to comp...
20.9981920.0001820.0000380.0000980.0003800.0001120.0009990This paper includes a reflection on the role...
30.0001240.0015550.0025900.0006850.0006270.9934620.0009585In this document, we describe the fractal st...
40.0001660.0006560.0017650.9956290.0008730.0004290.0004823We show how to test whether a graph with n v...
\n", 164 | "
" 165 | ], 166 | "text/plain": [ 167 | " CL ... abstract\n", 168 | "0 0.000150 ... This paper analyses the possibilities of per...\n", 169 | "1 0.000809 ... A finite element method is presented to comp...\n", 170 | "2 0.998192 ... This paper includes a reflection on the role...\n", 171 | "3 0.000124 ... In this document, we describe the fractal st...\n", 172 | "4 0.000166 ... We show how to test whether a graph with n v...\n", 173 | "\n", 174 | "[5 rows x 9 columns]" 175 | ] 176 | }, 177 | "metadata": { 178 | "tags": [] 179 | }, 180 | "execution_count": 5 181 | } 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "metadata": { 187 | "colab": { 188 | "base_uri": "https://localhost:8080/", 189 | "height": 204 190 | }, 191 | "id": "46d-Z4cYYc5r", 192 | "outputId": "97fcfe13-9e50-44ed-bf96-282ada494420" 193 | }, 194 | "source": [ 195 | "roberta = pd.read_csv('roberta.csv')\r\n", 196 | "roberta = roberta.drop(columns=['Unnamed: 0'])\r\n", 197 | "roberta.head() " 198 | ], 199 | "execution_count": null, 200 | "outputs": [ 201 | { 202 | "output_type": "execute_result", 203 | "data": { 204 | "text/html": [ 205 | "
\n", 206 | "\n", 219 | "\n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | "
CLCRDCDSLONISEresultabstract
00.0023130.0023870.9809380.0039600.0012520.0050760.0040742This paper analyses the possibilities of per...
10.0021940.0032060.9786230.0054030.0008720.0077450.0019572A finite element method is presented to comp...
20.9979380.0000900.0002470.0004610.0006200.0003090.0003350This paper includes a reflection on the role...
30.0062360.2982820.3845940.0640930.0152410.2030370.0285182In this document, we describe the fractal st...
40.0007520.0009670.0012570.9946320.0015380.0005340.0003213We show how to test whether a graph with n v...
\n", 297 | "
" 298 | ], 299 | "text/plain": [ 300 | " CL ... abstract\n", 301 | "0 0.002313 ... This paper analyses the possibilities of per...\n", 302 | "1 0.002194 ... A finite element method is presented to comp...\n", 303 | "2 0.997938 ... This paper includes a reflection on the role...\n", 304 | "3 0.006236 ... In this document, we describe the fractal st...\n", 305 | "4 0.000752 ... We show how to test whether a graph with n v...\n", 306 | "\n", 307 | "[5 rows x 9 columns]" 308 | ] 309 | }, 310 | "metadata": { 311 | "tags": [] 312 | }, 313 | "execution_count": 7 314 | } 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "metadata": { 320 | "colab": { 321 | "base_uri": "https://localhost:8080/", 322 | "height": 204 323 | }, 324 | "id": "7N1k70ZBZdKp", 325 | "outputId": "c0facf87-f366-43a4-aa0c-2be4a66110c2" 326 | }, 327 | "source": [ 328 | "scibert = pd.read_csv('scibert.csv')\r\n", 329 | "scibert = scibert.drop(columns='Unnamed: 0')\r\n", 330 | "scibert.head() " 331 | ], 332 | "execution_count": null, 333 | "outputs": [ 334 | { 335 | "output_type": "execute_result", 336 | "data": { 337 | "text/html": [ 338 | "
\n", 339 | "\n", 352 | "\n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | "
CLCRDCDSLONISEresultabstract
00.0001590.0007610.9947990.0007680.0002880.0018390.0013862This paper analyses the possibilities of per...
10.0002860.0015980.8480900.0025780.0007140.1441840.0025502A finite element method is presented to comp...
20.9991090.0002520.0001330.0001480.0001780.0000620.0001170This paper includes a reflection on the role...
30.0001460.0003130.0021940.0001690.0001530.9964660.0005595In this document, we describe the fractal st...
40.0002250.0002350.0004930.9983020.0004250.0001910.0001293We show how to test whether a graph with n v...
\n", 430 | "
" 431 | ], 432 | "text/plain": [ 433 | " CL ... abstract\n", 434 | "0 0.000159 ... This paper analyses the possibilities of per...\n", 435 | "1 0.000286 ... A finite element method is presented to comp...\n", 436 | "2 0.999109 ... This paper includes a reflection on the role...\n", 437 | "3 0.000146 ... In this document, we describe the fractal st...\n", 438 | "4 0.000225 ... We show how to test whether a graph with n v...\n", 439 | "\n", 440 | "[5 rows x 9 columns]" 441 | ] 442 | }, 443 | "metadata": { 444 | "tags": [] 445 | }, 446 | "execution_count": 19 447 | } 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "metadata": { 453 | "id": "ULhUk0xsh3_g" 454 | }, 455 | "source": [ 456 | "test = pd.read_csv('/content/drive/MyDrive/spdra2021/Datasets/test.csv',delimiter=',',\r\n", 457 | " header=None,names=['text'])\r\n", 458 | " " 459 | ], 460 | "execution_count": null, 461 | "outputs": [] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "metadata": { 466 | "id": "3ZWYYi6hrC_-" 467 | }, 468 | "source": [ 469 | "labels = ['CL','CR','DC','DS','LO','NI','SE']" 470 | ], 471 | "execution_count": null, 472 | "outputs": [] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "metadata": { 477 | "colab": { 478 | "base_uri": "https://localhost:8080/" 479 | }, 480 | "id": "_Zj5_hh64Wk4", 481 | "outputId": "3ec48089-225f-44c3-e524-099d8a43ce86" 482 | }, 483 | "source": [ 484 | "test['text']" 485 | ], 486 | "execution_count": null, 487 | "outputs": [ 488 | { 489 | "output_type": "execute_result", 490 | "data": { 491 | "text/plain": [ 492 | "0 This paper analyses the possibilities of per...\n", 493 | "1 A finite element method is presented to comp...\n", 494 | "2 This paper includes a reflection on the role...\n", 495 | "3 In this document, we describe the fractal st...\n", 496 | "4 We show how to test whether a graph with n v...\n", 497 | " ... \n", 498 | "6995 It is common practice to compare the computa...\n", 499 | "6996 Defeasible reasoning is a simple but efficie...\n", 500 | "6997 The almost periodic functions form a natural...\n", 501 | "6998 A notion of alternating timed automata is pr...\n", 502 | "6999 We present a hierarchical framework for anal...\n", 503 | "Name: text, Length: 7000, dtype: object" 504 | ] 505 | }, 506 | "metadata": { 507 | "tags": [] 508 | }, 509 | "execution_count": 42 510 | } 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "metadata": { 516 | "colab": { 517 | "base_uri": "https://localhost:8080/" 518 | }, 519 | "id": "Q1rhux11aG7f", 520 | "outputId": "8a7ef2ef-5f1b-4b7a-a0f5-3b6c4099bf86" 521 | }, 522 | "source": [ 523 | "for label in labels:\r\n", 524 | " print(label)\r\n", 525 | " print(np.corrcoef([bert[label].rank(pct=True), roberta[label].rank(pct=True), scibert[label].rank(pct=True)]))\r\n", 526 | "submission = pd.DataFrame()\r\n", 527 | "#submission['id'] = a['abstract']\r\n", 528 | "for label in labels:\r\n", 529 | " submission[label] = round((bert[label] + roberta[label] + scibert[label])/3,6)\r\n", 530 | "submission['result'] = submission.idxmax(axis = 1) \r\n", 531 | "submission['result'] = submission['result'].apply({'CL':0,'CR':1,'DC':2,\r\n", 532 | "'DS':3,'LO':4, 'NI':5, 'SE':6}.get) \r\n", 533 | "submission['id'] = test['text']\r\n", 534 | "submission.to_csv('submission.csv', index=False)" 535 | ], 536 | "execution_count": null, 537 | "outputs": [ 538 | { 539 | "output_type": "stream", 540 | "text": [ 541 | "CL\n", 542 | "[[1. 0.60876764 0.79489693]\n", 543 | " [0.60876764 1. 0.43417398]\n", 544 | " [0.79489693 0.43417398 1. ]]\n", 545 | "CR\n", 546 | "[[1. 0.81781081 0.77273806]\n", 547 | " [0.81781081 1. 0.69869303]\n", 548 | " [0.77273806 0.69869303 1. ]]\n", 549 | "DC\n", 550 | "[[1. 0.84889632 0.85096035]\n", 551 | " [0.84889632 1. 0.88747852]\n", 552 | " [0.85096035 0.88747852 1. ]]\n", 553 | "DS\n", 554 | "[[1. 0.92145531 0.84394307]\n", 555 | " [0.92145531 1. 0.82648213]\n", 556 | " [0.84394307 0.82648213 1. ]]\n", 557 | "LO\n", 558 | "[[1. 0.82319259 0.72438774]\n", 559 | " [0.82319259 1. 0.80665013]\n", 560 | " [0.72438774 0.80665013 1. ]]\n", 561 | "NI\n", 562 | "[[1. 0.92307865 0.91320051]\n", 563 | " [0.92307865 1. 0.90765773]\n", 564 | " [0.91320051 0.90765773 1. ]]\n", 565 | "SE\n", 566 | "[[1. 0.71567135 0.72420244]\n", 567 | " [0.71567135 1. 0.89973318]\n", 568 | " [0.72420244 0.89973318 1. ]]\n" 569 | ], 570 | "name": "stdout" 571 | } 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "metadata": { 577 | "colab": { 578 | "base_uri": "https://localhost:8080/", 579 | "height": 405 580 | }, 581 | "id": "EMqzUplRHjxa", 582 | "outputId": "590418e4-cadb-4e96-c48d-da936cc05ce7" 583 | }, 584 | "source": [ 585 | "submission" 586 | ], 587 | "execution_count": null, 588 | "outputs": [ 589 | { 590 | "output_type": "execute_result", 591 | "data": { 592 | "text/html": [ 593 | "
\n", 594 | "\n", 607 | "\n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | "
CLCRDCDSLONISEresultid
00.0008740.0015990.9872690.0027560.0014050.0032970.0028012This paper analyses the possibilities of per...
10.0010960.0054020.9073900.0060050.0018910.0750700.0031462A finite element method is presented to comp...
20.9984130.0001750.0001400.0002360.0003920.0001610.0004840This paper includes a reflection on the role...
30.0021690.1000500.1297930.0216490.0053410.7309880.0100115In this document, we describe the fractal st...
40.0003810.0006190.0011720.9961880.0009450.0003850.0003113We show how to test whether a graph with n v...
..............................
69950.0011780.0005960.0014090.0020840.9927880.0005090.0014374It is common practice to compare the computa...
69960.0014710.0004750.0010460.0016940.9933900.0004950.0014304Defeasible reasoning is a simple but efficie...
69970.0014510.0011850.0016050.0049240.9882870.0006490.0018984The almost periodic functions form a natural...
69980.0012220.0004950.0010890.0017300.9934160.0004700.0015794A notion of alternating timed automata is pr...
69990.0014550.0004720.0010290.0016840.9933970.0004900.0014744We present a hierarchical framework for anal...
\n", 757 | "

7000 rows × 9 columns

\n", 758 | "
" 759 | ], 760 | "text/plain": [ 761 | " CL ... id\n", 762 | "0 0.000874 ... This paper analyses the possibilities of per...\n", 763 | "1 0.001096 ... A finite element method is presented to comp...\n", 764 | "2 0.998413 ... This paper includes a reflection on the role...\n", 765 | "3 0.002169 ... In this document, we describe the fractal st...\n", 766 | "4 0.000381 ... We show how to test whether a graph with n v...\n", 767 | "... ... ... ...\n", 768 | "6995 0.001178 ... It is common practice to compare the computa...\n", 769 | "6996 0.001471 ... Defeasible reasoning is a simple but efficie...\n", 770 | "6997 0.001451 ... The almost periodic functions form a natural...\n", 771 | "6998 0.001222 ... A notion of alternating timed automata is pr...\n", 772 | "6999 0.001455 ... We present a hierarchical framework for anal...\n", 773 | "\n", 774 | "[7000 rows x 9 columns]" 775 | ] 776 | }, 777 | "metadata": { 778 | "tags": [] 779 | }, 780 | "execution_count": 48 781 | } 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "metadata": { 787 | "id": "7Nl0V7Bm47yj" 788 | }, 789 | "source": [ 790 | "submission['result'] = submission['result'].apply({0:'CL', 1:'CR', 2:'DC',\r\n", 791 | "3:'DS', 4:'LO', 5:'NI', 6:'SE' }.get)\r\n" 792 | ], 793 | "execution_count": null, 794 | "outputs": [] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "metadata": { 799 | "colab": { 800 | "base_uri": "https://localhost:8080/", 801 | "height": 34 802 | }, 803 | "id": "HlxxQwny5kfl", 804 | "outputId": "61254762-c27b-41e1-adee-584c01dc7269" 805 | }, 806 | "source": [ 807 | "result = submission['result'].to_numpy()\r\n", 808 | "print(len(result))\r\n", 809 | "np.savetxt(\"run2.txt\", result, fmt = \"%s\")\r\n", 810 | "from google.colab import files\r\n", 811 | "files.download('run2.txt')" 812 | ], 813 | "execution_count": null, 814 | "outputs": [ 815 | { 816 | "output_type": "stream", 817 | "text": [ 818 | "7000\n" 819 | ], 820 | "name": "stdout" 821 | }, 822 | { 823 | "output_type": "display_data", 824 | "data": { 825 | "application/javascript": [ 826 | "\n", 827 | " async function download(id, filename, size) {\n", 828 | " if (!google.colab.kernel.accessAllowed) {\n", 829 | " return;\n", 830 | " }\n", 831 | " const div = document.createElement('div');\n", 832 | " const label = document.createElement('label');\n", 833 | " label.textContent = `Downloading \"${filename}\": `;\n", 834 | " div.appendChild(label);\n", 835 | " const progress = document.createElement('progress');\n", 836 | " progress.max = size;\n", 837 | " div.appendChild(progress);\n", 838 | " document.body.appendChild(div);\n", 839 | "\n", 840 | " const buffers = [];\n", 841 | " let downloaded = 0;\n", 842 | "\n", 843 | " const channel = await google.colab.kernel.comms.open(id);\n", 844 | " // Send a message to notify the kernel that we're ready.\n", 845 | " channel.send({})\n", 846 | "\n", 847 | " for await (const message of channel.messages) {\n", 848 | " // Send a message to notify the kernel that we're ready.\n", 849 | " channel.send({})\n", 850 | " if (message.buffers) {\n", 851 | " for (const buffer of message.buffers) {\n", 852 | " buffers.push(buffer);\n", 853 | " downloaded += buffer.byteLength;\n", 854 | " progress.value = downloaded;\n", 855 | " }\n", 856 | " }\n", 857 | " }\n", 858 | " const blob = new Blob(buffers, {type: 'application/binary'});\n", 859 | " const a = document.createElement('a');\n", 860 | " a.href = window.URL.createObjectURL(blob);\n", 861 | " a.download = filename;\n", 862 | " div.appendChild(a);\n", 863 | " a.click();\n", 864 | " div.remove();\n", 865 | " }\n", 866 | " " 867 | ], 868 | "text/plain": [ 869 | "" 870 | ] 871 | }, 872 | "metadata": { 873 | "tags": [] 874 | } 875 | }, 876 | { 877 | "output_type": "display_data", 878 | "data": { 879 | "application/javascript": [ 880 | "download(\"download_a34366a4-3e84-43f2-bd10-e8e55c7350e6\", \"ensemble1.txt\", 21000)" 881 | ], 882 | "text/plain": [ 883 | "" 884 | ] 885 | }, 886 | "metadata": { 887 | "tags": [] 888 | } 889 | } 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "metadata": { 895 | "id": "cBhMAqPaINUo", 896 | "colab": { 897 | "base_uri": "https://localhost:8080/" 898 | }, 899 | "outputId": "6380bb97-4f42-414b-84d2-a030502a7ec5" 900 | }, 901 | "source": [ 902 | "from sklearn.metrics import classification_report\r\n", 903 | "print(classification_report(scibert['result'],submission['result']))" 904 | ], 905 | "execution_count": null, 906 | "outputs": [ 907 | { 908 | "output_type": "stream", 909 | "text": [ 910 | " precision recall f1-score support\n", 911 | "\n", 912 | " 0 0.99 0.99 0.99 1184\n", 913 | " 1 0.96 0.96 0.96 1122\n", 914 | " 2 0.94 0.92 0.93 812\n", 915 | " 3 0.97 0.97 0.97 1074\n", 916 | " 4 0.97 0.97 0.97 759\n", 917 | " 5 0.97 0.97 0.97 1199\n", 918 | " 6 0.96 0.96 0.96 850\n", 919 | "\n", 920 | " accuracy 0.97 7000\n", 921 | " macro avg 0.96 0.96 0.96 7000\n", 922 | "weighted avg 0.97 0.97 0.97 7000\n", 923 | "\n" 924 | ], 925 | "name": "stdout" 926 | } 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "metadata": { 932 | "colab": { 933 | "base_uri": "https://localhost:8080/" 934 | }, 935 | "id": "QHC1Pec_keR-", 936 | "outputId": "98347f5d-082f-4435-85cc-c026acb78375" 937 | }, 938 | "source": [ 939 | "print(classification_report(val['label'],scibert['result']))" 940 | ], 941 | "execution_count": null, 942 | "outputs": [ 943 | { 944 | "output_type": "stream", 945 | "text": [ 946 | " precision recall f1-score support\n", 947 | "\n", 948 | " 0 0.98 0.98 0.98 1866\n", 949 | " 1 0.92 0.91 0.91 1835\n", 950 | " 2 0.82 0.83 0.83 1355\n", 951 | " 3 0.93 0.93 0.93 1774\n", 952 | " 4 0.93 0.93 0.93 1217\n", 953 | " 5 0.91 0.91 0.91 1826\n", 954 | " 6 0.89 0.91 0.90 1327\n", 955 | "\n", 956 | " accuracy 0.92 11200\n", 957 | " macro avg 0.91 0.91 0.91 11200\n", 958 | "weighted avg 0.92 0.92 0.92 11200\n", 959 | "\n" 960 | ], 961 | "name": "stdout" 962 | } 963 | ] 964 | }, 965 | { 966 | "cell_type": "markdown", 967 | "metadata": { 968 | "id": "0pBDXhATIqGJ" 969 | }, 970 | "source": [ 971 | "#Predictions " 972 | ] 973 | }, 974 | { 975 | "cell_type": "code", 976 | "metadata": { 977 | "id": "0T6MeF7NkxY4" 978 | }, 979 | "source": [ 980 | "\r\n", 981 | "\r\n", 982 | "\"\"\"\r\n", 983 | "The submission file IIITT.zip has the systems as follows:\r\n", 984 | "\r\n", 985 | "run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased)\r\n", 986 | "run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT)\r\n", 987 | "run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame\r\n", 988 | "\"\"\"" 989 | ], 990 | "execution_count": null, 991 | "outputs": [] 992 | } 993 | ] 994 | } --------------------------------------------------------------------------------