├── utmn
├── readme.txt
├── 3_feedforward_+_ensembling.ipynb
├── 2_LDA.ipynb
└── 1_fine_tuning_and_getting_bert_embs.ipynb
├── IIITT
├── DESCRIPTION.md
├── run3.ipynb
└── run2.ipynb
├── README.md
├── FideLIPI
├── description.md
├── topic_model_based_feature_creation.py
├── 5_ensemble.py
├── 3_sentence_level_model.py
├── 4_tf_idf_logistic_model.py
├── 1_roberta_on_whole_abstract.py
└── 2_roberta_on_abstract_text_combined_with_lda.py
└── parklize
└── DESCRIPTION.md
/utmn/readme.txt:
--------------------------------------------------------------------------------
1 | This folder contains the code by UTMN team. Our final solution achieved 93.82% of the weighted F1-score.
2 |
3 | More details: Glazkova A. Identifying Topics of Scientific Articles with BERT-based Approaches and Topic Modeling.
--------------------------------------------------------------------------------
/IIITT/DESCRIPTION.md:
--------------------------------------------------------------------------------
1 | The submission file IIITT.zip has the systems as follows:
2 |
3 | - run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased)
4 | - run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT)
5 | - run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame
6 | - The saved Models and predicted probabilities are available in this link https://drive.google.com/drive/folders/1zY0dEyf49s0H00T7f_H575IKaJWHfWgO?usp=sharing
7 | - The other performed models are available on https://github.com/adeepH/SDPRA-2021-SharedTask
8 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SDPRA 2021 Shared Task
2 | Submitted systems of "The First Workshop & Shared Task on Scope Detection of the Peer Review Articles" 2021 shared task
3 |
4 | ## Submitted Teams
5 | * **UTMN**
6 | * **IIITT**
7 | * **FideLIPI**
8 | * **Parklize**
9 |
10 |
11 | ## Dataset Citation
12 | ```
13 | Reddy, Saichethan; Saini, Naveen (2021),
14 | “SDPRA 2021 Shared Task Data”, Mendeley Data, V1,
15 | doi: 10.17632/njb74czv49.1
16 | ```
17 |
18 | ## Overview Paper
19 | ```
20 | Reddy, S., Saini., N. Overview and Insights
21 | from Scope Detection of the Peer Review Articles
22 | Shared Tasks 2021. In Proceedings of the The
23 | First Workshop & Shared Task on Scope Detection of
24 | the Peer Review Articles (SDPRA 2021)
25 | ```
26 |
27 |
--------------------------------------------------------------------------------
/FideLIPI/description.md:
--------------------------------------------------------------------------------
1 | # SDPRA-2021 Shared Task
2 |
3 | ### Submission by : FideLIPI
4 | ### Team Members : Ankush Chopra, Sohom Ghosh
5 |
6 | We've built an ensemble of 4 models. These models are:
7 |
8 | 1. A classification model using Roberta pretrained model, where we finetune the model while training for the task.
9 | 2. A classification model using Roberta pretrained model combined with features created using LDA. We also try to finetune the Roberta weights along with classification layer while training for the task.
10 | 3. A classification model using where we first break the abstract into sentences, and build the model using all the sentences of length more than 10 words. We perform sentence tokenization using Spacy. Every sentence is given the same label as it's abstract. While prediction, we take the label with highest combined output probability as the prediction. We've used simple transformer library to build this model.
11 | 4. A classification model built on tf-idf features. These features consist of uni, bi, tri and four grams. We built a logistic regression model using these features.
12 |
13 | Above 4 scores are combined to give the final prediction. Final prediction is made by popular vote, and ties are broken arbitrarily.
14 |
--------------------------------------------------------------------------------
/parklize/DESCRIPTION.md:
--------------------------------------------------------------------------------
1 | This is the repository of the [shared task](https://sdpra-2021.github.io/website/) at PAKDD2021 on scholar text (abstract) classification for the solution from team **parklize**.
2 |
3 |
4 |
5 | There are two main ```.ipynb``` notebooks for the solution including:
6 |
7 | - ```pakdd2021_fasttext_entityembeddings.ipynb``` and
8 | - [Google Colab notebook](https://colab.research.google.com/drive/1x9MUQxXa2BnSVYjUMrgfy3oZa_p0YFXu?usp=sharing)
9 |
10 |
11 |
12 | # Details
13 |
14 | ```pakdd2021_fasttext_entityembeddings.ipynb``` does two things:
15 |
16 | - training a [fasttext](https://fasttext.cc/) classifier
17 | - get sentence embeddings with extracted entities using [TagMe](https://tagme.d4science.org/tagme/) and [wikipedia2vec](https://wikipedia2vec.github.io/wikipedia2vec/)
18 |
19 |
20 |
21 | Regarding training a fasttext classifier, there are several steps (cells):
22 |
23 | - read challenge data
24 | - split validation set further into *internal* validation & test sets
25 | - change data to fasttext format
26 | - train a fasttext classifier using [fasttext](https://fasttext.cc/)
27 | - predict on test set(s)
28 |
29 |
30 |
31 | Regarding getting sentence embeddings with extracted entities
32 |
33 | - extract Wikipedia entities/articles using TagMe
34 | - get abstract embeddings by aggregating entity embeddings for those entities mentioned in each abstract
35 | - the entities further filtered by applying k-means clustering (with two clusters) by choosing the large cluster with the assumption that the smaller one consists of noisy entities
36 |
37 |
38 |
39 | The Google Colab notebook does several things such as:
40 |
41 | - training Sentence-BERT classifiers (7) with [sentence transformers](https://www.sbert.net/), and testing with those classifiers
42 | - training a classifier with ```universal-sentence-encoder``` from [Tensorflow Hub](https://www.tensorflow.org/hub) for encoding abstract texts, and testing with this classifier
43 | - loading the fasttext classifier's prediction result
--------------------------------------------------------------------------------
/FideLIPI/topic_model_based_feature_creation.py:
--------------------------------------------------------------------------------
1 | """
2 | This script performs features from abstract by running LDA on these. LDA gives the vectors that represent the abstracts. We use these features as input to one of the roberta model that we have built. Details of this model can be found in script 2.
3 | Author: Sohom Ghosh
4 | """
5 | import re
6 | import os
7 | import pandas as pd
8 | import numpy as np
9 | import gensim
10 | from gensim import corpora
11 | from nltk.corpus import stopwords
12 | from nltk.stem.wordnet import WordNetLemmatizer
13 | import string
14 |
15 |
16 | PATH = "/data/disk3/pakdd/"
17 |
18 | # Reading data from train, test and validation files
19 | train = pd.read_excel(PATH + "train.xlsx", sheet_name="train", header=None)
20 | train.columns = ["text", "label"]
21 | validation = pd.read_excel(
22 | PATH + "validation.xlsx", sheet_name="validation", header=None
23 | )
24 | validation.columns = ["text", "label"]
25 | test = pd.read_excel(PATH + "test.xlsx", sheet_name="test", header=None)
26 | test.columns = ["text"]
27 |
28 |
29 | # Topic Modeling / LDA feature extraction
30 | stop = set(stopwords.words("english"))
31 | exclude = set(string.punctuation)
32 | lemma = WordNetLemmatizer()
33 |
34 |
35 | def clean(doc):
36 | stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
37 | punc_free = "".join(ch for ch in stop_free if ch not in exclude)
38 | normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
39 | return normalized
40 |
41 |
42 | doc_clean_train = [clean(doc).split() for doc in list(train["text"])]
43 | doc_clean_validation = [clean(doc).split() for doc in list(validation["text"])]
44 | doc_clean_test = [clean(doc).split() for doc in list(test["text"])]
45 |
46 | dictionary = corpora.Dictionary(doc_clean_train)
47 |
48 | Lda = gensim.models.ldamodel.LdaModel
49 |
50 |
51 | def tm_lda_feature_extract(doc_clean, df):
52 | doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
53 | lmodel = Lda(doc_term_matrix, num_topics=50, id2word=dictionary, passes=50)
54 | feature_matrix_lda = np.zeros(shape=(df.shape[0], 50)) # as number of topics is 50
55 | rw = 0
56 | for dd in doc_clean:
57 | bow_vector = dictionary.doc2bow(dd)
58 | lis = lmodel.get_document_topics(
59 | bow_vector,
60 | minimum_probability=None,
61 | minimum_phi_value=None,
62 | per_word_topics=False,
63 | )
64 | for (a, b) in lis:
65 | feature_matrix_lda[rw, a] = b
66 | rw = rw + 1
67 | feature_lda_df = pd.DataFrame(feature_matrix_lda)
68 | return feature_lda_df
69 |
70 |
71 | feature_lda_df_train = tm_lda_feature_extract(doc_clean_train, train)
72 | feature_lda_df_validation = tm_lda_feature_extract(doc_clean_validation, validation)
73 | feature_lda_df_test = tm_lda_feature_extract(doc_clean_test, test)
74 |
--------------------------------------------------------------------------------
/FideLIPI/5_ensemble.py:
--------------------------------------------------------------------------------
1 | """
2 | This script does an ensemble of the 4 child model by taking the popular vote from the child models.
3 | Ties are broken arbitrarily.
4 |
5 | Author : Ankush Chopra (ankush01729@gmail.com)
6 | """
7 | import pandas as pd
8 | import numpy as np
9 | import re, sys, os
10 |
11 | import ast
12 |
13 | from collections import Counter
14 | from sklearn.metrics import f1_score
15 |
16 |
17 | def ensemble_model(lda_path, sentence_model_path, vanila_model_path, tf_idf_model_path):
18 | """
19 | This function takes the 4 child model output file location as input and return the ensemble predictions as output.
20 | """
21 |
22 | # reading the data prediction by 4 child models.
23 | lda = pd.read_csv(lda_path)
24 | sentence = pd.read_csv(sentence_model_path)
25 | vanila = pd.read_csv(vanila_model_path)
26 | with open(tf_idf_model_path) as f:
27 | g = f.readlines()
28 | tf_idf = pd.DataFrame(ast.literal_eval(g[0]), columns=["predicted_labels"])
29 |
30 | # combining all 4 model predictions into one dataframe
31 | lda.reset_index(inplace=True)
32 | lda = pd.merge(lda, vanila, how="left", left_index=True, right_index=True)
33 | lda = lda[["index_x", "abs_text_x", "label_text_x", "pred", "pred_text"]]
34 | lda.columns = [
35 | "ind",
36 | "abs_text",
37 | "label_text",
38 | "model_with_LDA_text",
39 | "whole_abs_model_text",
40 | ]
41 | ddf = pd.merge(lda, sentence, how="left", on="ind")
42 | ddf.columns = [
43 | "ind",
44 | "abs_text",
45 | "label_text",
46 | "model_with_LDA_text",
47 | "whole_abs_model_text",
48 | "true_label",
49 | "sentence_model_text",
50 | "true_label_text",
51 | "pred_label_text",
52 | ]
53 | ddf = pd.concat([ddf, tf_idf], axis=1)
54 |
55 | # getting the final prediction by taking the max vote and breaking the ties arbitrarily
56 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6}
57 | ddf["predicted_labels"] = ddf.predicted_labels.map(lambda x: my_dict[x])
58 | ddf["combined_prediction_4"] = ddf[
59 | [
60 | "whole_abs_model_text",
61 | "model_with_LDA_text",
62 | "sentence_model_text",
63 | "predicted_labels",
64 | ]
65 | ].values.tolist()
66 | ddf["selected_from_combined_prediction_4"] = ddf["combined_prediction_4"].apply(
67 | lambda x: Counter(x).most_common(1)[0][0]
68 | )
69 |
70 | return ddf
71 |
72 |
73 | # f1 score calculation of ensemble model on training data.
74 | train_out = ensemble_model(
75 | "./LDA_and_transformer_on_whole_abstract_train_data.csv",
76 | "./sentence_model_sentence_above_len6_train_prediction_model_above_len_10.csv",
77 | "only_transformer_on_whole_abstract_train_data.csv",
78 | "logistic_regression_tfidf_v2_train_predictions.txt",
79 | )
80 | train_f1 = f1_score(
81 | train_out.label_text,
82 | train_out.selected_from_combined_prediction_4,
83 | average="weighted",
84 | )
85 |
86 | # f1 score calculation of ensemble model on validation data.
87 | val_out = ensemble_model(
88 | "./LDA_and_transformer_on_whole_abstract_val_data.csv",
89 | "./sentence_model_sentence_above_len6_val_prediction_model_above_len_10.csv",
90 | "only_transformer_on_whole_abstract_val_data.csv",
91 | "logistic_regression_tfidf_v2_val_predictions.txt",
92 | )
93 | val_f1 = f1_score(
94 | val_out.label_text, val_out.selected_from_combined_prediction_4, average="weighted"
95 | )
96 |
--------------------------------------------------------------------------------
/FideLIPI/3_sentence_level_model.py:
--------------------------------------------------------------------------------
1 | """
2 | This script performs the model training on the abstract dataset using the pretrained robera model. We use simple transformer library to train the model. We first break the abstract into sentences and assign the same label to sentence as original abstract. We then, train a model on this sentence level data. Since smaller sentences may not have enough predictive power. We train 4 models by selecting sentence above certain word count to test this hypothesis. We find that, models trained on sentence length above 10 perform the best on the validation data. Putting a sentence length filter of 6 on validation data gives us the best validation performance.
3 |
4 | Author: Ankush Chopra (ankush01729@gmail.com)
5 | """
6 |
7 | import os
8 | import re
9 | import torch
10 | import spacy
11 | import pandas as pd
12 | import numpy as np
13 | from operator import itemgetter
14 | from sklearn.metrics import f1_score, confusion_matrix
15 | from simpletransformers.classification import ClassificationModel, ClassificationArgs
16 |
17 | # setting up the right device type
18 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
19 |
20 | nlp = spacy.load("en_core_web_sm")
21 |
22 |
23 | def sentence_level_data_prep(df):
24 | """
25 | This function splits the abstracts into sentences. It uses spacy for sentence tokenization.
26 | """
27 |
28 | inds = []
29 | sentences_extracted = []
30 | for abstract, ind in zip(df["text"].values, df.index):
31 | for i in nlp(str(abstract).replace("\n", "")).sents:
32 | sentences_extracted.append(str(i))
33 | inds.append(ind)
34 | sent_df = pd.DataFrame(
35 | {"ind": inds, "sentences_from_abstract": sentences_extracted}
36 | )
37 | return sent_df
38 |
39 |
40 | df = pd.read_csv(r"./train.csv", header=None, names=["text", "labels"])
41 | sentences_train = sentence_level_data_prep(df)
42 | df.reset_index(inplace=True)
43 | df.columns = ["ind", "text", "labels"]
44 | sentences_train.merge(original_train[["ind", "labels"]], on="ind", how="inner")
45 |
46 | sentences_train["sentence_length"] = sentences_train.sentences_from_abstract.map(
47 | lambda x: len(x.split())
48 | )
49 | sentences_train["label_text"] = pd.Categorical(sentences_train.labels)
50 | sentences_train["labels"] = sentences_train.label_text.cat.codes
51 |
52 |
53 | model_args = ClassificationArgs(
54 | num_train_epochs=10,
55 | sliding_window=True,
56 | fp16=False,
57 | use_early_stopping=True,
58 | reprocess_input_data=True,
59 | overwrite_output_dir=True,
60 | )
61 |
62 | # Create a ClassificationModel
63 | model = ClassificationModel("roberta", "roberta-base", num_labels=7, args=model_args)
64 |
65 | # We train 4 models by selecting sentences above sent_len. We save these model for 10 epochs. At the end, we select best model from these 40 saved epoch models by selecting the one doing the best on the validation set.
66 | #
67 | for sent_len in [0, 6, 10, 15]:
68 | print(sent_len)
69 | sentences_train_filtred = sentences_train[
70 | (sentences_train["sentence_length"] > sent_len)
71 | ]
72 | sentences_train_filtred.reset_index(inplace=True, drop=True)
73 | train = sentences_train_filtred[["sentences_from_abstract", "labels"]]
74 |
75 | # Optional model configuration
76 | output_dir = "./roberta_model_sentence_" + str(sent_len)
77 | best_model_dir = output_dir + "/best_model/"
78 | cache_dir = output_dir + "/cache/"
79 | print(output_dir)
80 | model_args = ClassificationArgs(
81 | cache_dir=cache_dir,
82 | output_dir=output_dir,
83 | best_model_dir=best_model_dir,
84 | num_train_epochs=10,
85 | sliding_window=True,
86 | fp16=False,
87 | use_early_stopping=True,
88 | reprocess_input_data=True,
89 | overwrite_output_dir=True,
90 | )
91 |
92 | # Create a ClassificationModel
93 | model = ClassificationModel(
94 | "roberta", "roberta-base", num_labels=7, args=model_args
95 | )
96 | # You can set class weights by using the optional weight argument
97 | # Train the model
98 | model.train_model(train)
99 |
--------------------------------------------------------------------------------
/FideLIPI/4_tf_idf_logistic_model.py:
--------------------------------------------------------------------------------
1 | """
2 | This script performs the model training on the abstract dataset using the features created using the TF-IDF vectorizer. Model is trained using the logistic regression algorithm which utilizes the 22K features created using 1 to 4-gram token and their tf-idf vectorized values.
3 | Author: Sohom Ghosh
4 | """
5 |
6 | import re
7 | import os
8 | import pandas as pd
9 | import numpy as np
10 | import string
11 | from sklearn.feature_extraction.text import TfidfVectorizer
12 | from sklearn.linear_model import LogisticRegression
13 |
14 |
15 | PATH = "/data/disk3/pakdd/"
16 |
17 | # reading the input data files
18 | train = pd.read_excel(PATH + "train.xlsx", sheet_name="train", header=None)
19 | train.columns = ["text", "label"]
20 | validation = pd.read_excel(
21 | PATH + "validation.xlsx", sheet_name="validation", header=None
22 | )
23 | validation.columns = ["text", "label"]
24 | test = pd.read_excel(PATH + "test.xlsx", sheet_name="test", header=None)
25 | test.columns = ["text"]
26 |
27 | # TF-IDF feature creation
28 | tfidf_model_original_v2 = TfidfVectorizer(
29 | ngram_range=(1, 4), min_df=0.0005, stop_words="english"
30 | )
31 | tfidf_model_original_v2.fit(train["text"])
32 |
33 | # train
34 | tfidf_df_train_original_v2 = pd.DataFrame(
35 | tfidf_model_original_v2.transform(train["text"]).todense()
36 | )
37 | tfidf_df_train_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_)
38 |
39 | # validation
40 | tfidf_df_valid_original_v2 = pd.DataFrame(
41 | tfidf_model_original_v2.transform(validation["text"]).todense()
42 | )
43 | tfidf_df_valid_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_)
44 |
45 | # test
46 | tfidf_df_test_original_v2 = pd.DataFrame(
47 | tfidf_model_original_v2.transform(test["text"]).todense()
48 | )
49 | tfidf_df_test_original_v2.columns = sorted(tfidf_model_original_v2.vocabulary_)
50 |
51 |
52 | # Logistic Regression on tfidf_v2 (22K features)
53 | def model(clf, train_X, train_y, valid_X, valid_y):
54 | clf.fit(train_X, train_y)
55 | pred_tr = clf.predict(train_X)
56 | pred_valid = clf.predict(valid_X)
57 | print("\nTraining F1:{}".format(f1_score(train_y, pred_tr, average="weighted")))
58 | print("Training Confusion Matrix \n{}".format(confusion_matrix(train_y, pred_tr)))
59 | print("Classification Report: \n{}".format(classification_report(train_y, pred_tr)))
60 | print(
61 | "\nValidation F1:{}".format(f1_score(valid_y, pred_valid, average="weighted"))
62 | )
63 | print(
64 | "Validation Confusion Matrix \n{}".format(confusion_matrix(valid_y, pred_valid))
65 | )
66 | print(
67 | "Classification Report: \n{}".format(classification_report(valid_y, pred_valid))
68 | )
69 |
70 |
71 | lr_cnt = 0
72 | train_X = tfidf_df_train_original_v2
73 | valid_X = tfidf_df_valid_original_v2
74 | test_X = tfidf_df_test_original_v2
75 | train_y = train["label"].replace(
76 | {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6}
77 | )
78 | valid_y = validation["label"].replace(
79 | {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6}
80 | )
81 | info = "tfidf_v2_only"
82 |
83 |
84 | print("\n ################# LR VERSION ################# " + str(lr_cnt) + "\n")
85 |
86 | # Initializing logistic regression and training the model
87 | lr_clf = LogisticRegression(solver="lbfgs", n_jobs=-1)
88 | model(lr_clf, train_X, train_y, valid_X, valid_y)
89 | params = lr_clf.get_params()
90 | pred_tr = lr_clf.predict(train_X)
91 | pred_valid = lr_clf.predict(valid_X)
92 | open("lr_report_v" + str(lr_cnt) + info + ".txt", "w").write(
93 | str(info)
94 | + "\n\n"
95 | + str(params)
96 | + "\n\n lr_v"
97 | + str(lr_cnt)
98 | + ".pickle.dat"
99 | + "\n\n Training Confusion Matrix \n{}".format(confusion_matrix(train_y, pred_tr))
100 | + "\n\n Training Classification Report: \n{}".format(
101 | classification_report(train_y, pred_tr)
102 | )
103 | + "\n\n Validation Confusion Matrix \n{}".format(
104 | confusion_matrix(valid_y, pred_valid)
105 | )
106 | + "\n\n Validation Classification Report: \n{}".format(
107 | classification_report(valid_y, pred_valid)
108 | )
109 | )
110 | validation_predicted_lr_best = lr_clf.predict(valid_X)
111 | repl_di = {0: "CL", 1: "CR", 2: "DC", 3: "DS", 4: "LO", 5: "NI", 6: "SE"}
112 | open(PATH + "logistic_regression_tfidf_v2_validation_predictions.txt", "w").write(
113 | str([repl_di[i] for i in validation_predicted_lr_best])
114 | )
115 |
116 | test_predicted_lr_best = lr_clf.predict(test_X)
117 | pd.DataFrame({"predicted_labels": [repl_di[i] for i in test_predicted_lr_best]}).to_csv(
118 | PATH + "logistic_reggression_on_tfidf_v2_22K_features_predicted_on_test.csv",
119 | index=False,
120 | )
121 |
--------------------------------------------------------------------------------
/FideLIPI/1_roberta_on_whole_abstract.py:
--------------------------------------------------------------------------------
1 | """
2 | This script performs the model training on the abstract dataset using the pretrained robera model with a classifier head. It finetunes the roberta while training for the classification task.
3 | We let this run for 20 epochs and saved all the models. We selected the best models epoch when performance on the validation set stopped improving.
4 |
5 | Author: Ankush Chopra (ankush01729@gmail.com)
6 | """
7 | import os
8 | import torch
9 | import pandas as pd
10 | from torch.utils.data import Dataset, DataLoader
11 | from transformers import RobertaModel, RobertaTokenizer
12 |
13 | # setting up the device type
14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
15 |
16 |
17 | # reading dataset
18 | df = pd.read_csv(r"./train.csv", header=None, names=["abs_text", "label_text"])
19 | val_df = pd.read_csv(r"./validation.csv", header=None, names=["abs_text", "label_text"])
20 |
21 |
22 | # # Converting the codes to appropriate categories using a dictionary
23 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6}
24 |
25 |
26 | def update_cat(x):
27 | return my_dict[x]
28 |
29 |
30 | df["label_text"] = df["label_text"].apply(lambda x: update_cat(x))
31 | val_df["label_text"] = val_df["label_text"].apply(lambda x: update_cat(x))
32 |
33 | # Defining some key variables that will be used later on in the training
34 | MAX_LEN = 512
35 | TRAIN_BATCH_SIZE = 32
36 | VALID_BATCH_SIZE = 64
37 | EPOCHS = 20
38 | LEARNING_RATE = 2e-05
39 | tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
40 |
41 |
42 | class Triage(Dataset):
43 | """
44 | This is a subclass of torch packages Dataset class. It processes input to create ids, masks and targets required for model training.
45 | """
46 |
47 | def __init__(self, dataframe, tokenizer, max_len, text_col_name, category_col):
48 | self.len = len(dataframe)
49 | self.data = dataframe
50 | self.tokenizer = tokenizer
51 | self.max_len = max_len
52 | self.text_col_name = text_col_name
53 | self.category_col = category_col
54 |
55 | def __getitem__(self, index):
56 | title = str(self.data[self.text_col_name][index])
57 | title = " ".join(title.split())
58 | inputs = self.tokenizer.encode_plus(
59 | title,
60 | None,
61 | add_special_tokens=True,
62 | max_length=self.max_len,
63 | pad_to_max_length=True,
64 | return_token_type_ids=True,
65 | truncation=True,
66 | )
67 | ids = inputs["input_ids"]
68 | mask = inputs["attention_mask"]
69 |
70 | return {
71 | "ids": torch.tensor(ids, dtype=torch.long),
72 | "mask": torch.tensor(mask, dtype=torch.long),
73 | "targets": torch.tensor(
74 | self.data[self.category_col][index], dtype=torch.long
75 | ),
76 | }
77 |
78 | def __len__(self):
79 | return self.len
80 |
81 |
82 | # dataset specifics
83 | text_col_name = "abs_text"
84 | category_col = "label_text"
85 |
86 | training_set = Triage(df, tokenizer, MAX_LEN, text_col_name, category_col)
87 | validation_set = Triage(val_df, tokenizer, MAX_LEN, text_col_name, category_col)
88 |
89 |
90 | # data loader parameters
91 | train_params = {"batch_size": TRAIN_BATCH_SIZE, "shuffle": True, "num_workers": 0}
92 |
93 | test_params = {"batch_size": VALID_BATCH_SIZE, "shuffle": False, "num_workers": 0}
94 |
95 | # creating dataloader for modelling
96 | training_loader = DataLoader(training_set, **train_params)
97 | val_loader = DataLoader(validation_set, **test_params)
98 |
99 |
100 | class BERTClass(torch.nn.Module):
101 | """
102 | This is the modelling class which adds a classification layer on top of Roberta model. We finetune roberta while training for the label classification.
103 | """
104 |
105 | def __init__(self, num_class):
106 | super(BERTClass, self).__init__()
107 | self.num_class = num_class
108 | self.l1 = RobertaModel.from_pretrained("roberta-base")
109 | self.pre_classifier = torch.nn.Linear(768, 768)
110 | self.dropout = torch.nn.Dropout(0.3)
111 | self.classifier = torch.nn.Linear(768, self.num_class)
112 | self.history = dict()
113 |
114 | def forward(self, input_ids, attention_mask):
115 | output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
116 | hidden_state = output_1[0]
117 | pooler = hidden_state[:, 0]
118 | pooler = self.pre_classifier(pooler)
119 | pooler = torch.nn.ReLU()(pooler)
120 | pooler = self.dropout(pooler)
121 | output = self.classifier(pooler)
122 | return output
123 |
124 |
125 | # initializing and moving the model to the appropriate device
126 | model = BERTClass(7)
127 | model.to(device)
128 |
129 | # Creating the loss function and optimizer
130 | loss_function = torch.nn.CrossEntropyLoss()
131 | optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
132 |
133 |
134 | def calcuate_accu(big_idx, targets):
135 | """
136 | This function compares the predicted output with ground truth to give the count of the correct predictions.
137 | """
138 | n_correct = (big_idx == targets).sum().item()
139 | return n_correct
140 |
141 |
142 | def train(epoch):
143 | """
144 | Function to train the model. This function utilizes the model initialized using BERTClass. It trains the model and provides the accuracy on the training set.
145 | """
146 | tr_loss = 0
147 | n_correct = 0
148 | nb_tr_steps = 0
149 | nb_tr_examples = 0
150 | model.train()
151 | for _, data in enumerate(training_loader, 0):
152 | ids = data["ids"].to(device, dtype=torch.long)
153 | mask = data["mask"].to(device, dtype=torch.long)
154 | targets = data["targets"].to(device, dtype=torch.long)
155 | outputs = model(ids, mask)
156 | loss = loss_function(outputs, targets)
157 | tr_loss += loss.item()
158 | big_val, big_idx = torch.max(outputs.data, dim=1)
159 | n_correct += calcuate_accu(big_idx, targets)
160 |
161 | nb_tr_steps += 1
162 | nb_tr_examples += targets.size(0)
163 |
164 | if _ % 250 == 0:
165 | loss_step = tr_loss / nb_tr_steps
166 | accu_step = (n_correct * 100) / nb_tr_examples
167 | print(f"Training Loss per 250 steps: {loss_step}")
168 | print(f"Training Accuracy per 250 steps: {accu_step}")
169 |
170 | optimizer.zero_grad()
171 | loss.backward()
172 | # # When using GPU
173 | optimizer.step()
174 |
175 | print(f"The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}")
176 | epoch_loss = tr_loss / nb_tr_steps
177 | epoch_accu = (n_correct * 100) / nb_tr_examples
178 | print(f"Training Loss Epoch: {epoch_loss}")
179 | print(f"Training Accuracy Epoch: {epoch_accu}")
180 |
181 | return epoch_loss, epoch_accu
182 |
183 |
184 | def valid(model, testing_loader):
185 | """
186 | This function calculates the performance numbers on the validation set.
187 | """
188 | model.eval()
189 | n_correct = 0
190 | n_wrong = 0
191 | total = 0
192 | tr_loss = 0
193 | nb_tr_steps = 0
194 | nb_tr_examples = 0
195 | with torch.no_grad():
196 | for _, data in enumerate(testing_loader, 0):
197 | ids = data["ids"].to(device, dtype=torch.long)
198 | mask = data["mask"].to(device, dtype=torch.long)
199 | targets = data["targets"].to(device, dtype=torch.long)
200 | outputs = model(ids, mask).squeeze()
201 | loss = loss_function(outputs, targets)
202 | tr_loss += loss.item()
203 | big_val, big_idx = torch.max(outputs.data, dim=1)
204 | n_correct += calcuate_accu(big_idx, targets)
205 |
206 | nb_tr_steps += 1
207 | nb_tr_examples += targets.size(0)
208 |
209 | epoch_loss = tr_loss / nb_tr_steps
210 | epoch_accu = (n_correct * 100) / nb_tr_examples
211 | print(f"Validation Loss Epoch: {epoch_loss}")
212 | print(f"Validation Accuracy Epoch: {epoch_accu}")
213 |
214 | return epoch_loss, epoch_accu
215 |
216 |
217 | # path to save models at the end of the epochs
218 | PATH = "./transformer_model_roberta/"
219 | if not os.path.exists(PATH):
220 | os.makedirs(PATH)
221 |
222 | # variable to store the model performance at the epoch level
223 | model.history["train_acc"] = []
224 | model.history["val_acc"] = []
225 | model.history["train_loss"] = []
226 | model.history["val_loss"] = []
227 |
228 | # model training
229 | for epoch in range(EPOCHS):
230 | print("Epoch number : ", epoch)
231 | train_loss, train_accu = train(epoch)
232 | val_loss, val_accu = valid(model, val_loader)
233 | model.history["train_acc"].append(train_accu)
234 | model.history["train_loss"].append(train_loss)
235 | model.history["val_acc"].append(val_accu)
236 | model.history["val_loss"].append(val_loss)
237 | torch.save(
238 | {
239 | "epoch": epoch,
240 | "model_state_dict": model.state_dict(),
241 | "optimizer_state_dict": optimizer.state_dict(),
242 | },
243 | PATH + "/epoch_" + str(epoch) + ".bin",
244 | )
245 |
246 |
--------------------------------------------------------------------------------
/FideLIPI/2_roberta_on_abstract_text_combined_with_lda.py:
--------------------------------------------------------------------------------
1 | """
2 | This script performs the model training on the abstract dataset using the pretrained robera model with a classifier head. It finetunes the roberta while training for the classification task. Along with the Roberta representation of the abstract, we also use LDA vectors to train the model.
3 | We let this run for 20 epochs and saved all the models. We selected the best models epoch when performance on the validation set stopped improving.
4 |
5 | Author: Ankush Chopra (ankush01729@gmail.com)
6 | """
7 | import os
8 | import torch
9 | import pandas as pd
10 | from torch.utils.data import Dataset, DataLoader
11 | from transformers import RobertaModel, RobertaTokenizer
12 |
13 | # setting up the device type
14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
15 |
16 | # reading dataset
17 | df = pd.read_csv(r"./train.csv", header=None, names=["abs_text", "label_text"])
18 | val_df = pd.read_csv(r"./validation.csv", header=None, names=["abs_text", "label_text"])
19 |
20 | # reading additional features which are derived from topic models using LDA.
21 | lda_train = pd.read_csv("./feature_lda_df_train.csv")
22 | lda_valid = pd.read_csv("./feature_lda_df_validation.csv")
23 |
24 | # concatinating topic vectors from LDA to the abstract dataset
25 | df = pd.concat([df, lda_train], axis=1)
26 | val_df = pd.concat([val_df, lda_valid], axis=1)
27 |
28 | # # Converting the codes to appropriate categories using a dictionary
29 | my_dict = {"CL": 0, "CR": 1, "DC": 2, "DS": 3, "LO": 4, "NI": 5, "SE": 6}
30 |
31 |
32 | def update_cat(x):
33 | """
34 | Function to replace text labels with integer classes
35 | """
36 | return my_dict[x]
37 |
38 |
39 | df["label_text"] = df["label_text"].apply(lambda x: update_cat(x))
40 | val_df["label_text"] = val_df["label_text"].apply(lambda x: update_cat(x))
41 |
42 | # Defining some key variables that will be used later on in the training
43 | MAX_LEN = 512
44 | TRAIN_BATCH_SIZE = 32
45 | VALID_BATCH_SIZE = 8
46 | EPOCHS = 1
47 | LEARNING_RATE = 2e-05
48 | tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
49 |
50 |
51 | class Triage(Dataset):
52 | """
53 | This is a subclass of torch packages Dataset class. It processes input to create ids, masks and targets required for model training.
54 | """
55 |
56 | def __init__(self, dataframe, tokenizer, max_len, text_col_name, categoty_col):
57 | self.len = len(dataframe)
58 | self.data = dataframe
59 | self.tokenizer = tokenizer
60 | self.max_len = max_len
61 | self.text_col_name = text_col_name
62 | self.categoty_col = categoty_col
63 | self.col_names = list(dataframe)
64 |
65 | def __getitem__(self, index):
66 | title = str(self.data[self.text_col_name][index])
67 | title = " ".join(title.split())
68 | inputs = self.tokenizer.encode_plus(
69 | title,
70 | None,
71 | add_special_tokens=True,
72 | max_length=self.max_len,
73 | pad_to_max_length=True,
74 | return_token_type_ids=True,
75 | truncation=True,
76 | )
77 | ids = inputs["input_ids"]
78 | mask = inputs["attention_mask"]
79 |
80 | return {
81 | "ids": torch.tensor(ids, dtype=torch.long),
82 | "mask": torch.tensor(mask, dtype=torch.long),
83 | "targets": torch.tensor(
84 | self.data[self.categoty_col][index], dtype=torch.long
85 | ),
86 | "tf_idf_feature": torch.tensor(
87 | self.data.loc[index, self.col_names[2:]], dtype=torch.float32
88 | ),
89 | }
90 |
91 | def __len__(self):
92 | return self.len
93 |
94 |
95 | # dataset specifics
96 | text_col_name = "abs_text"
97 | category_col = "label_text"
98 |
99 | training_set = Triage(df, tokenizer, MAX_LEN, text_col_name, category_col)
100 | validation_set = Triage(val_df, tokenizer, MAX_LEN, text_col_name, category_col)
101 |
102 |
103 | # data loader parameters
104 | train_params = {"batch_size": TRAIN_BATCH_SIZE, "shuffle": True, "num_workers": 0}
105 |
106 | test_params = {"batch_size": VALID_BATCH_SIZE, "shuffle": False, "num_workers": 0}
107 |
108 | # creating dataloader for modelling
109 | training_loader = DataLoader(training_set, **train_params)
110 | val_loader = DataLoader(validation_set, **test_params)
111 |
112 |
113 | class BERTClass(torch.nn.Module):
114 | """
115 | This is the modelling class which adds a classification layer on top of Roberta model. We finetune roberta while training for the label classification.
116 | """
117 |
118 | def __init__(self, num_class):
119 | super(BERTClass, self).__init__()
120 | self.num_class = num_class
121 | self.l1 = RobertaModel.from_pretrained("roberta-base")
122 | self.hc_features = torch.nn.Linear(50, 128)
123 | self.from_bert = torch.nn.Linear(768, 128)
124 | self.dropout = torch.nn.Dropout(0.3)
125 | self.pre_classifier = torch.nn.Linear(256, 128)
126 | self.classifier = torch.nn.Linear(128, self.num_class)
127 | self.history = dict()
128 |
129 | def forward(self, input_ids, attention_mask, other_features):
130 | output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
131 | hidden_state = output_1[0]
132 | pooler = hidden_state[:, 0]
133 | pooler = self.from_bert(pooler)
134 | other_feature_layer = self.hc_features(other_features)
135 | combined_features = torch.cat((pooler, other_feature_layer), dim=1)
136 | combined_features = torch.nn.ReLU()(combined_features)
137 | combined_features = self.dropout(combined_features)
138 | combined_features = self.pre_classifier(combined_features)
139 | output = self.classifier(combined_features)
140 |
141 | return output
142 |
143 |
144 | # initializing and moving the model to the appropriate device
145 | model = BERTClass(7)
146 | model.to(device)
147 |
148 | # Creating the loss function and optimizer
149 | loss_function = torch.nn.CrossEntropyLoss()
150 | optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
151 |
152 |
153 | def calcuate_accu(big_idx, targets):
154 | """
155 | This function compares the predicted output with ground truth to give the count of the correct predictions.
156 | """
157 | n_correct = (big_idx == targets).sum().item()
158 | return n_correct
159 |
160 |
161 | def train(epoch):
162 | """
163 | Function to train the model. This function utilizes the model initialized using BERTClass. It trains the model and provides the accuracy on the training set.
164 | """
165 | tr_loss = 0
166 | n_correct = 0
167 | nb_tr_steps = 0
168 | nb_tr_examples = 0
169 | model.train()
170 | for _, data in enumerate(training_loader, 0):
171 | ids = data["ids"].to(device, dtype=torch.long)
172 | mask = data["mask"].to(device, dtype=torch.long)
173 | targets = data["targets"].to(device, dtype=torch.long)
174 | tf_idf_feature = data["tf_idf_feature"].to(device, dtype=torch.float32)
175 |
176 | outputs = model(ids, mask, tf_idf_feature)
177 | loss = loss_function(outputs, targets)
178 | tr_loss += loss.item()
179 | big_val, big_idx = torch.max(outputs.data, dim=1)
180 | n_correct += calcuate_accu(big_idx, targets)
181 |
182 | nb_tr_steps += 1
183 | nb_tr_examples += targets.size(0)
184 |
185 | if _ % 250 == 0:
186 | loss_step = tr_loss / nb_tr_steps
187 | accu_step = (n_correct * 100) / nb_tr_examples
188 | print(f"Training Loss per 250 steps: {loss_step}")
189 | print(f"Training Accuracy per 250 steps: {accu_step}")
190 |
191 | optimizer.zero_grad()
192 | loss.backward()
193 | # # When using GPU
194 | optimizer.step()
195 |
196 | print(f"The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}")
197 | epoch_loss = tr_loss / nb_tr_steps
198 | epoch_accu = (n_correct * 100) / nb_tr_examples
199 | print(f"Training Loss Epoch: {epoch_loss}")
200 | print(f"Training Accuracy Epoch: {epoch_accu}")
201 |
202 | return epoch_loss, epoch_accu
203 |
204 |
205 | def valid(model, testing_loader):
206 | """
207 | This function calculates the performance numbers on the validation set.
208 | """
209 | model.eval()
210 | n_correct = 0
211 | n_wrong = 0
212 | total = 0
213 | tr_loss = 0
214 | nb_tr_steps = 0
215 | nb_tr_examples = 0
216 | with torch.no_grad():
217 | for _, data in enumerate(testing_loader, 0):
218 | ids = data["ids"].to(device, dtype=torch.long)
219 | mask = data["mask"].to(device, dtype=torch.long)
220 | targets = data["targets"].to(device, dtype=torch.long)
221 | tf_idf_feature = data["tf_idf_feature"].to(device, dtype=torch.float32)
222 | outputs = model(ids, mask, tf_idf_feature).squeeze()
223 | loss = loss_function(outputs, targets)
224 | tr_loss += loss.item()
225 | big_val, big_idx = torch.max(outputs.data, dim=1)
226 | n_correct += calcuate_accu(big_idx, targets)
227 |
228 | nb_tr_steps += 1
229 | nb_tr_examples += targets.size(0)
230 |
231 | epoch_loss = tr_loss / nb_tr_steps
232 | epoch_accu = (n_correct * 100) / nb_tr_examples
233 | print(f"Validation Loss Epoch: {epoch_loss}")
234 | print(f"Validation Accuracy Epoch: {epoch_accu}")
235 |
236 | return epoch_loss, epoch_accu
237 |
238 |
239 | # path to save models at the end of the epochs
240 | PATH = "./transformer_model_roberta_with_lda/"
241 | if not os.path.exists(PATH):
242 | os.makedirs(PATH)
243 |
244 | # variable to store the model performance at the epoch level
245 | model.history["train_acc"] = []
246 | model.history["val_acc"] = []
247 | model.history["train_loss"] = []
248 | model.history["val_loss"] = []
249 |
250 | # model training
251 | for epoch in range(EPOCHS):
252 | print("Epoch number : ", epoch)
253 | train_loss, train_accu = train(epoch)
254 | val_loss, val_accu = valid(model, val_loader)
255 | model.history["train_acc"].append(train_accu)
256 | model.history["train_loss"].append(train_loss)
257 | model.history["val_acc"].append(val_accu)
258 | model.history["val_loss"].append(val_loss)
259 | torch.save(
260 | {
261 | "epoch": epoch,
262 | "model_state_dict": model.state_dict(),
263 | "optimizer_state_dict": optimizer.state_dict(),
264 | },
265 | PATH + "/epoch_" + str(epoch) + ".bin",
266 | )
267 |
--------------------------------------------------------------------------------
/utmn/3_feedforward_+_ensembling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "3. feedforward + ensembling.ipynb",
7 | "provenance": [],
8 | "toc_visible": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | }
14 | },
15 | "cells": [
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {
19 | "id": "jxiXI1lAheOB"
20 | },
21 | "source": [
22 | "#load data and libraries"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "metadata": {
28 | "id": "TBUF6vAff3TA"
29 | },
30 | "source": [
31 | "!pip install pymorphy2\r\n",
32 | "import numpy as np\r\n",
33 | "import pandas as pd\r\n",
34 | "import math\r\n",
35 | "from sklearn.preprocessing import OneHotEncoder\r\n",
36 | "import re, os, pickle\r\n",
37 | "\r\n",
38 | "import keras\r\n",
39 | "from keras import Sequential\r\n",
40 | "from keras.preprocessing.text import Tokenizer\r\n",
41 | "from keras.preprocessing.sequence import pad_sequences\r\n",
42 | "from keras.utils import to_categorical\r\n",
43 | "\r\n",
44 | "from keras.layers import Input, Embedding, Activation, Flatten, Dense, concatenate\r\n",
45 | "from keras.layers import Conv1D, MaxPooling1D, Dropout, LSTM\r\n",
46 | "from keras.models import Model\r\n",
47 | "\r\n",
48 | "!pip install imblearn\r\n",
49 | "from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE"
50 | ],
51 | "execution_count": null,
52 | "outputs": []
53 | },
54 | {
55 | "cell_type": "code",
56 | "metadata": {
57 | "id": "KZUX1JC_gEZY"
58 | },
59 | "source": [
60 | "from google.colab import drive\r\n",
61 | "drive.mount('/content/drive')"
62 | ],
63 | "execution_count": null,
64 | "outputs": []
65 | },
66 | {
67 | "cell_type": "code",
68 | "metadata": {
69 | "id": "tkagLwcogd1G"
70 | },
71 | "source": [
72 | "with open('/content/drive/bert_embs_val.pickle', 'rb') as f:\r\n",
73 | " val_values = pickle.load(f)\r\n",
74 | "with open('/content/drive/bert_embs_train.pickle', 'rb') as f:\r\n",
75 | " train_values = pickle.load(f)"
76 | ],
77 | "execution_count": null,
78 | "outputs": []
79 | },
80 | {
81 | "cell_type": "code",
82 | "metadata": {
83 | "id": "Ob3UgGdtgoGG"
84 | },
85 | "source": [
86 | "df = pd.read_csv('/content/drive/train.csv', header=None, names = ['text','label'])\r\n",
87 | "df.head()"
88 | ],
89 | "execution_count": null,
90 | "outputs": []
91 | },
92 | {
93 | "cell_type": "code",
94 | "metadata": {
95 | "id": "56EoPNvfhS0w"
96 | },
97 | "source": [
98 | "train_texts = df.text.values\r\n",
99 | "\r\n",
100 | "possible_labels = df.label.unique()\r\n",
101 | "label_dict = {}\r\n",
102 | "for index, possible_label in enumerate(possible_labels):\r\n",
103 | " label_dict[possible_label] = index\r\n",
104 | "\r\n",
105 | "df['label'] = df.label.replace(label_dict)\r\n",
106 | "train_labels = df.label.values"
107 | ],
108 | "execution_count": null,
109 | "outputs": []
110 | },
111 | {
112 | "cell_type": "code",
113 | "metadata": {
114 | "id": "enN2ItahhWsT"
115 | },
116 | "source": [
117 | "df = pd.read_csv('/content/drive/validation.csv', header=None, names = ['text','label'])\r\n",
118 | "df.head()"
119 | ],
120 | "execution_count": null,
121 | "outputs": []
122 | },
123 | {
124 | "cell_type": "code",
125 | "metadata": {
126 | "id": "Sl6CJ348hZ8e"
127 | },
128 | "source": [
129 | "val_texts = df.text.values\r\n",
130 | "df['label'] = df.label.replace(label_dict)\r\n",
131 | "train_labels = list(train_labels) + list(df.label.values)"
132 | ],
133 | "execution_count": null,
134 | "outputs": []
135 | },
136 | {
137 | "cell_type": "code",
138 | "metadata": {
139 | "id": "1F-pQuTIlOhm"
140 | },
141 | "source": [
142 | "with open('/content/drive/td_100_train.pickle', 'rb') as f:\r\n",
143 | " distributions_train = pickle.load(f)\r\n",
144 | "with open('/content/drive/td_100_val.pickle', 'rb') as f:\r\n",
145 | " distributions_val = pickle.load(f)"
146 | ],
147 | "execution_count": null,
148 | "outputs": []
149 | },
150 | {
151 | "cell_type": "code",
152 | "metadata": {
153 | "id": "0C-k_44_lecY"
154 | },
155 | "source": [
156 | "train_data = np.hstack((np.array(train_values),np.array(distributions_train)))\r\n",
157 | "test_data = np.hstack((np.array(val_values),np.array(distributions_val)))\r\n",
158 | "\r\n",
159 | "train_data.shape"
160 | ],
161 | "execution_count": null,
162 | "outputs": []
163 | },
164 | {
165 | "cell_type": "code",
166 | "metadata": {
167 | "id": "EDoqQ4FOZ5LN"
168 | },
169 | "source": [
170 | "len(train_labels)"
171 | ],
172 | "execution_count": null,
173 | "outputs": []
174 | },
175 | {
176 | "cell_type": "code",
177 | "metadata": {
178 | "id": "U2VNSGOoIiyw"
179 | },
180 | "source": [
181 | "ros = RandomOverSampler(random_state=1)\r\n",
182 | "train_data_resampled, trai_labels_resampled = ros.fit_resample(train_data, train_labels)"
183 | ],
184 | "execution_count": null,
185 | "outputs": []
186 | },
187 | {
188 | "cell_type": "code",
189 | "metadata": {
190 | "id": "Wx-BOylEIyGs"
191 | },
192 | "source": [
193 | "train_data = train_data_resampled\r\n",
194 | "train_labels = trai_labels_resampled\r\n",
195 | "\r\n",
196 | "train_data.shape"
197 | ],
198 | "execution_count": null,
199 | "outputs": []
200 | },
201 | {
202 | "cell_type": "code",
203 | "metadata": {
204 | "id": "woNJvWeqcrei"
205 | },
206 | "source": [
207 | "df = pd.DataFrame(train_data)\r\n",
208 | "df['label'] = pd.Series(train_labels)\r\n",
209 | "\r\n",
210 | "df = df.sample(frac=1)\r\n",
211 | "\r\n",
212 | "train_labels = df.label.values\r\n",
213 | "df = df.drop(columns = 'label')\r\n",
214 | "train_data = df.values\r\n",
215 | "\r\n",
216 | "train_data.shape"
217 | ],
218 | "execution_count": null,
219 | "outputs": []
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {
224 | "id": "iJ3Tn7DThkw2"
225 | },
226 | "source": [
227 | "#ffn"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "metadata": {
233 | "id": "wmNHLYY2aVGE"
234 | },
235 | "source": [
236 | "import math\r\n",
237 | "border = math.ceil(len(train_data) * 0.1)\r\n",
238 | "\r\n",
239 | "val_data, train_data = train_data[:border], train_data[border:]\r\n",
240 | "val_labels, train_labels = train_labels[:border], train_labels[border:]"
241 | ],
242 | "execution_count": null,
243 | "outputs": []
244 | },
245 | {
246 | "cell_type": "code",
247 | "metadata": {
248 | "id": "nUjbm_mMhj-J"
249 | },
250 | "source": [
251 | "train_labels = keras.utils.to_categorical(np.array(train_labels),len(label_dict))\r\n",
252 | "val_labels = keras.utils.to_categorical(np.array(val_labels),len(label_dict))"
253 | ],
254 | "execution_count": null,
255 | "outputs": []
256 | },
257 | {
258 | "cell_type": "code",
259 | "metadata": {
260 | "id": "2mnE3FuwiDv4"
261 | },
262 | "source": [
263 | "inputs=Input(shape=(868,), name='input')\r\n",
264 | "x=Dense(2024, activation='tanh', name='fully_connected_2048_tanh')(inputs)\r\n",
265 | "x=Dense(1024, activation='tanh', name='fully_connected_1024_tanh')(x)\r\n",
266 | "predictions=Dense(len(label_dict), activation='softmax', name='output_softmax')(x)\r\n",
267 | "model=Model(inputs=inputs, outputs=predictions)\r\n",
268 | "model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\r\n",
269 | "model.summary()\r\n",
270 | "\r\n",
271 | "from keras.utils import plot_model\r\n",
272 | "plot_model(model, to_file='fnn.png')"
273 | ],
274 | "execution_count": null,
275 | "outputs": []
276 | },
277 | {
278 | "cell_type": "code",
279 | "metadata": {
280 | "id": "exKm5feebKDj"
281 | },
282 | "source": [
283 | "from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score\r\n",
284 | "import pickle\r\n",
285 | "\r\n",
286 | "history = model.fit(train_data, train_labels, epochs=5, verbose=2, validation_data=(val_data, val_labels))\r\n",
287 | "\r\n",
288 | "predict = np.argmax(model.predict(val_data), axis=1)\r\n",
289 | "answer = np.argmax(val_labels, axis=1)\r\n",
290 | "\r\n",
291 | "f1=f1_score(predict, answer, average='macro')*100\r\n",
292 | "prec=precision_score(predict, answer, average='macro')*100\r\n",
293 | "recall=recall_score(predict, answer, average='macro')*100\r\n",
294 | "accuracy=accuracy_score(predict, answer)*100\r\n",
295 | "\r\n",
296 | "print(f1)"
297 | ],
298 | "execution_count": null,
299 | "outputs": []
300 | },
301 | {
302 | "cell_type": "code",
303 | "metadata": {
304 | "id": "hlk1zYMLbqJi"
305 | },
306 | "source": [
307 | "prediction = model.predict(test_data)\r\n",
308 | "\r\n",
309 | "with open('/content/drive/pred_tm.pickle', 'wb') as f:\r\n",
310 | " pickle.dump(prediction, f)"
311 | ],
312 | "execution_count": null,
313 | "outputs": []
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {
318 | "id": "QhfgnPxi2-xa"
319 | },
320 | "source": [
321 | "#Ensembling"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "metadata": {
327 | "id": "VonM6yFLZ4dM"
328 | },
329 | "source": [
330 | "labels = {'LO': 0, 'NI': 1, 'DS': 2, 'CL': 3, 'DC': 4, 'SE': 5, 'CR': 6}\r\n",
331 | "inv_labels = {v: k for k, v in labels.items()}\r\n",
332 | "inv_labels"
333 | ],
334 | "execution_count": null,
335 | "outputs": []
336 | },
337 | {
338 | "cell_type": "code",
339 | "metadata": {
340 | "id": "a_sIwr7DZ6ie"
341 | },
342 | "source": [
343 | "flat_predictions = [inv_labels[f] for f in flat_predictions]\r\n",
344 | "flat_predictions[:10]"
345 | ],
346 | "execution_count": null,
347 | "outputs": []
348 | },
349 | {
350 | "cell_type": "code",
351 | "metadata": {
352 | "id": "2s3veVMZdybQ"
353 | },
354 | "source": [
355 | "with open('/content/drive/predictions1.pickle', 'rb') as f:\r\n",
356 | " pred1 = pickle.load(f)\r\n",
357 | "\r\n",
358 | "with open('/content/drive/predictions2.pickle', 'rb') as f:\r\n",
359 | " pred2 = pickle.load(f)\r\n",
360 | "\r\n",
361 | "with open('/content/drive/predictions3.pickle', 'rb') as f:\r\n",
362 | " pred3 = pickle.load(f)"
363 | ],
364 | "execution_count": null,
365 | "outputs": []
366 | },
367 | {
368 | "cell_type": "code",
369 | "metadata": {
370 | "id": "cB2FuAo3eBOU"
371 | },
372 | "source": [
373 | "final = []\r\n",
374 | "for i in range(len(pred1)):\r\n",
375 | " final.append(pred1[i]+pred2[1]+pred3[i])\r\n",
376 | "print(final[0].shape)\r\n",
377 | "final[0]"
378 | ],
379 | "execution_count": null,
380 | "outputs": []
381 | },
382 | {
383 | "cell_type": "code",
384 | "metadata": {
385 | "id": "h7-IrL89eLpZ"
386 | },
387 | "source": [
388 | "flat_predictions = [item for sublist in final for item in sublist]\r\n",
389 | "flat_predictions[0]"
390 | ],
391 | "execution_count": null,
392 | "outputs": []
393 | }
394 | ]
395 | }
--------------------------------------------------------------------------------
/utmn/2_LDA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "2. LDA.ipynb",
7 | "provenance": [],
8 | "toc_visible": true
9 | },
10 | "kernelspec": {
11 | "display_name": "Python 3",
12 | "name": "python3"
13 | }
14 | },
15 | "cells": [
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {
19 | "id": "8JNQ1Bm6OycD"
20 | },
21 | "source": [
22 | "#LDA"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "metadata": {
28 | "id": "a1vTC32gOxYm"
29 | },
30 | "source": [
31 | "import nltk\n",
32 | "nltk.download('stopwords')"
33 | ],
34 | "execution_count": null,
35 | "outputs": []
36 | },
37 | {
38 | "cell_type": "code",
39 | "metadata": {
40 | "id": "Hs5EQgjhPACg"
41 | },
42 | "source": [
43 | "import re\n",
44 | "import numpy as np\n",
45 | "import pandas as pd\n",
46 | "from pprint import pprint\n",
47 | "\n",
48 | "# Gensim\n",
49 | "import gensim\n",
50 | "import gensim.corpora as corpora\n",
51 | "from gensim.utils import simple_preprocess\n",
52 | "from gensim.models import CoherenceModel\n",
53 | "\n",
54 | "# spacy for lemmatization\n",
55 | "import spacy\n",
56 | "\n",
57 | "# Plotting tools\n",
58 | "!pip install pyLDAvis\n",
59 | "import pyLDAvis\n",
60 | "import pyLDAvis.gensim # don't skip this\n",
61 | "import matplotlib.pyplot as plt\n",
62 | "%matplotlib inline\n",
63 | "\n",
64 | "# Enable logging for gensim - optional\n",
65 | "import logging\n",
66 | "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)\n",
67 | "\n",
68 | "import warnings\n",
69 | "warnings.filterwarnings(\"ignore\",category=DeprecationWarning)"
70 | ],
71 | "execution_count": null,
72 | "outputs": []
73 | },
74 | {
75 | "cell_type": "code",
76 | "metadata": {
77 | "id": "n7cJ1_vXPNLW"
78 | },
79 | "source": [
80 | "# NLTK Stop words\n",
81 | "from nltk.corpus import stopwords\n",
82 | "stop_words = stopwords.words('english')\n",
83 | "stop_words.extend(['from', 'subject', 're', 'edu', 'use'])"
84 | ],
85 | "execution_count": null,
86 | "outputs": []
87 | },
88 | {
89 | "cell_type": "code",
90 | "metadata": {
91 | "id": "daDVo9MZPUG7"
92 | },
93 | "source": [
94 | "df = pd.read_csv('/content/drive/train.csv', header=None, names = ['text','label'])\n",
95 | "train_texts = df.text.values.tolist()\n",
96 | "df = pd.read_csv('/content/drive/validation.csv', header=None, names = ['text','label'])\n",
97 | "val_texts = df.text.values.tolist()\n",
98 | "df = pd.read_csv('/content/drive/test.csv', header=None, names = ['text'])\n",
99 | "test_texts = df.text.values.tolist()\n",
100 | "data = train_texts + val_texts + test_texts\n",
101 | "\n",
102 | "data = [re.sub('\\S*@\\S*\\s?', '', sent) for sent in data]\n",
103 | "data = [re.sub('\\s+', ' ', sent) for sent in data]\n",
104 | "data = [re.sub(\"\\'\", \"\", sent) for sent in data]\n",
105 | "\n",
106 | "print(data[:1])"
107 | ],
108 | "execution_count": null,
109 | "outputs": []
110 | },
111 | {
112 | "cell_type": "code",
113 | "metadata": {
114 | "id": "WeFkk5J7P4xy"
115 | },
116 | "source": [
117 | "def sent_to_words(sentences):\n",
118 | " for sentence in sentences:\n",
119 | " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations\n",
120 | "\n",
121 | "data_words = list(sent_to_words(data))\n",
122 | "\n",
123 | "print(data_words[:1])"
124 | ],
125 | "execution_count": null,
126 | "outputs": []
127 | },
128 | {
129 | "cell_type": "code",
130 | "metadata": {
131 | "id": "iJnnzmeyQDco"
132 | },
133 | "source": [
134 | "# Build the bigram and trigram models\n",
135 | "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.\n",
136 | "trigram = gensim.models.Phrases(bigram[data_words], threshold=100) \n",
137 | "\n",
138 | "# Faster way to get a sentence clubbed as a trigram/bigram\n",
139 | "bigram_mod = gensim.models.phrases.Phraser(bigram)\n",
140 | "trigram_mod = gensim.models.phrases.Phraser(trigram)\n",
141 | "\n",
142 | "# See trigram example\n",
143 | "print(trigram_mod[bigram_mod[data_words[0]]])"
144 | ],
145 | "execution_count": null,
146 | "outputs": []
147 | },
148 | {
149 | "cell_type": "code",
150 | "metadata": {
151 | "id": "Q_9W3b9cQLPP"
152 | },
153 | "source": [
154 | "#import spacy\n",
155 | "# Define functions for stopwords, bigrams, trigrams and lemmatization\n",
156 | "def remove_stopwords(texts):\n",
157 | " return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]\n",
158 | "\n",
159 | "def make_bigrams(texts):\n",
160 | " return [bigram_mod[doc] for doc in texts]\n",
161 | "\n",
162 | "def make_trigrams(texts):\n",
163 | " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n",
164 | "\n",
165 | "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):\n",
166 | " \"\"\"https://spacy.io/api/annotation\"\"\"\n",
167 | " texts_out = []\n",
168 | " for sent in texts:\n",
169 | " doc = nlp(\" \".join(sent)) \n",
170 | " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])\n",
171 | " return texts_out"
172 | ],
173 | "execution_count": null,
174 | "outputs": []
175 | },
176 | {
177 | "cell_type": "code",
178 | "metadata": {
179 | "id": "rmAnsscWQYsN"
180 | },
181 | "source": [
182 | "import spacy\n",
183 | "# Remove Stop Words\n",
184 | "data_words_nostops = remove_stopwords(data_words)\n",
185 | "\n",
186 | "# Form Bigrams\n",
187 | "data_words_bigrams = make_bigrams(data_words_nostops)\n",
188 | "\n",
189 | "# Initialize spacy 'en' model, keeping only tagger component (for efficiency)\n",
190 | "# python3 -m spacy download en\n",
191 | "nlp = spacy.load('en', disable=['parser', 'ner'])\n",
192 | "\n",
193 | "# Do lemmatization keeping only noun, adj, vb, adv\n",
194 | "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])\n",
195 | "\n",
196 | "print(data_lemmatized[:1])"
197 | ],
198 | "execution_count": null,
199 | "outputs": []
200 | },
201 | {
202 | "cell_type": "code",
203 | "metadata": {
204 | "id": "GmHv4x0CQrtg"
205 | },
206 | "source": [
207 | "# Create Dictionary\n",
208 | "id2word = corpora.Dictionary(data_lemmatized)\n",
209 | "\n",
210 | "# Create Corpus\n",
211 | "texts = data_lemmatized\n",
212 | "\n",
213 | "# Term Document Frequency\n",
214 | "corpus = [id2word.doc2bow(text) for text in texts]\n",
215 | "\n",
216 | "# View\n",
217 | "print(corpus[:1])"
218 | ],
219 | "execution_count": null,
220 | "outputs": []
221 | },
222 | {
223 | "cell_type": "code",
224 | "metadata": {
225 | "id": "UDi3SEr9Qw_C"
226 | },
227 | "source": [
228 | "id2word[0]"
229 | ],
230 | "execution_count": null,
231 | "outputs": []
232 | },
233 | {
234 | "cell_type": "code",
235 | "metadata": {
236 | "id": "16s6HB7VQ0Lh"
237 | },
238 | "source": [
239 | "# Human readable format of corpus (term-frequency)\n",
240 | "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]"
241 | ],
242 | "execution_count": null,
243 | "outputs": []
244 | },
245 | {
246 | "cell_type": "code",
247 | "metadata": {
248 | "id": "lVZAwsQhfNiz"
249 | },
250 | "source": [
251 | "import pickle\r\n",
252 | "\r\n",
253 | "with open('/content/drive/corpus.pickle', 'wb') as f:\r\n",
254 | " pickle.dump(corpus, f)\r\n",
255 | "with open('/content/drive/id2word.pickle', 'wb') as f:\r\n",
256 | " pickle.dump(id2word, f)"
257 | ],
258 | "execution_count": null,
259 | "outputs": []
260 | },
261 | {
262 | "cell_type": "code",
263 | "metadata": {
264 | "id": "I2TMNp-9Q8nY"
265 | },
266 | "source": [
267 | "# Build LDA model\n",
268 | "lda_model = gensim.models.LdaMulticore(corpus=corpus,\n",
269 | " id2word=id2word,\n",
270 | " num_topics=100, \n",
271 | " random_state=100,\n",
272 | " chunksize=100,\n",
273 | " passes=10,\n",
274 | " workers=3,\n",
275 | " per_word_topics=True)"
276 | ],
277 | "execution_count": null,
278 | "outputs": []
279 | },
280 | {
281 | "cell_type": "code",
282 | "metadata": {
283 | "id": "U69V41P1R1WQ"
284 | },
285 | "source": [
286 | "# Print the Keyword in the 10 topics\n",
287 | "print(lda_model.print_topics())\n",
288 | "doc_lda = lda_model[corpus]"
289 | ],
290 | "execution_count": null,
291 | "outputs": []
292 | },
293 | {
294 | "cell_type": "code",
295 | "metadata": {
296 | "id": "mTg1rlk9R5aG"
297 | },
298 | "source": [
299 | "# Compute Perplexity\n",
300 | "print('\\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.\n",
301 | "\n",
302 | "# Compute Coherence Score\n",
303 | "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n",
304 | "coherence_lda = coherence_model_lda.get_coherence()\n",
305 | "print('\\nCoherence Score: ', coherence_lda)"
306 | ],
307 | "execution_count": null,
308 | "outputs": []
309 | },
310 | {
311 | "cell_type": "code",
312 | "metadata": {
313 | "id": "aQOX3MZFSBB1"
314 | },
315 | "source": [
316 | "# Visualize the topics\n",
317 | "pyLDAvis.enable_notebook()\n",
318 | "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)\n",
319 | "vis"
320 | ],
321 | "execution_count": null,
322 | "outputs": []
323 | },
324 | {
325 | "cell_type": "code",
326 | "metadata": {
327 | "id": "Jg76SsEshJAB"
328 | },
329 | "source": [
330 | "with open('/content/drive/lda_model.pickle', 'wb') as f:\r\n",
331 | " pickle.dump(lda_model, f)"
332 | ],
333 | "execution_count": null,
334 | "outputs": []
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {
339 | "id": "G25c2pseSMC3"
340 | },
341 | "source": [
342 | "#choose the best model"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "metadata": {
348 | "id": "f7V-IChJSE8z"
349 | },
350 | "source": [
351 | "def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):\n",
352 | " \"\"\"\n",
353 | " Compute c_v coherence for various number of topics\n",
354 | "\n",
355 | " Parameters:\n",
356 | " ----------\n",
357 | " dictionary : Gensim dictionary\n",
358 | " corpus : Gensim corpus\n",
359 | " texts : List of input texts\n",
360 | " limit : Max num of topics\n",
361 | "\n",
362 | " Returns:\n",
363 | " -------\n",
364 | " model_list : List of LDA topic models\n",
365 | " coherence_values : Coherence values corresponding to the LDA model with respective number of topics\n",
366 | " \"\"\"\n",
367 | " coherence_values = []\n",
368 | " model_list = []\n",
369 | " for num_topics in range(start, limit, step):\n",
370 | " print(num_topics)\n",
371 | " model = gensim.models.LdaMulticore(corpus=corpus,\n",
372 | " id2word=id2word,\n",
373 | " num_topics=num_topics, \n",
374 | " random_state=100,\n",
375 | " chunksize=100,\n",
376 | " passes=10,\n",
377 | " workers=None,\n",
378 | " per_word_topics=True)\n",
379 | " model_list.append(model)\n",
380 | " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n",
381 | " coherence_values.append(coherencemodel.get_coherence())\n",
382 | " print(coherence_values[len(coherence_values)-1])\n",
383 | "\n",
384 | " return model_list, coherence_values"
385 | ],
386 | "execution_count": null,
387 | "outputs": []
388 | },
389 | {
390 | "cell_type": "code",
391 | "metadata": {
392 | "id": "v14G8uOVSfIm"
393 | },
394 | "source": [
395 | "# Can take a long time to run.\n",
396 | "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=10, limit=200, step=10)"
397 | ],
398 | "execution_count": null,
399 | "outputs": []
400 | },
401 | {
402 | "cell_type": "code",
403 | "metadata": {
404 | "id": "x41IUc8YSqXm"
405 | },
406 | "source": [
407 | "# Show graph\n",
408 | "limit=205; start=10; step=10;\n",
409 | "x = range(start, limit, step)\n",
410 | "plt.plot(x, coherence_values)\n",
411 | "plt.xlabel(\"Number of topics\")\n",
412 | "plt.ylabel(\"Coherence value\")\n",
413 | "#plt.legend((\"coherence_values\"), loc='best')\n",
414 | "plt.show()"
415 | ],
416 | "execution_count": null,
417 | "outputs": []
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {
422 | "id": "CnNNVgzPUdjC"
423 | },
424 | "source": [
425 | "#get topic distributions"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "metadata": {
431 | "id": "LdZFUJzVUgbx"
432 | },
433 | "source": [
434 | "def get_dist(dist):\n",
435 | " new_dist = []\n",
436 | " for d in dist:\n",
437 | " new_dist.append(d[1])\n",
438 | " return new_dist\n",
439 | "\n",
440 | "corpus_train, corpus_val = corpus[:16800],corpus[16800:]\n",
441 | "distributions_train = []\n",
442 | "for doc in corpus_train:\n",
443 | " distributions_train.append(get_dist(lda_model.get_document_topics(doc, minimum_probability=0.0)))\n",
444 | "\n",
445 | "with open('/content/drive/td_100_train.pickle', 'wb') as f:\n",
446 | " pickle.dump(distributions_train, f)"
447 | ],
448 | "execution_count": null,
449 | "outputs": []
450 | },
451 | {
452 | "cell_type": "code",
453 | "metadata": {
454 | "id": "PwFbHp3Hk_qt"
455 | },
456 | "source": [
457 | "distributions_val = []\r\n",
458 | "for doc in corpus_val:\r\n",
459 | " distributions_val.append(get_dist(lda_model.get_document_topics(doc, minimum_probability=0.0)))\r\n",
460 | "\r\n",
461 | "with open('/content/drive/td_100_val.pickle', 'wb') as f:\r\n",
462 | " pickle.dump(distributions_val, f)"
463 | ],
464 | "execution_count": null,
465 | "outputs": []
466 | }
467 | ]
468 | }
--------------------------------------------------------------------------------
/IIITT/run3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Ensemble_classifier.ipynb",
7 | "provenance": [],
8 | "toc_visible": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | }
14 | },
15 | "cells": [
16 | {
17 | "cell_type": "code",
18 | "metadata": {
19 | "id": "lWckrP6pStpg"
20 | },
21 | "source": [
22 | "import pandas as pd\r\n",
23 | "import numpy as np\r\n",
24 | "import matplotlib.pyplot as plt\r\n"
25 | ],
26 | "execution_count": 1,
27 | "outputs": []
28 | },
29 | {
30 | "cell_type": "code",
31 | "metadata": {
32 | "id": "sLnqm1V6KnJB",
33 | "outputId": "11b139ae-7be7-40df-8bb7-ec947c095a6c",
34 | "colab": {
35 | "base_uri": "https://localhost:8080/"
36 | }
37 | },
38 | "source": [
39 | "from google.colab import drive\n",
40 | "drive.mount('/content/drive')"
41 | ],
42 | "execution_count": 3,
43 | "outputs": [
44 | {
45 | "output_type": "stream",
46 | "text": [
47 | "Mounted at /content/drive\n"
48 | ],
49 | "name": "stdout"
50 | }
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "metadata": {
56 | "colab": {
57 | "base_uri": "https://localhost:8080/"
58 | },
59 | "id": "4FUUUT_FX1ze",
60 | "outputId": "580b07bd-70fc-4602-86e2-1ea22174a0f7"
61 | },
62 | "source": [
63 | "cd /content/drive/MyDrive/sdpra2021/pred_probs/"
64 | ],
65 | "execution_count": 6,
66 | "outputs": [
67 | {
68 | "output_type": "stream",
69 | "text": [
70 | "/content/drive/MyDrive/spdra2021/pred_probs\n"
71 | ],
72 | "name": "stdout"
73 | }
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "metadata": {
79 | "colab": {
80 | "base_uri": "https://localhost:8080/",
81 | "height": 204
82 | },
83 | "id": "ounVF-yhYQsL",
84 | "outputId": "5d3367cd-4f56-4c2a-d3a7-713f5211cb52"
85 | },
86 | "source": [
87 | "bert = pd.read_csv('bert.csv')\r\n",
88 | "bert = bert.drop(columns='Unnamed: 0')\r\n",
89 | "bert.head() "
90 | ],
91 | "execution_count": 7,
92 | "outputs": [
93 | {
94 | "output_type": "execute_result",
95 | "data": {
96 | "text/html": [
97 | "
\n",
98 | "\n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | " | \n",
115 | " CL | \n",
116 | " CR | \n",
117 | " DC | \n",
118 | " DS | \n",
119 | " LO | \n",
120 | " NI | \n",
121 | " SE | \n",
122 | " result | \n",
123 | " abstract | \n",
124 | "
\n",
125 | " \n",
126 | " \n",
127 | " \n",
128 | " | 0 | \n",
129 | " 0.000150 | \n",
130 | " 0.001650 | \n",
131 | " 0.986069 | \n",
132 | " 0.003539 | \n",
133 | " 0.002675 | \n",
134 | " 0.002975 | \n",
135 | " 0.002942 | \n",
136 | " 2 | \n",
137 | " This paper analyses the possibilities of per... | \n",
138 | "
\n",
139 | " \n",
140 | " | 1 | \n",
141 | " 0.000809 | \n",
142 | " 0.011402 | \n",
143 | " 0.895457 | \n",
144 | " 0.010034 | \n",
145 | " 0.004086 | \n",
146 | " 0.073280 | \n",
147 | " 0.004931 | \n",
148 | " 2 | \n",
149 | " A finite element method is presented to comp... | \n",
150 | "
\n",
151 | " \n",
152 | " | 2 | \n",
153 | " 0.998192 | \n",
154 | " 0.000182 | \n",
155 | " 0.000038 | \n",
156 | " 0.000098 | \n",
157 | " 0.000380 | \n",
158 | " 0.000112 | \n",
159 | " 0.000999 | \n",
160 | " 0 | \n",
161 | " This paper includes a reflection on the role... | \n",
162 | "
\n",
163 | " \n",
164 | " | 3 | \n",
165 | " 0.000124 | \n",
166 | " 0.001555 | \n",
167 | " 0.002590 | \n",
168 | " 0.000685 | \n",
169 | " 0.000627 | \n",
170 | " 0.993462 | \n",
171 | " 0.000958 | \n",
172 | " 5 | \n",
173 | " In this document, we describe the fractal st... | \n",
174 | "
\n",
175 | " \n",
176 | " | 4 | \n",
177 | " 0.000166 | \n",
178 | " 0.000656 | \n",
179 | " 0.001765 | \n",
180 | " 0.995629 | \n",
181 | " 0.000873 | \n",
182 | " 0.000429 | \n",
183 | " 0.000482 | \n",
184 | " 3 | \n",
185 | " We show how to test whether a graph with n v... | \n",
186 | "
\n",
187 | " \n",
188 | "
\n",
189 | "
"
190 | ],
191 | "text/plain": [
192 | " CL ... abstract\n",
193 | "0 0.000150 ... This paper analyses the possibilities of per...\n",
194 | "1 0.000809 ... A finite element method is presented to comp...\n",
195 | "2 0.998192 ... This paper includes a reflection on the role...\n",
196 | "3 0.000124 ... In this document, we describe the fractal st...\n",
197 | "4 0.000166 ... We show how to test whether a graph with n v...\n",
198 | "\n",
199 | "[5 rows x 9 columns]"
200 | ]
201 | },
202 | "metadata": {
203 | "tags": []
204 | },
205 | "execution_count": 7
206 | }
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "metadata": {
212 | "colab": {
213 | "base_uri": "https://localhost:8080/",
214 | "height": 204
215 | },
216 | "id": "46d-Z4cYYc5r",
217 | "outputId": "77d2ba85-cba6-4851-c6a1-f841e4287cfb"
218 | },
219 | "source": [
220 | "roberta = pd.read_csv('roberta.csv')\r\n",
221 | "roberta = roberta.drop(columns=['Unnamed: 0'])\r\n",
222 | "roberta.head() "
223 | ],
224 | "execution_count": 8,
225 | "outputs": [
226 | {
227 | "output_type": "execute_result",
228 | "data": {
229 | "text/html": [
230 | "\n",
231 | "\n",
244 | "
\n",
245 | " \n",
246 | " \n",
247 | " | \n",
248 | " CL | \n",
249 | " CR | \n",
250 | " DC | \n",
251 | " DS | \n",
252 | " LO | \n",
253 | " NI | \n",
254 | " SE | \n",
255 | " result | \n",
256 | " abstract | \n",
257 | "
\n",
258 | " \n",
259 | " \n",
260 | " \n",
261 | " | 0 | \n",
262 | " 0.002313 | \n",
263 | " 0.002387 | \n",
264 | " 0.980938 | \n",
265 | " 0.003960 | \n",
266 | " 0.001252 | \n",
267 | " 0.005076 | \n",
268 | " 0.004074 | \n",
269 | " 2 | \n",
270 | " This paper analyses the possibilities of per... | \n",
271 | "
\n",
272 | " \n",
273 | " | 1 | \n",
274 | " 0.002194 | \n",
275 | " 0.003206 | \n",
276 | " 0.978623 | \n",
277 | " 0.005403 | \n",
278 | " 0.000872 | \n",
279 | " 0.007745 | \n",
280 | " 0.001957 | \n",
281 | " 2 | \n",
282 | " A finite element method is presented to comp... | \n",
283 | "
\n",
284 | " \n",
285 | " | 2 | \n",
286 | " 0.997938 | \n",
287 | " 0.000090 | \n",
288 | " 0.000247 | \n",
289 | " 0.000461 | \n",
290 | " 0.000620 | \n",
291 | " 0.000309 | \n",
292 | " 0.000335 | \n",
293 | " 0 | \n",
294 | " This paper includes a reflection on the role... | \n",
295 | "
\n",
296 | " \n",
297 | " | 3 | \n",
298 | " 0.006236 | \n",
299 | " 0.298282 | \n",
300 | " 0.384594 | \n",
301 | " 0.064093 | \n",
302 | " 0.015241 | \n",
303 | " 0.203037 | \n",
304 | " 0.028518 | \n",
305 | " 2 | \n",
306 | " In this document, we describe the fractal st... | \n",
307 | "
\n",
308 | " \n",
309 | " | 4 | \n",
310 | " 0.000752 | \n",
311 | " 0.000967 | \n",
312 | " 0.001257 | \n",
313 | " 0.994632 | \n",
314 | " 0.001538 | \n",
315 | " 0.000534 | \n",
316 | " 0.000321 | \n",
317 | " 3 | \n",
318 | " We show how to test whether a graph with n v... | \n",
319 | "
\n",
320 | " \n",
321 | "
\n",
322 | "
"
323 | ],
324 | "text/plain": [
325 | " CL ... abstract\n",
326 | "0 0.002313 ... This paper analyses the possibilities of per...\n",
327 | "1 0.002194 ... A finite element method is presented to comp...\n",
328 | "2 0.997938 ... This paper includes a reflection on the role...\n",
329 | "3 0.006236 ... In this document, we describe the fractal st...\n",
330 | "4 0.000752 ... We show how to test whether a graph with n v...\n",
331 | "\n",
332 | "[5 rows x 9 columns]"
333 | ]
334 | },
335 | "metadata": {
336 | "tags": []
337 | },
338 | "execution_count": 8
339 | }
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "metadata": {
345 | "colab": {
346 | "base_uri": "https://localhost:8080/",
347 | "height": 204
348 | },
349 | "id": "7N1k70ZBZdKp",
350 | "outputId": "4df5ba44-1dac-4fe1-825f-7f3b518cc2ac"
351 | },
352 | "source": [
353 | "scibert = pd.read_csv('scibert.csv')\r\n",
354 | "scibert = scibert.drop(columns='Unnamed: 0')\r\n",
355 | "scibert.head() "
356 | ],
357 | "execution_count": 9,
358 | "outputs": [
359 | {
360 | "output_type": "execute_result",
361 | "data": {
362 | "text/html": [
363 | "\n",
364 | "\n",
377 | "
\n",
378 | " \n",
379 | " \n",
380 | " | \n",
381 | " CL | \n",
382 | " CR | \n",
383 | " DC | \n",
384 | " DS | \n",
385 | " LO | \n",
386 | " NI | \n",
387 | " SE | \n",
388 | " result | \n",
389 | " abstract | \n",
390 | "
\n",
391 | " \n",
392 | " \n",
393 | " \n",
394 | " | 0 | \n",
395 | " 0.000159 | \n",
396 | " 0.000761 | \n",
397 | " 0.994799 | \n",
398 | " 0.000768 | \n",
399 | " 0.000288 | \n",
400 | " 0.001839 | \n",
401 | " 0.001386 | \n",
402 | " 2 | \n",
403 | " This paper analyses the possibilities of per... | \n",
404 | "
\n",
405 | " \n",
406 | " | 1 | \n",
407 | " 0.000286 | \n",
408 | " 0.001598 | \n",
409 | " 0.848090 | \n",
410 | " 0.002578 | \n",
411 | " 0.000714 | \n",
412 | " 0.144184 | \n",
413 | " 0.002550 | \n",
414 | " 2 | \n",
415 | " A finite element method is presented to comp... | \n",
416 | "
\n",
417 | " \n",
418 | " | 2 | \n",
419 | " 0.999109 | \n",
420 | " 0.000252 | \n",
421 | " 0.000133 | \n",
422 | " 0.000148 | \n",
423 | " 0.000178 | \n",
424 | " 0.000062 | \n",
425 | " 0.000117 | \n",
426 | " 0 | \n",
427 | " This paper includes a reflection on the role... | \n",
428 | "
\n",
429 | " \n",
430 | " | 3 | \n",
431 | " 0.000146 | \n",
432 | " 0.000313 | \n",
433 | " 0.002194 | \n",
434 | " 0.000169 | \n",
435 | " 0.000153 | \n",
436 | " 0.996466 | \n",
437 | " 0.000559 | \n",
438 | " 5 | \n",
439 | " In this document, we describe the fractal st... | \n",
440 | "
\n",
441 | " \n",
442 | " | 4 | \n",
443 | " 0.000225 | \n",
444 | " 0.000235 | \n",
445 | " 0.000493 | \n",
446 | " 0.998302 | \n",
447 | " 0.000425 | \n",
448 | " 0.000191 | \n",
449 | " 0.000129 | \n",
450 | " 3 | \n",
451 | " We show how to test whether a graph with n v... | \n",
452 | "
\n",
453 | " \n",
454 | "
\n",
455 | "
"
456 | ],
457 | "text/plain": [
458 | " CL ... abstract\n",
459 | "0 0.000159 ... This paper analyses the possibilities of per...\n",
460 | "1 0.000286 ... A finite element method is presented to comp...\n",
461 | "2 0.999109 ... This paper includes a reflection on the role...\n",
462 | "3 0.000146 ... In this document, we describe the fractal st...\n",
463 | "4 0.000225 ... We show how to test whether a graph with n v...\n",
464 | "\n",
465 | "[5 rows x 9 columns]"
466 | ]
467 | },
468 | "metadata": {
469 | "tags": []
470 | },
471 | "execution_count": 9
472 | }
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "metadata": {
478 | "id": "ULhUk0xsh3_g"
479 | },
480 | "source": [
481 | "test = pd.read_csv('/content/drive/MyDrive/spdra2021/Datasets/test.csv',delimiter=',',\r\n",
482 | " header=None,names=['text'])\r\n",
483 | " "
484 | ],
485 | "execution_count": 10,
486 | "outputs": []
487 | },
488 | {
489 | "cell_type": "code",
490 | "metadata": {
491 | "id": "3ZWYYi6hrC_-"
492 | },
493 | "source": [
494 | "labels = ['CL','CR','DC','DS','LO','NI','SE']"
495 | ],
496 | "execution_count": 11,
497 | "outputs": []
498 | },
499 | {
500 | "cell_type": "code",
501 | "metadata": {
502 | "colab": {
503 | "base_uri": "https://localhost:8080/"
504 | },
505 | "id": "Q1rhux11aG7f",
506 | "outputId": "0a471a06-5282-4170-e3c7-786f8b098ac5"
507 | },
508 | "source": [
509 | "for label in labels:\r\n",
510 | " print(label)\r\n",
511 | " print(np.corrcoef([bert[label].rank(pct=True), roberta[label].rank(pct=True), scibert[label].rank(pct=True)]))\r\n",
512 | "submission = pd.DataFrame()\r\n",
513 | "#submission['id'] = a['abstract']\r\n",
514 | "for label in labels:\r\n",
515 | " submission[label] = (bert[label].rank(pct=True) * 0.3 + roberta[label].rank(pct=True) * 0.3 + scibert[label].rank(pct=True)*0.4)\r\n",
516 | "submission['result'] = submission.idxmax(axis = 1) \r\n",
517 | "submission['result'] = submission['result'].apply({'CL':0,'CR':1,'DC':2,\r\n",
518 | "'DS':3,'LO':4, 'NI':5, 'SE':6}.get) \r\n",
519 | "submission['id'] = test['text']\r\n",
520 | "submission.to_csv('submission.csv', index=False)"
521 | ],
522 | "execution_count": 18,
523 | "outputs": [
524 | {
525 | "output_type": "stream",
526 | "text": [
527 | "CL\n",
528 | "[[1. 0.60876764 0.79489693]\n",
529 | " [0.60876764 1. 0.43417398]\n",
530 | " [0.79489693 0.43417398 1. ]]\n",
531 | "CR\n",
532 | "[[1. 0.81781081 0.77273806]\n",
533 | " [0.81781081 1. 0.69869303]\n",
534 | " [0.77273806 0.69869303 1. ]]\n",
535 | "DC\n",
536 | "[[1. 0.84889632 0.85096035]\n",
537 | " [0.84889632 1. 0.88747852]\n",
538 | " [0.85096035 0.88747852 1. ]]\n",
539 | "DS\n",
540 | "[[1. 0.92145531 0.84394307]\n",
541 | " [0.92145531 1. 0.82648213]\n",
542 | " [0.84394307 0.82648213 1. ]]\n",
543 | "LO\n",
544 | "[[1. 0.82319259 0.72438774]\n",
545 | " [0.82319259 1. 0.80665013]\n",
546 | " [0.72438774 0.80665013 1. ]]\n",
547 | "NI\n",
548 | "[[1. 0.92307865 0.91320051]\n",
549 | " [0.92307865 1. 0.90765773]\n",
550 | " [0.91320051 0.90765773 1. ]]\n",
551 | "SE\n",
552 | "[[1. 0.71567135 0.72420244]\n",
553 | " [0.71567135 1. 0.89973318]\n",
554 | " [0.72420244 0.89973318 1. ]]\n"
555 | ],
556 | "name": "stdout"
557 | }
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "metadata": {
563 | "colab": {
564 | "base_uri": "https://localhost:8080/",
565 | "height": 419
566 | },
567 | "id": "EMqzUplRHjxa",
568 | "outputId": "7e83ec84-1508-4402-d0ac-1188bcf64a8d"
569 | },
570 | "source": [
571 | "submission"
572 | ],
573 | "execution_count": 19,
574 | "outputs": [
575 | {
576 | "output_type": "execute_result",
577 | "data": {
578 | "text/html": [
579 | "\n",
580 | "\n",
593 | "
\n",
594 | " \n",
595 | " \n",
596 | " | \n",
597 | " CL | \n",
598 | " CR | \n",
599 | " DC | \n",
600 | " DS | \n",
601 | " LO | \n",
602 | " NI | \n",
603 | " SE | \n",
604 | " result | \n",
605 | " id | \n",
606 | "
\n",
607 | " \n",
608 | " \n",
609 | " \n",
610 | " | 0 | \n",
611 | " 0.295386 | \n",
612 | " 0.649800 | \n",
613 | " 0.972014 | \n",
614 | " 0.686957 | \n",
615 | " 0.583686 | \n",
616 | " 0.688386 | \n",
617 | " 0.762586 | \n",
618 | " 2 | \n",
619 | " This paper analyses the possibilities of per... | \n",
620 | "
\n",
621 | " \n",
622 | " | 1 | \n",
623 | " 0.515057 | \n",
624 | " 0.769757 | \n",
625 | " 0.912929 | \n",
626 | " 0.787314 | \n",
627 | " 0.658929 | \n",
628 | " 0.798314 | \n",
629 | " 0.751900 | \n",
630 | " 2 | \n",
631 | " A finite element method is presented to comp... | \n",
632 | "
\n",
633 | " \n",
634 | " | 2 | \n",
635 | " 0.919086 | \n",
636 | " 0.109029 | \n",
637 | " 0.020229 | \n",
638 | " 0.074743 | \n",
639 | " 0.254557 | \n",
640 | " 0.081329 | \n",
641 | " 0.208529 | \n",
642 | " 0 | \n",
643 | " This paper includes a reflection on the role... | \n",
644 | "
\n",
645 | " \n",
646 | " | 3 | \n",
647 | " 0.310886 | \n",
648 | " 0.597586 | \n",
649 | " 0.735329 | \n",
650 | " 0.481900 | \n",
651 | " 0.496643 | \n",
652 | " 0.868614 | \n",
653 | " 0.534700 | \n",
654 | " 5 | \n",
655 | " In this document, we describe the fractal st... | \n",
656 | "
\n",
657 | " \n",
658 | " | 4 | \n",
659 | " 0.224829 | \n",
660 | " 0.246786 | \n",
661 | " 0.291243 | \n",
662 | " 0.973157 | \n",
663 | " 0.569500 | \n",
664 | " 0.255157 | \n",
665 | " 0.099614 | \n",
666 | " 3 | \n",
667 | " We show how to test whether a graph with n v... | \n",
668 | "
\n",
669 | " \n",
670 | " | ... | \n",
671 | " ... | \n",
672 | " ... | \n",
673 | " ... | \n",
674 | " ... | \n",
675 | " ... | \n",
676 | " ... | \n",
677 | " ... | \n",
678 | " ... | \n",
679 | " ... | \n",
680 | "
\n",
681 | " \n",
682 | " | 6995 | \n",
683 | " 0.571286 | \n",
684 | " 0.398100 | \n",
685 | " 0.397129 | \n",
686 | " 0.638014 | \n",
687 | " 0.960421 | \n",
688 | " 0.256314 | \n",
689 | " 0.507057 | \n",
690 | " 4 | \n",
691 | " It is common practice to compare the computa... | \n",
692 | "
\n",
693 | " \n",
694 | " | 6996 | \n",
695 | " 0.671786 | \n",
696 | " 0.315371 | \n",
697 | " 0.296743 | \n",
698 | " 0.591829 | \n",
699 | " 0.969336 | \n",
700 | " 0.271543 | \n",
701 | " 0.510771 | \n",
702 | " 4 | \n",
703 | " Defeasible reasoning is a simple but efficie... | \n",
704 | "
\n",
705 | " \n",
706 | " | 6997 | \n",
707 | " 0.609871 | \n",
708 | " 0.557043 | \n",
709 | " 0.410114 | \n",
710 | " 0.689214 | \n",
711 | " 0.929086 | \n",
712 | " 0.351986 | \n",
713 | " 0.641114 | \n",
714 | " 4 | \n",
715 | " The almost periodic functions form a natural... | \n",
716 | "
\n",
717 | " \n",
718 | " | 6998 | \n",
719 | " 0.571543 | \n",
720 | " 0.309000 | \n",
721 | " 0.320829 | \n",
722 | " 0.614657 | \n",
723 | " 0.983743 | \n",
724 | " 0.226100 | \n",
725 | " 0.545014 | \n",
726 | " 4 | \n",
727 | " A notion of alternating timed automata is pr... | \n",
728 | "
\n",
729 | " \n",
730 | " | 6999 | \n",
731 | " 0.641700 | \n",
732 | " 0.291729 | \n",
733 | " 0.268357 | \n",
734 | " 0.596400 | \n",
735 | " 0.970171 | \n",
736 | " 0.266557 | \n",
737 | " 0.541643 | \n",
738 | " 4 | \n",
739 | " We present a hierarchical framework for anal... | \n",
740 | "
\n",
741 | " \n",
742 | "
\n",
743 | "
7000 rows × 9 columns
\n",
744 | "
"
745 | ],
746 | "text/plain": [
747 | " CL ... id\n",
748 | "0 0.295386 ... This paper analyses the possibilities of per...\n",
749 | "1 0.515057 ... A finite element method is presented to comp...\n",
750 | "2 0.919086 ... This paper includes a reflection on the role...\n",
751 | "3 0.310886 ... In this document, we describe the fractal st...\n",
752 | "4 0.224829 ... We show how to test whether a graph with n v...\n",
753 | "... ... ... ...\n",
754 | "6995 0.571286 ... It is common practice to compare the computa...\n",
755 | "6996 0.671786 ... Defeasible reasoning is a simple but efficie...\n",
756 | "6997 0.609871 ... The almost periodic functions form a natural...\n",
757 | "6998 0.571543 ... A notion of alternating timed automata is pr...\n",
758 | "6999 0.641700 ... We present a hierarchical framework for anal...\n",
759 | "\n",
760 | "[7000 rows x 9 columns]"
761 | ]
762 | },
763 | "metadata": {
764 | "tags": []
765 | },
766 | "execution_count": 19
767 | }
768 | ]
769 | },
770 | {
771 | "cell_type": "code",
772 | "metadata": {
773 | "id": "7Nl0V7Bm47yj"
774 | },
775 | "source": [
776 | "submission['result'] = submission['result'].apply({0:'CL', 1:'CR', 2:'DC',\r\n",
777 | "3:'DS', 4:'LO', 5:'NI', 6:'SE' }.get)\r\n"
778 | ],
779 | "execution_count": 20,
780 | "outputs": []
781 | },
782 | {
783 | "cell_type": "code",
784 | "metadata": {
785 | "colab": {
786 | "base_uri": "https://localhost:8080/",
787 | "height": 34
788 | },
789 | "id": "HlxxQwny5kfl",
790 | "outputId": "51674744-3f1b-4f70-f3ea-5aade7e3a646"
791 | },
792 | "source": [
793 | "result = submission['result'].to_numpy()\r\n",
794 | "print(len(result))\r\n",
795 | "np.savetxt(\"run3.txt\", result, fmt = \"%s\")\r\n",
796 | "from google.colab import files\r\n",
797 | "files.download('run3.txt')"
798 | ],
799 | "execution_count": 21,
800 | "outputs": [
801 | {
802 | "output_type": "stream",
803 | "text": [
804 | "7000\n"
805 | ],
806 | "name": "stdout"
807 | },
808 | {
809 | "output_type": "display_data",
810 | "data": {
811 | "application/javascript": [
812 | "\n",
813 | " async function download(id, filename, size) {\n",
814 | " if (!google.colab.kernel.accessAllowed) {\n",
815 | " return;\n",
816 | " }\n",
817 | " const div = document.createElement('div');\n",
818 | " const label = document.createElement('label');\n",
819 | " label.textContent = `Downloading \"${filename}\": `;\n",
820 | " div.appendChild(label);\n",
821 | " const progress = document.createElement('progress');\n",
822 | " progress.max = size;\n",
823 | " div.appendChild(progress);\n",
824 | " document.body.appendChild(div);\n",
825 | "\n",
826 | " const buffers = [];\n",
827 | " let downloaded = 0;\n",
828 | "\n",
829 | " const channel = await google.colab.kernel.comms.open(id);\n",
830 | " // Send a message to notify the kernel that we're ready.\n",
831 | " channel.send({})\n",
832 | "\n",
833 | " for await (const message of channel.messages) {\n",
834 | " // Send a message to notify the kernel that we're ready.\n",
835 | " channel.send({})\n",
836 | " if (message.buffers) {\n",
837 | " for (const buffer of message.buffers) {\n",
838 | " buffers.push(buffer);\n",
839 | " downloaded += buffer.byteLength;\n",
840 | " progress.value = downloaded;\n",
841 | " }\n",
842 | " }\n",
843 | " }\n",
844 | " const blob = new Blob(buffers, {type: 'application/binary'});\n",
845 | " const a = document.createElement('a');\n",
846 | " a.href = window.URL.createObjectURL(blob);\n",
847 | " a.download = filename;\n",
848 | " div.appendChild(a);\n",
849 | " a.click();\n",
850 | " div.remove();\n",
851 | " }\n",
852 | " "
853 | ],
854 | "text/plain": [
855 | ""
856 | ]
857 | },
858 | "metadata": {
859 | "tags": []
860 | }
861 | },
862 | {
863 | "output_type": "display_data",
864 | "data": {
865 | "application/javascript": [
866 | "download(\"download_d23b42c0-641a-49da-8061-44c07f762bf1\", \"run3.txt\", 21000)"
867 | ],
868 | "text/plain": [
869 | ""
870 | ]
871 | },
872 | "metadata": {
873 | "tags": []
874 | }
875 | }
876 | ]
877 | },
878 | {
879 | "cell_type": "markdown",
880 | "metadata": {
881 | "id": "0pBDXhATIqGJ"
882 | },
883 | "source": [
884 | "#Predictions "
885 | ]
886 | },
887 | {
888 | "cell_type": "code",
889 | "metadata": {
890 | "id": "0T6MeF7NkxY4"
891 | },
892 | "source": [
893 | "\r\n",
894 | "\r\n",
895 | "\"\"\"\r\n",
896 | "The submission file IIITT.zip has the systems as follows:\r\n",
897 | "\r\n",
898 | "run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased)\r\n",
899 | "run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT)\r\n",
900 | "run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame\r\n",
901 | "\"\"\""
902 | ],
903 | "execution_count": null,
904 | "outputs": []
905 | },
906 | {
907 | "cell_type": "code",
908 | "metadata": {
909 | "id": "bQH0FM_ZMBVP"
910 | },
911 | "source": [
912 | ""
913 | ],
914 | "execution_count": null,
915 | "outputs": []
916 | }
917 | ]
918 | }
--------------------------------------------------------------------------------
/utmn/1_fine_tuning_and_getting_bert_embs.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "accelerator": "GPU",
6 | "colab": {
7 | "name": "1. fine-tuning and getting bert embs",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "toc_visible": true
11 | },
12 | "kernelspec": {
13 | "display_name": "Python 3",
14 | "name": "python3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "code",
20 | "metadata": {
21 | "id": "hTjQwN2Ebfd9"
22 | },
23 | "source": [
24 | "import pandas as pd\n",
25 | "\n",
26 | "#import google disk (for data loading)\n",
27 | "from google.colab import drive\n",
28 | "drive.mount('/content/drive')"
29 | ],
30 | "execution_count": null,
31 | "outputs": []
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {
36 | "id": "RmP4Hy9UZZKA"
37 | },
38 | "source": [
39 | "#Import libraries"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "metadata": {
45 | "id": "DEfSbAA4QHas"
46 | },
47 | "source": [
48 | "import tensorflow as tf\n",
49 | "\n",
50 | "device_name = tf.test.gpu_device_name()\n",
51 | "if device_name != '/device:GPU:0':\n",
52 | " raise SystemError('GPU device not found')\n",
53 | "print('Found GPU at: {}'.format(device_name))"
54 | ],
55 | "execution_count": null,
56 | "outputs": []
57 | },
58 | {
59 | "cell_type": "code",
60 | "metadata": {
61 | "id": "0NmMdkZO8R6q"
62 | },
63 | "source": [
64 | "import torch\n",
65 | "\n",
66 | "# If there's a GPU available...\n",
67 | "if torch.cuda.is_available(): \n",
68 | "\n",
69 | " # Tell PyTorch to use the GPU. \n",
70 | " device = torch.device(\"cuda\")\n",
71 | "\n",
72 | " print('There are %d GPU(s) available.' % torch.cuda.device_count())\n",
73 | "\n",
74 | " print('We will use the GPU:', torch.cuda.get_device_name(0))\n",
75 | "\n",
76 | "# If not...\n",
77 | "else:\n",
78 | " print('No GPU available, using the CPU instead.')\n",
79 | " device = torch.device(\"cpu\")"
80 | ],
81 | "execution_count": null,
82 | "outputs": []
83 | },
84 | {
85 | "cell_type": "code",
86 | "metadata": {
87 | "id": "XfMNQrCXrhKP"
88 | },
89 | "source": [
90 | "!pip install transformers\n",
91 | "!pip install pytorch-pretrained-bert pytorch-nlp\n",
92 | "from transformers import BertModel#, RobertaModel\n",
93 | "import numpy as np\n",
94 | "import tensorflow as tf\n",
95 | "\n",
96 | "from transformers import *\n",
97 | "import pandas as pd\n",
98 | "#import torch\n",
99 | "from keras.preprocessing.sequence import pad_sequences\n",
100 | "from sklearn.model_selection import train_test_split\n",
101 | "import numpy as np\n",
102 | "import os\n",
103 | "import pickle"
104 | ],
105 | "execution_count": null,
106 | "outputs": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {
111 | "id": "E2gF3nZWZwNy"
112 | },
113 | "source": [
114 | "#Import data"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "metadata": {
120 | "id": "699-L4GYHV2I"
121 | },
122 | "source": [
123 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/train.csv', header=None, names = ['text','label'])\n",
124 | "df.head()"
125 | ],
126 | "execution_count": null,
127 | "outputs": []
128 | },
129 | {
130 | "cell_type": "code",
131 | "metadata": {
132 | "id": "Q0IFhZ5IHjxg"
133 | },
134 | "source": [
135 | "train_texts = df.text.values\n",
136 | "\n",
137 | "possible_labels = df.label.unique()\n",
138 | "label_dict = {}\n",
139 | "for index, possible_label in enumerate(possible_labels):\n",
140 | " label_dict[possible_label] = index\n",
141 | "\n",
142 | "print(label_dict)\n",
143 | "\n",
144 | "df['label'] = df.label.replace(label_dict)\n",
145 | "train_labels = df.label.values"
146 | ],
147 | "execution_count": null,
148 | "outputs": []
149 | },
150 | {
151 | "cell_type": "code",
152 | "metadata": {
153 | "id": "c8G1dCa8HniE"
154 | },
155 | "source": [
156 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/validation.csv', header=None, names = ['text','label'])\n",
157 | "df.head()"
158 | ],
159 | "execution_count": null,
160 | "outputs": []
161 | },
162 | {
163 | "cell_type": "code",
164 | "metadata": {
165 | "id": "L5XsBI0gNWwD"
166 | },
167 | "source": [
168 | "val_texts = df.text.values\r\n",
169 | "df['label'] = df.label.replace(label_dict)\r\n",
170 | "val_labels = df.label.values"
171 | ],
172 | "execution_count": null,
173 | "outputs": []
174 | },
175 | {
176 | "cell_type": "code",
177 | "metadata": {
178 | "id": "j3ha28ClHtMM"
179 | },
180 | "source": [
181 | "train_texts = list(train_texts) + list(df.text.values)\n",
182 | "df['label'] = df.label.replace(label_dict)\n",
183 | "train_labels = list(train_labels) + list(df.label.values)\n",
184 | "\n",
185 | "\n",
186 | "df = pd.DataFrame()\n",
187 | "df['text'] = pd.Series(train_texts)\n",
188 | "df['label'] = pd.Series(train_labels)\n",
189 | "df = df.sample(frac=1)\n",
190 | "df.head()"
191 | ],
192 | "execution_count": null,
193 | "outputs": []
194 | },
195 | {
196 | "cell_type": "code",
197 | "metadata": {
198 | "id": "8SnmkiWZxqwu"
199 | },
200 | "source": [
201 | "train_labels = df.label.values\r\n",
202 | "train_texts = df.text.values\r\n",
203 | "\r\n",
204 | "df = pd.read_csv('/content/drive/MyDrive/sdpra/test.csv', header=None, names = ['text'])\r\n",
205 | "val_texts = df.text.values"
206 | ],
207 | "execution_count": null,
208 | "outputs": []
209 | },
210 | {
211 | "cell_type": "code",
212 | "metadata": {
213 | "id": "G-dWiGvSy3Ey"
214 | },
215 | "source": [
216 | "print(len(train_labels))\r\n",
217 | "print(len(train_texts))\r\n",
218 | "\r\n",
219 | "print(len(val_texts))"
220 | ],
221 | "execution_count": null,
222 | "outputs": []
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {
227 | "id": "7WtgjJLo5NsX"
228 | },
229 | "source": [
230 | "#Preprocessing"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "metadata": {
236 | "id": "dLnCNHaEoGIR"
237 | },
238 | "source": [
239 | "train_texts = [text.lower().replace('\\r\\n',' ').replace('\\n',' ') for text in train_texts]\n",
240 | "val_texts = [text.lower().replace('\\r\\n',' ').replace('\\n',' ') for text in val_texts]"
241 | ],
242 | "execution_count": null,
243 | "outputs": []
244 | },
245 | {
246 | "cell_type": "code",
247 | "metadata": {
248 | "id": "9rHaME_JOIDD"
249 | },
250 | "source": [
251 | "train_texts[0]"
252 | ],
253 | "execution_count": null,
254 | "outputs": []
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {
259 | "id": "1GjZ7Z6naPTY"
260 | },
261 | "source": [
262 | "#Tokenization"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "metadata": {
268 | "id": "SrA8iiXCrylF"
269 | },
270 | "source": [
271 | "model_name = 'allenai/scibert_scivocab_uncased'\n",
272 | "#model_name = 'bert-base-uncased'\n",
273 | "#model_name = \"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\"\n",
274 | "#model_name = 'roberta-large'\n",
275 | "tokenizer = BertTokenizer.from_pretrained(model_name)\n",
276 | "#tokenizer = RobertaTokenizer.from_pretrained(model_name) #for roberta large"
277 | ],
278 | "execution_count": null,
279 | "outputs": []
280 | },
281 | {
282 | "cell_type": "code",
283 | "metadata": {
284 | "id": "MxB6Z3Bzr60m"
285 | },
286 | "source": [
287 | "# Tokenize all of the sentences and map the tokens to thier word IDs.\n",
288 | "input_ids = []\n",
289 | "attention_masks = []\n",
290 | "\n",
291 | "# For every sentence...\n",
292 | "for i, sent in enumerate(train_texts):\n",
293 | " # `encode_plus` will:\n",
294 | " # (1) Tokenize the sentence.\n",
295 | " # (2) Prepend the `[CLS]` token to the start.\n",
296 | " # (3) Append the `[SEP]` token to the end.\n",
297 | " # (4) Map tokens to their IDs.\n",
298 | " # (5) Pad or truncate the sentence to `max_length`\n",
299 | " # (6) Create attention masks for [PAD] tokens.\n",
300 | " encoded_dict = tokenizer.encode_plus(\n",
301 | " sent, # Sentence to encode.\n",
302 | " truncation = True,\n",
303 | " add_special_tokens = True, # Add '[CLS]' and '[SEP]'\n",
304 | " max_length = 256, # Pad & truncate all sentences.\n",
305 | " pad_to_max_length = True,\n",
306 | " return_attention_mask = True, # Construct attn. masks.\n",
307 | " return_tensors = 'pt', # Return pytorch tensors.\n",
308 | " )\n",
309 | " # Add the encoded sentence to the list. \n",
310 | " input_ids.append(encoded_dict['input_ids'])\n",
311 | " \n",
312 | " # And its attention mask (simply differentiates padding from non-padding).\n",
313 | " attention_masks.append(encoded_dict['attention_mask'])\n",
314 | "\n",
315 | "# Convert the lists into tensors.\n",
316 | "input_ids = torch.cat(input_ids, dim=0)\n",
317 | "attention_masks = torch.cat(attention_masks, dim=0)\n",
318 | "labels = torch.tensor(train_labels)\n",
319 | "\n",
320 | "# Print sentence 0, now as a list of IDs.\n",
321 | "print('Original: ', train_texts[0])\n",
322 | "print('Token IDs:', input_ids[0])"
323 | ],
324 | "execution_count": null,
325 | "outputs": []
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {
330 | "id": "f9UK8OpDaUZo"
331 | },
332 | "source": [
333 | "#Split into train and validation samples"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "metadata": {
339 | "id": "NfCcyf_XsASo"
340 | },
341 | "source": [
342 | "from torch.utils.data import TensorDataset, random_split\n",
343 | "\n",
344 | "# Combine the training inputs into a TensorDataset.\n",
345 | "dataset = TensorDataset(input_ids, attention_masks, labels)\n",
346 | "\n",
347 | "# Create a 90-10 train-validation split.\n",
348 | "\n",
349 | "# Calculate the number of samples to include in each set.\n",
350 | "train_size = int(0.9 * len(dataset))\n",
351 | "val_size = len(dataset) - train_size\n",
352 | "\n",
353 | "# Divide the dataset by randomly selecting samples.\n",
354 | "train_dataset, val_dataset = random_split(dataset, [train_size, val_size])\n",
355 | "\n",
356 | "print('{:>5,} training samples'.format(train_size))\n",
357 | "print('{:>5,} validation samples'.format(val_size))"
358 | ],
359 | "execution_count": null,
360 | "outputs": []
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {
365 | "id": "2nGKq9cnagyT"
366 | },
367 | "source": [
368 | "#Model"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "metadata": {
374 | "id": "kUji4egasCrM"
375 | },
376 | "source": [
377 | "from torch.utils.data import DataLoader, RandomSampler, SequentialSampler\n",
378 | "\n",
379 | "# The DataLoader needs to know our batch size for training, so we specify it \n",
380 | "# here. For fine-tuning BERT on a specific task, the authors recommend a batch \n",
381 | "# size of 16 or 32.\n",
382 | "batch_size = 8\n",
383 | "\n",
384 | "# Create the DataLoaders for our training and validation sets.\n",
385 | "# We'll take training samples in random order. \n",
386 | "train_dataloader = DataLoader(\n",
387 | " train_dataset, # The training samples.\n",
388 | " sampler = RandomSampler(train_dataset), # Select batches randomly\n",
389 | " batch_size = batch_size # Trains with this batch size.\n",
390 | " )\n",
391 | "\n",
392 | "# For validation the order doesn't matter, so we'll just read them sequentially.\n",
393 | "validation_dataloader = DataLoader(\n",
394 | " val_dataset, # The validation samples.\n",
395 | " sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.\n",
396 | " batch_size = batch_size # Evaluate with this batch size.\n",
397 | " )"
398 | ],
399 | "execution_count": null,
400 | "outputs": []
401 | },
402 | {
403 | "cell_type": "code",
404 | "metadata": {
405 | "id": "gez19WipsGrk"
406 | },
407 | "source": [
408 | "from transformers import RobertaForSequenceClassification, AdamW, RobertaConfig\n",
409 | "\n",
410 | "# Load BertForSequenceClassification, the pretrained BERT model with a single \n",
411 | "# linear classification layer on top. \n",
412 | "#'''\n",
413 | "model = BertForSequenceClassification.from_pretrained(\n",
414 | " model_name, # Use the 12-layer BERT model, with an uncased vocab.\n",
415 | " num_labels = len(label_dict), # The number of output labels--2 for binary classification.\n",
416 | " # You can increase this for multi-class tasks. \n",
417 | " output_attentions = False, # Whether the model returns attentions weights.\n",
418 | " output_hidden_states = False, # Whether the model returns all hidden-states.\n",
419 | ")\n",
420 | "'''\n",
421 | "#for roberta large\n",
422 | "model = RobertaForSequenceClassification.from_pretrained(\n",
423 | " \"roberta-large\", # Use the 12-layer BERT model, with an uncased vocab.\n",
424 | " num_labels = len(label_dict), # The number of output labels--2 for binary classification.\n",
425 | " # You can increase this for multi-class tasks. \n",
426 | " output_attentions = False, # Whether the model returns attentions weights.\n",
427 | " output_hidden_states = False, # Whether the model returns all hidden-states.\n",
428 | ")\n",
429 | "'''\n",
430 | "model.cuda()"
431 | ],
432 | "execution_count": null,
433 | "outputs": []
434 | },
435 | {
436 | "cell_type": "code",
437 | "metadata": {
438 | "id": "_5Bki-YbsKHz"
439 | },
440 | "source": [
441 | "# Get all of the model's parameters as a list of tuples.\n",
442 | "params = list(model.named_parameters())\n",
443 | "\n",
444 | "print('The BERT model has {:} different named parameters.\\n'.format(len(params)))\n",
445 | "\n",
446 | "print('==== Embedding Layer ====\\n')\n",
447 | "\n",
448 | "for p in params[0:5]:\n",
449 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))\n",
450 | "\n",
451 | "print('\\n==== First Transformer ====\\n')\n",
452 | "\n",
453 | "for p in params[5:21]:\n",
454 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))\n",
455 | "\n",
456 | "print('\\n==== Output Layer ====\\n')\n",
457 | "\n",
458 | "for p in params[-4:]:\n",
459 | " print(\"{:<55} {:>12}\".format(p[0], str(tuple(p[1].size()))))"
460 | ],
461 | "execution_count": null,
462 | "outputs": []
463 | },
464 | {
465 | "cell_type": "code",
466 | "metadata": {
467 | "id": "qwTEGVO_sULP"
468 | },
469 | "source": [
470 | "# Note: AdamW is a class from the huggingface library (as opposed to pytorch) \n",
471 | "# I believe the 'W' stands for 'Weight Decay fix\"\n",
472 | "optimizer = AdamW(model.parameters(),\n",
473 | " lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5\n",
474 | " eps = 1e-8 # args.adam_epsilon - default is 1e-8.\n",
475 | " )"
476 | ],
477 | "execution_count": null,
478 | "outputs": []
479 | },
480 | {
481 | "cell_type": "code",
482 | "metadata": {
483 | "id": "-FSrNbcasXpI"
484 | },
485 | "source": [
486 | "from transformers import get_linear_schedule_with_warmup\n",
487 | "\n",
488 | "# Number of training epochs. The BERT authors recommend between 2 and 4. \n",
489 | "# We chose to run for 4, but we'll see later that this may be over-fitting the\n",
490 | "# training data.\n",
491 | "epochs = 3\n",
492 | "\n",
493 | "# Total number of training steps is [number of batches] x [number of epochs]. \n",
494 | "# (Note that this is not the same as the number of training samples).\n",
495 | "total_steps = len(train_dataloader) * epochs\n",
496 | "\n",
497 | "# Create the learning rate scheduler.\n",
498 | "scheduler = get_linear_schedule_with_warmup(optimizer, \n",
499 | " num_warmup_steps = 0, # Default value in run_glue.py\n",
500 | " num_training_steps = total_steps)"
501 | ],
502 | "execution_count": null,
503 | "outputs": []
504 | },
505 | {
506 | "cell_type": "code",
507 | "metadata": {
508 | "id": "Ew9TYlDGsa_R"
509 | },
510 | "source": [
511 | "import numpy as np\n",
512 | "\n",
513 | "# Function to calculate the accuracy of our predictions vs labels\n",
514 | "def flat_accuracy(preds, labels):\n",
515 | " pred_flat = np.argmax(preds, axis=1).flatten()\n",
516 | " labels_flat = labels.flatten()\n",
517 | " return np.sum(pred_flat == labels_flat) / len(labels_flat)"
518 | ],
519 | "execution_count": null,
520 | "outputs": []
521 | },
522 | {
523 | "cell_type": "code",
524 | "metadata": {
525 | "id": "Li765m2DseoO"
526 | },
527 | "source": [
528 | "import time\n",
529 | "import datetime\n",
530 | "\n",
531 | "def format_time(elapsed):\n",
532 | " '''\n",
533 | " Takes a time in seconds and returns a string hh:mm:ss\n",
534 | " '''\n",
535 | " # Round to the nearest second.\n",
536 | " elapsed_rounded = int(round((elapsed)))\n",
537 | " \n",
538 | " # Format as hh:mm:ss\n",
539 | " return str(datetime.timedelta(seconds=elapsed_rounded))\n"
540 | ],
541 | "execution_count": null,
542 | "outputs": []
543 | },
544 | {
545 | "cell_type": "code",
546 | "metadata": {
547 | "id": "SnPqEl8rsgL5"
548 | },
549 | "source": [
550 | "import random\n",
551 | "import numpy as np\n",
552 | "\n",
553 | "# This training code is based on the `run_glue.py` script here:\n",
554 | "# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128\n",
555 | "\n",
556 | "# Set the seed value all over the place to make this reproducible.\n",
557 | "seed_val = 1\n",
558 | "\n",
559 | "random.seed(seed_val)\n",
560 | "np.random.seed(seed_val)\n",
561 | "torch.manual_seed(seed_val)\n",
562 | "torch.cuda.manual_seed_all(seed_val)\n",
563 | "\n",
564 | "# We'll store a number of quantities such as training and validation loss, \n",
565 | "# validation accuracy, and timings.\n",
566 | "training_stats = []\n",
567 | "\n",
568 | "# Measure the total training time for the whole run.\n",
569 | "total_t0 = time.time()\n",
570 | "\n",
571 | "# For each epoch...\n",
572 | "for epoch_i in range(0, epochs):\n",
573 | " \n",
574 | " # ========================================\n",
575 | " # Training\n",
576 | " # ========================================\n",
577 | " \n",
578 | " # Perform one full pass over the training set.\n",
579 | "\n",
580 | " print(\"\")\n",
581 | " print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))\n",
582 | " print('Training...')\n",
583 | "\n",
584 | " # Measure how long the training epoch takes.\n",
585 | " t0 = time.time()\n",
586 | "\n",
587 | " # Reset the total loss for this epoch.\n",
588 | " total_train_loss = 0\n",
589 | "\n",
590 | " # Put the model into training mode. Don't be mislead--the call to \n",
591 | " # `train` just changes the *mode*, it doesn't *perform* the training.\n",
592 | " # `dropout` and `batchnorm` layers behave differently during training\n",
593 | " # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)\n",
594 | " model.train()\n",
595 | "\n",
596 | " # For each batch of training data...\n",
597 | " for step, batch in enumerate(train_dataloader):\n",
598 | "\n",
599 | " # Progress update every 40 batches.\n",
600 | " if step % 40 == 0 and not step == 0:\n",
601 | " # Calculate elapsed time in minutes.\n",
602 | " elapsed = format_time(time.time() - t0)\n",
603 | " \n",
604 | " # Report progress.\n",
605 | " print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))\n",
606 | "\n",
607 | " # Unpack this training batch from our dataloader. \n",
608 | " #\n",
609 | " # As we unpack the batch, we'll also copy each tensor to the GPU using the \n",
610 | " # `to` method.\n",
611 | " #\n",
612 | " # `batch` contains three pytorch tensors:\n",
613 | " # [0]: input ids \n",
614 | " # [1]: attention masks\n",
615 | " # [2]: labels \n",
616 | " b_input_ids = batch[0].to(device)\n",
617 | " b_input_mask = batch[1].to(device)\n",
618 | " b_labels = batch[2].to(device)\n",
619 | "\n",
620 | " # Always clear any previously calculated gradients before performing a\n",
621 | " # backward pass. PyTorch doesn't do this automatically because \n",
622 | " # accumulating the gradients is \"convenient while training RNNs\". \n",
623 | " # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)\n",
624 | " model.zero_grad() \n",
625 | "\n",
626 | " # Perform a forward pass (evaluate the model on this training batch).\n",
627 | " # The documentation for this `model` function is here: \n",
628 | " # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification\n",
629 | " # It returns different numbers of parameters depending on what arguments\n",
630 | " # arge given and what flags are set. For our useage here, it returns\n",
631 | " # the loss (because we provided labels) and the \"logits\"--the model\n",
632 | " # outputs prior to activation.\n",
633 | " loss = model(b_input_ids, \n",
634 | " token_type_ids=None, \n",
635 | " attention_mask=b_input_mask, \n",
636 | " labels=b_labels)[0]\n",
637 | " logits = model(b_input_ids, \n",
638 | " token_type_ids=None, \n",
639 | " attention_mask=b_input_mask, \n",
640 | " labels=b_labels)[1] \n",
641 | "\n",
642 | " # Accumulate the training loss over all of the batches so that we can\n",
643 | " # calculate the average loss at the end. `loss` is a Tensor containing a\n",
644 | " # single value; the `.item()` function just returns the Python value \n",
645 | " # from the tensor.\n",
646 | " total_train_loss += loss.item()\n",
647 | "\n",
648 | " # Perform a backward pass to calculate the gradients.\n",
649 | " loss.backward()\n",
650 | "\n",
651 | " # Clip the norm of the gradients to 1.0.\n",
652 | " # This is to help prevent the \"exploding gradients\" problem.\n",
653 | " torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
654 | "\n",
655 | " # Update parameters and take a step using the computed gradient.\n",
656 | " # The optimizer dictates the \"update rule\"--how the parameters are\n",
657 | " # modified based on their gradients, the learning rate, etc.\n",
658 | " optimizer.step()\n",
659 | "\n",
660 | " # Update the learning rate.\n",
661 | " scheduler.step()\n",
662 | "\n",
663 | " # Calculate the average loss over all of the batches.\n",
664 | " avg_train_loss = total_train_loss / len(train_dataloader) \n",
665 | " \n",
666 | " # Measure how long this epoch took.\n",
667 | " training_time = format_time(time.time() - t0)\n",
668 | "\n",
669 | " print(\"\")\n",
670 | " print(\" Average training loss: {0:.2f}\".format(avg_train_loss))\n",
671 | " print(\" Training epcoh took: {:}\".format(training_time))\n",
672 | " \n",
673 | " # ========================================\n",
674 | " # Validation\n",
675 | " # ========================================\n",
676 | " # After the completion of each training epoch, measure our performance on\n",
677 | " # our validation set.\n",
678 | "\n",
679 | " print(\"\")\n",
680 | " print(\"Running Validation...\")\n",
681 | "\n",
682 | " t0 = time.time()\n",
683 | "\n",
684 | " # Put the model in evaluation mode--the dropout layers behave differently\n",
685 | " # during evaluation.\n",
686 | " model.eval()\n",
687 | "\n",
688 | " # Tracking variables \n",
689 | " total_eval_accuracy = 0\n",
690 | " total_eval_loss = 0\n",
691 | " nb_eval_steps = 0\n",
692 | "\n",
693 | " # Evaluate data for one epoch\n",
694 | " for batch in validation_dataloader:\n",
695 | " \n",
696 | " # Unpack this training batch from our dataloader. \n",
697 | " #\n",
698 | " # As we unpack the batch, we'll also copy each tensor to the GPU using \n",
699 | " # the `to` method.\n",
700 | " #\n",
701 | " # `batch` contains three pytorch tensors:\n",
702 | " # [0]: input ids \n",
703 | " # [1]: attention masks\n",
704 | " # [2]: labels \n",
705 | " b_input_ids = batch[0].to(device)\n",
706 | " b_input_mask = batch[1].to(device)\n",
707 | " b_labels = batch[2].to(device)\n",
708 | " \n",
709 | " # Tell pytorch not to bother with constructing the compute graph during\n",
710 | " # the forward pass, since this is only needed for backprop (training).\n",
711 | " with torch.no_grad(): \n",
712 | "\n",
713 | " # Forward pass, calculate logit predictions.\n",
714 | " # token_type_ids is the same as the \"segment ids\", which \n",
715 | " # differentiates sentence 1 and 2 in 2-sentence tasks.\n",
716 | " # The documentation for this `model` function is here: \n",
717 | " # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification\n",
718 | " # Get the \"logits\" output by the model. The \"logits\" are the output\n",
719 | " # values prior to applying an activation function like the softmax.\n",
720 | " loss = model(b_input_ids, \n",
721 | " token_type_ids=None, \n",
722 | " attention_mask=b_input_mask,\n",
723 | " labels=b_labels)[0]\n",
724 | " logits = model(b_input_ids, \n",
725 | " token_type_ids=None, \n",
726 | " attention_mask=b_input_mask,\n",
727 | " labels=b_labels)[1]\n",
728 | " \n",
729 | " # Accumulate the validation loss.\n",
730 | " total_eval_loss += loss.item()\n",
731 | "\n",
732 | " # Move logits and labels to CPU\n",
733 | " logits = logits.detach().cpu().numpy()\n",
734 | " label_ids = b_labels.to('cpu').numpy()\n",
735 | "\n",
736 | " # Calculate the accuracy for this batch of test sentences, and\n",
737 | " # accumulate it over all batches.\n",
738 | " total_eval_accuracy += flat_accuracy(logits, label_ids)\n",
739 | " \n",
740 | "\n",
741 | " # Report the final accuracy for this validation run.\n",
742 | " avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)\n",
743 | " print(\" Accuracy: {0:.2f}\".format(avg_val_accuracy))\n",
744 | "\n",
745 | " # Calculate the average loss over all of the batches.\n",
746 | " avg_val_loss = total_eval_loss / len(validation_dataloader)\n",
747 | " \n",
748 | " # Measure how long the validation run took.\n",
749 | " validation_time = format_time(time.time() - t0)\n",
750 | " \n",
751 | " print(\" Validation Loss: {0:.2f}\".format(avg_val_loss))\n",
752 | " print(\" Validation took: {:}\".format(validation_time))\n",
753 | "\n",
754 | " # Record all statistics from this epoch.\n",
755 | " training_stats.append(\n",
756 | " {\n",
757 | " 'epoch': epoch_i + 1,\n",
758 | " 'Training Loss': avg_train_loss,\n",
759 | " 'Valid. Loss': avg_val_loss,\n",
760 | " 'Valid. Accur.': avg_val_accuracy,\n",
761 | " 'Training Time': training_time,\n",
762 | " 'Validation Time': validation_time\n",
763 | " }\n",
764 | " )\n",
765 | "\n",
766 | "print(\"\")\n",
767 | "print(\"Training complete!\")\n",
768 | "\n",
769 | "print(\"Total training took {:} (h:mm:ss)\".format(format_time(time.time()-total_t0)))"
770 | ],
771 | "execution_count": null,
772 | "outputs": []
773 | },
774 | {
775 | "cell_type": "code",
776 | "metadata": {
777 | "id": "Z55Kdx7mi1xa"
778 | },
779 | "source": [
780 | "model.save_pretrained('/content/drive/bert')"
781 | ],
782 | "execution_count": null,
783 | "outputs": []
784 | },
785 | {
786 | "cell_type": "markdown",
787 | "metadata": {
788 | "id": "imcRg9Z3bDlw"
789 | },
790 | "source": [
791 | "#Test"
792 | ]
793 | },
794 | {
795 | "cell_type": "code",
796 | "metadata": {
797 | "id": "IEAg_ois9225"
798 | },
799 | "source": [
800 | "import pandas as pd\n",
801 | "\n",
802 | "# Tokenize all of the sentences and map the tokens to thier word IDs.\n",
803 | "input_ids = []\n",
804 | "attention_masks = []\n",
805 | "\n",
806 | "# For every sentence...\n",
807 | "for sent in val_texts:\n",
808 | " # `encode_plus` will:\n",
809 | " # (1) Tokenize the sentence.\n",
810 | " # (2) Prepend the `[CLS]` token to the start.\n",
811 | " # (3) Append the `[SEP]` token to the end.\n",
812 | " # (4) Map tokens to their IDs.\n",
813 | " # (5) Pad or truncate the sentence to `max_length`\n",
814 | " # (6) Create attention masks for [PAD] tokens.\n",
815 | " encoded_dict = tokenizer.encode_plus(\n",
816 | " sent, # Sentence to encode.\n",
817 | " add_special_tokens = True, # Add '[CLS]' and '[SEP]'\n",
818 | " max_length = 256, # Pad & truncate all sentences.\n",
819 | " truncation = True,\n",
820 | " pad_to_max_length = True,\n",
821 | " return_attention_mask = True, # Construct attn. masks.\n",
822 | " return_tensors = 'pt', # Return pytorch tensors.\n",
823 | " )\n",
824 | " \n",
825 | " # Add the encoded sentence to the list. \n",
826 | " input_ids.append(encoded_dict['input_ids'])\n",
827 | " \n",
828 | " # And its attention mask (simply differentiates padding from non-padding).\n",
829 | " attention_masks.append(encoded_dict['attention_mask'])\n",
830 | "\n",
831 | "# Convert the lists into tensors.\n",
832 | "input_ids = torch.cat(input_ids, dim=0)\n",
833 | "attention_masks = torch.cat(attention_masks, dim=0)\n",
834 | "#labels = torch.tensor(val_labels)\n",
835 | "\n",
836 | "# Set the batch size. \n",
837 | "batch_size = 8\n",
838 | "\n",
839 | "# Create the DataLoader.\n",
840 | "prediction_data = TensorDataset(input_ids, attention_masks)\n",
841 | "prediction_sampler = SequentialSampler(prediction_data)\n",
842 | "prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)"
843 | ],
844 | "execution_count": null,
845 | "outputs": []
846 | },
847 | {
848 | "cell_type": "code",
849 | "metadata": {
850 | "id": "lLlmfbbP-Lof"
851 | },
852 | "source": [
853 | "# Prediction on test set\n",
854 | "\n",
855 | "print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))\n",
856 | "\n",
857 | "\n",
858 | "# Put model in evaluation mode\n",
859 | "model.eval()\n",
860 | "\n",
861 | "# Tracking variables \n",
862 | "predictions, true_labels = [], []\n",
863 | "\n",
864 | "# Predict \n",
865 | "for batch in prediction_dataloader:\n",
866 | " # Add batch to GPU\n",
867 | " batch = tuple(t.to(device) for t in batch)\n",
868 | " \n",
869 | " # Unpack the inputs from our dataloader\n",
870 | " b_input_ids, b_input_mask = batch\n",
871 | " \n",
872 | " # Telling the model not to compute or store gradients, saving memory and \n",
873 | " # speeding up prediction\n",
874 | " with torch.no_grad():\n",
875 | " # Forward pass, calculate logit predictions\n",
876 | " outputs = model(b_input_ids, token_type_ids=None, \n",
877 | " attention_mask=b_input_mask)\n",
878 | "\n",
879 | " logits = outputs[0]\n",
880 | "\n",
881 | " # Move logits and labels to CPU\n",
882 | " logits = logits.detach().cpu().numpy()\n",
883 | " #label_ids = b_labels.to('cpu').numpy()\n",
884 | " \n",
885 | " # Store predictions and true labels\n",
886 | " predictions.append(logits)\n",
887 | " #true_labels.append(label_ids)\n",
888 | "\n",
889 | "print(' DONE.')"
890 | ],
891 | "execution_count": null,
892 | "outputs": []
893 | },
894 | {
895 | "cell_type": "code",
896 | "metadata": {
897 | "id": "JQDwYWHGcAOx"
898 | },
899 | "source": [
900 | "import pickle\r\n",
901 | "\r\n",
902 | "with open('/content/drive/predictions.pickle', 'wb') as f:\r\n",
903 | " pickle.dump(predictions, f)"
904 | ],
905 | "execution_count": null,
906 | "outputs": []
907 | },
908 | {
909 | "cell_type": "code",
910 | "metadata": {
911 | "id": "5nJEm2o4izmB"
912 | },
913 | "source": [
914 | "model.save_pretrained('/content/drive/bert')"
915 | ],
916 | "execution_count": null,
917 | "outputs": []
918 | },
919 | {
920 | "cell_type": "markdown",
921 | "metadata": {
922 | "id": "Vo9NjfLCMgKl"
923 | },
924 | "source": [
925 | "# Getting bert embeddings"
926 | ]
927 | },
928 | {
929 | "cell_type": "code",
930 | "metadata": {
931 | "colab": {
932 | "background_save": true
933 | },
934 | "id": "myFoZyZTMfPs"
935 | },
936 | "source": [
937 | "model = BertModel.from_pretrained('/content/drive/bert')\n",
938 | "\n",
939 | "values = []\n",
940 | "\n",
941 | "for text in train_texts:\n",
942 | " input_ids = torch.tensor(tokenizer.encode(text, truncation = True, \\\n",
943 | " add_special_tokens = True, max_length = 256)).unsqueeze(0) # Batch size 1\n",
944 | " outputs = model(input_ids)\n",
945 | " last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple\n",
946 | " values.append(np.average(last_hidden_states[0].detach().cpu().numpy(), axis = 0))\n",
947 | "\n",
948 | "with open('/content/drive/bert_embs_train.pickle', 'wb') as f:\n",
949 | " pickle.dump(values, f)"
950 | ],
951 | "execution_count": null,
952 | "outputs": []
953 | },
954 | {
955 | "cell_type": "code",
956 | "metadata": {
957 | "id": "zFhgjuphOaRY"
958 | },
959 | "source": [
960 | "values = []\n",
961 | "\n",
962 | "for text in val_texts:\n",
963 | " input_ids = torch.tensor(tokenizer.encode(text, truncation = True, \\\n",
964 | " add_special_tokens = True, max_length = 256)).unsqueeze(0) # Batch size 1\n",
965 | " outputs = model(input_ids)\n",
966 | " last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple\n",
967 | " values.append(np.average(last_hidden_states[0].detach().cpu().numpy(), axis = 0))\n",
968 | "\n",
969 | "with open('/content/drive/bert_embs_val.pickle', 'wb') as f:\n",
970 | " pickle.dump(values, f)"
971 | ],
972 | "execution_count": null,
973 | "outputs": []
974 | }
975 | ]
976 | }
--------------------------------------------------------------------------------
/IIITT/run2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Ensemble_classifier.ipynb",
7 | "provenance": []
8 | },
9 | "kernelspec": {
10 | "name": "python3",
11 | "display_name": "Python 3"
12 | }
13 | },
14 | "cells": [
15 | {
16 | "cell_type": "code",
17 | "metadata": {
18 | "id": "lWckrP6pStpg"
19 | },
20 | "source": [
21 | "import pandas as pd\r\n",
22 | "import numpy as np\r\n",
23 | "import matplotlib.pyplot as plt\r\n"
24 | ],
25 | "execution_count": null,
26 | "outputs": []
27 | },
28 | {
29 | "cell_type": "code",
30 | "metadata": {
31 | "colab": {
32 | "base_uri": "https://localhost:8080/"
33 | },
34 | "id": "4FUUUT_FX1ze",
35 | "outputId": "bba714b3-29dd-4e68-acbd-3624f1ceb354"
36 | },
37 | "source": [
38 | "cd /content/drive/MyDrive/sdpra2021/pred_probs/"
39 | ],
40 | "execution_count": null,
41 | "outputs": [
42 | {
43 | "output_type": "stream",
44 | "text": [
45 | "/content/drive/MyDrive/sdpra2021/pred_probs\n"
46 | ],
47 | "name": "stdout"
48 | }
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "metadata": {
54 | "colab": {
55 | "base_uri": "https://localhost:8080/",
56 | "height": 204
57 | },
58 | "id": "ounVF-yhYQsL",
59 | "outputId": "8020bb22-24ad-499d-d87b-a12cac302747"
60 | },
61 | "source": [
62 | "bert = pd.read_csv('bert.csv')\r\n",
63 | "bert = bert.drop(columns='Unnamed: 0')\r\n",
64 | "bert.head() "
65 | ],
66 | "execution_count": null,
67 | "outputs": [
68 | {
69 | "output_type": "execute_result",
70 | "data": {
71 | "text/html": [
72 | "\n",
73 | "\n",
86 | "
\n",
87 | " \n",
88 | " \n",
89 | " | \n",
90 | " CL | \n",
91 | " CR | \n",
92 | " DC | \n",
93 | " DS | \n",
94 | " LO | \n",
95 | " NI | \n",
96 | " SE | \n",
97 | " result | \n",
98 | " abstract | \n",
99 | "
\n",
100 | " \n",
101 | " \n",
102 | " \n",
103 | " | 0 | \n",
104 | " 0.000150 | \n",
105 | " 0.001650 | \n",
106 | " 0.986069 | \n",
107 | " 0.003539 | \n",
108 | " 0.002675 | \n",
109 | " 0.002975 | \n",
110 | " 0.002942 | \n",
111 | " 2 | \n",
112 | " This paper analyses the possibilities of per... | \n",
113 | "
\n",
114 | " \n",
115 | " | 1 | \n",
116 | " 0.000809 | \n",
117 | " 0.011402 | \n",
118 | " 0.895457 | \n",
119 | " 0.010034 | \n",
120 | " 0.004086 | \n",
121 | " 0.073280 | \n",
122 | " 0.004931 | \n",
123 | " 2 | \n",
124 | " A finite element method is presented to comp... | \n",
125 | "
\n",
126 | " \n",
127 | " | 2 | \n",
128 | " 0.998192 | \n",
129 | " 0.000182 | \n",
130 | " 0.000038 | \n",
131 | " 0.000098 | \n",
132 | " 0.000380 | \n",
133 | " 0.000112 | \n",
134 | " 0.000999 | \n",
135 | " 0 | \n",
136 | " This paper includes a reflection on the role... | \n",
137 | "
\n",
138 | " \n",
139 | " | 3 | \n",
140 | " 0.000124 | \n",
141 | " 0.001555 | \n",
142 | " 0.002590 | \n",
143 | " 0.000685 | \n",
144 | " 0.000627 | \n",
145 | " 0.993462 | \n",
146 | " 0.000958 | \n",
147 | " 5 | \n",
148 | " In this document, we describe the fractal st... | \n",
149 | "
\n",
150 | " \n",
151 | " | 4 | \n",
152 | " 0.000166 | \n",
153 | " 0.000656 | \n",
154 | " 0.001765 | \n",
155 | " 0.995629 | \n",
156 | " 0.000873 | \n",
157 | " 0.000429 | \n",
158 | " 0.000482 | \n",
159 | " 3 | \n",
160 | " We show how to test whether a graph with n v... | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " CL ... abstract\n",
168 | "0 0.000150 ... This paper analyses the possibilities of per...\n",
169 | "1 0.000809 ... A finite element method is presented to comp...\n",
170 | "2 0.998192 ... This paper includes a reflection on the role...\n",
171 | "3 0.000124 ... In this document, we describe the fractal st...\n",
172 | "4 0.000166 ... We show how to test whether a graph with n v...\n",
173 | "\n",
174 | "[5 rows x 9 columns]"
175 | ]
176 | },
177 | "metadata": {
178 | "tags": []
179 | },
180 | "execution_count": 5
181 | }
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "metadata": {
187 | "colab": {
188 | "base_uri": "https://localhost:8080/",
189 | "height": 204
190 | },
191 | "id": "46d-Z4cYYc5r",
192 | "outputId": "97fcfe13-9e50-44ed-bf96-282ada494420"
193 | },
194 | "source": [
195 | "roberta = pd.read_csv('roberta.csv')\r\n",
196 | "roberta = roberta.drop(columns=['Unnamed: 0'])\r\n",
197 | "roberta.head() "
198 | ],
199 | "execution_count": null,
200 | "outputs": [
201 | {
202 | "output_type": "execute_result",
203 | "data": {
204 | "text/html": [
205 | "\n",
206 | "\n",
219 | "
\n",
220 | " \n",
221 | " \n",
222 | " | \n",
223 | " CL | \n",
224 | " CR | \n",
225 | " DC | \n",
226 | " DS | \n",
227 | " LO | \n",
228 | " NI | \n",
229 | " SE | \n",
230 | " result | \n",
231 | " abstract | \n",
232 | "
\n",
233 | " \n",
234 | " \n",
235 | " \n",
236 | " | 0 | \n",
237 | " 0.002313 | \n",
238 | " 0.002387 | \n",
239 | " 0.980938 | \n",
240 | " 0.003960 | \n",
241 | " 0.001252 | \n",
242 | " 0.005076 | \n",
243 | " 0.004074 | \n",
244 | " 2 | \n",
245 | " This paper analyses the possibilities of per... | \n",
246 | "
\n",
247 | " \n",
248 | " | 1 | \n",
249 | " 0.002194 | \n",
250 | " 0.003206 | \n",
251 | " 0.978623 | \n",
252 | " 0.005403 | \n",
253 | " 0.000872 | \n",
254 | " 0.007745 | \n",
255 | " 0.001957 | \n",
256 | " 2 | \n",
257 | " A finite element method is presented to comp... | \n",
258 | "
\n",
259 | " \n",
260 | " | 2 | \n",
261 | " 0.997938 | \n",
262 | " 0.000090 | \n",
263 | " 0.000247 | \n",
264 | " 0.000461 | \n",
265 | " 0.000620 | \n",
266 | " 0.000309 | \n",
267 | " 0.000335 | \n",
268 | " 0 | \n",
269 | " This paper includes a reflection on the role... | \n",
270 | "
\n",
271 | " \n",
272 | " | 3 | \n",
273 | " 0.006236 | \n",
274 | " 0.298282 | \n",
275 | " 0.384594 | \n",
276 | " 0.064093 | \n",
277 | " 0.015241 | \n",
278 | " 0.203037 | \n",
279 | " 0.028518 | \n",
280 | " 2 | \n",
281 | " In this document, we describe the fractal st... | \n",
282 | "
\n",
283 | " \n",
284 | " | 4 | \n",
285 | " 0.000752 | \n",
286 | " 0.000967 | \n",
287 | " 0.001257 | \n",
288 | " 0.994632 | \n",
289 | " 0.001538 | \n",
290 | " 0.000534 | \n",
291 | " 0.000321 | \n",
292 | " 3 | \n",
293 | " We show how to test whether a graph with n v... | \n",
294 | "
\n",
295 | " \n",
296 | "
\n",
297 | "
"
298 | ],
299 | "text/plain": [
300 | " CL ... abstract\n",
301 | "0 0.002313 ... This paper analyses the possibilities of per...\n",
302 | "1 0.002194 ... A finite element method is presented to comp...\n",
303 | "2 0.997938 ... This paper includes a reflection on the role...\n",
304 | "3 0.006236 ... In this document, we describe the fractal st...\n",
305 | "4 0.000752 ... We show how to test whether a graph with n v...\n",
306 | "\n",
307 | "[5 rows x 9 columns]"
308 | ]
309 | },
310 | "metadata": {
311 | "tags": []
312 | },
313 | "execution_count": 7
314 | }
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "metadata": {
320 | "colab": {
321 | "base_uri": "https://localhost:8080/",
322 | "height": 204
323 | },
324 | "id": "7N1k70ZBZdKp",
325 | "outputId": "c0facf87-f366-43a4-aa0c-2be4a66110c2"
326 | },
327 | "source": [
328 | "scibert = pd.read_csv('scibert.csv')\r\n",
329 | "scibert = scibert.drop(columns='Unnamed: 0')\r\n",
330 | "scibert.head() "
331 | ],
332 | "execution_count": null,
333 | "outputs": [
334 | {
335 | "output_type": "execute_result",
336 | "data": {
337 | "text/html": [
338 | "\n",
339 | "\n",
352 | "
\n",
353 | " \n",
354 | " \n",
355 | " | \n",
356 | " CL | \n",
357 | " CR | \n",
358 | " DC | \n",
359 | " DS | \n",
360 | " LO | \n",
361 | " NI | \n",
362 | " SE | \n",
363 | " result | \n",
364 | " abstract | \n",
365 | "
\n",
366 | " \n",
367 | " \n",
368 | " \n",
369 | " | 0 | \n",
370 | " 0.000159 | \n",
371 | " 0.000761 | \n",
372 | " 0.994799 | \n",
373 | " 0.000768 | \n",
374 | " 0.000288 | \n",
375 | " 0.001839 | \n",
376 | " 0.001386 | \n",
377 | " 2 | \n",
378 | " This paper analyses the possibilities of per... | \n",
379 | "
\n",
380 | " \n",
381 | " | 1 | \n",
382 | " 0.000286 | \n",
383 | " 0.001598 | \n",
384 | " 0.848090 | \n",
385 | " 0.002578 | \n",
386 | " 0.000714 | \n",
387 | " 0.144184 | \n",
388 | " 0.002550 | \n",
389 | " 2 | \n",
390 | " A finite element method is presented to comp... | \n",
391 | "
\n",
392 | " \n",
393 | " | 2 | \n",
394 | " 0.999109 | \n",
395 | " 0.000252 | \n",
396 | " 0.000133 | \n",
397 | " 0.000148 | \n",
398 | " 0.000178 | \n",
399 | " 0.000062 | \n",
400 | " 0.000117 | \n",
401 | " 0 | \n",
402 | " This paper includes a reflection on the role... | \n",
403 | "
\n",
404 | " \n",
405 | " | 3 | \n",
406 | " 0.000146 | \n",
407 | " 0.000313 | \n",
408 | " 0.002194 | \n",
409 | " 0.000169 | \n",
410 | " 0.000153 | \n",
411 | " 0.996466 | \n",
412 | " 0.000559 | \n",
413 | " 5 | \n",
414 | " In this document, we describe the fractal st... | \n",
415 | "
\n",
416 | " \n",
417 | " | 4 | \n",
418 | " 0.000225 | \n",
419 | " 0.000235 | \n",
420 | " 0.000493 | \n",
421 | " 0.998302 | \n",
422 | " 0.000425 | \n",
423 | " 0.000191 | \n",
424 | " 0.000129 | \n",
425 | " 3 | \n",
426 | " We show how to test whether a graph with n v... | \n",
427 | "
\n",
428 | " \n",
429 | "
\n",
430 | "
"
431 | ],
432 | "text/plain": [
433 | " CL ... abstract\n",
434 | "0 0.000159 ... This paper analyses the possibilities of per...\n",
435 | "1 0.000286 ... A finite element method is presented to comp...\n",
436 | "2 0.999109 ... This paper includes a reflection on the role...\n",
437 | "3 0.000146 ... In this document, we describe the fractal st...\n",
438 | "4 0.000225 ... We show how to test whether a graph with n v...\n",
439 | "\n",
440 | "[5 rows x 9 columns]"
441 | ]
442 | },
443 | "metadata": {
444 | "tags": []
445 | },
446 | "execution_count": 19
447 | }
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "metadata": {
453 | "id": "ULhUk0xsh3_g"
454 | },
455 | "source": [
456 | "test = pd.read_csv('/content/drive/MyDrive/spdra2021/Datasets/test.csv',delimiter=',',\r\n",
457 | " header=None,names=['text'])\r\n",
458 | " "
459 | ],
460 | "execution_count": null,
461 | "outputs": []
462 | },
463 | {
464 | "cell_type": "code",
465 | "metadata": {
466 | "id": "3ZWYYi6hrC_-"
467 | },
468 | "source": [
469 | "labels = ['CL','CR','DC','DS','LO','NI','SE']"
470 | ],
471 | "execution_count": null,
472 | "outputs": []
473 | },
474 | {
475 | "cell_type": "code",
476 | "metadata": {
477 | "colab": {
478 | "base_uri": "https://localhost:8080/"
479 | },
480 | "id": "_Zj5_hh64Wk4",
481 | "outputId": "3ec48089-225f-44c3-e524-099d8a43ce86"
482 | },
483 | "source": [
484 | "test['text']"
485 | ],
486 | "execution_count": null,
487 | "outputs": [
488 | {
489 | "output_type": "execute_result",
490 | "data": {
491 | "text/plain": [
492 | "0 This paper analyses the possibilities of per...\n",
493 | "1 A finite element method is presented to comp...\n",
494 | "2 This paper includes a reflection on the role...\n",
495 | "3 In this document, we describe the fractal st...\n",
496 | "4 We show how to test whether a graph with n v...\n",
497 | " ... \n",
498 | "6995 It is common practice to compare the computa...\n",
499 | "6996 Defeasible reasoning is a simple but efficie...\n",
500 | "6997 The almost periodic functions form a natural...\n",
501 | "6998 A notion of alternating timed automata is pr...\n",
502 | "6999 We present a hierarchical framework for anal...\n",
503 | "Name: text, Length: 7000, dtype: object"
504 | ]
505 | },
506 | "metadata": {
507 | "tags": []
508 | },
509 | "execution_count": 42
510 | }
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "metadata": {
516 | "colab": {
517 | "base_uri": "https://localhost:8080/"
518 | },
519 | "id": "Q1rhux11aG7f",
520 | "outputId": "8a7ef2ef-5f1b-4b7a-a0f5-3b6c4099bf86"
521 | },
522 | "source": [
523 | "for label in labels:\r\n",
524 | " print(label)\r\n",
525 | " print(np.corrcoef([bert[label].rank(pct=True), roberta[label].rank(pct=True), scibert[label].rank(pct=True)]))\r\n",
526 | "submission = pd.DataFrame()\r\n",
527 | "#submission['id'] = a['abstract']\r\n",
528 | "for label in labels:\r\n",
529 | " submission[label] = round((bert[label] + roberta[label] + scibert[label])/3,6)\r\n",
530 | "submission['result'] = submission.idxmax(axis = 1) \r\n",
531 | "submission['result'] = submission['result'].apply({'CL':0,'CR':1,'DC':2,\r\n",
532 | "'DS':3,'LO':4, 'NI':5, 'SE':6}.get) \r\n",
533 | "submission['id'] = test['text']\r\n",
534 | "submission.to_csv('submission.csv', index=False)"
535 | ],
536 | "execution_count": null,
537 | "outputs": [
538 | {
539 | "output_type": "stream",
540 | "text": [
541 | "CL\n",
542 | "[[1. 0.60876764 0.79489693]\n",
543 | " [0.60876764 1. 0.43417398]\n",
544 | " [0.79489693 0.43417398 1. ]]\n",
545 | "CR\n",
546 | "[[1. 0.81781081 0.77273806]\n",
547 | " [0.81781081 1. 0.69869303]\n",
548 | " [0.77273806 0.69869303 1. ]]\n",
549 | "DC\n",
550 | "[[1. 0.84889632 0.85096035]\n",
551 | " [0.84889632 1. 0.88747852]\n",
552 | " [0.85096035 0.88747852 1. ]]\n",
553 | "DS\n",
554 | "[[1. 0.92145531 0.84394307]\n",
555 | " [0.92145531 1. 0.82648213]\n",
556 | " [0.84394307 0.82648213 1. ]]\n",
557 | "LO\n",
558 | "[[1. 0.82319259 0.72438774]\n",
559 | " [0.82319259 1. 0.80665013]\n",
560 | " [0.72438774 0.80665013 1. ]]\n",
561 | "NI\n",
562 | "[[1. 0.92307865 0.91320051]\n",
563 | " [0.92307865 1. 0.90765773]\n",
564 | " [0.91320051 0.90765773 1. ]]\n",
565 | "SE\n",
566 | "[[1. 0.71567135 0.72420244]\n",
567 | " [0.71567135 1. 0.89973318]\n",
568 | " [0.72420244 0.89973318 1. ]]\n"
569 | ],
570 | "name": "stdout"
571 | }
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "metadata": {
577 | "colab": {
578 | "base_uri": "https://localhost:8080/",
579 | "height": 405
580 | },
581 | "id": "EMqzUplRHjxa",
582 | "outputId": "590418e4-cadb-4e96-c48d-da936cc05ce7"
583 | },
584 | "source": [
585 | "submission"
586 | ],
587 | "execution_count": null,
588 | "outputs": [
589 | {
590 | "output_type": "execute_result",
591 | "data": {
592 | "text/html": [
593 | "\n",
594 | "\n",
607 | "
\n",
608 | " \n",
609 | " \n",
610 | " | \n",
611 | " CL | \n",
612 | " CR | \n",
613 | " DC | \n",
614 | " DS | \n",
615 | " LO | \n",
616 | " NI | \n",
617 | " SE | \n",
618 | " result | \n",
619 | " id | \n",
620 | "
\n",
621 | " \n",
622 | " \n",
623 | " \n",
624 | " | 0 | \n",
625 | " 0.000874 | \n",
626 | " 0.001599 | \n",
627 | " 0.987269 | \n",
628 | " 0.002756 | \n",
629 | " 0.001405 | \n",
630 | " 0.003297 | \n",
631 | " 0.002801 | \n",
632 | " 2 | \n",
633 | " This paper analyses the possibilities of per... | \n",
634 | "
\n",
635 | " \n",
636 | " | 1 | \n",
637 | " 0.001096 | \n",
638 | " 0.005402 | \n",
639 | " 0.907390 | \n",
640 | " 0.006005 | \n",
641 | " 0.001891 | \n",
642 | " 0.075070 | \n",
643 | " 0.003146 | \n",
644 | " 2 | \n",
645 | " A finite element method is presented to comp... | \n",
646 | "
\n",
647 | " \n",
648 | " | 2 | \n",
649 | " 0.998413 | \n",
650 | " 0.000175 | \n",
651 | " 0.000140 | \n",
652 | " 0.000236 | \n",
653 | " 0.000392 | \n",
654 | " 0.000161 | \n",
655 | " 0.000484 | \n",
656 | " 0 | \n",
657 | " This paper includes a reflection on the role... | \n",
658 | "
\n",
659 | " \n",
660 | " | 3 | \n",
661 | " 0.002169 | \n",
662 | " 0.100050 | \n",
663 | " 0.129793 | \n",
664 | " 0.021649 | \n",
665 | " 0.005341 | \n",
666 | " 0.730988 | \n",
667 | " 0.010011 | \n",
668 | " 5 | \n",
669 | " In this document, we describe the fractal st... | \n",
670 | "
\n",
671 | " \n",
672 | " | 4 | \n",
673 | " 0.000381 | \n",
674 | " 0.000619 | \n",
675 | " 0.001172 | \n",
676 | " 0.996188 | \n",
677 | " 0.000945 | \n",
678 | " 0.000385 | \n",
679 | " 0.000311 | \n",
680 | " 3 | \n",
681 | " We show how to test whether a graph with n v... | \n",
682 | "
\n",
683 | " \n",
684 | " | ... | \n",
685 | " ... | \n",
686 | " ... | \n",
687 | " ... | \n",
688 | " ... | \n",
689 | " ... | \n",
690 | " ... | \n",
691 | " ... | \n",
692 | " ... | \n",
693 | " ... | \n",
694 | "
\n",
695 | " \n",
696 | " | 6995 | \n",
697 | " 0.001178 | \n",
698 | " 0.000596 | \n",
699 | " 0.001409 | \n",
700 | " 0.002084 | \n",
701 | " 0.992788 | \n",
702 | " 0.000509 | \n",
703 | " 0.001437 | \n",
704 | " 4 | \n",
705 | " It is common practice to compare the computa... | \n",
706 | "
\n",
707 | " \n",
708 | " | 6996 | \n",
709 | " 0.001471 | \n",
710 | " 0.000475 | \n",
711 | " 0.001046 | \n",
712 | " 0.001694 | \n",
713 | " 0.993390 | \n",
714 | " 0.000495 | \n",
715 | " 0.001430 | \n",
716 | " 4 | \n",
717 | " Defeasible reasoning is a simple but efficie... | \n",
718 | "
\n",
719 | " \n",
720 | " | 6997 | \n",
721 | " 0.001451 | \n",
722 | " 0.001185 | \n",
723 | " 0.001605 | \n",
724 | " 0.004924 | \n",
725 | " 0.988287 | \n",
726 | " 0.000649 | \n",
727 | " 0.001898 | \n",
728 | " 4 | \n",
729 | " The almost periodic functions form a natural... | \n",
730 | "
\n",
731 | " \n",
732 | " | 6998 | \n",
733 | " 0.001222 | \n",
734 | " 0.000495 | \n",
735 | " 0.001089 | \n",
736 | " 0.001730 | \n",
737 | " 0.993416 | \n",
738 | " 0.000470 | \n",
739 | " 0.001579 | \n",
740 | " 4 | \n",
741 | " A notion of alternating timed automata is pr... | \n",
742 | "
\n",
743 | " \n",
744 | " | 6999 | \n",
745 | " 0.001455 | \n",
746 | " 0.000472 | \n",
747 | " 0.001029 | \n",
748 | " 0.001684 | \n",
749 | " 0.993397 | \n",
750 | " 0.000490 | \n",
751 | " 0.001474 | \n",
752 | " 4 | \n",
753 | " We present a hierarchical framework for anal... | \n",
754 | "
\n",
755 | " \n",
756 | "
\n",
757 | "
7000 rows × 9 columns
\n",
758 | "
"
759 | ],
760 | "text/plain": [
761 | " CL ... id\n",
762 | "0 0.000874 ... This paper analyses the possibilities of per...\n",
763 | "1 0.001096 ... A finite element method is presented to comp...\n",
764 | "2 0.998413 ... This paper includes a reflection on the role...\n",
765 | "3 0.002169 ... In this document, we describe the fractal st...\n",
766 | "4 0.000381 ... We show how to test whether a graph with n v...\n",
767 | "... ... ... ...\n",
768 | "6995 0.001178 ... It is common practice to compare the computa...\n",
769 | "6996 0.001471 ... Defeasible reasoning is a simple but efficie...\n",
770 | "6997 0.001451 ... The almost periodic functions form a natural...\n",
771 | "6998 0.001222 ... A notion of alternating timed automata is pr...\n",
772 | "6999 0.001455 ... We present a hierarchical framework for anal...\n",
773 | "\n",
774 | "[7000 rows x 9 columns]"
775 | ]
776 | },
777 | "metadata": {
778 | "tags": []
779 | },
780 | "execution_count": 48
781 | }
782 | ]
783 | },
784 | {
785 | "cell_type": "code",
786 | "metadata": {
787 | "id": "7Nl0V7Bm47yj"
788 | },
789 | "source": [
790 | "submission['result'] = submission['result'].apply({0:'CL', 1:'CR', 2:'DC',\r\n",
791 | "3:'DS', 4:'LO', 5:'NI', 6:'SE' }.get)\r\n"
792 | ],
793 | "execution_count": null,
794 | "outputs": []
795 | },
796 | {
797 | "cell_type": "code",
798 | "metadata": {
799 | "colab": {
800 | "base_uri": "https://localhost:8080/",
801 | "height": 34
802 | },
803 | "id": "HlxxQwny5kfl",
804 | "outputId": "61254762-c27b-41e1-adee-584c01dc7269"
805 | },
806 | "source": [
807 | "result = submission['result'].to_numpy()\r\n",
808 | "print(len(result))\r\n",
809 | "np.savetxt(\"run2.txt\", result, fmt = \"%s\")\r\n",
810 | "from google.colab import files\r\n",
811 | "files.download('run2.txt')"
812 | ],
813 | "execution_count": null,
814 | "outputs": [
815 | {
816 | "output_type": "stream",
817 | "text": [
818 | "7000\n"
819 | ],
820 | "name": "stdout"
821 | },
822 | {
823 | "output_type": "display_data",
824 | "data": {
825 | "application/javascript": [
826 | "\n",
827 | " async function download(id, filename, size) {\n",
828 | " if (!google.colab.kernel.accessAllowed) {\n",
829 | " return;\n",
830 | " }\n",
831 | " const div = document.createElement('div');\n",
832 | " const label = document.createElement('label');\n",
833 | " label.textContent = `Downloading \"${filename}\": `;\n",
834 | " div.appendChild(label);\n",
835 | " const progress = document.createElement('progress');\n",
836 | " progress.max = size;\n",
837 | " div.appendChild(progress);\n",
838 | " document.body.appendChild(div);\n",
839 | "\n",
840 | " const buffers = [];\n",
841 | " let downloaded = 0;\n",
842 | "\n",
843 | " const channel = await google.colab.kernel.comms.open(id);\n",
844 | " // Send a message to notify the kernel that we're ready.\n",
845 | " channel.send({})\n",
846 | "\n",
847 | " for await (const message of channel.messages) {\n",
848 | " // Send a message to notify the kernel that we're ready.\n",
849 | " channel.send({})\n",
850 | " if (message.buffers) {\n",
851 | " for (const buffer of message.buffers) {\n",
852 | " buffers.push(buffer);\n",
853 | " downloaded += buffer.byteLength;\n",
854 | " progress.value = downloaded;\n",
855 | " }\n",
856 | " }\n",
857 | " }\n",
858 | " const blob = new Blob(buffers, {type: 'application/binary'});\n",
859 | " const a = document.createElement('a');\n",
860 | " a.href = window.URL.createObjectURL(blob);\n",
861 | " a.download = filename;\n",
862 | " div.appendChild(a);\n",
863 | " a.click();\n",
864 | " div.remove();\n",
865 | " }\n",
866 | " "
867 | ],
868 | "text/plain": [
869 | ""
870 | ]
871 | },
872 | "metadata": {
873 | "tags": []
874 | }
875 | },
876 | {
877 | "output_type": "display_data",
878 | "data": {
879 | "application/javascript": [
880 | "download(\"download_a34366a4-3e84-43f2-bd10-e8e55c7350e6\", \"ensemble1.txt\", 21000)"
881 | ],
882 | "text/plain": [
883 | ""
884 | ]
885 | },
886 | "metadata": {
887 | "tags": []
888 | }
889 | }
890 | ]
891 | },
892 | {
893 | "cell_type": "code",
894 | "metadata": {
895 | "id": "cBhMAqPaINUo",
896 | "colab": {
897 | "base_uri": "https://localhost:8080/"
898 | },
899 | "outputId": "6380bb97-4f42-414b-84d2-a030502a7ec5"
900 | },
901 | "source": [
902 | "from sklearn.metrics import classification_report\r\n",
903 | "print(classification_report(scibert['result'],submission['result']))"
904 | ],
905 | "execution_count": null,
906 | "outputs": [
907 | {
908 | "output_type": "stream",
909 | "text": [
910 | " precision recall f1-score support\n",
911 | "\n",
912 | " 0 0.99 0.99 0.99 1184\n",
913 | " 1 0.96 0.96 0.96 1122\n",
914 | " 2 0.94 0.92 0.93 812\n",
915 | " 3 0.97 0.97 0.97 1074\n",
916 | " 4 0.97 0.97 0.97 759\n",
917 | " 5 0.97 0.97 0.97 1199\n",
918 | " 6 0.96 0.96 0.96 850\n",
919 | "\n",
920 | " accuracy 0.97 7000\n",
921 | " macro avg 0.96 0.96 0.96 7000\n",
922 | "weighted avg 0.97 0.97 0.97 7000\n",
923 | "\n"
924 | ],
925 | "name": "stdout"
926 | }
927 | ]
928 | },
929 | {
930 | "cell_type": "code",
931 | "metadata": {
932 | "colab": {
933 | "base_uri": "https://localhost:8080/"
934 | },
935 | "id": "QHC1Pec_keR-",
936 | "outputId": "98347f5d-082f-4435-85cc-c026acb78375"
937 | },
938 | "source": [
939 | "print(classification_report(val['label'],scibert['result']))"
940 | ],
941 | "execution_count": null,
942 | "outputs": [
943 | {
944 | "output_type": "stream",
945 | "text": [
946 | " precision recall f1-score support\n",
947 | "\n",
948 | " 0 0.98 0.98 0.98 1866\n",
949 | " 1 0.92 0.91 0.91 1835\n",
950 | " 2 0.82 0.83 0.83 1355\n",
951 | " 3 0.93 0.93 0.93 1774\n",
952 | " 4 0.93 0.93 0.93 1217\n",
953 | " 5 0.91 0.91 0.91 1826\n",
954 | " 6 0.89 0.91 0.90 1327\n",
955 | "\n",
956 | " accuracy 0.92 11200\n",
957 | " macro avg 0.91 0.91 0.91 11200\n",
958 | "weighted avg 0.92 0.92 0.92 11200\n",
959 | "\n"
960 | ],
961 | "name": "stdout"
962 | }
963 | ]
964 | },
965 | {
966 | "cell_type": "markdown",
967 | "metadata": {
968 | "id": "0pBDXhATIqGJ"
969 | },
970 | "source": [
971 | "#Predictions "
972 | ]
973 | },
974 | {
975 | "cell_type": "code",
976 | "metadata": {
977 | "id": "0T6MeF7NkxY4"
978 | },
979 | "source": [
980 | "\r\n",
981 | "\r\n",
982 | "\"\"\"\r\n",
983 | "The submission file IIITT.zip has the systems as follows:\r\n",
984 | "\r\n",
985 | "run 1 : Pre-trained Transformer Model (allenai/scibert_scivocab_uncased)\r\n",
986 | "run 2 : Average of probabities of predictions of ( BERT_base_uncased + RoBERTa_base + SciBERT)\r\n",
987 | "run 3 : Ensemble of probabilities of predictions by ranking the percentile of the result stored as a pandas DataFrame\r\n",
988 | "\"\"\""
989 | ],
990 | "execution_count": null,
991 | "outputs": []
992 | }
993 | ]
994 | }
--------------------------------------------------------------------------------