├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ ├── custom.md │ └── feature_request.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.rst ├── Data ├── Download_Glove.py ├── Download_WOS.py └── README.rst ├── LICENSE ├── README.rst ├── WordArt.png ├── code ├── Bagging.py ├── Boost.py ├── CNN.py ├── CRF.py ├── DNN.py ├── Decision_Tree.py ├── HDLTex │ └── README.rst ├── Hierarchical_Attention_Networks │ ├── README.md │ ├── textClassifierConv.py │ ├── textClassifierHATT.py │ └── textClassifierRNN.py ├── K-nearest_Neighbor.py ├── MultinomialNB.py ├── RCNN.py ├── RMDL │ └── README.rst ├── RNN.py ├── Random_Forest.py ├── Rocchio_classification.py └── SVM.py └── docs ├── Text_Classification.pdf ├── _config.yml ├── eq ├── tf-idf.gif └── tfidf.gif └── pic ├── Autoencoder.png ├── BPW.png ├── Bagging.PNG ├── Boosting.PNG ├── CBOW.png ├── CNN.png ├── CRF.png ├── DNN.png ├── F1.png ├── GitHub-Mark-32px.png ├── GitHub-Mark-64px.png ├── Glove.PNG ├── Glove_VS_DCWE.png ├── HAN.png ├── HDLTex.png ├── KNN.png ├── LSTM.png ├── OverviewTextClassification.png ├── RDL.jpg ├── RDL.png ├── RF.png ├── RMDL.jpg ├── RMDL.png ├── RMDL_Results.png ├── RMDL_Results_small.png ├── RNN.png ├── Random Projection.png ├── SVM.png ├── TSNE.png ├── Word Art.png ├── WordArt.png ├── fasttext-logo-color-web.png ├── github-logo.png ├── line.png ├── ngram_cnn_highway_1.png └── sphx_glr_plot_roc_001.png /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/custom.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Custom issue template 3 | about: Describe this issue template's purpose here. 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at kk7nc@virginia.edu. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.rst: -------------------------------------------------------------------------------- 1 | 2 | ************* 3 | Contributing 4 | ************* 5 | 6 | *For typos, please do not create a pull request. Instead, declare them in issues or email the repository owner*. Please note we have a code of conduct, please follow it in all your interactions with the project. 7 | 8 | ==================== 9 | Pull Request Process 10 | ==================== 11 | 12 | Please consider the following criterions in order to help us in a better way: 13 | 14 | 1. The pull request is mainly expected to be a link suggestion. 15 | 2. Please make sure your suggested resources are not obsolete or broken. 16 | 3. Ensure any install or build dependencies are removed before the end of the layer when doing a 17 | build and creating a pull request. 18 | 4. Add comments with details of changes to the interface, this includes new environment 19 | variables, exposed ports, useful file locations and container parameters. 20 | 5. You may merge the Pull Request in once you have the sign-off of at least one other developer, or if you 21 | do not have permission to do that, you may request the owner to merge it for you if you believe all checks are passed. 22 | 23 | Thank you! 24 | -------------------------------------------------------------------------------- /Data/Download_Glove.py: -------------------------------------------------------------------------------- 1 | ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 2 | RMDL: Random Multimodel Deep Learning for Classification 3 | * Copyright (C) 2018 Kamran Kowsari 4 | * Last Update: 04/25/2018 5 | * This file is part of RMDL project, University of Virginia. 6 | * Free to use, change, share and distribute source code of RMDL 7 | * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification 8 | * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL) 9 | * Comments and Error: email: kk7nc@virginia.edu 10 | ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 11 | 12 | 13 | from __future__ import print_function 14 | 15 | import os, sys, tarfile 16 | import numpy as np 17 | import zipfile 18 | 19 | if sys.version_info >= (3, 0, 0): 20 | import urllib.request as urllib # ugly but works 21 | else: 22 | import urllib 23 | 24 | print(sys.version_info) 25 | 26 | # image shape 27 | 28 | 29 | # path to the directory with the data 30 | DATA_DIR = '.\Glove' 31 | 32 | # url of the binary data 33 | 34 | 35 | 36 | # path to the binary train file with image data 37 | 38 | 39 | def download_and_extract(data='Wikipedia'): 40 | """ 41 | Download and extract the GloVe 42 | :return: None 43 | """ 44 | 45 | if data=='Wikipedia': 46 | DATA_URL = 'http://nlp.stanford.edu/data/glove.6B.zip' 47 | elif data=='Common_Crawl_840B': 48 | DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip' 49 | elif data=='Common_Crawl_42B': 50 | DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip' 51 | elif data=='Twitter': 52 | DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip' 53 | else: 54 | print("prameter should be Twitter, Common_Crawl_42B, Common_Crawl_840B, or Wikipedia") 55 | exit(0) 56 | 57 | 58 | dest_directory = DATA_DIR 59 | if not os.path.exists(dest_directory): 60 | os.makedirs(dest_directory) 61 | filename = DATA_URL.split('/')[-1] 62 | filepath = os.path.join(dest_directory, filename) 63 | print(filepath) 64 | 65 | path = os.path.abspath(dest_directory) 66 | if not os.path.exists(filepath): 67 | def _progress(count, block_size, total_size): 68 | sys.stdout.write('\rDownloading %s %.2f%%' % (filename, 69 | float(count * block_size) / float(total_size) * 100.0)) 70 | sys.stdout.flush() 71 | 72 | filepath, _ = urllib.urlretrieve(DATA_URL, filepath)#, reporthook=_progress) 73 | 74 | 75 | zip_ref = zipfile.ZipFile(filepath, 'r') 76 | zip_ref.extractall(DATA_DIR) 77 | zip_ref.close() 78 | return path 79 | -------------------------------------------------------------------------------- /Data/Download_WOS.py: -------------------------------------------------------------------------------- 1 | ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 2 | RMDL: Random Multimodel Deep Learning for Classification 3 | * Copyright (C) 2018 Kamran Kowsari 4 | * Last Update: 04/25/2018 5 | * This file is part of RMDL project, University of Virginia. 6 | * Free to use, change, share and distribute source code of RMDL 7 | * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification 8 | * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL) 9 | * Comments and Error: email: kk7nc@virginia.edu 10 | ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 11 | 12 | 13 | from __future__ import print_function 14 | 15 | import os, sys, tarfile 16 | import numpy as np 17 | 18 | if sys.version_info >= (3, 0, 0): 19 | import urllib.request as urllib # ugly but works 20 | else: 21 | import urllib 22 | 23 | print(sys.version_info) 24 | 25 | # image shape 26 | 27 | 28 | # path to the directory with the data 29 | DATA_DIR = '.\data_WOS' 30 | 31 | # url of the binary data 32 | DATA_URL = 'http://kowsari.net/WebOfScience.tar.gz' 33 | 34 | 35 | # path to the binary train file with image data 36 | 37 | 38 | def download_and_extract(): 39 | """ 40 | Download and extract the WOS datasets 41 | :return: None 42 | """ 43 | dest_directory = DATA_DIR 44 | if not os.path.exists(dest_directory): 45 | os.makedirs(dest_directory) 46 | filename = DATA_URL.split('/')[-1] 47 | filepath = os.path.join(dest_directory, filename) 48 | 49 | 50 | path = os.path.abspath(dest_directory) 51 | if not os.path.exists(filepath): 52 | def _progress(count, block_size, total_size): 53 | sys.stdout.write('\rDownloading %s %.2f%%' % (filename, 54 | float(count * block_size) / float(total_size) * 100.0)) 55 | sys.stdout.flush() 56 | 57 | filepath, _ = urllib.urlretrieve(DATA_URL, filepath, reporthook=_progress) 58 | 59 | print('Downloaded', filename) 60 | 61 | tarfile.open(filepath, 'r').extractall(dest_directory) 62 | return path 63 | -------------------------------------------------------------------------------- /Data/README.rst: -------------------------------------------------------------------------------- 1 | 2 | ################################################ 3 | Text Classification Algorithm: A Brief Overview 4 | ################################################ 5 | 6 | ################## 7 | Table of Contents 8 | ################## 9 | .. contents:: 10 | :local: 11 | :depth: 4 12 | 13 | 14 | IMDB 15 | ----- 16 | 17 | - This dataset contains 50,000 documents with 2 categories. 18 | 19 | Import Packages 20 | ~~~~~~~~~~~~~~~ 21 | 22 | .. code:: python 23 | 24 | import sys 25 | import os 26 | from RMDL import text_feature_extraction as txt 27 | from keras.datasets import imdb 28 | import numpy as np 29 | from RMDL import RMDL_Text as RMDL 30 | 31 | Load Data 32 | ~~~~~~~~~ 33 | 34 | .. code:: python 35 | 36 | print("Load IMDB dataset....") 37 | (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS) 38 | print(len(X_train)) 39 | print(y_test) 40 | word_index = imdb.get_word_index() 41 | index_word = {v: k for k, v in word_index.items()} 42 | X_train = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_train] 43 | X_test = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_test] 44 | X_train = np.array(X_train) 45 | X_train = np.array(X_train).ravel() 46 | print(X_train.shape) 47 | X_test = np.array(X_test) 48 | X_test = np.array(X_test).ravel() 49 | 50 | 51 | Web Of Science 52 | -------------- 53 | 54 | - Linke of dataset: |Data| 55 | 56 | - Web of Science Dataset 57 | `WOS-11967 `__ 58 | 59 | - This dataset contains 11,967 documents with 35 categories which 60 | include 7 parents categories. 61 | 62 | - Web of Science Dataset 63 | `WOS-46985 `__ 64 | 65 | - This dataset contains 46,985 documents with 134 categories 66 | which include 7 parents categories. 67 | 68 | - Web of Science Dataset 69 | `WOS-5736 `__ 70 | 71 | - This dataset contains 5,736 documents with 11 categories which 72 | include 3 parents categories. 73 | 74 | Import Packages 75 | ~~~~~~~~~~~~~~~ 76 | 77 | .. code:: python 78 | 79 | from RMDL import text_feature_extraction as txt 80 | from sklearn.model_selection import train_test_split 81 | from RMDL.Download import Download_WOS as WOS 82 | import numpy as np 83 | from RMDL import RMDL_Text as RMDL 84 | 85 | Load Data 86 | ~~~~~~~~~ 87 | .. code:: python 88 | 89 | path_WOS = WOS.download_and_extract() 90 | fname = os.path.join(path_WOS,"WebOfScience/WOS11967/X.txt") 91 | fnamek = os.path.join(path_WOS,"WebOfScience/WOS11967/Y.txt") 92 | with open(fname, encoding="utf-8") as f: 93 | content = f.readlines() 94 | content = [txt.text_cleaner(x) for x in content] 95 | with open(fnamek) as fk: 96 | contentk = fk.readlines() 97 | contentk = [x.strip() for x in contentk] 98 | Label = np.matrix(contentk, dtype=int) 99 | Label = np.transpose(Label) 100 | np.random.seed(7) 101 | print(Label.shape) 102 | X_train, X_test, y_train, y_test = train_test_split(content, Label, test_size=0.2, random_state=4) 103 | 104 | 105 | 106 | Reuters-21578 107 | ------------- 108 | 109 | - This dataset contains 21,578 documents with 90 categories. 110 | 111 | Import Packages 112 | ~~~~~~~~~~~~~~~ 113 | 114 | .. code:: python 115 | 116 | import sys 117 | import os 118 | import nltk 119 | nltk.download("reuters") 120 | from nltk.corpus import reuters 121 | from sklearn.preprocessing import MultiLabelBinarizer 122 | import numpy as np 123 | from RMDL import RMDL_Text as RMDL 124 | 125 | Load Data 126 | ~~~~~~~~~ 127 | .. code:: python 128 | 129 | documents = reuters.fileids() 130 | 131 | train_docs_id = list(filter(lambda doc: doc.startswith("train"), 132 | documents)) 133 | test_docs_id = list(filter(lambda doc: doc.startswith("test"), 134 | documents)) 135 | X_train = [(reuters.raw(doc_id)) for doc_id in train_docs_id] 136 | X_test = [(reuters.raw(doc_id)) for doc_id in test_docs_id] 137 | mlb = MultiLabelBinarizer() 138 | y_train = mlb.fit_transform([reuters.categories(doc_id) 139 | for doc_id in train_docs_id]) 140 | y_test = mlb.transform([reuters.categories(doc_id) 141 | for doc_id in test_docs_id]) 142 | y_train = np.argmax(y_train, axis=1) 143 | y_test = np.argmax(y_test, axis=1) 144 | 145 | 146 | 147 | 148 | ========== 149 | Citations: 150 | ========== 151 | 152 | ---- 153 | 154 | .. code:: 155 | 156 | @ARTICLE{Kowsari2018Text_Classification, 157 | title={Text Classification Algorithms: A Survey}, 158 | author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.}, 159 | journal={Information}, 160 | year={2019}, 161 | VOLUME = {10}, 162 | YEAR = {2019}, 163 | NUMBER = {4}, 164 | ARTICLE-NUMBER = {150}, 165 | URL = {http://www.mdpi.com/2078-2489/10/4/150}, 166 | ISSN = {2078-2489}, 167 | publisher={Multidisciplinary Digital Publishing Institute} 168 | } 169 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Kamran Kowsari 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /WordArt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/WordArt.png -------------------------------------------------------------------------------- /code/Bagging.py: -------------------------------------------------------------------------------- 1 | from sklearn.ensemble import BaggingClassifier 2 | from sklearn.neighbors import KNeighborsClassifier 3 | from sklearn.pipeline import Pipeline 4 | from sklearn import metrics 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | from sklearn.feature_extraction.text import TfidfTransformer 7 | from sklearn.datasets import fetch_20newsgroups 8 | 9 | newsgroups_train = fetch_20newsgroups(subset='train') 10 | newsgroups_test = fetch_20newsgroups(subset='test') 11 | X_train = newsgroups_train.data 12 | X_test = newsgroups_test.data 13 | y_train = newsgroups_train.target 14 | y_test = newsgroups_test.target 15 | 16 | text_clf = Pipeline([('vect', CountVectorizer()), 17 | ('tfidf', TfidfTransformer()), 18 | ('clf', BaggingClassifier(KNeighborsClassifier())), 19 | ]) 20 | 21 | text_clf.fit(X_train, y_train) 22 | 23 | 24 | predicted = text_clf.predict(X_test) 25 | 26 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/Boost.py: -------------------------------------------------------------------------------- 1 | from sklearn.ensemble import GradientBoostingClassifier 2 | from sklearn.pipeline import Pipeline 3 | from sklearn import metrics 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | from sklearn.feature_extraction.text import TfidfTransformer 6 | from sklearn.datasets import fetch_20newsgroups 7 | 8 | newsgroups_train = fetch_20newsgroups(subset='train') 9 | newsgroups_test = fetch_20newsgroups(subset='test') 10 | X_train = newsgroups_train.data 11 | X_test = newsgroups_test.data 12 | y_train = newsgroups_train.target 13 | y_test = newsgroups_test.target 14 | 15 | text_clf = Pipeline([('vect', CountVectorizer()), 16 | ('tfidf', TfidfTransformer()), 17 | ('clf', GradientBoostingClassifier(n_estimators=50,verbose=2)), 18 | ]) 19 | 20 | text_clf.fit(X_train, y_train) 21 | 22 | 23 | predicted = text_clf.predict(X_test) 24 | 25 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/CNN.py: -------------------------------------------------------------------------------- 1 | from keras.layers import Dropout, Dense,Input,Embedding,Flatten, AveragePooling2D, Conv2D,Reshape 2 | from keras.models import Sequential,Model 3 | from sklearn.feature_extraction.text import TfidfVectorizer 4 | import numpy as np 5 | from sklearn import metrics 6 | from keras.preprocessing.text import Tokenizer 7 | from keras.preprocessing.sequence import pad_sequences 8 | from sklearn.datasets import fetch_20newsgroups 9 | from keras.layers.merge import Concatenate 10 | 11 | 12 | def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=1000): 13 | np.random.seed(7) 14 | text = np.concatenate((X_train, X_test), axis=0) 15 | text = np.array(text) 16 | tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 17 | tokenizer.fit_on_texts(text) 18 | sequences = tokenizer.texts_to_sequences(text) 19 | word_index = tokenizer.word_index 20 | text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) 21 | print('Found %s unique tokens.' % len(word_index)) 22 | indices = np.arange(text.shape[0]) 23 | # np.random.shuffle(indices) 24 | text = text[indices] 25 | print(text.shape) 26 | X_train = text[0:len(X_train), ] 27 | X_test = text[len(X_train):, ] 28 | embeddings_index = {} 29 | f = open(".\glove.6B.100d.txt", encoding="utf8") ## GloVe file which could be download https://nlp.stanford.edu/projects/glove/ 30 | for line in f: 31 | values = line.split() 32 | word = values[0] 33 | try: 34 | coefs = np.asarray(values[1:], dtype='float32') 35 | except: 36 | pass 37 | embeddings_index[word] = coefs 38 | f.close() 39 | print('Total %s word vectors.' % len(embeddings_index)) 40 | return (X_train, X_test, word_index,embeddings_index) 41 | 42 | 43 | 44 | def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100, dropout=0.5): 45 | 46 | """ 47 | def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5): 48 | word_index in word index , 49 | embeddings_index is embeddings index, look at data_helper.py 50 | nClasses is number of classes, 51 | MAX_SEQUENCE_LENGTH is maximum lenght of text sequences, 52 | EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py 53 | """ 54 | 55 | model = Sequential() 56 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM)) 57 | for word, i in word_index.items(): 58 | embedding_vector = embeddings_index.get(word) 59 | if embedding_vector is not None: 60 | # words not found in embedding index will be all-zeros. 61 | if len(embedding_matrix[i]) !=len(embedding_vector): 62 | print("could not broadcast input array from shape",str(len(embedding_matrix[i])), 63 | "into shape",str(len(embedding_vector))," Please make sure your" 64 | " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,") 65 | exit(1) 66 | 67 | embedding_matrix[i] = embedding_vector 68 | 69 | embedding_layer = Embedding(len(word_index) + 1, 70 | EMBEDDING_DIM, 71 | weights=[embedding_matrix], 72 | input_length=MAX_SEQUENCE_LENGTH, 73 | trainable=True) 74 | 75 | # applying a more complex convolutional approach 76 | convs = [] 77 | filter_sizes = [] 78 | layer = 5 79 | print("Filter ",layer) 80 | for fl in range(0,layer): 81 | filter_sizes.append((fl+2,fl+2)) 82 | 83 | node = 128 84 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 85 | embedded_sequences = embedding_layer(sequence_input) 86 | emb = Reshape((500,10, 10), input_shape=(500,100))(embedded_sequences) 87 | 88 | for fsz in filter_sizes: 89 | l_conv = Conv2D(node, padding="same", kernel_size=fsz, activation='relu')(emb) 90 | l_pool = AveragePooling2D(pool_size=(5,1), padding="same")(l_conv) 91 | #l_pool = Dropout(0.25)(l_pool) 92 | convs.append(l_pool) 93 | 94 | l_merge = Concatenate(axis=1)(convs) 95 | l_cov1 = Conv2D(node, (5,5), padding="same", activation='relu')(l_merge) 96 | l_cov1 = AveragePooling2D(pool_size=(5,2), padding="same")(l_cov1) 97 | l_cov2 = Conv2D(node, (5,5), padding="same", activation='relu')(l_cov1) 98 | l_pool2 = AveragePooling2D(pool_size=(5,2), padding="same")(l_cov2) 99 | l_cov2 = Dropout(dropout)(l_pool2) 100 | l_flat = Flatten()(l_cov2) 101 | l_dense = Dense(128, activation='relu')(l_flat) 102 | l_dense = Dropout(dropout)(l_dense) 103 | 104 | preds = Dense(nclasses, activation='softmax')(l_dense) 105 | model = Model(sequence_input, preds) 106 | 107 | model.compile(loss='sparse_categorical_crossentropy', 108 | optimizer='adam', 109 | metrics=['accuracy']) 110 | 111 | 112 | 113 | return model 114 | 115 | 116 | 117 | from sklearn.datasets import fetch_20newsgroups 118 | from RMDL import text_feature_extraction as txt 119 | 120 | newsgroups_train = fetch_20newsgroups(subset='train') 121 | newsgroups_test = fetch_20newsgroups(subset='test') 122 | X_train = newsgroups_train.data 123 | X_test = newsgroups_test.data 124 | y_train = newsgroups_train.target 125 | y_test = newsgroups_test.target 126 | 127 | 128 | X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test) 129 | 130 | 131 | model_CNN = Build_Model_CNN_Text(word_index,embeddings_index, 20) 132 | 133 | 134 | model_CNN.summary() 135 | 136 | model_CNN.fit(X_train_Glove, y_train, 137 | validation_data=(X_test_Glove, y_test), 138 | epochs=1000, 139 | batch_size=128, 140 | verbose=2) 141 | 142 | predicted = model_CNN.predict(X_test_Glove) 143 | 144 | predicted = np.argmax(predicted, axis=1) 145 | 146 | 147 | print(metrics.classification_report(y_test, predicted)) 148 | -------------------------------------------------------------------------------- /code/CRF.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import sklearn_crfsuite 3 | from sklearn_crfsuite import metrics 4 | nltk.corpus.conll2002.fileids() 5 | train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train')) 6 | test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb')) 7 | def word2features(sent, i): 8 | word = sent[i][0] 9 | postag = sent[i][1] 10 | 11 | features = { 12 | 'bias': 1.0, 13 | 'word.lower()': word.lower(), 14 | 'word[-3:]': word[-3:], 15 | 'word[-2:]': word[-2:], 16 | 'word.isupper()': word.isupper(), 17 | 'word.istitle()': word.istitle(), 18 | 'word.isdigit()': word.isdigit(), 19 | 'postag': postag, 20 | 'postag[:2]': postag[:2], 21 | } 22 | if i > 0: 23 | word1 = sent[i-1][0] 24 | postag1 = sent[i-1][1] 25 | features.update({ 26 | '-1:word.lower()': word1.lower(), 27 | '-1:word.istitle()': word1.istitle(), 28 | '-1:word.isupper()': word1.isupper(), 29 | '-1:postag': postag1, 30 | '-1:postag[:2]': postag1[:2], 31 | }) 32 | else: 33 | features['BOS'] = True 34 | 35 | if i < len(sent)-1: 36 | word1 = sent[i+1][0] 37 | postag1 = sent[i+1][1] 38 | features.update({ 39 | '+1:word.lower()': word1.lower(), 40 | '+1:word.istitle()': word1.istitle(), 41 | '+1:word.isupper()': word1.isupper(), 42 | '+1:postag': postag1, 43 | '+1:postag[:2]': postag1[:2], 44 | }) 45 | else: 46 | features['EOS'] = True 47 | 48 | return features 49 | 50 | 51 | def sent2features(sent): 52 | return [word2features(sent, i) for i in range(len(sent))] 53 | 54 | def sent2labels(sent): 55 | return [label for token, postag, label in sent] 56 | 57 | def sent2tokens(sent): 58 | return [token for token, postag, label in sent] 59 | 60 | X_train = [sent2features(s) for s in train_sents] 61 | y_train = [sent2labels(s) for s in train_sents] 62 | 63 | X_test = [sent2features(s) for s in test_sents] 64 | y_test = [sent2labels(s) for s in test_sents] 65 | 66 | 67 | 68 | crf = sklearn_crfsuite.CRF( 69 | algorithm='lbfgs', 70 | c1=0.1, 71 | c2=0.1, 72 | max_iterations=100, 73 | all_possible_transitions=True 74 | ) 75 | crf.fit(X_train, y_train) 76 | 77 | y_pred = crf.predict(X_test) 78 | print(metrics.flat_classification_report( 79 | y_test, y_pred, digits=3 80 | )) -------------------------------------------------------------------------------- /code/DNN.py: -------------------------------------------------------------------------------- 1 | from keras.layers import Dropout, Dense 2 | from keras.models import Sequential 3 | from sklearn.feature_extraction.text import TfidfVectorizer 4 | import numpy as np 5 | from sklearn import metrics 6 | 7 | 8 | def TFIDF(X_train, X_test,MAX_NB_WORDS=75000): 9 | vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS) 10 | X_train = vectorizer_x.fit_transform(X_train).toarray() 11 | X_test = vectorizer_x.transform(X_test).toarray() 12 | print("tf-idf with",str(np.array(X_train).shape[1]),"features") 13 | return (X_train,X_test) 14 | 15 | 16 | def Build_Model_DNN_Text(shape, nClasses, dropout=0.5): 17 | """ 18 | buildModel_DNN_Tex(shape, nClasses,dropout) 19 | Build Deep neural networks Model for text classification 20 | Shape is input feature space 21 | nClasses is number of classes 22 | """ 23 | model = Sequential() 24 | node = 512 # number of nodes 25 | nLayers = 4 # number of hidden layer 26 | 27 | model.add(Dense(node,input_dim=shape,activation='relu')) 28 | model.add(Dropout(dropout)) 29 | for i in range(0,nLayers): 30 | model.add(Dense(node,input_dim=node,activation='relu')) 31 | model.add(Dropout(dropout)) 32 | model.add(Dense(nClasses, activation='softmax')) 33 | 34 | model.compile(loss='sparse_categorical_crossentropy', 35 | optimizer='adam', 36 | metrics=['accuracy']) 37 | 38 | return model 39 | 40 | 41 | from sklearn.datasets import fetch_20newsgroups 42 | 43 | newsgroups_train = fetch_20newsgroups(subset='train') 44 | newsgroups_test = fetch_20newsgroups(subset='test') 45 | X_train = newsgroups_train.data 46 | X_test = newsgroups_test.data 47 | y_train = newsgroups_train.target 48 | y_test = newsgroups_test.target 49 | 50 | X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test) 51 | 52 | 53 | model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20) 54 | model_DNN.summary() 55 | exit(1) 56 | model_DNN.fit(X_train_tfidf, y_train, 57 | validation_data=(X_test_tfidf, y_test), 58 | epochs=10, 59 | batch_size=128, 60 | verbose=2) 61 | 62 | predicted = model_DNN.predict_classes(X_test_tfidf) 63 | 64 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/Decision_Tree.py: -------------------------------------------------------------------------------- 1 | from sklearn import tree 2 | from sklearn.pipeline import Pipeline 3 | from sklearn import metrics 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | from sklearn.feature_extraction.text import TfidfTransformer 6 | from sklearn.datasets import fetch_20newsgroups 7 | 8 | newsgroups_train = fetch_20newsgroups(subset='train') 9 | newsgroups_test = fetch_20newsgroups(subset='test') 10 | X_train = newsgroups_train.data 11 | X_test = newsgroups_test.data 12 | y_train = newsgroups_train.target 13 | y_test = newsgroups_test.target 14 | 15 | text_clf = Pipeline([('vect', CountVectorizer()), 16 | ('tfidf', TfidfTransformer()), 17 | ('clf', tree.DecisionTreeClassifier()), 18 | ]) 19 | 20 | text_clf.fit(X_train, y_train) 21 | 22 | 23 | predicted = text_clf.predict(X_test) 24 | 25 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/HDLTex/README.rst: -------------------------------------------------------------------------------- 1 | |DOI| |travis| |appveyor| |wercker status| |Join the chat at 2 | https://gitter.im/HDLTex| |arXiv| |RG| |Binder| |license| |twitter| 3 | 4 | HDLTex: Hierarchical Deep Learning for Text Classification 5 | ========================================================== 6 | 7 | Refrenced paper : `HDLTex: Hierarchical Deep Learning for Text 8 | Classification `__ 9 | 10 | .. image:: /docs/pic/github-logo.png 11 | :target: https://github.com/kk7nc/HDLTex 12 | 13 | 14 | |Pic| 15 | 16 | Documentation: 17 | =============== 18 | 19 | Increasingly large document collections require improved information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of traditional supervised classifiers has degraded as the number of documents has increased. This is because along with growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy. 20 | 21 | Installation 22 | ============= 23 | 24 | Using pip 25 | ---------- 26 | .. code:: bash 27 | 28 | pip install HDLTex 29 | 30 | Using git 31 | ---------- 32 | .. code:: bash 33 | 34 | git clone --recursive https://github.com/kk7nc/HDLTex.git 35 | 36 | 37 | The primary requirements for this package are Python 3 with Tensorflow. 38 | The requirements.txt file contains a listing of the required Python 39 | packages; to install all requirements, run the following: 40 | 41 | .. code:: bash 42 | 43 | pip -r install requirements.txt 44 | 45 | Or 46 | 47 | .. code:: bash 48 | 49 | pip3 install -r requirements.txt 50 | 51 | Or: 52 | 53 | .. code:: bash 54 | 55 | conda install --file requirements.txt 56 | 57 | 58 | If the above command does not work, use the following: 59 | 60 | .. code:: bash 61 | 62 | sudo -H pip install -r requirements.txt 63 | 64 | 65 | Datasets for HDLTex: 66 | ===================== 67 | 68 | Linke of dataset: |Data| 69 | 70 | Web of Science Dataset 71 | `WOS-11967 `__ 72 | 73 | :: 74 | 75 | This dataset contains 11,967 documents with 35 categories which include 7 parents categories. 76 | 77 | 78 | Web of Science Dataset 79 | `WOS-46985 `__ 80 | 81 | :: 82 | 83 | This dataset contains 46,985 documents with 134 categories which include 7 parents categories. 84 | 85 | 86 | Web of Science Dataset 87 | `WOS-5736 `__ 88 | 89 | :: 90 | 91 | This dataset contains 5,736 documents with 11 categories which include 3 parents categories. 92 | 93 | Requirements : 94 | ---------------- 95 | General: 96 | 97 | - Python 3.5 or later see `Instruction Documents `__ 98 | - TensorFlow see `Instruction Documents `__. 99 | - scikit-learn see `Instruction Documents `__ 100 | - Keras see `Instruction Documents `__ 101 | - scipy see `Instruction Documents `__ 102 | - GPU 103 | 104 | - CUDA® Toolkit 8.0. For details, see `NVIDIA’s documentation `__. 105 | - The `NVIDIA drivers associated with CUDA Toolkit 8.0 `__. 106 | - cuDNN v6. For details, see `NVIDIA’s documentation `__. 107 | - GPU card with CUDA Compute Capability 3.0 or higher. 108 | - The libcupti-dev library, 109 | - To install this library, issue the following command: 110 | 111 | :: 112 | 113 | $ sudo apt-get install libcupti-dev 114 | 115 | 116 | Feature Extraction: 117 | =================== 118 | 119 | Global Vectors for Word Representation 120 | (`GLOVE `__) 121 | 122 | :: 123 | 124 | For CNN and RNN you need to download and linked the folder location to GLOVE 125 | 126 | 127 | 128 | Error and Comments: 129 | =================== 130 | 131 | Send an email to kk7nc@virginia.edu 132 | 133 | Citation: 134 | ========= 135 | 136 | .. code:: bash 137 | 138 | @inproceedings{Kowsari2018HDLTex, 139 | author={Kowsari, Kamran and Brown, Donald E and Heidarysafa, Mojtaba and Meimandi, Kiana Jafari and Gerber, Matthew S and Barnes, Laura E}, 140 | booktitle={2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)}, 141 | title={HDLTex: Hierarchical Deep Learning for Text Classification}, 142 | year={2017}, 143 | pages={364-371}, 144 | doi={10.1109/ICMLA.2017.0-134}, 145 | month={Dec} 146 | } 147 | 148 | .. |DOI| image:: http://kowsari.net/HDLTex_DOI.svg?maxAge=2592000 149 | :target: https://doi.org/10.1109/ICMLA.2017.0-134 150 | .. |travis| image:: https://travis-ci.org/kk7nc/HDLTex.svg?branch=master 151 | :target: https://travis-ci.org/kk7nc/HDLTex 152 | .. |wercker status| image:: https://app.wercker.com/status/24a123448ba8764b257a1df242146b8e/s/master 153 | :target: https://app.wercker.com/project/byKey/24a123448ba8764b257a1df242146b8e 154 | .. |Join the chat at https://gitter.im/HDLTex| image:: https://badges.gitter.im/Join%20Chat.svg 155 | :target: https://gitter.im/HDLTex/Lobby?source=orgpage 156 | .. |appveyor| image:: https://ci.appveyor.com/api/projects/status/github/kk7nc/HDLTex?branch=master&svg=true 157 | :target: https://ci.appveyor.com/project/kk7nc/hdltex 158 | .. |arXiv| image:: https://img.shields.io/badge/arXiv-1709.08267-red.svg?style=flat 159 | :target: https://arxiv.org/abs/1709.08267 160 | .. |RG| image:: https://img.shields.io/badge/ResearchGate-HDLTex-blue.svg?style=flat 161 | :target: https://www.researchgate.net/publication/319968747_HDLTex_Hierarchical_Deep_Learning_for_Text_Classification 162 | .. |Binder| image:: https://mybinder.org/badge.svg 163 | :target: https://mybinder.org/v2/gh/kk7nc/HDLTex/master 164 | .. |license| image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592104 165 | :target: https://github.com/kk7nc/HDLTex/blob/master/LICENSE 166 | .. |Data| image:: https://img.shields.io/badge/DOI-10.17632/9rw3vkcfy4.6-blue.svg?style=flat 167 | :target: http://dx.doi.org/10.17632/9rw3vkcfy4.6 168 | .. |Pic| image:: http://kowsari.net/____impro/1/onewebmedia/HDLTex.png?etag=W%2F%22c90cd-59c4019b%22&sourceContentType=image%2Fpng&ignoreAspectRatio&resize=821%2B326&extract=0%2B0%2B821%2B325?raw=false 169 | :alt: HDLTex as both Hierarchy lavel are DNN 170 | .. |twitter| image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social 171 | :target: https://twitter.com/intent/tweet?text=HDLTex:%20Hierarchical%20Deep%20Learning%20for%20Text%20Classification%0aGitHub:&url=https://github.com/kk7nc/HDLTex&hashtags=DeepLearning,Text_Classification,classification,MachineLearning,deep_neural_networks 172 | 173 | -------------------------------------------------------------------------------- /code/Hierarchical_Attention_Networks/README.md: -------------------------------------------------------------------------------- 1 | # textClassifier 2 | 3 | [](https://github.com/richliao/textClassifier) 4 | 5 | 6 | 7 | 8 | textClassifierHATT.py has the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). Please see the [this blog](https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/) for full detail. Also see [Keras Google group discussion](https://groups.google.com/forum/#!topic/keras-users/IWK9opMFavQ) 9 | 10 | textClassifierConv has implemented [Convolutional Neural Networks for Sentence Classification - Yoo Kim](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). Please see the [this blog](https://richliao.github.io/supervised/classification/2016/11/26/textclassifier-convolutional/) for full detail. 11 | -------------------------------------------------------------------------------- /code/Hierarchical_Attention_Networks/textClassifierConv.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | # author - Richard Liao 3 | # Dec 26 2016 4 | import numpy as np 5 | import pandas as pd 6 | import cPickle 7 | from collections import defaultdict 8 | import re 9 | 10 | from bs4 import BeautifulSoup 11 | 12 | import sys 13 | import os 14 | 15 | os.environ['KERAS_BACKEND']='theano' 16 | 17 | from keras.preprocessing.text import Tokenizer 18 | from keras.preprocessing.sequence import pad_sequences 19 | from keras.utils.np_utils import to_categorical 20 | 21 | from keras.layers import Embedding 22 | from keras.layers import Dense, Input, Flatten 23 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout 24 | from keras.models import Model 25 | 26 | MAX_SEQUENCE_LENGTH = 1000 27 | MAX_NB_WORDS = 20000 28 | EMBEDDING_DIM = 100 29 | VALIDATION_SPLIT = 0.2 30 | 31 | def clean_str(string): 32 | """ 33 | Tokenization/string cleaning for dataset 34 | Every dataset is lower cased except 35 | """ 36 | string = re.sub(r"\\", "", string) 37 | string = re.sub(r"\'", "", string) 38 | string = re.sub(r"\"", "", string) 39 | return string.strip().lower() 40 | 41 | data_train = pd.read_csv('~/Testground/data/imdb/labeledTrainData.tsv', sep='\t') 42 | print(data_train.shape) 43 | 44 | texts = [] 45 | labels = [] 46 | 47 | for idx in range(data_train.review.shape[0]): 48 | text = BeautifulSoup(data_train.review[idx]) 49 | texts.append(clean_str(text.get_text().encode('ascii','ignore'))) 50 | labels.append(data_train.sentiment[idx]) 51 | 52 | 53 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS) 54 | tokenizer.fit_on_texts(texts) 55 | sequences = tokenizer.texts_to_sequences(texts) 56 | 57 | word_index = tokenizer.word_index 58 | print('Found %s unique tokens.' % len(word_index)) 59 | 60 | data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) 61 | 62 | labels = to_categorical(np.asarray(labels)) 63 | print(('Shape of data tensor:', data.shape)) 64 | print(('Shape of label tensor:', labels.shape)) 65 | 66 | indices = np.arange(data.shape[0]) 67 | np.random.shuffle(indices) 68 | data = data[indices] 69 | labels = labels[indices] 70 | nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0]) 71 | 72 | x_train = data[:-nb_validation_samples] 73 | y_train = labels[:-nb_validation_samples] 74 | x_val = data[-nb_validation_samples:] 75 | y_val = labels[-nb_validation_samples:] 76 | 77 | print('Number of positive and negative reviews in traing and validation set ') 78 | print(y_train.sum(axis=0)) 79 | print(y_val.sum(axis=0)) 80 | 81 | GLOVE_DIR = "/ext/home/analyst/Testground/data/glove" 82 | embeddings_index = {} 83 | f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) 84 | for line in f: 85 | values = line.split() 86 | word = values[0] 87 | coefs = np.asarray(values[1:], dtype='float32') 88 | embeddings_index[word] = coefs 89 | f.close() 90 | 91 | print('Total %s word vectors in Glove 6B 100d.' % len(embeddings_index)) 92 | 93 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM)) 94 | for word, i in word_index.items(): 95 | embedding_vector = embeddings_index.get(word) 96 | if embedding_vector is not None: 97 | # words not found in embedding index will be all-zeros. 98 | embedding_matrix[i] = embedding_vector 99 | 100 | embedding_layer = Embedding(len(word_index) + 1, 101 | EMBEDDING_DIM, 102 | weights=[embedding_matrix], 103 | input_length=MAX_SEQUENCE_LENGTH, 104 | trainable=True) 105 | 106 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 107 | embedded_sequences = embedding_layer(sequence_input) 108 | l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences) 109 | l_pool1 = MaxPooling1D(5)(l_cov1) 110 | l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1) 111 | l_pool2 = MaxPooling1D(5)(l_cov2) 112 | l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2) 113 | l_pool3 = MaxPooling1D(35)(l_cov3) # global max pooling 114 | l_flat = Flatten()(l_pool3) 115 | l_dense = Dense(128, activation='relu')(l_flat) 116 | preds = Dense(2, activation='softmax')(l_dense) 117 | 118 | model = Model(sequence_input, preds) 119 | model.compile(loss='categorical_crossentropy', 120 | optimizer='rmsprop', 121 | metrics=['acc']) 122 | 123 | print("model fitting - simplified convolutional neural network") 124 | model.summary() 125 | model.fit(x_train, y_train, validation_data=(x_val, y_val), 126 | nb_epoch=10, batch_size=128) 127 | 128 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM)) 129 | for word, i in word_index.items(): 130 | embedding_vector = embeddings_index.get(word) 131 | if embedding_vector is not None: 132 | # words not found in embedding index will be all-zeros. 133 | embedding_matrix[i] = embedding_vector 134 | 135 | embedding_layer = Embedding(len(word_index) + 1, 136 | EMBEDDING_DIM, 137 | weights=[embedding_matrix], 138 | input_length=MAX_SEQUENCE_LENGTH, 139 | trainable=True) 140 | 141 | # applying a more complex convolutional approach 142 | convs = [] 143 | filter_sizes = [3,4,5] 144 | 145 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 146 | embedded_sequences = embedding_layer(sequence_input) 147 | 148 | for fsz in filter_sizes: 149 | l_conv = Conv1D(nb_filter=128,filter_length=fsz,activation='relu')(embedded_sequences) 150 | l_pool = MaxPooling1D(5)(l_conv) 151 | convs.append(l_pool) 152 | 153 | l_merge = Merge(mode='concat', concat_axis=1)(convs) 154 | l_cov1= Conv1D(128, 5, activation='relu')(l_merge) 155 | l_pool1 = MaxPooling1D(5)(l_cov1) 156 | l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1) 157 | l_pool2 = MaxPooling1D(30)(l_cov2) 158 | l_flat = Flatten()(l_pool2) 159 | l_dense = Dense(128, activation='relu')(l_flat) 160 | preds = Dense(2, activation='softmax')(l_dense) 161 | 162 | model = Model(sequence_input, preds) 163 | model.compile(loss='categorical_crossentropy', 164 | optimizer='rmsprop', 165 | metrics=['acc']) 166 | 167 | print("model fitting - more complex convolutional neural network") 168 | model.summary() 169 | model.fit(x_train, y_train, validation_data=(x_val, y_val), 170 | nb_epoch=20, batch_size=50) 171 | -------------------------------------------------------------------------------- /code/Hierarchical_Attention_Networks/textClassifierHATT.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | # author - Richard Liao 3 | # Dec 26 2016 4 | import numpy as np 5 | import pandas as pd 6 | import cPickle 7 | from collections import defaultdict 8 | import re 9 | 10 | from bs4 import BeautifulSoup 11 | 12 | import sys 13 | import os 14 | 15 | os.environ['KERAS_BACKEND']='theano' 16 | 17 | from keras.preprocessing.text import Tokenizer, text_to_word_sequence 18 | from keras.preprocessing.sequence import pad_sequences 19 | from keras.utils.np_utils import to_categorical 20 | 21 | from keras.layers import Embedding 22 | from keras.layers import Dense, Input, Flatten 23 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed 24 | from keras.models import Model 25 | 26 | from keras import backend as K 27 | from keras.engine.topology import Layer, InputSpec 28 | from keras import initializations 29 | 30 | MAX_SENT_LENGTH = 100 31 | MAX_SENTS = 15 32 | MAX_NB_WORDS = 20000 33 | EMBEDDING_DIM = 100 34 | VALIDATION_SPLIT = 0.2 35 | 36 | def clean_str(string): 37 | """ 38 | Tokenization/string cleaning for dataset 39 | Every dataset is lower cased except 40 | """ 41 | string = re.sub(r"\\", "", string) 42 | string = re.sub(r"\'", "", string) 43 | string = re.sub(r"\"", "", string) 44 | return string.strip().lower() 45 | 46 | data_train = pd.read_csv('~/Testground/data/imdb/labeledTrainData.tsv', sep='\t') 47 | print(data_train.shape) 48 | 49 | from nltk import tokenize 50 | 51 | reviews = [] 52 | labels = [] 53 | texts = [] 54 | 55 | for idx in range(data_train.review.shape[0]): 56 | text = BeautifulSoup(data_train.review[idx]) 57 | text = clean_str(text.get_text().encode('ascii','ignore')) 58 | texts.append(text) 59 | sentences = tokenize.sent_tokenize(text) 60 | reviews.append(sentences) 61 | 62 | labels.append(data_train.sentiment[idx]) 63 | 64 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS) 65 | tokenizer.fit_on_texts(texts) 66 | 67 | data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32') 68 | 69 | for i, sentences in enumerate(reviews): 70 | for j, sent in enumerate(sentences): 71 | if j< MAX_SENTS: 72 | wordTokens = text_to_word_sequence(sent) 73 | k=0 74 | for _, word in enumerate(wordTokens): 75 | if k`__ 6 | 7 | .. image:: /docs/pic/github-logo.png 8 | :target: https://github.com/kk7nc/RMDL 9 | 10 | Random Multimodel Deep Learning (RMDL): 11 | ======================================= 12 | 13 | A new ensemble, deep learning approach for classification. Deep 14 | learning models have achieved state-of-the-art results across many domains. 15 | RMDL solves the problem of finding the best deep learning structure 16 | and architecture while simultaneously improving robustness and accuracy 17 | through ensembles of deep learning architectures. RDML can accept 18 | asinput a variety data to include text, video, images, and symbolic. 19 | 20 | 21 | |RMDL| 22 | 23 | Random Multimodel Deep Learning (RDML) architecture for classification. 24 | RMDL includes 3 Random models, oneDNN classifier at left, one Deep CNN 25 | classifier at middle, and one Deep RNN classifier at right (each unit could be LSTMor GRU). 26 | 27 | 28 | Installation 29 | ============= 30 | 31 | There are pip and git for RMDL installation: 32 | 33 | Using pip 34 | ---------- 35 | 36 | .. code:: python 37 | 38 | pip install RMDL 39 | 40 | Using git 41 | --------- 42 | .. code:: bash 43 | 44 | git clone --recursive https://github.com/kk7nc/RMDL.git 45 | 46 | The primary requirements for this package are Python 3 with Tensorflow. The requirements.txt file 47 | contains a listing of the required Python packages; to install all requirements, run the following: 48 | 49 | .. code:: bash 50 | 51 | pip -r install requirements.txt 52 | 53 | Or 54 | 55 | .. code:: bash 56 | 57 | pip3 install -r requirements.txt 58 | 59 | Or: 60 | 61 | .. code:: bash 62 | 63 | conda install --file requirements.txt 64 | 65 | Documentation: 66 | ============== 67 | 68 | The exponential growth in the number of complex datasets every year requires more enhancement in 69 | machine learning methods to provide robust and accurate data classification. Lately, deep learning 70 | approaches have been achieved surpassing results in comparison to previous machine learning algorithms 71 | on tasks such as image classification, natural language processing, face recognition, and etc. The 72 | success of these deep learning algorithms relys on their capacity to model complex and non-linear 73 | relationships within data. However, finding the suitable structure for these models has been a challenge 74 | for researchers. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning 75 | approach for classification. RMDL solves the problem of finding the best deep learning structure and 76 | architecture while simultaneously improving robustness and accuracy through ensembles of deep 77 | learning architectures. In short, RMDL trains multiple models of Deep Neural Network (DNN), 78 | Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in parallel and combines 79 | their results to produce better result of any of those models individually. To create these models, 80 | each deep learning model has been constructed in a random fashion regarding the number of layers and 81 | nodes in their neural network structure. The resulting RDML model can be used for various domains such 82 | as text, video, images, and symbolic. In this Project, we describe RMDL model in depth and show the results 83 | for image and text classification as well as face recognition. For image classification, we compared our 84 | model with some of the available baselines using MNIST and CIFAR-10 datasets. Similarly, we used four 85 | datasets namely, WOS, Reuters, IMDB, and 20newsgroup and compared our results with available baselines. 86 | Web of Science (WOS) has been collected by authors and consists of three sets~(small, medium and large set). 87 | Lastly, we used ORL dataset to compare the performance of our approach with other face recognition methods. 88 | These test results show that RDML model consistently outperform standard methods over a broad range of 89 | data types and classification problems. 90 | 91 | Datasets for RMDL: 92 | ================== 93 | 94 | Text Datasets: 95 | -------------- 96 | 97 | - `IMDB Dataset `__ 98 | 99 | * This dataset contains 50,000 documents with 2 categories. 100 | 101 | - `Reters-21578 Dataset `__ 102 | 103 | * This dataset contains 21,578 documents with 90 categories. 104 | 105 | - `20Newsgroups Dataset `__ 106 | 107 | * This dataset contains 20,000 documents with 20 categories. 108 | 109 | - Web of Science Dataset (DOI: 110 | `10.17632/9rw3vkcfy4.2 `__) 111 | 112 | - Web of Science Dataset 113 | `WOS-11967 `__ 114 | 115 | - This dataset contains 11,967 documents with 35 categories which 116 | include 7 parents categories. 117 | 118 | - Web of Science Dataset 119 | `WOS-46985 `__ 120 | 121 | - This dataset contains 46,985 documents with 134 categories 122 | which include 7 parents categories. 123 | 124 | - Web of Science Dataset 125 | `WOS-5736 `__ 126 | 127 | - This dataset contains 5,736 documents with 11 categories which 128 | include 3 parents categories. 129 | 130 | Image datasets: 131 | --------------- 132 | 133 | - `MNIST Dataset `__ 134 | 135 | - The MNIST database contains 60,000 training images and 10,000 136 | testing images. 137 | 138 | - `CIFAR-10 Dataset `__ 139 | 140 | - The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 141 | classes, with 6000 images per class. There are 50000 training 142 | images and 10000 test images. 143 | 144 | Face Recognition 145 | ---------------- 146 | 147 | `The Database of Faces (The Olivetti Faces 148 | Dataset) `__ 149 | 150 | - The The Database of Faces dataset consists of 400 92x112 colour 151 | images and grayscale in 40 person 152 | 153 | Requirements for RMDL : 154 | ======================= 155 | 156 | General: 157 | ---------- 158 | 159 | - Python 3.5 or later see `Instruction 160 | Documents `__ 161 | 162 | - TensorFlow see `Instruction 163 | Documents `__. 164 | 165 | - scikit-learn see `Instruction 166 | Documents `__ 167 | 168 | - Keras see `Instruction Documents `__ 169 | 170 | - scipy see `Instruction 171 | Documents `__ 172 | 173 | 174 | GPU (if you want to run on GPU): 175 | -------------------------------- 176 | 177 | - CUDA® Toolkit 8.0. For details, see `NVIDIA’s 178 | documentation `__. 179 | 180 | - The `NVIDIA drivers associated with CUDA Toolkit 181 | 8.0 `__. 182 | 183 | - cuDNN v6. For details, see `NVIDIA’s 184 | documentation `__. 185 | 186 | - GPU card with CUDA Compute Capability 3.0 or higher. 187 | 188 | - The libcupti-dev library, 189 | 190 | Text and Document Classification 191 | ================================= 192 | 193 | - Download GloVe: Global Vectors for Word Representation `Instruction 194 | Documents `__ 195 | 196 | - Set data directory into 197 | `Global.py `__ 198 | 199 | - if you are not setting GloVe directory, GloVe will be downloaded 200 | 201 | Parameters: 202 | =========== 203 | 204 | Text_Classification 205 | ------------------- 206 | 207 | .. code:: python 208 | 209 | from RMDL import RMDL_Text 210 | 211 | .. code:: python 212 | 213 | Text_Classification(x_train, y_train, x_test, y_test, batch_size=128, 214 | EMBEDDING_DIM=50,MAX_SEQUENCE_LENGTH = 500, MAX_NB_WORDS = 75000, 215 | GloVe_dir="", GloVe_file = "glove.6B.50d.txt", 216 | sparse_categorical=True, random_deep=[3, 3, 3], epochs=[500, 500, 500], plot=True, 217 | min_hidden_layer_dnn=1, max_hidden_layer_dnn=8, min_nodes_dnn=128, max_nodes_dnn=1024, 218 | min_hidden_layer_rnn=1, max_hidden_layer_rnn=5, min_nodes_rnn=32, max_nodes_rnn=128, 219 | min_hidden_layer_cnn=3, max_hidden_layer_cnn=10, min_nodes_cnn=128, max_nodes_cnn=512, 220 | random_state=42, random_optimizor=True, dropout=0.05): 221 | 222 | 223 | Input 224 | ~~~~~ 225 | 226 | - x_train 227 | - y_train 228 | - x_test 229 | - y_test 230 | 231 | batch_size 232 | ~~~~~~~~~~ 233 | 234 | - batch_size: Integer. Number of samples per gradient update. If unspecified, it will default to 128. 235 | 236 | EMBEDDING_DIM 237 | ~~~~~~~~~~~~~~ 238 | 239 | - batch_size: Integer. Shape of word embedding (this number should be same with GloVe or other pre-trained embedding techniques that be used), it will default to 50 that used with pain of glove.6B.50d.txt file. 240 | 241 | 242 | MAX_SEQUENCE_LENGTH 243 | ~~~~~~~~~~~~~~~~~~~ 244 | 245 | - MAX_SEQUENCE_LENGTH: Integer. Maximum length of sequence or document in datasets, it will default to 500. 246 | 247 | 248 | MAX_NB_WORDS 249 | ~~~~~~~~~~~~~~~~~~~~~~~ 250 | 251 | - MAX_NB_WORDS: Integer. Maximum number of unique words in datasets, it will default to 75000. 252 | 253 | 254 | GloVe_dir 255 | ~~~~~~~~~~~~~~~~~~~~~~~ 256 | 257 | - GloVe_dir: String. Address of GloVe or any pre-trained directory, it will default to null which glove.6B.zip will be download. 258 | 259 | 260 | GloVe_file 261 | ~~~~~~~~~~~~~~~~~~~~~~~ 262 | 263 | - GloVe_dir: String. Which version of GloVe or pre-trained word emending will be used, it will default to glove.6B.50d.txt. 264 | 265 | - NOTE: if you use other version of GloVe EMBEDDING_DIM must be same dimensions. 266 | 267 | sparse_categorical 268 | ~~~~~~~~~~~~~~~~~~~~~~~ 269 | 270 | - sparse_categorical: bool. When target's dataset is (n,1) should be True, it will default to True. 271 | 272 | random_deep 273 | ~~~~~~~~~~~~~~~~~~~~~~~ 274 | 275 | - random_deep: Integer [3]. Number of ensembled model used in RMDL random_deep[0] is number of DNN, random_deep[1] is number of RNN, random_deep[0] is number of CNN, it will default to [3, 3, 3]. 276 | 277 | 278 | epochs 279 | ~~~~~~~~~~~~~~~~~~~~~~~ 280 | 281 | - epochs: Integer [3]. Number of epochs in each ensembled model used in RMDL epochs[0] is number of epochs used in DNN, epochs[1] is number of epochs used in RNN, epochs[0] is number of epochs used in CNN, it will default to [500, 500, 500]. 282 | 283 | 284 | plot 285 | ~~~~~~~~~~~~~~~~~~~~~~~ 286 | 287 | - plot: bool. True: shows confusion matrix and accuracy and loss 288 | 289 | 290 | min_hidden_layer_dnn 291 | ~~~~~~~~~~~~~~~~~~~~~~~ 292 | 293 | - min_hidden_layer_dnn: Integer. Lower Bounds of hidden layers of DNN used in RMDL, it will default to 1. 294 | 295 | 296 | max_hidden_layer_dnn 297 | ~~~~~~~~~~~~~~~~~~~~~~~ 298 | 299 | - max_hidden_layer_dnn: Integer. Upper bounds of hidden layers of DNN used in RMDL, it will default to 8. 300 | 301 | 302 | min_nodes_dnn 303 | ~~~~~~~~~~~~~~~~~~~~~~~ 304 | 305 | - min_nodes_dnn: Integer. Lower bounds of nodes in each layer of DNN used in RMDL, it will default to 128. 306 | 307 | max_nodes_dnn 308 | ~~~~~~~~~~~~~~~~~~~~~~~ 309 | 310 | - max_nodes_dnn: Integer. Upper bounds of nodes in each layer of DNN used in RMDL, it will default to 1024. 311 | 312 | min_hidden_layer_rnn 313 | ~~~~~~~~~~~~~~~~~~~~~~~ 314 | 315 | - min_hidden_layer_rnn: Integer. Lower Bounds of hidden layers of RNN used in RMDL, it will default to 1. 316 | 317 | 318 | max_hidden_layer_rnn 319 | ~~~~~~~~~~~~~~~~~~~~~~~ 320 | 321 | - man_hidden_layer_rnn: Integer. Upper Bounds of hidden layers of RNN used in RMDL, it will default to 5. 322 | 323 | 324 | min_nodes_rnn 325 | ~~~~~~~~~~~~~~~~~~~~~~~ 326 | 327 | - min_nodes_rnn: Integer. Lower bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 32. 328 | 329 | max_nodes_rnn 330 | ~~~~~~~~~~~~~~~~~~~~~~~ 331 | 332 | - max_nodes_rnn: Integer. Upper bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 128. 333 | 334 | 335 | min_hidden_layer_cnn 336 | ~~~~~~~~~~~~~~~~~~~~~~~ 337 | 338 | - min_hidden_layer_cnn: Integer. Lower Bounds of hidden layers of CNN used in RMDL, it will default to 3. 339 | 340 | 341 | max_hidden_layer_cnn 342 | ~~~~~~~~~~~~~~~~~~~~~~~ 343 | 344 | - max_hidden_layer_cnn: Integer. Upper Bounds of hidden layers of CNN used in RMDL, it will default to 10. 345 | 346 | 347 | min_nodes_cnn 348 | ~~~~~~~~~~~~~~~~~~~~~~~ 349 | 350 | - min_nodes_cnn: Integer. Lower bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 128. 351 | 352 | max_nodes_cnn 353 | ~~~~~~~~~~~~~~~~~~~~~~~ 354 | 355 | - min_nodes_cnn: Integer. Upper bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 512. 356 | 357 | random_state 358 | ~~~~~~~~~~~~~~~~~~~~~~~ 359 | 360 | - random_state : Integer, RandomState instance or None, optional (default=None) 361 | 362 | * If Integer, random_state is the seed used by the random number generator; 363 | 364 | 365 | random_optimizor 366 | ~~~~~~~~~~~~~~~~~~~~~~~ 367 | 368 | - random_optimizor : bool, If False, all models use adam optimizer. If True, all models use random optimizers. it will default to True 369 | 370 | 371 | dropout 372 | ~~~~~~~~~~~~~~~~~~~~~~~ 373 | 374 | - dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. 375 | 376 | 377 | Image_Classification 378 | --------------------- 379 | 380 | .. code:: python 381 | 382 | from RMDL import RMDL_Image 383 | 384 | .. code:: python 385 | 386 | Image_Classification(x_train, y_train, x_test, y_test, shape, batch_size=128, 387 | sparse_categorical=True, random_deep=[3, 3, 3], epochs=[500, 500, 500], plot=True, 388 | min_hidden_layer_dnn=1, max_hidden_layer_dnn=8, min_nodes_dnn=128, max_nodes_dnn=1024, 389 | min_hidden_layer_rnn=1, max_hidden_layer_rnn=5, min_nodes_rnn=32, max_nodes_rnn=128, 390 | min_hidden_layer_cnn=3, max_hidden_layer_cnn=10, min_nodes_cnn=128, max_nodes_cnn=512, 391 | random_state=42, random_optimizor=True, dropout=0.05) 392 | 393 | Input 394 | ~~~~~ 395 | - x_train 396 | - y_train 397 | - x_test 398 | - y_test 399 | 400 | shape 401 | ~~~~~ 402 | 403 | - shape: np.shape . shape of image. The most common situation would be a 2D input with shape (batch_size, input_dim). 404 | 405 | batch_size 406 | ~~~~~~~~~~ 407 | 408 | - batch_size: Integer. Number of samples per gradient update. If unspecified, it will default to 128. 409 | 410 | sparse_categorical 411 | ~~~~~~~~~~~~~~~~~~~~~~~ 412 | 413 | - sparse_categorical: bool. When target's dataset is (n,1) should be True, it will default to True. 414 | 415 | random_deep 416 | ~~~~~~~~~~~~~~~~~~~~~~~ 417 | 418 | - random_deep: Integer [3]. Number of ensembled model used in RMDL random_deep[0] is number of DNN, random_deep[1] is number of RNN, random_deep[0] is number of CNN, it will default to [3, 3, 3]. 419 | 420 | 421 | epochs 422 | ~~~~~~~~~~~~~~~~~~~~~~~ 423 | 424 | - epochs: Integer [3]. Number of epochs in each ensembled model used in RMDL epochs[0] is number of epochs used in DNN, epochs[1] is number of epochs used in RNN, epochs[0] is number of epochs used in CNN, it will default to [500, 500, 500]. 425 | 426 | 427 | plot 428 | ~~~~~~~~~~~~~~~~~~~~~~~ 429 | 430 | - plot: bool. True: shows confusion matrix and accuracy and loss 431 | 432 | 433 | min_hidden_layer_dnn 434 | ~~~~~~~~~~~~~~~~~~~~~~~ 435 | 436 | - min_hidden_layer_dnn: Integer. Lower Bounds of hidden layers of DNN used in RMDL, it will default to 1. 437 | 438 | 439 | max_hidden_layer_dnn 440 | ~~~~~~~~~~~~~~~~~~~~~~~ 441 | 442 | - max_hidden_layer_dnn: Integer. Upper bounds of hidden layers of DNN used in RMDL, it will default to 8. 443 | 444 | 445 | min_nodes_dnn 446 | ~~~~~~~~~~~~~~~~~~~~~~~ 447 | 448 | - min_nodes_dnn: Integer. Lower bounds of nodes in each layer of DNN used in RMDL, it will default to 128. 449 | 450 | max_nodes_dnn 451 | ~~~~~~~~~~~~~~~~~~~~~~~ 452 | 453 | - max_nodes_dnn: Integer. Upper bounds of nodes in each layer of DNN used in RMDL, it will default to 1024. 454 | 455 | min_nodes_rnn 456 | ~~~~~~~~~~~~~~~~~~~~~~~ 457 | 458 | - min_nodes_rnn: Integer. Lower bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 32. 459 | 460 | max_nodes_rnn 461 | ~~~~~~~~~~~~~~~~~~~~~~~ 462 | 463 | - maz_nodes_rnn: Integer. Upper bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 128. 464 | 465 | 466 | min_hidden_layer_cnn 467 | ~~~~~~~~~~~~~~~~~~~~~~~ 468 | 469 | - min_hidden_layer_cnn: Integer. Lower Bounds of hidden layers of CNN used in RMDL, it will default to 3. 470 | 471 | 472 | max_hidden_layer_cnn 473 | ~~~~~~~~~~~~~~~~~~~~~~~ 474 | 475 | - max_hidden_layer_cnn: Integer. Upper Bounds of hidden layers of CNN used in RMDL, it will default to 10. 476 | 477 | 478 | min_nodes_cnn 479 | ~~~~~~~~~~~~~~~~~~~~~~~ 480 | 481 | - min_nodes_cnn: Integer. Lower bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 128. 482 | 483 | max_nodes_cnn 484 | ~~~~~~~~~~~~~~~~~~~~~~~ 485 | 486 | - min_nodes_cnn: Integer. Upper bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 512. 487 | 488 | random_state 489 | ~~~~~~~~~~~~~~~~~~~~~~~ 490 | 491 | - random_state : Integer, RandomState instance or None, optional (default=None) 492 | 493 | * If Integer, random_state is the seed used by the random number generator; 494 | 495 | 496 | random_optimizor 497 | ~~~~~~~~~~~~~~~~~~~~~~~ 498 | 499 | - random_optimizor : bool, If False, all models use adam optimizer. If True, all models use random optimizers. it will default to True 500 | 501 | 502 | dropout 503 | ~~~~~~~~~~~~~~~~~~~~~~~ 504 | 505 | 506 | - dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. 507 | 508 | 509 | Example 510 | ======== 511 | 512 | MNIST 513 | ----- 514 | 515 | - The MNIST database contains 60,000 training images and 10,000 testing images. 516 | 517 | Import Packages 518 | ~~~~~~~~~~~~~~~ 519 | 520 | .. code:: python 521 | 522 | from keras.datasets import mnist 523 | import numpy as np 524 | from RMDL import RMDL_Image as RMDL 525 | 526 | 527 | Load Data 528 | ~~~~~~~~~ 529 | 530 | .. code:: python 531 | 532 | (X_train, y_train), (X_test, y_test) = mnist.load_data() 533 | X_train_D = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32') 534 | X_test_D = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32') 535 | X_train = X_train_D / 255.0 536 | X_test = X_test_D / 255.0 537 | number_of_classes = np.unique(y_train).shape[0] 538 | shape = (28, 28, 1) 539 | 540 | Using RMDL 541 | ~~~~~~~~~~~ 542 | 543 | .. code:: python 544 | 545 | batch_size = 128 546 | sparse_categorical = 0 547 | n_epochs = [100, 100, 100] ## DNN-RNN-CNN 548 | Random_Deep = [3, 3, 3] ## DNN-RNN-CNN 549 | 550 | RMDL.Image_Classification(X_train, y_train, X_test, y_test,shape, 551 | batch_size=batch_size, 552 | sparse_categorical=True, 553 | random_deep=Random_Deep, 554 | epochs=n_epochs) 555 | IMDB 556 | ----- 557 | 558 | - This dataset contains 50,000 documents with 2 categories. 559 | 560 | Import Packages 561 | ~~~~~~~~~~~~~~~ 562 | 563 | .. code:: python 564 | 565 | import sys 566 | import os 567 | from RMDL import text_feature_extraction as txt 568 | from keras.datasets import imdb 569 | import numpy as np 570 | from RMDL import RMDL_Text as RMDL 571 | 572 | Load Data 573 | ~~~~~~~~~ 574 | 575 | .. code:: python 576 | 577 | print("Load IMDB dataset....") 578 | (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS) 579 | print(len(X_train)) 580 | print(y_test) 581 | word_index = imdb.get_word_index() 582 | index_word = {v: k for k, v in word_index.items()} 583 | X_train = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_train] 584 | X_test = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_test] 585 | X_train = np.array(X_train) 586 | X_train = np.array(X_train).ravel() 587 | print(X_train.shape) 588 | X_test = np.array(X_test) 589 | X_test = np.array(X_test).ravel() 590 | 591 | Using RMDL 592 | ~~~~~~~~~~~ 593 | 594 | .. code:: python 595 | 596 | batch_size = 100 597 | sparse_categorical = 0 598 | n_epochs = [100, 100, 100] ## DNN--RNN-CNN 599 | Random_Deep = [3, 3, 3] ## DNN--RNN-CNN 600 | 601 | RMDL.Text_Classification(X_train, y_train, X_test, y_test, 602 | batch_size=batch_size, 603 | sparse_categorical=sparse_categorical, 604 | random_deep=Random_Deep, 605 | epochs=n_epochs) 606 | 607 | Web Of Science 608 | -------------- 609 | 610 | - Linke of dataset: |Data| 611 | 612 | - Web of Science Dataset 613 | `WOS-11967 `__ 614 | 615 | - This dataset contains 11,967 documents with 35 categories which 616 | include 7 parents categories. 617 | 618 | - Web of Science Dataset 619 | `WOS-46985 `__ 620 | 621 | - This dataset contains 46,985 documents with 134 categories 622 | which include 7 parents categories. 623 | 624 | - Web of Science Dataset 625 | `WOS-5736 `__ 626 | 627 | - This dataset contains 5,736 documents with 11 categories which 628 | include 3 parents categories. 629 | 630 | Import Packages 631 | ~~~~~~~~~~~~~~~ 632 | 633 | .. code:: python 634 | 635 | from RMDL import text_feature_extraction as txt 636 | from sklearn.model_selection import train_test_split 637 | from RMDL.Download import Download_WOS as WOS 638 | import numpy as np 639 | from RMDL import RMDL_Text as RMDL 640 | 641 | Load Data 642 | ~~~~~~~~~ 643 | .. code:: python 644 | 645 | path_WOS = WOS.download_and_extract() 646 | fname = os.path.join(path_WOS,"WebOfScience/WOS11967/X.txt") 647 | fnamek = os.path.join(path_WOS,"WebOfScience/WOS11967/Y.txt") 648 | with open(fname, encoding="utf-8") as f: 649 | content = f.readlines() 650 | content = [txt.text_cleaner(x) for x in content] 651 | with open(fnamek) as fk: 652 | contentk = fk.readlines() 653 | contentk = [x.strip() for x in contentk] 654 | Label = np.matrix(contentk, dtype=int) 655 | Label = np.transpose(Label) 656 | np.random.seed(7) 657 | print(Label.shape) 658 | X_train, X_test, y_train, y_test = train_test_split(content, Label, test_size=0.2, random_state=4) 659 | 660 | Using RMDL 661 | ~~~~~~~~~~~ 662 | .. code:: python 663 | 664 | batch_size = 100 665 | sparse_categorical = 0 666 | n_epochs = [5000, 500, 500] ## DNN--RNN-CNN 667 | Random_Deep = [3, 3, 3] ## DNN--RNN-CNN 668 | 669 | RMDL.Text_Classification(X_train, y_train, X_test, y_test, 670 | batch_size=batch_size, 671 | sparse_categorical=True, 672 | random_deep=Random_Deep, 673 | epochs=n_epochs,no_of_classes=12) 674 | 675 | Reuters-21578 676 | ------------- 677 | 678 | - This dataset contains 21,578 documents with 90 categories. 679 | 680 | Import Packages 681 | ~~~~~~~~~~~~~~~ 682 | 683 | .. code:: python 684 | 685 | import sys 686 | import os 687 | import nltk 688 | nltk.download("reuters") 689 | from nltk.corpus import reuters 690 | from sklearn.preprocessing import MultiLabelBinarizer 691 | import numpy as np 692 | from RMDL import RMDL_Text as RMDL 693 | 694 | Load Data 695 | ~~~~~~~~~ 696 | .. code:: python 697 | 698 | documents = reuters.fileids() 699 | 700 | train_docs_id = list(filter(lambda doc: doc.startswith("train"), 701 | documents)) 702 | test_docs_id = list(filter(lambda doc: doc.startswith("test"), 703 | documents)) 704 | X_train = [(reuters.raw(doc_id)) for doc_id in train_docs_id] 705 | X_test = [(reuters.raw(doc_id)) for doc_id in test_docs_id] 706 | mlb = MultiLabelBinarizer() 707 | y_train = mlb.fit_transform([reuters.categories(doc_id) 708 | for doc_id in train_docs_id]) 709 | y_test = mlb.transform([reuters.categories(doc_id) 710 | for doc_id in test_docs_id]) 711 | y_train = np.argmax(y_train, axis=1) 712 | y_test = np.argmax(y_test, axis=1) 713 | 714 | 715 | Using RMDL 716 | ~~~~~~~~~~~ 717 | .. code:: python 718 | 719 | batch_size = 100 720 | sparse_categorical = 0 721 | n_epochs = [20, 500, 50] ## DNN--RNN-CNN 722 | Random_Deep = [3, 0, 0] ## DNN--RNN-CNN 723 | 724 | RMDL.Text_Classification(X_train, y_train, X_test, y_test, 725 | batch_size=batch_size, 726 | sparse_categorical=True, 727 | random_deep=Random_Deep, 728 | epochs=n_epochs) 729 | 730 | 731 | 732 | Olivetti Faces 733 | -------------- 734 | 735 | - There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). 736 | 737 | Import Packages 738 | ~~~~~~~~~~~~~~~ 739 | 740 | .. code:: python 741 | 742 | from sklearn.datasets import fetch_olivetti_faces 743 | from sklearn.model_selection import train_test_split 744 | from RMDL import RMDL_Image as RMDL 745 | 746 | Load Data 747 | ~~~~~~~~~ 748 | .. code:: python 749 | 750 | number_of_classes = 40 751 | shape = (64, 64, 1) 752 | data = fetch_olivetti_faces() 753 | X_train, X_test, y_train, y_test = train_test_split(data.data, 754 | data.target, stratify=data.target, test_size=40) 755 | X_train = X_train.reshape(X_train.shape[0], 64, 64, 1).astype('float32') 756 | X_test = X_test.reshape(X_test.shape[0], 64, 64, 1).astype('float32') 757 | 758 | Using RMDL 759 | ~~~~~~~~~~~ 760 | .. code:: python 761 | 762 | batch_size = 100 763 | sparse_categorical = 0 764 | n_epochs = [500, 500, 50] ## DNN--RNN-CNN 765 | Random_Deep = [0, 0, 1] ## DNN--RNN-CNN 766 | 767 | RMDL.Image_Classification(X_train, y_train, X_test, y_test, 768 | shape, 769 | random_optimizor=False, 770 | batch_size=batch_size, 771 | random_deep=Random_Deep, 772 | epochs=n_epochs) 773 | 774 | 775 | 776 | More Exanmple 777 | `link `__ 778 | 779 | |Results| 780 | 781 | 782 | Error and Comments: 783 | ---------------------- 784 | 785 | 786 | Send an email to kk7nc@virginia.edu 787 | 788 | Citations 789 | --------- 790 | 791 | .. code:: 792 | 793 | @inproceedings{Kowsari2018RMDL, 794 | title={RMDL: Random Multimodel Deep Learning for Classification}, 795 | author={Kowsari, Kamran and Heidarysafa, Mojtaba and Brown, Donald E. and Jafari Meimandi, Kiana and Barnes, Laura E.}, 796 | booktitle={Proceedings of the 2018 International Conference on Information System and Data Mining}, 797 | year={2018}, 798 | DOI={https://doi.org/10.1145/3206098.3206111}, 799 | organization={ACM} 800 | } 801 | 802 | .. |werckerstatus| image:: https://app.wercker.com/status/3a564158e809971e2f7416beba5d05af/s/master 803 | :target: https://app.wercker.com/project/byKey/3a564158e809971e2f7416beba5d05af 804 | .. |BuildStatus| image:: https://travis-ci.org/kk7nc/RMDL.svg?branch=master 805 | :target: https://travis-ci.org/kk7nc/RMDL 806 | .. |PowerPoint| image:: https://img.shields.io/badge/Presentation-download-red.svg?style=flat 807 | :target: https://github.com/kk7nc/RMDL/blob/master/docs/RMDL.pdf 808 | .. |researchgate| image:: https://img.shields.io/badge/ResearchGate-RMDL-blue.svg?style=flat 809 | :target: https://www.researchgate.net/publication/324922651_RMDL_Random_Multimodel_Deep_Learning_for_Classification 810 | .. |Binder| image:: https://mybinder.org/badge.svg 811 | :target: https://mybinder.org/v2/gh/kk7nc/RMDL/master 812 | .. |pdf| image:: https://img.shields.io/badge/pdf-download-red.svg?style=flat 813 | :target: https://github.com/kk7nc/RMDL/blob/master/docs/ACM-RMDL.pdf 814 | .. |GitHublicense| image:: https://img.shields.io/badge/licence-GPL-blue.svg 815 | :target: ./LICENSE 816 | .. |RDL| image:: http://kowsari.net/onewebmedia/RDL.jpg 817 | .. |RMDL| image:: http://kowsari.net/onewebmedia/RMDL.jpg 818 | .. |Results| image:: http://kowsari.net/onewebmedia/RMDL_Results.png 819 | .. |Data| image:: https://img.shields.io/badge/DOI-10.17632/9rw3vkcfy4.6-blue.svg?style=flat 820 | :target: http://dx.doi.org/10.17632/9rw3vkcfy4.6 821 | .. |Pypi| image:: https://img.shields.io/badge/Pypi-RMDL%201.0.5-blue.svg 822 | :target: https://pypi.org/project/RMDL/ 823 | .. |DOI| image:: https://img.shields.io/badge/DOI-10.1145/3206098.3206111-blue.svg?style=flat 824 | :target: https://doi.org/10.1145/3206098.3206111 825 | .. |appveyor| image:: https://ci.appveyor.com/api/projects/status/github/kk7nc/RMDL?branch=master&svg=true 826 | :target: https://ci.appveyor.com/project/kk7nc/RMDL 827 | .. |arxiv| image:: https://img.shields.io/badge/arXiv-1805.01890-red.svg 828 | :target: https://arxiv.org/abs/1805.01890 829 | .. |twitter| image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social 830 | :target: https://twitter.com/intent/tweet?text=RMDL:%20Random%20Multimodel%20Deep%20Learning%20for%20Classification%0aGitHub:&url=https://github.com/kk7nc/RMDL&hashtags=DeepLearning,classification,MachineLearning,deep_neural_networks,Image_Classification,Text_Classification,EnsembleLearning 831 | .. |Join the chat at https://gitter.im/RMDL-Random-Multimodel-Deep-Learning| image:: https://badges.gitter.im/Join%20Chat.svg 832 | :target: https://gitter.im/RMDL-Random-Multimodel-Deep-Learning/Lobby?source=orgpage 833 | 834 | 835 | -------------------------------------------------------------------------------- /code/RNN.py: -------------------------------------------------------------------------------- 1 | from keras.layers import Dropout, Dense, GRU, Embedding 2 | from keras.models import Sequential 3 | from sklearn.feature_extraction.text import TfidfVectorizer 4 | import numpy as np 5 | from sklearn import metrics 6 | from keras.preprocessing.text import Tokenizer 7 | from keras.preprocessing.sequence import pad_sequences 8 | from sklearn.datasets import fetch_20newsgroups 9 | 10 | 11 | 12 | def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500): 13 | np.random.seed(7) 14 | text = np.concatenate((X_train, X_test), axis=0) 15 | text = np.array(text) 16 | tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 17 | tokenizer.fit_on_texts(text) 18 | sequences = tokenizer.texts_to_sequences(text) 19 | word_index = tokenizer.word_index 20 | text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) 21 | print('Found %s unique tokens.' % len(word_index)) 22 | indices = np.arange(text.shape[0]) 23 | # np.random.shuffle(indices) 24 | text = text[indices] 25 | print(text.shape) 26 | X_train = text[0:len(X_train), ] 27 | X_test = text[len(X_train):, ] 28 | embeddings_index = {} 29 | f = open("C:\\Users\\kamran\\Documents\\GitHub\\RMDL\\Examples\\Glove\\glove.6B.50d.txt", encoding="utf8") 30 | for line in f: 31 | 32 | values = line.split() 33 | word = values[0] 34 | try: 35 | coefs = np.asarray(values[1:], dtype='float32') 36 | except: 37 | pass 38 | embeddings_index[word] = coefs 39 | f.close() 40 | print('Total %s word vectors.' % len(embeddings_index)) 41 | return (X_train, X_test, word_index,embeddings_index) 42 | 43 | 44 | 45 | def Build_Model_RNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5): 46 | """ 47 | def buildModel_RNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5): 48 | word_index in word index , 49 | embeddings_index is embeddings index, look at data_helper.py 50 | nClasses is number of classes, 51 | MAX_SEQUENCE_LENGTH is maximum lenght of text sequences 52 | """ 53 | 54 | model = Sequential() 55 | hidden_layer = 3 56 | gru_node = 256 57 | 58 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM)) 59 | for word, i in word_index.items(): 60 | embedding_vector = embeddings_index.get(word) 61 | if embedding_vector is not None: 62 | # words not found in embedding index will be all-zeros. 63 | if len(embedding_matrix[i]) != len(embedding_vector): 64 | print("could not broadcast input array from shape", str(len(embedding_matrix[i])), 65 | "into shape", str(len(embedding_vector)), " Please make sure your" 66 | " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,") 67 | exit(1) 68 | embedding_matrix[i] = embedding_vector 69 | model.add(Embedding(len(word_index) + 1, 70 | EMBEDDING_DIM, 71 | weights=[embedding_matrix], 72 | input_length=MAX_SEQUENCE_LENGTH, 73 | trainable=True)) 74 | 75 | 76 | print(gru_node) 77 | for i in range(0,hidden_layer): 78 | model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2)) 79 | model.add(Dropout(dropout)) 80 | model.add(GRU(gru_node, recurrent_dropout=0.2)) 81 | #model.add(Dense(, activation='relu')) 82 | model.add(Dense(nclasses, activation='softmax')) 83 | 84 | 85 | model.compile(loss='sparse_categorical_crossentropy', 86 | optimizer='adam', 87 | metrics=['accuracy']) 88 | return model 89 | 90 | 91 | 92 | 93 | 94 | newsgroups_train = fetch_20newsgroups(subset='train') 95 | newsgroups_test = fetch_20newsgroups(subset='test') 96 | X_train = newsgroups_train.data 97 | X_test = newsgroups_test.data 98 | y_train = newsgroups_train.target 99 | y_test = newsgroups_test.target 100 | 101 | X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test) 102 | 103 | 104 | model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20) 105 | 106 | model_RNN.summary() 107 | 108 | model_RNN.fit(X_train_Glove, y_train, 109 | validation_data=(X_test_Glove, y_test), 110 | epochs=20, 111 | batch_size=128, 112 | verbose=2) 113 | 114 | predicted = model_RNN.predict_classes(X_test_Glove) 115 | 116 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/Random_Forest.py: -------------------------------------------------------------------------------- 1 | from sklearn.ensemble import RandomForestClassifier 2 | from sklearn.pipeline import Pipeline 3 | from sklearn import metrics 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | from sklearn.feature_extraction.text import TfidfTransformer 6 | from sklearn.datasets import fetch_20newsgroups 7 | 8 | newsgroups_train = fetch_20newsgroups(subset='train') 9 | newsgroups_test = fetch_20newsgroups(subset='test') 10 | X_train = newsgroups_train.data 11 | X_test = newsgroups_test.data 12 | y_train = newsgroups_train.target 13 | y_test = newsgroups_test.target 14 | 15 | text_clf = Pipeline([('vect', CountVectorizer()), 16 | ('tfidf', TfidfTransformer()), 17 | ('clf', RandomForestClassifier(n_estimators=100)), 18 | ]) 19 | 20 | text_clf.fit(X_train, y_train) 21 | 22 | 23 | predicted = text_clf.predict(X_test) 24 | 25 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/Rocchio_classification.py: -------------------------------------------------------------------------------- 1 | from sklearn.neighbors.nearest_centroid import NearestCentroid 2 | from sklearn.pipeline import Pipeline 3 | from sklearn import metrics 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | from sklearn.feature_extraction.text import TfidfTransformer 6 | from sklearn.datasets import fetch_20newsgroups 7 | 8 | newsgroups_train = fetch_20newsgroups(subset='train') 9 | newsgroups_test = fetch_20newsgroups(subset='test') 10 | X_train = newsgroups_train.data 11 | X_test = newsgroups_test.data 12 | y_train = newsgroups_train.target 13 | y_test = newsgroups_test.target 14 | 15 | text_clf = Pipeline([('vect', CountVectorizer()), 16 | ('tfidf', TfidfTransformer()), 17 | ('clf', NearestCentroid()), 18 | ]) 19 | 20 | text_clf.fit(X_train, y_train) 21 | 22 | 23 | predicted = text_clf.predict(X_test) 24 | 25 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /code/SVM.py: -------------------------------------------------------------------------------- 1 | from sklearn.svm import LinearSVC 2 | from sklearn.pipeline import Pipeline 3 | from sklearn import metrics 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | from sklearn.feature_extraction.text import TfidfTransformer 6 | from sklearn.datasets import fetch_20newsgroups 7 | 8 | newsgroups_train = fetch_20newsgroups(subset='train') 9 | newsgroups_test = fetch_20newsgroups(subset='test') 10 | X_train = newsgroups_train.data 11 | X_test = newsgroups_test.data 12 | y_train = newsgroups_train.target 13 | y_test = newsgroups_test.target 14 | 15 | text_clf = Pipeline([('vect', CountVectorizer()), 16 | ('tfidf', TfidfTransformer()), 17 | ('clf', LinearSVC()), 18 | ]) 19 | 20 | text_clf.fit(X_train, y_train) 21 | 22 | 23 | predicted = text_clf.predict(X_test) 24 | 25 | print(metrics.classification_report(y_test, predicted)) -------------------------------------------------------------------------------- /docs/Text_Classification.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/Text_Classification.pdf -------------------------------------------------------------------------------- /docs/_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-slate -------------------------------------------------------------------------------- /docs/eq/tf-idf.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/eq/tf-idf.gif -------------------------------------------------------------------------------- /docs/eq/tfidf.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/eq/tfidf.gif -------------------------------------------------------------------------------- /docs/pic/Autoencoder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Autoencoder.png -------------------------------------------------------------------------------- /docs/pic/BPW.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/BPW.png -------------------------------------------------------------------------------- /docs/pic/Bagging.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Bagging.PNG -------------------------------------------------------------------------------- /docs/pic/Boosting.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Boosting.PNG -------------------------------------------------------------------------------- /docs/pic/CBOW.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CBOW.png -------------------------------------------------------------------------------- /docs/pic/CNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CNN.png -------------------------------------------------------------------------------- /docs/pic/CRF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CRF.png -------------------------------------------------------------------------------- /docs/pic/DNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/DNN.png -------------------------------------------------------------------------------- /docs/pic/F1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/F1.png -------------------------------------------------------------------------------- /docs/pic/GitHub-Mark-32px.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/GitHub-Mark-32px.png -------------------------------------------------------------------------------- /docs/pic/GitHub-Mark-64px.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/GitHub-Mark-64px.png -------------------------------------------------------------------------------- /docs/pic/Glove.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Glove.PNG -------------------------------------------------------------------------------- /docs/pic/Glove_VS_DCWE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Glove_VS_DCWE.png -------------------------------------------------------------------------------- /docs/pic/HAN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/HAN.png -------------------------------------------------------------------------------- /docs/pic/HDLTex.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/HDLTex.png -------------------------------------------------------------------------------- /docs/pic/KNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/KNN.png -------------------------------------------------------------------------------- /docs/pic/LSTM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/LSTM.png -------------------------------------------------------------------------------- /docs/pic/OverviewTextClassification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/OverviewTextClassification.png -------------------------------------------------------------------------------- /docs/pic/RDL.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RDL.jpg -------------------------------------------------------------------------------- /docs/pic/RDL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RDL.png -------------------------------------------------------------------------------- /docs/pic/RF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RF.png -------------------------------------------------------------------------------- /docs/pic/RMDL.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL.jpg -------------------------------------------------------------------------------- /docs/pic/RMDL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL.png -------------------------------------------------------------------------------- /docs/pic/RMDL_Results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL_Results.png -------------------------------------------------------------------------------- /docs/pic/RMDL_Results_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL_Results_small.png -------------------------------------------------------------------------------- /docs/pic/RNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RNN.png -------------------------------------------------------------------------------- /docs/pic/Random Projection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Random Projection.png -------------------------------------------------------------------------------- /docs/pic/SVM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/SVM.png -------------------------------------------------------------------------------- /docs/pic/TSNE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/TSNE.png -------------------------------------------------------------------------------- /docs/pic/Word Art.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Word Art.png -------------------------------------------------------------------------------- /docs/pic/WordArt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/WordArt.png -------------------------------------------------------------------------------- /docs/pic/fasttext-logo-color-web.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/fasttext-logo-color-web.png -------------------------------------------------------------------------------- /docs/pic/github-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/github-logo.png -------------------------------------------------------------------------------- /docs/pic/line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/line.png -------------------------------------------------------------------------------- /docs/pic/ngram_cnn_highway_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/ngram_cnn_highway_1.png -------------------------------------------------------------------------------- /docs/pic/sphx_glr_plot_roc_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/sphx_glr_plot_roc_001.png --------------------------------------------------------------------------------