├── .github
    └── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   ├── custom.md
    │   └── feature_request.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.rst
├── Data
    ├── Download_Glove.py
    ├── Download_WOS.py
    └── README.rst
├── LICENSE
├── README.rst
├── WordArt.png
├── code
    ├── Bagging.py
    ├── Boost.py
    ├── CNN.py
    ├── CRF.py
    ├── DNN.py
    ├── Decision_Tree.py
    ├── HDLTex
    │   └── README.rst
    ├── Hierarchical_Attention_Networks
    │   ├── README.md
    │   ├── textClassifierConv.py
    │   ├── textClassifierHATT.py
    │   └── textClassifierRNN.py
    ├── K-nearest_Neighbor.py
    ├── MultinomialNB.py
    ├── RCNN.py
    ├── RMDL
    │   └── README.rst
    ├── RNN.py
    ├── Random_Forest.py
    ├── Rocchio_classification.py
    └── SVM.py
└── docs
    ├── Text_Classification.pdf
    ├── _config.yml
    ├── eq
        ├── tf-idf.gif
        └── tfidf.gif
    └── pic
        ├── Autoencoder.png
        ├── BPW.png
        ├── Bagging.PNG
        ├── Boosting.PNG
        ├── CBOW.png
        ├── CNN.png
        ├── CRF.png
        ├── DNN.png
        ├── F1.png
        ├── GitHub-Mark-32px.png
        ├── GitHub-Mark-64px.png
        ├── Glove.PNG
        ├── Glove_VS_DCWE.png
        ├── HAN.png
        ├── HDLTex.png
        ├── KNN.png
        ├── LSTM.png
        ├── OverviewTextClassification.png
        ├── RDL.jpg
        ├── RDL.png
        ├── RF.png
        ├── RMDL.jpg
        ├── RMDL.png
        ├── RMDL_Results.png
        ├── RMDL_Results_small.png
        ├── RNN.png
        ├── Random Projection.png
        ├── SVM.png
        ├── TSNE.png
        ├── Word Art.png
        ├── WordArt.png
        ├── fasttext-logo-color-web.png
        ├── github-logo.png
        ├── line.png
        ├── ngram_cnn_highway_1.png
        └── sphx_glr_plot_roc_001.png


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 | 
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 | 
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 | 
26 | **Desktop (please complete the following information):**
27 |  - OS: [e.g. iOS]
28 |  - Browser [e.g. chrome, safari]
29 |  - Version [e.g. 22]
30 | 
31 | **Smartphone (please complete the following information):**
32 |  - Device: [e.g. iPhone6]
33 |  - OS: [e.g. iOS8.1]
34 |  - Browser [e.g. stock browser, safari]
35 |  - Version [e.g. 22]
36 | 
37 | **Additional context**
38 | Add any other context about the problem here.
39 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/custom.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Custom issue template
 3 | about: Describe this issue template's purpose here.
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 | 
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 | 
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 | 
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Contributor Covenant Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | In the interest of fostering an open and welcoming environment, we as
 6 | contributors and maintainers pledge to making participation in our project and
 7 | our community a harassment-free experience for everyone, regardless of age, body
 8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
 9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 | 
12 | ## Our Standards
13 | 
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 | 
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 | 
23 | Examples of unacceptable behavior by participants include:
24 | 
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 |  advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 |  address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 |  professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 | 
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 | 
46 | ## Scope
47 | 
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 | 
55 | ## Enforcement
56 | 
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at kk7nc@virginia.edu. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 | 
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 | 
68 | ## Attribution
69 | 
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72 | 
73 | [homepage]: https://www.contributor-covenant.org
74 | 
75 | For answers to common questions about this code of conduct, see
76 | https://www.contributor-covenant.org/faq
77 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.rst:
--------------------------------------------------------------------------------
 1 | 
 2 | *************
 3 | Contributing
 4 | *************
 5 | 
 6 | *For typos, please do not create a pull request. Instead, declare them in issues or email the repository owner*. Please note we have a code of conduct, please follow it in all your interactions with the project.
 7 | 
 8 | ====================
 9 | Pull Request Process
10 | ====================
11 | 
12 | Please consider the following criterions in order to help us in a better way:
13 | 
14 | 1. The pull request is mainly expected to be a link suggestion.
15 | 2. Please make sure your suggested resources are not obsolete or broken.
16 | 3. Ensure any install or build dependencies are removed before the end of the layer when doing a
17 |    build and creating a pull request.
18 | 4. Add comments with details of changes to the interface, this includes new environment
19 |    variables, exposed ports, useful file locations and container parameters.
20 | 5. You may merge the Pull Request in once you have the sign-off of at least one other developer, or if you
21 |    do not have permission to do that, you may request the owner to merge it for you if you believe all checks are passed.
22 |    
23 |   Thank you!
24 | 


--------------------------------------------------------------------------------
/Data/Download_Glove.py:
--------------------------------------------------------------------------------
 1 | '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 2 | RMDL: Random Multimodel Deep Learning for Classification
 3 |  * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 4 |  * Last Update: 04/25/2018
 5 |  * This file is part of  RMDL project, University of Virginia.
 6 |  * Free to use, change, share and distribute source code of RMDL
 7 |  * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 8 |  * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 9 |  * Comments and Error: email: kk7nc@virginia.edu
10 | '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
11 | 
12 | 
13 | from __future__ import print_function
14 | 
15 | import os, sys, tarfile
16 | import numpy as np
17 | import zipfile
18 | 
19 | if sys.version_info >= (3, 0, 0):
20 |     import urllib.request as urllib  # ugly but works
21 | else:
22 |     import urllib
23 | 
24 | print(sys.version_info)
25 | 
26 | # image shape
27 | 
28 | 
29 | # path to the directory with the data
30 | DATA_DIR = '.\Glove'
31 | 
32 | # url of the binary data
33 | 
34 | 
35 | 
36 | # path to the binary train file with image data
37 | 
38 | 
39 | def download_and_extract(data='Wikipedia'):
40 |     """
41 |     Download and extract the GloVe
42 |     :return: None
43 |     """
44 | 
45 |     if data=='Wikipedia':
46 |         DATA_URL = 'http://nlp.stanford.edu/data/glove.6B.zip'
47 |     elif data=='Common_Crawl_840B':
48 |         DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip'
49 |     elif data=='Common_Crawl_42B':
50 |         DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip'
51 |     elif data=='Twitter':
52 |         DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip'
53 |     else:
54 |         print("prameter should be Twitter, Common_Crawl_42B, Common_Crawl_840B, or Wikipedia")
55 |         exit(0)
56 | 
57 | 
58 |     dest_directory = DATA_DIR
59 |     if not os.path.exists(dest_directory):
60 |         os.makedirs(dest_directory)
61 |     filename = DATA_URL.split('/')[-1]
62 |     filepath = os.path.join(dest_directory, filename)
63 |     print(filepath)
64 | 
65 |     path = os.path.abspath(dest_directory)
66 |     if not os.path.exists(filepath):
67 |         def _progress(count, block_size, total_size):
68 |             sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
69 |                                                           float(count * block_size) / float(total_size) * 100.0))
70 |             sys.stdout.flush()
71 | 
72 |         filepath, _ = urllib.urlretrieve(DATA_URL, filepath)#, reporthook=_progress)
73 | 
74 | 
75 |         zip_ref = zipfile.ZipFile(filepath, 'r')
76 |         zip_ref.extractall(DATA_DIR)
77 |         zip_ref.close()
78 |     return path
79 | 


--------------------------------------------------------------------------------
/Data/Download_WOS.py:
--------------------------------------------------------------------------------
 1 | '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 2 | RMDL: Random Multimodel Deep Learning for Classification
 3 |  * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 4 |  * Last Update: 04/25/2018
 5 |  * This file is part of  RMDL project, University of Virginia.
 6 |  * Free to use, change, share and distribute source code of RMDL
 7 |  * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 8 |  * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 9 |  * Comments and Error: email: kk7nc@virginia.edu
10 | '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
11 | 
12 | 
13 | from __future__ import print_function
14 | 
15 | import os, sys, tarfile
16 | import numpy as np
17 | 
18 | if sys.version_info >= (3, 0, 0):
19 |     import urllib.request as urllib  # ugly but works
20 | else:
21 |     import urllib
22 | 
23 | print(sys.version_info)
24 | 
25 | # image shape
26 | 
27 | 
28 | # path to the directory with the data
29 | DATA_DIR = '.\data_WOS'
30 | 
31 | # url of the binary data
32 | DATA_URL = 'http://kowsari.net/WebOfScience.tar.gz'
33 | 
34 | 
35 | # path to the binary train file with image data
36 | 
37 | 
38 | def download_and_extract():
39 |     """
40 |     Download and extract the WOS datasets
41 |     :return: None
42 |     """
43 |     dest_directory = DATA_DIR
44 |     if not os.path.exists(dest_directory):
45 |         os.makedirs(dest_directory)
46 |     filename = DATA_URL.split('/')[-1]
47 |     filepath = os.path.join(dest_directory, filename)
48 | 
49 | 
50 |     path = os.path.abspath(dest_directory)
51 |     if not os.path.exists(filepath):
52 |         def _progress(count, block_size, total_size):
53 |             sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
54 |                                                           float(count * block_size) / float(total_size) * 100.0))
55 |             sys.stdout.flush()
56 | 
57 |         filepath, _ = urllib.urlretrieve(DATA_URL, filepath, reporthook=_progress)
58 | 
59 |         print('Downloaded', filename)
60 | 
61 |         tarfile.open(filepath, 'r').extractall(dest_directory)
62 |     return path
63 | 


--------------------------------------------------------------------------------
/Data/README.rst:
--------------------------------------------------------------------------------
  1 | 
  2 | ################################################
  3 | Text Classification Algorithm: A Brief Overview
  4 | ################################################
  5 | 
  6 | ##################
  7 | Table of Contents
  8 | ##################
  9 | .. contents::
 10 |   :local:
 11 |   :depth: 4
 12 | 
 13 | 
 14 | IMDB
 15 | -----
 16 | 
 17 | -  This dataset contains 50,000 documents with 2 categories.
 18 | 
 19 | Import Packages
 20 | ~~~~~~~~~~~~~~~
 21 | 
 22 | .. code:: python
 23 | 
 24 |         import sys
 25 |         import os
 26 |         from RMDL import text_feature_extraction as txt
 27 |         from keras.datasets import imdb
 28 |         import numpy as np
 29 |         from RMDL import RMDL_Text as RMDL
 30 | 
 31 | Load Data
 32 | ~~~~~~~~~
 33 | 
 34 | .. code:: python
 35 | 
 36 |         print("Load IMDB dataset....")
 37 |         (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS)
 38 |         print(len(X_train))
 39 |         print(y_test)
 40 |         word_index = imdb.get_word_index()
 41 |         index_word = {v: k for k, v in word_index.items()}
 42 |         X_train = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_train]
 43 |         X_test = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_test]
 44 |         X_train = np.array(X_train)
 45 |         X_train = np.array(X_train).ravel()
 46 |         print(X_train.shape)
 47 |         X_test = np.array(X_test)
 48 |         X_test = np.array(X_test).ravel()
 49 |         
 50 |         
 51 | Web Of Science
 52 | --------------
 53 | 
 54 | -  Linke of dataset:  |Data|
 55 | 
 56 |    -  Web of Science Dataset
 57 |       `WOS-11967 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 58 | 
 59 |       -  This dataset contains 11,967 documents with 35 categories which
 60 |          include 7 parents categories.
 61 | 
 62 |    -  Web of Science Dataset
 63 |       `WOS-46985 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 64 | 
 65 |       -  This dataset contains 46,985 documents with 134 categories
 66 |          which include 7 parents categories.
 67 | 
 68 |    -  Web of Science Dataset
 69 |       `WOS-5736 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 70 | 
 71 |       -  This dataset contains 5,736 documents with 11 categories which
 72 |          include 3 parents categories.
 73 | 
 74 | Import Packages
 75 | ~~~~~~~~~~~~~~~
 76 | 
 77 | .. code:: python
 78 | 
 79 |         from RMDL import text_feature_extraction as txt
 80 |         from sklearn.model_selection import train_test_split
 81 |         from RMDL.Download import Download_WOS as WOS
 82 |         import numpy as np
 83 |         from RMDL import RMDL_Text as RMDL
 84 | 
 85 | Load Data
 86 | ~~~~~~~~~
 87 | .. code:: python
 88 | 
 89 |         path_WOS = WOS.download_and_extract()
 90 |         fname = os.path.join(path_WOS,"WebOfScience/WOS11967/X.txt")
 91 |         fnamek = os.path.join(path_WOS,"WebOfScience/WOS11967/Y.txt")
 92 |         with open(fname, encoding="utf-8") as f:
 93 |             content = f.readlines()
 94 |             content = [txt.text_cleaner(x) for x in content]
 95 |         with open(fnamek) as fk:
 96 |             contentk = fk.readlines()
 97 |         contentk = [x.strip() for x in contentk]
 98 |         Label = np.matrix(contentk, dtype=int)
 99 |         Label = np.transpose(Label)
100 |         np.random.seed(7)
101 |         print(Label.shape)
102 |         X_train, X_test, y_train, y_test = train_test_split(content, Label, test_size=0.2, random_state=4)
103 |         
104 |         
105 |         
106 | Reuters-21578
107 | -------------
108 | 
109 | - This dataset contains 21,578 documents with 90 categories.
110 | 
111 | Import Packages
112 | ~~~~~~~~~~~~~~~
113 | 
114 | .. code:: python
115 | 
116 |          import sys
117 |          import os
118 |          import nltk
119 |          nltk.download("reuters")
120 |          from nltk.corpus import reuters
121 |          from sklearn.preprocessing import MultiLabelBinarizer
122 |          import numpy as np
123 |          from RMDL import RMDL_Text as RMDL
124 | 
125 | Load Data
126 | ~~~~~~~~~
127 | .. code:: python
128 | 
129 |          documents = reuters.fileids()
130 | 
131 |          train_docs_id = list(filter(lambda doc: doc.startswith("train"),
132 |                                    documents))
133 |          test_docs_id = list(filter(lambda doc: doc.startswith("test"),
134 |                                   documents))
135 |          X_train = [(reuters.raw(doc_id)) for doc_id in train_docs_id]
136 |          X_test = [(reuters.raw(doc_id)) for doc_id in test_docs_id]
137 |          mlb = MultiLabelBinarizer()
138 |          y_train = mlb.fit_transform([reuters.categories(doc_id)
139 |                                     for doc_id in train_docs_id])
140 |          y_test = mlb.transform([reuters.categories(doc_id)
141 |                                for doc_id in test_docs_id])
142 |          y_train = np.argmax(y_train, axis=1)
143 |          y_test = np.argmax(y_test, axis=1)
144 | 
145 | 
146 | 
147 | 
148 |     ==========
149 | Citations:
150 | ==========
151 | 
152 | ----
153 | 
154 | .. code::
155 | 
156 |     @ARTICLE{Kowsari2018Text_Classification,
157 |         title={Text Classification Algorithms: A Survey},
158 |         author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.},
159 |         journal={Information},
160 |         year={2019},
161 |         VOLUME = {10},  
162 |         YEAR = {2019},
163 |         NUMBER = {4},
164 |         ARTICLE-NUMBER = {150},
165 |         URL = {http://www.mdpi.com/2078-2489/10/4/150},
166 |         ISSN = {2078-2489},
167 |         publisher={Multidisciplinary Digital Publishing Institute}
168 |     }
169 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Kamran  Kowsari
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/WordArt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/WordArt.png


--------------------------------------------------------------------------------
/code/Bagging.py:
--------------------------------------------------------------------------------
 1 | from sklearn.ensemble import BaggingClassifier
 2 | from sklearn.neighbors import KNeighborsClassifier
 3 | from sklearn.pipeline import Pipeline
 4 | from sklearn import metrics
 5 | from sklearn.feature_extraction.text import CountVectorizer
 6 | from sklearn.feature_extraction.text import TfidfTransformer
 7 | from sklearn.datasets import fetch_20newsgroups
 8 | 
 9 | newsgroups_train = fetch_20newsgroups(subset='train')
10 | newsgroups_test = fetch_20newsgroups(subset='test')
11 | X_train = newsgroups_train.data
12 | X_test = newsgroups_test.data
13 | y_train = newsgroups_train.target
14 | y_test = newsgroups_test.target
15 | 
16 | text_clf = Pipeline([('vect', CountVectorizer()),
17 |                      ('tfidf', TfidfTransformer()),
18 |                      ('clf', BaggingClassifier(KNeighborsClassifier())),
19 |                      ])
20 | 
21 | text_clf.fit(X_train, y_train)
22 | 
23 | 
24 | predicted = text_clf.predict(X_test)
25 | 
26 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/Boost.py:
--------------------------------------------------------------------------------
 1 | from sklearn.ensemble import GradientBoostingClassifier
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', GradientBoostingClassifier(n_estimators=50,verbose=2)),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/CNN.py:
--------------------------------------------------------------------------------
  1 | from keras.layers import Dropout, Dense,Input,Embedding,Flatten, AveragePooling2D, Conv2D,Reshape
  2 | from keras.models import Sequential,Model
  3 | from sklearn.feature_extraction.text import TfidfVectorizer
  4 | import numpy as np
  5 | from sklearn import metrics
  6 | from keras.preprocessing.text import Tokenizer
  7 | from keras.preprocessing.sequence import pad_sequences
  8 | from sklearn.datasets import fetch_20newsgroups
  9 | from keras.layers.merge import Concatenate
 10 | 
 11 | 
 12 | def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=1000):
 13 |     np.random.seed(7)
 14 |     text = np.concatenate((X_train, X_test), axis=0)
 15 |     text = np.array(text)
 16 |     tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
 17 |     tokenizer.fit_on_texts(text)
 18 |     sequences = tokenizer.texts_to_sequences(text)
 19 |     word_index = tokenizer.word_index
 20 |     text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
 21 |     print('Found %s unique tokens.' % len(word_index))
 22 |     indices = np.arange(text.shape[0])
 23 |     # np.random.shuffle(indices)
 24 |     text = text[indices]
 25 |     print(text.shape)
 26 |     X_train = text[0:len(X_train), ]
 27 |     X_test = text[len(X_train):, ]
 28 |     embeddings_index = {}
 29 |     f = open(".\glove.6B.100d.txt", encoding="utf8") ## GloVe file which could be download https://nlp.stanford.edu/projects/glove/
 30 |     for line in f:
 31 |         values = line.split()
 32 |         word = values[0]
 33 |         try:
 34 |             coefs = np.asarray(values[1:], dtype='float32')
 35 |         except:
 36 |             pass
 37 |         embeddings_index[word] = coefs
 38 |     f.close()
 39 |     print('Total %s word vectors.' % len(embeddings_index))
 40 |     return (X_train, X_test, word_index,embeddings_index)
 41 | 
 42 | 
 43 | 
 44 | def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100, dropout=0.5):
 45 | 
 46 |     """
 47 |         def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
 48 |         word_index in word index ,
 49 |         embeddings_index is embeddings index, look at data_helper.py
 50 |         nClasses is number of classes,
 51 |         MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,
 52 |         EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py
 53 |     """
 54 | 
 55 |     model = Sequential()
 56 |     embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
 57 |     for word, i in word_index.items():
 58 |         embedding_vector = embeddings_index.get(word)
 59 |         if embedding_vector is not None:
 60 |             # words not found in embedding index will be all-zeros.
 61 |             if len(embedding_matrix[i]) !=len(embedding_vector):
 62 |                 print("could not broadcast input array from shape",str(len(embedding_matrix[i])),
 63 |                                  "into shape",str(len(embedding_vector))," Please make sure your"
 64 |                                  " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
 65 |                 exit(1)
 66 | 
 67 |             embedding_matrix[i] = embedding_vector
 68 | 
 69 |     embedding_layer = Embedding(len(word_index) + 1,
 70 |                                 EMBEDDING_DIM,
 71 |                                 weights=[embedding_matrix],
 72 |                                 input_length=MAX_SEQUENCE_LENGTH,
 73 |                                 trainable=True)
 74 | 
 75 |     # applying a more complex convolutional approach
 76 |     convs = []
 77 |     filter_sizes = []
 78 |     layer = 5
 79 |     print("Filter  ",layer)
 80 |     for fl in range(0,layer):
 81 |         filter_sizes.append((fl+2,fl+2))
 82 | 
 83 |     node = 128
 84 |     sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
 85 |     embedded_sequences = embedding_layer(sequence_input)
 86 |     emb = Reshape((500,10, 10), input_shape=(500,100))(embedded_sequences)
 87 | 
 88 |     for fsz in filter_sizes:
 89 |         l_conv = Conv2D(node, padding="same", kernel_size=fsz, activation='relu')(emb)
 90 |         l_pool = AveragePooling2D(pool_size=(5,1), padding="same")(l_conv)
 91 |         #l_pool = Dropout(0.25)(l_pool)
 92 |         convs.append(l_pool)
 93 | 
 94 |     l_merge = Concatenate(axis=1)(convs)
 95 |     l_cov1 = Conv2D(node, (5,5), padding="same", activation='relu')(l_merge)
 96 |     l_cov1 = AveragePooling2D(pool_size=(5,2), padding="same")(l_cov1)
 97 |     l_cov2 = Conv2D(node, (5,5), padding="same", activation='relu')(l_cov1)
 98 |     l_pool2 = AveragePooling2D(pool_size=(5,2), padding="same")(l_cov2)
 99 |     l_cov2 = Dropout(dropout)(l_pool2)
100 |     l_flat = Flatten()(l_cov2)
101 |     l_dense = Dense(128, activation='relu')(l_flat)
102 |     l_dense = Dropout(dropout)(l_dense)
103 | 
104 |     preds = Dense(nclasses, activation='softmax')(l_dense)
105 |     model = Model(sequence_input, preds)
106 | 
107 |     model.compile(loss='sparse_categorical_crossentropy',
108 |                   optimizer='adam',
109 |                   metrics=['accuracy'])
110 | 
111 | 
112 | 
113 |     return model
114 | 
115 | 
116 | 
117 | from sklearn.datasets import fetch_20newsgroups
118 | from RMDL import text_feature_extraction as txt
119 | 
120 | newsgroups_train = fetch_20newsgroups(subset='train')
121 | newsgroups_test = fetch_20newsgroups(subset='test')
122 | X_train = newsgroups_train.data
123 | X_test = newsgroups_test.data
124 | y_train = newsgroups_train.target
125 | y_test = newsgroups_test.target
126 | 
127 | 
128 | X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)
129 | 
130 | 
131 | model_CNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)
132 | 
133 | 
134 | model_CNN.summary()
135 | 
136 | model_CNN.fit(X_train_Glove, y_train,
137 |                               validation_data=(X_test_Glove, y_test),
138 |                               epochs=1000,
139 |                               batch_size=128,
140 |                               verbose=2)
141 | 
142 | predicted = model_CNN.predict(X_test_Glove)
143 | 
144 | predicted = np.argmax(predicted, axis=1)
145 | 
146 | 
147 | print(metrics.classification_report(y_test, predicted))
148 | 


--------------------------------------------------------------------------------
/code/CRF.py:
--------------------------------------------------------------------------------
 1 | import nltk
 2 | import sklearn_crfsuite
 3 | from sklearn_crfsuite import metrics
 4 | nltk.corpus.conll2002.fileids()
 5 | train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
 6 | test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
 7 | def word2features(sent, i):
 8 |     word = sent[i][0]
 9 |     postag = sent[i][1]
10 | 
11 |     features = {
12 |         'bias': 1.0,
13 |         'word.lower()': word.lower(),
14 |         'word[-3:]': word[-3:],
15 |         'word[-2:]': word[-2:],
16 |         'word.isupper()': word.isupper(),
17 |         'word.istitle()': word.istitle(),
18 |         'word.isdigit()': word.isdigit(),
19 |         'postag': postag,
20 |         'postag[:2]': postag[:2],
21 |     }
22 |     if i > 0:
23 |         word1 = sent[i-1][0]
24 |         postag1 = sent[i-1][1]
25 |         features.update({
26 |             '-1:word.lower()': word1.lower(),
27 |             '-1:word.istitle()': word1.istitle(),
28 |             '-1:word.isupper()': word1.isupper(),
29 |             '-1:postag': postag1,
30 |             '-1:postag[:2]': postag1[:2],
31 |         })
32 |     else:
33 |         features['BOS'] = True
34 | 
35 |     if i < len(sent)-1:
36 |         word1 = sent[i+1][0]
37 |         postag1 = sent[i+1][1]
38 |         features.update({
39 |             '+1:word.lower()': word1.lower(),
40 |             '+1:word.istitle()': word1.istitle(),
41 |             '+1:word.isupper()': word1.isupper(),
42 |             '+1:postag': postag1,
43 |             '+1:postag[:2]': postag1[:2],
44 |         })
45 |     else:
46 |         features['EOS'] = True
47 | 
48 |     return features
49 | 
50 | 
51 | def sent2features(sent):
52 |     return [word2features(sent, i) for i in range(len(sent))]
53 | 
54 | def sent2labels(sent):
55 |     return [label for token, postag, label in sent]
56 | 
57 | def sent2tokens(sent):
58 |     return [token for token, postag, label in sent]
59 | 
60 | X_train = [sent2features(s) for s in train_sents]
61 | y_train = [sent2labels(s) for s in train_sents]
62 | 
63 | X_test = [sent2features(s) for s in test_sents]
64 | y_test = [sent2labels(s) for s in test_sents]
65 | 
66 | 
67 | 
68 | crf = sklearn_crfsuite.CRF(
69 |     algorithm='lbfgs',
70 |     c1=0.1,
71 |     c2=0.1,
72 |     max_iterations=100,
73 |     all_possible_transitions=True
74 | )
75 | crf.fit(X_train, y_train)
76 | 
77 | y_pred = crf.predict(X_test)
78 | print(metrics.flat_classification_report(
79 |     y_test, y_pred,  digits=3
80 | ))


--------------------------------------------------------------------------------
/code/DNN.py:
--------------------------------------------------------------------------------
 1 | from keras.layers import  Dropout, Dense
 2 | from keras.models import Sequential
 3 | from sklearn.feature_extraction.text import TfidfVectorizer
 4 | import numpy as np
 5 | from sklearn import metrics
 6 | 
 7 | 
 8 | def TFIDF(X_train, X_test,MAX_NB_WORDS=75000):
 9 |     vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)
10 |     X_train = vectorizer_x.fit_transform(X_train).toarray()
11 |     X_test = vectorizer_x.transform(X_test).toarray()
12 |     print("tf-idf with",str(np.array(X_train).shape[1]),"features")
13 |     return (X_train,X_test)
14 | 
15 | 
16 | def Build_Model_DNN_Text(shape, nClasses, dropout=0.5):
17 |     """
18 |     buildModel_DNN_Tex(shape, nClasses,dropout)
19 |     Build Deep neural networks Model for text classification
20 |     Shape is input feature space
21 |     nClasses is number of classes
22 |     """
23 |     model = Sequential()
24 |     node = 512 # number of nodes
25 |     nLayers = 4 # number of  hidden layer
26 | 
27 |     model.add(Dense(node,input_dim=shape,activation='relu'))
28 |     model.add(Dropout(dropout))
29 |     for i in range(0,nLayers):
30 |         model.add(Dense(node,input_dim=node,activation='relu'))
31 |         model.add(Dropout(dropout))
32 |     model.add(Dense(nClasses, activation='softmax'))
33 | 
34 |     model.compile(loss='sparse_categorical_crossentropy',
35 |                   optimizer='adam',
36 |                   metrics=['accuracy'])
37 | 
38 |     return model
39 | 
40 | 
41 | from sklearn.datasets import fetch_20newsgroups
42 | 
43 | newsgroups_train = fetch_20newsgroups(subset='train')
44 | newsgroups_test = fetch_20newsgroups(subset='test')
45 | X_train = newsgroups_train.data
46 | X_test = newsgroups_test.data
47 | y_train = newsgroups_train.target
48 | y_test = newsgroups_test.target
49 | 
50 | X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test)
51 | 
52 | 
53 | model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20)
54 | model_DNN.summary()
55 | exit(1)
56 | model_DNN.fit(X_train_tfidf, y_train,
57 |                               validation_data=(X_test_tfidf, y_test),
58 |                               epochs=10,
59 |                               batch_size=128,
60 |                               verbose=2)
61 | 
62 | predicted = model_DNN.predict_classes(X_test_tfidf)
63 | 
64 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/Decision_Tree.py:
--------------------------------------------------------------------------------
 1 | from sklearn import tree
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', tree.DecisionTreeClassifier()),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/HDLTex/README.rst:
--------------------------------------------------------------------------------
  1 | |DOI| |travis| |appveyor| |wercker status| |Join the chat at
  2 | https://gitter.im/HDLTex| |arXiv| |RG| |Binder| |license| |twitter|
  3 | 
  4 | HDLTex: Hierarchical Deep Learning for Text Classification
  5 | ==========================================================
  6 | 
  7 | Refrenced paper : `HDLTex: Hierarchical Deep Learning for Text
  8 | Classification <https://arxiv.org/abs/1709.08267>`__
  9 | 
 10 | .. image:: /docs/pic/github-logo.png
 11 |   :target: https://github.com/kk7nc/HDLTex
 12 |   
 13 |   
 14 | |Pic|
 15 | 
 16 | Documentation:
 17 | ===============
 18 | 
 19 | Increasingly large document collections require improved information processing methods for searching, retrieving, and organizing  text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of traditional supervised classifiers has degraded as the number of documents has increased. This is because along with growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
 20 | 
 21 | Installation
 22 | =============
 23 | 
 24 | Using pip
 25 | ----------
 26 | .. code:: bash
 27 |     
 28 |     pip install HDLTex
 29 |     
 30 | Using git
 31 | ----------
 32 | .. code:: bash
 33 | 
 34 |     git clone --recursive https://github.com/kk7nc/HDLTex.git
 35 | 
 36 | 
 37 | The primary requirements for this package are Python 3 with Tensorflow.
 38 | The requirements.txt file contains a listing of the required Python
 39 | packages; to install all requirements, run the following:
 40 | 
 41 | .. code:: bash
 42 | 
 43 |     pip -r install requirements.txt
 44 | 
 45 | Or
 46 | 
 47 | .. code:: bash
 48 | 
 49 |     pip3  install -r requirements.txt
 50 | 
 51 | Or:
 52 | 
 53 | .. code:: bash
 54 | 
 55 |     conda install --file requirements.txt
 56 |         
 57 | 
 58 | If the above command does not work, use the following:
 59 | 
 60 | .. code:: bash
 61 | 
 62 |     sudo -H pip  install -r requirements.txt
 63 | 
 64 | 
 65 | Datasets for HDLTex:
 66 | =====================
 67 | 
 68 | Linke of dataset: |Data|
 69 | 
 70 | Web of Science Dataset
 71 | `WOS-11967 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 72 | 
 73 | ::
 74 | 
 75 |         This dataset contains 11,967 documents with 35 categories which include 7 parents categories.
 76 |         
 77 | 
 78 | Web of Science Dataset
 79 | `WOS-46985 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 80 | 
 81 | ::
 82 | 
 83 |         This dataset contains 46,985 documents with 134 categories which include 7 parents categories.
 84 |       
 85 | 
 86 | Web of Science Dataset
 87 | `WOS-5736 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
 88 | 
 89 | ::
 90 | 
 91 |         This dataset contains 5,736 documents with 11 categories which include 3 parents categories.
 92 | 
 93 | Requirements :
 94 | ----------------
 95 | General:
 96 | 
 97 | - Python 3.5 or later see `Instruction Documents <https://www.python.org/>`__
 98 | - TensorFlow see `Instruction Documents <https://www.tensorflow.org/install/install_linux>`__.
 99 | - scikit-learn see `Instruction Documents <http://scikit-learn.org/stable/install.html>`__
100 | - Keras see `Instruction Documents <https://keras.io/>`__
101 | - scipy see `Instruction Documents <https://www.scipy.org/install.html>`__
102 | - GPU
103 | 
104 |     - CUDA® Toolkit 8.0. For details, see `NVIDIA’s documentation <https://developer.nvidia.com/cuda-toolkit>`__.
105 |     - The `NVIDIA drivers associated with CUDA Toolkit 8.0 <http://www.nvidia.com/Download/index.aspx>`__.
106 |     - cuDNN v6. For details, see `NVIDIA’s documentation <https://developer.nvidia.com/cudnn>`__.
107 |     - GPU card with CUDA Compute Capability 3.0 or higher.
108 |     - The libcupti-dev library,
109 |     - To install this library, issue the following command:
110 | 
111 | ::
112 | 
113 |         $ sudo apt-get install libcupti-dev
114 |         
115 |         
116 | Feature Extraction:
117 | ===================
118 | 
119 | Global Vectors for Word Representation
120 | (`GLOVE <https://nlp.stanford.edu/projects/glove/>`__)
121 | 
122 | ::
123 | 
124 |         For CNN and RNN you need to download and linked the folder location to GLOVE
125 |         
126 |         
127 | 
128 | Error and Comments:
129 | ===================
130 | 
131 | Send an email to kk7nc@virginia.edu
132 | 
133 | Citation:
134 | =========
135 | 
136 | .. code:: bash
137 | 
138 |     @inproceedings{Kowsari2018HDLTex, 
139 |     author={Kowsari, Kamran and Brown, Donald E and Heidarysafa, Mojtaba and Meimandi, Kiana Jafari and Gerber, Matthew S and Barnes, Laura E},
140 |     booktitle={2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)}, 
141 |     title={HDLTex: Hierarchical Deep Learning for Text Classification}, 
142 |     year={2017},  
143 |     pages={364-371}, 
144 |     doi={10.1109/ICMLA.2017.0-134},  
145 |     month={Dec}
146 |     }
147 | 
148 | .. |DOI| image:: http://kowsari.net/HDLTex_DOI.svg?maxAge=2592000
149 |    :target: https://doi.org/10.1109/ICMLA.2017.0-134
150 | .. |travis| image:: https://travis-ci.org/kk7nc/HDLTex.svg?branch=master
151 |    :target: https://travis-ci.org/kk7nc/HDLTex
152 | .. |wercker status| image:: https://app.wercker.com/status/24a123448ba8764b257a1df242146b8e/s/master
153 |    :target: https://app.wercker.com/project/byKey/24a123448ba8764b257a1df242146b8e
154 | .. |Join the chat at https://gitter.im/HDLTex| image:: https://badges.gitter.im/Join%20Chat.svg
155 |    :target: https://gitter.im/HDLTex/Lobby?source=orgpage
156 | .. |appveyor| image:: https://ci.appveyor.com/api/projects/status/github/kk7nc/HDLTex?branch=master&svg=true
157 |     :target: https://ci.appveyor.com/project/kk7nc/hdltex
158 | .. |arXiv| image:: https://img.shields.io/badge/arXiv-1709.08267-red.svg?style=flat
159 |    :target: https://arxiv.org/abs/1709.08267
160 | .. |RG| image:: https://img.shields.io/badge/ResearchGate-HDLTex-blue.svg?style=flat
161 |    :target: https://www.researchgate.net/publication/319968747_HDLTex_Hierarchical_Deep_Learning_for_Text_Classification
162 | .. |Binder| image:: https://mybinder.org/badge.svg
163 |    :target: https://mybinder.org/v2/gh/kk7nc/HDLTex/master
164 | .. |license| image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592104
165 |    :target: https://github.com/kk7nc/HDLTex/blob/master/LICENSE
166 | .. |Data| image:: https://img.shields.io/badge/DOI-10.17632/9rw3vkcfy4.6-blue.svg?style=flat
167 |    :target: http://dx.doi.org/10.17632/9rw3vkcfy4.6
168 | .. |Pic| image:: http://kowsari.net/____impro/1/onewebmedia/HDLTex.png?etag=W%2F%22c90cd-59c4019b%22&sourceContentType=image%2Fpng&ignoreAspectRatio&resize=821%2B326&extract=0%2B0%2B821%2B325?raw=false
169 |    :alt: HDLTex as both Hierarchy lavel are DNN
170 | .. |twitter| image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social
171 |    :target: https://twitter.com/intent/tweet?text=HDLTex:%20Hierarchical%20Deep%20Learning%20for%20Text%20Classification%0aGitHub:&url=https://github.com/kk7nc/HDLTex&hashtags=DeepLearning,Text_Classification,classification,MachineLearning,deep_neural_networks
172 |        
173 | 


--------------------------------------------------------------------------------
/code/Hierarchical_Attention_Networks/README.md:
--------------------------------------------------------------------------------
 1 | # textClassifier
 2 | 
 3 | [<img src="/docs/pic/github-logo.png">](https://github.com/richliao/textClassifier)
 4 | 
 5 | 
 6 | 
 7 | 
 8 | textClassifierHATT.py has the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). Please see the [this blog](https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/) for full detail. Also see [Keras Google group discussion](https://groups.google.com/forum/#!topic/keras-users/IWK9opMFavQ)
 9 | 
10 | textClassifierConv has implemented [Convolutional Neural Networks for Sentence Classification - Yoo Kim](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf). Please see the [this blog](https://richliao.github.io/supervised/classification/2016/11/26/textclassifier-convolutional/) for full detail.
11 | 


--------------------------------------------------------------------------------
/code/Hierarchical_Attention_Networks/textClassifierConv.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | # author - Richard Liao
  3 | # Dec 26 2016
  4 | import numpy as np
  5 | import pandas as pd
  6 | import cPickle
  7 | from collections import defaultdict
  8 | import re
  9 | 
 10 | from bs4 import BeautifulSoup
 11 | 
 12 | import sys
 13 | import os
 14 | 
 15 | os.environ['KERAS_BACKEND']='theano'
 16 | 
 17 | from keras.preprocessing.text import Tokenizer
 18 | from keras.preprocessing.sequence import pad_sequences
 19 | from keras.utils.np_utils import to_categorical
 20 | 
 21 | from keras.layers import Embedding
 22 | from keras.layers import Dense, Input, Flatten
 23 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout
 24 | from keras.models import Model
 25 | 
 26 | MAX_SEQUENCE_LENGTH = 1000
 27 | MAX_NB_WORDS = 20000
 28 | EMBEDDING_DIM = 100
 29 | VALIDATION_SPLIT = 0.2
 30 | 
 31 | def clean_str(string):
 32 |     """
 33 |     Tokenization/string cleaning for dataset
 34 |     Every dataset is lower cased except
 35 |     """
 36 |     string = re.sub(r"\\", "", string)    
 37 |     string = re.sub(r"\'", "", string)    
 38 |     string = re.sub(r"\"", "", string)    
 39 |     return string.strip().lower()
 40 | 
 41 | data_train = pd.read_csv('~/Testground/data/imdb/labeledTrainData.tsv', sep='\t')
 42 | print(data_train.shape)
 43 | 
 44 | texts = []
 45 | labels = []
 46 | 
 47 | for idx in range(data_train.review.shape[0]):
 48 |     text = BeautifulSoup(data_train.review[idx])
 49 |     texts.append(clean_str(text.get_text().encode('ascii','ignore')))
 50 |     labels.append(data_train.sentiment[idx])
 51 |     
 52 | 
 53 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
 54 | tokenizer.fit_on_texts(texts)
 55 | sequences = tokenizer.texts_to_sequences(texts)
 56 | 
 57 | word_index = tokenizer.word_index
 58 | print('Found %s unique tokens.' % len(word_index))
 59 | 
 60 | data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
 61 | 
 62 | labels = to_categorical(np.asarray(labels))
 63 | print(('Shape of data tensor:', data.shape))
 64 | print(('Shape of label tensor:', labels.shape))
 65 | 
 66 | indices = np.arange(data.shape[0])
 67 | np.random.shuffle(indices)
 68 | data = data[indices]
 69 | labels = labels[indices]
 70 | nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
 71 | 
 72 | x_train = data[:-nb_validation_samples]
 73 | y_train = labels[:-nb_validation_samples]
 74 | x_val = data[-nb_validation_samples:]
 75 | y_val = labels[-nb_validation_samples:]
 76 | 
 77 | print('Number of positive and negative reviews in traing and validation set ')
 78 | print(y_train.sum(axis=0))
 79 | print(y_val.sum(axis=0))
 80 | 
 81 | GLOVE_DIR = "/ext/home/analyst/Testground/data/glove"
 82 | embeddings_index = {}
 83 | f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
 84 | for line in f:
 85 |     values = line.split()
 86 |     word = values[0]
 87 |     coefs = np.asarray(values[1:], dtype='float32')
 88 |     embeddings_index[word] = coefs
 89 | f.close()
 90 | 
 91 | print('Total %s word vectors in Glove 6B 100d.' % len(embeddings_index))
 92 | 
 93 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
 94 | for word, i in word_index.items():
 95 |     embedding_vector = embeddings_index.get(word)
 96 |     if embedding_vector is not None:
 97 |         # words not found in embedding index will be all-zeros.
 98 |         embedding_matrix[i] = embedding_vector
 99 |         
100 | embedding_layer = Embedding(len(word_index) + 1,
101 |                             EMBEDDING_DIM,
102 |                             weights=[embedding_matrix],
103 |                             input_length=MAX_SEQUENCE_LENGTH,
104 |                             trainable=True)
105 | 
106 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
107 | embedded_sequences = embedding_layer(sequence_input)
108 | l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
109 | l_pool1 = MaxPooling1D(5)(l_cov1)
110 | l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
111 | l_pool2 = MaxPooling1D(5)(l_cov2)
112 | l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
113 | l_pool3 = MaxPooling1D(35)(l_cov3)  # global max pooling
114 | l_flat = Flatten()(l_pool3)
115 | l_dense = Dense(128, activation='relu')(l_flat)
116 | preds = Dense(2, activation='softmax')(l_dense)
117 | 
118 | model = Model(sequence_input, preds)
119 | model.compile(loss='categorical_crossentropy',
120 |               optimizer='rmsprop',
121 |               metrics=['acc'])
122 | 
123 | print("model fitting - simplified convolutional neural network")
124 | model.summary()
125 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
126 |           nb_epoch=10, batch_size=128)
127 | 
128 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
129 | for word, i in word_index.items():
130 |     embedding_vector = embeddings_index.get(word)
131 |     if embedding_vector is not None:
132 |         # words not found in embedding index will be all-zeros.
133 |         embedding_matrix[i] = embedding_vector
134 |         
135 | embedding_layer = Embedding(len(word_index) + 1,
136 |                             EMBEDDING_DIM,
137 |                             weights=[embedding_matrix],
138 |                             input_length=MAX_SEQUENCE_LENGTH,
139 |                             trainable=True)
140 | 
141 | # applying a more complex convolutional approach
142 | convs = []
143 | filter_sizes = [3,4,5]
144 | 
145 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
146 | embedded_sequences = embedding_layer(sequence_input)
147 | 
148 | for fsz in filter_sizes:
149 |     l_conv = Conv1D(nb_filter=128,filter_length=fsz,activation='relu')(embedded_sequences)
150 |     l_pool = MaxPooling1D(5)(l_conv)
151 |     convs.append(l_pool)
152 |     
153 | l_merge = Merge(mode='concat', concat_axis=1)(convs)
154 | l_cov1= Conv1D(128, 5, activation='relu')(l_merge)
155 | l_pool1 = MaxPooling1D(5)(l_cov1)
156 | l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
157 | l_pool2 = MaxPooling1D(30)(l_cov2)
158 | l_flat = Flatten()(l_pool2)
159 | l_dense = Dense(128, activation='relu')(l_flat)
160 | preds = Dense(2, activation='softmax')(l_dense)
161 | 
162 | model = Model(sequence_input, preds)
163 | model.compile(loss='categorical_crossentropy',
164 |               optimizer='rmsprop',
165 |               metrics=['acc'])
166 | 
167 | print("model fitting - more complex convolutional neural network")
168 | model.summary()
169 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
170 |           nb_epoch=20, batch_size=50)
171 | 


--------------------------------------------------------------------------------
/code/Hierarchical_Attention_Networks/textClassifierHATT.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | # author - Richard Liao 
  3 | # Dec 26 2016
  4 | import numpy as np
  5 | import pandas as pd
  6 | import cPickle
  7 | from collections import defaultdict
  8 | import re
  9 | 
 10 | from bs4 import BeautifulSoup
 11 | 
 12 | import sys
 13 | import os
 14 | 
 15 | os.environ['KERAS_BACKEND']='theano'
 16 | 
 17 | from keras.preprocessing.text import Tokenizer, text_to_word_sequence
 18 | from keras.preprocessing.sequence import pad_sequences
 19 | from keras.utils.np_utils import to_categorical
 20 | 
 21 | from keras.layers import Embedding
 22 | from keras.layers import Dense, Input, Flatten
 23 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed
 24 | from keras.models import Model
 25 | 
 26 | from keras import backend as K
 27 | from keras.engine.topology import Layer, InputSpec
 28 | from keras import initializations
 29 | 
 30 | MAX_SENT_LENGTH = 100
 31 | MAX_SENTS = 15
 32 | MAX_NB_WORDS = 20000
 33 | EMBEDDING_DIM = 100
 34 | VALIDATION_SPLIT = 0.2
 35 | 
 36 | def clean_str(string):
 37 |     """
 38 |     Tokenization/string cleaning for dataset
 39 |     Every dataset is lower cased except
 40 |     """
 41 |     string = re.sub(r"\\", "", string)    
 42 |     string = re.sub(r"\'", "", string)    
 43 |     string = re.sub(r"\"", "", string)    
 44 |     return string.strip().lower()
 45 | 
 46 | data_train = pd.read_csv('~/Testground/data/imdb/labeledTrainData.tsv', sep='\t')
 47 | print(data_train.shape)
 48 | 
 49 | from nltk import tokenize
 50 | 
 51 | reviews = []
 52 | labels = []
 53 | texts = []
 54 | 
 55 | for idx in range(data_train.review.shape[0]):
 56 |     text = BeautifulSoup(data_train.review[idx])
 57 |     text = clean_str(text.get_text().encode('ascii','ignore'))
 58 |     texts.append(text)
 59 |     sentences = tokenize.sent_tokenize(text)
 60 |     reviews.append(sentences)
 61 |     
 62 |     labels.append(data_train.sentiment[idx])
 63 | 
 64 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
 65 | tokenizer.fit_on_texts(texts)
 66 | 
 67 | data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
 68 | 
 69 | for i, sentences in enumerate(reviews):
 70 |     for j, sent in enumerate(sentences):
 71 |         if j< MAX_SENTS:
 72 |             wordTokens = text_to_word_sequence(sent)
 73 |             k=0
 74 |             for _, word in enumerate(wordTokens):
 75 |                 if k<MAX_SENT_LENGTH and tokenizer.word_index[word]<MAX_NB_WORDS:
 76 |                     data[i,j,k] = tokenizer.word_index[word]
 77 |                     k=k+1                    
 78 |                     
 79 | word_index = tokenizer.word_index
 80 | print('Total %s unique tokens.' % len(word_index))
 81 | 
 82 | labels = to_categorical(np.asarray(labels))
 83 | print(('Shape of data tensor:', data.shape))
 84 | print(('Shape of label tensor:', labels.shape))
 85 | 
 86 | indices = np.arange(data.shape[0])
 87 | np.random.shuffle(indices)
 88 | data = data[indices]
 89 | labels = labels[indices]
 90 | nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
 91 | 
 92 | x_train = data[:-nb_validation_samples]
 93 | y_train = labels[:-nb_validation_samples]
 94 | x_val = data[-nb_validation_samples:]
 95 | y_val = labels[-nb_validation_samples:]
 96 | 
 97 | print('Number of positive and negative reviews in traing and validation set')
 98 | print(y_train.sum(axis=0))
 99 | print(y_val.sum(axis=0))
100 | 
101 | GLOVE_DIR = "/ext/home/analyst/Testground/data/glove"
102 | embeddings_index = {}
103 | f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
104 | for line in f:
105 |     values = line.split()
106 |     word = values[0]
107 |     coefs = np.asarray(values[1:], dtype='float32')
108 |     embeddings_index[word] = coefs
109 | f.close()
110 | 
111 | print('Total %s word vectors.' % len(embeddings_index))
112 | 
113 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
114 | for word, i in word_index.items():
115 |     embedding_vector = embeddings_index.get(word)
116 |     if embedding_vector is not None:
117 |         # words not found in embedding index will be all-zeros.
118 |         embedding_matrix[i] = embedding_vector
119 |         
120 | embedding_layer = Embedding(len(word_index) + 1,
121 |                             EMBEDDING_DIM,
122 |                             weights=[embedding_matrix],
123 |                             input_length=MAX_SENT_LENGTH,
124 |                             trainable=True)
125 | 
126 | sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
127 | embedded_sequences = embedding_layer(sentence_input)
128 | l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
129 | sentEncoder = Model(sentence_input, l_lstm)
130 | 
131 | review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
132 | review_encoder = TimeDistributed(sentEncoder)(review_input)
133 | l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
134 | preds = Dense(2, activation='softmax')(l_lstm_sent)
135 | model = Model(review_input, preds)
136 | 
137 | model.compile(loss='categorical_crossentropy',
138 |               optimizer='rmsprop',
139 |               metrics=['acc'])
140 | 
141 | print("model fitting - Hierachical LSTM")
142 | print(model.summary())
143 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
144 |           nb_epoch=10, batch_size=50)
145 | 
146 | # building Hierachical Attention network
147 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
148 | for word, i in word_index.items():
149 |     embedding_vector = embeddings_index.get(word)
150 |     if embedding_vector is not None:
151 |         # words not found in embedding index will be all-zeros.
152 |         embedding_matrix[i] = embedding_vector
153 |         
154 | embedding_layer = Embedding(len(word_index) + 1,
155 |                             EMBEDDING_DIM,
156 |                             weights=[embedding_matrix],
157 |                             input_length=MAX_SENT_LENGTH,
158 |                             trainable=True)
159 | 
160 | class AttLayer(Layer):
161 |     def __init__(self, **kwargs):
162 |         self.init = initializations.get('normal')
163 |         #self.input_spec = [InputSpec(ndim=3)]
164 |         super(AttLayer, self).__init__(**kwargs)
165 | 
166 |     def build(self, input_shape):
167 |         assert len(input_shape)==3
168 |         #self.W = self.init((input_shape[-1],1))
169 |         self.W = self.init((input_shape[-1],))
170 |         #self.input_spec = [InputSpec(shape=input_shape)]
171 |         self.trainable_weights = [self.W]
172 |         super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!
173 | 
174 |     def call(self, x, mask=None):
175 |         eij = K.tanh(K.dot(x, self.W))
176 |         
177 |         ai = K.exp(eij)
178 |         weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
179 |         
180 |         weighted_input = x*weights.dimshuffle(0,1,'x')
181 |         return weighted_input.sum(axis=1)
182 | 
183 |     def get_output_shape_for(self, input_shape):
184 |         return (input_shape[0], input_shape[-1])
185 | 
186 | sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
187 | embedded_sequences = embedding_layer(sentence_input)
188 | l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
189 | l_dense = TimeDistributed(Dense(200))(l_lstm)
190 | l_att = AttLayer()(l_dense)
191 | sentEncoder = Model(sentence_input, l_att)
192 | 
193 | review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
194 | review_encoder = TimeDistributed(sentEncoder)(review_input)
195 | l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
196 | l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)
197 | l_att_sent = AttLayer()(l_dense_sent)
198 | preds = Dense(2, activation='softmax')(l_att_sent)
199 | model = Model(review_input, preds)
200 | 
201 | model.compile(loss='categorical_crossentropy',
202 |               optimizer='rmsprop',
203 |               metrics=['acc'])
204 | 
205 | print("model fitting - Hierachical attention network")
206 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
207 |           nb_epoch=10, batch_size=50)
208 | 


--------------------------------------------------------------------------------
/code/Hierarchical_Attention_Networks/textClassifierRNN.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | # author - Richard Liao
  3 | # Dec 26 2016
  4 | import numpy as np
  5 | import pandas as pd
  6 | import cPickle
  7 | from collections import defaultdict
  8 | import re
  9 | 
 10 | from bs4 import BeautifulSoup
 11 | 
 12 | import sys
 13 | import os
 14 | 
 15 | os.environ['KERAS_BACKEND']='theano'
 16 | 
 17 | from keras.preprocessing.text import Tokenizer
 18 | from keras.preprocessing.sequence import pad_sequences
 19 | from keras.utils.np_utils import to_categorical
 20 | 
 21 | from keras.layers import Embedding
 22 | from keras.layers import Dense, Input, Flatten
 23 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional
 24 | from keras.models import Model
 25 | 
 26 | from keras import backend as K
 27 | from keras.engine.topology import Layer, InputSpec
 28 | from keras import initializations
 29 | 
 30 | MAX_SEQUENCE_LENGTH = 1000
 31 | MAX_NB_WORDS = 20000
 32 | EMBEDDING_DIM = 100
 33 | VALIDATION_SPLIT = 0.2
 34 | 
 35 | def clean_str(string):
 36 |     """
 37 |     Tokenization/string cleaning for dataset
 38 |     Every dataset is lower cased except
 39 |     """
 40 |     string = re.sub(r"\\", "", string)    
 41 |     string = re.sub(r"\'", "", string)    
 42 |     string = re.sub(r"\"", "", string)    
 43 |     return string.strip().lower()
 44 | 
 45 | data_train = pd.read_csv('~/Testground/data/imdb/labeledTrainData.tsv', sep='\t')
 46 | print(data_train.shape)
 47 | 
 48 | texts = []
 49 | labels = []
 50 | 
 51 | for idx in range(data_train.review.shape[0]):
 52 |     text = BeautifulSoup(data_train.review[idx])
 53 |     texts.append(clean_str(text.get_text().encode('ascii','ignore')))
 54 |     labels.append(data_train.sentiment[idx])
 55 |     
 56 | 
 57 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
 58 | tokenizer.fit_on_texts(texts)
 59 | sequences = tokenizer.texts_to_sequences(texts)
 60 | 
 61 | word_index = tokenizer.word_index
 62 | print('Found %s unique tokens.' % len(word_index))
 63 | 
 64 | data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
 65 | 
 66 | labels = to_categorical(np.asarray(labels))
 67 | print(('Shape of data tensor:', data.shape))
 68 | print(('Shape of label tensor:', labels.shape))
 69 | 
 70 | indices = np.arange(data.shape[0])
 71 | np.random.shuffle(indices)
 72 | data = data[indices]
 73 | labels = labels[indices]
 74 | nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
 75 | 
 76 | x_train = data[:-nb_validation_samples]
 77 | y_train = labels[:-nb_validation_samples]
 78 | x_val = data[-nb_validation_samples:]
 79 | y_val = labels[-nb_validation_samples:]
 80 | 
 81 | print('Traing and validation set number of positive and negative reviews')
 82 | print(y_train.sum(axis=0))
 83 | print(y_val.sum(axis=0))
 84 | 
 85 | GLOVE_DIR = "~/Testground/data/glove"
 86 | embeddings_index = {}
 87 | f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
 88 | for line in f:
 89 |     values = line.split()
 90 |     word = values[0]
 91 |     coefs = np.asarray(values[1:], dtype='float32')
 92 |     embeddings_index[word] = coefs
 93 | f.close()
 94 | 
 95 | print('Total %s word vectors.' % len(embeddings_index))
 96 | 
 97 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
 98 | for word, i in word_index.items():
 99 |     embedding_vector = embeddings_index.get(word)
100 |     if embedding_vector is not None:
101 |         # words not found in embedding index will be all-zeros.
102 |         embedding_matrix[i] = embedding_vector
103 |         
104 | embedding_layer = Embedding(len(word_index) + 1,
105 |                             EMBEDDING_DIM,
106 |                             weights=[embedding_matrix],
107 |                             input_length=MAX_SEQUENCE_LENGTH,
108 |                             trainable=True)
109 | 
110 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
111 | embedded_sequences = embedding_layer(sequence_input)
112 | l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
113 | preds = Dense(2, activation='softmax')(l_lstm)
114 | model = Model(sequence_input, preds)
115 | model.compile(loss='categorical_crossentropy',
116 |               optimizer='rmsprop',
117 |               metrics=['acc'])
118 | 
119 | print("model fitting - Bidirectional LSTM")
120 | model.summary()
121 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
122 |           nb_epoch=10, batch_size=50)
123 | 
124 | # Attention GRU network		  
125 | class AttLayer(Layer):
126 |     def __init__(self, **kwargs):
127 |         self.init = initializations.get('normal')
128 |         #self.input_spec = [InputSpec(ndim=3)]
129 |         super(AttLayer, self).__init__(**kwargs)
130 | 
131 |     def build(self, input_shape):
132 |         assert len(input_shape)==3
133 |         #self.W = self.init((input_shape[-1],1))
134 |         self.W = self.init((input_shape[-1],))
135 |         #self.input_spec = [InputSpec(shape=input_shape)]
136 |         self.trainable_weights = [self.W]
137 |         super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!
138 | 
139 |     def call(self, x, mask=None):
140 |         eij = K.tanh(K.dot(x, self.W))
141 |         
142 |         ai = K.exp(eij)
143 |         weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
144 |         
145 |         weighted_input = x*weights.dimshuffle(0,1,'x')
146 |         return weighted_input.sum(axis=1)
147 | 
148 |     def get_output_shape_for(self, input_shape):
149 |         return (input_shape[0], input_shape[-1])
150 | 
151 | embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
152 | for word, i in word_index.items():
153 |     embedding_vector = embeddings_index.get(word)
154 |     if embedding_vector is not None:
155 |         # words not found in embedding index will be all-zeros.
156 |         embedding_matrix[i] = embedding_vector
157 |         
158 | embedding_layer = Embedding(len(word_index) + 1,
159 |                             EMBEDDING_DIM,
160 |                             weights=[embedding_matrix],
161 |                             input_length=MAX_SEQUENCE_LENGTH,
162 |                             trainable=True)
163 | 
164 | 
165 | 
166 | sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
167 | embedded_sequences = embedding_layer(sequence_input)
168 | l_gru = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
169 | l_att = AttLayer()(l_gru)
170 | preds = Dense(2, activation='softmax')(l_att)
171 | model = Model(sequence_input, preds)
172 | model.compile(loss='categorical_crossentropy',
173 |               optimizer='rmsprop',
174 |               metrics=['acc'])
175 | 
176 | print("model fitting - attention GRU network")
177 | model.summary()
178 | model.fit(x_train, y_train, validation_data=(x_val, y_val),
179 |           nb_epoch=10, batch_size=50)
180 | 


--------------------------------------------------------------------------------
/code/K-nearest_Neighbor.py:
--------------------------------------------------------------------------------
 1 | from sklearn.neighbors import KNeighborsClassifier
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', KNeighborsClassifier()),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/MultinomialNB.py:
--------------------------------------------------------------------------------
 1 | from sklearn.naive_bayes import MultinomialNB
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', MultinomialNB()),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/RCNN.py:
--------------------------------------------------------------------------------
  1 | from keras.preprocessing import sequence
  2 | from keras.models import Sequential
  3 | from keras.layers import Dense, Dropout, Activation
  4 | from keras.layers import Embedding
  5 | from keras.layers import LSTM
  6 | from keras.layers import Conv1D, MaxPooling1D
  7 | from keras.datasets import imdb
  8 | from sklearn.datasets import fetch_20newsgroups
  9 | import numpy as np
 10 | from sklearn import metrics
 11 | from keras.preprocessing.text import Tokenizer
 12 | from keras.preprocessing.sequence import pad_sequences
 13 | 
 14 | def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
 15 |     np.random.seed(7)
 16 |     text = np.concatenate((X_train, X_test), axis=0)
 17 |     text = np.array(text)
 18 |     tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
 19 |     tokenizer.fit_on_texts(text)
 20 |     sequences = tokenizer.texts_to_sequences(text)
 21 |     word_index = tokenizer.word_index
 22 |     text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
 23 |     print('Found %s unique tokens.' % len(word_index))
 24 |     indices = np.arange(text.shape[0])
 25 |     # np.random.shuffle(indices)
 26 |     text = text[indices]
 27 |     print(text.shape)
 28 |     X_train = text[0:len(X_train), ]
 29 |     X_test = text[len(X_train):, ]
 30 |     embeddings_index = {}
 31 |     f = open(".\glove.6B.100d.txt", encoding="utf8")
 32 |     for line in f:
 33 |         values = line.split()
 34 |         word = values[0]
 35 |         try:
 36 |             coefs = np.asarray(values[1:], dtype='float32')
 37 |         except:
 38 |             pass
 39 |         embeddings_index[word] = coefs
 40 |     f.close()
 41 |     print('Total %s word vectors.' % len(embeddings_index))
 42 |     return (X_train, X_test, word_index,embeddings_index)
 43 | 
 44 | 
 45 | def Build_Model_RCNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100):
 46 | 
 47 |     kernel_size = 2
 48 |     filters = 256
 49 |     pool_size = 2
 50 |     gru_node = 256
 51 | 
 52 |     embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
 53 |     for word, i in word_index.items():
 54 |         embedding_vector = embeddings_index.get(word)
 55 |         if embedding_vector is not None:
 56 |             # words not found in embedding index will be all-zeros.
 57 |             if len(embedding_matrix[i]) !=len(embedding_vector):
 58 |                 print("could not broadcast input array from shape",str(len(embedding_matrix[i])),
 59 |                                  "into shape",str(len(embedding_vector))," Please make sure your"
 60 |                                  " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
 61 |                 exit(1)
 62 | 
 63 |             embedding_matrix[i] = embedding_vector
 64 | 
 65 | 
 66 | 
 67 |     model = Sequential()
 68 |     model.add(Embedding(len(word_index) + 1,
 69 |                                 EMBEDDING_DIM,
 70 |                                 weights=[embedding_matrix],
 71 |                                 input_length=MAX_SEQUENCE_LENGTH,
 72 |                                 trainable=True))
 73 |     model.add(Dropout(0.25))
 74 |     model.add(Conv1D(filters, kernel_size, activation='relu'))
 75 |     model.add(MaxPooling1D(pool_size=pool_size))
 76 |     model.add(Conv1D(filters, kernel_size, activation='relu'))
 77 |     model.add(MaxPooling1D(pool_size=pool_size))
 78 |     model.add(Conv1D(filters, kernel_size, activation='relu'))
 79 |     model.add(MaxPooling1D(pool_size=pool_size))
 80 |     model.add(Conv1D(filters, kernel_size, activation='relu'))
 81 |     model.add(MaxPooling1D(pool_size=pool_size))
 82 |     model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
 83 |     model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
 84 |     model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
 85 |     model.add(LSTM(gru_node, recurrent_dropout=0.2))
 86 |     model.add(Dense(1024,activation='relu'))
 87 |     model.add(Dense(nclasses))
 88 |     model.add(Activation('softmax'))
 89 | 
 90 |     model.compile(loss='sparse_categorical_crossentropy',
 91 |                   optimizer='adam',
 92 |                   metrics=['accuracy'])
 93 | 
 94 |     return model
 95 | 
 96 | newsgroups_train = fetch_20newsgroups(subset='train')
 97 | newsgroups_test = fetch_20newsgroups(subset='test')
 98 | X_train = newsgroups_train.data
 99 | X_test = newsgroups_test.data
100 | y_train = newsgroups_train.target
101 | y_test = newsgroups_test.target
102 | 
103 | X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)
104 | 
105 | 
106 | model_RCNN = Build_Model_RCNN_Text(word_index,embeddings_index, 20)
107 | 
108 | 
109 | model_RCNN.summary()
110 | 
111 | model_RCNN.fit(X_train_Glove, y_train,
112 |                               validation_data=(X_test_Glove, y_test),
113 |                               epochs=15,
114 |                               batch_size=128,
115 |                               verbose=2)
116 | 
117 | predicted = model_RCNN.predict(X_test_Glove)
118 | 
119 | predicted = np.argmax(predicted, axis=1)
120 | print(metrics.classification_report(y_test, predicted))
121 | 


--------------------------------------------------------------------------------
/code/RMDL/README.rst:
--------------------------------------------------------------------------------
  1 | |DOI| |Pypi| |arxiv| |werckerstatus| |appveyor| |BuildStatus| |Join the chat at https://gitter.im/RMDL-Random-Multimodel-Deep-Learning| |PowerPoint| |researchgate| |Binder| |pdf| |GitHublicense| |twitter|
  2 | 
  3 | 
  4 | Referenced paper : `RMDL: Random Multimodel Deep Learning for
  5 | Classification <https://www.researchgate.net/publication/324922651_RMDL_Random_Multimodel_Deep_Learning_for_Classification>`__
  6 | 
  7 | .. image:: /docs/pic/github-logo.png
  8 |   :target: https://github.com/kk7nc/RMDL
  9 | 
 10 | Random Multimodel Deep Learning (RMDL):
 11 | =======================================
 12 | 
 13 | A new ensemble, deep learning approach for classification. Deep
 14 | learning models have achieved state-of-the-art results across many domains.
 15 | RMDL solves the problem of finding the best deep learning structure
 16 | and architecture while simultaneously improving robustness and accuracy
 17 | through ensembles of deep learning architectures. RDML can accept
 18 | asinput a variety data to include text, video, images, and symbolic.
 19 | 
 20 | 
 21 | |RMDL|
 22 | 
 23 | Random Multimodel Deep Learning (RDML) architecture for classification.
 24 | RMDL includes 3 Random models, oneDNN classifier at left, one Deep CNN
 25 | classifier at middle, and one Deep RNN classifier at right (each unit could be LSTMor GRU).
 26 | 
 27 | 
 28 | Installation
 29 | =============
 30 | 
 31 | There are pip and git for RMDL installation:
 32 | 
 33 | Using pip
 34 | ----------
 35 | 
 36 | .. code:: python
 37 | 
 38 |         pip install RMDL
 39 | 
 40 | Using git
 41 | ---------
 42 | .. code:: bash
 43 | 
 44 |     git clone --recursive https://github.com/kk7nc/RMDL.git
 45 | 
 46 | The primary requirements for this package are Python 3 with Tensorflow. The requirements.txt file
 47 | contains a listing of the required Python packages; to install all requirements, run the following:
 48 | 
 49 | .. code:: bash
 50 | 
 51 |     pip -r install requirements.txt
 52 | 
 53 | Or
 54 | 
 55 | .. code:: bash
 56 | 
 57 |     pip3  install -r requirements.txt
 58 | 
 59 | Or:
 60 | 
 61 | .. code:: bash
 62 | 
 63 |     conda install --file requirements.txt
 64 | 
 65 | Documentation:
 66 | ==============
 67 | 
 68 | The exponential growth in the number of complex datasets every year requires  more enhancement in
 69 | machine learning methods to provide  robust and accurate data classification. Lately, deep learning
 70 | approaches have been achieved surpassing results in comparison to previous machine learning algorithms
 71 | on tasks such as image classification, natural language processing, face recognition, and etc. The
 72 | success of these deep learning algorithms relys on their capacity to model complex and non-linear
 73 | relationships within data. However, finding the suitable structure for these models has been a challenge
 74 | for researchers. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning
 75 | approach for classification.  RMDL solves the problem of finding the best deep learning structure and
 76 | architecture while simultaneously improving robustness and accuracy through ensembles of deep
 77 | learning architectures. In short, RMDL trains multiple models of Deep Neural Network (DNN),
 78 | Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in parallel and combines
 79 | their results to produce better result of any of those models individually. To create these models,
 80 | each deep learning model has been constructed in a random fashion regarding the number of layers and
 81 | nodes in their neural network structure. The resulting RDML model can be used for various domains such
 82 | as text, video, images, and symbolic. In this Project, we describe RMDL model in depth and show the results
 83 | for image and text classification as well as face recognition. For image classification, we compared our
 84 | model with some of the available baselines using MNIST and CIFAR-10 datasets. Similarly, we used four
 85 | datasets namely, WOS, Reuters, IMDB, and 20newsgroup and compared our results with available baselines.
 86 | Web of Science (WOS) has been collected  by authors and consists of three sets~(small, medium and large set).
 87 | Lastly, we used ORL dataset to compare the performance of our approach with other face recognition methods.
 88 | These test results show that RDML model consistently outperform standard methods over a broad range of
 89 | data types and classification problems.
 90 | 
 91 | Datasets for RMDL:
 92 | ==================
 93 | 
 94 | Text Datasets:
 95 | --------------
 96 | 
 97 | - `IMDB Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`__
 98 | 
 99 |   * This dataset contains 50,000 documents with 2 categories.
100 | 
101 | - `Reters-21578 Dataset <https://keras.io/datasets/>`__
102 | 
103 |   * This dataset contains 21,578 documents with 90 categories.
104 | 
105 | - `20Newsgroups Dataset <https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups>`__
106 | 
107 |   * This dataset contains 20,000 documents with 20 categories.
108 | 
109 | -  Web of Science Dataset (DOI:
110 |    `10.17632/9rw3vkcfy4.2 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__)
111 | 
112 |    -  Web of Science Dataset
113 |       `WOS-11967 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
114 | 
115 |       -  This dataset contains 11,967 documents with 35 categories which
116 |          include 7 parents categories.
117 | 
118 |    -  Web of Science Dataset
119 |       `WOS-46985 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
120 | 
121 |       -  This dataset contains 46,985 documents with 134 categories
122 |          which include 7 parents categories.
123 | 
124 |    -  Web of Science Dataset
125 |       `WOS-5736 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
126 | 
127 |       -  This dataset contains 5,736 documents with 11 categories which
128 |          include 3 parents categories.
129 | 
130 | Image datasets:
131 | ---------------
132 | 
133 | -  `MNIST Dataset <https://en.wikipedia.org/wiki/MNIST_database>`__
134 | 
135 |    -  The MNIST database contains 60,000 training images and 10,000
136 |       testing images.
137 | 
138 | -  `CIFAR-10 Dataset <https://www.cs.toronto.edu/~kriz/cifar.html>`__
139 | 
140 |    -  The CIFAR-10 dataset consists of 60000 32x32 colour images in 10
141 |       classes, with 6000 images per class. There are 50000 training
142 |       images and 10000 test images.
143 | 
144 | Face Recognition
145 | ----------------
146 | 
147 | `The Database of Faces (The Olivetti Faces
148 | Dataset) <http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html>`__
149 | 
150 | -  The The Database of Faces dataset consists of 400 92x112 colour
151 |    images and grayscale in 40 person
152 | 
153 | Requirements for RMDL :
154 | =======================
155 | 
156 | General:
157 | ----------
158 | 
159 | -  Python 3.5 or later see `Instruction
160 |    Documents <https://www.python.org/>`__
161 | 
162 | -  TensorFlow see `Instruction
163 |    Documents <https://www.tensorflow.org/install/install_linux>`__.
164 | 
165 | -  scikit-learn see `Instruction
166 |    Documents <http://scikit-learn.org/stable/install.html>`__
167 | 
168 | -  Keras see `Instruction Documents <https://keras.io/>`__
169 | 
170 | -  scipy see `Instruction
171 |    Documents <https://www.scipy.org/install.html>`__
172 | 
173 | 
174 | GPU (if you want to run on GPU):
175 | --------------------------------
176 | 
177 | -  CUDA® Toolkit 8.0. For details, see `NVIDIA’s
178 |    documentation <https://developer.nvidia.com/cuda-toolkit>`__.
179 | 
180 | -  The `NVIDIA drivers associated with CUDA Toolkit
181 |    8.0 <http://www.nvidia.com/Download/index.aspx>`__.
182 | 
183 | -  cuDNN v6. For details, see `NVIDIA’s
184 |    documentation <https://developer.nvidia.com/cudnn>`__.
185 | 
186 | -  GPU card with CUDA Compute Capability 3.0 or higher.
187 | 
188 | -  The libcupti-dev library,
189 | 
190 | Text and Document Classification
191 | =================================
192 | 
193 | -  Download GloVe: Global Vectors for Word Representation `Instruction
194 |    Documents <https://nlp.stanford.edu/projects/glove/>`__
195 | 
196 |    -  Set data directory into
197 |       `Global.py <https://github.com/kk7nc/RMDL/blob/master/src/Global.py>`__
198 | 
199 |    -  if you are not setting GloVe directory, GloVe will be downloaded
200 | 
201 | Parameters:
202 | ===========
203 | 
204 | Text_Classification
205 | -------------------
206 | 
207 | .. code:: python
208 | 
209 |          from RMDL import RMDL_Text
210 | 
211 | .. code:: python
212 | 
213 |        Text_Classification(x_train, y_train, x_test,  y_test, batch_size=128,
214 |                         EMBEDDING_DIM=50,MAX_SEQUENCE_LENGTH = 500, MAX_NB_WORDS = 75000,
215 |                         GloVe_dir="", GloVe_file = "glove.6B.50d.txt",
216 |                         sparse_categorical=True, random_deep=[3, 3, 3], epochs=[500, 500, 500],  plot=True,
217 |                         min_hidden_layer_dnn=1, max_hidden_layer_dnn=8, min_nodes_dnn=128, max_nodes_dnn=1024,
218 |                         min_hidden_layer_rnn=1, max_hidden_layer_rnn=5, min_nodes_rnn=32,  max_nodes_rnn=128,
219 |                         min_hidden_layer_cnn=3, max_hidden_layer_cnn=10, min_nodes_cnn=128, max_nodes_cnn=512,
220 |                         random_state=42, random_optimizor=True, dropout=0.05):
221 | 
222 | 
223 | Input
224 | ~~~~~
225 | 
226 | - x_train
227 | - y_train
228 | - x_test
229 | - y_test
230 | 
231 | batch_size
232 | ~~~~~~~~~~
233 | 
234 | - batch_size: Integer. Number of samples per gradient update. If unspecified, it will default to 128.
235 | 
236 | EMBEDDING_DIM
237 | ~~~~~~~~~~~~~~
238 | 
239 | - batch_size: Integer. Shape of word embedding (this number should be same with GloVe or other pre-trained embedding techniques that be used), it will default to 50 that used with pain of glove.6B.50d.txt file.
240 | 
241 | 
242 | MAX_SEQUENCE_LENGTH
243 | ~~~~~~~~~~~~~~~~~~~
244 | 
245 | - MAX_SEQUENCE_LENGTH: Integer. Maximum length of sequence or document in datasets, it will default to 500.
246 | 
247 | 
248 | MAX_NB_WORDS
249 | ~~~~~~~~~~~~~~~~~~~~~~~
250 | 
251 | - MAX_NB_WORDS: Integer. Maximum number of unique words in datasets,  it will default to 75000.
252 | 
253 | 
254 | GloVe_dir
255 | ~~~~~~~~~~~~~~~~~~~~~~~
256 | 
257 | - GloVe_dir: String. Address of GloVe or any pre-trained directory,  it will default to null which glove.6B.zip will be download.
258 | 
259 | 
260 | GloVe_file
261 | ~~~~~~~~~~~~~~~~~~~~~~~
262 | 
263 | - GloVe_dir: String. Which version of GloVe or pre-trained word emending will be used,  it will default to glove.6B.50d.txt.
264 | 
265 | - NOTE: if you use other version of GloVe EMBEDDING_DIM must be same dimensions.
266 | 
267 | sparse_categorical
268 | ~~~~~~~~~~~~~~~~~~~~~~~
269 | 
270 | - sparse_categorical: bool. When target's dataset is (n,1) should be True, it will default to True.
271 | 
272 | random_deep
273 | ~~~~~~~~~~~~~~~~~~~~~~~
274 | 
275 | - random_deep: Integer [3]. Number of ensembled model used in RMDL random_deep[0] is number of DNN, random_deep[1] is number of RNN, random_deep[0] is number of CNN, it will default to [3, 3, 3].
276 | 
277 | 
278 | epochs
279 | ~~~~~~~~~~~~~~~~~~~~~~~
280 | 
281 | - epochs: Integer [3]. Number of epochs in each ensembled model used in RMDL epochs[0] is number of epochs used in DNN, epochs[1] is number of epochs used in  RNN, epochs[0] is number of epochs used in CNN, it will default to [500, 500, 500].
282 | 
283 | 
284 | plot
285 | ~~~~~~~~~~~~~~~~~~~~~~~
286 | 
287 | - plot: bool. True: shows confusion matrix and accuracy and loss
288 | 
289 | 
290 | min_hidden_layer_dnn
291 | ~~~~~~~~~~~~~~~~~~~~~~~
292 | 
293 | - min_hidden_layer_dnn: Integer. Lower Bounds of hidden layers of DNN used in RMDL, it will default to 1.
294 | 
295 | 
296 | max_hidden_layer_dnn
297 | ~~~~~~~~~~~~~~~~~~~~~~~
298 | 
299 | - max_hidden_layer_dnn: Integer. Upper bounds of hidden layers of DNN used in RMDL, it will default to 8.
300 | 
301 | 
302 | min_nodes_dnn
303 | ~~~~~~~~~~~~~~~~~~~~~~~
304 | 
305 | - min_nodes_dnn: Integer. Lower bounds of nodes in each layer of DNN used in RMDL, it will default to 128.
306 | 
307 | max_nodes_dnn
308 | ~~~~~~~~~~~~~~~~~~~~~~~
309 | 
310 | - max_nodes_dnn: Integer. Upper bounds of nodes in each layer of DNN used in RMDL, it will default to 1024.
311 | 
312 | min_hidden_layer_rnn
313 | ~~~~~~~~~~~~~~~~~~~~~~~
314 | 
315 | - min_hidden_layer_rnn: Integer. Lower Bounds of hidden layers of RNN used in RMDL, it will default to 1.
316 | 
317 | 
318 | max_hidden_layer_rnn
319 | ~~~~~~~~~~~~~~~~~~~~~~~
320 | 
321 | - man_hidden_layer_rnn: Integer. Upper Bounds of hidden layers of RNN used in RMDL, it will default to 5.
322 | 
323 | 
324 | min_nodes_rnn
325 | ~~~~~~~~~~~~~~~~~~~~~~~
326 | 
327 | - min_nodes_rnn: Integer. Lower bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 32.
328 | 
329 | max_nodes_rnn
330 | ~~~~~~~~~~~~~~~~~~~~~~~
331 | 
332 | - max_nodes_rnn: Integer. Upper bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 128.
333 | 
334 | 
335 | min_hidden_layer_cnn
336 | ~~~~~~~~~~~~~~~~~~~~~~~
337 | 
338 | - min_hidden_layer_cnn: Integer. Lower Bounds of hidden layers of CNN used in RMDL, it will default to 3.
339 | 
340 | 
341 | max_hidden_layer_cnn
342 | ~~~~~~~~~~~~~~~~~~~~~~~
343 | 
344 | - max_hidden_layer_cnn: Integer. Upper Bounds of hidden layers of CNN used in RMDL, it will default to 10.
345 | 
346 | 
347 | min_nodes_cnn
348 | ~~~~~~~~~~~~~~~~~~~~~~~
349 | 
350 | - min_nodes_cnn: Integer. Lower bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 128.
351 | 
352 | max_nodes_cnn
353 | ~~~~~~~~~~~~~~~~~~~~~~~
354 | 
355 | - min_nodes_cnn: Integer. Upper bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 512.
356 | 
357 | random_state
358 | ~~~~~~~~~~~~~~~~~~~~~~~
359 | 
360 | - random_state : Integer, RandomState instance or None, optional (default=None)
361 | 
362 |   * If Integer, random_state is the seed used by the random number generator;
363 | 
364 | 
365 | random_optimizor
366 | ~~~~~~~~~~~~~~~~~~~~~~~
367 | 
368 | - random_optimizor : bool, If False, all models use adam optimizer.  If True, all models use random optimizers. it will default to True
369 | 
370 | 
371 | dropout
372 | ~~~~~~~~~~~~~~~~~~~~~~~
373 | 
374 | - dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
375 | 
376 | 
377 | Image_Classification
378 | ---------------------
379 | 
380 | .. code:: python
381 | 
382 |          from RMDL import RMDL_Image
383 | 
384 | .. code:: python
385 | 
386 |          Image_Classification(x_train, y_train, x_test, y_test, shape, batch_size=128,
387 |                                   sparse_categorical=True, random_deep=[3, 3, 3], epochs=[500, 500, 500], plot=True,
388 |                                   min_hidden_layer_dnn=1, max_hidden_layer_dnn=8, min_nodes_dnn=128, max_nodes_dnn=1024,
389 |                                   min_hidden_layer_rnn=1, max_hidden_layer_rnn=5, min_nodes_rnn=32, max_nodes_rnn=128,
390 |                                   min_hidden_layer_cnn=3, max_hidden_layer_cnn=10, min_nodes_cnn=128, max_nodes_cnn=512,
391 |                                   random_state=42, random_optimizor=True, dropout=0.05)
392 | 
393 | Input
394 | ~~~~~
395 | - x_train
396 | - y_train
397 | - x_test
398 | - y_test
399 | 
400 | shape
401 | ~~~~~
402 | 
403 | - shape: np.shape . shape of image. The most common situation would be a 2D input with shape (batch_size, input_dim).
404 | 
405 | batch_size
406 | ~~~~~~~~~~
407 | 
408 | - batch_size: Integer. Number of samples per gradient update. If unspecified, it will default to 128.
409 | 
410 | sparse_categorical
411 | ~~~~~~~~~~~~~~~~~~~~~~~
412 | 
413 | - sparse_categorical: bool. When target's dataset is (n,1) should be True, it will default to True.
414 | 
415 | random_deep
416 | ~~~~~~~~~~~~~~~~~~~~~~~
417 | 
418 | - random_deep: Integer [3]. Number of ensembled model used in RMDL random_deep[0] is number of DNN, random_deep[1] is number of RNN, random_deep[0] is number of CNN, it will default to [3, 3, 3].
419 | 
420 | 
421 | epochs
422 | ~~~~~~~~~~~~~~~~~~~~~~~
423 | 
424 | - epochs: Integer [3]. Number of epochs in each ensembled model used in RMDL epochs[0] is number of epochs used in DNN, epochs[1] is number of epochs used in  RNN, epochs[0] is number of epochs used in CNN, it will default to [500, 500, 500].
425 | 
426 | 
427 | plot
428 | ~~~~~~~~~~~~~~~~~~~~~~~
429 | 
430 | - plot: bool. True: shows confusion matrix and accuracy and loss
431 | 
432 | 
433 | min_hidden_layer_dnn
434 | ~~~~~~~~~~~~~~~~~~~~~~~
435 | 
436 | - min_hidden_layer_dnn: Integer. Lower Bounds of hidden layers of DNN used in RMDL, it will default to 1.
437 | 
438 | 
439 | max_hidden_layer_dnn
440 | ~~~~~~~~~~~~~~~~~~~~~~~
441 | 
442 | - max_hidden_layer_dnn: Integer. Upper bounds of hidden layers of DNN used in RMDL, it will default to 8.
443 | 
444 | 
445 | min_nodes_dnn
446 | ~~~~~~~~~~~~~~~~~~~~~~~
447 | 
448 | - min_nodes_dnn: Integer. Lower bounds of nodes in each layer of DNN used in RMDL, it will default to 128.
449 | 
450 | max_nodes_dnn
451 | ~~~~~~~~~~~~~~~~~~~~~~~
452 | 
453 | - max_nodes_dnn: Integer. Upper bounds of nodes in each layer of DNN used in RMDL, it will default to 1024.
454 | 
455 | min_nodes_rnn
456 | ~~~~~~~~~~~~~~~~~~~~~~~
457 | 
458 | - min_nodes_rnn: Integer. Lower bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 32.
459 | 
460 | max_nodes_rnn
461 | ~~~~~~~~~~~~~~~~~~~~~~~
462 | 
463 | - maz_nodes_rnn: Integer. Upper bounds of nodes (LSTM or GRU) in each layer of RNN used in RMDL, it will default to 128.
464 | 
465 | 
466 | min_hidden_layer_cnn
467 | ~~~~~~~~~~~~~~~~~~~~~~~
468 | 
469 | - min_hidden_layer_cnn: Integer. Lower Bounds of hidden layers of CNN used in RMDL, it will default to 3.
470 | 
471 | 
472 | max_hidden_layer_cnn
473 | ~~~~~~~~~~~~~~~~~~~~~~~
474 | 
475 | - max_hidden_layer_cnn: Integer. Upper Bounds of hidden layers of CNN used in RMDL, it will default to 10.
476 | 
477 | 
478 | min_nodes_cnn
479 | ~~~~~~~~~~~~~~~~~~~~~~~
480 | 
481 | - min_nodes_cnn: Integer. Lower bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 128.
482 | 
483 | max_nodes_cnn
484 | ~~~~~~~~~~~~~~~~~~~~~~~
485 | 
486 | - min_nodes_cnn: Integer. Upper bounds of nodes (2D convolution layer) in each layer of CNN used in RMDL, it will default to 512.
487 | 
488 | random_state
489 | ~~~~~~~~~~~~~~~~~~~~~~~
490 | 
491 | - random_state : Integer, RandomState instance or None, optional (default=None)
492 | 
493 |   * If Integer, random_state is the seed used by the random number generator;
494 | 
495 | 
496 | random_optimizor
497 | ~~~~~~~~~~~~~~~~~~~~~~~
498 | 
499 | - random_optimizor : bool, If False, all models use adam optimizer.  If True, all models use random optimizers. it will default to True
500 | 
501 | 
502 | dropout
503 | ~~~~~~~~~~~~~~~~~~~~~~~
504 | 
505 | 
506 | - dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
507 | 
508 | 
509 | Example
510 | ========
511 | 
512 | MNIST
513 | -----
514 | 
515 | -  The MNIST database contains 60,000 training images and 10,000 testing images.
516 | 
517 | Import Packages
518 | ~~~~~~~~~~~~~~~
519 | 
520 | .. code:: python
521 | 
522 |         from keras.datasets import mnist
523 |         import numpy as np
524 |         from RMDL import RMDL_Image as RMDL
525 | 
526 | 
527 | Load Data
528 | ~~~~~~~~~
529 | 
530 | .. code:: python
531 | 
532 |         (X_train, y_train), (X_test, y_test) = mnist.load_data()
533 |         X_train_D = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32')
534 |         X_test_D = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')
535 |         X_train = X_train_D / 255.0
536 |         X_test = X_test_D / 255.0
537 |         number_of_classes = np.unique(y_train).shape[0]
538 |         shape = (28, 28, 1)
539 | 
540 | Using RMDL
541 | ~~~~~~~~~~~
542 | 
543 | .. code:: python
544 | 
545 |         batch_size = 128
546 |         sparse_categorical = 0
547 |         n_epochs = [100, 100, 100]  ## DNN-RNN-CNN
548 |         Random_Deep = [3, 3, 3]  ## DNN-RNN-CNN
549 |         
550 |         RMDL.Image_Classification(X_train, y_train, X_test, y_test,shape,
551 |                              batch_size=batch_size,
552 |                              sparse_categorical=True,
553 |                              random_deep=Random_Deep,
554 |                              epochs=n_epochs)
555 | IMDB
556 | -----
557 | 
558 | -  This dataset contains 50,000 documents with 2 categories.
559 | 
560 | Import Packages
561 | ~~~~~~~~~~~~~~~
562 | 
563 | .. code:: python
564 | 
565 |         import sys
566 |         import os
567 |         from RMDL import text_feature_extraction as txt
568 |         from keras.datasets import imdb
569 |         import numpy as np
570 |         from RMDL import RMDL_Text as RMDL
571 | 
572 | Load Data
573 | ~~~~~~~~~
574 | 
575 | .. code:: python
576 | 
577 |         print("Load IMDB dataset....")
578 |         (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS)
579 |         print(len(X_train))
580 |         print(y_test)
581 |         word_index = imdb.get_word_index()
582 |         index_word = {v: k for k, v in word_index.items()}
583 |         X_train = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_train]
584 |         X_test = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_test]
585 |         X_train = np.array(X_train)
586 |         X_train = np.array(X_train).ravel()
587 |         print(X_train.shape)
588 |         X_test = np.array(X_test)
589 |         X_test = np.array(X_test).ravel()
590 | 
591 | Using RMDL
592 | ~~~~~~~~~~~
593 | 
594 | .. code:: python
595 | 
596 |         batch_size = 100
597 |         sparse_categorical = 0
598 |         n_epochs = [100, 100, 100]  ## DNN--RNN-CNN
599 |         Random_Deep = [3, 3, 3]  ## DNN--RNN-CNN
600 | 
601 |         RMDL.Text_Classification(X_train, y_train, X_test, y_test,
602 |                              batch_size=batch_size,
603 |                              sparse_categorical=sparse_categorical,
604 |                              random_deep=Random_Deep,
605 |                              epochs=n_epochs)
606 | 
607 | Web Of Science
608 | --------------
609 | 
610 | -  Linke of dataset:  |Data|
611 | 
612 |    -  Web of Science Dataset
613 |       `WOS-11967 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
614 | 
615 |       -  This dataset contains 11,967 documents with 35 categories which
616 |          include 7 parents categories.
617 | 
618 |    -  Web of Science Dataset
619 |       `WOS-46985 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
620 | 
621 |       -  This dataset contains 46,985 documents with 134 categories
622 |          which include 7 parents categories.
623 | 
624 |    -  Web of Science Dataset
625 |       `WOS-5736 <http://dx.doi.org/10.17632/9rw3vkcfy4.2>`__
626 | 
627 |       -  This dataset contains 5,736 documents with 11 categories which
628 |          include 3 parents categories.
629 | 
630 | Import Packages
631 | ~~~~~~~~~~~~~~~
632 | 
633 | .. code:: python
634 | 
635 |         from RMDL import text_feature_extraction as txt
636 |         from sklearn.model_selection import train_test_split
637 |         from RMDL.Download import Download_WOS as WOS
638 |         import numpy as np
639 |         from RMDL import RMDL_Text as RMDL
640 | 
641 | Load Data
642 | ~~~~~~~~~
643 | .. code:: python
644 | 
645 |         path_WOS = WOS.download_and_extract()
646 |         fname = os.path.join(path_WOS,"WebOfScience/WOS11967/X.txt")
647 |         fnamek = os.path.join(path_WOS,"WebOfScience/WOS11967/Y.txt")
648 |         with open(fname, encoding="utf-8") as f:
649 |             content = f.readlines()
650 |             content = [txt.text_cleaner(x) for x in content]
651 |         with open(fnamek) as fk:
652 |             contentk = fk.readlines()
653 |         contentk = [x.strip() for x in contentk]
654 |         Label = np.matrix(contentk, dtype=int)
655 |         Label = np.transpose(Label)
656 |         np.random.seed(7)
657 |         print(Label.shape)
658 |         X_train, X_test, y_train, y_test = train_test_split(content, Label, test_size=0.2, random_state=4)
659 | 
660 | Using RMDL
661 | ~~~~~~~~~~~
662 | .. code:: python
663 | 
664 |         batch_size = 100
665 |         sparse_categorical = 0
666 |         n_epochs = [5000, 500, 500]  ## DNN--RNN-CNN
667 |         Random_Deep = [3, 3, 3]  ## DNN--RNN-CNN
668 | 
669 |         RMDL.Text_Classification(X_train, y_train, X_test, y_test,
670 |                              batch_size=batch_size,
671 |                              sparse_categorical=True,
672 |                              random_deep=Random_Deep,
673 |                              epochs=n_epochs,no_of_classes=12)
674 | 
675 | Reuters-21578
676 | -------------
677 | 
678 | - This dataset contains 21,578 documents with 90 categories.
679 | 
680 | Import Packages
681 | ~~~~~~~~~~~~~~~
682 | 
683 | .. code:: python
684 | 
685 |          import sys
686 |          import os
687 |          import nltk
688 |          nltk.download("reuters")
689 |          from nltk.corpus import reuters
690 |          from sklearn.preprocessing import MultiLabelBinarizer
691 |          import numpy as np
692 |          from RMDL import RMDL_Text as RMDL
693 | 
694 | Load Data
695 | ~~~~~~~~~
696 | .. code:: python
697 | 
698 |          documents = reuters.fileids()
699 | 
700 |          train_docs_id = list(filter(lambda doc: doc.startswith("train"),
701 |                                    documents))
702 |          test_docs_id = list(filter(lambda doc: doc.startswith("test"),
703 |                                   documents))
704 |          X_train = [(reuters.raw(doc_id)) for doc_id in train_docs_id]
705 |          X_test = [(reuters.raw(doc_id)) for doc_id in test_docs_id]
706 |          mlb = MultiLabelBinarizer()
707 |          y_train = mlb.fit_transform([reuters.categories(doc_id)
708 |                                     for doc_id in train_docs_id])
709 |          y_test = mlb.transform([reuters.categories(doc_id)
710 |                                for doc_id in test_docs_id])
711 |          y_train = np.argmax(y_train, axis=1)
712 |          y_test = np.argmax(y_test, axis=1)
713 | 
714 | 
715 | Using RMDL
716 | ~~~~~~~~~~~
717 | .. code:: python
718 | 
719 |         batch_size = 100
720 |         sparse_categorical = 0
721 |         n_epochs = [20, 500, 50]  ## DNN--RNN-CNN
722 |         Random_Deep = [3, 0, 0]  ## DNN--RNN-CNN
723 | 
724 |         RMDL.Text_Classification(X_train, y_train, X_test, y_test,
725 |                      batch_size=batch_size,
726 |                      sparse_categorical=True,
727 |                      random_deep=Random_Deep,
728 |                      epochs=n_epochs)
729 | 
730 | 
731 | 
732 | Olivetti Faces
733 | --------------
734 | 
735 | - There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).
736 | 
737 | Import Packages
738 | ~~~~~~~~~~~~~~~
739 | 
740 | .. code:: python
741 | 
742 |          from sklearn.datasets import fetch_olivetti_faces
743 |          from sklearn.model_selection import train_test_split
744 |          from RMDL import RMDL_Image as RMDL
745 | 
746 | Load Data
747 | ~~~~~~~~~
748 | .. code:: python
749 | 
750 |          number_of_classes = 40
751 |          shape = (64, 64, 1)
752 |          data = fetch_olivetti_faces()
753 |          X_train, X_test, y_train, y_test = train_test_split(data.data,
754 |                                                        data.target, stratify=data.target, test_size=40)
755 |          X_train = X_train.reshape(X_train.shape[0], 64, 64, 1).astype('float32')
756 |          X_test = X_test.reshape(X_test.shape[0], 64, 64, 1).astype('float32')
757 | 
758 | Using RMDL
759 | ~~~~~~~~~~~
760 | .. code:: python
761 | 
762 |         batch_size = 100
763 |         sparse_categorical = 0
764 |         n_epochs = [500, 500, 50]  ## DNN--RNN-CNN
765 |         Random_Deep = [0, 0, 1]  ## DNN--RNN-CNN
766 | 
767 |         RMDL.Image_Classification(X_train, y_train, X_test, y_test,
768 |                               shape,
769 |                               random_optimizor=False,
770 |                               batch_size=batch_size,
771 |                               random_deep=Random_Deep,
772 |                               epochs=n_epochs)
773 | 
774 | 
775 | 
776 | More Exanmple
777 | `link <https://github.com/kk7nc/RMDL/tree/master/Examples>`__
778 | 
779 | |Results|
780 | 
781 | 
782 | Error and Comments:
783 | ----------------------
784 | 
785 | 
786 | Send an email to kk7nc@virginia.edu
787 | 
788 | Citations
789 | ---------
790 | 
791 | .. code::
792 | 
793 |     @inproceedings{Kowsari2018RMDL,
794 |     title={RMDL: Random Multimodel Deep Learning for Classification},
795 |     author={Kowsari, Kamran and Heidarysafa, Mojtaba and Brown, Donald E. and Jafari Meimandi, Kiana and Barnes, Laura E.},
796 |     booktitle={Proceedings of the 2018 International Conference on Information System and Data Mining},
797 |     year={2018},
798 |     DOI={https://doi.org/10.1145/3206098.3206111},
799 |     organization={ACM}
800 |     }
801 | 
802 | .. |werckerstatus| image:: https://app.wercker.com/status/3a564158e809971e2f7416beba5d05af/s/master
803 |    :target: https://app.wercker.com/project/byKey/3a564158e809971e2f7416beba5d05af
804 | .. |BuildStatus| image:: https://travis-ci.org/kk7nc/RMDL.svg?branch=master
805 |    :target: https://travis-ci.org/kk7nc/RMDL
806 | .. |PowerPoint| image:: https://img.shields.io/badge/Presentation-download-red.svg?style=flat
807 |    :target: https://github.com/kk7nc/RMDL/blob/master/docs/RMDL.pdf
808 | .. |researchgate| image:: https://img.shields.io/badge/ResearchGate-RMDL-blue.svg?style=flat
809 |    :target: https://www.researchgate.net/publication/324922651_RMDL_Random_Multimodel_Deep_Learning_for_Classification
810 | .. |Binder| image:: https://mybinder.org/badge.svg
811 |    :target: https://mybinder.org/v2/gh/kk7nc/RMDL/master
812 | .. |pdf| image:: https://img.shields.io/badge/pdf-download-red.svg?style=flat
813 |    :target: https://github.com/kk7nc/RMDL/blob/master/docs/ACM-RMDL.pdf
814 | .. |GitHublicense| image:: https://img.shields.io/badge/licence-GPL-blue.svg
815 |    :target: ./LICENSE
816 | .. |RDL| image:: http://kowsari.net/onewebmedia/RDL.jpg
817 | .. |RMDL| image:: http://kowsari.net/onewebmedia/RMDL.jpg
818 | .. |Results| image:: http://kowsari.net/onewebmedia/RMDL_Results.png
819 | .. |Data| image:: https://img.shields.io/badge/DOI-10.17632/9rw3vkcfy4.6-blue.svg?style=flat
820 |    :target: http://dx.doi.org/10.17632/9rw3vkcfy4.6
821 | .. |Pypi| image:: https://img.shields.io/badge/Pypi-RMDL%201.0.5-blue.svg
822 |    :target: https://pypi.org/project/RMDL/
823 | .. |DOI| image:: https://img.shields.io/badge/DOI-10.1145/3206098.3206111-blue.svg?style=flat
824 |    :target: https://doi.org/10.1145/3206098.3206111
825 | .. |appveyor| image:: https://ci.appveyor.com/api/projects/status/github/kk7nc/RMDL?branch=master&svg=true
826 |     :target: https://ci.appveyor.com/project/kk7nc/RMDL
827 | .. |arxiv| image:: https://img.shields.io/badge/arXiv-1805.01890-red.svg
828 |     :target: https://arxiv.org/abs/1805.01890
829 | .. |twitter| image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social
830 |     :target: https://twitter.com/intent/tweet?text=RMDL:%20Random%20Multimodel%20Deep%20Learning%20for%20Classification%0aGitHub:&url=https://github.com/kk7nc/RMDL&hashtags=DeepLearning,classification,MachineLearning,deep_neural_networks,Image_Classification,Text_Classification,EnsembleLearning
831 | .. |Join the chat at https://gitter.im/RMDL-Random-Multimodel-Deep-Learning| image:: https://badges.gitter.im/Join%20Chat.svg
832 |    :target: https://gitter.im/RMDL-Random-Multimodel-Deep-Learning/Lobby?source=orgpage
833 |     
834 |     
835 | 


--------------------------------------------------------------------------------
/code/RNN.py:
--------------------------------------------------------------------------------
  1 | from keras.layers import Dropout, Dense, GRU, Embedding
  2 | from keras.models import Sequential
  3 | from sklearn.feature_extraction.text import TfidfVectorizer
  4 | import numpy as np
  5 | from sklearn import metrics
  6 | from keras.preprocessing.text import Tokenizer
  7 | from keras.preprocessing.sequence import pad_sequences
  8 | from sklearn.datasets import fetch_20newsgroups
  9 | 
 10 | 
 11 | 
 12 | def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
 13 |     np.random.seed(7)
 14 |     text = np.concatenate((X_train, X_test), axis=0)
 15 |     text = np.array(text)
 16 |     tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
 17 |     tokenizer.fit_on_texts(text)
 18 |     sequences = tokenizer.texts_to_sequences(text)
 19 |     word_index = tokenizer.word_index
 20 |     text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
 21 |     print('Found %s unique tokens.' % len(word_index))
 22 |     indices = np.arange(text.shape[0])
 23 |     # np.random.shuffle(indices)
 24 |     text = text[indices]
 25 |     print(text.shape)
 26 |     X_train = text[0:len(X_train), ]
 27 |     X_test = text[len(X_train):, ]
 28 |     embeddings_index = {}
 29 |     f = open("C:\\Users\\kamran\\Documents\\GitHub\\RMDL\\Examples\\Glove\\glove.6B.50d.txt", encoding="utf8")
 30 |     for line in f:
 31 | 
 32 |         values = line.split()
 33 |         word = values[0]
 34 |         try:
 35 |             coefs = np.asarray(values[1:], dtype='float32')
 36 |         except:
 37 |             pass
 38 |         embeddings_index[word] = coefs
 39 |     f.close()
 40 |     print('Total %s word vectors.' % len(embeddings_index))
 41 |     return (X_train, X_test, word_index,embeddings_index)
 42 | 
 43 | 
 44 | 
 45 | def Build_Model_RNN_Text(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
 46 |     """
 47 |     def buildModel_RNN(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
 48 |     word_index in word index ,
 49 |     embeddings_index is embeddings index, look at data_helper.py
 50 |     nClasses is number of classes,
 51 |     MAX_SEQUENCE_LENGTH is maximum lenght of text sequences
 52 |     """
 53 | 
 54 |     model = Sequential()
 55 |     hidden_layer = 3
 56 |     gru_node = 256
 57 | 
 58 |     embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
 59 |     for word, i in word_index.items():
 60 |         embedding_vector = embeddings_index.get(word)
 61 |         if embedding_vector is not None:
 62 |             # words not found in embedding index will be all-zeros.
 63 |             if len(embedding_matrix[i]) != len(embedding_vector):
 64 |                 print("could not broadcast input array from shape", str(len(embedding_matrix[i])),
 65 |                       "into shape", str(len(embedding_vector)), " Please make sure your"
 66 |                                                                 " EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
 67 |                 exit(1)
 68 |             embedding_matrix[i] = embedding_vector
 69 |     model.add(Embedding(len(word_index) + 1,
 70 |                                 EMBEDDING_DIM,
 71 |                                 weights=[embedding_matrix],
 72 |                                 input_length=MAX_SEQUENCE_LENGTH,
 73 |                                 trainable=True))
 74 | 
 75 | 
 76 |     print(gru_node)
 77 |     for i in range(0,hidden_layer):
 78 |         model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
 79 |         model.add(Dropout(dropout))
 80 |     model.add(GRU(gru_node, recurrent_dropout=0.2))
 81 |    #model.add(Dense(, activation='relu'))
 82 |     model.add(Dense(nclasses, activation='softmax'))
 83 | 
 84 | 
 85 |     model.compile(loss='sparse_categorical_crossentropy',
 86 |                       optimizer='adam',
 87 |                       metrics=['accuracy'])
 88 |     return model
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 | newsgroups_train = fetch_20newsgroups(subset='train')
 95 | newsgroups_test = fetch_20newsgroups(subset='test')
 96 | X_train = newsgroups_train.data
 97 | X_test = newsgroups_test.data
 98 | y_train = newsgroups_train.target
 99 | y_test = newsgroups_test.target
100 | 
101 | X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)
102 | 
103 | 
104 | model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)
105 | 
106 | model_RNN.summary()
107 | 
108 | model_RNN.fit(X_train_Glove, y_train,
109 |                               validation_data=(X_test_Glove, y_test),
110 |                               epochs=20,
111 |                               batch_size=128,
112 |                               verbose=2)
113 | 
114 | predicted = model_RNN.predict_classes(X_test_Glove)
115 | 
116 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/Random_Forest.py:
--------------------------------------------------------------------------------
 1 | from sklearn.ensemble import RandomForestClassifier
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', RandomForestClassifier(n_estimators=100)),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/Rocchio_classification.py:
--------------------------------------------------------------------------------
 1 | from sklearn.neighbors.nearest_centroid import NearestCentroid
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', NearestCentroid()),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/code/SVM.py:
--------------------------------------------------------------------------------
 1 | from sklearn.svm import LinearSVC
 2 | from sklearn.pipeline import Pipeline
 3 | from sklearn import metrics
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | from sklearn.feature_extraction.text import TfidfTransformer
 6 | from sklearn.datasets import fetch_20newsgroups
 7 | 
 8 | newsgroups_train = fetch_20newsgroups(subset='train')
 9 | newsgroups_test = fetch_20newsgroups(subset='test')
10 | X_train = newsgroups_train.data
11 | X_test = newsgroups_test.data
12 | y_train = newsgroups_train.target
13 | y_test = newsgroups_test.target
14 | 
15 | text_clf = Pipeline([('vect', CountVectorizer()),
16 |                      ('tfidf', TfidfTransformer()),
17 |                      ('clf', LinearSVC()),
18 |                      ])
19 | 
20 | text_clf.fit(X_train, y_train)
21 | 
22 | 
23 | predicted = text_clf.predict(X_test)
24 | 
25 | print(metrics.classification_report(y_test, predicted))


--------------------------------------------------------------------------------
/docs/Text_Classification.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/Text_Classification.pdf


--------------------------------------------------------------------------------
/docs/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate


--------------------------------------------------------------------------------
/docs/eq/tf-idf.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/eq/tf-idf.gif


--------------------------------------------------------------------------------
/docs/eq/tfidf.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/eq/tfidf.gif


--------------------------------------------------------------------------------
/docs/pic/Autoencoder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Autoencoder.png


--------------------------------------------------------------------------------
/docs/pic/BPW.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/BPW.png


--------------------------------------------------------------------------------
/docs/pic/Bagging.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Bagging.PNG


--------------------------------------------------------------------------------
/docs/pic/Boosting.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Boosting.PNG


--------------------------------------------------------------------------------
/docs/pic/CBOW.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CBOW.png


--------------------------------------------------------------------------------
/docs/pic/CNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CNN.png


--------------------------------------------------------------------------------
/docs/pic/CRF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/CRF.png


--------------------------------------------------------------------------------
/docs/pic/DNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/DNN.png


--------------------------------------------------------------------------------
/docs/pic/F1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/F1.png


--------------------------------------------------------------------------------
/docs/pic/GitHub-Mark-32px.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/GitHub-Mark-32px.png


--------------------------------------------------------------------------------
/docs/pic/GitHub-Mark-64px.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/GitHub-Mark-64px.png


--------------------------------------------------------------------------------
/docs/pic/Glove.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Glove.PNG


--------------------------------------------------------------------------------
/docs/pic/Glove_VS_DCWE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Glove_VS_DCWE.png


--------------------------------------------------------------------------------
/docs/pic/HAN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/HAN.png


--------------------------------------------------------------------------------
/docs/pic/HDLTex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/HDLTex.png


--------------------------------------------------------------------------------
/docs/pic/KNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/KNN.png


--------------------------------------------------------------------------------
/docs/pic/LSTM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/LSTM.png


--------------------------------------------------------------------------------
/docs/pic/OverviewTextClassification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/OverviewTextClassification.png


--------------------------------------------------------------------------------
/docs/pic/RDL.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RDL.jpg


--------------------------------------------------------------------------------
/docs/pic/RDL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RDL.png


--------------------------------------------------------------------------------
/docs/pic/RF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RF.png


--------------------------------------------------------------------------------
/docs/pic/RMDL.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL.jpg


--------------------------------------------------------------------------------
/docs/pic/RMDL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL.png


--------------------------------------------------------------------------------
/docs/pic/RMDL_Results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL_Results.png


--------------------------------------------------------------------------------
/docs/pic/RMDL_Results_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RMDL_Results_small.png


--------------------------------------------------------------------------------
/docs/pic/RNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/RNN.png


--------------------------------------------------------------------------------
/docs/pic/Random Projection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Random Projection.png


--------------------------------------------------------------------------------
/docs/pic/SVM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/SVM.png


--------------------------------------------------------------------------------
/docs/pic/TSNE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/TSNE.png


--------------------------------------------------------------------------------
/docs/pic/Word Art.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/Word Art.png


--------------------------------------------------------------------------------
/docs/pic/WordArt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/WordArt.png


--------------------------------------------------------------------------------
/docs/pic/fasttext-logo-color-web.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/fasttext-logo-color-web.png


--------------------------------------------------------------------------------
/docs/pic/github-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/github-logo.png


--------------------------------------------------------------------------------
/docs/pic/line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/line.png


--------------------------------------------------------------------------------
/docs/pic/ngram_cnn_highway_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/ngram_cnn_highway_1.png


--------------------------------------------------------------------------------
/docs/pic/sphx_glr_plot_roc_001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kk7nc/Text_Classification/4d72fc8854cd7ab5604fdf6145d18cde22736758/docs/pic/sphx_glr_plot_roc_001.png


--------------------------------------------------------------------------------