├── .gitignore ├── README.md ├── dataset └── ag_news │ ├── test.csv │ └── train.csv ├── document_classifier.ipynb ├── images ├── cnn.png ├── nb.png ├── rnn.png ├── shallow_nn.png └── tfidf.png └── presentation.ppt /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | word_embeddings/* 3 | save/* 4 | 5 | ### Linux ### 6 | *~ 7 | 8 | # KDE directory preferences 9 | .directory 10 | 11 | # Linux trash folder which might appear on any partition or disk 12 | .Trash-* 13 | 14 | # .nfs files are created when an open file is removed but is still being accessed 15 | .nfs* 16 | 17 | ### macOS ### 18 | *.DS_Store 19 | .AppleDouble 20 | .LSOverride 21 | 22 | # Icon must end with two \r 23 | Icon 24 | 25 | # Thumbnails 26 | ._* 27 | 28 | # Files that might appear in the root of a volume 29 | .DocumentRevisions-V100 30 | .fseventsd 31 | .Spotlight-V100 32 | .TemporaryItems 33 | .Trashes 34 | .VolumeIcon.icns 35 | .com.apple.timemachine.donotpresent 36 | 37 | # Directories potentially created on remote AFP share 38 | .AppleDB 39 | .AppleDesktop 40 | Network Trash Folder 41 | Temporary Items 42 | .apdisk 43 | 44 | ### PyCharm ### 45 | # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm 46 | # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 47 | 48 | .idea/ 49 | 50 | # CMake 51 | cmake-build-debug/ 52 | 53 | ## File-based project format: 54 | *.iws 55 | 56 | ## Plugin-specific files: 57 | 58 | # IntelliJ 59 | /out/ 60 | 61 | # mpeltonen/sbt-idea plugin 62 | .idea_modules/ 63 | 64 | # JIRA plugin 65 | atlassian-ide-plugin.xml 66 | 67 | # Cursive Clojure plugin 68 | .idea/replstate.xml 69 | 70 | # Crashlytics plugin (for Android Studio and IntelliJ) 71 | com_crashlytics_export_strings.xml 72 | crashlytics.properties 73 | crashlytics-build.properties 74 | fabric.properties 75 | 76 | # Sonarlint plugin 77 | .idea/sonarlint 78 | 79 | ### Python ### 80 | # Byte-compiled / optimized / DLL files 81 | *.py[cod] 82 | *$py.class 83 | 84 | # C extensions 85 | *.so 86 | 87 | # Distribution / packaging 88 | .Python 89 | env/ 90 | build/ 91 | develop-eggs/ 92 | dist/ 93 | downloads/ 94 | eggs/ 95 | .eggs/ 96 | lib/ 97 | lib64/ 98 | parts/ 99 | sdist/ 100 | var/ 101 | wheels/ 102 | *.egg-info/ 103 | .installed.cfg 104 | *.egg 105 | 106 | # PyInstaller 107 | # Usually these files are written by a python script from a template 108 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 109 | *.manifest 110 | *.spec 111 | 112 | # Installer logs 113 | pip-log.txt 114 | pip-delete-this-directory.txt 115 | 116 | # Unit test / coverage reports 117 | htmlcov/ 118 | .tox/ 119 | .coverage 120 | .coverage.* 121 | .cache 122 | nosetests.xml 123 | coverage.xml 124 | *,cover 125 | .hypothesis/ 126 | 127 | # Translations 128 | *.mo 129 | 130 | # Django stuff: 131 | 132 | # Flask stuff: 133 | instance/ 134 | .webassets-cache 135 | 136 | # Scrapy stuff: 137 | .scrapy 138 | 139 | # Sphinx documentation 140 | docs/_build/ 141 | 142 | # PyBuilder 143 | target/ 144 | 145 | # Jupyter Notebook 146 | .ipynb_checkpoints 147 | 148 | # pyenv 149 | .python-version 150 | 151 | # celery beat schedule file 152 | celerybeat-schedule 153 | 154 | # SageMath parsed files 155 | *.sage.py 156 | 157 | # dotenv 158 | .env 159 | 160 | # virtualenv 161 | .venv 162 | venv/ 163 | ENV/ 164 | 165 | # Spyder project settings 166 | .spyderproject 167 | .spyproject 168 | 169 | # Rope project settings 170 | .ropeproject 171 | 172 | # mkdocs documentation 173 | /site 174 | 175 | # End of https://www.gitignore.io/api/macos,linux,django,python,pycharm -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Document Classification using NLP, Machine Learning 2 | ## Objective 3 | Performed document classification into four defined categories (World, Sports, Business, Sci/Tech). Trained the classifier accuracy with different models ranging from Naïve Bayes to Convolutional Neural Network (CNN) and RCNN and compared the accuracy. By making use of different feature engineering techniques and Natural Language Processing (NLP) features created an accurate text classifier. 4 | 5 | 6 | ## Tech Stack 7 | - Language- Python 8 | - Libraries- Pandas, Numpy, Matplotlib, Scikit Learn, NLTK, Keras, TensorFlow backend 9 | - Models- Naive Bayes, Logistic Regression, Random Forest, XGBoost, Shallow Neural Network, Convolutional Neural Network, RCNN 10 | 11 | ## Implementation 12 | 13 | ### Open document_classifier.ipynb Jupyter file to go to the implementation details 14 | 15 | ### The model can be downloaded from below link. 16 | https://drive.google.com/drive/folders/10Ivt175DEkILxwHsF2Ltti8IZpVLtOyo?usp=sharing 17 | 18 | ### The jupyter file also demonstrates loading and using the model for real-time predictions 19 | -------------------------------------------------------------------------------- /images/cnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/images/cnn.png -------------------------------------------------------------------------------- /images/nb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/images/nb.png -------------------------------------------------------------------------------- /images/rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/images/rnn.png -------------------------------------------------------------------------------- /images/shallow_nn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/images/shallow_nn.png -------------------------------------------------------------------------------- /images/tfidf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/images/tfidf.png -------------------------------------------------------------------------------- /presentation.ppt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saurabh1907/document-classification-ml-nlp/f59d4e5cca4fde9df5573cb2421b48655af0cfb5/presentation.ppt --------------------------------------------------------------------------------