├── CNN ├── requirements.txt ├── config.yaml ├── model.py ├── train.py ├── utils.py └── TextClassifierCNN.ipynb ├── .gitignore ├── README.md └── LICENSE /CNN/requirements.txt: -------------------------------------------------------------------------------- 1 | keras 2 | nltk 3 | numpy 4 | pandas 5 | regex 6 | sklearn 7 | tensorflow 8 | tensorflow-datasets==4.0.1 9 | tensorflowjs 10 | transformers 11 | pyyaml 12 | -------------------------------------------------------------------------------- /CNN/config.yaml: -------------------------------------------------------------------------------- 1 | app: cnn-model 2 | task: text-clf 3 | output: 4 | path: ./models/ 5 | 6 | dataset: 7 | plugin: text 8 | file: 9 | path: dataset.csv 10 | 11 | input_features: 12 | text: Text 13 | labels: ClassIndex 14 | 15 | trn_val_splits: 16 | # Split based on a column's values: specify only the column containing a 17 | # the string "validation" for validation examples. 18 | # { type: fixed, value: "dataset"} 19 | 20 | # Random split 21 | { type: random, value: 0.2 } 22 | 23 | module: 24 | embedding_dim: 64 25 | dropout_rate: 0.3 26 | 27 | tokeniser: 28 | tokeniser: keras 29 | max_seq_length: 75 30 | pad: true 31 | 32 | training: 33 | batch_size: 16 34 | epochs: 5 35 | lr: 1.6e-04 36 | 37 | # Important for reproducibility 38 | random_state: 42 39 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # Data folder 132 | data 133 | 134 | # Mac OS files 135 | .DS_Store 136 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TextClassifierModels 2 | Repository containing the code to develop a Neural based text classifier. 3 | 4 | ## Models 5 | 6 | In the repository there are various models implemented for text classification. 7 | In order to access a _ready-to-explore_ version one can have a look at the notebooks provided. 8 | Models are quite heavy and memory consuming, so it is really advised to use a GPU machine to run their training tasks. 9 | 10 | ### Available models 11 | 12 | 13 |
| Model | 16 |Demo | 17 |Details | 18 |CLI | 19 |Accuracy score on AG news dataset | 20 |
|---|---|---|---|---|
CNN TextClassifier |
27 |
28 |
29 | |
30 |
31 | Classify texts with labels from the AG news database making use of a convolutional neural network. | 32 | 33 |python3 -m train -c config.yaml |
34 | 90.71 | 35 |
| webApp | 39 |||||
| source | 42 | 43 | 44 |||||
BERT TextClassifier |
47 |
48 |
49 | |
50 |
51 | Classify texts with labels from the AG news database making use of an attention model, based on BERT. | 52 | 53 |python3 -m train -c config.yaml |
54 | 93.95 | 55 |
| webApp | 59 |||||
| source | 62 |
| \n", 553 | " | Class Index | \n", 554 | "Title | \n", 555 | "Description | \n", 556 | "
|---|---|---|---|
| 0 | \n", 561 | "3 | \n", 562 | "Wall St. Bears Claw Back Into the Black (Reuters) | \n", 563 | "Reuters - Short-sellers, Wall Street's dwindli... | \n", 564 | "
| 1 | \n", 567 | "3 | \n", 568 | "Carlyle Looks Toward Commercial Aerospace (Reu... | \n", 569 | "Reuters - Private investment firm Carlyle Grou... | \n", 570 | "
| 2 | \n", 573 | "3 | \n", 574 | "Oil and Economy Cloud Stocks' Outlook (Reuters) | \n", 575 | "Reuters - Soaring crude prices plus worries\\ab... | \n", 576 | "
| 3 | \n", 579 | "3 | \n", 580 | "Iraq Halts Oil Exports from Main Southern Pipe... | \n", 581 | "Reuters - Authorities have halted oil export\\f... | \n", 582 | "
| 4 | \n", 585 | "3 | \n", 586 | "Oil prices soar to all-time record, posing new... | \n", 587 | "AFP - Tearaway world oil prices, toppling reco... | \n", 588 | "
| \n", 648 | " | Class Index | \n", 649 | "Title | \n", 650 | "Description | \n", 651 | "
|---|---|---|---|
| 0 | \n", 656 | "3 | \n", 657 | "Fears for T N pension after talks | \n", 658 | "Unions representing workers at Turner Newall... | \n", 659 | "
| 1 | \n", 662 | "4 | \n", 663 | "The Race is On: Second Private Team Sets Launc... | \n", 664 | "SPACE.com - TORONTO, Canada -- A second\\team o... | \n", 665 | "
| 2 | \n", 668 | "4 | \n", 669 | "Ky. Company Wins Grant to Study Peptides (AP) | \n", 670 | "AP - A company founded by a chemistry research... | \n", 671 | "
| 3 | \n", 674 | "4 | \n", 675 | "Prediction Unit Helps Forecast Wildfires (AP) | \n", 676 | "AP - It's barely dawn when Mike Fitzpatrick st... | \n", 677 | "
| 4 | \n", 680 | "4 | \n", 681 | "Calif. Aims to Limit Farm-Related Smog (AP) | \n", 682 | "AP - Southern California's smog-fighting agenc... | \n", 683 | "
| \n", 910 | " | Class Index | \n", 911 | "Title | \n", 912 | "Description | \n", 913 | "Text | \n", 914 | "
|---|---|---|---|---|
| 0 | \n", 919 | "3 | \n", 920 | "Wall St. Bears Claw Back Into the Black (Reuters) | \n", 921 | "Reuters - Short-sellers, Wall Street's dwindli... | \n", 922 | "Reuters Shortsellers Wall Streets dwindlingba... | \n", 923 | "
| 1 | \n", 926 | "3 | \n", 927 | "Carlyle Looks Toward Commercial Aerospace (Reu... | \n", 928 | "Reuters - Private investment firm Carlyle Grou... | \n", 929 | "Reuters Private investment firm Carlyle Group... | \n", 930 | "
| 2 | \n", 933 | "3 | \n", 934 | "Oil and Economy Cloud Stocks' Outlook (Reuters) | \n", 935 | "Reuters - Soaring crude prices plus worries\\ab... | \n", 936 | "Reuters Soaring crude prices plus worriesabou... | \n", 937 | "
| 3 | \n", 940 | "3 | \n", 941 | "Iraq Halts Oil Exports from Main Southern Pipe... | \n", 942 | "Reuters - Authorities have halted oil export\\f... | \n", 943 | "Reuters Authorities have halted oil exportflo... | \n", 944 | "
| 4 | \n", 947 | "3 | \n", 948 | "Oil prices soar to all-time record, posing new... | \n", 949 | "AFP - Tearaway world oil prices, toppling reco... | \n", 950 | "AFP Tearaway world oil prices toppling record... | \n", 951 | "