├── .bandit
├── requirements-test.txt
├── docs
├── swagger-screenshot.png
└── deploy-max-to-ibm-cloud-with-kubernetes-button.png
├── requirements.txt
├── training
└── README.md
├── .dockerignore
├── samples
├── README.md
└── test-examples.tsv
├── core
├── __init__.py
├── model.py
└── bert
│ ├── run_classifier.py
│ └── tokenization.py
├── api
├── __init__.py
├── metadata.py
└── predict.py
├── sha512sums.txt
├── max-text-sentiment-classifier.yaml
├── app.py
├── .travis.yml
├── tests
├── training
│ └── test_sample_training_response.py
├── test_api.py
└── test_response.py
├── config.py
├── Dockerfile
├── .gitignore
├── README.md
└── LICENSE
/.bandit:
--------------------------------------------------------------------------------
1 | [bandit]
2 | exclude: /tests,/training
3 |
--------------------------------------------------------------------------------
/requirements-test.txt:
--------------------------------------------------------------------------------
1 | pytest==6.1.2
2 | requests==2.25.0
3 | flake8==3.8.4
4 | bandit==1.6.2
5 |
--------------------------------------------------------------------------------
/docs/swagger-screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/swagger-screenshot.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # python3.6
2 | tensorflow == 2.5.0 # CPU Version of TensorFlow.
3 | # tensorflow-gpu == 1.12.2 # GPU version of TensorFlow.
4 |
--------------------------------------------------------------------------------
/docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png
--------------------------------------------------------------------------------
/training/README.md:
--------------------------------------------------------------------------------
1 | We have removed support for training since TensorFlow v2. If you are interested in training, check out the last TensorFlow v1 of the training script at https://github.com/IBM/MAX-Text-Sentiment-Classifier/tree/v2.1.0/training .
2 |
--------------------------------------------------------------------------------
/.dockerignore:
--------------------------------------------------------------------------------
1 | training/
2 | README.*
3 | .idea/
4 | .git/
5 | .gitignore
6 | tests/
7 | .pytest_cache
8 | tests/
9 | assets/.pytest_cache
10 | venv/
11 | assets/sentiment_BERT_base_uncased/
12 | assets/assets.tar.gz
13 | assets/sentiment_BERT_base_uncased.zip
14 | benchmark_model/
15 | docs/
--------------------------------------------------------------------------------
/samples/README.md:
--------------------------------------------------------------------------------
1 | # Sample Details
2 |
3 | ## Test Examples
4 |
5 | The tab-separated-values file `test-examples.tsv` in the `samples` folder contains a fraction of the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Claim%20Stance) ([CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)) not used for fine-tuning. In the first column, the claim is listed. In the second column, the corresponding sentiment (`pos` or `neg`) is listed. Claims in this file may be used to try out and benchmark the performance of this model.
6 |
--------------------------------------------------------------------------------
/core/__init__.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
--------------------------------------------------------------------------------
/api/__init__.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | from .metadata import ModelMetadataAPI # noqa
18 | from .predict import ModelPredictAPI # noqa
--------------------------------------------------------------------------------
/sha512sums.txt:
--------------------------------------------------------------------------------
1 | 4e9ed575594d83d53af2726926239df2a5e85d0fc2884238a512627ad56ffd243f92dc75dcc2c64c716b9556b177557528cdf143fd10d2cf517289076028aaaf assets/labels.txt
2 | 296397595b1fcedd3a37464d3aa14a57526820d2ff96795eef533c634173ab744a0309b4431b7918089b424fdc0962d78082e19936b31a8565e6b6e8413f7dbe assets/saved_model.pb
3 | f51f06aae7f580a88998f6f7f24b52495c8d3d289fdbdc21231c05f3a8965783074d95c17b819186f9a63b622280e8a051105a2161cd0d153fa57db7a0aba9f9 assets/vocab.txt
4 | c58d6f1107456635fc403caede31eaf831b985c61429e85eea3d3edc8281a6af09ea63d0986c4d110ae90547aeb6d312c75a66280723f614ce6246a353b58626 assets/variables/variables.data-00000-of-00001
5 | 42dc3a7620e8a712065ae7bc6973e654cb4e515dffd3dc8289b90571cac578c1afbc8cb558b44b268da7be56e5b5045380949459274736bdaf566f207970e795 assets/variables/variables.index
6 |
--------------------------------------------------------------------------------
/max-text-sentiment-classifier.yaml:
--------------------------------------------------------------------------------
1 | apiVersion: v1
2 | kind: Service
3 | metadata:
4 | name: max-text-sentiment-classifier
5 | spec:
6 | selector:
7 | app: max-text-sentiment-classifier
8 | ports:
9 | - port: 5000
10 | type: NodePort
11 | ---
12 | apiVersion: apps/v1
13 | kind: Deployment
14 | metadata:
15 | name: max-text-sentiment-classifier
16 | labels:
17 | app: max-text-sentiment-classifier
18 | spec:
19 | selector:
20 | matchLabels:
21 | app: max-text-sentiment-classifier
22 | replicas: 1
23 | template:
24 | metadata:
25 | labels:
26 | app: max-text-sentiment-classifier
27 | spec:
28 | containers:
29 | - name: max-text-sentiment-classifier
30 | image: quay.io/codait/max-text-sentiment-classifier:latest
31 | ports:
32 | - containerPort: 5000
33 |
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | from maxfw.core import MAXApp
18 | from api import ModelMetadataAPI, ModelPredictAPI
19 | from config import API_TITLE, API_DESC, API_VERSION
20 |
21 | max = MAXApp(API_TITLE, API_DESC, API_VERSION)
22 | max.add_api(ModelMetadataAPI, '/metadata')
23 | max.add_api(ModelPredictAPI, '/predict')
24 | max.run()
25 |
--------------------------------------------------------------------------------
/api/metadata.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | from core.model import ModelWrapper
18 | from maxfw.core import MAX_API, MetadataAPI, METADATA_SCHEMA
19 |
20 |
21 | class ModelMetadataAPI(MetadataAPI):
22 |
23 | @MAX_API.marshal_with(METADATA_SCHEMA)
24 | def get(self):
25 | """Return the metadata associated with the model"""
26 | return ModelWrapper.MODEL_META_DATA
27 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | language: python
18 | python:
19 | - 3.6
20 | services:
21 | - docker
22 | install:
23 | - docker build -t max-text-sentiment-classifier .
24 | - docker run -it -d --rm -p 5000:5000 max-text-sentiment-classifier
25 | - pip install -r requirements-test.txt
26 | before_script:
27 | - flake8 . --max-line-length=127
28 | - bandit -r .
29 | - sleep 30
30 | script:
31 | - pytest tests/test_api.py
32 | - pytest tests/test_response.py
33 |
--------------------------------------------------------------------------------
/tests/training/test_sample_training_response.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | import pytest
18 | import requests
19 |
20 |
21 | def test_response():
22 |
23 | # test code 200
24 | model_endpoint = 'http://localhost:5000/model/predict'
25 |
26 | json_data = {
27 | "text": ["good string",
28 | "bad string"]
29 | }
30 |
31 | r = requests.post(url=model_endpoint, json=json_data)
32 |
33 | assert r.status_code == 200
34 | response = r.json()
35 | assert response['status'] == 'ok'
36 |
37 | # test whether the labels have changed
38 | assert 'pos' in response['predictions'][0].keys()
39 | assert 'neg' in response['predictions'][0].keys()
40 |
41 |
42 | if __name__ == '__main__':
43 | pytest.main([__file__])
44 |
--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | # Flask settings
18 | DEBUG = False
19 |
20 | # Flask-restplus settings
21 | RESTPLUS_MASK_SWAGGER = False
22 | SWAGGER_UI_DOC_EXPANSION = 'none'
23 |
24 | # API metadata
25 | API_TITLE = 'MAX Text Sentiment Classifier'
26 | API_DESC = 'Detect the sentiment captured in short pieces of text. ' \
27 | 'The model was finetuned on the IBM Project Debater Claim Sentiment dataset.'
28 | API_VERSION = '2.0.0'
29 |
30 | # default model
31 | DEFAULT_MODEL_PATH = 'assets'
32 |
33 | # the metadata of the model
34 | MODEL_META_DATA = {
35 | 'id': 'max-text-sentiment-classifier',
36 | 'name': 'MAX Text Sentiment Classifier',
37 | 'description': 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.',
38 | 'type': 'Text Classification',
39 | 'source': 'https://developer.ibm.com/exchanges/models/all/max-text-sentiment-classifier/',
40 | 'license': 'Apache V2'
41 | }
42 |
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | FROM quay.io/codait/max-base:v1.4.0
18 |
19 | ARG model_bucket=https://codait-cos-max.s3.us.cloud-object-storage.appdomain.cloud/max-text-sentiment-classifier/1.2.0
20 | ARG model_file=assets.tar.gz
21 |
22 | ARG use_pre_trained_model=true
23 |
24 | RUN if [ "$use_pre_trained_model" = "true" ] ; then\
25 | # download pre-trained model artifacts from Cloud Object Storage
26 | wget -nv --show-progress --progress=bar:force:noscroll ${model_bucket}/${model_file} --output-document=assets/${model_file} &&\
27 | tar -x -C assets/ -f assets/${model_file} -v && rm assets/${model_file} ; \
28 | fi
29 |
30 | COPY requirements.txt .
31 | RUN pip install -r requirements.txt
32 |
33 | COPY . .
34 |
35 | RUN if [ "$use_pre_trained_model" = "true" ] ; then \
36 | # validate downloaded pre-trained model assets
37 | sha512sum -c sha512sums.txt ; \
38 | else \
39 | # rename the directory that contains the custom-trained model artifacts
40 | if [ -d "./custom_assets/" ] ; then \
41 | rm -rf ./assets && ln -s ./custom_assets ./assets ; \
42 | fi \
43 | fi
44 |
45 | EXPOSE 5000
46 |
47 | CMD python app.py
48 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *-model-building-code.zip
2 | .idea/
3 | # Byte-compiled / optimized / DLL files
4 | __pycache__/
5 | *.py[cod]
6 | *$py.class
7 |
8 | # C extensions
9 | *.so
10 |
11 | # Distribution / packaging
12 | .Python
13 | build/
14 | develop-eggs/
15 | dist/
16 | downloads/
17 | eggs/
18 | .eggs/
19 | lib/
20 | lib64/
21 | parts/
22 | sdist/
23 | var/
24 | wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *.cover
49 | .hypothesis/
50 | .pytest_cache/
51 |
52 | # Translations
53 | *.mo
54 | *.pot
55 |
56 | # Django stuff:
57 | *.log
58 | local_settings.py
59 | db.sqlite3
60 |
61 | # Flask stuff:
62 | instance/
63 | .webassets-cache
64 |
65 | # Scrapy stuff:
66 | .scrapy
67 |
68 | # Sphinx documentation
69 | docs/_build/
70 |
71 | # PyBuilder
72 | target/
73 |
74 | # Jupyter Notebook
75 | .ipynb_checkpoints
76 |
77 | # pyenv
78 | .python-version
79 |
80 | # celery beat schedule file
81 | celerybeat-schedule
82 |
83 | # SageMath parsed files
84 | *.sage.py
85 |
86 | # Environments
87 | .env
88 | .venv
89 | env/
90 | venv/
91 | ENV/
92 | env.bak/
93 | venv.bak/
94 |
95 | # Spyder project settings
96 | .spyderproject
97 | .spyproject
98 |
99 | # Rope project settings
100 | .ropeproject
101 |
102 | # mkdocs documentation
103 | /site
104 |
105 | # mypy
106 | .mypy_cache/
107 |
--------------------------------------------------------------------------------
/tests/test_api.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | import pytest
18 | import requests
19 |
20 |
21 | def test_swagger():
22 |
23 | model_endpoint = 'http://localhost:5000/swagger.json'
24 |
25 | r = requests.get(url=model_endpoint)
26 | assert r.status_code == 200
27 | assert r.headers['Content-Type'] == 'application/json'
28 |
29 | json = r.json()
30 | assert 'swagger' in json
31 | assert json.get('info') and json.get('info').get('title') == 'MAX Text Sentiment Classifier'
32 |
33 |
34 | def test_metadata():
35 |
36 | model_endpoint = 'http://localhost:5000/model/metadata'
37 |
38 | r = requests.get(url=model_endpoint)
39 | assert r.status_code == 200
40 |
41 | metadata = r.json()
42 | assert metadata['id'] == 'max-text-sentiment-classifier'
43 | assert metadata['name'] == 'MAX Text Sentiment Classifier'
44 | assert metadata['description'] == 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.'
45 | assert metadata['license'] == 'Apache V2'
46 | assert metadata['type'] == 'Text Classification'
47 | assert 'developer.ibm.com' in metadata['source']
48 |
49 |
50 | if __name__ == '__main__':
51 | pytest.main([__file__])
52 |
--------------------------------------------------------------------------------
/api/predict.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | from core.model import ModelWrapper
18 | from maxfw.core import MAX_API, PredictAPI
19 | from flask_restx import fields
20 | from flask import abort
21 |
22 | # Set up parser for input data (http://flask-restplus.readthedocs.io/en/stable/parsing.html)
23 | input_parser = MAX_API.model('ModelInput', {
24 | 'text': fields.List(fields.String, required=True,
25 | description='List of claims (strings) to be analyzed for either a positive or negative sentiment.')
26 | })
27 |
28 | with open('assets/labels.txt', 'r') as f:
29 | class_labels = [x.strip() for x in f]
30 |
31 | # Creating a JSON response model: https://flask-restplus.readthedocs.io/en/stable/marshalling.html#the-api-model-factory
32 | label_prediction = MAX_API.model('LabelPrediction',
33 | {l: fields.Float(required=True, description='Class probability') for l in class_labels}) # noqa - E741
34 |
35 | predict_response = MAX_API.model('ModelPredictResponse', {
36 | 'status': fields.String(required=True, description='Response status message'),
37 | 'predictions': fields.List(fields.Nested(label_prediction), description='Predicted labels and probabilities')
38 | })
39 |
40 |
41 | class ModelPredictAPI(PredictAPI):
42 |
43 | model_wrapper = ModelWrapper()
44 |
45 | @MAX_API.doc('predict')
46 | @MAX_API.expect(input_parser)
47 | @MAX_API.marshal_with(predict_response)
48 | def post(self):
49 | """Make a prediction given input data"""
50 | result = {'status': 'error'}
51 |
52 | input_json = MAX_API.payload
53 |
54 | try:
55 | preds = self.model_wrapper.predict(input_json['text'])
56 | except: # noqa
57 | abort(400, "Please supply a valid input json. "
58 | "The json structure should have a 'text' field containing a list of strings")
59 |
60 | # Generate the output format for every input string
61 | output = [{l: p[i] for i, l in enumerate(class_labels)} for p in preds]
62 |
63 | result['predictions'] = output
64 | result['status'] = 'ok'
65 |
66 | return result
67 |
--------------------------------------------------------------------------------
/tests/test_response.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | import pytest
18 | import requests
19 |
20 |
21 | def test_response():
22 | model_endpoint = 'http://localhost:5000/model/predict'
23 |
24 | json_data = {
25 | "text": ["good string",
26 | "bad string"]
27 | }
28 |
29 | r = requests.post(url=model_endpoint, json=json_data)
30 |
31 | assert r.status_code == 200
32 | response = r.json()
33 | assert response['status'] == 'ok'
34 |
35 | # verify that 'good string' is in fact positive
36 | assert round(float(response['predictions'][0]['positive'])) == 1
37 | # verify that 'bad string' is in fact negative
38 | assert round(float(response['predictions'][1]['negative'])) == 1
39 |
40 | json_data2 = {
41 | "text": [
42 | "2008 was a dark, dark year for stock markets worldwide.",
43 | "The Model Asset Exchange is a crucial element of a developer's toolkit."
44 | ]
45 | }
46 |
47 | r = requests.post(url=model_endpoint, json=json_data2)
48 |
49 | assert r.status_code == 200
50 | response = r.json()
51 | assert response['status'] == 'ok'
52 |
53 | # verify that "2008 was a dark, dark year for stock markets worldwide." is in fact negative
54 | assert round(float(response['predictions'][0]['positive'])) == 0
55 | assert round(float(response['predictions'][0]['negative'])) == 1
56 | # verify that "The Model Asset Exchange is a crucial element of a developer's toolkit." is in fact positive
57 | assert round(float(response['predictions'][1]['negative'])) == 0
58 | assert round(float(response['predictions'][1]['positive'])) == 1
59 |
60 | # Test different input batch sizes
61 | for input_size in [4, 16, 32, 64, 75]:
62 | json_data3 = {
63 | "text": ["good string"]*input_size
64 | }
65 |
66 | r = requests.post(url=model_endpoint, json=json_data3)
67 |
68 | assert r.status_code == 200
69 | response = r.json()
70 | assert response['status'] == 'ok'
71 |
72 | assert len(response['predictions']) == len(json_data3["text"])
73 |
74 |
75 | if __name__ == '__main__':
76 | pytest.main([__file__])
77 |
--------------------------------------------------------------------------------
/core/model.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | from maxfw.model import MAXModelWrapper
18 |
19 | import logging
20 | from config import DEFAULT_MODEL_PATH, MODEL_META_DATA as model_meta
21 |
22 | from core.bert.run_classifier import convert_single_example, MAXAPIProcessor
23 | from core.bert import tokenization
24 | from tensorflow.python.saved_model import tag_constants
25 | import tensorflow as tf
26 | import numpy as np
27 |
28 | logger = logging.getLogger()
29 |
30 |
31 | class ModelWrapper(MAXModelWrapper):
32 |
33 | MODEL_META_DATA = model_meta
34 |
35 | def __init__(self, path=DEFAULT_MODEL_PATH):
36 | tf.compat.v1.disable_v2_behavior()
37 |
38 | logger.info('Loading model from: {}...'.format(path))
39 |
40 | self.max_seq_length = 128
41 | self.do_lower_case = True
42 |
43 | # Set Logging verbosity
44 | tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
45 |
46 | # Loading the tf Graph
47 | self.graph = tf.Graph()
48 | self.sess = tf.compat.v1.Session(graph=self.graph)
49 | tf.compat.v1.saved_model.loader.load(self.sess, [tag_constants.SERVING], DEFAULT_MODEL_PATH)
50 |
51 | # Validate init_checkpoint
52 | tokenization.validate_case_matches_checkpoint(self.do_lower_case,
53 | DEFAULT_MODEL_PATH)
54 |
55 | # Initialize the dataprocessor
56 | self.processor = MAXAPIProcessor()
57 |
58 | # Get the labels
59 | self.label_list = self.processor.get_labels()
60 |
61 | # Initialize the tokenizer
62 | self.tokenizer = tokenization.FullTokenizer(
63 | vocab_file=f'{DEFAULT_MODEL_PATH}/vocab.txt', do_lower_case=self.do_lower_case)
64 |
65 | logger.info('Loaded model')
66 |
67 | def _pre_process(self, input):
68 | '''Preprocessing of the input is not required as it is carried out by the BERT model (Tokenization).'''
69 | return input
70 |
71 | def _post_process(self, result):
72 | '''Reformat the results if needed.'''
73 | return result
74 |
75 | def _predict(self, x, predict_batch_size=32):
76 | '''Predict the class probabilities using the BERT model.'''
77 |
78 | # Get the input examples
79 | predict_examples = self.processor.get_test_examples(x)
80 |
81 | # Grab the input tensors of the Graph
82 | tensor_input_ids = self.sess.graph.get_tensor_by_name('input_ids_1:0')
83 | tensor_input_mask = self.sess.graph.get_tensor_by_name('input_mask_1:0')
84 | tensor_label_ids = self.sess.graph.get_tensor_by_name('label_ids_1:0')
85 | tensor_segment_ids = self.sess.graph.get_tensor_by_name('segment_ids_1:0')
86 | tensor_outputs = self.sess.graph.get_tensor_by_name('loss/Softmax:0')
87 |
88 | # Grab the examples, convert to features, and create batches. In the loop,
89 | # Go over all examples in chunks of size `predict_batch_size`.
90 | predictions = []
91 |
92 | for i in range(0, len(predict_examples), predict_batch_size):
93 | examples = predict_examples[i:i+predict_batch_size]
94 |
95 | tf.compat.v1.logging.info(
96 | f"{i} out of {len(predict_examples)} examples done ({round(i * 100 / len(predict_examples))}%).")
97 |
98 | # Convert example to feature in batches.
99 | input_ids, input_mask, label_ids, segment_ids = zip(
100 | *tuple(convert_single_example(i + j, example, self.label_list, self.max_seq_length, self.tokenizer)
101 | for j, example in enumerate(examples)))
102 |
103 | # Convert to a format that is consistent with input tensors
104 | feed_dict = {}
105 | feed_dict[tensor_input_ids] = np.vstack(
106 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_ids))
107 | feed_dict[tensor_input_mask] = np.vstack(
108 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_mask))
109 | feed_dict[tensor_label_ids] = np.vstack(
110 | tuple(np.array(arr) for arr in label_ids)).flatten()
111 | feed_dict[tensor_segment_ids] = np.vstack(
112 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in segment_ids))
113 |
114 | # Make a prediction
115 | result = self.sess.run(tensor_outputs, feed_dict=feed_dict)
116 | # Add the predictions
117 | predictions.extend(p.tolist() for p in result)
118 |
119 | return predictions
120 |
--------------------------------------------------------------------------------
/core/bert/run_classifier.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """BERT finetuning runner."""
16 |
17 | import csv
18 | from core.bert import tokenization
19 | import tensorflow as tf
20 |
21 |
22 | class InputExample(object):
23 | """A single training/test example for simple sequence classification."""
24 |
25 | def __init__(self, guid, text_a, text_b=None, label=None):
26 | """Constructs a InputExample.
27 |
28 | Args:
29 | guid: Unique id for the example.
30 | text_a: string. The untokenized text of the first sequence. For single
31 | sequence tasks, only this sequence must be specified.
32 | text_b: (Optional) string. The untokenized text of the second sequence.
33 | Only must be specified for sequence pair tasks.
34 | label: (Optional) string. The label of the example. This should be
35 | specified for train and dev examples, but not for test examples.
36 | """
37 | self.guid = guid
38 | self.text_a = text_a
39 | self.text_b = text_b
40 | self.label = label
41 |
42 |
43 | class PaddingInputExample(object):
44 | """Fake example so the num input examples is a multiple of the batch size.
45 |
46 | When running eval/predict on the TPU, we need to pad the number of examples
47 | to be a multiple of the batch size, because the TPU requires a fixed batch
48 | size. The alternative is to drop the last batch, which is bad because it means
49 | the entire output data won't be generated.
50 |
51 | We use this class instead of `None` because treating `None` as padding
52 | batches could cause silent errors.
53 | """
54 |
55 |
56 | class InputFeatures(object):
57 | """A single set of features of data."""
58 |
59 | def __init__(self,
60 | input_ids,
61 | input_mask,
62 | segment_ids,
63 | label_id,
64 | is_real_example=True):
65 | self.input_ids = input_ids
66 | self.input_mask = input_mask
67 | self.segment_ids = segment_ids
68 | self.label_id = label_id
69 | self.is_real_example = is_real_example
70 |
71 |
72 | class DataProcessor(object):
73 | """Base class for data converters for sequence classification data sets."""
74 |
75 | def get_train_examples(self, data_dir):
76 | """Gets a collection of `InputExample`s for the train set."""
77 | raise NotImplementedError()
78 |
79 | def get_dev_examples(self, data_dir):
80 | """Gets a collection of `InputExample`s for the dev set."""
81 | raise NotImplementedError()
82 |
83 | def get_test_examples(self, data_dir):
84 | """Gets a collection of `InputExample`s for prediction."""
85 | raise NotImplementedError()
86 |
87 | def get_labels(self):
88 | """Gets the list of labels for this data set."""
89 | raise NotImplementedError()
90 |
91 | @classmethod
92 | def _read_tsv(cls, input_file, quotechar=None):
93 | """Reads a tab separated value file."""
94 | with tf.gfile.Open(input_file, "r") as f:
95 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
96 | lines = []
97 | for line in reader:
98 | lines.append(line)
99 | return lines
100 |
101 | @classmethod
102 | def _read_csv(cls, input_file, quotechar=None):
103 | """Reads a comma separated value file."""
104 | with tf.gfile.Open(input_file, "r") as f:
105 | reader = csv.reader(f, delimiter=",", quotechar=quotechar)
106 | lines = []
107 | for line in reader:
108 | lines.append(line)
109 | return lines
110 |
111 |
112 | class MAXAPIProcessor(DataProcessor):
113 | """Custom Data Processor for the MAX API."""
114 |
115 | def get_test_examples(self, test_data):
116 | """See base class."""
117 |
118 | # Verify that the input is a list of strings
119 | if type(test_data) != list:
120 | raise TypeError("'test_data' is type %r" % type(test_data))
121 | # Create InputExample objects from the input data
122 | return self._create_examples(test_data, "test")
123 |
124 | def get_labels(self):
125 | """See base class."""
126 | return ["pos", "neg"]
127 |
128 | def _create_examples(self, lines, set_type):
129 | """Creates examples for the training and dev sets."""
130 | examples = []
131 | for (i, line) in enumerate(lines):
132 | guid = "%s-%s" % (set_type, i)
133 | text_a = tokenization.convert_to_unicode(line)
134 | label = self.get_labels()[0]
135 | examples.append(
136 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
137 | return examples
138 |
139 |
140 | def convert_single_example(ex_index, example, label_list, max_seq_length,
141 | tokenizer):
142 | """Converts a single `InputExample` into a single `InputFeatures`."""
143 |
144 | if isinstance(example, PaddingInputExample):
145 | return InputFeatures(
146 | input_ids=[0] * max_seq_length,
147 | input_mask=[0] * max_seq_length,
148 | segment_ids=[0] * max_seq_length,
149 | label_id=0,
150 | is_real_example=False)
151 |
152 | label_map = {}
153 | for (i, label) in enumerate(label_list):
154 | label_map[label] = i
155 |
156 | tokens_a = tokenizer.tokenize(example.text_a)
157 | tokens_b = None
158 | if example.text_b:
159 | tokens_b = tokenizer.tokenize(example.text_b)
160 |
161 | if tokens_b:
162 | # Modifies `tokens_a` and `tokens_b` in place so that the total
163 | # length is less than the specified length.
164 | # Account for [CLS], [SEP], [SEP] with "- 3"
165 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) # noqa
166 | else:
167 | # Account for [CLS] and [SEP] with "- 2"
168 | if len(tokens_a) > max_seq_length - 2:
169 | tokens_a = tokens_a[0:(max_seq_length - 2)]
170 |
171 | # The convention in BERT is:
172 | # (a) For sequence pairs:
173 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
174 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
175 | # (b) For single sequences:
176 | # tokens: [CLS] the dog is hairy . [SEP]
177 | # type_ids: 0 0 0 0 0 0 0
178 | #
179 | # Where "type_ids" are used to indicate whether this is the first
180 | # sequence or the second sequence. The embedding vectors for `type=0` and
181 | # `type=1` were learned during pre-training and are added to the wordpiece
182 | # embedding vector (and position vector). This is not *strictly* necessary
183 | # since the [SEP] token unambiguously separates the sequences, but it makes
184 | # it easier for the model to learn the concept of sequences.
185 | #
186 | # For classification tasks, the first vector (corresponding to [CLS]) is
187 | # used as the "sentence vector". Note that this only makes sense because
188 | # the entire model is fine-tuned.
189 | tokens = []
190 | segment_ids = []
191 | tokens.append("[CLS]")
192 | segment_ids.append(0)
193 | for token in tokens_a:
194 | tokens.append(token)
195 | segment_ids.append(0)
196 | tokens.append("[SEP]")
197 | segment_ids.append(0)
198 |
199 | if tokens_b:
200 | for token in tokens_b:
201 | tokens.append(token)
202 | segment_ids.append(1)
203 | tokens.append("[SEP]")
204 | segment_ids.append(1)
205 |
206 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
207 |
208 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
209 | # tokens are attended to.
210 | input_mask = [1] * len(input_ids)
211 |
212 | # Zero-pad up to the sequence length.
213 | while len(input_ids) < max_seq_length:
214 | input_ids.append(0)
215 | input_mask.append(0)
216 | segment_ids.append(0)
217 |
218 | if len(input_ids) != max_seq_length:
219 | raise ValueError("'input_ids' has an incorrect length: %r" % len(input_ids))
220 | if len(input_mask) != max_seq_length:
221 | raise ValueError("'input_mask' has an incorrect length: %r" % len(input_mask))
222 | if len(segment_ids) != max_seq_length:
223 | raise ValueError("'segment_ids' has an incorrect length: %r" % len(segment_ids))
224 |
225 | label_id = label_map[example.label]
226 | if ex_index < 5:
227 | tf.compat.v1.logging.info("*** Example ***")
228 | tf.compat.v1.logging.info("guid: %s" % (example.guid))
229 | tf.compat.v1.logging.info("tokens: %s" % " ".join(
230 | [tokenization.printable_text(x) for x in tokens]))
231 | tf.compat.v1.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
232 | tf.compat.v1.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
233 | tf.compat.v1.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
234 | tf.compat.v1.logging.info("label: %s (id = %d)" % (example.label, label_id))
235 |
236 | # return feature
237 | return input_ids, input_mask, label_id, segment_ids
238 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://travis-ci.com/IBM/MAX-Text-Sentiment-Classifier) [](http://max-text-sentiment-classifier.codait-prod-41208c73af8fca213512856c7a09db52-0000.us-east.containers.appdomain.cloud)
2 |
3 | [
](http://ibm.biz/max-to-ibm-cloud-tutorial)
4 |
5 | # IBM Developer Model Asset Exchange: Text Sentiment Classifier
6 |
7 | This repository contains code to instantiate and deploy a text sentiment classifier. This model is able to detect whether a text fragment leans towards a positive or a negative sentiment. Optimal input examples for this model are short strings (preferably a single sentence) with correct grammar, although not a requirement.
8 |
9 | The model is based on the [pre-trained BERT-Base, English Uncased](https://github.com/google-research/bert/blob/master/README.md) model and was fine-tuned on the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml). The model files are hosted on
10 | [IBM Cloud Object Storage](https://max-cdn.cdn.appdomain.cloud/max-text-sentiment-classifier/1.2.0/assets.tar.gz).
11 | The code in this repository deploys the model as a web service in a Docker container. This repository was developed
12 | as part of the [IBM Developer Model Asset Exchange](https://developer.ibm.com/exchanges/models/) and the public API is powered by [IBM Cloud](https://ibm.biz/Bdz2XM).
13 |
14 | ## Model Metadata
15 | | Domain | Application | Industry | Framework | Training Data | Input Data |
16 | | --------- | -------- | -------- | --------- | --------- | --------------- |
17 | | Natural Language Processing (NLP) | Sentiment Analysis | General | TensorFlow | [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml) | Text |
18 |
19 | ## Benchmark
20 | In the table below, the prediction accuracy of the model on the test sets of three different datasets is listed.
21 |
22 | The first row showcases the generalization power of our model after fine-tuning on the IBM Claims Dataset.
23 | The Sentiment140 (Tweets) and IMDB Reviews datasets are only used for evaluating the transfer-learning capabilities of this model. The implementation in this repository was **not** trained or fine-tuned on the Sentiment140 or IMDB reviews datasets.
24 |
25 | The second row describes the performance of the BERT-Base (English - Uncased) model when fine-tuned on the specific task. This was done simply for reference, and the weights are therefore not made available.
26 |
27 |
28 | The generalization results (first row) are very good when the input data is similar to the data used for fine-tuning (e.g. Sentiment140 (tweets) when fine-tuned on the IBM Claims Dataset). However, when a different style of text is given as input, and with a longer median length (e.g. multi-sentence IMDB reviews), the results are not as good.
29 |
30 | | Model Type | IBM Claims | Sentiment140 | IMDB Reviews |
31 | | ------------- | -------- | -------- | -------------- |
32 | | This model (fine-tuned on IBM Claims) | 94% | 83.84% | 81% |
33 | | Models fine-tuned on the specific dataset | 94% | 84% | 90% |
34 |
35 | ## References
36 | * _J. Devlin, M. Chang, K. Lee, K. Toutanova_, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805), arXiv, 2018.
37 | * [Google BERT repository](https://github.com/google-research/bert)
38 | * [IBM Claims Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project) and [IBM Project Debater](https://www.research.ibm.com/artificial-intelligence/project-debater/)
39 |
40 | ## Licenses
41 | | Component | License | Link |
42 | | ------------- | -------- | -------- |
43 | | This repository | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) |
44 | | Fine-tuned Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) |
45 | | Pre-trained Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) |
46 | | Model Code (3rd party) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) |
47 | | IBM Claims Stance Dataset for fine-tuning | [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/) | [LICENSE 1](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project)
[LICENSE 2](https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Reusers.27_rights_and_obligations)|
48 |
49 | ## Pre-requisites:
50 | * `docker`: The [Docker](https://www.docker.com/) command-line interface. Follow the [installation instructions](https://docs.docker.com/install/) for your system.
51 | * The minimum recommended resources for this model is 4GB Memory and 4 CPUs.
52 | * If you are on x86-64/AMD64, your CPU must support [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) at the minimum.
53 |
54 | # Deployment options
55 |
56 | * [Deploy from Quay](#deploy-from-quay)
57 | * [Deploy on Red Hat OpenShift](#deploy-on-red-hat-openshift)
58 | * [Deploy on Kubernetes](#deploy-on-kubernetes)
59 | * [Run Locally](#run-locally)
60 |
61 | ## Deploy from Quay
62 | To run the docker image, which automatically starts the model serving API, run:
63 |
64 | ```
65 | $ docker run -it -p 5000:5000 quay.io/codait/max-text-sentiment-classifier
66 | ```
67 |
68 | This will pull a pre-built image from the Quay.io container registry (or use an existing image if already cachedlocally) and run it.
69 | If you'd rather checkout and build the model locally you can follow the [run locally](#run-locally) steps below.
70 |
71 | ## Deploy on Red Hat OpenShift
72 |
73 | You can deploy the model-serving microservice on Red Hat OpenShift by following the instructions for the OpenShift web console or the OpenShift Container Platform CLI [in this tutorial](https://developer.ibm.com/tutorials/deploy-a-model-asset-exchange-microservice-on-red-hat-openshift/), specifying `quay.io/codait/max-text-sentiment-classifier` as the image name.
74 |
75 | > Note that this model requires at least 4GB of RAM. Therefore this model will not run in a cluster that was provisioned under the [OpenShift Online starter plan](https://www.openshift.com/products/online/), which is capped at 2GB.
76 |
77 | ## Deploy on Kubernetes
78 | You can also deploy the model on Kubernetes using the latest docker image on Quay.
79 |
80 | On your Kubernetes cluster, run the following commands:
81 |
82 | ```
83 | $ kubectl apply -f https://github.com/IBM/MAX-Text-Sentiment-Classifier/raw/master/max-text-sentiment-classifier.yaml
84 | ```
85 |
86 | The model will be available internally at port `5000`, but can also be accessed externally through the `NodePort`.
87 |
88 | A more elaborate tutorial on how to deploy this MAX model to production on [IBM Cloud](https://ibm.biz/Bdz2XM) can be found [here](http://ibm.biz/max-to-ibm-cloud-tutorial).
89 |
90 | ## Run Locally
91 | 1. [Build the Model](#1-build-the-model)
92 | 2. [Deploy the Model](#2-deploy-the-model)
93 | 3. [Use the Model](#3-use-the-model)
94 | 4. [Development](#4-development)
95 | 5. [Cleanup](#5-cleanup)
96 |
97 |
98 | ### 1. Build the Model
99 | Clone this repository locally. In a terminal, run the following command:
100 |
101 | ```
102 | $ git clone https://github.com/IBM/MAX-Text-Sentiment-Classifier.git
103 | ```
104 |
105 | Change directory into the repository base folder:
106 |
107 | ```
108 | $ cd MAX-Text-Sentiment-Classifier
109 | ```
110 |
111 | To build the docker image locally, run:
112 |
113 | ```
114 | $ docker build -t max-text-sentiment-classifier .
115 | ```
116 |
117 | All required model assets will be downloaded during the build process. _Note_ that currently this docker image is CPU only (we will add support for GPU images later).
118 |
119 |
120 | ### 2. Deploy the Model
121 | To run the docker image, which automatically starts the model serving API, run:
122 |
123 | ```
124 | $ docker run -it -p 5000:5000 max-text-sentiment-classifier
125 | ```
126 |
127 | ### 3. Use the Model
128 |
129 | The API server automatically generates an interactive Swagger documentation page. Go to `http://localhost:5000` to load it. From there you can explore the API and also create test requests.
130 |
131 | ```
132 | Example:
133 | [
134 | "The Model Asset Exchange is a crucial element of a developer's toolkit.",
135 | "2008 was a dark, dark year for stock markets worldwide."
136 | ]
137 |
138 | Result:
139 | [
140 | {
141 | "positive": 0.9977352619171143,
142 | "negative": 0.002264695707708597
143 | }
144 | ],
145 | [
146 | {
147 | "positive": 0.001138084102421999,
148 | "negative": 0.9988619089126587
149 | }
150 | ]
151 | ```
152 |
153 |
154 | Use the `model/predict` endpoint to submit input text in json format. The json structure should have one key, `text`, with as value a list of input strings to be analyzed. An example can be found in the image below.
155 |
156 | Submitting proper json data triggers the model and will return a json file with a `status` and a `predictions` key. With this `predictions` field, a list of class labels and their corresponding probabilities will be associated. The first element in the list corresponds to the prediction for the first string in the input list.
157 |
158 |
159 | 
160 |
161 | You can also test it on the command line, for example:
162 |
163 | ```bash
164 | $ curl -d "{ \"text\": [ \"The Model Asset Exchange is a crucial element of a developer's toolkit.\" ]}" -X POST "http://localhost:5000/model/predict" -H "Content-Type: application/json"
165 | ```
166 |
167 | You should see a JSON response like that below:
168 |
169 | ```json
170 | {
171 | "status": "ok",
172 | "predictions": [
173 | [
174 | {
175 | "positive": 0.9977352619171143,
176 | "negative": 0.0022646968718618155
177 | }
178 | ]
179 | ]
180 | }
181 | ```
182 |
183 | ### 4. Development
184 | To run the Flask API app in debug mode, edit `config.py` to set `DEBUG = True` under the application settings. You will then need to rebuild the docker image (see [step 1](#1-build-the-model)).
185 |
186 | ### 5. Cleanup
187 | To stop the Docker container, type `CTRL` + `C` in your terminal.
188 |
189 | ## Resources and Contributions
190 |
191 | If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions [here](https://github.com/CODAIT/max-central-repo).
192 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright 2019, IBM (Center of Open-Source Data & AI Technologies)
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/core/bert/tokenization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """Tokenization classes."""
16 |
17 | import collections
18 | import re
19 | import unicodedata
20 | import six
21 | import tensorflow as tf
22 |
23 |
24 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
25 | """Checks whether the casing config is consistent with the checkpoint name."""
26 |
27 | # The casing has to be passed in by the user and there is no explicit check
28 | # as to whether it matches the checkpoint. The casing information probably
29 | # should have been stored in the bert_config.json file, but it's not, so
30 | # we have to heuristically detect it to validate.
31 |
32 | if not init_checkpoint:
33 | return
34 |
35 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
36 | if m is None:
37 | return
38 |
39 | model_name = m.group(1)
40 |
41 | lower_models = [
42 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
43 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
44 | ]
45 |
46 | cased_models = [
47 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
48 | "multi_cased_L-12_H-768_A-12"
49 | ]
50 |
51 | is_bad_config = False
52 | if model_name in lower_models and not do_lower_case:
53 | is_bad_config = True
54 | actual_flag = "False"
55 | case_name = "lowercased"
56 | opposite_flag = "True"
57 |
58 | if model_name in cased_models and do_lower_case:
59 | is_bad_config = True
60 | actual_flag = "True"
61 | case_name = "cased"
62 | opposite_flag = "False"
63 |
64 | if is_bad_config:
65 | raise ValueError(
66 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
67 | "However, `%s` seems to be a %s model, so you "
68 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
69 | "how the model was pre-training. If this error is wrong, please "
70 | "just comment out this check." % (actual_flag, init_checkpoint,
71 | model_name, case_name, opposite_flag))
72 |
73 |
74 | def convert_to_unicode(text):
75 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
76 | if six.PY3:
77 | if isinstance(text, str):
78 | return text
79 | elif isinstance(text, bytes):
80 | return text.decode("utf-8", "ignore")
81 | else:
82 | raise ValueError("Unsupported string type: %s" % (type(text)))
83 | elif six.PY2:
84 | if isinstance(text, str):
85 | return text.decode("utf-8", "ignore")
86 | elif isinstance(text, unicode): # noqa
87 | return text
88 | else:
89 | raise ValueError("Unsupported string type: %s" % (type(text)))
90 | else:
91 | raise ValueError("Not running on Python2 or Python 3?")
92 |
93 |
94 | def printable_text(text):
95 | """Returns text encoded in a way suitable for print or `tf.logging`."""
96 |
97 | # These functions want `str` for both Python2 and Python3, but in one case
98 | # it's a Unicode string and in the other it's a byte string.
99 | if six.PY3:
100 | if isinstance(text, str):
101 | return text
102 | elif isinstance(text, bytes):
103 | return text.decode("utf-8", "ignore")
104 | else:
105 | raise ValueError("Unsupported string type: %s" % (type(text)))
106 | elif six.PY2:
107 | if isinstance(text, str):
108 | return text
109 | elif isinstance(text, unicode): # noqa
110 | return text.encode("utf-8")
111 | else:
112 | raise ValueError("Unsupported string type: %s" % (type(text)))
113 | else:
114 | raise ValueError("Not running on Python2 or Python 3?")
115 |
116 |
117 | def load_vocab(vocab_file):
118 | """Loads a vocabulary file into a dictionary."""
119 | vocab = collections.OrderedDict()
120 | index = 0
121 | with tf.compat.v1.gfile.GFile(vocab_file, "r") as reader:
122 | while True:
123 | token = convert_to_unicode(reader.readline())
124 | if not token:
125 | break
126 | token = token.strip()
127 | vocab[token] = index
128 | index += 1
129 | return vocab
130 |
131 |
132 | def convert_by_vocab(vocab, items):
133 | """Converts a sequence of [tokens|ids] using the vocab."""
134 | output = []
135 | for item in items:
136 | output.append(vocab[item])
137 | return output
138 |
139 |
140 | def convert_tokens_to_ids(vocab, tokens):
141 | return convert_by_vocab(vocab, tokens)
142 |
143 |
144 | def convert_ids_to_tokens(inv_vocab, ids):
145 | return convert_by_vocab(inv_vocab, ids)
146 |
147 |
148 | def whitespace_tokenize(text):
149 | """Runs basic whitespace cleaning and splitting on a piece of text."""
150 | text = text.strip()
151 | if not text:
152 | return []
153 | tokens = text.split()
154 | return tokens
155 |
156 |
157 | class FullTokenizer(object):
158 | """Runs end-to-end tokenziation."""
159 |
160 | def __init__(self, vocab_file, do_lower_case=True):
161 | self.vocab = load_vocab(vocab_file)
162 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
163 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
164 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
165 |
166 | def tokenize(self, text):
167 | split_tokens = []
168 | for token in self.basic_tokenizer.tokenize(text):
169 | for sub_token in self.wordpiece_tokenizer.tokenize(token):
170 | split_tokens.append(sub_token)
171 |
172 | return split_tokens
173 |
174 | def convert_tokens_to_ids(self, tokens):
175 | return convert_by_vocab(self.vocab, tokens)
176 |
177 | def convert_ids_to_tokens(self, ids):
178 | return convert_by_vocab(self.inv_vocab, ids)
179 |
180 |
181 | class BasicTokenizer(object):
182 | """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
183 |
184 | def __init__(self, do_lower_case=True):
185 | """Constructs a BasicTokenizer.
186 |
187 | Args:
188 | do_lower_case: Whether to lower case the input.
189 | """
190 | self.do_lower_case = do_lower_case
191 |
192 | def tokenize(self, text):
193 | """Tokenizes a piece of text."""
194 | text = convert_to_unicode(text)
195 | text = self._clean_text(text)
196 |
197 | # This was added on November 1st, 2018 for the multilingual and Chinese
198 | # models. This is also applied to the English models now, but it doesn't
199 | # matter since the English models were not trained on any Chinese data
200 | # and generally don't have any Chinese data in them (there are Chinese
201 | # characters in the vocabulary because Wikipedia does have some Chinese
202 | # words in the English Wikipedia.).
203 | text = self._tokenize_chinese_chars(text)
204 |
205 | orig_tokens = whitespace_tokenize(text)
206 | split_tokens = []
207 | for token in orig_tokens:
208 | if self.do_lower_case:
209 | token = token.lower()
210 | token = self._run_strip_accents(token)
211 | split_tokens.extend(self._run_split_on_punc(token))
212 |
213 | output_tokens = whitespace_tokenize(" ".join(split_tokens))
214 | return output_tokens
215 |
216 | def _run_strip_accents(self, text):
217 | """Strips accents from a piece of text."""
218 | text = unicodedata.normalize("NFD", text)
219 | output = []
220 | for char in text:
221 | cat = unicodedata.category(char)
222 | if cat == "Mn":
223 | continue
224 | output.append(char)
225 | return "".join(output)
226 |
227 | def _run_split_on_punc(self, text):
228 | """Splits punctuation on a piece of text."""
229 | chars = list(text)
230 | i = 0
231 | start_new_word = True
232 | output = []
233 | while i < len(chars):
234 | char = chars[i]
235 | if _is_punctuation(char):
236 | output.append([char])
237 | start_new_word = True
238 | else:
239 | if start_new_word:
240 | output.append([])
241 | start_new_word = False
242 | output[-1].append(char)
243 | i += 1
244 |
245 | return ["".join(x) for x in output]
246 |
247 | def _tokenize_chinese_chars(self, text):
248 | """Adds whitespace around any CJK character."""
249 | output = []
250 | for char in text:
251 | cp = ord(char)
252 | if self._is_chinese_char(cp):
253 | output.append(" ")
254 | output.append(char)
255 | output.append(" ")
256 | else:
257 | output.append(char)
258 | return "".join(output)
259 |
260 | def _is_chinese_char(self, cp):
261 | """Checks whether CP is the codepoint of a CJK character."""
262 | # This defines a "chinese character" as anything in the CJK Unicode block:
263 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
264 | #
265 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
266 | # despite its name. The modern Korean Hangul alphabet is a different block,
267 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write
268 | # space-separated words, so they are not treated specially and handled
269 | # like the all of the other languages.
270 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
271 | (cp >= 0x3400 and cp <= 0x4DBF) or #
272 | (cp >= 0x20000 and cp <= 0x2A6DF) or #
273 | (cp >= 0x2A700 and cp <= 0x2B73F) or #
274 | (cp >= 0x2B740 and cp <= 0x2B81F) or #
275 | (cp >= 0x2B820 and cp <= 0x2CEAF) or
276 | (cp >= 0xF900 and cp <= 0xFAFF) or #
277 | (cp >= 0x2F800 and cp <= 0x2FA1F)): #
278 | return True
279 |
280 | return False
281 |
282 | def _clean_text(self, text):
283 | """Performs invalid character removal and whitespace cleanup on text."""
284 | output = []
285 | for char in text:
286 | cp = ord(char)
287 | if cp == 0 or cp == 0xfffd or _is_control(char):
288 | continue
289 | if _is_whitespace(char):
290 | output.append(" ")
291 | else:
292 | output.append(char)
293 | return "".join(output)
294 |
295 |
296 | class WordpieceTokenizer(object):
297 | """Runs WordPiece tokenziation."""
298 |
299 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): # nosec - "[UNK]" is not a password
300 | self.vocab = vocab
301 | self.unk_token = unk_token
302 | self.max_input_chars_per_word = max_input_chars_per_word
303 |
304 | def tokenize(self, text):
305 | """Tokenizes a piece of text into its word pieces.
306 |
307 | This uses a greedy longest-match-first algorithm to perform tokenization
308 | using the given vocabulary.
309 |
310 | For example:
311 | input = "unaffable"
312 | output = ["un", "##aff", "##able"]
313 |
314 | Args:
315 | text: A single token or whitespace separated tokens. This should have
316 | already been passed through `BasicTokenizer.
317 |
318 | Returns:
319 | A list of wordpiece tokens.
320 | """
321 |
322 | text = convert_to_unicode(text)
323 |
324 | output_tokens = []
325 | for token in whitespace_tokenize(text):
326 | chars = list(token)
327 | if len(chars) > self.max_input_chars_per_word:
328 | output_tokens.append(self.unk_token)
329 | continue
330 |
331 | is_bad = False
332 | start = 0
333 | sub_tokens = []
334 | while start < len(chars):
335 | end = len(chars)
336 | cur_substr = None
337 | while start < end:
338 | substr = "".join(chars[start:end])
339 | if start > 0:
340 | substr = "##" + substr
341 | if substr in self.vocab:
342 | cur_substr = substr
343 | break
344 | end -= 1
345 | if cur_substr is None:
346 | is_bad = True
347 | break
348 | sub_tokens.append(cur_substr)
349 | start = end
350 |
351 | if is_bad:
352 | output_tokens.append(self.unk_token)
353 | else:
354 | output_tokens.extend(sub_tokens)
355 | return output_tokens
356 |
357 |
358 | def _is_whitespace(char):
359 | """Checks whether `chars` is a whitespace character."""
360 | # \t, \n, and \r are technically contorl characters but we treat them
361 | # as whitespace since they are generally considered as such.
362 | if char == " " or char == "\t" or char == "\n" or char == "\r":
363 | return True
364 | cat = unicodedata.category(char)
365 | if cat == "Zs":
366 | return True
367 | return False
368 |
369 |
370 | def _is_control(char):
371 | """Checks whether `chars` is a control character."""
372 | # These are technically control characters but we count them as whitespace
373 | # characters.
374 | if char == "\t" or char == "\n" or char == "\r":
375 | return False
376 | cat = unicodedata.category(char)
377 | if cat.startswith("C"):
378 | return True
379 | return False
380 |
381 |
382 | def _is_punctuation(char):
383 | """Checks whether `chars` is a punctuation character."""
384 | cp = ord(char)
385 | # We treat all non-letter/number ASCII as punctuation.
386 | # Characters such as "^", "$", and "`" are not in the Unicode
387 | # Punctuation class but we treat them as punctuation anyways, for
388 | # consistency.
389 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
390 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
391 | return True
392 | cat = unicodedata.category(char)
393 | if cat.startswith("P"):
394 | return True
395 | return False
396 |
--------------------------------------------------------------------------------
/samples/test-examples.tsv:
--------------------------------------------------------------------------------
1 | video game violence is not related to serious aggressive behavior in real life pos
2 | Neurological link was found between playing violent video games and aggressive behaviour in children and teenagers neg
3 | many skills can be learned from the gaming experience, it builds practical and intellectual skills pos
4 | children may imitate aggressive behaviors witnessed in media neg
5 | Violent video games - especially first-person shooter games - encourages real-life acts of violence in teenagers neg
6 | there is social utility in expressive and imaginative forms of entertainment, even if they contain violence pos
7 | China's birth control policy contributes to forced abortions neg
8 | the dramatic decrease in Chinese fertility started before the program began in 1979 for unrelated factors neg
9 | overpopulation has been blamed for a variety of issues, including increasing poverty neg
10 | Lack of siblings has been blamed for a number of social ills neg
11 | The use of drugs to enhance performance is considered unethical by most international sports organizations neg
12 | There is a wide range of health concerns for users of anabolic steroids neg
13 | Physical exercise helps prevent the diseases of affluence pos
14 | Exercise alone is a potential prevention method and/or treatment for mild forms of depression pos
15 | Too much exercise can be harmful neg
16 | affirmative action devalues the accomplishments of people who are chosen based on the social group to which they belong rather than their qualifications neg
17 | affirmative action has undesirable side-effects in addition to failing to achieve its goals neg
18 | attempts at antidiscrimination have been criticized as reverse discrimination neg
19 | States must take measures to seek to eliminate prejudices pos
20 | Historical racism continues to be reflected in socio-economic inequality neg
21 | Policies adopted as affirmative action have been criticized as a form of referse discrimination neg
22 | Affirmative action perpetuates racial division neg
23 | no one has a legal right to have any demographic characteristic they possess be considered a favorable point on their behalf neg
24 | Affirmative action is counter-productive neg
25 | Affirmative action policies engender animosity toward preferred groups neg
26 | The identification of oppressed classes is difficult to carry out neg
27 | The typical knock out results in a sustained loss of consciousness neg
28 | The idea of multiculturalism had reached the end of its useful life neg
29 | multiculturalism works better in theory than in practice neg
30 | some forms of multiculturalism can divide people neg
31 | Multiculturalism would lead to acceptance of barbaric practices neg
32 | Addicted gamblers cost companies loss of productivity and profit neg
33 | Gamblers may suffer from depression and bankruptcy neg
34 | Compulsive gambling is often very detrimental to personal relationships neg
35 | Abuse is common in homes where pathological gambling is present neg
36 | Internet gambling is a legitimate activity that citizens have the right to engage in pos
37 | Gambling is a popular leisure activity enjoyed in many forms by millions of people pos
38 | electronic funds transfers inherent in online gambling are being exploited by criminal interests neg
39 | everyone has the right to leave or enter a country, along with movement within it pos
40 | Immigrants are considered hostile or alien to the natural culture neg
41 | Monarchy is a check against possible illegal action by politicians pos
42 | Monarchs rule with the intent of improving the lives of their subjects pos
43 | Royals are simply celebrities who should not have any formal role neg
44 | Monarchy encourages a feeling of dependency in many people who should instead have confidence in themselves and their fellow citizens neg
45 | Monarchical prerogative powers can be used to circumvent normal democratic process with no accountability neg
46 | Monarchy devalues intellect and achievement neg
47 | A constitutional monarch with limited powers and non-partisan nature can provide a focus for national unity pos
48 | the monarchy is inherently contrary to egalitarianism and multiculturalism neg
49 | "a republic is ""inevitable" pos
50 | "the Church's ban on condoms has ""caused the death of millions" neg
51 | release of tactical information usually presents a greater risk of casualties among one's own forces neg
52 | the right to freedom of speech is not absolute neg
53 | When resources are put into tobacco production they are taken away from food production neg
54 | tobacco causes cancer neg
55 | Freedom of expression is subject to certain restrictions neg
56 | The free communication of ideas and opinions is one of the most precious of the rights of man pos
57 | Speech can be justifiably suppressed in order to prevent harm from a clear and direct threat neg
58 | liberty of expression 'is not absolute neg
59 | government may not prohibit the expression of an idea simply because society finds the idea offensive or disagreeable neg
60 | Punishment of dangerous or offensive writings is necessary for the preservation of peace and good order pos
61 | Freedom of speech and expression can not be an excuse for distribution of indecent and immoral content neg
62 | censorship violates multiple Basic Human Rights neg
63 | by distributing vouchers to the families of students equal to the tuition that he/she would receive at his/her local public school, a student’s family could then choose from options where best to send their child pos
64 | Vouchers function to increase racial and economic discrimination in schools neg
65 | Earmarks are Good for American Democracy pos
66 | Congressmen tend to distribute specialized benefits at a great cost and ignore the particular costs the legislation bears upon the taxpayers neg
67 | Nuclear weapons decrease the chances of crisis escalation pos
68 | new nuclear states will use their acquired nuclear capabilities to deter threats and preserve peace pos
69 | more countries with nuclear weapons may increase the possibility of nuclear warfare neg
70 | abortion has negative effects on society neg
71 | Intact dilation and extraction is never needed to protect the health of a pregnant woman neg
72 | Certain restrictions on abortion could be used to form a slippery slope against all abortions neg
73 | Abortion, which would involve the deliberate destruction of life, should be rejected neg
74 | In all circumstances, it should be the woman's decision whether or not to terminate a pregnancy pos
75 | the embryo has a right to life pos
76 | it should be illegal for governments to regulate abortion neg
77 | working parents wish their children to be supervised pos
78 | students’ attitudes towards school did significantly increase as they spent more time on a year-round schedule pos
79 | year-round schools showed a substantial gain in academic achievement for at-risk, low performing students pos
80 | The year round schedule provides more opportunities for family vacations pos
81 | Students that attend year round schooling may miss out on experiences neg
82 | Welfare sustains or even creates poverty neg
83 | Safety nets enable households to make productive investments in their future that they may otherwise miss pos
84 | a lower rate of redistribution in a given society increases the inequality found among future incomes neg
85 | a certain amount of redistribution would be justified pos
86 | Because current science can't figure out exactly how life started, it must be God who caused life to start pos
87 | Atheism has been criticized as a faith in itself neg
88 | the most immoral acts in human history were performed by atheists neg
89 | belief in God and religion are social functions, used by those in power to oppress the working class neg
90 | the idea of God implies the abdication of human reason and justice neg
91 | The idea of God necessarily ends in the enslavement of mankind neg
92 | the theism of people throughout most of recorded history and in many different places provides prima facie demonstration of God's existence pos
93 | There is evidence for the existence of a God pos
94 | evolution can explain the apparent design in nature pos
95 | "Natural selection and similar scientific theories are superior to a ""God hypothesis""—the illusion of intelligent design—in explaining the living world and the cosmos" pos
96 | religion is needed to make us behave morally pos
97 | a god created the Universe pos
98 | Separation of older people from active roles in society benefits both society and older individuals pos
99 | personal and technical skills learned in the military will improve later employment prospects in civilian life pos
100 | Professionally-skilled conscripts are difficult to replace in the civilian workforce neg
101 | conscription would not provide adequate protection for the rights of conscientious objectors neg
102 | adequate military strength could be maintained without having conscription neg
103 | unpaid domestic work is just as valuable as paid work pos
104 | property rights encourage their holders to develop their property or generate wealth pos
105 | Patents are criticised as inonsistent with free trade neg
106 | intellectual property rights are essential to maintaining economic growth pos
107 | A positive correlation between the strengthening of the IP system and subsequent economic growth was found pos
108 | To violate intellectual property is no different morally than violating other property rights neg
109 | The cost of trying to enforce copyright is unreasonable neg
110 | all proposed alternatives to copyright protection do not allow for viable business models neg
111 | the very concept of copyright has never benefited society neg
112 | Wind power has gained very high social acceptance pos
113 | Wind energy is a clean energy source pos
114 | wind energy is one of the most cost efficient sources of renewable energy pos
115 | wind power is dependent on weather systems neg
116 | Wind power produces no greenhouse gas emissions during operation pos
117 | Wind power uses little land pos
118 | any effects on the environment from wind power are generally much less problematic than those of any other power source pos
119 | Wind power has low ongoing costs pos
120 | Wind projects revitalize the economy of rural communities pos
121 | The cost of repairing damaged ecosystems is considered to be much higher than the cost of conserving natural ecosystems pos
122 | People who live close to nature can be dependent on the survival of all the species in their environment pos
123 | "since species become extinct ""all the time"" the disappearance of a few more will not destroy the ecosystem" pos
124 | rapid rates of biodiversity loss threatens the sustained well-being of humanity neg
125 | Biodiversity is directly involved in water purification pos
126 | Austerity programs tend to have an impact on the poorest segments of the population neg
127 | the right to bear arms is absolute and unqualified pos
128 | an armed citizens' militia can help deter crime and tyranny pos
129 | arms allow for successful rebellions against tyranny pos
130 | The possibility of getting shot by an armed victim is a substantial deterrent to crime pos
131 | widespread gun ownership is protection against tyranny pos
132 | Bribery encourages rent seeking behaviour neg
133 | In some cases where the system of law is not well-implemented, bribes may be a way for companies to continue their businesses pos
134 | Corruption undermines the legitimacy of government and democratic values neg
135 | Availability of bribes can induce officials to contrive new rules and delays neg
136 | Corruption favors the most connected and unscrupulous, rather than the efficient neg
137 | The production of bitumen and synthetic crude oil emits more greenhouse gases than the production of conventional crude oil neg
138 | During Operation Cast Lead, the Israeli Defense Forces did more to safeguard the rights of civilians in a combat zone than any other army in the history of warfare pos
139 | Israeli citizens in the south have been suffering from rockets being fired at them neg
140 | the Israeli Defence Force breached laws of armed conflict by attacking indiscriminately civilians neg
141 | Israel's UAV attacks were a violation of International Humanitarian Law neg
142 | Israel was acting in self-defence pos
143 | Israelis have been killed by the unlawful rocket and mortar attacks from Gaza neg
144 | The repeated firing of rockets by Hamas endangers the lives of both Israeli and Palestinian civilians neg
145 | the military solution won't conduct to peace neg
146 | nothing justifies the suffering inflicted to civilian populations who live trapped in the Gaza strip neg
147 | This cycle of violence and retaliation impedes efforts to broker lasting peace in the region neg
148 | High-rise structures also pose serious challenges to firefighters during emergencies neg
149 | Tower Blocks grew a reputation for being undesirable low cost housing neg
150 | many tower blocks saw rising crime levels neg
151 | Tower blocks may be inherently more prone to casualties from a fire neg
152 | Tower blocks can hold thousands of families in a single building pos
153 | The right to freedom of thought and expression, sanctioned by the Declaration of the Rights of Man, cannot imply the right to offend the religious sentiment of believers neg
154 | In all societies there is a need to show sensitivity and responsibility in treating issues of special significance for the adherents of any particular faith pos
155 | the Israeli blockade and closures had pushed the Palestinian economy into a stage of de-development neg
156 | The blockade is a collective form of punishment on a civilian population neg
157 | The purpose of the blockade is to pressure Hamas into ending the rocket attacks and to deprive them of the supplies necessary for the continuation of rocket attacks pos
158 | "the blockade of Gaza is causing ""unacceptable suffering" neg
159 | The blockade is possibly a crime against humanity neg
160 | "all that is being achieved through the blockade is to ""enrich Hamas and marginalize even further the voices of moderation" neg
161 | Israel's blockade of the Gaza Strip was described as totally intolerable and unacceptable in the 21st century neg
162 | Israel restricts the ability for the Palestinian authority to exercise control neg
163 | Gaza was blockaded by Israel in response to the rocket and mortar attacks by Hamas and other militant groups operating inside Gaza pos
164 | the purpose of the restrictions in import of goods into Gaza are to pressure Hamas, which does not recognise Israel and backs attacks on its citizens pos
165 | The stated purpose of the blockade was to pressure Hamas into ending the rocket attacks pos
166 | Israel is not legally responsible for Gaza and not obliged to help a hostile territory beyond whatever is necessary to avoid a humanitarian crisis pos
167 | The Gaza blockade inflicted excessive damage to the civilian population in relation to the concrete military advantage expected neg
168 | Holocaust denial is a convenient polemical substitute for anti-semitism neg
169 | Open-source-appropriate technology built in continuous peer-review can result in better quality pos
170 | wide availability results in increased scrutiny of the source code, making open source software more secure pos
171 | With free software, businesses can fit software to their specific needs by changing the software pos
172 | proprietary software is unethical and unjust neg
173 | software produced in this fashion may lack standardization and be incompatible with various computer applications neg
174 | corruption is more prevalent in non-privatized sectors neg
175 | A state-monopolized function is prone to corruption neg
176 | certain public goods and services should remain primarily in the hands of government in order to ensure that everyone in society has access to them pos
177 | Academics are demoralized by government interference with admissions procedures neg
178 | Dependence on government funding has had disastrous effects on the higher education sector in continental Europe neg
179 | ASEAN Way has recently proven itself relatively successful in the settlements of disputes by peaceful manner pos
180 | "the importance of the ""wait to have sex"" message ends up being lost when programs demonstrate and encourage the use of contraception" neg
181 | abstinence-only sex ed and conservative moralizing will only alienate students neg
182 | sex education needs to be comprehensive to be effective pos
183 | Abstinence-only programs delay the initiation of sex pos
184 | abstinence-only education is ineffective neg
185 | abstinence-only-until-marriage programs are ineffective neg
186 | Comprehensive sex education covers abstinence as a positive choice, but also teaches about contraception and avoidance of STIs pos
187 | all abstinence programs are ineffective neg
188 | Abstinence until marriage is the most effective way to avoid HIV infection pos
189 | Providing safe-sex education promotes promiscuity neg
190 | Countries with conservative attitudes towards sex education have a higher incidence of STIs and teenage pregnancy neg
191 | sexual abstinence in teenagers decreases the risk of contracting STDs and having children outside marriage pos
192 | "voters ""overwhelmingly support term limits" pos
193 | Lawmakers are in gridlock because of becoming locked into entrenched positions over time neg
194 | Freed from political considerations related to re-election, lawmakers would be more free to vote on the merits pos
195 | Flag burning is protected by the First Amendment pos
196 | Americans oppose amending the constitution to outlaw flag burning neg
197 | The First Amendment reaches non-speech acts pos
198 | Flag burning tends to incite breaches of the peace neg
199 | laws against flag-burning are constitutional pos
200 | the BCS routinely involves controversy about which two teams are the best in the nation neg
201 | if everyone is left to their own economic devices instead of being controlled by the state, then the result would be a harmonious and more equal society pos
202 | the state has a legitimate role in providing public goods pos
203 | Financial liberalization and privatization coincide with democratization pos
204 | wages are naturally driven down in free market systems neg
205 | free markets usually fail to deal with the problem of externalities neg
206 | free market relationships are considered as structured upon coercion neg
207 | free trade gives optimal economic advantages pos
208 | Free trade allows maximum exploitation of workers by capital neg
209 | The justification for central planning is that the consolidation of economic resources can allow for the economy to take advantage of more perfect information when making decisions regarding investment and production pos
210 | market economies are inherently stable if left to themselves pos
211 | competition leads to innovation and more affordable prices pos
212 | "the state has a role to play in the economy, and specifically, in creating a ""safety net" pos
213 | free markets have the potential to free states from the looming prospect of recurrent warfare pos
214 | competition in the free market is more effective than the regulation of industry pos
215 | markets have inefficient outcomes neg
216 | no coercive monopoly of force can arise on a truly free market pos
217 | Unfettered markets are the best means to economic growth pos
218 | The regulation of markets is widely acknowledged as important to safeguard social and environmental values pos
219 | what starts as temporary governmental fixes usually become permanent and expanding government programs, which stifle the private sector and civil society neg
220 | left to its devices, the market will adjust efficiently pos
221 | Advertising increasingly invades public spaces neg
222 | Advertising has been criticized as inadvertently or even intentionally promoting sexism neg
223 | advertising directed at young children is per se manipulative neg
224 | the advertising techniques used to create consumer behaviour amount to the destruction of psychic and collective individuation neg
225 | it is inherently immoral to bring people into the world neg
226 | there ought to be a higher rate of population growth than what is currently mainstream pos
227 | the birth of a new person always entails nontrivial harm to that person neg
228 | the best thing for Earth's biosphere is for humans to voluntarily cease reproducing pos
229 | Voluntary human extinction will prevent human suffering and the extinction of other species pos
230 | non-reproduction would eventually allow humans to lead idyllic lifestyles pos
231 | attempting to reduce the Earth's population is the only moral option pos
232 | voluntary human extinction is advisable due to limited resources pos
233 | the decision not to procreate at all could be regarded as immoral neg
234 | Dam construction often leads to abuses of the masses by planners neg
235 | people worldwide have been physically displaced from their homes as a result of dam construction neg
236 | Dam failures are generally catastrophic neg
237 | In many reservoir construction projects people have to be moved and re-housed neg
238 | Farms and villages can be flooded by the creation of reservoirs, ruining many livelihoods neg
239 | once the renewable infrastructure is built, the fuel is free forever pos
240 | Large hydropower provides one of the lowest cost options in today’s energy market pos
241 | Development of large-scale hydroelectric power has environmental impacts neg
242 | The filling of large reservoirs can induce earth tremors, which may be large enough to be objectionable or destructive neg
243 | Hydroelectric dams with large reserviors can be operated to provide peak generation at times of peak demand pos
244 | Hydro plants are able to act as load following power plants pos
245 | Integrating ever-higher levels of renewables is being successfully demonstrated in the real world pos
246 | There are no harmful emissions associated with hydroelectric plant operation pos
247 | Truth commissions are sometimes criticised for allowing crimes to go unpunished, and creating impunity for serious human rights abusers neg
248 | victims and communities affected by past crimes have the right to know the identity of suspected perpetrators pos
249 | Restorative approaches seek a balanced approach to the needs of the victim, wrongdoer and community through processes that preserve the safety and dignity of all pos
250 | democracies have less internal systematic violence pos
251 | the more democratic a regime, the less its democide pos
252 | democracy causes peace pos
253 | democracies treat each other with trust and respect even during crises pos
254 | the best strategy to ensure our security and to build a durable peace is to support the advance of democracy elsewhere pos
255 | Democracy is economically inefficient neg
256 | the benefits of a specialised society may be compromised by democracy neg
257 | A majority bullying a minority is just as bad as a dictator, communist or otherwise, doing so neg
258 | democracy would bring peace pos
259 | democracy leads to less internal violence and mass murder by the government pos
260 | only in a democracy the citizens can have a share in freedom pos
261 |
--------------------------------------------------------------------------------