├── .bandit ├── requirements-test.txt ├── docs ├── swagger-screenshot.png └── deploy-max-to-ibm-cloud-with-kubernetes-button.png ├── requirements.txt ├── training └── README.md ├── .dockerignore ├── samples ├── README.md └── test-examples.tsv ├── core ├── __init__.py ├── model.py └── bert │ ├── run_classifier.py │ └── tokenization.py ├── api ├── __init__.py ├── metadata.py └── predict.py ├── sha512sums.txt ├── max-text-sentiment-classifier.yaml ├── app.py ├── .travis.yml ├── tests ├── training │ └── test_sample_training_response.py ├── test_api.py └── test_response.py ├── config.py ├── Dockerfile ├── .gitignore ├── README.md └── LICENSE /.bandit: -------------------------------------------------------------------------------- 1 | [bandit] 2 | exclude: /tests,/training 3 | -------------------------------------------------------------------------------- /requirements-test.txt: -------------------------------------------------------------------------------- 1 | pytest==6.1.2 2 | requests==2.25.0 3 | flake8==3.8.4 4 | bandit==1.6.2 5 | -------------------------------------------------------------------------------- /docs/swagger-screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/swagger-screenshot.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # python3.6 2 | tensorflow == 2.5.0 # CPU Version of TensorFlow. 3 | # tensorflow-gpu == 1.12.2 # GPU version of TensorFlow. 4 | -------------------------------------------------------------------------------- /docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png -------------------------------------------------------------------------------- /training/README.md: -------------------------------------------------------------------------------- 1 | We have removed support for training since TensorFlow v2. If you are interested in training, check out the last TensorFlow v1 of the training script at https://github.com/IBM/MAX-Text-Sentiment-Classifier/tree/v2.1.0/training . 2 | -------------------------------------------------------------------------------- /.dockerignore: -------------------------------------------------------------------------------- 1 | training/ 2 | README.* 3 | .idea/ 4 | .git/ 5 | .gitignore 6 | tests/ 7 | .pytest_cache 8 | tests/ 9 | assets/.pytest_cache 10 | venv/ 11 | assets/sentiment_BERT_base_uncased/ 12 | assets/assets.tar.gz 13 | assets/sentiment_BERT_base_uncased.zip 14 | benchmark_model/ 15 | docs/ -------------------------------------------------------------------------------- /samples/README.md: -------------------------------------------------------------------------------- 1 | # Sample Details 2 | 3 | ## Test Examples 4 | 5 | The tab-separated-values file `test-examples.tsv` in the `samples` folder contains a fraction of the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Claim%20Stance) ([CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)) not used for fine-tuning. In the first column, the claim is listed. In the second column, the corresponding sentiment (`pos` or `neg`) is listed. Claims in this file may be used to try out and benchmark the performance of this model. 6 | -------------------------------------------------------------------------------- /core/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | -------------------------------------------------------------------------------- /api/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from .metadata import ModelMetadataAPI # noqa 18 | from .predict import ModelPredictAPI # noqa -------------------------------------------------------------------------------- /sha512sums.txt: -------------------------------------------------------------------------------- 1 | 4e9ed575594d83d53af2726926239df2a5e85d0fc2884238a512627ad56ffd243f92dc75dcc2c64c716b9556b177557528cdf143fd10d2cf517289076028aaaf assets/labels.txt 2 | 296397595b1fcedd3a37464d3aa14a57526820d2ff96795eef533c634173ab744a0309b4431b7918089b424fdc0962d78082e19936b31a8565e6b6e8413f7dbe assets/saved_model.pb 3 | f51f06aae7f580a88998f6f7f24b52495c8d3d289fdbdc21231c05f3a8965783074d95c17b819186f9a63b622280e8a051105a2161cd0d153fa57db7a0aba9f9 assets/vocab.txt 4 | c58d6f1107456635fc403caede31eaf831b985c61429e85eea3d3edc8281a6af09ea63d0986c4d110ae90547aeb6d312c75a66280723f614ce6246a353b58626 assets/variables/variables.data-00000-of-00001 5 | 42dc3a7620e8a712065ae7bc6973e654cb4e515dffd3dc8289b90571cac578c1afbc8cb558b44b268da7be56e5b5045380949459274736bdaf566f207970e795 assets/variables/variables.index 6 | -------------------------------------------------------------------------------- /max-text-sentiment-classifier.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Service 3 | metadata: 4 | name: max-text-sentiment-classifier 5 | spec: 6 | selector: 7 | app: max-text-sentiment-classifier 8 | ports: 9 | - port: 5000 10 | type: NodePort 11 | --- 12 | apiVersion: apps/v1 13 | kind: Deployment 14 | metadata: 15 | name: max-text-sentiment-classifier 16 | labels: 17 | app: max-text-sentiment-classifier 18 | spec: 19 | selector: 20 | matchLabels: 21 | app: max-text-sentiment-classifier 22 | replicas: 1 23 | template: 24 | metadata: 25 | labels: 26 | app: max-text-sentiment-classifier 27 | spec: 28 | containers: 29 | - name: max-text-sentiment-classifier 30 | image: quay.io/codait/max-text-sentiment-classifier:latest 31 | ports: 32 | - containerPort: 5000 33 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from maxfw.core import MAXApp 18 | from api import ModelMetadataAPI, ModelPredictAPI 19 | from config import API_TITLE, API_DESC, API_VERSION 20 | 21 | max = MAXApp(API_TITLE, API_DESC, API_VERSION) 22 | max.add_api(ModelMetadataAPI, '/metadata') 23 | max.add_api(ModelPredictAPI, '/predict') 24 | max.run() 25 | -------------------------------------------------------------------------------- /api/metadata.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from core.model import ModelWrapper 18 | from maxfw.core import MAX_API, MetadataAPI, METADATA_SCHEMA 19 | 20 | 21 | class ModelMetadataAPI(MetadataAPI): 22 | 23 | @MAX_API.marshal_with(METADATA_SCHEMA) 24 | def get(self): 25 | """Return the metadata associated with the model""" 26 | return ModelWrapper.MODEL_META_DATA 27 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | language: python 18 | python: 19 | - 3.6 20 | services: 21 | - docker 22 | install: 23 | - docker build -t max-text-sentiment-classifier . 24 | - docker run -it -d --rm -p 5000:5000 max-text-sentiment-classifier 25 | - pip install -r requirements-test.txt 26 | before_script: 27 | - flake8 . --max-line-length=127 28 | - bandit -r . 29 | - sleep 30 30 | script: 31 | - pytest tests/test_api.py 32 | - pytest tests/test_response.py 33 | -------------------------------------------------------------------------------- /tests/training/test_sample_training_response.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import pytest 18 | import requests 19 | 20 | 21 | def test_response(): 22 | 23 | # test code 200 24 | model_endpoint = 'http://localhost:5000/model/predict' 25 | 26 | json_data = { 27 | "text": ["good string", 28 | "bad string"] 29 | } 30 | 31 | r = requests.post(url=model_endpoint, json=json_data) 32 | 33 | assert r.status_code == 200 34 | response = r.json() 35 | assert response['status'] == 'ok' 36 | 37 | # test whether the labels have changed 38 | assert 'pos' in response['predictions'][0].keys() 39 | assert 'neg' in response['predictions'][0].keys() 40 | 41 | 42 | if __name__ == '__main__': 43 | pytest.main([__file__]) 44 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | # Flask settings 18 | DEBUG = False 19 | 20 | # Flask-restplus settings 21 | RESTPLUS_MASK_SWAGGER = False 22 | SWAGGER_UI_DOC_EXPANSION = 'none' 23 | 24 | # API metadata 25 | API_TITLE = 'MAX Text Sentiment Classifier' 26 | API_DESC = 'Detect the sentiment captured in short pieces of text. ' \ 27 | 'The model was finetuned on the IBM Project Debater Claim Sentiment dataset.' 28 | API_VERSION = '2.0.0' 29 | 30 | # default model 31 | DEFAULT_MODEL_PATH = 'assets' 32 | 33 | # the metadata of the model 34 | MODEL_META_DATA = { 35 | 'id': 'max-text-sentiment-classifier', 36 | 'name': 'MAX Text Sentiment Classifier', 37 | 'description': 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.', 38 | 'type': 'Text Classification', 39 | 'source': 'https://developer.ibm.com/exchanges/models/all/max-text-sentiment-classifier/', 40 | 'license': 'Apache V2' 41 | } 42 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | FROM quay.io/codait/max-base:v1.4.0 18 | 19 | ARG model_bucket=https://codait-cos-max.s3.us.cloud-object-storage.appdomain.cloud/max-text-sentiment-classifier/1.2.0 20 | ARG model_file=assets.tar.gz 21 | 22 | ARG use_pre_trained_model=true 23 | 24 | RUN if [ "$use_pre_trained_model" = "true" ] ; then\ 25 | # download pre-trained model artifacts from Cloud Object Storage 26 | wget -nv --show-progress --progress=bar:force:noscroll ${model_bucket}/${model_file} --output-document=assets/${model_file} &&\ 27 | tar -x -C assets/ -f assets/${model_file} -v && rm assets/${model_file} ; \ 28 | fi 29 | 30 | COPY requirements.txt . 31 | RUN pip install -r requirements.txt 32 | 33 | COPY . . 34 | 35 | RUN if [ "$use_pre_trained_model" = "true" ] ; then \ 36 | # validate downloaded pre-trained model assets 37 | sha512sum -c sha512sums.txt ; \ 38 | else \ 39 | # rename the directory that contains the custom-trained model artifacts 40 | if [ -d "./custom_assets/" ] ; then \ 41 | rm -rf ./assets && ln -s ./custom_assets ./assets ; \ 42 | fi \ 43 | fi 44 | 45 | EXPOSE 5000 46 | 47 | CMD python app.py 48 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *-model-building-code.zip 2 | .idea/ 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | .hypothesis/ 50 | .pytest_cache/ 51 | 52 | # Translations 53 | *.mo 54 | *.pot 55 | 56 | # Django stuff: 57 | *.log 58 | local_settings.py 59 | db.sqlite3 60 | 61 | # Flask stuff: 62 | instance/ 63 | .webassets-cache 64 | 65 | # Scrapy stuff: 66 | .scrapy 67 | 68 | # Sphinx documentation 69 | docs/_build/ 70 | 71 | # PyBuilder 72 | target/ 73 | 74 | # Jupyter Notebook 75 | .ipynb_checkpoints 76 | 77 | # pyenv 78 | .python-version 79 | 80 | # celery beat schedule file 81 | celerybeat-schedule 82 | 83 | # SageMath parsed files 84 | *.sage.py 85 | 86 | # Environments 87 | .env 88 | .venv 89 | env/ 90 | venv/ 91 | ENV/ 92 | env.bak/ 93 | venv.bak/ 94 | 95 | # Spyder project settings 96 | .spyderproject 97 | .spyproject 98 | 99 | # Rope project settings 100 | .ropeproject 101 | 102 | # mkdocs documentation 103 | /site 104 | 105 | # mypy 106 | .mypy_cache/ 107 | -------------------------------------------------------------------------------- /tests/test_api.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import pytest 18 | import requests 19 | 20 | 21 | def test_swagger(): 22 | 23 | model_endpoint = 'http://localhost:5000/swagger.json' 24 | 25 | r = requests.get(url=model_endpoint) 26 | assert r.status_code == 200 27 | assert r.headers['Content-Type'] == 'application/json' 28 | 29 | json = r.json() 30 | assert 'swagger' in json 31 | assert json.get('info') and json.get('info').get('title') == 'MAX Text Sentiment Classifier' 32 | 33 | 34 | def test_metadata(): 35 | 36 | model_endpoint = 'http://localhost:5000/model/metadata' 37 | 38 | r = requests.get(url=model_endpoint) 39 | assert r.status_code == 200 40 | 41 | metadata = r.json() 42 | assert metadata['id'] == 'max-text-sentiment-classifier' 43 | assert metadata['name'] == 'MAX Text Sentiment Classifier' 44 | assert metadata['description'] == 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.' 45 | assert metadata['license'] == 'Apache V2' 46 | assert metadata['type'] == 'Text Classification' 47 | assert 'developer.ibm.com' in metadata['source'] 48 | 49 | 50 | if __name__ == '__main__': 51 | pytest.main([__file__]) 52 | -------------------------------------------------------------------------------- /api/predict.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from core.model import ModelWrapper 18 | from maxfw.core import MAX_API, PredictAPI 19 | from flask_restx import fields 20 | from flask import abort 21 | 22 | # Set up parser for input data (http://flask-restplus.readthedocs.io/en/stable/parsing.html) 23 | input_parser = MAX_API.model('ModelInput', { 24 | 'text': fields.List(fields.String, required=True, 25 | description='List of claims (strings) to be analyzed for either a positive or negative sentiment.') 26 | }) 27 | 28 | with open('assets/labels.txt', 'r') as f: 29 | class_labels = [x.strip() for x in f] 30 | 31 | # Creating a JSON response model: https://flask-restplus.readthedocs.io/en/stable/marshalling.html#the-api-model-factory 32 | label_prediction = MAX_API.model('LabelPrediction', 33 | {l: fields.Float(required=True, description='Class probability') for l in class_labels}) # noqa - E741 34 | 35 | predict_response = MAX_API.model('ModelPredictResponse', { 36 | 'status': fields.String(required=True, description='Response status message'), 37 | 'predictions': fields.List(fields.Nested(label_prediction), description='Predicted labels and probabilities') 38 | }) 39 | 40 | 41 | class ModelPredictAPI(PredictAPI): 42 | 43 | model_wrapper = ModelWrapper() 44 | 45 | @MAX_API.doc('predict') 46 | @MAX_API.expect(input_parser) 47 | @MAX_API.marshal_with(predict_response) 48 | def post(self): 49 | """Make a prediction given input data""" 50 | result = {'status': 'error'} 51 | 52 | input_json = MAX_API.payload 53 | 54 | try: 55 | preds = self.model_wrapper.predict(input_json['text']) 56 | except: # noqa 57 | abort(400, "Please supply a valid input json. " 58 | "The json structure should have a 'text' field containing a list of strings") 59 | 60 | # Generate the output format for every input string 61 | output = [{l: p[i] for i, l in enumerate(class_labels)} for p in preds] 62 | 63 | result['predictions'] = output 64 | result['status'] = 'ok' 65 | 66 | return result 67 | -------------------------------------------------------------------------------- /tests/test_response.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import pytest 18 | import requests 19 | 20 | 21 | def test_response(): 22 | model_endpoint = 'http://localhost:5000/model/predict' 23 | 24 | json_data = { 25 | "text": ["good string", 26 | "bad string"] 27 | } 28 | 29 | r = requests.post(url=model_endpoint, json=json_data) 30 | 31 | assert r.status_code == 200 32 | response = r.json() 33 | assert response['status'] == 'ok' 34 | 35 | # verify that 'good string' is in fact positive 36 | assert round(float(response['predictions'][0]['positive'])) == 1 37 | # verify that 'bad string' is in fact negative 38 | assert round(float(response['predictions'][1]['negative'])) == 1 39 | 40 | json_data2 = { 41 | "text": [ 42 | "2008 was a dark, dark year for stock markets worldwide.", 43 | "The Model Asset Exchange is a crucial element of a developer's toolkit." 44 | ] 45 | } 46 | 47 | r = requests.post(url=model_endpoint, json=json_data2) 48 | 49 | assert r.status_code == 200 50 | response = r.json() 51 | assert response['status'] == 'ok' 52 | 53 | # verify that "2008 was a dark, dark year for stock markets worldwide." is in fact negative 54 | assert round(float(response['predictions'][0]['positive'])) == 0 55 | assert round(float(response['predictions'][0]['negative'])) == 1 56 | # verify that "The Model Asset Exchange is a crucial element of a developer's toolkit." is in fact positive 57 | assert round(float(response['predictions'][1]['negative'])) == 0 58 | assert round(float(response['predictions'][1]['positive'])) == 1 59 | 60 | # Test different input batch sizes 61 | for input_size in [4, 16, 32, 64, 75]: 62 | json_data3 = { 63 | "text": ["good string"]*input_size 64 | } 65 | 66 | r = requests.post(url=model_endpoint, json=json_data3) 67 | 68 | assert r.status_code == 200 69 | response = r.json() 70 | assert response['status'] == 'ok' 71 | 72 | assert len(response['predictions']) == len(json_data3["text"]) 73 | 74 | 75 | if __name__ == '__main__': 76 | pytest.main([__file__]) 77 | -------------------------------------------------------------------------------- /core/model.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | from maxfw.model import MAXModelWrapper 18 | 19 | import logging 20 | from config import DEFAULT_MODEL_PATH, MODEL_META_DATA as model_meta 21 | 22 | from core.bert.run_classifier import convert_single_example, MAXAPIProcessor 23 | from core.bert import tokenization 24 | from tensorflow.python.saved_model import tag_constants 25 | import tensorflow as tf 26 | import numpy as np 27 | 28 | logger = logging.getLogger() 29 | 30 | 31 | class ModelWrapper(MAXModelWrapper): 32 | 33 | MODEL_META_DATA = model_meta 34 | 35 | def __init__(self, path=DEFAULT_MODEL_PATH): 36 | tf.compat.v1.disable_v2_behavior() 37 | 38 | logger.info('Loading model from: {}...'.format(path)) 39 | 40 | self.max_seq_length = 128 41 | self.do_lower_case = True 42 | 43 | # Set Logging verbosity 44 | tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) 45 | 46 | # Loading the tf Graph 47 | self.graph = tf.Graph() 48 | self.sess = tf.compat.v1.Session(graph=self.graph) 49 | tf.compat.v1.saved_model.loader.load(self.sess, [tag_constants.SERVING], DEFAULT_MODEL_PATH) 50 | 51 | # Validate init_checkpoint 52 | tokenization.validate_case_matches_checkpoint(self.do_lower_case, 53 | DEFAULT_MODEL_PATH) 54 | 55 | # Initialize the dataprocessor 56 | self.processor = MAXAPIProcessor() 57 | 58 | # Get the labels 59 | self.label_list = self.processor.get_labels() 60 | 61 | # Initialize the tokenizer 62 | self.tokenizer = tokenization.FullTokenizer( 63 | vocab_file=f'{DEFAULT_MODEL_PATH}/vocab.txt', do_lower_case=self.do_lower_case) 64 | 65 | logger.info('Loaded model') 66 | 67 | def _pre_process(self, input): 68 | '''Preprocessing of the input is not required as it is carried out by the BERT model (Tokenization).''' 69 | return input 70 | 71 | def _post_process(self, result): 72 | '''Reformat the results if needed.''' 73 | return result 74 | 75 | def _predict(self, x, predict_batch_size=32): 76 | '''Predict the class probabilities using the BERT model.''' 77 | 78 | # Get the input examples 79 | predict_examples = self.processor.get_test_examples(x) 80 | 81 | # Grab the input tensors of the Graph 82 | tensor_input_ids = self.sess.graph.get_tensor_by_name('input_ids_1:0') 83 | tensor_input_mask = self.sess.graph.get_tensor_by_name('input_mask_1:0') 84 | tensor_label_ids = self.sess.graph.get_tensor_by_name('label_ids_1:0') 85 | tensor_segment_ids = self.sess.graph.get_tensor_by_name('segment_ids_1:0') 86 | tensor_outputs = self.sess.graph.get_tensor_by_name('loss/Softmax:0') 87 | 88 | # Grab the examples, convert to features, and create batches. In the loop, 89 | # Go over all examples in chunks of size `predict_batch_size`. 90 | predictions = [] 91 | 92 | for i in range(0, len(predict_examples), predict_batch_size): 93 | examples = predict_examples[i:i+predict_batch_size] 94 | 95 | tf.compat.v1.logging.info( 96 | f"{i} out of {len(predict_examples)} examples done ({round(i * 100 / len(predict_examples))}%).") 97 | 98 | # Convert example to feature in batches. 99 | input_ids, input_mask, label_ids, segment_ids = zip( 100 | *tuple(convert_single_example(i + j, example, self.label_list, self.max_seq_length, self.tokenizer) 101 | for j, example in enumerate(examples))) 102 | 103 | # Convert to a format that is consistent with input tensors 104 | feed_dict = {} 105 | feed_dict[tensor_input_ids] = np.vstack( 106 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_ids)) 107 | feed_dict[tensor_input_mask] = np.vstack( 108 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_mask)) 109 | feed_dict[tensor_label_ids] = np.vstack( 110 | tuple(np.array(arr) for arr in label_ids)).flatten() 111 | feed_dict[tensor_segment_ids] = np.vstack( 112 | tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in segment_ids)) 113 | 114 | # Make a prediction 115 | result = self.sess.run(tensor_outputs, feed_dict=feed_dict) 116 | # Add the predictions 117 | predictions.extend(p.tolist() for p in result) 118 | 119 | return predictions 120 | -------------------------------------------------------------------------------- /core/bert/run_classifier.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """BERT finetuning runner.""" 16 | 17 | import csv 18 | from core.bert import tokenization 19 | import tensorflow as tf 20 | 21 | 22 | class InputExample(object): 23 | """A single training/test example for simple sequence classification.""" 24 | 25 | def __init__(self, guid, text_a, text_b=None, label=None): 26 | """Constructs a InputExample. 27 | 28 | Args: 29 | guid: Unique id for the example. 30 | text_a: string. The untokenized text of the first sequence. For single 31 | sequence tasks, only this sequence must be specified. 32 | text_b: (Optional) string. The untokenized text of the second sequence. 33 | Only must be specified for sequence pair tasks. 34 | label: (Optional) string. The label of the example. This should be 35 | specified for train and dev examples, but not for test examples. 36 | """ 37 | self.guid = guid 38 | self.text_a = text_a 39 | self.text_b = text_b 40 | self.label = label 41 | 42 | 43 | class PaddingInputExample(object): 44 | """Fake example so the num input examples is a multiple of the batch size. 45 | 46 | When running eval/predict on the TPU, we need to pad the number of examples 47 | to be a multiple of the batch size, because the TPU requires a fixed batch 48 | size. The alternative is to drop the last batch, which is bad because it means 49 | the entire output data won't be generated. 50 | 51 | We use this class instead of `None` because treating `None` as padding 52 | batches could cause silent errors. 53 | """ 54 | 55 | 56 | class InputFeatures(object): 57 | """A single set of features of data.""" 58 | 59 | def __init__(self, 60 | input_ids, 61 | input_mask, 62 | segment_ids, 63 | label_id, 64 | is_real_example=True): 65 | self.input_ids = input_ids 66 | self.input_mask = input_mask 67 | self.segment_ids = segment_ids 68 | self.label_id = label_id 69 | self.is_real_example = is_real_example 70 | 71 | 72 | class DataProcessor(object): 73 | """Base class for data converters for sequence classification data sets.""" 74 | 75 | def get_train_examples(self, data_dir): 76 | """Gets a collection of `InputExample`s for the train set.""" 77 | raise NotImplementedError() 78 | 79 | def get_dev_examples(self, data_dir): 80 | """Gets a collection of `InputExample`s for the dev set.""" 81 | raise NotImplementedError() 82 | 83 | def get_test_examples(self, data_dir): 84 | """Gets a collection of `InputExample`s for prediction.""" 85 | raise NotImplementedError() 86 | 87 | def get_labels(self): 88 | """Gets the list of labels for this data set.""" 89 | raise NotImplementedError() 90 | 91 | @classmethod 92 | def _read_tsv(cls, input_file, quotechar=None): 93 | """Reads a tab separated value file.""" 94 | with tf.gfile.Open(input_file, "r") as f: 95 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 96 | lines = [] 97 | for line in reader: 98 | lines.append(line) 99 | return lines 100 | 101 | @classmethod 102 | def _read_csv(cls, input_file, quotechar=None): 103 | """Reads a comma separated value file.""" 104 | with tf.gfile.Open(input_file, "r") as f: 105 | reader = csv.reader(f, delimiter=",", quotechar=quotechar) 106 | lines = [] 107 | for line in reader: 108 | lines.append(line) 109 | return lines 110 | 111 | 112 | class MAXAPIProcessor(DataProcessor): 113 | """Custom Data Processor for the MAX API.""" 114 | 115 | def get_test_examples(self, test_data): 116 | """See base class.""" 117 | 118 | # Verify that the input is a list of strings 119 | if type(test_data) != list: 120 | raise TypeError("'test_data' is type %r" % type(test_data)) 121 | # Create InputExample objects from the input data 122 | return self._create_examples(test_data, "test") 123 | 124 | def get_labels(self): 125 | """See base class.""" 126 | return ["pos", "neg"] 127 | 128 | def _create_examples(self, lines, set_type): 129 | """Creates examples for the training and dev sets.""" 130 | examples = [] 131 | for (i, line) in enumerate(lines): 132 | guid = "%s-%s" % (set_type, i) 133 | text_a = tokenization.convert_to_unicode(line) 134 | label = self.get_labels()[0] 135 | examples.append( 136 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 137 | return examples 138 | 139 | 140 | def convert_single_example(ex_index, example, label_list, max_seq_length, 141 | tokenizer): 142 | """Converts a single `InputExample` into a single `InputFeatures`.""" 143 | 144 | if isinstance(example, PaddingInputExample): 145 | return InputFeatures( 146 | input_ids=[0] * max_seq_length, 147 | input_mask=[0] * max_seq_length, 148 | segment_ids=[0] * max_seq_length, 149 | label_id=0, 150 | is_real_example=False) 151 | 152 | label_map = {} 153 | for (i, label) in enumerate(label_list): 154 | label_map[label] = i 155 | 156 | tokens_a = tokenizer.tokenize(example.text_a) 157 | tokens_b = None 158 | if example.text_b: 159 | tokens_b = tokenizer.tokenize(example.text_b) 160 | 161 | if tokens_b: 162 | # Modifies `tokens_a` and `tokens_b` in place so that the total 163 | # length is less than the specified length. 164 | # Account for [CLS], [SEP], [SEP] with "- 3" 165 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) # noqa 166 | else: 167 | # Account for [CLS] and [SEP] with "- 2" 168 | if len(tokens_a) > max_seq_length - 2: 169 | tokens_a = tokens_a[0:(max_seq_length - 2)] 170 | 171 | # The convention in BERT is: 172 | # (a) For sequence pairs: 173 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 174 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 175 | # (b) For single sequences: 176 | # tokens: [CLS] the dog is hairy . [SEP] 177 | # type_ids: 0 0 0 0 0 0 0 178 | # 179 | # Where "type_ids" are used to indicate whether this is the first 180 | # sequence or the second sequence. The embedding vectors for `type=0` and 181 | # `type=1` were learned during pre-training and are added to the wordpiece 182 | # embedding vector (and position vector). This is not *strictly* necessary 183 | # since the [SEP] token unambiguously separates the sequences, but it makes 184 | # it easier for the model to learn the concept of sequences. 185 | # 186 | # For classification tasks, the first vector (corresponding to [CLS]) is 187 | # used as the "sentence vector". Note that this only makes sense because 188 | # the entire model is fine-tuned. 189 | tokens = [] 190 | segment_ids = [] 191 | tokens.append("[CLS]") 192 | segment_ids.append(0) 193 | for token in tokens_a: 194 | tokens.append(token) 195 | segment_ids.append(0) 196 | tokens.append("[SEP]") 197 | segment_ids.append(0) 198 | 199 | if tokens_b: 200 | for token in tokens_b: 201 | tokens.append(token) 202 | segment_ids.append(1) 203 | tokens.append("[SEP]") 204 | segment_ids.append(1) 205 | 206 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 207 | 208 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 209 | # tokens are attended to. 210 | input_mask = [1] * len(input_ids) 211 | 212 | # Zero-pad up to the sequence length. 213 | while len(input_ids) < max_seq_length: 214 | input_ids.append(0) 215 | input_mask.append(0) 216 | segment_ids.append(0) 217 | 218 | if len(input_ids) != max_seq_length: 219 | raise ValueError("'input_ids' has an incorrect length: %r" % len(input_ids)) 220 | if len(input_mask) != max_seq_length: 221 | raise ValueError("'input_mask' has an incorrect length: %r" % len(input_mask)) 222 | if len(segment_ids) != max_seq_length: 223 | raise ValueError("'segment_ids' has an incorrect length: %r" % len(segment_ids)) 224 | 225 | label_id = label_map[example.label] 226 | if ex_index < 5: 227 | tf.compat.v1.logging.info("*** Example ***") 228 | tf.compat.v1.logging.info("guid: %s" % (example.guid)) 229 | tf.compat.v1.logging.info("tokens: %s" % " ".join( 230 | [tokenization.printable_text(x) for x in tokens])) 231 | tf.compat.v1.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 232 | tf.compat.v1.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 233 | tf.compat.v1.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 234 | tf.compat.v1.logging.info("label: %s (id = %d)" % (example.label, label_id)) 235 | 236 | # return feature 237 | return input_ids, input_mask, label_id, segment_ids 238 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://travis-ci.com/IBM/MAX-Text-Sentiment-Classifier.svg?branch=master)](https://travis-ci.com/IBM/MAX-Text-Sentiment-Classifier) [![API demo](https://img.shields.io/website/http/max-text-sentiment-classifier.codait-prod-41208c73af8fca213512856c7a09db52-0000.us-east.containers.appdomain.cloud/swagger.json.svg?label=API%20demo&down_message=down&up_message=up)](http://max-text-sentiment-classifier.codait-prod-41208c73af8fca213512856c7a09db52-0000.us-east.containers.appdomain.cloud) 2 | 3 | [](http://ibm.biz/max-to-ibm-cloud-tutorial) 4 | 5 | # IBM Developer Model Asset Exchange: Text Sentiment Classifier 6 | 7 | This repository contains code to instantiate and deploy a text sentiment classifier. This model is able to detect whether a text fragment leans towards a positive or a negative sentiment. Optimal input examples for this model are short strings (preferably a single sentence) with correct grammar, although not a requirement. 8 | 9 | The model is based on the [pre-trained BERT-Base, English Uncased](https://github.com/google-research/bert/blob/master/README.md) model and was fine-tuned on the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml). The model files are hosted on 10 | [IBM Cloud Object Storage](https://max-cdn.cdn.appdomain.cloud/max-text-sentiment-classifier/1.2.0/assets.tar.gz). 11 | The code in this repository deploys the model as a web service in a Docker container. This repository was developed 12 | as part of the [IBM Developer Model Asset Exchange](https://developer.ibm.com/exchanges/models/) and the public API is powered by [IBM Cloud](https://ibm.biz/Bdz2XM). 13 | 14 | ## Model Metadata 15 | | Domain | Application | Industry | Framework | Training Data | Input Data | 16 | | --------- | -------- | -------- | --------- | --------- | --------------- | 17 | | Natural Language Processing (NLP) | Sentiment Analysis | General | TensorFlow | [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml) | Text | 18 | 19 | ## Benchmark 20 | In the table below, the prediction accuracy of the model on the test sets of three different datasets is listed. 21 | 22 | The first row showcases the generalization power of our model after fine-tuning on the IBM Claims Dataset. 23 | The Sentiment140 (Tweets) and IMDB Reviews datasets are only used for evaluating the transfer-learning capabilities of this model. The implementation in this repository was **not** trained or fine-tuned on the Sentiment140 or IMDB reviews datasets. 24 | 25 | The second row describes the performance of the BERT-Base (English - Uncased) model when fine-tuned on the specific task. This was done simply for reference, and the weights are therefore not made available. 26 | 27 | 28 | The generalization results (first row) are very good when the input data is similar to the data used for fine-tuning (e.g. Sentiment140 (tweets) when fine-tuned on the IBM Claims Dataset). However, when a different style of text is given as input, and with a longer median length (e.g. multi-sentence IMDB reviews), the results are not as good. 29 | 30 | | Model Type | IBM Claims | Sentiment140 | IMDB Reviews | 31 | | ------------- | -------- | -------- | -------------- | 32 | | This model (fine-tuned on IBM Claims) | 94% | 83.84% | 81% | 33 | | Models fine-tuned on the specific dataset | 94% | 84% | 90% | 34 | 35 | ## References 36 | * _J. Devlin, M. Chang, K. Lee, K. Toutanova_, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805), arXiv, 2018. 37 | * [Google BERT repository](https://github.com/google-research/bert) 38 | * [IBM Claims Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project) and [IBM Project Debater](https://www.research.ibm.com/artificial-intelligence/project-debater/) 39 | 40 | ## Licenses 41 | | Component | License | Link | 42 | | ------------- | -------- | -------- | 43 | | This repository | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) | 44 | | Fine-tuned Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) | 45 | | Pre-trained Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) | 46 | | Model Code (3rd party) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) | 47 | | IBM Claims Stance Dataset for fine-tuning | [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/) | [LICENSE 1](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project)
[LICENSE 2](https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Reusers.27_rights_and_obligations)| 48 | 49 | ## Pre-requisites: 50 | * `docker`: The [Docker](https://www.docker.com/) command-line interface. Follow the [installation instructions](https://docs.docker.com/install/) for your system. 51 | * The minimum recommended resources for this model is 4GB Memory and 4 CPUs. 52 | * If you are on x86-64/AMD64, your CPU must support [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) at the minimum. 53 | 54 | # Deployment options 55 | 56 | * [Deploy from Quay](#deploy-from-quay) 57 | * [Deploy on Red Hat OpenShift](#deploy-on-red-hat-openshift) 58 | * [Deploy on Kubernetes](#deploy-on-kubernetes) 59 | * [Run Locally](#run-locally) 60 | 61 | ## Deploy from Quay 62 | To run the docker image, which automatically starts the model serving API, run: 63 | 64 | ``` 65 | $ docker run -it -p 5000:5000 quay.io/codait/max-text-sentiment-classifier 66 | ``` 67 | 68 | This will pull a pre-built image from the Quay.io container registry (or use an existing image if already cachedlocally) and run it. 69 | If you'd rather checkout and build the model locally you can follow the [run locally](#run-locally) steps below. 70 | 71 | ## Deploy on Red Hat OpenShift 72 | 73 | You can deploy the model-serving microservice on Red Hat OpenShift by following the instructions for the OpenShift web console or the OpenShift Container Platform CLI [in this tutorial](https://developer.ibm.com/tutorials/deploy-a-model-asset-exchange-microservice-on-red-hat-openshift/), specifying `quay.io/codait/max-text-sentiment-classifier` as the image name. 74 | 75 | > Note that this model requires at least 4GB of RAM. Therefore this model will not run in a cluster that was provisioned under the [OpenShift Online starter plan](https://www.openshift.com/products/online/), which is capped at 2GB. 76 | 77 | ## Deploy on Kubernetes 78 | You can also deploy the model on Kubernetes using the latest docker image on Quay. 79 | 80 | On your Kubernetes cluster, run the following commands: 81 | 82 | ``` 83 | $ kubectl apply -f https://github.com/IBM/MAX-Text-Sentiment-Classifier/raw/master/max-text-sentiment-classifier.yaml 84 | ``` 85 | 86 | The model will be available internally at port `5000`, but can also be accessed externally through the `NodePort`. 87 | 88 | A more elaborate tutorial on how to deploy this MAX model to production on [IBM Cloud](https://ibm.biz/Bdz2XM) can be found [here](http://ibm.biz/max-to-ibm-cloud-tutorial). 89 | 90 | ## Run Locally 91 | 1. [Build the Model](#1-build-the-model) 92 | 2. [Deploy the Model](#2-deploy-the-model) 93 | 3. [Use the Model](#3-use-the-model) 94 | 4. [Development](#4-development) 95 | 5. [Cleanup](#5-cleanup) 96 | 97 | 98 | ### 1. Build the Model 99 | Clone this repository locally. In a terminal, run the following command: 100 | 101 | ``` 102 | $ git clone https://github.com/IBM/MAX-Text-Sentiment-Classifier.git 103 | ``` 104 | 105 | Change directory into the repository base folder: 106 | 107 | ``` 108 | $ cd MAX-Text-Sentiment-Classifier 109 | ``` 110 | 111 | To build the docker image locally, run: 112 | 113 | ``` 114 | $ docker build -t max-text-sentiment-classifier . 115 | ``` 116 | 117 | All required model assets will be downloaded during the build process. _Note_ that currently this docker image is CPU only (we will add support for GPU images later). 118 | 119 | 120 | ### 2. Deploy the Model 121 | To run the docker image, which automatically starts the model serving API, run: 122 | 123 | ``` 124 | $ docker run -it -p 5000:5000 max-text-sentiment-classifier 125 | ``` 126 | 127 | ### 3. Use the Model 128 | 129 | The API server automatically generates an interactive Swagger documentation page. Go to `http://localhost:5000` to load it. From there you can explore the API and also create test requests. 130 | 131 | ``` 132 | Example: 133 | [ 134 | "The Model Asset Exchange is a crucial element of a developer's toolkit.", 135 | "2008 was a dark, dark year for stock markets worldwide." 136 | ] 137 | 138 | Result: 139 | [ 140 | { 141 | "positive": 0.9977352619171143, 142 | "negative": 0.002264695707708597 143 | } 144 | ], 145 | [ 146 | { 147 | "positive": 0.001138084102421999, 148 | "negative": 0.9988619089126587 149 | } 150 | ] 151 | ``` 152 | 153 | 154 | Use the `model/predict` endpoint to submit input text in json format. The json structure should have one key, `text`, with as value a list of input strings to be analyzed. An example can be found in the image below. 155 | 156 | Submitting proper json data triggers the model and will return a json file with a `status` and a `predictions` key. With this `predictions` field, a list of class labels and their corresponding probabilities will be associated. The first element in the list corresponds to the prediction for the first string in the input list. 157 | 158 | 159 | ![Swagger UI Screenshot](docs/swagger-screenshot.png) 160 | 161 | You can also test it on the command line, for example: 162 | 163 | ```bash 164 | $ curl -d "{ \"text\": [ \"The Model Asset Exchange is a crucial element of a developer's toolkit.\" ]}" -X POST "http://localhost:5000/model/predict" -H "Content-Type: application/json" 165 | ``` 166 | 167 | You should see a JSON response like that below: 168 | 169 | ```json 170 | { 171 | "status": "ok", 172 | "predictions": [ 173 | [ 174 | { 175 | "positive": 0.9977352619171143, 176 | "negative": 0.0022646968718618155 177 | } 178 | ] 179 | ] 180 | } 181 | ``` 182 | 183 | ### 4. Development 184 | To run the Flask API app in debug mode, edit `config.py` to set `DEBUG = True` under the application settings. You will then need to rebuild the docker image (see [step 1](#1-build-the-model)). 185 | 186 | ### 5. Cleanup 187 | To stop the Docker container, type `CTRL` + `C` in your terminal. 188 | 189 | ## Resources and Contributions 190 | 191 | If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions [here](https://github.com/CODAIT/max-central-repo). 192 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2019, IBM (Center of Open-Source Data & AI Technologies) 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /core/bert/tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tokenization classes.""" 16 | 17 | import collections 18 | import re 19 | import unicodedata 20 | import six 21 | import tensorflow as tf 22 | 23 | 24 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): 25 | """Checks whether the casing config is consistent with the checkpoint name.""" 26 | 27 | # The casing has to be passed in by the user and there is no explicit check 28 | # as to whether it matches the checkpoint. The casing information probably 29 | # should have been stored in the bert_config.json file, but it's not, so 30 | # we have to heuristically detect it to validate. 31 | 32 | if not init_checkpoint: 33 | return 34 | 35 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint) 36 | if m is None: 37 | return 38 | 39 | model_name = m.group(1) 40 | 41 | lower_models = [ 42 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", 43 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" 44 | ] 45 | 46 | cased_models = [ 47 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", 48 | "multi_cased_L-12_H-768_A-12" 49 | ] 50 | 51 | is_bad_config = False 52 | if model_name in lower_models and not do_lower_case: 53 | is_bad_config = True 54 | actual_flag = "False" 55 | case_name = "lowercased" 56 | opposite_flag = "True" 57 | 58 | if model_name in cased_models and do_lower_case: 59 | is_bad_config = True 60 | actual_flag = "True" 61 | case_name = "cased" 62 | opposite_flag = "False" 63 | 64 | if is_bad_config: 65 | raise ValueError( 66 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " 67 | "However, `%s` seems to be a %s model, so you " 68 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches " 69 | "how the model was pre-training. If this error is wrong, please " 70 | "just comment out this check." % (actual_flag, init_checkpoint, 71 | model_name, case_name, opposite_flag)) 72 | 73 | 74 | def convert_to_unicode(text): 75 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 76 | if six.PY3: 77 | if isinstance(text, str): 78 | return text 79 | elif isinstance(text, bytes): 80 | return text.decode("utf-8", "ignore") 81 | else: 82 | raise ValueError("Unsupported string type: %s" % (type(text))) 83 | elif six.PY2: 84 | if isinstance(text, str): 85 | return text.decode("utf-8", "ignore") 86 | elif isinstance(text, unicode): # noqa 87 | return text 88 | else: 89 | raise ValueError("Unsupported string type: %s" % (type(text))) 90 | else: 91 | raise ValueError("Not running on Python2 or Python 3?") 92 | 93 | 94 | def printable_text(text): 95 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 96 | 97 | # These functions want `str` for both Python2 and Python3, but in one case 98 | # it's a Unicode string and in the other it's a byte string. 99 | if six.PY3: 100 | if isinstance(text, str): 101 | return text 102 | elif isinstance(text, bytes): 103 | return text.decode("utf-8", "ignore") 104 | else: 105 | raise ValueError("Unsupported string type: %s" % (type(text))) 106 | elif six.PY2: 107 | if isinstance(text, str): 108 | return text 109 | elif isinstance(text, unicode): # noqa 110 | return text.encode("utf-8") 111 | else: 112 | raise ValueError("Unsupported string type: %s" % (type(text))) 113 | else: 114 | raise ValueError("Not running on Python2 or Python 3?") 115 | 116 | 117 | def load_vocab(vocab_file): 118 | """Loads a vocabulary file into a dictionary.""" 119 | vocab = collections.OrderedDict() 120 | index = 0 121 | with tf.compat.v1.gfile.GFile(vocab_file, "r") as reader: 122 | while True: 123 | token = convert_to_unicode(reader.readline()) 124 | if not token: 125 | break 126 | token = token.strip() 127 | vocab[token] = index 128 | index += 1 129 | return vocab 130 | 131 | 132 | def convert_by_vocab(vocab, items): 133 | """Converts a sequence of [tokens|ids] using the vocab.""" 134 | output = [] 135 | for item in items: 136 | output.append(vocab[item]) 137 | return output 138 | 139 | 140 | def convert_tokens_to_ids(vocab, tokens): 141 | return convert_by_vocab(vocab, tokens) 142 | 143 | 144 | def convert_ids_to_tokens(inv_vocab, ids): 145 | return convert_by_vocab(inv_vocab, ids) 146 | 147 | 148 | def whitespace_tokenize(text): 149 | """Runs basic whitespace cleaning and splitting on a piece of text.""" 150 | text = text.strip() 151 | if not text: 152 | return [] 153 | tokens = text.split() 154 | return tokens 155 | 156 | 157 | class FullTokenizer(object): 158 | """Runs end-to-end tokenziation.""" 159 | 160 | def __init__(self, vocab_file, do_lower_case=True): 161 | self.vocab = load_vocab(vocab_file) 162 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 163 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 164 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 165 | 166 | def tokenize(self, text): 167 | split_tokens = [] 168 | for token in self.basic_tokenizer.tokenize(text): 169 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 170 | split_tokens.append(sub_token) 171 | 172 | return split_tokens 173 | 174 | def convert_tokens_to_ids(self, tokens): 175 | return convert_by_vocab(self.vocab, tokens) 176 | 177 | def convert_ids_to_tokens(self, ids): 178 | return convert_by_vocab(self.inv_vocab, ids) 179 | 180 | 181 | class BasicTokenizer(object): 182 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 183 | 184 | def __init__(self, do_lower_case=True): 185 | """Constructs a BasicTokenizer. 186 | 187 | Args: 188 | do_lower_case: Whether to lower case the input. 189 | """ 190 | self.do_lower_case = do_lower_case 191 | 192 | def tokenize(self, text): 193 | """Tokenizes a piece of text.""" 194 | text = convert_to_unicode(text) 195 | text = self._clean_text(text) 196 | 197 | # This was added on November 1st, 2018 for the multilingual and Chinese 198 | # models. This is also applied to the English models now, but it doesn't 199 | # matter since the English models were not trained on any Chinese data 200 | # and generally don't have any Chinese data in them (there are Chinese 201 | # characters in the vocabulary because Wikipedia does have some Chinese 202 | # words in the English Wikipedia.). 203 | text = self._tokenize_chinese_chars(text) 204 | 205 | orig_tokens = whitespace_tokenize(text) 206 | split_tokens = [] 207 | for token in orig_tokens: 208 | if self.do_lower_case: 209 | token = token.lower() 210 | token = self._run_strip_accents(token) 211 | split_tokens.extend(self._run_split_on_punc(token)) 212 | 213 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 214 | return output_tokens 215 | 216 | def _run_strip_accents(self, text): 217 | """Strips accents from a piece of text.""" 218 | text = unicodedata.normalize("NFD", text) 219 | output = [] 220 | for char in text: 221 | cat = unicodedata.category(char) 222 | if cat == "Mn": 223 | continue 224 | output.append(char) 225 | return "".join(output) 226 | 227 | def _run_split_on_punc(self, text): 228 | """Splits punctuation on a piece of text.""" 229 | chars = list(text) 230 | i = 0 231 | start_new_word = True 232 | output = [] 233 | while i < len(chars): 234 | char = chars[i] 235 | if _is_punctuation(char): 236 | output.append([char]) 237 | start_new_word = True 238 | else: 239 | if start_new_word: 240 | output.append([]) 241 | start_new_word = False 242 | output[-1].append(char) 243 | i += 1 244 | 245 | return ["".join(x) for x in output] 246 | 247 | def _tokenize_chinese_chars(self, text): 248 | """Adds whitespace around any CJK character.""" 249 | output = [] 250 | for char in text: 251 | cp = ord(char) 252 | if self._is_chinese_char(cp): 253 | output.append(" ") 254 | output.append(char) 255 | output.append(" ") 256 | else: 257 | output.append(char) 258 | return "".join(output) 259 | 260 | def _is_chinese_char(self, cp): 261 | """Checks whether CP is the codepoint of a CJK character.""" 262 | # This defines a "chinese character" as anything in the CJK Unicode block: 263 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 264 | # 265 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 266 | # despite its name. The modern Korean Hangul alphabet is a different block, 267 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 268 | # space-separated words, so they are not treated specially and handled 269 | # like the all of the other languages. 270 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 271 | (cp >= 0x3400 and cp <= 0x4DBF) or # 272 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 273 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 274 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 275 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 276 | (cp >= 0xF900 and cp <= 0xFAFF) or # 277 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 278 | return True 279 | 280 | return False 281 | 282 | def _clean_text(self, text): 283 | """Performs invalid character removal and whitespace cleanup on text.""" 284 | output = [] 285 | for char in text: 286 | cp = ord(char) 287 | if cp == 0 or cp == 0xfffd or _is_control(char): 288 | continue 289 | if _is_whitespace(char): 290 | output.append(" ") 291 | else: 292 | output.append(char) 293 | return "".join(output) 294 | 295 | 296 | class WordpieceTokenizer(object): 297 | """Runs WordPiece tokenziation.""" 298 | 299 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): # nosec - "[UNK]" is not a password 300 | self.vocab = vocab 301 | self.unk_token = unk_token 302 | self.max_input_chars_per_word = max_input_chars_per_word 303 | 304 | def tokenize(self, text): 305 | """Tokenizes a piece of text into its word pieces. 306 | 307 | This uses a greedy longest-match-first algorithm to perform tokenization 308 | using the given vocabulary. 309 | 310 | For example: 311 | input = "unaffable" 312 | output = ["un", "##aff", "##able"] 313 | 314 | Args: 315 | text: A single token or whitespace separated tokens. This should have 316 | already been passed through `BasicTokenizer. 317 | 318 | Returns: 319 | A list of wordpiece tokens. 320 | """ 321 | 322 | text = convert_to_unicode(text) 323 | 324 | output_tokens = [] 325 | for token in whitespace_tokenize(text): 326 | chars = list(token) 327 | if len(chars) > self.max_input_chars_per_word: 328 | output_tokens.append(self.unk_token) 329 | continue 330 | 331 | is_bad = False 332 | start = 0 333 | sub_tokens = [] 334 | while start < len(chars): 335 | end = len(chars) 336 | cur_substr = None 337 | while start < end: 338 | substr = "".join(chars[start:end]) 339 | if start > 0: 340 | substr = "##" + substr 341 | if substr in self.vocab: 342 | cur_substr = substr 343 | break 344 | end -= 1 345 | if cur_substr is None: 346 | is_bad = True 347 | break 348 | sub_tokens.append(cur_substr) 349 | start = end 350 | 351 | if is_bad: 352 | output_tokens.append(self.unk_token) 353 | else: 354 | output_tokens.extend(sub_tokens) 355 | return output_tokens 356 | 357 | 358 | def _is_whitespace(char): 359 | """Checks whether `chars` is a whitespace character.""" 360 | # \t, \n, and \r are technically contorl characters but we treat them 361 | # as whitespace since they are generally considered as such. 362 | if char == " " or char == "\t" or char == "\n" or char == "\r": 363 | return True 364 | cat = unicodedata.category(char) 365 | if cat == "Zs": 366 | return True 367 | return False 368 | 369 | 370 | def _is_control(char): 371 | """Checks whether `chars` is a control character.""" 372 | # These are technically control characters but we count them as whitespace 373 | # characters. 374 | if char == "\t" or char == "\n" or char == "\r": 375 | return False 376 | cat = unicodedata.category(char) 377 | if cat.startswith("C"): 378 | return True 379 | return False 380 | 381 | 382 | def _is_punctuation(char): 383 | """Checks whether `chars` is a punctuation character.""" 384 | cp = ord(char) 385 | # We treat all non-letter/number ASCII as punctuation. 386 | # Characters such as "^", "$", and "`" are not in the Unicode 387 | # Punctuation class but we treat them as punctuation anyways, for 388 | # consistency. 389 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 390 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 391 | return True 392 | cat = unicodedata.category(char) 393 | if cat.startswith("P"): 394 | return True 395 | return False 396 | -------------------------------------------------------------------------------- /samples/test-examples.tsv: -------------------------------------------------------------------------------- 1 | video game violence is not related to serious aggressive behavior in real life pos 2 | Neurological link was found between playing violent video games and aggressive behaviour in children and teenagers neg 3 | many skills can be learned from the gaming experience, it builds practical and intellectual skills pos 4 | children may imitate aggressive behaviors witnessed in media neg 5 | Violent video games - especially first-person shooter games - encourages real-life acts of violence in teenagers neg 6 | there is social utility in expressive and imaginative forms of entertainment, even if they contain violence pos 7 | China's birth control policy contributes to forced abortions neg 8 | the dramatic decrease in Chinese fertility started before the program began in 1979 for unrelated factors neg 9 | overpopulation has been blamed for a variety of issues, including increasing poverty neg 10 | Lack of siblings has been blamed for a number of social ills neg 11 | The use of drugs to enhance performance is considered unethical by most international sports organizations neg 12 | There is a wide range of health concerns for users of anabolic steroids neg 13 | Physical exercise helps prevent the diseases of affluence pos 14 | Exercise alone is a potential prevention method and/or treatment for mild forms of depression pos 15 | Too much exercise can be harmful neg 16 | affirmative action devalues the accomplishments of people who are chosen based on the social group to which they belong rather than their qualifications neg 17 | affirmative action has undesirable side-effects in addition to failing to achieve its goals neg 18 | attempts at antidiscrimination have been criticized as reverse discrimination neg 19 | States must take measures to seek to eliminate prejudices pos 20 | Historical racism continues to be reflected in socio-economic inequality neg 21 | Policies adopted as affirmative action have been criticized as a form of referse discrimination neg 22 | Affirmative action perpetuates racial division neg 23 | no one has a legal right to have any demographic characteristic they possess be considered a favorable point on their behalf neg 24 | Affirmative action is counter-productive neg 25 | Affirmative action policies engender animosity toward preferred groups neg 26 | The identification of oppressed classes is difficult to carry out neg 27 | The typical knock out results in a sustained loss of consciousness neg 28 | The idea of multiculturalism had reached the end of its useful life neg 29 | multiculturalism works better in theory than in practice neg 30 | some forms of multiculturalism can divide people neg 31 | Multiculturalism would lead to acceptance of barbaric practices neg 32 | Addicted gamblers cost companies loss of productivity and profit neg 33 | Gamblers may suffer from depression and bankruptcy neg 34 | Compulsive gambling is often very detrimental to personal relationships neg 35 | Abuse is common in homes where pathological gambling is present neg 36 | Internet gambling is a legitimate activity that citizens have the right to engage in pos 37 | Gambling is a popular leisure activity enjoyed in many forms by millions of people pos 38 | electronic funds transfers inherent in online gambling are being exploited by criminal interests neg 39 | everyone has the right to leave or enter a country, along with movement within it pos 40 | Immigrants are considered hostile or alien to the natural culture neg 41 | Monarchy is a check against possible illegal action by politicians pos 42 | Monarchs rule with the intent of improving the lives of their subjects pos 43 | Royals are simply celebrities who should not have any formal role neg 44 | Monarchy encourages a feeling of dependency in many people who should instead have confidence in themselves and their fellow citizens neg 45 | Monarchical prerogative powers can be used to circumvent normal democratic process with no accountability neg 46 | Monarchy devalues intellect and achievement neg 47 | A constitutional monarch with limited powers and non-partisan nature can provide a focus for national unity pos 48 | the monarchy is inherently contrary to egalitarianism and multiculturalism neg 49 | "a republic is ""inevitable" pos 50 | "the Church's ban on condoms has ""caused the death of millions" neg 51 | release of tactical information usually presents a greater risk of casualties among one's own forces neg 52 | the right to freedom of speech is not absolute neg 53 | When resources are put into tobacco production they are taken away from food production neg 54 | tobacco causes cancer neg 55 | Freedom of expression is subject to certain restrictions neg 56 | The free communication of ideas and opinions is one of the most precious of the rights of man pos 57 | Speech can be justifiably suppressed in order to prevent harm from a clear and direct threat neg 58 | liberty of expression 'is not absolute neg 59 | government may not prohibit the expression of an idea simply because society finds the idea offensive or disagreeable neg 60 | Punishment of dangerous or offensive writings is necessary for the preservation of peace and good order pos 61 | Freedom of speech and expression can not be an excuse for distribution of indecent and immoral content neg 62 | censorship violates multiple Basic Human Rights neg 63 | by distributing vouchers to the families of students equal to the tuition that he/she would receive at his/her local public school, a student’s family could then choose from options where best to send their child pos 64 | Vouchers function to increase racial and economic discrimination in schools neg 65 | Earmarks are Good for American Democracy pos 66 | Congressmen tend to distribute specialized benefits at a great cost and ignore the particular costs the legislation bears upon the taxpayers neg 67 | Nuclear weapons decrease the chances of crisis escalation pos 68 | new nuclear states will use their acquired nuclear capabilities to deter threats and preserve peace pos 69 | more countries with nuclear weapons may increase the possibility of nuclear warfare neg 70 | abortion has negative effects on society neg 71 | Intact dilation and extraction is never needed to protect the health of a pregnant woman neg 72 | Certain restrictions on abortion could be used to form a slippery slope against all abortions neg 73 | Abortion, which would involve the deliberate destruction of life, should be rejected neg 74 | In all circumstances, it should be the woman's decision whether or not to terminate a pregnancy pos 75 | the embryo has a right to life pos 76 | it should be illegal for governments to regulate abortion neg 77 | working parents wish their children to be supervised pos 78 | students’ attitudes towards school did significantly increase as they spent more time on a year-round schedule pos 79 | year-round schools showed a substantial gain in academic achievement for at-risk, low performing students pos 80 | The year round schedule provides more opportunities for family vacations pos 81 | Students that attend year round schooling may miss out on experiences neg 82 | Welfare sustains or even creates poverty neg 83 | Safety nets enable households to make productive investments in their future that they may otherwise miss pos 84 | a lower rate of redistribution in a given society increases the inequality found among future incomes neg 85 | a certain amount of redistribution would be justified pos 86 | Because current science can't figure out exactly how life started, it must be God who caused life to start pos 87 | Atheism has been criticized as a faith in itself neg 88 | the most immoral acts in human history were performed by atheists neg 89 | belief in God and religion are social functions, used by those in power to oppress the working class neg 90 | the idea of God implies the abdication of human reason and justice neg 91 | The idea of God necessarily ends in the enslavement of mankind neg 92 | the theism of people throughout most of recorded history and in many different places provides prima facie demonstration of God's existence pos 93 | There is evidence for the existence of a God pos 94 | evolution can explain the apparent design in nature pos 95 | "Natural selection and similar scientific theories are superior to a ""God hypothesis""—the illusion of intelligent design—in explaining the living world and the cosmos" pos 96 | religion is needed to make us behave morally pos 97 | a god created the Universe pos 98 | Separation of older people from active roles in society benefits both society and older individuals pos 99 | personal and technical skills learned in the military will improve later employment prospects in civilian life pos 100 | Professionally-skilled conscripts are difficult to replace in the civilian workforce neg 101 | conscription would not provide adequate protection for the rights of conscientious objectors neg 102 | adequate military strength could be maintained without having conscription neg 103 | unpaid domestic work is just as valuable as paid work pos 104 | property rights encourage their holders to develop their property or generate wealth pos 105 | Patents are criticised as inonsistent with free trade neg 106 | intellectual property rights are essential to maintaining economic growth pos 107 | A positive correlation between the strengthening of the IP system and subsequent economic growth was found pos 108 | To violate intellectual property is no different morally than violating other property rights neg 109 | The cost of trying to enforce copyright is unreasonable neg 110 | all proposed alternatives to copyright protection do not allow for viable business models neg 111 | the very concept of copyright has never benefited society neg 112 | Wind power has gained very high social acceptance pos 113 | Wind energy is a clean energy source pos 114 | wind energy is one of the most cost efficient sources of renewable energy pos 115 | wind power is dependent on weather systems neg 116 | Wind power produces no greenhouse gas emissions during operation pos 117 | Wind power uses little land pos 118 | any effects on the environment from wind power are generally much less problematic than those of any other power source pos 119 | Wind power has low ongoing costs pos 120 | Wind projects revitalize the economy of rural communities pos 121 | The cost of repairing damaged ecosystems is considered to be much higher than the cost of conserving natural ecosystems pos 122 | People who live close to nature can be dependent on the survival of all the species in their environment pos 123 | "since species become extinct ""all the time"" the disappearance of a few more will not destroy the ecosystem" pos 124 | rapid rates of biodiversity loss threatens the sustained well-being of humanity neg 125 | Biodiversity is directly involved in water purification pos 126 | Austerity programs tend to have an impact on the poorest segments of the population neg 127 | the right to bear arms is absolute and unqualified pos 128 | an armed citizens' militia can help deter crime and tyranny pos 129 | arms allow for successful rebellions against tyranny pos 130 | The possibility of getting shot by an armed victim is a substantial deterrent to crime pos 131 | widespread gun ownership is protection against tyranny pos 132 | Bribery encourages rent seeking behaviour neg 133 | In some cases where the system of law is not well-implemented, bribes may be a way for companies to continue their businesses pos 134 | Corruption undermines the legitimacy of government and democratic values neg 135 | Availability of bribes can induce officials to contrive new rules and delays neg 136 | Corruption favors the most connected and unscrupulous, rather than the efficient neg 137 | The production of bitumen and synthetic crude oil emits more greenhouse gases than the production of conventional crude oil neg 138 | During Operation Cast Lead, the Israeli Defense Forces did more to safeguard the rights of civilians in a combat zone than any other army in the history of warfare pos 139 | Israeli citizens in the south have been suffering from rockets being fired at them neg 140 | the Israeli Defence Force breached laws of armed conflict by attacking indiscriminately civilians neg 141 | Israel's UAV attacks were a violation of International Humanitarian Law neg 142 | Israel was acting in self-defence pos 143 | Israelis have been killed by the unlawful rocket and mortar attacks from Gaza neg 144 | The repeated firing of rockets by Hamas endangers the lives of both Israeli and Palestinian civilians neg 145 | the military solution won't conduct to peace neg 146 | nothing justifies the suffering inflicted to civilian populations who live trapped in the Gaza strip neg 147 | This cycle of violence and retaliation impedes efforts to broker lasting peace in the region neg 148 | High-rise structures also pose serious challenges to firefighters during emergencies neg 149 | Tower Blocks grew a reputation for being undesirable low cost housing neg 150 | many tower blocks saw rising crime levels neg 151 | Tower blocks may be inherently more prone to casualties from a fire neg 152 | Tower blocks can hold thousands of families in a single building pos 153 | The right to freedom of thought and expression, sanctioned by the Declaration of the Rights of Man, cannot imply the right to offend the religious sentiment of believers neg 154 | In all societies there is a need to show sensitivity and responsibility in treating issues of special significance for the adherents of any particular faith pos 155 | the Israeli blockade and closures had pushed the Palestinian economy into a stage of de-development neg 156 | The blockade is a collective form of punishment on a civilian population neg 157 | The purpose of the blockade is to pressure Hamas into ending the rocket attacks and to deprive them of the supplies necessary for the continuation of rocket attacks pos 158 | "the blockade of Gaza is causing ""unacceptable suffering" neg 159 | The blockade is possibly a crime against humanity neg 160 | "all that is being achieved through the blockade is to ""enrich Hamas and marginalize even further the voices of moderation" neg 161 | Israel's blockade of the Gaza Strip was described as totally intolerable and unacceptable in the 21st century neg 162 | Israel restricts the ability for the Palestinian authority to exercise control neg 163 | Gaza was blockaded by Israel in response to the rocket and mortar attacks by Hamas and other militant groups operating inside Gaza pos 164 | the purpose of the restrictions in import of goods into Gaza are to pressure Hamas, which does not recognise Israel and backs attacks on its citizens pos 165 | The stated purpose of the blockade was to pressure Hamas into ending the rocket attacks pos 166 | Israel is not legally responsible for Gaza and not obliged to help a hostile territory beyond whatever is necessary to avoid a humanitarian crisis pos 167 | The Gaza blockade inflicted excessive damage to the civilian population in relation to the concrete military advantage expected neg 168 | Holocaust denial is a convenient polemical substitute for anti-semitism neg 169 | Open-source-appropriate technology built in continuous peer-review can result in better quality pos 170 | wide availability results in increased scrutiny of the source code, making open source software more secure pos 171 | With free software, businesses can fit software to their specific needs by changing the software pos 172 | proprietary software is unethical and unjust neg 173 | software produced in this fashion may lack standardization and be incompatible with various computer applications neg 174 | corruption is more prevalent in non-privatized sectors neg 175 | A state-monopolized function is prone to corruption neg 176 | certain public goods and services should remain primarily in the hands of government in order to ensure that everyone in society has access to them pos 177 | Academics are demoralized by government interference with admissions procedures neg 178 | Dependence on government funding has had disastrous effects on the higher education sector in continental Europe neg 179 | ASEAN Way has recently proven itself relatively successful in the settlements of disputes by peaceful manner pos 180 | "the importance of the ""wait to have sex"" message ends up being lost when programs demonstrate and encourage the use of contraception" neg 181 | abstinence-only sex ed and conservative moralizing will only alienate students neg 182 | sex education needs to be comprehensive to be effective pos 183 | Abstinence-only programs delay the initiation of sex pos 184 | abstinence-only education is ineffective neg 185 | abstinence-only-until-marriage programs are ineffective neg 186 | Comprehensive sex education covers abstinence as a positive choice, but also teaches about contraception and avoidance of STIs pos 187 | all abstinence programs are ineffective neg 188 | Abstinence until marriage is the most effective way to avoid HIV infection pos 189 | Providing safe-sex education promotes promiscuity neg 190 | Countries with conservative attitudes towards sex education have a higher incidence of STIs and teenage pregnancy neg 191 | sexual abstinence in teenagers decreases the risk of contracting STDs and having children outside marriage pos 192 | "voters ""overwhelmingly support term limits" pos 193 | Lawmakers are in gridlock because of becoming locked into entrenched positions over time neg 194 | Freed from political considerations related to re-election, lawmakers would be more free to vote on the merits pos 195 | Flag burning is protected by the First Amendment pos 196 | Americans oppose amending the constitution to outlaw flag burning neg 197 | The First Amendment reaches non-speech acts pos 198 | Flag burning tends to incite breaches of the peace neg 199 | laws against flag-burning are constitutional pos 200 | the BCS routinely involves controversy about which two teams are the best in the nation neg 201 | if everyone is left to their own economic devices instead of being controlled by the state, then the result would be a harmonious and more equal society pos 202 | the state has a legitimate role in providing public goods pos 203 | Financial liberalization and privatization coincide with democratization pos 204 | wages are naturally driven down in free market systems neg 205 | free markets usually fail to deal with the problem of externalities neg 206 | free market relationships are considered as structured upon coercion neg 207 | free trade gives optimal economic advantages pos 208 | Free trade allows maximum exploitation of workers by capital neg 209 | The justification for central planning is that the consolidation of economic resources can allow for the economy to take advantage of more perfect information when making decisions regarding investment and production pos 210 | market economies are inherently stable if left to themselves pos 211 | competition leads to innovation and more affordable prices pos 212 | "the state has a role to play in the economy, and specifically, in creating a ""safety net" pos 213 | free markets have the potential to free states from the looming prospect of recurrent warfare pos 214 | competition in the free market is more effective than the regulation of industry pos 215 | markets have inefficient outcomes neg 216 | no coercive monopoly of force can arise on a truly free market pos 217 | Unfettered markets are the best means to economic growth pos 218 | The regulation of markets is widely acknowledged as important to safeguard social and environmental values pos 219 | what starts as temporary governmental fixes usually become permanent and expanding government programs, which stifle the private sector and civil society neg 220 | left to its devices, the market will adjust efficiently pos 221 | Advertising increasingly invades public spaces neg 222 | Advertising has been criticized as inadvertently or even intentionally promoting sexism neg 223 | advertising directed at young children is per se manipulative neg 224 | the advertising techniques used to create consumer behaviour amount to the destruction of psychic and collective individuation neg 225 | it is inherently immoral to bring people into the world neg 226 | there ought to be a higher rate of population growth than what is currently mainstream pos 227 | the birth of a new person always entails nontrivial harm to that person neg 228 | the best thing for Earth's biosphere is for humans to voluntarily cease reproducing pos 229 | Voluntary human extinction will prevent human suffering and the extinction of other species pos 230 | non-reproduction would eventually allow humans to lead idyllic lifestyles pos 231 | attempting to reduce the Earth's population is the only moral option pos 232 | voluntary human extinction is advisable due to limited resources pos 233 | the decision not to procreate at all could be regarded as immoral neg 234 | Dam construction often leads to abuses of the masses by planners neg 235 | people worldwide have been physically displaced from their homes as a result of dam construction neg 236 | Dam failures are generally catastrophic neg 237 | In many reservoir construction projects people have to be moved and re-housed neg 238 | Farms and villages can be flooded by the creation of reservoirs, ruining many livelihoods neg 239 | once the renewable infrastructure is built, the fuel is free forever pos 240 | Large hydropower provides one of the lowest cost options in today’s energy market pos 241 | Development of large-scale hydroelectric power has environmental impacts neg 242 | The filling of large reservoirs can induce earth tremors, which may be large enough to be objectionable or destructive neg 243 | Hydroelectric dams with large reserviors can be operated to provide peak generation at times of peak demand pos 244 | Hydro plants are able to act as load following power plants pos 245 | Integrating ever-higher levels of renewables is being successfully demonstrated in the real world pos 246 | There are no harmful emissions associated with hydroelectric plant operation pos 247 | Truth commissions are sometimes criticised for allowing crimes to go unpunished, and creating impunity for serious human rights abusers neg 248 | victims and communities affected by past crimes have the right to know the identity of suspected perpetrators pos 249 | Restorative approaches seek a balanced approach to the needs of the victim, wrongdoer and community through processes that preserve the safety and dignity of all pos 250 | democracies have less internal systematic violence pos 251 | the more democratic a regime, the less its democide pos 252 | democracy causes peace pos 253 | democracies treat each other with trust and respect even during crises pos 254 | the best strategy to ensure our security and to build a durable peace is to support the advance of democracy elsewhere pos 255 | Democracy is economically inefficient neg 256 | the benefits of a specialised society may be compromised by democracy neg 257 | A majority bullying a minority is just as bad as a dictator, communist or otherwise, doing so neg 258 | democracy would bring peace pos 259 | democracy leads to less internal violence and mass murder by the government pos 260 | only in a democracy the citizens can have a share in freedom pos 261 | --------------------------------------------------------------------------------