├── .bandit
├── requirements-test.txt
├── docs
    ├── swagger-screenshot.png
    └── deploy-max-to-ibm-cloud-with-kubernetes-button.png
├── requirements.txt
├── training
    └── README.md
├── .dockerignore
├── samples
    ├── README.md
    └── test-examples.tsv
├── core
    ├── __init__.py
    ├── model.py
    └── bert
    │   ├── run_classifier.py
    │   └── tokenization.py
├── api
    ├── __init__.py
    ├── metadata.py
    └── predict.py
├── sha512sums.txt
├── max-text-sentiment-classifier.yaml
├── app.py
├── .travis.yml
├── tests
    ├── training
    │   └── test_sample_training_response.py
    ├── test_api.py
    └── test_response.py
├── config.py
├── Dockerfile
├── .gitignore
├── README.md
└── LICENSE


/.bandit:
--------------------------------------------------------------------------------
1 | [bandit]
2 | exclude: /tests,/training
3 | 


--------------------------------------------------------------------------------
/requirements-test.txt:
--------------------------------------------------------------------------------
1 | pytest==6.1.2
2 | requests==2.25.0
3 | flake8==3.8.4
4 | bandit==1.6.2
5 | 


--------------------------------------------------------------------------------
/docs/swagger-screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/swagger-screenshot.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # python3.6
2 | tensorflow == 2.5.0   # CPU Version of TensorFlow.
3 | # tensorflow-gpu  == 1.12.2  # GPU version of TensorFlow.
4 | 


--------------------------------------------------------------------------------
/docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/MAX-Text-Sentiment-Classifier/HEAD/docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png


--------------------------------------------------------------------------------
/training/README.md:
--------------------------------------------------------------------------------
1 | We have removed support for training since TensorFlow v2. If you are interested in training, check out the last TensorFlow v1 of the training script at https://github.com/IBM/MAX-Text-Sentiment-Classifier/tree/v2.1.0/training .
2 | 


--------------------------------------------------------------------------------
/.dockerignore:
--------------------------------------------------------------------------------
 1 | training/
 2 | README.*
 3 | .idea/
 4 | .git/
 5 | .gitignore
 6 | tests/
 7 | .pytest_cache
 8 | tests/
 9 | assets/.pytest_cache
10 | venv/
11 | assets/sentiment_BERT_base_uncased/
12 | assets/assets.tar.gz
13 | assets/sentiment_BERT_base_uncased.zip
14 | benchmark_model/
15 | docs/


--------------------------------------------------------------------------------
/samples/README.md:
--------------------------------------------------------------------------------
1 | # Sample Details
2 | 
3 | ## Test Examples
4 | 
5 | The tab-separated-values file `test-examples.tsv` in the `samples` folder contains a fraction of the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Claim%20Stance) ([CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)) not used for fine-tuning. In the first column, the claim is listed. In the second column, the corresponding sentiment (`pos` or `neg`) is listed. Claims in this file may be used to try out and benchmark the performance of this model.
6 | 


--------------------------------------------------------------------------------
/core/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 


--------------------------------------------------------------------------------
/api/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | from .metadata import ModelMetadataAPI  # noqa
18 | from .predict import ModelPredictAPI  # noqa


--------------------------------------------------------------------------------
/sha512sums.txt:
--------------------------------------------------------------------------------
1 | 4e9ed575594d83d53af2726926239df2a5e85d0fc2884238a512627ad56ffd243f92dc75dcc2c64c716b9556b177557528cdf143fd10d2cf517289076028aaaf  assets/labels.txt
2 | 296397595b1fcedd3a37464d3aa14a57526820d2ff96795eef533c634173ab744a0309b4431b7918089b424fdc0962d78082e19936b31a8565e6b6e8413f7dbe  assets/saved_model.pb
3 | f51f06aae7f580a88998f6f7f24b52495c8d3d289fdbdc21231c05f3a8965783074d95c17b819186f9a63b622280e8a051105a2161cd0d153fa57db7a0aba9f9  assets/vocab.txt
4 | c58d6f1107456635fc403caede31eaf831b985c61429e85eea3d3edc8281a6af09ea63d0986c4d110ae90547aeb6d312c75a66280723f614ce6246a353b58626  assets/variables/variables.data-00000-of-00001
5 | 42dc3a7620e8a712065ae7bc6973e654cb4e515dffd3dc8289b90571cac578c1afbc8cb558b44b268da7be56e5b5045380949459274736bdaf566f207970e795  assets/variables/variables.index
6 | 


--------------------------------------------------------------------------------
/max-text-sentiment-classifier.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: v1
 2 | kind: Service
 3 | metadata:
 4 |   name: max-text-sentiment-classifier
 5 | spec:
 6 |   selector:
 7 |     app: max-text-sentiment-classifier
 8 |   ports:
 9 |   - port: 5000
10 |   type: NodePort
11 | ---
12 | apiVersion: apps/v1
13 | kind: Deployment
14 | metadata:
15 |   name: max-text-sentiment-classifier
16 |   labels:
17 |     app: max-text-sentiment-classifier
18 | spec:
19 |   selector:
20 |     matchLabels:
21 |       app: max-text-sentiment-classifier
22 |   replicas: 1
23 |   template:
24 |     metadata:
25 |       labels:
26 |         app: max-text-sentiment-classifier
27 |     spec:
28 |       containers:
29 |       - name: max-text-sentiment-classifier
30 |         image: quay.io/codait/max-text-sentiment-classifier:latest
31 |         ports:
32 |         - containerPort: 5000
33 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | from maxfw.core import MAXApp
18 | from api import ModelMetadataAPI, ModelPredictAPI
19 | from config import API_TITLE, API_DESC, API_VERSION
20 | 
21 | max = MAXApp(API_TITLE, API_DESC, API_VERSION)
22 | max.add_api(ModelMetadataAPI, '/metadata')
23 | max.add_api(ModelPredictAPI, '/predict')
24 | max.run()
25 | 


--------------------------------------------------------------------------------
/api/metadata.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | from core.model import ModelWrapper
18 | from maxfw.core import MAX_API, MetadataAPI, METADATA_SCHEMA
19 | 
20 | 
21 | class ModelMetadataAPI(MetadataAPI):
22 | 
23 |     @MAX_API.marshal_with(METADATA_SCHEMA)
24 |     def get(self):
25 |         """Return the metadata associated with the model"""
26 |         return ModelWrapper.MODEL_META_DATA
27 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | language: python
18 | python:
19 |   - 3.6
20 | services:
21 |   - docker
22 | install:
23 |   - docker build -t max-text-sentiment-classifier .
24 |   - docker run -it -d --rm -p 5000:5000 max-text-sentiment-classifier
25 |   - pip install -r requirements-test.txt
26 | before_script:
27 |   - flake8 . --max-line-length=127
28 |   - bandit -r .
29 |   - sleep 30
30 | script:
31 |   - pytest tests/test_api.py
32 |   - pytest tests/test_response.py
33 | 


--------------------------------------------------------------------------------
/tests/training/test_sample_training_response.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | import pytest
18 | import requests
19 | 
20 | 
21 | def test_response():
22 | 
23 |     # test code 200
24 |     model_endpoint = 'http://localhost:5000/model/predict'
25 | 
26 |     json_data = {
27 |         "text": ["good string",
28 |                  "bad string"]
29 |     }
30 | 
31 |     r = requests.post(url=model_endpoint, json=json_data)
32 | 
33 |     assert r.status_code == 200
34 |     response = r.json()
35 |     assert response['status'] == 'ok'
36 | 
37 |     # test whether the labels have changed
38 |     assert 'pos' in response['predictions'][0].keys()
39 |     assert 'neg' in response['predictions'][0].keys()
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     pytest.main([__file__])
44 | 


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | # Flask settings
18 | DEBUG = False
19 | 
20 | # Flask-restplus settings
21 | RESTPLUS_MASK_SWAGGER = False
22 | SWAGGER_UI_DOC_EXPANSION = 'none'
23 | 
24 | # API metadata
25 | API_TITLE = 'MAX Text Sentiment Classifier'
26 | API_DESC = 'Detect the sentiment captured in short pieces of text. ' \
27 |            'The model was finetuned on the IBM Project Debater Claim Sentiment dataset.'
28 | API_VERSION = '2.0.0'
29 | 
30 | # default model
31 | DEFAULT_MODEL_PATH = 'assets'
32 | 
33 | # the metadata of the model
34 | MODEL_META_DATA = {
35 |     'id': 'max-text-sentiment-classifier',
36 |     'name': 'MAX Text Sentiment Classifier',
37 |     'description': 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.',
38 |     'type': 'Text Classification',
39 |     'source': 'https://developer.ibm.com/exchanges/models/all/max-text-sentiment-classifier/',
40 |     'license': 'Apache V2'
41 | }
42 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | FROM quay.io/codait/max-base:v1.4.0
18 | 
19 | ARG model_bucket=https://codait-cos-max.s3.us.cloud-object-storage.appdomain.cloud/max-text-sentiment-classifier/1.2.0
20 | ARG model_file=assets.tar.gz
21 | 
22 | ARG use_pre_trained_model=true
23 | 
24 | RUN if [ "$use_pre_trained_model" = "true" ] ; then\
25 |      # download pre-trained model artifacts from Cloud Object Storage
26 |      wget -nv --show-progress --progress=bar:force:noscroll ${model_bucket}/${model_file} --output-document=assets/${model_file} &&\
27 |      tar -x -C assets/ -f assets/${model_file} -v && rm assets/${model_file} ; \
28 |      fi
29 | 
30 | COPY requirements.txt .
31 | RUN pip install -r requirements.txt
32 | 
33 | COPY . .
34 | 
35 | RUN if [ "$use_pre_trained_model" = "true" ] ; then \
36 |       # validate downloaded pre-trained model assets
37 |       sha512sum -c sha512sums.txt ; \
38 |     else \
39 |       # rename the directory that contains the custom-trained model artifacts
40 |       if [ -d "./custom_assets/" ] ; then \
41 |         rm -rf ./assets && ln -s ./custom_assets ./assets ; \
42 |       fi \
43 |     fi
44 | 
45 | EXPOSE 5000
46 | 
47 | CMD python app.py
48 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | *-model-building-code.zip
  2 | .idea/
  3 | # Byte-compiled / optimized / DLL files
  4 | __pycache__/
  5 | *.py[cod]
  6 | *$py.class
  7 | 
  8 | # C extensions
  9 | *.so
 10 | 
 11 | # Distribution / packaging
 12 | .Python
 13 | build/
 14 | develop-eggs/
 15 | dist/
 16 | downloads/
 17 | eggs/
 18 | .eggs/
 19 | lib/
 20 | lib64/
 21 | parts/
 22 | sdist/
 23 | var/
 24 | wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | .hypothesis/
 50 | .pytest_cache/
 51 | 
 52 | # Translations
 53 | *.mo
 54 | *.pot
 55 | 
 56 | # Django stuff:
 57 | *.log
 58 | local_settings.py
 59 | db.sqlite3
 60 | 
 61 | # Flask stuff:
 62 | instance/
 63 | .webassets-cache
 64 | 
 65 | # Scrapy stuff:
 66 | .scrapy
 67 | 
 68 | # Sphinx documentation
 69 | docs/_build/
 70 | 
 71 | # PyBuilder
 72 | target/
 73 | 
 74 | # Jupyter Notebook
 75 | .ipynb_checkpoints
 76 | 
 77 | # pyenv
 78 | .python-version
 79 | 
 80 | # celery beat schedule file
 81 | celerybeat-schedule
 82 | 
 83 | # SageMath parsed files
 84 | *.sage.py
 85 | 
 86 | # Environments
 87 | .env
 88 | .venv
 89 | env/
 90 | venv/
 91 | ENV/
 92 | env.bak/
 93 | venv.bak/
 94 | 
 95 | # Spyder project settings
 96 | .spyderproject
 97 | .spyproject
 98 | 
 99 | # Rope project settings
100 | .ropeproject
101 | 
102 | # mkdocs documentation
103 | /site
104 | 
105 | # mypy
106 | .mypy_cache/
107 | 


--------------------------------------------------------------------------------
/tests/test_api.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | import pytest
18 | import requests
19 | 
20 | 
21 | def test_swagger():
22 | 
23 |     model_endpoint = 'http://localhost:5000/swagger.json'
24 | 
25 |     r = requests.get(url=model_endpoint)
26 |     assert r.status_code == 200
27 |     assert r.headers['Content-Type'] == 'application/json'
28 | 
29 |     json = r.json()
30 |     assert 'swagger' in json
31 |     assert json.get('info') and json.get('info').get('title') == 'MAX Text Sentiment Classifier'
32 | 
33 | 
34 | def test_metadata():
35 | 
36 |     model_endpoint = 'http://localhost:5000/model/metadata'
37 | 
38 |     r = requests.get(url=model_endpoint)
39 |     assert r.status_code == 200
40 | 
41 |     metadata = r.json()
42 |     assert metadata['id'] == 'max-text-sentiment-classifier'
43 |     assert metadata['name'] == 'MAX Text Sentiment Classifier'
44 |     assert metadata['description'] == 'BERT Base finetuned on the IBM Project Debater Claim Sentiment dataset.'
45 |     assert metadata['license'] == 'Apache V2'
46 |     assert metadata['type'] == 'Text Classification'
47 |     assert 'developer.ibm.com' in metadata['source']
48 | 
49 | 
50 | if __name__ == '__main__':
51 |     pytest.main([__file__])
52 | 


--------------------------------------------------------------------------------
/api/predict.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | from core.model import ModelWrapper
18 | from maxfw.core import MAX_API, PredictAPI
19 | from flask_restx import fields
20 | from flask import abort
21 | 
22 | # Set up parser for input data (http://flask-restplus.readthedocs.io/en/stable/parsing.html)
23 | input_parser = MAX_API.model('ModelInput', {
24 |     'text': fields.List(fields.String, required=True,
25 |                         description='List of claims (strings) to be analyzed for either a positive or negative sentiment.')
26 | })
27 | 
28 | with open('assets/labels.txt', 'r') as f:
29 |     class_labels = [x.strip() for x in f]
30 | 
31 | # Creating a JSON response model: https://flask-restplus.readthedocs.io/en/stable/marshalling.html#the-api-model-factory
32 | label_prediction = MAX_API.model('LabelPrediction',
33 |                                  {l: fields.Float(required=True, description='Class probability') for l in class_labels})  # noqa - E741
34 | 
35 | predict_response = MAX_API.model('ModelPredictResponse', {
36 |     'status': fields.String(required=True, description='Response status message'),
37 |     'predictions': fields.List(fields.Nested(label_prediction), description='Predicted labels and probabilities')
38 | })
39 | 
40 | 
41 | class ModelPredictAPI(PredictAPI):
42 | 
43 |     model_wrapper = ModelWrapper()
44 | 
45 |     @MAX_API.doc('predict')
46 |     @MAX_API.expect(input_parser)
47 |     @MAX_API.marshal_with(predict_response)
48 |     def post(self):
49 |         """Make a prediction given input data"""
50 |         result = {'status': 'error'}
51 | 
52 |         input_json = MAX_API.payload
53 | 
54 |         try:
55 |             preds = self.model_wrapper.predict(input_json['text'])
56 |         except: # noqa
57 |             abort(400, "Please supply a valid input json. "
58 |                        "The json structure should have a 'text' field containing a list of strings")
59 | 
60 |         # Generate the output format for every input string
61 |         output = [{l: p[i] for i, l in enumerate(class_labels)} for p in preds]
62 | 
63 |         result['predictions'] = output
64 |         result['status'] = 'ok'
65 | 
66 |         return result
67 | 


--------------------------------------------------------------------------------
/tests/test_response.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 | 
17 | import pytest
18 | import requests
19 | 
20 | 
21 | def test_response():
22 |     model_endpoint = 'http://localhost:5000/model/predict'
23 | 
24 |     json_data = {
25 |         "text": ["good string",
26 |                  "bad string"]
27 |     }
28 | 
29 |     r = requests.post(url=model_endpoint, json=json_data)
30 | 
31 |     assert r.status_code == 200
32 |     response = r.json()
33 |     assert response['status'] == 'ok'
34 | 
35 |     # verify that 'good string' is in fact positive
36 |     assert round(float(response['predictions'][0]['positive'])) == 1
37 |     # verify that 'bad string' is in fact negative
38 |     assert round(float(response['predictions'][1]['negative'])) == 1
39 | 
40 |     json_data2 = {
41 |         "text": [
42 |             "2008 was a dark, dark year for stock markets worldwide.",
43 |             "The Model Asset Exchange is a crucial element of a developer's toolkit."
44 |         ]
45 |     }
46 | 
47 |     r = requests.post(url=model_endpoint, json=json_data2)
48 | 
49 |     assert r.status_code == 200
50 |     response = r.json()
51 |     assert response['status'] == 'ok'
52 | 
53 |     # verify that "2008 was a dark, dark year for stock markets worldwide." is in fact negative
54 |     assert round(float(response['predictions'][0]['positive'])) == 0
55 |     assert round(float(response['predictions'][0]['negative'])) == 1
56 |     # verify that "The Model Asset Exchange is a crucial element of a developer's toolkit." is in fact positive
57 |     assert round(float(response['predictions'][1]['negative'])) == 0
58 |     assert round(float(response['predictions'][1]['positive'])) == 1
59 | 
60 |     # Test different input batch sizes
61 |     for input_size in [4, 16, 32, 64, 75]:
62 |         json_data3 = {
63 |             "text": ["good string"]*input_size
64 |         }
65 | 
66 |         r = requests.post(url=model_endpoint, json=json_data3)
67 | 
68 |         assert r.status_code == 200
69 |         response = r.json()
70 |         assert response['status'] == 'ok'
71 | 
72 |         assert len(response['predictions']) == len(json_data3["text"])
73 | 
74 | 
75 | if __name__ == '__main__':
76 |     pytest.main([__file__])
77 | 


--------------------------------------------------------------------------------
/core/model.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright 2018-2019 IBM Corp. All Rights Reserved.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | #
 16 | 
 17 | from maxfw.model import MAXModelWrapper
 18 | 
 19 | import logging
 20 | from config import DEFAULT_MODEL_PATH, MODEL_META_DATA as model_meta
 21 | 
 22 | from core.bert.run_classifier import convert_single_example, MAXAPIProcessor
 23 | from core.bert import tokenization
 24 | from tensorflow.python.saved_model import tag_constants
 25 | import tensorflow as tf
 26 | import numpy as np
 27 | 
 28 | logger = logging.getLogger()
 29 | 
 30 | 
 31 | class ModelWrapper(MAXModelWrapper):
 32 | 
 33 |     MODEL_META_DATA = model_meta
 34 | 
 35 |     def __init__(self, path=DEFAULT_MODEL_PATH):
 36 |         tf.compat.v1.disable_v2_behavior()
 37 | 
 38 |         logger.info('Loading model from: {}...'.format(path))
 39 | 
 40 |         self.max_seq_length = 128
 41 |         self.do_lower_case = True
 42 | 
 43 |         # Set Logging verbosity
 44 |         tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
 45 | 
 46 |         # Loading the tf Graph
 47 |         self.graph = tf.Graph()
 48 |         self.sess = tf.compat.v1.Session(graph=self.graph)
 49 |         tf.compat.v1.saved_model.loader.load(self.sess, [tag_constants.SERVING], DEFAULT_MODEL_PATH)
 50 | 
 51 |         # Validate init_checkpoint
 52 |         tokenization.validate_case_matches_checkpoint(self.do_lower_case,
 53 |                                                       DEFAULT_MODEL_PATH)
 54 | 
 55 |         # Initialize the dataprocessor
 56 |         self.processor = MAXAPIProcessor()
 57 | 
 58 |         # Get the labels
 59 |         self.label_list = self.processor.get_labels()
 60 | 
 61 |         # Initialize the tokenizer
 62 |         self.tokenizer = tokenization.FullTokenizer(
 63 |             vocab_file=f'{DEFAULT_MODEL_PATH}/vocab.txt', do_lower_case=self.do_lower_case)
 64 | 
 65 |         logger.info('Loaded model')
 66 | 
 67 |     def _pre_process(self, input):
 68 |         '''Preprocessing of the input is not required as it is carried out by the BERT model (Tokenization).'''
 69 |         return input
 70 | 
 71 |     def _post_process(self, result):
 72 |         '''Reformat the results if needed.'''
 73 |         return result
 74 | 
 75 |     def _predict(self, x, predict_batch_size=32):
 76 |         '''Predict the class probabilities using the BERT model.'''
 77 | 
 78 |         # Get the input examples
 79 |         predict_examples = self.processor.get_test_examples(x)
 80 | 
 81 |         # Grab the input tensors of the Graph
 82 |         tensor_input_ids = self.sess.graph.get_tensor_by_name('input_ids_1:0')
 83 |         tensor_input_mask = self.sess.graph.get_tensor_by_name('input_mask_1:0')
 84 |         tensor_label_ids = self.sess.graph.get_tensor_by_name('label_ids_1:0')
 85 |         tensor_segment_ids = self.sess.graph.get_tensor_by_name('segment_ids_1:0')
 86 |         tensor_outputs = self.sess.graph.get_tensor_by_name('loss/Softmax:0')
 87 | 
 88 |         # Grab the examples, convert to features, and create batches. In the loop,
 89 |         # Go over all examples in chunks of size `predict_batch_size`.
 90 |         predictions = []
 91 | 
 92 |         for i in range(0, len(predict_examples), predict_batch_size):
 93 |             examples = predict_examples[i:i+predict_batch_size]
 94 | 
 95 |             tf.compat.v1.logging.info(
 96 |                 f"{i} out of {len(predict_examples)} examples done ({round(i * 100 / len(predict_examples))}%).")
 97 | 
 98 |             # Convert example to feature in batches.
 99 |             input_ids, input_mask, label_ids, segment_ids = zip(
100 |                 *tuple(convert_single_example(i + j, example, self.label_list, self.max_seq_length, self.tokenizer)
101 |                        for j, example in enumerate(examples)))
102 | 
103 |             # Convert to a format that is consistent with input tensors
104 |             feed_dict = {}
105 |             feed_dict[tensor_input_ids] = np.vstack(
106 |                 tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_ids))
107 |             feed_dict[tensor_input_mask] = np.vstack(
108 |                 tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in input_mask))
109 |             feed_dict[tensor_label_ids] = np.vstack(
110 |                 tuple(np.array(arr) for arr in label_ids)).flatten()
111 |             feed_dict[tensor_segment_ids] = np.vstack(
112 |                 tuple(np.array(arr).reshape(-1, self.max_seq_length) for arr in segment_ids))
113 | 
114 |             # Make a prediction
115 |             result = self.sess.run(tensor_outputs, feed_dict=feed_dict)
116 |             # Add the predictions
117 |             predictions.extend(p.tolist() for p in result)
118 | 
119 |         return predictions
120 | 


--------------------------------------------------------------------------------
/core/bert/run_classifier.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """BERT finetuning runner."""
 16 | 
 17 | import csv
 18 | from core.bert import tokenization
 19 | import tensorflow as tf
 20 | 
 21 | 
 22 | class InputExample(object):
 23 |     """A single training/test example for simple sequence classification."""
 24 | 
 25 |     def __init__(self, guid, text_a, text_b=None, label=None):
 26 |         """Constructs a InputExample.
 27 | 
 28 |         Args:
 29 |           guid: Unique id for the example.
 30 |           text_a: string. The untokenized text of the first sequence. For single
 31 |             sequence tasks, only this sequence must be specified.
 32 |           text_b: (Optional) string. The untokenized text of the second sequence.
 33 |             Only must be specified for sequence pair tasks.
 34 |           label: (Optional) string. The label of the example. This should be
 35 |             specified for train and dev examples, but not for test examples.
 36 |         """
 37 |         self.guid = guid
 38 |         self.text_a = text_a
 39 |         self.text_b = text_b
 40 |         self.label = label
 41 | 
 42 | 
 43 | class PaddingInputExample(object):
 44 |     """Fake example so the num input examples is a multiple of the batch size.
 45 | 
 46 |     When running eval/predict on the TPU, we need to pad the number of examples
 47 |     to be a multiple of the batch size, because the TPU requires a fixed batch
 48 |     size. The alternative is to drop the last batch, which is bad because it means
 49 |     the entire output data won't be generated.
 50 | 
 51 |     We use this class instead of `None` because treating `None` as padding
 52 |     batches could cause silent errors.
 53 |     """
 54 | 
 55 | 
 56 | class InputFeatures(object):
 57 |     """A single set of features of data."""
 58 | 
 59 |     def __init__(self,
 60 |                  input_ids,
 61 |                  input_mask,
 62 |                  segment_ids,
 63 |                  label_id,
 64 |                  is_real_example=True):
 65 |         self.input_ids = input_ids
 66 |         self.input_mask = input_mask
 67 |         self.segment_ids = segment_ids
 68 |         self.label_id = label_id
 69 |         self.is_real_example = is_real_example
 70 | 
 71 | 
 72 | class DataProcessor(object):
 73 |     """Base class for data converters for sequence classification data sets."""
 74 | 
 75 |     def get_train_examples(self, data_dir):
 76 |         """Gets a collection of `InputExample`s for the train set."""
 77 |         raise NotImplementedError()
 78 | 
 79 |     def get_dev_examples(self, data_dir):
 80 |         """Gets a collection of `InputExample`s for the dev set."""
 81 |         raise NotImplementedError()
 82 | 
 83 |     def get_test_examples(self, data_dir):
 84 |         """Gets a collection of `InputExample`s for prediction."""
 85 |         raise NotImplementedError()
 86 | 
 87 |     def get_labels(self):
 88 |         """Gets the list of labels for this data set."""
 89 |         raise NotImplementedError()
 90 | 
 91 |     @classmethod
 92 |     def _read_tsv(cls, input_file, quotechar=None):
 93 |         """Reads a tab separated value file."""
 94 |         with tf.gfile.Open(input_file, "r") as f:
 95 |             reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
 96 |             lines = []
 97 |             for line in reader:
 98 |                 lines.append(line)
 99 |             return lines
100 | 
101 |     @classmethod
102 |     def _read_csv(cls, input_file, quotechar=None):
103 |         """Reads a comma separated value file."""
104 |         with tf.gfile.Open(input_file, "r") as f:
105 |             reader = csv.reader(f, delimiter=",", quotechar=quotechar)
106 |             lines = []
107 |             for line in reader:
108 |                 lines.append(line)
109 |             return lines
110 | 
111 | 
112 | class MAXAPIProcessor(DataProcessor):
113 |     """Custom Data Processor for the MAX API."""
114 | 
115 |     def get_test_examples(self, test_data):
116 |         """See base class."""
117 | 
118 |         # Verify that the input is a list of strings
119 |         if type(test_data) != list:
120 |             raise TypeError("'test_data' is type %r" % type(test_data))
121 |         # Create InputExample objects from the input data
122 |         return self._create_examples(test_data, "test")
123 | 
124 |     def get_labels(self):
125 |         """See base class."""
126 |         return ["pos", "neg"]
127 | 
128 |     def _create_examples(self, lines, set_type):
129 |         """Creates examples for the training and dev sets."""
130 |         examples = []
131 |         for (i, line) in enumerate(lines):
132 |             guid = "%s-%s" % (set_type, i)
133 |             text_a = tokenization.convert_to_unicode(line)
134 |             label = self.get_labels()[0]
135 |             examples.append(
136 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
137 |         return examples
138 | 
139 | 
140 | def convert_single_example(ex_index, example, label_list, max_seq_length,
141 |                            tokenizer):
142 |     """Converts a single `InputExample` into a single `InputFeatures`."""
143 | 
144 |     if isinstance(example, PaddingInputExample):
145 |         return InputFeatures(
146 |             input_ids=[0] * max_seq_length,
147 |             input_mask=[0] * max_seq_length,
148 |             segment_ids=[0] * max_seq_length,
149 |             label_id=0,
150 |             is_real_example=False)
151 | 
152 |     label_map = {}
153 |     for (i, label) in enumerate(label_list):
154 |         label_map[label] = i
155 | 
156 |     tokens_a = tokenizer.tokenize(example.text_a)
157 |     tokens_b = None
158 |     if example.text_b:
159 |         tokens_b = tokenizer.tokenize(example.text_b)
160 | 
161 |     if tokens_b:
162 |         # Modifies `tokens_a` and `tokens_b` in place so that the total
163 |         # length is less than the specified length.
164 |         # Account for [CLS], [SEP], [SEP] with "- 3"
165 |         _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)  # noqa
166 |     else:
167 |         # Account for [CLS] and [SEP] with "- 2"
168 |         if len(tokens_a) > max_seq_length - 2:
169 |             tokens_a = tokens_a[0:(max_seq_length - 2)]
170 | 
171 |     # The convention in BERT is:
172 |     # (a) For sequence pairs:
173 |     #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
174 |     #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
175 |     # (b) For single sequences:
176 |     #  tokens:   [CLS] the dog is hairy . [SEP]
177 |     #  type_ids: 0     0   0   0  0     0 0
178 |     #
179 |     # Where "type_ids" are used to indicate whether this is the first
180 |     # sequence or the second sequence. The embedding vectors for `type=0` and
181 |     # `type=1` were learned during pre-training and are added to the wordpiece
182 |     # embedding vector (and position vector). This is not *strictly* necessary
183 |     # since the [SEP] token unambiguously separates the sequences, but it makes
184 |     # it easier for the model to learn the concept of sequences.
185 |     #
186 |     # For classification tasks, the first vector (corresponding to [CLS]) is
187 |     # used as the "sentence vector". Note that this only makes sense because
188 |     # the entire model is fine-tuned.
189 |     tokens = []
190 |     segment_ids = []
191 |     tokens.append("[CLS]")
192 |     segment_ids.append(0)
193 |     for token in tokens_a:
194 |         tokens.append(token)
195 |         segment_ids.append(0)
196 |     tokens.append("[SEP]")
197 |     segment_ids.append(0)
198 | 
199 |     if tokens_b:
200 |         for token in tokens_b:
201 |             tokens.append(token)
202 |             segment_ids.append(1)
203 |         tokens.append("[SEP]")
204 |         segment_ids.append(1)
205 | 
206 |     input_ids = tokenizer.convert_tokens_to_ids(tokens)
207 | 
208 |     # The mask has 1 for real tokens and 0 for padding tokens. Only real
209 |     # tokens are attended to.
210 |     input_mask = [1] * len(input_ids)
211 | 
212 |     # Zero-pad up to the sequence length.
213 |     while len(input_ids) < max_seq_length:
214 |         input_ids.append(0)
215 |         input_mask.append(0)
216 |         segment_ids.append(0)
217 | 
218 |     if len(input_ids) != max_seq_length:
219 |         raise ValueError("'input_ids' has an incorrect length: %r" % len(input_ids))
220 |     if len(input_mask) != max_seq_length:
221 |         raise ValueError("'input_mask' has an incorrect length: %r" % len(input_mask))
222 |     if len(segment_ids) != max_seq_length:
223 |         raise ValueError("'segment_ids' has an incorrect length: %r" % len(segment_ids))
224 | 
225 |     label_id = label_map[example.label]
226 |     if ex_index < 5:
227 |         tf.compat.v1.logging.info("*** Example ***")
228 |         tf.compat.v1.logging.info("guid: %s" % (example.guid))
229 |         tf.compat.v1.logging.info("tokens: %s" % " ".join(
230 |             [tokenization.printable_text(x) for x in tokens]))
231 |         tf.compat.v1.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
232 |         tf.compat.v1.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
233 |         tf.compat.v1.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
234 |         tf.compat.v1.logging.info("label: %s (id = %d)" % (example.label, label_id))
235 | 
236 |     # return feature
237 |     return input_ids, input_mask, label_id, segment_ids
238 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![Build Status](https://travis-ci.com/IBM/MAX-Text-Sentiment-Classifier.svg?branch=master)](https://travis-ci.com/IBM/MAX-Text-Sentiment-Classifier) [![API demo](https://img.shields.io/website/http/max-text-sentiment-classifier.codait-prod-41208c73af8fca213512856c7a09db52-0000.us-east.containers.appdomain.cloud/swagger.json.svg?label=API%20demo&down_message=down&up_message=up)](http://max-text-sentiment-classifier.codait-prod-41208c73af8fca213512856c7a09db52-0000.us-east.containers.appdomain.cloud)
  2 | 
  3 | [<img src="docs/deploy-max-to-ibm-cloud-with-kubernetes-button.png" width="400px">](http://ibm.biz/max-to-ibm-cloud-tutorial)
  4 | 
  5 | # IBM Developer Model Asset Exchange: Text Sentiment Classifier
  6 | 
  7 | This repository contains code to instantiate and deploy a text sentiment classifier. This model is able to detect whether a text fragment leans towards a positive or a negative sentiment. Optimal input examples for this model are short strings (preferably a single sentence) with correct grammar, although not a requirement.
  8 | 
  9 | The model is based on the [pre-trained BERT-Base, English Uncased](https://github.com/google-research/bert/blob/master/README.md) model and was fine-tuned on the [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml). The model files are hosted on
 10 | [IBM Cloud Object Storage](https://max-cdn.cdn.appdomain.cloud/max-text-sentiment-classifier/1.2.0/assets.tar.gz).
 11 | The code in this repository deploys the model as a web service in a Docker container. This repository was developed
 12 | as part of the [IBM Developer Model Asset Exchange](https://developer.ibm.com/exchanges/models/) and the public API is powered by [IBM Cloud](https://ibm.biz/Bdz2XM).
 13 | 
 14 | ## Model Metadata
 15 | | Domain | Application | Industry  | Framework | Training Data | Input Data |
 16 | | --------- | --------  | -------- | --------- | --------- | --------------- | 
 17 | | Natural Language Processing (NLP) | Sentiment Analysis | General | TensorFlow | [IBM Claim Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml) | Text |
 18 | 
 19 | ## Benchmark
 20 | In the table below, the prediction accuracy of the model on the test sets of three different datasets is listed. 
 21 | 
 22 | The first row showcases the generalization power of our model after fine-tuning on the IBM Claims Dataset.
 23 |  The Sentiment140 (Tweets) and IMDB Reviews datasets are only used for evaluating the transfer-learning capabilities of this model. The implementation in this repository was **not** trained or fine-tuned on the Sentiment140 or IMDB reviews datasets.
 24 |  
 25 | The second row describes the performance of the BERT-Base (English - Uncased) model when fine-tuned on the specific task. This was done simply for reference, and the weights are therefore not made available.
 26 | 
 27 | 
 28 | The generalization results (first row) are very good when the input data is similar to the data used for fine-tuning (e.g. Sentiment140 (tweets) when fine-tuned on the IBM Claims Dataset). However, when a different style of text is given as input, and with a longer median length (e.g. multi-sentence IMDB reviews), the results are not as good.
 29 | 
 30 | | Model Type | IBM Claims | Sentiment140 | IMDB Reviews |
 31 | | ------------- | --------  | -------- | -------------- | 
 32 | | This model (fine-tuned on IBM Claims) | 94% | 83.84% | 81% |
 33 | | Models fine-tuned on the specific dataset | 94% | 84% | 90% |
 34 | 
 35 | ## References
 36 | * _J. Devlin, M. Chang, K. Lee, K. Toutanova_, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805), arXiv, 2018.
 37 | * [Google BERT repository](https://github.com/google-research/bert)
 38 | * [IBM Claims Stance Dataset](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project) and [IBM Project Debater](https://www.research.ibm.com/artificial-intelligence/project-debater/)
 39 | 
 40 | ## Licenses
 41 | | Component | License | Link  |
 42 | | ------------- | --------  | -------- |
 43 | | This repository | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) |
 44 | | Fine-tuned Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/IBM/MAX-Text-Sentiment-Classifier/blob/master/LICENSE) |
 45 | | Pre-trained Model Weights | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) |
 46 | | Model Code (3rd party) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | [LICENSE](https://github.com/google-research/bert/blob/master/LICENSE) |
 47 | | IBM Claims Stance Dataset for fine-tuning | [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/) | [LICENSE 1](http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Project) <br> [LICENSE 2](https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Reusers.27_rights_and_obligations)|
 48 | 
 49 | ## Pre-requisites:
 50 | * `docker`: The [Docker](https://www.docker.com/) command-line interface. Follow the [installation instructions](https://docs.docker.com/install/) for your system.
 51 | * The minimum recommended resources for this model is 4GB Memory and 4 CPUs.
 52 | * If you are on x86-64/AMD64, your CPU must support [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) at the minimum.
 53 | 
 54 | # Deployment options
 55 | 
 56 | * [Deploy from Quay](#deploy-from-quay)
 57 | * [Deploy on Red Hat OpenShift](#deploy-on-red-hat-openshift)
 58 | * [Deploy on Kubernetes](#deploy-on-kubernetes)
 59 | * [Run Locally](#run-locally)
 60 | 
 61 | ## Deploy from Quay
 62 | To run the docker image, which automatically starts the model serving API, run:
 63 | 
 64 | ```
 65 | $ docker run -it -p 5000:5000 quay.io/codait/max-text-sentiment-classifier
 66 | ```
 67 | 
 68 | This will pull a pre-built image from the Quay.io container registry (or use an existing image if already cachedlocally) and run it.
 69 | If you'd rather checkout and build the model locally you can follow the [run locally](#run-locally) steps below.
 70 | 
 71 | ## Deploy on Red Hat OpenShift
 72 | 
 73 | You can deploy the model-serving microservice on Red Hat OpenShift by following the instructions for the OpenShift web console or the OpenShift Container Platform CLI [in this tutorial](https://developer.ibm.com/tutorials/deploy-a-model-asset-exchange-microservice-on-red-hat-openshift/), specifying `quay.io/codait/max-text-sentiment-classifier` as the image name.
 74 | 
 75 | > Note that this model requires at least 4GB of RAM. Therefore this model will not run in a cluster that was provisioned under the [OpenShift Online starter plan](https://www.openshift.com/products/online/), which is capped at 2GB.
 76 | 
 77 | ## Deploy on Kubernetes
 78 | You can also deploy the model on Kubernetes using the latest docker image on Quay.
 79 | 
 80 | On your Kubernetes cluster, run the following commands:
 81 | 
 82 | ```
 83 | $ kubectl apply -f https://github.com/IBM/MAX-Text-Sentiment-Classifier/raw/master/max-text-sentiment-classifier.yaml
 84 | ```
 85 | 
 86 | The model will be available internally at port `5000`, but can also be accessed externally through the `NodePort`.
 87 | 
 88 | A more elaborate tutorial on how to deploy this MAX model to production on [IBM Cloud](https://ibm.biz/Bdz2XM) can be found [here](http://ibm.biz/max-to-ibm-cloud-tutorial).
 89 | 
 90 | ## Run Locally
 91 | 1. [Build the Model](#1-build-the-model)
 92 | 2. [Deploy the Model](#2-deploy-the-model)
 93 | 3. [Use the Model](#3-use-the-model)
 94 | 4. [Development](#4-development)
 95 | 5. [Cleanup](#5-cleanup)
 96 | 
 97 | 
 98 | ### 1. Build the Model
 99 | Clone this repository locally. In a terminal, run the following command:
100 | 
101 | ```
102 | $ git clone https://github.com/IBM/MAX-Text-Sentiment-Classifier.git
103 | ```
104 | 
105 | Change directory into the repository base folder:
106 | 
107 | ```
108 | $ cd MAX-Text-Sentiment-Classifier
109 | ```
110 | 
111 | To build the docker image locally, run: 
112 | 
113 | ```
114 | $ docker build -t max-text-sentiment-classifier .
115 | ```
116 | 
117 | All required model assets will be downloaded during the build process. _Note_ that currently this docker image is CPU only (we will add support for GPU images later).
118 | 
119 | 
120 | ### 2. Deploy the Model
121 | To run the docker image, which automatically starts the model serving API, run:
122 | 
123 | ```
124 | $ docker run -it -p 5000:5000 max-text-sentiment-classifier
125 | ```
126 | 
127 | ### 3. Use the Model
128 | 
129 | The API server automatically generates an interactive Swagger documentation page. Go to `http://localhost:5000` to load it. From there you can explore the API and also create test requests.
130 | 
131 | ```
132 | Example:
133 | [
134 | "The Model Asset Exchange is a crucial element of a developer's toolkit.",
135 | "2008 was a dark, dark year for stock markets worldwide."
136 | ]
137 | 
138 | Result:
139 | [
140 |   {
141 |     "positive": 0.9977352619171143,
142 |     "negative": 0.002264695707708597
143 |   }
144 | ],
145 | [
146 |   {
147 |     "positive": 0.001138084102421999,
148 |     "negative": 0.9988619089126587
149 |   }
150 | ]
151 | ```
152 | 
153 | 
154 | Use the `model/predict` endpoint to submit input text in json format. The json structure should have one key, `text`, with as value a list of input strings to be analyzed. An example can be found in the image below.
155 | 
156 | Submitting proper json data triggers the model and will return a json file with a `status` and a `predictions` key. With this `predictions` field, a list of class labels and their corresponding probabilities will be associated. The first element in the list corresponds to the prediction for the first string in the input list.
157 | 
158 | 
159 | ![Swagger UI Screenshot](docs/swagger-screenshot.png)
160 | 
161 | You can also test it on the command line, for example:
162 | 
163 | ```bash
164 | $ curl -d "{ \"text\": [ \"The Model Asset Exchange is a crucial element of a developer's toolkit.\" ]}" -X POST "http://localhost:5000/model/predict" -H "Content-Type: application/json"
165 | ```
166 | 
167 | You should see a JSON response like that below:
168 | 
169 | ```json
170 | {
171 |   "status": "ok",
172 |   "predictions": [
173 |     [
174 |       {
175 |         "positive": 0.9977352619171143,
176 |         "negative": 0.0022646968718618155
177 |       }
178 |     ]
179 |   ]
180 | }
181 | ```
182 | 
183 | ### 4. Development
184 | To run the Flask API app in debug mode, edit `config.py` to set `DEBUG = True` under the application settings. You will then need to rebuild the docker image (see [step 1](#1-build-the-model)).
185 | 
186 | ### 5. Cleanup
187 | To stop the Docker container, type `CTRL` + `C` in your terminal.
188 | 
189 | ## Resources and Contributions
190 |    
191 | If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions [here](https://github.com/CODAIT/max-central-repo).
192 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright 2019, IBM (Center of Open-Source Data & AI Technologies)
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/core/bert/tokenization.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Tokenization classes."""
 16 | 
 17 | import collections
 18 | import re
 19 | import unicodedata
 20 | import six
 21 | import tensorflow as tf
 22 | 
 23 | 
 24 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
 25 |     """Checks whether the casing config is consistent with the checkpoint name."""
 26 | 
 27 |     # The casing has to be passed in by the user and there is no explicit check
 28 |     # as to whether it matches the checkpoint. The casing information probably
 29 |     # should have been stored in the bert_config.json file, but it's not, so
 30 |     # we have to heuristically detect it to validate.
 31 | 
 32 |     if not init_checkpoint:
 33 |         return
 34 | 
 35 |     m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
 36 |     if m is None:
 37 |         return
 38 | 
 39 |     model_name = m.group(1)
 40 | 
 41 |     lower_models = [
 42 |         "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
 43 |         "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
 44 |     ]
 45 | 
 46 |     cased_models = [
 47 |         "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
 48 |         "multi_cased_L-12_H-768_A-12"
 49 |     ]
 50 | 
 51 |     is_bad_config = False
 52 |     if model_name in lower_models and not do_lower_case:
 53 |         is_bad_config = True
 54 |         actual_flag = "False"
 55 |         case_name = "lowercased"
 56 |         opposite_flag = "True"
 57 | 
 58 |     if model_name in cased_models and do_lower_case:
 59 |         is_bad_config = True
 60 |         actual_flag = "True"
 61 |         case_name = "cased"
 62 |         opposite_flag = "False"
 63 | 
 64 |     if is_bad_config:
 65 |         raise ValueError(
 66 |             "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
 67 |             "However, `%s` seems to be a %s model, so you "
 68 |             "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
 69 |             "how the model was pre-training. If this error is wrong, please "
 70 |             "just comment out this check." % (actual_flag, init_checkpoint,
 71 |                                               model_name, case_name, opposite_flag))
 72 | 
 73 | 
 74 | def convert_to_unicode(text):
 75 |     """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
 76 |     if six.PY3:
 77 |         if isinstance(text, str):
 78 |             return text
 79 |         elif isinstance(text, bytes):
 80 |             return text.decode("utf-8", "ignore")
 81 |         else:
 82 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 83 |     elif six.PY2:
 84 |         if isinstance(text, str):
 85 |             return text.decode("utf-8", "ignore")
 86 |         elif isinstance(text, unicode): # noqa
 87 |             return text
 88 |         else:
 89 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 90 |     else:
 91 |         raise ValueError("Not running on Python2 or Python 3?")
 92 | 
 93 | 
 94 | def printable_text(text):
 95 |     """Returns text encoded in a way suitable for print or `tf.logging`."""
 96 | 
 97 |     # These functions want `str` for both Python2 and Python3, but in one case
 98 |     # it's a Unicode string and in the other it's a byte string.
 99 |     if six.PY3:
100 |         if isinstance(text, str):
101 |             return text
102 |         elif isinstance(text, bytes):
103 |             return text.decode("utf-8", "ignore")
104 |         else:
105 |             raise ValueError("Unsupported string type: %s" % (type(text)))
106 |     elif six.PY2:
107 |         if isinstance(text, str):
108 |             return text
109 |         elif isinstance(text, unicode): # noqa
110 |             return text.encode("utf-8")
111 |         else:
112 |             raise ValueError("Unsupported string type: %s" % (type(text)))
113 |     else:
114 |         raise ValueError("Not running on Python2 or Python 3?")
115 | 
116 | 
117 | def load_vocab(vocab_file):
118 |     """Loads a vocabulary file into a dictionary."""
119 |     vocab = collections.OrderedDict()
120 |     index = 0
121 |     with tf.compat.v1.gfile.GFile(vocab_file, "r") as reader:
122 |         while True:
123 |             token = convert_to_unicode(reader.readline())
124 |             if not token:
125 |                 break
126 |             token = token.strip()
127 |             vocab[token] = index
128 |             index += 1
129 |     return vocab
130 | 
131 | 
132 | def convert_by_vocab(vocab, items):
133 |     """Converts a sequence of [tokens|ids] using the vocab."""
134 |     output = []
135 |     for item in items:
136 |         output.append(vocab[item])
137 |     return output
138 | 
139 | 
140 | def convert_tokens_to_ids(vocab, tokens):
141 |     return convert_by_vocab(vocab, tokens)
142 | 
143 | 
144 | def convert_ids_to_tokens(inv_vocab, ids):
145 |     return convert_by_vocab(inv_vocab, ids)
146 | 
147 | 
148 | def whitespace_tokenize(text):
149 |     """Runs basic whitespace cleaning and splitting on a piece of text."""
150 |     text = text.strip()
151 |     if not text:
152 |         return []
153 |     tokens = text.split()
154 |     return tokens
155 | 
156 | 
157 | class FullTokenizer(object):
158 |     """Runs end-to-end tokenziation."""
159 | 
160 |     def __init__(self, vocab_file, do_lower_case=True):
161 |         self.vocab = load_vocab(vocab_file)
162 |         self.inv_vocab = {v: k for k, v in self.vocab.items()}
163 |         self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
164 |         self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
165 | 
166 |     def tokenize(self, text):
167 |         split_tokens = []
168 |         for token in self.basic_tokenizer.tokenize(text):
169 |             for sub_token in self.wordpiece_tokenizer.tokenize(token):
170 |                 split_tokens.append(sub_token)
171 | 
172 |         return split_tokens
173 | 
174 |     def convert_tokens_to_ids(self, tokens):
175 |         return convert_by_vocab(self.vocab, tokens)
176 | 
177 |     def convert_ids_to_tokens(self, ids):
178 |         return convert_by_vocab(self.inv_vocab, ids)
179 | 
180 | 
181 | class BasicTokenizer(object):
182 |     """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
183 | 
184 |     def __init__(self, do_lower_case=True):
185 |         """Constructs a BasicTokenizer.
186 | 
187 |         Args:
188 |           do_lower_case: Whether to lower case the input.
189 |         """
190 |         self.do_lower_case = do_lower_case
191 | 
192 |     def tokenize(self, text):
193 |         """Tokenizes a piece of text."""
194 |         text = convert_to_unicode(text)
195 |         text = self._clean_text(text)
196 | 
197 |         # This was added on November 1st, 2018 for the multilingual and Chinese
198 |         # models. This is also applied to the English models now, but it doesn't
199 |         # matter since the English models were not trained on any Chinese data
200 |         # and generally don't have any Chinese data in them (there are Chinese
201 |         # characters in the vocabulary because Wikipedia does have some Chinese
202 |         # words in the English Wikipedia.).
203 |         text = self._tokenize_chinese_chars(text)
204 | 
205 |         orig_tokens = whitespace_tokenize(text)
206 |         split_tokens = []
207 |         for token in orig_tokens:
208 |             if self.do_lower_case:
209 |                 token = token.lower()
210 |                 token = self._run_strip_accents(token)
211 |             split_tokens.extend(self._run_split_on_punc(token))
212 | 
213 |         output_tokens = whitespace_tokenize(" ".join(split_tokens))
214 |         return output_tokens
215 | 
216 |     def _run_strip_accents(self, text):
217 |         """Strips accents from a piece of text."""
218 |         text = unicodedata.normalize("NFD", text)
219 |         output = []
220 |         for char in text:
221 |             cat = unicodedata.category(char)
222 |             if cat == "Mn":
223 |                 continue
224 |             output.append(char)
225 |         return "".join(output)
226 | 
227 |     def _run_split_on_punc(self, text):
228 |         """Splits punctuation on a piece of text."""
229 |         chars = list(text)
230 |         i = 0
231 |         start_new_word = True
232 |         output = []
233 |         while i < len(chars):
234 |             char = chars[i]
235 |             if _is_punctuation(char):
236 |                 output.append([char])
237 |                 start_new_word = True
238 |             else:
239 |                 if start_new_word:
240 |                     output.append([])
241 |                 start_new_word = False
242 |                 output[-1].append(char)
243 |             i += 1
244 | 
245 |         return ["".join(x) for x in output]
246 | 
247 |     def _tokenize_chinese_chars(self, text):
248 |         """Adds whitespace around any CJK character."""
249 |         output = []
250 |         for char in text:
251 |             cp = ord(char)
252 |             if self._is_chinese_char(cp):
253 |                 output.append(" ")
254 |                 output.append(char)
255 |                 output.append(" ")
256 |             else:
257 |                 output.append(char)
258 |         return "".join(output)
259 | 
260 |     def _is_chinese_char(self, cp):
261 |         """Checks whether CP is the codepoint of a CJK character."""
262 |         # This defines a "chinese character" as anything in the CJK Unicode block:
263 |         #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
264 |         #
265 |         # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
266 |         # despite its name. The modern Korean Hangul alphabet is a different block,
267 |         # as is Japanese Hiragana and Katakana. Those alphabets are used to write
268 |         # space-separated words, so they are not treated specially and handled
269 |         # like the all of the other languages.
270 |         if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
271 |                 (cp >= 0x3400 and cp <= 0x4DBF) or  #
272 |                 (cp >= 0x20000 and cp <= 0x2A6DF) or  #
273 |                 (cp >= 0x2A700 and cp <= 0x2B73F) or  #
274 |                 (cp >= 0x2B740 and cp <= 0x2B81F) or  #
275 |                 (cp >= 0x2B820 and cp <= 0x2CEAF) or
276 |                 (cp >= 0xF900 and cp <= 0xFAFF) or  #
277 |                 (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
278 |             return True
279 | 
280 |         return False
281 | 
282 |     def _clean_text(self, text):
283 |         """Performs invalid character removal and whitespace cleanup on text."""
284 |         output = []
285 |         for char in text:
286 |             cp = ord(char)
287 |             if cp == 0 or cp == 0xfffd or _is_control(char):
288 |                 continue
289 |             if _is_whitespace(char):
290 |                 output.append(" ")
291 |             else:
292 |                 output.append(char)
293 |         return "".join(output)
294 | 
295 | 
296 | class WordpieceTokenizer(object):
297 |     """Runs WordPiece tokenziation."""
298 | 
299 |     def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):  # nosec - "[UNK]" is not a password
300 |         self.vocab = vocab
301 |         self.unk_token = unk_token
302 |         self.max_input_chars_per_word = max_input_chars_per_word
303 | 
304 |     def tokenize(self, text):
305 |         """Tokenizes a piece of text into its word pieces.
306 | 
307 |         This uses a greedy longest-match-first algorithm to perform tokenization
308 |         using the given vocabulary.
309 | 
310 |         For example:
311 |           input = "unaffable"
312 |           output = ["un", "##aff", "##able"]
313 | 
314 |         Args:
315 |           text: A single token or whitespace separated tokens. This should have
316 |             already been passed through `BasicTokenizer.
317 | 
318 |         Returns:
319 |           A list of wordpiece tokens.
320 |         """
321 | 
322 |         text = convert_to_unicode(text)
323 | 
324 |         output_tokens = []
325 |         for token in whitespace_tokenize(text):
326 |             chars = list(token)
327 |             if len(chars) > self.max_input_chars_per_word:
328 |                 output_tokens.append(self.unk_token)
329 |                 continue
330 | 
331 |             is_bad = False
332 |             start = 0
333 |             sub_tokens = []
334 |             while start < len(chars):
335 |                 end = len(chars)
336 |                 cur_substr = None
337 |                 while start < end:
338 |                     substr = "".join(chars[start:end])
339 |                     if start > 0:
340 |                         substr = "##" + substr
341 |                     if substr in self.vocab:
342 |                         cur_substr = substr
343 |                         break
344 |                     end -= 1
345 |                 if cur_substr is None:
346 |                     is_bad = True
347 |                     break
348 |                 sub_tokens.append(cur_substr)
349 |                 start = end
350 | 
351 |             if is_bad:
352 |                 output_tokens.append(self.unk_token)
353 |             else:
354 |                 output_tokens.extend(sub_tokens)
355 |         return output_tokens
356 | 
357 | 
358 | def _is_whitespace(char):
359 |     """Checks whether `chars` is a whitespace character."""
360 |     # \t, \n, and \r are technically contorl characters but we treat them
361 |     # as whitespace since they are generally considered as such.
362 |     if char == " " or char == "\t" or char == "\n" or char == "\r":
363 |         return True
364 |     cat = unicodedata.category(char)
365 |     if cat == "Zs":
366 |         return True
367 |     return False
368 | 
369 | 
370 | def _is_control(char):
371 |     """Checks whether `chars` is a control character."""
372 |     # These are technically control characters but we count them as whitespace
373 |     # characters.
374 |     if char == "\t" or char == "\n" or char == "\r":
375 |         return False
376 |     cat = unicodedata.category(char)
377 |     if cat.startswith("C"):
378 |         return True
379 |     return False
380 | 
381 | 
382 | def _is_punctuation(char):
383 |     """Checks whether `chars` is a punctuation character."""
384 |     cp = ord(char)
385 |     # We treat all non-letter/number ASCII as punctuation.
386 |     # Characters such as "^", "$", and "`" are not in the Unicode
387 |     # Punctuation class but we treat them as punctuation anyways, for
388 |     # consistency.
389 |     if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
390 |             (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
391 |         return True
392 |     cat = unicodedata.category(char)
393 |     if cat.startswith("P"):
394 |         return True
395 |     return False
396 | 


--------------------------------------------------------------------------------
/samples/test-examples.tsv:
--------------------------------------------------------------------------------
  1 | video game violence is not related to serious aggressive behavior in real life	pos
  2 | Neurological link was found between playing violent video games and aggressive behaviour in children and teenagers	neg
  3 | many skills can be learned from the gaming experience, it builds practical and intellectual skills	pos
  4 | children may imitate aggressive behaviors witnessed in media	neg
  5 | Violent video games - especially first-person shooter games - encourages real-life acts of violence in teenagers	neg
  6 | there is social utility in expressive and imaginative forms of entertainment, even if they contain violence	pos
  7 | China's birth control policy contributes to forced abortions	neg
  8 | the dramatic decrease in Chinese fertility started before the program began in 1979 for unrelated factors	neg
  9 | overpopulation has been blamed for a variety of issues, including increasing poverty	neg
 10 | Lack of siblings has been blamed for a number of social ills	neg
 11 | The use of drugs to enhance performance is considered unethical by most international sports organizations	neg
 12 | There is a wide range of health concerns for users of anabolic steroids	neg
 13 | Physical exercise helps prevent the diseases of affluence	pos
 14 | Exercise alone is a potential prevention method and/or treatment for mild forms of depression	pos
 15 | Too much exercise can be harmful	neg
 16 | affirmative action devalues the accomplishments of people who are chosen based on the social group to which they belong rather than their qualifications	neg
 17 | affirmative action has undesirable side-effects in addition to failing to achieve its goals	neg
 18 | attempts at antidiscrimination have been criticized as reverse discrimination	neg
 19 | States must take measures to seek to eliminate prejudices	pos
 20 | Historical racism continues to be reflected in socio-economic inequality	neg
 21 | Policies adopted as affirmative action have been criticized as a form of referse discrimination	neg
 22 | Affirmative action perpetuates racial division	neg
 23 | no one has a legal right to have any demographic characteristic they possess be considered a favorable point on their behalf	neg
 24 | Affirmative action is counter-productive	neg
 25 | Affirmative action policies engender animosity toward preferred groups	neg
 26 | The identification of oppressed classes is difficult to carry out	neg
 27 | The typical knock out results in a sustained loss of consciousness	neg
 28 | The idea of multiculturalism had reached the end of its useful life	neg
 29 | multiculturalism works better in theory than in practice	neg
 30 | some forms of multiculturalism can divide people	neg
 31 | Multiculturalism would lead to acceptance of barbaric practices	neg
 32 | Addicted gamblers cost companies loss of productivity and profit	neg
 33 | Gamblers may suffer from depression and bankruptcy	neg
 34 | Compulsive gambling is often very detrimental to personal relationships	neg
 35 | Abuse is common in homes where pathological gambling is present	neg
 36 | Internet gambling is a legitimate activity that citizens have the right to engage in	pos
 37 | Gambling is a popular leisure activity enjoyed in many forms by millions of people	pos
 38 | electronic funds transfers inherent in online gambling are being exploited by criminal interests	neg
 39 | everyone has the right to leave or enter a country, along with movement within it	pos
 40 | Immigrants are considered hostile or alien to the natural culture	neg
 41 | Monarchy is a check against possible illegal action by politicians	pos
 42 | Monarchs rule with the intent of improving the lives of their subjects	pos
 43 | Royals are simply celebrities who should not have any formal role	neg
 44 | Monarchy encourages a feeling of dependency in many people who should instead have confidence in themselves and their fellow citizens	neg
 45 | Monarchical prerogative powers can be used to circumvent normal democratic process with no accountability	neg
 46 | Monarchy devalues intellect and achievement	neg
 47 | A constitutional monarch with limited powers and non-partisan nature can provide a focus for national unity	pos
 48 | the monarchy is inherently contrary to egalitarianism and multiculturalism	neg
 49 | "a republic is ""inevitable"	pos
 50 | "the Church's ban on condoms has ""caused the death of millions"	neg
 51 | release of tactical information usually presents a greater risk of casualties among one's own forces	neg
 52 | the right to freedom of speech is not absolute	neg
 53 | When resources are put into tobacco production they are taken away from food production	neg
 54 | tobacco causes cancer	neg
 55 | Freedom of expression is subject to certain restrictions	neg
 56 | The free communication of ideas and opinions is one of the most precious of the rights of man	pos
 57 | Speech can be justifiably suppressed in order to prevent harm from a clear and direct threat	neg
 58 | liberty of expression 'is not absolute	neg
 59 | government may not prohibit the expression of an idea simply because society finds the idea offensive or disagreeable	neg
 60 | Punishment of dangerous or offensive writings is necessary for the preservation of peace and good order	pos
 61 | Freedom of speech and expression can not be an excuse for distribution of indecent and immoral content	neg
 62 | censorship violates multiple Basic Human Rights	neg
 63 | by distributing vouchers to the families of students equal to the tuition that he/she would receive at his/her local public school, a student’s family could then choose from options where best to send their child	pos
 64 | Vouchers function to increase racial and economic discrimination in schools	neg
 65 | Earmarks are Good for American Democracy	pos
 66 | Congressmen tend to distribute specialized benefits at a great cost and ignore the particular costs the legislation bears upon the taxpayers	neg
 67 | Nuclear weapons decrease the chances of crisis escalation	pos
 68 | new nuclear states will use their acquired nuclear capabilities to deter threats and preserve peace	pos
 69 | more countries with nuclear weapons may increase the possibility of nuclear warfare	neg
 70 | abortion has negative effects on society	neg
 71 | Intact dilation and extraction  is never needed to protect the health of a pregnant woman	neg
 72 | Certain restrictions on abortion could be used to form a slippery slope against all abortions	neg
 73 | Abortion, which would involve the deliberate destruction of life, should be rejected	neg
 74 | In all circumstances, it should be the woman's decision whether or not to terminate a pregnancy	pos
 75 | the embryo has a right to life	pos
 76 | it should be illegal for governments to regulate abortion	neg
 77 | working parents wish their children to be supervised	pos
 78 | students’ attitudes towards school did significantly increase as they spent more time on a year-round schedule	pos
 79 | year-round schools showed a substantial gain in academic achievement for at-risk, low performing students	pos
 80 | The year round schedule provides more opportunities for family vacations	pos
 81 | Students that attend year round schooling may miss out on experiences	neg
 82 | Welfare sustains or even creates poverty	neg
 83 | Safety nets enable households to make productive investments in their future that they may otherwise miss	pos
 84 | a lower rate of redistribution in a given society increases the inequality found among future incomes	neg
 85 | a certain amount of redistribution would be justified	pos
 86 | Because current science can't figure out exactly how life started, it must be God who caused life to start	pos
 87 | Atheism has been criticized as a faith in itself	neg
 88 | the most immoral acts in human history were performed by atheists	neg
 89 | belief in God and religion are social functions, used by those in power to oppress the working class	neg
 90 | the idea of God implies the abdication of human reason and justice	neg
 91 | The idea of God necessarily ends in the enslavement of mankind	neg
 92 | the theism of people throughout most of recorded history and in many different places provides prima facie demonstration of God's existence	pos
 93 | There is evidence for the existence of a God	pos
 94 | evolution can explain the apparent design in nature	pos
 95 | "Natural selection and similar scientific theories are superior to a ""God hypothesis""—the illusion of intelligent design—in explaining the living world and the cosmos"	pos
 96 | religion is needed to make us behave morally	pos
 97 | a god created the Universe	pos
 98 | Separation of older people from active roles in society benefits both society and older individuals	pos
 99 | personal and technical skills learned in the military will improve later employment prospects in civilian life	pos
100 | Professionally-skilled conscripts are difficult to replace in the civilian workforce	neg
101 | conscription would not provide adequate protection for the rights of conscientious objectors	neg
102 | adequate military strength could be maintained without having conscription	neg
103 | unpaid domestic work is just as valuable as paid work	pos
104 | property rights encourage their holders to develop their property or generate wealth	pos
105 | Patents are criticised as inonsistent with free trade	neg
106 | intellectual property rights are essential to maintaining economic growth	pos
107 | A positive correlation between the strengthening of the IP system and subsequent economic growth was found	pos
108 | To violate intellectual property is no different morally than violating other property rights	neg
109 | The cost of trying to enforce copyright is unreasonable	neg
110 | all proposed alternatives to copyright protection do not allow for viable business models	neg
111 | the very concept of copyright has never benefited society	neg
112 | Wind power has gained very high social acceptance	pos
113 | Wind energy is a clean energy source	pos
114 | wind energy is one of the most cost efficient sources of renewable energy	pos
115 | wind power is dependent on weather systems	neg
116 | Wind power produces no greenhouse gas emissions during operation	pos
117 | Wind power uses little land	pos
118 | any effects on the environment from wind power are generally much less problematic than those of any other power source	pos
119 | Wind power has low ongoing costs	pos
120 | Wind projects revitalize the economy of rural communities	pos
121 | The cost of repairing damaged ecosystems is considered to be much higher than the cost of conserving natural ecosystems	pos
122 | People who live close to nature can be dependent on the survival of all the species in their environment	pos
123 | "since species become extinct ""all the time"" the disappearance of a few more will not destroy the ecosystem"	pos
124 | rapid rates of biodiversity loss threatens the sustained well-being of humanity	neg
125 | Biodiversity is directly involved in water purification	pos
126 | Austerity programs tend to have an impact on the poorest segments of the population	neg
127 | the right to bear arms is absolute and unqualified	pos
128 | an armed citizens' militia can help deter crime and tyranny	pos
129 | arms allow for successful rebellions against tyranny	pos
130 | The possibility of getting shot by an armed victim is a substantial deterrent to crime	pos
131 | widespread gun ownership is protection against tyranny	pos
132 | Bribery encourages rent seeking behaviour	neg
133 | In some cases where the system of law is not well-implemented, bribes may be a way for companies to continue their businesses	pos
134 | Corruption undermines the legitimacy of government and democratic values	neg
135 | Availability of bribes can induce officials to contrive new rules and delays	neg
136 | Corruption favors the most connected and unscrupulous, rather than the efficient	neg
137 | The production of bitumen and synthetic crude oil emits more greenhouse gases than the production of conventional crude oil	neg
138 | During Operation Cast Lead, the Israeli Defense Forces did more to safeguard the rights of civilians in a combat zone than any other army in the history of warfare	pos
139 | Israeli citizens in the south have been suffering from rockets being fired at them	neg
140 | the Israeli Defence Force breached laws of armed conflict by attacking indiscriminately civilians	neg
141 | Israel's UAV attacks were a violation of International Humanitarian Law	neg
142 | Israel was acting in self-defence	pos
143 | Israelis have been killed by the unlawful rocket and mortar attacks from Gaza	neg
144 | The repeated firing of rockets by Hamas endangers the lives of both Israeli and Palestinian civilians	neg
145 | the military solution won't conduct to peace	neg
146 | nothing justifies the suffering inflicted to civilian populations who live trapped in the Gaza strip	neg
147 | This cycle of violence and retaliation impedes efforts to broker lasting peace in the region	neg
148 | High-rise structures also pose serious challenges to firefighters during emergencies	neg
149 | Tower Blocks grew a reputation for being undesirable low cost housing	neg
150 | many tower blocks saw rising crime levels	neg
151 | Tower blocks may be inherently more prone to casualties from a fire	neg
152 | Tower blocks can hold thousands of families in a single building	pos
153 | The right to freedom of thought and expression, sanctioned by the Declaration of the Rights of Man, cannot imply the right to offend the religious sentiment of believers	neg
154 | In all societies there is a need to show sensitivity and responsibility in treating issues of special significance for the adherents of any particular faith	pos
155 | the Israeli blockade and closures had pushed the Palestinian economy into a stage of de-development	neg
156 | The blockade is a collective form of punishment on a civilian population	neg
157 | The purpose of the blockade is to pressure Hamas into ending the rocket attacks and to deprive them of the supplies necessary for the continuation of rocket attacks	pos
158 | "the blockade of Gaza is causing ""unacceptable suffering"	neg
159 | The blockade is possibly a crime against humanity	neg
160 | "all that is being achieved through the blockade is to ""enrich Hamas and marginalize even further the voices of moderation"	neg
161 | Israel's blockade of the Gaza Strip was described as totally intolerable and unacceptable in the 21st century	neg
162 | Israel restricts the ability for the Palestinian authority to exercise control	neg
163 | Gaza was blockaded by Israel in response to the rocket and mortar attacks by Hamas and other militant groups operating inside Gaza	pos
164 | the purpose of the restrictions in import of goods into Gaza are to pressure Hamas, which does not recognise Israel and backs attacks on its citizens	pos
165 | The stated purpose of the blockade was to pressure Hamas into ending the rocket attacks	pos
166 | Israel is not legally responsible for Gaza and not obliged to help a hostile territory beyond whatever is necessary to avoid a humanitarian crisis	pos
167 | The Gaza blockade inflicted excessive damage to the civilian population in relation to the concrete military advantage expected	neg
168 | Holocaust denial is a convenient polemical substitute for anti-semitism	neg
169 | Open-source-appropriate technology built in continuous peer-review can result in better quality	pos
170 | wide availability results in increased scrutiny of the source code, making open source software more secure	pos
171 | With free software, businesses can fit software to their specific needs by changing the software	pos
172 | proprietary software is unethical and unjust	neg
173 | software produced in this fashion may lack standardization and be incompatible with various computer applications	neg
174 | corruption is more prevalent in non-privatized sectors	neg
175 | A state-monopolized function is prone to corruption	neg
176 | certain public goods and services should remain primarily in the hands of government in order to ensure that everyone in society has access to them	pos
177 | Academics are demoralized by government interference with admissions procedures	neg
178 | Dependence on government funding has had disastrous effects on the higher education sector in continental Europe	neg
179 | ASEAN Way has recently proven itself relatively successful in the settlements of disputes by peaceful manner	pos
180 | "the importance of the ""wait to have sex"" message ends up being lost when programs demonstrate and encourage the use of contraception"	neg
181 | abstinence-only sex ed and conservative moralizing will only alienate students	neg
182 | sex education needs to be comprehensive to be effective	pos
183 | Abstinence-only programs delay the initiation of sex	pos
184 | abstinence-only education is ineffective	neg
185 | abstinence-only-until-marriage programs are ineffective	neg
186 | Comprehensive sex education covers abstinence as a positive choice, but also teaches about contraception and avoidance of STIs	pos
187 | all abstinence programs are ineffective	neg
188 | Abstinence until marriage is the most effective way to avoid HIV infection	pos
189 | Providing safe-sex education promotes promiscuity	neg
190 | Countries with conservative attitudes towards sex education have a higher incidence of STIs and teenage pregnancy	neg
191 | sexual abstinence in teenagers decreases the risk of contracting STDs and having children outside marriage	pos
192 | "voters ""overwhelmingly support term limits"	pos
193 | Lawmakers are in gridlock because of becoming locked into entrenched positions over time	neg
194 | Freed from political considerations related to re-election, lawmakers would be more free to vote on the merits	pos
195 | Flag burning is protected by the First Amendment	pos
196 | Americans oppose amending the constitution to outlaw flag burning	neg
197 | The First Amendment reaches non-speech acts	pos
198 | Flag burning tends to incite breaches of the peace	neg
199 | laws against flag-burning are constitutional	pos
200 | the BCS routinely involves controversy about which two teams are the best in the nation	neg
201 | if everyone is left to their own economic devices instead of being controlled by the state, then the result would be a harmonious and more equal society	pos
202 | the state has a legitimate role in providing public goods	pos
203 | Financial liberalization and privatization coincide with democratization	pos
204 | wages are naturally driven down in free market systems	neg
205 | free markets usually fail to deal with the problem of externalities	neg
206 | free market relationships are considered as structured upon coercion	neg
207 | free trade gives optimal economic advantages	pos
208 | Free trade allows maximum exploitation of workers by capital	neg
209 | The justification for central planning is that the consolidation of economic resources can allow for the economy to take advantage of more perfect information when making decisions regarding investment and production	pos
210 | market economies are inherently stable if left to themselves	pos
211 | competition leads to innovation and more affordable prices	pos
212 | "the state has a role to play in the economy, and specifically, in creating a ""safety net"	pos
213 | free markets have the potential to free states from the looming prospect of recurrent warfare	pos
214 | competition in the free market is more effective than the regulation of industry	pos
215 | markets have inefficient outcomes	neg
216 | no coercive monopoly of force can arise on a truly free market	pos
217 | Unfettered markets are the best means to economic growth	pos
218 | The regulation of markets is widely acknowledged as important to safeguard social and environmental values	pos
219 | what starts as temporary governmental fixes usually become permanent and expanding government programs, which stifle the private sector and civil society	neg
220 | left to its devices, the market will adjust efficiently	pos
221 | Advertising increasingly invades public spaces	neg
222 | Advertising has been criticized as inadvertently or even intentionally promoting sexism	neg
223 | advertising directed at young children is per se manipulative	neg
224 | the advertising techniques used to create consumer behaviour amount to the destruction of psychic and collective individuation	neg
225 | it is inherently immoral to bring people into the world	neg
226 | there ought to be a higher rate of population growth than what is currently mainstream	pos
227 | the birth of a new person always entails nontrivial harm to that person	neg
228 | the best thing for Earth's biosphere is for humans to voluntarily cease reproducing	pos
229 | Voluntary human extinction will prevent human suffering and the extinction of other species	pos
230 | non-reproduction would eventually allow humans to lead idyllic lifestyles	pos
231 | attempting to reduce the Earth's population is the only moral option	pos
232 | voluntary human extinction is advisable due to limited resources	pos
233 | the decision not to procreate at all could be regarded as immoral	neg
234 | Dam construction often leads to abuses of the masses by planners	neg
235 | people worldwide have been physically displaced from their homes as a result of dam construction	neg
236 | Dam failures are generally catastrophic	neg
237 | In many reservoir construction projects people have to be moved and re-housed	neg
238 | Farms and villages can be flooded by the creation of reservoirs, ruining many livelihoods	neg
239 | once the renewable infrastructure is built, the fuel is free forever	pos
240 | Large hydropower provides one of the lowest cost options in today’s energy market	pos
241 | Development of large-scale hydroelectric power has environmental impacts	neg
242 | The filling of large reservoirs can induce earth tremors, which may be large enough to be objectionable or destructive	neg
243 | Hydroelectric dams with large reserviors can be operated to provide peak generation at times of peak demand	pos
244 | Hydro plants are able to act as load following power plants	pos
245 | Integrating ever-higher levels of renewables is being successfully demonstrated in the real world	pos
246 | There are no harmful emissions associated with hydroelectric plant operation	pos
247 | Truth commissions are sometimes criticised for allowing crimes to go unpunished, and creating impunity for serious human rights abusers	neg
248 | victims and communities affected by past crimes have the right to know the identity of suspected perpetrators	pos
249 | Restorative approaches seek a balanced approach to the needs of the victim, wrongdoer and community through processes that preserve the safety and dignity of all	pos
250 | democracies have less internal systematic violence	pos
251 | the more democratic a regime, the less its democide	pos
252 | democracy causes peace	pos
253 | democracies treat each other with trust and respect even during crises	pos
254 | the best strategy to ensure our security and to build a durable peace is to support the advance of democracy elsewhere	pos
255 | Democracy is economically inefficient	neg
256 | the benefits of a specialised society may be compromised by democracy	neg
257 | A majority bullying a minority is just as bad as a dictator, communist or otherwise, doing so	neg
258 | democracy would bring peace	pos
259 | democracy leads to less internal violence and mass murder by the government	pos
260 | only in a democracy the citizens can have a share in freedom	pos
261 | 


--------------------------------------------------------------------------------