├── requirements.txt ├── README.md ├── LICENSE ├── .devcontainer ├── Dockerfile └── devcontainer.json ├── examples ├── mlops-wikipedia.txt ├── download.ipynb └── mlflow.txt ├── .gitignore └── notebooks ├── try-datasets.ipynb └── try-transformers.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | ipywidgets==8.0.1 2 | ipykernel==6.15.1 3 | transformers==4.21.1 4 | datasets==2.4.0 5 | ipykernel==6.15.1 6 | tensorflow==2.10 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Try 🤗 HuggingFace! 2 | 3 | Examples to try HuggingFace datasets and transformers 4 | 5 | * Search the [model hub](https://huggingface.co/models) for existing models 6 | * Install the [Datasets](https://github.com/huggingface/datasets/) Python package from the [requirements.txt](./requirements.txt) file 7 | * Install the [Transformers](https://github.com/huggingface/transformers) Python package from the [requirements.txt](./requirements.txt) file 8 | 9 | Browse through the [Jupyter Notebook examples](./notebooks). 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Alfredo Deza 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.devcontainer/Dockerfile: -------------------------------------------------------------------------------- 1 | # See here for image contents: https://github.com/microsoft/vscode-dev-containers/tree/v0.245.2/containers/python-3/.devcontainer/base.Dockerfile 2 | 3 | # [Choice] Python version (use -bullseye variants on local arm64/Apple Silicon): 3, 3.10, 3.9, 3.8, 3.7, 3.6, 3-bullseye, 3.10-bullseye, 3.9-bullseye, 3.8-bullseye, 3.7-bullseye, 3.6-bullseye, 3-buster, 3.10-buster, 3.9-buster, 3.8-buster, 3.7-buster, 3.6-buster 4 | ARG VARIANT="3.10-bullseye" 5 | FROM mcr.microsoft.com/vscode/devcontainers/python:0-${VARIANT} 6 | 7 | 8 | # [Optional] If your pip requirements rarely change, uncomment this section to add them to the image. 9 | COPY requirements.txt /tmp/pip-tmp/ 10 | RUN pip3 --disable-pip-version-check --no-cache-dir install -r /tmp/pip-tmp/requirements.txt \ 11 | && rm -rf /tmp/pip-tmp 12 | 13 | # [Optional] Uncomment this section to install additional OS packages. 14 | # RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \ 15 | # && apt-get -y install --no-install-recommends 16 | 17 | # [Optional] Uncomment this line to install global node packages. 18 | # RUN su vscode -c "source /usr/local/share/nvm/nvm.sh && npm install -g " 2>&1 -------------------------------------------------------------------------------- /examples/mlops-wikipedia.txt: -------------------------------------------------------------------------------- 1 | MLOps or ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.[1] The word is a compound of "machine learning" and the continuous development practice of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems.[2] Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. According to Gartner, MLOps is a subset of ModelOps. MLOps is focused on the operationalization of ML models, while ModelOps covers the operationalization of all types of AI models.[3] 2 | -------------------------------------------------------------------------------- /.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | // For format details, see https://aka.ms/devcontainer.json. For config options, see the README at: 2 | // https://github.com/microsoft/vscode-dev-containers/tree/v0.245.2/containers/python-3 3 | { 4 | "name": "Python 3", 5 | "build": { 6 | "dockerfile": "Dockerfile", 7 | "context": "..", 8 | "args": { 9 | // Update 'VARIANT' to pick a Python version: 3, 3.10, 3.9, 3.8, 3.7, 3.6 10 | // Append -bullseye or -buster to pin to an OS version. 11 | // Use -bullseye variants on local on arm64/Apple Silicon. 12 | "VARIANT": "3.8", 13 | // Options 14 | "NODE_VERSION": "none" 15 | } 16 | }, 17 | 18 | // Configure tool-specific properties. 19 | "customizations": { 20 | // Configure properties specific to VS Code. 21 | "vscode": { 22 | // Set *default* container specific settings.json values on container create. 23 | "settings": { 24 | "python.defaultInterpreterPath": "/usr/local/bin/python", 25 | "python.linting.enabled": true, 26 | "python.linting.pylintEnabled": true, 27 | "python.formatting.autopep8Path": "/usr/local/py-utils/bin/autopep8", 28 | "python.formatting.blackPath": "/usr/local/py-utils/bin/black", 29 | "python.formatting.yapfPath": "/usr/local/py-utils/bin/yapf", 30 | "python.linting.banditPath": "/usr/local/py-utils/bin/bandit", 31 | "python.linting.flake8Path": "/usr/local/py-utils/bin/flake8", 32 | "python.linting.mypyPath": "/usr/local/py-utils/bin/mypy", 33 | "python.linting.pycodestylePath": "/usr/local/py-utils/bin/pycodestyle", 34 | "python.linting.pydocstylePath": "/usr/local/py-utils/bin/pydocstyle", 35 | "python.linting.pylintPath": "/usr/local/py-utils/bin/pylint" 36 | }, 37 | 38 | // Add the IDs of extensions you want installed when the container is created. 39 | "extensions": [ 40 | "ms-python.python", 41 | "ms-python.vscode-pylance" 42 | ] 43 | } 44 | }, 45 | 46 | // Use 'forwardPorts' to make a list of ports inside the container available locally. 47 | // "forwardPorts": [], 48 | 49 | // Use 'postCreateCommand' to run commands after the container is created. 50 | // "postCreateCommand": "pip3 install --user -r requirements.txt", 51 | 52 | // Comment out to connect as root instead. More info: https://aka.ms/vscode-remote/containers/non-root. 53 | "remoteUser": "vscode" 54 | } 55 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | myenv/ 131 | -------------------------------------------------------------------------------- /examples/download.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 12, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from transformers import pipeline" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 14, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", 22 | "To disable this warning, you can either:\n", 23 | "\t- Avoid using `tokenizers` before the fork if possible\n" 24 | ] 25 | }, 26 | { 27 | "name": "stderr", 28 | "output_type": "stream", 29 | "text": [ 30 | "Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 172kB/s]" 31 | ] 32 | }, 33 | { 34 | "name": "stdout", 35 | "output_type": "stream", 36 | "text": [ 37 | "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" 38 | ] 39 | }, 40 | { 41 | "name": "stderr", 42 | "output_type": "stream", 43 | "text": [ 44 | "\n", 45 | "Downloading tf_model.h5: 100%|██████████| 511M/511M [00:05<00:00, 98.2MB/s] \n", 46 | "All model checkpoint layers were used when initializing TFBertForMaskedLM.\n", 47 | "\n", 48 | "All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.\n", 49 | "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.\n", 50 | "Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 19.8kB/s]\n", 51 | "Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 4.62MB/s]\n", 52 | "Downloading tokenizer.json: 100%|██████████| 455k/455k [00:00<00:00, 6.54MB/s]\n", 53 | "The model 'TFBertForMaskedLM' is not supported for summarization. Supported models are ['TFBartForConditionalGeneration', 'TFBlenderbotForConditionalGeneration', 'TFBlenderbotSmallForConditionalGeneration', 'TFEncoderDecoderModel', 'TFLEDForConditionalGeneration', 'TFMarianMTModel', 'TFMBartForConditionalGeneration', 'TFMT5ForConditionalGeneration', 'TFPegasusForConditionalGeneration', 'TFT5ForConditionalGeneration'].\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "summarizer = pipeline(\"summarization\", model=\"t5-small\", tokenizer=\"t5-small\", truncation=True, framework=\"tf\")" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "with open(\"mlflow.txt\", \"r\") as _f:\n", 68 | " print(summarizer(_f.read()))" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "from huggingface_hub import hf_hub_download\n", 78 | "hf_hub_download(repo_id=\"t5-small\", filename=\"pytorch_model.bin\")" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python 3.8.13 ('summarize')", 85 | "language": "python", 86 | "name": "python3" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 3 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython3", 98 | "version": "3.8.13" 99 | }, 100 | "orig_nbformat": 4, 101 | "vscode": { 102 | "interpreter": { 103 | "hash": "66521298b426c9301623669054a50dd0a367106800a6395f809220cc391585ba" 104 | } 105 | } 106 | }, 107 | "nbformat": 4, 108 | "nbformat_minor": 2 109 | } 110 | -------------------------------------------------------------------------------- /examples/mlflow.txt: -------------------------------------------------------------------------------- 1 | How we build MLflow projects and rapidly iterate over them 2 | 3 | Introduction 4 | Starting a new machine learning (ML) project is always cumbersome, especially when collaborating with many people: different standards need to be followed, some files need to exist by default, artifacts need to be stored at a certain location, and so on. 5 | 6 | As the number of projects keeps increasing, they end up looking very similar in structure and standards (writing tests, maintaining documentation, etc.), and while these projects are similar, sometimes they do not always work as intended. 7 | 8 | As mentioned in a previous article, we use Databricks as our data platform and MLflow at the core of all our projects that involve machine learning. For those of you who are new to MLflow, it is basically a tool/platform that manages the ML life-cycle — creating experiments, registering and deploying models. While the MLflow CLI is very handy, creating new projects that follow the structure of an MLflow project needs to be done manually. 9 | 10 | To tackle the issues mentioned above, we used cookiecutter to create a template for MLflow projects. By doing this, the data team can focus less on structure and configuration of the project and more on the implementation of the model. 11 | 12 | The following article shows how we designed our cookiecutter template and how we use it to run our projects on Databricks. 13 | 14 | Project templating using cookiecutter 15 | The cookiecutter template helps streamline creating, testing, running and deploying MLflow projects. We designed the template in such a way that: 16 | 17 | the structure of our project with that defined by MLflow is consistent 18 | an internal standard can be enforced within the team 19 | a default set of requirements for local setup and development is maintained 20 | it facilitates continuous integration/continuous delivery (CI/CD) 21 | When the cookiecutter template is used, required information such as the specifics of the project along with the location of where it will be stored (the artifact path) and the MLflow experiment name are configured. Once this information is recorded, the cookiecutter creates a new git-initialized folder with the structure defined in the template and the details entered earlier. Once the steps in the cookiecutter template have been completed, all we need to do is to add the GitHub remote and we are good to go ! 22 | 23 | Since the cookiecutter structures the project and its files, the next and exciting step is to start building the model by adding logic to it. Once the model is ready, we make sure that the tests do not fail, and that the MLflow project runs locally without any errors. As an added step, the project can be deployed and served locally to test if responses are received when requests are sent to it. 24 | 25 | Apart from the local development and testing, workflows that assist continuous integration have been set up using GitHub Actions. The cookiecutter template includes two workflows: 26 | 27 | the push workflow that performs tests and runs the project 28 | the release workflow that runs the project on a cluster in Databricks once all the tests have passed 29 | MLflow on Databricks 30 | With every new release, a new run of the project is logged on Databricks. Every logged run contains information such as the run name, the time it was started, who started it, the GitHub commit hash and a description if given. An example of how runs are logged is shown in Fig 1: 31 | 32 | 33 | Figure 1: Logging of MLflow runs on Databricks (Sensitive information hidden) 34 | From the list of all runs, we can then choose the latest successful run and register the model associated with this run, if the output of the run is what was expected. 35 | 36 | 37 | Figure 2: Model registration (Sensitive information hidden) 38 | MLflow facilitates model registration with the help of a button and also automatically versions every model that is registered. 39 | 40 | 41 | Figure 3: Model versioning (Sensitive information hidden) 42 | Registered models can subsequently be assigned a stage. A stage can either be Production, Staging, Archived or None. By default, when a model is served, it does not have a stage. We always serve newly registered models to Staging so that teams such as the Frontend/Backend can communicate with it and integrate the new logic from the MLflow model into their respective staging environments. 43 | 44 | 45 | Figure 4: Model stage transitioning 46 | Once a model is registered, we can serve the model to a REST API endpoint. Managing models just by using MLflow is pretty easy, but managing MLflow models on Databricks is a lot easier! For example, model serving on Databricks is so easy that there is no need to configure the computing resources where the model will be served on, or the URL of the endpoint. Databricks does all of this automatically when the model serving option is enabled! If the model lives in either the Production or Staging stage, the invocation URL contains /Production or /Staging depending on the stage respectively. 47 | 48 | 49 | Figure 5: Model serving (Sensitive information hidden) 50 | POST requests can now be sent to the served model using the “Model URL”. Databricks provides a friendly UI to test the served model on their platform (as shown in Fig 5). 51 | 52 | Once all systems are go, with a click of a button, we can easily transition the stage to Production and watch our newly created model work its magic! 53 | 54 | Entire process in a nutshell: 55 | While the process of initializing an MLflow project and putting it to production does require some manual effort (to ensure that the right models are registered, and served), all other checks and balances are taken care of by the CI/CD workflows that are part of our Cookiecutter template by default. This makes it extremely easy to continuously add new logic into our models (or create new ones) and put these changes to production. 56 | 57 | The entire process from initialization to putting the model to production is summarized in Fig 6. 58 | 59 | 60 | Figure 6: Summary of process 61 | Final thoughts: 62 | Templating using cookiecutter has helped the team focus less on the structure of the project and more on the core implementation of the project. Adding to this, by using MLflow, managing and tracking the models that we build has never been easier. And finally, the seamless integration of MLflow into the Databricks platform has made the setup a lot simpler, and deployment a whole lot faster ! 63 | 64 | In short, the process we follow allows us to focus solely on adding and improving the implementation/logic of the model, which can then be integrated into our core product, the app. 65 | 66 | 67 | -------------------------------------------------------------------------------- /notebooks/try-datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 🤗 HuggingFace Datasets\n", 8 | "\n", 9 | "HuggingFace provides the ability to interact with datasets dynamically. There is no need to pre-download large models and include them in a repository like this one. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from datasets import load_dataset, list_datasets" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stdout", 28 | "output_type": "stream", 29 | "text": [ 30 | "175825\n", 31 | "[]\n" 32 | ] 33 | } 34 | ], 35 | "source": [ 36 | "# Explore available datasets\n", 37 | "available = list_datasets()\n", 38 | "print(len(available))\n", 39 | "print([i for i in available if '/' not in i])" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stderr", 49 | "output_type": "stream", 50 | "text": [ 51 | "Using custom data configuration default\n", 52 | "Reusing dataset movie_rationales (/home/vscode/.cache/huggingface/datasets/movie_rationales/default/0.1.0/70ed6b72496c90835e8ee73ebf8d0e49f5ad3aa93f302c8a4b6c886143cfb779)\n" 53 | ] 54 | }, 55 | { 56 | "data": { 57 | "application/vnd.jupyter.widget-view+json": { 58 | "model_id": "9a1ad48369fb4d42b0188e632be57928", 59 | "version_major": 2, 60 | "version_minor": 0 61 | }, 62 | "text/plain": [ 63 | " 0%| | 0/3 [00:00\n", 149 | "\n", 162 | "\n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | "
reviewlabelevidences
0plot : two teen couples go to a church party ,...0[mind - fuck movie, the sad part is, downshift...
1the happy bastard 's quick movie review damn\\n...0[it 's pretty much a sunken ship, sutherland i...
2it is movies like these that make a jaded movi...0[the characters and acting is nothing spectacu...
3\" quest for camelot \" is warner bros . '\\nfirs...0[dead on arrival, the characters stink, subpar...
4synopsis : a mentally unstable man undergoing ...0[it is highly derivative and somewhat boring, ...
\n", 204 | "" 205 | ], 206 | "text/plain": [ 207 | " review label \\\n", 208 | "0 plot : two teen couples go to a church party ,... 0 \n", 209 | "1 the happy bastard 's quick movie review damn\\n... 0 \n", 210 | "2 it is movies like these that make a jaded movi... 0 \n", 211 | "3 \" quest for camelot \" is warner bros . '\\nfirs... 0 \n", 212 | "4 synopsis : a mentally unstable man undergoing ... 0 \n", 213 | "\n", 214 | " evidences \n", 215 | "0 [mind - fuck movie, the sad part is, downshift... \n", 216 | "1 [it 's pretty much a sunken ship, sutherland i... \n", 217 | "2 [the characters and acting is nothing spectacu... \n", 218 | "3 [dead on arrival, the characters stink, subpar... \n", 219 | "4 [it is highly derivative and somewhat boring, ... " 220 | ] 221 | }, 222 | "execution_count": 7, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "df.head()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 8, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/html": [ 239 | "
\n", 240 | "\n", 253 | "\n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | "
label
count1600.000000
mean0.500000
std0.500156
min0.000000
25%0.000000
50%0.500000
75%1.000000
max1.000000
\n", 295 | "
" 296 | ], 297 | "text/plain": [ 298 | " label\n", 299 | "count 1600.000000\n", 300 | "mean 0.500000\n", 301 | "std 0.500156\n", 302 | "min 0.000000\n", 303 | "25% 0.000000\n", 304 | "50% 0.500000\n", 305 | "75% 1.000000\n", 306 | "max 1.000000" 307 | ] 308 | }, 309 | "execution_count": 8, 310 | "metadata": {}, 311 | "output_type": "execute_result" 312 | } 313 | ], 314 | "source": [ 315 | "df.describe()" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 9, 321 | "metadata": {}, 322 | "outputs": [ 323 | { 324 | "data": { 325 | "text/plain": [ 326 | "label\n", 327 | "0 800\n", 328 | "1 800\n", 329 | "Name: count, dtype: int64" 330 | ] 331 | }, 332 | "execution_count": 9, 333 | "metadata": {}, 334 | "output_type": "execute_result" 335 | } 336 | ], 337 | "source": [ 338 | "df['label'].value_counts()" 339 | ] 340 | } 341 | ], 342 | "metadata": { 343 | "kernelspec": { 344 | "display_name": "Python 3.9.13 ('huggingface')", 345 | "language": "python", 346 | "name": "python3" 347 | }, 348 | "language_info": { 349 | "codemirror_mode": { 350 | "name": "ipython", 351 | "version": 3 352 | }, 353 | "file_extension": ".py", 354 | "mimetype": "text/x-python", 355 | "name": "python", 356 | "nbconvert_exporter": "python", 357 | "pygments_lexer": "ipython3", 358 | "version": "3.8.17" 359 | }, 360 | "orig_nbformat": 4, 361 | "vscode": { 362 | "interpreter": { 363 | "hash": "920d5173f2c6743f2c8a5baff36bfaa747ac4cb3d34512c636ba17b43fdf31dc" 364 | } 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /notebooks/try-transformers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Trying 🤗 HuggingFace Transformers\n", 8 | "\n", 9 | "Make sure you install the dependencies from `requirements.txt` before executing cells in this notebook." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "2024-07-10 14:48:23.676226: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", 22 | "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", 23 | "2024-07-10 14:48:25.555554: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", 24 | "2024-07-10 14:48:25.555602: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n", 25 | "2024-07-10 14:48:25.705622: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", 26 | "2024-07-10 14:48:26.813722: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", 27 | "2024-07-10 14:48:26.813905: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", 28 | "2024-07-10 14:48:26.813920: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" 29 | ] 30 | } 31 | ], 32 | "source": [ 33 | "from transformers import pipeline" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Define the generator pipeline. In this case, use the `text2text` for NLP processing" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "import tensorflow as tf\n", 50 | "import keras\n", 51 | "from transformers import TFAutoModel" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 10, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "data": { 61 | "application/vnd.jupyter.widget-view+json": { 62 | "model_id": "c52e68ef515a4fe5ae5a3f64a4e72be8", 63 | "version_major": 2, 64 | "version_minor": 0 65 | }, 66 | "text/plain": [ 67 | "Downloading tf_model.h5: 0%| | 0.00/851M [00:00