├── LICENSE ├── README.md ├── SECURITY.md ├── cookiecutter.json ├── images └── cookiecutter-docs.png ├── requirements.txt └── {{cookiecutter.project_slug}} ├── .gitignore ├── Dockerfile ├── README.md ├── app ├── __init__.py ├── api.py ├── data │ └── example_request.json ├── models.py ├── spacy_extractor.py └── tests │ ├── __init__.py │ └── test_api.py ├── images └── cookiecutter-docs.png ├── main.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 - present Microsoft Corporation 4 | 5 | All rights reserved. 6 | 7 | Permission is hereby granted, free of charge, to any person obtaining a copy 8 | of this software and associated documentation files (the "Software"), to deal 9 | in the Software without restriction, including without limitation the rights 10 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 11 | copies of the Software, and to permit persons to whom the Software is 12 | furnished to do so, subject to the following conditions: 13 | 14 | The above copyright notice and this permission notice shall be included in all 15 | copies or substantial portions of the Software. 16 | 17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 18 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 19 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 20 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 21 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 22 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 23 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cookiecutter-spacy-fastapi 2 | 3 | Python cookiecutter API for quick deployments of spaCy models with FastAPI 4 | 5 | ## Azure Search 6 | The API interface is compatible with Azure Search Cognitive Skills. 7 | 8 | For instructions on adding your API as a Custom Cognitive Skill in Azure Search see: 9 | https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface 10 | 11 | ## Requirements 12 | - Python >= 3.6 with pip installed 13 | 14 | ## Quickstart 15 | 16 | ### Install the latest [Cookiecutter](https://github.com/audreyr/cookiecutter) if you haven't installed it yet (this requires Cookiecutter 1.4.0 or higher): 17 | ``` 18 | pip install --user cookiecutter 19 | ``` 20 | 21 | ### Point cookiecutter to this GitHub repository to automatically download and generate your project 22 | 23 | ``` 24 | cookiecutter https://github.com/Microsoft/cookiecutter-azure-search-cognitive-skill 25 | ``` 26 | 27 | View the README.md of your new project for instructions on next steps 28 | 29 | ## Resources 30 | This project has two key dependencies: 31 | 32 | | Dependency Name | Documentation | Description | 33 | |-----------------|------------------------------|----------------------------------------------------------------------------------------| 34 | | spaCy | https://spacy.io | Industrial-strength Natural Language Processing (NLP) with Python and Cython | 35 | | FastAPI | https://fastapi.tiangolo.com | FastAPI framework, high performance, easy to learn, fast to code, ready for production | 36 | 37 | 38 | # Contributing 39 | 40 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 41 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 42 | the rights to use your contribution. For details, visit https://cla.microsoft.com. 43 | 44 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide 45 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions 46 | provided by the bot. You will only need to do this once across all repos using our CLA. 47 | 48 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 49 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 50 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Security 4 | 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). 6 | 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below. 8 | 9 | ## Reporting Security Issues 10 | 11 | **Please do not report security vulnerabilities through public GitHub issues.** 12 | 13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report). 14 | 15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey). 16 | 17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc). 18 | 19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: 20 | 21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) 22 | * Full paths of source file(s) related to the manifestation of the issue 23 | * The location of the affected source code (tag/branch/commit or direct URL) 24 | * Any special configuration required to reproduce the issue 25 | * Step-by-step instructions to reproduce the issue 26 | * Proof-of-concept or exploit code (if possible) 27 | * Impact of the issue, including how an attacker might exploit the issue 28 | 29 | This information will help us triage your report more quickly. 30 | 31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs. 32 | 33 | ## Preferred Languages 34 | 35 | We prefer all communications to be in English. 36 | 37 | ## Policy 38 | 39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd). 40 | 41 | 42 | -------------------------------------------------------------------------------- /cookiecutter.json: -------------------------------------------------------------------------------- 1 | { 2 | "project_name": "spaCy FastAPI Azure Cognitive Skill", 3 | "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}", 4 | "short_description": "spaCy FastAPI for Custom Cognitive Skills in Azure Search", 5 | "spacy_model": "This must be one of spaCy's default models. See https://spacy.io/usage for a supported list." 6 | } -------------------------------------------------------------------------------- /images/cookiecutter-docs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/cookiecutter-spacy-fastapi/78edb2597405e6a8537c55c1f33a3e00451ce9e2/images/cookiecutter-docs.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cookiecutter 2 | vsts 3 | ruamel.yaml 4 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | .venv 3 | app/.env 4 | 5 | forge.yaml 6 | .forge 7 | 8 | *.py[cod] 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Packages 14 | *.egg 15 | *.egg-info 16 | build 17 | eggs 18 | parts 19 | bin 20 | var 21 | sdist 22 | develop-eggs 23 | .installed.cfg 24 | lib 25 | lib64 26 | 27 | # Installer logs 28 | pip-log.txt 29 | 30 | # Unit test / coverage reports 31 | .coverage 32 | .tox 33 | nosetests.xml 34 | 35 | # si 36 | *.mo 37 | 38 | # Mr Developer 39 | .mr.developer.cfg 40 | .project 41 | .pydevproject 42 | 43 | # Complexity 44 | output/*.html 45 | output/*/index.html 46 | 47 | # Sphinx 48 | docs/_build 49 | README.html 50 | 51 | # Cookiecutter 52 | output/ 53 | 54 | myflaskapp/ 55 | bower_components 56 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tiangolo/uvicorn-gunicorn-fastapi:python3.9 2 | ENV PORT 8080 3 | ENV APP_MODULE app.api:app 4 | ENV LOG_LEVEL debug 5 | ENV WEB_CONCURRENCY 2 6 | 7 | COPY ./requirements.txt ./requirements.txt 8 | RUN pip install -r requirements.txt 9 | 10 | RUN spacy download {{cookiecutter.spacy_model}} 11 | 12 | COPY ./app /app/app 13 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/README.md: -------------------------------------------------------------------------------- 1 | # {{cookiecutter.project_name}} 2 | 3 | {{cookiecutter.short_description}} 4 | 5 | --- 6 | 7 | ## Azure Search Cognitive Skills 8 | For instructions on adding your API as a Custom Cognitive Skill in Azure Search see: 9 | https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface 10 | 11 | ## Resources 12 | This project has two key dependencies: 13 | 14 | | Dependency Name | Documentation | Description | 15 | |-----------------|------------------------------|----------------------------------------------------------------------------------------| 16 | | spaCy | https://spacy.io | Industrial-strength Natural Language Processing (NLP) with Python and Cython | 17 | | FastAPI | https://fastapi.tiangolo.com | FastAPI framework, high performance, easy to learn, fast to code, ready for production | 18 | --- 19 | 20 | ## Run Locally 21 | To run locally in debug mode run: 22 | 23 | ``` 24 | cd ./{{cookiecutter.project_slug}} 25 | bash ./create_virtualenv.sh 26 | uvicorn app.api:app --reload 27 | ``` 28 | Open your browser to http://localhost:8000/docs to view the OpenAPI UI. 29 | 30 | ![Open API Image](./images/cookiecutter-docs.png) 31 | 32 | 33 | For an alternate view of the docs navigate to http://localhost:8000/redoc 34 | 35 | --- 36 | 37 | ## Deploy with Azure Pipelines 38 | Follow this guide to setup an Azure Resource Group with instances of Azure Kubernetes Service and Azure Container Registry and setup CI / CD with Azure Pipelines. 39 | 40 | https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/kubernetes/aks-template?view=azure-devops 41 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/api.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | 4 | from collections import defaultdict 5 | import os 6 | 7 | from dotenv import load_dotenv, find_dotenv 8 | from fastapi import Body, FastAPI 9 | from starlette.middleware.cors import CORSMiddleware 10 | from starlette.responses import RedirectResponse 11 | import spacy 12 | import srsly 13 | import uvicorn 14 | 15 | from app.models import ( 16 | ENT_PROP_MAP, 17 | RecordsRequest, 18 | RecordsResponse, 19 | RecordsEntitiesByTypeResponse, 20 | ) 21 | from app.spacy_extractor import SpacyExtractor 22 | 23 | 24 | app = FastAPI( 25 | title="{{cookiecutter.project_name}}", 26 | version="1.0", 27 | description="{{cookiecutter.short_description}}", 28 | ) 29 | 30 | example_request = srsly.read_json("app/data/example_request.json") 31 | 32 | nlp = spacy.load("{{cookiecutter.spacy_model}}") 33 | extractor = SpacyExtractor(nlp) 34 | 35 | 36 | @app.get("/", include_in_schema=False) 37 | def docs_redirect(): 38 | return RedirectResponse(f"/docs") 39 | 40 | 41 | @app.post("/entities", response_model=RecordsResponse, tags=["NER"]) 42 | async def extract_entities(body: RecordsRequest = Body(..., example=example_request)): 43 | """Extract Named Entities from a batch of Records.""" 44 | 45 | res = [] 46 | documents = [] 47 | 48 | for val in body.values: 49 | documents.append({"id": val.recordId, "text": val.data.text}) 50 | 51 | entities_res = extractor.extract_entities(documents) 52 | 53 | res = [ 54 | {"recordId": er["id"], "data": {"entities": er["entities"]}} 55 | for er in entities_res 56 | ] 57 | 58 | return {"values": res} 59 | 60 | 61 | @app.post( 62 | "/entities_by_type", response_model=RecordsEntitiesByTypeResponse, tags=["NER"] 63 | ) 64 | async def extract_entities_by_type(body: RecordsRequest = Body(..., example=example_request)): 65 | """Extract Named Entities from a batch of Records separated by entity label. 66 | This route can be used directly as a Cognitive Skill in Azure Search 67 | For Documentation on integration with Azure Search, see here: 68 | https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface""" 69 | 70 | res = [] 71 | documents = [] 72 | 73 | for val in body.values: 74 | documents.append({"id": val.recordId, "text": val.data.text}) 75 | 76 | entities_res = extractor.extract_entities(documents) 77 | res = [] 78 | 79 | for er in entities_res: 80 | groupby = defaultdict(list) 81 | for ent in er["entities"]: 82 | ent_prop = ENT_PROP_MAP[ent["label"]] 83 | groupby[ent_prop].append(ent["name"]) 84 | record = {"recordId": er["id"], "data": groupby} 85 | res.append(record) 86 | 87 | return {"values": res} 88 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/data/example_request.json: -------------------------------------------------------------------------------- 1 | { 2 | "values": [ 3 | { 4 | "recordId": "a1", 5 | "data": { 6 | "text": "But Google is starting from behind. The company made a late push into hardware, and Apple's Siri, available on iPhones, and Amazon's Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.", 7 | "language": "en" 8 | } 9 | } 10 | ] 11 | } -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/models.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | 4 | from typing import Dict, List, Optional 5 | from pydantic import BaseModel 6 | 7 | 8 | ENT_PROP_MAP = { 9 | "CARDINAL": "cardinals", 10 | "DATE": "dates", 11 | "EVENT": "events", 12 | "FAC": "facilities", 13 | "GPE": "gpes", 14 | "LANGUAGE": "languages", 15 | "LAW": "laws", 16 | "LOC": "locations", 17 | "MONEY": "money", 18 | "NORP": "norps", 19 | "ORDINAL": "ordinals", 20 | "ORG": "organizations", 21 | "PERCENT": "percentages", 22 | "PERSON": "people", 23 | "PRODUCT": "products", 24 | "QUANTITY": "quanities", 25 | "TIME": "times", 26 | "WORK_OF_ART": "worksOfArt", 27 | } 28 | 29 | 30 | class RecordDataRequest(BaseModel): 31 | text: str 32 | language: str = "en" 33 | 34 | 35 | class RecordRequest(BaseModel): 36 | recordId: str 37 | data: RecordDataRequest 38 | 39 | 40 | class RecordsRequest(BaseModel): 41 | values: List[RecordRequest] 42 | 43 | 44 | class RecordDataResponse(BaseModel): 45 | entities: List 46 | 47 | 48 | class Message(BaseModel): 49 | message: str 50 | 51 | 52 | class RecordResponse(BaseModel): 53 | recordId: str 54 | data: RecordDataResponse 55 | errors: Optional[List[Message]] 56 | warnings: Optional[List[Message]] 57 | 58 | 59 | class RecordsResponse(BaseModel): 60 | values: List[RecordResponse] 61 | 62 | 63 | class RecordEntitiesByTypeResponse(BaseModel): 64 | recordId: str 65 | data: Dict[str, List[str]] 66 | 67 | 68 | class RecordsEntitiesByTypeResponse(BaseModel): 69 | values: List[RecordEntitiesByTypeResponse] 70 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/spacy_extractor.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | 4 | from typing import Dict, List 5 | import spacy 6 | from spacy.language import Language 7 | 8 | 9 | class SpacyExtractor: 10 | """class SpacyExtractor encapsulates logic to pipe Records with an id and text body 11 | through a spacy model and return entities separated by Entity Type 12 | """ 13 | 14 | def __init__( 15 | self, nlp: Language, input_id_col: str = "id", input_text_col: str = "text" 16 | ): 17 | """Initialize the SpacyExtractor pipeline. 18 | 19 | nlp (spacy.language.Language): pre-loaded spacy language model 20 | input_text_col (str): property on each document to run the model on 21 | input_id_col (str): property on each document to correlate with request 22 | 23 | RETURNS (EntityRecognizer): The newly constructed object. 24 | """ 25 | self.nlp = nlp 26 | self.input_id_col = input_id_col 27 | self.input_text_col = input_text_col 28 | 29 | def _name_to_id(self, text: str): 30 | """Utility function to do a messy normalization of an entity name 31 | 32 | text (str): text to create "id" from 33 | """ 34 | return "-".join([s.lower() for s in text.split()]) 35 | 36 | def extract_entities(self, records: List[Dict[str, str]]): 37 | """Apply the pre-trained model to a batch of records 38 | 39 | records (list): The list of "document" dictionaries each with an 40 | `id` and `text` property 41 | 42 | RETURNS (list): List of responses containing the id of 43 | the correlating document and a list of entities. 44 | """ 45 | ids = (doc[self.input_id_col] for doc in records) 46 | texts = (doc[self.input_text_col] for doc in records) 47 | 48 | res = [] 49 | 50 | for doc_id, spacy_doc in zip(ids, self.nlp.pipe(texts)): 51 | entities = {} 52 | for ent in spacy_doc.ents: 53 | ent_id = ent.kb_id 54 | if not ent_id: 55 | ent_id = ent.ent_id 56 | if not ent_id: 57 | ent_id = self._name_to_id(ent.text) 58 | 59 | if ent_id not in entities: 60 | if ent.text.lower() == ent.text: 61 | ent_name = ent.text.capitalize() 62 | else: 63 | ent_name = ent.text 64 | entities[ent_id] = { 65 | "name": ent_name, 66 | "label": ent.label_, 67 | "matches": [], 68 | } 69 | entities[ent_id]["matches"].append( 70 | {"start": ent.start_char, "end": ent.end_char, "text": ent.text} 71 | ) 72 | 73 | res.append({"id": doc_id, "entities": list(entities.values())}) 74 | return res 75 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/tests/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/app/tests/test_api.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. All rights reserved. 2 | # Licensed under the MIT License. 3 | 4 | from starlette.testclient import TestClient 5 | from app.api import app 6 | 7 | 8 | def test_docs_redirect(): 9 | client = TestClient(app) 10 | response = client.get("/") 11 | assert response.history[0].status_code == 302 12 | assert response.status_code == 200 13 | assert response.url == "http://testserver/docs" 14 | 15 | 16 | def test_api(): 17 | client = TestClient(app) 18 | 19 | text = """But Google is starting from behind. The company made a late push 20 | into hardware, and Apple's Siri, available on iPhones, and Amazon's Alexa 21 | software, which runs on its Echo and Dot devices, have clear leads in 22 | consumer adoption.""" 23 | 24 | request_data = { 25 | "values": [{"recordId": "a1", "data": {"text": text, "language": "en"}}] 26 | } 27 | 28 | response = client.post("/spacy_entities", json=request_data) 29 | assert response.status_code == 200 30 | 31 | first_record = response.json()["values"][0] 32 | assert first_record["recordId"] == "a1" 33 | assert first_record["errors"] == None 34 | assert first_record["warnings"] == None 35 | 36 | assert first_record["data"]["entities"] == [ 37 | "Alexa", 38 | "Amazon", 39 | "Apple", 40 | "Echo and Dot", 41 | "Google", 42 | "iPhones", 43 | "Siri", 44 | ] 45 | -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/images/cookiecutter-docs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/cookiecutter-spacy-fastapi/78edb2597405e6a8537c55c1f33a3e00451ce9e2/{{cookiecutter.project_slug}}/images/cookiecutter-docs.png -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/main.py: -------------------------------------------------------------------------------- 1 | import uvicorn 2 | from app.api import app 3 | 4 | 5 | if __name__ == '__main__': 6 | uvicorn.run(app, host='0.0.0.0', port=8080, log_level='info') -------------------------------------------------------------------------------- /{{cookiecutter.project_slug}}/requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi >= 0.75.0 2 | uvicorn >= 0.17.1 3 | spacy >= 3.2, <3.3 --------------------------------------------------------------------------------