├── LICENSE
├── README.md
├── SECURITY.md
├── cookiecutter.json
├── images
    └── cookiecutter-docs.png
├── requirements.txt
└── {{cookiecutter.project_slug}}
    ├── .gitignore
    ├── Dockerfile
    ├── README.md
    ├── app
        ├── __init__.py
        ├── api.py
        ├── data
        │   └── example_request.json
        ├── models.py
        ├── spacy_extractor.py
        └── tests
        │   ├── __init__.py
        │   └── test_api.py
    ├── images
        └── cookiecutter-docs.png
    ├── main.py
    └── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 - present Microsoft Corporation
 4 | 
 5 | All rights reserved.
 6 | 
 7 | Permission is hereby granted, free of charge, to any person obtaining a copy
 8 | of this software and associated documentation files (the "Software"), to deal
 9 | in the Software without restriction, including without limitation the rights
10 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11 | copies of the Software, and to permit persons to whom the Software is
12 | furnished to do so, subject to the following conditions:
13 | 
14 | The above copyright notice and this permission notice shall be included in all
15 | copies or substantial portions of the Software.
16 | 
17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # cookiecutter-spacy-fastapi
 2 | 
 3 | Python cookiecutter API for quick deployments of spaCy models with FastAPI
 4 | 
 5 | ## Azure Search
 6 | The API interface is compatible with Azure Search Cognitive Skills.
 7 | 
 8 | For instructions on adding your API as a Custom Cognitive Skill in Azure Search see:
 9 | https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface
10 | 
11 | ## Requirements
12 | - Python >= 3.6 with pip installed
13 | 
14 | ## Quickstart
15 | 
16 | ### Install the latest [Cookiecutter](https://github.com/audreyr/cookiecutter) if you haven't installed it yet (this requires Cookiecutter 1.4.0 or higher):
17 | ```
18 | pip install --user cookiecutter
19 | ```
20 | 
21 | ### Point cookiecutter to this GitHub repository to automatically download and generate your project
22 | 
23 | ```
24 | cookiecutter https://github.com/Microsoft/cookiecutter-azure-search-cognitive-skill
25 | ```
26 | 
27 | View the README.md of your new project for instructions on next steps
28 | 
29 | ## Resources
30 | This project has two key dependencies:
31 | 
32 | | Dependency Name | Documentation                | Description                                                                            |
33 | |-----------------|------------------------------|----------------------------------------------------------------------------------------|
34 | | spaCy           | https://spacy.io             | Industrial-strength Natural Language Processing (NLP) with Python and Cython           |
35 | | FastAPI         | https://fastapi.tiangolo.com | FastAPI framework, high performance, easy to learn, fast to code, ready for production |
36 | 
37 | 
38 | # Contributing
39 | 
40 | This project welcomes contributions and suggestions.  Most contributions require you to agree to a
41 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
42 | the rights to use your contribution. For details, visit https://cla.microsoft.com.
43 | 
44 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
45 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
46 | provided by the bot. You will only need to do this once across all repos using our CLA.
47 | 
48 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
49 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
50 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | <!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
 2 | 
 3 | ## Security
 4 | 
 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
 6 | 
 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
 8 | 
 9 | ## Reporting Security Issues
10 | 
11 | **Please do not report security vulnerabilities through public GitHub issues.**
12 | 
13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
14 | 
15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
16 | 
17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc). 
18 | 
19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20 | 
21 |   * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22 |   * Full paths of source file(s) related to the manifestation of the issue
23 |   * The location of the affected source code (tag/branch/commit or direct URL)
24 |   * Any special configuration required to reproduce the issue
25 |   * Step-by-step instructions to reproduce the issue
26 |   * Proof-of-concept or exploit code (if possible)
27 |   * Impact of the issue, including how an attacker might exploit the issue
28 | 
29 | This information will help us triage your report more quickly.
30 | 
31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
32 | 
33 | ## Preferred Languages
34 | 
35 | We prefer all communications to be in English.
36 | 
37 | ## Policy
38 | 
39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
40 | 
41 | <!-- END MICROSOFT SECURITY.MD BLOCK -->
42 | 


--------------------------------------------------------------------------------
/cookiecutter.json:
--------------------------------------------------------------------------------
1 | {
2 |     "project_name": "spaCy FastAPI Azure Cognitive Skill",
3 |     "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}",
4 |     "short_description": "spaCy FastAPI for Custom Cognitive Skills in Azure Search",
5 |     "spacy_model": "This must be one of spaCy's default models. See https://spacy.io/usage for a supported list."
6 | }


--------------------------------------------------------------------------------
/images/cookiecutter-docs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/microsoft/cookiecutter-spacy-fastapi/78edb2597405e6a8537c55c1f33a3e00451ce9e2/images/cookiecutter-docs.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | cookiecutter
2 | vsts
3 | ruamel.yaml
4 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/.gitignore:
--------------------------------------------------------------------------------
 1 | .vscode
 2 | .venv
 3 | app/.env
 4 | 
 5 | forge.yaml
 6 | .forge
 7 | 
 8 | *.py[cod]
 9 | 
10 | # C extensions
11 | *.so
12 | 
13 | # Packages
14 | *.egg
15 | *.egg-info
16 | build
17 | eggs
18 | parts
19 | bin
20 | var
21 | sdist
22 | develop-eggs
23 | .installed.cfg
24 | lib
25 | lib64
26 | 
27 | # Installer logs
28 | pip-log.txt
29 | 
30 | # Unit test / coverage reports
31 | .coverage
32 | .tox
33 | nosetests.xml
34 | 
35 | # si
36 | *.mo
37 | 
38 | # Mr Developer
39 | .mr.developer.cfg
40 | .project
41 | .pydevproject
42 | 
43 | # Complexity
44 | output/*.html
45 | output/*/index.html
46 | 
47 | # Sphinx
48 | docs/_build
49 | README.html
50 | 
51 | # Cookiecutter
52 | output/
53 | 
54 | myflaskapp/
55 | bower_components
56 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM tiangolo/uvicorn-gunicorn-fastapi:python3.9
 2 | ENV PORT 8080
 3 | ENV APP_MODULE app.api:app
 4 | ENV LOG_LEVEL debug
 5 | ENV WEB_CONCURRENCY 2
 6 | 
 7 | COPY ./requirements.txt ./requirements.txt
 8 | RUN pip install -r requirements.txt
 9 | 
10 | RUN spacy download {{cookiecutter.spacy_model}}
11 | 
12 | COPY ./app /app/app
13 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/README.md:
--------------------------------------------------------------------------------
 1 | # {{cookiecutter.project_name}}
 2 | 
 3 | {{cookiecutter.short_description}}
 4 | 
 5 | ---
 6 | 
 7 | ## Azure Search Cognitive Skills
 8 | For instructions on adding your API as a Custom Cognitive Skill in Azure Search see:
 9 | https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface
10 | 
11 | ## Resources
12 | This project has two key dependencies:
13 | 
14 | | Dependency Name | Documentation                | Description                                                                            |
15 | |-----------------|------------------------------|----------------------------------------------------------------------------------------|
16 | | spaCy           | https://spacy.io             | Industrial-strength Natural Language Processing (NLP) with Python and Cython           |
17 | | FastAPI         | https://fastapi.tiangolo.com | FastAPI framework, high performance, easy to learn, fast to code, ready for production |
18 | ---
19 | 
20 | ## Run Locally
21 | To run locally in debug mode run:
22 | 
23 | ```
24 | cd ./{{cookiecutter.project_slug}}
25 | bash ./create_virtualenv.sh
26 | uvicorn app.api:app --reload
27 | ```
28 | Open your browser to http://localhost:8000/docs to view the OpenAPI UI.
29 | 
30 | ![Open API Image](./images/cookiecutter-docs.png)
31 | 
32 | 
33 | For an alternate view of the docs navigate to http://localhost:8000/redoc
34 | 
35 | ---
36 | 
37 | ## Deploy with Azure Pipelines
38 | Follow this guide to setup an Azure Resource Group with instances of Azure Kubernetes Service and Azure Container Registry and setup CI / CD with Azure Pipelines.
39 | 
40 | https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/kubernetes/aks-template?view=azure-devops
41 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/__init__.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Microsoft Corporation. All rights reserved.
2 | # Licensed under the MIT License.
3 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/api.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Microsoft Corporation. All rights reserved.
 2 | # Licensed under the MIT License.
 3 | 
 4 | from collections import defaultdict
 5 | import os
 6 | 
 7 | from dotenv import load_dotenv, find_dotenv
 8 | from fastapi import Body, FastAPI
 9 | from starlette.middleware.cors import CORSMiddleware
10 | from starlette.responses import RedirectResponse
11 | import spacy
12 | import srsly
13 | import uvicorn
14 | 
15 | from app.models import (
16 |     ENT_PROP_MAP,
17 |     RecordsRequest,
18 |     RecordsResponse,
19 |     RecordsEntitiesByTypeResponse,
20 | )
21 | from app.spacy_extractor import SpacyExtractor
22 | 
23 | 
24 | app = FastAPI(
25 |     title="{{cookiecutter.project_name}}",
26 |     version="1.0",
27 |     description="{{cookiecutter.short_description}}",
28 | )
29 | 
30 | example_request = srsly.read_json("app/data/example_request.json")
31 | 
32 | nlp = spacy.load("{{cookiecutter.spacy_model}}")
33 | extractor = SpacyExtractor(nlp)
34 | 
35 | 
36 | @app.get("/", include_in_schema=False)
37 | def docs_redirect():
38 |     return RedirectResponse(f"/docs")
39 | 
40 | 
41 | @app.post("/entities", response_model=RecordsResponse, tags=["NER"])
42 | async def extract_entities(body: RecordsRequest = Body(..., example=example_request)):
43 |     """Extract Named Entities from a batch of Records."""
44 | 
45 |     res = []
46 |     documents = []
47 | 
48 |     for val in body.values:
49 |         documents.append({"id": val.recordId, "text": val.data.text})
50 | 
51 |     entities_res = extractor.extract_entities(documents)
52 | 
53 |     res = [
54 |         {"recordId": er["id"], "data": {"entities": er["entities"]}}
55 |         for er in entities_res
56 |     ]
57 | 
58 |     return {"values": res}
59 | 
60 | 
61 | @app.post(
62 |     "/entities_by_type", response_model=RecordsEntitiesByTypeResponse, tags=["NER"]
63 | )
64 | async def extract_entities_by_type(body: RecordsRequest = Body(..., example=example_request)):
65 |     """Extract Named Entities from a batch of Records separated by entity label.
66 |         This route can be used directly as a Cognitive Skill in Azure Search
67 |         For Documentation on integration with Azure Search, see here:
68 |         https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface"""
69 | 
70 |     res = []
71 |     documents = []
72 | 
73 |     for val in body.values:
74 |         documents.append({"id": val.recordId, "text": val.data.text})
75 | 
76 |     entities_res = extractor.extract_entities(documents)
77 |     res = []
78 | 
79 |     for er in entities_res:
80 |         groupby = defaultdict(list)
81 |         for ent in er["entities"]:
82 |             ent_prop = ENT_PROP_MAP[ent["label"]]
83 |             groupby[ent_prop].append(ent["name"])
84 |         record = {"recordId": er["id"], "data": groupby}
85 |         res.append(record)
86 | 
87 |     return {"values": res}
88 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/data/example_request.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "values": [
 3 |         {
 4 |             "recordId": "a1",
 5 |             "data": {
 6 |                 "text": "But Google is starting from behind. The company made a late push into hardware, and Apple's Siri, available on iPhones, and Amazon's Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.",
 7 |                 "language": "en"
 8 |             }
 9 |         }
10 |     ]
11 | }


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/models.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Microsoft Corporation. All rights reserved.
 2 | # Licensed under the MIT License.
 3 | 
 4 | from typing import Dict, List, Optional
 5 | from pydantic import BaseModel
 6 | 
 7 | 
 8 | ENT_PROP_MAP = {
 9 |     "CARDINAL": "cardinals",
10 |     "DATE": "dates",
11 |     "EVENT": "events",
12 |     "FAC": "facilities",
13 |     "GPE": "gpes",
14 |     "LANGUAGE": "languages",
15 |     "LAW": "laws",
16 |     "LOC": "locations",
17 |     "MONEY": "money",
18 |     "NORP": "norps",
19 |     "ORDINAL": "ordinals",
20 |     "ORG": "organizations",
21 |     "PERCENT": "percentages",
22 |     "PERSON": "people",
23 |     "PRODUCT": "products",
24 |     "QUANTITY": "quanities",
25 |     "TIME": "times",
26 |     "WORK_OF_ART": "worksOfArt",
27 | }
28 | 
29 | 
30 | class RecordDataRequest(BaseModel):
31 |     text: str
32 |     language: str = "en"
33 | 
34 | 
35 | class RecordRequest(BaseModel):
36 |     recordId: str
37 |     data: RecordDataRequest
38 | 
39 | 
40 | class RecordsRequest(BaseModel):
41 |     values: List[RecordRequest]
42 | 
43 | 
44 | class RecordDataResponse(BaseModel):
45 |     entities: List
46 | 
47 | 
48 | class Message(BaseModel):
49 |     message: str
50 | 
51 | 
52 | class RecordResponse(BaseModel):
53 |     recordId: str
54 |     data: RecordDataResponse
55 |     errors: Optional[List[Message]]
56 |     warnings: Optional[List[Message]]
57 | 
58 | 
59 | class RecordsResponse(BaseModel):
60 |     values: List[RecordResponse]
61 | 
62 | 
63 | class RecordEntitiesByTypeResponse(BaseModel):
64 |     recordId: str
65 |     data: Dict[str, List[str]]
66 | 
67 | 
68 | class RecordsEntitiesByTypeResponse(BaseModel):
69 |     values: List[RecordEntitiesByTypeResponse]
70 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/spacy_extractor.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Microsoft Corporation. All rights reserved.
 2 | # Licensed under the MIT License.
 3 | 
 4 | from typing import Dict, List
 5 | import spacy
 6 | from spacy.language import Language
 7 | 
 8 | 
 9 | class SpacyExtractor:
10 |     """class SpacyExtractor encapsulates logic to pipe Records with an id and text body
11 |     through a spacy model and return entities separated by Entity Type
12 |     """
13 | 
14 |     def __init__(
15 |         self, nlp: Language, input_id_col: str = "id", input_text_col: str = "text"
16 |     ):
17 |         """Initialize the SpacyExtractor pipeline.
18 |         
19 |         nlp (spacy.language.Language): pre-loaded spacy language model
20 |         input_text_col (str): property on each document to run the model on
21 |         input_id_col (str): property on each document to correlate with request
22 | 
23 |         RETURNS (EntityRecognizer): The newly constructed object.
24 |         """
25 |         self.nlp = nlp
26 |         self.input_id_col = input_id_col
27 |         self.input_text_col = input_text_col
28 | 
29 |     def _name_to_id(self, text: str):
30 |         """Utility function to do a messy normalization of an entity name
31 | 
32 |         text (str): text to create "id" from
33 |         """
34 |         return "-".join([s.lower() for s in text.split()])
35 | 
36 |     def extract_entities(self, records: List[Dict[str, str]]):
37 |         """Apply the pre-trained model to a batch of records
38 |         
39 |         records (list): The list of "document" dictionaries each with an
40 |             `id` and `text` property
41 |         
42 |         RETURNS (list): List of responses containing the id of 
43 |             the correlating document and a list of entities.
44 |         """
45 |         ids = (doc[self.input_id_col] for doc in records)
46 |         texts = (doc[self.input_text_col] for doc in records)
47 | 
48 |         res = []
49 | 
50 |         for doc_id, spacy_doc in zip(ids, self.nlp.pipe(texts)):
51 |             entities = {}
52 |             for ent in spacy_doc.ents:
53 |                 ent_id = ent.kb_id
54 |                 if not ent_id:
55 |                     ent_id = ent.ent_id
56 |                 if not ent_id:
57 |                     ent_id = self._name_to_id(ent.text)
58 | 
59 |                 if ent_id not in entities:
60 |                     if ent.text.lower() == ent.text:
61 |                         ent_name = ent.text.capitalize()
62 |                     else:
63 |                         ent_name = ent.text
64 |                     entities[ent_id] = {
65 |                         "name": ent_name,
66 |                         "label": ent.label_,
67 |                         "matches": [],
68 |                     }
69 |                 entities[ent_id]["matches"].append(
70 |                     {"start": ent.start_char, "end": ent.end_char, "text": ent.text}
71 |                 )
72 | 
73 |             res.append({"id": doc_id, "entities": list(entities.values())})
74 |         return res
75 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/tests/__init__.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Microsoft Corporation. All rights reserved.
2 | # Licensed under the MIT License.
3 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/app/tests/test_api.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) Microsoft Corporation. All rights reserved.
 2 | # Licensed under the MIT License.
 3 | 
 4 | from starlette.testclient import TestClient
 5 | from app.api import app
 6 | 
 7 | 
 8 | def test_docs_redirect():
 9 |     client = TestClient(app)
10 |     response = client.get("/")
11 |     assert response.history[0].status_code == 302
12 |     assert response.status_code == 200
13 |     assert response.url == "http://testserver/docs"
14 | 
15 | 
16 | def test_api():
17 |     client = TestClient(app)
18 | 
19 |     text = """But Google is starting from behind. The company made a late push
20 |     into hardware, and Apple's Siri, available on iPhones, and Amazon's Alexa
21 |     software, which runs on its Echo and Dot devices, have clear leads in
22 |     consumer adoption."""
23 | 
24 |     request_data = {
25 |         "values": [{"recordId": "a1", "data": {"text": text, "language": "en"}}]
26 |     }
27 | 
28 |     response = client.post("/spacy_entities", json=request_data)
29 |     assert response.status_code == 200
30 | 
31 |     first_record = response.json()["values"][0]
32 |     assert first_record["recordId"] == "a1"
33 |     assert first_record["errors"] == None
34 |     assert first_record["warnings"] == None
35 | 
36 |     assert first_record["data"]["entities"] == [
37 |         "Alexa",
38 |         "Amazon",
39 |         "Apple",
40 |         "Echo and Dot",
41 |         "Google",
42 |         "iPhones",
43 |         "Siri",
44 |     ]
45 | 


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/images/cookiecutter-docs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/microsoft/cookiecutter-spacy-fastapi/78edb2597405e6a8537c55c1f33a3e00451ce9e2/{{cookiecutter.project_slug}}/images/cookiecutter-docs.png


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/main.py:
--------------------------------------------------------------------------------
1 | import uvicorn
2 | from app.api import app
3 | 
4 | 
5 | if __name__ == '__main__':
6 |     uvicorn.run(app, host='0.0.0.0', port=8080, log_level='info')


--------------------------------------------------------------------------------
/{{cookiecutter.project_slug}}/requirements.txt:
--------------------------------------------------------------------------------
1 | fastapi >= 0.75.0
2 | uvicorn >= 0.17.1
3 | spacy >= 3.2, <3.3


--------------------------------------------------------------------------------