├── tests
    ├── __init__.py
    └── pii_codex
    │   ├── __init__.py
    │   ├── utils
    │       ├── __init__.py
    │       └── test_pii_mapping_util.py
    │   └── services
    │       ├── __init__.py
    │       ├── adapters
    │           ├── __init__.py
    │           └── detection_adapters
    │           │   ├── __init__.py
    │           │   ├── test_azure_detection_adapter.py
    │           │   ├── test_presidio_detection_adapter.py
    │           │   └── test_aws_detection_adapter.py
    │       ├── test_assessment_service.py
    │       └── test_detection_service.py
├── pii_codex
    ├── models
    │   ├── __init__.py
    │   ├── microsoft_presidio_pii.py
    │   ├── aws_pii.py
    │   ├── analysis.py
    │   └── common.py
    ├── utils
    │   ├── __init__.py
    │   ├── package_installer_util.py
    │   ├── logging.py
    │   ├── statistics_util.py
    │   └── pii_mapping_util.py
    ├── services
    │   ├── __init__.py
    │   ├── adapters
    │   │   ├── __init__.py
    │   │   └── detection_adapters
    │   │   │   ├── __init__.py
    │   │   │   ├── detection_adapter_base.py
    │   │   │   ├── azure_detection_adapter.py
    │   │   │   ├── presidio_detection_adapter.py
    │   │   │   └── aws_detection_adapter.py
    │   ├── analyzers
    │   │   ├── __init__.py
    │   │   └── presidio_analysis.py
    │   └── assessment_service.py
    ├── __init__.py
    └── config.py
├── docs
    ├── PII_Codex_Logo.png
    ├── UC1_Converting_Existing_Detections_With_Adapters.png
    ├── UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png
    ├── LOCAL_SETUP.md
    ├── MAPPING.md
    ├── dev
    │   └── pii_codex
    │   │   ├── config.html
    │   │   ├── services
    │   │       ├── analyzers
    │   │       │   └── index.html
    │   │       ├── adapters
    │   │       │   ├── index.html
    │   │       │   └── detection_adapters
    │   │       │   │   └── index.html
    │   │       └── index.html
    │   │   ├── index.html
    │   │   ├── models
    │   │       └── index.html
    │   │   └── utils
    │   │       ├── package_installer_util.html
    │   │       ├── index.html
    │   │       └── logging.html
    └── DETECTION_AND_ANALYSIS.md
├── scripts
    └── update_citation.sh
├── .pre-commit-config.yaml
├── codecov.yml
├── .github
    ├── pull_request_template.md
    ├── workflows
    │   ├── draft-pdf.yml
    │   ├── checks.yml
    │   ├── test.yml
    │   └── release.yml
    └── ISSUE_TEMPLATE
    │   └── issue_template.md
├── CITATION.cff
├── mypy.ini
├── .zenodo.json
├── LICENSE
├── Makefile
├── .gitignore
├── CONTRIBUTING.md
├── pyproject.toml
└── joss
    ├── paper.bib
    └── paper.md


/tests/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/models/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/utils/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/pii_codex/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/services/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/pii_codex/utils/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/services/adapters/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/services/analyzers/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.5.0"
2 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/adapters/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pii_codex/services/adapters/detection_adapters/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/adapters/detection_adapters/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/PII_Codex_Logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/PII_Codex_Logo.png


--------------------------------------------------------------------------------
/scripts/update_citation.sh:
--------------------------------------------------------------------------------
1 | sed -r -i '' "s/([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])/$(date '+%Y-%m-%d')/g" CITATION.cff


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
1 | repos:
2 |   - repo: https://github.com/ambv/black
3 |     rev: 22.12.0
4 |     hooks:
5 |       - id: black
6 |         language_version: python3.9


--------------------------------------------------------------------------------
/docs/UC1_Converting_Existing_Detections_With_Adapters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/UC1_Converting_Existing_Detections_With_Adapters.png


--------------------------------------------------------------------------------
/docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png


--------------------------------------------------------------------------------
/pii_codex/config.py:
--------------------------------------------------------------------------------
1 | from pii_codex.utils.pii_mapping_util import PIIMapper
2 | 
3 | PII_MAPPER = PIIMapper()
4 | DEFAULT_LANG = "en"
5 | DEFAULT_ANALYSIS_MODE = "POPULATION"
6 | DEFAULT_TOKEN_REPLACEMENT_VALUE = "<REDACTED>"
7 | 


--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
 1 | coverage:
 2 |   status:
 3 |     project:
 4 |       default:
 5 |         target: 90%    # the required coverage value
 6 |         threshold: 1%  # the leniency in hitting the target
 7 |     patch:
 8 |       default:
 9 |         target: 90%    # the required coverage value
10 |         threshold: 1%  # the leniency in hitting the target


--------------------------------------------------------------------------------
/pii_codex/utils/package_installer_util.py:
--------------------------------------------------------------------------------
 1 | import subprocess
 2 | import sys
 3 | 
 4 | 
 5 | def install_spacy_package(package_name):
 6 |     """
 7 |     Installs missing spacy package (if found missing)
 8 |     @param package_name:
 9 |     @return:
10 |     """
11 |     subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])
12 | 


--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
 1 | ### Overview
 2 | 
 3 | PR description. Explain what it's doing and why.
 4 | 
 5 | - Implementation details...
 6 | - Bugfix details...
 7 | 
 8 | ### Checklist
 9 | 
10 | - [ ] New PII Detection Service Added (Optional)
11 | - [ ] New PII types added to pii_type_mappings (Optional)
12 | - [ ] Tests added
13 | - [ ] CI passed
14 | - [ ] Commits Squashed


--------------------------------------------------------------------------------
/CITATION.cff:
--------------------------------------------------------------------------------
 1 | cff-version: 1.2.0
 2 | message: "If you use this software, please cite it as below."
 3 | authors:
 4 |   - family-names: Rosado
 5 |     given-names: Eidan J.
 6 |     orcid: https://orcid.org/0000-0003-0665-098X
 7 | title: "pii-codex: a Python library for PII detection, categorization, and severity assessment"
 8 | version: 0.5.0
 9 | doi: 10.5281/zenodo.7212576
10 | date-released: 2025-12-16
11 | 


--------------------------------------------------------------------------------
/.github/workflows/draft-pdf.yml:
--------------------------------------------------------------------------------
 1 | on: [push]
 2 | 
 3 | jobs:
 4 |   paper:
 5 |     runs-on: ubuntu-latest
 6 |     name: Paper Draft
 7 |     steps:
 8 |       - name: Checkout
 9 |         uses: actions/checkout@v3
10 |       - name: Build draft PDF
11 |         uses: openjournals/openjournals-draft-action@master
12 |         with:
13 |           journal: joss
14 |           paper-path: ./joss/paper.md
15 |       - name: Upload
16 |         uses: actions/upload-artifact@v4
17 |         with:
18 |           name: paper
19 |           path: ./joss/paper.pdf


--------------------------------------------------------------------------------
/pii_codex/utils/logging.py:
--------------------------------------------------------------------------------
 1 | from time import time
 2 | import logging
 3 | 
 4 | logger = logging.getLogger()
 5 | 
 6 | 
 7 | def timed_operation(func):
 8 |     """
 9 |     Used to show execution time for function
10 |     @param func:
11 |     @return:
12 |     """
13 | 
14 |     def wrapper_function(*args, **kwargs):
15 |         """
16 |         Logs the function execution time
17 | 
18 |         @param args:
19 |         @param kwargs:
20 |         @return:
21 |         """
22 |         start_time = time()
23 |         result = func(*args, **kwargs)
24 |         end_time = time()
25 |         logger.info(f"{func.__name__!r} executed in {(end_time - start_time):.4f}s")
26 |         return result
27 | 
28 |     return wrapper_function
29 | 


--------------------------------------------------------------------------------
/mypy.ini:
--------------------------------------------------------------------------------
 1 | [mypy]
 2 | python_version = 3.11
 3 | no_implicit_optional = False
 4 | warn_unused_ignores = True
 5 | warn_redundant_casts = True
 6 | warn_unused_configs = True
 7 | disallow_untyped_defs = False
 8 | disallow_incomplete_defs = False
 9 | check_untyped_defs = False
10 | disable_error_code = annotation-unchecked
11 | 
12 | [mypy-pytest.*]
13 | ignore_missing_imports = True
14 | 
15 | [mypy-_pytest.*]
16 | ignore_missing_imports = True
17 | 
18 | [mypy-assertpy.*]
19 | ignore_missing_imports = True
20 | 
21 | [mypy-presidio_analyzer.*]
22 | ignore_missing_imports = True
23 | 
24 | [mypy-presidio_anonymizer.*]
25 | ignore_missing_imports = True
26 | 
27 | [mypy-spacy.*]
28 | ignore_missing_imports = True
29 | 
30 | [mypy-pandas.*]
31 | ignore_missing_imports = True


--------------------------------------------------------------------------------
/.github/workflows/checks.yml:
--------------------------------------------------------------------------------
 1 | # Runs typecheck and lint process on feature branches only
 2 | name: Checks
 3 | 
 4 | on:
 5 |     push:
 6 |         branches-ignore:
 7 |             - main
 8 | jobs:
 9 |     typecheck:
10 |         runs-on: ubuntu-latest
11 |         steps:
12 |             - name: checkout
13 |               uses: actions/checkout@v4
14 |             - name: setup - python
15 |               uses: actions/setup-python@v4
16 |               with:
17 |                   python-version: 3.12
18 |             - name: Install Global Dependencies
19 |               run: pip install -U pip && pip install uv
20 |             - name: install
21 |               run: make install
22 |             - name: typecheck
23 |               run: make typecheck
24 |             - name: lint
25 |               run: make lint


--------------------------------------------------------------------------------
/.zenodo.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "access_right": "open",
 3 |   "version": "0.5.0",
 4 |   "creators": [
 5 |     {
 6 |       "orcid": "0000-0003-0665-098X",
 7 |       "name": "Eidan J. Rosado",
 8 |       "affiliation": "College of Computing and Engineering, Nova Southeastern University, Fort Lauderdale, FL 33314, USA"
 9 |     }
10 |   ],
11 |   "keywords": ["Python", "PII", "PII Detection", "risk categories",
12 |     "PII categorization","personal identifiable information",
13 |     "Named Entity Recognition", "PII Topology"],
14 |   "license": "BSD-3-Clause",
15 |   "title": "pii-codex: a Python library for PII detection, categorization, and severity assessment",
16 |   "language": "eng",
17 |   "description": "A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)"
18 | }


--------------------------------------------------------------------------------
/pii_codex/services/adapters/detection_adapters/detection_adapter_base.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from pii_codex.models.analysis import DetectionResult, DetectionResultItem
 4 | 
 5 | 
 6 | class BasePIIDetectionAdapter:
 7 |     def convert_analyzed_item(self, pii_detection) -> List[DetectionResultItem]:
 8 |         """
 9 |         Converts a detection result into a collection of DetectionResultItem
10 | 
11 |         @param pii_detection: dict
12 |         @return: List[DetectionResultItem]
13 |         """
14 |         raise Exception("Not implemented yet")
15 | 
16 |     def convert_analyzed_collection(self, pii_detections) -> List[DetectionResult]:
17 |         """
18 |         Converts a collection of detection results to a collection of DetectionResult.
19 | 
20 |         @param pii_detections: List[dict]
21 |         @return: List[DetectionResult]
22 |         """
23 |         raise Exception("Not implemented yet")
24 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/issue_template.md:
--------------------------------------------------------------------------------
 1 | Issue tracker is **ONLY** used for reporting bugs. New features should be discussed on our slack channel. Please use [stackoverflow](https://stackoverflow.com) for supporting issues.
 2 | 
 3 | <!--- Provide a general summary of the issue in the Title above -->
 4 | 
 5 | ## Expected Behavior
 6 | <!--- Explain what should happen -->
 7 | 
 8 | ## Current Behavior
 9 | <!--- Explain what happens instead of the expected behavior -->
10 | 
11 | ## Possible Solution
12 | <!--- Not obligatory, but suggest a fix/reason for the bug, -->
13 | 
14 | ## Steps to Reproduce
15 | <!--- Provide a link to a live example, or an unambiguous set of steps to -->
16 | <!--- reproduce this bug. Include code to reproduce, if relevant -->
17 | 1.
18 | 2.
19 | 3.
20 | 4.
21 | 
22 | ## Context (Environment)
23 | <!--- How has this issue affected you? What are you trying to accomplish? -->
24 | <!--- Providing context helps us come up with a solution that is most useful in the real world -->
25 | 
26 | <!--- Provide a general summary of the issue in the Title above -->
27 | 
28 | ## Detailed Description
29 | <!--- Provide a detailed description of the change or addition you are proposing -->
30 | 
31 | ## Possible Implementation
32 | <!--- Not obligatory, but suggest an idea for implementing addition or change -->


--------------------------------------------------------------------------------
/pii_codex/models/microsoft_presidio_pii.py:
--------------------------------------------------------------------------------
 1 | from __future__ import annotations
 2 | 
 3 | from enum import Enum
 4 | 
 5 | 
 6 | class MSFTPresidioPIIType(Enum):
 7 |     """
 8 |     PII Types associated with Microsoft Presidio Analyzer
 9 |     Supported Entities: https://microsoft.github.io/presidio/supported_entities/
10 |     """
11 | 
12 |     PHONE_NUMBER = "PHONE_NUMBER"
13 |     EMAIL_ADDRESS = "EMAIL_ADDRESS"
14 |     ABA_ROUTING_NUMBER = "ABA_ROUTING_NUMBER"
15 |     IP_ADDRESS = "IP_ADDRESS"
16 |     DATE = "DATE_TIME"
17 |     ADDRESS = "LOCATION"
18 |     AGE = "AGE"
19 |     PERSON = "PERSON"
20 |     CREDIT_CARD_NUMBER = "CREDIT_CARD"
21 |     CRYPTO = "CRYPTO"
22 |     URL = "URL"
23 |     DATE_TIME = "DATE_TIME"
24 |     LOCATION = "LOCATION"
25 |     NRP = "NRP"
26 |     MEDICAL_LICENSE = "MEDICAL_LICENSE"
27 |     US_SOCIAL_SECURITY_NUMBER = "US_SSN"
28 |     US_BANK_ACCOUNT_NUMBER = "US_BANK_NUMBER"
29 |     US_DRIVERS_LICENSE_NUMBER = "US_DRIVER_LICENSE"
30 |     US_PASSPORT_NUMBER = "US_PASSPORT"
31 |     US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_ITIN"
32 |     INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "IBAN_CODE"
33 |     # UK_NATIONAL_HEALTH_NUMBER = "UK_NHS"  # To be added in future versions
34 |     AU_BUSINESS_NUMBER = "AU_ABN"
35 |     AU_COMPANY_NUMBER = "AU_ACN"
36 |     AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICARE"
37 |     AU_TAX_FILE_NUMBER = "AU_TFN"
38 | 


--------------------------------------------------------------------------------
/.github/workflows/test.yml:
--------------------------------------------------------------------------------
 1 | name: Test
 2 | 
 3 | on:
 4 |   # Triggers the test workflow on push for all branches
 5 |   push:
 6 |     paths:
 7 |       - "pii_codex/**"
 8 |       - "uv.lock"
 9 |       - "pyproject.toml"
10 |       - ".github/workflows/test.yml"
11 | 
12 |   # Allows you to run this workflow manually from the Actions tab
13 |   workflow_dispatch:
14 | 
15 | jobs:
16 |   build:
17 |     name: Run Tests
18 |     strategy:
19 |       matrix:
20 |         python-version: [ "3.9", "3.10", "3.11", "3.12" ]
21 |         os: [ubuntu-latest, macos-latest]
22 |     runs-on: ${{ matrix.os }}
23 | 
24 |     # Checkout the code, install poetry, install dependencies,
25 |     # and run test with coverage
26 |     steps:
27 |       - name: Environment Setup
28 |         uses: actions/checkout@v4
29 |       - name: Setup Python
30 |         uses: actions/setup-python@v4
31 |         with:
32 |           python-version: ${{ matrix.python-version }}
33 |       - name: Install Global Dependencies
34 |         run: pip install -U pip && pip install uv
35 |       - name: Install Project Dependencies
36 |         run: |
37 |           make install
38 |       - name: Run Tests
39 |         run: |
40 |           make test.coverage
41 |       - uses: codecov/codecov-action@v3  # Coverage report submitted only on merge to main
42 |         with:
43 |           token: ${{ secrets.CODECOV_TOKEN }}
44 |           name: codecov-umbrella 
45 |           files: ./coverage.xml
46 |           verbose: true
47 | 


--------------------------------------------------------------------------------
/pii_codex/utils/statistics_util.py:
--------------------------------------------------------------------------------
 1 | import statistics
 2 | 
 3 | import numpy as np
 4 | 
 5 | 
 6 | def get_population_standard_deviation(values) -> float:
 7 |     return statistics.pstdev(values)
 8 | 
 9 | 
10 | def get_population_variance(values) -> float:
11 |     return statistics.pvariance(values)
12 | 
13 | 
14 | def get_standard_deviation(values, collection_type: str) -> float:
15 |     if collection_type.lower() != "sample" and collection_type.lower() != "population":
16 |         raise Exception("Invalid collection type. Must be 'SAMPLE' or 'POPULATION'.")
17 | 
18 |     return (
19 |         statistics.stdev(values)
20 |         if collection_type.lower() == "sample"
21 |         else get_population_standard_deviation(values)
22 |     )
23 | 
24 | 
25 | def get_variance(values, collection_type: str) -> float:
26 |     if collection_type.lower() != "sample" and collection_type.lower() != "population":
27 |         raise Exception("Invalid collection type. Must be 'SAMPLE' or 'POPULATION'.")
28 | 
29 |     return (
30 |         statistics.variance(values)
31 |         if collection_type.lower() == "sample"
32 |         else get_population_variance(values)
33 |     )
34 | 
35 | 
36 | def get_mean(values) -> float:
37 |     return statistics.mean(values)
38 | 
39 | 
40 | def get_median(values) -> float:
41 |     return statistics.median(values)
42 | 
43 | 
44 | def get_mode(values):
45 |     return statistics.mode(values)
46 | 
47 | 
48 | def get_sum(values):
49 |     return np.sum(values)
50 | 


--------------------------------------------------------------------------------
/docs/LOCAL_SETUP.md:
--------------------------------------------------------------------------------
 1 | # New Project Setup
 2 | See README.md for step-by-step details on installing the pii-codex package as an import.
 3 | 
 4 | Video Demo of package import and usage:
 5 | <div align="center">
 6 | 
 7 | [![PII-Codex Video Demo](https://img.youtube.com/vi/51TP2I5SNlo/0.jpg)](https://youtu.be/51TP2I5SNlo)
 8 | 
 9 | </div>
10 | 
11 | `Note: This video has no sound, it just shows steps taken to install the package and use it in a file.`
12 | 
13 | # Local Repo Setup
14 | For those contributing or modifying the source, use the following to set up locally.
15 | 
16 | ## Environment Config
17 | You'll need Python (^3.11) and `uv` configured on your machine. Once those are configured, create a virtual
18 | environment and install dependencies.
19 | 
20 | ```bash
21 | uv sync
22 | ```
23 | 
24 | Installing dependencies will vary by usage. For those in need of the PII-Codex integration of the MSFT Presidio Analyzer, it is recommended to install the `detections` extras:
25 | 
26 | ```bash
27 | uv sync --extra detections
28 | ```
29 | 
30 | As part of the `detections` extras installation, the download for the `en_core_web_lg` spaCy model will be enabled on first use of the `PresidioPIIAnalyzer()`. If more language support is needed, you'll need to download it separately. Reference <a href="https://github.com/explosion/spacy-models/releases?q=en_core_web_lg&expanded=true">explosion/spacy-models</a>.
31 | 
32 | Depending on your setup and need, you may need to add the virtual environment to Jupyter. You may do so with the following command:
33 | 
34 | ```bash
35 | make jupyter.attach.venv
36 | ```


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2022, Eidan J. Rosado
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # PII Codex Makefile
 2 | # Used from GitHub workflow and locally
 3 | default: install test lint
 4 | 
 5 | test: lint test.all
 6 | test.cov: test.coverage
 7 | 
 8 | install:
 9 | 	@uv sync
10 | 	@uv sync --all-extras
11 | 	@uv sync --extra dev 
12 | 	$(MAKE) install.pre_commit
13 | 	@echo "Installation complete!"
14 | 
15 | install.pre_commit:
16 | 	@uv run pre-commit install || (echo "Warning: pre-commit installation failed. You may need to run 'uv sync --extra dev' first." && exit 1)
17 | 
18 | install.extras:
19 | 	@echo "Installing detection dependencies (spaCy, Presidio, etc.)..."
20 | 	@uv sync --extra detections
21 | 
22 | install_spacy_en_core_web_lg:
23 | 	@python3 -m spacy download en_core_web_lg
24 | 
25 | jupyter.attach.venv:
26 | 	@python3 -m ipykernel install --user --name=venv
27 | 
28 | test.all:
29 | 	@pytest tests
30 | 
31 | test.coverage:
32 | 	@uv run coverage run -m pytest -vv tests && uv run coverage report -m --omit="*/test*,config/*.conf" --fail-under=95
33 | 	@uv run coverage xml
34 | 
35 | lint:
36 | 	@uv run pylint pii_codex tests
37 | 
38 | typecheck:
39 | 	@uv run mypy pii_codex tests
40 | 
41 | format.check:
42 | 	@black . --check
43 | 
44 | format.fix:
45 | 	@black .
46 | 
47 | bump.citation.date:
48 | 	./scripts/update_citation.sh
49 | 
50 | docs:
51 | 	@pdoc --html pii_codex --force -o ./docs/dev
52 | 
53 | version.bump.patch:
54 | 	@uv run bumpver update --patch
55 | 	# $(MAKE) bump.citation.date
56 | 
57 | version.bump.minor:
58 | 	@uv run bumpver update --minor
59 | 	$(MAKE) bump.citation.date
60 | 
61 | version.bump.major:
62 | 	@uv run bumpver update --major
63 | 	$(MAKE) bump.citation.date
64 | 
65 | package:
66 | 	@uv build
67 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/test_assessment_service.py:
--------------------------------------------------------------------------------
 1 | from assertpy import assert_that
 2 | from pii_codex.models.common import PIIType
 3 | from pii_codex.models.analysis import RiskAssessment
 4 | from pii_codex.services.assessment_service import PIIAssessmentService
 5 | from pii_codex.utils.statistics_util import get_mean
 6 | 
 7 | 
 8 | class TestPIIAssessmentService:
 9 |     pii_assessment_service = PIIAssessmentService()
10 | 
11 |     def test_assess_pii_type(self):
12 |         risk_assessment: RiskAssessment = self.pii_assessment_service.assess_pii_type(
13 |             detected_pii_type=PIIType.US_SOCIAL_SECURITY_NUMBER.name
14 |         )
15 |         assert_that(risk_assessment.risk_level).is_equal_to(3)
16 |         assert_that(risk_assessment.pii_type_detected).is_equal_to(
17 |             PIIType.US_SOCIAL_SECURITY_NUMBER.name
18 |         )
19 | 
20 |     def test_assess_pii_type_list(self):
21 |         # PII types with same ratings
22 |         risk_assessment_list = self.pii_assessment_service.assess_pii_type_list(
23 |             detected_pii_types=[
24 |                 PIIType.US_SOCIAL_SECURITY_NUMBER.name,
25 |                 PIIType.PHONE_NUMBER.name,
26 |             ]
27 |         )
28 |         assert_that(isinstance(risk_assessment_list[0], RiskAssessment))
29 |         assert_that(
30 |             get_mean([assessment.risk_level for assessment in risk_assessment_list])
31 |         ).is_equal_to(3.0)
32 | 
33 |         # PII types with different ratings
34 |         risk_assessment_list = self.pii_assessment_service.assess_pii_type_list(
35 |             detected_pii_types=[
36 |                 PIIType.US_SOCIAL_SECURITY_NUMBER.name,
37 |                 PIIType.RACE.name,
38 |             ]
39 |         )
40 |         assert_that(isinstance(risk_assessment_list[0], RiskAssessment))
41 |         assert_that(
42 |             get_mean([assessment.risk_level for assessment in risk_assessment_list])
43 |         ).is_equal_to(2.5)
44 | 


--------------------------------------------------------------------------------
/pii_codex/services/adapters/detection_adapters/azure_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from pii_codex.config import PII_MAPPER
 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import (
 6 |     BasePIIDetectionAdapter,
 7 | )
 8 | 
 9 | 
10 | class AzurePIIDetectionAdapter(BasePIIDetectionAdapter):
11 |     def convert_analyzed_item(self, pii_detection: dict):
12 |         """
13 |         Converts a detection result into a collection of DetectionResultItem
14 | 
15 |         @param pii_detection: dict
16 |         @return: List[DetectionResultItem]
17 |         """
18 |         return [
19 |             DetectionResultItem(
20 |                 entity_type=PII_MAPPER.convert_azure_pii_to_common_pii_type(
21 |                     entity["category"]
22 |                 ).name,
23 |                 score=entity["confidence_score"],
24 |                 start=entity["offset"],
25 |                 end=entity["offset"] + entity["length"],
26 |             )
27 |             for entity in pii_detection["entities"]
28 |         ]
29 | 
30 |     def convert_analyzed_collection(
31 |         self, pii_detections: List[dict]
32 |     ) -> List[DetectionResult]:
33 |         """
34 |         Converts a collection of detection results to a collection of DetectionResult.
35 | 
36 |         @param pii_detections: List[dict]
37 |         @return: List[DetectionResultItem]
38 |         """
39 |         detection_results: List[DetectionResult] = []
40 |         for i, detection in enumerate(pii_detections):
41 |             # Return results in formatted Analysis Result List object
42 |             detection_results.append(
43 |                 DetectionResult(
44 |                     index=i,
45 |                     detections=self.convert_analyzed_item(pii_detection=detection),
46 |                 )
47 |             )
48 | 
49 |         return detection_results
50 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | **/.DS_Store
 30 | **/__pycache__
 31 | bin
 32 | 
 33 | # PyInstaller
 34 | #  Usually these files are written by a python script from a template
 35 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 36 | *.manifest
 37 | *.spec
 38 | 
 39 | # Installer logs
 40 | pip-log.txt
 41 | pip-delete-this-directory.txt
 42 | 
 43 | # Unit test / coverage reports
 44 | htmlcov/
 45 | .tox/
 46 | .nox/
 47 | .coverage
 48 | .coverage.*
 49 | .cache
 50 | nosetests.xml
 51 | coverage.xml
 52 | *.cover
 53 | *.py,cover
 54 | .hypothesis/
 55 | .pytest_cache/
 56 | 
 57 | # Translations
 58 | *.mo
 59 | *.pot
 60 | 
 61 | # Scrapy stuff:
 62 | .scrapy
 63 | 
 64 | # Sphinx documentation
 65 | docs/_build/
 66 | 
 67 | # PyBuilder
 68 | target/
 69 | 
 70 | # IPython
 71 | profile_default/
 72 | ipython_config.py
 73 | 
 74 | # pyenv
 75 | .python-version
 76 | 
 77 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 78 | __pypackages__/
 79 | 
 80 | # Environments
 81 | .env
 82 | .venv
 83 | env/
 84 | venv/
 85 | ENV/
 86 | env.bak/
 87 | venv.bak/
 88 | .envrc
 89 | 
 90 | # Spyder project settings
 91 | .spyderproject
 92 | .spyproject
 93 | 
 94 | # Rope project settings
 95 | .ropeproject
 96 | 
 97 | # mkdocs documentation
 98 | /site
 99 | 
100 | # mypy
101 | .mypy_cache/
102 | .dmypy.json
103 | dmypy.json
104 | 
105 | # Pyre type checker
106 | .pyre/
107 | 
108 | # Editor stuff
109 | .idea
110 | .vscode
111 | 
112 | backend/bin
113 | 
114 | # Test output
115 | coverage.xml
116 | pytest.log
117 | tests/pii_codex/data


--------------------------------------------------------------------------------
/pii_codex/models/aws_pii.py:
--------------------------------------------------------------------------------
 1 | from __future__ import annotations
 2 | 
 3 | from enum import Enum
 4 | 
 5 | 
 6 | # PII Types and Models as expanded in research
 7 | # Research Page:
 8 | # AWS Comprehend PII Docs: https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html
 9 | 
10 | 
11 | class AWSComprehendPIIType(Enum):
12 |     """
13 |     AWS Comprehend-Supported PII types
14 |     """
15 | 
16 |     EMAIL_ADDRESS = "EMAIL"
17 |     ADDRESS = "ADDRESS"
18 |     PERSON = "NAME"
19 |     PHONE_NUMBER = "PHONE"
20 |     DATE = "DATE_TIME"
21 |     URL = "URL"
22 |     AGE = "AGE"
23 |     USERNAME = "USERNAME"
24 |     PASSWORD = "PASSWORD"
25 |     CREDIT_DEBIT_NUMBER = "CREDIT_DEBIT_NUMBER"
26 |     CREDIT_DEBIT_CVV = "CREDIT_DEBIT_CVV"
27 |     CREDIT_DEBIT_EXPIRY = "CREDIT_DEBIT_EXPIRY"
28 |     PIN = "PIN"
29 |     US_DRIVERS_LICENSE_NUMBER = "DRIVER_ID"
30 |     LICENSE_PLATE_NUMBER = "LICENSE_PLATE"
31 |     VEHICLE_IDENTIFICATION_NUMBER = "VEHICLE_IDENTIFICATION_NUMBER"
32 |     INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "INTERNATIONAL_BANK_ACCOUNT_NUMBER"
33 |     SWIFT_CODE = "SWIFT_CODE"
34 |     CRYPTO = "CRYPTO_WALLET_ADDRESS"
35 |     IP_ADDRESS = "IP_ADDRESS"
36 |     IPV6_ADDRESS = "IPV6_ADDRESS"
37 |     MAC_ADDRESS = "MAC_ADDRESS"
38 |     AWS_ACCESS_KEY = "AWS_ACCESS_KEY"
39 |     AWS_SECRET_KEY = "AWS_SECRET_KEY"
40 |     US_PASSPORT_NUMBER = "PASSPORT_NUMBER"
41 |     US_SOCIAL_SECURITY_NUMBER = "SSN"
42 |     US_BANK_ACCOUNT_NUMBER = "BANK_ACCOUNT_NUMBER"
43 |     ABA_ROUTING_NUMBER = "BANK_ROUTING"
44 |     US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION"
45 |     UK_NATIONAL_HEALTH_SERVICE_NUMBER = "UK_NATIONAL_HEALTH_SERVICE_NUMBER"
46 |     UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER = "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER"
47 |     UK_NATIONAL_INSURANCE_NUMBER = "UK_NATIONAL_INSURANCE_NUMBER"
48 |     CA_HEALTH_NUMBER = "CA_HEALTH_NUMBER"
49 |     CA_SOCIAL_INSURANCE_NUMBER = "CA_SOCIAL_INSURANCE_NUMBER"
50 |     IN_AADHAAR = "IN_AADHAAR"
51 |     IN_VOTER_NUMBER = "IN_VOTER_NUMBER"
52 |     IN_PERMANENT_ACCOUNT_NUMBER = "IN_PERMANENT_ACCOUNT_NUMBER"
53 |     IN_NREGA = "IN_NREGA"
54 |     ALL = "ALL"
55 | 


--------------------------------------------------------------------------------
/pii_codex/services/adapters/detection_adapters/presidio_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from pii_codex.config import PII_MAPPER
 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import (
 6 |     BasePIIDetectionAdapter,
 7 | )
 8 | 
 9 | 
10 | class PresidioPIIDetectionAdapter(BasePIIDetectionAdapter):
11 | 
12 |     """
13 |     Intended for those that are using their own pre-detected result set from Presidio
14 |     """
15 | 
16 |     def convert_analyzed_item(self, pii_detection) -> List[DetectionResultItem]:
17 |         """
18 |         Converts a single Presidio analysis attempt into a collection of DetectionResultItem objects. One string
19 |         analysis by Presidio returns an array of RecognizerResult objects.
20 | 
21 |         @param pii_detection: RecognizerResult from presidio analyzer
22 |         @return: List[DetectionResultItem]
23 |         """
24 | 
25 |         return [
26 |             DetectionResultItem(
27 |                 entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type(
28 |                     result.entity_type
29 |                 ).name,
30 |                 score=result.score,
31 |                 start=result.start,
32 |                 end=result.end,
33 |             )
34 |             for result in pii_detection
35 |         ]
36 | 
37 |     def convert_analyzed_collection(self, pii_detections) -> List[DetectionResult]:
38 |         """
39 |         Converts a collection of Presidio analysis results to a collection of DetectionResult. A collection of Presidio
40 |         analysis results ends up being a 2D array.
41 | 
42 |         @param pii_detections: List[List[RecognizerResult]] - list of individual analyses from Presidio
43 | 
44 |         """
45 | 
46 |         detection_results: List[DetectionResult] = []
47 |         for i, detection in enumerate(pii_detections):
48 |             # Return results in formatted Analysis Result List object
49 |             detection_results.append(
50 |                 DetectionResult(
51 |                     index=i,
52 |                     detections=self.convert_analyzed_item(pii_detection=detection),
53 |                 )
54 |             )
55 | 
56 |         return detection_results
57 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/adapters/detection_adapters/test_azure_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from assertpy import assert_that
 2 | 
 3 | from pii_codex.models.analysis import DetectionResultItem
 4 | from pii_codex.models.azure_pii import AzurePIIType
 5 | from pii_codex.services.adapters.detection_adapters.azure_detection_adapter import (
 6 |     AzurePIIDetectionAdapter,
 7 | )
 8 | 
 9 | 
10 | def test_azure_analysis_single_item_conversion():
11 |     # with pytest.raises(Exception) as ex_info:
12 |     conversion_results = AzurePIIDetectionAdapter().convert_analyzed_item(
13 |         pii_detection={
14 |             "entities": [
15 |                 {
16 |                     "text": "My email is example@example.eu.edu",
17 |                     "category": AzurePIIType.EMAIL_ADDRESS.value,
18 |                     "subcategory": None,
19 |                     "length": 22,
20 |                     "offset": 11,
21 |                     "confidence_score": 0.8,
22 |                 }
23 |             ]
24 |         }
25 |     )
26 | 
27 |     assert_that(conversion_results).is_not_none()
28 |     assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true()
29 | 
30 | 
31 | def test_azure_analysis_collection_conversion():
32 |     conversion_results = AzurePIIDetectionAdapter().convert_analyzed_collection(
33 |         pii_detections=[
34 |             {
35 |                 "entities": [
36 |                     {
37 |                         "text": "My email is example@example.eu.edu",
38 |                         "category": AzurePIIType.EMAIL_ADDRESS.value,
39 |                         "subcategory": None,
40 |                         "length": 22,
41 |                         "offset": 11,
42 |                         "confidence_score": 0.8,
43 |                     }
44 |                 ]
45 |             },
46 |             {
47 |                 "entities": [
48 |                     {
49 |                         "text": "My email is example1@example.eu.edu",
50 |                         "category": AzurePIIType.EMAIL_ADDRESS.value,
51 |                         "subcategory": None,
52 |                         "length": 23,
53 |                         "offset": 11,
54 |                         "confidence_score": 0.8,
55 |                     }
56 |                 ]
57 |             },
58 |         ]
59 |     )
60 | 
61 |     assert_that(conversion_results).is_not_empty()
62 |     assert_that(conversion_results[1].index).is_greater_than(
63 |         conversion_results[0].index
64 |     )
65 |     assert_that(
66 |         isinstance(conversion_results[0].detections[0], DetectionResultItem)
67 |     ).is_true()
68 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # PII-Codex Community Guidelines
 2 | The following aims to highlight ways to contribute to PII-Codex. Code reviews, documentation updates, and pull requests are always welcome. To the source code directly by forking the project, modify the code, and create a pull requests for review. 
 3 | 
 4 | Please use clear and organized descriptions when creating issues and pull requests and leverage the templates when possible.
 5 | 
 6 | ### Bug Report and Support Requests
 7 | You can use issues to report bugs and seek support. Before creating any new issues, please check for similar ones in the issue list first.
 8 | 
 9 | ### Submission Checklist
10 | When submitting a pull request, please check the following:
11 | 
12 | Unit tests, documentation, and code style are in order. The GitHub action `test.yml` will automatically run tests and check whether the coverage threshold is still being met. 
13 | 
14 | Other checks such as the typechecker and linter will also run with every feature branch push. These checks can also be performed with the following commands:
15 | 
16 | ```bash
17 | make typecheck
18 | make lint
19 | make test.coverage
20 | ```
21 | 
22 | Works in progress will be considered if you're unsure of what this exactly means, in which case you'll likely be asked to make some further changes.
23 | 
24 | ### Updating Documentation
25 | Docs are autogenerated by [pdoc](https://pdoc3.github.io/pdoc/). To update documentation, run the command `make docs` from terminal / command line. The generated docs will appear in docs/dev in html format. If the auto-generated doc entry is missing context, check the docstring added to the new code.
26 | 
27 | ### Creating Releases
28 | With any change (non-breaking dependency upgrades, bug fixes, etc), the version will need to be updated to the next value and a new release created. Using the command listing below, run the one associated with the changes:
29 | 
30 | ```bash
31 | make version.bump.patch # use this for bug fixes, test coverage, non-breaking dependency upgrades
32 | 
33 | make version.bump.minor # use this for minor business logic changes - must be non-breaking
34 | 
35 | make version.bump.major # use this for anything that is not backwards compatible
36 | ```
37 | 
38 | Upon merging the changes into main, the `releases.yml` will automatically compare the releases and the version noted in the CITATION.cff and pyproject.toml. In the case a mismatch is found, a release will be created.
39 | 
40 | ### Licensing
41 | The contributed code will be licensed under PII-Codex's license, https://github.com/EdyVision/pii-codex/blob/main/LICENSE. If you did not write the code yourself, you must ensure the existing license is compatible and include the license information in the contributed files, or obtain permission from the original author to relicense the contributed code.


--------------------------------------------------------------------------------
/.github/workflows/release.yml:
--------------------------------------------------------------------------------
 1 | # Automatically creates a new tag and GitHub release if poetry version doesn't match last release version
 2 | name: Release
 3 | 
 4 | on:
 5 |   workflow_dispatch:
 6 |     inputs:
 7 |       tag:
 8 |         description: 'Tag or ref to release from (e.g. v0.5.0). Leave empty to use main.'
 9 |         required: false
10 |         default: ''
11 | 
12 |   push:
13 |     branches:
14 |       - main
15 | 
16 | jobs:
17 |   build:
18 |     name: Release
19 |     runs-on: ubuntu-latest
20 |     steps:
21 |       - name: Setup Python
22 |         uses: actions/setup-python@v4
23 |         with:
24 |           python-version: 3.12
25 |       - name: Checkout branch "main"
26 |         uses: actions/checkout@v4
27 |         with:
28 |           ref: ${{ github.event.inputs.tag || 'main' }}
29 |           fetch-depth: 0
30 |       - name: Install Global Dependencies
31 |         run: pip install -U pip && pip install uv
32 |       - name: Build project for distribution
33 |         run: uv build
34 |       - name: Get Current Version
35 |         id: get_version
36 |         run: |
37 |           TAG_NAME=$(uv run python -c "import tomllib; print(tomllib.load(open('pyproject.toml', 'rb'))['project']['version'])")
38 |           echo "TAG_NAME=v$TAG_NAME" >> $GITHUB_ENV
39 |           echo "$TAG_NAME"
40 |       - name: Check Released Versions
41 |         id: get_last_release_version
42 |         run: |
43 |           LAST_RELEASE=$(git tag --sort=committerdate | tail -1)
44 |           echo "LAST_RELEASE_VERSION=$LAST_RELEASE" >> $GITHUB_ENV
45 |           echo "Last released tag: $LAST_RELEASE"
46 |       - name: Check for Version Mismatch
47 |         shell: bash
48 |         if: ${{ env.LAST_RELEASE_VERSION != env.TAG_NAME }}
49 |         run: |
50 |           echo "New version found. Matching release will be created."
51 |           echo "Last version: ${{ env.LAST_RELEASE_VERSION }}"
52 |           echo "Current version: ${{ env.TAG_NAME }}"
53 |       - name: Tag and Release GitHub Snapshot
54 |         id: release-snapshot
55 |         if: ${{ github.event_name == 'workflow_dispatch' || env.LAST_RELEASE_VERSION != env.TAG_NAME }}
56 |         uses: ncipollo/release-action@v1
57 |         with:
58 |           token: ${{ secrets.GITHUB_TOKEN }}
59 |           commit: main
60 |           tag: ${{ env.TAG_NAME }}
61 |           skipIfReleaseExists: true
62 |           draft: false
63 |           prerelease: false
64 |       - name: Publish to PyPI
65 |         if: ${{ github.event_name == 'workflow_dispatch' || env.LAST_RELEASE_VERSION != env.TAG_NAME }}
66 |         env:
67 |           TWINE_USERNAME: __token__
68 |           TWINE_PASSWORD: ${{ secrets.PII_CODEX_PYPI_TOKEN }}
69 |         run: |
70 |           python -m pip install --upgrade pip
71 |           pip install twine
72 |           twine upload dist/*


--------------------------------------------------------------------------------
/tests/pii_codex/services/adapters/detection_adapters/test_presidio_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from assertpy import assert_that
 4 | from presidio_analyzer import RecognizerResult
 5 | 
 6 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
 7 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 8 | from pii_codex.services.adapters.detection_adapters.presidio_detection_adapter import (
 9 |     PresidioPIIDetectionAdapter,
10 | )
11 | 
12 | 
13 | presidio_adapter = PresidioPIIDetectionAdapter()
14 | 
15 | 
16 | def test_presidio_analysis_single_item_conversion():
17 |     conversion_results: List[
18 |         DetectionResultItem
19 |     ] = presidio_adapter.convert_analyzed_item(
20 |         pii_detection=[
21 |             RecognizerResult(
22 |                 entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value,
23 |                 start=123,
24 |                 end=456,
25 |                 score=0.98,
26 |             ),
27 |             RecognizerResult(
28 |                 entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value,
29 |                 start=123,
30 |                 end=456,
31 |                 score=0.73,
32 |             ),
33 |         ]
34 |     )
35 | 
36 |     assert_that(conversion_results).is_not_empty()
37 |     assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true()
38 | 
39 | 
40 | def test_presidio_analysis_collection_conversion():
41 |     conversion_results: List[
42 |         DetectionResult
43 |     ] = presidio_adapter.convert_analyzed_collection(
44 |         pii_detections=[
45 |             [
46 |                 RecognizerResult(
47 |                     entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value,
48 |                     start=123,
49 |                     end=456,
50 |                     score=0.98,
51 |                 ),
52 |                 RecognizerResult(
53 |                     entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value,
54 |                     start=123,
55 |                     end=456,
56 |                     score=0.973,
57 |                 ),
58 |             ],
59 |             [
60 |                 RecognizerResult(
61 |                     entity_type=MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value,
62 |                     start=123,
63 |                     end=456,
64 |                     score=0.98,
65 |                 ),
66 |                 RecognizerResult(
67 |                     entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value,
68 |                     start=123,
69 |                     end=456,
70 |                     score=0.973,
71 |                 ),
72 |             ],
73 |         ]
74 |     )
75 | 
76 |     assert_that(conversion_results).is_not_empty()
77 |     assert_that(conversion_results[1].index).is_greater_than(
78 |         conversion_results[0].index
79 |     )
80 |     assert_that(
81 |         isinstance(conversion_results[0].detections[0], DetectionResultItem)
82 |     ).is_true()
83 | 


--------------------------------------------------------------------------------
/pii_codex/services/assessment_service.py:
--------------------------------------------------------------------------------
 1 | from typing import List, Tuple
 2 | from collections import Counter
 3 | from itertools import chain
 4 | 
 5 | from ..config import PII_MAPPER
 6 | from ..models.analysis import RiskAssessment, AnalysisResult
 7 | from ..utils.statistics_util import get_mean, get_sum
 8 | 
 9 | 
10 | class PIIAssessmentService:
11 |     """
12 |     Class for mapping PII types to categories and extracting them.
13 |     """
14 | 
15 |     def assess_pii_type(self, detected_pii_type: str) -> RiskAssessment:
16 |         """
17 |         Assesses a singular detected PII type given a type name string from common.PIIType enum
18 |         @param detected_pii_type: type name strings from common.PIIType enum
19 |         @return: RiskAssessment
20 |         """
21 |         return PII_MAPPER.map_pii_type(detected_pii_type)
22 | 
23 |     def assess_pii_type_list(
24 |         self, detected_pii_types: List[str]
25 |     ) -> List[RiskAssessment]:
26 |         """
27 |         Assesses a list of detected PII types given an array of type name strings from common.PIIType enum
28 |         @param detected_pii_types: array type name strings from common.PIIType
29 |         enum (e.g. ["PHONE_NUMBER", "US_SOCIAL_SECURITY_NUMBER"])
30 |         @return: List[RiskAssessment]
31 |         """
32 |         ranked_pii: List[RiskAssessment] = []
33 | 
34 |         for pii_type in detected_pii_types:
35 |             ranked_pii.append(PII_MAPPER.map_pii_type(pii_type))
36 | 
37 |         return ranked_pii
38 | 
39 |     @staticmethod
40 |     def calculate_risk_assessment_score_average(
41 |         risk_assessments: List[RiskAssessment],
42 |     ) -> float:
43 |         """
44 |         Returns the average risk score per token
45 | 
46 |         @param risk_assessments:
47 |         @return: float
48 |         """
49 |         return get_mean([assessment.risk_level for assessment in risk_assessments])
50 | 
51 |     @staticmethod
52 |     def get_detected_pii_count(analyses: List[AnalysisResult]) -> int:
53 |         """
54 |         Returns the count of detected PII for analyses performed on a collection
55 | 
56 |         @param analyses: List[ScoredAnalysisResult]
57 |         @return: float
58 |         """
59 |         return get_sum(
60 |             [
61 |                 len(analysis.analysis)
62 |                 for analysis in analyses
63 |                 if analysis.get_detected_types()
64 |             ]
65 |         )
66 | 
67 |     @staticmethod
68 |     def get_detected_pii_types(
69 |         analyses: List[AnalysisResult],
70 |     ) -> Tuple[set[str], Counter]:
71 |         """
72 |         Returns the list of detected PII types and their frequencies for analyses performed on a collection
73 | 
74 |         @param analyses: List[ScoredAnalysisResult]
75 |         @return: Tuple[List[str], Counter]
76 |         """
77 |         flattened_list_of_detections = list(
78 |             chain.from_iterable(
79 |                 [analysis.get_detected_types() for analysis in analyses]
80 |             )
81 |         )
82 | 
83 |         return set(flattened_list_of_detections), Counter(flattened_list_of_detections)
84 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
  1 | [project]
  2 | name = "pii-codex"
  3 | version = "0.5.0"
  4 | description = "PII Detection, Categorization, and Severity Assessment"
  5 | authors = [
  6 |     {name = "Eidan J. Rosado"}
  7 | ]
  8 | license = {text = "BSD 3-Clause"}
  9 | readme = "README.md"
 10 | homepage = "https://github.com/EdyVision/pii-codex"
 11 | repository = "https://github.com/EdyVision/pii-codex"
 12 | keywords = ["PII", "PII topology", "risk categories", "personal identifiable information", "risk assessment"]
 13 | classifiers = [
 14 |     "Development Status :: 1 - Planning",
 15 |     "Intended Audience :: Science/Research",
 16 |     "License :: OSI Approved :: BSD License",
 17 |     "Programming Language :: Python :: 3",
 18 | ]
 19 | requires-python = ">=3.9,<3.13"
 20 | dependencies = [
 21 |     "dataclasses-json>=0.5.7,<0.6.0",
 22 |     "pydantic[dotenv]>=1.10.2,<2.0.0",
 23 |     "pandas>=2.0.0,<3.0.0",
 24 |     "pillow>=10.0.0,<11.0.0",
 25 |     "pdoc>=13.1.1,<14.0.0",
 26 | ]
 27 | 
 28 | [project.optional-dependencies]
 29 | detections = [
 30 |     "spacy>=3.8.7",
 31 |     "presidio-analyzer>=2.2.359",
 32 |     "presidio-anonymizer>=2.2.359"
 33 | ]
 34 | dev = [
 35 |     "pytest>=7.4.0,<8.0.0",
 36 |     "black>=23.0.0,<24.0.0",
 37 |     "pylint>=3.0.0,<4.0.0",
 38 |     "mypy>=1.5.0,<2.0.0",
 39 |     "coverage>=7.2.0,<8.0.0",
 40 |     "assertpy>=1.1,<2.0.0",
 41 |     "Faker>=19.0.0,<20.0.0",
 42 |     "matplotlib>=3.7.0,<4.0.0",
 43 |     "ipykernel>=6.25.0,<7.0.0",
 44 |     "jupyter>=1.0.0,<2.0.0",
 45 |     "jupyter_core>=5.0.0,<6.0.0",
 46 |     "importlib-resources>=6.0.0,<7.0.0",
 47 |     "seaborn>=0.12.0,<1.0.0",
 48 |     "pre-commit>=3.0.0,<4.0.0",
 49 |     "spacy>=3.8.7",
 50 |     "presidio-analyzer>=2.2.360",
 51 |     "presidio-anonymizer>=2.2.360",
 52 |     "tomli>=1.2.0,<3.0.0"
 53 | ]
 54 | 
 55 | [project.urls]
 56 | Homepage = "https://github.com/EdyVision/pii-codex"
 57 | Repository = "https://github.com/EdyVision/pii-codex"
 58 | 
 59 | [bumpver]
 60 | current_version = "0.5.0"
 61 | version_pattern = "MAJOR.MINOR.PATCH"
 62 | files = [
 63 |     "pyproject.toml",
 64 |     "CITATION.cff", 
 65 |     ".zenodo.json"
 66 | ]
 67 | 
 68 | [bumpver.file_patterns]
 69 | "pyproject.toml" = [
 70 |     '^version = "{version}"$',
 71 |     '^current_version = "{version}"$',
 72 | ]
 73 | "pii_codex/__init__.py" = [
 74 |     '^__version__ = "{version}"$',
 75 | ]
 76 | "CITATION.cff" = [
 77 |     '^version: {version}$',
 78 | ]
 79 | ".zenodo.json" = [
 80 |     '^  "version": "{version}",$',
 81 | ]
 82 | 
 83 | [tool.pytest.ini_options]
 84 | minversion = "7.0"
 85 | testpaths = [
 86 |     "tests"
 87 | ]
 88 | 
 89 | log_cli = true
 90 | log_cli_level = "INFO"
 91 | log_cli_format = "%(message)s"
 92 | 
 93 | log_file = "pytest.log"
 94 | log_file_level = "DEBUG"
 95 | log_file_format = "%(asctime)s [%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)"
 96 | log_file_date_format = "%Y-%m-%d %H:%M:%S"
 97 | 
 98 | [build-system]
 99 | requires = ["hatchling"]
100 | build-backend = "hatchling.build"
101 | 
102 | [dependency-groups]
103 | dev = [
104 |     "bumpver>=2025.1131",
105 | ]
106 | 
107 | [tool.hatch.metadata]
108 | allow-direct-references = true
109 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/adapters/detection_adapters/test_aws_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | import pytest
 4 | from assertpy import assert_that
 5 | 
 6 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
 7 | from pii_codex.models.aws_pii import AWSComprehendPIIType
 8 | from pii_codex.services.adapters.detection_adapters.aws_detection_adapter import (
 9 |     AWSComprehendPIIDetectionAdapter,
10 | )
11 | 
12 | 
13 | @pytest.mark.parametrize(
14 |     "expected_result,expected_pii",
15 |     [
16 |         (False, []),
17 |         (False, []),
18 |         (True, [AWSComprehendPIIType.EMAIL_ADDRESS]),
19 |         (
20 |             True,
21 |             [AWSComprehendPIIType.EMAIL_ADDRESS],
22 |         ),
23 |         (
24 |             True,
25 |             [AWSComprehendPIIType.PHONE_NUMBER],
26 |         ),
27 |         (
28 |             True,
29 |             [AWSComprehendPIIType.EMAIL_ADDRESS, AWSComprehendPIIType.PHONE_NUMBER],
30 |         ),
31 |         (
32 |             True,
33 |             [AWSComprehendPIIType.EMAIL_ADDRESS, AWSComprehendPIIType.PHONE_NUMBER],
34 |         ),
35 |     ],
36 | )
37 | def test_aws_comprehend_analysis_single_item_conversion(expected_result, expected_pii):
38 |     conversion_results: List[
39 |         DetectionResultItem
40 |     ] = AWSComprehendPIIDetectionAdapter().convert_analyzed_item(
41 |         pii_detection={
42 |             "Entities": [
43 |                 {
44 |                     "Score": 0.99,
45 |                     "Type": pii_type.value,
46 |                     "BeginOffset": 123,
47 |                     "EndOffset": 456,
48 |                 }
49 |                 for pii_type in expected_pii
50 |             ]
51 |         },
52 |     )
53 | 
54 |     if expected_result:
55 |         assert_that(conversion_results).is_not_empty()
56 |         assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true()
57 |     else:
58 |         assert_that(conversion_results).is_empty()
59 | 
60 | 
61 | def test_aws_comprehend_analysis_collection_conversion():
62 |     conversion_results: List[
63 |         DetectionResult
64 |     ] = AWSComprehendPIIDetectionAdapter().convert_analyzed_collection(
65 |         pii_detections=[
66 |             {
67 |                 "Entities": [
68 |                     {
69 |                         "Score": 0.99,
70 |                         "Type": AWSComprehendPIIType.EMAIL_ADDRESS.value,
71 |                         "BeginOffset": 123,
72 |                         "EndOffset": 456,
73 |                     }
74 |                 ]
75 |             },
76 |             {
77 |                 "Entities": [
78 |                     {
79 |                         "Score": 0.73,
80 |                         "Type": AWSComprehendPIIType.PHONE_NUMBER.value,
81 |                         "BeginOffset": 456,
82 |                         "EndOffset": 789,
83 |                     }
84 |                 ]
85 |             },
86 |         ]
87 |     )
88 | 
89 |     assert_that(conversion_results).is_not_empty()
90 |     assert_that(conversion_results[1].index).is_greater_than(
91 |         conversion_results[0].index
92 |     )
93 |     assert_that(
94 |         isinstance(conversion_results[0].detections[0], DetectionResultItem)
95 |     ).is_true()
96 | 


--------------------------------------------------------------------------------
/pii_codex/services/adapters/detection_adapters/aws_detection_adapter.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from pii_codex.config import PII_MAPPER
 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import (
 6 |     BasePIIDetectionAdapter,
 7 | )
 8 | 
 9 | 
10 | class AWSComprehendPIIDetectionAdapter(BasePIIDetectionAdapter):
11 |     def convert_analyzed_item(self, pii_detection: dict) -> List[DetectionResultItem]:
12 |         """
13 |         Converts an AWS Comprehend detect_pii() result into a collection of DetectionResultItem
14 | 
15 |         @param pii_detection: dict from AWS Comprehend detect_pii {
16 |             'Entities': [
17 |                 {
18 |                     'Score': ...,
19 |                     'Type': 'BANK_ACCOUNT_NUMBER'|'BANK_ROUTING'|'CREDIT_DEBIT_NUMBER'|'CREDIT_DEBIT_CVV'|
20 |                     'CREDIT_DEBIT_EXPIRY'|'PIN'|'EMAIL'|'ADDRESS'|'NAME'|'PHONE'|'SSN'|'DATE_TIME'|'PASSPORT_NUMBER'|
21 |                     'DRIVER_ID'|'URL'|'AGE'|'USERNAME'|'PASSWORD'|'AWS_ACCESS_KEY'|'AWS_SECRET_KEY'|'IP_ADDRESS'|
22 |                     'MAC_ADDRESS'|'ALL'|'LICENSE_PLATE'|'VEHICLE_IDENTIFICATION_NUMBER'|'UK_NATIONAL_INSURANCE_NUMBER'|
23 |                     'CA_SOCIAL_INSURANCE_NUMBER'|'US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER'|
24 |                     'UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER'|'IN_PERMANENT_ACCOUNT_NUMBER'|'IN_NREGA'|
25 |                     'INTERNATIONAL_BANK_ACCOUNT_NUMBER'|'SWIFT_CODE'|'UK_NATIONAL_HEALTH_SERVICE_NUMBER'|
26 |                     'CA_HEALTH_NUMBER'|'IN_AADHAAR'|'IN_VOTER_NUMBER',
27 |                     'BeginOffset': 123,
28 |                     'EndOffset': 123
29 |                 },
30 |             ]
31 |         }
32 |         @return: List[DetectionResultItem]
33 |         """
34 | 
35 |         return [
36 |             DetectionResultItem(
37 |                 entity_type=PII_MAPPER.convert_aws_comprehend_pii_to_common_pii_type(
38 |                     result["Type"]
39 |                 ).name,
40 |                 score=result["Score"],
41 |                 start=result["BeginOffset"],
42 |                 end=result["EndOffset"],
43 |             )
44 |             for result in pii_detection["Entities"]
45 |         ]
46 | 
47 |     def convert_analyzed_collection(
48 |         self, pii_detections: List[dict]
49 |     ) -> List[DetectionResult]:
50 |         """
51 |         Converts a collection of AWS Comprehend detect_pii() results to a collection of DetectionResult.
52 | 
53 |         @param pii_detections: List[dict] of response from AWS Comprehend detect_pii - [{
54 |             'Entities': [
55 |                 {
56 |                     'Score': ...,
57 |                     'Type': 'BANK_ACCOUNT_NUMBER'|'BANK_ROUTING'|'CREDIT_DEBIT_NUMBER'|'CREDIT_DEBIT_CVV'|
58 |                     'CREDIT_DEBIT_EXPIRY'|'PIN'|'EMAIL'|'ADDRESS'|'NAME'|'PHONE'|'SSN'|'DATE_TIME'|'PASSPORT_NUMBER'|
59 |                     'DRIVER_ID'|'URL'|'AGE'|'USERNAME'|'PASSWORD'|'AWS_ACCESS_KEY'|'AWS_SECRET_KEY'|'IP_ADDRESS'|
60 |                     'MAC_ADDRESS'|'ALL'|'LICENSE_PLATE'|'VEHICLE_IDENTIFICATION_NUMBER'|'UK_NATIONAL_INSURANCE_NUMBER'|
61 |                     'CA_SOCIAL_INSURANCE_NUMBER'|'US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER'|
62 |                     'UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER'|'IN_PERMANENT_ACCOUNT_NUMBER'|'IN_NREGA'|
63 |                     'INTERNATIONAL_BANK_ACCOUNT_NUMBER'|'SWIFT_CODE'|'UK_NATIONAL_HEALTH_SERVICE_NUMBER'|
64 |                     'CA_HEALTH_NUMBER'|'IN_AADHAAR'|'IN_VOTER_NUMBER',
65 |                     'BeginOffset': 123,
66 |                     'EndOffset': 123
67 |                 },
68 |             ]
69 |         }]
70 | 
71 |         """
72 | 
73 |         detection_results: List[DetectionResult] = []
74 |         for i, detection in enumerate(pii_detections):
75 |             # Return results in formatted Analysis Result List object
76 |             detection_results.append(
77 |                 DetectionResult(
78 |                     index=i,
79 |                     detections=self.convert_analyzed_item(pii_detection=detection),
80 |                 )
81 |             )
82 | 
83 |         return detection_results
84 | 


--------------------------------------------------------------------------------
/pii_codex/models/analysis.py:
--------------------------------------------------------------------------------
  1 | # pylint: disable=too-many-instance-attributes
  2 | from __future__ import annotations
  3 | 
  4 | from dataclasses import dataclass, field
  5 | from typing import List, Counter, Optional
  6 | 
  7 | from pii_codex.models.common import RiskLevel, RiskLevelDefinition
  8 | 
  9 | 
 10 | # PII detection, risk assessment, and analysis models
 11 | 
 12 | 
 13 | @dataclass
 14 | class RiskAssessment:
 15 |     """
 16 |     Singular risk assessment for a string token
 17 |     """
 18 | 
 19 |     pii_type_detected: Optional[str] = None
 20 |     risk_level: int = RiskLevel.LEVEL_ONE.value
 21 |     risk_level_definition: str = (
 22 |         RiskLevelDefinition.LEVEL_ONE.value
 23 |     )  # Default if it's not semi or fully identifiable
 24 |     cluster_membership_type: Optional[str] = None
 25 |     hipaa_category: Optional[str] = None
 26 |     dhs_category: Optional[str] = None
 27 |     nist_category: Optional[str] = None
 28 | 
 29 | 
 30 | @dataclass
 31 | class RiskAssessmentList:
 32 |     """
 33 |     Risk Assessments and the average risk score of all list items
 34 |     """
 35 | 
 36 |     risk_assessments: List[RiskAssessment]
 37 |     average_risk_score: float
 38 | 
 39 | 
 40 | @dataclass
 41 | class DetectionResultItem:
 42 |     """
 43 |     Type associated with a singular PII detection (e.g. detection of an email in a string), its associated risk score,
 44 |     and where it is located in a string.
 45 |     """
 46 | 
 47 |     entity_type: str
 48 |     score: float = 0.0  # metadata detections don't have confidence score values
 49 |     start: int = 0  # metadata detections don't have offset values
 50 |     end: int = 0  # metadata detections don't have offset values
 51 | 
 52 | 
 53 | @dataclass
 54 | class DetectionResult:
 55 |     """
 56 |     List of detection results and the index of the string they pertain to
 57 |     """
 58 | 
 59 |     detections: List[DetectionResultItem]
 60 |     index: int = 0
 61 | 
 62 | 
 63 | @dataclass
 64 | class AnalysisResultItem:
 65 |     """
 66 |     The results associated to a single detection of a single string (e.g. Social Media Post, SMS, etc.)
 67 |     """
 68 | 
 69 |     detection: Optional[DetectionResultItem]
 70 |     risk_assessment: RiskAssessment
 71 | 
 72 |     def to_dict(self):
 73 |         return {
 74 |             "riskAssessment": self.risk_assessment.__dict__,
 75 |             "detection": self.detection.__dict__,
 76 |         }
 77 | 
 78 |     def to_flattened_dict(self):
 79 |         assessment = self.risk_assessment.__dict__.copy()
 80 | 
 81 |         if self.detection:
 82 |             assessment.update(self.detection.__dict__)
 83 | 
 84 |         return assessment
 85 | 
 86 | 
 87 | @dataclass
 88 | class AnalysisResult:
 89 |     """
 90 |     The analysis results associated with several detections within a single string (e.g. Social Media Post, SMS, etc.)
 91 |     """
 92 | 
 93 |     analysis: List[AnalysisResultItem]
 94 |     index: int = 0
 95 |     risk_score_mean: float = 0.0
 96 |     sanitized_text: str = ""
 97 | 
 98 |     def to_dict(self):
 99 |         return {
100 |             "analysis": [item.to_flattened_dict() for item in self.analysis],
101 |             "index": self.index,
102 |             "risk_score_mean": self.risk_score_mean,
103 |             "sanitized_text": self.sanitized_text,
104 |         }
105 | 
106 |     def get_detected_types(self) -> List[str]:
107 |         return [pii.detection.entity_type for pii in self.analysis if pii.detection]
108 | 
109 | 
110 | @dataclass
111 | class AnalysisResultSet:
112 |     """
113 |     The analysis results associated with a collection of strings or documents (e.g. Social Media Posts, forum thread,
114 |     etc.). Includes most/least detected PII types within the collection, average risk score of analyses,
115 |     """
116 | 
117 |     analyses: List[AnalysisResult]
118 |     detection_count: int = 0
119 |     detected_pii_types: set[str] = field(default_factory=set)
120 |     detected_pii_type_frequencies: Counter = None  # type: ignore
121 |     risk_scores: List[float] = field(default_factory=list)
122 |     risk_score_mean: float = 1.0  # Default is 1 for non-identifiable
123 |     risk_score_mode: float = 0.0
124 |     risk_score_median: float = 0.0
125 |     risk_score_standard_deviation: float = 0.0
126 |     risk_score_variance: float = 0.0
127 |     collection_name: Optional[
128 |         str
129 |     ] = None  # Optional ability for analysts to name a set (see analysis storage step in notebooks)
130 |     collection_type: str = "POPULATION"  # Other option is SAMPLE
131 | 
132 |     def to_dict(self):
133 |         return {
134 |             "collection_name": self.collection_name,
135 |             "collection_type": self.collection_type,
136 |             "analyses": [item.to_dict() for item in self.analyses],
137 |             "detection_count": self.detection_count,
138 |             "risk_scores": self.risk_scores,
139 |             "risk_score_mean": self.risk_score_mean,
140 |             "risk_score_mode": self.risk_score_mode,
141 |             "risk_score_median": self.risk_score_median,
142 |             "risk_score_standard_deviation": self.risk_score_standard_deviation,
143 |             "risk_score_variance": self.risk_score_variance,
144 |             "detected_pii_types": self.detected_pii_types,
145 |             "detected_pii_type_frequencies": dict(self.detected_pii_type_frequencies),
146 |         }
147 | 


--------------------------------------------------------------------------------
/pii_codex/models/common.py:
--------------------------------------------------------------------------------
  1 | from __future__ import annotations
  2 | 
  3 | from enum import Enum
  4 | 
  5 | # All listed PII Types from Milne et al (2018) and a few others along with
  6 | # models used for PII categorization for DHS, NIST, and HIPAA
  7 | 
  8 | 
  9 | class AnalysisProviderType(Enum):
 10 |     """
 11 |     Analysis Provider Types - software and cloud service APIs providing PII detection results
 12 |     """
 13 | 
 14 |     AZURE = "AZURE"
 15 |     AWS = "AWS"
 16 |     PRESIDIO = "PRESIDIO"
 17 | 
 18 | 
 19 | class RiskLevel(Enum):
 20 |     """
 21 |     Numerical values assigned to the levels on the continuum presented by Schwartz and Solove (2011)
 22 |     """
 23 | 
 24 |     LEVEL_ONE = 1  # Not-Identifiable
 25 |     LEVEL_TWO = 2  # Semi-Identifiable
 26 |     LEVEL_THREE = 3  # Identifiable
 27 | 
 28 | 
 29 | class RiskLevelDefinition(Enum):
 30 |     """
 31 |     Levels on the continuum presented by Schwartz and Solove (2011)
 32 |     """
 33 | 
 34 |     LEVEL_ONE = "Non-Identifiable"  # Default if no entities were detected, risk level is set to this
 35 |     LEVEL_TWO = "Semi-Identifiable"
 36 |     LEVEL_THREE = "Identifiable"  # Level associated with Directly PII, PHI, and Standalone PII info types
 37 | 
 38 | 
 39 | class MetadataType(Enum):
 40 |     """
 41 |     Common metadata types associated with social media posts and other online platforms
 42 |     """
 43 | 
 44 |     SCREEN_NAME = "screen_name"
 45 |     NAME = "name"
 46 |     LOCATION = "location"
 47 |     URL = "url"
 48 |     USER_ID = "user_id"
 49 | 
 50 | 
 51 | class PIIType(Enum):
 52 |     """
 53 |     Commonly observed PII types across services and software
 54 |     """
 55 | 
 56 |     PHONE_NUMBER = "PHONE"
 57 |     WORK_PHONE_NUMBER = "PHONE"
 58 |     EMAIL_ADDRESS = "EMAIL"
 59 |     ABA_ROUTING_NUMBER = "ABA_ROUTING_NUMBER"
 60 |     IP_ADDRESS = "IP_ADDRESS"
 61 |     DATE = "DATE"
 62 |     ADDRESS = "ADDRESS"
 63 |     HOME_ADDRESS = "ADDRESS"
 64 |     WORK_ADDRESS = "ADDRESS"
 65 |     AGE = "AGE"
 66 |     PERSON = "PERSON"
 67 |     CREDIT_CARD_NUMBER = "CREDIT_CARD_NUMBER"
 68 |     CREDIT_SCORE = "CREDIT_SCORE"
 69 |     CRYPTO = "CRYPTO"
 70 |     URL = "URL"
 71 |     DATE_TIME = "DATE_TIME"
 72 |     LOCATION = "LOCATION"
 73 |     ZIPCODE = "ZIPCODE"
 74 |     RACE = "RACE"
 75 |     HEIGHT = "HEIGHT"
 76 |     WEIGHT = "WEIGHT"
 77 |     GENDER = "GENDER"
 78 |     HOMETOWN = "HOMETOWN"
 79 |     SCREEN_NAME = "SCREEN_NAME"
 80 |     MARITAL_STATUS = "MARITAL_STATUS"
 81 |     NUMBER_OF_CHILDREN = "NUMBER_OF_CHILDREN"
 82 |     COUNTRY_OF_CITIZENSHIP = "COUNTRY_OF_CITIZENSHIP"
 83 |     VOICE_PRINT = "VOICE_PRINT"
 84 |     FINGERPRINT = "FINGERPRINT"
 85 |     DNA_PROFILE = "DNA_PROFILE"
 86 |     PICTURE_FACE = "PICTURE_FACE"
 87 |     HANDWRITING_SAMPLE = "HANDWRITING_SAMPLE"
 88 |     MOTHERS_MAIDEN_NAME = "MOTHERS_MAIDEN_NAME"
 89 |     DIGITAL_SIGNATURE = "DIGITAL_SIGNATURE"
 90 |     HEALTH_INSURANCE_ID = "HEALTH_INSURANCE_ID"
 91 |     SHOPPING_BEHAVIOR = "SHOPPING_BEHAVIOR"
 92 |     SEXUAL_PREFERENCE = "SEXUAL_PREFERENCE"
 93 |     SOCIAL_NETWORK_PROFILE = "SOCIAL_NETWORK_PROFILE"
 94 |     JOB_TITLE = "JOB_TITLE"
 95 |     INCOME_LEVEL = "INCOME_LEVEL"
 96 |     OCCUPATION = "OCCUPATION"
 97 |     DOCUMENTS = "DOCUMENTS"
 98 |     MEDICAL_LICENSE = "MEDICAL_LICENSE"
 99 |     LICENSE_PLATE_NUMBER = "LICENSE_PLATE_NUMBER"
100 |     SECURITY_ACCESS_CODES = "SECURITY_ACCESS_CODES"
101 |     PASSWORD = "PASSWORD"
102 |     US_SOCIAL_SECURITY_NUMBER = "US_SOCIAL_SECURITY_NUMBER"
103 |     US_BANK_ACCOUNT_NUMBER = "US_BANK_ACCOUNT_NUMBER"
104 |     US_DRIVERS_LICENSE_NUMBER = "US_DRIVERS_LICENSE_NUMBER"
105 |     US_PASSPORT_NUMBER = "US_PASSPORT_NUMBER"
106 |     US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION"
107 |     INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "INTERNATIONAL_BANKING_ACCOUNT_NUMBER"
108 |     SWIFT_CODE = "SWIFTCode"
109 |     NRP = "NRP"  # A person's nationality, religion, or political group
110 |     # Australian PII types
111 |     AU_BUSINESS_NUMBER = "AU_BUSINESS_NUMBER"
112 |     AU_COMPANY_NUMBER = "AU_COMPANY_NUMBER"
113 |     AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICAL_ACCOUNT_NUMBER"
114 |     AU_TAX_FILE_NUMBER = "AU_TAX_FILE_NUMBER"
115 | 
116 | 
117 | class NISTCategory(Enum):
118 |     """
119 |     Information Categories presented by NIST as noted in Milne et al., 2016
120 |     """
121 | 
122 |     LINKABLE = "Linkable"
123 |     DIRECTLY_PII = "Directly PII"
124 | 
125 | 
126 | class DHSCategory(Enum):
127 |     """
128 |     Information Categories presented by DHS as noted in Milne et al., 2016
129 |     """
130 | 
131 |     NOT_MENTIONED = "Not Mentioned"
132 |     LINKABLE = "Linkable"
133 |     STAND_ALONE_PII = "Stand Alone PII"
134 | 
135 | 
136 | class HIPAACategory(Enum):
137 |     """
138 |     Information Categories presented by HIPAA guidelines
139 |     """
140 | 
141 |     NON_PHI = "Not Protected Health Information"
142 |     PHI = "Protected Health Information"
143 | 
144 | 
145 | class ClusterMembershipType(Enum):
146 |     """
147 |     Information Cluster Memberships presented by Milne et al., 2016
148 |     """
149 | 
150 |     BASIC_DEMOGRAPHICS = "Basic Demographics"
151 |     PERSONAL_PREFERENCES = "Personal Preferences"
152 |     CONTACT_INFORMATION = "Contact Information"
153 |     COMMUNITY_INTERACTION = "Community Interaction"
154 |     FINANCIAL_INFORMATION = "Financial Information"
155 |     SECURE_IDENTIFIERS = "Secure Identifiers"
156 | 


--------------------------------------------------------------------------------
/docs/MAPPING.md:
--------------------------------------------------------------------------------
  1 | # Mapping Across PII Detections with PII-Codex
  2 | The PII Codex has a number of enums to help with the definitions and labeling of PII, their categories, and their severity rankings across modules. At this time, only AWS Comprehend, Microsoft Presidio, and Microsoft Azure PII entity types are mapped to using the common PII types listing.
  3 | 
  4 | Selecting a PII type from the common PII type listing:
  5 | 
  6 | ## Mapping Between PII Types
  7 | ```python
  8 | from pii_codex.models.common import PIIType
  9 | PIIType.EMAIL_ADDRESS # Selecting a single common PIIType 
 10 | PIIType.EMAIL_ADDRESS.name # The name of the enum entry
 11 | PIIType.EMAIL_ADDRESS.value # The String value of the enum entry
 12 | ```
 13 | 
 14 | Iterating through all common types supported:
 15 | 
 16 | ```python
 17 | from pii_codex.models.common import PIIType
 18 | pii_types = [pii_type.name for pii_type in PIIType]
 19 | ```
 20 | 
 21 | Each module or cloud resource will have its own string labeling for the PII Type. It is sometimes required to map to that string value in order to properly parse out a PII detection or to initialize an analyzer. To map to a different PII type (if supported with the version, using MSFT Presidio for example):
 22 | 
 23 | ```python
 24 | from pii_codex.models.common import PIIType
 25 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 26 | from pii_codex.models.azure_pii import AzureDetectionType
 27 | from pii_codex.models.aws_pii import AWSComprehendPIIType
 28 | 
 29 | presidio_pii_type = MSFTPresidioPIIType[PIIType.EMAIL_ADDRESS.name] # MSFT Presidio enum entry
 30 | 
 31 | print("Presidio Enum Type Name for Email: ", presidio_pii_type.name)
 32 | print("Presidio Enum Type Value for Email: ", presidio_pii_type.value)
 33 | ```
 34 | 
 35 | Using the built-in mapper can be a help, just pass in the mapping you'd like to perform and it'll provide you with the enum name and the enum type entry. If it is not supported, you'll be supplied with the error instead.
 36 | 
 37 | ```python
 38 | from pii_codex.models.common import PIIType
 39 | from pii_codex.utils.pii_mapping_util import PIIMapper
 40 | 
 41 | pii_mapper = PIIMapper()
 42 | 
 43 | azure_pii = pii_mapper.convert_common_pii_to_azure_pii_type(PIIType.EMAIL_ADDRESS)
 44 | 
 45 | aws_pii = pii_mapper.convert_common_pii_to_aws_comprehend_type(PIIType.EMAIL_ADDRESS)
 46 | presidio_pii = pii_mapper.convert_common_pii_to_msft_presidio_type(PIIType.EMAIL_ADDRESS)
 47 | ```
 48 | 
 49 | In the case you are using the PII-Codex module for just detection conversions and analysis, there is an inverse set of mappers that will take Presidio, Azure, or AWS Comprehend PII types and convert to the PII-Codex commmon types:
 50 | 
 51 | ```python
 52 | from pii_codex.models.common import PIIType
 53 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 54 | from pii_codex.models.azure_pii import AzureDetectionType
 55 | from pii_codex.models.aws_pii import AWSComprehendPIIType
 56 | from pii_codex.utils.pii_mapping_util import PIIMapper
 57 | 
 58 | pii_mapper = PIIMapper()
 59 | 
 60 | azure_to_common_pii = pii_mapper.convert_azure_pii_to_common_pii_type(
 61 |     AzureDetectionType.EMAIL_ADDRESS.value
 62 | )
 63 | aws_to_common_pii = pii_mapper.convert_aws_comprehend_pii_to_common_pii_type(
 64 |     AWSComprehendPIIType.EMAIL_ADDRESS.value
 65 | )
 66 | presidio_to_common_pii = pii_mapper.convert_msft_presidio_pii_to_common_pii_type(
 67 |     MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value  # e.g. "US_SSN"
 68 | )
 69 | ```
 70 | 
 71 | ### Example: provider‑specific labels vs common PII types
 72 | 
 73 | Some providers use compact or region‑encoded labels which do **not** match the human‑readable common PII type names. For example:
 74 | 
 75 | ```python
 76 | from pii_codex.models.common import PIIType
 77 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 78 | from pii_codex.utils.pii_mapping_util import PIIMapper
 79 | 
 80 | pii_mapper = PIIMapper()
 81 | 
 82 | # Presidio emits "US_SSN", which we represent as:
 83 | assert MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value == "US_SSN"
 84 | 
 85 | # PII Codex maps this back to the canonical common type:
 86 | common_type = pii_mapper.convert_msft_presidio_pii_to_common_pii_type("US_SSN")
 87 | assert common_type is PIIType.US_SOCIAL_SECURITY_NUMBER
 88 | 
 89 | # Likewise for AU tax identifiers:
 90 | assert MSFTPresidioPIIType.AU_TAX_FILE_NUMBER.value == "AU_TFN"
 91 | common_au_tfn = pii_mapper.convert_msft_presidio_pii_to_common_pii_type("AU_TFN")
 92 | assert common_au_tfn is PIIType.AU_TAX_FILE_NUMBER
 93 | 
 94 | # The key idea: provider enums mirror the provider's own labels,
 95 | # and PIIType is the canonical, provider‑independent surface.
 96 | ```
 97 | 
 98 | ## Importing Updated Files
 99 | 
100 | ```python
101 | 
102 | from pii_codex.utils import pii_mapping_util
103 | from pii_codex.models.common import PIIType
104 | # Data frame loaded from csv mapping file (assumes /data location in pii-codex)
105 | csv_file_dataframe = pii_mapping_util.open_pii_type_mapping_csv("v1")
106 | 
107 | # Data frame loaded from json mapping file (assumes /data location in pii-codex)
108 | json_file_dataframe = pii_mapping_util.open_pii_type_mapping_json("v1")
109 | 
110 | # Retrieving the entries for "IP Address" Information Type, for example
111 | ip_address = json_file_dataframe[json_file_dataframe.Information_Type=='IP Address'].item()
112 | 
113 | pii_type = PIIType[ip_address]
114 | print("Enum Type Name for IP Address: ", pii_type.name)
115 | print("Enum Type Name for IP Address: ", pii_type.value)
116 | ```
117 | 


--------------------------------------------------------------------------------
/docs/dev/pii_codex/config.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.config API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.config</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | <details class="source">
26 | <summary>
27 | <span>Expand source code</span>
28 | </summary>
29 | <pre><code class="python">from pii_codex.utils.pii_mapping_util import PIIMapper
30 | 
31 | PII_MAPPER = PIIMapper()
32 | DEFAULT_LANG = &#34;en&#34;
33 | DEFAULT_ANALYSIS_MODE = &#34;POPULATION&#34;
34 | DEFAULT_TOKEN_REPLACEMENT_VALUE = &#34;&lt;REDACTED&gt;&#34;</code></pre>
35 | </details>
36 | </section>
37 | <section>
38 | </section>
39 | <section>
40 | </section>
41 | <section>
42 | </section>
43 | <section>
44 | </section>
45 | </article>
46 | <nav id="sidebar">
47 | <h1>Index</h1>
48 | <div class="toc">
49 | <ul></ul>
50 | </div>
51 | <ul id="index">
52 | <li><h3>Super-module</h3>
53 | <ul>
54 | <li><code><a title="pii_codex" href="index.html">pii_codex</a></code></li>
55 | </ul>
56 | </li>
57 | </ul>
58 | </nav>
59 | </main>
60 | <footer id="footer">
61 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
62 | </footer>
63 | </body>
64 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/services/analyzers/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.services.analyzers API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.services.analyzers</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.services.analyzers.presidio_analysis" href="presidio_analysis.html">pii_codex.services.analyzers.presidio_analysis</a></code></dt>
30 | <dd>
31 | <div class="desc"></div>
32 | </dd>
33 | </dl>
34 | </section>
35 | <section>
36 | </section>
37 | <section>
38 | </section>
39 | <section>
40 | </section>
41 | </article>
42 | <nav id="sidebar">
43 | <h1>Index</h1>
44 | <div class="toc">
45 | <ul></ul>
46 | </div>
47 | <ul id="index">
48 | <li><h3>Super-module</h3>
49 | <ul>
50 | <li><code><a title="pii_codex.services" href="../index.html">pii_codex.services</a></code></li>
51 | </ul>
52 | </li>
53 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
54 | <ul>
55 | <li><code><a title="pii_codex.services.analyzers.presidio_analysis" href="presidio_analysis.html">pii_codex.services.analyzers.presidio_analysis</a></code></li>
56 | </ul>
57 | </li>
58 | </ul>
59 | </nav>
60 | </main>
61 | <footer id="footer">
62 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
63 | </footer>
64 | </body>
65 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/services/adapters/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.services.adapters API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.services.adapters</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.services.adapters.detection_adapters" href="detection_adapters/index.html">pii_codex.services.adapters.detection_adapters</a></code></dt>
30 | <dd>
31 | <div class="desc"></div>
32 | </dd>
33 | </dl>
34 | </section>
35 | <section>
36 | </section>
37 | <section>
38 | </section>
39 | <section>
40 | </section>
41 | </article>
42 | <nav id="sidebar">
43 | <h1>Index</h1>
44 | <div class="toc">
45 | <ul></ul>
46 | </div>
47 | <ul id="index">
48 | <li><h3>Super-module</h3>
49 | <ul>
50 | <li><code><a title="pii_codex.services" href="../index.html">pii_codex.services</a></code></li>
51 | </ul>
52 | </li>
53 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
54 | <ul>
55 | <li><code><a title="pii_codex.services.adapters.detection_adapters" href="detection_adapters/index.html">pii_codex.services.adapters.detection_adapters</a></code></li>
56 | </ul>
57 | </li>
58 | </ul>
59 | </nav>
60 | </main>
61 | <footer id="footer">
62 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
63 | </footer>
64 | </body>
65 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Package <code>pii_codex</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | <details class="source">
26 | <summary>
27 | <span>Expand source code</span>
28 | </summary>
29 | <pre><code class="python">__version__ = &#34;0.1.0&#34;</code></pre>
30 | </details>
31 | </section>
32 | <section>
33 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
34 | <dl>
35 | <dt><code class="name"><a title="pii_codex.config" href="config.html">pii_codex.config</a></code></dt>
36 | <dd>
37 | <div class="desc"></div>
38 | </dd>
39 | <dt><code class="name"><a title="pii_codex.models" href="models/index.html">pii_codex.models</a></code></dt>
40 | <dd>
41 | <div class="desc"></div>
42 | </dd>
43 | <dt><code class="name"><a title="pii_codex.services" href="services/index.html">pii_codex.services</a></code></dt>
44 | <dd>
45 | <div class="desc"></div>
46 | </dd>
47 | <dt><code class="name"><a title="pii_codex.utils" href="utils/index.html">pii_codex.utils</a></code></dt>
48 | <dd>
49 | <div class="desc"></div>
50 | </dd>
51 | </dl>
52 | </section>
53 | <section>
54 | </section>
55 | <section>
56 | </section>
57 | <section>
58 | </section>
59 | </article>
60 | <nav id="sidebar">
61 | <h1>Index</h1>
62 | <div class="toc">
63 | <ul></ul>
64 | </div>
65 | <ul id="index">
66 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
67 | <ul>
68 | <li><code><a title="pii_codex.config" href="config.html">pii_codex.config</a></code></li>
69 | <li><code><a title="pii_codex.models" href="models/index.html">pii_codex.models</a></code></li>
70 | <li><code><a title="pii_codex.services" href="services/index.html">pii_codex.services</a></code></li>
71 | <li><code><a title="pii_codex.utils" href="utils/index.html">pii_codex.utils</a></code></li>
72 | </ul>
73 | </li>
74 | </ul>
75 | </nav>
76 | </main>
77 | <footer id="footer">
78 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
79 | </footer>
80 | </body>
81 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/services/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.services API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.services</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.services.adapters" href="adapters/index.html">pii_codex.services.adapters</a></code></dt>
30 | <dd>
31 | <div class="desc"></div>
32 | </dd>
33 | <dt><code class="name"><a title="pii_codex.services.analysis_service" href="analysis_service.html">pii_codex.services.analysis_service</a></code></dt>
34 | <dd>
35 | <div class="desc"></div>
36 | </dd>
37 | <dt><code class="name"><a title="pii_codex.services.analyzers" href="analyzers/index.html">pii_codex.services.analyzers</a></code></dt>
38 | <dd>
39 | <div class="desc"></div>
40 | </dd>
41 | <dt><code class="name"><a title="pii_codex.services.assessment_service" href="assessment_service.html">pii_codex.services.assessment_service</a></code></dt>
42 | <dd>
43 | <div class="desc"></div>
44 | </dd>
45 | </dl>
46 | </section>
47 | <section>
48 | </section>
49 | <section>
50 | </section>
51 | <section>
52 | </section>
53 | </article>
54 | <nav id="sidebar">
55 | <h1>Index</h1>
56 | <div class="toc">
57 | <ul></ul>
58 | </div>
59 | <ul id="index">
60 | <li><h3>Super-module</h3>
61 | <ul>
62 | <li><code><a title="pii_codex" href="../index.html">pii_codex</a></code></li>
63 | </ul>
64 | </li>
65 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
66 | <ul>
67 | <li><code><a title="pii_codex.services.adapters" href="adapters/index.html">pii_codex.services.adapters</a></code></li>
68 | <li><code><a title="pii_codex.services.analysis_service" href="analysis_service.html">pii_codex.services.analysis_service</a></code></li>
69 | <li><code><a title="pii_codex.services.analyzers" href="analyzers/index.html">pii_codex.services.analyzers</a></code></li>
70 | <li><code><a title="pii_codex.services.assessment_service" href="assessment_service.html">pii_codex.services.assessment_service</a></code></li>
71 | </ul>
72 | </li>
73 | </ul>
74 | </nav>
75 | </main>
76 | <footer id="footer">
77 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
78 | </footer>
79 | </body>
80 | </html>


--------------------------------------------------------------------------------
/pii_codex/utils/pii_mapping_util.py:
--------------------------------------------------------------------------------
  1 | # pylint: disable=broad-except, unused-variable, no-else-return
  2 | from typing import Optional
  3 | 
  4 | from pii_codex.models.aws_pii import AWSComprehendPIIType
  5 | from pii_codex.models.azure_pii import AzureDetectionType
  6 | from pii_codex.models.common import (
  7 |     RiskLevel,
  8 |     PIIType,
  9 |     MetadataType,
 10 |     RiskLevelDefinition,
 11 | )
 12 | from pii_codex.models.analysis import RiskAssessment
 13 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 14 | 
 15 | from pii_codex.services.pii_type_mappings import get_pii_mapping
 16 | 
 17 | 
 18 | class PIIMapper:
 19 |     """
 20 |     Class to map PII types listed as Common Types, Azure Types, AWS Comprehend Types, and Presidio Types
 21 |     """
 22 | 
 23 |     def __init__(self):
 24 |         # No need to load CSV anymore - using Python structure
 25 |         pass
 26 | 
 27 |     def map_pii_type(self, pii_type: str) -> RiskAssessment:
 28 |         """
 29 |         Maps the PII Type to a full RiskAssessment including categories it belongs to, risk level, and
 30 |         its location in the text. This cross-references some of the types listed by Milne et al. (2016)
 31 | 
 32 |         @param pii_type:
 33 |         @return:
 34 |         """
 35 | 
 36 |         try:
 37 |             mapping = get_pii_mapping(pii_type)
 38 | 
 39 |             # Get the risk level definition string based on the risk level
 40 |             risk_level_to_definition = {
 41 |                 RiskLevel.LEVEL_ONE: RiskLevelDefinition.LEVEL_ONE.value,
 42 |                 RiskLevel.LEVEL_TWO: RiskLevelDefinition.LEVEL_TWO.value,
 43 |                 RiskLevel.LEVEL_THREE: RiskLevelDefinition.LEVEL_THREE.value,
 44 |             }
 45 |             risk_level_definition = risk_level_to_definition[mapping.risk_level]
 46 | 
 47 |             return RiskAssessment(
 48 |                 pii_type_detected=pii_type,
 49 |                 risk_level=mapping.risk_level.value,
 50 |                 risk_level_definition=risk_level_definition,
 51 |                 cluster_membership_type=mapping.cluster_membership_type.value,
 52 |                 hipaa_category=mapping.hipaa_category.value,
 53 |                 dhs_category=mapping.dhs_category.value,
 54 |                 nist_category=mapping.nist_category.value,
 55 |             )
 56 |         except KeyError:
 57 |             raise Exception(
 58 |                 f"An error occurred while processing the detected entity {pii_type}"
 59 |             )
 60 | 
 61 |     @classmethod
 62 |     def convert_common_pii_to_msft_presidio_type(
 63 |         cls, pii_type: PIIType
 64 |     ) -> MSFTPresidioPIIType:
 65 |         """
 66 |         Converts a common PII Type to a MSFT Presidio Type
 67 |         @param pii_type:
 68 |         @return:
 69 |         """
 70 | 
 71 |         try:
 72 |             converted_type = MSFTPresidioPIIType[pii_type.name]
 73 |         except Exception as ex:
 74 |             raise Exception(
 75 |                 "The current version does not support this PII Type conversion."
 76 |             )
 77 | 
 78 |         return converted_type
 79 | 
 80 |     @classmethod
 81 |     def convert_common_pii_to_azure_pii_type(
 82 |         cls, pii_type: PIIType
 83 |     ) -> AzureDetectionType:
 84 |         """
 85 |         Converts a common PII Type to an Azure PII Type
 86 |         @param pii_type:
 87 |         @return:
 88 |         """
 89 |         try:
 90 |             return AzureDetectionType[pii_type.name]
 91 |         except Exception as ex:
 92 |             raise Exception(
 93 |                 "The current version does not support this PII Type conversion."
 94 |             )
 95 | 
 96 |     @classmethod
 97 |     def convert_common_pii_to_aws_comprehend_type(
 98 |         cls,
 99 |         pii_type: PIIType,
100 |     ) -> AWSComprehendPIIType:
101 |         """
102 |         Converts a common PII Type to an AWS PII Type
103 |         @param pii_type:
104 |         @return:
105 |         """
106 |         try:
107 |             return AWSComprehendPIIType[pii_type.name]
108 |         except Exception as ex:
109 |             raise Exception(
110 |                 "The current version does not support this PII Type conversion."
111 |             )
112 | 
113 |     @classmethod
114 |     def convert_azure_pii_to_common_pii_type(cls, pii_type: str) -> PIIType:
115 |         """
116 |         Converts an Azure PII Type to a common PII Type
117 |         @param pii_type:
118 |         @return:
119 |         """
120 |         try:
121 |             if pii_type == AzureDetectionType.USUK_PASSPORT_NUMBER.value:
122 |                 # Special case, map to USUK for all US and UK Passport types
123 |                 return PIIType.US_PASSPORT_NUMBER
124 | 
125 |             return PIIType[AzureDetectionType(pii_type).name]
126 |         except Exception as ex:
127 |             raise Exception(
128 |                 "The current version does not support this PII Type conversion."
129 |             )
130 | 
131 |     @classmethod
132 |     def convert_aws_comprehend_pii_to_common_pii_type(
133 |         cls,
134 |         pii_type: str,
135 |     ) -> PIIType:
136 |         """
137 |         Converts an AWS PII Type to a common PII Type
138 |         @param pii_type: str from AWS Comprehend (maps to value of AWSComprehendPIIType)
139 |         @return:
140 |         """
141 |         try:
142 |             return PIIType[AWSComprehendPIIType(pii_type).name]
143 |         except Exception as ex:
144 |             raise Exception(
145 |                 "The current version does not support this PII Type conversion."
146 |             )
147 | 
148 |     @classmethod
149 |     def convert_msft_presidio_pii_to_common_pii_type(
150 |         cls,
151 |         pii_type: str,
152 |     ) -> PIIType:
153 |         """
154 |         Converts a Microsoft Presidio PII Type to a common PII Type
155 |         @param pii_type: str from Presidio (maps to value of PIIType)
156 |         @return:
157 |         """
158 |         try:
159 |             # Handle specific cases where Presidio returns different values than enum names
160 |             if pii_type == "US_SSN":
161 |                 return PIIType.US_SOCIAL_SECURITY_NUMBER
162 |             if pii_type == "US_BANK_NUMBER":
163 |                 return PIIType.US_BANK_ACCOUNT_NUMBER
164 |             if pii_type == "AU_MEDICARE":
165 |                 return PIIType.AU_MEDICAL_ACCOUNT_NUMBER
166 |             if pii_type == "DATE":
167 |                 return PIIType.DATE_TIME
168 | 
169 |             # For everything else, use the original approach that was working
170 |             return PIIType[MSFTPresidioPIIType(pii_type).name]
171 | 
172 |         except Exception as ex:
173 |             raise Exception(
174 |                 f"The current version does not support this PII Type conversion: {pii_type}. Error: {str(ex)}"
175 |             )
176 | 
177 |     @classmethod
178 |     def convert_metadata_type_to_common_pii_type(
179 |         cls, metadata_type: str
180 |     ) -> Optional[PIIType]:
181 |         """
182 |         Converts metadata type str entry to common PII type
183 |         @param metadata_type:
184 |         @return: PIIType
185 |         """
186 | 
187 |         try:
188 |             if metadata_type.lower() == "name":
189 |                 return PIIType.PERSON
190 | 
191 |             if metadata_type.lower() == "user_id":
192 |                 # If dealing with public data, user_id can be used to pull down
193 |                 # social network profile
194 |                 return PIIType.SOCIAL_NETWORK_PROFILE
195 | 
196 |             return PIIType[MetadataType(metadata_type.lower()).name]
197 |         except Exception as ex:
198 |             raise Exception(
199 |                 "The current version does not support this Metadata to PII Type conversion."
200 |             )
201 | 


--------------------------------------------------------------------------------
/docs/dev/pii_codex/models/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.models API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.models</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.models.analysis" href="analysis.html">pii_codex.models.analysis</a></code></dt>
30 | <dd>
31 | <div class="desc"></div>
32 | </dd>
33 | <dt><code class="name"><a title="pii_codex.models.aws_pii" href="aws_pii.html">pii_codex.models.aws_pii</a></code></dt>
34 | <dd>
35 | <div class="desc"></div>
36 | </dd>
37 | <dt><code class="name"><a title="pii_codex.models.azure_pii" href="azure_pii.html">pii_codex.models.azure_pii</a></code></dt>
38 | <dd>
39 | <div class="desc"></div>
40 | </dd>
41 | <dt><code class="name"><a title="pii_codex.models.common" href="common.html">pii_codex.models.common</a></code></dt>
42 | <dd>
43 | <div class="desc"></div>
44 | </dd>
45 | <dt><code class="name"><a title="pii_codex.models.microsoft_presidio_pii" href="microsoft_presidio_pii.html">pii_codex.models.microsoft_presidio_pii</a></code></dt>
46 | <dd>
47 | <div class="desc"></div>
48 | </dd>
49 | </dl>
50 | </section>
51 | <section>
52 | </section>
53 | <section>
54 | </section>
55 | <section>
56 | </section>
57 | </article>
58 | <nav id="sidebar">
59 | <h1>Index</h1>
60 | <div class="toc">
61 | <ul></ul>
62 | </div>
63 | <ul id="index">
64 | <li><h3>Super-module</h3>
65 | <ul>
66 | <li><code><a title="pii_codex" href="../index.html">pii_codex</a></code></li>
67 | </ul>
68 | </li>
69 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
70 | <ul>
71 | <li><code><a title="pii_codex.models.analysis" href="analysis.html">pii_codex.models.analysis</a></code></li>
72 | <li><code><a title="pii_codex.models.aws_pii" href="aws_pii.html">pii_codex.models.aws_pii</a></code></li>
73 | <li><code><a title="pii_codex.models.azure_pii" href="azure_pii.html">pii_codex.models.azure_pii</a></code></li>
74 | <li><code><a title="pii_codex.models.common" href="common.html">pii_codex.models.common</a></code></li>
75 | <li><code><a title="pii_codex.models.microsoft_presidio_pii" href="microsoft_presidio_pii.html">pii_codex.models.microsoft_presidio_pii</a></code></li>
76 | </ul>
77 | </li>
78 | </ul>
79 | </nav>
80 | </main>
81 | <footer id="footer">
82 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
83 | </footer>
84 | </body>
85 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/utils/package_installer_util.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.utils.package_installer_util API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.utils.package_installer_util</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | <details class="source">
26 | <summary>
27 | <span>Expand source code</span>
28 | </summary>
29 | <pre><code class="python">import subprocess
30 | import sys
31 | 
32 | 
33 | def install_spacy_package(package_name):
34 |     &#34;&#34;&#34;
35 |     Installs missing spacy package (if found missing)
36 |     @param package_name:
37 |     @return:
38 |     &#34;&#34;&#34;
39 |     subprocess.check_call([sys.executable, &#34;-m&#34;, &#34;spacy&#34;, &#34;download&#34;, package_name])</code></pre>
40 | </details>
41 | </section>
42 | <section>
43 | </section>
44 | <section>
45 | </section>
46 | <section>
47 | <h2 class="section-title" id="header-functions">Functions</h2>
48 | <dl>
49 | <dt id="pii_codex.utils.package_installer_util.install_spacy_package"><code class="name flex">
50 | <span>def <span class="ident">install_spacy_package</span></span>(<span>package_name)</span>
51 | </code></dt>
52 | <dd>
53 | <div class="desc"><p>Installs missing spacy package (if found missing)
54 | @param package_name:
55 | @return:</p></div>
56 | <details class="source">
57 | <summary>
58 | <span>Expand source code</span>
59 | </summary>
60 | <pre><code class="python">def install_spacy_package(package_name):
61 |     &#34;&#34;&#34;
62 |     Installs missing spacy package (if found missing)
63 |     @param package_name:
64 |     @return:
65 |     &#34;&#34;&#34;
66 |     subprocess.check_call([sys.executable, &#34;-m&#34;, &#34;spacy&#34;, &#34;download&#34;, package_name])</code></pre>
67 | </details>
68 | </dd>
69 | </dl>
70 | </section>
71 | <section>
72 | </section>
73 | </article>
74 | <nav id="sidebar">
75 | <h1>Index</h1>
76 | <div class="toc">
77 | <ul></ul>
78 | </div>
79 | <ul id="index">
80 | <li><h3>Super-module</h3>
81 | <ul>
82 | <li><code><a title="pii_codex.utils" href="index.html">pii_codex.utils</a></code></li>
83 | </ul>
84 | </li>
85 | <li><h3><a href="#header-functions">Functions</a></h3>
86 | <ul class="">
87 | <li><code><a title="pii_codex.utils.package_installer_util.install_spacy_package" href="#pii_codex.utils.package_installer_util.install_spacy_package">install_spacy_package</a></code></li>
88 | </ul>
89 | </li>
90 | </ul>
91 | </nav>
92 | </main>
93 | <footer id="footer">
94 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
95 | </footer>
96 | </body>
97 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/utils/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.utils API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.utils</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.utils.file_util" href="file_util.html">pii_codex.utils.file_util</a></code></dt>
30 | <dd>
31 | <div class="desc"><p>File utils</p></div>
32 | </dd>
33 | <dt><code class="name"><a title="pii_codex.utils.logging" href="logging.html">pii_codex.utils.logging</a></code></dt>
34 | <dd>
35 | <div class="desc"></div>
36 | </dd>
37 | <dt><code class="name"><a title="pii_codex.utils.package_installer_util" href="package_installer_util.html">pii_codex.utils.package_installer_util</a></code></dt>
38 | <dd>
39 | <div class="desc"></div>
40 | </dd>
41 | <dt><code class="name"><a title="pii_codex.utils.pii_mapping_util" href="pii_mapping_util.html">pii_codex.utils.pii_mapping_util</a></code></dt>
42 | <dd>
43 | <div class="desc"></div>
44 | </dd>
45 | <dt><code class="name"><a title="pii_codex.utils.statistics_util" href="statistics_util.html">pii_codex.utils.statistics_util</a></code></dt>
46 | <dd>
47 | <div class="desc"></div>
48 | </dd>
49 | </dl>
50 | </section>
51 | <section>
52 | </section>
53 | <section>
54 | </section>
55 | <section>
56 | </section>
57 | </article>
58 | <nav id="sidebar">
59 | <h1>Index</h1>
60 | <div class="toc">
61 | <ul></ul>
62 | </div>
63 | <ul id="index">
64 | <li><h3>Super-module</h3>
65 | <ul>
66 | <li><code><a title="pii_codex" href="../index.html">pii_codex</a></code></li>
67 | </ul>
68 | </li>
69 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
70 | <ul>
71 | <li><code><a title="pii_codex.utils.file_util" href="file_util.html">pii_codex.utils.file_util</a></code></li>
72 | <li><code><a title="pii_codex.utils.logging" href="logging.html">pii_codex.utils.logging</a></code></li>
73 | <li><code><a title="pii_codex.utils.package_installer_util" href="package_installer_util.html">pii_codex.utils.package_installer_util</a></code></li>
74 | <li><code><a title="pii_codex.utils.pii_mapping_util" href="pii_mapping_util.html">pii_codex.utils.pii_mapping_util</a></code></li>
75 | <li><code><a title="pii_codex.utils.statistics_util" href="statistics_util.html">pii_codex.utils.statistics_util</a></code></li>
76 | </ul>
77 | </li>
78 | </ul>
79 | </nav>
80 | </main>
81 | <footer id="footer">
82 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
83 | </footer>
84 | </body>
85 | </html>


--------------------------------------------------------------------------------
/joss/paper.bib:
--------------------------------------------------------------------------------
  1 | @article{milne_pettinico_hajjat_markos_2016,
  2 |     title={Information sensitivity typology: Mapping the degree and type of risk consumers perceive in personal data sharing},
  3 |     volume={51},
  4 |     DOI={10.1111/joca.12111},
  5 |     number={1},
  6 |     journal={Journal of Consumer Affairs},
  7 |     author={Milne, George R. and Pettinico, George and Hajjat, Fatima M. and Markos, Ereni},
  8 |     year={2016}, pages={133–161}
  9 | }
 10 | @misc{OCR_2022,
 11 |     title={Summary of the HIPAA privacy rule},
 12 |     url={https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html#:~:text=The%20U.S.%20Department%20of%20Health,1996%20(%22HIPAA%22).},
 13 |     journal={HHS.gov},
 14 |     author={OCR, Office for Civil Rights},
 15 |     year={2022},
 16 |     month={Oct}
 17 | }
 18 | @article{West_2017,
 19 |     title={Data capitalism: Redefining the logics of surveillance and privacy},
 20 |     volume={58},
 21 |     DOI={10.1177/0007650317718185},
 22 |     number={1},
 23 |     journal={Business &amp;amp; Society},
 24 |     author={West, Sarah Myers},
 25 |     year={2017},
 26 |     pages={20–41}
 27 | }
 28 | @article{schwartz_solove_2011,
 29 |     title={The PII Problem: Privacy and a New Concept of Personally Identifiable Information},
 30 |     volume={86},
 31 |     journal={New York University Law Review},
 32 |     author={Schwartz, Paul M and Solove, Daniel J},
 33 |     year={2011},
 34 |     pages={1814}
 35 | }
 36 | @article{trepte2020,
 37 |     author = {Trepte, Sabine},
 38 |     year = {2020},
 39 |     month = {05},
 40 |     pages = {},
 41 |     title = {The Social Media Privacy Model: Privacy and Communication in the Light of Social Media Affordances},
 42 |     volume = {31},
 43 |     journal = {Communication Theory},
 44 |     doi = {10.1093/ct/qtz035}
 45 | }
 46 | @software{rosado2022,
 47 |     author = {Rosado, Eidan J.},
 48 |     doi = {10.5281/zenodo.7212576},
 49 |     month = {12},
 50 |     title = {{pii-codex: a Python library for PII detection, categorization, and severity assessment}},
 51 |     version = {0.4.3},
 52 |     year = {2023}
 53 | }
 54 | @misc{microsoft_presidio,
 55 |     title={Microsoft/Presidio: Context Aware, pluggable and customizable data protection and Anonymization SDK for text and images},
 56 |     url={https://github.com/microsoft/presidio},
 57 |     publisher={Microsoft},
 58 |     author={Microsoft}
 59 | }
 60 | @misc{microsoft_presidio_entities,
 61 |     title={PII entities supported by Presidio},
 62 |     url={https://microsoft.github.io/presidio/supported_entities/},
 63 |     journal={Supported entities - Microsoft Presidio},
 64 |     publisher={Microsoft},
 65 |     author={Microsoft},
 66 |     year={2022}
 67 | }
 68 | @misc{azure_detection_cognitive_skill,
 69 |      title={PII detection cognitive skill - azure cognitive search},
 70 |      url={https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-pii-detection},
 71 |      journal={PII Detection cognitive skill - Azure Cognitive Search | Microsoft Learn},
 72 |      publisher={Microsoft Azure},
 73 |      author={Microsoft Azure},
 74 |      year={2022}
 75 | }
 76 | @misc{aws_comprehend,
 77 |      title={PII detection cognitive skill - azure cognitive search},
 78 |      url={https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html},
 79 |      journal={What is Amazon Comprehend | AWS Comprehend},
 80 |      publisher={Amazon Web Services},
 81 |      author={Amazon Web Services},
 82 |      year={2022}
 83 | }
 84 | @misc{gdpr_erasure_right,
 85 |     url={https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/individual-rights/right-to-erasure/#:~:text=Under%20Article%2017%20of%20the,be%20created%20in%20the%20future.},
 86 |     title={Right to erasure},
 87 |     publisher={Information Commissioners Office},
 88 |     year={2022}
 89 | }
 90 | @misc{GDPR_eu_2022,
 91 |     url={https://gdpr.eu/what-is-gdpr/},
 92 |     journal={GDPR.eu},
 93 |     year={2022},
 94 |     month={May}
 95 | }
 96 | @misc{hipaa,
 97 |     url={https://www.hhs.gov/hipaa/index.html},
 98 |     title={Health Information Privacy},
 99 |     publisher={U.S. Department of Health and Human Services},
100 |     year={2022}
101 | }
102 | @article{mccallister_grance_scarfone_2010,
103 |     title={Guide to protecting the confidentiality of personally identifiable information (PII)},
104 |     DOI={10.6028/nist.sp.800-122},
105 |     journal={S Department of Commerce: National Institute of Standards and Technology (NIST)},
106 |     author={McCallister, E and Grance, T and Scarfone, K A},
107 |     year={2010}
108 | }
109 | @misc{dhs_2012,
110 |    title={Handbook for Safeguarding Sensitive Personally Identifying Information},
111 |    publisher={Privacy Office, Department of Homeland Security},
112 |    year={2012}
113 | }
114 | @article{beigi_liu_2020,
115 |      title={A survey on privacy in Social Media}, volume={1},
116 |      DOI={10.1145/3343038},
117 |      number={1},
118 |      journal={ACM/IMS Transactions on Data Science},
119 |      author={Beigi, Ghazaleh and Liu, Huan},
120 |      year={2020},
121 |      pages={1–38}
122 | }
123 | @article{moura_serrão_2019,
124 |      title={Security and privacy issues of Big Data},
125 |      DOI={10.4018/978-1-5225-8897-9.ch019},
126 |      journal={Cyber Law, Privacy, and Security},
127 |      author={Moura, José and Serrão, Carlos},
128 |      year={2019},
129 |      pages={375–407}
130 | }
131 | 
132 | @unpublished{spacy2,
133 |     AUTHOR = {Honnibal, Matthew and Montani, Ines},
134 |     TITLE  = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing},
135 |     YEAR   = {2017},
136 |     Note   = {To appear}
137 | }
138 | 
139 | @article{belanger_crossler_2011,
140 |     ISSN = {02767783},
141 |     URL = {http://www.jstor.org/stable/41409971},
142 |     abstract = {Information privacy refers to the desire of individuals to control or have some influence over data about themselves. Advances in information technology have raised concerns about information privacy and its impacts, and have motivated Information Systems researchers to explore information privacy issues, including technical solutions to address these concerns. In this paper, we inform researchers about the current state of information privacy research in IS through a critical analysis of the IS literature that considers information privacy as a key construct. The review of the literature reveals that information privacy is a multilevel concept, but rarely studied as such. We also find that information privacy research has been heavily reliant on studentbased and USA-centric samples, which results in findings of limited generalizability. Information privacy research focuses on explaining and predicting theoretical contributions, with few studies in journal articles focusing on design and action contributions. We recommend that future research should consider different levels of analysis as well as multilevel effects of information privacy. We illustrate this with a multilevel framework for information privacy concerns. We call for research on information privacy to use a broader diversity of sampling populations, and for more design and action information privacy research to be published in journal articles that can result in IT artifacts for protection or control of information privacy.},
143 |     author = {France Bélanger and Robert E. Crossler},
144 |     journal = {MIS Quarterly},
145 |     number = {4},
146 |     pages = {1017--1041},
147 |     publisher = {Management Information Systems Research Center, University of Minnesota},
148 |     title = {Privacy in the Digital Age: A Review of Information Privacy Research in Information Systems},
149 |     urldate = {2023-04-26},
150 |     volume = {35},
151 |     year = {2011}
152 | }
153 | 
154 | @article{tene_polonetsky_2012,
155 |     author = {Tene, Omer and Polonetsky, Jules},
156 |     year = {2012},
157 |     month = {09},
158 |     pages = {},
159 |     title = {Big Data for All: Privacy and User Control in the Age of Analytics},
160 |     volume = {11},
161 |     journal = {Northwestern Journal of Technology and Intellectual Property}
162 | }


--------------------------------------------------------------------------------
/docs/dev/pii_codex/services/adapters/detection_adapters/index.html:
--------------------------------------------------------------------------------
 1 | <!doctype html>
 2 | <html lang="en">
 3 | <head>
 4 | <meta charset="utf-8">
 5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
 6 | <meta name="generator" content="pdoc 0.10.0" />
 7 | <title>pii_codex.services.adapters.detection_adapters API documentation</title>
 8 | <meta name="description" content="" />
 9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
17 | </head>
18 | <body>
19 | <main>
20 | <article id="content">
21 | <header>
22 | <h1 class="title">Module <code>pii_codex.services.adapters.detection_adapters</code></h1>
23 | </header>
24 | <section id="section-intro">
25 | </section>
26 | <section>
27 | <h2 class="section-title" id="header-submodules">Sub-modules</h2>
28 | <dl>
29 | <dt><code class="name"><a title="pii_codex.services.adapters.detection_adapters.aws_detection_adapter" href="aws_detection_adapter.html">pii_codex.services.adapters.detection_adapters.aws_detection_adapter</a></code></dt>
30 | <dd>
31 | <div class="desc"></div>
32 | </dd>
33 | <dt><code class="name"><a title="pii_codex.services.adapters.detection_adapters.azure_detection_adapter" href="azure_detection_adapter.html">pii_codex.services.adapters.detection_adapters.azure_detection_adapter</a></code></dt>
34 | <dd>
35 | <div class="desc"></div>
36 | </dd>
37 | <dt><code class="name"><a title="pii_codex.services.adapters.detection_adapters.detection_adapter_base" href="detection_adapter_base.html">pii_codex.services.adapters.detection_adapters.detection_adapter_base</a></code></dt>
38 | <dd>
39 | <div class="desc"></div>
40 | </dd>
41 | <dt><code class="name"><a title="pii_codex.services.adapters.detection_adapters.presidio_detection_adapter" href="presidio_detection_adapter.html">pii_codex.services.adapters.detection_adapters.presidio_detection_adapter</a></code></dt>
42 | <dd>
43 | <div class="desc"></div>
44 | </dd>
45 | </dl>
46 | </section>
47 | <section>
48 | </section>
49 | <section>
50 | </section>
51 | <section>
52 | </section>
53 | </article>
54 | <nav id="sidebar">
55 | <h1>Index</h1>
56 | <div class="toc">
57 | <ul></ul>
58 | </div>
59 | <ul id="index">
60 | <li><h3>Super-module</h3>
61 | <ul>
62 | <li><code><a title="pii_codex.services.adapters" href="../index.html">pii_codex.services.adapters</a></code></li>
63 | </ul>
64 | </li>
65 | <li><h3><a href="#header-submodules">Sub-modules</a></h3>
66 | <ul>
67 | <li><code><a title="pii_codex.services.adapters.detection_adapters.aws_detection_adapter" href="aws_detection_adapter.html">pii_codex.services.adapters.detection_adapters.aws_detection_adapter</a></code></li>
68 | <li><code><a title="pii_codex.services.adapters.detection_adapters.azure_detection_adapter" href="azure_detection_adapter.html">pii_codex.services.adapters.detection_adapters.azure_detection_adapter</a></code></li>
69 | <li><code><a title="pii_codex.services.adapters.detection_adapters.detection_adapter_base" href="detection_adapter_base.html">pii_codex.services.adapters.detection_adapters.detection_adapter_base</a></code></li>
70 | <li><code><a title="pii_codex.services.adapters.detection_adapters.presidio_detection_adapter" href="presidio_detection_adapter.html">pii_codex.services.adapters.detection_adapters.presidio_detection_adapter</a></code></li>
71 | </ul>
72 | </li>
73 | </ul>
74 | </nav>
75 | </main>
76 | <footer id="footer">
77 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
78 | </footer>
79 | </body>
80 | </html>


--------------------------------------------------------------------------------
/docs/dev/pii_codex/utils/logging.html:
--------------------------------------------------------------------------------
  1 | <!doctype html>
  2 | <html lang="en">
  3 | <head>
  4 | <meta charset="utf-8">
  5 | <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
  6 | <meta name="generator" content="pdoc 0.10.0" />
  7 | <title>pii_codex.utils.logging API documentation</title>
  8 | <meta name="description" content="" />
  9 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/sanitize.min.css" integrity="sha256-PK9q560IAAa6WVRRh76LtCaI8pjTJ2z11v0miyNNjrs=" crossorigin>
 10 | <link rel="preload stylesheet" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/11.0.1/typography.min.css" integrity="sha256-7l/o7C8jubJiy74VsKTidCy1yBkRtiUGbVkYBylBqUg=" crossorigin>
 11 | <link rel="stylesheet preload" as="style" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/styles/github.min.css" crossorigin>
 12 | <style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
 13 | <style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
 14 | <style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
 15 | <script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.1.1/highlight.min.js" integrity="sha256-Uv3H6lx7dJmRfRvH8TH6kJD1TSK1aFcwgx+mdg3epi8=" crossorigin></script>
 16 | <script>window.addEventListener('DOMContentLoaded', () => hljs.initHighlighting())</script>
 17 | </head>
 18 | <body>
 19 | <main>
 20 | <article id="content">
 21 | <header>
 22 | <h1 class="title">Module <code>pii_codex.utils.logging</code></h1>
 23 | </header>
 24 | <section id="section-intro">
 25 | <details class="source">
 26 | <summary>
 27 | <span>Expand source code</span>
 28 | </summary>
 29 | <pre><code class="python">from time import time
 30 | import logging
 31 | 
 32 | logger = logging.getLogger()
 33 | 
 34 | 
 35 | def timed_operation(func):
 36 |     &#34;&#34;&#34;
 37 |     Used to show execution time for function
 38 |     @param func:
 39 |     @return:
 40 |     &#34;&#34;&#34;
 41 | 
 42 |     def wrapper_function(*args, **kwargs):
 43 |         &#34;&#34;&#34;
 44 |         Logs the function execution time
 45 | 
 46 |         @param args:
 47 |         @param kwargs:
 48 |         @return:
 49 |         &#34;&#34;&#34;
 50 |         start_time = time()
 51 |         result = func(*args, **kwargs)
 52 |         end_time = time()
 53 |         logger.info(f&#34;{func.__name__!r} executed in {(end_time - start_time):.4f}s&#34;)
 54 |         return result
 55 | 
 56 |     return wrapper_function</code></pre>
 57 | </details>
 58 | </section>
 59 | <section>
 60 | </section>
 61 | <section>
 62 | </section>
 63 | <section>
 64 | <h2 class="section-title" id="header-functions">Functions</h2>
 65 | <dl>
 66 | <dt id="pii_codex.utils.logging.timed_operation"><code class="name flex">
 67 | <span>def <span class="ident">timed_operation</span></span>(<span>func)</span>
 68 | </code></dt>
 69 | <dd>
 70 | <div class="desc"><p>Used to show execution time for function
 71 | @param func:
 72 | @return:</p></div>
 73 | <details class="source">
 74 | <summary>
 75 | <span>Expand source code</span>
 76 | </summary>
 77 | <pre><code class="python">def timed_operation(func):
 78 |     &#34;&#34;&#34;
 79 |     Used to show execution time for function
 80 |     @param func:
 81 |     @return:
 82 |     &#34;&#34;&#34;
 83 | 
 84 |     def wrapper_function(*args, **kwargs):
 85 |         &#34;&#34;&#34;
 86 |         Logs the function execution time
 87 | 
 88 |         @param args:
 89 |         @param kwargs:
 90 |         @return:
 91 |         &#34;&#34;&#34;
 92 |         start_time = time()
 93 |         result = func(*args, **kwargs)
 94 |         end_time = time()
 95 |         logger.info(f&#34;{func.__name__!r} executed in {(end_time - start_time):.4f}s&#34;)
 96 |         return result
 97 | 
 98 |     return wrapper_function</code></pre>
 99 | </details>
100 | </dd>
101 | </dl>
102 | </section>
103 | <section>
104 | </section>
105 | </article>
106 | <nav id="sidebar">
107 | <h1>Index</h1>
108 | <div class="toc">
109 | <ul></ul>
110 | </div>
111 | <ul id="index">
112 | <li><h3>Super-module</h3>
113 | <ul>
114 | <li><code><a title="pii_codex.utils" href="index.html">pii_codex.utils</a></code></li>
115 | </ul>
116 | </li>
117 | <li><h3><a href="#header-functions">Functions</a></h3>
118 | <ul class="">
119 | <li><code><a title="pii_codex.utils.logging.timed_operation" href="#pii_codex.utils.logging.timed_operation">timed_operation</a></code></li>
120 | </ul>
121 | </li>
122 | </ul>
123 | </nav>
124 | </main>
125 | <footer id="footer">
126 | <p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.10.0</a>.</p>
127 | </footer>
128 | </body>
129 | </html>


--------------------------------------------------------------------------------
/pii_codex/services/analyzers/presidio_analysis.py:
--------------------------------------------------------------------------------
  1 | # pylint: disable=broad-except,unused-argument,import-outside-toplevel,unused-variable
  2 | from typing import List, Tuple
  3 | 
  4 | from presidio_anonymizer.entities.engine.recognizer_result import RecognizerResult
  5 | 
  6 | from ...config import PII_MAPPER, DEFAULT_LANG, DEFAULT_TOKEN_REPLACEMENT_VALUE
  7 | from ...models.analysis import DetectionResultItem, DetectionResult
  8 | from ...utils.package_installer_util import install_spacy_package
  9 | from ...utils.pii_mapping_util import PIIMapper
 10 | from ...utils.logging import logger
 11 | 
 12 | 
 13 | class PresidioPIIAnalyzer:
 14 |     """
 15 |     Presidio PII Analyzer - a wrapper for the Microsoft Presidio Analyzer and Anonymization functions
 16 |     """
 17 | 
 18 |     def __init__(
 19 |         self, pii_token_replacement_value: str = DEFAULT_TOKEN_REPLACEMENT_VALUE
 20 |     ):
 21 |         """
 22 |         Since installing Spacy, the en_core_web_lg model, and the MSFT Presidio package are optional installs
 23 |         the imports are wrapped to prevent any failures
 24 |         @param pii_token_replacement_value: str to replace detected pii token with (e.g. <REDACTED>)
 25 |         """
 26 | 
 27 |         try:
 28 |             import spacy
 29 |             from presidio_analyzer import AnalyzerEngine
 30 |             from presidio_anonymizer import AnonymizerEngine
 31 |             from presidio_anonymizer.entities import OperatorConfig
 32 | 
 33 |             if not spacy.util.is_package("en_core_web_lg"):
 34 |                 # Last resort. Will install the en_core_web_lg package if end-user hadn't already.
 35 |                 install_spacy_package("en_core_web_lg")
 36 | 
 37 |             self.analyzer = AnalyzerEngine()
 38 |             self.anonymizer = AnonymizerEngine()
 39 |             self.pii_mapper = PIIMapper()
 40 | 
 41 |             self.operators = {
 42 |                 "DEFAULT": OperatorConfig(
 43 |                     "replace", {"new_value": pii_token_replacement_value}
 44 |                 ),
 45 |                 "TITLE": OperatorConfig("redact", {}),
 46 |             }
 47 | 
 48 |         except ImportError:
 49 |             raise Exception(
 50 |                 'Missing dependencies from extras. Install the PII-Codex extras: "detections"'
 51 |             )
 52 | 
 53 |     def get_supported_entities(self, language_code=DEFAULT_LANG) -> List[str]:
 54 |         """
 55 |         Retrieves a list of supported entities, this will narrow down what is available for a given language
 56 | 
 57 |         @param language_code: str - defaults to "en"
 58 |         @return: List[str]
 59 |         """
 60 |         return self.analyzer.get_supported_entities(language=language_code)
 61 | 
 62 |     def get_loaded_recognizers(self, language_code: str = DEFAULT_LANG):
 63 |         """
 64 |         Retrieves a list of loaded recognizers, narrowing down the list of what is available for a given language
 65 |         @param language_code:
 66 |         @return:
 67 |         """
 68 |         return self.analyzer.get_recognizers(language=language_code)
 69 | 
 70 |     def analyze_item(
 71 |         self, text: str, language_code: str = DEFAULT_LANG, entities: List[str] = None
 72 |     ) -> Tuple[List[DetectionResultItem], str]:
 73 |         """
 74 |         Uses Microsoft Presidio (spaCy module) to analyze given a set of entities to analyze the provided text against.
 75 |         Will log an error if the identifier or entity recognizer is not added to Presidio's base recognizers or
 76 |         a custom recognizer created. Returns the list of detected items and the sanitized string
 77 | 
 78 |         @param language_code: str "en" is default
 79 |         @param entities: str - List[MSFTPresidioPIIType.name]
 80 |         @param text: str
 81 |         @return: Tuple[List[DetectionResultItem], str]
 82 |         """
 83 | 
 84 |         detections = []
 85 | 
 86 |         if not entities:
 87 |             entities = self.get_supported_entities(language_code)
 88 | 
 89 |         try:
 90 |             # Engine Setup - spaCy model setup and PII recognizers
 91 |             detections = self.analyzer.analyze(
 92 |                 text=text, entities=entities, language=language_code
 93 |             )
 94 | 
 95 |         except Exception as ex:
 96 |             logger.error(ex)
 97 | 
 98 |         # Return analyzer results in formatted Analysis Result List object
 99 |         detection_items = [
100 |             DetectionResultItem(
101 |                 entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type(
102 |                     result.entity_type
103 |                 ).name,
104 |                 score=result.score,
105 |                 start=result.start,
106 |                 end=result.end,
107 |             )
108 |             for result in detections
109 |         ]
110 |         return detection_items, self.sanitize_text(
111 |             text=text, analysis_items=detection_items
112 |         )
113 | 
114 |     def sanitize_text(
115 |         self, text: str, analysis_items: List[DetectionResultItem]
116 |     ) -> str:
117 |         """
118 |         Sanitizes the text analyzed with MSFT Presidio's Anonymizer
119 |         @param text:
120 |         @param analysis_items:
121 |         @return:
122 |         """
123 |         try:
124 |             # Convert DetectionResultItem back to RecognizerResult for Presidio anonymizer
125 | 
126 |             recognizer_results = [
127 |                 RecognizerResult(
128 |                     entity_type=item.entity_type,
129 |                     start=item.start,
130 |                     end=item.end,
131 |                     score=item.score,
132 |                 )
133 |                 for item in analysis_items
134 |             ]
135 | 
136 |             anonymization_result = self.anonymizer.anonymize(
137 |                 text=text, analyzer_results=recognizer_results, operators=self.operators
138 |             )
139 | 
140 |             return anonymization_result.text or ""
141 | 
142 |         except Exception as ex:
143 |             logger.error("An error occurred sanitizing the string")
144 |             return ""
145 | 
146 |     def analyze_collection(
147 |         self, texts: List[str], language_code: str = "en", entities: List[str] = None
148 |     ) -> List[DetectionResult]:
149 |         """
150 |         Uses Microsoft Presidio (spaCy module) to analyze given a set of entities to analyze the provided text against.
151 |         Will log an error if the identifier or entity recognizer is not added to Presidio's base recognizers or
152 |         a custom recognizer created.
153 | 
154 |         @param language_code: str "en" is default
155 |         @param entities: List[MSFTPresidioPIIType.name] defaults to all possible entities for selected language
156 |         @param texts: List[str]
157 |         @return: List[DetectionResult]
158 |         """
159 | 
160 |         detection_results = []
161 |         try:
162 |             if not entities:
163 |                 entities = self.get_supported_entities(language_code)
164 | 
165 |             # Engine Setup - spaCy model setup and PII recognizers
166 |             for i, text in enumerate(texts):
167 |                 text_analysis = self.analyzer.analyze(
168 |                     text=text, entities=entities, language=language_code
169 |                 )
170 | 
171 |                 # Every analysis by the analyzer will have a set of detections within
172 |                 detections = [
173 |                     DetectionResultItem(
174 |                         entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type(
175 |                             result.entity_type
176 |                         ).name,
177 |                         score=result.score,
178 |                         start=result.start,
179 |                         end=result.end,
180 |                     )
181 |                     for result in text_analysis
182 |                 ]
183 |                 detection_results.append(
184 |                     DetectionResult(index=i, detections=detections)
185 |                 )
186 | 
187 |             # Return analyzer results in formatted Analysis Result List object
188 | 
189 |         except Exception as ex:
190 |             logger.error(ex)
191 | 
192 |         return detection_results
193 | 
194 |     @classmethod
195 |     def convert_analyzed_item(cls, pii_detection) -> List[DetectionResultItem]:
196 |         """
197 |         Converts a single Presidio analysis attempt into a collection of DetectionResultItem objects. One string
198 |         analysis by Presidio returns an array of RecognizerResult objects.
199 | 
200 |         @param pii_detection: RecognizerResult from presidio analyzer
201 |         @return: List[DetectionResultItem]
202 |         """
203 | 
204 |         return [
205 |             DetectionResultItem(
206 |                 entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type(
207 |                     result.entity_type
208 |                 ).name,
209 |                 score=result.score,
210 |                 start=result.start,
211 |                 end=result.end,
212 |             )
213 |             for result in pii_detection
214 |         ]
215 | 
216 |     @classmethod
217 |     def convert_analyzed_collection(cls, pii_detections) -> List[DetectionResult]:
218 |         """
219 |         Converts a collection of Presidio analysis results to a collection of DetectionResult. A collection of Presidio
220 |         analysis results ends up being a 2D array.
221 | 
222 |         @param pii_detections: List[RecognizerResult] from Presidio analyzer
223 |         @return: List[DetectionResult]
224 |         """
225 | 
226 |         detection_results: List[DetectionResult] = []
227 |         for i, result in enumerate(pii_detections):
228 |             # Return results in formatted Analysis Result List object
229 |             detections = []
230 |             for entity in result:
231 |                 detections.append(
232 |                     DetectionResultItem(
233 |                         entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type(
234 |                             entity.entity_type
235 |                         ).name,
236 |                         score=entity.score,
237 |                         start=entity.start,
238 |                         end=entity.end,
239 |                     )
240 |                 )
241 | 
242 |             detection_results.append(DetectionResult(index=i, detections=detections))
243 | 
244 |         return detection_results
245 | 


--------------------------------------------------------------------------------
/tests/pii_codex/services/test_detection_service.py:
--------------------------------------------------------------------------------
  1 | from typing import List
  2 | 
  3 | import pytest
  4 | from assertpy import assert_that
  5 | from presidio_analyzer import RecognizerResult
  6 | from pii_codex.config import DEFAULT_LANG
  7 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult
  8 | from pii_codex.models.microsoft_presidio_pii import (
  9 |     MSFTPresidioPIIType,
 10 | )
 11 | from pii_codex.services.analyzers.presidio_analysis import (
 12 |     PresidioPIIAnalyzer,
 13 | )
 14 | 
 15 | 
 16 | class TestDetectionService:
 17 |     presidio_analyzer = PresidioPIIAnalyzer()
 18 | 
 19 |     @pytest.mark.parametrize(
 20 |         "test_input,pii_types,expected_result",
 21 |         [
 22 |             ("Not", [MSFTPresidioPIIType.PHONE_NUMBER.value], False),
 23 |             ("PII", [MSFTPresidioPIIType.EMAIL_ADDRESS.value], False),
 24 |             ("example@example.com", [MSFTPresidioPIIType.EMAIL_ADDRESS.value], True),
 25 |             (
 26 |                 "My email is example@example.eu.edu",
 27 |                 [MSFTPresidioPIIType.EMAIL_ADDRESS.value],
 28 |                 True,
 29 |             ),
 30 |             (
 31 |                 "My phone number is 191-212-456-7890",
 32 |                 [MSFTPresidioPIIType.PHONE_NUMBER.value],
 33 |                 False,
 34 |             ),  # International number not working
 35 |             (
 36 |                 "My phone number is 305-555-5555",
 37 |                 [MSFTPresidioPIIType.PHONE_NUMBER.value],
 38 |                 True,
 39 |             ),
 40 |             (
 41 |                 "My phone number is 305-555-5555 and email is example@example.com",
 42 |                 [
 43 |                     MSFTPresidioPIIType.PHONE_NUMBER.value,
 44 |                     MSFTPresidioPIIType.EMAIL_ADDRESS.value,
 45 |                 ],
 46 |                 True,
 47 |             ),
 48 |         ],
 49 |     )
 50 |     def test_msft_presidio_analysis_single_item(
 51 |         self, test_input, pii_types, expected_result
 52 |     ):
 53 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
 54 |             text=test_input,
 55 |             entities=pii_types,
 56 |         )
 57 | 
 58 |         if expected_result:
 59 |             assert_that(presidio_results).is_not_empty()
 60 |             assert_that(isinstance(presidio_results[0], DetectionResultItem)).is_true()
 61 |             assert_that(sanitized_text).is_not_empty()
 62 |         else:
 63 |             assert_that(presidio_results).is_empty()
 64 | 
 65 |     def test_msft_presidio_analysis_collection(self):
 66 |         presidio_results = self.presidio_analyzer.analyze_collection(
 67 |             texts=[
 68 |                 "My email is example@example.eu.edu",
 69 |                 "My phone number is 305-555-5555 and email is example@example.com",
 70 |             ],
 71 |             entities=self.presidio_analyzer.get_supported_entities(language_code="en"),
 72 |             language_code=DEFAULT_LANG,
 73 |         )
 74 | 
 75 |         assert_that(presidio_results).is_not_empty()
 76 |         assert_that(presidio_results[1].index).is_greater_than(
 77 |             presidio_results[0].index
 78 |         )
 79 |         assert_that(
 80 |             isinstance(presidio_results[0].detections[0], DetectionResultItem)
 81 |         ).is_true()
 82 | 
 83 |     def test_presidio_analysis_collection_conversion(self):
 84 |         conversion_results: List[
 85 |             DetectionResult
 86 |         ] = self.presidio_analyzer.convert_analyzed_collection(
 87 |             pii_detections=[
 88 |                 [
 89 |                     RecognizerResult(
 90 |                         entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value,
 91 |                         start=123,
 92 |                         end=456,
 93 |                         score=0.98,
 94 |                     ),
 95 |                     RecognizerResult(
 96 |                         entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value,
 97 |                         start=123,
 98 |                         end=456,
 99 |                         score=0.973,
100 |                     ),
101 |                 ],
102 |                 [
103 |                     RecognizerResult(
104 |                         entity_type=MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value,
105 |                         start=123,
106 |                         end=456,
107 |                         score=0.98,
108 |                     ),
109 |                     RecognizerResult(
110 |                         entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value,
111 |                         start=123,
112 |                         end=456,
113 |                         score=0.973,
114 |                     ),
115 |                 ],
116 |             ]
117 |         )
118 | 
119 |         assert_that(conversion_results).is_not_empty()
120 |         assert_that(conversion_results[1].index).is_greater_than(
121 |             conversion_results[0].index
122 |         )
123 |         assert_that(
124 |             isinstance(conversion_results[0].detections[0], DetectionResultItem)
125 |         ).is_true()
126 | 
127 |     @pytest.mark.parametrize(
128 |         "ssn_text,expected_detection",
129 |         [
130 |             ("My SSN is 489-36-8350", True),  # Robert Aragon from DLP test data
131 |             ("SSN: 514-14-8905", True),  # Ashley Borden from DLP test data
132 |             (
133 |                 "Social Security Number: 690-05-5315",
134 |                 True,
135 |             ),  # Thomas Conley from DLP test data
136 |             ("My number is 421-37-1396", True),  # Susan Davis from DLP test data
137 |             ("SSN 458-02-6124", True),  # Christopher Diaz from DLP test data
138 |             ("No SSN here", False),  # No SSN
139 |             ("Random text 123-45-6789", False),  # Generic SSN format without context
140 |         ],
141 |     )
142 |     def test_ssn_detection_with_dlp_data(self, ssn_text, expected_detection):
143 |         """Test SSN detection using DLP test data from dlptest.com"""
144 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
145 |             text=ssn_text,
146 |             entities=[MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value],
147 |         )
148 | 
149 |         if expected_detection:
150 |             assert_that(presidio_results).is_not_empty()
151 |             assert_that(presidio_results[0].entity_type).is_equal_to(
152 |                 "US_SOCIAL_SECURITY_NUMBER"
153 |             )
154 |             assert_that(sanitized_text).is_not_empty()
155 |         else:
156 |             assert_that(presidio_results).is_empty()
157 | 
158 |     def test_ssn_conversion_to_common_type(self):
159 |         """Test that SSN detection results are properly converted to common PII types"""
160 |         # Test with DLP test data SSN
161 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
162 |             text="SSN: 489-36-8350",  # Robert Aragon from DLP test data
163 |             entities=[MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value],
164 |         )
165 | 
166 |         assert_that(presidio_results).is_not_empty()
167 |         # The conversion should map US_SSN to US_SOCIAL_SECURITY_NUMBER
168 |         assert_that(presidio_results[0].entity_type).is_equal_to(
169 |             "US_SOCIAL_SECURITY_NUMBER"
170 |         )
171 |         assert_that(sanitized_text).is_not_empty()
172 |         assert_that(sanitized_text).does_not_contain("489-36-8350")
173 | 
174 |     def test_bank_number_detection_and_conversion(self):
175 |         """Test bank account number detection and conversion"""
176 |         # Test with a sample bank account number
177 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
178 |             text="Bank account: 1234567890",
179 |             entities=[MSFTPresidioPIIType.US_BANK_ACCOUNT_NUMBER.value],
180 |         )
181 | 
182 |         assert_that(presidio_results).is_not_empty()
183 |         # The conversion should map US_BANK_NUMBER to US_BANK_ACCOUNT_NUMBER
184 |         assert_that(presidio_results[0].entity_type).is_equal_to(
185 |             "US_BANK_ACCOUNT_NUMBER"
186 |         )
187 |         assert_that(sanitized_text).is_not_empty()
188 |         assert_that(sanitized_text).does_not_contain("1234567890")
189 | 
190 |     def test_au_medicare_detection_and_conversion(self):
191 |         """Test Australian Medicare number detection and conversion"""
192 |         # Test with a sample Australian Medicare number
193 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
194 |             text="Medicare: 1234567890",
195 |             entities=[MSFTPresidioPIIType.AU_MEDICAL_ACCOUNT_NUMBER.value],
196 |         )
197 | 
198 |         # Note: Presidio doesn't have a recognizer for AU_MEDICARE in English
199 |         # This test demonstrates the mapping conversion but won't detect anything
200 |         # The conversion should map AU_MEDICARE to AU_MEDICAL_ACCOUNT_NUMBER when it exists
201 |         if presidio_results:
202 |             assert_that(presidio_results[0].entity_type).is_equal_to(
203 |                 "AU_MEDICAL_ACCOUNT_NUMBER"
204 |             )
205 |             assert_that(sanitized_text).is_not_empty()
206 |             assert_that(sanitized_text).does_not_contain("1234567890")
207 |         else:
208 |             # If no recognizer is available, that's also acceptable
209 |             pass
210 | 
211 |     def test_multiple_pii_types_with_dlp_data(self):
212 |         """Test detection of multiple PII types using DLP test data"""
213 |         test_text = (
214 |             "Robert Aragon, SSN: 489-36-8350, DOB: 6/7/1981"  # Test entry from DLP data
215 |         )
216 | 
217 |         presidio_results, sanitized_text = self.presidio_analyzer.analyze_item(
218 |             text=test_text,
219 |             entities=[
220 |                 MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value,
221 |                 MSFTPresidioPIIType.DATE.value,
222 |                 MSFTPresidioPIIType.PERSON.value,
223 |             ],
224 |         )
225 | 
226 |         assert_that(presidio_results).is_not_empty()
227 |         # Should detect SSN, date, and person
228 |         detected_types = [result.entity_type for result in presidio_results]
229 |         assert_that(detected_types).contains("US_SOCIAL_SECURITY_NUMBER")
230 |         assert_that(detected_types).contains("PERSON")
231 |         assert_that(sanitized_text).is_not_empty()
232 |         assert_that(sanitized_text).does_not_contain("489-36-8350")
233 |         assert_that(sanitized_text).does_not_contain("6/7/1981")
234 |         assert_that(sanitized_text).does_not_contain("Robert Aragon")
235 | 


--------------------------------------------------------------------------------
/tests/pii_codex/utils/test_pii_mapping_util.py:
--------------------------------------------------------------------------------
  1 | # pylint: disable=broad-except, line-too-long
  2 | from assertpy import assert_that
  3 | import pytest
  4 | 
  5 | from pii_codex.config import PII_MAPPER
  6 | from pii_codex.models.aws_pii import AWSComprehendPIIType
  7 | from pii_codex.models.azure_pii import AzureDetectionType
  8 | from pii_codex.models.common import (
  9 |     PIIType,
 10 |     ClusterMembershipType,
 11 |     DHSCategory,
 12 |     NISTCategory,
 13 |     HIPAACategory,
 14 |     RiskLevel,
 15 |     RiskLevelDefinition,
 16 | )
 17 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType
 18 | from pii_codex.services.pii_type_mappings import PII_TYPE_MAPPINGS
 19 | import pii_codex.utils.pii_mapping_util as util_module
 20 | 
 21 | 
 22 | class TestPIIMappingUtil:
 23 |     # region PII MAPPING AND CONVERSION FUNCTIONS
 24 |     @pytest.mark.parametrize(
 25 |         "pii_type",
 26 |         [pii_type.name for pii_type in PIIType],
 27 |     )
 28 |     def test_map_pii_type(self, pii_type):
 29 |         """
 30 |         Requires the type mapping to be in the associated version file in pii_codex/data/
 31 |         """
 32 |         if pii_type is not PIIType.DOCUMENTS.name:
 33 |             mapped_pii = PII_MAPPER.map_pii_type(pii_type)
 34 |             assert_that(mapped_pii.risk_level).is_greater_than(1)
 35 |             assert_that(
 36 |                 isinstance(
 37 |                     ClusterMembershipType(mapped_pii.cluster_membership_type),
 38 |                     ClusterMembershipType,
 39 |                 )
 40 |             ).is_true()
 41 |             assert_that(
 42 |                 isinstance(DHSCategory(mapped_pii.dhs_category), DHSCategory)
 43 |             ).is_true()
 44 |             assert_that(
 45 |                 isinstance(NISTCategory(mapped_pii.nist_category), NISTCategory)
 46 |             ).is_true()
 47 |             assert_that(
 48 |                 isinstance(HIPAACategory(mapped_pii.hipaa_category), HIPAACategory)
 49 |             ).is_true()
 50 | 
 51 |     @pytest.mark.parametrize(
 52 |         "pii_type",
 53 |         PIIType,
 54 |     )
 55 |     def test_convert_common_pii_to_msft_presidio_type(self, pii_type):
 56 |         try:
 57 |             converted_pii = PII_MAPPER.convert_common_pii_to_msft_presidio_type(
 58 |                 pii_type
 59 |             )
 60 |             assert_that(isinstance(converted_pii, MSFTPresidioPIIType)).is_true()
 61 |         except Exception as ex:
 62 |             assert_that(ex.args[0]).contains(
 63 |                 "The current version does not support this PII Type conversion."
 64 |             )
 65 | 
 66 |     @pytest.mark.parametrize(
 67 |         "pii_type",
 68 |         PIIType,
 69 |     )
 70 |     def test_convert_common_pii_to_azure_pii_type(self, pii_type):
 71 |         try:
 72 |             converted_pii = PII_MAPPER.convert_common_pii_to_azure_pii_type(pii_type)
 73 |             assert_that(isinstance(converted_pii, AzureDetectionType)).is_true()
 74 |         except Exception as ex:
 75 |             assert_that(ex.args[0]).contains(
 76 |                 "The current version does not support this PII Type conversion."
 77 |             )
 78 | 
 79 |     @pytest.mark.parametrize(
 80 |         "pii_type",
 81 |         AzureDetectionType,
 82 |     )
 83 |     def test_convert_azure_pii_to_common_pii_type(self, pii_type):
 84 |         try:
 85 |             converted_pii = PII_MAPPER.convert_azure_pii_to_common_pii_type(pii_type)
 86 |             assert_that(isinstance(converted_pii, PIIType)).is_true()
 87 |         except Exception as ex:
 88 |             assert_that(ex.args[0]).contains(
 89 |                 "The current version does not support this PII Type conversion."
 90 |             )
 91 | 
 92 |     @pytest.mark.parametrize(
 93 |         "pii_type",
 94 |         AWSComprehendPIIType,
 95 |     )
 96 |     def test_convert_aws_pii_to_common_pii_type(self, pii_type):
 97 |         try:
 98 |             converted_pii = PII_MAPPER.convert_aws_comprehend_pii_to_common_pii_type(
 99 |                 pii_type
100 |             )
101 |             assert_that(isinstance(converted_pii, PIIType)).is_true()
102 |         except Exception as ex:
103 |             assert_that(ex.args[0]).contains(
104 |                 "The current version does not support this PII Type conversion."
105 |             )
106 | 
107 |     @pytest.mark.parametrize(
108 |         "pii_type",
109 |         PIIType,
110 |     )
111 |     def test_convert_common_pii_to_aws_comprehend_type(self, pii_type):
112 |         try:
113 |             converted_pii = PII_MAPPER.convert_common_pii_to_aws_comprehend_type(
114 |                 pii_type
115 |             )
116 |             assert_that(isinstance(converted_pii, AWSComprehendPIIType)).is_true()
117 |         except Exception as ex:
118 |             assert_that(ex.args[0]).contains(
119 |                 "The current version does not support this PII Type conversion."
120 |             )
121 | 
122 |     def test_convert_metadata_to_pii_failure(self):
123 |         with pytest.raises(Exception) as execinfo:
124 |             PII_MAPPER.convert_metadata_type_to_common_pii_type("other_type")
125 | 
126 |         assert_that(str(execinfo.value)).contains(
127 |             "The current version does not support this Metadata to PII Type conversion."
128 |         )
129 | 
130 |     def test_convert_msft_presidio_pii_to_common_pii_type_failure(self):
131 |         with pytest.raises(Exception) as execinfo:
132 |             PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type("other_type")
133 | 
134 |         assert_that(str(execinfo.value)).contains(
135 |             "The current version does not support this PII Type conversion: other_type. Error: 'other_type' is not a valid MSFTPresidioPIIType"
136 |         )
137 | 
138 |     def test_pii_mapping_enum_consistency(self):
139 |         """Test that all PII mappings have consistent enum references"""
140 |         for mapping in PII_TYPE_MAPPINGS.values():
141 |             # Test that risk level mapping works correctly (it's an enum, not int)
142 |             assert_that(mapping.risk_level).is_instance_of(RiskLevel)
143 |             assert_that(mapping.risk_level.value).is_between(1, 3)
144 | 
145 |             # Test that cluster membership type is valid
146 |             assert_that(mapping.cluster_membership_type).is_instance_of(
147 |                 ClusterMembershipType
148 |             )
149 | 
150 |             # Test that categories are valid enum instances
151 |             assert_that(mapping.nist_category).is_instance_of(NISTCategory)
152 |             assert_that(mapping.dhs_category).is_instance_of(DHSCategory)
153 |             assert_that(mapping.hipaa_category).is_instance_of(HIPAACategory)
154 | 
155 |             # Test that provider enums are either valid enum instances or None
156 |             if mapping.presidio_enum is not None:
157 |                 assert_that(mapping.presidio_enum).is_instance_of(MSFTPresidioPIIType)
158 |             if mapping.azure_enum is not None:
159 |                 assert_that(mapping.azure_enum).is_instance_of(AzureDetectionType)
160 |             if mapping.aws_enum is not None:
161 |                 assert_that(mapping.aws_enum).is_instance_of(AWSComprehendPIIType)
162 | 
163 |     def test_risk_level_definition_mapping(self):
164 |         """Test that risk level to definition mapping works correctly"""
165 |         for pii_type in PII_TYPE_MAPPINGS:
166 |             # Test that we can create a RiskAssessment without errors
167 |             risk_assessment = PII_MAPPER.map_pii_type(pii_type)
168 | 
169 |             # Test that risk level definition is a valid string
170 |             assert_that(risk_assessment.risk_level_definition).is_instance_of(str)
171 | 
172 |             # Test that risk level definition is one of the valid values
173 |             valid_definitions = [level.value for level in RiskLevelDefinition]
174 |             assert_that(
175 |                 risk_assessment.risk_level_definition in valid_definitions
176 |             ).is_true()
177 | 
178 |             # Test that risk level is an integer
179 |             assert_that(risk_assessment.risk_level).is_instance_of(int)
180 |             assert_that(risk_assessment.risk_level).is_between(1, 3)
181 | 
182 |     def test_provider_enum_consistency(self):
183 |         """Test that provider-specific enum mappings are consistent"""
184 |         # Test some key mappings that should have all three providers
185 |         key_mappings = [
186 |             "EMAIL_ADDRESS",
187 |             "PHONE_NUMBER",
188 |             "PERSON",
189 |             "US_SOCIAL_SECURITY_NUMBER",
190 |             "CREDIT_CARD_NUMBER",
191 |         ]
192 | 
193 |         for pii_type in key_mappings:
194 |             if pii_type in PII_TYPE_MAPPINGS:
195 |                 mapping = PII_TYPE_MAPPINGS[pii_type]
196 | 
197 |                 # These should have all three provider enums
198 |                 assert_that(mapping.presidio_enum).is_not_none()
199 |                 assert_that(mapping.azure_enum).is_not_none()
200 |                 assert_that(mapping.aws_enum).is_not_none()
201 | 
202 |                 # Test that the enum names are consistent
203 |                 if mapping.presidio_enum:
204 |                     assert_that(mapping.presidio_enum.name).is_equal_to(pii_type)
205 |                 if mapping.azure_enum:
206 |                     assert_that(mapping.azure_enum.name).is_equal_to(pii_type)
207 |                 if mapping.aws_enum:
208 |                     # AWS might have different naming (e.g., CREDIT_DEBIT_NUMBER vs CREDIT_CARD_NUMBER)
209 |                     assert_that(mapping.aws_enum.name).is_not_empty()
210 | 
211 |     def test_azure_usuk_passport_mapping(self):
212 |         """Test the specific fix for US_PASSPORT_NUMBER -> USUK_PASSPORT_NUMBER"""
213 |         us_passport_mapping = PII_TYPE_MAPPINGS.get("US_PASSPORT_NUMBER")
214 |         assert_that(us_passport_mapping).is_not_none()
215 | 
216 |         if us_passport_mapping and us_passport_mapping.azure_enum:
217 |             assert_that(us_passport_mapping.azure_enum.name).is_equal_to(
218 |                 "USUK_PASSPORT_NUMBER"
219 |             )
220 | 
221 |     def test_aws_australian_types_none(self):
222 |         """Test that AWS doesn't have Australian business/company types (set to None)"""
223 |         au_business_mapping = PII_TYPE_MAPPINGS.get("AU_BUSINESS_NUMBER")
224 |         au_company_mapping = PII_TYPE_MAPPINGS.get("AU_COMPANY_NUMBER")
225 | 
226 |         if au_business_mapping:
227 |             assert_that(au_business_mapping.aws_enum).is_none()
228 |         if au_company_mapping:
229 |             assert_that(au_company_mapping.aws_enum).is_none()
230 | 
231 |     def test_no_csv_dependencies(self):
232 |         """Test that we no longer have CSV file dependencies"""
233 |         # Check that the module doesn't have pandas-related attributes
234 |         assert_that(hasattr(util_module, "_pii_mapping_data_frame")).is_false()
235 | 
236 |         # Check that we're using the new mapping system (PII_MAPPER is imported from config)
237 |         assert_that(PII_MAPPER).is_not_none()
238 | 
239 |         # Test that map_pii_type works without CSV
240 |         result = PII_MAPPER.map_pii_type("EMAIL_ADDRESS")
241 |         assert_that(result).is_not_none()
242 |         assert_that(result.pii_type_detected).is_equal_to("EMAIL_ADDRESS")
243 | 
244 |     # endregion
245 | 
246 |     # CSV-related tests removed - no longer using CSV files
247 | 


--------------------------------------------------------------------------------
/docs/DETECTION_AND_ANALYSIS.md:
--------------------------------------------------------------------------------
  1 | # Detection and Analysis with PII-Codex
  2 | In the case that you are not bringing your own detection service, Microsoft Presidio is integrated into PII Codex that provides flexibility in analyzer type. At the time of this repo's creation, only a select number of evaluators exist. You may create your own evaluators and swap out the version of presidio that pii-codex uses. Note that with this change, you will also need to update the mappings where applicable.
  3 | 
  4 | The following are not integrated into the service, but have PII type mapping and detection object conversion support:
  5 | 
  6 | <ol>
  7 |     <li>AWS Comprehend (Requires AWS Account) [<a href="https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html">docs</a>]</li>
  8 |     <li>Azure PII Detection Cognitive Skill (Requires Azure Account) [<a href="https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-pii-detection">docs</a>]</li>
  9 | </ol>
 10 | 
 11 | For those using pre-detected results, adapters are provided to convert types and results to the expected DetectionResult/DetectionResultItem format (see diagram below):
 12 | 
 13 | ![Converting And Analyzing Existing Detections](UC1_Converting_Existing_Detections.png)
 14 | 
 15 | To supply the analyzer module with a collection of pre-detected results from your own Microsoft Presidio, Azure, or AWS Comprehend analysis process, you will need to first convert the detection to a set of DetectionResult objects to feed into the analyzer as follows:
 16 | 
 17 | ```python
 18 | from typing import List
 19 | from pii_codex.models.common import (
 20 |     AnalysisProviderType,
 21 | )
 22 | from presidio_analyzer import RecognizerResult
 23 | from pii_codex.services.analysis_service import PIIAnalysisService
 24 | from pii_codex.services.adapters.detection_adapters.presidio_detection_adapter import PresidioPIIDetectionAdapter
 25 | from pii_codex.models.analysis import DetectionResult
 26 | 
 27 | presidio_detection_service = PresidioPIIDetectionAdapter()
 28 | 
 29 | list_of_detections: List[RecognizerResult] = [] # your list of detections 
 30 | converted_detections: List[DetectionResult] = presidio_detection_service.convert_analyzed_collection(
 31 |                                                     pii_detections=list_of_detections
 32 |                                               )
 33 | pii_analysis_service = PIIAnalysisService(
 34 |     analysis_provider=AnalysisProviderType.PRESIDIO.name
 35 | )  # If you don't intend to use presidio, override the analysis_provider value
 36 | 
 37 | results = pii_analysis_service.analyze_detection_collection(
 38 |     detection_collection=converted_detections,
 39 |     collection_name="Data Set Label",  # this is more for those that intend to find a way to label collections
 40 |     collection_type="SAMPLE"  # defaults to POPULATION, input used for standard deviation and variance calculations
 41 | )
 42 | ```
 43 | 
 44 | The other two detection adapters available are AWSComprehendPIIDetectionAdapter and AzurePIIDetectionAdapter. 
 45 | 
 46 | <hr>
 47 | 
 48 | In the case you require the built-in Presidio functionality, you can call the analysis service as follows:
 49 | 
 50 | ```python
 51 | from pii_codex.services.analysis_service import PIIAnalysisService
 52 | 
 53 | pii_analysis_service = PIIAnalysisService()
 54 | 
 55 | strings_to_analyze = ["string to analyze", "string to analyze"] # strings to analyze
 56 | results = pii_analysis_service.analyze_collection(
 57 |     texts=strings_to_analyze,
 58 |     language_code="en",
 59 |     collection_name="Data Set Label", # this is optional, geared for those that require labeling of collections
 60 |     collection_type="SAMPLE" # defaults to POPULATION, input used for standard deviation and variance calculations
 61 | )
 62 | ```
 63 | 
 64 | This functionality can easily take a singular text item or a collection of them and runs through the presidio analysis and assessment service files as presented in the diagram below.
 65 | 
 66 | ![Converting And Analyzing Text with Presidio Builtin](UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png)
 67 | 
 68 | 
 69 | For those analyzing social media posts, you can also supply metadata per text sample to be analyzed in a dataframe.
 70 | 
 71 | ```python
 72 | import pandas as pd
 73 | from pii_codex.services.analysis_service import PIIAnalysisService
 74 | 
 75 | pii_analysis_service = PIIAnalysisService()
 76 | 
 77 | results = pii_analysis_service.analyze_collection(
 78 |     data=pd.DataFrame({
 79 |         "text": [
 80 |             "I attend the University of Central Florida, how about you?",
 81 |             "If anyone needs trig help, my phone number 555-555-5555 and my email is example123@email.com",
 82 |             "Oh I do! My number is 777-777-7777. Where is the residence hall?",
 83 |             "The dorm is over at 123 Dark Data Lane, OH, 11111",
 84 |             "Cool, I'll be there!"
 85 |         ],
 86 |         "metadata": [
 87 |             {"location": True, "url": False, "screen_name": True},
 88 |             {"location": True, "url": False, "screen_name": True},
 89 |             {"location": False, "url": False, "screen_name": True}, # Not all social media posts will have location metadata
 90 |             {"location": False, "url": False, "screen_name": True},
 91 |             {"location": True, "url": False, "screen_name": True},
 92 |         ]
 93 |     }),
 94 |     language_code="en",
 95 |     collection_name="Data Set Label", # this is optional, geared for those that require labeling of collections
 96 |     collection_type="SAMPLE" # defaults to POPULATION, input used for standard deviation and variance calculations
 97 | )
 98 | ```
 99 | 
100 | Sample output:
101 | 
102 | ```
103 | {
104 |     "collection_name": "PII Collection 1",
105 |     "collection_type": "POPULATION",
106 |     "analyses": [
107 |         {
108 |             "analysis": [
109 |                 {
110 |                     "pii_type_detected": "PERSON",
111 |                     "risk_level": 3,
112 |                     "risk_level_definition": "Identifiable",
113 |                     "cluster_membership_type": "Financial Information",
114 |                     "hipaa_category": "Protected Health Information",
115 |                     "dhs_category": "Linkable",
116 |                     "nist_category": "Directly PII",
117 |                     "entity_type": "PERSON",
118 |                     "score": 0.85,
119 |                     "start": 21,
120 |                     "end": 24,
121 |                 }
122 |             ],
123 |             "index": 0,
124 |             "risk_score_mean": 3,
125 |             "sanitized_text": "Hi! My name is <REDACTED>",
126 |         },
127 |         {
128 |             "analysis": [
129 |                 {
130 |                     "pii_type_detected": "EMAIL_ADDRESS",
131 |                     "risk_level": 3,
132 |                     "risk_level_definition": "Identifiable",
133 |                     "cluster_membership_type": "Personal Preferences",
134 |                     "hipaa_category": "Protected Health Information",
135 |                     "dhs_category": "Stand Alone PII",
136 |                     "nist_category": "Directly PII",
137 |                     "entity_type": "EMAIL_ADDRESS",
138 |                     "score": 1.0,
139 |                     "start": 74,
140 |                     "end": 94,
141 |                 },
142 |                 {
143 |                     "pii_type_detected": "PHONE_NUMBER",
144 |                     "risk_level": 3,
145 |                     "risk_level_definition": "Identifiable",
146 |                     "cluster_membership_type": "Contact Information",
147 |                     "hipaa_category": "Protected Health Information",
148 |                     "dhs_category": "Stand Alone PII",
149 |                     "nist_category": "Directly PII",
150 |                     "entity_type": "PHONE_NUMBER",
151 |                     "score": 0.75,
152 |                     "start": 45,
153 |                     "end": 57,
154 |                 },
155 |                 {
156 |                     "pii_type_detected": "URL",
157 |                     "risk_level": 2,
158 |                     "risk_level_definition": "Semi-Identifiable",
159 |                     "cluster_membership_type": "Community Interaction",
160 |                     "hipaa_category": "Not Protected Health Information",
161 |                     "dhs_category": "Linkable",
162 |                     "nist_category": "Linkable",
163 |                     "entity_type": "URL",
164 |                     "score": 0.5,
165 |                     "start": 85,
166 |                     "end": 94,
167 |                 },
168 |             ],
169 |             "index": 1,
170 |             "risk_score_mean": 2.6666666666666665,
171 |             "sanitized_text": "Hi! My phone number is <REDACTED>. You can also reach me by email at <REDACTED>",
172 |         },
173 |         {
174 |             "analysis": [
175 |                 {
176 |                     "pii_type_detected": None,
177 |                     "risk_level": 1,
178 |                     "risk_level_definition": "Non-Identifiable",
179 |                     "cluster_membership_type": None,
180 |                     "hipaa_category": None,
181 |                     "dhs_category": None,
182 |                     "nist_category": None,
183 |                 }
184 |             ],
185 |             "index": 2,
186 |             "risk_score_mean": 1,
187 |             "sanitized_text": "Hi! What is the title of this book?",
188 |         },
189 |         {
190 |             "analysis": [
191 |                 {
192 |                     "pii_type_detected": "LOCATION",
193 |                     "risk_level": 2,
194 |                     "risk_level_definition": "Semi-Identifiable",
195 |                     "cluster_membership_type": "Secure Identifiers",
196 |                     "hipaa_category": "Protected Health Information",
197 |                     "dhs_category": "Not Mentioned",
198 |                     "nist_category": "Linkable",
199 |                     "entity_type": "LOCATION",
200 |                     "score": 0.85,
201 |                     "start": 42,
202 |                     "end": 44,
203 |                 }
204 |             ],
205 |             "index": 3,
206 |             "risk_score_mean": 2,
207 |             "sanitized_text": "<REDACTED>",
208 |         },
209 |         {
210 |             "analysis": [
211 |                 {
212 |                     "pii_type_detected": None,
213 |                     "risk_level": 1,
214 |                     "risk_level_definition": "Non-Identifiable",
215 |                     "cluster_membership_type": None,
216 |                     "hipaa_category": None,
217 |                     "dhs_category": None,
218 |                     "nist_category": None,
219 |                 }
220 |             ],
221 |             "index": 4,
222 |             "risk_score_mean": 1,
223 |             "sanitized_text": "Hi! I have a cat too.",
224 |         },
225 |     ],
226 |     "detection_count": 5,
227 |     "risk_scores": [3, 2.6666666666666665, 1, 2, 1],
228 |     "risk_score_mean": 1.9333333333333333,
229 |     "risk_score_mode": 1,
230 |     "risk_score_median": 2,
231 |     "risk_score_standard_deviation": 0.8273115763993905,
232 |     "risk_score_variance": 0.6844444444444444,
233 |     "detected_pii_types": {
234 |         "LOCATION",
235 |         "EMAIL_ADDRESS",
236 |         "URL",
237 |         "PHONE_NUMBER",
238 |         "PERSON",
239 |     },
240 |     "detected_pii_type_frequencies": {
241 |         "PERSON": 1,
242 |         "EMAIL_ADDRESS": 1,
243 |         "PHONE_NUMBER": 1,
244 |         "URL": 1,
245 |         "LOCATION": 1,
246 |     },
247 | }
248 | 
249 | ```
250 | 
251 | Check out full analysis example in the notebook: notebooks/pii-analysis-ms-presidio.


--------------------------------------------------------------------------------
/joss/paper.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'PII-Codex: a Python library for PII detection, categorization, and severity assessment'
  3 | 
  4 | tags:
  5 |   - Python
  6 |   - PII
  7 |   - PII topology
  8 |   - risk categories
  9 |   - personal identifiable information
 10 | 
 11 | authors:
 12 |   - name: Eidan J. Rosado
 13 |     orcid: 0000-0003-0665-098X
 14 |     affiliation: 1
 15 | affiliations:
 16 |   - name:
 17 |       College of Computing and Engineering, Nova Southeastern University, Fort Lauderdale, FL 33314,
 18 |       USA
 19 |     index: 1
 20 |   
 21 | date: 30 Dec 2022
 22 | 
 23 | bibliography: paper.bib
 24 | 
 25 | ---
 26 | 
 27 | # Summary
 28 | There have been a number of advancements in the detection of personal identifiable information (PII) and scrubbing libraries to aid developers and researchers in their detection and anonymization efforts. With the recent shift in data handling procedures and global policy implementations regarding identifying information, it is becoming more important for data consumers to be aware of what data needs to be scrubbed, why it's being scrubbed, and to have the means to perform said scrubbing. 
 29 | 
 30 | PII-Codex is a collection of extended theoretical, conceptual, and policy works in PII categorization and severity assessment [@schwartz_solove_2011; @milne_pettinico_hajjat_markos_2016], and the integration thereof with PII detection software and API client adapters. It allows researchers to analyze a body of text or a collection thereof and determine whether the PII detected within these texts, if any, are considered identifiable. Furthermore, it allows end-users to determine the severity and associated categorizations of detected PII tokens.
 31 | 
 32 | # Challenges
 33 | 
 34 | While a number of open-source PII detection libraries have been created and PII detection APIs are provided by cloud service providers [@azure_detection_cognitive_skill; @aws_comprehend], the detection results are often provided with the type of PII detected, an index reference of where the detection is within the text, and a confidence score associated with the detection. Those receiving these results aren’t provided with a means of understanding why the text token is classified as PII, what framework, policy, or convention labels it as such, and just how severe its exposure is.
 35 | 
 36 | # Statement of Need
 37 | 
 38 | The general knowledge base of identifiable data, the usage restrictions of this data, and the associated policies surrounding it have shifted drastically over the years. Between the mid-1990s and 2000s, or the dotcom bubble, the industry saw a rise in data capitalism by way of making information freely accessible, fostering a way to make the web personal, and finally, placing value on data and the potential it had to impact consumerism [@West_2017]. Alongside the rise in data capitalism came early data policy initiatives. In 1995, the EU Data Protection Directive was created to establish some minimum data privacy and security standards [@GDPR_eu_2022] and the US Health Insurance Portability and Accountability Act (HIPAA) was enacted in 1996 with the final regulation being published in 2000 [@OCR_2022] to help battle healthcare fraud and to provide regulations governing the privacy and security of an individual's patient details. Both of these policies have evolved over the years to include protected entities and have paved the way to the policies and protective technologies the world sees today aimed at protecting PII.
 39 | 
 40 | The tech industry specifically has had to adjust to these policy changes regarding the tracking of individuals, the usage of data from online profiles and platforms, and the right to be forgotten entirely from a service or platform [@gdpr_erasure_right]. While the shift has provided data protections around the globe, the majority of technology users continue to have little to no control over their personal information with third-party data consumers [@tene_polonetsky_2012; @trepte2020]. From an individual researcher's perspective, understanding if identifiable data types exist in a data set can prevent accidental sharing of such data by allowing its detection in the first place and, in the case of this software package, permit for the results to be publishable by sanitizing the text tokens and provide transparency on the reasons why the token was considered to be PII. From a platform user's perspective, detecting PII ahead of publication and understanding why it is considered PII can prevent an accidental disclosure that can later be used by adversaries. This need is what drives the development of PII-Codex.
 41 | 
 42 | # The PII-Codex Package
 43 | 
 44 | PII-Codex is a Python package built to combine the Information Sensitivity Typology works of Milne et al. [@milne_pettinico_hajjat_markos_2016], categorizations and guidelines from the National Institute of Standards and Technology (NIST) [@mccallister_grance_scarfone_2010], Department of Homeland Security (DHS) [@dhs_2012], and the Health Insurance Portability and Accountability Act (HIPAA) [@hipaa]. It combines these categories to rate the detection on a scale of 1 to 3, labeling it as <em>Non-Identifiable</em>, <em>Semi-Identifiable</em>, or <em>Identifiable</em> as presented by the risk continuum by Schwartz and Solove [@schwartz_solove_2011].  The package provides a subset of Milne et al.'s Information Sensitivity Typology as some technologies group entries into a singular category or the detection of the entry may not yet be available.
 45 | 
 46 | Built into the package is an analyzer service that leverages Microsoft’s Presidio library for PII detection and anonymization [@microsoft_presidio] as well as the option to use the built-in detection adapters for Microsoft Presidio, Azure Detection Cognitive Skill [@azure_detection_cognitive_skill], and AWS Comprehend [@aws_comprehend] for pre-existing detections. The output of the adapters and the analysis service are analysis objects with a listing of detections, detection frequencies, severities, mean risk scores for each string processed, and summary statistics on the analysis made. 
 47 | 
 48 | The final outputs do not contain the original texts but provide the sanitized or anonymized texts and where to find the detections, should the end-user require this information. In providing this capability, one can prevent the accidental dissemination of private information in downstream research efforts, an issue commonly discussed in cybersecurity research [@belanger_crossler_2011; @moura_serrão_2019; @beigi_liu_2020].
 49 | 
 50 | ## Design
 51 | 
 52 | PII-Codex is broken down into a series of services, utilities, and adapters. For a majority of cases, end-users may already have used Microsoft Presidio, Azure, AWS Comprehend or some other solution to detect PII in text. To account for these cases, adapters were provided to convert the varying detection results into a common form, <em>DetectionResultItem</em> and <em>DetectionResult</em> objects, which are later used by the Analysis Service and Assessment Service. This usage flow is presented in Figure 1.
 53 | 
 54 | ![Converting And Analyzing Existing Detections\label{fig:converting-and-analyzing-existing-detections}](../docs/UC1_Converting_Existing_Detections_With_Adapters.png)
 55 | 
 56 | As shown in Figure 2, for end-users that still require detections to be carried out, Microsoft Presidio was integrated as the primary analysis provider within the Analysis Service.
 57 | 
 58 | ![Using Presidio-Enabled Builtin Service for Detection and Analysis\label{fig:using-presidio-builtins-for-analyses}](../docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png)
 59 | 
 60 | The Analysis and Assessment services expose functions for those defining their own detectors and enable the conversion to a common detection type so that the full Analysis Result set can be built.
 61 | 
 62 | ## Example Usage
 63 | 
 64 | The collection analysis permits a list of strings under 
 65 | <em>texts</em> parameter or a DataFrame with a text column under the <em>data</em> parameter. The collection will be analyzed and a summary provided in an AnalysisResultSet object. The AnalysisResultSet object will show individual detections and their risk assessments which includes risk score assessment and associated PII categories. Each analysis is provided with the sanitized input text when using the default analysis service. Unless supplied with another replacement token, the sanitized input text will contain <em><REDACTED></em> in place of detected PII tokens:
 66 | 
 67 | ```
 68 | Hi! My phone number is <REDACTED>."
 69 | ```
 70 | 
 71 | Email detections, for example, are presented as <em>Identifiable</em>, which automatically places it at a risk level of <em>3</em>, the highest a token is assigned. Something like a URL is considered <em>Semi-Identifiable</em> and therefore is assigned a risk level of <em>2</em>. Other texts will fall under <em>Non-Identifiable</em> and will be assigned a risk level of <em>1</em>.
 72 | 
 73 | Using the `texts` parameter:
 74 | 
 75 | ```python
 76 | from pii_codex.services.analysis_service import PIIAnalysisService
 77 | 
 78 | results = PIIAnalysisService().analyze_collection(
 79 |     texts=[
 80 |         "email@example.com is the email I can be reached at.", 
 81 |         "Their number is 555-555-5555"
 82 |     ]
 83 | )
 84 | ```
 85 | 
 86 | Using the `data` parameter with metadata support for social media analysis:
 87 | 
 88 | ```python
 89 | import pandas as pd
 90 | from pii_codex.services.analysis_service import PIIAnalysisService
 91 | 
 92 | results = PIIAnalysisService().analyze_collection(
 93 |     data=pd.DataFrame.from_dict({
 94 |         "text": [
 95 |             "email@example.com is the email I can be reached at.", 
 96 |             "Their number is 555-555-5555"
 97 |         ],
 98 |         "metadata": [
 99 |             {"location": True, "url": False, "screen_name": True},
100 |             {"location": False, "url": False, "screen_name": True}
101 |         ]
102 |     }),
103 |     collection_name="Social Media Example",
104 |     collection_type="SAMPLE"
105 | )
106 | ```
107 | 
108 | The AnalysisResultSet object will show individual detections and their risk assessments. Email detections, for example, are presented as identifiable and direct PII which automatically place it at a risk level of 3, the highest a token is assigned.
109 | 
110 | 
111 | ```json
112 | {
113 |     "pii_type_detected": "EMAIL_ADDRESS",
114 |     "risk_level": 3,
115 |     "risk_level_definition": "Identifiable",
116 |     "cluster_membership_type": "Personal Preferences",
117 |     "hipaa_category": "Protected Health Information",
118 |     "dhs_category": "Stand Alone PII",
119 |     "nist_category": "Directly PII",
120 |     "entity_type": "EMAIL_ADDRESS",
121 |     "score": 1.0,
122 |     "start": 74,
123 |     "end": 94
124 | }
125 | ```
126 | 
127 | Each string analyzed may contain <em>n</em> number of PII detections, with each detection having a risk severity between 1 and 3 inclusively. The risk score mean <em>$\overline{rs}$</em> is calculated based on the average of all token risk scores <em>rs</em> for that one string. Since other data is provided, while non-identifiable on its own, may provide context that can lead to identification, their values (assigned a 1 for non-identifiable) are taken into account in the calculation. The calculation for a single string's risk score is presented as the formula below.
128 | 
129 | \begin{equation}
130 | \overline{rs} = \frac{1}{n} \sum_{i=i}^{n} rs_{i}
131 | \end{equation}
132 | 
133 | For collections of strings being analyzed, each risk score mean is taken into account to provide a collection-wide risk score mean value. Given that a collection can have <em>n</em> number of analyzed strings, the collection risk score mean value can be calculated with the mean of means formula below.
134 | 
135 | \begin{equation}
136 | \mu_{\overline{rs}} = \frac{\overline{rs}_1 + \overline{rs}_2 + ... + \overline{rs}_n}{n}
137 | \end{equation}
138 | 
139 | In the <em>AnalysisResult</em> object, the mean risk score of all detected tokens in a string is provided as the <em>risk score mean</em>. The <em>AnalysisResultSet</em> object will contain the mean of means, or the average of all risk score averages, will be provided as the <em>risk score mean</em>.
140 | 
141 | ## Availability
142 | PII-Codex can be installed via pip or poetry. The source code of PII-Codex is available at the GitHub repository (https://github.com/EdyVision/pii-codex). The builds can be obtained from https://github.com/EdyVision/pii-codex/releases  and via Zenodo [@rosado2022].
143 | 
144 | # References


--------------------------------------------------------------------------------