├── tests ├── __init__.py └── pii_codex │ ├── __init__.py │ ├── utils │ ├── __init__.py │ └── test_pii_mapping_util.py │ └── services │ ├── __init__.py │ ├── adapters │ ├── __init__.py │ └── detection_adapters │ │ ├── __init__.py │ │ ├── test_azure_detection_adapter.py │ │ ├── test_presidio_detection_adapter.py │ │ └── test_aws_detection_adapter.py │ ├── test_assessment_service.py │ └── test_detection_service.py ├── pii_codex ├── models │ ├── __init__.py │ ├── microsoft_presidio_pii.py │ ├── aws_pii.py │ ├── analysis.py │ └── common.py ├── utils │ ├── __init__.py │ ├── package_installer_util.py │ ├── logging.py │ ├── statistics_util.py │ └── pii_mapping_util.py ├── services │ ├── __init__.py │ ├── adapters │ │ ├── __init__.py │ │ └── detection_adapters │ │ │ ├── __init__.py │ │ │ ├── detection_adapter_base.py │ │ │ ├── azure_detection_adapter.py │ │ │ ├── presidio_detection_adapter.py │ │ │ └── aws_detection_adapter.py │ ├── analyzers │ │ ├── __init__.py │ │ └── presidio_analysis.py │ └── assessment_service.py ├── __init__.py └── config.py ├── docs ├── PII_Codex_Logo.png ├── UC1_Converting_Existing_Detections_With_Adapters.png ├── UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png ├── LOCAL_SETUP.md ├── MAPPING.md ├── dev │ └── pii_codex │ │ ├── config.html │ │ ├── services │ │ ├── analyzers │ │ │ └── index.html │ │ ├── adapters │ │ │ ├── index.html │ │ │ └── detection_adapters │ │ │ │ └── index.html │ │ └── index.html │ │ ├── index.html │ │ ├── models │ │ └── index.html │ │ └── utils │ │ ├── package_installer_util.html │ │ ├── index.html │ │ └── logging.html └── DETECTION_AND_ANALYSIS.md ├── scripts └── update_citation.sh ├── .pre-commit-config.yaml ├── codecov.yml ├── .github ├── pull_request_template.md ├── workflows │ ├── draft-pdf.yml │ ├── checks.yml │ ├── test.yml │ └── release.yml └── ISSUE_TEMPLATE │ └── issue_template.md ├── CITATION.cff ├── mypy.ini ├── .zenodo.json ├── LICENSE ├── Makefile ├── .gitignore ├── CONTRIBUTING.md ├── pyproject.toml └── joss ├── paper.bib └── paper.md /tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/models/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/pii_codex/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/services/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/pii_codex/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/pii_codex/services/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/services/adapters/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/services/analyzers/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.5.0" 2 | -------------------------------------------------------------------------------- /tests/pii_codex/services/adapters/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pii_codex/services/adapters/detection_adapters/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/pii_codex/services/adapters/detection_adapters/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/PII_Codex_Logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/PII_Codex_Logo.png -------------------------------------------------------------------------------- /scripts/update_citation.sh: -------------------------------------------------------------------------------- 1 | sed -r -i '' "s/([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])/$(date '+%Y-%m-%d')/g" CITATION.cff -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: https://github.com/ambv/black 3 | rev: 22.12.0 4 | hooks: 5 | - id: black 6 | language_version: python3.9 -------------------------------------------------------------------------------- /docs/UC1_Converting_Existing_Detections_With_Adapters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/UC1_Converting_Existing_Detections_With_Adapters.png -------------------------------------------------------------------------------- /docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EdyVision/pii-codex/HEAD/docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png -------------------------------------------------------------------------------- /pii_codex/config.py: -------------------------------------------------------------------------------- 1 | from pii_codex.utils.pii_mapping_util import PIIMapper 2 | 3 | PII_MAPPER = PIIMapper() 4 | DEFAULT_LANG = "en" 5 | DEFAULT_ANALYSIS_MODE = "POPULATION" 6 | DEFAULT_TOKEN_REPLACEMENT_VALUE = "" 7 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | coverage: 2 | status: 3 | project: 4 | default: 5 | target: 90% # the required coverage value 6 | threshold: 1% # the leniency in hitting the target 7 | patch: 8 | default: 9 | target: 90% # the required coverage value 10 | threshold: 1% # the leniency in hitting the target -------------------------------------------------------------------------------- /pii_codex/utils/package_installer_util.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | import sys 3 | 4 | 5 | def install_spacy_package(package_name): 6 | """ 7 | Installs missing spacy package (if found missing) 8 | @param package_name: 9 | @return: 10 | """ 11 | subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name]) 12 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | ### Overview 2 | 3 | PR description. Explain what it's doing and why. 4 | 5 | - Implementation details... 6 | - Bugfix details... 7 | 8 | ### Checklist 9 | 10 | - [ ] New PII Detection Service Added (Optional) 11 | - [ ] New PII types added to pii_type_mappings (Optional) 12 | - [ ] Tests added 13 | - [ ] CI passed 14 | - [ ] Commits Squashed -------------------------------------------------------------------------------- /CITATION.cff: -------------------------------------------------------------------------------- 1 | cff-version: 1.2.0 2 | message: "If you use this software, please cite it as below." 3 | authors: 4 | - family-names: Rosado 5 | given-names: Eidan J. 6 | orcid: https://orcid.org/0000-0003-0665-098X 7 | title: "pii-codex: a Python library for PII detection, categorization, and severity assessment" 8 | version: 0.5.0 9 | doi: 10.5281/zenodo.7212576 10 | date-released: 2025-12-16 11 | -------------------------------------------------------------------------------- /.github/workflows/draft-pdf.yml: -------------------------------------------------------------------------------- 1 | on: [push] 2 | 3 | jobs: 4 | paper: 5 | runs-on: ubuntu-latest 6 | name: Paper Draft 7 | steps: 8 | - name: Checkout 9 | uses: actions/checkout@v3 10 | - name: Build draft PDF 11 | uses: openjournals/openjournals-draft-action@master 12 | with: 13 | journal: joss 14 | paper-path: ./joss/paper.md 15 | - name: Upload 16 | uses: actions/upload-artifact@v4 17 | with: 18 | name: paper 19 | path: ./joss/paper.pdf -------------------------------------------------------------------------------- /pii_codex/utils/logging.py: -------------------------------------------------------------------------------- 1 | from time import time 2 | import logging 3 | 4 | logger = logging.getLogger() 5 | 6 | 7 | def timed_operation(func): 8 | """ 9 | Used to show execution time for function 10 | @param func: 11 | @return: 12 | """ 13 | 14 | def wrapper_function(*args, **kwargs): 15 | """ 16 | Logs the function execution time 17 | 18 | @param args: 19 | @param kwargs: 20 | @return: 21 | """ 22 | start_time = time() 23 | result = func(*args, **kwargs) 24 | end_time = time() 25 | logger.info(f"{func.__name__!r} executed in {(end_time - start_time):.4f}s") 26 | return result 27 | 28 | return wrapper_function 29 | -------------------------------------------------------------------------------- /mypy.ini: -------------------------------------------------------------------------------- 1 | [mypy] 2 | python_version = 3.11 3 | no_implicit_optional = False 4 | warn_unused_ignores = True 5 | warn_redundant_casts = True 6 | warn_unused_configs = True 7 | disallow_untyped_defs = False 8 | disallow_incomplete_defs = False 9 | check_untyped_defs = False 10 | disable_error_code = annotation-unchecked 11 | 12 | [mypy-pytest.*] 13 | ignore_missing_imports = True 14 | 15 | [mypy-_pytest.*] 16 | ignore_missing_imports = True 17 | 18 | [mypy-assertpy.*] 19 | ignore_missing_imports = True 20 | 21 | [mypy-presidio_analyzer.*] 22 | ignore_missing_imports = True 23 | 24 | [mypy-presidio_anonymizer.*] 25 | ignore_missing_imports = True 26 | 27 | [mypy-spacy.*] 28 | ignore_missing_imports = True 29 | 30 | [mypy-pandas.*] 31 | ignore_missing_imports = True -------------------------------------------------------------------------------- /.github/workflows/checks.yml: -------------------------------------------------------------------------------- 1 | # Runs typecheck and lint process on feature branches only 2 | name: Checks 3 | 4 | on: 5 | push: 6 | branches-ignore: 7 | - main 8 | jobs: 9 | typecheck: 10 | runs-on: ubuntu-latest 11 | steps: 12 | - name: checkout 13 | uses: actions/checkout@v4 14 | - name: setup - python 15 | uses: actions/setup-python@v4 16 | with: 17 | python-version: 3.12 18 | - name: Install Global Dependencies 19 | run: pip install -U pip && pip install uv 20 | - name: install 21 | run: make install 22 | - name: typecheck 23 | run: make typecheck 24 | - name: lint 25 | run: make lint -------------------------------------------------------------------------------- /.zenodo.json: -------------------------------------------------------------------------------- 1 | { 2 | "access_right": "open", 3 | "version": "0.5.0", 4 | "creators": [ 5 | { 6 | "orcid": "0000-0003-0665-098X", 7 | "name": "Eidan J. Rosado", 8 | "affiliation": "College of Computing and Engineering, Nova Southeastern University, Fort Lauderdale, FL 33314, USA" 9 | } 10 | ], 11 | "keywords": ["Python", "PII", "PII Detection", "risk categories", 12 | "PII categorization","personal identifiable information", 13 | "Named Entity Recognition", "PII Topology"], 14 | "license": "BSD-3-Clause", 15 | "title": "pii-codex: a Python library for PII detection, categorization, and severity assessment", 16 | "language": "eng", 17 | "description": "A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)" 18 | } -------------------------------------------------------------------------------- /pii_codex/services/adapters/detection_adapters/detection_adapter_base.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from pii_codex.models.analysis import DetectionResult, DetectionResultItem 4 | 5 | 6 | class BasePIIDetectionAdapter: 7 | def convert_analyzed_item(self, pii_detection) -> List[DetectionResultItem]: 8 | """ 9 | Converts a detection result into a collection of DetectionResultItem 10 | 11 | @param pii_detection: dict 12 | @return: List[DetectionResultItem] 13 | """ 14 | raise Exception("Not implemented yet") 15 | 16 | def convert_analyzed_collection(self, pii_detections) -> List[DetectionResult]: 17 | """ 18 | Converts a collection of detection results to a collection of DetectionResult. 19 | 20 | @param pii_detections: List[dict] 21 | @return: List[DetectionResult] 22 | """ 23 | raise Exception("Not implemented yet") 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/issue_template.md: -------------------------------------------------------------------------------- 1 | Issue tracker is **ONLY** used for reporting bugs. New features should be discussed on our slack channel. Please use [stackoverflow](https://stackoverflow.com) for supporting issues. 2 | 3 | 4 | 5 | ## Expected Behavior 6 | 7 | 8 | ## Current Behavior 9 | 10 | 11 | ## Possible Solution 12 | 13 | 14 | ## Steps to Reproduce 15 | 16 | 17 | 1. 18 | 2. 19 | 3. 20 | 4. 21 | 22 | ## Context (Environment) 23 | 24 | 25 | 26 | 27 | 28 | ## Detailed Description 29 | 30 | 31 | ## Possible Implementation 32 | -------------------------------------------------------------------------------- /pii_codex/models/microsoft_presidio_pii.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from enum import Enum 4 | 5 | 6 | class MSFTPresidioPIIType(Enum): 7 | """ 8 | PII Types associated with Microsoft Presidio Analyzer 9 | Supported Entities: https://microsoft.github.io/presidio/supported_entities/ 10 | """ 11 | 12 | PHONE_NUMBER = "PHONE_NUMBER" 13 | EMAIL_ADDRESS = "EMAIL_ADDRESS" 14 | ABA_ROUTING_NUMBER = "ABA_ROUTING_NUMBER" 15 | IP_ADDRESS = "IP_ADDRESS" 16 | DATE = "DATE_TIME" 17 | ADDRESS = "LOCATION" 18 | AGE = "AGE" 19 | PERSON = "PERSON" 20 | CREDIT_CARD_NUMBER = "CREDIT_CARD" 21 | CRYPTO = "CRYPTO" 22 | URL = "URL" 23 | DATE_TIME = "DATE_TIME" 24 | LOCATION = "LOCATION" 25 | NRP = "NRP" 26 | MEDICAL_LICENSE = "MEDICAL_LICENSE" 27 | US_SOCIAL_SECURITY_NUMBER = "US_SSN" 28 | US_BANK_ACCOUNT_NUMBER = "US_BANK_NUMBER" 29 | US_DRIVERS_LICENSE_NUMBER = "US_DRIVER_LICENSE" 30 | US_PASSPORT_NUMBER = "US_PASSPORT" 31 | US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_ITIN" 32 | INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "IBAN_CODE" 33 | # UK_NATIONAL_HEALTH_NUMBER = "UK_NHS" # To be added in future versions 34 | AU_BUSINESS_NUMBER = "AU_ABN" 35 | AU_COMPANY_NUMBER = "AU_ACN" 36 | AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICARE" 37 | AU_TAX_FILE_NUMBER = "AU_TFN" 38 | -------------------------------------------------------------------------------- /.github/workflows/test.yml: -------------------------------------------------------------------------------- 1 | name: Test 2 | 3 | on: 4 | # Triggers the test workflow on push for all branches 5 | push: 6 | paths: 7 | - "pii_codex/**" 8 | - "uv.lock" 9 | - "pyproject.toml" 10 | - ".github/workflows/test.yml" 11 | 12 | # Allows you to run this workflow manually from the Actions tab 13 | workflow_dispatch: 14 | 15 | jobs: 16 | build: 17 | name: Run Tests 18 | strategy: 19 | matrix: 20 | python-version: [ "3.9", "3.10", "3.11", "3.12" ] 21 | os: [ubuntu-latest, macos-latest] 22 | runs-on: ${{ matrix.os }} 23 | 24 | # Checkout the code, install poetry, install dependencies, 25 | # and run test with coverage 26 | steps: 27 | - name: Environment Setup 28 | uses: actions/checkout@v4 29 | - name: Setup Python 30 | uses: actions/setup-python@v4 31 | with: 32 | python-version: ${{ matrix.python-version }} 33 | - name: Install Global Dependencies 34 | run: pip install -U pip && pip install uv 35 | - name: Install Project Dependencies 36 | run: | 37 | make install 38 | - name: Run Tests 39 | run: | 40 | make test.coverage 41 | - uses: codecov/codecov-action@v3 # Coverage report submitted only on merge to main 42 | with: 43 | token: ${{ secrets.CODECOV_TOKEN }} 44 | name: codecov-umbrella 45 | files: ./coverage.xml 46 | verbose: true 47 | -------------------------------------------------------------------------------- /pii_codex/utils/statistics_util.py: -------------------------------------------------------------------------------- 1 | import statistics 2 | 3 | import numpy as np 4 | 5 | 6 | def get_population_standard_deviation(values) -> float: 7 | return statistics.pstdev(values) 8 | 9 | 10 | def get_population_variance(values) -> float: 11 | return statistics.pvariance(values) 12 | 13 | 14 | def get_standard_deviation(values, collection_type: str) -> float: 15 | if collection_type.lower() != "sample" and collection_type.lower() != "population": 16 | raise Exception("Invalid collection type. Must be 'SAMPLE' or 'POPULATION'.") 17 | 18 | return ( 19 | statistics.stdev(values) 20 | if collection_type.lower() == "sample" 21 | else get_population_standard_deviation(values) 22 | ) 23 | 24 | 25 | def get_variance(values, collection_type: str) -> float: 26 | if collection_type.lower() != "sample" and collection_type.lower() != "population": 27 | raise Exception("Invalid collection type. Must be 'SAMPLE' or 'POPULATION'.") 28 | 29 | return ( 30 | statistics.variance(values) 31 | if collection_type.lower() == "sample" 32 | else get_population_variance(values) 33 | ) 34 | 35 | 36 | def get_mean(values) -> float: 37 | return statistics.mean(values) 38 | 39 | 40 | def get_median(values) -> float: 41 | return statistics.median(values) 42 | 43 | 44 | def get_mode(values): 45 | return statistics.mode(values) 46 | 47 | 48 | def get_sum(values): 49 | return np.sum(values) 50 | -------------------------------------------------------------------------------- /docs/LOCAL_SETUP.md: -------------------------------------------------------------------------------- 1 | # New Project Setup 2 | See README.md for step-by-step details on installing the pii-codex package as an import. 3 | 4 | Video Demo of package import and usage: 5 |
6 | 7 | [![PII-Codex Video Demo](https://img.youtube.com/vi/51TP2I5SNlo/0.jpg)](https://youtu.be/51TP2I5SNlo) 8 | 9 |
10 | 11 | `Note: This video has no sound, it just shows steps taken to install the package and use it in a file.` 12 | 13 | # Local Repo Setup 14 | For those contributing or modifying the source, use the following to set up locally. 15 | 16 | ## Environment Config 17 | You'll need Python (^3.11) and `uv` configured on your machine. Once those are configured, create a virtual 18 | environment and install dependencies. 19 | 20 | ```bash 21 | uv sync 22 | ``` 23 | 24 | Installing dependencies will vary by usage. For those in need of the PII-Codex integration of the MSFT Presidio Analyzer, it is recommended to install the `detections` extras: 25 | 26 | ```bash 27 | uv sync --extra detections 28 | ``` 29 | 30 | As part of the `detections` extras installation, the download for the `en_core_web_lg` spaCy model will be enabled on first use of the `PresidioPIIAnalyzer()`. If more language support is needed, you'll need to download it separately. Reference explosion/spacy-models. 31 | 32 | Depending on your setup and need, you may need to add the virtual environment to Jupyter. You may do so with the following command: 33 | 34 | ```bash 35 | make jupyter.attach.venv 36 | ``` -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2022, Eidan J. Rosado 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # PII Codex Makefile 2 | # Used from GitHub workflow and locally 3 | default: install test lint 4 | 5 | test: lint test.all 6 | test.cov: test.coverage 7 | 8 | install: 9 | @uv sync 10 | @uv sync --all-extras 11 | @uv sync --extra dev 12 | $(MAKE) install.pre_commit 13 | @echo "Installation complete!" 14 | 15 | install.pre_commit: 16 | @uv run pre-commit install || (echo "Warning: pre-commit installation failed. You may need to run 'uv sync --extra dev' first." && exit 1) 17 | 18 | install.extras: 19 | @echo "Installing detection dependencies (spaCy, Presidio, etc.)..." 20 | @uv sync --extra detections 21 | 22 | install_spacy_en_core_web_lg: 23 | @python3 -m spacy download en_core_web_lg 24 | 25 | jupyter.attach.venv: 26 | @python3 -m ipykernel install --user --name=venv 27 | 28 | test.all: 29 | @pytest tests 30 | 31 | test.coverage: 32 | @uv run coverage run -m pytest -vv tests && uv run coverage report -m --omit="*/test*,config/*.conf" --fail-under=95 33 | @uv run coverage xml 34 | 35 | lint: 36 | @uv run pylint pii_codex tests 37 | 38 | typecheck: 39 | @uv run mypy pii_codex tests 40 | 41 | format.check: 42 | @black . --check 43 | 44 | format.fix: 45 | @black . 46 | 47 | bump.citation.date: 48 | ./scripts/update_citation.sh 49 | 50 | docs: 51 | @pdoc --html pii_codex --force -o ./docs/dev 52 | 53 | version.bump.patch: 54 | @uv run bumpver update --patch 55 | # $(MAKE) bump.citation.date 56 | 57 | version.bump.minor: 58 | @uv run bumpver update --minor 59 | $(MAKE) bump.citation.date 60 | 61 | version.bump.major: 62 | @uv run bumpver update --major 63 | $(MAKE) bump.citation.date 64 | 65 | package: 66 | @uv build 67 | -------------------------------------------------------------------------------- /tests/pii_codex/services/test_assessment_service.py: -------------------------------------------------------------------------------- 1 | from assertpy import assert_that 2 | from pii_codex.models.common import PIIType 3 | from pii_codex.models.analysis import RiskAssessment 4 | from pii_codex.services.assessment_service import PIIAssessmentService 5 | from pii_codex.utils.statistics_util import get_mean 6 | 7 | 8 | class TestPIIAssessmentService: 9 | pii_assessment_service = PIIAssessmentService() 10 | 11 | def test_assess_pii_type(self): 12 | risk_assessment: RiskAssessment = self.pii_assessment_service.assess_pii_type( 13 | detected_pii_type=PIIType.US_SOCIAL_SECURITY_NUMBER.name 14 | ) 15 | assert_that(risk_assessment.risk_level).is_equal_to(3) 16 | assert_that(risk_assessment.pii_type_detected).is_equal_to( 17 | PIIType.US_SOCIAL_SECURITY_NUMBER.name 18 | ) 19 | 20 | def test_assess_pii_type_list(self): 21 | # PII types with same ratings 22 | risk_assessment_list = self.pii_assessment_service.assess_pii_type_list( 23 | detected_pii_types=[ 24 | PIIType.US_SOCIAL_SECURITY_NUMBER.name, 25 | PIIType.PHONE_NUMBER.name, 26 | ] 27 | ) 28 | assert_that(isinstance(risk_assessment_list[0], RiskAssessment)) 29 | assert_that( 30 | get_mean([assessment.risk_level for assessment in risk_assessment_list]) 31 | ).is_equal_to(3.0) 32 | 33 | # PII types with different ratings 34 | risk_assessment_list = self.pii_assessment_service.assess_pii_type_list( 35 | detected_pii_types=[ 36 | PIIType.US_SOCIAL_SECURITY_NUMBER.name, 37 | PIIType.RACE.name, 38 | ] 39 | ) 40 | assert_that(isinstance(risk_assessment_list[0], RiskAssessment)) 41 | assert_that( 42 | get_mean([assessment.risk_level for assessment in risk_assessment_list]) 43 | ).is_equal_to(2.5) 44 | -------------------------------------------------------------------------------- /pii_codex/services/adapters/detection_adapters/azure_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from pii_codex.config import PII_MAPPER 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import ( 6 | BasePIIDetectionAdapter, 7 | ) 8 | 9 | 10 | class AzurePIIDetectionAdapter(BasePIIDetectionAdapter): 11 | def convert_analyzed_item(self, pii_detection: dict): 12 | """ 13 | Converts a detection result into a collection of DetectionResultItem 14 | 15 | @param pii_detection: dict 16 | @return: List[DetectionResultItem] 17 | """ 18 | return [ 19 | DetectionResultItem( 20 | entity_type=PII_MAPPER.convert_azure_pii_to_common_pii_type( 21 | entity["category"] 22 | ).name, 23 | score=entity["confidence_score"], 24 | start=entity["offset"], 25 | end=entity["offset"] + entity["length"], 26 | ) 27 | for entity in pii_detection["entities"] 28 | ] 29 | 30 | def convert_analyzed_collection( 31 | self, pii_detections: List[dict] 32 | ) -> List[DetectionResult]: 33 | """ 34 | Converts a collection of detection results to a collection of DetectionResult. 35 | 36 | @param pii_detections: List[dict] 37 | @return: List[DetectionResultItem] 38 | """ 39 | detection_results: List[DetectionResult] = [] 40 | for i, detection in enumerate(pii_detections): 41 | # Return results in formatted Analysis Result List object 42 | detection_results.append( 43 | DetectionResult( 44 | index=i, 45 | detections=self.convert_analyzed_item(pii_detection=detection), 46 | ) 47 | ) 48 | 49 | return detection_results 50 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | **/.DS_Store 30 | **/__pycache__ 31 | bin 32 | 33 | # PyInstaller 34 | # Usually these files are written by a python script from a template 35 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 36 | *.manifest 37 | *.spec 38 | 39 | # Installer logs 40 | pip-log.txt 41 | pip-delete-this-directory.txt 42 | 43 | # Unit test / coverage reports 44 | htmlcov/ 45 | .tox/ 46 | .nox/ 47 | .coverage 48 | .coverage.* 49 | .cache 50 | nosetests.xml 51 | coverage.xml 52 | *.cover 53 | *.py,cover 54 | .hypothesis/ 55 | .pytest_cache/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # IPython 71 | profile_default/ 72 | ipython_config.py 73 | 74 | # pyenv 75 | .python-version 76 | 77 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 78 | __pypackages__/ 79 | 80 | # Environments 81 | .env 82 | .venv 83 | env/ 84 | venv/ 85 | ENV/ 86 | env.bak/ 87 | venv.bak/ 88 | .envrc 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | .dmypy.json 103 | dmypy.json 104 | 105 | # Pyre type checker 106 | .pyre/ 107 | 108 | # Editor stuff 109 | .idea 110 | .vscode 111 | 112 | backend/bin 113 | 114 | # Test output 115 | coverage.xml 116 | pytest.log 117 | tests/pii_codex/data -------------------------------------------------------------------------------- /pii_codex/models/aws_pii.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from enum import Enum 4 | 5 | 6 | # PII Types and Models as expanded in research 7 | # Research Page: 8 | # AWS Comprehend PII Docs: https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html 9 | 10 | 11 | class AWSComprehendPIIType(Enum): 12 | """ 13 | AWS Comprehend-Supported PII types 14 | """ 15 | 16 | EMAIL_ADDRESS = "EMAIL" 17 | ADDRESS = "ADDRESS" 18 | PERSON = "NAME" 19 | PHONE_NUMBER = "PHONE" 20 | DATE = "DATE_TIME" 21 | URL = "URL" 22 | AGE = "AGE" 23 | USERNAME = "USERNAME" 24 | PASSWORD = "PASSWORD" 25 | CREDIT_DEBIT_NUMBER = "CREDIT_DEBIT_NUMBER" 26 | CREDIT_DEBIT_CVV = "CREDIT_DEBIT_CVV" 27 | CREDIT_DEBIT_EXPIRY = "CREDIT_DEBIT_EXPIRY" 28 | PIN = "PIN" 29 | US_DRIVERS_LICENSE_NUMBER = "DRIVER_ID" 30 | LICENSE_PLATE_NUMBER = "LICENSE_PLATE" 31 | VEHICLE_IDENTIFICATION_NUMBER = "VEHICLE_IDENTIFICATION_NUMBER" 32 | INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "INTERNATIONAL_BANK_ACCOUNT_NUMBER" 33 | SWIFT_CODE = "SWIFT_CODE" 34 | CRYPTO = "CRYPTO_WALLET_ADDRESS" 35 | IP_ADDRESS = "IP_ADDRESS" 36 | IPV6_ADDRESS = "IPV6_ADDRESS" 37 | MAC_ADDRESS = "MAC_ADDRESS" 38 | AWS_ACCESS_KEY = "AWS_ACCESS_KEY" 39 | AWS_SECRET_KEY = "AWS_SECRET_KEY" 40 | US_PASSPORT_NUMBER = "PASSPORT_NUMBER" 41 | US_SOCIAL_SECURITY_NUMBER = "SSN" 42 | US_BANK_ACCOUNT_NUMBER = "BANK_ACCOUNT_NUMBER" 43 | ABA_ROUTING_NUMBER = "BANK_ROUTING" 44 | US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION" 45 | UK_NATIONAL_HEALTH_SERVICE_NUMBER = "UK_NATIONAL_HEALTH_SERVICE_NUMBER" 46 | UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER = "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER" 47 | UK_NATIONAL_INSURANCE_NUMBER = "UK_NATIONAL_INSURANCE_NUMBER" 48 | CA_HEALTH_NUMBER = "CA_HEALTH_NUMBER" 49 | CA_SOCIAL_INSURANCE_NUMBER = "CA_SOCIAL_INSURANCE_NUMBER" 50 | IN_AADHAAR = "IN_AADHAAR" 51 | IN_VOTER_NUMBER = "IN_VOTER_NUMBER" 52 | IN_PERMANENT_ACCOUNT_NUMBER = "IN_PERMANENT_ACCOUNT_NUMBER" 53 | IN_NREGA = "IN_NREGA" 54 | ALL = "ALL" 55 | -------------------------------------------------------------------------------- /pii_codex/services/adapters/detection_adapters/presidio_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from pii_codex.config import PII_MAPPER 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import ( 6 | BasePIIDetectionAdapter, 7 | ) 8 | 9 | 10 | class PresidioPIIDetectionAdapter(BasePIIDetectionAdapter): 11 | 12 | """ 13 | Intended for those that are using their own pre-detected result set from Presidio 14 | """ 15 | 16 | def convert_analyzed_item(self, pii_detection) -> List[DetectionResultItem]: 17 | """ 18 | Converts a single Presidio analysis attempt into a collection of DetectionResultItem objects. One string 19 | analysis by Presidio returns an array of RecognizerResult objects. 20 | 21 | @param pii_detection: RecognizerResult from presidio analyzer 22 | @return: List[DetectionResultItem] 23 | """ 24 | 25 | return [ 26 | DetectionResultItem( 27 | entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type( 28 | result.entity_type 29 | ).name, 30 | score=result.score, 31 | start=result.start, 32 | end=result.end, 33 | ) 34 | for result in pii_detection 35 | ] 36 | 37 | def convert_analyzed_collection(self, pii_detections) -> List[DetectionResult]: 38 | """ 39 | Converts a collection of Presidio analysis results to a collection of DetectionResult. A collection of Presidio 40 | analysis results ends up being a 2D array. 41 | 42 | @param pii_detections: List[List[RecognizerResult]] - list of individual analyses from Presidio 43 | 44 | """ 45 | 46 | detection_results: List[DetectionResult] = [] 47 | for i, detection in enumerate(pii_detections): 48 | # Return results in formatted Analysis Result List object 49 | detection_results.append( 50 | DetectionResult( 51 | index=i, 52 | detections=self.convert_analyzed_item(pii_detection=detection), 53 | ) 54 | ) 55 | 56 | return detection_results 57 | -------------------------------------------------------------------------------- /tests/pii_codex/services/adapters/detection_adapters/test_azure_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from assertpy import assert_that 2 | 3 | from pii_codex.models.analysis import DetectionResultItem 4 | from pii_codex.models.azure_pii import AzurePIIType 5 | from pii_codex.services.adapters.detection_adapters.azure_detection_adapter import ( 6 | AzurePIIDetectionAdapter, 7 | ) 8 | 9 | 10 | def test_azure_analysis_single_item_conversion(): 11 | # with pytest.raises(Exception) as ex_info: 12 | conversion_results = AzurePIIDetectionAdapter().convert_analyzed_item( 13 | pii_detection={ 14 | "entities": [ 15 | { 16 | "text": "My email is example@example.eu.edu", 17 | "category": AzurePIIType.EMAIL_ADDRESS.value, 18 | "subcategory": None, 19 | "length": 22, 20 | "offset": 11, 21 | "confidence_score": 0.8, 22 | } 23 | ] 24 | } 25 | ) 26 | 27 | assert_that(conversion_results).is_not_none() 28 | assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true() 29 | 30 | 31 | def test_azure_analysis_collection_conversion(): 32 | conversion_results = AzurePIIDetectionAdapter().convert_analyzed_collection( 33 | pii_detections=[ 34 | { 35 | "entities": [ 36 | { 37 | "text": "My email is example@example.eu.edu", 38 | "category": AzurePIIType.EMAIL_ADDRESS.value, 39 | "subcategory": None, 40 | "length": 22, 41 | "offset": 11, 42 | "confidence_score": 0.8, 43 | } 44 | ] 45 | }, 46 | { 47 | "entities": [ 48 | { 49 | "text": "My email is example1@example.eu.edu", 50 | "category": AzurePIIType.EMAIL_ADDRESS.value, 51 | "subcategory": None, 52 | "length": 23, 53 | "offset": 11, 54 | "confidence_score": 0.8, 55 | } 56 | ] 57 | }, 58 | ] 59 | ) 60 | 61 | assert_that(conversion_results).is_not_empty() 62 | assert_that(conversion_results[1].index).is_greater_than( 63 | conversion_results[0].index 64 | ) 65 | assert_that( 66 | isinstance(conversion_results[0].detections[0], DetectionResultItem) 67 | ).is_true() 68 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # PII-Codex Community Guidelines 2 | The following aims to highlight ways to contribute to PII-Codex. Code reviews, documentation updates, and pull requests are always welcome. To the source code directly by forking the project, modify the code, and create a pull requests for review. 3 | 4 | Please use clear and organized descriptions when creating issues and pull requests and leverage the templates when possible. 5 | 6 | ### Bug Report and Support Requests 7 | You can use issues to report bugs and seek support. Before creating any new issues, please check for similar ones in the issue list first. 8 | 9 | ### Submission Checklist 10 | When submitting a pull request, please check the following: 11 | 12 | Unit tests, documentation, and code style are in order. The GitHub action `test.yml` will automatically run tests and check whether the coverage threshold is still being met. 13 | 14 | Other checks such as the typechecker and linter will also run with every feature branch push. These checks can also be performed with the following commands: 15 | 16 | ```bash 17 | make typecheck 18 | make lint 19 | make test.coverage 20 | ``` 21 | 22 | Works in progress will be considered if you're unsure of what this exactly means, in which case you'll likely be asked to make some further changes. 23 | 24 | ### Updating Documentation 25 | Docs are autogenerated by [pdoc](https://pdoc3.github.io/pdoc/). To update documentation, run the command `make docs` from terminal / command line. The generated docs will appear in docs/dev in html format. If the auto-generated doc entry is missing context, check the docstring added to the new code. 26 | 27 | ### Creating Releases 28 | With any change (non-breaking dependency upgrades, bug fixes, etc), the version will need to be updated to the next value and a new release created. Using the command listing below, run the one associated with the changes: 29 | 30 | ```bash 31 | make version.bump.patch # use this for bug fixes, test coverage, non-breaking dependency upgrades 32 | 33 | make version.bump.minor # use this for minor business logic changes - must be non-breaking 34 | 35 | make version.bump.major # use this for anything that is not backwards compatible 36 | ``` 37 | 38 | Upon merging the changes into main, the `releases.yml` will automatically compare the releases and the version noted in the CITATION.cff and pyproject.toml. In the case a mismatch is found, a release will be created. 39 | 40 | ### Licensing 41 | The contributed code will be licensed under PII-Codex's license, https://github.com/EdyVision/pii-codex/blob/main/LICENSE. If you did not write the code yourself, you must ensure the existing license is compatible and include the license information in the contributed files, or obtain permission from the original author to relicense the contributed code. -------------------------------------------------------------------------------- /.github/workflows/release.yml: -------------------------------------------------------------------------------- 1 | # Automatically creates a new tag and GitHub release if poetry version doesn't match last release version 2 | name: Release 3 | 4 | on: 5 | workflow_dispatch: 6 | inputs: 7 | tag: 8 | description: 'Tag or ref to release from (e.g. v0.5.0). Leave empty to use main.' 9 | required: false 10 | default: '' 11 | 12 | push: 13 | branches: 14 | - main 15 | 16 | jobs: 17 | build: 18 | name: Release 19 | runs-on: ubuntu-latest 20 | steps: 21 | - name: Setup Python 22 | uses: actions/setup-python@v4 23 | with: 24 | python-version: 3.12 25 | - name: Checkout branch "main" 26 | uses: actions/checkout@v4 27 | with: 28 | ref: ${{ github.event.inputs.tag || 'main' }} 29 | fetch-depth: 0 30 | - name: Install Global Dependencies 31 | run: pip install -U pip && pip install uv 32 | - name: Build project for distribution 33 | run: uv build 34 | - name: Get Current Version 35 | id: get_version 36 | run: | 37 | TAG_NAME=$(uv run python -c "import tomllib; print(tomllib.load(open('pyproject.toml', 'rb'))['project']['version'])") 38 | echo "TAG_NAME=v$TAG_NAME" >> $GITHUB_ENV 39 | echo "$TAG_NAME" 40 | - name: Check Released Versions 41 | id: get_last_release_version 42 | run: | 43 | LAST_RELEASE=$(git tag --sort=committerdate | tail -1) 44 | echo "LAST_RELEASE_VERSION=$LAST_RELEASE" >> $GITHUB_ENV 45 | echo "Last released tag: $LAST_RELEASE" 46 | - name: Check for Version Mismatch 47 | shell: bash 48 | if: ${{ env.LAST_RELEASE_VERSION != env.TAG_NAME }} 49 | run: | 50 | echo "New version found. Matching release will be created." 51 | echo "Last version: ${{ env.LAST_RELEASE_VERSION }}" 52 | echo "Current version: ${{ env.TAG_NAME }}" 53 | - name: Tag and Release GitHub Snapshot 54 | id: release-snapshot 55 | if: ${{ github.event_name == 'workflow_dispatch' || env.LAST_RELEASE_VERSION != env.TAG_NAME }} 56 | uses: ncipollo/release-action@v1 57 | with: 58 | token: ${{ secrets.GITHUB_TOKEN }} 59 | commit: main 60 | tag: ${{ env.TAG_NAME }} 61 | skipIfReleaseExists: true 62 | draft: false 63 | prerelease: false 64 | - name: Publish to PyPI 65 | if: ${{ github.event_name == 'workflow_dispatch' || env.LAST_RELEASE_VERSION != env.TAG_NAME }} 66 | env: 67 | TWINE_USERNAME: __token__ 68 | TWINE_PASSWORD: ${{ secrets.PII_CODEX_PYPI_TOKEN }} 69 | run: | 70 | python -m pip install --upgrade pip 71 | pip install twine 72 | twine upload dist/* -------------------------------------------------------------------------------- /tests/pii_codex/services/adapters/detection_adapters/test_presidio_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from assertpy import assert_that 4 | from presidio_analyzer import RecognizerResult 5 | 6 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 7 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 8 | from pii_codex.services.adapters.detection_adapters.presidio_detection_adapter import ( 9 | PresidioPIIDetectionAdapter, 10 | ) 11 | 12 | 13 | presidio_adapter = PresidioPIIDetectionAdapter() 14 | 15 | 16 | def test_presidio_analysis_single_item_conversion(): 17 | conversion_results: List[ 18 | DetectionResultItem 19 | ] = presidio_adapter.convert_analyzed_item( 20 | pii_detection=[ 21 | RecognizerResult( 22 | entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value, 23 | start=123, 24 | end=456, 25 | score=0.98, 26 | ), 27 | RecognizerResult( 28 | entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value, 29 | start=123, 30 | end=456, 31 | score=0.73, 32 | ), 33 | ] 34 | ) 35 | 36 | assert_that(conversion_results).is_not_empty() 37 | assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true() 38 | 39 | 40 | def test_presidio_analysis_collection_conversion(): 41 | conversion_results: List[ 42 | DetectionResult 43 | ] = presidio_adapter.convert_analyzed_collection( 44 | pii_detections=[ 45 | [ 46 | RecognizerResult( 47 | entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value, 48 | start=123, 49 | end=456, 50 | score=0.98, 51 | ), 52 | RecognizerResult( 53 | entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value, 54 | start=123, 55 | end=456, 56 | score=0.973, 57 | ), 58 | ], 59 | [ 60 | RecognizerResult( 61 | entity_type=MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value, 62 | start=123, 63 | end=456, 64 | score=0.98, 65 | ), 66 | RecognizerResult( 67 | entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value, 68 | start=123, 69 | end=456, 70 | score=0.973, 71 | ), 72 | ], 73 | ] 74 | ) 75 | 76 | assert_that(conversion_results).is_not_empty() 77 | assert_that(conversion_results[1].index).is_greater_than( 78 | conversion_results[0].index 79 | ) 80 | assert_that( 81 | isinstance(conversion_results[0].detections[0], DetectionResultItem) 82 | ).is_true() 83 | -------------------------------------------------------------------------------- /pii_codex/services/assessment_service.py: -------------------------------------------------------------------------------- 1 | from typing import List, Tuple 2 | from collections import Counter 3 | from itertools import chain 4 | 5 | from ..config import PII_MAPPER 6 | from ..models.analysis import RiskAssessment, AnalysisResult 7 | from ..utils.statistics_util import get_mean, get_sum 8 | 9 | 10 | class PIIAssessmentService: 11 | """ 12 | Class for mapping PII types to categories and extracting them. 13 | """ 14 | 15 | def assess_pii_type(self, detected_pii_type: str) -> RiskAssessment: 16 | """ 17 | Assesses a singular detected PII type given a type name string from common.PIIType enum 18 | @param detected_pii_type: type name strings from common.PIIType enum 19 | @return: RiskAssessment 20 | """ 21 | return PII_MAPPER.map_pii_type(detected_pii_type) 22 | 23 | def assess_pii_type_list( 24 | self, detected_pii_types: List[str] 25 | ) -> List[RiskAssessment]: 26 | """ 27 | Assesses a list of detected PII types given an array of type name strings from common.PIIType enum 28 | @param detected_pii_types: array type name strings from common.PIIType 29 | enum (e.g. ["PHONE_NUMBER", "US_SOCIAL_SECURITY_NUMBER"]) 30 | @return: List[RiskAssessment] 31 | """ 32 | ranked_pii: List[RiskAssessment] = [] 33 | 34 | for pii_type in detected_pii_types: 35 | ranked_pii.append(PII_MAPPER.map_pii_type(pii_type)) 36 | 37 | return ranked_pii 38 | 39 | @staticmethod 40 | def calculate_risk_assessment_score_average( 41 | risk_assessments: List[RiskAssessment], 42 | ) -> float: 43 | """ 44 | Returns the average risk score per token 45 | 46 | @param risk_assessments: 47 | @return: float 48 | """ 49 | return get_mean([assessment.risk_level for assessment in risk_assessments]) 50 | 51 | @staticmethod 52 | def get_detected_pii_count(analyses: List[AnalysisResult]) -> int: 53 | """ 54 | Returns the count of detected PII for analyses performed on a collection 55 | 56 | @param analyses: List[ScoredAnalysisResult] 57 | @return: float 58 | """ 59 | return get_sum( 60 | [ 61 | len(analysis.analysis) 62 | for analysis in analyses 63 | if analysis.get_detected_types() 64 | ] 65 | ) 66 | 67 | @staticmethod 68 | def get_detected_pii_types( 69 | analyses: List[AnalysisResult], 70 | ) -> Tuple[set[str], Counter]: 71 | """ 72 | Returns the list of detected PII types and their frequencies for analyses performed on a collection 73 | 74 | @param analyses: List[ScoredAnalysisResult] 75 | @return: Tuple[List[str], Counter] 76 | """ 77 | flattened_list_of_detections = list( 78 | chain.from_iterable( 79 | [analysis.get_detected_types() for analysis in analyses] 80 | ) 81 | ) 82 | 83 | return set(flattened_list_of_detections), Counter(flattened_list_of_detections) 84 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "pii-codex" 3 | version = "0.5.0" 4 | description = "PII Detection, Categorization, and Severity Assessment" 5 | authors = [ 6 | {name = "Eidan J. Rosado"} 7 | ] 8 | license = {text = "BSD 3-Clause"} 9 | readme = "README.md" 10 | homepage = "https://github.com/EdyVision/pii-codex" 11 | repository = "https://github.com/EdyVision/pii-codex" 12 | keywords = ["PII", "PII topology", "risk categories", "personal identifiable information", "risk assessment"] 13 | classifiers = [ 14 | "Development Status :: 1 - Planning", 15 | "Intended Audience :: Science/Research", 16 | "License :: OSI Approved :: BSD License", 17 | "Programming Language :: Python :: 3", 18 | ] 19 | requires-python = ">=3.9,<3.13" 20 | dependencies = [ 21 | "dataclasses-json>=0.5.7,<0.6.0", 22 | "pydantic[dotenv]>=1.10.2,<2.0.0", 23 | "pandas>=2.0.0,<3.0.0", 24 | "pillow>=10.0.0,<11.0.0", 25 | "pdoc>=13.1.1,<14.0.0", 26 | ] 27 | 28 | [project.optional-dependencies] 29 | detections = [ 30 | "spacy>=3.8.7", 31 | "presidio-analyzer>=2.2.359", 32 | "presidio-anonymizer>=2.2.359" 33 | ] 34 | dev = [ 35 | "pytest>=7.4.0,<8.0.0", 36 | "black>=23.0.0,<24.0.0", 37 | "pylint>=3.0.0,<4.0.0", 38 | "mypy>=1.5.0,<2.0.0", 39 | "coverage>=7.2.0,<8.0.0", 40 | "assertpy>=1.1,<2.0.0", 41 | "Faker>=19.0.0,<20.0.0", 42 | "matplotlib>=3.7.0,<4.0.0", 43 | "ipykernel>=6.25.0,<7.0.0", 44 | "jupyter>=1.0.0,<2.0.0", 45 | "jupyter_core>=5.0.0,<6.0.0", 46 | "importlib-resources>=6.0.0,<7.0.0", 47 | "seaborn>=0.12.0,<1.0.0", 48 | "pre-commit>=3.0.0,<4.0.0", 49 | "spacy>=3.8.7", 50 | "presidio-analyzer>=2.2.360", 51 | "presidio-anonymizer>=2.2.360", 52 | "tomli>=1.2.0,<3.0.0" 53 | ] 54 | 55 | [project.urls] 56 | Homepage = "https://github.com/EdyVision/pii-codex" 57 | Repository = "https://github.com/EdyVision/pii-codex" 58 | 59 | [bumpver] 60 | current_version = "0.5.0" 61 | version_pattern = "MAJOR.MINOR.PATCH" 62 | files = [ 63 | "pyproject.toml", 64 | "CITATION.cff", 65 | ".zenodo.json" 66 | ] 67 | 68 | [bumpver.file_patterns] 69 | "pyproject.toml" = [ 70 | '^version = "{version}"$', 71 | '^current_version = "{version}"$', 72 | ] 73 | "pii_codex/__init__.py" = [ 74 | '^__version__ = "{version}"$', 75 | ] 76 | "CITATION.cff" = [ 77 | '^version: {version}$', 78 | ] 79 | ".zenodo.json" = [ 80 | '^ "version": "{version}",$', 81 | ] 82 | 83 | [tool.pytest.ini_options] 84 | minversion = "7.0" 85 | testpaths = [ 86 | "tests" 87 | ] 88 | 89 | log_cli = true 90 | log_cli_level = "INFO" 91 | log_cli_format = "%(message)s" 92 | 93 | log_file = "pytest.log" 94 | log_file_level = "DEBUG" 95 | log_file_format = "%(asctime)s [%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)" 96 | log_file_date_format = "%Y-%m-%d %H:%M:%S" 97 | 98 | [build-system] 99 | requires = ["hatchling"] 100 | build-backend = "hatchling.build" 101 | 102 | [dependency-groups] 103 | dev = [ 104 | "bumpver>=2025.1131", 105 | ] 106 | 107 | [tool.hatch.metadata] 108 | allow-direct-references = true 109 | -------------------------------------------------------------------------------- /tests/pii_codex/services/adapters/detection_adapters/test_aws_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import pytest 4 | from assertpy import assert_that 5 | 6 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 7 | from pii_codex.models.aws_pii import AWSComprehendPIIType 8 | from pii_codex.services.adapters.detection_adapters.aws_detection_adapter import ( 9 | AWSComprehendPIIDetectionAdapter, 10 | ) 11 | 12 | 13 | @pytest.mark.parametrize( 14 | "expected_result,expected_pii", 15 | [ 16 | (False, []), 17 | (False, []), 18 | (True, [AWSComprehendPIIType.EMAIL_ADDRESS]), 19 | ( 20 | True, 21 | [AWSComprehendPIIType.EMAIL_ADDRESS], 22 | ), 23 | ( 24 | True, 25 | [AWSComprehendPIIType.PHONE_NUMBER], 26 | ), 27 | ( 28 | True, 29 | [AWSComprehendPIIType.EMAIL_ADDRESS, AWSComprehendPIIType.PHONE_NUMBER], 30 | ), 31 | ( 32 | True, 33 | [AWSComprehendPIIType.EMAIL_ADDRESS, AWSComprehendPIIType.PHONE_NUMBER], 34 | ), 35 | ], 36 | ) 37 | def test_aws_comprehend_analysis_single_item_conversion(expected_result, expected_pii): 38 | conversion_results: List[ 39 | DetectionResultItem 40 | ] = AWSComprehendPIIDetectionAdapter().convert_analyzed_item( 41 | pii_detection={ 42 | "Entities": [ 43 | { 44 | "Score": 0.99, 45 | "Type": pii_type.value, 46 | "BeginOffset": 123, 47 | "EndOffset": 456, 48 | } 49 | for pii_type in expected_pii 50 | ] 51 | }, 52 | ) 53 | 54 | if expected_result: 55 | assert_that(conversion_results).is_not_empty() 56 | assert_that(isinstance(conversion_results[0], DetectionResultItem)).is_true() 57 | else: 58 | assert_that(conversion_results).is_empty() 59 | 60 | 61 | def test_aws_comprehend_analysis_collection_conversion(): 62 | conversion_results: List[ 63 | DetectionResult 64 | ] = AWSComprehendPIIDetectionAdapter().convert_analyzed_collection( 65 | pii_detections=[ 66 | { 67 | "Entities": [ 68 | { 69 | "Score": 0.99, 70 | "Type": AWSComprehendPIIType.EMAIL_ADDRESS.value, 71 | "BeginOffset": 123, 72 | "EndOffset": 456, 73 | } 74 | ] 75 | }, 76 | { 77 | "Entities": [ 78 | { 79 | "Score": 0.73, 80 | "Type": AWSComprehendPIIType.PHONE_NUMBER.value, 81 | "BeginOffset": 456, 82 | "EndOffset": 789, 83 | } 84 | ] 85 | }, 86 | ] 87 | ) 88 | 89 | assert_that(conversion_results).is_not_empty() 90 | assert_that(conversion_results[1].index).is_greater_than( 91 | conversion_results[0].index 92 | ) 93 | assert_that( 94 | isinstance(conversion_results[0].detections[0], DetectionResultItem) 95 | ).is_true() 96 | -------------------------------------------------------------------------------- /pii_codex/services/adapters/detection_adapters/aws_detection_adapter.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from pii_codex.config import PII_MAPPER 4 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 5 | from pii_codex.services.adapters.detection_adapters.detection_adapter_base import ( 6 | BasePIIDetectionAdapter, 7 | ) 8 | 9 | 10 | class AWSComprehendPIIDetectionAdapter(BasePIIDetectionAdapter): 11 | def convert_analyzed_item(self, pii_detection: dict) -> List[DetectionResultItem]: 12 | """ 13 | Converts an AWS Comprehend detect_pii() result into a collection of DetectionResultItem 14 | 15 | @param pii_detection: dict from AWS Comprehend detect_pii { 16 | 'Entities': [ 17 | { 18 | 'Score': ..., 19 | 'Type': 'BANK_ACCOUNT_NUMBER'|'BANK_ROUTING'|'CREDIT_DEBIT_NUMBER'|'CREDIT_DEBIT_CVV'| 20 | 'CREDIT_DEBIT_EXPIRY'|'PIN'|'EMAIL'|'ADDRESS'|'NAME'|'PHONE'|'SSN'|'DATE_TIME'|'PASSPORT_NUMBER'| 21 | 'DRIVER_ID'|'URL'|'AGE'|'USERNAME'|'PASSWORD'|'AWS_ACCESS_KEY'|'AWS_SECRET_KEY'|'IP_ADDRESS'| 22 | 'MAC_ADDRESS'|'ALL'|'LICENSE_PLATE'|'VEHICLE_IDENTIFICATION_NUMBER'|'UK_NATIONAL_INSURANCE_NUMBER'| 23 | 'CA_SOCIAL_INSURANCE_NUMBER'|'US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER'| 24 | 'UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER'|'IN_PERMANENT_ACCOUNT_NUMBER'|'IN_NREGA'| 25 | 'INTERNATIONAL_BANK_ACCOUNT_NUMBER'|'SWIFT_CODE'|'UK_NATIONAL_HEALTH_SERVICE_NUMBER'| 26 | 'CA_HEALTH_NUMBER'|'IN_AADHAAR'|'IN_VOTER_NUMBER', 27 | 'BeginOffset': 123, 28 | 'EndOffset': 123 29 | }, 30 | ] 31 | } 32 | @return: List[DetectionResultItem] 33 | """ 34 | 35 | return [ 36 | DetectionResultItem( 37 | entity_type=PII_MAPPER.convert_aws_comprehend_pii_to_common_pii_type( 38 | result["Type"] 39 | ).name, 40 | score=result["Score"], 41 | start=result["BeginOffset"], 42 | end=result["EndOffset"], 43 | ) 44 | for result in pii_detection["Entities"] 45 | ] 46 | 47 | def convert_analyzed_collection( 48 | self, pii_detections: List[dict] 49 | ) -> List[DetectionResult]: 50 | """ 51 | Converts a collection of AWS Comprehend detect_pii() results to a collection of DetectionResult. 52 | 53 | @param pii_detections: List[dict] of response from AWS Comprehend detect_pii - [{ 54 | 'Entities': [ 55 | { 56 | 'Score': ..., 57 | 'Type': 'BANK_ACCOUNT_NUMBER'|'BANK_ROUTING'|'CREDIT_DEBIT_NUMBER'|'CREDIT_DEBIT_CVV'| 58 | 'CREDIT_DEBIT_EXPIRY'|'PIN'|'EMAIL'|'ADDRESS'|'NAME'|'PHONE'|'SSN'|'DATE_TIME'|'PASSPORT_NUMBER'| 59 | 'DRIVER_ID'|'URL'|'AGE'|'USERNAME'|'PASSWORD'|'AWS_ACCESS_KEY'|'AWS_SECRET_KEY'|'IP_ADDRESS'| 60 | 'MAC_ADDRESS'|'ALL'|'LICENSE_PLATE'|'VEHICLE_IDENTIFICATION_NUMBER'|'UK_NATIONAL_INSURANCE_NUMBER'| 61 | 'CA_SOCIAL_INSURANCE_NUMBER'|'US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER'| 62 | 'UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER'|'IN_PERMANENT_ACCOUNT_NUMBER'|'IN_NREGA'| 63 | 'INTERNATIONAL_BANK_ACCOUNT_NUMBER'|'SWIFT_CODE'|'UK_NATIONAL_HEALTH_SERVICE_NUMBER'| 64 | 'CA_HEALTH_NUMBER'|'IN_AADHAAR'|'IN_VOTER_NUMBER', 65 | 'BeginOffset': 123, 66 | 'EndOffset': 123 67 | }, 68 | ] 69 | }] 70 | 71 | """ 72 | 73 | detection_results: List[DetectionResult] = [] 74 | for i, detection in enumerate(pii_detections): 75 | # Return results in formatted Analysis Result List object 76 | detection_results.append( 77 | DetectionResult( 78 | index=i, 79 | detections=self.convert_analyzed_item(pii_detection=detection), 80 | ) 81 | ) 82 | 83 | return detection_results 84 | -------------------------------------------------------------------------------- /pii_codex/models/analysis.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=too-many-instance-attributes 2 | from __future__ import annotations 3 | 4 | from dataclasses import dataclass, field 5 | from typing import List, Counter, Optional 6 | 7 | from pii_codex.models.common import RiskLevel, RiskLevelDefinition 8 | 9 | 10 | # PII detection, risk assessment, and analysis models 11 | 12 | 13 | @dataclass 14 | class RiskAssessment: 15 | """ 16 | Singular risk assessment for a string token 17 | """ 18 | 19 | pii_type_detected: Optional[str] = None 20 | risk_level: int = RiskLevel.LEVEL_ONE.value 21 | risk_level_definition: str = ( 22 | RiskLevelDefinition.LEVEL_ONE.value 23 | ) # Default if it's not semi or fully identifiable 24 | cluster_membership_type: Optional[str] = None 25 | hipaa_category: Optional[str] = None 26 | dhs_category: Optional[str] = None 27 | nist_category: Optional[str] = None 28 | 29 | 30 | @dataclass 31 | class RiskAssessmentList: 32 | """ 33 | Risk Assessments and the average risk score of all list items 34 | """ 35 | 36 | risk_assessments: List[RiskAssessment] 37 | average_risk_score: float 38 | 39 | 40 | @dataclass 41 | class DetectionResultItem: 42 | """ 43 | Type associated with a singular PII detection (e.g. detection of an email in a string), its associated risk score, 44 | and where it is located in a string. 45 | """ 46 | 47 | entity_type: str 48 | score: float = 0.0 # metadata detections don't have confidence score values 49 | start: int = 0 # metadata detections don't have offset values 50 | end: int = 0 # metadata detections don't have offset values 51 | 52 | 53 | @dataclass 54 | class DetectionResult: 55 | """ 56 | List of detection results and the index of the string they pertain to 57 | """ 58 | 59 | detections: List[DetectionResultItem] 60 | index: int = 0 61 | 62 | 63 | @dataclass 64 | class AnalysisResultItem: 65 | """ 66 | The results associated to a single detection of a single string (e.g. Social Media Post, SMS, etc.) 67 | """ 68 | 69 | detection: Optional[DetectionResultItem] 70 | risk_assessment: RiskAssessment 71 | 72 | def to_dict(self): 73 | return { 74 | "riskAssessment": self.risk_assessment.__dict__, 75 | "detection": self.detection.__dict__, 76 | } 77 | 78 | def to_flattened_dict(self): 79 | assessment = self.risk_assessment.__dict__.copy() 80 | 81 | if self.detection: 82 | assessment.update(self.detection.__dict__) 83 | 84 | return assessment 85 | 86 | 87 | @dataclass 88 | class AnalysisResult: 89 | """ 90 | The analysis results associated with several detections within a single string (e.g. Social Media Post, SMS, etc.) 91 | """ 92 | 93 | analysis: List[AnalysisResultItem] 94 | index: int = 0 95 | risk_score_mean: float = 0.0 96 | sanitized_text: str = "" 97 | 98 | def to_dict(self): 99 | return { 100 | "analysis": [item.to_flattened_dict() for item in self.analysis], 101 | "index": self.index, 102 | "risk_score_mean": self.risk_score_mean, 103 | "sanitized_text": self.sanitized_text, 104 | } 105 | 106 | def get_detected_types(self) -> List[str]: 107 | return [pii.detection.entity_type for pii in self.analysis if pii.detection] 108 | 109 | 110 | @dataclass 111 | class AnalysisResultSet: 112 | """ 113 | The analysis results associated with a collection of strings or documents (e.g. Social Media Posts, forum thread, 114 | etc.). Includes most/least detected PII types within the collection, average risk score of analyses, 115 | """ 116 | 117 | analyses: List[AnalysisResult] 118 | detection_count: int = 0 119 | detected_pii_types: set[str] = field(default_factory=set) 120 | detected_pii_type_frequencies: Counter = None # type: ignore 121 | risk_scores: List[float] = field(default_factory=list) 122 | risk_score_mean: float = 1.0 # Default is 1 for non-identifiable 123 | risk_score_mode: float = 0.0 124 | risk_score_median: float = 0.0 125 | risk_score_standard_deviation: float = 0.0 126 | risk_score_variance: float = 0.0 127 | collection_name: Optional[ 128 | str 129 | ] = None # Optional ability for analysts to name a set (see analysis storage step in notebooks) 130 | collection_type: str = "POPULATION" # Other option is SAMPLE 131 | 132 | def to_dict(self): 133 | return { 134 | "collection_name": self.collection_name, 135 | "collection_type": self.collection_type, 136 | "analyses": [item.to_dict() for item in self.analyses], 137 | "detection_count": self.detection_count, 138 | "risk_scores": self.risk_scores, 139 | "risk_score_mean": self.risk_score_mean, 140 | "risk_score_mode": self.risk_score_mode, 141 | "risk_score_median": self.risk_score_median, 142 | "risk_score_standard_deviation": self.risk_score_standard_deviation, 143 | "risk_score_variance": self.risk_score_variance, 144 | "detected_pii_types": self.detected_pii_types, 145 | "detected_pii_type_frequencies": dict(self.detected_pii_type_frequencies), 146 | } 147 | -------------------------------------------------------------------------------- /pii_codex/models/common.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from enum import Enum 4 | 5 | # All listed PII Types from Milne et al (2018) and a few others along with 6 | # models used for PII categorization for DHS, NIST, and HIPAA 7 | 8 | 9 | class AnalysisProviderType(Enum): 10 | """ 11 | Analysis Provider Types - software and cloud service APIs providing PII detection results 12 | """ 13 | 14 | AZURE = "AZURE" 15 | AWS = "AWS" 16 | PRESIDIO = "PRESIDIO" 17 | 18 | 19 | class RiskLevel(Enum): 20 | """ 21 | Numerical values assigned to the levels on the continuum presented by Schwartz and Solove (2011) 22 | """ 23 | 24 | LEVEL_ONE = 1 # Not-Identifiable 25 | LEVEL_TWO = 2 # Semi-Identifiable 26 | LEVEL_THREE = 3 # Identifiable 27 | 28 | 29 | class RiskLevelDefinition(Enum): 30 | """ 31 | Levels on the continuum presented by Schwartz and Solove (2011) 32 | """ 33 | 34 | LEVEL_ONE = "Non-Identifiable" # Default if no entities were detected, risk level is set to this 35 | LEVEL_TWO = "Semi-Identifiable" 36 | LEVEL_THREE = "Identifiable" # Level associated with Directly PII, PHI, and Standalone PII info types 37 | 38 | 39 | class MetadataType(Enum): 40 | """ 41 | Common metadata types associated with social media posts and other online platforms 42 | """ 43 | 44 | SCREEN_NAME = "screen_name" 45 | NAME = "name" 46 | LOCATION = "location" 47 | URL = "url" 48 | USER_ID = "user_id" 49 | 50 | 51 | class PIIType(Enum): 52 | """ 53 | Commonly observed PII types across services and software 54 | """ 55 | 56 | PHONE_NUMBER = "PHONE" 57 | WORK_PHONE_NUMBER = "PHONE" 58 | EMAIL_ADDRESS = "EMAIL" 59 | ABA_ROUTING_NUMBER = "ABA_ROUTING_NUMBER" 60 | IP_ADDRESS = "IP_ADDRESS" 61 | DATE = "DATE" 62 | ADDRESS = "ADDRESS" 63 | HOME_ADDRESS = "ADDRESS" 64 | WORK_ADDRESS = "ADDRESS" 65 | AGE = "AGE" 66 | PERSON = "PERSON" 67 | CREDIT_CARD_NUMBER = "CREDIT_CARD_NUMBER" 68 | CREDIT_SCORE = "CREDIT_SCORE" 69 | CRYPTO = "CRYPTO" 70 | URL = "URL" 71 | DATE_TIME = "DATE_TIME" 72 | LOCATION = "LOCATION" 73 | ZIPCODE = "ZIPCODE" 74 | RACE = "RACE" 75 | HEIGHT = "HEIGHT" 76 | WEIGHT = "WEIGHT" 77 | GENDER = "GENDER" 78 | HOMETOWN = "HOMETOWN" 79 | SCREEN_NAME = "SCREEN_NAME" 80 | MARITAL_STATUS = "MARITAL_STATUS" 81 | NUMBER_OF_CHILDREN = "NUMBER_OF_CHILDREN" 82 | COUNTRY_OF_CITIZENSHIP = "COUNTRY_OF_CITIZENSHIP" 83 | VOICE_PRINT = "VOICE_PRINT" 84 | FINGERPRINT = "FINGERPRINT" 85 | DNA_PROFILE = "DNA_PROFILE" 86 | PICTURE_FACE = "PICTURE_FACE" 87 | HANDWRITING_SAMPLE = "HANDWRITING_SAMPLE" 88 | MOTHERS_MAIDEN_NAME = "MOTHERS_MAIDEN_NAME" 89 | DIGITAL_SIGNATURE = "DIGITAL_SIGNATURE" 90 | HEALTH_INSURANCE_ID = "HEALTH_INSURANCE_ID" 91 | SHOPPING_BEHAVIOR = "SHOPPING_BEHAVIOR" 92 | SEXUAL_PREFERENCE = "SEXUAL_PREFERENCE" 93 | SOCIAL_NETWORK_PROFILE = "SOCIAL_NETWORK_PROFILE" 94 | JOB_TITLE = "JOB_TITLE" 95 | INCOME_LEVEL = "INCOME_LEVEL" 96 | OCCUPATION = "OCCUPATION" 97 | DOCUMENTS = "DOCUMENTS" 98 | MEDICAL_LICENSE = "MEDICAL_LICENSE" 99 | LICENSE_PLATE_NUMBER = "LICENSE_PLATE_NUMBER" 100 | SECURITY_ACCESS_CODES = "SECURITY_ACCESS_CODES" 101 | PASSWORD = "PASSWORD" 102 | US_SOCIAL_SECURITY_NUMBER = "US_SOCIAL_SECURITY_NUMBER" 103 | US_BANK_ACCOUNT_NUMBER = "US_BANK_ACCOUNT_NUMBER" 104 | US_DRIVERS_LICENSE_NUMBER = "US_DRIVERS_LICENSE_NUMBER" 105 | US_PASSPORT_NUMBER = "US_PASSPORT_NUMBER" 106 | US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION" 107 | INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "INTERNATIONAL_BANKING_ACCOUNT_NUMBER" 108 | SWIFT_CODE = "SWIFTCode" 109 | NRP = "NRP" # A person's nationality, religion, or political group 110 | # Australian PII types 111 | AU_BUSINESS_NUMBER = "AU_BUSINESS_NUMBER" 112 | AU_COMPANY_NUMBER = "AU_COMPANY_NUMBER" 113 | AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICAL_ACCOUNT_NUMBER" 114 | AU_TAX_FILE_NUMBER = "AU_TAX_FILE_NUMBER" 115 | 116 | 117 | class NISTCategory(Enum): 118 | """ 119 | Information Categories presented by NIST as noted in Milne et al., 2016 120 | """ 121 | 122 | LINKABLE = "Linkable" 123 | DIRECTLY_PII = "Directly PII" 124 | 125 | 126 | class DHSCategory(Enum): 127 | """ 128 | Information Categories presented by DHS as noted in Milne et al., 2016 129 | """ 130 | 131 | NOT_MENTIONED = "Not Mentioned" 132 | LINKABLE = "Linkable" 133 | STAND_ALONE_PII = "Stand Alone PII" 134 | 135 | 136 | class HIPAACategory(Enum): 137 | """ 138 | Information Categories presented by HIPAA guidelines 139 | """ 140 | 141 | NON_PHI = "Not Protected Health Information" 142 | PHI = "Protected Health Information" 143 | 144 | 145 | class ClusterMembershipType(Enum): 146 | """ 147 | Information Cluster Memberships presented by Milne et al., 2016 148 | """ 149 | 150 | BASIC_DEMOGRAPHICS = "Basic Demographics" 151 | PERSONAL_PREFERENCES = "Personal Preferences" 152 | CONTACT_INFORMATION = "Contact Information" 153 | COMMUNITY_INTERACTION = "Community Interaction" 154 | FINANCIAL_INFORMATION = "Financial Information" 155 | SECURE_IDENTIFIERS = "Secure Identifiers" 156 | -------------------------------------------------------------------------------- /docs/MAPPING.md: -------------------------------------------------------------------------------- 1 | # Mapping Across PII Detections with PII-Codex 2 | The PII Codex has a number of enums to help with the definitions and labeling of PII, their categories, and their severity rankings across modules. At this time, only AWS Comprehend, Microsoft Presidio, and Microsoft Azure PII entity types are mapped to using the common PII types listing. 3 | 4 | Selecting a PII type from the common PII type listing: 5 | 6 | ## Mapping Between PII Types 7 | ```python 8 | from pii_codex.models.common import PIIType 9 | PIIType.EMAIL_ADDRESS # Selecting a single common PIIType 10 | PIIType.EMAIL_ADDRESS.name # The name of the enum entry 11 | PIIType.EMAIL_ADDRESS.value # The String value of the enum entry 12 | ``` 13 | 14 | Iterating through all common types supported: 15 | 16 | ```python 17 | from pii_codex.models.common import PIIType 18 | pii_types = [pii_type.name for pii_type in PIIType] 19 | ``` 20 | 21 | Each module or cloud resource will have its own string labeling for the PII Type. It is sometimes required to map to that string value in order to properly parse out a PII detection or to initialize an analyzer. To map to a different PII type (if supported with the version, using MSFT Presidio for example): 22 | 23 | ```python 24 | from pii_codex.models.common import PIIType 25 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 26 | from pii_codex.models.azure_pii import AzureDetectionType 27 | from pii_codex.models.aws_pii import AWSComprehendPIIType 28 | 29 | presidio_pii_type = MSFTPresidioPIIType[PIIType.EMAIL_ADDRESS.name] # MSFT Presidio enum entry 30 | 31 | print("Presidio Enum Type Name for Email: ", presidio_pii_type.name) 32 | print("Presidio Enum Type Value for Email: ", presidio_pii_type.value) 33 | ``` 34 | 35 | Using the built-in mapper can be a help, just pass in the mapping you'd like to perform and it'll provide you with the enum name and the enum type entry. If it is not supported, you'll be supplied with the error instead. 36 | 37 | ```python 38 | from pii_codex.models.common import PIIType 39 | from pii_codex.utils.pii_mapping_util import PIIMapper 40 | 41 | pii_mapper = PIIMapper() 42 | 43 | azure_pii = pii_mapper.convert_common_pii_to_azure_pii_type(PIIType.EMAIL_ADDRESS) 44 | 45 | aws_pii = pii_mapper.convert_common_pii_to_aws_comprehend_type(PIIType.EMAIL_ADDRESS) 46 | presidio_pii = pii_mapper.convert_common_pii_to_msft_presidio_type(PIIType.EMAIL_ADDRESS) 47 | ``` 48 | 49 | In the case you are using the PII-Codex module for just detection conversions and analysis, there is an inverse set of mappers that will take Presidio, Azure, or AWS Comprehend PII types and convert to the PII-Codex commmon types: 50 | 51 | ```python 52 | from pii_codex.models.common import PIIType 53 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 54 | from pii_codex.models.azure_pii import AzureDetectionType 55 | from pii_codex.models.aws_pii import AWSComprehendPIIType 56 | from pii_codex.utils.pii_mapping_util import PIIMapper 57 | 58 | pii_mapper = PIIMapper() 59 | 60 | azure_to_common_pii = pii_mapper.convert_azure_pii_to_common_pii_type( 61 | AzureDetectionType.EMAIL_ADDRESS.value 62 | ) 63 | aws_to_common_pii = pii_mapper.convert_aws_comprehend_pii_to_common_pii_type( 64 | AWSComprehendPIIType.EMAIL_ADDRESS.value 65 | ) 66 | presidio_to_common_pii = pii_mapper.convert_msft_presidio_pii_to_common_pii_type( 67 | MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value # e.g. "US_SSN" 68 | ) 69 | ``` 70 | 71 | ### Example: provider‑specific labels vs common PII types 72 | 73 | Some providers use compact or region‑encoded labels which do **not** match the human‑readable common PII type names. For example: 74 | 75 | ```python 76 | from pii_codex.models.common import PIIType 77 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 78 | from pii_codex.utils.pii_mapping_util import PIIMapper 79 | 80 | pii_mapper = PIIMapper() 81 | 82 | # Presidio emits "US_SSN", which we represent as: 83 | assert MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value == "US_SSN" 84 | 85 | # PII Codex maps this back to the canonical common type: 86 | common_type = pii_mapper.convert_msft_presidio_pii_to_common_pii_type("US_SSN") 87 | assert common_type is PIIType.US_SOCIAL_SECURITY_NUMBER 88 | 89 | # Likewise for AU tax identifiers: 90 | assert MSFTPresidioPIIType.AU_TAX_FILE_NUMBER.value == "AU_TFN" 91 | common_au_tfn = pii_mapper.convert_msft_presidio_pii_to_common_pii_type("AU_TFN") 92 | assert common_au_tfn is PIIType.AU_TAX_FILE_NUMBER 93 | 94 | # The key idea: provider enums mirror the provider's own labels, 95 | # and PIIType is the canonical, provider‑independent surface. 96 | ``` 97 | 98 | ## Importing Updated Files 99 | 100 | ```python 101 | 102 | from pii_codex.utils import pii_mapping_util 103 | from pii_codex.models.common import PIIType 104 | # Data frame loaded from csv mapping file (assumes /data location in pii-codex) 105 | csv_file_dataframe = pii_mapping_util.open_pii_type_mapping_csv("v1") 106 | 107 | # Data frame loaded from json mapping file (assumes /data location in pii-codex) 108 | json_file_dataframe = pii_mapping_util.open_pii_type_mapping_json("v1") 109 | 110 | # Retrieving the entries for "IP Address" Information Type, for example 111 | ip_address = json_file_dataframe[json_file_dataframe.Information_Type=='IP Address'].item() 112 | 113 | pii_type = PIIType[ip_address] 114 | print("Enum Type Name for IP Address: ", pii_type.name) 115 | print("Enum Type Name for IP Address: ", pii_type.value) 116 | ``` 117 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/config.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.config API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 |
21 |
22 |

Module pii_codex.config

23 |
24 |
25 |
26 | 27 | Expand source code 28 | 29 |
from pii_codex.utils.pii_mapping_util import PIIMapper
30 | 
31 | PII_MAPPER = PIIMapper()
32 | DEFAULT_LANG = "en"
33 | DEFAULT_ANALYSIS_MODE = "POPULATION"
34 | DEFAULT_TOKEN_REPLACEMENT_VALUE = "<REDACTED>"
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 | 59 |
60 | 63 | 64 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/services/analyzers/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.services.analyzers API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 42 | 60 |
61 | 64 | 65 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/services/adapters/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.services.adapters API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 42 | 60 |
61 | 64 | 65 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 |
21 |
22 |

Package pii_codex

23 |
24 |
25 |
26 | 27 | Expand source code 28 | 29 |
__version__ = "0.1.0"
30 |
31 |
32 |
33 |

Sub-modules

34 |
35 |
pii_codex.config
36 |
37 |
38 |
39 |
pii_codex.models
40 |
41 |
42 |
43 |
pii_codex.services
44 |
45 |
46 |
47 |
pii_codex.utils
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 | 76 |
77 | 80 | 81 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/services/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.services API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 54 | 75 |
76 | 79 | 80 | -------------------------------------------------------------------------------- /pii_codex/utils/pii_mapping_util.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=broad-except, unused-variable, no-else-return 2 | from typing import Optional 3 | 4 | from pii_codex.models.aws_pii import AWSComprehendPIIType 5 | from pii_codex.models.azure_pii import AzureDetectionType 6 | from pii_codex.models.common import ( 7 | RiskLevel, 8 | PIIType, 9 | MetadataType, 10 | RiskLevelDefinition, 11 | ) 12 | from pii_codex.models.analysis import RiskAssessment 13 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 14 | 15 | from pii_codex.services.pii_type_mappings import get_pii_mapping 16 | 17 | 18 | class PIIMapper: 19 | """ 20 | Class to map PII types listed as Common Types, Azure Types, AWS Comprehend Types, and Presidio Types 21 | """ 22 | 23 | def __init__(self): 24 | # No need to load CSV anymore - using Python structure 25 | pass 26 | 27 | def map_pii_type(self, pii_type: str) -> RiskAssessment: 28 | """ 29 | Maps the PII Type to a full RiskAssessment including categories it belongs to, risk level, and 30 | its location in the text. This cross-references some of the types listed by Milne et al. (2016) 31 | 32 | @param pii_type: 33 | @return: 34 | """ 35 | 36 | try: 37 | mapping = get_pii_mapping(pii_type) 38 | 39 | # Get the risk level definition string based on the risk level 40 | risk_level_to_definition = { 41 | RiskLevel.LEVEL_ONE: RiskLevelDefinition.LEVEL_ONE.value, 42 | RiskLevel.LEVEL_TWO: RiskLevelDefinition.LEVEL_TWO.value, 43 | RiskLevel.LEVEL_THREE: RiskLevelDefinition.LEVEL_THREE.value, 44 | } 45 | risk_level_definition = risk_level_to_definition[mapping.risk_level] 46 | 47 | return RiskAssessment( 48 | pii_type_detected=pii_type, 49 | risk_level=mapping.risk_level.value, 50 | risk_level_definition=risk_level_definition, 51 | cluster_membership_type=mapping.cluster_membership_type.value, 52 | hipaa_category=mapping.hipaa_category.value, 53 | dhs_category=mapping.dhs_category.value, 54 | nist_category=mapping.nist_category.value, 55 | ) 56 | except KeyError: 57 | raise Exception( 58 | f"An error occurred while processing the detected entity {pii_type}" 59 | ) 60 | 61 | @classmethod 62 | def convert_common_pii_to_msft_presidio_type( 63 | cls, pii_type: PIIType 64 | ) -> MSFTPresidioPIIType: 65 | """ 66 | Converts a common PII Type to a MSFT Presidio Type 67 | @param pii_type: 68 | @return: 69 | """ 70 | 71 | try: 72 | converted_type = MSFTPresidioPIIType[pii_type.name] 73 | except Exception as ex: 74 | raise Exception( 75 | "The current version does not support this PII Type conversion." 76 | ) 77 | 78 | return converted_type 79 | 80 | @classmethod 81 | def convert_common_pii_to_azure_pii_type( 82 | cls, pii_type: PIIType 83 | ) -> AzureDetectionType: 84 | """ 85 | Converts a common PII Type to an Azure PII Type 86 | @param pii_type: 87 | @return: 88 | """ 89 | try: 90 | return AzureDetectionType[pii_type.name] 91 | except Exception as ex: 92 | raise Exception( 93 | "The current version does not support this PII Type conversion." 94 | ) 95 | 96 | @classmethod 97 | def convert_common_pii_to_aws_comprehend_type( 98 | cls, 99 | pii_type: PIIType, 100 | ) -> AWSComprehendPIIType: 101 | """ 102 | Converts a common PII Type to an AWS PII Type 103 | @param pii_type: 104 | @return: 105 | """ 106 | try: 107 | return AWSComprehendPIIType[pii_type.name] 108 | except Exception as ex: 109 | raise Exception( 110 | "The current version does not support this PII Type conversion." 111 | ) 112 | 113 | @classmethod 114 | def convert_azure_pii_to_common_pii_type(cls, pii_type: str) -> PIIType: 115 | """ 116 | Converts an Azure PII Type to a common PII Type 117 | @param pii_type: 118 | @return: 119 | """ 120 | try: 121 | if pii_type == AzureDetectionType.USUK_PASSPORT_NUMBER.value: 122 | # Special case, map to USUK for all US and UK Passport types 123 | return PIIType.US_PASSPORT_NUMBER 124 | 125 | return PIIType[AzureDetectionType(pii_type).name] 126 | except Exception as ex: 127 | raise Exception( 128 | "The current version does not support this PII Type conversion." 129 | ) 130 | 131 | @classmethod 132 | def convert_aws_comprehend_pii_to_common_pii_type( 133 | cls, 134 | pii_type: str, 135 | ) -> PIIType: 136 | """ 137 | Converts an AWS PII Type to a common PII Type 138 | @param pii_type: str from AWS Comprehend (maps to value of AWSComprehendPIIType) 139 | @return: 140 | """ 141 | try: 142 | return PIIType[AWSComprehendPIIType(pii_type).name] 143 | except Exception as ex: 144 | raise Exception( 145 | "The current version does not support this PII Type conversion." 146 | ) 147 | 148 | @classmethod 149 | def convert_msft_presidio_pii_to_common_pii_type( 150 | cls, 151 | pii_type: str, 152 | ) -> PIIType: 153 | """ 154 | Converts a Microsoft Presidio PII Type to a common PII Type 155 | @param pii_type: str from Presidio (maps to value of PIIType) 156 | @return: 157 | """ 158 | try: 159 | # Handle specific cases where Presidio returns different values than enum names 160 | if pii_type == "US_SSN": 161 | return PIIType.US_SOCIAL_SECURITY_NUMBER 162 | if pii_type == "US_BANK_NUMBER": 163 | return PIIType.US_BANK_ACCOUNT_NUMBER 164 | if pii_type == "AU_MEDICARE": 165 | return PIIType.AU_MEDICAL_ACCOUNT_NUMBER 166 | if pii_type == "DATE": 167 | return PIIType.DATE_TIME 168 | 169 | # For everything else, use the original approach that was working 170 | return PIIType[MSFTPresidioPIIType(pii_type).name] 171 | 172 | except Exception as ex: 173 | raise Exception( 174 | f"The current version does not support this PII Type conversion: {pii_type}. Error: {str(ex)}" 175 | ) 176 | 177 | @classmethod 178 | def convert_metadata_type_to_common_pii_type( 179 | cls, metadata_type: str 180 | ) -> Optional[PIIType]: 181 | """ 182 | Converts metadata type str entry to common PII type 183 | @param metadata_type: 184 | @return: PIIType 185 | """ 186 | 187 | try: 188 | if metadata_type.lower() == "name": 189 | return PIIType.PERSON 190 | 191 | if metadata_type.lower() == "user_id": 192 | # If dealing with public data, user_id can be used to pull down 193 | # social network profile 194 | return PIIType.SOCIAL_NETWORK_PROFILE 195 | 196 | return PIIType[MetadataType(metadata_type.lower()).name] 197 | except Exception as ex: 198 | raise Exception( 199 | "The current version does not support this Metadata to PII Type conversion." 200 | ) 201 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/models/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.models API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 58 | 80 |
81 | 84 | 85 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/utils/package_installer_util.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.utils.package_installer_util API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 |
21 |
22 |

Module pii_codex.utils.package_installer_util

23 |
24 |
25 |
26 | 27 | Expand source code 28 | 29 |
import subprocess
30 | import sys
31 | 
32 | 
33 | def install_spacy_package(package_name):
34 |     """
35 |     Installs missing spacy package (if found missing)
36 |     @param package_name:
37 |     @return:
38 |     """
39 |     subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |

Functions

48 |
49 |
50 | def install_spacy_package(package_name) 51 |
52 |
53 |

Installs missing spacy package (if found missing) 54 | @param package_name: 55 | @return:

56 |
57 | 58 | Expand source code 59 | 60 |
def install_spacy_package(package_name):
61 |     """
62 |     Installs missing spacy package (if found missing)
63 |     @param package_name:
64 |     @return:
65 |     """
66 |     subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 | 92 |
93 | 96 | 97 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/utils/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.utils API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 58 | 80 |
81 | 84 | 85 | -------------------------------------------------------------------------------- /joss/paper.bib: -------------------------------------------------------------------------------- 1 | @article{milne_pettinico_hajjat_markos_2016, 2 | title={Information sensitivity typology: Mapping the degree and type of risk consumers perceive in personal data sharing}, 3 | volume={51}, 4 | DOI={10.1111/joca.12111}, 5 | number={1}, 6 | journal={Journal of Consumer Affairs}, 7 | author={Milne, George R. and Pettinico, George and Hajjat, Fatima M. and Markos, Ereni}, 8 | year={2016}, pages={133–161} 9 | } 10 | @misc{OCR_2022, 11 | title={Summary of the HIPAA privacy rule}, 12 | url={https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html#:~:text=The%20U.S.%20Department%20of%20Health,1996%20(%22HIPAA%22).}, 13 | journal={HHS.gov}, 14 | author={OCR, Office for Civil Rights}, 15 | year={2022}, 16 | month={Oct} 17 | } 18 | @article{West_2017, 19 | title={Data capitalism: Redefining the logics of surveillance and privacy}, 20 | volume={58}, 21 | DOI={10.1177/0007650317718185}, 22 | number={1}, 23 | journal={Business &amp; Society}, 24 | author={West, Sarah Myers}, 25 | year={2017}, 26 | pages={20–41} 27 | } 28 | @article{schwartz_solove_2011, 29 | title={The PII Problem: Privacy and a New Concept of Personally Identifiable Information}, 30 | volume={86}, 31 | journal={New York University Law Review}, 32 | author={Schwartz, Paul M and Solove, Daniel J}, 33 | year={2011}, 34 | pages={1814} 35 | } 36 | @article{trepte2020, 37 | author = {Trepte, Sabine}, 38 | year = {2020}, 39 | month = {05}, 40 | pages = {}, 41 | title = {The Social Media Privacy Model: Privacy and Communication in the Light of Social Media Affordances}, 42 | volume = {31}, 43 | journal = {Communication Theory}, 44 | doi = {10.1093/ct/qtz035} 45 | } 46 | @software{rosado2022, 47 | author = {Rosado, Eidan J.}, 48 | doi = {10.5281/zenodo.7212576}, 49 | month = {12}, 50 | title = {{pii-codex: a Python library for PII detection, categorization, and severity assessment}}, 51 | version = {0.4.3}, 52 | year = {2023} 53 | } 54 | @misc{microsoft_presidio, 55 | title={Microsoft/Presidio: Context Aware, pluggable and customizable data protection and Anonymization SDK for text and images}, 56 | url={https://github.com/microsoft/presidio}, 57 | publisher={Microsoft}, 58 | author={Microsoft} 59 | } 60 | @misc{microsoft_presidio_entities, 61 | title={PII entities supported by Presidio}, 62 | url={https://microsoft.github.io/presidio/supported_entities/}, 63 | journal={Supported entities - Microsoft Presidio}, 64 | publisher={Microsoft}, 65 | author={Microsoft}, 66 | year={2022} 67 | } 68 | @misc{azure_detection_cognitive_skill, 69 | title={PII detection cognitive skill - azure cognitive search}, 70 | url={https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-pii-detection}, 71 | journal={PII Detection cognitive skill - Azure Cognitive Search | Microsoft Learn}, 72 | publisher={Microsoft Azure}, 73 | author={Microsoft Azure}, 74 | year={2022} 75 | } 76 | @misc{aws_comprehend, 77 | title={PII detection cognitive skill - azure cognitive search}, 78 | url={https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html}, 79 | journal={What is Amazon Comprehend | AWS Comprehend}, 80 | publisher={Amazon Web Services}, 81 | author={Amazon Web Services}, 82 | year={2022} 83 | } 84 | @misc{gdpr_erasure_right, 85 | url={https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/individual-rights/right-to-erasure/#:~:text=Under%20Article%2017%20of%20the,be%20created%20in%20the%20future.}, 86 | title={Right to erasure}, 87 | publisher={Information Commissioners Office}, 88 | year={2022} 89 | } 90 | @misc{GDPR_eu_2022, 91 | url={https://gdpr.eu/what-is-gdpr/}, 92 | journal={GDPR.eu}, 93 | year={2022}, 94 | month={May} 95 | } 96 | @misc{hipaa, 97 | url={https://www.hhs.gov/hipaa/index.html}, 98 | title={Health Information Privacy}, 99 | publisher={U.S. Department of Health and Human Services}, 100 | year={2022} 101 | } 102 | @article{mccallister_grance_scarfone_2010, 103 | title={Guide to protecting the confidentiality of personally identifiable information (PII)}, 104 | DOI={10.6028/nist.sp.800-122}, 105 | journal={S Department of Commerce: National Institute of Standards and Technology (NIST)}, 106 | author={McCallister, E and Grance, T and Scarfone, K A}, 107 | year={2010} 108 | } 109 | @misc{dhs_2012, 110 | title={Handbook for Safeguarding Sensitive Personally Identifying Information}, 111 | publisher={Privacy Office, Department of Homeland Security}, 112 | year={2012} 113 | } 114 | @article{beigi_liu_2020, 115 | title={A survey on privacy in Social Media}, volume={1}, 116 | DOI={10.1145/3343038}, 117 | number={1}, 118 | journal={ACM/IMS Transactions on Data Science}, 119 | author={Beigi, Ghazaleh and Liu, Huan}, 120 | year={2020}, 121 | pages={1–38} 122 | } 123 | @article{moura_serrão_2019, 124 | title={Security and privacy issues of Big Data}, 125 | DOI={10.4018/978-1-5225-8897-9.ch019}, 126 | journal={Cyber Law, Privacy, and Security}, 127 | author={Moura, José and Serrão, Carlos}, 128 | year={2019}, 129 | pages={375–407} 130 | } 131 | 132 | @unpublished{spacy2, 133 | AUTHOR = {Honnibal, Matthew and Montani, Ines}, 134 | TITLE = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing}, 135 | YEAR = {2017}, 136 | Note = {To appear} 137 | } 138 | 139 | @article{belanger_crossler_2011, 140 | ISSN = {02767783}, 141 | URL = {http://www.jstor.org/stable/41409971}, 142 | abstract = {Information privacy refers to the desire of individuals to control or have some influence over data about themselves. Advances in information technology have raised concerns about information privacy and its impacts, and have motivated Information Systems researchers to explore information privacy issues, including technical solutions to address these concerns. In this paper, we inform researchers about the current state of information privacy research in IS through a critical analysis of the IS literature that considers information privacy as a key construct. The review of the literature reveals that information privacy is a multilevel concept, but rarely studied as such. We also find that information privacy research has been heavily reliant on studentbased and USA-centric samples, which results in findings of limited generalizability. Information privacy research focuses on explaining and predicting theoretical contributions, with few studies in journal articles focusing on design and action contributions. We recommend that future research should consider different levels of analysis as well as multilevel effects of information privacy. We illustrate this with a multilevel framework for information privacy concerns. We call for research on information privacy to use a broader diversity of sampling populations, and for more design and action information privacy research to be published in journal articles that can result in IT artifacts for protection or control of information privacy.}, 143 | author = {France Bélanger and Robert E. Crossler}, 144 | journal = {MIS Quarterly}, 145 | number = {4}, 146 | pages = {1017--1041}, 147 | publisher = {Management Information Systems Research Center, University of Minnesota}, 148 | title = {Privacy in the Digital Age: A Review of Information Privacy Research in Information Systems}, 149 | urldate = {2023-04-26}, 150 | volume = {35}, 151 | year = {2011} 152 | } 153 | 154 | @article{tene_polonetsky_2012, 155 | author = {Tene, Omer and Polonetsky, Jules}, 156 | year = {2012}, 157 | month = {09}, 158 | pages = {}, 159 | title = {Big Data for All: Privacy and User Control in the Age of Analytics}, 160 | volume = {11}, 161 | journal = {Northwestern Journal of Technology and Intellectual Property} 162 | } -------------------------------------------------------------------------------- /docs/dev/pii_codex/services/adapters/detection_adapters/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.services.adapters.detection_adapters API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 |
21 |
22 |

Module pii_codex.services.adapters.detection_adapters

23 |
24 |
25 |
26 |
27 |

Sub-modules

28 |
29 |
pii_codex.services.adapters.detection_adapters.aws_detection_adapter
30 |
31 |
32 |
33 |
pii_codex.services.adapters.detection_adapters.azure_detection_adapter
34 |
35 |
36 |
37 |
pii_codex.services.adapters.detection_adapters.detection_adapter_base
38 |
39 |
40 |
41 |
pii_codex.services.adapters.detection_adapters.presidio_detection_adapter
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 | 75 |
76 | 79 | 80 | -------------------------------------------------------------------------------- /docs/dev/pii_codex/utils/logging.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | pii_codex.utils.logging API documentation 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 |
21 |
22 |

Module pii_codex.utils.logging

23 |
24 |
25 |
26 | 27 | Expand source code 28 | 29 |
from time import time
 30 | import logging
 31 | 
 32 | logger = logging.getLogger()
 33 | 
 34 | 
 35 | def timed_operation(func):
 36 |     """
 37 |     Used to show execution time for function
 38 |     @param func:
 39 |     @return:
 40 |     """
 41 | 
 42 |     def wrapper_function(*args, **kwargs):
 43 |         """
 44 |         Logs the function execution time
 45 | 
 46 |         @param args:
 47 |         @param kwargs:
 48 |         @return:
 49 |         """
 50 |         start_time = time()
 51 |         result = func(*args, **kwargs)
 52 |         end_time = time()
 53 |         logger.info(f"{func.__name__!r} executed in {(end_time - start_time):.4f}s")
 54 |         return result
 55 | 
 56 |     return wrapper_function
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |

Functions

65 |
66 |
67 | def timed_operation(func) 68 |
69 |
70 |

Used to show execution time for function 71 | @param func: 72 | @return:

73 |
74 | 75 | Expand source code 76 | 77 |
def timed_operation(func):
 78 |     """
 79 |     Used to show execution time for function
 80 |     @param func:
 81 |     @return:
 82 |     """
 83 | 
 84 |     def wrapper_function(*args, **kwargs):
 85 |         """
 86 |         Logs the function execution time
 87 | 
 88 |         @param args:
 89 |         @param kwargs:
 90 |         @return:
 91 |         """
 92 |         start_time = time()
 93 |         result = func(*args, **kwargs)
 94 |         end_time = time()
 95 |         logger.info(f"{func.__name__!r} executed in {(end_time - start_time):.4f}s")
 96 |         return result
 97 | 
 98 |     return wrapper_function
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 | 124 |
125 | 128 | 129 | -------------------------------------------------------------------------------- /pii_codex/services/analyzers/presidio_analysis.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=broad-except,unused-argument,import-outside-toplevel,unused-variable 2 | from typing import List, Tuple 3 | 4 | from presidio_anonymizer.entities.engine.recognizer_result import RecognizerResult 5 | 6 | from ...config import PII_MAPPER, DEFAULT_LANG, DEFAULT_TOKEN_REPLACEMENT_VALUE 7 | from ...models.analysis import DetectionResultItem, DetectionResult 8 | from ...utils.package_installer_util import install_spacy_package 9 | from ...utils.pii_mapping_util import PIIMapper 10 | from ...utils.logging import logger 11 | 12 | 13 | class PresidioPIIAnalyzer: 14 | """ 15 | Presidio PII Analyzer - a wrapper for the Microsoft Presidio Analyzer and Anonymization functions 16 | """ 17 | 18 | def __init__( 19 | self, pii_token_replacement_value: str = DEFAULT_TOKEN_REPLACEMENT_VALUE 20 | ): 21 | """ 22 | Since installing Spacy, the en_core_web_lg model, and the MSFT Presidio package are optional installs 23 | the imports are wrapped to prevent any failures 24 | @param pii_token_replacement_value: str to replace detected pii token with (e.g. ) 25 | """ 26 | 27 | try: 28 | import spacy 29 | from presidio_analyzer import AnalyzerEngine 30 | from presidio_anonymizer import AnonymizerEngine 31 | from presidio_anonymizer.entities import OperatorConfig 32 | 33 | if not spacy.util.is_package("en_core_web_lg"): 34 | # Last resort. Will install the en_core_web_lg package if end-user hadn't already. 35 | install_spacy_package("en_core_web_lg") 36 | 37 | self.analyzer = AnalyzerEngine() 38 | self.anonymizer = AnonymizerEngine() 39 | self.pii_mapper = PIIMapper() 40 | 41 | self.operators = { 42 | "DEFAULT": OperatorConfig( 43 | "replace", {"new_value": pii_token_replacement_value} 44 | ), 45 | "TITLE": OperatorConfig("redact", {}), 46 | } 47 | 48 | except ImportError: 49 | raise Exception( 50 | 'Missing dependencies from extras. Install the PII-Codex extras: "detections"' 51 | ) 52 | 53 | def get_supported_entities(self, language_code=DEFAULT_LANG) -> List[str]: 54 | """ 55 | Retrieves a list of supported entities, this will narrow down what is available for a given language 56 | 57 | @param language_code: str - defaults to "en" 58 | @return: List[str] 59 | """ 60 | return self.analyzer.get_supported_entities(language=language_code) 61 | 62 | def get_loaded_recognizers(self, language_code: str = DEFAULT_LANG): 63 | """ 64 | Retrieves a list of loaded recognizers, narrowing down the list of what is available for a given language 65 | @param language_code: 66 | @return: 67 | """ 68 | return self.analyzer.get_recognizers(language=language_code) 69 | 70 | def analyze_item( 71 | self, text: str, language_code: str = DEFAULT_LANG, entities: List[str] = None 72 | ) -> Tuple[List[DetectionResultItem], str]: 73 | """ 74 | Uses Microsoft Presidio (spaCy module) to analyze given a set of entities to analyze the provided text against. 75 | Will log an error if the identifier or entity recognizer is not added to Presidio's base recognizers or 76 | a custom recognizer created. Returns the list of detected items and the sanitized string 77 | 78 | @param language_code: str "en" is default 79 | @param entities: str - List[MSFTPresidioPIIType.name] 80 | @param text: str 81 | @return: Tuple[List[DetectionResultItem], str] 82 | """ 83 | 84 | detections = [] 85 | 86 | if not entities: 87 | entities = self.get_supported_entities(language_code) 88 | 89 | try: 90 | # Engine Setup - spaCy model setup and PII recognizers 91 | detections = self.analyzer.analyze( 92 | text=text, entities=entities, language=language_code 93 | ) 94 | 95 | except Exception as ex: 96 | logger.error(ex) 97 | 98 | # Return analyzer results in formatted Analysis Result List object 99 | detection_items = [ 100 | DetectionResultItem( 101 | entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type( 102 | result.entity_type 103 | ).name, 104 | score=result.score, 105 | start=result.start, 106 | end=result.end, 107 | ) 108 | for result in detections 109 | ] 110 | return detection_items, self.sanitize_text( 111 | text=text, analysis_items=detection_items 112 | ) 113 | 114 | def sanitize_text( 115 | self, text: str, analysis_items: List[DetectionResultItem] 116 | ) -> str: 117 | """ 118 | Sanitizes the text analyzed with MSFT Presidio's Anonymizer 119 | @param text: 120 | @param analysis_items: 121 | @return: 122 | """ 123 | try: 124 | # Convert DetectionResultItem back to RecognizerResult for Presidio anonymizer 125 | 126 | recognizer_results = [ 127 | RecognizerResult( 128 | entity_type=item.entity_type, 129 | start=item.start, 130 | end=item.end, 131 | score=item.score, 132 | ) 133 | for item in analysis_items 134 | ] 135 | 136 | anonymization_result = self.anonymizer.anonymize( 137 | text=text, analyzer_results=recognizer_results, operators=self.operators 138 | ) 139 | 140 | return anonymization_result.text or "" 141 | 142 | except Exception as ex: 143 | logger.error("An error occurred sanitizing the string") 144 | return "" 145 | 146 | def analyze_collection( 147 | self, texts: List[str], language_code: str = "en", entities: List[str] = None 148 | ) -> List[DetectionResult]: 149 | """ 150 | Uses Microsoft Presidio (spaCy module) to analyze given a set of entities to analyze the provided text against. 151 | Will log an error if the identifier or entity recognizer is not added to Presidio's base recognizers or 152 | a custom recognizer created. 153 | 154 | @param language_code: str "en" is default 155 | @param entities: List[MSFTPresidioPIIType.name] defaults to all possible entities for selected language 156 | @param texts: List[str] 157 | @return: List[DetectionResult] 158 | """ 159 | 160 | detection_results = [] 161 | try: 162 | if not entities: 163 | entities = self.get_supported_entities(language_code) 164 | 165 | # Engine Setup - spaCy model setup and PII recognizers 166 | for i, text in enumerate(texts): 167 | text_analysis = self.analyzer.analyze( 168 | text=text, entities=entities, language=language_code 169 | ) 170 | 171 | # Every analysis by the analyzer will have a set of detections within 172 | detections = [ 173 | DetectionResultItem( 174 | entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type( 175 | result.entity_type 176 | ).name, 177 | score=result.score, 178 | start=result.start, 179 | end=result.end, 180 | ) 181 | for result in text_analysis 182 | ] 183 | detection_results.append( 184 | DetectionResult(index=i, detections=detections) 185 | ) 186 | 187 | # Return analyzer results in formatted Analysis Result List object 188 | 189 | except Exception as ex: 190 | logger.error(ex) 191 | 192 | return detection_results 193 | 194 | @classmethod 195 | def convert_analyzed_item(cls, pii_detection) -> List[DetectionResultItem]: 196 | """ 197 | Converts a single Presidio analysis attempt into a collection of DetectionResultItem objects. One string 198 | analysis by Presidio returns an array of RecognizerResult objects. 199 | 200 | @param pii_detection: RecognizerResult from presidio analyzer 201 | @return: List[DetectionResultItem] 202 | """ 203 | 204 | return [ 205 | DetectionResultItem( 206 | entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type( 207 | result.entity_type 208 | ).name, 209 | score=result.score, 210 | start=result.start, 211 | end=result.end, 212 | ) 213 | for result in pii_detection 214 | ] 215 | 216 | @classmethod 217 | def convert_analyzed_collection(cls, pii_detections) -> List[DetectionResult]: 218 | """ 219 | Converts a collection of Presidio analysis results to a collection of DetectionResult. A collection of Presidio 220 | analysis results ends up being a 2D array. 221 | 222 | @param pii_detections: List[RecognizerResult] from Presidio analyzer 223 | @return: List[DetectionResult] 224 | """ 225 | 226 | detection_results: List[DetectionResult] = [] 227 | for i, result in enumerate(pii_detections): 228 | # Return results in formatted Analysis Result List object 229 | detections = [] 230 | for entity in result: 231 | detections.append( 232 | DetectionResultItem( 233 | entity_type=PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type( 234 | entity.entity_type 235 | ).name, 236 | score=entity.score, 237 | start=entity.start, 238 | end=entity.end, 239 | ) 240 | ) 241 | 242 | detection_results.append(DetectionResult(index=i, detections=detections)) 243 | 244 | return detection_results 245 | -------------------------------------------------------------------------------- /tests/pii_codex/services/test_detection_service.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import pytest 4 | from assertpy import assert_that 5 | from presidio_analyzer import RecognizerResult 6 | from pii_codex.config import DEFAULT_LANG 7 | from pii_codex.models.analysis import DetectionResultItem, DetectionResult 8 | from pii_codex.models.microsoft_presidio_pii import ( 9 | MSFTPresidioPIIType, 10 | ) 11 | from pii_codex.services.analyzers.presidio_analysis import ( 12 | PresidioPIIAnalyzer, 13 | ) 14 | 15 | 16 | class TestDetectionService: 17 | presidio_analyzer = PresidioPIIAnalyzer() 18 | 19 | @pytest.mark.parametrize( 20 | "test_input,pii_types,expected_result", 21 | [ 22 | ("Not", [MSFTPresidioPIIType.PHONE_NUMBER.value], False), 23 | ("PII", [MSFTPresidioPIIType.EMAIL_ADDRESS.value], False), 24 | ("example@example.com", [MSFTPresidioPIIType.EMAIL_ADDRESS.value], True), 25 | ( 26 | "My email is example@example.eu.edu", 27 | [MSFTPresidioPIIType.EMAIL_ADDRESS.value], 28 | True, 29 | ), 30 | ( 31 | "My phone number is 191-212-456-7890", 32 | [MSFTPresidioPIIType.PHONE_NUMBER.value], 33 | False, 34 | ), # International number not working 35 | ( 36 | "My phone number is 305-555-5555", 37 | [MSFTPresidioPIIType.PHONE_NUMBER.value], 38 | True, 39 | ), 40 | ( 41 | "My phone number is 305-555-5555 and email is example@example.com", 42 | [ 43 | MSFTPresidioPIIType.PHONE_NUMBER.value, 44 | MSFTPresidioPIIType.EMAIL_ADDRESS.value, 45 | ], 46 | True, 47 | ), 48 | ], 49 | ) 50 | def test_msft_presidio_analysis_single_item( 51 | self, test_input, pii_types, expected_result 52 | ): 53 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 54 | text=test_input, 55 | entities=pii_types, 56 | ) 57 | 58 | if expected_result: 59 | assert_that(presidio_results).is_not_empty() 60 | assert_that(isinstance(presidio_results[0], DetectionResultItem)).is_true() 61 | assert_that(sanitized_text).is_not_empty() 62 | else: 63 | assert_that(presidio_results).is_empty() 64 | 65 | def test_msft_presidio_analysis_collection(self): 66 | presidio_results = self.presidio_analyzer.analyze_collection( 67 | texts=[ 68 | "My email is example@example.eu.edu", 69 | "My phone number is 305-555-5555 and email is example@example.com", 70 | ], 71 | entities=self.presidio_analyzer.get_supported_entities(language_code="en"), 72 | language_code=DEFAULT_LANG, 73 | ) 74 | 75 | assert_that(presidio_results).is_not_empty() 76 | assert_that(presidio_results[1].index).is_greater_than( 77 | presidio_results[0].index 78 | ) 79 | assert_that( 80 | isinstance(presidio_results[0].detections[0], DetectionResultItem) 81 | ).is_true() 82 | 83 | def test_presidio_analysis_collection_conversion(self): 84 | conversion_results: List[ 85 | DetectionResult 86 | ] = self.presidio_analyzer.convert_analyzed_collection( 87 | pii_detections=[ 88 | [ 89 | RecognizerResult( 90 | entity_type=MSFTPresidioPIIType.EMAIL_ADDRESS.value, 91 | start=123, 92 | end=456, 93 | score=0.98, 94 | ), 95 | RecognizerResult( 96 | entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value, 97 | start=123, 98 | end=456, 99 | score=0.973, 100 | ), 101 | ], 102 | [ 103 | RecognizerResult( 104 | entity_type=MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value, 105 | start=123, 106 | end=456, 107 | score=0.98, 108 | ), 109 | RecognizerResult( 110 | entity_type=MSFTPresidioPIIType.PHONE_NUMBER.value, 111 | start=123, 112 | end=456, 113 | score=0.973, 114 | ), 115 | ], 116 | ] 117 | ) 118 | 119 | assert_that(conversion_results).is_not_empty() 120 | assert_that(conversion_results[1].index).is_greater_than( 121 | conversion_results[0].index 122 | ) 123 | assert_that( 124 | isinstance(conversion_results[0].detections[0], DetectionResultItem) 125 | ).is_true() 126 | 127 | @pytest.mark.parametrize( 128 | "ssn_text,expected_detection", 129 | [ 130 | ("My SSN is 489-36-8350", True), # Robert Aragon from DLP test data 131 | ("SSN: 514-14-8905", True), # Ashley Borden from DLP test data 132 | ( 133 | "Social Security Number: 690-05-5315", 134 | True, 135 | ), # Thomas Conley from DLP test data 136 | ("My number is 421-37-1396", True), # Susan Davis from DLP test data 137 | ("SSN 458-02-6124", True), # Christopher Diaz from DLP test data 138 | ("No SSN here", False), # No SSN 139 | ("Random text 123-45-6789", False), # Generic SSN format without context 140 | ], 141 | ) 142 | def test_ssn_detection_with_dlp_data(self, ssn_text, expected_detection): 143 | """Test SSN detection using DLP test data from dlptest.com""" 144 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 145 | text=ssn_text, 146 | entities=[MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value], 147 | ) 148 | 149 | if expected_detection: 150 | assert_that(presidio_results).is_not_empty() 151 | assert_that(presidio_results[0].entity_type).is_equal_to( 152 | "US_SOCIAL_SECURITY_NUMBER" 153 | ) 154 | assert_that(sanitized_text).is_not_empty() 155 | else: 156 | assert_that(presidio_results).is_empty() 157 | 158 | def test_ssn_conversion_to_common_type(self): 159 | """Test that SSN detection results are properly converted to common PII types""" 160 | # Test with DLP test data SSN 161 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 162 | text="SSN: 489-36-8350", # Robert Aragon from DLP test data 163 | entities=[MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value], 164 | ) 165 | 166 | assert_that(presidio_results).is_not_empty() 167 | # The conversion should map US_SSN to US_SOCIAL_SECURITY_NUMBER 168 | assert_that(presidio_results[0].entity_type).is_equal_to( 169 | "US_SOCIAL_SECURITY_NUMBER" 170 | ) 171 | assert_that(sanitized_text).is_not_empty() 172 | assert_that(sanitized_text).does_not_contain("489-36-8350") 173 | 174 | def test_bank_number_detection_and_conversion(self): 175 | """Test bank account number detection and conversion""" 176 | # Test with a sample bank account number 177 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 178 | text="Bank account: 1234567890", 179 | entities=[MSFTPresidioPIIType.US_BANK_ACCOUNT_NUMBER.value], 180 | ) 181 | 182 | assert_that(presidio_results).is_not_empty() 183 | # The conversion should map US_BANK_NUMBER to US_BANK_ACCOUNT_NUMBER 184 | assert_that(presidio_results[0].entity_type).is_equal_to( 185 | "US_BANK_ACCOUNT_NUMBER" 186 | ) 187 | assert_that(sanitized_text).is_not_empty() 188 | assert_that(sanitized_text).does_not_contain("1234567890") 189 | 190 | def test_au_medicare_detection_and_conversion(self): 191 | """Test Australian Medicare number detection and conversion""" 192 | # Test with a sample Australian Medicare number 193 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 194 | text="Medicare: 1234567890", 195 | entities=[MSFTPresidioPIIType.AU_MEDICAL_ACCOUNT_NUMBER.value], 196 | ) 197 | 198 | # Note: Presidio doesn't have a recognizer for AU_MEDICARE in English 199 | # This test demonstrates the mapping conversion but won't detect anything 200 | # The conversion should map AU_MEDICARE to AU_MEDICAL_ACCOUNT_NUMBER when it exists 201 | if presidio_results: 202 | assert_that(presidio_results[0].entity_type).is_equal_to( 203 | "AU_MEDICAL_ACCOUNT_NUMBER" 204 | ) 205 | assert_that(sanitized_text).is_not_empty() 206 | assert_that(sanitized_text).does_not_contain("1234567890") 207 | else: 208 | # If no recognizer is available, that's also acceptable 209 | pass 210 | 211 | def test_multiple_pii_types_with_dlp_data(self): 212 | """Test detection of multiple PII types using DLP test data""" 213 | test_text = ( 214 | "Robert Aragon, SSN: 489-36-8350, DOB: 6/7/1981" # Test entry from DLP data 215 | ) 216 | 217 | presidio_results, sanitized_text = self.presidio_analyzer.analyze_item( 218 | text=test_text, 219 | entities=[ 220 | MSFTPresidioPIIType.US_SOCIAL_SECURITY_NUMBER.value, 221 | MSFTPresidioPIIType.DATE.value, 222 | MSFTPresidioPIIType.PERSON.value, 223 | ], 224 | ) 225 | 226 | assert_that(presidio_results).is_not_empty() 227 | # Should detect SSN, date, and person 228 | detected_types = [result.entity_type for result in presidio_results] 229 | assert_that(detected_types).contains("US_SOCIAL_SECURITY_NUMBER") 230 | assert_that(detected_types).contains("PERSON") 231 | assert_that(sanitized_text).is_not_empty() 232 | assert_that(sanitized_text).does_not_contain("489-36-8350") 233 | assert_that(sanitized_text).does_not_contain("6/7/1981") 234 | assert_that(sanitized_text).does_not_contain("Robert Aragon") 235 | -------------------------------------------------------------------------------- /tests/pii_codex/utils/test_pii_mapping_util.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=broad-except, line-too-long 2 | from assertpy import assert_that 3 | import pytest 4 | 5 | from pii_codex.config import PII_MAPPER 6 | from pii_codex.models.aws_pii import AWSComprehendPIIType 7 | from pii_codex.models.azure_pii import AzureDetectionType 8 | from pii_codex.models.common import ( 9 | PIIType, 10 | ClusterMembershipType, 11 | DHSCategory, 12 | NISTCategory, 13 | HIPAACategory, 14 | RiskLevel, 15 | RiskLevelDefinition, 16 | ) 17 | from pii_codex.models.microsoft_presidio_pii import MSFTPresidioPIIType 18 | from pii_codex.services.pii_type_mappings import PII_TYPE_MAPPINGS 19 | import pii_codex.utils.pii_mapping_util as util_module 20 | 21 | 22 | class TestPIIMappingUtil: 23 | # region PII MAPPING AND CONVERSION FUNCTIONS 24 | @pytest.mark.parametrize( 25 | "pii_type", 26 | [pii_type.name for pii_type in PIIType], 27 | ) 28 | def test_map_pii_type(self, pii_type): 29 | """ 30 | Requires the type mapping to be in the associated version file in pii_codex/data/ 31 | """ 32 | if pii_type is not PIIType.DOCUMENTS.name: 33 | mapped_pii = PII_MAPPER.map_pii_type(pii_type) 34 | assert_that(mapped_pii.risk_level).is_greater_than(1) 35 | assert_that( 36 | isinstance( 37 | ClusterMembershipType(mapped_pii.cluster_membership_type), 38 | ClusterMembershipType, 39 | ) 40 | ).is_true() 41 | assert_that( 42 | isinstance(DHSCategory(mapped_pii.dhs_category), DHSCategory) 43 | ).is_true() 44 | assert_that( 45 | isinstance(NISTCategory(mapped_pii.nist_category), NISTCategory) 46 | ).is_true() 47 | assert_that( 48 | isinstance(HIPAACategory(mapped_pii.hipaa_category), HIPAACategory) 49 | ).is_true() 50 | 51 | @pytest.mark.parametrize( 52 | "pii_type", 53 | PIIType, 54 | ) 55 | def test_convert_common_pii_to_msft_presidio_type(self, pii_type): 56 | try: 57 | converted_pii = PII_MAPPER.convert_common_pii_to_msft_presidio_type( 58 | pii_type 59 | ) 60 | assert_that(isinstance(converted_pii, MSFTPresidioPIIType)).is_true() 61 | except Exception as ex: 62 | assert_that(ex.args[0]).contains( 63 | "The current version does not support this PII Type conversion." 64 | ) 65 | 66 | @pytest.mark.parametrize( 67 | "pii_type", 68 | PIIType, 69 | ) 70 | def test_convert_common_pii_to_azure_pii_type(self, pii_type): 71 | try: 72 | converted_pii = PII_MAPPER.convert_common_pii_to_azure_pii_type(pii_type) 73 | assert_that(isinstance(converted_pii, AzureDetectionType)).is_true() 74 | except Exception as ex: 75 | assert_that(ex.args[0]).contains( 76 | "The current version does not support this PII Type conversion." 77 | ) 78 | 79 | @pytest.mark.parametrize( 80 | "pii_type", 81 | AzureDetectionType, 82 | ) 83 | def test_convert_azure_pii_to_common_pii_type(self, pii_type): 84 | try: 85 | converted_pii = PII_MAPPER.convert_azure_pii_to_common_pii_type(pii_type) 86 | assert_that(isinstance(converted_pii, PIIType)).is_true() 87 | except Exception as ex: 88 | assert_that(ex.args[0]).contains( 89 | "The current version does not support this PII Type conversion." 90 | ) 91 | 92 | @pytest.mark.parametrize( 93 | "pii_type", 94 | AWSComprehendPIIType, 95 | ) 96 | def test_convert_aws_pii_to_common_pii_type(self, pii_type): 97 | try: 98 | converted_pii = PII_MAPPER.convert_aws_comprehend_pii_to_common_pii_type( 99 | pii_type 100 | ) 101 | assert_that(isinstance(converted_pii, PIIType)).is_true() 102 | except Exception as ex: 103 | assert_that(ex.args[0]).contains( 104 | "The current version does not support this PII Type conversion." 105 | ) 106 | 107 | @pytest.mark.parametrize( 108 | "pii_type", 109 | PIIType, 110 | ) 111 | def test_convert_common_pii_to_aws_comprehend_type(self, pii_type): 112 | try: 113 | converted_pii = PII_MAPPER.convert_common_pii_to_aws_comprehend_type( 114 | pii_type 115 | ) 116 | assert_that(isinstance(converted_pii, AWSComprehendPIIType)).is_true() 117 | except Exception as ex: 118 | assert_that(ex.args[0]).contains( 119 | "The current version does not support this PII Type conversion." 120 | ) 121 | 122 | def test_convert_metadata_to_pii_failure(self): 123 | with pytest.raises(Exception) as execinfo: 124 | PII_MAPPER.convert_metadata_type_to_common_pii_type("other_type") 125 | 126 | assert_that(str(execinfo.value)).contains( 127 | "The current version does not support this Metadata to PII Type conversion." 128 | ) 129 | 130 | def test_convert_msft_presidio_pii_to_common_pii_type_failure(self): 131 | with pytest.raises(Exception) as execinfo: 132 | PII_MAPPER.convert_msft_presidio_pii_to_common_pii_type("other_type") 133 | 134 | assert_that(str(execinfo.value)).contains( 135 | "The current version does not support this PII Type conversion: other_type. Error: 'other_type' is not a valid MSFTPresidioPIIType" 136 | ) 137 | 138 | def test_pii_mapping_enum_consistency(self): 139 | """Test that all PII mappings have consistent enum references""" 140 | for mapping in PII_TYPE_MAPPINGS.values(): 141 | # Test that risk level mapping works correctly (it's an enum, not int) 142 | assert_that(mapping.risk_level).is_instance_of(RiskLevel) 143 | assert_that(mapping.risk_level.value).is_between(1, 3) 144 | 145 | # Test that cluster membership type is valid 146 | assert_that(mapping.cluster_membership_type).is_instance_of( 147 | ClusterMembershipType 148 | ) 149 | 150 | # Test that categories are valid enum instances 151 | assert_that(mapping.nist_category).is_instance_of(NISTCategory) 152 | assert_that(mapping.dhs_category).is_instance_of(DHSCategory) 153 | assert_that(mapping.hipaa_category).is_instance_of(HIPAACategory) 154 | 155 | # Test that provider enums are either valid enum instances or None 156 | if mapping.presidio_enum is not None: 157 | assert_that(mapping.presidio_enum).is_instance_of(MSFTPresidioPIIType) 158 | if mapping.azure_enum is not None: 159 | assert_that(mapping.azure_enum).is_instance_of(AzureDetectionType) 160 | if mapping.aws_enum is not None: 161 | assert_that(mapping.aws_enum).is_instance_of(AWSComprehendPIIType) 162 | 163 | def test_risk_level_definition_mapping(self): 164 | """Test that risk level to definition mapping works correctly""" 165 | for pii_type in PII_TYPE_MAPPINGS: 166 | # Test that we can create a RiskAssessment without errors 167 | risk_assessment = PII_MAPPER.map_pii_type(pii_type) 168 | 169 | # Test that risk level definition is a valid string 170 | assert_that(risk_assessment.risk_level_definition).is_instance_of(str) 171 | 172 | # Test that risk level definition is one of the valid values 173 | valid_definitions = [level.value for level in RiskLevelDefinition] 174 | assert_that( 175 | risk_assessment.risk_level_definition in valid_definitions 176 | ).is_true() 177 | 178 | # Test that risk level is an integer 179 | assert_that(risk_assessment.risk_level).is_instance_of(int) 180 | assert_that(risk_assessment.risk_level).is_between(1, 3) 181 | 182 | def test_provider_enum_consistency(self): 183 | """Test that provider-specific enum mappings are consistent""" 184 | # Test some key mappings that should have all three providers 185 | key_mappings = [ 186 | "EMAIL_ADDRESS", 187 | "PHONE_NUMBER", 188 | "PERSON", 189 | "US_SOCIAL_SECURITY_NUMBER", 190 | "CREDIT_CARD_NUMBER", 191 | ] 192 | 193 | for pii_type in key_mappings: 194 | if pii_type in PII_TYPE_MAPPINGS: 195 | mapping = PII_TYPE_MAPPINGS[pii_type] 196 | 197 | # These should have all three provider enums 198 | assert_that(mapping.presidio_enum).is_not_none() 199 | assert_that(mapping.azure_enum).is_not_none() 200 | assert_that(mapping.aws_enum).is_not_none() 201 | 202 | # Test that the enum names are consistent 203 | if mapping.presidio_enum: 204 | assert_that(mapping.presidio_enum.name).is_equal_to(pii_type) 205 | if mapping.azure_enum: 206 | assert_that(mapping.azure_enum.name).is_equal_to(pii_type) 207 | if mapping.aws_enum: 208 | # AWS might have different naming (e.g., CREDIT_DEBIT_NUMBER vs CREDIT_CARD_NUMBER) 209 | assert_that(mapping.aws_enum.name).is_not_empty() 210 | 211 | def test_azure_usuk_passport_mapping(self): 212 | """Test the specific fix for US_PASSPORT_NUMBER -> USUK_PASSPORT_NUMBER""" 213 | us_passport_mapping = PII_TYPE_MAPPINGS.get("US_PASSPORT_NUMBER") 214 | assert_that(us_passport_mapping).is_not_none() 215 | 216 | if us_passport_mapping and us_passport_mapping.azure_enum: 217 | assert_that(us_passport_mapping.azure_enum.name).is_equal_to( 218 | "USUK_PASSPORT_NUMBER" 219 | ) 220 | 221 | def test_aws_australian_types_none(self): 222 | """Test that AWS doesn't have Australian business/company types (set to None)""" 223 | au_business_mapping = PII_TYPE_MAPPINGS.get("AU_BUSINESS_NUMBER") 224 | au_company_mapping = PII_TYPE_MAPPINGS.get("AU_COMPANY_NUMBER") 225 | 226 | if au_business_mapping: 227 | assert_that(au_business_mapping.aws_enum).is_none() 228 | if au_company_mapping: 229 | assert_that(au_company_mapping.aws_enum).is_none() 230 | 231 | def test_no_csv_dependencies(self): 232 | """Test that we no longer have CSV file dependencies""" 233 | # Check that the module doesn't have pandas-related attributes 234 | assert_that(hasattr(util_module, "_pii_mapping_data_frame")).is_false() 235 | 236 | # Check that we're using the new mapping system (PII_MAPPER is imported from config) 237 | assert_that(PII_MAPPER).is_not_none() 238 | 239 | # Test that map_pii_type works without CSV 240 | result = PII_MAPPER.map_pii_type("EMAIL_ADDRESS") 241 | assert_that(result).is_not_none() 242 | assert_that(result.pii_type_detected).is_equal_to("EMAIL_ADDRESS") 243 | 244 | # endregion 245 | 246 | # CSV-related tests removed - no longer using CSV files 247 | -------------------------------------------------------------------------------- /docs/DETECTION_AND_ANALYSIS.md: -------------------------------------------------------------------------------- 1 | # Detection and Analysis with PII-Codex 2 | In the case that you are not bringing your own detection service, Microsoft Presidio is integrated into PII Codex that provides flexibility in analyzer type. At the time of this repo's creation, only a select number of evaluators exist. You may create your own evaluators and swap out the version of presidio that pii-codex uses. Note that with this change, you will also need to update the mappings where applicable. 3 | 4 | The following are not integrated into the service, but have PII type mapping and detection object conversion support: 5 | 6 |
    7 |
  1. AWS Comprehend (Requires AWS Account) [docs]
  2. 8 |
  3. Azure PII Detection Cognitive Skill (Requires Azure Account) [docs]
  4. 9 |
10 | 11 | For those using pre-detected results, adapters are provided to convert types and results to the expected DetectionResult/DetectionResultItem format (see diagram below): 12 | 13 | ![Converting And Analyzing Existing Detections](UC1_Converting_Existing_Detections.png) 14 | 15 | To supply the analyzer module with a collection of pre-detected results from your own Microsoft Presidio, Azure, or AWS Comprehend analysis process, you will need to first convert the detection to a set of DetectionResult objects to feed into the analyzer as follows: 16 | 17 | ```python 18 | from typing import List 19 | from pii_codex.models.common import ( 20 | AnalysisProviderType, 21 | ) 22 | from presidio_analyzer import RecognizerResult 23 | from pii_codex.services.analysis_service import PIIAnalysisService 24 | from pii_codex.services.adapters.detection_adapters.presidio_detection_adapter import PresidioPIIDetectionAdapter 25 | from pii_codex.models.analysis import DetectionResult 26 | 27 | presidio_detection_service = PresidioPIIDetectionAdapter() 28 | 29 | list_of_detections: List[RecognizerResult] = [] # your list of detections 30 | converted_detections: List[DetectionResult] = presidio_detection_service.convert_analyzed_collection( 31 | pii_detections=list_of_detections 32 | ) 33 | pii_analysis_service = PIIAnalysisService( 34 | analysis_provider=AnalysisProviderType.PRESIDIO.name 35 | ) # If you don't intend to use presidio, override the analysis_provider value 36 | 37 | results = pii_analysis_service.analyze_detection_collection( 38 | detection_collection=converted_detections, 39 | collection_name="Data Set Label", # this is more for those that intend to find a way to label collections 40 | collection_type="SAMPLE" # defaults to POPULATION, input used for standard deviation and variance calculations 41 | ) 42 | ``` 43 | 44 | The other two detection adapters available are AWSComprehendPIIDetectionAdapter and AzurePIIDetectionAdapter. 45 | 46 |
47 | 48 | In the case you require the built-in Presidio functionality, you can call the analysis service as follows: 49 | 50 | ```python 51 | from pii_codex.services.analysis_service import PIIAnalysisService 52 | 53 | pii_analysis_service = PIIAnalysisService() 54 | 55 | strings_to_analyze = ["string to analyze", "string to analyze"] # strings to analyze 56 | results = pii_analysis_service.analyze_collection( 57 | texts=strings_to_analyze, 58 | language_code="en", 59 | collection_name="Data Set Label", # this is optional, geared for those that require labeling of collections 60 | collection_type="SAMPLE" # defaults to POPULATION, input used for standard deviation and variance calculations 61 | ) 62 | ``` 63 | 64 | This functionality can easily take a singular text item or a collection of them and runs through the presidio analysis and assessment service files as presented in the diagram below. 65 | 66 | ![Converting And Analyzing Text with Presidio Builtin](UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png) 67 | 68 | 69 | For those analyzing social media posts, you can also supply metadata per text sample to be analyzed in a dataframe. 70 | 71 | ```python 72 | import pandas as pd 73 | from pii_codex.services.analysis_service import PIIAnalysisService 74 | 75 | pii_analysis_service = PIIAnalysisService() 76 | 77 | results = pii_analysis_service.analyze_collection( 78 | data=pd.DataFrame({ 79 | "text": [ 80 | "I attend the University of Central Florida, how about you?", 81 | "If anyone needs trig help, my phone number 555-555-5555 and my email is example123@email.com", 82 | "Oh I do! My number is 777-777-7777. Where is the residence hall?", 83 | "The dorm is over at 123 Dark Data Lane, OH, 11111", 84 | "Cool, I'll be there!" 85 | ], 86 | "metadata": [ 87 | {"location": True, "url": False, "screen_name": True}, 88 | {"location": True, "url": False, "screen_name": True}, 89 | {"location": False, "url": False, "screen_name": True}, # Not all social media posts will have location metadata 90 | {"location": False, "url": False, "screen_name": True}, 91 | {"location": True, "url": False, "screen_name": True}, 92 | ] 93 | }), 94 | language_code="en", 95 | collection_name="Data Set Label", # this is optional, geared for those that require labeling of collections 96 | collection_type="SAMPLE" # defaults to POPULATION, input used for standard deviation and variance calculations 97 | ) 98 | ``` 99 | 100 | Sample output: 101 | 102 | ``` 103 | { 104 | "collection_name": "PII Collection 1", 105 | "collection_type": "POPULATION", 106 | "analyses": [ 107 | { 108 | "analysis": [ 109 | { 110 | "pii_type_detected": "PERSON", 111 | "risk_level": 3, 112 | "risk_level_definition": "Identifiable", 113 | "cluster_membership_type": "Financial Information", 114 | "hipaa_category": "Protected Health Information", 115 | "dhs_category": "Linkable", 116 | "nist_category": "Directly PII", 117 | "entity_type": "PERSON", 118 | "score": 0.85, 119 | "start": 21, 120 | "end": 24, 121 | } 122 | ], 123 | "index": 0, 124 | "risk_score_mean": 3, 125 | "sanitized_text": "Hi! My name is ", 126 | }, 127 | { 128 | "analysis": [ 129 | { 130 | "pii_type_detected": "EMAIL_ADDRESS", 131 | "risk_level": 3, 132 | "risk_level_definition": "Identifiable", 133 | "cluster_membership_type": "Personal Preferences", 134 | "hipaa_category": "Protected Health Information", 135 | "dhs_category": "Stand Alone PII", 136 | "nist_category": "Directly PII", 137 | "entity_type": "EMAIL_ADDRESS", 138 | "score": 1.0, 139 | "start": 74, 140 | "end": 94, 141 | }, 142 | { 143 | "pii_type_detected": "PHONE_NUMBER", 144 | "risk_level": 3, 145 | "risk_level_definition": "Identifiable", 146 | "cluster_membership_type": "Contact Information", 147 | "hipaa_category": "Protected Health Information", 148 | "dhs_category": "Stand Alone PII", 149 | "nist_category": "Directly PII", 150 | "entity_type": "PHONE_NUMBER", 151 | "score": 0.75, 152 | "start": 45, 153 | "end": 57, 154 | }, 155 | { 156 | "pii_type_detected": "URL", 157 | "risk_level": 2, 158 | "risk_level_definition": "Semi-Identifiable", 159 | "cluster_membership_type": "Community Interaction", 160 | "hipaa_category": "Not Protected Health Information", 161 | "dhs_category": "Linkable", 162 | "nist_category": "Linkable", 163 | "entity_type": "URL", 164 | "score": 0.5, 165 | "start": 85, 166 | "end": 94, 167 | }, 168 | ], 169 | "index": 1, 170 | "risk_score_mean": 2.6666666666666665, 171 | "sanitized_text": "Hi! My phone number is . You can also reach me by email at ", 172 | }, 173 | { 174 | "analysis": [ 175 | { 176 | "pii_type_detected": None, 177 | "risk_level": 1, 178 | "risk_level_definition": "Non-Identifiable", 179 | "cluster_membership_type": None, 180 | "hipaa_category": None, 181 | "dhs_category": None, 182 | "nist_category": None, 183 | } 184 | ], 185 | "index": 2, 186 | "risk_score_mean": 1, 187 | "sanitized_text": "Hi! What is the title of this book?", 188 | }, 189 | { 190 | "analysis": [ 191 | { 192 | "pii_type_detected": "LOCATION", 193 | "risk_level": 2, 194 | "risk_level_definition": "Semi-Identifiable", 195 | "cluster_membership_type": "Secure Identifiers", 196 | "hipaa_category": "Protected Health Information", 197 | "dhs_category": "Not Mentioned", 198 | "nist_category": "Linkable", 199 | "entity_type": "LOCATION", 200 | "score": 0.85, 201 | "start": 42, 202 | "end": 44, 203 | } 204 | ], 205 | "index": 3, 206 | "risk_score_mean": 2, 207 | "sanitized_text": "", 208 | }, 209 | { 210 | "analysis": [ 211 | { 212 | "pii_type_detected": None, 213 | "risk_level": 1, 214 | "risk_level_definition": "Non-Identifiable", 215 | "cluster_membership_type": None, 216 | "hipaa_category": None, 217 | "dhs_category": None, 218 | "nist_category": None, 219 | } 220 | ], 221 | "index": 4, 222 | "risk_score_mean": 1, 223 | "sanitized_text": "Hi! I have a cat too.", 224 | }, 225 | ], 226 | "detection_count": 5, 227 | "risk_scores": [3, 2.6666666666666665, 1, 2, 1], 228 | "risk_score_mean": 1.9333333333333333, 229 | "risk_score_mode": 1, 230 | "risk_score_median": 2, 231 | "risk_score_standard_deviation": 0.8273115763993905, 232 | "risk_score_variance": 0.6844444444444444, 233 | "detected_pii_types": { 234 | "LOCATION", 235 | "EMAIL_ADDRESS", 236 | "URL", 237 | "PHONE_NUMBER", 238 | "PERSON", 239 | }, 240 | "detected_pii_type_frequencies": { 241 | "PERSON": 1, 242 | "EMAIL_ADDRESS": 1, 243 | "PHONE_NUMBER": 1, 244 | "URL": 1, 245 | "LOCATION": 1, 246 | }, 247 | } 248 | 249 | ``` 250 | 251 | Check out full analysis example in the notebook: notebooks/pii-analysis-ms-presidio. -------------------------------------------------------------------------------- /joss/paper.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'PII-Codex: a Python library for PII detection, categorization, and severity assessment' 3 | 4 | tags: 5 | - Python 6 | - PII 7 | - PII topology 8 | - risk categories 9 | - personal identifiable information 10 | 11 | authors: 12 | - name: Eidan J. Rosado 13 | orcid: 0000-0003-0665-098X 14 | affiliation: 1 15 | affiliations: 16 | - name: 17 | College of Computing and Engineering, Nova Southeastern University, Fort Lauderdale, FL 33314, 18 | USA 19 | index: 1 20 | 21 | date: 30 Dec 2022 22 | 23 | bibliography: paper.bib 24 | 25 | --- 26 | 27 | # Summary 28 | There have been a number of advancements in the detection of personal identifiable information (PII) and scrubbing libraries to aid developers and researchers in their detection and anonymization efforts. With the recent shift in data handling procedures and global policy implementations regarding identifying information, it is becoming more important for data consumers to be aware of what data needs to be scrubbed, why it's being scrubbed, and to have the means to perform said scrubbing. 29 | 30 | PII-Codex is a collection of extended theoretical, conceptual, and policy works in PII categorization and severity assessment [@schwartz_solove_2011; @milne_pettinico_hajjat_markos_2016], and the integration thereof with PII detection software and API client adapters. It allows researchers to analyze a body of text or a collection thereof and determine whether the PII detected within these texts, if any, are considered identifiable. Furthermore, it allows end-users to determine the severity and associated categorizations of detected PII tokens. 31 | 32 | # Challenges 33 | 34 | While a number of open-source PII detection libraries have been created and PII detection APIs are provided by cloud service providers [@azure_detection_cognitive_skill; @aws_comprehend], the detection results are often provided with the type of PII detected, an index reference of where the detection is within the text, and a confidence score associated with the detection. Those receiving these results aren’t provided with a means of understanding why the text token is classified as PII, what framework, policy, or convention labels it as such, and just how severe its exposure is. 35 | 36 | # Statement of Need 37 | 38 | The general knowledge base of identifiable data, the usage restrictions of this data, and the associated policies surrounding it have shifted drastically over the years. Between the mid-1990s and 2000s, or the dotcom bubble, the industry saw a rise in data capitalism by way of making information freely accessible, fostering a way to make the web personal, and finally, placing value on data and the potential it had to impact consumerism [@West_2017]. Alongside the rise in data capitalism came early data policy initiatives. In 1995, the EU Data Protection Directive was created to establish some minimum data privacy and security standards [@GDPR_eu_2022] and the US Health Insurance Portability and Accountability Act (HIPAA) was enacted in 1996 with the final regulation being published in 2000 [@OCR_2022] to help battle healthcare fraud and to provide regulations governing the privacy and security of an individual's patient details. Both of these policies have evolved over the years to include protected entities and have paved the way to the policies and protective technologies the world sees today aimed at protecting PII. 39 | 40 | The tech industry specifically has had to adjust to these policy changes regarding the tracking of individuals, the usage of data from online profiles and platforms, and the right to be forgotten entirely from a service or platform [@gdpr_erasure_right]. While the shift has provided data protections around the globe, the majority of technology users continue to have little to no control over their personal information with third-party data consumers [@tene_polonetsky_2012; @trepte2020]. From an individual researcher's perspective, understanding if identifiable data types exist in a data set can prevent accidental sharing of such data by allowing its detection in the first place and, in the case of this software package, permit for the results to be publishable by sanitizing the text tokens and provide transparency on the reasons why the token was considered to be PII. From a platform user's perspective, detecting PII ahead of publication and understanding why it is considered PII can prevent an accidental disclosure that can later be used by adversaries. This need is what drives the development of PII-Codex. 41 | 42 | # The PII-Codex Package 43 | 44 | PII-Codex is a Python package built to combine the Information Sensitivity Typology works of Milne et al. [@milne_pettinico_hajjat_markos_2016], categorizations and guidelines from the National Institute of Standards and Technology (NIST) [@mccallister_grance_scarfone_2010], Department of Homeland Security (DHS) [@dhs_2012], and the Health Insurance Portability and Accountability Act (HIPAA) [@hipaa]. It combines these categories to rate the detection on a scale of 1 to 3, labeling it as Non-Identifiable, Semi-Identifiable, or Identifiable as presented by the risk continuum by Schwartz and Solove [@schwartz_solove_2011]. The package provides a subset of Milne et al.'s Information Sensitivity Typology as some technologies group entries into a singular category or the detection of the entry may not yet be available. 45 | 46 | Built into the package is an analyzer service that leverages Microsoft’s Presidio library for PII detection and anonymization [@microsoft_presidio] as well as the option to use the built-in detection adapters for Microsoft Presidio, Azure Detection Cognitive Skill [@azure_detection_cognitive_skill], and AWS Comprehend [@aws_comprehend] for pre-existing detections. The output of the adapters and the analysis service are analysis objects with a listing of detections, detection frequencies, severities, mean risk scores for each string processed, and summary statistics on the analysis made. 47 | 48 | The final outputs do not contain the original texts but provide the sanitized or anonymized texts and where to find the detections, should the end-user require this information. In providing this capability, one can prevent the accidental dissemination of private information in downstream research efforts, an issue commonly discussed in cybersecurity research [@belanger_crossler_2011; @moura_serrão_2019; @beigi_liu_2020]. 49 | 50 | ## Design 51 | 52 | PII-Codex is broken down into a series of services, utilities, and adapters. For a majority of cases, end-users may already have used Microsoft Presidio, Azure, AWS Comprehend or some other solution to detect PII in text. To account for these cases, adapters were provided to convert the varying detection results into a common form, DetectionResultItem and DetectionResult objects, which are later used by the Analysis Service and Assessment Service. This usage flow is presented in Figure 1. 53 | 54 | ![Converting And Analyzing Existing Detections\label{fig:converting-and-analyzing-existing-detections}](../docs/UC1_Converting_Existing_Detections_With_Adapters.png) 55 | 56 | As shown in Figure 2, for end-users that still require detections to be carried out, Microsoft Presidio was integrated as the primary analysis provider within the Analysis Service. 57 | 58 | ![Using Presidio-Enabled Builtin Service for Detection and Analysis\label{fig:using-presidio-builtins-for-analyses}](../docs/UC2_Using_Presidio_Builtin_Service_for_Detection_and_Analysis.png) 59 | 60 | The Analysis and Assessment services expose functions for those defining their own detectors and enable the conversion to a common detection type so that the full Analysis Result set can be built. 61 | 62 | ## Example Usage 63 | 64 | The collection analysis permits a list of strings under 65 | texts parameter or a DataFrame with a text column under the data parameter. The collection will be analyzed and a summary provided in an AnalysisResultSet object. The AnalysisResultSet object will show individual detections and their risk assessments which includes risk score assessment and associated PII categories. Each analysis is provided with the sanitized input text when using the default analysis service. Unless supplied with another replacement token, the sanitized input text will contain in place of detected PII tokens: 66 | 67 | ``` 68 | Hi! My phone number is ." 69 | ``` 70 | 71 | Email detections, for example, are presented as Identifiable, which automatically places it at a risk level of 3, the highest a token is assigned. Something like a URL is considered Semi-Identifiable and therefore is assigned a risk level of 2. Other texts will fall under Non-Identifiable and will be assigned a risk level of 1. 72 | 73 | Using the `texts` parameter: 74 | 75 | ```python 76 | from pii_codex.services.analysis_service import PIIAnalysisService 77 | 78 | results = PIIAnalysisService().analyze_collection( 79 | texts=[ 80 | "email@example.com is the email I can be reached at.", 81 | "Their number is 555-555-5555" 82 | ] 83 | ) 84 | ``` 85 | 86 | Using the `data` parameter with metadata support for social media analysis: 87 | 88 | ```python 89 | import pandas as pd 90 | from pii_codex.services.analysis_service import PIIAnalysisService 91 | 92 | results = PIIAnalysisService().analyze_collection( 93 | data=pd.DataFrame.from_dict({ 94 | "text": [ 95 | "email@example.com is the email I can be reached at.", 96 | "Their number is 555-555-5555" 97 | ], 98 | "metadata": [ 99 | {"location": True, "url": False, "screen_name": True}, 100 | {"location": False, "url": False, "screen_name": True} 101 | ] 102 | }), 103 | collection_name="Social Media Example", 104 | collection_type="SAMPLE" 105 | ) 106 | ``` 107 | 108 | The AnalysisResultSet object will show individual detections and their risk assessments. Email detections, for example, are presented as identifiable and direct PII which automatically place it at a risk level of 3, the highest a token is assigned. 109 | 110 | 111 | ```json 112 | { 113 | "pii_type_detected": "EMAIL_ADDRESS", 114 | "risk_level": 3, 115 | "risk_level_definition": "Identifiable", 116 | "cluster_membership_type": "Personal Preferences", 117 | "hipaa_category": "Protected Health Information", 118 | "dhs_category": "Stand Alone PII", 119 | "nist_category": "Directly PII", 120 | "entity_type": "EMAIL_ADDRESS", 121 | "score": 1.0, 122 | "start": 74, 123 | "end": 94 124 | } 125 | ``` 126 | 127 | Each string analyzed may contain n number of PII detections, with each detection having a risk severity between 1 and 3 inclusively. The risk score mean $\overline{rs}$ is calculated based on the average of all token risk scores rs for that one string. Since other data is provided, while non-identifiable on its own, may provide context that can lead to identification, their values (assigned a 1 for non-identifiable) are taken into account in the calculation. The calculation for a single string's risk score is presented as the formula below. 128 | 129 | \begin{equation} 130 | \overline{rs} = \frac{1}{n} \sum_{i=i}^{n} rs_{i} 131 | \end{equation} 132 | 133 | For collections of strings being analyzed, each risk score mean is taken into account to provide a collection-wide risk score mean value. Given that a collection can have n number of analyzed strings, the collection risk score mean value can be calculated with the mean of means formula below. 134 | 135 | \begin{equation} 136 | \mu_{\overline{rs}} = \frac{\overline{rs}_1 + \overline{rs}_2 + ... + \overline{rs}_n}{n} 137 | \end{equation} 138 | 139 | In the AnalysisResult object, the mean risk score of all detected tokens in a string is provided as the risk score mean. The AnalysisResultSet object will contain the mean of means, or the average of all risk score averages, will be provided as the risk score mean. 140 | 141 | ## Availability 142 | PII-Codex can be installed via pip or poetry. The source code of PII-Codex is available at the GitHub repository (https://github.com/EdyVision/pii-codex). The builds can be obtained from https://github.com/EdyVision/pii-codex/releases and via Zenodo [@rosado2022]. 143 | 144 | # References --------------------------------------------------------------------------------