├── .gitignore ├── README.md ├── data ├── quranic-treebank-0.4-chunks.tsv └── vectors.txt ├── poetry.lock ├── pyproject.toml ├── src ├── chunks │ ├── chunk.py │ ├── chunks.py │ ├── location.py │ └── preprocessor.py ├── data.py ├── embeddings.py └── models │ ├── bilstm_chunker.py │ └── evaluator.py └── tests ├── chunker_test.py └── dataset_splitter.py /.gitignore: -------------------------------------------------------------------------------- 1 | # hidden 2 | .* 3 | !.gitignore 4 | 5 | # Python 6 | __pycache__ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Quran Neural Chunker 2 | 3 | *Join us on a new journey! Visit the [Corpus 2.0 upgrade project](https://github.com/kaisdukes/quranic-corpus) for new work on the Quranic Arabic Corpus.* 4 | 5 | ## What’s in this Repo? 6 | 7 | A data preprocessor for the [Quranic Treebank](https://qurancorpus.app/treebank/2:258) using neural networks. Divides longer verses into smaller chunks. 8 | 9 | To work with this codebase, you will need a strong background in Artificial Intelligence applied to Quranic Research, specifically in the fields of Computational Linguistics and Natural Language Processing (NLP). 10 | 11 | ## Why Do We Need This? 12 | 13 | Large portions of the Quran contain lengthy verses. The Quranic Treebank is designed primarily as an educational resource, allowing users of the corpus to gain deeper linguistic understanding of the Classical Arabic language of the Quran through side-by-side comparison with *i’rāb* (إعراب), traditional linguistic analysis. The Quranic Treebank breaks down longer verses into multiple dependency graphs to annotate syntactic structure. Each graph corresponds to a chunk. 14 | 15 | Dependency graphs are kept intentionally short for easier display on mobile devices. Larger syntactic structures that cross graphs are linked together through reference nodes. To construct the treebank, we need to first perform verse chunking. There are several ways this could be done, but one possibility is to train a machine learning model using four sources of data: 16 | 17 | **Existing chunk boundaries:** The existing division of verses implied by the dependency graphs in the treebank. 18 | 19 | **Reference grammar alignment:** The breakdown of verses into word groups in the reference grammar used to construct the treebank, Salih’s *al-I’rāb al-Mufassal*. In principle, this could be a strong choice for training the model as the treebank was initially chunked to support easier alignment and cross-referencing with this reference work. 20 | 21 | **Pause marks:** Although the Classical Arabic Uthmani script of the Quran doesn’t contain modern punctuation like full stops or commas, it does contain [pause marks](https://corpus.quran.com/documentation/pausemarks.jsp) (to support *waqf* and *tajweed*), which may aid in chunking. 22 | 23 | **Punctuation from translations:** The Quranic Arabic Corpus has word-aligned translations into English, which often include punctuation. Using this data may also help boost the accuracy of the chunker. 24 | 25 | Because the evaluation step needs to test against the treebank, it makes sense to include the existing implied chunk boundaries as part of the training dataset. Other data sources are included to test how they might boost accuracy. Choosing just one signal, like *waqf* marks might not be optimal. 26 | 27 | ## What’s in the Training Data? 28 | 29 | A ‘word’ in the Quran isn't easily defined, due to the rich morphology of Classical Arabic. The Quranic Arabic Corpus uses the terminology ‘segment’ to denote a morphological segment and ‘token’ to denote a whitespace separated token. 30 | 31 | The [quranic-treebank-0.4-chunks.tsv](https://github.com/kaisdukes/quran-verse-chunker/tree/main/data) file has one row per token with 9 columns: 32 | 33 | * Chapter number 34 | * Verse number 35 | * Token number 36 | * The arabic form of the token (in Unicode) 37 | * The corresponding embedding ID (i.e. word ID) from GloVe-Arabic 38 | * The world-aligned english translation, including punctuation marks such as full stops 39 | * The [pause mark](https://corpus.quran.com/documentation/pausemarks.jsp) (*waqf*) associated with the token 40 | * A binary flag indicating if the token is at the end of an word group in the corresponding *i’rāb* (إعراب) in the reference grammar *al-I’rāb al-Mufassal* 41 | * A binary flag indicating if the token is at the end of a dependency graph. This value is the expected output of the chunker. 42 | 43 | ## Quranic Word Embeddings 44 | 45 | This repo also contains Quranic word embeddings. However, these are not perfect, but are good enough for the task at hand (verse chunking). For this reason, we do not recommend using these embeddings for other Quranic NLP tasks without further refinement. The mapping process, described below, may not be suitable or accurate enough for this. 46 | 47 | The embeddings were derived using the following process: 48 | 49 | 1. For more accurate mappings, the simplified *imla'ei* (إملائي) Quranic script was used, instead of the more commonly used *uthmani* script. This is freely available from the [Tanzil project](https://tanzil.net/docs/quran_text_types) 50 | 2. We use the [GloVe-Arabic](https://github.com/tarekeldeeb/GloVe-Arabic) word embeddings dataset, based on an Arabic corpus of 1.75B tokens. This corpus includes Classical Arabic, which the Quran is written in. 51 | 3. Mapping Quranic words to GloVe-Arabic was possible by dropping diacritics and vocative prefixes, resulting in 76,205 / 77,429 = 98.419% words mapped. Again, it’s worth stressing that this is not a perfect mapping due to similar word-forms having ambiguity without diacritics, but is a useful approximation for the specific task of verse chunking. 52 | 53 | ## Getting Started 54 | 55 | This project uses [Poetry](https://python-poetry.org) to manage package dependencies. 56 | 57 | First, clone the repository: 58 | 59 | ``` 60 | git clone https://github.com/kaisdukes/quran-neural-chunker.git 61 | cd quran-neural-chunker 62 | ``` 63 | 64 | Install Poetry using [Homebrew](https://brew.sh): 65 | 66 | ``` 67 | brew install poetry 68 | ``` 69 | 70 | Next, install project dependencies: 71 | 72 | ``` 73 | poetry install 74 | ``` 75 | 76 | All dependencies, such as [pandas](https://pandas.pydata.org), are installed in the virtual environment. 77 | 78 | Use the Poetry shell: 79 | 80 | ``` 81 | poetry shell 82 | ``` 83 | 84 | Test the chunker: 85 | 86 | ``` 87 | python tests/chunker_test.py 88 | ``` -------------------------------------------------------------------------------- /poetry.lock: -------------------------------------------------------------------------------- 1 | # This file is automatically @generated by Poetry 1.5.1 and should not be changed by hand. 2 | 3 | [[package]] 4 | name = "filelock" 5 | version = "3.12.2" 6 | description = "A platform independent file lock." 7 | optional = false 8 | python-versions = ">=3.7" 9 | files = [ 10 | {file = "filelock-3.12.2-py3-none-any.whl", hash = "sha256:cbb791cdea2a72f23da6ac5b5269ab0a0d161e9ef0100e653b69049a7706d1ec"}, 11 | {file = "filelock-3.12.2.tar.gz", hash = "sha256:002740518d8aa59a26b0c76e10fb8c6e15eae825d34b6fdf670333fd7b938d81"}, 12 | ] 13 | 14 | [package.extras] 15 | docs = ["furo (>=2023.5.20)", "sphinx (>=7.0.1)", "sphinx-autodoc-typehints (>=1.23,!=1.23.4)"] 16 | testing = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "diff-cover (>=7.5)", "pytest (>=7.3.1)", "pytest-cov (>=4.1)", "pytest-mock (>=3.10)", "pytest-timeout (>=2.1)"] 17 | 18 | [[package]] 19 | name = "jinja2" 20 | version = "3.1.2" 21 | description = "A very fast and expressive template engine." 22 | optional = false 23 | python-versions = ">=3.7" 24 | files = [ 25 | {file = "Jinja2-3.1.2-py3-none-any.whl", hash = "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61"}, 26 | {file = "Jinja2-3.1.2.tar.gz", hash = "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852"}, 27 | ] 28 | 29 | [package.dependencies] 30 | MarkupSafe = ">=2.0" 31 | 32 | [package.extras] 33 | i18n = ["Babel (>=2.7)"] 34 | 35 | [[package]] 36 | name = "joblib" 37 | version = "1.3.1" 38 | description = "Lightweight pipelining with Python functions" 39 | optional = false 40 | python-versions = ">=3.7" 41 | files = [ 42 | {file = "joblib-1.3.1-py3-none-any.whl", hash = "sha256:89cf0529520e01b3de7ac7b74a8102c90d16d54c64b5dd98cafcd14307fdf915"}, 43 | {file = "joblib-1.3.1.tar.gz", hash = "sha256:1f937906df65329ba98013dc9692fe22a4c5e4a648112de500508b18a21b41e3"}, 44 | ] 45 | 46 | [[package]] 47 | name = "markupsafe" 48 | version = "2.1.3" 49 | description = "Safely add untrusted strings to HTML/XML markup." 50 | optional = false 51 | python-versions = ">=3.7" 52 | files = [ 53 | {file = "MarkupSafe-2.1.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:cd0f502fe016460680cd20aaa5a76d241d6f35a1c3350c474bac1273803893fa"}, 54 | {file = "MarkupSafe-2.1.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:e09031c87a1e51556fdcb46e5bd4f59dfb743061cf93c4d6831bf894f125eb57"}, 55 | {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:68e78619a61ecf91e76aa3e6e8e33fc4894a2bebe93410754bd28fce0a8a4f9f"}, 56 | {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:65c1a9bcdadc6c28eecee2c119465aebff8f7a584dd719facdd9e825ec61ab52"}, 57 | {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:525808b8019e36eb524b8c68acdd63a37e75714eac50e988180b169d64480a00"}, 58 | {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:962f82a3086483f5e5f64dbad880d31038b698494799b097bc59c2edf392fce6"}, 59 | {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:aa7bd130efab1c280bed0f45501b7c8795f9fdbeb02e965371bbef3523627779"}, 60 | {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:c9c804664ebe8f83a211cace637506669e7890fec1b4195b505c214e50dd4eb7"}, 61 | {file = "MarkupSafe-2.1.3-cp310-cp310-win32.whl", hash = "sha256:10bbfe99883db80bdbaff2dcf681dfc6533a614f700da1287707e8a5d78a8431"}, 62 | {file = "MarkupSafe-2.1.3-cp310-cp310-win_amd64.whl", hash = "sha256:1577735524cdad32f9f694208aa75e422adba74f1baee7551620e43a3141f559"}, 63 | {file = "MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:ad9e82fb8f09ade1c3e1b996a6337afac2b8b9e365f926f5a61aacc71adc5b3c"}, 64 | {file = "MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:3c0fae6c3be832a0a0473ac912810b2877c8cb9d76ca48de1ed31e1c68386575"}, 65 | {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b076b6226fb84157e3f7c971a47ff3a679d837cf338547532ab866c57930dbee"}, 66 | {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bfce63a9e7834b12b87c64d6b155fdd9b3b96191b6bd334bf37db7ff1fe457f2"}, 67 | {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:338ae27d6b8745585f87218a3f23f1512dbf52c26c28e322dbe54bcede54ccb9"}, 68 | {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:e4dd52d80b8c83fdce44e12478ad2e85c64ea965e75d66dbeafb0a3e77308fcc"}, 69 | {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:df0be2b576a7abbf737b1575f048c23fb1d769f267ec4358296f31c2479db8f9"}, 70 | {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:5bbe06f8eeafd38e5d0a4894ffec89378b6c6a625ff57e3028921f8ff59318ac"}, 71 | {file = "MarkupSafe-2.1.3-cp311-cp311-win32.whl", hash = "sha256:dd15ff04ffd7e05ffcb7fe79f1b98041b8ea30ae9234aed2a9168b5797c3effb"}, 72 | {file = "MarkupSafe-2.1.3-cp311-cp311-win_amd64.whl", hash = "sha256:134da1eca9ec0ae528110ccc9e48041e0828d79f24121a1a146161103c76e686"}, 73 | {file = "MarkupSafe-2.1.3-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:8e254ae696c88d98da6555f5ace2279cf7cd5b3f52be2b5cf97feafe883b58d2"}, 74 | {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb0932dc158471523c9637e807d9bfb93e06a95cbf010f1a38b98623b929ef2b"}, 75 | {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9402b03f1a1b4dc4c19845e5c749e3ab82d5078d16a2a4c2cd2df62d57bb0707"}, 76 | {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ca379055a47383d02a5400cb0d110cef0a776fc644cda797db0c5696cfd7e18e"}, 77 | {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:b7ff0f54cb4ff66dd38bebd335a38e2c22c41a8ee45aa608efc890ac3e3931bc"}, 78 | {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:c011a4149cfbcf9f03994ec2edffcb8b1dc2d2aede7ca243746df97a5d41ce48"}, 79 | {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:56d9f2ecac662ca1611d183feb03a3fa4406469dafe241673d521dd5ae92a155"}, 80 | {file = "MarkupSafe-2.1.3-cp37-cp37m-win32.whl", hash = "sha256:8758846a7e80910096950b67071243da3e5a20ed2546e6392603c096778d48e0"}, 81 | {file = "MarkupSafe-2.1.3-cp37-cp37m-win_amd64.whl", hash = "sha256:787003c0ddb00500e49a10f2844fac87aa6ce977b90b0feaaf9de23c22508b24"}, 82 | {file = "MarkupSafe-2.1.3-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:2ef12179d3a291be237280175b542c07a36e7f60718296278d8593d21ca937d4"}, 83 | {file = "MarkupSafe-2.1.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:2c1b19b3aaacc6e57b7e25710ff571c24d6c3613a45e905b1fde04d691b98ee0"}, 84 | {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8afafd99945ead6e075b973fefa56379c5b5c53fd8937dad92c662da5d8fd5ee"}, 85 | {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8c41976a29d078bb235fea9b2ecd3da465df42a562910f9022f1a03107bd02be"}, 86 | {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d080e0a5eb2529460b30190fcfcc4199bd7f827663f858a226a81bc27beaa97e"}, 87 | {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:69c0f17e9f5a7afdf2cc9fb2d1ce6aabdb3bafb7f38017c0b77862bcec2bbad8"}, 88 | {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:504b320cd4b7eff6f968eddf81127112db685e81f7e36e75f9f84f0df46041c3"}, 89 | {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:42de32b22b6b804f42c5d98be4f7e5e977ecdd9ee9b660fda1a3edf03b11792d"}, 90 | {file = "MarkupSafe-2.1.3-cp38-cp38-win32.whl", hash = "sha256:ceb01949af7121f9fc39f7d27f91be8546f3fb112c608bc4029aef0bab86a2a5"}, 91 | {file = "MarkupSafe-2.1.3-cp38-cp38-win_amd64.whl", hash = "sha256:1b40069d487e7edb2676d3fbdb2b0829ffa2cd63a2ec26c4938b2d34391b4ecc"}, 92 | {file = "MarkupSafe-2.1.3-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:8023faf4e01efadfa183e863fefde0046de576c6f14659e8782065bcece22198"}, 93 | {file = "MarkupSafe-2.1.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:6b2b56950d93e41f33b4223ead100ea0fe11f8e6ee5f641eb753ce4b77a7042b"}, 94 | {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9dcdfd0eaf283af041973bff14a2e143b8bd64e069f4c383416ecd79a81aab58"}, 95 | {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:05fb21170423db021895e1ea1e1f3ab3adb85d1c2333cbc2310f2a26bc77272e"}, 96 | {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:282c2cb35b5b673bbcadb33a585408104df04f14b2d9b01d4c345a3b92861c2c"}, 97 | {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:ab4a0df41e7c16a1392727727e7998a467472d0ad65f3ad5e6e765015df08636"}, 98 | {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:7ef3cb2ebbf91e330e3bb937efada0edd9003683db6b57bb108c4001f37a02ea"}, 99 | {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:0a4e4a1aff6c7ac4cd55792abf96c915634c2b97e3cc1c7129578aa68ebd754e"}, 100 | {file = "MarkupSafe-2.1.3-cp39-cp39-win32.whl", hash = "sha256:fec21693218efe39aa7f8599346e90c705afa52c5b31ae019b2e57e8f6542bb2"}, 101 | {file = "MarkupSafe-2.1.3-cp39-cp39-win_amd64.whl", hash = "sha256:3fd4abcb888d15a94f32b75d8fd18ee162ca0c064f35b11134be77050296d6ba"}, 102 | {file = "MarkupSafe-2.1.3.tar.gz", hash = "sha256:af598ed32d6ae86f1b747b82783958b1a4ab8f617b06fe68795c7f026abbdcad"}, 103 | ] 104 | 105 | [[package]] 106 | name = "mpmath" 107 | version = "1.3.0" 108 | description = "Python library for arbitrary-precision floating-point arithmetic" 109 | optional = false 110 | python-versions = "*" 111 | files = [ 112 | {file = "mpmath-1.3.0-py3-none-any.whl", hash = "sha256:a0b2b9fe80bbcd81a6647ff13108738cfb482d481d826cc0e02f5b35e5c88d2c"}, 113 | {file = "mpmath-1.3.0.tar.gz", hash = "sha256:7a28eb2a9774d00c7bc92411c19a89209d5da7c4c9a9e227be8330a23a25b91f"}, 114 | ] 115 | 116 | [package.extras] 117 | develop = ["codecov", "pycodestyle", "pytest (>=4.6)", "pytest-cov", "wheel"] 118 | docs = ["sphinx"] 119 | gmpy = ["gmpy2 (>=2.1.0a4)"] 120 | tests = ["pytest (>=4.6)"] 121 | 122 | [[package]] 123 | name = "networkx" 124 | version = "3.1" 125 | description = "Python package for creating and manipulating graphs and networks" 126 | optional = false 127 | python-versions = ">=3.8" 128 | files = [ 129 | {file = "networkx-3.1-py3-none-any.whl", hash = "sha256:4f33f68cb2afcf86f28a45f43efc27a9386b535d567d2127f8f61d51dec58d36"}, 130 | {file = "networkx-3.1.tar.gz", hash = "sha256:de346335408f84de0eada6ff9fafafff9bcda11f0a0dfaa931133debb146ab61"}, 131 | ] 132 | 133 | [package.extras] 134 | default = ["matplotlib (>=3.4)", "numpy (>=1.20)", "pandas (>=1.3)", "scipy (>=1.8)"] 135 | developer = ["mypy (>=1.1)", "pre-commit (>=3.2)"] 136 | doc = ["nb2plots (>=0.6)", "numpydoc (>=1.5)", "pillow (>=9.4)", "pydata-sphinx-theme (>=0.13)", "sphinx (>=6.1)", "sphinx-gallery (>=0.12)", "texext (>=0.6.7)"] 137 | extra = ["lxml (>=4.6)", "pydot (>=1.4.2)", "pygraphviz (>=1.10)", "sympy (>=1.10)"] 138 | test = ["codecov (>=2.1)", "pytest (>=7.2)", "pytest-cov (>=4.0)"] 139 | 140 | [[package]] 141 | name = "numpy" 142 | version = "1.25.0" 143 | description = "Fundamental package for array computing in Python" 144 | optional = false 145 | python-versions = ">=3.9" 146 | files = [ 147 | {file = "numpy-1.25.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:8aa130c3042052d656751df5e81f6d61edff3e289b5994edcf77f54118a8d9f4"}, 148 | {file = "numpy-1.25.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:9e3f2b96e3b63c978bc29daaa3700c028fe3f049ea3031b58aa33fe2a5809d24"}, 149 | {file = "numpy-1.25.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d6b267f349a99d3908b56645eebf340cb58f01bd1e773b4eea1a905b3f0e4208"}, 150 | {file = "numpy-1.25.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4aedd08f15d3045a4e9c648f1e04daca2ab1044256959f1f95aafeeb3d794c16"}, 151 | {file = "numpy-1.25.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:6d183b5c58513f74225c376643234c369468e02947b47942eacbb23c1671f25d"}, 152 | {file = "numpy-1.25.0-cp310-cp310-win32.whl", hash = "sha256:d76a84998c51b8b68b40448ddd02bd1081bb33abcdc28beee6cd284fe11036c6"}, 153 | {file = "numpy-1.25.0-cp310-cp310-win_amd64.whl", hash = "sha256:c0dc071017bc00abb7d7201bac06fa80333c6314477b3d10b52b58fa6a6e38f6"}, 154 | {file = "numpy-1.25.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4c69fe5f05eea336b7a740e114dec995e2f927003c30702d896892403df6dbf0"}, 155 | {file = "numpy-1.25.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:9c7211d7920b97aeca7b3773a6783492b5b93baba39e7c36054f6e749fc7490c"}, 156 | {file = "numpy-1.25.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ecc68f11404930e9c7ecfc937aa423e1e50158317bf67ca91736a9864eae0232"}, 157 | {file = "numpy-1.25.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e559c6afbca484072a98a51b6fa466aae785cfe89b69e8b856c3191bc8872a82"}, 158 | {file = "numpy-1.25.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:6c284907e37f5e04d2412950960894b143a648dea3f79290757eb878b91acbd1"}, 159 | {file = "numpy-1.25.0-cp311-cp311-win32.whl", hash = "sha256:95367ccd88c07af21b379be1725b5322362bb83679d36691f124a16357390153"}, 160 | {file = "numpy-1.25.0-cp311-cp311-win_amd64.whl", hash = "sha256:b76aa836a952059d70a2788a2d98cb2a533ccd46222558b6970348939e55fc24"}, 161 | {file = "numpy-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:b792164e539d99d93e4e5e09ae10f8cbe5466de7d759fc155e075237e0c274e4"}, 162 | {file = "numpy-1.25.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:7cd981ccc0afe49b9883f14761bb57c964df71124dcd155b0cba2b591f0d64b9"}, 163 | {file = "numpy-1.25.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5aa48bebfb41f93043a796128854b84407d4df730d3fb6e5dc36402f5cd594c0"}, 164 | {file = "numpy-1.25.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5177310ac2e63d6603f659fadc1e7bab33dd5a8db4e0596df34214eeab0fee3b"}, 165 | {file = "numpy-1.25.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:0ac6edfb35d2a99aaf102b509c8e9319c499ebd4978df4971b94419a116d0790"}, 166 | {file = "numpy-1.25.0-cp39-cp39-win32.whl", hash = "sha256:7412125b4f18aeddca2ecd7219ea2d2708f697943e6f624be41aa5f8a9852cc4"}, 167 | {file = "numpy-1.25.0-cp39-cp39-win_amd64.whl", hash = "sha256:26815c6c8498dc49d81faa76d61078c4f9f0859ce7817919021b9eba72b425e3"}, 168 | {file = "numpy-1.25.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:5b1b90860bf7d8a8c313b372d4f27343a54f415b20fb69dd601b7efe1029c91e"}, 169 | {file = "numpy-1.25.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:85cdae87d8c136fd4da4dad1e48064d700f63e923d5af6c8c782ac0df8044542"}, 170 | {file = "numpy-1.25.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:cc3fda2b36482891db1060f00f881c77f9423eead4c3579629940a3e12095fe8"}, 171 | {file = "numpy-1.25.0.tar.gz", hash = "sha256:f1accae9a28dc3cda46a91de86acf69de0d1b5f4edd44a9b0c3ceb8036dfff19"}, 172 | ] 173 | 174 | [[package]] 175 | name = "pandas" 176 | version = "2.0.3" 177 | description = "Powerful data structures for data analysis, time series, and statistics" 178 | optional = false 179 | python-versions = ">=3.8" 180 | files = [ 181 | {file = "pandas-2.0.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:e4c7c9f27a4185304c7caf96dc7d91bc60bc162221152de697c98eb0b2648dd8"}, 182 | {file = "pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:f167beed68918d62bffb6ec64f2e1d8a7d297a038f86d4aed056b9493fca407f"}, 183 | {file = "pandas-2.0.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce0c6f76a0f1ba361551f3e6dceaff06bde7514a374aa43e33b588ec10420183"}, 184 | {file = "pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba619e410a21d8c387a1ea6e8a0e49bb42216474436245718d7f2e88a2f8d7c0"}, 185 | {file = "pandas-2.0.3-cp310-cp310-win32.whl", hash = "sha256:3ef285093b4fe5058eefd756100a367f27029913760773c8bf1d2d8bebe5d210"}, 186 | {file = "pandas-2.0.3-cp310-cp310-win_amd64.whl", hash = "sha256:9ee1a69328d5c36c98d8e74db06f4ad518a1840e8ccb94a4ba86920986bb617e"}, 187 | {file = "pandas-2.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:b084b91d8d66ab19f5bb3256cbd5ea661848338301940e17f4492b2ce0801fe8"}, 188 | {file = "pandas-2.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:37673e3bdf1551b95bf5d4ce372b37770f9529743d2498032439371fc7b7eb26"}, 189 | {file = "pandas-2.0.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b9cb1e14fdb546396b7e1b923ffaeeac24e4cedd14266c3497216dd4448e4f2d"}, 190 | {file = "pandas-2.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d9cd88488cceb7635aebb84809d087468eb33551097d600c6dad13602029c2df"}, 191 | {file = "pandas-2.0.3-cp311-cp311-win32.whl", hash = "sha256:694888a81198786f0e164ee3a581df7d505024fbb1f15202fc7db88a71d84ebd"}, 192 | {file = "pandas-2.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:6a21ab5c89dcbd57f78d0ae16630b090eec626360085a4148693def5452d8a6b"}, 193 | {file = "pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:9e4da0d45e7f34c069fe4d522359df7d23badf83abc1d1cef398895822d11061"}, 194 | {file = "pandas-2.0.3-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:32fca2ee1b0d93dd71d979726b12b61faa06aeb93cf77468776287f41ff8fdc5"}, 195 | {file = "pandas-2.0.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:258d3624b3ae734490e4d63c430256e716f488c4fcb7c8e9bde2d3aa46c29089"}, 196 | {file = "pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9eae3dc34fa1aa7772dd3fc60270d13ced7346fcbcfee017d3132ec625e23bb0"}, 197 | {file = "pandas-2.0.3-cp38-cp38-win32.whl", hash = "sha256:f3421a7afb1a43f7e38e82e844e2bca9a6d793d66c1a7f9f0ff39a795bbc5e02"}, 198 | {file = "pandas-2.0.3-cp38-cp38-win_amd64.whl", hash = "sha256:69d7f3884c95da3a31ef82b7618af5710dba95bb885ffab339aad925c3e8ce78"}, 199 | {file = "pandas-2.0.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:5247fb1ba347c1261cbbf0fcfba4a3121fbb4029d95d9ef4dc45406620b25c8b"}, 200 | {file = "pandas-2.0.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:81af086f4543c9d8bb128328b5d32e9986e0c84d3ee673a2ac6fb57fd14f755e"}, 201 | {file = "pandas-2.0.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1994c789bf12a7c5098277fb43836ce090f1073858c10f9220998ac74f37c69b"}, 202 | {file = "pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5ec591c48e29226bcbb316e0c1e9423622bc7a4eaf1ef7c3c9fa1a3981f89641"}, 203 | {file = "pandas-2.0.3-cp39-cp39-win32.whl", hash = "sha256:04dbdbaf2e4d46ca8da896e1805bc04eb85caa9a82e259e8eed00254d5e0c682"}, 204 | {file = "pandas-2.0.3-cp39-cp39-win_amd64.whl", hash = "sha256:1168574b036cd8b93abc746171c9b4f1b83467438a5e45909fed645cf8692dbc"}, 205 | {file = "pandas-2.0.3.tar.gz", hash = "sha256:c02f372a88e0d17f36d3093a644c73cfc1788e876a7c4bcb4020a77512e2043c"}, 206 | ] 207 | 208 | [package.dependencies] 209 | numpy = [ 210 | {version = ">=1.21.0", markers = "python_version >= \"3.10\""}, 211 | {version = ">=1.23.2", markers = "python_version >= \"3.11\""}, 212 | ] 213 | python-dateutil = ">=2.8.2" 214 | pytz = ">=2020.1" 215 | tzdata = ">=2022.1" 216 | 217 | [package.extras] 218 | all = ["PyQt5 (>=5.15.1)", "SQLAlchemy (>=1.4.16)", "beautifulsoup4 (>=4.9.3)", "bottleneck (>=1.3.2)", "brotlipy (>=0.7.0)", "fastparquet (>=0.6.3)", "fsspec (>=2021.07.0)", "gcsfs (>=2021.07.0)", "html5lib (>=1.1)", "hypothesis (>=6.34.2)", "jinja2 (>=3.0.0)", "lxml (>=4.6.3)", "matplotlib (>=3.6.1)", "numba (>=0.53.1)", "numexpr (>=2.7.3)", "odfpy (>=1.4.1)", "openpyxl (>=3.0.7)", "pandas-gbq (>=0.15.0)", "psycopg2 (>=2.8.6)", "pyarrow (>=7.0.0)", "pymysql (>=1.0.2)", "pyreadstat (>=1.1.2)", "pytest (>=7.3.2)", "pytest-asyncio (>=0.17.0)", "pytest-xdist (>=2.2.0)", "python-snappy (>=0.6.0)", "pyxlsb (>=1.0.8)", "qtpy (>=2.2.0)", "s3fs (>=2021.08.0)", "scipy (>=1.7.1)", "tables (>=3.6.1)", "tabulate (>=0.8.9)", "xarray (>=0.21.0)", "xlrd (>=2.0.1)", "xlsxwriter (>=1.4.3)", "zstandard (>=0.15.2)"] 219 | aws = ["s3fs (>=2021.08.0)"] 220 | clipboard = ["PyQt5 (>=5.15.1)", "qtpy (>=2.2.0)"] 221 | compression = ["brotlipy (>=0.7.0)", "python-snappy (>=0.6.0)", "zstandard (>=0.15.2)"] 222 | computation = ["scipy (>=1.7.1)", "xarray (>=0.21.0)"] 223 | excel = ["odfpy (>=1.4.1)", "openpyxl (>=3.0.7)", "pyxlsb (>=1.0.8)", "xlrd (>=2.0.1)", "xlsxwriter (>=1.4.3)"] 224 | feather = ["pyarrow (>=7.0.0)"] 225 | fss = ["fsspec (>=2021.07.0)"] 226 | gcp = ["gcsfs (>=2021.07.0)", "pandas-gbq (>=0.15.0)"] 227 | hdf5 = ["tables (>=3.6.1)"] 228 | html = ["beautifulsoup4 (>=4.9.3)", "html5lib (>=1.1)", "lxml (>=4.6.3)"] 229 | mysql = ["SQLAlchemy (>=1.4.16)", "pymysql (>=1.0.2)"] 230 | output-formatting = ["jinja2 (>=3.0.0)", "tabulate (>=0.8.9)"] 231 | parquet = ["pyarrow (>=7.0.0)"] 232 | performance = ["bottleneck (>=1.3.2)", "numba (>=0.53.1)", "numexpr (>=2.7.1)"] 233 | plot = ["matplotlib (>=3.6.1)"] 234 | postgresql = ["SQLAlchemy (>=1.4.16)", "psycopg2 (>=2.8.6)"] 235 | spss = ["pyreadstat (>=1.1.2)"] 236 | sql-other = ["SQLAlchemy (>=1.4.16)"] 237 | test = ["hypothesis (>=6.34.2)", "pytest (>=7.3.2)", "pytest-asyncio (>=0.17.0)", "pytest-xdist (>=2.2.0)"] 238 | xml = ["lxml (>=4.6.3)"] 239 | 240 | [[package]] 241 | name = "python-dateutil" 242 | version = "2.8.2" 243 | description = "Extensions to the standard Python datetime module" 244 | optional = false 245 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7" 246 | files = [ 247 | {file = "python-dateutil-2.8.2.tar.gz", hash = "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86"}, 248 | {file = "python_dateutil-2.8.2-py2.py3-none-any.whl", hash = "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"}, 249 | ] 250 | 251 | [package.dependencies] 252 | six = ">=1.5" 253 | 254 | [[package]] 255 | name = "pytz" 256 | version = "2023.3" 257 | description = "World timezone definitions, modern and historical" 258 | optional = false 259 | python-versions = "*" 260 | files = [ 261 | {file = "pytz-2023.3-py2.py3-none-any.whl", hash = "sha256:a151b3abb88eda1d4e34a9814df37de2a80e301e68ba0fd856fb9b46bfbbbffb"}, 262 | {file = "pytz-2023.3.tar.gz", hash = "sha256:1d8ce29db189191fb55338ee6d0387d82ab59f3d00eac103412d64e0ebd0c588"}, 263 | ] 264 | 265 | [[package]] 266 | name = "scikit-learn" 267 | version = "1.3.0" 268 | description = "A set of python modules for machine learning and data mining" 269 | optional = false 270 | python-versions = ">=3.8" 271 | files = [ 272 | {file = "scikit-learn-1.3.0.tar.gz", hash = "sha256:8be549886f5eda46436b6e555b0e4873b4f10aa21c07df45c4bc1735afbccd7a"}, 273 | {file = "scikit_learn-1.3.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:981287869e576d42c682cf7ca96af0c6ac544ed9316328fd0d9292795c742cf5"}, 274 | {file = "scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:436aaaae2c916ad16631142488e4c82f4296af2404f480e031d866863425d2a2"}, 275 | {file = "scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c7e28d8fa47a0b30ae1bd7a079519dd852764e31708a7804da6cb6f8b36e3630"}, 276 | {file = "scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ae80c08834a473d08a204d966982a62e11c976228d306a2648c575e3ead12111"}, 277 | {file = "scikit_learn-1.3.0-cp310-cp310-win_amd64.whl", hash = "sha256:552fd1b6ee22900cf1780d7386a554bb96949e9a359999177cf30211e6b20df6"}, 278 | {file = "scikit_learn-1.3.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:79970a6d759eb00a62266a31e2637d07d2d28446fca8079cf9afa7c07b0427f8"}, 279 | {file = "scikit_learn-1.3.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:850a00b559e636b23901aabbe79b73dc604b4e4248ba9e2d6e72f95063765603"}, 280 | {file = "scikit_learn-1.3.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ee04835fb016e8062ee9fe9074aef9b82e430504e420bff51e3e5fffe72750ca"}, 281 | {file = "scikit_learn-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9d953531f5d9f00c90c34fa3b7d7cfb43ecff4c605dac9e4255a20b114a27369"}, 282 | {file = "scikit_learn-1.3.0-cp311-cp311-win_amd64.whl", hash = "sha256:151ac2bf65ccf363664a689b8beafc9e6aae36263db114b4ca06fbbbf827444a"}, 283 | {file = "scikit_learn-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:6a885a9edc9c0a341cab27ec4f8a6c58b35f3d449c9d2503a6fd23e06bbd4f6a"}, 284 | {file = "scikit_learn-1.3.0-cp38-cp38-macosx_12_0_arm64.whl", hash = "sha256:9877af9c6d1b15486e18a94101b742e9d0d2f343d35a634e337411ddb57783f3"}, 285 | {file = "scikit_learn-1.3.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c470f53cea065ff3d588050955c492793bb50c19a92923490d18fcb637f6383a"}, 286 | {file = "scikit_learn-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd6e2d7389542eae01077a1ee0318c4fec20c66c957f45c7aac0c6eb0fe3c612"}, 287 | {file = "scikit_learn-1.3.0-cp38-cp38-win_amd64.whl", hash = "sha256:3a11936adbc379a6061ea32fa03338d4ca7248d86dd507c81e13af428a5bc1db"}, 288 | {file = "scikit_learn-1.3.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:998d38fcec96584deee1e79cd127469b3ad6fefd1ea6c2dfc54e8db367eb396b"}, 289 | {file = "scikit_learn-1.3.0-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:ded35e810438a527e17623ac6deae3b360134345b7c598175ab7741720d7ffa7"}, 290 | {file = "scikit_learn-1.3.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0e8102d5036e28d08ab47166b48c8d5e5810704daecf3a476a4282d562be9a28"}, 291 | {file = "scikit_learn-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7617164951c422747e7c32be4afa15d75ad8044f42e7d70d3e2e0429a50e6718"}, 292 | {file = "scikit_learn-1.3.0-cp39-cp39-win_amd64.whl", hash = "sha256:1d54fb9e6038284548072df22fd34777e434153f7ffac72c8596f2d6987110dd"}, 293 | ] 294 | 295 | [package.dependencies] 296 | joblib = ">=1.1.1" 297 | numpy = ">=1.17.3" 298 | scipy = ">=1.5.0" 299 | threadpoolctl = ">=2.0.0" 300 | 301 | [package.extras] 302 | benchmark = ["matplotlib (>=3.1.3)", "memory-profiler (>=0.57.0)", "pandas (>=1.0.5)"] 303 | docs = ["Pillow (>=7.1.2)", "matplotlib (>=3.1.3)", "memory-profiler (>=0.57.0)", "numpydoc (>=1.2.0)", "pandas (>=1.0.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.16.2)", "seaborn (>=0.9.0)", "sphinx (>=6.0.0)", "sphinx-copybutton (>=0.5.2)", "sphinx-gallery (>=0.10.1)", "sphinx-prompt (>=1.3.0)", "sphinxext-opengraph (>=0.4.2)"] 304 | examples = ["matplotlib (>=3.1.3)", "pandas (>=1.0.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.16.2)", "seaborn (>=0.9.0)"] 305 | tests = ["black (>=23.3.0)", "matplotlib (>=3.1.3)", "mypy (>=1.3)", "numpydoc (>=1.2.0)", "pandas (>=1.0.5)", "pooch (>=1.6.0)", "pyamg (>=4.0.0)", "pytest (>=7.1.2)", "pytest-cov (>=2.9.0)", "ruff (>=0.0.272)", "scikit-image (>=0.16.2)"] 306 | 307 | [[package]] 308 | name = "scipy" 309 | version = "1.11.1" 310 | description = "Fundamental algorithms for scientific computing in Python" 311 | optional = false 312 | python-versions = "<3.13,>=3.9" 313 | files = [ 314 | {file = "scipy-1.11.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:aec8c62fbe52914f9cf28d846cf0401dd80ab80788bbab909434eb336ed07c04"}, 315 | {file = "scipy-1.11.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:3b9963798df1d8a52db41a6fc0e6fa65b1c60e85d73da27ae8bb754de4792481"}, 316 | {file = "scipy-1.11.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3e8eb42db36526b130dfbc417609498a6192381abc1975b91e3eb238e0b41c1a"}, 317 | {file = "scipy-1.11.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:366a6a937110d80dca4f63b3f5b00cc89d36f678b2d124a01067b154e692bab1"}, 318 | {file = "scipy-1.11.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:08d957ca82d3535b3b9ba6c8ff355d78fe975271874e2af267cb5add5bd78625"}, 319 | {file = "scipy-1.11.1-cp310-cp310-win_amd64.whl", hash = "sha256:e866514bc2d660608447b6ba95c8900d591f2865c07cca0aa4f7ff3c4ca70f30"}, 320 | {file = "scipy-1.11.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ba94eeef3c9caa4cea7b402a35bb02a5714ee1ee77eb98aca1eed4543beb0f4c"}, 321 | {file = "scipy-1.11.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:512fdc18c65f76dadaca139348e525646d440220d8d05f6d21965b8d4466bccd"}, 322 | {file = "scipy-1.11.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cce154372f0ebe88556ed06d7b196e9c2e0c13080ecb58d0f35062dc7cc28b47"}, 323 | {file = "scipy-1.11.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b4bb943010203465ac81efa392e4645265077b4d9e99b66cf3ed33ae12254173"}, 324 | {file = "scipy-1.11.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:249cfa465c379c9bb2c20123001e151ff5e29b351cbb7f9c91587260602c58d0"}, 325 | {file = "scipy-1.11.1-cp311-cp311-win_amd64.whl", hash = "sha256:ffb28e3fa31b9c376d0fb1f74c1f13911c8c154a760312fbee87a21eb21efe31"}, 326 | {file = "scipy-1.11.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:39154437654260a52871dfde852adf1b93b1d1bc5dc0ffa70068f16ec0be2624"}, 327 | {file = "scipy-1.11.1-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:b588311875c58d1acd4ef17c983b9f1ab5391755a47c3d70b6bd503a45bfaf71"}, 328 | {file = "scipy-1.11.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d51565560565a0307ed06fa0ec4c6f21ff094947d4844d6068ed04400c72d0c3"}, 329 | {file = "scipy-1.11.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b41a0f322b4eb51b078cb3441e950ad661ede490c3aca66edef66f4b37ab1877"}, 330 | {file = "scipy-1.11.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:396fae3f8c12ad14c5f3eb40499fd06a6fef8393a6baa352a652ecd51e74e029"}, 331 | {file = "scipy-1.11.1-cp39-cp39-win_amd64.whl", hash = "sha256:be8c962a821957fdde8c4044efdab7a140c13294997a407eaee777acf63cbf0c"}, 332 | {file = "scipy-1.11.1.tar.gz", hash = "sha256:fb5b492fa035334fd249f0973cc79ecad8b09c604b42a127a677b45a9a3d4289"}, 333 | ] 334 | 335 | [package.dependencies] 336 | numpy = ">=1.21.6,<1.28.0" 337 | 338 | [package.extras] 339 | dev = ["click", "cython-lint (>=0.12.2)", "doit (>=0.36.0)", "mypy", "pycodestyle", "pydevtool", "rich-click", "ruff", "types-psutil", "typing_extensions"] 340 | doc = ["jupytext", "matplotlib (>2)", "myst-nb", "numpydoc", "pooch", "pydata-sphinx-theme (==0.9.0)", "sphinx (!=4.1.0)", "sphinx-design (>=0.2.0)"] 341 | test = ["asv", "gmpy2", "mpmath", "pooch", "pytest", "pytest-cov", "pytest-timeout", "pytest-xdist", "scikit-umfpack", "threadpoolctl"] 342 | 343 | [[package]] 344 | name = "six" 345 | version = "1.16.0" 346 | description = "Python 2 and 3 compatibility utilities" 347 | optional = false 348 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*" 349 | files = [ 350 | {file = "six-1.16.0-py2.py3-none-any.whl", hash = "sha256:8abb2f1d86890a2dfb989f9a77cfcfd3e47c2a354b01111771326f8aa26e0254"}, 351 | {file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"}, 352 | ] 353 | 354 | [[package]] 355 | name = "sympy" 356 | version = "1.12" 357 | description = "Computer algebra system (CAS) in Python" 358 | optional = false 359 | python-versions = ">=3.8" 360 | files = [ 361 | {file = "sympy-1.12-py3-none-any.whl", hash = "sha256:c3588cd4295d0c0f603d0f2ae780587e64e2efeedb3521e46b9bb1d08d184fa5"}, 362 | {file = "sympy-1.12.tar.gz", hash = "sha256:ebf595c8dac3e0fdc4152c51878b498396ec7f30e7a914d6071e674d49420fb8"}, 363 | ] 364 | 365 | [package.dependencies] 366 | mpmath = ">=0.19" 367 | 368 | [[package]] 369 | name = "threadpoolctl" 370 | version = "3.1.0" 371 | description = "threadpoolctl" 372 | optional = false 373 | python-versions = ">=3.6" 374 | files = [ 375 | {file = "threadpoolctl-3.1.0-py3-none-any.whl", hash = "sha256:8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b"}, 376 | {file = "threadpoolctl-3.1.0.tar.gz", hash = "sha256:a335baacfaa4400ae1f0d8e3a58d6674d2f8828e3716bb2802c44955ad391380"}, 377 | ] 378 | 379 | [[package]] 380 | name = "torch" 381 | version = "2.0.1" 382 | description = "Tensors and Dynamic neural networks in Python with strong GPU acceleration" 383 | optional = false 384 | python-versions = ">=3.8.0" 385 | files = [ 386 | {file = "torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl", hash = "sha256:8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a"}, 387 | {file = "torch-2.0.1-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:359bfaad94d1cda02ab775dc1cc386d585712329bb47b8741607ef6ef4950747"}, 388 | {file = "torch-2.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:7c84e44d9002182edd859f3400deaa7410f5ec948a519cc7ef512c2f9b34d2c4"}, 389 | {file = "torch-2.0.1-cp310-none-macosx_10_9_x86_64.whl", hash = "sha256:567f84d657edc5582d716900543e6e62353dbe275e61cdc36eda4929e46df9e7"}, 390 | {file = "torch-2.0.1-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:787b5a78aa7917465e9b96399b883920c88a08f4eb63b5a5d2d1a16e27d2f89b"}, 391 | {file = "torch-2.0.1-cp311-cp311-manylinux1_x86_64.whl", hash = "sha256:e617b1d0abaf6ced02dbb9486803abfef0d581609b09641b34fa315c9c40766d"}, 392 | {file = "torch-2.0.1-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:b6019b1de4978e96daa21d6a3ebb41e88a0b474898fe251fd96189587408873e"}, 393 | {file = "torch-2.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:dbd68cbd1cd9da32fe5d294dd3411509b3d841baecb780b38b3b7b06c7754434"}, 394 | {file = "torch-2.0.1-cp311-none-macosx_10_9_x86_64.whl", hash = "sha256:ef654427d91600129864644e35deea761fb1fe131710180b952a6f2e2207075e"}, 395 | {file = "torch-2.0.1-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:25aa43ca80dcdf32f13da04c503ec7afdf8e77e3a0183dd85cd3e53b2842e527"}, 396 | {file = "torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:5ef3ea3d25441d3957348f7e99c7824d33798258a2bf5f0f0277cbcadad2e20d"}, 397 | {file = "torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:0882243755ff28895e8e6dc6bc26ebcf5aa0911ed81b2a12f241fc4b09075b13"}, 398 | {file = "torch-2.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:f66aa6b9580a22b04d0af54fcd042f52406a8479e2b6a550e3d9f95963e168c8"}, 399 | {file = "torch-2.0.1-cp38-none-macosx_10_9_x86_64.whl", hash = "sha256:1adb60d369f2650cac8e9a95b1d5758e25d526a34808f7448d0bd599e4ae9072"}, 400 | {file = "torch-2.0.1-cp38-none-macosx_11_0_arm64.whl", hash = "sha256:1bcffc16b89e296826b33b98db5166f990e3b72654a2b90673e817b16c50e32b"}, 401 | {file = "torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:e10e1597f2175365285db1b24019eb6f04d53dcd626c735fc502f1e8b6be9875"}, 402 | {file = "torch-2.0.1-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:423e0ae257b756bb45a4b49072046772d1ad0c592265c5080070e0767da4e490"}, 403 | {file = "torch-2.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:8742bdc62946c93f75ff92da00e3803216c6cce9b132fbca69664ca38cfb3e18"}, 404 | {file = "torch-2.0.1-cp39-none-macosx_10_9_x86_64.whl", hash = "sha256:c62df99352bd6ee5a5a8d1832452110435d178b5164de450831a3a8cc14dc680"}, 405 | {file = "torch-2.0.1-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:671a2565e3f63b8fe8e42ae3e36ad249fe5e567435ea27b94edaa672a7d0c416"}, 406 | ] 407 | 408 | [package.dependencies] 409 | filelock = "*" 410 | jinja2 = "*" 411 | networkx = "*" 412 | sympy = "*" 413 | typing-extensions = "*" 414 | 415 | [package.extras] 416 | opt-einsum = ["opt-einsum (>=3.3)"] 417 | 418 | [[package]] 419 | name = "typing-extensions" 420 | version = "4.7.1" 421 | description = "Backported and Experimental Type Hints for Python 3.7+" 422 | optional = false 423 | python-versions = ">=3.7" 424 | files = [ 425 | {file = "typing_extensions-4.7.1-py3-none-any.whl", hash = "sha256:440d5dd3af93b060174bf433bccd69b0babc3b15b1a8dca43789fd7f61514b36"}, 426 | {file = "typing_extensions-4.7.1.tar.gz", hash = "sha256:b75ddc264f0ba5615db7ba217daeb99701ad295353c45f9e95963337ceeeffb2"}, 427 | ] 428 | 429 | [[package]] 430 | name = "tzdata" 431 | version = "2023.3" 432 | description = "Provider of IANA time zone data" 433 | optional = false 434 | python-versions = ">=2" 435 | files = [ 436 | {file = "tzdata-2023.3-py2.py3-none-any.whl", hash = "sha256:7e65763eef3120314099b6939b5546db7adce1e7d6f2e179e3df563c70511eda"}, 437 | {file = "tzdata-2023.3.tar.gz", hash = "sha256:11ef1e08e54acb0d4f95bdb1be05da659673de4acbd21bf9c69e94cc5e907a3a"}, 438 | ] 439 | 440 | [metadata] 441 | lock-version = "2.0" 442 | python-versions = ">=3.11,<3.13" 443 | content-hash = "0558e6e5010d26a95e44cdc71fb11f9e37839a48fe0988f8009cd97d93ce8b9f" 444 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "quran-neural-chunker" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Kais Dukes"] 6 | readme = "README.md" 7 | packages = [{include = "src"}] 8 | 9 | [tool.poetry.dependencies] 10 | python = ">=3.11,<3.13" 11 | pandas = "^2.0.3" 12 | torch = "^2.0.1" 13 | scikit-learn = "^1.3.0" 14 | 15 | [build-system] 16 | requires = ["poetry-core"] 17 | build-backend = "poetry.core.masonry.api" -------------------------------------------------------------------------------- /src/chunks/chunk.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | 3 | from .location import Location 4 | 5 | 6 | @dataclass(frozen=True) 7 | class Chunk: 8 | start: Location 9 | end: Location 10 | -------------------------------------------------------------------------------- /src/chunks/chunks.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from pandas import DataFrame 4 | 5 | from .chunk import Chunk 6 | from .location import Location 7 | 8 | 9 | def get_chunks(df: DataFrame): 10 | chunks: List[Chunk] = [] 11 | start: Location = None 12 | 13 | for _, row in df.iterrows(): 14 | loc = Location(row['chapter_number'], row['verse_number'], row['token_number']) 15 | if start is None: 16 | start = loc 17 | if row['chunk_end'] == 1: 18 | end = loc 19 | chunk = Chunk(start, end) 20 | chunks.append(chunk) 21 | start = None 22 | 23 | return chunks 24 | -------------------------------------------------------------------------------- /src/chunks/location.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | 3 | 4 | @dataclass(frozen=True) 5 | class Location: 6 | chapter_number: int 7 | verse_number: int 8 | token_number: int 9 | 10 | def __str__(self): 11 | parts = [str(self.chapter_number), str(self.verse_number)] 12 | if self.token_number > 0: 13 | parts.append(str(self.token_number)) 14 | return ':'.join(parts) 15 | -------------------------------------------------------------------------------- /src/chunks/preprocessor.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from pandas import DataFrame 3 | 4 | 5 | def preprocess(df: DataFrame): 6 | df['verse_end'] = ( 7 | (df.groupby(['chapter_number', 'verse_number']).token_number.transform(max) == df.token_number) 8 | .astype(int)) 9 | 10 | df['punctuation'] = df['translation'].apply(_punctuation) 11 | 12 | 13 | PUNCTUATION = [',', '.', '\'', '\"', '!', '?'] 14 | 15 | 16 | def _punctuation(text: str) -> str: 17 | n = len(text) 18 | for i in range(n - 1, -1, -1): 19 | if text[i] not in PUNCTUATION: 20 | return text[i+1:] if i < n - 1 else '' 21 | return text 22 | -------------------------------------------------------------------------------- /src/data.py: -------------------------------------------------------------------------------- 1 | import csv 2 | 3 | import pandas as pd 4 | 5 | 6 | def load_data(): 7 | CHUNKS_FILE = 'data/quranic-treebank-0.4-chunks.tsv' 8 | return pd.read_csv(CHUNKS_FILE, sep='\t', quoting=csv.QUOTE_NONE) 9 | -------------------------------------------------------------------------------- /src/embeddings.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | 3 | import numpy as np 4 | 5 | 6 | class Embeddings: 7 | 8 | def __init__(self): 9 | self._embeddings: Dict[int, np.ndarray] = {} 10 | self._default_vector: np.ndarray = np.zeros(256) 11 | self._load_embeddings() 12 | 13 | def get_vector(self, embeddingId: int): 14 | return self._embeddings.get(embeddingId, self._default_vector) 15 | 16 | def _load_embeddings(self): 17 | VECTOR_FILE = 'data/vectors.txt' 18 | with open(VECTOR_FILE, 'r') as file: 19 | for line in file: 20 | line = line.strip().split() 21 | embeddingId = int(line[0][:-1]) 22 | vector = np.array(list(map(float, line[2:]))) 23 | self._embeddings[embeddingId] = vector 24 | -------------------------------------------------------------------------------- /src/models/bilstm_chunker.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from pandas import DataFrame 6 | import torch 7 | from torch import nn 8 | from torch.optim import Adam 9 | from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence 10 | from sklearn.preprocessing import LabelEncoder 11 | from sklearn.model_selection import train_test_split 12 | from torch.utils.data import Dataset, DataLoader 13 | 14 | from .evaluator import Evaluator 15 | from ..data import load_data 16 | from ..embeddings import Embeddings 17 | from ..chunks.preprocessor import preprocess 18 | from ..chunks.chunks import get_chunks 19 | 20 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 21 | max_length = 128 22 | 23 | 24 | class BiLSTMModel(nn.Module): 25 | 26 | def __init__(self, input_size, hidden_size, num_layers, output_size): 27 | super(BiLSTMModel, self).__init__() 28 | self.hidden_size = hidden_size 29 | self.num_layers = num_layers 30 | self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True) 31 | self.fc = nn.Linear(2*hidden_size, output_size) 32 | 33 | def forward(self, x, lengths): 34 | h0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(x.device) 35 | c0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(x.device) 36 | 37 | x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False) 38 | 39 | packed_output, _ = self.lstm(x, (h0, c0)) 40 | 41 | # unpack the output before passing through the linear layer 42 | output, output_lengths = pad_packed_sequence(packed_output, batch_first=True) 43 | 44 | # manually pad the sequences to max_length 45 | if output.size(1) < max_length: 46 | output = nn.functional.pad(output, (0, 0, 0, max_length - output.size(1))) 47 | 48 | out = self.fc(output) 49 | return out 50 | 51 | 52 | class QuranDataset(Dataset): 53 | def __init__(self, verses, labels): 54 | self.verses = verses 55 | self.labels = labels 56 | 57 | def __len__(self): 58 | return len(self.verses) 59 | 60 | def __getitem__(self, index): 61 | verse = self.verses[index] 62 | label = self.labels[index] 63 | length = len(verse) 64 | 65 | # padding 66 | if length < max_length: 67 | verse.extend([[0]*len(verse[0])] * (max_length - length)) 68 | label.extend([0] * (max_length - length)) 69 | 70 | return torch.tensor(verse, dtype=torch.float32), torch.tensor(label), length 71 | 72 | 73 | def get_verses(df: DataFrame): 74 | le = LabelEncoder() 75 | df['encoded_punctuation'] = le.fit_transform(df['punctuation']) 76 | 77 | word_vectors = Embeddings() 78 | 79 | rows = [] 80 | for _, row in df.iterrows(): 81 | embedding_vector = word_vectors.get_vector(row['embedding_id']) 82 | core_values = row[['token_number', 'pause_mark', 'irab_end', 'verse_end', 'encoded_punctuation']].values 83 | full_vector = np.concatenate([core_values, embedding_vector]).tolist() 84 | rows.append(full_vector + [row['chunk_end']]) 85 | X = pd.DataFrame(rows, columns=[f'feature_{i}' for i in range(261)]+['chunk_end']) 86 | 87 | verses: List[List[int]] = [] 88 | labels: List[int] = [] 89 | verse_info: List[List[int]] = [] 90 | 91 | for _, group in df.groupby(['chapter_number', 'verse_number']): 92 | group_df = X.loc[group.index] 93 | verse = group_df[group_df.columns.difference(['chunk_end'])].values.tolist() 94 | label = group_df['chunk_end'].tolist() 95 | 96 | verses.append(verse) 97 | labels.append(label) 98 | 99 | verse_info_single = group[['chapter_number', 'verse_number', 'token_number']].values.tolist() 100 | verse_info.append(verse_info_single) 101 | 102 | temp_data = list(zip(verses, verse_info, labels)) 103 | train_temp, test_temp = train_test_split(temp_data, test_size=0.10, random_state=42) 104 | 105 | train_verses, train_verse_info, train_labels = zip(*train_temp) 106 | test_verses, test_verse_info, test_labels = zip(*test_temp) 107 | 108 | return train_verses, test_verses, train_labels, test_labels, train_verse_info, test_verse_info 109 | 110 | 111 | def pack_labels(labels): 112 | lengths = [len(label) for label in labels] 113 | max_len = max(lengths) 114 | labels_padded = [torch.cat([label, torch.zeros(max_len - len(label))]) for label in labels] 115 | return torch.stack(labels_padded) 116 | 117 | 118 | def train_and_test(): 119 | df = load_data() 120 | preprocess(df) 121 | 122 | df.fillna(0, inplace=True) 123 | 124 | input_size = 261 125 | hidden_size = 512 126 | num_layers = 2 127 | output_size = 2 128 | num_epochs = 50 129 | batch_size = 64 130 | learning_rate = 0.001 131 | 132 | train_verses, test_verses, train_labels, test_labels, train_verse_info, test_verse_info = get_verses(df) 133 | print(f'Train verse count: {len(train_verses)}') 134 | print(f'Test verse count: {len(test_verses)}') 135 | 136 | training_data = QuranDataset(train_verses, train_labels) 137 | testing_data = QuranDataset(test_verses, test_labels) 138 | 139 | train_loader = DataLoader(training_data, batch_size=batch_size, shuffle=True) 140 | test_loader = DataLoader(testing_data, batch_size=batch_size, shuffle=False) 141 | 142 | model = BiLSTMModel(input_size, hidden_size, num_layers, output_size) 143 | model = model.to(device) 144 | 145 | criterion = nn.CrossEntropyLoss() 146 | optimizer = Adam(model.parameters(), lr=learning_rate) 147 | 148 | # train 149 | model.train() 150 | for epoch in range(num_epochs): 151 | for i, (verses, labels, lengths) in enumerate(train_loader): 152 | verses = verses.to(device) 153 | labels = labels.to(device) 154 | 155 | # forward pass 156 | raw_outputs = model(verses, lengths) 157 | labels = labels.view(-1) # reshape labels to be a 1D tensor 158 | loss = criterion(raw_outputs.view(-1, output_size), labels) 159 | 160 | # backward pass and optimize 161 | optimizer.zero_grad() 162 | loss.backward() 163 | optimizer.step() 164 | 165 | print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item())) 166 | 167 | # test 168 | model.eval() 169 | 170 | expected_results_df = DataFrame(columns=['chapter_number', 'verse_number', 'token_number', 'chunk_end']) 171 | output_results_df = DataFrame(columns=['chapter_number', 'verse_number', 'token_number', 'chunk_end']) 172 | 173 | evaluator = Evaluator() 174 | with torch.no_grad(): 175 | for test_index in range(len(testing_data)): 176 | verse, label, length = testing_data[test_index] 177 | verse, label, length = verse.to(device).unsqueeze(0), label.to(device).unsqueeze(0), torch.tensor([length]) 178 | 179 | raw_output = model(verse, length) 180 | _, predicted = torch.max(raw_output.data, 2) 181 | predicted = predicted.cpu().numpy() 182 | 183 | verse_info = test_verse_info[test_index] 184 | 185 | for idx, token in enumerate(verse_info): 186 | expected_row = DataFrame({ 187 | 'chapter_number': token[0], 188 | 'verse_number': token[1], 189 | 'token_number': token[2], 190 | 'chunk_end': label.cpu().numpy()[0][idx]}, index=[0]) 191 | expected_results_df = pd.concat([expected_results_df, expected_row]) 192 | 193 | output_row = DataFrame({ 194 | 'chapter_number': token[0], 195 | 'verse_number': token[1], 196 | 'token_number': token[2], 197 | 'chunk_end': predicted[0][idx]}, index=[0]) 198 | output_results_df = pd.concat([output_results_df, output_row]) 199 | 200 | # chunk-level evaluation 201 | expected_chunks = get_chunks(expected_results_df) 202 | output_chunks = get_chunks(output_results_df) 203 | print(f'Expected: {len(expected_chunks)} chunks') 204 | print(f'Output: {len(output_chunks)} chunks') 205 | 206 | evaluator.compare(expected_chunks, output_chunks) 207 | print(f'Precision: {evaluator.precision}') 208 | print(f'Recall: {evaluator.recall}') 209 | print(f'F1 score: {evaluator.f1_score}') 210 | print() 211 | -------------------------------------------------------------------------------- /src/models/evaluator.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from ..chunks.chunk import Chunk 4 | 5 | 6 | class Evaluator: 7 | 8 | def __init__(self): 9 | self._expected_chunks = 0 10 | self._output_chunks = 0 11 | self._equivalent_chunks = 0 12 | 13 | def compare(self, expected_chunks: List[Chunk], output_chunks: List[Chunk]): 14 | expected_set = set(expected_chunks) 15 | output_set = set(output_chunks) 16 | 17 | self._expected_chunks += len(expected_set) 18 | self._output_chunks += len(output_set) 19 | self._equivalent_chunks += len(expected_set & output_set) 20 | 21 | @property 22 | def precision(self): 23 | return 0 if self._output_chunks == 0 else self._equivalent_chunks / self._output_chunks 24 | 25 | @property 26 | def recall(self): 27 | return 0 if self._expected_chunks == 0 else self._equivalent_chunks / self._expected_chunks 28 | 29 | @property 30 | def f1_score(self): 31 | precision = self.precision 32 | recall = self.recall 33 | return 0 if precision + recall == 0 else 2 * (precision * recall) / (precision + recall) 34 | -------------------------------------------------------------------------------- /tests/chunker_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from src.models.bilstm_chunker import train_and_test 4 | 5 | 6 | class ChunkerTest(unittest.TestCase): 7 | 8 | def test_chunker(self): 9 | train_and_test() 10 | 11 | 12 | if __name__ == '__main__': 13 | unittest.main() 14 | -------------------------------------------------------------------------------- /tests/dataset_splitter.py: -------------------------------------------------------------------------------- 1 | from pandas import DataFrame 2 | from sklearn.model_selection import GroupShuffleSplit 3 | 4 | 5 | def split_dataset(df: DataFrame, fold: int): 6 | 7 | df['verse_id'] = df['chapter_number'].astype(str) + ':' + df['verse_number'].astype(str) 8 | 9 | train_idx, test_idx = next( 10 | GroupShuffleSplit(test_size=.10, n_splits=2, random_state=fold) 11 | .split(df, groups=df['verse_id']) 12 | ) 13 | 14 | train_df = df.iloc[train_idx] 15 | test_df = df.iloc[test_idx] 16 | 17 | train_df = train_df.drop(columns=['verse_id']) 18 | test_df = test_df.drop(columns=['verse_id']) 19 | 20 | return train_df, test_df 21 | --------------------------------------------------------------------------------