├── .gitignore
├── README.md
├── data
    ├── quranic-treebank-0.4-chunks.tsv
    └── vectors.txt
├── poetry.lock
├── pyproject.toml
├── src
    ├── chunks
    │   ├── chunk.py
    │   ├── chunks.py
    │   ├── location.py
    │   └── preprocessor.py
    ├── data.py
    ├── embeddings.py
    └── models
    │   ├── bilstm_chunker.py
    │   └── evaluator.py
└── tests
    ├── chunker_test.py
    └── dataset_splitter.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # hidden
2 | .*
3 | !.gitignore
4 | 
5 | # Python
6 | __pycache__


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Quran Neural Chunker
 2 | 
 3 | *Join us on a new journey! Visit the [Corpus 2.0 upgrade project](https://github.com/kaisdukes/quranic-corpus) for new work on the Quranic Arabic Corpus.*
 4 | 
 5 | ## What’s in this Repo?
 6 | 
 7 | A data preprocessor for the [Quranic Treebank](https://qurancorpus.app/treebank/2:258) using neural networks. Divides longer verses into smaller chunks.
 8 | 
 9 | To work with this codebase, you will need a strong background in Artificial Intelligence applied to Quranic Research, specifically in the fields of Computational Linguistics and Natural Language Processing (NLP).
10 | 
11 | ## Why Do We Need This?
12 | 
13 | Large portions of the Quran contain lengthy verses. The Quranic Treebank is designed primarily as an educational resource, allowing users of the corpus to gain deeper linguistic understanding of the Classical Arabic language of the Quran through side-by-side comparison with *i’rāb* (إعراب), traditional linguistic analysis. The Quranic Treebank breaks down longer verses into multiple dependency graphs to annotate syntactic structure. Each graph corresponds to a chunk.
14 | 
15 | Dependency graphs are kept intentionally short for easier display on mobile devices. Larger syntactic structures that cross graphs are linked together through reference nodes. To construct the treebank, we need to first perform verse chunking. There are several ways this could be done, but one possibility is to train a machine learning model using four sources of data:
16 | 
17 | **Existing chunk boundaries:** The existing division of verses implied by the dependency graphs in the treebank.
18 | 
19 | **Reference grammar alignment:** The breakdown of verses into word groups in the reference grammar used to construct the treebank, Salih’s *al-I’rāb al-Mufassal*. In principle, this could be a strong choice for training the model as the treebank was initially chunked to support easier alignment and cross-referencing with this reference work.
20 | 
21 | **Pause marks:** Although the Classical Arabic Uthmani script of the Quran doesn’t contain modern punctuation like full stops or commas, it does contain [pause marks](https://corpus.quran.com/documentation/pausemarks.jsp) (to support *waqf* and *tajweed*), which may aid in chunking.
22 | 
23 | **Punctuation from translations:** The Quranic Arabic Corpus has word-aligned translations into English, which often include punctuation. Using this data may also help boost the accuracy of the chunker.
24 | 
25 | Because the evaluation step needs to test against the treebank, it makes sense to include the existing implied chunk boundaries as part of the training dataset. Other data sources are included to test how they might boost accuracy. Choosing just one signal, like *waqf* marks might not be optimal.
26 | 
27 | ## What’s in the Training Data?
28 | 
29 | A ‘word’ in the Quran isn't easily defined, due to the rich morphology of Classical Arabic. The Quranic Arabic Corpus uses the terminology ‘segment’ to denote a morphological segment and ‘token’ to denote a whitespace separated token.
30 | 
31 | The [quranic-treebank-0.4-chunks.tsv](https://github.com/kaisdukes/quran-verse-chunker/tree/main/data) file has one row per token with 9 columns:
32 | 
33 | * Chapter number
34 | * Verse number
35 | * Token number
36 | * The arabic form of the token (in Unicode)
37 | * The corresponding embedding ID (i.e. word ID) from GloVe-Arabic
38 | * The world-aligned english translation, including punctuation marks such as full stops
39 | * The [pause mark](https://corpus.quran.com/documentation/pausemarks.jsp) (*waqf*) associated with the token
40 | * A binary flag indicating if the token is at the end of an word group in the corresponding *i’rāb* (إعراب) in the reference grammar *al-I’rāb al-Mufassal*
41 | * A binary flag indicating if the token is at the end of a dependency graph. This value is the expected output of the chunker.
42 | 
43 | ## Quranic Word Embeddings
44 | 
45 | This repo also contains Quranic word embeddings. However, these are not perfect, but are good enough for the task at hand (verse chunking). For this reason, we do not recommend using these embeddings for other Quranic NLP tasks without further refinement. The mapping process, described below, may not be suitable or accurate enough for this.
46 | 
47 | The embeddings were derived using the following process:
48 | 
49 | 1. For more accurate mappings, the simplified *imla'ei* (إملائي) Quranic script was used, instead of the more commonly used *uthmani* script. This is freely available from the [Tanzil project](https://tanzil.net/docs/quran_text_types) 
50 | 2. We use the [GloVe-Arabic](https://github.com/tarekeldeeb/GloVe-Arabic) word embeddings dataset, based on an Arabic corpus of 1.75B tokens. This corpus includes Classical Arabic, which the Quran is written in.
51 | 3. Mapping Quranic words to GloVe-Arabic was possible by dropping diacritics and vocative prefixes, resulting in 76,205 / 77,429 = 98.419% words mapped. Again, it’s worth stressing that this is not a perfect mapping due to similar word-forms having ambiguity without diacritics, but is a useful approximation for the specific task of verse chunking.
52 | 
53 | ## Getting Started
54 | 
55 | This project uses [Poetry](https://python-poetry.org) to manage package dependencies.
56 | 
57 | First, clone the repository:
58 | 
59 | ```
60 | git clone https://github.com/kaisdukes/quran-neural-chunker.git
61 | cd quran-neural-chunker
62 | ```
63 | 
64 | Install Poetry using [Homebrew](https://brew.sh):
65 | 
66 | ```
67 | brew install poetry
68 | ```
69 | 
70 | Next, install project dependencies:
71 | 
72 | ```
73 | poetry install
74 | ```
75 | 
76 | All dependencies, such as [pandas](https://pandas.pydata.org), are installed in the virtual environment.
77 | 
78 | Use the Poetry shell:
79 | 
80 | ```
81 | poetry shell
82 | ```
83 | 
84 | Test the chunker:
85 | 
86 | ```
87 | python tests/chunker_test.py
88 | ```


--------------------------------------------------------------------------------
/poetry.lock:
--------------------------------------------------------------------------------
  1 | # This file is automatically @generated by Poetry 1.5.1 and should not be changed by hand.
  2 | 
  3 | [[package]]
  4 | name = "filelock"
  5 | version = "3.12.2"
  6 | description = "A platform independent file lock."
  7 | optional = false
  8 | python-versions = ">=3.7"
  9 | files = [
 10 |     {file = "filelock-3.12.2-py3-none-any.whl", hash = "sha256:cbb791cdea2a72f23da6ac5b5269ab0a0d161e9ef0100e653b69049a7706d1ec"},
 11 |     {file = "filelock-3.12.2.tar.gz", hash = "sha256:002740518d8aa59a26b0c76e10fb8c6e15eae825d34b6fdf670333fd7b938d81"},
 12 | ]
 13 | 
 14 | [package.extras]
 15 | docs = ["furo (>=2023.5.20)", "sphinx (>=7.0.1)", "sphinx-autodoc-typehints (>=1.23,!=1.23.4)"]
 16 | testing = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "diff-cover (>=7.5)", "pytest (>=7.3.1)", "pytest-cov (>=4.1)", "pytest-mock (>=3.10)", "pytest-timeout (>=2.1)"]
 17 | 
 18 | [[package]]
 19 | name = "jinja2"
 20 | version = "3.1.2"
 21 | description = "A very fast and expressive template engine."
 22 | optional = false
 23 | python-versions = ">=3.7"
 24 | files = [
 25 |     {file = "Jinja2-3.1.2-py3-none-any.whl", hash = "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61"},
 26 |     {file = "Jinja2-3.1.2.tar.gz", hash = "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852"},
 27 | ]
 28 | 
 29 | [package.dependencies]
 30 | MarkupSafe = ">=2.0"
 31 | 
 32 | [package.extras]
 33 | i18n = ["Babel (>=2.7)"]
 34 | 
 35 | [[package]]
 36 | name = "joblib"
 37 | version = "1.3.1"
 38 | description = "Lightweight pipelining with Python functions"
 39 | optional = false
 40 | python-versions = ">=3.7"
 41 | files = [
 42 |     {file = "joblib-1.3.1-py3-none-any.whl", hash = "sha256:89cf0529520e01b3de7ac7b74a8102c90d16d54c64b5dd98cafcd14307fdf915"},
 43 |     {file = "joblib-1.3.1.tar.gz", hash = "sha256:1f937906df65329ba98013dc9692fe22a4c5e4a648112de500508b18a21b41e3"},
 44 | ]
 45 | 
 46 | [[package]]
 47 | name = "markupsafe"
 48 | version = "2.1.3"
 49 | description = "Safely add untrusted strings to HTML/XML markup."
 50 | optional = false
 51 | python-versions = ">=3.7"
 52 | files = [
 53 |     {file = "MarkupSafe-2.1.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:cd0f502fe016460680cd20aaa5a76d241d6f35a1c3350c474bac1273803893fa"},
 54 |     {file = "MarkupSafe-2.1.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:e09031c87a1e51556fdcb46e5bd4f59dfb743061cf93c4d6831bf894f125eb57"},
 55 |     {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:68e78619a61ecf91e76aa3e6e8e33fc4894a2bebe93410754bd28fce0a8a4f9f"},
 56 |     {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:65c1a9bcdadc6c28eecee2c119465aebff8f7a584dd719facdd9e825ec61ab52"},
 57 |     {file = "MarkupSafe-2.1.3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:525808b8019e36eb524b8c68acdd63a37e75714eac50e988180b169d64480a00"},
 58 |     {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:962f82a3086483f5e5f64dbad880d31038b698494799b097bc59c2edf392fce6"},
 59 |     {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:aa7bd130efab1c280bed0f45501b7c8795f9fdbeb02e965371bbef3523627779"},
 60 |     {file = "MarkupSafe-2.1.3-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:c9c804664ebe8f83a211cace637506669e7890fec1b4195b505c214e50dd4eb7"},
 61 |     {file = "MarkupSafe-2.1.3-cp310-cp310-win32.whl", hash = "sha256:10bbfe99883db80bdbaff2dcf681dfc6533a614f700da1287707e8a5d78a8431"},
 62 |     {file = "MarkupSafe-2.1.3-cp310-cp310-win_amd64.whl", hash = "sha256:1577735524cdad32f9f694208aa75e422adba74f1baee7551620e43a3141f559"},
 63 |     {file = "MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:ad9e82fb8f09ade1c3e1b996a6337afac2b8b9e365f926f5a61aacc71adc5b3c"},
 64 |     {file = "MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:3c0fae6c3be832a0a0473ac912810b2877c8cb9d76ca48de1ed31e1c68386575"},
 65 |     {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b076b6226fb84157e3f7c971a47ff3a679d837cf338547532ab866c57930dbee"},
 66 |     {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bfce63a9e7834b12b87c64d6b155fdd9b3b96191b6bd334bf37db7ff1fe457f2"},
 67 |     {file = "MarkupSafe-2.1.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:338ae27d6b8745585f87218a3f23f1512dbf52c26c28e322dbe54bcede54ccb9"},
 68 |     {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:e4dd52d80b8c83fdce44e12478ad2e85c64ea965e75d66dbeafb0a3e77308fcc"},
 69 |     {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:df0be2b576a7abbf737b1575f048c23fb1d769f267ec4358296f31c2479db8f9"},
 70 |     {file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:5bbe06f8eeafd38e5d0a4894ffec89378b6c6a625ff57e3028921f8ff59318ac"},
 71 |     {file = "MarkupSafe-2.1.3-cp311-cp311-win32.whl", hash = "sha256:dd15ff04ffd7e05ffcb7fe79f1b98041b8ea30ae9234aed2a9168b5797c3effb"},
 72 |     {file = "MarkupSafe-2.1.3-cp311-cp311-win_amd64.whl", hash = "sha256:134da1eca9ec0ae528110ccc9e48041e0828d79f24121a1a146161103c76e686"},
 73 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:8e254ae696c88d98da6555f5ace2279cf7cd5b3f52be2b5cf97feafe883b58d2"},
 74 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb0932dc158471523c9637e807d9bfb93e06a95cbf010f1a38b98623b929ef2b"},
 75 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9402b03f1a1b4dc4c19845e5c749e3ab82d5078d16a2a4c2cd2df62d57bb0707"},
 76 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ca379055a47383d02a5400cb0d110cef0a776fc644cda797db0c5696cfd7e18e"},
 77 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:b7ff0f54cb4ff66dd38bebd335a38e2c22c41a8ee45aa608efc890ac3e3931bc"},
 78 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:c011a4149cfbcf9f03994ec2edffcb8b1dc2d2aede7ca243746df97a5d41ce48"},
 79 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:56d9f2ecac662ca1611d183feb03a3fa4406469dafe241673d521dd5ae92a155"},
 80 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-win32.whl", hash = "sha256:8758846a7e80910096950b67071243da3e5a20ed2546e6392603c096778d48e0"},
 81 |     {file = "MarkupSafe-2.1.3-cp37-cp37m-win_amd64.whl", hash = "sha256:787003c0ddb00500e49a10f2844fac87aa6ce977b90b0feaaf9de23c22508b24"},
 82 |     {file = "MarkupSafe-2.1.3-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:2ef12179d3a291be237280175b542c07a36e7f60718296278d8593d21ca937d4"},
 83 |     {file = "MarkupSafe-2.1.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:2c1b19b3aaacc6e57b7e25710ff571c24d6c3613a45e905b1fde04d691b98ee0"},
 84 |     {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8afafd99945ead6e075b973fefa56379c5b5c53fd8937dad92c662da5d8fd5ee"},
 85 |     {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8c41976a29d078bb235fea9b2ecd3da465df42a562910f9022f1a03107bd02be"},
 86 |     {file = "MarkupSafe-2.1.3-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d080e0a5eb2529460b30190fcfcc4199bd7f827663f858a226a81bc27beaa97e"},
 87 |     {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:69c0f17e9f5a7afdf2cc9fb2d1ce6aabdb3bafb7f38017c0b77862bcec2bbad8"},
 88 |     {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:504b320cd4b7eff6f968eddf81127112db685e81f7e36e75f9f84f0df46041c3"},
 89 |     {file = "MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:42de32b22b6b804f42c5d98be4f7e5e977ecdd9ee9b660fda1a3edf03b11792d"},
 90 |     {file = "MarkupSafe-2.1.3-cp38-cp38-win32.whl", hash = "sha256:ceb01949af7121f9fc39f7d27f91be8546f3fb112c608bc4029aef0bab86a2a5"},
 91 |     {file = "MarkupSafe-2.1.3-cp38-cp38-win_amd64.whl", hash = "sha256:1b40069d487e7edb2676d3fbdb2b0829ffa2cd63a2ec26c4938b2d34391b4ecc"},
 92 |     {file = "MarkupSafe-2.1.3-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:8023faf4e01efadfa183e863fefde0046de576c6f14659e8782065bcece22198"},
 93 |     {file = "MarkupSafe-2.1.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:6b2b56950d93e41f33b4223ead100ea0fe11f8e6ee5f641eb753ce4b77a7042b"},
 94 |     {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9dcdfd0eaf283af041973bff14a2e143b8bd64e069f4c383416ecd79a81aab58"},
 95 |     {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:05fb21170423db021895e1ea1e1f3ab3adb85d1c2333cbc2310f2a26bc77272e"},
 96 |     {file = "MarkupSafe-2.1.3-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:282c2cb35b5b673bbcadb33a585408104df04f14b2d9b01d4c345a3b92861c2c"},
 97 |     {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:ab4a0df41e7c16a1392727727e7998a467472d0ad65f3ad5e6e765015df08636"},
 98 |     {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:7ef3cb2ebbf91e330e3bb937efada0edd9003683db6b57bb108c4001f37a02ea"},
 99 |     {file = "MarkupSafe-2.1.3-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:0a4e4a1aff6c7ac4cd55792abf96c915634c2b97e3cc1c7129578aa68ebd754e"},
100 |     {file = "MarkupSafe-2.1.3-cp39-cp39-win32.whl", hash = "sha256:fec21693218efe39aa7f8599346e90c705afa52c5b31ae019b2e57e8f6542bb2"},
101 |     {file = "MarkupSafe-2.1.3-cp39-cp39-win_amd64.whl", hash = "sha256:3fd4abcb888d15a94f32b75d8fd18ee162ca0c064f35b11134be77050296d6ba"},
102 |     {file = "MarkupSafe-2.1.3.tar.gz", hash = "sha256:af598ed32d6ae86f1b747b82783958b1a4ab8f617b06fe68795c7f026abbdcad"},
103 | ]
104 | 
105 | [[package]]
106 | name = "mpmath"
107 | version = "1.3.0"
108 | description = "Python library for arbitrary-precision floating-point arithmetic"
109 | optional = false
110 | python-versions = "*"
111 | files = [
112 |     {file = "mpmath-1.3.0-py3-none-any.whl", hash = "sha256:a0b2b9fe80bbcd81a6647ff13108738cfb482d481d826cc0e02f5b35e5c88d2c"},
113 |     {file = "mpmath-1.3.0.tar.gz", hash = "sha256:7a28eb2a9774d00c7bc92411c19a89209d5da7c4c9a9e227be8330a23a25b91f"},
114 | ]
115 | 
116 | [package.extras]
117 | develop = ["codecov", "pycodestyle", "pytest (>=4.6)", "pytest-cov", "wheel"]
118 | docs = ["sphinx"]
119 | gmpy = ["gmpy2 (>=2.1.0a4)"]
120 | tests = ["pytest (>=4.6)"]
121 | 
122 | [[package]]
123 | name = "networkx"
124 | version = "3.1"
125 | description = "Python package for creating and manipulating graphs and networks"
126 | optional = false
127 | python-versions = ">=3.8"
128 | files = [
129 |     {file = "networkx-3.1-py3-none-any.whl", hash = "sha256:4f33f68cb2afcf86f28a45f43efc27a9386b535d567d2127f8f61d51dec58d36"},
130 |     {file = "networkx-3.1.tar.gz", hash = "sha256:de346335408f84de0eada6ff9fafafff9bcda11f0a0dfaa931133debb146ab61"},
131 | ]
132 | 
133 | [package.extras]
134 | default = ["matplotlib (>=3.4)", "numpy (>=1.20)", "pandas (>=1.3)", "scipy (>=1.8)"]
135 | developer = ["mypy (>=1.1)", "pre-commit (>=3.2)"]
136 | doc = ["nb2plots (>=0.6)", "numpydoc (>=1.5)", "pillow (>=9.4)", "pydata-sphinx-theme (>=0.13)", "sphinx (>=6.1)", "sphinx-gallery (>=0.12)", "texext (>=0.6.7)"]
137 | extra = ["lxml (>=4.6)", "pydot (>=1.4.2)", "pygraphviz (>=1.10)", "sympy (>=1.10)"]
138 | test = ["codecov (>=2.1)", "pytest (>=7.2)", "pytest-cov (>=4.0)"]
139 | 
140 | [[package]]
141 | name = "numpy"
142 | version = "1.25.0"
143 | description = "Fundamental package for array computing in Python"
144 | optional = false
145 | python-versions = ">=3.9"
146 | files = [
147 |     {file = "numpy-1.25.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:8aa130c3042052d656751df5e81f6d61edff3e289b5994edcf77f54118a8d9f4"},
148 |     {file = "numpy-1.25.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:9e3f2b96e3b63c978bc29daaa3700c028fe3f049ea3031b58aa33fe2a5809d24"},
149 |     {file = "numpy-1.25.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d6b267f349a99d3908b56645eebf340cb58f01bd1e773b4eea1a905b3f0e4208"},
150 |     {file = "numpy-1.25.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4aedd08f15d3045a4e9c648f1e04daca2ab1044256959f1f95aafeeb3d794c16"},
151 |     {file = "numpy-1.25.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:6d183b5c58513f74225c376643234c369468e02947b47942eacbb23c1671f25d"},
152 |     {file = "numpy-1.25.0-cp310-cp310-win32.whl", hash = "sha256:d76a84998c51b8b68b40448ddd02bd1081bb33abcdc28beee6cd284fe11036c6"},
153 |     {file = "numpy-1.25.0-cp310-cp310-win_amd64.whl", hash = "sha256:c0dc071017bc00abb7d7201bac06fa80333c6314477b3d10b52b58fa6a6e38f6"},
154 |     {file = "numpy-1.25.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4c69fe5f05eea336b7a740e114dec995e2f927003c30702d896892403df6dbf0"},
155 |     {file = "numpy-1.25.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:9c7211d7920b97aeca7b3773a6783492b5b93baba39e7c36054f6e749fc7490c"},
156 |     {file = "numpy-1.25.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ecc68f11404930e9c7ecfc937aa423e1e50158317bf67ca91736a9864eae0232"},
157 |     {file = "numpy-1.25.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e559c6afbca484072a98a51b6fa466aae785cfe89b69e8b856c3191bc8872a82"},
158 |     {file = "numpy-1.25.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:6c284907e37f5e04d2412950960894b143a648dea3f79290757eb878b91acbd1"},
159 |     {file = "numpy-1.25.0-cp311-cp311-win32.whl", hash = "sha256:95367ccd88c07af21b379be1725b5322362bb83679d36691f124a16357390153"},
160 |     {file = "numpy-1.25.0-cp311-cp311-win_amd64.whl", hash = "sha256:b76aa836a952059d70a2788a2d98cb2a533ccd46222558b6970348939e55fc24"},
161 |     {file = "numpy-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:b792164e539d99d93e4e5e09ae10f8cbe5466de7d759fc155e075237e0c274e4"},
162 |     {file = "numpy-1.25.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:7cd981ccc0afe49b9883f14761bb57c964df71124dcd155b0cba2b591f0d64b9"},
163 |     {file = "numpy-1.25.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5aa48bebfb41f93043a796128854b84407d4df730d3fb6e5dc36402f5cd594c0"},
164 |     {file = "numpy-1.25.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5177310ac2e63d6603f659fadc1e7bab33dd5a8db4e0596df34214eeab0fee3b"},
165 |     {file = "numpy-1.25.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:0ac6edfb35d2a99aaf102b509c8e9319c499ebd4978df4971b94419a116d0790"},
166 |     {file = "numpy-1.25.0-cp39-cp39-win32.whl", hash = "sha256:7412125b4f18aeddca2ecd7219ea2d2708f697943e6f624be41aa5f8a9852cc4"},
167 |     {file = "numpy-1.25.0-cp39-cp39-win_amd64.whl", hash = "sha256:26815c6c8498dc49d81faa76d61078c4f9f0859ce7817919021b9eba72b425e3"},
168 |     {file = "numpy-1.25.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:5b1b90860bf7d8a8c313b372d4f27343a54f415b20fb69dd601b7efe1029c91e"},
169 |     {file = "numpy-1.25.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:85cdae87d8c136fd4da4dad1e48064d700f63e923d5af6c8c782ac0df8044542"},
170 |     {file = "numpy-1.25.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:cc3fda2b36482891db1060f00f881c77f9423eead4c3579629940a3e12095fe8"},
171 |     {file = "numpy-1.25.0.tar.gz", hash = "sha256:f1accae9a28dc3cda46a91de86acf69de0d1b5f4edd44a9b0c3ceb8036dfff19"},
172 | ]
173 | 
174 | [[package]]
175 | name = "pandas"
176 | version = "2.0.3"
177 | description = "Powerful data structures for data analysis, time series, and statistics"
178 | optional = false
179 | python-versions = ">=3.8"
180 | files = [
181 |     {file = "pandas-2.0.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:e4c7c9f27a4185304c7caf96dc7d91bc60bc162221152de697c98eb0b2648dd8"},
182 |     {file = "pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:f167beed68918d62bffb6ec64f2e1d8a7d297a038f86d4aed056b9493fca407f"},
183 |     {file = "pandas-2.0.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce0c6f76a0f1ba361551f3e6dceaff06bde7514a374aa43e33b588ec10420183"},
184 |     {file = "pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba619e410a21d8c387a1ea6e8a0e49bb42216474436245718d7f2e88a2f8d7c0"},
185 |     {file = "pandas-2.0.3-cp310-cp310-win32.whl", hash = "sha256:3ef285093b4fe5058eefd756100a367f27029913760773c8bf1d2d8bebe5d210"},
186 |     {file = "pandas-2.0.3-cp310-cp310-win_amd64.whl", hash = "sha256:9ee1a69328d5c36c98d8e74db06f4ad518a1840e8ccb94a4ba86920986bb617e"},
187 |     {file = "pandas-2.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:b084b91d8d66ab19f5bb3256cbd5ea661848338301940e17f4492b2ce0801fe8"},
188 |     {file = "pandas-2.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:37673e3bdf1551b95bf5d4ce372b37770f9529743d2498032439371fc7b7eb26"},
189 |     {file = "pandas-2.0.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b9cb1e14fdb546396b7e1b923ffaeeac24e4cedd14266c3497216dd4448e4f2d"},
190 |     {file = "pandas-2.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d9cd88488cceb7635aebb84809d087468eb33551097d600c6dad13602029c2df"},
191 |     {file = "pandas-2.0.3-cp311-cp311-win32.whl", hash = "sha256:694888a81198786f0e164ee3a581df7d505024fbb1f15202fc7db88a71d84ebd"},
192 |     {file = "pandas-2.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:6a21ab5c89dcbd57f78d0ae16630b090eec626360085a4148693def5452d8a6b"},
193 |     {file = "pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:9e4da0d45e7f34c069fe4d522359df7d23badf83abc1d1cef398895822d11061"},
194 |     {file = "pandas-2.0.3-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:32fca2ee1b0d93dd71d979726b12b61faa06aeb93cf77468776287f41ff8fdc5"},
195 |     {file = "pandas-2.0.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:258d3624b3ae734490e4d63c430256e716f488c4fcb7c8e9bde2d3aa46c29089"},
196 |     {file = "pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9eae3dc34fa1aa7772dd3fc60270d13ced7346fcbcfee017d3132ec625e23bb0"},
197 |     {file = "pandas-2.0.3-cp38-cp38-win32.whl", hash = "sha256:f3421a7afb1a43f7e38e82e844e2bca9a6d793d66c1a7f9f0ff39a795bbc5e02"},
198 |     {file = "pandas-2.0.3-cp38-cp38-win_amd64.whl", hash = "sha256:69d7f3884c95da3a31ef82b7618af5710dba95bb885ffab339aad925c3e8ce78"},
199 |     {file = "pandas-2.0.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:5247fb1ba347c1261cbbf0fcfba4a3121fbb4029d95d9ef4dc45406620b25c8b"},
200 |     {file = "pandas-2.0.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:81af086f4543c9d8bb128328b5d32e9986e0c84d3ee673a2ac6fb57fd14f755e"},
201 |     {file = "pandas-2.0.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1994c789bf12a7c5098277fb43836ce090f1073858c10f9220998ac74f37c69b"},
202 |     {file = "pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5ec591c48e29226bcbb316e0c1e9423622bc7a4eaf1ef7c3c9fa1a3981f89641"},
203 |     {file = "pandas-2.0.3-cp39-cp39-win32.whl", hash = "sha256:04dbdbaf2e4d46ca8da896e1805bc04eb85caa9a82e259e8eed00254d5e0c682"},
204 |     {file = "pandas-2.0.3-cp39-cp39-win_amd64.whl", hash = "sha256:1168574b036cd8b93abc746171c9b4f1b83467438a5e45909fed645cf8692dbc"},
205 |     {file = "pandas-2.0.3.tar.gz", hash = "sha256:c02f372a88e0d17f36d3093a644c73cfc1788e876a7c4bcb4020a77512e2043c"},
206 | ]
207 | 
208 | [package.dependencies]
209 | numpy = [
210 |     {version = ">=1.21.0", markers = "python_version >= \"3.10\""},
211 |     {version = ">=1.23.2", markers = "python_version >= \"3.11\""},
212 | ]
213 | python-dateutil = ">=2.8.2"
214 | pytz = ">=2020.1"
215 | tzdata = ">=2022.1"
216 | 
217 | [package.extras]
218 | all = ["PyQt5 (>=5.15.1)", "SQLAlchemy (>=1.4.16)", "beautifulsoup4 (>=4.9.3)", "bottleneck (>=1.3.2)", "brotlipy (>=0.7.0)", "fastparquet (>=0.6.3)", "fsspec (>=2021.07.0)", "gcsfs (>=2021.07.0)", "html5lib (>=1.1)", "hypothesis (>=6.34.2)", "jinja2 (>=3.0.0)", "lxml (>=4.6.3)", "matplotlib (>=3.6.1)", "numba (>=0.53.1)", "numexpr (>=2.7.3)", "odfpy (>=1.4.1)", "openpyxl (>=3.0.7)", "pandas-gbq (>=0.15.0)", "psycopg2 (>=2.8.6)", "pyarrow (>=7.0.0)", "pymysql (>=1.0.2)", "pyreadstat (>=1.1.2)", "pytest (>=7.3.2)", "pytest-asyncio (>=0.17.0)", "pytest-xdist (>=2.2.0)", "python-snappy (>=0.6.0)", "pyxlsb (>=1.0.8)", "qtpy (>=2.2.0)", "s3fs (>=2021.08.0)", "scipy (>=1.7.1)", "tables (>=3.6.1)", "tabulate (>=0.8.9)", "xarray (>=0.21.0)", "xlrd (>=2.0.1)", "xlsxwriter (>=1.4.3)", "zstandard (>=0.15.2)"]
219 | aws = ["s3fs (>=2021.08.0)"]
220 | clipboard = ["PyQt5 (>=5.15.1)", "qtpy (>=2.2.0)"]
221 | compression = ["brotlipy (>=0.7.0)", "python-snappy (>=0.6.0)", "zstandard (>=0.15.2)"]
222 | computation = ["scipy (>=1.7.1)", "xarray (>=0.21.0)"]
223 | excel = ["odfpy (>=1.4.1)", "openpyxl (>=3.0.7)", "pyxlsb (>=1.0.8)", "xlrd (>=2.0.1)", "xlsxwriter (>=1.4.3)"]
224 | feather = ["pyarrow (>=7.0.0)"]
225 | fss = ["fsspec (>=2021.07.0)"]
226 | gcp = ["gcsfs (>=2021.07.0)", "pandas-gbq (>=0.15.0)"]
227 | hdf5 = ["tables (>=3.6.1)"]
228 | html = ["beautifulsoup4 (>=4.9.3)", "html5lib (>=1.1)", "lxml (>=4.6.3)"]
229 | mysql = ["SQLAlchemy (>=1.4.16)", "pymysql (>=1.0.2)"]
230 | output-formatting = ["jinja2 (>=3.0.0)", "tabulate (>=0.8.9)"]
231 | parquet = ["pyarrow (>=7.0.0)"]
232 | performance = ["bottleneck (>=1.3.2)", "numba (>=0.53.1)", "numexpr (>=2.7.1)"]
233 | plot = ["matplotlib (>=3.6.1)"]
234 | postgresql = ["SQLAlchemy (>=1.4.16)", "psycopg2 (>=2.8.6)"]
235 | spss = ["pyreadstat (>=1.1.2)"]
236 | sql-other = ["SQLAlchemy (>=1.4.16)"]
237 | test = ["hypothesis (>=6.34.2)", "pytest (>=7.3.2)", "pytest-asyncio (>=0.17.0)", "pytest-xdist (>=2.2.0)"]
238 | xml = ["lxml (>=4.6.3)"]
239 | 
240 | [[package]]
241 | name = "python-dateutil"
242 | version = "2.8.2"
243 | description = "Extensions to the standard Python datetime module"
244 | optional = false
245 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
246 | files = [
247 |     {file = "python-dateutil-2.8.2.tar.gz", hash = "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86"},
248 |     {file = "python_dateutil-2.8.2-py2.py3-none-any.whl", hash = "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"},
249 | ]
250 | 
251 | [package.dependencies]
252 | six = ">=1.5"
253 | 
254 | [[package]]
255 | name = "pytz"
256 | version = "2023.3"
257 | description = "World timezone definitions, modern and historical"
258 | optional = false
259 | python-versions = "*"
260 | files = [
261 |     {file = "pytz-2023.3-py2.py3-none-any.whl", hash = "sha256:a151b3abb88eda1d4e34a9814df37de2a80e301e68ba0fd856fb9b46bfbbbffb"},
262 |     {file = "pytz-2023.3.tar.gz", hash = "sha256:1d8ce29db189191fb55338ee6d0387d82ab59f3d00eac103412d64e0ebd0c588"},
263 | ]
264 | 
265 | [[package]]
266 | name = "scikit-learn"
267 | version = "1.3.0"
268 | description = "A set of python modules for machine learning and data mining"
269 | optional = false
270 | python-versions = ">=3.8"
271 | files = [
272 |     {file = "scikit-learn-1.3.0.tar.gz", hash = "sha256:8be549886f5eda46436b6e555b0e4873b4f10aa21c07df45c4bc1735afbccd7a"},
273 |     {file = "scikit_learn-1.3.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:981287869e576d42c682cf7ca96af0c6ac544ed9316328fd0d9292795c742cf5"},
274 |     {file = "scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:436aaaae2c916ad16631142488e4c82f4296af2404f480e031d866863425d2a2"},
275 |     {file = "scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c7e28d8fa47a0b30ae1bd7a079519dd852764e31708a7804da6cb6f8b36e3630"},
276 |     {file = "scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ae80c08834a473d08a204d966982a62e11c976228d306a2648c575e3ead12111"},
277 |     {file = "scikit_learn-1.3.0-cp310-cp310-win_amd64.whl", hash = "sha256:552fd1b6ee22900cf1780d7386a554bb96949e9a359999177cf30211e6b20df6"},
278 |     {file = "scikit_learn-1.3.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:79970a6d759eb00a62266a31e2637d07d2d28446fca8079cf9afa7c07b0427f8"},
279 |     {file = "scikit_learn-1.3.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:850a00b559e636b23901aabbe79b73dc604b4e4248ba9e2d6e72f95063765603"},
280 |     {file = "scikit_learn-1.3.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ee04835fb016e8062ee9fe9074aef9b82e430504e420bff51e3e5fffe72750ca"},
281 |     {file = "scikit_learn-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9d953531f5d9f00c90c34fa3b7d7cfb43ecff4c605dac9e4255a20b114a27369"},
282 |     {file = "scikit_learn-1.3.0-cp311-cp311-win_amd64.whl", hash = "sha256:151ac2bf65ccf363664a689b8beafc9e6aae36263db114b4ca06fbbbf827444a"},
283 |     {file = "scikit_learn-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:6a885a9edc9c0a341cab27ec4f8a6c58b35f3d449c9d2503a6fd23e06bbd4f6a"},
284 |     {file = "scikit_learn-1.3.0-cp38-cp38-macosx_12_0_arm64.whl", hash = "sha256:9877af9c6d1b15486e18a94101b742e9d0d2f343d35a634e337411ddb57783f3"},
285 |     {file = "scikit_learn-1.3.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c470f53cea065ff3d588050955c492793bb50c19a92923490d18fcb637f6383a"},
286 |     {file = "scikit_learn-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd6e2d7389542eae01077a1ee0318c4fec20c66c957f45c7aac0c6eb0fe3c612"},
287 |     {file = "scikit_learn-1.3.0-cp38-cp38-win_amd64.whl", hash = "sha256:3a11936adbc379a6061ea32fa03338d4ca7248d86dd507c81e13af428a5bc1db"},
288 |     {file = "scikit_learn-1.3.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:998d38fcec96584deee1e79cd127469b3ad6fefd1ea6c2dfc54e8db367eb396b"},
289 |     {file = "scikit_learn-1.3.0-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:ded35e810438a527e17623ac6deae3b360134345b7c598175ab7741720d7ffa7"},
290 |     {file = "scikit_learn-1.3.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0e8102d5036e28d08ab47166b48c8d5e5810704daecf3a476a4282d562be9a28"},
291 |     {file = "scikit_learn-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7617164951c422747e7c32be4afa15d75ad8044f42e7d70d3e2e0429a50e6718"},
292 |     {file = "scikit_learn-1.3.0-cp39-cp39-win_amd64.whl", hash = "sha256:1d54fb9e6038284548072df22fd34777e434153f7ffac72c8596f2d6987110dd"},
293 | ]
294 | 
295 | [package.dependencies]
296 | joblib = ">=1.1.1"
297 | numpy = ">=1.17.3"
298 | scipy = ">=1.5.0"
299 | threadpoolctl = ">=2.0.0"
300 | 
301 | [package.extras]
302 | benchmark = ["matplotlib (>=3.1.3)", "memory-profiler (>=0.57.0)", "pandas (>=1.0.5)"]
303 | docs = ["Pillow (>=7.1.2)", "matplotlib (>=3.1.3)", "memory-profiler (>=0.57.0)", "numpydoc (>=1.2.0)", "pandas (>=1.0.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.16.2)", "seaborn (>=0.9.0)", "sphinx (>=6.0.0)", "sphinx-copybutton (>=0.5.2)", "sphinx-gallery (>=0.10.1)", "sphinx-prompt (>=1.3.0)", "sphinxext-opengraph (>=0.4.2)"]
304 | examples = ["matplotlib (>=3.1.3)", "pandas (>=1.0.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.16.2)", "seaborn (>=0.9.0)"]
305 | tests = ["black (>=23.3.0)", "matplotlib (>=3.1.3)", "mypy (>=1.3)", "numpydoc (>=1.2.0)", "pandas (>=1.0.5)", "pooch (>=1.6.0)", "pyamg (>=4.0.0)", "pytest (>=7.1.2)", "pytest-cov (>=2.9.0)", "ruff (>=0.0.272)", "scikit-image (>=0.16.2)"]
306 | 
307 | [[package]]
308 | name = "scipy"
309 | version = "1.11.1"
310 | description = "Fundamental algorithms for scientific computing in Python"
311 | optional = false
312 | python-versions = "<3.13,>=3.9"
313 | files = [
314 |     {file = "scipy-1.11.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:aec8c62fbe52914f9cf28d846cf0401dd80ab80788bbab909434eb336ed07c04"},
315 |     {file = "scipy-1.11.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:3b9963798df1d8a52db41a6fc0e6fa65b1c60e85d73da27ae8bb754de4792481"},
316 |     {file = "scipy-1.11.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3e8eb42db36526b130dfbc417609498a6192381abc1975b91e3eb238e0b41c1a"},
317 |     {file = "scipy-1.11.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:366a6a937110d80dca4f63b3f5b00cc89d36f678b2d124a01067b154e692bab1"},
318 |     {file = "scipy-1.11.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:08d957ca82d3535b3b9ba6c8ff355d78fe975271874e2af267cb5add5bd78625"},
319 |     {file = "scipy-1.11.1-cp310-cp310-win_amd64.whl", hash = "sha256:e866514bc2d660608447b6ba95c8900d591f2865c07cca0aa4f7ff3c4ca70f30"},
320 |     {file = "scipy-1.11.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ba94eeef3c9caa4cea7b402a35bb02a5714ee1ee77eb98aca1eed4543beb0f4c"},
321 |     {file = "scipy-1.11.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:512fdc18c65f76dadaca139348e525646d440220d8d05f6d21965b8d4466bccd"},
322 |     {file = "scipy-1.11.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cce154372f0ebe88556ed06d7b196e9c2e0c13080ecb58d0f35062dc7cc28b47"},
323 |     {file = "scipy-1.11.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b4bb943010203465ac81efa392e4645265077b4d9e99b66cf3ed33ae12254173"},
324 |     {file = "scipy-1.11.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:249cfa465c379c9bb2c20123001e151ff5e29b351cbb7f9c91587260602c58d0"},
325 |     {file = "scipy-1.11.1-cp311-cp311-win_amd64.whl", hash = "sha256:ffb28e3fa31b9c376d0fb1f74c1f13911c8c154a760312fbee87a21eb21efe31"},
326 |     {file = "scipy-1.11.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:39154437654260a52871dfde852adf1b93b1d1bc5dc0ffa70068f16ec0be2624"},
327 |     {file = "scipy-1.11.1-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:b588311875c58d1acd4ef17c983b9f1ab5391755a47c3d70b6bd503a45bfaf71"},
328 |     {file = "scipy-1.11.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d51565560565a0307ed06fa0ec4c6f21ff094947d4844d6068ed04400c72d0c3"},
329 |     {file = "scipy-1.11.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b41a0f322b4eb51b078cb3441e950ad661ede490c3aca66edef66f4b37ab1877"},
330 |     {file = "scipy-1.11.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:396fae3f8c12ad14c5f3eb40499fd06a6fef8393a6baa352a652ecd51e74e029"},
331 |     {file = "scipy-1.11.1-cp39-cp39-win_amd64.whl", hash = "sha256:be8c962a821957fdde8c4044efdab7a140c13294997a407eaee777acf63cbf0c"},
332 |     {file = "scipy-1.11.1.tar.gz", hash = "sha256:fb5b492fa035334fd249f0973cc79ecad8b09c604b42a127a677b45a9a3d4289"},
333 | ]
334 | 
335 | [package.dependencies]
336 | numpy = ">=1.21.6,<1.28.0"
337 | 
338 | [package.extras]
339 | dev = ["click", "cython-lint (>=0.12.2)", "doit (>=0.36.0)", "mypy", "pycodestyle", "pydevtool", "rich-click", "ruff", "types-psutil", "typing_extensions"]
340 | doc = ["jupytext", "matplotlib (>2)", "myst-nb", "numpydoc", "pooch", "pydata-sphinx-theme (==0.9.0)", "sphinx (!=4.1.0)", "sphinx-design (>=0.2.0)"]
341 | test = ["asv", "gmpy2", "mpmath", "pooch", "pytest", "pytest-cov", "pytest-timeout", "pytest-xdist", "scikit-umfpack", "threadpoolctl"]
342 | 
343 | [[package]]
344 | name = "six"
345 | version = "1.16.0"
346 | description = "Python 2 and 3 compatibility utilities"
347 | optional = false
348 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*"
349 | files = [
350 |     {file = "six-1.16.0-py2.py3-none-any.whl", hash = "sha256:8abb2f1d86890a2dfb989f9a77cfcfd3e47c2a354b01111771326f8aa26e0254"},
351 |     {file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"},
352 | ]
353 | 
354 | [[package]]
355 | name = "sympy"
356 | version = "1.12"
357 | description = "Computer algebra system (CAS) in Python"
358 | optional = false
359 | python-versions = ">=3.8"
360 | files = [
361 |     {file = "sympy-1.12-py3-none-any.whl", hash = "sha256:c3588cd4295d0c0f603d0f2ae780587e64e2efeedb3521e46b9bb1d08d184fa5"},
362 |     {file = "sympy-1.12.tar.gz", hash = "sha256:ebf595c8dac3e0fdc4152c51878b498396ec7f30e7a914d6071e674d49420fb8"},
363 | ]
364 | 
365 | [package.dependencies]
366 | mpmath = ">=0.19"
367 | 
368 | [[package]]
369 | name = "threadpoolctl"
370 | version = "3.1.0"
371 | description = "threadpoolctl"
372 | optional = false
373 | python-versions = ">=3.6"
374 | files = [
375 |     {file = "threadpoolctl-3.1.0-py3-none-any.whl", hash = "sha256:8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b"},
376 |     {file = "threadpoolctl-3.1.0.tar.gz", hash = "sha256:a335baacfaa4400ae1f0d8e3a58d6674d2f8828e3716bb2802c44955ad391380"},
377 | ]
378 | 
379 | [[package]]
380 | name = "torch"
381 | version = "2.0.1"
382 | description = "Tensors and Dynamic neural networks in Python with strong GPU acceleration"
383 | optional = false
384 | python-versions = ">=3.8.0"
385 | files = [
386 |     {file = "torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl", hash = "sha256:8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a"},
387 |     {file = "torch-2.0.1-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:359bfaad94d1cda02ab775dc1cc386d585712329bb47b8741607ef6ef4950747"},
388 |     {file = "torch-2.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:7c84e44d9002182edd859f3400deaa7410f5ec948a519cc7ef512c2f9b34d2c4"},
389 |     {file = "torch-2.0.1-cp310-none-macosx_10_9_x86_64.whl", hash = "sha256:567f84d657edc5582d716900543e6e62353dbe275e61cdc36eda4929e46df9e7"},
390 |     {file = "torch-2.0.1-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:787b5a78aa7917465e9b96399b883920c88a08f4eb63b5a5d2d1a16e27d2f89b"},
391 |     {file = "torch-2.0.1-cp311-cp311-manylinux1_x86_64.whl", hash = "sha256:e617b1d0abaf6ced02dbb9486803abfef0d581609b09641b34fa315c9c40766d"},
392 |     {file = "torch-2.0.1-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:b6019b1de4978e96daa21d6a3ebb41e88a0b474898fe251fd96189587408873e"},
393 |     {file = "torch-2.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:dbd68cbd1cd9da32fe5d294dd3411509b3d841baecb780b38b3b7b06c7754434"},
394 |     {file = "torch-2.0.1-cp311-none-macosx_10_9_x86_64.whl", hash = "sha256:ef654427d91600129864644e35deea761fb1fe131710180b952a6f2e2207075e"},
395 |     {file = "torch-2.0.1-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:25aa43ca80dcdf32f13da04c503ec7afdf8e77e3a0183dd85cd3e53b2842e527"},
396 |     {file = "torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:5ef3ea3d25441d3957348f7e99c7824d33798258a2bf5f0f0277cbcadad2e20d"},
397 |     {file = "torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:0882243755ff28895e8e6dc6bc26ebcf5aa0911ed81b2a12f241fc4b09075b13"},
398 |     {file = "torch-2.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:f66aa6b9580a22b04d0af54fcd042f52406a8479e2b6a550e3d9f95963e168c8"},
399 |     {file = "torch-2.0.1-cp38-none-macosx_10_9_x86_64.whl", hash = "sha256:1adb60d369f2650cac8e9a95b1d5758e25d526a34808f7448d0bd599e4ae9072"},
400 |     {file = "torch-2.0.1-cp38-none-macosx_11_0_arm64.whl", hash = "sha256:1bcffc16b89e296826b33b98db5166f990e3b72654a2b90673e817b16c50e32b"},
401 |     {file = "torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:e10e1597f2175365285db1b24019eb6f04d53dcd626c735fc502f1e8b6be9875"},
402 |     {file = "torch-2.0.1-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:423e0ae257b756bb45a4b49072046772d1ad0c592265c5080070e0767da4e490"},
403 |     {file = "torch-2.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:8742bdc62946c93f75ff92da00e3803216c6cce9b132fbca69664ca38cfb3e18"},
404 |     {file = "torch-2.0.1-cp39-none-macosx_10_9_x86_64.whl", hash = "sha256:c62df99352bd6ee5a5a8d1832452110435d178b5164de450831a3a8cc14dc680"},
405 |     {file = "torch-2.0.1-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:671a2565e3f63b8fe8e42ae3e36ad249fe5e567435ea27b94edaa672a7d0c416"},
406 | ]
407 | 
408 | [package.dependencies]
409 | filelock = "*"
410 | jinja2 = "*"
411 | networkx = "*"
412 | sympy = "*"
413 | typing-extensions = "*"
414 | 
415 | [package.extras]
416 | opt-einsum = ["opt-einsum (>=3.3)"]
417 | 
418 | [[package]]
419 | name = "typing-extensions"
420 | version = "4.7.1"
421 | description = "Backported and Experimental Type Hints for Python 3.7+"
422 | optional = false
423 | python-versions = ">=3.7"
424 | files = [
425 |     {file = "typing_extensions-4.7.1-py3-none-any.whl", hash = "sha256:440d5dd3af93b060174bf433bccd69b0babc3b15b1a8dca43789fd7f61514b36"},
426 |     {file = "typing_extensions-4.7.1.tar.gz", hash = "sha256:b75ddc264f0ba5615db7ba217daeb99701ad295353c45f9e95963337ceeeffb2"},
427 | ]
428 | 
429 | [[package]]
430 | name = "tzdata"
431 | version = "2023.3"
432 | description = "Provider of IANA time zone data"
433 | optional = false
434 | python-versions = ">=2"
435 | files = [
436 |     {file = "tzdata-2023.3-py2.py3-none-any.whl", hash = "sha256:7e65763eef3120314099b6939b5546db7adce1e7d6f2e179e3df563c70511eda"},
437 |     {file = "tzdata-2023.3.tar.gz", hash = "sha256:11ef1e08e54acb0d4f95bdb1be05da659673de4acbd21bf9c69e94cc5e907a3a"},
438 | ]
439 | 
440 | [metadata]
441 | lock-version = "2.0"
442 | python-versions = ">=3.11,<3.13"
443 | content-hash = "0558e6e5010d26a95e44cdc71fb11f9e37839a48fe0988f8009cd97d93ce8b9f"
444 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "quran-neural-chunker"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Kais Dukes"]
 6 | readme = "README.md"
 7 | packages = [{include = "src"}]
 8 | 
 9 | [tool.poetry.dependencies]
10 | python = ">=3.11,<3.13"
11 | pandas = "^2.0.3"
12 | torch = "^2.0.1"
13 | scikit-learn = "^1.3.0"
14 | 
15 | [build-system]
16 | requires = ["poetry-core"]
17 | build-backend = "poetry.core.masonry.api"


--------------------------------------------------------------------------------
/src/chunks/chunk.py:
--------------------------------------------------------------------------------
 1 | from dataclasses import dataclass
 2 | 
 3 | from .location import Location
 4 | 
 5 | 
 6 | @dataclass(frozen=True)
 7 | class Chunk:
 8 |     start: Location
 9 |     end: Location
10 | 


--------------------------------------------------------------------------------
/src/chunks/chunks.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from pandas import DataFrame
 4 | 
 5 | from .chunk import Chunk
 6 | from .location import Location
 7 | 
 8 | 
 9 | def get_chunks(df: DataFrame):
10 |     chunks: List[Chunk] = []
11 |     start: Location = None
12 | 
13 |     for _, row in df.iterrows():
14 |         loc = Location(row['chapter_number'], row['verse_number'], row['token_number'])
15 |         if start is None:
16 |             start = loc
17 |         if row['chunk_end'] == 1:
18 |             end = loc
19 |             chunk = Chunk(start, end)
20 |             chunks.append(chunk)
21 |             start = None
22 | 
23 |     return chunks
24 | 


--------------------------------------------------------------------------------
/src/chunks/location.py:
--------------------------------------------------------------------------------
 1 | from dataclasses import dataclass
 2 | 
 3 | 
 4 | @dataclass(frozen=True)
 5 | class Location:
 6 |     chapter_number: int
 7 |     verse_number: int
 8 |     token_number: int
 9 | 
10 |     def __str__(self):
11 |         parts = [str(self.chapter_number), str(self.verse_number)]
12 |         if self.token_number > 0:
13 |             parts.append(str(self.token_number))
14 |         return ':'.join(parts)
15 | 


--------------------------------------------------------------------------------
/src/chunks/preprocessor.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from pandas import DataFrame
 3 | 
 4 | 
 5 | def preprocess(df: DataFrame):
 6 |     df['verse_end'] = (
 7 |         (df.groupby(['chapter_number', 'verse_number']).token_number.transform(max) == df.token_number)
 8 |         .astype(int))
 9 | 
10 |     df['punctuation'] = df['translation'].apply(_punctuation)
11 | 
12 | 
13 | PUNCTUATION = [',', '.', '\'', '\"', '!', '?']
14 | 
15 | 
16 | def _punctuation(text: str) -> str:
17 |     n = len(text)
18 |     for i in range(n - 1, -1, -1):
19 |         if text[i] not in PUNCTUATION:
20 |             return text[i+1:] if i < n - 1 else ''
21 |     return text
22 | 


--------------------------------------------------------------------------------
/src/data.py:
--------------------------------------------------------------------------------
1 | import csv
2 | 
3 | import pandas as pd
4 | 
5 | 
6 | def load_data():
7 |     CHUNKS_FILE = 'data/quranic-treebank-0.4-chunks.tsv'
8 |     return pd.read_csv(CHUNKS_FILE, sep='\t', quoting=csv.QUOTE_NONE)
9 | 


--------------------------------------------------------------------------------
/src/embeddings.py:
--------------------------------------------------------------------------------
 1 | from typing import Dict
 2 | 
 3 | import numpy as np
 4 | 
 5 | 
 6 | class Embeddings:
 7 | 
 8 |     def __init__(self):
 9 |         self._embeddings: Dict[int, np.ndarray] = {}
10 |         self._default_vector: np.ndarray = np.zeros(256)
11 |         self._load_embeddings()
12 | 
13 |     def get_vector(self, embeddingId: int):
14 |         return self._embeddings.get(embeddingId, self._default_vector)
15 | 
16 |     def _load_embeddings(self):
17 |         VECTOR_FILE = 'data/vectors.txt'
18 |         with open(VECTOR_FILE, 'r') as file:
19 |             for line in file:
20 |                 line = line.strip().split()
21 |                 embeddingId = int(line[0][:-1])
22 |                 vector = np.array(list(map(float, line[2:])))
23 |                 self._embeddings[embeddingId] = vector
24 | 


--------------------------------------------------------------------------------
/src/models/bilstm_chunker.py:
--------------------------------------------------------------------------------
  1 | from typing import List
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | from pandas import DataFrame
  6 | import torch
  7 | from torch import nn
  8 | from torch.optim import Adam
  9 | from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
 10 | from sklearn.preprocessing import LabelEncoder
 11 | from sklearn.model_selection import train_test_split
 12 | from torch.utils.data import Dataset, DataLoader
 13 | 
 14 | from .evaluator import Evaluator
 15 | from ..data import load_data
 16 | from ..embeddings import Embeddings
 17 | from ..chunks.preprocessor import preprocess
 18 | from ..chunks.chunks import get_chunks
 19 | 
 20 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 21 | max_length = 128
 22 | 
 23 | 
 24 | class BiLSTMModel(nn.Module):
 25 | 
 26 |     def __init__(self, input_size, hidden_size, num_layers, output_size):
 27 |         super(BiLSTMModel, self).__init__()
 28 |         self.hidden_size = hidden_size
 29 |         self.num_layers = num_layers
 30 |         self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
 31 |         self.fc = nn.Linear(2*hidden_size, output_size)
 32 | 
 33 |     def forward(self, x, lengths):
 34 |         h0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(x.device)
 35 |         c0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(x.device)
 36 | 
 37 |         x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
 38 | 
 39 |         packed_output, _ = self.lstm(x, (h0, c0))
 40 | 
 41 |         # unpack the output before passing through the linear layer
 42 |         output, output_lengths = pad_packed_sequence(packed_output, batch_first=True)
 43 | 
 44 |         # manually pad the sequences to max_length
 45 |         if output.size(1) < max_length:
 46 |             output = nn.functional.pad(output, (0, 0, 0, max_length - output.size(1)))
 47 | 
 48 |         out = self.fc(output)
 49 |         return out
 50 | 
 51 | 
 52 | class QuranDataset(Dataset):
 53 |     def __init__(self, verses, labels):
 54 |         self.verses = verses
 55 |         self.labels = labels
 56 | 
 57 |     def __len__(self):
 58 |         return len(self.verses)
 59 | 
 60 |     def __getitem__(self, index):
 61 |         verse = self.verses[index]
 62 |         label = self.labels[index]
 63 |         length = len(verse)
 64 | 
 65 |         # padding
 66 |         if length < max_length:
 67 |             verse.extend([[0]*len(verse[0])] * (max_length - length))
 68 |             label.extend([0] * (max_length - length))
 69 | 
 70 |         return torch.tensor(verse, dtype=torch.float32), torch.tensor(label), length
 71 | 
 72 | 
 73 | def get_verses(df: DataFrame):
 74 |     le = LabelEncoder()
 75 |     df['encoded_punctuation'] = le.fit_transform(df['punctuation'])
 76 | 
 77 |     word_vectors = Embeddings()
 78 | 
 79 |     rows = []
 80 |     for _, row in df.iterrows():
 81 |         embedding_vector = word_vectors.get_vector(row['embedding_id'])
 82 |         core_values = row[['token_number', 'pause_mark', 'irab_end', 'verse_end', 'encoded_punctuation']].values
 83 |         full_vector = np.concatenate([core_values, embedding_vector]).tolist()
 84 |         rows.append(full_vector + [row['chunk_end']])
 85 |     X = pd.DataFrame(rows, columns=[f'feature_{i}' for i in range(261)]+['chunk_end'])
 86 | 
 87 |     verses: List[List[int]] = []
 88 |     labels: List[int] = []
 89 |     verse_info: List[List[int]] = []
 90 | 
 91 |     for _, group in df.groupby(['chapter_number', 'verse_number']):
 92 |         group_df = X.loc[group.index]
 93 |         verse = group_df[group_df.columns.difference(['chunk_end'])].values.tolist()
 94 |         label = group_df['chunk_end'].tolist()
 95 | 
 96 |         verses.append(verse)
 97 |         labels.append(label)
 98 | 
 99 |         verse_info_single = group[['chapter_number', 'verse_number', 'token_number']].values.tolist()
100 |         verse_info.append(verse_info_single)
101 | 
102 |     temp_data = list(zip(verses, verse_info, labels))
103 |     train_temp, test_temp = train_test_split(temp_data, test_size=0.10, random_state=42)
104 | 
105 |     train_verses, train_verse_info, train_labels = zip(*train_temp)
106 |     test_verses, test_verse_info, test_labels = zip(*test_temp)
107 | 
108 |     return train_verses, test_verses, train_labels, test_labels, train_verse_info, test_verse_info
109 | 
110 | 
111 | def pack_labels(labels):
112 |     lengths = [len(label) for label in labels]
113 |     max_len = max(lengths)
114 |     labels_padded = [torch.cat([label, torch.zeros(max_len - len(label))]) for label in labels]
115 |     return torch.stack(labels_padded)
116 | 
117 | 
118 | def train_and_test():
119 |     df = load_data()
120 |     preprocess(df)
121 | 
122 |     df.fillna(0, inplace=True)
123 | 
124 |     input_size = 261
125 |     hidden_size = 512
126 |     num_layers = 2
127 |     output_size = 2
128 |     num_epochs = 50
129 |     batch_size = 64
130 |     learning_rate = 0.001
131 | 
132 |     train_verses, test_verses, train_labels, test_labels, train_verse_info, test_verse_info = get_verses(df)
133 |     print(f'Train verse count: {len(train_verses)}')
134 |     print(f'Test verse count: {len(test_verses)}')
135 | 
136 |     training_data = QuranDataset(train_verses, train_labels)
137 |     testing_data = QuranDataset(test_verses, test_labels)
138 | 
139 |     train_loader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
140 |     test_loader = DataLoader(testing_data, batch_size=batch_size, shuffle=False)
141 | 
142 |     model = BiLSTMModel(input_size, hidden_size, num_layers, output_size)
143 |     model = model.to(device)
144 | 
145 |     criterion = nn.CrossEntropyLoss()
146 |     optimizer = Adam(model.parameters(), lr=learning_rate)
147 | 
148 |     # train
149 |     model.train()
150 |     for epoch in range(num_epochs):
151 |         for i, (verses, labels, lengths) in enumerate(train_loader):
152 |             verses = verses.to(device)
153 |             labels = labels.to(device)
154 | 
155 |             # forward pass
156 |             raw_outputs = model(verses, lengths)
157 |             labels = labels.view(-1)  # reshape labels to be a 1D tensor
158 |             loss = criterion(raw_outputs.view(-1, output_size), labels)
159 | 
160 |             # backward pass and optimize
161 |             optimizer.zero_grad()
162 |             loss.backward()
163 |             optimizer.step()
164 | 
165 |         print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))
166 | 
167 |         # test
168 |         model.eval()
169 | 
170 |         expected_results_df = DataFrame(columns=['chapter_number', 'verse_number', 'token_number', 'chunk_end'])
171 |         output_results_df = DataFrame(columns=['chapter_number', 'verse_number', 'token_number', 'chunk_end'])
172 | 
173 |         evaluator = Evaluator()
174 |         with torch.no_grad():
175 |             for test_index in range(len(testing_data)):
176 |                 verse, label, length = testing_data[test_index]
177 |                 verse, label, length = verse.to(device).unsqueeze(0), label.to(device).unsqueeze(0), torch.tensor([length])
178 | 
179 |                 raw_output = model(verse, length)
180 |                 _, predicted = torch.max(raw_output.data, 2)
181 |                 predicted = predicted.cpu().numpy()
182 | 
183 |                 verse_info = test_verse_info[test_index]
184 | 
185 |                 for idx, token in enumerate(verse_info):
186 |                     expected_row = DataFrame({
187 |                         'chapter_number': token[0],
188 |                         'verse_number': token[1],
189 |                         'token_number': token[2],
190 |                         'chunk_end': label.cpu().numpy()[0][idx]}, index=[0])
191 |                     expected_results_df = pd.concat([expected_results_df, expected_row])
192 | 
193 |                     output_row = DataFrame({
194 |                         'chapter_number': token[0],
195 |                         'verse_number': token[1],
196 |                         'token_number': token[2],
197 |                         'chunk_end': predicted[0][idx]}, index=[0])
198 |                     output_results_df = pd.concat([output_results_df, output_row])
199 | 
200 |         # chunk-level evaluation
201 |         expected_chunks = get_chunks(expected_results_df)
202 |         output_chunks = get_chunks(output_results_df)
203 |         print(f'Expected: {len(expected_chunks)} chunks')
204 |         print(f'Output: {len(output_chunks)} chunks')
205 | 
206 |         evaluator.compare(expected_chunks, output_chunks)
207 |         print(f'Precision: {evaluator.precision}')
208 |         print(f'Recall: {evaluator.recall}')
209 |         print(f'F1 score: {evaluator.f1_score}')
210 |         print()
211 | 


--------------------------------------------------------------------------------
/src/models/evaluator.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from ..chunks.chunk import Chunk
 4 | 
 5 | 
 6 | class Evaluator:
 7 | 
 8 |     def __init__(self):
 9 |         self._expected_chunks = 0
10 |         self._output_chunks = 0
11 |         self._equivalent_chunks = 0
12 | 
13 |     def compare(self, expected_chunks: List[Chunk], output_chunks: List[Chunk]):
14 |         expected_set = set(expected_chunks)
15 |         output_set = set(output_chunks)
16 | 
17 |         self._expected_chunks += len(expected_set)
18 |         self._output_chunks += len(output_set)
19 |         self._equivalent_chunks += len(expected_set & output_set)
20 | 
21 |     @property
22 |     def precision(self):
23 |         return 0 if self._output_chunks == 0 else self._equivalent_chunks / self._output_chunks
24 | 
25 |     @property
26 |     def recall(self):
27 |         return 0 if self._expected_chunks == 0 else self._equivalent_chunks / self._expected_chunks
28 | 
29 |     @property
30 |     def f1_score(self):
31 |         precision = self.precision
32 |         recall = self.recall
33 |         return 0 if precision + recall == 0 else 2 * (precision * recall) / (precision + recall)
34 | 


--------------------------------------------------------------------------------
/tests/chunker_test.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | from src.models.bilstm_chunker import train_and_test
 4 | 
 5 | 
 6 | class ChunkerTest(unittest.TestCase):
 7 | 
 8 |     def test_chunker(self):
 9 |         train_and_test()
10 | 
11 | 
12 | if __name__ == '__main__':
13 |     unittest.main()
14 | 


--------------------------------------------------------------------------------
/tests/dataset_splitter.py:
--------------------------------------------------------------------------------
 1 | from pandas import DataFrame
 2 | from sklearn.model_selection import GroupShuffleSplit
 3 | 
 4 | 
 5 | def split_dataset(df: DataFrame, fold: int):
 6 | 
 7 |     df['verse_id'] = df['chapter_number'].astype(str) + ':' + df['verse_number'].astype(str)
 8 | 
 9 |     train_idx, test_idx = next(
10 |         GroupShuffleSplit(test_size=.10, n_splits=2, random_state=fold)
11 |         .split(df, groups=df['verse_id'])
12 |     )
13 | 
14 |     train_df = df.iloc[train_idx]
15 |     test_df = df.iloc[test_idx]
16 | 
17 |     train_df = train_df.drop(columns=['verse_id'])
18 |     test_df = test_df.drop(columns=['verse_id'])
19 | 
20 |     return train_df, test_df
21 | 


--------------------------------------------------------------------------------