├── Khmer ├── README.md └── best │ └── khmLimon.traineddata ├── LICENSE ├── Old_Persian ├── README.md └── legacy │ └── peo.traineddata ├── README.md ├── _config.yml ├── akk ├── README.md ├── best │ └── akk.traineddata ├── fast │ └── akk.traineddata └── legacy │ └── akk.traineddata ├── grc_hist ├── README.md └── best │ └── grc_hist.traineddata └── urd_naw ├── README.md └── best └── urd_naw.traineddata /Khmer/README.md: -------------------------------------------------------------------------------- 1 | # khmLimon.traineddata 2 | 3 | * Language - Khmer 4 | * Langcode - khmLimon 5 | * Type of training - Finetune using new fonts 6 | * Contributed by - [@phyrumsk](https://github.com/phyrumsk) 7 | 8 | ## Training Procedure 9 | 10 | [@phyrumsk](https://github.com/phyrumsk) from [Open Institute Cambodia](https://github.com/OpenInstituteCambodia) finetuned the [tesseract_best engine](https://github.com/OpenInstituteCambodia/tessdata_best) with 6 new Limon fonts such as Limon S1, S2, F1, F2, R1, R2 using the same tesseract netspec. 11 | 12 | The accuracy reports show that there is over 10% improvement in the recognition on newly finetuned fonts without any appreciable loss in accuracy for the earlier list. 13 | 14 | They used isri-ocr-evaluation-tools to test for accuracy and 2 types of images were tested, images which were generated using text2image command (without noise) and the scan images (with noise). 15 | 16 | [Sample of testing images are also included](https://github.com/tesseract-ocr/tessdata_best/files/2066786/2018_06_04_KhmLimon_result_for_github.zip). 17 | 18 | ## Accuracy Reports 19 | 20 | ### Accuracy_Tesseract_4.0 21 | 22 | 23 | Testing Text: |   | Characters : | 19618 |   |   | Clusters : |   | 11272 24 | -- | -- | -- | -- | -- | -- | -- | -- | -- 25 | 26 | | No. | Font Name | Size | Character Accuracy | | Cluster Accuracy | | 27 | |-----------|--------------------------|------|--------------------|----------|------------------|----------| 28 | | | | | Errors | Accuracy | Errors | Accuracy | 29 | | 1 | Khmer OS | 12 | 1495 | 92,38% | 650 | 94,23% | 30 | | 2 | Khmer OS Battambang | | 562 | 97,14% | 140 | 98,76% | 31 | | 3 | Khmer OS Bokor | | 3288 | 83,24% | 1846 | 83,62% | 32 | | 4 | Khmer OS Content | | 644 | 96,72% | 159 | 98,59% | 33 | | 5 | Khmer OS Fasthand | | 4605 | 76,53% | 2012 | 82,15% | 34 | | 6 | Khmer OS Freehand | | 2184 | 88,87% | 505 | 95,52% | 35 | | 7 | Khmer OS Metal Chrieng | | 3230 | 83,54% | 1694 | 84,97% | 36 | | 8 | Khmer OS Muol | | 525 | 97,32% | 135 | 98,80% | 37 | | 9 | Khmer OS Muol Light | | 3430 | 82,52% | 2115 | 81,24% | 38 | | 10 | Khmer OS Muol Pali | | 6697 | 65,86% | 3688 | 67,28% | 39 | | 11 | Khmer OS Siemreap | | 1318 | 93,28% | 582 | 94,84% | 40 | | 12 | Khmer OS System | | 2338 | 88,08% | 1076 | 90,45% | 41 | | 13 | Noto Serif Khmer Bold | | 2006 | 89,77% | 1038 | 90,79% | 42 | | 14 | Noto Serif Khmer Regular | | 2043 | 89,59% | 1045 | 90,73% | 43 | | Average : | | | 2,454,64 | 87,49% | 1,191,79 | 89,43% | 44 | 45 | | No. | Font Name | Size | Character Accuracy | | Cluster Accuracy | | 46 | |-----------|-----------|------|--------------------|----------|------------------|----------| 47 | | | | | Errors | Accuracy | Errors | Accuracy | 48 | | 1 | Limon F1 | 22 | 11459 | 41,59% | 7340 | 34,88% | 49 | | 2 | Limon F2 | | 11152 | 43,15% | 7652 | 32,11% | 50 | | 3 | Limon F3 | | 8454 | 56,91% | 5590 | 50,41% | 51 | | 4 | Limon F4 | | 17082 | 12,93% | 10319 | 8,45% | 52 | | 5 | Limon F5 | | 12259 | 37,51% | 7689 | 31,79% | 53 | | 6 | Limon F6 | | 6146 | 68,67% | 3858 | 65,77% | 54 | | 7 | Limon F7 | | 6750 | 65,59% | 4031 | 64,24% | 55 | | 8 | Limon F8 | | 5512 | 71,90% | 3482 | 69,11% | 56 | | 9 | Limon R1 | | 7053 | 64,05% | 4336 | 61,53% | 57 | | 10 | Limon R2 | | 8044 | 59,00% | 4584 | 59,33% | 58 | | 11 | Limon R3 | | 10467 | 46,65% | 5357 | 52,48% | 59 | | 12 | Limon R4 | | 11815 | 39,77% | 7068 | 37,30% | 60 | | 13 | Limon R5 | | 9479 | 51,68% | 5811 | 48,45% | 61 | | 14 | Limon S1 | | 2704 | 86,22% | 1479 | 86,88% | 62 | | 15 | Limon S2 | | 2277 | 88,39% | 1248 | 88,93% | 63 | | 16 | Limon S3 | | 2863 | 85,41% | 1413 | 87,46% | 64 | | 17 | Limon S4 | | 4631 | 76,39% | 3074 | 72,73% | 65 | | 18 | Limon S5 | | 2552 | 86,99% | 1488 | 86,80% | 66 | | 19 | Limon S6 | | 4577 | 76,67% | 2700 | 76,05% | 67 | | 20 | Limon S7 | | 2773 | 85,87% | 1485 | 86,83% | 68 | | Average : | | | 7,402,45 | 62,27% | 4,500,20 | 60,08% | 69 | 70 | ### Acc_LimonS1S2F1F2R1R2Unicode 71 | 72 | Engine: Fine Tune For Limon S1, S2, F1, F2, R1 & R2 Unicode 73 | 74 | Net_spec value “[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx384 O1c1]” 75 | 76 | | No. | Font Name | Size | Character Accuracy | | Cluster Accuracy | | 77 | |-----------|--------------------------|------|--------------------|----------|------------------|----------| 78 | | | | | Errors | Accuracy | Misrecognized | Accuracy | 79 | | 1 | Khmer OS | 12 | 1818 | 90,73% | 900 | 92,02% | 80 | | 2 | Khmer OS Battambang | | 1058 | 94,61% | 476 | 95,78% | 81 | | 3 | Khmer OS Bokor | | 2746 | 86,00% | 1599 | 85,81% | 82 | | 4 | Khmer OS Content | | 844 | 95,70% | 353 | 96,87% | 83 | | 5 | Khmer OS Fasthand | | 5063 | 74,19% | 2296 | 79,63% | 84 | | 6 | Khmer OS Freehand | | 2370 | 87,92% | 674 | 94,02% | 85 | | 7 | Khmer OS Metal Chrieng | | 3428 | 82,53% | 1811 | 83,93% | 86 | | 8 | Khmer OS Muol | | 946 | 95,18% | 464 | 95,88% | 87 | | 9 | Khmer OS Muol Light | | 1313 | 93,31% | 720 | 93,61% | 88 | | 10 | Khmer OS Muol Pali | | 6194 | 68,43% | 3413 | 69,72% | 89 | | 11 | Khmer OS Siemreap | | 1939 | 90,12% | 1041 | 90,76% | 90 | | 12 | Khmer OS System | | 2562 | 86,94% | 1248 | 88,93% | 91 | | 13 | Noto Serif Khmer Bold | | 2270 | 88,43% | 1256 | 88,86% | 92 | | 14 | Noto Serif Khmer Regular | | 2334 | 88,10% | 1278 | 88,66% | 93 | | Average : | | | 2,491,79 | 87,30% | 1,252,07 | 88,89% | 94 | 95 | 96 | | No. | Font Name | Size | Character Accuracy | | Cluster Accuracy | | 97 | |-----------|-----------|------|--------------------|----------|------------------|----------| 98 | | | | | Errors | Accuracy | Misrecognized | Accuracy | 99 | | 1 | Limon F1 | 22 | 2599 | 86,75% | 1696 | 84,95% | 100 | | 2 | Limon F2 | | 3896 | 80,14% | 2661 | 76,39% | 101 | | 3 | Limon F3 | | 4178 | 78,70% | 2786 | 75,28% | 102 | | 4 | Limon F4 | | 14807 | 24,52% | 9374 | 16,84% | 103 | | 5 | Limon F5 | | 4859 | 75,23% | 3282 | 70,88% | 104 | | 6 | Limon F6 | | 2576 | 86,87% | 1622 | 85,61% | 105 | | 7 | Limon F7 | | 4115 | 79,02% | 2145 | 80,97% | 106 | | 8 | Limon F8 | | 3580 | 81,75% | 2166 | 80,78% | 107 | | 9 | Limon R1 | | 2144 | 89,07% | 1358 | 87,95% | 108 | | 10 | Limon R2 | | 3681 | 81,24% | 1626 | 85,57% | 109 | | 11 | Limon R3 | | 8224 | 58,08% | 3482 | 69,11% | 110 | | 12 | Limon R4 | | 9265 | 52,77% | 4504 | 60,04% | 111 | | 13 | Limon R5 | | 3120 | 84,10% | 1995 | 82,30% | 112 | | 14 | Limon S1 | | 1262 | 93,57% | 612 | 94,57% | 113 | | 15 | Limon S2 | | 1557 | 92,06% | 756 | 93,29% | 114 | | 16 | Limon S3 | | 2100 | 89,30% | 1070 | 90,51% | 115 | | 17 | Limon S4 | | 2515 | 87,18% | 1670 | 85,18% | 116 | | 18 | Limon S5 | | 2069 | 89,45% | 1122 | 90,05% | 117 | | 19 | Limon S6 | | 2885 | 85,29% | 1473 | 86,93% | 118 | | 20 | Limon S7 | | 1852 | 90,56% | 889 | 92,11% | 119 | | Average : | | | 4,064,20 | 79,28% | 2,314,45 | 79,47% | 120 | -------------------------------------------------------------------------------- /Khmer/best/khmLimon.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/Khmer/best/khmLimon.traineddata -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Old_Persian/README.md: -------------------------------------------------------------------------------- 1 | # peo.traineddata 2 | 3 | ## tesseract_old_persian: 4 | 5 | This project aims to create an OCR model (convert image to text) to translate Old Persian cuneiform. It is part of the [Electronic Old Persian Library](https://github.com/Electronic-Old-Persian-Library) organization. 6 | 7 | This tesseract pre-trained OCR model (version 3 - for Legacy engine) converts Old Persian cuneiform to English transcription and was developed by [S. Muhammad Hossein Mousavi](https://github.com/SeyedMuhammadHosseinMousavi/Extracting-Old-Persian-Cuneiform/tree/main 8 | ). 9 | 10 | 11 | ## Notebook: 12 | https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/blob/master/tesseract_old_persian/Tesseract_Old_Persian_OCR.ipynb 13 | 14 | 15 | 16 | 17 | 18 | ## An example: 19 | 20 | The last 12 lines of the Great Darius' inscription in Persepolis, [DPd inscription](https://www.livius.org/sources/content/achaemenid-royal-inscriptions/dpd/): 21 | 22 | Input: 23 | 24 | ![darius2](https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/assets/74653444/fc8f2a4c-b8b4-4b46-97e3-c87d506fd6fd) 25 | 26 | 27 | 28 | Output: 29 | 30 | Zittiy ; iaryvuS ; xrSayZiy; 31 | 32 | mnc;aurmzia;upstam; rlauv; 33 | 34 | hia ; ViZiriS ; rgiriS ; uta; 35 | 36 | im am ; i h yaum ; au lm z i a ; 37 | 38 | pitTucs;hca;hinaya; hca; 39 | 40 | QuSiyala ; hca;iruga;ariy; 41 | 42 | imam ;ihyaum;ma; ajMiya; ait; 43 | 44 | aim ;yanm;jDiyaMiy; 45 | 46 | aitmiy ; iiaTuv 47 | 48 | ## At the next stage, we can translate that Old Persian transcription to modern languages by [Chat-GPT](https://chatgpt.com/): 49 | 50 | Translate to Modern Persian: 51 | 52 | 53 | 54 | 55 | این منم داریوش شاهنشاه؛ 56 | به لطف اهورامزدا، من این را بنا کردم؛ 57 | من این امپراتوری را بنیان نهادم و آن را نیرومند ساختم. 58 | باشد که اهورامزدا من و پادشاهی مرا محافظت کند؛ 59 | باشد که برای همیشه پایدار بماند؛ 60 | و باشد که از دروغ در امان باشد؛ 61 | این است آنچه من انجام دادم؛ 62 | 63 | این است آنچه من می‌گویم. 64 | 65 | Translate to Modern English: 66 | 67 | “This is me, Dariush king; By the grace of Ahura Mazda, I have built this; I founded this empire and made it strong. May Ahuramazda protect me and my kingdom; may it last forever; and it would be safe from lies; that is what I did; 68 | That is what I am saying.” 69 | 70 | ## Article 71 | 72 | I wrote an [article](https://www.researchgate.net/publication/382528886_Translating_Old_Persian_cuneiform_by_artificial_intelligence_AI) as a tiny report for what I have done for this project till now. 73 | 74 | ## Notice 75 | 76 | This [project](https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR) is still under developing. For contributing contact me by email: melaneepython@gmail.com 77 | -------------------------------------------------------------------------------- /Old_Persian/legacy/peo.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/Old_Persian/legacy/peo.traineddata -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tessdata_contrib 2 | 3 | User contributed (non Google) data repository for Tesseract 4 and 5 (Akkadian, Ancient Greek, Old Persian languages, ...) 4 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman -------------------------------------------------------------------------------- /akk/README.md: -------------------------------------------------------------------------------- 1 | # akk.traineddata 2 | 3 | ## Model card 4 | 5 | * Language - Akkadian 6 | * Language code - akk 7 | * Type of training - finetuning for LSTM models, training for legacy model 8 | * Contributed by - [@Shreeshrii](https://github.com/Shreeshrii) and [@wincentbalin](https://github.com/wincentbalin) 9 | 10 | ## Model files 11 | 12 | * LSTM model - [akk.traineddata](best/akk.traineddata) 13 | * Fast LSTM model - [akk.traineddata](fast/akk.traineddata) 14 | * Legacy model - [akk.traineddata](legacy/akk.traineddata) 15 | 16 | ## Training Procedure 17 | 18 | Training was performed by [@Shreeshrii](https://github.com/Shreeshrii) and [@wincentbalin](https://github.com/wincentbalin). 19 | 20 | The source data for the LSTM model is provided in the [LSTM langdata](https://github.com/tesseract-ocr/langdata_lstm/tree/main/akk) 21 | repository and for the legacy model in the [legacy langdata](https://github.com/tesseract-ocr/langdata/tree/main/akk) repository. 22 | 23 | The description of the training procedure is provided in the [tesstrain wiki](https://github.com/tesseract-ocr/tesstrain/wiki/Akkadian-Cuneiform). The repository with the most recent setup is [here](https://github.com/wincentbalin/tesstrain-akk). 24 | 25 | -------------------------------------------------------------------------------- /akk/best/akk.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/akk/best/akk.traineddata -------------------------------------------------------------------------------- /akk/fast/akk.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/akk/fast/akk.traineddata -------------------------------------------------------------------------------- /akk/legacy/akk.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/akk/legacy/akk.traineddata -------------------------------------------------------------------------------- /grc_hist/README.md: -------------------------------------------------------------------------------- 1 | # grc_hist: a tesseract model for historical documents written in (polytonic) greek 2 | 3 | `grc_hist` was developed in the context of the project [AjaxMultiCommentary](https://github.com/AjaxMultiCommentary/). It is our best Tesseract model for recognition of historical documents written in (polytonic) greek. 4 | 5 | **Training.** This model starts from `grc`, updates its dictionary and was fine-tuned with 35K+ of real-life ground-truth lines for 40 epochs. This model is the best checkpoint on our evaluation data and was produced at epoch 23. 6 | 7 | Fine-tuning datasets are [GT-commentaries-OCR](https://github.com/AjaxMultiCommentary/GT-commentaries-OCR) and [Pogetra](https://zenodo.org/record/4774201), which contain a very broad set of images in terms of period, font, type and style. 8 | 9 | The training command used is the following (to be used once training requirements are fixed): 10 | 11 | ```shell 12 | cd /your/tesstrain/dir 13 | export TESSDATA_PREFIX=/your/tessdata_best/dir 14 | export LD_LIBRARY_PATH=/your/lib/dir # ~/anaconda3/lib depending on your installation 15 | make training MODEL_NAME=grc_hist START_MODEL=grc GROUND_TRUTH_DIR=/your/path/to/dir/containing/both/datasets/ 16 | LANGDATA_DIR=/your/lib/langdata_lstm/dir/ TESSDATA=/your/tessdata_best/dir/ DATA_DIR=/your/dir/ CORES=30 EPOCHS=40 LEARNING_RATE=0.0001 PSM=7 RATIO_TRAIN=0.95 TARGET_ERROR_RATE=0.001 17 | ``` 18 | 19 | **Results**. Tested on both datasets and on altered images, `grc_hist` drastically surpasses `grc` with performance getting up to .007% on greek characters. The table below shows the results of our main experiments with error rates for characters, words and greek characters only. 20 | 21 | 22 | | model | test_dataset | chars_ER | words_ER | greek_chars_ER | 23 | | -------- | ------------ | -------- | -------- | -------------- | 24 | | grc | ajmc | .096 | .347 | .091 | 25 | | grc | pogretra | .059 | .214 | .049 | 26 | | grc_hist | ajmc | .013 | .061 | .011 | 27 | | grc_hist | pogretra | .015 | .05 | .007 | 28 | 29 | 30 | 31 | **Usage**. The model could be of great value for libraries, researchers and anyone interested in historical greek documents. 32 | -------------------------------------------------------------------------------- /grc_hist/best/grc_hist.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/grc_hist/best/grc_hist.traineddata -------------------------------------------------------------------------------- /urd_naw/README.md: -------------------------------------------------------------------------------- 1 | # urd_naw - Improved Tesseract Model for Urdu OCR 2 | 3 | This repository provides an enhanced Tesseract OCR model (`urd_naw`) specifically trained for improved text extraction from Urdu script images. This custom model demonstrates significantly better performance compared to the standard Tesseract Urdu model (`urd`). 4 | 5 | ## 📚 Training Dataset 6 | 7 | The `urd_naw` model was trained on a comprehensive dataset aggregating Urdu text from diverse sources. This dataset comprises 12,748 image-text pairs, meticulously curated for effective model fine-tuning. 8 | 9 | - **Access the dataset:** [Training Dataset Link](https://drive.google.com/file/d/1jbdFo1ea8MxZ7yHOVIcPgP5yOd4ihvBU/view?usp=drive_link) 10 | 11 | ## 🚀 Training 12 | 13 | The model was fine-tuned using the Tesseract training tools. Below is an example command for initiating training on Windows using PowerShell. Please adapt the paths according to your environment setup. 14 | 15 | ```powershell 16 | # Ensure you are in your tesstrain data directory 17 | cd C:\path\to\your\tesstrain\data 18 | 19 | # Set TESSDATA_PREFIX environment variable 20 | $env:TESSDATA_PREFIX = "C:\path\to\your\tessdata_best\data" 21 | 22 | # Run the training command 23 | make training MODEL_NAME=urd_naw START_MODEL=urd ` 24 | LANGDATA_DIR="C:\path\to\your\langdata_lstm\dir\" ` 25 | TESSDATA="C:\path\to\your\tessdata_best\dir\" ` 26 | LANG_TYPE=RTL 27 | ``` 28 | 29 | ## 📊 Model Performance and Results 30 | 31 | The `urd_naw` model exhibits substantial improvements over the baseline `urd` model. Key performance metrics from the training process (Iteration 9724) are summarized below: 32 | 33 | | Metric | Value | Interpretation | 34 | | :----------- | :------- | :--------------------------------------------------- | 35 | | Mean RMS | 2.125% | High accuracy in character shape recognition. | 36 | | Delta | 9.981% | Significant improvement during training iterations. | 37 | | BCER (train) | 27.125% | ~73% character-level accuracy on the training set. | 38 | | BWER (train) | 63.071% | Word-level accuracy, improved but with scope for enhancement. | 39 | | Skip ratio | 0.100% | Minimal skipping of image regions during processing. | 40 | 41 | **Key Advantages:** 42 | 43 | - **Enhanced Recognition:** Superior accuracy for Urdu text compared to the default Tesseract model. 44 | - **Character Accuracy:** More precise recognition at the character level. 45 | - **Complex Script Handling:** Better processing of intricate Urdu script patterns (ligatures, diacritics). 46 | - **Reliability:** Increased robustness for processing real-world Urdu documents. 47 | 48 | ## 💡 Potential Applications 49 | 50 | This improved model is well-suited for various tasks, including: 51 | 52 | - Digitizing printed Urdu books and historical documents. 53 | - Extracting text from handwritten Urdu notes or manuscripts. 54 | - Creating searchable text layers for Urdu image archives. 55 | - Assisting researchers and libraries working with Urdu materials. 56 | 57 | ## 🖼️ Visual Comparison 58 | 59 | To visually demonstrate the performance difference, here is a comparison using a sample test image: 60 | 61 | **Test Image:** 62 | 63 | ![Test Urdu Image](https://nawadiraat.org/images/ocr-image.jpg) 64 | 65 | *(Original Link: https://nawadiraat.org/images/ocr-image.jpg)* 66 | 67 | **Model Output Comparison:** 68 | 69 | | Model | Output Image | 70 | | :---------------- | :------------------------------------------------- | 71 | | Default `urd` | سے تو دی یں ف تھی ,جس دسیان اپ جما لکا اس ےکیا خر مرے شو قکیء اس ےکرا ین مرے عا لکا | 72 | | Improved `urd_naw`| جے خودہی نہیں فر تیں، جس دھیان اپنے جمال کا اسے کیاخبر مرے شوق کی، اسے کیاپنہ مرے حال کا | 73 | 74 | **Observation:** The `urd_naw` model demonstrates notably fewer character-level errors and significantly improved word segmentation. Complex ligatures and diacritics are more accurately recognized, resulting in a more coherent and readable output compared to the standard `urd` model. 75 | 76 | ## 📜 License 77 | 78 | This project is licensed under the Apache 2.0 License. See the `LICENSE` file for details. 79 | -------------------------------------------------------------------------------- /urd_naw/best/urd_naw.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tesseract-ocr/tessdata_contrib/1b7ada6f9ed0e165f06b3212500e1433fdf4dfc7/urd_naw/best/urd_naw.traineddata --------------------------------------------------------------------------------