├── .gitignore ├── 1 CLTK Setup.ipynb ├── 2 Import corpora.ipynb ├── 3 Basic NLP.ipynb ├── 4 Lemmatization.ipynb ├── 5 Text reuse.ipynb ├── 6 N-grams.ipynb ├── 7 Syllabification, prosody, phonetics.ipynb ├── 8 Part-of-speech tagging.ipynb ├── 9 Lexical Dispersion Plot.ipynb ├── LICENSE ├── README.md ├── images ├── lexical_diversity_greek_canon.png └── tableau_bubble.png └── languages ├── old-norse ├── old-norse-tutorial.ipynb └── runes.ipynb ├── old_english └── Phonological Rules.ipynb └── south_asia ├── Bengali_tutorial.ipynb ├── Gujarati_tutorial.ipynb ├── Hindi_tutorial.ipynb ├── Kannada_tutorial.ipynb ├── Malayalam_tutorial.ipynb ├── Marathi_tutorial.ipynb ├── Odia_tutorial.ipynb ├── Pali_tutorial.ipynb ├── Prakrit_tutorial.ipynb ├── Punjabi_tutorial.ipynb ├── Sanskrit_tutorial.ipynb └── Telugu_tutorial.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | -------------------------------------------------------------------------------- /1 CLTK Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Install Python\n", 8 | "\n", 9 | "## Mac\n", 10 | "\n", 11 | "See (current version is 3.6.4).\n", 12 | "\n", 13 | "\n", 14 | "## Linux\n", 15 | "\n", 16 | "Open Terminal and check current version with `python --version` or `python3 --version`. If 3.4 or 3.5, you're fine. If Python version is out of date, run these:\n", 17 | "\n", 18 | "``` bash\n", 19 | "$ curl -O https://raw.githubusercontent.com/kylepjohnson/python3_bootstrap/master/install.sh\n", 20 | "$ chmod +x install.sh\n", 21 | "$ ./install.sh\n", 22 | "```\n", 23 | "\n", 24 | "This Linux build from source will take around 5 minutes." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "# Install Git\n", 32 | "\n", 33 | "The CLTK uses Git for corpus management. For Mac, install it from here: . For Linux, check if present (`git --version`); if not then use your package manager to get it (e.g., `apt-get install git`)." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# Create a virtual environment\n", 41 | "\n", 42 | "This makes a special environment (a \"sandbox\") just for the CLTK. If something goes wrong, you can just delete it and start again.\n", 43 | "\n", 44 | "``` bash\n", 45 | "$ cd ~/\n", 46 | "$ mkdir cltk\n", 47 | "$ cd cltk\n", 48 | "$ pyvenv venv\n", 49 | "$ source venv/bin/activate\n", 50 | "```\n", 51 | "\n", 52 | "Now you can see that you're not using your system Python but this particular one:\n", 53 | "\n", 54 | "``` bash\n", 55 | "$ which python\n", 56 | "```\n", 57 | "\n", 58 | "Note that every time you open a new Terminal window, you'll need to \"activate\" this environment with `source ~/cltk/venv/bin/activate`." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "# Install CLTK\n", 66 | "\n", 67 | "``` bash\n", 68 | "$ pip install cltk\n", 69 | "```\n", 70 | "\n", 71 | "This will take a few minutes, as it will install several \"dependencies\", being other Python libraries which the CLTK uses.\n", 72 | "\n", 73 | "Also install Jupyter, which is a really handy way of writing code.\n", 74 | "\n", 75 | "``` bash\n", 76 | "$ pip install jupyter\n", 77 | "```" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "# Test Jupyter\n", 85 | "\n", 86 | "From your `cltk` directory, launch a notebook (such as this one) from the Terminal with `jupyter notebook`. Then open your preferred browser to ." 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "# Download these tutorials\n", 94 | "\n", 95 | "You may find these instructions at ." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "# Join GitHub\n", 103 | "\n", 104 | "GitHub is a nice way to share code. Come visit us at !" 105 | ] 106 | } 107 | ], 108 | "metadata": { 109 | "kernelspec": { 110 | "display_name": "Python 3", 111 | "language": "python", 112 | "name": "python3" 113 | }, 114 | "language_info": { 115 | "codemirror_mode": { 116 | "name": "ipython", 117 | "version": 3 118 | }, 119 | "file_extension": ".py", 120 | "mimetype": "text/x-python", 121 | "name": "python", 122 | "nbconvert_exporter": "python", 123 | "pygments_lexer": "ipython3", 124 | "version": "3.6.4" 125 | } 126 | }, 127 | "nbformat": 4, 128 | "nbformat_minor": 1 129 | } 130 | -------------------------------------------------------------------------------- /2 Import corpora.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "The CLTK has a distributed infrastructure that lets you download official CLTK texts or other corpora shared by others. For full docs, see .\n", 8 | "\n", 9 | "To get started, from the Terminal, open a new Jupyter notebook from within your `~/cltk` directory (see notebook 1 \"CLTK Setup\" for instructions): `jupyter notebook`. Then go to ." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# See what corpora are available\n", 17 | "\n", 18 | "First we need to \"import\" the right part of the CLTK library. Think of this as pulling just the book you need off the shelf and having it ready to read." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "# This is the import of the right part of the CLTK library\n", 28 | "\n", 29 | "from cltk.corpus.utils.importer import CorpusImporter" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# See https://github.com/cltk for all official corpora\n", 39 | "\n", 40 | "my_latin_downloader = CorpusImporter('latin')\n", 41 | "\n", 42 | "# Now 'my_latin_downloader' is the variable by which we call the CorpusImporter" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/plain": [ 53 | "['latin_text_perseus',\n", 54 | " 'latin_treebank_perseus',\n", 55 | " 'latin_text_latin_library',\n", 56 | " 'phi5',\n", 57 | " 'phi7',\n", 58 | " 'latin_proper_names_cltk',\n", 59 | " 'latin_models_cltk',\n", 60 | " 'latin_pos_lemmata_cltk',\n", 61 | " 'latin_treebank_index_thomisticus',\n", 62 | " 'latin_lexica_perseus',\n", 63 | " 'latin_training_set_sentence_cltk',\n", 64 | " 'latin_word2vec_cltk',\n", 65 | " 'latin_text_antique_digiliblt',\n", 66 | " 'latin_text_corpus_grammaticorum_latinorum',\n", 67 | " 'latin_text_poeti_ditalia']" 68 | ] 69 | }, 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "output_type": "execute_result" 73 | } 74 | ], 75 | "source": [ 76 | "my_latin_downloader.list_corpora" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "# Import several corpora" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 4, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "my_latin_downloader.import_corpus('latin_text_latin_library')\n", 93 | "my_latin_downloader.import_corpus('latin_models_cltk')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "You can verify the files were downloaded in the Terminal with `$ ls -l ~/cltk_data/latin/text/latin_text_latin_library/`" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 6, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "['greek_software_tlgu',\n", 112 | " 'greek_text_perseus',\n", 113 | " 'phi7',\n", 114 | " 'tlg',\n", 115 | " 'greek_proper_names_cltk',\n", 116 | " 'greek_models_cltk',\n", 117 | " 'greek_treebank_perseus',\n", 118 | " 'greek_lexica_perseus',\n", 119 | " 'greek_training_set_sentence_cltk',\n", 120 | " 'greek_word2vec_cltk',\n", 121 | " 'greek_text_lacus_curtius',\n", 122 | " 'greek_text_first1kgreek']" 123 | ] 124 | }, 125 | "execution_count": 6, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "# Let's get some Greek corpora, too\n", 132 | "\n", 133 | "my_greek_downloader = CorpusImporter('greek')\n", 134 | "my_greek_downloader.import_corpus('greek_models_cltk')\n", 135 | "my_greek_downloader.list_corpora" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 6, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "my_greek_downloader.import_corpus('greek_text_lacus_curtius')" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Likewise, verify with `ls -l ~/cltk_data/greek/text/greek_text_lacus_curtius/plain/`" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 3, 159 | "metadata": { 160 | "scrolled": true 161 | }, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Downloaded 5% 2.22 MiB | 4.15 MiB/s \r", 168 | "Downloaded 6% 2.22 MiB | 4.15 MiB/s \r", 169 | "Downloaded 7% 2.22 MiB | 4.15 MiB/s \r", 170 | "Downloaded 8% 2.22 MiB | 4.15 MiB/s \r", 171 | "Downloaded 8% 2.22 MiB | 4.15 MiB/s \r", 172 | "Downloaded 9% 2.22 MiB | 4.15 MiB/s \r", 173 | "Downloaded 10% 5.62 MiB | 5.43 MiB/s \r", 174 | "Downloaded 11% 5.62 MiB | 5.43 MiB/s \r", 175 | "Downloaded 12% 5.62 MiB | 5.43 MiB/s \r", 176 | "Downloaded 13% 5.62 MiB | 5.43 MiB/s \r", 177 | "Downloaded 14% 5.62 MiB | 5.43 MiB/s \r", 178 | "Downloaded 15% 5.62 MiB | 5.43 MiB/s \r", 179 | "Downloaded 16% 5.62 MiB | 5.43 MiB/s \r", 180 | "Downloaded 17% 5.62 MiB | 5.43 MiB/s \r", 181 | "Downloaded 18% 5.62 MiB | 5.43 MiB/s \r", 182 | "Downloaded 19% 5.62 MiB | 5.43 MiB/s \r", 183 | "Downloaded 20% 5.62 MiB | 5.43 MiB/s \r", 184 | "Downloaded 21% 14.47 MiB | 9.42 MiB/s \r", 185 | "Downloaded 22% 14.47 MiB | 9.42 MiB/s \r", 186 | "Downloaded 23% 14.47 MiB | 9.42 MiB/s \r", 187 | "Downloaded 24% 14.47 MiB | 9.42 MiB/s \r", 188 | "Downloaded 24% 17.04 MiB | 8.32 MiB/s \r", 189 | "Downloaded 25% 17.04 MiB | 8.32 MiB/s \r", 190 | "Downloaded 26% 17.04 MiB | 8.32 MiB/s \r", 191 | "Downloaded 27% 17.04 MiB | 8.32 MiB/s \r", 192 | "Downloaded 28% 19.54 MiB | 7.33 MiB/s \r", 193 | "Downloaded 29% 19.54 MiB | 7.33 MiB/s \r", 194 | "Downloaded 30% 19.54 MiB | 7.33 MiB/s \r", 195 | "Downloaded 30% 19.54 MiB | 7.33 MiB/s \r", 196 | "Downloaded 31% 19.54 MiB | 7.33 MiB/s \r", 197 | "Downloaded 32% 19.54 MiB | 7.33 MiB/s \r", 198 | "Downloaded 33% 19.54 MiB | 7.33 MiB/s \r", 199 | "Downloaded 34% 19.54 MiB | 7.33 MiB/s \r", 200 | "Downloaded 35% 19.54 MiB | 7.33 MiB/s \r", 201 | "Downloaded 36% 19.54 MiB | 7.33 MiB/s \r", 202 | "Downloaded 37% 23.64 MiB | 7.33 MiB/s \r", 203 | "Downloaded 38% 24.18 MiB | 6.49 MiB/s \r", 204 | "Downloaded 38% 26.77 MiB | 6.18 MiB/s \r", 205 | "Downloaded 39% 26.77 MiB | 6.18 MiB/s \r", 206 | "Downloaded 40% 26.77 MiB | 6.18 MiB/s \r", 207 | "Downloaded 41% 26.77 MiB | 6.18 MiB/s \r", 208 | "Downloaded 42% 26.77 MiB | 6.18 MiB/s \r", 209 | "Downloaded 43% 26.77 MiB | 6.18 MiB/s \r", 210 | "Downloaded 44% 26.77 MiB | 6.18 MiB/s \r", 211 | "Downloaded 44% 29.04 MiB | 5.83 MiB/s \r", 212 | "Downloaded 45% 29.04 MiB | 5.83 MiB/s \r", 213 | "Downloaded 46% 32.96 MiB | 5.45 MiB/s \r", 214 | "Downloaded 46% 32.96 MiB | 5.45 MiB/s \r", 215 | "Downloaded 47% 34.78 MiB | 4.05 MiB/s \r", 216 | "Downloaded 48% 34.78 MiB | 4.05 MiB/s \r", 217 | "Downloaded 49% 34.78 MiB | 4.05 MiB/s \r", 218 | "Downloaded 49% 37.17 MiB | 4.02 MiB/s \r", 219 | "Downloaded 50% 37.17 MiB | 4.02 MiB/s \r", 220 | "Downloaded 51% 39.68 MiB | 4.07 MiB/s \r", 221 | "Downloaded 51% 39.68 MiB | 4.07 MiB/s \r", 222 | "Downloaded 52% 42.10 MiB | 3.68 MiB/s \r", 223 | "Downloaded 53% 42.10 MiB | 3.68 MiB/s \r", 224 | "Downloaded 54% 42.10 MiB | 3.68 MiB/s \r", 225 | "Downloaded 55% 42.10 MiB | 3.68 MiB/s \r", 226 | "Downloaded 56% 42.10 MiB | 3.68 MiB/s \r", 227 | "Downloaded 57% 42.10 MiB | 3.68 MiB/s \r", 228 | "Downloaded 58% 42.10 MiB | 3.68 MiB/s \r", 229 | "Downloaded 59% 42.10 MiB | 3.68 MiB/s \r", 230 | "Downloaded 60% 44.94 MiB | 4.14 MiB/s \r", 231 | "Downloaded 61% 44.94 MiB | 4.14 MiB/s \r", 232 | "Downloaded 62% 44.94 MiB | 4.14 MiB/s \r", 233 | "Downloaded 62% 44.94 MiB | 4.14 MiB/s \r", 234 | "Downloaded 63% 46.48 MiB | 3.99 MiB/s \r", 235 | "Downloaded 64% 46.48 MiB | 3.99 MiB/s \r", 236 | "Downloaded 64% 50.14 MiB | 4.04 MiB/s \r", 237 | "Downloaded 65% 50.14 MiB | 4.04 MiB/s \r", 238 | "Downloaded 65% 52.54 MiB | 4.17 MiB/s \r", 239 | "Downloaded 66% 55.90 MiB | 4.08 MiB/s \r", 240 | "Downloaded 66% 57.35 MiB | 3.87 MiB/s \r", 241 | "Downloaded 67% 59.61 MiB | 3.81 MiB/s \r", 242 | "Downloaded 67% 59.61 MiB | 3.81 MiB/s \r", 243 | "Downloaded 68% 63.14 MiB | 3.48 MiB/s \r", 244 | "Downloaded 68% 63.14 MiB | 3.48 MiB/s \r", 245 | "Downloaded 69% 65.30 MiB | 3.52 MiB/s \r", 246 | "Downloaded 70% 65.30 MiB | 3.52 MiB/s \r", 247 | "Downloaded 70% 69.03 MiB | 3.83 MiB/s \r", 248 | "Downloaded 71% 70.43 MiB | 3.62 MiB/s \r", 249 | "Downloaded 72% 70.43 MiB | 3.62 MiB/s \r", 250 | "Downloaded 72% 70.43 MiB | 3.62 MiB/s \r", 251 | "Downloaded 73% 72.05 MiB | 3.57 MiB/s \r", 252 | "Downloaded 74% 72.05 MiB | 3.57 MiB/s \r", 253 | "Downloaded 75% 74.07 MiB | 3.66 MiB/s \r", 254 | "Downloaded 76% 76.26 MiB | 3.80 MiB/s \r", 255 | "Downloaded 76% 76.26 MiB | 3.80 MiB/s \r", 256 | "Downloaded 77% 78.72 MiB | 3.79 MiB/s \r", 257 | "Downloaded 77% 78.72 MiB | 3.79 MiB/s \r", 258 | "Downloaded 78% 78.72 MiB | 3.79 MiB/s \r", 259 | "Downloaded 79% 78.72 MiB | 3.79 MiB/s \r", 260 | "Downloaded 80% 85.18 MiB | 4.70 MiB/s \r", 261 | "Downloaded 81% 85.18 MiB | 4.70 MiB/s \r", 262 | "Downloaded 82% 85.18 MiB | 4.70 MiB/s \r", 263 | "Downloaded 83% 85.18 MiB | 4.70 MiB/s \r", 264 | "Downloaded 84% 85.18 MiB | 4.70 MiB/s \r", 265 | "Downloaded 85% 85.18 MiB | 4.70 MiB/s \r", 266 | "Downloaded 86% 85.18 MiB | 4.70 MiB/s \r", 267 | "Downloaded 87% 92.76 MiB | 6.02 MiB/s \r", 268 | "Downloaded 88% 92.76 MiB | 6.02 MiB/s \r", 269 | "Downloaded 89% 92.76 MiB | 6.02 MiB/s \r", 270 | "Downloaded 89% 92.76 MiB | 6.02 MiB/s \r", 271 | "Downloaded 90% 92.76 MiB | 6.02 MiB/s \r", 272 | "Downloaded 91% 92.76 MiB | 6.02 MiB/s \r", 273 | "Downloaded 92% 92.76 MiB | 6.02 MiB/s \r", 274 | "Downloaded 93% 102.39 MiB | 7.76 MiB/s \r", 275 | "Downloaded 94% 102.39 MiB | 7.76 MiB/s \r", 276 | "Downloaded 95% 113.15 MiB | 9.19 MiB/s \r", 277 | "Downloaded 95% 113.15 MiB | 9.19 MiB/s \r", 278 | "Downloaded 96% 114.78 MiB | 9.33 MiB/s \r", 279 | "Downloaded 97% 114.78 MiB | 9.33 MiB/s \r", 280 | "Downloaded 98% 123.62 MiB | 10.40 MiB/s \r", 281 | "Downloaded 98% 125.96 MiB | 9.94 MiB/s \r", 282 | "Downloaded 98% 125.96 MiB | 9.94 MiB/s \r", 283 | "Downloaded 98% 136.67 MiB | 11.52 MiB/s \r", 284 | "Downloaded 99% 138.21 MiB | 10.27 MiB/s \r", 285 | "Downloaded 99% 146.75 MiB | 8.54 MiB/s \r", 286 | "Downloaded 99% 146.75 MiB | 8.54 MiB/s \r", 287 | "Downloaded 99% 153.79 MiB | 7.57 MiB/s \r", 288 | "Downloaded 100% 157.59 MiB | 6.26 MiB/s \r", 289 | "Downloaded 100% 160.82 MiB | 5.18 MiB/s \r", 290 | "Downloaded 100% 163.52 MiB | 5.21 MiB/s \r" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "my_greek_downloader.import_corpus('greek_text_first1kgreek')" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 4, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "name": "stdout", 305 | "output_type": "stream", 306 | "text": [ 307 | "total 2176\r\n", 308 | "-rw-r--r-- 1 root root 126919 Jul 13 10:05 Committing Issues using GitHub.docx\r\n", 309 | "-rwxr-xr-x 1 root root 1889 Jul 13 10:05 cselstats.pl\r\n", 310 | "drwxr-xr-x 118 root root 4096 Jul 13 10:05 data\r\n", 311 | "-rwxr-xr-x 1 root root 1955024 Jul 13 10:05 #gelasius-kg.xml#\r\n", 312 | "-rwxr-xr-x 1 root root 2414 Jul 13 10:05 greek-justwork.txt\r\n", 313 | "-rwxr-xr-x 1 root root 3249 Jul 13 10:05 greek.txt\r\n", 314 | "-rwxr-xr-x 1 root root 19777 Jul 13 10:05 Greek-works.txt\r\n", 315 | "-rw-r--r-- 1 root root 19125 Jul 13 10:05 license.md\r\n", 316 | "-rw-r--r-- 1 root root 58346 Jul 13 10:05 new_edition_metadata.csv\r\n", 317 | "-rw-r--r-- 1 root root 697 Jul 13 10:05 pages.sh\r\n", 318 | "-rwxr-xr-x 1 root root 1901 Jul 13 10:05 pnumber.xsl\r\n", 319 | "-rw-r--r-- 1 root root 1658 Jul 13 10:05 README.md\r\n", 320 | "drwxr-xr-x 2 root root 4096 Jul 13 10:05 save\r\n", 321 | "drwxr-xr-x 48 root root 4096 Jul 13 10:05 split\r\n", 322 | "drwxr-xr-x 2 root root 4096 Jul 13 10:05 volume_xml\r\n" 323 | ] 324 | } 325 | ], 326 | "source": [ 327 | "!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek/" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "# Convert TEI XML texts\n", 335 | "\n", 336 | "Here we'll convert the First 1K Years' Greek corpus from TEI XML to plain text." 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 3, 342 | "metadata": { 343 | "collapsed": true 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 4, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "#! If you get the following error: 'Install `bs4` and `lxml` to parse these TEI files.'\n", 357 | "# then run: `pip install bs4 lxml`.\n", 358 | "\n", 359 | "onekgreek_tei_xml_to_text()" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 5, 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "name": "stdout", 369 | "output_type": "stream", 370 | "text": [ 371 | "677\r\n" 372 | ] 373 | } 374 | ], 375 | "source": [ 376 | "# Count the converted plaintext files\n", 377 | "\n", 378 | "!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext/ | wc -l" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "# Import local corpora" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 10, 391 | "metadata": { 392 | "collapsed": true 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "my_latin_downloader.import_corpus('phi5', '~/cltk/corpora/PHI5/')" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 11, 402 | "metadata": { 403 | "collapsed": true 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "my_latin_downloader.import_corpus('phi7', '~/cltk/corpora/PHI7/')" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 7, 413 | "metadata": { 414 | "collapsed": true 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "my_greek_downloader.import_corpus('tlg', '~/cltk/corpora/TLG_E/')" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 12, 424 | "metadata": {}, 425 | "outputs": [ 426 | { 427 | "name": "stdout", 428 | "output_type": "stream", 429 | "text": [ 430 | "total 204\r\n", 431 | "drwxr-xr-x 2 kyle kyle 32768 Mar 30 2014 phi5\r\n", 432 | "drwxr-xr-x 2 kyle kyle 24576 Mar 30 2014 phi7\r\n", 433 | "drwxr-xr-x 2 kyle kyle 151552 Mar 30 2014 tlg\r\n" 434 | ] 435 | } 436 | ], 437 | "source": [ 438 | "!ls -l /home/kyle/cltk_data/originals/" 439 | ] 440 | } 441 | ], 442 | "metadata": { 443 | "kernelspec": { 444 | "display_name": "Python 3", 445 | "language": "python", 446 | "name": "python3" 447 | }, 448 | "language_info": { 449 | "codemirror_mode": { 450 | "name": "ipython", 451 | "version": 3 452 | }, 453 | "file_extension": ".py", 454 | "mimetype": "text/x-python", 455 | "name": "python", 456 | "nbconvert_exporter": "python", 457 | "pygments_lexer": "ipython3", 458 | "version": "3.6.4" 459 | } 460 | }, 461 | "nbformat": 4, 462 | "nbformat_minor": 1 463 | } 464 | -------------------------------------------------------------------------------- /4 Lemmatization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Thinking back to the previous examples for tokenization and lexical counting, there is an obvious shortcoming, that it does not assimilate lexically identical words with one another. For example, we may want to count \"est\" and \"sunt\" as instances of \"esse\".\n", 8 | "\n", 9 | "Lemmatization is the non-trivial process of reconciling inflected forms to their dictionary headword. The CLTK offers several methods. We'll show here one of the less sophisticated approaches. (Documentation for a new statistical method is in the works.)\n", 10 | "\n", 11 | "Note: You may have heard of stemming, which is similar in purpose, however it does not convert a word to a dictionary form, but only reduces commonly related forms into a new, unambiguous string (e.g., 'amicitia' --> 'amiciti'). This is not what we need for Greek and Latin." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Latin" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "cato_agri_praef = \"Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiverunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem civem existimarint foeneratorem quam furem, hinc licet existimare. Et virum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, verum, ut supra dixi, periculosum et calamitosum. At ex agricolis et viri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque invidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.\"" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# First import a repository: the CLTK data models for Latin\n", 37 | "\n", 38 | "from cltk.corpus.utils.importer import CorpusImporter\n", 39 | "corpus_importer = CorpusImporter('latin')\n", 40 | "corpus_importer.import_corpus('latin_models_cltk')" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 5, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# Replace j/v and tokenize\n", 50 | "\n", 51 | "from cltk.stem.latin.j_v import JVReplacer\n", 52 | "from cltk.tokenize.word import WordTokenizer\n", 53 | "\n", 54 | "jv_replacer = JVReplacer()\n", 55 | "cato_agri_praef = jv_replacer.replace(cato_agri_praef.lower())\n", 56 | "\n", 57 | "word_tokenizer = WordTokenizer('latin')\n", 58 | "cato_word_tokens = word_tokenizer.tokenize(cato_agri_praef.lower())\n", 59 | "cato_word_tokens = [token for token in cato_word_tokens if token not in ['.', ',', ':', ';']]" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 6, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "name": "stdout", 69 | "output_type": "stream", 70 | "text": [ 71 | "['edo1', 'interdum', 'praesto2', 'mercor', 'res', 'quaero', 'nitor1', 'tam', 'periculosus', 'sum1', 'et', 'ito', 'foenerari', 'si', 'tam', 'honestus', 'magnus', 'noster', 'sic', 'habeo', 'et', 'ito', 'in', 'lex', 'posiuerunt', 'fur', 'duplum', 'condemno', 'foeneratorem', 'quadruplus', 'quantus', 'malus', 'civis', 'existimo', 'foeneratorem', 'qui1', 'fur', 'hinc', 'liceo1', 'existimo', 'et', 'vir', 'bonus', 'cum', 'laudo', 'ito', 'laudo', 'bonus', 'agricola1', 'bonus', '-que', 'colonus', 'amplus', 'laudo', 'existimo', 'qui1', 'ito', 'laudo', 'mercator', 'autem', 'strenuus', 'studiosus', '-que', 'redeo', 'quaero', 'existimo', 'verus', 'ut', 'supra', 'dico2', 'periculosus', 'et', 'calamitosus', 'at', 'ex', 'agricola1', 'et', 'vir', 'fortis', 'et', 'milito', 'strenuus', 'gigno', 'magnus', '-que', 'pius', 'quaestus', 'stabilissimus', '-que', 'consequor', 'minimus', '-que', 'invidiosus', 'minimus', '-que', 'malus', 'cogito', 'sum1', 'qui1', 'in', 'eo1', 'studium', 'occupo', 'sum1', 'nunc', 'ut', 'ad', 'res', 'redeo', 'qui1', 'promitto', 'instituo', 'principium', 'hic', 'sum1']\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "from cltk.stem.lemma import LemmaReplacer\n", 77 | "\n", 78 | "lemmatizer = LemmaReplacer('latin')\n", 79 | "lemmata = lemmatizer.lemmatize(cato_word_tokens)\n", 80 | "print(lemmata)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 7, 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "name": "stdout", 90 | "output_type": "stream", 91 | "text": [ 92 | "['est/edo1', 'interdum/interdum', 'praestare/praesto2', 'mercaturis/mercor', 'rem/res', 'quaerere/quaero', 'nisi/nitor1', 'tam/tam', 'periculosum/periculosus', 'sit/sum1', 'et/et', 'item/ito', 'foenerari/foenerari', 'si/si', 'tam/tam', 'honestum/honestus', 'maiores/magnus', 'nostri/noster', 'sic/sic', 'habuerunt/habeo', 'et/et', 'ita/ito', 'in/in', 'legibus/lex', 'posiuerunt/posiuerunt', 'furem/fur', 'dupli/duplum', 'condemnari/condemno', 'foeneratorem/foeneratorem', 'quadrupli/quadruplus', 'quanto/quantus', 'peiorem/malus', 'ciuem/civis', 'existimarint/existimo', 'foeneratorem/foeneratorem', 'quam/qui1', 'furem/fur', 'hinc/hinc', 'licet/liceo1', 'existimare/existimo', 'et/et', 'uirum/vir', 'bonum/bonus', 'quom/cum', 'laudabant/laudo', 'ita/ito', 'laudabant/laudo', 'bonum/bonus', 'agricolam/agricola1', 'bonum/bonus', '-que/-que', 'colonum/colonus', 'amplissime/amplus', 'laudari/laudo', 'existimabatur/existimo', 'qui/qui1', 'ita/ito', 'laudabatur/laudo', 'mercatorem/mercator', 'autem/autem', 'strenuum/strenuus', 'studiosum/studiosus', '-que/-que', 'rei/redeo', 'quaerendae/quaero', 'existimo/existimo', 'uerum/verus', 'ut/ut', 'supra/supra', 'dixi/dico2', 'periculosum/periculosus', 'et/et', 'calamitosum/calamitosus', 'at/at', 'ex/ex', 'agricolis/agricola1', 'et/et', 'uiri/vir', 'fortissimi/fortis', 'et/et', 'milites/milito', 'strenuissimi/strenuus', 'gignuntur/gigno', 'maxime/magnus', '-que/-que', 'pius/pius', 'quaestus/quaestus', 'stabilissimus/stabilissimus', '-que/-que', 'consequitur/consequor', 'minime/minimus', '-que/-que', 'inuidiosus/invidiosus', 'minime/minimus', '-que/-que', 'male/malus', 'cogitantes/cogito', 'sunt/sum1', 'qui/qui1', 'in/in', 'eo/eo1', 'studio/studium', 'occupati/occupo', 'sunt/sum1', 'nunc/nunc', 'ut/ut', 'ad/ad', 'rem/res', 'redeam/redeo', 'quod/qui1', 'promisi/promitto', 'institutum/instituo', 'principium/principium', 'hoc/hic', 'erit/sum1']\n" 93 | ] 94 | } 95 | ], 96 | "source": [ 97 | "# Now we do the same but also return the original form\n", 98 | "# This is useful for checking accuracy\n", 99 | "\n", 100 | "lemmata_orig = lemmatizer.lemmatize(cato_word_tokens, return_raw=True)\n", 101 | "print(lemmata_orig)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 8, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "name": "stdout", 111 | "output_type": "stream", 112 | "text": [ 113 | "115\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "# Let's count again\n", 119 | "\n", 120 | "# Count all words\n", 121 | "\n", 122 | "print(len(lemmata))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 10, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "73\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "# Count unique words\n", 140 | "\n", 141 | "print(len(set(lemmata)))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 11, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "0.6347826086956522\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "# Finally, measure lexical diversity, using lemmata\n", 159 | "\n", 160 | "print(len(set(lemmata)) / len(lemmata))" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "Greek" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 2, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "athenaeus_incipit = \"Ἀθήναιος μὲν ὁ τῆς βίβλου πατήρ· ποιεῖται δὲ τὸν λόγον πρὸς Τιμοκράτην· Δειπνοσοφιστὴς δὲ ταύτῃ τὸ ὄνομα. Ὑπόκειται δὲ τῷ λόγῳ Λαρήνσιος Ῥωμαῖος, ἀνὴρ τῇ τύχῃ περιφανής, τοὺς κατὰ πᾶσαν παιδείαν ἐμπειροτάτους ἐν αὑτοῦ δαιτυμόνας ποιούμενος· ἐν οἷς οὐκ ἔσθ᾽ οὗτινος τῶν καλλίστων οὐκ ἐμνημόνευσεν. Ἰχθῦς τε γὰρ τῇ βίβλῳ ἐνέθετο καὶ τὰς τούτων χρείας καὶ τὰς τῶν ὀνομάτων ἀναπτύξεις καὶ λαχάνων γένη παντοῖα καὶ ζῴων παντοδαπῶν καὶ ἄνδρας ἱστορίας συγγεγραφότας καὶ ποιητὰς καὶ φιλοσόφους καὶ ὄργανα μουσικὰ καὶ σκωμμάτων εἴδη μυρία καὶ ἐκπωμάτων διαφορὰς καὶ πλούτους βασιλέων διηγήσατο καὶ νηῶν μεγέθη καὶ ὅσα ἄλλα οὐδ᾽ ἂν εὐχερῶς ἀπομνημονεύσαιμι, ἢ ἐπιλίποι μ᾽ ἂν ἡ ἡμέρα κατ᾽ εἶδος διεξερχόμενον. Καί ἐστιν ἡ τοῦ λόγου οἰκονομία μίμημα τῆς τοῦ δείπνου πολυτελείας καὶ ἡ τῆς βίβλου διασκευὴ τῆς ἐν τῷ δείπνῳ παρασκευῆς. Τοιοῦτον ὁ θαυμαστὸς οὗτος τοῦ λόγου οἰκονόμος Ἀθήναιος ἥδιστον λογόδειπνον εἰσηγεῖται κρείττων τε αὐτὸς ἑαυτοῦ γινόμενος, ὥσπερ οἱ Ἀθήνησι ῥήτορες, ὑπὸ τῆς ἐν τῷ λέγειν θερμότητος πρὸς τὰ ἑπόμενα τῆς βίβλου βαθμηδὸν ὑπεράλλεται.\"" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 3, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "from cltk.corpus.utils.importer import CorpusImporter\n", 186 | "corpus_importer = CorpusImporter('greek')\n", 187 | "corpus_importer.import_corpus('greek_models_cltk')\n", 188 | "\n", 189 | "from cltk.tokenize.word import WordTokenizer\n", 190 | "word_tokenizer = WordTokenizer('greek')\n", 191 | "athenaeus_word_tokens = word_tokenizer.tokenize(athenaeus_incipit.lower())\n", 192 | "athenaeus_word_tokens = [token for token in athenaeus_word_tokens if token not in ['.', ',', ':', ';']]\n", 193 | "\n", 194 | "from cltk.stem.lemma import LemmaReplacer\n", 195 | "lemmatizer = LemmaReplacer('greek')\n", 196 | "lemmata = lemmatizer.lemmatize(athenaeus_word_tokens)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 4, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "name": "stdout", 206 | "output_type": "stream", 207 | "text": [ 208 | "['ἀθήναιος', 'μὲν', 'ὁ', 'ὁ', 'βίβλος', 'πατήρ·', 'ποιέω', 'δὲ', 'τὸν', 'λόγος', 'πρὸς', 'τιμοκράτην·', 'δειπνοσοφιστὴς', 'δὲ', 'οὗτος', 'τὸ', 'ὄνομα', 'ὑπόκειμαι', 'δὲ', 'ὁ', 'λόγος', 'λαρήνσιος', 'ῥωμαῖος', 'ἀνὴρ', 'ὁ', 'τυγχάνω', 'περιφανής', 'τοὺς', 'κατὰ', 'πᾶς', 'παιδεία', 'ἔμπειρος', 'ἐν', 'ἑαυτοῦ', 'δαιτυμών', 'ποιούμενος·', 'ἐν', 'ὅς', 'οὐ', 'ἔσθ᾽', 'ὅστις', 'ὁ', 'καλός', 'οὐ', 'μνημονεύω', 'ἰχθύς', 'τε', 'γὰρ', 'ὁ', 'βίβλος', 'ἐντίθημι', 'καὶ', 'τὰς', 'οὗτος', 'χρεία', 'καὶ', 'τὰς', 'ὁ', 'ὄνομα', 'ἀναπτύσσω', 'καὶ', 'λάχανον', 'γένος', 'παντοῖος', 'καὶ', 'ζωιόω', 'παντοδαπός', 'καὶ', 'ἀνήρ', 'ἱστορία', 'συγγράφω', 'καὶ', 'ποιητὰς', 'καὶ', 'φιλόσοφος', 'καὶ', 'ὀργαίνω', 'μουσικὰ', 'καὶ', 'σκῶμμα', 'εἶδος', 'μυρίος', 'καὶ', 'ἔκπωμα', 'διαφορὰς', 'καὶ', 'πλοῦτος', 'βασιλίς', 'διηγέομαι', 'καὶ', 'ναῦς', 'μέγεθος', 'καὶ', 'ὅσος', 'ἄλλος', 'οὐδ᾽', 'ἂν', 'εὐχερής', 'ἀπομνημονεύω', 'ἢ', 'ἐπιλείπω', 'μ᾽', 'ἂν', 'ὁ', 'ἥμερος', 'κατ᾽', 'εἶδος', 'διεξέρχομαι', 'καί', 'εἰμί', 'ὁ', 'ὁ', 'λογόω', 'οἰκονομία', 'μίμημα', 'ὁ', 'ὁ', 'δεῖπνος', 'πολυτέλεια', 'καὶ', 'ὁ', 'ὁ', 'βίβλος', 'διασκευὴ', 'ὁ', 'ἐν', 'ὁ', 'δεῖπνος', 'παρασκευάζω', 'τοιοῦτος', 'ὁ', 'θαυμαστὸς', 'οὗτος', 'ὁ', 'λογόω', 'οἰκονόμος', 'ἀθήναιος', 'ἡδύς', 'λογόδειπνον', 'εἰσηγέομαι', 'κρείσσων', 'τε', 'αὐτὸς', 'ἑαυτοῦ', 'γίγνομαι', 'ὥσπερ', 'ὁ', 'ἀθήνευς', 'ῥήτωρ', 'ὑπὸ', 'ὁ', 'ἐν', 'ὁ', 'λέγω1', 'θερμότης', 'πρὸς', 'τὰ', 'ἕπομαι', 'ὁ', 'βίβλος', 'βαθμηδὸν', 'ὑπεράλλομαι']\n" 209 | ] 210 | } 211 | ], 212 | "source": [ 213 | "print(lemmata)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 5, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "['ἀθήναιος/ἀθήναιος', 'μὲν/μὲν', 'ὁ/ὁ', 'τῆς/ὁ', 'βίβλου/βίβλος', 'πατήρ·/πατήρ·', 'ποιεῖται/ποιέω', 'δὲ/δὲ', 'τὸν/τὸν', 'λόγον/λόγος', 'πρὸς/πρὸς', 'τιμοκράτην·/τιμοκράτην·', 'δειπνοσοφιστὴς/δειπνοσοφιστὴς', 'δὲ/δὲ', 'ταύτῃ/οὗτος', 'τὸ/τὸ', 'ὄνομα/ὄνομα', 'ὑπόκειται/ὑπόκειμαι', 'δὲ/δὲ', 'τῷ/ὁ', 'λόγῳ/λόγος', 'λαρήνσιος/λαρήνσιος', 'ῥωμαῖος/ῥωμαῖος', 'ἀνὴρ/ἀνὴρ', 'τῇ/ὁ', 'τύχῃ/τυγχάνω', 'περιφανής/περιφανής', 'τοὺς/τοὺς', 'κατὰ/κατὰ', 'πᾶσαν/πᾶς', 'παιδείαν/παιδεία', 'ἐμπειροτάτους/ἔμπειρος', 'ἐν/ἐν', 'αὑτοῦ/ἑαυτοῦ', 'δαιτυμόνας/δαιτυμών', 'ποιούμενος·/ποιούμενος·', 'ἐν/ἐν', 'οἷς/ὅς', 'οὐκ/οὐ', 'ἔσθ᾽/ἔσθ᾽', 'οὗτινος/ὅστις', 'τῶν/ὁ', 'καλλίστων/καλός', 'οὐκ/οὐ', 'ἐμνημόνευσεν/μνημονεύω', 'ἰχθῦς/ἰχθύς', 'τε/τε', 'γὰρ/γὰρ', 'τῇ/ὁ', 'βίβλῳ/βίβλος', 'ἐνέθετο/ἐντίθημι', 'καὶ/καὶ', 'τὰς/τὰς', 'τούτων/οὗτος', 'χρείας/χρεία', 'καὶ/καὶ', 'τὰς/τὰς', 'τῶν/ὁ', 'ὀνομάτων/ὄνομα', 'ἀναπτύξεις/ἀναπτύσσω', 'καὶ/καὶ', 'λαχάνων/λάχανον', 'γένη/γένος', 'παντοῖα/παντοῖος', 'καὶ/καὶ', 'ζῴων/ζωιόω', 'παντοδαπῶν/παντοδαπός', 'καὶ/καὶ', 'ἄνδρας/ἀνήρ', 'ἱστορίας/ἱστορία', 'συγγεγραφότας/συγγράφω', 'καὶ/καὶ', 'ποιητὰς/ποιητὰς', 'καὶ/καὶ', 'φιλοσόφους/φιλόσοφος', 'καὶ/καὶ', 'ὄργανα/ὀργαίνω', 'μουσικὰ/μουσικὰ', 'καὶ/καὶ', 'σκωμμάτων/σκῶμμα', 'εἴδη/εἶδος', 'μυρία/μυρίος', 'καὶ/καὶ', 'ἐκπωμάτων/ἔκπωμα', 'διαφορὰς/διαφορὰς', 'καὶ/καὶ', 'πλούτους/πλοῦτος', 'βασιλέων/βασιλίς', 'διηγήσατο/διηγέομαι', 'καὶ/καὶ', 'νηῶν/ναῦς', 'μεγέθη/μέγεθος', 'καὶ/καὶ', 'ὅσα/ὅσος', 'ἄλλα/ἄλλος', 'οὐδ᾽/οὐδ᾽', 'ἂν/ἂν', 'εὐχερῶς/εὐχερής', 'ἀπομνημονεύσαιμι/ἀπομνημονεύω', 'ἢ/ἢ', 'ἐπιλίποι/ἐπιλείπω', 'μ᾽/μ᾽', 'ἂν/ἂν', 'ἡ/ὁ', 'ἡμέρα/ἥμερος', 'κατ᾽/κατ᾽', 'εἶδος/εἶδος', 'διεξερχόμενον/διεξέρχομαι', 'καί/καί', 'ἐστιν/εἰμί', 'ἡ/ὁ', 'τοῦ/ὁ', 'λόγου/λογόω', 'οἰκονομία/οἰκονομία', 'μίμημα/μίμημα', 'τῆς/ὁ', 'τοῦ/ὁ', 'δείπνου/δεῖπνος', 'πολυτελείας/πολυτέλεια', 'καὶ/καὶ', 'ἡ/ὁ', 'τῆς/ὁ', 'βίβλου/βίβλος', 'διασκευὴ/διασκευὴ', 'τῆς/ὁ', 'ἐν/ἐν', 'τῷ/ὁ', 'δείπνῳ/δεῖπνος', 'παρασκευῆς/παρασκευάζω', 'τοιοῦτον/τοιοῦτος', 'ὁ/ὁ', 'θαυμαστὸς/θαυμαστὸς', 'οὗτος/οὗτος', 'τοῦ/ὁ', 'λόγου/λογόω', 'οἰκονόμος/οἰκονόμος', 'ἀθήναιος/ἀθήναιος', 'ἥδιστον/ἡδύς', 'λογόδειπνον/λογόδειπνον', 'εἰσηγεῖται/εἰσηγέομαι', 'κρείττων/κρείσσων', 'τε/τε', 'αὐτὸς/αὐτὸς', 'ἑαυτοῦ/ἑαυτοῦ', 'γινόμενος/γίγνομαι', 'ὥσπερ/ὥσπερ', 'οἱ/ὁ', 'ἀθήνησι/ἀθήνευς', 'ῥήτορες/ῥήτωρ', 'ὑπὸ/ὑπὸ', 'τῆς/ὁ', 'ἐν/ἐν', 'τῷ/ὁ', 'λέγειν/λέγω1', 'θερμότητος/θερμότης', 'πρὸς/πρὸς', 'τὰ/τὰ', 'ἑπόμενα/ἕπομαι', 'τῆς/ὁ', 'βίβλου/βίβλος', 'βαθμηδὸν/βαθμηδὸν', 'ὑπεράλλεται/ὑπεράλλομαι']\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "lemmata_orig = lemmatizer.lemmatize(athenaeus_word_tokens, return_raw=True)\n", 231 | "print(lemmata_orig)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 6, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "162\n" 244 | ] 245 | } 246 | ], 247 | "source": [ 248 | "print(len(lemmata))" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 7, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "name": "stdout", 258 | "output_type": "stream", 259 | "text": [ 260 | "106\n" 261 | ] 262 | } 263 | ], 264 | "source": [ 265 | "print(len(set(lemmata)))" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 8, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "0.654320987654321\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "print(len(set(lemmata)) / len(lemmata))" 283 | ] 284 | } 285 | ], 286 | "metadata": { 287 | "kernelspec": { 288 | "display_name": "Python 3", 289 | "language": "python", 290 | "name": "python3" 291 | }, 292 | "language_info": { 293 | "codemirror_mode": { 294 | "name": "ipython", 295 | "version": 3 296 | }, 297 | "file_extension": ".py", 298 | "mimetype": "text/x-python", 299 | "name": "python", 300 | "nbconvert_exporter": "python", 301 | "pygments_lexer": "ipython3", 302 | "version": "3.6.4" 303 | } 304 | }, 305 | "nbformat": 4, 306 | "nbformat_minor": 1 307 | } 308 | -------------------------------------------------------------------------------- /5 Text reuse.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Latin" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Load all of Cicero's De divinatione\n", 17 | "\n", 18 | "import os\n", 19 | "\n", 20 | "div1_fp = os.path.expanduser('~/cltk_data/latin/text/latin_text_latin_library/cicero/divinatione1.txt')\n", 21 | "div2_fp = os.path.expanduser('~/cltk_data/latin/text/latin_text_latin_library/cicero/divinatione2.txt')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "with open(div1_fp) as fo:\n", 31 | " div1 = fo.read()\n", 32 | "\n", 33 | "with open(div2_fp) as fo:\n", 34 | " div2 = fo.read()" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "'126 127 128 129 130 131 132 \\n \\n\\n \\n\\n\\n \\nI 1 Vetus opinio est iam usque ab heroicis ducta temporibus, eaque et populi Romani et omnium gentium firmata consensu, versari quandam inter homines divinationem, quam Graeci mantikh/n appellant, id est praesensionem et scientiam rerum futurarum. Magnifica quaedam res et salutaris, si modo est ulla, quaque proxime ad deorum vim natura mortalis possit accedere. Itaque ut alia nos melius multa quam Graeci, sic huic praestantissimae rei nomen nostri a divis, Graeci, ut Plato interpretatur, a furore duxerunt. 2 Gentem quidem nullam video neque tam humanam atque doctam neque tam immanem tamque barbaram, quae non significari futura et a quibusdam intellegi praedicique posse censeat. Principio Assyrii, ut ab ultimis auctoritatem repetam, propter planitiam magnitudinemque regionum quas incolebant, cum caelum ex omni parte patens atque apertum intuerentur, traiectiones motusque stellarum observitaverunt, quibus notati, quid cuique significaretur memoriae p'" 46 | ] 47 | }, 48 | "execution_count": 3, 49 | "metadata": {}, 50 | "output_type": "execute_result" 51 | } 52 | ], 53 | "source": [ 54 | "div1[500:1500]" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 4, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "# We will calculate the Levenstein distance\n", 64 | "# See http://docs.cltk.org/en/latest/multilingual.html#text-reuse\n", 65 | "\n", 66 | "from cltk.text_reuse.levenshtein import Levenshtein" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 5, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "data": { 76 | "text/plain": [ 77 | "0.44" 78 | ] 79 | }, 80 | "execution_count": 5, 81 | "metadata": {}, 82 | "output_type": "execute_result" 83 | } 84 | ], 85 | "source": [ 86 | "lev_dist = Levenshtein()\n", 87 | "lev_dist.ratio(div1, div2)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 6, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "multa quoque et bello passus\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "# Find the longest common substring\n", 105 | "# This can take some time!\n", 106 | "\n", 107 | "from cltk.text_reuse.comparison import long_substring\n", 108 | "\n", 109 | "# Aen 1.1-6\n", 110 | "aen = \"\"\"arma virumque cano, Troiae qui primus ab oris\n", 111 | "Italiam, fato profugus, Laviniaque venit\n", 112 | "litora, multum ille et terris iactatus et alto\n", 113 | "vi superum saevae memorem Iunonis ob iram;\n", 114 | "multa quoque et bello passus, dum conderet urbem, 5\n", 115 | "inferretque deos Latio, genus unde Latinum,\n", 116 | "Albanique patres, atque altae moenia Romae.\"\"\"\n", 117 | "\n", 118 | "# Servius 1.1\n", 119 | "serv = \"\"\"arma multi varie disserunt cur ab armis Vergilius coeperit, omnes tamen inania sentire manifestum est, cum eum constet aliunde sumpsisse principium, sicut in praemissa eius vita monstratum est. per 'arma' autem bellum significat, et est tropus metonymia. nam arma quibus in bello utimur pro bello posuit, sicut toga qua in pace utimur pro pace ponitur, ut Cicero “cedant arma togae” , id est bellum paci. alii ideo 'arma' hoc loco proprie dicta accipiunt, primo quod fuerint victricia, secundo quod divina, tertio quod prope semper armis virum subiungit, ut “arma virumque ferens” et “arma acri facienda viro” . arma virumque figura usitata est ut non eo ordine respondeamus quo proposuimus; nam prius de erroribus Aeneae dicit, post de bello. hac autem figura etiam in prosa utimur. sic Cicero in Verrinis “nam sine ullo sumptu nostro coriis, tunicis frumentoque suppeditato maximos exercitus nostros vestivit, aluit, armavit” . non nulli autem hyperbaton putant, ut sit sensus talis 'arma virumque cano, genus unde Latinum Albanique patres atque altae moenia Romae', mox illa revoces 'Troiae qui primus ab oris'; sic enim causa operis declaratur, cur cogentibus fatis in Latium venerit. et est poeticum principium professivum 'arma virumque cano', invocativum 'Musa mihi causas memora', narrativum 'urbs antiqua fuit'. et professivum quattuor modis sumpsit: a duce 'arma virumque cano', ab itinere 'Troiae qui primus ab oris', a bello 'multa quoque et bello passus', a generis successu 'genus unde Latinum'. virum quem non dicit, sed circumstantiis ostendit Aeneam. et bene addidit post 'arma' 'virum', quia arma possunt et aliarum artium instrumenta dici, ut “Cerealiaque arma” . cano polysemus sermo est. tria enim significat: aliquando laudo, ut “regemque canebant” ; aliquando divino, ut “ipsa canas oro” ; aliquando canto, ut in hoc loco. nam proprie canto significat, quia cantanda sunt carmina. Troiae Troia regio est Asiae, Ilium civitas Troiae; plerumque tamen usurpant poetae et pro civitate vel regionem vel provinciam ponunt, ut Iuvenalis “et flammis Asiam ferroque cadentem” . Probus ait Troiam Graios et Aiax non debere per unam i scribi. qui primus quaerunt multi, cur Aeneam primum ad Italiam venisse dixerit, cum paulo post dicat Antenorem ante adventum Aeneae fundasse civitatem. constat quidem, sed habita temporum ratione peritissime Vergilius dixit. namque illo tempore, quo Aeneas ad Italiam venit, finis erat Italiae usque ad Rubiconem fluvium: cuius rei meminit Lucanus et Gallica certus limes ab Ausoniis disterminat arva colonis. unde apparet Antenorem non ad Italiam venisse, sed ad Galliam cisalpinam, in qua Venetia est. postea vero promotis usque ad Alpes Italiae finibus, novitas creavit errorem. plerique tamen quaestionem hanc volunt ex sequentibus solvi, ut videatur ob hoc addidisse Vergilius 'ad Lavina litora', ne significaret Antenorem. melior tamen est superior expositio. primus [ergo] non ante quem nemo, sed post quem nullus, “tuque o, cui prima furentem fundit equum magno tellus percussa tridenti et hic mihi responsum primus dedit” . vel laudative 'primus', ut “primam qui legibus urbem fundabit, Curibus parvis” . ab oris speciem pro genere; nam oras terras generaliter debemus accipere. sane praepositionem mutavit, nam 'ex oris' melius potuit dicere.\n", 120 | "\"\"\"\n", 121 | "\n", 122 | "print(long_substring(aen, serv))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "0.6717661057283699\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "from cltk.text_reuse.comparison import minhash\n", 140 | "\n", 141 | "print(minhash(div1,div2))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 8, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "0.4822404371584699\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "# Try with texts by different authors, adding Apuleius' Apologia\n", 159 | "\n", 160 | "ap_fp = os.path.expanduser('~/cltk_data/latin/text/latin_text_latin_library/apuleius/apuleius.apol.txt')\n", 161 | "\n", 162 | "with open(ap_fp) as fo:\n", 163 | " ap = fo.read()\n", 164 | "\n", 165 | "print(minhash(ap, div1))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "# Greek" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "For Greek, we'll use the two books of Plutarch's De esu carnium." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 10, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "'ΠΕΡΙ ΣΑΡΚΟΦΑΓΙΑΣ\\n\\t\\nΛΟΓΟΣ Α´\\n\\t\\n\\n\\n(993)\\n1\\n\\nἈλλὰ σὺ μὲν ἐρῶτᾷς τίνι λόγῳ Πυθαγόρας ἀπείχετο σαρκοφαγίας; ἐγὼ δὲ θαυμάζω καὶ τίνι πάθει καὶ ποίᾳ ψυχῇ ἢ λόγῳ Bὁ πρῶτος ἄνθρωπος ἥψατο φόνου στόματι καὶ τεθνηκότος ζῴου χείλεσι προσήψατο σαρκὸς καὶ νεκρῶν σωμάτων καὶ ἐώλων1 προθέμενος τραπέζας ὄψα καὶ τροφὰς2 προσεῖπεν3 τὰ μικρὸν ἔμπροσθεν βρυχώμενα μέρη καὶ φθεγγόμενα καὶ κινούμενα καὶ βλέποντα. πῶς ἡ ὄψις ὑπέμεινε τὸν φόνον σφαζομένων δερομένων διαμελιζομένων, πῶς ἡ ὄσφρησις ἤνεγκε τὴν ἀποφοράν, πῶς τὴν γεῦσιν οὐκ ἀπέτρεψεν ὁ μολυσμὸς ἑλκῶν ψαύουσαν ἀλλοτρίων καὶ τραυμάτων θανασίμων χυμοὺς καὶ ἰχῶρας ἀπολαμβάνουσαν;4\\n\\n\\n\\n\\nCεἷρπον μὲν ῥινοί, κρέα δ᾽ ἀμφ᾽ ὀβελοῖς ἐμεμύκει\\n\\t\\t\\t\\nὀπταλέα τε καὶ ὠμά, βοῶν δ᾽ ὡς γίγνετο φωνή·\\n\\t\\t\\n\\n\\n\\nτοῦτο μὲν5 πλάσμα καὶ μῦθός ἐστι, τὸ δέ γε δεῖπνον ἀληθῶς τερατῶδες, πεινῆν τινα τῶν μυκωμένων\\n\\n\\n\\np542ἔτι6 διδάσκοντα ὑφ᾽ ὧν δεῖ τρέφεσθαι ζώντων ἔτι καὶ λαλούντων καὶ7 διαταττόμενον ἀρτύσεις τινὰς καὶ ὀπτήσεις καὶ παραθέσεις· τούτων8 ἔδει ζητεῖν τὸν πρῶτον ἀρξάμενον οὐ τὸν ὀψὲ παυσάμενον.\\n\\n\\t\\n2\\n\\nἪ τοῖς μὲν πρώτοις ἐκείνοις ἐπιχειρήσασι σαρκοφαγεῖν τὴν αἰτίαν εἴποι'" 191 | ] 192 | }, 193 | "execution_count": 10, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "import os\n", 200 | "\n", 201 | "carn1_fp = os.path.expanduser('~/cltk_data/greek/text/greek_text_lacus_curtius/plain/Plutarch/De_esu_carnium/1.txt')\n", 202 | "carn2_fp = os.path.expanduser('~/cltk_data/greek/text/greek_text_lacus_curtius/plain/Plutarch/De_esu_carnium/2.txt')\n", 203 | "\n", 204 | "with open(carn1_fp) as fo:\n", 205 | " carn1 = fo.read()\n", 206 | "\n", 207 | "with open(carn2_fp) as fo:\n", 208 | " carn2 = fo.read()\n", 209 | "\n", 210 | "carn1[908:2001]" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 11, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "0.46" 222 | ] 223 | }, 224 | "execution_count": 11, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "from cltk.text_reuse.levenshtein import Levenshtein\n", 231 | "lev_dist = Levenshtein()\n", 232 | "lev_dist.ratio(carn1, carn2)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 12, 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "name": "stdout", 242 | "output_type": "stream", 243 | "text": [ 244 | "0.43631915182819087\n" 245 | ] 246 | } 247 | ], 248 | "source": [ 249 | "from cltk.text_reuse.comparison import minhash\n", 250 | "print(minhash(carn1,carn2))" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 16, 256 | "metadata": {}, 257 | "outputs": [ 258 | { 259 | "name": "stdout", 260 | "output_type": "stream", 261 | "text": [ 262 | "0.2874497799885211\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "# Compare with Book 1 of Oppian's Cynegetica\n", 268 | "# We're comparing prose with verse: the difference is clear\n", 269 | "\n", 270 | "cyn_fp = os.path.expanduser('~/cltk_data/greek/text/greek_text_lacus_curtius/plain/Oppian/Cynegetica/1*.txt')\n", 271 | "with open(cyn_fp) as fo:\n", 272 | " cyn = fo.read()\n", 273 | "\n", 274 | "print(minhash(cyn, carn1))" 275 | ] 276 | } 277 | ], 278 | "metadata": { 279 | "kernelspec": { 280 | "display_name": "Python 3", 281 | "language": "python", 282 | "name": "python3" 283 | }, 284 | "language_info": { 285 | "codemirror_mode": { 286 | "name": "ipython", 287 | "version": 3 288 | }, 289 | "file_extension": ".py", 290 | "mimetype": "text/x-python", 291 | "name": "python", 292 | "nbconvert_exporter": "python", 293 | "pygments_lexer": "ipython3", 294 | "version": "3.6.4" 295 | } 296 | }, 297 | "nbformat": 4, 298 | "nbformat_minor": 2 299 | } 300 | -------------------------------------------------------------------------------- /7 Syllabification, prosody, phonetics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Latin syllables" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "['est', 'interdum', 'praestare', 'mercaturis', 'rem', 'quaerere', 'nisi', 'tam', 'periculosum', 'sit', 'et', 'item', 'foenerari', 'si', 'tam', 'honestum', 'maiores', 'nostri', 'sic', 'habuerunt', 'et', 'ita', 'in', 'legibus', 'posiverunt', 'furem', 'dupli', 'condemnari', 'foeneratorem', 'quadrupli', 'quanto', 'peiorem', 'civem', 'existimarint', 'foeneratorem', 'quam', 'furem', 'hinc', 'licet', 'existimare', 'et', 'virum', 'bonum', 'quom', 'laudabant', 'ita', 'laudabant', 'bonum', 'agricolam', 'bonum', '-que', 'colonum', 'amplissime', 'laudari', 'existimabatur', 'qui', 'ita', 'laudabatur', 'mercatorem', 'autem', 'strenuum', 'studiosum', '-que', 'rei', 'quaerendae', 'existimo', 'verum', 'ut', 'supra', 'dixi', 'periculosum', 'et', 'calamitosum', 'at', 'ex', 'agricolis', 'et', 'viri', 'fortissimi', 'et', 'milites', 'strenuissimi', 'gignuntur', 'maxime', '-que', 'pius', 'quaestus', 'stabilissimus', '-que', 'consequitur', 'minime', '-que', 'invidiosus', 'minime', '-que', 'male', 'cogitantes', 'sunt', 'qui', 'in', 'eo', 'studio', 'occupati', 'sunt', 'nunc', 'ut', 'ad', 'rem', 'redeam', 'quod', 'promisi', 'institutum', 'principium', 'hoc', 'erit']\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "# See http://docs.cltk.org/en/latest/latin.html#syllabifier\n", 25 | "\n", 26 | "from cltk.stem.latin.syllabifier import Syllabifier\n", 27 | "\n", 28 | "cato_agri_praef = \"Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiverunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem civem existimarint foeneratorem quam furem, hinc licet existimare. Et virum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, verum, ut supra dixi, periculosum et calamitosum. At ex agricolis et viri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque invidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.\"\n", 29 | "\n", 30 | "from cltk.tokenize.word import WordTokenizer\n", 31 | "word_tokenizer = WordTokenizer('latin')\n", 32 | "cato_cltk_word_tokens = word_tokenizer.tokenize(cato_agri_praef.lower())\n", 33 | "cato_cltk_word_tokens_no_punt = [token for token in cato_cltk_word_tokens if token not in ['.', ',', ':', ';']]\n", 34 | "\n", 35 | "# Now you can see the word \"-que\"\n", 36 | "\n", 37 | "print(cato_cltk_word_tokens_no_punt)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "est ['est']\n", 50 | "interdum ['in', 'ter', 'dum']\n", 51 | "praestare ['praes', 'ta', 're']\n", 52 | "mercaturis ['mer', 'ca', 'tu', 'ris']\n", 53 | "rem ['rem']\n", 54 | "quaerere ['quae', 're', 're']\n", 55 | "nisi ['ni', 'si']\n", 56 | "tam ['tam']\n", 57 | "periculosum ['pe', 'ri', 'cu', 'lo', 'sum']\n", 58 | "sit ['sit']\n", 59 | "et ['et']\n", 60 | "item ['i', 'tem']\n", 61 | "foenerari ['foe', 'ne', 'ra', 'ri']\n", 62 | "si ['si']\n", 63 | "tam ['tam']\n", 64 | "honestum ['ho', 'nes', 'tum']\n", 65 | "maiores ['ma', 'io', 'res']\n", 66 | "nostri ['nos', 'tri']\n", 67 | "sic ['sic']\n", 68 | "habuerunt ['ha', 'bu', 'e', 'runt']\n", 69 | "et ['et']\n", 70 | "ita ['i', 'ta']\n", 71 | "in ['in']\n", 72 | "legibus ['le', 'gi', 'bus']\n", 73 | "posiverunt ['po', 'si', 've', 'runt']\n", 74 | "furem ['fu', 'rem']\n", 75 | "dupli ['du', 'pli']\n", 76 | "condemnari ['con', 'dem', 'na', 'ri']\n", 77 | "foeneratorem ['foe', 'ne', 'ra', 'to', 'rem']\n", 78 | "quadrupli ['qua', 'dru', 'pli']\n", 79 | "quanto ['quan', 'to']\n", 80 | "peiorem ['peio', 'rem']\n", 81 | "civem ['ci', 'vem']\n", 82 | "existimarint ['ex', 'is', 'ti', 'ma', 'rint']\n", 83 | "foeneratorem ['foe', 'ne', 'ra', 'to', 'rem']\n", 84 | "quam ['quam']\n", 85 | "furem ['fu', 'rem']\n", 86 | "hinc ['hinc']\n", 87 | "licet ['li', 'cet']\n", 88 | "existimare ['ex', 'is', 'ti', 'ma', 're']\n", 89 | "et ['et']\n", 90 | "virum ['vi', 'rum']\n", 91 | "bonum ['bo', 'num']\n", 92 | "quom ['quom']\n", 93 | "laudabant ['lau', 'da', 'bant']\n", 94 | "ita ['i', 'ta']\n", 95 | "laudabant ['lau', 'da', 'bant']\n", 96 | "bonum ['bo', 'num']\n", 97 | "agricolam ['a', 'gri', 'co', 'lam']\n", 98 | "bonum ['bo', 'num']\n", 99 | "-que ['-que']\n", 100 | "colonum ['co', 'lo', 'num']\n", 101 | "amplissime ['am', 'plis', 'si', 'me']\n", 102 | "laudari ['lau', 'da', 'ri']\n", 103 | "existimabatur ['ex', 'is', 'ti', 'ma', 'ba', 'tur']\n", 104 | "qui ['qui']\n", 105 | "ita ['i', 'ta']\n", 106 | "laudabatur ['lau', 'da', 'ba', 'tur']\n", 107 | "mercatorem ['mer', 'ca', 'to', 'rem']\n", 108 | "autem ['au', 'tem']\n", 109 | "strenuum ['stre', 'nu', 'um']\n", 110 | "studiosum ['stu', 'di', 'o', 'sum']\n", 111 | "-que ['-que']\n", 112 | "rei ['rei']\n", 113 | "quaerendae ['quae', 'ren', 'dae']\n", 114 | "existimo ['ex', 'is', 'ti', 'mo']\n", 115 | "verum ['ve', 'rum']\n", 116 | "ut ['ut']\n", 117 | "supra ['su', 'pra']\n", 118 | "dixi ['di', 'xi']\n", 119 | "periculosum ['pe', 'ri', 'cu', 'lo', 'sum']\n", 120 | "et ['et']\n", 121 | "calamitosum ['ca', 'la', 'mi', 'to', 'sum']\n", 122 | "at ['at']\n", 123 | "ex ['ex']\n", 124 | "agricolis ['a', 'gri', 'co', 'lis']\n", 125 | "et ['et']\n", 126 | "viri ['vi', 'ri']\n", 127 | "fortissimi ['for', 'tis', 'si', 'mi']\n", 128 | "et ['et']\n", 129 | "milites ['mi', 'li', 'tes']\n", 130 | "strenuissimi ['stre', 'nu', 'is', 'si', 'mi']\n", 131 | "gignuntur ['gig', 'nun', 'tur']\n", 132 | "maxime ['ma', 'xi', 'me']\n", 133 | "-que ['-que']\n", 134 | "pius ['pi', 'us']\n", 135 | "quaestus ['quaes', 'tus']\n", 136 | "stabilissimus ['sta', 'bi', 'lis', 'si', 'mus']\n", 137 | "-que ['-que']\n", 138 | "consequitur ['con', 'se', 'qui', 'tur']\n", 139 | "minime ['mi', 'ni', 'me']\n", 140 | "-que ['-que']\n", 141 | "invidiosus ['in', 'vi', 'di', 'o', 'sus']\n", 142 | "minime ['mi', 'ni', 'me']\n", 143 | "-que ['-que']\n", 144 | "male ['ma', 'le']\n", 145 | "cogitantes ['co', 'gi', 'tan', 'tes']\n", 146 | "sunt ['sunt']\n", 147 | "qui ['qui']\n", 148 | "in ['in']\n", 149 | "eo ['e', 'o']\n", 150 | "studio ['stu', 'di', 'o']\n", 151 | "occupati ['oc', 'cu', 'pa', 'ti']\n", 152 | "sunt ['sunt']\n", 153 | "nunc ['nunc']\n", 154 | "ut ['ut']\n", 155 | "ad ['ad']\n", 156 | "rem ['rem']\n", 157 | "redeam ['re', 'de', 'am']\n", 158 | "quod ['quod']\n", 159 | "promisi ['pro', 'mi', 'si']\n", 160 | "institutum ['in', 'sti', 'tu', 'tum']\n", 161 | "principium ['prin', 'ci', 'pi', 'um']\n", 162 | "hoc ['hoc']\n", 163 | "erit ['e', 'rit']\n" 164 | ] 165 | } 166 | ], 167 | "source": [ 168 | "syllabifier = Syllabifier()\n", 169 | "\n", 170 | "for word in cato_cltk_word_tokens_no_punt:\n", 171 | " syllables = syllabifier.syllabify(word)\n", 172 | " print(word, syllables)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "# Latin prosody\n", 180 | "\n", 181 | "This is a two-step process: first find the long vowels, then scan the actual meter." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 3, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "# Use the macronizer\n", 191 | "# See http://docs.cltk.org/en/latest/latin.html#macronizer\n", 192 | "\n", 193 | "from cltk.prosody.latin.macronizer import Macronizer\n", 194 | "\n", 195 | "macronizer = Macronizer('tag_ngram_123_backoff')\n", 196 | "\n", 197 | "text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'\n", 198 | "\n", 199 | "scanned_text = macronizer.macronize_text(text)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 4, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "[('quō', None, 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('ō', None, 'ō'), ('catilīnā', None, 'catilīnā'), (',', 'u--------', ','), ('abūtēre', None, 'abūtēre'), ('nostrā', None, 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "# Use the scanner\n", 217 | "# See http://docs.cltk.org/en/latest/latin.html#prosody-scanning\n", 218 | "\n", 219 | "from cltk.prosody.latin.scanner import Scansion\n", 220 | "\n", 221 | "scanner = Scansion()\n", 222 | "prose_text = macronizer.macronize_tags(scanned_text)\n", 223 | "print(prose_text)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "# Greek scansion" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 1, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "data": { 240 | "text/plain": [ 241 | "['˘¯¯¯˘¯¯˘¯˘¯˘˘x', '¯¯˘¯x']" 242 | ] 243 | }, 244 | "execution_count": 1, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "from cltk.prosody.greek.scanner import Scansion\n", 251 | "\n", 252 | "scanner = Scansion()\n", 253 | "\n", 254 | "scanner.scan_text('νέος μὲν καὶ ἄπειρος, δικῶν ἔγωγε ἔτι. μὲν καὶ ἄπειρος.')" 255 | ] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.6.4" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 2 279 | } 280 | -------------------------------------------------------------------------------- /8 Part-of-speech tagging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Latin\n", 8 | "\n", 9 | "For Latin POS tags, see https://github.com/cltk/latin_treebank_perseus." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 16, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "aen = \"\"\"arma virumque cano, Troiae qui primus ab oris\n", 19 | "Italiam, fato profugus, Laviniaque venit\n", 20 | "litora, multum ille et terris iactatus et alto\n", 21 | "vi superum saevae memorem Iunonis ob iram;\n", 22 | "multa quoque et bello passus, dum conderet urbem, 5\n", 23 | "inferretque deos Latio, genus unde Latinum,\n", 24 | "Albanique patres, atque altae moenia Romae.\"\"\"\n", 25 | "\n", 26 | "# rm line breaks\n", 27 | "aen = aen.replace('\\n',' ')" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 17, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "arma virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram; multa quoque et bello passus, dum conderet urbem, 5 inferretque deos Latio, genus unde Latinum, Albanique patres, atque altae moenia Romae.\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "print(aen)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 18, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "from cltk.tag.pos import POSTag" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 19, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "tagger = POSTag('latin')\n", 63 | "aen_tagged = tagger.tag_ngram_123_backoff(aen)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 20, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "[('arma', 'N-P---NA-'), ('virumque', None), ('cano', None), (',', 'U--------'), ('Troiae', None), ('qui', 'P-S---MN-'), ('primus', 'A-S---MN-'), ('ab', 'R--------'), ('oris', 'N-P---NB-'), ('Italiam', None), (',', 'U--------'), ('fato', None), ('profugus', None), (',', 'U--------'), ('Laviniaque', None), ('venit', 'V3SPIA---'), ('litora', 'N-P---NA-'), (',', 'U--------'), ('multum', 'A-S---MA-'), ('ille', 'P-S---MN-'), ('et', 'C--------'), ('terris', 'N-P---FB-'), ('iactatus', None), ('et', 'C--------'), ('alto', 'A-S---MB-'), ('vi', 'N-S---FB-'), ('superum', 'N-P---MG-'), ('saevae', None), ('memorem', 'V1SPSA---'), ('Iunonis', None), ('ob', 'R--------'), ('iram', 'N-S---FA-'), (';', None), ('multa', 'A-P---NA-'), ('quoque', 'D--------'), ('et', 'C--------'), ('bello', 'N-S---NB-'), ('passus', 'N-P---MA-'), (',', 'U--------'), ('dum', 'C--------'), ('conderet', None), ('urbem', 'N-S---FA-'), (',', 'U--------'), ('5', None), ('inferretque', None), ('deos', 'N-P---MA-'), ('Latio', None), (',', 'U--------'), ('genus', 'N-S---NN-'), ('unde', 'D--------'), ('Latinum', None), (',', 'U--------'), ('Albanique', None), ('patres', 'N-P---MV-'), (',', 'U--------'), ('atque', 'C--------'), ('altae', None), ('moenia', 'N-P---NA-'), ('Romae', None), ('.', 'U--------')]\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "print(aen_tagged)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 21, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# There are options, as the following\n", 90 | "\n", 91 | "aen_tagged = tagger.tag_crf('Gallia est omnis divisa in partes tres')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 7, 97 | "metadata": { 98 | "scrolled": false 99 | }, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "[('Gallia', 'A-P---NA-'), ('est', 'V3SPIA---'), ('omnis', 'A-S---FN-'), ('divisa', 'N-S---FN-'), ('in', 'R--------'), ('partes', 'N-P---FA-'), ('tres', 'M--------')]\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "print(aen_tagged)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "# Greek\n", 118 | "\n", 119 | "For Greek POS tags, see https://github.com/cltk/greek_treebank_perseus." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 9, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "athenaeus = \"Ἀθήναιος μὲν ὁ τῆς βίβλου πατήρ· ποιεῖται δὲ τὸν λόγον πρὸς Τιμοκράτην· Δειπνοσοφιστὴς δὲ ταύτῃ τὸ ὄνομα. Ὑπόκειται δὲ τῷ λόγῳ Λαρήνσιος Ῥωμαῖος, ἀνὴρ τῇ τύχῃ περιφανής, τοὺς κατὰ πᾶσαν παιδείαν ἐμπειροτάτους ἐν αὑτοῦ δαιτυμόνας ποιούμενος·\"" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 10, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "[('Ἀθήναιος', 'Unk'), ('μὲν', 'G--------'), ('ὁ', 'L-S---MN-'), ('τῆς', 'L-S---FG-'), ('βίβλου', 'Unk'), ('πατήρ', 'N-S---MN-'), ('·', 'U--------'), ('ποιεῖται', 'V3SPIE---'), ('δὲ', 'G--------'), ('τὸν', 'P-S---MA-'), ('λόγον', 'N-S---MA-'), ('πρὸς', 'R--------'), ('Τιμοκράτην', 'Unk'), ('·', 'U--------'), ('Δειπνοσοφιστὴς', 'Unk'), ('δὲ', 'G--------'), ('ταύτῃ', 'D--------'), ('τὸ', 'L-S---NN-'), ('ὄνομα', 'N-S---NN-'), ('.', 'U--------'), ('Ὑπόκειται', 'Unk'), ('δὲ', 'G--------'), ('τῷ', 'L-S---MD-'), ('λόγῳ', 'N-S---MD-'), ('Λαρήνσιος', 'Unk'), ('Ῥωμαῖος', 'Unk'), (',', 'U--------'), ('ἀνὴρ', 'N-S---MN-'), ('τῇ', 'L-S---FD-'), ('τύχῃ', 'N-S---FD-'), ('περιφανής', 'Unk'), (',', 'U--------'), ('τοὺς', 'P-P---MA-'), ('κατὰ', 'R--------'), ('πᾶσαν', 'A-S---FA-'), ('παιδείαν', 'Unk'), ('ἐμπειροτάτους', 'Unk'), ('ἐν', 'R--------'), ('αὑτοῦ', 'A-S---MG-'), ('δαιτυμόνας', 'N-P---MA-'), ('ποιούμενος', 'T-SPPEMN-'), ('·', 'U--------')]\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "from cltk.tag.pos import POSTag\n", 146 | "tagger = POSTag('greek')\n", 147 | "\n", 148 | "# Using another tagger\n", 149 | "\n", 150 | "athenaeus_tagged = tagger.tag_tnt(athenaeus)\n", 151 | "print(athenaeus_tagged)" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python 3", 158 | "language": "python", 159 | "name": "python3" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.6.5" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 2 176 | } 177 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Classical Language Toolkit 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | These are tutorials on how to get started using the CLTK in your research. Whereas [the CLTK official docs](https://docs.cltk.org) only explain individual functions, these notebooks are intended to be more illustrative of how you can use the functions together and answer real scholarly questions. 4 | 5 | Pull requests are welcome. 6 | 7 | # License 8 | 9 | MIT (see `LICENSE`). 10 | -------------------------------------------------------------------------------- /images/lexical_diversity_greek_canon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cltk/tutorials/49376bf6c07e2394141a87624a192c90450c5f20/images/lexical_diversity_greek_canon.png -------------------------------------------------------------------------------- /images/tableau_bubble.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cltk/tutorials/49376bf6c07e2394141a87624a192c90450c5f20/images/tableau_bubble.png -------------------------------------------------------------------------------- /languages/old-norse/old-norse-tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Old Norse with CLTK\n", 8 | "\n", 9 | "Process your Old Norse texts thanks to cltk. Here are presented several tools adapted to Old Norse." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# Set your own user path\n", 19 | "USER_PATH = \"/home/pi\"" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Import Old Norse corpora\n", 27 | "* old_norse_text_perseus contains different Old Norse books\n", 28 | "* old_norse_texts_heimskringla contains the Eddas\n", 29 | "* old_norse_models_cltk is data for a Part Of Speech tagger \n", 30 | "\n", 31 | "By default, corpora are imported into ~/cltk_data." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "from cltk.corpus.utils.importer import CorpusImporter\n", 41 | "onc = CorpusImporter(\"old_norse\")\n", 42 | "onc.import_corpus(\"old_norse_text_perseus\")\n", 43 | "onc.import_corpus(\"old_norse_texts_heimskringla\")\n", 44 | "onc.import_corpus(\"old_norse_models_cltk\")" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### Configure IPython\n", 52 | "\n", 53 | "Configure IPython if you want to use this notebook\n", 54 | "```bash\n", 55 | "$ ipython profile create\n", 56 | "$ ipython locate\n", 57 | "$ nano ~/.ipython/profile_default/ipython_config.py\n", 58 | "```\n", 59 | "Add it a the end of the file:\n", 60 | "```python\n", 61 | "c.InteractiveShellApp.exec_lines = [\n", 62 | " 'import sys; sys.path.append(\"~/cltk_data/old_norse\")'\n", 63 | "]\n", 64 | "```\n", 65 | "And... It's done!" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "### old_norse_text_perseus" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "Ögmundr er maðr nefndr, er kal\n", 85 | "Sigurðr hefir átt sér einn fós\n", 86 | "Nú halda þeir þangat, ok er þe\n", 87 | "Nú líða stundir fram, ok var s\n", 88 | "Herruðr hét jarl ríkr ok ágætr\n", 89 | "Í þann tíma réð fyrir Danmörku\n", 90 | "Nú er þat eitt sumar, at hann \n", 91 | "Þetta spyrst til skipa Ragnars\n", 92 | "Nú ráða þeir þetta með sér, at\n", 93 | "HEIMIR í Hlymdölum spyrr nú þe\n", 94 | "Nú halda þeir í brott þaðan, þ\n", 95 | "Eptir þetta fara þeir Hvítserk\n", 96 | "Nú er sú stund var liðin, er á\n", 97 | "Nú er þar til máls at taka, er\n", 98 | "Nú ráða þeir þat með sér, at þ\n", 99 | "Sá atburðr hefir verit út í lö\n", 100 | "Nú segir hann, at honum lízt v\n", 101 | "Nú er þat eitthvert sinn, at m\n", 102 | "Nú berr svá til, at þeir koma \n", 103 | "Eysteinn hefir konungr heitit,\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "import os\n", 109 | "import json\n", 110 | "\n", 111 | "corpus = os.path.join(USER_PATH, \"cltk_data/old_norse/text/old_norse_text_perseus/plain_text/Ragnars_saga_loðbrókar_ok_sona_hans\")\n", 112 | "chapters = []\n", 113 | "for filename in os.listdir(corpus):\n", 114 | " with open(os.path.join(corpus, filename)) as f:\n", 115 | " chapter_text = f.read() # json.load(filename)\n", 116 | " print(chapter_text[:30])\n", 117 | " chapters.append(chapter_text)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### old_norse_texts_heimskringla" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 4, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "['Snorra-Edda', '__pycache__', 'Sæmundar-Edda']\n", 137 | "\n", 138 | "Atlakviða\n", 139 | "\n", 140 | "Dauði Atla\n", 141 | "\n", 142 | "Guðrún Gjúkadóttir hefndi bræðra sinna, svá sem frægt er orðit. Hon drap fyr\n" 143 | ] 144 | } 145 | ], 146 | "source": [ 147 | "from old_norse.text.old_norse_texts_heimskringla.text_manager import *\n", 148 | "corpus_path = USER_PATH+\"/cltk_data/old_norse/text/old_norse_texts_heimskringla\"\n", 149 | "here = os.getcwd()\n", 150 | "os.chdir(corpus_path)\n", 151 | "loader = TextLoader(os.path.join(corpus_path, \"Sæmundar-Edda\", \"Atlakviða\"), \"txt\")\n", 152 | "print(loader.get_available_names())\n", 153 | "complete_text = loader.load()\n", 154 | "print(complete_text[:100])\n", 155 | "os.chdir(here)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "### POS tagging\n", 163 | "Unknown tags are marked with 'Unk'." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 5, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "[('Hlióðs', 'Unk'),\n", 175 | " ('bið', 'VBPI'),\n", 176 | " ('ek', 'PRO-N'),\n", 177 | " ('allar', 'Q-A'),\n", 178 | " ('.', '.')]" 179 | ] 180 | }, 181 | "execution_count": 5, 182 | "metadata": {}, 183 | "output_type": "execute_result" 184 | } 185 | ], 186 | "source": [ 187 | "from cltk.tag.pos import POSTag\n", 188 | "import cltk.tag.pos as cltkonpos\n", 189 | "tagger = POSTag('old_norse')\n", 190 | "sent = 'Hlióðs bið ek allar.'\n", 191 | "tagger.tag_tnt(sent)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Word tokenizing\n", 199 | "For now, the word tokenizer is basic, but Old Norse actually does not need a sophisticated one." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 6, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "['Gylfi', 'konungr', 'var', 'maðr', 'vitr', 'ok', 'fjölkunnigr', '.']" 211 | ] 212 | }, 213 | "execution_count": 6, 214 | "metadata": {}, 215 | "output_type": "execute_result" 216 | } 217 | ], 218 | "source": [ 219 | "from cltk.tokenize.word import WordTokenizer\n", 220 | "word_tokenizer = WordTokenizer('old_norse')\n", 221 | "sentence = \"Gylfi konungr var maðr vitr ok fjölkunnigr.\"\n", 222 | "word_tokenizer.tokenize(sentence)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "### Old Norse Stop Words\n", 230 | "A list of stop words was elaborated with the most insignificant words of a sentence. Of course, according to your needs, you can change it." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 7, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "data": { 240 | "text/plain": [ 241 | "['var',\n", 242 | " 'einn',\n", 243 | " 'morgin',\n", 244 | " ',',\n", 245 | " 'karlsefni',\n", 246 | " 'rjóðrit',\n", 247 | " 'flekk',\n", 248 | " 'nökkurn',\n", 249 | " ',',\n", 250 | " 'glitraði']" 251 | ] 252 | }, 253 | "execution_count": 7, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "from nltk.tokenize.punkt import PunktLanguageVars\n", 260 | "from cltk.stop.old_norse.stops import STOPS_LIST\n", 261 | "sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'\n", 262 | "p = PunktLanguageVars()\n", 263 | "\n", 264 | "tokens = p.word_tokenize(sentence.lower())\n", 265 | "[w for w in tokens if not w in STOPS_LIST]" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "### Swadesh list for Old Norse\n", 273 | "In the following Swadesh list, an item may have several words if they have a similar meaning, and some words lack because I have not found any corresponding Old Norse word." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 8, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "data": { 283 | "text/plain": [ 284 | "['ek',\n", 285 | " 'þú',\n", 286 | " 'hann',\n", 287 | " 'vér',\n", 288 | " 'þér',\n", 289 | " 'þeir',\n", 290 | " 'sjá, þessi',\n", 291 | " 'sá',\n", 292 | " 'hér',\n", 293 | " 'þar',\n", 294 | " 'hvar',\n", 295 | " 'hvat',\n", 296 | " 'hvar',\n", 297 | " 'hvenær',\n", 298 | " 'hvé',\n", 299 | " 'eigi',\n", 300 | " 'allr',\n", 301 | " 'margr',\n", 302 | " 'nǫkkurr',\n", 303 | " 'fár',\n", 304 | " 'annarr',\n", 305 | " 'einn',\n", 306 | " 'tveir',\n", 307 | " 'þrír',\n", 308 | " 'fjórir',\n", 309 | " 'fimm',\n", 310 | " 'stórr',\n", 311 | " 'langr',\n", 312 | " 'breiðr',\n", 313 | " 'þykkr']" 314 | ] 315 | }, 316 | "execution_count": 8, 317 | "metadata": {}, 318 | "output_type": "execute_result" 319 | } 320 | ], 321 | "source": [ 322 | "from cltk.corpus.swadesh import Swadesh\n", 323 | "swadesh = Swadesh('old_norse')\n", 324 | "words = swadesh.words()\n", 325 | "words[:30]" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "By Clément Besnier, email address: clemsciences@aol.com" 333 | ] 334 | } 335 | ], 336 | "metadata": { 337 | "kernelspec": { 338 | "display_name": "Python 3.6", 339 | "language": "python", 340 | "name": "python3" 341 | }, 342 | "language_info": { 343 | "codemirror_mode": { 344 | "name": "ipython", 345 | "version": 3 346 | }, 347 | "file_extension": ".py", 348 | "mimetype": "text/x-python", 349 | "name": "python", 350 | "nbconvert_exporter": "python", 351 | "pygments_lexer": "ipython3", 352 | "version": "3.6.3" 353 | } 354 | }, 355 | "nbformat": 4, 356 | "nbformat_minor": 1 357 | } 358 | -------------------------------------------------------------------------------- /languages/old-norse/runes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Runes\n", 8 | "Note: in order to use this **Jupyter notebook**, you need at least **python 3.6** or above." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Configuration\n", 16 | "\n", 17 | "Install required modules.\n", 18 | "```bash\n", 19 | "$ sudo pip3.6 install requests lxml \n", 20 | "```" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "Configure **ipython**.\n", 28 | "\n", 29 | "```bash\n", 30 | "$ ipython profile create\n", 31 | "$ ipython locate\n", 32 | "$ nano .ipython/profile_default/ipython_config.py\n", 33 | "```\n", 34 | " Add it a the end of the file:\n", 35 | "```bash\n", 36 | "c.InteractiveShellApp.exec_lines = [\n", 37 | " 'import sys; sys.path.append(\"/home/pi/cltk_data\")'\n", 38 | "]\n", 39 | "```\n", 40 | "It is necessary to do that because it makes things easier to utilize data furnished by CLTK. You will see later in the notebook how it is used.\n", 41 | "\n", 42 | "And... It's done!" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Install the **kernel** associated with **python3.6** [https://ipython.readthedocs.io/en/stable/install/kernel_install.html](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) " 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "Let's test if the import is correct:\n", 57 | "```bash\n", 58 | "$ python3.6\n", 59 | "```" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "### Runes and CLTK\n", 67 | "\n", 68 | "How can we work on runes with CLK?" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 1, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "from cltk.corpus.old_norse import runes" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "First of all, let's see what runes are:" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 2, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "'᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬'" 96 | ] 97 | }, 98 | "execution_count": 2, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "from old_norse.text.old_norse_runic_transcriptions.denmark.data import little_jelling_stone \n", 105 | "little_jelling_stone" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "The oldest runic inscriptions found are from 200 AC. They have always denoted Germanic languages. Until the 8th century, the elder *futhark* alphabet was used. It was compouned with 24 characters: ᚠ, ᚢ, ᚦ, ᚨ, ᚱ, ᚲ, ᚷ, ᚹ, ᚺ, ᚾ, ᛁ, ᛃ, ᛇ, ᛈ, ᛉ, ᛊ, ᛏ, ᛒ, ᛖ, ᛗ, ᛚ, ᛜ, ᛟ, ᛞ. The word *Futhark* comes from the 6 first characters of the alphabet: ᚠ (f), ᚢ (u), ᚦ (th), ᚨ (a), ᚱ (r), ᚲ (k). Later, this alphabet was reduced to 16 runes, the *younger futhark* ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚼ, ᚾ, ᛁ, ᛅ, ᛋ, ᛏ, ᛒ, ᛖ, ᛘ, ᛚ, ᛦ, with more ambiguity on sounds. Shapes of runes may vary according to which matter they are carved on, that is why there is a variant of the *younger futhark* like this: ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚽ, ᚿ, ᛁ, ᛅ, ᛌ, ᛐ, ᛓ, ᛖ, ᛙ, ᛚ, ᛧ." 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "Get the available runic alphabets with **RunicAlphabetName**" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 3, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "from cltk.corpus.old_norse.runes import RunicAlphabetName" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 4, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "elder_futhark\n", 141 | "younger_futhark\n", 142 | "short_twig_younger_futhark\n" 143 | ] 144 | } 145 | ], 146 | "source": [ 147 | "for name in RunicAlphabetName:\n", 148 | " print(name.value)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Retrieve the contents of the alphabets:" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "[ᚠ, ᚢ, ᚦ, ᚨ, ᚱ, ᚲ, ᚷ, ᚹ, ᚺ, ᚾ, ᛁ, ᛃ, ᛇ, ᛈ, ᛉ, ᛊ, ᛏ, ᛒ, ᛖ, ᛗ, ᛚ, ᛜ, ᛟ, ᛞ]\n", 168 | "[ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚼ, ᚾ, ᛁ, ᛅ, ᛋ, ᛏ, ᛒ, ᛖ, ᛘ, ᛚ, ᛦ]\n", 169 | "[ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚽ, ᚿ, ᛁ, ᛅ, ᛌ, ᛐ, ᛓ, ᛖ, ᛙ, ᛚ, ᛧ]\n" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "for alphabet in [runes.ELDER_FUTHARK, runes.YOUNGER_FUTHARK, runes.SHORT_TWIG_YOUNGER_FUTHARK]:\n", 175 | " print(alphabet)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "### Runic inscriptions\n", 183 | "\n", 184 | "May I get examples from the real world? Of course! For that, we use **CorpusImporter** class from CLTK to import the data contained in a CLTK project named \"old_norse_runic_transcriptions\"." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 6, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "import os\n", 194 | "from cltk.corpus.utils.importer import CorpusImporter\n", 195 | "\n", 196 | "onc = CorpusImporter(\"old_norse\")\n", 197 | "onc.import_corpus(\"old_norse_runic_transcriptions\")" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "Once the corpus has been downloaded, you can visualize ome famous runic inscriptions like the Jelling stones in the *data.py* file." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 7, 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "data": { 214 | "text/plain": [ 215 | "'᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬'" 216 | ] 217 | }, 218 | "execution_count": 7, 219 | "metadata": {}, 220 | "output_type": "execute_result" 221 | } 222 | ], 223 | "source": [ 224 | "from old_norse.text.old_norse_runic_transcriptions.denmark.data import little_jelling_stone , big_jelling_stone\n", 225 | "little_jelling_stone" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 8, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "'ᚼᛅᚱᛅᛚᛏᚱ᛬ᚴᚢᚾᚢᚴᛦ᛬ᛒᛅᚦ᛬ᚴᛅᚢᚱᚢᛅ ᚴᚢᛒᛚ᛬ᚦᛅᚢᛋᛁ᛬ᛅᚠᛏ᛬ᚴᚢᚱᛘ ᚠᛅᚦᚢᚱ ᛋᛁᚾ ᛅᚢᚴ ᛅᚠᛏ᛬ᚦᚭᚢᚱᚢᛁ᛬ᛘᚢᚦᚢᚱ᛬ᛋᛁᚾᛅ᛬ᛋᛅ ᚼᛅᚱᛅᛚᛏᚱ(᛬)ᛁᛅᛋ᛬ᛋᚭᛦ᛫ᚢᛅᚾ᛫ᛏᛅᚾᛘᛅᚢᚱᚴ\\nᛅᛚᛅ᛫ᛅᚢᚴ᛫ᚾᚢᚱᚢᛁᚴ\\n᛫ᛅᚢᚴ᛫ᛏ(ᛅ)ᚾᛁ(᛫ᚴᛅᚱᚦᛁ᛫)ᚴᚱᛁᛋᛏᚾᚭ'" 237 | ] 238 | }, 239 | "execution_count": 8, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "big_jelling_stone" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "### Encoding and data format\n", 253 | "\n", 254 | "Runes are encoded in UTF-8 from \\u16A0 ᚠ to \\u16FF ᛪ. See https://en.wikipedia.org/wiki/Runic_(Unicode_block)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Interesting, but why a Python module for runes? This module provides:\n", 262 | "* metadata attached to runes (runic alphabet which it is in, its representation, the approximate sound it describes, the transcription, the name)\n", 263 | "* a rune to latin character transcriber \n", 264 | "* a unified method to retrieve corpora of runic inscriptions " 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 9, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "ᚠ" 276 | ] 277 | }, 278 | "execution_count": 9, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "runes.ELDER_FUTHARK[0]" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Runes are defined with the **Rune** class in the *rune* module." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 10, 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "data": { 301 | "text/plain": [ 302 | "ᚠ" 303 | ] 304 | }, 305 | "execution_count": 10, 306 | "metadata": {}, 307 | "output_type": "execute_result" 308 | } 309 | ], 310 | "source": [ 311 | "runes.Rune(runes.RunicAlphabetName.elder_futhark, \"\\u16A0\", \"f\", \"f\", \"fehu\")" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "### Runic transcription\n", 319 | "Use the **Transcriber** class to get a basic transcription of a runic inscription. To transcribe correctly a runic inscription, you have to take care about which runic alphabets it was written in. In the following exampls, the *younger Futhark* was used. An incorrect alphabet makes the transcription quite useless as in the second example." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 11, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/plain": [ 330 | "'᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'" 331 | ] 332 | }, 333 | "execution_count": 11, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "runes.Transcriber.transcribe(little_jelling_stone, runes.YOUNGER_FUTHARK)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 12, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "'᛫᛫ur᛫᛫᛫᛫unu᛫᛫᛫᛫(᛫r)þi᛫᛫ubl᛫þu᛫i᛫᛫(ft)᛫þurui᛫᛫unu᛫᛫in᛫᛫t᛫n᛫᛫r᛫᛫᛫᛫but᛫'" 351 | ] 352 | }, 353 | "execution_count": 12, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "runes.Transcriber.transcribe(little_jelling_stone, runes.ELDER_FUTHARK)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 13, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "from old_norse.text.old_norse_runic_transcriptions.sweden import scraper" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "If you want to import all the Sweden runic inscriptions, call the following function." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 14, 381 | "metadata": {}, 382 | "outputs": [ 383 | { 384 | "data": { 385 | "text/plain": [ 386 | "" 387 | ] 388 | }, 389 | "execution_count": 14, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "scraper.retrieve_sweden_runic_inscriptions" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Future tasks:\n", 403 | "* normalizing runic inscriptions and transcriptions,\n", 404 | "* tag runic inscriptions with locations and estimated dates,\n", 405 | "* making a statistics module to analyze frequencies of words, runes, spellings in runic inscriptions,\n", 406 | "* getting more runic inscriptions from Norway, Denmark, etc,\n", 407 | "* using phonetical rules [module](https://github.com/cltk/cltk/blob/master/cltk/phonology/utils.py) to get a normalized, pronunciation of Old norse inscriptions written with runes.\n", 408 | " " 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "By Clément Besnier, email address: clemsciences@aol.com, web site: https://clementbesnier.pythonanywhere.com/, twitter: clemsciences" 416 | ] 417 | } 418 | ], 419 | "metadata": { 420 | "kernelspec": { 421 | "display_name": "Python 3.6", 422 | "language": "python", 423 | "name": "python3" 424 | }, 425 | "language_info": { 426 | "codemirror_mode": { 427 | "name": "ipython", 428 | "version": 3 429 | }, 430 | "file_extension": ".py", 431 | "mimetype": "text/x-python", 432 | "name": "python", 433 | "nbconvert_exporter": "python", 434 | "pygments_lexer": "ipython3", 435 | "version": "3.6.3" 436 | } 437 | }, 438 | "nbformat": 4, 439 | "nbformat_minor": 1 440 | } 441 | -------------------------------------------------------------------------------- /languages/south_asia/Bengali_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Bengali with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Use CLTK to analyse Bengali texts!" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Let us first add user path, where our scripts will be downloaded to.." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import os\n", 31 | "USER_PATH = os.path.expanduser('~')" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Now, before we can analyse the texts, let us first download the Bengali texts from CLTK's Github repo, for which, we will be needing an importer." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "from cltk.corpus.utils.importer import CorpusImporter\n", 48 | "bengali_corpus_downloader = CorpusImporter('bengali')" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Once we have our importer ready, we can view which corpora are available for download." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 3, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "['bengali_text_wikisource']" 67 | ] 68 | }, 69 | "execution_count": 3, 70 | "metadata": {}, 71 | "output_type": "execute_result" 72 | } 73 | ], 74 | "source": [ 75 | "bengali_corpus_downloader.list_corpora" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Let us now download the corpus bengali_text_wikisource. The corpus will be downloaded to the home directory of the user in a directory called `cltk_data/text`" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 4, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "bengali_corpus_downloader.import_corpus('bengali_text_wikisource')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Let us now open the text শকুন্তলা by Abanindranath Tagore." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 5, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "bengali_corpus_path = os.path.join(USER_PATH,'cltk_data/bengali/text/bengali_text_wikisource')\n", 108 | "bengali_text_path = os.path.join(bengali_corpus_path,'শকুন্তলা')" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "Since we have the data differentiated into different text files, let us combine them to form a single text block. Then we print first 1000 characters of first text file." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 6, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "name": "stdout", 125 | "output_type": "stream", 126 | "text": [ 127 | "তপোবনে ।\n", 128 | " রাজা রাজ্যে চলে গেলেন, আর শকুন্তলা সেই বনে দিন গুন্‌তে লাগল।\n", 129 | " যাবার সময় রাজা নিজের মোহর আংটী শকুন্তলাকে দিয়ে গেলেন, বলে গেলেন—সুন্দরি, তুমি প্রতিদিন আমার নামের একটি করে অক্ষর পড়বে, নামও শেষ হবে আর বনপথে সোণার রথ তোমাকে নিতে আসবে।”\n", 130 | " কিন্তু হায়, সোণার রথ কই এল ?\n", 131 | " কত দিন গেল, কত রাত গেল; দুষ্মন্ত নাম কতবার পড়া হয়ে গেল, তবু সোণার রথ কই এল! হায় হায়, সোণার সাঁঝে সোণার রথ সেই যে গেল আর ফিরল না!\n", 132 | " পৃথিবীর রাজা সোণার সিংহাসনে, আর বনের রাণী কুটীর দুয়ারে,—দুই জনে দুই খানে।\n", 133 | " রাজার শোকে শকুন্তলার মন ভেঙ্গে পড়ল। কোথা রইল অতিথিসেবা, কোথা রইল পোষা হরিণ, কোথা রইল সাধের নিকুঞ্জবনে প্রাণের দুই প্রিয়সখী! শকুন্তলার মুখে হাসি নেই, চোখে ঘুম নেই! রাজার ভাবনা নিয়ে কুটীর দুয়ারে পাষাণ-প্রতিমা বসে রইল।\n", 134 | " রাজার রথ কেন এল না? কেন রাজা ভুলে রইলেন?\n", 135 | " রাজা রাজ্যে গেলে একদিন শকুন্তলা কুটীর দুয়ারে গালে হাত দিয়ে বসে বসে রাজার কথা ভাবছে—ভাবছে আর কাঁদছে, এমন সময় মহর্ষি দুর্ব্বাসা দুয়ারে অতিথি এলেন, শকুন্তলা জানতেও পারলে না, ফিরেও দেখলে না। একে দুর্ব্বাসা মহা অভিমানী, একটুতেই মহা রাগ হয়, কথায় কথায়\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "bengali_text_shakuntala = []\n", 141 | "for filename in os.listdir(bengali_text_path):\n", 142 | " if filename[-3:] == 'txt':\n", 143 | " with open(os.path.join(bengali_text_path,filename)) as f:\n", 144 | " file_text = f.read()\n", 145 | " bengali_text_shakuntala.append(file_text)\n", 146 | "bengali_text_shakuntala_first = bengali_text_shakuntala[0][:1000]\n", 147 | "print(bengali_text_shakuntala_first)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## Sentence tokenization" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Let us now perform tokenization on the first part of the text Shakuntala . " 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 7, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "['তপোবনে', '।', '\\n\\xa0রাজা', 'রাজ্যে', 'চলে', 'গেলেন', ',', 'আর', 'শকুন্তলা', 'সেই', 'বনে', 'দিন', 'গুন্\\u200cতে', 'লাগল', '।', '\\n\\xa0যাবার', 'সময়', 'রাজা', 'নিজের', 'মোহর', 'আংটী', 'শকুন্তলাকে', 'দিয়ে', 'গেলেন', ',', 'বলে', 'গেলেন—সুন্দরি', ',', 'তুমি', 'প্রতিদিন', 'আমার', 'নামের', 'একটি', 'করে', 'অক্ষর', 'পড়বে', ',', 'নামও', 'শেষ', 'হবে', 'আর', 'বনপথে', 'সোণার', 'রথ', 'তোমাকে', 'নিতে', 'আসবে', '।', '”\\n\\xa0কিন্তু', 'হায়', ',', 'সোণার', 'রথ', 'কই', 'এল\\xa0', '?', '\\n\\xa0কত', 'দিন', 'গেল', ',', 'কত', 'রাত', 'গেল', ';', 'দুষ্মন্ত', 'নাম', 'কতবার', 'পড়া', 'হয়ে', 'গেল', ',', 'তবু', 'সোণার', 'রথ', 'কই', 'এল', '!', 'হায়', 'হায়', ',', 'সোণার', 'সাঁঝে', 'সোণার', 'রথ', 'সেই', 'যে', 'গেল', 'আর', 'ফিরল', 'না', '!', '\\n\\xa0পৃথিবীর', 'রাজা', 'সোণার', 'সিংহাসনে', ',', 'আর', 'বনের', 'রাণী', 'কুটীর', 'দুয়ারে', ',', '—দুই', 'জনে', 'দুই', 'খানে', '।', '\\n\\xa0রাজার', 'শোকে', 'শকুন্তলার', 'মন', 'ভেঙ্গে', 'পড়ল', '।', 'কোথা', 'রইল', 'অতিথিসেবা', ',', 'কোথা', 'রইল', 'পোষা', 'হরিণ', ',', 'কোথা', 'রইল', 'সাধের', 'নিকুঞ্জবনে', 'প্রাণের', 'দুই', 'প্রিয়সখী', '!', 'শকুন্তলার', 'মুখে', 'হাসি', 'নেই', ',', 'চোখে', 'ঘুম', 'নেই', '!', 'রাজার', 'ভাবনা', 'নিয়ে', 'কুটীর', 'দুয়ারে', 'পাষাণ', '-', 'প্রতিমা', 'বসে', 'রইল', '।', '\\n\\xa0রাজার', 'রথ', 'কেন', 'এল', 'না', '?', 'কেন', 'রাজা', 'ভুলে', 'রইলেন', '?', '\\n\\xa0রাজা', 'রাজ্যে', 'গেলে', 'একদিন', 'শকুন্তলা', 'কুটীর', 'দুয়ারে', 'গালে', 'হাত', 'দিয়ে', 'বসে', 'বসে', 'রাজার', 'কথা', 'ভাবছে—ভাবছে', 'আর', 'কাঁদছে', ',', 'এমন', 'সময়', 'মহর্ষি', 'দুর্ব্বাসা', 'দুয়ারে', 'অতিথি', 'এলেন', ',', 'শকুন্তলা', 'জানতেও', 'পারলে', 'না', ',', 'ফিরেও', 'দেখলে', 'না', '।', 'একে', 'দুর্ব্বাসা', 'মহা', 'অভিমানী', ',', 'একটুতেই', 'মহা', 'রাগ', 'হয়', ',', 'কথায়']\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "\n", 179 | "from cltk.tokenize.sentence import TokenizeSentence\n", 180 | "tokenizer = TokenizeSentence('bengali')\n", 181 | "bengali_text_shakuntala_first_tokens = tokenizer.tokenize(bengali_text_shakuntala_first)\n", 182 | "print(bengali_text_shakuntala_first_tokens[:-1]) ##omit last word due to incompleteness\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "## Transliterations" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "We can transliterate Bengali scripts to that of other Indic languages. Let us transliterate ` আমি বই পছন্দ করি `to Telugu:" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 8, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/plain": [ 207 | "'ఆమి బఇ పఛన్ద కరి'" 208 | ] 209 | }, 210 | "execution_count": 8, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "bengali_text_two = 'আমি বই পছন্দ করি'\n", 217 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 218 | "UnicodeIndicTransliterator.transliterate(bengali_text_two,\"bn\",\"te\")" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "We can also romanize the text as shown:" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 9, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "'aami bi paChanda kari'" 237 | ] 238 | }, 239 | "execution_count": 9, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 246 | "ItransTransliterator.to_itrans(bengali_text_two,'bn')" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 10, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "text/plain": [ 264 | "'শিক্ষা'" 265 | ] 266 | }, 267 | "execution_count": 10, 268 | "metadata": {}, 269 | "output_type": "execute_result" 270 | } 271 | ], 272 | "source": [ 273 | "bengali_text_itrans = 'shikshhaa'\n", 274 | "ItransTransliterator.from_itrans(bengali_text_itrans,'bn')" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## Syllabifier" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "We can use the indian_syllabifier to syllabify the Bengali sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 11, 294 | "metadata": { 295 | "scrolled": true 296 | }, 297 | "outputs": [], 298 | "source": [ 299 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 300 | "phonetics_model_importer.list_corpora\n", 301 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "Now we import the syllabifier and syllabify as follows:" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 12, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "%%capture\n", 318 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 319 | "bengali_syllabifier = Syllabifier('bengali')\n", 320 | "bengali_syllables = bengali_syllabifier.orthographic_syllabify('আমি')" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "The syllables of the word `আমি` will thus be:" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 13, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | "['আ', 'মি']\n" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "print(bengali_syllables)" 345 | ] 346 | } 347 | ], 348 | "metadata": { 349 | "kernelspec": { 350 | "display_name": "Python 3", 351 | "language": "python", 352 | "name": "python3" 353 | }, 354 | "language_info": { 355 | "codemirror_mode": { 356 | "name": "ipython", 357 | "version": 3 358 | }, 359 | "file_extension": ".py", 360 | "mimetype": "text/x-python", 361 | "name": "python", 362 | "nbconvert_exporter": "python", 363 | "pygments_lexer": "ipython3", 364 | "version": "3.6.5" 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /languages/south_asia/Gujarati_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Gujarati with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "See how you can analyse your Gujarati texts with CLTK !
\n", 15 | "Let's begin by adding the `USER_PATH`.." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "In order to be able to download Gujarati texts from CLTK's Github repo, we will require an importer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "gujarati_downloader = CorpusImporter('gujarati')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "We can now see the corpora available for download, by using `list_corpora` feature of the importer. Let's go ahead and try it out!" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['gujarati_text_wikisource']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "gujarati_downloader.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "The corpus gujarati_text_wikisource can be downloaded from the Github repo. The corpus will be downloaded to the directory `cltk_data/gujarati` at the above mentioned `USER_PATH`" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "gujarati_downloader.import_corpus('gujarati_text_wikisource')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "You can see the texts downloaded by doing the following, or checking out the `cltk_data/gujarati/text/gujarati_text_wikisource` directory." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 7, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "['narsinh_mehta', 'kabir', 'vallabhacharya']\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "gujarati_corpus_path = os.path.join(USER_PATH,'cltk_data/gujarati/text/gujarati_text_wikisource')\n", 110 | "list_of_texts = [text for text in os.listdir(gujarati_corpus_path) if '.' not in text]\n", 111 | "print(list_of_texts)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "Great, now that we have our texts, let's take a sample from one of them. For this tutorial, we shall be using govinda_khele_holi , a text by the Gujarati poet Narsinh Mehta." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 10, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "વૃંદાવન જઈએ,\n", 131 | "જીહાં ગોવિંદ ખેલે હોળી;\n", 132 | "નટવર વેશ ધર્યો નંદ નંદન,\n", 133 | "મળી મહાવન ટોળી... ચાલો સખી !\n", 134 | "\n", 135 | "એક નાચે એક ચંગ વજાડે,\n", 136 | "છાંટે કેસર ઘોળી;\n", 137 | "એક અબીરગુલાલ ઉડાડે,\n", 138 | "એક ગાય ભાંભર ભોળી... ચાલો સખી !\n", 139 | "\n", 140 | "એક એકને કરે છમકલાં,\n", 141 | "હસી હસી કર લે તાળી;\n", 142 | "માહોમાહે કરે મરકલાં,\n", 143 | "મધ્ય ખેલે વનમાળી... ચાલો સખી !\n", 144 | "\n", 145 | "વસંત ઋતુ વૃંદાવન સરી,\n", 146 | "ફૂલ્યો ફાગણ માસ;\n", 147 | "ગોવિંદગોપી રમે રંગભર,\n", 148 | "જુએ નરસૈંયો દાસ... ચાલો સખી !\n", 149 | " \n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "gujarati_text_path = os.path.join(gujarati_corpus_path,'narsinh_mehta/govinda_khele_holi.txt')\n", 155 | "gujarati_text = open(gujarati_text_path,'r').read()\n", 156 | "print(gujarati_text)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "## Gujarati Alphabets" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "There are 13 vowels, 33 consonants, which are grouped as follows:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 12, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "Digits: ['૦', '૧', '૨', '૩', '૪', '૫', '૬', '૭', '૮', '૯', '૧૦']\n", 183 | "Vowels: ['અ', 'આ', 'ઇ', 'ઈ', 'ઉ', 'ઊ', 'ઋ', 'એ', 'ઐ', 'ઓ', 'ઔ', 'અં', 'અઃ']\n", 184 | "Dependent vowels: ['ા ', 'િ', 'ી', 'ો', 'ૌ']\n", 185 | "Consonants: ['ક', 'ખ', 'ગ', 'ઘ', 'ચ', 'છ', 'જ', 'ઝ', 'ઞ', 'ટ', 'ઠ', 'ડ', 'ઢ', 'ણ', 'ત', 'થ', 'દ', 'ધ', 'ન', 'પ', 'ફ', 'બ', 'ભ', 'મ', 'ય', 'ર', 'લ', 'ળ', 'વ', 'શ', 'ષ', 'સ', 'હ']\n", 186 | "Velar consonants: ['ક', 'ખ', 'ગ', 'ઘ', 'ઙ']\n", 187 | "Palatal consonants: ['ચ', 'છ', 'જ', 'ઝ', 'ઞ']\n", 188 | "Retroflex consonants: ['ટ', 'ઠ', 'ડ', 'ઢ', 'ણ']\n", 189 | "Dental consonants: ['ત', 'થ', 'દ', 'ધ', 'ન']\n", 190 | "Labial consonants: ['પ', 'ફ', 'બ', 'ભ', 'મ']\n", 191 | "Sonorant consonants: ['ય', 'ર', 'લ', 'વ']\n", 192 | "Sibilant consonants: ['શ', 'ષ', 'સ']\n", 193 | "Guttural consonant: ['હ']\n", 194 | "Additional consonants: ['ળ', 'ક્ષ', 'જ્ઞ']\n", 195 | "Modifiers: [' ्', ' ॓', ' ॔']\n" 196 | ] 197 | } 198 | ], 199 | "source": [ 200 | "from cltk.corpus.gujarati.alphabet import *\n", 201 | "print(\"Digits:\",DIGITS)\n", 202 | "print(\"Vowels:\",VOWELS)\n", 203 | "print(\"Dependent vowels:\",DEPENDENT_VOWELS)\n", 204 | "print(\"Consonants:\",CONSONANTS)\n", 205 | "print(\"Velar consonants:\",VELAR_CONSONANTS)\n", 206 | "print(\"Palatal consonants:\",PALATAL_CONSONANTS)\n", 207 | "print(\"Retroflex consonants:\",RETROFLEX_CONSONANTS)\n", 208 | "print(\"Dental consonants:\",DENTAL_CONSONANTS)\n", 209 | "print(\"Labial consonants:\",LABIAL_CONSONANTS)\n", 210 | "print(\"Sonorant consonants:\",SONORANT_CONSONANTS)\n", 211 | "print(\"Sibilant consonants:\",SIBILANT_CONSONANTS)\n", 212 | "print(\"Guttural consonant:\",GUTTURAL_CONSONANT)\n", 213 | "print(\"Additional consonants:\",ADDITIONAL_CONSONANTS)\n", 214 | "print(\"Modifiers:\",MODIFIERS)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "## Transliterations" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "We can transliterate Gujarati scripts to that of other Indic languages. Let us transliterate `કમળ ભારતનો રાષ્ટ્રીય ફૂલ છે`to Kannada:" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 16, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "'ಕಮಳ ಭಾರತನೋ ರಾಷ್ಟ್ರೀಯ ಫೂಲ ಛೇ'" 240 | ] 241 | }, 242 | "execution_count": 16, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "gujarati_text_two = 'કમળ ભારતનો રાષ્ટ્રીય ફૂલ છે'\n", 249 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 250 | "UnicodeIndicTransliterator.transliterate(gujarati_text_two,\"gu\",\"kn\")" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "We can also romanize the text as shown:" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 26, 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "data": { 267 | "text/plain": [ 268 | "'kamalda bhaaratano raashhTriiya phuula Che'" 269 | ] 270 | }, 271 | "execution_count": 26, 272 | "metadata": {}, 273 | "output_type": "execute_result" 274 | } 275 | ], 276 | "source": [ 277 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 278 | "ItransTransliterator.to_itrans(gujarati_text_two,'gu')" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 25, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "data": { 295 | "text/plain": [ 296 | "'ભાવના'" 297 | ] 298 | }, 299 | "execution_count": 25, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "gujarati_text_itrans = 'bhaawanaa'\n", 306 | "ItransTransliterator.from_itrans(gujarati_text_itrans,'gu')" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "## Syllabifier" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "We can use the indian_syllabifier to syllabify the Gujarati sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 11, 326 | "metadata": { 327 | "scrolled": true 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 332 | "phonetics_model_importer.list_corpora\n", 333 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "Now we import the syllabifier and syllabify as follows:" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 27, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "%%capture\n", 350 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 351 | "gujarati_syllabifier = Syllabifier('gujarati')\n", 352 | "gujarati_syllables = gujarati_syllabifier.orthographic_syllabify('ભાવના')" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "The syllables of the word `ભાવના` will thus be:" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 28, 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "name": "stdout", 369 | "output_type": "stream", 370 | "text": [ 371 | "['ભા', 'વ', 'ના']\n" 372 | ] 373 | } 374 | ], 375 | "source": [ 376 | "print(gujarati_syllables)" 377 | ] 378 | } 379 | ], 380 | "metadata": { 381 | "kernelspec": { 382 | "display_name": "Python 3", 383 | "language": "python", 384 | "name": "python3" 385 | }, 386 | "language_info": { 387 | "codemirror_mode": { 388 | "name": "ipython", 389 | "version": 3 390 | }, 391 | "file_extension": ".py", 392 | "mimetype": "text/x-python", 393 | "name": "python", 394 | "nbconvert_exporter": "python", 395 | "pygments_lexer": "ipython3", 396 | "version": "3.6.5" 397 | } 398 | }, 399 | "nbformat": 4, 400 | "nbformat_minor": 2 401 | } 402 | -------------------------------------------------------------------------------- /languages/south_asia/Hindi_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Hindi with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Analyse hindi texts using CLTK!
\n", 15 | "Firstly, we need to add the path where our corpora will reside." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Before we begin analysing the texts, we will need to download the Hindi corpora, for which, we will be using an Importer. Call the importer to download Hindi texts, as follows.. " 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "hindi_corpus_importer = CorpusImporter('hindi')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "You can view which corpora to download by calling list_corpora() method" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['hindi_text_ltrc']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "hindi_corpus_importer.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "hindi_corpus_importer.import_corpus('hindi_text_ltrc');" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "It can be verified that the `hindi_text_ltrc` corpus is downloaded in a `cltk_data/hindi/text` folder which at the path given by `USER_PATH`. It is now possible to analyse the texts within. For this tutorial, let us analyse the text `Akiri Kalam` by the poet Malik Muhammad Jayasi, which is at the path as shown." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "hindi_corpus_path = os.path.join(USER_PATH,'cltk_data/hindi/text/hindi_text_ltrc/')\n", 95 | "hindi_text_path = os.path.join(hindi_corpus_path,'JayasI/AKirIkalAm/main.u')\n", 96 | "hindi_text = open(hindi_text_path,'r').read()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Let us see the first two stanzas of `hindi_text`" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 6, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | " आखरी कलाम\n", 116 | "\n", 117 | "पहिले नावँ दैउ कर लीन्हा । जेंइ जिउ दीन्ह, बोल मुख कीन्हा॥\n", 118 | "दीन्हेसि सिर जो सँवारै पागा । दीन्हेसि कया जो पहिरै बागा॥\n", 119 | "दीन्हेसि नयन जोति, उजियारा । दीन्हेसि देखै कहँ संसारा॥\n", 120 | "दीन्हेसि स्रवन बात जेहि सुनै । दीन्हेसि बुध्दि, ज्ञान बहु गुनै॥\n", 121 | "दीन्हेसि नासिक लीजै बासा । दीन्हेसि सुमन सुगंधा बिरासा॥\n", 122 | "दीन्हेसि जीभ बैन रस भाखै । दीन्हेसि भुगुति, साधा सब राखै॥\n", 123 | "दीन्हेसि दसन, सुरग कपोला । दीन्हेसि अधार जे रचैं तँबोला॥\n", 124 | "दीन्हेसि बदन सुरूप रँग, दीन्हेसि माथे भाग।\n", 125 | "देखि दयाल, 'मुहम्मद', सीस नाइ पद लाग॥1॥\n", 126 | "\n", 127 | "दीन्हेसि कंठ बोल जेहि माहाँ । दीन्हेसि भुजादंड, बल बाहाँ॥\n", 128 | "दीन्हेसि हिया भोग जेहि जमा । दीन्हेसि पाँच भूत, आतमा॥\n", 129 | "दीन्हेसि बदन सीत औ घामू । दीन्हेसि सुक्ख नींद बिसरामू॥\n", 130 | "दीन्हेसि हाथ चाह जस कीजै । दीन्हेसि कर पल्लव गहि लीजै॥\n", 131 | "दीन्हेसि रहस कूद बहुतेरा । दीन्हेसि हरष हिया बहु मेरा॥\n", 132 | "दीन्हेसि बैठक आसन मारै । दीन्हेसि बूत जो उठें सँभारैं॥\n", 133 | "दीन्हेसि सबै सँपूरन काया । दीन्हेसि होइ चलै कहँ पाया॥\n", 134 | "दीन्हेसि नौ नौ फाटका, दीन्हेसि दसवँ दुवार।\n", 135 | "सो अस दानि 'मुहम्मद' तिन्ह कै हौं बलिहार॥2॥\n", 136 | "\n", 137 | "\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "print(hindi_text[:990])" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## Tokenizing Sentences" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Let us tokenize the sentences in `hindi_text`." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "name": "stdout", 166 | "output_type": "stream", 167 | "text": [ 168 | "['आखरी', 'कलाम\\n\\nपहिले', 'नावँ', 'दैउ', 'कर', 'लीन्हा', '।', 'जेंइ', 'जिउ', 'दीन्ह', ',', 'बोल', 'मुख', 'कीन्हा', '॥', '\\nदीन्हेसि', 'सिर', 'जो', 'सँवारै', 'पागा', '।', 'दीन्हेसि', 'कया', 'जो', 'पहिरै', 'बागा', '॥', '\\nदीन्हेसि', 'नयन', 'जोति', ',', 'उजियारा', '।', 'दीन्हेसि', 'देखै', 'कहँ', 'संसारा', '॥', '\\nदीन्हेसि', 'स्रवन', 'बात', 'जेहि', 'सुनै', '।', 'दीन्हेसि', 'बुध्दि', ',', 'ज्ञान', 'बहु', 'गुनै']\n" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "from cltk.tokenize.sentence import TokenizeSentence\n", 174 | "hindi_tokenizer = TokenizeSentence('hindi')\n", 175 | "hindi_tokens = hindi_tokenizer.tokenize(hindi_text)\n", 176 | "print(hindi_tokens[:50])" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Stopword filtering" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "Stopwords list for Hindi can be found at `stop` module of cltk." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 8, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "['हें', 'है', 'हैं', 'हि', 'ही', 'हो', 'हे', 'से', 'अत', 'के']\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "from cltk.stop.classical_hindi.stops import STOPS_LIST\n", 208 | "print (STOPS_LIST[:10])" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "Let us filter the `hindi_tokens` list for words that are not stop words." 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 9, 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "['आखरी', 'कलाम\\n\\nपहिले', 'नावँ', 'दैउ', 'कर']\n" 228 | ] 229 | } 230 | ], 231 | "source": [ 232 | "hindi_tokens_no_stop = [token for token in hindi_tokens if token not in STOPS_LIST]\n", 233 | "print(hindi_tokens_no_stop[:5])" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 10, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "6404\n", 246 | "5987\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "print(len(hindi_tokens))\n", 252 | "print(len(hindi_tokens_no_stop))" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "As one can see, `hindi_tokens` had 6404 tokens whereas `hindi_tokens_no_stop` has 5987" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "## Swadesh list for Hindi" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "Swadesh list for Hindi can be obtained as follows:" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 11, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "name": "stdout", 283 | "output_type": "stream", 284 | "text": [ 285 | "['मैं', 'तू', 'वह', 'हम', 'तुम', 'वे', 'यह', 'वह', 'यहाँ', 'वहाँ', 'कौन', 'क्या', 'कहाँ', 'कब', 'कैसा', 'नहीं', 'सब', 'बहुत', 'कुछ', 'थोड़ा', 'दूसरा', 'एक', 'दो', 'तीन', 'चार', 'पाँच', 'बड़ा', 'लम्बा', 'चौड़ा', 'गाढ़ा', 'भारी', 'छोटा', 'छोटा', 'तंग', 'पतला', 'औरत', 'आदमी', 'इंसान', 'बच्चा', 'पत्नी', 'पति', 'माता', 'पिता', 'जानवर', 'मछली', 'चिड़िया', 'कुत्ता', 'जूँ', 'साँप', 'कीड़ा', 'पेड़', 'जंगल', 'डण्डा', 'फल', 'बीज', 'पत्ता', 'जड़', 'छाल', 'फूल', 'घास', 'रस्सी', 'त्वचा', 'माँस', 'ख़ून', 'हड्डी', 'चरबी', 'अंडा', 'सींग', 'पूँछ', 'पंख', 'बाल', 'सर', 'कान', 'आँख', 'नाक', 'मुँह', 'दाँत', 'जीभ', 'नाख़ुन', 'पैर', 'टांग', 'घुटना', 'हाथ', 'पंख', 'पेट', 'अंतड़ी', 'गरदन', 'पीठ', 'छाती', 'दिल', 'जिगर', 'पीना', 'खाना', 'काटना', 'चूसना', 'थूकना', 'उल्टी करना', 'फूँक मारना', 'साँस लेना', 'हँसना', 'देखना', 'सुनना', 'जानना', 'सोचना', 'सूंघना', '(से) डरना ((se) ḍarnā', 'सोना', 'जीना', 'मरना', 'मारना', 'लड़ना', 'शिकार करना', 'मारना', 'काटना', 'बंटना', 'भोंकना', 'खरोंचना', 'खोदना', 'तैरना', 'उड़ना', 'चलना', 'आना', 'लेटना', 'बैठना', 'खड़ा होना', 'मुड़ना', 'गिरना', 'देना', 'पकड़ना', 'घुसा देना', 'मलना', 'धोना', 'पोंछना', 'खींचना', 'धक्का देना', 'फेंकना', 'बाँधना', 'सीना', 'गिनना', 'कहना', 'गाना', 'खेलना', 'तैरना', 'बहना', 'जमना', 'सूजना', 'सूरज', 'चांद', 'तारा', 'पानी', 'बारिश', 'नदी', 'झील', 'समन्दर', 'नमक', 'पत्थर', 'रेत', 'धूल', 'धरती', 'बादल', 'धुंध', 'आसमान', 'हवा', 'बर्फ़', 'बर्फ़', 'धुआँ', 'आग', 'राख', 'जलना', 'सड़क', 'पहाड़', 'लाल', 'हरा', 'पीला', 'सफ़ेद', 'काला', 'रात', 'दिन', 'साल', 'गर्म', 'ठंडा', 'पूरा', 'नया', 'पुराना', 'अच्छा', 'बुरा', 'सड़ा', 'गन्दा', 'सीधा', 'गोल', 'तीखा', 'कुंद', 'चिकना', 'गीला', 'सूखा', 'सही', 'नज़दीक', 'दूर', 'दायाँ', 'बायाँ', 'पे', 'में', 'के साथ', 'और', 'अगर', 'क्योंकि', 'नाम']\n" 286 | ] 287 | } 288 | ], 289 | "source": [ 290 | "from cltk.corpus.swadesh import Swadesh\n", 291 | "swadesh_list = Swadesh('hi')\n", 292 | "print(swadesh_list.words())" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "## Transliterations" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "We can transliterate Hindi scripts to that of other Indic languages. Let us transliterate ` फूल `to Malayalam:" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 12, 312 | "metadata": {}, 313 | "outputs": [ 314 | { 315 | "data": { 316 | "text/plain": [ 317 | "' ഫൂല '" 318 | ] 319 | }, 320 | "execution_count": 12, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 327 | "UnicodeIndicTransliterator.transliterate(' फूल ',\"hi\",\"ml\")" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "We can also romanize the text as shown:" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 13, 340 | "metadata": {}, 341 | "outputs": [ 342 | { 343 | "data": { 344 | "text/plain": [ 345 | "'paDha़naa eka achChii aadata hai.'" 346 | ] 347 | }, 348 | "execution_count": 13, 349 | "metadata": {}, 350 | "output_type": "execute_result" 351 | } 352 | ], 353 | "source": [ 354 | "hindi_text_two = 'पढ़ना एक अच्छी आदत है।'\n", 355 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 356 | "ItransTransliterator.to_itrans(hindi_text_two,'hi')" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 14, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "'भाषा विचारों को व्यक्त करने का माध्यम है'" 375 | ] 376 | }, 377 | "execution_count": 14, 378 | "metadata": {}, 379 | "output_type": "execute_result" 380 | } 381 | ], 382 | "source": [ 383 | "hindi_text_itrans = 'bhaashhaa wichaaro.m ko wyakta karane kaa maadhyama hai'\n", 384 | "ItransTransliterator.from_itrans(hindi_text_itrans,'hi')" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "## Syllabifier" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "We can use the `indian_syllabifier` to syllabify the Hindi sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 15, 404 | "metadata": { 405 | "scrolled": true 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 410 | "phonetics_model_importer.list_corpora\n", 411 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "Now we import the syllabifier and syllabify as follows:" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 16, 424 | "metadata": {}, 425 | "outputs": [], 426 | "source": [ 427 | "%%capture\n", 428 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 429 | "hindi_syllabifier = Syllabifier('hindi')\n", 430 | "hindi_syllables = hindi_syllabifier.orthographic_syllabify('पुस्तकालय')" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": {}, 436 | "source": [ 437 | "The syllables of the word पुस्तकालय will thus be:" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 17, 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "name": "stdout", 447 | "output_type": "stream", 448 | "text": [ 449 | "['पु', 'स्त', 'का', 'ल', 'य']\n" 450 | ] 451 | } 452 | ], 453 | "source": [ 454 | "print(hindi_syllables)" 455 | ] 456 | } 457 | ], 458 | "metadata": { 459 | "kernelspec": { 460 | "display_name": "Python 3", 461 | "language": "python", 462 | "name": "python3" 463 | }, 464 | "language_info": { 465 | "codemirror_mode": { 466 | "name": "ipython", 467 | "version": 3 468 | }, 469 | "file_extension": ".py", 470 | "mimetype": "text/x-python", 471 | "name": "python", 472 | "nbconvert_exporter": "python", 473 | "pygments_lexer": "ipython3", 474 | "version": "3.6.5" 475 | } 476 | }, 477 | "nbformat": 4, 478 | "nbformat_minor": 2 479 | } 480 | -------------------------------------------------------------------------------- /languages/south_asia/Kannada_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Kannada with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Analyse Kannada texts with CLTK!
\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Kannada Alphabets" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "There are 14 Swaras or vowels, 25 Structured and 11 Unstructured consonants collectively known as Vynjanas and 2 Yogavaahakas
\n", 29 | "A Consonant plus Vowel symbol makes a kagunita" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "metadata": {}, 36 | "outputs": [ 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | "Vowels: ['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ', 'ಊ', 'ಋ', 'ೠ', 'ಎ', 'ಏ', 'ಐಒ', 'ಒ', 'ಓ', 'ಔ']\n", 42 | "Yogavaahakas: ['ಅಂ', 'ಅಃ']\n", 43 | "Structured consonants: ['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ', 'ಚ', 'ಛ', 'ಜ', 'ಝ', 'ಞ', 'ಟ', 'ಠ', 'ಡ', 'ಢ', 'ಣ', 'ತ', 'ಥ', 'ದ', 'ಧ', 'ನ', 'ಪ', 'ಫ', 'ಬ', 'ಭ', 'ಮ']\n", 44 | "Unstructured consonants: ['ಯ', 'ರ', 'ಱ', 'ಲ', 'ವ', 'ಶ', 'ಷ', 'ಸ', 'ಹ', 'ಳ', 'ೞ']\n", 45 | "Numerals: ['೦', '೧', '೨', '೩', '೪', '೫', '೬', '೭', '೮', '೯']\n", 46 | "Vowel signs: ['', 'ಾ', 'ಿ', 'ೀ', 'ು', 'ೂ', 'ೃ', 'ೆ', 'ೇ', 'ೈ', 'ೊ', 'ೋ', 'ೌ', 'ಂ', 'ಃ']\n" 47 | ] 48 | } 49 | ], 50 | "source": [ 51 | "from cltk.corpus.kannada.alphabet import *\n", 52 | "print(\"Vowels: \", VOWELS)\n", 53 | "print(\"Yogavaahakas: \", YOGAVAAHAKAS)\n", 54 | "print(\"Structured consonants: \",STRUCTURED_CONSONANTS)\n", 55 | "print(\"Unstructured consonants: \",UNSTRUCTURED_CONSONANTS)\n", 56 | "print(\"Numerals: \",NUMERALS)\n", 57 | "print(\"Vowel signs: \",VOWEL_SIGNS)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Transliterations" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "We can transliterate Kannada scripts to that of other Indic languages. Let us take an example Kannada text and transliterate it to Hindi:" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "ಗ್ರಂಥಾಲಯಗಳು ಅರಿವಿನ ಜ್ಞಾನದೀವಿಗೆಗಳು. ಇಷ್ಟಪಟ್ಟು ಓದಲು ಬರುವವರಿಗೆ, ಜ್ಞಾನದ ಹೊಸ ಹೊಳಹನ್ನು ನೀಡುವ ಅಕ್ಷಯ ಭಂಡಾರಗಳು. ಗ್ರಂಥಾಲಯಗಳ ಸಂಪನ್ಮೂಲಗಳು ಎಂದಿಗೂ ಎಲ್ಲಿಯೂ ಬತ್ತಿಹೋಗುವುದಿಲ್ಲ. ಪ್ರಾಚೀನ ಕಾಲದಲ್ಲಿ ಮುದ್ರಾಣಾಲಯಗಳಿರಲಿಲ್ಲ. ಆದ್ದರಿಂದ ಜ್ಞಾನವನ್ನು ಸಂಪಾದಿಸಲು ಬಹಳ ಕಷ್ಟಪಡಬೇಕಾಗುತ್ತಿತ್ತು. ಈಗ ಗ್ರಂಥಗಳು ನಮಗೆ ಬೇಕಾದ ವಿಷಯಗಳನ್ನು ತಿಳಿಸಲು ಸಿದ್ಧವಿರುವುವು. ನಾವು ಬೇಕಾದಾಗ ಗ್ರಂಥಾಲಯಕ್ಕೆ ಹೋಗಿ ಬೇಕಾದ ಗ್ರಂಥಗಳನ್ನು ಓದಿ ಜ್ಞಾನ ಪಡೆಯಬಹುದು.\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "kannada_text = \"ಗ್ರಂಥಾಲಯಗಳು ಅರಿವಿನ ಜ್ಞಾನದೀವಿಗೆಗಳು. ಇಷ್ಟಪಟ್ಟು ಓದಲು ಬರುವವರಿಗೆ, ಜ್ಞಾನದ ಹೊಸ ಹೊಳಹನ್ನು ನೀಡುವ ಅಕ್ಷಯ ಭಂಡಾರಗಳು. ಗ್ರಂಥಾಲಯಗಳ ಸಂಪನ್ಮೂಲಗಳು ಎಂದಿಗೂ ಎಲ್ಲಿಯೂ ಬತ್ತಿಹೋಗುವುದಿಲ್ಲ. ಪ್ರಾಚೀನ ಕಾಲದಲ್ಲಿ ಮುದ್ರಾಣಾಲಯಗಳಿರಲಿಲ್ಲ. ಆದ್ದರಿಂದ ಜ್ಞಾನವನ್ನು ಸಂಪಾದಿಸಲು ಬಹಳ ಕಷ್ಟಪಡಬೇಕಾಗುತ್ತಿತ್ತು. ಈಗ ಗ್ರಂಥಗಳು ನಮಗೆ ಬೇಕಾದ ವಿಷಯಗಳನ್ನು ತಿಳಿಸಲು ಸಿದ್ಧವಿರುವುವು. ನಾವು ಬೇಕಾದಾಗ ಗ್ರಂಥಾಲಯಕ್ಕೆ ಹೋಗಿ ಬೇಕಾದ ಗ್ರಂಥಗಳನ್ನು ಓದಿ ಜ್ಞಾನ ಪಡೆಯಬಹುದು.\"\n", 89 | "print(kannada_text)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "data": { 99 | "text/plain": [ 100 | "'ग्रंथालयगळु अरिविन ज्ञानदीविगॆगळु. इष्टपट्टु ओदलु बरुववरिगॆ, ज्ञानद हॊस हॊळहन्नु नीडुव अक्षय भंडारगळु. ग्रंथालयगळ संपन्मूलगळु ऎंदिगू ऎल्लियू बत्तिहोगुवुदिल्ल. प्राचीन कालदल्लि मुद्राणालयगळिरलिल्ल. आद्दरिंद ज्ञानवन्नु संपादिसलु बहळ कष्टपडबेकागुत्तित्तु. ईग ग्रंथगळु नमगॆ बेकाद विषयगळन्नु तिळिसलु सिद्धविरुवुवु. नावु बेकादाग ग्रंथालयक्कॆ होगि बेकाद ग्रंथगळन्नु ओदि ज्ञान पडॆयबहुदु.'" 101 | ] 102 | }, 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 110 | "UnicodeIndicTransliterator.transliterate(kannada_text,\"kn\",\"hi\")" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "We can also romanize the text as shown:" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "'shraddh.e mattu shramawu pratibh.eyannu solisabahudu'" 129 | ] 130 | }, 131 | "execution_count": 4, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "kannada_text_two = \"ಶ್ರದ್ಧೆ ಮತ್ತು ಶ್ರಮವು ಪ್ರತಿಭೆಯನ್ನು ಸೋಲಿಸಬಹುದು\"\n", 138 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 139 | "ItransTransliterator.to_itrans(kannada_text_two,'kn')\n" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 5, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "'ಪ್ರಾಚೀನ ಗ್ರಂಥಾಲಯಗಳು ಕೇವಲ ಹಸ್ತಪ್ರತಿ, ತಾಳೇಗರಿ, ಚರ್ಮಪಟ್ಟಿ ಮೊದಲಾದುವುಗಳ ಸಂಗ್ರಹಗಳಾಗಿದ್ದುವು'" 158 | ] 159 | }, 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "kannada_text_itrans = 'praachiina gra.mthaalayagaldu kewala hastaprati, taaldegari, charmapaTTi m.odalaaduwugalda sa.mgrahagaldaagidduwu'\n", 167 | "ItransTransliterator.from_itrans(kannada_text_itrans,'kn')" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "## Syllabifier" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "We can use the indian_syllabifier to syllabify the Kannada sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 6, 187 | "metadata": { 188 | "scrolled": true 189 | }, 190 | "outputs": [], 191 | "source": [ 192 | "from cltk.corpus.utils.importer import CorpusImporter\n", 193 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 194 | "phonetics_model_importer.list_corpora\n", 195 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "Now we import the syllabifier and syllabify as follows:" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 7, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "%%capture\n", 212 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 213 | "kannada_syllabifier = Syllabifier('kannada')\n", 214 | "kannada_syllables = kannada_syllabifier.orthographic_syllabify('ಹಸ್ತಪ್ರತಿ')" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "The syllables of the word ಹಸ್ತಪ್ರತಿ will thus be:" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 8, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "name": "stdout", 231 | "output_type": "stream", 232 | "text": [ 233 | "['ಹ', 'ಸ್ತ', 'ಪ್ರ', 'ತಿ']\n" 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "print(kannada_syllables)" 239 | ] 240 | } 241 | ], 242 | "metadata": { 243 | "kernelspec": { 244 | "display_name": "Python 3", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.6.5" 259 | } 260 | }, 261 | "nbformat": 4, 262 | "nbformat_minor": 2 263 | } 264 | -------------------------------------------------------------------------------- /languages/south_asia/Malayalam_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Malayalam with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Use CLTK to analyze your Malayalam texts.\n", 15 | "
Let us start by setting `USER_PATH`" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Let us try to download a Malayalam corpora that is available remotely at CLTK's Github repo. To do this, first we will need an importer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "malayalam_downloader = CorpusImporter('malayalam')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Once we have this, we can check the corpora available for download as follows:" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "['malayalam_text_gretil']\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "print(malayalam_downloader.list_corpora)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "Let us now download the malayalam_text_gretil corpus. " 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "malayalam_downloader.import_corpus('malayalam_text_gretil')" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "The corpus has been downloaded to a directory in `cltk_data` , which resides in the `USER_PATH`. Let us open the text, Jyotsnika." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 5, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "malayalam_corpus_path = os.path.join(USER_PATH, 'cltk_data/malayalam/text/malayalam_text_gretil/text')\n", 99 | "malayalam_text_path = os.path.join(malayalam_corpus_path,'jyotsniu.txt')\n", 100 | "malayalam_text = open(malayalam_text_path,'r').read()" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 6, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stdout", 110 | "output_type": "stream", 111 | "text": [ 112 | "\n", 113 | "jyōtsnikā viṣavaidyaṃ\n", 114 | "\n", 115 | "[1]abhivandanādhikāraṃ\n", 116 | "hariḥ ṣrī gaṇapatayē namaḥ\n", 117 | "avighnamastu\n", 118 | "\n", 119 | "maṃgaḷaṃ\n", 120 | "vandē varadamācāryyamantarāyōpaśāntayē /\n", 121 | "gaṇanāthaṃ ca gōvindaṃ kumārakamalōtbhavau // Jyo_1.1 //\n", 122 | "\n", 123 | "muṭiyil tiṅkaḷuṃ pāṃpuṃ maṭiyil gauriyuṃ sadā /\n", 124 | "kuṭi koṇṭoru dēvantannaṭiyāṃ paṅkajam bhajē // Jyo_1.2 //\n", 125 | "\n", 126 | "gatvā svarggamatandritassuravaraṃ\n", 127 | "jitvā sudhāṃ bāhubhirddhṛtvā\n", 128 | "mātaramētya vidrutataraṃ\n", 129 | "datvāśu tasyai tataḥ\n", 130 | "hṛtvā dāsyamanēkakadrutanayān\n", 131 | "hatvā muhurmmātaraṃ\n", 132 | "natvā yastu virājatē tamaniśaṃ\n", 133 | "vandē khagādhīśvaraṃ // Jyo_1.3 //\n", 134 | "\n", 135 | "yēnaviṣṇōrddhvajaṃ sākṣādrājatē paramātmanaḥ /\n", 136 | "tasmai namōstu satataṃ garuḍāya mahātmanē // Jyo_1.4 //\n", 137 | "\n", 138 | "\n", 139 | "pratijñā\n", 140 | "viṣapīḍitarāyuḷḷa narāṇāṃ hitasiddhayē /\n", 141 | "taccikitsāṃ pravakṣyāmi prasannāstu sarasvatī // Jyo_1.5 //\n", 142 | "\n", 143 | "gurudēvadvijātīnāṃ bhaktaḥ śuddhō dayāparaḥ /\n", 144 | "svakarmmābhirataḥ kuryyāl garapīḍitarakṣaṇaṃ // Jyo_1.6 //\n", 145 | "\n", 146 | "tathā bahujanadrōhaṃ ceyvōnuṃ brahmahāvinuṃ /\n", 147 | "svadharmmācāramaryyādāhīnanuṃ dviṣatāmapi // Jyo_1.7 //\n", 148 | "\n", 149 | "kṛtaghnabhīruśōkārttacaṇḍānāṃ vyagracētasāṃ /\n", 150 | "gatāyuṣmānumavvaṇṇamavidhēyanumaṅṅine // Jyo_1.8 //\n", 151 | "\n", 152 | "\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "print(malayalam_text[1930:2998]) #indices adjusted" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "## Transliterations" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Transliterations of Malayalam text from Malayalam to other scripts, indicizing ITRANS-transliteration and romanizing Malayalam script can be done. Let us convert a sample text from Malayalam to Hindi." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 7, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/plain": [ 182 | "'कायिक'" 183 | ] 184 | }, 185 | "execution_count": 7, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 192 | "malayalam_text_two = 'കായിക'\n", 193 | "UnicodeIndicTransliterator.transliterate(malayalam_text_two,'ml','hi') ##transliterating to hindi" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "Now, let us try transliterating ITRANS-transliteration of `അവിഘ്നമസ്തു` to Malayalam.." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 8, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "data": { 210 | "text/plain": [ 211 | "'അവിഘ്നമസ്തു'" 212 | ] 213 | }, 214 | "execution_count": 8, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 221 | "ItransTransliterator.from_itrans('avighnamastu','ml')" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Similiarly, we can romanize the Malayalam words as follows:" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 9, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "'tasyai'" 240 | ] 241 | }, 242 | "execution_count": 9, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 249 | "ItransTransliterator.to_itrans('തസ്യൈ','ml')" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "## Syllabifier" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "We can use the indian_syllabifier to syllabify the Malayalam sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 10, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 273 | "phonetics_model_importer.list_corpora\n", 274 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "Now we import the syllabifier and syllabify as follows:" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 11, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "%%capture\n", 291 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 292 | "malayalam_syllabifier = Syllabifier('malayalam')\n", 293 | "malayalam_syllables = malayalam_syllabifier.orthographic_syllabify('ജാലവിദ്യ')" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "The syllables of the word ജാലവിദ്യ will thus be:" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 12, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "['ജാ', 'ല', 'വി', 'ദ്യ']\n" 313 | ] 314 | } 315 | ], 316 | "source": [ 317 | "print(malayalam_syllables)" 318 | ] 319 | } 320 | ], 321 | "metadata": { 322 | "kernelspec": { 323 | "display_name": "Python 3", 324 | "language": "python", 325 | "name": "python3" 326 | }, 327 | "language_info": { 328 | "codemirror_mode": { 329 | "name": "ipython", 330 | "version": 3 331 | }, 332 | "file_extension": ".py", 333 | "mimetype": "text/x-python", 334 | "name": "python", 335 | "nbconvert_exporter": "python", 336 | "pygments_lexer": "ipython3", 337 | "version": "3.6.5" 338 | } 339 | }, 340 | "nbformat": 4, 341 | "nbformat_minor": 2 342 | } 343 | -------------------------------------------------------------------------------- /languages/south_asia/Marathi_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Marathi with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Analyse marathi texts using CLTK!
\n", 15 | "Firstly, we need to add the path where our corpora will reside." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Before we begin analysing the texts, we will need to download the marathi corpora, for which, we will be using an Importer. Call the importer to download marathi texts, as follows.. " 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "marathi_corpus_importer = CorpusImporter('marathi')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "You can view which corpora to download by calling list_corpora() method" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['marathi_text_wikisource']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "marathi_corpus_importer.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "marathi_corpus_importer.import_corpus('marathi_text_wikisource');" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "It can be verified that the `marathi_text_wikisource` corpus is downloaded in a `cltk_data/marathi/text` folder which at the path given by `USER_PATH`. It is now possible to analyse the texts within. See what datasets are available as shown:" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "['dnyaneshwari', 'haripath']\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "marathi_corpus_path = os.path.join(USER_PATH,'cltk_data/marathi/text/marathi_text_wikisource/datasets')\n", 103 | "print(os.listdir(marathi_corpus_path))" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "For this tutorial, let us analyse the texts by poet Dnyaneshwari, which is at the path as shown." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 6, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "marathi_text_path = os.path.join(marathi_corpus_path,'dnyaneshwari')\n", 120 | "marathi_chapters = []\n", 121 | "for filename in os.listdir(marathi_text_path):\n", 122 | " with open(os.path.join(marathi_text_path,filename),'r') as file:\n", 123 | " chapter_text = file.read()\n", 124 | " marathi_chapters.append(chapter_text)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Let us see take the first 1005 characters of the first chapter for the analysis.." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 7, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | " ॥ ॐ श्री परमात्मने नमः ॥ ॥ अथ श्रीमद्भगवद्गीता ॥ । अश्टादशोऽध्यायः - अध्याय अठरावा । । । मोक्षसंज्ञासयोगः ।\n", 144 | "जयजय देव निर्मळ । निजजनाखिलमंगळ । जन्मजराजलदजाळ । प्रभंजन ॥ १ ॥ जयजय देव प्रबळ । विदळितामंगळकुळ । निगमागमद्रुमफळ । फलप्रद ॥ २ ॥ जयजय देव सकल । विगतविषयवत्सल । कलितकाळकौतूहल । कलातीत ॥ ३ ॥ जयजय देव निश्चळ । चलितचित्तपानतुंदिल । जगदुन्मीलनाविरल । केलिप्रिय ॥ ४ ॥ जयजय देव निष्कळ । स्फुरदमंदानंदबहळ । नित्यनिरस्ताखिलमळ । मूळभूत ॥ ५ ॥ जयजय देव स्वप्रभ । जगदंबुदगर्भनभ । भुवनोद्भवारंभस्तंभ । भवध्वंस ॥ ६ ॥ जयजय देव विशुद्ध । विदुदयोद्यानद्विरद । शमदम\\-मदनमदभेद । दयार्णव ॥ ७ ॥ जयजय देवैकरूप । अतिकृतकंदर्पसर्पदर्प । भक्तभावभुवनदीप । तापापह ॥ ८ ॥ जयजय देव अद्वितीय । परीणतोपरमैकप्रिय । निजजनजित भजनीय । मायागम्य ॥ ९ ॥ जयजय देव श्रीगुरो । अकल्पनाख्यकल्पतरो । स्वसंविद्रुमबीजप्ररो । हणावनी ॥ १० ॥ हे काय एकैक ऐसैसें । नानापरीभाषावशें । स्तोत्र करूं तुजोद्देशें । निर्विशेषा ॥ ११ ॥ जिहींं विशेषणीं विशेषिजे । तें दृश्य नव्हे रूप तुझें । हें जाणें मी म्हणौनि लाजें । वानणा इहीं ॥ १२ ॥ परी मर्यादेचा सागरु ।\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "marathi_text = marathi_chapters[0]\n", 150 | "print(marathi_text[:1005])" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## Tokenizing Sentences" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Let us tokenize the sentences in `marathi_text`." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 8, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "['॥', 'ॐ', 'श्री', 'परमात्मने', 'नमः', '॥', '॥', 'अथ', 'श्रीमद्भगवद्गीता', '॥', '।', 'अश्टादशोऽध्यायः', '-', 'अध्याय', 'अठरावा', '।', '।', '।', 'मोक्षसंज्ञासयोगः', '।', '\\nजयजय', 'देव', 'निर्मळ', '।', 'निजजनाखिलमंगळ', '।', 'जन्मजराजलदजाळ', '।', 'प्रभंजन', '॥', '१', '॥', 'जयजय', 'देव', 'प्रबळ', '।', 'विदळितामंगळकुळ', '।', 'निगमागमद्रुमफळ', '।', 'फलप्रद', '॥', '२', '॥', 'जयजय', 'देव', 'सकल', '।', 'विगतविषयवत्सल', '।']\n" 177 | ] 178 | } 179 | ], 180 | "source": [ 181 | "from cltk.tokenize.sentence import TokenizeSentence\n", 182 | "marathi_tokenizer = TokenizeSentence('marathi')\n", 183 | "marathi_tokens = marathi_tokenizer.tokenize(marathi_text)\n", 184 | "print(marathi_tokens[:50])" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "## Stopword filtering" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "Stopwords list for marathi can be found at `stop` module of cltk." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 9, 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "name": "stdout", 208 | "output_type": "stream", 209 | "text": [ 210 | "['न', 'तरी', 'तो', 'हें', 'तें', 'कां', 'आणि', 'जें', 'जे', 'मग']\n" 211 | ] 212 | } 213 | ], 214 | "source": [ 215 | "from cltk.stop.marathi.stops import STOP_LIST\n", 216 | "print (STOP_LIST[:10])" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Let us filter the `marathi_tokens` list for words that are not stop words." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 10, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "name": "stdout", 233 | "output_type": "stream", 234 | "text": [ 235 | "['॥', 'ॐ', 'श्री', 'परमात्मने', 'नमः']\n" 236 | ] 237 | } 238 | ], 239 | "source": [ 240 | "marathi_tokens_no_stop = [token for token in marathi_tokens if token not in STOP_LIST]\n", 241 | "print(marathi_tokens_no_stop[:5])" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 11, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "33475\n", 254 | "27835\n" 255 | ] 256 | } 257 | ], 258 | "source": [ 259 | "print(len(marathi_tokens))\n", 260 | "print(len(marathi_tokens_no_stop))" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "As one can see, `marathi_tokens` had 33475 tokens whereas `marathi_tokens_no_stop` has 27835" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "## Transliterations" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "We can transliterate marathi scripts to that of other Indic languages. Let us transliterate ` शब्दकोश `to Gujarati:" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 12, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/plain": [ 292 | "' શબ્દકોશ '" 293 | ] 294 | }, 295 | "execution_count": 12, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | } 299 | ], 300 | "source": [ 301 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 302 | "UnicodeIndicTransliterator.transliterate(' शब्दकोश ',\"mr\",\"gu\")" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "We can also romanize the text as shown:" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 13, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "data": { 319 | "text/plain": [ 320 | "'tulasii aushhadhii wanaspatii aahe'" 321 | ] 322 | }, 323 | "execution_count": 13, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "marathi_text_two = 'तुलसी औषधी वनस्पती आहे'\n", 330 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 331 | "ItransTransliterator.to_itrans(marathi_text_two,'mr')" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 14, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "data": { 348 | "text/plain": [ 349 | "'आपण दररोज एक पुस्तक वाचले पाहिजे।'" 350 | ] 351 | }, 352 | "execution_count": 14, 353 | "metadata": {}, 354 | "output_type": "execute_result" 355 | } 356 | ], 357 | "source": [ 358 | "marathi_text_itrans = 'aapaNa dararoja eka pustaka waachale paahije.'\n", 359 | "ItransTransliterator.from_itrans(marathi_text_itrans,'mr')" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "## Syllabifier" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "We can use the `indian_syllabifier` to syllabify the Marathi sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 15, 379 | "metadata": { 380 | "scrolled": true 381 | }, 382 | "outputs": [], 383 | "source": [ 384 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 385 | "phonetics_model_importer.list_corpora\n", 386 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "Now we import the syllabifier and syllabify as follows:" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 16, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "%%capture\n", 403 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 404 | "marathi_syllabifier = Syllabifier('marathi')\n", 405 | "marathi_syllables = marathi_syllabifier.orthographic_syllabify('इतिहास')" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "The syllables of the word इतिहास will thus be:" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 17, 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "name": "stdout", 422 | "output_type": "stream", 423 | "text": [ 424 | "['इ', 'ति', 'हा', 'स']\n" 425 | ] 426 | } 427 | ], 428 | "source": [ 429 | "print(marathi_syllables)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "## Marathi Alphabets" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "There are 13 vowels in Marathi, which can be printed out as follows:" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 18, 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "Vowels: ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ']\n", 456 | "IAST Representation of vowels: ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao']\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "from cltk.corpus.marathi.alphabet import *\n", 462 | "print(\"Vowels: \", VOWELS)\n", 463 | "print(\"IAST Representation of vowels: \",IAST_REPRESENTATION_VOWELS)" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "Also, there are 25 consonnants divided into 5 groups or vargas :" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 19, 476 | "metadata": {}, 477 | "outputs": [ 478 | { 479 | "name": "stdout", 480 | "output_type": "stream", 481 | "text": [ 482 | "Velar consonants: ['क', 'ख', 'ग', 'घ', 'ङ']\n", 483 | "IAST Representation of Velar consonants: ['k', 'kh', 'g', 'gh', 'ṅ']\n", 484 | "\n", 485 | "Palatal consonants: ['च', 'छ', 'ज', 'झ', 'ञ']\n", 486 | "IAST Representation of Palatal consonants: ['c', 'ch', 'j', 'jh', 'ñ']\n", 487 | "\n", 488 | "Retroflex consonants: ['ट', 'ठ', 'ड', 'ढ', 'ण']\n", 489 | "IAST Representation of Retroflex consonants: ['ṭ', 'ṭh', 'ḍ', 'ḍh', 'ṇ']\n", 490 | "\n", 491 | "Dental consonants: ['त', 'थ', 'द', 'ध', 'न']\n", 492 | "IAST Representation of Dental consonants: ['t', 'th', 'd', 'dh', 'n']\n", 493 | "\n", 494 | "Labial consonants: ['प', 'फ', 'ब', 'भ', 'म']\n", 495 | "IAST Representation of Labial consonants: ['p', 'ph', 'b', 'bh', 'm']\n" 496 | ] 497 | } 498 | ], 499 | "source": [ 500 | "print(\"Velar consonants:\",VELAR_CONSONANTS)\n", 501 | "print(\"IAST Representation of Velar consonants:\",IAST_VELAR_CONSONANTS)\n", 502 | "print(\"\\nPalatal consonants:\",PALATAL_CONSONANTS)\n", 503 | "print(\"IAST Representation of Palatal consonants:\",IAST_PALATAL_CONSONANTS)\n", 504 | "print(\"\\nRetroflex consonants:\",RETROFLEX_CONSONANTS)\n", 505 | "print(\"IAST Representation of Retroflex consonants:\",IAST_RETROFLEX_CONSONANTS)\n", 506 | "print(\"\\nDental consonants:\",DENTAL_CONSONANTS)\n", 507 | "print(\"IAST Representation of Dental consonants:\",IAST_DENTAL_CONSONANTS)\n", 508 | "print(\"\\nLabial consonants:\",LABIAL_CONSONANTS)\n", 509 | "print(\"IAST Representation of Labial consonants:\",IAST_LABIAL_CONSONANTS)" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "There are 4 semi-vowels, 3 sibilants, 1 fricative and 3 additional consonants in Marathi." 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 20, 522 | "metadata": {}, 523 | "outputs": [ 524 | { 525 | "name": "stdout", 526 | "output_type": "stream", 527 | "text": [ 528 | "Semi-vowels: ['य', 'र', 'ल', 'व']\n", 529 | "IAST Representation of Semi-vowels: ['y', 'r', 'l', 'w']\n", 530 | "\n", 531 | "Sibilants ['श', 'ष', 'स']\n", 532 | "IAST Representation of Sibilants ['ś', 'ṣ', 's']\n", 533 | "\n", 534 | "Fricative consonants: ['ह']\n", 535 | "IAST Representation of Fricative consonants: ['h']\n", 536 | "\n", 537 | "Additional consonants: ['ळ', 'क्ष', 'ज्ञ']\n", 538 | "IAST Representation of Additional consonants: ['La', 'kSha', 'dnya']\n" 539 | ] 540 | } 541 | ], 542 | "source": [ 543 | "print(\"Semi-vowels: \",SEMI_VOWELS)\n", 544 | "print(\"IAST Representation of Semi-vowels: \",IAST_SEMI_VOWELS)\n", 545 | "\n", 546 | "print(\"\\nSibilants\",SIBILANTS)\n", 547 | "print(\"IAST Representation of Sibilants\",IAST_SIBILANTS)\n", 548 | "\n", 549 | "print(\"\\nFricative consonants:\",FRIACTIVE_CONSONANTS)\n", 550 | "print(\"IAST Representation of Fricative consonants:\",IAST_FRIACTIVE_CONSONANTS)\n", 551 | "\n", 552 | "print(\"\\nAdditional consonants:\",ADDITIONAL_CONSONANTS)\n", 553 | "print(\"IAST Representation of Additional consonants:\",IAST_ADDITIONAL_CONSONANTS)" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "metadata": {}, 559 | "source": [ 560 | "Following are the digits in the Marathi Script:" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 21, 566 | "metadata": {}, 567 | "outputs": [ 568 | { 569 | "name": "stdout", 570 | "output_type": "stream", 571 | "text": [ 572 | "Digits: ['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']\n" 573 | ] 574 | } 575 | ], 576 | "source": [ 577 | "print(\"Digits:\",DIGITS)" 578 | ] 579 | } 580 | ], 581 | "metadata": { 582 | "kernelspec": { 583 | "display_name": "Python 3", 584 | "language": "python", 585 | "name": "python3" 586 | }, 587 | "language_info": { 588 | "codemirror_mode": { 589 | "name": "ipython", 590 | "version": 3 591 | }, 592 | "file_extension": ".py", 593 | "mimetype": "text/x-python", 594 | "name": "python", 595 | "nbconvert_exporter": "python", 596 | "pygments_lexer": "ipython3", 597 | "version": "3.6.5" 598 | } 599 | }, 600 | "nbformat": 4, 601 | "nbformat_minor": 2 602 | } 603 | -------------------------------------------------------------------------------- /languages/south_asia/Odia_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Odia with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "You can now analyse Odia texts with CLTK!
\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Odia Alphabets" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "There are 14 vowels, 25 Structured and 11 Unstructured consonants in Odia language. See them by doing as follows:" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stdout", 38 | "output_type": "stream", 39 | "text": [ 40 | "Vowels: ['ଅ', 'ଆ', 'ଇ', 'ଈ', 'ଉ', 'ଊ', 'ଋ', 'ୠ', 'ଌ', 'ୡ', 'ଏ', 'ଐ', 'ଓ', 'ଔ']\n", 41 | "Structured consonants: ['କ', 'ଖ', 'ଗ', 'ଘ', 'ଙ', 'ଚ', 'ଛ', 'ଜ', 'ଝ', 'ଞ', 'ଟ', 'ଠ', 'ଡ', 'ଢ', 'ଣ', 'ତ', 'ଥ', 'ଦ', 'ଧ', 'ନ', 'ପ', 'ଫ', 'ବ', 'ଭ', 'ମ']\n", 42 | "Unstructured consonants: ['ଯ', 'ୟ', 'ର', 'ଲ', 'ଳ', 'ୱ', 'ଶ', 'ଷ', 'ସ', 'ହ', 'କ୍ଷ']\n", 43 | "Numerals: ['୦', '୧', '୨', '୩', '୪', '୫', '୬', '୭', '୮', '୯']\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "from cltk.corpus.odia.alphabet import *\n", 49 | "print(\"Vowels: \", VOWELS)\n", 50 | "print(\"Structured consonants: \",STRUCTURED_CONSONANTS)\n", 51 | "print(\"Unstructured consonants: \",UNSTRUCTURED_CONSONANTS)\n", 52 | "print(\"Numerals: \",NUMERALS)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## Transliterations" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "We can transliterate Odia scripts to that of other Indic languages. Let us take an example Odia text and transliterate it to Hindi:" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "ମୁଁ ତୁମକୁ ସାହାଯ୍ୟ କରିପାରେ କି? \n" 79 | ] 80 | } 81 | ], 82 | "source": [ 83 | "odia_text = \"ମୁଁ ତୁମକୁ ସାହାଯ୍ୟ କରିପାରେ କି? \"\n", 84 | "print(odia_text)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 7, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "'मुँ तुमकु साहाय्य़ करिपारे कि? '" 96 | ] 97 | }, 98 | "execution_count": 7, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 105 | "UnicodeIndicTransliterator.transliterate(odia_text,\"or\",\"hi\")" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "We can also romanize the text as shown:" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 11, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "'ghaaguDi'" 124 | ] 125 | }, 126 | "execution_count": 11, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "odia_text_two = \"ଘାଗୁଡି\"\n", 133 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 134 | "ItransTransliterator.to_itrans(odia_text_two,'or')\n" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 9, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "'ସୁନ୍ଦର'" 153 | ] 154 | }, 155 | "execution_count": 9, 156 | "metadata": {}, 157 | "output_type": "execute_result" 158 | } 159 | ], 160 | "source": [ 161 | "odia_text_itrans = 'sundara'\n", 162 | "ItransTransliterator.from_itrans(odia_text_itrans,'or')" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Syllabifier" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "We can use the indian_syllabifier to syllabify the odia sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 6, 182 | "metadata": { 183 | "scrolled": true 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "from cltk.corpus.utils.importer import CorpusImporter\n", 188 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 189 | "phonetics_model_importer.list_corpora\n", 190 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "Now we import the syllabifier and syllabify as follows:" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 13, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "%%capture\n", 207 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 208 | "odia_syllabifier = Syllabifier('oriya')\n", 209 | "odia_syllables = odia_syllabifier.orthographic_syllabify('ସୁଗନ୍ଧ')" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "The syllables of the word ସୁଗନ୍ଧ will thus be:" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 14, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "['ସୁ', 'ଗ', 'ନ୍ଧ']\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "print(odia_syllables)" 234 | ] 235 | } 236 | ], 237 | "metadata": { 238 | "kernelspec": { 239 | "display_name": "Python 3", 240 | "language": "python", 241 | "name": "python3" 242 | }, 243 | "language_info": { 244 | "codemirror_mode": { 245 | "name": "ipython", 246 | "version": 3 247 | }, 248 | "file_extension": ".py", 249 | "mimetype": "text/x-python", 250 | "name": "python", 251 | "nbconvert_exporter": "python", 252 | "pygments_lexer": "ipython3", 253 | "version": "3.6.5" 254 | } 255 | }, 256 | "nbformat": 4, 257 | "nbformat_minor": 2 258 | } 259 | -------------------------------------------------------------------------------- /languages/south_asia/Pali_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pali with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Here's a quick overview on how you can analyse your Pali texts with CLTK !
\n", 15 | "Let's begin by adding the `USER_PATH`.." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "In order to be able to download Pali texts from CLTK's Github repo, we will require an importer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "pali_downloader = CorpusImporter('pali')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "We can now see the corpora available for download, by using `list_corpora` feature of the importer. Let's go ahead and try it out!" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['pali_text_ptr_tipitaka', 'pali_texts_gretil']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "pali_downloader.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "The corpus pali_texts_gretil can be downloaded from the Github repo. The corpus will be downloaded to the directory `cltk_data/pali` at the above mentioned `USER_PATH`" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "pali_downloader.import_corpus('pali_texts_gretil')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "You can see the texts downloaded by doing the following, or checking out the `cltk_data/pali/text/pali_texts_gretil` directory." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "['9_phil', '2_parcan', '4_comm', '1_tipit', '3_chron', '6_suanco']\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "pali_corpus_path = os.path.join(USER_PATH,'cltk_data/pali/text/pali_texts_gretil')\n", 110 | "list_of_texts = [text for text in os.listdir(pali_corpus_path) if '.' not in text]\n", 111 | "print(list_of_texts)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "Digha Nikaya is a Buddhist scripture and is one of the five nikayas in Sutta Pitaka, which is one of the three parts of Pali Tipitaka. Let us view the contents of the first chapter of Digha Nikaya." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 6, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "\n", 131 | "1. \n", 132 | "\n", 133 | "Brahmajālasuttaṃ\n", 134 | "\n", 135 | "1. Evaṃ me sutaṃ ekaṃ samayaṃ bhagavā antarā ca rājagahaṃ antarā ca nālandaṃ\n", 136 | "addhānamaggapaṭipanno hoti mahatā bhikkhusaṅghena saddhiṃ pañcamattehi\n", 137 | "bhikkhusatehi. Suppiyo'pi kho paribbājako antarā ca rājagahaṃ antarā ca nālandaṃ\n", 138 | "addhānamaggapaṭipanno hoti saddhiṃ antevāsinā brahmadattena māṇavena. \n", 139 | "\n", 140 | "Tatra sudaṃ suppiyo paribbājako anekapariyāyena buddhassa avaṇṇaṃ bhāsati, dhammassa\n", 141 | "avaṇṇaṃ bhāsati, saṅghassa avaṇṇaṃ bhāsati. Suppiyassa pana paribbājakassa antevāsī\n", 142 | "brahmadatto māṇavo anekapariyāyena buddhassa vaṇṇaṃ bhāsati, dhammassa vaṇṇaṃ\n", 143 | "bhāsati, saṅghassa vaṇṇaṃ bhāsati. Itiha te ubho ācariyantevāsī aññamaññassa\n", 144 | "ujuvipaccanīkavādā bhagavantaṃ piṭṭhito piṭṭhito anubaddhā1 honti bhikkhusaṅghaṃ ca. \n", 145 | "\n", 146 | "2. Atha kho bhagavā ambalaṭṭhikāyaṃ rājāgārake ekarattivāsaṃ upagaṃchi saddhiṃ\n", 147 | "bhikkhusaṅghena. Suppiyo'pi kho paribbājako ambalaṭṭhikāyaṃ rājāgārake ekarattivāsaṃ\n", 148 | "upagaṃchi saddhiṃ antevāsinā brahmadattena māṇavena. Tatra'pi sudaṃ suppiyo\n", 149 | "paribbājako anekapariyāyena buddhassa avaṇṇaṃ bhāsati, dhammassa avaṇṇaṃ bhāsati,\n", 150 | "saṅghassa avaṇṇaṃ bhāsati. Suppiyassa [PTS Page 002] [\\q 2/] pana paribbājakassa\n", 151 | "antevāsī brahmadatto māṇavo buddhassa vaṇṇaṃ bhāsati, dhammassa vaṇṇaṃ bhāsati,\n", 152 | "saṅghassa vaṇṇaṃ bhāsati. Itiha te ubho ācariyantevāsī aññamaññassa ujuvipaccanīkavādā\n", 153 | "viharanti. \n", 154 | " - - - - - - - - - - - - - - -\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "pali_text_path = os.path.join(pali_corpus_path,'1_tipit/2_sut/1_digh/dighan1u.txt')\n", 160 | "pali_text = open(pali_text_path,'r').read()\n", 161 | "print(pali_text[1445:2800])" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "## Pali Alphabets" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "Pali is usually written in Sinhalese, Brāhmi, Khmer, Burmese or Devanagari. There are 7 independent vowels, 7 dependent vowels and 33 consonants in Pali (Sinhalese script) which are as follows:" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 7, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "Independent vowels: ['අ', 'ආ', 'ඉ', 'ඊ', 'උ', 'එ', 'ඔ']\n", 188 | "Dependent vowels: ['ා', 'ි', 'ී', 'ු', 'ූ', 'ෙ', 'ො']\n", 189 | "Consonants: ['ක', 'ඛ', 'ග', 'ඝ', 'ඞ', 'ච', 'ඡ', 'ජ', 'ඣ', 'ඤ', 'ට', 'ඨ', 'ඩ', 'ඪ', 'ණ', 'ත', 'ථ', 'ද', 'ධ', 'න', 'ප', 'ඵ', 'බ', 'භ', 'ම', 'ය', 'ර', 'ල', 'ව', 'ස', 'හ', 'ළ', 'අං']\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "from cltk.corpus.pali.alphabet import *\n", 195 | "print(\"Independent vowels:\",INDEPENDENT_VOWELS)\n", 196 | "print(\"Dependent vowels:\",DEPENDENT_VOWELS)\n", 197 | "print(\"Consonants:\",CONSONANTS)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "## Transliterations" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "We can transliterate Pali texts to that of other Indic languages. Let us transliterate `අභිරුචිර` to Hindi:" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 8, 217 | "metadata": {}, 218 | "outputs": [ 219 | { 220 | "data": { 221 | "text/plain": [ 222 | "'अभिरुचिर'" 223 | ] 224 | }, 225 | "execution_count": 8, 226 | "metadata": {}, 227 | "output_type": "execute_result" 228 | } 229 | ], 230 | "source": [ 231 | "pali_text_two = 'අභිරුචිර'\n", 232 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 233 | "UnicodeIndicTransliterator.transliterate(pali_text_two,\"si\",\"hi\")" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "We can also romanize the text as shown:" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 9, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "'abhiruchira'" 252 | ] 253 | }, 254 | "execution_count": 9, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 261 | "ItransTransliterator.to_itrans(pali_text_two,'si')" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 10, 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "data": { 278 | "text/plain": [ 279 | "'දේවධම්මෝ'" 280 | ] 281 | }, 282 | "execution_count": 10, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "pali_text_itrans = 'devadhammo'\n", 289 | "ItransTransliterator.from_itrans(pali_text_itrans,'si')" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "## Syllabifier" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "We can use the indian_syllabifier to syllabify the Pali sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 11, 309 | "metadata": { 310 | "scrolled": true 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 315 | "phonetics_model_importer.list_corpora\n", 316 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "Now we import the syllabifier and syllabify as follows:" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 12, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [ 332 | "%%capture\n", 333 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 334 | "pali_syllabifier = Syllabifier('sinhalese')\n", 335 | "pali_syllables = pali_syllabifier.orthographic_syllabify(pali_text_two)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "The syllables of the word `pali_text_two` will thus be:" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 13, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "name": "stdout", 352 | "output_type": "stream", 353 | "text": [ 354 | "['අ', 'භ', 'ි', 'ර', 'ු', 'ච', 'ි', 'ර']\n" 355 | ] 356 | } 357 | ], 358 | "source": [ 359 | "print(pali_syllables)" 360 | ] 361 | } 362 | ], 363 | "metadata": { 364 | "kernelspec": { 365 | "display_name": "Python 3", 366 | "language": "python", 367 | "name": "python3" 368 | }, 369 | "language_info": { 370 | "codemirror_mode": { 371 | "name": "ipython", 372 | "version": 3 373 | }, 374 | "file_extension": ".py", 375 | "mimetype": "text/x-python", 376 | "name": "python", 377 | "nbconvert_exporter": "python", 378 | "pygments_lexer": "ipython3", 379 | "version": "3.6.5" 380 | } 381 | }, 382 | "nbformat": 4, 383 | "nbformat_minor": 2 384 | } 385 | -------------------------------------------------------------------------------- /languages/south_asia/Prakrit_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Prakrit with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Prakrits are any of several Middle Indo-Aryan languages formerly spoken in India (Wikipedia)
\n", 15 | "Here's how you can import Prakrit corpora. Let us start off by setting `USER_PATH` for our reference." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "In order to be able to download Prakrit texts from CLTK's Github repo, we will require an importer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "prakrit_downloader = CorpusImporter('prakrit')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "We can now see the corpora available for download, by using `list_corpora` feature of the importer. Let's go ahead and try it out!" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['prakrit_texts_gretil']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "prakrit_downloader.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "The corpus Prakrit_texts_gretil can be downloaded from the Github repo. The corpus will be downloaded to the directory `cltk_data/prakrit` at the above mentioned `USER_PATH`" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "prakrit_downloader.import_corpus('prakrit_texts_gretil')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "You can see the texts downloaded by doing the following, or checking out the `cltk_data/prakrit/text/prakrit_texts_gretil/txt` directory." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "['dasaveyu.txt', 'halsatsu.txt', 'uttaraju.txt', 'hadhut_u.txt', 'isibhasu.txt', 'ayarangu.txt', 'spaucaru.txt', 'suyagadu.txt']\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "prakrit_corpus_path = os.path.join(USER_PATH,'cltk_data/prakrit/text/prakrit_texts_gretil/txt')\n", 110 | "list_of_texts = os.listdir(prakrit_corpus_path)\n", 111 | "print(list_of_texts)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "As an example, let us open the file `dasaveyu.txt` which holds the contents of The Dasaveyāliya Sutta" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 6, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "\n", 131 | "\n", 132 | "1 prathamamadhyayanam (dumapupphiyā.) /\n", 133 | "\n", 134 | "dhammo maṅgalamukkaṭhaṃ | ahiṃsā saṃjamo tavo /\n", 135 | "devā vi taṃ namaṃsanti | jassa dhamme sayā maṇo \n", 136 | "jahā dumassa pupphesu | bhamaro āviyai rasaṃ /\n", 137 | "na ya pupphaṃ kilāmei | so ya pīṇei appayaṃ \n", 138 | "emee samaṇā muttā | je loe santi sāhuṇo /\n", 139 | "vihaṃgamā va pupphesu | dāṇa-bhattesaṇe rayā \n", 140 | "vayaṃ ca vittiṃ labbhāmo | na ya koi uvahammaī /\n", 141 | "ahāgaḍesu rīyante | pupphesu bhamarā jahā \n", 142 | "mahukāra-samā buddhā | je bhavanti aṇissiyā /\n", 143 | "nāṇā-piṇḍa-rayā dantā, | teṇa vuccantisāhuṇo tti bemi ||\n", 144 | "|| prathamamadhyayanam \n", 145 | "\n", 146 | " Dasaveyāliya: 2 dvitīyamadhyayanam (sāmaṇṇapuvvagaṃ.) /\n", 147 | "\n", 148 | "kahaṃ nu kujjā sāmaṇṇaṃ | jo kāme na nivārae /\n", 149 | "pae pae visīyanto | saṃkappassa vasaṃ gao? \n", 150 | "vattha-gandhamalaṃkāraṃ | itthīo sayaṇāṇi ya /\n", 151 | "acchandā je na bhuñjanti | na se \"cāi\" tti vuccaī \n", 152 | "je ya kante pie bhoe | laddhe vippiṭhi-kuvvaī /\n", 153 | "sāhīṇe cayai bhoe | se hu \"cāi\" tti vuccaī \n", 154 | " samāe pehāe parivvayanto | siyā maṇo nissaraī bahiddhā, /\n", 155 | " \"na sā mahaṃ no vi ahaṃ pi tīse\" | icceva tāo viṇaejja rāgaṃ \n", 156 | " āyāvayāhī! caya sogumallaṃ! | kāme kamāhī! kamiyaṃ khu dukkhaṃ /\n", 157 | " chindāhi dosaṃ! viṇaejja rāgaṃ! | evaṃ suhī hohisi saṃparāe \n", 158 | "pakkhande jaliyaṃ joiṃ | dhūma-keuṃ durāsayaṃ /\n", 159 | "necchanti vantayaṃ bhottuṃ | kule jāyā agandhaṇe \n", 160 | "dhiratthu te jaso-kāmī | jo taṃ jīviya-kāraṇā /\n", 161 | "vantaṃ icchasi āveuṃ! | seyaṃ te maraṇaṃ bhave \n", 162 | "ahaṃ ca bhoga-rāyassa, | taṃ ca si andhavaṇhiṇo /\n", 163 | "mā kule gandhaṇā homo, | saṃjamaṃ nihuo cara \n", 164 | "jai taṃ kāhisi bhāvaṃ | jā jā dacchisi nārio /\n", 165 | "vāyāiddho vva haḍho | aṭhiyappā bhavissasi \n", 166 | "tīse so vayaṇaṃ soccā | saṃjayāe subhāsiyaṃ /\n", 167 | "aṅkuseṇa jahā nāgo | dhamme saṃpaḍivāio \n", 168 | "evaṃ karenti saṃbuddhā | paṇḍiyā paviyakkhaṇā /\n", 169 | "viṇiyaanti bhogesu | jahā se purisuttamo tti bemi ||\n", 170 | "|| dvitīyamadhyayanam\n" 171 | ] 172 | } 173 | ], 174 | "source": [ 175 | "prakrit_text_path = os.path.join(prakrit_corpus_path,'dasaveyu.txt')\n", 176 | "prakrit_text = open(prakrit_text_path,'r').read()\n", 177 | "print(prakrit_text[1225:2950])" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "It is possible to analyse Prakrit texts in Devanagari. One can refer to Hindi or Sanskrit tutorials for more information." 185 | ] 186 | } 187 | ], 188 | "metadata": { 189 | "kernelspec": { 190 | "display_name": "Python 3", 191 | "language": "python", 192 | "name": "python3" 193 | }, 194 | "language_info": { 195 | "codemirror_mode": { 196 | "name": "ipython", 197 | "version": 3 198 | }, 199 | "file_extension": ".py", 200 | "mimetype": "text/x-python", 201 | "name": "python", 202 | "nbconvert_exporter": "python", 203 | "pygments_lexer": "ipython3", 204 | "version": "3.6.5" 205 | } 206 | }, 207 | "nbformat": 4, 208 | "nbformat_minor": 2 209 | } 210 | -------------------------------------------------------------------------------- /languages/south_asia/Punjabi_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Punjabi with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Use CLTK to analyse your Punjabi texts!
\n", 15 | "Firstly, we need to add the path where our corpora will reside." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Before we begin analysing the texts, we will need to download the Punjabi corpora, for which, we will be using an Importer. Call the importer to download Punjabi texts, as follows.. " 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "punjabi_corpus_importer = CorpusImporter('punjabi')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "You can view which corpora to download by calling list_corpora() method" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['punjabi_text_gurban']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "punjabi_corpus_importer.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "punjabi_corpus_importer.import_corpus('punjabi_text_gurban');" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "It can be verified that the `punjabi_text_gurban` corpus is downloaded in a `cltk_data/punjabi/text` folder which at the path given by `USER_PATH`. It is now possible to analyse the texts within. As an example, let us open the first page of Guru Granth Sahib.." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "punjabi_corpus_path = os.path.join(USER_PATH,'cltk_data/punjabi/text/punjabi_text_gurban/guru_granth_sahib')\n", 95 | "punjabi_text_path = os.path.join(punjabi_corpus_path,'pg1.txt')\n", 96 | "punjabi_text = open(punjabi_text_path,'r').read()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Let us see the contents of `punjabi_text`" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 6, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "#1\n", 116 | "ੴ ਸਤਿ ਨਾਮੁ ਕਰਤਾ ਪੁਰਖੁ ਨਿਰਭਉ ਨਿਰਵੈਰੁ ਅਕਾਲ ਮੂਰਤਿ ਅਜੂਨੀ ਸੈਭੰ ਗੁਰ ਪ੍ਰਸਾਦਿ ॥\n", 117 | "॥ ਜਪੁ ॥\n", 118 | "ਆਦਿ ਸਚੁ ਜੁਗਾਦਿ ਸਚੁ ॥\n", 119 | "ਹੈ ਭੀ ਸਚੁ ਨਾਨਕ ਹੋਸੀ ਭੀ ਸਚੁ ॥੧॥\n", 120 | "ਸੋਚੈ ਸੋਚਿ ਨ ਹੋਵਈ ਜੇ ਸੋਚੀ ਲਖ ਵਾਰ ॥\n", 121 | "ਚੁਪੈ ਚੁਪ ਨ ਹੋਵਈ ਜੇ ਲਾਇ ਰਹਾ ਲਿਵ ਤਾਰ ॥\n", 122 | "ਭੁਖਿਆ ਭੁਖ ਨ ਉਤਰੀ ਜੇ ਬੰਨਾ ਪੁਰੀਆ ਭਾਰ ॥\n", 123 | "ਸਹਸ ਸਿਆਣਪਾ ਲਖ ਹੋਹਿ ਤ ਇਕ ਨ ਚਲੈ ਨਾਲਿ ॥\n", 124 | "ਕਿਵ ਸਚਿਆਰਾ ਹੋਈਐ ਕਿਵ ਕੂੜੈ ਤੁਟੈ ਪਾਲਿ ॥\n", 125 | "ਹੁਕਮਿ ਰਜਾਈ ਚਲਣਾ ਨਾਨਕ ਲਿਖਿਆ ਨਾਲਿ ॥੧॥\n", 126 | "ਹੁਕਮੀ ਹੋਵਨਿ ਆਕਾਰ ਹੁਕਮੁ ਨ ਕਹਿਆ ਜਾਈ ॥\n", 127 | "ਹੁਕਮੀ ਹੋਵਨਿ ਜੀਅ ਹੁਕਮਿ ਮਿਲੈ ਵਡਿਆਈ ॥\n", 128 | "ਹੁਕਮੀ ਉਤਮੁ ਨੀਚੁ ਹੁਕਮਿ ਲਿਖਿ ਦੁਖ ਸੁਖ ਪਾਈਅਹਿ ॥\n", 129 | "ਇਕਨਾ ਹੁਕਮੀ ਬਖਸੀਸ ਇਕਿ ਹੁਕਮੀ ਸਦਾ ਭਵਾਈਅਹਿ ॥\n", 130 | "ਹੁਕਮੈ ਅੰਦਰਿ ਸਭੁ ਕੋ ਬਾਹਰਿ ਹੁਕਮ ਨ ਕੋਇ ॥\n", 131 | "ਨਾਨਕ ਹੁਕਮੈ ਜੇ ਬੁਝੈ ਤ ਹਉਮੈ ਕਹੈ ਨ ਕੋਇ ॥੨॥\n", 132 | "ਗਾਵੈ ਕੋ ਤਾਣੁ ਹੋਵੈ ਕਿਸੈ ਤਾਣੁ ॥\n", 133 | "ਗਾਵੈ ਕੋ ਦਾਤਿ ਜਾਣੈ ਨੀਸਾਣੁ ॥\n", 134 | "ਗਾਵੈ ਕੋ ਗੁਣ ਵਡਿਆਈਆ ਚਾਰ ॥\n", 135 | "ਗਾਵੈ ਕੋ ਵਿਦਿਆ ਵਿਖਮੁ ਵੀਚਾਰੁ ॥\n", 136 | "ਗਾਵੈ ਕੋ ਸਾਜਿ ਕਰੇ ਤਨੁ ਖੇਹ ॥\n", 137 | "ਗਾਵੈ ਕੋ ਜੀਅ ਲੈ ਫਿਰਿ ਦੇਹ ॥\n", 138 | "ਗਾਵੈ ਕੋ ਜਾਪੈ ਦਿਸੈ ਦੂਰਿ ॥\n", 139 | "\n", 140 | "\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "print(punjabi_text)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Tokenisation in Punjabi" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Tokeniser for Punjabi is not available. However, to serve our purpose, we may use the tokeniser for Hindi language." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 7, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "name": "stdout", 169 | "output_type": "stream", 170 | "text": [ 171 | "['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਦੀ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਵਾਲੀ', 'ਭਾਸ਼ਾ', 'ਹੈ', '।']\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "from cltk.tokenize.sentence import TokenizeSentence\n", 177 | "punjabi_sentence = \"ਪੰਜਾਬੀ ਪੰਜਾਬ ਦੀ ਮੁਖੱ ਬੋੋਲਣ ਜਾਣ ਵਾਲੀ ਭਾਸ਼ਾ ਹੈ।\"\n", 178 | "punjabi_tokeniser = TokenizeSentence('hindi')\n", 179 | "punjabi_tokens = punjabi_tokeniser.indian_punctuation_tokenize_regex(punjabi_sentence)\n", 180 | "print(punjabi_tokens[:50])" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "## Stopword filtering" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "Stopwords list for Punjabi can be found at `stop` module of cltk." 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "['ਦੇ', 'ਦੀ', 'ਵਿਚ', 'ਦਾ', 'ਨੂੰ', 'ਹੈ', 'ਹੀ', 'ਹੇ', 'ਕੇ', 'ਉਸ']\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "from cltk.stop.punjabi.stops import STOPS_LIST\n", 212 | "print (STOPS_LIST[:10])" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "Let us filter the `punjabi_tokens` list for words that are not stop words." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 9, 225 | "metadata": {}, 226 | "outputs": [ 227 | { 228 | "name": "stdout", 229 | "output_type": "stream", 230 | "text": [ 231 | "['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਭਾਸ਼ਾ', '।']\n" 232 | ] 233 | } 234 | ], 235 | "source": [ 236 | "punjabi_tokens_no_stop = [token for token in punjabi_tokens if token not in STOPS_LIST]\n", 237 | "print(punjabi_tokens_no_stop)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 10, 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "name": "stdout", 247 | "output_type": "stream", 248 | "text": [ 249 | "10\n", 250 | "7\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "print(len(punjabi_tokens))\n", 256 | "print(len(punjabi_tokens_no_stop))" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "As one can see, `punjabi_tokens` had 10 tokens whereas `punjabi_tokens_no_stop` has 7" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "## Transliterations" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "We can transliterate Punjabi scripts to that of other Indic languages. Let us transliterate `punjabi_sentence`to Hindi:" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 11, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "text/plain": [ 288 | "'पੰजाबी पੰजाब दी मुखੱ बोोलण जाण वाली भाशा है।'" 289 | ] 290 | }, 291 | "execution_count": 11, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 298 | "UnicodeIndicTransliterator.transliterate(punjabi_sentence,\"pa\",\"hi\")" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "We can also romanize the text as shown:" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 12, 311 | "metadata": {}, 312 | "outputs": [ 313 | { 314 | "data": { 315 | "text/plain": [ 316 | "'iੱka sohaNii sa़aama'" 317 | ] 318 | }, 319 | "execution_count": 12, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "punjabi_text_two = 'ਇੱਕ ਸੋਹਣੀ ਸ਼ਾਮ'\n", 326 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 327 | "ItransTransliterator.to_itrans(punjabi_text_two,'pa')" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 13, 340 | "metadata": {}, 341 | "outputs": [ 342 | { 343 | "data": { 344 | "text/plain": [ 345 | "'ਮੈਨੂੰ ਖਾਣਾ ਪਸੰਦ ਹੈ'" 346 | ] 347 | }, 348 | "execution_count": 13, 349 | "metadata": {}, 350 | "output_type": "execute_result" 351 | } 352 | ], 353 | "source": [ 354 | "punjabi_text_itrans = 'mainuuੰ khaaNaa pasaੰda hai'\n", 355 | "ItransTransliterator.from_itrans(punjabi_text_itrans,'pa')" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "## Syllabifier" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "We can use the `indian_syllabifier` to syllabify the Punjabi sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 14, 375 | "metadata": { 376 | "scrolled": true 377 | }, 378 | "outputs": [], 379 | "source": [ 380 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 381 | "phonetics_model_importer.list_corpora\n", 382 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "Now we import the syllabifier and syllabify as follows:" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 15, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "%%capture\n", 399 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 400 | "punjabi_syllabifier = Syllabifier('punjabi')\n", 401 | "punjabi_syllables = punjabi_syllabifier.orthographic_syllabify('ਪੰਜਾਬੀ')" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "The syllables of the word ਪੰਜਾਬੀ will thus be:" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 16, 414 | "metadata": {}, 415 | "outputs": [ 416 | { 417 | "name": "stdout", 418 | "output_type": "stream", 419 | "text": [ 420 | "['ਪ', 'ੰ', 'ਜਾ', 'ਬੀ']\n" 421 | ] 422 | } 423 | ], 424 | "source": [ 425 | "print(punjabi_syllables)" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "## Punjabi Alphabets" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "There are two scripts used in Punjabi: Gurmukhi and Shahmukhi" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "Digits in Punjabi are as follows:" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 17, 452 | "metadata": {}, 453 | "outputs": [ 454 | { 455 | "name": "stdout", 456 | "output_type": "stream", 457 | "text": [ 458 | "Gurumukhi digits: ['੦', '੧', '੨', '੩', '੪', '੫', '੬', '੭', '੮', '੯']\n", 459 | "Shahmukhi digits: ['۰', '۱', '۲', '۳', '٤', '۵', '٦', '۷', '۸', '۹']\n" 460 | ] 461 | } 462 | ], 463 | "source": [ 464 | "from cltk.corpus.punjabi.alphabet import *\n", 465 | "print(\"Gurumukhi digits:\",DIGITS_GURMUKHI)\n", 466 | "print(\"Shahmukhi digits:\",DIGITS_SHAHMUKHI)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "Vowels in Punjabi are as follows:" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 19, 479 | "metadata": {}, 480 | "outputs": [ 481 | { 482 | "name": "stdout", 483 | "output_type": "stream", 484 | "text": [ 485 | "Independent vowels in Gurmukhi: ['ਆ', 'ਇ', 'ਈ', 'ਉ', 'ਊ', 'ਏ', 'ਐ', 'ਓ', 'ਔ']\n", 486 | "Independent vowels in Shahmukhi: ['ا', 'و', 'ی', 'ے']\n", 487 | "Dependent vowels in Gurmukhi: ['ਾ', 'ਿ', 'ੀ', 'ੁ', 'ੂ', 'ੇ', 'ੈ', 'ੋ', 'ੌ']\n", 488 | "Dependent vowels in Shahmukhi: ['َ', 'ِ', 'ُ']\n" 489 | ] 490 | } 491 | ], 492 | "source": [ 493 | "print(\"Independent vowels in Gurmukhi: \",INDEPENDENT_VOWELS_GURMUKHI)\n", 494 | "print(\"Independent vowels in Shahmukhi: \",INDEPENDENT_VOWELS_SHAHMUKHI)\n", 495 | "print(\"Dependent vowels in Gurmukhi: \",DEPENDENT_VOWELS_GURMUKHI)\n", 496 | "print(\"Dependent vowels in Shahmukhi: \",DEPENDENT_VOWELS_SHAHMUKHI)" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "Consonants and other symbols in Punjabi are as follows:" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 20, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "Consonants in Gurmukhi: ['ੳ', 'ਅ', 'ੲ', 'ਸ', 'ਹ', 'ਕ', 'ਖ', 'ਗ', 'ਘ', 'ਙ', 'ਚ', 'ਛ', 'ਜ', 'ਝ', 'ਞ', 'ਟ', 'ਠ', 'ਡ', 'ਢ', 'ਣ', 'ਤ', 'ਥ', 'ਦ', 'ਧ', 'ਨ', 'ਪ', 'ਫ', 'ਬ', 'ਭ', 'ਮ', 'ਯ', 'ਰ', 'ਲ', 'ਵ', 'ੜ']\n", 516 | "Consonants in Shahmukhi: ['ء', 'ب', 'پ', 'ت', 'ٹ', 'ث', 'ج', 'چ', 'ح', 'خ', 'د', 'ڈ', 'ذ', 'ر', 'ڑ', 'ز', 'ژ', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ک', 'گ', 'ل', 'م', 'ن', 'ه', 'ھ']\n", 517 | "Bindi consonants in Gurmukhi: ['ਖ਼', 'ਗ਼', 'ਜ਼', 'ਫ਼', 'ਲ਼', 'ਸ਼']\n", 518 | "Other symbols in Gurmukhi: ['ੱ', 'ਂ', 'ਃ', 'ੰ', 'ੑ', 'ੴ', 'ਁ']\n", 519 | "Other symbols in Shahmukhi: ['ﺁ', 'ۀ', 'ﻻ']\n" 520 | ] 521 | } 522 | ], 523 | "source": [ 524 | "print(\"Consonants in Gurmukhi: \",CONSONANTS_GURMUKHI)\n", 525 | "print(\"Consonants in Shahmukhi: \",CONSONANTS_SHAHMUKHI)\n", 526 | "print(\"Bindi consonants in Gurmukhi: \",BINDI_CONSONANTS_GURMUKHI)\n", 527 | "print(\"Other symbols in Gurmukhi:\", OTHER_SYMBOLS_GURMUKHI)\n", 528 | "print(\"Other symbols in Shahmukhi:\", OTHER_SYMBOLS_SHAHMUKHI)" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "## Numerifier" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": {}, 541 | "source": [ 542 | "This is a tool that lets you convert English numerals to Punjabi and vice-versa." 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 23, 548 | "metadata": {}, 549 | "outputs": [ 550 | { 551 | "name": "stdout", 552 | "output_type": "stream", 553 | "text": [ 554 | "1234567890\n" 555 | ] 556 | } 557 | ], 558 | "source": [ 559 | "from cltk.corpus.punjabi.numerifier import punToEnglish_number, englishToPun_number\n", 560 | "c = punToEnglish_number('੧੨੩੪੫੬੭੮੯੦')\n", 561 | "print(c)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 24, 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "name": "stdout", 571 | "output_type": "stream", 572 | "text": [ 573 | "੧੨੩੪੫੬੭੮੯੦\n" 574 | ] 575 | } 576 | ], 577 | "source": [ 578 | "c = englishToPun_number('1234567890')\n", 579 | "print(c)" 580 | ] 581 | } 582 | ], 583 | "metadata": { 584 | "kernelspec": { 585 | "display_name": "Python 3", 586 | "language": "python", 587 | "name": "python3" 588 | }, 589 | "language_info": { 590 | "codemirror_mode": { 591 | "name": "ipython", 592 | "version": 3 593 | }, 594 | "file_extension": ".py", 595 | "mimetype": "text/x-python", 596 | "name": "python", 597 | "nbconvert_exporter": "python", 598 | "pygments_lexer": "ipython3", 599 | "version": "3.6.5" 600 | } 601 | }, 602 | "nbformat": 4, 603 | "nbformat_minor": 2 604 | } 605 | -------------------------------------------------------------------------------- /languages/south_asia/Telugu_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Telugu with CLTK" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "A quick overview of how you can analyse your Telugu texts with CLTK !
\n", 15 | "Let's begin by adding the `USER_PATH`.." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import os\n", 25 | "USER_PATH = os.path.expanduser('~')" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "In order to be able to download Telugu texts from CLTK's Github repo, we will require an importer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from cltk.corpus.utils.importer import CorpusImporter\n", 42 | "telugu_downloader = CorpusImporter('telugu')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "We can now see the corpora available for download, by using `list_corpora` feature of the importer. Let's go ahead and try it out!" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['telugu_text_wikisource']" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "telugu_downloader.list_corpora" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "The corpus telugu_text_wikisource can be downloaded from the Github repo. The corpus will be downloaded to the directory `cltk_data/telugu` at the above mentioned `USER_PATH`" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "telugu_downloader.import_corpus('telugu_text_wikisource')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "You can see the texts downloaded by doing the following, or checking out the `cltk_data/telugu/text/telugu_text_wikisource/texts` directory." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "['kumari_satakam', 'vemana_satakam', 'brahma_puranam', 'neeti_satakam', 'srimadbhagavatham', 'dwarakapathi_satakam', 'krishna_satakam', 'bharga_satakam', 'dasaradhi_satakam', 'ramachandarprabhu_satakam', 'molla_ramayanam', 'srungara_satakam', 'sri_kalahistiswara_satakam', 'siva_puranam', 'narasimha_satakam', 'vrushadhipa_satakam', 'sumathi_satakam', 'kumara_satakam', 'bhadadri_satakam', 'pothana_telugu_bhagavatham', 'narayana_satakam', 'bhaskara_satakam', 'mrutyunjaya_siva_satakam', 'suryasatakam', 'devakinandana_satakam', 'vairagya_satakam']\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "telugu_corpus_path = os.path.join(USER_PATH,'cltk_data/telugu/text/telugu_text_wikisource/texts')\n", 110 | "list_of_texts = [text for text in os.listdir(telugu_corpus_path) if '.' not in text]\n", 111 | "print(list_of_texts)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "Now that we have our texts, let's take a sample from one of them. For this tutorial, we shall be using Kumari Satakam by Pakki Venkata Narasimha Kavi." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 6, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "1. శ్రీ భూ నీళా హైమవ\n", 131 | "తీ భారతు లతుల శుభవ తిగ నెన్ను చు స \n", 132 | "త్సౌభాగ్యము నీ కొసగంగ\n", 133 | "లో భావించెదరు ధర్మ లోల కుమారీ!\n", 134 | "\n", 135 | "2. చెప్పెడి బుద్ధులలోపల\n", 136 | "దప్పకు మొక టైన సర్వ ధర్మములందున్\n", 137 | "మెప్పొంది యిహపరంబులన్\n", 138 | "దప్పింతయు లేక మెలగ దగును కుమారీ!\n", 139 | "\n", 140 | "3. ఆటల బాటలలోనే\n", 141 | "మాటయు రాకుండన్ దండ్రి మందిరమందున్\n", 142 | "బాటిల్లు గాపురములో\n", 143 | "వాట మెఱిగి బాల! తిరుగ వలయున్ గుమారీ!\n", 144 | "\n", 145 | "4. మగనికి నత్తకు మామకున్\n", 146 | "దగ సేవ యొనర్చుచోటన్ దత్పరిచర్యన్\n", 147 | "మిగుల నుతి బొందుచుండుట\n", 148 | "మగువలకున్ బాడి తెలిసి మసలు కుమారీ!\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "telugu_text_path = os.path.join(telugu_corpus_path,'kumari_satakam/satakam.txt')\n", 154 | "telugu_text = open(telugu_text_path,'r').read()\n", 155 | "print(telugu_text[:446])" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "## Telugu Alphabets" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "Telugu alphabets are grouped as follows:" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 7, 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "NumeralS: ['౦', '౧', '౨', '౩', '౪\\t', '౫', '౬', '౭', '౮', '౯']\n", 182 | "Vowels: ['అ ', 'ఆ', 'ఇ', 'ఈ ', 'ఉ ', 'ఊ ', 'ఋ ', 'ౠ ', 'ఌ ', 'ౡ', 'ఎ', 'ఏ', 'ఐ', 'ఒ', 'ఓ', 'ఔ ', 'అం', 'అః']\n", 183 | "Consonants: ['క', 'ఖ', 'గ', 'ఘ', 'ఙచ', 'ఛ', 'జ', 'ఝ', 'ఞ', 'ట', 'ఠ', 'డ ', 'ఢ', 'ణ', 'త', 'థ', 'ద', 'ధ', 'న', 'ప', 'ఫ', 'బ', 'భ', 'మ', 'య', 'ర', 'ల', 'వ', 'శ', 'ష', 'స ', 'హ', 'ళ', 'క్ష ', 'ఱ']\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "from cltk.corpus.telugu.alphabet import *\n", 189 | "print(\"NumeralS:\",NUMERLALS)\n", 190 | "print(\"Vowels:\",VOWELS)\n", 191 | "print(\"Consonants:\",CONSONANTS)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Transliterations" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "We can transliterate Telugu scripts to that of other Indic languages. Let us transliterate one of the lines from the above text, `లో భావించెదరు ధర్మ లోల కుమారీ! `to Hindi:" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 8, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "'लो भाविंचॆदरु धर्म लोल कुमारी!'" 217 | ] 218 | }, 219 | "execution_count": 8, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "telugu_text_two = 'లో భావించెదరు ధర్మ లోల కుమారీ!'\n", 226 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator\n", 227 | "UnicodeIndicTransliterator.transliterate(telugu_text_two,\"te\",\"hi\")" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "We can also romanize the text as shown:" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 9, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "'lo bhaawi.mch.edaru dharma lola kumaarii!'" 246 | ] 247 | }, 248 | "execution_count": 9, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator\n", 255 | "ItransTransliterator.to_itrans(telugu_text_two,'te')" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "Similarly, we can indicize a text given in its ITRANS-transliteration" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 10, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/plain": [ 273 | "'కుమారీ'" 274 | ] 275 | }, 276 | "execution_count": 10, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "telugu_text_itrans = 'kumaarii'\n", 283 | "ItransTransliterator.from_itrans(telugu_text_itrans,'te')" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "## Syllabifier" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "We can use the indian_syllabifier to syllabify the Telugu sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 11, 303 | "metadata": { 304 | "scrolled": true 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "phonetics_model_importer = CorpusImporter('sanskrit')\n", 309 | "phonetics_model_importer.list_corpora\n", 310 | "phonetics_model_importer.import_corpus('sanskrit_models_cltk') " 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "Now we import the syllabifier and syllabify as follows:" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 12, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "%%capture\n", 327 | "from cltk.stem.sanskrit.indian_syllabifier import Syllabifier\n", 328 | "telugu_syllabifier = Syllabifier('telegu')\n", 329 | "telugu_syllables = telugu_syllabifier.orthographic_syllabify('కుమారీ')" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "The syllables of the word `'కుమారీ'` will thus be:" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 13, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "['కు', 'మా', 'రీ']\n" 349 | ] 350 | } 351 | ], 352 | "source": [ 353 | "print(telugu_syllables)" 354 | ] 355 | } 356 | ], 357 | "metadata": { 358 | "kernelspec": { 359 | "display_name": "Python 3", 360 | "language": "python", 361 | "name": "python3" 362 | }, 363 | "language_info": { 364 | "codemirror_mode": { 365 | "name": "ipython", 366 | "version": 3 367 | }, 368 | "file_extension": ".py", 369 | "mimetype": "text/x-python", 370 | "name": "python", 371 | "nbconvert_exporter": "python", 372 | "pygments_lexer": "ipython3", 373 | "version": "3.6.5" 374 | } 375 | }, 376 | "nbformat": 4, 377 | "nbformat_minor": 2 378 | } 379 | --------------------------------------------------------------------------------