├── .gitattributes ├── .gitignore ├── MDD.svg ├── MDX.svg ├── README.md ├── notes ├── 01-测试结果.ipynb └── parse_mdx.ipynb ├── parser.py ├── pureSalsa20.py ├── readmdict.py ├── ripemd128.py └── xiaozhan.py /.gitattributes: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binhetech/mdict-parser/257885176aa572953b044e9ff68b88fecc86cdf9/.gitattributes -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .idea 3 | .ipynb_checkpoints 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | An Analysis of MDX/MDD File Format 2 | ================================== 3 | 4 | MDict is a multi-platform open dictionary 5 | 6 | which are both questionable. It is not available for every platform, e.g. OS X, Linux. 7 | Its dictionary file format is not open. But this has not hindered its popularity, 8 | and many dictionaries have been created for it. 9 | 10 | This is an attempt to reveal MDX/MDD file format, so that my favorite dictionaries, 11 | created by MDict users, could be used elsewhere. 12 | 13 | 14 | MDict Files 15 | =========== 16 | MDict stores the dictionary definitions, i.e. (key word, explanation) in MDX file and 17 | the dictionary reference data, e.g. images, pronunciations, stylesheets in MDD file. 18 | Although holding different contents, these two file formats share the same structure. 19 | 20 | MDX File Format 21 | =============== 22 | 23 | 24 | 25 | MDD File Format 26 | =============== 27 | 28 | 29 | 30 | Example Programs 31 | ================ 32 | 33 | readmdict.py 34 | ------------ 35 | readmdict.py is an example implementation in Python. This program can read/extract mdx/mdd files. 36 | 37 | .. note:: python-lzo is required to read mdx files created with engine 1.2. 38 | Get Windows version from http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-lzo 39 | 40 | It can be used as a command line tool. Suppose one has oald8.mdx and oald8.mdd:: 41 | 42 | $ python readmdict.py -x oald8.mdx 43 | 44 | This will creates *oald8.txt* dictionary file and creates a folder *data* for images, pronunciation audio files. 45 | 46 | On Windows, one can also double click it and select the file in the popup dialog. 47 | 48 | Or as a module:: 49 | 50 | In [1]: from readmdict import MDX, MDD 51 | 52 | Read MDX file and print the first entry:: 53 | 54 | In [2]: mdx = MDX('oald8.mdx') 55 | 56 | In [3]: items = mdx.items() 57 | 58 | In [4]: items.next() 59 | Out[4]: 60 | ('A', 61 | '.........') 62 | ``mdx`` is an object having all info from a MDX file. ``items`` is an iterator producing 2-item tuples. 63 | Of each tuple, the first element is the entry text and the second is the explanation. Both are UTF-8 encoded strings. 64 | 65 | Read MDD file and print the first entry:: 66 | 67 | In [5]: mdd = MDD('oald8.mdd') 68 | 69 | In [6]: items = mdd.items() 70 | 71 | In [7]: items = mdd.next() 72 | Out[7]: 73 | (u'\\pic\\accordion_concertina.jpg', 74 | '\xff\xd8\xff\xe0\x00\x10JFIF...........') 75 | 76 | ``mdd`` is an object having all info from a MDD file. ``items`` is an iterator producing 2-item tuples. 77 | Of each tuple, the first element is the file name and the second element is the corresponding file content. 78 | The file name is encoded in UTF-8. The file content is a plain bytes array. 79 | 80 | Acknowledge 81 | =========== 82 | The file format gets fully disclosed by https://github.com/zhansliu/writemdict. 83 | The encryption part is taken into this project. 84 | -------------------------------------------------------------------------------- /notes/01-测试结果.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import json" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 4, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "314205 items found\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "with open(\"../output/dict_jianming-2_output.json\", \"r\", encoding=\"utf-8\") as f:\n", 27 | " data = json.load(f)\n", 28 | "print(\"{} items found\".format(len(data)))" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 10, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "data": { 38 | "text/plain": [ 39 | "{'Lexicon': 'apply oneself to',\n", 40 | " 'type': 'Phrase',\n", 41 | " 'PhoneticSymbols': [],\n", 42 | " 'ParaPhrases': [{'pos': 'v.',\n", 43 | " 'english': '',\n", 44 | " 'chinese': '致力于',\n", 45 | " 'Sentences': [],\n", 46 | " 'source': 'jianming-2',\n", 47 | " 'scene': '',\n", 48 | " 'category': ''}]}" 49 | ] 50 | }, 51 | "execution_count": 10, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "data[10000]" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 11, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "280540 words found\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "words = {i[\"Lexicon\"]: i for i in data if i[\"type\"] == \"Word\"}\n", 75 | "print(\"{} words found\".format(len(words)))" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 24, 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "data": { 85 | "text/plain": [ 86 | "{'Lexicon': 'well',\n", 87 | " 'type': 'Word',\n", 88 | " 'PhoneticSymbols': [],\n", 89 | " 'ParaPhrases': [{'pos': 'adv.',\n", 90 | " 'english': '',\n", 91 | " 'chinese': '好, 对, 满意地; 友好地, 和蔼地; 彻底地, 完全地',\n", 92 | " 'Sentences': [{'english': ' Do you eat well at school?',\n", 93 | " 'chinese': '你在学校吃得好吗?',\n", 94 | " 'audioUrlUS': '',\n", 95 | " 'audioUrlUK': '',\n", 96 | " 'source': 'jianming-2'}],\n", 97 | " 'source': 'jianming-2',\n", 98 | " 'scene': '',\n", 99 | " 'category': ''},\n", 100 | " {'pos': 'adv.',\n", 101 | " 'english': '',\n", 102 | " 'chinese': '夸奖地, 称赞地',\n", 103 | " 'Sentences': [{'english': ' They speak well of him at school.',\n", 104 | " 'chinese': '学校里的人都称赞他。',\n", 105 | " 'audioUrlUS': '',\n", 106 | " 'audioUrlUK': '',\n", 107 | " 'source': 'jianming-2'}],\n", 108 | " 'source': 'jianming-2',\n", 109 | " 'scene': '',\n", 110 | " 'category': ''},\n", 111 | " {'pos': 'adv.',\n", 112 | " 'english': '',\n", 113 | " 'chinese': '有理由地, 恰当地',\n", 114 | " 'Sentences': [{'english': ' You did well to tell him.',\n", 115 | " 'chinese': '你告诉了他, 做得对。',\n", 116 | " 'audioUrlUS': '',\n", 117 | " 'audioUrlUK': '',\n", 118 | " 'source': 'jianming-2'}],\n", 119 | " 'source': 'jianming-2',\n", 120 | " 'scene': '',\n", 121 | " 'category': ''},\n", 122 | " {'pos': 'adv.',\n", 123 | " 'english': '',\n", 124 | " 'chinese': '很, 相当',\n", 125 | " 'Sentences': [{'english': ' You may well be right.',\n", 126 | " 'chinese': '很可能是你对。',\n", 127 | " 'audioUrlUS': '',\n", 128 | " 'audioUrlUK': '',\n", 129 | " 'source': 'jianming-2'}],\n", 130 | " 'source': 'jianming-2',\n", 131 | " 'scene': '',\n", 132 | " 'category': ''},\n", 133 | " {'pos': 'adj.',\n", 134 | " 'english': '',\n", 135 | " 'chinese': '健康的; 痊愈的',\n", 136 | " 'Sentences': [{'english': \" I don't think he is really a well man.\",\n", 137 | " 'chinese': '我认为他并不是真正健康的人。',\n", 138 | " 'audioUrlUS': '',\n", 139 | " 'audioUrlUK': '',\n", 140 | " 'source': 'jianming-2'}],\n", 141 | " 'source': 'jianming-2',\n", 142 | " 'scene': '',\n", 143 | " 'category': ''},\n", 144 | " {'pos': 'adj.',\n", 145 | " 'english': '',\n", 146 | " 'chinese': '良好的; 正常的; 令人满意的',\n", 147 | " 'Sentences': [{'english': ' All is not well in this country.',\n", 148 | " 'chinese': '这个国家的情况不能令人满意。',\n", 149 | " 'audioUrlUS': '',\n", 150 | " 'audioUrlUK': '',\n", 151 | " 'source': 'jianming-2'}],\n", 152 | " 'source': 'jianming-2',\n", 153 | " 'scene': '',\n", 154 | " 'category': ''},\n", 155 | " {'pos': 'int.',\n", 156 | " 'english': '',\n", 157 | " 'chinese': '(用于表示惊讶, 疑虑, 接受等)',\n", 158 | " 'Sentences': [{'english': ' Well!Look at that amazing sight!',\n", 159 | " 'chinese': '哦!看那迷人的景色!',\n", 160 | " 'audioUrlUS': '',\n", 161 | " 'audioUrlUK': '',\n", 162 | " 'source': 'jianming-2'}],\n", 163 | " 'source': 'jianming-2',\n", 164 | " 'scene': '',\n", 165 | " 'category': ''},\n", 166 | " {'pos': 'n.',\n", 167 | " 'english': '',\n", 168 | " 'chinese': '井, 水井',\n", 169 | " 'Sentences': [{'english': ' They dug another well in the village.',\n", 170 | " 'chinese': '他们在村里又挖了一口井。',\n", 171 | " 'audioUrlUS': '',\n", 172 | " 'audioUrlUK': '',\n", 173 | " 'source': 'jianming-2'}],\n", 174 | " 'source': 'jianming-2',\n", 175 | " 'scene': '',\n", 176 | " 'category': ''},\n", 177 | " {'pos': 'n.',\n", 178 | " 'english': '',\n", 179 | " 'chinese': '泉; 源泉',\n", 180 | " 'Sentences': [{'english': ' A book is a well of knowledge.',\n", 181 | " 'chinese': '书是知识的源泉。',\n", 182 | " 'audioUrlUS': '',\n", 183 | " 'audioUrlUK': '',\n", 184 | " 'source': 'jianming-2'}],\n", 185 | " 'source': 'jianming-2',\n", 186 | " 'scene': '',\n", 187 | " 'category': ''},\n", 188 | " {'pos': 'vi.',\n", 189 | " 'english': '',\n", 190 | " 'chinese': '(液体)涌出; 流出; 涌流',\n", 191 | " 'Sentences': [{'english': ' Oil welled out of the ground.',\n", 192 | " 'chinese': '原油从地下涌出。',\n", 193 | " 'audioUrlUS': '',\n", 194 | " 'audioUrlUK': '',\n", 195 | " 'source': 'jianming-2'}],\n", 196 | " 'source': 'jianming-2',\n", 197 | " 'scene': '',\n", 198 | " 'category': ''}]}" 199 | ] 200 | }, 201 | "execution_count": 24, 202 | "metadata": {}, 203 | "output_type": "execute_result" 204 | } 205 | ], 206 | "source": [ 207 | "words[\"well\"]" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 18, 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "{'Lexicon': 'modern',\n", 219 | " 'type': 'Word',\n", 220 | " 'PhoneticSymbols': [],\n", 221 | " 'ParaPhrases': [{'pos': 'adj.',\n", 222 | " 'english': '',\n", 223 | " 'chinese': '现代的; 近代的',\n", 224 | " 'Sentences': [{'english': ' He was steeped in modern history.',\n", 225 | " 'chinese': '他埋头于近代史的研究。',\n", 226 | " 'audioUrlUS': '',\n", 227 | " 'audioUrlUK': '',\n", 228 | " 'source': 'jianming-2'}],\n", 229 | " 'source': 'jianming-2',\n", 230 | " 'scene': '',\n", 231 | " 'category': ''},\n", 232 | " {'pos': 'adj.',\n", 233 | " 'english': '',\n", 234 | " 'chinese': '新式的, 时髦的, 最新的',\n", 235 | " 'Sentences': [{'english': ' He has modern ideas in spite of his great age.',\n", 236 | " 'chinese': '尽管他年事很高, 但思想观念却很入时。',\n", 237 | " 'audioUrlUS': '',\n", 238 | " 'audioUrlUK': '',\n", 239 | " 'source': 'jianming-2'},\n", 240 | " {'english': ' The dress is the most modern.',\n", 241 | " 'chinese': '这件衣服是最时髦的。',\n", 242 | " 'audioUrlUS': '',\n", 243 | " 'audioUrlUK': '',\n", 244 | " 'source': 'jianming-2'}],\n", 245 | " 'source': 'jianming-2',\n", 246 | " 'scene': '',\n", 247 | " 'category': ''},\n", 248 | " {'pos': 'adj.',\n", 249 | " 'english': '',\n", 250 | " 'chinese': '当代风格的, 现代派的',\n", 251 | " 'Sentences': [{'english': ' They went to an exhibition of modern art yesterday.',\n", 252 | " 'chinese': '昨天, 他们参观了现代美术展览。',\n", 253 | " 'audioUrlUS': '',\n", 254 | " 'audioUrlUK': '',\n", 255 | " 'source': 'jianming-2'}],\n", 256 | " 'source': 'jianming-2',\n", 257 | " 'scene': '',\n", 258 | " 'category': ''}]}" 259 | ] 260 | }, 261 | "execution_count": 18, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "words[\"modern\"]" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 19, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "name": "stdout", 277 | "output_type": "stream", 278 | "text": [ 279 | "22334 items found\n" 280 | ] 281 | } 282 | ], 283 | "source": [ 284 | "with open(\"D:/work/python_work/中心词库/音标-22334-牛津源-2020-8-29.json\", \"r\", encoding=\"utf-8\") as f:\n", 285 | " psdata = json.load(f)\n", 286 | "print(\"{} items found\".format(len(psdata)))" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 21, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "16717\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "num = []\n", 304 | "for i in psdata:\n", 305 | " if i[\"word\"] in words:\n", 306 | " num += [i]\n", 307 | "print(len(num))" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 23, 313 | "metadata": {}, 314 | "outputs": [ 315 | { 316 | "data": { 317 | "text/plain": [ 318 | "{'Lexicon': 'triathlon',\n", 319 | " 'type': 'Word',\n", 320 | " 'PhoneticSymbols': [],\n", 321 | " 'ParaPhrases': [{'pos': 'n.',\n", 322 | " 'english': '',\n", 323 | " 'chinese': '三项全能运动',\n", 324 | " 'Sentences': [],\n", 325 | " 'source': 'jianming-2',\n", 326 | " 'scene': '',\n", 327 | " 'category': ''}]}" 328 | ] 329 | }, 330 | "execution_count": 23, 331 | "metadata": {}, 332 | "output_type": "execute_result" 333 | } 334 | ], 335 | "source": [ 336 | "words[num[99][\"word\"]]" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [] 345 | } 346 | ], 347 | "metadata": { 348 | "kernelspec": { 349 | "display_name": "py36", 350 | "language": "python", 351 | "name": "py36" 352 | }, 353 | "language_info": { 354 | "codemirror_mode": { 355 | "name": "ipython", 356 | "version": 3 357 | }, 358 | "file_extension": ".py", 359 | "mimetype": "text/x-python", 360 | "name": "python", 361 | "nbconvert_exporter": "python", 362 | "pygments_lexer": "ipython3", 363 | "version": "3.7.6" 364 | } 365 | }, 366 | "nbformat": 4, 367 | "nbformat_minor": 4 368 | } 369 | -------------------------------------------------------------------------------- /notes/parse_mdx.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 126, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import re\n", 11 | "import bs4\n", 12 | "from bs4 import BeautifulSoup\n", 13 | "from readmdict import MDX, MDD" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "mdx = MDX(r'D:\\work\\database\\dict\\牛津高阶英汉双解词典(第9版).mdx')\n", 23 | "items = mdx.items()\n", 24 | "w2item = {i[0].decode(\"utf-8\"): i[1].decode(\"utf-8\") for i in items}\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 3, 30 | "metadata": {}, 31 | "outputs": [ 32 | { 33 | "data": { 34 | "text/plain": [ 35 | "\"noun,verb 🔑 soapBrE /səʊp/ 🔊NAmE /soʊp/ 🔊 noun🔑 [uncountable, countable] a substance that you use with water for washing your body 肥皂◆soap and water肥皂和水◆a bar/piece of soap 一块肥皂◆soap bubbles肥皂泡 [countable] (informal) = soap opera ◆soaps on TV电视上播出的肥皂剧◆She's a US soap star. 她是美国肥皂剧明星。🔊🔊 🔑 soapBrE /səʊp/ 🔊NAmE /soʊp/ 🔊 verbpresent simple - I / you / we / they soap BrE /səʊp/ 🔊 NAmE /soʊp/ 🔊present simple - he / she / it soaps BrE /səʊps/ 🔊 NAmE /soʊps/ 🔊past simple soaped BrE /səʊpt/ 🔊 NAmE /soʊpt/ 🔊past participle soaped BrE /səʊpt/ 🔊 NAmE /soʊpt/ 🔊 -ing form soaping BrE /ˈsəʊpɪŋ/ 🔊 NAmE /ˈsoʊpɪŋ/ 🔊~ yourself/sb/sth to rub yourself/sb/sth with soap 抹肥皂;用肥皂擦洗 \\u2002➡\\u2002 see also soft-soap \\n\"" 36 | ] 37 | }, 38 | "execution_count": 3, 39 | "metadata": {}, 40 | "output_type": "execute_result" 41 | } 42 | ], 43 | "source": [ 44 | "# 输出所有文本内容信息\n", 45 | "bs.get_text()" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 143, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "name": "stdout", 55 | "output_type": "stream", 56 | "text": [ 57 | "\n", 58 | " \n", 59 | " \n", 61 | "\n", 62 | "
\n", 63 | "
\n", 64 | " \n", 65 | " adjective\n", 66 | " \n", 67 | " ,\n", 68 | "
\n", 69 | "
\n", 70 | " \n", 71 | " verb\n", 72 | " \n", 73 | "
\n", 74 | "
\n", 75 | "
\n", 76 | " \n", 77 | " \n", 78 | " weird\n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " BrE\n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " /\n", 97 | " \n", 98 | " wɪəd\n", 99 | " \n", 100 | " /\n", 101 | " \n", 102 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " 🔊\n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " NAmE\n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " /\n", 124 | " \n", 125 | " wɪrd\n", 126 | " \n", 127 | " /\n", 128 | " \n", 129 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " 🔊\n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | "
\n", 142 | "\n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " adjective\n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " (\n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " weird·er\n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " ,\n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " weird·est\n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " )\n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " very\n", 194 | " \n", 195 | " \n", 196 | " strange\n", 197 | " \n", 198 | " \n", 199 | " or\n", 200 | " \n", 201 | " \n", 202 | " unusual\n", 203 | " \n", 204 | " \n", 205 | " and\n", 206 | " \n", 207 | " \n", 208 | " difficult\n", 209 | " \n", 210 | " \n", 211 | " to\n", 212 | " \n", 213 | " \n", 214 | " explain\n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " 奇异的;不寻常的;怪诞的\n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " SYN\n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " strange\n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " ◆\n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " a\n", 254 | " \n", 255 | " \n", 256 | " weird\n", 257 | " \n", 258 | " \n", 259 | " dream\n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " 离奇的梦\n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " ◆\n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " She's\n", 280 | " \n", 281 | " \n", 282 | " a\n", 283 | " \n", 284 | " \n", 285 | " really\n", 286 | " \n", 287 | " \n", 288 | " weird\n", 289 | " \n", 290 | " \n", 291 | " girl\n", 292 | " \n", 293 | " .\n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " 她真是个古怪的女孩。\n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " 🔊\n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " 🔊\n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | "\n", 322 | "\n", 323 | " \n", 324 | " \n", 325 | " ◆\n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " He's\n", 332 | " \n", 333 | " \n", 334 | " got\n", 335 | " \n", 336 | " \n", 337 | " some\n", 338 | " \n", 339 | " \n", 340 | " weird\n", 341 | " \n", 342 | " \n", 343 | " ideas\n", 344 | " \n", 345 | " .\n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " 他有些怪念头。\n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " 🔊\n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " 🔊\n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | "\n", 366 | "\n", 367 | " \n", 368 | " \n", 369 | " ◆\n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " It's\n", 376 | " \n", 377 | " \n", 378 | " really\n", 379 | " \n", 380 | " \n", 381 | " weird\n", 382 | " \n", 383 | " \n", 384 | " seeing\n", 385 | " \n", 386 | " \n", 387 | " yourself\n", 388 | " \n", 389 | " \n", 390 | " on\n", 391 | " \n", 392 | " \n", 393 | " television\n", 394 | " \n", 395 | " .\n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " 看到自己上了电视感觉怪怪的。\n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " 🔊\n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " 🔊\n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | "\n", 416 | "\n", 417 | " \n", 418 | " \n", 419 | " ◆\n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " the\n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " weird\n", 430 | " \n", 431 | " \n", 432 | " and\n", 433 | " \n", 434 | " \n", 435 | " wonderful\n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " creatures\n", 441 | " \n", 442 | " \n", 443 | " that\n", 444 | " \n", 445 | " \n", 446 | " live\n", 447 | " \n", 448 | " \n", 449 | " beneath\n", 450 | " \n", 451 | " \n", 452 | " the\n", 453 | " \n", 454 | " \n", 455 | " sea\n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " 奇异美丽的海底生物\n", 463 | " \n", 464 | " \n", 465 | "\n", 466 | "\n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " strange\n", 474 | " \n", 475 | " \n", 476 | " in\n", 477 | " \n", 478 | " \n", 479 | " a\n", 480 | " \n", 481 | " \n", 482 | " mysterious\n", 483 | " \n", 484 | " \n", 485 | " and\n", 486 | " \n", 487 | " \n", 488 | " frightening\n", 489 | " \n", 490 | " \n", 491 | " way\n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " 离奇的;诡异的\n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " SYN\n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " eerie\n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " ◆\n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " She\n", 532 | " \n", 533 | " \n", 534 | " began\n", 535 | " \n", 536 | " \n", 537 | " to\n", 538 | " \n", 539 | " \n", 540 | " make\n", 541 | " \n", 542 | " \n", 543 | " weird\n", 544 | " \n", 545 | " \n", 546 | " inhuman\n", 547 | " \n", 548 | " \n", 549 | " sounds\n", 550 | " \n", 551 | " .\n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " 她开始发出可怕的非人的声音。\n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " 🔊\n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " 🔊\n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | "\n", 578 | "\n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " ▸\n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " weird·ly\n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " BrE\n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " /\n", 605 | " \n", 606 | " ˈwɪədli\n", 607 | " \n", 608 | " /\n", 609 | " \n", 610 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " 🔊\n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " NAmE\n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " /\n", 632 | " \n", 633 | " ˈwɪrdli\n", 634 | " \n", 635 | " /\n", 636 | " \n", 637 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " 🔊\n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | "\n", 652 | "\n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " adverb\n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | "\n", 661 | "\n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " ◆\n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " The\n", 675 | " \n", 676 | " \n", 677 | " town\n", 678 | " \n", 679 | " \n", 680 | " was\n", 681 | " \n", 682 | " \n", 683 | " weirdly\n", 684 | " \n", 685 | " \n", 686 | " familiar\n", 687 | " \n", 688 | " .\n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " 这个城镇怪面熟的。\n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " 🔊\n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " 🔊\n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | "\n", 713 | "\n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " ▸\n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " weird·ness\n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " BrE\n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " /\n", 739 | " \n", 740 | " ˈwɪədnəs\n", 741 | " \n", 742 | " /\n", 743 | " \n", 744 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " 🔊\n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " NAmE\n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " /\n", 766 | " \n", 767 | " ˈwɪrdnəs\n", 768 | " \n", 769 | " /\n", 770 | " \n", 771 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " 🔊\n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | "\n", 785 | "\n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " noun\n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | "\n", 794 | "\n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " [\n", 800 | " \n", 801 | " \n", 802 | " uncountable\n", 803 | " \n", 804 | " \n", 805 | " ]\n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | "\n", 811 | "
\n", 812 | " \n", 813 | " \n", 814 | " weird\n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " BrE\n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " /\n", 833 | " \n", 834 | " wɪəd\n", 835 | " \n", 836 | " /\n", 837 | " \n", 838 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " 🔊\n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " NAmE\n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " /\n", 860 | " \n", 861 | " wɪrd\n", 862 | " \n", 863 | " /\n", 864 | " \n", 865 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " 🔊\n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | "
\n", 878 | "\n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " verb\n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " present simple - I / you / we / they\n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " weird\n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " BrE\n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " /\n", 913 | " \n", 914 | " wɪəd\n", 915 | " \n", 916 | " /\n", 917 | " \n", 918 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " 🔊\n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " NAmE\n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " /\n", 938 | " \n", 939 | " wɪrd\n", 940 | " \n", 941 | " /\n", 942 | " \n", 943 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " 🔊\n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | "\n", 959 | "\n", 960 | " \n", 961 | " present simple - he / she / it\n", 962 | " \n", 963 | "\n", 964 | "\n", 965 | " \n", 966 | " \n", 967 | " weirds\n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " BrE\n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " /\n", 981 | " \n", 982 | " wɪədz\n", 983 | " \n", 984 | " /\n", 985 | " \n", 986 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " 🔊\n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " NAmE\n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " /\n", 1006 | " \n", 1007 | " wɪrdz\n", 1008 | " \n", 1009 | " /\n", 1010 | " \n", 1011 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " 🔊\n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | "\n", 1023 | "\n", 1024 | " \n", 1025 | " past simple\n", 1026 | " \n", 1027 | "\n", 1028 | "\n", 1029 | " \n", 1030 | " \n", 1031 | " weirded\n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " BrE\n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " /\n", 1045 | " \n", 1046 | " ˈwɪədɪd\n", 1047 | " \n", 1048 | " /\n", 1049 | " \n", 1050 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " 🔊\n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " NAmE\n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " /\n", 1070 | " \n", 1071 | " ˈwɪrdɪd\n", 1072 | " \n", 1073 | " /\n", 1074 | " \n", 1075 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " 🔊\n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | "\n", 1087 | "\n", 1088 | " \n", 1089 | " past participle\n", 1090 | " \n", 1091 | "\n", 1092 | "\n", 1093 | " \n", 1094 | " \n", 1095 | " weirded\n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " BrE\n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " /\n", 1109 | " \n", 1110 | " ˈwɪədɪd\n", 1111 | " \n", 1112 | " /\n", 1113 | " \n", 1114 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " 🔊\n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " NAmE\n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " /\n", 1134 | " \n", 1135 | " ˈwɪrdɪd\n", 1136 | " \n", 1137 | " /\n", 1138 | " \n", 1139 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " 🔊\n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | "\n", 1151 | "\n", 1152 | " \n", 1153 | " -ing form\n", 1154 | " \n", 1155 | "\n", 1156 | "\n", 1157 | " \n", 1158 | " \n", 1159 | " weirding\n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " BrE\n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " /\n", 1173 | " \n", 1174 | " ˈwɪədɪŋ\n", 1175 | " \n", 1176 | " /\n", 1177 | " \n", 1178 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " 🔊\n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " NAmE\n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " /\n", 1198 | " \n", 1199 | " ˈwɪrdɪŋ\n", 1200 | " \n", 1201 | " /\n", 1202 | " \n", 1203 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " 🔊\n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | "\n", 1215 | "\n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " ●\n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " ˌweird\n", 1238 | " \n", 1239 | " \n", 1240 | " sb\n", 1241 | " \n", 1242 | " \n", 1243 | " ˈout\n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " (\n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " informal\n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " )\n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " to\n", 1267 | " \n", 1268 | " \n", 1269 | " seem\n", 1270 | " \n", 1271 | " \n", 1272 | " strange\n", 1273 | " \n", 1274 | " \n", 1275 | " or\n", 1276 | " \n", 1277 | " \n", 1278 | " worrying\n", 1279 | " \n", 1280 | " \n", 1281 | " to\n", 1282 | " \n", 1283 | " \n", 1284 | " sb\n", 1285 | " \n", 1286 | " \n", 1287 | " and\n", 1288 | " \n", 1289 | " \n", 1290 | " make\n", 1291 | " \n", 1292 | " \n", 1293 | " them\n", 1294 | " \n", 1295 | " \n", 1296 | " feel\n", 1297 | " \n", 1298 | " \n", 1299 | " uncomfortable\n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " 使感到奇怪;使感到烦恼;使感到不舒服\n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " ◆\n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " The\n", 1320 | " \n", 1321 | " \n", 1322 | " whole\n", 1323 | " \n", 1324 | " \n", 1325 | " concept\n", 1326 | " \n", 1327 | " \n", 1328 | " really\n", 1329 | " \n", 1330 | " \n", 1331 | " weirds\n", 1332 | " \n", 1333 | " \n", 1334 | " me\n", 1335 | " \n", 1336 | " \n", 1337 | " out\n", 1338 | " \n", 1339 | " .\n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " 这整个想法让我觉得十分怪异。\n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " 🔊\n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " 🔊\n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | "\n", 1370 | "\n" 1371 | ] 1372 | } 1373 | ], 1374 | "source": [ 1375 | "# 10139\n", 1376 | "bs = BeautifulSoup(w2item[\"weird\"], \"html.parser\")\n", 1377 | "\n", 1378 | "\n", 1379 | "def iter_parse(node, result):\n", 1380 | " if node.contents:\n", 1381 | " for i in node.children:\n", 1382 | " result[i] = {}\n", 1383 | " out = iter_parse(i, result[i])\n", 1384 | " result[i] = {\n", 1385 | " \"name\": out[0],\n", 1386 | " \"text\": out[1]\n", 1387 | " }\n", 1388 | " else:\n", 1389 | " return node.name, node.text\n", 1390 | "\n", 1391 | "\n", 1392 | "def parse_sentence(nodes):\n", 1393 | " outs = []\n", 1394 | " if not isinstance(nodes, str):\n", 1395 | " for k in nodes.children:\n", 1396 | " if not isinstance(k, str):\n", 1397 | " print(\"------------name={}, attrs={}, text={}\".format(k.name, k.attrs,\n", 1398 | " k.get_text()))\n", 1399 | " outs.append({\"english\": k.text})\n", 1400 | " return outs\n", 1401 | "\n", 1402 | "\n", 1403 | "print(bs.prettify())\n", 1404 | "\n" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "code", 1409 | "execution_count": 144, 1410 | "metadata": {}, 1411 | "outputs": [ 1412 | { 1413 | "name": "stdout", 1414 | "output_type": "stream", 1415 | "text": [ 1416 | "\n", 1417 | "id=0, node=\n", 1418 | "\n", 1419 | "id=1, node=
\n", 1420 | "\n", 1421 | "id=2, node=
weirdBrE /wɪəd/ 🔊NAmE /wɪrd/ 🔊
\n", 1422 | "\n", 1423 | "id=3, node= adjective (weird·er, weird·est) very strange or unusual and difficult to explain 奇异的;不寻常的;怪诞的 SYN strange a weird dream离奇的梦She's a really weird girl. 她真是个古怪的女孩。🔊🔊\n", 1424 | "\n", 1425 | "id=4, node=He's got some weird ideas. 他有些怪念头。🔊🔊\n", 1426 | "\n", 1427 | "id=5, node=It's really weird seeing yourself on television. 看到自己上了电视感觉怪怪的。🔊🔊\n", 1428 | "\n", 1429 | "id=6, node=the weird and wonderful creatures that live beneath the sea奇异美丽的海底生物\n", 1430 | "\n", 1431 | "id=7, node=strange in a mysterious and frightening way 离奇的;诡异的 SYN eerie She began to make weird inhuman sounds. 她开始发出可怕的非人的声音。🔊🔊\n", 1432 | "\n", 1433 | "id=8, node= weird·ly BrE /ˈwɪədli/ 🔊NAmE /ˈwɪrdli/ 🔊\n", 1434 | "\n", 1435 | "id=9, node= adverb\n", 1436 | "\n", 1437 | "id=10, node=The town was weirdly familiar. 这个城镇怪面熟的。🔊🔊\n", 1438 | "\n", 1439 | "id=11, node= weird·ness BrE /ˈwɪədnəs/ 🔊NAmE /ˈwɪrdnəs/ 🔊\n", 1440 | "\n", 1441 | "id=12, node= noun\n", 1442 | "\n", 1443 | "id=13, node= [uncountable] \n", 1444 | "\n", 1445 | "id=14, node=
weirdBrE /wɪəd/ 🔊NAmE /wɪrd/ 🔊
\n", 1446 | "\n", 1447 | "id=15, node= verbpresent simple - I / you / we / they weird BrE /wɪəd/ 🔊 NAmE /wɪrd/ 🔊\n", 1448 | "\n", 1449 | "id=16, node=present simple - he / she / it \n", 1450 | "\n", 1451 | "id=17, node=weirds BrE /wɪədz/ 🔊 NAmE /wɪrdz/ 🔊\n", 1452 | "\n", 1453 | "id=18, node=past simple \n", 1454 | "\n", 1455 | "id=19, node=weirded BrE /ˈwɪədɪd/ 🔊 NAmE /ˈwɪrdɪd/ 🔊\n", 1456 | "\n", 1457 | "id=20, node=past participle \n", 1458 | "\n", 1459 | "id=21, node=weirded BrE /ˈwɪədɪd/ 🔊 NAmE /ˈwɪrdɪd/ 🔊\n", 1460 | "\n", 1461 | "id=22, node= -ing form \n", 1462 | "\n", 1463 | "id=23, node=weirding BrE /ˈwɪədɪŋ/ 🔊 NAmE /ˈwɪrdɪŋ/ 🔊\n", 1464 | "\n", 1465 | "id=24, node= ˌweird sb ˈout(informal) to seem strange or worrying to sb and make them feel uncomfortable 使感到奇怪;使感到烦恼;使感到不舒服The whole concept really weirds me out. 这整个想法让我觉得十分怪异。🔊🔊\n", 1466 | "\n", 1467 | "id=25, node=\n", 1468 | "\n" 1469 | ] 1470 | } 1471 | ], 1472 | "source": [ 1473 | "for ic, node in enumerate(bs.children):\n", 1474 | " print(\"\\nid={}, node={}\".format(ic, node))" 1475 | ] 1476 | }, 1477 | { 1478 | "cell_type": "code", 1479 | "execution_count": 140, 1480 | "metadata": {}, 1481 | "outputs": [ 1482 | { 1483 | "name": "stdout", 1484 | "output_type": "stream", 1485 | "text": [ 1486 | "soap and water肥皂和水 肥皂和水\n", 1487 | "a bar/piece of soap 一块肥皂 一块肥皂\n", 1488 | "soap bubbles肥皂泡 肥皂泡\n", 1489 | "soaps on TV电视上播出的肥皂剧 电视上播出的肥皂剧\n", 1490 | "She's a US soap star. 她是美国肥皂剧明星。 她是美国肥皂剧明星。\n" 1491 | ] 1492 | } 1493 | ], 1494 | "source": [ 1495 | "for i in bs.find_all(\"div\", \"cixing_part\"):\n", 1496 | "# print(i.tag)\n", 1497 | " top = i.find_all(\"top-g\")\n", 1498 | " subentry = i.find_all(\"subentry-g\")\n", 1499 | " for j in subentry:\n", 1500 | " sngs=j.find_all(\"sn-gs\")\n", 1501 | " for k in sngs:\n", 1502 | " sng = k.find_all(\"sn-g\")\n", 1503 | " for m in sng:\n", 1504 | " xgs = m.find_all(\"x-gs\")\n", 1505 | " for n in xgs:\n", 1506 | " xgblk = n.find_all(\"x-g-blk\")\n", 1507 | " # 3个例句\n", 1508 | " sentences = []\n", 1509 | " for w in xgblk:\n", 1510 | " x = w.find_all(\"x\")\n", 1511 | " for v in x:\n", 1512 | " # 添加例句\n", 1513 | " # 先把中文释义提出,并从树中移除\n", 1514 | "# sentCh=v.chn.extract().get_text()\n", 1515 | " sentCh=v.chn.get_text()\n", 1516 | " # 再提取全部英文例句,可解决标点无法提取问题\n", 1517 | " sentEn=v.get_text()\n", 1518 | " sentence={\"english\": sentEn.strip(), \"chinese\": sentCh.strip(), \n", 1519 | " \"audioUrlUS\": \"\", \"audioUrlUK\":\"\", \"resource\": \"oxld_9\"}\n", 1520 | " sentences.append(sentence)\n", 1521 | " print(sentEn, sentCh)\n", 1522 | " \n", 1523 | "# print(x[0].prettify())\n" 1524 | ] 1525 | }, 1526 | { 1527 | "cell_type": "code", 1528 | "execution_count": 141, 1529 | "metadata": { 1530 | "scrolled": true 1531 | }, 1532 | "outputs": [ 1533 | { 1534 | "ename": "AttributeError", 1535 | "evalue": "ResultSet object has no attribute 'findall'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", 1536 | "output_type": "error", 1537 | "traceback": [ 1538 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 1539 | "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", 1540 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0msubentry\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"def\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 1541 | "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\bs4\\element.py\u001b[0m in \u001b[0;36m__getattr__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 2079\u001b[0m \u001b[1;34m\"\"\"Raise a helpful exception to explain a common code fix.\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2080\u001b[0m raise AttributeError(\n\u001b[1;32m-> 2081\u001b[1;33m \u001b[1;34m\"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?\"\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2082\u001b[0m )\n", 1542 | "\u001b[1;31mAttributeError\u001b[0m: ResultSet object has no attribute 'findall'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" 1543 | ] 1544 | } 1545 | ], 1546 | "source": [ 1547 | "subentry.findall(\"def\")" 1548 | ] 1549 | }, 1550 | { 1551 | "cell_type": "code", 1552 | "execution_count": 134, 1553 | "metadata": {}, 1554 | "outputs": [ 1555 | { 1556 | "data": { 1557 | "text/plain": [ 1558 | "🔑 [uncountable, countable] a substance that you use with water for washing your body 肥皂soap and water肥皂和水a bar/piece of soap 一块肥皂soap bubbles肥皂泡 [countable] (informal) = soap opera soaps on TV电视上播出的肥皂剧She's a US soap star. 她是美国肥皂剧明星。🔊🔊" 1559 | ] 1560 | }, 1561 | "execution_count": 134, 1562 | "metadata": {}, 1563 | "output_type": "execute_result" 1564 | } 1565 | ], 1566 | "source": [ 1567 | "k" 1568 | ] 1569 | }, 1570 | { 1571 | "cell_type": "code", 1572 | "execution_count": 133, 1573 | "metadata": {}, 1574 | "outputs": [ 1575 | { 1576 | "data": { 1577 | "text/plain": [ 1578 | "She's a US soap star. 她是美国肥皂剧明星。" 1579 | ] 1580 | }, 1581 | "execution_count": 133, 1582 | "metadata": {}, 1583 | "output_type": "execute_result" 1584 | } 1585 | ], 1586 | "source": [ 1587 | "v" 1588 | ] 1589 | }, 1590 | { 1591 | "cell_type": "code", 1592 | "execution_count": 132, 1593 | "metadata": {}, 1594 | "outputs": [ 1595 | { 1596 | "data": { 1597 | "text/plain": [ 1598 | "{'e_id': 'u4cdebea65f7df6b4.-37641f80.142d72656f3.-1b5',\n", 1599 | " 'eid': 'soap_x_6',\n", 1600 | " 'status': '6',\n", 1601 | " 'tranid': '6',\n", 1602 | " 'wd': \"She's a US soap star.\"}" 1603 | ] 1604 | }, 1605 | "execution_count": 132, 1606 | "metadata": {}, 1607 | "output_type": "execute_result" 1608 | } 1609 | ], 1610 | "source": [ 1611 | "v.attrs" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "code", 1616 | "execution_count": 131, 1617 | "metadata": {}, 1618 | "outputs": [ 1619 | { 1620 | "data": { 1621 | "text/plain": [ 1622 | "\"She's a US soap star.\"" 1623 | ] 1624 | }, 1625 | "execution_count": 131, 1626 | "metadata": {}, 1627 | "output_type": "execute_result" 1628 | } 1629 | ], 1630 | "source": [ 1631 | "v.attrs['wd']" 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "code", 1636 | "execution_count": 129, 1637 | "metadata": {}, 1638 | "outputs": [ 1639 | { 1640 | "data": { 1641 | "text/plain": [ 1642 | "[She's,\n", 1643 | " a,\n", 1644 | " US,\n", 1645 | " soap,\n", 1646 | " star,\n", 1647 | " ,\n", 1648 | " 她是美国肥皂剧明星。]" 1649 | ] 1650 | }, 1651 | "execution_count": 129, 1652 | "metadata": {}, 1653 | "output_type": "execute_result" 1654 | } 1655 | ], 1656 | "source": [ 1657 | "[i for i in v.children if isinstance(i, bs4.element.Tag)]" 1658 | ] 1659 | }, 1660 | { 1661 | "cell_type": "code", 1662 | "execution_count": 130, 1663 | "metadata": {}, 1664 | "outputs": [ 1665 | { 1666 | "ename": "SyntaxError", 1667 | "evalue": "invalid syntax (, line 1)", 1668 | "output_type": "error", 1669 | "traceback": [ 1670 | "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m1\u001b[0m\n\u001b[1;33m v.\"xhtml:a\"\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" 1671 | ] 1672 | } 1673 | ], 1674 | "source": [ 1675 | "v.\"xhtml:a\"" 1676 | ] 1677 | }, 1678 | { 1679 | "cell_type": "code", 1680 | "execution_count": 142, 1681 | "metadata": {}, 1682 | "outputs": [ 1683 | { 1684 | "name": "stdout", 1685 | "output_type": "stream", 1686 | "text": [ 1687 | "\n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " verb\n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " present simple - I / you / we / they\n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " soap\n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " BrE\n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " /\n", 1722 | " \n", 1723 | " səʊp\n", 1724 | " \n", 1725 | " /\n", 1726 | " \n", 1727 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " 🔊\n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " NAmE\n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " /\n", 1747 | " \n", 1748 | " soʊp\n", 1749 | " \n", 1750 | " /\n", 1751 | " \n", 1752 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " 🔊\n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " present simple - he / she / it\n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " soaps\n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " BrE\n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " /\n", 1786 | " \n", 1787 | " səʊps\n", 1788 | " \n", 1789 | " /\n", 1790 | " \n", 1791 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " 🔊\n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " NAmE\n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " /\n", 1811 | " \n", 1812 | " soʊps\n", 1813 | " \n", 1814 | " /\n", 1815 | " \n", 1816 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " 🔊\n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | "\n" 1832 | ] 1833 | } 1834 | ], 1835 | "source": [ 1836 | "print(j.prettify())" 1837 | ] 1838 | }, 1839 | { 1840 | "cell_type": "code", 1841 | "execution_count": 19, 1842 | "metadata": {}, 1843 | "outputs": [ 1844 | { 1845 | "name": "stdout", 1846 | "output_type": "stream", 1847 | "text": [ 1848 | "\n", 1849 | " \n", 1850 | " \n", 1851 | " [\n", 1852 | " \n", 1853 | " \n", 1854 | " uncountable\n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " ,\n", 1860 | " \n", 1861 | " \n", 1862 | " countable\n", 1863 | " \n", 1864 | " \n", 1865 | " ]\n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " a\n", 1871 | " \n", 1872 | " \n", 1873 | " substance\n", 1874 | " \n", 1875 | " \n", 1876 | " that\n", 1877 | " \n", 1878 | " \n", 1879 | " you\n", 1880 | " \n", 1881 | " \n", 1882 | " use\n", 1883 | " \n", 1884 | " \n", 1885 | " with\n", 1886 | " \n", 1887 | " \n", 1888 | " water\n", 1889 | " \n", 1890 | " \n", 1891 | " for\n", 1892 | " \n", 1893 | " \n", 1894 | " washing\n", 1895 | " \n", 1896 | " \n", 1897 | " your\n", 1898 | " \n", 1899 | " \n", 1900 | " body\n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " 肥皂\n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " ◆\n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " soap\n", 1920 | " \n", 1921 | " \n", 1922 | " and\n", 1923 | " \n", 1924 | " \n", 1925 | " water\n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " 肥皂和水\n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " ◆\n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " a\n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " bar\n", 1950 | " \n", 1951 | " /\n", 1952 | " \n", 1953 | " piece\n", 1954 | " \n", 1955 | " \n", 1956 | " of\n", 1957 | " \n", 1958 | " \n", 1959 | " soap\n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " 一块肥皂\n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " ◆\n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " soap\n", 1981 | " \n", 1982 | " \n", 1983 | " bubbles\n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " 肥皂泡\n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | "\n" 1996 | ] 1997 | } 1998 | ], 1999 | "source": [ 2000 | "print(sng[0].prettify())" 2001 | ] 2002 | }, 2003 | { 2004 | "cell_type": "code", 2005 | "execution_count": null, 2006 | "metadata": {}, 2007 | "outputs": [], 2008 | "source": [ 2009 | "paraphrases = []\n", 2010 | "for ic, node in enumerate(bs.children):\n", 2011 | " print(\"\\nid={}, name={}\".format(ic, node.name))\n", 2012 | " # div, {'id': 'noun', 'class': ['cixing_part']}\n", 2013 | " if not isinstance(node, str) and node.attrs and \"class\" in node.attrs.keys():\n", 2014 | " # 词性跳转\n", 2015 | " if \"cixing_tiaozhuan\" in node.attrs[\"class\"]:\n", 2016 | " poss = node.text.split(\",\")\n", 2017 | " # 词性部分\n", 2018 | " elif \"cixing_part\" in node.attrs[\"class\"]:\n", 2019 | " # 添加词性\n", 2020 | " paraphrase = {\"pos\": node.attrs[\"id\"]}\n", 2021 | " titleList = node.findall(\"top-g\")\n", 2022 | " posList = node.findall(\"subentry-g\")\n", 2023 | " for paras in node.children:\n", 2024 | " # 2 items: top-g;subentry-g\n", 2025 | " print(\"------name={}, attrs={}, text={}\".format(paras.name, paras.attrs, paras.get_text()))\n", 2026 | " # if not isinstance(s, str):\n", 2027 | " # 添加释义\n", 2028 | " posList = paras.findall(\"pos-g\")\n", 2029 | " if paras.name == \"top-g\":\n", 2030 | " paraphrase[\"text\"] = paras.get_text()\n", 2031 | " else:\n", 2032 | " # subentry-g\n", 2033 | " for sentence in paras.children:\n", 2034 | " # 解析例句列表\n", 2035 | " if sentence.name != \"pos-g\":\n", 2036 | " continue\n", 2037 | " paraphrase[\"sentences\"] = []\n", 2038 | " if not isinstance(sentence, str):\n", 2039 | " for s in sentence.children:\n", 2040 | " sents = parse_sentence(s)\n", 2041 | " if sents:\n", 2042 | " paraphrase[\"sentences\"] += sents\n", 2043 | " paraphrases.append(paraphrase)\n", 2044 | "\n", 2045 | "print(bs.get_text())\n", 2046 | "\n", 2047 | "print(paraphrases)\n" 2048 | ] 2049 | } 2050 | ], 2051 | "metadata": { 2052 | "kernelspec": { 2053 | "display_name": "py36", 2054 | "language": "python", 2055 | "name": "py36" 2056 | }, 2057 | "language_info": { 2058 | "codemirror_mode": { 2059 | "name": "ipython", 2060 | "version": 3 2061 | }, 2062 | "file_extension": ".py", 2063 | "mimetype": "text/x-python", 2064 | "name": "python", 2065 | "nbconvert_exporter": "python", 2066 | "pygments_lexer": "ipython3", 2067 | "version": "3.7.6" 2068 | } 2069 | }, 2070 | "nbformat": 4, 2071 | "nbformat_minor": 4 2072 | } 2073 | -------------------------------------------------------------------------------- /parser.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import re 3 | import bs4 4 | from bs4 import BeautifulSoup 5 | from readmdict import MDX, MDD 6 | import json 7 | import unicodedata 8 | 9 | tagSets = set() 10 | 11 | 12 | def text_norm(text, lang): 13 | """ 14 | 文本规范化. 15 | """ 16 | if lang == "en" or lang == "english": 17 | text = " ".join(text.strip().split()) 18 | else: 19 | text = text.strip() 20 | text = unicodedata.normalize("NFKC", text) 21 | return text 22 | 23 | 24 | def get_tag_list(node): 25 | return [i.name for i in node if isinstance(i, bs4.element.Tag)] 26 | 27 | 28 | def get_audio(tags, pos, source, name="pron-g"): 29 | audios = [] 30 | for i in tags: 31 | audio = { 32 | "name": i.find_all(name)[0].get_text(), 33 | "audioUrl": i.find_all("a")[0].attrs["href"], 34 | "country": i.find_all(re.compile("label"))[0].get_text(), 35 | "source": source, "pos": pos 36 | } 37 | audios.append(audio) 38 | return audios 39 | 40 | 41 | def split_node_en_ch(tags): 42 | # 获取标签列表:[i.name for i in v.children if isinstance(i, bs4.element.Tag)] 43 | chs, ens = [], [] 44 | for v in tags: 45 | # 根据标签提取, chn: 先把中文提出,并从树中移除 46 | vv = v 47 | ch = unicodedata.normalize("NFKC", vv.chn.extract().get_text()) 48 | # 再提取全部英文,可解决标点无法提取问题 49 | # 根据属性提取 50 | en = vv.get_text() 51 | chs.append(ch.strip()) 52 | ens.append(en.strip()) 53 | return chs, ens 54 | 55 | 56 | def parse_sentences(node, source): 57 | sentences = [] 58 | chs, ens = split_node_en_ch(node) 59 | for ch, en in zip(chs, ens): 60 | sentence = {"chinese": ch, "english": en, 61 | "audioUrlUS": "", "audioUrlUK": "", "source": source} 62 | sentences.append(sentence) 63 | return sentences 64 | 65 | 66 | def get_sn(m, source): 67 | for t in get_tag_list(m): 68 | tagSets.add(t) 69 | # print("m=", get_tag_list(m)) 70 | # 释义信息1 71 | # xr-gs; = soap opera 72 | # gram-g: [countable] 73 | # label-g-blk: (informal) 74 | # 提取词性小类 75 | if m.find_all("gram-g"): 76 | category = m.find_all("gram-g")[0].get_text() 77 | else: 78 | category = "" 79 | # 提取应用场景 80 | if m.find_all("label-g-blk"): 81 | scene = m.find_all("label-g-blk")[0].get_text() 82 | else: 83 | scene = "" 84 | # xr-gs: = 同义词 85 | xdef = m.find_all("def") 86 | chs, ens = split_node_en_ch(xdef) 87 | # 例句列表 88 | xgs = m.find_all("x-gs") 89 | sents = [] 90 | for n in xgs: 91 | for t in get_tag_list(m): 92 | tagSets.add(t) 93 | # print("n=", get_tag_list(n)) 94 | xgblk = n.find_all("x") 95 | # 3个例句 96 | sentences = parse_sentences(xgblk, source) 97 | sents += sentences 98 | paras = {"chinese": chs[0] if chs else "", "english": ens[0] if ens else "", 99 | "category": category, "scene": scene, 100 | "Sentences": sents, "source": source} 101 | return paras 102 | 103 | 104 | def get_cixing(tags, source): 105 | phoneticsymbols = [] 106 | paraphrases = [] 107 | for i in tags: 108 | for t in get_tag_list(i): 109 | tagSets.add(t) 110 | # 词性 111 | pos = i.attrs["id"] 112 | # 提取美英音标、音频url 113 | if not phoneticsymbols: 114 | phoneticsymbols = get_audio(i.find_all("pron-g-blk"), pos, source) 115 | # 词性小类 116 | 117 | # 动词释义 118 | if i.find_all("vp-g"): 119 | # root词根、第三人称单数vp-g-ps 120 | wdVp = get_vp(i.find_all("vp-g")) 121 | if i.find_all("vpform"): 122 | # present simple一般现在时 123 | wdVpForm = i.find_all("vpform")[0].get_text() 124 | 125 | # 释义列表 126 | for m in i.find_all("sn-g"): 127 | # 获取释义、例句列表 128 | paras = get_sn(m, source) 129 | paras["pos"] = pos 130 | paraphrases.append(paras) 131 | return phoneticsymbols, paraphrases 132 | 133 | 134 | def get_vp(tags): 135 | # 获取动词的各种形式 136 | vps = [] 137 | for i in tags: 138 | vps.append({"form": i.attrs["form"], "text": i.find_all("vp")[0].get_text()}) 139 | return vps 140 | 141 | 142 | def get_sns(tags): 143 | # label, def, sn 144 | label = tags.find_all("label")[0].get_text() 145 | return 146 | 147 | 148 | def get_pv(tags, source): 149 | """ 150 | phrasal verbs动词短语. 151 | 152 | """ 153 | # 获取短语搭配 154 | outs = [] 155 | # 短语列表 156 | for i in tags: 157 | # 释义列表 158 | paraphrases = [] 159 | for j in i.find_all("sn-g"): 160 | # 例句列表 161 | paras = get_sn(j, source) 162 | paras["pos"] = "" 163 | paraphrases.append(paras) 164 | outs.append({"phrase": i.find_all("pv")[0].get_text(), 165 | "ParaPhrases": paraphrases}) 166 | return outs 167 | 168 | 169 | def parse_oxld(lexicon, bs, source="oxld_9"): 170 | if len(lexicon.split()) > 1: 171 | lexcionType = "Phrase" 172 | else: 173 | lexcionType = "Word" 174 | result = {"Lexicon": lexicon, "type": lexcionType} 175 | phoneticsymbols = [] 176 | paraphrases = [] 177 | vpg = [] 178 | pvg = [] 179 | 180 | print(get_tag_list(bs)) 181 | 182 | # 获取词性部分(音标,释义,例句) 183 | if bs.find_all("div", "cixing_part", recursive=False): 184 | phoneticsymbols, paraphrases = get_cixing(bs.find_all("div", "cixing_part", recursive=False), source) 185 | 186 | # verb past动词时态 187 | if bs.find_all("vp-g", recursive=False): 188 | vpg = get_vp(bs.find_all("vp-g", recursive=False)) 189 | 190 | # phrasal verbs动词短语 191 | if bs.find_all("pv-gs-blk", recursive=False): 192 | for tag in bs.find_all("pv-gs-blk", recursive=False): 193 | pvg += get_pv(tag.find_all("pv-g"), source) 194 | 195 | result["PhoneticSymbols"] = phoneticsymbols 196 | result["ParaPhrases"] = paraphrases 197 | result["Inflection"] = vpg 198 | result["PhrasalVerbs"] = pvg 199 | return result 200 | 201 | 202 | def parse_jianming(lexicon, bs, source="jianming"): 203 | """ 204 | 简明英汉汉英词典 205 | """ 206 | if len(lexicon.split()) > 1: 207 | lexcionType = "Phrase" 208 | else: 209 | lexcionType = "Word" 210 | result = {"Lexicon": lexicon, "type": lexcionType} 211 | phoneticsymbols = [] 212 | paraphrases = [] 213 | 214 | contents = bs.find_all(["font", "b"]) 215 | newpos = "" 216 | newpp = "" 217 | sentences = [] 218 | curIc = 0 219 | for ic, i in enumerate(contents): 220 | if ic < curIc: 221 | continue 222 | if i.name == "font" and i.attrs["color"] == "DarkMagenta": 223 | # 新词性 224 | newpos = i.get_text() 225 | elif i.name == "b": 226 | # 新释义 227 | newpp = i.get_text() 228 | if ic + 1 < len(contents): 229 | notSent = (contents[ic + 1].name == "font" and contents[ic + 1].attrs["color"] == "DarkMagenta") or \ 230 | contents[ic + 1].name == "b" 231 | if notSent: 232 | if newpos != "" and newpp != "": 233 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, 234 | "source": source, 235 | "scene": "", "category": ""} 236 | paraphrases.append(paras) 237 | sentences = [] 238 | elif ic + 1 == len(contents): 239 | if newpos != "" and newpp != "": 240 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, "source": source, 241 | "scene": "", "category": ""} 242 | paraphrases.append(paras) 243 | sentences = [] 244 | elif i.name == "font" and i.attrs["color"] == "Navy": 245 | # 添加例句 246 | en = i.get_text().strip() 247 | ch = contents[ic + 1].get_text().strip() 248 | if newpos != "" and newpp != "" and en != "" and ch != "": 249 | sentence = {"english": text_norm(en, "english"), "chinese": text_norm(ch, "chinese"), "audioUrlUS": "", 250 | "audioUrlUK": "", "source": source} 251 | sentences.append(sentence) 252 | # 只有当下一个不是例句时:append释义信息 253 | if ic + 2 < len(contents): 254 | notSent = (contents[ic + 2].name == "font" and contents[ic + 2].attrs["color"] == "DarkMagenta") or \ 255 | contents[ic + 2].name == "b" 256 | if notSent: 257 | if newpos != "" and newpp != "": 258 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, 259 | "source": source, 260 | "scene": "", "category": ""} 261 | paraphrases.append(paras) 262 | sentences = [] 263 | elif ic + 2 == len(contents): 264 | if newpos != "" and newpp != "": 265 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, "source": source, 266 | "scene": "", "category": ""} 267 | paraphrases.append(paras) 268 | sentences = [] 269 | curIc = ic + 2 270 | 271 | result["PhoneticSymbols"] = phoneticsymbols 272 | result["ParaPhrases"] = paraphrases 273 | return result 274 | 275 | 276 | def parse_item(item, source, item2infos): 277 | bs = BeautifulSoup(item2infos[item], "html.parser") 278 | # print(bs.prettify()) 279 | result = {} 280 | if "oxld" in source: 281 | result = parse_oxld(item, bs, source) 282 | elif "jianming" in source: 283 | result = parse_jianming(item, bs, source) 284 | return result 285 | 286 | 287 | def parse_items(items, source, item2infos): 288 | results = [] 289 | for i in items: 290 | result = parse_item(i, source, item2infos) 291 | if result: 292 | results.append(result) 293 | return results 294 | 295 | 296 | def write_json(infos, fileOut): 297 | print("{} lexicons writing...".format(len(infos))) 298 | with open(fileOut, "w", encoding="utf-8") as f: 299 | json.dump(infos, f, ensure_ascii=False) 300 | 301 | 302 | def gen_dict(fileIn, source): 303 | mdx = MDX(fileIn) 304 | items = [i for i in mdx.items()] 305 | print("{} items loaded from {}".format(len(items), fileIn)) 306 | item2infos = {i[0].decode("utf-8"): i[1].decode("utf-8") for i in items} 307 | # outs = parse_items(["sorb", "weird"], source, item2infos) 308 | outs = [] 309 | for i in items: 310 | try: 311 | out = parse_item(i[0].decode("utf-8"), source, item2infos) 312 | if out["PhoneticSymbols"] or out["ParaPhrases"]: 313 | outs.append(out) 314 | except Exception as e: 315 | print("Error: {}".format(repr(e))) 316 | write_json(outs, f"./output/dict_{len(outs)}_{source}_output.json") 317 | 318 | 319 | def test(fileIn, source): 320 | mdx = MDX(fileIn) 321 | items = [i for i in mdx.items()] 322 | print("{} items loaded from {}".format(len(items), fileIn)) 323 | item2infos = {i[0].decode("utf-8"): i[1].decode("utf-8") for i in items} 324 | outs = parse_items(["sorb", "weird"], source, item2infos) 325 | # outs = [] 326 | # for i in items: 327 | # try: 328 | # out = parse_item(i[0].decode("utf-8"), source, item2infos) 329 | # if out["PhoneticSymbols"] or out["ParaPhrases"]: 330 | # outs.append(out) 331 | # except Exception as e: 332 | # print("Error: {}".format(repr(e))) 333 | write_json(outs, f"./output/dict_output.json") 334 | 335 | 336 | def test_jianming(): 337 | fileIn = r'D:\work\database\dict\简明英汉汉英词典.mdx' 338 | source = "jianming-2" 339 | test(fileIn, source) 340 | 341 | 342 | def test_oxld(): 343 | fileIn = r'D:\work\database\dict\牛津高阶英汉双解词典(第9版).mdx' 344 | source = "oxld-9" 345 | test(fileIn, source) 346 | 347 | 348 | if __name__ == "__main__": 349 | test_oxld() 350 | 351 | print("{} tags writing...".format(len(tagSets))) 352 | with open("./output/tag_set1.txt", "w") as f: 353 | for i in sorted(tagSets): 354 | f.write(i + "\n") 355 | -------------------------------------------------------------------------------- /pureSalsa20.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | """ 5 | Copyright by https://github.com/zhansliu/writemdict 6 | 7 | pureSalsa20.py -- a pure Python implementation of the Salsa20 cipher, ported to Python 3 8 | 9 | v4.0: Added Python 3 support, dropped support for Python <= 2.5. 10 | 11 | // zhansliu 12 | 13 | Original comments below. 14 | 15 | ==================================================================== 16 | There are comments here by two authors about three pieces of software: 17 | comments by Larry Bugbee about 18 | Salsa20, the stream cipher by Daniel J. Bernstein 19 | (including comments about the speed of the C version) and 20 | pySalsa20, Bugbee's own Python wrapper for salsa20.c 21 | (including some references), and 22 | comments by Steve Witham about 23 | pureSalsa20, Witham's pure Python 2.5 implementation of Salsa20, 24 | which follows pySalsa20's API, and is in this file. 25 | 26 | Salsa20: a Fast Streaming Cipher (comments by Larry Bugbee) 27 | ----------------------------------------------------------- 28 | 29 | Salsa20 is a fast stream cipher written by Daniel Bernstein 30 | that basically uses a hash function and XOR making for fast 31 | encryption. (Decryption uses the same function.) Salsa20 32 | is simple and quick. 33 | 34 | Some Salsa20 parameter values... 35 | design strength 128 bits 36 | key length 128 or 256 bits, exactly 37 | IV, aka nonce 64 bits, always 38 | chunk size must be in multiples of 64 bytes 39 | 40 | Salsa20 has two reduced versions, 8 and 12 rounds each. 41 | 42 | One benchmark (10 MB): 43 | 1.5GHz PPC G4 102/97/89 MB/sec for 8/12/20 rounds 44 | AMD Athlon 2500+ 77/67/53 MB/sec for 8/12/20 rounds 45 | (no I/O and before Python GC kicks in) 46 | 47 | Salsa20 is a Phase 3 finalist in the EU eSTREAM competition 48 | and appears to be one of the fastest ciphers. It is well 49 | documented so I will not attempt any injustice here. Please 50 | see "References" below. 51 | 52 | ...and Salsa20 is "free for any use". 53 | 54 | 55 | pySalsa20: a Python wrapper for Salsa20 (Comments by Larry Bugbee) 56 | ------------------------------------------------------------------ 57 | 58 | pySalsa20.py is a simple ctypes Python wrapper. Salsa20 is 59 | as it's name implies, 20 rounds, but there are two reduced 60 | versions, 8 and 12 rounds each. Because the APIs are 61 | identical, pySalsa20 is capable of wrapping all three 62 | versions (number of rounds hardcoded), including a special 63 | version that allows you to set the number of rounds with a 64 | set_rounds() function. Compile the version of your choice 65 | as a shared library (not as a Python extension), name and 66 | install it as libsalsa20.so. 67 | 68 | Sample usage: 69 | from pySalsa20 import Salsa20 70 | s20 = Salsa20(key, IV) 71 | dataout = s20.encryptBytes(datain) # same for decrypt 72 | 73 | This is EXPERIMENTAL software and intended for educational 74 | purposes only. To make experimentation less cumbersome, 75 | pySalsa20 is also free for any use. 76 | 77 | THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF 78 | ANY KIND. USE AT YOUR OWN RISK. 79 | 80 | Enjoy, 81 | 82 | Larry Bugbee 83 | bugbee@seanet.com 84 | April 2007 85 | 86 | 87 | References: 88 | ----------- 89 | http://en.wikipedia.org/wiki/Salsa20 90 | http://en.wikipedia.org/wiki/Daniel_Bernstein 91 | http://cr.yp.to/djb.html 92 | http://www.ecrypt.eu.org/stream/salsa20p3.html 93 | http://www.ecrypt.eu.org/stream/p3ciphers/salsa20/salsa20_p3source.zip 94 | 95 | 96 | Prerequisites for pySalsa20: 97 | ---------------------------- 98 | - Python 2.5 (haven't tested in 2.4) 99 | 100 | 101 | pureSalsa20: Salsa20 in pure Python 2.5 (comments by Steve Witham) 102 | ------------------------------------------------------------------ 103 | 104 | pureSalsa20 is the stand-alone Python code in this file. 105 | It implements the underlying Salsa20 core algorithm 106 | and emulates pySalsa20's Salsa20 class API (minus a bug(*)). 107 | 108 | pureSalsa20 is MUCH slower than libsalsa20.so wrapped with pySalsa20-- 109 | about 1/1000 the speed for Salsa20/20 and 1/500 the speed for Salsa20/8, 110 | when encrypting 64k-byte blocks on my computer. 111 | 112 | pureSalsa20 is for cases where portability is much more important than 113 | speed. I wrote it for use in a "structured" random number generator. 114 | 115 | There are comments about the reasons for this slowness in 116 | http://www.tiac.net/~sw/2010/02/PureSalsa20 117 | 118 | Sample usage: 119 | from pureSalsa20 import Salsa20 120 | s20 = Salsa20(key, IV) 121 | dataout = s20.encryptBytes(datain) # same for decrypt 122 | 123 | I took the test code from pySalsa20, added a bunch of tests including 124 | rough speed tests, and moved them into the file testSalsa20.py. 125 | To test both pySalsa20 and pureSalsa20, type 126 | python testSalsa20.py 127 | 128 | (*)The bug (?) in pySalsa20 is this. The rounds variable is global to the 129 | libsalsa20.so library and not switched when switching between instances 130 | of the Salsa20 class. 131 | s1 = Salsa20( key, IV, 20 ) 132 | s2 = Salsa20( key, IV, 8 ) 133 | In this example, 134 | with pySalsa20, both s1 and s2 will do 8 rounds of encryption. 135 | with pureSalsa20, s1 will do 20 rounds and s2 will do 8 rounds. 136 | Perhaps giving each instance its own nRounds variable, which 137 | is passed to the salsa20wordtobyte() function, is insecure. I'm not a 138 | cryptographer. 139 | 140 | pureSalsa20.py and testSalsa20.py are EXPERIMENTAL software and 141 | intended for educational purposes only. To make experimentation less 142 | cumbersome, pureSalsa20.py and testSalsa20.py are free for any use. 143 | 144 | Revisions: 145 | ---------- 146 | p3.2 Fixed bug that initialized the output buffer with plaintext! 147 | Saner ramping of nreps in speed test. 148 | Minor changes and print statements. 149 | p3.1 Took timing variability out of add32() and rot32(). 150 | Made the internals more like pySalsa20/libsalsa . 151 | Put the semicolons back in the main loop! 152 | In encryptBytes(), modify a byte array instead of appending. 153 | Fixed speed calculation bug. 154 | Used subclasses instead of patches in testSalsa20.py . 155 | Added 64k-byte messages to speed test to be fair to pySalsa20. 156 | p3 First version, intended to parallel pySalsa20 version 3. 157 | 158 | More references: 159 | ---------------- 160 | http://www.seanet.com/~bugbee/crypto/salsa20/ [pySalsa20] 161 | http://cr.yp.to/snuffle.html [The original name of Salsa20] 162 | http://cr.yp.to/snuffle/salsafamily-20071225.pdf [ Salsa20 design] 163 | http://www.tiac.net/~sw/2010/02/PureSalsa20 164 | 165 | THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF 166 | ANY KIND. USE AT YOUR OWN RISK. 167 | 168 | Cheers, 169 | 170 | Steve Witham sw at remove-this tiac dot net 171 | February, 2010 172 | """ 173 | import sys 174 | assert(sys.version_info >= (2, 6)) 175 | 176 | if sys.version_info >= (3,): 177 | integer_types = (int,) 178 | python3 = True 179 | else: 180 | integer_types = (int, long) 181 | python3 = False 182 | 183 | from struct import Struct 184 | little_u64 = Struct( "= 2**64" 238 | ctx = self.ctx 239 | ctx[ 8],ctx[ 9] = little2_i32.unpack( little_u64.pack( counter ) ) 240 | 241 | def getCounter( self ): 242 | return little_u64.unpack( little2_i32.pack( *self.ctx[ 8:10 ] ) ) [0] 243 | 244 | 245 | def setRounds(self, rounds, testing=False ): 246 | assert testing or rounds in [8, 12, 20], 'rounds must be 8, 12, 20' 247 | self.rounds = rounds 248 | 249 | 250 | def encryptBytes(self, data): 251 | assert type(data) == bytes, 'data must be byte string' 252 | assert self._lastChunk64, 'previous chunk not multiple of 64 bytes' 253 | lendata = len(data) 254 | munged = bytearray(lendata) 255 | for i in range( 0, lendata, 64 ): 256 | h = salsa20_wordtobyte( self.ctx, self.rounds, checkRounds=False ) 257 | self.setCounter( ( self.getCounter() + 1 ) % 2**64 ) 258 | # Stopping at 2^70 bytes per nonce is user's responsibility. 259 | for j in range( min( 64, lendata - i ) ): 260 | if python3: 261 | munged[ i+j ] = data[ i+j ] ^ h[j] 262 | else: 263 | munged[ i+j ] = ord(data[ i+j ]) ^ ord(h[j]) 264 | 265 | self._lastChunk64 = not lendata % 64 266 | return bytes(munged) 267 | 268 | decryptBytes = encryptBytes # encrypt and decrypt use same function 269 | 270 | #-------------------------------------------------------------------------- 271 | 272 | def salsa20_wordtobyte( input, nRounds=20, checkRounds=True ): 273 | """ Do nRounds Salsa20 rounds on a copy of 274 | input: list or tuple of 16 ints treated as little-endian unsigneds. 275 | Returns a 64-byte string. 276 | """ 277 | 278 | assert( type(input) in ( list, tuple ) and len(input) == 16 ) 279 | assert( not(checkRounds) or ( nRounds in [ 8, 12, 20 ] ) ) 280 | 281 | x = list( input ) 282 | 283 | def XOR( a, b ): return a ^ b 284 | ROTATE = rot32 285 | PLUS = add32 286 | 287 | for i in range( nRounds // 2 ): 288 | # These ...XOR...ROTATE...PLUS... lines are from ecrypt-linux.c 289 | # unchanged except for indents and the blank line between rounds: 290 | x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 0],x[12]), 7)); 291 | x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[ 4],x[ 0]), 9)); 292 | x[12] = XOR(x[12],ROTATE(PLUS(x[ 8],x[ 4]),13)); 293 | x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[12],x[ 8]),18)); 294 | x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 5],x[ 1]), 7)); 295 | x[13] = XOR(x[13],ROTATE(PLUS(x[ 9],x[ 5]), 9)); 296 | x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[13],x[ 9]),13)); 297 | x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 1],x[13]),18)); 298 | x[14] = XOR(x[14],ROTATE(PLUS(x[10],x[ 6]), 7)); 299 | x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[14],x[10]), 9)); 300 | x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 2],x[14]),13)); 301 | x[10] = XOR(x[10],ROTATE(PLUS(x[ 6],x[ 2]),18)); 302 | x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[15],x[11]), 7)); 303 | x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 3],x[15]), 9)); 304 | x[11] = XOR(x[11],ROTATE(PLUS(x[ 7],x[ 3]),13)); 305 | x[15] = XOR(x[15],ROTATE(PLUS(x[11],x[ 7]),18)); 306 | 307 | x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[ 0],x[ 3]), 7)); 308 | x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[ 1],x[ 0]), 9)); 309 | x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[ 2],x[ 1]),13)); 310 | x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[ 3],x[ 2]),18)); 311 | x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 5],x[ 4]), 7)); 312 | x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 6],x[ 5]), 9)); 313 | x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 7],x[ 6]),13)); 314 | x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 4],x[ 7]),18)); 315 | x[11] = XOR(x[11],ROTATE(PLUS(x[10],x[ 9]), 7)); 316 | x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[11],x[10]), 9)); 317 | x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 8],x[11]),13)); 318 | x[10] = XOR(x[10],ROTATE(PLUS(x[ 9],x[ 8]),18)); 319 | x[12] = XOR(x[12],ROTATE(PLUS(x[15],x[14]), 7)); 320 | x[13] = XOR(x[13],ROTATE(PLUS(x[12],x[15]), 9)); 321 | x[14] = XOR(x[14],ROTATE(PLUS(x[13],x[12]),13)); 322 | x[15] = XOR(x[15],ROTATE(PLUS(x[14],x[13]),18)); 323 | 324 | for i in range( len( input ) ): 325 | x[i] = PLUS( x[i], input[i] ) 326 | return little16_i32.pack( *x ) 327 | 328 | #--------------------------- 32-bit ops ------------------------------- 329 | 330 | def trunc32( w ): 331 | """ Return the bottom 32 bits of w as a Python int. 332 | This creates longs temporarily, but returns an int. """ 333 | w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) ) 334 | assert type(w) == int 335 | return w 336 | 337 | 338 | def add32( a, b ): 339 | """ Add two 32-bit words discarding carry above 32nd bit, 340 | and without creating a Python long. 341 | Timing shouldn't vary. 342 | """ 343 | lo = ( a & 0xFFFF ) + ( b & 0xFFFF ) 344 | hi = ( a >> 16 ) + ( b >> 16 ) + ( lo >> 16 ) 345 | return ( -(hi & 0x8000) | ( hi & 0x7FFF ) ) << 16 | ( lo & 0xFFFF ) 346 | 347 | 348 | def rot32( w, nLeft ): 349 | """ Rotate 32-bit word left by nLeft or right by -nLeft 350 | without creating a Python long. 351 | Timing depends on nLeft but not on w. 352 | """ 353 | nLeft &= 31 # which makes nLeft >= 0 354 | if nLeft == 0: 355 | return w 356 | 357 | # Note: now 1 <= nLeft <= 31. 358 | # RRRsLLLLLL There are nLeft RRR's, (31-nLeft) LLLLLL's, 359 | # => sLLLLLLRRR and one s which becomes the sign bit. 360 | RRR = ( ( ( w >> 1 ) & 0x7fffFFFF ) >> ( 31 - nLeft ) ) 361 | sLLLLLL = -( (1<<(31-nLeft)) & w ) | (0x7fffFFFF>>nLeft) & w 362 | return RRR | ( sLLLLLL << nLeft ) 363 | 364 | 365 | # --------------------------------- end ----------------------------------- 366 | -------------------------------------------------------------------------------- /readmdict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # readmdict.py 4 | # Octopus MDict Dictionary File (.mdx) and Resource File (.mdd) Analyser 5 | # 6 | # Copyright (C) 2012, 2013, 2015 Xiaoqiang Wang 7 | # 8 | # This program is a free software; you can redistribute it and/or modify 9 | # it under the terms of the GNU General Public License as published by 10 | # the Free Software Foundation, version 3 of the License. 11 | # 12 | # You can get a copy of GNU General Public License along this program 13 | # But you can always get it from http://www.gnu.org/licenses/gpl.txt 14 | # 15 | # This program is distributed in the hope that it will be useful, 16 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 17 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 18 | # GNU General Public License for more details. 19 | 20 | from struct import pack, unpack 21 | from io import BytesIO 22 | import re 23 | import sys 24 | 25 | from ripemd128 import ripemd128 26 | from pureSalsa20 import Salsa20 27 | 28 | # zlib compression is used for engine version >=2.0 29 | import zlib 30 | # LZO compression is used for engine version < 2.0 31 | try: 32 | import lzo 33 | except ImportError: 34 | lzo = None 35 | print("LZO compression support is not available") 36 | 37 | # 2x3 compatible 38 | if sys.hexversion >= 0x03000000: 39 | unicode = str 40 | 41 | 42 | def _unescape_entities(text): 43 | """ 44 | unescape offending tags < > " & 45 | """ 46 | text = text.replace(b'<', b'<') 47 | text = text.replace(b'>', b'>') 48 | text = text.replace(b'"', b'"') 49 | text = text.replace(b'&', b'&') 50 | return text 51 | 52 | 53 | def _fast_decrypt(data, key): 54 | b = bytearray(data) 55 | key = bytearray(key) 56 | previous = 0x36 57 | for i in range(len(b)): 58 | t = (b[i] >> 4 | b[i] << 4) & 0xff 59 | t = t ^ previous ^ (i & 0xff) ^ key[i % len(key)] 60 | previous = b[i] 61 | b[i] = t 62 | return bytes(b) 63 | 64 | 65 | def _mdx_decrypt(comp_block): 66 | key = ripemd128(comp_block[4:8] + pack(b' 124 | """ 125 | taglist = re.findall(b'(\w+)="(.*?)"', header, re.DOTALL) 126 | tagdict = {} 127 | for key, value in taglist: 128 | tagdict[key] = _unescape_entities(value) 129 | return tagdict 130 | 131 | def _decode_key_block_info(self, key_block_info_compressed): 132 | if self._version >= 2: 133 | # zlib compression 134 | assert(key_block_info_compressed[:4] == b'\x02\x00\x00\x00') 135 | # decrypt if needed 136 | if self._encrypt & 0x02: 137 | key_block_info_compressed = _mdx_decrypt(key_block_info_compressed) 138 | # decompress 139 | key_block_info = zlib.decompress(key_block_info_compressed[8:]) 140 | # adler checksum 141 | adler32 = unpack('>I', key_block_info_compressed[4:8])[0] 142 | assert(adler32 == zlib.adler32(key_block_info) & 0xffffffff) 143 | else: 144 | # no compression 145 | key_block_info = key_block_info_compressed 146 | # decode 147 | key_block_info_list = [] 148 | num_entries = 0 149 | i = 0 150 | if self._version >= 2: 151 | byte_format = '>H' 152 | byte_width = 2 153 | text_term = 1 154 | else: 155 | byte_format = '>B' 156 | byte_width = 1 157 | text_term = 0 158 | 159 | while i < len(key_block_info): 160 | # number of entries in current key block 161 | num_entries += unpack(self._number_format, key_block_info[i:i+self._number_width])[0] 162 | i += self._number_width 163 | # text head size 164 | text_head_size = unpack(byte_format, key_block_info[i:i+byte_width])[0] 165 | i += byte_width 166 | # text head 167 | if self._encoding != 'UTF-16': 168 | i += text_head_size + text_term 169 | else: 170 | i += (text_head_size + text_term) * 2 171 | # text tail size 172 | text_tail_size = unpack(byte_format, key_block_info[i:i+byte_width])[0] 173 | i += byte_width 174 | # text tail 175 | if self._encoding != 'UTF-16': 176 | i += text_tail_size + text_term 177 | else: 178 | i += (text_tail_size + text_term) * 2 179 | # key block compressed size 180 | key_block_compressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0] 181 | i += self._number_width 182 | # key block decompressed size 183 | key_block_decompressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0] 184 | i += self._number_width 185 | key_block_info_list += [(key_block_compressed_size, key_block_decompressed_size)] 186 | 187 | #assert(num_entries == self._num_entries) 188 | 189 | return key_block_info_list 190 | 191 | def _decode_key_block(self, key_block_compressed, key_block_info_list): 192 | key_list = [] 193 | i = 0 194 | for compressed_size, decompressed_size in key_block_info_list: 195 | start = i 196 | end = i + compressed_size 197 | # 4 bytes : compression type 198 | key_block_type = key_block_compressed[start:start+4] 199 | # 4 bytes : adler checksum of decompressed key block 200 | adler32 = unpack('>I', key_block_compressed[start+4:start+8])[0] 201 | if key_block_type == b'\x00\x00\x00\x00': 202 | key_block = key_block_compressed[start+8:end] 203 | elif key_block_type == b'\x01\x00\x00\x00': 204 | if lzo is None: 205 | print("LZO compression is not supported") 206 | break 207 | # decompress key block 208 | header = b'\xf0' + pack('>I', decompressed_size) 209 | key_block = lzo.decompress(header + key_block_compressed[start+8:end]) 210 | elif key_block_type == b'\x02\x00\x00\x00': 211 | # decompress key block 212 | key_block = zlib.decompress(key_block_compressed[start+8:end]) 213 | # extract one single key block into a key list 214 | key_list += self._split_key_block(key_block) 215 | # notice that adler32 returns signed value 216 | assert(adler32 == zlib.adler32(key_block) & 0xffffffff) 217 | 218 | i += compressed_size 219 | return key_list 220 | 221 | def _split_key_block(self, key_block): 222 | key_list = [] 223 | key_start_index = 0 224 | while key_start_index < len(key_block): 225 | # the corresponding record's offset in record block 226 | key_id = unpack(self._number_format, key_block[key_start_index:key_start_index+self._number_width])[0] 227 | # key text ends with '\x00' 228 | if self._encoding == 'UTF-16': 229 | delimiter = b'\x00\x00' 230 | width = 2 231 | else: 232 | delimiter = b'\x00' 233 | width = 1 234 | i = key_start_index + self._number_width 235 | while i < len(key_block): 236 | if key_block[i:i+width] == delimiter: 237 | key_end_index = i 238 | break 239 | i += width 240 | key_text = key_block[key_start_index+self._number_width:key_end_index]\ 241 | .decode(self._encoding, errors='ignore').encode('utf-8').strip() 242 | key_start_index = key_end_index + width 243 | key_list += [(key_id, key_text)] 244 | return key_list 245 | 246 | def _read_header(self): 247 | f = open(self._fname, 'rb') 248 | # number of bytes of header text 249 | header_bytes_size = unpack('>I', f.read(4))[0] 250 | header_bytes = f.read(header_bytes_size) 251 | # 4 bytes: adler32 checksum of header, in little endian 252 | adler32 = unpack('= 0x03000000: 264 | encoding = encoding.decode('utf-8') 265 | # GB18030 > GBK > GB2312 266 | if encoding in ['GBK', 'GB2312']: 267 | encoding = 'GB18030' 268 | self._encoding = encoding 269 | # encryption flag 270 | # 0x00 - no encryption 271 | # 0x01 - encrypt record block 272 | # 0x02 - encrypt key info block 273 | if b'Encrypted' not in header_tag or header_tag[b'Encrypted'] == b'No': 274 | self._encrypt = 0 275 | elif header_tag[b'Encrypted'] == b'Yes': 276 | self._encrypt = 1 277 | else: 278 | self._encrypt = int(header_tag[b'Encrypted']) 279 | 280 | # stylesheet attribute if present takes form of: 281 | # style_number # 1-255 282 | # style_begin # or '' 283 | # style_end # or '' 284 | # store stylesheet in dict in the form of 285 | # {'number' : ('style_begin', 'style_end')} 286 | self._stylesheet = {} 287 | if header_tag.get('StyleSheet'): 288 | lines = header_tag['StyleSheet'].splitlines() 289 | for i in range(0, len(lines), 3): 290 | self._stylesheet[lines[i]] = (lines[i+1], lines[i+2]) 291 | 292 | # before version 2.0, number is 4 bytes integer 293 | # version 2.0 and above uses 8 bytes 294 | self._version = float(header_tag[b'GeneratedByEngineVersion']) 295 | if self._version < 2.0: 296 | self._number_width = 4 297 | self._number_format = '>I' 298 | else: 299 | self._number_width = 8 300 | self._number_format = '>Q' 301 | 302 | return header_tag 303 | 304 | def _read_keys(self): 305 | f = open(self._fname, 'rb') 306 | f.seek(self._key_block_offset) 307 | 308 | # the following numbers could be encrypted 309 | if self._version >= 2.0: 310 | num_bytes = 8 * 5 311 | else: 312 | num_bytes = 4 * 4 313 | block = f.read(num_bytes) 314 | 315 | if self._encrypt & 1: 316 | if self._passcode is None: 317 | raise RuntimeError('user identification is needed to read encrypted file') 318 | regcode, userid = self._passcode 319 | if isinstance(userid, unicode): 320 | userid = userid.encode('utf8') 321 | if self.header[b'RegisterBy'] == b'EMail': 322 | encrypted_key = _decrypt_regcode_by_email(regcode, userid) 323 | else: 324 | encrypted_key = _decrypt_regcode_by_deviceid(regcode, userid) 325 | block = _salsa_decrypt(block, encrypted_key) 326 | 327 | # decode this block 328 | sf = BytesIO(block) 329 | # number of key blocks 330 | num_key_blocks = self._read_number(sf) 331 | # number of entries 332 | self._num_entries = self._read_number(sf) 333 | # number of bytes of key block info after decompression 334 | if self._version >= 2.0: 335 | key_block_info_decomp_size = self._read_number(sf) 336 | # number of bytes of key block info 337 | key_block_info_size = self._read_number(sf) 338 | # number of bytes of key block 339 | key_block_size = self._read_number(sf) 340 | 341 | # 4 bytes: adler checksum of previous 5 numbers 342 | if self._version >= 2.0: 343 | adler32 = unpack('>I', f.read(4))[0] 344 | assert adler32 == (zlib.adler32(block) & 0xffffffff) 345 | 346 | # read key block info, which indicates key block's compressed and decompressed size 347 | key_block_info = f.read(key_block_info_size) 348 | key_block_info_list = self._decode_key_block_info(key_block_info) 349 | assert(num_key_blocks == len(key_block_info_list)) 350 | 351 | # read key block 352 | key_block_compressed = f.read(key_block_size) 353 | # extract key block 354 | key_list = self._decode_key_block(key_block_compressed, key_block_info_list) 355 | 356 | self._record_block_offset = f.tell() 357 | f.close() 358 | 359 | return key_list 360 | 361 | def _read_keys_brutal(self): 362 | f = open(self._fname, 'rb') 363 | f.seek(self._key_block_offset) 364 | 365 | # the following numbers could be encrypted, disregard them! 366 | if self._version >= 2.0: 367 | num_bytes = 8 * 5 + 4 368 | key_block_type = b'\x02\x00\x00\x00' 369 | else: 370 | num_bytes = 4 * 4 371 | key_block_type = b'\x01\x00\x00\x00' 372 | block = f.read(num_bytes) 373 | 374 | # key block info 375 | # 4 bytes '\x02\x00\x00\x00' 376 | # 4 bytes adler32 checksum 377 | # unknown number of bytes follows until '\x02\x00\x00\x00' which marks the beginning of key block 378 | key_block_info = f.read(8) 379 | if self._version >= 2.0: 380 | assert key_block_info[:4] == b'\x02\x00\x00\x00' 381 | while True: 382 | fpos = f.tell() 383 | t = f.read(1024) 384 | index = t.find(key_block_type) 385 | if index != -1: 386 | key_block_info += t[:index] 387 | f.seek(fpos + index) 388 | break 389 | else: 390 | key_block_info += t 391 | 392 | key_block_info_list = self._decode_key_block_info(key_block_info) 393 | key_block_size = sum(list(zip(*key_block_info_list))[0]) 394 | 395 | # read key block 396 | key_block_compressed = f.read(key_block_size) 397 | # extract key block 398 | key_list = self._decode_key_block(key_block_compressed, key_block_info_list) 399 | 400 | self._record_block_offset = f.tell() 401 | f.close() 402 | 403 | self._num_entries = len(key_list) 404 | return key_list 405 | 406 | 407 | class MDD(MDict): 408 | """ 409 | MDict resource file format (*.MDD) reader. 410 | >>> mdd = MDD('example.mdd') 411 | >>> len(mdd) 412 | 208 413 | >>> for filename,content in mdd.items(): 414 | ... print filename, content[:10] 415 | """ 416 | def __init__(self, fname, passcode=None): 417 | MDict.__init__(self, fname, encoding='UTF-16', passcode=passcode) 418 | 419 | def items(self): 420 | """Return a generator which in turn produce tuples in the form of (filename, content) 421 | """ 422 | return self._decode_record_block() 423 | 424 | def _decode_record_block(self): 425 | f = open(self._fname, 'rb') 426 | f.seek(self._record_block_offset) 427 | 428 | num_record_blocks = self._read_number(f) 429 | num_entries = self._read_number(f) 430 | assert(num_entries == self._num_entries) 431 | record_block_info_size = self._read_number(f) 432 | record_block_size = self._read_number(f) 433 | 434 | # record block info section 435 | record_block_info_list = [] 436 | size_counter = 0 437 | for i in range(num_record_blocks): 438 | compressed_size = self._read_number(f) 439 | decompressed_size = self._read_number(f) 440 | record_block_info_list += [(compressed_size, decompressed_size)] 441 | size_counter += self._number_width * 2 442 | assert(size_counter == record_block_info_size) 443 | 444 | # actual record block 445 | offset = 0 446 | i = 0 447 | size_counter = 0 448 | for compressed_size, decompressed_size in record_block_info_list: 449 | record_block_compressed = f.read(compressed_size) 450 | # 4 bytes: compression type 451 | record_block_type = record_block_compressed[:4] 452 | # 4 bytes: adler32 checksum of decompressed record block 453 | adler32 = unpack('>I', record_block_compressed[4:8])[0] 454 | if record_block_type == b'\x00\x00\x00\x00': 455 | record_block = record_block_compressed[8:] 456 | elif record_block_type == b'\x01\x00\x00\x00': 457 | if lzo is None: 458 | print("LZO compression is not supported") 459 | break 460 | # decompress 461 | header = b'\xf0' + pack('>I', decompressed_size) 462 | record_block = lzo.decompress(header + record_block_compressed[8:]) 463 | elif record_block_type == b'\x02\x00\x00\x00': 464 | # decompress 465 | record_block = zlib.decompress(record_block_compressed[8:]) 466 | 467 | # notice that adler32 return signed value 468 | assert(adler32 == zlib.adler32(record_block) & 0xffffffff) 469 | 470 | assert(len(record_block) == decompressed_size) 471 | # split record block according to the offset info from key block 472 | while i < len(self._key_list): 473 | record_start, key_text = self._key_list[i] 474 | # reach the end of current record block 475 | if record_start - offset >= len(record_block): 476 | break 477 | # record end index 478 | if i < len(self._key_list)-1: 479 | record_end = self._key_list[i+1][0] 480 | else: 481 | record_end = len(record_block) + offset 482 | i += 1 483 | data = record_block[record_start-offset:record_end-offset] 484 | yield key_text, data 485 | offset += len(record_block) 486 | size_counter += compressed_size 487 | assert(size_counter == record_block_size) 488 | 489 | f.close() 490 | 491 | 492 | class MDX(MDict): 493 | """ 494 | MDict dictionary file format (*.MDD) reader. 495 | >>> mdx = MDX('example.mdx') 496 | >>> len(mdx) 497 | 42481 498 | >>> for key,value in mdx.items(): 499 | ... print key, value[:10] 500 | """ 501 | def __init__(self, fname, encoding='', substyle=False, passcode=None): 502 | MDict.__init__(self, fname, encoding, passcode) 503 | self._substyle = substyle 504 | 505 | def items(self): 506 | """Return a generator which in turn produce tuples in the form of (key, value) 507 | """ 508 | return self._decode_record_block() 509 | 510 | def _substitute_stylesheet(self, txt): 511 | # substitute stylesheet definition 512 | txt_list = re.split('`\d+`', txt) 513 | txt_tag = re.findall('`\d+`', txt) 514 | txt_styled = txt_list[0] 515 | for j, p in enumerate(txt_list[1:]): 516 | style = self._stylesheet[txt_tag[j][1:-1]] 517 | if p and p[-1] == '\n': 518 | txt_styled = txt_styled + style[0] + p.rstrip() + style[1] + '\r\n' 519 | else: 520 | txt_styled = txt_styled + style[0] + p + style[1] 521 | return txt_styled 522 | 523 | def _decode_record_block(self): 524 | f = open(self._fname, 'rb') 525 | f.seek(self._record_block_offset) 526 | 527 | num_record_blocks = self._read_number(f) 528 | num_entries = self._read_number(f) 529 | assert(num_entries == self._num_entries) 530 | record_block_info_size = self._read_number(f) 531 | record_block_size = self._read_number(f) 532 | 533 | # record block info section 534 | record_block_info_list = [] 535 | size_counter = 0 536 | for i in range(num_record_blocks): 537 | compressed_size = self._read_number(f) 538 | decompressed_size = self._read_number(f) 539 | record_block_info_list += [(compressed_size, decompressed_size)] 540 | size_counter += self._number_width * 2 541 | assert(size_counter == record_block_info_size) 542 | 543 | # actual record block data 544 | offset = 0 545 | i = 0 546 | size_counter = 0 547 | for compressed_size, decompressed_size in record_block_info_list: 548 | record_block_compressed = f.read(compressed_size) 549 | # 4 bytes indicates block compression type 550 | record_block_type = record_block_compressed[:4] 551 | # 4 bytes adler checksum of uncompressed content 552 | adler32 = unpack('>I', record_block_compressed[4:8])[0] 553 | # no compression 554 | if record_block_type == b'\x00\x00\x00\x00': 555 | record_block = record_block_compressed[8:] 556 | # lzo compression 557 | elif record_block_type == b'\x01\x00\x00\x00': 558 | if lzo is None: 559 | print("LZO compression is not supported") 560 | break 561 | # decompress 562 | header = b'\xf0' + pack('>I', decompressed_size) 563 | record_block = lzo.decompress(header + record_block_compressed[8:]) 564 | # zlib compression 565 | elif record_block_type == b'\x02\x00\x00\x00': 566 | # decompress 567 | record_block = zlib.decompress(record_block_compressed[8:]) 568 | 569 | # notice that adler32 return signed value 570 | assert(adler32 == zlib.adler32(record_block) & 0xffffffff) 571 | 572 | assert(len(record_block) == decompressed_size) 573 | # split record block according to the offset info from key block 574 | while i < len(self._key_list): 575 | record_start, key_text = self._key_list[i] 576 | # reach the end of current record block 577 | if record_start - offset >= len(record_block): 578 | break 579 | # record end index 580 | if i < len(self._key_list)-1: 581 | record_end = self._key_list[i+1][0] 582 | else: 583 | record_end = len(record_block) + offset 584 | i += 1 585 | record = record_block[record_start-offset:record_end-offset] 586 | # convert to utf-8 587 | record = record.decode(self._encoding, errors='ignore').strip(u'\x00').encode('utf-8') 588 | # substitute styles 589 | if self._substyle and self._stylesheet: 590 | record = self._substitute_stylesheet(record) 591 | 592 | yield key_text, record 593 | offset += len(record_block) 594 | size_counter += compressed_size 595 | assert(size_counter == record_block_size) 596 | 597 | f.close() 598 | 599 | 600 | if __name__ == '__main__': 601 | import sys 602 | import os 603 | import os.path 604 | import argparse 605 | import codecs 606 | 607 | def passcode(s): 608 | try: 609 | regcode, userid = s.split(',') 610 | except: 611 | raise argparse.ArgumentTypeError("Passcode must be regcode,userid") 612 | try: 613 | regcode = codecs.decode(regcode, 'hex') 614 | except: 615 | raise argparse.ArgumentTypeError("regcode must be a 32 bytes hexadecimal string") 616 | return regcode, userid 617 | 618 | parser = argparse.ArgumentParser() 619 | parser.add_argument('-x', '--extract', action="store_true", 620 | help='extract mdx to source format and extract files from mdd') 621 | parser.add_argument('-s', '--substyle', action="store_true", 622 | help='substitute style definition if present') 623 | parser.add_argument('-d', '--datafolder', default="data", 624 | help='folder to extract data files from mdd') 625 | parser.add_argument('-e', '--encoding', default="", 626 | help='folder to extract data files from mdd') 627 | parser.add_argument('-p', '--passcode', default=None, type=passcode, 628 | help='register_code,email_or_deviceid') 629 | parser.add_argument("filename", nargs='?', help="mdx file name") 630 | args = parser.parse_args() 631 | 632 | # use GUI to select file, default to extract 633 | if not args.filename: 634 | import Tkinter 635 | import tkFileDialog 636 | root = Tkinter.Tk() 637 | root.withdraw() 638 | args.filename = tkFileDialog.askopenfilename(parent=root) 639 | args.extract = True 640 | 641 | if not os.path.exists(args.filename): 642 | print("Please specify a valid MDX/MDD file") 643 | 644 | base, ext = os.path.splitext(args.filename) 645 | 646 | # read mdx file 647 | if ext.lower() == os.path.extsep + 'mdx': 648 | mdx = MDX(args.filename, args.encoding, args.substyle, args.passcode) 649 | if type(args.filename) is unicode: 650 | bfname = args.filename.encode('utf-8') 651 | else: 652 | bfname = args.filename 653 | print('======== %s ========' % bfname) 654 | print(' Number of Entries : %d' % len(mdx)) 655 | for key, value in mdx.header.items(): 656 | print(' %s : %s' % (key, value)) 657 | else: 658 | mdx = None 659 | 660 | # find companion mdd file 661 | mdd_filename = ''.join([base, os.path.extsep, 'mdd']) 662 | if os.path.exists(mdd_filename): 663 | mdd = MDD(mdd_filename, args.passcode) 664 | if type(mdd_filename) is unicode: 665 | bfname = mdd_filename.encode('utf-8') 666 | else: 667 | bfname = mdd_filename 668 | print('======== %s ========' % bfname) 669 | print(' Number of Entries : %d' % len(mdd)) 670 | for key, value in mdd.header.items(): 671 | print(' %s : %s' % (key, value)) 672 | else: 673 | mdd = None 674 | 675 | if args.extract: 676 | # write out glos 677 | if mdx: 678 | output_fname = ''.join([base, os.path.extsep, 'txt']) 679 | tf = open(output_fname, 'wb') 680 | for key, value in mdx.items(): 681 | tf.write(key) 682 | tf.write(b'\r\n') 683 | tf.write(value) 684 | if not value.endswith(b'\n'): 685 | tf.write(b'\r\n') 686 | tf.write(b'\r\n') 687 | tf.close() 688 | # write out style 689 | if mdx.header.get('StyleSheet'): 690 | style_fname = ''.join([base, '_style', os.path.extsep, 'txt']) 691 | sf = open(style_fname, 'wb') 692 | sf.write(b'\r\n'.join(mdx.header['StyleSheet'].splitlines())) 693 | sf.close() 694 | # write out optional data files 695 | if mdd: 696 | datafolder = os.path.join(os.path.dirname(args.filename), args.datafolder) 697 | if not os.path.exists(datafolder): 698 | os.makedirs(datafolder) 699 | for key, value in mdd.items(): 700 | fname = key.decode('utf-8').replace('\\', os.path.sep) 701 | dfname = datafolder + fname 702 | if not os.path.exists(os.path.dirname(dfname)): 703 | os.makedirs(os.path.dirname(dfname)) 704 | df = open(dfname, 'wb') 705 | df.write(value) 706 | df.close() 707 | -------------------------------------------------------------------------------- /ripemd128.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright by https://github.com/zhansliu/writemdict 3 | 4 | ripemd128.py - A simple ripemd128 library in pure Python. 5 | 6 | Supports both Python 2 (versions >= 2.6) and Python 3. 7 | 8 | Usage: 9 | from ripemd128 import ripemd128 10 | digest = ripemd128(b"The quick brown fox jumps over the lazy dog") 11 | assert(digest == b"\x3f\xa9\xb5\x7f\x05\x3c\x05\x3f\xbe\x27\x35\xb2\x38\x0d\xb5\x96") 12 | 13 | """ 14 | 15 | 16 | 17 | import struct 18 | 19 | 20 | # follows this description: http://homes.esat.kuleuven.be/~bosselae/ripemd/rmd128.txt 21 | 22 | def f(j, x, y, z): 23 | assert(0 <= j and j < 64) 24 | if j < 16: 25 | return x ^ y ^ z 26 | elif j < 32: 27 | return (x & y) | (z & ~x) 28 | elif j < 48: 29 | return (x | (0xffffffff & ~y)) ^ z 30 | else: 31 | return (x & z) | (y & ~z) 32 | 33 | def K(j): 34 | assert(0 <= j and j < 64) 35 | if j < 16: 36 | return 0x00000000 37 | elif j < 32: 38 | return 0x5a827999 39 | elif j < 48: 40 | return 0x6ed9eba1 41 | else: 42 | return 0x8f1bbcdc 43 | 44 | def Kp(j): 45 | assert(0 <= j and j < 64) 46 | if j < 16: 47 | return 0x50a28be6 48 | elif j < 32: 49 | return 0x5c4dd124 50 | elif j < 48: 51 | return 0x6d703ef3 52 | else: 53 | return 0x00000000 54 | 55 | def padandsplit(message): 56 | """ 57 | returns a two-dimensional array X[i][j] of 32-bit integers, where j ranges 58 | from 0 to 16. 59 | First pads the message to length in bytes is congruent to 56 (mod 64), 60 | by first adding a byte 0x80, and then padding with 0x00 bytes until the 61 | message length is congruent to 56 (mod 64). Then adds the little-endian 62 | 64-bit representation of the original length. Finally, splits the result 63 | up into 64-byte blocks, which are further parsed as 32-bit integers. 64 | """ 65 | origlen = len(message) 66 | padlength = 64 - ((origlen - 56) % 64) #minimum padding is 1! 67 | message += b"\x80" 68 | message += b"\x00" * (padlength - 1) 69 | message += struct.pack("> (32-s)) & 0xffffffff 86 | 87 | r = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15, 88 | 7, 4,13, 1,10, 6,15, 3,12, 0, 9, 5, 2,14,11, 8, 89 | 3,10,14, 4, 9,15, 8, 1, 2, 7, 0, 6,13,11, 5,12, 90 | 1, 9,11,10, 0, 8,12, 4,13, 3, 7,15,14, 5, 6, 2] 91 | rp = [ 5,14, 7, 0, 9, 2,11, 4,13, 6,15, 8, 1,10, 3,12, 92 | 6,11, 3, 7, 0,13, 5,10,14,15, 8,12, 4, 9, 1, 2, 93 | 15, 5, 1, 3, 7,14, 6, 9,11, 8,12, 2,10, 0, 4,13, 94 | 8, 6, 4, 1, 3,11,15, 0, 5,12, 2,13, 9, 7,10,14] 95 | s = [11,14,15,12, 5, 8, 7, 9,11,13,14,15, 6, 7, 9, 8, 96 | 7, 6, 8,13,11, 9, 7,15, 7,12,15, 9,11, 7,13,12, 97 | 11,13, 6, 7,14, 9,13,15,14, 8,13, 6, 5,12, 7, 5, 98 | 11,12,14,15,14,15, 9, 8, 9,14, 5, 6, 8, 6, 5,12] 99 | sp = [ 8, 9, 9,11,13,15,15, 5, 7, 7, 8,11,14,14,12, 6, 100 | 9,13,15, 7,12, 8, 9,11, 7, 7,12, 7, 6,15,13,11, 101 | 9, 7,15,11, 8, 6, 6,14,12,13, 5,14,13,13, 7, 5, 102 | 15, 5, 8,11,14,14, 6,14, 6, 9,12, 9,12, 5,15, 8] 103 | 104 | 105 | def ripemd128(message): 106 | h0 = 0x67452301 107 | h1 = 0xefcdab89 108 | h2 = 0x98badcfe 109 | h3 = 0x10325476 110 | X = padandsplit(message) 111 | for i in range(len(X)): 112 | (A,B,C,D) = (h0,h1,h2,h3) 113 | (Ap,Bp,Cp,Dp) = (h0,h1,h2,h3) 114 | for j in range(64): 115 | T = rol(s[j], add(A, f(j,B,C,D), X[i][r[j]], K(j))) 116 | (A,D,C,B) = (D,C,B,T) 117 | T = rol(sp[j], add(Ap, f(63-j,Bp,Cp,Dp), X[i][rp[j]], Kp(j))) 118 | (Ap,Dp,Cp,Bp)=(Dp,Cp,Bp,T) 119 | T = add(h1,C,Dp) 120 | h1 = add(h2,D,Ap) 121 | h2 = add(h3,A,Bp) 122 | h3 = add(h0,B,Cp) 123 | h0 = T 124 | 125 | 126 | return struct.pack(" 1: 33 | lexiconType = "Phrase" 34 | else: 35 | lexiconType = "Word" 36 | result = {"Lexicon": lexicon, "type": lexiconType} 37 | for k in self.items: 38 | try: 39 | if k == "PhoneticSymbols": 40 | result[k] = eval("self.get_%s(lexicon)" % k) 41 | elif k == "Derivatives": 42 | id = html.xpath("//body")[0].attrib["data-word_id"] 43 | result[k] = eval("self.get_%s(id, lexicon)" % k) 44 | else: 45 | result[k] = eval("self.get_%s(html)" % k) 46 | except Exception as e: 47 | print("Error: {}, {}".format(lexicon, repr(e))) 48 | # 保存词汇信息 49 | if not (lexicon in result["Inflections"].values()): 50 | isSave = False 51 | for k in self.items: 52 | if k in result.keys() and result[k]: 53 | isSave = True 54 | break 55 | if isSave: 56 | self.save_infos(lexicon, result) 57 | else: 58 | print("Warning: {} in Inflections: {}".format(lexicon, result["Inflections"])) 59 | return result 60 | 61 | def get_phonetic_symbol(self, html): 62 | """ 63 | 音标提取. 64 | """ 65 | 66 | ps = html.xpath("//div[@class='cssVocWordVideo jsControlAudio']/span") 67 | outs = [] 68 | if len(ps) >= 2: 69 | country = ps[0].text 70 | name = "/" + ps[1].text[1:-1] + "/" 71 | out = {"country": self.countryCh2En[country], "audioUrl": "", "name": name, "source": self.source} 72 | outs.append(out) 73 | return outs 74 | 75 | def get_PhoneticSymbols(self, lexicon): 76 | """ 77 | 美英音标提取. 78 | """ 79 | outs = [] 80 | # IELTS 81 | url = self.url % ("ielts", lexicon) 82 | html = etree.parse(url, etree.HTMLParser(encoding="utf-8")) 83 | outs += self.get_phonetic_symbol(html) 84 | 85 | # TOEFL 86 | url = self.url % ("toefl", lexicon) 87 | html = etree.parse(url, etree.HTMLParser(encoding="utf-8")) 88 | outs += self.get_phonetic_symbol(html) 89 | 90 | return outs 91 | 92 | def get_ParaPhrases(self, html, name="toefl"): 93 | """ 94 | 释义、例句信息提取. 95 | """ 96 | ps = html.xpath("//li[@class='cssVocCont jsVocCont active']/ul/li") 97 | paraphrases = [] 98 | for p in ps: 99 | try: 100 | # 获取词性、释义信息 101 | paras = p.xpath("./div/p[@class='cssVocTotoleChinese']/text()")[0] 102 | para = paras.split(".", 1) 103 | if len(para) != 2: 104 | continue 105 | pos = para[0] + "." 106 | parapch = para[1].strip() 107 | 108 | # 获取简明例句 109 | sentInfos = p.xpath("./div/div/div[1]/descendant::p[@class='cssVocExEnglish']") 110 | jianmingSentEns = [i.xpath('string(.)').strip() for i in sentInfos] 111 | sentInfos = p.xpath("./div/div/div[1]/descendant::p[@class='cssVocExChinese']") 112 | jianmingSentChs = [i.xpath('string(.)').strip() for i in sentInfos] 113 | 114 | # 获取情景例句 115 | sentInfos = p.xpath("./div/div/div[2]/descendant::p[@class='cssVocExEnglish']") 116 | sceneSentEns = [i.xpath('string(.)').strip() for i in sentInfos] 117 | sentInfos = p.xpath("./div/div/div[2]/descendant::p[@class='cssVocExChinese']") 118 | sceneSentChs = [i.xpath('string(.)').strip() for i in sentInfos] 119 | 120 | # 获取托福考试例句 121 | sentInfos = p.xpath("./div/div/div[3]/descendant::p[@class='cssVocExEnglish']") 122 | toeflSentEns = [i.xpath('string(.)').strip() for i in sentInfos] 123 | sentInfos = p.xpath("./div/div/div[3]/descendant::p[@class='cssVocExChinese']") 124 | toeflSentChs = [i.xpath('string(.)').strip() for i in sentInfos] 125 | 126 | # 添加例句 127 | sentences = [] 128 | if len(jianmingSentEns) == len(jianmingSentChs): 129 | sentences += [{"english": e, "chinese": c, "source": self.source + "-jianming", "audioUrlUS": "", 130 | "audioUrlUK": ""} 131 | for e, c in zip(jianmingSentEns, jianmingSentChs)] 132 | if len(sceneSentEns) == len(sceneSentChs): 133 | sentences += [{"english": e, "chinese": c, "source": self.source + "-scene", "audioUrlUS": "", 134 | "audioUrlUK": ""} 135 | for e, c in zip(sceneSentEns, sceneSentChs)] 136 | if len(toeflSentEns) == len(toeflSentChs): 137 | sentences += [{"english": e, "chinese": c, "source": self.source + "-" + name, "audioUrlUS": "", 138 | "audioUrlUK": ""} 139 | for e, c in zip(toeflSentEns, toeflSentChs)] 140 | paraphrase = {"pos": pos, "english": "", "chinese": parapch, "Sentences": sentences, 141 | "source": self.source} 142 | 143 | paraphrases.append(paraphrase) 144 | except Exception as e: 145 | print("Error: {}".format(repr(e))) 146 | pass 147 | return paraphrases 148 | 149 | def get_Inflections(self, html): 150 | """ 151 | 变形词提取. 152 | """ 153 | words = html.xpath("//ul[@class='cssVocForMatVaried']/li/text()") 154 | names = html.xpath("//ul[@class='cssVocForMatVaried']/li/span/text()") 155 | assert len(words) == len(names) 156 | out = {} 157 | for w, n in zip(words, names): 158 | out[n] = w.strip() 159 | return out 160 | 161 | def get_fixed_collocations(self, html): 162 | """ 163 | 固定搭配提取. 164 | """ 165 | result = html.xpath("//li[@class='cssVocContTwo jsVocContTwo active']/ul/li") 166 | outs = [] 167 | for r in result: 168 | collection = r.xpath("./div/p[@class='cssVocTotoleChinese']/text()")[0].strip() 169 | ch = r.xpath("./div/p[@class='cssVocTotoleEng']/text()")[0].strip() 170 | outs.append({"name": collection, "chinese": ch, "source": self.source}) 171 | return outs 172 | 173 | def get_idiomatic_usage(self, html): 174 | """ 175 | 习惯用法提取. 176 | """ 177 | result = html.xpath("//ul[@class='cssVocContTogole jsVocContTogole']") 178 | outs = [] 179 | for r in result: 180 | usage = r.xpath("./div/p[@class='cssVocTotoleChinese']/text()") 181 | ch = r.xpath("./div/p[@class='cssVocTotoleEng']/text()")[0].strip() 182 | outs.append({"idiomatic_usage": usage, "chinese": ch, "source": self.source}) 183 | return outs 184 | 185 | def get_Collocations(self, html): 186 | """ 187 | 搭配提取. 188 | """ 189 | # 固定搭配 190 | outs = self.get_fixed_collocations(html) 191 | # 习惯用法 192 | # outs += self.get_idiomatic_usage(html) 193 | return outs 194 | 195 | def get_Derivatives(self, id, lexicon): 196 | """ 197 | 派生词提取. 198 | """ 199 | outs = [] 200 | html = etree.parse(f"http://top.zhan.com/vocab/detail/one-2-ten.html?test_type=2&word_id={id}&word={lexicon}", 201 | etree.HTMLParser(encoding="utf-8")) 202 | ens = html.xpath("//p[@class='cssDeriWordsBoxId']/text()") 203 | chs = html.xpath("//ul[@class='cssDeriWordsBoxType']/li/text()") 204 | assert len(ens) == len(chs) 205 | for en, ch in zip(ens, chs): 206 | para = ch.split(".", 1) 207 | if len(para) != 2: 208 | continue 209 | pos = para[0].strip() + "." 210 | parapch = para[1].strip() 211 | outs.append({"Lexicon": en.strip(), "chinese": parapch, "pos": pos, "source": self.source}) 212 | return outs 213 | 214 | def save_infos(self, lexicon, infos): 215 | """ 216 | 保存词汇信息. 217 | """ 218 | with open(self.dictPath + str(lexicon), "w", encoding="utf-8") as f: 219 | json.dump(infos, f, ensure_ascii=False) 220 | 221 | def read_infos(self, lexicon): 222 | """ 223 | 读取词汇信息. 224 | """ 225 | with open(self.dictPath + str(lexicon), "r", encoding="utf-8") as f: 226 | return json.load(f) 227 | 228 | 229 | if __name__ == "__main__": 230 | c = XiaozhanCrawler() 231 | import pandas as pd 232 | from multiprocessing import Pool 233 | 234 | c.get_infos("taste") 235 | 236 | # words = pd.read_csv(r"D:\work\database\单词-缺失美英音标-汇总2-sorted.csv", header=None) 237 | # print("words shape={}".format(words.shape)) 238 | # 239 | # p = Pool(1) 240 | # words = words[0].values 241 | # for i in range(len(words)): 242 | # p.apply(c.get_infos, (words[i],)) 243 | --------------------------------------------------------------------------------