├── .gitattributes
├── .gitignore
├── MDD.svg
├── MDX.svg
├── README.md
├── notes
├── 01-测试结果.ipynb
└── parse_mdx.ipynb
├── parser.py
├── pureSalsa20.py
├── readmdict.py
├── ripemd128.py
└── xiaozhan.py
/.gitattributes:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binhetech/mdict-parser/257885176aa572953b044e9ff68b88fecc86cdf9/.gitattributes
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | .idea
3 | .ipynb_checkpoints
4 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | An Analysis of MDX/MDD File Format
2 | ==================================
3 |
4 | MDict is a multi-platform open dictionary
5 |
6 | which are both questionable. It is not available for every platform, e.g. OS X, Linux.
7 | Its dictionary file format is not open. But this has not hindered its popularity,
8 | and many dictionaries have been created for it.
9 |
10 | This is an attempt to reveal MDX/MDD file format, so that my favorite dictionaries,
11 | created by MDict users, could be used elsewhere.
12 |
13 |
14 | MDict Files
15 | ===========
16 | MDict stores the dictionary definitions, i.e. (key word, explanation) in MDX file and
17 | the dictionary reference data, e.g. images, pronunciations, stylesheets in MDD file.
18 | Although holding different contents, these two file formats share the same structure.
19 |
20 | MDX File Format
21 | ===============
22 |
23 |
24 |
25 | MDD File Format
26 | ===============
27 |
28 |
29 |
30 | Example Programs
31 | ================
32 |
33 | readmdict.py
34 | ------------
35 | readmdict.py is an example implementation in Python. This program can read/extract mdx/mdd files.
36 |
37 | .. note:: python-lzo is required to read mdx files created with engine 1.2.
38 | Get Windows version from http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-lzo
39 |
40 | It can be used as a command line tool. Suppose one has oald8.mdx and oald8.mdd::
41 |
42 | $ python readmdict.py -x oald8.mdx
43 |
44 | This will creates *oald8.txt* dictionary file and creates a folder *data* for images, pronunciation audio files.
45 |
46 | On Windows, one can also double click it and select the file in the popup dialog.
47 |
48 | Or as a module::
49 |
50 | In [1]: from readmdict import MDX, MDD
51 |
52 | Read MDX file and print the first entry::
53 |
54 | In [2]: mdx = MDX('oald8.mdx')
55 |
56 | In [3]: items = mdx.items()
57 |
58 | In [4]: items.next()
59 | Out[4]:
60 | ('A',
61 | '.........')
62 | ``mdx`` is an object having all info from a MDX file. ``items`` is an iterator producing 2-item tuples.
63 | Of each tuple, the first element is the entry text and the second is the explanation. Both are UTF-8 encoded strings.
64 |
65 | Read MDD file and print the first entry::
66 |
67 | In [5]: mdd = MDD('oald8.mdd')
68 |
69 | In [6]: items = mdd.items()
70 |
71 | In [7]: items = mdd.next()
72 | Out[7]:
73 | (u'\\pic\\accordion_concertina.jpg',
74 | '\xff\xd8\xff\xe0\x00\x10JFIF...........')
75 |
76 | ``mdd`` is an object having all info from a MDD file. ``items`` is an iterator producing 2-item tuples.
77 | Of each tuple, the first element is the file name and the second element is the corresponding file content.
78 | The file name is encoded in UTF-8. The file content is a plain bytes array.
79 |
80 | Acknowledge
81 | ===========
82 | The file format gets fully disclosed by https://github.com/zhansliu/writemdict.
83 | The encryption part is taken into this project.
84 |
--------------------------------------------------------------------------------
/notes/01-测试结果.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import json"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 4,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "314205 items found\n"
22 | ]
23 | }
24 | ],
25 | "source": [
26 | "with open(\"../output/dict_jianming-2_output.json\", \"r\", encoding=\"utf-8\") as f:\n",
27 | " data = json.load(f)\n",
28 | "print(\"{} items found\".format(len(data)))"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 10,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "data": {
38 | "text/plain": [
39 | "{'Lexicon': 'apply oneself to',\n",
40 | " 'type': 'Phrase',\n",
41 | " 'PhoneticSymbols': [],\n",
42 | " 'ParaPhrases': [{'pos': 'v.',\n",
43 | " 'english': '',\n",
44 | " 'chinese': '致力于',\n",
45 | " 'Sentences': [],\n",
46 | " 'source': 'jianming-2',\n",
47 | " 'scene': '',\n",
48 | " 'category': ''}]}"
49 | ]
50 | },
51 | "execution_count": 10,
52 | "metadata": {},
53 | "output_type": "execute_result"
54 | }
55 | ],
56 | "source": [
57 | "data[10000]"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 11,
63 | "metadata": {},
64 | "outputs": [
65 | {
66 | "name": "stdout",
67 | "output_type": "stream",
68 | "text": [
69 | "280540 words found\n"
70 | ]
71 | }
72 | ],
73 | "source": [
74 | "words = {i[\"Lexicon\"]: i for i in data if i[\"type\"] == \"Word\"}\n",
75 | "print(\"{} words found\".format(len(words)))"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 24,
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "data": {
85 | "text/plain": [
86 | "{'Lexicon': 'well',\n",
87 | " 'type': 'Word',\n",
88 | " 'PhoneticSymbols': [],\n",
89 | " 'ParaPhrases': [{'pos': 'adv.',\n",
90 | " 'english': '',\n",
91 | " 'chinese': '好, 对, 满意地; 友好地, 和蔼地; 彻底地, 完全地',\n",
92 | " 'Sentences': [{'english': ' Do you eat well at school?',\n",
93 | " 'chinese': '你在学校吃得好吗?',\n",
94 | " 'audioUrlUS': '',\n",
95 | " 'audioUrlUK': '',\n",
96 | " 'source': 'jianming-2'}],\n",
97 | " 'source': 'jianming-2',\n",
98 | " 'scene': '',\n",
99 | " 'category': ''},\n",
100 | " {'pos': 'adv.',\n",
101 | " 'english': '',\n",
102 | " 'chinese': '夸奖地, 称赞地',\n",
103 | " 'Sentences': [{'english': ' They speak well of him at school.',\n",
104 | " 'chinese': '学校里的人都称赞他。',\n",
105 | " 'audioUrlUS': '',\n",
106 | " 'audioUrlUK': '',\n",
107 | " 'source': 'jianming-2'}],\n",
108 | " 'source': 'jianming-2',\n",
109 | " 'scene': '',\n",
110 | " 'category': ''},\n",
111 | " {'pos': 'adv.',\n",
112 | " 'english': '',\n",
113 | " 'chinese': '有理由地, 恰当地',\n",
114 | " 'Sentences': [{'english': ' You did well to tell him.',\n",
115 | " 'chinese': '你告诉了他, 做得对。',\n",
116 | " 'audioUrlUS': '',\n",
117 | " 'audioUrlUK': '',\n",
118 | " 'source': 'jianming-2'}],\n",
119 | " 'source': 'jianming-2',\n",
120 | " 'scene': '',\n",
121 | " 'category': ''},\n",
122 | " {'pos': 'adv.',\n",
123 | " 'english': '',\n",
124 | " 'chinese': '很, 相当',\n",
125 | " 'Sentences': [{'english': ' You may well be right.',\n",
126 | " 'chinese': '很可能是你对。',\n",
127 | " 'audioUrlUS': '',\n",
128 | " 'audioUrlUK': '',\n",
129 | " 'source': 'jianming-2'}],\n",
130 | " 'source': 'jianming-2',\n",
131 | " 'scene': '',\n",
132 | " 'category': ''},\n",
133 | " {'pos': 'adj.',\n",
134 | " 'english': '',\n",
135 | " 'chinese': '健康的; 痊愈的',\n",
136 | " 'Sentences': [{'english': \" I don't think he is really a well man.\",\n",
137 | " 'chinese': '我认为他并不是真正健康的人。',\n",
138 | " 'audioUrlUS': '',\n",
139 | " 'audioUrlUK': '',\n",
140 | " 'source': 'jianming-2'}],\n",
141 | " 'source': 'jianming-2',\n",
142 | " 'scene': '',\n",
143 | " 'category': ''},\n",
144 | " {'pos': 'adj.',\n",
145 | " 'english': '',\n",
146 | " 'chinese': '良好的; 正常的; 令人满意的',\n",
147 | " 'Sentences': [{'english': ' All is not well in this country.',\n",
148 | " 'chinese': '这个国家的情况不能令人满意。',\n",
149 | " 'audioUrlUS': '',\n",
150 | " 'audioUrlUK': '',\n",
151 | " 'source': 'jianming-2'}],\n",
152 | " 'source': 'jianming-2',\n",
153 | " 'scene': '',\n",
154 | " 'category': ''},\n",
155 | " {'pos': 'int.',\n",
156 | " 'english': '',\n",
157 | " 'chinese': '(用于表示惊讶, 疑虑, 接受等)',\n",
158 | " 'Sentences': [{'english': ' Well!Look at that amazing sight!',\n",
159 | " 'chinese': '哦!看那迷人的景色!',\n",
160 | " 'audioUrlUS': '',\n",
161 | " 'audioUrlUK': '',\n",
162 | " 'source': 'jianming-2'}],\n",
163 | " 'source': 'jianming-2',\n",
164 | " 'scene': '',\n",
165 | " 'category': ''},\n",
166 | " {'pos': 'n.',\n",
167 | " 'english': '',\n",
168 | " 'chinese': '井, 水井',\n",
169 | " 'Sentences': [{'english': ' They dug another well in the village.',\n",
170 | " 'chinese': '他们在村里又挖了一口井。',\n",
171 | " 'audioUrlUS': '',\n",
172 | " 'audioUrlUK': '',\n",
173 | " 'source': 'jianming-2'}],\n",
174 | " 'source': 'jianming-2',\n",
175 | " 'scene': '',\n",
176 | " 'category': ''},\n",
177 | " {'pos': 'n.',\n",
178 | " 'english': '',\n",
179 | " 'chinese': '泉; 源泉',\n",
180 | " 'Sentences': [{'english': ' A book is a well of knowledge.',\n",
181 | " 'chinese': '书是知识的源泉。',\n",
182 | " 'audioUrlUS': '',\n",
183 | " 'audioUrlUK': '',\n",
184 | " 'source': 'jianming-2'}],\n",
185 | " 'source': 'jianming-2',\n",
186 | " 'scene': '',\n",
187 | " 'category': ''},\n",
188 | " {'pos': 'vi.',\n",
189 | " 'english': '',\n",
190 | " 'chinese': '(液体)涌出; 流出; 涌流',\n",
191 | " 'Sentences': [{'english': ' Oil welled out of the ground.',\n",
192 | " 'chinese': '原油从地下涌出。',\n",
193 | " 'audioUrlUS': '',\n",
194 | " 'audioUrlUK': '',\n",
195 | " 'source': 'jianming-2'}],\n",
196 | " 'source': 'jianming-2',\n",
197 | " 'scene': '',\n",
198 | " 'category': ''}]}"
199 | ]
200 | },
201 | "execution_count": 24,
202 | "metadata": {},
203 | "output_type": "execute_result"
204 | }
205 | ],
206 | "source": [
207 | "words[\"well\"]"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 18,
213 | "metadata": {},
214 | "outputs": [
215 | {
216 | "data": {
217 | "text/plain": [
218 | "{'Lexicon': 'modern',\n",
219 | " 'type': 'Word',\n",
220 | " 'PhoneticSymbols': [],\n",
221 | " 'ParaPhrases': [{'pos': 'adj.',\n",
222 | " 'english': '',\n",
223 | " 'chinese': '现代的; 近代的',\n",
224 | " 'Sentences': [{'english': ' He was steeped in modern history.',\n",
225 | " 'chinese': '他埋头于近代史的研究。',\n",
226 | " 'audioUrlUS': '',\n",
227 | " 'audioUrlUK': '',\n",
228 | " 'source': 'jianming-2'}],\n",
229 | " 'source': 'jianming-2',\n",
230 | " 'scene': '',\n",
231 | " 'category': ''},\n",
232 | " {'pos': 'adj.',\n",
233 | " 'english': '',\n",
234 | " 'chinese': '新式的, 时髦的, 最新的',\n",
235 | " 'Sentences': [{'english': ' He has modern ideas in spite of his great age.',\n",
236 | " 'chinese': '尽管他年事很高, 但思想观念却很入时。',\n",
237 | " 'audioUrlUS': '',\n",
238 | " 'audioUrlUK': '',\n",
239 | " 'source': 'jianming-2'},\n",
240 | " {'english': ' The dress is the most modern.',\n",
241 | " 'chinese': '这件衣服是最时髦的。',\n",
242 | " 'audioUrlUS': '',\n",
243 | " 'audioUrlUK': '',\n",
244 | " 'source': 'jianming-2'}],\n",
245 | " 'source': 'jianming-2',\n",
246 | " 'scene': '',\n",
247 | " 'category': ''},\n",
248 | " {'pos': 'adj.',\n",
249 | " 'english': '',\n",
250 | " 'chinese': '当代风格的, 现代派的',\n",
251 | " 'Sentences': [{'english': ' They went to an exhibition of modern art yesterday.',\n",
252 | " 'chinese': '昨天, 他们参观了现代美术展览。',\n",
253 | " 'audioUrlUS': '',\n",
254 | " 'audioUrlUK': '',\n",
255 | " 'source': 'jianming-2'}],\n",
256 | " 'source': 'jianming-2',\n",
257 | " 'scene': '',\n",
258 | " 'category': ''}]}"
259 | ]
260 | },
261 | "execution_count": 18,
262 | "metadata": {},
263 | "output_type": "execute_result"
264 | }
265 | ],
266 | "source": [
267 | "words[\"modern\"]"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": 19,
273 | "metadata": {},
274 | "outputs": [
275 | {
276 | "name": "stdout",
277 | "output_type": "stream",
278 | "text": [
279 | "22334 items found\n"
280 | ]
281 | }
282 | ],
283 | "source": [
284 | "with open(\"D:/work/python_work/中心词库/音标-22334-牛津源-2020-8-29.json\", \"r\", encoding=\"utf-8\") as f:\n",
285 | " psdata = json.load(f)\n",
286 | "print(\"{} items found\".format(len(psdata)))"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": 21,
292 | "metadata": {},
293 | "outputs": [
294 | {
295 | "name": "stdout",
296 | "output_type": "stream",
297 | "text": [
298 | "16717\n"
299 | ]
300 | }
301 | ],
302 | "source": [
303 | "num = []\n",
304 | "for i in psdata:\n",
305 | " if i[\"word\"] in words:\n",
306 | " num += [i]\n",
307 | "print(len(num))"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": 23,
313 | "metadata": {},
314 | "outputs": [
315 | {
316 | "data": {
317 | "text/plain": [
318 | "{'Lexicon': 'triathlon',\n",
319 | " 'type': 'Word',\n",
320 | " 'PhoneticSymbols': [],\n",
321 | " 'ParaPhrases': [{'pos': 'n.',\n",
322 | " 'english': '',\n",
323 | " 'chinese': '三项全能运动',\n",
324 | " 'Sentences': [],\n",
325 | " 'source': 'jianming-2',\n",
326 | " 'scene': '',\n",
327 | " 'category': ''}]}"
328 | ]
329 | },
330 | "execution_count": 23,
331 | "metadata": {},
332 | "output_type": "execute_result"
333 | }
334 | ],
335 | "source": [
336 | "words[num[99][\"word\"]]"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": null,
342 | "metadata": {},
343 | "outputs": [],
344 | "source": []
345 | }
346 | ],
347 | "metadata": {
348 | "kernelspec": {
349 | "display_name": "py36",
350 | "language": "python",
351 | "name": "py36"
352 | },
353 | "language_info": {
354 | "codemirror_mode": {
355 | "name": "ipython",
356 | "version": 3
357 | },
358 | "file_extension": ".py",
359 | "mimetype": "text/x-python",
360 | "name": "python",
361 | "nbconvert_exporter": "python",
362 | "pygments_lexer": "ipython3",
363 | "version": "3.7.6"
364 | }
365 | },
366 | "nbformat": 4,
367 | "nbformat_minor": 4
368 | }
369 |
--------------------------------------------------------------------------------
/notes/parse_mdx.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 126,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pandas as pd\n",
10 | "import re\n",
11 | "import bs4\n",
12 | "from bs4 import BeautifulSoup\n",
13 | "from readmdict import MDX, MDD"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": null,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "mdx = MDX(r'D:\\work\\database\\dict\\牛津高阶英汉双解词典(第9版).mdx')\n",
23 | "items = mdx.items()\n",
24 | "w2item = {i[0].decode(\"utf-8\"): i[1].decode(\"utf-8\") for i in items}\n"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 3,
30 | "metadata": {},
31 | "outputs": [
32 | {
33 | "data": {
34 | "text/plain": [
35 | "\"noun,verb 🔑 soapBrE /səʊp/ 🔊NAmE /soʊp/ 🔊 noun🔑 [uncountable, countable] a substance that you use with water for washing your body 肥皂◆soap and water肥皂和水◆a bar/piece of soap 一块肥皂◆soap bubbles肥皂泡 [countable] (informal) = soap opera ◆soaps on TV电视上播出的肥皂剧◆She's a US soap star. 她是美国肥皂剧明星。🔊🔊 🔑 soapBrE /səʊp/ 🔊NAmE /soʊp/ 🔊 verbpresent simple - I / you / we / they soap BrE /səʊp/ 🔊 NAmE /soʊp/ 🔊present simple - he / she / it soaps BrE /səʊps/ 🔊 NAmE /soʊps/ 🔊past simple soaped BrE /səʊpt/ 🔊 NAmE /soʊpt/ 🔊past participle soaped BrE /səʊpt/ 🔊 NAmE /soʊpt/ 🔊 -ing form soaping BrE /ˈsəʊpɪŋ/ 🔊 NAmE /ˈsoʊpɪŋ/ 🔊~ yourself/sb/sth to rub yourself/sb/sth with soap 抹肥皂;用肥皂擦洗 \\u2002➡\\u2002 see also soft-soap \\n\""
36 | ]
37 | },
38 | "execution_count": 3,
39 | "metadata": {},
40 | "output_type": "execute_result"
41 | }
42 | ],
43 | "source": [
44 | "# 输出所有文本内容信息\n",
45 | "bs.get_text()"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 143,
51 | "metadata": {},
52 | "outputs": [
53 | {
54 | "name": "stdout",
55 | "output_type": "stream",
56 | "text": [
57 | "\n",
58 | " \n",
59 | " \n",
61 | "\n",
62 | "\n",
63 | "
\n",
69 | "
\n",
74 | "
\n",
75 | "\n",
76 | "
\n",
77 | " \n",
78 | " weird\n",
79 | " \n",
80 | " \n",
81 | " \n",
82 | " \n",
83 | " \n",
84 | " \n",
85 | " \n",
86 | " \n",
87 | " \n",
88 | " \n",
89 | " \n",
90 | " BrE\n",
91 | " \n",
92 | " \n",
93 | " \n",
94 | " \n",
95 | " \n",
96 | " /\n",
97 | " \n",
98 | " wɪəd\n",
99 | " \n",
100 | " /\n",
101 | " \n",
102 | " \n",
104 | " \n",
105 | " \n",
106 | " \n",
107 | " \n",
108 | " 🔊\n",
109 | " \n",
110 | " \n",
111 | " \n",
112 | " \n",
113 | " \n",
114 | " \n",
115 | " \n",
116 | " \n",
117 | " NAmE\n",
118 | " \n",
119 | " \n",
120 | " \n",
121 | " \n",
122 | " \n",
123 | " /\n",
124 | " \n",
125 | " wɪrd\n",
126 | " \n",
127 | " /\n",
128 | " \n",
129 | " \n",
131 | " \n",
132 | " \n",
133 | " \n",
134 | " \n",
135 | " 🔊\n",
136 | " \n",
137 | " \n",
138 | " \n",
139 | " \n",
140 | " \n",
141 | "
\n",
142 | "\n",
143 | " \n",
144 | " \n",
145 | " \n",
146 | " \n",
147 | " \n",
148 | " adjective\n",
149 | " \n",
150 | " \n",
151 | " \n",
152 | " \n",
153 | " \n",
154 | " \n",
155 | " \n",
156 | " (\n",
157 | " \n",
158 | " \n",
159 | " \n",
160 | " \n",
161 | " \n",
162 | " \n",
163 | " weird·er\n",
164 | " \n",
165 | " \n",
166 | " \n",
167 | " \n",
168 | " \n",
169 | " \n",
170 | " ,\n",
171 | " \n",
172 | " \n",
173 | " \n",
174 | " \n",
175 | " weird·est\n",
176 | " \n",
177 | " \n",
178 | " \n",
179 | " \n",
180 | " \n",
181 | " \n",
182 | " )\n",
183 | " \n",
184 | " \n",
185 | " \n",
186 | " \n",
187 | " \n",
188 | " \n",
189 | " \n",
190 | " \n",
191 | " \n",
192 | " \n",
193 | " very\n",
194 | " \n",
195 | " \n",
196 | " strange\n",
197 | " \n",
198 | " \n",
199 | " or\n",
200 | " \n",
201 | " \n",
202 | " unusual\n",
203 | " \n",
204 | " \n",
205 | " and\n",
206 | " \n",
207 | " \n",
208 | " difficult\n",
209 | " \n",
210 | " \n",
211 | " to\n",
212 | " \n",
213 | " \n",
214 | " explain\n",
215 | " \n",
216 | " \n",
217 | " \n",
218 | " \n",
219 | " \n",
220 | " \n",
221 | " 奇异的;不寻常的;怪诞的\n",
222 | " \n",
223 | " \n",
224 | " \n",
225 | " \n",
226 | " \n",
227 | " \n",
228 | " SYN\n",
229 | " \n",
230 | " \n",
231 | " \n",
232 | " \n",
233 | " \n",
234 | " \n",
235 | " \n",
236 | " \n",
237 | " strange\n",
238 | " \n",
239 | " \n",
240 | " \n",
241 | " \n",
242 | " \n",
243 | " \n",
244 | " \n",
245 | " \n",
246 | " \n",
247 | " \n",
248 | " ◆\n",
249 | " \n",
250 | " \n",
251 | " \n",
252 | " \n",
253 | " a\n",
254 | " \n",
255 | " \n",
256 | " weird\n",
257 | " \n",
258 | " \n",
259 | " dream\n",
260 | " \n",
261 | " \n",
262 | " \n",
263 | " \n",
264 | " \n",
265 | " \n",
266 | " 离奇的梦\n",
267 | " \n",
268 | " \n",
269 | " \n",
270 | " \n",
271 | " \n",
272 | " \n",
273 | " ◆\n",
274 | " \n",
275 | " \n",
276 | " \n",
277 | " \n",
278 | " \n",
279 | " She's\n",
280 | " \n",
281 | " \n",
282 | " a\n",
283 | " \n",
284 | " \n",
285 | " really\n",
286 | " \n",
287 | " \n",
288 | " weird\n",
289 | " \n",
290 | " \n",
291 | " girl\n",
292 | " \n",
293 | " .\n",
294 | " \n",
295 | " \n",
296 | " \n",
297 | " \n",
298 | " \n",
299 | " 她真是个古怪的女孩。\n",
300 | " \n",
301 | " \n",
302 | " \n",
303 | " \n",
304 | " 🔊\n",
305 | " \n",
306 | " \n",
307 | " \n",
308 | " \n",
309 | " 🔊\n",
310 | " \n",
311 | " \n",
312 | " \n",
313 | " \n",
314 | " \n",
315 | " \n",
316 | " \n",
317 | " \n",
318 | " \n",
319 | " \n",
320 | " \n",
321 | "\n",
322 | "\n",
323 | " \n",
324 | " \n",
325 | " ◆\n",
326 | " \n",
327 | " \n",
328 | " \n",
329 | " \n",
330 | " \n",
331 | " He's\n",
332 | " \n",
333 | " \n",
334 | " got\n",
335 | " \n",
336 | " \n",
337 | " some\n",
338 | " \n",
339 | " \n",
340 | " weird\n",
341 | " \n",
342 | " \n",
343 | " ideas\n",
344 | " \n",
345 | " .\n",
346 | " \n",
347 | " \n",
348 | " \n",
349 | " \n",
350 | " \n",
351 | " 他有些怪念头。\n",
352 | " \n",
353 | " \n",
354 | " \n",
355 | " \n",
356 | " 🔊\n",
357 | " \n",
358 | " \n",
359 | " \n",
360 | " \n",
361 | " 🔊\n",
362 | " \n",
363 | " \n",
364 | " \n",
365 | "\n",
366 | "\n",
367 | " \n",
368 | " \n",
369 | " ◆\n",
370 | " \n",
371 | " \n",
372 | " \n",
373 | " \n",
374 | " \n",
375 | " It's\n",
376 | " \n",
377 | " \n",
378 | " really\n",
379 | " \n",
380 | " \n",
381 | " weird\n",
382 | " \n",
383 | " \n",
384 | " seeing\n",
385 | " \n",
386 | " \n",
387 | " yourself\n",
388 | " \n",
389 | " \n",
390 | " on\n",
391 | " \n",
392 | " \n",
393 | " television\n",
394 | " \n",
395 | " .\n",
396 | " \n",
397 | " \n",
398 | " \n",
399 | " \n",
400 | " \n",
401 | " 看到自己上了电视感觉怪怪的。\n",
402 | " \n",
403 | " \n",
404 | " \n",
405 | " \n",
406 | " 🔊\n",
407 | " \n",
408 | " \n",
409 | " \n",
410 | " \n",
411 | " 🔊\n",
412 | " \n",
413 | " \n",
414 | " \n",
415 | "\n",
416 | "\n",
417 | " \n",
418 | " \n",
419 | " ◆\n",
420 | " \n",
421 | " \n",
422 | " \n",
423 | " \n",
424 | " the\n",
425 | " \n",
426 | " \n",
427 | " \n",
428 | " \n",
429 | " weird\n",
430 | " \n",
431 | " \n",
432 | " and\n",
433 | " \n",
434 | " \n",
435 | " wonderful\n",
436 | " \n",
437 | " \n",
438 | " \n",
439 | " \n",
440 | " creatures\n",
441 | " \n",
442 | " \n",
443 | " that\n",
444 | " \n",
445 | " \n",
446 | " live\n",
447 | " \n",
448 | " \n",
449 | " beneath\n",
450 | " \n",
451 | " \n",
452 | " the\n",
453 | " \n",
454 | " \n",
455 | " sea\n",
456 | " \n",
457 | " \n",
458 | " \n",
459 | " \n",
460 | " \n",
461 | " \n",
462 | " 奇异美丽的海底生物\n",
463 | " \n",
464 | " \n",
465 | "\n",
466 | "\n",
467 | " \n",
468 | " \n",
469 | " \n",
470 | " \n",
471 | " \n",
472 | " \n",
473 | " strange\n",
474 | " \n",
475 | " \n",
476 | " in\n",
477 | " \n",
478 | " \n",
479 | " a\n",
480 | " \n",
481 | " \n",
482 | " mysterious\n",
483 | " \n",
484 | " \n",
485 | " and\n",
486 | " \n",
487 | " \n",
488 | " frightening\n",
489 | " \n",
490 | " \n",
491 | " way\n",
492 | " \n",
493 | " \n",
494 | " \n",
495 | " \n",
496 | " \n",
497 | " \n",
498 | " 离奇的;诡异的\n",
499 | " \n",
500 | " \n",
501 | " \n",
502 | " \n",
503 | " \n",
504 | " \n",
505 | " SYN\n",
506 | " \n",
507 | " \n",
508 | " \n",
509 | " \n",
510 | " \n",
511 | " \n",
512 | " \n",
513 | " \n",
514 | " eerie\n",
515 | " \n",
516 | " \n",
517 | " \n",
518 | " \n",
519 | " \n",
520 | " \n",
521 | " \n",
522 | " \n",
523 | " \n",
524 | " \n",
525 | " ◆\n",
526 | " \n",
527 | " \n",
528 | " \n",
529 | " \n",
530 | " \n",
531 | " She\n",
532 | " \n",
533 | " \n",
534 | " began\n",
535 | " \n",
536 | " \n",
537 | " to\n",
538 | " \n",
539 | " \n",
540 | " make\n",
541 | " \n",
542 | " \n",
543 | " weird\n",
544 | " \n",
545 | " \n",
546 | " inhuman\n",
547 | " \n",
548 | " \n",
549 | " sounds\n",
550 | " \n",
551 | " .\n",
552 | " \n",
553 | " \n",
554 | " \n",
555 | " \n",
556 | " \n",
557 | " 她开始发出可怕的非人的声音。\n",
558 | " \n",
559 | " \n",
560 | " \n",
561 | " \n",
562 | " 🔊\n",
563 | " \n",
564 | " \n",
565 | " \n",
566 | " \n",
567 | " 🔊\n",
568 | " \n",
569 | " \n",
570 | " \n",
571 | " \n",
572 | " \n",
573 | " \n",
574 | " \n",
575 | " \n",
576 | " \n",
577 | "\n",
578 | "\n",
579 | " \n",
580 | " \n",
581 | " \n",
582 | " \n",
583 | " ▸\n",
584 | " \n",
585 | " \n",
586 | " \n",
587 | " \n",
588 | " \n",
589 | " \n",
590 | " weird·ly\n",
591 | " \n",
592 | " \n",
593 | " \n",
594 | " \n",
595 | " \n",
596 | " \n",
597 | " \n",
598 | " BrE\n",
599 | " \n",
600 | " \n",
601 | " \n",
602 | " \n",
603 | " \n",
604 | " /\n",
605 | " \n",
606 | " ˈwɪədli\n",
607 | " \n",
608 | " /\n",
609 | " \n",
610 | " \n",
612 | " \n",
613 | " \n",
614 | " \n",
615 | " \n",
616 | " 🔊\n",
617 | " \n",
618 | " \n",
619 | " \n",
620 | " \n",
621 | " \n",
622 | " \n",
623 | " \n",
624 | " \n",
625 | " NAmE\n",
626 | " \n",
627 | " \n",
628 | " \n",
629 | " \n",
630 | " \n",
631 | " /\n",
632 | " \n",
633 | " ˈwɪrdli\n",
634 | " \n",
635 | " /\n",
636 | " \n",
637 | " \n",
639 | " \n",
640 | " \n",
641 | " \n",
642 | " \n",
643 | " 🔊\n",
644 | " \n",
645 | " \n",
646 | " \n",
647 | " \n",
648 | " \n",
649 | " \n",
650 | " \n",
651 | "\n",
652 | "\n",
653 | " \n",
654 | " \n",
655 | " \n",
656 | " adverb\n",
657 | " \n",
658 | " \n",
659 | " \n",
660 | "\n",
661 | "\n",
662 | " \n",
663 | " \n",
664 | " \n",
665 | " \n",
666 | " \n",
667 | " \n",
668 | " ◆\n",
669 | " \n",
670 | " \n",
671 | " \n",
672 | " \n",
673 | " \n",
674 | " The\n",
675 | " \n",
676 | " \n",
677 | " town\n",
678 | " \n",
679 | " \n",
680 | " was\n",
681 | " \n",
682 | " \n",
683 | " weirdly\n",
684 | " \n",
685 | " \n",
686 | " familiar\n",
687 | " \n",
688 | " .\n",
689 | " \n",
690 | " \n",
691 | " \n",
692 | " \n",
693 | " \n",
694 | " 这个城镇怪面熟的。\n",
695 | " \n",
696 | " \n",
697 | " \n",
698 | " \n",
699 | " 🔊\n",
700 | " \n",
701 | " \n",
702 | " \n",
703 | " \n",
704 | " 🔊\n",
705 | " \n",
706 | " \n",
707 | " \n",
708 | " \n",
709 | " \n",
710 | " \n",
711 | " \n",
712 | "\n",
713 | "\n",
714 | " \n",
715 | " \n",
716 | " \n",
717 | " ▸\n",
718 | " \n",
719 | " \n",
720 | " \n",
721 | " \n",
722 | " \n",
723 | " \n",
724 | " weird·ness\n",
725 | " \n",
726 | " \n",
727 | " \n",
728 | " \n",
729 | " \n",
730 | " \n",
731 | " \n",
732 | " BrE\n",
733 | " \n",
734 | " \n",
735 | " \n",
736 | " \n",
737 | " \n",
738 | " /\n",
739 | " \n",
740 | " ˈwɪədnəs\n",
741 | " \n",
742 | " /\n",
743 | " \n",
744 | " \n",
746 | " \n",
747 | " \n",
748 | " \n",
749 | " \n",
750 | " 🔊\n",
751 | " \n",
752 | " \n",
753 | " \n",
754 | " \n",
755 | " \n",
756 | " \n",
757 | " \n",
758 | " \n",
759 | " NAmE\n",
760 | " \n",
761 | " \n",
762 | " \n",
763 | " \n",
764 | " \n",
765 | " /\n",
766 | " \n",
767 | " ˈwɪrdnəs\n",
768 | " \n",
769 | " /\n",
770 | " \n",
771 | " \n",
773 | " \n",
774 | " \n",
775 | " \n",
776 | " \n",
777 | " 🔊\n",
778 | " \n",
779 | " \n",
780 | " \n",
781 | " \n",
782 | " \n",
783 | " \n",
784 | "\n",
785 | "\n",
786 | " \n",
787 | " \n",
788 | " \n",
789 | " noun\n",
790 | " \n",
791 | " \n",
792 | " \n",
793 | "\n",
794 | "\n",
795 | " \n",
796 | " \n",
797 | " \n",
798 | " \n",
799 | " [\n",
800 | " \n",
801 | " \n",
802 | " uncountable\n",
803 | " \n",
804 | " \n",
805 | " ]\n",
806 | " \n",
807 | " \n",
808 | " \n",
809 | " \n",
810 | "\n",
811 | "\n",
812 | "
\n",
813 | " \n",
814 | " weird\n",
815 | " \n",
816 | " \n",
817 | " \n",
818 | " \n",
819 | " \n",
820 | " \n",
821 | " \n",
822 | " \n",
823 | " \n",
824 | " \n",
825 | " \n",
826 | " BrE\n",
827 | " \n",
828 | " \n",
829 | " \n",
830 | " \n",
831 | " \n",
832 | " /\n",
833 | " \n",
834 | " wɪəd\n",
835 | " \n",
836 | " /\n",
837 | " \n",
838 | " \n",
840 | " \n",
841 | " \n",
842 | " \n",
843 | " \n",
844 | " 🔊\n",
845 | " \n",
846 | " \n",
847 | " \n",
848 | " \n",
849 | " \n",
850 | " \n",
851 | " \n",
852 | " \n",
853 | " NAmE\n",
854 | " \n",
855 | " \n",
856 | " \n",
857 | " \n",
858 | " \n",
859 | " /\n",
860 | " \n",
861 | " wɪrd\n",
862 | " \n",
863 | " /\n",
864 | " \n",
865 | " \n",
867 | " \n",
868 | " \n",
869 | " \n",
870 | " \n",
871 | " 🔊\n",
872 | " \n",
873 | " \n",
874 | " \n",
875 | " \n",
876 | " \n",
877 | "
\n",
878 | "\n",
879 | " \n",
880 | " \n",
881 | " \n",
882 | " \n",
883 | " \n",
884 | " verb\n",
885 | " \n",
886 | " \n",
887 | " \n",
888 | " \n",
889 | " \n",
890 | " \n",
891 | " \n",
892 | " \n",
893 | " present simple - I / you / we / they\n",
894 | " \n",
895 | " \n",
896 | " \n",
897 | " \n",
898 | " \n",
899 | " weird\n",
900 | " \n",
901 | " \n",
902 | " \n",
903 | " \n",
904 | " \n",
905 | " \n",
906 | " BrE\n",
907 | " \n",
908 | " \n",
909 | " \n",
910 | " \n",
911 | " \n",
912 | " /\n",
913 | " \n",
914 | " wɪəd\n",
915 | " \n",
916 | " /\n",
917 | " \n",
918 | " \n",
920 | " \n",
921 | " \n",
922 | " \n",
923 | " \n",
924 | " 🔊\n",
925 | " \n",
926 | " \n",
927 | " \n",
928 | " \n",
929 | " \n",
930 | " \n",
931 | " NAmE\n",
932 | " \n",
933 | " \n",
934 | " \n",
935 | " \n",
936 | " \n",
937 | " /\n",
938 | " \n",
939 | " wɪrd\n",
940 | " \n",
941 | " /\n",
942 | " \n",
943 | " \n",
945 | " \n",
946 | " \n",
947 | " \n",
948 | " \n",
949 | " 🔊\n",
950 | " \n",
951 | " \n",
952 | " \n",
953 | " \n",
954 | " \n",
955 | " \n",
956 | " \n",
957 | " \n",
958 | "\n",
959 | "\n",
960 | " \n",
961 | " present simple - he / she / it\n",
962 | " \n",
963 | "\n",
964 | "\n",
965 | " \n",
966 | " \n",
967 | " weirds\n",
968 | " \n",
969 | " \n",
970 | " \n",
971 | " \n",
972 | " \n",
973 | " \n",
974 | " BrE\n",
975 | " \n",
976 | " \n",
977 | " \n",
978 | " \n",
979 | " \n",
980 | " /\n",
981 | " \n",
982 | " wɪədz\n",
983 | " \n",
984 | " /\n",
985 | " \n",
986 | " \n",
988 | " \n",
989 | " \n",
990 | " \n",
991 | " \n",
992 | " 🔊\n",
993 | " \n",
994 | " \n",
995 | " \n",
996 | " \n",
997 | " \n",
998 | " \n",
999 | " NAmE\n",
1000 | " \n",
1001 | " \n",
1002 | " \n",
1003 | " \n",
1004 | " \n",
1005 | " /\n",
1006 | " \n",
1007 | " wɪrdz\n",
1008 | " \n",
1009 | " /\n",
1010 | " \n",
1011 | " \n",
1013 | " \n",
1014 | " \n",
1015 | " \n",
1016 | " \n",
1017 | " 🔊\n",
1018 | " \n",
1019 | " \n",
1020 | " \n",
1021 | " \n",
1022 | "\n",
1023 | "\n",
1024 | " \n",
1025 | " past simple\n",
1026 | " \n",
1027 | "\n",
1028 | "\n",
1029 | " \n",
1030 | " \n",
1031 | " weirded\n",
1032 | " \n",
1033 | " \n",
1034 | " \n",
1035 | " \n",
1036 | " \n",
1037 | " \n",
1038 | " BrE\n",
1039 | " \n",
1040 | " \n",
1041 | " \n",
1042 | " \n",
1043 | " \n",
1044 | " /\n",
1045 | " \n",
1046 | " ˈwɪədɪd\n",
1047 | " \n",
1048 | " /\n",
1049 | " \n",
1050 | " \n",
1052 | " \n",
1053 | " \n",
1054 | " \n",
1055 | " \n",
1056 | " 🔊\n",
1057 | " \n",
1058 | " \n",
1059 | " \n",
1060 | " \n",
1061 | " \n",
1062 | " \n",
1063 | " NAmE\n",
1064 | " \n",
1065 | " \n",
1066 | " \n",
1067 | " \n",
1068 | " \n",
1069 | " /\n",
1070 | " \n",
1071 | " ˈwɪrdɪd\n",
1072 | " \n",
1073 | " /\n",
1074 | " \n",
1075 | " \n",
1077 | " \n",
1078 | " \n",
1079 | " \n",
1080 | " \n",
1081 | " 🔊\n",
1082 | " \n",
1083 | " \n",
1084 | " \n",
1085 | " \n",
1086 | "\n",
1087 | "\n",
1088 | " \n",
1089 | " past participle\n",
1090 | " \n",
1091 | "\n",
1092 | "\n",
1093 | " \n",
1094 | " \n",
1095 | " weirded\n",
1096 | " \n",
1097 | " \n",
1098 | " \n",
1099 | " \n",
1100 | " \n",
1101 | " \n",
1102 | " BrE\n",
1103 | " \n",
1104 | " \n",
1105 | " \n",
1106 | " \n",
1107 | " \n",
1108 | " /\n",
1109 | " \n",
1110 | " ˈwɪədɪd\n",
1111 | " \n",
1112 | " /\n",
1113 | " \n",
1114 | " \n",
1116 | " \n",
1117 | " \n",
1118 | " \n",
1119 | " \n",
1120 | " 🔊\n",
1121 | " \n",
1122 | " \n",
1123 | " \n",
1124 | " \n",
1125 | " \n",
1126 | " \n",
1127 | " NAmE\n",
1128 | " \n",
1129 | " \n",
1130 | " \n",
1131 | " \n",
1132 | " \n",
1133 | " /\n",
1134 | " \n",
1135 | " ˈwɪrdɪd\n",
1136 | " \n",
1137 | " /\n",
1138 | " \n",
1139 | " \n",
1141 | " \n",
1142 | " \n",
1143 | " \n",
1144 | " \n",
1145 | " 🔊\n",
1146 | " \n",
1147 | " \n",
1148 | " \n",
1149 | " \n",
1150 | "\n",
1151 | "\n",
1152 | " \n",
1153 | " -ing form\n",
1154 | " \n",
1155 | "\n",
1156 | "\n",
1157 | " \n",
1158 | " \n",
1159 | " weirding\n",
1160 | " \n",
1161 | " \n",
1162 | " \n",
1163 | " \n",
1164 | " \n",
1165 | " \n",
1166 | " BrE\n",
1167 | " \n",
1168 | " \n",
1169 | " \n",
1170 | " \n",
1171 | " \n",
1172 | " /\n",
1173 | " \n",
1174 | " ˈwɪədɪŋ\n",
1175 | " \n",
1176 | " /\n",
1177 | " \n",
1178 | " \n",
1180 | " \n",
1181 | " \n",
1182 | " \n",
1183 | " \n",
1184 | " 🔊\n",
1185 | " \n",
1186 | " \n",
1187 | " \n",
1188 | " \n",
1189 | " \n",
1190 | " \n",
1191 | " NAmE\n",
1192 | " \n",
1193 | " \n",
1194 | " \n",
1195 | " \n",
1196 | " \n",
1197 | " /\n",
1198 | " \n",
1199 | " ˈwɪrdɪŋ\n",
1200 | " \n",
1201 | " /\n",
1202 | " \n",
1203 | " \n",
1205 | " \n",
1206 | " \n",
1207 | " \n",
1208 | " \n",
1209 | " 🔊\n",
1210 | " \n",
1211 | " \n",
1212 | " \n",
1213 | " \n",
1214 | "\n",
1215 | "\n",
1216 | " \n",
1217 | " \n",
1218 | "
\n",
1219 | " \n",
1220 | " \n",
1221 | " \n",
1222 | " \n",
1223 | " \n",
1224 | " \n",
1225 | " \n",
1226 | " \n",
1227 | " \n",
1228 | " \n",
1229 | " \n",
1230 | " \n",
1231 | " \n",
1232 | " ●\n",
1233 | " \n",
1234 | " \n",
1235 | " \n",
1236 | " \n",
1237 | " ˌweird\n",
1238 | " \n",
1239 | " \n",
1240 | " sb\n",
1241 | " \n",
1242 | " \n",
1243 | " ˈout\n",
1244 | " \n",
1245 | " \n",
1246 | " \n",
1247 | " \n",
1248 | " \n",
1249 | " \n",
1250 | " \n",
1251 | " \n",
1252 | " (\n",
1253 | " \n",
1254 | " \n",
1255 | " \n",
1256 | " \n",
1257 | " informal\n",
1258 | " \n",
1259 | " \n",
1260 | " \n",
1261 | " \n",
1262 | " )\n",
1263 | " \n",
1264 | " \n",
1265 | " \n",
1266 | " to\n",
1267 | " \n",
1268 | " \n",
1269 | " seem\n",
1270 | " \n",
1271 | " \n",
1272 | " strange\n",
1273 | " \n",
1274 | " \n",
1275 | " or\n",
1276 | " \n",
1277 | " \n",
1278 | " worrying\n",
1279 | " \n",
1280 | " \n",
1281 | " to\n",
1282 | " \n",
1283 | " \n",
1284 | " sb\n",
1285 | " \n",
1286 | " \n",
1287 | " and\n",
1288 | " \n",
1289 | " \n",
1290 | " make\n",
1291 | " \n",
1292 | " \n",
1293 | " them\n",
1294 | " \n",
1295 | " \n",
1296 | " feel\n",
1297 | " \n",
1298 | " \n",
1299 | " uncomfortable\n",
1300 | " \n",
1301 | " \n",
1302 | " \n",
1303 | " \n",
1304 | " \n",
1305 | " \n",
1306 | " 使感到奇怪;使感到烦恼;使感到不舒服\n",
1307 | " \n",
1308 | " \n",
1309 | " \n",
1310 | " \n",
1311 | " \n",
1312 | " \n",
1313 | " ◆\n",
1314 | " \n",
1315 | " \n",
1316 | " \n",
1317 | " \n",
1318 | " \n",
1319 | " The\n",
1320 | " \n",
1321 | " \n",
1322 | " whole\n",
1323 | " \n",
1324 | " \n",
1325 | " concept\n",
1326 | " \n",
1327 | " \n",
1328 | " really\n",
1329 | " \n",
1330 | " \n",
1331 | " weirds\n",
1332 | " \n",
1333 | " \n",
1334 | " me\n",
1335 | " \n",
1336 | " \n",
1337 | " out\n",
1338 | " \n",
1339 | " .\n",
1340 | " \n",
1341 | " \n",
1342 | " \n",
1343 | " \n",
1344 | " \n",
1345 | " 这整个想法让我觉得十分怪异。\n",
1346 | " \n",
1347 | " \n",
1348 | " \n",
1349 | " \n",
1350 | " 🔊\n",
1351 | " \n",
1352 | " \n",
1353 | " \n",
1354 | " \n",
1355 | " 🔊\n",
1356 | " \n",
1357 | " \n",
1358 | " \n",
1359 | " \n",
1360 | " \n",
1361 | " \n",
1362 | " \n",
1363 | " \n",
1364 | " \n",
1365 | " \n",
1366 | " \n",
1367 | " \n",
1368 | " \n",
1369 | "\n",
1370 | "\n"
1371 | ]
1372 | }
1373 | ],
1374 | "source": [
1375 | "# 10139\n",
1376 | "bs = BeautifulSoup(w2item[\"weird\"], \"html.parser\")\n",
1377 | "\n",
1378 | "\n",
1379 | "def iter_parse(node, result):\n",
1380 | " if node.contents:\n",
1381 | " for i in node.children:\n",
1382 | " result[i] = {}\n",
1383 | " out = iter_parse(i, result[i])\n",
1384 | " result[i] = {\n",
1385 | " \"name\": out[0],\n",
1386 | " \"text\": out[1]\n",
1387 | " }\n",
1388 | " else:\n",
1389 | " return node.name, node.text\n",
1390 | "\n",
1391 | "\n",
1392 | "def parse_sentence(nodes):\n",
1393 | " outs = []\n",
1394 | " if not isinstance(nodes, str):\n",
1395 | " for k in nodes.children:\n",
1396 | " if not isinstance(k, str):\n",
1397 | " print(\"------------name={}, attrs={}, text={}\".format(k.name, k.attrs,\n",
1398 | " k.get_text()))\n",
1399 | " outs.append({\"english\": k.text})\n",
1400 | " return outs\n",
1401 | "\n",
1402 | "\n",
1403 | "print(bs.prettify())\n",
1404 | "\n"
1405 | ]
1406 | },
1407 | {
1408 | "cell_type": "code",
1409 | "execution_count": 144,
1410 | "metadata": {},
1411 | "outputs": [
1412 | {
1413 | "name": "stdout",
1414 | "output_type": "stream",
1415 | "text": [
1416 | "\n",
1417 | "id=0, node=\n",
1418 | "\n",
1419 | "id=1, node=\n",
1420 | "\n",
1421 | "id=2, node= weirdBrE /wɪəd/ 🔊NAmE /wɪrd/ 🔊 \n",
1422 | "\n",
1423 | "id=3, node= adjective (weird·er, weird·est) very strange or unusual and difficult to explain 奇异的;不寻常的;怪诞的 SYN strange ◆a weird dream离奇的梦◆She's a really weird girl. 她真是个古怪的女孩。🔊🔊\n",
1424 | "\n",
1425 | "id=4, node=◆He's got some weird ideas. 他有些怪念头。🔊🔊\n",
1426 | "\n",
1427 | "id=5, node=◆It's really weird seeing yourself on television. 看到自己上了电视感觉怪怪的。🔊🔊\n",
1428 | "\n",
1429 | "id=6, node=◆the weird and wonderful creatures that live beneath the sea奇异美丽的海底生物\n",
1430 | "\n",
1431 | "id=7, node=strange in a mysterious and frightening way 离奇的;诡异的 SYN eerie ◆She began to make weird inhuman sounds. 她开始发出可怕的非人的声音。🔊🔊\n",
1432 | "\n",
1433 | "id=8, node=▸ weird·ly BrE /ˈwɪədli/ 🔊NAmE /ˈwɪrdli/ 🔊\n",
1434 | "\n",
1435 | "id=9, node= adverb\n",
1436 | "\n",
1437 | "id=10, node=◆The town was weirdly familiar. 这个城镇怪面熟的。🔊🔊\n",
1438 | "\n",
1439 | "id=11, node=▸ weird·ness BrE /ˈwɪədnəs/ 🔊NAmE /ˈwɪrdnəs/ 🔊\n",
1440 | "\n",
1441 | "id=12, node= noun\n",
1442 | "\n",
1443 | "id=13, node= [uncountable] \n",
1444 | "\n",
1445 | "id=14, node= weirdBrE /wɪəd/ 🔊NAmE /wɪrd/ 🔊 \n",
1446 | "\n",
1447 | "id=15, node= verbpresent simple - I / you / we / they weird BrE /wɪəd/ 🔊 NAmE /wɪrd/ 🔊\n",
1448 | "\n",
1449 | "id=16, node=present simple - he / she / it \n",
1450 | "\n",
1451 | "id=17, node=weirds BrE /wɪədz/ 🔊 NAmE /wɪrdz/ 🔊\n",
1452 | "\n",
1453 | "id=18, node=past simple \n",
1454 | "\n",
1455 | "id=19, node=weirded BrE /ˈwɪədɪd/ 🔊 NAmE /ˈwɪrdɪd/ 🔊\n",
1456 | "\n",
1457 | "id=20, node=past participle \n",
1458 | "\n",
1459 | "id=21, node=weirded BrE /ˈwɪədɪd/ 🔊 NAmE /ˈwɪrdɪd/ 🔊\n",
1460 | "\n",
1461 | "id=22, node= -ing form \n",
1462 | "\n",
1463 | "id=23, node=weirding BrE /ˈwɪədɪŋ/ 🔊 NAmE /ˈwɪrdɪŋ/ 🔊\n",
1464 | "\n",
1465 | "id=24, node=
●ˌweird sb ˈout(informal) to seem strange or worrying to sb and make them feel uncomfortable 使感到奇怪;使感到烦恼;使感到不舒服◆The whole concept really weirds me out. 这整个想法让我觉得十分怪异。🔊🔊\n",
1466 | "\n",
1467 | "id=25, node=\n",
1468 | "\n"
1469 | ]
1470 | }
1471 | ],
1472 | "source": [
1473 | "for ic, node in enumerate(bs.children):\n",
1474 | " print(\"\\nid={}, node={}\".format(ic, node))"
1475 | ]
1476 | },
1477 | {
1478 | "cell_type": "code",
1479 | "execution_count": 140,
1480 | "metadata": {},
1481 | "outputs": [
1482 | {
1483 | "name": "stdout",
1484 | "output_type": "stream",
1485 | "text": [
1486 | "soap and water肥皂和水 肥皂和水\n",
1487 | "a bar/piece of soap 一块肥皂 一块肥皂\n",
1488 | "soap bubbles肥皂泡 肥皂泡\n",
1489 | "soaps on TV电视上播出的肥皂剧 电视上播出的肥皂剧\n",
1490 | "She's a US soap star. 她是美国肥皂剧明星。 她是美国肥皂剧明星。\n"
1491 | ]
1492 | }
1493 | ],
1494 | "source": [
1495 | "for i in bs.find_all(\"div\", \"cixing_part\"):\n",
1496 | "# print(i.tag)\n",
1497 | " top = i.find_all(\"top-g\")\n",
1498 | " subentry = i.find_all(\"subentry-g\")\n",
1499 | " for j in subentry:\n",
1500 | " sngs=j.find_all(\"sn-gs\")\n",
1501 | " for k in sngs:\n",
1502 | " sng = k.find_all(\"sn-g\")\n",
1503 | " for m in sng:\n",
1504 | " xgs = m.find_all(\"x-gs\")\n",
1505 | " for n in xgs:\n",
1506 | " xgblk = n.find_all(\"x-g-blk\")\n",
1507 | " # 3个例句\n",
1508 | " sentences = []\n",
1509 | " for w in xgblk:\n",
1510 | " x = w.find_all(\"x\")\n",
1511 | " for v in x:\n",
1512 | " # 添加例句\n",
1513 | " # 先把中文释义提出,并从树中移除\n",
1514 | "# sentCh=v.chn.extract().get_text()\n",
1515 | " sentCh=v.chn.get_text()\n",
1516 | " # 再提取全部英文例句,可解决标点无法提取问题\n",
1517 | " sentEn=v.get_text()\n",
1518 | " sentence={\"english\": sentEn.strip(), \"chinese\": sentCh.strip(), \n",
1519 | " \"audioUrlUS\": \"\", \"audioUrlUK\":\"\", \"resource\": \"oxld_9\"}\n",
1520 | " sentences.append(sentence)\n",
1521 | " print(sentEn, sentCh)\n",
1522 | " \n",
1523 | "# print(x[0].prettify())\n"
1524 | ]
1525 | },
1526 | {
1527 | "cell_type": "code",
1528 | "execution_count": 141,
1529 | "metadata": {
1530 | "scrolled": true
1531 | },
1532 | "outputs": [
1533 | {
1534 | "ename": "AttributeError",
1535 | "evalue": "ResultSet object has no attribute 'findall'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?",
1536 | "output_type": "error",
1537 | "traceback": [
1538 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
1539 | "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
1540 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0msubentry\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"def\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
1541 | "\u001b[1;32mC:\\ProgramData\\Anaconda3\\lib\\site-packages\\bs4\\element.py\u001b[0m in \u001b[0;36m__getattr__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 2079\u001b[0m \u001b[1;34m\"\"\"Raise a helpful exception to explain a common code fix.\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2080\u001b[0m raise AttributeError(\n\u001b[1;32m-> 2081\u001b[1;33m \u001b[1;34m\"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?\"\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2082\u001b[0m )\n",
1542 | "\u001b[1;31mAttributeError\u001b[0m: ResultSet object has no attribute 'findall'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"
1543 | ]
1544 | }
1545 | ],
1546 | "source": [
1547 | "subentry.findall(\"def\")"
1548 | ]
1549 | },
1550 | {
1551 | "cell_type": "code",
1552 | "execution_count": 134,
1553 | "metadata": {},
1554 | "outputs": [
1555 | {
1556 | "data": {
1557 | "text/plain": [
1558 | "🔑 [uncountable, countable] a substance that you use with water for washing your body 肥皂◆soap and water肥皂和水◆a bar/piece of soap 一块肥皂◆soap bubbles肥皂泡 [countable] (informal) = soap opera ◆soaps on TV电视上播出的肥皂剧◆She's a US soap star. 她是美国肥皂剧明星。🔊🔊"
1559 | ]
1560 | },
1561 | "execution_count": 134,
1562 | "metadata": {},
1563 | "output_type": "execute_result"
1564 | }
1565 | ],
1566 | "source": [
1567 | "k"
1568 | ]
1569 | },
1570 | {
1571 | "cell_type": "code",
1572 | "execution_count": 133,
1573 | "metadata": {},
1574 | "outputs": [
1575 | {
1576 | "data": {
1577 | "text/plain": [
1578 | "She's a US soap star. 她是美国肥皂剧明星。"
1579 | ]
1580 | },
1581 | "execution_count": 133,
1582 | "metadata": {},
1583 | "output_type": "execute_result"
1584 | }
1585 | ],
1586 | "source": [
1587 | "v"
1588 | ]
1589 | },
1590 | {
1591 | "cell_type": "code",
1592 | "execution_count": 132,
1593 | "metadata": {},
1594 | "outputs": [
1595 | {
1596 | "data": {
1597 | "text/plain": [
1598 | "{'e_id': 'u4cdebea65f7df6b4.-37641f80.142d72656f3.-1b5',\n",
1599 | " 'eid': 'soap_x_6',\n",
1600 | " 'status': '6',\n",
1601 | " 'tranid': '6',\n",
1602 | " 'wd': \"She's a US soap star.\"}"
1603 | ]
1604 | },
1605 | "execution_count": 132,
1606 | "metadata": {},
1607 | "output_type": "execute_result"
1608 | }
1609 | ],
1610 | "source": [
1611 | "v.attrs"
1612 | ]
1613 | },
1614 | {
1615 | "cell_type": "code",
1616 | "execution_count": 131,
1617 | "metadata": {},
1618 | "outputs": [
1619 | {
1620 | "data": {
1621 | "text/plain": [
1622 | "\"She's a US soap star.\""
1623 | ]
1624 | },
1625 | "execution_count": 131,
1626 | "metadata": {},
1627 | "output_type": "execute_result"
1628 | }
1629 | ],
1630 | "source": [
1631 | "v.attrs['wd']"
1632 | ]
1633 | },
1634 | {
1635 | "cell_type": "code",
1636 | "execution_count": 129,
1637 | "metadata": {},
1638 | "outputs": [
1639 | {
1640 | "data": {
1641 | "text/plain": [
1642 | "[She's,\n",
1643 | " a,\n",
1644 | " US,\n",
1645 | " soap,\n",
1646 | " star,\n",
1647 | " ,\n",
1648 | " 她是美国肥皂剧明星。]"
1649 | ]
1650 | },
1651 | "execution_count": 129,
1652 | "metadata": {},
1653 | "output_type": "execute_result"
1654 | }
1655 | ],
1656 | "source": [
1657 | "[i for i in v.children if isinstance(i, bs4.element.Tag)]"
1658 | ]
1659 | },
1660 | {
1661 | "cell_type": "code",
1662 | "execution_count": 130,
1663 | "metadata": {},
1664 | "outputs": [
1665 | {
1666 | "ename": "SyntaxError",
1667 | "evalue": "invalid syntax (, line 1)",
1668 | "output_type": "error",
1669 | "traceback": [
1670 | "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m1\u001b[0m\n\u001b[1;33m v.\"xhtml:a\"\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n"
1671 | ]
1672 | }
1673 | ],
1674 | "source": [
1675 | "v.\"xhtml:a\""
1676 | ]
1677 | },
1678 | {
1679 | "cell_type": "code",
1680 | "execution_count": 142,
1681 | "metadata": {},
1682 | "outputs": [
1683 | {
1684 | "name": "stdout",
1685 | "output_type": "stream",
1686 | "text": [
1687 | "\n",
1688 | " \n",
1689 | " \n",
1690 | " \n",
1691 | " \n",
1692 | " \n",
1693 | " verb\n",
1694 | " \n",
1695 | " \n",
1696 | " \n",
1697 | " \n",
1698 | " \n",
1699 | " \n",
1700 | " \n",
1701 | " \n",
1702 | " present simple - I / you / we / they\n",
1703 | " \n",
1704 | " \n",
1705 | " \n",
1706 | " \n",
1707 | " \n",
1708 | " soap\n",
1709 | " \n",
1710 | " \n",
1711 | " \n",
1712 | " \n",
1713 | " \n",
1714 | " \n",
1715 | " BrE\n",
1716 | " \n",
1717 | " \n",
1718 | " \n",
1719 | " \n",
1720 | " \n",
1721 | " /\n",
1722 | " \n",
1723 | " səʊp\n",
1724 | " \n",
1725 | " /\n",
1726 | " \n",
1727 | " \n",
1729 | " \n",
1730 | " \n",
1731 | " \n",
1732 | " \n",
1733 | " 🔊\n",
1734 | " \n",
1735 | " \n",
1736 | " \n",
1737 | " \n",
1738 | " \n",
1739 | " \n",
1740 | " NAmE\n",
1741 | " \n",
1742 | " \n",
1743 | " \n",
1744 | " \n",
1745 | " \n",
1746 | " /\n",
1747 | " \n",
1748 | " soʊp\n",
1749 | " \n",
1750 | " /\n",
1751 | " \n",
1752 | " \n",
1754 | " \n",
1755 | " \n",
1756 | " \n",
1757 | " \n",
1758 | " 🔊\n",
1759 | " \n",
1760 | " \n",
1761 | " \n",
1762 | " \n",
1763 | " \n",
1764 | " \n",
1765 | " \n",
1766 | " present simple - he / she / it\n",
1767 | " \n",
1768 | " \n",
1769 | " \n",
1770 | " \n",
1771 | " \n",
1772 | " soaps\n",
1773 | " \n",
1774 | " \n",
1775 | " \n",
1776 | " \n",
1777 | " \n",
1778 | " \n",
1779 | " BrE\n",
1780 | " \n",
1781 | " \n",
1782 | " \n",
1783 | " \n",
1784 | " \n",
1785 | " /\n",
1786 | " \n",
1787 | " səʊps\n",
1788 | " \n",
1789 | " /\n",
1790 | " \n",
1791 | " \n",
1793 | " \n",
1794 | " \n",
1795 | " \n",
1796 | " \n",
1797 | " 🔊\n",
1798 | " \n",
1799 | " \n",
1800 | " \n",
1801 | " \n",
1802 | " \n",
1803 | " \n",
1804 | " NAmE\n",
1805 | " \n",
1806 | " \n",
1807 | " \n",
1808 | " \n",
1809 | " \n",
1810 | " /\n",
1811 | " \n",
1812 | " soʊps\n",
1813 | " \n",
1814 | " /\n",
1815 | " \n",
1816 | " \n",
1818 | " \n",
1819 | " \n",
1820 | " \n",
1821 | " \n",
1822 | " 🔊\n",
1823 | " \n",
1824 | " \n",
1825 | " \n",
1826 | " \n",
1827 | " \n",
1828 | " \n",
1829 | " \n",
1830 | " \n",
1831 | "\n"
1832 | ]
1833 | }
1834 | ],
1835 | "source": [
1836 | "print(j.prettify())"
1837 | ]
1838 | },
1839 | {
1840 | "cell_type": "code",
1841 | "execution_count": 19,
1842 | "metadata": {},
1843 | "outputs": [
1844 | {
1845 | "name": "stdout",
1846 | "output_type": "stream",
1847 | "text": [
1848 | "\n",
1849 | " \n",
1850 | " \n",
1851 | " [\n",
1852 | " \n",
1853 | " \n",
1854 | " uncountable\n",
1855 | " \n",
1856 | " \n",
1857 | " \n",
1858 | " \n",
1859 | " ,\n",
1860 | " \n",
1861 | " \n",
1862 | " countable\n",
1863 | " \n",
1864 | " \n",
1865 | " ]\n",
1866 | " \n",
1867 | " \n",
1868 | " \n",
1869 | " \n",
1870 | " a\n",
1871 | " \n",
1872 | " \n",
1873 | " substance\n",
1874 | " \n",
1875 | " \n",
1876 | " that\n",
1877 | " \n",
1878 | " \n",
1879 | " you\n",
1880 | " \n",
1881 | " \n",
1882 | " use\n",
1883 | " \n",
1884 | " \n",
1885 | " with\n",
1886 | " \n",
1887 | " \n",
1888 | " water\n",
1889 | " \n",
1890 | " \n",
1891 | " for\n",
1892 | " \n",
1893 | " \n",
1894 | " washing\n",
1895 | " \n",
1896 | " \n",
1897 | " your\n",
1898 | " \n",
1899 | " \n",
1900 | " body\n",
1901 | " \n",
1902 | " \n",
1903 | " \n",
1904 | " \n",
1905 | " \n",
1906 | " \n",
1907 | " 肥皂\n",
1908 | " \n",
1909 | " \n",
1910 | " \n",
1911 | " \n",
1912 | " \n",
1913 | " \n",
1914 | " ◆\n",
1915 | " \n",
1916 | " \n",
1917 | " \n",
1918 | " \n",
1919 | " soap\n",
1920 | " \n",
1921 | " \n",
1922 | " and\n",
1923 | " \n",
1924 | " \n",
1925 | " water\n",
1926 | " \n",
1927 | " \n",
1928 | " \n",
1929 | " \n",
1930 | " \n",
1931 | " \n",
1932 | " 肥皂和水\n",
1933 | " \n",
1934 | " \n",
1935 | " \n",
1936 | " \n",
1937 | " \n",
1938 | " \n",
1939 | " ◆\n",
1940 | " \n",
1941 | " \n",
1942 | " \n",
1943 | " \n",
1944 | " a\n",
1945 | " \n",
1946 | " \n",
1947 | " \n",
1948 | " \n",
1949 | " bar\n",
1950 | " \n",
1951 | " /\n",
1952 | " \n",
1953 | " piece\n",
1954 | " \n",
1955 | " \n",
1956 | " of\n",
1957 | " \n",
1958 | " \n",
1959 | " soap\n",
1960 | " \n",
1961 | " \n",
1962 | " \n",
1963 | " \n",
1964 | " \n",
1965 | " \n",
1966 | " \n",
1967 | " \n",
1968 | " 一块肥皂\n",
1969 | " \n",
1970 | " \n",
1971 | " \n",
1972 | " \n",
1973 | " \n",
1974 | " \n",
1975 | " ◆\n",
1976 | " \n",
1977 | " \n",
1978 | " \n",
1979 | " \n",
1980 | " soap\n",
1981 | " \n",
1982 | " \n",
1983 | " bubbles\n",
1984 | " \n",
1985 | " \n",
1986 | " \n",
1987 | " \n",
1988 | " \n",
1989 | " \n",
1990 | " 肥皂泡\n",
1991 | " \n",
1992 | " \n",
1993 | " \n",
1994 | " \n",
1995 | "\n"
1996 | ]
1997 | }
1998 | ],
1999 | "source": [
2000 | "print(sng[0].prettify())"
2001 | ]
2002 | },
2003 | {
2004 | "cell_type": "code",
2005 | "execution_count": null,
2006 | "metadata": {},
2007 | "outputs": [],
2008 | "source": [
2009 | "paraphrases = []\n",
2010 | "for ic, node in enumerate(bs.children):\n",
2011 | " print(\"\\nid={}, name={}\".format(ic, node.name))\n",
2012 | " # div, {'id': 'noun', 'class': ['cixing_part']}\n",
2013 | " if not isinstance(node, str) and node.attrs and \"class\" in node.attrs.keys():\n",
2014 | " # 词性跳转\n",
2015 | " if \"cixing_tiaozhuan\" in node.attrs[\"class\"]:\n",
2016 | " poss = node.text.split(\",\")\n",
2017 | " # 词性部分\n",
2018 | " elif \"cixing_part\" in node.attrs[\"class\"]:\n",
2019 | " # 添加词性\n",
2020 | " paraphrase = {\"pos\": node.attrs[\"id\"]}\n",
2021 | " titleList = node.findall(\"top-g\")\n",
2022 | " posList = node.findall(\"subentry-g\")\n",
2023 | " for paras in node.children:\n",
2024 | " # 2 items: top-g;subentry-g\n",
2025 | " print(\"------name={}, attrs={}, text={}\".format(paras.name, paras.attrs, paras.get_text()))\n",
2026 | " # if not isinstance(s, str):\n",
2027 | " # 添加释义\n",
2028 | " posList = paras.findall(\"pos-g\")\n",
2029 | " if paras.name == \"top-g\":\n",
2030 | " paraphrase[\"text\"] = paras.get_text()\n",
2031 | " else:\n",
2032 | " # subentry-g\n",
2033 | " for sentence in paras.children:\n",
2034 | " # 解析例句列表\n",
2035 | " if sentence.name != \"pos-g\":\n",
2036 | " continue\n",
2037 | " paraphrase[\"sentences\"] = []\n",
2038 | " if not isinstance(sentence, str):\n",
2039 | " for s in sentence.children:\n",
2040 | " sents = parse_sentence(s)\n",
2041 | " if sents:\n",
2042 | " paraphrase[\"sentences\"] += sents\n",
2043 | " paraphrases.append(paraphrase)\n",
2044 | "\n",
2045 | "print(bs.get_text())\n",
2046 | "\n",
2047 | "print(paraphrases)\n"
2048 | ]
2049 | }
2050 | ],
2051 | "metadata": {
2052 | "kernelspec": {
2053 | "display_name": "py36",
2054 | "language": "python",
2055 | "name": "py36"
2056 | },
2057 | "language_info": {
2058 | "codemirror_mode": {
2059 | "name": "ipython",
2060 | "version": 3
2061 | },
2062 | "file_extension": ".py",
2063 | "mimetype": "text/x-python",
2064 | "name": "python",
2065 | "nbconvert_exporter": "python",
2066 | "pygments_lexer": "ipython3",
2067 | "version": "3.7.6"
2068 | }
2069 | },
2070 | "nbformat": 4,
2071 | "nbformat_minor": 4
2072 | }
2073 |
--------------------------------------------------------------------------------
/parser.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import re
3 | import bs4
4 | from bs4 import BeautifulSoup
5 | from readmdict import MDX, MDD
6 | import json
7 | import unicodedata
8 |
9 | tagSets = set()
10 |
11 |
12 | def text_norm(text, lang):
13 | """
14 | 文本规范化.
15 | """
16 | if lang == "en" or lang == "english":
17 | text = " ".join(text.strip().split())
18 | else:
19 | text = text.strip()
20 | text = unicodedata.normalize("NFKC", text)
21 | return text
22 |
23 |
24 | def get_tag_list(node):
25 | return [i.name for i in node if isinstance(i, bs4.element.Tag)]
26 |
27 |
28 | def get_audio(tags, pos, source, name="pron-g"):
29 | audios = []
30 | for i in tags:
31 | audio = {
32 | "name": i.find_all(name)[0].get_text(),
33 | "audioUrl": i.find_all("a")[0].attrs["href"],
34 | "country": i.find_all(re.compile("label"))[0].get_text(),
35 | "source": source, "pos": pos
36 | }
37 | audios.append(audio)
38 | return audios
39 |
40 |
41 | def split_node_en_ch(tags):
42 | # 获取标签列表:[i.name for i in v.children if isinstance(i, bs4.element.Tag)]
43 | chs, ens = [], []
44 | for v in tags:
45 | # 根据标签提取, chn: 先把中文提出,并从树中移除
46 | vv = v
47 | ch = unicodedata.normalize("NFKC", vv.chn.extract().get_text())
48 | # 再提取全部英文,可解决标点无法提取问题
49 | # 根据属性提取
50 | en = vv.get_text()
51 | chs.append(ch.strip())
52 | ens.append(en.strip())
53 | return chs, ens
54 |
55 |
56 | def parse_sentences(node, source):
57 | sentences = []
58 | chs, ens = split_node_en_ch(node)
59 | for ch, en in zip(chs, ens):
60 | sentence = {"chinese": ch, "english": en,
61 | "audioUrlUS": "", "audioUrlUK": "", "source": source}
62 | sentences.append(sentence)
63 | return sentences
64 |
65 |
66 | def get_sn(m, source):
67 | for t in get_tag_list(m):
68 | tagSets.add(t)
69 | # print("m=", get_tag_list(m))
70 | # 释义信息1
71 | # xr-gs; = soap opera
72 | # gram-g: [countable]
73 | # label-g-blk: (informal)
74 | # 提取词性小类
75 | if m.find_all("gram-g"):
76 | category = m.find_all("gram-g")[0].get_text()
77 | else:
78 | category = ""
79 | # 提取应用场景
80 | if m.find_all("label-g-blk"):
81 | scene = m.find_all("label-g-blk")[0].get_text()
82 | else:
83 | scene = ""
84 | # xr-gs: = 同义词
85 | xdef = m.find_all("def")
86 | chs, ens = split_node_en_ch(xdef)
87 | # 例句列表
88 | xgs = m.find_all("x-gs")
89 | sents = []
90 | for n in xgs:
91 | for t in get_tag_list(m):
92 | tagSets.add(t)
93 | # print("n=", get_tag_list(n))
94 | xgblk = n.find_all("x")
95 | # 3个例句
96 | sentences = parse_sentences(xgblk, source)
97 | sents += sentences
98 | paras = {"chinese": chs[0] if chs else "", "english": ens[0] if ens else "",
99 | "category": category, "scene": scene,
100 | "Sentences": sents, "source": source}
101 | return paras
102 |
103 |
104 | def get_cixing(tags, source):
105 | phoneticsymbols = []
106 | paraphrases = []
107 | for i in tags:
108 | for t in get_tag_list(i):
109 | tagSets.add(t)
110 | # 词性
111 | pos = i.attrs["id"]
112 | # 提取美英音标、音频url
113 | if not phoneticsymbols:
114 | phoneticsymbols = get_audio(i.find_all("pron-g-blk"), pos, source)
115 | # 词性小类
116 |
117 | # 动词释义
118 | if i.find_all("vp-g"):
119 | # root词根、第三人称单数vp-g-ps
120 | wdVp = get_vp(i.find_all("vp-g"))
121 | if i.find_all("vpform"):
122 | # present simple一般现在时
123 | wdVpForm = i.find_all("vpform")[0].get_text()
124 |
125 | # 释义列表
126 | for m in i.find_all("sn-g"):
127 | # 获取释义、例句列表
128 | paras = get_sn(m, source)
129 | paras["pos"] = pos
130 | paraphrases.append(paras)
131 | return phoneticsymbols, paraphrases
132 |
133 |
134 | def get_vp(tags):
135 | # 获取动词的各种形式
136 | vps = []
137 | for i in tags:
138 | vps.append({"form": i.attrs["form"], "text": i.find_all("vp")[0].get_text()})
139 | return vps
140 |
141 |
142 | def get_sns(tags):
143 | # label, def, sn
144 | label = tags.find_all("label")[0].get_text()
145 | return
146 |
147 |
148 | def get_pv(tags, source):
149 | """
150 | phrasal verbs动词短语.
151 |
152 | """
153 | # 获取短语搭配
154 | outs = []
155 | # 短语列表
156 | for i in tags:
157 | # 释义列表
158 | paraphrases = []
159 | for j in i.find_all("sn-g"):
160 | # 例句列表
161 | paras = get_sn(j, source)
162 | paras["pos"] = ""
163 | paraphrases.append(paras)
164 | outs.append({"phrase": i.find_all("pv")[0].get_text(),
165 | "ParaPhrases": paraphrases})
166 | return outs
167 |
168 |
169 | def parse_oxld(lexicon, bs, source="oxld_9"):
170 | if len(lexicon.split()) > 1:
171 | lexcionType = "Phrase"
172 | else:
173 | lexcionType = "Word"
174 | result = {"Lexicon": lexicon, "type": lexcionType}
175 | phoneticsymbols = []
176 | paraphrases = []
177 | vpg = []
178 | pvg = []
179 |
180 | print(get_tag_list(bs))
181 |
182 | # 获取词性部分(音标,释义,例句)
183 | if bs.find_all("div", "cixing_part", recursive=False):
184 | phoneticsymbols, paraphrases = get_cixing(bs.find_all("div", "cixing_part", recursive=False), source)
185 |
186 | # verb past动词时态
187 | if bs.find_all("vp-g", recursive=False):
188 | vpg = get_vp(bs.find_all("vp-g", recursive=False))
189 |
190 | # phrasal verbs动词短语
191 | if bs.find_all("pv-gs-blk", recursive=False):
192 | for tag in bs.find_all("pv-gs-blk", recursive=False):
193 | pvg += get_pv(tag.find_all("pv-g"), source)
194 |
195 | result["PhoneticSymbols"] = phoneticsymbols
196 | result["ParaPhrases"] = paraphrases
197 | result["Inflection"] = vpg
198 | result["PhrasalVerbs"] = pvg
199 | return result
200 |
201 |
202 | def parse_jianming(lexicon, bs, source="jianming"):
203 | """
204 | 简明英汉汉英词典
205 | """
206 | if len(lexicon.split()) > 1:
207 | lexcionType = "Phrase"
208 | else:
209 | lexcionType = "Word"
210 | result = {"Lexicon": lexicon, "type": lexcionType}
211 | phoneticsymbols = []
212 | paraphrases = []
213 |
214 | contents = bs.find_all(["font", "b"])
215 | newpos = ""
216 | newpp = ""
217 | sentences = []
218 | curIc = 0
219 | for ic, i in enumerate(contents):
220 | if ic < curIc:
221 | continue
222 | if i.name == "font" and i.attrs["color"] == "DarkMagenta":
223 | # 新词性
224 | newpos = i.get_text()
225 | elif i.name == "b":
226 | # 新释义
227 | newpp = i.get_text()
228 | if ic + 1 < len(contents):
229 | notSent = (contents[ic + 1].name == "font" and contents[ic + 1].attrs["color"] == "DarkMagenta") or \
230 | contents[ic + 1].name == "b"
231 | if notSent:
232 | if newpos != "" and newpp != "":
233 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences,
234 | "source": source,
235 | "scene": "", "category": ""}
236 | paraphrases.append(paras)
237 | sentences = []
238 | elif ic + 1 == len(contents):
239 | if newpos != "" and newpp != "":
240 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, "source": source,
241 | "scene": "", "category": ""}
242 | paraphrases.append(paras)
243 | sentences = []
244 | elif i.name == "font" and i.attrs["color"] == "Navy":
245 | # 添加例句
246 | en = i.get_text().strip()
247 | ch = contents[ic + 1].get_text().strip()
248 | if newpos != "" and newpp != "" and en != "" and ch != "":
249 | sentence = {"english": text_norm(en, "english"), "chinese": text_norm(ch, "chinese"), "audioUrlUS": "",
250 | "audioUrlUK": "", "source": source}
251 | sentences.append(sentence)
252 | # 只有当下一个不是例句时:append释义信息
253 | if ic + 2 < len(contents):
254 | notSent = (contents[ic + 2].name == "font" and contents[ic + 2].attrs["color"] == "DarkMagenta") or \
255 | contents[ic + 2].name == "b"
256 | if notSent:
257 | if newpos != "" and newpp != "":
258 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences,
259 | "source": source,
260 | "scene": "", "category": ""}
261 | paraphrases.append(paras)
262 | sentences = []
263 | elif ic + 2 == len(contents):
264 | if newpos != "" and newpp != "":
265 | paras = {"pos": newpos, "english": "", "chinese": newpp, "Sentences": sentences, "source": source,
266 | "scene": "", "category": ""}
267 | paraphrases.append(paras)
268 | sentences = []
269 | curIc = ic + 2
270 |
271 | result["PhoneticSymbols"] = phoneticsymbols
272 | result["ParaPhrases"] = paraphrases
273 | return result
274 |
275 |
276 | def parse_item(item, source, item2infos):
277 | bs = BeautifulSoup(item2infos[item], "html.parser")
278 | # print(bs.prettify())
279 | result = {}
280 | if "oxld" in source:
281 | result = parse_oxld(item, bs, source)
282 | elif "jianming" in source:
283 | result = parse_jianming(item, bs, source)
284 | return result
285 |
286 |
287 | def parse_items(items, source, item2infos):
288 | results = []
289 | for i in items:
290 | result = parse_item(i, source, item2infos)
291 | if result:
292 | results.append(result)
293 | return results
294 |
295 |
296 | def write_json(infos, fileOut):
297 | print("{} lexicons writing...".format(len(infos)))
298 | with open(fileOut, "w", encoding="utf-8") as f:
299 | json.dump(infos, f, ensure_ascii=False)
300 |
301 |
302 | def gen_dict(fileIn, source):
303 | mdx = MDX(fileIn)
304 | items = [i for i in mdx.items()]
305 | print("{} items loaded from {}".format(len(items), fileIn))
306 | item2infos = {i[0].decode("utf-8"): i[1].decode("utf-8") for i in items}
307 | # outs = parse_items(["sorb", "weird"], source, item2infos)
308 | outs = []
309 | for i in items:
310 | try:
311 | out = parse_item(i[0].decode("utf-8"), source, item2infos)
312 | if out["PhoneticSymbols"] or out["ParaPhrases"]:
313 | outs.append(out)
314 | except Exception as e:
315 | print("Error: {}".format(repr(e)))
316 | write_json(outs, f"./output/dict_{len(outs)}_{source}_output.json")
317 |
318 |
319 | def test(fileIn, source):
320 | mdx = MDX(fileIn)
321 | items = [i for i in mdx.items()]
322 | print("{} items loaded from {}".format(len(items), fileIn))
323 | item2infos = {i[0].decode("utf-8"): i[1].decode("utf-8") for i in items}
324 | outs = parse_items(["sorb", "weird"], source, item2infos)
325 | # outs = []
326 | # for i in items:
327 | # try:
328 | # out = parse_item(i[0].decode("utf-8"), source, item2infos)
329 | # if out["PhoneticSymbols"] or out["ParaPhrases"]:
330 | # outs.append(out)
331 | # except Exception as e:
332 | # print("Error: {}".format(repr(e)))
333 | write_json(outs, f"./output/dict_output.json")
334 |
335 |
336 | def test_jianming():
337 | fileIn = r'D:\work\database\dict\简明英汉汉英词典.mdx'
338 | source = "jianming-2"
339 | test(fileIn, source)
340 |
341 |
342 | def test_oxld():
343 | fileIn = r'D:\work\database\dict\牛津高阶英汉双解词典(第9版).mdx'
344 | source = "oxld-9"
345 | test(fileIn, source)
346 |
347 |
348 | if __name__ == "__main__":
349 | test_oxld()
350 |
351 | print("{} tags writing...".format(len(tagSets)))
352 | with open("./output/tag_set1.txt", "w") as f:
353 | for i in sorted(tagSets):
354 | f.write(i + "\n")
355 |
--------------------------------------------------------------------------------
/pureSalsa20.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | """
5 | Copyright by https://github.com/zhansliu/writemdict
6 |
7 | pureSalsa20.py -- a pure Python implementation of the Salsa20 cipher, ported to Python 3
8 |
9 | v4.0: Added Python 3 support, dropped support for Python <= 2.5.
10 |
11 | // zhansliu
12 |
13 | Original comments below.
14 |
15 | ====================================================================
16 | There are comments here by two authors about three pieces of software:
17 | comments by Larry Bugbee about
18 | Salsa20, the stream cipher by Daniel J. Bernstein
19 | (including comments about the speed of the C version) and
20 | pySalsa20, Bugbee's own Python wrapper for salsa20.c
21 | (including some references), and
22 | comments by Steve Witham about
23 | pureSalsa20, Witham's pure Python 2.5 implementation of Salsa20,
24 | which follows pySalsa20's API, and is in this file.
25 |
26 | Salsa20: a Fast Streaming Cipher (comments by Larry Bugbee)
27 | -----------------------------------------------------------
28 |
29 | Salsa20 is a fast stream cipher written by Daniel Bernstein
30 | that basically uses a hash function and XOR making for fast
31 | encryption. (Decryption uses the same function.) Salsa20
32 | is simple and quick.
33 |
34 | Some Salsa20 parameter values...
35 | design strength 128 bits
36 | key length 128 or 256 bits, exactly
37 | IV, aka nonce 64 bits, always
38 | chunk size must be in multiples of 64 bytes
39 |
40 | Salsa20 has two reduced versions, 8 and 12 rounds each.
41 |
42 | One benchmark (10 MB):
43 | 1.5GHz PPC G4 102/97/89 MB/sec for 8/12/20 rounds
44 | AMD Athlon 2500+ 77/67/53 MB/sec for 8/12/20 rounds
45 | (no I/O and before Python GC kicks in)
46 |
47 | Salsa20 is a Phase 3 finalist in the EU eSTREAM competition
48 | and appears to be one of the fastest ciphers. It is well
49 | documented so I will not attempt any injustice here. Please
50 | see "References" below.
51 |
52 | ...and Salsa20 is "free for any use".
53 |
54 |
55 | pySalsa20: a Python wrapper for Salsa20 (Comments by Larry Bugbee)
56 | ------------------------------------------------------------------
57 |
58 | pySalsa20.py is a simple ctypes Python wrapper. Salsa20 is
59 | as it's name implies, 20 rounds, but there are two reduced
60 | versions, 8 and 12 rounds each. Because the APIs are
61 | identical, pySalsa20 is capable of wrapping all three
62 | versions (number of rounds hardcoded), including a special
63 | version that allows you to set the number of rounds with a
64 | set_rounds() function. Compile the version of your choice
65 | as a shared library (not as a Python extension), name and
66 | install it as libsalsa20.so.
67 |
68 | Sample usage:
69 | from pySalsa20 import Salsa20
70 | s20 = Salsa20(key, IV)
71 | dataout = s20.encryptBytes(datain) # same for decrypt
72 |
73 | This is EXPERIMENTAL software and intended for educational
74 | purposes only. To make experimentation less cumbersome,
75 | pySalsa20 is also free for any use.
76 |
77 | THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF
78 | ANY KIND. USE AT YOUR OWN RISK.
79 |
80 | Enjoy,
81 |
82 | Larry Bugbee
83 | bugbee@seanet.com
84 | April 2007
85 |
86 |
87 | References:
88 | -----------
89 | http://en.wikipedia.org/wiki/Salsa20
90 | http://en.wikipedia.org/wiki/Daniel_Bernstein
91 | http://cr.yp.to/djb.html
92 | http://www.ecrypt.eu.org/stream/salsa20p3.html
93 | http://www.ecrypt.eu.org/stream/p3ciphers/salsa20/salsa20_p3source.zip
94 |
95 |
96 | Prerequisites for pySalsa20:
97 | ----------------------------
98 | - Python 2.5 (haven't tested in 2.4)
99 |
100 |
101 | pureSalsa20: Salsa20 in pure Python 2.5 (comments by Steve Witham)
102 | ------------------------------------------------------------------
103 |
104 | pureSalsa20 is the stand-alone Python code in this file.
105 | It implements the underlying Salsa20 core algorithm
106 | and emulates pySalsa20's Salsa20 class API (minus a bug(*)).
107 |
108 | pureSalsa20 is MUCH slower than libsalsa20.so wrapped with pySalsa20--
109 | about 1/1000 the speed for Salsa20/20 and 1/500 the speed for Salsa20/8,
110 | when encrypting 64k-byte blocks on my computer.
111 |
112 | pureSalsa20 is for cases where portability is much more important than
113 | speed. I wrote it for use in a "structured" random number generator.
114 |
115 | There are comments about the reasons for this slowness in
116 | http://www.tiac.net/~sw/2010/02/PureSalsa20
117 |
118 | Sample usage:
119 | from pureSalsa20 import Salsa20
120 | s20 = Salsa20(key, IV)
121 | dataout = s20.encryptBytes(datain) # same for decrypt
122 |
123 | I took the test code from pySalsa20, added a bunch of tests including
124 | rough speed tests, and moved them into the file testSalsa20.py.
125 | To test both pySalsa20 and pureSalsa20, type
126 | python testSalsa20.py
127 |
128 | (*)The bug (?) in pySalsa20 is this. The rounds variable is global to the
129 | libsalsa20.so library and not switched when switching between instances
130 | of the Salsa20 class.
131 | s1 = Salsa20( key, IV, 20 )
132 | s2 = Salsa20( key, IV, 8 )
133 | In this example,
134 | with pySalsa20, both s1 and s2 will do 8 rounds of encryption.
135 | with pureSalsa20, s1 will do 20 rounds and s2 will do 8 rounds.
136 | Perhaps giving each instance its own nRounds variable, which
137 | is passed to the salsa20wordtobyte() function, is insecure. I'm not a
138 | cryptographer.
139 |
140 | pureSalsa20.py and testSalsa20.py are EXPERIMENTAL software and
141 | intended for educational purposes only. To make experimentation less
142 | cumbersome, pureSalsa20.py and testSalsa20.py are free for any use.
143 |
144 | Revisions:
145 | ----------
146 | p3.2 Fixed bug that initialized the output buffer with plaintext!
147 | Saner ramping of nreps in speed test.
148 | Minor changes and print statements.
149 | p3.1 Took timing variability out of add32() and rot32().
150 | Made the internals more like pySalsa20/libsalsa .
151 | Put the semicolons back in the main loop!
152 | In encryptBytes(), modify a byte array instead of appending.
153 | Fixed speed calculation bug.
154 | Used subclasses instead of patches in testSalsa20.py .
155 | Added 64k-byte messages to speed test to be fair to pySalsa20.
156 | p3 First version, intended to parallel pySalsa20 version 3.
157 |
158 | More references:
159 | ----------------
160 | http://www.seanet.com/~bugbee/crypto/salsa20/ [pySalsa20]
161 | http://cr.yp.to/snuffle.html [The original name of Salsa20]
162 | http://cr.yp.to/snuffle/salsafamily-20071225.pdf [ Salsa20 design]
163 | http://www.tiac.net/~sw/2010/02/PureSalsa20
164 |
165 | THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF
166 | ANY KIND. USE AT YOUR OWN RISK.
167 |
168 | Cheers,
169 |
170 | Steve Witham sw at remove-this tiac dot net
171 | February, 2010
172 | """
173 | import sys
174 | assert(sys.version_info >= (2, 6))
175 |
176 | if sys.version_info >= (3,):
177 | integer_types = (int,)
178 | python3 = True
179 | else:
180 | integer_types = (int, long)
181 | python3 = False
182 |
183 | from struct import Struct
184 | little_u64 = Struct( "= 2**64"
238 | ctx = self.ctx
239 | ctx[ 8],ctx[ 9] = little2_i32.unpack( little_u64.pack( counter ) )
240 |
241 | def getCounter( self ):
242 | return little_u64.unpack( little2_i32.pack( *self.ctx[ 8:10 ] ) ) [0]
243 |
244 |
245 | def setRounds(self, rounds, testing=False ):
246 | assert testing or rounds in [8, 12, 20], 'rounds must be 8, 12, 20'
247 | self.rounds = rounds
248 |
249 |
250 | def encryptBytes(self, data):
251 | assert type(data) == bytes, 'data must be byte string'
252 | assert self._lastChunk64, 'previous chunk not multiple of 64 bytes'
253 | lendata = len(data)
254 | munged = bytearray(lendata)
255 | for i in range( 0, lendata, 64 ):
256 | h = salsa20_wordtobyte( self.ctx, self.rounds, checkRounds=False )
257 | self.setCounter( ( self.getCounter() + 1 ) % 2**64 )
258 | # Stopping at 2^70 bytes per nonce is user's responsibility.
259 | for j in range( min( 64, lendata - i ) ):
260 | if python3:
261 | munged[ i+j ] = data[ i+j ] ^ h[j]
262 | else:
263 | munged[ i+j ] = ord(data[ i+j ]) ^ ord(h[j])
264 |
265 | self._lastChunk64 = not lendata % 64
266 | return bytes(munged)
267 |
268 | decryptBytes = encryptBytes # encrypt and decrypt use same function
269 |
270 | #--------------------------------------------------------------------------
271 |
272 | def salsa20_wordtobyte( input, nRounds=20, checkRounds=True ):
273 | """ Do nRounds Salsa20 rounds on a copy of
274 | input: list or tuple of 16 ints treated as little-endian unsigneds.
275 | Returns a 64-byte string.
276 | """
277 |
278 | assert( type(input) in ( list, tuple ) and len(input) == 16 )
279 | assert( not(checkRounds) or ( nRounds in [ 8, 12, 20 ] ) )
280 |
281 | x = list( input )
282 |
283 | def XOR( a, b ): return a ^ b
284 | ROTATE = rot32
285 | PLUS = add32
286 |
287 | for i in range( nRounds // 2 ):
288 | # These ...XOR...ROTATE...PLUS... lines are from ecrypt-linux.c
289 | # unchanged except for indents and the blank line between rounds:
290 | x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 0],x[12]), 7));
291 | x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[ 4],x[ 0]), 9));
292 | x[12] = XOR(x[12],ROTATE(PLUS(x[ 8],x[ 4]),13));
293 | x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[12],x[ 8]),18));
294 | x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 5],x[ 1]), 7));
295 | x[13] = XOR(x[13],ROTATE(PLUS(x[ 9],x[ 5]), 9));
296 | x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[13],x[ 9]),13));
297 | x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 1],x[13]),18));
298 | x[14] = XOR(x[14],ROTATE(PLUS(x[10],x[ 6]), 7));
299 | x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[14],x[10]), 9));
300 | x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 2],x[14]),13));
301 | x[10] = XOR(x[10],ROTATE(PLUS(x[ 6],x[ 2]),18));
302 | x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[15],x[11]), 7));
303 | x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 3],x[15]), 9));
304 | x[11] = XOR(x[11],ROTATE(PLUS(x[ 7],x[ 3]),13));
305 | x[15] = XOR(x[15],ROTATE(PLUS(x[11],x[ 7]),18));
306 |
307 | x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[ 0],x[ 3]), 7));
308 | x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[ 1],x[ 0]), 9));
309 | x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[ 2],x[ 1]),13));
310 | x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[ 3],x[ 2]),18));
311 | x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 5],x[ 4]), 7));
312 | x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 6],x[ 5]), 9));
313 | x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 7],x[ 6]),13));
314 | x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 4],x[ 7]),18));
315 | x[11] = XOR(x[11],ROTATE(PLUS(x[10],x[ 9]), 7));
316 | x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[11],x[10]), 9));
317 | x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 8],x[11]),13));
318 | x[10] = XOR(x[10],ROTATE(PLUS(x[ 9],x[ 8]),18));
319 | x[12] = XOR(x[12],ROTATE(PLUS(x[15],x[14]), 7));
320 | x[13] = XOR(x[13],ROTATE(PLUS(x[12],x[15]), 9));
321 | x[14] = XOR(x[14],ROTATE(PLUS(x[13],x[12]),13));
322 | x[15] = XOR(x[15],ROTATE(PLUS(x[14],x[13]),18));
323 |
324 | for i in range( len( input ) ):
325 | x[i] = PLUS( x[i], input[i] )
326 | return little16_i32.pack( *x )
327 |
328 | #--------------------------- 32-bit ops -------------------------------
329 |
330 | def trunc32( w ):
331 | """ Return the bottom 32 bits of w as a Python int.
332 | This creates longs temporarily, but returns an int. """
333 | w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )
334 | assert type(w) == int
335 | return w
336 |
337 |
338 | def add32( a, b ):
339 | """ Add two 32-bit words discarding carry above 32nd bit,
340 | and without creating a Python long.
341 | Timing shouldn't vary.
342 | """
343 | lo = ( a & 0xFFFF ) + ( b & 0xFFFF )
344 | hi = ( a >> 16 ) + ( b >> 16 ) + ( lo >> 16 )
345 | return ( -(hi & 0x8000) | ( hi & 0x7FFF ) ) << 16 | ( lo & 0xFFFF )
346 |
347 |
348 | def rot32( w, nLeft ):
349 | """ Rotate 32-bit word left by nLeft or right by -nLeft
350 | without creating a Python long.
351 | Timing depends on nLeft but not on w.
352 | """
353 | nLeft &= 31 # which makes nLeft >= 0
354 | if nLeft == 0:
355 | return w
356 |
357 | # Note: now 1 <= nLeft <= 31.
358 | # RRRsLLLLLL There are nLeft RRR's, (31-nLeft) LLLLLL's,
359 | # => sLLLLLLRRR and one s which becomes the sign bit.
360 | RRR = ( ( ( w >> 1 ) & 0x7fffFFFF ) >> ( 31 - nLeft ) )
361 | sLLLLLL = -( (1<<(31-nLeft)) & w ) | (0x7fffFFFF>>nLeft) & w
362 | return RRR | ( sLLLLLL << nLeft )
363 |
364 |
365 | # --------------------------------- end -----------------------------------
366 |
--------------------------------------------------------------------------------
/readmdict.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # readmdict.py
4 | # Octopus MDict Dictionary File (.mdx) and Resource File (.mdd) Analyser
5 | #
6 | # Copyright (C) 2012, 2013, 2015 Xiaoqiang Wang
7 | #
8 | # This program is a free software; you can redistribute it and/or modify
9 | # it under the terms of the GNU General Public License as published by
10 | # the Free Software Foundation, version 3 of the License.
11 | #
12 | # You can get a copy of GNU General Public License along this program
13 | # But you can always get it from http://www.gnu.org/licenses/gpl.txt
14 | #
15 | # This program is distributed in the hope that it will be useful,
16 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
17 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18 | # GNU General Public License for more details.
19 |
20 | from struct import pack, unpack
21 | from io import BytesIO
22 | import re
23 | import sys
24 |
25 | from ripemd128 import ripemd128
26 | from pureSalsa20 import Salsa20
27 |
28 | # zlib compression is used for engine version >=2.0
29 | import zlib
30 | # LZO compression is used for engine version < 2.0
31 | try:
32 | import lzo
33 | except ImportError:
34 | lzo = None
35 | print("LZO compression support is not available")
36 |
37 | # 2x3 compatible
38 | if sys.hexversion >= 0x03000000:
39 | unicode = str
40 |
41 |
42 | def _unescape_entities(text):
43 | """
44 | unescape offending tags < > " &
45 | """
46 | text = text.replace(b'<', b'<')
47 | text = text.replace(b'>', b'>')
48 | text = text.replace(b'"', b'"')
49 | text = text.replace(b'&', b'&')
50 | return text
51 |
52 |
53 | def _fast_decrypt(data, key):
54 | b = bytearray(data)
55 | key = bytearray(key)
56 | previous = 0x36
57 | for i in range(len(b)):
58 | t = (b[i] >> 4 | b[i] << 4) & 0xff
59 | t = t ^ previous ^ (i & 0xff) ^ key[i % len(key)]
60 | previous = b[i]
61 | b[i] = t
62 | return bytes(b)
63 |
64 |
65 | def _mdx_decrypt(comp_block):
66 | key = ripemd128(comp_block[4:8] + pack(b'
124 | """
125 | taglist = re.findall(b'(\w+)="(.*?)"', header, re.DOTALL)
126 | tagdict = {}
127 | for key, value in taglist:
128 | tagdict[key] = _unescape_entities(value)
129 | return tagdict
130 |
131 | def _decode_key_block_info(self, key_block_info_compressed):
132 | if self._version >= 2:
133 | # zlib compression
134 | assert(key_block_info_compressed[:4] == b'\x02\x00\x00\x00')
135 | # decrypt if needed
136 | if self._encrypt & 0x02:
137 | key_block_info_compressed = _mdx_decrypt(key_block_info_compressed)
138 | # decompress
139 | key_block_info = zlib.decompress(key_block_info_compressed[8:])
140 | # adler checksum
141 | adler32 = unpack('>I', key_block_info_compressed[4:8])[0]
142 | assert(adler32 == zlib.adler32(key_block_info) & 0xffffffff)
143 | else:
144 | # no compression
145 | key_block_info = key_block_info_compressed
146 | # decode
147 | key_block_info_list = []
148 | num_entries = 0
149 | i = 0
150 | if self._version >= 2:
151 | byte_format = '>H'
152 | byte_width = 2
153 | text_term = 1
154 | else:
155 | byte_format = '>B'
156 | byte_width = 1
157 | text_term = 0
158 |
159 | while i < len(key_block_info):
160 | # number of entries in current key block
161 | num_entries += unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
162 | i += self._number_width
163 | # text head size
164 | text_head_size = unpack(byte_format, key_block_info[i:i+byte_width])[0]
165 | i += byte_width
166 | # text head
167 | if self._encoding != 'UTF-16':
168 | i += text_head_size + text_term
169 | else:
170 | i += (text_head_size + text_term) * 2
171 | # text tail size
172 | text_tail_size = unpack(byte_format, key_block_info[i:i+byte_width])[0]
173 | i += byte_width
174 | # text tail
175 | if self._encoding != 'UTF-16':
176 | i += text_tail_size + text_term
177 | else:
178 | i += (text_tail_size + text_term) * 2
179 | # key block compressed size
180 | key_block_compressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
181 | i += self._number_width
182 | # key block decompressed size
183 | key_block_decompressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
184 | i += self._number_width
185 | key_block_info_list += [(key_block_compressed_size, key_block_decompressed_size)]
186 |
187 | #assert(num_entries == self._num_entries)
188 |
189 | return key_block_info_list
190 |
191 | def _decode_key_block(self, key_block_compressed, key_block_info_list):
192 | key_list = []
193 | i = 0
194 | for compressed_size, decompressed_size in key_block_info_list:
195 | start = i
196 | end = i + compressed_size
197 | # 4 bytes : compression type
198 | key_block_type = key_block_compressed[start:start+4]
199 | # 4 bytes : adler checksum of decompressed key block
200 | adler32 = unpack('>I', key_block_compressed[start+4:start+8])[0]
201 | if key_block_type == b'\x00\x00\x00\x00':
202 | key_block = key_block_compressed[start+8:end]
203 | elif key_block_type == b'\x01\x00\x00\x00':
204 | if lzo is None:
205 | print("LZO compression is not supported")
206 | break
207 | # decompress key block
208 | header = b'\xf0' + pack('>I', decompressed_size)
209 | key_block = lzo.decompress(header + key_block_compressed[start+8:end])
210 | elif key_block_type == b'\x02\x00\x00\x00':
211 | # decompress key block
212 | key_block = zlib.decompress(key_block_compressed[start+8:end])
213 | # extract one single key block into a key list
214 | key_list += self._split_key_block(key_block)
215 | # notice that adler32 returns signed value
216 | assert(adler32 == zlib.adler32(key_block) & 0xffffffff)
217 |
218 | i += compressed_size
219 | return key_list
220 |
221 | def _split_key_block(self, key_block):
222 | key_list = []
223 | key_start_index = 0
224 | while key_start_index < len(key_block):
225 | # the corresponding record's offset in record block
226 | key_id = unpack(self._number_format, key_block[key_start_index:key_start_index+self._number_width])[0]
227 | # key text ends with '\x00'
228 | if self._encoding == 'UTF-16':
229 | delimiter = b'\x00\x00'
230 | width = 2
231 | else:
232 | delimiter = b'\x00'
233 | width = 1
234 | i = key_start_index + self._number_width
235 | while i < len(key_block):
236 | if key_block[i:i+width] == delimiter:
237 | key_end_index = i
238 | break
239 | i += width
240 | key_text = key_block[key_start_index+self._number_width:key_end_index]\
241 | .decode(self._encoding, errors='ignore').encode('utf-8').strip()
242 | key_start_index = key_end_index + width
243 | key_list += [(key_id, key_text)]
244 | return key_list
245 |
246 | def _read_header(self):
247 | f = open(self._fname, 'rb')
248 | # number of bytes of header text
249 | header_bytes_size = unpack('>I', f.read(4))[0]
250 | header_bytes = f.read(header_bytes_size)
251 | # 4 bytes: adler32 checksum of header, in little endian
252 | adler32 = unpack('= 0x03000000:
264 | encoding = encoding.decode('utf-8')
265 | # GB18030 > GBK > GB2312
266 | if encoding in ['GBK', 'GB2312']:
267 | encoding = 'GB18030'
268 | self._encoding = encoding
269 | # encryption flag
270 | # 0x00 - no encryption
271 | # 0x01 - encrypt record block
272 | # 0x02 - encrypt key info block
273 | if b'Encrypted' not in header_tag or header_tag[b'Encrypted'] == b'No':
274 | self._encrypt = 0
275 | elif header_tag[b'Encrypted'] == b'Yes':
276 | self._encrypt = 1
277 | else:
278 | self._encrypt = int(header_tag[b'Encrypted'])
279 |
280 | # stylesheet attribute if present takes form of:
281 | # style_number # 1-255
282 | # style_begin # or ''
283 | # style_end # or ''
284 | # store stylesheet in dict in the form of
285 | # {'number' : ('style_begin', 'style_end')}
286 | self._stylesheet = {}
287 | if header_tag.get('StyleSheet'):
288 | lines = header_tag['StyleSheet'].splitlines()
289 | for i in range(0, len(lines), 3):
290 | self._stylesheet[lines[i]] = (lines[i+1], lines[i+2])
291 |
292 | # before version 2.0, number is 4 bytes integer
293 | # version 2.0 and above uses 8 bytes
294 | self._version = float(header_tag[b'GeneratedByEngineVersion'])
295 | if self._version < 2.0:
296 | self._number_width = 4
297 | self._number_format = '>I'
298 | else:
299 | self._number_width = 8
300 | self._number_format = '>Q'
301 |
302 | return header_tag
303 |
304 | def _read_keys(self):
305 | f = open(self._fname, 'rb')
306 | f.seek(self._key_block_offset)
307 |
308 | # the following numbers could be encrypted
309 | if self._version >= 2.0:
310 | num_bytes = 8 * 5
311 | else:
312 | num_bytes = 4 * 4
313 | block = f.read(num_bytes)
314 |
315 | if self._encrypt & 1:
316 | if self._passcode is None:
317 | raise RuntimeError('user identification is needed to read encrypted file')
318 | regcode, userid = self._passcode
319 | if isinstance(userid, unicode):
320 | userid = userid.encode('utf8')
321 | if self.header[b'RegisterBy'] == b'EMail':
322 | encrypted_key = _decrypt_regcode_by_email(regcode, userid)
323 | else:
324 | encrypted_key = _decrypt_regcode_by_deviceid(regcode, userid)
325 | block = _salsa_decrypt(block, encrypted_key)
326 |
327 | # decode this block
328 | sf = BytesIO(block)
329 | # number of key blocks
330 | num_key_blocks = self._read_number(sf)
331 | # number of entries
332 | self._num_entries = self._read_number(sf)
333 | # number of bytes of key block info after decompression
334 | if self._version >= 2.0:
335 | key_block_info_decomp_size = self._read_number(sf)
336 | # number of bytes of key block info
337 | key_block_info_size = self._read_number(sf)
338 | # number of bytes of key block
339 | key_block_size = self._read_number(sf)
340 |
341 | # 4 bytes: adler checksum of previous 5 numbers
342 | if self._version >= 2.0:
343 | adler32 = unpack('>I', f.read(4))[0]
344 | assert adler32 == (zlib.adler32(block) & 0xffffffff)
345 |
346 | # read key block info, which indicates key block's compressed and decompressed size
347 | key_block_info = f.read(key_block_info_size)
348 | key_block_info_list = self._decode_key_block_info(key_block_info)
349 | assert(num_key_blocks == len(key_block_info_list))
350 |
351 | # read key block
352 | key_block_compressed = f.read(key_block_size)
353 | # extract key block
354 | key_list = self._decode_key_block(key_block_compressed, key_block_info_list)
355 |
356 | self._record_block_offset = f.tell()
357 | f.close()
358 |
359 | return key_list
360 |
361 | def _read_keys_brutal(self):
362 | f = open(self._fname, 'rb')
363 | f.seek(self._key_block_offset)
364 |
365 | # the following numbers could be encrypted, disregard them!
366 | if self._version >= 2.0:
367 | num_bytes = 8 * 5 + 4
368 | key_block_type = b'\x02\x00\x00\x00'
369 | else:
370 | num_bytes = 4 * 4
371 | key_block_type = b'\x01\x00\x00\x00'
372 | block = f.read(num_bytes)
373 |
374 | # key block info
375 | # 4 bytes '\x02\x00\x00\x00'
376 | # 4 bytes adler32 checksum
377 | # unknown number of bytes follows until '\x02\x00\x00\x00' which marks the beginning of key block
378 | key_block_info = f.read(8)
379 | if self._version >= 2.0:
380 | assert key_block_info[:4] == b'\x02\x00\x00\x00'
381 | while True:
382 | fpos = f.tell()
383 | t = f.read(1024)
384 | index = t.find(key_block_type)
385 | if index != -1:
386 | key_block_info += t[:index]
387 | f.seek(fpos + index)
388 | break
389 | else:
390 | key_block_info += t
391 |
392 | key_block_info_list = self._decode_key_block_info(key_block_info)
393 | key_block_size = sum(list(zip(*key_block_info_list))[0])
394 |
395 | # read key block
396 | key_block_compressed = f.read(key_block_size)
397 | # extract key block
398 | key_list = self._decode_key_block(key_block_compressed, key_block_info_list)
399 |
400 | self._record_block_offset = f.tell()
401 | f.close()
402 |
403 | self._num_entries = len(key_list)
404 | return key_list
405 |
406 |
407 | class MDD(MDict):
408 | """
409 | MDict resource file format (*.MDD) reader.
410 | >>> mdd = MDD('example.mdd')
411 | >>> len(mdd)
412 | 208
413 | >>> for filename,content in mdd.items():
414 | ... print filename, content[:10]
415 | """
416 | def __init__(self, fname, passcode=None):
417 | MDict.__init__(self, fname, encoding='UTF-16', passcode=passcode)
418 |
419 | def items(self):
420 | """Return a generator which in turn produce tuples in the form of (filename, content)
421 | """
422 | return self._decode_record_block()
423 |
424 | def _decode_record_block(self):
425 | f = open(self._fname, 'rb')
426 | f.seek(self._record_block_offset)
427 |
428 | num_record_blocks = self._read_number(f)
429 | num_entries = self._read_number(f)
430 | assert(num_entries == self._num_entries)
431 | record_block_info_size = self._read_number(f)
432 | record_block_size = self._read_number(f)
433 |
434 | # record block info section
435 | record_block_info_list = []
436 | size_counter = 0
437 | for i in range(num_record_blocks):
438 | compressed_size = self._read_number(f)
439 | decompressed_size = self._read_number(f)
440 | record_block_info_list += [(compressed_size, decompressed_size)]
441 | size_counter += self._number_width * 2
442 | assert(size_counter == record_block_info_size)
443 |
444 | # actual record block
445 | offset = 0
446 | i = 0
447 | size_counter = 0
448 | for compressed_size, decompressed_size in record_block_info_list:
449 | record_block_compressed = f.read(compressed_size)
450 | # 4 bytes: compression type
451 | record_block_type = record_block_compressed[:4]
452 | # 4 bytes: adler32 checksum of decompressed record block
453 | adler32 = unpack('>I', record_block_compressed[4:8])[0]
454 | if record_block_type == b'\x00\x00\x00\x00':
455 | record_block = record_block_compressed[8:]
456 | elif record_block_type == b'\x01\x00\x00\x00':
457 | if lzo is None:
458 | print("LZO compression is not supported")
459 | break
460 | # decompress
461 | header = b'\xf0' + pack('>I', decompressed_size)
462 | record_block = lzo.decompress(header + record_block_compressed[8:])
463 | elif record_block_type == b'\x02\x00\x00\x00':
464 | # decompress
465 | record_block = zlib.decompress(record_block_compressed[8:])
466 |
467 | # notice that adler32 return signed value
468 | assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
469 |
470 | assert(len(record_block) == decompressed_size)
471 | # split record block according to the offset info from key block
472 | while i < len(self._key_list):
473 | record_start, key_text = self._key_list[i]
474 | # reach the end of current record block
475 | if record_start - offset >= len(record_block):
476 | break
477 | # record end index
478 | if i < len(self._key_list)-1:
479 | record_end = self._key_list[i+1][0]
480 | else:
481 | record_end = len(record_block) + offset
482 | i += 1
483 | data = record_block[record_start-offset:record_end-offset]
484 | yield key_text, data
485 | offset += len(record_block)
486 | size_counter += compressed_size
487 | assert(size_counter == record_block_size)
488 |
489 | f.close()
490 |
491 |
492 | class MDX(MDict):
493 | """
494 | MDict dictionary file format (*.MDD) reader.
495 | >>> mdx = MDX('example.mdx')
496 | >>> len(mdx)
497 | 42481
498 | >>> for key,value in mdx.items():
499 | ... print key, value[:10]
500 | """
501 | def __init__(self, fname, encoding='', substyle=False, passcode=None):
502 | MDict.__init__(self, fname, encoding, passcode)
503 | self._substyle = substyle
504 |
505 | def items(self):
506 | """Return a generator which in turn produce tuples in the form of (key, value)
507 | """
508 | return self._decode_record_block()
509 |
510 | def _substitute_stylesheet(self, txt):
511 | # substitute stylesheet definition
512 | txt_list = re.split('`\d+`', txt)
513 | txt_tag = re.findall('`\d+`', txt)
514 | txt_styled = txt_list[0]
515 | for j, p in enumerate(txt_list[1:]):
516 | style = self._stylesheet[txt_tag[j][1:-1]]
517 | if p and p[-1] == '\n':
518 | txt_styled = txt_styled + style[0] + p.rstrip() + style[1] + '\r\n'
519 | else:
520 | txt_styled = txt_styled + style[0] + p + style[1]
521 | return txt_styled
522 |
523 | def _decode_record_block(self):
524 | f = open(self._fname, 'rb')
525 | f.seek(self._record_block_offset)
526 |
527 | num_record_blocks = self._read_number(f)
528 | num_entries = self._read_number(f)
529 | assert(num_entries == self._num_entries)
530 | record_block_info_size = self._read_number(f)
531 | record_block_size = self._read_number(f)
532 |
533 | # record block info section
534 | record_block_info_list = []
535 | size_counter = 0
536 | for i in range(num_record_blocks):
537 | compressed_size = self._read_number(f)
538 | decompressed_size = self._read_number(f)
539 | record_block_info_list += [(compressed_size, decompressed_size)]
540 | size_counter += self._number_width * 2
541 | assert(size_counter == record_block_info_size)
542 |
543 | # actual record block data
544 | offset = 0
545 | i = 0
546 | size_counter = 0
547 | for compressed_size, decompressed_size in record_block_info_list:
548 | record_block_compressed = f.read(compressed_size)
549 | # 4 bytes indicates block compression type
550 | record_block_type = record_block_compressed[:4]
551 | # 4 bytes adler checksum of uncompressed content
552 | adler32 = unpack('>I', record_block_compressed[4:8])[0]
553 | # no compression
554 | if record_block_type == b'\x00\x00\x00\x00':
555 | record_block = record_block_compressed[8:]
556 | # lzo compression
557 | elif record_block_type == b'\x01\x00\x00\x00':
558 | if lzo is None:
559 | print("LZO compression is not supported")
560 | break
561 | # decompress
562 | header = b'\xf0' + pack('>I', decompressed_size)
563 | record_block = lzo.decompress(header + record_block_compressed[8:])
564 | # zlib compression
565 | elif record_block_type == b'\x02\x00\x00\x00':
566 | # decompress
567 | record_block = zlib.decompress(record_block_compressed[8:])
568 |
569 | # notice that adler32 return signed value
570 | assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
571 |
572 | assert(len(record_block) == decompressed_size)
573 | # split record block according to the offset info from key block
574 | while i < len(self._key_list):
575 | record_start, key_text = self._key_list[i]
576 | # reach the end of current record block
577 | if record_start - offset >= len(record_block):
578 | break
579 | # record end index
580 | if i < len(self._key_list)-1:
581 | record_end = self._key_list[i+1][0]
582 | else:
583 | record_end = len(record_block) + offset
584 | i += 1
585 | record = record_block[record_start-offset:record_end-offset]
586 | # convert to utf-8
587 | record = record.decode(self._encoding, errors='ignore').strip(u'\x00').encode('utf-8')
588 | # substitute styles
589 | if self._substyle and self._stylesheet:
590 | record = self._substitute_stylesheet(record)
591 |
592 | yield key_text, record
593 | offset += len(record_block)
594 | size_counter += compressed_size
595 | assert(size_counter == record_block_size)
596 |
597 | f.close()
598 |
599 |
600 | if __name__ == '__main__':
601 | import sys
602 | import os
603 | import os.path
604 | import argparse
605 | import codecs
606 |
607 | def passcode(s):
608 | try:
609 | regcode, userid = s.split(',')
610 | except:
611 | raise argparse.ArgumentTypeError("Passcode must be regcode,userid")
612 | try:
613 | regcode = codecs.decode(regcode, 'hex')
614 | except:
615 | raise argparse.ArgumentTypeError("regcode must be a 32 bytes hexadecimal string")
616 | return regcode, userid
617 |
618 | parser = argparse.ArgumentParser()
619 | parser.add_argument('-x', '--extract', action="store_true",
620 | help='extract mdx to source format and extract files from mdd')
621 | parser.add_argument('-s', '--substyle', action="store_true",
622 | help='substitute style definition if present')
623 | parser.add_argument('-d', '--datafolder', default="data",
624 | help='folder to extract data files from mdd')
625 | parser.add_argument('-e', '--encoding', default="",
626 | help='folder to extract data files from mdd')
627 | parser.add_argument('-p', '--passcode', default=None, type=passcode,
628 | help='register_code,email_or_deviceid')
629 | parser.add_argument("filename", nargs='?', help="mdx file name")
630 | args = parser.parse_args()
631 |
632 | # use GUI to select file, default to extract
633 | if not args.filename:
634 | import Tkinter
635 | import tkFileDialog
636 | root = Tkinter.Tk()
637 | root.withdraw()
638 | args.filename = tkFileDialog.askopenfilename(parent=root)
639 | args.extract = True
640 |
641 | if not os.path.exists(args.filename):
642 | print("Please specify a valid MDX/MDD file")
643 |
644 | base, ext = os.path.splitext(args.filename)
645 |
646 | # read mdx file
647 | if ext.lower() == os.path.extsep + 'mdx':
648 | mdx = MDX(args.filename, args.encoding, args.substyle, args.passcode)
649 | if type(args.filename) is unicode:
650 | bfname = args.filename.encode('utf-8')
651 | else:
652 | bfname = args.filename
653 | print('======== %s ========' % bfname)
654 | print(' Number of Entries : %d' % len(mdx))
655 | for key, value in mdx.header.items():
656 | print(' %s : %s' % (key, value))
657 | else:
658 | mdx = None
659 |
660 | # find companion mdd file
661 | mdd_filename = ''.join([base, os.path.extsep, 'mdd'])
662 | if os.path.exists(mdd_filename):
663 | mdd = MDD(mdd_filename, args.passcode)
664 | if type(mdd_filename) is unicode:
665 | bfname = mdd_filename.encode('utf-8')
666 | else:
667 | bfname = mdd_filename
668 | print('======== %s ========' % bfname)
669 | print(' Number of Entries : %d' % len(mdd))
670 | for key, value in mdd.header.items():
671 | print(' %s : %s' % (key, value))
672 | else:
673 | mdd = None
674 |
675 | if args.extract:
676 | # write out glos
677 | if mdx:
678 | output_fname = ''.join([base, os.path.extsep, 'txt'])
679 | tf = open(output_fname, 'wb')
680 | for key, value in mdx.items():
681 | tf.write(key)
682 | tf.write(b'\r\n')
683 | tf.write(value)
684 | if not value.endswith(b'\n'):
685 | tf.write(b'\r\n')
686 | tf.write(b'>\r\n')
687 | tf.close()
688 | # write out style
689 | if mdx.header.get('StyleSheet'):
690 | style_fname = ''.join([base, '_style', os.path.extsep, 'txt'])
691 | sf = open(style_fname, 'wb')
692 | sf.write(b'\r\n'.join(mdx.header['StyleSheet'].splitlines()))
693 | sf.close()
694 | # write out optional data files
695 | if mdd:
696 | datafolder = os.path.join(os.path.dirname(args.filename), args.datafolder)
697 | if not os.path.exists(datafolder):
698 | os.makedirs(datafolder)
699 | for key, value in mdd.items():
700 | fname = key.decode('utf-8').replace('\\', os.path.sep)
701 | dfname = datafolder + fname
702 | if not os.path.exists(os.path.dirname(dfname)):
703 | os.makedirs(os.path.dirname(dfname))
704 | df = open(dfname, 'wb')
705 | df.write(value)
706 | df.close()
707 |
--------------------------------------------------------------------------------
/ripemd128.py:
--------------------------------------------------------------------------------
1 | """
2 | Copyright by https://github.com/zhansliu/writemdict
3 |
4 | ripemd128.py - A simple ripemd128 library in pure Python.
5 |
6 | Supports both Python 2 (versions >= 2.6) and Python 3.
7 |
8 | Usage:
9 | from ripemd128 import ripemd128
10 | digest = ripemd128(b"The quick brown fox jumps over the lazy dog")
11 | assert(digest == b"\x3f\xa9\xb5\x7f\x05\x3c\x05\x3f\xbe\x27\x35\xb2\x38\x0d\xb5\x96")
12 |
13 | """
14 |
15 |
16 |
17 | import struct
18 |
19 |
20 | # follows this description: http://homes.esat.kuleuven.be/~bosselae/ripemd/rmd128.txt
21 |
22 | def f(j, x, y, z):
23 | assert(0 <= j and j < 64)
24 | if j < 16:
25 | return x ^ y ^ z
26 | elif j < 32:
27 | return (x & y) | (z & ~x)
28 | elif j < 48:
29 | return (x | (0xffffffff & ~y)) ^ z
30 | else:
31 | return (x & z) | (y & ~z)
32 |
33 | def K(j):
34 | assert(0 <= j and j < 64)
35 | if j < 16:
36 | return 0x00000000
37 | elif j < 32:
38 | return 0x5a827999
39 | elif j < 48:
40 | return 0x6ed9eba1
41 | else:
42 | return 0x8f1bbcdc
43 |
44 | def Kp(j):
45 | assert(0 <= j and j < 64)
46 | if j < 16:
47 | return 0x50a28be6
48 | elif j < 32:
49 | return 0x5c4dd124
50 | elif j < 48:
51 | return 0x6d703ef3
52 | else:
53 | return 0x00000000
54 |
55 | def padandsplit(message):
56 | """
57 | returns a two-dimensional array X[i][j] of 32-bit integers, where j ranges
58 | from 0 to 16.
59 | First pads the message to length in bytes is congruent to 56 (mod 64),
60 | by first adding a byte 0x80, and then padding with 0x00 bytes until the
61 | message length is congruent to 56 (mod 64). Then adds the little-endian
62 | 64-bit representation of the original length. Finally, splits the result
63 | up into 64-byte blocks, which are further parsed as 32-bit integers.
64 | """
65 | origlen = len(message)
66 | padlength = 64 - ((origlen - 56) % 64) #minimum padding is 1!
67 | message += b"\x80"
68 | message += b"\x00" * (padlength - 1)
69 | message += struct.pack("> (32-s)) & 0xffffffff
86 |
87 | r = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,
88 | 7, 4,13, 1,10, 6,15, 3,12, 0, 9, 5, 2,14,11, 8,
89 | 3,10,14, 4, 9,15, 8, 1, 2, 7, 0, 6,13,11, 5,12,
90 | 1, 9,11,10, 0, 8,12, 4,13, 3, 7,15,14, 5, 6, 2]
91 | rp = [ 5,14, 7, 0, 9, 2,11, 4,13, 6,15, 8, 1,10, 3,12,
92 | 6,11, 3, 7, 0,13, 5,10,14,15, 8,12, 4, 9, 1, 2,
93 | 15, 5, 1, 3, 7,14, 6, 9,11, 8,12, 2,10, 0, 4,13,
94 | 8, 6, 4, 1, 3,11,15, 0, 5,12, 2,13, 9, 7,10,14]
95 | s = [11,14,15,12, 5, 8, 7, 9,11,13,14,15, 6, 7, 9, 8,
96 | 7, 6, 8,13,11, 9, 7,15, 7,12,15, 9,11, 7,13,12,
97 | 11,13, 6, 7,14, 9,13,15,14, 8,13, 6, 5,12, 7, 5,
98 | 11,12,14,15,14,15, 9, 8, 9,14, 5, 6, 8, 6, 5,12]
99 | sp = [ 8, 9, 9,11,13,15,15, 5, 7, 7, 8,11,14,14,12, 6,
100 | 9,13,15, 7,12, 8, 9,11, 7, 7,12, 7, 6,15,13,11,
101 | 9, 7,15,11, 8, 6, 6,14,12,13, 5,14,13,13, 7, 5,
102 | 15, 5, 8,11,14,14, 6,14, 6, 9,12, 9,12, 5,15, 8]
103 |
104 |
105 | def ripemd128(message):
106 | h0 = 0x67452301
107 | h1 = 0xefcdab89
108 | h2 = 0x98badcfe
109 | h3 = 0x10325476
110 | X = padandsplit(message)
111 | for i in range(len(X)):
112 | (A,B,C,D) = (h0,h1,h2,h3)
113 | (Ap,Bp,Cp,Dp) = (h0,h1,h2,h3)
114 | for j in range(64):
115 | T = rol(s[j], add(A, f(j,B,C,D), X[i][r[j]], K(j)))
116 | (A,D,C,B) = (D,C,B,T)
117 | T = rol(sp[j], add(Ap, f(63-j,Bp,Cp,Dp), X[i][rp[j]], Kp(j)))
118 | (Ap,Dp,Cp,Bp)=(Dp,Cp,Bp,T)
119 | T = add(h1,C,Dp)
120 | h1 = add(h2,D,Ap)
121 | h2 = add(h3,A,Bp)
122 | h3 = add(h0,B,Cp)
123 | h0 = T
124 |
125 |
126 | return struct.pack(" 1:
33 | lexiconType = "Phrase"
34 | else:
35 | lexiconType = "Word"
36 | result = {"Lexicon": lexicon, "type": lexiconType}
37 | for k in self.items:
38 | try:
39 | if k == "PhoneticSymbols":
40 | result[k] = eval("self.get_%s(lexicon)" % k)
41 | elif k == "Derivatives":
42 | id = html.xpath("//body")[0].attrib["data-word_id"]
43 | result[k] = eval("self.get_%s(id, lexicon)" % k)
44 | else:
45 | result[k] = eval("self.get_%s(html)" % k)
46 | except Exception as e:
47 | print("Error: {}, {}".format(lexicon, repr(e)))
48 | # 保存词汇信息
49 | if not (lexicon in result["Inflections"].values()):
50 | isSave = False
51 | for k in self.items:
52 | if k in result.keys() and result[k]:
53 | isSave = True
54 | break
55 | if isSave:
56 | self.save_infos(lexicon, result)
57 | else:
58 | print("Warning: {} in Inflections: {}".format(lexicon, result["Inflections"]))
59 | return result
60 |
61 | def get_phonetic_symbol(self, html):
62 | """
63 | 音标提取.
64 | """
65 |
66 | ps = html.xpath("//div[@class='cssVocWordVideo jsControlAudio']/span")
67 | outs = []
68 | if len(ps) >= 2:
69 | country = ps[0].text
70 | name = "/" + ps[1].text[1:-1] + "/"
71 | out = {"country": self.countryCh2En[country], "audioUrl": "", "name": name, "source": self.source}
72 | outs.append(out)
73 | return outs
74 |
75 | def get_PhoneticSymbols(self, lexicon):
76 | """
77 | 美英音标提取.
78 | """
79 | outs = []
80 | # IELTS
81 | url = self.url % ("ielts", lexicon)
82 | html = etree.parse(url, etree.HTMLParser(encoding="utf-8"))
83 | outs += self.get_phonetic_symbol(html)
84 |
85 | # TOEFL
86 | url = self.url % ("toefl", lexicon)
87 | html = etree.parse(url, etree.HTMLParser(encoding="utf-8"))
88 | outs += self.get_phonetic_symbol(html)
89 |
90 | return outs
91 |
92 | def get_ParaPhrases(self, html, name="toefl"):
93 | """
94 | 释义、例句信息提取.
95 | """
96 | ps = html.xpath("//li[@class='cssVocCont jsVocCont active']/ul/li")
97 | paraphrases = []
98 | for p in ps:
99 | try:
100 | # 获取词性、释义信息
101 | paras = p.xpath("./div/p[@class='cssVocTotoleChinese']/text()")[0]
102 | para = paras.split(".", 1)
103 | if len(para) != 2:
104 | continue
105 | pos = para[0] + "."
106 | parapch = para[1].strip()
107 |
108 | # 获取简明例句
109 | sentInfos = p.xpath("./div/div/div[1]/descendant::p[@class='cssVocExEnglish']")
110 | jianmingSentEns = [i.xpath('string(.)').strip() for i in sentInfos]
111 | sentInfos = p.xpath("./div/div/div[1]/descendant::p[@class='cssVocExChinese']")
112 | jianmingSentChs = [i.xpath('string(.)').strip() for i in sentInfos]
113 |
114 | # 获取情景例句
115 | sentInfos = p.xpath("./div/div/div[2]/descendant::p[@class='cssVocExEnglish']")
116 | sceneSentEns = [i.xpath('string(.)').strip() for i in sentInfos]
117 | sentInfos = p.xpath("./div/div/div[2]/descendant::p[@class='cssVocExChinese']")
118 | sceneSentChs = [i.xpath('string(.)').strip() for i in sentInfos]
119 |
120 | # 获取托福考试例句
121 | sentInfos = p.xpath("./div/div/div[3]/descendant::p[@class='cssVocExEnglish']")
122 | toeflSentEns = [i.xpath('string(.)').strip() for i in sentInfos]
123 | sentInfos = p.xpath("./div/div/div[3]/descendant::p[@class='cssVocExChinese']")
124 | toeflSentChs = [i.xpath('string(.)').strip() for i in sentInfos]
125 |
126 | # 添加例句
127 | sentences = []
128 | if len(jianmingSentEns) == len(jianmingSentChs):
129 | sentences += [{"english": e, "chinese": c, "source": self.source + "-jianming", "audioUrlUS": "",
130 | "audioUrlUK": ""}
131 | for e, c in zip(jianmingSentEns, jianmingSentChs)]
132 | if len(sceneSentEns) == len(sceneSentChs):
133 | sentences += [{"english": e, "chinese": c, "source": self.source + "-scene", "audioUrlUS": "",
134 | "audioUrlUK": ""}
135 | for e, c in zip(sceneSentEns, sceneSentChs)]
136 | if len(toeflSentEns) == len(toeflSentChs):
137 | sentences += [{"english": e, "chinese": c, "source": self.source + "-" + name, "audioUrlUS": "",
138 | "audioUrlUK": ""}
139 | for e, c in zip(toeflSentEns, toeflSentChs)]
140 | paraphrase = {"pos": pos, "english": "", "chinese": parapch, "Sentences": sentences,
141 | "source": self.source}
142 |
143 | paraphrases.append(paraphrase)
144 | except Exception as e:
145 | print("Error: {}".format(repr(e)))
146 | pass
147 | return paraphrases
148 |
149 | def get_Inflections(self, html):
150 | """
151 | 变形词提取.
152 | """
153 | words = html.xpath("//ul[@class='cssVocForMatVaried']/li/text()")
154 | names = html.xpath("//ul[@class='cssVocForMatVaried']/li/span/text()")
155 | assert len(words) == len(names)
156 | out = {}
157 | for w, n in zip(words, names):
158 | out[n] = w.strip()
159 | return out
160 |
161 | def get_fixed_collocations(self, html):
162 | """
163 | 固定搭配提取.
164 | """
165 | result = html.xpath("//li[@class='cssVocContTwo jsVocContTwo active']/ul/li")
166 | outs = []
167 | for r in result:
168 | collection = r.xpath("./div/p[@class='cssVocTotoleChinese']/text()")[0].strip()
169 | ch = r.xpath("./div/p[@class='cssVocTotoleEng']/text()")[0].strip()
170 | outs.append({"name": collection, "chinese": ch, "source": self.source})
171 | return outs
172 |
173 | def get_idiomatic_usage(self, html):
174 | """
175 | 习惯用法提取.
176 | """
177 | result = html.xpath("//ul[@class='cssVocContTogole jsVocContTogole']")
178 | outs = []
179 | for r in result:
180 | usage = r.xpath("./div/p[@class='cssVocTotoleChinese']/text()")
181 | ch = r.xpath("./div/p[@class='cssVocTotoleEng']/text()")[0].strip()
182 | outs.append({"idiomatic_usage": usage, "chinese": ch, "source": self.source})
183 | return outs
184 |
185 | def get_Collocations(self, html):
186 | """
187 | 搭配提取.
188 | """
189 | # 固定搭配
190 | outs = self.get_fixed_collocations(html)
191 | # 习惯用法
192 | # outs += self.get_idiomatic_usage(html)
193 | return outs
194 |
195 | def get_Derivatives(self, id, lexicon):
196 | """
197 | 派生词提取.
198 | """
199 | outs = []
200 | html = etree.parse(f"http://top.zhan.com/vocab/detail/one-2-ten.html?test_type=2&word_id={id}&word={lexicon}",
201 | etree.HTMLParser(encoding="utf-8"))
202 | ens = html.xpath("//p[@class='cssDeriWordsBoxId']/text()")
203 | chs = html.xpath("//ul[@class='cssDeriWordsBoxType']/li/text()")
204 | assert len(ens) == len(chs)
205 | for en, ch in zip(ens, chs):
206 | para = ch.split(".", 1)
207 | if len(para) != 2:
208 | continue
209 | pos = para[0].strip() + "."
210 | parapch = para[1].strip()
211 | outs.append({"Lexicon": en.strip(), "chinese": parapch, "pos": pos, "source": self.source})
212 | return outs
213 |
214 | def save_infos(self, lexicon, infos):
215 | """
216 | 保存词汇信息.
217 | """
218 | with open(self.dictPath + str(lexicon), "w", encoding="utf-8") as f:
219 | json.dump(infos, f, ensure_ascii=False)
220 |
221 | def read_infos(self, lexicon):
222 | """
223 | 读取词汇信息.
224 | """
225 | with open(self.dictPath + str(lexicon), "r", encoding="utf-8") as f:
226 | return json.load(f)
227 |
228 |
229 | if __name__ == "__main__":
230 | c = XiaozhanCrawler()
231 | import pandas as pd
232 | from multiprocessing import Pool
233 |
234 | c.get_infos("taste")
235 |
236 | # words = pd.read_csv(r"D:\work\database\单词-缺失美英音标-汇总2-sorted.csv", header=None)
237 | # print("words shape={}".format(words.shape))
238 | #
239 | # p = Pool(1)
240 | # words = words[0].values
241 | # for i in range(len(words)):
242 | # p.apply(c.get_infos, (words[i],))
243 |
--------------------------------------------------------------------------------