├── MyEnglish ├── README.md ├── english1.txt ├── english2.txt ├── text split.ipynb └── 使用说明.txt /MyEnglish: -------------------------------------------------------------------------------- 1 | i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't the of and to a in that is it was for on be with I he as by you at are this not have had his from but they which or an were her she we there been their one has will n't can all would do if more when who what so about up its some into them could no him said time only two out other then my may like over also new these your any me now did people first very after should just than most where made between back way our many years work much know being those down how before see through such make get because us three year own good think still well must right even go both too last take used er government use however off under same Mr does 're [=are world here man day might got say going life come against another while need again each old never part long thought little since number put house end different found home place within local children system want great without set left small few came something case around look always group went percent party company second given give find four important next information until point women high later public often why every national men things far fact took really further help head seen quite British form business school possible away area less London taken general water large family five early whether hand told best money face already looked having power young better night days country change asked side called says become times enough mean support done service together whole himself John members nothing control market able council room major eyes though thing act open court problem week others towards available working war report law interest held following problems research making round full felt either yes policy level question six education half known show police once mind body main clear Britain period services voice person above tell saw name minister care keep anything office feel past ever road health itself child mother months across am areas book society words upon car themselves therefore likely looking economic turned million probably began black kind view white community seemed England among doing provide father today centre result south city English study future door present became behind one people may new make say text use study get many world even much way test high job could like go take come help show child need well also find two day best today us thing good keep three know see example bad idea old start woman big number u put end c dr gdp far true early phone house four five six seven eight nine ten men book add b ii friend ig look age return top mind loss week ago money follow lot try open dnt usps play csr ad easy th kea tell else third story uk john paid born dna next second user mail thus ceo stay fast e player soccer cut nh do key slow type run feel face web baby tsa let seze buy anyone food send rsc white away co head car love fish net self win vote stop kid own hope worry m st tmt city cry post iasb vos able error welcome red hour lane full star mr green l eat n pass wear jump ftc ill sex sport laugh oecd myers whole advice hawaii culture home f double girl halo owe vaux feed sit h hoffa mouth upon output length fly cup youth bos de waal ai fell card quo unable pre clear yes bag pas fan hair happily care maybe twice san juan east orin kerr andy skip v iucn arc club safe nrc gyo rgyi eye fat sad pen teeth reply cute bother sunday box imago man size sale life law data finn table a b c d e f g h i j k l m n o p q r s t u v w x y z vip abc taxi adam freud cleaner ok nice anne gaap fun air month non iq every problem young read back set call low cost pay search america paragraph 2 | change 3 | less 4 | accord 5 | student 6 | american 7 | become 8 | learn 9 | answer 10 | family 11 | human 12 | give -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # kaoyan-words 2 | 单词提取 3 | https://pan.baidu.com/s/1hFUNHwYNfIU5GuTENdg46g 提取码:UPPc 4 | 更多详情请关注
5 | ![qrcode_for_gh_7c5b4ccc7292_258.jpg](https://i.loli.net/2019/06/22/5d0e0259b656249627.jpg) 6 | -------------------------------------------------------------------------------- /text split.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import sys,re,collections,nltk\n", 10 | "from nltk.corpus import wordnet\n", 11 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 12 | "from nltk.tokenize import word_tokenize\n", 13 | "from nltk.corpus import stopwords\n", 14 | "stw = stopwords.words('MyEnglish')" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# 正则表达式过滤特殊符号用空格符占位,双引号、单引号、句点、逗号\n", 24 | "pat_letter = re.compile(r'[^a-zA-Z \\']+')\n", 25 | "# 还原常见缩写单词\n", 26 | "pat_is = re.compile(\"(it|he|she|that|this|there|here)(\\'s)\", re.I)\n", 27 | "pat_s = re.compile(\"(?<=[a-zA-Z])\\'s\") # 找出字母后面的字母\n", 28 | "pat_s2 = re.compile(\"(?<=s)\\'s?\") \n", 29 | "pat_not = re.compile(\"(?<=[a-zA-Z])n\\'t\") # not的缩写\n", 30 | "pat_would = re.compile(\"(?<=[a-zA-Z])\\'d\") # would的缩写\n", 31 | "pat_will = re.compile(\"(?<=[a-zA-Z])\\'ll\") # will的缩写\n", 32 | "pat_am = re.compile(\"(?<=[I|i])\\'m\") # am的缩写\n", 33 | "pat_are = re.compile(\"(?<=[a-zA-Z])\\'re\") # are的缩写\n", 34 | "pat_ve = re.compile(\"(?<=[a-zA-Z])\\'ve\") # have的缩写\n", 35 | "\n", 36 | "lmtzr = WordNetLemmatizer()" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "def replace_abbreviations(text):\n", 46 | " new_text = text\n", 47 | " new_text = pat_letter.sub(' ', text).strip().lower()\n", 48 | " new_text = pat_is.sub(r\"\\1 is\", new_text)\n", 49 | " new_text = pat_s.sub(\"\", new_text)\n", 50 | " new_text = pat_s2.sub(\"\", new_text)\n", 51 | " new_text = pat_not.sub(\" not\", new_text)\n", 52 | " new_text = pat_would.sub(\" would\", new_text)\n", 53 | " new_text = pat_will.sub(\" will\", new_text)\n", 54 | " new_text = pat_am.sub(\" am\", new_text)\n", 55 | " new_text = pat_are.sub(\" are\", new_text)\n", 56 | " new_text = pat_ve.sub(\" have\", new_text)\n", 57 | " new_text = new_text.replace('\\'', ' ')\n", 58 | " return new_text" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 4, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "# pos和tag有相似的地方,通过tag获得pos\n", 68 | "def get_wordnet_pos(treebank_tag):\n", 69 | " if treebank_tag.startswith('J'):\n", 70 | " return nltk.corpus.wordnet.ADJ\n", 71 | " elif treebank_tag.startswith('V'):\n", 72 | " return nltk.corpus.wordnet.VERB\n", 73 | " elif treebank_tag.startswith('N'):\n", 74 | " return nltk.corpus.wordnet.NOUN\n", 75 | " elif treebank_tag.startswith('R'):\n", 76 | " return nltk.corpus.wordnet.ADV\n", 77 | " else:\n", 78 | " return ''\n", 79 | "\n", 80 | "def merge(words):\n", 81 | " new_words = []\n", 82 | " for word in words:\n", 83 | " if word not in stw and wordnet.synsets(word):\n", 84 | " tag = nltk.pos_tag(word_tokenize(word)) # tag is like [('bigger', 'JJR')]\n", 85 | " pos = get_wordnet_pos(tag[0][1])\n", 86 | " if pos:\n", 87 | " # lemmatize()方法将word单词还原成pos词性的形式\n", 88 | " lemmatized_word = lmtzr.lemmatize(word, pos)\n", 89 | " if lemmatized_word not in stw and wordnet.synsets(lemmatized_word):\n", 90 | " new_words.append(lemmatized_word)\n", 91 | " else:\n", 92 | " new_words.append(word)\n", 93 | " return new_words" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "def get_words(file): \n", 103 | " with open (file) as f: \n", 104 | " words_box=[]\n", 105 | " # pat = re.compile(r'[^a-zA-Z \\']+') # 过滤特殊符号\n", 106 | " for line in f: \n", 107 | " words_box.extend(merge(replace_abbreviations(line).split()))\n", 108 | " return collections.Counter(words_box) # 返回单词和词频\n", 109 | "\n", 110 | "\n", 111 | "# 将统计结果写入文件\n", 112 | "def write_to_file(words, file=\"english2.csv\"):\n", 113 | " f = open(file, 'w')\n", 114 | " for item in words:\n", 115 | " for field in item:\n", 116 | " f.write(str(field)+',')\n", 117 | " f.write('\\n')" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 6, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stdout", 127 | "output_type": "stream", 128 | "text": [ 129 | "counting...\n" 130 | ] 131 | } 132 | ], 133 | "source": [ 134 | "if __name__=='__main__':\n", 135 | " print (\"counting...\")\n", 136 | " words = get_words(\"english2.txt\")\n", 137 | " write_to_file((words.most_common()))" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "# 数据分析" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 8, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "import numpy as np\n", 154 | "import pandas as pd" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 11, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "res = pd.read_csv('result.csv')" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 14, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "2013" 175 | ] 176 | }, 177 | "execution_count": 14, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "len(res[res['次数']==1])" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 15, 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "777" 195 | ] 196 | }, 197 | "execution_count": 15, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "len(res[res['次数']==2])" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 16, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "412" 215 | ] 216 | }, 217 | "execution_count": 16, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "len(res[res['次数']==3])" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 20, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "data": { 233 | "text/plain": [ 234 | "1289" 235 | ] 236 | }, 237 | "execution_count": 20, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "len(res[res['次数']>=4])" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 21, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "data": { 253 | "text/plain": [ 254 | "4491" 255 | ] 256 | }, 257 | "execution_count": 21, 258 | "metadata": {}, 259 | "output_type": "execute_result" 260 | } 261 | ], 262 | "source": [ 263 | "len(res)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [] 272 | } 273 | ], 274 | "metadata": { 275 | "kernelspec": { 276 | "display_name": "Python 3", 277 | "language": "python", 278 | "name": "python3" 279 | }, 280 | "language_info": { 281 | "codemirror_mode": { 282 | "name": "ipython", 283 | "version": 3 284 | }, 285 | "file_extension": ".py", 286 | "mimetype": "text/x-python", 287 | "name": "python", 288 | "nbconvert_exporter": "python", 289 | "pygments_lexer": "ipython3", 290 | "version": "3.7.1" 291 | } 292 | }, 293 | "nbformat": 4, 294 | "nbformat_minor": 2 295 | } 296 | -------------------------------------------------------------------------------- /使用说明.txt: -------------------------------------------------------------------------------- 1 | 1. 使用pip install nltk命令安装NLTK库 2 | 2. 在python中执行 3 | import nltk 4 | nltk.download() 5 | 在弹出的用户界面中勾选all 然后download 6 | 3. 把MyEnglish文件夹放在/…/nltk_data/corpora/stopwords/ 下 7 | 4. 运行代码 8 | 9 | 注意:MyEnglish是一些过于简单的初级词汇,只要考研文章出出现里面的单词都会被排除,如run,apple等,你也可以打开然后自定义添加一些不想要背的单词,一个单词占一行 10 | --------------------------------------------------------------------------------