├── MyEnglish
├── README.md
├── english1.txt
├── english2.txt
├── text split.ipynb
└── 使用说明.txt


/MyEnglish:
--------------------------------------------------------------------------------
 1 | imemymyselfweouroursourselvesyouyou'reyou'veyou'llyou'dyouryoursyourselfyourselveshehimhishimselfsheshe'sherhersherselfitit'sitsitselftheythemtheirtheirsthemselveswhatwhichwhowhomthisthatthat'llthesethoseamisarewaswerebebeenbeinghavehashadhavingdodoesdiddoingaantheandbutiforbecauseasuntilwhileofatbyforwithaboutagainstbetweenintothroughduringbeforeafterabovebelowtofromupdowninoutonoffoverunderagainfurtherthenonceheretherewhenwherewhyhowallanybotheachfewmoremostothersomesuchnonornotonlyownsamesothantooverystcanwilljustdondon'tshouldshould'venowdllmoreveyainarenaren'tcouldncouldn'tdidndidn'tdoesndoesn'thadnhadn'thasnhasn'thavenhaven'tisnisn'tmamightnmightn'tmustnmustn'tneednneedn'tshanshan'tshouldnshouldn'twasnwasn'twerenweren'twonwon'twouldnwouldn'tthe  of   and to   a   in   that is   it   wasfor  on be  with Ihe  as   by youat   are  this not  have had his  frombut  they whichor   an  wereher  she we  therebeen theirone has will n't  can all woulddo if   morewhenwho what so   aboutup its  some   into them   could   no himsaid timeonly two out  otherthen my  may like overalso new theseyourany me  now did  people  firstvery aftershould  just than mostwhere  made   between back way our  many   yearsworkmuch   know   being   thosedownhow before  see  through such make   get  becauseus  three   yearown good   thinkstill well mustright even go both too  last take used er   governmentuse howeveroff  under  same   Mr  does're [=areworld  here man day might  got  say going  life come   against another while  need again   each old  never  part longthought little since   number put  house  end differentfoundhomeplace   within   local   children  system   want   great   without  set     left small   few came   something   casearound   lookalways  group   went   percent  party   company second   given   give find four important next information  until point   women  high laterpublic   oftenwhy every   national  men things far  facttookreally furtherhelp head seen quite Britishformbusiness school   possible  away   arealess London  taken   general  waterlarge   family  fiveearly   whether  hand   told best money   facealready  looked  having  power  young  betternight   days country change  asked   side called    says become timesenough mean   support done   service  together whole  himself John membersnothing control market able council room major   eyesthough  thing   act  open   court   problem week   others  towards available working warreport   law interest  held followingproblemsresearch making  round   full felt either   yes  policy   level   question six  education half known  show   police   once mind   body   main   clearBritain  period   services voice   person  above   tell  saw name   minister carekeep anything office   feelpast everroad health   itselfchild   mother   months  across  am    areasbook   society   words  upon   car  themselvesthereforelikely   looking  economicturned   million   probably began   black   kind view    white   communityseemed  England among  doing   provide  father   today   centre   result   south   city English  study   future   doorpresent   became  behind   onepeoplemaynewmakesaytextusestudygetmanyworldevenmuchwaytesthighjobcouldlikegotakecomehelpshowchildneedwellalsofindtwodaybesttodayusthinggoodkeepthreeknowseeexamplebadideaoldstartwomanbignumberuputendcdrgdpfartrueearlyphonehousefourfivesixseveneightninetenmenbookaddbiifriendiglookagereturntopmindlossweekagomoneyfollowlottryopendntuspsplaycsradeasythkeatellelsethirdstoryukjohnpaidborndnanextsecondusermailthusceostayfasteplayersoccercutnhdokeyslowtyperunfeelfacewebbabytsaletsezebuyanyonefoodsendrscwhiteawaycoheadcarlovefishnetselfwinvotestopkidownhopeworrymsttmtcitycrypostiasbvosableerrorwelcomeredhourlanefullstarmrgreenleatnpasswearjumpftcillsexsportlaughoecdmyerswholeadvicehawaiiculturehomefdoublegirlhaloowevauxfeedsithhoffamouthuponoutputlengthflycupyouthbosdewaalaifellcardquounablepreclearyesbagpasfanhairhappilycaremaybetwicesanjuaneastorinkerrandyskipviucnarcclubsafenrcgyorgyieyefatsadpenteethreplycutebothersundayboximagomansizesalelifelawdatafinntableabcdefghijklmnopqrstuvwxyzvipabctaxiadamfreudcleanerokniceannegaapfunairmonthnoniqeveryproblemyoungreadbacksetcalllowcostpaysearchamericaparagraph
 2 | change
 3 | less
 4 | accord
 5 | student
 6 | american
 7 | become
 8 | learn
 9 | answer
10 | family
11 | human
12 | give


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # kaoyan-words
2 | 单词提取
3 | https://pan.baidu.com/s/1hFUNHwYNfIU5GuTENdg46g 提取码：UPPc
4 | 更多详情请关注 <br>
5 | ![qrcode_for_gh_7c5b4ccc7292_258.jpg](https://i.loli.net/2019/06/22/5d0e0259b656249627.jpg)
6 | 


--------------------------------------------------------------------------------
/text split.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import sys,re,collections,nltk\n",
 10 |     "from nltk.corpus import wordnet\n",
 11 |     "from nltk.stem.wordnet import WordNetLemmatizer\n",
 12 |     "from nltk.tokenize import word_tokenize\n",
 13 |     "from nltk.corpus import stopwords\n",
 14 |     "stw = stopwords.words('MyEnglish')"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 2,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "# 正则表达式过滤特殊符号用空格符占位，双引号、单引号、句点、逗号\n",
 24 |     "pat_letter = re.compile(r'[^a-zA-Z \\']+')\n",
 25 |     "# 还原常见缩写单词\n",
 26 |     "pat_is = re.compile(\"(it|he|she|that|this|there|here)(\\'s)\", re.I)\n",
 27 |     "pat_s = re.compile(\"(?<=[a-zA-Z])\\'s\") # 找出字母后面的字母\n",
 28 |     "pat_s2 = re.compile(\"(?<=s)\\'s?\")      \n",
 29 |     "pat_not = re.compile(\"(?<=[a-zA-Z])n\\'t\") # not的缩写\n",
 30 |     "pat_would = re.compile(\"(?<=[a-zA-Z])\\'d\") # would的缩写\n",
 31 |     "pat_will = re.compile(\"(?<=[a-zA-Z])\\'ll\") # will的缩写\n",
 32 |     "pat_am = re.compile(\"(?<=[I|i])\\'m\") # am的缩写\n",
 33 |     "pat_are = re.compile(\"(?<=[a-zA-Z])\\'re\") # are的缩写\n",
 34 |     "pat_ve = re.compile(\"(?<=[a-zA-Z])\\'ve\") # have的缩写\n",
 35 |     "\n",
 36 |     "lmtzr = WordNetLemmatizer()"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "def replace_abbreviations(text):\n",
 46 |     "    new_text = text\n",
 47 |     "    new_text = pat_letter.sub(' ', text).strip().lower()\n",
 48 |     "    new_text = pat_is.sub(r\"\\1 is\", new_text)\n",
 49 |     "    new_text = pat_s.sub(\"\", new_text)\n",
 50 |     "    new_text = pat_s2.sub(\"\", new_text)\n",
 51 |     "    new_text = pat_not.sub(\" not\", new_text)\n",
 52 |     "    new_text = pat_would.sub(\" would\", new_text)\n",
 53 |     "    new_text = pat_will.sub(\" will\", new_text)\n",
 54 |     "    new_text = pat_am.sub(\" am\", new_text)\n",
 55 |     "    new_text = pat_are.sub(\" are\", new_text)\n",
 56 |     "    new_text = pat_ve.sub(\" have\", new_text)\n",
 57 |     "    new_text = new_text.replace('\\'', ' ')\n",
 58 |     "    return new_text"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 4,
 64 |    "metadata": {},
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "# pos和tag有相似的地方，通过tag获得pos\n",
 68 |     "def get_wordnet_pos(treebank_tag):\n",
 69 |     "    if treebank_tag.startswith('J'):\n",
 70 |     "        return nltk.corpus.wordnet.ADJ\n",
 71 |     "    elif treebank_tag.startswith('V'):\n",
 72 |     "        return nltk.corpus.wordnet.VERB\n",
 73 |     "    elif treebank_tag.startswith('N'):\n",
 74 |     "        return nltk.corpus.wordnet.NOUN\n",
 75 |     "    elif treebank_tag.startswith('R'):\n",
 76 |     "        return nltk.corpus.wordnet.ADV\n",
 77 |     "    else:\n",
 78 |     "        return ''\n",
 79 |     "\n",
 80 |     "def merge(words):\n",
 81 |     "    new_words = []\n",
 82 |     "    for word in words:\n",
 83 |     "        if word not in stw and wordnet.synsets(word):\n",
 84 |     "            tag = nltk.pos_tag(word_tokenize(word)) # tag is like [('bigger', 'JJR')]\n",
 85 |     "            pos = get_wordnet_pos(tag[0][1])\n",
 86 |     "            if pos:\n",
 87 |     "                # lemmatize()方法将word单词还原成pos词性的形式\n",
 88 |     "                lemmatized_word = lmtzr.lemmatize(word, pos)\n",
 89 |     "                if lemmatized_word not in stw and wordnet.synsets(lemmatized_word):\n",
 90 |     "                    new_words.append(lemmatized_word)\n",
 91 |     "            else:\n",
 92 |     "                new_words.append(word)\n",
 93 |     "    return new_words"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 5,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "def get_words(file):  \n",
103 |     "    with open (file) as f:  \n",
104 |     "        words_box=[]\n",
105 |     "        # pat = re.compile(r'[^a-zA-Z \\']+') # 过滤特殊符号\n",
106 |     "        for line in f:                           \n",
107 |     "            words_box.extend(merge(replace_abbreviations(line).split()))\n",
108 |     "    return collections.Counter(words_box) # 返回单词和词频\n",
109 |     "\n",
110 |     "\n",
111 |     "# 将统计结果写入文件\n",
112 |     "def write_to_file(words, file=\"english2.csv\"):\n",
113 |     "    f = open(file, 'w')\n",
114 |     "    for item in words:\n",
115 |     "        for field in item:\n",
116 |     "            f.write(str(field)+',')\n",
117 |     "        f.write('\\n')"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 6,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "counting...\n"
130 |      ]
131 |     }
132 |    ],
133 |    "source": [
134 |     "if __name__=='__main__':\n",
135 |     "    print (\"counting...\")\n",
136 |     "    words = get_words(\"english2.txt\")\n",
137 |     "    write_to_file((words.most_common()))"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "# 数据分析"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 8,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "import numpy as np\n",
154 |     "import pandas as pd"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 11,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "res = pd.read_csv('result.csv')"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": 14,
169 |    "metadata": {},
170 |    "outputs": [
171 |     {
172 |      "data": {
173 |       "text/plain": [
174 |        "2013"
175 |       ]
176 |      },
177 |      "execution_count": 14,
178 |      "metadata": {},
179 |      "output_type": "execute_result"
180 |     }
181 |    ],
182 |    "source": [
183 |     "len(res[res['次数']==1])"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 15,
189 |    "metadata": {},
190 |    "outputs": [
191 |     {
192 |      "data": {
193 |       "text/plain": [
194 |        "777"
195 |       ]
196 |      },
197 |      "execution_count": 15,
198 |      "metadata": {},
199 |      "output_type": "execute_result"
200 |     }
201 |    ],
202 |    "source": [
203 |     "len(res[res['次数']==2])"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 16,
209 |    "metadata": {},
210 |    "outputs": [
211 |     {
212 |      "data": {
213 |       "text/plain": [
214 |        "412"
215 |       ]
216 |      },
217 |      "execution_count": 16,
218 |      "metadata": {},
219 |      "output_type": "execute_result"
220 |     }
221 |    ],
222 |    "source": [
223 |     "len(res[res['次数']==3])"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 20,
229 |    "metadata": {},
230 |    "outputs": [
231 |     {
232 |      "data": {
233 |       "text/plain": [
234 |        "1289"
235 |       ]
236 |      },
237 |      "execution_count": 20,
238 |      "metadata": {},
239 |      "output_type": "execute_result"
240 |     }
241 |    ],
242 |    "source": [
243 |     "len(res[res['次数']>=4])"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": 21,
249 |    "metadata": {},
250 |    "outputs": [
251 |     {
252 |      "data": {
253 |       "text/plain": [
254 |        "4491"
255 |       ]
256 |      },
257 |      "execution_count": 21,
258 |      "metadata": {},
259 |      "output_type": "execute_result"
260 |     }
261 |    ],
262 |    "source": [
263 |     "len(res)"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": []
272 |   }
273 |  ],
274 |  "metadata": {
275 |   "kernelspec": {
276 |    "display_name": "Python 3",
277 |    "language": "python",
278 |    "name": "python3"
279 |   },
280 |   "language_info": {
281 |    "codemirror_mode": {
282 |     "name": "ipython",
283 |     "version": 3
284 |    },
285 |    "file_extension": ".py",
286 |    "mimetype": "text/x-python",
287 |    "name": "python",
288 |    "nbconvert_exporter": "python",
289 |    "pygments_lexer": "ipython3",
290 |    "version": "3.7.1"
291 |   }
292 |  },
293 |  "nbformat": 4,
294 |  "nbformat_minor": 2
295 | }
296 | 


--------------------------------------------------------------------------------
/使用说明.txt:
--------------------------------------------------------------------------------
 1 | 1. 使用pip install nltk命令安装NLTK库
 2 | 2. 在python中执行
 3 |    import nltk
 4 |    nltk.download()
 5 |    在弹出的用户界面中勾选all 然后download
 6 | 3. 把MyEnglish文件夹放在/…/nltk_data/corpora/stopwords/ 下
 7 | 4. 运行代码
 8 | 
 9 | 注意：MyEnglish是一些过于简单的初级词汇，只要考研文章出出现里面的单词都会被排除，如run，apple等，你也可以打开然后自定义添加一些不想要背的单词，一个单词占一行
10 | 


--------------------------------------------------------------------------------