├── .gitignore
├── README.md
├── additional
    ├── __init__.py
    └── resultdb.py
├── config
    ├── __init__.py
    └── config.json
├── dev
    ├── __init__.py
    ├── check_it_in_hs300.py
    └── process_it.py
├── east_sentiment
    ├── __init__.py
    ├── aggregateFactor.py
    ├── dailyResult.py
    ├── outputResult.py
    ├── produceFactor.py
    └── sendMail.py
├── example
    ├── ApSchedulerExample
    │   ├── __init__.py
    │   └── processpoll.py
    └── __init__.py
├── main.py
├── script
    ├── __init__.py
    ├── gubaEast.py
    ├── sina.py
    └── snowball.py
├── set_codes
    ├── HS300.txt
    ├── IT_code.txt
    ├── IT_unique.txt
    ├── set_IT.py
    └── set_hs300.py
├── test
    ├── Scores.xls
    ├── __init__.py
    ├── testPandas.py
    └── testPandasExcel.py
└── tools
    ├── __init__.py
    ├── drop.py
    └── mongotool.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | gitignore
3 | eastSentiment/*.pyc
4 | *.pyc
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # pyspider-stock
  2 | 
  3 | **Note:This README will have both Chinese and English version, Chinese first because it is for Chinese stock market.**
  4 | 
  5 | *Update* :
  6 | 
  7 | - [x] 增加[IT][1]版块股票的抓取和分析
  8 | 
  9 | 
 10 | ## 这个项目做什么？
 11 | 这个项目使用[pyspider][2]抓取[东方财富网股吧][3]、[雪球网][4]、[新浪股吧][5]的帖子，然后使用自然语言处理（情感分析）的方式分析舆论
 12 | 
 13 | 所以
 14 | 
 15 | 它有两个部分
 16 | 
 17 | 1. 抓取帖子
 18 | 2. 情感分析
 19 | 
 20 | ## 如何运行它？
 21 | 
 22 | ### 第一步 抓取帖子
 23 | 
 24 | * 下载[pyspider][6]，[mongoDB][7]，[redis][8]，[snowNLP][9]，[pymongo(2.9)][10]及相应的依赖库
 25 | * 运行`set_codes/set_hs300.py`和`set_IT.py`（为了将HS300成份股的股票代码装入mongoDB,后者的目的是放入IT股票的代码）
 26 | * 然后，将`resultdb.py`放入pyspider的`database/mongodb`目录下（为了将爬取到的数据放入mongoDB）,pyspider路径使用`pip show pyspider`命令
 27 | * 启动`redis`
 28 | * 然后，在有`config.json`的目录下，**command line** 运行`pyspider -c config.json all &`
 29 | * 其次，将script里的脚本复制后，粘贴到localhost：5000下你自己的工程里（想要爬取哪个网站就粘贴哪个script），保存
 30 | * 最后在网页localhost:5000里单击run
 31 | 
 32 | 在早上开盘前执行完最后两步即可在每天早上开盘前获取到HS300昨日的舆论数据
 33 | 
 34 | ### 第二步 情感分析
 35 | 在完成第一步30分钟后即可执行该步骤
 36 | 
 37 | 第一次运行时在和`main.py`同目录下新建目录`data`
 38 | 
 39 | 运行 `main.py`即可
 40 | 
 41 | 
 42 | ## 发生了什么？
 43 | 
 44 | 默认使用`gubaEast.py`抓取东方财富网下的**股友汇**版块，因为它最稳定
 45 | 
 46 | 执行完第一步后，你会在名为`[stockcode]eastmoney`的database下发现`[date]GuYouHui`的collection，其中`[stockcode]`和`[date]`分别是HS300成份股的股票代码和昨天的日期
 47 | 
 48 | 接着是情感分析部分
 49 | 
 50 | 核心是3段代码：
 51 | 
 52 |     produceFactor.getSentimentFactor(stockCode, grab_time)
 53 | 用于获得抓取日期的特定股票帖子的情感因子和情感值（由情感因子乘以阅读量获得）
 54 | 
 55 |     aggregateFactor.aggregate(stockCode, grab_time)
 56 | 用于获得抓取日期的特定股票的情感值（由所有帖子情感值相加得到），结果保存在`[stockcode]eastmoney`下的`[date]SentimentFactor`中
 57 | 
 58 |     dailyResult.setDailyResult(stockCode, grab_time)
 59 | 用于汇总抓取日期的所有HS300股票的情感值和帖子数，结果在`[date]`database的`DailyResult`collection下
 60 | 
 61 | 而后结果会以excel的格式保存在`data`目录下
 62 | 
 63 | 结果会以邮件形式发给你指定的人，通过`sendMail`模块
 64 | 
 65 | 最后taskdb里面这个任务会被清除，以便明天增量抓取。同时会将5天前数据库中的数据导出，存在本地，并删除数据库中的数据
 66 | 
 67 | 如果想用[app][11]在android端查看结果，就保留
 68 | 
 69 |     os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html')
 70 | 
 71 | 
 72 | *English version*
 73 | 
 74 | ## What's the aim of this project？
 75 | This project use [pyspider][12] to get posts of  [eastmoney][13], [xueqiu][14], [sinaguba][15],then use NLP techs to analyze the sentiment of public in order to select stocks.
 76 | 
 77 | SO
 78 | 
 79 | It has two parts
 80 | 
 81 | 1. crawl posts
 82 | 2. sentiment analysis
 83 | 
 84 | ## How to run this project？
 85 | 
 86 | ### Step 1 Crawl posts
 87 | 
 88 | * Download [pyspider][16]，[mongoDB][17]，[redis][18]，[snowNLP][19] and other dependencies
 89 | * run `set_hs300/setCodes.py`（in order to get all symbols of HS300 and load them into mongoDB）
 90 | * put `resultdb.py` into `database/mongodb` directory of **pyspider**（in order to save the crawling data to mongoDB）
 91 | * start `redis`
 92 | * **command line**  run `pyspider -c config.json all &` under directory of `config.json`
 93 | * copy script in `script` folder, paste code to your own project in localhost:5000, save
 94 | * click run button in localhost:5000
 95 | 
 96 | Complete two last steps before the market is open, then you'll get sentiment data everyday periodically.
 97 | 
 98 | ### Step 2 Sentiment analysis
 99 | 
100 | Run `main.py` after the posts been crawled and stored, also remember to create `data` directory for the first running.
101 | 
102 | 
103 | ## What happened？
104 | 
105 | Because of the stability, use `gubaEast.py` to crawl **GuYouHui** section is by default.
106 | 
107 | After Step 1 finished，you'll find a collection named `[date]GuYouHui` under a database called `[stockcode]eastmoney`, where `[stockcode]` and `[date]`are symbols of HS300 and date of yesterday.
108 | 
109 | Another part is sentiment analysis
110 | 
111 | The core part is three pieces of code：
112 | 
113 |     produceFactor.getSentimentFactor(stockCode, grab_time)
114 | To obtain sentiment values and sentiment factor for a specific symbol **post** and crawl date（sentiment values are computed by snowNLP while sentiment factor is sentiment values times read numbers）
115 | 
116 |     aggregateFactor.aggregate(stockCode, grab_time)
117 | To obtain sentiment values and sentiment factor for a specific symbol and crawl date（by adding all the posts for that stock on that day）, result is in `[date]SentimentFactor` under `[stockcode]eastmoney`
118 | 
119 |     dailyResult.setDailyResult(stockCode, grab_time)
120 | To collect all the sentiment factors and number of posts for the crawl date,result is in the `DailyResult` collection which is under the `[date]`database.
121 | 
122 | Then an excel would be saved under the `data` directory as the final result.
123 | 
124 | The result would be mailed to specific users, through `sendMail` module.
125 | 
126 | Tasks under `taskdb` would be deleted in order to crawl posts periodically. Meanwhile data which stored 5 days ago would be dumped as backup and mongoDB would delete the original one. 
127 | 
128 | If you want to use the [app][20] to check the result on android, keep the following code 
129 | 
130 |     os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html')
131 | 
132 | 
133 |   [1]: http://quote.eastmoney.com/center/list.html#28002737_0_2
134 |   [2]: http://docs.pyspider.org/en/latest/
135 |   [3]: http://guba.eastmoney.com/
136 |   [4]: https://xueqiu.com/
137 |   [5]: http://guba.sina.com.cn/
138 |   [6]: http://docs.pyspider.org/en/latest/
139 |   [7]: https://www.mongodb.com/
140 |   [8]: https://redis.io/
141 |   [9]: https://github.com/isnowfy/snownlp
142 |   [10]: http://api.mongodb.com/python/current/installation.html
143 |   [11]: https://github.com/ryh95/huaxiApp
144 |   [12]: http://docs.pyspider.org/en/latest/
145 |   [13]: http://guba.eastmoney.com/
146 |   [14]: https://xueqiu.com/
147 |   [15]: http://guba.sina.com.cn/
148 |   [16]: http://docs.pyspider.org/en/latest/
149 |   [17]: https://www.mongodb.com/
150 |   [18]: https://redis.io/
151 |   [19]: https://github.com/isnowfy/snownlp
152 |   [20]: https://github.com/ryh95/huaxiApp
153 | 


--------------------------------------------------------------------------------
/additional/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/additional/__init__.py


--------------------------------------------------------------------------------
/additional/resultdb.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | # vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
  4 | # Author: Binux<i@binux.me>
  5 | #         http://binux.me
  6 | # Created on 2014-10-13 22:18:36
  7 | 
  8 | import json
  9 | import logging
 10 | import time
 11 | import re
 12 | 
 13 | import datetime
 14 | from pymongo import MongoClient
 15 | from pyspider.database.base.resultdb import ResultDB as BaseResultDB
 16 | from .mongodbbase import SplitTableMixin
 17 | logger = logging.getLogger("result")
 18 | 
 19 | class ResultDB(SplitTableMixin, BaseResultDB):
 20 |     collection_prefix = ''
 21 | 
 22 |     def __init__(self, url, database='resultdb'):
 23 |         self.conn = MongoClient(url)
 24 |         self.database = self.conn[database]
 25 |         self.projects = set()
 26 | 
 27 |         self._list_project()
 28 | 
 29 |     def _parse(self, data):
 30 |         data['_id'] = str(data['_id'])
 31 |         if 'result' in data:
 32 |             data['result'] = json.loads(data['result'])
 33 |         return data
 34 | 
 35 |     def _stringify(self, data):
 36 |         if 'result' in data:
 37 |             data['result'] = json.dumps(data['result'])
 38 |         return data
 39 | 
 40 |     def save(self, project, taskid, url, result):
 41 | 
 42 |         #logger.info('url : %s',url)
 43 | 
 44 | #(1) deal with guba.eastmoney.com
 45 |         if url.split('/')[2] == 'guba.eastmoney.com':
 46 | #1.specify the database_name
 47 |             stockCode = url.split(',')[1]
 48 |             self.database = self.conn[stockCode+'eastmoney']
 49 | 
 50 | #2.specify the collection_name
 51 |             flag = result['url'].split(',')[2][0]
 52 | 
 53 |             # add time for GuYouHui
 54 |             now_time = datetime.datetime.now()
 55 |             yes_time = now_time + datetime.timedelta(days=-1)
 56 |             grab_time = yes_time.strftime('%m-%d')
 57 | 
 58 |             if flag == '5':
 59 |                 collection_name = grab_time+'GuYouHui'
 60 |             elif flag == '1':
 61 |                 collection_name = 'XinWen'
 62 |             elif flag == '2':
 63 |                 collection_name = 'YanBao'
 64 |             elif flag == '3':
 65 |                 collection_name = 'GongGao'
 66 |             else:
 67 |                 collection_name = self._collection_name(project)
 68 | 
 69 | #3.create the item which is going to insert to the mongoDB
 70 |             obj = {
 71 |                 # 'taskid': taskid,
 72 |                 # 'url': url,
 73 |                 # 'result': result,
 74 |                 # 'updatetime': time.time(),
 75 |                 # here has changed
 76 |                 'read' : result['read'],
 77 |                 'comment' : result['comment'],
 78 |                 'title' : result['title'],
 79 |                 'author' : result['author'],
 80 |                 # 'date' : result['date'],
 81 |                 'last' : result['last'],
 82 |                 'text' : result['text'],
 83 |                 'url' : result['url'],
 84 |                 'create' : result['create'],
 85 |                 'created_at' :result['created_at']
 86 |             }
 87 | #(2) deal with xueqiu.com
 88 |         elif url.split('/')[2] == 'xueqiu.com':
 89 | 
 90 | #1.specify the database_name
 91 |             stockCode = re.findall('\d{6}',url)[0]
 92 |             self.database = self.conn[stockCode+'xueqiu']
 93 | #2.specify the collection_name
 94 |             flag = re.findall('source=(.*?)&',url)[0]
 95 | 
 96 |             logger.info('flag : %s',flag)
 97 | 
 98 |             if flag == 'user':
 99 |                 collection_name = 'TaoLun'
100 |             elif flag == 'trans':
101 |                 collection_name = 'JiaoYi'
102 |             elif flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB':
103 |                 collection_name = 'XinWen'
104 |             elif flag == '%E5%85%AC%E5%91%8A':
105 |                 collection_name = 'GongGao'
106 |             elif flag == '%E7%A0%94%E6%8A%A5':
107 |                 collection_name = 'YanBao'
108 |             else:
109 |                 collection_name = self._collection_name(project)
110 | 
111 | #3.create the item which is going to insert to the mongoDB
112 |             if flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB':
113 | 
114 |                 obj = {
115 |                     'name' : result['name'],
116 |                     'text' : result['text'],
117 |                     'time' : long(result['time']),
118 |                     'title' : result['title'],
119 | 
120 |                 }
121 |             else:
122 |                 obj = {
123 |                     'name' : result['name'],
124 |                     'text' : result['text'],
125 |                     'time' : long(result['time']),
126 |                     # 'title' : result['title'],
127 | 
128 |                 }
129 | # (3)deal with guba.sina.com
130 |         elif url.split('/')[2] == 'guba.sina.com.cn':
131 | # 1.specify the database name
132 |             stockCode = re.findall('name=(.*?)&type',result['url'])[0]
133 |             self.database = self.conn[stockCode+'sina']
134 | # 2.specify the collection name
135 |             collection_name = 'TaoLun'
136 | 
137 | # 3.create the item which is going to insert to the mongoDB
138 |             obj = {
139 |                     'author' : result['author'],
140 |                     'comment' : result['comment'],
141 |                     'read' : long(result['read']),
142 |                     'title' : result['title'],
143 |                     'text' : result['text'],
144 |                     'tid' : result['tid'],
145 |                     'time' : result['time'],
146 |                     'url' : result['url']
147 |                 }
148 | 
149 | 
150 |         return self.database[collection_name].update(
151 |             {'taskid': taskid}, {"$set": self._stringify(obj)}, upsert=True
152 |         )
153 | 
154 |     def select(self, project, fields=None, offset=0, limit=0):
155 |         if project not in self.projects:
156 |             self._list_project()
157 |         if project not in self.projects:
158 |             return
159 |         collection_name = self._collection_name(project)
160 |         for result in self.database[collection_name].find(fields=fields, skip=offset, limit=limit):
161 |             yield self._parse(result)
162 | 
163 |     def count(self, project):
164 |         if project not in self.projects:
165 |             self._list_project()
166 |         if project not in self.projects:
167 |             return
168 |         collection_name = self._collection_name(project)
169 |         return self.database[collection_name].count()
170 | 
171 |     def get(self, project, taskid, fields=None):
172 |         if project not in self.projects:
173 |             self._list_project()
174 |         if project not in self.projects:
175 |             return
176 |         collection_name = self._collection_name(project)
177 |         ret = self.database[collection_name].find_one({'taskid': taskid}, fields=fields)
178 |         if not ret:
179 |             return ret
180 |         return self._parse(ret)
181 | 


--------------------------------------------------------------------------------
/config/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/config/__init__.py


--------------------------------------------------------------------------------
/config/config.json:
--------------------------------------------------------------------------------
1 | {
2 |   "taskdb": "mongodb+taskdb://127.0.0.1:27017/taskdb",
3 |   "projectdb": "mongodb+projectdb://127.0.0.1:27017/projectdb",
4 |   "resultdb": "mongodb+resultdb://127.0.0.1:27017/resultdb",
5 |   "message_queue": "redis://127.0.0.1:6379/db"
6 | }
7 | 


--------------------------------------------------------------------------------
/dev/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/dev/__init__.py


--------------------------------------------------------------------------------
/dev/check_it_in_hs300.py:
--------------------------------------------------------------------------------
 1 | from pymongo import MongoClient
 2 | 
 3 | client = MongoClient()
 4 | db = client['stockcodes']
 5 | documents = db.HS300.find()
 6 | # read and store hs300 stocks in 'hs300' map
 7 | hs300 = {}
 8 | for document in documents:
 9 |     hs300[document['stockcode']] = 1
10 | 
11 | it_file =  open('IT_code.txt','r')
12 | out = file('IT_unique.txt','a+')
13 | 
14 | num = 0
15 | for line in it_file:
16 |     line = line.replace('\n','')
17 |     line = line.decode('utf8')
18 |     if line not in hs300:
19 |         print line
20 |         out.write(str(line)+'\n')
21 |         num+=1
22 | print num


--------------------------------------------------------------------------------
/dev/process_it.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | input_file = open('IT.txt','r')
 4 | output_file = file('IT.txt','a+')
 5 | 
 6 | for line in input_file:
 7 |      # output_file.write(re.findall(r'\d{6}',line)[0])
 8 |      # output_file.write('\n')
 9 |      output_file.writelines([item+' ' for item in line.split()[:3]])
10 |      output_file.write('\n')
11 | 
12 | input_file.close()
13 | output_file.close()


--------------------------------------------------------------------------------
/east_sentiment/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/east_sentiment/__init__.py


--------------------------------------------------------------------------------
/east_sentiment/aggregateFactor.py:
--------------------------------------------------------------------------------
 1 | # coding:utf8
 2 | import datetime
 3 | from pymongo import MongoClient
 4 | 
 5 | def aggregate(stockcode, date):
 6 |     client = MongoClient()
 7 |     # 获取昨天的日期
 8 |     # now_time = datetime.datetime.now()
 9 |     # yes_time = now_time + datetime.timedelta(days=-1)
10 |     # grab_time = yes_time.strftime('%m-%d')
11 | 
12 |     db = client[stockcode+'eastmoney']
13 |     coll = db[date+'GuYouHui']
14 | 
15 |     documents = coll.aggregate(
16 |         [
17 |             {"$group": {"_id": "$last_date", "sentiment_factor": {"$sum": "$sentiment_factor"}}}
18 |         ]
19 |     )
20 | 
21 |     coll2 = db[date+'SentimentFactor']
22 |     for result in documents['result']:
23 |         coll2.insert_one({
24 |             "sentiment_factor": result['sentiment_factor'],
25 |             "last_date": date
26 |         })
27 | 
28 |     print date+stockcode+'SentimentFactor has been aggregated!'
29 | 


--------------------------------------------------------------------------------
/east_sentiment/dailyResult.py:
--------------------------------------------------------------------------------
 1 | # coding:utf8
 2 | import datetime
 3 | from pymongo import MongoClient
 4 | 
 5 | 
 6 | def setDailyResult(stockcode,date,section_name=''):
 7 |     client = MongoClient()
 8 | 
 9 |     db = client[stockcode + 'eastmoney']
10 |     # 获取昨天的日期
11 |     # now_time = datetime.datetime.now()
12 |     # yes_time = now_time + datetime.timedelta(days=-1)
13 |     # grab_time = yes_time.strftime('%m-%d')
14 | 
15 |     coll = db[date + 'SentimentFactor']
16 | 
17 |     cusor = coll.find({"last_date":date})
18 |     for document in cusor:
19 |         sentimentFactor = document['sentiment_factor']
20 | 
21 |     coll = db[date+'GuYouHui']
22 | 
23 |     dailyCounts =  coll.count()
24 | 
25 |     db = client[date+section_name]
26 |     try:
27 |         db.DailyResult.insert_one(
28 |             {
29 |                 "stock_code": stockcode,
30 |                 "sentiment_factor": sentimentFactor,
31 |                 "daily_counts": dailyCounts
32 |             }
33 |         )
34 |     except UnboundLocalError,e:
35 |         pass
36 | 
37 |     print date+stockcode+'has been set in daily result!'
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/east_sentiment/outputResult.py:
--------------------------------------------------------------------------------
  1 | # coding:utf8
  2 | import datetime
  3 | 
  4 | import pymongo
  5 | import pandas as pd
  6 | import xlwt
  7 | from pymongo import MongoClient
  8 | 
  9 | def getDailyResult(date,section_name=''):
 10 |     client = MongoClient()
 11 |     # now_time = datetime.datetime.now()
 12 |     # yes_time = now_time + datetime.timedelta(days=-1)
 13 |     # grab_time = yes_time.strftime('%m-%d')
 14 |     db = client[date+section_name]
 15 | 
 16 | 
 17 |     # 找到前20名
 18 |     threshold = 20
 19 |     wb = xlwt.Workbook()
 20 |     ws = wb.add_sheet('InputSheet')
 21 |     ws.write(0, 0, 'positive')
 22 |     ws.write(0, 1, 'value')
 23 |     ws.write(0, 2, 'negative')
 24 |     ws.write(0, 3, 'value')
 25 |     ws.write(0, 4, 'hottest')
 26 |     ws.write(0, 5, 'numPosts')
 27 |     ws.write(0, 6, 'positive_hottest')
 28 |     ws.write(0, 7, 'value')
 29 | 
 30 |     documents = db.DailyResult.find().sort([
 31 |         ("sentiment_factor", pymongo.DESCENDING)
 32 |     ]).limit(threshold)
 33 | 
 34 |     # 写入前２列
 35 |     i = 0
 36 |     for document in documents:
 37 |         # print document
 38 |         ws.write(i + 1, 0, str(document['stock_code']))
 39 |         ws.write(i + 1, 1, document['sentiment_factor'])
 40 |         i += 1
 41 | 
 42 |     # 写入后２列
 43 |     documents = db.DailyResult.find().sort([
 44 |         ("sentiment_factor", pymongo.ASCENDING)
 45 |     ]).limit(threshold)
 46 | 
 47 |     i = 0
 48 |     for document in documents:
 49 |         ws.write(i + 1, 2, str(document['stock_code']))
 50 |         ws.write(i + 1, 3, document['sentiment_factor'])
 51 |         i += 1
 52 | 
 53 |     # 写入最热门的股票
 54 |     # 写入最热门股票每天的帖子数
 55 |     documents = db.DailyResult.find().sort([
 56 |         ("daily_counts", pymongo.DESCENDING)
 57 |     ]).limit(threshold)
 58 | 
 59 |     i = 0
 60 |     for document in documents:
 61 |         ws.write(i + 1, 4, str(document['stock_code']))
 62 |         ws.write(i + 1, 5, str(document['daily_counts']))
 63 |         i += 1
 64 | 
 65 | 
 66 |     # 对最热门的股票按照情感值递减排序
 67 |     documents = db.DailyResult.aggregate([
 68 |         {"$sort" : {"daily_counts":pymongo.DESCENDING}},
 69 |         {"$limit" : threshold},
 70 |         {"$sort" : {"sentiment_factor":pymongo.DESCENDING}}
 71 |     ])
 72 | 
 73 |     i = 0
 74 |     for document in documents['result']:
 75 |         # print document
 76 |         ws.write(i + 1, 6, str(document['stock_code']))
 77 |         ws.write(i + 1, 7, str(document['sentiment_factor']))
 78 |         i += 1
 79 | 
 80 |     # 保存
 81 |     wb.save('data/'+date+section_name + 'result' + '.xls')
 82 | 
 83 |     print date+section_name+'result.xlsx'+'has been produced!'
 84 | 
 85 | def getDailyStockInfo(stockcode, date,type='sentiment'):
 86 |     client = MongoClient()
 87 |     db = client[date]
 88 |     cursor = db.DailyResult.find({"stock_code":stockcode})
 89 |     sentiment = None
 90 |     posts = None
 91 |     for document in cursor:
 92 |         if type == 'sentiment':
 93 |             sentiment = document['sentiment_factor']
 94 |         else:
 95 |             posts = document['daily_counts']
 96 |     if type == 'sentiment':
 97 |         return sentiment
 98 |     else:
 99 |         return posts
100 | 
101 | 
102 | def getDailyAttachment(date):
103 |     df = pd.read_excel("data/"+date+"result.xls",
104 |        converters={'positive':str,'negative':str,'hottest':str})
105 |     positive_stockcodes = df['positive'].tolist()
106 |     negative_stockcodes = df['negative'].tolist()
107 |     hottest_stockcodes = df['hottest'].tolist()
108 | 
109 |     dict_pos = getResultDict(positive_stockcodes,date)
110 |     dict_neg = getResultDict(negative_stockcodes,date)
111 |     dict_hot = getResultDict(hottest_stockcodes,date,type='hottest')
112 | 
113 |     writer = pd.ExcelWriter(date + 'attachment.xls')
114 | 
115 |     table_pos = pd.DataFrame(dict_pos)
116 |     table_neg = pd.DataFrame(dict_neg)
117 |     table_hot = pd.DataFrame(dict_hot)
118 | 
119 |     table_pos.to_excel(writer, 'positive')
120 |     table_neg.to_excel(writer, 'negative')
121 |     table_hot.to_excel(writer, 'hottest')
122 | 
123 |     writer.save()
124 | 
125 | def getResultDict(codes,date,type='sentiment'):
126 |     result_dict = {'stockcodes':codes}
127 |     now_time = datetime.datetime.strptime(date, '%m-%d')
128 |     for i in range(20):
129 |         yes_time = now_time + datetime.timedelta(days=-i)
130 |         yes_time = yes_time.strftime('%m-%d')
131 |         col_sentiments = []
132 |         for stockcode in codes:
133 |             dailysentiment = getDailyStockInfo(stockcode, yes_time,type=type)
134 |             col_sentiments.append(dailysentiment)
135 |         result_dict[yes_time] = col_sentiments
136 |     return result_dict


--------------------------------------------------------------------------------
/east_sentiment/produceFactor.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | from pymongo import  *
 3 | from snownlp import  SnowNLP
 4 | import pymongo
 5 | 
 6 | # connecting database 601001eastmoney
 7 | def getSentimentFactor(stockcode,date):
 8 |     client = MongoClient()
 9 | 
10 |     # now_time = datetime.datetime.now()
11 |     # yes_time = now_time + datetime.timedelta(days=-1)
12 |     # grab_time = yes_time.strftime('%m-%d')
13 | 
14 |     db = client[stockcode+'eastmoney']
15 | 
16 |     # sort the collection GuYouHui and point to it
17 |     coll = db[date+'GuYouHui']
18 | 
19 |     documents = coll.find().sort([
20 |         ("created_at", pymongo.DESCENDING)
21 |     ])
22 | 
23 |     for document in documents:
24 | 
25 |         create_date = document['create'][:10]
26 |         last = document['last'][:5]
27 | 
28 |         if (document['text'] != ''):
29 |             s = SnowNLP(document['text'])
30 |             sentiment = s.sentiments
31 |             sentiment_factor = (sentiment) * int(document['read'])
32 |         else:
33 |             sentiment,sentiment_factor = 0,0
34 |         coll.update({"_id":document['_id']},{"$set":{"sentiment":sentiment,"sentiment_factor":sentiment_factor}})
35 | 
36 |         print document['_id']
37 | 


--------------------------------------------------------------------------------
/east_sentiment/sendMail.py:
--------------------------------------------------------------------------------
 1 | # coding:utf8
 2 | import smtplib
 3 | from email import encoders, MIMEMultipart
 4 | from email.mime.base import MIMEBase
 5 | from email.mime.text import MIMEText
 6 | 
 7 | import datetime
 8 | 
 9 | import pandas as pd
10 | 
11 | 
12 | def send(date,section_list):
13 |     _user = "1971990184@qq.com"
14 |     _pwd = "rxdaltsmieszgfje"
15 |     # _to = ['henry.duye@gmail.com','qliu.net@gmail.com','380312089@qq.com']
16 | 
17 |     # test
18 |     _to = ['ryuanhang@gmail.com']
19 | 
20 |     # now_time = datetime.datetime.now()
21 |     # yes_time = now_time + datetime.timedelta(days=-1)
22 |     # grab_time = yes_time.strftime('%m-%d')
23 | 
24 |     msg = MIMEMultipart.MIMEMultipart()
25 |     # mail's title
26 |     msg["Subject"] = date+"Result"
27 |     msg["From"] = _user
28 |     msg["To"] = ",".join(_to)
29 | 
30 |     # add content for the mail
31 |     msg_content = ''
32 |     for section_name in section_list:
33 |         msg_content += excel2str(date, section_name=section_name)
34 |         msg_content += '\n'
35 | 
36 |     msg.attach(MIMEText(msg_content, 'plain', 'utf-8'))
37 | 
38 |     #with open(date+'attachment.xls', 'rb') as f:
39 |        # # 设置附件的MIME和文件名，这里是png类型:
40 |        # mime = MIMEBase('file', 'xls', filename=date+'attachment.xls')
41 |        # # 加上必要的头信息:
42 |        # mime.add_header('Content-Disposition', 'attachment', filename=date+'attachment.xls')
43 |        # mime.add_header('Content-ID', '<0>')
44 |        # mime.add_header('X-Attachment-Id', '0')
45 |        # # 把附件的内容读进来:
46 |        # mime.set_payload(f.read())
47 |        # # 用Base64编码:
48 |        # encoders.encode_base64(mime)
49 |        # # 添加到MIMEMultipart:
50 |        # msg.attach(mime)
51 | 
52 |     try:
53 |         s = smtplib.SMTP_SSL("smtp.qq.com", 465)
54 |         s.login(_user, _pwd)
55 |         s.sendmail(_user, _to, msg.as_string())
56 |         s.quit()
57 |         print "Succeed in sending mail!"
58 |     except smtplib.SMTPException, e:
59 |         print "Falied,%s" % e
60 | 
61 | def excel2str(date,section_name=''):
62 |     df = pd.read_excel("data/" + date+section_name + "result.xls",
63 |         converters={'positive': str, 'negative': str, 'hottest': str},header=None)
64 |     string = ''
65 |     for list in df.values:
66 |         for item in list:
67 |             string += str(item)+'    '
68 |         string += "\n"
69 |     return string
70 | 


--------------------------------------------------------------------------------
/example/ApSchedulerExample/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/example/ApSchedulerExample/__init__.py


--------------------------------------------------------------------------------
/example/ApSchedulerExample/processpoll.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Demonstrates how to schedule a job to be run in a process pool on 3 second intervals.
 3 | """
 4 | 
 5 | from datetime import datetime
 6 | import os
 7 | 
 8 | from apscheduler.schedulers.blocking import BlockingScheduler
 9 | 
10 | 
11 | def tick():
12 |     print('Tick! The time is: %s' % datetime.now())
13 | 
14 | 
15 | if __name__ == '__main__':
16 |     scheduler = BlockingScheduler()
17 |     scheduler.add_executor('processpool')
18 |     scheduler.add_job(tick, 'interval', seconds=3)
19 |     print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))
20 | 
21 |     try:
22 |         scheduler.start()
23 |     except (KeyboardInterrupt, SystemExit):
24 |         pass


--------------------------------------------------------------------------------
/example/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/example/__init__.py


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | # coding:utf8
 2 | import datetime
 3 | import os
 4 | import time
 5 | from apscheduler.schedulers.blocking import BlockingScheduler
 6 | from pymongo import MongoClient
 7 | 
 8 | from east_sentiment import dailyResult, sendMail
 9 | from east_sentiment import produceFactor,aggregateFactor, outputResult
10 | from tools import mongotool
11 | 
12 | stockCodes = []
13 | # stockCodes = ['000001','000002','000009','000046','000063','000069','000333','000402','000559','000568','000623','000630','000651','000686','000712','000725','000738','000776','000783','000792','000793',
14 |               # '000800','000858','000895','000898','000937','000999','002007','002065','002142','002236','002292','002294','002385','002465','002736','300015','300017','300024','300058','300070']
15 | 
16 | # stockCodes = ['000001','000002','000009','000027','000039','000046','000060','000061','000063','000069']
17 | client = MongoClient()
18 | db = client['stockcodes']
19 | documents = db.HS300.find()
20 | 
21 | for document in documents:
22 |     stockCodes.append(document['stockcode'])
23 | 
24 | # append IT stock codes
25 | IT_stockCodes = []
26 | documents_IT = db.IT.find()
27 | for document in documents_IT:
28 |     IT_stockCodes.append(document['stockcode'])
29 | 
30 | if __name__ == '__main__':
31 | 
32 |     while True:
33 |         t0 = time.clock()
34 | 
35 |         now_time = datetime.datetime.now()
36 |         yes_time = now_time + datetime.timedelta(days=-1)
37 |         grab_time = yes_time.strftime('%m-%d')
38 |         for stockCode in stockCodes:
39 |             # 一个帖子的
40 |             produceFactor.getSentimentFactor(stockCode, grab_time)
41 |             # 一个股票的
42 |             aggregateFactor.aggregate(stockCode, grab_time)
43 |             # 一天的
44 |             dailyResult.setDailyResult(stockCode, grab_time)
45 | 
46 |         outputResult.getDailyResult(grab_time)
47 | 
48 |         # For IT stocks
49 | 
50 |         for stockCode in IT_stockCodes:
51 |             # 一个帖子的
52 |             produceFactor.getSentimentFactor(stockCode, grab_time)
53 |             # 一个股票的
54 |             aggregateFactor.aggregate(stockCode, grab_time)
55 |             # 一天的
56 |             dailyResult.setDailyResult(stockCode, grab_time, section_name='IT')
57 | 
58 |         outputResult.getDailyResult(grab_time, section_name='IT')
59 | 
60 |         sendMail.send(grab_time, section_list=['', 'IT'])
61 | 
62 |         # dump and drop part
63 |         client = MongoClient()
64 |         db = client.taskdb
65 |         db.east.drop()
66 |         print 'east collection has been dropped!'
67 | 
68 |         mongotool.dump()
69 |         mongotool.drop()
70 |         os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html')
71 | 
72 |         t1 = time.clock()
73 | 
74 |         time.sleep(86400-(t1-t0))


--------------------------------------------------------------------------------
/script/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/script/__init__.py


--------------------------------------------------------------------------------
/script/gubaEast.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | # Created on 2016-02-02 11:38:51
  4 | # Project: HelloWorld
  5 | import re
  6 | 
  7 | import datetime
  8 | from pyspider.libs.base_handler import *
  9 | from lxml import etree
 10 | from pymongo import *
 11 | 
 12 | #num = 0
 13 | class Handler(BaseHandler):
 14 |     crawl_config = {
 15 |     }
 16 |     
 17 |     def __init__(self):
 18 |         client = MongoClient()
 19 |         db = client['stockcodes']
 20 |         self.StockCodes = []
 21 |         documents = db.HS300.find()
 22 |         for document in documents:
 23 |             self.StockCodes.append(document['stockcode'])
 24 | 
 25 |         # Add IT stock codes
 26 |         documents_it = db.IT.find()
 27 |         for document in documents_it:
 28 |             self.StockCodes.append(document['stockcode'])
 29 | 
 30 |     @every(minutes = 24*60)
 31 |     def on_start(self):
 32 |         
 33 |         for stockcode in self.StockCodes:
 34 |             self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',5_1.html', callback=self.index_page)
 35 |             # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',1,'+'f_1.html', callback=self.index_page)
 36 |             # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',2,'+'f_1.html', callback=self.index_page)
 37 |             # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',3,'+'f_1.html', callback=self.index_page)
 38 | 
 39 |     @config(age = 60*60*24)
 40 |     def index_page(self, response):
 41 |         selector = etree.HTML(response.text)
 42 |         content_field =  selector.xpath('//*[@id="articlelistnew"]/div[starts-with(@class,"articleh")]')
 43 |         # 获取昨天时间，用于抓取
 44 |         now_time = datetime.datetime.now()
 45 |         yes_time = now_time + datetime.timedelta(days=-1)
 46 |         grab_time = yes_time.strftime('%m-%d')
 47 |         first = True
 48 |         flag = False
 49 | 
 50 |         # 提取每一页的所有帖子
 51 |         for each in content_field:
 52 |             last = each.xpath('span[6]/text()')[0]
 53 |             last_time = last[:5]
 54 |             # 根据时间来判断
 55 |             if grab_time != last_time:
 56 |                 if first != True:
 57 |                     flag = True
 58 |                     break
 59 |                 continue
 60 |             first = False
 61 |             item = {}
 62 |             read = each.xpath('span[1]/text()')[0]
 63 |             comment = each.xpath('span[2]/text()')[0]
 64 |             title = each.xpath('span[3]/a/text()')[0]
 65 |             # author容易因为结构出现异常
 66 |             if each.xpath('span[4]/a/text()'):
 67 |                 author = each.xpath('span[4]/a/text()')[0]
 68 |             else:
 69 |                 author = each.xpath('span[4]/span/text()')[0]
 70 | 
 71 |             # date = each.xpath('span[5]/text()')[0]
 72 | 
 73 |             address = each.xpath('span[3]/a/@href')[0]
 74 |             baseUrl = 'http://guba.eastmoney.com'
 75 |             Url = baseUrl + address
 76 |             # 将数据放入item
 77 |             item['read'] = read
 78 |             item['comment'] = comment
 79 |             item['title'] = title
 80 |             item['author'] = author
 81 |             # item['date'] = date
 82 |             item['last'] = last
 83 |             item['url'] = response.url
 84 | 
 85 |             # 提取内容
 86 |             self.crawl(Url, callback=self.detail_page, save={'item': item})
 87 | 
 88 | 
 89 | 
 90 |         # 生成下一页链接
 91 |         if not flag:
 92 |             info = selector.xpath('//*[@id="articlelistnew"]/div[@class="pager"]/span/@data-pager')[0]
 93 |             List = info.split('|')
 94 |             if int(List[2])*int(List[3])<int(List[1]):
 95 |                 nextLink = response.url.split('_')[0] +  '_'  + str(int(List[3])+1) + '.html'
 96 |                 self.crawl(nextLink,callback = self.index_page)
 97 | 
 98 |    
 99 |     def detail_page(self, response):
100 |         selector =  etree.HTML(response.text)
101 |         info = selector.xpath('//div[@class="stockcodec"]')[0]
102 |         data = info.xpath('string(.)').replace('\n','').replace('\r','').replace('\t','')
103 |         time_text = selector.xpath('//*[@id="zwconttb"]/div[@class="zwfbtime"]/text()')[0]
104 |         time = re.findall('\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}',time_text)[0]
105 |         item = response.save['item']
106 |         item['text'] = data
107 |         item['create'] = time
108 |         try:
109 |             item['created_at'] = int(re.findall('\d{9}',response.url)[0])
110 |         except IndexError,e:
111 |             item['created_at'] = int(re.findall('\d{8}',response.url)[0])
112 |         return item
113 |         
114 | 


--------------------------------------------------------------------------------
/script/sina.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | # Created on 2016-02-02 11:38:51
 4 | # Project: HelloWorld
 5 | import re
 6 | from pyspider.libs.base_handler import *
 7 | from lxml import etree
 8 | from pymongo import *
 9 | 
10 | #num = 0
11 | class Handler(BaseHandler):
12 |     crawl_config = {
13 |     }
14 | 
15 | 
16 | 
17 | 
18 | 
19 | 
20 |     @every(minutes = 20)
21 |     def on_start(self):
22 | 
23 |         self.crawl('http://guba.sina.com.cn/?s=bar&name=sz000001&type=0&page=1',callback = self.index_page)
24 | 
25 |     @config(age = 3*60)
26 |     def index_page(self, response):
27 |         selector = etree.HTML(response.text)
28 |         content_field =  selector.xpath('//*[@id="blk_list_02"]/table/tbody/tr')
29 | 
30 | 
31 |         # 提取每一页的所有帖子
32 |         for each in content_field[1:]:
33 |             item = {}
34 |             read = each.xpath('td[1]/span/text()')[0]
35 |             comment = each.xpath('td[2]/span/text()')[0]
36 |             title = each.xpath('td[3]/a/text()')[0]
37 |             # author容易因为结构出现异常
38 |             # if each.xpath('td[4]/div/a[1]/text()'):
39 |             #     author = each.xpath('td[4]/div/a[1]/text()')[0]
40 |             # elif each.xpath('td[4]/div/text()'):
41 |             #     author = each.xpath('td[4]/div/text()')[0]
42 |             # else:
43 |             #     author = ''
44 |             if each.xpath('td[4]/div/text()'):
45 |                 author = each.xpath('td[4]/div/text()')[0]
46 |             elif each.xpath('td[4]/div/a[1]/text()'):
47 |                 author = each.xpath('td[4]/div/a[1]/text()')[0]
48 |             else:
49 |                 author = ''
50 | 
51 | 
52 |             # 将数据放入item
53 |             item['read'] = read
54 |             item['comment'] = comment
55 |             item['title'] = title
56 |             item['author'] = author
57 |             item['url'] = response.url
58 | 
59 |             Url = 'http://guba.sina.com.cn'+each.xpath('td[3]/a/@href')[0]
60 | 
61 |             # 提取内容
62 |             self.crawl(Url,callback=self.detail_page,save={'item':item})
63 | 
64 |         page = int(response.url.split('&')[3].split('=')[1])
65 |         next_page = int(selector.xpath('//*[@id="blk_list_02"]/div[@class="blk_01_b"]/p/a[last()]/@href')[0].split('&')[3].split('=')[1])
66 |         if(page<next_page):
67 |             page+=1
68 |             url = 'http://guba.sina.com.cn/?s=bar&name=sz000001&type=0&page=' + str(page)
69 |             self.crawl(url,callback = self.index_page)
70 | 
71 | 
72 |     def detail_page(self, response):
73 |         selector =  etree.HTML(response.text)
74 |         data = selector.xpath('//*[@id="thread_content"]')[0]
75 |         text =  data.xpath('string(.)').replace('\n','').replace('\r','').replace('t','')
76 |         item = response.save['item']
77 |         item['text'] = text
78 |         #time = response.doc('.iltp_time > span').text()
79 |         time = re.findall('<div class=\'fl_left iltp_time\'><span>(.*?)</span></div>',response.text)[0]
80 |         #time = selector.xpath('//*[@id="thread"]/div[@class="il_txt"]/div[@class="ilt_panel clearfix"]/div[@class="fl_left iltp_time"]/span/text()')
81 |         item['time'] = time
82 |         item['tid'] = int(re.findall('tid=(.*?)&bid',response.url)[0])
83 |         return item
84 | 
85 | 


--------------------------------------------------------------------------------
/script/snowball.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | # Created on 2016-02-24 14:54:21
  4 | # Project: Snowball
  5 | import  datetime
  6 | import time
  7 | import re
  8 | from pyspider.libs.base_handler import *
  9 | from lxml import etree
 10 | 
 11 | 
 12 | class Handler(BaseHandler):
 13 |     crawl_config = {
 14 |         'headers' : {'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36','X-Requested-With' : 'XMLHttpRequest','Referer' : 'http://xueqiu.com/S/SH601001','Host' : 'xueqiu.com','cache-control' : 'no-cache'}
 15 |     }
 16 |     
 17 |     def __init__(self):
 18 |         self.stockcode=['000003']
 19 | 
 20 | 
 21 | #这个函数用于产生cookie，为后面做准备
 22 |     @every(minutes=5)
 23 |     def on_start(self):
 24 |         for stockcode in self.stockcode:
 25 |             self.crawl('http://xueqiu.com/S/'+'SH'+stockcode,callback=self.first_scrape,save = {'stockcode':stockcode})
 26 | 
 27 | #这个函数用来将on_start函数中产生的cookie传入，用于得到最大的页面数
 28 |     @config(age=3*60)
 29 |     def first_scrape(self, response):
 30 |         List = ['%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB','%E5%85%AC%E5%91%8A','%E7%A0%94%E6%8A%A5']
 31 | 
 32 |         self.crawl('http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' +  'user' + '&sort=time&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']})
 33 |         self.crawl('http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' +  'trans' + '&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']})
 34 |         for module in List:
 35 |             self.crawl('http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' +  module + '&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']})
 36 | 
 37 | #这个函数抓取所有板块的有用信息
 38 |     @config(priority=2)
 39 |     def produce_page(self, response):
 40 | 
 41 |         flag = re.findall('source=(.*?)&',response.url)[0]
 42 | 
 43 |         if flag == 'user':
 44 |             for i in range(1,int(response.json['maxPage']+1)):
 45 | 
 46 |                 url = 'http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' +  flag + '&sort=time&page=' + str(i)+'&_=' + str(int(time.time()*1000))
 47 | 
 48 |                 self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page)
 49 | 
 50 |         elif flag == 'trans':
 51 | 
 52 |             for i in range(1,int(response.json['maxPage']+1)):
 53 | 
 54 |                 url = 'http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' +  flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000))
 55 | 
 56 |                 self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page)
 57 | 
 58 |         elif flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB':
 59 | 
 60 |             for i in range(1,int(response.json['maxPage']+1)):
 61 | 
 62 |                 url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' +  flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000))
 63 | 
 64 |                 self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page)
 65 | 
 66 |         elif flag ==  '%E5%85%AC%E5%91%8A':
 67 | 
 68 |              for i in range(1,int(response.json['maxPage']+1)):
 69 | 
 70 |                 url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' +  flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000))
 71 | 
 72 |                 self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page)
 73 | 
 74 |         elif flag == '%E7%A0%94%E6%8A%A5':
 75 | 
 76 |              for i in range(1,int(response.json['maxPage']+1)):
 77 | 
 78 |                 url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' +  flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000))
 79 | 
 80 |                 self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page)
 81 | 
 82 | #这个函数用于格式化数据
 83 |     def deal_page(self,response):
 84 | 
 85 |          url = ''
 86 |          for item in  response.url.split('&')[:-2]:
 87 |                 temp = item+'&'
 88 |                 url+=temp
 89 | 
 90 |          if re.findall('source=(.*?)&',response.url)[0]=='%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB':
 91 |              for each in response.json['list']:
 92 |                 self.send_message(self.project_name,{
 93 |                     'name' : each['user']['screen_name'],
 94 |                     'text' : etree.HTML(each['text']).xpath('string(.)'),
 95 |                     'time' : each['created_at'],
 96 |                     'title' : each['title']
 97 |                     }, url = "%s" % (url+str(each['created_at'])))
 98 |          else:
 99 |               for each in response.json['list']:
100 |                 self.send_message(self.project_name,{
101 |                     'name' : each['user']['screen_name'],
102 |                     'text' : etree.HTML(each['text']).xpath('string(.)'),
103 |                     'time' : each['created_at'],
104 |                     #'title' : each['title']
105 |                     }, url = "%s" % (url+str(each['created_at'])))
106 | #返回数据到数据库
107 |     def  on_message(self,project,msg):
108 |         return msg


--------------------------------------------------------------------------------
/set_codes/HS300.txt:
--------------------------------------------------------------------------------
  1 | 证券代码	证券简称		
  2 | 000001.SZ	平安银行		
  3 | 000002.SZ	万科A		
  4 | 000009.SZ	中国宝安		
  5 | 000027.SZ	深圳能源		
  6 | 000039.SZ	中集集团		
  7 | 000046.SZ	泛海控股		
  8 | 000060.SZ	中金岭南		
  9 | 000061.SZ	农产品		
 10 | 000063.SZ	中兴通讯		
 11 | 000069.SZ	华侨城A		
 12 | 000100.SZ	TCL集团		
 13 | 000156.SZ	华数传媒		
 14 | 000157.SZ	中联重科		
 15 | 000166.SZ	申万宏源		
 16 | 000333.SZ	美的集团		
 17 | 000338.SZ	潍柴动力		
 18 | 000400.SZ	许继电气		
 19 | 000402.SZ	金融街		
 20 | 000413.SZ	东旭光电		
 21 | 000415.SZ	渤海金控		
 22 | 000423.SZ	东阿阿胶		
 23 | 000425.SZ	徐工机械		
 24 | 000503.SZ	海虹控股		
 25 | 000538.SZ	云南白药		
 26 | 000539.SZ	粤电力A		
 27 | 000540.SZ	中天城投		
 28 | 000559.SZ	万向钱潮		
 29 | 000568.SZ	泸州老窖		
 30 | 000581.SZ	威孚高科		
 31 | 000598.SZ	兴蓉环境		
 32 | 000623.SZ	吉林敖东		
 33 | 000625.SZ	长安汽车		
 34 | 000629.SZ	攀钢钒钛		
 35 | 000630.SZ	铜陵有色		
 36 | 000651.SZ	格力电器		
 37 | 000686.SZ	东北证券		
 38 | 000709.SZ	河钢股份		
 39 | 000712.SZ	锦龙股份		
 40 | 000725.SZ	京东方A		
 41 | 000728.SZ	国元证券		
 42 | 000729.SZ	燕京啤酒		
 43 | 000738.SZ	中航动控		
 44 | 000750.SZ	国海证券		
 45 | 000768.SZ	中航飞机		
 46 | 000776.SZ	广发证券		
 47 | 000778.SZ	新兴铸管		
 48 | 000783.SZ	长江证券		
 49 | 000792.SZ	盐湖股份		
 50 | 000793.SZ	华闻传媒		
 51 | 000800.SZ	一汽轿车		
 52 | 000825.SZ	太钢不锈		
 53 | 000826.SZ	启迪桑德		
 54 | 000831.SZ	五矿稀土		
 55 | 000858.SZ	五粮液		
 56 | 000876.SZ	新希望		
 57 | 000883.SZ	湖北能源		
 58 | 000895.SZ	双汇发展		
 59 | 000898.SZ	鞍钢股份		
 60 | 000917.SZ	电广传媒		
 61 | 000937.SZ	冀中能源		
 62 | 000963.SZ	华东医药		
 63 | 000983.SZ	西山煤电		
 64 | 000999.SZ	华润三九		
 65 | 001979.SZ	招商蛇口		
 66 | 002007.SZ	华兰生物		
 67 | 002008.SZ	大族激光		
 68 | 002024.SZ	苏宁云商		
 69 | 002038.SZ	双鹭药业		
 70 | 002065.SZ	东华软件		
 71 | 002081.SZ	金螳螂		
 72 | 002129.SZ	中环股份		
 73 | 002142.SZ	宁波银行		
 74 | 002146.SZ	荣盛发展		
 75 | 002153.SZ	石基信息		
 76 | 002195.SZ	二三四五		
 77 | 002202.SZ	金风科技		
 78 | 002230.SZ	科大讯飞		
 79 | 002236.SZ	大华股份		
 80 | 002241.SZ	歌尔声学		
 81 | 002252.SZ	上海莱士		
 82 | 002292.SZ	奥飞动漫		
 83 | 002294.SZ	信立泰		
 84 | 002304.SZ	洋河股份		
 85 | 002353.SZ	杰瑞股份		
 86 | 002375.SZ	亚厦股份		
 87 | 002385.SZ	大北农		
 88 | 002399.SZ	海普瑞		
 89 | 002410.SZ	广联达		
 90 | 002415.SZ	海康威视		
 91 | 002422.SZ	科伦药业		
 92 | 002450.SZ	康得新		
 93 | 002456.SZ	欧菲光		
 94 | 002465.SZ	海格通信		
 95 | 002470.SZ	金正大		
 96 | 002475.SZ	立讯精密		
 97 | 002500.SZ	山西证券		
 98 | 002594.SZ	比亚迪		
 99 | 002673.SZ	西部证券		
100 | 002736.SZ	国信证券		
101 | 002739.SZ	万达院线		
102 | 300002.SZ	神州泰岳		
103 | 300003.SZ	乐普医疗		
104 | 300015.SZ	爱尔眼科		
105 | 300017.SZ	网宿科技		
106 | 300024.SZ	机器人		
107 | 300027.SZ	华谊兄弟		
108 | 300058.SZ	蓝色光标		
109 | 300059.SZ	东方财富		
110 | 300070.SZ	碧水源		
111 | 300104.SZ	乐视网		
112 | 300124.SZ	汇川技术		
113 | 300133.SZ	华策影视		
114 | 300144.SZ	宋城演艺		
115 | 300146.SZ	汤臣倍健		
116 | 300251.SZ	光线传媒		
117 | 300315.SZ	掌趣科技		
118 | 600000.SH	浦发银行		
119 | 600005.SH	武钢股份		
120 | 600008.SH	首创股份		
121 | 600009.SH	上海机场		
122 | 600010.SH	包钢股份		
123 | 600011.SH	华能国际		
124 | 600015.SH	华夏银行		
125 | 600016.SH	民生银行		
126 | 600018.SH	上港集团		
127 | 600019.SH	宝钢股份		
128 | 600021.SH	上海电力		
129 | 600023.SH	浙能电力		
130 | 600027.SH	华电国际		
131 | 600028.SH	中国石化		
132 | 600029.SH	南方航空		
133 | 600030.SH	中信证券		
134 | 600031.SH	三一重工		
135 | 600036.SH	招商银行		
136 | 600038.SH	中直股份		
137 | 600048.SH	保利地产		
138 | 600050.SH	中国联通		
139 | 600060.SH	海信电器		
140 | 600066.SH	宇通客车		
141 | 600068.SH	葛洲坝		
142 | 600085.SH	同仁堂		
143 | 600089.SH	特变电工		
144 | 600100.SH	同方股份		
145 | 600104.SH	上汽集团		
146 | 600109.SH	国金证券		
147 | 600111.SH	北方稀土		
148 | 600115.SH	东方航空		
149 | 600118.SH	中国卫星		
150 | 600150.SH	中国船舶		
151 | 600153.SH	建发股份		
152 | 600157.SH	永泰能源		
153 | 600166.SH	福田汽车		
154 | 600170.SH	上海建工		
155 | 600177.SH	雅戈尔		
156 | 600188.SH	兖州煤业		
157 | 600196.SH	复星医药		
158 | 600208.SH	新湖中宝		
159 | 600221.SH	海南航空		
160 | 600252.SH	中恒集团		
161 | 600256.SH	广汇能源		
162 | 600271.SH	航天信息		
163 | 600276.SH	恒瑞医药		
164 | 600309.SH	万华化学		
165 | 600315.SH	上海家化		
166 | 600317.SH	营口港		
167 | 600332.SH	白云山		
168 | 600340.SH	华夏幸福		
169 | 600350.SH	山东高速		
170 | 600352.SH	浙江龙盛		
171 | 600362.SH	江西铜业		
172 | 600369.SH	西南证券		
173 | 600372.SH	中航电子		
174 | 600373.SH	中文传媒		
175 | 600383.SH	金地集团		
176 | 600398.SH	海澜之家		
177 | 600406.SH	国电南瑞		
178 | 600415.SH	小商品城		
179 | 600485.SH	信威集团		
180 | 600489.SH	中金黄金		
181 | 600518.SH	康美药业		
182 | 600519.SH	贵州茅台		
183 | 600535.SH	天士力		
184 | 600547.SH	山东黄金		
185 | 600549.SH	厦门钨业		
186 | 600570.SH	恒生电子		
187 | 600578.SH	京能电力		
188 | 600583.SH	海油工程		
189 | 600585.SH	海螺水泥		
190 | 600588.SH	用友网络		
191 | 600600.SH	青岛啤酒		
192 | 600633.SH	浙报传媒		
193 | 600637.SH	东方明珠		
194 | 600642.SH	申能股份		
195 | 600648.SH	外高桥		
196 | 600649.SH	城投控股		
197 | 600660.SH	福耀玻璃		
198 | 600663.SH	陆家嘴		
199 | 600674.SH	川投能源		
200 | 600688.SH	上海石化		
201 | 600690.SH	青岛海尔		
202 | 600703.SH	三安光电		
203 | 600705.SH	中航资本		
204 | 600717.SH	天津港		
205 | 600718.SH	东软集团		
206 | 600739.SH	辽宁成大		
207 | 600741.SH	华域汽车		
208 | 600783.SH	鲁信创投		
209 | 600795.SH	国电电力		
210 | 600804.SH	鹏博士		
211 | 600820.SH	隧道股份		
212 | 600827.SH	百联股份		
213 | 600837.SH	海通证券		
214 | 600839.SH	四川长虹		
215 | 600863.SH	内蒙华电		
216 | 600867.SH	通化东宝		
217 | 600873.SH	梅花生物		
218 | 600875.SH	东方电气		
219 | 600886.SH	国投电力		
220 | 600887.SH	伊利股份		
221 | 600893.SH	中航动力		
222 | 600895.SH	张江高科		
223 | 600900.SH	长江电力		
224 | 600958.SH	东方证券		
225 | 600959.SH	江苏有线		
226 | 600998.SH	九州通		
227 | 600999.SH	招商证券		
228 | 601006.SH	大秦铁路		
229 | 601009.SH	南京银行		
230 | 601016.SH	节能风电		
231 | 601018.SH	宁波港		
232 | 601021.SH	春秋航空		
233 | 601088.SH	中国神华		
234 | 601098.SH	中南传媒		
235 | 601099.SH	太平洋		
236 | 601106.SH	中国一重		
237 | 601111.SH	中国国航		
238 | 601117.SH	中国化学		
239 | 601118.SH	海南橡胶		
240 | 601158.SH	重庆水务		
241 | 601166.SH	兴业银行		
242 | 601169.SH	北京银行		
243 | 601179.SH	中国西电		
244 | 601186.SH	中国铁建		
245 | 601198.SH	东兴证券		
246 | 601211.SH	国泰君安		
247 | 601216.SH	君正集团		
248 | 601225.SH	陕西煤业		
249 | 601231.SH	环旭电子		
250 | 601238.SH	广汽集团		
251 | 601258.SH	庞大集团		
252 | 601288.SH	农业银行		
253 | 601318.SH	中国平安		
254 | 601328.SH	交通银行		
255 | 601333.SH	广深铁路		
256 | 601336.SH	新华保险		
257 | 601377.SH	兴业证券		
258 | 601390.SH	中国中铁		
259 | 601398.SH	工商银行		
260 | 601555.SH	东吴证券		
261 | 601600.SH	中国铝业		
262 | 601601.SH	中国太保		
263 | 601607.SH	上海医药		
264 | 601608.SH	中信重工		
265 | 601618.SH	中国中冶		
266 | 601628.SH	中国人寿		
267 | 601633.SH	长城汽车		
268 | 601668.SH	中国建筑		
269 | 601669.SH	中国电建		
270 | 601688.SH	华泰证券		
271 | 601699.SH	潞安环能		
272 | 601718.SH	际华集团		
273 | 601727.SH	上海电气		
274 | 601766.SH	中国中车		
275 | 601788.SH	光大证券		
276 | 601800.SH	中国交建		
277 | 601808.SH	中海油服		
278 | 601818.SH	光大银行		
279 | 601857.SH	中国石油		
280 | 601866.SH	中海集运		
281 | 601872.SH	招商轮船		
282 | 601888.SH	中国国旅		
283 | 601898.SH	中煤能源		
284 | 601899.SH	紫金矿业		
285 | 601901.SH	方正证券		
286 | 601919.SH	中国远洋		
287 | 601928.SH	凤凰传媒		
288 | 601933.SH	永辉超市		
289 | 601939.SH	建设银行		
290 | 601958.SH	金钼股份		
291 | 601969.SH	海南矿业		
292 | 601985.SH	中国核电		
293 | 601988.SH	中国银行		
294 | 601989.SH	中国重工		
295 | 601991.SH	大唐发电		
296 | 601992.SH	金隅股份		
297 | 601998.SH	中信银行		
298 | 603000.SH	人民网		
299 | 603288.SH	海天味业		
300 | 603993.SH	洛阳钼业		
301 | 603885.SH	吉祥航空		
302 | 


--------------------------------------------------------------------------------
/set_codes/IT_code.txt:
--------------------------------------------------------------------------------
  1 | 300597
  2 | 603039
  3 | 300209
  4 | 300598
  5 | 300523
  6 | 300451
  7 | 300561
  8 | 600728
  9 | 600634
 10 | 300277
 11 | 300448
 12 | 600455
 13 | 603636
 14 | 002174
 15 | 603990
 16 | 300579
 17 | 300248
 18 | 002339
 19 | 300044
 20 | 300300
 21 | 603444
 22 | 002065
 23 | 600271
 24 | 002474
 25 | 300096
 26 | 300508
 27 | 300311
 28 | 002230
 29 | 300031
 30 | 300074
 31 | 002439
 32 | 002279
 33 | 300302
 34 | 002362
 35 | 600476
 36 | 002657
 37 | 300168
 38 | 300075
 39 | 002649
 40 | 002063
 41 | 600718
 42 | 600446
 43 | 600570
 44 | 300047
 45 | 002421
 46 | 300002
 47 | 600756
 48 | 002373
 49 | 300045
 50 | 300287
 51 | 600536
 52 | 002771
 53 | 601519
 54 | 000948
 55 | 002268
 56 | 300290
 57 | 600289
 58 | 002232
 59 | 002331
 60 | 300264
 61 | 600410
 62 | 300365
 63 | 300339
 64 | 300033
 65 | 300253
 66 | 300036
 67 | 002410
 68 | 300170
 69 | 300085
 70 | 600571
 71 | 300324
 72 | 300297
 73 | 002405
 74 | 300166
 75 | 300017
 76 | 300231
 77 | 600588
 78 | 300352
 79 | 300465
 80 | 300010
 81 | 600845
 82 | 300348
 83 | 300020
 84 | 002153
 85 | 002178
 86 | 603918
 87 | 300188
 88 | 300182
 89 | 300113
 90 | 002642
 91 | 300245
 92 | 002280
 93 | 300533
 94 | 002253
 95 | 300369
 96 | 300559
 97 | 603189
 98 | 300520
 99 | 300229
100 | 300377
101 | 002401
102 | 300542
103 | 603258
104 | 300552
105 | 300541
106 | 002558
107 | 300550
108 | 300379
109 | 300386
110 | 300378
111 | 002368
112 | 300525
113 | 300556
114 | 600767
115 | 300469
116 | 600242
117 | 300467
118 | 300578
119 | 300271
120 | 300518
121 | 300468
122 | 600654
123 | 000711
124 | 603986
125 | 300605
126 | 300603
127 | 603881
128 | 300513
129 | 300212
130 | 


--------------------------------------------------------------------------------
/set_codes/IT_unique.txt:
--------------------------------------------------------------------------------
  1 | 300597
  2 | 603039
  3 | 300209
  4 | 300598
  5 | 300523
  6 | 300451
  7 | 300561
  8 | 600728
  9 | 600634
 10 | 300277
 11 | 300448
 12 | 600455
 13 | 603636
 14 | 002174
 15 | 603990
 16 | 300579
 17 | 300248
 18 | 002339
 19 | 300044
 20 | 300300
 21 | 603444
 22 | 002474
 23 | 300096
 24 | 300508
 25 | 300311
 26 | 300031
 27 | 300074
 28 | 002439
 29 | 002279
 30 | 300302
 31 | 002362
 32 | 600476
 33 | 002657
 34 | 300168
 35 | 300075
 36 | 002649
 37 | 002063
 38 | 600446
 39 | 300047
 40 | 002421
 41 | 600756
 42 | 002373
 43 | 300045
 44 | 300287
 45 | 600536
 46 | 002771
 47 | 601519
 48 | 000948
 49 | 002268
 50 | 300290
 51 | 600289
 52 | 002232
 53 | 002331
 54 | 300264
 55 | 600410
 56 | 300365
 57 | 300339
 58 | 300033
 59 | 300253
 60 | 300036
 61 | 300170
 62 | 300085
 63 | 600571
 64 | 300324
 65 | 300297
 66 | 002405
 67 | 300166
 68 | 300231
 69 | 300352
 70 | 300465
 71 | 300010
 72 | 600845
 73 | 300348
 74 | 300020
 75 | 002178
 76 | 603918
 77 | 300188
 78 | 300182
 79 | 300113
 80 | 002642
 81 | 300245
 82 | 002280
 83 | 300533
 84 | 002253
 85 | 300369
 86 | 300559
 87 | 603189
 88 | 300520
 89 | 300229
 90 | 300377
 91 | 002401
 92 | 300542
 93 | 603258
 94 | 300552
 95 | 300541
 96 | 002558
 97 | 300550
 98 | 300379
 99 | 300386
100 | 300378
101 | 002368
102 | 300525
103 | 300556
104 | 600767
105 | 300469
106 | 600242
107 | 300467
108 | 300578
109 | 300271
110 | 300518
111 | 300468
112 | 600654
113 | 000711
114 | 603986
115 | 300605
116 | 300603
117 | 603881
118 | 300513
119 | 300212
120 | 


--------------------------------------------------------------------------------
/set_codes/set_IT.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from pymongo import *
 3 | # import pymongo
 4 | 
 5 | # this script is used to insert the HS300 stockcodes into MongoDB
 6 | client = MongoClient()
 7 | db = client['stockcodes']
 8 | 
 9 | f= open('IT_unique.txt','r')
10 | all_text = f.read()
11 | stockcodes = re.findall('\d{6}',all_text)
12 | for stockcode in stockcodes:
13 |     result = db.IT.insert_one({
14 |         "stockcode" : stockcode
15 |     })
16 |     print result.inserted_id
17 | f.close()


--------------------------------------------------------------------------------
/set_codes/set_hs300.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from pymongo import *
 3 | # import pymongo
 4 | 
 5 | # this script is used to insert the HS300 stockcodes into MongoDB
 6 | client = MongoClient()
 7 | db = client['stockcodes']
 8 | 
 9 | f= open('HS300.txt','r')
10 | all_text = f.read()
11 | stockcodes = re.findall('\d{6}',all_text)
12 | for stockcode in stockcodes:
13 |     result = db.HS300.insert_one({
14 |         "stockcode" : stockcode
15 |     })
16 |     print result.inserted_id
17 | f.close()


--------------------------------------------------------------------------------
/test/Scores.xls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/test/Scores.xls


--------------------------------------------------------------------------------
/test/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/test/__init__.py


--------------------------------------------------------------------------------
/test/testPandas.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | 
3 | df = pd.read_excel("../data/07-01result.xls")
4 | stock = df['positive'].tolist()
5 | print stock


--------------------------------------------------------------------------------
/test/testPandasExcel.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | 
3 | data = {'names':['John Doe', 'Zoe McCarty', 'Pam Ferris'],
4 |        'scores': [76, 98, 90]}
5 | table = pd.DataFrame(data)
6 | 
7 | writer = pd.ExcelWriter('Scores.xls')
8 | table.to_excel(writer, 'Scores 1')
9 | writer.save()


--------------------------------------------------------------------------------
/tools/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/tools/__init__.py


--------------------------------------------------------------------------------
/tools/drop.py:
--------------------------------------------------------------------------------
 1 | from pymongo import MongoClient
 2 | 
 3 | client = MongoClient()
 4 | db = client['stockcodes']
 5 | documents = db.HS300.find()
 6 | stockcodes = []
 7 | for document in documents:
 8 |     stockcodes.append(document['stockcode'])
 9 | 
10 | for stockcode in stockcodes:
11 |     client = MongoClient()
12 |     client.drop_database(stockcode+'eastmoney')
13 | 
14 | 


--------------------------------------------------------------------------------
/tools/mongotool.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import datetime
 4 | from pymongo import MongoClient
 5 | 
 6 | def dump():
 7 |     client = MongoClient()
 8 |     db = client['stockcodes']
 9 |     documents = db.HS300.find()
10 |     documents_it = db.IT.find()
11 |     stockcodes = []
12 |     for document in documents:
13 |         stockcodes.append(document['stockcode'])
14 |     # add IT stocks
15 |     for document in documents_it:
16 |         stockcodes.append(document['stockcode'])
17 | 
18 |     now_time = datetime.datetime.now()
19 |     dump_time = now_time + datetime.timedelta(days=-5)
20 |     dump_time = dump_time.strftime('%m-%d')
21 | 
22 |     os.system('mkdir '+dump_time)
23 | 
24 |     for stockcode in stockcodes:
25 |         os.system('mongodump --db '+stockcode+'eastmoney'+' --collection '+dump_time+'GuYouHui')
26 |         os.system('mongodump --db '+stockcode+'eastmoney'+' --collection '+dump_time+'SentimentFactor')
27 |     os.system('mongodump --db '+dump_time)
28 |     os.system('mongodump --db '+dump_time+'IT')
29 | 
30 |     os.system('mv dump '+dump_time)
31 | 
32 |     print dump_time+'data has been dumped!'
33 |     print dump_time +'IT'+ 'data has been dumped!'
34 | 
35 | def drop():
36 |     client = MongoClient()
37 |     db = client['stockcodes']
38 |     documents = db.HS300.find()
39 |     documents_it = db.IT.find()
40 |     stockcodes = []
41 |     for document in documents:
42 |         stockcodes.append(document['stockcode'])
43 | 
44 |     # add IT stocks
45 |     for document in documents_it:
46 |         stockcodes.append(document['stockcode'])
47 | 
48 |     now_time = datetime.datetime.now()
49 |     drop_time = now_time + datetime.timedelta(days=-5)
50 |     drop_time = drop_time.strftime('%m-%d')
51 | 
52 |     for stockcode in stockcodes:
53 |         db = client[stockcode+'eastmoney']
54 |         coll = db[drop_time+'GuYouHui']
55 |         coll.drop()
56 |         coll = db[drop_time+'SentimentFactor']
57 |         coll.drop()
58 |     client.drop_database(drop_time)
59 | 
60 |     client.drop_database(drop_time+'IT')
61 | 
62 |     print drop_time+'data has been dropped!'
63 |     print drop_time+'IT'+'data has been dropped!'


--------------------------------------------------------------------------------