├── .gitignore ├── README.md ├── additional ├── __init__.py └── resultdb.py ├── config ├── __init__.py └── config.json ├── dev ├── __init__.py ├── check_it_in_hs300.py └── process_it.py ├── east_sentiment ├── __init__.py ├── aggregateFactor.py ├── dailyResult.py ├── outputResult.py ├── produceFactor.py └── sendMail.py ├── example ├── ApSchedulerExample │ ├── __init__.py │ └── processpoll.py └── __init__.py ├── main.py ├── script ├── __init__.py ├── gubaEast.py ├── sina.py └── snowball.py ├── set_codes ├── HS300.txt ├── IT_code.txt ├── IT_unique.txt ├── set_IT.py └── set_hs300.py ├── test ├── Scores.xls ├── __init__.py ├── testPandas.py └── testPandasExcel.py └── tools ├── __init__.py ├── drop.py └── mongotool.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | gitignore 3 | eastSentiment/*.pyc 4 | *.pyc 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pyspider-stock 2 | 3 | **Note:This README will have both Chinese and English version, Chinese first because it is for Chinese stock market.** 4 | 5 | *Update* : 6 | 7 | - [x] 增加[IT][1]版块股票的抓取和分析 8 | 9 | 10 | ## 这个项目做什么? 11 | 这个项目使用[pyspider][2]抓取[东方财富网股吧][3]、[雪球网][4]、[新浪股吧][5]的帖子,然后使用自然语言处理(情感分析)的方式分析舆论 12 | 13 | 所以 14 | 15 | 它有两个部分 16 | 17 | 1. 抓取帖子 18 | 2. 情感分析 19 | 20 | ## 如何运行它? 21 | 22 | ### 第一步 抓取帖子 23 | 24 | * 下载[pyspider][6],[mongoDB][7],[redis][8],[snowNLP][9],[pymongo(2.9)][10]及相应的依赖库 25 | * 运行`set_codes/set_hs300.py`和`set_IT.py`(为了将HS300成份股的股票代码装入mongoDB,后者的目的是放入IT股票的代码) 26 | * 然后,将`resultdb.py`放入pyspider的`database/mongodb`目录下(为了将爬取到的数据放入mongoDB),pyspider路径使用`pip show pyspider`命令 27 | * 启动`redis` 28 | * 然后,在有`config.json`的目录下,**command line** 运行`pyspider -c config.json all &` 29 | * 其次,将script里的脚本复制后,粘贴到localhost:5000下你自己的工程里(想要爬取哪个网站就粘贴哪个script),保存 30 | * 最后在网页localhost:5000里单击run 31 | 32 | 在早上开盘前执行完最后两步即可在每天早上开盘前获取到HS300昨日的舆论数据 33 | 34 | ### 第二步 情感分析 35 | 在完成第一步30分钟后即可执行该步骤 36 | 37 | 第一次运行时在和`main.py`同目录下新建目录`data` 38 | 39 | 运行 `main.py`即可 40 | 41 | 42 | ## 发生了什么? 43 | 44 | 默认使用`gubaEast.py`抓取东方财富网下的**股友汇**版块,因为它最稳定 45 | 46 | 执行完第一步后,你会在名为`[stockcode]eastmoney`的database下发现`[date]GuYouHui`的collection,其中`[stockcode]`和`[date]`分别是HS300成份股的股票代码和昨天的日期 47 | 48 | 接着是情感分析部分 49 | 50 | 核心是3段代码: 51 | 52 | produceFactor.getSentimentFactor(stockCode, grab_time) 53 | 用于获得抓取日期的特定股票帖子的情感因子和情感值(由情感因子乘以阅读量获得) 54 | 55 | aggregateFactor.aggregate(stockCode, grab_time) 56 | 用于获得抓取日期的特定股票的情感值(由所有帖子情感值相加得到),结果保存在`[stockcode]eastmoney`下的`[date]SentimentFactor`中 57 | 58 | dailyResult.setDailyResult(stockCode, grab_time) 59 | 用于汇总抓取日期的所有HS300股票的情感值和帖子数,结果在`[date]`database的`DailyResult`collection下 60 | 61 | 而后结果会以excel的格式保存在`data`目录下 62 | 63 | 结果会以邮件形式发给你指定的人,通过`sendMail`模块 64 | 65 | 最后taskdb里面这个任务会被清除,以便明天增量抓取。同时会将5天前数据库中的数据导出,存在本地,并删除数据库中的数据 66 | 67 | 如果想用[app][11]在android端查看结果,就保留 68 | 69 | os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html') 70 | 71 | 72 | *English version* 73 | 74 | ## What's the aim of this project? 75 | This project use [pyspider][12] to get posts of [eastmoney][13], [xueqiu][14], [sinaguba][15],then use NLP techs to analyze the sentiment of public in order to select stocks. 76 | 77 | SO 78 | 79 | It has two parts 80 | 81 | 1. crawl posts 82 | 2. sentiment analysis 83 | 84 | ## How to run this project? 85 | 86 | ### Step 1 Crawl posts 87 | 88 | * Download [pyspider][16],[mongoDB][17],[redis][18],[snowNLP][19] and other dependencies 89 | * run `set_hs300/setCodes.py`(in order to get all symbols of HS300 and load them into mongoDB) 90 | * put `resultdb.py` into `database/mongodb` directory of **pyspider**(in order to save the crawling data to mongoDB) 91 | * start `redis` 92 | * **command line** run `pyspider -c config.json all &` under directory of `config.json` 93 | * copy script in `script` folder, paste code to your own project in localhost:5000, save 94 | * click run button in localhost:5000 95 | 96 | Complete two last steps before the market is open, then you'll get sentiment data everyday periodically. 97 | 98 | ### Step 2 Sentiment analysis 99 | 100 | Run `main.py` after the posts been crawled and stored, also remember to create `data` directory for the first running. 101 | 102 | 103 | ## What happened? 104 | 105 | Because of the stability, use `gubaEast.py` to crawl **GuYouHui** section is by default. 106 | 107 | After Step 1 finished,you'll find a collection named `[date]GuYouHui` under a database called `[stockcode]eastmoney`, where `[stockcode]` and `[date]`are symbols of HS300 and date of yesterday. 108 | 109 | Another part is sentiment analysis 110 | 111 | The core part is three pieces of code: 112 | 113 | produceFactor.getSentimentFactor(stockCode, grab_time) 114 | To obtain sentiment values and sentiment factor for a specific symbol **post** and crawl date(sentiment values are computed by snowNLP while sentiment factor is sentiment values times read numbers) 115 | 116 | aggregateFactor.aggregate(stockCode, grab_time) 117 | To obtain sentiment values and sentiment factor for a specific symbol and crawl date(by adding all the posts for that stock on that day), result is in `[date]SentimentFactor` under `[stockcode]eastmoney` 118 | 119 | dailyResult.setDailyResult(stockCode, grab_time) 120 | To collect all the sentiment factors and number of posts for the crawl date,result is in the `DailyResult` collection which is under the `[date]`database. 121 | 122 | Then an excel would be saved under the `data` directory as the final result. 123 | 124 | The result would be mailed to specific users, through `sendMail` module. 125 | 126 | Tasks under `taskdb` would be deleted in order to crawl posts periodically. Meanwhile data which stored 5 days ago would be dumped as backup and mongoDB would delete the original one. 127 | 128 | If you want to use the [app][20] to check the result on android, keep the following code 129 | 130 | os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html') 131 | 132 | 133 | [1]: http://quote.eastmoney.com/center/list.html#28002737_0_2 134 | [2]: http://docs.pyspider.org/en/latest/ 135 | [3]: http://guba.eastmoney.com/ 136 | [4]: https://xueqiu.com/ 137 | [5]: http://guba.sina.com.cn/ 138 | [6]: http://docs.pyspider.org/en/latest/ 139 | [7]: https://www.mongodb.com/ 140 | [8]: https://redis.io/ 141 | [9]: https://github.com/isnowfy/snownlp 142 | [10]: http://api.mongodb.com/python/current/installation.html 143 | [11]: https://github.com/ryh95/huaxiApp 144 | [12]: http://docs.pyspider.org/en/latest/ 145 | [13]: http://guba.eastmoney.com/ 146 | [14]: https://xueqiu.com/ 147 | [15]: http://guba.sina.com.cn/ 148 | [16]: http://docs.pyspider.org/en/latest/ 149 | [17]: https://www.mongodb.com/ 150 | [18]: https://redis.io/ 151 | [19]: https://github.com/isnowfy/snownlp 152 | [20]: https://github.com/ryh95/huaxiApp 153 | -------------------------------------------------------------------------------- /additional/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/additional/__init__.py -------------------------------------------------------------------------------- /additional/resultdb.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | # vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8: 4 | # Author: Binux 5 | # http://binux.me 6 | # Created on 2014-10-13 22:18:36 7 | 8 | import json 9 | import logging 10 | import time 11 | import re 12 | 13 | import datetime 14 | from pymongo import MongoClient 15 | from pyspider.database.base.resultdb import ResultDB as BaseResultDB 16 | from .mongodbbase import SplitTableMixin 17 | logger = logging.getLogger("result") 18 | 19 | class ResultDB(SplitTableMixin, BaseResultDB): 20 | collection_prefix = '' 21 | 22 | def __init__(self, url, database='resultdb'): 23 | self.conn = MongoClient(url) 24 | self.database = self.conn[database] 25 | self.projects = set() 26 | 27 | self._list_project() 28 | 29 | def _parse(self, data): 30 | data['_id'] = str(data['_id']) 31 | if 'result' in data: 32 | data['result'] = json.loads(data['result']) 33 | return data 34 | 35 | def _stringify(self, data): 36 | if 'result' in data: 37 | data['result'] = json.dumps(data['result']) 38 | return data 39 | 40 | def save(self, project, taskid, url, result): 41 | 42 | #logger.info('url : %s',url) 43 | 44 | #(1) deal with guba.eastmoney.com 45 | if url.split('/')[2] == 'guba.eastmoney.com': 46 | #1.specify the database_name 47 | stockCode = url.split(',')[1] 48 | self.database = self.conn[stockCode+'eastmoney'] 49 | 50 | #2.specify the collection_name 51 | flag = result['url'].split(',')[2][0] 52 | 53 | # add time for GuYouHui 54 | now_time = datetime.datetime.now() 55 | yes_time = now_time + datetime.timedelta(days=-1) 56 | grab_time = yes_time.strftime('%m-%d') 57 | 58 | if flag == '5': 59 | collection_name = grab_time+'GuYouHui' 60 | elif flag == '1': 61 | collection_name = 'XinWen' 62 | elif flag == '2': 63 | collection_name = 'YanBao' 64 | elif flag == '3': 65 | collection_name = 'GongGao' 66 | else: 67 | collection_name = self._collection_name(project) 68 | 69 | #3.create the item which is going to insert to the mongoDB 70 | obj = { 71 | # 'taskid': taskid, 72 | # 'url': url, 73 | # 'result': result, 74 | # 'updatetime': time.time(), 75 | # here has changed 76 | 'read' : result['read'], 77 | 'comment' : result['comment'], 78 | 'title' : result['title'], 79 | 'author' : result['author'], 80 | # 'date' : result['date'], 81 | 'last' : result['last'], 82 | 'text' : result['text'], 83 | 'url' : result['url'], 84 | 'create' : result['create'], 85 | 'created_at' :result['created_at'] 86 | } 87 | #(2) deal with xueqiu.com 88 | elif url.split('/')[2] == 'xueqiu.com': 89 | 90 | #1.specify the database_name 91 | stockCode = re.findall('\d{6}',url)[0] 92 | self.database = self.conn[stockCode+'xueqiu'] 93 | #2.specify the collection_name 94 | flag = re.findall('source=(.*?)&',url)[0] 95 | 96 | logger.info('flag : %s',flag) 97 | 98 | if flag == 'user': 99 | collection_name = 'TaoLun' 100 | elif flag == 'trans': 101 | collection_name = 'JiaoYi' 102 | elif flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB': 103 | collection_name = 'XinWen' 104 | elif flag == '%E5%85%AC%E5%91%8A': 105 | collection_name = 'GongGao' 106 | elif flag == '%E7%A0%94%E6%8A%A5': 107 | collection_name = 'YanBao' 108 | else: 109 | collection_name = self._collection_name(project) 110 | 111 | #3.create the item which is going to insert to the mongoDB 112 | if flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB': 113 | 114 | obj = { 115 | 'name' : result['name'], 116 | 'text' : result['text'], 117 | 'time' : long(result['time']), 118 | 'title' : result['title'], 119 | 120 | } 121 | else: 122 | obj = { 123 | 'name' : result['name'], 124 | 'text' : result['text'], 125 | 'time' : long(result['time']), 126 | # 'title' : result['title'], 127 | 128 | } 129 | # (3)deal with guba.sina.com 130 | elif url.split('/')[2] == 'guba.sina.com.cn': 131 | # 1.specify the database name 132 | stockCode = re.findall('name=(.*?)&type',result['url'])[0] 133 | self.database = self.conn[stockCode+'sina'] 134 | # 2.specify the collection name 135 | collection_name = 'TaoLun' 136 | 137 | # 3.create the item which is going to insert to the mongoDB 138 | obj = { 139 | 'author' : result['author'], 140 | 'comment' : result['comment'], 141 | 'read' : long(result['read']), 142 | 'title' : result['title'], 143 | 'text' : result['text'], 144 | 'tid' : result['tid'], 145 | 'time' : result['time'], 146 | 'url' : result['url'] 147 | } 148 | 149 | 150 | return self.database[collection_name].update( 151 | {'taskid': taskid}, {"$set": self._stringify(obj)}, upsert=True 152 | ) 153 | 154 | def select(self, project, fields=None, offset=0, limit=0): 155 | if project not in self.projects: 156 | self._list_project() 157 | if project not in self.projects: 158 | return 159 | collection_name = self._collection_name(project) 160 | for result in self.database[collection_name].find(fields=fields, skip=offset, limit=limit): 161 | yield self._parse(result) 162 | 163 | def count(self, project): 164 | if project not in self.projects: 165 | self._list_project() 166 | if project not in self.projects: 167 | return 168 | collection_name = self._collection_name(project) 169 | return self.database[collection_name].count() 170 | 171 | def get(self, project, taskid, fields=None): 172 | if project not in self.projects: 173 | self._list_project() 174 | if project not in self.projects: 175 | return 176 | collection_name = self._collection_name(project) 177 | ret = self.database[collection_name].find_one({'taskid': taskid}, fields=fields) 178 | if not ret: 179 | return ret 180 | return self._parse(ret) 181 | -------------------------------------------------------------------------------- /config/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/config/__init__.py -------------------------------------------------------------------------------- /config/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "taskdb": "mongodb+taskdb://127.0.0.1:27017/taskdb", 3 | "projectdb": "mongodb+projectdb://127.0.0.1:27017/projectdb", 4 | "resultdb": "mongodb+resultdb://127.0.0.1:27017/resultdb", 5 | "message_queue": "redis://127.0.0.1:6379/db" 6 | } 7 | -------------------------------------------------------------------------------- /dev/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/dev/__init__.py -------------------------------------------------------------------------------- /dev/check_it_in_hs300.py: -------------------------------------------------------------------------------- 1 | from pymongo import MongoClient 2 | 3 | client = MongoClient() 4 | db = client['stockcodes'] 5 | documents = db.HS300.find() 6 | # read and store hs300 stocks in 'hs300' map 7 | hs300 = {} 8 | for document in documents: 9 | hs300[document['stockcode']] = 1 10 | 11 | it_file = open('IT_code.txt','r') 12 | out = file('IT_unique.txt','a+') 13 | 14 | num = 0 15 | for line in it_file: 16 | line = line.replace('\n','') 17 | line = line.decode('utf8') 18 | if line not in hs300: 19 | print line 20 | out.write(str(line)+'\n') 21 | num+=1 22 | print num -------------------------------------------------------------------------------- /dev/process_it.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | input_file = open('IT.txt','r') 4 | output_file = file('IT.txt','a+') 5 | 6 | for line in input_file: 7 | # output_file.write(re.findall(r'\d{6}',line)[0]) 8 | # output_file.write('\n') 9 | output_file.writelines([item+' ' for item in line.split()[:3]]) 10 | output_file.write('\n') 11 | 12 | input_file.close() 13 | output_file.close() -------------------------------------------------------------------------------- /east_sentiment/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/east_sentiment/__init__.py -------------------------------------------------------------------------------- /east_sentiment/aggregateFactor.py: -------------------------------------------------------------------------------- 1 | # coding:utf8 2 | import datetime 3 | from pymongo import MongoClient 4 | 5 | def aggregate(stockcode, date): 6 | client = MongoClient() 7 | # 获取昨天的日期 8 | # now_time = datetime.datetime.now() 9 | # yes_time = now_time + datetime.timedelta(days=-1) 10 | # grab_time = yes_time.strftime('%m-%d') 11 | 12 | db = client[stockcode+'eastmoney'] 13 | coll = db[date+'GuYouHui'] 14 | 15 | documents = coll.aggregate( 16 | [ 17 | {"$group": {"_id": "$last_date", "sentiment_factor": {"$sum": "$sentiment_factor"}}} 18 | ] 19 | ) 20 | 21 | coll2 = db[date+'SentimentFactor'] 22 | for result in documents['result']: 23 | coll2.insert_one({ 24 | "sentiment_factor": result['sentiment_factor'], 25 | "last_date": date 26 | }) 27 | 28 | print date+stockcode+'SentimentFactor has been aggregated!' 29 | -------------------------------------------------------------------------------- /east_sentiment/dailyResult.py: -------------------------------------------------------------------------------- 1 | # coding:utf8 2 | import datetime 3 | from pymongo import MongoClient 4 | 5 | 6 | def setDailyResult(stockcode,date,section_name=''): 7 | client = MongoClient() 8 | 9 | db = client[stockcode + 'eastmoney'] 10 | # 获取昨天的日期 11 | # now_time = datetime.datetime.now() 12 | # yes_time = now_time + datetime.timedelta(days=-1) 13 | # grab_time = yes_time.strftime('%m-%d') 14 | 15 | coll = db[date + 'SentimentFactor'] 16 | 17 | cusor = coll.find({"last_date":date}) 18 | for document in cusor: 19 | sentimentFactor = document['sentiment_factor'] 20 | 21 | coll = db[date+'GuYouHui'] 22 | 23 | dailyCounts = coll.count() 24 | 25 | db = client[date+section_name] 26 | try: 27 | db.DailyResult.insert_one( 28 | { 29 | "stock_code": stockcode, 30 | "sentiment_factor": sentimentFactor, 31 | "daily_counts": dailyCounts 32 | } 33 | ) 34 | except UnboundLocalError,e: 35 | pass 36 | 37 | print date+stockcode+'has been set in daily result!' 38 | 39 | 40 | -------------------------------------------------------------------------------- /east_sentiment/outputResult.py: -------------------------------------------------------------------------------- 1 | # coding:utf8 2 | import datetime 3 | 4 | import pymongo 5 | import pandas as pd 6 | import xlwt 7 | from pymongo import MongoClient 8 | 9 | def getDailyResult(date,section_name=''): 10 | client = MongoClient() 11 | # now_time = datetime.datetime.now() 12 | # yes_time = now_time + datetime.timedelta(days=-1) 13 | # grab_time = yes_time.strftime('%m-%d') 14 | db = client[date+section_name] 15 | 16 | 17 | # 找到前20名 18 | threshold = 20 19 | wb = xlwt.Workbook() 20 | ws = wb.add_sheet('InputSheet') 21 | ws.write(0, 0, 'positive') 22 | ws.write(0, 1, 'value') 23 | ws.write(0, 2, 'negative') 24 | ws.write(0, 3, 'value') 25 | ws.write(0, 4, 'hottest') 26 | ws.write(0, 5, 'numPosts') 27 | ws.write(0, 6, 'positive_hottest') 28 | ws.write(0, 7, 'value') 29 | 30 | documents = db.DailyResult.find().sort([ 31 | ("sentiment_factor", pymongo.DESCENDING) 32 | ]).limit(threshold) 33 | 34 | # 写入前2列 35 | i = 0 36 | for document in documents: 37 | # print document 38 | ws.write(i + 1, 0, str(document['stock_code'])) 39 | ws.write(i + 1, 1, document['sentiment_factor']) 40 | i += 1 41 | 42 | # 写入后2列 43 | documents = db.DailyResult.find().sort([ 44 | ("sentiment_factor", pymongo.ASCENDING) 45 | ]).limit(threshold) 46 | 47 | i = 0 48 | for document in documents: 49 | ws.write(i + 1, 2, str(document['stock_code'])) 50 | ws.write(i + 1, 3, document['sentiment_factor']) 51 | i += 1 52 | 53 | # 写入最热门的股票 54 | # 写入最热门股票每天的帖子数 55 | documents = db.DailyResult.find().sort([ 56 | ("daily_counts", pymongo.DESCENDING) 57 | ]).limit(threshold) 58 | 59 | i = 0 60 | for document in documents: 61 | ws.write(i + 1, 4, str(document['stock_code'])) 62 | ws.write(i + 1, 5, str(document['daily_counts'])) 63 | i += 1 64 | 65 | 66 | # 对最热门的股票按照情感值递减排序 67 | documents = db.DailyResult.aggregate([ 68 | {"$sort" : {"daily_counts":pymongo.DESCENDING}}, 69 | {"$limit" : threshold}, 70 | {"$sort" : {"sentiment_factor":pymongo.DESCENDING}} 71 | ]) 72 | 73 | i = 0 74 | for document in documents['result']: 75 | # print document 76 | ws.write(i + 1, 6, str(document['stock_code'])) 77 | ws.write(i + 1, 7, str(document['sentiment_factor'])) 78 | i += 1 79 | 80 | # 保存 81 | wb.save('data/'+date+section_name + 'result' + '.xls') 82 | 83 | print date+section_name+'result.xlsx'+'has been produced!' 84 | 85 | def getDailyStockInfo(stockcode, date,type='sentiment'): 86 | client = MongoClient() 87 | db = client[date] 88 | cursor = db.DailyResult.find({"stock_code":stockcode}) 89 | sentiment = None 90 | posts = None 91 | for document in cursor: 92 | if type == 'sentiment': 93 | sentiment = document['sentiment_factor'] 94 | else: 95 | posts = document['daily_counts'] 96 | if type == 'sentiment': 97 | return sentiment 98 | else: 99 | return posts 100 | 101 | 102 | def getDailyAttachment(date): 103 | df = pd.read_excel("data/"+date+"result.xls", 104 | converters={'positive':str,'negative':str,'hottest':str}) 105 | positive_stockcodes = df['positive'].tolist() 106 | negative_stockcodes = df['negative'].tolist() 107 | hottest_stockcodes = df['hottest'].tolist() 108 | 109 | dict_pos = getResultDict(positive_stockcodes,date) 110 | dict_neg = getResultDict(negative_stockcodes,date) 111 | dict_hot = getResultDict(hottest_stockcodes,date,type='hottest') 112 | 113 | writer = pd.ExcelWriter(date + 'attachment.xls') 114 | 115 | table_pos = pd.DataFrame(dict_pos) 116 | table_neg = pd.DataFrame(dict_neg) 117 | table_hot = pd.DataFrame(dict_hot) 118 | 119 | table_pos.to_excel(writer, 'positive') 120 | table_neg.to_excel(writer, 'negative') 121 | table_hot.to_excel(writer, 'hottest') 122 | 123 | writer.save() 124 | 125 | def getResultDict(codes,date,type='sentiment'): 126 | result_dict = {'stockcodes':codes} 127 | now_time = datetime.datetime.strptime(date, '%m-%d') 128 | for i in range(20): 129 | yes_time = now_time + datetime.timedelta(days=-i) 130 | yes_time = yes_time.strftime('%m-%d') 131 | col_sentiments = [] 132 | for stockcode in codes: 133 | dailysentiment = getDailyStockInfo(stockcode, yes_time,type=type) 134 | col_sentiments.append(dailysentiment) 135 | result_dict[yes_time] = col_sentiments 136 | return result_dict -------------------------------------------------------------------------------- /east_sentiment/produceFactor.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | from pymongo import * 3 | from snownlp import SnowNLP 4 | import pymongo 5 | 6 | # connecting database 601001eastmoney 7 | def getSentimentFactor(stockcode,date): 8 | client = MongoClient() 9 | 10 | # now_time = datetime.datetime.now() 11 | # yes_time = now_time + datetime.timedelta(days=-1) 12 | # grab_time = yes_time.strftime('%m-%d') 13 | 14 | db = client[stockcode+'eastmoney'] 15 | 16 | # sort the collection GuYouHui and point to it 17 | coll = db[date+'GuYouHui'] 18 | 19 | documents = coll.find().sort([ 20 | ("created_at", pymongo.DESCENDING) 21 | ]) 22 | 23 | for document in documents: 24 | 25 | create_date = document['create'][:10] 26 | last = document['last'][:5] 27 | 28 | if (document['text'] != ''): 29 | s = SnowNLP(document['text']) 30 | sentiment = s.sentiments 31 | sentiment_factor = (sentiment) * int(document['read']) 32 | else: 33 | sentiment,sentiment_factor = 0,0 34 | coll.update({"_id":document['_id']},{"$set":{"sentiment":sentiment,"sentiment_factor":sentiment_factor}}) 35 | 36 | print document['_id'] 37 | -------------------------------------------------------------------------------- /east_sentiment/sendMail.py: -------------------------------------------------------------------------------- 1 | # coding:utf8 2 | import smtplib 3 | from email import encoders, MIMEMultipart 4 | from email.mime.base import MIMEBase 5 | from email.mime.text import MIMEText 6 | 7 | import datetime 8 | 9 | import pandas as pd 10 | 11 | 12 | def send(date,section_list): 13 | _user = "1971990184@qq.com" 14 | _pwd = "rxdaltsmieszgfje" 15 | # _to = ['henry.duye@gmail.com','qliu.net@gmail.com','380312089@qq.com'] 16 | 17 | # test 18 | _to = ['ryuanhang@gmail.com'] 19 | 20 | # now_time = datetime.datetime.now() 21 | # yes_time = now_time + datetime.timedelta(days=-1) 22 | # grab_time = yes_time.strftime('%m-%d') 23 | 24 | msg = MIMEMultipart.MIMEMultipart() 25 | # mail's title 26 | msg["Subject"] = date+"Result" 27 | msg["From"] = _user 28 | msg["To"] = ",".join(_to) 29 | 30 | # add content for the mail 31 | msg_content = '' 32 | for section_name in section_list: 33 | msg_content += excel2str(date, section_name=section_name) 34 | msg_content += '\n' 35 | 36 | msg.attach(MIMEText(msg_content, 'plain', 'utf-8')) 37 | 38 | #with open(date+'attachment.xls', 'rb') as f: 39 | # # 设置附件的MIME和文件名,这里是png类型: 40 | # mime = MIMEBase('file', 'xls', filename=date+'attachment.xls') 41 | # # 加上必要的头信息: 42 | # mime.add_header('Content-Disposition', 'attachment', filename=date+'attachment.xls') 43 | # mime.add_header('Content-ID', '<0>') 44 | # mime.add_header('X-Attachment-Id', '0') 45 | # # 把附件的内容读进来: 46 | # mime.set_payload(f.read()) 47 | # # 用Base64编码: 48 | # encoders.encode_base64(mime) 49 | # # 添加到MIMEMultipart: 50 | # msg.attach(mime) 51 | 52 | try: 53 | s = smtplib.SMTP_SSL("smtp.qq.com", 465) 54 | s.login(_user, _pwd) 55 | s.sendmail(_user, _to, msg.as_string()) 56 | s.quit() 57 | print "Succeed in sending mail!" 58 | except smtplib.SMTPException, e: 59 | print "Falied,%s" % e 60 | 61 | def excel2str(date,section_name=''): 62 | df = pd.read_excel("data/" + date+section_name + "result.xls", 63 | converters={'positive': str, 'negative': str, 'hottest': str},header=None) 64 | string = '' 65 | for list in df.values: 66 | for item in list: 67 | string += str(item)+' ' 68 | string += "\n" 69 | return string 70 | -------------------------------------------------------------------------------- /example/ApSchedulerExample/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/example/ApSchedulerExample/__init__.py -------------------------------------------------------------------------------- /example/ApSchedulerExample/processpoll.py: -------------------------------------------------------------------------------- 1 | """ 2 | Demonstrates how to schedule a job to be run in a process pool on 3 second intervals. 3 | """ 4 | 5 | from datetime import datetime 6 | import os 7 | 8 | from apscheduler.schedulers.blocking import BlockingScheduler 9 | 10 | 11 | def tick(): 12 | print('Tick! The time is: %s' % datetime.now()) 13 | 14 | 15 | if __name__ == '__main__': 16 | scheduler = BlockingScheduler() 17 | scheduler.add_executor('processpool') 18 | scheduler.add_job(tick, 'interval', seconds=3) 19 | print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C')) 20 | 21 | try: 22 | scheduler.start() 23 | except (KeyboardInterrupt, SystemExit): 24 | pass -------------------------------------------------------------------------------- /example/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/example/__init__.py -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # coding:utf8 2 | import datetime 3 | import os 4 | import time 5 | from apscheduler.schedulers.blocking import BlockingScheduler 6 | from pymongo import MongoClient 7 | 8 | from east_sentiment import dailyResult, sendMail 9 | from east_sentiment import produceFactor,aggregateFactor, outputResult 10 | from tools import mongotool 11 | 12 | stockCodes = [] 13 | # stockCodes = ['000001','000002','000009','000046','000063','000069','000333','000402','000559','000568','000623','000630','000651','000686','000712','000725','000738','000776','000783','000792','000793', 14 | # '000800','000858','000895','000898','000937','000999','002007','002065','002142','002236','002292','002294','002385','002465','002736','300015','300017','300024','300058','300070'] 15 | 16 | # stockCodes = ['000001','000002','000009','000027','000039','000046','000060','000061','000063','000069'] 17 | client = MongoClient() 18 | db = client['stockcodes'] 19 | documents = db.HS300.find() 20 | 21 | for document in documents: 22 | stockCodes.append(document['stockcode']) 23 | 24 | # append IT stock codes 25 | IT_stockCodes = [] 26 | documents_IT = db.IT.find() 27 | for document in documents_IT: 28 | IT_stockCodes.append(document['stockcode']) 29 | 30 | if __name__ == '__main__': 31 | 32 | while True: 33 | t0 = time.clock() 34 | 35 | now_time = datetime.datetime.now() 36 | yes_time = now_time + datetime.timedelta(days=-1) 37 | grab_time = yes_time.strftime('%m-%d') 38 | for stockCode in stockCodes: 39 | # 一个帖子的 40 | produceFactor.getSentimentFactor(stockCode, grab_time) 41 | # 一个股票的 42 | aggregateFactor.aggregate(stockCode, grab_time) 43 | # 一天的 44 | dailyResult.setDailyResult(stockCode, grab_time) 45 | 46 | outputResult.getDailyResult(grab_time) 47 | 48 | # For IT stocks 49 | 50 | for stockCode in IT_stockCodes: 51 | # 一个帖子的 52 | produceFactor.getSentimentFactor(stockCode, grab_time) 53 | # 一个股票的 54 | aggregateFactor.aggregate(stockCode, grab_time) 55 | # 一天的 56 | dailyResult.setDailyResult(stockCode, grab_time, section_name='IT') 57 | 58 | outputResult.getDailyResult(grab_time, section_name='IT') 59 | 60 | sendMail.send(grab_time, section_list=['', 'IT']) 61 | 62 | # dump and drop part 63 | client = MongoClient() 64 | db = client.taskdb 65 | db.east.drop() 66 | print 'east collection has been dropped!' 67 | 68 | mongotool.dump() 69 | mongotool.drop() 70 | os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html') 71 | 72 | t1 = time.clock() 73 | 74 | time.sleep(86400-(t1-t0)) -------------------------------------------------------------------------------- /script/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/script/__init__.py -------------------------------------------------------------------------------- /script/gubaEast.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | # Created on 2016-02-02 11:38:51 4 | # Project: HelloWorld 5 | import re 6 | 7 | import datetime 8 | from pyspider.libs.base_handler import * 9 | from lxml import etree 10 | from pymongo import * 11 | 12 | #num = 0 13 | class Handler(BaseHandler): 14 | crawl_config = { 15 | } 16 | 17 | def __init__(self): 18 | client = MongoClient() 19 | db = client['stockcodes'] 20 | self.StockCodes = [] 21 | documents = db.HS300.find() 22 | for document in documents: 23 | self.StockCodes.append(document['stockcode']) 24 | 25 | # Add IT stock codes 26 | documents_it = db.IT.find() 27 | for document in documents_it: 28 | self.StockCodes.append(document['stockcode']) 29 | 30 | @every(minutes = 24*60) 31 | def on_start(self): 32 | 33 | for stockcode in self.StockCodes: 34 | self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',5_1.html', callback=self.index_page) 35 | # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',1,'+'f_1.html', callback=self.index_page) 36 | # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',2,'+'f_1.html', callback=self.index_page) 37 | # self.crawl('http://guba.eastmoney.com/list,' + stockcode + ',3,'+'f_1.html', callback=self.index_page) 38 | 39 | @config(age = 60*60*24) 40 | def index_page(self, response): 41 | selector = etree.HTML(response.text) 42 | content_field = selector.xpath('//*[@id="articlelistnew"]/div[starts-with(@class,"articleh")]') 43 | # 获取昨天时间,用于抓取 44 | now_time = datetime.datetime.now() 45 | yes_time = now_time + datetime.timedelta(days=-1) 46 | grab_time = yes_time.strftime('%m-%d') 47 | first = True 48 | flag = False 49 | 50 | # 提取每一页的所有帖子 51 | for each in content_field: 52 | last = each.xpath('span[6]/text()')[0] 53 | last_time = last[:5] 54 | # 根据时间来判断 55 | if grab_time != last_time: 56 | if first != True: 57 | flag = True 58 | break 59 | continue 60 | first = False 61 | item = {} 62 | read = each.xpath('span[1]/text()')[0] 63 | comment = each.xpath('span[2]/text()')[0] 64 | title = each.xpath('span[3]/a/text()')[0] 65 | # author容易因为结构出现异常 66 | if each.xpath('span[4]/a/text()'): 67 | author = each.xpath('span[4]/a/text()')[0] 68 | else: 69 | author = each.xpath('span[4]/span/text()')[0] 70 | 71 | # date = each.xpath('span[5]/text()')[0] 72 | 73 | address = each.xpath('span[3]/a/@href')[0] 74 | baseUrl = 'http://guba.eastmoney.com' 75 | Url = baseUrl + address 76 | # 将数据放入item 77 | item['read'] = read 78 | item['comment'] = comment 79 | item['title'] = title 80 | item['author'] = author 81 | # item['date'] = date 82 | item['last'] = last 83 | item['url'] = response.url 84 | 85 | # 提取内容 86 | self.crawl(Url, callback=self.detail_page, save={'item': item}) 87 | 88 | 89 | 90 | # 生成下一页链接 91 | if not flag: 92 | info = selector.xpath('//*[@id="articlelistnew"]/div[@class="pager"]/span/@data-pager')[0] 93 | List = info.split('|') 94 | if int(List[2])*int(List[3]) span').text() 79 | time = re.findall('
(.*?)
',response.text)[0] 80 | #time = selector.xpath('//*[@id="thread"]/div[@class="il_txt"]/div[@class="ilt_panel clearfix"]/div[@class="fl_left iltp_time"]/span/text()') 81 | item['time'] = time 82 | item['tid'] = int(re.findall('tid=(.*?)&bid',response.url)[0]) 83 | return item 84 | 85 | -------------------------------------------------------------------------------- /script/snowball.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | # Created on 2016-02-24 14:54:21 4 | # Project: Snowball 5 | import datetime 6 | import time 7 | import re 8 | from pyspider.libs.base_handler import * 9 | from lxml import etree 10 | 11 | 12 | class Handler(BaseHandler): 13 | crawl_config = { 14 | 'headers' : {'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36','X-Requested-With' : 'XMLHttpRequest','Referer' : 'http://xueqiu.com/S/SH601001','Host' : 'xueqiu.com','cache-control' : 'no-cache'} 15 | } 16 | 17 | def __init__(self): 18 | self.stockcode=['000003'] 19 | 20 | 21 | #这个函数用于产生cookie,为后面做准备 22 | @every(minutes=5) 23 | def on_start(self): 24 | for stockcode in self.stockcode: 25 | self.crawl('http://xueqiu.com/S/'+'SH'+stockcode,callback=self.first_scrape,save = {'stockcode':stockcode}) 26 | 27 | #这个函数用来将on_start函数中产生的cookie传入,用于得到最大的页面数 28 | @config(age=3*60) 29 | def first_scrape(self, response): 30 | List = ['%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB','%E5%85%AC%E5%91%8A','%E7%A0%94%E6%8A%A5'] 31 | 32 | self.crawl('http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' + 'user' + '&sort=time&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']}) 33 | self.crawl('http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' + 'trans' + '&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']}) 34 | for module in List: 35 | self.crawl('http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' + module + '&page=' + '1' +'&_=' + str(int(time.time()*1000)),headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.produce_page,save={'stockcode':response.save['stockcode']}) 36 | 37 | #这个函数抓取所有板块的有用信息 38 | @config(priority=2) 39 | def produce_page(self, response): 40 | 41 | flag = re.findall('source=(.*?)&',response.url)[0] 42 | 43 | if flag == 'user': 44 | for i in range(1,int(response.json['maxPage']+1)): 45 | 46 | url = 'http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' + flag + '&sort=time&page=' + str(i)+'&_=' + str(int(time.time()*1000)) 47 | 48 | self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page) 49 | 50 | elif flag == 'trans': 51 | 52 | for i in range(1,int(response.json['maxPage']+1)): 53 | 54 | url = 'http://xueqiu.com/statuses/search.json?count=10&comment=0&symbol=SH'+response.save['stockcode']+'&hl=0&source=' + flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000)) 55 | 56 | self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page) 57 | 58 | elif flag == '%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB': 59 | 60 | for i in range(1,int(response.json['maxPage']+1)): 61 | 62 | url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' + flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000)) 63 | 64 | self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page) 65 | 66 | elif flag == '%E5%85%AC%E5%91%8A': 67 | 68 | for i in range(1,int(response.json['maxPage']+1)): 69 | 70 | url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' + flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000)) 71 | 72 | self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page) 73 | 74 | elif flag == '%E7%A0%94%E6%8A%A5': 75 | 76 | for i in range(1,int(response.json['maxPage']+1)): 77 | 78 | url = 'http://xueqiu.com/statuses/stock_timeline.json?symbol_id=SH'+response.save['stockcode']+'&count=10&source=' + flag + '&page=' + str(i) +'&_=' + str(int(time.time()*1000)) 79 | 80 | self.crawl(url,headers = {'Cookie' : str(response.cookies).replace('\'','').replace(',',';').replace('{','').replace('}','').replace(':','=')},callback = self.deal_page) 81 | 82 | #这个函数用于格式化数据 83 | def deal_page(self,response): 84 | 85 | url = '' 86 | for item in response.url.split('&')[:-2]: 87 | temp = item+'&' 88 | url+=temp 89 | 90 | if re.findall('source=(.*?)&',response.url)[0]=='%E8%87%AA%E9%80%89%E8%82%A1%E6%96%B0%E9%97%BB': 91 | for each in response.json['list']: 92 | self.send_message(self.project_name,{ 93 | 'name' : each['user']['screen_name'], 94 | 'text' : etree.HTML(each['text']).xpath('string(.)'), 95 | 'time' : each['created_at'], 96 | 'title' : each['title'] 97 | }, url = "%s" % (url+str(each['created_at']))) 98 | else: 99 | for each in response.json['list']: 100 | self.send_message(self.project_name,{ 101 | 'name' : each['user']['screen_name'], 102 | 'text' : etree.HTML(each['text']).xpath('string(.)'), 103 | 'time' : each['created_at'], 104 | #'title' : each['title'] 105 | }, url = "%s" % (url+str(each['created_at']))) 106 | #返回数据到数据库 107 | def on_message(self,project,msg): 108 | return msg -------------------------------------------------------------------------------- /set_codes/HS300.txt: -------------------------------------------------------------------------------- 1 | 证券代码 证券简称 2 | 000001.SZ 平安银行 3 | 000002.SZ 万科A 4 | 000009.SZ 中国宝安 5 | 000027.SZ 深圳能源 6 | 000039.SZ 中集集团 7 | 000046.SZ 泛海控股 8 | 000060.SZ 中金岭南 9 | 000061.SZ 农产品 10 | 000063.SZ 中兴通讯 11 | 000069.SZ 华侨城A 12 | 000100.SZ TCL集团 13 | 000156.SZ 华数传媒 14 | 000157.SZ 中联重科 15 | 000166.SZ 申万宏源 16 | 000333.SZ 美的集团 17 | 000338.SZ 潍柴动力 18 | 000400.SZ 许继电气 19 | 000402.SZ 金融街 20 | 000413.SZ 东旭光电 21 | 000415.SZ 渤海金控 22 | 000423.SZ 东阿阿胶 23 | 000425.SZ 徐工机械 24 | 000503.SZ 海虹控股 25 | 000538.SZ 云南白药 26 | 000539.SZ 粤电力A 27 | 000540.SZ 中天城投 28 | 000559.SZ 万向钱潮 29 | 000568.SZ 泸州老窖 30 | 000581.SZ 威孚高科 31 | 000598.SZ 兴蓉环境 32 | 000623.SZ 吉林敖东 33 | 000625.SZ 长安汽车 34 | 000629.SZ 攀钢钒钛 35 | 000630.SZ 铜陵有色 36 | 000651.SZ 格力电器 37 | 000686.SZ 东北证券 38 | 000709.SZ 河钢股份 39 | 000712.SZ 锦龙股份 40 | 000725.SZ 京东方A 41 | 000728.SZ 国元证券 42 | 000729.SZ 燕京啤酒 43 | 000738.SZ 中航动控 44 | 000750.SZ 国海证券 45 | 000768.SZ 中航飞机 46 | 000776.SZ 广发证券 47 | 000778.SZ 新兴铸管 48 | 000783.SZ 长江证券 49 | 000792.SZ 盐湖股份 50 | 000793.SZ 华闻传媒 51 | 000800.SZ 一汽轿车 52 | 000825.SZ 太钢不锈 53 | 000826.SZ 启迪桑德 54 | 000831.SZ 五矿稀土 55 | 000858.SZ 五粮液 56 | 000876.SZ 新希望 57 | 000883.SZ 湖北能源 58 | 000895.SZ 双汇发展 59 | 000898.SZ 鞍钢股份 60 | 000917.SZ 电广传媒 61 | 000937.SZ 冀中能源 62 | 000963.SZ 华东医药 63 | 000983.SZ 西山煤电 64 | 000999.SZ 华润三九 65 | 001979.SZ 招商蛇口 66 | 002007.SZ 华兰生物 67 | 002008.SZ 大族激光 68 | 002024.SZ 苏宁云商 69 | 002038.SZ 双鹭药业 70 | 002065.SZ 东华软件 71 | 002081.SZ 金螳螂 72 | 002129.SZ 中环股份 73 | 002142.SZ 宁波银行 74 | 002146.SZ 荣盛发展 75 | 002153.SZ 石基信息 76 | 002195.SZ 二三四五 77 | 002202.SZ 金风科技 78 | 002230.SZ 科大讯飞 79 | 002236.SZ 大华股份 80 | 002241.SZ 歌尔声学 81 | 002252.SZ 上海莱士 82 | 002292.SZ 奥飞动漫 83 | 002294.SZ 信立泰 84 | 002304.SZ 洋河股份 85 | 002353.SZ 杰瑞股份 86 | 002375.SZ 亚厦股份 87 | 002385.SZ 大北农 88 | 002399.SZ 海普瑞 89 | 002410.SZ 广联达 90 | 002415.SZ 海康威视 91 | 002422.SZ 科伦药业 92 | 002450.SZ 康得新 93 | 002456.SZ 欧菲光 94 | 002465.SZ 海格通信 95 | 002470.SZ 金正大 96 | 002475.SZ 立讯精密 97 | 002500.SZ 山西证券 98 | 002594.SZ 比亚迪 99 | 002673.SZ 西部证券 100 | 002736.SZ 国信证券 101 | 002739.SZ 万达院线 102 | 300002.SZ 神州泰岳 103 | 300003.SZ 乐普医疗 104 | 300015.SZ 爱尔眼科 105 | 300017.SZ 网宿科技 106 | 300024.SZ 机器人 107 | 300027.SZ 华谊兄弟 108 | 300058.SZ 蓝色光标 109 | 300059.SZ 东方财富 110 | 300070.SZ 碧水源 111 | 300104.SZ 乐视网 112 | 300124.SZ 汇川技术 113 | 300133.SZ 华策影视 114 | 300144.SZ 宋城演艺 115 | 300146.SZ 汤臣倍健 116 | 300251.SZ 光线传媒 117 | 300315.SZ 掌趣科技 118 | 600000.SH 浦发银行 119 | 600005.SH 武钢股份 120 | 600008.SH 首创股份 121 | 600009.SH 上海机场 122 | 600010.SH 包钢股份 123 | 600011.SH 华能国际 124 | 600015.SH 华夏银行 125 | 600016.SH 民生银行 126 | 600018.SH 上港集团 127 | 600019.SH 宝钢股份 128 | 600021.SH 上海电力 129 | 600023.SH 浙能电力 130 | 600027.SH 华电国际 131 | 600028.SH 中国石化 132 | 600029.SH 南方航空 133 | 600030.SH 中信证券 134 | 600031.SH 三一重工 135 | 600036.SH 招商银行 136 | 600038.SH 中直股份 137 | 600048.SH 保利地产 138 | 600050.SH 中国联通 139 | 600060.SH 海信电器 140 | 600066.SH 宇通客车 141 | 600068.SH 葛洲坝 142 | 600085.SH 同仁堂 143 | 600089.SH 特变电工 144 | 600100.SH 同方股份 145 | 600104.SH 上汽集团 146 | 600109.SH 国金证券 147 | 600111.SH 北方稀土 148 | 600115.SH 东方航空 149 | 600118.SH 中国卫星 150 | 600150.SH 中国船舶 151 | 600153.SH 建发股份 152 | 600157.SH 永泰能源 153 | 600166.SH 福田汽车 154 | 600170.SH 上海建工 155 | 600177.SH 雅戈尔 156 | 600188.SH 兖州煤业 157 | 600196.SH 复星医药 158 | 600208.SH 新湖中宝 159 | 600221.SH 海南航空 160 | 600252.SH 中恒集团 161 | 600256.SH 广汇能源 162 | 600271.SH 航天信息 163 | 600276.SH 恒瑞医药 164 | 600309.SH 万华化学 165 | 600315.SH 上海家化 166 | 600317.SH 营口港 167 | 600332.SH 白云山 168 | 600340.SH 华夏幸福 169 | 600350.SH 山东高速 170 | 600352.SH 浙江龙盛 171 | 600362.SH 江西铜业 172 | 600369.SH 西南证券 173 | 600372.SH 中航电子 174 | 600373.SH 中文传媒 175 | 600383.SH 金地集团 176 | 600398.SH 海澜之家 177 | 600406.SH 国电南瑞 178 | 600415.SH 小商品城 179 | 600485.SH 信威集团 180 | 600489.SH 中金黄金 181 | 600518.SH 康美药业 182 | 600519.SH 贵州茅台 183 | 600535.SH 天士力 184 | 600547.SH 山东黄金 185 | 600549.SH 厦门钨业 186 | 600570.SH 恒生电子 187 | 600578.SH 京能电力 188 | 600583.SH 海油工程 189 | 600585.SH 海螺水泥 190 | 600588.SH 用友网络 191 | 600600.SH 青岛啤酒 192 | 600633.SH 浙报传媒 193 | 600637.SH 东方明珠 194 | 600642.SH 申能股份 195 | 600648.SH 外高桥 196 | 600649.SH 城投控股 197 | 600660.SH 福耀玻璃 198 | 600663.SH 陆家嘴 199 | 600674.SH 川投能源 200 | 600688.SH 上海石化 201 | 600690.SH 青岛海尔 202 | 600703.SH 三安光电 203 | 600705.SH 中航资本 204 | 600717.SH 天津港 205 | 600718.SH 东软集团 206 | 600739.SH 辽宁成大 207 | 600741.SH 华域汽车 208 | 600783.SH 鲁信创投 209 | 600795.SH 国电电力 210 | 600804.SH 鹏博士 211 | 600820.SH 隧道股份 212 | 600827.SH 百联股份 213 | 600837.SH 海通证券 214 | 600839.SH 四川长虹 215 | 600863.SH 内蒙华电 216 | 600867.SH 通化东宝 217 | 600873.SH 梅花生物 218 | 600875.SH 东方电气 219 | 600886.SH 国投电力 220 | 600887.SH 伊利股份 221 | 600893.SH 中航动力 222 | 600895.SH 张江高科 223 | 600900.SH 长江电力 224 | 600958.SH 东方证券 225 | 600959.SH 江苏有线 226 | 600998.SH 九州通 227 | 600999.SH 招商证券 228 | 601006.SH 大秦铁路 229 | 601009.SH 南京银行 230 | 601016.SH 节能风电 231 | 601018.SH 宁波港 232 | 601021.SH 春秋航空 233 | 601088.SH 中国神华 234 | 601098.SH 中南传媒 235 | 601099.SH 太平洋 236 | 601106.SH 中国一重 237 | 601111.SH 中国国航 238 | 601117.SH 中国化学 239 | 601118.SH 海南橡胶 240 | 601158.SH 重庆水务 241 | 601166.SH 兴业银行 242 | 601169.SH 北京银行 243 | 601179.SH 中国西电 244 | 601186.SH 中国铁建 245 | 601198.SH 东兴证券 246 | 601211.SH 国泰君安 247 | 601216.SH 君正集团 248 | 601225.SH 陕西煤业 249 | 601231.SH 环旭电子 250 | 601238.SH 广汽集团 251 | 601258.SH 庞大集团 252 | 601288.SH 农业银行 253 | 601318.SH 中国平安 254 | 601328.SH 交通银行 255 | 601333.SH 广深铁路 256 | 601336.SH 新华保险 257 | 601377.SH 兴业证券 258 | 601390.SH 中国中铁 259 | 601398.SH 工商银行 260 | 601555.SH 东吴证券 261 | 601600.SH 中国铝业 262 | 601601.SH 中国太保 263 | 601607.SH 上海医药 264 | 601608.SH 中信重工 265 | 601618.SH 中国中冶 266 | 601628.SH 中国人寿 267 | 601633.SH 长城汽车 268 | 601668.SH 中国建筑 269 | 601669.SH 中国电建 270 | 601688.SH 华泰证券 271 | 601699.SH 潞安环能 272 | 601718.SH 际华集团 273 | 601727.SH 上海电气 274 | 601766.SH 中国中车 275 | 601788.SH 光大证券 276 | 601800.SH 中国交建 277 | 601808.SH 中海油服 278 | 601818.SH 光大银行 279 | 601857.SH 中国石油 280 | 601866.SH 中海集运 281 | 601872.SH 招商轮船 282 | 601888.SH 中国国旅 283 | 601898.SH 中煤能源 284 | 601899.SH 紫金矿业 285 | 601901.SH 方正证券 286 | 601919.SH 中国远洋 287 | 601928.SH 凤凰传媒 288 | 601933.SH 永辉超市 289 | 601939.SH 建设银行 290 | 601958.SH 金钼股份 291 | 601969.SH 海南矿业 292 | 601985.SH 中国核电 293 | 601988.SH 中国银行 294 | 601989.SH 中国重工 295 | 601991.SH 大唐发电 296 | 601992.SH 金隅股份 297 | 601998.SH 中信银行 298 | 603000.SH 人民网 299 | 603288.SH 海天味业 300 | 603993.SH 洛阳钼业 301 | 603885.SH 吉祥航空 302 | -------------------------------------------------------------------------------- /set_codes/IT_code.txt: -------------------------------------------------------------------------------- 1 | 300597 2 | 603039 3 | 300209 4 | 300598 5 | 300523 6 | 300451 7 | 300561 8 | 600728 9 | 600634 10 | 300277 11 | 300448 12 | 600455 13 | 603636 14 | 002174 15 | 603990 16 | 300579 17 | 300248 18 | 002339 19 | 300044 20 | 300300 21 | 603444 22 | 002065 23 | 600271 24 | 002474 25 | 300096 26 | 300508 27 | 300311 28 | 002230 29 | 300031 30 | 300074 31 | 002439 32 | 002279 33 | 300302 34 | 002362 35 | 600476 36 | 002657 37 | 300168 38 | 300075 39 | 002649 40 | 002063 41 | 600718 42 | 600446 43 | 600570 44 | 300047 45 | 002421 46 | 300002 47 | 600756 48 | 002373 49 | 300045 50 | 300287 51 | 600536 52 | 002771 53 | 601519 54 | 000948 55 | 002268 56 | 300290 57 | 600289 58 | 002232 59 | 002331 60 | 300264 61 | 600410 62 | 300365 63 | 300339 64 | 300033 65 | 300253 66 | 300036 67 | 002410 68 | 300170 69 | 300085 70 | 600571 71 | 300324 72 | 300297 73 | 002405 74 | 300166 75 | 300017 76 | 300231 77 | 600588 78 | 300352 79 | 300465 80 | 300010 81 | 600845 82 | 300348 83 | 300020 84 | 002153 85 | 002178 86 | 603918 87 | 300188 88 | 300182 89 | 300113 90 | 002642 91 | 300245 92 | 002280 93 | 300533 94 | 002253 95 | 300369 96 | 300559 97 | 603189 98 | 300520 99 | 300229 100 | 300377 101 | 002401 102 | 300542 103 | 603258 104 | 300552 105 | 300541 106 | 002558 107 | 300550 108 | 300379 109 | 300386 110 | 300378 111 | 002368 112 | 300525 113 | 300556 114 | 600767 115 | 300469 116 | 600242 117 | 300467 118 | 300578 119 | 300271 120 | 300518 121 | 300468 122 | 600654 123 | 000711 124 | 603986 125 | 300605 126 | 300603 127 | 603881 128 | 300513 129 | 300212 130 | -------------------------------------------------------------------------------- /set_codes/IT_unique.txt: -------------------------------------------------------------------------------- 1 | 300597 2 | 603039 3 | 300209 4 | 300598 5 | 300523 6 | 300451 7 | 300561 8 | 600728 9 | 600634 10 | 300277 11 | 300448 12 | 600455 13 | 603636 14 | 002174 15 | 603990 16 | 300579 17 | 300248 18 | 002339 19 | 300044 20 | 300300 21 | 603444 22 | 002474 23 | 300096 24 | 300508 25 | 300311 26 | 300031 27 | 300074 28 | 002439 29 | 002279 30 | 300302 31 | 002362 32 | 600476 33 | 002657 34 | 300168 35 | 300075 36 | 002649 37 | 002063 38 | 600446 39 | 300047 40 | 002421 41 | 600756 42 | 002373 43 | 300045 44 | 300287 45 | 600536 46 | 002771 47 | 601519 48 | 000948 49 | 002268 50 | 300290 51 | 600289 52 | 002232 53 | 002331 54 | 300264 55 | 600410 56 | 300365 57 | 300339 58 | 300033 59 | 300253 60 | 300036 61 | 300170 62 | 300085 63 | 600571 64 | 300324 65 | 300297 66 | 002405 67 | 300166 68 | 300231 69 | 300352 70 | 300465 71 | 300010 72 | 600845 73 | 300348 74 | 300020 75 | 002178 76 | 603918 77 | 300188 78 | 300182 79 | 300113 80 | 002642 81 | 300245 82 | 002280 83 | 300533 84 | 002253 85 | 300369 86 | 300559 87 | 603189 88 | 300520 89 | 300229 90 | 300377 91 | 002401 92 | 300542 93 | 603258 94 | 300552 95 | 300541 96 | 002558 97 | 300550 98 | 300379 99 | 300386 100 | 300378 101 | 002368 102 | 300525 103 | 300556 104 | 600767 105 | 300469 106 | 600242 107 | 300467 108 | 300578 109 | 300271 110 | 300518 111 | 300468 112 | 600654 113 | 000711 114 | 603986 115 | 300605 116 | 300603 117 | 603881 118 | 300513 119 | 300212 120 | -------------------------------------------------------------------------------- /set_codes/set_IT.py: -------------------------------------------------------------------------------- 1 | import re 2 | from pymongo import * 3 | # import pymongo 4 | 5 | # this script is used to insert the HS300 stockcodes into MongoDB 6 | client = MongoClient() 7 | db = client['stockcodes'] 8 | 9 | f= open('IT_unique.txt','r') 10 | all_text = f.read() 11 | stockcodes = re.findall('\d{6}',all_text) 12 | for stockcode in stockcodes: 13 | result = db.IT.insert_one({ 14 | "stockcode" : stockcode 15 | }) 16 | print result.inserted_id 17 | f.close() -------------------------------------------------------------------------------- /set_codes/set_hs300.py: -------------------------------------------------------------------------------- 1 | import re 2 | from pymongo import * 3 | # import pymongo 4 | 5 | # this script is used to insert the HS300 stockcodes into MongoDB 6 | client = MongoClient() 7 | db = client['stockcodes'] 8 | 9 | f= open('HS300.txt','r') 10 | all_text = f.read() 11 | stockcodes = re.findall('\d{6}',all_text) 12 | for stockcode in stockcodes: 13 | result = db.HS300.insert_one({ 14 | "stockcode" : stockcode 15 | }) 16 | print result.inserted_id 17 | f.close() -------------------------------------------------------------------------------- /test/Scores.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/test/Scores.xls -------------------------------------------------------------------------------- /test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/test/__init__.py -------------------------------------------------------------------------------- /test/testPandas.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | df = pd.read_excel("../data/07-01result.xls") 4 | stock = df['positive'].tolist() 5 | print stock -------------------------------------------------------------------------------- /test/testPandasExcel.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | data = {'names':['John Doe', 'Zoe McCarty', 'Pam Ferris'], 4 | 'scores': [76, 98, 90]} 5 | table = pd.DataFrame(data) 6 | 7 | writer = pd.ExcelWriter('Scores.xls') 8 | table.to_excel(writer, 'Scores 1') 9 | writer.save() -------------------------------------------------------------------------------- /tools/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryh95/pyspider-stock/8ba2d7964944f7c1baad17af37215152c18b5f5e/tools/__init__.py -------------------------------------------------------------------------------- /tools/drop.py: -------------------------------------------------------------------------------- 1 | from pymongo import MongoClient 2 | 3 | client = MongoClient() 4 | db = client['stockcodes'] 5 | documents = db.HS300.find() 6 | stockcodes = [] 7 | for document in documents: 8 | stockcodes.append(document['stockcode']) 9 | 10 | for stockcode in stockcodes: 11 | client = MongoClient() 12 | client.drop_database(stockcode+'eastmoney') 13 | 14 | -------------------------------------------------------------------------------- /tools/mongotool.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import datetime 4 | from pymongo import MongoClient 5 | 6 | def dump(): 7 | client = MongoClient() 8 | db = client['stockcodes'] 9 | documents = db.HS300.find() 10 | documents_it = db.IT.find() 11 | stockcodes = [] 12 | for document in documents: 13 | stockcodes.append(document['stockcode']) 14 | # add IT stocks 15 | for document in documents_it: 16 | stockcodes.append(document['stockcode']) 17 | 18 | now_time = datetime.datetime.now() 19 | dump_time = now_time + datetime.timedelta(days=-5) 20 | dump_time = dump_time.strftime('%m-%d') 21 | 22 | os.system('mkdir '+dump_time) 23 | 24 | for stockcode in stockcodes: 25 | os.system('mongodump --db '+stockcode+'eastmoney'+' --collection '+dump_time+'GuYouHui') 26 | os.system('mongodump --db '+stockcode+'eastmoney'+' --collection '+dump_time+'SentimentFactor') 27 | os.system('mongodump --db '+dump_time) 28 | os.system('mongodump --db '+dump_time+'IT') 29 | 30 | os.system('mv dump '+dump_time) 31 | 32 | print dump_time+'data has been dumped!' 33 | print dump_time +'IT'+ 'data has been dumped!' 34 | 35 | def drop(): 36 | client = MongoClient() 37 | db = client['stockcodes'] 38 | documents = db.HS300.find() 39 | documents_it = db.IT.find() 40 | stockcodes = [] 41 | for document in documents: 42 | stockcodes.append(document['stockcode']) 43 | 44 | # add IT stocks 45 | for document in documents_it: 46 | stockcodes.append(document['stockcode']) 47 | 48 | now_time = datetime.datetime.now() 49 | drop_time = now_time + datetime.timedelta(days=-5) 50 | drop_time = drop_time.strftime('%m-%d') 51 | 52 | for stockcode in stockcodes: 53 | db = client[stockcode+'eastmoney'] 54 | coll = db[drop_time+'GuYouHui'] 55 | coll.drop() 56 | coll = db[drop_time+'SentimentFactor'] 57 | coll.drop() 58 | client.drop_database(drop_time) 59 | 60 | client.drop_database(drop_time+'IT') 61 | 62 | print drop_time+'data has been dropped!' 63 | print drop_time+'IT'+'data has been dropped!' --------------------------------------------------------------------------------