├── .gitignore ├── README.md ├── contirbutor.txt ├── core ├── __init__.py ├── dispatch_center.py └── scrap.py ├── distribute_ids.pkl ├── main.py ├── scraper ├── __init__.py ├── weibo_scraper.py └── weibo_scraper_m.py ├── settings ├── accounts.py └── config.py └── utils ├── __init__.py ├── connection.py ├── cookies.py └── string.py /.gitignore: -------------------------------------------------------------------------------- 1 | ghostdriver.log 2 | .idea/ 3 | __pycache__/ 4 | save_to_distribute.py 5 | test.py 6 | save_distribute.py 7 | contributor.txt 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Weibo Terminator Work Flow 2 | 3 | ![PicName](http://ofwzcunzi.bkt.clouddn.com/7gOwGp5K4FPWHkHQ.png) 4 | 5 | > 这个项目是之前项目的重启版本,之前的项目地址[这里](https://github.com/jinfagang/weibo_terminater.git),那个项目依旧会保持更新,这是weibo terminator的工作版本,这个版本对上一个版本做了一些优化,这里的最终目标是一起爬取语料,包括情感分析、对话语料、舆论风控、大数据分析等应用。 6 | 7 | # UPDATE 2017-5-16 8 | 9 | 更新: 10 | * 调整了首次cookies获取逻辑,如果程序没有检测到cookies就会退出,防止后面爬取不到更多的内容而crash; 11 | * 增加了WeiBoScraperM 类,目前还在构建中,欢迎submit PR 实现,这个类主要实现从另外一个微博域名爬取,也就是手机域名; 12 | 13 | 大家可以pull一下更新。 14 | 15 | # UPDATE 2017-5-15 16 | 17 | 经过一些小修改和几位contributor的PR,代码发生了一些小变化,基本上都是在修复bug和完善一些逻辑,修改如下: 18 | 19 | 1. 修复了保存出错的问题,这个大家在第一次push的时候clone的代码要pull一下; 20 | 2. 关于 `WeiboScraper has not attribute weibo_content`的错误,新代码已经修复; 21 | 22 | @Fence 提交PR修改了一些内容: 23 | 1. 原先的固定30s休息换成随机时间,具体参数可自己定义 24 | 2. 增加了big_v_ids_file,记录已经保存过粉丝的明星id; 用txt格式,方便contributor手动增删 25 | 3. 两个函数的爬取页面都改成了page+1,避免断点续爬时重复爬取上次已经爬过最后一页 26 | 4. 把原先的“爬取完一个id的所有微博及其评论”改为“爬完一条微博及其所有评论就保存” 27 | 5. (Optional)把保存文件的部分单独为函数,因为分别有2个和3个地方需要保存 28 | 29 | 大家可以`git pull origin master`, 获取一下新更新的版本,同时也欢迎大家继续问我要uuid,我会定时把名单公布在`contirbutor.txt` 中,我近期在做数据merge的工作,以及数据清洗,分类等工作,merge工作完成之后会把大数据集分发给大家。 30 | 31 | 32 | # Improve 33 | 34 | 对上一版本做了以下改进: 35 | 36 | * 没有了太多的distraction,直奔主题,给定id,获取该用户的所有微博,微博数量,粉丝数,所有微博内容以及评论内容; 37 | * 和上一版本不同的是,这一次我们的理念是把所有数据保存到三个pickle文件中,以字典的文件存储,这么做的目的是方便断点续爬; 38 | * 同时做到了,已经爬过的id爬虫不会再次爬取,也就是说爬虫会记住爬取过的id,每个id获取完了所有内容之后会被标记为已经爬取; 39 | * 除此之外,微博内容和微博评论被单独分开,微博内容爬取过程中出现中断,第二次不会重新爬取,会从中断的页码继续爬取; 40 | * 更加重要的是!!!每个id爬取互不影响,你可以直接从pickle文件中调取出任何你想要的id的微博内容,可以做任何处理!! 41 | * 除此之外之外,测试了新的反爬策略,采用的延时机制能够很好的工作,不过还无法完全做到无人控制。 42 | 43 | **更更加重要的是!!!**,在这一版本中,爬虫的智能性得到了很大提升,爬虫会在爬取每个id的时候,**自动去获取该id的所有粉丝id!!** 44 | 相当于是,我给大家的都是种子id,种子id都是一些明星或者公司或者媒体大V的id,从这些种子id你可以获取到成千上万的其他种子id!! 45 | 假如一个明星粉丝是3.4万,第一次爬取你就可以获得3.4万个id,然后在从子代id继续爬,每个子代id有粉丝100,第二次你就可以获取到340万个id!!!足够了吗?!!!当然不够!!! 46 | 47 | **我们这个项目永远不会停止!!!** 会一直进行下去,直到收获足够多的语料!!! 48 | 49 | (当然实际上我们不能获得所有粉丝,不过这些也足够了。) 50 | 51 | ![PicName](http://ofwzcunzi.bkt.clouddn.com/lqcx6MLSdS8whJVt.png) 52 | 53 | # Work Flow 54 | 55 | 这一版本的目标是针对contributor,我们的工作流程也非常简单: 56 | 57 | 1. 获取uuid,这个uuid可以调取到 distribute_ids.pkl 的2-3个id,这个是我们的种子id,当然大家也可以直接获取到所有id,但是为了防止重复工作,建议大家向我申请一个uuid,你只负责你的那个,爬完之后,把最终文件反馈给我,我整理去重之后,把最终的大语料发放给大家。 58 | 2. 运行 `python3 main.py uuid`,这里说明一下,uuid指定的id爬取完之后才会取爬fans id; 59 | 3. Done! 60 | 61 | # Discuss 62 | 63 | 依旧贴出一下讨论群,欢迎大家添加: 64 | ``` 65 | QQ 66 | AI智能自然语言处理: 476464663 67 | Tensorflow智能聊天Bot: 621970965 68 | GitHub深度学习开源交流: 263018023 69 | ``` 70 | 微信可以加我好友: jintianiloveu 71 | 72 | # Copyright 73 | 74 | ``` 75 | (c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0 76 | ``` 77 | -------------------------------------------------------------------------------- /contirbutor.txt: -------------------------------------------------------------------------------- 1 | https://github.com/MichaelFeng87 s00002 2 | https://github.com/jiang20160402 s00001 3 | https://github.com/richard1225 s00003 4 | https://github.com/little1tow s00004 5 | https://github.com/Linlinlin111 s00005 6 | https://github.com/Fence m00001 7 | https://github.com/Da-Capo m00002 8 | https://github.com/liangWenPeng m00002 9 | https://github.com/af1ynch m00003 10 | https://github.com/elviswxy m00005 11 | https://github.com/liuyijiang1994 c00001 12 | -------------------------------------------------------------------------------- /core/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/core/__init__.py -------------------------------------------------------------------------------- /core/dispatch_center.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: dispatch_center.py 3 | # author: JinTian 4 | # time: 13/04/2017 9:52 AM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | from scraper.weibo_scraper import WeiBoScraper 20 | from settings.config import COOKIES_SAVE_PATH 21 | from settings.accounts import accounts 22 | import os 23 | from utils.cookies import get_cookie_from_network 24 | import pickle 25 | 26 | 27 | class Dispatcher(object): 28 | """ 29 | Dispatch center, if your cookies is out of date, 30 | set update_cookies to True to update all accounts cookies 31 | """ 32 | 33 | def __init__(self, id_file_path, mode, uid, filter_flag=0, update_cookies=False): 34 | self.mode = mode 35 | self.filter_flag = filter_flag 36 | self.update_cookies = update_cookies 37 | 38 | self._init_accounts_cookies() 39 | self._init_accounts() 40 | 41 | if self.mode == 'single': 42 | self.user_id = uid 43 | elif self.mode == 'multi': 44 | self.id_file_path = id_file_path 45 | else: 46 | raise Exception('mode option only support single and multi') 47 | 48 | def execute(self): 49 | if self.mode == 'single': 50 | self._init_single_mode() 51 | elif self.mode == 'multi': 52 | self._init_multi_mode() 53 | else: 54 | raise Exception('mode option only support single and multi') 55 | 56 | def _init_accounts_cookies(self): 57 | """ 58 | get all cookies for accounts, dump into pkl, this will only run once, if 59 | you update accounts, set update to True 60 | :return: 61 | """ 62 | if self.update_cookies: 63 | for account in accounts: 64 | print('preparing cookies for account {}'.format(account)) 65 | get_cookie_from_network(account['id'], account['password']) 66 | print('all accounts getting cookies finished. starting scrap..') 67 | else: 68 | if os.path.exists(COOKIES_SAVE_PATH): 69 | pass 70 | else: 71 | for account in accounts: 72 | print('preparing cookies for account {}'.format(account)) 73 | get_cookie_from_network(account['id'], account['password']) 74 | print('all accounts getting cookies finished. starting scrap..') 75 | 76 | def _init_accounts(self): 77 | """ 78 | setting accounts 79 | :return: 80 | """ 81 | try: 82 | with open(COOKIES_SAVE_PATH, 'rb') as f: 83 | cookies_dict = pickle.load(f) 84 | self.all_accounts = list(cookies_dict.keys()) 85 | print('----------- detected {} accounts, weibo_terminator will using all accounts to scrap ' 86 | 'automatically -------------'.format(len(self.all_accounts))) 87 | print('detected accounts: ', self.all_accounts) 88 | except Exception as e: 89 | print(e) 90 | print('error, not find cookies file.') 91 | 92 | def _init_single_mode(self): 93 | scraper = WeiBoScraper(using_account=self.all_accounts[0], uuid=self.user_id, filter_flag=self.filter_flag) 94 | i = 1 95 | while True: 96 | result = scraper.crawl() 97 | if result: 98 | print('finished!!!') 99 | break 100 | else: 101 | if i >= len(self.all_accounts): 102 | print('scrap not finish, account resource run out. update account, move on scrap.') 103 | break 104 | else: 105 | scraper.switch_account(self.all_accounts[i]) 106 | i += 1 107 | print('account {} being banned or error weibo is none for current user id, switch to {}..'.format( 108 | self.all_accounts[i - 1], self.all_accounts[i])) 109 | 110 | def _init_multi_mode(self): 111 | pass 112 | 113 | 114 | -------------------------------------------------------------------------------- /core/scrap.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: scrap.py 3 | # author: JinTian 4 | # time: 10/05/2017 10:38 PM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | from scraper.weibo_scraper import WeiBoScraper 20 | from utils.cookies import get_cookie_from_network 21 | from settings.config import * 22 | import pickle 23 | from settings.accounts import accounts 24 | 25 | 26 | def init_accounts_cookies(): 27 | if os.path.exists(COOKIES_SAVE_PATH): 28 | with open(COOKIES_SAVE_PATH, 'rb') as f: 29 | cookies_dict = pickle.load(f) 30 | return list(cookies_dict.keys()) 31 | else: 32 | for account in accounts: 33 | print('preparing cookies for account {}'.format(account)) 34 | get_cookie_from_network(account['id'], account['password']) 35 | print('checking account validation...') 36 | valid_accounts = get_valid_accounts() 37 | 38 | if len(valid_accounts) == len(accounts): 39 | print('all accounts checked valid... start scrap') 40 | return valid_accounts 41 | elif len(valid_accounts) < 1: 42 | print('error, not find valid accounts, please check accounts.') 43 | exit(0) 44 | elif len(valid_accounts) > 1: 45 | print('find valid accounts: ', valid_accounts) 46 | print('starting scrap..') 47 | return valid_accounts 48 | 49 | 50 | def get_valid_accounts(): 51 | with open(COOKIES_SAVE_PATH, 'rb') as f: 52 | cookies_dict = pickle.load(f) 53 | return list(cookies_dict.keys()) 54 | 55 | 56 | def get_cookies_by_account(account_id): 57 | with open(COOKIES_SAVE_PATH, 'rb') as f: 58 | cookies_dict = pickle.load(f) 59 | return cookies_dict[account_id] 60 | 61 | 62 | def scrap(scrap_id): 63 | """ 64 | scrap a single id 65 | :return: 66 | """ 67 | valid_accounts = init_accounts_cookies() 68 | print('valid accounts: ', valid_accounts) 69 | 70 | # TODO currently only using single account, multi accounts using multi thread maybe quicker but seems like a mess 71 | # TODO maybe will adding multi thread feature when our code comes steady 72 | # so that maybe need to manually change accounts when one account being baned 73 | account_id = valid_accounts[0] 74 | print('using accounts: ', account_id) 75 | 76 | cookies = get_cookies_by_account(account_id) 77 | 78 | # alternative this scraper can changed into WeiBoScraperM in the future which scrap from http://m.weibo.cn 79 | scraper = WeiBoScraper(account_id, scrap_id, cookies) 80 | scraper.crawl() 81 | 82 | 83 | def main(args): 84 | scrap_id = args 85 | scrap(scrap_id) 86 | -------------------------------------------------------------------------------- /distribute_ids.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/distribute_ids.pkl -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: main.py 3 | # author: JinTian 4 | # time: 13/04/2017 10:01 AM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | weibo_terminator_workflow 21 | 22 | run main.py will do: 23 | 1. first run will scrap distribute ids, every id finished all task will mark as done. 24 | 2. after all distribute ids were done, will scrap all fans id. 25 | 26 | """ 27 | import argparse 28 | 29 | from core.dispatch_center import Dispatcher 30 | from settings.config import * 31 | from utils.string import is_valid_id 32 | from core.scrap import scrap 33 | import sys 34 | import pickle 35 | 36 | 37 | def mission(distribute_uuid=None): 38 | """ 39 | mission for workflow. 40 | this method is very simple. 41 | for the first run, it will scrap distribute ids. you just get distribute_uuid from wechat 42 | `jintianiloveu` who is administrator of this project. get your uuid and paste it into mission param 43 | 44 | 45 | this will get 2-5 ids from distribute_ids.pkl, every uuid got ids are different. 46 | After mission complete scrap, continue scrap fans_ids.pkl which contains many many ids. as possiable as you 47 | can to scrap those fans ids. 48 | :return: 49 | """ 50 | scrap('3879293449') 51 | if os.path.exists(DISTRIBUTE_IDS): 52 | print('find distribute ids from {}'.format(DISTRIBUTE_IDS)) 53 | with open(DISTRIBUTE_IDS, 'rb') as f: 54 | distribute_dict = pickle.load(f) 55 | if os.path.exists(SCRAPED_MARK): 56 | finished_ids = pickle.load(open(SCRAPED_MARK, 'rb')) 57 | else: 58 | finished_ids = [] 59 | try: 60 | mission_ids = distribute_dict[distribute_uuid] 61 | 62 | if len([i for i in mission_ids if i in finished_ids]) == len(mission_ids): 63 | print('Good Done!!! Mission Complete!!') 64 | print('now will continue scrap fans_ids.pkl file.') 65 | 66 | fans_id_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_fans.pkl') 67 | if os.path.exists(fans_id_file): 68 | fans_id = pickle.load(open(fans_id_file, 'rb')) 69 | for fd in fans_id: 70 | scrap(fd) 71 | else: 72 | for md in mission_ids: 73 | scrap(md) 74 | except Exception as e: 75 | print(e) 76 | print('distribute uuid invalid.') 77 | 78 | 79 | def scrap_single(sid): 80 | """ 81 | this method scrap single id. 82 | For you want scrap you own ids. just send it here 83 | :param sid: 84 | :return: 85 | """ 86 | scrap(sid) 87 | 88 | 89 | if __name__ == '__main__': 90 | if len(sys.argv) < 2: 91 | print('run as python3 main.py your-uuid, you can get uuid via wechat `jintianiloveu`.') 92 | else: 93 | mission(sys.argv[1]) 94 | -------------------------------------------------------------------------------- /scraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/scraper/__init__.py -------------------------------------------------------------------------------- /scraper/weibo_scraper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: sentence_similarity.py 3 | # author: JinTian 4 | # time: 24/03/2017 6:46 PM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | using guide: 21 | setting accounts first: 22 | 23 | under: weibo_terminator/settings/accounts.py 24 | you can set more than one accounts, WT will using all accounts one by one, 25 | if one banned, another will move on. 26 | 27 | if you care about security, using subsidiary accounts instead. 28 | 29 | """ 30 | import re 31 | import sys 32 | import os 33 | import requests 34 | from lxml import etree 35 | import traceback 36 | from settings.config import COOKIES_SAVE_PATH 37 | import pickle 38 | import time 39 | from utils.string import is_number 40 | import numpy as np 41 | from settings.config import * 42 | import logging 43 | 44 | 45 | class WeiBoScraper(object): 46 | def __init__(self, using_account, scrap_id, cookies, filter_flag=0): 47 | self.using_account = using_account 48 | self.cookies = cookies 49 | 50 | self._init_cookies() 51 | self._init_headers() 52 | 53 | self.scrap_id = scrap_id 54 | self.filter = filter_flag 55 | self.user_name = '' 56 | self.weibo_num = 0 57 | self.weibo_scraped = 0 58 | self.following = 0 59 | self.followers = 0 60 | self.weibo_content = [] 61 | self.num_zan = [] 62 | self.num_forwarding = [] 63 | self.num_comment = [] 64 | self.weibo_detail_urls = [] 65 | self.rest_time = 20 66 | self.rest_min_time = 20 # create random rest time 67 | self.rest_max_time = 30 68 | 69 | self.weibo_content_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_content.pkl') 70 | self.weibo_content_and_comment_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_content_and_comment.pkl') 71 | self.weibo_fans_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_fans.pkl') 72 | self.big_v_ids_file = os.path.join(CORPUS_SAVE_DIR, 'big_v_ids.txt') 73 | 74 | def _init_cookies(self): 75 | cookie = { 76 | "Cookie": self.cookies 77 | } 78 | self.cookie = cookie 79 | 80 | def _init_headers(self): 81 | """ 82 | avoid span 83 | :return: 84 | """ 85 | headers = requests.utils.default_headers() 86 | user_agent = { 87 | 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0' 88 | } 89 | headers.update(user_agent) 90 | print('headers: ', headers) 91 | self.headers = headers 92 | 93 | def jump_scraped_id(self): 94 | """ 95 | check mark scrapd ids, jump scraped one. 96 | :return: 97 | """ 98 | if os.path.exists(SCRAPED_MARK) and os.path.getsize(SCRAPED_MARK) > 0: 99 | with open(SCRAPED_MARK, 'rb') as f: 100 | scraped_ids = pickle.load(f) 101 | if self.scrap_id in scraped_ids: 102 | return True 103 | else: 104 | return False 105 | else: 106 | return False 107 | 108 | def crawl(self): 109 | # this is the most time-cost part, we have to catch errors, return to dispatch center 110 | if self.jump_scraped_id(): 111 | print('scrap id {} already scraped, directly pass it.'.format(self.scrap_id)) 112 | else: 113 | try: 114 | self._get_html() 115 | self._get_user_name() 116 | self._get_user_info() 117 | 118 | self._get_fans_ids() 119 | self._get_weibo_content() 120 | self._get_weibo_content_and_comment() 121 | return True 122 | except Exception as e: 123 | logging.error(e) 124 | print('some error above not catch, return to dispatch center, resign for new account..') 125 | return False 126 | 127 | def _get_html(self): 128 | try: 129 | url = 'http://weibo.cn/%s?filter=%s&page=1' % (self.scrap_id, self.filter) 130 | print(url) 131 | self.html = requests.get(url, cookies=self.cookie, headers=self.headers).content 132 | print('success load html..') 133 | except Exception as e: 134 | print(e) 135 | 136 | def _get_user_name(self): 137 | print('\n' + '-' * 30) 138 | print('getting user name') 139 | try: 140 | selector = etree.HTML(self.html) 141 | self.user_name = selector.xpath('//table//div[@class="ut"]/span[1]/text()')[0] 142 | print('current user name is: {}'.format(self.user_name)) 143 | except Exception as e: 144 | print(e) 145 | print('html not properly loaded, maybe cookies out of date or account being banned. ' 146 | 'change an account please') 147 | exit() 148 | 149 | def _get_user_info(self): 150 | print('\n' + '-' * 30) 151 | print('getting user info') 152 | selector = etree.HTML(self.html) 153 | pattern = r"\d+\.?\d*" 154 | str_wb = selector.xpath('//span[@class="tc"]/text()')[0] 155 | guid = re.findall(pattern, str_wb, re.S | re.M) 156 | for value in guid: 157 | num_wb = int(value) 158 | break 159 | self.weibo_num = num_wb 160 | 161 | str_gz = selector.xpath("//div[@class='tip2']/a/text()")[0] 162 | guid = re.findall(pattern, str_gz, re.M) 163 | self.following = int(guid[0]) 164 | 165 | str_fs = selector.xpath("//div[@class='tip2']/a/text()")[1] 166 | guid = re.findall(pattern, str_fs, re.M) 167 | self.followers = int(guid[0]) 168 | print('current user all weibo num {}, following {}, followers {}'.format(self.weibo_num, self.following, 169 | self.followers)) 170 | 171 | def _get_fans_ids(self): 172 | """ 173 | this method will execute to scrap scrap_user's fans, 174 | which means every time you scrap an user, you will get a bunch of fans ids, 175 | more importantly, in this fans ids have no repeat one. 176 | 177 | BEWARE THAT: this method will not execute if self.followers < 200, you can edit this 178 | value. 179 | :return: 180 | """ 181 | print('\n' + '-' * 30) 182 | print('getting fans ids...') 183 | if os.path.exists(self.big_v_ids_file): 184 | with open(self.big_v_ids_file) as f: 185 | big_v_ids = f.read().split('\n') 186 | if self.scrap_id in big_v_ids: 187 | print('Fans ids of big_v {} (id {}) have been saved before.'.format( 188 | self.user_name, self.scrap_id)) 189 | return 190 | print(self.followers) 191 | if self.followers < 200: 192 | pass 193 | else: 194 | fans_ids = [] 195 | if os.path.exists(self.weibo_fans_save_file) and os.path.getsize(self.weibo_fans_save_file) > 0: 196 | with open(self.weibo_fans_save_file, 'rb') as f: 197 | fans_ids = pickle.load(f) 198 | 199 | fans_url = 'https://weibo.cn/{}/fans?'.format(self.scrap_id) 200 | # first from fans html get how many page fans have 201 | # beware that, this 202 | print(fans_url) 203 | # r = requests.get(fans_url, cookies=self.cookie, headers=self.headers).content 204 | # print(r) 205 | html_fans = requests.get(fans_url, cookies=self.cookie, headers=self.headers).content 206 | selector = etree.HTML(html_fans) 207 | try: 208 | if selector.xpath('//input[@name="mp"]') is None: 209 | page_num = 1 210 | else: 211 | page_num = int(selector.xpath('//input[@name="mp"]')[0].attrib['value']) 212 | print('all fans have {} pages.'.format(page_num)) 213 | 214 | try: 215 | for i in range(page_num): 216 | if i % 5 == 0 and i != 0: 217 | print('[REST] rest {}s for cheating....'.format(self.rest_time)) 218 | time.sleep(self.rest_time) 219 | fans_url_child = 'https://weibo.cn/{}/fans?page={}'.format(self.scrap_id, i) 220 | print('requesting fans url: {}'.format(fans_url)) 221 | html_child = requests.get(fans_url_child, cookies=self.cookie, headers=self.headers).content 222 | selector_child = etree.HTML(html_child) 223 | fans_ids_content = selector_child.xpath("//div[@class='c']/table//a[1]/@href") 224 | ids = [i.split('/')[-1] for i in fans_ids_content] 225 | ids = list(set(ids)) 226 | for d in ids: 227 | print('appending fans id {}'.format(d)) 228 | fans_ids.append(d) 229 | except Exception as e: 230 | print('error: ', e) 231 | dump_fans_list = list(set(fans_ids)) 232 | print(dump_fans_list) 233 | with open(self.weibo_fans_save_file, 'wb') as f: 234 | pickle.dump(dump_fans_list, f) 235 | print('fans ids not fully added, but this is enough, saved into {}'.format( 236 | self.weibo_fans_save_file)) 237 | 238 | dump_fans_list = list(set(fans_ids)) 239 | print(dump_fans_list) 240 | with open(self.weibo_fans_save_file, 'wb') as f: 241 | pickle.dump(dump_fans_list, f) 242 | with open(self.big_v_ids_file, 'a') as f: 243 | f.write(self.scrap_id + '\n') 244 | print('successfully saved fans id file into {}'.format(self.weibo_fans_save_file)) 245 | 246 | except Exception as e: 247 | logging.error(e) 248 | 249 | def _get_weibo_content(self): 250 | print('\n' + '-' * 30) 251 | print('getting weibo content...') 252 | selector = etree.HTML(self.html) 253 | try: 254 | if selector.xpath('//input[@name="mp"]') is None: 255 | page_num = 1 256 | else: 257 | page_num = int(selector.xpath('//input[@name="mp"]')[0].attrib['value']) 258 | pattern = r"\d+\.?\d*" 259 | print('all weibo page {}'.format(page_num)) 260 | 261 | start_page = 0 262 | if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0: 263 | print('load previous weibo_content file from {}'.format(self.weibo_content_save_file)) 264 | obj = pickle.load(open(self.weibo_content_save_file, 'rb')) 265 | if self.scrap_id in obj.keys(): 266 | self.weibo_content = obj[self.scrap_id]['weibo_content'] 267 | start_page = obj[self.scrap_id]['last_scrap_page'] 268 | if start_page >= page_num: 269 | print('\nAll weibo contents of {} have been scrapped before\n'.format(self.user_name)) 270 | return 271 | 272 | try: 273 | # traverse all weibo, and we will got weibo detail urls 274 | for page in range(start_page + 1, page_num + 1): 275 | url2 = 'http://weibo.cn/%s?filter=%s&page=%s' % (self.scrap_id, self.filter, page) 276 | html2 = requests.get(url2, cookies=self.cookie, headers=self.headers).content 277 | selector2 = etree.HTML(html2) 278 | content = selector2.xpath("//div[@class='c']") 279 | print('\n---- current solving page {} of {}'.format(page, page_num)) 280 | 281 | if page % 5 == 0: 282 | print('[REST] rest for %ds to cheat weibo site, avoid being banned.' % self.rest_time) 283 | time.sleep(self.rest_time) 284 | 285 | if len(content) > 3: 286 | for i in range(0, len(content) - 2): 287 | detail = content[i].xpath("@id")[0] 288 | self.weibo_detail_urls.append('http://weibo.cn/comment/{}?uid={}&rl=0'. 289 | format(detail.split('_')[-1], self.scrap_id)) 290 | 291 | self.weibo_scraped += 1 292 | str_t = content[i].xpath("div/span[@class='ctt']") 293 | weibos = str_t[0].xpath('string(.)') 294 | self.weibo_content.append(weibos) 295 | print(weibos) 296 | 297 | str_zan = content[i].xpath("div/a/text()")[-4] 298 | guid = re.findall(pattern, str_zan, re.M) 299 | num_zan = int(guid[0]) 300 | self.num_zan.append(num_zan) 301 | 302 | forwarding = content[i].xpath("div/a/text()")[-3] 303 | guid = re.findall(pattern, forwarding, re.M) 304 | num_forwarding = int(guid[0]) 305 | self.num_forwarding.append(num_forwarding) 306 | 307 | comment = content[i].xpath("div/a/text()")[-2] 308 | guid = re.findall(pattern, comment, re.M) 309 | num_comment = int(guid[0]) 310 | self.num_comment.append(num_comment) 311 | except etree.XMLSyntaxError as e: 312 | print('\n' * 2) 313 | print('=' * 20) 314 | print('weibo user {} all weibo content finished scrap.'.format(self.user_name)) 315 | print('all weibo {}, all like {}, all comments {}'.format( 316 | len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment))) 317 | print('try saving weibo content for now...') 318 | self._save_content(page) 319 | # maybe should not need to release memory 320 | # del self.weibo_content 321 | except Exception as e: 322 | print(e) 323 | print('\n' * 2) 324 | print('=' * 20) 325 | print('weibo user {} content scrap error occured {}.'.format(self.user_name, e)) 326 | print('all weibo {}, all like {}, all comments {}'.format( 327 | len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment))) 328 | print('try saving weibo content for now...') 329 | self._save_content(page) 330 | # del self.weibo_content 331 | # should keep self.weibo_content 332 | print('\n' * 2) 333 | print('=' * 20) 334 | print('all weibo {}, all like {}, all comments {}'.format( 335 | len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment))) 336 | print('try saving weibo content for now...') 337 | self._save_content(page) 338 | del self.weibo_content 339 | if self.filter == 0: 340 | print('共' + str(self.weibo_scraped) + '条微博') 341 | else: 342 | print('共' + str(self.weibo_num) + '条微博,其中' + str(self.weibo_scraped) + '条为原创微博') 343 | except IndexError as e: 344 | print('get weibo info done, current user {} has no weibo yet.'.format(self.scrap_id)) 345 | except KeyboardInterrupt: 346 | print('manually interrupted... try save wb_content for now...') 347 | self._save_content(page - 1) 348 | 349 | def _save_content(self, page): 350 | dump_obj = dict() 351 | if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0: 352 | with open(self.weibo_content_save_file, 'rb') as f: 353 | dump_obj = pickle.load(f) 354 | dump_obj[self.scrap_id] = { 355 | 'weibo_content': self.weibo_content, 356 | 'last_scrap_page': page 357 | } 358 | with open(self.weibo_content_save_file, 'wb') as f: 359 | pickle.dump(dump_obj, f) 360 | 361 | dump_obj[self.scrap_id] = { 362 | 'weibo_content': self.weibo_content, 363 | 'last_scrap_page': page 364 | } 365 | with open(self.weibo_content_save_file, 'wb') as f: 366 | pickle.dump(dump_obj, f) 367 | print('\nUser name: {} \t ID: {} '.format(self.user_name, self.scrap_id)) 368 | print('[CHEER] weibo content saved into {}, finished this part successfully.\n'.format( 369 | self.weibo_content_save_file, page)) 370 | 371 | def _get_weibo_content_and_comment(self): 372 | """ 373 | all weibo will be saved into weibo_content_and_comment.pkl 374 | in format: 375 | { 376 | scrap_id: { 377 | 'weibo_detail_urls': [....], 378 | 'last_scrap_index': 5, 379 | 'content_and_comment': [ 380 | {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num0}, 381 | {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num1}, 382 | {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num2} 383 | ] 384 | } 385 | } 386 | :return: 387 | """ 388 | print('\n' + '-' * 30) 389 | print('getting content and comment...') 390 | weibo_detail_urls = self.weibo_detail_urls 391 | start_scrap_index = 0 392 | content_and_comment = [] 393 | if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0: 394 | print('load previous weibo_content file from {}'.format(self.weibo_content_save_file)) 395 | obj = pickle.load(open(self.weibo_content_save_file, 'rb')) 396 | self.weibo_content = obj[self.scrap_id]['weibo_content'] 397 | 398 | if os.path.exists(self.weibo_content_and_comment_save_file) and os.path.getsize(self.weibo_content_and_comment_save_file) > 0: 399 | with open(self.weibo_content_and_comment_save_file, 'rb') as f: 400 | obj = pickle.load(f) 401 | if self.scrap_id in obj.keys(): 402 | obj = obj[self.scrap_id] 403 | weibo_detail_urls = obj['weibo_detail_urls'] 404 | start_scrap_index = obj['last_scrap_index'] 405 | content_and_comment = obj['content_and_comment'] 406 | 407 | end_scrap_index = len(weibo_detail_urls) 408 | try: 409 | for i in range(start_scrap_index + 1, end_scrap_index): 410 | url = weibo_detail_urls[i] 411 | one_content_and_comment = dict() 412 | 413 | print('\n\nsolving weibo detail from {}'.format(url)) 414 | print('No.{} weibo of total {}'.format(i, end_scrap_index)) 415 | html_detail = requests.get(url, cookies=self.cookie, headers=self.headers).content 416 | selector_detail = etree.HTML(html_detail) 417 | # if current weibo content has no comment, skip it 418 | if not selector_detail.xpath('//*[@id="pagelist"]/form/div/input[1]/@value'): 419 | continue 420 | all_comment_pages = selector_detail.xpath('//*[@id="pagelist"]/form/div/input[1]/@value')[0] 421 | print('这是 {} 的微博:'.format(self.user_name)) 422 | print('微博内容: {}'.format(self.weibo_content[i])) 423 | print('接下来是下面的评论:\n\n') 424 | 425 | one_content_and_comment['content'] = self.weibo_content[i] 426 | one_content_and_comment['comment'] = [] 427 | start_idx = 0 428 | end_idx = int(all_comment_pages) - 2 429 | if i == start_scrap_index + 1 and content_and_comment: 430 | one_cac = content_and_comment[-1] 431 | # use the following judgement because the previous data don't have 'last_idx' 432 | if 'last_idx' in one_cac.keys(): 433 | print('\nTrying to recover from the last comment of last content...\n') 434 | if one_cac['last_idx'] + 1 < end_idx: 435 | one_content_and_comment['comment'] = one_cac['comment'] 436 | start_idx = one_cac['last_idx'] + 1 437 | print('last_idx: {}\n'.format(one_cac['last_idx'])) 438 | 439 | for page in range(start_idx, end_idx): 440 | print('\n---- current solving page {} of {}'.format(page, int(all_comment_pages) - 3)) 441 | if page % 5 == 0: 442 | self.rest_time = np.random.randint(self.rest_min_time, self.rest_max_time) 443 | print('[ATTEMPTING] rest for %ds to cheat weibo site, avoid being banned.' % self.rest_time) 444 | time.sleep(self.rest_time) 445 | 446 | # we crawl from page 2, cause front pages have some noise 447 | detail_comment_url = url + '&page=' + str(page + 2) 448 | no_content_pages = [] 449 | try: 450 | # from every detail comment url we will got all comment 451 | html_detail_page = requests.get(detail_comment_url, cookies=self.cookie).content 452 | selector_comment = etree.HTML(html_detail_page) 453 | 454 | comment_div_element = selector_comment.xpath('//div[starts-with(@id, "C_")]') 455 | 456 | for child in comment_div_element: 457 | single_comment_user_name = child.xpath('a[1]/text()')[0] 458 | if child.xpath('span[1][count(*)=0]'): 459 | single_comment_content = child.xpath('span[1][count(*)=0]/text()')[0] 460 | else: 461 | span_element = child.xpath('span[1]')[0] 462 | at_user_name = span_element.xpath('a/text()')[0] 463 | at_user_name = '$' + at_user_name.split('@')[-1] + '$' 464 | single_comment_content = span_element.xpath('/text()') 465 | single_comment_content.insert(1, at_user_name) 466 | single_comment_content = ' '.join(single_comment_content) 467 | 468 | full_single_comment = '<' + single_comment_user_name + '>' + ': ' + single_comment_content 469 | print(full_single_comment) 470 | one_content_and_comment['comment'].append(full_single_comment) 471 | one_content_and_comment['last_idx'] = page 472 | except etree.XMLSyntaxError as e: 473 | no_content_pages.append(page) 474 | print('\n\nThis page has no contents and is passed: ', e) 475 | print('Total no_content_pages: {}'.format(len(no_content_pages))) 476 | 477 | except Exception as e: 478 | print('Raise Exception in _get_weibo_content_and_comment, error:', e) 479 | print('\n' * 2) 480 | print('=' * 20) 481 | print('weibo user {} content_and_comment scrap error occured {}.'.format(self.user_name, e)) 482 | self._save_content_and_comment(i, one_content_and_comment, weibo_detail_urls) 483 | print("\n\nComments are successfully save:\n User name: {}\n weibo content: {}\n\n".format( 484 | self.user_name, one_content_and_comment['content'])) 485 | # save every one_content_and_comment 486 | self._save_content_and_comment(i, one_content_and_comment, weibo_detail_urls) 487 | print('weibo scrap done!') 488 | self.mark_as_scraped(self.scrap_id) 489 | print('*' * 30) 490 | print("\n\nComments are successfully save:\n User name: {}\n weibo content: {}".format( 491 | self.user_name, one_content_and_comment['content'])) 492 | except KeyboardInterrupt: 493 | print('manually interrupted.. try save wb_content_and_comment for now...') 494 | self._save_content_and_comment(i - 1, one_content_and_comment, weibo_detail_urls) 495 | 496 | print('\n' * 2) 497 | print('=' * 20) 498 | print('user {}, all weibo content and comment finished.'.format(self.user_name)) 499 | 500 | def _save_content_and_comment(self, i, one_content_and_comment, weibo_detail_urls): 501 | dump_dict = dict() 502 | if os.path.exists(self.weibo_content_and_comment_save_file) and os.path.getsize(self.weibo_content_and_comment_save_file) > 0: 503 | with open(self.weibo_content_and_comment_save_file, 'rb') as f: 504 | obj = pickle.load(f) 505 | dump_dict = obj 506 | if self.scrap_id in dump_dict.keys(): 507 | dump_dict[self.scrap_id]['last_scrap_index'] = i 508 | dump_dict[self.scrap_id]['content_and_comment'].append(one_content_and_comment) 509 | else: 510 | dump_dict[self.scrap_id] = { 511 | 'weibo_detail_urls': weibo_detail_urls, 512 | 'last_scrap_index': i, 513 | 'content_and_comment': [one_content_and_comment] 514 | } 515 | else: 516 | dump_dict[self.scrap_id] = { 517 | 'weibo_detail_urls': weibo_detail_urls, 518 | 'last_scrap_index': i, 519 | 'content_and_comment': [one_content_and_comment] 520 | } 521 | with open(self.weibo_content_and_comment_save_file, 'wb') as f: 522 | print('try saving weibo content and comment for now.') 523 | pickle.dump(dump_dict, f) 524 | 525 | def switch_account(self, new_account): 526 | assert new_account.isinstance(str), 'account must be string' 527 | self.using_account = new_account 528 | 529 | @staticmethod 530 | def mark_as_scraped(scrap_id): 531 | """ 532 | this will mark an id to be scraped, next time will jump this id directly 533 | :return: 534 | """ 535 | scraped_ids = [] 536 | if os.path.exists(SCRAPED_MARK) and os.path.getsize(SCRAPED_MARK) > 0: 537 | with open(SCRAPED_MARK, 'rb') as f: 538 | scraped_ids = pickle.load(f) 539 | scraped_ids.append(scrap_id) 540 | with open(SCRAPED_MARK, 'wb') as f: 541 | pickle.dump(scraped_ids, f) 542 | print('scrap id {} marked as scraped, next time will jump this id directly.'.format(scrap_id)) 543 | -------------------------------------------------------------------------------- /scraper/weibo_scraper_m.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: weibo_scraper_m.py 3 | # author: JinTian 4 | # time: 16/05/2017 11:32 AM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | this is another option for WeiBo scraper, using domain: 21 | http://m.weibo.cn 22 | 23 | this scraper under development, welcome submit PR to add more features 24 | """ 25 | import os 26 | import requests 27 | import numpy as np 28 | from settings.config import * 29 | import pickle 30 | 31 | 32 | class WeiBoScraperM(object): 33 | """ 34 | this scraper under construct, contributor can follow this template to compatible WeiBoScraper API 35 | """ 36 | 37 | def __init__(self, using_account, scrap_id, cookies, filter_flag=0): 38 | self.using_account = using_account 39 | self.scrap_id = scrap_id 40 | self.cookies = cookies 41 | self.filter_flag = filter_flag 42 | 43 | def _init_cookies(self): 44 | cookie = { 45 | "Cookie": self.cookies 46 | } 47 | self.cookie = cookie 48 | 49 | def _init_headers(self): 50 | """ 51 | avoid span 52 | :return: 53 | """ 54 | headers = requests.utils.default_headers() 55 | user_agent = { 56 | 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0' 57 | } 58 | headers.update(user_agent) 59 | print('headers: ', headers) 60 | self.headers = headers 61 | 62 | def jump_scraped_id(self): 63 | """ 64 | check mark scrapd ids, jump scraped one. 65 | :return: 66 | """ 67 | if os.path.exists(SCRAPED_MARK): 68 | with open(SCRAPED_MARK, 'rb') as f: 69 | scraped_ids = pickle.load(f) 70 | if self.scrap_id in scraped_ids: 71 | return True 72 | else: 73 | return False 74 | else: 75 | return False 76 | 77 | def crawl(self): 78 | if self.jump_scraped_id(): 79 | print('scrap id {} already scraped, directly pass it.'.format(self.scrap_id)) 80 | else: 81 | try: 82 | self._get_html() 83 | self._get_user_name() 84 | self._get_user_info() 85 | 86 | self._get_fans_ids() 87 | self._get_wb_content() 88 | self._get_wb_content_and_comment() 89 | return True 90 | except Exception as e: 91 | print('error when crawl: ', e) 92 | return False 93 | 94 | def _get_html(self): 95 | pass 96 | 97 | def _get_user_name(self): 98 | pass 99 | 100 | def _get_user_info(self): 101 | pass 102 | 103 | def _get_fans_ids(self): 104 | pass 105 | 106 | def _get_wb_content(self): 107 | pass 108 | 109 | def _save_content(self, page): 110 | pass 111 | 112 | def _get_wb_content_and_comment(self): 113 | pass 114 | 115 | def _save_content_and_comment(self, i, one_content_and_comment, weibo_detail_urls): 116 | pass 117 | 118 | def switch_account(self, new_account): 119 | assert new_account.isinstance(str), 'account must be string' 120 | self.using_account = new_account 121 | 122 | @staticmethod 123 | def mark_as_scraped(scrap_id): 124 | """ 125 | this will mark an id to be scraped, next time will jump this id directly 126 | :return: 127 | """ 128 | scraped_ids = [] 129 | if os.path.exists(SCRAPED_MARK): 130 | with open(SCRAPED_MARK, 'rb') as f: 131 | scraped_ids = pickle.load(f) 132 | scraped_ids.append(scrap_id) 133 | with open(SCRAPED_MARK, 'wb') as f: 134 | pickle.dump(scraped_ids, f) 135 | print('scrap id {} marked as scraped, next time will jump this id directly.'.format(scrap_id)) -------------------------------------------------------------------------------- /settings/accounts.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: accounts.py 3 | # author: JinTian 4 | # time: 17/04/2017 2:26 PM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | place all your accounts inside here, 21 | terminator will using all accounts, if one being banned, another will move on. 22 | until all accounts were banned, scrap stop. 23 | """ 24 | 25 | 26 | # please set this to your own, this is fake accounts 27 | accounts = [ 28 | { 29 | "id": 'jintianiloveu', 30 | "password": '77888', 31 | }, 32 | 33 | ] 34 | -------------------------------------------------------------------------------- /settings/config.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: settings.py 3 | # author: JinTian 4 | # time: 13/04/2017 10:10 AM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | all configurations set here, follow the instructions 21 | """ 22 | import os 23 | 24 | # you should not change this properly 25 | DEFAULT_USER_ID = 'realangelababy' 26 | LOGIN_URL = 'https://passport.weibo.cn/signin/login' 27 | 28 | ID_FILE_PATH = './settings/id_file' 29 | 30 | 31 | # change this to your PhantomJS unzip path, point to bin/phantomjs executable file, full path 32 | PHANTOM_JS_PATH = '/Users/jintian/phantomjs-2.1.1-macosx/bin/phantomjs' 33 | 34 | 35 | COOKIES_SAVE_PATH = 'settings/cookies.pkl' 36 | 37 | 38 | CORPUS_SAVE_DIR = './scraped_corpus/' 39 | 40 | DISTRIBUTE_IDS = 'distribute_ids.pkl' 41 | 42 | SCRAPED_MARK = './settings/scraped.mark' 43 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/utils/__init__.py -------------------------------------------------------------------------------- /utils/connection.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: connection.py 3 | # author: JinTian 4 | # time: 13/04/2017 9:53 AM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ -------------------------------------------------------------------------------- /utils/cookies.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: cookies.py 3 | # author: JinTian 4 | # time: 17/04/2017 12:55 PM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | """ 20 | to get weibo_terminator switch accounts automatically, please be sure install: 21 | 22 | PhantomJS, from http://phantomjs.org/download.html 23 | 24 | """ 25 | import os 26 | from selenium import webdriver 27 | from selenium.webdriver.common.keys import Keys 28 | from selenium.common.exceptions import InvalidElementStateException 29 | import time 30 | import sys 31 | from tqdm import * 32 | import pickle 33 | 34 | from settings.accounts import accounts 35 | from settings.config import LOGIN_URL, PHANTOM_JS_PATH, COOKIES_SAVE_PATH 36 | 37 | 38 | def count_time(): 39 | for i in tqdm(range(40)): 40 | time.sleep(0.5) 41 | 42 | 43 | def get_cookie_from_network(account_id, account_password): 44 | url_login = LOGIN_URL 45 | phantom_js_driver_file = os.path.abspath(PHANTOM_JS_PATH) 46 | if os.path.exists(phantom_js_driver_file): 47 | try: 48 | print('loading PhantomJS from {}'.format(phantom_js_driver_file)) 49 | driver = webdriver.PhantomJS(phantom_js_driver_file) 50 | # must set window size or will not find element 51 | driver.set_window_size(1640, 688) 52 | driver.get(url_login) 53 | # before get element sleep for 4 seconds, waiting for page render complete. 54 | print('opening weibo login page, this is first done for prepare for cookies. be patience to waite load ' 55 | 'complete.') 56 | count_time() 57 | driver.find_element_by_xpath('//input[@id="loginName"]').send_keys(account_id) 58 | driver.find_element_by_xpath('//input[@id="loginPassword"]').send_keys(account_password) 59 | # driver.find_element_by_xpath('//input[@id="loginPassword"]').send_keys(Keys.RETURN) 60 | print('account id: {}'.format(account_id)) 61 | print('account password: {}'.format(account_password)) 62 | 63 | driver.find_element_by_xpath('//a[@id="loginAction"]').click() 64 | except InvalidElementStateException as e: 65 | print(e) 66 | print('error, account id {} is not valid, pass this account, you can edit it and then ' 67 | 'update cookies. \n' 68 | .format(account_id)) 69 | 70 | try: 71 | cookie_list = driver.get_cookies() 72 | cookie_string = '' 73 | for cookie in cookie_list: 74 | if 'name' in cookie and 'value' in cookie: 75 | cookie_string += cookie['name'] + '=' + cookie['value'] + ';' 76 | if 'SSOLoginState' in cookie_string: 77 | print('success get cookies!! \n {}'.format(cookie_string)) 78 | if os.path.exists(COOKIES_SAVE_PATH): 79 | with open(COOKIES_SAVE_PATH, 'rb') as f: 80 | cookies_dict = pickle.load(f) 81 | if cookies_dict[account_id] is not None: 82 | cookies_dict[account_id] = cookie_string 83 | with open(COOKIES_SAVE_PATH, 'wb') as f: 84 | pickle.dump(cookies_dict, f) 85 | print('successfully save cookies into {}. \n'.format(COOKIES_SAVE_PATH)) 86 | else: 87 | pass 88 | else: 89 | cookies_dict = dict() 90 | cookies_dict[account_id] = cookie_string 91 | with open(COOKIES_SAVE_PATH, 'wb') as f: 92 | pickle.dump(cookies_dict, f) 93 | print('successfully save cookies into {}. \n'.format(COOKIES_SAVE_PATH)) 94 | return cookie_string 95 | else: 96 | print('error, account id {} is not valid, pass this account, you can edit it and then ' 97 | 'update cookies. \n' 98 | .format(account_id)) 99 | pass 100 | 101 | except Exception as e: 102 | print(e) 103 | 104 | else: 105 | print('can not find PhantomJS driver, please download from http://phantomjs.org/download.html based on your ' 106 | 'system.') 107 | -------------------------------------------------------------------------------- /utils/string.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # file: string.py 3 | # author: JinTian 4 | # time: 17/04/2017 7:06 PM 5 | # Copyright 2017 JinTian. All Rights Reserved. 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # ------------------------------------------------------------------------ 19 | 20 | 21 | def is_valid_id(s): 22 | try: 23 | a = float(s) 24 | return True 25 | except ValueError as e: 26 | return False 27 | 28 | 29 | def is_number(s): 30 | try: 31 | a = float(s) 32 | return True 33 | except ValueError as e: 34 | return False 35 | 36 | --------------------------------------------------------------------------------