├── .gitignore
├── README.md
├── contirbutor.txt
├── core
    ├── __init__.py
    ├── dispatch_center.py
    └── scrap.py
├── distribute_ids.pkl
├── main.py
├── scraper
    ├── __init__.py
    ├── weibo_scraper.py
    └── weibo_scraper_m.py
├── settings
    ├── accounts.py
    └── config.py
└── utils
    ├── __init__.py
    ├── connection.py
    ├── cookies.py
    └── string.py


/.gitignore:
--------------------------------------------------------------------------------
1 | ghostdriver.log
2 | .idea/
3 | __pycache__/
4 | save_to_distribute.py
5 | test.py
6 | save_distribute.py
7 | contributor.txt
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Weibo Terminator Work Flow
 2 | 
 3 | ![PicName](http://ofwzcunzi.bkt.clouddn.com/7gOwGp5K4FPWHkHQ.png)
 4 | 
 5 | > 这个项目是之前项目的重启版本，之前的项目地址[这里](https://github.com/jinfagang/weibo_terminater.git)，那个项目依旧会保持更新，这是weibo terminator的工作版本，这个版本对上一个版本做了一些优化，这里的最终目标是一起爬取语料，包括情感分析、对话语料、舆论风控、大数据分析等应用。
 6 | 
 7 | # UPDATE 2017-5-16
 8 | 
 9 | 更新：
10 | * 调整了首次cookies获取逻辑，如果程序没有检测到cookies就会退出，防止后面爬取不到更多的内容而crash；
11 | * 增加了WeiBoScraperM 类，目前还在构建中，欢迎submit PR 实现，这个类主要实现从另外一个微博域名爬取，也就是手机域名；
12 | 
13 | 大家可以pull一下更新。
14 | 
15 | # UPDATE 2017-5-15
16 | 
17 | 经过一些小修改和几位contributor的PR，代码发生了一些小变化，基本上都是在修复bug和完善一些逻辑，修改如下：
18 | 
19 | 1. 修复了保存出错的问题，这个大家在第一次push的时候clone的代码要pull一下;
20 | 2. 关于 `WeiboScraper has not attribute weibo_content`的错误，新代码已经修复;
21 | 
22 | @Fence 提交PR修改了一些内容:
23 | 1. 原先的固定30s休息换成随机时间，具体参数可自己定义
24 | 2. 增加了big_v_ids_file，记录已经保存过粉丝的明星id； 用txt格式，方便contributor手动增删
25 | 3. 两个函数的爬取页面都改成了page+1，避免断点续爬时重复爬取上次已经爬过最后一页
26 | 4. 把原先的“爬取完一个id的所有微博及其评论”改为“爬完一条微博及其所有评论就保存”
27 | 5. （Optional）把保存文件的部分单独为函数，因为分别有2个和3个地方需要保存
28 | 
29 | 大家可以`git pull origin master`， 获取一下新更新的版本，同时也欢迎大家继续问我要uuid，我会定时把名单公布在`contirbutor.txt` 中，我近期在做数据merge的工作，以及数据清洗，分类等工作，merge工作完成之后会把大数据集分发给大家。
30 | 
31 | 
32 | # Improve
33 | 
34 | 对上一版本做了以下改进：
35 | 
36 | * 没有了太多的distraction，直奔主题，给定id，获取该用户的所有微博，微博数量，粉丝数，所有微博内容以及评论内容；
37 | * 和上一版本不同的是，这一次我们的理念是把所有数据保存到三个pickle文件中，以字典的文件存储，这么做的目的是方便断点续爬；
38 | * 同时做到了，已经爬过的id爬虫不会再次爬取，也就是说爬虫会记住爬取过的id，每个id获取完了所有内容之后会被标记为已经爬取；
39 | * 除此之外，微博内容和微博评论被单独分开，微博内容爬取过程中出现中断，第二次不会重新爬取，会从中断的页码继续爬取；
40 | * 更加重要的是！！！每个id爬取互不影响，你可以直接从pickle文件中调取出任何你想要的id的微博内容，可以做任何处理！！
41 | * 除此之外之外，测试了新的反爬策略，采用的延时机制能够很好的工作，不过还无法完全做到无人控制。
42 | 
43 | **更更加重要的是！！！**，在这一版本中，爬虫的智能性得到了很大提升，爬虫会在爬取每个id的时候，**自动去获取该id的所有粉丝id！！**
44 | 相当于是，我给大家的都是种子id，种子id都是一些明星或者公司或者媒体大V的id，从这些种子id你可以获取到成千上万的其他种子id！！
45 | 假如一个明星粉丝是3.4万，第一次爬取你就可以获得3.4万个id，然后在从子代id继续爬，每个子代id有粉丝100，第二次你就可以获取到340万个id！！！足够了吗？！！！当然不够！！！
46 | 
47 | **我们这个项目永远不会停止！！！** 会一直进行下去，直到收获足够多的语料！！！
48 | 
49 | （当然实际上我们不能获得所有粉丝，不过这些也足够了。）
50 | 
51 | ![PicName](http://ofwzcunzi.bkt.clouddn.com/lqcx6MLSdS8whJVt.png)
52 | 
53 | # Work Flow
54 | 
55 | 这一版本的目标是针对contributor，我们的工作流程也非常简单：
56 | 
57 | 1. 获取uuid，这个uuid可以调取到 distribute_ids.pkl 的2-3个id，这个是我们的种子id，当然大家也可以直接获取到所有id，但是为了防止重复工作，建议大家向我申请一个uuid，你只负责你的那个，爬完之后，把最终文件反馈给我，我整理去重之后，把最终的大语料发放给大家。
58 | 2. 运行 `python3 main.py uuid`，这里说明一下，uuid指定的id爬取完之后才会取爬fans id；
59 | 3. Done！
60 | 
61 | # Discuss
62 | 
63 | 依旧贴出一下讨论群，欢迎大家添加：
64 | ```
65 | QQ
66 | AI智能自然语言处理: 476464663
67 | Tensorflow智能聊天Bot: 621970965
68 | GitHub深度学习开源交流: 263018023
69 | ```
70 | 微信可以加我好友： jintianiloveu
71 | 
72 | # Copyright
73 | 
74 | ```
75 | (c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0
76 | ```
77 | 


--------------------------------------------------------------------------------
/contirbutor.txt:
--------------------------------------------------------------------------------
 1 | https://github.com/MichaelFeng87    s00002
 2 | https://github.com/jiang20160402    s00001
 3 | https://github.com/richard1225  s00003
 4 | https://github.com/little1tow   s00004
 5 | https://github.com/Linlinlin111 s00005
 6 | https://github.com/Fence    m00001
 7 | https://github.com/Da-Capo  m00002
 8 | https://github.com/liangWenPeng m00002
 9 | https://github.com/af1ynch  m00003
10 | https://github.com/elviswxy  m00005
11 | https://github.com/liuyijiang1994   c00001
12 | 


--------------------------------------------------------------------------------
/core/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/core/__init__.py


--------------------------------------------------------------------------------
/core/dispatch_center.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # file: dispatch_center.py
  3 | # author: JinTian
  4 | # time: 13/04/2017 9:52 AM
  5 | # Copyright 2017 JinTian. All Rights Reserved.
  6 | #
  7 | # Licensed under the Apache License, Version 2.0 (the "License");
  8 | # you may not use this file except in compliance with the License.
  9 | # You may obtain a copy of the License at
 10 | #
 11 | #     http://www.apache.org/licenses/LICENSE-2.0
 12 | #
 13 | # Unless required by applicable law or agreed to in writing, software
 14 | # distributed under the License is distributed on an "AS IS" BASIS,
 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16 | # See the License for the specific language governing permissions and
 17 | # limitations under the License.
 18 | # ------------------------------------------------------------------------
 19 | from scraper.weibo_scraper import WeiBoScraper
 20 | from settings.config import COOKIES_SAVE_PATH
 21 | from settings.accounts import accounts
 22 | import os
 23 | from utils.cookies import get_cookie_from_network
 24 | import pickle
 25 | 
 26 | 
 27 | class Dispatcher(object):
 28 |     """
 29 |     Dispatch center, if your cookies is out of date,
 30 |     set update_cookies to True to update all accounts cookies
 31 |     """
 32 | 
 33 |     def __init__(self, id_file_path, mode, uid, filter_flag=0, update_cookies=False):
 34 |         self.mode = mode
 35 |         self.filter_flag = filter_flag
 36 |         self.update_cookies = update_cookies
 37 | 
 38 |         self._init_accounts_cookies()
 39 |         self._init_accounts()
 40 | 
 41 |         if self.mode == 'single':
 42 |             self.user_id = uid
 43 |         elif self.mode == 'multi':
 44 |             self.id_file_path = id_file_path
 45 |         else:
 46 |             raise Exception('mode option only support single and multi')
 47 | 
 48 |     def execute(self):
 49 |         if self.mode == 'single':
 50 |             self._init_single_mode()
 51 |         elif self.mode == 'multi':
 52 |             self._init_multi_mode()
 53 |         else:
 54 |             raise Exception('mode option only support single and multi')
 55 | 
 56 |     def _init_accounts_cookies(self):
 57 |         """
 58 |         get all cookies for accounts, dump into pkl, this will only run once, if
 59 |         you update accounts, set update to True
 60 |         :return:
 61 |         """
 62 |         if self.update_cookies:
 63 |             for account in accounts:
 64 |                 print('preparing cookies for account {}'.format(account))
 65 |                 get_cookie_from_network(account['id'], account['password'])
 66 |             print('all accounts getting cookies finished. starting scrap..')
 67 |         else:
 68 |             if os.path.exists(COOKIES_SAVE_PATH):
 69 |                 pass
 70 |             else:
 71 |                 for account in accounts:
 72 |                     print('preparing cookies for account {}'.format(account))
 73 |                     get_cookie_from_network(account['id'], account['password'])
 74 |                 print('all accounts getting cookies finished. starting scrap..')
 75 | 
 76 |     def _init_accounts(self):
 77 |         """
 78 |         setting accounts
 79 |         :return:
 80 |         """
 81 |         try:
 82 |             with open(COOKIES_SAVE_PATH, 'rb') as f:
 83 |                 cookies_dict = pickle.load(f)
 84 |             self.all_accounts = list(cookies_dict.keys())
 85 |             print('----------- detected {} accounts, weibo_terminator will using all accounts to scrap '
 86 |                   'automatically -------------'.format(len(self.all_accounts)))
 87 |             print('detected accounts: ', self.all_accounts)
 88 |         except Exception as e:
 89 |             print(e)
 90 |             print('error, not find cookies file.')
 91 |       
 92 |     def _init_single_mode(self):
 93 |         scraper = WeiBoScraper(using_account=self.all_accounts[0], uuid=self.user_id, filter_flag=self.filter_flag)
 94 |         i = 1
 95 |         while True:
 96 |             result = scraper.crawl()
 97 |             if result:
 98 |                 print('finished!!!')
 99 |                 break
100 |             else:
101 |                 if i >= len(self.all_accounts):
102 |                     print('scrap not finish, account resource run out. update account, move on scrap.')
103 |                     break
104 |                 else:
105 |                     scraper.switch_account(self.all_accounts[i])
106 |                     i += 1
107 |                     print('account {} being banned or error weibo is none for current user id, switch to {}..'.format(
108 |                         self.all_accounts[i - 1], self.all_accounts[i]))
109 | 
110 |     def _init_multi_mode(self):
111 |         pass
112 | 
113 | 
114 | 


--------------------------------------------------------------------------------
/core/scrap.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: scrap.py
 3 | # author: JinTian
 4 | # time: 10/05/2017 10:38 PM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------
19 | from scraper.weibo_scraper import WeiBoScraper
20 | from utils.cookies import get_cookie_from_network
21 | from settings.config import *
22 | import pickle
23 | from settings.accounts import accounts
24 | 
25 | 
26 | def init_accounts_cookies():
27 |     if os.path.exists(COOKIES_SAVE_PATH):
28 |         with open(COOKIES_SAVE_PATH, 'rb') as f:
29 |             cookies_dict = pickle.load(f)
30 |         return list(cookies_dict.keys())
31 |     else:
32 |         for account in accounts:
33 |             print('preparing cookies for account {}'.format(account))
34 |             get_cookie_from_network(account['id'], account['password'])
35 |         print('checking account validation...')
36 |         valid_accounts = get_valid_accounts()
37 | 
38 |         if len(valid_accounts) == len(accounts):
39 |             print('all accounts checked valid... start scrap')
40 |             return valid_accounts
41 |         elif len(valid_accounts) < 1:
42 |             print('error, not find valid accounts, please check accounts.')
43 |             exit(0)
44 |         elif len(valid_accounts) > 1:
45 |             print('find valid accounts: ', valid_accounts)
46 |             print('starting scrap..')
47 |             return valid_accounts
48 | 
49 | 
50 | def get_valid_accounts():
51 |     with open(COOKIES_SAVE_PATH, 'rb') as f:
52 |         cookies_dict = pickle.load(f)
53 |     return list(cookies_dict.keys())
54 | 
55 | 
56 | def get_cookies_by_account(account_id):
57 |     with open(COOKIES_SAVE_PATH, 'rb') as f:
58 |         cookies_dict = pickle.load(f)
59 |     return cookies_dict[account_id]
60 | 
61 | 
62 | def scrap(scrap_id):
63 |     """
64 |     scrap a single id
65 |     :return:
66 |     """
67 |     valid_accounts = init_accounts_cookies()
68 |     print('valid accounts: ', valid_accounts)
69 | 
70 |     # TODO currently only using single account, multi accounts using multi thread maybe quicker but seems like a mess
71 |     # TODO maybe will adding multi thread feature when our code comes steady
72 |     # so that maybe need to manually change accounts when one account being baned
73 |     account_id = valid_accounts[0]
74 |     print('using accounts: ', account_id)
75 | 
76 |     cookies = get_cookies_by_account(account_id)
77 | 
78 |     # alternative this scraper can changed into WeiBoScraperM in the future which scrap from http://m.weibo.cn
79 |     scraper = WeiBoScraper(account_id, scrap_id, cookies)
80 |     scraper.crawl()
81 | 
82 | 
83 | def main(args):
84 |     scrap_id = args
85 |     scrap(scrap_id)
86 | 


--------------------------------------------------------------------------------
/distribute_ids.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/distribute_ids.pkl


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: main.py
 3 | # author: JinTian
 4 | # time: 13/04/2017 10:01 AM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------
19 | """
20 | weibo_terminator_workflow
21 | 
22 | run main.py will do:
23 | 1. first run will scrap distribute ids, every id finished all task will mark as done.
24 | 2. after all distribute ids were done, will scrap all fans id.
25 | 
26 | """
27 | import argparse
28 | 
29 | from core.dispatch_center import Dispatcher
30 | from settings.config import *
31 | from utils.string import is_valid_id
32 | from core.scrap import scrap
33 | import sys
34 | import pickle
35 | 
36 | 
37 | def mission(distribute_uuid=None):
38 |     """
39 |     mission for workflow.
40 |     this method is very simple.
41 |     for the first run, it will scrap distribute ids. you just get distribute_uuid from wechat
42 |     `jintianiloveu` who is administrator of this project. get your uuid and paste it into mission param
43 | 
44 | 
45 |     this will get 2-5 ids from distribute_ids.pkl, every uuid got ids are different.
46 |     After mission complete scrap, continue scrap fans_ids.pkl which contains many many ids. as possiable as you
47 |     can to scrap those fans ids.
48 |     :return:
49 |     """
50 |     scrap('3879293449')
51 |     if os.path.exists(DISTRIBUTE_IDS):
52 |         print('find distribute ids from {}'.format(DISTRIBUTE_IDS))
53 |         with open(DISTRIBUTE_IDS, 'rb') as f:
54 |             distribute_dict = pickle.load(f)
55 |         if os.path.exists(SCRAPED_MARK):
56 |             finished_ids = pickle.load(open(SCRAPED_MARK, 'rb'))
57 |         else:
58 |             finished_ids = []
59 |         try:
60 |             mission_ids = distribute_dict[distribute_uuid]
61 | 
62 |             if len([i for i in mission_ids if i in finished_ids]) == len(mission_ids):
63 |                 print('Good Done!!! Mission Complete!!')
64 |                 print('now will continue scrap fans_ids.pkl file.')
65 | 
66 |                 fans_id_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_fans.pkl')
67 |                 if os.path.exists(fans_id_file):
68 |                     fans_id = pickle.load(open(fans_id_file, 'rb'))
69 |                     for fd in fans_id:
70 |                         scrap(fd)
71 |             else:
72 |                 for md in mission_ids:
73 |                     scrap(md)
74 |         except Exception as e:
75 |             print(e)
76 |             print('distribute uuid invalid.')
77 | 
78 | 
79 | def scrap_single(sid):
80 |     """
81 |     this method scrap single id.
82 |     For you want scrap you own ids. just send it here
83 |     :param sid:
84 |     :return:
85 |     """
86 |     scrap(sid)
87 | 
88 | 
89 | if __name__ == '__main__':
90 |     if len(sys.argv) < 2:
91 |         print('run as python3 main.py your-uuid, you can get uuid via wechat `jintianiloveu`.')
92 |     else:
93 |         mission(sys.argv[1])
94 | 


--------------------------------------------------------------------------------
/scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/scraper/__init__.py


--------------------------------------------------------------------------------
/scraper/weibo_scraper.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # file: sentence_similarity.py
  3 | # author: JinTian
  4 | # time: 24/03/2017 6:46 PM
  5 | # Copyright 2017 JinTian. All Rights Reserved.
  6 | #
  7 | # Licensed under the Apache License, Version 2.0 (the "License");
  8 | # you may not use this file except in compliance with the License.
  9 | # You may obtain a copy of the License at
 10 | #
 11 | #     http://www.apache.org/licenses/LICENSE-2.0
 12 | #
 13 | # Unless required by applicable law or agreed to in writing, software
 14 | # distributed under the License is distributed on an "AS IS" BASIS,
 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16 | # See the License for the specific language governing permissions and
 17 | # limitations under the License.
 18 | # ------------------------------------------------------------------------
 19 | """
 20 | using guide:
 21 | setting accounts first:
 22 | 
 23 | under: weibo_terminator/settings/accounts.py
 24 | you can set more than one accounts, WT will using all accounts one by one,
 25 | if one banned, another will move on.
 26 | 
 27 | if you care about security, using subsidiary accounts instead.
 28 | 
 29 | """
 30 | import re
 31 | import sys
 32 | import os
 33 | import requests
 34 | from lxml import etree
 35 | import traceback
 36 | from settings.config import COOKIES_SAVE_PATH
 37 | import pickle
 38 | import time
 39 | from utils.string import is_number
 40 | import numpy as np
 41 | from settings.config import *
 42 | import logging
 43 | 
 44 | 
 45 | class WeiBoScraper(object):
 46 |     def __init__(self, using_account, scrap_id, cookies, filter_flag=0):
 47 |         self.using_account = using_account
 48 |         self.cookies = cookies
 49 | 
 50 |         self._init_cookies()
 51 |         self._init_headers()
 52 | 
 53 |         self.scrap_id = scrap_id
 54 |         self.filter = filter_flag
 55 |         self.user_name = ''
 56 |         self.weibo_num = 0
 57 |         self.weibo_scraped = 0
 58 |         self.following = 0
 59 |         self.followers = 0
 60 |         self.weibo_content = []
 61 |         self.num_zan = []
 62 |         self.num_forwarding = []
 63 |         self.num_comment = []
 64 |         self.weibo_detail_urls = []
 65 |         self.rest_time = 20
 66 |         self.rest_min_time = 20  # create random rest time
 67 |         self.rest_max_time = 30
 68 | 
 69 |         self.weibo_content_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_content.pkl')
 70 |         self.weibo_content_and_comment_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_content_and_comment.pkl')
 71 |         self.weibo_fans_save_file = os.path.join(CORPUS_SAVE_DIR, 'weibo_fans.pkl')
 72 |         self.big_v_ids_file = os.path.join(CORPUS_SAVE_DIR, 'big_v_ids.txt')
 73 | 
 74 |     def _init_cookies(self):
 75 |         cookie = {
 76 |             "Cookie": self.cookies
 77 |         }
 78 |         self.cookie = cookie
 79 | 
 80 |     def _init_headers(self):
 81 |         """
 82 |         avoid span
 83 |         :return:
 84 |         """
 85 |         headers = requests.utils.default_headers()
 86 |         user_agent = {
 87 |             'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0'
 88 |         }
 89 |         headers.update(user_agent)
 90 |         print('headers: ', headers)
 91 |         self.headers = headers
 92 | 
 93 |     def jump_scraped_id(self):
 94 |         """
 95 |         check mark scrapd ids, jump scraped one.
 96 |         :return:
 97 |         """
 98 |         if os.path.exists(SCRAPED_MARK) and os.path.getsize(SCRAPED_MARK) > 0:
 99 |             with open(SCRAPED_MARK, 'rb') as f:
100 |                 scraped_ids = pickle.load(f)
101 |             if self.scrap_id in scraped_ids:
102 |                 return True
103 |             else:
104 |                 return False
105 |         else:
106 |             return False
107 | 
108 |     def crawl(self):
109 |         # this is the most time-cost part, we have to catch errors, return to dispatch center
110 |         if self.jump_scraped_id():
111 |             print('scrap id {} already scraped, directly pass it.'.format(self.scrap_id))
112 |         else:
113 |             try:
114 |                 self._get_html()
115 |                 self._get_user_name()
116 |                 self._get_user_info()
117 | 
118 |                 self._get_fans_ids()
119 |                 self._get_weibo_content()
120 |                 self._get_weibo_content_and_comment()
121 |                 return True
122 |             except Exception as e:
123 |                 logging.error(e)
124 |                 print('some error above not catch, return to dispatch center, resign for new account..')
125 |                 return False
126 | 
127 |     def _get_html(self):
128 |         try:
129 |             url = 'http://weibo.cn/%s?filter=%s&page=1' % (self.scrap_id, self.filter)
130 |             print(url)
131 |             self.html = requests.get(url, cookies=self.cookie, headers=self.headers).content
132 |             print('success load html..')
133 |         except Exception as e:
134 |             print(e)
135 | 
136 |     def _get_user_name(self):
137 |         print('\n' + '-' * 30)
138 |         print('getting user name')
139 |         try:
140 |             selector = etree.HTML(self.html)
141 |             self.user_name = selector.xpath('//table//div[@class="ut"]/span[1]/text()')[0]
142 |             print('current user name is: {}'.format(self.user_name))
143 |         except Exception as e:
144 |             print(e)
145 |             print('html not properly loaded, maybe cookies out of date or account being banned. '
146 |                   'change an account please')
147 |             exit()
148 | 
149 |     def _get_user_info(self):
150 |         print('\n' + '-' * 30)
151 |         print('getting user info')
152 |         selector = etree.HTML(self.html)
153 |         pattern = r"\d+\.?\d*"
154 |         str_wb = selector.xpath('//span[@class="tc"]/text()')[0]
155 |         guid = re.findall(pattern, str_wb, re.S | re.M)
156 |         for value in guid:
157 |             num_wb = int(value)
158 |             break
159 |         self.weibo_num = num_wb
160 | 
161 |         str_gz = selector.xpath("//div[@class='tip2']/a/text()")[0]
162 |         guid = re.findall(pattern, str_gz, re.M)
163 |         self.following = int(guid[0])
164 | 
165 |         str_fs = selector.xpath("//div[@class='tip2']/a/text()")[1]
166 |         guid = re.findall(pattern, str_fs, re.M)
167 |         self.followers = int(guid[0])
168 |         print('current user all weibo num {}, following {}, followers {}'.format(self.weibo_num, self.following,
169 |                                                                                  self.followers))
170 | 
171 |     def _get_fans_ids(self):
172 |         """
173 |         this method will execute to scrap scrap_user's fans,
174 |         which means every time you scrap an user, you will get a bunch of fans ids,
175 |         more importantly, in this fans ids have no repeat one.
176 | 
177 |         BEWARE THAT: this method will not execute if self.followers < 200, you can edit this
178 |         value.
179 |         :return:
180 |         """
181 |         print('\n' + '-' * 30)
182 |         print('getting fans ids...')
183 |         if os.path.exists(self.big_v_ids_file):
184 |             with open(self.big_v_ids_file) as f:
185 |                 big_v_ids = f.read().split('\n')
186 |                 if self.scrap_id in big_v_ids:
187 |                     print('Fans ids of big_v {} (id {}) have been saved before.'.format(
188 |                         self.user_name, self.scrap_id))
189 |                     return
190 |         print(self.followers)
191 |         if self.followers < 200:
192 |             pass
193 |         else:
194 |             fans_ids = []
195 |             if os.path.exists(self.weibo_fans_save_file) and os.path.getsize(self.weibo_fans_save_file) > 0:
196 |                 with open(self.weibo_fans_save_file, 'rb') as f:
197 |                     fans_ids = pickle.load(f)
198 | 
199 |             fans_url = 'https://weibo.cn/{}/fans?'.format(self.scrap_id)
200 |             # first from fans html get how many page fans have
201 |             # beware that, this
202 |             print(fans_url)
203 |             # r = requests.get(fans_url, cookies=self.cookie, headers=self.headers).content
204 |             # print(r)
205 |             html_fans = requests.get(fans_url, cookies=self.cookie, headers=self.headers).content
206 |             selector = etree.HTML(html_fans)
207 |             try:
208 |                 if selector.xpath('//input[@name="mp"]') is None:
209 |                     page_num = 1
210 |                 else:
211 |                     page_num = int(selector.xpath('//input[@name="mp"]')[0].attrib['value'])
212 |                 print('all fans have {} pages.'.format(page_num))
213 | 
214 |                 try:
215 |                     for i in range(page_num):
216 |                         if i % 5 == 0 and i != 0:
217 |                             print('[REST] rest {}s for cheating....'.format(self.rest_time))
218 |                             time.sleep(self.rest_time)
219 |                         fans_url_child = 'https://weibo.cn/{}/fans?page={}'.format(self.scrap_id, i)
220 |                         print('requesting fans url: {}'.format(fans_url))
221 |                         html_child = requests.get(fans_url_child, cookies=self.cookie, headers=self.headers).content
222 |                         selector_child = etree.HTML(html_child)
223 |                         fans_ids_content = selector_child.xpath("//div[@class='c']/table//a[1]/@href")
224 |                         ids = [i.split('/')[-1] for i in fans_ids_content]
225 |                         ids = list(set(ids))
226 |                         for d in ids:
227 |                             print('appending fans id {}'.format(d))
228 |                             fans_ids.append(d)
229 |                 except Exception as e:
230 |                     print('error: ', e)
231 |                     dump_fans_list = list(set(fans_ids))
232 |                     print(dump_fans_list)
233 |                     with open(self.weibo_fans_save_file, 'wb') as f:
234 |                         pickle.dump(dump_fans_list, f)
235 |                     print('fans ids not fully added, but this is enough, saved into {}'.format(
236 |                         self.weibo_fans_save_file))
237 | 
238 |                 dump_fans_list = list(set(fans_ids))
239 |                 print(dump_fans_list)
240 |                 with open(self.weibo_fans_save_file, 'wb') as f:
241 |                     pickle.dump(dump_fans_list, f)
242 |                 with open(self.big_v_ids_file, 'a') as f:
243 |                     f.write(self.scrap_id + '\n')
244 |                 print('successfully saved fans id file into {}'.format(self.weibo_fans_save_file))
245 | 
246 |             except Exception as e:
247 |                 logging.error(e)
248 | 
249 |     def _get_weibo_content(self):
250 |         print('\n' + '-' * 30)
251 |         print('getting weibo content...')
252 |         selector = etree.HTML(self.html)
253 |         try:
254 |             if selector.xpath('//input[@name="mp"]') is None:
255 |                 page_num = 1
256 |             else:
257 |                 page_num = int(selector.xpath('//input[@name="mp"]')[0].attrib['value'])
258 |             pattern = r"\d+\.?\d*"
259 |             print('all weibo page {}'.format(page_num))
260 | 
261 |             start_page = 0
262 |             if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0:
263 |                 print('load previous weibo_content file from {}'.format(self.weibo_content_save_file))
264 |                 obj = pickle.load(open(self.weibo_content_save_file, 'rb'))
265 |                 if self.scrap_id in obj.keys():
266 |                     self.weibo_content = obj[self.scrap_id]['weibo_content']
267 |                     start_page = obj[self.scrap_id]['last_scrap_page']
268 |             if start_page >= page_num:
269 |                 print('\nAll weibo contents of {} have been scrapped before\n'.format(self.user_name))
270 |                 return
271 | 
272 |             try:
273 |                 # traverse all weibo, and we will got weibo detail urls
274 |                 for page in range(start_page + 1, page_num + 1):
275 |                     url2 = 'http://weibo.cn/%s?filter=%s&page=%s' % (self.scrap_id, self.filter, page)
276 |                     html2 = requests.get(url2, cookies=self.cookie, headers=self.headers).content
277 |                     selector2 = etree.HTML(html2)
278 |                     content = selector2.xpath("//div[@class='c']")
279 |                     print('\n---- current solving page {} of {}'.format(page, page_num))
280 | 
281 |                     if page % 5 == 0:
282 |                         print('[REST] rest for %ds to cheat weibo site, avoid being banned.' % self.rest_time)
283 |                         time.sleep(self.rest_time)
284 | 
285 |                     if len(content) > 3:
286 |                         for i in range(0, len(content) - 2):
287 |                             detail = content[i].xpath("@id")[0]
288 |                             self.weibo_detail_urls.append('http://weibo.cn/comment/{}?uid={}&rl=0'.
289 |                                                           format(detail.split('_')[-1], self.scrap_id))
290 | 
291 |                             self.weibo_scraped += 1
292 |                             str_t = content[i].xpath("div/span[@class='ctt']")
293 |                             weibos = str_t[0].xpath('string(.)')
294 |                             self.weibo_content.append(weibos)
295 |                             print(weibos)
296 | 
297 |                             str_zan = content[i].xpath("div/a/text()")[-4]
298 |                             guid = re.findall(pattern, str_zan, re.M)
299 |                             num_zan = int(guid[0])
300 |                             self.num_zan.append(num_zan)
301 | 
302 |                             forwarding = content[i].xpath("div/a/text()")[-3]
303 |                             guid = re.findall(pattern, forwarding, re.M)
304 |                             num_forwarding = int(guid[0])
305 |                             self.num_forwarding.append(num_forwarding)
306 | 
307 |                             comment = content[i].xpath("div/a/text()")[-2]
308 |                             guid = re.findall(pattern, comment, re.M)
309 |                             num_comment = int(guid[0])
310 |                             self.num_comment.append(num_comment)
311 |             except etree.XMLSyntaxError as e:
312 |                 print('\n' * 2)
313 |                 print('=' * 20)
314 |                 print('weibo user {} all weibo content finished scrap.'.format(self.user_name))
315 |                 print('all weibo {}, all like {}, all comments {}'.format(
316 |                     len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment)))
317 |                 print('try saving weibo content for now...')
318 |                 self._save_content(page)
319 |                 # maybe should not need to release memory
320 |                 # del self.weibo_content
321 |             except Exception as e:
322 |                 print(e)
323 |                 print('\n' * 2)
324 |                 print('=' * 20)
325 |                 print('weibo user {} content scrap error occured {}.'.format(self.user_name, e))
326 |                 print('all weibo {}, all like {}, all comments {}'.format(
327 |                     len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment)))
328 |                 print('try saving weibo content for now...')
329 |                 self._save_content(page)
330 |                 # del self.weibo_content
331 |                 # should keep self.weibo_content
332 |             print('\n' * 2)
333 |             print('=' * 20)
334 |             print('all weibo {}, all like {}, all comments {}'.format(
335 |                 len(self.weibo_content), np.sum(self.num_zan), np.sum(self.num_comment)))
336 |             print('try saving weibo content for now...')
337 |             self._save_content(page)
338 |             del self.weibo_content
339 |             if self.filter == 0:
340 |                 print('共' + str(self.weibo_scraped) + '条微博')
341 |             else:
342 |                 print('共' + str(self.weibo_num) + '条微博，其中' + str(self.weibo_scraped) + '条为原创微博')
343 |         except IndexError as e:
344 |             print('get weibo info done, current user {} has no weibo yet.'.format(self.scrap_id))
345 |         except KeyboardInterrupt:
346 |             print('manually interrupted... try save wb_content for now...')
347 |             self._save_content(page - 1)
348 | 
349 |     def _save_content(self, page):
350 |         dump_obj = dict()
351 |         if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0:
352 |             with open(self.weibo_content_save_file, 'rb') as f:
353 |                 dump_obj = pickle.load(f)
354 |             dump_obj[self.scrap_id] = {
355 |                 'weibo_content': self.weibo_content,
356 |                 'last_scrap_page': page
357 |             }
358 |             with open(self.weibo_content_save_file, 'wb') as f:
359 |                 pickle.dump(dump_obj, f)
360 | 
361 |         dump_obj[self.scrap_id] = {
362 |             'weibo_content': self.weibo_content,
363 |             'last_scrap_page': page
364 |         }
365 |         with open(self.weibo_content_save_file, 'wb') as f:
366 |             pickle.dump(dump_obj, f)
367 |         print('\nUser name: {} \t ID: {} '.format(self.user_name, self.scrap_id))
368 |         print('[CHEER] weibo content saved into {}, finished this part successfully.\n'.format(
369 |             self.weibo_content_save_file, page))
370 | 
371 |     def _get_weibo_content_and_comment(self):
372 |         """
373 |         all weibo will be saved into weibo_content_and_comment.pkl
374 |         in format:
375 |         {
376 |             scrap_id: {
377 |                 'weibo_detail_urls': [....],
378 |                 'last_scrap_index': 5,
379 |                 'content_and_comment': [
380 |                 {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num0},
381 |                 {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num1},
382 |                 {'content': '...', 'comment': ['..', '...', '...', '...',], 'last_idx':num2}
383 |                 ]
384 |             }
385 |         }
386 |         :return:
387 |         """
388 |         print('\n' + '-' * 30)
389 |         print('getting content and comment...')
390 |         weibo_detail_urls = self.weibo_detail_urls
391 |         start_scrap_index = 0
392 |         content_and_comment = []
393 |         if os.path.exists(self.weibo_content_save_file) and os.path.getsize(self.weibo_content_save_file) > 0:
394 |             print('load previous weibo_content file from {}'.format(self.weibo_content_save_file))
395 |             obj = pickle.load(open(self.weibo_content_save_file, 'rb'))
396 |             self.weibo_content = obj[self.scrap_id]['weibo_content']
397 | 
398 |         if os.path.exists(self.weibo_content_and_comment_save_file) and os.path.getsize(self.weibo_content_and_comment_save_file) > 0:
399 |             with open(self.weibo_content_and_comment_save_file, 'rb') as f:
400 |                 obj = pickle.load(f)
401 |                 if self.scrap_id in obj.keys():
402 |                     obj = obj[self.scrap_id]
403 |                     weibo_detail_urls = obj['weibo_detail_urls']
404 |                     start_scrap_index = obj['last_scrap_index']
405 |                     content_and_comment = obj['content_and_comment']
406 | 
407 |         end_scrap_index = len(weibo_detail_urls)
408 |         try:
409 |             for i in range(start_scrap_index + 1, end_scrap_index):
410 |                 url = weibo_detail_urls[i]
411 |                 one_content_and_comment = dict()
412 | 
413 |                 print('\n\nsolving weibo detail from {}'.format(url))
414 |                 print('No.{} weibo of total {}'.format(i, end_scrap_index))
415 |                 html_detail = requests.get(url, cookies=self.cookie, headers=self.headers).content
416 |                 selector_detail = etree.HTML(html_detail)
417 |                 # if current weibo content has no comment, skip it
418 |                 if not selector_detail.xpath('//*[@id="pagelist"]/form/div/input[1]/@value'):
419 |                     continue
420 |                 all_comment_pages = selector_detail.xpath('//*[@id="pagelist"]/form/div/input[1]/@value')[0]
421 |                 print('这是 {} 的微博：'.format(self.user_name))
422 |                 print('微博内容： {}'.format(self.weibo_content[i]))
423 |                 print('接下来是下面的评论：\n\n')
424 | 
425 |                 one_content_and_comment['content'] = self.weibo_content[i]
426 |                 one_content_and_comment['comment'] = []
427 |                 start_idx = 0
428 |                 end_idx = int(all_comment_pages) - 2
429 |                 if i == start_scrap_index + 1 and content_and_comment:
430 |                     one_cac = content_and_comment[-1]
431 |                     # use the following judgement because the previous data don't have 'last_idx'
432 |                     if 'last_idx' in one_cac.keys():  
433 |                         print('\nTrying to recover from the last comment of last content...\n')
434 |                         if one_cac['last_idx'] + 1 < end_idx:
435 |                             one_content_and_comment['comment'] = one_cac['comment']
436 |                             start_idx = one_cac['last_idx'] + 1
437 |                             print('last_idx: {}\n'.format(one_cac['last_idx']))
438 | 
439 |                 for page in range(start_idx, end_idx):
440 |                     print('\n---- current solving page {} of {}'.format(page, int(all_comment_pages) - 3))
441 |                     if page % 5 == 0:
442 |                         self.rest_time = np.random.randint(self.rest_min_time, self.rest_max_time)
443 |                         print('[ATTEMPTING] rest for %ds to cheat weibo site, avoid being banned.' % self.rest_time)
444 |                         time.sleep(self.rest_time)
445 | 
446 |                     # we crawl from page 2, cause front pages have some noise
447 |                     detail_comment_url = url + '&page=' + str(page + 2)
448 |                     no_content_pages = []
449 |                     try:
450 |                         # from every detail comment url we will got all comment
451 |                         html_detail_page = requests.get(detail_comment_url, cookies=self.cookie).content
452 |                         selector_comment = etree.HTML(html_detail_page)
453 | 
454 |                         comment_div_element = selector_comment.xpath('//div[starts-with(@id, "C_")]')
455 | 
456 |                         for child in comment_div_element:
457 |                             single_comment_user_name = child.xpath('a[1]/text()')[0]
458 |                             if child.xpath('span[1][count(*)=0]'):
459 |                                 single_comment_content = child.xpath('span[1][count(*)=0]/text()')[0]
460 |                             else:
461 |                                 span_element = child.xpath('span[1]')[0]
462 |                                 at_user_name = span_element.xpath('a/text()')[0]
463 |                                 at_user_name = '$' + at_user_name.split('@')[-1] + '$'
464 |                                 single_comment_content = span_element.xpath('/text()')
465 |                                 single_comment_content.insert(1, at_user_name)
466 |                                 single_comment_content = ' '.join(single_comment_content)
467 | 
468 |                             full_single_comment = '<' + single_comment_user_name + '>' + ': ' + single_comment_content
469 |                             print(full_single_comment)
470 |                             one_content_and_comment['comment'].append(full_single_comment)
471 |                             one_content_and_comment['last_idx'] = page
472 |                     except etree.XMLSyntaxError as e:
473 |                         no_content_pages.append(page)
474 |                         print('\n\nThis page has no contents and is passed: ', e)
475 |                         print('Total no_content_pages: {}'.format(len(no_content_pages)))
476 | 
477 |                     except Exception as e:
478 |                         print('Raise Exception in _get_weibo_content_and_comment, error:', e)
479 |                         print('\n' * 2)
480 |                         print('=' * 20)
481 |                         print('weibo user {} content_and_comment scrap error occured {}.'.format(self.user_name, e))
482 |                         self._save_content_and_comment(i, one_content_and_comment, weibo_detail_urls)
483 |                         print("\n\nComments are successfully save:\n User name: {}\n weibo content: {}\n\n".format(
484 |                             self.user_name, one_content_and_comment['content']))
485 |                 # save every one_content_and_comment
486 |                 self._save_content_and_comment(i, one_content_and_comment, weibo_detail_urls)
487 |                 print('weibo scrap done!')
488 |                 self.mark_as_scraped(self.scrap_id)
489 |                 print('*' * 30)
490 |                 print("\n\nComments are successfully save:\n User name: {}\n weibo content: {}".format(
491 |                     self.user_name, one_content_and_comment['content']))
492 |         except KeyboardInterrupt:
493 |             print('manually interrupted.. try save wb_content_and_comment for now...')
494 |             self._save_content_and_comment(i - 1, one_content_and_comment, weibo_detail_urls)
495 | 
496 |         print('\n' * 2)
497 |         print('=' * 20)
498 |         print('user {}, all weibo content and comment finished.'.format(self.user_name))
499 | 
500 |     def _save_content_and_comment(self, i, one_content_and_comment, weibo_detail_urls):
501 |         dump_dict = dict()
502 |         if os.path.exists(self.weibo_content_and_comment_save_file) and os.path.getsize(self.weibo_content_and_comment_save_file) > 0:
503 |             with open(self.weibo_content_and_comment_save_file, 'rb') as f:
504 |                 obj = pickle.load(f)
505 |                 dump_dict = obj
506 |                 if self.scrap_id in dump_dict.keys():
507 |                     dump_dict[self.scrap_id]['last_scrap_index'] = i
508 |                     dump_dict[self.scrap_id]['content_and_comment'].append(one_content_and_comment)
509 |                 else:
510 |                     dump_dict[self.scrap_id] = {
511 |                         'weibo_detail_urls': weibo_detail_urls,
512 |                         'last_scrap_index': i,
513 |                         'content_and_comment': [one_content_and_comment]
514 |                     }
515 |         else:
516 |             dump_dict[self.scrap_id] = {
517 |                 'weibo_detail_urls': weibo_detail_urls,
518 |                 'last_scrap_index': i,
519 |                 'content_and_comment': [one_content_and_comment]
520 |             }
521 |         with open(self.weibo_content_and_comment_save_file, 'wb') as f:
522 |             print('try saving weibo content and comment for now.')
523 |             pickle.dump(dump_dict, f)
524 | 
525 |     def switch_account(self, new_account):
526 |         assert new_account.isinstance(str), 'account must be string'
527 |         self.using_account = new_account
528 | 
529 |     @staticmethod
530 |     def mark_as_scraped(scrap_id):
531 |         """
532 |         this will mark an id to be scraped, next time will jump this id directly
533 |         :return:
534 |         """
535 |         scraped_ids = []
536 |         if os.path.exists(SCRAPED_MARK) and os.path.getsize(SCRAPED_MARK) > 0:
537 |             with open(SCRAPED_MARK, 'rb') as f:
538 |                 scraped_ids = pickle.load(f)
539 |         scraped_ids.append(scrap_id)
540 |         with open(SCRAPED_MARK, 'wb') as f:
541 |             pickle.dump(scraped_ids, f)
542 |         print('scrap id {} marked as scraped, next time will jump this id directly.'.format(scrap_id))
543 | 


--------------------------------------------------------------------------------
/scraper/weibo_scraper_m.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # file: weibo_scraper_m.py
  3 | # author: JinTian
  4 | # time: 16/05/2017 11:32 AM
  5 | # Copyright 2017 JinTian. All Rights Reserved.
  6 | #
  7 | # Licensed under the Apache License, Version 2.0 (the "License");
  8 | # you may not use this file except in compliance with the License.
  9 | # You may obtain a copy of the License at
 10 | #
 11 | #     http://www.apache.org/licenses/LICENSE-2.0
 12 | #
 13 | # Unless required by applicable law or agreed to in writing, software
 14 | # distributed under the License is distributed on an "AS IS" BASIS,
 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16 | # See the License for the specific language governing permissions and
 17 | # limitations under the License.
 18 | # ------------------------------------------------------------------------
 19 | """
 20 | this is another option for WeiBo scraper, using domain:
 21 | http://m.weibo.cn
 22 | 
 23 | this scraper under development, welcome submit PR to add more features
 24 | """
 25 | import os
 26 | import requests
 27 | import numpy as np
 28 | from settings.config import *
 29 | import pickle
 30 | 
 31 | 
 32 | class WeiBoScraperM(object):
 33 |     """
 34 |     this scraper under construct, contributor can follow this template to compatible WeiBoScraper API
 35 |     """
 36 | 
 37 |     def __init__(self, using_account, scrap_id, cookies, filter_flag=0):
 38 |         self.using_account = using_account
 39 |         self.scrap_id = scrap_id
 40 |         self.cookies = cookies
 41 |         self.filter_flag = filter_flag
 42 | 
 43 |     def _init_cookies(self):
 44 |         cookie = {
 45 |             "Cookie": self.cookies
 46 |         }
 47 |         self.cookie = cookie
 48 | 
 49 |     def _init_headers(self):
 50 |         """
 51 |         avoid span
 52 |         :return:
 53 |         """
 54 |         headers = requests.utils.default_headers()
 55 |         user_agent = {
 56 |             'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0'
 57 |         }
 58 |         headers.update(user_agent)
 59 |         print('headers: ', headers)
 60 |         self.headers = headers
 61 | 
 62 |     def jump_scraped_id(self):
 63 |         """
 64 |         check mark scrapd ids, jump scraped one.
 65 |         :return:
 66 |         """
 67 |         if os.path.exists(SCRAPED_MARK):
 68 |             with open(SCRAPED_MARK, 'rb') as f:
 69 |                 scraped_ids = pickle.load(f)
 70 |             if self.scrap_id in scraped_ids:
 71 |                 return True
 72 |             else:
 73 |                 return False
 74 |         else:
 75 |             return False
 76 | 
 77 |     def crawl(self):
 78 |         if self.jump_scraped_id():
 79 |             print('scrap id {} already scraped, directly pass it.'.format(self.scrap_id))
 80 |         else:
 81 |             try:
 82 |                 self._get_html()
 83 |                 self._get_user_name()
 84 |                 self._get_user_info()
 85 | 
 86 |                 self._get_fans_ids()
 87 |                 self._get_wb_content()
 88 |                 self._get_wb_content_and_comment()
 89 |                 return True
 90 |             except Exception as e:
 91 |                 print('error when crawl: ', e)
 92 |                 return False
 93 | 
 94 |     def _get_html(self):
 95 |         pass
 96 | 
 97 |     def _get_user_name(self):
 98 |         pass
 99 | 
100 |     def _get_user_info(self):
101 |         pass
102 | 
103 |     def _get_fans_ids(self):
104 |         pass
105 | 
106 |     def _get_wb_content(self):
107 |         pass
108 | 
109 |     def _save_content(self, page):
110 |         pass
111 | 
112 |     def _get_wb_content_and_comment(self):
113 |        pass
114 | 
115 |     def _save_content_and_comment(self, i, one_content_and_comment, weibo_detail_urls):
116 |        pass
117 | 
118 |     def switch_account(self, new_account):
119 |         assert new_account.isinstance(str), 'account must be string'
120 |         self.using_account = new_account
121 | 
122 |     @staticmethod
123 |     def mark_as_scraped(scrap_id):
124 |         """
125 |         this will mark an id to be scraped, next time will jump this id directly
126 |         :return:
127 |         """
128 |         scraped_ids = []
129 |         if os.path.exists(SCRAPED_MARK):
130 |             with open(SCRAPED_MARK, 'rb') as f:
131 |                 scraped_ids = pickle.load(f)
132 |         scraped_ids.append(scrap_id)
133 |         with open(SCRAPED_MARK, 'wb') as f:
134 |             pickle.dump(scraped_ids, f)
135 |         print('scrap id {} marked as scraped, next time will jump this id directly.'.format(scrap_id))


--------------------------------------------------------------------------------
/settings/accounts.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: accounts.py
 3 | # author: JinTian
 4 | # time: 17/04/2017 2:26 PM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------
19 | """
20 | place all your accounts inside here,
21 | terminator will using all accounts, if one being banned, another will move on.
22 | until all accounts were banned, scrap stop.
23 | """
24 | 
25 | 
26 | # please set this to your own, this is fake accounts
27 | accounts = [
28 |     {
29 |         "id": 'jintianiloveu',
30 |         "password": '77888',
31 |     },
32 | 
33 | ]
34 | 


--------------------------------------------------------------------------------
/settings/config.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: settings.py
 3 | # author: JinTian
 4 | # time: 13/04/2017 10:10 AM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------
19 | """
20 | all configurations set here, follow the instructions
21 | """
22 | import os
23 | 
24 | # you should not change this properly
25 | DEFAULT_USER_ID = 'realangelababy'
26 | LOGIN_URL = 'https://passport.weibo.cn/signin/login'
27 | 
28 | ID_FILE_PATH = './settings/id_file'
29 | 
30 | 
31 | # change this to your PhantomJS unzip path, point to bin/phantomjs executable file, full path
32 | PHANTOM_JS_PATH = '/Users/jintian/phantomjs-2.1.1-macosx/bin/phantomjs'
33 | 
34 | 
35 | COOKIES_SAVE_PATH = 'settings/cookies.pkl'
36 | 
37 | 
38 | CORPUS_SAVE_DIR = './scraped_corpus/'
39 | 
40 | DISTRIBUTE_IDS = 'distribute_ids.pkl'
41 | 
42 | SCRAPED_MARK = './settings/scraped.mark'
43 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasjinreal/weibo_terminator_workflow/0a709d70c2159417b1328d20f60d03de03aceb8c/utils/__init__.py


--------------------------------------------------------------------------------
/utils/connection.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: connection.py
 3 | # author: JinTian
 4 | # time: 13/04/2017 9:53 AM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------


--------------------------------------------------------------------------------
/utils/cookies.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # file: cookies.py
  3 | # author: JinTian
  4 | # time: 17/04/2017 12:55 PM
  5 | # Copyright 2017 JinTian. All Rights Reserved.
  6 | #
  7 | # Licensed under the Apache License, Version 2.0 (the "License");
  8 | # you may not use this file except in compliance with the License.
  9 | # You may obtain a copy of the License at
 10 | #
 11 | #     http://www.apache.org/licenses/LICENSE-2.0
 12 | #
 13 | # Unless required by applicable law or agreed to in writing, software
 14 | # distributed under the License is distributed on an "AS IS" BASIS,
 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16 | # See the License for the specific language governing permissions and
 17 | # limitations under the License.
 18 | # ------------------------------------------------------------------------
 19 | """
 20 | to get weibo_terminator switch accounts automatically, please be sure install:
 21 | 
 22 | PhantomJS, from http://phantomjs.org/download.html
 23 | 
 24 | """
 25 | import os
 26 | from selenium import webdriver
 27 | from selenium.webdriver.common.keys import Keys
 28 | from selenium.common.exceptions import InvalidElementStateException
 29 | import time
 30 | import sys
 31 | from tqdm import *
 32 | import pickle
 33 | 
 34 | from settings.accounts import accounts
 35 | from settings.config import LOGIN_URL, PHANTOM_JS_PATH, COOKIES_SAVE_PATH
 36 | 
 37 | 
 38 | def count_time():
 39 |     for i in tqdm(range(40)):
 40 |         time.sleep(0.5)
 41 | 
 42 | 
 43 | def get_cookie_from_network(account_id, account_password):
 44 |     url_login = LOGIN_URL
 45 |     phantom_js_driver_file = os.path.abspath(PHANTOM_JS_PATH)
 46 |     if os.path.exists(phantom_js_driver_file):
 47 |         try:
 48 |             print('loading PhantomJS from {}'.format(phantom_js_driver_file))
 49 |             driver = webdriver.PhantomJS(phantom_js_driver_file)
 50 |             # must set window size or will not find element
 51 |             driver.set_window_size(1640, 688)
 52 |             driver.get(url_login)
 53 |             # before get element sleep for 4 seconds, waiting for page render complete.
 54 |             print('opening weibo login page, this is first done for prepare for cookies. be patience to waite load '
 55 |                   'complete.')
 56 |             count_time()
 57 |             driver.find_element_by_xpath('//input[@id="loginName"]').send_keys(account_id)
 58 |             driver.find_element_by_xpath('//input[@id="loginPassword"]').send_keys(account_password)
 59 |             # driver.find_element_by_xpath('//input[@id="loginPassword"]').send_keys(Keys.RETURN)
 60 |             print('account id: {}'.format(account_id))
 61 |             print('account password: {}'.format(account_password))
 62 | 
 63 |             driver.find_element_by_xpath('//a[@id="loginAction"]').click()
 64 |         except InvalidElementStateException as e:
 65 |             print(e)
 66 |             print('error, account id {} is not valid, pass this account, you can edit it and then '
 67 |                   'update cookies. \n'
 68 |                   .format(account_id))
 69 | 
 70 |         try:
 71 |             cookie_list = driver.get_cookies()
 72 |             cookie_string = ''
 73 |             for cookie in cookie_list:
 74 |                 if 'name' in cookie and 'value' in cookie:
 75 |                     cookie_string += cookie['name'] + '=' + cookie['value'] + ';'
 76 |             if 'SSOLoginState' in cookie_string:
 77 |                 print('success get cookies!! \n {}'.format(cookie_string))
 78 |                 if os.path.exists(COOKIES_SAVE_PATH):
 79 |                     with open(COOKIES_SAVE_PATH, 'rb') as f:
 80 |                         cookies_dict = pickle.load(f)
 81 |                     if cookies_dict[account_id] is not None:
 82 |                         cookies_dict[account_id] = cookie_string
 83 |                         with open(COOKIES_SAVE_PATH, 'wb') as f:
 84 |                             pickle.dump(cookies_dict, f)
 85 |                         print('successfully save cookies into {}. \n'.format(COOKIES_SAVE_PATH))
 86 |                     else:
 87 |                         pass
 88 |                 else:
 89 |                     cookies_dict = dict()
 90 |                     cookies_dict[account_id] = cookie_string
 91 |                     with open(COOKIES_SAVE_PATH, 'wb') as f:
 92 |                         pickle.dump(cookies_dict, f)
 93 |                     print('successfully save cookies into {}. \n'.format(COOKIES_SAVE_PATH))
 94 |                 return cookie_string
 95 |             else:
 96 |                 print('error, account id {} is not valid, pass this account, you can edit it and then '
 97 |                       'update cookies. \n'
 98 |                       .format(account_id))
 99 |                 pass
100 | 
101 |         except Exception as e:
102 |             print(e)
103 | 
104 |     else:
105 |         print('can not find PhantomJS driver, please download from http://phantomjs.org/download.html based on your '
106 |               'system.')
107 | 


--------------------------------------------------------------------------------
/utils/string.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # file: string.py
 3 | # author: JinTian
 4 | # time: 17/04/2017 7:06 PM
 5 | # Copyright 2017 JinTian. All Rights Reserved.
 6 | #
 7 | # Licensed under the Apache License, Version 2.0 (the "License");
 8 | # you may not use this file except in compliance with the License.
 9 | # You may obtain a copy of the License at
10 | #
11 | #     http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | # ------------------------------------------------------------------------
19 | 
20 | 
21 | def is_valid_id(s):
22 |     try:
23 |         a = float(s)
24 |         return True
25 |     except ValueError as e:
26 |         return False
27 | 
28 | 
29 | def is_number(s):
30 |     try:
31 |         a = float(s)
32 |         return True
33 |     except ValueError as e:
34 |         return False
35 | 
36 | 


--------------------------------------------------------------------------------