├── README.md ├── README.md~ ├── amazon-spider-comments ├── README.md ├── README.md~ ├── multiplePull.py └── pull.py └── appannie-spider-comments ├── README.md ├── README.md~ ├── comments.xls ├── pull2.py ├── test.py └── word island.xls /README.md: -------------------------------------------------------------------------------- 1 | ##ios游戏评论爬虫 2 | 3 | 朋友的游戏公司需要分析游戏评论,于是托我写了这个爬虫程序。 4 | 5 | ### 安装&运行 6 | 详见 7 | amazon-spider-comments/README.md 8 | appannie-spider-comments/README.md 9 | 10 | ### 介绍&说明 11 | 分别对amazon(亚马逊)和appannie(app安妮)上的游戏评论爬取。 12 | 13 | #### amazon(requests + BeautifulSoup + xlwt + 多进程) 14 | 15 | 同一ip地址在短时间内大量的请求amazon服务器,会强制跳转到验证码页面,经过验证码确认后才可继续正常访问 16 | 17 | 针对这个厉害的防爬虫机制,采用了简单的睡觉策略。。 18 | ``` 19 | if type(comments) == type(None): 20 | print "try page",realpage,"again..." 21 | time.sleep(3) 22 | ``` 23 | 这个方法虽然没啥含金量但还算有用,大量请求后短暂休息几秒又可以继续爬了,看来亚马逊服务器的智商也不是很高嘛^_^ 24 | 25 | PS:其实这里更好的方法是用代理ip,不过由于手头没有好的代理ip,网上搜的代理ip质量又参差不齐,外加朋友急需,也没时间测试,就使用了偷懒办法 26 | 27 | #### appannie(requests + xlwt + 用户登录模拟 + Ajax模拟) 28 | 并没有直接对appannie爬取,经分析找出该网站api地址,如果能直接获取数据则可以省大量精力,但访问api出现会三个大单词 29 | ``` 30 | ajax call only 31 | ``` 32 | 于是程序模拟Ajax访问,在headers添加 33 | ``` 34 | 'X-Requested-With':'XMLHttpRequest' 35 | ``` 36 | 获取appannie数据还需要登录,以及验证csrftoken,具体方法详见代码 37 | 38 | #### ------update------ 39 | 修改了amazon的爬虫,改为多进程爬取,详见代码 40 | amazon-spider-comments/multiplePull.py 41 | -------------------------------------------------------------------------------- /README.md~: -------------------------------------------------------------------------------- 1 | ##ios游戏评论爬虫 2 | 3 | 朋友的游戏公司需要分析游戏评论,于是托我写了这个爬虫程序。 4 | 5 | ### 安装&运行 6 | 详见 7 | amazon-spider-comments/README.md 8 | appannie-spider-comments/README.md 9 | 10 | ### 介绍&说明 11 | 分别对amazon(亚马逊)和appannie(app安妮)上的游戏评论爬取。 12 | 13 | #### amazon(requests + BeautifulSoup + xlwt) 14 | 15 | 同一ip地址在短时间内大量的请求amazon服务器,会强制跳转到验证码页面,经过验证码确认后才可继续正常访问 16 | 17 | 针对这个厉害的防爬虫机制,采用了简单的睡觉策略。。 18 | ``` 19 | if type(comments) == type(None): 20 | print "try page",realpage,"again..." 21 | time.sleep(3) 22 | ``` 23 | 这个方法虽然没啥含金量但还算有用,大量请求后短暂休息几秒又可以继续爬了,看来亚马逊服务器的智商也不是很高嘛^_^ 24 | 25 | PS:其实这里更好的方法是用代理ip,不过由于手头没有好的代理ip,网上搜的代理ip质量又参差不齐,外加朋友急需,也没时间测试,就使用了偷懒办法 26 | 27 | #### appannie(requests + xlwt + 用户登录模拟 + Ajax模拟) 28 | 并没有直接对appannie爬取,经分析找出该网站api地址,如果能直接获取数据则可以省大量精力,但访问api出现会三个大单词 29 | ``` 30 | ajax call only 31 | ``` 32 | 于是程序模拟Ajax访问,在headers添加 33 | ``` 34 | 'X-Requested-With':'XMLHttpRequest' 35 | ``` 36 | 获取appannie数据还需要登录,以及验证csrftoken,具体方法详见代码 37 | 38 | #### ------update------ 39 | 修改了amazon的爬虫,改为多进程爬取,详见代码 40 | amazon-spider-comments/multiplePull.py 41 | -------------------------------------------------------------------------------- /amazon-spider-comments/README.md: -------------------------------------------------------------------------------- 1 | #### ------requirements------ 2 | requests==2.4.3 3 | 4 | xlwt==0.7.5 5 | 6 | BeautifulSoup 7 | -------------------------------------------------------------------------------- /amazon-spider-comments/README.md~: -------------------------------------------------------------------------------- 1 | #### ------requirements------ 2 | requests 3 | xlwt 4 | BeautifulSoup 5 | -------------------------------------------------------------------------------- /amazon-spider-comments/multiplePull.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | import requests 3 | import re 4 | import json 5 | from bs4 import BeautifulSoup 6 | import time 7 | import sys 8 | reload(sys) 9 | sys.setdefaultencoding('utf-8') 10 | from multiprocessing import Process 11 | 12 | 13 | def downWeb(url): 14 | user_agent = ("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0") 15 | headers = {"User-Agent":user_agent, 16 | #"Referer":referer, 17 | "Host":"www.amazon.com", 18 | 'Connection':'keep-alive', 19 | 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 20 | 'Accept-Encoding':'gzip, deflate, sdch', 21 | 'Accept-Language':'zh-CN,zh;q=0.8', 22 | 'Cache-Control':'no-cache', 23 | 'Pragma':'no-cache', 24 | 'Upgrade-Insecure-Requests':'1' 25 | } 26 | contentlist = [] 27 | 28 | try: 29 | time.sleep(2) 30 | realUrl = url 31 | realpage = realUrl[-2:] 32 | r = requests.get(realUrl, headers=headers) 33 | soup = BeautifulSoup(r.content) 34 | comments = soup.find(id="cm_cr-review_list") 35 | if type(comments) == type(None): 36 | print "try page",realpage,"again..." 37 | time.sleep(3) 38 | r = requests.get(realUrl, headers=headers) 39 | soup = BeautifulSoup(r.content) 40 | comments = soup.find(id="cm_cr-review_list") 41 | if type(comments) == type(None): 42 | print "try page",realpage,"again..." 43 | time.sleep(5) 44 | r = requests.get(realUrl, headers=headers) 45 | soup = BeautifulSoup(r.content) 46 | comments = soup.find(id="cm_cr-review_list") 47 | if type(comments) == type(None): 48 | print "page ",realpage,"give up !!!!" 49 | else: 50 | print "page",realpage,"downloding successful..." 51 | contentlist.append(r.content) 52 | else: 53 | print "page",realpage,"downloding successful..." 54 | contentlist.append(r.content) 55 | else: 56 | print "page",realpage,"downloding successful..." 57 | contentlist.append(r.content) 58 | except requests.exceptions.RequestException as e: 59 | print e 60 | return contentlist 61 | 62 | def matchRe(contentlist): 63 | system = 'amazon' 64 | rate = [] 65 | title = [] 66 | author = [] 67 | text = [] 68 | tanslate = "" 69 | date = [] 70 | num = 1 71 | 72 | for each in contentlist: 73 | content = each 74 | soup = BeautifulSoup(content) 75 | 76 | comments = soup.find(id="cm_cr-review_list") 77 | comment = comments.find_all("div","a-section review") 78 | #print "put in page",num,"comment" 79 | 80 | for each in comment: 81 | eachrate = each.i.span.get_text().split()[0].decode().encode('utf-8') 82 | rate.append(eachrate) 83 | #rate.append(each.i.span.get_text().split()[0].decode().encode('utf-8')) 84 | eachtitle = each.find_all("a", class_="review-title")[0].get_text().decode('utf-8') 85 | title.append(repr(eachtitle)[2:-1]) 86 | #title.append(each.find_all("a", class_="review-title")[0].get_text().decode('utf-8')) 87 | eachauthor = each.find_all("a", class_="author")[0].get_text().decode('utf-8') 88 | author.append(repr(eachauthor)[2:-1]) 89 | #author.append(each.find_all("a", class_="author")[0].get_text().decode('utf-8')) 90 | eachtext = (each.find_all("span", class_="review-text")[0].get_text().decode('utf-8')) 91 | text.append(repr(eachtext)[2:-1]) 92 | eachdate = each.find_all("span", class_="review-date")[0].get_text().decode().encode('utf-8') 93 | date.append(eachdate) 94 | #date.append(each.find_all("span", class_="review-date")[0].get_text().decode().encode('utf-8')) 95 | #print repr(eachrate),repr(eachtitle),repr(eachauthor),repr(eachtext),repr(eachdate) 96 | 97 | commentDict = dict(system=system,rate=rate, 98 | title=title,author=author,text=text, 99 | tanslate=tanslate,date=date) 100 | print len(author),"comments download successful..." 101 | return commentDict 102 | 103 | def saveToExcel(comment): 104 | system,rate,title,author,text,tanslate,date = (comment["system"], 105 | comment["rate"],comment["title"],comment["author"], 106 | comment["text"],comment["tanslate"],comment["date"]) 107 | 108 | import xlwt 109 | 110 | efile = xlwt.Workbook() 111 | table = efile.add_sheet('Sheet1') 112 | table.write(0,0,u'平台') 113 | table.write(0,1,u'评分rate') 114 | table.write(0,2,u'review标题') 115 | table.write(0,3,u'reviewer作者') 116 | table.write(0,4,u'review正文') 117 | table.write(0,5,u'中文简单描述') 118 | table.write(0,6,u'时间') 119 | 120 | for num,each in enumerate(rate): 121 | index = num +1 122 | try: 123 | table.write(index,0,system) 124 | table.write(index,1,rate[num]) 125 | table.write(index,2,title[num]) 126 | table.write(index,3,author[num]) 127 | table.write(index,4,text[num]) 128 | table.write(index,5,tanslate) 129 | table.write(index,6,date[num]) 130 | except: 131 | print "len error or ascii error" 132 | 133 | efile.save('WordGenius.xls') 134 | print "Save data successful..." 135 | 136 | def savefile(content): 137 | f = open('comments.txt','w') 138 | f.write(content) 139 | f.close() 140 | 141 | def runMultiple(page): 142 | WordGenius ="http://www.amazon.com/Word-Genius-Challenging-Exercise-Puzzle/product-reviews/B01A0MWG40/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 143 | content = downWeb(WordGenius,page) 144 | comments = matchRe(content) 145 | savefile = saveToExcel(comments) 146 | 147 | def runMultiple(url): 148 | content = downWeb(url) 149 | comments = matchRe(content) 150 | savefile = saveToExcel(comments) 151 | 152 | 153 | if __name__ == '__main__': 154 | page = 15 155 | Crushurl = "http://www.amazon.com/Crush-Letters-Challenging-Search-Puzzle/product-reviews/B00JPVPX2A/ref=cm_cr_pr_btm_link_%d?ie=UTF8&%%3BshowViewpoints=1&%%3BsortBy=recent&%%3BpageNumber=9&pageNumber=%d" 156 | WordFallurl = "http://www.amazon.com/WordFall-Addictive-Words-Search-Puzzle/product-reviews/B016EX8NK0/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 157 | WordJungle ="http://www.amazon.com/Word-Jungle-Challenging-Brain-Puzzle/product-reviews/B015II9BBC/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 158 | HiWords ="http://www.amazon.com/Hi-Words-Word-Search-Puzzle/product-reviews/B00GY0PQZ4/ref=cm_cr_pr_btm_link_%d?ie=UTF8&%%3BshowViewpoints=1&%%3BsortBy=recent&pageNumber=%d" 159 | WordGenius ="http://www.amazon.com/Word-Genius-Challenging-Exercise-Puzzle/product-reviews/B01A0MWG40/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 160 | 161 | # 多进程爬取数据 162 | for each in range(0,page): 163 | page = each +1 164 | url = WordGenius %(page, page) 165 | p = Process(target=runMultiple, args=(url,)) 166 | p.start() 167 | p.join() 168 | -------------------------------------------------------------------------------- /amazon-spider-comments/pull.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | import requests 3 | import re 4 | import json 5 | from bs4 import BeautifulSoup 6 | import time 7 | import sys 8 | reload(sys) 9 | sys.setdefaultencoding('utf-8') 10 | 11 | 12 | def downWeb(url,page): 13 | user_agent = ("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0") 14 | headers = {"User-Agent":user_agent, 15 | #"Referer":referer, 16 | "Host":"www.amazon.com", 17 | 'Connection':'keep-alive', 18 | 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 19 | 'Accept-Encoding':'gzip, deflate, sdch', 20 | 'Accept-Language':'zh-CN,zh;q=0.8', 21 | 'Cache-Control':'no-cache', 22 | 'Pragma':'no-cache', 23 | 'Upgrade-Insecure-Requests':'1' 24 | } 25 | contentlist = [] 26 | for each in range(0,page): 27 | realpage = each +1 28 | realUrl = url %(realpage, realpage) 29 | try: 30 | time.sleep(2) 31 | r = requests.get(realUrl, headers=headers) 32 | soup = BeautifulSoup(r.content) 33 | comments = soup.find(id="cm_cr-review_list") 34 | if type(comments) == type(None): 35 | print "try page",realpage,"again..." 36 | time.sleep(3) 37 | r = requests.get(realUrl, headers=headers) 38 | soup = BeautifulSoup(r.content) 39 | comments = soup.find(id="cm_cr-review_list") 40 | if type(comments) == type(None): 41 | print "try page",realpage,"again..." 42 | time.sleep(5) 43 | r = requests.get(realUrl, headers=headers) 44 | soup = BeautifulSoup(r.content) 45 | comments = soup.find(id="cm_cr-review_list") 46 | if type(comments) == type(None): 47 | print "page ",realpage,"give up !!!!" 48 | else: 49 | print "page",realpage,"downloding successful..." 50 | contentlist.append(r.content) 51 | else: 52 | print "page",realpage,"downloding successful..." 53 | contentlist.append(r.content) 54 | else: 55 | print "page",realpage,"downloding successful..." 56 | contentlist.append(r.content) 57 | except requests.exceptions.RequestException as e: 58 | print e 59 | return contentlist 60 | 61 | def matchRe(contentlist): 62 | system = 'amazon' 63 | rate = [] 64 | title = [] 65 | author = [] 66 | text = [] 67 | tanslate = "" 68 | date = [] 69 | num = 1 70 | 71 | for each in contentlist: 72 | content = each 73 | soup = BeautifulSoup(content) 74 | 75 | comments = soup.find(id="cm_cr-review_list") 76 | comment = comments.find_all("div","a-section review") 77 | #print "put in page",num,"comment" 78 | 79 | for each in comment: 80 | eachrate = each.i.span.get_text().split()[0].decode().encode('utf-8') 81 | rate.append(eachrate) 82 | #rate.append(each.i.span.get_text().split()[0].decode().encode('utf-8')) 83 | eachtitle = each.find_all("a", class_="review-title")[0].get_text().decode('utf-8') 84 | title.append(repr(eachtitle)[2:-1]) 85 | #title.append(each.find_all("a", class_="review-title")[0].get_text().decode('utf-8')) 86 | eachauthor = each.find_all("a", class_="author")[0].get_text().decode('utf-8') 87 | author.append(repr(eachauthor)[2:-1]) 88 | #author.append(each.find_all("a", class_="author")[0].get_text().decode('utf-8')) 89 | eachtext = (each.find_all("span", class_="review-text")[0].get_text().decode('utf-8')) 90 | text.append(repr(eachtext)[2:-1]) 91 | eachdate = each.find_all("span", class_="review-date")[0].get_text().decode().encode('utf-8') 92 | date.append(eachdate) 93 | #date.append(each.find_all("span", class_="review-date")[0].get_text().decode().encode('utf-8')) 94 | #print repr(eachrate),repr(eachtitle),repr(eachauthor),repr(eachtext),repr(eachdate) 95 | 96 | commentDict = dict(system=system,rate=rate, 97 | title=title,author=author,text=text, 98 | tanslate=tanslate,date=date) 99 | print len(author),"comments download successful..." 100 | return commentDict 101 | 102 | def saveToExcel(comment): 103 | system,rate,title,author,text,tanslate,date = (comment["system"], 104 | comment["rate"],comment["title"],comment["author"], 105 | comment["text"],comment["tanslate"],comment["date"]) 106 | 107 | import xlwt 108 | 109 | efile = xlwt.Workbook() 110 | table = efile.add_sheet('Sheet1') 111 | table.write(0,0,u'平台') 112 | table.write(0,1,u'评分rate') 113 | table.write(0,2,u'review标题') 114 | table.write(0,3,u'reviewer作者') 115 | table.write(0,4,u'review正文') 116 | table.write(0,5,u'中文简单描述') 117 | table.write(0,6,u'时间') 118 | 119 | for num,each in enumerate(rate): 120 | index = num +1 121 | try: 122 | table.write(index,0,system) 123 | table.write(index,1,rate[num]) 124 | table.write(index,2,title[num]) 125 | table.write(index,3,author[num]) 126 | table.write(index,4,text[num]) 127 | table.write(index,5,tanslate) 128 | table.write(index,6,date[num]) 129 | except: 130 | print "len error or ascii error" 131 | 132 | efile.save('WordGenius.xls') 133 | print "Save data successful..." 134 | 135 | def savefile(content): 136 | f = open('comments.txt','w') 137 | f.write(content) 138 | f.close() 139 | 140 | if __name__ == '__main__': 141 | page = 15 142 | Crushurl = "http://www.amazon.com/Crush-Letters-Challenging-Search-Puzzle/product-reviews/B00JPVPX2A/ref=cm_cr_pr_btm_link_%d?ie=UTF8&%%3BshowViewpoints=1&%%3BsortBy=recent&%%3BpageNumber=9&pageNumber=%d" 143 | WordFallurl = "http://www.amazon.com/WordFall-Addictive-Words-Search-Puzzle/product-reviews/B016EX8NK0/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 144 | WordJungle ="http://www.amazon.com/Word-Jungle-Challenging-Brain-Puzzle/product-reviews/B015II9BBC/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 145 | HiWords ="http://www.amazon.com/Hi-Words-Word-Search-Puzzle/product-reviews/B00GY0PQZ4/ref=cm_cr_pr_btm_link_%d?ie=UTF8&%%3BshowViewpoints=1&%%3BsortBy=recent&pageNumber=%d" 146 | WordGenius ="http://www.amazon.com/Word-Genius-Challenging-Exercise-Puzzle/product-reviews/B01A0MWG40/ref=cm_cr_pr_btm_link_%d?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=%d" 147 | content = downWeb(WordGenius,page) 148 | comments = matchRe(content) 149 | savefile = saveToExcel(comments) 150 | #savefile(comments) 151 | -------------------------------------------------------------------------------- /appannie-spider-comments/README.md: -------------------------------------------------------------------------------- 1 | #### ------requirements------ 2 | requests==2.4.3 3 | 4 | xlwt==0.7.5 5 | -------------------------------------------------------------------------------- /appannie-spider-comments/README.md~: -------------------------------------------------------------------------------- 1 | #### ------requirements------ 2 | requests 3 | xlwt 4 | -------------------------------------------------------------------------------- /appannie-spider-comments/comments.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/spider-comments/faaf505646e3ac6ff14b86cbab8028921f33a405/appannie-spider-comments/comments.xls -------------------------------------------------------------------------------- /appannie-spider-comments/pull2.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | import requests 3 | import re 4 | import json 5 | 6 | def loginWeb(url): 7 | referer = "https://www.appannie.com/account/login/?_ref=header" 8 | user_agent = ("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0") 9 | headers = {"User-Agent":user_agent, 10 | "Referer":referer, 11 | "Host":"www.appannie.com", 12 | 'Connection':'keep-alive', 13 | 'Accept':'application/json, text/plain,*/*', 14 | 'Accept-Encoding':'gzip, deflate, sdch', 15 | 'Accept-Language':'zh-CN,zh;q=0.8', 16 | 'X-NewRelic-ID':'VwcPUFJXGwEBUlJSDgc=', 17 | 'X-Requested-With':'XMLHttpRequest', 18 | } 19 | 20 | urlogin = "https://www.appannie.com/account/login/" 21 | s = requests.Session() 22 | s.get(urlogin,headers=headers) 23 | csrftoken = s.cookies['csrftoken'] 24 | postdata = { 25 | 'csrfmiddlewaretoken':csrftoken, 26 | 'next':'/dashboard/home/', 27 | 'username':'ynt30869@figjs.com', 28 | 'password':'thevmorgEej6', 29 | } 30 | 31 | s.headers=headers 32 | login_response = s.post(url=urlogin,data=postdata) 33 | if not 200 <= login_response.status_code < 300: 34 | raise Exception("Error while logging in, code: %d" % (response. status_code)) 35 | else: 36 | print "login success..." 37 | headers["X-CSRFToken"] = csrftoken 38 | r = s.get(url=url,headers=headers) 39 | if not 200 <= r.status_code < 300: 40 | raise Exception("Error while download website, code:%d" % (r.status_code)) 41 | else: 42 | return r.content 43 | 44 | def matchRe(content): 45 | decodejson = json.loads(content) 46 | rowslist = decodejson["data"]["rows"] 47 | 48 | system = 'ios' 49 | version = [] 50 | rate = [] 51 | title = [] 52 | author = [] 53 | text = [] 54 | tanslate = "" 55 | date = [] 56 | country = [] 57 | 58 | for each in rowslist: 59 | version.append(each["version"]) 60 | rate.append(each["rating"]) 61 | title.append(each["title"]) 62 | author.append(each["author"]) 63 | text.append(each["content"]) 64 | date.append(each["date"]) 65 | country.append(each["country"]["code"]) 66 | 67 | commentDict = dict(system=system,version=version,rate=rate, 68 | title=title,author=author,text=text, 69 | tanslate=tanslate,date=date,country=country) 70 | print "create Dict done..." 71 | return commentDict 72 | 73 | def saveToExcel(comment): 74 | system,version,rate,title,author,text,tanslate,date,country = (comment["system"], 75 | comment["version"],comment["rate"],comment["title"],comment["author"], 76 | comment["text"],comment["tanslate"],comment["date"],comment["country"]) 77 | 78 | import xlwt 79 | 80 | efile = xlwt.Workbook() 81 | table = efile.add_sheet('Sheet1') 82 | table.write(0,0,u'平台') 83 | table.write(0,1,u'版本') 84 | table.write(0,2,u'评分rate') 85 | table.write(0,3,u'review标题') 86 | table.write(0,4,u'reviewer作者') 87 | table.write(0,5,u'review正文') 88 | table.write(0,6,u'中文简单描述') 89 | table.write(0,7,u'时间') 90 | table.write(0,8,u'国家') 91 | 92 | for num,each in enumerate(version): 93 | index = num +1 94 | table.write(index,0,system) 95 | table.write(index,1,version[num]) 96 | table.write(index,2,rate[num]) 97 | table.write(index,3,title[num]) 98 | table.write(index,4,author[num]) 99 | table.write(index,5,text[num]) 100 | table.write(index,6,tanslate) 101 | table.write(index,7,date[num]) 102 | table.write(index,8,country[num]) 103 | 104 | efile.save('word island.xls') 105 | 106 | def savefile(content): 107 | f = open('index.html','wb') 108 | f.write(content) 109 | f.close() 110 | 111 | if __name__ == '__main__': 112 | indexUrl = "https://www.appannie.com" 113 | #commentUrl="https://www.appannie.com/apps/ios/app/hi-words-a-new-word-search-puzzle-game/reviews/table/?date=2015-10-12~2016-01-02&orderby=&desc=t&page=1&limit=10" 114 | #commentUrl="https://www.appannie.com/apps/ios/app/wordfall-most-addictive-words/reviews/table/?date=2015-01-29~2015-12-21&orderby=&desc=t&page=1&limit=200" 115 | #commentUrl="https://www.appannie.com/apps/ios/app/word-jungle-challenging-word/reviews/table/?date=2015-02-01~2016-01-03&orderby=&desc=t&page=1&limit=200" 116 | commentUrl="https://www.appannie.com/apps/ios/app/word-island-new-challenging/reviews/table/?date=2015-01-10~2015-10-18&orderby=&desc=t&page=1&limit=200" 117 | content = loginWeb(commentUrl) 118 | commentDate = matchRe(content) 119 | saveToExcel(commentDate) 120 | #savefile(content) 121 | -------------------------------------------------------------------------------- /appannie-spider-comments/test.py: -------------------------------------------------------------------------------- 1 | import cookielib 2 | filename ='hello.txt' 3 | cookie = cookielib.MozillaCookieJar(filename) 4 | cookie.save(ignore_discard=True, ignore_expires=True) 5 | 6 | -------------------------------------------------------------------------------- /appannie-spider-comments/word island.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/spider-comments/faaf505646e3ac6ff14b86cbab8028921f33a405/appannie-spider-comments/word island.xls --------------------------------------------------------------------------------