├── .gitignore ├── LICENSE.txt ├── MANIFEST.in ├── README.md ├── quora_scraper ├── chromedriver └── scraper.py ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include LICENSE 3 | include quora_scraper/chromedriver 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Quora-scraper 2 | 3 | [![N|Solid](https://cldup.com/dTxpPi9lDf.thumb.png)](https://github.com/banyous/quora-scraper) 4 | 5 | 6 | Quora-scraper is a command-line application written in Python that simulates a browser environment to let you scrape Quora rich textual data. You can use one of the three scraping modules to: Find questions that discuss about certain topics (such as Finance, Politics, Tesla or Donald-Trump). Scrape Quora answers related to certain questions, or scrape users profile. Please use it responsibly ! 7 | 8 | ## Install 9 | To use our scraper, please follow the steps below: 10 | - Install python 3.6 or upper versions. 11 | - Install the latest version of google-chrome. 12 | - Install quora-scraper: 13 | 14 | ```sh 15 | $ pip install quora-scraper 16 | ``` 17 | To update quora-scraper: 18 | 19 | ```sh 20 | $ pip install quora-scraper --upgrade 21 | ``` 22 | 23 | Alternatively, you can clone the project and run the following command to install: Make sure you cd into the quora-scraper folder before performing the command below. 24 | 25 | ```sh 26 | $ python setup.py install 27 | ``` 28 | 29 | ## Usage 30 | 31 | quora-scraper has three scraping modules : ```questions``` ,```answers```,```users```. 32 | #### 1) Scraping questions URL: 33 | 34 | You can scrape questions related to certain topics using ```questions``` command. This module takes as an input a list of topic keywords. Output is a questions_URL file containing the topic's question links. 35 | 36 | Scraping a topic questions can be done as follows: 37 | 38 | - a) Use -l parameter + topic keywords list. 39 | 40 | ```sh 41 | $ quora-scraper questions -l [finance,politics,Donald-Trump] 42 | ``` 43 | 44 | - b) Use -f parameter + topic keywords file location. (keywords must be line separated inside the file): 45 | 46 | ```sh 47 | $ quora-scraper questions -f topics_file.txt 48 | ``` 49 | 50 | #### 2) Scraping answers: 51 | 52 | Quora answers are scraped using ```answers``` command. This module takes as an input a list of Questions URL. Output is a file of scraped answers (answers.txt). An answer consists of : 53 | 54 | Quest-ID | AnswerDate | AnswerAuthor-ID | Quest-tags | Answer-Text 55 | 56 | To scrape answers, use one of the following methods: 57 | 58 | - a) Use -l parameter + question URLs list. 59 | 60 | ```sh 61 | $ quora-scraper answers -l [https://www.quora.com/Is-milk-good,https://www.quora.com/Was-Einstein-a-fake-and-a-plagiarist] 62 | ``` 63 | 64 | - b) Use -f parameter + question URLs file location: 65 | 66 | ```sh 67 | $ quora-scraper answers -f questions_url.txt 68 | ``` 69 | 70 | #### 3) Scraping Quora user profile: 71 | 72 | You can scrape Quora Users profile using ```users``` command. The users module takes as an input a list of Quora user IDs. The output is UserProfile file containing: 73 | 74 | First line : 75 | UserID | ProfileDescription |ProfileBio | Location | TotalViews |NBAnswers | NBQuestions | NBFollowers | NBFollowing 76 | 77 | Remaining lines (User's answers): 78 | AnswerDate | QuestionID | AnswerText 79 | 80 | Scraping Users profile can be done as follows: 81 | 82 | - a) Use -l parameter + User-IDs list. 83 | ```sh 84 | $ quora-scraper users -l [Albert-Einstein-195,Jackie-Chan-8] 85 | ``` 86 | 87 | - b) Use -f parameter + User-IDs file. 88 | 89 | ```sh 90 | $ quora-scraper users -f quora_username_file.txt 91 | ``` 92 | 93 | ### Notes 94 | a) Input files must be line separated. 95 | 96 | b) Output files fields are tab separated. 97 | 98 | c) You can add a list/line index parameter In order to start the scraping from that index. The code below will start scraping from "physics" keyword: 99 | ```sh 100 | $ quora-scraper questions -l [finance,politics,tech,physics,life,sports] -i 3 101 | ``` 102 | 103 | d) Quora website puts limit on the number of questions accessible on a topic page. Thus, even if a topic has a large number of questions (ex: 100k), the number scraped questions links will not exceed 2k or 3k questions. 104 | 105 | e) For more help use : 106 | ```sh 107 | $ quora-scraper --help 108 | ``` 109 | f) Quora-scraper uses xpaths and bs4 methods to scrape Quora webpage elements. Since Quora HTML Structure is constantly changing, the code may need modification from time to time. Please feel free to update and contribute to the source-code in order to keep the scraper up-to-date. 110 | 111 | 112 | License 113 | ---- 114 | 115 | This project uses the following license: [MIT] 116 | 117 | 118 | 119 | 120 | [//]: # (These are reference links used in the body of this note and get stripped out when the markdown processor does its job. There is no need to format nicely because it shouldn't be seen. Thanks SO - http://stackoverflow.com/questions/4823468/store-comments-in-markdown-syntax) 121 | 122 | 123 | [MIT]: 124 | 125 | -------------------------------------------------------------------------------- /quora_scraper/chromedriver: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/banyous/quora-scraper/1bed6227bfb46b618379a33ead2ab368101e3cbc/quora_scraper/chromedriver -------------------------------------------------------------------------------- /quora_scraper/scraper.py: -------------------------------------------------------------------------------- 1 | # __main__.py 2 | DEBUG = 1 3 | import os 4 | import sys 5 | import time 6 | import json 7 | import pathlib 8 | from pathlib import Path 9 | import random 10 | import userpaths 11 | import dateparser 12 | import argparse 13 | from datetime import datetime, timedelta 14 | from bs4 import BeautifulSoup 15 | from selenium import webdriver 16 | from selenium.webdriver.chrome.options import Options 17 | from selenium.webdriver.common.action_chains import ActionChains 18 | from selenium.webdriver.common.by import By 19 | from selenium.webdriver.support import expected_conditions as EC 20 | from selenium.webdriver.support.ui import WebDriverWait 21 | 22 | 23 | # ------------------------------------------------------------- 24 | # ------------------------------------------------------------- 25 | def connect_chrome(): 26 | options = Options() 27 | options.add_argument('--headless') 28 | options.add_argument('log-level=3') 29 | options.add_argument("--incognito") 30 | options.add_argument("--no-sandbox") 31 | options.add_argument("--disable-dev-shm-usage") 32 | # try: 33 | # import quora_scraper 34 | # package_path=str(quora_scraper.__path__).split("'")[1] 35 | # driver_path= Path(package_path) / "chromedriver" 36 | # except: 37 | # driver_path= Path.cwd() / "chromedriver" 38 | # driver_path= Path.cwd() / "chromedriver" 39 | 40 | driver = webdriver.Chrome(options=options) 41 | driver.maximize_window() 42 | time.sleep(2) 43 | return driver 44 | 45 | 46 | # ------------------------------------------------------------- 47 | # ------------------------------------------------------------- 48 | # remove 'k'(kilo) and 'm'(million) from Quora numbers 49 | def convert_number(number): 50 | if 'k' in number: 51 | n = float(number.lower().replace('k', '').replace(' ', '')) * 1000 52 | elif 'm' in number: 53 | n = float(number.lower().replace('m', '').replace(' ', '')) * 1000000 54 | else: 55 | n = number 56 | return int(n) 57 | 58 | 59 | # ------------------------------------------------------------- 60 | # ------------------------------------------------------------- 61 | # convert Quora dates (such as 2 months ago) to DD-MM-YYYY format 62 | def convert_date_format(date_text): 63 | try: 64 | if "Updated" in date_text: 65 | date = date_text[8:] 66 | else: 67 | date = date_text[9:] 68 | date = dateparser.parse(date_text).strftime("%Y-%m-%d") 69 | except: # when updated or answered in the same week (ex: Updated Sat) 70 | date = dateparser.parse("7 days ago").strftime("%Y-%m-%d") 71 | return date 72 | 73 | 74 | # ------------------------------------------------------------- 75 | # ------------------------------------------------------------- 76 | def scroll_up(self, nb_times): 77 | for iii in range(0, nb_times): 78 | self.execute_script("window.scrollBy(0,-200)") 79 | time.sleep(1) 80 | 81 | 82 | # ------------------------------------------------------------- 83 | # ------------------------------------------------------------- 84 | # method for loading quora dynamic content 85 | def scroll_down(self, type_of_page='users'): 86 | last_height = self.page_source 87 | loop_scroll = True 88 | attempt = 0 89 | # we generate a random waiting time between 2 and 4 90 | waiting_scroll_time = round(random.uniform(2, 4), 1) 91 | print('scrolling down to get all answers...') 92 | max_waiting_time = round(random.uniform(5, 7), 1) 93 | # we increase waiting time when we look for questions urls 94 | if type_of_page == 'questions': max_waiting_time = round(random.uniform(20, 30), 1) 95 | # scroll down loop until page not changing 96 | while loop_scroll: 97 | self.execute_script("window.scrollTo(0, document.body.scrollHeight);") 98 | time.sleep(2) 99 | if type_of_page == 'answers': 100 | scroll_up(self, 2) 101 | new_height = self.page_source 102 | if new_height == last_height: 103 | # in case of not change, we increase the waiting time 104 | waiting_scroll_time = max_waiting_time 105 | attempt += 1 106 | if attempt == 3: # in the third attempt we end the scrolling 107 | loop_scroll = False 108 | # print('attempt',attempt) 109 | else: 110 | attempt = 0 111 | waiting_scroll_time = round(random.uniform(2, 4), 1) 112 | last_height = new_height 113 | 114 | 115 | # ------------------------------------------------------------- 116 | # ------------------------------------------------------------- 117 | # questions urls crawler 118 | def questions(topics_list, save_path): 119 | browser = connect_chrome() 120 | topic_index = -1 121 | loop_limit = len(topics_list) 122 | print('Starting the questions crawling') 123 | while True: 124 | print('--------------------------------------------------') 125 | topic_index += 1 126 | if topic_index >= loop_limit: 127 | print('Crawling completed, questions have been saved to : ', save_path) 128 | browser.quit() 129 | break 130 | topic_term = topics_list[topic_index].strip() 131 | # we remove hashtags (optional) 132 | topic_term.replace("#", '') 133 | # Looking if the topic has an existing Quora url 134 | print('#########################################################') 135 | print('Looking for topic number : ', topic_index, ' | ', topic_term) 136 | try: 137 | url = "https://www.quora.com/topic/" + topic_term.strip() 138 | browser.get(url) 139 | time.sleep(2) 140 | except Exception as e0: 141 | print('topic does not exist in Quora') 142 | # print('exception e0') 143 | # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e0).__name__, e0) 144 | continue 145 | 146 | # get browser source 147 | html_source = browser.page_source 148 | question_count_soup = BeautifulSoup(html_source, 'html.parser') 149 | all_question_htmls = question_count_soup.find_all('div', {'class': 'CssComponent-sc-1oskqb9-0 cXjXFI'}) 150 | 151 | # get total number of questions 152 | question_count = len(all_question_htmls) 153 | if question_count is None: 154 | print('topic does not have questions...') 155 | continue 156 | if question_count == 0: 157 | print('topic does not have questions...') 158 | continue 159 | 160 | # Get scroll height 161 | last_height = browser.execute_script("return document.body.scrollHeight") 162 | 163 | # infinite while loop, break it when you reach the end of the page or not able to scroll further. 164 | # Note that Quora 165 | # if there is only 10 questions, we need to scroll down the profile to load more questions 166 | if question_count == 10: 167 | scroll_down(browser, 'questions') 168 | 169 | # next we harvest all questions URLs that exists in the Quora topic's page 170 | # get html page source 171 | html_source = browser.page_source 172 | soup = BeautifulSoup(html_source, 'html.parser') 173 | 174 | all_htmls = soup.find_all('div', {'class': 'CssComponent-sc-1oskqb9-0 cXjXFI'}) 175 | question_count_after_scroll = len(all_htmls) 176 | print(f'number of questions for this topic : {question_count_after_scroll}') 177 | 178 | # add questions to a set for uniqueness 179 | question_set = set() 180 | for html in all_htmls: 181 | all_links = html.find_all('a', {'href': True}) 182 | # in one question we get 3 links and the third link is the question link 183 | try: 184 | question_link = all_links[2] 185 | question_set.add(question_link) 186 | except IndexError: 187 | question_link = all_links[0] 188 | question_set.add(question_link) 189 | # write content of set to Questions_URLs/ folder 190 | save_file = Path(save_path) / str(topic_term.strip('\n') + '_question_urls.txt') 191 | file_question_urls = open(save_file, mode='w', encoding='utf-8') 192 | for ques in question_set: 193 | link_url = ques.attrs['href'] 194 | file_question_urls.write(link_url + '\n') 195 | file_question_urls.close() 196 | 197 | # sleep every while in order to not get banned 198 | if topic_index % 5 == 4: 199 | sleep_time = (round(random.uniform(5, 10), 1)) 200 | time.sleep(sleep_time) 201 | 202 | browser.quit() 203 | 204 | 205 | # ------------------------------------------------------------- 206 | # ------------------------------------------------------------- 207 | # answers crawler 208 | def answers(urls_list, save_path): 209 | browser = connect_chrome() 210 | url_index = -1 211 | loop_limit = len(urls_list) 212 | # output file containing all answers 213 | file_answers = open(Path(save_path) / "answers.txt", mode='a') 214 | print('Starting the answers crawling...') 215 | while True: 216 | url_index += 1 217 | print('--------------------------------------------------') 218 | if url_index >= loop_limit: 219 | print('Crawling completed, answers have been saved to : ', save_path) 220 | browser.quit() 221 | file_answers.close() 222 | break 223 | current_line = urls_list[url_index] 224 | print('processing question number : ' + str(url_index + 1)) 225 | print(current_line) 226 | if '/unanswered/' in str(current_line): 227 | print('answer is unanswered') 228 | continue 229 | question_id = current_line 230 | # opening Question page 231 | try: 232 | browser.get(current_line) 233 | time.sleep(2) 234 | except Exception as OpenEx: 235 | print('cant open the following question link : ', current_line) 236 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(OpenEx).__name__, OpenEx) 237 | print(str(OpenEx)) 238 | continue 239 | try: 240 | nb_answers_text = WebDriverWait(browser, 10).until( 241 | EC.visibility_of_element_located((By.XPATH, "//div[text()[contains(.,'Answer')]]"))).text 242 | nb_answers = [int(s.strip('+')) for s in nb_answers_text.split() if s.strip('+').isdigit()][0] 243 | print('Question have :', nb_answers_text) 244 | except Exception as Openans: 245 | print('cant get answers') 246 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(Openans).__name__, Openans) 247 | print(str(Openans)) 248 | continue 249 | # nb_answers_text = browser.find_element_by_xpath("//div[@class='QuestionPageAnswerHeader']//div[@class='answer_count']").text 250 | 251 | if nb_answers > 7: 252 | scroll_down(browser, 'answers') 253 | continue_reading_buttons = browser.find_elements_by_xpath("//a[@role='button']") 254 | time.sleep(2) 255 | for button in continue_reading_buttons: 256 | try: 257 | ActionChains(browser).click(button).perform() 258 | time.sleep(1) 259 | except: 260 | print('cant click more') 261 | continue 262 | time.sleep(2) 263 | html_source = browser.page_source 264 | soup = BeautifulSoup(html_source, "html.parser") 265 | # get the question-id 266 | question_id = current_line.rsplit('/', 1)[-1] 267 | # find title 268 | title = current_line.replace("https://www.quora.com/", "") 269 | # find question's topics 270 | questions_topics = soup.findAll("div", {"class": "q-box qu-mr--tiny qu-mb--tiny"}) 271 | questions_topics_text = [] 272 | for topic in questions_topics: 273 | questions_topics_text.append(topic.text.rstrip()) 274 | # number of answers 275 | # not all answers are saved! 276 | # answers that collapsed, and those written by anonymous users are not saved 277 | try: 278 | split_html = html_source.split('class="q-box qu-pt--medium qu-pb--medium"') 279 | except Exception as not_exist: # mostly because question is deleted by quora 280 | print('question no long exists') 281 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(not_exist).__name__, not_exist) 282 | print(str(not_exist)) 283 | continue 284 | # The underneath loop will generate len(split_html)/2 exceptions, cause answers in split_html 285 | # are either in Odd or Pair positions, so ignore printed exceptions. 286 | # print('len split : ',len(split_html)) 287 | for i in range(1, len(split_html)): 288 | try: 289 | part = split_html[i] 290 | part_soup = BeautifulSoup(part, "html.parser") 291 | # print('===============================================================') 292 | # find users names of answers authors 293 | try: 294 | authors = part_soup.find("a", href=lambda href: href and "/profile/" in href) 295 | user_id = authors['href'].rsplit('/', 1)[-1] 296 | # print(user_id) 297 | except Exception as not_exist2: # mostly because question is deleted by quora 298 | print('author extract pb') 299 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(not_exist2).__name__, 300 | not_exist2) 301 | print(str(not_exist2)) 302 | continue 303 | 304 | # find answer dates 305 | 306 | answer_date = part_soup.find("a", string=lambda string: string and ( 307 | "Answered" in string or "Updated" in string)) # ("a", {"class": "answer_permalink"}) 308 | try: 309 | date = answer_date.text 310 | if "Updated" in date: 311 | date = date[8:] 312 | else: 313 | date = date[9:] 314 | date = dateparser.parse(date).strftime("%Y-%m-%d") 315 | except: # when updated or answered in the same week (ex: Updated Sat) 316 | date = dateparser.parse("7 days ago").strftime("%Y-%m-%d") 317 | # print(date) 318 | # find answers text 319 | answer_text = part_soup.find("div", {"class": "q-relative spacing_log_answer_content"}) 320 | # print(" answer_text", answer_text.text) 321 | answer_text = answer_text.text 322 | # write answer elements to file 323 | s = str(question_id.rstrip()) + '\t' + str(date) + "\t" + user_id + "\t" + str( 324 | questions_topics_text) + "\t" + str(answer_text.rstrip()) + "\n" 325 | # print("writing down the answer...") 326 | file_answers.write(s) 327 | print('writing down answers...') 328 | except Exception as e1: # Most times because user is anonymous , continue without saving anything 329 | print('---------------There is an Exception-----------') 330 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e1).__name__, e1) 331 | print(str(e1)) 332 | o = 1 333 | 334 | # we sleep every while in order to avoid IP ban 335 | if url_index % 3 == 2: 336 | sleep_time = (round(random.uniform(5, 10), 1)) 337 | time.sleep(sleep_time) 338 | browser.quit() 339 | 340 | 341 | # ------------------------------------------------------------- 342 | # ------------------------------------------------------------- 343 | # Users profile crawler 344 | def users(users_list, save_path): 345 | browser = connect_chrome() 346 | user_index = -1 347 | loop_limit = len(users_list) 348 | print('Starting the users crawling...') 349 | while True: 350 | print('_______________________________________________________________') 351 | user_index += 1 352 | if user_index >= loop_limit: 353 | print('Crawling completed, answers have been saved to : ', save_path) 354 | browser.quit() 355 | break 356 | # a dict to contain information about profile 357 | quora_profile_information = dict() 358 | current_line = users_list[user_index].strip() 359 | current_line = current_line.replace('http', 'https') 360 | # we change proxy and sleep every 200 request (number can be changed) 361 | # sleep every while in order to not get banned 362 | if user_index % 5 == 4: 363 | sleep_time = (round(random.uniform(5, 10), 1)) 364 | # print('*********') 365 | # print('Sleeping the browser for ', sleep_time) 366 | # print('*********') 367 | time.sleep(sleep_time) 368 | user_id = current_line.strip().replace('\r', '').replace('\n', '') 369 | url = "https://www.quora.com/profile/" + user_id 370 | print('processing quora user number : ', user_index + 1, ' ', url) 371 | browser.get(url) 372 | time.sleep(2) 373 | # get profile description 374 | try: 375 | description = browser.find_element_by_class_name('IdentityCredential') 376 | description = description.text.replace('\n', ' ') 377 | # print(description) 378 | except: 379 | description = '' 380 | # print('no description') 381 | quora_profile_information['description'] = description 382 | # get profile bio 383 | try: 384 | more_button = browser.find_elements_by_link_text('(more)') 385 | ActionChains(browser).move_to_element(more_button[0]).click(more_button[0]).perform() 386 | time.sleep(0.5) 387 | profile_bio = browser.find_element_by_class_name('ProfileDescriptionPreviewSection') 388 | profile_bio_text = profile_bio.text.replace('\n', ' ') 389 | # print(profile_bio_text) 390 | except Exception as e: 391 | # print('no profile bio') 392 | # print(e) 393 | profile_bio_text = '' 394 | quora_profile_information['profile_bio'] = profile_bio_text 395 | html_source = browser.page_source 396 | source_soup = BeautifulSoup(html_source, "html.parser") 397 | # get location 398 | # print('trying to get location') 399 | location = 'None' 400 | try: 401 | location1 = (source_soup.find(attrs={"class": "LocationCredentialListItem"})) 402 | location2 = (location1.find(attrs={"class": "main_text"})).text 403 | location = location2.replace('Lives in ', '') 404 | except Exception as e3: 405 | # print('exception regarding finding location') 406 | # print(e3) 407 | pass 408 | quora_profile_information['location'] = location 409 | # get total number of views 410 | total_views = '0' 411 | try: 412 | # views=wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "AnswerViewsAboutListItem.AboutListItem"))) 413 | views = (source_soup.find(attrs={"class": "ContentViewsAboutListItem"})) 414 | total_views = views.text.split("content")[0] 415 | except Exception as e4: 416 | ###print('exception regarding finding number of views') 417 | ###print(e4) 418 | pass 419 | # print(total_views) 420 | # print('@@@@@@@@@') 421 | total_views = convert_number(total_views) 422 | # print(' location : ',location) 423 | # print("total_views",total_views) 424 | # print(total_views) 425 | quora_profile_information['total_views'] = total_views 426 | nbanswers = 0 427 | nbquestions = 0 428 | nbfollowers = 0 429 | nbfollowing = 0 430 | # print('trying to get answers stats') 431 | try: 432 | html_source = browser.page_source 433 | source_soup = BeautifulSoup(html_source, "html.parser") 434 | # Find user social attributes : #answers, #questions, #shares, #posts, #blogs, #followers, #following, #topics, #edits 435 | nbanswers = browser.find_element_by_xpath("//span[text()[contains(.,'Answers')]]/parent::*") 436 | nbanswers = nbanswers.text.strip('Answers').strip().replace(',', '') 437 | nbquestions = browser.find_element_by_xpath("//span[text()[contains(.,'Questions')]]/parent::*") 438 | nbquestions = nbquestions.text.strip('Questions').strip().replace(',', '') 439 | # print("questions ",nbquestions) 440 | nbfollowers = browser.find_element_by_xpath("//span[text()[contains(.,'Followers')]]/parent::*") 441 | nbfollowers = nbfollowers.text.strip('Followers').strip().replace(',', '') 442 | # print("followers ",nbfollowers) 443 | nbfollowing = browser.find_element_by_xpath("//span[text()[contains(.,'Following')]]/parent::*") 444 | nbfollowing = nbfollowing.text.strip('Following').strip().replace(',', '') 445 | # print("following ",nbfollowing) 446 | except Exception as ea: 447 | # print('cant get profile attributes answers questions followers following') 448 | # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(ea).__name__, ea) 449 | time.sleep(1) 450 | if nbanswers == 0: 451 | print(' User does not exists or does not have answers...') 452 | continue 453 | 454 | # Open User profile file (save file) 455 | save_file = save_path / str(user_id + '.txt') 456 | file_user_profile = open(save_file, "w", encoding="utf8") 457 | quora_profile_information['user_id'] = user_id 458 | 459 | # writing answers stats to file 460 | quora_profile_information['nb_answers'] = nbanswers 461 | quora_profile_information['nb_questions'] = nbquestions 462 | quora_profile_information['nb_followers'] = nbfollowers 463 | quora_profile_information['nb_following'] = nbfollowing 464 | json.dump(quora_profile_information, file_user_profile) 465 | file_user_profile.write('\n') 466 | 467 | # scroll down profile for loading all answers 468 | print('user has ', nbanswers, ' answers') 469 | if int(nbanswers) > 9: 470 | scroll_down(browser) 471 | # get answers text (we click on (more) button of each answer) 472 | if int(nbanswers) > 0: 473 | # print('scrolling down for answers collect') 474 | i = 0 475 | # Find and click on all (more) to load full text of answers 476 | more_button = browser.find_elements_by_xpath("//div[contains(text(), '(more)')]") 477 | # print('nb more buttons',len(more_button)) 478 | for jk in range(0, len(more_button)): 479 | ActionChains(browser).move_to_element(more_button[jk]).click(more_button[jk]).perform() 480 | time.sleep(1) 481 | try: 482 | questions_and_dates_tags = browser.find_elements_by_xpath( 483 | "//a[@class='q-box qu-cursor--pointer qu-hover--textDecoration--underline' and contains(@href,'/answer/') and not(contains(@href,'/comment/')) and not(contains(@style,'font-style: normal')) ]") 484 | questions_link = [] 485 | questions_date = [] 486 | # filtering only unique questions and dates 487 | for QD in questions_and_dates_tags: 488 | Qlink = QD.get_attribute("href").split('/')[3] 489 | if Qlink not in questions_link: 490 | questions_link.append(Qlink) 491 | questions_date.append(QD.get_attribute("text")) 492 | 493 | questions_date = [convert_date_format(d) for d in questions_date] 494 | answers_text = browser.find_elements_by_xpath("//div[@class='q-relative spacing_log_answer_content']") 495 | answers_text = [' '.join(answer.text.split('\n')[:]).replace('\r', '').replace('\t', '').strip() for 496 | answer in answers_text] 497 | except Exception as eans: 498 | print('cant get answers') 499 | print(eans) 500 | print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(eans).__name__, eans) 501 | continue 502 | # writing down answers ( date+ Question-ID + Answer text) 503 | for ind in range(0, int(nbanswers)): 504 | try: 505 | # print(ind) 506 | file_user_profile.write( 507 | questions_date[ind] + '\t' + questions_link[ind].rstrip() + '\t' + answers_text[ 508 | ind].rstrip() + '\n') 509 | except Exception as ew: 510 | # print(ew) 511 | # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(ew).__name__, ew) 512 | print('could not write to file...') 513 | continue 514 | file_user_profile.close() 515 | 516 | browser.quit() 517 | 518 | 519 | # ------------------------------------------------------------- 520 | # ------------------------------------------------------------- 521 | def main(): 522 | start_time = datetime.now() 523 | 524 | # Input Folder 525 | input_path = Path(userpaths.get_my_documents()) / "QuoraScraperData" / "input" 526 | pathlib.Path(input_path).mkdir(parents=True, exist_ok=True) 527 | 528 | # Read arguments 529 | parser = argparse.ArgumentParser() 530 | parser.add_argument("module", choices=['questions', 'answers', 'users'], help="type of crawler") 531 | group = parser.add_mutually_exclusive_group() 532 | group.add_argument("-f", "--verbose", action="store_true", help="input keywords file path ") 533 | group.add_argument("-l", "--quiet", action="store_true", help="input keywords list") 534 | parser.add_argument("input", help=" Input filepath or input list") 535 | parser.add_argument("-i", "--index", type=int, default=0, help="index from which to start scraping ") 536 | args = parser.parse_args() 537 | 538 | # set starting crawl index 539 | list_index = args.index 540 | 541 | # set input list for crawling 542 | # if input is filepath 543 | keywords_list = [] 544 | if args.verbose: 545 | filename = args.input 546 | print("Input file is : ", filename) 547 | if os.path.isfile(filename): 548 | with open(filename, mode='r', encoding='utf-8') as keywords_file: 549 | keywords_list = keywords_file.readlines() 550 | elif os.path.isfile(Path(input_path) / filename): 551 | with open(Path(input_path) / filename, mode='r', encoding='utf-8') as keywords_file: 552 | keywords_list = keywords_file.readlines() 553 | else: 554 | print() 555 | print("Reading file error: Please put the file in the program directory: ", Path.cwd(), 556 | " or in the QuoraScraperData folder :", input_path, " and try again") 557 | print() 558 | 559 | # if input is list 560 | elif args.quiet: 561 | keywords_list = [item.strip() for item in args.input.strip('[]').split(',')] 562 | 563 | keywords_list = keywords_list[list_index:] 564 | 565 | # create output folder 566 | module_name = args.module 567 | save_path = Path(userpaths.get_my_documents()) / "QuoraScraperData" / module_name 568 | pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) 569 | 570 | # launch scraper 571 | if module_name.strip() == 'questions': 572 | questions(keywords_list, save_path) 573 | elif module_name.strip() == 'answers': 574 | answers(keywords_list, save_path) 575 | elif module_name.strip() == 'users': 576 | users(keywords_list, save_path) 577 | 578 | end_time = datetime.now() 579 | print(' Crawling took a total time of : ', end_time - start_time) 580 | 581 | 582 | if __name__ == '__main__': 583 | main() 584 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.9.3 2 | bs4==0.0.1 3 | certifi==2021.5.30 4 | chardet==4.0.0 5 | colorama==0.4.4 6 | configparser==5.0.2 7 | crayons==0.4.0 8 | dateparser==1.0.0 9 | idna==2.10 10 | python-dateutil==2.8.1 11 | pytz==2021.1 12 | regex==2021.7.6 13 | requests==2.25.1 14 | selenium==3.141.0 15 | six==1.16.0 16 | soupsieve==2.2.1 17 | tzlocal==2.1 18 | urllib3==1.26.6 19 | userpaths==0.1.3 20 | webdriver-manager==3.4.2 21 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | def readme(): 4 | with open('README.md') as f: 5 | README = f.read() 6 | return README 7 | 8 | 9 | 10 | setup( 11 | name = 'quora-scraper', 12 | packages = ['quora_scraper'], 13 | version = '1.1.3', 14 | license='MIT', 15 | description = "Python based code to scrap and download data from quora website: questions related to certain topics, answers given on certain questions and users profile data", 16 | long_description=readme(), 17 | long_description_content_type="text/markdown", 18 | author = 'Youcef Benkhedda', 19 | author_email = 'y_benkhedda@esi.dz', 20 | url="https://github.com/banyous/quora-scraper", 21 | download_url = 'https://github.com/user/reponame/archive/v_01.tar.gz', 22 | keywords = ['quora', 'topics', 'Q&A','user','scraper', 'download','answers','questions'], 23 | include_package_data=True, 24 | install_requires=[ 25 | 'selenium', 26 | 'bs4', 27 | 'webdriver-manager', 28 | 'dateparser', 29 | 'userpaths' 30 | ], 31 | entry_points={ 32 | "console_scripts": [ 33 | "quora-scraper=quora_scraper.scraper:main", 34 | ] 35 | }, 36 | classifiers=[ 37 | 'Development Status :: 3 - Alpha', 38 | 'Intended Audience :: Developers', 39 | 'Topic :: Software Development :: Build Tools', 40 | 'License :: OSI Approved :: MIT License', 41 | 'Programming Language :: Python :: 3.6', 42 | ], 43 | 44 | ) 45 | --------------------------------------------------------------------------------