├── .gitignore
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── quora_scraper
    ├── chromedriver
    └── scraper.py
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.md
2 | include LICENSE
3 | include quora_scraper/chromedriver
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Quora-scraper
  2 | 
  3 | [![N|Solid](https://cldup.com/dTxpPi9lDf.thumb.png)](https://github.com/banyous/quora-scraper)
  4 | 
  5 | 
  6 | Quora-scraper is a command-line application written in Python that simulates a browser environment to let you scrape Quora rich textual data. You can use one of the three scraping modules to: Find questions that discuss about certain topics (such as Finance, Politics, Tesla or Donald-Trump). Scrape Quora answers related to certain questions, or scrape users profile. Please use it responsibly ! 
  7 | 
  8 | ## Install
  9 | To use our scraper, please follow the steps below:
 10 | - Install python 3.6 or upper versions.
 11 | - Install the latest version of google-chrome.
 12 | - Install quora-scraper:
 13 | 
 14 | ```sh
 15 | $ pip install quora-scraper
 16 | ```
 17 | To update quora-scraper:
 18 | 
 19 | ```sh
 20 | $ pip install quora-scraper --upgrade
 21 | ```
 22 | 
 23 | Alternatively, you can clone the project and run the following command to install: Make sure you cd into the quora-scraper folder before performing the command below.
 24 | 
 25 | ```sh
 26 | $  python setup.py install
 27 | ```
 28 | 
 29 | ## Usage
 30 | 
 31 | quora-scraper has three scraping modules : ```questions``` ,```answers```,```users```.
 32 | #### 1) Scraping questions URL:
 33 | 
 34 | You can scrape questions related to certain topics using ```questions``` command. This module takes as an input a list of topic keywords. Output is a questions_URL file containing the topic's question links. 
 35 | 
 36 | Scraping a topic questions can be done as follows:
 37 | 
 38 | - a) Use -l parameter + topic keywords list.
 39 | 
 40 |     ```sh
 41 |     $ quora-scraper questions -l [finance,politics,Donald-Trump]
 42 |     ```
 43 | 
 44 | - b) Use -f parameter + topic keywords file location. (keywords must be line separated inside the file):
 45 | 
 46 |     ```sh
 47 |     $ quora-scraper questions -f  topics_file.txt
 48 |     ```
 49 |     
 50 | #### 2) Scraping answers:
 51 | 
 52 | Quora answers are scraped using ```answers``` command. This module takes as an input a list of Questions URL. Output is a file of scraped answers (answers.txt). An answer consists of :
 53 | 
 54 | Quest-ID | AnswerDate | AnswerAuthor-ID | Quest-tags | Answer-Text 
 55 | 
 56 | To scrape answers, use one of the following methods:
 57 | 
 58 | - a) Use -l parameter + question URLs list. 
 59 | 
 60 |     ```sh
 61 |     $ quora-scraper answers -l [https://www.quora.com/Is-milk-good,https://www.quora.com/Was-Einstein-a-fake-and-a-plagiarist]
 62 |     ```
 63 | 
 64 | - b)  Use -f parameter + question URLs file location:
 65 |  
 66 |     ```sh
 67 |     $ quora-scraper answers -f  questions_url.txt
 68 |     ```
 69 |  
 70 | #### 3) Scraping Quora user profile:
 71 | 
 72 | You can scrape Quora Users profile using ```users``` command. The users module takes as an input a list of Quora user IDs. The output is UserProfile file containing:
 73 | 
 74 | First line :
 75 | UserID | ProfileDescription |ProfileBio | Location | TotalViews |NBAnswers | NBQuestions | NBFollowers |  NBFollowing
 76 | 
 77 | Remaining lines (User's answers):
 78 | AnswerDate | QuestionID | AnswerText 
 79 | 
 80 | Scraping Users profile can be done as follows:
 81 | 
 82 | - a) Use -l parameter + User-IDs list. 
 83 |     ```sh
 84 |     $ quora-scraper users -l [Albert-Einstein-195,Jackie-Chan-8]
 85 |     ```
 86 |    
 87 | - b)  Use -f parameter + User-IDs file. 
 88 | 
 89 |     ```sh
 90 |     $ quora-scraper users -f quora_username_file.txt
 91 |     ```
 92 | 
 93 | ### Notes
 94 | a) Input files must be line separated.
 95 | 
 96 | b) Output files fields are tab separated.
 97 | 
 98 | c) You can add a list/line index parameter In order to start the scraping from that index. The code below will start scraping from "physics" keyword:
 99 |     ```sh
100 |     $ quora-scraper questions -l [finance,politics,tech,physics,life,sports]  -i 3
101 |     ```
102 | 
103 | d) Quora website puts limit on the number of questions accessible on a topic page. Thus, even if a topic has a large number of questions (ex: 100k), the number scraped questions links will not exceed 2k or 3k questions.
104 |  
105 | e) For more help use : 
106 |  ```sh
107 |     $ quora-scraper --help
108 |  ```
109 | f) Quora-scraper uses  xpaths and bs4 methods to scrape Quora webpage elements. Since Quora HTML Structure is constantly changing, the code may need modification from time to time. Please feel free to update and contribute to the source-code in order to keep the scraper up-to-date.
110 |      
111 |   
112 | License
113 | ----
114 | 
115 | This project uses the following license: [MIT]
116 | 
117 | 
118 | 
119 | 
120 | [//]: # (These are reference links used in the body of this note and get stripped out when the markdown processor does its job. There is no need to format nicely because it shouldn't be seen. Thanks SO - http://stackoverflow.com/questions/4823468/store-comments-in-markdown-syntax)
121 | 
122 | 
123 |    [MIT]: <https://github.com/banyousr>
124 | 
125 | 


--------------------------------------------------------------------------------
/quora_scraper/chromedriver:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/banyous/quora-scraper/1bed6227bfb46b618379a33ead2ab368101e3cbc/quora_scraper/chromedriver


--------------------------------------------------------------------------------
/quora_scraper/scraper.py:
--------------------------------------------------------------------------------
  1 | # __main__.py
  2 | DEBUG = 1
  3 | import os
  4 | import sys
  5 | import time
  6 | import json
  7 | import pathlib
  8 | from pathlib import Path
  9 | import random
 10 | import userpaths
 11 | import dateparser
 12 | import argparse
 13 | from datetime import datetime, timedelta
 14 | from bs4 import BeautifulSoup
 15 | from selenium import webdriver
 16 | from selenium.webdriver.chrome.options import Options
 17 | from selenium.webdriver.common.action_chains import ActionChains
 18 | from selenium.webdriver.common.by import By
 19 | from selenium.webdriver.support import expected_conditions as EC
 20 | from selenium.webdriver.support.ui import WebDriverWait
 21 | 
 22 | 
 23 | # -------------------------------------------------------------
 24 | # -------------------------------------------------------------
 25 | def connect_chrome():
 26 |     options = Options()
 27 |     options.add_argument('--headless')
 28 |     options.add_argument('log-level=3')
 29 |     options.add_argument("--incognito")
 30 |     options.add_argument("--no-sandbox")
 31 |     options.add_argument("--disable-dev-shm-usage")
 32 |     # try:
 33 |     # 	import quora_scraper
 34 |     # 	package_path=str(quora_scraper.__path__).split("'")[1]
 35 |     # 	driver_path= Path(package_path) / "chromedriver"
 36 |     # except:
 37 |     # 	driver_path= Path.cwd() / "chromedriver"
 38 |     # driver_path= Path.cwd() / "chromedriver"
 39 | 
 40 |     driver = webdriver.Chrome(options=options)
 41 |     driver.maximize_window()
 42 |     time.sleep(2)
 43 |     return driver
 44 | 
 45 | 
 46 | # -------------------------------------------------------------
 47 | # -------------------------------------------------------------
 48 | # remove 'k'(kilo) and 'm'(million) from Quora numbers
 49 | def convert_number(number):
 50 |     if 'k' in number:
 51 |         n = float(number.lower().replace('k', '').replace(' ', '')) * 1000
 52 |     elif 'm' in number:
 53 |         n = float(number.lower().replace('m', '').replace(' ', '')) * 1000000
 54 |     else:
 55 |         n = number
 56 |     return int(n)
 57 | 
 58 | 
 59 | # -------------------------------------------------------------
 60 | # -------------------------------------------------------------
 61 | # convert Quora dates (such as 2 months ago) to DD-MM-YYYY format
 62 | def convert_date_format(date_text):
 63 |     try:
 64 |         if "Updated" in date_text:
 65 |             date = date_text[8:]
 66 |         else:
 67 |             date = date_text[9:]
 68 |         date = dateparser.parse(date_text).strftime("%Y-%m-%d")
 69 |     except:  # when updated or answered in the same week (ex: Updated Sat)
 70 |         date = dateparser.parse("7 days ago").strftime("%Y-%m-%d")
 71 |     return date
 72 | 
 73 | 
 74 | # -------------------------------------------------------------
 75 | # -------------------------------------------------------------
 76 | def scroll_up(self, nb_times):
 77 |     for iii in range(0, nb_times):
 78 |         self.execute_script("window.scrollBy(0,-200)")
 79 |         time.sleep(1)
 80 | 
 81 | 
 82 | # -------------------------------------------------------------
 83 | # -------------------------------------------------------------
 84 | # method for loading  quora dynamic content
 85 | def scroll_down(self, type_of_page='users'):
 86 |     last_height = self.page_source
 87 |     loop_scroll = True
 88 |     attempt = 0
 89 |     # we generate a random waiting time between 2 and 4
 90 |     waiting_scroll_time = round(random.uniform(2, 4), 1)
 91 |     print('scrolling down to get all answers...')
 92 |     max_waiting_time = round(random.uniform(5, 7), 1)
 93 |     # we increase waiting time when we look for questions urls
 94 |     if type_of_page == 'questions': max_waiting_time = round(random.uniform(20, 30), 1)
 95 |     # scroll down loop until page not changing
 96 |     while loop_scroll:
 97 |         self.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 98 |         time.sleep(2)
 99 |         if type_of_page == 'answers':
100 |             scroll_up(self, 2)
101 |         new_height = self.page_source
102 |         if new_height == last_height:
103 |             # in case of not change, we increase the waiting time
104 |             waiting_scroll_time = max_waiting_time
105 |             attempt += 1
106 |             if attempt == 3:  # in the third attempt we end the scrolling
107 |                 loop_scroll = False
108 |         # print('attempt',attempt)
109 |         else:
110 |             attempt = 0
111 |             waiting_scroll_time = round(random.uniform(2, 4), 1)
112 |         last_height = new_height
113 | 
114 | 
115 | # -------------------------------------------------------------
116 | # -------------------------------------------------------------	
117 | # questions urls crawler 
118 | def questions(topics_list, save_path):
119 |     browser = connect_chrome()
120 |     topic_index = -1
121 |     loop_limit = len(topics_list)
122 |     print('Starting the questions crawling')
123 |     while True:
124 |         print('--------------------------------------------------')
125 |         topic_index += 1
126 |         if topic_index >= loop_limit:
127 |             print('Crawling completed, questions have been saved to  :  ', save_path)
128 |             browser.quit()
129 |             break
130 |         topic_term = topics_list[topic_index].strip()
131 |         # we remove hashtags (optional)
132 |         topic_term.replace("#", '')
133 |         # Looking if the topic has an existing Quora url
134 |         print('#########################################################')
135 |         print('Looking for topic number : ', topic_index, ' | ', topic_term)
136 |         try:
137 |             url = "https://www.quora.com/topic/" + topic_term.strip()
138 |             browser.get(url)
139 |             time.sleep(2)
140 |         except Exception as e0:
141 |             print('topic does not exist in Quora')
142 |             # print('exception e0')
143 |             # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e0).__name__, e0)
144 |             continue
145 | 
146 |         # get browser source
147 |         html_source = browser.page_source
148 |         question_count_soup = BeautifulSoup(html_source, 'html.parser')
149 |         all_question_htmls = question_count_soup.find_all('div', {'class': 'CssComponent-sc-1oskqb9-0 cXjXFI'})
150 | 
151 |         #  get total number of questions
152 |         question_count = len(all_question_htmls)
153 |         if question_count is None:
154 |             print('topic does not have questions...')
155 |             continue
156 |         if question_count == 0:
157 |             print('topic does not have questions...')
158 |             continue
159 | 
160 |         # Get scroll height
161 |         last_height = browser.execute_script("return document.body.scrollHeight")
162 | 
163 |         # infinite while loop, break it when you reach the end of the page or not able to scroll further.
164 |         # Note that Quora
165 |         # if there is only 10 questions, we need to scroll down the profile to load more questions
166 |         if question_count == 10:
167 |             scroll_down(browser, 'questions')
168 | 
169 |         # next we harvest all questions URLs that exists in the Quora topic's page
170 |         # get html page source
171 |         html_source = browser.page_source
172 |         soup = BeautifulSoup(html_source, 'html.parser')
173 | 
174 |         all_htmls = soup.find_all('div', {'class': 'CssComponent-sc-1oskqb9-0 cXjXFI'})
175 |         question_count_after_scroll = len(all_htmls)
176 |         print(f'number of questions for this topic : {question_count_after_scroll}')
177 | 
178 |         # add questions to a set for uniqueness
179 |         question_set = set()
180 |         for html in all_htmls:
181 |             all_links = html.find_all('a', {'href': True})
182 |             # in one question we get 3 links and the third link is the question link
183 |             try:
184 |                 question_link = all_links[2]
185 |                 question_set.add(question_link)
186 |             except IndexError:
187 |                 question_link = all_links[0]
188 |                 question_set.add(question_link)
189 |         # write content of set to Questions_URLs/ folder
190 |         save_file = Path(save_path) / str(topic_term.strip('\n') + '_question_urls.txt')
191 |         file_question_urls = open(save_file, mode='w', encoding='utf-8')
192 |         for ques in question_set:
193 |             link_url = ques.attrs['href']
194 |             file_question_urls.write(link_url + '\n')
195 |         file_question_urls.close()
196 | 
197 |         # sleep every while in order to not get banned
198 |         if topic_index % 5 == 4:
199 |             sleep_time = (round(random.uniform(5, 10), 1))
200 |             time.sleep(sleep_time)
201 | 
202 |     browser.quit()
203 | 
204 | 
205 | # -------------------------------------------------------------
206 | # -------------------------------------------------------------
207 | # answers crawler
208 | def answers(urls_list, save_path):
209 |     browser = connect_chrome()
210 |     url_index = -1
211 |     loop_limit = len(urls_list)
212 |     # output file containing all answers
213 |     file_answers = open(Path(save_path) / "answers.txt", mode='a')
214 |     print('Starting the answers crawling...')
215 |     while True:
216 |         url_index += 1
217 |         print('--------------------------------------------------')
218 |         if url_index >= loop_limit:
219 |             print('Crawling completed, answers have been saved to  :  ', save_path)
220 |             browser.quit()
221 |             file_answers.close()
222 |             break
223 |         current_line = urls_list[url_index]
224 |         print('processing question number  : ' + str(url_index + 1))
225 |         print(current_line)
226 |         if '/unanswered/' in str(current_line):
227 |             print('answer is unanswered')
228 |             continue
229 |         question_id = current_line
230 |         # opening Question page
231 |         try:
232 |             browser.get(current_line)
233 |             time.sleep(2)
234 |         except Exception as OpenEx:
235 |             print('cant open the following question link : ', current_line)
236 |             print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(OpenEx).__name__, OpenEx)
237 |             print(str(OpenEx))
238 |             continue
239 |         try:
240 |             nb_answers_text = WebDriverWait(browser, 10).until(
241 |                 EC.visibility_of_element_located((By.XPATH, "//div[text()[contains(.,'Answer')]]"))).text
242 |             nb_answers = [int(s.strip('+')) for s in nb_answers_text.split() if s.strip('+').isdigit()][0]
243 |             print('Question have :', nb_answers_text)
244 |         except Exception as Openans:
245 |             print('cant get answers')
246 |             print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(Openans).__name__, Openans)
247 |             print(str(Openans))
248 |             continue
249 |         # nb_answers_text = browser.find_element_by_xpath("//div[@class='QuestionPageAnswerHeader']//div[@class='answer_count']").text
250 | 
251 |         if nb_answers > 7:
252 |             scroll_down(browser, 'answers')
253 |         continue_reading_buttons = browser.find_elements_by_xpath("//a[@role='button']")
254 |         time.sleep(2)
255 |         for button in continue_reading_buttons:
256 |             try:
257 |                 ActionChains(browser).click(button).perform()
258 |                 time.sleep(1)
259 |             except:
260 |                 print('cant click more')
261 |                 continue
262 |         time.sleep(2)
263 |         html_source = browser.page_source
264 |         soup = BeautifulSoup(html_source, "html.parser")
265 |         # get the question-id
266 |         question_id = current_line.rsplit('/', 1)[-1]
267 |         # find title
268 |         title = current_line.replace("https://www.quora.com/", "")
269 |         # find question's topics
270 |         questions_topics = soup.findAll("div", {"class": "q-box qu-mr--tiny qu-mb--tiny"})
271 |         questions_topics_text = []
272 |         for topic in questions_topics:
273 |             questions_topics_text.append(topic.text.rstrip())
274 |         # number of answers
275 |         # not all answers are saved!
276 |         # answers that collapsed, and those written by anonymous users are not saved
277 |         try:
278 |             split_html = html_source.split('class="q-box qu-pt--medium qu-pb--medium"')
279 |         except Exception as not_exist:  # mostly because question is deleted by quora
280 |             print('question no long exists')
281 |             print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(not_exist).__name__, not_exist)
282 |             print(str(not_exist))
283 |             continue
284 |         # The underneath loop will generate len(split_html)/2 exceptions, cause answers in split_html
285 |         # are either in Odd or Pair positions, so ignore printed exceptions.
286 |         # print('len split : ',len(split_html))
287 |         for i in range(1, len(split_html)):
288 |             try:
289 |                 part = split_html[i]
290 |                 part_soup = BeautifulSoup(part, "html.parser")
291 |                 # print('===============================================================')
292 |                 # find users names of answers authors
293 |                 try:
294 |                     authors = part_soup.find("a", href=lambda href: href and "/profile/" in href)
295 |                     user_id = authors['href'].rsplit('/', 1)[-1]
296 |                 # print(user_id)
297 |                 except Exception as not_exist2:  # mostly because question is deleted by quora
298 |                     print('author extract pb')
299 |                     print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(not_exist2).__name__,
300 |                           not_exist2)
301 |                     print(str(not_exist2))
302 |                     continue
303 | 
304 |                 # find answer dates
305 | 
306 |                 answer_date = part_soup.find("a", string=lambda string: string and (
307 |                         "Answered" in string or "Updated" in string))  # ("a", {"class": "answer_permalink"})
308 |                 try:
309 |                     date = answer_date.text
310 |                     if "Updated" in date:
311 |                         date = date[8:]
312 |                     else:
313 |                         date = date[9:]
314 |                     date = dateparser.parse(date).strftime("%Y-%m-%d")
315 |                 except:  # when updated or answered in the same week (ex: Updated Sat)
316 |                     date = dateparser.parse("7 days ago").strftime("%Y-%m-%d")
317 |                 # print(date)
318 |                 # find answers text
319 |                 answer_text = part_soup.find("div", {"class": "q-relative spacing_log_answer_content"})
320 |                 # print(" answer_text", answer_text.text)
321 |                 answer_text = answer_text.text
322 |                 # write answer elements to file
323 |                 s = str(question_id.rstrip()) + '\t' + str(date) + "\t" + user_id + "\t" + str(
324 |                     questions_topics_text) + "\t" + str(answer_text.rstrip()) + "\n"
325 |                 # print("writing down the answer...")
326 |                 file_answers.write(s)
327 |                 print('writing down answers...')
328 |             except Exception as e1:  # Most times because user is anonymous ,  continue without saving anything
329 |                 print('---------------There is an Exception-----------')
330 |                 print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e1).__name__, e1)
331 |                 print(str(e1))
332 |                 o = 1
333 | 
334 |         # we sleep every while in order to avoid IP ban
335 |         if url_index % 3 == 2:
336 |             sleep_time = (round(random.uniform(5, 10), 1))
337 |             time.sleep(sleep_time)
338 |     browser.quit()
339 | 
340 | 
341 | # -------------------------------------------------------------
342 | # -------------------------------------------------------------
343 | # Users profile crawler
344 | def users(users_list, save_path):
345 |     browser = connect_chrome()
346 |     user_index = -1
347 |     loop_limit = len(users_list)
348 |     print('Starting the users crawling...')
349 |     while True:
350 |         print('_______________________________________________________________')
351 |         user_index += 1
352 |         if user_index >= loop_limit:
353 |             print('Crawling completed, answers have been saved to  :  ', save_path)
354 |             browser.quit()
355 |             break
356 |         # a dict to contain information about profile
357 |         quora_profile_information = dict()
358 |         current_line = users_list[user_index].strip()
359 |         current_line = current_line.replace('http', 'https')
360 |         # we change proxy and sleep every 200 request (number can be changed)
361 |         # sleep every while in order to not get banned
362 |         if user_index % 5 == 4:
363 |             sleep_time = (round(random.uniform(5, 10), 1))
364 |             # print('*********')
365 |             # print('Sleeping the browser for ', sleep_time)
366 |             # print('*********')
367 |             time.sleep(sleep_time)
368 |         user_id = current_line.strip().replace('\r', '').replace('\n', '')
369 |         url = "https://www.quora.com/profile/" + user_id
370 |         print('processing quora user number : ', user_index + 1, '	', url)
371 |         browser.get(url)
372 |         time.sleep(2)
373 |         # get profile description
374 |         try:
375 |             description = browser.find_element_by_class_name('IdentityCredential')
376 |             description = description.text.replace('\n', ' ')
377 |         # print(description)
378 |         except:
379 |             description = ''
380 |         # print('no description')
381 |         quora_profile_information['description'] = description
382 |         # get profile bio
383 |         try:
384 |             more_button = browser.find_elements_by_link_text('(more)')
385 |             ActionChains(browser).move_to_element(more_button[0]).click(more_button[0]).perform()
386 |             time.sleep(0.5)
387 |             profile_bio = browser.find_element_by_class_name('ProfileDescriptionPreviewSection')
388 |             profile_bio_text = profile_bio.text.replace('\n', ' ')
389 |             # print(profile_bio_text)
390 |         except Exception as e:
391 |             # print('no profile bio')
392 |             # print(e)
393 |             profile_bio_text = ''
394 |         quora_profile_information['profile_bio'] = profile_bio_text
395 |         html_source = browser.page_source
396 |         source_soup = BeautifulSoup(html_source, "html.parser")
397 |         # get location
398 |         # print('trying to get location')
399 |         location = 'None'
400 |         try:
401 |             location1 = (source_soup.find(attrs={"class": "LocationCredentialListItem"}))
402 |             location2 = (location1.find(attrs={"class": "main_text"})).text
403 |             location = location2.replace('Lives in ', '')
404 |         except Exception as e3:
405 |             # print('exception regarding finding location')
406 |             # print(e3)
407 |             pass
408 |         quora_profile_information['location'] = location
409 |         # get total number of views
410 |         total_views = '0'
411 |         try:
412 |             # views=wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "AnswerViewsAboutListItem.AboutListItem")))
413 |             views = (source_soup.find(attrs={"class": "ContentViewsAboutListItem"}))
414 |             total_views = views.text.split("content")[0]
415 |         except Exception as e4:
416 |             ###print('exception regarding finding number of views')
417 |             ###print(e4)
418 |             pass
419 |         # print(total_views)
420 |         # print('@@@@@@@@@')
421 |         total_views = convert_number(total_views)
422 |         # print(' location : ',location)
423 |         # print("total_views",total_views)
424 |         # print(total_views)
425 |         quora_profile_information['total_views'] = total_views
426 |         nbanswers = 0
427 |         nbquestions = 0
428 |         nbfollowers = 0
429 |         nbfollowing = 0
430 |         # print('trying to get answers stats')
431 |         try:
432 |             html_source = browser.page_source
433 |             source_soup = BeautifulSoup(html_source, "html.parser")
434 |             # Find user social attributes : #answers, #questions, #shares, #posts, #blogs, #followers, #following, #topics, #edits
435 |             nbanswers = browser.find_element_by_xpath("//span[text()[contains(.,'Answers')]]/parent::*")
436 |             nbanswers = nbanswers.text.strip('Answers').strip().replace(',', '')
437 |             nbquestions = browser.find_element_by_xpath("//span[text()[contains(.,'Questions')]]/parent::*")
438 |             nbquestions = nbquestions.text.strip('Questions').strip().replace(',', '')
439 |             # print("questions ",nbquestions)
440 |             nbfollowers = browser.find_element_by_xpath("//span[text()[contains(.,'Followers')]]/parent::*")
441 |             nbfollowers = nbfollowers.text.strip('Followers').strip().replace(',', '')
442 |             # print("followers ",nbfollowers)
443 |             nbfollowing = browser.find_element_by_xpath("//span[text()[contains(.,'Following')]]/parent::*")
444 |             nbfollowing = nbfollowing.text.strip('Following').strip().replace(',', '')
445 |         # print("following ",nbfollowing)
446 |         except Exception as ea:
447 |             # print('cant get profile attributes answers questions followers following')
448 |             # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(ea).__name__, ea)
449 |             time.sleep(1)
450 |             if nbanswers == 0:
451 |                 print(' User does not exists or does not have answers...')
452 |                 continue
453 | 
454 |         # Open User profile file (save file)
455 |         save_file = save_path / str(user_id + '.txt')
456 |         file_user_profile = open(save_file, "w", encoding="utf8")
457 |         quora_profile_information['user_id'] = user_id
458 | 
459 |         # writing answers stats to file
460 |         quora_profile_information['nb_answers'] = nbanswers
461 |         quora_profile_information['nb_questions'] = nbquestions
462 |         quora_profile_information['nb_followers'] = nbfollowers
463 |         quora_profile_information['nb_following'] = nbfollowing
464 |         json.dump(quora_profile_information, file_user_profile)
465 |         file_user_profile.write('\n')
466 | 
467 |         # scroll down profile for loading all answers
468 |         print('user has ', nbanswers, ' answers')
469 |         if int(nbanswers) > 9:
470 |             scroll_down(browser)
471 |         # get answers text (we click on (more) button of each answer)
472 |         if int(nbanswers) > 0:
473 |             # print('scrolling down for answers collect')
474 |             i = 0
475 |             # Find and click on all (more)  to load full text of answers
476 |             more_button = browser.find_elements_by_xpath("//div[contains(text(), '(more)')]")
477 |             # print('nb more buttons',len(more_button))
478 |             for jk in range(0, len(more_button)):
479 |                 ActionChains(browser).move_to_element(more_button[jk]).click(more_button[jk]).perform()
480 |                 time.sleep(1)
481 |             try:
482 |                 questions_and_dates_tags = browser.find_elements_by_xpath(
483 |                     "//a[@class='q-box qu-cursor--pointer qu-hover--textDecoration--underline' and contains(@href,'/answer/') and not(contains(@href,'/comment/')) and not(contains(@style,'font-style: normal')) ]")
484 |                 questions_link = []
485 |                 questions_date = []
486 |                 # filtering only unique questions and dates
487 |                 for QD in questions_and_dates_tags:
488 |                     Qlink = QD.get_attribute("href").split('/')[3]
489 |                     if Qlink not in questions_link:
490 |                         questions_link.append(Qlink)
491 |                         questions_date.append(QD.get_attribute("text"))
492 | 
493 |                 questions_date = [convert_date_format(d) for d in questions_date]
494 |                 answers_text = browser.find_elements_by_xpath("//div[@class='q-relative spacing_log_answer_content']")
495 |                 answers_text = [' '.join(answer.text.split('\n')[:]).replace('\r', '').replace('\t', '').strip() for
496 |                                 answer in answers_text]
497 |             except Exception as eans:
498 |                 print('cant get answers')
499 |                 print(eans)
500 |                 print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(eans).__name__, eans)
501 |                 continue
502 |             # writing down answers ( date+ Question-ID + Answer text)
503 |             for ind in range(0, int(nbanswers)):
504 |                 try:
505 |                     # print(ind)
506 |                     file_user_profile.write(
507 |                         questions_date[ind] + '\t' + questions_link[ind].rstrip() + '\t' + answers_text[
508 |                             ind].rstrip() + '\n')
509 |                 except Exception as ew:
510 |                     # print(ew)
511 |                     # print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(ew).__name__, ew)
512 |                     print('could not write to file...')
513 |                     continue
514 |         file_user_profile.close()
515 | 
516 |     browser.quit()
517 | 
518 | 
519 | # -------------------------------------------------------------
520 | # -------------------------------------------------------------
521 | def main():
522 |     start_time = datetime.now()
523 | 
524 |     # Input Folder
525 |     input_path = Path(userpaths.get_my_documents()) / "QuoraScraperData" / "input"
526 |     pathlib.Path(input_path).mkdir(parents=True, exist_ok=True)
527 | 
528 |     # Read arguments
529 |     parser = argparse.ArgumentParser()
530 |     parser.add_argument("module", choices=['questions', 'answers', 'users'], help="type of crawler")
531 |     group = parser.add_mutually_exclusive_group()
532 |     group.add_argument("-f", "--verbose", action="store_true", help="input keywords file path ")
533 |     group.add_argument("-l", "--quiet", action="store_true", help="input keywords list")
534 |     parser.add_argument("input", help=" Input filepath or input list")
535 |     parser.add_argument("-i", "--index", type=int, default=0, help="index from which to start scraping ")
536 |     args = parser.parse_args()
537 | 
538 |     # set starting crawl index
539 |     list_index = args.index
540 | 
541 |     # set input list for crawling
542 |     # if input is filepath
543 |     keywords_list = []
544 |     if args.verbose:
545 |         filename = args.input
546 |         print("Input file is : ", filename)
547 |         if os.path.isfile(filename):
548 |             with open(filename, mode='r', encoding='utf-8') as keywords_file:
549 |                 keywords_list = keywords_file.readlines()
550 |         elif os.path.isfile(Path(input_path) / filename):
551 |             with open(Path(input_path) / filename, mode='r', encoding='utf-8') as keywords_file:
552 |                 keywords_list = keywords_file.readlines()
553 |         else:
554 |             print()
555 |             print("Reading file error: Please put the file in the program directory: ", Path.cwd(),
556 |                   " or in the QuoraScraperData folder :", input_path, "  and try again")
557 |             print()
558 | 
559 |     # if input is list
560 |     elif args.quiet:
561 |         keywords_list = [item.strip() for item in args.input.strip('[]').split(',')]
562 | 
563 |     keywords_list = keywords_list[list_index:]
564 | 
565 |     # create output folder
566 |     module_name = args.module
567 |     save_path = Path(userpaths.get_my_documents()) / "QuoraScraperData" / module_name
568 |     pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
569 | 
570 |     # launch scraper
571 |     if module_name.strip() == 'questions':
572 |         questions(keywords_list, save_path)
573 |     elif module_name.strip() == 'answers':
574 |         answers(keywords_list, save_path)
575 |     elif module_name.strip() == 'users':
576 |         users(keywords_list, save_path)
577 | 
578 |     end_time = datetime.now()
579 |     print(' Crawling took a total time of  : ', end_time - start_time)
580 | 
581 | 
582 | if __name__ == '__main__':
583 |     main()
584 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | beautifulsoup4==4.9.3
 2 | bs4==0.0.1
 3 | certifi==2021.5.30
 4 | chardet==4.0.0
 5 | colorama==0.4.4
 6 | configparser==5.0.2
 7 | crayons==0.4.0
 8 | dateparser==1.0.0
 9 | idna==2.10
10 | python-dateutil==2.8.1
11 | pytz==2021.1
12 | regex==2021.7.6
13 | requests==2.25.1
14 | selenium==3.141.0
15 | six==1.16.0
16 | soupsieve==2.2.1
17 | tzlocal==2.1
18 | urllib3==1.26.6
19 | userpaths==0.1.3
20 | webdriver-manager==3.4.2
21 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | def readme():
 4 |     with open('README.md') as f:
 5 |         README = f.read()
 6 |     return README
 7 |     
 8 |     
 9 |     
10 | setup(
11 |   name = 'quora-scraper',
12 |   packages = ['quora_scraper'],
13 |   version = '1.1.3',
14 |   license='MIT',
15 |   description = "Python based code to scrap and download data from quora website: questions related to certain topics, answers given on certain questions and users profile data",
16 |   long_description=readme(),
17 |   long_description_content_type="text/markdown",
18 |   author = 'Youcef Benkhedda',
19 |   author_email = 'y_benkhedda@esi.dz',
20 |   url="https://github.com/banyous/quora-scraper",
21 |   download_url = 'https://github.com/user/reponame/archive/v_01.tar.gz',
22 |   keywords = ['quora', 'topics', 'Q&A','user','scraper', 'download','answers','questions'],
23 |   include_package_data=True,
24 |   install_requires=[
25 | 			'selenium',
26 | 			'bs4',
27 | 			'webdriver-manager',
28 | 			'dateparser',
29 | 			'userpaths'
30 |       ],
31 |   entry_points={
32 | 	"console_scripts": [
33 | 	    "quora-scraper=quora_scraper.scraper:main",
34 |         ]
35 |     },
36 |   classifiers=[
37 |     'Development Status :: 3 - Alpha',
38 |     'Intended Audience :: Developers',
39 |     'Topic :: Software Development :: Build Tools',
40 |     'License :: OSI Approved :: MIT License',
41 |     'Programming Language :: Python :: 3.6',
42 |   ],
43 |   
44 | )
45 | 


--------------------------------------------------------------------------------