├── .gitignore
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── README.rst
├── changelog.txt
├── examples
    ├── get_twitter_user_data.py
    └── get_twitter_user_data_parallel.py
├── requirements.txt
├── setup.py
└── twitterscraper
    ├── __init__.py
    ├── main.py
    ├── query.py
    ├── tweet.py
    └── user.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | /dist/
3 | /*.egg-info
4 | /build/
5 | .idea/
6 | venv/


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.7-alpine
2 | RUN apk add --update --no-cache g++ gcc libxslt-dev
3 | COPY . /app
4 | WORKDIR /app
5 | RUN python setup.py install
6 | CMD ["twitterscraper"]
7 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016-2019 by Ahmet Taspinar (taspinar@gmail.com)
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include requirements.txt
2 | include LICENSE.txt
3 | include README.rst
4 | include HISTORY.rst


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | |Downloads| |Downloads_month| |PyPI version| |GitHub contributors|
  2 | 
  3 | .. |Downloads| image:: https://pepy.tech/badge/twitterscraper
  4 |    :target: https://pepy.tech/project/twitterscraper
  5 | .. |Downloads_month| image:: https://pepy.tech/badge/twitterscraper/month
  6 |    :target: https://pepy.tech/project/twitterscraper/month
  7 | .. |PyPI version| image:: https://badge.fury.io/py/twitterscraper.svg
  8 |    :target: https://badge.fury.io/py/twitterscraper
  9 | .. |GitHub contributors| image:: https://img.shields.io/github/contributors/taspinar/twitterscraper.svg
 10 |    :target: https://github.com/taspinar/twitterscraper/graphs/contributors
 11 | 
 12 | 
 13 | Backers
 14 | ========
 15 | 
 16 | Thank you to all our backers! 🙏 [`Become a backer`_]
 17 | 
 18 | Sponsors
 19 | ========
 20 | 
 21 | Support this project by becoming a sponsor. Your logo will show up here
 22 | with a link to your website. [`Become a sponsor`_]
 23 | 
 24 | .. _Become a backer: https://opencollective.com/twitterscraper#backer
 25 | .. _Become a sponsor: https://opencollective.com/twitterscraper#sponsor
 26 | 
 27 | 
 28 | Synopsis
 29 | ========
 30 | 
 31 | A simple script to scrape Tweets using the Python package ``requests``
 32 | to retrieve the content and ``Beautifulsoup4`` to parse the retrieved
 33 | content.
 34 | 
 35 | 1. Motivation
 36 | =============
 37 | 
 38 | Twitter has provided `REST
 39 | API's <https://dev.twitter.com/rest/public>`__ which can be used by
 40 | developers to access and read Twitter data. They have also provided a
 41 | `Streaming API <https://dev.twitter.com/streaming/overview>`__ which can
 42 | be used to access Twitter Data in real-time.
 43 | 
 44 | Most of the software written to access Twitter data provide a library
 45 | which functions as a wrapper around Twitter's Search and Streaming API's
 46 | and are therefore constrained by the limitations of the API's.
 47 | 
 48 | With Twitter's Search API you can only send 180 Requests every 15
 49 | minutes. With a maximum number of 100 tweets per Request, you
 50 | can mine 72 tweets per hour (4 x 180 x 100 =72) . By using
 51 | TwitterScraper you are not limited by this number but by your internet
 52 | speed/bandwith and the number of instances of TwitterScraper you are
 53 | willing to start.
 54 | 
 55 | One of the bigger disadvantages of the Search API is that you can only
 56 | access Tweets written in the **past 7 days**. This is a major bottleneck
 57 | for anyone looking for older data. With TwitterScraper there is no such 
 58 | limitation.
 59 | 
 60 | Per Tweet it scrapes the following information:
 61 |  + Tweet-id
 62 |  + Tweet-url
 63 |  + Tweet text
 64 |  + Tweet html
 65 |  + Links inside Tweet
 66 |  + Hashtags inside Tweet
 67 |  + Image URLS inside Tweet
 68 |  + Video URL inside Tweet
 69 |  + Tweet timestamp
 70 |  + Tweet Epoch timestamp
 71 |  + Tweet No. of likes
 72 |  + Tweet No. of replies
 73 |  + Tweet No. of retweets
 74 |  + Username
 75 |  + User Full Name / Screen Name
 76 |  + User ID
 77 |  + Tweet is an reply to
 78 |  + Tweet is replied to
 79 |  + List of users Tweet is an reply to
 80 |  + Tweet ID of parent tweet
 81 | 
 82 |  
 83 | In addition it can scrape for the following user information:
 84 |  + Date user joined
 85 |  + User location (if filled in)
 86 |  + User blog (if filled in)
 87 |  + User No. of tweets
 88 |  + User No. of following
 89 |  + User No. of followers
 90 |  + User No. of likes
 91 |  + User No. of lists
 92 |  + User is verified
 93 | 
 94 | 
 95 | 2. Installation and Usage
 96 | =========================
 97 | 
 98 | To install **twitterscraper**:
 99 | 
100 | .. code:: python
101 | 
102 |     (sudo) pip install twitterscraper
103 | 
104 | or you can clone the repository and in the folder containing setup.py
105 | 
106 | .. code:: python
107 | 
108 |     python setup.py install
109 | 
110 | If you prefer more isolation you can build a docker image
111 | 
112 | .. code:: python
113 | 
114 |     docker build -t twitterscraper:build .
115 | 
116 | and run your container with:
117 | 
118 | .. code:: python
119 | 
120 | 
121 |     docker run --rm -it -v/<PATH_TO_SOME_SHARED_FOLDER_FOR_RESULTS>:/app/data twitterscraper:build <YOUR_QUERY>
122 | 
123 | 2.2 The CLI
124 | -----------
125 | 
126 | You can use the command line application to get your tweets stored to
127 | JSON right away. Twitterscraper takes several arguments:
128 | 
129 | -  ``-h`` or ``--help`` Print out the help message and exits.
130 | 
131 | -  ``-l`` or ``--limit`` TwitterScraper stops scraping when *at least*
132 |    the number of tweets indicated with ``--limit`` is scraped. Since
133 |    tweets are retrieved in batches of 20, this will always be a multiple
134 |    of 20. Omit the limit to retrieve all tweets. You can at any time abort the
135 |    scraping by pressing Ctrl+C, the scraped tweets will be stored safely
136 |    in your JSON file.
137 | 
138 | -  ``--lang`` Retrieves tweets written in a specific language. Currently
139 |    30+ languages are supported. For a full list of the languages print
140 |    out the help message.
141 | 
142 | -  ``-bd`` or ``--begindate`` Set the date from which TwitterScraper
143 |    should start scraping for your query. Format is YYYY-MM-DD. The
144 |    default value is set to 2006-03-21. This does not work in combination with ``--user``.
145 | 
146 | -  ``-ed`` or ``--enddate`` Set the enddate which TwitterScraper should
147 |    use to stop scraping for your query. Format is YYYY-MM-DD. The
148 |    default value is set to today. This does not work in combination with ``--user``.
149 | 
150 | -  ``-u`` or ``--user`` Scrapes the tweets from that users' profile page.
151 |    This also includes all retweets by that user. See section 2.2.4 in the examples below
152 |    for more information.
153 | 
154 | -  ``--profiles`` : Twitterscraper will in addition to the tweets, also scrape for the profile
155 |    information of the users who have written these tweets. The results will be saved in the
156 |    file userprofiles_<filename>.
157 | 
158 | -  ``-p`` or ``--poolsize`` Set the number of parallel processes
159 |    TwitterScraper should initiate while scraping for your query. Default
160 |    value is set to 20. Depending on the computational power you have,
161 |    you can increase this number. It is advised to keep this number below
162 |    the number of days you are scraping. For example, if you are
163 |    scraping from 2017-01-10 to 2017-01-20, you can set this number to a
164 |    maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you
165 |    can increase this number to a maximum of 150, if you have the
166 |    computational resources. Does not work in combination with ``--user``.
167 | 
168 | -  ``-o`` or ``--output`` Gives the name of the output file. If no
169 |    output filename is given, the default filename 'tweets.json' or 'tweets.csv'
170 |    will be used.
171 | 
172 | -  ``-c`` or ``--csv`` Write the result to a CSV file instead of a JSON file.
173 | 
174 | -  ``-d`` or ``--dump``: With this argument, the scraped tweets will be
175 |    printed to the screen instead of an outputfile. If you are using this
176 |    argument, the ``--output`` argument doe not need to be used.
177 | 
178 | -  ``-ow`` or ``--overwrite``: With this argument, if the output file already exists
179 |    it will be overwritten. If this argument is not set (default) twitterscraper will
180 |    exit with the warning that the output file already exists.
181 | 
182 | -  ``-dp`` or ``--disableproxy``: With this argument, proxy servers are not used when scrapping tweets or user profiles from twitter.
183 | 
184 | 2.2.1 Examples of simple queries
185 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
186 | 
187 | Below is an example of how twitterscraper can be used:
188 | 
189 | ``twitterscraper Trump --limit 1000 --output=tweets.json``
190 | 
191 | ``twitterscraper Trump -l 1000 -o tweets.json``
192 | 
193 | ``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
194 | 
195 | 
196 | 
197 | 2.2.2 Examples of advanced queries
198 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
199 | 
200 | You can use any advanced query Twitter supports. An advanced query
201 | should be placed within quotes, so that twitterscraper can recognize it
202 | as one single query.
203 | 
204 | Here are some examples:
205 | 
206 | -  search for the occurence of 'Bitcoin' or 'BTC':
207 |    ``twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000``
208 | -  search for the occurence of 'Bitcoin' and 'BTC':
209 |    ``twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000``
210 | -  search for tweets from a specific user:
211 |    ``twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000``
212 | -  search for tweets to a specific user:
213 |    ``twitterscraper "Blockchain to:VitalikButerin" -o blockchain_tweets.json -l 1000``
214 | -  search for tweets written from a location:
215 |    ``twitterscraper "Blockchain near:Seattle within:15mi" -o blockchain_tweets.json -l 1000``
216 | 
217 | You can construct an advanced query on `Twitter Advanced Search <https://twitter.com/search-advanced?lang=en>`__ or use one of the operators shown on `this page <https://lifehacker.com/search-twitter-more-efficiently-with-these-search-opera-1598165519>`__.
218 | Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators.html>`__
219 | 
220 | 
221 | 
222 | 2.2.3 Examples of scraping user pages
223 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
224 | 
225 | You can also scraped all tweets written or retweeted by a specific user.
226 | This can be done by adding the boolean argument ``-u / --user`` argument.
227 | If this argument is used, the search term should be equal to the username.
228 | 
229 | Here is an example of scraping a specific user:
230 | 
231 | ``twitterscraper realDonaldTrump --user -o tweets_username.json``
232 | 
233 | This does not work in combination with ``-p``, ``-bd``, or ``-ed``.
234 | 
235 | The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
236 | all tweets from a profile page (including retweets).
237 | The example in 2.2.2 scrapes the results from the search page (excluding retweets).
238 | 
239 | 
240 | 2.3 From within Python
241 | ----------------------
242 | 
243 | You can easily use TwitterScraper from within python:
244 | 
245 | ::
246 | 
247 |     from twitterscraper import query_tweets
248 | 
249 |     if __name__ == '__main__':
250 |         list_of_tweets = query_tweets("Trump OR Clinton", 10)
251 | 
252 |         #print the retrieved tweets to the screen:
253 |         for tweet in query_tweets("Trump OR Clinton", 10):
254 |             print(tweet)
255 | 
256 |         #Or save the retrieved tweets to file:
257 |         file = open(“output.txt”,”w”)
258 |         for tweet in query_tweets("Trump OR Clinton", 10):
259 |             file.write(str(tweet.text.encode('utf-8')))
260 |         file.close()
261 | 
262 | 2.3.1 Examples of Python Queries
263 | --------------------------------
264 | 
265 |    - Query tweets from a given URL:
266 |       Parameters:
267 |          - query:     The query search parameter of url
268 |          - lang:      Language of queried url
269 |          - pos:       Parameter passed for where to start looking in url
270 |          - retry:     Number of times to retry if error   
271 | 
272 |       .. code:: python
273 | 
274 |           query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60)
275 |    
276 |    - Query all tweets that match qeury:
277 |       Parameters:
278 |          - query:     The query search parameter
279 |          - limit:     Number of tweets returned
280 |          - begindate: Start date of query
281 |          - enddate:   End date of query
282 |          - poolsize:  Tweets per poolsize
283 |          - lang:      Language of query
284 |       
285 |       .. code:: python
286 | 
287 |           query_tweets('query', limit=None, begindate=dt.date.today(), enddate=dt.date.today(), poolsize=20, lang='')
288 | 
289 |    - Query tweets from a specific user:
290 |       Parameters:
291 |          - user:      Twitter username
292 |          - limit:     Number of tweets returned
293 | 
294 |       .. code:: python
295 |       
296 |           query_tweets(user, limit=None)
297 |       
298 | 2.4 Scraping for retweets
299 | ----------------------
300 | 
301 | A regular search within Twitter will not show you any retweets.
302 | Twitterscraper therefore does not contain any retweets in the output.
303 | 
304 | To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet,
305 | a search for ``#trump2020`` will only show the original tweet.
306 | 
307 | The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.
308 | 
309 | 
310 | 2.5 Scraping for User Profile information
311 | ----------------------
312 | By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
313 | The results will be saved in the file "userprofiles_<filename>".
314 | 
315 | Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
316 | It is also possible to scrape for profile information without scraping for tweets.
317 | Examples of this can be found in the examples folder.
318 | 
319 | 
320 | 3. Output
321 | =========
322 | 
323 | All of the retrieved Tweets are stored in the indicated output file. The
324 | contents of the output file will look like:
325 | 
326 | ::
327 | 
328 |     [{"fullname": "Rupert Meehl", "id": "892397793071050752", "likes": "1", "replies": "0", "retweets": "0", "text": "Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\n\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "Rupert_Meehl"}, {"fullname": "Barry Shapiro", "id": "892397794375327744", "likes": "0", "replies": "0", "retweets": "0", "text": "A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "barryshap"}, (...)
329 |     ]
330 | 
331 | 3.1 Opening the output file
332 | ---------------------------
333 | 
334 | In order to correctly handle all possible characters in the tweets
335 | (think of Japanese or Arabic characters), the output is saved as utf-8
336 | encoded bytes. That is why you could see text like
337 | "\u30b1 \u30f3 \u3055 \u307e \u30fe ..." in the output file.
338 | 
339 | What you should do is open the file with the proper encoding:
340 | 
341 | .. figure:: https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG
342 | 
343 |    Example of output with Japanese characters
344 | 
345 | 3.1.2 Opening into a pandas dataframe
346 | ---------------------------
347 | 
348 | After the file has been opened, it can easily be converted into a ```pandas``` DataFrame
349 | 
350 | ::
351 | 
352 |     import pandas as pd
353 |     df = pd.read_json('tweets.json', encoding='utf-8')
354 | 


--------------------------------------------------------------------------------
/changelog.txt:
--------------------------------------------------------------------------------
  1 | # twitterscraper changelog
  2 | 
  3 | # 1.6.1 ( 2020-07-28 )
  4 | ## Fixed
  5 | - Issue 330: Added KeyError to the try / except so that it no longer breaks when json_resp does not have this key.
  6 | 
  7 | # 1.6.0 ( 2020-07-22 )
  8 | ## Added
  9 | - PR234: Adds command line argument -dp or --disableproxy to disable to use of proxy when querying.
 10 | ## Improved
 11 | - PR261: Improve logging; there is no ts_logger file, logger is initiated in main.py and query.py, loglevel is set via CLI.
 12 | 
 13 | # 1.5.0 ( 2020-07-22 )
 14 | ## Fixed
 15 | - PR304: Fixed query.py by adding 'X-Requested-With': 'XMLHttpRequest' to header value.
 16 | - PR253: Fixed Docker build
 17 | ## Added
 18 | - PR313: Added example to README (section 2.3.1).
 19 | - PR277: Support emojis by adding the alt text of images to the tweet text.
 20 | 
 21 | # 1.4.0 ( 2019-11-03 )
 22 | ## Fixed
 23 | - PR228: Fixed Typo in Readme
 24 | - PR224: Force CSV quoting for all non-numeric values
 25 | ## Added
 26 | - PR213: Added Dockerfile for Docker support
 27 | - PR220: Passed timeout value of 60s from method to requests.get()
 28 | - PR231: Added a lot of tweet attributes to the output, regarding links, media and replies.
 29 | - PR233: Added support for searching for the '&' sign.
 30 | ## Improved
 31 | - PR223: Pretty printing the output which is dumped
 32 | 
 33 | # 1.3.1 ( 2019-09-07 )
 34 | ## Fixed
 35 | - Change two uses of f-strings to .format() since f-strings only work well with Python 3.6+
 36 | 
 37 | # 1.3.0 ( 2019-09-07 )
 38 | ## Added
 39 | - Added the use of proxies while making an request. 
 40 | - PR #204: Added a max timeout to twitterscraper requests which is set to 60s by default. 
 41 | 
 42 | # 1.2.1 ( 2019-08-06 )
 43 | ### Fixed
 44 | - PR #208: Fixed a type in a print statement which was breaking down twitterscraper
 45 | - Remove the use of fake_useragent library
 46 | 
 47 | # 1.2.0 ( 2019-06-22 )
 48 | ### Added
 49 | - PR #186: adds the fields is_retweet, retweeter related information, and timestamp_epochs to the output.
 50 | - PR #184: use fake_useragent for generation of random user agent headers. 
 51 | - Additionally scraper for 'is_verified' when scraping for user profile pages.
 52 | 
 53 | # 1.1.0 ( 2019-06-15 )
 54 | ### Added
 55 | - PR #176: Using billiard library instead of multiprocessing to add the ability to use this library with Celery.
 56 | 
 57 | # 1.0.1 ( 2019-06-15 )
 58 | ### Fixed
 59 | - PR #191: wrong argument was used in the method query_tweets_from_user()
 60 | - CSV output file has as default ";" as a separator. 
 61 | - PR #173: Some small improvements on the profile page scraping.
 62 | ### Added
 63 | - Command line argument -ow / --overwrite to indicate if an existing output file should be overwritten.
 64 | 
 65 | # 1.0.0 ( 2019-02-04 )
 66 | ### Added
 67 | - PR #159: scrapes user profile pages for additional information. 
 68 | ### Fixed:
 69 | - Moved example scripts demonstrating use of get_user_info() functionality to examples folder
 70 | - removed screenshot demonstrating get_user_info() works
 71 | - Added command line argument to main.py which calls get_user_info() for all users in list of scraped tweets.
 72 | 
 73 | # 0.9.3 ( 2018-11-04 )
 74 | ### Fixed
 75 | - PR #143: cancels query if end-date is earlier than begin-date. 
 76 | - PR #151: returned json_resp['min_position] is parsed in order to quote special characters.
 77 | - PR #153: cast Tweets attributes to proper data types (int instead of str)
 78 | - Use codecs.open() to write to file. Should fix issues 144 and 147.
 79 | 
 80 | # 0.9.0 ( 2018-07-18 )
 81 | ### Added
 82 | - Added -u / --user command line argument which can be used to scrape all 
 83 |   tweets from an users profile page.
 84 | 
 85 | ## 0.8.1 ( 2018-07-18 )
 86 | - saving .csv files as an utf-8 encoded file. This fixes https://github.com/taspinar/twitterscraper/issues/138
 87 | 
 88 | ## 0.8.0 ( 2018-07-17 )
 89 | ### Fixed
 90 | - remove two headers which caused bad fetching results https://github.com/taspinar/twitterscraper/issues/126#issuecomment-405132147
 91 | - fix python2 logger bug https://github.com/taspinar/twitterscraper/issues/134 https://github.com/taspinar/twitterscraper/issues/132 https://github.com/taspinar/twitterscraper/issues/127
 92 | 
 93 | ### Improved
 94 | - Use a generator to get tweets, but convert to list in `query_tweets_once`
 95 |   - this is useful for low memory applications, like massively parallelizing twitter scraping through AWS Lambda (128MB RAM)
 96 | - use single quotes for all strings (it was inconsistent prior)
 97 | - pep8 compliance on L28
 98 | 
 99 | ### Removed
100 | - remove `eliminate_duplicates` dead code
101 | 
102 | # 0.7.2 ( 2018-07-09 )
103 | ### Fixed
104 | - twitterscraper.logging is imported as logger instead of logging in order to
105 |    avoid a module name clash with Python2's logging module.
106 | 
107 | # 0.7.1 ( 2018-06-12 )
108 | ### Improved
109 | - Give access to logger for scripts which import this module. Create the module,
110 |   `logging.py`, which contains the logger used by twitterscraper.
111 | 
112 | ### Removed
113 | - fake_useragent is removed as a dependency, since it has been giving
114 | user-agent headers which keep being blocked by Twitter.
115 | 
116 | ## 0.7.0 ( 2018-05-06 )
117 | ### Fixed
118 | - By using linspace() instead of range() to divide the number of days into
119 |   the number of parallel processes, edge cases ( p = 1 ) now also work fine.
120 |   This fixes https://github.com/taspinar/twitterscraper/issues/108.
121 | 
122 | ### Improved
123 | - The default value of begindate is set to 2006-03-21. The previous value (2017-01-01)
124 |   was chosen arbitrarily and leaded to questions why not all tweets were retrieved.
125 |   This fixes https://github.com/taspinar/twitterscraper/issues/88.
126 | 
127 | ### Added
128 | - Users can now save the tweets in a csv-format, with the arguments "-c" or "--csv"
129 | 
130 | ## 0.6.2 ( 2018-03-21 )
131 | ### Fixed
132 | - Errors occuring during the serialization of a non-html response (everything after 1st request),
133 |   No longer crashes the program but is catched with a try / except.
134 | - Fixes https://github.com/taspinar/twitterscraper/issues/93
135 | 
136 | - The '@' character in an username is now removed by the ".strip('\@')" method instead of "[1:]".
137 | - This fixes issue https://github.com/taspinar/twitterscraper/issues/105
138 | 
139 | ## 0.6.1 ( 2018-03-17 )
140 | ### Improved
141 | - The way the number of days are divided over the number of parallel processes is improved.
142 | - The maximum number of parallel processes is limited to the max no of days.
143 | - Fixes https://github.com/taspinar/twitterscraper/issues/101
144 | 
145 | ## 0.6.0 ( 2018-02-17 )
146 | ### Fixed
147 | - PR #89: closed pools to prevent zombie processes.
148 | 
149 | 
150 | ## 0.5.1 ( 2018-02-17 )
151 | ### Fixed
152 | - Fixed MaxRecursionError crashes which was introduced with version 0.5.0
153 | 
154 | ## 0.5.0 ( 2018-01-11 )
155 | ### Added
156 | - Added the html code of a tweet message to the Tweet class as one of its attributes
157 | 
158 | ## 0.4.2 ( 2018-01-09 )
159 | ### Fixed
160 | - Fixed backward compatability of the new --lang parameter by placing it at the end of all arguments.
161 | 
162 | ## 0.4.1 ( 2018-01-07 )
163 | ### Fixed
164 | - Fixed --lang functionality by passing the lang parameter from its CL argument form to the generater url.
165 | 
166 | ## 0.4 ( 2017-12-19 )
167 | -----------
168 | ### Added
169 | - Added "-bd / --begindate" command line arguments to set the begin date of the query
170 | - Added "-ed / --enddate" command line arguments to set the end date of the query.
171 | - Added "-p / --poolsize" command line arguments which can change the number of parallel processes.
172 |   Default number of parallel processes is set to 20.
173 | 
174 | ### Improved
175 | - Outputfile is only created if tweets are actually retrieved.
176 | 
177 | ### Removed
178 | - The ´query_all_tweets' method in the Query module is removed. Since twitterscraper is starting parallel processes by default,
179 |   this method is no longer necessary.
180 | 
181 | ### Changed
182 | - The 'query_tweets' method now takes as arguments query, limit, begindate, enddate, poolsize.
183 | - The 'query_tweets_once' no longer has the argument 'num_tweets'
184 | - The default value of the 'retry' argument of the 'query_single_page' method has been increased from 3 to 10.
185 | - The ´query_tweets_once' method does not log to screen at every single scrape, but at the end of a batch.
186 | 
187 | 
188 | ## 0.3.3 ( 2017-12-06 )
189 | -----------
190 | ### Added
191 | -PR #61: Adding --lang functionality which can retrieve tweets written in a specific language.
192 | -PR #62: Tweet class now also contains the tweet url. This closes https://github.com/taspinar/twitterscraper/issues/59
193 | 
194 | 
195 | ## 0.3.2 ( 2017-11-12 )
196 | -----------
197 | ### Improved
198 | -PR #55: Adding --dump functionality which dumps the scraped tweets to screen, instead of an outputfile.
199 | 
200 | 
201 | ## 0.3.1 ( 2017-11-05 )
202 | -----------
203 | ### Improved
204 | -PR #49: scraping of replies, retweets and likes is improved.
205 | 
206 | 
207 | ## 0.3.0 ( 2017-08-01 )
208 | -----------
209 | ### Added
210 | - Tweet class now also includes 'replies', 'retweets' and 'likes'
211 | 
212 | 
213 | ## 0.2.7 ( 2017-01-10 )
214 | -----------
215 | ### Improved
216 | - PR #26: use ``requests`` library for HTTP requests. Makes the use of urllib2 / urllib redundant.
217 | ### Added:
218 | - changelog.txt for GitHub
219 | - HISTORY.rst for PyPi
220 | - README.rst for PyPi
221 | 
222 | ## 0.2.6 ( 2017-01-02 )
223 | -----------
224 | ### Improved
225 | - PR #25: convert date retrieved from timestamp to day precision
226 | 


--------------------------------------------------------------------------------
/examples/get_twitter_user_data.py:
--------------------------------------------------------------------------------
 1 | from twitterscraper.query import query_user_info
 2 | import pandas as pd
 3 | from multiprocessing import Pool
 4 | import time
 5 | from IPython.display import display
 6 | 
 7 | 
 8 | global twitter_user_info
 9 | twitter_user_info=[]
10 | 
11 | 
12 | def get_user_info(twitter_user):
13 |     """
14 |     An example of using the query_user_info method
15 |     :param twitter_user: the twitter user to capture user data
16 |     :return: twitter_user_data: returns a dictionary of twitter user data
17 |     """
18 |     user_info = query_user_info(user= twitter_user)
19 |     twitter_user_data = {}
20 |     twitter_user_data["user"] = user_info.user
21 |     twitter_user_data["fullname"] = user_info.full_name
22 |     twitter_user_data["location"] = user_info.location
23 |     twitter_user_data["blog"] = user_info.blog
24 |     twitter_user_data["date_joined"] = user_info.date_joined
25 |     twitter_user_data["id"] = user_info.id
26 |     twitter_user_data["num_tweets"] = user_info.tweets
27 |     twitter_user_data["following"] = user_info.following
28 |     twitter_user_data["followers"] = user_info.followers
29 |     twitter_user_data["likes"] = user_info.likes
30 |     twitter_user_data["lists"] = user_info.lists
31 |     
32 |     return twitter_user_data
33 | 
34 | 
35 | def main():
36 |     start = time.time()
37 |     users = ['Carlos_F_Enguix', 'mmtung', 'dremio', 'MongoDB', 'JenWike', 'timberners_lee','ataspinar2', 'realDonaldTrump',
38 |             'BarackObama', 'elonmusk', 'BillGates', 'BillClinton','katyperry','KimKardashian']
39 | 
40 |     pool = Pool(8)    
41 |     for user in pool.map(get_user_info,users):
42 |         twitter_user_info.append(user)
43 | 
44 |     cols=['id','fullname','date_joined','location','blog', 'num_tweets','following','followers','likes','lists']
45 |     data_frame = pd.DataFrame(twitter_user_info, index=users, columns=cols)
46 |     data_frame.index.name = "Users"
47 |     data_frame.sort_values(by="followers", ascending=False, inplace=True, kind='quicksort', na_position='last')
48 |     elapsed = time.time() - start
49 |     print(f"Elapsed time: {elapsed}")
50 |     display(data_frame)
51 |     
52 | 
53 | if __name__ == '__main__':
54 |     main()


--------------------------------------------------------------------------------
/examples/get_twitter_user_data_parallel.py:
--------------------------------------------------------------------------------
 1 | from twitterscraper.query import query_user_info
 2 | import pandas as pd
 3 | from multiprocessing import Pool
 4 | from IPython.display import display
 5 | import sys
 6 | 
 7 | global twitter_user_info
 8 | twitter_user_info = []
 9 | 
10 | 
11 | def get_user_info(twitter_user):
12 |     """
13 |     An example of using the query_user_info method
14 |     :param twitter_user: the twitter user to capture user data
15 |     :return: twitter_user_data: returns a dictionary of twitter user data
16 |     """
17 |     user_info = query_user_info(user=twitter_user)
18 |     twitter_user_data = {}
19 |     twitter_user_data["user"] = user_info.user
20 |     twitter_user_data["fullname"] = user_info.full_name
21 |     twitter_user_data["location"] = user_info.location
22 |     twitter_user_data["blog"] = user_info.blog
23 |     twitter_user_data["date_joined"] = user_info.date_joined
24 |     twitter_user_data["id"] = user_info.id
25 |     twitter_user_data["num_tweets"] = user_info.tweets
26 |     twitter_user_data["following"] = user_info.following
27 |     twitter_user_data["followers"] = user_info.followers
28 |     twitter_user_data["likes"] = user_info.likes
29 |     twitter_user_data["lists"] = user_info.lists
30 | 
31 |     return twitter_user_data
32 | 
33 | 
34 | def main(args):
35 |     users = []
36 | 
37 |     for arg in args:
38 |         users.append(arg)
39 | 
40 |     pool_size = len(users)
41 |     if pool_size < 8:
42 |         pool = Pool(pool_size)
43 |     else:
44 |         pool = Pool(8)
45 | 
46 |     for user in pool.map(get_user_info, users):
47 |         twitter_user_info.append(user)
48 | 
49 |     cols = ['id', 'fullname', 'date_joined', 'location', 'blog', 'num_tweets', 'following', 'followers', 'likes',
50 |             'lists']
51 |     data_frame = pd.DataFrame(twitter_user_info, index=users, columns=cols)
52 |     data_frame.index.name = "Users"
53 |     data_frame.sort_values(by="followers", ascending=False, inplace=True, kind='quicksort', na_position='last')
54 |     display(data_frame)
55 | 
56 | 
57 | if __name__ == '__main__':
58 |     main(sys.argv[1:])
59 | 
60 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | coala-utils~=0.5.0
2 | bs4
3 | lxml
4 | requests
5 | billiard
6 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | from setuptools import setup, find_packages
 4 | with open('requirements.txt') as requirements:
 5 |     required = requirements.read().splitlines()
 6 | 
 7 | setup(
 8 |     name='twitterscraper',
 9 |     version='1.6.1',
10 |     description='Tool for scraping Tweets',
11 |     url='https://github.com/taspinar/twitterscraper',
12 |     author=['Ahmet Taspinar', 'Lasse Schuirmann'],
13 |     author_email='taspinar@gmail.com',
14 |     license='MIT',
15 |     packages=find_packages(exclude=["build.*", "tests", "tests.*"]),
16 |     install_requires=required,
17 |     entry_points={
18 |         "console_scripts": [
19 |             "twitterscraper = twitterscraper.main:main"
20 |         ]
21 |     })
22 | 


--------------------------------------------------------------------------------
/twitterscraper/__init__.py:
--------------------------------------------------------------------------------
 1 | # TwitterScraper
 2 | # Copyright 2016-2020 Ahmet Taspinar
 3 | # See LICENSE for details.
 4 | """
 5 | Twitter Scraper tool
 6 | """
 7 | 
 8 | __version__ = '1.6.1'
 9 | __author__ = 'Ahmet Taspinar'
10 | __license__ = 'MIT'
11 | 
12 | 
13 | from twitterscraper.query import query_tweets
14 | from twitterscraper.query import query_tweets_from_user
15 | from twitterscraper.query import query_user_info
16 | from twitterscraper.tweet import Tweet
17 | from twitterscraper.user import User
18 | 


--------------------------------------------------------------------------------
/twitterscraper/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This is a command line application that allows you to scrape twitter!
  3 | """
  4 | import argparse
  5 | import collections
  6 | import csv
  7 | import datetime as dt
  8 | import json
  9 | import logging
 10 | from os.path import isfile
 11 | from pprint import pprint
 12 | 
 13 | from twitterscraper.query import (query_tweets, query_tweets_from_user,
 14 |                                   query_user_info)
 15 | 
 16 | logger = logging.getLogger('twitterscraper')
 17 | 
 18 | 
 19 | class JSONEncoder(json.JSONEncoder):
 20 |     def default(self, obj):
 21 |         if hasattr(obj, '__json__'):
 22 |             return obj.__json__()
 23 |         elif isinstance(obj, collections.Iterable):
 24 |             return list(obj)
 25 |         elif isinstance(obj, dt.datetime):
 26 |             return obj.isoformat()
 27 |         elif hasattr(obj, '__getitem__') and hasattr(obj, 'keys'):
 28 |             return dict(obj)
 29 |         elif hasattr(obj, '__dict__'):
 30 |             return {member: getattr(obj, member)
 31 |                     for member in dir(obj)
 32 |                     if not member.startswith('_') and
 33 |                     not hasattr(getattr(obj, member), '__call__')}
 34 | 
 35 |         return json.JSONEncoder.default(self, obj)
 36 | 
 37 | def valid_date(s):
 38 |     try:
 39 |         return dt.datetime.strptime(s, "%Y-%m-%d").date()
 40 |     except ValueError:
 41 |         msg = "Not a valid date: '{0}'.".format(s)
 42 |         raise argparse.ArgumentTypeError(msg)
 43 | 
 44 | def valid_loglevel(level):
 45 |     try:
 46 |         return logging._checkLevel(level)
 47 |     except (ValueError, TypeError) as ex:
 48 |         raise argparse.ArgumentTypeError(ex)
 49 | 
 50 | def main():
 51 |     try:
 52 |         parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
 53 |             description=__doc__
 54 |         )
 55 | 
 56 |         parser.add_argument("query", type=str, help="Advanced twitter query")
 57 |         parser.add_argument("-o", "--output", type=str, default="tweets.json",
 58 |                             help="Path to a JSON file to store the gathered "
 59 |                                  "tweets to.")
 60 |         parser.add_argument("-l", "--limit", type=int, default=None,
 61 |                             help="Number of minimum tweets to gather.")
 62 |         parser.add_argument("-a", "--all", action='store_true',
 63 |                             help="Set this flag if you want to get all tweets "
 64 |                                  "in the history of twitter. Begindate is set to 2006-03-01."
 65 |                                  "This may take a while. You can increase the number of parallel"
 66 |                                  "processes depending on the computational power you have.")
 67 |         parser.add_argument("-c", "--csv", action='store_true',
 68 |                                 help="Set this flag if you want to save the results to a CSV format.")
 69 |         parser.add_argument("-u", "--user", action='store_true',
 70 |                             help="Set this flag to if you want to scrape tweets from a specific user"
 71 |                                  "The query should then consist of the profilename you want to scrape without @")
 72 |         parser.add_argument("--profiles", action='store_true',
 73 |                             help="Set this flag to if you want to scrape profile info of all the users where you" 
 74 |                             "have previously scraped from. After all of the tweets have been scraped it will start"
 75 |                             "a new process of scraping profile pages.")
 76 |         parser.add_argument("--lang", type=str, default=None,
 77 |                             help="Set this flag if you want to query tweets in \na specific language. You can choose from:\n"
 78 |                                  "en (English)\nar (Arabic)\nbn (Bengali)\n"
 79 |                                  "cs (Czech)\nda (Danish)\nde (German)\nel (Greek)\nes (Spanish)\n"
 80 |                                  "fa (Persian)\nfi (Finnish)\nfil (Filipino)\nfr (French)\n"
 81 |                                  "he (Hebrew)\nhi (Hindi)\nhu (Hungarian)\n"
 82 |                                  "id (Indonesian)\nit (Italian)\nja (Japanese)\n"
 83 |                                  "ko (Korean)\nmsa (Malay)\nnl (Dutch)\n"
 84 |                                  "no (Norwegian)\npl (Polish)\npt (Portuguese)\n"
 85 |                                  "ro (Romanian)\nru (Russian)\nsv (Swedish)\n"
 86 |                                  "th (Thai)\ntr (Turkish)\nuk (Ukranian)\n"
 87 |                                  "ur (Urdu)\nvi (Vietnamese)\n"
 88 |                                  "zh-cn (Chinese Simplified)\n"
 89 |                                  "zh-tw (Chinese Traditional)"
 90 |                                  )
 91 |         parser.add_argument("-d", "--dump", action="store_true",
 92 |                             help="Set this flag if you want to dump the tweets \nto the console rather than outputting to a file")
 93 |         parser.add_argument("-ow", "--overwrite", action="store_true",
 94 |                             help="Set this flag if you want to overwrite the existing output file.")
 95 |         parser.add_argument("-bd", "--begindate", type=valid_date, default="2006-03-21",
 96 |                             help="Scrape for tweets starting from this date. Format YYYY-MM-DD. \nDefault value is 2006-03-21", metavar='\b')
 97 |         parser.add_argument("-ed", "--enddate", type=valid_date, default=dt.date.today(),
 98 |                             help="Scrape for tweets until this date. Format YYYY-MM-DD. \nDefault value is the date of today.", metavar='\b')
 99 |         parser.add_argument("-p", "--poolsize", type=int, default=20, help="Specify the number of parallel process you want to run. \n"
100 |                             "Default value is set to 20. \nYou can change this number if you have more computing power available. \n"
101 |                             "Set to 1 if you dont want to run any parallel processes.", metavar='\b')
102 |         parser.add_argument("--loglevel", type=valid_loglevel, default=logging.INFO, help="Specify the level for logging. \n"
103 |                             "Must be a valid value from https://docs.python.org/2/library/logging.html#logging-levels. \n"
104 |                             "Default log level is set to INFO.")
105 |         parser.add_argument("-dp", "--disableproxy", action="store_true", default=False, help="Set this flag if you want to disable use of proxy servers when scrapping tweets and user profiles. \n")
106 |         args = parser.parse_args()
107 | 
108 |         logging.basicConfig()
109 |         logger.setLevel(args.loglevel)
110 | 
111 |         if isfile(args.output) and not args.dump and not args.overwrite:
112 |             logger.error("Output file already exists! Aborting.")
113 |             exit(-1)
114 | 
115 |         if args.all:
116 |             args.begindate = dt.date(2006,3,1)
117 | 
118 |         if args.user:
119 |             tweets = query_tweets_from_user(user = args.query, limit = args.limit, use_proxy = not args.disableproxy)
120 |         else:
121 |             tweets = query_tweets(query = args.query, limit = args.limit,
122 |                               begindate = args.begindate, enddate = args.enddate,
123 |                               poolsize = args.poolsize, lang = args.lang, use_proxy = not args.disableproxy)
124 | 
125 |         if args.dump:
126 |             pprint([tweet.__dict__ for tweet in tweets])
127 |         else:
128 |             if tweets:
129 |                 with open(args.output, "w", encoding="utf-8") as output:
130 |                     if args.csv:
131 |                         f = csv.writer(output, delimiter=";", quoting=csv.QUOTE_NONNUMERIC)
132 |                         f.writerow([
133 |                             "screen_name", "username", "user_id", "tweet_id",
134 |                             "tweet_url", "timestamp", "timestamp_epochs",
135 |                             "text", "text_html", "links", "hashtags",
136 |                             "has_media", "img_urls", "video_url", "likes",
137 |                             "retweets", "replies", "is_replied", "is_reply_to",
138 |                             "parent_tweet_id", "reply_to_users"
139 |                         ])
140 |                         for t in tweets:
141 |                             f.writerow([
142 |                                 t.screen_name, t.username, t.user_id,
143 |                                 t.tweet_id, t.tweet_url, t.timestamp,
144 |                                 t.timestamp_epochs, t.text, t.text_html,
145 |                                 t.links, t.hashtags, t.has_media, t.img_urls,
146 |                                 t.video_url, t.likes, t.retweets, t.replies,
147 |                                 t.is_replied, t.is_reply_to, t.parent_tweet_id,
148 |                                 t.reply_to_users
149 |                             ])
150 |                     else:
151 |                         json.dump(tweets, output, cls=JSONEncoder)
152 |             if args.profiles and tweets:
153 |                 list_users = list(set([tweet.username for tweet in tweets]))
154 |                 list_users_info = [query_user_info(elem, not args.disableproxy) for elem in list_users]
155 |                 filename = 'userprofiles_' + args.output
156 |                 with open(filename, "w", encoding="utf-8") as output:
157 |                     json.dump(list_users_info, output, cls=JSONEncoder)
158 |     except KeyboardInterrupt:
159 |         logger.info("Program interrupted by user. Quitting...")
160 | 


--------------------------------------------------------------------------------
/twitterscraper/query.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | 
  3 | import datetime as dt
  4 | import json
  5 | import logging
  6 | import random
  7 | import sys
  8 | import urllib
  9 | from functools import partial
 10 | from itertools import cycle
 11 | 
 12 | import requests
 13 | from billiard.pool import Pool
 14 | from bs4 import BeautifulSoup
 15 | 
 16 | from twitterscraper.tweet import Tweet
 17 | from twitterscraper.user import User
 18 | 
 19 | logger = logging.getLogger('twitterscraper')
 20 | 
 21 | #from fake_useragent import UserAgent
 22 | #ua = UserAgent()
 23 | #HEADER = {'User-Agent': ua.random}
 24 | HEADERS_LIST = [
 25 |     'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
 26 |     'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
 27 |     'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',
 28 |     'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16',
 29 |     'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
 30 | ]
 31 | 
 32 | HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}
 33 | logger.info(HEADER)
 34 | 
 35 | INIT_URL = 'https://twitter.com/search?f=tweets&vertical=default&q={q}&l={lang}'
 36 | RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' \
 37 |              'default&include_available_features=1&include_entities=1&' \
 38 |              'reset_error_state=false&src=typd&max_position={pos}&q={q}&l={lang}'
 39 | INIT_URL_USER = 'https://twitter.com/{u}'
 40 | RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/{u}/timeline/tweets?' \
 41 |                   'include_available_features=1&include_entities=1&' \
 42 |                   'max_position={pos}&reset_error_state=false'
 43 | PROXY_URL = 'https://free-proxy-list.net/'
 44 | 
 45 | def get_proxies():
 46 |     response = requests.get(PROXY_URL)
 47 |     soup = BeautifulSoup(response.text, 'lxml')
 48 |     table = soup.find('table',id='proxylisttable')
 49 |     list_tr = table.find_all('tr')
 50 |     list_td = [elem.find_all('td') for elem in list_tr]
 51 |     list_td = list(filter(None, list_td))
 52 |     list_ip = [elem[0].text for elem in list_td]
 53 |     list_ports = [elem[1].text for elem in list_td]
 54 |     list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
 55 |     return list_proxies               
 56 |                   
 57 | def get_query_url(query, lang, pos, from_user = False):
 58 |     if from_user:
 59 |         if pos is None:
 60 |             return INIT_URL_USER.format(u=query)
 61 |         else:
 62 |             return RELOAD_URL_USER.format(u=query, pos=pos)
 63 |     if pos is None:
 64 |         return INIT_URL.format(q=query, lang=lang)
 65 |     else:
 66 |         return RELOAD_URL.format(q=query, pos=pos, lang=lang)
 67 | 
 68 | def linspace(start, stop, n):
 69 |     if n == 1:
 70 |         yield stop
 71 |         return
 72 |     h = (stop - start) / (n - 1)
 73 |     for i in range(n):
 74 |         yield start + h * i
 75 | 
 76 | proxies = get_proxies()
 77 | proxy_pool = cycle(proxies)
 78 | 
 79 | def query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60, use_proxy=True):
 80 |     """
 81 |     Returns tweets from the given URL.
 82 | 
 83 |     :param query: The query parameter of the query url
 84 |     :param lang: The language parameter of the query url
 85 |     :param pos: The query url parameter that determines where to start looking
 86 |     :param retry: Number of retries if something goes wrong.
 87 |     :return: The list of tweets, the pos argument for getting the next page.
 88 |     """
 89 |     url = get_query_url(query, lang, pos, from_user)
 90 |     logger.info('Scraping tweets from {}'.format(url))
 91 | 
 92 |     try:
 93 |         if use_proxy:
 94 |             proxy = next(proxy_pool)
 95 |             logger.info('Using proxy {}'.format(proxy))
 96 |             response = requests.get(url, headers=HEADER, proxies={"http": proxy}, timeout=timeout)
 97 |         else:
 98 |             print('not using proxy')
 99 |             response = requests.get(url, headers=HEADER, timeout=timeout)
100 |         if pos is None:  # html response
101 |             html = response.text or ''
102 |             json_resp = None
103 |         else:
104 |             html = ''
105 |             try:
106 |                 json_resp = response.json()
107 |                 html = json_resp['items_html'] or ''
108 |             except (ValueError, KeyError) as e:
109 |                 logger.exception('Failed to parse JSON while requesting "{}"'.format(url))
110 | 
111 |         tweets = list(Tweet.from_html(html))
112 | 
113 |         if not tweets:
114 |             try:
115 |                 if json_resp:
116 |                     pos = json_resp['min_position']
117 |                     has_more_items = json_resp['has_more_items']
118 |                     if not has_more_items:
119 |                         logger.info("Twitter returned : 'has_more_items' ")
120 |                         return [], None
121 |                 else:
122 |                     pos = None
123 |             except:
124 |                 pass
125 |             if retry > 0:
126 |                 logger.info('Retrying... (Attempts left: {})'.format(retry))
127 |                 return query_single_page(query, lang, pos, retry - 1, from_user, use_proxy=use_proxy)
128 |             else:
129 |                 return [], pos
130 | 
131 |         if json_resp:
132 |             return tweets, urllib.parse.quote(json_resp['min_position'])
133 |         if from_user:
134 |             return tweets, tweets[-1].tweet_id
135 |         return tweets, "TWEET-{}-{}".format(tweets[-1].tweet_id, tweets[0].tweet_id)
136 | 
137 |     except requests.exceptions.HTTPError as e:
138 |         logger.exception('HTTPError {} while requesting "{}"'.format(
139 |             e, url))
140 |     except requests.exceptions.ConnectionError as e:
141 |         logger.exception('ConnectionError {} while requesting "{}"'.format(
142 |             e, url))
143 |     except requests.exceptions.Timeout as e:
144 |         logger.exception('TimeOut {} while requesting "{}"'.format(
145 |             e, url))
146 |     except json.decoder.JSONDecodeError as e:
147 |         logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
148 |             e, url))
149 | 
150 |     if retry > 0:
151 |         logger.info('Retrying... (Attempts left: {})'.format(retry))
152 |         return query_single_page(query, lang, pos, retry - 1, use_proxy=use_proxy)
153 | 
154 |     logger.error('Giving up.')
155 |     return [], None
156 | 
157 | 
158 | def query_tweets_once_generator(query, limit=None, lang='', pos=None, use_proxy=True):
159 |     """
160 |     Queries twitter for all the tweets you want! It will load all pages it gets
161 |     from twitter. However, twitter might out of a sudden stop serving new pages,
162 |     in that case, use the `query_tweets` method.
163 | 
164 |     Note that this function catches the KeyboardInterrupt so it can return
165 |     tweets on incomplete queries if the user decides to abort.
166 | 
167 |     :param query: Any advanced query you want to do! Compile it at
168 |                   https://twitter.com/search-advanced and just copy the query!
169 |     :param limit: Scraping will be stopped when at least ``limit`` number of
170 |                   items are fetched.
171 |     :param pos: Field used as a "checkpoint" to continue where you left off in iteration
172 |     :return:      A list of twitterscraper.Tweet objects. You will get at least
173 |                   ``limit`` number of items.
174 |     """
175 |     logger.info('Querying {}'.format(query))
176 |     query = query.replace(' ', '%20').replace('#', '%23').replace(':', '%3A').replace('&', '%26')
177 |     num_tweets = 0
178 |     try:
179 |         while True:
180 |             new_tweets, new_pos = query_single_page(query, lang, pos, use_proxy=use_proxy)
181 |             if len(new_tweets) == 0:
182 |                 logger.info('Got {} tweets for {}.'.format(
183 |                     num_tweets, query))
184 |                 return
185 | 
186 |             for t in new_tweets:
187 |                 yield t, pos
188 | 
189 |             # use new_pos only once you have iterated through all old tweets
190 |             pos = new_pos
191 | 
192 |             num_tweets += len(new_tweets)
193 | 
194 |             if limit and num_tweets >= limit:
195 |                 logger.info('Got {} tweets for {}.'.format(
196 |                     num_tweets, query))
197 |                 return
198 | 
199 |     except KeyboardInterrupt:
200 |         logger.info('Program interrupted by user. Returning tweets gathered '
201 |                      'so far...')
202 |     except BaseException:
203 |         logger.exception('An unknown error occurred! Returning tweets '
204 |                           'gathered so far.')
205 |     logger.info('Got {} tweets for {}.'.format(
206 |         num_tweets, query))
207 | 
208 | 
209 | def query_tweets_once(*args, **kwargs):
210 |     res = list(query_tweets_once_generator(*args, **kwargs))
211 |     if res:
212 |         tweets, positions = zip(*res)
213 |         return tweets
214 |     else:
215 |         return []
216 | 
217 | 
218 | def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang='', use_proxy=True):
219 |     no_days = (enddate - begindate).days
220 |     
221 |     if(no_days < 0):
222 |         sys.exit('Begin date must occur before end date.')
223 |     
224 |     if poolsize > no_days:
225 |         # Since we are assigning each pool a range of dates to query,
226 | 		# the number of pools should not exceed the number of dates.
227 |         poolsize = no_days
228 |     dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]
229 | 
230 |     if limit and poolsize:
231 |         limit_per_pool = (limit // poolsize)+1
232 |     else:
233 |         limit_per_pool = None
234 | 
235 |     queries = ['{} since:{} until:{}'.format(query, since, until)
236 |                for since, until in zip(dateranges[:-1], dateranges[1:])]
237 | 
238 |     all_tweets = []
239 |     try:
240 |         pool = Pool(poolsize)
241 |         logger.info('queries: {}'.format(queries))
242 |         try:
243 |             for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang, use_proxy=use_proxy), queries):
244 |                 all_tweets.extend(new_tweets)
245 |                 logger.info('Got {} tweets ({} new).'.format(
246 |                     len(all_tweets), len(new_tweets)))
247 |         except KeyboardInterrupt:
248 |             logger.info('Program interrupted by user. Returning all tweets '
249 |                          'gathered so far.')
250 |     finally:
251 |         pool.close()
252 |         pool.join()
253 | 
254 |     return all_tweets
255 | 
256 | 
257 | def query_tweets_from_user(user, limit=None, use_proxy=True):
258 |     pos = None
259 |     tweets = []
260 |     try:
261 |         while True:
262 |            new_tweets, pos = query_single_page(user, lang='', pos=pos, from_user=True, use_proxy=use_proxy)
263 |            if len(new_tweets) == 0:
264 |                logger.info("Got {} tweets from username {}".format(len(tweets), user))
265 |                return tweets
266 | 
267 |            tweets += new_tweets
268 | 
269 |            if limit and len(tweets) >= limit:
270 |                logger.info("Got {} tweets from username {}".format(len(tweets), user))
271 |                return tweets
272 | 
273 |     except KeyboardInterrupt:
274 |         logger.info("Program interrupted by user. Returning tweets gathered "
275 |                      "so far...")
276 |     except BaseException:
277 |         logger.exception("An unknown error occurred! Returning tweets "
278 |                           "gathered so far.")
279 |     logger.info("Got {} tweets from username {}.".format(
280 |         len(tweets), user))
281 |     return tweets
282 | 
283 | 
284 | def query_user_page(url, retry=10, timeout=60, use_proxy=True):
285 |     """
286 |     Returns the scraped user data from a twitter user page.
287 | 
288 |     :param url: The URL to get the twitter user info from (url contains the user page)
289 |     :param retry: Number of retries if something goes wrong.
290 |     :return: Returns the scraped user data from a twitter user page.
291 |     """
292 | 
293 |     try:
294 |         if use_proxy:
295 |             proxy = next(proxy_pool)
296 |             logger.info('Using proxy {}'.format(proxy))
297 |             response = requests.get(url, headers=HEADER, proxies={"http": proxy})
298 |         else:
299 |             response = requests.get(url, headers=HEADER)
300 |         html = response.text or ''
301 | 
302 |         user_info = User.from_html(html)
303 |         if not user_info:
304 |             return None
305 | 
306 |         return user_info
307 | 
308 |     except requests.exceptions.HTTPError as e:
309 |         logger.exception('HTTPError {} while requesting "{}"'.format(
310 |             e, url))
311 |     except requests.exceptions.ConnectionError as e:
312 |         logger.exception('ConnectionError {} while requesting "{}"'.format(
313 |             e, url))
314 |     except requests.exceptions.Timeout as e:
315 |         logger.exception('TimeOut {} while requesting "{}"'.format(
316 |             e, url))
317 | 
318 |     if retry > 0:
319 |         logger.info('Retrying... (Attempts left: {})'.format(retry))
320 |         return query_user_page(url, retry-1, use_proxy)
321 | 
322 |     logger.error('Giving up.')
323 |     return None
324 | 
325 | 
326 | def query_user_info(user, use_proxy=True):
327 |     """
328 |     Returns the scraped user data from a twitter user page.
329 | 
330 |     :param user: the twitter user to web scrape its twitter page info
331 |     """
332 | 
333 | 
334 |     try:
335 |         user_info = query_user_page(INIT_URL_USER.format(u=user), use_proxy=use_proxy)
336 |         if user_info:
337 |             logger.info("Got user information from username {}".format(user))
338 |             return user_info
339 | 
340 |     except KeyboardInterrupt:
341 |         logger.info("Program interrupted by user. Returning user information gathered so far...")
342 |     except BaseException:
343 |         logger.exception("An unknown error occurred! Returning user information gathered so far...")
344 | 
345 |     logger.info("Got user information from username {}".format(user))
346 |     return user_info
347 | 


--------------------------------------------------------------------------------
/twitterscraper/tweet.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | from datetime import datetime
  3 | 
  4 | from bs4 import BeautifulSoup
  5 | from coala_utils.decorators import generate_ordering
  6 | 
  7 | 
  8 | @generate_ordering('timestamp', 'id', 'text', 'user', 'replies', 'retweets', 'likes')
  9 | class Tweet:
 10 |     def __init__(
 11 |         self, screen_name, username, user_id, tweet_id, tweet_url, timestamp,
 12 |         timestamp_epochs, text, text_html, links, hashtags, has_media, img_urls,
 13 |         video_url, likes, retweets, replies, is_replied, is_reply_to,
 14 |         parent_tweet_id, reply_to_users
 15 |     ):
 16 |         # user name & id
 17 |         self.screen_name = screen_name
 18 |         self.username = username
 19 |         self.user_id = user_id
 20 |         # tweet basic data
 21 |         self.tweet_id = tweet_id
 22 |         self.tweet_url = tweet_url
 23 |         self.timestamp = timestamp
 24 |         self.timestamp_epochs = timestamp_epochs
 25 |         # tweet text
 26 |         self.text = text
 27 |         self.text_html = text_html
 28 |         self.links = links
 29 |         self.hashtags = hashtags
 30 |         # tweet media
 31 |         self.has_media = has_media
 32 |         self.img_urls = img_urls
 33 |         self.video_url = video_url
 34 |         # tweet actions numbers
 35 |         self.likes = likes
 36 |         self.retweets = retweets
 37 |         self.replies = replies
 38 |         self.is_replied = is_replied
 39 |         # detail of reply to others
 40 |         self.is_reply_to = is_reply_to
 41 |         self.parent_tweet_id = parent_tweet_id
 42 |         self.reply_to_users = reply_to_users
 43 | 
 44 |     @classmethod
 45 |     def from_soup(cls, tweet):
 46 |         tweet_div = tweet.find('div', 'tweet')
 47 | 
 48 |         # user name & id
 49 |         screen_name = tweet_div["data-screen-name"].strip('@')
 50 |         username = tweet_div["data-name"]
 51 |         user_id = tweet_div["data-user-id"]
 52 | 
 53 |         # tweet basic data
 54 |         tweet_id = tweet_div["data-tweet-id"]  # equal to 'data-item-id'
 55 |         tweet_url = tweet_div["data-permalink-path"]
 56 |         timestamp_epochs = int(tweet.find('span', '_timestamp')['data-time'])
 57 |         timestamp = datetime.utcfromtimestamp(timestamp_epochs)
 58 | 
 59 |         # tweet text
 60 |         soup_html = tweet_div \
 61 |             .find('div', 'js-tweet-text-container') \
 62 |             .find('p', 'tweet-text')
 63 |         text_html = str(soup_html) or ""
 64 |         for img in soup_html.findAll("img", "Emoji"):
 65 |             img.replace_with(img.attrs.get("alt", ''))
 66 |         text = soup_html.get_text() or ""
 67 |         links = [
 68 |             atag.get('data-expanded-url', atag['href'])
 69 |             for atag in soup_html.find_all('a', class_='twitter-timeline-link')
 70 |             if 'pic.twitter' not in atag.text  # eliminate picture
 71 |         ]
 72 |         hashtags = [tag.strip('#')for tag in re.findall(r'#\w+', text)]
 73 | 
 74 |         # tweet media
 75 |         # --- imgs
 76 |         soup_imgs = tweet_div.find_all('div', 'AdaptiveMedia-photoContainer')
 77 |         img_urls = [
 78 |             img['data-image-url'] for img in soup_imgs
 79 |         ] if soup_imgs else []
 80 | 
 81 |         # --- videos
 82 |         video_div = tweet_div.find('div', 'PlayableMedia-container')
 83 |         video_url = video_div.find('a')['href'] if video_div else ''
 84 |         has_media = True if img_urls or video_url else False
 85 | 
 86 |         # update 'links': eliminate 'video_url' from 'links' for duplicate
 87 |         links = list(filter(lambda x: x != video_url, links))
 88 | 
 89 |         # tweet actions numbers
 90 |         action_div = tweet_div.find('div', 'ProfileTweet-actionCountList')
 91 | 
 92 |         # --- likes
 93 |         likes = int(action_div.find(
 94 |             'span', 'ProfileTweet-action--favorite').find(
 95 |             'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0')
 96 |         # --- RT
 97 |         retweets = int(action_div.find(
 98 |             'span', 'ProfileTweet-action--retweet').find(
 99 |             'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0')
100 |         # --- replies
101 |         replies = int(action_div.find(
102 |             'span', 'ProfileTweet-action--reply u-hiddenVisually').find(
103 |             'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0')
104 |         is_replied = False if replies == 0 else True
105 | 
106 |         # detail of reply to others
107 |         # - reply to others
108 |         parent_tweet_id = tweet_div['data-conversation-id']  # parent tweet
109 | 
110 |         if tweet_id == parent_tweet_id:
111 |             is_reply_to = False
112 |             parent_tweet_id = ''
113 |             reply_to_users = []
114 |         else:
115 |             is_reply_to = True
116 |             soup_reply_to_users = \
117 |                 tweet_div.find('div', 'ReplyingToContextBelowAuthor') \
118 |                 .find_all('a')
119 |             reply_to_users = [{
120 |                 'screen_name': user.text.strip('@'),
121 |                 'user_id': user['data-user-id']
122 |             } for user in soup_reply_to_users]
123 | 
124 |         return cls(
125 |             screen_name, username, user_id, tweet_id, tweet_url, timestamp,
126 |             timestamp_epochs, text, text_html, links, hashtags, has_media,
127 |             img_urls, video_url, likes, retweets, replies, is_replied,
128 |             is_reply_to, parent_tweet_id, reply_to_users
129 |         )
130 | 
131 |     @classmethod
132 |     def from_html(cls, html):
133 |         soup = BeautifulSoup(html, "lxml")
134 |         tweets = soup.find_all('li', 'js-stream-item')
135 |         if tweets:
136 |             for tweet in tweets:
137 |                 try:
138 |                     yield cls.from_soup(tweet)
139 |                 except AttributeError:
140 |                     pass  # Incomplete info? Discard!
141 |                 except TypeError:
142 |                     pass  # Incomplete info? Discard!
143 | 


--------------------------------------------------------------------------------
/twitterscraper/user.py:
--------------------------------------------------------------------------------
  1 | from bs4 import BeautifulSoup
  2 | 
  3 | 
  4 | class User:
  5 |     def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0, 
  6 |         following=0, followers=0, likes=0, lists=0, is_verified=0):
  7 |         self.user = user
  8 |         self.full_name = full_name
  9 |         self.location = location
 10 |         self.blog = blog
 11 |         self.date_joined = date_joined
 12 |         self.id = id
 13 |         self.tweets = tweets
 14 |         self.following = following
 15 |         self.followers = followers
 16 |         self.likes = likes
 17 |         self.lists = lists
 18 |         self.is_verified = is_verified
 19 |        
 20 |     @classmethod
 21 |     def from_soup(self, tag_prof_header, tag_prof_nav):
 22 |         """
 23 |         Returns the scraped user data from a twitter user page.
 24 | 
 25 |         :param tag_prof_header: captures the left hand part of user info
 26 |         :param tag_prof_nav: captures the upper part of user info
 27 |         :return: Returns a User object with captured data via beautifulsoup
 28 |         """
 29 | 
 30 |         self.user= tag_prof_header.find('a', {'class':'ProfileHeaderCard-nameLink u-textInheritColor js-nav'})['href'].strip("/") 
 31 |         self.full_name = tag_prof_header.find('a', {'class':'ProfileHeaderCard-nameLink u-textInheritColor js-nav'}).text
 32 |         
 33 |         location = tag_prof_header.find('span', {'class':'ProfileHeaderCard-locationText u-dir'}) 
 34 |         if location is None:
 35 |             self.location = "None"
 36 |         else: 
 37 |             self.location = location.text.strip()
 38 | 
 39 |         blog = tag_prof_header.find('span', {'class':"ProfileHeaderCard-urlText u-dir"})
 40 |         if blog is None:
 41 |             blog = "None"
 42 |         else:
 43 |             self.blog = blog.text.strip() 
 44 | 
 45 |         date_joined = tag_prof_header.find('div', {'class':"ProfileHeaderCard-joinDate"}).find('span', {'class':'ProfileHeaderCard-joinDateText js-tooltip u-dir'})['title']
 46 |         if date_joined is None:
 47 |             self.data_joined = "Unknown"
 48 |         else:    
 49 |             self.date_joined = date_joined.strip()
 50 | 
 51 |         tag_verified = tag_prof_header.find('span', {'class': "ProfileHeaderCard-badges"})
 52 |         if tag_verified is not None:
 53 |             self.is_verified = 1
 54 |             
 55 |         self.id = tag_prof_nav.find('div',{'class':'ProfileNav'})['data-user-id']
 56 |         tweets = tag_prof_nav.find('span', {'class':"ProfileNav-value"})['data-count']
 57 |         if tweets is None:
 58 |             self.tweets = 0
 59 |         else:
 60 |             self.tweets = int(tweets)
 61 | 
 62 |         following = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--following"}).\
 63 |         find('span', {'class':"ProfileNav-value"})['data-count']
 64 |         if following is None:
 65 |             following = 0
 66 |         else:
 67 |             self.following = int(following)
 68 | 
 69 |         followers = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--followers"}).\
 70 |         find('span', {'class':"ProfileNav-value"})['data-count']
 71 |         if followers is None:
 72 |             self.followers = 0
 73 |         else:
 74 |             self.followers = int(followers)    
 75 |         
 76 |         likes = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--favorites"}).\
 77 |         find('span', {'class':"ProfileNav-value"})['data-count']
 78 |         if likes is None:
 79 |             self.likes = 0
 80 |         else:
 81 |             self.likes = int(likes)    
 82 |         
 83 |         lists = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--lists"})
 84 |         if lists is None:
 85 |             self.lists = 0
 86 |         elif lists.find('span', {'class':"ProfileNav-value"}) is None:    
 87 |             self.lists = 0
 88 |         else:    
 89 |             lists = lists.find('span', {'class':"ProfileNav-value"}).text    
 90 |             self.lists = int(lists)
 91 |         return(self)
 92 | 
 93 |     @classmethod
 94 |     def from_html(self, html):
 95 |         soup = BeautifulSoup(html, "lxml")
 96 |         user_profile_header = soup.find("div", {"class":'ProfileHeaderCard'})
 97 |         user_profile_canopy = soup.find("div", {"class":'ProfileCanopy-nav'})
 98 |         if user_profile_header and user_profile_canopy:
 99 |             try:
100 |                 return self.from_soup(user_profile_header, user_profile_canopy)
101 |             except AttributeError:
102 |                 pass  # Incomplete info? Discard!
103 |             except TypeError:
104 |                 pass  # Incomplete info? Discard!
105 | 


--------------------------------------------------------------------------------