├── .gitignore ├── Dockerfile ├── LICENSE ├── MANIFEST.in ├── README.rst ├── changelog.txt ├── examples ├── get_twitter_user_data.py └── get_twitter_user_data_parallel.py ├── requirements.txt ├── setup.py └── twitterscraper ├── __init__.py ├── main.py ├── query.py ├── tweet.py └── user.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | /dist/ 3 | /*.egg-info 4 | /build/ 5 | .idea/ 6 | venv/ -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.7-alpine 2 | RUN apk add --update --no-cache g++ gcc libxslt-dev 3 | COPY . /app 4 | WORKDIR /app 5 | RUN python setup.py install 6 | CMD ["twitterscraper"] 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016-2019 by Ahmet Taspinar (taspinar@gmail.com) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include requirements.txt 2 | include LICENSE.txt 3 | include README.rst 4 | include HISTORY.rst -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | |Downloads| |Downloads_month| |PyPI version| |GitHub contributors| 2 | 3 | .. |Downloads| image:: https://pepy.tech/badge/twitterscraper 4 | :target: https://pepy.tech/project/twitterscraper 5 | .. |Downloads_month| image:: https://pepy.tech/badge/twitterscraper/month 6 | :target: https://pepy.tech/project/twitterscraper/month 7 | .. |PyPI version| image:: https://badge.fury.io/py/twitterscraper.svg 8 | :target: https://badge.fury.io/py/twitterscraper 9 | .. |GitHub contributors| image:: https://img.shields.io/github/contributors/taspinar/twitterscraper.svg 10 | :target: https://github.com/taspinar/twitterscraper/graphs/contributors 11 | 12 | 13 | Backers 14 | ======== 15 | 16 | Thank you to all our backers! 🙏 [`Become a backer`_] 17 | 18 | Sponsors 19 | ======== 20 | 21 | Support this project by becoming a sponsor. Your logo will show up here 22 | with a link to your website. [`Become a sponsor`_] 23 | 24 | .. _Become a backer: https://opencollective.com/twitterscraper#backer 25 | .. _Become a sponsor: https://opencollective.com/twitterscraper#sponsor 26 | 27 | 28 | Synopsis 29 | ======== 30 | 31 | A simple script to scrape Tweets using the Python package ``requests`` 32 | to retrieve the content and ``Beautifulsoup4`` to parse the retrieved 33 | content. 34 | 35 | 1. Motivation 36 | ============= 37 | 38 | Twitter has provided `REST 39 | API's `__ which can be used by 40 | developers to access and read Twitter data. They have also provided a 41 | `Streaming API `__ which can 42 | be used to access Twitter Data in real-time. 43 | 44 | Most of the software written to access Twitter data provide a library 45 | which functions as a wrapper around Twitter's Search and Streaming API's 46 | and are therefore constrained by the limitations of the API's. 47 | 48 | With Twitter's Search API you can only send 180 Requests every 15 49 | minutes. With a maximum number of 100 tweets per Request, you 50 | can mine 72 tweets per hour (4 x 180 x 100 =72) . By using 51 | TwitterScraper you are not limited by this number but by your internet 52 | speed/bandwith and the number of instances of TwitterScraper you are 53 | willing to start. 54 | 55 | One of the bigger disadvantages of the Search API is that you can only 56 | access Tweets written in the **past 7 days**. This is a major bottleneck 57 | for anyone looking for older data. With TwitterScraper there is no such 58 | limitation. 59 | 60 | Per Tweet it scrapes the following information: 61 | + Tweet-id 62 | + Tweet-url 63 | + Tweet text 64 | + Tweet html 65 | + Links inside Tweet 66 | + Hashtags inside Tweet 67 | + Image URLS inside Tweet 68 | + Video URL inside Tweet 69 | + Tweet timestamp 70 | + Tweet Epoch timestamp 71 | + Tweet No. of likes 72 | + Tweet No. of replies 73 | + Tweet No. of retweets 74 | + Username 75 | + User Full Name / Screen Name 76 | + User ID 77 | + Tweet is an reply to 78 | + Tweet is replied to 79 | + List of users Tweet is an reply to 80 | + Tweet ID of parent tweet 81 | 82 | 83 | In addition it can scrape for the following user information: 84 | + Date user joined 85 | + User location (if filled in) 86 | + User blog (if filled in) 87 | + User No. of tweets 88 | + User No. of following 89 | + User No. of followers 90 | + User No. of likes 91 | + User No. of lists 92 | + User is verified 93 | 94 | 95 | 2. Installation and Usage 96 | ========================= 97 | 98 | To install **twitterscraper**: 99 | 100 | .. code:: python 101 | 102 | (sudo) pip install twitterscraper 103 | 104 | or you can clone the repository and in the folder containing setup.py 105 | 106 | .. code:: python 107 | 108 | python setup.py install 109 | 110 | If you prefer more isolation you can build a docker image 111 | 112 | .. code:: python 113 | 114 | docker build -t twitterscraper:build . 115 | 116 | and run your container with: 117 | 118 | .. code:: python 119 | 120 | 121 | docker run --rm -it -v/:/app/data twitterscraper:build 122 | 123 | 2.2 The CLI 124 | ----------- 125 | 126 | You can use the command line application to get your tweets stored to 127 | JSON right away. Twitterscraper takes several arguments: 128 | 129 | - ``-h`` or ``--help`` Print out the help message and exits. 130 | 131 | - ``-l`` or ``--limit`` TwitterScraper stops scraping when *at least* 132 | the number of tweets indicated with ``--limit`` is scraped. Since 133 | tweets are retrieved in batches of 20, this will always be a multiple 134 | of 20. Omit the limit to retrieve all tweets. You can at any time abort the 135 | scraping by pressing Ctrl+C, the scraped tweets will be stored safely 136 | in your JSON file. 137 | 138 | - ``--lang`` Retrieves tweets written in a specific language. Currently 139 | 30+ languages are supported. For a full list of the languages print 140 | out the help message. 141 | 142 | - ``-bd`` or ``--begindate`` Set the date from which TwitterScraper 143 | should start scraping for your query. Format is YYYY-MM-DD. The 144 | default value is set to 2006-03-21. This does not work in combination with ``--user``. 145 | 146 | - ``-ed`` or ``--enddate`` Set the enddate which TwitterScraper should 147 | use to stop scraping for your query. Format is YYYY-MM-DD. The 148 | default value is set to today. This does not work in combination with ``--user``. 149 | 150 | - ``-u`` or ``--user`` Scrapes the tweets from that users' profile page. 151 | This also includes all retweets by that user. See section 2.2.4 in the examples below 152 | for more information. 153 | 154 | - ``--profiles`` : Twitterscraper will in addition to the tweets, also scrape for the profile 155 | information of the users who have written these tweets. The results will be saved in the 156 | file userprofiles_. 157 | 158 | - ``-p`` or ``--poolsize`` Set the number of parallel processes 159 | TwitterScraper should initiate while scraping for your query. Default 160 | value is set to 20. Depending on the computational power you have, 161 | you can increase this number. It is advised to keep this number below 162 | the number of days you are scraping. For example, if you are 163 | scraping from 2017-01-10 to 2017-01-20, you can set this number to a 164 | maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you 165 | can increase this number to a maximum of 150, if you have the 166 | computational resources. Does not work in combination with ``--user``. 167 | 168 | - ``-o`` or ``--output`` Gives the name of the output file. If no 169 | output filename is given, the default filename 'tweets.json' or 'tweets.csv' 170 | will be used. 171 | 172 | - ``-c`` or ``--csv`` Write the result to a CSV file instead of a JSON file. 173 | 174 | - ``-d`` or ``--dump``: With this argument, the scraped tweets will be 175 | printed to the screen instead of an outputfile. If you are using this 176 | argument, the ``--output`` argument doe not need to be used. 177 | 178 | - ``-ow`` or ``--overwrite``: With this argument, if the output file already exists 179 | it will be overwritten. If this argument is not set (default) twitterscraper will 180 | exit with the warning that the output file already exists. 181 | 182 | - ``-dp`` or ``--disableproxy``: With this argument, proxy servers are not used when scrapping tweets or user profiles from twitter. 183 | 184 | 2.2.1 Examples of simple queries 185 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 186 | 187 | Below is an example of how twitterscraper can be used: 188 | 189 | ``twitterscraper Trump --limit 1000 --output=tweets.json`` 190 | 191 | ``twitterscraper Trump -l 1000 -o tweets.json`` 192 | 193 | ``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json`` 194 | 195 | 196 | 197 | 2.2.2 Examples of advanced queries 198 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 199 | 200 | You can use any advanced query Twitter supports. An advanced query 201 | should be placed within quotes, so that twitterscraper can recognize it 202 | as one single query. 203 | 204 | Here are some examples: 205 | 206 | - search for the occurence of 'Bitcoin' or 'BTC': 207 | ``twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000`` 208 | - search for the occurence of 'Bitcoin' and 'BTC': 209 | ``twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000`` 210 | - search for tweets from a specific user: 211 | ``twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000`` 212 | - search for tweets to a specific user: 213 | ``twitterscraper "Blockchain to:VitalikButerin" -o blockchain_tweets.json -l 1000`` 214 | - search for tweets written from a location: 215 | ``twitterscraper "Blockchain near:Seattle within:15mi" -o blockchain_tweets.json -l 1000`` 216 | 217 | You can construct an advanced query on `Twitter Advanced Search `__ or use one of the operators shown on `this page `__. 218 | Also see `Twitter's Standard operators `__ 219 | 220 | 221 | 222 | 2.2.3 Examples of scraping user pages 223 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 224 | 225 | You can also scraped all tweets written or retweeted by a specific user. 226 | This can be done by adding the boolean argument ``-u / --user`` argument. 227 | If this argument is used, the search term should be equal to the username. 228 | 229 | Here is an example of scraping a specific user: 230 | 231 | ``twitterscraper realDonaldTrump --user -o tweets_username.json`` 232 | 233 | This does not work in combination with ``-p``, ``-bd``, or ``-ed``. 234 | 235 | The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes 236 | all tweets from a profile page (including retweets). 237 | The example in 2.2.2 scrapes the results from the search page (excluding retweets). 238 | 239 | 240 | 2.3 From within Python 241 | ---------------------- 242 | 243 | You can easily use TwitterScraper from within python: 244 | 245 | :: 246 | 247 | from twitterscraper import query_tweets 248 | 249 | if __name__ == '__main__': 250 | list_of_tweets = query_tweets("Trump OR Clinton", 10) 251 | 252 | #print the retrieved tweets to the screen: 253 | for tweet in query_tweets("Trump OR Clinton", 10): 254 | print(tweet) 255 | 256 | #Or save the retrieved tweets to file: 257 | file = open(“output.txt”,”w”) 258 | for tweet in query_tweets("Trump OR Clinton", 10): 259 | file.write(str(tweet.text.encode('utf-8'))) 260 | file.close() 261 | 262 | 2.3.1 Examples of Python Queries 263 | -------------------------------- 264 | 265 | - Query tweets from a given URL: 266 | Parameters: 267 | - query: The query search parameter of url 268 | - lang: Language of queried url 269 | - pos: Parameter passed for where to start looking in url 270 | - retry: Number of times to retry if error 271 | 272 | .. code:: python 273 | 274 | query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60) 275 | 276 | - Query all tweets that match qeury: 277 | Parameters: 278 | - query: The query search parameter 279 | - limit: Number of tweets returned 280 | - begindate: Start date of query 281 | - enddate: End date of query 282 | - poolsize: Tweets per poolsize 283 | - lang: Language of query 284 | 285 | .. code:: python 286 | 287 | query_tweets('query', limit=None, begindate=dt.date.today(), enddate=dt.date.today(), poolsize=20, lang='') 288 | 289 | - Query tweets from a specific user: 290 | Parameters: 291 | - user: Twitter username 292 | - limit: Number of tweets returned 293 | 294 | .. code:: python 295 | 296 | query_tweets(user, limit=None) 297 | 298 | 2.4 Scraping for retweets 299 | ---------------------- 300 | 301 | A regular search within Twitter will not show you any retweets. 302 | Twitterscraper therefore does not contain any retweets in the output. 303 | 304 | To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, 305 | a search for ``#trump2020`` will only show the original tweet. 306 | 307 | The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument. 308 | 309 | 310 | 2.5 Scraping for User Profile information 311 | ---------------------- 312 | By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. 313 | The results will be saved in the file "userprofiles_". 314 | 315 | Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :) 316 | It is also possible to scrape for profile information without scraping for tweets. 317 | Examples of this can be found in the examples folder. 318 | 319 | 320 | 3. Output 321 | ========= 322 | 323 | All of the retrieved Tweets are stored in the indicated output file. The 324 | contents of the output file will look like: 325 | 326 | :: 327 | 328 | [{"fullname": "Rupert Meehl", "id": "892397793071050752", "likes": "1", "replies": "0", "retweets": "0", "text": "Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\n\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "Rupert_Meehl"}, {"fullname": "Barry Shapiro", "id": "892397794375327744", "likes": "0", "replies": "0", "retweets": "0", "text": "A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "barryshap"}, (...) 329 | ] 330 | 331 | 3.1 Opening the output file 332 | --------------------------- 333 | 334 | In order to correctly handle all possible characters in the tweets 335 | (think of Japanese or Arabic characters), the output is saved as utf-8 336 | encoded bytes. That is why you could see text like 337 | "\u30b1 \u30f3 \u3055 \u307e \u30fe ..." in the output file. 338 | 339 | What you should do is open the file with the proper encoding: 340 | 341 | .. figure:: https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG 342 | 343 | Example of output with Japanese characters 344 | 345 | 3.1.2 Opening into a pandas dataframe 346 | --------------------------- 347 | 348 | After the file has been opened, it can easily be converted into a ```pandas``` DataFrame 349 | 350 | :: 351 | 352 | import pandas as pd 353 | df = pd.read_json('tweets.json', encoding='utf-8') 354 | -------------------------------------------------------------------------------- /changelog.txt: -------------------------------------------------------------------------------- 1 | # twitterscraper changelog 2 | 3 | # 1.6.1 ( 2020-07-28 ) 4 | ## Fixed 5 | - Issue 330: Added KeyError to the try / except so that it no longer breaks when json_resp does not have this key. 6 | 7 | # 1.6.0 ( 2020-07-22 ) 8 | ## Added 9 | - PR234: Adds command line argument -dp or --disableproxy to disable to use of proxy when querying. 10 | ## Improved 11 | - PR261: Improve logging; there is no ts_logger file, logger is initiated in main.py and query.py, loglevel is set via CLI. 12 | 13 | # 1.5.0 ( 2020-07-22 ) 14 | ## Fixed 15 | - PR304: Fixed query.py by adding 'X-Requested-With': 'XMLHttpRequest' to header value. 16 | - PR253: Fixed Docker build 17 | ## Added 18 | - PR313: Added example to README (section 2.3.1). 19 | - PR277: Support emojis by adding the alt text of images to the tweet text. 20 | 21 | # 1.4.0 ( 2019-11-03 ) 22 | ## Fixed 23 | - PR228: Fixed Typo in Readme 24 | - PR224: Force CSV quoting for all non-numeric values 25 | ## Added 26 | - PR213: Added Dockerfile for Docker support 27 | - PR220: Passed timeout value of 60s from method to requests.get() 28 | - PR231: Added a lot of tweet attributes to the output, regarding links, media and replies. 29 | - PR233: Added support for searching for the '&' sign. 30 | ## Improved 31 | - PR223: Pretty printing the output which is dumped 32 | 33 | # 1.3.1 ( 2019-09-07 ) 34 | ## Fixed 35 | - Change two uses of f-strings to .format() since f-strings only work well with Python 3.6+ 36 | 37 | # 1.3.0 ( 2019-09-07 ) 38 | ## Added 39 | - Added the use of proxies while making an request. 40 | - PR #204: Added a max timeout to twitterscraper requests which is set to 60s by default. 41 | 42 | # 1.2.1 ( 2019-08-06 ) 43 | ### Fixed 44 | - PR #208: Fixed a type in a print statement which was breaking down twitterscraper 45 | - Remove the use of fake_useragent library 46 | 47 | # 1.2.0 ( 2019-06-22 ) 48 | ### Added 49 | - PR #186: adds the fields is_retweet, retweeter related information, and timestamp_epochs to the output. 50 | - PR #184: use fake_useragent for generation of random user agent headers. 51 | - Additionally scraper for 'is_verified' when scraping for user profile pages. 52 | 53 | # 1.1.0 ( 2019-06-15 ) 54 | ### Added 55 | - PR #176: Using billiard library instead of multiprocessing to add the ability to use this library with Celery. 56 | 57 | # 1.0.1 ( 2019-06-15 ) 58 | ### Fixed 59 | - PR #191: wrong argument was used in the method query_tweets_from_user() 60 | - CSV output file has as default ";" as a separator. 61 | - PR #173: Some small improvements on the profile page scraping. 62 | ### Added 63 | - Command line argument -ow / --overwrite to indicate if an existing output file should be overwritten. 64 | 65 | # 1.0.0 ( 2019-02-04 ) 66 | ### Added 67 | - PR #159: scrapes user profile pages for additional information. 68 | ### Fixed: 69 | - Moved example scripts demonstrating use of get_user_info() functionality to examples folder 70 | - removed screenshot demonstrating get_user_info() works 71 | - Added command line argument to main.py which calls get_user_info() for all users in list of scraped tweets. 72 | 73 | # 0.9.3 ( 2018-11-04 ) 74 | ### Fixed 75 | - PR #143: cancels query if end-date is earlier than begin-date. 76 | - PR #151: returned json_resp['min_position] is parsed in order to quote special characters. 77 | - PR #153: cast Tweets attributes to proper data types (int instead of str) 78 | - Use codecs.open() to write to file. Should fix issues 144 and 147. 79 | 80 | # 0.9.0 ( 2018-07-18 ) 81 | ### Added 82 | - Added -u / --user command line argument which can be used to scrape all 83 | tweets from an users profile page. 84 | 85 | ## 0.8.1 ( 2018-07-18 ) 86 | - saving .csv files as an utf-8 encoded file. This fixes https://github.com/taspinar/twitterscraper/issues/138 87 | 88 | ## 0.8.0 ( 2018-07-17 ) 89 | ### Fixed 90 | - remove two headers which caused bad fetching results https://github.com/taspinar/twitterscraper/issues/126#issuecomment-405132147 91 | - fix python2 logger bug https://github.com/taspinar/twitterscraper/issues/134 https://github.com/taspinar/twitterscraper/issues/132 https://github.com/taspinar/twitterscraper/issues/127 92 | 93 | ### Improved 94 | - Use a generator to get tweets, but convert to list in `query_tweets_once` 95 | - this is useful for low memory applications, like massively parallelizing twitter scraping through AWS Lambda (128MB RAM) 96 | - use single quotes for all strings (it was inconsistent prior) 97 | - pep8 compliance on L28 98 | 99 | ### Removed 100 | - remove `eliminate_duplicates` dead code 101 | 102 | # 0.7.2 ( 2018-07-09 ) 103 | ### Fixed 104 | - twitterscraper.logging is imported as logger instead of logging in order to 105 | avoid a module name clash with Python2's logging module. 106 | 107 | # 0.7.1 ( 2018-06-12 ) 108 | ### Improved 109 | - Give access to logger for scripts which import this module. Create the module, 110 | `logging.py`, which contains the logger used by twitterscraper. 111 | 112 | ### Removed 113 | - fake_useragent is removed as a dependency, since it has been giving 114 | user-agent headers which keep being blocked by Twitter. 115 | 116 | ## 0.7.0 ( 2018-05-06 ) 117 | ### Fixed 118 | - By using linspace() instead of range() to divide the number of days into 119 | the number of parallel processes, edge cases ( p = 1 ) now also work fine. 120 | This fixes https://github.com/taspinar/twitterscraper/issues/108. 121 | 122 | ### Improved 123 | - The default value of begindate is set to 2006-03-21. The previous value (2017-01-01) 124 | was chosen arbitrarily and leaded to questions why not all tweets were retrieved. 125 | This fixes https://github.com/taspinar/twitterscraper/issues/88. 126 | 127 | ### Added 128 | - Users can now save the tweets in a csv-format, with the arguments "-c" or "--csv" 129 | 130 | ## 0.6.2 ( 2018-03-21 ) 131 | ### Fixed 132 | - Errors occuring during the serialization of a non-html response (everything after 1st request), 133 | No longer crashes the program but is catched with a try / except. 134 | - Fixes https://github.com/taspinar/twitterscraper/issues/93 135 | 136 | - The '@' character in an username is now removed by the ".strip('\@')" method instead of "[1:]". 137 | - This fixes issue https://github.com/taspinar/twitterscraper/issues/105 138 | 139 | ## 0.6.1 ( 2018-03-17 ) 140 | ### Improved 141 | - The way the number of days are divided over the number of parallel processes is improved. 142 | - The maximum number of parallel processes is limited to the max no of days. 143 | - Fixes https://github.com/taspinar/twitterscraper/issues/101 144 | 145 | ## 0.6.0 ( 2018-02-17 ) 146 | ### Fixed 147 | - PR #89: closed pools to prevent zombie processes. 148 | 149 | 150 | ## 0.5.1 ( 2018-02-17 ) 151 | ### Fixed 152 | - Fixed MaxRecursionError crashes which was introduced with version 0.5.0 153 | 154 | ## 0.5.0 ( 2018-01-11 ) 155 | ### Added 156 | - Added the html code of a tweet message to the Tweet class as one of its attributes 157 | 158 | ## 0.4.2 ( 2018-01-09 ) 159 | ### Fixed 160 | - Fixed backward compatability of the new --lang parameter by placing it at the end of all arguments. 161 | 162 | ## 0.4.1 ( 2018-01-07 ) 163 | ### Fixed 164 | - Fixed --lang functionality by passing the lang parameter from its CL argument form to the generater url. 165 | 166 | ## 0.4 ( 2017-12-19 ) 167 | ----------- 168 | ### Added 169 | - Added "-bd / --begindate" command line arguments to set the begin date of the query 170 | - Added "-ed / --enddate" command line arguments to set the end date of the query. 171 | - Added "-p / --poolsize" command line arguments which can change the number of parallel processes. 172 | Default number of parallel processes is set to 20. 173 | 174 | ### Improved 175 | - Outputfile is only created if tweets are actually retrieved. 176 | 177 | ### Removed 178 | - The ´query_all_tweets' method in the Query module is removed. Since twitterscraper is starting parallel processes by default, 179 | this method is no longer necessary. 180 | 181 | ### Changed 182 | - The 'query_tweets' method now takes as arguments query, limit, begindate, enddate, poolsize. 183 | - The 'query_tweets_once' no longer has the argument 'num_tweets' 184 | - The default value of the 'retry' argument of the 'query_single_page' method has been increased from 3 to 10. 185 | - The ´query_tweets_once' method does not log to screen at every single scrape, but at the end of a batch. 186 | 187 | 188 | ## 0.3.3 ( 2017-12-06 ) 189 | ----------- 190 | ### Added 191 | -PR #61: Adding --lang functionality which can retrieve tweets written in a specific language. 192 | -PR #62: Tweet class now also contains the tweet url. This closes https://github.com/taspinar/twitterscraper/issues/59 193 | 194 | 195 | ## 0.3.2 ( 2017-11-12 ) 196 | ----------- 197 | ### Improved 198 | -PR #55: Adding --dump functionality which dumps the scraped tweets to screen, instead of an outputfile. 199 | 200 | 201 | ## 0.3.1 ( 2017-11-05 ) 202 | ----------- 203 | ### Improved 204 | -PR #49: scraping of replies, retweets and likes is improved. 205 | 206 | 207 | ## 0.3.0 ( 2017-08-01 ) 208 | ----------- 209 | ### Added 210 | - Tweet class now also includes 'replies', 'retweets' and 'likes' 211 | 212 | 213 | ## 0.2.7 ( 2017-01-10 ) 214 | ----------- 215 | ### Improved 216 | - PR #26: use ``requests`` library for HTTP requests. Makes the use of urllib2 / urllib redundant. 217 | ### Added: 218 | - changelog.txt for GitHub 219 | - HISTORY.rst for PyPi 220 | - README.rst for PyPi 221 | 222 | ## 0.2.6 ( 2017-01-02 ) 223 | ----------- 224 | ### Improved 225 | - PR #25: convert date retrieved from timestamp to day precision 226 | -------------------------------------------------------------------------------- /examples/get_twitter_user_data.py: -------------------------------------------------------------------------------- 1 | from twitterscraper.query import query_user_info 2 | import pandas as pd 3 | from multiprocessing import Pool 4 | import time 5 | from IPython.display import display 6 | 7 | 8 | global twitter_user_info 9 | twitter_user_info=[] 10 | 11 | 12 | def get_user_info(twitter_user): 13 | """ 14 | An example of using the query_user_info method 15 | :param twitter_user: the twitter user to capture user data 16 | :return: twitter_user_data: returns a dictionary of twitter user data 17 | """ 18 | user_info = query_user_info(user= twitter_user) 19 | twitter_user_data = {} 20 | twitter_user_data["user"] = user_info.user 21 | twitter_user_data["fullname"] = user_info.full_name 22 | twitter_user_data["location"] = user_info.location 23 | twitter_user_data["blog"] = user_info.blog 24 | twitter_user_data["date_joined"] = user_info.date_joined 25 | twitter_user_data["id"] = user_info.id 26 | twitter_user_data["num_tweets"] = user_info.tweets 27 | twitter_user_data["following"] = user_info.following 28 | twitter_user_data["followers"] = user_info.followers 29 | twitter_user_data["likes"] = user_info.likes 30 | twitter_user_data["lists"] = user_info.lists 31 | 32 | return twitter_user_data 33 | 34 | 35 | def main(): 36 | start = time.time() 37 | users = ['Carlos_F_Enguix', 'mmtung', 'dremio', 'MongoDB', 'JenWike', 'timberners_lee','ataspinar2', 'realDonaldTrump', 38 | 'BarackObama', 'elonmusk', 'BillGates', 'BillClinton','katyperry','KimKardashian'] 39 | 40 | pool = Pool(8) 41 | for user in pool.map(get_user_info,users): 42 | twitter_user_info.append(user) 43 | 44 | cols=['id','fullname','date_joined','location','blog', 'num_tweets','following','followers','likes','lists'] 45 | data_frame = pd.DataFrame(twitter_user_info, index=users, columns=cols) 46 | data_frame.index.name = "Users" 47 | data_frame.sort_values(by="followers", ascending=False, inplace=True, kind='quicksort', na_position='last') 48 | elapsed = time.time() - start 49 | print(f"Elapsed time: {elapsed}") 50 | display(data_frame) 51 | 52 | 53 | if __name__ == '__main__': 54 | main() -------------------------------------------------------------------------------- /examples/get_twitter_user_data_parallel.py: -------------------------------------------------------------------------------- 1 | from twitterscraper.query import query_user_info 2 | import pandas as pd 3 | from multiprocessing import Pool 4 | from IPython.display import display 5 | import sys 6 | 7 | global twitter_user_info 8 | twitter_user_info = [] 9 | 10 | 11 | def get_user_info(twitter_user): 12 | """ 13 | An example of using the query_user_info method 14 | :param twitter_user: the twitter user to capture user data 15 | :return: twitter_user_data: returns a dictionary of twitter user data 16 | """ 17 | user_info = query_user_info(user=twitter_user) 18 | twitter_user_data = {} 19 | twitter_user_data["user"] = user_info.user 20 | twitter_user_data["fullname"] = user_info.full_name 21 | twitter_user_data["location"] = user_info.location 22 | twitter_user_data["blog"] = user_info.blog 23 | twitter_user_data["date_joined"] = user_info.date_joined 24 | twitter_user_data["id"] = user_info.id 25 | twitter_user_data["num_tweets"] = user_info.tweets 26 | twitter_user_data["following"] = user_info.following 27 | twitter_user_data["followers"] = user_info.followers 28 | twitter_user_data["likes"] = user_info.likes 29 | twitter_user_data["lists"] = user_info.lists 30 | 31 | return twitter_user_data 32 | 33 | 34 | def main(args): 35 | users = [] 36 | 37 | for arg in args: 38 | users.append(arg) 39 | 40 | pool_size = len(users) 41 | if pool_size < 8: 42 | pool = Pool(pool_size) 43 | else: 44 | pool = Pool(8) 45 | 46 | for user in pool.map(get_user_info, users): 47 | twitter_user_info.append(user) 48 | 49 | cols = ['id', 'fullname', 'date_joined', 'location', 'blog', 'num_tweets', 'following', 'followers', 'likes', 50 | 'lists'] 51 | data_frame = pd.DataFrame(twitter_user_info, index=users, columns=cols) 52 | data_frame.index.name = "Users" 53 | data_frame.sort_values(by="followers", ascending=False, inplace=True, kind='quicksort', na_position='last') 54 | display(data_frame) 55 | 56 | 57 | if __name__ == '__main__': 58 | main(sys.argv[1:]) 59 | 60 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | coala-utils~=0.5.0 2 | bs4 3 | lxml 4 | requests 5 | billiard 6 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from setuptools import setup, find_packages 4 | with open('requirements.txt') as requirements: 5 | required = requirements.read().splitlines() 6 | 7 | setup( 8 | name='twitterscraper', 9 | version='1.6.1', 10 | description='Tool for scraping Tweets', 11 | url='https://github.com/taspinar/twitterscraper', 12 | author=['Ahmet Taspinar', 'Lasse Schuirmann'], 13 | author_email='taspinar@gmail.com', 14 | license='MIT', 15 | packages=find_packages(exclude=["build.*", "tests", "tests.*"]), 16 | install_requires=required, 17 | entry_points={ 18 | "console_scripts": [ 19 | "twitterscraper = twitterscraper.main:main" 20 | ] 21 | }) 22 | -------------------------------------------------------------------------------- /twitterscraper/__init__.py: -------------------------------------------------------------------------------- 1 | # TwitterScraper 2 | # Copyright 2016-2020 Ahmet Taspinar 3 | # See LICENSE for details. 4 | """ 5 | Twitter Scraper tool 6 | """ 7 | 8 | __version__ = '1.6.1' 9 | __author__ = 'Ahmet Taspinar' 10 | __license__ = 'MIT' 11 | 12 | 13 | from twitterscraper.query import query_tweets 14 | from twitterscraper.query import query_tweets_from_user 15 | from twitterscraper.query import query_user_info 16 | from twitterscraper.tweet import Tweet 17 | from twitterscraper.user import User 18 | -------------------------------------------------------------------------------- /twitterscraper/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is a command line application that allows you to scrape twitter! 3 | """ 4 | import argparse 5 | import collections 6 | import csv 7 | import datetime as dt 8 | import json 9 | import logging 10 | from os.path import isfile 11 | from pprint import pprint 12 | 13 | from twitterscraper.query import (query_tweets, query_tweets_from_user, 14 | query_user_info) 15 | 16 | logger = logging.getLogger('twitterscraper') 17 | 18 | 19 | class JSONEncoder(json.JSONEncoder): 20 | def default(self, obj): 21 | if hasattr(obj, '__json__'): 22 | return obj.__json__() 23 | elif isinstance(obj, collections.Iterable): 24 | return list(obj) 25 | elif isinstance(obj, dt.datetime): 26 | return obj.isoformat() 27 | elif hasattr(obj, '__getitem__') and hasattr(obj, 'keys'): 28 | return dict(obj) 29 | elif hasattr(obj, '__dict__'): 30 | return {member: getattr(obj, member) 31 | for member in dir(obj) 32 | if not member.startswith('_') and 33 | not hasattr(getattr(obj, member), '__call__')} 34 | 35 | return json.JSONEncoder.default(self, obj) 36 | 37 | def valid_date(s): 38 | try: 39 | return dt.datetime.strptime(s, "%Y-%m-%d").date() 40 | except ValueError: 41 | msg = "Not a valid date: '{0}'.".format(s) 42 | raise argparse.ArgumentTypeError(msg) 43 | 44 | def valid_loglevel(level): 45 | try: 46 | return logging._checkLevel(level) 47 | except (ValueError, TypeError) as ex: 48 | raise argparse.ArgumentTypeError(ex) 49 | 50 | def main(): 51 | try: 52 | parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter, 53 | description=__doc__ 54 | ) 55 | 56 | parser.add_argument("query", type=str, help="Advanced twitter query") 57 | parser.add_argument("-o", "--output", type=str, default="tweets.json", 58 | help="Path to a JSON file to store the gathered " 59 | "tweets to.") 60 | parser.add_argument("-l", "--limit", type=int, default=None, 61 | help="Number of minimum tweets to gather.") 62 | parser.add_argument("-a", "--all", action='store_true', 63 | help="Set this flag if you want to get all tweets " 64 | "in the history of twitter. Begindate is set to 2006-03-01." 65 | "This may take a while. You can increase the number of parallel" 66 | "processes depending on the computational power you have.") 67 | parser.add_argument("-c", "--csv", action='store_true', 68 | help="Set this flag if you want to save the results to a CSV format.") 69 | parser.add_argument("-u", "--user", action='store_true', 70 | help="Set this flag to if you want to scrape tweets from a specific user" 71 | "The query should then consist of the profilename you want to scrape without @") 72 | parser.add_argument("--profiles", action='store_true', 73 | help="Set this flag to if you want to scrape profile info of all the users where you" 74 | "have previously scraped from. After all of the tweets have been scraped it will start" 75 | "a new process of scraping profile pages.") 76 | parser.add_argument("--lang", type=str, default=None, 77 | help="Set this flag if you want to query tweets in \na specific language. You can choose from:\n" 78 | "en (English)\nar (Arabic)\nbn (Bengali)\n" 79 | "cs (Czech)\nda (Danish)\nde (German)\nel (Greek)\nes (Spanish)\n" 80 | "fa (Persian)\nfi (Finnish)\nfil (Filipino)\nfr (French)\n" 81 | "he (Hebrew)\nhi (Hindi)\nhu (Hungarian)\n" 82 | "id (Indonesian)\nit (Italian)\nja (Japanese)\n" 83 | "ko (Korean)\nmsa (Malay)\nnl (Dutch)\n" 84 | "no (Norwegian)\npl (Polish)\npt (Portuguese)\n" 85 | "ro (Romanian)\nru (Russian)\nsv (Swedish)\n" 86 | "th (Thai)\ntr (Turkish)\nuk (Ukranian)\n" 87 | "ur (Urdu)\nvi (Vietnamese)\n" 88 | "zh-cn (Chinese Simplified)\n" 89 | "zh-tw (Chinese Traditional)" 90 | ) 91 | parser.add_argument("-d", "--dump", action="store_true", 92 | help="Set this flag if you want to dump the tweets \nto the console rather than outputting to a file") 93 | parser.add_argument("-ow", "--overwrite", action="store_true", 94 | help="Set this flag if you want to overwrite the existing output file.") 95 | parser.add_argument("-bd", "--begindate", type=valid_date, default="2006-03-21", 96 | help="Scrape for tweets starting from this date. Format YYYY-MM-DD. \nDefault value is 2006-03-21", metavar='\b') 97 | parser.add_argument("-ed", "--enddate", type=valid_date, default=dt.date.today(), 98 | help="Scrape for tweets until this date. Format YYYY-MM-DD. \nDefault value is the date of today.", metavar='\b') 99 | parser.add_argument("-p", "--poolsize", type=int, default=20, help="Specify the number of parallel process you want to run. \n" 100 | "Default value is set to 20. \nYou can change this number if you have more computing power available. \n" 101 | "Set to 1 if you dont want to run any parallel processes.", metavar='\b') 102 | parser.add_argument("--loglevel", type=valid_loglevel, default=logging.INFO, help="Specify the level for logging. \n" 103 | "Must be a valid value from https://docs.python.org/2/library/logging.html#logging-levels. \n" 104 | "Default log level is set to INFO.") 105 | parser.add_argument("-dp", "--disableproxy", action="store_true", default=False, help="Set this flag if you want to disable use of proxy servers when scrapping tweets and user profiles. \n") 106 | args = parser.parse_args() 107 | 108 | logging.basicConfig() 109 | logger.setLevel(args.loglevel) 110 | 111 | if isfile(args.output) and not args.dump and not args.overwrite: 112 | logger.error("Output file already exists! Aborting.") 113 | exit(-1) 114 | 115 | if args.all: 116 | args.begindate = dt.date(2006,3,1) 117 | 118 | if args.user: 119 | tweets = query_tweets_from_user(user = args.query, limit = args.limit, use_proxy = not args.disableproxy) 120 | else: 121 | tweets = query_tweets(query = args.query, limit = args.limit, 122 | begindate = args.begindate, enddate = args.enddate, 123 | poolsize = args.poolsize, lang = args.lang, use_proxy = not args.disableproxy) 124 | 125 | if args.dump: 126 | pprint([tweet.__dict__ for tweet in tweets]) 127 | else: 128 | if tweets: 129 | with open(args.output, "w", encoding="utf-8") as output: 130 | if args.csv: 131 | f = csv.writer(output, delimiter=";", quoting=csv.QUOTE_NONNUMERIC) 132 | f.writerow([ 133 | "screen_name", "username", "user_id", "tweet_id", 134 | "tweet_url", "timestamp", "timestamp_epochs", 135 | "text", "text_html", "links", "hashtags", 136 | "has_media", "img_urls", "video_url", "likes", 137 | "retweets", "replies", "is_replied", "is_reply_to", 138 | "parent_tweet_id", "reply_to_users" 139 | ]) 140 | for t in tweets: 141 | f.writerow([ 142 | t.screen_name, t.username, t.user_id, 143 | t.tweet_id, t.tweet_url, t.timestamp, 144 | t.timestamp_epochs, t.text, t.text_html, 145 | t.links, t.hashtags, t.has_media, t.img_urls, 146 | t.video_url, t.likes, t.retweets, t.replies, 147 | t.is_replied, t.is_reply_to, t.parent_tweet_id, 148 | t.reply_to_users 149 | ]) 150 | else: 151 | json.dump(tweets, output, cls=JSONEncoder) 152 | if args.profiles and tweets: 153 | list_users = list(set([tweet.username for tweet in tweets])) 154 | list_users_info = [query_user_info(elem, not args.disableproxy) for elem in list_users] 155 | filename = 'userprofiles_' + args.output 156 | with open(filename, "w", encoding="utf-8") as output: 157 | json.dump(list_users_info, output, cls=JSONEncoder) 158 | except KeyboardInterrupt: 159 | logger.info("Program interrupted by user. Quitting...") 160 | -------------------------------------------------------------------------------- /twitterscraper/query.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import datetime as dt 4 | import json 5 | import logging 6 | import random 7 | import sys 8 | import urllib 9 | from functools import partial 10 | from itertools import cycle 11 | 12 | import requests 13 | from billiard.pool import Pool 14 | from bs4 import BeautifulSoup 15 | 16 | from twitterscraper.tweet import Tweet 17 | from twitterscraper.user import User 18 | 19 | logger = logging.getLogger('twitterscraper') 20 | 21 | #from fake_useragent import UserAgent 22 | #ua = UserAgent() 23 | #HEADER = {'User-Agent': ua.random} 24 | HEADERS_LIST = [ 25 | 'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13', 26 | 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 27 | 'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201', 28 | 'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16', 29 | 'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre' 30 | ] 31 | 32 | HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'} 33 | logger.info(HEADER) 34 | 35 | INIT_URL = 'https://twitter.com/search?f=tweets&vertical=default&q={q}&l={lang}' 36 | RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' \ 37 | 'default&include_available_features=1&include_entities=1&' \ 38 | 'reset_error_state=false&src=typd&max_position={pos}&q={q}&l={lang}' 39 | INIT_URL_USER = 'https://twitter.com/{u}' 40 | RELOAD_URL_USER = 'https://twitter.com/i/profiles/show/{u}/timeline/tweets?' \ 41 | 'include_available_features=1&include_entities=1&' \ 42 | 'max_position={pos}&reset_error_state=false' 43 | PROXY_URL = 'https://free-proxy-list.net/' 44 | 45 | def get_proxies(): 46 | response = requests.get(PROXY_URL) 47 | soup = BeautifulSoup(response.text, 'lxml') 48 | table = soup.find('table',id='proxylisttable') 49 | list_tr = table.find_all('tr') 50 | list_td = [elem.find_all('td') for elem in list_tr] 51 | list_td = list(filter(None, list_td)) 52 | list_ip = [elem[0].text for elem in list_td] 53 | list_ports = [elem[1].text for elem in list_td] 54 | list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))] 55 | return list_proxies 56 | 57 | def get_query_url(query, lang, pos, from_user = False): 58 | if from_user: 59 | if pos is None: 60 | return INIT_URL_USER.format(u=query) 61 | else: 62 | return RELOAD_URL_USER.format(u=query, pos=pos) 63 | if pos is None: 64 | return INIT_URL.format(q=query, lang=lang) 65 | else: 66 | return RELOAD_URL.format(q=query, pos=pos, lang=lang) 67 | 68 | def linspace(start, stop, n): 69 | if n == 1: 70 | yield stop 71 | return 72 | h = (stop - start) / (n - 1) 73 | for i in range(n): 74 | yield start + h * i 75 | 76 | proxies = get_proxies() 77 | proxy_pool = cycle(proxies) 78 | 79 | def query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60, use_proxy=True): 80 | """ 81 | Returns tweets from the given URL. 82 | 83 | :param query: The query parameter of the query url 84 | :param lang: The language parameter of the query url 85 | :param pos: The query url parameter that determines where to start looking 86 | :param retry: Number of retries if something goes wrong. 87 | :return: The list of tweets, the pos argument for getting the next page. 88 | """ 89 | url = get_query_url(query, lang, pos, from_user) 90 | logger.info('Scraping tweets from {}'.format(url)) 91 | 92 | try: 93 | if use_proxy: 94 | proxy = next(proxy_pool) 95 | logger.info('Using proxy {}'.format(proxy)) 96 | response = requests.get(url, headers=HEADER, proxies={"http": proxy}, timeout=timeout) 97 | else: 98 | print('not using proxy') 99 | response = requests.get(url, headers=HEADER, timeout=timeout) 100 | if pos is None: # html response 101 | html = response.text or '' 102 | json_resp = None 103 | else: 104 | html = '' 105 | try: 106 | json_resp = response.json() 107 | html = json_resp['items_html'] or '' 108 | except (ValueError, KeyError) as e: 109 | logger.exception('Failed to parse JSON while requesting "{}"'.format(url)) 110 | 111 | tweets = list(Tweet.from_html(html)) 112 | 113 | if not tweets: 114 | try: 115 | if json_resp: 116 | pos = json_resp['min_position'] 117 | has_more_items = json_resp['has_more_items'] 118 | if not has_more_items: 119 | logger.info("Twitter returned : 'has_more_items' ") 120 | return [], None 121 | else: 122 | pos = None 123 | except: 124 | pass 125 | if retry > 0: 126 | logger.info('Retrying... (Attempts left: {})'.format(retry)) 127 | return query_single_page(query, lang, pos, retry - 1, from_user, use_proxy=use_proxy) 128 | else: 129 | return [], pos 130 | 131 | if json_resp: 132 | return tweets, urllib.parse.quote(json_resp['min_position']) 133 | if from_user: 134 | return tweets, tweets[-1].tweet_id 135 | return tweets, "TWEET-{}-{}".format(tweets[-1].tweet_id, tweets[0].tweet_id) 136 | 137 | except requests.exceptions.HTTPError as e: 138 | logger.exception('HTTPError {} while requesting "{}"'.format( 139 | e, url)) 140 | except requests.exceptions.ConnectionError as e: 141 | logger.exception('ConnectionError {} while requesting "{}"'.format( 142 | e, url)) 143 | except requests.exceptions.Timeout as e: 144 | logger.exception('TimeOut {} while requesting "{}"'.format( 145 | e, url)) 146 | except json.decoder.JSONDecodeError as e: 147 | logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format( 148 | e, url)) 149 | 150 | if retry > 0: 151 | logger.info('Retrying... (Attempts left: {})'.format(retry)) 152 | return query_single_page(query, lang, pos, retry - 1, use_proxy=use_proxy) 153 | 154 | logger.error('Giving up.') 155 | return [], None 156 | 157 | 158 | def query_tweets_once_generator(query, limit=None, lang='', pos=None, use_proxy=True): 159 | """ 160 | Queries twitter for all the tweets you want! It will load all pages it gets 161 | from twitter. However, twitter might out of a sudden stop serving new pages, 162 | in that case, use the `query_tweets` method. 163 | 164 | Note that this function catches the KeyboardInterrupt so it can return 165 | tweets on incomplete queries if the user decides to abort. 166 | 167 | :param query: Any advanced query you want to do! Compile it at 168 | https://twitter.com/search-advanced and just copy the query! 169 | :param limit: Scraping will be stopped when at least ``limit`` number of 170 | items are fetched. 171 | :param pos: Field used as a "checkpoint" to continue where you left off in iteration 172 | :return: A list of twitterscraper.Tweet objects. You will get at least 173 | ``limit`` number of items. 174 | """ 175 | logger.info('Querying {}'.format(query)) 176 | query = query.replace(' ', '%20').replace('#', '%23').replace(':', '%3A').replace('&', '%26') 177 | num_tweets = 0 178 | try: 179 | while True: 180 | new_tweets, new_pos = query_single_page(query, lang, pos, use_proxy=use_proxy) 181 | if len(new_tweets) == 0: 182 | logger.info('Got {} tweets for {}.'.format( 183 | num_tweets, query)) 184 | return 185 | 186 | for t in new_tweets: 187 | yield t, pos 188 | 189 | # use new_pos only once you have iterated through all old tweets 190 | pos = new_pos 191 | 192 | num_tweets += len(new_tweets) 193 | 194 | if limit and num_tweets >= limit: 195 | logger.info('Got {} tweets for {}.'.format( 196 | num_tweets, query)) 197 | return 198 | 199 | except KeyboardInterrupt: 200 | logger.info('Program interrupted by user. Returning tweets gathered ' 201 | 'so far...') 202 | except BaseException: 203 | logger.exception('An unknown error occurred! Returning tweets ' 204 | 'gathered so far.') 205 | logger.info('Got {} tweets for {}.'.format( 206 | num_tweets, query)) 207 | 208 | 209 | def query_tweets_once(*args, **kwargs): 210 | res = list(query_tweets_once_generator(*args, **kwargs)) 211 | if res: 212 | tweets, positions = zip(*res) 213 | return tweets 214 | else: 215 | return [] 216 | 217 | 218 | def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang='', use_proxy=True): 219 | no_days = (enddate - begindate).days 220 | 221 | if(no_days < 0): 222 | sys.exit('Begin date must occur before end date.') 223 | 224 | if poolsize > no_days: 225 | # Since we are assigning each pool a range of dates to query, 226 | # the number of pools should not exceed the number of dates. 227 | poolsize = no_days 228 | dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)] 229 | 230 | if limit and poolsize: 231 | limit_per_pool = (limit // poolsize)+1 232 | else: 233 | limit_per_pool = None 234 | 235 | queries = ['{} since:{} until:{}'.format(query, since, until) 236 | for since, until in zip(dateranges[:-1], dateranges[1:])] 237 | 238 | all_tweets = [] 239 | try: 240 | pool = Pool(poolsize) 241 | logger.info('queries: {}'.format(queries)) 242 | try: 243 | for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang, use_proxy=use_proxy), queries): 244 | all_tweets.extend(new_tweets) 245 | logger.info('Got {} tweets ({} new).'.format( 246 | len(all_tweets), len(new_tweets))) 247 | except KeyboardInterrupt: 248 | logger.info('Program interrupted by user. Returning all tweets ' 249 | 'gathered so far.') 250 | finally: 251 | pool.close() 252 | pool.join() 253 | 254 | return all_tweets 255 | 256 | 257 | def query_tweets_from_user(user, limit=None, use_proxy=True): 258 | pos = None 259 | tweets = [] 260 | try: 261 | while True: 262 | new_tweets, pos = query_single_page(user, lang='', pos=pos, from_user=True, use_proxy=use_proxy) 263 | if len(new_tweets) == 0: 264 | logger.info("Got {} tweets from username {}".format(len(tweets), user)) 265 | return tweets 266 | 267 | tweets += new_tweets 268 | 269 | if limit and len(tweets) >= limit: 270 | logger.info("Got {} tweets from username {}".format(len(tweets), user)) 271 | return tweets 272 | 273 | except KeyboardInterrupt: 274 | logger.info("Program interrupted by user. Returning tweets gathered " 275 | "so far...") 276 | except BaseException: 277 | logger.exception("An unknown error occurred! Returning tweets " 278 | "gathered so far.") 279 | logger.info("Got {} tweets from username {}.".format( 280 | len(tweets), user)) 281 | return tweets 282 | 283 | 284 | def query_user_page(url, retry=10, timeout=60, use_proxy=True): 285 | """ 286 | Returns the scraped user data from a twitter user page. 287 | 288 | :param url: The URL to get the twitter user info from (url contains the user page) 289 | :param retry: Number of retries if something goes wrong. 290 | :return: Returns the scraped user data from a twitter user page. 291 | """ 292 | 293 | try: 294 | if use_proxy: 295 | proxy = next(proxy_pool) 296 | logger.info('Using proxy {}'.format(proxy)) 297 | response = requests.get(url, headers=HEADER, proxies={"http": proxy}) 298 | else: 299 | response = requests.get(url, headers=HEADER) 300 | html = response.text or '' 301 | 302 | user_info = User.from_html(html) 303 | if not user_info: 304 | return None 305 | 306 | return user_info 307 | 308 | except requests.exceptions.HTTPError as e: 309 | logger.exception('HTTPError {} while requesting "{}"'.format( 310 | e, url)) 311 | except requests.exceptions.ConnectionError as e: 312 | logger.exception('ConnectionError {} while requesting "{}"'.format( 313 | e, url)) 314 | except requests.exceptions.Timeout as e: 315 | logger.exception('TimeOut {} while requesting "{}"'.format( 316 | e, url)) 317 | 318 | if retry > 0: 319 | logger.info('Retrying... (Attempts left: {})'.format(retry)) 320 | return query_user_page(url, retry-1, use_proxy) 321 | 322 | logger.error('Giving up.') 323 | return None 324 | 325 | 326 | def query_user_info(user, use_proxy=True): 327 | """ 328 | Returns the scraped user data from a twitter user page. 329 | 330 | :param user: the twitter user to web scrape its twitter page info 331 | """ 332 | 333 | 334 | try: 335 | user_info = query_user_page(INIT_URL_USER.format(u=user), use_proxy=use_proxy) 336 | if user_info: 337 | logger.info("Got user information from username {}".format(user)) 338 | return user_info 339 | 340 | except KeyboardInterrupt: 341 | logger.info("Program interrupted by user. Returning user information gathered so far...") 342 | except BaseException: 343 | logger.exception("An unknown error occurred! Returning user information gathered so far...") 344 | 345 | logger.info("Got user information from username {}".format(user)) 346 | return user_info 347 | -------------------------------------------------------------------------------- /twitterscraper/tweet.py: -------------------------------------------------------------------------------- 1 | import re 2 | from datetime import datetime 3 | 4 | from bs4 import BeautifulSoup 5 | from coala_utils.decorators import generate_ordering 6 | 7 | 8 | @generate_ordering('timestamp', 'id', 'text', 'user', 'replies', 'retweets', 'likes') 9 | class Tweet: 10 | def __init__( 11 | self, screen_name, username, user_id, tweet_id, tweet_url, timestamp, 12 | timestamp_epochs, text, text_html, links, hashtags, has_media, img_urls, 13 | video_url, likes, retweets, replies, is_replied, is_reply_to, 14 | parent_tweet_id, reply_to_users 15 | ): 16 | # user name & id 17 | self.screen_name = screen_name 18 | self.username = username 19 | self.user_id = user_id 20 | # tweet basic data 21 | self.tweet_id = tweet_id 22 | self.tweet_url = tweet_url 23 | self.timestamp = timestamp 24 | self.timestamp_epochs = timestamp_epochs 25 | # tweet text 26 | self.text = text 27 | self.text_html = text_html 28 | self.links = links 29 | self.hashtags = hashtags 30 | # tweet media 31 | self.has_media = has_media 32 | self.img_urls = img_urls 33 | self.video_url = video_url 34 | # tweet actions numbers 35 | self.likes = likes 36 | self.retweets = retweets 37 | self.replies = replies 38 | self.is_replied = is_replied 39 | # detail of reply to others 40 | self.is_reply_to = is_reply_to 41 | self.parent_tweet_id = parent_tweet_id 42 | self.reply_to_users = reply_to_users 43 | 44 | @classmethod 45 | def from_soup(cls, tweet): 46 | tweet_div = tweet.find('div', 'tweet') 47 | 48 | # user name & id 49 | screen_name = tweet_div["data-screen-name"].strip('@') 50 | username = tweet_div["data-name"] 51 | user_id = tweet_div["data-user-id"] 52 | 53 | # tweet basic data 54 | tweet_id = tweet_div["data-tweet-id"] # equal to 'data-item-id' 55 | tweet_url = tweet_div["data-permalink-path"] 56 | timestamp_epochs = int(tweet.find('span', '_timestamp')['data-time']) 57 | timestamp = datetime.utcfromtimestamp(timestamp_epochs) 58 | 59 | # tweet text 60 | soup_html = tweet_div \ 61 | .find('div', 'js-tweet-text-container') \ 62 | .find('p', 'tweet-text') 63 | text_html = str(soup_html) or "" 64 | for img in soup_html.findAll("img", "Emoji"): 65 | img.replace_with(img.attrs.get("alt", '')) 66 | text = soup_html.get_text() or "" 67 | links = [ 68 | atag.get('data-expanded-url', atag['href']) 69 | for atag in soup_html.find_all('a', class_='twitter-timeline-link') 70 | if 'pic.twitter' not in atag.text # eliminate picture 71 | ] 72 | hashtags = [tag.strip('#')for tag in re.findall(r'#\w+', text)] 73 | 74 | # tweet media 75 | # --- imgs 76 | soup_imgs = tweet_div.find_all('div', 'AdaptiveMedia-photoContainer') 77 | img_urls = [ 78 | img['data-image-url'] for img in soup_imgs 79 | ] if soup_imgs else [] 80 | 81 | # --- videos 82 | video_div = tweet_div.find('div', 'PlayableMedia-container') 83 | video_url = video_div.find('a')['href'] if video_div else '' 84 | has_media = True if img_urls or video_url else False 85 | 86 | # update 'links': eliminate 'video_url' from 'links' for duplicate 87 | links = list(filter(lambda x: x != video_url, links)) 88 | 89 | # tweet actions numbers 90 | action_div = tweet_div.find('div', 'ProfileTweet-actionCountList') 91 | 92 | # --- likes 93 | likes = int(action_div.find( 94 | 'span', 'ProfileTweet-action--favorite').find( 95 | 'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0') 96 | # --- RT 97 | retweets = int(action_div.find( 98 | 'span', 'ProfileTweet-action--retweet').find( 99 | 'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0') 100 | # --- replies 101 | replies = int(action_div.find( 102 | 'span', 'ProfileTweet-action--reply u-hiddenVisually').find( 103 | 'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0') 104 | is_replied = False if replies == 0 else True 105 | 106 | # detail of reply to others 107 | # - reply to others 108 | parent_tweet_id = tweet_div['data-conversation-id'] # parent tweet 109 | 110 | if tweet_id == parent_tweet_id: 111 | is_reply_to = False 112 | parent_tweet_id = '' 113 | reply_to_users = [] 114 | else: 115 | is_reply_to = True 116 | soup_reply_to_users = \ 117 | tweet_div.find('div', 'ReplyingToContextBelowAuthor') \ 118 | .find_all('a') 119 | reply_to_users = [{ 120 | 'screen_name': user.text.strip('@'), 121 | 'user_id': user['data-user-id'] 122 | } for user in soup_reply_to_users] 123 | 124 | return cls( 125 | screen_name, username, user_id, tweet_id, tweet_url, timestamp, 126 | timestamp_epochs, text, text_html, links, hashtags, has_media, 127 | img_urls, video_url, likes, retweets, replies, is_replied, 128 | is_reply_to, parent_tweet_id, reply_to_users 129 | ) 130 | 131 | @classmethod 132 | def from_html(cls, html): 133 | soup = BeautifulSoup(html, "lxml") 134 | tweets = soup.find_all('li', 'js-stream-item') 135 | if tweets: 136 | for tweet in tweets: 137 | try: 138 | yield cls.from_soup(tweet) 139 | except AttributeError: 140 | pass # Incomplete info? Discard! 141 | except TypeError: 142 | pass # Incomplete info? Discard! 143 | -------------------------------------------------------------------------------- /twitterscraper/user.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | 3 | 4 | class User: 5 | def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0, 6 | following=0, followers=0, likes=0, lists=0, is_verified=0): 7 | self.user = user 8 | self.full_name = full_name 9 | self.location = location 10 | self.blog = blog 11 | self.date_joined = date_joined 12 | self.id = id 13 | self.tweets = tweets 14 | self.following = following 15 | self.followers = followers 16 | self.likes = likes 17 | self.lists = lists 18 | self.is_verified = is_verified 19 | 20 | @classmethod 21 | def from_soup(self, tag_prof_header, tag_prof_nav): 22 | """ 23 | Returns the scraped user data from a twitter user page. 24 | 25 | :param tag_prof_header: captures the left hand part of user info 26 | :param tag_prof_nav: captures the upper part of user info 27 | :return: Returns a User object with captured data via beautifulsoup 28 | """ 29 | 30 | self.user= tag_prof_header.find('a', {'class':'ProfileHeaderCard-nameLink u-textInheritColor js-nav'})['href'].strip("/") 31 | self.full_name = tag_prof_header.find('a', {'class':'ProfileHeaderCard-nameLink u-textInheritColor js-nav'}).text 32 | 33 | location = tag_prof_header.find('span', {'class':'ProfileHeaderCard-locationText u-dir'}) 34 | if location is None: 35 | self.location = "None" 36 | else: 37 | self.location = location.text.strip() 38 | 39 | blog = tag_prof_header.find('span', {'class':"ProfileHeaderCard-urlText u-dir"}) 40 | if blog is None: 41 | blog = "None" 42 | else: 43 | self.blog = blog.text.strip() 44 | 45 | date_joined = tag_prof_header.find('div', {'class':"ProfileHeaderCard-joinDate"}).find('span', {'class':'ProfileHeaderCard-joinDateText js-tooltip u-dir'})['title'] 46 | if date_joined is None: 47 | self.data_joined = "Unknown" 48 | else: 49 | self.date_joined = date_joined.strip() 50 | 51 | tag_verified = tag_prof_header.find('span', {'class': "ProfileHeaderCard-badges"}) 52 | if tag_verified is not None: 53 | self.is_verified = 1 54 | 55 | self.id = tag_prof_nav.find('div',{'class':'ProfileNav'})['data-user-id'] 56 | tweets = tag_prof_nav.find('span', {'class':"ProfileNav-value"})['data-count'] 57 | if tweets is None: 58 | self.tweets = 0 59 | else: 60 | self.tweets = int(tweets) 61 | 62 | following = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--following"}).\ 63 | find('span', {'class':"ProfileNav-value"})['data-count'] 64 | if following is None: 65 | following = 0 66 | else: 67 | self.following = int(following) 68 | 69 | followers = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--followers"}).\ 70 | find('span', {'class':"ProfileNav-value"})['data-count'] 71 | if followers is None: 72 | self.followers = 0 73 | else: 74 | self.followers = int(followers) 75 | 76 | likes = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--favorites"}).\ 77 | find('span', {'class':"ProfileNav-value"})['data-count'] 78 | if likes is None: 79 | self.likes = 0 80 | else: 81 | self.likes = int(likes) 82 | 83 | lists = tag_prof_nav.find('li', {'class':"ProfileNav-item ProfileNav-item--lists"}) 84 | if lists is None: 85 | self.lists = 0 86 | elif lists.find('span', {'class':"ProfileNav-value"}) is None: 87 | self.lists = 0 88 | else: 89 | lists = lists.find('span', {'class':"ProfileNav-value"}).text 90 | self.lists = int(lists) 91 | return(self) 92 | 93 | @classmethod 94 | def from_html(self, html): 95 | soup = BeautifulSoup(html, "lxml") 96 | user_profile_header = soup.find("div", {"class":'ProfileHeaderCard'}) 97 | user_profile_canopy = soup.find("div", {"class":'ProfileCanopy-nav'}) 98 | if user_profile_header and user_profile_canopy: 99 | try: 100 | return self.from_soup(user_profile_header, user_profile_canopy) 101 | except AttributeError: 102 | pass # Incomplete info? Discard! 103 | except TypeError: 104 | pass # Incomplete info? Discard! 105 | --------------------------------------------------------------------------------