├── AdvScraper ├── GetOldTweets3 │ ├── GetOldTweets3_Article_Scraper.ipynb │ └── GetOldTweets3_Companion_Scraper.ipynb ├── README.md ├── Tweepy │ ├── Tweepy_Article_Scraper.ipynb │ ├── Tweepy_Companion_Scraper.ipynb │ └── credentials.csv └── Tweepy_and_GetOldTweets3.ipynb ├── BasicScraper ├── GetOldTweets3_Basic_Scraper.ipynb ├── README.md └── Tweepy_Basic_Scraper.ipynb ├── README.md ├── ScraperV4 ├── README.md └── Tweepy_Scraper_V4.ipynb └── snscrape ├── README.md ├── cli-with-python ├── snscrape-python-cli.ipynb └── snscrape-python-cli.py └── python-wrapper ├── snscrape-python-wrapper.ipynb └── snscrape-python-wrapper.py /AdvScraper/GetOldTweets3/GetOldTweets3_Article_Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Article Notebook for Scraping Twitter Using GetOldTweets3\n", 8 | "\n", 9 | "Package: https://github.com/Mottl/GetOldTweets3\n", 10 | "\n", 11 | "Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f\n", 12 | "\n", 13 | "### Notebook Author: Martin Beck\n", 14 | "#### Information current as of August, 13th 2020\n", 15 | " Dependencies: Make sure GetOldTweets3 is already installed in your Python environment. If not, you can pip install GetOldTweets3 to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Notebook's Table of Contents\n", 23 | "\n", 24 | "1. [Getting More Information From Tweets](#Section1)\n", 25 | "
How to scrape more information from tweets such as favorite count, retweet count, mentions, permalinks, etc.\n", 26 | "2. [Getting User Information From Tweets](#Section2)\n", 27 | "
GetOldTweets3 does not offer anymore user information than their screename or Twitter @ name which is shown in section 1.\n", 28 | "3. [Scraping Tweets With Advanced Queries](#Section3)\n", 29 | "
How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.\n", 30 | "4. [Putting It All Together](#Section4)\n", 31 | "
Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Imports for Notebook" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 27, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "# Pip install Tweepy if you don't already have the package\n", 48 | "# !pip install tweepy\n", 49 | "\n", 50 | "# Imports\n", 51 | "import GetOldTweets3 as got\n", 52 | "import pandas as pd" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## 1. Getting More Information From Tweets \n", 60 | "[Return to Table of Contents](#TOC)\n", 61 | "
\n", 62 | "List of information available in the tweet object with GetOldTweets3\n", 63 | "* tweet.geo: *NOTE GEO-DATA NOT WORKING BASED ON ISSUE

\n", 64 | "\n", 65 | "* tweet.id: Id of tweet\n", 66 | "* tweet.author_id: User id of tweet's author\n", 67 | "* tweet.username: Username of tweet's author, commonly called User @ name\n", 68 | "* tweet.to: If tweet is a reply, the original tweet's username\n", 69 | "* tweet.text: Text content of tweet\n", 70 | "* tweet.retweets: Count of retweets\n", 71 | "* tweet.favorites: Count of favorites\n", 72 | "* tweet.replies: Count of replies\n", 73 | "* tweet.date: Date tweet was created\n", 74 | "* tweet.formatted_date: Formatted version of when tweet was created\n", 75 | "* tweet.hashtags: Hashtags that tweet contains\n", 76 | "* tweet.mentions: Mentions of other users that tweet contains\n", 77 | "* tweet.urls: Urls that are in the tweet\n", 78 | "* tweet.permalink: Permalink of tweet itself" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 35, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "username = 'jack'\n", 88 | "count = 150\n", 89 | " \n", 90 | "# Creation of tweetCriteria query object with methods to specify further\n", 91 | "tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 92 | ".setMaxTweets(count)\n", 93 | " \n", 94 | "# Creation of tweets iterable containing all queried tweet data\n", 95 | "tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 96 | " \n", 97 | "# List comprehension pulling chosen tweet information from tweets\n", 98 | "# Add or remove tweet information you want in the below list comprehension\n", 99 | "tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.formatted_date, tweet.hashtags, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 100 | " \n", 101 | "# Creation of dataframe from tweets_list\n", 102 | "# Add or remove columns as you remove tweet information\n", 103 | "tweets_df1 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 104 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 36, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/html": [ 115 | "
\n", 116 | "\n", 129 | "\n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | "
Tweet IdTweet User IdTweet UserReply toTextRetweetsFavoritesRepliesDatetimeFormatted dateHashtagsMentionsUrlsPermalink
0129476528925570662412jackjsngrJordan is incredible1161272622020-08-15 22:37:55+00:00Sat Aug 15 22:37:55 +0000 2020https://twitter.com/jsngr/status/1294635175222...https://twitter.com/jack/status/12947652892557...
1129375388415923405012jackSpaceForceDoD?74191135832020-08-13 03:38:57+00:00Thu Aug 13 03:38:57 +0000 2020https://twitter.com/spaceforcedod/status/12936...https://twitter.com/jack/status/12937538841592...
2129368763667522355212jackTwitterDevBuild on Twitter again!61949454422020-08-12 23:15:42+00:00Wed Aug 12 23:15:42 +0000 2020https://twitter.com/TwitterDev/status/12935935...https://twitter.com/jack/status/12936876366752...
3129364129745938841612jackboardroomThanks for the chat @richkleiman and Gianni! G...52385892020-08-12 20:11:34+00:00Wed Aug 12 20:11:34 +0000 2020@richkleimanhttps://twitter.com/boardroom/status/129356427...https://twitter.com/jack/status/12936412974593...
4129195627381499084812jackMayalangersegalThank you. Thank you. Thank you. @RemindMe_OfT...293162020-08-08 04:35:53+00:00Sat Aug 08 04:35:53 +0000 2020@RemindMe_OfThishttps://twitter.com/jack/status/12919562738149...
\n", 237 | "
" 238 | ], 239 | "text/plain": [ 240 | " Tweet Id Tweet User Id Tweet User Reply to \\\n", 241 | "0 1294765289255706624 12 jack jsngr \n", 242 | "1 1293753884159234050 12 jack SpaceForceDoD \n", 243 | "2 1293687636675223552 12 jack TwitterDev \n", 244 | "3 1293641297459388416 12 jack boardroom \n", 245 | "4 1291956273814990848 12 jack Mayalangersegal \n", 246 | "\n", 247 | " Text Retweets Favorites \\\n", 248 | "0 Jordan is incredible 116 1272 \n", 249 | "1 ? 741 9113 \n", 250 | "2 Build on Twitter again! 619 4945 \n", 251 | "3 Thanks for the chat @richkleiman and Gianni! G... 52 385 \n", 252 | "4 Thank you. Thank you. Thank you. @RemindMe_OfT... 2 93 \n", 253 | "\n", 254 | " Replies Datetime Formatted date Hashtags \\\n", 255 | "0 62 2020-08-15 22:37:55+00:00 Sat Aug 15 22:37:55 +0000 2020 \n", 256 | "1 583 2020-08-13 03:38:57+00:00 Thu Aug 13 03:38:57 +0000 2020 \n", 257 | "2 442 2020-08-12 23:15:42+00:00 Wed Aug 12 23:15:42 +0000 2020 \n", 258 | "3 89 2020-08-12 20:11:34+00:00 Wed Aug 12 20:11:34 +0000 2020 \n", 259 | "4 16 2020-08-08 04:35:53+00:00 Sat Aug 08 04:35:53 +0000 2020 \n", 260 | "\n", 261 | " Mentions Urls \\\n", 262 | "0 https://twitter.com/jsngr/status/1294635175222... \n", 263 | "1 https://twitter.com/spaceforcedod/status/12936... \n", 264 | "2 https://twitter.com/TwitterDev/status/12935935... \n", 265 | "3 @richkleiman https://twitter.com/boardroom/status/129356427... \n", 266 | "4 @RemindMe_OfThis \n", 267 | "\n", 268 | " Permalink \n", 269 | "0 https://twitter.com/jack/status/12947652892557... \n", 270 | "1 https://twitter.com/jack/status/12937538841592... \n", 271 | "2 https://twitter.com/jack/status/12936876366752... \n", 272 | "3 https://twitter.com/jack/status/12936412974593... \n", 273 | "4 https://twitter.com/jack/status/12919562738149... " 274 | ] 275 | }, 276 | "execution_count": 36, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "tweets_df1.head()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "## 2. Getting User Information From Tweets\n", 290 | "[Return to Table of Contents](#TOC)\n", 291 | "
GetOldTweets3 is limited in the user information that is accessible. This library only allows access to a tweet author's username and user_id. If you want user information I recommend looking into utilizing Tweepy for all of your scraping, or using Tweepy in tandem with GetOldTweets3 in order to utilize both libraries to their strengths." 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "## 3. Scraping Tweets With Advanced Queries\n", 299 | "[Return to Table of Contents](#TOC)\n", 300 | "
\n", 301 | "List of methods available with GetOldTweets3 to refine your queries.\n", 302 | "\n", 303 | "* setUsername(str): Setting query based on username\n", 304 | "* setMaxTweets(int): Setting maximum number of tweets to search\n", 305 | "* setQuerySearch(str): Setting query based on text\n", 306 | "* setSince(str \"yyyy-mm-dd\"): Setting lower bound date on query\n", 307 | "* setUntil(str \"yyyy-mm-dd\"): Setting upper bound date on query\n", 308 | "* setNear(str): Setting location of query search\n", 309 | "* setWithin(str): Setting radius of query search location\n", 310 | "* setLang(str): Setting language of query\n", 311 | "* setTopTweets(bool): Setting query to search only for top tweets\n", 312 | "* setEmoji(\"ignore\"/\"unicode\"/\"name\"): Setting query to search using emoji styles" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 37, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "username = \"BarackObama\"\n", 322 | "text_query = \"Hello\"\n", 323 | "since_date = \"2011-01-01\"\n", 324 | "until_date = \"2016-12-20\"\n", 325 | "count = 150\n", 326 | " \n", 327 | "# Creation of tweetCriteria query object with methods to specify further\n", 328 | "tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 329 | ".setQuerySearch(text_query).setSince(since_date)\\\n", 330 | ".setUntil(until_date).setMaxTweets(count)\n", 331 | " \n", 332 | "# Creation of tweets iterable containing all queried tweet data\n", 333 | "tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 334 | " \n", 335 | "# List comprehension pulling chosen tweet information from tweets\n", 336 | "# Add or remove tweet information you want in the below list comprehension\n", 337 | "tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites,tweet.replies,tweet.date] for tweet in tweets]\n", 338 | " \n", 339 | "# Creation of dataframe from tweets list\n", 340 | "# Add or remove columns as you remove tweet information\n", 341 | "tweets_df3 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User', 'Text','Retweets', 'Favorites', \n", 342 | " 'Replies', 'Datetime'])" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 38, 348 | "metadata": { 349 | "scrolled": false 350 | }, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/html": [ 355 | "
\n", 356 | "\n", 369 | "\n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | "
Tweet IdTweet User IdTweet UserTextRetweetsFavoritesRepliesDatetime
0682986933862154241813286BarackObamaHello, 2016.3506130107602016-01-01 18:09:08+00:00
1547783171199496192813286BarackObamaSay hello to friends you know and everyone you...3555907510872014-12-24 15:57:39+00:00
2457281289351999489813286BarackObamaHello, spring.58071008910402014-04-18 22:15:30+00:00
3438453976833343488813286BarackObama\"Hello OFA!\" —President Obama at the #ActionSu...134244572014-02-25 23:22:28+00:00
4265569746991333377813286BarackObama“Hello, Columbus! Hello, Ohio! Are you fired u...513208812012-11-05 21:42:16+00:00
\n", 441 | "
" 442 | ], 443 | "text/plain": [ 444 | " Tweet Id Tweet User Id Tweet User \\\n", 445 | "0 682986933862154241 813286 BarackObama \n", 446 | "1 547783171199496192 813286 BarackObama \n", 447 | "2 457281289351999489 813286 BarackObama \n", 448 | "3 438453976833343488 813286 BarackObama \n", 449 | "4 265569746991333377 813286 BarackObama \n", 450 | "\n", 451 | " Text Retweets Favorites \\\n", 452 | "0 Hello, 2016. 3506 13010 \n", 453 | "1 Say hello to friends you know and everyone you... 3555 9075 \n", 454 | "2 Hello, spring. 5807 10089 \n", 455 | "3 \"Hello OFA!\" —President Obama at the #ActionSu... 134 244 \n", 456 | "4 “Hello, Columbus! Hello, Ohio! Are you fired u... 513 208 \n", 457 | "\n", 458 | " Replies Datetime \n", 459 | "0 760 2016-01-01 18:09:08+00:00 \n", 460 | "1 1087 2014-12-24 15:57:39+00:00 \n", 461 | "2 1040 2014-04-18 22:15:30+00:00 \n", 462 | "3 57 2014-02-25 23:22:28+00:00 \n", 463 | "4 81 2012-11-05 21:42:16+00:00 " 464 | ] 465 | }, 466 | "execution_count": 38, 467 | "metadata": {}, 468 | "output_type": "execute_result" 469 | } 470 | ], 471 | "source": [ 472 | "tweets_df3.head()" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "## 4. Putting It All Together\n", 480 | "[Return to Table of Contents](#TOC)\n", 481 | "
\n", 482 | "Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.\n", 483 | "\n", 484 | "
\n", 485 | "Below is an example of a search for 150 top tweets with 'coronavirus' in it that occurred between August 5th and August 8th 2020 in Washington D.C." 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 39, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "text_query = 'Coronavirus'\n", 495 | "since_date = '2020-08-05'\n", 496 | "until_date = '2020-08-10'\n", 497 | "location = 'Washington, D.C.'\n", 498 | "top_tweets = True\n", 499 | "count = 150\n", 500 | " \n", 501 | "# Creation of tweetCriteria query object with methods to specify further\n", 502 | "tweetCriteria = got.manager.TweetCriteria()\\\n", 503 | ".setQuerySearch(text_query).setSince(since_date)\\\n", 504 | ".setUntil(until_date).setNear(location).setTopTweets(top_tweets)\\\n", 505 | ".setMaxTweets(count)\n", 506 | " \n", 507 | "# Creation of tweets iterable containing all queried tweet data\n", 508 | "tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 509 | " \n", 510 | "# List comprehension pulling chosen tweet information from tweets\n", 511 | "# Add or remove tweet information you want in the below list comprehension\n", 512 | "tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 513 | " \n", 514 | "# Creation of dataframe from tweets list\n", 515 | "# Add or remove columns as you remove tweet information\n", 516 | "tweets_df4 = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text',\n", 517 | " 'Retweets', 'Favorites', 'Replies', 'Datetime', 'Mentions','Urls','Permalink'])" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 40, 523 | "metadata": {}, 524 | "outputs": [ 525 | { 526 | "data": { 527 | "text/html": [ 528 | "
\n", 529 | "\n", 542 | "\n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | "
Tweet IdTweet User IdTweet UserReply toTextRetweetsFavoritesRepliesDatetimeMentionsUrlsPermalink
01292610170309181447535643852JordanSchachtelNoneFauci had a very interesting Q&A this weekend ...276563922020-08-09 23:54:14+00:00https://www.cnbc.com/2020/08/07/coronavirus-va...https://twitter.com/JordanSchachtel/status/129...
11292584089833349121225265639ddale8NoneIf the president confused you about what was a...174334811432020-08-09 22:10:36+00:00https://cnn.it/31zwwirhttps://twitter.com/ddale8/status/129258408983...
2129254381123584000053809979davidalimNoneAntigen tests have been touted as a way to sca...164212272020-08-09 19:30:33+00:00@rachel_roubeinhttps://www.politico.com/news/2020/08/09/coron...https://twitter.com/davidalim/status/129254381...
3129252566042293043218956073dcexaminerNoneA Nashville, Tennessee, councilwoman wants tho...3152743902020-08-09 18:18:26+00:00https://washex.am/3kD8L1Ehttps://twitter.com/dcexaminer/status/12925256...
41292468804648394752309822757ryanstruykNoneThe United States just reached 5 million repor...9741257522020-08-09 14:32:30+00:00https://twitter.com/ryanstruyk/status/12924688...
\n", 638 | "
" 639 | ], 640 | "text/plain": [ 641 | " Tweet Id Tweet User Id Tweet User Reply to \\\n", 642 | "0 1292610170309181447 535643852 JordanSchachtel None \n", 643 | "1 1292584089833349121 225265639 ddale8 None \n", 644 | "2 1292543811235840000 53809979 davidalim None \n", 645 | "3 1292525660422930432 18956073 dcexaminer None \n", 646 | "4 1292468804648394752 309822757 ryanstruyk None \n", 647 | "\n", 648 | " Text Retweets Favorites \\\n", 649 | "0 Fauci had a very interesting Q&A this weekend ... 276 563 \n", 650 | "1 If the president confused you about what was a... 1743 3481 \n", 651 | "2 Antigen tests have been touted as a way to sca... 164 212 \n", 652 | "3 A Nashville, Tennessee, councilwoman wants tho... 315 274 \n", 653 | "4 The United States just reached 5 million repor... 974 1257 \n", 654 | "\n", 655 | " Replies Datetime Mentions \\\n", 656 | "0 92 2020-08-09 23:54:14+00:00 \n", 657 | "1 143 2020-08-09 22:10:36+00:00 \n", 658 | "2 27 2020-08-09 19:30:33+00:00 @rachel_roubein \n", 659 | "3 390 2020-08-09 18:18:26+00:00 \n", 660 | "4 52 2020-08-09 14:32:30+00:00 \n", 661 | "\n", 662 | " Urls \\\n", 663 | "0 https://www.cnbc.com/2020/08/07/coronavirus-va... \n", 664 | "1 https://cnn.it/31zwwir \n", 665 | "2 https://www.politico.com/news/2020/08/09/coron... \n", 666 | "3 https://washex.am/3kD8L1E \n", 667 | "4 \n", 668 | "\n", 669 | " Permalink \n", 670 | "0 https://twitter.com/JordanSchachtel/status/129... \n", 671 | "1 https://twitter.com/ddale8/status/129258408983... \n", 672 | "2 https://twitter.com/davidalim/status/129254381... \n", 673 | "3 https://twitter.com/dcexaminer/status/12925256... \n", 674 | "4 https://twitter.com/ryanstruyk/status/12924688... " 675 | ] 676 | }, 677 | "execution_count": 40, 678 | "metadata": {}, 679 | "output_type": "execute_result" 680 | } 681 | ], 682 | "source": [ 683 | "tweets_df4.head()" 684 | ] 685 | } 686 | ], 687 | "metadata": { 688 | "kernelspec": { 689 | "display_name": "Python 3", 690 | "language": "python", 691 | "name": "python3" 692 | }, 693 | "language_info": { 694 | "codemirror_mode": { 695 | "name": "ipython", 696 | "version": 3 697 | }, 698 | "file_extension": ".py", 699 | "mimetype": "text/x-python", 700 | "name": "python", 701 | "nbconvert_exporter": "python", 702 | "pygments_lexer": "ipython3", 703 | "version": "3.7.3" 704 | } 705 | }, 706 | "nbformat": 4, 707 | "nbformat_minor": 2 708 | } 709 | -------------------------------------------------------------------------------- /AdvScraper/GetOldTweets3/GetOldTweets3_Companion_Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Companion Notebook for Scraping Twitter Using GetOldTweets3\n", 8 | "\n", 9 | "Package: https://github.com/Mottl/GetOldTweets3\n", 10 | "\n", 11 | "Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f\n", 12 | "\n", 13 | "### Notebook Author: Martin Beck\n", 14 | "#### Information current as of August, 13th 2020\n", 15 | " Dependencies: Make sure GetOldTweets3 is already installed in your Python environment. If not, you can pip install GetOldTweets3 to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Notebook's Table of Contents\n", 23 | "
\n", 24 | "This companion notebook is meant to build on the scraping article and article notebook as it covers more scenarios that may come up and provides more examples.\n", 25 | "\n", 26 | "1. [Getting More Information From Tweets](#Section1)\n", 27 | "
How to scrape more information from tweets such as favorite count, retweet count, mentions, permalinks, etc.\n", 28 | "2. [Getting User Information From Tweets](#Section2)\n", 29 | "
GetOldTweets3 does not offer anymore user information than their screename or Twitter @ name which is shown in section 1.\n", 30 | "3. [Scraping Tweets With Advanced Queries](#Section3)\n", 31 | "
How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.\n", 32 | "4. [Putting It All Together](#Section4)\n", 33 | "
Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Imports for Notebook" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# Pip install GetOldTweets3 if you don't already have the package\n", 50 | "# !pip install GetOldTweets3\n", 51 | "\n", 52 | "# Imports\n", 53 | "import GetOldTweets3 as got\n", 54 | "import pandas as pd" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## 1. Getting More Information From Tweets \n", 62 | "[Return to Table of Contents](#TOC)\n", 63 | "
\n", 64 | "List of information available in the tweet object with GetOldTweets3 I included everything except geo data due to issues that are currently still open.\n", 65 | "\n", 66 | "* tweet.geo: *NOTE GEO-DATA NOT WORKING BASED ON ISSUE

\n", 67 | "\n", 68 | "* tweet.id: Id of tweet\n", 69 | "* tweet.author_id: User id of tweet's author\n", 70 | "* tweet.username: Username of tweet's author, commonly called User's @ name\n", 71 | "* tweet.to: If tweet is a reply, the original tweet's username\n", 72 | "* tweet.text: Text content of tweet\n", 73 | "* tweet.retweets: Count of retweets\n", 74 | "* tweet.favorites: Count of favorites\n", 75 | "* tweet.replies: Count of replies\n", 76 | "* tweet.date: Date tweet was created\n", 77 | "* tweet.formatted_date: Formatted version of when tweet was created\n", 78 | "* tweet.hashtags: Hashtags that tweet contains\n", 79 | "* tweet.mentions: Mentions of other users that tweet contains\n", 80 | "* tweet.urls: Urls that are in the tweet\n", 81 | "* tweet.permalink: Permalink of tweet itself" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Query by Username\n", 89 | "I created three functions to build off of based off of various scenarios that are likely to happen for someone scraping tweets from users. After each function I call them to showcase an example of them being used.\n", 90 | "\n", 91 | "#### F1. scrape_user_tweets\n", 92 | "This function scrapes a single users tweets and exports the data as a csv or excel file\n", 93 | "\n", 94 | "#### F2. scrape_multiple_users_multifile\n", 95 | "This function scrapes multiple users based on a list and exports separate csv or excel files per user.\n", 96 | "\n", 97 | "#### F3. scrape_multiple_users_singlefile\n", 98 | "This function scrapes multiple users based on a list and exports one csv or excel file containing all tweets" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 34, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "def scrape_user_tweets(username, max_tweets):\n", 108 | " # Creation of query object\n", 109 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 110 | " .setMaxTweets(max_tweets)\n", 111 | " # Creation of list that contains all tweets\n", 112 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 113 | "\n", 114 | " # Pulling information from tweets iterable object\n", 115 | " # Add or remove tweet information you want in the below list comprehension\n", 116 | " tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 117 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 118 | " tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 119 | "\n", 120 | " # Creation of dataframe from tweets list\n", 121 | " # Add or remove columns as you remove tweet information\n", 122 | " tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 123 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 124 | " \n", 125 | " # Removing timezone information to allow excel file download\n", 126 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 127 | " \n", 128 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 129 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 130 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 35, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "# Creating example username to scrape from\n", 140 | "username = 'jack'\n", 141 | "\n", 142 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 143 | "max_tweets = 150\n", 144 | "\n", 145 | "# Function will scrape username, attempt to pull max_tweet amount, and create csv/excel file from data.\n", 146 | "scrape_user_tweets(username,max_tweets)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 36, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "def scrape_multiple_users_multifile(username_list, max_tweets_per):\n", 156 | " # Looping through each username in user list\n", 157 | " for username in username_list:\n", 158 | " # Creation of query object\n", 159 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 160 | " .setMaxTweets(max_tweets_per)\n", 161 | " # Creation of list that contains all tweets\n", 162 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 163 | "\n", 164 | " # Creating list of chosen tweet data\n", 165 | " # Add or remove tweet information you want in the below list comprehension\n", 166 | " tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 167 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 168 | " tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 169 | "\n", 170 | " # Creation of dataframe from tweets list\n", 171 | " # Add or remove columns as you remove tweet information\n", 172 | " tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 173 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 174 | " \n", 175 | " # Removing timezone information to allow excel file download\n", 176 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 177 | " \n", 178 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 179 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 180 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 37, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "# Creating example user list with 3 users\n", 190 | "user_name_list = ['jack','billgates','random']\n", 191 | "\n", 192 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 193 | "max_tweets_per = 150\n", 194 | "\n", 195 | "# Function will scrape each user, attempting to pull max_tweet amount, and create csv/excel file per user.\n", 196 | "scrape_multiple_users_multifile(user_name_list, max_tweets_per)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 40, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def scrape_multiple_users_singlefile(username_list, max_tweets_per):\n", 206 | " # Creating master list to contain all tweets\n", 207 | " master_tweets_list = []\n", 208 | " \n", 209 | " # Looping through each username in user list\n", 210 | " for username in user_name_list:\n", 211 | " # Creation of query object\n", 212 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 213 | " .setMaxTweets(max_tweets_per)\n", 214 | " # Creation of list that contains all tweets\n", 215 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 216 | "\n", 217 | " # Creating list of chosen tweet data\n", 218 | " # Appending new tweets per user into the master tweet list\n", 219 | " # Add or remove tweet information you want in the below list comprehension\n", 220 | " for tweet in tweets:\n", 221 | " master_tweets_list.append((tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 222 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 223 | " tweet.mentions, tweet.urls, tweet.permalink))\n", 224 | "\n", 225 | " # Creation of dataframe from tweets list\n", 226 | " # Add or remove columns as you remove tweet information\n", 227 | " tweets_df = pd.DataFrame(master_tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 228 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 229 | " \n", 230 | " # Removing timezone information to allow excel file download\n", 231 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 232 | " \n", 233 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 234 | " tweets_df.to_csv('multi-user-tweets.csv', sep=',', index = False)\n", 235 | "# tweets_df.to_excel('multi-user-tweets.xlsx', index = False)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 41, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "# Creating example user list with 3 users\n", 245 | "user_name_list = ['jack','billgates','random']\n", 246 | "\n", 247 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 248 | "max_tweets_per = 150\n", 249 | "\n", 250 | "# Function will scrape each user, attempting to pull max_tweet amount, and create one csv/excel file containing all data name multi-user-tweets.\n", 251 | "scrape_multiple_users_singlefile(user_name_list, max_tweets_per)" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "### Query by Text Search\n", 259 | "I created a function to build off of for scraping tweets by text search.\n", 260 | "\n", 261 | "#### F1. scrape_text_query\n", 262 | "This function scrapes tweets from Twitter based on the text search and exports the data as a csv or excel file" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 44, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "def scrape_text_query(text_query, count):\n", 272 | " # Creation of query object\n", 273 | " tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\\\n", 274 | " .setMaxTweets(count)\n", 275 | " # Creation of list that contains all tweets\n", 276 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 277 | "\n", 278 | " # Creating list of chosen tweet data\n", 279 | " # Add or remove tweet information you want in the below list comprehension\n", 280 | " tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 281 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 282 | " tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 283 | "\n", 284 | " # Creation of dataframe from tweets\n", 285 | " # Add or remove columns as you remove tweet information\n", 286 | " tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 287 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 288 | " \n", 289 | " # Removing timezone information to allow excel file download\n", 290 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 291 | " \n", 292 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 293 | " tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)\n", 294 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(text_query), index = False)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "# Input search query to scrape tweets and name csv file\n", 304 | "text_query = 'Coronavirus'\n", 305 | "\n", 306 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 307 | "max_tweets = 150\n", 308 | "\n", 309 | "# Function scrapes for tweets containing text_query, attempting to pull max_tweet amount and create csv/excel file containing data.\n", 310 | "scrape_text_query(text_query, max_tweets)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "## 2. Getting User Information From Tweets\n", 318 | "[Return to Table of Contents](#TOC)\n", 319 | "
GetOldTweets3 is limited in the user information that is accessible. This library only allows access to a tweet author's username and user_id. If you want user information I recommend looking into utilizing Tweepy for all of your scraping, or using Tweepy in tandem with GetOldTweets3 in order to utilize both libraries to their strengths." 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "## 3. Scraping Tweets With Advanced Queries\n", 327 | "[Return to Table of Contents](#TOC)\n", 328 | "
\n", 329 | "List of methods available with GetOldTweets3 to refine your queries.\n", 330 | "\n", 331 | "* setUsername(str): Setting query based on username\n", 332 | "* setMaxTweets(int): Setting maximum number of tweets to search\n", 333 | "* setQuerySearch(str): Setting query based on text\n", 334 | "* setSince(str \"yyyy-mm-dd\"): Setting lower bound date on query\n", 335 | "* setUntil(str \"yyyy-mm-dd\"): Setting upper bound date on query\n", 336 | "* setNear(str): Setting location of query search\n", 337 | "* setWithin(str): Setting radius of query search location\n", 338 | "* setLang(str): Setting language of query\n", 339 | "* setTopTweets(bool): Setting query to search only for top tweets\n", 340 | "* setEmoji(\"ignore\"/\"unicode\"/\"name\"): Setting query to search using emoji styles" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "I created two functions to build off of that utilize the different query methods available through the TweetCriteria class. As you can see you can mix and match the above methods in any way. It's important to remember that the more restrictive you make the search the more likely that a smaller amount of tweets that will come up.\n", 348 | "\n", 349 | "#### F1. scrape_advanced_queries1\n", 350 | "This function queries by using .setUsername to set the username, .setQuerySearch to set text to query for, .setSince to set the oldest date of the tweets to query, .setUntil to set the most recent date of the tweets to query, .setMaxTweets to set the amount of tweets to query for.\n", 351 | "\n", 352 | "#### F2. scrape_advanced_queries2\n", 353 | "This function queries by using .setQuerySearch, .setNear to set a location to query for tweets around, .setWithin to set a radius restriction around the chosen location, .setLang to scrape for tweets written in a specific language, .setMaxTweets" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 45, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "def scrape_advanced_queries1(username, text_query, since_date, until_date, count):\n", 363 | " # Creation of query object with as many specific queries as you want\n", 364 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 365 | " .setQuerySearch(text_query).setSince(since_date)\\\n", 366 | " .setUntil(until_date).setMaxTweets(count)\n", 367 | " \n", 368 | " # Creation of list that contains all tweets\n", 369 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 370 | "\n", 371 | " # Creating list of chosen tweet data\n", 372 | " # Add or remove tweet information you want in the below list comprehension\n", 373 | " tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 374 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 375 | " tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 376 | "\n", 377 | " # Creation of dataframe from tweets list\n", 378 | " # Add or remove columns as you remove tweet information\n", 379 | " tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 380 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 381 | " \n", 382 | " # Removing timezone information to allow excel file download\n", 383 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 384 | " \n", 385 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 386 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 387 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 46, 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "username = \"BarackObama\"\n", 397 | "text_query = \"Hello\"\n", 398 | "since_date = \"2011-01-01\"\n", 399 | "until_date = \"2016-12-20\"\n", 400 | "count = 150\n", 401 | "\n", 402 | "scrape_advanced_queries1(username, text_query, since_date, until_date, count)" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 47, 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [ 411 | "def scrape_advanced_queries2(text_query, location, radius, language, count):\n", 412 | " # Creation of query object with as many specific queries as you want\n", 413 | " tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\\\n", 414 | " .setNear(location).setWithin(radius).setLang(language).setMaxTweets(count)\n", 415 | " \n", 416 | " # Creation of list that contains all tweets\n", 417 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 418 | "\n", 419 | " # Creating list of chosen tweet data\n", 420 | " # Add or remove tweet information you want in the below list comprehension\n", 421 | " tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 422 | " tweet.replies,tweet.date, tweet.formatted_date, tweet.hashtags, \n", 423 | " tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]\n", 424 | "\n", 425 | " # Creation of dataframe from tweets list\n", 426 | " # Add or remove columns as you remove tweet information\n", 427 | " tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User','Reply to', 'Text','Retweets', 'Favorites', 'Replies', 'Datetime',\n", 428 | " 'Formatted date', 'Hashtags','Mentions','Urls','Permalink'])\n", 429 | " \n", 430 | " # Removing timezone information to allow excel file download\n", 431 | " tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 432 | " \n", 433 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 434 | " tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)\n", 435 | " tweets_df.to_excel('{}-tweets.xlsx'.format(text_query), index = False)" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 48, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "text_query = \"Hola\"\n", 445 | "location = \"Mexico\"\n", 446 | "radius = \"100mi\"\n", 447 | "language = \"Spanish\"\n", 448 | "count = 150\n", 449 | "\n", 450 | "scrape_advanced_queries2(text_query, location, radius, language, count)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "## 4. Putting It All Together\n", 458 | "[Return to Table of Contents](#TOC)\n", 459 | "
\n", 460 | "Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.\n", 461 | "\n", 462 | "
\n", 463 | "Below is an example of a search for 150 top tweets with 'coronavirus' in it that occurred between August 5th and August 8th 2020 in Washington D.C.\n" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 49, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "text_query = 'Coronavirus'\n", 473 | "since_date = '2020-08-05'\n", 474 | "until_date = '2020-08-10'\n", 475 | "location = 'Washington, D.C.'\n", 476 | "top_tweets = True\n", 477 | "count = 150\n", 478 | "\n", 479 | "# Creation of tweetCriteria query object with methods to specify further\n", 480 | "tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setSince(since_date)\\\n", 481 | ".setUntil(until_date).setNear(location).setTopTweets(top_tweets).setMaxTweets(count)\n", 482 | "\n", 483 | "# Creation of tweets iterable containing all queried tweet data\n", 484 | "tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 485 | "\n", 486 | "# List comprehension pulling chosen tweet information per tweet from tweets\n", 487 | "# Add or remove tweet information you want in the below list comprehension\n", 488 | "tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites,\n", 489 | " tweet.replies,tweet.date, tweet.mentions, tweet.urls, tweet.permalink,] \n", 490 | " for tweet in tweets]\n", 491 | "\n", 492 | "# Creation of dataframe from tweets list\n", 493 | "# Add or remove columns as you remove tweet information\n", 494 | "tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Twitter User Id', 'Twitter @ Name','Reply to', 'Text','Retweets', 'Favorites', \n", 495 | " 'Replies', 'Datetime','Mentions','Urls','Permalink'])\n", 496 | "# Removing timezone information to allow excel file download\n", 497 | "tweets_df['Datetime'] = tweets_df['Datetime'].apply(lambda x: x.replace(tzinfo=None))\n", 498 | "\n", 499 | "# Uncomment/comment below lines to decide between creating csv or excel file \n", 500 | "tweets_df.to_csv('put-together-tweets.csv', sep=',', index = False)\n", 501 | "# tweets_df.to_excel('put-together-tweets.xlsx', index = False)" 502 | ] 503 | } 504 | ], 505 | "metadata": { 506 | "kernelspec": { 507 | "display_name": "Python 3", 508 | "language": "python", 509 | "name": "python3" 510 | }, 511 | "language_info": { 512 | "codemirror_mode": { 513 | "name": "ipython", 514 | "version": 3 515 | }, 516 | "file_extension": ".py", 517 | "mimetype": "text/x-python", 518 | "name": "python", 519 | "nbconvert_exporter": "python", 520 | "pygments_lexer": "ipython3", 521 | "version": "3.7.3" 522 | } 523 | }, 524 | "nbformat": 4, 525 | "nbformat_minor": 2 526 | } 527 | -------------------------------------------------------------------------------- /AdvScraper/README.md: -------------------------------------------------------------------------------- 1 | # NOTE, the following information is heavily outdated, GetOldTweets3 is no longer usable, and the Tweepy code utilizes Twitter API V1, V2 is currently used. 2 | 3 | 4 | 5 | 6 | --- 7 | --- 8 | 9 | # How to Scrape More Information From Tweets on Twitter 10 | This folder contains the jupyter notebooks for my advanced scraping tutorial published [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1 "written article"). 11 | 12 | This folder contains two subfolders based on two different Python packages I used to scrape tweets. Each sub-folder contains an article notebook that follows the code snippets in my article and a companion notebook that provides more code examples and easy to use functions. This folder also contains a third item titled Tweepy_and_GetOldTweets3.ipynb that utilizes both Python packages to allow one to use GetOldTweets3 and have access to user information. 13 | 14 | The contents of this folder and its subfolders are shown below. 15 | 16 | 17 | * GetOldTweets3 18 | * GetOldTweets3_Article_Scraper.ipynb 19 | * GetOldTweets3_Companion_Scraper.ipynb 20 | * Tweepy 21 | * Tweepy_Article_Scraper.ipynb 22 | * Tweepy_Companion_Scraper.ipynb 23 | * credentials.csv 24 | * Tweepy_and_GetOldTweets3.ipynb 25 | -------------------------------------------------------------------------------- /AdvScraper/Tweepy/Tweepy_Companion_Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Companion Notebook for Scraping Twitter Using Tweepy\n", 8 | "\n", 9 | "Package Github: https://github.com/tweepy/tweepy\n", 10 | "\n", 11 | "Package Documentation: https://tweepy.readthedocs.io/en/latest/\n", 12 | "\n", 13 | "Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f\n", 14 | "\n", 15 | "### Notebook Author: Martin Beck\n", 16 | "#### Information current as of August, 13th 2020\n", 17 | " Dependencies: Make sure Tweepy is already installed in your Python environment. If not, you can pip install Tweepy to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## Notebook's Table of Contents\n", 25 | "
\n", 26 | "This companion notebook is meant to build on the scraping article and article notebook as it covers more scenarios that may come up and provides more examples.\n", 27 | "\n", 28 | "0. [Credentials and Authorization](#Section0)\n", 29 | "
Setting up credentials and authorization in order to utilize Tweepy\n", 30 | "1. [Getting More Information From Tweets](#Section1)\n", 31 | "
How to scrape more information from tweets such as favorite count, retweet count, if they're replying to someone else, if turned on the coordinates of where the tweet came from, etc.\n", 32 | "2. [Getting User Information From Tweets](#Section2)\n", 33 | "
How to scrape user information from tweets such as their follower count, total amount of tweets, if they're a verified user, location of where account is registered, etc.\n", 34 | "3. [Scraping Tweets With Advanced Queries](#Section3)\n", 35 | "
How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.\n", 36 | "4. [Putting It All Together](#Section4)\n", 37 | "
Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Imports for Notebook" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 11, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "# Pip install Tweepy if you don't already have the package\n", 54 | "# !pip install tweepy\n", 55 | "\n", 56 | "# Imports\n", 57 | "import tweepy\n", 58 | "import pandas as pd\n", 59 | "import time" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## 0. Credentials and Authorization\n", 67 | "[Return to Table of Contents](#TOC)\n", 68 | "
Tweepy requires credentials before you can utilize its API. The below code helps setup the notebook for authorization. I already have an an article covering setting up Tweepy and getting credentials [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) if further instructions are needed.\n", 69 | "\n", 70 | "You don't necessarily have to create a credentials file, however if you find youself sharing Tweepy code to other parties I recommend it so you don't accidentally share your credentials. Otherwise skip the below cell and just enter your credentials in and have them hardcoded below." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 19, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/html": [ 81 | "
\n", 82 | "\n", 95 | "\n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
namekey
0consumer_keyXXXXXXXXXXX
1consumer_secretXXXXXXXXXXX
2access_tokenXXXXXXXXXXX
3access_secretXXXXXXXXXXX
\n", 126 | "
" 127 | ], 128 | "text/plain": [ 129 | " name key\n", 130 | "0 consumer_key XXXXXXXXXXX\n", 131 | "1 consumer_secret XXXXXXXXXXX\n", 132 | "2 access_token XXXXXXXXXXX\n", 133 | "3 access_secret XXXXXXXXXXX" 134 | ] 135 | }, 136 | "execution_count": 19, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "# Loading in from csv file\n", 143 | "\n", 144 | "credentials_df = pd.read_csv('credentials.csv',header=None,names=['name','key'])\n", 145 | "\n", 146 | "credentials_df" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 13, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# Credentials from csv file\n", 156 | "\n", 157 | "consumer_key = credentials_df.loc[credentials_df['name']=='consumer_key','key'].iloc[0]\n", 158 | "consumer_secret = credentials_df.loc[credentials_df['name']=='consumer_secret','key'].iloc[0]\n", 159 | "access_token = credentials_df.loc[credentials_df['name']=='access_token','key'].iloc[0]\n", 160 | "access_token_secret = credentials_df.loc[credentials_df['name']=='access_secret','key'].iloc[0]\n", 161 | "\n", 162 | "# Credentials hardcoded\n", 163 | "\n", 164 | "# consumer_key = \"XXXXX\"\n", 165 | "# consumer_secret = \"XXXXX\"\n", 166 | "# access_token = \"XXXXX\"\n", 167 | "# access_token_secret = \"XXXXXX\"\n", 168 | "\n", 169 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", 170 | "auth.set_access_token(access_token, access_token_secret)\n", 171 | "api = tweepy.API(auth,wait_on_rate_limit=True)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "## 1. Getting More Information From Tweets\n", 179 | "[Return to Table of Contents](#TOC)\n", 180 | "
List of information available in tweet object with Tweepy. This is not an exhaustive list but does contain a majority of the available information. If you want an exhaustive list of everything contained in the tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) describing all the attributes. \n", 181 | "\n", 182 | "String versions of Id's (e.g., id_str, in_reply_to_status_id_str) are used instead to best keep data integrity as there is a possibility for Id's stored as integers to be cut off.\n", 183 | "\n", 184 | "* tweet.user User information is covered in part 2 in greater detail

\n", 185 | "\n", 186 | "* tweet.full_text: Text content of tweet when API is told to pull all contents of tweets that have more than 140 characters

\n", 187 | "\n", 188 | "* tweet.text: Text content of tweet\n", 189 | "* tweet.created_at: Date tweet was created\n", 190 | "* tweet.id_str: Id of tweet\n", 191 | "* tweet.user.screen_name: Username of tweet's author\n", 192 | "* tweet.coordinates: Geographic location as reported by user or client. May be null that is why extract_coordinates function below was created\n", 193 | "* tweet.place: Indicates place associated with tweet where user signed up with like Las Vegas, NV. May be null that so extract_place function below was created\n", 194 | "* tweet.retweet_count: Count of retweets\n", 195 | "* tweet.favorite_count: Count of favorites\n", 196 | "* tweet.lang: Indicates a BCP 47 language identifier corresponding to machine detected language of tweet text.\n", 197 | "* tweet.source: Source where tweet was posted through. Ex: Twitter Web Client\n", 198 | "* tweet.in_reply_to_status_id_str: If a tweet is a reply, the original tweet's id. Can be null if tweet is not a reply\n", 199 | "* tweet.in_reply_to_user_id_str: If a tweet is a reply, string representation of original tweet's user id\n", 200 | "* tweet.is_quote_status: If tweet is a quote tweet" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### Query by Username\n", 208 | "I created three functions to build off of based off of various scenarios that are likely to happen for someone scraping tweets from users. After each function I call them to showcase an example of them being used.\n", 209 | "\n", 210 | "#### F0. extract_coordinates and extract_place\n", 211 | "These functions check for if a tweet has either coordinate information or place information and extract the pertinent information from their json. These are separate functions because they can be nullable so it's important to check first if they have them then to extract and replace in the dataframe.\n", 212 | "\n", 213 | "#### F1. scrape_user_tweets\n", 214 | "This function scrapes a single users tweets and exports the data as a csv or excel file\n", 215 | "\n", 216 | "#### F2. scrape_multiple_users_multifile\n", 217 | "This function scrapes multiple users based on a list and exports separate csv or excel files per user.\n", 218 | "\n", 219 | "#### F3. scrape_multiple_users_singlefile\n", 220 | "This function scrapes multiple users based on a list and exports one csv or excel file containing all tweets" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 14, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "# Function created to extract coordinates from tweet if it has coordinate info\n", 230 | "# Tweets tend to have null so important to run check\n", 231 | "# Make sure to run this cell as it is used in a lot of different functions below\n", 232 | "def extract_coordinates(row):\n", 233 | " if row['Tweet Coordinates']:\n", 234 | " return row['Tweet Coordinates']['coordinates']\n", 235 | " else:\n", 236 | " return None\n", 237 | "\n", 238 | "# Function created to extract place such as city, state or country from tweet if it has place info\n", 239 | "# Tweets tend to have null so important to run check\n", 240 | "# Make sure to run this cell as it is used in a lot of different functions below\n", 241 | "def extract_place(row):\n", 242 | " if row['Place Info']:\n", 243 | " return row['Place Info'].full_name\n", 244 | " else:\n", 245 | " return None" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 127, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "def scrape_user_tweets(username, max_tweets):\n", 255 | " # Creation of query method using parameters\n", 256 | " tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets)\n", 257 | "\n", 258 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 259 | " # Add or remove tweet information you want in the below list comprehension\n", 260 | " tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,\n", 261 | " tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,\n", 262 | " tweet.source, tweet.in_reply_to_status_id_str, \n", 263 | " tweet.in_reply_to_user_id_str, tweet.is_quote_status,\n", 264 | " ] for tweet in tweets]\n", 265 | "\n", 266 | " # Creation of dataframe from tweets_list\n", 267 | " # Add or remove columns as you remove tweet information\n", 268 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',\n", 269 | " 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',\n", 270 | " 'Replied Tweet User Id Str', 'Quote Status Bool'])\n", 271 | " \n", 272 | " # Checks if there are coordinates attached to tweets, if so extracts them\n", 273 | " tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 274 | " \n", 275 | " # Checks if there is place information available, if so extracts them\n", 276 | " tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 277 | " \n", 278 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 279 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 280 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 128, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "# Creating example username to scrape from\n", 290 | "username = 'random'\n", 291 | "\n", 292 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 293 | "max_tweets = 150\n", 294 | "\n", 295 | "# Function will scrape username, attempt to pull max_tweet amount, and create csv/excel file from data.\n", 296 | "scrape_user_tweets(username,max_tweets)" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 131, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "def scrape_multiple_users_multifile(username_list, max_tweets_per):\n", 306 | " # Looping through each username in user list\n", 307 | " \n", 308 | " for username in username_list: \n", 309 | " # Creation of query method using parameters\n", 310 | " tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets_per)\n", 311 | "\n", 312 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 313 | " # Add or remove tweet information you want in the below list comprehension\n", 314 | " tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,\n", 315 | " tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,\n", 316 | " tweet.source, tweet.in_reply_to_status_id_str, \n", 317 | " tweet.in_reply_to_user_id_str, tweet.is_quote_status,] for tweet in tweets]\n", 318 | "\n", 319 | " # Creation of dataframe from tweets_list\n", 320 | " # Add or remove columns as you remove tweet information\n", 321 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',\n", 322 | " 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',\n", 323 | " 'Replied Tweet User Id Str', 'Quote Status Bool'])\n", 324 | " \n", 325 | " # Checks if there are coordinates attached to tweets, if so extracts them\n", 326 | " tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 327 | " \n", 328 | " # Checks if there is place information available, if so extracts them\n", 329 | " tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 330 | " \n", 331 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 332 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 333 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 130, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# Creating example user list with 3 users\n", 343 | "user_name_list = ['jack','billgates','random']\n", 344 | "\n", 345 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 346 | "max_tweets_per = 150\n", 347 | "\n", 348 | "# Function will scrape each user, attempting to pull max_tweet amount, and create csv/excel file per user.\n", 349 | "scrape_multiple_users_multifile(user_name_list, max_tweets_per)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 134, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "def scrape_multiple_users_singlefile(username_list, max_tweets_per):\n", 359 | " # Creating master list to contain all tweets\n", 360 | " master_tweets_list = []\n", 361 | " \n", 362 | " # Looping through each username in user list\n", 363 | " for username in user_name_list:\n", 364 | " # Creation of query method using parameters\n", 365 | " tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets_per)\n", 366 | " \n", 367 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 368 | " # Appending new tweets per user into the master tweet list\n", 369 | " # Add or remove tweet information you want in the below list comprehension\n", 370 | " for tweet in tweets:\n", 371 | " master_tweets_list.append((tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,\n", 372 | " tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,\n", 373 | " tweet.source, tweet.in_reply_to_status_id_str, \n", 374 | " tweet.in_reply_to_user_id_str, tweet.is_quote_status))\n", 375 | " \n", 376 | " # Creation of dataframe from tweets_list\n", 377 | " # Add or remove columns as you remove tweet information\n", 378 | " tweets_df = pd.DataFrame(master_tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',\n", 379 | " 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',\n", 380 | " 'Replied Tweet User Id Str', 'Quote Status Bool'])\n", 381 | "\n", 382 | " # Checks if there are coordinates attached to tweets, if so extracts them\n", 383 | " tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 384 | " \n", 385 | " # Checks if there is place information available, if so extracts them\n", 386 | " tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 387 | " \n", 388 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 389 | " tweets_df.to_csv('multi-user-tweets.csv', sep=',', index = False)\n", 390 | "# tweets_df.to_excel('multi-user-tweets.xlsx', index = False)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 133, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "# Creating example user list with 3 users\n", 400 | "user_name_list = ['jack','billgates','random']\n", 401 | "\n", 402 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 403 | "max_tweets_per = 150\n", 404 | "\n", 405 | "# Function will scrape each user, attempting to pull max_tweet amount, and create one csv/excel file containing all data name multi-user-tweets.\n", 406 | "scrape_multiple_users_singlefile(user_name_list, max_tweets_per)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "## Allowing API to Access up to 280 Characters From Tweets\n", 414 | "\n", 415 | "In the cursor parameters add tweet_mode='extended' to access tweet text that goes beyond Twitter's original 140 character limit.\n", 416 | "\n", 417 | "If tweet_mode is set to extended the tweet attribute tweet.text becomes tweet.full_text isntead." 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 17, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "def scrape_extended_tweets(username, max_tweets):\n", 427 | " # Creation of query method using parameters\n", 428 | " tweets = tweepy.Cursor(api.user_timeline,id=username, tweet_mode='extended').items(max_tweets)\n", 429 | "\n", 430 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 431 | " # Add or remove tweet information you want in the below list comprehension\n", 432 | " tweets_list = [[tweet.full_text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,\n", 433 | " tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,\n", 434 | " tweet.source, tweet.in_reply_to_status_id_str, \n", 435 | " tweet.in_reply_to_user_id_str, tweet.is_quote_status,\n", 436 | " ] for tweet in tweets]\n", 437 | "\n", 438 | " # Creation of dataframe from tweets_list\n", 439 | " # Add or remove columns as you remove tweet information\n", 440 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',\n", 441 | " 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',\n", 442 | " 'Replied Tweet User Id Str', 'Quote Status Bool'])\n", 443 | " \n", 444 | " # Checks if there are coordinates attached to tweets, if so extracts them\n", 445 | " tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 446 | " \n", 447 | " # Checks if there is place information available, if so extracts them\n", 448 | " tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 449 | " \n", 450 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 451 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 452 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 18, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "# Creating example username to scrape from\n", 462 | "username = 'billgates'\n", 463 | "\n", 464 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 465 | "max_tweets = 150\n", 466 | "\n", 467 | "# Function will scrape username, attempt to pull max_tweet amount, and create csv/excel file from data.\n", 468 | "scrape_extended_tweets(username,max_tweets)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### Query by Text Search\n", 476 | "I created one function to build off of for scraping tweets by text search.\n", 477 | "\n", 478 | "#### F1. scrape_text_query\n", 479 | "This function scrapes tweets from Twitter based on the text search and exports the data as a csv or excel file" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 9, 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "def scrape_text_query(text_query, max_tweets):\n", 489 | " # Creation of query method using parameters\n", 490 | " tweets = tweepy.Cursor(api.search,q=text_query, tweet_mode='extended').items(max_tweets)\n", 491 | "\n", 492 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 493 | " # Add or remove tweet information you want in the below list comprehension\n", 494 | " tweets_list = [[tweet.full_text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,\n", 495 | " tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,\n", 496 | " tweet.source, tweet.in_reply_to_status_id_str, \n", 497 | " tweet.in_reply_to_user_id_str, tweet.is_quote_status,\n", 498 | " ] for tweet in tweets]\n", 499 | "\n", 500 | " # Creation of dataframe from tweets_list\n", 501 | " # Add or remove columns as you remove tweet information\n", 502 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',\n", 503 | " 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',\n", 504 | " 'Replied Tweet User Id Str', 'Quote Status Bool'])\n", 505 | "\n", 506 | " # Checks if there are coordinates attached to tweets, if so extracts them\n", 507 | " tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 508 | " \n", 509 | " # Checks if there is place information available, if so extracts them\n", 510 | " tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 511 | "\n", 512 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 513 | " tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)\n", 514 | "# tweets_df.to_excel('{}-tweets.xlsx'.format(text_query), index = False)" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 10, 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [ 523 | "# Input search query to scrape tweets and name csv file\n", 524 | "text_query = 'Coronavirus'\n", 525 | "\n", 526 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 527 | "max_tweets = 150\n", 528 | "\n", 529 | "# Function scrapes for tweets containing text_query, attempting to pull max_tweet amount and create csv/excel file containing data.\n", 530 | "scrape_text_query(text_query, max_tweets)" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "## 2. Getting User Information From Tweets\n", 538 | "[Return to Table of Contents](#TOC)\n", 539 | "\n", 540 | "Tweepy excels in this category. Having more access to user information than GetOldTweets3.\n", 541 | "
List of information available in user object with Tweepy. This is not an exhaustive list but does contain a majority of the available information. If you want an exhaustive list of everything contained in the tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object) describing all the attributes. \n", 542 | "\n", 543 | "String versions of Id's (e.g., id_str, user.id_str) are used instead to best keep data integrity as there is a possibility for Id's stored as integers to be cut off.\n", 544 | "\n", 545 | "* tweet.text: Text content of tweet\n", 546 | "* tweet.created_at: Date tweet was created\n", 547 | "* tweet.id_str: Id of tweet\n", 548 | "* tweet.user.name: Name of the user as they've defined it\n", 549 | "* tweet.user.screen_name: Username of tweet's author, commonly called User @ name\n", 550 | "* tweet.user.id_str: Use id of tweet's author\n", 551 | "* tweet.user.location: User defined location for account's profile. Can be nullable\n", 552 | "* tweet.user.url: URL provided by user in bio. Can be nullable\n", 553 | "* tweet.user.description: Text in user bio. Can be nullable\n", 554 | "* tweet.user.verified: Boolean indicating whether user has a verified account\n", 555 | "* tweet.user.followers_count: Count of followers user has\n", 556 | "* tweet.user.friends_count: Count of other users that user is following\n", 557 | "* tweet.user.favourites_count: Count of tweets user has liked in the account's lifetime\n", 558 | "* tweet.user.statuses_count: Count of tweets (including retweets) issued by user\n", 559 | "* tweet.user.listed_count: Count of public lists that user is member of\n", 560 | "* tweet.user.created_at: Date that the user account was created on Twitter\n", 561 | "* tweet.user.profile_image_url_https: HTTPS-based URL pointing to user's profile image\n", 562 | "* tweet.user.default_profile: When true, indicates user has not altered the theme or background of user profile\n", 563 | "* tweet.user.default_profile_image: When true, indicates if user has not uploaded their own profile image and default image is used instead\n", 564 | "\n", 565 | "### Query by Text Search\n", 566 | "I created one function to build off of that searches by text and pulls all user information available.\n", 567 | "\n", 568 | "#### F1. scrape_user_information\n", 569 | "This function scrapes tweets from Twitter based on the text search, pulls user information and exports the data as a csv or excel file" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": 137, 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "def scrape_user_information(text_query, max_tweets):\n", 579 | " # Creation of query method using parameters\n", 580 | " tweets = tweepy.Cursor(api.search,q=text_query).items(max_tweets)\n", 581 | "\n", 582 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 583 | " # Add or remove tweet information you want in the below list comprehension\n", 584 | " tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.name, tweet.user.screen_name, \n", 585 | " tweet.user.id_str, tweet.user.location, tweet.user.url,\n", 586 | " tweet.user.description, tweet.user.verified, tweet.user.followers_count,\n", 587 | " tweet.user.friends_count, tweet.user.favourites_count, tweet.user.statuses_count,\n", 588 | " tweet.user.listed_count, tweet.user.created_at, tweet.user.profile_image_url_https,\n", 589 | " tweet.user.default_profile, tweet.user.default_profile_image] for tweet in tweets]\n", 590 | "\n", 591 | " # Creation of dataframe from tweets_list\n", 592 | " # Add or remove columns as you remove tweet information\n", 593 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter Username', 'Twitter @ name',\n", 594 | " 'Twitter User Id', 'Twitter User Location', 'URL in Bio', 'Twitter Bio',\n", 595 | " 'User Verified Status', 'Users Following Count',\n", 596 | " 'Number users this account is following', 'Users Number of Likes', 'Users Tweet Count',\n", 597 | " 'Lists Containing User', 'Account Created Time', 'Profile Image URL',\n", 598 | " 'User Default Profile', 'User Default Profile Image'])\n", 599 | "\n", 600 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 601 | " tweets_df.to_csv('{}-userinfo-tweets.csv'.format(text_query), sep=',', index = False)\n", 602 | " # tweets_df.to_excel('{}-userinfo-tweets.xlsx'.format(text_query), index = False)" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 138, 608 | "metadata": { 609 | "scrolled": false 610 | }, 611 | "outputs": [], 612 | "source": [ 613 | "# Input search query to scrape tweets and name csv file\n", 614 | "text_query = 'Coronavirus'\n", 615 | "\n", 616 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 617 | "max_tweets = 150\n", 618 | "\n", 619 | "# Function scrapes for tweets containing text_query, attempting to pull max_tweet amount and create csv/excel file containing data.\n", 620 | "scrape_user_information(text_query, max_tweets)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "## 3. Scraping Tweets With Advanced Queries\n", 628 | "[Return to Table of Contents](#TOC)\n", 629 | "
List of query methods available with Tweepy. This is not an exhaustive list but does contain a majority of the methods available. If you want an exhaustive list of everything available there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets).\n", 630 | "\n", 631 | "* q = str: Setting query based on text\n", 632 | "* geocode = str \"lat,long,radius\": Setting location of query and radius\n", 633 | "* lang = str: Setting language of query, full list of language codes [here](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)\n", 634 | "* result_type = str \"mixed\"/\"recent\"/\"popular\": Setting popularity preference of query\n", 635 | "* until = str \"yyyy-mm-dd\": Setting upper bound date on query, if using standard search API be cognizant of 7-day limit\n", 636 | "* since_id = str or int: Returns results with Id's more recent than given Id\n", 637 | "* max_id = str or int: Returns results with Id's older than given Id\n", 638 | "* count = int: Number of tweets to return per page. Max is 100, defaults to 15" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "I created a function to build off of based that utilize the different query methods available with Tweepy. As you can see you can mix and match the above methods in any way. It's important to remember that the more restrictive you make the search the more likely that a smaller amount of tweets that will come up.\n", 646 | "\n", 647 | "#### F1. scrape_advanced_queries\n", 648 | "This function queries by using geocode to set a location to query for tweets and restrict within a certain radius, lang to scrape for tweets written in a specific language, result_type to search for tweets based on popularity, until to set an upper bound date on tweets, since_id to set a restriction on the oldest id possible, and max_id to set a restriction on the earliest id possible" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 139, 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "def scrape_advanced_queries(coordinates, language, result_type, until_date, max_tweets):\n", 658 | " # Creation of query method using parameters\n", 659 | " tweets = tweepy.Cursor(api.search, geocode=coordinates, lang=language, result_type = result_type, \n", 660 | " until = until_date, count = 100).items(max_tweets)\n", 661 | "\n", 662 | " # List comprehension pulling chosen tweet information from tweets iterable object\n", 663 | " # Add or remove tweet information you want in the below list comprehension\n", 664 | " tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, \n", 665 | " tweet.user.id_str, tweet.user.location, tweet.user.url, \n", 666 | " tweet.user.verified, tweet.user.followers_count,\n", 667 | " tweet.user.friends_count, tweet.user.statuses_count,\n", 668 | " tweet.user.default_profile_image, tweet.lang] for tweet in tweets]\n", 669 | "\n", 670 | " # Creation of dataframe from tweets_list\n", 671 | " # Add or remove columns as you remove tweet information\n", 672 | " tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Tweet Favorite Count', 'Twitter @ name',\n", 673 | " 'Twitter User Id', 'Twitter User Location', 'URL in Bio','User Verified Status', 'Users Current Following Count',\n", 674 | " 'Number of accounts user is following', 'Users Tweet Count',\n", 675 | " 'Profile Image URL','Tweet Language'])\n", 676 | " \n", 677 | " # Uncomment/comment below lines to decide between creating csv or excel file \n", 678 | " tweets_df.to_csv('advancedqueries-tweets.csv', sep=',', index = False)\n", 679 | " # tweets_df.to_excel('advancedqueries-tweets.xlsx', index = False)" 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": 140, 685 | "metadata": {}, 686 | "outputs": [], 687 | "source": [ 688 | "# Example may no longer show tweets if until_date falls outside \n", 689 | "# of 7-day period from when you run cell\n", 690 | "\n", 691 | "coordinates = '19.402833,-99.141051,50mi'\n", 692 | "language = 'es'\n", 693 | "result_type = 'recent'\n", 694 | "until_date = '2020-08-10'\n", 695 | "max_tweets = 150\n", 696 | "\n", 697 | "scrape_advanced_queries(coordinates, language, result_type, until_date, max_tweets)" 698 | ] 699 | }, 700 | { 701 | "cell_type": "markdown", 702 | "metadata": {}, 703 | "source": [ 704 | "## 4. Putting It All Together\n", 705 | "[Return to Table of Contents](#TOC)\n", 706 | "
\n", 707 | "Great, we now know how to pull more information from tweets and querying with advanced parameters. The great thing is how easy it is to mix and match whatever you want to search for. While it was shown above several times. The point is that you can mix and match the information you want from the tweets and the type of queries you conduct. It's just important that you update the column names in the pandas dataframe so you don't get errors.\n", 708 | "\n", 709 | "
\n", 710 | "Below is an example of a search for 150 tweets with 'Coronavirus' in it that occurred within a 50 mile radius of Las Vegas, NV. Which in this case has the geo coordinates of lat 36.169786, long -115.139858\n" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": 142, 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "text_query = 'Coronavirus'\n", 720 | "coordinates = '36.169786,-115.139858,50mi'\n", 721 | "max_tweets = 150\n", 722 | "\n", 723 | "# Creation of query method using parameters\n", 724 | "tweets = tweepy.Cursor(api.search, q = text_query, geocode = coordinates, count = 100).items(max_tweets)\n", 725 | "\n", 726 | "# List comprehension pulling chosen tweet information from tweets iterable object\n", 727 | "# Add or remove tweet information you want in the below list comprehension\n", 728 | "tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, \n", 729 | " tweet.user.id_str, tweet.user.location, tweet.user.followers_count, tweet.coordinates, tweet.place] for tweet in tweets]\n", 730 | "\n", 731 | "# Creation of dataframe from tweets_list\n", 732 | "# Add or remove columns as you remove tweet information\n", 733 | "tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Tweet Favorite Count', 'Twitter @ name',\n", 734 | " 'Twitter User Id', 'Twitter User Location', 'Users Current Following Count', 'Tweet Coordinates', 'Place Info'])\n", 735 | "\n", 736 | "# Checks if there are coordinates attached to tweets, if so extracts them\n", 737 | "tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)\n", 738 | " \n", 739 | "# Checks if there is place information available, if so extracts them\n", 740 | "tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)\n", 741 | "\n", 742 | "# Uncomment/comment below lines to decide between creating csv or excel file \n", 743 | "tweets_df.to_csv('put-together-tweets.csv', sep=',', index = False)\n", 744 | "# tweets_df.to_excel('put-together-tweets.xlsx', index = False)" 745 | ] 746 | } 747 | ], 748 | "metadata": { 749 | "kernelspec": { 750 | "display_name": "Python 3", 751 | "language": "python", 752 | "name": "python3" 753 | }, 754 | "language_info": { 755 | "codemirror_mode": { 756 | "name": "ipython", 757 | "version": 3 758 | }, 759 | "file_extension": ".py", 760 | "mimetype": "text/x-python", 761 | "name": "python", 762 | "nbconvert_exporter": "python", 763 | "pygments_lexer": "ipython3", 764 | "version": "3.7.3" 765 | } 766 | }, 767 | "nbformat": 4, 768 | "nbformat_minor": 2 769 | } 770 | -------------------------------------------------------------------------------- /AdvScraper/Tweepy/credentials.csv: -------------------------------------------------------------------------------- 1 | consumer_key,XXXXXX 2 | consumer_secret,XXXXXX 3 | access_token,XXXXXX 4 | access_secret,XXXXXX 5 | -------------------------------------------------------------------------------- /AdvScraper/Tweepy_and_GetOldTweets3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Notebook for Scraping Twitter With Tweepy and GetOldTweets3\n", 8 | "\n", 9 | "Tweepy Package Github: https://github.com/tweepy/tweepy\n", 10 | "\n", 11 | "GetOldTweets3 Package Github: https://github.com/Mottl/GetOldTweets3\n", 12 | "\n", 13 | "Tweepy Package Documentation: https://tweepy.readthedocs.io/en/latest/\n", 14 | "\n", 15 | "Article Read-Along: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f\n", 16 | "\n", 17 | "### Notebook Author: Martin Beck\n", 18 | "#### Information current as of August, 13th 2020\n", 19 | " Dependencies: Make sure Tweepy and GetOldTweets3 is already installed in your Python environment. If not, you can pip install Tweepy to install the package. If you want more information on setting up I have an article [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) that goes into deeper detail." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## Notebook's Table of Contents\n", 27 | "\n", 28 | "0. [Credentials and Authorization](#Section0)\n", 29 | "
Setting up credentials and authorization in order to utilize Tweepy\n", 30 | "1. [Available Methods With Tweepy](#Section1)\n", 31 | "
Methods available with Tweepy to pull more information\n", 32 | "2. [How to Use Tweepy With GetOldTweets3](#Section2)\n", 33 | "
Examples on using Tweepy's methods and how to use them on your datasets." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Imports for Notebook" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 1, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# Pip install Tweepy if you don't already have the package\n", 50 | "# !pip install tweepy\n", 51 | "\n", 52 | "# Imports\n", 53 | "import tweepy\n", 54 | "import pandas as pd\n", 55 | "import GetOldTweets3 as got\n", 56 | "import time" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## 0. Credentials and Authorization\n", 64 | "[Return to Table of Contents](#TOC)\n", 65 | "
Tweepy requires credentials before you can utilize its API. The below code helps setup the notebook for authorization. I already have an an article covering setting up Tweepy and getting credentials [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) if further instructions are needed.\n", 66 | "\n", 67 | "You don't necessarily have to create a credentials file, however if you find youself sharing Tweepy code to other parties I recommend it so you don't accidentally share your credentials. Otherwise skip the below cell and just enter your credentials in and have them hardcoded below." 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 44, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "data": { 77 | "text/html": [ 78 | "
\n", 79 | "\n", 92 | "\n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | "
namekey
0consumer_keyXXXXXX
1consumer_secretXXXXXX
2access_tokenXXXXXX
3access_secretXXXXXX
\n", 123 | "
" 124 | ], 125 | "text/plain": [ 126 | " name key\n", 127 | "0 consumer_key XXXXXX\n", 128 | "1 consumer_secret XXXXXX\n", 129 | "2 access_token XXXXXX\n", 130 | "3 access_secret XXXXXX" 131 | ] 132 | }, 133 | "execution_count": 44, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "# Loading in from csv file\n", 140 | "\n", 141 | "credentials_df = pd.read_csv('credentials.csv',header=None,names=['name','key'])\n", 142 | "\n", 143 | "credentials_df" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 3, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# Credentials from csv file\n", 153 | "\n", 154 | "consumer_key = credentials_df.loc[credentials_df['name']=='consumer_key','key'].iloc[0]\n", 155 | "consumer_secret = credentials_df.loc[credentials_df['name']=='consumer_secret','key'].iloc[0]\n", 156 | "access_token = credentials_df.loc[credentials_df['name']=='access_token','key'].iloc[0]\n", 157 | "access_token_secret = credentials_df.loc[credentials_df['name']=='access_secret','key'].iloc[0]\n", 158 | "\n", 159 | "# Credentials hardcoded\n", 160 | "\n", 161 | "# consumer_key = \"XXXXX\"\n", 162 | "# consumer_secret = \"XXXXX\"\n", 163 | "# access_token = \"XXXXX\"\n", 164 | "# access_token_secret = \"XXXXXX\"\n", 165 | "\n", 166 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", 167 | "auth.set_access_token(access_token, access_token_secret)\n", 168 | "api = tweepy.API(auth,wait_on_rate_limit=True)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## 1. Available Methods With Tweepy\n", 176 | "[Return to Table of Contents](#TOC)\n", 177 | "
For the most part there are only two relevant methods. If you're curious about what else you can do with Tweepy the documentation is available [here](http://docs.tweepy.org/en/latest/api.html#search-methods). \n", 178 | "\n", 179 | "The revelant methods are api.get_status and api.get_user\n", 180 | "\n", 181 | "api.get_status provides you with access to Tweepy's tweet object which by default also includes user information.\n", 182 | "\n", 183 | "api.get_user only provides you with user information. \n", 184 | "\n", 185 | "You can use either if you only care about accessing user data since they both contain it. However, if you want access to tweet information that is only available with Tweepy such as tweet.in_reply_to_user_id_str I'd recommend using api.get_status" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "## 2. How to Use Tweepy With GetOldTweets3\n", 193 | "[Return to Table of Contents](#TOC)\n", 194 | "\n", 195 | "Below is a list of information accessible in both Tweepy's tweet and user object. This is not an exhaustive list for either object. If you want an exhaustive list of everything contained in Tweepy's tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object). If you want an exhaustive list of everything contained in the Tweepy's user object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object). \n", 196 | "\n", 197 | "* tweet.coordinates: Geographic location as reported by user or client. May be null that is why extract_coordinates function below was created\n", 198 | "* tweet.place: Indicates place associated with tweet where user signed up with like Las Vegas, NV. May be null that so extract_place function below was created\n", 199 | "* tweet.lang: Indicates a BCP 47 language identifier corresponding to machine detected language of tweet text.\n", 200 | "* tweet.source: Source where tweet was posted through. Ex: Twitter Web Client\n", 201 | "* tweet.in_reply_to_status_id_str: If a tweet is a reply, the original tweet's id. Can be null if tweet is not a reply\n", 202 | "* tweet.in_reply_to_user_id_str: If a tweet is a reply, string representation of original tweet's user id\n", 203 | "* tweet.user.location: User defined location for account's profile. Can be nullable\n", 204 | "* tweet.user.url: URL provided by user in bio. Can be nullable\n", 205 | "* tweet.user.description: Text in user bio. Can be nullable\n", 206 | "* tweet.user.verified: Boolean indicating whether user has a verified account\n", 207 | "* tweet.user.followers_count: Count of followers user has\n", 208 | "* tweet.user.friends_count: Count of other users that user is following\n", 209 | "* tweet.user.favourites_count: Count of tweets user has liked in the account's lifetime\n", 210 | "* tweet.user.statuses_count: Count of tweets (including retweets) issued by user\n", 211 | "* tweet.user.listed_count: Count of public lists that user is member of\n", 212 | "* tweet.user.created_at: Date that the user account was created on Twitter\n", 213 | "* tweet.user.profile_image_url_https: HTTPS-based URL pointing to user's profile image\n", 214 | "* tweet.user.default_profile: When true, indicates user has not altered the theme or background of user profile\n", 215 | "* tweet.user.default_profile_image: When true, indicates if user has not uploaded their own profile image and default image is used instead\n", 216 | "\n", 217 | "Remember Tweepy still has its request limitations meaning if you have larger datasets, that running these requests may take time. I've ran this workaround on a smaller dataset of 5k tweets and it took me 1-2hrs to finish running. It's up to you whether you'd rather let your computer spend time running for free or spend money on using Twitter's Premium/Enterprise APIs to work with bigger datasets." 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "### Preparation\n", 225 | "\n", 226 | "To use Tweepy with GetOldTweets3 there is a little bit of preparation required. Depending on whether you're using the api.get_status or api.get_user method you'll need to have the relevant information available.\n", 227 | "\n", 228 | "In the case of api.get_status make sure you use GOT3 to scrape for tweet.id\n", 229 | "\n", 230 | "In the case of api.get_user make sure you use GOT3 to scrape for either tweet.author_id or tweet.username\n", 231 | "\n", 232 | "I'll showcase this below." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 4, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "text_query = 'Hello'\n", 242 | "since_date = \"2020-7-20\"\n", 243 | "until_date = \"2020-7-21\"\n", 244 | "\n", 245 | "count = 150\n", 246 | " \n", 247 | "# Creation of tweetCriteria query object with methods to specify further\n", 248 | "tweetCriteria = got.manager.TweetCriteria()\\\n", 249 | ".setQuerySearch(text_query).setSince(since_date)\\\n", 250 | ".setUntil(until_date).setMaxTweets(count)\n", 251 | " \n", 252 | "# Creation of tweets iterable containing all queried tweet data\n", 253 | "tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 254 | " \n", 255 | "# List comprehension pulling chosen tweet information from tweets\n", 256 | "# Add or remove tweet information you want in the below list comprehension\n", 257 | "tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date] for tweet in tweets]\n", 258 | " \n", 259 | "# Creation of dataframe from tweets list\n", 260 | "# Add or remove columns as you remove tweet information\n", 261 | "tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User', 'Text',\n", 262 | " 'Retweets', 'Favorites', 'Replies', 'Datetime'])" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "### I scraped with GetOldTweets3 making sure that I have tweet.id, and tweet.author_id or tweet.username." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 41, 275 | "metadata": { 276 | "scrolled": true 277 | }, 278 | "outputs": [ 279 | { 280 | "data": { 281 | "text/html": [ 282 | "
\n", 283 | "\n", 296 | "\n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | "
Tweet IdTweet User IdTweet UserTextRetweetsFavoritesRepliesDatetime
012853638588323635201182717701203972096workinclassbirdfriend..... hello0312020-07-20 23:59:59+00:00
112853638572429475841183184898070405120Soap_The_Scrubhello yes i interacted0432020-07-20 23:59:59+00:00
21285363856202698753844768299388813314kuroslaysHello lew,0002020-07-20 23:59:58+00:00
312853638560559513631214501518646247425bubsjiim nervous HELLO0002020-07-20 23:59:58+00:00
41285363852851511301811267164476841984realJakeLoganButt Stallion says hello neck gaiter0002020-07-20 23:59:58+00:00
\n", 368 | "
" 369 | ], 370 | "text/plain": [ 371 | " Tweet Id Tweet User Id Tweet User \\\n", 372 | "0 1285363858832363520 1182717701203972096 workinclassbird \n", 373 | "1 1285363857242947584 1183184898070405120 Soap_The_Scrub \n", 374 | "2 1285363856202698753 844768299388813314 kuroslays \n", 375 | "3 1285363856055951363 1214501518646247425 bubsji \n", 376 | "4 1285363852851511301 811267164476841984 realJakeLogan \n", 377 | "\n", 378 | " Text Retweets Favorites Replies \\\n", 379 | "0 friend..... hello 0 3 1 \n", 380 | "1 hello yes i interacted 0 4 3 \n", 381 | "2 Hello lew, 0 0 0 \n", 382 | "3 im nervous HELLO 0 0 0 \n", 383 | "4 Butt Stallion says hello neck gaiter 0 0 0 \n", 384 | "\n", 385 | " Datetime \n", 386 | "0 2020-07-20 23:59:59+00:00 \n", 387 | "1 2020-07-20 23:59:59+00:00 \n", 388 | "2 2020-07-20 23:59:58+00:00 \n", 389 | "3 2020-07-20 23:59:58+00:00 \n", 390 | "4 2020-07-20 23:59:58+00:00 " 391 | ] 392 | }, 393 | "execution_count": 41, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# Taking a quick look at the data scraped\n", 400 | "tweets_df.head()" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "### Alright now we have our data, let's look at a row for information to test how api.get_status and api.get_user work." 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 6, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "data": { 417 | "text/plain": [ 418 | "Tweet Id 1285363852851511301\n", 419 | "Tweet User Id 811267164476841984\n", 420 | "Tweet User realJakeLogan\n", 421 | "Text Butt Stallion says hello neck gaiter \n", 422 | "Retweets 0\n", 423 | "Favorites 0\n", 424 | "Replies 0\n", 425 | "Datetime 2020-07-20 23:59:58+00:00\n", 426 | "Name: 4, dtype: object" 427 | ] 428 | }, 429 | "execution_count": 6, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "# Using iloc to show a specific row of data\n", 436 | "tweets_df.iloc[4]" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 7, 442 | "metadata": {}, 443 | "outputs": [ 444 | { 445 | "name": "stdout", 446 | "output_type": "stream", 447 | "text": [ 448 | "Tweet Id: 1285363852851511301\n", 449 | "User Id: 811267164476841984\n", 450 | "Username: realJakeLogan\n" 451 | ] 452 | } 453 | ], 454 | "source": [ 455 | "# Printing out the relevant information for us\n", 456 | "print(\"Tweet Id: \",tweets_df.iloc[4][0])\n", 457 | "print(\"User Id: \",tweets_df.iloc[4][1])\n", 458 | "print(\"Username: \",tweets_df.iloc[4][2])" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "### Perfect now let's test get_status and get_user with the above Tweet Id, User Id, and Username." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 48, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "api.get_status(1285363852851511301)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "### There's a lot going on with that. Remember the list from above that shows the attributes of tweet and user objects? We can use that to focus on the relevant parts." 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": 8, 487 | "metadata": {}, 488 | "outputs": [ 489 | { 490 | "name": "stdout", 491 | "output_type": "stream", 492 | "text": [ 493 | "Salinas Valley, CA\n", 494 | "9\n", 495 | "WordPress.com\n" 496 | ] 497 | } 498 | ], 499 | "source": [ 500 | "# Using the get_status method to request for the tweet data and pull out requested information\n", 501 | "print(api.get_status(1285363852851511301).user.location)\n", 502 | "print(api.get_status(1285363852851511301).user.followers_count)\n", 503 | "print(api.get_status(1285363852851511301).source)" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 9, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "Salinas Valley, CA\n", 516 | "9\n" 517 | ] 518 | }, 519 | { 520 | "ename": "AttributeError", 521 | "evalue": "'User' object has no attribute 'source'", 522 | "output_type": "error", 523 | "traceback": [ 524 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 525 | "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", 526 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mapi\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_user\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m811267164476841984\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlocation\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mapi\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_user\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'realJakeLogan'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfollowers_count\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mapi\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_user\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m811267164476841984\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msource\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 527 | "\u001b[1;31mAttributeError\u001b[0m: 'User' object has no attribute 'source'" 528 | ] 529 | } 530 | ], 531 | "source": [ 532 | "print(api.get_user(811267164476841984).location)\n", 533 | "print(api.get_user('realJakeLogan').followers_count)\n", 534 | "\n", 535 | "# Should throw an error because user object only has user information\n", 536 | "print(api.get_user(811267164476841984).source)" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "As you can see user information is available with either method. The only difference is api.get_status requires you to enter the user keyword as seen with user.location to look at its user information whereas api.get_user only requires .location because it is the user information. That's why we see the error above with looking at the source information with api.get_user because there is no tweet information.\n", 544 | "\n", 545 | "Lastly, as you can see api.get_user is able to use either User Id or a Twitter Username to pull up user information.\n", 546 | "\n", 547 | "These methods are great, but using it on a single item is only good for testing. The power really comes in when you can create a function allowing you to use it with a whole dataset." 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 10, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [ 556 | "# Creating copy of original df to mess around with\n", 557 | "tweet_df_test = tweets_df.copy()" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 20, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [ 566 | "# Creating functions to request tweet or user information and extract them\n", 567 | "def extract_tweepy_tweet_info(row):\n", 568 | " tweet = api.get_status(row['Tweet Id'])\n", 569 | " return tweet.source\n", 570 | "\n", 571 | "def extract_tweepy_tweet_user_info(row):\n", 572 | " tweet = api.get_status(row['Tweet Id'])\n", 573 | " return tweet.user.statuses_count\n", 574 | " \n", 575 | "def extract_tweepy_user_info1(row):\n", 576 | " user = api.get_user(row['Tweet User Id'])\n", 577 | " return user.followers_count\n", 578 | "\n", 579 | "def extract_tweepy_user_info2(row):\n", 580 | " user = api.get_user(row['Tweet User'])\n", 581 | " return user.verified" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 21, 587 | "metadata": { 588 | "scrolled": true 589 | }, 590 | "outputs": [], 591 | "source": [ 592 | "# Setting new columns to be equal to the returned data from each function\n", 593 | "tweet_df_test['Tweet Source'] = tweet_df_test.apply(extract_tweepy_tweet_info,axis=1)\n", 594 | "tweet_df_test['Tweets Count'] = tweet_df_test.apply(extract_tweepy_tweet_user_info,axis=1)\n", 595 | "tweet_df_test['Follower Count'] = tweet_df_test.apply(extract_tweepy_user_info1,axis=1)\n", 596 | "tweet_df_test['Verified Status'] = tweet_df_test.apply(extract_tweepy_user_info2,axis=1)" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 26, 602 | "metadata": {}, 603 | "outputs": [ 604 | { 605 | "data": { 606 | "text/html": [ 607 | "
\n", 608 | "\n", 621 | "\n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | "
Tweet IdTweet User IdTweet UserTextRetweetsFavoritesRepliesDatetimeTweet SourceFollower CountVerified StatusTweets Count
012853638588323635201182717701203972096workinclassbirdfriend..... hello0312020-07-20 23:59:59+00:00Twitter for iPhone1877False561
112853638572429475841183184898070405120Soap_The_Scrubhello yes i interacted0432020-07-20 23:59:59+00:00Twitter for iPhone1265False11815
21285363856202698753844768299388813314kuroslaysHello lew,0002020-07-20 23:59:58+00:00Twitter for iPhone1201False7332
312853638560559513631214501518646247425bubsjiim nervous HELLO0002020-07-20 23:59:58+00:00Twitter for Android568False10844
41285363852851511301811267164476841984realJakeLoganButt Stallion says hello neck gaiter0002020-07-20 23:59:58+00:00WordPress.com9False147
\n", 717 | "
" 718 | ], 719 | "text/plain": [ 720 | " Tweet Id Tweet User Id Tweet User \\\n", 721 | "0 1285363858832363520 1182717701203972096 workinclassbird \n", 722 | "1 1285363857242947584 1183184898070405120 Soap_The_Scrub \n", 723 | "2 1285363856202698753 844768299388813314 kuroslays \n", 724 | "3 1285363856055951363 1214501518646247425 bubsji \n", 725 | "4 1285363852851511301 811267164476841984 realJakeLogan \n", 726 | "\n", 727 | " Text Retweets Favorites Replies \\\n", 728 | "0 friend..... hello 0 3 1 \n", 729 | "1 hello yes i interacted 0 4 3 \n", 730 | "2 Hello lew, 0 0 0 \n", 731 | "3 im nervous HELLO 0 0 0 \n", 732 | "4 Butt Stallion says hello neck gaiter 0 0 0 \n", 733 | "\n", 734 | " Datetime Tweet Source Follower Count \\\n", 735 | "0 2020-07-20 23:59:59+00:00 Twitter for iPhone 1877 \n", 736 | "1 2020-07-20 23:59:59+00:00 Twitter for iPhone 1265 \n", 737 | "2 2020-07-20 23:59:58+00:00 Twitter for iPhone 1201 \n", 738 | "3 2020-07-20 23:59:58+00:00 Twitter for Android 568 \n", 739 | "4 2020-07-20 23:59:58+00:00 WordPress.com 9 \n", 740 | "\n", 741 | " Verified Status Tweets Count \n", 742 | "0 False 561 \n", 743 | "1 False 11815 \n", 744 | "2 False 7332 \n", 745 | "3 False 10844 \n", 746 | "4 False 147 " 747 | ] 748 | }, 749 | "execution_count": 26, 750 | "metadata": {}, 751 | "output_type": "execute_result" 752 | } 753 | ], 754 | "source": [ 755 | "# Output of data\n", 756 | "tweet_df_test.head()" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "As you can see there are now four new columns added on at the end of this dataframe.\n", 764 | "\n", 765 | "It's worth noting the above code is not done efficiently in regards to time and API requests. If you find yourself using either method to access more than one piece of information for each tweet the functions above are not the best way to do so because they send one request per tweet.attribute instead of collecting several attributes for one request.\n", 766 | "\n", 767 | "If you want to access several attributes per Tweet, there's a couple ways of doing so. Either create a list, store the data in the list then add it to the dataframe. Or create a function that will create a series and return it, then use pandas to apply this method to a dataframe. I'll showcase the former as it's easier to grasp." 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 24, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "# Creating copy of original df to mess around with\n", 777 | "tweets_df_test_efficient = tweets_df.copy()" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": 34, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "# Creation of list to store scrape tweet data\n", 787 | "tweets_holding_list = []\n", 788 | "\n", 789 | "def extract_tweepy_tweet_info_efficient(row):\n", 790 | " # Using Tweepy API to request for tweet data\n", 791 | " tweet = api.get_status(row['Tweet Id'])\n", 792 | " \n", 793 | " # Storing chosen tweet data in tweets_holding_list to be used later\n", 794 | " tweets_holding_list.append((tweet.source, tweet.user.statuses_count, tweet.user.followers_count, tweet.user.verified))" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": {}, 801 | "outputs": [], 802 | "source": [ 803 | "# Applying the extract_tweepy_tweet_info_efficient function to store tweet data in the tweets_holding_list\n", 804 | "tweets_df_test_efficient.apply(extract_tweepy_tweet_info_efficient, axis=1)\n", 805 | "\n", 806 | "# Creating new columns to store the data that's currently being held in tweets_holding_list\n", 807 | "tweets_df_test_efficient[['Tweet Source', 'User Tweet Count', 'Follower Count', 'User Verified Status']] = pd.DataFrame(tweets_holding_list)" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 43, 813 | "metadata": {}, 814 | "outputs": [ 815 | { 816 | "data": { 817 | "text/html": [ 818 | "
\n", 819 | "\n", 832 | "\n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | "
Tweet IdTweet User IdTweet UserTextRetweetsFavoritesRepliesDatetimeTweet SourceUser Tweet CountFollower CountUser Verified Status
012853638588323635201182717701203972096workinclassbirdfriend..... hello0312020-07-20 23:59:59+00:00Twitter for iPhone5611878False
112853638572429475841183184898070405120Soap_The_Scrubhello yes i interacted0432020-07-20 23:59:59+00:00Twitter for iPhone118191266False
21285363856202698753844768299388813314kuroslaysHello lew,0002020-07-20 23:59:58+00:00Twitter for iPhone73331200False
312853638560559513631214501518646247425bubsjiim nervous HELLO0002020-07-20 23:59:58+00:00Twitter for Android10861568False
41285363852851511301811267164476841984realJakeLoganButt Stallion says hello neck gaiter0002020-07-20 23:59:58+00:00WordPress.com1479False
\n", 928 | "
" 929 | ], 930 | "text/plain": [ 931 | " Tweet Id Tweet User Id Tweet User \\\n", 932 | "0 1285363858832363520 1182717701203972096 workinclassbird \n", 933 | "1 1285363857242947584 1183184898070405120 Soap_The_Scrub \n", 934 | "2 1285363856202698753 844768299388813314 kuroslays \n", 935 | "3 1285363856055951363 1214501518646247425 bubsji \n", 936 | "4 1285363852851511301 811267164476841984 realJakeLogan \n", 937 | "\n", 938 | " Text Retweets Favorites Replies \\\n", 939 | "0 friend..... hello 0 3 1 \n", 940 | "1 hello yes i interacted 0 4 3 \n", 941 | "2 Hello lew, 0 0 0 \n", 942 | "3 im nervous HELLO 0 0 0 \n", 943 | "4 Butt Stallion says hello neck gaiter 0 0 0 \n", 944 | "\n", 945 | " Datetime Tweet Source User Tweet Count \\\n", 946 | "0 2020-07-20 23:59:59+00:00 Twitter for iPhone 561 \n", 947 | "1 2020-07-20 23:59:59+00:00 Twitter for iPhone 11819 \n", 948 | "2 2020-07-20 23:59:58+00:00 Twitter for iPhone 7333 \n", 949 | "3 2020-07-20 23:59:58+00:00 Twitter for Android 10861 \n", 950 | "4 2020-07-20 23:59:58+00:00 WordPress.com 147 \n", 951 | "\n", 952 | " Follower Count User Verified Status \n", 953 | "0 1878 False \n", 954 | "1 1266 False \n", 955 | "2 1200 False \n", 956 | "3 568 False \n", 957 | "4 9 False " 958 | ] 959 | }, 960 | "execution_count": 43, 961 | "metadata": {}, 962 | "output_type": "execute_result" 963 | } 964 | ], 965 | "source": [ 966 | "# Output of data\n", 967 | "tweets_df_test_efficient.head()" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "There you go. That's all there is to it. It's more efficient to only run the api request once and pull all the information you need than to send a request for each tweet.attribute. It'll save a lot more time in the long run." 975 | ] 976 | } 977 | ], 978 | "metadata": { 979 | "kernelspec": { 980 | "display_name": "Python 3", 981 | "language": "python", 982 | "name": "python3" 983 | }, 984 | "language_info": { 985 | "codemirror_mode": { 986 | "name": "ipython", 987 | "version": 3 988 | }, 989 | "file_extension": ".py", 990 | "mimetype": "text/x-python", 991 | "name": "python", 992 | "nbconvert_exporter": "python", 993 | "pygments_lexer": "ipython3", 994 | "version": "3.7.3" 995 | } 996 | }, 997 | "nbformat": 4, 998 | "nbformat_minor": 2 999 | } 1000 | -------------------------------------------------------------------------------- /BasicScraper/GetOldTweets3_Basic_Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "3MDPXp5-X80r" 8 | }, 9 | "source": [ 10 | "# Scraper for Twitter using GetOldTweets3\n", 11 | "\n", 12 | "Package: https://github.com/Mottl/GetOldTweets3\n", 13 | "\n", 14 | "### Notebook Author: Martin Beck" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "colab": { 22 | "base_uri": "https://localhost:8080/", 23 | "height": 1000 24 | }, 25 | "colab_type": "code", 26 | "id": "vp7x7kWeYABh", 27 | "outputId": "af1a20c2-2262-47f8-e27f-90076bd7860b", 28 | "scrolled": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# Pip install GetOldTweets3 if you don't already have the package\n", 33 | "# !pip install GetOldTweets3\n", 34 | "\n", 35 | "# Imports\n", 36 | "import GetOldTweets3 as got\n", 37 | "import pandas as pd" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "colab_type": "text", 44 | "id": "he3accCbyaWG" 45 | }, 46 | "source": [ 47 | "## Query by Username\n", 48 | "Creation of queries using GetOldTweets3\n", 49 | "\n", 50 | "Function is focused on querying by username then providing a CSV file of that query using pandas." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 2, 56 | "metadata": { 57 | "colab": {}, 58 | "colab_type": "code", 59 | "id": "54rhT5wfZVXD" 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "# Function that pulls tweets from a specific username and turns to csv file\n", 64 | "\n", 65 | "# Parameters: (list of twitter usernames), (max number of most recent tweets to pull from)\n", 66 | "def username_tweets_to_csv(username, count):\n", 67 | " # Creation of query object\n", 68 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", 69 | " .setMaxTweets(count)\n", 70 | " # Creation of list that contains all tweets\n", 71 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 72 | "\n", 73 | " # Creating list of chosen tweet data\n", 74 | " user_tweets = [[tweet.date, tweet.text] for tweet in tweets]\n", 75 | "\n", 76 | " # Creation of dataframe from tweets list\n", 77 | " tweets_df = pd.DataFrame(user_tweets, columns = ['Datetime', 'Text'])\n", 78 | "\n", 79 | " # Converting dataframe to CSV\n", 80 | " tweets_df.to_csv('{}-{}k-tweets.csv'.format(username, int(count/1000)), sep=',')" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# Now to use the function created\n", 90 | "# Input username(s) to scrape tweets and name csv file\n", 91 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 92 | "username = 'jack'\n", 93 | "count = 2000\n", 94 | "\n", 95 | "# Calling function to turn username's past x amount of tweets into a CSV file\n", 96 | "username_tweets_to_csv(username, count)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": { 102 | "colab_type": "text", 103 | "id": "G7r4McYgyoQy" 104 | }, 105 | "source": [ 106 | "## Query by Text Search\n", 107 | "Function is focused on querying by text query then providing a CSV file of that query using pandas." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 34, 113 | "metadata": { 114 | "colab": {}, 115 | "colab_type": "code", 116 | "id": "JSjpix_9A5e6" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "# Function that pulls tweets based on a general search query and turns to csv file\n", 121 | "\n", 122 | "# Parameters: (text query you want to search), (max number of most recent tweets to pull from)\n", 123 | "def text_query_to_csv(text_query, count):\n", 124 | " # Creation of query object\n", 125 | " tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\\\n", 126 | " .setMaxTweets(count)\n", 127 | " # Creation of list that contains all tweets\n", 128 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", 129 | "\n", 130 | " # Creating list of chosen tweet data\n", 131 | " text_tweets = [[tweet.date, tweet.text] for tweet in tweets]\n", 132 | "\n", 133 | " # Creation of dataframe from tweets\n", 134 | " tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text'])\n", 135 | "\n", 136 | " # Converting tweets dataframe to csv file\n", 137 | " tweets_df.to_csv('{}-{}k-tweets.csv'.format(text_query, int(count/1000)), sep=',')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# Now to use the function created\n", 147 | "# Input search query to scrape tweets and name csv file\n", 148 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 149 | "text_query = 'USA Election 2020'\n", 150 | "count = 5000\n", 151 | "\n", 152 | "# Calling function to query X amount of relevant tweets and create a CSV file\n", 153 | "text_query_to_csv(text_query, count)" 154 | ] 155 | } 156 | ], 157 | "metadata": { 158 | "colab": { 159 | "collapsed_sections": [], 160 | "name": "GetOldTweets3 Twitter Scraper", 161 | "provenance": [] 162 | }, 163 | "kernelspec": { 164 | "display_name": "Python 3", 165 | "language": "python", 166 | "name": "python3" 167 | }, 168 | "language_info": { 169 | "codemirror_mode": { 170 | "name": "ipython", 171 | "version": 3 172 | }, 173 | "file_extension": ".py", 174 | "mimetype": "text/x-python", 175 | "name": "python", 176 | "nbconvert_exporter": "python", 177 | "pygments_lexer": "ipython3", 178 | "version": "3.7.3" 179 | } 180 | }, 181 | "nbformat": 4, 182 | "nbformat_minor": 1 183 | } 184 | -------------------------------------------------------------------------------- /BasicScraper/README.md: -------------------------------------------------------------------------------- 1 | # NOTE, the following information is heavily outdated, GetOldTweets3 is no longer usable, and the Tweepy code utilizes Twitter API V1, V2 is currently used. 2 | 3 | 4 | 5 | 6 | --- 7 | --- 8 | 9 | # How to Scrape Tweets from Twitter 10 | This folder contains my jupyter notebooks for my basic scraping tutorial published [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1 "written article"). 11 | 12 | This folder's notebooks scrape tweets using two different packages in Python. 13 | * GetOldTweets3 14 | * Tweepy 15 | -------------------------------------------------------------------------------- /BasicScraper/Tweepy_Basic_Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "xh92xbMkLy28" 8 | }, 9 | "source": [ 10 | "# Scraper for Twitter using Tweepy\n", 11 | "\n", 12 | "Package Github: https://github.com/tweepy/tweepy\n", 13 | "\n", 14 | "Package Documentation: https://tweepy.readthedocs.io/en/latest/\n", 15 | "\n", 16 | "### Notebook Author: Martin Beck" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 3, 22 | "metadata": { 23 | "colab": { 24 | "base_uri": "https://localhost:8080/", 25 | "height": 213 26 | }, 27 | "colab_type": "code", 28 | "id": "90OU2SDJL2Q9", 29 | "outputId": "89d239d4-dc97-43c7-fff0-cbbe793bf094" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# Pip install Tweepy if you don't already have the package\n", 34 | "# !pip install tweepy\n", 35 | "\n", 36 | "# Imports\n", 37 | "import tweepy\n", 38 | "import pandas as pd\n", 39 | "import time" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": { 45 | "colab_type": "text", 46 | "id": "5q3dtxauP0KR" 47 | }, 48 | "source": [ 49 | "## Credentials and Authorization" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 4, 55 | "metadata": { 56 | "colab": {}, 57 | "colab_type": "code", 58 | "id": "4NcOQy9XM5hR" 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "# Credentials\n", 63 | "\n", 64 | "consumer_key = \"XXXXXX\"\n", 65 | "consumer_secret = \"XXXXXX\"\n", 66 | "access_token = \"XXXXXX\"\n", 67 | "access_token_secret = \"XXXXXX\"\n", 68 | "\n", 69 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", 70 | "auth.set_access_token(access_token, access_token_secret)\n", 71 | "api = tweepy.API(auth,wait_on_rate_limit=True)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": { 77 | "colab_type": "text", 78 | "id": "LvBbNQXgM3QI" 79 | }, 80 | "source": [ 81 | "## Query by Username\n", 82 | "Creation of queries using Tweepy API\n", 83 | "\n", 84 | "Function is focused on completing the query then providing a CSV file of that query using pandas" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 5, 90 | "metadata": { 91 | "colab": {}, 92 | "colab_type": "code", 93 | "id": "fguMqU2ifc5h" 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "tweets = []\n", 98 | "\n", 99 | "def username_tweets_to_csv(username,count):\n", 100 | " try: \n", 101 | " # Creation of query method using parameters\n", 102 | " tweets = tweepy.Cursor(api.user_timeline,id=username).items(count)\n", 103 | "\n", 104 | " # Pulling information from tweets iterable object\n", 105 | " tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]\n", 106 | "\n", 107 | " # Creation of dataframe from tweets list\n", 108 | " # Add or remove columns as you remove tweet information\n", 109 | " tweets_df = pd.DataFrame(tweets_list,columns=['Datetime', 'Tweet Id', 'Text'])\n", 110 | "\n", 111 | " # Converting dataframe to CSV \n", 112 | " tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)\n", 113 | "\n", 114 | " except BaseException as e:\n", 115 | " print('failed on_status,',str(e))\n", 116 | " time.sleep(3)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 6, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Input username to scrape tweets and name csv file\n", 126 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 127 | "username = 'jack'\n", 128 | "count = 150\n", 129 | "\n", 130 | "# Calling function to turn username's past X amount of tweets into a CSV file\n", 131 | "username_tweets_to_csv(username, count)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": { 137 | "colab_type": "text", 138 | "id": "jFe9EonmM6u9" 139 | }, 140 | "source": [ 141 | "## Query by Text Search\n", 142 | "Function is focused on completing the query then providing a CSV file of that query using pandas" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 7, 148 | "metadata": { 149 | "colab": {}, 150 | "colab_type": "code", 151 | "id": "1hOeCFq6M83k" 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "tweets = []\n", 156 | "\n", 157 | "def text_query_to_csv(text_query,count):\n", 158 | " try:\n", 159 | " # Creation of query method using parameters\n", 160 | " tweets = tweepy.Cursor(api.search,q=text_query).items(count)\n", 161 | "\n", 162 | " # Pulling information from tweets iterable object\n", 163 | " tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]\n", 164 | "\n", 165 | " # Creation of dataframe from tweets list\n", 166 | " # Add or remove columns as you remove tweet information\n", 167 | " tweets_df = pd.DataFrame(tweets_list,columns=['Datetime', 'Tweet Id', 'Text'])\n", 168 | "\n", 169 | " # Converting dataframe to CSV \n", 170 | " tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)\n", 171 | "\n", 172 | " except BaseException as e:\n", 173 | " print('failed on_status,',str(e))\n", 174 | " time.sleep(3)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 8, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "# Input search query to scrape tweets and name csv file\n", 184 | "# Max recent tweets pulls x amount of most recent tweets from that user\n", 185 | "text_query = 'USA Election 2020'\n", 186 | "count = 150\n", 187 | "\n", 188 | "# Calling function to query X amount of relevant tweets and create a CSV file\n", 189 | "text_query_to_csv(text_query, count)" 190 | ] 191 | } 192 | ], 193 | "metadata": { 194 | "colab": { 195 | "collapsed_sections": [], 196 | "name": "Tweepy Twitter Scraper", 197 | "provenance": [] 198 | }, 199 | "kernelspec": { 200 | "display_name": "Python 3", 201 | "language": "python", 202 | "name": "python3" 203 | }, 204 | "language_info": { 205 | "codemirror_mode": { 206 | "name": "ipython", 207 | "version": 3 208 | }, 209 | "file_extension": ".py", 210 | "mimetype": "text/x-python", 211 | "name": "python", 212 | "nbconvert_exporter": "python", 213 | "pygments_lexer": "ipython3", 214 | "version": "3.7.3" 215 | } 216 | }, 217 | "nbformat": 4, 218 | "nbformat_minor": 1 219 | } 220 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Scraping Tweets from Twitter 2 | This repository contains various materials that follow my series of tweet scraping articles. 3 | 4 | The **ScraperV4** folder contains materials from my Twitter scraping tutorial article available [here](https://betterprogramming.pub/how-to-scrape-tweets-from-twitter-141ed19abb10). 5 | This article covers: 6 | 7 | * Setting up Tweepy with Twitter API V2 8 | * Simple queires with Tweepy 9 | 10 | ## OUTDATED MATERIALS 11 | All materials pertaining to the following below sections are outdated. Materials are left for archival reasons. Many API changes prevent scrapers like snscrape and GetOldTweets3 to work. The only current version that consistently works is using Twitter API V2 which is shown in the above section 12 | If you raise any issues on my code, please refer to the specific sub-directory and Python libraries used so I can know where to help. 13 | 14 | The BasicScraper folder contains materials from my beginner scraping tutorial article available [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1 "written article"). 15 | This article covers: 16 | * Setting up Tweepy and GetOldTweets3 17 | * Simple queries with Tweepy and GetOldTweets3 18 | 19 | The AdvScraper folder contains materials from my advanced scraping tutorial article available [here](https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f "written article"). 20 | This article covers: 21 | * Pulling more information from tweets with Tweepy and GetOldTweets3 22 | * Pulling user information from tweets with Tweepy and GetOldTweets3 23 | * Scraping using filters with Tweepy and GetOldTweets3 24 | 25 | The snscrape folder contains materials from my snscrape scraping tutorial article available [here](https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af). 26 | This article covers: 27 | * Setting up snscrape 28 | * Different ways to use snscrape 29 | * Simple queries with snscrape 30 | -------------------------------------------------------------------------------- /ScraperV4/README.md: -------------------------------------------------------------------------------- 1 | # How to Scrape Tweets from Twitter 2 | This folder contains my jupyter notebooks for my updated basic scraping tutorial published [here](https://betterprogramming.pub/how-to-scrape-tweets-from-twitter-141ed19abb10). 3 | 4 | This folder's notebook scrape tweets using the Tweepy package in Python. 5 | -------------------------------------------------------------------------------- /ScraperV4/Tweepy_Scraper_V4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "554d9ee5", 6 | "metadata": {}, 7 | "source": [ 8 | "# Scraper for Twitter Using Tweepy\n", 9 | "Package Github: https://github.com/tweepy/tweepy\n", 10 | "\n", 11 | "Package Documentation: https://tweepy.readthedocs.io/en/latest/" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "id": "ba8aff69", 17 | "metadata": {}, 18 | "source": [ 19 | "## Notebook Author: Martin Beck" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 1, 25 | "id": "1e38a14f", 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# Pip install Tweepy if you don't already have the package\n", 30 | "# !pip install tweepy\n", 31 | "\n", 32 | "# Imports\n", 33 | "import tweepy\n", 34 | "import pandas as pd" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "id": "c8f69264", 40 | "metadata": {}, 41 | "source": [ 42 | "## Credentials and Authorization" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "id": "5d396c11", 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# Credentials\n", 53 | "bearer_token = \"XXXXXXX\"\n", 54 | "\n", 55 | "client = tweepy.Client(bearer_token)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "77359a30", 61 | "metadata": {}, 62 | "source": [ 63 | "## Query by Username\n", 64 | "Function is focused on using Username to Search then providing a CSV file of that scrape using pandas" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 189, 70 | "id": "1dc585a4", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "def username_search_to_csv(username, count):\n", 75 | " try:\n", 76 | " # grabbing user id from username \n", 77 | " user_id = client.get_user(username=username).data.id\n", 78 | " \n", 79 | " # Creation of query method using parameters\n", 80 | " tweets = tweepy.Paginator(client.get_users_tweets, user_id, tweet_fields=[\"author_id\", \"created_at\", \"lang\", \"public_metrics\"], expansions=[\"author_id\"], max_results=100).flatten(limit = count)\n", 81 | " \n", 82 | " tweets_list = []\n", 83 | " \n", 84 | " # Pulling information from tweets generator\n", 85 | " tweets_list = [[tweet.created_at, tweet.id, tweet.text, tweet.public_metrics[\"retweet_count\"], tweet.public_metrics[\"like_count\"]]for tweet in tweets]\n", 86 | " \n", 87 | " # Creation of dataframe from tweets list\n", 88 | " tweets_df = pd.DataFrame(tweets_list, columns=[\"Created At\", \"Tweet Id\", \"Text\", \"Retweet Count\", \"Like Count\"])\n", 89 | " \n", 90 | " # Converting dataframe to CSV \n", 91 | " tweets_df.to_csv(\"{}-tweets.csv\".format(username), sep=\",\", index = False)\n", 92 | " \n", 93 | " print(\"Completed Scrape!\")\n", 94 | " \n", 95 | " except BaseException as e:\n", 96 | " print(\"failed on_status,\",str(e))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 4, 102 | "id": "1abdf7d4", 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# Input search query to scrape tweets and name csv file\n", 107 | "username = \"BillGates\"\n", 108 | "count = 10\n", 109 | "\n", 110 | "# Calling function to query X amount of relevant tweets and create a CSV file\n", 111 | "username_search_to_csv(username, count)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "id": "4eee4399", 117 | "metadata": {}, 118 | "source": [ 119 | "## Scrape by Keyword Search\n", 120 | "Function is focused on using Keyword Search then providing a CSV file of that scrape using pandas" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 2, 126 | "id": "0f3198ff", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "def keyword_search_to_csv(keyword_search, count):\n", 131 | " try:\n", 132 | " # Creation of query method using parameters\n", 133 | " tweets = tweepy.Paginator(client.search_recent_tweets, keyword_search, tweet_fields=[\"author_id\", \"created_at\", \"lang\", \"public_metrics\"], user_fields=[\"username\"]).flatten(limit = count)\n", 134 | " \n", 135 | " tweets_list = []\n", 136 | " \n", 137 | " # Pulling information from tweets generator\n", 138 | " tweets_list = [[tweet.created_at, tweet.id, tweet.text, tweet.public_metrics[\"retweet_count\"], tweet.public_metrics[\"like_count\"]]for tweet in tweets]\n", 139 | " \n", 140 | " # Creation of dataframe from tweets list\n", 141 | " tweets_df = pd.DataFrame(tweets_list, columns=[\"Created At\", \"Tweet Id\", \"Text\", \"Retweet Count\", \"Like Count\"])\n", 142 | " \n", 143 | " # Converting dataframe to CSV \n", 144 | " tweets_df.to_csv(\"{}-tweets.csv\".format(keyword_search), sep=\",\", index = False)\n", 145 | " \n", 146 | " print(\"Completed Scrape!\")\n", 147 | " \n", 148 | " except BaseException as e:\n", 149 | " print(\"failed on_status,\",str(e))" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 3, 155 | "id": "8adcbae1", 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# Input search query to scrape tweets and name csv file\n", 160 | "keyword_search = \"Dogs\"\n", 161 | "count = 10\n", 162 | "\n", 163 | "# Calling function to query X amount of relevant tweets and create a CSV file\n", 164 | "keyword_search_to_csv(keyword_search, count)" 165 | ] 166 | } 167 | ], 168 | "metadata": { 169 | "kernelspec": { 170 | "display_name": "Python 3 (ipykernel)", 171 | "language": "python", 172 | "name": "python3" 173 | }, 174 | "language_info": { 175 | "codemirror_mode": { 176 | "name": "ipython", 177 | "version": 3 178 | }, 179 | "file_extension": ".py", 180 | "mimetype": "text/x-python", 181 | "name": "python", 182 | "nbconvert_exporter": "python", 183 | "pygments_lexer": "ipython3", 184 | "version": "3.11.2" 185 | } 186 | }, 187 | "nbformat": 4, 188 | "nbformat_minor": 5 189 | } 190 | -------------------------------------------------------------------------------- /snscrape/README.md: -------------------------------------------------------------------------------- 1 | # NOTE, the following information is heavily outdated. Snscrape is currently unusable as mentioned in its GitHub issues: https://github.com/JustAnotherArchivist/snscrape/issues/996 2 | 3 | 4 | 5 | 6 | --- 7 | ---# How to Scrape Tweets with snscrape 8 | This folder contains the jupyter notebooks for my snscrape scraping tutorial published [here](https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af). 9 | 10 | This folder contains two subfolders based on what method you use with snscrape such as using the CLI commands or the Python wrapper available with snscrape. Each sub-folder contains a Jupyter notebook and Python script that follows the code snippets in my article. 11 | 12 | The contents of this folder and its subfolders are shown below. 13 | 14 | * cli-with-python 15 | * snscrape-python-cli.ipynb 16 | * snscrape-python-cli.py 17 | * python-wrapper 18 | * snscrape-python-wrapper.ipynb 19 | * snscrape-python-wrapper.py 20 | -------------------------------------------------------------------------------- /snscrape/cli-with-python/snscrape-python-cli.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Article Notebook for Scraping Twitter Using snscrape's CLI Commands With Python\n", 8 | "
Package Github: https://github.com/JustAnotherArchivist/snscrape\n", 9 | "
This notebook will be using the development version of snscrape\n", 10 | "\n", 11 | "Article Read-Along: https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af\n", 12 | "\n", 13 | "### Notebook Author: Martin Beck\n", 14 | "Information current as of November, 26th 2020
\n", 15 | "\n", 16 | "This notebook contains materials for scraping tweets from Twitter using snscrape's CLI commands with Python\n", 17 | "\n", 18 | "Dependencies: \n", 19 | "- Your Python version must be 3.8 or higher. The development version of snscrape will not work with Python 3.7 or lower. You can download the latest Python version [here](https://www.python.org/downloads/).\n", 20 | "- Development version of snscrape, uncomment the pip install line in the below cell to pip install in the notebook if you don't already have it.\n", 21 | "- Pandas, the dataframes allows easy manipulation and indexing of data, this is more of a preference but is what I follow in this notebook." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 4, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# Run the pip install command below if you don't already have the library\n", 31 | "# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git\n", 32 | "\n", 33 | "# Run the below command if you don't already have Pandas\n", 34 | "# !pip install pandas\n", 35 | "\n", 36 | "# Imports\n", 37 | "import os\n", 38 | "import pandas as pd" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "# Query by Username\n", 46 | "The code below will scrape for 100 tweets by a username then provide a CSV file with Pandas" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "scrolled": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "# Setting variables to be used in format string command below\n", 58 | "tweet_count = 100\n", 59 | "username = \"jack\"\n", 60 | "\n", 61 | "# Using OS library to call CLI commands in Python\n", 62 | "os.system(\"snscrape --jsonl --max-results {} twitter-search 'from:{}'> user-tweets.json\".format(tweet_count, username))" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 6, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/html": [ 73 | "
\n", 74 | "\n", 87 | "\n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | "
urldatecontentrenderedContentiduseroutlinkstcooutlinksreplyCountretweetCountlikeCountquoteCountconversationIdlangsourcemediaretweetedTweetquotedTweetmentionedUsers
0https://twitter.com/jack/status/13324354308016...2020-11-27 21:25:36+00:00@JesseDorogusker @Square ❤️@JesseDorogusker @Square ❤️1332435430801690624{'username': 'jack', 'displayname': 'jack', 'i...[][]54822611332428871891775488und<a href=\"http://twitter.com/download/iphone\" r...NaNNaNNone[{'username': 'JesseDorogusker', 'displayname'...
1https://twitter.com/jack/status/13291496370060...2020-11-18 19:49:02+00:00@NeerajKA Welcome!@NeerajKA Welcome!1329149637006041088{'username': 'jack', 'displayname': 'jack', 'i...[][]721480081329140522565439490en<a href=\"http://twitter.com/download/iphone\" r...NaNNaNNone[{'username': 'NeerajKA', 'displayname': 'Neer...
2https://twitter.com/jack/status/13291372550263...2020-11-18 18:59:50+00:00Join @CashApp! #Bitcoin https://t.co/SbYANIZyixJoin @CashApp! #Bitcoin twitter.com/owenbjenni...1329137255026311168{'username': 'jack', 'displayname': 'jack', 'i...[https://twitter.com/owenbjennings/status/1329...[https://t.co/SbYANIZyix]58527725071321329137255026311168en<a href=\"http://twitter.com/download/iphone\" r...NaNNaN{'url': 'https://twitter.com/owenbjennings/sta...[{'username': 'CashApp', 'displayname': 'Cash ...
3https://twitter.com/jack/status/13291366656847...2020-11-18 18:57:29+00:00@kateconger @sarahintampa Nah@kateconger @sarahintampa Nah1329136665684705280{'username': 'jack', 'displayname': 'jack', 'i...[][]385176101329126492731699203und<a href=\"http://twitter.com/download/iphone\" r...NaNNaNNone[{'username': 'kateconger', 'displayname': 'o....
4https://twitter.com/jack/status/13291358061921...2020-11-18 18:54:05+00:00@mmasnick Terrible idea! And terribly false.@mmasnick Terrible idea! And terribly false.1329135806192107521{'username': 'jack', 'displayname': 'jack', 'i...[][]5113222161329128773845860352en<a href=\"http://twitter.com/download/iphone\" r...NaNNaNNone[{'username': 'mmasnick', 'displayname': 'Mike...
\n", 225 | "
" 226 | ], 227 | "text/plain": [ 228 | " url \\\n", 229 | "0 https://twitter.com/jack/status/13324354308016... \n", 230 | "1 https://twitter.com/jack/status/13291496370060... \n", 231 | "2 https://twitter.com/jack/status/13291372550263... \n", 232 | "3 https://twitter.com/jack/status/13291366656847... \n", 233 | "4 https://twitter.com/jack/status/13291358061921... \n", 234 | "\n", 235 | " date content \\\n", 236 | "0 2020-11-27 21:25:36+00:00 @JesseDorogusker @Square ❤️ \n", 237 | "1 2020-11-18 19:49:02+00:00 @NeerajKA Welcome! \n", 238 | "2 2020-11-18 18:59:50+00:00 Join @CashApp! #Bitcoin https://t.co/SbYANIZyix \n", 239 | "3 2020-11-18 18:57:29+00:00 @kateconger @sarahintampa Nah \n", 240 | "4 2020-11-18 18:54:05+00:00 @mmasnick Terrible idea! And terribly false. \n", 241 | "\n", 242 | " renderedContent id \\\n", 243 | "0 @JesseDorogusker @Square ❤️ 1332435430801690624 \n", 244 | "1 @NeerajKA Welcome! 1329149637006041088 \n", 245 | "2 Join @CashApp! #Bitcoin twitter.com/owenbjenni... 1329137255026311168 \n", 246 | "3 @kateconger @sarahintampa Nah 1329136665684705280 \n", 247 | "4 @mmasnick Terrible idea! And terribly false. 1329135806192107521 \n", 248 | "\n", 249 | " user \\\n", 250 | "0 {'username': 'jack', 'displayname': 'jack', 'i... \n", 251 | "1 {'username': 'jack', 'displayname': 'jack', 'i... \n", 252 | "2 {'username': 'jack', 'displayname': 'jack', 'i... \n", 253 | "3 {'username': 'jack', 'displayname': 'jack', 'i... \n", 254 | "4 {'username': 'jack', 'displayname': 'jack', 'i... \n", 255 | "\n", 256 | " outlinks \\\n", 257 | "0 [] \n", 258 | "1 [] \n", 259 | "2 [https://twitter.com/owenbjennings/status/1329... \n", 260 | "3 [] \n", 261 | "4 [] \n", 262 | "\n", 263 | " tcooutlinks replyCount retweetCount likeCount quoteCount \\\n", 264 | "0 [] 54 8 226 1 \n", 265 | "1 [] 72 14 800 8 \n", 266 | "2 [https://t.co/SbYANIZyix] 585 277 2507 132 \n", 267 | "3 [] 38 5 176 10 \n", 268 | "4 [] 51 13 222 16 \n", 269 | "\n", 270 | " conversationId lang \\\n", 271 | "0 1332428871891775488 und \n", 272 | "1 1329140522565439490 en \n", 273 | "2 1329137255026311168 en \n", 274 | "3 1329126492731699203 und \n", 275 | "4 1329128773845860352 en \n", 276 | "\n", 277 | " source media retweetedTweet \\\n", 278 | "0 text-query-tweets.json'.format(tweet_count, since_date, text_query, until_date))" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 9, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "data": { 353 | "text/html": [ 354 | "
\n", 355 | "\n", 368 | "\n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | "
urldatecontentrenderedContentiduseroutlinkstcooutlinksreplyCountretweetCountlikeCountquoteCountconversationIdlangsourcemediaretweetedTweetquotedTweetmentionedUsers
0https://twitter.com/TylerPaulUtt1/status/12889...2020-07-30 23:57:02+00:00@SiBuduh @langoinstitute do you know the Ko wo...@SiBuduh @langoinstitute do you know the Ko wo...1288986997143601152{'username': 'TylerPaulUtt1', 'displayname': '...[][]10001288307058928947204en<a href=\"http://twitter.com/#!/download/ipad\" ...NoneNaNNone[{'username': 'SiBuduh', 'displayname': 'Ed Lu...
1https://twitter.com/EndlessSynthwav/status/128...2020-07-30 23:44:04+00:00@RockstarGames Any idea if the elephant rifle ...@RockstarGames Any idea if the elephant rifle ...1288983731122966534{'username': 'EndlessSynthwav', 'displayname':...[][]00001288983731122966534en<a href=\"http://twitter.com/download/android\" ...NoneNaNNone[{'username': 'RockstarGames', 'displayname': ...
2https://twitter.com/aanalyst50/status/12889677...2020-07-30 22:40:40+00:00@realDonaldTrump Trump just keeps ignoring the...@realDonaldTrump Trump just keeps ignoring the...1288967774795116550{'username': 'aanalyst50', 'displayname': 'Don...[][]00101288966119676616704en<a href=\"http://twitter.com/download/iphone\" r...NoneNaNNone[{'username': 'realDonaldTrump', 'displayname'...
3https://twitter.com/RozeyBozzy/status/12889669...2020-07-30 22:37:18+00:00@cslogan88 Famous 19th century song from Engla...@cslogan88 Famous 19th century song from Engla...1288966929236066309{'username': 'RozeyBozzy', 'displayname': 'Roz...[][]00001288965246602838017en<a href=\"https://mobile.twitter.com\" rel=\"nofo...NoneNaNNone[{'username': 'cslogan88', 'displayname': 'Chr...
4https://twitter.com/alfred_hanan/status/128896...2020-07-30 22:32:44+00:00@realDonaldTrump #RepublicanTrumpVirus.\\nLets ...@realDonaldTrump #RepublicanTrumpVirus.\\nLets ...1288965780030144512{'username': 'alfred_hanan', 'displayname': 'A...[][]00001288947487911419905en<a href=\"http://twitter.com/download/android\" ...NoneNaNNone[{'username': 'realDonaldTrump', 'displayname'...
\n", 506 | "
" 507 | ], 508 | "text/plain": [ 509 | " url \\\n", 510 | "0 https://twitter.com/TylerPaulUtt1/status/12889... \n", 511 | "1 https://twitter.com/EndlessSynthwav/status/128... \n", 512 | "2 https://twitter.com/aanalyst50/status/12889677... \n", 513 | "3 https://twitter.com/RozeyBozzy/status/12889669... \n", 514 | "4 https://twitter.com/alfred_hanan/status/128896... \n", 515 | "\n", 516 | " date \\\n", 517 | "0 2020-07-30 23:57:02+00:00 \n", 518 | "1 2020-07-30 23:44:04+00:00 \n", 519 | "2 2020-07-30 22:40:40+00:00 \n", 520 | "3 2020-07-30 22:37:18+00:00 \n", 521 | "4 2020-07-30 22:32:44+00:00 \n", 522 | "\n", 523 | " content \\\n", 524 | "0 @SiBuduh @langoinstitute do you know the Ko wo... \n", 525 | "1 @RockstarGames Any idea if the elephant rifle ... \n", 526 | "2 @realDonaldTrump Trump just keeps ignoring the... \n", 527 | "3 @cslogan88 Famous 19th century song from Engla... \n", 528 | "4 @realDonaldTrump #RepublicanTrumpVirus.\\nLets ... \n", 529 | "\n", 530 | " renderedContent id \\\n", 531 | "0 @SiBuduh @langoinstitute do you know the Ko wo... 1288986997143601152 \n", 532 | "1 @RockstarGames Any idea if the elephant rifle ... 1288983731122966534 \n", 533 | "2 @realDonaldTrump Trump just keeps ignoring the... 1288967774795116550 \n", 534 | "3 @cslogan88 Famous 19th century song from Engla... 1288966929236066309 \n", 535 | "4 @realDonaldTrump #RepublicanTrumpVirus.\\nLets ... 1288965780030144512 \n", 536 | "\n", 537 | " user outlinks tcooutlinks \\\n", 538 | "0 {'username': 'TylerPaulUtt1', 'displayname': '... [] [] \n", 539 | "1 {'username': 'EndlessSynthwav', 'displayname':... [] [] \n", 540 | "2 {'username': 'aanalyst50', 'displayname': 'Don... [] [] \n", 541 | "3 {'username': 'RozeyBozzy', 'displayname': 'Roz... [] [] \n", 542 | "4 {'username': 'alfred_hanan', 'displayname': 'A... [] [] \n", 543 | "\n", 544 | " replyCount retweetCount likeCount quoteCount conversationId lang \\\n", 545 | "0 1 0 0 0 1288307058928947204 en \n", 546 | "1 0 0 0 0 1288983731122966534 en \n", 547 | "2 0 0 1 0 1288966119676616704 en \n", 548 | "3 0 0 0 0 1288965246602838017 en \n", 549 | "4 0 0 0 0 1288947487911419905 en \n", 550 | "\n", 551 | " source media retweetedTweet \\\n", 552 | "0
user-tweets.json".format(tweet_count, username)) 25 | 26 | # Reads the json generated from the CLI command above and creates a pandas dataframe 27 | tweets_df1 = pd.read_json('user-tweets.json', lines=True) 28 | 29 | # Displays first 5 entries from dataframe 30 | # tweets_df1.head() 31 | 32 | # Export dataframe into a CSV 33 | tweets_df1.to_csv('user-tweets.csv', sep=',', index=False) 34 | 35 | 36 | # Query by text search 37 | # Setting variables to be used in format string command below 38 | tweet_count = 500 39 | text_query = "its the elephant" 40 | since_date = "2020-06-01" 41 | until_date = "2020-07-31" 42 | 43 | # Using OS library to call CLI commands in Python 44 | os.system('snscrape --jsonl --max-results {} --since {} twitter-search "{} until:{}"> text-query-tweets.json'.format(tweet_count, since_date, text_query, until_date)) 45 | 46 | # Reads the json generated from the CLI command above and creates a pandas dataframe 47 | tweets_df2 = pd.read_json('text-query-tweets.json', lines=True) 48 | 49 | # Displays first 5 entries from dataframe 50 | # tweets_df2.head() 51 | 52 | # Export dataframe into a CSV 53 | tweets_df2.to_csv('text-query-tweets.csv', sep=',', index=False) -------------------------------------------------------------------------------- /snscrape/python-wrapper/snscrape-python-wrapper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Article Notebook for Scraping Twitter Using snscrape's Python Wrapper\n", 8 | "
Package Github: https://github.com/JustAnotherArchivist/snscrape\n", 9 | "
This notebook will be using the development version of snscrape\n", 10 | "\n", 11 | "Article Read-Along: https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af\n", 12 | "\n", 13 | "### Notebook Author: Martin Beck\n", 14 | "Information current as of November, 28th 2020
\n", 15 | "\n", 16 | "This notebook contains materials for scraping tweets from Twitter using snscrape's Python Wrapper\n", 17 | "\n", 18 | "Dependencies: \n", 19 | "- Your Python version must be 3.8 or higher. The development version of snscrape will not work with Python 3.7 or lower. You can download the latest Python version [here](https://www.python.org/downloads/).\n", 20 | "- Development version of snscrape, uncomment the pip install line in the below cell to pip install in the notebook if you don't already have it.\n", 21 | "- Pandas, the dataframes allows easy manipulation and indexing of data, this is more of a preference but is what I follow in this notebook." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 4, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# Run the pip install command below if you don't already have the library\n", 31 | "# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git\n", 32 | "\n", 33 | "# Run the below command if you don't already have Pandas\n", 34 | "# !pip install pandas\n", 35 | "\n", 36 | "# Imports\n", 37 | "import snscrape.modules.twitter as sntwitter\n", 38 | "import pandas as pd" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "# Query by Username\n", 46 | "The code below will scrape for 100 tweets by a username then provide a CSV file with Pandas" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 35, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "# Setting variables to be used below\n", 56 | "maxTweets = 100\n", 57 | "\n", 58 | "# Creating list to append tweet data to\n", 59 | "tweets_list1 = []\n", 60 | "\n", 61 | "# Using TwitterSearchScraper to scrape data and append tweets to list\n", 62 | "for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:jack').get_items()):\n", 63 | " if i>maxTweets:\n", 64 | " break\n", 65 | " tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username])" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 36, 71 | "metadata": { 72 | "scrolled": false 73 | }, 74 | "outputs": [ 75 | { 76 | "data": { 77 | "text/html": [ 78 | "
\n", 79 | "\n", 92 | "\n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | "
DatetimeTweet IdTextUsername
02020-11-27 21:25:36+00:001332435430801690624@JesseDorogusker @Square ❤️jack
12020-11-18 19:49:02+00:001329149637006041088@NeerajKA Welcome!jack
22020-11-18 18:59:50+00:001329137255026311168Join @CashApp! #Bitcoin https://t.co/SbYANIZyixjack
32020-11-18 18:57:29+00:001329136665684705280@kateconger @sarahintampa Nahjack
42020-11-18 18:54:05+00:001329135806192107521@mmasnick Terrible idea! And terribly false.jack
\n", 140 | "
" 141 | ], 142 | "text/plain": [ 143 | " Datetime Tweet Id \\\n", 144 | "0 2020-11-27 21:25:36+00:00 1332435430801690624 \n", 145 | "1 2020-11-18 19:49:02+00:00 1329149637006041088 \n", 146 | "2 2020-11-18 18:59:50+00:00 1329137255026311168 \n", 147 | "3 2020-11-18 18:57:29+00:00 1329136665684705280 \n", 148 | "4 2020-11-18 18:54:05+00:00 1329135806192107521 \n", 149 | "\n", 150 | " Text Username \n", 151 | "0 @JesseDorogusker @Square ❤️ jack \n", 152 | "1 @NeerajKA Welcome! jack \n", 153 | "2 Join @CashApp! #Bitcoin https://t.co/SbYANIZyix jack \n", 154 | "3 @kateconger @sarahintampa Nah jack \n", 155 | "4 @mmasnick Terrible idea! And terribly false. jack " 156 | ] 157 | }, 158 | "execution_count": 36, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "# Creating a dataframe from the tweets list above\n", 165 | "tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])\n", 166 | "\n", 167 | "# Display first 5 entries from dataframe\n", 168 | "tweets_df1.head()" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 37, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "# Export dataframe into a CSV\n", 178 | "tweets_df1.to_csv('user-tweets.csv', sep=',', index=False)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "# Query by Text Search\n", 186 | "The code below will scrape for 500 tweets between June 1st, 2020 and July 31st, 2020, by a text search then provide a CSV file with Pandas" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 27, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "# Setting variables to be used below\n", 196 | "maxTweets = 500\n", 197 | "\n", 198 | "# Creating list to append tweet data to\n", 199 | "tweets_list2 = []\n", 200 | "\n", 201 | "# Using TwitterSearchScraper to scrape data and append tweets to list\n", 202 | "for i,tweet in enumerate(sntwitter.TwitterSearchScraper('its the elephant since:2020-06-01 until:2020-07-31').get_items()):\n", 203 | " if i>maxTweets:\n", 204 | " break\n", 205 | " tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username])" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 28, 211 | "metadata": { 212 | "scrolled": false 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/html": [ 218 | "
\n", 219 | "\n", 232 | "\n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | "
DatetimeTweet IdTextUsername
02020-07-30 23:57:02+00:001288986997143601152@SiBuduh @langoinstitute do you know the Ko wo...TylerPaulUtt1
12020-07-30 23:44:04+00:001288983731122966534@RockstarGames Any idea if the elephant rifle ...EndlessSynthwav
22020-07-30 22:40:40+00:001288967774795116550@realDonaldTrump Trump just keeps ignoring the...aanalyst50
32020-07-30 22:37:18+00:001288966929236066309@cslogan88 Famous 19th century song from Engla...RozeyBozzy
42020-07-30 22:32:44+00:001288965780030144512@realDonaldTrump #RepublicanTrumpVirus.\\nLets ...alfred_hanan
\n", 280 | "
" 281 | ], 282 | "text/plain": [ 283 | " Datetime Tweet Id \\\n", 284 | "0 2020-07-30 23:57:02+00:00 1288986997143601152 \n", 285 | "1 2020-07-30 23:44:04+00:00 1288983731122966534 \n", 286 | "2 2020-07-30 22:40:40+00:00 1288967774795116550 \n", 287 | "3 2020-07-30 22:37:18+00:00 1288966929236066309 \n", 288 | "4 2020-07-30 22:32:44+00:00 1288965780030144512 \n", 289 | "\n", 290 | " Text Username \n", 291 | "0 @SiBuduh @langoinstitute do you know the Ko wo... TylerPaulUtt1 \n", 292 | "1 @RockstarGames Any idea if the elephant rifle ... EndlessSynthwav \n", 293 | "2 @realDonaldTrump Trump just keeps ignoring the... aanalyst50 \n", 294 | "3 @cslogan88 Famous 19th century song from Engla... RozeyBozzy \n", 295 | "4 @realDonaldTrump #RepublicanTrumpVirus.\\nLets ... alfred_hanan " 296 | ] 297 | }, 298 | "execution_count": 28, 299 | "metadata": {}, 300 | "output_type": "execute_result" 301 | } 302 | ], 303 | "source": [ 304 | "# Creating a dataframe from the tweets list above\n", 305 | "tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])\n", 306 | "\n", 307 | "# Display first 5 entries from dataframe\n", 308 | "tweets_df2.head()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 38, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# Export dataframe into a CSV\n", 318 | "tweets_df2.to_csv('text-query-tweets.csv', sep=',', index=False)" 319 | ] 320 | } 321 | ], 322 | "metadata": { 323 | "kernelspec": { 324 | "display_name": "Python 3", 325 | "language": "python", 326 | "name": "python3" 327 | }, 328 | "language_info": { 329 | "codemirror_mode": { 330 | "name": "ipython", 331 | "version": 3 332 | }, 333 | "file_extension": ".py", 334 | "mimetype": "text/x-python", 335 | "name": "python", 336 | "nbconvert_exporter": "python", 337 | "pygments_lexer": "ipython3", 338 | "version": "3.7.3" 339 | } 340 | }, 341 | "nbformat": 4, 342 | "nbformat_minor": 4 343 | } 344 | -------------------------------------------------------------------------------- /snscrape/python-wrapper/snscrape-python-wrapper.py: -------------------------------------------------------------------------------- 1 | # Script Author: Martin Beck 2 | # Medium Article Follow-Along: https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af 3 | 4 | # Pip install the command below if you don't have the development version of snscrape 5 | # !pip install git+https://github.com/JustAnotherArchivist/snscrape.git 6 | 7 | # Run the below command if you don't already have Pandas 8 | # !pip install pandas 9 | 10 | # Imports 11 | import snscrape.modules.twitter as sntwitter 12 | import pandas as pd 13 | 14 | # Below are two ways of scraping using the Python Wrapper. 15 | # Comment or uncomment as you need. If you currently run the script as is it will scrape both queries 16 | # then output two different csv files. 17 | 18 | # Query by username 19 | # Setting variables to be used below 20 | maxTweets = 100 21 | 22 | # Creating list to append tweet data to 23 | tweets_list1 = [] 24 | 25 | # Using TwitterSearchScraper to scrape data 26 | for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:jack').get_items()): 27 | if i>maxTweets: 28 | break 29 | tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username]) 30 | 31 | # Creating a dataframe from the tweets list above 32 | tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username']) 33 | 34 | # Display first 5 entries from dataframe 35 | # tweets_df1.head() 36 | 37 | # Export dataframe into a CSV 38 | tweets_df1.to_csv('user-tweets.csv', sep=',', index=False) 39 | 40 | 41 | # Query by text search 42 | # Setting variables to be used below 43 | maxTweets = 500 44 | 45 | # Creating list to append tweet data to 46 | tweets_list2 = [] 47 | 48 | # Using TwitterSearchScraper to scrape data and append tweets to list 49 | for i,tweet in enumerate(sntwitter.TwitterSearchScraper('its the elephant since:2020-06-01 until:2020-07-31').get_items()): 50 | if i>maxTweets: 51 | break 52 | tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username]) 53 | 54 | # Creating a dataframe from the tweets list above 55 | tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username']) 56 | 57 | # Display first 5 entries from dataframe 58 | tweets_df2.head() 59 | 60 | # Export dataframe into a CSV 61 | tweets_df2.to_csv('text-query-tweets.csv', sep=',', index=False) --------------------------------------------------------------------------------