├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ ├── feature_request.md │ └── question.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── Pipfile ├── README.md ├── collector.py ├── config_template.yml ├── create_dense_result.sql ├── create_node_view.sql ├── database_handler.py ├── empty_keys.json ├── exceptions.py ├── functional_test.py ├── helpers.py ├── make_config.py ├── make_test_tweet_jsons.py ├── passwords_template.py ├── seed_with_lots_of_friends.csv ├── seeds.csv ├── seeds_empty.csv ├── seeds_template.csv ├── seeds_test.csv ├── setup.py ├── setup_server.sh ├── start.py ├── test_helpers.py ├── test_run.sh ├── tests ├── config_test_empty.yml └── tests.py ├── twauth.py ├── two_seeds.csv └── wrong_tokens.csv /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **Possibly related issues I have found under "Issues":** 14 | Use the search function in Issues to find related issues (also closed ones). 15 | 16 | **To Reproduce** 17 | Steps to reproduce the behavior: 18 | 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots/command line outputs to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Python: [e.g. 3.6] 29 | [- contents of Pipfile and lockfile] 30 | 31 | **Additional context** 32 | Add any other context about the problem here. 33 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Possibly related issues/feature requests I have found under "Issues":** 14 | Use the search function in issues to find related issues (also closed ones). 15 | 16 | **Describe the solution you'd like** 17 | A clear and concise description of what you want to happen. 18 | 19 | **Describe alternatives you've considered** 20 | A clear and concise description of any alternative solutions or features you've considered. 21 | 22 | **Additional context** 23 | Add any other context or screenshots about the feature request here. 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Question 3 | about: Ask a question that is not answered by the documentation, the readme, or related publications. 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **I have a question regarding:** 11 | 12 | **This is my question:** 13 | 14 | **I already have consulted the following resources (e.g. README, documentation, linked talks, linked articles, stackoverflow), from which my understanding is that …:** 15 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | environment.yml 2 | .idea/ 3 | .python-version 4 | __pycache__/ 5 | keys*.* 6 | token*.csv 7 | .swp 8 | *.db 9 | passwords.py 10 | config_bu_perm.yml 11 | test_config.yml 12 | tests/tweet_jsons 13 | *latest_seeds* 14 | seeds_de_ids.csv 15 | user_ids_de.csv 16 | results/ 17 | Pipfile.lock.bk 18 | config.yml 19 | Pipfile.lock 20 | .vscode/settings.json 21 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at f.muench@leibniz-hbi.de. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributor's Guidelines 2 | 3 | Contributions are possible in form of Issues or Pull Requests. 4 | 5 | For Issues, please: 6 | 7 | 1. Read the documentation and/or the README. 8 | 2. Search for your problem in the issues section first. 9 | 3. If you cannot find a solution/answer yourself with these resources, raise an issue. 10 | 11 | For PRs: 12 | 13 | 1. Every PR must contain a passing unit-test or an adaption of existing tests that tests the proposed changes. All existing tests must pass. (If you are unfamiliar with Test Driven Development (TDD) read [this](https://code.tutsplus.com/tutorials/beginning-test-driven-development-in-python--net-30137) and maybe the first chapters of [this](https://www.oreilly.com/library/view/test-driven-development-with/9781449365141/)). 14 | 2. Keep your commits small. 15 | 3. document your code, preferably with Google style docstrings (https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html), 16 | at least though with intuitive comments 17 | 4. keep your code as readable as possible (https://docs.python-guide.org/writing/style/) and 18 | 5. compliant with flake8 (http://flake8.pycqa.org/en/latest/index.html#) 19 | 20 | By submitting a pull request to this repository, you agree to license your contribution under the MIT license of this project. 21 | 22 | By contributing you agree to follow the [Code of Conduct of this project](CODE_OF_CONDUCT.md). The project owner(s) reserve the right to exclude/ban contributors from this project not only, but especially if they violate the CoC. 23 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Felix Victor Münch 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | pandas = "*" 8 | tweepy = "<4" 9 | pyyaml = "<6" 10 | sqlalchemy = "<2" 11 | pymysql = "*" 12 | argparse = "*" 13 | urllib3 = ">=1.26.5" 14 | 15 | [dev-packages] 16 | "flake8" = "*" 17 | pytest = "*" 18 | isort = "*" 19 | ipython = ">=8.10" 20 | "autopep8" = "*" 21 | pydocstyle = "*" 22 | 23 | [requires] 24 | python_version = ">=3.8" 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![LOGO](https://upload.wikimedia.org/wikipedia/commons/thumb/4/48/Radishes.svg/173px-Radishes.svg.png) 2 | 3 | # RADICES 4 | 5 | This software prototype creates an explorative sample of core accounts in (optionally language-based) Twitter follow networks. 6 | 7 | If you use this for your research please cite [the article](https://journals.sagepub.com/doi/full/10.1177/2056305120984475) and/or [cite the software itself](https://doi.org/10.6084/m9.figshare.8864777). 8 | 9 | ## Why is this useful and how does this work? 10 | 11 | In this journal article we explain the underlying method and draw a map of the German Twittersphere: 12 | 13 | https://journals.sagepub.com/doi/full/10.1177/2056305120984475 14 | 15 | This talk sums up the article: 16 | 17 | https://youtu.be/qsnGTl8d3qU?t=21823. 18 | 19 | A short usage demo that was prepared for the ICA conference in 2020 can be found here: 20 | 21 | https://www.youtube.com/watch?v=i_p-tjvmrR4 22 | 23 | (**PLEASE NOTE:** The language specification is not working as it did for our paper due to changes in the Twitter API. Now it uses the language of the last tweet (or optionally the last 200 tweets with a threshold fraction defined by you to avoid false positives) by a user as determined by Twitter instead of the interface language. This might lead to different results.) 24 | 25 | A large-scale test of the 'bootstrapping' feature regarding the preservation of k-coreness-ranking of the sampled accounts is presented here: 26 | 27 | https://youtu.be/sV8Giaj9UwI (video for IC2S2 2020) 28 | 29 | A test of the sampling method across language communities (Italian and German) can be watched here: 30 | 31 | https://www.youtube.com/watch?v=dhXRO2d1Eno (video for AoIR 2020) 32 | 33 | Please feel free to open an issue or comment if you have any questions. 34 | 35 | Moreover, if you find any bugs, you are invited to report them as an [Issue](https://github.com/FlxVctr/SparseTwitter/issues). 36 | 37 | Before contributing/raising an issue, please read the [Contributor Guidelines](CONTRIBUTING.md). 38 | 39 | ## Installation & Usage 40 | 1. [Create a Twitter Developer app](https://developer.twitter.com/en/docs/basics/getting-started) 41 | 2. Set up your virtual environment with [pipenv](https://pipenv.readthedocs.io/en/latest/) [(see here)](#Create-Virtual-Environment-with-Pipenv) 42 | 3. Have users authorise your app (the more the better - at least one) [(see here)](#authorise-app--get-tokens) 43 | 4. [Set up a mysql Database locally or online](https://dev.mysql.com/doc/mysql-getting-started/en/). 44 | 5. Fill out config.yml according to your requirements [(see here)](#configuration-configyml) 45 | 6. Fill out the seeds_template with your starting seeds or use the given ones [(see here)](#Indicate-starting-seeds-for-the-walkers) 46 | 7. [Start software](#Start), be happy 47 | 8. (Develop the app further - [run tests](#Testing)) 48 | 49 | ### Create Virtual Environment with Pipenv 50 | We recommend installing [pipenv](https://pipenv.readthedocs.io/en/latest) (including the installation of pyenv) to create a virtual environment with all the required packages in the respective versions. 51 | After installing pipenv, navigate to the project directory and run: 52 | 53 | ``` 54 | pipenv install 55 | ``` 56 | This creates a virtual environment and installs the packages specified in the Pipfile. 57 | 58 | Run 59 | ``` 60 | pipenv shell 61 | ``` 62 | to start a shell in the virtual environment. 63 | 64 | ### Authorise App & Get Tokens 65 | This app is based on a [Twitter Developer](https://developer.twitter.com/) app. To use it you have to first create a Twitter app. 66 | Once you did that, your Consumer API Key and Secret have to be pasted into a `keys.json`, for which you can copy `empty_keys.json` (do not delete or change this file if you want to use the developer tests). 67 | You are now ready to have users authorize your app so that it will get more API calls. To do so, run 68 | ``` 69 | python twauth.py 70 | ``` 71 | This will open a link to Twitter that requires you (or someone else) to log in with their Twitter account. Once logged in, a 6-digit authorisation key will be shown on the screen. This key has to be entered into the console window where `twauth.py` is still running. After the code was entered, a new token will be added to the `tokens.csv` file. For this software to run, the app has to be authorised by at least one Twitter user. 72 | 73 | ### Configuration (config.yml) 74 | After setting up your mysql database, copy `config_template.yml` to a file named `config.yml` and enter the database information. Do not change the dbtype argument since at the moment, only mySQL databases are supported. 75 | Note that the password field is required (this also means that your database has to be password-protected). If no password is given (even is none is needed for the database), the app will raise an Exception. 76 | 77 | You can also indicate which Twitter user account details you want to collect. Those will be stored in a database table called `user_details`. By default, the software has to collect account id, follower count, account creation time and account tweets count at the moment and you have to activate those by uncommenting in the config. If you wish to collect more user details, just enter the mysql type after the colon (":") of the respective user detail in the list. The suggested type is already indicated in the comment in the respective line. Note, however, that collecting huge amounts of data has not been tested with all the user details being collected, so we do not guarantee the code to work with them. Moreover, due to Twitter API changes, some of the user details may become private properties, thus not collectable any more through the API. 78 | 79 | If you have a mailgun account, you can also add your details at the bottom of the `config.yml`. If you do so, you will receive an email when the software encounters an error. 80 | 81 | ### Indicate starting seeds for the walkers 82 | The algorithm needs seeds (i.e. Twitter Account IDs) to draw randomly from when initialising the walkers or when it reached an impasse. These seeds have to be specified in `seeds.csv`. One Twitter account ID per line. Feel free to use `seeds_template.csv` (and rename it to `seeds.csv`) to replace the existing seeds which are 200 randomly drawn accounts from the TrISMA dataset (Bruns, Moon, Münch & Sadkowsky, 2017) that use German as interface language. 83 | 84 | Note that the `seeds.csv` at least have to contain that many account IDs as walkers should run in parallel. We suggest using at least 100 seeds, the more the better (we used 15.000.000). However, since a recent update, the algorithm can gather ('bootstrap') its own seeds and there is no need to give a comprehensive seed list. This changes the quality of the sample (for the worse or the better is subject of ongoing research), however, it makes it a very powerful exploratory tool. 85 | 86 | ## Start 87 | 88 | **PLEASE NOTE:** The language specification is not working as it did for our paper due to changes in the Twitter API. Now it uses the language of the last tweet(s) by a user as determined by Twitter instead of the interface language. This might lead to different results from our paper (even though the macrostructures of a certain network should remain very similar). 89 | 90 | Run (while you are in the pipenv virtual environment) 91 | ``` 92 | python start.py -n 2 -l de it -lt 0.05 -p 1 -k "keyword1" "keyword2" "keyword3" 93 | ``` 94 | where 95 | 96 | * -n takes the number of seeds to be drawn from the seed pool, 97 | * -l can set the Twitter accounts's last [status languages](https://developer.twitter.com/en/docs/developer-utilities/supported-languages/api-reference/get-help-languages) that are of your interest, 98 | * -lt defines a fraction of tweets within the last 200 tweets that has to be detected to be in the requested languages (might slow down collection) 99 | * -k can be used to only follow paths to seeds who used defined keywords in their last 200 tweets (keywords are interpreted as [regexes](https://docs.python.org/3/howto/regex.html), ignoring case) 100 | * and -p the number of pages to look at when identifying the next node. For explanation of advanced usage and more features (like 'bootstrapping', an approach, reminiscent of snowballing, to grow the seed pool) use 101 | 102 | ``` 103 | python start.py --help 104 | ``` 105 | which will show a help dialogue with explanations and default values. Please raise an issue if those should not be clear enough. 106 | 107 | Note: 108 | - If the program freezes after saying "Starting x Collectors", it is likely that either your keys.json or your tokens.csv contains wrong information. We work on a solution that is more user-friendly! 109 | - If you get an error saying "lookup_users() got an unexpected keyword argument", you likely have the wrong version of tweepy installed. Either update your tweepy package or use pipenv to create a virtual environment and install all the packages you need. 110 | - If at some point an error is encountered: There is a -r (restart with latest seeds) option to resume collection after interrupting the crawler with `control-c`. This is also handy in case you need to reboot your machine. **Note that you will still have to define the other parameters as you did when you started the collection the first time.** 111 | 112 | ## Analysis (with Gephi) 113 | 114 | It is possible to import the data into tools like Gephi via a MySQL connector. However, Gephi apparently supports only MySQL 5 at the time of writing. 115 | 116 | To do so, it is helpful to use [`create_node_view.sql`](https://github.com/FlxVctr/RADICES/blob/master/create_node_view.sql) and [`create_dense_result.sql`](https://github.com/FlxVctr/RADICES/blob/master/create_dense_result.sql) to create views for Gephi to import. 117 | 118 | Then you can import the results, in the case of Gephi via the menu item **File -> Import Database -> Edge List**, using your database credentials and 119 | 120 | * `SELECT * FROM nodes` as the "Node Query" 121 | * `SELECT * FROM result` as the "Edge Query" if you want to analyse the walked edges only (as done in the German Twittersphere paper) 122 | * `SELECT * FROM dense_result` as the "Edge Query" if you want to analyse all edges between collected accounts (which will be a much denser network) 123 | 124 | Other tables created by RADICES in the database that might be interesting for analysis are: 125 | 126 | * **result**: edge list (columns: source,target) containing the Twitter IDs of walked accounts 127 | * **friends**: cache of collected follow connections, up to p * 5000 connections per walked account (might contain connections to accounts which do not fulfill language or keyword criteria) 128 | * **user_details**: user details cache, as defined in `config.yml` of all accounts in **result** and **friends** (might contain not deleted data from accounts which do not fulfill language or keyword criteria) 129 | 130 | Other tables contain only data that is necessary for internal functions. 131 | 132 | ## Testing 133 | 134 | For development purposes. Note that you still need a functional (i.e. filled out) `keys.json` and tokens indicated in `tokens.csv` to work with. 135 | Moreover, for some tests to run through, some user details json files are needed. They have to be stored in `tests/tweet_jsons/` and can be downloaded by running 136 | ``` 137 | python make_test_tweet_jsons.py -s 1670174994 138 | ``` 139 | where -s stands for the seed to use and can be replaced by any Twitter seed of your choice. 140 | Note: the name `tweet_jsons` is misleading, since the json files actually contain information about specific users (friends of the given seed). This will be changed in a later version. 141 | 142 | ### passwords.py 143 | Before testing, please re-enter the password of the sparsetwitter mySQL user into the `passwords_template.py`. Then, rename it into `passwords.py`. If you would like to make use (and test) mailgun notifications, please also enter the relevant information as well. 144 | 145 | ### Local mysql database 146 | Some of the tests try to connect to a local mySQL database using the user "sparsetwitter@localhost". For these tests to run properly it is required that a mySQL server actually runs on the device and that a user 'sparsetwitter'@'localhost' with relevant permissions exists. 147 | 148 | Please refer to the [mySQL documentation](https://dev.mysql.com/doc/mysql-installation-excerpt/5.5/en/installing.html) on how to install mySQL on your system. If your mySQL server is up and running, the following command will create the user 'sparsetwitter' and will give it full permissions (replace "" with a password): 149 | 150 | ``` 151 | CREATE USER 'sparsetwitter'@'localhost' IDENTIFIED BY ''; GRANT ALL ON *.* TO 'sparsetwitter'@'localhost' WITH GRANT OPTION; 152 | ``` 153 | 154 | ### Tests that will fail 155 | For the functional tests, the test `FirstUseTest.test_restarts_after_exception` will fail if you did not provide (or did not provide valid) Mailgun credentials. Also one unit-test will fail in this case. 156 | 157 | ### Running the tests 158 | To run the tests, just type 159 | 160 | ``` 161 | python functional_test.py 162 | ``` 163 | 164 | and / or 165 | 166 | ``` 167 | python tests/tests.py -s 168 | ``` 169 | The -s parameter is for skipping API call-draining tests. Note that even if -s is set, the tests can take very long to run if only few API tokens are given in the tokens.csv. The whole software relies on a sufficiently high number of tokens. We used 15. 170 | 171 | ## Disclaimer 172 | By submitting a pull request to this repository, you agree to license your contribution under the MIT license (as this project is). 173 | 174 | The "Logo" above is from https://commons.wikimedia.org/wiki/File:Radishes.svg and licensed as being in the public domain ([CC0](https://creativecommons.org/publicdomain/zero/1.0/deed.en)). 175 | 176 | -------------------------------------------------------------------------------- /collector.py: -------------------------------------------------------------------------------- 1 | import multiprocessing.dummy as mp 2 | import time 3 | from exceptions import TestException 4 | from functools import wraps 5 | from sys import stdout, stderr 6 | 7 | import numpy as np 8 | import pandas as pd 9 | import tweepy 10 | from sqlalchemy.exc import IntegrityError, ProgrammingError 11 | 12 | from database_handler import DataBaseHandler 13 | from helpers import friends_details_dtypes 14 | from setup import FileImport 15 | 16 | # mp.set_start_method('spawn') 17 | 18 | 19 | def get_latest_tweets(user_id, connection, fields=['lang', 'full_text']): 20 | 21 | statuses = connection.api.user_timeline(user_id=user_id, count=200, tweet_mode='extended') 22 | 23 | result = pd.DataFrame(columns=fields) 24 | 25 | for status in statuses: 26 | result = result.append({field: getattr(status, field) for field in fields}, 27 | ignore_index=True) 28 | 29 | return result 30 | 31 | 32 | def get_fraction_of_tweets_in_language(tweets): 33 | """Returns fraction of languages in a tweet dataframe as a dictionary 34 | 35 | Args: 36 | tweets (pandas.DataFrame): Tweet DataFrame as returned by `get_latest_tweets` 37 | Returns: 38 | language_fractions (dict): {languagecode (str): fraction (float)} 39 | """ 40 | 41 | language_fractions = tweets['lang'].value_counts(normalize=True) 42 | 43 | language_fractions = language_fractions.to_dict() 44 | 45 | return language_fractions 46 | 47 | 48 | # TODO: there might be a better way to drop columns that we don't want than flatten everything 49 | # and removing the columns thereafter. 50 | def flatten_json(y: dict, columns: list, sep: str = "_", 51 | nonetype: dict = {'date': None, 'num': None, 'str': None, 'bool': None}): 52 | ''' 53 | Flattens nested dictionaries. 54 | adapted from: https://medium.com/@amirziai/flattening-json-objects-in-python-f5343c794b10 55 | Attributes: 56 | y (dict): Nested dictionary to be flattened. 57 | columns (list of str): Dictionary keys that should not be flattened. 58 | sep (str): Separator for new dictionary keys of nested structures. 59 | nonetype (Value): specify the value that should be used if a key's value is None 60 | ''' 61 | 62 | out = {} 63 | 64 | def flatten(x, name=''): 65 | if type(x) is dict and str(name[:-1]) not in columns: # don't flatten nested fields 66 | for a in x: 67 | flatten(x[a], name + a + sep) 68 | elif type(x) is list and str(name[:-1]) not in columns: # same 69 | i = 0 70 | for a in x: 71 | flatten(a, name + str(i) + sep) 72 | i += 1 73 | elif type(x) is list and str(name[:-1]) in columns: 74 | out[str(name[:-1])] = str(x) # Must be str so that nested lists are written to db 75 | elif type(x) is dict and str(name[:-1]) in columns: 76 | out[str(name[:-1])] = str(x) # Same here 77 | elif type(x) is bool and str(name[:-1]) in columns: 78 | out[str(name[:-1])] = int(x) # Same here 79 | elif x is None and str(name[:-1]) in columns: 80 | if friends_details_dtypes[str(name[:-1])] == np.datetime64: 81 | out[str(name[:-1])] = nonetype["date"] 82 | elif friends_details_dtypes[str(name[:-1])] == np.int64: 83 | out[str(name[:-1])] = nonetype["num"] 84 | elif friends_details_dtypes[str(name[:-1])] == str: 85 | out[str(name[:-1])] = nonetype["str"] 86 | elif friends_details_dtypes[str(name[:-1])] == np.int8: 87 | out[str(name[:-1])] = nonetype["bool"] 88 | else: 89 | raise NotImplementedError("twitter user_detail does not have a supported" 90 | "corresponding data type") 91 | else: 92 | out[str(name[:-1])] = x 93 | 94 | flatten(y) 95 | return out 96 | 97 | 98 | # Decorator function for re-executing x times (with exponentially developing 99 | # waiting times) 100 | def retry_x_times(x): 101 | def retry_decorator(func): 102 | 103 | @wraps(func) 104 | def func_wrapper(*args, **kwargs): 105 | 106 | try: 107 | if kwargs['fail'] is True: 108 | # if we're testing fails: 109 | return func(*args, **kwargs) 110 | except KeyError: 111 | try: 112 | if kwargs['test_fail'] is True: 113 | return func(*args, **kwargs) 114 | except KeyError: 115 | pass 116 | 117 | i = 0 118 | if 'restart' in kwargs: 119 | restart = kwargs['restart'] 120 | 121 | if 'retries' in kwargs: 122 | retries = kwargs['retries'] 123 | else: 124 | retries = x 125 | 126 | for i in range(retries - 1): 127 | try: 128 | if 'restart' in kwargs: 129 | kwargs['restart'] = restart 130 | return func(*args, **kwargs) 131 | except Exception as e: 132 | restart = True 133 | waiting_time = 2**i 134 | stdout.write(f"Encountered exception in {func.__name__}{args, kwargs}.\n{e}") 135 | stdout.write(f"Retrying in {waiting_time}.\n") 136 | stdout.flush() 137 | time.sleep(waiting_time) 138 | i += 1 139 | 140 | return func(*args, **kwargs) 141 | 142 | return func_wrapper 143 | 144 | return retry_decorator 145 | 146 | 147 | class MyProcess(mp.Process): 148 | def run(self): 149 | try: 150 | mp.Process.run(self) 151 | except Exception as err: 152 | self.err = err 153 | raise self.err 154 | else: 155 | self.err = None 156 | 157 | 158 | class Connection(object): 159 | """Class that handles the connection to Twitter 160 | 161 | Attributes: 162 | token_file_name (str): Path to file with user tokens 163 | """ 164 | 165 | def __init__(self, token_file_name="tokens.csv", token_queue=None): 166 | self.credentials = FileImport().read_app_key_file() 167 | 168 | self.ctoken = self.credentials[0] 169 | self.csecret = self.credentials[1] 170 | 171 | if token_queue is None: 172 | self.tokens = FileImport().read_token_file(token_file_name) 173 | 174 | self.token_queue = mp.Queue() 175 | 176 | for token, secret in self.tokens.values: 177 | self.token_queue.put((token, secret, {}, {})) 178 | else: 179 | self.token_queue = token_queue 180 | 181 | self.token, self.secret, self.reset_time_dict, self.calls_dict = self.token_queue.get() 182 | self.auth = tweepy.OAuthHandler(self.ctoken, self.csecret) 183 | self.auth.set_access_token(self.token, self.secret) 184 | self.api = tweepy.API(self.auth, wait_on_rate_limit=False, wait_on_rate_limit_notify=False) 185 | 186 | def next_token(self): 187 | 188 | self.token_queue.put((self.token, self.secret, self.reset_time_dict, self.calls_dict)) 189 | 190 | (self.token, self.secret, 191 | self.reset_time_dict, self.calls_dict) = self.token_queue.get() 192 | 193 | self.auth = tweepy.OAuthHandler(self.ctoken, self.csecret) 194 | self.auth.set_access_token(self.token, self.secret) 195 | 196 | self.api = tweepy.API(self.auth) 197 | 198 | def remaining_calls(self, endpoint='/friends/ids'): 199 | """Returns the number of remaining calls until reset time. 200 | 201 | Args: 202 | endpoint (str): 203 | API endpoint. 204 | Defaults to '/friends/ids' 205 | Returns: 206 | remaining calls (int) 207 | """ 208 | 209 | rate_limits = self.api.rate_limit_status() 210 | 211 | path = endpoint.split('/') 212 | 213 | path = path[1:] 214 | 215 | rate_limits = rate_limits['resources'][path[0]] 216 | 217 | key = "/" + path[0] 218 | 219 | for item in path[1:]: 220 | key = key + '/' + item 221 | rate_limits = rate_limits[key] 222 | 223 | rate_limits = rate_limits['remaining'] 224 | 225 | return rate_limits 226 | 227 | def reset_time(self, endpoint='/friends/ids'): 228 | """Returns the time until reset time. 229 | 230 | Args: 231 | endpoint (str): 232 | API endpoint. 233 | Defaults to '/friends/ids' 234 | Returns: 235 | remaining time in seconds (int) 236 | """ 237 | 238 | reset_time = self.api.rate_limit_status() 239 | 240 | path = endpoint.split('/') 241 | 242 | path = path[1:] 243 | 244 | reset_time = reset_time['resources'][path[0]] 245 | 246 | key = "/" + path[0] 247 | 248 | for item in path[1:]: 249 | key = key + '/' + item 250 | reset_time = reset_time[key] 251 | 252 | reset_time = reset_time['reset'] - int(time.time()) 253 | 254 | return reset_time 255 | 256 | 257 | class Collector(object): 258 | """Does the collecting of friends. 259 | 260 | Attributes: 261 | connection (Connection object): 262 | Connection object with actually active credentials 263 | seed (int): Twitter id of seed user 264 | """ 265 | 266 | def __init__(self, connection, seed, following_pages_limit=0): 267 | self.seed = seed 268 | self.connection = connection 269 | 270 | self.token_blacklist = {} 271 | self.following_pages_limit = following_pages_limit 272 | 273 | class Decorators(object): 274 | 275 | @staticmethod 276 | def retry_with_next_token_on_rate_limit_error(func): 277 | def wrapper(*args, **kwargs): 278 | collector = args[0] 279 | old_token = collector.connection.token 280 | while True: 281 | try: 282 | try: 283 | if kwargs['force_retry_token'] is True: 284 | print('Forced retry with token.') 285 | return func(*args, **kwargs) 286 | except KeyError: 287 | pass 288 | try: 289 | if collector.token_blacklist[old_token] <= time.time(): 290 | print(f'Token starting with {old_token[:4]} should work again.') 291 | return func(*args, **kwargs) 292 | else: 293 | print(f'Token starting with {old_token[:4]} not ready yet.') 294 | collector.connection.next_token() 295 | time.sleep(10) 296 | continue 297 | except KeyError: 298 | print(f'Token starting with {old_token[:4]} not tried yet. Trying.') 299 | return func(*args, **kwargs) 300 | except tweepy.RateLimitError: 301 | collector.token_blacklist[old_token] = time.time() + 150 302 | print(f'Token starting with {old_token[:4]} hit rate limit.') 303 | print("Retrying with next available token.") 304 | print(f"Blacklisted until {collector.token_blacklist[old_token]}") 305 | collector.connection.next_token() 306 | continue 307 | break 308 | return wrapper 309 | 310 | @Decorators.retry_with_next_token_on_rate_limit_error 311 | def check_API_calls_and_update_if_necessary(self, endpoint, check_calls=True): 312 | """Checks for an endpoint how many calls are left (optional), gets the reset time 313 | and updates token if necessary. 314 | 315 | If called with check_calls = False, 316 | it will assume that the actual token calls for the specified endpoint are depleted 317 | and return None for remaining calls 318 | 319 | Args: 320 | endpoint (str): API endpoint, e.g. '/friends/ids' 321 | check_calls (boolean): Default True 322 | Returns: 323 | if check_calls=True: 324 | remaining_calls (int) 325 | else: 326 | None 327 | """ 328 | 329 | def try_remaining_calls_except_invalid_token(): 330 | try: 331 | remaining_calls = self.connection.remaining_calls(endpoint=endpoint) 332 | except tweepy.error.TweepError as invalid_error: 333 | if "'code': 89" in invalid_error.reason: 334 | print(f"Token starting with {self.connection.token[:5]} seems to have expired or\ 335 | it has been revoked.") 336 | print(invalid_error) 337 | self.connection.next_token() 338 | remaining_calls = self.connection.remaining_calls(endpoint=endpoint) 339 | else: 340 | raise invalid_error 341 | print("REMAINING CALLS FOR {} WITH TOKEN STARTING WITH {}: ".format( 342 | endpoint, self.connection.token[:4]), remaining_calls) 343 | return remaining_calls 344 | 345 | if check_calls is True: 346 | self.connection.calls_dict[endpoint] = try_remaining_calls_except_invalid_token() 347 | 348 | reset_time = self.connection.reset_time(endpoint=endpoint) 349 | 350 | self.connection.reset_time_dict[endpoint] = time.time() + reset_time 351 | 352 | while self.connection.calls_dict[endpoint] == 0: 353 | stdout.write("Attempt with next available token.\n") 354 | 355 | self.connection.next_token() 356 | 357 | try: 358 | next_reset_at = self.connection.reset_time_dict[endpoint] 359 | if time.time() >= next_reset_at: 360 | self.connection.calls_dict[endpoint] = \ 361 | self.connection.remaining_calls(endpoint=endpoint) 362 | else: 363 | time.sleep(10) 364 | continue 365 | except KeyError: 366 | self.connection.calls_dict[endpoint] = \ 367 | try_remaining_calls_except_invalid_token() 368 | reset_time = self.connection.reset_time(endpoint=endpoint) 369 | self.connection.reset_time_dict[endpoint] = time.time() + reset_time 370 | 371 | print("REMAINING CALLS FOR {} WITH TOKEN STARTING WITH {}: ".format( 372 | endpoint, self.connection.token[:4]), self.connection.calls_dict[endpoint]) 373 | print(f"{time.strftime('%c')}: new reset of token {self.connection.token[:4]} for \ 374 | {endpoint} in {int(self.connection.reset_time_dict[endpoint] - time.time())} seconds.") 375 | 376 | return self.connection.calls_dict[endpoint] 377 | 378 | else: 379 | self.connection.calls_dict[endpoint] = 0 380 | 381 | if endpoint not in self.connection.reset_time_dict \ 382 | or self.connection.reset_time_dict[endpoint] <= time.time(): 383 | reset_time = self.connection.reset_time(endpoint=endpoint) 384 | self.connection.reset_time_dict[endpoint] = time.time() + reset_time 385 | print("REMAINING CALLS FOR {} WITH TOKEN STARTING WITH {}: ".format( 386 | endpoint, self.connection.token[:4]), self.connection.calls_dict[endpoint]) 387 | print(f"{time.strftime('%c')}: new reset of token {self.connection.token[:4]} for \ 388 | {endpoint} in {int(self.connection.reset_time_dict[endpoint] - time.time())} seconds.") 389 | 390 | while (endpoint in self.connection.reset_time_dict and 391 | self.connection.reset_time_dict[endpoint] >= time.time() and 392 | self.connection.calls_dict[endpoint] == 0): 393 | self.connection.next_token() 394 | time.sleep(1) 395 | 396 | return None 397 | 398 | def get_friend_list(self, twitter_id=None, follower=False): 399 | """Gets the friend list of an account. 400 | 401 | Args: 402 | twitter_id (int): Twitter Id of account, 403 | if None defaults to seed account of Collector object. 404 | 405 | Returns: 406 | list with friends of user. 407 | """ 408 | 409 | if twitter_id is None: 410 | twitter_id = self.seed 411 | 412 | result = [] 413 | 414 | cursor = -1 415 | following_page = 0 416 | while self.following_pages_limit == 0 or following_page < self.following_pages_limit: 417 | while True: 418 | try: 419 | if follower is False: 420 | page = self.connection.api.friends_ids(user_id=twitter_id, cursor=cursor) 421 | self.connection.calls_dict['/friends/ids'] = 1 422 | else: 423 | page = self.connection.api.followers_ids(user_id=twitter_id, cursor=cursor) 424 | self.connection.calls_dict['/followers/ids'] = 1 425 | break 426 | except tweepy.RateLimitError: 427 | if follower is False: 428 | self.check_API_calls_and_update_if_necessary(endpoint='/friends/ids', 429 | check_calls=False) 430 | else: 431 | self.check_API_calls_and_update_if_necessary(endpoint='/followers/ids', 432 | check_calls=False) 433 | 434 | if len(page[0]) > 0: 435 | result += page[0] 436 | else: 437 | break 438 | cursor = page[1][1] 439 | 440 | following_page += 1 441 | 442 | return result 443 | 444 | def get_details(self, friends): 445 | """Collects details from friends of an account. 446 | 447 | Args: 448 | friends (list of int): list of Twitter user ids 449 | 450 | Returns: 451 | list of Tweepy user objects 452 | """ 453 | 454 | i = 0 455 | 456 | user_details = [] 457 | 458 | while i < len(friends): 459 | 460 | if i + 100 <= len(friends): 461 | j = i + 100 462 | else: 463 | j = len(friends) 464 | 465 | while True: 466 | try: 467 | try: 468 | user_details += self.connection.api.lookup_users(user_ids=friends[i:j], 469 | tweet_mode='extended') 470 | except tweepy.error.TweepError as e: 471 | if "No user matches for specified terms." in e.reason: 472 | stdout.write(f"No user matches for {friends[i:j]}") 473 | stdout.flush() 474 | else: 475 | raise e 476 | self.connection.calls_dict['/users/lookup'] = 1 477 | break 478 | except tweepy.RateLimitError: 479 | self.check_API_calls_and_update_if_necessary(endpoint='/users/lookup', 480 | check_calls=False) 481 | 482 | i += 100 483 | 484 | return user_details 485 | 486 | @staticmethod 487 | def make_friend_df(friends_details, select=["id", "followers_count", "status_lang", 488 | "created_at", "statuses_count"], 489 | provide_jsons: bool = False, replace_nonetype: bool = True, 490 | nonetype: dict = {'date': '1970-01-01', 491 | 'num': -1, 492 | 'str': '-1', 493 | 'bool': -1}): 494 | """Transforms list of user details to pandas.DataFrame 495 | 496 | Args: 497 | friends_details (list of Tweepy user objects) 498 | select (list of str): columns to keep in DataFrame 499 | provide_jsons (boolean): If true, will treat friends_details as list of jsons. This 500 | allows creating a user details dataframe without having to 501 | download the details first. Note that the jsons must have the 502 | same format as the _json attribute of a user node of the 503 | Twitter API. 504 | replace_nonetype (boolean): Whether or not to replace values in the user_details that 505 | are None. Setting this to False is experimental, since code 506 | to avoid errors resulting from it has not yet been 507 | implemented. By default, missing dates will be replaced by 508 | 1970/01/01, missing numericals by -1, missing strs by 509 | '-1', and missing booleans by -1. 510 | Use the 'nonetype' param to change the default. 511 | nonetype (dict): Contains the defaults for nonetype replacement (see docs for 512 | 'replace_nonetype' param). 513 | {'date': 'yyyy-mm-dd', 'num': int, 'str': 'str', 'bool': int} 514 | 515 | Returns: 516 | pandas.DataFrame with these columns or selected as by `select`: 517 | ['contributors_enabled', 518 | 'created_at', 519 | 'default_profile', 520 | 'default_profile_image', 521 | 'description', 522 | 'entities_description_urls', 523 | 'entities_url_urls', 524 | 'favourites_count', 525 | 'follow_request_sent', 526 | 'followers_count', 527 | 'following', 528 | 'friends_count', 529 | 'geo_enabled', 530 | 'has_extended_profile', 531 | 'id', 532 | 'id_str', 533 | 'is_translation_enabled', 534 | 'is_translator', 535 | 'lang', 536 | 'listed_count', 537 | 'location', 538 | 'name', 539 | 'needs_phone_verification', 540 | 'notifications', 541 | 'profile_background_color', 542 | 'profile_background_image_url', 543 | 'profile_background_image_url_https', 544 | 'profile_background_tile', 545 | 'profile_banner_url', 546 | 'profile_image_url', 547 | 'profile_image_url_https', 548 | 'profile_link_color', 549 | 'profile_sidebar_border_color', 550 | 'profile_sidebar_fill_color', 551 | 'profile_text_color', 552 | 'profile_use_background_image', 553 | 'protected', 554 | 'screen_name', 555 | 'status_contributors', 556 | 'status_coordinates', 557 | 'status_coordinates_coordinates', 558 | 'status_coordinates_type', 559 | 'status_created_at', 560 | 'status_entities_hashtags', 561 | 'status_entities_media', 562 | 'status_entities_symbols', 563 | 'status_entities_urls', 564 | 'status_entities_user_mentions', 565 | 'status_extended_entities_media', 566 | 'status_favorite_count', 567 | 'status_favorited', 568 | 'status_geo', 569 | 'status_geo_coordinates', 570 | 'status_geo_type', 571 | 'status_id', 572 | 'status_id_str', 573 | 'status_in_reply_to_screen_name', 574 | 'status_in_reply_to_status_id', 575 | 'status_in_reply_to_status_id_str', 576 | 'status_in_reply_to_user_id', 577 | 'status_in_reply_to_user_id_str', 578 | 'status_is_quote_status', 579 | 'status_lang', 580 | 'status_place', 581 | 'status_place_bounding_box_coordinates', 582 | 'status_place_bounding_box_type', 583 | 'status_place_contained_within', 584 | 'status_place_country', 585 | 'status_place_country_code', 586 | 'status_place_full_name', 587 | 'status_place_id', 588 | 'status_place_name', 589 | 'status_place_place_type', 590 | 'status_place_url', 591 | 'status_possibly_sensitive', 592 | 'status_quoted_status_id', 593 | 'status_quoted_status_id_str', 594 | 'status_retweet_count', 595 | 'status_retweeted', 596 | 'status_retweeted_status_contributors', 597 | 'status_retweeted_status_coordinates', 598 | 'status_retweeted_status_created_at', 599 | 'status_retweeted_status_entities_hashtags', 600 | 'status_retweeted_status_entities_media', 601 | 'status_retweeted_status_entities_symbols', 602 | 'status_retweeted_status_entities_urls', 603 | 'status_retweeted_status_entities_user_mentions', 604 | 'status_retweeted_status_extended_entities_media', 605 | 'status_retweeted_status_favorite_count', 606 | 'status_retweeted_status_favorited', 607 | 'status_retweeted_status_geo', 608 | 'status_retweeted_status_id', 609 | 'status_retweeted_status_id_str', 610 | 'status_retweeted_status_in_reply_to_screen_name', 611 | 'status_retweeted_status_in_reply_to_status_id', 612 | 'status_retweeted_status_in_reply_to_status_id_str', 613 | 'status_retweeted_status_in_reply_to_user_id', 614 | 'status_retweeted_status_in_reply_to_user_id_str', 615 | 'status_retweeted_status_is_quote_status', 616 | 'status_retweeted_status_lang', 617 | 'status_retweeted_status_place', 618 | 'status_retweeted_status_possibly_sensitive', 619 | 'status_retweeted_status_quoted_status_id', 620 | 'status_retweeted_status_quoted_status_id_str', 621 | 'status_retweeted_status_retweet_count', 622 | 'status_retweeted_status_retweeted', 623 | 'status_retweeted_status_source', 624 | 'status_retweeted_status_full_text', 625 | 'status_retweeted_status_truncated', 626 | 'status_source', 627 | 'status_full_text', 628 | 'status_truncated', 629 | 'statuses_count', 630 | 'suspended', 631 | 'time_zone', 632 | 'translator_type', 633 | 'url', 634 | 'verified' 635 | 'utc_offset'], 636 | """ 637 | 638 | if not provide_jsons: 639 | json_list_raw = [friend._json for friend in friends_details] 640 | else: 641 | json_list_raw = friends_details 642 | json_list = [] 643 | dtypes = {key: value for (key, value) in friends_details_dtypes.items() if key in select} 644 | for j in json_list_raw: 645 | flat = flatten_json(j, sep="_", columns=select, nonetype=nonetype) 646 | # In case that there are keys in the user_details json that are not in select 647 | newflat = {key: value for (key, value) in flat.items() if key in select} 648 | json_list.append(newflat) 649 | 650 | df = pd.json_normalize(json_list) 651 | 652 | for var in select: 653 | if var not in df.columns: 654 | if dtypes[var] == np.datetime64: 655 | df[var] = pd.to_datetime(nonetype["date"]) 656 | elif dtypes[var] == np.int64: 657 | df[var] = nonetype["num"] 658 | elif dtypes[var] == str: 659 | df[var] = nonetype["str"] 660 | elif dtypes[var] == np.int8: 661 | df[var] = nonetype["bool"] 662 | else: 663 | df[var] = np.nan 664 | else: 665 | if dtypes[var] == np.datetime64: 666 | df[var] = df[var].fillna(pd.to_datetime(nonetype["date"])) 667 | elif dtypes[var] == np.int64: 668 | df[var] = df[var].fillna(nonetype["num"]) 669 | elif dtypes[var] == str: 670 | df[var] = df[var].fillna(nonetype["str"]) 671 | elif dtypes[var] == np.int8: 672 | df[var] = df[var].fillna(nonetype["bool"]) 673 | df[var] = df[var].astype(dtypes[var]) 674 | 675 | df.sort_index(axis=1, inplace=True) 676 | return df 677 | 678 | def check_follows(self, source, target): 679 | """Checks Twitter API whether `source` account follows `target` account. 680 | 681 | Args: 682 | source (int): user id 683 | target (int): user id 684 | Returns: 685 | - `True` if `source` follows `target` 686 | - `False` if `source` does not follow `target` 687 | """ 688 | 689 | # TODO: check remaining API calls 690 | 691 | friendship = self.connection.api.show_friendship( 692 | source_id=source, target_id=target) 693 | 694 | following = friendship[0].following 695 | 696 | return following 697 | 698 | 699 | class Coordinator(object): 700 | """Selects a queue of seeds and coordinates the collection with collectors 701 | and a queue of tokens. 702 | """ 703 | 704 | def __init__(self, seeds=2, token_file_name="tokens.csv", seed_list=None, 705 | following_pages_limit=0): 706 | 707 | # Get seeds from seeds.csv 708 | self.seed_pool = FileImport().read_seed_file() 709 | 710 | # Create seed_list if none is given by sampling from the seed_pool 711 | if seed_list is None: 712 | 713 | self.number_of_seeds = seeds 714 | try: 715 | self.seeds = self.seed_pool.sample(n=self.number_of_seeds) 716 | except ValueError: # seed pool too small 717 | stderr.write("WARNING: Seed pool smaller than number of seeds.\n") 718 | self.seeds = self.seed_pool.sample(n=self.number_of_seeds, replace=True) 719 | 720 | self.seeds = self.seeds[0].values 721 | else: 722 | self.number_of_seeds = len(seed_list) 723 | self.seeds = seed_list 724 | 725 | self.seed_queue = mp.Queue() 726 | 727 | for seed in self.seeds: 728 | self.seed_queue.put(seed) 729 | 730 | # Get authorized user tokens for app from tokens.csv 731 | self.tokens = FileImport().read_token_file(token_file_name) 732 | 733 | # and put them in a queue 734 | self.token_queue = mp.Queue() 735 | 736 | for token, secret in self.tokens.values: 737 | self.token_queue.put((token, secret, {}, {})) 738 | 739 | # Initialize DataBaseHandler for DB communication 740 | self.dbh = DataBaseHandler() 741 | self.following_pages_limit = following_pages_limit 742 | 743 | def bootstrap_seed_pool(self, after_timestamp=0): 744 | """Adds all collected user details, i.e. friends with the desired properties 745 | (e.g. language) of previously found seeds to the seed pool. 746 | 747 | Args: 748 | after_timestamp (int): filter for friends added after this timestamp. Default: 0 749 | Returns: 750 | None 751 | """ 752 | 753 | seed_pool_size = len(self.seed_pool) 754 | stdout.write("Bootstrapping seeds.\n") 755 | stdout.write(f"Old size: {seed_pool_size}. Adding after {after_timestamp} ") 756 | stdout.flush() 757 | 758 | query = f"SELECT id FROM user_details WHERE UNIX_TIMESTAMP(timestamp) >= {after_timestamp}" 759 | 760 | more_seeds = pd.read_sql(query, self.dbh.engine) 761 | more_seeds.columns = [0] # rename from id to 0 for proper append 762 | self.seed_pool = self.seed_pool.merge(more_seeds, how='outer', on=[0]) 763 | 764 | seed_pool_size = len(self.seed_pool) 765 | stdout.write(f"New size: {seed_pool_size}\n") 766 | stdout.flush() 767 | 768 | def lookup_accounts_friend_details(self, account_id, db_connection=None, select="*"): 769 | """Looks up and retrieves details from friends of `account_id` via database. 770 | 771 | Args: 772 | account_id (int) 773 | db_connection (database connection/engine object) 774 | select (str): comma separated list of required fields, defaults to all available ("*") 775 | Returns: 776 | None, if no friends found. 777 | Otherwise DataFrame with all details. Might be empty if language filter is on. 778 | """ 779 | 780 | if db_connection is None: 781 | db_connection = self.dbh.engine 782 | 783 | query = f"SELECT target from friends WHERE source = {account_id} AND burned = 0" 784 | friends = pd.read_sql(query, db_connection) 785 | 786 | if len(friends) == 0: 787 | return None 788 | else: 789 | friends = friends['target'].values 790 | friends = tuple(friends) 791 | if len(friends) == 1: 792 | friends = str(friends).replace(',', '') 793 | 794 | query = f"SELECT {select} from user_details WHERE id IN {friends}" 795 | friend_detail = pd.read_sql(query, db_connection) 796 | 797 | return friend_detail 798 | 799 | def choose_random_new_seed(self, msg, connection): 800 | new_seed = self.seed_pool.sample(n=1) 801 | new_seed = new_seed[0].values[0] 802 | 803 | if msg is not None: 804 | stdout.write(msg + "\n") 805 | stdout.flush() 806 | 807 | self.token_queue.put( 808 | (connection.token, connection.secret, 809 | connection.reset_time_dict, connection.calls_dict)) 810 | 811 | self.seed_queue.put(new_seed) 812 | 813 | return new_seed 814 | 815 | def write_user_details(self, user_details): 816 | """Writes pandas.DataFrame `user_details` to MySQL table 'user_details' 817 | """ 818 | 819 | try: 820 | user_details.to_sql('user_details', if_exists='append', 821 | index=False, con=self.dbh.engine) 822 | 823 | except IntegrityError: # duplicate id (primary key) 824 | temp_tbl_name = self.dbh.make_temp_tbl() 825 | user_details.to_sql(temp_tbl_name, if_exists="append", index=False, 826 | con=self.dbh.engine) 827 | query = "REPLACE INTO user_details SELECT * FROM {};".format( 828 | temp_tbl_name) 829 | self.dbh.engine.execute(query) 830 | self.dbh.engine.execute("DROP TABLE " + temp_tbl_name + ";") 831 | 832 | @retry_x_times(10) 833 | def work_through_seed_get_next_seed(self, seed, select=[], status_lang=None, 834 | connection=None, fail=False, **kwargs): 835 | """Takes a seed and determines the next seed and saves all details collected to db. 836 | 837 | Args: 838 | seed (int) 839 | select (list of str): fields to save to database, defaults to all 840 | status_lang (str): Twitter language code for language of last status to filter for, 841 | defaults to None 842 | connection (collector.Connection object) 843 | Returns: 844 | seed (int) 845 | """ 846 | 847 | # For testing raise of errors while multithreading 848 | if fail is True: 849 | raise TestException 850 | 851 | if 'fail_hidden' in kwargs and kwargs['fail_hidden'] is True: 852 | raise TestException 853 | 854 | language_check_condition = ( 855 | status_lang is not None and 856 | 'language_threshold' in kwargs and 857 | kwargs['language_threshold'] > 0 858 | ) 859 | 860 | keyword_condition = ('keywords' in kwargs and 861 | kwargs['keywords'] is not None and 862 | len(kwargs['keywords']) > 0) 863 | 864 | if connection is None: 865 | connection = Connection(token_queue=self.token_queue) 866 | 867 | friends_details = None 868 | if 'restart' in kwargs and kwargs['restart'] is True: 869 | print("No db lookup after restart allowed, accessing Twitter API.") 870 | else: 871 | try: 872 | friends_details = self.lookup_accounts_friend_details( 873 | seed, self.dbh.engine) 874 | 875 | except ProgrammingError: 876 | 877 | print("""Accessing db for friends_details failed. Maybe database does not exist yet. 878 | Accessing Twitter API.""") 879 | 880 | if friends_details is None: 881 | if 'restart' in kwargs and kwargs['restart'] is True: 882 | pass 883 | elif language_check_condition or keyword_condition: 884 | check_exists_query = f""" 885 | SELECT EXISTS( 886 | SELECT source FROM result 887 | WHERE source={seed} 888 | ) 889 | """ 890 | seed_depleted = self.dbh.engine.execute(check_exists_query).scalar() 891 | 892 | if seed_depleted == 1: 893 | new_seed = self.choose_random_new_seed( 894 | f'Seed {seed} is depleted. No friends meet conditions. Random new seed.', 895 | connection) 896 | 897 | return new_seed 898 | 899 | collector = Collector(connection, seed, 900 | following_pages_limit=self.following_pages_limit) 901 | 902 | try: 903 | friend_list = collector.get_friend_list() 904 | if 'bootstrap' in kwargs and kwargs['bootstrap'] is True: 905 | follower_list = collector.get_friend_list(follower=True) 906 | except tweepy.error.TweepError as e: # if account is protected 907 | if "Not authorized." in e.reason: 908 | 909 | new_seed = self.choose_random_new_seed( 910 | "Account {} protected, selecting random seed.".format(seed), connection) 911 | 912 | return new_seed 913 | 914 | elif "does not exist" in e.reason: 915 | 916 | new_seed = self.choose_random_new_seed( 917 | f"Account {seed} does not exist. Selecting random seed.", connection) 918 | 919 | return new_seed 920 | 921 | else: 922 | raise e 923 | 924 | if friend_list == []: # if account follows nobody 925 | 926 | new_seed = self.choose_random_new_seed( 927 | "No friends or unburned connections left, selecting random seed.", connection) 928 | 929 | return new_seed 930 | 931 | self.dbh.write_friends(seed, friend_list) 932 | 933 | friends_details = collector.get_details(friend_list) 934 | select = list(set(select + ["id", "followers_count", 935 | "status_lang", "created_at", "statuses_count"])) 936 | friends_details = Collector.make_friend_df(friends_details, select) 937 | 938 | if 'bootstrap' in kwargs and kwargs['bootstrap'] is True: 939 | follower_details = collector.get_details(follower_list) 940 | follower_details = Collector.make_friend_df(follower_details, select) 941 | 942 | if status_lang is not None: 943 | 944 | if type(status_lang) is str: 945 | status_lang = [status_lang] 946 | friends_details = friends_details[friends_details['status_lang'].isin(status_lang)] 947 | 948 | if 'bootstrap' in kwargs and kwargs['bootstrap'] is True: 949 | follower_details = follower_details[follower_details['status_lang'].isin( 950 | status_lang)] 951 | 952 | if len(friends_details) == 0: 953 | 954 | new_seed = self.choose_random_new_seed( 955 | f"No friends found with language '{status_lang}', selecting random seed.", 956 | connection) 957 | 958 | return new_seed 959 | 960 | self.write_user_details(friends_details) 961 | 962 | if 'bootstrap' in kwargs and kwargs['bootstrap'] is True: 963 | self.write_user_details(follower_details) 964 | 965 | if status_lang is not None and len(friends_details) == 0: 966 | 967 | new_seed = self.seed_pool.sample(n=1) 968 | new_seed = new_seed[0].values[0] 969 | 970 | stdout.write( 971 | "No user details for friends with last status language '{}' found in db.\n".format( 972 | status_lang)) 973 | stdout.flush() 974 | 975 | self.token_queue.put( 976 | (connection.token, connection.secret, 977 | connection.reset_time_dict, connection.calls_dict)) 978 | 979 | self.seed_queue.put(new_seed) 980 | 981 | return new_seed 982 | 983 | if 'restart' in kwargs and kwargs['restart'] is True: 984 | # lookup just in case we had them already 985 | friends_details_db = self.lookup_accounts_friend_details( 986 | seed, self.dbh.engine) 987 | if friends_details_db is not None and len(friends_details_db) > 0: 988 | friends_details = friends_details_db 989 | 990 | double_burned = True 991 | 992 | while double_burned is True: 993 | max_follower_count = friends_details['followers_count'].max() 994 | 995 | new_seed = friends_details[ 996 | friends_details['followers_count'] == max_follower_count]['id'].values[0] 997 | 998 | while language_check_condition or keyword_condition: 999 | # RETRIEVE AND TEST MORE TWEETS FOR LANGUAGE OR KEYWORDS 1000 | try: 1001 | latest_tweets = get_latest_tweets(new_seed, connection, 1002 | fields=['lang', 'full_text']) 1003 | except tweepy.error.TweepError as e: # if account is protected 1004 | if "Not authorized." in e.reason: 1005 | new_seed = self.choose_random_new_seed( 1006 | f"Account {new_seed} protected, selecting random seed.", connection) 1007 | 1008 | return new_seed 1009 | elif "does not exist" in e.reason: 1010 | new_seed = self.choose_random_new_seed( 1011 | f"Account {seed} does not exist. Selecting random seed.", connection) 1012 | 1013 | return new_seed 1014 | else: 1015 | raise e 1016 | 1017 | threshold_met = True # set true per default and change to False if not met 1018 | keyword_met = True 1019 | 1020 | if language_check_condition: 1021 | language_fractions = get_fraction_of_tweets_in_language(latest_tweets) 1022 | 1023 | threshold_met = any(kwargs['language_threshold'] <= fraction 1024 | for fraction in language_fractions.values()) 1025 | 1026 | if keyword_condition: 1027 | keyword_met = any(latest_tweets['full_text'].str.contains(keyword, 1028 | case=False).any() 1029 | for keyword in kwargs['keywords']) 1030 | 1031 | # THEN REMOVE FROM friends_details DATAFRAME, SEED POOL, 1032 | # AND DATABASE IF FALSE POSITIVE 1033 | # ACCORDING TO THRESHOLD OR KEYWORD 1034 | 1035 | if threshold_met and keyword_met: 1036 | break 1037 | else: 1038 | friends_details = friends_details[friends_details['id'] != new_seed] 1039 | 1040 | print( 1041 | f'seed pool size before removing not matching seed: {len(self.seed_pool)}') 1042 | self.seed_pool = self.seed_pool[self.seed_pool[0] != new_seed] 1043 | print( 1044 | f'seed pool size after removing not matching seed: {len(self.seed_pool)}') 1045 | 1046 | # query = f"DELETE from user_details WHERE id = {new_seed}" 1047 | # self.dbh.engine.execute(query) 1048 | 1049 | query = f"DELETE from friends WHERE target = {new_seed}" 1050 | self.dbh.engine.execute(query) 1051 | 1052 | # AND REPEAT THE CHECK 1053 | try: 1054 | new_seed = friends_details[friends_details['followers_count'] == 1055 | max_follower_count]['id'].values[0] 1056 | except IndexError: # no more friends 1057 | new_seed = self.choose_random_new_seed( 1058 | f'{seed}: No friends meet set conditions. Selecting random.', 1059 | connection) 1060 | 1061 | return new_seed 1062 | 1063 | check_exists_query = """ 1064 | SELECT EXISTS( 1065 | SELECT * FROM friends 1066 | WHERE source={source} 1067 | ) 1068 | """.format(source=new_seed) 1069 | node_exists_as_source = self.dbh.engine.execute(check_exists_query).scalar() 1070 | 1071 | if node_exists_as_source == 1: 1072 | check_follow_query = """ 1073 | SELECT EXISTS( 1074 | SELECT * FROM friends 1075 | WHERE source={source} and target={target} 1076 | ) 1077 | """.format(source=new_seed, target=seed) 1078 | 1079 | follows = self.dbh.engine.execute(check_follow_query).scalar() 1080 | 1081 | elif node_exists_as_source == 0: 1082 | # check on Twitter 1083 | 1084 | # FIXTHIS: dirty workaround because of wacky test 1085 | if connection == "fail": 1086 | connection = Connection() 1087 | 1088 | try: 1089 | collector 1090 | except NameError: 1091 | collector = Collector(connection, seed) 1092 | 1093 | try: 1094 | follows = int(collector.check_follows(source=new_seed, target=seed)) 1095 | except tweepy.TweepError: 1096 | print(f"Follow back undetermined. User {new_seed} not available") 1097 | follows = 0 1098 | 1099 | if follows == 0: 1100 | 1101 | insert_query = f""" 1102 | INSERT INTO result (source, target) 1103 | VALUES ({seed}, {new_seed}) 1104 | ON DUPLICATE KEY UPDATE source = source 1105 | """ 1106 | 1107 | self.dbh.engine.execute(insert_query) 1108 | 1109 | print('\nno follow back: added ({seed})-->({new_seed})'.format( 1110 | seed=seed, new_seed=new_seed 1111 | )) 1112 | 1113 | if follows == 1: 1114 | 1115 | insert_query = f""" 1116 | INSERT INTO result (source, target) 1117 | VALUES 1118 | ({seed}, {new_seed}), 1119 | ({new_seed}, {seed}) 1120 | ON DUPLICATE KEY UPDATE source = source 1121 | """ 1122 | 1123 | self.dbh.engine.execute(insert_query) 1124 | 1125 | print('\nfollow back: added ({seed})<-->({new_seed})'.format( 1126 | seed=seed, new_seed=new_seed 1127 | )) 1128 | 1129 | update_query = """ 1130 | UPDATE friends 1131 | SET burned=1 1132 | WHERE source={source} AND target={target} AND burned = 0 1133 | """.format(source=seed, target=new_seed) 1134 | 1135 | update_result = self.dbh.engine.execute(update_query) 1136 | 1137 | if update_result.rowcount == 0: 1138 | print(f"Connection ({seed})-->({new_seed}) was burned already.") 1139 | friends_details = self.lookup_accounts_friend_details( 1140 | seed, self.dbh.engine) 1141 | 1142 | if friends_details is None or len(friends_details) == 0: 1143 | new_seed = self.choose_random_new_seed( 1144 | f"No friends or unburned connections left for {seed}, selecting random.", 1145 | connection) 1146 | 1147 | return new_seed 1148 | 1149 | else: 1150 | print(f"burned ({seed})-->({new_seed})") 1151 | double_burned = False 1152 | 1153 | self.token_queue.put( 1154 | (connection.token, connection.secret, 1155 | connection.reset_time_dict, connection.calls_dict)) 1156 | 1157 | self.seed_queue.put(new_seed) 1158 | 1159 | return new_seed 1160 | 1161 | def start_collectors(self, number_of_seeds=None, select=[], status_lang=None, fail=False, 1162 | fail_hidden=False, restart=False, retries=10, bootstrap=False, 1163 | latest_start_time=0, language_threshold=0, keywords=[]): 1164 | """Starts `number_of_seeds` collector threads 1165 | collecting the next seed for on seed taken from `self.queue` 1166 | and puting it back into `self.seed_queue`. 1167 | 1168 | Args: 1169 | number_of_seeds (int): Defaults to `self.number_of_seeds` 1170 | select (list of strings): fields to save to user_details table in database 1171 | status_lang (str): language code for latest tweet langage to select 1172 | Returns: 1173 | list of mp.(dummy.)Process 1174 | """ 1175 | 1176 | if bootstrap is True: 1177 | 1178 | if restart is True: 1179 | latest_start_time = 0 1180 | 1181 | self.bootstrap_seed_pool(after_timestamp=latest_start_time) 1182 | 1183 | if number_of_seeds is None: 1184 | number_of_seeds = self.number_of_seeds 1185 | 1186 | processes = [] 1187 | seed_list = [] 1188 | 1189 | print("number of seeds: ", number_of_seeds) 1190 | 1191 | for i in range(number_of_seeds): 1192 | seed = self.seed_queue.get() 1193 | seed_list += [seed] 1194 | print("seed ", i, ": ", seed) 1195 | processes.append(MyProcess(target=self.work_through_seed_get_next_seed, 1196 | kwargs={'seed': seed, 1197 | 'select': select, 1198 | 'status_lang': status_lang, 1199 | 'fail': fail, 1200 | 'fail_hidden': fail_hidden, 1201 | 'restart': restart, 1202 | 'retries': retries, 1203 | 'language_threshold': language_threshold, 1204 | 'bootstrap': bootstrap, 1205 | 'keywords': keywords}, 1206 | name=str(seed))) 1207 | 1208 | latest_seeds = pd.DataFrame(seed_list) 1209 | 1210 | latest_seeds.to_csv('latest_seeds.csv', index=False, header=False) 1211 | 1212 | for p in processes: 1213 | p.start() 1214 | print(f"Thread {p.name} started.") 1215 | 1216 | return processes 1217 | -------------------------------------------------------------------------------- /config_template.yml: -------------------------------------------------------------------------------- 1 | # In the following config file, please fill the fields as you need them. 2 | # Do not use quotes, just plain text: e.g.: 3 | # sql: 4 | # dbtype: sqlite 5 | # etc. 6 | 7 | # ================== Database Information ===================== 8 | sql: 9 | dbtype: mysql 10 | host: # if dbtype = mysql, provide host 11 | user: # if dbtype = mysql, provide user 12 | passwd: # if dbtype = mysql, provide password 13 | dbname: # provide a name for the database. 14 | 15 | 16 | # ================== Twitter User Details ===================== 17 | # If you wish to save certain twitter user details, please just add the SQL data 18 | # type you wish to save it as in the SQL database (recommended types are indicated 19 | # in parantheses). If you do not wish to save a certain detail, just leave it empty 20 | # like so: 21 | # twitter_user_details: 22 | # contributors_enabled: SMALLINT 23 | # created at: 24 | # This will save the detail "contributors_enabled" as booelan / tinyint into the 25 | # database but it will not save "created_at" at all. 26 | 27 | twitter_user_details: 28 | contributors_enabled: # SMALLINT 29 | created_at: DATETIME 30 | default_profile: # SMALLINT 31 | default_profile_image: # SMALLINT 32 | description: # TEXT (contains a dict) 33 | entities_description_urls: # TEXT 34 | entities_url_urls: # TEXT (contains a dict) 35 | favourites_count: # BIGINT 36 | follow_request_sent: # SMALLINT 37 | followers_count: BIGINT 38 | following: # SMALLINT 39 | friends_count: # BIGINT 40 | geo_enabled: # SMALLINT 41 | has_extended_profile: # SMALLINT 42 | id: BIGINT PRIMARY KEY 43 | id_str: # VARCHAR(30) 44 | is_translation_enabled: # SMALLINT 45 | is_translator: # SMALLINT 46 | lang: # VARCHAR(10) 47 | listed_count: # BIGINT 48 | location: # TEXT 49 | name: # VARCHAR (50) 50 | needs_phone_verification: #SMALLINT 51 | notifications: # SMALLINT 52 | profile_background_color: # CHAR(6) (is a Hex Color Code) 53 | profile_background_image_url: # TEXT 54 | profile_background_image_url_https: # TEXT 55 | profile_background_tile: # SMALLINT 56 | profile_banner_url: # TEXT 57 | profile_image_url: # TEXT 58 | profile_image_url_https: # TEXT 59 | profile_link_color: # CHAR(6) (is a Hex Color Code) 60 | profile_sidebar_border_color: # CHAR(6) (is a Hex Color Code) 61 | profile_sidebar_fill_color: # CHAR(6) (is a Hex Color Code) 62 | profile_text_color: # CHAR(6) (is a Hex Color Code) 63 | profile_use_background_image: # SMALLINT 64 | protected: # SMALLINT 65 | screen_name: # VARCHAR(50) 66 | status_contributors: # TEXT (Rarely available) 67 | status_coordinates: # TEXT (contains a dict) 68 | status_coordinates_coordinates: # TEXT (Rarely available) 69 | status_coordinates_type: # TEXT (Rarely available) 70 | status_created_at: # DATETIME 71 | status_entities_hashtags: # TEXT (contains a dict) 72 | status_entities_media: # TEXT (contains a dict) 73 | status_entities_symbols: # TEXT (contains a dict) # DE FACTO ALWAYS EMPTY 74 | status_entities_urls: # TEXT (contains a dict) 75 | status_entities_user_mentions: # TEXT (contains a dict) 76 | status_extended_entities_media: # TEXT (contains a dict) 77 | status_favorite_count: # INT 78 | status_favorited: # SMALLINT 79 | status_geo: # TEXT (contains a dict) 80 | status_geo_coordinates: # TEXT (Rarely available) 81 | status_geo_type: # TEXT (Rarely available) 82 | status_id: # BIGINT 83 | status_id_str: # VARCHAR(50) 84 | status_in_reply_to_screen_name: # VARCHAR(50) 85 | status_in_reply_to_status_id: # BIGINT 86 | status_in_reply_to_status_id_str: # VARCHAR(50) 87 | status_in_reply_to_user_id: # BIGINT 88 | status_in_reply_to_user_id_str: # VARCHAR(30) 89 | status_is_quote_status: # SMALLINT 90 | status_lang: VARCHAR(10) 91 | status_place: # TEXT (contains a dict) 92 | status_place_bounding_box_coordinates: # TEXT (Rarely available) 93 | status_place_bounding_box_type: # TEXT (Rarely available) 94 | status_place_contained_within: # TEXT (Rarely available) 95 | status_place_country: # TEXT (Rarely available) 96 | status_place_country_code: # TEXT (Rarely available) 97 | status_place_full_name: # TEXT (Rarely available) 98 | status_place_id: # TEXT (Rarely available) 99 | status_place_name: # TEXT (Rarely available) 100 | status_place_place_type: # TEXT (Rarely available) 101 | status_place_url: # TEXT (Rarely available) 102 | status_possibly_sensitive: # SMALLINT 103 | status_quoted_status_id: # BIGINT 104 | status_quoted_status_id_str: # VARCHAR(50) 105 | status_retweet_count: # INT 106 | status_retweeted: # SMALLINT 107 | status_retweeted_status_contributors: # TEXT (Rarely available) 108 | status_retweeted_status_coordinates: # TEXT (contains a dict) 109 | status_retweeted_status_created_at: # DATETIME 110 | status_retweeted_status_entities_hashtags: # TEXT (contains a dict) 111 | status_retweeted_status_entities_media: # TEXT (contains a dict) 112 | status_retweeted_status_entities_symbols: # TEXT (Rarely available) 113 | status_retweeted_status_entities_urls: # TEXT (contains a dict) 114 | status_retweeted_status_entities_user_mentions: # TEXT (contains a dict) 115 | status_retweeted_status_extended_entities_media: # TEXT (contains a dict) 116 | status_retweeted_status_favorite_count: # INT 117 | status_retweeted_status_favorited: # SMALLINT 118 | status_retweeted_status_geo: # TEXT (contains a dict) 119 | status_retweeted_status_id: # BIGINT 120 | status_retweeted_status_id_str: # VARCHAR(50) 121 | status_retweeted_status_in_reply_to_screen_name: # VARCHAR(30) 122 | status_retweeted_status_in_reply_to_status_id: # BIGINT 123 | status_retweeted_status_in_reply_to_status_id_str: # VARCHAR(50) 124 | status_retweeted_status_in_reply_to_user_id: # BIGINT 125 | status_retweeted_status_in_reply_to_user_id_str: # VARCHAR(30) 126 | status_retweeted_status_is_quote_status: # SMALLINT 127 | status_retweeted_status_lang: # VARCHAR(10) 128 | status_retweeted_status_place: # TEXT (contains a dict) 129 | status_retweeted_status_possibly_sensitive: # SMALLINT 130 | status_retweeted_status_quoted_status_id: # BIGINT 131 | status_retweeted_status_quoted_status_id_str: # VARCHAR(50) 132 | status_retweeted_status_retweet_count: # INT 133 | status_retweeted_status_retweeted: # SMALLINT 134 | status_retweeted_status_source: # TEXT 135 | status_retweeted_status_full_text: # TEXT 136 | status_retweeted_status_truncated: # SMALLINT 137 | status_source: # TEXT 138 | status_full_text: # TEXT 139 | status_truncated: # SMALLINT 140 | statuses_count: BIGINT 141 | suspended: # SMALLINT 142 | time_zone: # TEXT (Rarely available) 143 | translator_type: # VARCHAR(50) 144 | url: # TEXT 145 | verified: # BOOLEAN 146 | utc_offset: # TEXT (Rarely available) 147 | 148 | 149 | # ================== Notification Emails ===================== 150 | 151 | notifications: 152 | email_to_notify: # user@example.com 153 | # mailgun details 154 | # (find them under the respective domain name here: https://mailgun.com/app/domains) 155 | mailgun_default_smtp_login: 156 | mailgun_api_base_url: 157 | mailgun_api_key: 158 | -------------------------------------------------------------------------------- /create_dense_result.sql: -------------------------------------------------------------------------------- 1 | CREATE VIEW `dense_result` AS 2 | SELECT DISTINCT source, target FROM friends WHERE source IN 3 | (SELECT DISTINCT T.id 4 | FROM 5 | (SELECT 6 | result.source AS id 7 | FROM 8 | result UNION SELECT 9 | result.target AS id 10 | FROM 11 | result) T) 12 | AND target IN 13 | (SELECT DISTINCT T.id 14 | FROM 15 | (SELECT 16 | result.source AS id 17 | FROM 18 | result UNION SELECT 19 | result.target AS id 20 | FROM 21 | result) T) 22 | -------------------------------------------------------------------------------- /create_node_view.sql: -------------------------------------------------------------------------------- 1 | CREATE VIEW `nodes` AS 2 | SELECT 3 | `user_details`.`id` AS `id`, 4 | `user_details`.`status_lang` AS `status_lang`, 5 | `user_details`.`screen_name` AS `screen_name`, 6 | `user_details`.`name` AS `name`, 7 | `user_details`.`location` AS `location`, 8 | `user_details`.`description` AS `description`, 9 | `user_details`.`created_at` AS `created_at`, 10 | `user_details`.`favourites_count` AS `favourites_count`, 11 | `user_details`.`followers_count` AS `followers_count`, 12 | `user_details`.`friends_count` AS `friends_count`, 13 | `user_details`.`listed_count` AS `listed_count`, 14 | `user_details`.`protected` AS `protected`, 15 | `user_details`.`statuses_count` AS `statuses_count`, 16 | `user_details`.`status_created_at` AS `status_created_at`, 17 | `user_details`.`timestamp` AS `timestamp`, 18 | `user_details`.`verified` AS `verified` 19 | FROM 20 | `user_details` 21 | WHERE 22 | `user_details`.`id` IN (SELECT 23 | `T`.`id` 24 | FROM 25 | (SELECT 26 | `result`.`source` AS `id` 27 | FROM 28 | `result` UNION SELECT 29 | `result`.`target` AS `id` 30 | FROM 31 | `result`) `T`) -------------------------------------------------------------------------------- /database_handler.py: -------------------------------------------------------------------------------- 1 | import sqlite3 as lite 2 | import uuid 3 | from sqlite3 import Error 4 | 5 | import pandas as pd 6 | from sqlalchemy import create_engine 7 | from sqlalchemy.exc import OperationalError 8 | 9 | from setup import Config 10 | 11 | 12 | class DataBaseHandler(): 13 | def __init__(self, config_path: str = "config.yml", config_dict: dict = None, 14 | create_all: bool = True): 15 | """Initializes class by either connecting to an existing database 16 | or by creating a new database. Database settings depend on config.yml 17 | 18 | Args: 19 | config_file (str): Path to configuration file. Defaults to "config.yml" 20 | config_dict (dict): Dictionary containing the config information (in case 21 | the dictionary shall be directly passed instead of read 22 | out of a configuration file). 23 | create_all (bool): If set to false, will not attempt to create the friends, 24 | result, and user_details tables. 25 | Returns: 26 | Nothing 27 | """ 28 | 29 | # Prepare user_details configured in config.yml for user_details table creation 30 | self.config = Config(config_path, config_dict) 31 | user_details_list = [] 32 | if "twitter_user_details" in self.config.config: 33 | for detail, sqldatatype in self.config.config["twitter_user_details"].items(): 34 | if sqldatatype is not None: 35 | user_details_list.append(detail + " " + sqldatatype) 36 | else: 37 | print("""Key "twitter_user_details" could not be found in config.yml. Will not create 38 | a user_details table.""") 39 | 40 | # Table creation for SQLITE database type. 41 | # Note and TODO: the collector does not support sqlite (yet) 42 | if self.config.dbtype.lower() == "sqlite": 43 | try: 44 | self.engine = lite.connect(self.config.dbname + ".db") 45 | print("Connected to " + self.config.dbname + "!") 46 | except Error as e: 47 | raise e 48 | if create_all: 49 | try: 50 | create_friends_table_sql = """CREATE TABLE IF NOT EXISTS friends ( 51 | source BIGINT NOT NULL, 52 | target BIGINT NOT NULL, 53 | burned TINYINT NOT NULL, 54 | timestamp DATETIME DEFAULT CURRENT_TIMESTAMP 55 | );""" 56 | create_friends_index_sql_1 = "CREATE INDEX iFSource ON friends(source);" 57 | create_friends_index_sql_2 = "CREATE INDEX iFTimestamp ON friends(timestamp);" 58 | create_results_table_sql = """CREATE TABLE IF NOT EXISTS result ( 59 | source BIGINT NOT NULL, 60 | target BIGINT NOT NULL, 61 | timestamp DATETIME DEFAULT CURRENT_TIMESTAMP 62 | );""" 63 | create_results_index_sql_1 = "CREATE INDEX iRSource ON result(source);" 64 | create_results_index_sql_2 = "CREATE INDEX iRTimestamp ON result(timestamp);" 65 | c = self.engine.cursor() 66 | c.execute(create_friends_table_sql) 67 | c.execute(create_friends_index_sql_1) 68 | c.execute(create_friends_index_sql_2) 69 | c.execute(create_results_table_sql) 70 | c.execute(create_results_index_sql_1) 71 | c.execute(create_results_index_sql_2) 72 | if user_details_list != []: 73 | create_user_details_sql = """ 74 | CREATE TABLE IF NOT EXISTS user_details 75 | (""" + ", ".join(user_details_list) + """, 76 | timestamp DATETIME DEFAULT CURRENT_TIMESTAMP);""" 77 | create_ud_index = "CREATE INDEX iUTimestamp ON user_details(timestamp)" 78 | c.execute(create_user_details_sql) 79 | c.execute(create_ud_index) 80 | else: 81 | # TODO: Make this a minimal user_details table? 82 | print("""No user_details configured in config.yml. Will not create a 83 | user_details table.""") 84 | except Error as e: 85 | print(e) 86 | 87 | # Table creation for mysql database type 88 | elif self.config.dbtype.lower() == "mysql": 89 | try: 90 | self.engine = create_engine( 91 | f'mysql+pymysql://{self.config.dbuser}:' 92 | f'{self.config.dbpwd}@{self.config.dbhost}/{self.config.dbname}' 93 | ) 94 | print('Connected to database "' + self.config.dbname + '" via mySQL!') 95 | except OperationalError as e: 96 | raise e 97 | if create_all: 98 | try: 99 | create_friends_table_sql = """CREATE TABLE IF NOT EXISTS friends ( 100 | source BIGINT NOT NULL, 101 | target BIGINT NOT NULL, 102 | burned TINYINT NOT NULL, 103 | timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP 104 | ON UPDATE CURRENT_TIMESTAMP, 105 | UNIQUE INDEX fedge (source, target), 106 | INDEX(timestamp) 107 | );""" 108 | create_results_table_sql = """CREATE TABLE IF NOT EXISTS result ( 109 | source BIGINT NOT NULL, 110 | target BIGINT NOT NULL, 111 | timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 112 | UNIQUE INDEX redge (source, target), 113 | INDEX(timestamp) 114 | );""" 115 | self.engine.execute(create_friends_table_sql) 116 | self.engine.execute(create_results_table_sql) 117 | if user_details_list != []: 118 | create_user_details_sql = """ 119 | CREATE TABLE IF NOT EXISTS user_details 120 | (""" + ", ".join(user_details_list) + """, timestamp TIMESTAMP 121 | DEFAULT CURRENT_TIMESTAMP, 122 | INDEX(timestamp));""" 123 | self.engine.execute(create_user_details_sql) 124 | else: 125 | print("""No user_details configured in config.yml. Will not create a 126 | user_details table.""") 127 | except OperationalError as e: 128 | raise e 129 | 130 | def make_temp_tbl(self, type: str = "user_details"): 131 | """Creates a new temporary table with a random name consisting of a temp_ prefix 132 | and a uid. The structure of the table depends on the chosen type param. The 133 | table's structure will be a copy of an existing table, for example, a temporary 134 | user_details table will have the same columns and attributes (Keys, constraints, etc.) 135 | as the user_details table. 136 | 137 | Args: 138 | type (str): The table that the temporary table is going to simulate. 139 | Possible values are ["friends", "result", "user_details"] 140 | Returns: 141 | The name of the temporary table. 142 | """ 143 | uid = uuid.uuid4() 144 | temp_tbl_name = "temp_" + str(uid).replace('-', '_') 145 | 146 | if self.config.dbtype.lower() == "mysql": 147 | create_temp_tbl_sql = f"CREATE TABLE {temp_tbl_name} LIKE {type};" 148 | elif self.config.dbtype.lower() == "sqlite": 149 | create_temp_tbl_sql = f"CREATE TABLE {temp_tbl_name} AS SELECT * FROM {type} WHERE 0" 150 | self.engine.execute(create_temp_tbl_sql) 151 | return temp_tbl_name 152 | 153 | def write_friends(self, seed, friendlist): 154 | """Writes the database entries for one user and their friends in format user, friends. 155 | Note that the database is appended by the new entries, and that no entries will be deleted 156 | by this method. 157 | 158 | Args: 159 | seed (str): single Twitter ID 160 | friendlist (list of str): Twitter IDs of seed's friends 161 | Returns: 162 | Nothing 163 | """ 164 | temp_tbl_name = self.make_temp_tbl(type="friends") 165 | 166 | friends_df = pd.DataFrame({'target': friendlist}) 167 | friends_df['source'] = seed 168 | friends_df['burned'] = 0 169 | friends_df.to_sql(name=temp_tbl_name, con=self.engine, if_exists="replace", index=False) 170 | 171 | if self.config.dbtype.lower() == "mysql": 172 | insert_query = f""" 173 | INSERT INTO friends (source, target, burned) 174 | SELECT source, target, burned 175 | FROM {temp_tbl_name} 176 | ON DUPLICATE KEY UPDATE 177 | source = {temp_tbl_name}.source 178 | """ 179 | elif self.config.dbtype.lower() == "sqlite": 180 | insert_query = f""" 181 | INSERT OR IGNORE INTO friends (source, target, burned) 182 | SELECT source, target, burned 183 | FROM {temp_tbl_name} 184 | """ 185 | 186 | self.engine.execute(insert_query) 187 | self.engine.execute(f"DROP TABLE {temp_tbl_name}") 188 | -------------------------------------------------------------------------------- /empty_keys.json: -------------------------------------------------------------------------------- 1 | {"consumer_token": "", "consumer_secret": ""} 2 | -------------------------------------------------------------------------------- /exceptions.py: -------------------------------------------------------------------------------- 1 | class TestException(Exception): 2 | pass 3 | -------------------------------------------------------------------------------- /functional_test.py: -------------------------------------------------------------------------------- 1 | # functional test for network collector 2 | import argparse 3 | from datetime import datetime 4 | import os 5 | import shutil 6 | import sys 7 | import unittest 8 | import warnings 9 | from exceptions import TestException 10 | from subprocess import PIPE, STDOUT, CalledProcessError, Popen, check_output 11 | 12 | import pandas as pd 13 | import yaml 14 | from sqlalchemy.exc import InternalError 15 | 16 | import test_helpers 17 | from collector import Coordinator 18 | from database_handler import DataBaseHandler 19 | from start import main_loop 20 | 21 | parser = argparse.ArgumentParser(description='SparseTwitter FunctionalTestSuite') 22 | parser.add_argument('-w', '--show_resource_warnings', 23 | help='If set, will show possible resource warnings from the requests package.', 24 | required=False, 25 | action='store_true') 26 | parser.add_argument('unittest_args', nargs='*') 27 | 28 | args = parser.parse_args() 29 | show_warnings = args.show_resource_warnings 30 | sys.argv[1:] = args.unittest_args 31 | 32 | mysql_cfg = test_helpers.config_dict_user_details_dtypes_mysql 33 | 34 | 35 | def setUpModule(): 36 | if not show_warnings: 37 | warnings.filterwarnings(action="ignore", 38 | message="unclosed", 39 | category=ResourceWarning) 40 | if os.path.isfile("latest_seeds.csv"): 41 | os.rename("latest_seeds.csv", 42 | "{}_latest_seeds.csv".format(datetime.now().isoformat().replace(":", "-"))) 43 | 44 | 45 | class FirstUseTest(unittest.TestCase): 46 | 47 | """Functional test for first use of the program.""" 48 | 49 | @classmethod 50 | def setUpClass(cls): 51 | os.rename("seeds.csv", "seeds.csv.bak") 52 | if os.path.exists("latest_seeds.csv"): 53 | os.rename("latest_seeds.csv", "latest_seeds.csv.bak") 54 | 55 | @classmethod 56 | def tearDownClass(cls): 57 | if os.path.exists("seeds.csv"): 58 | os.remove("seeds.csv") 59 | os.rename("seeds.csv.bak", "seeds.csv") 60 | if os.path.exists("latest_seeds.csv.bak"): 61 | os.rename("latest_seeds.csv.bak", "latest_seeds.csv") 62 | 63 | def setUp(self): 64 | if os.path.isfile("config.yml"): 65 | os.rename("config.yml", "config.yml.bak") 66 | 67 | def tearDown(self): 68 | if os.path.isfile("config.yml.bak"): 69 | os.replace("config.yml.bak", "config.yml") 70 | if os.path.isfile("seeds.csv"): 71 | os.remove("seeds.csv") 72 | 73 | dbh = DataBaseHandler(config_dict=mysql_cfg, create_all=False) 74 | 75 | try: 76 | dbh.engine.execute("DROP TABLE friends") 77 | except InternalError: 78 | pass 79 | try: 80 | dbh.engine.execute("DROP TABLE user_details") 81 | except InternalError: 82 | pass 83 | try: 84 | dbh.engine.execute("DROP TABLE result") 85 | except InternalError: 86 | pass 87 | 88 | def test_starts_and_checks_for_necessary_input_seeds_missing(self): 89 | if os.path.isfile("seeds.csv"): 90 | os.remove("seeds.csv") 91 | 92 | with open("config.yml", "w") as f: 93 | yaml.dump(mysql_cfg, f, default_flow_style=False) 94 | 95 | # User starts program with `start.py` 96 | try: 97 | response = str(check_output('python start.py', stderr=STDOUT, 98 | shell=True), encoding="ascii") 99 | 100 | # ... and encounters an error because the seeds.csv is missing. 101 | except CalledProcessError as e: 102 | response = str(e.output) 103 | self.assertIn('"seeds.csv" could not be found', response) 104 | 105 | def test_starts_and_checks_for_necessary_input_seeds_empty(self): 106 | # User starts program with `start.py` 107 | shutil.copyfile("seeds_empty.csv", "seeds.csv") 108 | 109 | with open("config.yml", "w") as f: 110 | yaml.dump(mysql_cfg, f, default_flow_style=False) 111 | 112 | try: 113 | response = str(check_output('python start.py', stderr=STDOUT, 114 | shell=True), encoding="ascii") 115 | 116 | # ... and encounters an error because the seeds.csv is empty. 117 | except CalledProcessError as e: 118 | response = str(e.output) 119 | self.assertIn('"seeds.csv" is empty', response) 120 | 121 | def test_starts_and_checks_for_necessary_input_config_missing(self): 122 | # user starts program with `start.py` 123 | if not os.path.exists("seeds.csv"): 124 | shutil.copyfile("seeds.csv.bak", "seeds.csv") 125 | try: 126 | response = str(check_output('python start.py', stderr=STDOUT, 127 | shell=True), encoding="ascii") 128 | 129 | # ... and encounters an error because: 130 | except CalledProcessError as e: 131 | response = str(e.output) 132 | # ... the config.yml is missing. Ergo the user creates a new one using make_config.py 133 | self.assertIn("provide a config.yml", response) 134 | if "provide a config.yml" in response: 135 | # Does make_config.py not make a new config.yml when entered "n"? 136 | p = Popen("python make_config.py", stdout=PIPE, stderr=PIPE, stdin=PIPE, 137 | shell=True) 138 | p.communicate("n\n".encode()) 139 | self.assertFalse(os.path.isfile("config.yml")) 140 | 141 | # Does make_config.py open a dialogue asking to open the new config.yaml? 142 | p = Popen("python make_config.py", stdout=PIPE, stderr=PIPE, stdin=PIPE, 143 | shell=True) 144 | p.communicate("y\n".encode()) 145 | 146 | self.assertTrue(os.path.exists("config.yml")) 147 | 148 | with open("config.yml", "w") as f: 149 | yaml.dump(mysql_cfg, f, default_flow_style=False) 150 | 151 | DataBaseHandler().engine.execute("DROP TABLES friends, user_details, result;") 152 | 153 | def test_starting_collectors_and_writing_to_db(self): 154 | 155 | shutil.copyfile("seeds_test.csv", "seeds.csv") 156 | 157 | with open("config.yml", "w") as f: 158 | yaml.dump(mysql_cfg, f, default_flow_style=False) 159 | 160 | try: 161 | response = str(check_output('python start.py -n 2 -t -p 1', 162 | stderr=STDOUT, shell=True)) 163 | print(response) 164 | except CalledProcessError as e: 165 | response = str(e.output) 166 | print(response) 167 | raise e 168 | 169 | dbh = DataBaseHandler() 170 | 171 | result = pd.read_sql("result", dbh.engine) 172 | 173 | self.assertLessEqual(len(result), 8) 174 | 175 | self.assertNotIn(True, result.duplicated().values) 176 | 177 | dbh.engine.execute("DROP TABLE friends, user_details, result;") 178 | 179 | def test_restarts_after_exception(self): 180 | 181 | shutil.copyfile("two_seeds.csv", "seeds.csv") 182 | 183 | with open("config.yml", "w") as f: 184 | yaml.dump(mysql_cfg, f, default_flow_style=False) 185 | 186 | with self.assertRaises(TestException): 187 | main_loop(Coordinator(), test_fail=True) 188 | 189 | p = Popen("python start.py -n 2 -t -f -p 1", stdout=PIPE, stderr=PIPE, stdin=PIPE, 190 | shell=True) 191 | 192 | stdout, stderr = p.communicate() 193 | 194 | self.assertIn("Retrying", stdout.decode('utf-8')) # tries to restart 195 | self.assertIn("Sent notification to", stdout.decode('utf-8')) 196 | 197 | latest_seeds = set(pd.read_csv("latest_seeds.csv", header=None)[0].values) 198 | seeds = set(pd.read_csv('seeds.csv', header=None)[0].values) 199 | 200 | self.assertEqual(latest_seeds, seeds) 201 | 202 | q = Popen("python start.py -t --restart -p 1", stdout=PIPE, stderr=PIPE, stdin=PIPE, 203 | shell=True) 204 | 205 | stdout, stderr = q.communicate() 206 | 207 | self.assertIn("Restarting with latest seeds:", stdout.decode('utf-8'), 208 | msg=f"{stdout.decode('utf-8')}\n{stderr.decode('utf-8')}") 209 | 210 | latest_seeds = set(pd.read_csv("latest_seeds.csv", header=None)[0].values) 211 | 212 | self.assertNotEqual(latest_seeds, seeds) 213 | 214 | DataBaseHandler().engine.execute("DROP TABLE friends, user_details, result;") 215 | 216 | def test_collects_only_requested_number_of_pages_of_friends(self): 217 | 218 | shutil.copyfile("seed_with_lots_of_friends.csv", "seeds.csv") 219 | 220 | with open("config.yml", "w") as f: 221 | yaml.dump(mysql_cfg, f, default_flow_style=False) 222 | 223 | try: 224 | response = str(check_output('python start.py -n 1 -t -p 1', 225 | stderr=STDOUT, shell=True)) 226 | print(response) 227 | except CalledProcessError as e: 228 | response = str(e.output) 229 | print(response) 230 | raise e 231 | 232 | dbh = DataBaseHandler() 233 | 234 | result = pd.read_sql("SELECT COUNT(*) FROM friends WHERE source = 2343198944", dbh.engine) 235 | 236 | result = result['COUNT(*)'][0] 237 | 238 | self.assertLessEqual(result, 5000) 239 | self.assertGreater(result, 4000) 240 | 241 | dbh.engine.execute("DROP TABLE friends, user_details, result;") 242 | 243 | 244 | if __name__ == '__main__': 245 | unittest.main() 246 | -------------------------------------------------------------------------------- /helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | friends_details_dtypes = { 5 | "contributors_enabled": np.int8, 6 | "created_at": np.datetime64, 7 | "default_profile": np.int8, 8 | "default_profile_image": np.int8, 9 | "description": str, 10 | "entities_description_urls": str, 11 | "entities_url_urls": str, 12 | "favourites_count": np.int64, 13 | "follow_request_sent": np.int8, 14 | "followers_count": np.int64, 15 | "following": np.int8, 16 | "friends_count": np.int64, 17 | "geo_enabled": np.int8, 18 | "has_extended_profile": np.int8, 19 | "id": np.int64, 20 | "id_str": str, 21 | "is_translation_enabled": np.int8, 22 | "is_translator": np.int8, 23 | "lang": str, 24 | "listed_count": np.int64, 25 | "location": str, 26 | "name": str, 27 | "needs_phone_verification": np.int8, 28 | "notifications": np.int8, 29 | "profile_background_color": str, 30 | "profile_background_image_url": str, 31 | "profile_background_image_url_https": str, 32 | "profile_background_tile": np.int8, 33 | "profile_banner_url": str, 34 | "profile_image_url": str, 35 | "profile_image_url_https": str, 36 | "profile_link_color": str, 37 | "profile_sidebar_border_color": str, 38 | "profile_sidebar_fill_color": str, 39 | "profile_text_color": str, 40 | "profile_use_background_image": np.int8, 41 | "protected": np.int8, 42 | "screen_name": str, 43 | "status_contributors": str, 44 | "status_coordinates": str, 45 | "status_coordinates_coordinates": str, 46 | "status_coordinates_type": str, 47 | "status_created_at": np.datetime64, 48 | "status_entities_hashtags": str, 49 | "status_entities_media": str, 50 | "status_entities_symbols": str, 51 | "status_entities_urls": str, 52 | "status_entities_user_mentions": str, 53 | "status_extended_entities_media": str, 54 | "status_favorite_count": np.int64, 55 | "status_favorited": np.int8, 56 | "status_geo": str, 57 | "status_geo_coordinates": str, 58 | "status_geo_type": str, 59 | "status_id": np.int64, 60 | "status_id_str": str, 61 | "status_in_reply_to_screen_name": str, 62 | "status_in_reply_to_status_id": np.int64, 63 | "status_in_reply_to_status_id_str": str, 64 | "status_in_reply_to_user_id": np.int64, 65 | "status_in_reply_to_user_id_str": str, 66 | "status_is_quote_status": np.int8, 67 | "status_lang": str, 68 | "status_place": str, 69 | "status_place_bounding_box_coordinates": str, 70 | "status_place_bounding_box_type": str, 71 | "status_place_contained_within": str, 72 | "status_place_country": str, 73 | "status_place_country_code": str, 74 | "status_place_full_name": str, 75 | "status_place_id": str, 76 | "status_place_name": str, 77 | "status_place_place_type": str, 78 | "status_place_url": str, 79 | "status_possibly_sensitive": np.int8, 80 | "status_quoted_status_id": np.int64, 81 | "status_quoted_status_id_str": str, 82 | "status_retweet_count": np.int64, 83 | "status_retweeted": np.int8, 84 | "status_retweeted_status_contributors": str, 85 | "status_retweeted_status_coordinates": str, 86 | "status_retweeted_status_created_at": np.datetime64, 87 | "status_retweeted_status_entities_hashtags": str, 88 | "status_retweeted_status_entities_media": str, 89 | "status_retweeted_status_entities_symbols": str, 90 | "status_retweeted_status_entities_urls": str, 91 | "status_retweeted_status_entities_user_mentions": str, 92 | "status_retweeted_status_extended_entities_media": str, 93 | "status_retweeted_status_favorite_count": np.int64, 94 | "status_retweeted_status_favorited": np.int8, 95 | "status_retweeted_status_geo": str, 96 | "status_retweeted_status_id": np.int64, 97 | "status_retweeted_status_id_str": str, 98 | "status_retweeted_status_in_reply_to_screen_name": str, 99 | "status_retweeted_status_in_reply_to_status_id": np.int64, 100 | "status_retweeted_status_in_reply_to_status_id_str": str, 101 | "status_retweeted_status_in_reply_to_user_id": np.int64, 102 | "status_retweeted_status_in_reply_to_user_id_str": str, 103 | "status_retweeted_status_is_quote_status": np.int8, 104 | "status_retweeted_status_lang": str, 105 | "status_retweeted_status_place": str, 106 | "status_retweeted_status_possibly_sensitive": np.int8, 107 | "status_retweeted_status_quoted_status_id": np.int64, 108 | "status_retweeted_status_quoted_status_id_str": str, 109 | "status_retweeted_status_retweet_count": np.int64, 110 | "status_retweeted_status_retweeted": np.int8, 111 | "status_retweeted_status_source": str, 112 | "status_retweeted_status_full_text": str, 113 | "status_retweeted_status_truncated": np.int8, 114 | "status_source": str, 115 | "status_full_text": str, 116 | "status_truncated": np.int8, 117 | "statuses_count": np.int64, 118 | "suspended": np.int8, 119 | "time_zone": str, 120 | "translator_type": str, 121 | "url": str, 122 | "verified": np.int8, 123 | "utc_offset": str 124 | } 125 | -------------------------------------------------------------------------------- /make_config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import sys 4 | from shutil import copyfile 5 | 6 | 7 | # creates a new empty config file and opens it 8 | def make_config(): 9 | copyfile('config_template.yml', 'config.yml') 10 | 11 | 12 | if __name__ == '__main__': 13 | i = 0 14 | while True: 15 | if i == 0: 16 | answer = input('''This program will create a new config.yml.\n 17 | After running it, you will be asked with which program to\n 18 | open the new file. Please choose your standard text editor.\n 19 | Do you wish to create a new config.yml now? (y/n): ''') 20 | else: 21 | answer = input('''Sorry, I did not get your input. Do you wish to create \n 22 | a new config.yml now? Pleaser answer y for yes or n for no: ''') 23 | if answer == "n": 24 | break 25 | elif answer == "y": 26 | make_config() 27 | if sys.platform.startswith('darwin'): 28 | subprocess.call(('open', "config.yml")) 29 | elif os.name == 'nt': # For Windows 30 | os.startfile("config.yml") 31 | elif os.name == 'posix': # For Linux, Mac, etc. 32 | subprocess.call(('xdg-open', "config.yml")) 33 | break 34 | else: 35 | i = 1 36 | pass 37 | -------------------------------------------------------------------------------- /make_test_tweet_jsons.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | from collector import Connection, Collector 5 | from tweepy.error import TweepError 6 | 7 | parser = argparse.ArgumentParser(description='SparseTwitter user_details Downloader') 8 | parser.add_argument('-s', '--seed', 9 | help='''Provide a seed (=Twitter user ID). Its friends details will 10 | be downloaded.''', 11 | required=False, 12 | type=int, 13 | default=1670174994) 14 | 15 | # Setup the Collector 16 | seed = parser.parse_args().seed # Swap seeds with another Twitter User ID if you like 17 | if seed == 1670174994: 18 | print("No seed given. Using default seed " + str(seed) + ".") 19 | else: 20 | print("Downloading and saving friends' details of user " + str(seed) + ".") 21 | con = Connection() 22 | collector = Collector(con, seed) 23 | 24 | # Get the friends and details of the specified seed 25 | try: 26 | friends = collector.get_friend_list() 27 | except TweepError as e: 28 | if "'code': 34" in e.reason: 29 | raise TweepError("The seed you have given is not a valid Twitter user ID") 30 | friends_details = collector.get_details(friends) 31 | 32 | # Check for the relevant directory 33 | if not os.path.isdir(os.path.join("tests", "tweet_jsons")): 34 | os.mkdir(os.path.join("tests", "tweet_jsons")) 35 | 36 | # Write details in json files 37 | ct = 1 38 | for friend_details in friends_details: 39 | with open(os.path.join("tests", "tweet_jsons", "user_" + str(ct) + ".json"), "w") as f: 40 | json.dump(friend_details._json, f) 41 | ct += 1 42 | -------------------------------------------------------------------------------- /passwords_template.py: -------------------------------------------------------------------------------- 1 | sparsetwittermysqlpw = "" # mySQL Database Password 2 | 3 | # Details for Mailgun 4 | email_to_notify = "" 5 | mailgun_default_smtp_login = "" 6 | mailgun_api_base_url = "" 7 | mailgun_api_key = "" 8 | -------------------------------------------------------------------------------- /seed_with_lots_of_friends.csv: -------------------------------------------------------------------------------- 1 | 2343198944 -------------------------------------------------------------------------------- /seeds.csv: -------------------------------------------------------------------------------- 1 | 1160813580 2 | 387286700 3 | 1650198780 4 | 3024495627 5 | 2578020576 6 | 1615651381 7 | 1474424516 8 | 2369064876 9 | 2356366477 10 | 3472134496 11 | 747694110 12 | 2823918498 13 | 3208550097 14 | 272320418 15 | 829238504 16 | 1313754673 17 | 1058601290 18 | 4745661921 19 | 4441230569 20 | 4318242023 21 | 1063591824 22 | 3128010721 23 | 1127335022 24 | 1864730334 25 | 3002360276 26 | 4471778609 27 | 130920197 28 | 349283534 29 | 3856362255 30 | 2252523257 31 | 1444461152 32 | 4077417616 33 | 2262505775 34 | 3024991169 35 | 3147623579 36 | 3874002016 37 | 3671970441 38 | 1726736760 39 | 3814280783 40 | 2914680869 41 | 1058357965 42 | 3372766829 43 | 2538995806 44 | 2262632932 45 | 2885623492 46 | 3418218262 47 | 2218086659 48 | 335823104 49 | 1053116809 50 | 3401112971 51 | 3432820259 52 | 209303905 53 | 1258184006 54 | 796851890 55 | 1006285662 56 | 220695831 57 | 637379118 58 | 4277813061 59 | 1880484644 60 | 2743755429 61 | 2383032568 62 | 3973389993 63 | 4259114561 64 | 718765261 65 | 1726474417 66 | 2956222413 67 | 462397876 68 | 3296344923 69 | 884449957 70 | 3137787919 71 | 2335721112 72 | 3027412138 73 | 498901234 74 | 2781553545 75 | 351156021 76 | 2782879317 77 | 4135423463 78 | 3456702916 79 | 2390022283 80 | 4051757151 81 | 203013862 82 | 2376494468 83 | 3290275384 84 | 164215794 85 | 2505322224 86 | 490618665 87 | 941416256 88 | 368341331 89 | 2905473880 90 | 3170869461 91 | 1423283714 92 | 348531344 93 | 4094440878 94 | 2808133216 95 | 2572066829 96 | 913997334 97 | 2857548747 98 | 755882400 99 | 402905400 100 | 863806069 101 | 2859522701 102 | 2798373465 103 | 2842997865 104 | 704562018 105 | 1645850383 106 | 1338687164 107 | 154246966 108 | 2576520419 109 | 2762898040 110 | 629324495 111 | 399662062 112 | 2209566454 113 | 2513383812 114 | 2920605401 115 | 14654198 116 | 452151022 117 | 4689724754 118 | 1323367015 119 | 3502661722 120 | 582949817 121 | 3199922521 122 | 2247300875 123 | 4162142890 124 | 927925566 125 | 239562669 126 | 3368720859 127 | 1433137598 128 | 289021193 129 | 4758403947 130 | 2746563946 131 | 140023188 132 | 588559865 133 | 498903566 134 | 2391508352 135 | 585492814 136 | 269555197 137 | 3401234464 138 | 2322539054 139 | 881006096 140 | 382580898 141 | 485927303 142 | 115634822 143 | 2980339960 144 | 2623612919 145 | 939869593 146 | 4762405155 147 | 3791004803 148 | 3771853829 149 | 1134613718 150 | 81878858 151 | 968468935 152 | 1127549936 153 | 2807449437 154 | 889759956 155 | 83843595 156 | 2187803476 157 | 3207500698 158 | 2604254442 159 | 2172587015 160 | 347856926 161 | 2477017303 162 | 877446548 163 | 438437315 164 | 2919834849 165 | 487816821 166 | 2759768812 167 | 518411707 168 | 2955674980 169 | 1324474825 170 | 86313958 171 | 2956580308 172 | 1355366048 173 | 3204190670 174 | 265279495 175 | 495344962 176 | 2428003101 177 | 198588886 178 | 171061254 179 | 339112582 180 | 2699374110 181 | 2635454777 182 | 1074460009 183 | 612562461 184 | 115398574 185 | 446242313 186 | 2992813781 187 | 2365475859 188 | 609966621 189 | 528335444 190 | 2572785724 191 | 288124721 192 | 3000036442 193 | 3588679635 194 | 2359625452 195 | 2177278912 196 | 3050287239 197 | 98723220 198 | 4872795622 199 | 2841209625 200 | 4003824741 201 | -------------------------------------------------------------------------------- /seeds_empty.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FlxVctr/RADICES/94eb8d91663e58fcc20d1cddbba35b4f51378b0b/seeds_empty.csv -------------------------------------------------------------------------------- /seeds_template.csv: -------------------------------------------------------------------------------- 1 | 1160813580 2 | 387286700 3 | 1650198780 4 | 3024495627 5 | 2578020576 6 | 1615651381 7 | 1474424516 8 | 2369064876 9 | 2356366477 10 | 3472134496 11 | 747694110 12 | 2823918498 13 | 3208550097 14 | 272320418 15 | 829238504 16 | 1313754673 17 | 1058601290 18 | 4745661921 19 | 4441230569 20 | 4318242023 21 | 1063591824 22 | 3128010721 23 | 1127335022 24 | 1864730334 25 | 3002360276 26 | 4471778609 27 | 130920197 28 | 349283534 29 | 3856362255 30 | 2252523257 31 | 1444461152 32 | 4077417616 33 | 2262505775 34 | 3024991169 35 | 3147623579 36 | 3874002016 37 | 3671970441 38 | 1726736760 39 | 3814280783 40 | 2914680869 41 | 1058357965 42 | 3372766829 43 | 2538995806 44 | 2262632932 45 | 2885623492 46 | 3418218262 47 | 2218086659 48 | 335823104 49 | 1053116809 50 | 3401112971 51 | 3432820259 52 | 209303905 53 | 1258184006 54 | 796851890 55 | 1006285662 56 | 220695831 57 | 637379118 58 | 4277813061 59 | 1880484644 60 | 2743755429 61 | 2383032568 62 | 3973389993 63 | 4259114561 64 | 718765261 65 | 1726474417 66 | 2956222413 67 | 462397876 68 | 3296344923 69 | 884449957 70 | 3137787919 71 | 2335721112 72 | 3027412138 73 | 498901234 74 | 2781553545 75 | 351156021 76 | 2782879317 77 | 4135423463 78 | 3456702916 79 | 2390022283 80 | 4051757151 81 | 203013862 82 | 2376494468 83 | 3290275384 84 | 164215794 85 | 2505322224 86 | 490618665 87 | 941416256 88 | 368341331 89 | 2905473880 90 | 3170869461 91 | 1423283714 92 | 348531344 93 | 4094440878 94 | 2808133216 95 | 2572066829 96 | 913997334 97 | 2857548747 98 | 755882400 99 | 402905400 100 | 863806069 101 | 2859522701 102 | 2798373465 103 | 2842997865 104 | 704562018 105 | 1645850383 106 | 1338687164 107 | 154246966 108 | 2576520419 109 | 2762898040 110 | 629324495 111 | 399662062 112 | 2209566454 113 | 2513383812 114 | 2920605401 115 | 14654198 116 | 452151022 117 | 4689724754 118 | 1323367015 119 | 3502661722 120 | 582949817 121 | 3199922521 122 | 2247300875 123 | 4162142890 124 | 927925566 125 | 239562669 126 | 3368720859 127 | 1433137598 128 | 289021193 129 | 4758403947 130 | 2746563946 131 | 140023188 132 | 588559865 133 | 498903566 134 | 2391508352 135 | 585492814 136 | 269555197 137 | 3401234464 138 | 2322539054 139 | 881006096 140 | 382580898 141 | 485927303 142 | 115634822 143 | 2980339960 144 | 2623612919 145 | 939869593 146 | 4762405155 147 | 3791004803 148 | 3771853829 149 | 1134613718 150 | 81878858 151 | 968468935 152 | 1127549936 153 | 2807449437 154 | 889759956 155 | 83843595 156 | 2187803476 157 | 3207500698 158 | 2604254442 159 | 2172587015 160 | 347856926 161 | 2477017303 162 | 877446548 163 | 438437315 164 | 2919834849 165 | 487816821 166 | 2759768812 167 | 518411707 168 | 2955674980 169 | 1324474825 170 | 86313958 171 | 2956580308 172 | 1355366048 173 | 3204190670 174 | 265279495 175 | 495344962 176 | 2428003101 177 | 198588886 178 | 171061254 179 | 339112582 180 | 2699374110 181 | 2635454777 182 | 1074460009 183 | 612562461 184 | 115398574 185 | 446242313 186 | 2992813781 187 | 2365475859 188 | 609966621 189 | 528335444 190 | 2572785724 191 | 288124721 192 | 3000036442 193 | 3588679635 194 | 2359625452 195 | 2177278912 196 | 3050287239 197 | 98723220 198 | 4872795622 199 | 2841209625 200 | 4003824741 201 | -------------------------------------------------------------------------------- /seeds_test.csv: -------------------------------------------------------------------------------- 1 | 769041127899009024 2 | 83662933 3 | 2673261430 4 | 1016830345889701888 5 | 19830856 6 | 1032678888269336576 7 | 896102591498776578 8 | 36684736 9 | 476935296 10 | 596500117 11 | 967089795456491522 12 | 980251691193851904 13 | 2649412098 14 | 2328075054 15 | 2320475550 16 | 2364445033 17 | 932738354294161408 18 | 4820804277 19 | 39356720 20 | 190564700 21 | 806177097265901568 22 | 595484668 23 | 35257717 24 | 54184375 25 | 26716249 26 | 593041510 27 | 951028652770291712 28 | 1040403366 29 | 263020833 30 | 5715752 31 | 26783152 32 | 114508061 33 | 15243812 34 | 5734902 35 | 1560587828 36 | 84550656 37 | 66823494 38 | 14553288 39 | 31087080 40 | 19108766 41 | 19767324 42 | 11013962 43 | 36970167 44 | 988320878 45 | 16100710 46 | 14389093 47 | 1855503205 48 | 30569817 49 | 214272214 50 | 2836842186 51 | 976057131400155136 52 | 1348771484 53 | 368379922 54 | 20463983 55 | 21494378 56 | 727875626330378240 57 | 3076796608 58 | 201490244 59 | 335181321 60 | 3306930087 61 | 1423906682 62 | 57925222 63 | 2308103545 64 | 4901370903 65 | 16310263 66 | 2317351705 67 | 316389142 68 | 1464120432 69 | 453030125 70 | 1470232105 71 | 3046696618 72 | 65607107 73 | 47375691 74 | 5483202 75 | 103973912 76 | 4307446636 77 | 3379461005 78 | 379940359 79 | 95551818 80 | 47478759 81 | 1372037468 82 | 33434994 83 | 736541087716630532 84 | 126046965 85 | 1337785291 86 | 78837869 87 | 88577174 88 | 936011619804442624 89 | 2764750474 90 | 2419140080 91 | 2828212668 92 | 398087684 93 | 2460368252 94 | 68400053 95 | 22525751 96 | 303048766 97 | 2962250933 98 | 381187236 99 | 5994452 100 | 16689804 101 | 18812572 102 | 817386 103 | 2251623492 104 | 209811713 105 | 1582853809 106 | 4398626122 107 | 1654188770 108 | 1451773004 109 | 316327930 110 | 23962323 111 | 717313 112 | 34743251 113 | 14647570 114 | 13298072 115 | 11348282 116 | 5988062 117 | 14677919 118 | 259771124 119 | 17248121 120 | 14075928 121 | 19658826 122 | 3459051 123 | 14361155 124 | 27830610 125 | 20596281 126 | 14499829 127 | 33933259 128 | 14700316 129 | 18213483 130 | 69231187 131 | 14159148 132 | 1289362482 133 | 38451030 134 | 3898598357 135 | 77436536 136 | 95731075 137 | 607311335 138 | 712598138901643264 139 | 79749280 140 | 16366472 141 | 16680571 142 | 846165733285347328 143 | 702824612095078400 144 | 56605612 145 | 701504334341734400 146 | 52401341 147 | 19002346 148 | 824241575706427392 149 | 26000689 150 | 8143682 151 | 2329921 152 | 37213193 153 | 27047369 154 | 3092629269 155 | 237670274 156 | 545156588 157 | 2233154425 158 | 187484412 159 | 818876014390603776 160 | 16333712 161 | 470106608 162 | 15518560 163 | 3027041582 164 | 12 165 | 50393960 166 | 13 167 | 82443473 168 | 27596259 169 | 12819112 170 | 17534607 171 | 314378129 172 | 8271262 173 | 361173087 174 | 426550185 175 | 21088417 176 | 708113 177 | 2259993758 178 | 264097819 179 | 288078578 180 | 288889970 181 | 277961081 182 | 123431491 183 | 22595510 184 | 388983706 185 | 4157723231 186 | 394063385 187 | 2979097128 188 | 5605712 189 | 17675072 190 | 14684110 191 | 253603966 192 | 243896198 193 | 3889782193 194 | 18622869 195 | 16017475 196 | 2347049341 197 | 16955870 198 | 2303751216 199 | 16753692 200 | 1374321499 201 | 13493122 202 | 2479087773 203 | 830152026453577728 204 | 375502494 205 | 783648878495145984 206 | 80750540 207 | 900247404917673984 208 | 46377338 209 | 46220856 210 | 75498935 211 | 40383352 212 | 402875257 213 | 261665936 214 | 14767515 215 | 116766613 216 | 2164132428 217 | 47571889 218 | 14773649 219 | 11178902 220 | 16228337 221 | 817442499551756288 222 | 4816 223 | 14706139 224 | 402995665 225 | 765104120915234816 226 | 899902687 227 | 44586323 228 | 47881149 229 | 2977580734 230 | 1377379782 231 | 337321190 232 | 2287331420 233 | 1112553272 234 | 2771848270 235 | 114485232 236 | 17112843 237 | 27731964 238 | 358619703 239 | 94035021 240 | 93654149 241 | 36346494 242 | 822215673812119553 243 | 2341772972 244 | 2284174986 245 | 44196397 246 | 335671713 247 | 3094222966 248 | 142791254 249 | 120269446 250 | 48477652 251 | 3697013177 252 | 19854920 253 | 563311832 254 | 2891858896 255 | 267944170 256 | 50762276 257 | 1539343122 258 | 2228815292 259 | 807095 260 | 822710883935780864 261 | 2254182852 262 | 1595615893 263 | 3536736623 264 | 2775564497 265 | 229910053 266 | 816567039577956352 267 | 3431605294 268 | 15806521 269 | 11663272 270 | 425267565 271 | 4876022234 272 | 723099338629500928 273 | 4353357197 274 | 46769303 275 | 2472108787 276 | 58841829 277 | 823367015830323201 278 | 16426657 279 | 1320161 280 | 571603577 281 | 24362471 282 | 917910307 283 | 2179529843 284 | 183501560 285 | 153185635 286 | 14140695 287 | 15008456 288 | 775184441090027520 289 | 3197921 290 | 875861664 291 | 2398002414 292 | 3606164057 293 | 1605720368 294 | 3356531254 295 | 2244994945 296 | 786491 297 | 495430242 298 | 3306527433 299 | 2493407430 300 | 22097835 301 | 69448996 302 | 53549590 303 | 2903217440 304 | 2198690413 305 | 125695429 306 | 1398876810 307 | 207534677 308 | 202181866 309 | 123085951 310 | 121564533 311 | 716375773800697857 312 | 370389714 313 | 434857136 314 | 434236851 315 | 161646584 316 | 35708293 317 | 111600563 318 | 3017844172 319 | 4507964833 320 | 1198594368 321 | 2237808535 322 | 14504859 323 | 4848860457 324 | 3588277825 325 | 1902773580 326 | 3075333501 327 | 1927923236 328 | 77712285 329 | 15008596 330 | 1493423918 331 | 415906140 332 | 426509606 333 | 304928205 334 | 56341402 335 | 15160540 336 | 582161546 337 | 16629994 338 | 47654804 339 | 15210983 340 | 6753702 341 | 14191640 342 | 6480152 343 | 238955118 344 | 2390331060 345 | 25053097 346 | 2323501513 347 | 1173685621 348 | 88943180 349 | 1200763760 350 | 603877056 351 | 88060423 352 | 783942472019931136 353 | 384650167 354 | 39428536 355 | 101770339 356 | 47438273 357 | 20778563 358 | 31384530 359 | 14893345 360 | 1526228120 361 | 1288375663 362 | 62827149 363 | 1290351 364 | 273214252 365 | 1025580128 366 | 492327798 367 | 17324052 368 | 740238495952736256 369 | 716348353915686912 370 | 2828562864 371 | 273199520 372 | 23268862 373 | 1699604941 374 | 101422142 375 | 1340011 376 | 961374817 377 | 20747796 378 | 1237373550 379 | 14892927 380 | 2248057626 381 | 1653010081 382 | 775583414510497792 383 | 491895975 384 | 380749300 385 | 1412345191 386 | 3190556852 387 | 25073877 388 | 2881661080 389 | 793926 390 | 14453232 391 | 54425158 392 | 30849103 393 | 763914232786145280 394 | 917965284 395 | 24741685 396 | 2803191 397 | 74286565 398 | 125400714 399 | 18720595 400 | 259453542 401 | 68677937 402 | 2341085688 403 | 27792033 404 | 3099501 405 | 2248872301 406 | 1618939716 407 | 89087163 408 | 2191981764 409 | 14414286 410 | 25298569 411 | 172532664 412 | 66468564 413 | 717583922595364865 414 | 12229772 415 | 4255361 416 | 76980293 417 | 142415193 418 | 1551184494 419 | 90524645 420 | 274626857 421 | 16669898 422 | 14824990 423 | 249765062 424 | 2312333412 425 | 754651485980459009 426 | 3353357193 427 | 86044873 428 | 335595328 429 | 14077832 430 | 91068586 431 | 133562519 432 | 1439685422 433 | 21558856 434 | 350700354 435 | 31543267 436 | 34247967 437 | 228055082 438 | 19530582 439 | 15140589 440 | 4727373682 441 | 2498331294 442 | 20499665 443 | 14300508 444 | 607480698 445 | 14396027 446 | 420954778 447 | 736285300729733120 448 | 137116288 449 | 86390214 450 | 57571700 451 | 4923437464 452 | 352510636 453 | 566812856 454 | 91804096 455 | 1273755780 456 | 2443972061 457 | 2902651666 458 | 2786173214 459 | 106611045 460 | 51734793 461 | 14659447 462 | 15668978 463 | 67566869 464 | 170303282 465 | 76115927 466 | 1336552399 467 | 29248745 468 | 47087325 469 | 87308833 470 | 57140128 471 | 19304187 472 | 14299624 473 | 838999716 474 | 66421943 475 | 18247347 476 | 375499570 477 | 49976458 478 | 847412334 479 | 133893663 480 | 854581958 481 | 56510427 482 | 16330659 483 | 78270570 484 | 1547101849 485 | 3245142196 486 | 3119988399 487 | 1252144297 488 | 461379146 489 | 16403155 490 | 2859039287 491 | 26291578 492 | 20639941 493 | 50478950 494 | 3549743295 495 | 3674645536 496 | 2576431 497 | 2608166073 498 | 1884191208 499 | 19968025 500 | 274430113 501 | 2470512481 502 | 712749669260984320 503 | 78688499 504 | 3500269756 505 | 2301638324 506 | 4316593949 507 | 633 508 | 2182812798 509 | 600029236 510 | 17184309 511 | 287816744 512 | 334107188 513 | 522324711 514 | 719302825512034304 515 | 131935861 516 | 17682232 517 | 6635422 518 | 3406855121 519 | 1922183119 520 | 19278408 521 | 2402013658 522 | 2904886151 523 | 22036511 524 | 3356462770 525 | 1010950338 526 | 47428795 527 | 70459168 528 | 430306392 529 | 2810902381 530 | 1257535058 531 | 20555437 532 | 1160751655 533 | 21650075 534 | 12808972 535 | 154355992 536 | 2413352126 537 | 48289662 538 | 610659001 539 | 14247082 540 | 17682362 541 | 551687308 542 | 10451462 543 | 288605100 544 | 15806978 545 | 529830882 546 | 918151616 547 | 120813008 548 | 2251684044 549 | 279412211 550 | 216776631 551 | 1339835893 552 | 3389944744 553 | 295713773 554 | 1606242174 555 | 92227227 556 | 4228587674 557 | 16477634 558 | 224495471 559 | 2436389418 560 | 4023357917 561 | 763440894 562 | 3066551830 563 | 2916305152 564 | 5943942 565 | 67915432 566 | 19662154 567 | 5974412 568 | 22888086 569 | 2846840261 570 | 50605029 571 | 1101341 572 | 822215679726100480 573 | 1536791610 574 | 3317084775 575 | 3525878536 576 | 272730913 577 | 3318180745 578 | 1101399690 579 | 2262722558 580 | 20280065 581 | 13334762 582 | 63873759 583 | 45391039 584 | 607946179 585 | 105590714 586 | 2213117090 587 | 1152122496 588 | 56365668 589 | 1891806212 590 | 51002583 591 | 464564528 592 | 11156392 593 | 3297270913 594 | 2734713482 595 | 19054387 596 | 13566872 597 | 206717989 598 | 169426475 599 | 157981564 600 | 196994616 601 | 124690469 602 | 33838201 603 | 3355051751 604 | 2837694282 605 | 78442404 606 | 846905922 607 | 420091534 608 | 16895951 609 | 75641981 610 | 11125942 611 | 251535413 612 | 1320998672 613 | 3023310881 614 | 15903326 615 | 1262059087 616 | 111030963 617 | 22755223 618 | 3305381595 619 | 500704345 620 | 35147037 621 | 190583610 622 | 2367431 623 | 98873137 624 | 596316544 625 | 3297801310 626 | 204297410 627 | 3196230661 628 | 1912617043 629 | 186910006 630 | 1946165882 631 | 3154287162 632 | 15765018 633 | 312189104 634 | 1490902753 635 | 1652884033 636 | 897858440 637 | 1549209552 638 | 466083800 639 | 22749856 640 | 46770438 641 | 39973389 642 | 91982547 643 | 2834511 644 | 1338019026 645 | 65078480 646 | 6822912 647 | 1621528116 648 | 1975955528 649 | 442678902 650 | 537122882 651 | 56505125 652 | 19330099 653 | 39223985 654 | 2202920821 655 | 2710319475 656 | 40148479 657 | 2718774725 658 | 41312733 659 | 370338250 660 | 95463894 661 | 78934252 662 | 34950452 663 | 887523506 664 | 372313088 665 | 17803524 666 | 19534629 667 | 242582424 668 | 2546210232 669 | 2176822777 670 | 1044919213 671 | 19488465 672 | 1689053928 673 | 1295719904 674 | 1335048846 675 | 19072286 676 | 3052069771 677 | 2595064682 678 | 30599331 679 | 69220946 680 | 719013 681 | 2846058853 682 | 14839109 683 | 2176358690 684 | 46471465 685 | 70105352 686 | 111574040 687 | 1530889117 688 | 26974240 689 | 124434709 690 | 814881 691 | 15247023 692 | 138012644 693 | 17015268 694 | 2970420477 695 | 144078209 696 | 1051171 697 | 106241842 698 | 18202677 699 | 22461446 700 | 1132776031 701 | 3010639332 702 | 487905834 703 | 14645160 704 | 246532728 705 | 263127148 706 | 180505807 707 | 204832963 708 | 25411758 709 | 17672825 710 | 322506796 711 | 2496529670 712 | 7190852 713 | 15861468 714 | 502531529 715 | 614884262 716 | 13088992 717 | 627987302 718 | 531788806 719 | 664883 720 | 560864000 721 | 2832189284 722 | 16213124 723 | 130649891 724 | 87532773 725 | 6253282 726 | 376825877 727 | 6844292 728 | 22465460 729 | 18313165 730 | 371823539 731 | 94787173 732 | 234785663 733 | 950802878 734 | 15181803 735 | 1536552278 736 | 78807415 737 | 22467617 738 | 13521812 739 | 428982236 740 | 1958326711 741 | 9626642 742 | 838464523 743 | 15865878 744 | 2496667400 745 | 2270729301 746 | 218984871 747 | 1105839134 748 | 2315178300 749 | 521368028 750 | 2317524115 751 | 5769472 752 | 2231228510 753 | 253536357 754 | 611986351 755 | 949268624 756 | 2181826142 757 | 376267732 758 | 2154538213 759 | 1719900512 760 | 1430400949 761 | 18780641 762 | 19923474 763 | 321629144 764 | 39021657 765 | 17896763 766 | 17462723 767 | 1397451631 768 | 312748934 769 | 18490018 770 | 235921307 771 | 557558765 772 | 16076032 773 | 577289927 774 | 141915585 775 | 234296047 776 | 14854155 777 | 17971083 778 | 14850399 779 | 77719559 780 | 1140451 781 | 42256930 782 | 14235098 783 | 281581384 784 | 105102178 785 | 275686563 786 | 18972334 787 | 15392033 788 | 97426684 789 | 15742288 790 | 21749206 791 | 21695922 792 | 11060982 793 | 407347022 794 | 177507079 795 | 79456653 796 | 17791180 797 | 97310140 798 | 775591994 799 | 22014811 800 | 10246582 801 | 582728848 802 | 711803 803 | 235696644 804 | 401922298 805 | 621403074 806 | 19704259 807 | 95255169 808 | 460533115 809 | 82383210 810 | 857813509 811 | 471872425 812 | 15006174 813 | 15028848 814 | 52696465 815 | 16527404 816 | 21704174 817 | 20915367 818 | 17389991 819 | 144950663 820 | 19180320 821 | 40517090 822 | 423364278 823 | 14640559 824 | 37712201 825 | 459298312 826 | 552895403 827 | 16867474 828 | 21440857 829 | 15211500 830 | 15775749 831 | 18239435 832 | 123578864 833 | 9793082 834 | 463700716 835 | 14176770 836 | 17803569 837 | 116193595 838 | 19242665 839 | 451586190 840 | 9993772 841 | 545825855 842 | 18942977 843 | 132660415 844 | 18588430 845 | 220000321 846 | 1741221 847 | 834544453 848 | 74448435 849 | 39982325 850 | 44576546 851 | 16475657 852 | 358428263 853 | 410955689 854 | 11979772 855 | 97237270 856 | 10253232 857 | 82971362 858 | 45370474 859 | 19004783 860 | 312705486 861 | 21257115 862 | 791083760 863 | 45163174 864 | 13068822 865 | 756741314 866 | 105554801 867 | 84111950 868 | 5582742 869 | 224407693 870 | 15473958 871 | 15998669 872 | 17782233 873 | 16745492 874 | 90596312 875 | 20597868 876 | 14497517 877 | 49411886 878 | 726912924 879 | 509064363 880 | 633793808 881 | 10876852 882 | 182055312 883 | 18466967 884 | 462775400 885 | 481456646 886 | 21209555 887 | 613657071 888 | 17492631 889 | 439281506 890 | 1217801 891 | 19759617 892 | 22036165 893 | 9773092 894 | 566298895 895 | 241067950 896 | 533076506 897 | 117051643 898 | 56118316 899 | 40227879 900 | 338891131 901 | 550843940 902 | 72814114 903 | 14764243 904 | 364488011 905 | 211544109 906 | 273468342 907 | 463122035 908 | 76700084 909 | 57350105 910 | 359919054 911 | 480266532 912 | 12421082 913 | 129026706 914 | 475910101 915 | 5746452 916 | 15129265 917 | 89593573 918 | 16569660 919 | 307009093 920 | 71277496 921 | 326776224 922 | 52687314 923 | 40283581 924 | 225235528 925 | 40227292 926 | 31030273 927 | 26015858 928 | 438532851 929 | 389946451 930 | 161723960 931 | 268464220 932 | 121431534 933 | 203781901 934 | 208149256 935 | 387942987 936 | 383797311 937 | 411560362 938 | 53901362 939 | 18295116 940 | 387750600 941 | 389532648 942 | 389393606 943 | 393577559 944 | 389298698 945 | 359959410 946 | 392675870 947 | 390160050 948 | 19299909 949 | 18687936 950 | 390945097 951 | 390622471 952 | 237651617 953 | 19726654 954 | 35142791 955 | 190730941 956 | 48272332 957 | 19187671 958 | 114504124 959 | 337299792 960 | 14208791 961 | 316516471 962 | 49713024 963 | 98220617 964 | 16838771 965 | 258428285 966 | 2960221 967 | 173155110 968 | 34872011 969 | 326117963 970 | 19016843 971 | 225421876 972 | 225712501 973 | 16490214 974 | 234343491 975 | 57136767 976 | 225663702 977 | 323242396 978 | 1344951 979 | 38464103 980 | 27159398 981 | 263096190 982 | 917131 983 | 275508478 984 | 265296096 985 | 125348034 986 | 14324983 987 | 257124818 988 | 24867956 989 | 24870628 990 | 24277906 991 | 256400038 992 | 14613514 993 | 151058995 994 | 2836581 995 | 245735979 996 | 69248730 997 | 128463002 998 | 185611778 999 | 11435642 1000 | 9334352 1001 | 15040978 1002 | 89785537 1003 | 92546663 1004 | 101472891 1005 | 53112669 1006 | 1708841 1007 | 211089576 1008 | 21650481 1009 | 49394939 1010 | 1007251 1011 | 46090688 1012 | 14116588 1013 | 19280591 1014 | 102734360 1015 | 18216581 1016 | 201891140 1017 | 7014182 1018 | 21507142 1019 | 86277392 1020 | 14247715 1021 | 1727601 1022 | 21743933 1023 | 14868991 1024 | 16986371 1025 | 136879221 1026 | 18044580 1027 | 17995778 1028 | 14420009 1029 | 16589206 1030 | 12509262 1031 | 1234801 1032 | 38888944 1033 | 15494684 1034 | 14670903 1035 | 37439097 1036 | 123247764 1037 | 106164576 1038 | 68511500 1039 | 15072786 1040 | 94834350 1041 | 47971303 1042 | 47718696 1043 | 18761526 1044 | 91797748 1045 | 10049712 1046 | 18863815 1047 | 15318271 1048 | 31812497 1049 | 20862865 1050 | 77777801 1051 | 19903466 1052 | 5876652 1053 | 783214 1054 | 813286 1055 | 14454247 1056 | 20536157 1057 | 52407579 1058 | 9655032 1059 | 1879351 1060 | 21657529 1061 | 55836889 1062 | 56624987 1063 | 639643 1064 | 37930051 1065 | 9294012 1066 | 15234407 1067 | 27434239 1068 | 18477798 1069 | 14341194 1070 | 20187015 1071 | 42603105 1072 | 23759730 1073 | 14742504 1074 | 17773851 1075 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import json 2 | from json import JSONDecodeError 3 | import pandas as pd 4 | from requests import post 5 | import yaml 6 | 7 | 8 | # General class for necessary file imports 9 | class FileImport(): 10 | def read_app_key_file(self, filename: str = "keys.json") -> tuple: 11 | """Reads file with consumer key and consumer secret (JSON) 12 | 13 | Args: 14 | filename (str, optional): Defaults to "keys.json" 15 | 16 | Returns: 17 | Tuple with two strings: (1) being the twitter consumer token and (2) being the 18 | twitter consumer secret 19 | """ 20 | 21 | # TODO: change return to dictionary 22 | 23 | try: 24 | with open(filename, "r") as f: 25 | self.key_file = json.load(f) 26 | except FileNotFoundError: 27 | raise FileNotFoundError('"keys.json" could not be found') 28 | except JSONDecodeError as e: 29 | print("Bad JSON file. Please check that 'keys.json' is formatted\ 30 | correctly and that it is not empty") 31 | raise e 32 | if "consumer_token" not in self.key_file or "consumer_secret" not in self.key_file: 33 | raise KeyError('''"keys.json" does not contain the dictionary keys 34 | "consumer_token" and/or "consumer_secret"''') 35 | 36 | if type(self.key_file["consumer_secret"]) is not str or type( 37 | self.key_file["consumer_token"]) is not str: 38 | raise TypeError("Consumer secret is type" + 39 | str(type(self.key_file["consumer_secret"])) + 40 | "and consumer token is type " + str(type( 41 | self.key_file["consumer_token"])) + '''. Both 42 | must be of type str. ''') 43 | 44 | return (self.key_file["consumer_token"], self.key_file["consumer_secret"]) 45 | 46 | def read_seed_file(self, filename: str = "seeds.csv") -> pd.DataFrame: 47 | """Reads file with specified seeds to start from (csv) 48 | 49 | Args: 50 | filename (str, optional): Defaults to "seeds.csv" 51 | 52 | Returns: 53 | A single column pandas DataFrame with one Twitter ID (seed) each row. 54 | """ 55 | try: 56 | with open("seeds.csv", "r") as f: 57 | self.seeds = pd.read_csv(f, header=None) 58 | except FileNotFoundError: 59 | raise FileNotFoundError('"seeds.csv" could not be found') 60 | except pd.errors.EmptyDataError as e: 61 | print('"seeds.csv" is empty!') 62 | raise e 63 | return self.seeds 64 | 65 | def read_token_file(self, filename="tokens.csv"): 66 | """Reads file with authorized user tokens (csv). 67 | 68 | Args: 69 | filename (str, optional): Defaults to "tokens.csv" 70 | 71 | Returns: 72 | pandas.DataFrame: With columns `token` and `secret`, one line per user 73 | """ 74 | return pd.read_csv(filename) 75 | 76 | 77 | # Configuration class. Reads out all information from a given config.yml 78 | class Config(): 79 | """Class that handles the SQL and twitter user details configuration. 80 | 81 | Attributes: 82 | config_file (str): Path to configuration file 83 | config_dict (dict): Dictionary containing the config information (in case 84 | the dictionary shall be directly passed instead of read 85 | out of a configuration file). 86 | """ 87 | config_template = "config_template.py" 88 | 89 | # Initializes class using config.yml 90 | def __init__(self, config_file="config.yml", config_dict: dict = None): 91 | if config_dict is not None: 92 | self.config = config_dict 93 | else: 94 | self.config_path = config_file 95 | try: 96 | with open(self.config_path, 'r') as f: 97 | self.config = yaml.safe_load(f) 98 | except FileNotFoundError: 99 | raise FileNotFoundError('Could not find "' + self.config_path + '''".\n 100 | Please run "python3 make_config.py" or provide a config.yml''') 101 | 102 | # Check if mailgun notifications should be used 103 | if "notifications" not in self.config: 104 | self.use_notifications = False 105 | else: 106 | self.notif_config = self.config["notifications"] 107 | notif_config_items = [value for (key, value) in self.notif_config.items()] 108 | single_items = list(set(notif_config_items)) 109 | if len(single_items) == 1 and single_items[0] is None: 110 | self.use_notifications = False 111 | elif None in single_items: 112 | missing = [key for (key, value) in self.notif_config.items() if value is None] 113 | raise ValueError(f"""You have not filled all required fields for the notifications 114 | configuration! Fields missing are {missing}""") 115 | else: 116 | self.use_notifications = True 117 | 118 | # Check for necessary database information. If no information is provided, 119 | # stop 120 | if "sql" not in self.config: 121 | print("Config file " + config_file + """ does not contain key 'sql'! 122 | Will use default sqlite configuration.""") 123 | self.config["sql"] = dict(dbtype="sqlite", 124 | dbname="new_database") 125 | self.sql_config = self.config["sql"] 126 | 127 | # No db type given in Config 128 | if self.sql_config["dbtype"] is None: 129 | print('''Parameter dbtype not set in the "config.yml". Will create 130 | an sqlite database.''') 131 | self.dbtype = "sqlite" 132 | else: 133 | self.dbtype = self.sql_config["dbtype"].strip() 134 | 135 | # DB type is msql - checking for all parameters 136 | if self.dbtype == "mysql": 137 | try: 138 | self.dbhost = str(self.sql_config["host"]) 139 | self.dbuser = str(self.sql_config["user"]) 140 | self.dbpwd = str(self.sql_config["passwd"]) 141 | if self.dbhost == '': 142 | raise ValueError("dbhost parameter is empty") 143 | if self.dbuser == '': 144 | raise ValueError("dbuser parameter is empty") 145 | if self.dbpwd == '': 146 | raise ValueError("passwd parameter is empty") 147 | except KeyError as e: 148 | raise e 149 | elif self.dbtype == "sqlite": 150 | self.dbhost = None 151 | self.dbuser = None 152 | self.dbpwd = None 153 | else: 154 | raise ValueError('''dbtype parameter is neither "sqlite" nor 155 | "mysql". Please adjust the "config.yml" ''') 156 | 157 | # Set db name 158 | if self.sql_config["dbname"] is not None: 159 | self.dbname = self.sql_config["dbname"] 160 | else: 161 | print('''Parameter "dbname" is missing. New database will have the name 162 | "new_database".''') 163 | self.dbname = "new_database" 164 | 165 | # Function to send mail if notifications are turned on in config.yml 166 | # TODO: finalize this function 167 | def send_mail(self, message_dict): 168 | '''Sends an email via Mailgun. 169 | Args: 170 | message_dict (dict): 171 | { 172 | "subject": "your_subject" 173 | "text": "message" 174 | } 175 | config (dict): 176 | { 177 | "mailgun_api_base_url": "link to mailgun_api_base_url" 178 | "mailgun_api_key": "your mailgun_api_key" 179 | "mailgun_default_smtp_login": "your mailgun_default_smtp_login" 180 | "email_to_notify": "the email_to_notify" 181 | } 182 | Returns: 183 | requests.post to Mailgun API. 184 | ''' 185 | 186 | api_base_url = self.notif_config["mailgun_api_base_url"] + '/messages' 187 | auth = ('api', self.notif_config["mailgun_api_key"]) 188 | 189 | data = { 190 | "from": f"SparseTwitter <{self.notif_config['mailgun_default_smtp_login']}>", 191 | "to": self.notif_config["email_to_notify"] 192 | } 193 | 194 | data.update(message_dict) 195 | 196 | return post(api_base_url, auth=auth, data=data) 197 | 198 | # TODO: Add mailgun config 199 | -------------------------------------------------------------------------------- /setup_server.sh: -------------------------------------------------------------------------------- 1 | wget https://raw.githubusercontent.com/chetankapoor/swap/master/swap.sh -O swap.sh 2 | sudo sh swap.sh 2G 3 | sudo apt-get update 4 | sudo apt-get install mysql-server 5 | mysql_secure_installation 6 | sudo mysql -u root -p 7 | create database sparsetwitter; 8 | create database sparsetwitter_live; 9 | set global validate_password_special_char_count = 0; 10 | create user 'sparsetwitter'@'localhost' identified by 'password'; 11 | GRANT ALL PRIVILEGES ON *.* TO 'sparsetwitter'@'localhost'; 12 | create user 'sparsetwitter_remote'@'%' identified by 'password'; 13 | GRANT ALL PRIVILEGES ON sparsetwitter.* TO 'sparsetwitter_remote'@'%'; 14 | GRANT ALL PRIVILEGES ON sparsetwitter_live.* TO 'sparsetwitter_remote'@'%'; 15 | # follow this guide: https://medium.com/@haotangio/how-to-properly-setup-mysql-5-7-for-production-on-ubuntu-16-04-dd4088286016 16 | exit 17 | git clone https://github.com/FlxVctr/SparseTwitter.git 18 | sudo apt install python-pip 19 | pip install --user pipenv 20 | sudo apt upgrade 21 | sudo reboot now 22 | # ssh back into machine 23 | cd SparseTwitter 24 | screen 25 | curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash 26 | # add stuff to bashrc as prompted by script 27 | source ../.bashrc 28 | pyenv update 29 | sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \ 30 | libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \ 31 | xz-utils tk-dev libffi-dev liblzma-dev 32 | sudo reboot now 33 | # ssh back into machine 34 | screen 35 | cd SparseTwitter 36 | pipenv install 37 | pipenv shell 38 | # follow readme in SparseTwitter 39 | python tests/tests.py -s 40 | python functional_test.py -------------------------------------------------------------------------------- /start.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from datetime import datetime 3 | import os 4 | import time 5 | import traceback 6 | from shutil import copyfile 7 | from sys import stderr, stdout 8 | 9 | import pandas as pd 10 | 11 | from collector import Coordinator 12 | from setup import Config 13 | 14 | 15 | def main_loop(coordinator, select=[], status_lang=None, test_fail=False, restart=False, 16 | bootstrap=False, language_threshold=0, keywords=[]): 17 | 18 | try: 19 | latest_start_time = pd.read_sql_table('timetable', coordinator.dbh.engine) 20 | latest_start_time = latest_start_time['latest_start_time'][0] 21 | except ValueError: 22 | latest_start_time = 0 23 | 24 | if restart is True: 25 | 26 | update_query = f""" 27 | UPDATE friends 28 | SET burned=0 29 | WHERE UNIX_TIMESTAMP(timestamp) > {latest_start_time} 30 | """ 31 | coordinator.dbh.engine.execute(update_query) 32 | 33 | start_time = time.time() 34 | 35 | pd.DataFrame({'latest_start_time': [start_time]}).to_sql('timetable', coordinator.dbh.engine, 36 | if_exists='replace') 37 | 38 | collectors = coordinator.start_collectors(select=select, 39 | status_lang=status_lang, 40 | fail=test_fail, 41 | restart=restart, 42 | retries=4, 43 | latest_start_time=latest_start_time, 44 | bootstrap=bootstrap, 45 | language_threshold=language_threshold, 46 | keywords=keywords) 47 | 48 | stdout.write("\nstarting {} collectors\n".format(len(collectors))) 49 | stdout.write(f"\nKeywords: {keywords}\n") 50 | stdout.flush() 51 | 52 | i = 0 53 | timeout = 7200 54 | 55 | for instance in collectors: 56 | instance.join(timeout=timeout) 57 | if instance.is_alive(): 58 | raise RuntimeError(f"Thread {instance.name} took longer than {timeout} seconds \ 59 | to finish.") 60 | if instance.err is not None: 61 | raise instance.err 62 | i += 1 63 | stdout.write(f"Thread {instance.name} joined. {i} collector(s) finished\n") 64 | stdout.flush() 65 | 66 | 67 | if __name__ == "__main__": 68 | 69 | # Backup latest_seeds.csv if exists 70 | if os.path.isfile("latest_seeds.csv"): 71 | copyfile("latest_seeds.csv", 72 | "{}_latest_seeds.csv".format(datetime.now().isoformat().replace(":", "-"))) 73 | 74 | # Get arguments from commandline 75 | parser = argparse.ArgumentParser() 76 | parser.add_argument('-n', '--seeds', type=int, help="specify number of seeds", default=10) 77 | parser.add_argument('-l', '--language', nargs="+", 78 | help="specify language codes of last status by users to gather") 79 | parser.add_argument('-lt', '--lthreshold', type=float, 80 | help="fraction threshold (0 to 1) of last 200 tweets by an account that \ 81 | must have chosen languages detected (leads to less false positives but \ 82 | also more false negatives)", default=0) 83 | parser.add_argument('-k', '--keywords', nargs="+", 84 | help="specify keywords contained in last 200 tweets by users to gather") 85 | parser.add_argument('-r', '--restart', 86 | help="restart with latest seeds in latest_seeds.csv", action="store_true") 87 | parser.add_argument('-p', '--following_pages_limit', type=int, 88 | help='''Define limit for maximum number of recent followings to retrieve per \ 89 | account to determine most followed friend. 90 | 1 page has a maximum of 5000 folllowings. 91 | Lower values speed up collection. Default: 0 (unlimited)''', default=0) 92 | parser.add_argument('-b', '--bootstrap', help="at every step, add a seed's friends and followers \ 93 | to the seed pool from which accounts are chosen randomly if walkers are at an impasse", 94 | action="store_true") 95 | parser.add_argument('-t', '--test', help="dev only: test for 2 loops only", 96 | action="store_true") 97 | parser.add_argument('-f', '--fail', help="dev only: test unexpected exception", 98 | action="store_true") 99 | 100 | args = parser.parse_args() 101 | 102 | config = Config() 103 | 104 | user_details_list = [] 105 | for detail, sqldatatype in config.config["twitter_user_details"].items(): 106 | if sqldatatype is not None: 107 | user_details_list.append(detail) 108 | 109 | if args.restart: 110 | latest_seeds_df = pd.read_csv('latest_seeds.csv', header=None)[0] 111 | latest_seeds = list(latest_seeds_df.values) 112 | coordinator = Coordinator(seed_list=latest_seeds, 113 | following_pages_limit=args.following_pages_limit) 114 | print("Restarting with latest seeds:\n") 115 | print(latest_seeds_df) 116 | else: 117 | coordinator = Coordinator(seeds=args.seeds, 118 | following_pages_limit=args.following_pages_limit) 119 | 120 | k = 0 121 | restart_counter = 0 122 | 123 | while True: 124 | 125 | if args.test: 126 | k += 1 127 | if k == 2: 128 | args.fail = False 129 | if k == 3: 130 | break 131 | stdout.write("\nTEST RUN {}\n".format(k)) 132 | stdout.flush() 133 | 134 | try: 135 | if args.restart is True and restart_counter == 0: 136 | 137 | main_loop(coordinator, select=user_details_list, 138 | status_lang=args.language, test_fail=args.fail, restart=True, 139 | bootstrap=args.bootstrap, language_threshold=args.lthreshold, 140 | keywords=args.keywords) 141 | restart_counter += 1 142 | else: 143 | main_loop(coordinator, select=user_details_list, 144 | status_lang=args.language, test_fail=args.fail, bootstrap=args.bootstrap, 145 | language_threshold=args.lthreshold, keywords=args.keywords) 146 | except Exception: 147 | stdout.write("Encountered unexpected exception:\n") 148 | traceback.print_exc() 149 | try: 150 | if config.use_notifications is True: 151 | response = config.send_mail({ 152 | "subject": "Unexpected Error", 153 | "text": 154 | f"Unexpected Error encountered.\n{traceback.format_exc()}" 155 | } 156 | ) 157 | assert '200' in str(response) 158 | stdout.write(f"Sent notification to {config.notif_config['email_to_notify']}") 159 | stdout.flush() 160 | except Exception: 161 | stderr.write('Could not send error-mail: \n') 162 | traceback.print_exc(file=stderr) 163 | stdout.write("Retrying in 5 seconds.") 164 | stdout.flush() 165 | latest_seeds = list(pd.read_csv('latest_seeds.csv', header=None)[0].values) 166 | coordinator = Coordinator(seed_list=latest_seeds, 167 | following_pages_limit=args.following_pages_limit) 168 | args.restart = True 169 | restart_counter = 0 170 | time.sleep(5) 171 | -------------------------------------------------------------------------------- /test_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import passwords 5 | 6 | config_dict_notifications = { 7 | "email_to_notify": passwords.email_to_notify, 8 | "mailgun_default_smtp_login": passwords.mailgun_default_smtp_login, 9 | "mailgun_api_base_url": passwords.mailgun_api_base_url, 10 | "mailgun_api_key": passwords.mailgun_api_key 11 | } 12 | 13 | config_dict_twitter_details = { 14 | "twitter_user_details": { 15 | "contributors_enabled": None, 16 | "created_at": "DATETIME", 17 | "default_profile": None, 18 | "default_profile_image": None, 19 | "description": None, 20 | "entities_description_urls": None, 21 | "entities_url_urls": None, 22 | "favourites_count": None, 23 | "follow_request_sent": None, 24 | "followers_count": "INT", 25 | "following": None, 26 | "friends_count": None, 27 | "geo_enabled": None, 28 | "has_extended_profile": None, 29 | "id": "BIGINT PRIMARY KEY", 30 | "id_str": None, 31 | "is_translation_enabled": None, 32 | "is_translator": None, 33 | "lang": None, 34 | "listed_count": None, 35 | "location": None, 36 | "name": None, 37 | "needs_phone_verification": None, 38 | "notifications": None, 39 | "profile_background_color": None, 40 | "profile_background_image_url": None, 41 | "profile_background_image_url_https": None, 42 | "profile_background_tile": None, 43 | "profile_banner_url": None, 44 | "profile_image_url": None, 45 | "profile_image_url_https": None, 46 | "profile_link_color": None, 47 | "profile_sidebar_border_color": None, 48 | "profile_sidebar_fill_color": None, 49 | "profile_text_color": None, 50 | "profile_use_background_image": None, 51 | "protected": None, 52 | "screen_name": None, 53 | "status_contributors": None, 54 | "status_coordinates": None, 55 | "status_coordinates_coordinates": None, 56 | "status_coordinates_type": None, 57 | "status_created_at": None, 58 | "status_entities_hashtags": None, 59 | "status_entities_media": None, 60 | "status_entities_symbols": None, 61 | "status_entities_urls": None, 62 | "status_entities_user_mentions": None, 63 | "status_extended_entities_media": None, 64 | "status_favorite_count": None, 65 | "status_favorited": None, 66 | "status_geo": None, 67 | "status_geo_coordinates": None, 68 | "status_geo_type": None, 69 | "status_id": None, 70 | "status_id_str": None, 71 | "status_in_reply_to_screen_name": None, 72 | "status_in_reply_to_status_id": None, 73 | "status_in_reply_to_status_id_str": None, 74 | "status_in_reply_to_user_id": None, 75 | "status_in_reply_to_user_id_str": None, 76 | "status_is_quote_status": None, 77 | "status_lang": "VARCHAR(20)", 78 | "status_place": None, 79 | "status_place_bounding_box_coordinates": None, 80 | "status_place_bounding_box_type": None, 81 | "status_place_contained_within": None, 82 | "status_place_country": None, 83 | "status_place_country_code": None, 84 | "status_place_full_name": None, 85 | "status_place_id": None, 86 | "status_place_name": None, 87 | "status_place_place_type": None, 88 | "status_place_url": None, 89 | "status_possibly_sensitive": None, 90 | "status_quoted_status_id": None, 91 | "status_quoted_status_id_str": None, 92 | "status_retweet_count": None, 93 | "status_retweeted": None, 94 | "status_retweeted_status_contributors": None, 95 | "status_retweeted_status_coordinates": None, 96 | "status_retweeted_status_created_at": None, 97 | "status_retweeted_status_entities_hashtags": None, 98 | "status_retweeted_status_entities_media": None, 99 | "status_retweeted_status_entities_symbols": None, 100 | "status_retweeted_status_entities_urls": None, 101 | "status_retweeted_status_entities_user_mentions": None, 102 | "status_retweeted_status_extended_entities_media": None, 103 | "status_retweeted_status_favorite_count": None, 104 | "status_retweeted_status_favorited": None, 105 | "status_retweeted_status_geo": None, 106 | "status_retweeted_status_id": None, 107 | "status_retweeted_status_id_str": None, 108 | "status_retweeted_status_in_reply_to_screen_name": None, 109 | "status_retweeted_status_in_reply_to_status_id": None, 110 | "status_retweeted_status_in_reply_to_status_id_str": None, 111 | "status_retweeted_status_in_reply_to_user_id": None, 112 | "status_retweeted_status_in_reply_to_user_id_str": None, 113 | "status_retweeted_status_is_quote_status": None, 114 | "status_retweeted_status_lang": None, 115 | "status_retweeted_status_place": None, 116 | "status_retweeted_status_possibly_sensitive": None, 117 | "status_retweeted_status_quoted_status_id": None, 118 | "status_retweeted_status_quoted_status_id_str": None, 119 | "status_retweeted_status_retweet_count": None, 120 | "status_retweeted_status_retweeted": None, 121 | "status_retweeted_status_source": None, 122 | "status_retweeted_status_full_text": None, 123 | "status_retweeted_status_truncated": None, 124 | "status_source": None, 125 | "status_full_text": None, 126 | "status_truncated": None, 127 | "statuses_count": "BIGINT", 128 | "suspended": None, 129 | "time_zone": None, 130 | "translator_type": None, 131 | "url": None, 132 | "verified": None, 133 | "utc_offset": None 134 | } 135 | } 136 | 137 | config_dict_sqlite = { 138 | "sql": { 139 | "dbtype": "sqlite", 140 | "host": None, 141 | "user": None, 142 | "passwd": None, 143 | "dbname": "test_db" 144 | }, 145 | "twitter_user_details": config_dict_twitter_details["twitter_user_details"], 146 | "notifications": config_dict_notifications 147 | } 148 | 149 | config_dict_mysql = { 150 | "sql": { 151 | "dbtype": "mysql", 152 | "host": "127.0.0.1", 153 | "user": "sparsetwitter", 154 | "passwd": passwords.sparsetwittermysqlpw, 155 | "dbname": "sparsetwitter" 156 | }, 157 | "twitter_user_details": config_dict_twitter_details["twitter_user_details"], 158 | "notifications": config_dict_notifications 159 | } 160 | 161 | # TODO: FELDER OHNE VORKOMMEN RAUS? 162 | config_dict_user_details_dtypes_sqlite = { 163 | "sql": { 164 | "dbtype": "sqlite", 165 | "host": None, 166 | "user": None, 167 | "passwd": None, 168 | "dbname": "test_db" 169 | }, 170 | "twitter_user_details": { 171 | "contributors_enabled": "SMALLINT", 172 | "created_at": "DATETIME", 173 | "default_profile": "SMALLINT", 174 | "default_profile_image": "SMALLINT", 175 | "description": "TEXT", 176 | "entities_description_urls": "TEXT", 177 | "entities_url_urls": "TEXT", 178 | "favourites_count": "BIGINT", 179 | "follow_request_sent": "SMALLINT", 180 | "followers_count": "BIGINT", 181 | "following": "SMALLINT", 182 | "friends_count": "BIGINT", 183 | "geo_enabled": "SMALLINT", 184 | "has_extended_profile": "SMALLINT", 185 | "id": "BIGINT PRIMARY KEY", 186 | "id_str": "VARCHAR(30)", 187 | "is_translation_enabled": "SMALLINT", 188 | "is_translator": "SMALLINT", 189 | "lang": "VARCHAR(30)", 190 | "listed_count": "BIGINT", 191 | "location": "TEXT", 192 | "name": "VARCHAR(50)", 193 | "needs_phone_verification": "SMALLINT", 194 | "notifications": "SMALLINT", 195 | "profile_background_color": "CHAR(6)", 196 | "profile_background_image_url": "TEXT", 197 | "profile_background_image_url_https": "TEXT", 198 | "profile_background_tile": "SMALLINT", 199 | "profile_banner_url": "TEXT", 200 | "profile_image_url": "TEXT", 201 | "profile_image_url_https": "TEXT", 202 | "profile_link_color": "CHAR(6)", 203 | "profile_sidebar_border_color": "CHAR(6)", 204 | "profile_sidebar_fill_color": "CHAR(6)", 205 | "profile_text_color": "CHAR(6)", 206 | "profile_use_background_image": "SMALLINT", 207 | "protected": "SMALLINT", 208 | "screen_name": "VARCHAR(50)", 209 | "status_contributors": "TEXT", # TODO: SEE IF THIS IS NEEDED 210 | "status_coordinates": "TEXT", 211 | "status_coordinates_coordinates": "TEXT", # TODO: SEE IF THIS IS NEEDED 212 | "status_coordinates_type": "TEXT", # TODO: SEE IF THIS IS NEEDED 213 | "status_created_at": "DATETIME", 214 | "status_entities_hashtags": "TEXT", 215 | "status_entities_media": "TEXT", 216 | "status_entities_symbols": "TEXT", 217 | "status_entities_urls": "TEXT", 218 | "status_entities_user_mentions": "TEXT", 219 | "status_extended_entities_media": "TEXT", 220 | "status_favorite_count": "INT", 221 | "status_favorited": "SMALLINT", 222 | "status_geo": "TEXT", 223 | "status_geo_coordinates": "TEXT", # TODO: SEE IF THIS IS NEEDED 224 | "status_geo_type": "TEXT", # TODO: SEE IF THIS IS NEEDED 225 | "status_id": "BIGINT", 226 | "status_id_str": "VARCHAR(50)", 227 | "status_in_reply_to_screen_name": "VARCHAR(30)", 228 | "status_in_reply_to_status_id": "BIGINT", 229 | "status_in_reply_to_status_id_str": "VARCHAR(50)", 230 | "status_in_reply_to_user_id": "BIGINT", 231 | "status_in_reply_to_user_id_str": "VARCHAR(30)", 232 | "status_is_quote_status": "SMALLINT", 233 | "status_lang": "VARCHAR(30)", 234 | "status_place": "TEXT", 235 | "status_place_bounding_box_coordinates": "TEXT", # TODO: SEE IF THIS IS NEEDED 236 | "status_place_bounding_box_type": "TEXT", # TODO: SEE IF THIS IS NEEDED 237 | "status_place_contained_within": "TEXT", # TODO: SEE IF THIS IS NEEDED 238 | "status_place_country": "TEXT", # TODO: SEE IF THIS IS NEEDED 239 | "status_place_country_code": "TEXT", # TODO: SEE IF THIS IS NEEDED 240 | "status_place_full_name": "TEXT", # TODO: SEE IF THIS IS NEEDED 241 | "status_place_id": "TEXT", # TODO: SEE IF THIS IS NEEDED 242 | "status_place_name": "TEXT", # TODO: SEE IF THIS IS NEEDED 243 | "status_place_place_type": "TEXT", # TODO: SEE IF THIS IS NEEDED 244 | "status_place_url": "TEXT", # TODO: SEE IF THIS IS NEEDED 245 | "status_possibly_sensitive": "SMALLINT", 246 | "status_quoted_status_id": "BIGINT", 247 | "status_quoted_status_id_str": "VARCHAR(50)", 248 | "status_retweet_count": "INT", 249 | "status_retweeted": "SMALLINT", 250 | "status_retweeted_status_contributors": "TEXT", # TODO: SEE IF THIS IS NEEDED 251 | "status_retweeted_status_coordinates": "TEXT", 252 | "status_retweeted_status_created_at": "DATETIME", 253 | "status_retweeted_status_entities_hashtags": "TEXT", 254 | "status_retweeted_status_entities_media": "TEXT", 255 | "status_retweeted_status_entities_symbols": "TEXT", # TODO: SEE IF THIS IS NEEDED 256 | "status_retweeted_status_entities_urls": "TEXT", 257 | "status_retweeted_status_entities_user_mentions": "TEXT", 258 | "status_retweeted_status_extended_entities_media": "TEXT", 259 | "status_retweeted_status_favorite_count": "INT", 260 | "status_retweeted_status_favorited": "SMALLINT", 261 | "status_retweeted_status_geo": "TEXT", 262 | "status_retweeted_status_id": "BIGINT", 263 | "status_retweeted_status_id_str": "VARCHAR(50)", 264 | "status_retweeted_status_in_reply_to_screen_name": "VARCHAR(30)", 265 | "status_retweeted_status_in_reply_to_status_id": "BIGINT", 266 | "status_retweeted_status_in_reply_to_status_id_str": "VARCHAR(50)", 267 | "status_retweeted_status_in_reply_to_user_id": "BIGINT", 268 | "status_retweeted_status_in_reply_to_user_id_str": "VARCHAR(30)", 269 | "status_retweeted_status_is_quote_status": "SMALLINT", 270 | "status_retweeted_status_lang": "VARCHAR(30)", 271 | "status_retweeted_status_place": "TEXT", 272 | "status_retweeted_status_possibly_sensitive": "SMALLINT", 273 | "status_retweeted_status_quoted_status_id": "BIGINT", 274 | "status_retweeted_status_quoted_status_id_str": "VARCHAR(50)", 275 | "status_retweeted_status_retweet_count": "INT", 276 | "status_retweeted_status_retweeted": "SMALLINT", 277 | "status_retweeted_status_source": "TEXT", 278 | "status_retweeted_status_full_text": "TEXT", 279 | "status_retweeted_status_truncated": "SMALLINT", 280 | "status_source": "TEXT", 281 | "status_full_text": "TEXT", 282 | "status_truncated": "SMALLINT", 283 | "statuses_count": "BIGINT", 284 | "suspended": "SMALLINT", 285 | "time_zone": "TEXT", # TODO: SEE IF THIS IS NEEDED 286 | "translator_type": "VARCHAR(30)", 287 | "url": "TEXT", 288 | "verified": "SMALLINT", 289 | "utc_offset": "TEXT" # TODO: SEE IF THIS IS NEEDED 290 | }, 291 | "notifications": config_dict_notifications 292 | } 293 | 294 | config_dict_user_details_dtypes_mysql = { 295 | "sql": { 296 | "dbtype": "mysql", 297 | "host": "127.0.0.1", 298 | "user": "sparsetwitter", 299 | "passwd": passwords.sparsetwittermysqlpw, 300 | "dbname": "sparsetwitter" 301 | }, 302 | "twitter_user_details": config_dict_user_details_dtypes_sqlite["twitter_user_details"], 303 | "notifications": config_dict_notifications 304 | } 305 | 306 | friends_details_pddf_dtypes = { 307 | "contributors_enabled": bool, 308 | "created_at": pd.Timestamp, 309 | "default_profile": bool, 310 | "default_profile_image": bool, 311 | "description": str, 312 | "entities_description_urls": str, 313 | "entities_url_urls": str, 314 | "favourites_count": np.int64, 315 | "follow_request_sent": bool, 316 | "followers_count": np.int64, 317 | "following": bool, 318 | "friends_count": np.int64, 319 | "geo_enabled": bool, 320 | "has_extended_profile": bool, 321 | "id": np.int64, 322 | "id_str": str, 323 | "is_translation_enabled": bool, 324 | "is_translator": bool, 325 | "lang": str, 326 | "listed_count": np.int64, 327 | "location": str, 328 | "name": str, 329 | "needs_phone_verification": bool, 330 | "notifications": bool, 331 | "profile_background_color": str, 332 | "profile_background_image_url": str, 333 | "profile_background_image_url_https": str, 334 | "profile_background_tile": bool, 335 | "profile_banner_url": str, 336 | "profile_image_url": str, 337 | "profile_image_url_https": str, 338 | "profile_link_color": str, 339 | "profile_sidebar_border_color": str, 340 | "profile_sidebar_fill_color": str, 341 | "profile_text_color": str, 342 | "profile_use_background_image": bool, 343 | "protected": bool, 344 | "screen_name": str, 345 | "status_contributors": str, 346 | "status_coordinates": str, 347 | "status_coordinates_coordinates": str, 348 | "status_coordinates_type": str, 349 | "status_created_at": pd.Timestamp, 350 | "status_entities_hashtags": str, 351 | "status_entities_media": str, 352 | "status_entities_symbols": str, 353 | "status_entities_urls": str, 354 | "status_entities_user_mentions": str, 355 | "status_extended_entities_media": str, 356 | "status_favorite_count": np.int64, 357 | "status_favorited": bool, 358 | "status_geo": str, 359 | "status_geo_coordinates": str, 360 | "status_geo_type": str, 361 | "status_id": np.int64, 362 | "status_id_str": str, 363 | "status_in_reply_to_screen_name": str, 364 | "status_in_reply_to_status_id": np.int64, 365 | "status_in_reply_to_status_id_str": str, 366 | "status_in_reply_to_user_id": np.int64, 367 | "status_in_reply_to_user_id_str": str, 368 | "status_is_quote_status": bool, 369 | "status_lang": str, 370 | "status_place": str, 371 | "status_place_bounding_box_coordinates": str, 372 | "status_place_bounding_box_type": str, 373 | "status_place_contained_within": str, 374 | "status_place_country": str, 375 | "status_place_country_code": str, 376 | "status_place_full_name": str, 377 | "status_place_id": str, 378 | "status_place_name": str, 379 | "status_place_place_type": str, 380 | "status_place_url": str, 381 | "status_possibly_sensitive": bool, 382 | "status_quoted_status_id": np.int64, 383 | "status_quoted_status_id_str": str, 384 | "status_retweet_count": np.int64, 385 | "status_retweeted": bool, 386 | "status_retweeted_status_contributors": str, 387 | "status_retweeted_status_coordinates": str, 388 | "status_retweeted_status_created_at": pd.Timestamp, 389 | "status_retweeted_status_entities_hashtags": str, 390 | "status_retweeted_status_entities_media": str, 391 | "status_retweeted_status_entities_symbols": str, 392 | "status_retweeted_status_entities_urls": str, 393 | "status_retweeted_status_entities_user_mentions": str, 394 | "status_retweeted_status_extended_entities_media": str, 395 | "status_retweeted_status_favorite_count": np.int64, 396 | "status_retweeted_status_favorited": bool, 397 | "status_retweeted_status_geo": str, 398 | "status_retweeted_status_id": np.int64, 399 | "status_retweeted_status_id_str": str, 400 | "status_retweeted_status_in_reply_to_screen_name": str, 401 | "status_retweeted_status_in_reply_to_status_id": np.int64, 402 | "status_retweeted_status_in_reply_to_status_id_str": str, 403 | "status_retweeted_status_in_reply_to_user_id": np.int64, 404 | "status_retweeted_status_in_reply_to_user_id_str": str, 405 | "status_retweeted_status_is_quote_status": bool, 406 | "status_retweeted_status_lang": str, 407 | "status_retweeted_status_place": str, 408 | "status_retweeted_status_possibly_sensitive": bool, 409 | "status_retweeted_status_quoted_status_id": np.int64, 410 | "status_retweeted_status_quoted_status_id_str": str, 411 | "status_retweeted_status_retweet_count": np.int64, 412 | "status_retweeted_status_retweeted": bool, 413 | "status_retweeted_status_source": str, 414 | "status_retweeted_status_full_text": str, 415 | "status_retweeted_status_truncated": bool, 416 | "status_source": str, 417 | "status_full_text": str, 418 | "status_truncated": bool, 419 | "statuses_count": np.int64, 420 | "suspended": bool, 421 | "time_zone": str, 422 | "translator_type": str, 423 | "url": str, 424 | "verified": bool, 425 | "utc_offset": str 426 | } 427 | -------------------------------------------------------------------------------- /test_run.sh: -------------------------------------------------------------------------------- 1 | cp test_config.yml config.yml 2 | cp user_ids_de.csv seeds.csv 3 | python start.py -n 200 -l de -p 1 4 | -------------------------------------------------------------------------------- /tests/config_test_empty.yml: -------------------------------------------------------------------------------- 1 | # In the following config file, please fill the fields as you need them. 2 | # Do not use quotes, just plain text: e.g.: 3 | # sql: 4 | # dbtype: sqlite 5 | # etc. 6 | 7 | # ================== Database Information ===================== 8 | sql: 9 | dbtype: # mysql 10 | host: # if dbtype = mysql, provide host 11 | user: # if dbtype = mysql, provide user 12 | passwd: # if dbtype = mysql, provide password 13 | dbname: # provide a name for the database. 14 | 15 | 16 | # ================== Twitter User Details ===================== 17 | # If you wish to save certain twitter user details, please just add the SQL data 18 | # type you wish to save it as in the SQL database (recommended types are indicated 19 | # in parantheses). If you do not wish to save a certain detail, just leave it empty 20 | # like so: 21 | # twitter_user_details: 22 | # contributors_enabled: SMALLINT 23 | # created at: 24 | # This will save the detail "contributors_enabled" as booelan / tinyint into the 25 | # database but it will not save "created_at" at all. 26 | 27 | twitter_user_details: 28 | contributors_enabled: # SMALLINT 29 | created_at: # DATETIME 30 | default_profile: # SMALLINT 31 | default_profile_image: # SMALLINT 32 | description: # TEXT (contains a dict) 33 | entities_description_urls: # TEXT 34 | entities_url_urls: # TEXT (contains a dict) 35 | favourites_count: # BIGINT 36 | follow_request_sent: # SMALLINT 37 | followers_count: # BIGINT 38 | following: # SMALLINT 39 | friends_count: # BIGINT 40 | geo_enabled: # SMALLINT 41 | has_extended_profile: # SMALLINT 42 | id: # BIGINT PRIMARY KEY 43 | id_str: # VARCHAR(30) 44 | is_translation_enabled: # SMALLINT 45 | is_translator: # SMALLINT 46 | lang: # VARCHAR(10) 47 | listed_count: # BIGINT 48 | location: # TEXT 49 | name: # VARCHAR (50) 50 | needs_phone_verification: #SMALLINT 51 | notifications: # SMALLINT 52 | profile_background_color: # CHAR(6) (is a Hex Color Code) 53 | profile_background_image_url: # TEXT 54 | profile_background_image_url_https: # TEXT 55 | profile_background_tile: # SMALLINT 56 | profile_banner_url: # TEXT 57 | profile_image_url: # TEXT 58 | profile_image_url_https: # TEXT 59 | profile_link_color: # CHAR(6) (is a Hex Color Code) 60 | profile_sidebar_border_color: # CHAR(6) (is a Hex Color Code) 61 | profile_sidebar_fill_color: # CHAR(6) (is a Hex Color Code) 62 | profile_text_color: # CHAR(6) (is a Hex Color Code) 63 | profile_use_background_image: # SMALLINT 64 | protected: # SMALLINT 65 | screen_name: # VARCHAR(50) 66 | status_contributors: # TEXT (Rarely available) 67 | status_coordinates: # TEXT (contains a dict) 68 | status_coordinates_coordinates: # TEXT (Rarely available) 69 | status_coordinates_type: # TEXT (Rarely available) 70 | status_created_at: # DATETIME 71 | status_entities_hashtags: # TEXT (contains a dict) 72 | status_entities_media: # TEXT (contains a dict) 73 | status_entities_symbols: # TEXT (contains a dict) # DE FACTO ALWAYS EMPTY 74 | status_entities_urls: # TEXT (contains a dict) 75 | status_entities_user_mentions: # TEXT (contains a dict) 76 | status_extended_entities_media: # TEXT (contains a dict) 77 | status_favorite_count: # INT 78 | status_favorited: # SMALLINT 79 | status_geo: # TEXT (contains a dict) 80 | status_geo_coordinates: # TEXT (Rarely available) 81 | status_geo_type: # TEXT (Rarely available) 82 | status_id: # BIGINT 83 | status_id_str: # VARCHAR(50) 84 | status_in_reply_to_screen_name: # VARCHAR(50) 85 | status_in_reply_to_status_id: # BIGINT 86 | status_in_reply_to_status_id_str: # VARCHAR(50) 87 | status_in_reply_to_user_id: # BIGINT 88 | status_in_reply_to_user_id_str: # VARCHAR(30) 89 | status_is_quote_status: # SMALLINT 90 | status_lang: # VARCHAR(10) 91 | status_place: # TEXT (contains a dict) 92 | status_place_bounding_box_coordinates: # TEXT (Rarely available) 93 | status_place_bounding_box_type: # TEXT (Rarely available) 94 | status_place_contained_within: # TEXT (Rarely available) 95 | status_place_country: # TEXT (Rarely available) 96 | status_place_country_code: # TEXT (Rarely available) 97 | status_place_full_name: # TEXT (Rarely available) 98 | status_place_id: # TEXT (Rarely available) 99 | status_place_name: # TEXT (Rarely available) 100 | status_place_place_type: # TEXT (Rarely available) 101 | status_place_url: # TEXT (Rarely available) 102 | status_possibly_sensitive: # SMALLINT 103 | status_quoted_status_id: # BIGINT 104 | status_quoted_status_id_str: # VARCHAR(50) 105 | status_retweet_count: # INT 106 | status_retweeted: # SMALLINT 107 | status_retweeted_status_contributors: # TEXT (Rarely available) 108 | status_retweeted_status_coordinates: # TEXT (contains a dict) 109 | status_retweeted_status_created_at: # DATETIME 110 | status_retweeted_status_entities_hashtags: # TEXT (contains a dict) 111 | status_retweeted_status_entities_media: # TEXT (contains a dict) 112 | status_retweeted_status_entities_symbols: # TEXT (Rarely available) 113 | status_retweeted_status_entities_urls: # TEXT (contains a dict) 114 | status_retweeted_status_entities_user_mentions: # TEXT (contains a dict) 115 | status_retweeted_status_extended_entities_media: # TEXT (contains a dict) 116 | status_retweeted_status_favorite_count: # INT 117 | status_retweeted_status_favorited: # SMALLINT 118 | status_retweeted_status_geo: # TEXT (contains a dict) 119 | status_retweeted_status_id: # BIGINT 120 | status_retweeted_status_id_str: # VARCHAR(50) 121 | status_retweeted_status_in_reply_to_screen_name: # VARCHAR(30) 122 | status_retweeted_status_in_reply_to_status_id: # BIGINT 123 | status_retweeted_status_in_reply_to_status_id_str: # VARCHAR(50) 124 | status_retweeted_status_in_reply_to_user_id: # BIGINT 125 | status_retweeted_status_in_reply_to_user_id_str: # VARCHAR(30) 126 | status_retweeted_status_is_quote_status: # SMALLINT 127 | status_retweeted_status_lang: # VARCHAR(10) 128 | status_retweeted_status_place: # TEXT (contains a dict) 129 | status_retweeted_status_possibly_sensitive: # SMALLINT 130 | status_retweeted_status_quoted_status_id: # BIGINT 131 | status_retweeted_status_quoted_status_id_str: # VARCHAR(50) 132 | status_retweeted_status_retweet_count: # INT 133 | status_retweeted_status_retweeted: # SMALLINT 134 | status_retweeted_status_source: # TEXT 135 | status_retweeted_status_full_text: # TEXT 136 | status_retweeted_status_truncated: # SMALLINT 137 | status_source: # TEXT 138 | status_full_text: # TEXT 139 | status_truncated: # SMALLINT 140 | statuses_count: # BIGINT 141 | suspended: # SMALLINT 142 | time_zone: # TEXT (Rarely available) 143 | translator_type: # VARCHAR(50) 144 | url: # TEXT 145 | verified: # BOOLEAN 146 | utc_offset: # TEXT (Rarely available) 147 | 148 | 149 | # ================== Notification Emails ===================== 150 | 151 | notifications: 152 | email_to_notify: # user@example.com 153 | # mailgun details 154 | # (find them under the respective domain name here: https://mailgun.com/app/domains) 155 | mailgun_default_smtp_login: 156 | mailgun_api_base_url: 157 | mailgun_api_key: 158 | -------------------------------------------------------------------------------- /twauth.py: -------------------------------------------------------------------------------- 1 | from __future__ import unicode_literals 2 | 3 | import csv 4 | import os 5 | import webbrowser 6 | 7 | import tweepy as tp 8 | 9 | from setup import FileImport 10 | 11 | 12 | class OAuthorizer(): 13 | def __init__(self): 14 | ctoken, csecret = FileImport().read_app_key_file() 15 | auth = tp.OAuthHandler(ctoken, csecret) 16 | 17 | try: 18 | redirect_url = auth.get_authorization_url() 19 | except tp.TweepError as e: 20 | if '"code":32' in e.reason: 21 | raise tp.TweepError("""Failed to get the request token. Perhaps the Consumer Key 22 | and / or secret in your 'keys.json' is incorrect?""") 23 | else: 24 | raise e 25 | 26 | webbrowser.open(redirect_url) 27 | token = auth.request_token["oauth_token"] 28 | verifier = input("Please enter Verifier Code: ") 29 | auth.request_token = {'oauth_token': token, 30 | 'oauth_token_secret': verifier} 31 | try: 32 | auth.get_access_token(verifier) 33 | except tp.TweepError as e: 34 | if "Invalid oauth_verifier parameter" in e.reason: 35 | raise tp.TweepError("""Failed to get access token! Perhaps the 36 | verifier you've entered is wrong.""") 37 | else: 38 | raise e 39 | 40 | if not os.path.isfile('tokens.csv'): 41 | with open('tokens.csv', 'a', newline='') as f: 42 | writer = csv.writer(f) 43 | writer.writerow(["token", "secret"]) 44 | f.close() 45 | 46 | with open('tokens.csv', 'a', newline='') as f: 47 | writer = csv.writer(f) 48 | writer.writerow([auth.access_token, auth.access_token_secret]) 49 | f.close() 50 | 51 | 52 | if __name__ == "__main__": 53 | OAuthorizer() 54 | -------------------------------------------------------------------------------- /two_seeds.csv: -------------------------------------------------------------------------------- 1 | 83662933 2 | 36476777 3 | -------------------------------------------------------------------------------- /wrong_tokens.csv: -------------------------------------------------------------------------------- 1 | token,secret 2 | asdfasdf,asdf --------------------------------------------------------------------------------