├── .gitignore ├── README.md ├── img ├── CC_predictions.png ├── CC_rule_set.png ├── FFC_rule_set.png ├── SOS_rule_set.png ├── class_A_classifier_evaluation.png ├── class_A_features.png ├── classifier_A_vs_C.png ├── collected_data_stats.png ├── confusion_matrix.png ├── crawling_cost.png ├── evasion_features.png ├── feature_test.png ├── features_classifiers_evaluation.png ├── features_cost.png ├── features_evaluation.png ├── features_random_forest.png ├── reduced_overfitting.png └── rules_evaluation.png ├── report └── report.tex ├── research ├── efficient_detection_fake_twitter_followers.pdf ├── project_description.pdf ├── stringhini_Detecting spammer on social network.pdf └── stringhini_Detecting spammer on social network_paper.pdf └── src ├── cachedata.py ├── classifiers.py ├── csv_processor.py ├── features.py ├── generateBAS.py ├── main.py ├── metrics.py ├── tweets.py ├── url_finder.py └── users.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | 103 | # data 104 | data/* 105 | 106 | # latex 107 | report/*.aux 108 | report/*.pdf 109 | report/*.synctex.gz 110 | report/*.toc -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BigDataProject3 2 | Automatic detection of fake twitter followers 3 | 4 | # Research paper summary 5 | 6 | ## The baseline datasets 7 | - 9M accounts 8 | - 3M tweets 9 | 10 | ### Fake project 11 | Twitter account created to be followed by real accounts to collect data. 12 | Referred as TFP 13 | 14 | ### #Elezioni2013 dataset 15 | Named E13, data mining of the twitter accounts involved in the #elezioni2013 and after discarding officially involved accounts and sampling the lefts ones, they manually checked the remaining accounts(1488). 16 | From this work resulted 1481 human accounts labeled. 17 | 18 | ### Baseline dataset of human accounts 19 | So TFP and E13 are the starting set of human accounts "HUM". 20 | 21 | ### Baseline dataset of fake followers 22 | 3k fake accounts bought: 23 | 1169 FSF (fast followers) 24 | 1337 INT (intertwitter) 25 | 845 TWT (1000 but 155 got insta banned, from twittertechnology) 26 | 27 | Dataset is clearly illustrative and not exhaustive of all possible fake accounts. 28 | 29 | ![alt text](./img/collected_data_stats.png) 30 | 31 | ### Baseline dataset 32 | Studies have shown that the distributions between classes in classification datasets can affect the classification. 33 | 34 | Twitter advanced that the amount of spam/fake accounts should be less then the 5% of MAU (monthly active users), not applicable for our problem because they cant be assimilated to out dataset and an account buying fake accounts, will have a abnormal distribution of fake/real accounts. 35 | --> <5% can't be transferred to fake followers of an account. 36 | 37 | They decided to go for a balanced distribution -> used 5%-95%(100 HUM - 1900 FAK)to 95%-5%(1900 HUM - 100 FAK) proportions to train the classifier, considering their accuracy with cross-validation. 38 | 39 | 40 | To obtain a balanced dataset, we randomly undersampled the total set of fake accounts (i.e., 3351) to match the size of the HUM dataset of verified human accounts. Thus, we built a baseline dataset of 1950 fake followers, labeled FAK. The final baseline dataset for this work includes both the HUM dataset and the FAK dataset for a total of 3900 Twitter accounts. This balanced dataset is labeled BAS in the remainder of the paper and has been exploited for all the experiments described in this work (where not otherwise specified). Table 1 shows the number of accounts, tweets and relationships contained in the datasets described in this section. 41 | 42 | ## Classifiers used for fake detection 43 | From 3 procedures proposed, they assessed their effectiveness by trying them on their dataset. Depending on their efficiency, they will be used later as features to fit the classifiers. 44 | 45 | ### Followers of political candidates. 46 | Test on Obama, Romney and Italian Politicians followers. Algorithm based on public features from the accounts. The algo assigns human and bot scores and classifies an account considering the gap between the sum of the two scores. The algo assigns a human point for each feature in the "feature table". 47 | ![alt text](./img/CC_rule_set.png) 48 | On the other hand it receives a bot point when not meeting one of the features and 2 points for only using API. 49 | (specificities of each feature can be read in the paper) 50 | 51 | ### Stateofsearch.com 52 | This website proposed the following rule set: 53 | ![alt text](./img/SOS_rule_set.png) 54 | 55 | This rule set doesn't focuses on the account but on the tweets emitted. The rules looking for similarities are done over the dataset. 56 | Important: because temporal isn't available and twitter's API limitation rule 6&7 were not applied. 57 | 58 | ### Socialbakers’ FakeFollowerCheck 59 | Fakeness classification tool based on 8 criteria: 60 | ![alt text](./img/FFC_rule_set.png) 61 | 62 | ### Evaluation methodology 63 | The 3 methods were tested on our human dataset and fake followers. We used the confusion matrix as standard indication of accuracy: 64 | REMINDER: 65 | - True Positive (TP): the number of those fake followers recognized by the rule as fake followers; 66 | - True Negative (TN): the number of those human followers recognized by the rule as human followers; 67 | - False Positive (FP): the number of those human followers recognized by the rule as fake followers; 68 | - False Negative (FN): the number of those fake followers recognized by the rule as human followers. 69 | ![alt text](./img/confusion_matrix.png) 70 | 71 | Using the folowing metric: 72 | - Accuracy: the proportion of predicted true results (both true positives and true negatives) in the population, that is $$\frac{TP+TN}{TP+TN+FP+FN}$$ 73 | - Precision: the proportion of predicted positive cases that are indeed real positive, that is $$\frac{TP}{TP+FP}$$ 74 | - Recall (or also Sensitivity): the proportion of real positive cases that are indeed predicted positive, that is $$\frac{TP}{TP+FN}$$ 75 | - F-Measure: the harmonic mean of precision and recall, namely $$\frac{2·precision·recall}{precision+recall}$$ 76 | - Matthew Correlation Coefficient (MCC from now on) [37]: the estimator of the correlation betweenthe predicted class and the real class of the samples, defined as: 77 | $$\frac{TP·TN-FP·FN}{\sqrt{(TP+FN)(TP+FP)(TN+FP)(TN+FN)}} 78 | 79 | In addition to these they also measured two additional metrics: 80 | - Information Gain (Igain): the information gain considers a more general dependence, leveraging probability densities. It is a measure about the in- 81 | formativeness of a feature with respect to the predicting class 82 | - Pearson Correlation Coefficient(PCC): the Pearson correlation coefficient can detect linear dependencies between a feature and the target class. It is a measure of the strength of the linear relationship between two random variables X and Y. 83 | 84 | 85 | ### Evaluation of CC algorithm 86 | ![alt text](./img/CC_predictions.png) 87 | Not very good at detecting bots, but decent job with humans. 88 | ### Individual rules evaluation 89 | Here they analyzed the effectiveness of each individual rule. 90 | ![alt text](./img/rules_evaluation.png) 91 | 92 | ## Fake detection based on feature 93 | Classification using 2 sets of features extracted from spam accounts. 94 | Important: features extracted from spammers but used for fake followers. 95 | To extract these features, they used classifiers producing glass-box(white-box) and black-box models. 96 | ### Spammers detection in social networks. 97 | Use of Random Forest which results in classification but also features: 98 | ![alt text](./img/features_random_forest.png) 99 | 100 | ### Evolving twitter spammers detection 101 | Since spammers are changing their behavior to avoid detection here are a set of features to still detect them even when using evasion systems: 102 | ![alt text](./img/evasion_features.png) 103 | 104 | ### Evaluation of these features 105 | Single features evaluation: 106 | ![alt text](./img/features_evaluation.png) 107 | 108 | Features evaluation using them with classifiers: 109 | ![alt text](./img/features_classifiers_evaluation.png) 110 | The results are very good, the classification accuracy is really high for all the classifiers. 111 | The features-based classifiers are way more accurate then CC-algorithm to predict and detect fake followers. 112 | 113 | ### Discussion of the results 114 | By analysing the classifiers we extracted the most effective features: 115 | - for decision Trees, the features close to the root 116 | - Decorate, AdaBoost, and Random Forest are based on Decision tress but they are a composition of trees and therefore are harder to analyse. 117 | 118 | #### Differences between fake followers and spammers 119 | URL ratio is higher for fake followers (72%) and only 14% for humans. 120 | API ratio is higher for spammers then humans. For fake followers it is lower than 0.0001 for 78%. 121 | The average neighbor's tweets features is lower for spammers than for fake followers. 122 | 123 | Fake followers appear to be more passive compared to spammers and they do not make use of automated mechanisms. 124 | 125 | #### overfitting 126 | Usual problem of classification is to be suited too much for the training dataset and not for new data. To avoid overfitting the best idea is to keep a simple classifier. 127 | For decision tree algorithms reducing the number of nodes and the height of the tree helps. 128 | For trees, common practice is to adopt an aggressive pruning strategy -> using the reduce-error pruning with small test sets. This results in simpler trees(less features) but maintaining good performances. 129 | ![alt text](./img/reduced_overfitting.png) 130 | 131 | #### Bidirectional link radio 132 | This is the feature with the most information gain. To test out its importance in detecting the fake from humans, we retry the classifiers excluding this feature and see how they compare with the classifiers trained with the bidirectional link radio feature. 133 | 134 | From the previous table we realize that the feature isn't essential but greatly effective. 135 | 136 | ## An efficient and lightweight classifier 137 | ![alt text](./img/feature_test.png) 138 | As we notice, classifiers based on features sets perform better than rule sets. To further improve the classifier we analyse their cost. 139 | 140 | ### Crawling cost 141 | We divide the type of data to crawl into 3 categories: 142 | - profile (Class A) 143 | - timeline (Class B) 144 | - relationship (Class C) 145 | 146 | These categories are directly related to the amount of data that needs to be downloaded for a category of feature. To do so we compare the amount of data to be downloaded for each category (best and worst case scenarios -> best is 1 API call and worst is for the biggest account possible) 147 | We also take into account the maximum amount of calls allowed by the twitter API which defines our max threshold of calls. 148 | Parameters of the table: 149 | - $f$ : number of followers of the target account; 150 | - $t_i$ : number of tweets of the i -th follower of the target account; 151 | - $\phi_i$ : number of friends of the i -th follower of the target account; 152 | - $f_i$ : number of followers of the i -th follower of the target account. 153 | ![alt text](./img/crawling_cost.png) 154 | 155 | Important: It is important to notice that calls download all the information available for the account, therefore obtaining the data for the different classes concurrently. 156 | 157 | ### Class A classifier 158 | 159 | ![alt text](./img/classifier_A_vs_C.png) 160 | 161 | The classifiers belong to the class of the most expensive feature. 162 | Here we experiment the results obtained with the cheapest classifier working with class A features. 163 | 164 | The classifiers are tested on 2 features sets: the class features and all the features. 165 | We can see some classifiers improving and others slightly dropping in performance. 166 | 167 | ### Validation of the Class A classifier 168 | Two experiments: 169 | - Using our baseline dataset as training 170 | - Using Obama's followers as training 171 | 172 | ![alt text](./img/class_A_classifier_evaluation.png) 173 | 174 | For each of these experiments we tested the classifiers with these testing datasets: 175 | - human accounts 176 | - 1401 fake followers not included in the BAS dataset 177 | 178 | For this validation we can see notable differences between the approaches. 179 | We can also see the the random sample was more correctly labeled then the Obama's sample meaning the Obama's dataset introduces previously unknown features from the training sets. 180 | 181 | ### Assessing Class A features 182 | 183 | To assess the importance of the features used in Class A features we used an information fusion-based sensitivity analysis. 184 | Information fusion is a technique aimed at leveraging the predictive power of several different models in order to achieve a combined prediction accuracy which is better than the predictions of the single models. 185 | Sensitivity analysis, instead, aims at assessing the relative importance of the 186 | different features used to build a classification model. 187 | 188 | By combining them we can estimate the importance of certain features used in different classifiers with a common classification task. 189 | 190 | To do so we have to retrain the classifiers of the 8 class A classifiers with our baseline dataset and remove one feature at a time. 191 | Each of the trained classifiers is then tested with our test dataset. 192 | -------------------------------------------------------------------------------- /img/CC_predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/CC_predictions.png -------------------------------------------------------------------------------- /img/CC_rule_set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/CC_rule_set.png -------------------------------------------------------------------------------- /img/FFC_rule_set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/FFC_rule_set.png -------------------------------------------------------------------------------- /img/SOS_rule_set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/SOS_rule_set.png -------------------------------------------------------------------------------- /img/class_A_classifier_evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/class_A_classifier_evaluation.png -------------------------------------------------------------------------------- /img/class_A_features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/class_A_features.png -------------------------------------------------------------------------------- /img/classifier_A_vs_C.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/classifier_A_vs_C.png -------------------------------------------------------------------------------- /img/collected_data_stats.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/collected_data_stats.png -------------------------------------------------------------------------------- /img/confusion_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/confusion_matrix.png -------------------------------------------------------------------------------- /img/crawling_cost.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/crawling_cost.png -------------------------------------------------------------------------------- /img/evasion_features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/evasion_features.png -------------------------------------------------------------------------------- /img/feature_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/feature_test.png -------------------------------------------------------------------------------- /img/features_classifiers_evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_classifiers_evaluation.png -------------------------------------------------------------------------------- /img/features_cost.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_cost.png -------------------------------------------------------------------------------- /img/features_evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_evaluation.png -------------------------------------------------------------------------------- /img/features_random_forest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_random_forest.png -------------------------------------------------------------------------------- /img/reduced_overfitting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/reduced_overfitting.png -------------------------------------------------------------------------------- /img/rules_evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/rules_evaluation.png -------------------------------------------------------------------------------- /report/report.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt]{article} 2 | \usepackage[utf8]{inputenc} 3 | \usepackage{listings} 4 | \usepackage{graphicx} 5 | \usepackage{fontenc} 6 | \usepackage{listings} 7 | \usepackage{multicol} 8 | \usepackage{verbatim} 9 | 10 | \title{UNamur\\ 11 | ICYBM201 Big Data and Computer Security : Fame for sale, efficient detection of fake Twitter 12 | followers} 13 | 14 | \author{TIO NOGUERAS Gérard, NYAKI Loïc} 15 | 16 | \begin{document} 17 | \maketitle 18 | 19 | \newpage 20 | \tableofcontents 21 | \newpage 22 | 23 | \section{Introduction} 24 | 25 | \paragraph{} 26 | The objective of this project is to partially reproduce the results presented in a 2015 paper titled \textit{Fame for sale: effective detection of fake twitter followers}, by Cresci \textit{et al.}\\ 27 | 28 | In the paper, various classification rules and features proposed by academics, \textit{technology bloggers} and companies specialized in fake twitter users detection are enumerated and assessed. Rule-based classification is shown to perform more poorly than feature-based classification. Therefore the authors decided to abandon rule-based classification in favor of the 19 features responsible for the most information gain while having the smallest calculation cost.\\ 29 | 30 | The resulting classifier trained on that 19-features set shows an accuracy of up to 97.5\% on randomly sampled account, while being solely based on data readily available on the profile of the twitter users, which ensures that the data processing is fast and lightweight. 31 | 32 | \subsection{Objectives} 33 | In this project, we focus on reproducing the final results of Cresci \textit{et al.}, by implementing various classifiers based on the feature set presented in that paper.\\ 34 | 35 | First we describe the 19 features that will be part of the feature set. Then, we introduce the datasets that have been provided, and which contain \textit{human} or \textit{bot} data. Later we describe how the features will be extracted from these datasets as well as our planned procedure for training classifier. Each type of classifier is then described along with their parameters, and results are presented for each classifier. Finally, we summarize the results argue on the difference between the results presented in Cresci \textit{et al.}, and the results we managed to obtain. 36 | 37 | 38 | \section{Rule sets and Features set} 39 | In the paper, five rule sets and features set are analyzed. They come from academic research, as well as technology bloggers and companies specialized in fake tweeter users detection.\\ 40 | 41 | To avoid simply using all the features mentionned in other papers without any preprocessing, the authors decided to measure each feature individually and asses it's effectiveness thanks to machine learning metrics. After doing, they removed the features they considered not to have a sufficient score and removed the features they considered impossible to obtain. (ie. time related features that would involve monitoring of thousands of accounts.) 42 | \\\\ 43 | We have decided that we will present the features that they kept and that we will se to train our classifiers.\\ 44 | We have devided the features by paper and in each section we have separated the features related to the profile(class A), the timeline(class B) and relationships (class C). 45 | 46 | \subsection{Camisani-Calzolari rule set} 47 | In this paper the features had as objective to focus on human accounts, but they can still be used to detect bots. \\ 48 | We start with the class A section, for example a couple of straightforward ones: has name, has image, has address, has biography, belongs to a list. 49 | Pretty simple most human users will fill this type of information.\\ 50 | The next ones assess that the account has been active and tries to find attributes that will correlate with humans: followers $\geq$ 30, tweets $\geq$ 50, URL 51 | in profile, 2 $*$ followers $\geq$ friends.\\ 52 | For the class B of this paper where we try to find information in the tweets of the user that can be relatable to humans. Some patterns that have been noticed that humans follow: geo-localized, is favorite, uses punctuation, uses hashtag. Another pattern that they noticed is that humans don't use the twitter api that much but rather a lot of different items available to us: uses iPhone, uses Android, uses Foursquare, uses Instagram, uses Twitter.com and using multiple clients. This last ones are some additional ones found to be done by humans: userID in tweet, 53 | tweets with URLs and retweet $\geq$ 1. 54 | 55 | The rule set contained the following classes of rules: 56 | 57 | \begin{itemize} 58 | \item Camisani-Calzolari: The Camisani-Calzolari rule set is described as follows: 22 class A rules 59 | \item State of search:They propose 7 rules, with 5 of class A, and 2 of class B 60 | \item SocialBakers: The rule set of social baker is composed of 9 rules, with 6 of class A and 3 of class B. 61 | \item Stringhini et al.: This paper proposes 5 features with 3 class A features, and 2 class B features 62 | \item Yang et al.:This paper proposes 9 features. Two of class A, 3 of class B, and 4 of class C. 63 | \end{itemize} 64 | 65 | \subsection{State of search} 66 | This ones are really general ones for bot detection by state of search website. 67 | For the class A: bot in biography, friends/followers == 100 and duplicate profile pictures.\\ 68 | For class B: same sentence to many accounts, tweet from API. 69 | This features are actually extracted from spammer bots. 70 | 71 | \subsection{Socialbakers} 72 | We start finding a trend with some of the ratios for our class A: friends/followers $\geq$ 50, friends $\geq$ 100. Followed by some features resembling the ones in Camisani: default image after 2 months, no bio, no location, 0 tweets.\\ 73 | For the Class B, we're only looking at ratios and tweets' similarities: tweets spam phrases, same tweet$\geq$ 3, retweets $\geq$ 90\%, tweet-links $\geq$ 90\%. 74 | \subsection{Stringhini et al.} 75 | The last 2 papers already have really strong features that we are going to use and see if we can improve the score they obtained by adding extra features. 76 | Here you can see some of the best ones: number of friends, number of tweets, $friends/(followers^2)$.\\ 77 | For class B: we only have a couple of ones: tweet similarity, URL ratio. 78 | This paper also tracked spam bots. 79 | 80 | \begin{tabular}{lc} 81 | \hline 82 | 1. spambots do not have thousands of friends; & spambots have a high ratio of tweets containing urls\\ 83 | 2. spambots have sent less than 20 tweets; & spambots have a high ratio between the total number of tweets from friends and the square of their total followers (lower ratio means legitimate user)\\ 84 | 3. the content of the spambots' tweets exhibits "message similarity"; & \\ 85 | \hline 86 | \end{tabular} 87 | 88 | 89 | \subsection{Yang et al.} 90 | This last paper has a focus on the API for class B features: API ratio, API URL ratio, API tweet similarity. And some curious features for the class A: age, following rate. The following rate is obvious, not much people follow bots and the age is smarter, spam bots are susceptible to be banned or simply used once therefore their age will usually be lower then real accounts. 91 | 92 | \subsection{Rules that were not implemented} 93 | The following features where not implemented,as knowing whether a tweet came from an API was far from obvious : \\ 94 | 95 | \begin{itemize} 96 | \item{get\_api\_url\_ratio(): returns the ratio between the number of tweets containing a URL and the number of tweets sent from an API.} 97 | \item{get\_api\_tweet\_similarity(): supposedly returns a value representing the similarity between tweets sent from an API.} 98 | \end{itemize} 99 | 100 | We tried using the \textit{grep} command with the keyword "API" on the tweets.csv files but only detected a few results who seemed unrelated. 101 | 102 | \section{Data extraction} 103 | To generate the final feature set, we extracted the class A features (data easily obtained through the user profile) as well as the class C features (all the features, ranging from the easiest to obtain to the hardest one, requiring many computations).\\ 104 | 105 | The result was a features dataset that is ready to be used for training classifiers. 106 | 107 | \subsection{Available data} 108 | The researchers responsible for the paper Fame forl sale made their 5 basic datasets available for future researchers.\\ 109 | 110 | Since we are trying to replicate the work they have done to practice our skills and understand more deeply the meaning of data and the proper way to analyse data is by following their BAS dataset creation. 111 | 112 | \subsection{Table 1 creation} 113 | In this section we are going to create the base dataset that we will use throughout the project. 114 | We have 5 available datasets: 115 | We are going to create 1 final dataset. The BAS dataset constituted of 1950 human twitter accounts and 1950 fake accounts.\\ The human accounts are simply the sum of the human datasets we had available.\\ 116 | For the fake accounts, we randomly undersampled the 3 datasets available to obtain the same number of accounts as the normal ones.\\ 117 | After undersampling the users, we used the ids of these users to collect the rest of the data in the other files.\\\\ 118 | Small differences due to the randomness of the users choice: 119 | We encountered a significant drop first with the number of tweets averaging 92k followers which far from the 118k from the paper. We later realized that our parser had and issue with the tweets containing commas and was missing them because of that error.\\ 120 | 121 | \begin{tabular}{lcccccc} 122 | dataset & accounts & tweets & followers & friends & total relationships \\ 123 | \hline 124 | TFP & 469 & 563,693 & 258,494 & 241,710 & 500,204 \\ 125 | E13 & 1481 & 2,068,037 & 1,526,944 & 667,225 & 2,194,169 \\ 126 | FSF & 1169 & 22,910 & 11,893 & 253,026 & 264,919 \\ 127 | INT & 1337 & 58925 & 23173 & 517485 & 540658 \\ 128 | TWT & 845 & 114192 & 28588 & 729839 & 758427 \\ 129 | \hline 130 | HUM & 1950 & 2,631,730 & 1,785,438 & 908,935 & 2,694,373 \\ 131 | FAK & 1950 & 107,031 & 35,404 & 873,494 & 908,898 \\ 132 | \hline 133 | BAS & 3900 & 2,738,761 & 1,820,842 & 1,782,429 & 3,603,271 \\ 134 | \end{tabular} 135 | 136 | 137 | \section{Data Pre-Processing} 138 | \paragraph{} 139 | The datasets that we received was composed of several directories containing each the following files: \textit{users.csv, friends.csv, followers.csv and tweets.csv}. These are regular csv files containing text, numerical values and NaN values.\\ 140 | 141 | NaN values are bothersome as they are there own type and can cause problems when they get mixed with Strings or numerals. Therefore the first thing to do was to replace every NaN instance by something less troublesome. We decided that an empty String would be a good solution. This was done in a single command, when opening the CSV file: 142 | 143 | \begin{lstlisting} 144 | pd.read_csv(totalPath, encoding='latin-1').fillna('') 145 | \end{lstlisting} 146 | 147 | \section{Feature set generation} 148 | Based on \textit{Cresci et al.}, 3 classes of features where identified: classes\textit{A, B} and \textit{C}. Class \textit{A} features being features whose data can be obtained directly in the user profile, while class \textit{B} features require a simple computation, and \textit{C} features require heavier computations.\\ 149 | 150 | We implemented the functions and used them to extract 2 features set out of the data: a \textit{class A} features set and a class C features set, which encompasses the features of every classes (\textit{A, B} and \textit{C}). 151 | 152 | \subsection{Data labeling} 153 | For each initial dataset (E13, FAK, FSF, HUM, INT, TFP, TWT) the label of the users is known. The TFP and HUM dataset only contain real users, while the other dataset contain fake users.\\ 154 | 155 | Therefor, when generating the class A and class C features set, we add the label 'human' for the features set based on TFP and HUM, and the label 'bot' to the other. These labels will actually be numbers in the features set : 1 for bots, and 0 for humans. 156 | 157 | \section{Process Optimization} 158 | The amount of data that we had to manipulate by no means huge or overwhelming, but it was still big enough so that, if not careful, some optimization issues could arise.\\ 159 | 160 | Developing using Python, we used the popular pandas library and its DataFrame object to manipulate our data and perform computations. Soon however, the program was plagued by an abnormally slow execution time.\\ 161 | 162 | The issue came from manually looping on DataFrame object, either by user a \textit{for} loop, or by using the DataFrame.iterrows() function. This led to processing time that could reach between 10 and 20 seconds per user record. Considering that we had several thousands users, another solution had to be found.\\\\ 163 | This part of the project was definetly the most time consuming and involved a lot of small tricks to achieve the expected result for certain features.\\ 164 | As explained above, after a first implementation of certain features we realized that the extraction was extremly slow, we then decided to cut down the testing set to 10 users to ensure a functional extractor without waiting for 30 min after each execution. This already improved our testing time.\\ 165 | After that we realized that for the B class features we were getting very long execution times. Because the B class features are realted to tweets, depending on the users this amount can be in the thousands and for some features this involved looping a lot. 166 | 167 | \subsection{DataFrame.apply() and lambda functions} 168 | We found out that there was several efficient ways to loop over a DataFrame. The easiest way being through the use of lambda functions which will automatically be executed on every row of the Dataframe, without requiring a manual management of the iterative process.\\ 169 | 170 | The result was efficient, and the processing time per user came down from 10-20s to about 0.3s. We decided that this gain in performance was good but with 3900 users to process we're still talking of 1300s which is a really long processing time. To improve that we researched optimisation techniques with Pandas.SerieS. And using built in functions and conditions on the Series we kept improving our processing time. For some of our features we obtained huge improvements alowing to reduce most of the B features to 0.05s (except for the HUM dataset because of the high amount of data to check).\\\\ 171 | 172 | Because the built in functions are limited and Series conditional selection is limited for some of the features we add to be creative. I have some example I want to showcase: 173 | 174 | \begin{verbatim} 175 | def is_favorite(userRow,tweetsDF): 176 | tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 177 | fav = tweets['favorite_count'] != 0 178 | return int(not tweets[fav].empty) 179 | \end{verbatim} 180 | Here we recover a dataframe with only the tweets of the user. We create a conditional variable that selects everthing different then 0. If we apply it to that dataframe we are going to get a Series with True and False values, by checking for tweets[fav].empty we are looking if at least one of the values isn't False. 181 | \\ A last example: 182 | \begin{verbatim} 183 | def uses_foursquare(userRow,tweetsDF): 184 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 185 | res = "foursquare" in all_tweets['source'].str.cat() 186 | \end{verbatim} 187 | Here instead of looping through the tweet sources and adding them to a string to finally check if the searched string is in it we adopt another approach: isolate the Series of the dataframe and the concatenate the strings using the builtin function. It achieves the same thing but reduces the processing time drasticaly.\\\\ 188 | 189 | There are many other features that have been approached in unconventional ways to achieve better results. \\ 190 | 191 | Unfortunately we found out too late of a method called Vectorization which would arguably perform even faster. But this is not a problem since we look forward in the future to try it out and achieve the most optimized result possible. 192 | 193 | \section{Classifiers} 194 | In \textit{Cresci et al.} the following 8 classifiers are proposed: Random Forest, Decorate, Decision Tree, Adaptive Boosting, Bayesian Network, k-Nearest Neighbors, Logistic Regression and Support Vector Machine.\\ 195 | 196 | For the sake of simplicity, we decided to focus on the following classifiers:Random Forest, Decision Tree, Adaptive Boosting, k-Nearest Neighbors, Logistic Regression and Support Vector Machine.\\ 197 | 198 | \subsection{Results} 199 | After training our classifierson our generated datasets, we obtain the following resutls: 200 | 201 | \begin{tabular}{lcccccc} 202 | \hline 203 | \textbf{Algorithm} & \textbf{Accuracy} & \textbf{Precision} & \textbf{Recall} &\textbf{F-M} & \textbf{MCC} & AUC\\ 204 | \hline 205 | \textit{Class A}\\ 206 | 207 | J48 & 1 & 1 & 1 & 1 & 1 & 1 \\ 208 | LR & 0.997 & 0.997 & 0.997 & 0.997 & 0.994 & 0.997 \\ 209 | AB & 1 & 1 & 1 & 1 & 1 & 1 \\ 210 | SVM & 0.678 & 0.999 & 0.356 & 0.525 & 0.464 & 0.78\\ 211 | RF & 0.999 & 1 & 0.999 & 0.999 & 0.999 & 0.999 \\ 212 | kNN & 0.944 & 0.937 & 0.951 & 0.944 & 0.888 & 0.944\\ 213 | 214 | \hline 215 | \end{tabular} 216 | 217 | \paragraph{Accuracy} 218 | Accuracy measure the percentage of samples correctly identified in both classe (human, bot). In our case, the accuracy seems very high, except for Support Vector Machines, for a reason that we do not explain. 219 | 220 | \paragraph{Precision} 221 | A high precision indicates that the samples that were classified as bots actually are. Here, the precision seems very high for every classifier. The high precision of the SVM combined with its low accuracy indicates that when the classifier labels a sample as bot, it is very confident and is almost always right. But the classifier stays silent on many samples that are bot and fails to identify them as such. 222 | 223 | \paragraph{Recall} 224 | The recall identifies the relevant samples that have been missed by the classifier. As excpected, based on the accuracy and precision, the SVM classifier score low, meaning that when it sees a bot, it often doesn't recognizes it as such. The rest of the classifier perform very well. 225 | 226 | 227 | \paragraph{F-M} 228 | The F-Measure gives a global assessment of the quality of the classifier. It is high for all classifiers, except for the SVM. 229 | 230 | \paragraph{MCC} 231 | Similar the the F-M, the MCC tries to give a single value that represent the quality of the classifier. All classifiers score well except for the SVM. 232 | \paragraph{AUC} 233 | 234 | As you can see, the classifiers seem to 235 | Due to an execution error caused by a bad value, the results for class C couldn't be displayed 236 | 237 | \section{Conclusion} 238 | In this project, we first familiarized ourselves with the various dataset provided before starting building features that emulated the 5 studies described in Cresci \textit{et al}. \\ 239 | Based on those features and and the Class A, and Class C feature set provided by Cresci \textit{et al.}, we generated for each of the 6 datasets, a pair of Class A and Class C features dataset for which every row was labeled as either 'human' or 'bot'.\\ 240 | Once these features were generated and labeled, they were used them to train various \textit{classifiers} in hope to be able to distinguish fake users from legitimate users. 241 | 242 | \bibliography{bibliography} 243 | \bibliographystyle{plain} 244 | \end{document} -------------------------------------------------------------------------------- /research/efficient_detection_fake_twitter_followers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/efficient_detection_fake_twitter_followers.pdf -------------------------------------------------------------------------------- /research/project_description.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/project_description.pdf -------------------------------------------------------------------------------- /research/stringhini_Detecting spammer on social network.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/stringhini_Detecting spammer on social network.pdf -------------------------------------------------------------------------------- /research/stringhini_Detecting spammer on social network_paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/stringhini_Detecting spammer on social network_paper.pdf -------------------------------------------------------------------------------- /src/cachedata.py: -------------------------------------------------------------------------------- 1 | import tweets as tw 2 | import users as us 3 | 4 | 5 | _user_tweets = {} 6 | _user_friends = {} 7 | _user_followers = {} 8 | 9 | _user_tweets_count = {} 10 | _user_friends_count = {} 11 | _user_followers_count = {} 12 | 13 | 14 | def get_user_tweets(userID, tweetsDF): 15 | tweets = None 16 | try: 17 | tweets = _user_tweets[userID] 18 | 19 | except KeyError: 20 | _user_tweets[userID] = tw.get_tweets_dataframe_user(userID, tweetsDF) 21 | 22 | finally: 23 | tweets = _user_tweets[userID] 24 | 25 | return tweets 26 | 27 | def get_tweets_count(userID, tweetsDF): 28 | tweetCount = 0 29 | 30 | try: 31 | tweetCount = _user_tweets_count[userID] 32 | 33 | except KeyError: 34 | _user_tweets_count[userID] = get_user_tweets(userID,tweetsDF).shape[0] 35 | 36 | finally: 37 | tweetCount = _user_tweets_count[userID] 38 | 39 | return tweetCount 40 | 41 | def get_user_friends(userID, friendsDF): 42 | friends = None 43 | 44 | try: 45 | friends = _user_friends[userID] 46 | 47 | except KeyError: 48 | _user_friends[userID] = us.get_friends_ids(userID, friendsDF) 49 | 50 | finally: 51 | friends = _user_friends[userID] 52 | 53 | return friends 54 | 55 | def get_friends_count(userID, friendsDF): 56 | friendsCount = 0 57 | 58 | try: 59 | userCount = _user_friends_count[userID] 60 | 61 | except KeyError: 62 | _user_friends_count[userID] = len(get_user_friends(userID,tweetsDF)) 63 | 64 | finally: 65 | friendsCount = _user_friends_count[userID] 66 | 67 | return friendsCount 68 | 69 | def get_user_followers(userID, followersDF): 70 | followers = None 71 | 72 | try: 73 | friends = _user_friends[userID] 74 | 75 | except KeyError: 76 | _user_followers[userID] = us.get_followers_ids(userID, followersDF) 77 | 78 | finally: 79 | followers = _user_followers[userID] 80 | 81 | return followers 82 | 83 | def get_followers_count(userID, followersDF): 84 | followersCount = 0 85 | 86 | try: 87 | followersCount = _user_followers_count[userID] 88 | 89 | except KeyError: 90 | _user_followers_count[userID] = len(get_user_followers(userID,tweetsDF).tolist()) 91 | 92 | finally: 93 | followersCount = _user_followers_count[userID] 94 | 95 | return followersCount -------------------------------------------------------------------------------- /src/classifiers.py: -------------------------------------------------------------------------------- 1 | # References for this section 2 | # 3 | # Classifiers used to evaluate the features 4 | # Decorate (D), Adaptive Boost (AB), Random Forest (RF), 5 | # Decision Tree (J48), Bayesian Network (BN), k-Nearest Neighbors (kNN), 6 | # Multinomial Ridge Logistic Regression (LR) and a Support Vector 7 | # Machine (SVM). 8 | # 9 | # D, RF, J48 and BN 10 | # ----------------- 11 | # http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.414.5888&rep=rep1&type=pdf 12 | # -> (http://puu.sh/yVTDB/41ec806328.png) 13 | # Nothing special mentioned concerning the parameters for these 14 | # classifiers. 15 | # 16 | # http://www.cse.iitd.ernet.in/~siy117527/sil765/readings/Detecting%20spammer%20on%20social%20network.pdf 17 | # Again no information about the RF configuration 18 | # 19 | # SVM 20 | # --- 21 | # Our SVM classifier exploits a Radial Basis Function (RBF) kernel 22 | # and has been trained using libSVM as the machine learning algorithm 23 | # The cost and gamma parameters have been optimized via a grid search 24 | # algorithm. 25 | # 26 | # kNN 27 | # --- 28 | # k parameter of the kNN classifier and the ridge penalizing parameter 29 | # of the LR model have been optimized via a cross validation parameter 30 | # selection algorithm. 31 | from sklearn.model_selection import cross_val_score 32 | from sklearn.model_selection import cross_val_predict 33 | from sklearn.ensemble import RandomForestClassifier 34 | from sklearn.tree import DecisionTreeClassifier 35 | from sklearn.ensemble import AdaBoostClassifier 36 | from sklearn.linear_model import LogisticRegression 37 | #from pomegranate import * 38 | from sklearn.neighbors import KNeighborsClassifier 39 | from sklearn import svm 40 | 41 | 42 | def random_forest(): 43 | # http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 44 | # 45 | # RandomForestClassifier(n_estimators=10, criterion=’gini’, 46 | # max_depth=None, min_samples_split=2, min_samples_leaf=1, 47 | # min_weight_fraction_leaf=0.0, max_features=’auto’, 48 | # max_leaf_nodes=None, min_impurity_decrease=0.0, 49 | # min_impurity_split=None, bootstrap=True, oob_score=False, 50 | # n_jobs=1, random_state=None, verbose=0, warm_start=False, 51 | # class_weight=None) 52 | # 53 | clf = RandomForestClassifier() 54 | # clf.fit(X, y) 55 | # clf.predict([[0, 0, 0, 0]])) 56 | return clf 57 | 58 | def decorate(): 59 | # Install https://github.com/fracpete/python-weka-wrapper3 60 | # 61 | # examples: https://github.com/fracpete/python-weka-wrapper3-examples/blob/master/src/wekaexamples/classifiers/classifiers.py 62 | # pass 63 | # load a dataset 64 | loader = Loader("weka.core.converters.ArffLoader") 65 | 66 | #evaluation 67 | evaluation = Evaluation(train_features) 68 | 69 | 70 | def decision_tree(): 71 | # http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier 72 | # clf = DecisionTreeClassifier(random_state=0) 73 | # 74 | # DecisionTreeClassifier(criterion=’gini’, splitter=’best’, 75 | # max_depth=None, min_samples_split=2, min_samples_leaf=1, 76 | # min_weight_fraction_leaf=0.0, max_features=None, 77 | # random_state=None, max_leaf_nodes=None, 78 | # min_impurity_decrease=0.0, min_impurity_split=None, 79 | # class_weight=None, presort=False) 80 | # J48 81 | clf = DecisionTreeClassifier() 82 | #clf = train_decision_tree(train_features,train_labels) 83 | 84 | #result = cls.score(test_features, test_labels) 85 | 86 | return clf 87 | 88 | #def train_decision_tree(train_features, train_labels): 89 | # clf = tree.DecisionTreeClassifier() 90 | # clf = clf.fit(train_features, train_labels) 91 | 92 | 93 | def adaptive_boost(): 94 | # http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html 95 | # 96 | # AdaBoostClassifier(base_estimator=None, n_estimators=50, 97 | # learning_rate=1.0, algorithm=’SAMME.R’, random_state=None) 98 | # 99 | # iris = load_iris() 100 | # clf = AdaBoostClassifier(n_estimators=100) 101 | # scores = cross_val_score(clf, iris.data, iris.target) 102 | # scores.mean() 103 | clf = AdaBoostClassifier() 104 | #clf = AdaBoostClassifier(n_estimators=100, learning_rate=0.1) 105 | return clf 106 | 107 | 108 | def bayesian_network(): 109 | # http://pomegranate.readthedocs.io/en/latest/BayesianNetwork.html 110 | # Careful with dependencies 111 | # 112 | model = BayesianNetwork.from_samples(train_features, algorithm='exact') 113 | model.predict(test_features) 114 | 115 | def k_nearest_neighbors(): 116 | # http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html 117 | # 118 | #clf = KNeighborsClassifier(n_neighbors=3) 119 | clf = KNeighborsClassifier() 120 | # clf.fit(X, y) 121 | # print(clf.predict([[1.1]])) 122 | # 123 | # (n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, 124 | # p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs) 125 | return clf 126 | 127 | def logistic_regression(): 128 | # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 129 | # 130 | # LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, 131 | # fit_intercept=True, intercept_scaling=1, class_weight=None, 132 | # random_state=None, solver=’liblinear’, max_iter=100, 133 | # multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1) 134 | # 135 | clf = LogisticRegression() 136 | # clf.fit(X, y) 137 | # clf.predict(X) 138 | # 139 | return clf 140 | 141 | def support_vector_machine(): 142 | clf = svm.SVC(kernel='rbf',C=10.0,gamma='auto') 143 | #clf.fit(train_features, train_labels) 144 | # clf.predict([[2., 2.]]) 145 | # 146 | # Algo: libSVM (SVC) 147 | # kernel: rbf 148 | # C : optimized 149 | # gamma : optimized 150 | # http://scikit-learn.org/stable/modules/grid_search.html 151 | # http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV 152 | # http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html 153 | # 154 | # SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, 155 | # decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', 156 | # max_iter=-1, probability=False, random_state=None, shrinking=True, 157 | # tol=0.001, verbose=False) 158 | return clf 159 | 160 | def classify(features, labels): 161 | 162 | # CROSS VALIDATION: cross_val_score(clf, test_features, test_labels, cv=10) 163 | # or 164 | # from sklearn.model_selection import cross_val_predict 165 | # pred = cross_val_predict(clf, test_features, test_labels) 166 | 167 | # dico with the classifiers 168 | classifiers_dict = {} 169 | # dico with the predictions made with the 170 | predictions_dict = {} 171 | 172 | classifiers_dict['RF'] = random_forest() 173 | #classifiers_dict['D'] = decorate() 174 | classifiers_dict['J48'] = decision_tree() 175 | classifiers_dict['AB'] = adaptive_boost() 176 | #classifiers_dict['BN'] = bayesian_network() 177 | classifiers_dict['kNN'] = k_nearest_neighbors() 178 | classifiers_dict['LR'] = logistic_regression() 179 | classifiers_dict['SVM'] = support_vector_machine() 180 | 181 | 182 | for key, value in classifiers_dict.items(): 183 | try: 184 | predictions_dict[key] = cross_val_predict(value, features, labels, cv=10) 185 | except ValueError: 186 | predictions_dict[key] = None 187 | return predictions_dict 188 | -------------------------------------------------------------------------------- /src/csv_processor.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from pathlib import Path 4 | import pandas as pd 5 | import numpy as np 6 | import sklearn 7 | import csv 8 | import random 9 | import linecache 10 | import matplotlib.pyplot as plt 11 | import time 12 | 13 | def get_dataframe(filename): 14 | df = None 15 | 16 | try: 17 | df = pd.read_csv(filename) 18 | 19 | except: 20 | print("Error while loading file : "+filename) 21 | 22 | return df 23 | 24 | def HUM_creator(e13_folder,tfp_folder,hum_folder): 25 | """ 26 | Very simple compilation of the HUM datasets merging E13 and TFP. 27 | E13 count: 1481 28 | TFP count: 469 29 | final accounts: 1950 30 | """ 31 | files = ["users.csv","followers.csv","friends.csv","tweets.csv"] 32 | for file in files: 33 | f1 = open(e13_folder+file,"r", errors="ignore") 34 | f2 = open(tfp_folder+file,"r", errors="ignore") 35 | f3 = open(hum_folder+file,"w", errors="ignore") 36 | for lines in f1.readlines(): 37 | f3.write(lines) 38 | 39 | for lines in f2.readlines()[1:]: 40 | f3.write(lines) 41 | f1.close() 42 | f2.close() 43 | f3.close() 44 | 45 | def FAK_creator(fsf_folder,int_folder,twt_folder,fak_folder): 46 | """ 47 | From the dataset of fake accounts, we undersample the dataset randomly to 48 | reach the same amout as the HUM dataset. 49 | FSF count: 1169 50 | INT count: 1337 51 | TWT count: 845 52 | fake accounts count: 3351 53 | final accounts needed: 1950 54 | percent randomly selected from the fake sample: 0.5819 55 | """ 56 | file_merging(fsf_folder,int_folder,twt_folder,fak_folder,0.5819) 57 | 58 | def file_merging(folder1,folder2,folder3,result_folder,percent): 59 | # 3 folders with fake accounts 60 | # 61 | # In each folder we start by randomly undersampling the users 62 | # Then we use the ids collected for the users and fetch the corresponding data 63 | # in the other files. 64 | folder_list = [folder1,folder2,folder3] 65 | files = ["users.csv","followers.csv","friends.csv","tweets.csv"] 66 | 67 | # flag to check if the header has already been added. 68 | # [users","followers","friends","tweets"] 69 | # 0 -> no header 70 | # 1 -> header already added 71 | head_flags = [0,0,0,0] 72 | 73 | for folder in folder_list: 74 | # List for the user's ids randomly picked (reset for each folder) 75 | list_of_ids = [] 76 | 77 | for file in files: 78 | 79 | #with open(folder+file, newline='', encoding='utf-32') as fp: 80 | # reader = csv.reader(fp, dialect='excel') 81 | # for line in reader: 82 | # print(str(line).encode("utf-8")) 83 | #fp.close() 84 | 85 | 86 | f = open(folder+file,"r", errors="ignore") 87 | r = open(result_folder+file,"a",errors="ignore") 88 | 89 | lines = f.readlines() 90 | 91 | len_file = len(lines) 92 | 93 | # choosing the number of lines we'll keep 94 | number_lines = int(percent*len_file) 95 | 96 | # This creates a random sub-sample of the lines of the files 97 | # 98 | # This works to select the percent of users we are interested but 99 | # for the other files we have to select the corresponding data. 100 | 101 | print("file: "+folder+file) 102 | 103 | if file == "users.csv": 104 | if head_flags[0] == 0: 105 | r.write(lines[0]) 106 | head_flags[0] = 1 107 | 108 | for line in random.sample(range(1,len_file), number_lines): 109 | #print("list_of_ids: "+str((lines[line].split('","')[0].strip('"')))) 110 | list_of_ids.append(int(lines[line].split('","')[0].strip('"'))) 111 | r.write(lines[line]) 112 | #print("liste: "+str(list_of_ids)) 113 | #time.sleep(1) 114 | # no need for conditional check, just make sure that the flag is at 1 after header added 115 | 116 | # So after selecting the percent of users we want, we have to fetch the ids 117 | # and collect the corresponding data in the other files. 118 | # for tweets.csv the id is the [3] element 119 | # for friends.csv the id is the [0] element 120 | # and for followers.csv the id is the [1] element 121 | 122 | elif file == "followers.csv": 123 | # header check 124 | if head_flags[1] == 0: 125 | r.write(lines[0]) 126 | head_flags[1] = 1 127 | # id check 128 | for line in lines[1:]: 129 | if int(line.split(",")[1].replace('"','')) in list_of_ids: 130 | r.write(line) 131 | 132 | elif file == "friends.csv": 133 | if head_flags[2] == 0: 134 | r.write(lines[0]) 135 | head_flags[2] = 1 136 | for line in lines[1:]: 137 | if int(line.split(",")[0].strip('"')) in list_of_ids: 138 | r.write(line) 139 | 140 | elif file == "tweets.csv": 141 | if head_flags[3] == 0: 142 | r.write(lines[0]) 143 | head_flags[3] = 1 144 | for line in lines[1:]: 145 | line = line.replace(',,',',"",') 146 | if int(line.split('","')[4]) in list_of_ids: 147 | r.write(line) 148 | f.close() 149 | r.close() 150 | 151 | if(__name__ == "__main__"): 152 | HUM_creator("../data/E13/","../data/TFP/","../data/HUM/") 153 | #FAK_creator("../data/FSF/","../data/INT/","../data/TWT/","../data/FAK/") 154 | #BAS_creator("../data/HUM/","../data/FAK/","../data/BAS/") -------------------------------------------------------------------------------- /src/features.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import sys 3 | from datetime import datetime 4 | from dateutil.parser import parse 5 | import itertools 6 | import string 7 | import cachedata as cache 8 | from tweets import * 9 | from url_finder import * 10 | from users import * 11 | import math 12 | from time import * 13 | from difflib import SequenceMatcher 14 | 15 | ########### 16 | # Class A # 17 | ########### 18 | 19 | # Camisani-Calzolari 20 | HAS_NAME = 'has_name' 21 | HAS_IMAGE = 'has_image' 22 | HAS_ADDRESS = 'has_address' 23 | HAS_BIO = 'has_bio' 24 | HAS_30_FOLLOWERS ='has_30_followers' 25 | BELONGS_TO_A_LIST = 'belongs_to_a_list' 26 | HAS_50_TWEETS = 'has_50_tweets' 27 | URL_IN_PROFILE = 'url_in_profile' 28 | FOLLOWERS_TO_FRIENDS_RATIO_OVER_2 = 'followers_to_friends_ration_over_2' 29 | 30 | # State of search 31 | BOT_IN_BIO = 'bot_in_bio' 32 | FRIENDS_TO_FOLLOWERS_RATIO_IS_100 = 'friends_to_followers_ratio_is_100' 33 | DUPLICATE_PROFILE_PICTURE = 'duplicate_profile_picture' 34 | 35 | # Socialbakers 36 | HAS_x50_FOLLOWERS = 'has_x50_followers' 37 | HAS_DEFAULT_IMAGE = 'has_default_image' 38 | HAS_NO_BIO = 'has_no_bio' 39 | HAS_NO_LOCATION = 'has_no_location' 40 | HAS_100_FRIENDS = 'has_100_friends' 41 | HAS_NO_TWEETS = 'has_no_tweets' 42 | 43 | # Stringhini et al. 44 | NUMBER_OF_FRIENDS = 'number_of_friends' 45 | NUMBER_OF_FRIENDS_TWEETS = 'number_of_friends_tweets' 46 | NUMBER_OF_TWEETS_SENT = 'number_of_tweets_sent' 47 | FRIENDS_TO_FOLLOWERS_RATIO = 'friends_to_followers_ratio' 48 | 49 | # Yang et al. 50 | ACCOUNT_AGE = 'account_age' 51 | FOLLOWING_RATE = 'following_rate' 52 | 53 | ########### 54 | # Class B # 55 | ########### 56 | 57 | # Camisani-Calzolari 58 | GEOLOCALIZED = 'geolocalized' 59 | IS_FAVORITE = 'is_favorite' 60 | USES_PUNCTUATION = 'uses_punctuation' 61 | USES_HASHTAG = 'uses_hashtag' 62 | USES_IPHONE = 'uses_iphone' 63 | USES_ANDROID = 'uses_android' 64 | USES_FOURSQUARE = 'uses_foursquare' 65 | USES_INSTAGRAM = 'uses_instagram' 66 | USES_TWITTERDOTCOM = 'uses_twitterdotcom' 67 | USERID_IN_TWEET = 'userid_in_tweet' 68 | TWEETS_WITH_URL = 'tweets_with_url' 69 | RETWEET_OVER_1 = 'retweet_over_1' 70 | USES_DIFFERENT_CLIENTS = 'uses_different_clients' 71 | 72 | #State of Search 73 | DUPLICATE_SENTENCES_ACROSS_TWEETS = 'duplicate_sentences_across_tweets' 74 | API_TWEETS = 'api_tweets' 75 | 76 | # Socialbakers 77 | HAS_DUPLICATE_TWEETS = 'has_duplicate_tweets' 78 | HIGH_RETWEET_RATIO = 'high_retweet_ratio' 79 | HIGH_TWEET_LINK_RATIO = 'high_tweet_link_ratio' 80 | 81 | 82 | # Stringhini et al. 83 | TWEET_SIMILARITY = 'tweet_similarity' 84 | URL_RATIO = 'url_ratio' 85 | UNIQUE_FRIENDS_NAME_RATIO = 'unique_friends_name' 86 | 87 | # Yang et al. 88 | API_RATIO = 'api_ratio' 89 | API_URL_RATIO = 'api_url_ratio' 90 | API_TWEET_SIMILARITY = 'api_tweet_similarity' 91 | 92 | 93 | 94 | ########### 95 | # Class C # 96 | ########### 97 | 98 | # Yang et al. 99 | BILINK_RATIO = 'bi-link_ratio' 100 | AVERAGE_NEIGHBORS_FOLLOWERS = 'average_neighbors_followers' 101 | AVERAGE_NEIGHBORS_TWEETS = 'average_neighbors_tweets' 102 | FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS = 'followings_to_median_neighbors_followers' 103 | 104 | #********************************* 105 | # Features set names * 106 | #********************************* 107 | CAMISANI = 'camisani' #'Camisani-Calzolari' 108 | STATEOFSEARCH = 'stateofsearch' #'State of search' 109 | SOCIALBAKERS = 'socialbakers' #'SocialBakers' 110 | STRINGHINI = 'stringhini' #'Stringhini et al' 111 | YANG = 'yang' #'Yang et al' 112 | CLASS_A = 'A' 113 | CLASS_C = 'C' 114 | 115 | def get_features(featureSetName, dataframes): 116 | features = {} 117 | 118 | if(featureSetName == CAMISANI): 119 | features = get_camisani_features(dataframes) 120 | 121 | elif(featureSetName == STATEOFSEARCH): 122 | features = get_state_of_search_features(dataframes) 123 | 124 | elif(featureSetName == SOCIALBAKERS): 125 | features = get_socialbakers_features(dataframes) 126 | 127 | elif(featureSetName == STRINGHINI): 128 | features = get_stringhini_features(dataframes) 129 | 130 | elif(featureSetName == YANG): 131 | features = get_yang_features(dataframes) 132 | 133 | elif(featureSetName == CLASS_A): 134 | features = get_class_A_features(dataframes) 135 | 136 | elif(featureSetName == CLASS_C): 137 | features = get_class_C_features(dataframes) 138 | 139 | else: 140 | print("Error Unknown feature set specified : "+featureSetName) 141 | 142 | return features 143 | 144 | def get_class_A_features(dataframes): 145 | usersDF = dataframes['users'] 146 | tweetsDF = dataframes['tweets'] 147 | 148 | features = [] 149 | 150 | LIMIT = 10 151 | limit = 1 152 | 153 | for index, row in usersDF.iterrows(): 154 | #timelog("[{}] User {}".format(index,row['id'])) 155 | features.append(get_single_class_A_features(row,usersDF, tweetsDF)) 156 | 157 | #Temporary code, for test purpose 158 | ''' 159 | if(limit > LIMIT): 160 | break 161 | else: 162 | limit = limit +1 163 | ''' 164 | 165 | return features 166 | 167 | 168 | def get_single_class_A_features(userRow, usersDF,tweetsDF): 169 | #Class A features = profile-based features 170 | userID = userRow['id'] 171 | 172 | #Camisani class A 173 | features = {} 174 | features[HAS_NAME] = has_name(userRow) 175 | features[HAS_IMAGE] = has_image(userRow) 176 | features[HAS_ADDRESS] = has_address(userRow) 177 | features[HAS_BIO] = has_bio(userRow) 178 | features[HAS_30_FOLLOWERS] = has_30_followers(userRow) 179 | features[BELONGS_TO_A_LIST] = belongs_to_a_list(userRow) 180 | features[HAS_50_TWEETS] = has_50_tweets(userRow) 181 | features[URL_IN_PROFILE] = url_in_profile(userRow) 182 | features[FOLLOWERS_TO_FRIENDS_RATIO_OVER_2] = followers_to_friends_ration_over_2(userRow) 183 | 184 | #State of search class A 185 | features[BOT_IN_BIO] = bot_in_bio(userRow) 186 | features[FRIENDS_TO_FOLLOWERS_RATIO_IS_100] = friends_to_followers_ratio_is_100(userRow) 187 | features[DUPLICATE_PROFILE_PICTURE] = duplicate_profile_picture(userRow,usersDF) 188 | 189 | #Social bakers class A 190 | features[HAS_x50_FOLLOWERS] = has_x50_followers(userRow) 191 | features[HAS_DEFAULT_IMAGE] = has_default_image(userRow) 192 | #features[HAS_NO_BIO] = has_no_bio(userRow) #Same as has_bio from camisani 193 | features[HAS_NO_LOCATION] = has_no_location(userRow) 194 | features[HAS_100_FRIENDS] = has_100_friends(userRow) 195 | features[HAS_NO_TWEETS] = has_no_tweets(userID, tweetsDF) 196 | 197 | #Stringhini class A 198 | features[NUMBER_OF_FRIENDS] = get_friends_count(userRow) 199 | features[FRIENDS_TO_FOLLOWERS_RATIO] = get_stringhini_friends_to_followers_ratio(userRow) 200 | 201 | #Yang class A 202 | features[ACCOUNT_AGE] = get_account_age(userRow) 203 | 204 | return features 205 | 206 | def get_class_C_features(dataframes): 207 | usersDF = dataframes['users'] 208 | tweetsDF = dataframes['tweets'] 209 | friendsDF = dataframes['friends'] 210 | followersDF = dataframes['followers'] 211 | 212 | features = [] 213 | 214 | LIMIT = 5 215 | t0 = time() 216 | for index, row in usersDF.iterrows(): 217 | #timelog("[{}] User {}".format(index,row['id'])) 218 | features.append(get_single_class_C_features(row,usersDF, friendsDF,followersDF,tweetsDF)) 219 | 220 | #Temporary code, for test purpose 221 | ''' 222 | if(index > LIMIT): 223 | break 224 | ''' 225 | 226 | 227 | return features 228 | 229 | def get_single_class_C_features(userRow,usersDF, friendsDF,followersDF,tweetsDF): 230 | #Class C features = all features 231 | userID = userRow['id'] 232 | features = get_single_class_A_features(userRow, usersDF,tweetsDF) 233 | 234 | #Camisani Class B 235 | features[GEOLOCALIZED] = geolocalized(userRow,tweetsDF) 236 | features[IS_FAVORITE] = is_favorite(userRow,tweetsDF) 237 | features[USES_PUNCTUATION] = uses_punctuation(userRow,tweetsDF) 238 | features[USES_HASHTAG] = uses_hashtag(userRow,tweetsDF) 239 | features[USES_IPHONE] = uses_iphone(userRow,tweetsDF) 240 | features[USES_ANDROID] = uses_android(userRow,tweetsDF) 241 | features[USES_FOURSQUARE] = uses_foursquare(userRow,tweetsDF) 242 | features[USES_INSTAGRAM] = uses_instagram(userRow,tweetsDF) 243 | features[USES_TWITTERDOTCOM] = uses_twitterdotcom(userRow,tweetsDF) 244 | features[USERID_IN_TWEET] = userid_in_tweet(userRow,tweetsDF) 245 | features[TWEETS_WITH_URL] = tweets_with_url(userRow,tweetsDF) 246 | features[RETWEET_OVER_1] = retweet_over_1(userRow,tweetsDF) 247 | features[USES_DIFFERENT_CLIENTS] = uses_different_clients(userRow,tweetsDF) 248 | 249 | #State of search Class B 250 | features[DUPLICATE_SENTENCES_ACROSS_TWEETS] = duplicate_sentences_across_tweets(userRow,tweetsDF) 251 | features[API_TWEETS] = api_tweets(userRow,tweetsDF) 252 | 253 | #Social Bakers class B 254 | features[HAS_DUPLICATE_TWEETS] = has_duplicate_tweets(userID,tweetsDF,3) 255 | features[HIGH_RETWEET_RATIO] = has_retweet_ratio(userID,tweetsDF,0.9) 256 | features[HIGH_TWEET_LINK_RATIO] = has_tweet_links_ratio(userID, tweetsDF,0.9) 257 | 258 | #Stringhini class B 259 | features[NUMBER_OF_TWEETS_SENT] = cache.get_tweets_count(userID,tweetsDF) 260 | #features[TWEET_SIMILARITY] = get_tweet_similarity(userRow,tweetsDF) #comment calculer? 261 | features[URL_RATIO] = get_url_ratio(userID, tweetsDF) 262 | features[UNIQUE_FRIENDS_NAME_RATIO] = get_unique_friends_name_ratio(userID,usersDF,friendsDF) 263 | 264 | #Yang class B (Comment calculer API?) 265 | #features[API_RATIO] = get_api_ratio(userID, tweetsDF) 266 | #features[API_URL_RATIO] = get_api_url_ratio(userRow) 267 | #features[API_TWEET_SIMILARITY] = get_api_tweet_similarity(userRow) 268 | 269 | 270 | #Yang class C 271 | #features[BILINK_RATIO] = get_bilink_ratio(userRow, friendsDF, followersDF) 272 | features[AVERAGE_NEIGHBORS_FOLLOWERS] = get_average_neighbors_followers(userID,friendsDF,usersDF) 273 | features[AVERAGE_NEIGHBORS_TWEETS] = get_average_neighbors_tweets(userID, usersDF,friendsDF, tweetsDF) 274 | #features[FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS] = get_followings_to_median(userRow) 275 | 276 | return features 277 | 278 | def get_camisani_features(dataframes): 279 | camisaniFeatures = [] 280 | 281 | usersDF = dataframes['users'] 282 | tweetsDF = dataframes['tweets'] 283 | 284 | LIMIT = 10 285 | limit = 1 286 | 287 | for index, row in usersDF.iterrows(): 288 | camisaniFeatures.append(get_single_user_camisani_features(row,tweetsDF)) 289 | 290 | #Temporary code, for test purpose 291 | if(limit > LIMIT): 292 | break 293 | else: 294 | limit = limit +1 295 | 296 | return camisaniFeatures 297 | 298 | def get_single_user_camisani_features(userRow, tweetsDF): 299 | ''' 300 | Class A : has name, has image, has address, has bio, followers >= 30, belongs to a list, 301 | tweets >= 50, URL in profile, 2xfollowers >= friends 302 | 303 | Class B : geo, is favorite, uses punctuation, uses #, uses iphone, uses android, uses foursquare, 304 | uses twitter.com, userId in tweet, retweet >= 1, uses different clients 305 | ''' 306 | 307 | features = {} 308 | 309 | # class A 310 | t0 = time() 311 | t2 = time() 312 | features[HAS_NAME] = has_name(userRow) 313 | print ("class A.1 camisani:", round(time()-t2, 3), "s") 314 | t3 = time() 315 | features[HAS_IMAGE] = has_image(userRow) 316 | print ("class A.2 camisani:", round(time()-t3, 3), "s") 317 | t4 = time() 318 | features[HAS_ADDRESS] = has_address(userRow) 319 | print ("class A.3 camisani:", round(time()-t4, 3), "s") 320 | t5 = time() 321 | features[HAS_BIO] = has_bio(userRow) 322 | print ("class A.4 camisani:", round(time()-t5, 3), "s") 323 | t6 = time() 324 | features[HAS_30_FOLLOWERS] = has_30_followers(userRow) 325 | print ("class A.5 camisani:", round(time()-t6, 3), "s") 326 | t7 = time() 327 | features[BELONGS_TO_A_LIST] = belongs_to_a_list(userRow) 328 | print ("class A.6 camisani:", round(time()-t7, 3), "s") 329 | t8 = time() 330 | features[HAS_50_TWEETS] = has_50_tweets(userRow) 331 | print ("class A.7 camisani:", round(time()-t8, 3), "s") 332 | t9 = time() 333 | features[URL_IN_PROFILE] = url_in_profile(userRow) 334 | print ("class A.8 camisani:", round(time()-t9, 3), "s") 335 | t10 = time() 336 | features[FOLLOWERS_TO_FRIENDS_RATIO_OVER_2] = followers_to_friends_ration_over_2(userRow) 337 | print ("class A.9 camisani:", round(time()-t10, 3), "s") 338 | print ("class A camisani:", round(time()-t2, 3), "s") 339 | 340 | # class B 341 | t1 = time() 342 | t10 = time() 343 | features[GEOLOCALIZED] = geolocalized(userRow,tweetsDF) 344 | print ("class B.1 camisani:", round(time()-t10, 3), "s") 345 | t11 = time() 346 | features[IS_FAVORITE] = is_favorite(userRow,tweetsDF) 347 | print ("class B.2 camisani:", round(time()-t11, 3), "s") 348 | t12 = time() 349 | features[USES_PUNCTUATION] = uses_punctuation(userRow,tweetsDF) 350 | print ("class B.3 camisani:", round(time()-t12, 3), "s") 351 | t13 = time() 352 | features[USES_HASHTAG] = uses_hashtag(userRow,tweetsDF) 353 | print ("class B.4 camisani:", round(time()-t13, 3), "s") 354 | t14 = time() 355 | features[USES_IPHONE] = uses_iphone(userRow,tweetsDF) 356 | print ("class B.5 camisani:", round(time()-t14, 3), "s") 357 | t15 = time() 358 | features[USES_ANDROID] = uses_android(userRow,tweetsDF) 359 | print ("class B.6 camisani:", round(time()-t15, 3), "s") 360 | t16 = time() 361 | features[USES_FOURSQUARE] = uses_foursquare(userRow,tweetsDF) 362 | print ("class B.7 camisani:", round(time()-t16, 3), "s") 363 | t17 = time() 364 | features[USES_INSTAGRAM] = uses_instagram(userRow,tweetsDF) 365 | print ("class B.8 camisani:", round(time()-t17, 3), "s") 366 | t18 = time() 367 | features[USES_TWITTERDOTCOM] = uses_twitterdotcom(userRow,tweetsDF) 368 | print ("class B.9 camisani:", round(time()-t18, 3), "s") 369 | t19 = time() 370 | features[USERID_IN_TWEET] = userid_in_tweet(userRow,tweetsDF) 371 | print ("class B.10 camisani:", round(time()-t19, 3), "s") 372 | t20 = time() 373 | features[TWEETS_WITH_URL] = tweets_with_url(userRow,tweetsDF) 374 | print ("class B.11 camisani:", round(time()-t20, 3), "s") 375 | t21 = time() 376 | features[RETWEET_OVER_1] = retweet_over_1(userRow,tweetsDF) 377 | print ("class B.12 camisani:", round(time()-t21, 3), "s") 378 | t22 = time() 379 | features[USES_DIFFERENT_CLIENTS] = uses_different_clients(userRow,tweetsDF) 380 | print ("class B.13 camisani:", round(time()-t22, 3), "s") 381 | print ("class B camisani:", round(time()-t1, 3), "s") 382 | print ("camisani:", round(time()-t0, 3), "s") 383 | return features 384 | 385 | def get_state_of_search_features(dataframes): 386 | stateofsearchFeatures = [] 387 | 388 | usersDF = dataframes['users'] 389 | tweetsDF = dataframes['tweets'] 390 | 391 | LIMIT = 10 392 | limit = 1 393 | 394 | for index, row in usersDF.iterrows(): 395 | stateofsearchFeatures.append(get_single_user_state_of_search_features(row,usersDF,tweetsDF)) 396 | 397 | #Temporary code, for test purpose 398 | if(limit > LIMIT): 399 | break 400 | else: 401 | limit = limit +1 402 | 403 | return stateofsearchFeatures 404 | 405 | 406 | def get_single_user_state_of_search_features(userRow, usersDF, tweetsDF): 407 | ''' 408 | Class A : bot in biography, friends/followers > 100, duplicate profile pictures 409 | 410 | Class B : same sentence to many accounts, tweet from API 411 | ''' 412 | 413 | features = {} 414 | 415 | # class A 416 | t0 = time() 417 | t1 = time() 418 | features[BOT_IN_BIO] = bot_in_bio(userRow) 419 | print ("class A.1 stateofsearch:", round(time()-t1, 3), "s") 420 | t2 = time() 421 | features[FRIENDS_TO_FOLLOWERS_RATIO_IS_100] = friends_to_followers_ratio_is_100(userRow) 422 | print ("class A.2 stateofsearch:", round(time()-t2, 3), "s") 423 | t3 = time() 424 | features[DUPLICATE_PROFILE_PICTURE] = duplicate_profile_picture(userRow,usersDF) 425 | print ("class A.3 stateofsearch:", round(time()-t3, 3), "s") 426 | 427 | # class B 428 | t4 = time() 429 | features[DUPLICATE_SENTENCES_ACROSS_TWEETS] = duplicate_sentences_across_tweets(userRow,tweetsDF) 430 | print ("class B.1 stateofsearch:", round(time()-t4, 3), "s") 431 | t5 = time() 432 | features[API_TWEETS] = api_tweets(userRow,tweetsDF) 433 | print ("class B.2 stateofsearch:", round(time()-t5, 3), "s") 434 | print ("stateofsearch:", round(time()-t0, 3), "s") 435 | 436 | return features 437 | 438 | def get_socialbakers_features(dataframes): 439 | socialbakersFeatures = [] 440 | 441 | usersDF = dataframes['users'] 442 | tweetsDF = dataframes['tweets'] 443 | 444 | LIMIT = 100 445 | limit = 1 446 | 447 | for index, row in usersDF.iterrows(): 448 | socialbakersFeatures.append(get_single_user_socialbakers_features(row,tweetsDF)) 449 | 450 | #Temporary code, for test purpose 451 | ''' 452 | if(limit > LIMIT): 453 | break 454 | else: 455 | limit = limit +1 456 | ''' 457 | 458 | return socialbakersFeatures 459 | 460 | def get_single_user_socialbakers_features(userRow, tweetsDF): 461 | ''' 462 | Class A : followers ≥ 50, default image after 2 463 | months, no bio, no location, friends ≥100, 0 tweets 464 | 465 | Class B : tweets spam phrases, same tweet ≥ 3, retweets ≥ 90%, 466 | tweet-links ≥ 90% 467 | ''' 468 | userID = userRow['id'] 469 | 470 | features = {} 471 | 472 | #Class A 473 | 474 | #TODO : ce n'est pas 50 followers, c'est un ratio de 50:1 entre friends et followers 475 | features[HAS_x50_FOLLOWERS] = has_x50_followers(userRow) 476 | features[HAS_DEFAULT_IMAGE] = has_default_image(userRow) 477 | features[HAS_NO_BIO] = has_no_bio(userRow) 478 | features[HAS_NO_LOCATION] = has_no_location(userRow) 479 | features[HAS_100_FRIENDS] = has_100_friends(userRow) 480 | features[HAS_NO_TWEETS] = has_no_tweets(userID, tweetsDF) 481 | 482 | #Class B 483 | features[HAS_DUPLICATE_TWEETS] = has_duplicate_tweets(userID,tweetsDF,3) 484 | features[HIGH_RETWEET_RATIO] = has_retweet_ratio(userID,tweetsDF,0.9) 485 | features[HIGH_TWEET_LINK_RATIO] = has_tweet_links_ratio(userID, tweetsDF,0.9) 486 | 487 | return features 488 | 489 | def get_stringhini_features(dataframes): 490 | ''' 491 | 492 | ''' 493 | stringhiniFeatures = [] 494 | 495 | usersDF = dataframes['users'] 496 | friendsDF = dataframes['friends'] 497 | tweetsDF = dataframes['tweets'] 498 | 499 | LIMIT = 5 500 | limit = 1 501 | 502 | for index, row in usersDF.iterrows(): 503 | #timelog("User "+str(limit)) 504 | stringhiniFeatures.append(get_single_user_stringhini_features(row,usersDF, friendsDF,tweetsDF)) 505 | 506 | #Temporary code, for test purpose 507 | ''' 508 | if(limit > LIMIT): 509 | break 510 | else: 511 | limit = limit +1 512 | ''' 513 | 514 | return stringhiniFeatures 515 | 516 | def get_single_user_stringhini_features(userRow, usersDF,friendsDF, tweetsDF): 517 | ''' 518 | Class A : number of friends, number of friends tweets, friends/(followersˆ2) 519 | 520 | Class B : tweet similarity, URL ratio 521 | ''' 522 | ''' 523 | usersDF.set_index('id', inplace=True) 524 | tweetsDF.set_index('user_id',inplace=True) 525 | friendsDF.set_index('source_id', inplace=True) 526 | ''' 527 | userID = userRow['id'] 528 | 529 | features = {} 530 | 531 | # Class A 532 | timelog("Start stringhini class A") 533 | features[NUMBER_OF_FRIENDS] = get_friends_count(userRow) 534 | features[FRIENDS_TO_FOLLOWERS_RATIO] = get_stringhini_friends_to_followers_ratio(userRow) 535 | timelog("Start class B") 536 | # Class B 537 | features[NUMBER_OF_TWEETS_SENT] = count_user_tweets(userID,tweetsDF) 538 | timelog("count user tweets Ok") 539 | #features[TWEET_SIMILARITY] = get_tweet_similarity(userRow,tweetsDF) #comment calculer? 540 | features[URL_RATIO] = get_url_ratio(userID, tweetsDF) 541 | timelog("url ratio Ok") 542 | features[UNIQUE_FRIENDS_NAME_RATIO] = get_unique_friends_name_ratio(userID,usersDF,friendsDF) 543 | timelog("unique friends Ok") 544 | timelog("End class B") 545 | return features 546 | 547 | def get_yang_features(dataframes): 548 | yangFeatures = [] 549 | 550 | usersDF = dataframes['users'] 551 | friendsDF = dataframes['friends'] 552 | followersDF = dataframes['followers'] 553 | tweetsDF = dataframes['tweets'] 554 | 555 | LIMIT = 10 556 | limit = 1 557 | 558 | for index, row in usersDF.iterrows(): 559 | yangFeatures.append(get_single_user_yang_features(row, usersDF,friendsDF,followersDF,tweetsDF)) 560 | 561 | #timelog("User "+str(limit)) 562 | 563 | #Temporary code, for test purpose 564 | ''' 565 | if(limit > LIMIT): 566 | break 567 | else: 568 | limit = limit +1 569 | ''' 570 | 571 | return yangFeatures 572 | 573 | 574 | def get_single_user_yang_features(userRow, usersDF, friendsDF, followersDF,tweetsDF): 575 | ''' 576 | class A : age, following rate 577 | 578 | Class B : API ratio, API URL ratio, API tweet similarity 579 | 580 | Class C: bi-link ratio, average 581 | neighbors’ followers, average 582 | neighbors’ tweets, followings 583 | to median neighbor’s followers 584 | ''' 585 | userID = userRow['id'] 586 | 587 | features = {} 588 | #timelog("Start processing") 589 | # Class A features 590 | features[ACCOUNT_AGE] = get_account_age(userRow) 591 | #features[FOLLOWING_RATE] = get_following_rate(userRow) # What is following rate? 592 | #timelog("\tClass A finished") 593 | # Class B features 594 | #features[API_RATIO] = get_api_ratio(userRow) 595 | #features[API_URL_RATIO] = get_api_url_ratio(userRow) 596 | #features[API_TWEET_SIMILARITY] = get_api_tweet_similarity(userRow) 597 | 598 | # Class C features 599 | #features[BILINK_RATIO] = get_bilink_ratio(userRow, friendsDF, followersDF) 600 | features[AVERAGE_NEIGHBORS_FOLLOWERS] = get_average_neighbors_followers(userID,friendsDF,usersDF) 601 | features[AVERAGE_NEIGHBORS_TWEETS] = get_average_neighbors_tweets(userID, usersDF,friendsDF, tweetsDF) 602 | #features[FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS] = get_followings_to_median(userRow) 603 | #timelog("\tClass C finished") 604 | return features 605 | 606 | def get_dataframes(featureSetName, datasetDirectory): 607 | ''' 608 | - BaseFilesDirectory is the base directory of the dataset we are using : 609 | E13,FAK,FSF,HUM,INT,TFP,TWT. 610 | - featureSetName is the name of the feature set for which we want to load the dataframes. 611 | 612 | The function returns the proper required dataframes for the featureSetName specified. 613 | ''' 614 | fileNames = {} 615 | dataframes = {} 616 | 617 | if(featureSetName == CAMISANI): 618 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'} 619 | elif(featureSetName == STATEOFSEARCH): 620 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'} 621 | 622 | elif(featureSetName == SOCIALBAKERS): 623 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'} 624 | 625 | elif(featureSetName == STRINGHINI): 626 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends' : 'friends.csv'} 627 | 628 | elif(featureSetName == YANG): 629 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends': 'friends.csv' 630 | ,'followers': 'followers.csv'} 631 | 632 | elif(featureSetName == CLASS_A): 633 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'} 634 | 635 | elif(featureSetName == CLASS_C): 636 | fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends': 'friends.csv' 637 | ,'followers': 'followers.csv'} 638 | 639 | # We load the dataframes from the files specified above, in a dataframe dictionary 640 | for key, filename in fileNames.items(): 641 | 642 | totalPath = datasetDirectory + '/'+ filename 643 | #print(totalPath) 644 | timelog("Loading "+ totalPath) 645 | 646 | try: 647 | dataframes[key] = pd.read_csv(totalPath, encoding='latin-1').fillna('') 648 | #dataframes[key] = pd.read_csv(totalPath).fillna('') 649 | except Exception as e: 650 | print("Error while reading file "+totalPath) 651 | print(e) 652 | return dataframes 653 | 654 | ''' 655 | TODO: create functions that retrieve each individual feature, below 656 | ''' 657 | 658 | # Class A features 659 | 660 | def has_name(userRow): 661 | res = (not userRow['name'] == "") 662 | 663 | if(res): 664 | return 1 665 | else: 666 | return 0 667 | 668 | def has_image(userRow): 669 | res = (not userRow['profile_image_url'] == "") 670 | 671 | if(res): 672 | return 1 673 | else: 674 | return 0 675 | 676 | def has_address(userRow): 677 | res = (not userRow['location'] == "") 678 | 679 | if(res): 680 | return 1 681 | else: 682 | return 0 683 | 684 | def has_bio(userRow): 685 | res = (not userRow['description'] == "") 686 | 687 | if(res): 688 | return 1 689 | else: 690 | return 0 691 | 692 | def has_30_followers(userRow): 693 | res = int(userRow['followers_count']) >= 30 694 | 695 | if(res): 696 | return 1 697 | else: 698 | return 0 699 | 700 | def belongs_to_a_list(userRow): 701 | res = int(userRow['listed_count']) > 0 702 | 703 | if(res): 704 | return 1 705 | else: 706 | return 0 707 | 708 | def has_50_tweets(userRow): 709 | res = userRow['statuses_count'] > 50 710 | 711 | if(res): 712 | return 1 713 | else: 714 | return 0 715 | 716 | def url_in_profile(userRow): 717 | res = 0 718 | if isinstance(userRow['description'], str) and has_url(userRow['description']): 719 | res = 1 720 | return res 721 | 722 | def followers_to_friends_ration_over_2(userRow): 723 | friendsCount = int(userRow['friends_count']) 724 | 725 | if(friendsCount == 0): 726 | friendsCount = 0.001 727 | 728 | res = int(userRow['followers_count'])/friendsCount > 2 729 | 730 | if(res): 731 | return 1 732 | else: 733 | return 0 734 | 735 | def bot_in_bio(userRow): 736 | # https://stackoverflow.com/questions/11144389/find-all-upper-lower-and-mixed-case-combinations-of-a-string 737 | bot_list = map(''.join, itertools.product(*((c.upper(), c.lower()) for c in 'bot'))) 738 | 739 | res = 0 740 | #timelog("start") 741 | if isinstance(userRow['description'],str): 742 | for bot_combination in bot_list: 743 | if bot_combination in userRow['description']: 744 | res = 1 745 | #timelog("end") 746 | 747 | return res 748 | 749 | def friends_to_followers_ratio_is_100(userRow): 750 | threshold = 100 751 | 752 | res = get_friends_to_followers_ratio(userRow) >= threshold 753 | 754 | if(res): 755 | return 1 756 | else: 757 | return 0 758 | 759 | def duplicate_profile_picture(userRow,usersDF): 760 | ''' 761 | This functions checks if the name of the profile picture link is duplicated. 762 | ''' 763 | clean = lambda x: x.split('/')[-1] 764 | image = userRow['profile_image_url'].split('/')[-1] 765 | img_column = usersDF['profile_image_url'].apply(clean) 766 | 767 | res = not image in img_column.unique() 768 | 769 | if(res): 770 | return 1 771 | else: 772 | return 0 773 | 774 | def get_account_age(userRow): 775 | # Date format : Thu Apr 06 15:24:15 +0000 2017 776 | creation_date = userRow['created_at'] 777 | 778 | account_creation = datetime.strptime(creation_date,'%a %b %d %H:%M:%S %z %Y') 779 | #Two lines below, to prevent error : can't subtract offset-naive and offset-aware datetimes 780 | # Solution from https://stackoverflow.com/questions/796008/cant-subtract-offset-naive-and-offset-aware-datetimes 781 | timezone = account_creation.tzinfo 782 | today = datetime.now(timezone) 783 | 784 | return (today - account_creation).days 785 | 786 | def get_following_rate(userRow): 787 | ''' 788 | following rate: this metric reflects the speed at which an 789 | accounts follows other accounts. Spammers usually feature high 790 | values of this rate. 791 | ''' 792 | return int(userRow['friends_count'])/int(userRow['followers_count']) 793 | 794 | def get_friends_count(userRow): 795 | return int(userRow['friends_count']) 796 | 797 | 798 | def get_friends_tweet_count(userRow,friendsDF,usersDF): 799 | userID = userRow['id'] 800 | 801 | friends_id_list = get_friends_ids(userID,friendsDF) 802 | friends_count = len(friends_id_list) 803 | 804 | tweetUsers = usersDF[usersDF['id'].isin(friends_id_list)] 805 | 806 | userTweetCount = lambda user: count_user_tweets(user['id'],friendsDF) 807 | res = tweetUsers.apply(userTweetCount).sum() 808 | 809 | return res 810 | 811 | 812 | def get_friends_to_followers_ratio(userRow): 813 | res = 0 814 | if int(userRow['followers_count']) != 0: 815 | res = int(userRow['friends_count'])/int(userRow['followers_count']) 816 | return res 817 | 818 | def get_stringhini_friends_to_followers_ratio(userRow): 819 | followers = int(userRow['followers_count']) 820 | 821 | if(followers > 0): 822 | return int(userRow['friends_count'])/(followers*followers) 823 | else: 824 | return int(userRow['friends_count'])/0.01 825 | 826 | def has_x50_followers(userRow): 827 | followers = int(userRow['followers_count']) 828 | friends = int(userRow['friends_count']) 829 | 830 | if(followers ==0): 831 | followers = 0.001 832 | 833 | res = (friends/followers)>= 50 834 | 835 | if(res): 836 | return 1 837 | else: 838 | return 0 839 | 840 | def has_default_image(userRow): 841 | res = userRow['default_profile_image'] 842 | 843 | if(res): 844 | return 1 845 | else: 846 | return 0 847 | 848 | def has_no_bio(userRow): 849 | if(not userRow['description']): 850 | return 1 851 | else: 852 | return 0 853 | 854 | def has_no_location(userRow): 855 | if(not userRow['location']): 856 | return 1 857 | else: 858 | return 0 859 | 860 | def has_100_friends(userRow): 861 | res = get_friends_count(userRow) >= 100 862 | 863 | if(res): 864 | return 1 865 | else: 866 | return 0 867 | 868 | def has_no_tweets(userID, tweetsDF): 869 | res = not has_tweets(userID, tweetsDF) 870 | 871 | if(res): 872 | return 1 873 | else: 874 | return 0 875 | 876 | 877 | # Class B features 878 | def geolocalized(userRow,tweetsDF): 879 | tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 880 | geo = tweets['geo'] != "" 881 | res = not tweets[geo].empty 882 | 883 | if(res): 884 | return 1 885 | else: 886 | return 0 887 | 888 | def is_favorite(userRow,tweetsDF): 889 | #tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 890 | tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 891 | fav = tweets['favorite_count'] != 0 892 | return int(not tweets[fav].empty) 893 | 894 | def uses_punctuation(userRow,tweetsDF): 895 | # https://mail.python.org/pipermail/tutor/2001-October/009454.html 896 | bio_and_timeline = "" 897 | if isinstance(userRow['description'], str): 898 | bio_and_timeline += userRow['description'] 899 | bio_and_timeline += get_tweets_strings(int(userRow['id']),tweetsDF) 900 | 901 | res = 0 902 | 903 | for letter in bio_and_timeline: 904 | if letter in string.punctuation: 905 | res = 1 906 | return res 907 | 908 | def uses_hashtag(userRow,tweetsDF): 909 | res = '#' in get_tweets_strings(int(userRow['id']),tweetsDF) 910 | 911 | if(res): 912 | return 1 913 | else: 914 | return 0 915 | 916 | def uses_iphone(userRow,tweetsDF): 917 | #all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 918 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 919 | res = "Iphone" in all_tweets['source'].str.cat() 920 | 921 | if(res): 922 | return 1 923 | else: 924 | return 0 925 | 926 | def uses_android(userRow,tweetsDF): 927 | #all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 928 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 929 | res = "Android" in all_tweets['source'].str.cat() 930 | 931 | if(res): 932 | return 1 933 | else: 934 | return 0 935 | 936 | def uses_foursquare(userRow,tweetsDF): 937 | #all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 938 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 939 | res = "foursquare" in all_tweets['source'].str.cat() 940 | 941 | if(res): 942 | return 1 943 | else: 944 | return 0 945 | 946 | def uses_instagram(userRow,tweetsDF): 947 | #all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 948 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 949 | res = "Instagram" in all_tweets['source'].str.cat() 950 | 951 | if(res): 952 | return 1 953 | else: 954 | return 0 955 | 956 | def uses_twitterdotcom(userRow,tweetsDF): 957 | #all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 958 | all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF) 959 | res = "web" in all_tweets['source'].str.cat() 960 | 961 | if(res): 962 | return 1 963 | else: 964 | return 0 965 | 966 | def userid_in_tweet(userRow,tweetsDF): 967 | res = str(userRow['id']) in get_tweets_strings(userRow['id'],tweetsDF) 968 | 969 | if(res): 970 | return 1 971 | else: 972 | return 0 973 | 974 | def tweets_with_url(userRow,tweetsDF): 975 | return has_url(get_tweets_strings(userRow['id'],tweetsDF)) 976 | 977 | def retweet_over_1(userRow,tweetsDF): 978 | all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 979 | # Any checks if the conditions happens once in the Series. 980 | return int((all_tweets['retweet_count'] > 1).any()) 981 | 982 | def uses_different_clients(userRow,tweetsDF): 983 | all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 984 | res = len(all_tweets['source'].unique()) > 1 985 | 986 | if(res): 987 | return 1 988 | else: 989 | return 0 990 | 991 | # https://stackoverflow.com/questions/17388213/find-the-similarity-percent-between-two-strings 992 | def similar(a, b): 993 | return SequenceMatcher(None, a, b).ratio() 994 | 995 | 996 | def duplicate_sentences_across_tweets(userRow,tweetsDF): 997 | res = False 998 | all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 999 | counter = 0 1000 | for index, tweet in all_tweets.iterrows(): 1001 | other_counter = 0 1002 | for other_index, other_tweet in all_tweets.iterrows(): 1003 | if tweet['text'] == other_tweet['text'] and index != other_index : 1004 | # For more real similarity check 1005 | # if similar(tweet['text'],other_tweet['text']) > 0.7 and index != other_index : 1006 | res = True 1007 | break 1008 | if other_counter == 20: 1009 | break 1010 | other_counter+=1 1011 | if counter == 20: 1012 | break 1013 | counter+=1 1014 | 1015 | return int(res) 1016 | 1017 | def api_tweets(userRow,tweetsDF): 1018 | all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF) 1019 | res = not "twitter.com" in all_tweets['source'].str.cat() 1020 | 1021 | if(res): 1022 | return 1 1023 | else: 1024 | return 0 1025 | 1026 | def get_api_ratio(userID, tweetsDF): 1027 | # tweets sent from api over total number of tweets 1028 | tweets = get_tweets_dataframe_user(userID, tweetsDF) 1029 | tweetsCount = len(tweets.tolist()) 1030 | tweets_api = get_api_tweets_count(userID, tweets) 1031 | 1032 | if(userTweets > 0): 1033 | return tweets_api/userTweets 1034 | else: 1035 | return 0 1036 | 1037 | def get_api_url_ratio(userRow): 1038 | return 0 1039 | 1040 | def get_api_tweet_similarity(userRow): 1041 | return 0 1042 | 1043 | def get_tweet_similarity(userID,tweetsDF): 1044 | return 0 1045 | 1046 | def get_url_ratio(userID, tweetsDF): 1047 | '''ratio of tweets with a url''' 1048 | return get_tweets_with_url_ratio(userID,tweetsDF) 1049 | 1050 | def get_unique_friends_name_ratio(userID,usersDF,friendsDF): 1051 | friends_with_name = get_friends_with_initialized_name(userID, usersDF,friendsDF) 1052 | unique_names_count = count_unique_names(friends_with_name) 1053 | 1054 | #avoid division by zero, so we return a big number 1055 | if(unique_names_count == 0): 1056 | unique_names_count = 0.001 1057 | 1058 | return len(friends_with_name)/unique_names_count 1059 | 1060 | def has_duplicate_tweets(userID, tweetsDF,duplicate_threshold): 1061 | tweetsUser = get_tweets_dataframe_user(userID,tweetsDF) 1062 | 1063 | uniqueTweets = tweetsUser['text'].unique() 1064 | 1065 | userTweets = tweetsUser.shape[0] 1066 | uniqueCount = len(uniqueTweets.tolist()) 1067 | 1068 | if(uniqueCount < userTweets): 1069 | return 1 1070 | else: 1071 | return 0 1072 | 1073 | def has_retweet_ratio(userID,tweetsDF, ratio_threshold): 1074 | ''' 1075 | Returns true if the ratio calculated is superior or equal to the ratio_threshold. 1076 | Returns false otherwise. 1077 | ''' 1078 | if(retweet_ratio(userID,tweetsDF)>= ratio_threshold): 1079 | return 1 1080 | else: 1081 | return 0 1082 | 1083 | def has_tweet_links_ratio(userID, tweetsDF, ratio_threshold): 1084 | ''' 1085 | Returns true if the ratio calculated is superior or equal to the ratio_threshold. 1086 | Returns false otherwise. 1087 | ''' 1088 | 1089 | ratio = get_url_ratio(userID, tweetsDF) 1090 | 1091 | res = ratio >= ratio_threshold 1092 | 1093 | return int(res) 1094 | 1095 | 1096 | # Class C featrues 1097 | def get_bilink_ratio(userRow, friendsDF, followersDF): 1098 | userID = userRow['id'] 1099 | friends_count = userRow['friends_count'] 1100 | #Bi-directional link is when two account follow each other 1101 | friends = get_friends_ids(userID,friendsDF) 1102 | 1103 | followersSeries = get_follower_ids(userID,followersDF) 1104 | 1105 | bilinkList = followersSeries.isin(friends) 1106 | 1107 | bilink_count = bilink.shap[0] 1108 | #print("===== User ID = "+str(userID)) 1109 | #print("[Bilink count : {}][Official Friends count : {}][Official Followers count : {}], [followers actually found : {} ]".format(bilink_count,friends_count,userRow['followers_count'], len(followersSeries))) 1110 | return bilink_count/friends_count 1111 | 1112 | def get_average_neighbors_followers(userID,friendsDF,usersDF): 1113 | #Average the number of followers of the friends of the user. 1114 | friends = get_friends_ids(userID, friendsDF) 1115 | 1116 | return get_avg_neighbors_followers(friends,usersDF) 1117 | 1118 | def get_average_neighbors_tweets(userID, userDF, friendsDF, tweetsDF): 1119 | #Average the number of tweets of the friends of the user. 1120 | friends = get_friends_ids(userID, friendsDF) 1121 | return get_avg_friends_tweets(friends,tweetsDF) 1122 | 1123 | 1124 | def get_followings_to_median(userRow): 1125 | ''' 1126 | Ratio between number of friends and the median of the followers of its friends. 1127 | ''' 1128 | return 0 1129 | 1130 | def timelog(message): 1131 | print(datetime.now().strftime('%H:%M:%S')+' '+message ) 1132 | ''' 1133 | To use (prototype) go to root directory: 1134 | command example : python3 src/features.py "data/E13/" yang 1135 | ''' 1136 | if(__name__ == "__main__"): 1137 | directory = sys.argv[1] 1138 | featureSetName = sys.argv[2] 1139 | 1140 | dataframes = get_dataframes(featureSetName, directory) 1141 | 1142 | timelog("OK, ca passe") 1143 | #print("=========== Getting features ===========") 1144 | #print(time()) 1145 | t0 = time() 1146 | features = pd.DataFrame(get_features(featureSetName, dataframes)) 1147 | tf = time() - t0 1148 | #print("========== Elapsed time ==========") 1149 | #print(tf) 1150 | #timelog("Features : ") 1151 | print(features) 1152 | -------------------------------------------------------------------------------- /src/generateBAS.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import features 3 | import sys 4 | import os 5 | from time import time 6 | 7 | ''' 8 | The goal of this module is to generate a list of features based on a 9 | base directory (ex: E13, FAK, etc) and the name of the feature set (class A, class C). 10 | 11 | The generated list is the baseline feature set and will be created in the directory 12 | specified in parameters. Ex: E13/baseline.csv 13 | ''' 14 | 15 | def get_BAS_dataset(hum_path,fak_path): 16 | filename_A = "features_A.csv" 17 | filename_C = "features_C.csv" 18 | 19 | hum_df_A = pd.read_csv(hum_path+filename_A, encoding='latin-1',sep='\t') 20 | fak_df_A = pd.read_csv(fak_path+filename_A, encoding='latin-1',sep='\t') 21 | hum_df_C = pd.read_csv(hum_path+filename_C, encoding='latin-1',sep='\t') 22 | fak_df_C = pd.read_csv(fak_path+filename_C, encoding='latin-1',sep='\t') 23 | 24 | frames_A = [hum_df_A,fak_df_A] 25 | frames_C = [hum_df_C,fak_df_C] 26 | 27 | df_A = pd.concat(frames_A) 28 | df_C = pd.concat(frames_C) 29 | 30 | # randomize features, before cross_validation 31 | df_A = df_A.sample(frac=1).reset_index(drop=True) 32 | df_C = df_C.sample(frac=1).reset_index(drop=True) 33 | 34 | labels_A = df_A['label'].tolist() 35 | labels_C = df_C['label'].tolist() 36 | return labels_A,labels_C,df_A,df_C 37 | 38 | 39 | def save_features_in_file(path, featuresDF, featureSetName): 40 | 41 | filename = path + "/features_"+featureSetName+".csv" 42 | print("Current directory :"+os.getcwd()) 43 | print("Filename : "+filename) 44 | featuresDF.to_csv(filename,sep='\t') 45 | 46 | def labelize_features(featuresDF, label): 47 | labelVal = 0 48 | 49 | if(label == 'human'): 50 | labelVal = 0 51 | elif(label == 'bot'): 52 | labelVal = 1 53 | 54 | featuresDF['label'] = labelVal 55 | 56 | 57 | if(__name__ == "__main__"): 58 | # Command example. Load class A : python3 src/main.py data/E13 A 59 | # Command example. Load class C : python3 src/main.py data/E13 C 60 | 61 | #Get the dataset name (E13, FAK, FSF,HUM,etc) 62 | argsLen = len(sys.argv) 63 | 64 | datasetDirectory = sys.argv[1] 65 | featureSetName = sys.argv[2] 66 | 67 | 68 | 69 | dataframes = features.get_dataframes(featureSetName,datasetDirectory) 70 | 71 | print("=========== Getting features ===========") 72 | print(time()) 73 | t0 = time() 74 | 75 | featuresData = features.get_features(featureSetName, dataframes) 76 | 77 | featuresData = pd.DataFrame(featuresData) 78 | 79 | if(argsLen > 3): 80 | label = sys.argv[3] 81 | labelize_features(featuresData,label) 82 | 83 | tf = time() - t0 84 | print("========== Elapsed time ==========") 85 | print(tf) 86 | save_features_in_file(datasetDirectory, featuresData,featureSetName) 87 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | from classifiers import * 2 | from features import * 3 | from metrics import * 4 | import sys 5 | from generateBAS import * 6 | 7 | def main(): 8 | 9 | 10 | # data frame with our dataset 11 | labels_A, labels_C, df_features_class_A, df_features_class_C = get_BAS_dataset("data/HUM/","data/FAK/") 12 | # preprocessing of the data to capture the features we want 13 | #df_features_class_A = get_class_A_features(df) 14 | #df_features_class_C = get_class_C_features(df) 15 | 16 | # classifiers (training + prediction) 17 | class_A_classifiers_predictions = classify(df_features_class_A,labels_A) 18 | class_C_classifiers_predictions = classify(df_features_class_C,labels_C) 19 | 20 | # Metrics on the result predictions from the classifiers 21 | results_class_A_classifiers = metrics(labels_A, class_A_classifiers_predictions) 22 | results_class_C_classifiers = metrics(labels_C, class_C_classifiers_predictions) 23 | 24 | # publish and save results 25 | print("Metrics of the Class A classifiers(profile)") 26 | for key,value in results_class_A_classifiers.items(): 27 | print("Algorithm: "+key) 28 | print("Accuracy: "+ str(round(value[0],3))+ " Precision: "+str(round(value[1],3))) 29 | print("Recall: "+ str(round(value[2],3))+ " F-M: "+str(round(value[3],3))) 30 | print("MCC: "+ str(round(value[4],3))+ " AUC: "+str(round(value[5],3))) 31 | print() 32 | #publish(results_class_A_classifiers) 33 | print("===================== oOo ========================") 34 | print("Metrics of the Class C classifiers(all features)") 35 | for key,value in results_class_C_classifiers.items(): 36 | print("Algorithm: "+key) 37 | print("Accuracy: "+ str(round(value[0],3))+ " Precision: "+str(round(value[1],3))) 38 | print("Recall: "+ str(round(value[2],3))+ " F-M: "+str(round(value[3],3))) 39 | print("MCC: "+ str(round(value[4],3))+ " AUC: "+str(round(value[5],3))) 40 | print() 41 | #publish(results_class_C_classifiers) 42 | 43 | 44 | if(__name__ == "__main__"): 45 | # Command example. Load class A : python3 src/main.py data/E13 A 46 | # Command example. Load class C : python3 src/main.py data/E13 C 47 | 48 | #Get the dataset name (E13, FAK, FSF,HUM,etc) 49 | #baseDataset = sys.argv[1] 50 | #featureSetName = sys.argv[2] 51 | 52 | #dataframes = features.get_dataframes(baseDataset_A, featureSetName) 53 | 54 | #main(baseDataset) 55 | main() 56 | 57 | -------------------------------------------------------------------------------- /src/metrics.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import confusion_matrix 2 | from sklearn.metrics import accuracy_score 3 | from sklearn.metrics import precision_score 4 | from sklearn.metrics import recall_score 5 | from sklearn.metrics import f1_score 6 | from sklearn.metrics import matthews_corrcoef 7 | from sklearn.metrics import roc_auc_score 8 | def metrics(labels,pred_dict): 9 | 10 | results_dict = {} 11 | # Confusion matrix 12 | # labels : array, shape = [n_samples] 13 | # Ground truth (correct) target values. 14 | # pred : array, shape = [n_samples] 15 | # Estimated targets as returned by a classifier. 16 | 17 | for key,pred in pred_dict.items(): 18 | #try: 19 | #tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel() 20 | #np.set_printoptions(precision=2) 21 | try: 22 | # Accuracy 23 | acc = accuracy_score(labels, pred) 24 | except ValueError: 25 | acc = 0 26 | try: 27 | # Precision 28 | pres = precision_score(labels, pred, average='binary') 29 | except ValueError: 30 | pres = 0 31 | try: 32 | # Recall 33 | rec = recall_score(labels, pred, average='binary') 34 | except ValueError: 35 | rec = 0 36 | try: 37 | # F-measure 38 | f1 = f1_score(labels, pred, average='binary') 39 | except ValueError: 40 | f1 = 0 41 | try: 42 | # MCC 43 | mcc = matthews_corrcoef(labels, pred) 44 | except ValueError: 45 | mcc = 0 46 | try: 47 | # AUC 48 | auc_score = roc_auc_score(labels, pred) 49 | except ValueError: 50 | auc_score = 0 51 | # 52 | # Coud be this one must be rechecked with paper 53 | # from sklearn.metrics import roc_auc_score 54 | # roc_auc_score(labels, y_scores) 55 | results_dict[key] = [acc,pres,rec,f1,mcc,auc_score] 56 | #except ValueError: 57 | # pass 58 | return results_dict -------------------------------------------------------------------------------- /src/tweets.py: -------------------------------------------------------------------------------- 1 | from url_finder import * 2 | import pandas as pd 3 | import numpy as np 4 | from time import * 5 | ''' 6 | In this module, we process tweets, iterate and extract data from them. 7 | ''' 8 | def has_tweets(userID,tweetsDF): 9 | return tweetsDF[tweetsDF['user_id'] == userID].empty 10 | 11 | def get_tweets_count(userID,tweetsDF): 12 | ''' 13 | This will count the number of tweets done by a 14 | ''' 15 | userTweets = get_tweets_user(userID,tweetsDF) 16 | 17 | return len(userTweets) 18 | 19 | def get_tweets_user(userID,tweetsDF): 20 | ''' 21 | This function returns the list of tweets belonging to a particular userID 22 | ''' 23 | return tweetsDF['text'][tweetsDF['user_id']== userID].tolist() 24 | 25 | def get_tweets_dataframe_user(userID,tweetsDF): 26 | ''' 27 | This function returns the tweets belonging to a particular userID 28 | ''' 29 | #t0 = time() 30 | user_id = tweetsDF['user_id'] == userID 31 | res = tweetsDF[user_id] 32 | #print ("tweet user:", round(time()-t0, 3), "s") 33 | return res 34 | 35 | def count_user_tweets(userID, tweetsDF): 36 | return len(get_tweets_user(userID,tweetsDF)) 37 | 38 | 39 | def get_avg_friends_tweets(friendsIDlist,tweetsDF): 40 | friends_count = len(friendsIDlist) 41 | total_tweet_count = 0 42 | friendsIDlist = pd.Series(friendsIDlist) 43 | 44 | total_tweet_count = tweetsDF[tweetsDF['user_id'].isin(friendsIDlist)].shape[0] 45 | if friends_count == 0: 46 | friends_count = 0.001 47 | return total_tweet_count/friends_count 48 | 49 | 50 | 51 | def get_tweets_strings(userID,tweetsDF): 52 | ''' 53 | Searches for a user's tweets and saves all the text from 54 | the tweets in 1 string 55 | ''' 56 | 57 | tweets_user = get_tweets_dataframe_user(userID,tweetsDF) 58 | res = tweets_user['text'].str.cat() 59 | return res 60 | 61 | 62 | def get_tweets_with_url_ratio(userID, tweetsDF): 63 | t0 = time() 64 | #print("ID ----"+str(userID)) 65 | #print(tweetsDF.head()) 66 | user_tweets = tweetsDF['user_id'] == userID 67 | #urlTweets = tweetsDF[user_tweets &(tweetsDF['text'].apply(lambda tweet: tweet_contains_url(tweet)))] 68 | urlTweets = tweetsDF[user_tweets & tweetsDF['text']] 69 | new_value = urlTweets['text'].apply(lambda tweet: tweet_contains_url(tweet)) 70 | #print(new_value) 71 | #print("TEST:", round(time()-t0, 3), "s") 72 | tweets_with_url = new_value.mean() 73 | #print("tweets_with_url: "+str(tweets_with_url)) 74 | user_tweets_count = count_user_tweets(userID,tweetsDF) 75 | if(user_tweets_count == 0): 76 | user_tweets_count = 0.001 77 | 78 | return float(tweets_with_url)/float(user_tweets_count) 79 | 80 | 81 | def tweet_contains_url(tweet): 82 | return has_url(tweet) 83 | 84 | def get_api_tweets_count(userTweetsDF): 85 | ''' 86 | The tweetDF only contains tweets from a single user 87 | ''' 88 | tweets = userTweetsDF[userTweetsDF['text'].apply(lambda tweet : tweet.count("API")>0)] 89 | 90 | return len(tweets.tolist()) 91 | 92 | def retweet_ratio(userID, tweetsDF): 93 | tweets = tweetsDF[tweetsDF['user_id'] == userID] 94 | 95 | tweetsCount = tweets.shape[0] 96 | 97 | retweets = tweets[tweets['retweet_count']>0].shape[0] 98 | 99 | if(tweetsCount == 0): 100 | #print("-----------------Retweet 0") 101 | return 0 102 | 103 | else: 104 | res = retweets/tweetsCount 105 | #print("-----------------Retweet : "+str(res)) 106 | return res -------------------------------------------------------------------------------- /src/url_finder.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This script handles the search of a URL in a given text. 3 | It can also return the urls in a given text 4 | ''' 5 | import re 6 | 7 | #Taken from : https://gist.github.com/gruber/8891611 (gruber/Liberal Regex Pattern for Web URLs) 8 | #URL_REGEX = re.compile('''(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?