├── .gitignore
├── README.md
├── img
    ├── CC_predictions.png
    ├── CC_rule_set.png
    ├── FFC_rule_set.png
    ├── SOS_rule_set.png
    ├── class_A_classifier_evaluation.png
    ├── class_A_features.png
    ├── classifier_A_vs_C.png
    ├── collected_data_stats.png
    ├── confusion_matrix.png
    ├── crawling_cost.png
    ├── evasion_features.png
    ├── feature_test.png
    ├── features_classifiers_evaluation.png
    ├── features_cost.png
    ├── features_evaluation.png
    ├── features_random_forest.png
    ├── reduced_overfitting.png
    └── rules_evaluation.png
├── report
    └── report.tex
├── research
    ├── efficient_detection_fake_twitter_followers.pdf
    ├── project_description.pdf
    ├── stringhini_Detecting spammer on social network.pdf
    └── stringhini_Detecting spammer on social network_paper.pdf
└── src
    ├── cachedata.py
    ├── classifiers.py
    ├── csv_processor.py
    ├── features.py
    ├── generateBAS.py
    ├── main.py
    ├── metrics.py
    ├── tweets.py
    ├── url_finder.py
    └── users.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | env/
 12 | build/
 13 | develop-eggs/
 14 | dist/
 15 | downloads/
 16 | eggs/
 17 | .eggs/
 18 | lib/
 19 | lib64/
 20 | parts/
 21 | sdist/
 22 | var/
 23 | wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | 
 49 | # Translations
 50 | *.mo
 51 | *.pot
 52 | 
 53 | # Django stuff:
 54 | *.log
 55 | local_settings.py
 56 | 
 57 | # Flask stuff:
 58 | instance/
 59 | .webassets-cache
 60 | 
 61 | # Scrapy stuff:
 62 | .scrapy
 63 | 
 64 | # Sphinx documentation
 65 | docs/_build/
 66 | 
 67 | # PyBuilder
 68 | target/
 69 | 
 70 | # Jupyter Notebook
 71 | .ipynb_checkpoints
 72 | 
 73 | # pyenv
 74 | .python-version
 75 | 
 76 | # celery beat schedule file
 77 | celerybeat-schedule
 78 | 
 79 | # SageMath parsed files
 80 | *.sage.py
 81 | 
 82 | # dotenv
 83 | .env
 84 | 
 85 | # virtualenv
 86 | .venv
 87 | venv/
 88 | ENV/
 89 | 
 90 | # Spyder project settings
 91 | .spyderproject
 92 | .spyproject
 93 | 
 94 | # Rope project settings
 95 | .ropeproject
 96 | 
 97 | # mkdocs documentation
 98 | /site
 99 | 
100 | # mypy
101 | .mypy_cache/
102 | 
103 | # data
104 | data/* 
105 | 
106 | # latex
107 | report/*.aux
108 | report/*.pdf
109 | report/*.synctex.gz
110 | report/*.toc


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # BigDataProject3
  2 | Automatic detection of fake twitter followers
  3 | 
  4 | # Research paper summary
  5 | 
  6 | ## The baseline datasets
  7 | - 9M accounts
  8 | - 3M tweets
  9 |   
 10 | ### Fake project
 11 | Twitter account created to be followed by real accounts to collect data.
 12 | Referred as TFP  
 13 |   
 14 | ### #Elezioni2013 dataset
 15 | Named E13, data mining of the twitter accounts involved in the #elezioni2013 and after discarding officially involved accounts and sampling the lefts ones, they manually checked the remaining accounts(1488).  
 16 | From this work resulted 1481 human accounts labeled.  
 17 |   
 18 | ### Baseline dataset of human accounts
 19 | So TFP and E13 are the starting set of human accounts "HUM".  
 20 |   
 21 | ### Baseline dataset of fake followers
 22 | 3k fake accounts bought:  
 23 |     1169 FSF (fast followers)  
 24 |     1337 INT (intertwitter)  
 25 |     845 TWT (1000 but 155 got insta banned, from twittertechnology)  
 26 |   
 27 | Dataset is clearly illustrative and not exhaustive of all possible fake accounts.  
 28 | 
 29 | ![alt text](./img/collected_data_stats.png)
 30 | 
 31 | ### Baseline dataset
 32 | Studies have shown that the distributions between classes in classification datasets can affect the classification.  
 33 |   
 34 | Twitter advanced that the amount of spam/fake accounts should be less then the 5% of MAU (monthly active users), not applicable for our problem because they cant be assimilated to out dataset and an account buying fake accounts, will have a abnormal distribution of fake/real accounts.   
 35 | --> <5% can't be transferred to fake followers of an account.  
 36 |   
 37 | They decided to go for a balanced distribution -> used 5%-95%(100 HUM - 1900 FAK)to 95%-5%(1900 HUM - 100 FAK) proportions to train the classifier, considering their accuracy with cross-validation.      
 38 | 
 39 | 
 40 | To obtain a balanced dataset, we randomly undersampled the total set of fake accounts (i.e., 3351) to match the size of the HUM dataset of verified human accounts. Thus, we built a baseline dataset of 1950 fake followers, labeled FAK. The final baseline dataset for this work includes both the HUM dataset and the FAK dataset for a total of 3900 Twitter accounts. This balanced dataset is labeled BAS in the remainder of the paper and has been exploited for all the experiments described in this work (where not otherwise specified). Table 1 shows the number of accounts, tweets and relationships contained in the datasets described in this section.
 41 | 
 42 | ## Classifiers used for fake detection
 43 | From 3 procedures proposed, they assessed their effectiveness by trying them on their dataset. Depending on their efficiency, they will be used later as features to fit the classifiers.
 44 | 
 45 | ### Followers of political candidates.
 46 | Test on Obama, Romney and Italian Politicians followers. Algorithm based on public features from the accounts. The algo assigns human and bot scores and classifies an account considering the gap between the sum of the two scores. The algo assigns a human point for each feature in the "feature table".  
 47 | ![alt text](./img/CC_rule_set.png)
 48 | On the other hand it receives a bot point when not meeting one of the features and 2 points for only using API.  
 49 | (specificities of each feature can be read in the paper)
 50 | 
 51 | ### Stateofsearch.com
 52 | This website proposed the following rule set:  
 53 | ![alt text](./img/SOS_rule_set.png)
 54 | 
 55 | This rule set doesn't focuses on the account but on the tweets emitted. The rules looking for similarities are done over the dataset.  
 56 | Important: because temporal isn't available and twitter's API limitation rule 6&7 were not applied.
 57 | 
 58 | ### Socialbakers’ FakeFollowerCheck
 59 | Fakeness classification tool based on 8 criteria:  
 60 | ![alt text](./img/FFC_rule_set.png)
 61 | 
 62 | ### Evaluation methodology
 63 | The 3 methods were tested on our human dataset and fake followers. We used the confusion matrix as standard indication of accuracy:  
 64 | REMINDER:  
 65 | - True Positive (TP): the number of those fake followers recognized by the rule as fake followers;
 66 | - True Negative (TN): the number of those human followers recognized by the rule as human followers;
 67 | - False Positive (FP): the number of those human followers recognized by the rule as fake followers;
 68 | - False Negative (FN): the number of those fake followers recognized by the rule as human followers.
 69 | ![alt text](./img/confusion_matrix.png)
 70 | 
 71 | Using the folowing metric:  
 72 | - Accuracy: the proportion of predicted true results (both true positives and true negatives) in the population, that is   $$\frac{TP+TN}{TP+TN+FP+FN}$$
 73 | - Precision: the proportion of predicted positive cases that are indeed real positive, that is   $$\frac{TP}{TP+FP}$$
 74 | - Recall (or also Sensitivity): the proportion of real positive cases that are indeed predicted positive, that is   $$\frac{TP}{TP+FN}$$
 75 | - F-Measure: the harmonic mean of precision and recall, namely   $$\frac{2·precision·recall}{precision+recall}$$
 76 | - Matthew Correlation Coefficient (MCC from now on) [37]: the estimator of the correlation betweenthe predicted class and the real class of the samples, defined as:  
 77 | $$\frac{TP·TN-FP·FN}{\sqrt{(TP+FN)(TP+FP)(TN+FP)(TN+FN)}}
 78 | 
 79 | In addition to these they also measured two additional metrics:
 80 | - Information Gain (Igain): the information gain considers a more general dependence, leveraging probability densities. It is a measure about the in-
 81 | formativeness of a feature with respect to the predicting class
 82 | - Pearson Correlation Coefficient(PCC): the Pearson correlation coefficient can detect linear dependencies between a feature and the target class. It is a measure of the strength of the linear relationship between two random variables X and Y.
 83 | 
 84 | 
 85 | ### Evaluation of CC algorithm
 86 | ![alt text](./img/CC_predictions.png)
 87 | Not very good at detecting bots, but decent job with humans.
 88 | ### Individual rules evaluation
 89 | Here they analyzed the effectiveness of each individual rule.
 90 | ![alt text](./img/rules_evaluation.png)
 91 | 
 92 | ## Fake detection based on feature
 93 | Classification using 2 sets of features extracted from spam accounts.
 94 | Important: features extracted from spammers but used for fake followers. 
 95 | To extract these features, they used classifiers producing glass-box(white-box) and black-box models.
 96 | ### Spammers detection in social networks.
 97 | Use of Random Forest which results in classification but also features:  
 98 | ![alt text](./img/features_random_forest.png)
 99 | 
100 | ### Evolving twitter spammers detection
101 | Since spammers are changing their behavior to avoid detection here are a set of features to still detect them even when using evasion systems:  
102 | ![alt text](./img/evasion_features.png)
103 | 
104 | ### Evaluation of these features
105 | Single features evaluation:   
106 | ![alt text](./img/features_evaluation.png)
107 | 
108 | Features evaluation using them with classifiers:  
109 | ![alt text](./img/features_classifiers_evaluation.png)
110 | The results are very good, the classification accuracy is really high for all the classifiers.  
111 | The features-based classifiers are way more accurate then CC-algorithm to predict and detect fake followers.
112 | 
113 | ### Discussion of the results
114 | By analysing the classifiers we extracted the most effective features:  
115 | - for decision Trees, the features close to the root
116 | - Decorate, AdaBoost, and Random Forest are based on Decision tress but they are a composition of trees and therefore are harder to analyse.
117 | 
118 | #### Differences between fake followers and spammers
119 | URL ratio is higher for fake followers (72%) and only 14% for humans.  
120 | API ratio is higher for spammers then humans. For fake followers it is lower than 0.0001 for 78%.  
121 | The average neighbor's tweets features is lower for spammers than for fake followers.  
122 |   
123 | Fake followers appear to be more passive compared to spammers and they do not make use of automated mechanisms.
124 | 
125 | #### overfitting
126 | Usual problem of classification is to be suited too much for the training dataset and not for new data. To avoid overfitting the best idea is to keep a simple classifier.  
127 | For decision tree algorithms reducing the number of nodes and the height of the tree helps.  
128 | For trees, common practice is to adopt an aggressive pruning strategy -> using the reduce-error pruning with small test sets. This results in simpler trees(less features) but maintaining good performances.
129 | ![alt text](./img/reduced_overfitting.png)  
130 | 
131 | #### Bidirectional link radio
132 | This is the feature with the most information gain. To test out its importance in detecting the fake from humans, we retry the classifiers excluding this feature and see how they compare with the classifiers trained with the bidirectional link radio feature.
133 | 
134 | From the previous table we realize that the feature isn't essential but greatly effective.
135 | 
136 | ## An efficient and lightweight classifier
137 | ![alt text](./img/feature_test.png)
138 | As we notice, classifiers based on features sets perform better than rule sets. To further improve the classifier we analyse their cost.
139 | 
140 | ### Crawling cost
141 | We divide the type of data to crawl into 3 categories:  
142 | - profile (Class A)
143 | - timeline (Class B)
144 | - relationship (Class C)
145 |   
146 | These categories are directly related to the amount of data that needs to be downloaded for a category of feature. To do so we compare the amount of data to be downloaded for each category (best and worst case scenarios -> best is 1 API call and worst is for the biggest account possible)  
147 | We also take into account the maximum amount of calls allowed by the twitter API which defines our max threshold of calls.  
148 | Parameters of the table: 
149 | - $f$ : number of followers of the target account;
150 | - $t_i$ : number of tweets of the i -th follower of the target account;
151 | - $\phi_i$ : number of friends of the i -th follower of the target account;
152 | - $f_i$ : number of followers of the i -th follower of the target account.
153 | ![alt text](./img/crawling_cost.png)  
154 |  
155 | Important: It is important to notice that calls download all the information available for the account, therefore obtaining the data for the different classes concurrently.
156 | 
157 | ### Class A classifier
158 | 
159 | ![alt text](./img/classifier_A_vs_C.png)  
160 | 
161 | The classifiers belong to the class of the most expensive feature.
162 | Here we experiment the results obtained with the cheapest classifier working with class A features.  
163 |   
164 | The classifiers are tested on 2 features sets: the class features and all the features.
165 | We can see some classifiers improving and others slightly dropping in performance.
166 | 
167 | ### Validation of the Class A classifier
168 | Two experiments:
169 | - Using our baseline dataset as training
170 | - Using Obama's followers as training
171 | 
172 | ![alt text](./img/class_A_classifier_evaluation.png)
173 | 
174 | For each of these experiments we tested the classifiers with these testing datasets:  
175 | - human accounts 
176 | - 1401 fake followers not included in the BAS dataset
177 | 
178 | For this validation we can see notable differences between the approaches.
179 | We can also see the the random sample was more correctly labeled then the Obama's sample meaning the Obama's dataset introduces previously unknown features from the training sets.
180 | 
181 | ### Assessing Class A features
182 | 
183 | To assess the importance of the features used in Class A features we used an information fusion-based sensitivity analysis.   
184 | Information fusion is a technique aimed at leveraging the predictive power of several different models in order to achieve a combined prediction accuracy which is better than the predictions of the single models.  
185 | Sensitivity analysis, instead, aims at assessing the relative importance of the
186 | different features used to build a classification model.  
187 |   
188 | By combining them we can estimate the importance of certain features used in different classifiers with a common classification task.
189 | 
190 | To do so  we have to retrain the classifiers of the 8 class A classifiers with our baseline dataset and remove one feature at a time.  
191 | Each of the trained classifiers is then tested with our test dataset.
192 | 


--------------------------------------------------------------------------------
/img/CC_predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/CC_predictions.png


--------------------------------------------------------------------------------
/img/CC_rule_set.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/CC_rule_set.png


--------------------------------------------------------------------------------
/img/FFC_rule_set.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/FFC_rule_set.png


--------------------------------------------------------------------------------
/img/SOS_rule_set.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/SOS_rule_set.png


--------------------------------------------------------------------------------
/img/class_A_classifier_evaluation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/class_A_classifier_evaluation.png


--------------------------------------------------------------------------------
/img/class_A_features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/class_A_features.png


--------------------------------------------------------------------------------
/img/classifier_A_vs_C.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/classifier_A_vs_C.png


--------------------------------------------------------------------------------
/img/collected_data_stats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/collected_data_stats.png


--------------------------------------------------------------------------------
/img/confusion_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/confusion_matrix.png


--------------------------------------------------------------------------------
/img/crawling_cost.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/crawling_cost.png


--------------------------------------------------------------------------------
/img/evasion_features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/evasion_features.png


--------------------------------------------------------------------------------
/img/feature_test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/feature_test.png


--------------------------------------------------------------------------------
/img/features_classifiers_evaluation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_classifiers_evaluation.png


--------------------------------------------------------------------------------
/img/features_cost.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_cost.png


--------------------------------------------------------------------------------
/img/features_evaluation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_evaluation.png


--------------------------------------------------------------------------------
/img/features_random_forest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/features_random_forest.png


--------------------------------------------------------------------------------
/img/reduced_overfitting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/reduced_overfitting.png


--------------------------------------------------------------------------------
/img/rules_evaluation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/img/rules_evaluation.png


--------------------------------------------------------------------------------
/report/report.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[a4paper,11pt]{article}
  2 | \usepackage[utf8]{inputenc}
  3 | \usepackage{listings}
  4 | \usepackage{graphicx}
  5 | \usepackage{fontenc}
  6 | \usepackage{listings}
  7 | \usepackage{multicol}
  8 | \usepackage{verbatim}
  9 | 
 10 | \title{UNamur\\
 11 | 	ICYBM201 Big Data and Computer Security : Fame for sale, efficient detection of fake Twitter
 12 | followers}
 13 | 
 14 | \author{TIO NOGUERAS Gérard, NYAKI Loïc}
 15 | 
 16 | \begin{document}
 17 | \maketitle
 18 | 
 19 | \newpage
 20 | \tableofcontents
 21 | \newpage
 22 | 
 23 | \section{Introduction}
 24 | 
 25 | \paragraph{}
 26 | The objective of this project is to partially reproduce the results presented in a 2015 paper titled  \textit{Fame for sale: effective detection of fake twitter followers}, by Cresci \textit{et al.}\\
 27 | 
 28 | In the paper, various classification rules and features proposed by academics, \textit{technology bloggers} and companies specialized in fake twitter users detection are enumerated and assessed. Rule-based classification is shown to perform more poorly than feature-based classification. Therefore the authors decided to abandon rule-based classification in favor of the 19 features responsible for the most information gain while having the smallest calculation cost.\\
 29 | 
 30 | The resulting classifier trained on that 19-features set shows an accuracy of up to 97.5\% on randomly sampled account, while being solely based on data readily available on the profile of the twitter users, which ensures that the data processing is fast and lightweight.
 31 | 
 32 | \subsection{Objectives}
 33 | In this project, we focus on reproducing the final results of Cresci \textit{et al.}, by implementing various classifiers based on the feature set presented in that paper.\\
 34 | 
 35 | First we describe the 19 features that will be part of the feature set. Then, we introduce the datasets that have been provided, and which contain \textit{human} or \textit{bot} data. Later we describe how the features will be extracted from these datasets as well as our planned procedure for training classifier. Each type of classifier is then described along with their parameters, and results are presented for each classifier. Finally, we summarize the results argue on the difference between the results presented in Cresci \textit{et al.}, and the results we managed to obtain.
 36 | 
 37 | 
 38 | \section{Rule sets and Features set}
 39 | In the paper, five rule sets and features set are analyzed. They come from  academic research, as well as technology bloggers and companies specialized in fake tweeter users detection.\\
 40 | 
 41 | To avoid simply using all the features mentionned in other papers without any preprocessing, the authors decided to measure each feature individually and asses it's effectiveness thanks to machine learning metrics. After doing, they removed the features they considered not to have a sufficient score and removed the features they considered impossible to obtain. (ie. time related features that would involve monitoring of thousands of accounts.)
 42 | \\\\
 43 | We have decided that we will present the features that they kept and that we will se to train our classifiers.\\
 44 | We have devided the features by paper and in each section we have separated the features related to the profile(class A), the timeline(class B) and relationships (class C).
 45 | 
 46 | \subsection{Camisani-Calzolari rule set}
 47 | In this paper the features had as objective to focus on human accounts, but they can still be used to detect bots. \\
 48 | We start with the class A section, for example a couple of straightforward ones: has name, has image, has address, has biography, belongs to a list.
 49 | Pretty simple most human users will fill this type of information.\\
 50 | The next ones assess that the account has been active and tries to find attributes that will correlate with humans: followers $\geq$ 30, tweets $\geq$ 50, URL
 51 | in profile, 2 $*$ followers $\geq$ friends.\\
 52 | For the class B of this paper where we try to find information in the tweets of the user that can be relatable to humans. Some patterns that have been noticed that humans follow: geo-localized, is favorite, uses punctuation, uses hashtag. Another pattern that they noticed is that humans don't use the twitter api that much but rather a lot of different items available to us: uses iPhone, uses Android, uses Foursquare, uses Instagram, uses Twitter.com and using multiple clients. This last ones are some additional ones found to be done by humans: userID in tweet,
 53 | tweets with URLs and retweet $\geq$ 1. 
 54 | 
 55 | The rule set contained the following classes of rules:
 56 | 
 57 | \begin{itemize}
 58 | \item Camisani-Calzolari: The Camisani-Calzolari rule set is described as follows: 22 class A rules
 59 | \item State of search:They propose 7 rules, with 5 of class A, and 2 of class B
 60 | \item SocialBakers: The rule set of social baker is composed of 9 rules, with 6 of class A and 3 of class B.
 61 | \item Stringhini et al.: This paper proposes 5 features with 3 class A features, and 2 class B features
 62 | \item Yang et al.:This paper proposes 9 features. Two of class A, 3 of class B, and 4 of class C.
 63 | \end{itemize}
 64 | 
 65 | \subsection{State of search}
 66 | This ones are really general ones for bot detection by state of search website.
 67 | For the class A: bot in biography, friends/followers == 100 and duplicate profile pictures.\\
 68 | For class B: same sentence to many accounts, tweet from API.
 69 | This features are actually extracted from spammer bots.
 70 | 
 71 | \subsection{Socialbakers}
 72 | We start finding a trend with some of the ratios for our class A: friends/followers $\geq$ 50, friends $\geq$ 100. Followed by some features resembling the ones in Camisani: default image after 2 months, no bio, no location, 0 tweets.\\
 73 | For the Class B, we're only looking at ratios and tweets' similarities: tweets spam phrases, same tweet$\geq$ 3, retweets $\geq$ 90\%, tweet-links $\geq$ 90\%.
 74 | \subsection{Stringhini et al.}
 75 | The last 2 papers already have really strong features that we are going to use and see if we can improve the score they obtained by adding extra features.
 76 | Here you can see some of the best ones: number of friends, number of tweets, $friends/(followers^2)$.\\
 77 | For class B: we only have a couple of ones: tweet similarity, URL ratio.
 78 | This paper also tracked spam bots.
 79 | 
 80 | \begin{tabular}{lc}
 81 | \hline
 82 | 1. spambots do not have thousands of friends; & spambots have a high ratio of tweets containing urls\\
 83 | 2. spambots have sent less than 20 tweets; & spambots have a high ratio between the total number of tweets from friends and the square of their total followers (lower ratio means legitimate user)\\
 84 | 3. the content of the spambots' tweets exhibits "message similarity"; & \\
 85 | \hline
 86 | \end{tabular}
 87 | 
 88 | 
 89 | \subsection{Yang et al.}
 90 | This last paper has a focus on the API for class B features: API ratio, API URL ratio, API tweet similarity. And some curious features for the class A: age, following rate. The following rate is obvious, not much people follow bots and the age is smarter, spam bots are susceptible to be banned or simply used once therefore their age will usually be lower then real accounts.
 91 | 
 92 | \subsection{Rules that were not implemented}
 93 | The following features where not implemented,as knowing whether a tweet came from an API was far from obvious : \\
 94 | 
 95 | \begin{itemize}
 96 | \item{get\_api\_url\_ratio(): returns the ratio between the number of tweets containing a URL and the number of tweets sent from an API.}
 97 | \item{get\_api\_tweet\_similarity(): supposedly returns a value representing the similarity between tweets sent from an API.}
 98 | \end{itemize} 
 99 | 
100 | We tried using the \textit{grep} command with the keyword "API" on the tweets.csv files but only detected a few results who seemed unrelated.
101 | 
102 | \section{Data extraction}
103 | To generate the final feature set, we extracted the class A features (data easily obtained through the user profile) as well as the class C features (all the features, ranging from the easiest to obtain to the hardest one, requiring many computations).\\
104 | 
105 | The result was a features dataset that is ready to be used for training classifiers.
106 | 
107 | \subsection{Available data}
108 | The researchers responsible for the paper Fame forl sale made their 5 basic datasets available for future researchers.\\ 
109 | 
110 | Since we are trying to replicate the work they have done to practice our skills and understand more deeply the meaning of data and the proper way to analyse data is by following their BAS dataset creation.
111 | 
112 | \subsection{Table 1 creation}
113 | In this section we are going to create the base dataset that we will use throughout the project.
114 | We have 5 available datasets: 
115 | We are going to create 1 final dataset. The BAS dataset constituted of 1950 human twitter accounts and 1950 fake accounts.\\ The human accounts are simply the sum of the human datasets we had available.\\
116 | For the fake accounts, we randomly undersampled the 3 datasets available to obtain the same number of accounts as the normal ones.\\
117 | After undersampling the users, we used the ids of these users to collect the rest of the data in the other files.\\\\
118 | Small differences due to the randomness of the users choice:
119 | We encountered a significant drop first with the number of tweets averaging 92k followers which far from the 118k from the paper. We later realized that our parser had and issue with the tweets containing commas and was missing them because of that error.\\
120 | 
121 | \begin{tabular}{lcccccc}
122 | dataset & accounts & tweets & followers & friends & total relationships \\
123 | \hline
124 | TFP & 469 & 563,693 & 258,494 & 241,710 & 500,204 \\
125 | E13 & 1481 & 2,068,037 & 1,526,944 & 667,225 & 2,194,169 \\
126 | FSF & 1169 & 22,910 & 11,893 & 253,026 & 264,919 \\
127 | INT & 1337 & 58925 & 23173 & 517485 & 540658 \\
128 | TWT & 845 & 114192 & 28588 & 729839 & 758427 \\
129 | \hline
130 | HUM & 1950 & 2,631,730 & 1,785,438 & 908,935 & 2,694,373 \\
131 | FAK & 1950 & 107,031 & 35,404 & 873,494 & 908,898 \\
132 | \hline
133 | BAS & 3900 & 2,738,761 & 1,820,842 & 1,782,429 & 3,603,271 \\
134 | \end{tabular}
135 | 
136 | 
137 | \section{Data Pre-Processing}
138 | \paragraph{}
139 | The datasets that we received was composed of several directories containing each the following files: \textit{users.csv, friends.csv, followers.csv and tweets.csv}. These are regular csv files containing text, numerical values and NaN values.\\
140 | 
141 | NaN values are bothersome as they are there own type and can cause problems when they get mixed with Strings or numerals. Therefore the first thing to do was to replace every NaN instance by something less troublesome. We decided that an empty String would be a good solution. This was done in a single command, when opening the CSV file:
142 | 
143 | \begin{lstlisting}
144 | pd.read_csv(totalPath, encoding='latin-1').fillna('')
145 | \end{lstlisting}
146 | 
147 | \section{Feature set generation}
148 | Based on \textit{Cresci et al.}, 3 classes of features where identified: classes\textit{A, B} and \textit{C}. Class \textit{A} features being features whose data can be obtained directly in the user profile, while class \textit{B} features require a simple computation, and \textit{C} features require heavier computations.\\
149 | 
150 | We implemented the functions and used them to extract 2 features set out of the data: a \textit{class A} features set and a class C features set, which encompasses the features of every classes (\textit{A, B} and \textit{C}).
151 | 
152 | \subsection{Data labeling}
153 | For each initial dataset (E13, FAK, FSF, HUM, INT, TFP, TWT) the label of the users is known. The TFP and HUM dataset only contain real users, while the other dataset contain fake users.\\
154 | 
155 | Therefor, when generating the class A and class C features set, we add the label 'human' for the features set based on TFP and HUM, and the label 'bot' to the other. These labels will actually be numbers in the features set : 1 for bots, and 0 for humans.
156 | 
157 | \section{Process Optimization}
158 | The amount of data that we had to manipulate by no means huge or overwhelming, but it was still big enough so that, if not careful, some optimization issues could arise.\\
159 | 
160 | Developing using Python, we used the popular pandas library and its DataFrame object to manipulate our data and perform computations. Soon however, the program was plagued by an abnormally slow execution time.\\
161 | 
162 | The issue came from manually looping on DataFrame object, either by user a \textit{for} loop, or by using the DataFrame.iterrows() function. This led to processing time that could reach between 10 and 20 seconds per user record. Considering that we had several thousands users, another solution had to be found.\\\\
163 | This part of the project was definetly the most time consuming and involved a lot of small tricks to achieve the expected result for certain features.\\
164 | As explained above, after a first implementation of certain features we realized that the extraction was extremly slow, we then decided to cut down the testing set to 10 users to ensure a functional extractor without waiting for 30 min after each execution. This already improved our testing time.\\
165 | After that we realized that for the B class features we were getting very long execution times. Because the B class features are realted to tweets, depending on the users this amount can be in the thousands and for some features this involved looping a lot.
166 | 
167 | \subsection{DataFrame.apply() and lambda functions}
168 | We found out that there was several efficient ways to loop over a DataFrame. The easiest way being through the use of lambda functions which will automatically be executed on every row of the Dataframe, without requiring a manual management of the iterative process.\\
169 | 
170 | The result was efficient, and the processing time per user came down from 10-20s to about 0.3s. We decided that this gain in performance was good but with 3900 users to process we're still talking of 1300s which is a really long processing time. To improve that we researched optimisation techniques with Pandas.SerieS. And using built in functions and conditions on the Series we kept improving our processing time. For some of our features we obtained huge improvements alowing to reduce most of the B features to 0.05s (except for the HUM dataset because of the high amount of data to check).\\\\
171 | 
172 | Because the built in functions are limited and Series conditional selection is limited for some of the features we add to be creative. I have some example I want to showcase:
173 | 
174 | \begin{verbatim}
175 | def is_favorite(userRow,tweetsDF):
176 | 	tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
177 | 	fav = tweets['favorite_count'] != 0
178 | 	return int(not tweets[fav].empty)
179 | \end{verbatim}
180 | Here we recover a dataframe with only the tweets of the user. We create a conditional variable that selects everthing different then 0. If we apply it to that dataframe we are going to get a Series with True and False values, by checking for tweets[fav].empty we are looking if at least one of the values isn't False.
181 | \\ A last example: 
182 | \begin{verbatim}
183 | def uses_foursquare(userRow,tweetsDF):
184 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
185 | 	res = "foursquare" in all_tweets['source'].str.cat()
186 | \end{verbatim}
187 | Here instead of looping through the tweet sources and adding them to a string to finally check if the searched string is in it we adopt another approach: isolate the Series of the dataframe and the concatenate the strings using the builtin function. It achieves the same thing but reduces the processing time drasticaly.\\\\
188 | 
189 | There are many other features that have been approached in unconventional ways to achieve better results. \\
190 | 
191 | Unfortunately we found out too late of a method called Vectorization which would arguably perform even faster. But this is not a problem since we look forward in the future to try it out and achieve the most optimized result possible.
192 | 
193 | \section{Classifiers}
194 | In \textit{Cresci et al.} the following 8 classifiers are proposed: Random Forest, Decorate, Decision Tree, Adaptive Boosting, Bayesian Network, k-Nearest Neighbors, Logistic Regression and Support Vector Machine.\\
195 | 
196 | For the sake of simplicity, we decided to focus on the following classifiers:Random Forest,  Decision Tree, Adaptive Boosting, k-Nearest Neighbors, Logistic Regression and Support Vector Machine.\\
197 | 
198 | \subsection{Results}
199 | After training our classifierson our generated datasets, we obtain the following resutls:
200 | 
201 | \begin{tabular}{lcccccc}
202 | 	\hline
203 | 	\textbf{Algorithm} 	& \textbf{Accuracy} 	& \textbf{Precision} & \textbf{Recall} &\textbf{F-M} 	& \textbf{MCC} & AUC\\
204 | 	\hline
205 | 	\textit{Class A}\\
206 | 	
207 | 	J48	&	1		&	1		&	1		& 1 	&	1	& 1   \\
208 | 	LR	&	0.997	&	0.997	&	0.997	& 0.997 &	0.994 & 0.997  \\
209 | 	AB	&	1		&	1		&	1		& 1 	&	1	& 1   \\	
210 | 	SVM	&	0.678	&	0.999	&	0.356	& 0.525 & 0.464 & 0.78\\
211 | 	RF	&	0.999	&	1		&	0.999	& 0.999	&	0.999	& 0.999 \\
212 | 	kNN	&	0.944	&	0.937	&	0.951	& 0.944 & 0.888 & 0.944\\
213 | 	
214 | 	\hline
215 | \end{tabular}
216 | 	
217 | 	\paragraph{Accuracy}
218 | 	Accuracy measure the percentage of samples correctly identified in both classe (human, bot). In our case, the accuracy seems very high, except for Support Vector Machines, for a reason that we do not explain.
219 | 	
220 | 	\paragraph{Precision}
221 | 	A high precision indicates that the samples that were classified as bots actually are. Here, the precision seems very high for every classifier. The high precision of the SVM combined with its low accuracy indicates that when the classifier labels a sample as bot, it is very confident and is almost always right. But the classifier stays silent on many samples that are bot and fails to identify them as such.
222 | 	
223 | 	\paragraph{Recall}
224 | 	The recall identifies the relevant samples that have been missed by the classifier. As excpected, based on the accuracy and precision, the SVM classifier score low, meaning that when it sees a bot, it often doesn't recognizes it as such. The rest of the classifier perform very well.
225 | 	
226 | 	
227 | 	\paragraph{F-M}
228 | 	The F-Measure gives a global assessment of the quality of the classifier. It is high for all classifiers, except for the SVM.
229 | 	
230 | 	\paragraph{MCC}
231 | 	Similar the the F-M, the MCC tries to give a single value that represent the quality of the classifier. All classifiers score well except for the SVM.	
232 | 	\paragraph{AUC}
233 | 	
234 | 	As you can see, the classifiers seem to 
235 | 	Due to an execution error caused by a bad value, the results for class C couldn't be displayed
236 | 	
237 | \section{Conclusion}
238 | In this project, we first familiarized ourselves with the various dataset provided before starting building features that emulated the 5 studies described in Cresci \textit{et al}. \\
239 | Based on those features and and the Class A, and Class C feature set provided by Cresci \textit{et al.}, we generated for each of the 6 datasets, a pair of Class A and Class C features dataset for which every row was labeled as either 'human' or 'bot'.\\
240 | Once these features were generated and labeled, they were used them to train various \textit{classifiers} in hope to be able to distinguish fake users from legitimate users.
241 | 
242 | \bibliography{bibliography}
243 | \bibliographystyle{plain}
244 | \end{document}


--------------------------------------------------------------------------------
/research/efficient_detection_fake_twitter_followers.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/efficient_detection_fake_twitter_followers.pdf


--------------------------------------------------------------------------------
/research/project_description.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/project_description.pdf


--------------------------------------------------------------------------------
/research/stringhini_Detecting spammer on social network.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/stringhini_Detecting spammer on social network.pdf


--------------------------------------------------------------------------------
/research/stringhini_Detecting spammer on social network_paper.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lnyaki/BigDataProject3/9812868e98646e34cfe7e216d31d0eab2f480ae6/research/stringhini_Detecting spammer on social network_paper.pdf


--------------------------------------------------------------------------------
/src/cachedata.py:
--------------------------------------------------------------------------------
 1 | import tweets as tw
 2 | import users as us
 3 | 
 4 | 
 5 | _user_tweets = {}
 6 | _user_friends = {}
 7 | _user_followers = {}
 8 | 
 9 | _user_tweets_count = {}
10 | _user_friends_count = {}
11 | _user_followers_count = {}
12 | 
13 | 
14 | def get_user_tweets(userID, tweetsDF):
15 | 	tweets = None
16 | 	try:
17 | 		tweets = _user_tweets[userID]
18 | 
19 | 	except KeyError:
20 | 		_user_tweets[userID] = tw.get_tweets_dataframe_user(userID, tweetsDF)
21 | 
22 | 	finally:
23 | 		tweets = _user_tweets[userID]
24 | 
25 | 	return tweets
26 | 
27 | def get_tweets_count(userID, tweetsDF):
28 | 	tweetCount = 0
29 | 
30 | 	try:
31 | 		tweetCount = _user_tweets_count[userID]
32 | 
33 | 	except KeyError:
34 | 		_user_tweets_count[userID] = get_user_tweets(userID,tweetsDF).shape[0]
35 | 
36 | 	finally:
37 | 		tweetCount = _user_tweets_count[userID]
38 | 
39 | 	return tweetCount
40 | 
41 | def get_user_friends(userID, friendsDF):
42 | 	friends = None
43 | 
44 | 	try:
45 | 		friends = _user_friends[userID]
46 | 
47 | 	except KeyError:
48 | 		_user_friends[userID] = us.get_friends_ids(userID, friendsDF)
49 | 
50 | 	finally:
51 | 		friends = _user_friends[userID]
52 | 
53 | 	return friends
54 | 
55 | def get_friends_count(userID, friendsDF):
56 | 	friendsCount = 0
57 | 
58 | 	try:
59 | 		userCount = _user_friends_count[userID]
60 | 
61 | 	except KeyError:
62 | 		_user_friends_count[userID] = len(get_user_friends(userID,tweetsDF))
63 | 
64 | 	finally:
65 | 		friendsCount = _user_friends_count[userID]
66 | 
67 | 	return friendsCount
68 | 
69 | def get_user_followers(userID, followersDF):
70 | 	followers = None
71 | 	
72 | 	try:
73 | 		friends = _user_friends[userID]
74 | 
75 | 	except KeyError:
76 | 		_user_followers[userID] = us.get_followers_ids(userID, followersDF)
77 | 
78 | 	finally:
79 | 		followers = _user_followers[userID]
80 | 
81 | 	return followers 
82 | 
83 | def get_followers_count(userID, followersDF):
84 | 	followersCount = 0
85 | 
86 | 	try:
87 | 		followersCount = _user_followers_count[userID]
88 | 
89 | 	except KeyError:
90 | 		_user_followers_count[userID] = len(get_user_followers(userID,tweetsDF).tolist())
91 | 
92 | 	finally:
93 | 		followersCount = _user_followers_count[userID]
94 | 
95 | 	return followersCount


--------------------------------------------------------------------------------
/src/classifiers.py:
--------------------------------------------------------------------------------
  1 | # References for this section
  2 | # 
  3 | # Classifiers used to evaluate the features
  4 | # Decorate (D), Adaptive Boost (AB), Random Forest (RF), 
  5 | # Decision Tree (J48), Bayesian Network (BN), k-Nearest Neighbors (kNN), 
  6 | # Multinomial Ridge Logistic Regression (LR) and a Support Vector
  7 | # Machine (SVM).
  8 | # 
  9 | # D, RF, J48 and BN
 10 | # -----------------
 11 | # http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.414.5888&rep=rep1&type=pdf
 12 | # ->  (http://puu.sh/yVTDB/41ec806328.png)
 13 | # Nothing special mentioned concerning the parameters for these 
 14 | # classifiers.
 15 | # 
 16 | # http://www.cse.iitd.ernet.in/~siy117527/sil765/readings/Detecting%20spammer%20on%20social%20network.pdf
 17 | # Again no information about the RF configuration
 18 | # 
 19 | # SVM
 20 | # ---
 21 | # Our SVM classifier exploits a Radial Basis Function (RBF) kernel
 22 | # and has been trained using libSVM as the machine learning algorithm
 23 | # The cost and gamma parameters have been optimized via a grid search 
 24 | # algorithm.
 25 | # 
 26 | # kNN
 27 | # ---
 28 | # k parameter of the kNN classifier and the ridge penalizing parameter
 29 | # of the LR model have been optimized via a cross validation parameter 
 30 | # selection algorithm.
 31 | from sklearn.model_selection import cross_val_score
 32 | from sklearn.model_selection import cross_val_predict
 33 | from sklearn.ensemble import RandomForestClassifier
 34 | from sklearn.tree import DecisionTreeClassifier
 35 | from sklearn.ensemble import AdaBoostClassifier
 36 | from sklearn.linear_model import LogisticRegression
 37 | #from pomegranate import *
 38 | from sklearn.neighbors import KNeighborsClassifier
 39 | from sklearn import svm
 40 | 
 41 | 
 42 | def random_forest():
 43 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
 44 | 	# 
 45 | 	# RandomForestClassifier(n_estimators=10, criterion=’gini’, 
 46 | 	# max_depth=None, min_samples_split=2, min_samples_leaf=1, 
 47 | 	# min_weight_fraction_leaf=0.0, max_features=’auto’, 
 48 | 	# max_leaf_nodes=None, min_impurity_decrease=0.0, 
 49 | 	# min_impurity_split=None, bootstrap=True, oob_score=False, 
 50 | 	# n_jobs=1, random_state=None, verbose=0, warm_start=False, 
 51 | 	# class_weight=None)
 52 | 	# 
 53 | 	clf = RandomForestClassifier()
 54 | 	# clf.fit(X, y)
 55 | 	# clf.predict([[0, 0, 0, 0]]))
 56 | 	return clf
 57 | 
 58 | def decorate():
 59 | 	# Install https://github.com/fracpete/python-weka-wrapper3
 60 | 	# 
 61 | 	# examples: https://github.com/fracpete/python-weka-wrapper3-examples/blob/master/src/wekaexamples/classifiers/classifiers.py
 62 | 	# pass
 63 | 	# load a dataset
 64 |     loader = Loader("weka.core.converters.ArffLoader")
 65 | 
 66 |     #evaluation
 67 |     evaluation = Evaluation(train_features)
 68 | 
 69 | 
 70 | def decision_tree():
 71 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
 72 | 	# clf = DecisionTreeClassifier(random_state=0)
 73 | 	#
 74 | 	# DecisionTreeClassifier(criterion=’gini’, splitter=’best’, 
 75 | 	# max_depth=None, min_samples_split=2, min_samples_leaf=1, 
 76 | 	# min_weight_fraction_leaf=0.0, max_features=None, 
 77 | 	# random_state=None, max_leaf_nodes=None, 
 78 | 	# min_impurity_decrease=0.0, min_impurity_split=None, 
 79 | 	# class_weight=None, presort=False)
 80 | 	# J48
 81 | 	clf = DecisionTreeClassifier()
 82 | 	#clf = train_decision_tree(train_features,train_labels)
 83 | 	
 84 | 	#result = cls.score(test_features, test_labels)
 85 | 
 86 | 	return clf
 87 | 
 88 | #def train_decision_tree(train_features, train_labels):
 89 | #	clf = tree.DecisionTreeClassifier()
 90 | #	clf = clf.fit(train_features, train_labels)
 91 | 
 92 | 
 93 | def adaptive_boost():
 94 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
 95 | 	# 
 96 | 	# AdaBoostClassifier(base_estimator=None, n_estimators=50, 
 97 | 	# learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
 98 | 	#
 99 | 	# iris = load_iris()
100 | 	# clf = AdaBoostClassifier(n_estimators=100)
101 | 	# scores = cross_val_score(clf, iris.data, iris.target)
102 | 	# scores.mean()   
103 | 	clf = AdaBoostClassifier()
104 | 	#clf = AdaBoostClassifier(n_estimators=100, learning_rate=0.1)
105 | 	return clf
106 | 
107 | 
108 | def bayesian_network():
109 | 	# http://pomegranate.readthedocs.io/en/latest/BayesianNetwork.html
110 | 	# Careful with dependencies
111 | 	# 
112 | 	model = BayesianNetwork.from_samples(train_features, algorithm='exact')
113 | 	model.predict(test_features)
114 | 
115 | def k_nearest_neighbors():
116 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
117 | 	#
118 | 	#clf = KNeighborsClassifier(n_neighbors=3)
119 | 	clf = KNeighborsClassifier()
120 | 	# clf.fit(X, y)
121 | 	# print(clf.predict([[1.1]]))
122 | 	# 
123 | 	# (n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30,
124 | 	#  p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
125 | 	return clf
126 | 
127 | def logistic_regression():
128 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
129 | 	# 
130 | 	# LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0,
131 | 	# fit_intercept=True, intercept_scaling=1, class_weight=None, 
132 | 	# random_state=None, solver=’liblinear’, max_iter=100, 
133 | 	# multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)
134 | 	# 
135 | 	clf = LogisticRegression()
136 | 	# clf.fit(X, y)
137 | 	# clf.predict(X)
138 | 	# 
139 | 	return clf
140 | 
141 | def support_vector_machine():
142 | 	clf = svm.SVC(kernel='rbf',C=10.0,gamma='auto')
143 | 	#clf.fit(train_features, train_labels) 
144 | 	# clf.predict([[2., 2.]])
145 | 	# 
146 | 	# Algo: libSVM (SVC)
147 | 	# kernel: rbf
148 | 	# C 	: optimized 
149 | 	# gamma : optimized 
150 | 	# http://scikit-learn.org/stable/modules/grid_search.html
151 | 	# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
152 | 	# http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
153 | 	# 
154 | 	# SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
155 |     # decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
156 |     # max_iter=-1, probability=False, random_state=None, shrinking=True,
157 |     # tol=0.001, verbose=False)
158 | 	return clf
159 | 
160 | def classify(features, labels):
161 | 
162 | 	# CROSS VALIDATION: cross_val_score(clf, test_features, test_labels, cv=10)
163 | 	# or 
164 | 	# from sklearn.model_selection import cross_val_predict
165 | 	# pred = cross_val_predict(clf, test_features, test_labels)
166 | 
167 | 	# dico with the classifiers 
168 | 	classifiers_dict = {}
169 | 	# dico with the predictions made with the 
170 | 	predictions_dict = {}
171 | 
172 | 	classifiers_dict['RF'] = random_forest()
173 | 	#classifiers_dict['D'] = decorate()
174 | 	classifiers_dict['J48'] = decision_tree()
175 | 	classifiers_dict['AB'] = adaptive_boost()
176 | 	#classifiers_dict['BN'] = bayesian_network()
177 | 	classifiers_dict['kNN'] = k_nearest_neighbors()
178 | 	classifiers_dict['LR'] = logistic_regression()
179 | 	classifiers_dict['SVM'] = support_vector_machine()
180 | 	
181 | 
182 | 	for key, value in classifiers_dict.items():
183 | 		try:
184 | 			predictions_dict[key] = cross_val_predict(value, features, labels, cv=10)
185 | 		except ValueError:
186 | 			predictions_dict[key] = None
187 | 	return predictions_dict
188 | 


--------------------------------------------------------------------------------
/src/csv_processor.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | from pathlib import Path
  4 | import pandas as pd 
  5 | import numpy as np 
  6 | import sklearn
  7 | import csv
  8 | import random
  9 | import linecache
 10 | import matplotlib.pyplot as plt 
 11 | import time
 12 | 
 13 | def get_dataframe(filename):
 14 | 	df = None
 15 | 
 16 | 	try:
 17 | 		df = pd.read_csv(filename)
 18 | 
 19 | 	except:
 20 | 		print("Error while loading file : "+filename)
 21 | 
 22 | 	return df
 23 | 
 24 | def HUM_creator(e13_folder,tfp_folder,hum_folder):
 25 | 	"""
 26 | 	Very simple compilation of the HUM datasets merging E13 and TFP.
 27 | 	E13 count: 1481
 28 | 	TFP count: 469
 29 | 	final accounts: 1950
 30 | 	"""
 31 | 	files = ["users.csv","followers.csv","friends.csv","tweets.csv"]
 32 | 	for file in files:
 33 | 		f1 = open(e13_folder+file,"r", errors="ignore")
 34 | 		f2 = open(tfp_folder+file,"r", errors="ignore")
 35 | 		f3 = open(hum_folder+file,"w", errors="ignore")
 36 | 		for lines in f1.readlines():
 37 | 			f3.write(lines)
 38 | 
 39 | 		for lines in f2.readlines()[1:]:
 40 | 			f3.write(lines)
 41 | 	f1.close()
 42 | 	f2.close()
 43 | 	f3.close()
 44 | 
 45 | def FAK_creator(fsf_folder,int_folder,twt_folder,fak_folder):
 46 | 	"""
 47 | 	From the dataset of fake accounts, we undersample the dataset randomly to 
 48 | 	reach the same amout as the HUM dataset.
 49 | 	FSF count: 1169
 50 | 	INT count: 1337
 51 | 	TWT count: 845
 52 | 	fake accounts count: 3351
 53 | 	final accounts needed: 1950
 54 | 	percent randomly selected from the fake sample: 0.5819 
 55 | 	"""
 56 | 	file_merging(fsf_folder,int_folder,twt_folder,fak_folder,0.5819)
 57 | 
 58 | def file_merging(folder1,folder2,folder3,result_folder,percent):
 59 | 	# 3 folders with fake accounts
 60 | 	# 
 61 | 	# In each folder we start by randomly undersampling the users
 62 | 	# Then we use the ids collected for the users and fetch the corresponding data
 63 | 	# in the other files.
 64 | 	folder_list = [folder1,folder2,folder3]
 65 | 	files = ["users.csv","followers.csv","friends.csv","tweets.csv"]
 66 | 
 67 | 	# flag to check if the header has already been added.
 68 | 	# [users","followers","friends","tweets"]
 69 | 	# 0 -> no header
 70 | 	# 1 -> header already added
 71 | 	head_flags = [0,0,0,0]
 72 | 
 73 | 	for folder in folder_list:
 74 | 		# List for the user's ids randomly picked (reset for each folder)
 75 | 		list_of_ids = []
 76 | 
 77 | 		for file in files:
 78 | 
 79 | 			#with open(folder+file, newline='', encoding='utf-32') as fp:
 80 | 			#	reader = csv.reader(fp, dialect='excel')
 81 | 			#	for line in reader:
 82 | 			#		print(str(line).encode("utf-8"))
 83 | 			#fp.close()
 84 | 
 85 | 
 86 | 			f = open(folder+file,"r", errors="ignore")
 87 | 			r = open(result_folder+file,"a",errors="ignore")
 88 | 			
 89 | 			lines = f.readlines()
 90 | 
 91 | 			len_file = len(lines)
 92 | 
 93 | 			# choosing the number of lines we'll keep
 94 | 			number_lines = int(percent*len_file)
 95 | 
 96 | 			# This creates a random sub-sample of the lines of the files
 97 | 			# 
 98 | 			# This works to select the percent of users we are interested but 
 99 | 			# for the other files we have to select the corresponding data.
100 | 
101 | 			print("file: "+folder+file)
102 | 
103 | 			if file == "users.csv":
104 | 				if head_flags[0] == 0:
105 | 					r.write(lines[0])
106 | 					head_flags[0] = 1
107 | 
108 | 				for line in random.sample(range(1,len_file), number_lines):
109 | 					#print("list_of_ids: "+str((lines[line].split('","')[0].strip('"'))))
110 | 					list_of_ids.append(int(lines[line].split('","')[0].strip('"')))
111 | 					r.write(lines[line])
112 | 				#print("liste: "+str(list_of_ids))
113 | 				#time.sleep(1)
114 | 				# no need for conditional check, just make sure that the flag is at 1 after header added
115 | 
116 | 			# So after selecting the percent of users we want, we have to fetch the ids 
117 | 			# and collect the corresponding data in the other files.
118 | 			# for tweets.csv the id is the [3] element
119 | 			# for friends.csv the id is the [0] element
120 | 			# and for followers.csv the id is the [1] element
121 | 
122 | 			elif file == "followers.csv":
123 | 				# header check
124 | 				if head_flags[1] == 0:
125 | 					r.write(lines[0])
126 | 					head_flags[1] = 1
127 | 				# id check
128 | 				for line in lines[1:]:
129 | 					if int(line.split(",")[1].replace('"','')) in list_of_ids:
130 | 						r.write(line)
131 | 
132 | 			elif file == "friends.csv":
133 | 				if head_flags[2] == 0:
134 | 					r.write(lines[0])
135 | 					head_flags[2] = 1
136 | 				for line in lines[1:]:
137 | 					if int(line.split(",")[0].strip('"')) in list_of_ids:
138 | 						r.write(line)
139 | 
140 | 			elif file == "tweets.csv":
141 | 				if head_flags[3] == 0:
142 | 					r.write(lines[0])
143 | 					head_flags[3] = 1
144 | 				for line in lines[1:]:
145 | 					line = line.replace(',,',',"",') 
146 | 					if int(line.split('","')[4]) in list_of_ids:
147 | 						r.write(line)
148 | 			f.close()
149 | 			r.close()
150 | 
151 | if(__name__ == "__main__"):
152 | 	HUM_creator("../data/E13/","../data/TFP/","../data/HUM/")
153 | 	#FAK_creator("../data/FSF/","../data/INT/","../data/TWT/","../data/FAK/")
154 | 	#BAS_creator("../data/HUM/","../data/FAK/","../data/BAS/")


--------------------------------------------------------------------------------
/src/features.py:
--------------------------------------------------------------------------------
   1 | import pandas as pd
   2 | import sys
   3 | from datetime import datetime
   4 | from dateutil.parser import parse
   5 | import itertools
   6 | import string
   7 | import cachedata as cache
   8 | from tweets import *
   9 | from url_finder import *
  10 | from users import *
  11 | import math
  12 | from time import *
  13 | from difflib import SequenceMatcher
  14 | 
  15 | ###########
  16 | # Class A #
  17 | ###########
  18 | 
  19 | # Camisani-Calzolari
  20 | HAS_NAME 			= 'has_name'
  21 | HAS_IMAGE 			= 'has_image'
  22 | HAS_ADDRESS 		= 'has_address'
  23 | HAS_BIO 			= 'has_bio'
  24 | HAS_30_FOLLOWERS 	='has_30_followers'
  25 | BELONGS_TO_A_LIST 	= 'belongs_to_a_list'
  26 | HAS_50_TWEETS 		= 'has_50_tweets'
  27 | URL_IN_PROFILE 		= 'url_in_profile'
  28 | FOLLOWERS_TO_FRIENDS_RATIO_OVER_2 	= 'followers_to_friends_ration_over_2'
  29 | 
  30 | # State of search
  31 | BOT_IN_BIO 							= 'bot_in_bio'
  32 | FRIENDS_TO_FOLLOWERS_RATIO_IS_100 	= 'friends_to_followers_ratio_is_100'
  33 | DUPLICATE_PROFILE_PICTURE 			= 'duplicate_profile_picture'
  34 | 
  35 | # Socialbakers 
  36 | HAS_x50_FOLLOWERS 	= 'has_x50_followers'
  37 | HAS_DEFAULT_IMAGE 	= 'has_default_image'
  38 | HAS_NO_BIO 			= 'has_no_bio'
  39 | HAS_NO_LOCATION 	= 'has_no_location'
  40 | HAS_100_FRIENDS 	= 'has_100_friends'
  41 | HAS_NO_TWEETS 		= 'has_no_tweets'
  42 | 
  43 | # Stringhini et al.
  44 | NUMBER_OF_FRIENDS 			= 'number_of_friends'
  45 | NUMBER_OF_FRIENDS_TWEETS 	= 'number_of_friends_tweets'
  46 | NUMBER_OF_TWEETS_SENT		= 'number_of_tweets_sent'
  47 | FRIENDS_TO_FOLLOWERS_RATIO 	= 'friends_to_followers_ratio'
  48 | 
  49 | # Yang et al.
  50 | ACCOUNT_AGE		= 'account_age'
  51 | FOLLOWING_RATE 	= 'following_rate'
  52 | 
  53 | ###########
  54 | # Class B #
  55 | ###########
  56 | 
  57 | # Camisani-Calzolari
  58 | GEOLOCALIZED 			= 'geolocalized'
  59 | IS_FAVORITE 			= 'is_favorite'
  60 | USES_PUNCTUATION 		= 'uses_punctuation'
  61 | USES_HASHTAG 			= 'uses_hashtag'
  62 | USES_IPHONE 			= 'uses_iphone'
  63 | USES_ANDROID 			= 'uses_android'
  64 | USES_FOURSQUARE 		= 'uses_foursquare'
  65 | USES_INSTAGRAM 			= 'uses_instagram'
  66 | USES_TWITTERDOTCOM 		= 'uses_twitterdotcom'
  67 | USERID_IN_TWEET 		= 'userid_in_tweet'
  68 | TWEETS_WITH_URL 		= 'tweets_with_url'
  69 | RETWEET_OVER_1 			= 'retweet_over_1'
  70 | USES_DIFFERENT_CLIENTS 	= 'uses_different_clients'
  71 | 
  72 | #State of Search
  73 | DUPLICATE_SENTENCES_ACROSS_TWEETS 	= 'duplicate_sentences_across_tweets'
  74 | API_TWEETS 							= 'api_tweets'
  75 | 
  76 | # Socialbakers
  77 | HAS_DUPLICATE_TWEETS 	= 'has_duplicate_tweets'
  78 | HIGH_RETWEET_RATIO 		= 'high_retweet_ratio'
  79 | HIGH_TWEET_LINK_RATIO	= 'high_tweet_link_ratio'
  80 | 
  81 | 
  82 | # Stringhini et al.
  83 | TWEET_SIMILARITY 	= 'tweet_similarity'
  84 | URL_RATIO 			= 'url_ratio'
  85 | UNIQUE_FRIENDS_NAME_RATIO = 'unique_friends_name'
  86 | 
  87 | # Yang et al.
  88 | API_RATIO 				= 'api_ratio'
  89 | API_URL_RATIO 			= 'api_url_ratio'
  90 | API_TWEET_SIMILARITY 	= 'api_tweet_similarity'
  91 | 
  92 | 
  93 | 
  94 | ###########
  95 | # Class C #
  96 | ###########
  97 | 
  98 | # Yang et al.
  99 | BILINK_RATIO 				= 'bi-link_ratio'
 100 | AVERAGE_NEIGHBORS_FOLLOWERS = 'average_neighbors_followers'
 101 | AVERAGE_NEIGHBORS_TWEETS 	= 'average_neighbors_tweets'
 102 | FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS 	= 'followings_to_median_neighbors_followers'
 103 | 
 104 | #*********************************
 105 | #      Features set names        *
 106 | #*********************************
 107 | CAMISANI 		= 'camisani'		#'Camisani-Calzolari'
 108 | STATEOFSEARCH	= 'stateofsearch'	#'State of search'
 109 | SOCIALBAKERS 	= 'socialbakers'	#'SocialBakers'
 110 | STRINGHINI 		= 'stringhini'		#'Stringhini et al'
 111 | YANG 			= 'yang' 			#'Yang et al'
 112 | CLASS_A 		= 'A'
 113 | CLASS_C 		= 'C'
 114 | 
 115 | def get_features(featureSetName, dataframes):
 116 | 	features = {}
 117 | 
 118 | 	if(featureSetName == CAMISANI):
 119 | 		features = get_camisani_features(dataframes)
 120 | 
 121 | 	elif(featureSetName == STATEOFSEARCH):
 122 | 		features = get_state_of_search_features(dataframes)
 123 | 
 124 | 	elif(featureSetName == SOCIALBAKERS):
 125 | 		features = get_socialbakers_features(dataframes)
 126 | 
 127 | 	elif(featureSetName == STRINGHINI):
 128 | 		features = get_stringhini_features(dataframes)
 129 | 
 130 | 	elif(featureSetName == YANG):
 131 | 		features = get_yang_features(dataframes)
 132 | 
 133 | 	elif(featureSetName == CLASS_A):
 134 | 		features = get_class_A_features(dataframes)
 135 | 
 136 | 	elif(featureSetName == CLASS_C):
 137 | 		features = get_class_C_features(dataframes)
 138 | 
 139 | 	else:
 140 | 		print("Error Unknown feature set specified : "+featureSetName)
 141 | 
 142 | 	return features
 143 | 
 144 | def get_class_A_features(dataframes):
 145 | 	usersDF 	= dataframes['users']
 146 | 	tweetsDF 	= dataframes['tweets']
 147 | 
 148 | 	features 	= []
 149 | 
 150 | 	LIMIT = 10
 151 | 	limit = 1
 152 | 
 153 | 	for index, row in usersDF.iterrows():
 154 | 		#timelog("[{}] User {}".format(index,row['id']))
 155 | 		features.append(get_single_class_A_features(row,usersDF, tweetsDF))
 156 | 
 157 | 		#Temporary code, for test purpose
 158 | 		'''
 159 | 		if(limit > LIMIT):
 160 | 			break
 161 | 		else:
 162 | 			limit = limit +1
 163 | 		'''
 164 | 
 165 | 	return features
 166 | 
 167 | 
 168 | def get_single_class_A_features(userRow, usersDF,tweetsDF):
 169 | 	#Class A features = profile-based features
 170 | 	userID = userRow['id']
 171 | 
 172 | 	#Camisani class A
 173 | 	features = {}
 174 | 	features[HAS_NAME] 			= has_name(userRow)
 175 | 	features[HAS_IMAGE] 		= has_image(userRow)
 176 | 	features[HAS_ADDRESS] 		= has_address(userRow)
 177 | 	features[HAS_BIO] 			= has_bio(userRow)
 178 | 	features[HAS_30_FOLLOWERS] 	= has_30_followers(userRow)
 179 | 	features[BELONGS_TO_A_LIST] = belongs_to_a_list(userRow)
 180 | 	features[HAS_50_TWEETS] 	= has_50_tweets(userRow)
 181 | 	features[URL_IN_PROFILE] 	= url_in_profile(userRow)
 182 | 	features[FOLLOWERS_TO_FRIENDS_RATIO_OVER_2] = followers_to_friends_ration_over_2(userRow)
 183 | 
 184 | 	#State of search class A
 185 | 	features[BOT_IN_BIO] 						= bot_in_bio(userRow)
 186 | 	features[FRIENDS_TO_FOLLOWERS_RATIO_IS_100] = friends_to_followers_ratio_is_100(userRow)
 187 | 	features[DUPLICATE_PROFILE_PICTURE] 		= duplicate_profile_picture(userRow,usersDF)
 188 | 
 189 | 	#Social bakers class A
 190 | 	features[HAS_x50_FOLLOWERS] = has_x50_followers(userRow)
 191 | 	features[HAS_DEFAULT_IMAGE] = has_default_image(userRow)
 192 | 	#features[HAS_NO_BIO] 		= has_no_bio(userRow) #Same as has_bio from camisani
 193 | 	features[HAS_NO_LOCATION] 	= has_no_location(userRow)
 194 | 	features[HAS_100_FRIENDS] 	= has_100_friends(userRow)
 195 | 	features[HAS_NO_TWEETS] 	= has_no_tweets(userID, tweetsDF)
 196 | 	
 197 | 	#Stringhini class A
 198 | 	features[NUMBER_OF_FRIENDS] 			= get_friends_count(userRow)
 199 | 	features[FRIENDS_TO_FOLLOWERS_RATIO] 	= get_stringhini_friends_to_followers_ratio(userRow)
 200 | 
 201 | 	#Yang class A
 202 | 	features[ACCOUNT_AGE] 		= get_account_age(userRow)
 203 | 
 204 | 	return features 
 205 | 
 206 | def get_class_C_features(dataframes):
 207 | 	usersDF 	= dataframes['users']
 208 | 	tweetsDF 	= dataframes['tweets']
 209 | 	friendsDF 	= dataframes['friends']
 210 | 	followersDF = dataframes['followers']
 211 | 
 212 | 	features 	= []
 213 | 
 214 | 	LIMIT = 5
 215 | 	t0 = time()
 216 | 	for index, row in usersDF.iterrows():
 217 | 		#timelog("[{}] User {}".format(index,row['id']))
 218 | 		features.append(get_single_class_C_features(row,usersDF, friendsDF,followersDF,tweetsDF))
 219 | 
 220 | 		#Temporary code, for test purpose
 221 | 		'''
 222 | 		if(index > LIMIT):
 223 | 			break
 224 | 		'''
 225 | 
 226 | 	
 227 | 	return features
 228 | 
 229 | def get_single_class_C_features(userRow,usersDF, friendsDF,followersDF,tweetsDF):
 230 | 	#Class C features = all features
 231 | 	userID = userRow['id']
 232 | 	features = get_single_class_A_features(userRow, usersDF,tweetsDF)
 233 | 
 234 | 	#Camisani Class B
 235 | 	features[GEOLOCALIZED] 				= geolocalized(userRow,tweetsDF)
 236 | 	features[IS_FAVORITE] 				= is_favorite(userRow,tweetsDF)
 237 | 	features[USES_PUNCTUATION] 			= uses_punctuation(userRow,tweetsDF)
 238 | 	features[USES_HASHTAG] 				= uses_hashtag(userRow,tweetsDF)
 239 | 	features[USES_IPHONE] 				= uses_iphone(userRow,tweetsDF)
 240 | 	features[USES_ANDROID] 				= uses_android(userRow,tweetsDF)
 241 | 	features[USES_FOURSQUARE] 			= uses_foursquare(userRow,tweetsDF)
 242 | 	features[USES_INSTAGRAM] 			= uses_instagram(userRow,tweetsDF)
 243 | 	features[USES_TWITTERDOTCOM] 		= uses_twitterdotcom(userRow,tweetsDF)
 244 | 	features[USERID_IN_TWEET] 			= userid_in_tweet(userRow,tweetsDF)
 245 | 	features[TWEETS_WITH_URL] 			= tweets_with_url(userRow,tweetsDF)
 246 | 	features[RETWEET_OVER_1] 			= retweet_over_1(userRow,tweetsDF)
 247 | 	features[USES_DIFFERENT_CLIENTS] 	= uses_different_clients(userRow,tweetsDF)
 248 | 
 249 | 	#State of search Class B
 250 | 	features[DUPLICATE_SENTENCES_ACROSS_TWEETS] 	= duplicate_sentences_across_tweets(userRow,tweetsDF)
 251 | 	features[API_TWEETS] 							= api_tweets(userRow,tweetsDF)
 252 | 
 253 | 	#Social Bakers class B
 254 | 	features[HAS_DUPLICATE_TWEETS] 	= has_duplicate_tweets(userID,tweetsDF,3)
 255 | 	features[HIGH_RETWEET_RATIO] 	= has_retweet_ratio(userID,tweetsDF,0.9)
 256 | 	features[HIGH_TWEET_LINK_RATIO] = has_tweet_links_ratio(userID, tweetsDF,0.9)
 257 | 
 258 | 	#Stringhini class B
 259 | 	features[NUMBER_OF_TWEETS_SENT]		= cache.get_tweets_count(userID,tweetsDF)
 260 | 	#features[TWEET_SIMILARITY] 		= get_tweet_similarity(userRow,tweetsDF)	#comment calculer?
 261 | 	features[URL_RATIO] 				= get_url_ratio(userID, tweetsDF)
 262 | 	features[UNIQUE_FRIENDS_NAME_RATIO] = get_unique_friends_name_ratio(userID,usersDF,friendsDF)
 263 | 	
 264 | 	#Yang class B (Comment calculer API?)
 265 | 	#features[API_RATIO] 				= get_api_ratio(userID, tweetsDF)
 266 | 	#features[API_URL_RATIO] 			= get_api_url_ratio(userRow)
 267 | 	#features[API_TWEET_SIMILARITY] 	= get_api_tweet_similarity(userRow)
 268 | 
 269 | 
 270 | 	#Yang class C
 271 | 	#features[BILINK_RATIO] 				= get_bilink_ratio(userRow, friendsDF, followersDF)
 272 | 	features[AVERAGE_NEIGHBORS_FOLLOWERS] 	= get_average_neighbors_followers(userID,friendsDF,usersDF)
 273 | 	features[AVERAGE_NEIGHBORS_TWEETS] 		= get_average_neighbors_tweets(userID, usersDF,friendsDF, tweetsDF)
 274 | 	#features[FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS] = get_followings_to_median(userRow)
 275 | 
 276 | 	return features 
 277 | 
 278 | def get_camisani_features(dataframes):
 279 | 	camisaniFeatures = []
 280 | 
 281 | 	usersDF 	= dataframes['users']
 282 | 	tweetsDF 	= dataframes['tweets']
 283 | 
 284 | 	LIMIT = 10
 285 | 	limit = 1
 286 | 
 287 | 	for index, row in usersDF.iterrows():
 288 | 		camisaniFeatures.append(get_single_user_camisani_features(row,tweetsDF))
 289 | 
 290 | 		#Temporary code, for test purpose
 291 | 		if(limit > LIMIT):
 292 | 			break
 293 | 		else:
 294 | 			limit = limit +1
 295 | 
 296 | 	return camisaniFeatures
 297 | 
 298 | def get_single_user_camisani_features(userRow, tweetsDF):
 299 | 	'''
 300 | 	Class A : has name, has image, has address, has bio, followers >= 30, belongs to a list, 
 301 | 	tweets >= 50, URL in profile, 2xfollowers >= friends
 302 | 
 303 | 	Class B : geo, is favorite, uses punctuation, uses #, uses iphone, uses android, uses foursquare,
 304 | 	uses twitter.com, userId in tweet, retweet >= 1, uses different clients
 305 | 	'''
 306 | 
 307 | 	features = {}
 308 | 
 309 | 	# class A
 310 | 	t0 = time()
 311 | 	t2 = time()
 312 | 	features[HAS_NAME] 			= has_name(userRow)
 313 | 	print ("class A.1 camisani:", round(time()-t2, 3), "s")
 314 | 	t3 = time()
 315 | 	features[HAS_IMAGE] 		= has_image(userRow)
 316 | 	print ("class A.2 camisani:", round(time()-t3, 3), "s")
 317 | 	t4 = time()
 318 | 	features[HAS_ADDRESS] 		= has_address(userRow)
 319 | 	print ("class A.3 camisani:", round(time()-t4, 3), "s")
 320 | 	t5 = time()
 321 | 	features[HAS_BIO] 			= has_bio(userRow)
 322 | 	print ("class A.4 camisani:", round(time()-t5, 3), "s")
 323 | 	t6 = time()
 324 | 	features[HAS_30_FOLLOWERS] 	= has_30_followers(userRow)
 325 | 	print ("class A.5 camisani:", round(time()-t6, 3), "s")
 326 | 	t7 = time()
 327 | 	features[BELONGS_TO_A_LIST] = belongs_to_a_list(userRow)
 328 | 	print ("class A.6 camisani:", round(time()-t7, 3), "s")
 329 | 	t8 = time()
 330 | 	features[HAS_50_TWEETS] 	= has_50_tweets(userRow)
 331 | 	print ("class A.7 camisani:", round(time()-t8, 3), "s")
 332 | 	t9 = time()
 333 | 	features[URL_IN_PROFILE] 	= url_in_profile(userRow)
 334 | 	print ("class A.8 camisani:", round(time()-t9, 3), "s")
 335 | 	t10 = time()
 336 | 	features[FOLLOWERS_TO_FRIENDS_RATIO_OVER_2] = followers_to_friends_ration_over_2(userRow)
 337 | 	print ("class A.9 camisani:", round(time()-t10, 3), "s")
 338 | 	print ("class A camisani:", round(time()-t2, 3), "s")
 339 | 
 340 | 	# class B
 341 | 	t1 = time()
 342 | 	t10 = time()
 343 | 	features[GEOLOCALIZED] 				= geolocalized(userRow,tweetsDF)
 344 | 	print ("class B.1 camisani:", round(time()-t10, 3), "s")
 345 | 	t11 = time()
 346 | 	features[IS_FAVORITE] 				= is_favorite(userRow,tweetsDF)
 347 | 	print ("class B.2 camisani:", round(time()-t11, 3), "s")
 348 | 	t12 = time()
 349 | 	features[USES_PUNCTUATION] 			= uses_punctuation(userRow,tweetsDF)
 350 | 	print ("class B.3 camisani:", round(time()-t12, 3), "s")
 351 | 	t13 = time()
 352 | 	features[USES_HASHTAG] 				= uses_hashtag(userRow,tweetsDF)
 353 | 	print ("class B.4 camisani:", round(time()-t13, 3), "s")
 354 | 	t14 = time()
 355 | 	features[USES_IPHONE] 				= uses_iphone(userRow,tweetsDF)
 356 | 	print ("class B.5 camisani:", round(time()-t14, 3), "s")
 357 | 	t15 = time()
 358 | 	features[USES_ANDROID] 				= uses_android(userRow,tweetsDF)
 359 | 	print ("class B.6 camisani:", round(time()-t15, 3), "s")
 360 | 	t16 = time()
 361 | 	features[USES_FOURSQUARE] 			= uses_foursquare(userRow,tweetsDF)
 362 | 	print ("class B.7 camisani:", round(time()-t16, 3), "s")	
 363 | 	t17 = time()
 364 | 	features[USES_INSTAGRAM] 			= uses_instagram(userRow,tweetsDF)
 365 | 	print ("class B.8 camisani:", round(time()-t17, 3), "s")
 366 | 	t18 = time()
 367 | 	features[USES_TWITTERDOTCOM] 		= uses_twitterdotcom(userRow,tweetsDF)
 368 | 	print ("class B.9 camisani:", round(time()-t18, 3), "s")
 369 | 	t19 = time()
 370 | 	features[USERID_IN_TWEET] 			= userid_in_tweet(userRow,tweetsDF)
 371 | 	print ("class B.10 camisani:", round(time()-t19, 3), "s")
 372 | 	t20 = time()
 373 | 	features[TWEETS_WITH_URL] 			= tweets_with_url(userRow,tweetsDF)
 374 | 	print ("class B.11 camisani:", round(time()-t20, 3), "s")
 375 | 	t21 = time()
 376 | 	features[RETWEET_OVER_1] 			= retweet_over_1(userRow,tweetsDF)
 377 | 	print ("class B.12 camisani:", round(time()-t21, 3), "s")
 378 | 	t22 = time()
 379 | 	features[USES_DIFFERENT_CLIENTS] 	= uses_different_clients(userRow,tweetsDF)
 380 | 	print ("class B.13 camisani:", round(time()-t22, 3), "s")
 381 | 	print ("class B camisani:", round(time()-t1, 3), "s")
 382 | 	print ("camisani:", round(time()-t0, 3), "s")
 383 | 	return features
 384 | 
 385 | def get_state_of_search_features(dataframes):
 386 | 	stateofsearchFeatures = []
 387 | 
 388 | 	usersDF 	= dataframes['users']
 389 | 	tweetsDF 	= dataframes['tweets']
 390 | 
 391 | 	LIMIT = 10
 392 | 	limit = 1
 393 | 
 394 | 	for index, row in usersDF.iterrows():
 395 | 		stateofsearchFeatures.append(get_single_user_state_of_search_features(row,usersDF,tweetsDF))
 396 | 
 397 | 		#Temporary code, for test purpose
 398 | 		if(limit > LIMIT):
 399 | 			break
 400 | 		else:
 401 | 			limit = limit +1
 402 | 
 403 | 	return stateofsearchFeatures
 404 | 
 405 | 
 406 | def get_single_user_state_of_search_features(userRow, usersDF, tweetsDF):
 407 | 	'''
 408 | 	Class A : bot in biography, friends/followers > 100, duplicate profile pictures
 409 | 
 410 | 	Class B : same sentence to many accounts, tweet from API
 411 | 	'''
 412 | 
 413 | 	features = {}
 414 | 
 415 | 	# class A
 416 | 	t0 = time()
 417 | 	t1 = time()
 418 | 	features[BOT_IN_BIO] 						= bot_in_bio(userRow)
 419 | 	print ("class A.1 stateofsearch:", round(time()-t1, 3), "s")
 420 | 	t2 = time()
 421 | 	features[FRIENDS_TO_FOLLOWERS_RATIO_IS_100] = friends_to_followers_ratio_is_100(userRow)
 422 | 	print ("class A.2 stateofsearch:", round(time()-t2, 3), "s")
 423 | 	t3 = time()
 424 | 	features[DUPLICATE_PROFILE_PICTURE] 		= duplicate_profile_picture(userRow,usersDF)
 425 | 	print ("class A.3 stateofsearch:", round(time()-t3, 3), "s")
 426 | 	
 427 | 	# class B
 428 | 	t4 = time()
 429 | 	features[DUPLICATE_SENTENCES_ACROSS_TWEETS] 	= duplicate_sentences_across_tweets(userRow,tweetsDF)
 430 | 	print ("class B.1 stateofsearch:", round(time()-t4, 3), "s")
 431 | 	t5 = time()
 432 | 	features[API_TWEETS] 							= api_tweets(userRow,tweetsDF)
 433 | 	print ("class B.2 stateofsearch:", round(time()-t5, 3), "s")
 434 | 	print ("stateofsearch:", round(time()-t0, 3), "s")
 435 | 
 436 | 	return features
 437 | 
 438 | def get_socialbakers_features(dataframes):
 439 | 	socialbakersFeatures = []
 440 | 
 441 | 	usersDF 	= dataframes['users']
 442 | 	tweetsDF 	= dataframes['tweets']
 443 | 
 444 | 	LIMIT = 100
 445 | 	limit = 1
 446 | 
 447 | 	for index, row in usersDF.iterrows():
 448 | 		socialbakersFeatures.append(get_single_user_socialbakers_features(row,tweetsDF))
 449 | 
 450 | 		#Temporary code, for test purpose
 451 | 		'''
 452 | 		if(limit > LIMIT):
 453 | 			break
 454 | 		else:
 455 | 			limit = limit +1
 456 | 		'''
 457 | 
 458 | 	return socialbakersFeatures
 459 | 
 460 | def get_single_user_socialbakers_features(userRow, tweetsDF):
 461 | 	'''
 462 | 	Class A : followers ≥ 50, default image after 2
 463 | 		months, no bio, no location, friends ≥100, 0 tweets 
 464 | 	
 465 | 	Class B : tweets spam phrases, same tweet ≥ 3, retweets ≥ 90%,
 466 | 		tweet-links ≥ 90%
 467 | 	'''
 468 | 	userID = userRow['id']
 469 | 
 470 | 	features = {}
 471 | 
 472 | 	#Class A
 473 | 
 474 | 	#TODO : ce n'est pas 50 followers, c'est un ratio de 50:1 entre friends et followers
 475 | 	features[HAS_x50_FOLLOWERS] = has_x50_followers(userRow)
 476 | 	features[HAS_DEFAULT_IMAGE] = has_default_image(userRow)
 477 | 	features[HAS_NO_BIO] 		= has_no_bio(userRow)
 478 | 	features[HAS_NO_LOCATION] 	= has_no_location(userRow)
 479 | 	features[HAS_100_FRIENDS] 	= has_100_friends(userRow)
 480 | 	features[HAS_NO_TWEETS] 	= has_no_tweets(userID, tweetsDF)
 481 | 
 482 | 	#Class B
 483 | 	features[HAS_DUPLICATE_TWEETS] 	= has_duplicate_tweets(userID,tweetsDF,3)
 484 | 	features[HIGH_RETWEET_RATIO] 	= has_retweet_ratio(userID,tweetsDF,0.9)
 485 | 	features[HIGH_TWEET_LINK_RATIO] = has_tweet_links_ratio(userID, tweetsDF,0.9)
 486 | 
 487 | 	return features
 488 | 
 489 | def get_stringhini_features(dataframes):
 490 | 	'''
 491 | 
 492 | 	'''
 493 | 	stringhiniFeatures = []
 494 | 
 495 | 	usersDF 	= dataframes['users']
 496 | 	friendsDF 	= dataframes['friends']
 497 | 	tweetsDF 	= dataframes['tweets']
 498 | 
 499 | 	LIMIT = 5
 500 | 	limit = 1
 501 | 
 502 | 	for index, row in usersDF.iterrows():
 503 | 		#timelog("User "+str(limit))
 504 | 		stringhiniFeatures.append(get_single_user_stringhini_features(row,usersDF, friendsDF,tweetsDF))
 505 | 
 506 | 		#Temporary code, for test purpose
 507 | 		'''
 508 | 		if(limit > LIMIT):
 509 | 			break
 510 | 		else:
 511 | 			limit = limit +1
 512 | 		'''
 513 | 
 514 | 	return stringhiniFeatures
 515 | 
 516 | def get_single_user_stringhini_features(userRow, usersDF,friendsDF, tweetsDF):
 517 | 	'''
 518 | 	Class A : number of friends, number of friends tweets, friends/(followersˆ2)
 519 | 
 520 | 	Class B : tweet similarity, URL ratio
 521 | 	'''
 522 | 	'''
 523 | 	usersDF.set_index('id', inplace=True)
 524 | 	tweetsDF.set_index('user_id',inplace=True)
 525 | 	friendsDF.set_index('source_id', inplace=True)
 526 | 	'''
 527 | 	userID = userRow['id']
 528 | 
 529 | 	features = {}
 530 | 
 531 | 	# Class A
 532 | 	timelog("Start stringhini class A")
 533 | 	features[NUMBER_OF_FRIENDS] 			= get_friends_count(userRow)
 534 | 	features[FRIENDS_TO_FOLLOWERS_RATIO] 	= get_stringhini_friends_to_followers_ratio(userRow)
 535 | 	timelog("Start class B")
 536 | 	# Class B
 537 | 	features[NUMBER_OF_TWEETS_SENT]	= count_user_tweets(userID,tweetsDF)
 538 | 	timelog("count user tweets Ok")
 539 | 	#features[TWEET_SIMILARITY] 	= get_tweet_similarity(userRow,tweetsDF)	#comment calculer?
 540 | 	features[URL_RATIO] 				= get_url_ratio(userID, tweetsDF)
 541 | 	timelog("url ratio Ok")
 542 | 	features[UNIQUE_FRIENDS_NAME_RATIO] = get_unique_friends_name_ratio(userID,usersDF,friendsDF) 
 543 | 	timelog("unique friends Ok")
 544 | 	timelog("End class B")
 545 | 	return features
 546 | 
 547 | def get_yang_features(dataframes):
 548 | 	yangFeatures = []
 549 | 
 550 | 	usersDF 	= dataframes['users']
 551 | 	friendsDF 	= dataframes['friends']
 552 | 	followersDF = dataframes['followers']
 553 | 	tweetsDF 	= dataframes['tweets']
 554 | 
 555 | 	LIMIT = 10
 556 | 	limit = 1
 557 | 
 558 | 	for index, row in usersDF.iterrows():
 559 | 		yangFeatures.append(get_single_user_yang_features(row, usersDF,friendsDF,followersDF,tweetsDF))
 560 | 
 561 | 		#timelog("User "+str(limit))
 562 | 		
 563 | 		#Temporary code, for test purpose
 564 | 		'''
 565 | 		if(limit > LIMIT):
 566 | 			break
 567 | 		else:
 568 | 			limit = limit +1
 569 | 		'''
 570 | 
 571 | 	return yangFeatures
 572 | 
 573 | 
 574 | def get_single_user_yang_features(userRow, usersDF, friendsDF, followersDF,tweetsDF):
 575 | 	'''
 576 | 	class A : age, following rate
 577 | 
 578 | 	Class B : API ratio, API URL ratio, API tweet similarity
 579 | 
 580 | 	Class C: bi-link ratio, average
 581 | 		neighbors’ followers, average
 582 | 		neighbors’ tweets, followings
 583 | 		to median neighbor’s followers
 584 | 	'''
 585 | 	userID = userRow['id']
 586 | 
 587 | 	features = {}
 588 | 	#timelog("Start processing")
 589 | 	# Class A features
 590 | 	features[ACCOUNT_AGE] 		= get_account_age(userRow)
 591 | 	#features[FOLLOWING_RATE] 	= get_following_rate(userRow) # What is following rate?
 592 | 	#timelog("\tClass A finished")
 593 | 	# Class B features
 594 | 	#features[API_RATIO] 			= get_api_ratio(userRow)
 595 | 	#features[API_URL_RATIO] 		= get_api_url_ratio(userRow)
 596 | 	#features[API_TWEET_SIMILARITY] 	= get_api_tweet_similarity(userRow)
 597 | 
 598 | 	# Class C features
 599 | 	#features[BILINK_RATIO] 					= get_bilink_ratio(userRow, friendsDF, followersDF)
 600 | 	features[AVERAGE_NEIGHBORS_FOLLOWERS] 	= get_average_neighbors_followers(userID,friendsDF,usersDF)
 601 | 	features[AVERAGE_NEIGHBORS_TWEETS] 		= get_average_neighbors_tweets(userID, usersDF,friendsDF, tweetsDF)
 602 | 	#features[FOLLOWINGS_TO_MEDIAN_NEIGHBORS_FOLLOWERS] = get_followings_to_median(userRow)
 603 | 	#timelog("\tClass C finished")
 604 | 	return features
 605 | 
 606 | def get_dataframes(featureSetName, datasetDirectory):
 607 | 	'''
 608 | 	- BaseFilesDirectory is the base directory of the dataset we are using :
 609 | 		E13,FAK,FSF,HUM,INT,TFP,TWT.
 610 | 	- featureSetName is the name of the feature set for which we want to load the dataframes.
 611 | 
 612 | 	The function returns the proper required dataframes for the featureSetName specified.
 613 | 	'''
 614 | 	fileNames = {}
 615 | 	dataframes = {}
 616 | 
 617 | 	if(featureSetName == CAMISANI):
 618 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'}
 619 | 	elif(featureSetName == STATEOFSEARCH):
 620 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'}
 621 | 
 622 | 	elif(featureSetName == SOCIALBAKERS):
 623 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'}
 624 | 
 625 | 	elif(featureSetName == STRINGHINI):
 626 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends' : 'friends.csv'}
 627 | 
 628 | 	elif(featureSetName == YANG):
 629 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends': 'friends.csv'
 630 | 					,'followers': 'followers.csv'}
 631 | 
 632 | 	elif(featureSetName == CLASS_A):
 633 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv'}
 634 | 
 635 | 	elif(featureSetName == CLASS_C):
 636 | 		fileNames = {'users' : 'users.csv', 'tweets' : 'tweets.csv','friends': 'friends.csv'
 637 | 					,'followers': 'followers.csv'}
 638 | 
 639 | 	# We load the dataframes from the files specified above, in a dataframe dictionary
 640 | 	for key, filename in fileNames.items():
 641 | 
 642 | 		totalPath = datasetDirectory + '/'+ filename
 643 | 		#print(totalPath)
 644 | 		timelog("Loading "+ totalPath)
 645 | 
 646 | 		try:
 647 | 			dataframes[key] = pd.read_csv(totalPath, encoding='latin-1').fillna('')
 648 | 			#dataframes[key] = pd.read_csv(totalPath).fillna('')
 649 | 		except Exception as e:
 650 | 			print("Error while reading file "+totalPath)
 651 | 			print(e)
 652 | 	return dataframes 
 653 | 
 654 | '''
 655 | TODO: create functions that retrieve each individual feature, below
 656 | '''
 657 | 
 658 | # Class A features
 659 | 
 660 | def has_name(userRow):
 661 | 	res = (not userRow['name'] == "")
 662 | 
 663 | 	if(res):
 664 | 		return 1
 665 | 	else:
 666 | 		return 0
 667 | 
 668 | def has_image(userRow):
 669 | 	res = (not userRow['profile_image_url'] == "")
 670 | 
 671 | 	if(res):
 672 | 		return 1
 673 | 	else:
 674 | 		return 0
 675 | 
 676 | def has_address(userRow):
 677 | 	res = (not userRow['location'] == "")
 678 | 
 679 | 	if(res):
 680 | 		return 1
 681 | 	else:
 682 | 		return 0
 683 | 
 684 | def has_bio(userRow):
 685 | 	res =  (not userRow['description'] == "")
 686 | 
 687 | 	if(res):
 688 | 		return 1
 689 | 	else:
 690 | 		return 0
 691 | 
 692 | def has_30_followers(userRow):
 693 | 	res = int(userRow['followers_count']) >= 30
 694 | 
 695 | 	if(res):
 696 | 		return 1
 697 | 	else:
 698 | 		return 0
 699 | 
 700 | def belongs_to_a_list(userRow):
 701 | 	res = int(userRow['listed_count']) > 0
 702 | 
 703 | 	if(res):
 704 | 		return 1
 705 | 	else:
 706 | 		return 0
 707 | 
 708 | def has_50_tweets(userRow):
 709 | 	res = userRow['statuses_count'] > 50
 710 | 
 711 | 	if(res):
 712 | 		return 1
 713 | 	else:
 714 | 		return 0
 715 | 
 716 | def url_in_profile(userRow):
 717 | 	res = 0
 718 | 	if isinstance(userRow['description'], str) and has_url(userRow['description']):
 719 | 		res = 1
 720 | 	return res
 721 | 
 722 | def followers_to_friends_ration_over_2(userRow):
 723 | 	friendsCount = int(userRow['friends_count'])
 724 | 
 725 | 	if(friendsCount == 0):
 726 | 		friendsCount = 0.001
 727 | 	
 728 | 	res = int(userRow['followers_count'])/friendsCount > 2
 729 | 	
 730 | 	if(res):							
 731 | 		return 1
 732 | 	else:
 733 | 		return 0
 734 | 
 735 | def bot_in_bio(userRow):
 736 | 	# https://stackoverflow.com/questions/11144389/find-all-upper-lower-and-mixed-case-combinations-of-a-string
 737 | 	bot_list = map(''.join, itertools.product(*((c.upper(), c.lower()) for c in 'bot')))
 738 | 
 739 | 	res = 0
 740 | 	#timelog("start")
 741 | 	if isinstance(userRow['description'],str):
 742 | 		for bot_combination in bot_list:
 743 | 			if bot_combination in userRow['description']:
 744 | 				res = 1
 745 | 	#timelog("end")
 746 | 
 747 | 	return res
 748 | 
 749 | def friends_to_followers_ratio_is_100(userRow):
 750 | 	threshold = 100
 751 | 
 752 | 	res = get_friends_to_followers_ratio(userRow) >= threshold
 753 | 
 754 | 	if(res):
 755 | 		return 1
 756 | 	else:
 757 | 		return 0
 758 | 
 759 | def duplicate_profile_picture(userRow,usersDF):
 760 | 	'''
 761 | 	This functions checks if the name of the profile picture link is duplicated.
 762 | 	'''
 763 | 	clean = lambda x: x.split('/')[-1]
 764 | 	image = userRow['profile_image_url'].split('/')[-1]
 765 | 	img_column = usersDF['profile_image_url'].apply(clean)
 766 | 
 767 | 	res = not image in img_column.unique()
 768 | 
 769 | 	if(res):
 770 | 		return 1
 771 | 	else:
 772 | 		return 0
 773 | 
 774 | def get_account_age(userRow):
 775 | 	# Date format : Thu Apr 06 15:24:15 +0000 2017
 776 | 	creation_date = userRow['created_at']
 777 | 	
 778 | 	account_creation = datetime.strptime(creation_date,'%a %b %d %H:%M:%S %z %Y')
 779 | 	#Two lines below, to prevent error : can't subtract offset-naive and offset-aware datetimes
 780 | 	# Solution from https://stackoverflow.com/questions/796008/cant-subtract-offset-naive-and-offset-aware-datetimes
 781 | 	timezone = account_creation.tzinfo
 782 | 	today = datetime.now(timezone)
 783 | 
 784 | 	return (today - account_creation).days
 785 | 
 786 | def get_following_rate(userRow):
 787 | 	'''
 788 | 	following rate: this metric reflects the speed at which an
 789 | 	accounts follows other accounts. Spammers usually feature high
 790 | 	values of this rate.
 791 | 	'''
 792 | 	return int(userRow['friends_count'])/int(userRow['followers_count'])
 793 | 
 794 | def get_friends_count(userRow):
 795 | 	return int(userRow['friends_count'])
 796 | 
 797 | 
 798 | def get_friends_tweet_count(userRow,friendsDF,usersDF):
 799 | 	userID 	= userRow['id']
 800 | 
 801 | 	friends_id_list = get_friends_ids(userID,friendsDF)
 802 | 	friends_count 	= len(friends_id_list)
 803 | 
 804 | 	tweetUsers = usersDF[usersDF['id'].isin(friends_id_list)]
 805 | 
 806 | 	userTweetCount = lambda user: count_user_tweets(user['id'],friendsDF) 
 807 | 	res = tweetUsers.apply(userTweetCount).sum()
 808 | 
 809 | 	return res
 810 | 
 811 | 
 812 | def get_friends_to_followers_ratio(userRow):
 813 | 	res = 0
 814 | 	if int(userRow['followers_count']) != 0:
 815 | 		res = int(userRow['friends_count'])/int(userRow['followers_count'])
 816 | 	return res
 817 | 
 818 | def get_stringhini_friends_to_followers_ratio(userRow):
 819 | 	followers = int(userRow['followers_count'])
 820 | 
 821 | 	if(followers > 0):
 822 | 		return int(userRow['friends_count'])/(followers*followers)
 823 | 	else:
 824 | 		return int(userRow['friends_count'])/0.01
 825 | 
 826 | def has_x50_followers(userRow):
 827 | 	followers = int(userRow['followers_count'])
 828 | 	friends = int(userRow['friends_count'])
 829 | 
 830 | 	if(followers ==0):
 831 | 		followers = 0.001
 832 | 
 833 | 	res = (friends/followers)>= 50
 834 | 
 835 | 	if(res):
 836 | 		return 1
 837 | 	else:
 838 | 		return 0
 839 | 
 840 | def has_default_image(userRow):
 841 | 	res =  userRow['default_profile_image']
 842 | 
 843 | 	if(res):
 844 | 		return 1
 845 | 	else:
 846 | 		return 0
 847 | 
 848 | def has_no_bio(userRow):
 849 | 	if(not userRow['description']):
 850 | 		return 1
 851 | 	else:
 852 | 		return 0
 853 | 
 854 | def has_no_location(userRow):
 855 | 	if(not userRow['location']):
 856 | 		return 1
 857 | 	else:
 858 | 		return 0
 859 | 
 860 | def has_100_friends(userRow):
 861 | 	res =  get_friends_count(userRow) >= 100
 862 | 
 863 | 	if(res):
 864 | 		return 1
 865 | 	else:
 866 | 		return 0
 867 | 
 868 | def has_no_tweets(userID, tweetsDF):
 869 | 	res = not has_tweets(userID, tweetsDF)
 870 | 
 871 | 	if(res):
 872 | 		return 1
 873 | 	else:
 874 | 		return 0
 875 | 
 876 | 
 877 | # Class B features
 878 | def geolocalized(userRow,tweetsDF):
 879 | 	tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 880 | 	geo = tweets['geo'] != ""
 881 | 	res =  not tweets[geo].empty
 882 | 
 883 | 	if(res):
 884 | 		return 1
 885 | 	else:
 886 | 		return 0
 887 | 	
 888 | def is_favorite(userRow,tweetsDF):
 889 | 	#tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 890 | 	tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 891 | 	fav = tweets['favorite_count'] != 0
 892 | 	return int(not tweets[fav].empty)
 893 | 
 894 | def uses_punctuation(userRow,tweetsDF):
 895 | 	# https://mail.python.org/pipermail/tutor/2001-October/009454.html
 896 | 	bio_and_timeline = ""
 897 | 	if isinstance(userRow['description'], str):
 898 | 		bio_and_timeline += userRow['description']
 899 | 	bio_and_timeline += get_tweets_strings(int(userRow['id']),tweetsDF)
 900 | 
 901 | 	res = 0
 902 | 
 903 | 	for letter in bio_and_timeline:
 904 | 		if letter in string.punctuation:
 905 | 			res = 1
 906 | 	return res
 907 | 
 908 | def uses_hashtag(userRow,tweetsDF):
 909 | 	res = '#' in get_tweets_strings(int(userRow['id']),tweetsDF)
 910 | 
 911 | 	if(res):
 912 | 		return 1
 913 | 	else:
 914 | 		return 0
 915 | 
 916 | def uses_iphone(userRow,tweetsDF):
 917 | 	#all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 918 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 919 | 	res =  "Iphone" in all_tweets['source'].str.cat()
 920 | 
 921 | 	if(res):
 922 | 		return 1
 923 | 	else:
 924 | 		return 0
 925 | 
 926 | def uses_android(userRow,tweetsDF):
 927 | 	#all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 928 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 929 | 	res = "Android" in all_tweets['source'].str.cat()
 930 | 
 931 | 	if(res):
 932 | 		return 1
 933 | 	else:
 934 | 		return 0
 935 | 
 936 | def uses_foursquare(userRow,tweetsDF):
 937 | 	#all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 938 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 939 | 	res = "foursquare" in all_tweets['source'].str.cat()
 940 | 
 941 | 	if(res):
 942 | 		return 1
 943 | 	else:
 944 | 		return 0
 945 | 
 946 | def uses_instagram(userRow,tweetsDF):
 947 | 	#all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 948 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 949 | 	res = "Instagram" in all_tweets['source'].str.cat()
 950 | 
 951 | 	if(res):
 952 | 		return 1
 953 | 	else:
 954 | 		return 0
 955 | 
 956 | def uses_twitterdotcom(userRow,tweetsDF):
 957 | 	#all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 958 | 	all_tweets = cache.get_user_tweets(int(userRow['id']),tweetsDF)
 959 | 	res = "web" in all_tweets['source'].str.cat()
 960 | 
 961 | 	if(res):
 962 | 		return 1
 963 | 	else:
 964 | 		return 0
 965 | 	
 966 | def userid_in_tweet(userRow,tweetsDF):
 967 | 	res = str(userRow['id']) in get_tweets_strings(userRow['id'],tweetsDF)
 968 | 
 969 | 	if(res):
 970 | 		return 1
 971 | 	else:
 972 | 		return 0
 973 | 
 974 | def tweets_with_url(userRow,tweetsDF):
 975 | 	return has_url(get_tweets_strings(userRow['id'],tweetsDF))
 976 | 
 977 | def retweet_over_1(userRow,tweetsDF):
 978 | 	all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 979 | 	# Any checks if the conditions happens once in the Series.
 980 | 	return int((all_tweets['retweet_count'] > 1).any())
 981 | 
 982 | def uses_different_clients(userRow,tweetsDF):
 983 | 	all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 984 | 	res = len(all_tweets['source'].unique()) > 1
 985 | 
 986 | 	if(res):
 987 | 		return 1
 988 | 	else:
 989 | 		return 0
 990 | 
 991 | # https://stackoverflow.com/questions/17388213/find-the-similarity-percent-between-two-strings
 992 | def similar(a, b):
 993 |     return SequenceMatcher(None, a, b).ratio()
 994 | 
 995 | 
 996 | def duplicate_sentences_across_tweets(userRow,tweetsDF):
 997 | 	res = False
 998 | 	all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
 999 | 	counter = 0
1000 | 	for index, tweet in all_tweets.iterrows():
1001 | 		other_counter = 0
1002 | 		for other_index, other_tweet in all_tweets.iterrows():
1003 | 			if tweet['text'] == other_tweet['text'] and index != other_index :
1004 | 			# For more real similarity check
1005 | 			# if similar(tweet['text'],other_tweet['text']) > 0.7 and index != other_index :
1006 | 				res = True
1007 | 				break
1008 | 			if other_counter == 20:
1009 | 				break
1010 | 			other_counter+=1
1011 | 		if counter == 20:
1012 | 			break
1013 | 		counter+=1
1014 | 
1015 | 	return int(res)
1016 | 
1017 | def api_tweets(userRow,tweetsDF):
1018 | 	all_tweets = get_tweets_dataframe_user(int(userRow['id']),tweetsDF)
1019 | 	res = not "twitter.com" in all_tweets['source'].str.cat()
1020 | 
1021 | 	if(res):
1022 | 		return 1
1023 | 	else:
1024 | 		return 0
1025 | 
1026 | def get_api_ratio(userID, tweetsDF):
1027 | 	# tweets sent from api over total number of tweets
1028 | 	tweets = get_tweets_dataframe_user(userID, tweetsDF)
1029 | 	tweetsCount = len(tweets.tolist())
1030 | 	tweets_api = get_api_tweets_count(userID, tweets)
1031 | 
1032 | 	if(userTweets > 0):
1033 | 		return tweets_api/userTweets
1034 | 	else:
1035 | 		return 0
1036 | 
1037 | def get_api_url_ratio(userRow):
1038 | 	return 0
1039 | 
1040 | def get_api_tweet_similarity(userRow):
1041 | 	return 0
1042 | 
1043 | def get_tweet_similarity(userID,tweetsDF):
1044 | 	return 0
1045 | 
1046 | def get_url_ratio(userID, tweetsDF):
1047 | 	'''ratio of tweets with a url'''
1048 | 	return get_tweets_with_url_ratio(userID,tweetsDF)
1049 | 
1050 | def get_unique_friends_name_ratio(userID,usersDF,friendsDF):
1051 | 	friends_with_name = get_friends_with_initialized_name(userID, usersDF,friendsDF)
1052 | 	unique_names_count = count_unique_names(friends_with_name)
1053 | 
1054 | 	#avoid division by zero, so we return a big number
1055 | 	if(unique_names_count == 0):
1056 | 		unique_names_count = 0.001
1057 | 
1058 | 	return len(friends_with_name)/unique_names_count
1059 | 
1060 | def has_duplicate_tweets(userID, tweetsDF,duplicate_threshold):
1061 | 	tweetsUser = get_tweets_dataframe_user(userID,tweetsDF)
1062 | 
1063 | 	uniqueTweets = tweetsUser['text'].unique()
1064 | 
1065 | 	userTweets = tweetsUser.shape[0]
1066 | 	uniqueCount = len(uniqueTweets.tolist())
1067 | 
1068 | 	if(uniqueCount < userTweets):
1069 | 		return 1
1070 | 	else:
1071 | 		return 0
1072 | 
1073 | def has_retweet_ratio(userID,tweetsDF, ratio_threshold):
1074 | 	'''
1075 | 	Returns true if the ratio calculated is superior or equal to the ratio_threshold.
1076 | 	Returns false otherwise.
1077 | 	'''
1078 | 	if(retweet_ratio(userID,tweetsDF)>= ratio_threshold):
1079 | 		return 1
1080 | 	else:
1081 | 		return 0
1082 | 
1083 | def has_tweet_links_ratio(userID, tweetsDF, ratio_threshold):
1084 | 	'''
1085 | 	Returns true if the ratio calculated is superior or equal to the ratio_threshold.
1086 | 	Returns false otherwise.
1087 | 	'''
1088 | 
1089 | 	ratio = get_url_ratio(userID, tweetsDF)
1090 | 
1091 | 	res = ratio >= ratio_threshold
1092 | 	
1093 | 	return int(res)
1094 | 
1095 | 
1096 | # Class C featrues
1097 | def get_bilink_ratio(userRow, friendsDF, followersDF):
1098 | 	userID = userRow['id']
1099 | 	friends_count = userRow['friends_count']
1100 | 	#Bi-directional link is when two account follow each other
1101 | 	friends = get_friends_ids(userID,friendsDF)
1102 | 
1103 | 	followersSeries = get_follower_ids(userID,followersDF)
1104 | 
1105 | 	bilinkList = followersSeries.isin(friends)
1106 | 
1107 | 	bilink_count = bilink.shap[0]
1108 | 	#print("===== User ID = "+str(userID))
1109 | 	#print("[Bilink count : {}][Official Friends count : {}][Official Followers count :  {}], [followers actually found : {} ]".format(bilink_count,friends_count,userRow['followers_count'], len(followersSeries)))
1110 | 	return bilink_count/friends_count
1111 | 
1112 | def get_average_neighbors_followers(userID,friendsDF,usersDF):
1113 | 	#Average the number of followers of the friends of the user.
1114 | 	friends = get_friends_ids(userID, friendsDF)
1115 | 
1116 | 	return get_avg_neighbors_followers(friends,usersDF)
1117 | 
1118 | def get_average_neighbors_tweets(userID, userDF, friendsDF, tweetsDF):
1119 | 	#Average the number of tweets of the friends of the user.
1120 | 	friends = get_friends_ids(userID, friendsDF)
1121 | 	return get_avg_friends_tweets(friends,tweetsDF)
1122 | 	
1123 | 
1124 | def get_followings_to_median(userRow):
1125 | 	'''
1126 | 	Ratio between number of friends and the median of the followers of its friends.
1127 | 	'''
1128 | 	return 0
1129 | 
1130 | def timelog(message):
1131 | 	print(datetime.now().strftime('%H:%M:%S')+' '+message )
1132 | '''
1133 | To use (prototype) go to root directory:
1134 | 	command example : python3 src/features.py "data/E13/" yang
1135 | '''
1136 | if(__name__ == "__main__"):
1137 | 	directory = sys.argv[1]
1138 | 	featureSetName = sys.argv[2]
1139 | 
1140 | 	dataframes = get_dataframes(featureSetName, directory)
1141 | 
1142 | 	timelog("OK, ca passe")
1143 | 	#print("=========== Getting features ===========")
1144 | 	#print(time())
1145 | 	t0 = time()
1146 | 	features 	= pd.DataFrame(get_features(featureSetName, dataframes))
1147 | 	tf = time() - t0
1148 | 	#print("========== Elapsed time ==========")
1149 | 	#print(tf)
1150 | 	#timelog("Features : ")
1151 | 	print(features)
1152 | 


--------------------------------------------------------------------------------
/src/generateBAS.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import features
 3 | import sys
 4 | import os
 5 | from time import time
 6 | 
 7 | '''
 8 | The goal of this module is to generate a list of features based on a
 9 | base directory (ex: E13, FAK, etc) and the name of the feature set (class A, class C).
10 | 
11 | The generated list is the baseline feature set and will be created in the directory
12 | specified in parameters. Ex: E13/baseline.csv
13 | '''
14 | 
15 | def get_BAS_dataset(hum_path,fak_path):
16 | 	filename_A = "features_A.csv"
17 | 	filename_C = "features_C.csv"
18 | 
19 | 	hum_df_A = pd.read_csv(hum_path+filename_A, encoding='latin-1',sep='\t')
20 | 	fak_df_A = pd.read_csv(fak_path+filename_A, encoding='latin-1',sep='\t')
21 | 	hum_df_C = pd.read_csv(hum_path+filename_C, encoding='latin-1',sep='\t')
22 | 	fak_df_C = pd.read_csv(fak_path+filename_C, encoding='latin-1',sep='\t')
23 | 	
24 | 	frames_A = [hum_df_A,fak_df_A]
25 | 	frames_C = [hum_df_C,fak_df_C]
26 | 
27 | 	df_A = pd.concat(frames_A)
28 | 	df_C = pd.concat(frames_C)
29 | 
30 | 	# randomize features, before cross_validation
31 | 	df_A = df_A.sample(frac=1).reset_index(drop=True)
32 | 	df_C = df_C.sample(frac=1).reset_index(drop=True)
33 | 
34 | 	labels_A = df_A['label'].tolist()
35 | 	labels_C = df_C['label'].tolist()
36 | 	return labels_A,labels_C,df_A,df_C
37 | 
38 | 
39 | def save_features_in_file(path, featuresDF, featureSetName):
40 | 
41 | 	filename =  path + "/features_"+featureSetName+".csv"
42 | 	print("Current directory :"+os.getcwd())
43 | 	print("Filename : "+filename)
44 | 	featuresDF.to_csv(filename,sep='\t')
45 | 
46 | def labelize_features(featuresDF, label):
47 | 	labelVal = 0
48 | 
49 | 	if(label == 'human'):
50 | 		labelVal = 0
51 | 	elif(label == 'bot'):
52 | 		labelVal = 1
53 | 		
54 | 	featuresDF['label'] = labelVal
55 | 
56 | 
57 | if(__name__ == "__main__"):
58 | 	# Command example. Load class A : python3 src/main.py data/E13 A
59 | 	# Command example. Load class C : python3 src/main.py data/E13 C
60 | 
61 | 	#Get the dataset name (E13, FAK, FSF,HUM,etc)
62 | 	argsLen = len(sys.argv)
63 | 
64 | 	datasetDirectory 	= sys.argv[1]
65 | 	featureSetName 		= sys.argv[2]
66 | 
67 | 
68 | 
69 | 	dataframes  	= features.get_dataframes(featureSetName,datasetDirectory)
70 | 
71 | 	print("=========== Getting features ===========")
72 | 	print(time())
73 | 	t0 = time()
74 | 	
75 | 	featuresData 	= features.get_features(featureSetName, dataframes)
76 | 
77 | 	featuresData = pd.DataFrame(featuresData)
78 | 
79 | 	if(argsLen > 3):
80 | 		label 				= sys.argv[3]
81 | 		labelize_features(featuresData,label)
82 | 
83 | 	tf = time() - t0
84 | 	print("========== Elapsed time ==========")
85 | 	print(tf)
86 | 	save_features_in_file(datasetDirectory, featuresData,featureSetName)
87 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
 1 | from classifiers import *
 2 | from features import *
 3 | from metrics import *
 4 | import sys
 5 | from generateBAS import *
 6 | 
 7 | def main():
 8 | 	
 9 | 
10 | 	# data frame with our dataset
11 | 	labels_A, labels_C, df_features_class_A, df_features_class_C = get_BAS_dataset("data/HUM/","data/FAK/")
12 | 	# preprocessing of the data to capture the features we want
13 | 	#df_features_class_A = get_class_A_features(df)
14 | 	#df_features_class_C = get_class_C_features(df)
15 | 	
16 | 	# classifiers (training + prediction)
17 | 	class_A_classifiers_predictions = classify(df_features_class_A,labels_A)
18 | 	class_C_classifiers_predictions = classify(df_features_class_C,labels_C)
19 | 
20 | 	# Metrics on the result predictions from the classifiers
21 | 	results_class_A_classifiers = metrics(labels_A, class_A_classifiers_predictions)
22 | 	results_class_C_classifiers = metrics(labels_C, class_C_classifiers_predictions)
23 |  
24 | 	# publish and save results
25 | 	print("Metrics of the Class A classifiers(profile)")
26 | 	for key,value in results_class_A_classifiers.items():
27 | 		print("Algorithm: "+key)
28 | 		print("Accuracy: "+ str(round(value[0],3))+ " Precision: "+str(round(value[1],3)))
29 | 		print("Recall: "+ str(round(value[2],3))+ " F-M: "+str(round(value[3],3)))
30 | 		print("MCC: "+ str(round(value[4],3))+ " AUC: "+str(round(value[5],3)))
31 | 		print()
32 | 	#publish(results_class_A_classifiers)
33 | 	print("===================== oOo ========================")
34 | 	print("Metrics of the Class C classifiers(all features)")
35 | 	for key,value in results_class_C_classifiers.items():
36 | 		print("Algorithm: "+key)
37 | 		print("Accuracy: "+ str(round(value[0],3))+ " Precision: "+str(round(value[1],3)))
38 | 		print("Recall: "+ str(round(value[2],3))+ " F-M: "+str(round(value[3],3)))
39 | 		print("MCC: "+ str(round(value[4],3))+ " AUC: "+str(round(value[5],3)))
40 | 		print()
41 | 	#publish(results_class_C_classifiers)
42 | 
43 | 
44 | if(__name__ == "__main__"):
45 | 	# Command example. Load class A : python3 src/main.py data/E13 A
46 | 	# Command example. Load class C : python3 src/main.py data/E13 C
47 | 
48 | 	#Get the dataset name (E13, FAK, FSF,HUM,etc)
49 | 	#baseDataset 	= sys.argv[1]
50 | 	#featureSetName 	= sys.argv[2]
51 | 
52 | 	#dataframes  = features.get_dataframes(baseDataset_A, featureSetName)
53 | 
54 | 	#main(baseDataset)
55 | 	main()
56 | 
57 | 


--------------------------------------------------------------------------------
/src/metrics.py:
--------------------------------------------------------------------------------
 1 | from sklearn.metrics import confusion_matrix
 2 | from sklearn.metrics import accuracy_score
 3 | from sklearn.metrics import precision_score
 4 | from sklearn.metrics import recall_score
 5 | from sklearn.metrics import f1_score
 6 | from sklearn.metrics import matthews_corrcoef
 7 | from sklearn.metrics import roc_auc_score
 8 | def metrics(labels,pred_dict):
 9 | 
10 | 	results_dict = {}
11 | 	# Confusion matrix
12 | 	# labels : array, shape = [n_samples]
13 | 	# Ground truth (correct) target values.
14 | 	# pred : array, shape = [n_samples]
15 | 	# Estimated targets as returned by a classifier.
16 | 	
17 | 	for key,pred in pred_dict.items():
18 | 		#try:
19 | 			#tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
20 | 			#np.set_printoptions(precision=2)
21 | 		try:
22 | 			# Accuracy
23 | 			acc = accuracy_score(labels, pred)
24 | 		except ValueError:
25 | 			acc = 0
26 | 		try:
27 | 			# Precision
28 | 			pres = precision_score(labels, pred, average='binary')
29 | 		except ValueError:
30 | 			pres = 0
31 | 		try:
32 | 			# Recall
33 | 			rec = recall_score(labels, pred, average='binary')
34 | 		except ValueError:
35 | 			rec = 0
36 | 		try:
37 | 			# F-measure
38 | 			f1 = f1_score(labels, pred, average='binary')
39 | 		except ValueError:
40 | 			f1 = 0
41 | 		try:
42 | 			# MCC
43 | 			mcc = matthews_corrcoef(labels, pred) 
44 | 		except ValueError:
45 | 			mcc = 0
46 | 		try:
47 | 			# AUC
48 | 			auc_score = roc_auc_score(labels, pred)
49 | 		except ValueError:
50 | 			auc_score = 0
51 | 			# 
52 | 			# Coud be this one must be rechecked with paper
53 | 			# from sklearn.metrics import roc_auc_score
54 | 			# roc_auc_score(labels, y_scores)
55 | 			results_dict[key] = [acc,pres,rec,f1,mcc,auc_score]
56 | 	#except ValueError:
57 | 	#	pass
58 | 	return results_dict


--------------------------------------------------------------------------------
/src/tweets.py:
--------------------------------------------------------------------------------
  1 | from url_finder import *
  2 | import pandas as pd
  3 | import numpy as np
  4 | from time import *
  5 | '''
  6 | In this module, we process tweets, iterate and extract data from them.
  7 | '''
  8 | def has_tweets(userID,tweetsDF):
  9 | 	return tweetsDF[tweetsDF['user_id'] == userID].empty
 10 | 
 11 | def get_tweets_count(userID,tweetsDF):
 12 | 	'''
 13 | 	This will count the number of tweets done by a
 14 | 	'''
 15 | 	userTweets = get_tweets_user(userID,tweetsDF)
 16 | 
 17 | 	return len(userTweets)
 18 | 
 19 | def get_tweets_user(userID,tweetsDF):
 20 | 	'''
 21 | 	This function returns the list of tweets belonging to a particular userID
 22 | 	'''
 23 | 	return tweetsDF['text'][tweetsDF['user_id']== userID].tolist()
 24 | 
 25 | def get_tweets_dataframe_user(userID,tweetsDF):
 26 | 	'''
 27 | 	This function returns the tweets belonging to a particular userID
 28 | 	'''
 29 | 	#t0 = time()
 30 | 	user_id = tweetsDF['user_id'] == userID
 31 | 	res = tweetsDF[user_id]
 32 | 	#print ("tweet user:", round(time()-t0, 3), "s")
 33 | 	return res
 34 | 
 35 | def count_user_tweets(userID, tweetsDF):
 36 | 	return len(get_tweets_user(userID,tweetsDF))
 37 | 
 38 | 
 39 | def get_avg_friends_tweets(friendsIDlist,tweetsDF):
 40 | 	friends_count 		= len(friendsIDlist)
 41 | 	total_tweet_count 	= 0
 42 | 	friendsIDlist = pd.Series(friendsIDlist)
 43 | 
 44 | 	total_tweet_count = tweetsDF[tweetsDF['user_id'].isin(friendsIDlist)].shape[0]
 45 | 	if friends_count == 0:
 46 | 		friends_count = 0.001
 47 | 	return total_tweet_count/friends_count
 48 | 
 49 | 
 50 | 
 51 | def get_tweets_strings(userID,tweetsDF):
 52 | 	'''
 53 | 	Searches for a user's tweets and saves all the text from
 54 | 	the tweets in 1 string
 55 | 	'''
 56 | 
 57 | 	tweets_user = get_tweets_dataframe_user(userID,tweetsDF)
 58 | 	res = tweets_user['text'].str.cat()
 59 | 	return res
 60 | 
 61 | 
 62 | def get_tweets_with_url_ratio(userID, tweetsDF):
 63 | 	t0 = time()
 64 | 	#print("ID ----"+str(userID))
 65 | 	#print(tweetsDF.head())
 66 | 	user_tweets = tweetsDF['user_id'] == userID
 67 | 	#urlTweets 	= tweetsDF[user_tweets &(tweetsDF['text'].apply(lambda tweet: tweet_contains_url(tweet)))]
 68 | 	urlTweets 	= tweetsDF[user_tweets & tweetsDF['text']]
 69 | 	new_value = urlTweets['text'].apply(lambda tweet: tweet_contains_url(tweet))
 70 | 	#print(new_value)
 71 | 	#print("TEST:", round(time()-t0, 3), "s")
 72 | 	tweets_with_url 	= new_value.mean()
 73 | 	#print("tweets_with_url: "+str(tweets_with_url))
 74 | 	user_tweets_count 	= count_user_tweets(userID,tweetsDF)
 75 | 	if(user_tweets_count == 0):
 76 | 		user_tweets_count = 0.001
 77 | 		
 78 | 	return float(tweets_with_url)/float(user_tweets_count)
 79 | 
 80 | 
 81 | def tweet_contains_url(tweet):
 82 | 	return has_url(tweet)
 83 | 
 84 | def get_api_tweets_count(userTweetsDF):
 85 | 	'''
 86 | 	The tweetDF only contains tweets from a single user
 87 | 	'''
 88 | 	tweets = userTweetsDF[userTweetsDF['text'].apply(lambda tweet : tweet.count("API")>0)]
 89 | 
 90 | 	return len(tweets.tolist())
 91 | 
 92 | def retweet_ratio(userID, tweetsDF):
 93 | 	tweets = tweetsDF[tweetsDF['user_id'] == userID]
 94 | 
 95 | 	tweetsCount = tweets.shape[0]
 96 | 
 97 | 	retweets = tweets[tweets['retweet_count']>0].shape[0]
 98 | 
 99 | 	if(tweetsCount == 0):
100 | 		#print("-----------------Retweet 0")
101 | 		return 0
102 | 
103 | 	else:
104 | 		res = retweets/tweetsCount
105 | 		#print("-----------------Retweet : "+str(res))
106 | 		return res


--------------------------------------------------------------------------------
/src/url_finder.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | This script handles the search of a URL in a given text.
 3 | It can also return the urls in a given text
 4 | '''
 5 | import re
 6 | 
 7 | #Taken from : https://gist.github.com/gruber/8891611 (gruber/Liberal Regex Pattern for Web URLs)
 8 | #URL_REGEX = re.compile('''(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))''')
 9 | URL_REGEX = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
10 | #TODO : à corriger car ne marche pas
11 | def has_url(text):
12 | 	
13 | 	if(re.search(URL_REGEX,text)):
14 | 		return True
15 | 	else:
16 | 		return False
17 | 
18 | def get_urls(text):
19 | 	return re.findall(URL_REGEX,text)
20 | 
21 | def count_urls(text):
22 | 	return count(get_urls(text))


--------------------------------------------------------------------------------
/src/users.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import cachedata as cache
 3 | '''
 4 | This module handles all processings on the friends of a user.
 5 | '''
 6 | 
 7 | def get_average_followers_count(userID, usersDF):
 8 | 	pass
 9 | 	#return usersDF
10 | 
11 | def get_friends_ids(userID, friendsDF):
12 | 	return friendsDF['target_id'][friendsDF['source_id'] == userID].tolist()
13 | 
14 | def get_follower_ids(userID, followersDF):
15 | 	#source = follower, target = user
16 | 	return followersDF['source_id'][followersDF['target_id'] == userID]
17 | 	#return followersDF['target_id'][followersDF['source_id'] == userID]
18 | 
19 | def get_friends_count(userID,friendsDF):
20 | 	return friendsDF[friendsDF['source_id'] == userID].count()
21 | 
22 | def get_friends_with_initialized_name(userID, usersDF, friendsDF):
23 | 	friendsIDs 	= cache.get_user_friends(userID,friendsDF)
24 | 	not_empty 	= usersDF['name'] != ''
25 | 	not_null 	= usersDF['name'].notnull()
26 | 	is_unique 	= usersDF['id'].isin(friendsIDs)
27 | 
28 | 	return usersDF[not_null	& not_empty & is_unique]
29 | 
30 | def count_unique_names(friendsDF):
31 | 	#duplicate Dataframe
32 | 	
33 | 	unique_count = len(friendsDF.name.unique().tolist())	
34 | 
35 | 	return unique_count
36 | 
37 | def get_avg_neighbors_followers(friendsIDlist,usersDF):
38 | 	friends_count 			= len(friendsIDlist)
39 | 	total_followers_count 	= 0
40 | 
41 | 	friendsSeries = pd.Series(friendsIDlist)
42 | 
43 | 	followers = usersDF['followers_count'][usersDF['id'].isin(friendsSeries)]
44 | 	total_followers_count = followers.shape[0]
45 | 	total_result = followers.sum()
46 | 
47 | 
48 | 	#print("[Total friends : {}], [Followers : {}], [Total neigbors followers :{}]".format(friends_count,total_followers_count,total_result))
49 | 
50 | 	if(total_followers_count == 0):
51 | 		return 0
52 | 	else:
53 | 		return total_result/total_followers_count
54 | 
55 | 
56 | 


--------------------------------------------------------------------------------