├── NAMESPACE
├── .Rbuildignore
├── figures
    ├── dw-nom.png
    ├── face_validity.png
    └── model_performance.png
├── data
    ├── ideo_tweets.rda
    └── pol_tweets.rda
├── .gitignore
├── man
    ├── complete_setup.Rd
    ├── texts_to_vectors.Rd
    ├── train_test_split.Rd
    ├── tweets_to_df.Rd
    ├── prepare_glove_embeddings.Rd
    ├── prepare_w2v_embeddings.Rd
    ├── scrape_tweets.Rd
    ├── evaluate.Rd
    ├── train_lstm.Rd
    └── predict_ideology.Rd
├── deepIdeology.Rproj
├── DESCRIPTION
├── R
    ├── scrape_tweets.R
    ├── predict.R
    ├── word_embeddings.R
    └── train_models.R
└── README.md


/NAMESPACE:
--------------------------------------------------------------------------------
1 | exportPattern("^[[:alpha:]]+")
2 | 


--------------------------------------------------------------------------------
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^data-raw$
4 | 


--------------------------------------------------------------------------------
/figures/dw-nom.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/dw-nom.png


--------------------------------------------------------------------------------
/data/ideo_tweets.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/data/ideo_tweets.rda


--------------------------------------------------------------------------------
/data/pol_tweets.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/data/pol_tweets.rda


--------------------------------------------------------------------------------
/figures/face_validity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/face_validity.png


--------------------------------------------------------------------------------
/figures/model_performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/model_performance.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | embeddings
6 | glove.twitter.27B
7 | tokenizers
8 | models
9 | 


--------------------------------------------------------------------------------
/man/complete_setup.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/scrape_tweets.R
 3 | \name{complete_setup}
 4 | \alias{complete_setup}
 5 | \title{complete_setup}
 6 | \usage{
 7 | complete_setup()
 8 | }
 9 | \description{
10 | This function should be called after package installation to properly set up dependencies and create file caching system.
11 | }
12 | 


--------------------------------------------------------------------------------
/deepIdeology.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | 


--------------------------------------------------------------------------------
/man/texts_to_vectors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/train_models.R
 3 | \name{texts_to_vectors}
 4 | \alias{texts_to_vectors}
 5 | \title{texts_to_vectors}
 6 | \usage{
 7 | texts_to_vectors(texts, tokenizer)
 8 | }
 9 | \arguments{
10 | \item{texts}{Character vector of raw text data}
11 | 
12 | \item{tokenizer}{Pre-fit keras tokenizer}
13 | }
14 | \value{
15 | matrix of vectorized texts
16 | }
17 | \description{
18 | Helper function vectorize text data
19 | }
20 | 


--------------------------------------------------------------------------------
/man/train_test_split.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/train_models.R
 3 | \name{train_test_split}
 4 | \alias{train_test_split}
 5 | \title{train_test_split}
 6 | \usage{
 7 | train_test_split(X, y, test_size = 0.2)
 8 | }
 9 | \arguments{
10 | \item{X}{data.frame or matrix of data}
11 | 
12 | \item{y}{Labels (optional).}
13 | 
14 | \item{test_size}{Proportion of samples to set aside for testing.}
15 | }
16 | \value{
17 | List of X_train, X_test, y_train, y_test
18 | }
19 | \description{
20 | Helper function to split data into training and testing sets.
21 | }
22 | 


--------------------------------------------------------------------------------
/man/tweets_to_df.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/scrape_tweets.R
 3 | \name{tweets_to_df}
 4 | \alias{tweets_to_df}
 5 | \title{tweets_to_df}
 6 | \usage{
 7 | tweets_to_df(tweet_dir, keep_retweets = FALSE)
 8 | }
 9 | \arguments{
10 | \item{tweet_dir}{Directory where scraped Tweets are stored}
11 | 
12 | \item{keep_retweets}{Optionally discard retweets.}
13 | }
14 | \value{
15 | data.frame of Tweets with metadata
16 | }
17 | \description{
18 | This function takes a directory of JSON files containing scraped Tweets and returns a data.frame
19 | }
20 | \examples{
21 | tweet_df <- tweets_to_df("data/scraped_tweets", keep_retweets = FALSE)
22 | }
23 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: deepIdeology
 2 | Type: Package
 3 | Title: Scale Ideological Slant of Twitter Posts
 4 | Version: 0.1.0
 5 | Author: Alex Gottlieb
 6 | Maintainer: Alex Gottlieb <Alexander.R.Gottlieb.GR@dartmouth.edu>
 7 | Description: This package allows users to identify the ideological leanings of Twitter posts with benchmark accuracy
 8 |     using a Long Short-Term Memory recurrent neural network model trained on a data set of Tweets labeled through
 9 |     the Amazon Mechanical Turk crowd-sourcing platform. The best-performing models are able to classify Tweets are 
10 |     liberal- or conservative-leaning with 86.90% accuracy and are able to capture both directionality and degree of
11 |     slant.
12 | Depends: 
13 |     R (>= 3.4.4),
14 |     dplyr,
15 |     keras
16 | License: MIT
17 | Encoding: UTF-8
18 | LazyData: true
19 | RoxygenNote: 6.1.1.9000
20 | Imports: 
21 |     purrr
22 | 


--------------------------------------------------------------------------------
/man/prepare_glove_embeddings.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/word_embeddings.R
 3 | \name{prepare_glove_embeddings}
 4 | \alias{prepare_glove_embeddings}
 5 | \title{prepare_glove_embeddings}
 6 | \usage{
 7 | prepare_glove_embeddings(embedding_dim, tokenizer)
 8 | }
 9 | \arguments{
10 | \item{embedding_dim}{Dimensionality of word embeddings. Options are 25, 50, 100, 200.}
11 | 
12 | \item{tokenizer}{Pre-fit keras text tokenizer.}
13 | }
14 | \description{
15 | This function prepares an embedding matrix containing the words in the training data set from pre-trained GloVe embeddings.
16 | }
17 | \details{
18 | For more information on the GloVe embedding algorithm, visit https://nlp.stanford.edu/projects/glove/.
19 | }
20 | \note{
21 | The GloVe embeddings are 1.3G zipped and 3.8G unzipped.
22 | 
23 | Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_glove_\{embedding_dim\}.rda"
24 | }
25 | 


--------------------------------------------------------------------------------
/man/prepare_w2v_embeddings.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/word_embeddings.R
 3 | \name{prepare_w2v_embeddings}
 4 | \alias{prepare_w2v_embeddings}
 5 | \title{prepare_w2v_embeddings}
 6 | \usage{
 7 | prepare_w2v_embeddings(texts, embedding_dim, tokenizer)
 8 | }
 9 | \arguments{
10 | \item{texts}{Character vector of raw text from training data.}
11 | 
12 | \item{embedding_dim}{Dimensionality of word embeddings. Options are 25, 50, 100, 200.}
13 | 
14 | \item{tokenizer}{Pre-fit keras text tokenizer.}
15 | }
16 | \description{
17 | This function trains a word2vec model to create custom word embeddings from the training data set.
18 | }
19 | \details{
20 | For a good introduction to word2vec model see Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., 2013)
21 | }
22 | \note{
23 | Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_wv2_\{embedding_dim\}.rda"
24 | }
25 | 


--------------------------------------------------------------------------------
/man/scrape_tweets.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/scrape_tweets.R
 3 | \name{scrape_tweets}
 4 | \alias{scrape_tweets}
 5 | \title{scrape_tweets}
 6 | \usage{
 7 | scrape_tweets(screen_names = NULL, ids = NULL, tweets_per_user,
 8 |   credentials_dir, out_dir)
 9 | }
10 | \arguments{
11 | \item{screen_names}{Character vector of screen names of Twitter users.}
12 | 
13 | \item{ids}{Character or integer vector of IDs of Twitter users. Use either (but not both) of these two arguments.}
14 | 
15 | \item{tweets_per_user}{Number of tweets to scrape for each user.}
16 | 
17 | \item{credentials_dir}{Directory with Twitter OAuth tokens.}
18 | 
19 | \item{out_dir}{Name of directory to store scraped Tweets.}
20 | }
21 | \description{
22 | This function scrapes the most recent n Tweets of a list of Twitter users.
23 | }
24 | \examples{
25 | data("tweets")
26 | users <- unique(tweets$screen_name)
27 | scrape_tweets(screen_names = users, tweets_per_user = 200, credentials_dir = "credentials", out_dir = "data/scraped_tweets")
28 | }
29 | 


--------------------------------------------------------------------------------
/man/evaluate.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/train_models.R
 3 | \name{evaluate}
 4 | \alias{evaluate}
 5 | \title{evaluate}
 6 | \usage{
 7 | evaluate(model_path, X_test, y_test)
 8 | }
 9 | \arguments{
10 | \item{model_path}{Path to HDF5 file containing model. Should be of the form "models/\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5"}
11 | 
12 | \item{X_test}{data.frame or matrix of vectorized Tweets}
13 | 
14 | \item{y_test}{Labels for testing data. 0 for liberal, 1 for conservative.}
15 | }
16 | \value{
17 | List of performance metrics. Currently, a confusion matrix, overall prediction accuracy, precision, recall, and F1 score are return.
18 | }
19 | \description{
20 | This function evaluates the performance of a trained model.
21 | }
22 | \examples{
23 | data("ideo_tweets")
24 | ideo_tokenizer <- text_tokenizer(num_words=20000)
25 | ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text)
26 | texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer)
27 | labels <- tweets$ideo_cat
28 | 
29 | train_test <- train_test_split(texts, labels)
30 | 
31 | evaluate("models/bi-lstm_w2v_25d.h5", train_test$X_test, train_test$y_test)
32 | }
33 | 


--------------------------------------------------------------------------------
/man/train_lstm.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/train_models.R
 3 | \name{train_lstm}
 4 | \alias{train_lstm}
 5 | \title{train_lstm}
 6 | \usage{
 7 | train_lstm(X_train, y_train, embeddings = "w2v", embedding_dim = 25,
 8 |   bidirectional = FALSE, convolutional = FALSE)
 9 | }
10 | \arguments{
11 | \item{X_train}{data.frame or matrix of vectorized Tweets}
12 | 
13 | \item{y_train}{Labels for training data. 0 for liberal, 1 for conservative.}
14 | 
15 | \item{embeddings}{Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).}
16 | 
17 | \item{embedding_dim}{Length of word embeddings to use. Options are 25, 50, 100, or 200.}
18 | 
19 | \item{bidirectional}{Optionally train on text sequences in reverse as well as forwards.}
20 | 
21 | \item{convolutional}{Optionally apply convolutional filter to text sequences. Can only be used when bidirectional = TRUE}
22 | }
23 | \description{
24 | This function trains the LSTM model to identify the ideological slant of Tweets.
25 | }
26 | \note{
27 | Models are automatically saved in HDF5 format to a sub-folder of the root-directory called "models". File format is "\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5".
28 | }
29 | \examples{
30 | # train a Bi-LSTM network using GloVe embeddings
31 | data("ideo_tweets")
32 | ideo_tokenizer <- text_tokenizer(num_words=20000)
33 | ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text)
34 | texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer)
35 | labels <- tweets$ideo_cat
36 | 
37 | train_test <- train_test_split(texts, labels)
38 | X_train <- train_test$X_train
39 | y_trian <- train_test$y_train
40 | train_ltsm(X_train, ty_train, embeddings="glove", bidirectional=TRUE)
41 | }
42 | 


--------------------------------------------------------------------------------
/R/scrape_tweets.R:
--------------------------------------------------------------------------------
 1 | #' complete_setup
 2 | #'
 3 | #' This function should be called after package installation to properly set up dependencies and create file caching system.
 4 | #' @export
 5 | complete_setup <- function() {
 6 |   library(keras)
 7 |   install_keras(tensorflow = "1.9")
 8 | 
 9 |   library(devtools)
10 |   install_version("rmongodb", version = "1.8.0", repos = "http://cran.us.r-project.org")
11 |   install_github("SMAPPNYU/smappR")
12 | 
13 |   if (!dir.exists("~/.deepIdeology")) {
14 |     dir.create("~/.deepIdeology")
15 |   }
16 | }
17 | 
18 | #' scrape_tweets
19 | #'
20 | #' This function scrapes the most recent n Tweets of a list of Twitter users.
21 | #' @param screen_names Character vector of screen names of Twitter users.
22 | #' @param ids Character or integer vector of IDs of Twitter users. Use either (but not both) of these two arguments.
23 | #' @param tweets_per_user Number of tweets to scrape for each user.
24 | #' @param credentials_dir Directory with Twitter OAuth tokens.
25 | #' @param out_dir Name of directory to store scraped Tweets.
26 | #' @export
27 | #' @examples
28 | #' data("tweets")
29 | #' users <- unique(tweets$screen_name)
30 | #' scrape_tweets(screen_names = users, tweets_per_user = 200, credentials_dir = "credentials", out_dir = "data/scraped_tweets")
31 | scrape_tweets <- function(screen_names = NULL, ids = NULL, tweets_per_user, credentials_dir, out_dir) {
32 |   if (!dir.exists(out_dir)) {
33 |     dir.create(out_dir)
34 |   }
35 | 
36 |   scrape_func <- function(x) {
37 |     fname <- file.path(out_dir, paste0(x,'_tweets.json'))
38 |     tryCatch(smappR::getTimeline(fname,
39 |                                  oauth_folder = credentials_dir,
40 |                                 screen_name = x,
41 |                                 n = tweets_per_user),
42 |              error = function(e) NA)
43 |   }
44 | 
45 |   if (!is.null(screen_names)){
46 |     lapply(screen_names, scrape_func)
47 |   } else {
48 |     lapply(ids, scrape_func)
49 |   }
50 | }
51 | 
52 | #' tweets_to_df
53 | #'
54 | #' This function takes a directory of JSON files containing scraped Tweets and returns a data.frame
55 | #' @param tweet_dir Directory where scraped Tweets are stored
56 | #' @param keep_retweets Optionally discard retweets.
57 | #' @return data.frame of Tweets with metadata
58 | #' @export
59 | #' @examples
60 | #' tweet_df <- tweets_to_df("data/scraped_tweets", keep_retweets = FALSE)
61 | tweets_to_df <- function(tweet_dir, keep_retweets=FALSE) {
62 |   files <- list.files(tweet_dir)
63 |   tweets <- lapply(files,
64 |                    function(x) {
65 |                      tryCatch(parseTweets(file.path(tweet_dir, x), legacy=TRUE),
66 |                               error=function(e) NA)
67 |                      }
68 |                    )
69 |   tweets <- do.call("rbind",tweets)
70 |   tweets$tweet_url <- sprintf("https://twitter.com/%s/status/%s", tweets$screen_name, tweets$id_str)
71 | 
72 |   if (!keep_retweets) {
73 |     tweets <- tweets[!grepl("RT", tweets$text),]
74 |   }
75 | 
76 |   return(tweets)
77 | }
78 | 
79 | 


--------------------------------------------------------------------------------
/man/predict_ideology.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/predict.R
 3 | \name{predict_ideology}
 4 | \alias{predict_ideology}
 5 | \title{predict_ideology}
 6 | \usage{
 7 | predict_ideology(tweets, model = "BiLSTM", embeddings = "w2v",
 8 |   embedding_dim = 25, filter_political_tweets = FALSE)
 9 | }
10 | \arguments{
11 | \item{tweets}{Character vector of Tweets.}
12 | 
13 | \item{model}{Neural network architecture to use. Options are "LSTM", "BiLSTM", or "C-BiLSTM".}
14 | 
15 | \item{embeddings}{Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).}
16 | 
17 | \item{embedding_dim}{Length of word embeddings to use. Options are 25, 50, 100, or 200.}
18 | 
19 | \item{filter_political_tweets}{If Tweet collection may contain non-political Tweets, optionally filter them out before ideological scaling.}
20 | }
21 | \value{
22 | Vector of float values between 0 and 1, where values closer to 0 indiciate liberal ideological slant, values closer to 1 indicate conservative ideological slant, and values near 0.5 indicate a lack of ideological leaning. Non-political Tweets return return a NULL value.
23 | }
24 | \description{
25 | This function allows you to scale the ideological slant of Twitter posts.
26 | }
27 | \details{
28 | The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites.
29 | 
30 | The Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model.
31 | }
32 | \examples{
33 | tweets <- c("Make no mistake- the President of the United States is actively sabotaging the health insurance of millions of Americans with this action.",
34 |             "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better; freer nation because of it.",
35 |             "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown")
36 | preds <- predict_ideology(tweets, model="BiLSTM", embeddings="w2v")
37 | }
38 | 


--------------------------------------------------------------------------------
/R/predict.R:
--------------------------------------------------------------------------------
 1 | #' predict_ideology
 2 | #'
 3 | #' This function allows you to scale the ideological slant of Twitter posts.
 4 | #' @param tweets Character vector of Tweets.
 5 | #' @param model Neural network architecture to use. Options are "LSTM", "BiLSTM", or "C-BiLSTM".
 6 | #' @param embeddings Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).
 7 | #' @param embedding_dim Length of word embeddings to use. Options are 25, 50, 100, or 200.
 8 | #' @param filter_political_tweets If Tweet collection may contain non-political Tweets, optionally filter them out before ideological scaling.
 9 | #' @return Vector of float values between 0 and 1, where values closer to 0 indiciate liberal ideological slant, values closer to 1 indicate conservative ideological slant, and values near 0.5 indicate a lack of ideological leaning. Non-political Tweets return return a NULL value.
10 | #' @export
11 | #' @details The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites.
12 | #' @details The Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model.
13 | #' @examples
14 | #' tweets <- c("Make no mistake- the President of the United States is actively sabotaging the health insurance of millions of Americans with this action.",
15 | #'             "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better; freer nation because of it.",
16 | #'             "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown")
17 | #' preds <- predict_ideology(tweets, model="BiLSTM", embeddings="w2v")
18 | 
19 | predict_ideology <- function(tweets, model="BiLSTM", embeddings="w2v", embedding_dim=25, filter_political_tweets=FALSE) {
20 |   stopifnot(model %in% list("LSTM", "BiLSTM", "C-BiLSTM"))
21 | 
22 |   cwd <- getwd()
23 |   setwd("~/.deepIdeology/")
24 |   # if Tweet collection contains non-political tweets, filter out before scaling ideology
25 |   if (filter_political_tweets) {
26 |     if (!file.exists("models/politics_classifier.h5")) {
27 |       print("No pre-trained politics classifier exists. Training model now. This may take a moment.")
28 |       prepare_politics_classifier()
29 |     }
30 | 
31 |     model <- keras::load_model_hdf5("models/politics_classifier.h5")
32 |     pol_ind <- model %>%
33 |       keras::predict_classes(texts)
34 |     sprintf("%i political Tweets identified out of %i total Tweets", table(preds)[2], length(preds))
35 |   } else {
36 |     pol_ind <- 1:length(tweets)
37 |   }
38 | 
39 |   # load fit tokenizer, convert raw text to sequences
40 |   if (!file.exists("tokenizers/ideo_tweet_tokenizer")) {
41 |     data("ideo_tweets")
42 |     tokenizer <- keras::text_tokenizer(num_words = 20000)
43 |     tokenizer <- keras::fit_text_tokenizer(tokenizer, ideo_tweets$text)
44 |     if (!dir.exists("tokenizers")) {
45 |       dir.create("tokenizers")
46 |     }
47 |     keras::save_text_tokenizer(tokenizer, "tokenizers/ideo_tweet_tokenizer")
48 |   }
49 | 
50 |   tokenizer <- keras::load_text_tokenizer("tokenizers/ideo_tweet_tokenizer")
51 | 
52 |   # load desired model
53 |   model_name_map <- list(LSTM = "lstm", BiLSTM = "bi-lstm", CBiLSTM = "c-bi-lstm")
54 |   model_fname <- sprintf("models/%s_%s_%sd.h5", model_name_map[[model]], embeddings, embedding_dim)
55 | 
56 |   if (!file.exists(model_fname)) {
57 |     print("No pre-trained model with that configuration exists. Training model now. This may take a moment.")
58 |     data("ideo_tweets")
59 |     text_vecs <- texts_to_vectors(ideo_tweets$text, tokenizer)
60 |     labels <- ideo_tweets$ideo_cat
61 |     data <- train_test_split(text_vecs, labels)
62 |     if (model == "BiLSTM") {
63 |       bidirectional = TRUE
64 |       convolutional = FALSE
65 |     } else if (model == "C-BiLSTM") {
66 |       bidirectional = TRUE
67 |       convolutional = TRUE
68 |     } else {
69 |       bidirectional = FALSE
70 |       convolutional = FALSE
71 |     }
72 |     train_lstm(data$X_train, data$y_train, embeddings = embeddings, embedding_dim = embedding_dim,
73 |                bidirectional = bidirectional, convolutional = convolutional)
74 |   }
75 | 
76 |   model <- keras::load_model_hdf5(model_fname)
77 | 
78 |   text_vecs <- texts_to_vectors(tweets, tokenizer)
79 |   # generate predictions on new text
80 |   preds <- model %>%
81 |     keras::predict_proba(text_vecs)
82 | 
83 |   preds[-pol_ind] <- NULL
84 | 
85 |   setwd(cwd)
86 |   return(preds[,1])
87 | }
88 | 
89 | 


--------------------------------------------------------------------------------
/R/word_embeddings.R:
--------------------------------------------------------------------------------
  1 | #' prepare_glove_embeddings
  2 | #'
  3 | #' This function prepares an embedding matrix containing the words in the training data set from pre-trained GloVe embeddings.
  4 | #' @param embedding_dim Dimensionality of word embeddings. Options are 25, 50, 100, 200.
  5 | #' @param tokenizer Pre-fit keras text tokenizer.
  6 | #' @export
  7 | #' @details For more information on the GloVe embedding algorithm, visit https://nlp.stanford.edu/projects/glove/.
  8 | #' @note The GloVe embeddings are 1.3G zipped and 3.8G unzipped.
  9 | #' @note Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_glove_\{embedding_dim\}.rda"
 10 | prepare_glove_embeddings <- function(embedding_dim, tokenizer) {
 11 |   stopifnot(embedding_dim %in% list(25, 50, 100, 200))
 12 | 
 13 |   cwd <- getwd()
 14 |   setwd("~/.deepIdeology/")
 15 |   if (!dir.exists("glove.twitter.27B")) {
 16 |     dir.create("glove.twitter.27B")
 17 |     download <- menu(c("Yes", "No"), title="Cannot find pre-trained GloVe embeddings. Would you like to download now (1.3G)?")
 18 |     if (download == 1) download.file("http://nlp.stanford.edu/data/glove.twitter.27B.zip", "glove.twitter.27B/glove.twitter.27B.zip")
 19 |     unzip("glove.twitter.27B/glove.twitter.27B.zip", exdir="glove.twitter.27B")
 20 |     file.remove("glove.twitter.27B/glove.twitter.27B.zip")
 21 |   }
 22 | 
 23 |   embeddings_file <- sprintf("glove.twitter.27B/glove.twitter.27B.%sd.txt", embedding_dim)
 24 |   word_index <- tokenizer$word_index
 25 |   embeddings_index <- new.env(parent = emptyenv())
 26 |   lines <- readLines(embeddings_file)
 27 |   for (line in lines) {
 28 |     values <- strsplit(line, ' ', fixed = TRUE)[[1]]
 29 |     word <- values[[1]]
 30 |     coefs <- as.numeric(values[-1])
 31 |     embeddings_index[[word]] <- coefs
 32 |   }
 33 | 
 34 |   embedding_matrix <- matrix(0L, nrow = length(word_index)+1, ncol = embedding_dim)
 35 |   for (word in names(word_index)) {
 36 |     index <- word_index[[word]]
 37 |     if (index >= length(word_index))
 38 |       next
 39 |     embedding_vector <- embeddings_index[[word]]
 40 |     if (!is.null(embedding_vector)) {
 41 |       # words not found in embedding index will be all-zeros.
 42 |       embedding_matrix[index,] <- embedding_vector
 43 |     }
 44 |   }
 45 | 
 46 |   out_file <- sprintf("embeddings/tweet_glove_%sd.rda", embedding_dim)
 47 |   if (!dir.exists("embeddings")) {
 48 |     dir.create("embeddings")
 49 |   }
 50 |   save(embedding_matrix, file = out_file)
 51 |   setwd(cwd)
 52 | }
 53 | 
 54 | 
 55 | #' prepare_w2v_embeddings
 56 | #'
 57 | #' This function trains a word2vec model to create custom word embeddings from the training data set.
 58 | #' @param texts Character vector of raw text from training data.
 59 | #' @param embedding_dim Dimensionality of word embeddings. Options are 25, 50, 100, 200.
 60 | #' @param tokenizer Pre-fit keras text tokenizer.
 61 | #' @export
 62 | #' @details For a good introduction to word2vec model see Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., 2013)
 63 | #' @note Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_wv2_\{embedding_dim\}.rda"
 64 | 
 65 | prepare_w2v_embeddings <- function(texts, embedding_dim, tokenizer) {
 66 | 
 67 |   cwd <- getwd()
 68 |   setwd("~/.deepIdeology/")
 69 | 
 70 |   skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) {
 71 |     gen <- keras::texts_to_sequences_generator(tokenizer, sample(text))
 72 |     function() {
 73 |       skip <- keras::generator_next(gen) %>%
 74 |         keras::skipgrams(
 75 |           vocabulary_size = tokenizer$num_words,
 76 |           window_size = window_size,
 77 |           negative_samples = 1
 78 |         )
 79 |       x <- purrr::transpose(skip$couples) %>% purrr::map(. %>% unlist %>% as.matrix(ncol = 1))
 80 |       y <- skip$labels %>% as.matrix(ncol = 1)
 81 |       list(x, y)
 82 |     }
 83 |   }
 84 | 
 85 |   skip_window <- 5       # How many words to consider left and right.
 86 |   num_sampled <- 1       # Number of negative examples to sample for each word.
 87 | 
 88 |   input_target <- keras::layer_input(shape = 1)
 89 |   input_context <- keras::layer_input(shape = 1)
 90 | 
 91 |   embedding <- keras::layer_embedding(
 92 |     input_dim = tokenizer$num_words + 1,
 93 |     output_dim = embedding_dim,
 94 |     input_length = 1,
 95 |     name = "embedding"
 96 |   )
 97 | 
 98 |   target_vector <- input_target %>%
 99 |     embedding() %>%
100 |     keras::layer_flatten()
101 | 
102 |   context_vector <- input_context %>%
103 |     embedding() %>%
104 |     keras::layer_flatten()
105 | 
106 |   dot_product <- keras::layer_dot(list(target_vector, context_vector), axes = 1)
107 |   output <- keras::layer_dense(dot_product, units = 1, activation = "sigmoid")
108 | 
109 |   model <- keras::keras_model(list(input_target, input_context), output)
110 |   model %>% keras::compile(loss = "binary_crossentropy", optimizer = "adam")
111 | 
112 | 
113 |   model %>% keras::fit_generator(skipgrams_generator(texts,
114 |                                                      tokenizer,
115 |                                                      skip_window,
116 |                                                      negative_samples),
117 |                           steps_per_epoch=10000,
118 |                           epochs=10,
119 |                           callbacks = list(keras::callback_model_checkpoint(sprintf("models/w2v_%sd.h5", embedding_dim),
120 |                                                                             monitor = "loss",
121 |                                                                             save_best_only = TRUE),
122 |                                            keras::callback_early_stopping(monitor = "loss", patience=2))
123 |   )
124 | 
125 |   model <- keras::load_model_hdf5(sprintf("models/w2v_%sd.h5", embedding_dim))
126 |   embedding_matrix <- keras::get_weights(model)[[1]]
127 |   words <- dplyr::data_frame(word=names(tokenizer$word_index),
128 |                       id=as.integer(unlist(tokenizer$word_index)))
129 |   words <- words %>% dplyr::filter(id <= tokenizer$num_words) %>% dplyr::arrange(id)
130 |   row.names(embedding_matrix) <- c("UNK",words$word)
131 | 
132 |   out_file <- sprintf("embeddings/tweet_wv2_%sd.rda", embedding_dim)
133 | 
134 |   if (!dir.exists("embeddings")) {
135 |     dir.create("embeddings")
136 |   }
137 |   save(embedding_matrix,file=out_file)
138 |   setwd(cwd)
139 | }
140 | 


--------------------------------------------------------------------------------
/R/train_models.R:
--------------------------------------------------------------------------------
  1 | prepare_politics_classifier <- function() {
  2 |   data("pol_tweets")
  3 |   cwd <- getwd()
  4 |   setwd("~/.deepIdeology")
  5 |   if (file.exists("tokenizers/pol_tweet_tokenizer")) {
  6 |     tokenizer <- keras::load_text_tokenizer("tokenizers/pol_tweet_tokenizer")
  7 |   } else {
  8 |     tokenizer <- keras::text_tokenizer()
  9 |     tokenizer <- keras::fit_text_tokenizer(tokenizer,pol_tweets$Input.text)
 10 |     if (!dir.exists("tokenizers")) {
 11 |       dir.create("tokenizers")
 12 |     }
 13 | 
 14 |     keras::save_text_tokenizer(tokenizer, filename = "tokenizers/pol_tweet_tokenizer")
 15 |   }
 16 | 
 17 |   texts <- texts_to_vectors(pol_tweets$Input.text, tokenizer)
 18 |   labels <- pol_tweets$pol
 19 |   word_index <- tokenizer$word_index
 20 |   data <- train_test_split(texts, labels)
 21 | 
 22 |   lstm <- keras::keras_model_sequential()
 23 |   lstm %>%
 24 |     keras::layer_embedding(input_dim = length(word_index)+1, output_dim=64) %>%
 25 |     keras::layer_lstm(units=64, dropout=0.5, recurrent_dropout=0.3) %>%
 26 |     keras::layer_dense(units=16, activation='relu') %>%
 27 |     keras::layer_dropout(0.5) %>%
 28 |     keras::layer_dense(units=1, activation='sigmoid')
 29 | 
 30 |   lstm %>% keras::compile(loss='binary_crossentropy',optimizer='adam',metrics=c('accuracy'))
 31 | 
 32 |   if (!dir.exists("models")) {
 33 |     dir.create("models")
 34 |   }
 35 |   lstm %>% keras::fit(
 36 |     data$X_train, data$y_train,
 37 |     batch_size=64,
 38 |     epochs=100,
 39 |     validation_split=0.2,
 40 |     callbacks = list(keras::callback_model_checkpoint(sprintf("models/politics_classifier.h5"),
 41 |                                                      monitor = "val_loss",
 42 |                                                      save_best_only = TRUE),
 43 |                      keras::callback_early_stopping(monitor = "val_loss", patience=3))
 44 |   )
 45 | 
 46 |   setwd(cwd)
 47 | }
 48 | 
 49 | #' train_lstm
 50 | #'
 51 | #' This function trains the LSTM model to identify the ideological slant of Tweets.
 52 | #' @param X_train data.frame or matrix of vectorized Tweets
 53 | #' @param y_train Labels for training data. 0 for liberal, 1 for conservative.
 54 | #' @param embeddings Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).
 55 | #' @param embedding_dim Length of word embeddings to use. Options are 25, 50, 100, or 200.
 56 | #' @param bidirectional Optionally train on text sequences in reverse as well as forwards.
 57 | #' @param convolutional Optionally apply convolutional filter to text sequences. Can only be used when bidirectional = TRUE
 58 | #' @export
 59 | #' @note Models are automatically saved in HDF5 format to a sub-folder of the root-directory called "models". File format is "\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5".
 60 | #' @examples
 61 | #' # train a Bi-LSTM network using GloVe embeddings
 62 | #' data("ideo_tweets")
 63 | #' ideo_tokenizer <- text_tokenizer(num_words=20000)
 64 | #' ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text)
 65 | #' texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer)
 66 | #' labels <- tweets$ideo_cat
 67 | #'
 68 | #' train_test <- train_test_split(texts, labels)
 69 | #' X_train <- train_test$X_train
 70 | #' y_trian <- train_test$y_train
 71 | #' train_ltsm(X_train, ty_train, embeddings="glove", bidirectional=TRUE)
 72 | train_lstm <- function(X_train, y_train, embeddings = "w2v", embedding_dim = 25, bidirectional = FALSE, convolutional = FALSE) {
 73 |   stopifnot(embedding_dim %in% list(25, 50, 100, 200))
 74 |   stopifnot(embeddings %in% list("random", "w2v", "glove"))
 75 | 
 76 |   cwd <- getwd()
 77 |   setwd("~/.deepIdeology/")
 78 | 
 79 |   out_fname <- sprintf("lstm_%sd.h5", embedding_dim)
 80 | 
 81 |   model <- keras::keras_model_sequential()
 82 |   if (embeddings != "random") {
 83 |     embedding_fname <- sprintf("embeddings/tweet_%s_%sd.rda", embeddings, embedding_dim)
 84 | 
 85 |     if (!file.exists(embedding_fname)) {
 86 |       print(sprintf("Embedding file does not exist. Preparing %s-dimensional %s embeddings. This may take a moment", embedding_dim, embeddings))
 87 |       tokenizer <- keras::load_text_tokenizer("tokenizers/ideo_tweet_tokenizer")
 88 |       if (embeddings == "glove") {
 89 |         prepare_glove_embeddings(embedding_dim, tokenizer)
 90 |       } else {
 91 |         data("ideo_tweets")
 92 |         prepare_w2v_embeddings(ideo_tweets$text, embedding_dim, tokenizer)
 93 |       }
 94 |     }
 95 |     emebedding_matrix <- get(load(embedding_fname))
 96 | 
 97 |     model %>%
 98 |       keras::layer_embedding(input_dim = dim(embedding_matrix)[1], output_dim=embedding_dim,
 99 |                              weights = list(embedding_matrix))
100 |     out_fname <- sprintf("lstm_%s_%sd.h5", embeddings, embedding_dim)
101 |   } else {
102 |     model %>%
103 |       keras::layer_embedding(input_dim = 20000+1, output_dim=64)
104 |     out_fname <- sprintf("lstm_%sd.h5", embedding_dim)
105 |   }
106 |   if (convolutional) {
107 |     model %>%
108 |       keras::layer_conv_1d(filters=64,
109 |                            kernel_size = 3,
110 |                            padding = 'valid',
111 |                            activation = 'relu',
112 |                            strides=1) %>%
113 |       keras::layer_max_pooling_1d(pool_size = 2)
114 |     out_fname <- sprintf("c-bi-%s", out_fname)
115 |   }
116 |   if (bidirectional) {
117 |     model %>%
118 |       keras::bidirectional(layer_lstm(units=64, dropout=0.3, recurrent_dropout=0.3))
119 |     if (!convolutional) out_fname <- sprintf("bi-%s", out_fname)
120 |   } else {
121 |     model %>%
122 |       keras::layer_lstm(units=64, dropout=0.3, recurrent_dropout=0.3)
123 |   }
124 | 
125 |   model %>%
126 |     keras::layer_dense(units=16, activation='relu') %>%
127 |     keras::layer_dropout(0.5) %>%
128 |     keras::layer_dense(units=1, activation='sigmoid')
129 | 
130 |   model %>% keras::compile(loss='binary_crossentropy',optimizer='adam',metrics=c('accuracy'))
131 | 
132 |   if (!dir.exists("models")) {
133 |     dir.create("models")
134 |   }
135 |   model %>% keras::fit(
136 |     X_train, y_train,
137 |     batch_size=64,
138 |     epochs=50,
139 |     validation_split=0.2,
140 |     callbacks = list(keras::callback_model_checkpoint(sprintf("models/%s", out_fname),
141 |                                                       monitor = "val_loss",
142 |                                                       save_best_only = TRUE),
143 |                      keras::callback_early_stopping(monitor = "val_loss", patience=3))
144 |   )
145 | 
146 |   setwd(cwd)
147 | }
148 | 
149 | #' evaluate
150 | #'
151 | #' This function evaluates the performance of a trained model.
152 | #' @param model_path Path to HDF5 file containing model. Should be of the form "models/\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5"
153 | #' @param X_test data.frame or matrix of vectorized Tweets
154 | #' @param y_test Labels for testing data. 0 for liberal, 1 for conservative.
155 | #' @export
156 | #' @return List of performance metrics. Currently, a confusion matrix, overall prediction accuracy, precision, recall, and F1 score are return.
157 | #' @examples
158 | #' data("ideo_tweets")
159 | #' ideo_tokenizer <- text_tokenizer(num_words=20000)
160 | #' ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text)
161 | #' texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer)
162 | #' labels <- tweets$ideo_cat
163 | #'
164 | #' train_test <- train_test_split(texts, labels)
165 | #'
166 | #' evaluate("models/bi-lstm_w2v_25d.h5", train_test$X_test, train_test$y_test)
167 | evaluate <- function(model_path, X_test, y_test) {
168 |   model <- keras::load_model_hdf5(model_path)
169 |   preds <- model %>%
170 |     keras::predict_classes(X_test)
171 | 
172 |   res <- list()
173 |   cm <-as.matrix(table(Actual = y_test, Predicted = preds))
174 |   res[["Confusion Matrix"]] <- cm
175 | 
176 |   n <- sum(cm) # number of instances
177 |   nc <- nrow(cm) # number of classes
178 |   diag <- diag(cm) # number of correctly classified instances per class
179 |   rowsums <- apply(cm, 1, sum) # number of instances per class
180 |   colsums <- apply(cm, 2, sum) # number of predictions per class
181 |   p <- rowsums / n # distribution of instances over the actual classes
182 |   q <- colsums / n # distribution of instances over the predicted classes
183 |   accuracy <- sum(diag) / n
184 |   res[["Accuracy"]] <- accuracy
185 | 
186 |   precision <- diag / colsums
187 |   recall <- diag / rowsums
188 |   f1 <- 2 * precision * recall / (precision + recall)
189 |   res[["Precision/Recall"]] <- data.frame(precision, recall, f1)
190 | 
191 |   return(res)
192 | }
193 | 
194 | #' train_test_split
195 | #'
196 | #' Helper function to split data into training and testing sets.
197 | #' @param X data.frame or matrix of data
198 | #' @param y Labels (optional).
199 | #' @param test_size Proportion of samples to set aside for testing.
200 | #' @export
201 | #' @return List of X_train, X_test, y_train, y_test
202 | train_test_split <- function(X, y, test_size=0.2) {
203 |   n_train <- floor((1-test_size)*nrow(X))
204 |   train_ind <- sample(nrow(X),n_train)
205 |   return(list(X_train=X[train_ind,], X_test=X[-train_ind,], y_train=y[train_ind], y_test=y[-train_ind]))
206 | }
207 | 
208 | #' texts_to_vectors
209 | #'
210 | #' Helper function vectorize text data
211 | #' @param texts Character vector of raw text data
212 | #' @param tokenizer Pre-fit keras tokenizer
213 | #' @export
214 | #' @return matrix of vectorized texts
215 | texts_to_vectors <- function(texts, tokenizer){
216 |   sequences <- keras::texts_to_sequences(tokenizer, texts)
217 |   vecs <- keras::pad_sequences(sequences)
218 |   return(vecs)
219 | }
220 | 
221 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # deepIdeology: Scale ideological slant of Tweets
 2 | 
 3 | This package allows users to identify the ideological leanings of Twitter posts with benchmark accuracy
 4 | using a Long Short-Term Memory recurrent neural network model trained on a data set of Tweets labeled through
 5 | the Amazon Mechanical Turk crowd-sourcing platform. The best-performing model is able to classify Tweets as 
 6 | liberal- or conservative-leaning with 86.90% accuracy and is capture both directionality and degree of
 7 | slant (see "Performance and Validation" below. This package was developed for a study of preference falsification on social media entitled "Private Partisan, 
 8 | Public Moderate: Preference Falsification on Twitter" (Gottlieb, 2018). Contact maintainer to request access.
 9 | 
10 | ## Installation and setup
11 | 
12 | To install the latest version of `deepIdeology` from GitHub, run the following:
13 | ```{r}
14 | devtools::install_github("alex-gottlieb/deepIdeology")
15 | library(deepIdeology)
16 | ```
17 | Following successful installation run the `complete_setup()` command, which will finish installing the `keras` module and set up the file caching
18 | on which the package relies.
19 | 
20 | ## predict_ideology
21 | 
22 | `predict_ideology` is the core function of this package. It allows the user to scale the ideological slant of Tweets from 0 to 1, with values close 
23 | to 0 indicating a strong liberal slant, values close to 1 indicating a strong conservative slant, and values around 0.5 indicating ideologically moderate or neutral content. The first time this function is called, it will likely take upwards of an hour to run, as the model will need to be trained and the word embeddings which are required to vectorize the raw text of the Tweets will need to be either downloaded (GloVe) or learned by another neural network
24 | model (word2vec). Once a particular model or embedding configuration is used once, though, the files will be cached, allowing near-instantaneous
25 | evaluations in the future. 
26 | 
27 | The parameters of the function deserve a little further explanation. The best-performing parameters are set as the defaults of the function, so readers not interested in the technical details can skip this section.
28 | * `model` allows the user to select the particular neural network architecture used in the slant classifier:
29 |     + `LSTM` stands for Long Short-Term Memory Network, a type of recurrent neural network that can learn contextual information in text, or related bits of information separated by a wide spatial or temporal gap. (Hochreiter and Schmidhuber 1997).
30 |     + `BiLSTM` is a Bidirectional LSTM (Graves and Schmidhuber 2005). In this architecture, two LSTM units are trained: one on the text as-is, and one on a reverse copy of the input sequence, which allows the network to place a given word in the context of what comes both before and after it.
31 |     + `C-BiLSTM` is a Convolutional Bidirectional LSTM (Xi-Lian, Wei, and Teng-Jiao 2017), which can learn target-related context and semantic representations simultaneously.
32 | * `embeddings` determines which type of word embedding is used. Word embeddings are a means of transforming raw text into *d*-dimensional numeric vectors that a machine can understand. A straight-forward primer on word embeddings and common models can be found [here](https://machinelearningmastery.com/what-are-word-embeddings/)
33 |     + `GloVe` is a count-based model, which learns embeddings through dimensionality reduction on the co-occurrence count matrix of a corpus (Pennington, Socher, and Manning 2014). The networks in this package use GloVe embeddings calculated from a corpus of 2 billion Tweets. More information can be found [here](https://nlp.stanford.edu/projects/glove/).
34 |     + `w2v` or word2vec is a predictive model, which means it learns the embeddings that minimize the loss of predicting each word given its context words and their vector representation (Mikolov et al. 2013). If word2vec embeddings are chosen, a separate neural network model will be trained to learn the word embeddings.
35 |     + `random` uses an embedding layer with a random initialization, which is then learned in the course of the regular model training.
36 | * `embedding_dim` is the dimensionality of the vector space into which each word is projected. In general, higher-dimensional embeddings can capture more semantic subtleties, but also require more training data to discover those nuances. For the sake of making functions more generalizable, options are restricted to 25, 50, 100, and 200.
37 | * `filter_political_tweets` gives users the option to remove Tweets that are non-political in nature before slant-scaling if there is there possibility that those Tweets are contained in the data set. This is done using a separate classifier also trained on Tweets labeled as "political" or "not political" through Amazon Mechanical Turk.
38 | 
39 | A toy example:
40 | ```{r}
41 | tweets <- c("Republicans are moving full steam ahead on their #GOPTaxScam, which lays the groundwork for them to gut Social Security and Medicare. I urge my Senate colleagues to vote No!",
42 |              "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better, freer nation because of it.",
43 |              "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown")
44 | 
45 | predict_ideology(tweets)
46 | ```
47 | 
48 | #### Caveats
49 | The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites.
50 | 
51 | Additionally, the Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model. 
52 | 
53 | Both of these issues can be addressed with the continued augmentation of the training data set with labeled examples, so if anyone is interested in continuing this work, please be in touch!
54 | 
55 | ## Performance
56 | 
57 | The predictive accuracy of various model/embedding combinations on a set-aside testing data set are shown below in Table 1. 
58 | 
59 | ![Table 1: Ideology classifier performance](figures/model_performance.png) 
60 | 
61 | Note that a full hyperparameter optimization was not performed for any of the neural network models, so these numbers can be considered lower-bound estimates of the predictive power of the respective model configurations, and comparisons between models should be taken with a grain of salt. 
62 | 
63 | ## Validation
64 | 
65 | A number of tests were performed to validate the quality of predictions. Table 2 shows a simple face validity test in which 1 Tweet from the "elite" user pool is sampled from each decile of prediction (i.e. [0, 0.1), [0.1, 0.2),...,[0.9, 1]). 
66 | 
67 | ![Table 2: Examples of predicted probabilities for randomly selected Tweets. Values closer to 0 mean the model is highly confident that the Tweet is liberal, while values close to 1 ndicate confidence in conservativeness.](figures/face_validity.png)
68 | 
69 | Based on this random sample, it would appear that the model captures both directionality and degree of slant, with Tweets predicted around 0.5 betraying little deological leaning and getting progressively more liberal and conservative as values approach 0 and 1, respectively. It even seems as if it is capable of scaling Tweets for which a high level of political knowledge, long-term memory, and careful reading of tone would be required to understand the slant. For example, the message “Chairman Grassley's job is to hold hearings on Judge Garland. He doesn't need to poll his colleagues. He just needs to do his job!” has a predicted probability of around 0.15. To understand that this Tweet is liberal in nature, one would have to know the fact that Chuck Grassley is a conservative Republican senator, to remember that “Chairman Grassley” is the antecedent of the pronoun “he”, which comes much later in the Tweet, and to understand that the tone of the sentences beginning with “he” is highly disapproving.
70 | 
71 | As a more rigorous test, estimates were validated against existing and widely-accepted measures of ideological preferences. Figure 1 shows the correlation between the mean predicted ideology of 200 Tweets for members of Congress and the first dimension of their DW-NOMINATE score, which is based on the roll-call voting patterns of members of Congress (Poole and Rosenthal 1997). 
72 | 
73 | ![Figure 1: Comparing Twitter-based estimates of ideological preferences of legislators to the first dimension of their DW-NOMINATE scores.](figures/dw-nom.png)
74 | 
75 | The correlation between Twitter-based estimates and the first dimension of the DW-NOMINATE scores for the 114th Congress is 0.947 for the House of Representatives and 0.937 for the Senate, values roughly equal with the Twitter-based Bayesian ideal points calculated by Barbera (2015) as well as those derived from Facebook "like"-ing data by Bond and Messing (2015).
76 | 


--------------------------------------------------------------------------------