├── NAMESPACE ├── .Rbuildignore ├── figures ├── dw-nom.png ├── face_validity.png └── model_performance.png ├── data ├── ideo_tweets.rda └── pol_tweets.rda ├── .gitignore ├── man ├── complete_setup.Rd ├── texts_to_vectors.Rd ├── train_test_split.Rd ├── tweets_to_df.Rd ├── prepare_glove_embeddings.Rd ├── prepare_w2v_embeddings.Rd ├── scrape_tweets.Rd ├── evaluate.Rd ├── train_lstm.Rd └── predict_ideology.Rd ├── deepIdeology.Rproj ├── DESCRIPTION ├── R ├── scrape_tweets.R ├── predict.R ├── word_embeddings.R └── train_models.R └── README.md /NAMESPACE: -------------------------------------------------------------------------------- 1 | exportPattern("^[[:alpha:]]+") 2 | -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^data-raw$ 4 | -------------------------------------------------------------------------------- /figures/dw-nom.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/dw-nom.png -------------------------------------------------------------------------------- /data/ideo_tweets.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/data/ideo_tweets.rda -------------------------------------------------------------------------------- /data/pol_tweets.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/data/pol_tweets.rda -------------------------------------------------------------------------------- /figures/face_validity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/face_validity.png -------------------------------------------------------------------------------- /figures/model_performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alex-gottlieb/deepIdeology/HEAD/figures/model_performance.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | embeddings 6 | glove.twitter.27B 7 | tokenizers 8 | models 9 | -------------------------------------------------------------------------------- /man/complete_setup.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/scrape_tweets.R 3 | \name{complete_setup} 4 | \alias{complete_setup} 5 | \title{complete_setup} 6 | \usage{ 7 | complete_setup() 8 | } 9 | \description{ 10 | This function should be called after package installation to properly set up dependencies and create file caching system. 11 | } 12 | -------------------------------------------------------------------------------- /deepIdeology.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | -------------------------------------------------------------------------------- /man/texts_to_vectors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/train_models.R 3 | \name{texts_to_vectors} 4 | \alias{texts_to_vectors} 5 | \title{texts_to_vectors} 6 | \usage{ 7 | texts_to_vectors(texts, tokenizer) 8 | } 9 | \arguments{ 10 | \item{texts}{Character vector of raw text data} 11 | 12 | \item{tokenizer}{Pre-fit keras tokenizer} 13 | } 14 | \value{ 15 | matrix of vectorized texts 16 | } 17 | \description{ 18 | Helper function vectorize text data 19 | } 20 | -------------------------------------------------------------------------------- /man/train_test_split.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/train_models.R 3 | \name{train_test_split} 4 | \alias{train_test_split} 5 | \title{train_test_split} 6 | \usage{ 7 | train_test_split(X, y, test_size = 0.2) 8 | } 9 | \arguments{ 10 | \item{X}{data.frame or matrix of data} 11 | 12 | \item{y}{Labels (optional).} 13 | 14 | \item{test_size}{Proportion of samples to set aside for testing.} 15 | } 16 | \value{ 17 | List of X_train, X_test, y_train, y_test 18 | } 19 | \description{ 20 | Helper function to split data into training and testing sets. 21 | } 22 | -------------------------------------------------------------------------------- /man/tweets_to_df.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/scrape_tweets.R 3 | \name{tweets_to_df} 4 | \alias{tweets_to_df} 5 | \title{tweets_to_df} 6 | \usage{ 7 | tweets_to_df(tweet_dir, keep_retweets = FALSE) 8 | } 9 | \arguments{ 10 | \item{tweet_dir}{Directory where scraped Tweets are stored} 11 | 12 | \item{keep_retweets}{Optionally discard retweets.} 13 | } 14 | \value{ 15 | data.frame of Tweets with metadata 16 | } 17 | \description{ 18 | This function takes a directory of JSON files containing scraped Tweets and returns a data.frame 19 | } 20 | \examples{ 21 | tweet_df <- tweets_to_df("data/scraped_tweets", keep_retweets = FALSE) 22 | } 23 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: deepIdeology 2 | Type: Package 3 | Title: Scale Ideological Slant of Twitter Posts 4 | Version: 0.1.0 5 | Author: Alex Gottlieb 6 | Maintainer: Alex Gottlieb 7 | Description: This package allows users to identify the ideological leanings of Twitter posts with benchmark accuracy 8 | using a Long Short-Term Memory recurrent neural network model trained on a data set of Tweets labeled through 9 | the Amazon Mechanical Turk crowd-sourcing platform. The best-performing models are able to classify Tweets are 10 | liberal- or conservative-leaning with 86.90% accuracy and are able to capture both directionality and degree of 11 | slant. 12 | Depends: 13 | R (>= 3.4.4), 14 | dplyr, 15 | keras 16 | License: MIT 17 | Encoding: UTF-8 18 | LazyData: true 19 | RoxygenNote: 6.1.1.9000 20 | Imports: 21 | purrr 22 | -------------------------------------------------------------------------------- /man/prepare_glove_embeddings.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/word_embeddings.R 3 | \name{prepare_glove_embeddings} 4 | \alias{prepare_glove_embeddings} 5 | \title{prepare_glove_embeddings} 6 | \usage{ 7 | prepare_glove_embeddings(embedding_dim, tokenizer) 8 | } 9 | \arguments{ 10 | \item{embedding_dim}{Dimensionality of word embeddings. Options are 25, 50, 100, 200.} 11 | 12 | \item{tokenizer}{Pre-fit keras text tokenizer.} 13 | } 14 | \description{ 15 | This function prepares an embedding matrix containing the words in the training data set from pre-trained GloVe embeddings. 16 | } 17 | \details{ 18 | For more information on the GloVe embedding algorithm, visit https://nlp.stanford.edu/projects/glove/. 19 | } 20 | \note{ 21 | The GloVe embeddings are 1.3G zipped and 3.8G unzipped. 22 | 23 | Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_glove_\{embedding_dim\}.rda" 24 | } 25 | -------------------------------------------------------------------------------- /man/prepare_w2v_embeddings.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/word_embeddings.R 3 | \name{prepare_w2v_embeddings} 4 | \alias{prepare_w2v_embeddings} 5 | \title{prepare_w2v_embeddings} 6 | \usage{ 7 | prepare_w2v_embeddings(texts, embedding_dim, tokenizer) 8 | } 9 | \arguments{ 10 | \item{texts}{Character vector of raw text from training data.} 11 | 12 | \item{embedding_dim}{Dimensionality of word embeddings. Options are 25, 50, 100, 200.} 13 | 14 | \item{tokenizer}{Pre-fit keras text tokenizer.} 15 | } 16 | \description{ 17 | This function trains a word2vec model to create custom word embeddings from the training data set. 18 | } 19 | \details{ 20 | For a good introduction to word2vec model see Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., 2013) 21 | } 22 | \note{ 23 | Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_wv2_\{embedding_dim\}.rda" 24 | } 25 | -------------------------------------------------------------------------------- /man/scrape_tweets.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/scrape_tweets.R 3 | \name{scrape_tweets} 4 | \alias{scrape_tweets} 5 | \title{scrape_tweets} 6 | \usage{ 7 | scrape_tweets(screen_names = NULL, ids = NULL, tweets_per_user, 8 | credentials_dir, out_dir) 9 | } 10 | \arguments{ 11 | \item{screen_names}{Character vector of screen names of Twitter users.} 12 | 13 | \item{ids}{Character or integer vector of IDs of Twitter users. Use either (but not both) of these two arguments.} 14 | 15 | \item{tweets_per_user}{Number of tweets to scrape for each user.} 16 | 17 | \item{credentials_dir}{Directory with Twitter OAuth tokens.} 18 | 19 | \item{out_dir}{Name of directory to store scraped Tweets.} 20 | } 21 | \description{ 22 | This function scrapes the most recent n Tweets of a list of Twitter users. 23 | } 24 | \examples{ 25 | data("tweets") 26 | users <- unique(tweets$screen_name) 27 | scrape_tweets(screen_names = users, tweets_per_user = 200, credentials_dir = "credentials", out_dir = "data/scraped_tweets") 28 | } 29 | -------------------------------------------------------------------------------- /man/evaluate.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/train_models.R 3 | \name{evaluate} 4 | \alias{evaluate} 5 | \title{evaluate} 6 | \usage{ 7 | evaluate(model_path, X_test, y_test) 8 | } 9 | \arguments{ 10 | \item{model_path}{Path to HDF5 file containing model. Should be of the form "models/\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5"} 11 | 12 | \item{X_test}{data.frame or matrix of vectorized Tweets} 13 | 14 | \item{y_test}{Labels for testing data. 0 for liberal, 1 for conservative.} 15 | } 16 | \value{ 17 | List of performance metrics. Currently, a confusion matrix, overall prediction accuracy, precision, recall, and F1 score are return. 18 | } 19 | \description{ 20 | This function evaluates the performance of a trained model. 21 | } 22 | \examples{ 23 | data("ideo_tweets") 24 | ideo_tokenizer <- text_tokenizer(num_words=20000) 25 | ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text) 26 | texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer) 27 | labels <- tweets$ideo_cat 28 | 29 | train_test <- train_test_split(texts, labels) 30 | 31 | evaluate("models/bi-lstm_w2v_25d.h5", train_test$X_test, train_test$y_test) 32 | } 33 | -------------------------------------------------------------------------------- /man/train_lstm.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/train_models.R 3 | \name{train_lstm} 4 | \alias{train_lstm} 5 | \title{train_lstm} 6 | \usage{ 7 | train_lstm(X_train, y_train, embeddings = "w2v", embedding_dim = 25, 8 | bidirectional = FALSE, convolutional = FALSE) 9 | } 10 | \arguments{ 11 | \item{X_train}{data.frame or matrix of vectorized Tweets} 12 | 13 | \item{y_train}{Labels for training data. 0 for liberal, 1 for conservative.} 14 | 15 | \item{embeddings}{Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).} 16 | 17 | \item{embedding_dim}{Length of word embeddings to use. Options are 25, 50, 100, or 200.} 18 | 19 | \item{bidirectional}{Optionally train on text sequences in reverse as well as forwards.} 20 | 21 | \item{convolutional}{Optionally apply convolutional filter to text sequences. Can only be used when bidirectional = TRUE} 22 | } 23 | \description{ 24 | This function trains the LSTM model to identify the ideological slant of Tweets. 25 | } 26 | \note{ 27 | Models are automatically saved in HDF5 format to a sub-folder of the root-directory called "models". File format is "\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5". 28 | } 29 | \examples{ 30 | # train a Bi-LSTM network using GloVe embeddings 31 | data("ideo_tweets") 32 | ideo_tokenizer <- text_tokenizer(num_words=20000) 33 | ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text) 34 | texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer) 35 | labels <- tweets$ideo_cat 36 | 37 | train_test <- train_test_split(texts, labels) 38 | X_train <- train_test$X_train 39 | y_trian <- train_test$y_train 40 | train_ltsm(X_train, ty_train, embeddings="glove", bidirectional=TRUE) 41 | } 42 | -------------------------------------------------------------------------------- /R/scrape_tweets.R: -------------------------------------------------------------------------------- 1 | #' complete_setup 2 | #' 3 | #' This function should be called after package installation to properly set up dependencies and create file caching system. 4 | #' @export 5 | complete_setup <- function() { 6 | library(keras) 7 | install_keras(tensorflow = "1.9") 8 | 9 | library(devtools) 10 | install_version("rmongodb", version = "1.8.0", repos = "http://cran.us.r-project.org") 11 | install_github("SMAPPNYU/smappR") 12 | 13 | if (!dir.exists("~/.deepIdeology")) { 14 | dir.create("~/.deepIdeology") 15 | } 16 | } 17 | 18 | #' scrape_tweets 19 | #' 20 | #' This function scrapes the most recent n Tweets of a list of Twitter users. 21 | #' @param screen_names Character vector of screen names of Twitter users. 22 | #' @param ids Character or integer vector of IDs of Twitter users. Use either (but not both) of these two arguments. 23 | #' @param tweets_per_user Number of tweets to scrape for each user. 24 | #' @param credentials_dir Directory with Twitter OAuth tokens. 25 | #' @param out_dir Name of directory to store scraped Tweets. 26 | #' @export 27 | #' @examples 28 | #' data("tweets") 29 | #' users <- unique(tweets$screen_name) 30 | #' scrape_tweets(screen_names = users, tweets_per_user = 200, credentials_dir = "credentials", out_dir = "data/scraped_tweets") 31 | scrape_tweets <- function(screen_names = NULL, ids = NULL, tweets_per_user, credentials_dir, out_dir) { 32 | if (!dir.exists(out_dir)) { 33 | dir.create(out_dir) 34 | } 35 | 36 | scrape_func <- function(x) { 37 | fname <- file.path(out_dir, paste0(x,'_tweets.json')) 38 | tryCatch(smappR::getTimeline(fname, 39 | oauth_folder = credentials_dir, 40 | screen_name = x, 41 | n = tweets_per_user), 42 | error = function(e) NA) 43 | } 44 | 45 | if (!is.null(screen_names)){ 46 | lapply(screen_names, scrape_func) 47 | } else { 48 | lapply(ids, scrape_func) 49 | } 50 | } 51 | 52 | #' tweets_to_df 53 | #' 54 | #' This function takes a directory of JSON files containing scraped Tweets and returns a data.frame 55 | #' @param tweet_dir Directory where scraped Tweets are stored 56 | #' @param keep_retweets Optionally discard retweets. 57 | #' @return data.frame of Tweets with metadata 58 | #' @export 59 | #' @examples 60 | #' tweet_df <- tweets_to_df("data/scraped_tweets", keep_retweets = FALSE) 61 | tweets_to_df <- function(tweet_dir, keep_retweets=FALSE) { 62 | files <- list.files(tweet_dir) 63 | tweets <- lapply(files, 64 | function(x) { 65 | tryCatch(parseTweets(file.path(tweet_dir, x), legacy=TRUE), 66 | error=function(e) NA) 67 | } 68 | ) 69 | tweets <- do.call("rbind",tweets) 70 | tweets$tweet_url <- sprintf("https://twitter.com/%s/status/%s", tweets$screen_name, tweets$id_str) 71 | 72 | if (!keep_retweets) { 73 | tweets <- tweets[!grepl("RT", tweets$text),] 74 | } 75 | 76 | return(tweets) 77 | } 78 | 79 | -------------------------------------------------------------------------------- /man/predict_ideology.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/predict.R 3 | \name{predict_ideology} 4 | \alias{predict_ideology} 5 | \title{predict_ideology} 6 | \usage{ 7 | predict_ideology(tweets, model = "BiLSTM", embeddings = "w2v", 8 | embedding_dim = 25, filter_political_tweets = FALSE) 9 | } 10 | \arguments{ 11 | \item{tweets}{Character vector of Tweets.} 12 | 13 | \item{model}{Neural network architecture to use. Options are "LSTM", "BiLSTM", or "C-BiLSTM".} 14 | 15 | \item{embeddings}{Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization).} 16 | 17 | \item{embedding_dim}{Length of word embeddings to use. Options are 25, 50, 100, or 200.} 18 | 19 | \item{filter_political_tweets}{If Tweet collection may contain non-political Tweets, optionally filter them out before ideological scaling.} 20 | } 21 | \value{ 22 | Vector of float values between 0 and 1, where values closer to 0 indiciate liberal ideological slant, values closer to 1 indicate conservative ideological slant, and values near 0.5 indicate a lack of ideological leaning. Non-political Tweets return return a NULL value. 23 | } 24 | \description{ 25 | This function allows you to scale the ideological slant of Twitter posts. 26 | } 27 | \details{ 28 | The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites. 29 | 30 | The Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model. 31 | } 32 | \examples{ 33 | tweets <- c("Make no mistake- the President of the United States is actively sabotaging the health insurance of millions of Americans with this action.", 34 | "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better; freer nation because of it.", 35 | "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown") 36 | preds <- predict_ideology(tweets, model="BiLSTM", embeddings="w2v") 37 | } 38 | -------------------------------------------------------------------------------- /R/predict.R: -------------------------------------------------------------------------------- 1 | #' predict_ideology 2 | #' 3 | #' This function allows you to scale the ideological slant of Twitter posts. 4 | #' @param tweets Character vector of Tweets. 5 | #' @param model Neural network architecture to use. Options are "LSTM", "BiLSTM", or "C-BiLSTM". 6 | #' @param embeddings Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization). 7 | #' @param embedding_dim Length of word embeddings to use. Options are 25, 50, 100, or 200. 8 | #' @param filter_political_tweets If Tweet collection may contain non-political Tweets, optionally filter them out before ideological scaling. 9 | #' @return Vector of float values between 0 and 1, where values closer to 0 indiciate liberal ideological slant, values closer to 1 indicate conservative ideological slant, and values near 0.5 indicate a lack of ideological leaning. Non-political Tweets return return a NULL value. 10 | #' @export 11 | #' @details The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites. 12 | #' @details The Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model. 13 | #' @examples 14 | #' tweets <- c("Make no mistake- the President of the United States is actively sabotaging the health insurance of millions of Americans with this action.", 15 | #' "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better; freer nation because of it.", 16 | #' "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown") 17 | #' preds <- predict_ideology(tweets, model="BiLSTM", embeddings="w2v") 18 | 19 | predict_ideology <- function(tweets, model="BiLSTM", embeddings="w2v", embedding_dim=25, filter_political_tweets=FALSE) { 20 | stopifnot(model %in% list("LSTM", "BiLSTM", "C-BiLSTM")) 21 | 22 | cwd <- getwd() 23 | setwd("~/.deepIdeology/") 24 | # if Tweet collection contains non-political tweets, filter out before scaling ideology 25 | if (filter_political_tweets) { 26 | if (!file.exists("models/politics_classifier.h5")) { 27 | print("No pre-trained politics classifier exists. Training model now. This may take a moment.") 28 | prepare_politics_classifier() 29 | } 30 | 31 | model <- keras::load_model_hdf5("models/politics_classifier.h5") 32 | pol_ind <- model %>% 33 | keras::predict_classes(texts) 34 | sprintf("%i political Tweets identified out of %i total Tweets", table(preds)[2], length(preds)) 35 | } else { 36 | pol_ind <- 1:length(tweets) 37 | } 38 | 39 | # load fit tokenizer, convert raw text to sequences 40 | if (!file.exists("tokenizers/ideo_tweet_tokenizer")) { 41 | data("ideo_tweets") 42 | tokenizer <- keras::text_tokenizer(num_words = 20000) 43 | tokenizer <- keras::fit_text_tokenizer(tokenizer, ideo_tweets$text) 44 | if (!dir.exists("tokenizers")) { 45 | dir.create("tokenizers") 46 | } 47 | keras::save_text_tokenizer(tokenizer, "tokenizers/ideo_tweet_tokenizer") 48 | } 49 | 50 | tokenizer <- keras::load_text_tokenizer("tokenizers/ideo_tweet_tokenizer") 51 | 52 | # load desired model 53 | model_name_map <- list(LSTM = "lstm", BiLSTM = "bi-lstm", CBiLSTM = "c-bi-lstm") 54 | model_fname <- sprintf("models/%s_%s_%sd.h5", model_name_map[[model]], embeddings, embedding_dim) 55 | 56 | if (!file.exists(model_fname)) { 57 | print("No pre-trained model with that configuration exists. Training model now. This may take a moment.") 58 | data("ideo_tweets") 59 | text_vecs <- texts_to_vectors(ideo_tweets$text, tokenizer) 60 | labels <- ideo_tweets$ideo_cat 61 | data <- train_test_split(text_vecs, labels) 62 | if (model == "BiLSTM") { 63 | bidirectional = TRUE 64 | convolutional = FALSE 65 | } else if (model == "C-BiLSTM") { 66 | bidirectional = TRUE 67 | convolutional = TRUE 68 | } else { 69 | bidirectional = FALSE 70 | convolutional = FALSE 71 | } 72 | train_lstm(data$X_train, data$y_train, embeddings = embeddings, embedding_dim = embedding_dim, 73 | bidirectional = bidirectional, convolutional = convolutional) 74 | } 75 | 76 | model <- keras::load_model_hdf5(model_fname) 77 | 78 | text_vecs <- texts_to_vectors(tweets, tokenizer) 79 | # generate predictions on new text 80 | preds <- model %>% 81 | keras::predict_proba(text_vecs) 82 | 83 | preds[-pol_ind] <- NULL 84 | 85 | setwd(cwd) 86 | return(preds[,1]) 87 | } 88 | 89 | -------------------------------------------------------------------------------- /R/word_embeddings.R: -------------------------------------------------------------------------------- 1 | #' prepare_glove_embeddings 2 | #' 3 | #' This function prepares an embedding matrix containing the words in the training data set from pre-trained GloVe embeddings. 4 | #' @param embedding_dim Dimensionality of word embeddings. Options are 25, 50, 100, 200. 5 | #' @param tokenizer Pre-fit keras text tokenizer. 6 | #' @export 7 | #' @details For more information on the GloVe embedding algorithm, visit https://nlp.stanford.edu/projects/glove/. 8 | #' @note The GloVe embeddings are 1.3G zipped and 3.8G unzipped. 9 | #' @note Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_glove_\{embedding_dim\}.rda" 10 | prepare_glove_embeddings <- function(embedding_dim, tokenizer) { 11 | stopifnot(embedding_dim %in% list(25, 50, 100, 200)) 12 | 13 | cwd <- getwd() 14 | setwd("~/.deepIdeology/") 15 | if (!dir.exists("glove.twitter.27B")) { 16 | dir.create("glove.twitter.27B") 17 | download <- menu(c("Yes", "No"), title="Cannot find pre-trained GloVe embeddings. Would you like to download now (1.3G)?") 18 | if (download == 1) download.file("http://nlp.stanford.edu/data/glove.twitter.27B.zip", "glove.twitter.27B/glove.twitter.27B.zip") 19 | unzip("glove.twitter.27B/glove.twitter.27B.zip", exdir="glove.twitter.27B") 20 | file.remove("glove.twitter.27B/glove.twitter.27B.zip") 21 | } 22 | 23 | embeddings_file <- sprintf("glove.twitter.27B/glove.twitter.27B.%sd.txt", embedding_dim) 24 | word_index <- tokenizer$word_index 25 | embeddings_index <- new.env(parent = emptyenv()) 26 | lines <- readLines(embeddings_file) 27 | for (line in lines) { 28 | values <- strsplit(line, ' ', fixed = TRUE)[[1]] 29 | word <- values[[1]] 30 | coefs <- as.numeric(values[-1]) 31 | embeddings_index[[word]] <- coefs 32 | } 33 | 34 | embedding_matrix <- matrix(0L, nrow = length(word_index)+1, ncol = embedding_dim) 35 | for (word in names(word_index)) { 36 | index <- word_index[[word]] 37 | if (index >= length(word_index)) 38 | next 39 | embedding_vector <- embeddings_index[[word]] 40 | if (!is.null(embedding_vector)) { 41 | # words not found in embedding index will be all-zeros. 42 | embedding_matrix[index,] <- embedding_vector 43 | } 44 | } 45 | 46 | out_file <- sprintf("embeddings/tweet_glove_%sd.rda", embedding_dim) 47 | if (!dir.exists("embeddings")) { 48 | dir.create("embeddings") 49 | } 50 | save(embedding_matrix, file = out_file) 51 | setwd(cwd) 52 | } 53 | 54 | 55 | #' prepare_w2v_embeddings 56 | #' 57 | #' This function trains a word2vec model to create custom word embeddings from the training data set. 58 | #' @param texts Character vector of raw text from training data. 59 | #' @param embedding_dim Dimensionality of word embeddings. Options are 25, 50, 100, 200. 60 | #' @param tokenizer Pre-fit keras text tokenizer. 61 | #' @export 62 | #' @details For a good introduction to word2vec model see Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., 2013) 63 | #' @note Embeddings are saved as Rdata to a folder called embeddings with the file format "tweet_wv2_\{embedding_dim\}.rda" 64 | 65 | prepare_w2v_embeddings <- function(texts, embedding_dim, tokenizer) { 66 | 67 | cwd <- getwd() 68 | setwd("~/.deepIdeology/") 69 | 70 | skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) { 71 | gen <- keras::texts_to_sequences_generator(tokenizer, sample(text)) 72 | function() { 73 | skip <- keras::generator_next(gen) %>% 74 | keras::skipgrams( 75 | vocabulary_size = tokenizer$num_words, 76 | window_size = window_size, 77 | negative_samples = 1 78 | ) 79 | x <- purrr::transpose(skip$couples) %>% purrr::map(. %>% unlist %>% as.matrix(ncol = 1)) 80 | y <- skip$labels %>% as.matrix(ncol = 1) 81 | list(x, y) 82 | } 83 | } 84 | 85 | skip_window <- 5 # How many words to consider left and right. 86 | num_sampled <- 1 # Number of negative examples to sample for each word. 87 | 88 | input_target <- keras::layer_input(shape = 1) 89 | input_context <- keras::layer_input(shape = 1) 90 | 91 | embedding <- keras::layer_embedding( 92 | input_dim = tokenizer$num_words + 1, 93 | output_dim = embedding_dim, 94 | input_length = 1, 95 | name = "embedding" 96 | ) 97 | 98 | target_vector <- input_target %>% 99 | embedding() %>% 100 | keras::layer_flatten() 101 | 102 | context_vector <- input_context %>% 103 | embedding() %>% 104 | keras::layer_flatten() 105 | 106 | dot_product <- keras::layer_dot(list(target_vector, context_vector), axes = 1) 107 | output <- keras::layer_dense(dot_product, units = 1, activation = "sigmoid") 108 | 109 | model <- keras::keras_model(list(input_target, input_context), output) 110 | model %>% keras::compile(loss = "binary_crossentropy", optimizer = "adam") 111 | 112 | 113 | model %>% keras::fit_generator(skipgrams_generator(texts, 114 | tokenizer, 115 | skip_window, 116 | negative_samples), 117 | steps_per_epoch=10000, 118 | epochs=10, 119 | callbacks = list(keras::callback_model_checkpoint(sprintf("models/w2v_%sd.h5", embedding_dim), 120 | monitor = "loss", 121 | save_best_only = TRUE), 122 | keras::callback_early_stopping(monitor = "loss", patience=2)) 123 | ) 124 | 125 | model <- keras::load_model_hdf5(sprintf("models/w2v_%sd.h5", embedding_dim)) 126 | embedding_matrix <- keras::get_weights(model)[[1]] 127 | words <- dplyr::data_frame(word=names(tokenizer$word_index), 128 | id=as.integer(unlist(tokenizer$word_index))) 129 | words <- words %>% dplyr::filter(id <= tokenizer$num_words) %>% dplyr::arrange(id) 130 | row.names(embedding_matrix) <- c("UNK",words$word) 131 | 132 | out_file <- sprintf("embeddings/tweet_wv2_%sd.rda", embedding_dim) 133 | 134 | if (!dir.exists("embeddings")) { 135 | dir.create("embeddings") 136 | } 137 | save(embedding_matrix,file=out_file) 138 | setwd(cwd) 139 | } 140 | -------------------------------------------------------------------------------- /R/train_models.R: -------------------------------------------------------------------------------- 1 | prepare_politics_classifier <- function() { 2 | data("pol_tweets") 3 | cwd <- getwd() 4 | setwd("~/.deepIdeology") 5 | if (file.exists("tokenizers/pol_tweet_tokenizer")) { 6 | tokenizer <- keras::load_text_tokenizer("tokenizers/pol_tweet_tokenizer") 7 | } else { 8 | tokenizer <- keras::text_tokenizer() 9 | tokenizer <- keras::fit_text_tokenizer(tokenizer,pol_tweets$Input.text) 10 | if (!dir.exists("tokenizers")) { 11 | dir.create("tokenizers") 12 | } 13 | 14 | keras::save_text_tokenizer(tokenizer, filename = "tokenizers/pol_tweet_tokenizer") 15 | } 16 | 17 | texts <- texts_to_vectors(pol_tweets$Input.text, tokenizer) 18 | labels <- pol_tweets$pol 19 | word_index <- tokenizer$word_index 20 | data <- train_test_split(texts, labels) 21 | 22 | lstm <- keras::keras_model_sequential() 23 | lstm %>% 24 | keras::layer_embedding(input_dim = length(word_index)+1, output_dim=64) %>% 25 | keras::layer_lstm(units=64, dropout=0.5, recurrent_dropout=0.3) %>% 26 | keras::layer_dense(units=16, activation='relu') %>% 27 | keras::layer_dropout(0.5) %>% 28 | keras::layer_dense(units=1, activation='sigmoid') 29 | 30 | lstm %>% keras::compile(loss='binary_crossentropy',optimizer='adam',metrics=c('accuracy')) 31 | 32 | if (!dir.exists("models")) { 33 | dir.create("models") 34 | } 35 | lstm %>% keras::fit( 36 | data$X_train, data$y_train, 37 | batch_size=64, 38 | epochs=100, 39 | validation_split=0.2, 40 | callbacks = list(keras::callback_model_checkpoint(sprintf("models/politics_classifier.h5"), 41 | monitor = "val_loss", 42 | save_best_only = TRUE), 43 | keras::callback_early_stopping(monitor = "val_loss", patience=3)) 44 | ) 45 | 46 | setwd(cwd) 47 | } 48 | 49 | #' train_lstm 50 | #' 51 | #' This function trains the LSTM model to identify the ideological slant of Tweets. 52 | #' @param X_train data.frame or matrix of vectorized Tweets 53 | #' @param y_train Labels for training data. 0 for liberal, 1 for conservative. 54 | #' @param embeddings Type of word embedding algorithm to use. Options are "w2v" (word2vec), "glove", or "random" (random initialization). 55 | #' @param embedding_dim Length of word embeddings to use. Options are 25, 50, 100, or 200. 56 | #' @param bidirectional Optionally train on text sequences in reverse as well as forwards. 57 | #' @param convolutional Optionally apply convolutional filter to text sequences. Can only be used when bidirectional = TRUE 58 | #' @export 59 | #' @note Models are automatically saved in HDF5 format to a sub-folder of the root-directory called "models". File format is "\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5". 60 | #' @examples 61 | #' # train a Bi-LSTM network using GloVe embeddings 62 | #' data("ideo_tweets") 63 | #' ideo_tokenizer <- text_tokenizer(num_words=20000) 64 | #' ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text) 65 | #' texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer) 66 | #' labels <- tweets$ideo_cat 67 | #' 68 | #' train_test <- train_test_split(texts, labels) 69 | #' X_train <- train_test$X_train 70 | #' y_trian <- train_test$y_train 71 | #' train_ltsm(X_train, ty_train, embeddings="glove", bidirectional=TRUE) 72 | train_lstm <- function(X_train, y_train, embeddings = "w2v", embedding_dim = 25, bidirectional = FALSE, convolutional = FALSE) { 73 | stopifnot(embedding_dim %in% list(25, 50, 100, 200)) 74 | stopifnot(embeddings %in% list("random", "w2v", "glove")) 75 | 76 | cwd <- getwd() 77 | setwd("~/.deepIdeology/") 78 | 79 | out_fname <- sprintf("lstm_%sd.h5", embedding_dim) 80 | 81 | model <- keras::keras_model_sequential() 82 | if (embeddings != "random") { 83 | embedding_fname <- sprintf("embeddings/tweet_%s_%sd.rda", embeddings, embedding_dim) 84 | 85 | if (!file.exists(embedding_fname)) { 86 | print(sprintf("Embedding file does not exist. Preparing %s-dimensional %s embeddings. This may take a moment", embedding_dim, embeddings)) 87 | tokenizer <- keras::load_text_tokenizer("tokenizers/ideo_tweet_tokenizer") 88 | if (embeddings == "glove") { 89 | prepare_glove_embeddings(embedding_dim, tokenizer) 90 | } else { 91 | data("ideo_tweets") 92 | prepare_w2v_embeddings(ideo_tweets$text, embedding_dim, tokenizer) 93 | } 94 | } 95 | emebedding_matrix <- get(load(embedding_fname)) 96 | 97 | model %>% 98 | keras::layer_embedding(input_dim = dim(embedding_matrix)[1], output_dim=embedding_dim, 99 | weights = list(embedding_matrix)) 100 | out_fname <- sprintf("lstm_%s_%sd.h5", embeddings, embedding_dim) 101 | } else { 102 | model %>% 103 | keras::layer_embedding(input_dim = 20000+1, output_dim=64) 104 | out_fname <- sprintf("lstm_%sd.h5", embedding_dim) 105 | } 106 | if (convolutional) { 107 | model %>% 108 | keras::layer_conv_1d(filters=64, 109 | kernel_size = 3, 110 | padding = 'valid', 111 | activation = 'relu', 112 | strides=1) %>% 113 | keras::layer_max_pooling_1d(pool_size = 2) 114 | out_fname <- sprintf("c-bi-%s", out_fname) 115 | } 116 | if (bidirectional) { 117 | model %>% 118 | keras::bidirectional(layer_lstm(units=64, dropout=0.3, recurrent_dropout=0.3)) 119 | if (!convolutional) out_fname <- sprintf("bi-%s", out_fname) 120 | } else { 121 | model %>% 122 | keras::layer_lstm(units=64, dropout=0.3, recurrent_dropout=0.3) 123 | } 124 | 125 | model %>% 126 | keras::layer_dense(units=16, activation='relu') %>% 127 | keras::layer_dropout(0.5) %>% 128 | keras::layer_dense(units=1, activation='sigmoid') 129 | 130 | model %>% keras::compile(loss='binary_crossentropy',optimizer='adam',metrics=c('accuracy')) 131 | 132 | if (!dir.exists("models")) { 133 | dir.create("models") 134 | } 135 | model %>% keras::fit( 136 | X_train, y_train, 137 | batch_size=64, 138 | epochs=50, 139 | validation_split=0.2, 140 | callbacks = list(keras::callback_model_checkpoint(sprintf("models/%s", out_fname), 141 | monitor = "val_loss", 142 | save_best_only = TRUE), 143 | keras::callback_early_stopping(monitor = "val_loss", patience=3)) 144 | ) 145 | 146 | setwd(cwd) 147 | } 148 | 149 | #' evaluate 150 | #' 151 | #' This function evaluates the performance of a trained model. 152 | #' @param model_path Path to HDF5 file containing model. Should be of the form "models/\{model type\}_\{embedding type\}_\{embedding dimensionality\}d.h5" 153 | #' @param X_test data.frame or matrix of vectorized Tweets 154 | #' @param y_test Labels for testing data. 0 for liberal, 1 for conservative. 155 | #' @export 156 | #' @return List of performance metrics. Currently, a confusion matrix, overall prediction accuracy, precision, recall, and F1 score are return. 157 | #' @examples 158 | #' data("ideo_tweets") 159 | #' ideo_tokenizer <- text_tokenizer(num_words=20000) 160 | #' ideo_tokenizer <- fit_text_tokenizer(ideo_tokenizer, ideo_tweets$text) 161 | #' texts <- texts_to_vectors(ideo_tweets$text, ideo_tokenizer) 162 | #' labels <- tweets$ideo_cat 163 | #' 164 | #' train_test <- train_test_split(texts, labels) 165 | #' 166 | #' evaluate("models/bi-lstm_w2v_25d.h5", train_test$X_test, train_test$y_test) 167 | evaluate <- function(model_path, X_test, y_test) { 168 | model <- keras::load_model_hdf5(model_path) 169 | preds <- model %>% 170 | keras::predict_classes(X_test) 171 | 172 | res <- list() 173 | cm <-as.matrix(table(Actual = y_test, Predicted = preds)) 174 | res[["Confusion Matrix"]] <- cm 175 | 176 | n <- sum(cm) # number of instances 177 | nc <- nrow(cm) # number of classes 178 | diag <- diag(cm) # number of correctly classified instances per class 179 | rowsums <- apply(cm, 1, sum) # number of instances per class 180 | colsums <- apply(cm, 2, sum) # number of predictions per class 181 | p <- rowsums / n # distribution of instances over the actual classes 182 | q <- colsums / n # distribution of instances over the predicted classes 183 | accuracy <- sum(diag) / n 184 | res[["Accuracy"]] <- accuracy 185 | 186 | precision <- diag / colsums 187 | recall <- diag / rowsums 188 | f1 <- 2 * precision * recall / (precision + recall) 189 | res[["Precision/Recall"]] <- data.frame(precision, recall, f1) 190 | 191 | return(res) 192 | } 193 | 194 | #' train_test_split 195 | #' 196 | #' Helper function to split data into training and testing sets. 197 | #' @param X data.frame or matrix of data 198 | #' @param y Labels (optional). 199 | #' @param test_size Proportion of samples to set aside for testing. 200 | #' @export 201 | #' @return List of X_train, X_test, y_train, y_test 202 | train_test_split <- function(X, y, test_size=0.2) { 203 | n_train <- floor((1-test_size)*nrow(X)) 204 | train_ind <- sample(nrow(X),n_train) 205 | return(list(X_train=X[train_ind,], X_test=X[-train_ind,], y_train=y[train_ind], y_test=y[-train_ind])) 206 | } 207 | 208 | #' texts_to_vectors 209 | #' 210 | #' Helper function vectorize text data 211 | #' @param texts Character vector of raw text data 212 | #' @param tokenizer Pre-fit keras tokenizer 213 | #' @export 214 | #' @return matrix of vectorized texts 215 | texts_to_vectors <- function(texts, tokenizer){ 216 | sequences <- keras::texts_to_sequences(tokenizer, texts) 217 | vecs <- keras::pad_sequences(sequences) 218 | return(vecs) 219 | } 220 | 221 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # deepIdeology: Scale ideological slant of Tweets 2 | 3 | This package allows users to identify the ideological leanings of Twitter posts with benchmark accuracy 4 | using a Long Short-Term Memory recurrent neural network model trained on a data set of Tweets labeled through 5 | the Amazon Mechanical Turk crowd-sourcing platform. The best-performing model is able to classify Tweets as 6 | liberal- or conservative-leaning with 86.90% accuracy and is capture both directionality and degree of 7 | slant (see "Performance and Validation" below. This package was developed for a study of preference falsification on social media entitled "Private Partisan, 8 | Public Moderate: Preference Falsification on Twitter" (Gottlieb, 2018). Contact maintainer to request access. 9 | 10 | ## Installation and setup 11 | 12 | To install the latest version of `deepIdeology` from GitHub, run the following: 13 | ```{r} 14 | devtools::install_github("alex-gottlieb/deepIdeology") 15 | library(deepIdeology) 16 | ``` 17 | Following successful installation run the `complete_setup()` command, which will finish installing the `keras` module and set up the file caching 18 | on which the package relies. 19 | 20 | ## predict_ideology 21 | 22 | `predict_ideology` is the core function of this package. It allows the user to scale the ideological slant of Tweets from 0 to 1, with values close 23 | to 0 indicating a strong liberal slant, values close to 1 indicating a strong conservative slant, and values around 0.5 indicating ideologically moderate or neutral content. The first time this function is called, it will likely take upwards of an hour to run, as the model will need to be trained and the word embeddings which are required to vectorize the raw text of the Tweets will need to be either downloaded (GloVe) or learned by another neural network 24 | model (word2vec). Once a particular model or embedding configuration is used once, though, the files will be cached, allowing near-instantaneous 25 | evaluations in the future. 26 | 27 | The parameters of the function deserve a little further explanation. The best-performing parameters are set as the defaults of the function, so readers not interested in the technical details can skip this section. 28 | * `model` allows the user to select the particular neural network architecture used in the slant classifier: 29 | + `LSTM` stands for Long Short-Term Memory Network, a type of recurrent neural network that can learn contextual information in text, or related bits of information separated by a wide spatial or temporal gap. (Hochreiter and Schmidhuber 1997). 30 | + `BiLSTM` is a Bidirectional LSTM (Graves and Schmidhuber 2005). In this architecture, two LSTM units are trained: one on the text as-is, and one on a reverse copy of the input sequence, which allows the network to place a given word in the context of what comes both before and after it. 31 | + `C-BiLSTM` is a Convolutional Bidirectional LSTM (Xi-Lian, Wei, and Teng-Jiao 2017), which can learn target-related context and semantic representations simultaneously. 32 | * `embeddings` determines which type of word embedding is used. Word embeddings are a means of transforming raw text into *d*-dimensional numeric vectors that a machine can understand. A straight-forward primer on word embeddings and common models can be found [here](https://machinelearningmastery.com/what-are-word-embeddings/) 33 | + `GloVe` is a count-based model, which learns embeddings through dimensionality reduction on the co-occurrence count matrix of a corpus (Pennington, Socher, and Manning 2014). The networks in this package use GloVe embeddings calculated from a corpus of 2 billion Tweets. More information can be found [here](https://nlp.stanford.edu/projects/glove/). 34 | + `w2v` or word2vec is a predictive model, which means it learns the embeddings that minimize the loss of predicting each word given its context words and their vector representation (Mikolov et al. 2013). If word2vec embeddings are chosen, a separate neural network model will be trained to learn the word embeddings. 35 | + `random` uses an embedding layer with a random initialization, which is then learned in the course of the regular model training. 36 | * `embedding_dim` is the dimensionality of the vector space into which each word is projected. In general, higher-dimensional embeddings can capture more semantic subtleties, but also require more training data to discover those nuances. For the sake of making functions more generalizable, options are restricted to 25, 50, 100, and 200. 37 | * `filter_political_tweets` gives users the option to remove Tweets that are non-political in nature before slant-scaling if there is there possibility that those Tweets are contained in the data set. This is done using a separate classifier also trained on Tweets labeled as "political" or "not political" through Amazon Mechanical Turk. 38 | 39 | A toy example: 40 | ```{r} 41 | tweets <- c("Republicans are moving full steam ahead on their #GOPTaxScam, which lays the groundwork for them to gut Social Security and Medicare. I urge my Senate colleagues to vote No!", 42 | "This MLK Day, 50 years after his death, we honor Dr. King's legacy. He lived for the causes of justice and equality, and opened the door of opportunity for millions of Americans. America is a better, freer nation because of it.", 43 | "I’m disappointed in Senate Democrats for shutting down the government. #SchumerShutdown") 44 | 45 | predict_ideology(tweets) 46 | ``` 47 | 48 | #### Caveats 49 | The data set on which the models are trained is roughly 75 percent Tweets from "elite" users (e.g. politicians, media outlets, think tanks, etc.), with the remaining 25 percent coming from "mass" users. In validating the models, it became apparent that they were much more capable of identifying slant from the former group, which in many ways presents an idealized scenario of clearly- (and often forcefully-) articulated ideological leanings along with (mostly) consistent grammar and spelling. Predictions of "mass" Tweets were largely clustered around the middle of the spectrum, not because they were necessarily more moderate, but because the models could not make a confident prediction either way. Accordingly, researchers should use caution when using this package to scale Tweets from groups other than poltiical elites. 50 | 51 | Additionally, the Tweets used to train the models were scraped and labeled in early 2018. The ideological spectrum is, of course, not a static entity, and where particular issues and actors fall on that spectrum can shift over time. Additionally, new issues and actors have emerged on the political scene since this data was collected, so stances on more recent topics (e.g. Brett Kavanaugh or the Green New Deal) that might provide a great deal of information to a political observer about someone's leanings would not provide any additional information to the model. 52 | 53 | Both of these issues can be addressed with the continued augmentation of the training data set with labeled examples, so if anyone is interested in continuing this work, please be in touch! 54 | 55 | ## Performance 56 | 57 | The predictive accuracy of various model/embedding combinations on a set-aside testing data set are shown below in Table 1. 58 | 59 | ![Table 1: Ideology classifier performance](figures/model_performance.png) 60 | 61 | Note that a full hyperparameter optimization was not performed for any of the neural network models, so these numbers can be considered lower-bound estimates of the predictive power of the respective model configurations, and comparisons between models should be taken with a grain of salt. 62 | 63 | ## Validation 64 | 65 | A number of tests were performed to validate the quality of predictions. Table 2 shows a simple face validity test in which 1 Tweet from the "elite" user pool is sampled from each decile of prediction (i.e. [0, 0.1), [0.1, 0.2),...,[0.9, 1]). 66 | 67 | ![Table 2: Examples of predicted probabilities for randomly selected Tweets. Values closer to 0 mean the model is highly confident that the Tweet is liberal, while values close to 1 ndicate confidence in conservativeness.](figures/face_validity.png) 68 | 69 | Based on this random sample, it would appear that the model captures both directionality and degree of slant, with Tweets predicted around 0.5 betraying little deological leaning and getting progressively more liberal and conservative as values approach 0 and 1, respectively. It even seems as if it is capable of scaling Tweets for which a high level of political knowledge, long-term memory, and careful reading of tone would be required to understand the slant. For example, the message “Chairman Grassley's job is to hold hearings on Judge Garland. He doesn't need to poll his colleagues. He just needs to do his job!” has a predicted probability of around 0.15. To understand that this Tweet is liberal in nature, one would have to know the fact that Chuck Grassley is a conservative Republican senator, to remember that “Chairman Grassley” is the antecedent of the pronoun “he”, which comes much later in the Tweet, and to understand that the tone of the sentences beginning with “he” is highly disapproving. 70 | 71 | As a more rigorous test, estimates were validated against existing and widely-accepted measures of ideological preferences. Figure 1 shows the correlation between the mean predicted ideology of 200 Tweets for members of Congress and the first dimension of their DW-NOMINATE score, which is based on the roll-call voting patterns of members of Congress (Poole and Rosenthal 1997). 72 | 73 | ![Figure 1: Comparing Twitter-based estimates of ideological preferences of legislators to the first dimension of their DW-NOMINATE scores.](figures/dw-nom.png) 74 | 75 | The correlation between Twitter-based estimates and the first dimension of the DW-NOMINATE scores for the 114th Congress is 0.947 for the House of Representatives and 0.937 for the Senate, values roughly equal with the Twitter-based Bayesian ideal points calculated by Barbera (2015) as well as those derived from Facebook "like"-ing data by Bond and Messing (2015). 76 | --------------------------------------------------------------------------------