├── .gitattributes ├── .gitignore ├── README-fig ├── questions_per_day-1.png ├── questions_per_week-1.png └── tags_per_year-1.png ├── README.Rmd ├── README.html ├── README.md ├── question_tags.csv.gz ├── questions.csv.gz ├── setup-data.R ├── stacklite.Rproj └── status.yml /.gitattributes: -------------------------------------------------------------------------------- 1 | *.csv.gz filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | README-cache 2 | .Rproj.user 3 | -------------------------------------------------------------------------------- /README-fig/questions_per_day-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/questions_per_day-1.png -------------------------------------------------------------------------------- /README-fig/questions_per_week-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/questions_per_week-1.png -------------------------------------------------------------------------------- /README-fig/tags_per_year-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/tags_per_year-1.png -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | ```{r setup, include=FALSE} 2 | knitr::opts_chunk$set(echo = TRUE) 3 | ``` 4 | 5 | ```{r echo = FALSE} 6 | library(knitr) 7 | opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE, 8 | cache.path = "README-cache/", 9 | fig.path = "README-fig/") 10 | ``` 11 | 12 | ## StackLite: A simple dataset of Stack Overflow questions and tags 13 | 14 | This repository shares a dataset about Stack Overflow questions. For each question, it includes: 15 | 16 | * Question ID 17 | * Creation date 18 | * Closed date, if applicable 19 | * Deletion date, if applicable 20 | * Score 21 | * Owner user ID 22 | * Number of answers 23 | * Tags 24 | 25 | This dataset is ideal for answering questions such as: 26 | 27 | * The increase or decrease in questions in each tag over time 28 | * Correlations among tags on questions 29 | * Which tags tend to get higher or lower scores 30 | * Which tags tend to be asked on weekends vs weekdays 31 | * Rates of question closure or deletion over time 32 | * The speed at which questions are closed or deleted 33 | 34 | This is all public data within the [Stack Exchange Data Dump](https://archive.org/details/stackexchange), which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the [Stack Exchange Data Explorer](https://data.stackexchange.com/), but this offers analysts the chance to work with it locally using their tool of choice. 35 | 36 | ### Status 37 | 38 | ```{r load_data, echo = FALSE} 39 | library(dplyr) 40 | library(readr) 41 | library(yaml) 42 | 43 | questions <- read_csv("questions.csv.gz", progress = FALSE) 44 | question_tags <- read_csv("question_tags.csv.gz", progress = FALSE) 45 | ``` 46 | 47 | This dataset was extracted from the Stack Overflow database at `r yaml.load_file("status.yml")$retrieved_time` UTC and contains questions up to **`r yaml.load_file("status.yml")$max_date`**. This includes `r sum(is.na(questions$DeletionDate))` non-deleted questions, and `r sum(!is.na(questions$DeletionDate))` deleted ones. (The script for downloading the data can be found in [setup-data.R](setup-data.R), though it can be run only by Stack Overflow employees with database access). 48 | 49 | ### Examples in R 50 | 51 | The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I'll share some examples of an analysis in R. 52 | 53 | The question data and the question-tag pairings are stored separately. You can read in the dataset with: 54 | 55 | ```{r eval = FALSE} 56 | library(readr) 57 | library(dplyr) 58 | 59 | questions <- read_csv("questions.csv.gz") 60 | question_tags <- read_csv("question_tags.csv.gz") 61 | ``` 62 | 63 | ```{r} 64 | questions 65 | question_tags 66 | ``` 67 | 68 | As one example, you could find the most popular tags: 69 | 70 | ```{r question_tags_count, dependson = "load_data"} 71 | question_tags %>% 72 | count(Tag, sort = TRUE) 73 | ``` 74 | 75 | Or plot the number of questions asked per week: 76 | 77 | ```{r questions_per_week, dependson = "load_data"} 78 | library(ggplot2) 79 | library(lubridate) 80 | 81 | questions %>% 82 | count(Week = round_date(CreationDate, "week")) %>% 83 | ggplot(aes(Week, n)) + 84 | geom_line() 85 | ``` 86 | 87 | Or you could compare the growth of particular tags over time: 88 | 89 | ```{r tags_per_year, dependson = "load_data"} 90 | library(lubridate) 91 | 92 | tags <- c("c#", "javascript", "python", "r") 93 | 94 | q_per_year <- questions %>% 95 | count(Year = year(CreationDate)) %>% 96 | rename(YearTotal = n) 97 | 98 | tags_per_year <- question_tags %>% 99 | filter(Tag %in% tags) %>% 100 | inner_join(questions) %>% 101 | count(Year = year(CreationDate), Tag) %>% 102 | inner_join(q_per_year) 103 | 104 | ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) + 105 | geom_line() + 106 | scale_y_continuous(labels = scales::percent_format()) + 107 | ylab("% of Stack Overflow questions with this tag") 108 | ``` 109 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ## StackLite: A simple dataset of Stack Overflow questions and tags 6 | 7 | This repository shares a dataset about Stack Overflow questions. For each question, it includes: 8 | 9 | * Question ID 10 | * Creation date 11 | * Closed date, if applicable 12 | * Deletion date, if applicable 13 | * Score 14 | * Owner user ID 15 | * Number of answers 16 | * Tags 17 | 18 | This dataset is ideal for answering questions such as: 19 | 20 | * The increase or decrease in questions in each tag over time 21 | * Correlations among tags on questions 22 | * Which tags tend to get higher or lower scores 23 | * Which tags tend to be asked on weekends vs weekdays 24 | * Rates of question closure or deletion over time 25 | * The speed at which questions are closed or deleted 26 | 27 | This is all public data within the [Stack Exchange Data Dump](https://archive.org/details/stackexchange), which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the [Stack Exchange Data Explorer](https://data.stackexchange.com/), but this offers analysts the chance to work with it locally using their tool of choice. 28 | 29 | ### Status 30 | 31 | 32 | 33 | This dataset was extracted from the Stack Overflow database at 2017-04-06 16:39:26 UTC and contains questions up to **2017-04-05**. This includes 13629741 non-deleted questions, and 4133745 deleted ones. (The script for downloading the data can be found in [setup-data.R](setup-data.R), though it can be run only by Stack Overflow employees with database access). 34 | 35 | ### Examples in R 36 | 37 | The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I'll share some examples of an analysis in R. 38 | 39 | The question data and the question-tag pairings are stored separately. You can read in the dataset with: 40 | 41 | 42 | ```r 43 | library(readr) 44 | library(dplyr) 45 | 46 | questions <- read_csv("questions.csv.gz") 47 | question_tags <- read_csv("question_tags.csv.gz") 48 | ``` 49 | 50 | 51 | ```r 52 | questions 53 | ``` 54 | 55 | ``` 56 | ## # A tibble: 17,763,486 × 7 57 | ## Id CreationDate ClosedDate DeletionDate Score 58 | ## 59 | ## 1 1 2008-07-31 21:26:37 2011-03-28 00:53:47 1 60 | ## 2 4 2008-07-31 21:42:52 472 61 | ## 3 6 2008-07-31 22:08:08 210 62 | ## 4 8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40 42 63 | ## 5 9 2008-07-31 23:40:59 1452 64 | ## 6 11 2008-07-31 23:55:37 1154 65 | ## 7 13 2008-08-01 00:42:38 464 66 | ## 8 14 2008-08-01 00:59:11 296 67 | ## 9 16 2008-08-01 04:59:33 84 68 | ## 10 17 2008-08-01 05:09:55 119 69 | ## # ... with 17,763,476 more rows, and 2 more variables: OwnerUserId , 70 | ## # AnswerCount 71 | ``` 72 | 73 | ```r 74 | question_tags 75 | ``` 76 | 77 | ``` 78 | ## # A tibble: 52,224,835 × 2 79 | ## Id Tag 80 | ## 81 | ## 1 1 data 82 | ## 2 4 c# 83 | ## 3 4 winforms 84 | ## 4 4 type-conversion 85 | ## 5 4 decimal 86 | ## 6 4 opacity 87 | ## 7 6 html 88 | ## 8 6 css 89 | ## 9 6 css3 90 | ## 10 6 internet-explorer-7 91 | ## # ... with 52,224,825 more rows 92 | ``` 93 | 94 | As one example, you could find the most popular tags: 95 | 96 | 97 | ```r 98 | question_tags %>% 99 | count(Tag, sort = TRUE) 100 | ``` 101 | 102 | ``` 103 | ## # A tibble: 59,140 × 2 104 | ## Tag n 105 | ## 106 | ## 1 javascript 1712324 107 | ## 2 java 1614786 108 | ## 3 php 1406127 109 | ## 4 c# 1356681 110 | ## 5 android 1327680 111 | ## 6 jquery 1035978 112 | ## 7 python 898647 113 | ## 8 html 804340 114 | ## 9 ios 652484 115 | ## 10 c++ 645197 116 | ## # ... with 59,130 more rows 117 | ``` 118 | 119 | Or plot the number of questions asked per week: 120 | 121 | 122 | ```r 123 | library(ggplot2) 124 | library(lubridate) 125 | 126 | questions %>% 127 | count(Week = round_date(CreationDate, "week")) %>% 128 | ggplot(aes(Week, n)) + 129 | geom_line() 130 | ``` 131 | 132 | ![plot of chunk questions_per_week](README-fig/questions_per_week-1.png) 133 | 134 | Or you could compare the growth of particular tags over time: 135 | 136 | 137 | ```r 138 | library(lubridate) 139 | 140 | tags <- c("c#", "javascript", "python", "r") 141 | 142 | q_per_year <- questions %>% 143 | count(Year = year(CreationDate)) %>% 144 | rename(YearTotal = n) 145 | 146 | tags_per_year <- question_tags %>% 147 | filter(Tag %in% tags) %>% 148 | inner_join(questions) %>% 149 | count(Year = year(CreationDate), Tag) %>% 150 | inner_join(q_per_year) 151 | 152 | ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) + 153 | geom_line() + 154 | scale_y_continuous(labels = scales::percent_format()) + 155 | ylab("% of Stack Overflow questions with this tag") 156 | ``` 157 | 158 | ![plot of chunk tags_per_year](README-fig/tags_per_year-1.png) 159 | -------------------------------------------------------------------------------- /question_tags.csv.gz: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:6c681a91555e398bc5c67f9d70e06b3b5de23fe6605310f0a2013295fc52d5d0 3 | size 251822155 4 | -------------------------------------------------------------------------------- /questions.csv.gz: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:b573243588b32c1c408b0c66aaaf40a76b5ea785d21728299fd2f482515b68d0 3 | size 216921225 4 | -------------------------------------------------------------------------------- /setup-data.R: -------------------------------------------------------------------------------- 1 | # This script cannot be run except by Stack Overflow employees with access to the internal 2 | # sqlstackr package and the Stack Exchange databases. 3 | 4 | # It is provided for transparency 5 | 6 | library(sqlstackr) 7 | library(dplyr) 8 | library(yaml) 9 | library(tidyr) 10 | library(stringr) 11 | library(lubridate) 12 | library(readr) 13 | 14 | # we convert dates to a diff and back because it happens to 15 | # be much faster on large queries with the RSQLServer driver 16 | 17 | date_diff <- function(x) as.POSIXct(x, tz = "UTC", origin = "1970-1-1") 18 | 19 | query <- "select Id, 20 | datediff(second, '1970-1-1', CreationDate) as CreationDate, 21 | datediff(second, '1970-1-1', ClosedDate) as ClosedDate, 22 | datediff(second, '1970-1-1', DeletionDate) as DeletionDate, 23 | Tags, Score, OwnerUserId, AnswerCount 24 | from Posts 25 | where PostTypeId = 1" 26 | 27 | retrieved_time <- with_tz(Sys.time(), "UTC") 28 | 29 | questions_raw <- query_StackOverflow(query, collect = TRUE) %>% 30 | mutate_each(funs(date_diff), CreationDate, ClosedDate, DeletionDate) 31 | 32 | # filter only for yesterday, and blank out owner user ID on deleted questions 33 | max_date <- max(as.Date(questions_raw$CreationDate)) - 1 34 | 35 | questions <- questions_raw %>% 36 | filter(as.Date(CreationDate) <= max_date) %>% 37 | arrange(Id) %>% 38 | mutate(OwnerUserId = ifelse(is.na(DeletionDate), OwnerUserId, NA)) 39 | 40 | # write status.yml 41 | s <- list(retrieved_time = as.character(retrieved_time), 42 | max_date = as.character(max_date), 43 | number_questions = nrow(questions)) 44 | write(as.yaml(s), file = "status.yml") 45 | 46 | # turn question tags into one row per question-tag pair 47 | question_tags <- questions %>% 48 | select(Id, Tags) %>% 49 | unnest(Tags = str_split(Tags, "\\|")) %>% 50 | filter(Tags != "") %>% 51 | rename(Tag = Tags) 52 | 53 | questions$Tags <- NULL 54 | 55 | # remove existing gz files 56 | unlink("questions.csv.gz") 57 | unlink("question_tags.csv.gz") 58 | 59 | write_csv(questions, "questions.csv") 60 | system("gzip questions.csv") 61 | 62 | write_csv(question_tags, "question_tags.csv") 63 | system("gzip question_tags.csv") 64 | 65 | # knit the README 66 | library(knitr) 67 | knit("README.Rmd") 68 | -------------------------------------------------------------------------------- /stacklite.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /status.yml: -------------------------------------------------------------------------------- 1 | retrieved_time: 2016-10-13 18:09:48 2 | max_date: '2016-10-12' 3 | number_questions: 16238301 4 | 5 | --------------------------------------------------------------------------------