├── .gitattributes
├── .gitignore
├── README-fig
    ├── questions_per_day-1.png
    ├── questions_per_week-1.png
    └── tags_per_year-1.png
├── README.Rmd
├── README.html
├── README.md
├── question_tags.csv.gz
├── questions.csv.gz
├── setup-data.R
├── stacklite.Rproj
└── status.yml


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.csv.gz filter=lfs diff=lfs merge=lfs -text
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | README-cache
2 | .Rproj.user
3 | 


--------------------------------------------------------------------------------
/README-fig/questions_per_day-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/questions_per_day-1.png


--------------------------------------------------------------------------------
/README-fig/questions_per_week-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/questions_per_week-1.png


--------------------------------------------------------------------------------
/README-fig/tags_per_year-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgrtwo/StackLite/537da93f4b3ba606732e561ce02ceaba6db1ab98/README-fig/tags_per_year-1.png


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
  1 | ```{r setup, include=FALSE}
  2 | knitr::opts_chunk$set(echo = TRUE)
  3 | ```
  4 | 
  5 | ```{r echo = FALSE}
  6 | library(knitr)
  7 | opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE,
  8 |                cache.path = "README-cache/",
  9 |                fig.path = "README-fig/")
 10 | ```
 11 | 
 12 | ## StackLite: A simple dataset of Stack Overflow questions and tags
 13 | 
 14 | This repository shares a dataset about Stack Overflow questions. For each question, it includes:
 15 | 
 16 | * Question ID
 17 | * Creation date
 18 | * Closed date, if applicable
 19 | * Deletion date, if applicable
 20 | * Score
 21 | * Owner user ID
 22 | * Number of answers
 23 | * Tags
 24 | 
 25 | This dataset is ideal for answering questions such as:
 26 | 
 27 | * The increase or decrease in questions in each tag over time
 28 | * Correlations among tags on questions
 29 | * Which tags tend to get higher or lower scores
 30 | * Which tags tend to be asked on weekends vs weekdays
 31 | * Rates of question closure or deletion over time
 32 | * The speed at which questions are closed or deleted
 33 | 
 34 | This is all public data within the [Stack Exchange Data Dump](https://archive.org/details/stackexchange), which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the [Stack Exchange Data Explorer](https://data.stackexchange.com/), but this offers analysts the chance to work with it locally using their tool of choice.
 35 | 
 36 | ### Status
 37 | 
 38 | ```{r load_data, echo = FALSE}
 39 | library(dplyr)
 40 | library(readr)
 41 | library(yaml)
 42 | 
 43 | questions <- read_csv("questions.csv.gz", progress = FALSE)
 44 | question_tags <- read_csv("question_tags.csv.gz", progress = FALSE)
 45 | ```
 46 | 
 47 | This dataset was extracted from the Stack Overflow database at `r yaml.load_file("status.yml")$retrieved_time` UTC and contains questions up to **`r yaml.load_file("status.yml")$max_date`**. This includes `r sum(is.na(questions$DeletionDate))` non-deleted questions, and  `r sum(!is.na(questions$DeletionDate))` deleted ones. (The script for downloading the data can be found in [setup-data.R](setup-data.R), though it can be run only by Stack Overflow employees with database access).
 48 | 
 49 | ### Examples in R
 50 | 
 51 | The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I'll share some examples of an analysis in R.
 52 | 
 53 | The question data and the question-tag pairings are stored separately. You can read in the dataset with:
 54 | 
 55 | ```{r eval = FALSE}
 56 | library(readr)
 57 | library(dplyr)
 58 | 
 59 | questions <- read_csv("questions.csv.gz")
 60 | question_tags <- read_csv("question_tags.csv.gz")
 61 | ```
 62 | 
 63 | ```{r}
 64 | questions
 65 | question_tags
 66 | ```
 67 | 
 68 | As one example, you could find the most popular tags:
 69 | 
 70 | ```{r question_tags_count, dependson = "load_data"}
 71 | question_tags %>%
 72 |   count(Tag, sort = TRUE)
 73 | ```
 74 | 
 75 | Or plot the number of questions asked per week:
 76 | 
 77 | ```{r questions_per_week, dependson = "load_data"}
 78 | library(ggplot2)
 79 | library(lubridate)
 80 | 
 81 | questions %>%
 82 |   count(Week = round_date(CreationDate, "week")) %>%
 83 |   ggplot(aes(Week, n)) +
 84 |   geom_line()
 85 | ```
 86 | 
 87 | Or you could compare the growth of particular tags over time:
 88 | 
 89 | ```{r tags_per_year, dependson = "load_data"}
 90 | library(lubridate)
 91 | 
 92 | tags <- c("c#", "javascript", "python", "r")
 93 | 
 94 | q_per_year <- questions %>%
 95 |   count(Year = year(CreationDate)) %>%
 96 |   rename(YearTotal = n)
 97 | 
 98 | tags_per_year <- question_tags %>%
 99 |   filter(Tag %in% tags) %>%
100 |   inner_join(questions) %>%
101 |   count(Year = year(CreationDate), Tag) %>%
102 |   inner_join(q_per_year)
103 | 
104 | ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) +
105 |   geom_line() +
106 |   scale_y_continuous(labels = scales::percent_format()) +
107 |   ylab("% of Stack Overflow questions with this tag")
108 | ```
109 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | 
  5 | ## StackLite: A simple dataset of Stack Overflow questions and tags
  6 | 
  7 | This repository shares a dataset about Stack Overflow questions. For each question, it includes:
  8 | 
  9 | * Question ID
 10 | * Creation date
 11 | * Closed date, if applicable
 12 | * Deletion date, if applicable
 13 | * Score
 14 | * Owner user ID
 15 | * Number of answers
 16 | * Tags
 17 | 
 18 | This dataset is ideal for answering questions such as:
 19 | 
 20 | * The increase or decrease in questions in each tag over time
 21 | * Correlations among tags on questions
 22 | * Which tags tend to get higher or lower scores
 23 | * Which tags tend to be asked on weekends vs weekdays
 24 | * Rates of question closure or deletion over time
 25 | * The speed at which questions are closed or deleted
 26 | 
 27 | This is all public data within the [Stack Exchange Data Dump](https://archive.org/details/stackexchange), which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the [Stack Exchange Data Explorer](https://data.stackexchange.com/), but this offers analysts the chance to work with it locally using their tool of choice.
 28 | 
 29 | ### Status
 30 | 
 31 | 
 32 | 
 33 | This dataset was extracted from the Stack Overflow database at 2017-04-06 16:39:26 UTC and contains questions up to **2017-04-05**. This includes 13629741 non-deleted questions, and  4133745 deleted ones. (The script for downloading the data can be found in [setup-data.R](setup-data.R), though it can be run only by Stack Overflow employees with database access).
 34 | 
 35 | ### Examples in R
 36 | 
 37 | The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I'll share some examples of an analysis in R.
 38 | 
 39 | The question data and the question-tag pairings are stored separately. You can read in the dataset with:
 40 | 
 41 | 
 42 | ```r
 43 | library(readr)
 44 | library(dplyr)
 45 | 
 46 | questions <- read_csv("questions.csv.gz")
 47 | question_tags <- read_csv("question_tags.csv.gz")
 48 | ```
 49 | 
 50 | 
 51 | ```r
 52 | questions
 53 | ```
 54 | 
 55 | ```
 56 | ## # A tibble: 17,763,486 × 7
 57 | ##       Id        CreationDate          ClosedDate        DeletionDate Score
 58 | ##    <int>              <dttm>              <dttm>              <dttm> <int>
 59 | ## 1      1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1
 60 | ## 2      4 2008-07-31 21:42:52                <NA>                <NA>   472
 61 | ## 3      6 2008-07-31 22:08:08                <NA>                <NA>   210
 62 | ## 4      8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42
 63 | ## 5      9 2008-07-31 23:40:59                <NA>                <NA>  1452
 64 | ## 6     11 2008-07-31 23:55:37                <NA>                <NA>  1154
 65 | ## 7     13 2008-08-01 00:42:38                <NA>                <NA>   464
 66 | ## 8     14 2008-08-01 00:59:11                <NA>                <NA>   296
 67 | ## 9     16 2008-08-01 04:59:33                <NA>                <NA>    84
 68 | ## 10    17 2008-08-01 05:09:55                <NA>                <NA>   119
 69 | ## # ... with 17,763,476 more rows, and 2 more variables: OwnerUserId <int>,
 70 | ## #   AnswerCount <int>
 71 | ```
 72 | 
 73 | ```r
 74 | question_tags
 75 | ```
 76 | 
 77 | ```
 78 | ## # A tibble: 52,224,835 × 2
 79 | ##       Id                 Tag
 80 | ##    <int>               <chr>
 81 | ## 1      1                data
 82 | ## 2      4                  c#
 83 | ## 3      4            winforms
 84 | ## 4      4     type-conversion
 85 | ## 5      4             decimal
 86 | ## 6      4             opacity
 87 | ## 7      6                html
 88 | ## 8      6                 css
 89 | ## 9      6                css3
 90 | ## 10     6 internet-explorer-7
 91 | ## # ... with 52,224,825 more rows
 92 | ```
 93 | 
 94 | As one example, you could find the most popular tags:
 95 | 
 96 | 
 97 | ```r
 98 | question_tags %>%
 99 |   count(Tag, sort = TRUE)
100 | ```
101 | 
102 | ```
103 | ## # A tibble: 59,140 × 2
104 | ##           Tag       n
105 | ##         <chr>   <int>
106 | ## 1  javascript 1712324
107 | ## 2        java 1614786
108 | ## 3         php 1406127
109 | ## 4          c# 1356681
110 | ## 5     android 1327680
111 | ## 6      jquery 1035978
112 | ## 7      python  898647
113 | ## 8        html  804340
114 | ## 9         ios  652484
115 | ## 10        c++  645197
116 | ## # ... with 59,130 more rows
117 | ```
118 | 
119 | Or plot the number of questions asked per week:
120 | 
121 | 
122 | ```r
123 | library(ggplot2)
124 | library(lubridate)
125 | 
126 | questions %>%
127 |   count(Week = round_date(CreationDate, "week")) %>%
128 |   ggplot(aes(Week, n)) +
129 |   geom_line()
130 | ```
131 | 
132 | ![plot of chunk questions_per_week](README-fig/questions_per_week-1.png)
133 | 
134 | Or you could compare the growth of particular tags over time:
135 | 
136 | 
137 | ```r
138 | library(lubridate)
139 | 
140 | tags <- c("c#", "javascript", "python", "r")
141 | 
142 | q_per_year <- questions %>%
143 |   count(Year = year(CreationDate)) %>%
144 |   rename(YearTotal = n)
145 | 
146 | tags_per_year <- question_tags %>%
147 |   filter(Tag %in% tags) %>%
148 |   inner_join(questions) %>%
149 |   count(Year = year(CreationDate), Tag) %>%
150 |   inner_join(q_per_year)
151 | 
152 | ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) +
153 |   geom_line() +
154 |   scale_y_continuous(labels = scales::percent_format()) +
155 |   ylab("% of Stack Overflow questions with this tag")
156 | ```
157 | 
158 | ![plot of chunk tags_per_year](README-fig/tags_per_year-1.png)
159 | 


--------------------------------------------------------------------------------
/question_tags.csv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:6c681a91555e398bc5c67f9d70e06b3b5de23fe6605310f0a2013295fc52d5d0
3 | size 251822155
4 | 


--------------------------------------------------------------------------------
/questions.csv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:b573243588b32c1c408b0c66aaaf40a76b5ea785d21728299fd2f482515b68d0
3 | size 216921225
4 | 


--------------------------------------------------------------------------------
/setup-data.R:
--------------------------------------------------------------------------------
 1 | # This script cannot be run except by Stack Overflow employees with access to the internal
 2 | # sqlstackr package and the Stack Exchange databases.
 3 | 
 4 | # It is provided for transparency
 5 | 
 6 | library(sqlstackr)
 7 | library(dplyr)
 8 | library(yaml)
 9 | library(tidyr)
10 | library(stringr)
11 | library(lubridate)
12 | library(readr)
13 | 
14 | # we convert dates to a diff and back because it happens to
15 | # be much faster on large queries with the RSQLServer driver
16 | 
17 | date_diff <- function(x) as.POSIXct(x, tz = "UTC", origin = "1970-1-1")
18 | 
19 | query <- "select Id,
20 |             datediff(second, '1970-1-1', CreationDate) as CreationDate,
21 |             datediff(second, '1970-1-1', ClosedDate) as ClosedDate,
22 |             datediff(second, '1970-1-1', DeletionDate) as DeletionDate,
23 |             Tags, Score, OwnerUserId, AnswerCount
24 |         from Posts
25 |         where PostTypeId = 1"
26 | 
27 | retrieved_time <- with_tz(Sys.time(), "UTC")
28 | 
29 | questions_raw <- query_StackOverflow(query, collect = TRUE) %>%
30 |   mutate_each(funs(date_diff), CreationDate, ClosedDate, DeletionDate)
31 | 
32 | # filter only for yesterday, and blank out owner user ID on deleted questions
33 | max_date <- max(as.Date(questions_raw$CreationDate)) - 1
34 | 
35 | questions <- questions_raw %>%
36 |   filter(as.Date(CreationDate) <= max_date) %>%
37 |   arrange(Id) %>%
38 |   mutate(OwnerUserId = ifelse(is.na(DeletionDate), OwnerUserId, NA))
39 | 
40 | # write status.yml
41 | s <- list(retrieved_time = as.character(retrieved_time),
42 |           max_date = as.character(max_date),
43 |           number_questions = nrow(questions))
44 | write(as.yaml(s), file = "status.yml")
45 | 
46 | # turn question tags into one row per question-tag pair
47 | question_tags <- questions %>%
48 |   select(Id, Tags) %>%
49 |   unnest(Tags = str_split(Tags, "\\|")) %>%
50 |   filter(Tags != "") %>%
51 |   rename(Tag = Tags)
52 | 
53 | questions$Tags <- NULL
54 | 
55 | # remove existing gz files
56 | unlink("questions.csv.gz")
57 | unlink("question_tags.csv.gz")
58 | 
59 | write_csv(questions, "questions.csv")
60 | system("gzip questions.csv")
61 | 
62 | write_csv(question_tags, "question_tags.csv")
63 | system("gzip question_tags.csv")
64 | 
65 | # knit the README
66 | library(knitr)
67 | knit("README.Rmd")
68 | 


--------------------------------------------------------------------------------
/stacklite.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: knitr
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/status.yml:
--------------------------------------------------------------------------------
1 | retrieved_time: 2016-10-13 18:09:48
2 | max_date: '2016-10-12'
3 | number_questions: 16238301
4 | 
5 | 


--------------------------------------------------------------------------------