├── .gitignore
├── README.md
├── analysis.Rmd
├── betaface.R
├── eigencoder.Rproj
└── results.Rds
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | .DS_Store
6 | analysis_cache
7 | analysis_files
8 | analysis.html
9 | ms_key.txt
10 | cache/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## EigenCoder
2 |
3 | You'll need to register an Azure key for the [Faces API](https://www.projectoxford.ai/face). You can then save that key in a file in this directory named `ms_key.txt` so it can be loaded and used for the API calls.
--------------------------------------------------------------------------------
/analysis.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "EigenCoders"
3 | author: "Jeff Allen"
4 | date: "March 8, 2016"
5 | output:
6 | html_document:
7 | keep_md: yes
8 | ---
9 |
10 | There are a lot of stereotypes in the programming community. "Swift is used by a bunch of bearded hipsters." "C++ is for old people." "No one likes coding in Java." Well it turns out that some of these might be true.
11 |
12 | ## Approach
13 |
14 | [GitHub](https://github.com) is likely the most popular open-source hosting platform in use today. GitHub has many open-source repositories for code written in wide variety of languages. They also provide ["trending" pages](https://github.com/trending/go?since=monthly) which show you repositories in a particular language that are particularly popular right now.
15 |
16 | These popular repositories also include the profile pictures of some of the most prolific committers on these projects. Meaning that we can easily get a few dozen profile pictures of some of the busiest contributors to popular projects for any given language.
17 |
18 | We also have access to a neat resource from Microsoft's Project Oxford called the "[Face API](https://www.projectoxford.ai/face)" which, among other things, can detect a face in a given image and estimate some properties about the face. Is the subject smiling? What gender and age are they? Do they have facial hair?
19 |
20 | Combined, we can get an estimate of some interesting properties about the profile pictures of programmers who are working in various languages. Let's get started!
21 |
22 |
It should be noted that this is super non-scientific. Who knows how accurate the Face API is or how accurately a user's GitHub profile picture maps to any aspect of their personality/identity. It's also unclear whether the most prolific contributors to popular repositories accurately represent a community. Also, small sample sizes. Etc., etc.
23 |
24 | All code used in this post is available here: [https://github.com/trestletech/eigencoder](https://github.com/trestletech/eigencoder)
25 |
26 | ## Data
27 |
28 | ```{r gathering, cache=TRUE,message=FALSE, warning=FALSE, echo=FALSE}
29 | library(rvest)
30 |
31 | languages <- c("ruby", "r", "javascript", "java", "html", "go", "cpp", "c", "python", "php", "perl", "swift", "csharp")
32 |
33 | source("betaface.R")
34 |
35 | #TODO: lookup name from GitHub to help detect gender
36 |
37 | get_profiles <- function(lang){
38 | ht <- read_html(paste0("https://github.com/trending/", lang, "?since=monthly"))
39 | contribs <- ht %>%
40 | html_nodes(".repo-list-item .repo-list-meta .avatar")
41 |
42 | authors <- contribs %>% html_attr("alt")
43 | authors <- gsub("^@", "", authors)
44 | imgs <- contribs %>% html_attr("src")
45 | imgs <- gsub("s=40", "s=420", imgs)
46 |
47 | names(imgs) <- authors
48 | imgs
49 | }
50 |
51 | results <- list()
52 | for (langInd in 1:length(languages)){
53 | lang <- languages[langInd]
54 | message("Processing ", lang, "...")
55 | prof <- get_profiles(lang)
56 |
57 | df <- data.frame(
58 | username=character(0),
59 | smile=numeric(0),
60 | gender=character(0),
61 | age=numeric(0),
62 | mustache=numeric(0),
63 | beard=numeric(0),
64 | sideburns=numeric(0)
65 | )
66 | for (i in 1:length(prof)){
67 | tryCatch({
68 | attrs <- cached_msface(prof[[i]], ms_key)
69 | df <- rbind(df, data.frame(
70 | username=names(prof)[i],
71 | smile=attrs$smile,
72 | gender=attrs$gender,
73 | age=attrs$age,
74 | mustache=attrs$facialHair$moustache,
75 | beard=attrs$facialHair$beard,
76 | sideburns=attrs$facialHair$sideburns
77 | ))
78 | }, error = function(e){
79 | df <- rbind(df, data.frame(
80 | username=names(prof)[i],
81 | smile=NA,
82 | gender=NA,
83 | age=NA,
84 | mustache=NA,
85 | beard=NA,
86 | sideburns=NA))
87 | })
88 | }
89 | results[[langInd]] <- df
90 | saveRDS(results, "results.Rds")
91 | }
92 | ```
93 |
94 | ```{r unique, echo=FALSE}
95 | # We can exclude redundant usernames in each programming language.
96 | names(results) <- languages
97 | saveRDS(results, "results.Rds")
98 |
99 | # Filter to only include unique users
100 | # TODO: do this before looking up their face API redundantly
101 | results <- lapply(results, function(x){
102 | unique(x)
103 | })
104 | saveRDS(results, "unique-results.Rds")
105 | ```
106 |
107 | GitHub lists 25 repositories on its trending page and shows the top 5 committers for each. Some projects don't have 5 contributors, so fewer are shown. We remove duplicated usernames then send each of these profile pictures (up to 125 per language) to be analyzed by the Face API. Of course, not all pictures have (detectable) faces in them.
108 |
109 | In total, we get the following:
110 |
111 | ```{r, echo=FALSE, results='asis'}
112 | df <- data.frame(Lang=languages, FacesDetected=numeric(length(languages)))
113 | for (i in 1:length(languages)){
114 | df[i,2] <- nrow(results[[i]])
115 | }
116 |
117 | knitr::kable(df)
118 | ```
119 |
120 |
121 | ## Gender
122 |
123 | One of the properties returned by the Face API is a prediction of the gender of the subject of the photo. The results are pretty discouraging for anyone who's not a chauvinist...
124 |
125 | ```{r gender, warning=FALSE, echo=FALSE, message=FALSE}
126 | gender <- (sapply(results, function(x){list(male=sum(x["gender"] == "male"), female=sum(x["gender"] == "female"))}))
127 | # Weird list format
128 | gender <- matrix(unlist(gender), nrow=2)
129 | rownames(gender) <- c("Male", "Female")
130 | colnames(gender) <- languages
131 |
132 | counts <- apply(gender, 2, sum)
133 |
134 | library(dplyr)
135 | library(tidyr)
136 | genderRatio <- as.data.frame(t(gender/rep(counts, each=2)))
137 | genderRatio$language <- rownames(genderRatio)
138 | genderRatio <- gather(genderRatio, gender, ratio, Male, Female)
139 |
140 | #Sort
141 | genderRatio <- genderRatio %>% arrange(desc(gender),desc(ratio))
142 |
143 | library(plotly)
144 | plot_ly(genderRatio, x=language, y=ratio, type="bar", color=gender) %>%
145 | layout(barmode="stack")
146 | ```
147 |
148 | ## Age
149 |
150 | Age is an interesting trend. Some people assume that "old-school" languages are only used by old people and that new, trendy languages are used by hipsters. It turns out that's not always true; Java, for instance, has the *lowest* median age.
151 |
152 | ```{r age, warning=FALSE, echo=FALSE}
153 | listToDF <- function(lst, name="val"){
154 | df <- NULL
155 | for (i in 1:length(lst)){
156 | li <- lst[[i]]
157 | f <- data.frame(language=rep(names(lst)[i], length(li)),
158 | val=li, stringsAsFactors = FALSE)
159 | df <- bind_rows(df, f)
160 | }
161 | colnames(df)[2] <- name
162 | df
163 | }
164 |
165 | listToDensity <- function(lst, title="", xlab="x", ...){
166 | dens <- lapply(lst, function(li){
167 | breaks <- seq(from=min(li), to=max(li), length.out=6)
168 | cu <- cut(li, breaks, right=FALSE, labels=breaks[-1])
169 |
170 | # Prepend 0
171 | ar <- c(0, table(cu))
172 | names(ar)[1] <- "0"
173 |
174 | ar
175 | })
176 | df <- data.frame(
177 | x = unlist(lapply(dens, function(x){as.numeric(names(x))})),
178 | y = unlist(lapply(dens, as.numeric)),
179 | lang = rep(names(dens), each = length(dens[[1]]))
180 | )
181 |
182 | plot_ly(df, x=x, y=y, color=lang) %>%
183 | layout(title=title, xaxis=list(title=xlab), yaxis=list(title="Ratio"), autosize=FALSE, ...)
184 | }
185 |
186 | listToBar <- function(lst, fun, title="", ylab="Average"){
187 | avg <- sapply(lst, fun)
188 | df <- as.data.frame(avg)
189 | df$language <- rownames(df)
190 |
191 | df <- df %>% arrange(desc(avg))
192 |
193 | plot_ly(df, x=language, y=avg, type="bar") %>%
194 | layout(title=title, yaxis=list(title=ylab), autosize=TRUE)
195 | }
196 |
197 | ages <- lapply(results, "[[", "age")
198 |
199 | library(shiny)
200 | tabsetPanel(
201 | tabPanel("Barplot", print(listToBar(ages, median, "Median Age of Programmers by Language", "Median Age"))),
202 | tabPanel("Density", print(listToDensity(ages, "Age of Programmers by Language", "Age")))
203 | )
204 | ```
205 |
206 | ## Smiles
207 |
208 | Every programmer has a language that makes them miserable. So miserable, perhaps, that you can't even muster a smile for your GitHub profile picture.
209 |
210 | The Face API returns a score from 0 to 1 approximating the amount that you're smiling. Programmers using certain languages seem happier than others. Maybe R programmers are just smiling about the [crazy market for data scientists](https://www.oreilly.com/ideas/2015-data-science-salary-survey) in this economy...
211 |
212 | ```{r smiles, warning=FALSE, echo=FALSE}
213 | smiles <- lapply(results, "[[", "smile")
214 |
215 | tabsetPanel(
216 | tabPanel("Barplot", print(listToBar(smiles, mean, "Mean Smiliness of Programmers by Language", "Mean Smile Score"))),
217 | tabPanel("Density", print(listToDensity(smiles, "Smiliness of Programmers by Language", "Smile Score")))
218 | )
219 | ```
220 |
221 | ## Facial Hair
222 |
223 | If you've been coding for any length of time, you've met at least one mustached fellow riding a fixie and wearing skinny jeans who won't stop talking about Swift. Turns out that's a real stereotype.
224 |
225 | I did not normalize for gender here.
226 |
227 | ```{r hair, warning=FALSE, echo=FALSE}
228 | mustache <- lapply(results, "[[", "mustache")
229 | beard <- lapply(results, "[[", "beard")
230 | sideburns <- lapply(results, "[[", "sideburns")
231 |
232 | # Facial Hair
233 | facialHair <- list()
234 | for (i in 1:length(mustache)){
235 | facialHair[[i]] <- mustache[[i]] + beard[[i]] + sideburns[[i]]
236 | }
237 | names(facialHair) <- languages
238 |
239 | tabsetPanel(
240 | tabPanel("Barplot", print(listToBar(facialHair, mean, "Average Facial Hair by Language", "Facial Hair"))),
241 | tabPanel("Density", print(listToDensity(facialHair, "Facial Hair by Language", "Facial Hair")))
242 | )
243 | ```
244 |
245 | Or we can look at each facial hair property returned by the Face API individually.
246 |
247 | ```{r, warning=FALSE, echo=FALSE}
248 | tabsetPanel(
249 | tabPanel("Mustache", print(listToDensity(mustache, "Mustaches by Language", "Mustaches"))),
250 | tabPanel("Beard", print(listToDensity(beard, "Beards by Language", "Beards"))),
251 | tabPanel("Sideburns", print(listToDensity(sideburns, "Sideburns by Language", "Sideburns")))
252 | )
253 | ```
254 |
255 | ## Conclusion
256 |
257 | I guess to grow a mustache if you want to contribute to a successful C++ project?
258 |
259 | In reality, you'd need a much larger sample size and some guarantees around the accuracy of the Face API before you could have any real confidence about the conclusions you draw from this data.
260 |
--------------------------------------------------------------------------------
/betaface.R:
--------------------------------------------------------------------------------
1 | library(httr)
2 |
3 | img <- "https://avatars2.githubusercontent.com/u/1356007?v=3&s=420"
4 |
5 | # The free Betaface API key
6 | betaface <- function(img, key="d45fd466-51e2-4701-8da8-04351c872236",
7 | secret="171e8465-f548-401d-b63b-caf0dc28df5f"){
8 | res <- POST("http://betafaceapi.com/service_json.svc/UploadNewImage_Url", body=list(
9 | api_key = key,
10 | api_secret = secret,
11 | detection_flags = "bestface classifiers",
12 | image_url = img
13 | ))
14 |
15 | stop_for_status(res)
16 |
17 | uid <- content(res)$img_uid
18 |
19 | res <- POST("http://betafaceapi.com/service_json.svc/GetImageInfo", body=list(
20 | api_key = key,
21 | api_secret = secret,
22 | img_uid = uid
23 | ))
24 |
25 | stop_for_status(res)
26 |
27 | cont <- content(res)
28 | if (is.null(cont$faces) || length(cont$faces) == 0){
29 | stop("No face found")
30 | }
31 |
32 | face <- cont$faces[[1]]
33 | do.call(rbind.data.frame, face$tags)
34 | }
35 |
36 | library(digest)
37 | cached_msface <- function(img, key, cache="./cache"){
38 | if (!dir.exists(cache)){
39 | dir.create(cache)
40 | }
41 |
42 | # Lookup img hash in cache
43 | hash <- digest::sha1(img)
44 |
45 | if (file.exists(file.path(cache, hash))){
46 | obj <- readRDS(file.path(cache, hash))
47 | message("Cache hit: ", img)
48 | if (is.null(obj)){
49 | stop("Original error")
50 | }
51 | return(obj)
52 | }
53 |
54 | message("Not in cache: ", img, " ", hash)
55 | tryCatch({
56 | f <- msface(img, key)
57 | saveRDS(f, file.path(cache, hash))
58 | return(f)
59 | }, error=function(e){
60 | saveRDS(NULL, file.path(cache, hash))
61 | })
62 | }
63 |
64 | # Microsoft Faces API key
65 | ms_key <- readLines("ms_key.txt")
66 | msface <- function(img, key){
67 | url <- "https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=0&returnFaceLandmarks=0&returnFaceAttributes=age,gender,facialHair,smile"
68 | res <- POST(url, add_headers("Ocp-Apim-Subscription-Key"=key), body=list(url=img), encode="json")
69 |
70 | while (res$status_code == 429){
71 | # Rate limit. Sleep and try again
72 | message("Hit rate limit. Sleeping for 60 seconds")
73 | Sys.sleep(60)
74 | res <- POST(url, add_headers("Ocp-Apim-Subscription-Key"=key), body=list(url=img), encode="json")
75 | }
76 |
77 | stop_for_status(res)
78 |
79 | cont <- content(res)
80 | if (length(cont) < 1 || is.null(cont[[1]]$faceAttributes) || length(cont[[1]]$faceAttributes) == 0){
81 | stop("No face attributes")
82 | }
83 |
84 | att <- cont[[1]]$faceAttributes
85 | }
--------------------------------------------------------------------------------
/eigencoder.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------
/results.Rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trestletech/eigencoder/9519ba0e26390259c2d79cba27e6ce8314d8dcba/results.Rds
--------------------------------------------------------------------------------