├── .gitignore ├── README.md ├── analysis.Rmd ├── betaface.R ├── eigencoder.Rproj └── results.Rds /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store 6 | analysis_cache 7 | analysis_files 8 | analysis.html 9 | ms_key.txt 10 | cache/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## EigenCoder 2 | 3 | You'll need to register an Azure key for the [Faces API](https://www.projectoxford.ai/face). You can then save that key in a file in this directory named `ms_key.txt` so it can be loaded and used for the API calls. -------------------------------------------------------------------------------- /analysis.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "EigenCoders" 3 | author: "Jeff Allen" 4 | date: "March 8, 2016" 5 | output: 6 | html_document: 7 | keep_md: yes 8 | --- 9 | 10 | There are a lot of stereotypes in the programming community. "Swift is used by a bunch of bearded hipsters." "C++ is for old people." "No one likes coding in Java." Well it turns out that some of these might be true. 11 | 12 | ## Approach 13 | 14 | [GitHub](https://github.com) is likely the most popular open-source hosting platform in use today. GitHub has many open-source repositories for code written in wide variety of languages. They also provide ["trending" pages](https://github.com/trending/go?since=monthly) which show you repositories in a particular language that are particularly popular right now. 15 | 16 | These popular repositories also include the profile pictures of some of the most prolific committers on these projects. Meaning that we can easily get a few dozen profile pictures of some of the busiest contributors to popular projects for any given language. 17 | 18 | We also have access to a neat resource from Microsoft's Project Oxford called the "[Face API](https://www.projectoxford.ai/face)" which, among other things, can detect a face in a given image and estimate some properties about the face. Is the subject smiling? What gender and age are they? Do they have facial hair? 19 | 20 | Combined, we can get an estimate of some interesting properties about the profile pictures of programmers who are working in various languages. Let's get started! 21 | 22 |
It should be noted that this is super non-scientific. Who knows how accurate the Face API is or how accurately a user's GitHub profile picture maps to any aspect of their personality/identity. It's also unclear whether the most prolific contributors to popular repositories accurately represent a community. Also, small sample sizes. Etc., etc.
23 | 24 | All code used in this post is available here: [https://github.com/trestletech/eigencoder](https://github.com/trestletech/eigencoder) 25 | 26 | ## Data 27 | 28 | ```{r gathering, cache=TRUE,message=FALSE, warning=FALSE, echo=FALSE} 29 | library(rvest) 30 | 31 | languages <- c("ruby", "r", "javascript", "java", "html", "go", "cpp", "c", "python", "php", "perl", "swift", "csharp") 32 | 33 | source("betaface.R") 34 | 35 | #TODO: lookup name from GitHub to help detect gender 36 | 37 | get_profiles <- function(lang){ 38 | ht <- read_html(paste0("https://github.com/trending/", lang, "?since=monthly")) 39 | contribs <- ht %>% 40 | html_nodes(".repo-list-item .repo-list-meta .avatar") 41 | 42 | authors <- contribs %>% html_attr("alt") 43 | authors <- gsub("^@", "", authors) 44 | imgs <- contribs %>% html_attr("src") 45 | imgs <- gsub("s=40", "s=420", imgs) 46 | 47 | names(imgs) <- authors 48 | imgs 49 | } 50 | 51 | results <- list() 52 | for (langInd in 1:length(languages)){ 53 | lang <- languages[langInd] 54 | message("Processing ", lang, "...") 55 | prof <- get_profiles(lang) 56 | 57 | df <- data.frame( 58 | username=character(0), 59 | smile=numeric(0), 60 | gender=character(0), 61 | age=numeric(0), 62 | mustache=numeric(0), 63 | beard=numeric(0), 64 | sideburns=numeric(0) 65 | ) 66 | for (i in 1:length(prof)){ 67 | tryCatch({ 68 | attrs <- cached_msface(prof[[i]], ms_key) 69 | df <- rbind(df, data.frame( 70 | username=names(prof)[i], 71 | smile=attrs$smile, 72 | gender=attrs$gender, 73 | age=attrs$age, 74 | mustache=attrs$facialHair$moustache, 75 | beard=attrs$facialHair$beard, 76 | sideburns=attrs$facialHair$sideburns 77 | )) 78 | }, error = function(e){ 79 | df <- rbind(df, data.frame( 80 | username=names(prof)[i], 81 | smile=NA, 82 | gender=NA, 83 | age=NA, 84 | mustache=NA, 85 | beard=NA, 86 | sideburns=NA)) 87 | }) 88 | } 89 | results[[langInd]] <- df 90 | saveRDS(results, "results.Rds") 91 | } 92 | ``` 93 | 94 | ```{r unique, echo=FALSE} 95 | # We can exclude redundant usernames in each programming language. 96 | names(results) <- languages 97 | saveRDS(results, "results.Rds") 98 | 99 | # Filter to only include unique users 100 | # TODO: do this before looking up their face API redundantly 101 | results <- lapply(results, function(x){ 102 | unique(x) 103 | }) 104 | saveRDS(results, "unique-results.Rds") 105 | ``` 106 | 107 | GitHub lists 25 repositories on its trending page and shows the top 5 committers for each. Some projects don't have 5 contributors, so fewer are shown. We remove duplicated usernames then send each of these profile pictures (up to 125 per language) to be analyzed by the Face API. Of course, not all pictures have (detectable) faces in them. 108 | 109 | In total, we get the following: 110 | 111 | ```{r, echo=FALSE, results='asis'} 112 | df <- data.frame(Lang=languages, FacesDetected=numeric(length(languages))) 113 | for (i in 1:length(languages)){ 114 | df[i,2] <- nrow(results[[i]]) 115 | } 116 | 117 | knitr::kable(df) 118 | ``` 119 | 120 | 121 | ## Gender 122 | 123 | One of the properties returned by the Face API is a prediction of the gender of the subject of the photo. The results are pretty discouraging for anyone who's not a chauvinist... 124 | 125 | ```{r gender, warning=FALSE, echo=FALSE, message=FALSE} 126 | gender <- (sapply(results, function(x){list(male=sum(x["gender"] == "male"), female=sum(x["gender"] == "female"))})) 127 | # Weird list format 128 | gender <- matrix(unlist(gender), nrow=2) 129 | rownames(gender) <- c("Male", "Female") 130 | colnames(gender) <- languages 131 | 132 | counts <- apply(gender, 2, sum) 133 | 134 | library(dplyr) 135 | library(tidyr) 136 | genderRatio <- as.data.frame(t(gender/rep(counts, each=2))) 137 | genderRatio$language <- rownames(genderRatio) 138 | genderRatio <- gather(genderRatio, gender, ratio, Male, Female) 139 | 140 | #Sort 141 | genderRatio <- genderRatio %>% arrange(desc(gender),desc(ratio)) 142 | 143 | library(plotly) 144 | plot_ly(genderRatio, x=language, y=ratio, type="bar", color=gender) %>% 145 | layout(barmode="stack") 146 | ``` 147 | 148 | ## Age 149 | 150 | Age is an interesting trend. Some people assume that "old-school" languages are only used by old people and that new, trendy languages are used by hipsters. It turns out that's not always true; Java, for instance, has the *lowest* median age. 151 | 152 | ```{r age, warning=FALSE, echo=FALSE} 153 | listToDF <- function(lst, name="val"){ 154 | df <- NULL 155 | for (i in 1:length(lst)){ 156 | li <- lst[[i]] 157 | f <- data.frame(language=rep(names(lst)[i], length(li)), 158 | val=li, stringsAsFactors = FALSE) 159 | df <- bind_rows(df, f) 160 | } 161 | colnames(df)[2] <- name 162 | df 163 | } 164 | 165 | listToDensity <- function(lst, title="", xlab="x", ...){ 166 | dens <- lapply(lst, function(li){ 167 | breaks <- seq(from=min(li), to=max(li), length.out=6) 168 | cu <- cut(li, breaks, right=FALSE, labels=breaks[-1]) 169 | 170 | # Prepend 0 171 | ar <- c(0, table(cu)) 172 | names(ar)[1] <- "0" 173 | 174 | ar 175 | }) 176 | df <- data.frame( 177 | x = unlist(lapply(dens, function(x){as.numeric(names(x))})), 178 | y = unlist(lapply(dens, as.numeric)), 179 | lang = rep(names(dens), each = length(dens[[1]])) 180 | ) 181 | 182 | plot_ly(df, x=x, y=y, color=lang) %>% 183 | layout(title=title, xaxis=list(title=xlab), yaxis=list(title="Ratio"), autosize=FALSE, ...) 184 | } 185 | 186 | listToBar <- function(lst, fun, title="", ylab="Average"){ 187 | avg <- sapply(lst, fun) 188 | df <- as.data.frame(avg) 189 | df$language <- rownames(df) 190 | 191 | df <- df %>% arrange(desc(avg)) 192 | 193 | plot_ly(df, x=language, y=avg, type="bar") %>% 194 | layout(title=title, yaxis=list(title=ylab), autosize=TRUE) 195 | } 196 | 197 | ages <- lapply(results, "[[", "age") 198 | 199 | library(shiny) 200 | tabsetPanel( 201 | tabPanel("Barplot", print(listToBar(ages, median, "Median Age of Programmers by Language", "Median Age"))), 202 | tabPanel("Density", print(listToDensity(ages, "Age of Programmers by Language", "Age"))) 203 | ) 204 | ``` 205 | 206 | ## Smiles 207 | 208 | Every programmer has a language that makes them miserable. So miserable, perhaps, that you can't even muster a smile for your GitHub profile picture. 209 | 210 | The Face API returns a score from 0 to 1 approximating the amount that you're smiling. Programmers using certain languages seem happier than others. Maybe R programmers are just smiling about the [crazy market for data scientists](https://www.oreilly.com/ideas/2015-data-science-salary-survey) in this economy... 211 | 212 | ```{r smiles, warning=FALSE, echo=FALSE} 213 | smiles <- lapply(results, "[[", "smile") 214 | 215 | tabsetPanel( 216 | tabPanel("Barplot", print(listToBar(smiles, mean, "Mean Smiliness of Programmers by Language", "Mean Smile Score"))), 217 | tabPanel("Density", print(listToDensity(smiles, "Smiliness of Programmers by Language", "Smile Score"))) 218 | ) 219 | ``` 220 | 221 | ## Facial Hair 222 | 223 | If you've been coding for any length of time, you've met at least one mustached fellow riding a fixie and wearing skinny jeans who won't stop talking about Swift. Turns out that's a real stereotype. 224 | 225 | I did not normalize for gender here. 226 | 227 | ```{r hair, warning=FALSE, echo=FALSE} 228 | mustache <- lapply(results, "[[", "mustache") 229 | beard <- lapply(results, "[[", "beard") 230 | sideburns <- lapply(results, "[[", "sideburns") 231 | 232 | # Facial Hair 233 | facialHair <- list() 234 | for (i in 1:length(mustache)){ 235 | facialHair[[i]] <- mustache[[i]] + beard[[i]] + sideburns[[i]] 236 | } 237 | names(facialHair) <- languages 238 | 239 | tabsetPanel( 240 | tabPanel("Barplot", print(listToBar(facialHair, mean, "Average Facial Hair by Language", "Facial Hair"))), 241 | tabPanel("Density", print(listToDensity(facialHair, "Facial Hair by Language", "Facial Hair"))) 242 | ) 243 | ``` 244 | 245 | Or we can look at each facial hair property returned by the Face API individually. 246 | 247 | ```{r, warning=FALSE, echo=FALSE} 248 | tabsetPanel( 249 | tabPanel("Mustache", print(listToDensity(mustache, "Mustaches by Language", "Mustaches"))), 250 | tabPanel("Beard", print(listToDensity(beard, "Beards by Language", "Beards"))), 251 | tabPanel("Sideburns", print(listToDensity(sideburns, "Sideburns by Language", "Sideburns"))) 252 | ) 253 | ``` 254 | 255 | ## Conclusion 256 | 257 | I guess to grow a mustache if you want to contribute to a successful C++ project? 258 | 259 | In reality, you'd need a much larger sample size and some guarantees around the accuracy of the Face API before you could have any real confidence about the conclusions you draw from this data. 260 | -------------------------------------------------------------------------------- /betaface.R: -------------------------------------------------------------------------------- 1 | library(httr) 2 | 3 | img <- "https://avatars2.githubusercontent.com/u/1356007?v=3&s=420" 4 | 5 | # The free Betaface API key 6 | betaface <- function(img, key="d45fd466-51e2-4701-8da8-04351c872236", 7 | secret="171e8465-f548-401d-b63b-caf0dc28df5f"){ 8 | res <- POST("http://betafaceapi.com/service_json.svc/UploadNewImage_Url", body=list( 9 | api_key = key, 10 | api_secret = secret, 11 | detection_flags = "bestface classifiers", 12 | image_url = img 13 | )) 14 | 15 | stop_for_status(res) 16 | 17 | uid <- content(res)$img_uid 18 | 19 | res <- POST("http://betafaceapi.com/service_json.svc/GetImageInfo", body=list( 20 | api_key = key, 21 | api_secret = secret, 22 | img_uid = uid 23 | )) 24 | 25 | stop_for_status(res) 26 | 27 | cont <- content(res) 28 | if (is.null(cont$faces) || length(cont$faces) == 0){ 29 | stop("No face found") 30 | } 31 | 32 | face <- cont$faces[[1]] 33 | do.call(rbind.data.frame, face$tags) 34 | } 35 | 36 | library(digest) 37 | cached_msface <- function(img, key, cache="./cache"){ 38 | if (!dir.exists(cache)){ 39 | dir.create(cache) 40 | } 41 | 42 | # Lookup img hash in cache 43 | hash <- digest::sha1(img) 44 | 45 | if (file.exists(file.path(cache, hash))){ 46 | obj <- readRDS(file.path(cache, hash)) 47 | message("Cache hit: ", img) 48 | if (is.null(obj)){ 49 | stop("Original error") 50 | } 51 | return(obj) 52 | } 53 | 54 | message("Not in cache: ", img, " ", hash) 55 | tryCatch({ 56 | f <- msface(img, key) 57 | saveRDS(f, file.path(cache, hash)) 58 | return(f) 59 | }, error=function(e){ 60 | saveRDS(NULL, file.path(cache, hash)) 61 | }) 62 | } 63 | 64 | # Microsoft Faces API key 65 | ms_key <- readLines("ms_key.txt") 66 | msface <- function(img, key){ 67 | url <- "https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=0&returnFaceLandmarks=0&returnFaceAttributes=age,gender,facialHair,smile" 68 | res <- POST(url, add_headers("Ocp-Apim-Subscription-Key"=key), body=list(url=img), encode="json") 69 | 70 | while (res$status_code == 429){ 71 | # Rate limit. Sleep and try again 72 | message("Hit rate limit. Sleeping for 60 seconds") 73 | Sys.sleep(60) 74 | res <- POST(url, add_headers("Ocp-Apim-Subscription-Key"=key), body=list(url=img), encode="json") 75 | } 76 | 77 | stop_for_status(res) 78 | 79 | cont <- content(res) 80 | if (length(cont) < 1 || is.null(cont[[1]]$faceAttributes) || length(cont[[1]]$faceAttributes) == 0){ 81 | stop("No face attributes") 82 | } 83 | 84 | att <- cont[[1]]$faceAttributes 85 | } -------------------------------------------------------------------------------- /eigencoder.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /results.Rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trestletech/eigencoder/9519ba0e26390259c2d79cba27e6ce8314d8dcba/results.Rds --------------------------------------------------------------------------------