├── .gitignore ├── LICENSE.md ├── README.md ├── data ├── dictionary.csv ├── ipedscolumns.json └── ipedsfiles.json ├── requirements.txt └── scripts ├── downloadData.py ├── getColumnNames.py ├── getData.R ├── makeDictionary.py └── scraper.py /.gitignore: -------------------------------------------------------------------------------- 1 | raw/ 2 | dict/ 3 | .Rproj.user 4 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | Copyright 2017 The Urban Institute 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # IPEDS scraper 2 | 3 | Download data from IPEDS [complete data files](http://nces.ed.gov/ipeds/datacenter/DataFiles.aspx). 4 | 5 | For each year, IPEDS splits data into several files - up to several dozen. The datasets are each saved as .csv and compressed into .zip (Stata file .zip are also available). For some years, revised datasets are available. These are included in the same .zip file. In revised file cases, the non-revised file is deleted in scripts/downloadData.py and the final version is saved. 6 | 7 | Each file has a corresponding dictionary .zip, which includes .xls, .xlsx, or .html dictionaries. According to NCES, there is no comprehensive dictionary available. 8 | 9 | Beware: variable names frequently change between years. In other cases, the variable name will stay the same but the value levels will change (e.g. 1,2,3 in 2000 and 5,10,15,20 in 2001). I don't have a good answer for comparing between years, besides looking at the data dictionaries. If you have a better answer please share! 10 | 11 | 12 | ## Functions 13 | ### Scrape list of available files 14 | Assembles [data/ipedsfiles.json](data/ipedsfiles.json) with info on all available complete data files from IPEDS (year, survey, title, data file .zip url, dictionary file .zip url) 15 | ```python 16 | python3 scripts/scraper.py 17 | ``` 18 | 19 | ### Assemble a master dictionary 20 | Downloads and extracts dictionary files for given years from [data/ipedsfiles.json](data/ipedsfiles.json), compiles the .xls and .xlsx dictionaries into [data/dictionary.csv](data/dictionary.csv) 21 | * Note: pre-2009 dictionaries are saved in .html files and are not parsed here. 22 | ```python 23 | python3 scripts/makeDictionary.py STARTYEAR STOPYEAR 24 | ``` 25 | 26 | ### Download files 27 | Download data files listed in [data/ipedsfiles.json](data/ipedsfiles.json) for a given range of years. 28 | ```python 29 | python3 scripts/downloadData.py STARTYEAR STOPYEAR 30 | ``` 31 | 32 | ### Get column names 33 | Get column names from downloaded files for a given range of years and save in a json. 34 | ```python 35 | python3 scripts/getColumnNames.py STARTYEAR STOPYEAR 36 | ``` -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4 2 | selenium 3 | xlrd -------------------------------------------------------------------------------- /scripts/downloadData.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Download all IPEDS Complete Data Files for a given set of years 4 | Extract and keep final/revised versions 5 | Make a json specifying columns in each data file 6 | Hannah Recht, 04-04-16 7 | """ 8 | 9 | from urllib.request import urlopen 10 | import json 11 | import zipfile 12 | import os 13 | import csv 14 | import argparse 15 | 16 | parser = argparse.ArgumentParser() 17 | parser.add_argument("start", help="start year", 18 | type=int) 19 | parser.add_argument("stop", help="stop year", 20 | type=int) 21 | args = parser.parse_args() 22 | 23 | # Import json of available files, created in scraper.py 24 | with open('data/ipedsfiles.json') as fp: 25 | allfiles = json.load(fp) 26 | 27 | # Download all the data in given years 28 | def downloadData(start, stop): 29 | print("*****************************") 30 | print("Downloading data") 31 | print("*****************************") 32 | for i in range(start,stop): 33 | print("Downloading " + str(i) + " data files") 34 | # Make directory for the raw files - one per year 35 | if not os.path.exists('raw/' + str(i) + '/'): 36 | os.makedirs('raw/' + str(i) + '/') 37 | # Download all the files in the json 38 | for f in allfiles: 39 | if(f['year']==i): 40 | # URL to download 41 | url = f['dataurl'] 42 | # dataset file name (XXXX.zip) 43 | urlname = url.split("http://nces.ed.gov/ipeds/datacenter/data/",1)[1] 44 | rd = urlopen(url) 45 | saveurl = "raw/" + str(i) +'/' + urlname 46 | # Save the zip files 47 | with open(saveurl, "wb") as p: 48 | p.write(rd.read()) 49 | p.close() 50 | 51 | # Unzip .zips 52 | zip_ref = zipfile.ZipFile(saveurl, 'r') 53 | zip_ref.extractall("raw/" + str(i) +'/') 54 | zip_ref.close() 55 | 56 | # Remove zip file 57 | os.remove("raw/" + str(i) +'/' + urlname) 58 | 59 | # Some datasets have been revised over time, so they'll download XXXX.csv and XXXX_rv.csv 60 | # We only want the revised version 61 | def removeDups(start, stop): 62 | print("*****************************") 63 | print("Removing duplicates") 64 | print("*****************************") 65 | for i in range(start,stop): 66 | print("Removing " + str(i) + " duplicates") 67 | files = os.listdir('raw/' + str(i) + '/') 68 | # See how many files are in each year 69 | # print([i,len(files)]) 70 | for file in files: 71 | # file name minus '.csv' 72 | name = file[:-4] 73 | # If the file name ends in _rv, keep that one and delete the other (no _rv) 74 | if(name[-3:] =='_rv'): 75 | #print(name) 76 | unrevised = name[:-3] 77 | if(os.path.exists('raw/' + str(i) + '/' + unrevised + '.csv')): 78 | os.remove('raw/' + str(i) + '/' + unrevised + '.csv') 79 | print('Removed ' + unrevised) 80 | # else: 81 | # print('no match ' + unrevised) 82 | 83 | downloadData(args.start, args.stop) 84 | removeDups(args.start, args.stop) -------------------------------------------------------------------------------- /scripts/getColumnNames.py: -------------------------------------------------------------------------------- 1 | 2 | # Get column names in each CSV 3 | 4 | from urllib.request import urlopen 5 | import json 6 | import os 7 | import csv 8 | import argparse 9 | 10 | parser = argparse.ArgumentParser() 11 | parser.add_argument("start", help="start year", 12 | type=int) 13 | parser.add_argument("stop", help="stop year", 14 | type=int) 15 | args = parser.parse_args() 16 | 17 | dataVariables = list() 18 | def listVars(start, stop): 19 | print("*****************************") 20 | print("Getting column names") 21 | print("*****************************") 22 | for i in range(start,stop): 23 | print("Getting " + str(i) + " column names") 24 | files = os.listdir('raw/' + str(i) + '/') 25 | for file in files: 26 | if file.endswith(('.csv')): 27 | #print(file) 28 | 29 | entry = dict() 30 | entry['year'] = i 31 | 32 | # file name minus '.csv' 33 | name = file[:-4] 34 | # If the file name ends in _rv, strip the rv for name field 35 | if(name[-3:] =='_rv'): 36 | name = name[:-3] 37 | 38 | entry['name'] = name 39 | entry['path'] = 'raw/' + str(i) + '/' + file 40 | with open('raw/' + str(i) + '/' + file, 'r') as c: 41 | d_reader = csv.DictReader(c) 42 | entry['columns'] = d_reader.fieldnames 43 | c.close() 44 | dataVariables.append(entry) 45 | # Export to json 46 | with open('data/ipedscolumns.json', 'w') as fp: 47 | json.dump(dataVariables, fp) 48 | 49 | listVars(args.start, args.stop) -------------------------------------------------------------------------------- /scripts/getData.R: -------------------------------------------------------------------------------- 1 | # Functions to get data from IPEDS csvs into R, format, join into one long data frame 2 | 3 | library("jsonlite") 4 | library("dplyr") 5 | library("stringr") 6 | library("openxlsx") 7 | 8 | ipedspath <- "/Users/hrecht/Documents/ipeds-scraper/" 9 | allfiles <- fromJSON(paste(ipedspath, "data/ipedsfiles.json", sep="")) 10 | datacols <- fromJSON(paste(ipedspath, "data/ipedscolumns.json", sep="")) 11 | 12 | # IPEDS dictionary 13 | dictionary <- read.csv(paste(ipedspath, "data/dictionary.csv", sep=""), stringsAsFactors = F) 14 | 15 | # Join colnames to file info, remove FLAGS datasets, using 1990+ 16 | ipeds <- left_join(datacols, allfiles, by = c("name", "year")) 17 | ipeds <- ipeds %>% filter(!grepl("flags", name)) %>% 18 | filter(year >= 1990) 19 | 20 | # There are a few duplicates in the way that IPEDS lists its files - remove them 21 | ipeds <-ipeds[!duplicated(ipeds[,"path"]),] 22 | 23 | # Search for a variable(s), return list of files that contain it 24 | searchVars <- function(vars) { 25 | # Filter the full IPEDS metadata dataset info to just those containing your vars 26 | dt <- ipeds %>% filter(grepl(paste(vars, collapse='|'), columns, ignore.case = T)) 27 | datalist <- split(dt, dt$name) 28 | return(datalist) 29 | } 30 | 31 | # Return the datasets containing the var(s) and selected the necessary columns 32 | getData <- function(datalist, vars, keepallvars) { 33 | allvars <- tolower(c(vars, "unitid", "year")) 34 | for (i in seq_along(datalist)) { 35 | # Construct path to CSV 36 | csvpath <- datalist[[i]]$path 37 | fullpath <- paste(ipedspath, csvpath, sep="") 38 | name <- datalist[[i]]$name 39 | 40 | print(paste("Reading in ", fullpath, sep = "")) 41 | 42 | # Read CSV - some IPEDS CSVs are malformed, containing extra commas at the end of all rows but the headers 43 | # Need to handle these. Permanent solution - send list of malformed files to NCES. This is a known issue. 44 | row1 <- readLines(fullpath, n = 1) 45 | csvnames <- unlist(strsplit(row1,',')) 46 | d <- read.table(fullpath, header = F, stringsAsFactors = F, sep=",", skip = 1, na.strings=c("",".","NA")) 47 | if (length(csvnames) == ncol(d)) { 48 | colnames(d) <- csvnames 49 | } else if (length(csvnames) == ncol(d) - 1) { 50 | colnames(d) <- c(csvnames, "xxx") 51 | print("Malformed CSV - extra column without header. Handled by R function but note for NCES.") 52 | } else if ((length(csvnames) != ncol(d) - 1) & length(csvnames) == ncol(d)) { 53 | print("Malformed CSV - unknown column length mismatch error. Note for NCES") 54 | print(path) 55 | } 56 | 57 | #d <- read.csv(fullpath, header=T, stringsAsFactors = F, na.strings=c("",".","NA")) 58 | # Give it a year variable 59 | d$year <- datalist[[i]]$year 60 | # All lowercase colnames 61 | colnames(d) <- tolower(colnames(d)) 62 | 63 | # OPEID can be sometimes integer sometimes character - coerce to character 64 | if("opeid" %in% colnames(d)) 65 | { 66 | d$opeid <- as.character(d$opeid) 67 | } 68 | if("f2a20" %in% colnames(d)) 69 | { 70 | d$f2a20 <- as.character(d$f2a20) 71 | } 72 | # unitid sometimes has type issues 73 | d$unitid <- as.character(d$unitid) 74 | # Select just the need vars 75 | if(keepallvars == FALSE) { 76 | selects <- intersect(colnames(d), allvars) 77 | d <- d %>% select(one_of(selects)) 78 | } else { 79 | d <- d %>% select(-starts_with("x")) 80 | } 81 | assign(name, d, envir = .GlobalEnv) 82 | } 83 | } 84 | 85 | # Bind rows to make one data frame 86 | makeDataset <- function(vars) { 87 | dt <- ipeds %>% filter(grepl(paste(vars, collapse='|'), columns, ignore.case = T)) 88 | ipeds_list <- lapply(dt$name, get) 89 | ipedsdata <- bind_rows(ipeds_list) 90 | ipedsdata <- ipedsdata %>% arrange(year, unitid) 91 | # Unit id back to numeric 92 | ipedsdata$unitid <- as.numeric(ipedsdata$unitid) 93 | return(ipedsdata) 94 | } 95 | 96 | # If desired (usually the case): Do all the things: search, get datasets 97 | returnData <- function(myvars, keepallvars = FALSE) { 98 | dl <- searchVars(myvars) 99 | getData(dl, myvars, keepallvars) 100 | makeDataset(myvars) 101 | } 102 | rm(allfiles, datacols) 103 | 104 | # Example - some institutional characteristics 105 | instvars <- c("fips", "stabbr", "instnm", "sector", "pset4flg", "instcat", "ccbasic", "control", "deggrant", "opeflag", "opeind", "opeid", "carnegie", "hloffer") 106 | institutions <- returnData(instvars) -------------------------------------------------------------------------------- /scripts/makeDictionary.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Download IPEDS dictionaries and make a master csv dictionary 4 | Note, pre-2009 dictionaries are awfully-formatted HTML. 5 | """ 6 | 7 | from urllib.request import urlopen 8 | import json 9 | import zipfile 10 | import os 11 | import xlrd 12 | import csv 13 | import argparse 14 | 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument("start", help="start year", 17 | type=int) 18 | parser.add_argument("stop", help="stop year", 19 | type=int) 20 | args = parser.parse_args() 21 | 22 | # Import json of available files, created in scraper.py 23 | with open('data/ipedsfiles.json') as fp: 24 | allfiles = json.load(fp) 25 | 26 | # Make directory for the raw files 27 | if not os.path.exists('raw/dictionary/'): 28 | os.makedirs('raw/dictionary/') 29 | 30 | # The pre-2009 dictionaries are HTML. Fun! Actually misery! 2009+ are mix of .xls and .xlsx and a few .html 31 | # Downloading the pre-2009 dictionary zips will get you a bunch of html files 32 | 33 | def downloadDicts(start, stop): 34 | print("*****************************") 35 | print("Downloading dictionaries") 36 | print("*****************************") 37 | for i in range(start,stop): 38 | print("Downloading " + str(i) + " dictionaries") 39 | # Make directory for the raw files - one per year 40 | if not os.path.exists('dict/' + str(i) + '/'): 41 | os.makedirs('dict/' + str(i) + '/') 42 | # Download all the files in the json 43 | for f in allfiles: 44 | if(f['year']==i): 45 | # URL to download 46 | url = f['dicturl'] 47 | # dataset file name (XXXX.zip) 48 | urlname = url.split("http://nces.ed.gov/ipeds/datacenter/data/",1)[1] 49 | rd = urlopen(url) 50 | saveurl = "dict/" + str(i) +'/' + urlname 51 | # Save the zip files 52 | with open(saveurl, "wb") as p: 53 | p.write(rd.read()) 54 | p.close() 55 | 56 | # Unzip .zips 57 | zip_ref = zipfile.ZipFile(saveurl, 'r') 58 | zip_ref.extractall("dict/" + str(i) +'/') 59 | zip_ref.close() 60 | 61 | # Remove zip file 62 | os.remove("dict/" + str(i) +'/' + urlname) 63 | 64 | # For the Excel dictionaries, compile the varlist tabs 65 | def makeMasterDict(start, stop): 66 | print("*****************************") 67 | print("Assembling master dictionary") 68 | print("*****************************") 69 | # Set up dictionary CSV 70 | with open('data/dictionary.csv', 'w') as f: 71 | c = csv.writer(f) 72 | c.writerow(['year', 'dictname', 'dictfile', 'varnumber', 'varname', 'datatype' ,'fieldwidth', 'format', 'imputationvar', 'vartitle']) 73 | f.close() 74 | 75 | # For each Excel dictionary, take the contents and file name and add to master dictionary csv 76 | for i in range(start,stop): 77 | for file in os.listdir('dict/' + str(i) + '/'): 78 | if file.endswith((".xls", ".xlsx")): 79 | print("Adding " + str(i) + " " + file + " to dictionary") 80 | dictname = file.split(".", 1)[0] 81 | rowstart = [i, dictname, file] 82 | workbook = xlrd.open_workbook('dict/' + str(i) +'/' + file, on_demand = True) 83 | worksheet = workbook.sheet_by_name('varlist') 84 | with open('data/dictionary.csv', 'a') as f: 85 | c = csv.writer(f) 86 | for r in range(2,worksheet.nrows): 87 | varrow = worksheet.row_values(r) 88 | row = rowstart + varrow 89 | c.writerow(row) 90 | 91 | downloadDicts(args.start, args.stop) 92 | makeMasterDict(args.start, args.stop) -------------------------------------------------------------------------------- /scripts/scraper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Scrape IPEDS http://nces.ed.gov/ipeds/datacenter/DataFiles.aspx 4 | Hannah Recht, 03-24-16 5 | """ 6 | 7 | from bs4 import BeautifulSoup 8 | from selenium import webdriver 9 | from selenium.webdriver.support.ui import Select 10 | import json 11 | 12 | driver = webdriver.Firefox() 13 | 14 | # Directory url for downloads 15 | dirurl = "http://nces.ed.gov/ipeds/datacenter/" 16 | 17 | files = list() 18 | 19 | def scrapetable(): 20 | # Scrape table of datasets 21 | content = driver.page_source 22 | soup = BeautifulSoup(''.join(content), "lxml") 23 | table = soup.find("table", { "id" : "contentPlaceHolder_tblResult" }) 24 | # Get info and URLs for data zip and dictionary zip 25 | for row in table.find_all('tr')[1:]: 26 | entry = dict() 27 | tds = row.find_all('td') 28 | entry['year'] = int(tds[0].text) 29 | entry['survey'] = tds[1].text 30 | entry['title'] = tds[2].text 31 | entry['dataurl'] = dirurl + tds[3].a.get('href') 32 | entry['dicturl'] = dirurl + tds[6].a.get('href') 33 | # File name minus 'data/' and '.zip' 34 | entry['name'] = (tds[3].a.get('href')[5:-4]).lower() 35 | files.append(entry) 36 | 37 | # There is no direct link to the complete data files view. Need to press some buttons. 38 | # If the site changes this will probably all break yay 39 | 40 | # Complete data files entry point 41 | driver.get('http://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7') 42 | 43 | # Press continue 44 | driver.find_element_by_xpath("//input[@id='ImageButton1' and @title='Continue']").click() 45 | 46 | # Make a list for all the available years 47 | select = Select(driver.find_element_by_id('contentPlaceHolder_ddlYears')) 48 | years = list() 49 | for option in select.options: 50 | years.append(option.get_attribute('value')) 51 | 52 | # Get info on all the available datasets per year, save 53 | def chooseyear(year): 54 | # Choose year from dropdown 55 | select.select_by_value(year) 56 | # Continue to list of datasets 57 | driver.find_element_by_xpath("//input[@id='contentPlaceHolder_ibtnContinue']").click() 58 | # Scrape the table of available datasets, add to 'files' 59 | scrapetable() 60 | 61 | # -1 = All years 62 | chooseyear('-1') 63 | 64 | # Export to json 65 | with open('data/ipedsfiles.json', 'w') as fp: 66 | json.dump(files, fp) 67 | --------------------------------------------------------------------------------