├── .gitignore
├── LICENSE.md
├── README.md
├── data
    ├── dictionary.csv
    ├── ipedscolumns.json
    └── ipedsfiles.json
├── requirements.txt
└── scripts
    ├── downloadData.py
    ├── getColumnNames.py
    ├── getData.R
    ├── makeDictionary.py
    └── scraper.py


/.gitignore:
--------------------------------------------------------------------------------
1 | raw/
2 | dict/
3 | .Rproj.user
4 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | Copyright 2017 The Urban Institute
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # IPEDS scraper
 2 | 
 3 | Download data from IPEDS [complete data files](http://nces.ed.gov/ipeds/datacenter/DataFiles.aspx). 
 4 | 
 5 | For each year, IPEDS splits data into several files - up to several dozen. The datasets are each saved as .csv and compressed into .zip (Stata file .zip are also available). For some years, revised datasets are available. These are included in the same .zip file. In revised file cases, the non-revised file is deleted in scripts/downloadData.py and the final version is saved.
 6 | 
 7 | Each file has a corresponding dictionary .zip, which includes .xls, .xlsx, or .html dictionaries. According to NCES, there is no comprehensive dictionary available.
 8 | 
 9 | Beware: variable names frequently change between years. In other cases, the variable name will stay the same but the value levels will change (e.g. 1,2,3 in 2000 and 5,10,15,20 in 2001). I don't have a good answer for comparing between years, besides looking at the data dictionaries. If you have a better answer please share!
10 | 
11 | 
12 | ## Functions
13 | ### Scrape list of available files
14 | Assembles [data/ipedsfiles.json](data/ipedsfiles.json) with info on all available complete data files from IPEDS (year, survey, title, data file .zip url, dictionary file .zip url)
15 | ```python
16 | python3 scripts/scraper.py
17 | ```
18 | 
19 | ### Assemble a master dictionary
20 | Downloads and extracts dictionary files for given years from [data/ipedsfiles.json](data/ipedsfiles.json), compiles the .xls and .xlsx dictionaries into [data/dictionary.csv](data/dictionary.csv)
21 | * Note: pre-2009 dictionaries are saved in .html files and are not parsed here.
22 | ```python
23 | python3 scripts/makeDictionary.py STARTYEAR STOPYEAR
24 | ```
25 | 
26 | ### Download files
27 | Download data files listed in [data/ipedsfiles.json](data/ipedsfiles.json) for a given range of years.
28 | ```python
29 | python3 scripts/downloadData.py STARTYEAR STOPYEAR
30 | ```
31 | 
32 | ### Get column names
33 | Get column names from downloaded files for a given range of years and save in a json.
34 | ```python
35 | python3 scripts/getColumnNames.py STARTYEAR STOPYEAR
36 | ```


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4
2 | selenium
3 | xlrd


--------------------------------------------------------------------------------
/scripts/downloadData.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Download all IPEDS  Complete Data Files for a given set of years
 4 | Extract and keep final/revised versions
 5 | Make a json specifying columns in each data file
 6 | Hannah Recht, 04-04-16
 7 | """
 8 | 
 9 | from urllib.request import urlopen
10 | import json
11 | import zipfile
12 | import os
13 | import csv
14 | import argparse
15 | 
16 | parser = argparse.ArgumentParser()
17 | parser.add_argument("start", help="start year",
18 |                     type=int)
19 | parser.add_argument("stop", help="stop year",
20 |                     type=int)
21 | args = parser.parse_args()
22 | 
23 | # Import json of available files, created in scraper.py
24 | with open('data/ipedsfiles.json') as fp:
25 |     allfiles = json.load(fp)
26 | 
27 | # Download all the data in given years
28 | def downloadData(start, stop):
29 |     print("*****************************")
30 |     print("Downloading data")
31 |     print("*****************************")
32 |     for i in range(start,stop):
33 |         print("Downloading " + str(i) + " data files")
34 |         # Make directory for the raw files - one per year
35 |         if not os.path.exists('raw/' + str(i) + '/'):
36 |             os.makedirs('raw/' + str(i) + '/')
37 |         # Download all the files in the json
38 |         for f in allfiles:
39 |             if(f['year']==i):
40 |                 # URL to download
41 |                 url = f['dataurl']
42 |                 # dataset file name (XXXX.zip)
43 |                 urlname = url.split("http://nces.ed.gov/ipeds/datacenter/data/",1)[1]
44 |                 rd = urlopen(url)
45 |                 saveurl = "raw/" + str(i) +'/' + urlname
46 |                 # Save the zip files
47 |                 with open(saveurl, "wb") as p:
48 |                      p.write(rd.read())
49 |                      p.close()
50 | 
51 |                 # Unzip .zips
52 |                 zip_ref = zipfile.ZipFile(saveurl, 'r')
53 |                 zip_ref.extractall("raw/" + str(i) +'/')
54 |                 zip_ref.close()
55 | 
56 |                 # Remove zip file
57 |                 os.remove("raw/" + str(i) +'/' + urlname)
58 | 
59 | # Some datasets have been revised over time, so they'll download XXXX.csv and XXXX_rv.csv
60 | # We only want the revised version
61 | def removeDups(start, stop):
62 |     print("*****************************")
63 |     print("Removing duplicates")
64 |     print("*****************************")
65 |     for i in range(start,stop):
66 |         print("Removing " + str(i) + " duplicates")
67 |         files = os.listdir('raw/' + str(i) + '/')
68 |         # See how many files are in each year
69 |         # print([i,len(files)])
70 |         for file in files:
71 |             # file name minus '.csv'
72 |             name = file[:-4]
73 |             # If the file name ends in _rv, keep that one and delete the other (no _rv)
74 |             if(name[-3:] =='_rv'):
75 |                 #print(name)
76 |                 unrevised = name[:-3]
77 |                 if(os.path.exists('raw/' + str(i) + '/' + unrevised + '.csv')):
78 |                     os.remove('raw/' + str(i) + '/' + unrevised + '.csv')
79 |                     print('Removed ' + unrevised)
80 | #                else:
81 | #                    print('no match ' + unrevised)
82 | 
83 | downloadData(args.start, args.stop)
84 | removeDups(args.start, args.stop)


--------------------------------------------------------------------------------
/scripts/getColumnNames.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # Get column names in each CSV
 3 | 
 4 | from urllib.request import urlopen
 5 | import json
 6 | import os
 7 | import csv
 8 | import argparse
 9 | 
10 | parser = argparse.ArgumentParser()
11 | parser.add_argument("start", help="start year",
12 |                     type=int)
13 | parser.add_argument("stop", help="stop year",
14 |                     type=int)
15 | args = parser.parse_args()
16 | 
17 | dataVariables = list()
18 | def listVars(start, stop):
19 |     print("*****************************")
20 |     print("Getting column names")
21 |     print("*****************************")
22 |     for i in range(start,stop):
23 |         print("Getting " + str(i) + " column names")
24 |         files = os.listdir('raw/' + str(i) + '/')
25 |         for file in files:
26 |             if file.endswith(('.csv')):
27 |                 #print(file)
28 | 
29 |                 entry = dict()
30 |                 entry['year'] = i
31 | 
32 |                 # file name minus '.csv'
33 |                 name = file[:-4]
34 |                 # If the file name ends in _rv, strip the rv for name field
35 |                 if(name[-3:] =='_rv'):
36 |                     name = name[:-3]
37 | 
38 |                 entry['name'] = name
39 |                 entry['path'] = 'raw/' + str(i) + '/' + file
40 |                 with open('raw/' + str(i) + '/' + file, 'r') as c:
41 |                     d_reader = csv.DictReader(c)
42 |                     entry['columns'] = d_reader.fieldnames
43 |                 c.close()
44 |                 dataVariables.append(entry)
45 |     # Export to json
46 |     with open('data/ipedscolumns.json', 'w') as fp:
47 |         json.dump(dataVariables, fp)
48 | 
49 | listVars(args.start, args.stop)


--------------------------------------------------------------------------------
/scripts/getData.R:
--------------------------------------------------------------------------------
  1 | # Functions to get data from IPEDS csvs into R, format, join into one long data frame
  2 | 
  3 | library("jsonlite")
  4 | library("dplyr")
  5 | library("stringr")
  6 | library("openxlsx")
  7 | 
  8 | ipedspath <- "/Users/hrecht/Documents/ipeds-scraper/"
  9 | allfiles <- fromJSON(paste(ipedspath, "data/ipedsfiles.json", sep=""))
 10 | datacols <- fromJSON(paste(ipedspath, "data/ipedscolumns.json", sep=""))
 11 | 
 12 | # IPEDS dictionary
 13 | dictionary <- read.csv(paste(ipedspath, "data/dictionary.csv", sep=""), stringsAsFactors = F)
 14 | 
 15 | # Join colnames to file info, remove FLAGS datasets, using 1990+
 16 | ipeds <- left_join(datacols, allfiles, by = c("name", "year"))
 17 | ipeds <- ipeds %>% filter(!grepl("flags", name)) %>%
 18 |   filter(year >= 1990)
 19 | 
 20 | # There are a few duplicates in the way that IPEDS lists its files - remove them
 21 | ipeds <-ipeds[!duplicated(ipeds[,"path"]),]
 22 | 
 23 | # Search for a variable(s), return list of files that contain it
 24 | searchVars <- function(vars) {
 25 |   # Filter the full IPEDS metadata dataset info to just those containing your vars
 26 |   dt <- ipeds %>% filter(grepl(paste(vars, collapse='|'), columns, ignore.case = T))
 27 |   datalist <- split(dt, dt$name)
 28 |   return(datalist)
 29 | }
 30 | 
 31 | # Return the datasets containing the var(s) and selected the necessary columns
 32 | getData <- function(datalist, vars, keepallvars) {
 33 |   allvars <- tolower(c(vars, "unitid", "year"))
 34 |   for (i in seq_along(datalist)) {
 35 |     # Construct path to CSV
 36 |     csvpath <- datalist[[i]]$path
 37 |     fullpath <- paste(ipedspath, csvpath, sep="")
 38 |     name <- datalist[[i]]$name
 39 |     
 40 |     print(paste("Reading in ", fullpath, sep = ""))
 41 |     
 42 |     # Read CSV - some IPEDS CSVs are malformed, containing extra commas at the end of all rows but the headers
 43 |     # Need to handle these. Permanent solution - send list of malformed files to NCES. This is a known issue.
 44 |     row1 <- readLines(fullpath, n = 1)
 45 |     csvnames <- unlist(strsplit(row1,','))
 46 |     d <- read.table(fullpath, header = F, stringsAsFactors = F, sep=",", skip = 1, na.strings=c("",".","NA"))
 47 |     if (length(csvnames) == ncol(d)) {
 48 |       colnames(d) <- csvnames
 49 |     } else if (length(csvnames) == ncol(d) - 1) {
 50 |       colnames(d) <- c(csvnames, "xxx")
 51 |       print("Malformed CSV - extra column without header. Handled by R function but note for NCES.")
 52 |     } else if ((length(csvnames) != ncol(d) - 1) & length(csvnames) == ncol(d)) {
 53 |       print("Malformed CSV - unknown column length mismatch error. Note for NCES")
 54 |       print(path)
 55 |     }
 56 |     
 57 |     #d <- read.csv(fullpath, header=T, stringsAsFactors = F, na.strings=c("",".","NA"))
 58 |     # Give it a year variable
 59 |     d$year <- datalist[[i]]$year
 60 |     # All lowercase colnames
 61 |     colnames(d) <- tolower(colnames(d))
 62 |     
 63 |     # OPEID can be sometimes integer sometimes character - coerce to character
 64 |     if("opeid" %in% colnames(d))
 65 |     {
 66 |       d$opeid <- as.character(d$opeid)
 67 |     }
 68 |     if("f2a20" %in% colnames(d))
 69 |     {
 70 |       d$f2a20 <- as.character(d$f2a20)
 71 |     }
 72 |     # unitid sometimes has type issues
 73 |     d$unitid <- as.character(d$unitid)
 74 |     # Select just the need vars
 75 |     if(keepallvars == FALSE) {
 76 |       selects <- intersect(colnames(d), allvars)
 77 |       d <- d %>% select(one_of(selects))
 78 |     } else {
 79 |       d <- d %>% select(-starts_with("x"))
 80 |     }
 81 |     assign(name, d, envir = .GlobalEnv)
 82 |   }
 83 | }
 84 | 
 85 | # Bind rows to make one data frame
 86 | makeDataset <- function(vars) {
 87 |   dt <- ipeds %>% filter(grepl(paste(vars, collapse='|'), columns, ignore.case = T))
 88 |   ipeds_list <- lapply(dt$name, get)
 89 |   ipedsdata <- bind_rows(ipeds_list)
 90 |   ipedsdata <- ipedsdata %>% arrange(year, unitid)
 91 |   # Unit id back to numeric
 92 |   ipedsdata$unitid <- as.numeric(ipedsdata$unitid)
 93 |   return(ipedsdata)
 94 | }
 95 | 
 96 | # If desired (usually the case): Do all the things: search, get datasets
 97 | returnData <- function(myvars, keepallvars = FALSE) {
 98 |   dl <- searchVars(myvars)
 99 |   getData(dl, myvars, keepallvars)
100 |   makeDataset(myvars)
101 | }
102 | rm(allfiles, datacols)
103 | 
104 | # Example - some institutional characteristics
105 | instvars <- c("fips", "stabbr", "instnm", "sector", "pset4flg", "instcat", "ccbasic", "control", "deggrant", "opeflag", "opeind", "opeid", "carnegie", "hloffer")
106 | institutions <- returnData(instvars)


--------------------------------------------------------------------------------
/scripts/makeDictionary.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Download IPEDS dictionaries and make a master csv dictionary
 4 | Note, pre-2009 dictionaries are awfully-formatted HTML.
 5 | """
 6 | 
 7 | from urllib.request import urlopen
 8 | import json
 9 | import zipfile
10 | import os
11 | import xlrd
12 | import csv
13 | import argparse
14 | 
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument("start", help="start year",
17 |                     type=int)
18 | parser.add_argument("stop", help="stop year",
19 |                     type=int)
20 | args = parser.parse_args()
21 | 
22 | # Import json of available files, created in scraper.py
23 | with open('data/ipedsfiles.json') as fp:
24 |     allfiles = json.load(fp)
25 | 
26 | # Make directory for the raw files
27 | if not os.path.exists('raw/dictionary/'):
28 |     os.makedirs('raw/dictionary/')
29 | 
30 | # The pre-2009 dictionaries are HTML. Fun! Actually misery! 2009+ are mix of .xls and .xlsx and a few .html
31 | # Downloading the pre-2009 dictionary zips will get you a bunch of html files
32 | 
33 | def downloadDicts(start, stop):
34 |     print("*****************************")
35 |     print("Downloading dictionaries")
36 |     print("*****************************")
37 |     for i in range(start,stop):
38 |         print("Downloading " + str(i) + " dictionaries")
39 |         # Make directory for the raw files - one per year
40 |         if not os.path.exists('dict/' + str(i) + '/'):
41 |             os.makedirs('dict/' + str(i) + '/')
42 |         # Download all the files in the json
43 |         for f in allfiles:
44 |             if(f['year']==i):
45 |                 # URL to download
46 |                 url = f['dicturl']
47 |                 # dataset file name (XXXX.zip)
48 |                 urlname = url.split("http://nces.ed.gov/ipeds/datacenter/data/",1)[1]
49 |                 rd = urlopen(url)
50 |                 saveurl = "dict/" + str(i) +'/' + urlname
51 |                 # Save the zip files
52 |                 with open(saveurl, "wb") as p:
53 |                      p.write(rd.read())
54 |                      p.close()
55 | 
56 |                 # Unzip .zips
57 |                 zip_ref = zipfile.ZipFile(saveurl, 'r')
58 |                 zip_ref.extractall("dict/" + str(i) +'/')
59 |                 zip_ref.close()
60 | 
61 |                 # Remove zip file
62 |                 os.remove("dict/" + str(i) +'/' + urlname)
63 | 
64 | # For the Excel dictionaries, compile the varlist tabs
65 | def makeMasterDict(start, stop):
66 |     print("*****************************")
67 |     print("Assembling master dictionary")
68 |     print("*****************************")
69 |     # Set up dictionary CSV
70 |     with open('data/dictionary.csv', 'w') as f:
71 |         c = csv.writer(f)
72 |         c.writerow(['year', 'dictname', 'dictfile', 'varnumber', 'varname', 'datatype' ,'fieldwidth', 'format', 'imputationvar', 'vartitle'])
73 |         f.close()
74 | 
75 |         # For each Excel dictionary, take the contents and file name and add to master dictionary csv
76 |     for i in range(start,stop):
77 |         for file in os.listdir('dict/' + str(i) + '/'):
78 |             if file.endswith((".xls", ".xlsx")):
79 |                 print("Adding " + str(i) + " " + file + " to dictionary")
80 |                 dictname = file.split(".", 1)[0]
81 |                 rowstart = [i, dictname, file]
82 |                 workbook = xlrd.open_workbook('dict/' + str(i) +'/' + file, on_demand = True)
83 |                 worksheet = workbook.sheet_by_name('varlist')
84 |                 with open('data/dictionary.csv', 'a') as f:
85 |                     c = csv.writer(f)
86 |                     for r in range(2,worksheet.nrows):
87 |                         varrow = worksheet.row_values(r)
88 |                         row = rowstart + varrow
89 |                         c.writerow(row)
90 | 
91 | downloadDicts(args.start, args.stop)
92 | makeMasterDict(args.start, args.stop)


--------------------------------------------------------------------------------
/scripts/scraper.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Scrape IPEDS http://nces.ed.gov/ipeds/datacenter/DataFiles.aspx
 4 | Hannah Recht, 03-24-16
 5 | """
 6 | 
 7 | from bs4 import BeautifulSoup
 8 | from selenium import webdriver
 9 | from selenium.webdriver.support.ui import Select
10 | import json
11 | 
12 | driver = webdriver.Firefox()
13 | 
14 | # Directory url for downloads
15 | dirurl = "http://nces.ed.gov/ipeds/datacenter/"
16 | 
17 | files = list()
18 | 
19 | def scrapetable():
20 |     # Scrape table of datasets
21 |     content = driver.page_source
22 |     soup = BeautifulSoup(''.join(content), "lxml")
23 |     table = soup.find("table", { "id" : "contentPlaceHolder_tblResult" })
24 |     # Get info and URLs for data zip and dictionary zip
25 |     for row in table.find_all('tr')[1:]:
26 |         entry = dict()
27 |         tds = row.find_all('td')
28 |         entry['year'] = int(tds[0].text)
29 |         entry['survey'] = tds[1].text
30 |         entry['title'] = tds[2].text
31 |         entry['dataurl'] = dirurl + tds[3].a.get('href')
32 |         entry['dicturl'] = dirurl + tds[6].a.get('href')
33 |         # File name minus 'data/' and '.zip'
34 |         entry['name'] = (tds[3].a.get('href')[5:-4]).lower()
35 |         files.append(entry)
36 | 
37 | # There is no direct link to the complete data files view. Need to press some buttons.
38 | # If the site changes this will probably all break yay
39 | 
40 | # Complete data files entry point
41 | driver.get('http://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7')
42 | 
43 | # Press continue
44 | driver.find_element_by_xpath("//input[@id='ImageButton1' and @title='Continue']").click()
45 | 
46 | # Make a list for all the available years
47 | select = Select(driver.find_element_by_id('contentPlaceHolder_ddlYears'))
48 | years = list()
49 | for option in select.options:
50 |     years.append(option.get_attribute('value'))
51 | 
52 | # Get info on all the available datasets per year, save
53 | def chooseyear(year):
54 |     # Choose year from dropdown
55 |     select.select_by_value(year)
56 |     # Continue to list of datasets
57 |     driver.find_element_by_xpath("//input[@id='contentPlaceHolder_ibtnContinue']").click()
58 |     # Scrape the table of available datasets, add to 'files'
59 |     scrapetable()
60 | 
61 | # -1 = All years
62 | chooseyear('-1')
63 | 
64 | # Export to json
65 | with open('data/ipedsfiles.json', 'w') as fp:
66 |     json.dump(files, fp)
67 | 


--------------------------------------------------------------------------------