├── data └── README.md ├── 1-intro-R ├── data-link.txt ├── Lecture1.pdf ├── solutions.zip ├── Assignment.R ├── README.md ├── CEOmissing.csv ├── CEOcomp.csv ├── 1-5.R ├── 1-3.R ├── 1-4.R ├── 1-2.R ├── .Rapp.history └── 1-1.R ├── 4-graphs ├── Networks.pdf ├── code │ ├── exercise1_start.R │ ├── section5.R │ ├── exercise3_complete.R │ ├── exercise4_complete.R │ ├── exercise1_complete.R │ ├── exercise2_complete.R │ ├── exercise5_complete.R │ ├── section3.R │ ├── section1.R │ ├── section2.R │ └── section4.R └── README.md ├── 5-simulation ├── simjulia_slides.ppt ├── preassignment.jl ├── simjulia_examples │ ├── bank_01.jl │ ├── bank_01 (complete).jl │ ├── bank_06.jl │ ├── bank_08.jl │ ├── bank_06 (complete).jl │ ├── bank_08 (complete).jl │ ├── bank_11.jl │ └── bank_11 (complete).jl ├── README.md └── distributed.jl ├── 2-intermediate-R ├── FirstHalfSlides.pdf ├── SecondHalf slides.pdf ├── extractTop20.R ├── Carrier Names ├── SecondHalf_solutions.R ├── README.md ├── SecondHalf.R ├── FirstHalf.R └── prcp_pretty.csv ├── 8-project ├── Class 8 Column Generation.pdf ├── README.md ├── Historical_Route.csv └── Flight_Alaska.csv ├── 3-visualization ├── IAPvisualization2015.pptx ├── README.md └── pollData.csv ├── README.md ├── 7-adv-optimization ├── README.md └── Callbacks.ipynb └── 6-nonlinear-opt ├── README.md ├── Nonlinear-JuMP.ipynb ├── IJulia intro.ipynb ├── Nonlinear-DCP.ipynb └── Nonlinear-DualNumbers.ipynb /data/README.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /1-intro-R/data-link.txt: -------------------------------------------------------------------------------- 1 | http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_2014_9.zip -------------------------------------------------------------------------------- /4-graphs/Networks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/4-graphs/Networks.pdf -------------------------------------------------------------------------------- /1-intro-R/Lecture1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/1-intro-R/Lecture1.pdf -------------------------------------------------------------------------------- /1-intro-R/solutions.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/1-intro-R/solutions.zip -------------------------------------------------------------------------------- /5-simulation/simjulia_slides.ppt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/5-simulation/simjulia_slides.ppt -------------------------------------------------------------------------------- /2-intermediate-R/FirstHalfSlides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/2-intermediate-R/FirstHalfSlides.pdf -------------------------------------------------------------------------------- /2-intermediate-R/SecondHalf slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/2-intermediate-R/SecondHalf slides.pdf -------------------------------------------------------------------------------- /8-project/Class 8 Column Generation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/8-project/Class 8 Column Generation.pdf -------------------------------------------------------------------------------- /3-visualization/IAPvisualization2015.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joehuchette/OR-software-tools-2015/HEAD/3-visualization/IAPvisualization2015.pptx -------------------------------------------------------------------------------- /2-intermediate-R/extractTop20.R: -------------------------------------------------------------------------------- 1 | top20 = c("ATL","LAX","ORD","DFW","DEN","JFK","SFO","CLT","LAS","PHX","MIA","IAH","EWR","MCO","SEA","MSP","DTW","BOS","PHL","LGA") -------------------------------------------------------------------------------- /5-simulation/preassignment.jl: -------------------------------------------------------------------------------- 1 | Pkg.add("SimJulia") 2 | using SimJulia 3 | include(Pkg.dir("SimJulia") * "/test/example_1.jl") 4 | 5 | addprocs(2) 6 | 7 | @parallel for i in 1:2 8 | println("Hello from core $(myid())") 9 | end -------------------------------------------------------------------------------- /1-intro-R/Assignment.R: -------------------------------------------------------------------------------- 1 | # IAP 2014 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Pre-assignment 6 | 7 | library(stats) 8 | lm_test <- lm(mpg ~ hp + cyl + wt + gear, data = mtcars) 9 | print(summary(lm_test)) 10 | 11 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_01.jl: -------------------------------------------------------------------------------- 1 | using SimJulia 2 | 3 | # Model components 4 | 5 | function visit(customer::Process, time_in_bank::Float64) 6 | 7 | end 8 | 9 | # Experiment data 10 | 11 | end_time = 100.0 12 | time_in_bank = 10.0 13 | 14 | # Model/Experiment 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /2-intermediate-R/Carrier Names: -------------------------------------------------------------------------------- 1 | 9E - Endeavor 2 | AA - American 3 | AS - Alaska 4 | B6 - JetBlue 5 | DL - Delta 6 | EV - ExpressJet 7 | F9 - Frontier 8 | FL - AirTran 9 | HA - Hawaiian 10 | MQ - Envoy 11 | OO - SkyWest 12 | UA - United 13 | US - US Airways 14 | VX - Virgin America 15 | WN - Southwest 16 | YV - Mesa 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MIT 15.S60 2015 2 | 3 | ## Software Tools for Operations Research 4 | 5 | ### Schedule 6 | * [Introduction to R] 7 | * [Intermediate R] 8 | * [Visualization in R] 9 | * [Graphs] 10 | * [Simulations] 11 | * [Nonlinear Optimization] 12 | * [Advanced Optimization] 13 | * [Project] 14 | 15 | ### Assignments 16 | Assignments should be submitted online via Stellar. 17 | 18 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_01 (complete).jl: -------------------------------------------------------------------------------- 1 | using SimJulia 2 | 3 | # Model components 4 | 5 | function visit(customer::Process, time_in_bank::Float64) 6 | println("$(now(customer)) $customer Here I am") 7 | hold(customer, time_in_bank) # stay in the bank 8 | println("$(now(customer)) $customer I must leave") 9 | end 10 | 11 | # Experiment data 12 | 13 | end_time = 100.0 14 | time_in_bank = 10.0 15 | 16 | # Model/Experiment 17 | 18 | sim = Simulation(uint(16)) # define environment 19 | c = Process(sim, "Ben") # define process 20 | activate(c, 5.0, visit, time_in_bank) # add process method 21 | run(sim, end_time) 22 | -------------------------------------------------------------------------------- /4-graphs/code/exercise1_start.R: -------------------------------------------------------------------------------- 1 | # Split the data by the carrier; this creates a list. 2 | spl <- split(dat, dat$Carrier) 3 | 4 | # Using lapply, we will call a function on each subset of dat that 5 | # builds a graph using the exact same split-apply-combine code we used 6 | # before. 7 | carrier.graphs <- lapply(spl, function(dat) { 8 | # Compute "edges" by splitting on Origin/Dest pairs, computing a 1-row 9 | # data frame for each, and then combining with do.call and rbind. 10 | 11 | # Compute "vertices" by splitting on Origin, computing a 1-row data 12 | # frame for each, and then combining with do.call and rbind. 13 | 14 | # Compute and return a graph g using graph.data.frame() 15 | }) 16 | 17 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_06.jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | # process method for customers 7 | function visit(customer::Process, time_in_bank::Float64) 8 | @printf("%7.4f %s: Here I am\n", now(customer), customer) 9 | hold(customer, time_in_bank) 10 | @printf("%7.4f %s: I must leave\n", now(customer), customer) 11 | end 12 | 13 | # process method for source 14 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64) 15 | 16 | end 17 | 18 | # Experiment data 19 | 20 | num_customer = 5 21 | end_time = 400.0 22 | mean_time_between_arrivals = 10.0 23 | theseed = 99999 24 | srand(theseed) 25 | 26 | # Model/Experiment 27 | 28 | sim = Simulation(uint(16)) 29 | # define source here 30 | 31 | run(sim, end_time) 32 | -------------------------------------------------------------------------------- /1-intro-R/README.md: -------------------------------------------------------------------------------- 1 | ## Introduction to R Pre-Assignment 2 | 3 | ## Installation Instructions 4 | 5 | Please download and install R from [this webpage](http://cran.us.r-project.org). 6 | 7 | Once there, select your operating system: 8 | 9 | -For Windows users, select "Install R for the first time" then "Download R 3.1.2 for Windows" 10 | 11 | -For Mac users, select "R-3.1.2.pkg" 12 | 13 | ## Assignment 14 | 15 | Copy and paste the following lines of code to the R Console: 16 | 17 | ``` 18 | library(stats) 19 | lm_test <- lm(mpg ~ hp + cyl + wt + gear, data = mtcars) 20 | summary(lm_test) 21 | ``` 22 | 23 | Press Enter and copy the output to a .txt file. 24 | 25 | The first two lines of your output should look like: 26 | 27 | ``` 28 | Call: 29 | lm(formula = mpg ~ hp + cyl + wt + gear, data = mtcars) 30 | ``` 31 | 32 | ## Questions? 33 | Please e-mail jkung@mit.edu. 34 | 35 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_08.jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | function visit(customer::Process, time_in_bank::Float64, clerk::Resource) 7 | 8 | end 9 | 10 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64, mean_time_in_bank::Float64, clerk::Resource) 11 | d_tba = Exponential(mean_time_between_arrivals) 12 | d_tib = Exponential(mean_time_in_bank) 13 | # generate customers 14 | end 15 | 16 | # Experiment data 17 | 18 | max_number = 5 19 | max_time = 400.0 20 | mean_time_between_arrivals = 10.0 21 | mean_time_in_bank = 12.0 22 | theseed = 99999 23 | srand(theseed) 24 | 25 | # Model/Experiment 26 | 27 | sim = Simulation(uint(16)) 28 | # create resource "k" 29 | s = Process(sim, "Source") 30 | activate(s, 0.0, generate, max_number, mean_time_between_arrivals, mean_time_in_bank, k) 31 | run(sim, max_time) 32 | -------------------------------------------------------------------------------- /5-simulation/README.md: -------------------------------------------------------------------------------- 1 | # Simulation and Distributed Computing 2 | 3 | This class introduces the Julia discrete-event simulation library SimJulia and teaches users how to run things like simulations in parallel on a computer with more than one processing core. 4 | 5 | ## Pre-assignment: 6 | 7 | ### Update Code Repo 8 | Update your ORC software repository (`git pull`) 9 | 10 | ### Download Julia 11 | If you don't have Julia already installed, check out Ian and Miles' instructions at http://www.juliaopt.org/install.pdf. IJulia is recommended, but is not strictly necessary. 12 | 13 | ### Test Julia 14 | Run the file "preassignment.jl" in the 5-simulation/ folder and submit the output in a .txt file to Stellar. 15 | 16 | This class assumes a working knowledge of Julia. If you're not familiar with it, check out http://docs.julialang.org/en/release-0.3/manual/getting-started/ and http://learnxinyminutes.com/docs/julia/. 17 | -------------------------------------------------------------------------------- /1-intro-R/CEOmissing.csv: -------------------------------------------------------------------------------- 1 | CompanyNumber,TotalCompensation,Years,ChangeStockPrice,ChangeCompanySales,MBA 1,1530,7,48,89,1 2,NA,6,35,19,1 3,602,3,9,24,0 4,1170,6,37,8,NA 5,1086,NA,34,28,0 6,2536,9,NA,-16,1 7,300,2,-17,-17,NA 8,NA,2,-15,-67,1 9,250,0,-52,49,0 10,2413,10,109,-27,1 11,2707,NA,44,26,1 12,341,1,28,-7,0 13,734,4,NA,-7,NA 14,NA,8,16,NA,0 15,743,4,11,50,1 16,898,7,-21,-20,1 17,498,4,16,-24,0 18,NA,2,-10,64,0 19,1388,4,8,-58,1 20,898,5,28,-73,1 21,408,4,13,31,1 22,1091,NA,34,66,0 23,1550,7,NA,-4,1 24,NA,5,26,55,0 25,1462,7,46,10,1 26,1456,7,46,NA,1 27,1984,8,63,28,1 28,NA,10,12,-36,0 29,2021,7,48,72,1 30,2871,8,7,5,1 31,245,NA,-58,-16,1 32,3217,11,NA,51,1 33,1315,7,42,-7,0 34,NA,9,55,122,NA 35,260,0,-54,-41,1 36,250,NA,-17,-35,0 37,718,5,23,19,1 38,1593,8,NA,76,NA 39,1905,8,67,-48,1 40,NA,5,21,64,1 41,2253,7,46,104,1 42,254,0,-41,99,0 43,1883,NA,60,NA,1 44,1501,5,10,20,1 45,NA,0,-17,-18,0 46,NA,11,NA,27,1 47,NA,6,40,41,1 48,1897,8,-24,-41,NA 49,1157,5,21,87,1 50,246,3,1,-34,0 -------------------------------------------------------------------------------- /1-intro-R/CEOcomp.csv: -------------------------------------------------------------------------------- 1 | CompanyNumber,TotalCompensation,Years,ChangeStockPrice,ChangeCompanySales,MBA 1,1530,7,48,89,1 2,1117,6,35,19,1 3,602,3,9,24,0 4,1170,6,37,8,1 5,1086,6,34,28,0 6,2536,9,81,-16,1 7,300,2,-17,-17,0 8,670,2,-15,-67,1 9,250,0,-52,49,0 10,2413,10,109,-27,1 11,2707,7,44,26,1 12,341,1,28,-7,0 13,734,4,10,-7,0 14,2368,8,16,-4,0 15,743,4,11,50,1 16,898,7,-21,-20,1 17,498,4,16,-24,0 18,250,2,-10,64,0 19,1388,4,8,-58,1 20,898,5,28,-73,1 21,408,4,13,31,1 22,1091,6,34,66,0 23,1550,7,49,-4,1 24,832,5,26,55,0 25,1462,7,46,10,1 26,1456,7,46,-5,1 27,1984,8,63,28,1 28,1493,10,12,-36,0 29,2021,7,48,72,1 30,2871,8,7,5,1 31,245,0,-58,-16,1 32,3217,11,102,51,1 33,1315,7,42,-7,0 34,1730,9,55,122,1 35,260,0,-54,-41,1 36,250,2,-17,-35,0 37,718,5,23,19,1 38,1593,8,66,76,1 39,1905,8,67,-48,1 40,2283,5,21,64,1 41,2253,7,46,104,1 42,254,0,-41,99,0 43,1883,8,60,-12,1 44,1501,5,10,20,1 45,386,0,-17,-18,0 46,2181,11,37,27,1 47,1766,6,40,41,1 48,1897,8,-24,-41,1 49,1157,5,21,87,1 50,246,3,1,-34,0 -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_06 (complete).jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | # process method for customers 7 | function visit(customer::Process, time_in_bank::Float64) 8 | @printf("%7.4f %s: Here I am\n", now(customer), customer) 9 | hold(customer, time_in_bank) 10 | @printf("%7.4f %s: I must leave\n", now(customer), customer) 11 | end 12 | 13 | # process method for source 14 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64) 15 | d = Exponential(mean_time_between_arrivals) 16 | for i = 1:number 17 | c = Process(simulation(source), @sprintf("Customer%02d", i)) 18 | activate(c, now(source), visit, 12.0) 19 | t = rand(d) # sample inter-arrival time "t" 20 | hold(source, t) # suspend source for "t" time units 21 | end 22 | end 23 | 24 | # Experiment data 25 | 26 | num_customer = 5 27 | end_time = 400.0 28 | mean_time_between_arrivals = 10.0 29 | theseed = 99999 30 | srand(theseed) 31 | 32 | # Model/Experiment 33 | 34 | sim = Simulation(uint(16)) 35 | s = Process(sim, "Source") 36 | activate(s, 0.0, generate, num_customer, mean_time_between_arrivals) 37 | run(sim, end_time) 38 | -------------------------------------------------------------------------------- /4-graphs/README.md: -------------------------------------------------------------------------------- 1 | ## Networks in R Pre-Assignment 2 | 3 | ## Git update 4 | 5 | Please update your git repository to get the latest version of everything. 6 | 7 | ## Data setup 8 | 9 | If you already have the file On_Time_On_Time_Performance_2014_9.csv (the September 2014 airline flight network), that is great. Simply copy it into the folder 4-graphs in the git repository. 10 | 11 | If you don't already have this file, download http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_2014_9.zip and unzip it, saving On_Time_On_Time_Performance_2014_9.csv to the folder 4-graphs in the git repository. 12 | 13 | ## Assignment 14 | 15 | First, start R and set your working directory to the 4-graphs folder of the git repository. 16 | 17 | To verify your data is downloaded and located properly, please run the following in R (note that the data will take a small while to load): 18 | 19 | ``` 20 | dat <- read.csv("On_Time_On_Time_Performance_2014_9.csv", stringsAsFactors=FALSE) 21 | nrow(dat) 22 | length(unique(dat$Origin)) 23 | ``` 24 | 25 | Next, install the igraph package in R and run some simple commands: 26 | 27 | ``` 28 | install.packages("igraph") 29 | library(igraph) 30 | set.seed(144) 31 | max(betweenness(erdos.renyi.game(100, 0.5))) 32 | ``` 33 | 34 | Please submit the output of these two R snippits (3 total lines of output) in a .txt file on stellar. 35 | 36 | ## Questions? 37 | Please email John Silberholz (josilber@mit.edu). -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_08 (complete).jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | function visit(customer::Process, time_in_bank::Float64, clerk::Resource) 7 | arrive = now(customer) 8 | @printf("%8.3f %s: Here I am\n", arrive, customer) 9 | request(customer, clerk) # waiting for the server 10 | wait = now(customer) - arrive 11 | @printf("%8.3f %s: Waited %6.3f\n", now(customer), customer, wait) 12 | hold(customer, time_in_bank) # using the server 13 | release(customer, clerk) # finish service 14 | @printf("%8.3f %s: Finished\n", now(customer), customer) 15 | end 16 | 17 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64, mean_time_in_bank::Float64, clerk::Resource) 18 | d_tba = Exponential(mean_time_between_arrivals) 19 | d_tib = Exponential(mean_time_in_bank) 20 | for i = 1:number 21 | c = Process(simulation(source), @sprintf("Customer%02d", i)) 22 | tib = rand(d_tib) 23 | activate(c, now(source), visit, tib, clerk) 24 | tba = rand(d_tba) 25 | hold(source, tba) 26 | end 27 | end 28 | 29 | # Experiment data 30 | 31 | max_number = 5 32 | max_time = 400.0 33 | mean_time_between_arrivals = 10.0 34 | mean_time_in_bank = 12.0 35 | theseed = 99999 36 | srand(theseed) 37 | 38 | # Model/Experiment 39 | 40 | sim = Simulation(uint(16)) 41 | k = Resource(sim, "Counter", uint(1), false) 42 | s = Process(sim, "Source") 43 | activate(s, 0.0, generate, max_number, mean_time_between_arrivals, mean_time_in_bank, k) 44 | run(sim, max_time) 45 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_11.jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | function visit(customer::Process, time_in_bank::Float64, clerk::Resource) 7 | arrive = now(customer) 8 | @printf("%8.3f %s: Here I am\n", arrive, customer) 9 | request(customer, clerk) # waiting for the server 10 | wait = now(customer) - arrive 11 | @printf("%8.3f %s: Waited %6.3f\n", now(customer), customer, wait) 12 | hold(customer, time_in_bank) # using the server 13 | release(customer, clerk) # finish service 14 | @printf("%8.3f %s: Finished\n", now(customer), customer) 15 | end 16 | 17 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64, mean_time_in_bank::Float64, clerk::Resource) 18 | d_tba = Exponential(mean_time_between_arrivals) 19 | d_tib = Exponential(mean_time_in_bank) 20 | for i = 1:number 21 | c = Process(simulation(source), @sprintf("Customer%02d", i)) 22 | tib = rand(d_tib) 23 | activate(c, now(source), visit, tib, clerk) 24 | tba = rand(d_tba) 25 | hold(source, tba) 26 | end 27 | end 28 | 29 | # Experiment data 30 | 31 | max_number = 5 32 | max_time = 400.0 33 | mean_time_between_arrivals = 10.0 34 | mean_time_in_bank = 12.0 35 | theseed = 99999 36 | srand(theseed) 37 | 38 | # Model/Experiment 39 | 40 | sim = Simulation(uint(16)) 41 | k = Resource(sim, "Counter", uint(1), false) # set "false" to "true" 42 | s = Process(sim, "Source") 43 | activate(s, 0.0, generate, max_number, mean_time_between_arrivals, mean_time_in_bank, k) 44 | run(sim, max_time) 45 | 46 | # Print result 47 | -------------------------------------------------------------------------------- /2-intermediate-R/SecondHalf_solutions.R: -------------------------------------------------------------------------------- 1 | ### 2 | #Joins assignment - solutions 3 | ### 4 | 5 | # 1) Join airport latitudes to the flight data. What was the largest change in latitude for any flight? 6 | flights = merge(flights,latlong[,1:2],by.x="Origin",by.y="locationID") 7 | #let's take a look at the data frame now 8 | #see that the column we've just merged in is called "Latitude" 9 | #but since we merged on origin, it's really the origin latitude. 10 | #So we rename it: 11 | names(flights)[match("Latitude",names(flights))]="Origin.Lat" 12 | #same for destination latitude 13 | flights = merge(flights,latlong[,1:2],by.x="Dest",by.y="locationID") 14 | names(flights)[match("Latitude",names(flights))]="Dest.Lat" 15 | flights$DiffLat = flights$Dest.Lat - flights$Origin.Lat 16 | biggest.lat.change = max(abs(flights$DiffLat)) 17 | # 2) (optional) Find a flight (may not be unique) which experienced this largest change in latitude. 18 | # Hint: use the order() function to sort a data frame 19 | flights = flights[order(abs(flights$DiffLat),decreasing=TRUE),] 20 | flights[1,] 21 | # 3) (optional) Re-do the jet stream example using latitudes instead of longitudes. 22 | # Is there a relationship between change in latitude and flight speed? 23 | plot(flights$DiffLat, flights$Speed,pch=".") 24 | lat.effect = cor(flights$DiffLat, flights$Speed) 25 | 26 | 27 | ### 28 | #Optional joins assignment 29 | ### 30 | #Is there a relationship between airport latitude and average delay ratio? 31 | airport.info = merge(airport.info, latlong[,1:2],by.x="Airport",by.y="locationID") 32 | plot(airport.info$Latitude,airport.info$Avg.delay.ratio) 33 | cor(airport.info$Latitude,airport.info$Avg.delay.ratio) 34 | -------------------------------------------------------------------------------- /5-simulation/simjulia_examples/bank_11 (complete).jl: -------------------------------------------------------------------------------- 1 | using Distributions 2 | using SimJulia 3 | 4 | # Model components 5 | 6 | function visit(customer::Process, time_in_bank::Float64, clerk::Resource) 7 | arrive = now(customer) 8 | @printf("%8.3f %s: Here I am\n", arrive, customer) 9 | request(customer, clerk) # waiting for the server 10 | wait = now(customer) - arrive 11 | @printf("%8.3f %s: Waited %6.3f\n", now(customer), customer, wait) 12 | hold(customer, time_in_bank) # using the server 13 | release(customer, clerk) # finish service 14 | @printf("%8.3f %s: Finished\n", now(customer), customer) 15 | end 16 | 17 | function generate(source::Process, number::Int64, mean_time_between_arrivals::Float64, mean_time_in_bank::Float64, clerk::Resource) 18 | d_tba = Exponential(mean_time_between_arrivals) 19 | d_tib = Exponential(mean_time_in_bank) 20 | for i = 1:number 21 | c = Process(simulation(source), @sprintf("Customer%02d", i)) 22 | tib = rand(d_tib) 23 | activate(c, now(source), visit, tib, clerk) 24 | tba = rand(d_tba) 25 | hold(source, tba) 26 | end 27 | end 28 | 29 | # Experiment data 30 | 31 | max_number = 5 32 | max_time = 400.0 33 | mean_time_between_arrivals = 10.0 34 | mean_time_in_bank = 12.0 35 | theseed = 99999 36 | srand(theseed) 37 | 38 | # Model/Experiment 39 | 40 | sim = Simulation(uint(16)) 41 | k = Resource(sim, "Counter", uint(1), true) # set "monitered=true" 42 | s = Process(sim, "Source") 43 | activate(s, 0.0, generate, max_number, mean_time_between_arrivals, mean_time_in_bank, k) 44 | run(sim, max_time) 45 | 46 | # Print result 47 | println("TimeAverage no. waiting: $(time_average(wait_monitor(k)))") 48 | println("TimeAverage no. in service: $(time_average(activity_monitor(k)))") 49 | -------------------------------------------------------------------------------- /4-graphs/code/section5.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Section 5 -- Community Detection 3 | ################################################################## 4 | 5 | # One of the many modularity-maximizing algorithms in spinglass.community 6 | comm <- spinglass.community(g) 7 | comm 8 | str(comm) 9 | table(comm$membership) 10 | 11 | # Great. We'll want to plot our communities so let's actually do this 12 | # again, limiting to the continental US. We'll take this code from the 13 | # plotting bonus question and modify it for our needs, coloring airports 14 | # based on their community. 15 | g2 <- induced.subgraph(g, V(g)$Lat >= 15 & V(g)$Lat <= 50 & V(g)$Lon >= -130 & V(g)$Lon <= -60 & V(g)$Country == "United States") 16 | comm2 <- spinglass.community(g2) 17 | comm2 18 | 19 | # Let's get some spiffy colors for our nodes -- we'll get a palette 20 | # with 5 colors from RColorBrewer, which has carefully selected palettes 21 | # where all the colors look good together. 22 | library(RColorBrewer) 23 | display.brewer.all() 24 | colors <- brewer.pal(5, "Set1") 25 | colors 26 | 27 | # Now we can actually plot our image and check it out. We'll index within 28 | # the colors vector when we set vertex.color. We can see the benefit of 29 | # having vertex metadata instead of storing it outside the graph -- if 30 | # we hadn't stored Lon, Lat, and NumFlights as metadata we would have 31 | # needed to subset each for our continental US plot. 32 | png("section5.png", width=960, height=480) 33 | plot(g2, layout=cbind(V(g2)$Lon, V(g2)$Lat), edge.arrow.mode=0, vertex.label=NA, vertex.size=3, edge.color=ifelse(E(g2)$NumFlights >= 100, "black", NA), vertex.color=colors[comm2$membership], asp=0.5) 34 | dev.off() 35 | -------------------------------------------------------------------------------- /1-intro-R/1-5.R: -------------------------------------------------------------------------------- 1 | # IAP 2015 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Script file 1-5.R 6 | # In this script file, we cover SVMs 7 | 8 | ############################# 9 | ## SUPPORT VECTOR MACHINES ## 10 | ############################# 11 | 12 | # Install and load new package 13 | install.packages("e1071") 14 | library(e1071) 15 | 16 | # Build SVM model for iris data set (since SVM is 17 | # easier to visualize with smaller datasets with 18 | # continuous attributes) 19 | 20 | # First, we want to subset the dataset to only 21 | # keep two attributes (so we can easily visualize 22 | # the model) 23 | 24 | IrisDataSVM = subset(iris, select = Petal.Length:Species) 25 | 26 | # SVM model - linear kernel 27 | IrisSVM = svm(Species ~ Petal.Length + Petal.Width, data = IrisDataSVM, kernel = "linear") 28 | 29 | # Plot the model 30 | plot(IrisSVM, data = IrisDataSVM) 31 | 32 | # Color of the data points indicates 33 | # the true class; background color indicates 34 | # prediction; X indicates a support vector 35 | 36 | # SVM model - polynomial kernel 37 | IrisSVM = svm(Species ~ Petal.Length + Petal.Width, data = IrisDataSVM, kernel = "polynomial", degree = 3) 38 | plot(IrisSVM, data = IrisDataSVM) 39 | 40 | # degree = degree of polynomial used. Different 41 | # values will often give very different results. 42 | 43 | #SVM model - radial basis kernel 44 | IrisSVM = svm(Species ~ Petal.Length + Petal.Width, data = IrisDataSVM, kernel = "radial", gamma = 10) 45 | plot(IrisSVM, data=IrisDataSVM) 46 | 47 | # gamma controls how well the model will 48 | # fit the data. Larger gamma will fit the data 49 | # more exactly. Try gamma = 100 and gamma = 0.1 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /4-graphs/code/exercise3_complete.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Exercise 3 -- Regression models over edges 3 | ################################################################## 4 | 5 | # Use linear regression to predict the proportion of delayed departures 6 | # and arrivals for each edge. Predict using the number of flights on that 7 | # edge, the edge betweenness, the degree of the departure and arrival 8 | # airports on the edge, and the PageRank of the departure and arrival 9 | # airports on the edge. Check for multicollinearity between the network 10 | # metrics. 11 | 12 | g 13 | 14 | emetrics <- data.frame(LateDep=E(g)$LateDep, 15 | LateArr=E(g)$LateArr, 16 | NumFlights=E(g)$NumFlights, 17 | EdgeBetweenness=edge.betweenness(g), 18 | DepDegree=degree(g)[get.edges(g, E(g))[,1]], 19 | ArrDegree=degree(g)[get.edges(g, E(g))[,2]], 20 | DepPageRank=page.rank(g)$vector[get.edges(g, E(g))[,1]], 21 | ArrPageRank=page.rank(g)$vector[get.edges(g, E(g))[,2]]) 22 | 23 | head(emetrics) 24 | 25 | summary(lm(LateDep~NumFlights+EdgeBetweenness+DepDegree+ArrDegree+DepPageRank+ArrPageRank, data=emetrics)) 26 | summary(lm(LateArr~NumFlights+EdgeBetweenness+DepDegree+ArrDegree+DepPageRank+ArrPageRank, data=emetrics)) 27 | 28 | cor(emetrics) 29 | # Looks like we have some missing data in LateArr, so let's use na.omit: 30 | cor(na.omit(emetrics)) 31 | # Better be careful interpreting coefficients! 32 | 33 | # Bonus: one airport has relatively low degree (<= 50) but relatively 34 | # high betweenness centrality (>= 500). Plot these two metrics against 35 | # each other to observe the outlier. What is the airport and why does 36 | # it have this property? Hint: you can access neighbors with ?neighbors. 37 | 38 | plot(degree(g), betweenness(g)) 39 | which(degree(g) <= 50 & betweenness(g) >= 5000) 40 | airports[airports$IATA == "ANC",] 41 | neighbors(g, "ANC") 42 | V(g)$name[neighbors(g, "ANC")] 43 | degree(g)[neighbors(g, "ANC")] 44 | -------------------------------------------------------------------------------- /7-adv-optimization/README.md: -------------------------------------------------------------------------------- 1 | # Mixed-integer optimization 2 | 3 | ## Preassignment 4 | 5 | For this class, we will be using the Gurobi mixed-integer programming solver. 6 | 7 | ### Installing Gurobi 8 | Gurobi is commercial software, but they have a very permissive (and free!) academic license. If you have an older version of Gurobi (>= 5.5) on your computer, that should be fine. 9 | 10 | 1. Go to www.gurobi.com 11 | 2. Create an account, and request an academic license. 12 | 3. Download the installer for Gurobi 6.0 13 | 4. Install Gurobi, accepting default options. Remember where it installed to! 14 | 5. Go back to the website and navigate to the page for your academic license. You'll be given a command with a big code in it, e.g. grbgetkey aaaaa-bbbb 15 | 6. In a terminal, navigate to the ``gurobi600//bin`` folder where ```` is the name of your operating system. 16 | 7. Copy-and-paste the command from the website into the command prompt---you need to be on campus for this to work! 17 | 18 | 19 | ### Install the Gurobi interface in Julia 20 | 21 | Installing this is easy using the Julia package manager: 22 | ```jl 23 | julia> Pkg.add("Gurobi") 24 | ``` 25 | 26 | If you don't have an academic email or cannot get access for Gurobi for another reason, you should be able to follow along with the open source solver GLPK for much of the class. To install, simply do 27 | ```jl 28 | julia> Pkg.add("GLPKMathProgInterface") 29 | ``` 30 | 31 | ## Solving a simple MIP 32 | How about a simple knapsack problem? Enter the following JuMP code and submit all the output to Stellar. 33 | 34 | ```jl 35 | using JuMP, Gurobi 36 | m = Model(solver=GurobiSolver(Presolve=0)) # turn presolve off to make it a lil more interesting 37 | N = 100 38 | @defVar(m, x[1:N], Bin) 39 | @addConstraint(m, dot(rand(N), x) <= 5) 40 | @setObjective(m, Max, dot(rand(N), x)) 41 | solve(m) 42 | ``` 43 | 44 | ## Questions? 45 | Email huchette@mit.edu 46 | -------------------------------------------------------------------------------- /6-nonlinear-opt/README.md: -------------------------------------------------------------------------------- 1 | # Nonlinear optimization 2 | 3 | This class covers topics in nonlinear optimization. Code will be posted before the start of the class. 4 | 5 | ## Pre-assignment: 6 | 7 | ### Install Julia and IJulia 8 | IJulia is required for this class. See the instructions at http://www.juliaopt.org/install.pdf. Alternatively, you may use [JuliaBox](https://juliabox.org/) to complete the assignment and follow along with the class if there's any trouble with a local installation. (Troubleshooting note: if Julia is working but IJulia is not, try running ``Pkg.build("IJulia")`` and check for reported errors.) 9 | 10 | ### Install packages 11 | We will use the following packages: 12 | - JuMP 13 | - Optim 14 | - Ipopt 15 | - Convex 16 | - Distributions 17 | - PyPlot 18 | - Gadfly 19 | - Interact 20 | - ECOS 21 | 22 | First run ``Pkg.update()`` to update the package database, then install each one with ``Pkg.add("xxx")`` where ``xxx`` is the package name. 23 | 24 | ### Test the installation 25 | 26 | In a blank IJulia notebook, paste the following code into a cell: 27 | 28 | ```julia 29 | import Convex 30 | x = Convex.Variable(Convex.Positive()) 31 | Convex.solve!(Convex.minimize(x)) 32 | Convex.evaluate(x) 33 | ``` 34 | 35 | and run it by pressing shift-enter. The result should be some iteration output from ECOS and then a small value that's very close to zero. 36 | 37 | In the next cell, paste and run the following code: 38 | 39 | ```julia 40 | import JuMP 41 | m = JuMP.Model() 42 | @JuMP.defVar(m, x >= 0) 43 | @JuMP.setNLObjective(m, Min, x) 44 | JuMP.solve(m) 45 | JuMP.getValue(x) 46 | ``` 47 | 48 | You should see some output from Ipopt and then the result which should be a number that's exactly or very close to zero. 49 | 50 | (Note that we use ``import JuMP`` instead of ``using JuMP`` because there are some clashes in the names used by Convex.jl and JuMP.) 51 | 52 | Now go to ``File -> Download as -> IPython Notebook (.ipynb)`` and save the notebook file to your computer. Submit this file to Stellar. 53 | -------------------------------------------------------------------------------- /4-graphs/code/exercise4_complete.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Exercise 4 -- Bond percolation 3 | ################################################################## 4 | # Perform uniform random bond percolation, randomly retaining proportion 5 | # phi of edges (hint: ?subgraph.edges). As before, test a range of phi 6 | # values and compute the normalized size of the largest component. 7 | random.bond.percolation <- function(g, phi, reps) { 8 | mean(replicate(reps, max(c(0, clusters(subgraph.edges(g, eids=sample(ecount(g), phi*ecount(g))))$csize)))) / vcount(g) 9 | } 10 | rb.perc <- data.frame(phi=phis, perc=sapply(phis, random.bond.percolation, g=g, reps=100)) 11 | plot(rb.perc) 12 | 13 | # Perform targeted bond percolation, comparing the following strategies: 14 | # 1) Remove edges with largest minimum degree of endpoints (hint: ?pmin) 15 | # 2) Remove edges with largest edge betweenness 16 | targeted.bond.percolation1 <- function(g, phi) { 17 | ordering <- order(pmin(degree(g)[get.edges(g, E(g))[,1]], degree(g)[get.edges(g, E(g))[,2]])) 18 | max(c(0, clusters(subgraph.edges(g, head(ordering, phi*ecount(g))))$csize)) / vcount(g) 19 | } 20 | tb.perc1 <- data.frame(phi=phis, perc=sapply(phis, targeted.bond.percolation1, g=g)) 21 | plot(tb.perc1) 22 | 23 | targeted.bond.percolation2 <- function(g, phi) { 24 | ordering <- order(edge.betweenness(g)) 25 | max(c(0, clusters(subgraph.edges(g, head(ordering, phi*ecount(g))))$csize)) / vcount(g) 26 | } 27 | tb.perc2 <- data.frame(phi=phis, perc=sapply(phis, targeted.bond.percolation2, g=g)) 28 | plot(tb.perc2) 29 | 30 | # Compare targeted site percolation of the Delta (DL) and Southwest (WN) 31 | # networks. 32 | ts.delta <- data.frame(phi=phis, carrier="Delta", perc=sapply(phis, targeted.site.percolation, g=carrier.graphs$DL)) 33 | ts.sw <- data.frame(phi=phis, carrier="Southwest", perc=sapply(phis, targeted.site.percolation, g=carrier.graphs$WN)) 34 | ggplot(rbind(ts.delta, ts.sw), aes(x=phi, y=perc, group=carrier, color=carrier)) + geom_line() 35 | -------------------------------------------------------------------------------- /4-graphs/code/exercise1_complete.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Exercise 1 -- Carrier-Specific Flight Networks 3 | ################################################################## 4 | # 5 | # We have computed the network for all airlines combined, but we might 6 | # be interested in the network for each separate carrier. Create a list 7 | # of graphs for each carrier, which can be constructed by limiting the 8 | # set of all flights to just those from that carrier and then 9 | # constructing the graph in the same way that we constructed the full 10 | # graph. The carrier can be found in the Carrier variable. 11 | 12 | spl <- split(dat, dat$Carrier) 13 | carrier.graphs <- lapply(spl, function(dat) { 14 | e.spl <- split(dat, paste(dat$Origin, dat$Dest)) 15 | e.spl2 <- lapply(e.spl, function(x) { 16 | data.frame(Origin = x$Origin[1], 17 | Dest = x$Dest[1], 18 | NumFlights = nrow(x), 19 | LateDep = mean(x$DepDel15, na.rm=T), 20 | LateArr = mean(x$ArrDel15, na.rm=T), 21 | TaxiOut = mean(x$TaxiOut, na.rm=T), 22 | TaxiIn = mean(x$TaxiIn, na.rm=T)) 23 | }) 24 | edges <- do.call(rbind, e.spl2) 25 | vertices <- do.call(rbind, lapply(split(dat, dat$Origin), function(x) { 26 | data.frame(Origin = x$Origin[1], 27 | NumFlights = nrow(x), 28 | LateDep = mean(x$DepDel15, na.rm=T), 29 | LateArr = mean(x$ArrDel15, na.rm=T), 30 | TaxiOut = mean(x$TaxiOut, na.rm=T), 31 | TaxiIn = mean(x$TaxiIn, na.rm=T)) 32 | })) 33 | g <- graph.data.frame(edges, TRUE, vertices) 34 | return(g) 35 | }) 36 | 37 | # Now we can look at qualitative differences between the carriers 38 | carriers <- do.call(rbind, lapply(carrier.graphs, function(g) { 39 | data.frame(airports=vcount(g), density=graph.density(g)) 40 | })) 41 | carriers$name <- names(carrier.graphs) 42 | carriers 43 | 44 | # We can plot to get a better sense of this data 45 | library(ggplot2) 46 | ggplot(carriers, aes(x=airports, y=density, label=name)) + geom_text() 47 | -------------------------------------------------------------------------------- /2-intermediate-R/README.md: -------------------------------------------------------------------------------- 1 | ## Intermediate R Pre-Assignment 2 | 3 | __Note that ``sqldf`` requires a relatively recent version of R (at least 3.1.0). Make sure your version is up-to-date.__ 4 | 5 | 1. Download 6 | 2. Extract the CSV file to your Intermediate R directory. 7 | 3. Fire up R, change your working directory to the Intermediate R directory, and run the following (could take a few minutes): 8 | 9 | -------------------------- 10 | 11 | ```R 12 | flights.raw = read.csv("On_Time_On_Time_Performance_2013_12.csv") 13 | 14 | keep = c("DayofMonth","DayOfWeek","FlightDate","Carrier","TailNum","FlightNum","Origin","OriginCityName","OriginStateFips","OriginStateName","Dest","DestCityName","DestStateFips","DestStateName","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk", "Cancelled","CancellationCode","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay") 15 | 16 | flights = flights.raw[,keep] 17 | 18 | write.csv(flights,"flights.csv") 19 | 20 | install.packages("sqldf") 21 | 22 | library(sqldf) 23 | 24 | flights.bos = sqldf("select * from 'flights' where Origin='BOS'") 25 | ``` 26 | -------------------------- 27 | 28 | __Question 1:__ what is the most common day of the week for departures in the full data set? 29 | 30 | __Question 2:__ what is the least common day of the week for departures from Boston? 31 | 32 | Hint: use the table() function 33 | 34 | When you're done, you can delete the file On_Time_On_Time_Performance_2013_12.csv. Keep the csv file written during the homework. 35 | 36 | ## Questions? 37 | 38 | Please email efields@mit.edu 39 | 40 | The completed flights.csv is too big for the github repo but can be downloaded from https://dl.dropboxusercontent.com/u/1877897/flights.csv 41 | -------------------------------------------------------------------------------- /4-graphs/code/exercise2_complete.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Exercise 2 -- Manipulating visual properties 3 | ################################################################## 4 | # 5 | # 1) Plot the Delta Airlines network (IATA code DL) with the node size 6 | # scaled by the square root of the number of flights from an airport. 7 | # Color the Atlanta airport (ATL) red and other airports black. 8 | # B1) Plot the full network with nodes positioned based on their 9 | # latitude/longitude instead of using a layout algorithm. Adjust 10 | # edge.color to only plot edges with 100 or more flights, and mark 11 | # the top five airports by volume (ATL, ORD, DFW, DEN, LAX) as red 12 | # and the other as light gray. 13 | # B2) Replicate B1, limiting to the continental United States. You can 14 | # do this by limiting the longitude range to [-130, -60], limiting 15 | # the latitude range to [15, 50], and limiting the country to 16 | # "United States". Plot with a 2:1 width:height ratio and the 17 | # appropriate asp value for plot. Hint: ?induced.subgraph. 18 | 19 | png("exercise2_1.png") 20 | dl <- carrier.graphs$DL 21 | plot(dl, layout=layout.lgl(dl), edge.arrow.mode=0, vertex.label=NA, vertex.size=sqrt(V(dl)$NumFlights)/5, vertex.color=ifelse(V(dl)$name == "ATL", "red", "black")) 22 | dev.off() 23 | 24 | png("exercise2_b1.png") 25 | plot(g, layout=cbind(V(g)$Lon, V(g)$Lat), edge.arrow.mode=0, vertex.label=NA, vertex.size=3, edge.color=ifelse(E(g)$NumFlights >= 100, "black", NA), vertex.color=ifelse(V(g)$name %in% c("ATL", "ORD", "DFW", "DEN", "LAX"), "red", "lightgray")) 26 | dev.off() 27 | 28 | png("exercise2_b2.png", width=960, height=480) 29 | g2 <- induced.subgraph(g, V(g)$Lat >= 15 & V(g)$Lat <= 50 & V(g)$Lon >= -130 & V(g)$Lon <= -60 & V(g)$Country == "United States") 30 | plot(g2, layout=cbind(V(g2)$Lon, V(g2)$Lat), edge.arrow.mode=0, vertex.label=NA, vertex.size=3, edge.color=ifelse(E(g2)$NumFlights >= 100, "black", NA), vertex.color=ifelse(V(g2)$name %in% c("ATL", "ORD", "DFW", "DEN", "LAX"), "red", "lightgray"), asp=0.5) 31 | dev.off() 32 | -------------------------------------------------------------------------------- /3-visualization/README.md: -------------------------------------------------------------------------------- 1 | ## Visualization in R 2 | 3 | ### Prerequisites and Class Info: 4 | 5 | This module builds on the Machine Learning in R and Data Wrangling classes given in the first week. You should be comfortable writing R code to run linear regression, logistic regression, and clustering algorithms which were all taught in Machine Learning in R. You should also be comfortable using the table command, the apply family of functions (tapply, lapply, apply), the merge command, the split-apply-combine framework, and creating your own functions. These were taught in Data Wrangling. Please review all these concepts before class on Tuesday, especially if you are new to R. 6 | 7 | The material covered will be very similar to last year. However, you're welcome to repeat, if you like! Some datasets, examples, and in-class problems will be different. 8 | 9 | ### Git Update: 10 | 11 | Please update your git repository so that you have the most recent class materials. 12 | 13 | ### Data: 14 | 15 | You will need the "flights.csv" dataset that you created in your pre-class assignment for Module 2, Data Wrangling. You will also need the "airports.csv" dataset which is available in the data directory of the github repository. Please make sure you have both of these ready to go. 16 | 17 | ### Installation Instructions: 18 | 19 | Please run the following commands in an R console: 20 | 21 | ``` 22 | install.packages("ggplot2") 23 | install.packages("maps") 24 | install.packages("ggmap") 25 | install.packages("mapproj") 26 | ``` 27 | 28 | ### Assignment: 29 | 30 | Run the following code. After each plot is produced, save it, and finally submit a document on Stellar containing the three plots. 31 | 32 | ``` 33 | library(ggplot2) 34 | ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() 35 | ``` 36 | ``` 37 | library(maps) 38 | italy = map_data("italy") 39 | ggplot(italy, aes(x = long, y = lat, group = group)) + geom_polygon() 40 | ``` 41 | ``` 42 | library(ggmap) 43 | MIT = get_map(location = "Massachusetts Institute of Technology", zoom = 15) 44 | ggmap(MIT) 45 | ``` 46 | 47 | ### Questions? 48 | 49 | Please email Angie King (aking10@mit.edu). 50 | -------------------------------------------------------------------------------- /4-graphs/code/exercise5_complete.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Exercise 5 -- Adding communities to prediction models 3 | ################################################################## 4 | 5 | # Add the departure and arrival community to the regression for edge 6 | # outcomes from Section 3. Compute communities for the whole graph (not 7 | # just the continental U.S.) and model as two factor variables (hint: 8 | # ?as.factor). Remember you can start with code from 9 | # code/exercise3_complete.R. 10 | 11 | emetrics <- data.frame(LateDep=E(g)$LateDep, 12 | DepCommunity=as.factor(comm$membership[get.edges(g, E(g))[,1]]), 13 | ArrCommunity=as.factor(comm$membership[get.edges(g, E(g))[,2]]), 14 | LateArr=E(g)$LateArr, 15 | NumFlights=E(g)$NumFlights, 16 | EdgeBetweenness=edge.betweenness(g), 17 | DepDegree=degree(g)[get.edges(g, E(g))[,1]], 18 | ArrDegree=degree(g)[get.edges(g, E(g))[,2]], 19 | DepPageRank=page.rank(g)$vector[get.edges(g, E(g))[,1]], 20 | ArrPageRank=page.rank(g)$vector[get.edges(g, E(g))[,2]]) 21 | summary(lm(LateDep~DepCommunity+ArrCommunity+NumFlights+EdgeBetweenness+DepDegree+ArrDegree+DepPageRank+ArrPageRank, data=emetrics)) 22 | summary(lm(LateArr~DepCommunity+ArrCommunity+NumFlights+EdgeBetweenness+DepDegree+ArrDegree+DepPageRank+ArrPageRank, data=emetrics)) 23 | 24 | # Bonus: perform targeted bond percolation using communities. Compute an 25 | # indicator for whether each edge bridges communities and order the 26 | # removal priority first by this indicator, then by edge betweenness. 27 | # Compare to the two targeted strategies from Exercise 4, Bonus 1. 28 | 29 | comm1 <- comm$membership[get.edges(g, E(g))[,1]] 30 | comm2 <- comm$membership[get.edges(g, E(g))[,2]] 31 | ordering <- order(edge.betweenness(g) + 10000 * (comm1 != comm2)) 32 | targeted.bond.percolation3 <- function(g, phi) { 33 | max(c(0, clusters(subgraph.edges(g, ordering[1:(phi*ecount(g))]))$csize)) / vcount(g) 34 | } 35 | tb.perc3 <- data.frame(phi=phis, perc=sapply(phis, targeted.bond.percolation3, g=g)) 36 | tb.perc2$type <- "2" 37 | tb.perc3$type <- "3" 38 | tb.perc.compare <- rbind(tb.perc2, tb.perc3) 39 | ggplot(tb.perc.compare, aes(x=phi, y=perc, group=type, color=type)) + geom_line() -------------------------------------------------------------------------------- /8-project/README.md: -------------------------------------------------------------------------------- 1 | # Column Generation 2 | 3 | This class will cover column-wise modeling and column generation solution technique. Code will be posted before the start of the class. 4 | 5 | ## Preassignment 6 | 7 | 8 | ### Install Julia, IJulia and JuMP 9 | 10 | Please see preassignment for [module 6, nonlinear optimization](https://github.com/joehuchette/OR-software-tools-2015/tree/master/6-nonlinear-opt). 11 | 12 | ### Install Gurobi and Gurobi Interface in Julia 13 | 14 | Please see preassignment for [module 7, mixed-integer optimization](https://github.com/joehuchette/OR-software-tools-2015/blob/master/7-adv-optimization/README.md). 15 | 16 | ### Install the [Graphs](https://github.com/JuliaLang/Graphs.jl) package in Julia 17 | 18 | Enter the following in Julia console 19 | ```jl 20 | julia> Pkg.add("Graphs") 21 | ``` 22 | 23 | ## 1. Solving a shortest path problem 24 | Enter the following Julia code and submit the output to Stellar. 25 | 26 | ```jl 27 | using Graphs 28 | 29 | # construct a graph and the edge distance vector 30 | 31 | g = simple_inclist(5) 32 | 33 | inputs = [ # each element is (u, v, dist) 34 | (1, 2, 10.), 35 | (1, 3, 5.), 36 | (2, 3, 2.), 37 | (3, 2, 3.), 38 | (2, 4, 1.), 39 | (3, 5, 2.), 40 | (4, 5, 4.), 41 | (5, 4, 6.), 42 | (5, 1, 7.), 43 | (3, 4, 9.) ] 44 | 45 | ne = length(inputs) 46 | dists = zeros(ne) 47 | 48 | for i = 1 : ne 49 | a = inputs[i] 50 | add_edge!(g, a[1], a[2]) # add edge 51 | dists[i] = a[3] # set distance 52 | end 53 | 54 | r = dijkstra_shortest_paths(g, dists, 1) 55 | 56 | r.parents 57 | ``` 58 | 59 | ## 2. Column-wise modeling in JuMP 60 | 61 | Enter the following JuMP code and submit the output to Stellar. 62 | ```jl 63 | using JuMP, Gurobi 64 | 65 | m = Model(solver=GurobiSolver()) 66 | @defVar(m, 0 <= x <= 1) 67 | @defVar(m, 0 <= y <= 1) 68 | @setObjective(m, Max, 5x + 1y) 69 | @addConstraint(m, con, x + y <= 1) 70 | solve(m) # x = 1, y = 0 71 | @defVar(m, 0 <= z <= 1, objective = 10.0, inconstraints = [con], coefficients = [1.0]) 72 | # The constraint is now x + y + z <= 1 73 | # The objective is now 5x + 1y + 10z 74 | solve(m) # z = 1 75 | ``` 76 | 77 | ## Questions? 78 | Email chiwei@mit.edu 79 | -------------------------------------------------------------------------------- /3-visualization/pollData.csv: -------------------------------------------------------------------------------- 1 | State,Year,SurveyUSA,DiffCount,Republican Alabama,2004,18,5,1 Alabama,2008,25,5,1 Alaska,2004,21,1,1 Alaska,2008,18,6,1 Arizona,2004,15,8,1 Arizona,2008,3,9,1 Arizona,2012,5,4,1 Arkansas,2004,5,8,1 Arkansas,2008,7,5,1 Arkansas,2012,21,2,1 California,2004,-11,-8,0 California,2008,-24,-5,0 California,2012,-14,-6,0 Colorado,2004,3,9,1 Colorado,2008,-1,-15,0 Colorado,2012,-2,-5,0 Connecticut,2004,-33,-3,0 Connecticut,2008,-16,-4,0 Connecticut,2012,-13,-8,0 Delaware,2004,-16,-2,0 Delaware,2008,-30,-4,0 Florida,2004,1,0,1 Florida,2008,-3,-13,0 Florida,2012,0,6,0 Georgia,2004,12,4,1 Georgia,2008,7,9,1 Georgia,2012,8,4,1 Hawaii,2004,4,2,0 Hawaii,2008,-24,-1,0 Hawaii,2012,-24,-2,0 Idaho,2004,22,1,1 Idaho,2008,30,1,1 Idaho,2012,24,1,1 Illinois,2004,-12,-5,0 Illinois,2008,-33,-5,0 Illinois,2012,-16,-5,0 Indiana,2004,19,3,1 Indiana,2008,0,2,0 Indiana,2012,18,3,1 Iowa,2004,-3,5,1 Iowa,2008,-15,-8,0 Iowa,2012,-2,-2,0 Kansas,2004,23,3,1 Kansas,2008,21,2,1 Kansas,2012,9,1,1 Kentucky,2004,21,3,1 Kentucky,2008,16,5,1 Kentucky,2012,14,1,1 Louisiana,2004,7,5,1 Louisiana,2008,21,2,1 Louisiana,2012,21,2,1 Maine,2004,-8,-6,0 Maine,2008,-15,-6,0 Maine,2012,-7,-6,0 Maryland,2004,-11,-6,0 Maryland,2008,-29,-1,0 Maryland,2012,-29,-4,0 Massachusetts,2004,-29,-2,0 Massachusetts,2008,-17,-4,0 Massachusetts,2012,-30,-8,0 Michigan,2004,-29,-2,0 Michigan,2008,-11,-11,0 Michigan,2012,-11,-10,0 Minnesota,2004,-1,-7,0 Minnesota,2008,-3,-14,0 Minnesota,2012,-11,-5,0 Mississippi,2004,25,1,1 Mississippi,2008,7,4,1 Mississippi,2012,8,1,1 Missouri,2004,5,8,1 Missouri,2008,0,4,1 Missouri,2012,7,8,1 Montana,2004,21,3,1 Montana,2008,8,4,1 Montana,2012,12,5,1 Nebraska,2004,30,2,1 Nebraska,2008,22,1,1 Nebraska,2012,25,2,1 Nevada,2004,8,9,1 Nevada,2008,0,-9,0 Nevada,2012,-4,-10,0 New Hampshire,2004,-1,-5,0 New Hampshire,2008,-11,-14,0 New Hampshire,2012,-11,-8,0 New Jersey,2004,-12,-8,0 New Jersey,2008,-10,-9,0 New Jersey,2012,-14,-9,0 New Mexico,2004,0,2,1 New Mexico,2008,-7,-6,0 New Mexico,2012,-7,-5,0 New York,2004,-18,-6,0 New York,2008,-33,-5,0 New York,2012,-29,-5,0 North Carolina,2004,8,7,1 North Carolina,2008,1,-5,0 North Carolina,2012,5,3,1 North Dakota,2004,25,2,1 North Dakota,2008,5,0,1 North Dakota,2012,21,4,1 Ohio,2004,2,3,1 Ohio,2008,-2,-16,0 Ohio,2012,-5,-16,0 Oklahoma,2004,30,4,1 Oklahoma,2008,24,2,1 Oklahoma,2012,24,1,1 Oregon,2004,-3,-8,0 Oregon,2008,-19,-9,0 Oregon,2012,-7,-4,0 Pennsylvania,2004,-1,-12,0 Pennsylvania,2008,-9,-19,0 Pennsylvania,2012,0,-13,0 Rhode Island,2004,-13,-2,0 Rhode Island,2008,-29,-1,0 Rhode Island,2012,-17,-2,0 South Carolina,2004,18,4,1 South Carolina,2008,8,5,1 South Carolina,2012,23,1,1 South Dakota,2004,8,4,1 South Dakota,2008,7,4,1 South Dakota,2012,16,1,1 Tennessee,2004,18,7,1 Tennessee,2008,8,5,1 Tennessee,2012,24,1,1 Texas,2004,22,2,1 Texas,2008,7,5,1 Texas,2012,12,4,1 Utah,2004,18,3,1 Utah,2008,30,3,1 Utah,2012,22,1,1 Vermont,2004,-16,-2,0 Vermont,2008,-24,-2,0 Virginia,2004,4,5,1 Virginia,2008,-4,-18,0 Virginia,2012,-2,-4,0 Washington,2004,-4,-10,0 Washington,2008,-16,-6,0 Washington,2012,-14,-8,0 West Virginia,2004,5,6,1 West Virginia,2008,15,11,1 West Virginia,2012,19,1,1 Wisconsin,2004,-5,1,0 Wisconsin,2008,-16,-12,0 Wisconsin,2012,-4,-8,0 Wyoming,2004,30,1,1 Wyoming,2008,21,3,1 -------------------------------------------------------------------------------- /4-graphs/code/section3.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Section 3 -- Network Metrics 3 | ################################################################## 4 | 5 | # Let's start out by computing some global network metrics. 6 | graph.density(g) 7 | reciprocity(g) 8 | assortativity.degree(g) 9 | 10 | # Now let's look at the distribution of some of the vertex and edge 11 | # metrics. 12 | hist(degree(g)) 13 | head(sort(degree(g), decreasing=TRUE)) 14 | hist(closeness(g)) 15 | head(sort(closeness(g), decreasing=TRUE)) 16 | hist(betweenness(g)) 17 | table(betweenness(g) == 0) 18 | head(sort(betweenness(g), decreasing=TRUE)) 19 | page.rank(g) 20 | hist(page.rank(g)$vector) 21 | head(sort(page.rank(g)$vector, decreasing=TRUE)) 22 | hist(transitivity(g, "local")) 23 | head(sort(transitivity(g, "local"), decreasing=TRUE)) 24 | 25 | # transitivity() doesn't return a named vector, so we'll need to do a bit 26 | # more work to figure out the airports with the largest transitivity. 27 | # sort() returns the largest transitivities, but we will instead use 28 | # order(), which returns the indices of the nodes with the largest 29 | # transitivities. 30 | head(order(transitivity(g, "local"), decreasing=TRUE)) 31 | transitivity(g, "local")[93] 32 | transitivity(g, "local")[265] 33 | 34 | # We can use the indices from order() to look up node names or degrees. 35 | V(g)$name[head(order(transitivity(g, "local"), decreasing=TRUE))] 36 | degree(g)[head(order(transitivity(g, "local"), decreasing=TRUE))] 37 | 38 | # Edge betweenness is one of the most important edge metrics 39 | hist(edge.betweenness(g)) 40 | 41 | # One really common thing to do with vertex or edge metrics is to add 42 | # them to a regression model that predicts some feature of the vertices 43 | # or edges. The igraph network metric functions return vectors containing 44 | # the metric so we can build a data frame with all the metrics we need 45 | # as well as our outcome data that we've stored as vertex and edge 46 | # metadata. 47 | 48 | # We'll try to predict two outcomes for vertices -- the prop. of late 49 | # departures and the taxi out time. We'll include two metrics that capture 50 | # the volume of traffic at the airport -- the total number of flights and 51 | # the degree of the airport in the network. We'll also use closeness 52 | # centrality, which is how close this airport is to all others. We might 53 | # hypothesize that airports with high volume or near the center of the 54 | # network are overloaded and have more delays or that they have invested 55 | # in robust systems/procedures and will have fewer delays. 56 | 57 | # Let's remind ourselves of our vertex attributes 58 | g 59 | 60 | # Now we can build the data frame 61 | metrics <- data.frame(Origin=V(g)$name, 62 | LateDep=V(g)$LateDep, 63 | TaxiOut=V(g)$TaxiOut, 64 | NumFlights=V(g)$NumFlights, 65 | degree=degree(g), 66 | closeness=closeness(g)) 67 | 68 | head(metrics) 69 | 70 | # Now we can build our models; we'll use simple linear regression but 71 | # clearly any regression model you learned in Module 1 could be used. 72 | summary(lm(LateDep~NumFlights+degree+closeness, data=metrics)) 73 | summary(lm(TaxiOut~NumFlights+degree+closeness, data=metrics)) 74 | -------------------------------------------------------------------------------- /4-graphs/code/section1.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Section 1 -- Data Wrangling to Construct Networks in R 3 | ################################################################## 4 | 5 | # Let's start by loading in our data. This could take a bit of time. 6 | # We use stringsAsFactors=FALSE because it helps us avoid factor 7 | # levels with no data when we subset our data. 8 | dat <- read.csv("On_Time_On_Time_Performance_2014_9.csv", 9 | stringsAsFactors=FALSE) 10 | head(dat) 11 | 12 | # To get the edge information, we'll split into all unique Origin -> Dest 13 | # pairs; using paste() is a convenient way to build a key out of two or 14 | # more variables when splitting data with the split() function. 15 | e.spl <- split(dat, paste(dat$Origin, dat$Dest)) 16 | 17 | # In addition to the origin and destination of an edge, we can store 18 | # the number of flights for this pairing, the proportion of late 19 | # departures and arrivals, and the average taxi out and in times. 20 | # For the "apply" step of our split-apply-combine paradigm 21 | e.spl2 <- lapply(e.spl, function(x) { 22 | data.frame(Origin = x$Origin[1], 23 | Dest = x$Dest[1], 24 | NumFlights = nrow(x), 25 | LateDep = mean(x$DepDel15, na.rm=T), 26 | LateArr = mean(x$ArrDel15, na.rm=T), 27 | TaxiOut = mean(x$TaxiOut, na.rm=T), 28 | TaxiIn = mean(x$TaxiIn, na.rm=T)) 29 | }) 30 | 31 | # As usual, we'll use do.call() with rbind() for the "combine" step. 32 | edges <- do.call(rbind, e.spl2) 33 | 34 | # We can put the whole split-apply-combine into a single line of code when 35 | # computing the vertex information, which limits the number of variables 36 | # we have floating around. 37 | vertices <- do.call(rbind, lapply(split(dat, dat$Origin), function(x) { 38 | data.frame(Origin = x$Origin[1], 39 | NumFlights = nrow(x), 40 | LateDep = mean(x$DepDel15, na.rm=T), 41 | LateArr = mean(x$ArrDel15, na.rm=T), 42 | TaxiOut = mean(x$TaxiOut, na.rm=T), 43 | TaxiIn = mean(x$TaxiIn, na.rm=T)) 44 | })) 45 | 46 | # Let's also load in the locations of the airports by merging with our 47 | # dataset of airport locations, making sure we didn't lose any 48 | # airports in the process of the merge. 49 | airports <- read.csv("../data/airports.csv", stringsAsFactors=FALSE) 50 | head(airports) 51 | dim(vertices) 52 | vertices <- merge(vertices, airports, by.x="Origin", by.y="IATA") 53 | dim(vertices) 54 | 55 | # Now we can construct our graph with graph.data.frame() from igraph. 56 | library(igraph) 57 | g <- graph.data.frame(edges, TRUE, vertices) 58 | 59 | # The first line says we have a directed graph (D) with named vertices (N). 60 | # The attributes list shows all vertex and edge attributes. The first 61 | # entry in ()'s is whether vertex (v) or edge (e) attribute, and the second 62 | # is the type of attribute: character (c) or numeric (n). 63 | g 64 | 65 | # Easy to access vertex and edge sequences and metadata 66 | head(V(g)) 67 | head(V(g)$Lat) 68 | head(E(g)) 69 | head(E(g)$LateDep) 70 | 71 | # Let's compute some basic properties of the network (more metrics coming 72 | # later in the module) 73 | ecount(g) 74 | vcount(g) 75 | graph.density(g) 76 | -------------------------------------------------------------------------------- /1-intro-R/1-3.R: -------------------------------------------------------------------------------- 1 | # IAP 2015 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Script file 1-3.R 6 | # In this script file, we cover CART and random forest 7 | 8 | ################################################ 9 | ## CLASSIFICATION AND REGRESSION TREES (CART) ## 10 | ################################################ 11 | 12 | # First install package rpart and load the library 13 | install.packages("rpart") 14 | library(rpart) 15 | 16 | # Build a CART model 17 | Titanic.CART = rpart(Survived ~ Class + Age + Sex, data = TitanicTrain, method = "class", control = rpart.control(minbucket = 10)) 18 | 19 | # Plot the tree. For all trees, if the conditional at the 20 | # top is true, go to the left. 21 | plot(Titanic.CART) 22 | text(Titanic.CART, pretty = 0) 23 | 24 | # Make prediction on the test set 25 | Titanic.CARTpredTest = predict(Titanic.CART, newdata = TitanicTest, type = "class") 26 | 27 | # Create the confusion matrix 28 | CARTpredTable <- table(TitanicTest$Survived, Titanic.CARTpredTest) 29 | CARTpredTable 30 | 31 | # Calculate accuracy 32 | sum(diag(CARTpredTable))/nrow(TitanicTest) 33 | 34 | 35 | # We can also use CART for continuous outcomes 36 | CEOcomp.CART = rpart(TotalCompensation ~ Years + ChangeStockPrice + ChangeCompanySales + MBA, data = CEOcomp, method = "anova", control = rpart.control(minsplit = 5)) 37 | 38 | # Create a vector of predictions 39 | predict(CEOcomp.CART) 40 | CEOcomp$TotalCompensation 41 | 42 | ################### 43 | ## RANDOM FOREST ## 44 | ################### 45 | 46 | # Install package randomForest and load the library 47 | install.packages("randomForest") 48 | library(randomForest) 49 | 50 | # Build a random forest model for the Titanic dataset 51 | Titanic.forest = randomForest(Survived ~ Class + Age + Sex, data = TitanicTrain, nodesize = 10, ntree = 200) 52 | 53 | # Warning message! - random forest need to predict a factor 54 | str(TitanicTrain$Survived) 55 | TitanicTrain$Survived <- factor(TitanicTrain$Survived) 56 | TitanicTest$Survived <- factor(TitanicTest$Survived) 57 | 58 | # Let's try again! 59 | Titanic.forest = randomForest(Survived ~ Class + Age + Sex, data = TitanicTrain, nodesize = 10, ntree = 200) 60 | 61 | # Make predictions on the test set 62 | 63 | Titanic.forestPred = predict(Titanic.forest, newdata = TitanicTest) 64 | forest.table <- table(TitanicTest$Survived, Titanic.forestPred) 65 | forest.table 66 | 67 | # Check accuracy 68 | sum(diag(forest.table))/nrow(TitanicTest) 69 | 70 | ################ 71 | ## ASSIGNMENT ## 72 | ################ 73 | 74 | # Let's compare the performance of CART and random 75 | # forest on the LettersBinary dataset 76 | 77 | # 1) Build a CART model on the training data. Set the 78 | # minbucket parameter to 25. Then test it on the 79 | # testing set, create a confusion matrix, and determine 80 | # the accuracy. 81 | 82 | letters.formula <- formula(Letter ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16) 83 | 84 | 85 | 86 | 87 | 88 | # 2) Do the same as above for random forest. Use nodesize 89 | # = 25 and ntree = 200. 90 | 91 | 92 | 93 | 94 | 95 | # EXTRA ASSIGNMENT: 96 | 97 | # *1) Try different ways of control the tree growth. Look 98 | # at the rpart.control help page. Try giving your model 99 | # values for cp or maxdepth. 100 | 101 | # *2) Try different values of ntree in your randomForest 102 | # model. Try setting it to a very low number, and a 103 | # very high number. How do the prediction results 104 | # compare? 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | -------------------------------------------------------------------------------- /1-intro-R/1-4.R: -------------------------------------------------------------------------------- 1 | # IAP 2015 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Script file 1-4.R 6 | # In this script file, we cover hierarchical and 7 | # k-means clustering 8 | 9 | ############################# 10 | 11 | # R has many built-in datasets -- 12 | # let's take a look at what they have 13 | data() 14 | 15 | # Load the iris set and learn about it 16 | data(iris) 17 | ?iris 18 | str(iris) 19 | 20 | ############################# 21 | ## HIERARCHICAL CLUSTERING ## 22 | ############################# 23 | 24 | # Since species is not a number, we can't 25 | # compute a distance, so we need to exclude 26 | # the last column 27 | IrisDist = dist(iris[1:4], method = "euclidean") 28 | 29 | # Alternative methods include "maximum" and 30 | # "manhattan" (different distance metrics) 31 | 32 | # Compute the hierarchical clusters. We use 33 | # method = "ward" to minimize the distance between 34 | # the clusters and the variance within each 35 | # of the clusters 36 | IrisHC = hclust(IrisDist, method = "ward.D") 37 | 38 | # Plot a dendrogram 39 | plot(IrisHC) 40 | 41 | # This diagram will help us decide how many 42 | # clusters are appropriate for this problem. 43 | # The height of the vertical lines represents 44 | # the distance between the points that were 45 | # combined into clusters. The record numbers 46 | # are listed among the bottom (usually hard to 47 | # see). The taller the lines, the more likely 48 | # it is that clusters should be separate. Two 49 | # or three clusters would be appropriate here. 50 | 51 | # Plot rectangles around the clusters to aid 52 | # in visualization 53 | rect.hclust(IrisHC, k = 3, border = "red") 54 | 55 | # Now, split the data into these three clusters 56 | IrisHCGroups = cutree(IrisHC, k = 3) 57 | 58 | # IrisHCGroups is now a vector assigning each 59 | # data point to a cluster 60 | 61 | # Use a table to look at the properties of each 62 | # of the clusters. 63 | table(iris$Species, IrisHCGroups) 64 | tapply(iris$Petal.Length, IrisHCGroups, mean) 65 | 66 | # Using tapply for the means of each of the 67 | # attributes will give us the centroids of the 68 | # clusters. 69 | 70 | ######################## 71 | ## K-MEANS CLUSTERING ## 72 | ######################## 73 | 74 | # K-means clustering requires that we have 75 | # an initial guess as to how many clusters 76 | # there are. We will initialize it to 3 in this 77 | # case, but if we didn't know, we could always 78 | # try multiple values and experiment 79 | 80 | # Run a k-means cluster with 3 clusters and 81 | # 100 iterations (centroids recomputed and points 82 | # reassigned each time) 83 | IrisKMC = kmeans(iris[1:4], centers = 3, iter.max = 100) 84 | str(IrisKMC) 85 | 86 | # Create a vector with the group numbers 87 | IrisKMCGroups = IrisKMC$cluster 88 | 89 | # Check out the properties of the clusters 90 | # using table 91 | table(iris$Species, IrisKMCGroups) 92 | 93 | # Try improving with more iterations! 94 | IrisKMC = kmeans(iris[1:4], centers = 3, iter.max = 10000) 95 | IrisKMCGroups = IrisKMC$cluster 96 | table(iris$Species, IrisKMCGroups) 97 | 98 | # Look at the locations of the centroids 99 | IrisKMC$centers 100 | 101 | ################ 102 | ## ASSIGNMENT ## 103 | ################ 104 | 105 | # 1a) Cluster the LettersBinary dataset using 106 | # hierarchical clustering. Don't forget to 107 | # leave out the "Letter" attribute when 108 | # computing the distance matrix! (Since 109 | # this dataset is larger, it may take 110 | # a bit longer to compute) 111 | 112 | 113 | 114 | 115 | # b) Plot the dendrogram and use it to 116 | # decide how many clusters to select. 117 | 118 | 119 | 120 | 121 | # c) Make a table comparing the "Letter" 122 | # attribute compared with the HC assignment 123 | 124 | 125 | 126 | 127 | # 2) Do the same using k-means clustering. 128 | # How well do you think clustering performs 129 | # on this dataset? 130 | 131 | 132 | 133 | # Clustering doesn't seem to do too well here. 134 | 135 | 136 | # EXTRA ASSIGNMENT 137 | 138 | # An additional parameter in the K-Means 139 | # algorithm is the number of random starts to 140 | # use. This is controlled with the parameter 141 | # nstart in the function kmeans. Try different 142 | # values for nstart. Does it improve the 143 | # algorithm? 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | -------------------------------------------------------------------------------- /4-graphs/code/section2.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Section 2 -- Network Visualization 3 | ################################################################## 4 | 5 | # Let's start out by seeing what exactly is returned when we run a 6 | # graph layout algorithm. Of course, we have longitude/latitude information 7 | # for airports, so we're doing this more as an exercise in looking at 8 | # graph layout algorithms. Later in the section we'll layout nodes based on 9 | # geography. 10 | layout1 <- layout.fruchterman.reingold(g) 11 | dim(layout1) 12 | head(layout1) 13 | 14 | # It's just a set of 2-d points, one for each vertex. We could get a 15 | # higher-dimensional layout with the "dim" parameter. Force-directed 16 | # layouts are typically optimized from a random starting location, so 17 | # we would expect a different layout if we ran it again (this is one of 18 | # the complaints people have with these sorts of layouts). We could use 19 | # set.seed() to ensure the same value for multiple runs of the algorithm. 20 | layout1 <- layout.fruchterman.reingold(g) 21 | head(layout1) 22 | 23 | # We can plot with our selected layout with the plot() function. 24 | plot(g, layout=layout1) # Can cancel with escape key 25 | 26 | # It takes a long time to plot the graph to the R display, so we can 27 | # instead plot it to a file and then open the file. 28 | png("plot1.png") 29 | plot(g, layout=layout1) 30 | dev.off() 31 | 32 | # Most first attempts at plotting a graph look pretty bad. We need to 33 | # do the following: 34 | # 1) Remove the vertex names 35 | # 2) Make the vertices smaller 36 | # 3) Remove the arrowheads (almost all edges will be bidirectional) 37 | # We'll need to look at ?igraph.plotting to figure out how to do this 38 | ?igraph.plotting 39 | png("plot2.png") 40 | plot(g, layout=layout1, vertex.size=3, edge.arrow.mode=0, vertex.label=NA) 41 | dev.off() 42 | 43 | # So far we set all the plotting properties vertex.size, vertex.label, 44 | # and edge.arrow.mode to single values, meaning that value applied for 45 | # all vertices/edges. We can also set values dynamically based on 46 | # vertex/edge metadata, providing one value for each node or edge. 47 | # First, let's use a color gradient based on metadata. We'll make vertices 48 | # darker gray if they have more volume and lighter gray if they have less. 49 | 50 | # colorRamp returns a function that will convert values between 0 51 | # and 1 into colors between our color endpoints. It returns a matrix 52 | # the three columns are red, green, and blue; we can convert this into 53 | # a vector with the rgb() function. 54 | grad.fxn <- colorRamp(c("lightgray", "black")) 55 | grad.fxn 56 | grad.fxn(c(0, .2, .5, 1)) 57 | rgb(grad.fxn(c(0, .2, .5, 1)), max=255) 58 | color.mat <- grad.fxn(V(g)$NumFlights / max(V(g)$NumFlights)) 59 | head(color.mat) 60 | dim(color.mat) 61 | vertex.colors <- rgb(color.mat, max=255) 62 | head(vertex.colors) 63 | length(vertex.colors) 64 | 65 | png("plot3.png") 66 | plot(g, layout=layout.lgl(g), vertex.size=3, edge.arrow.mode=0, vertex.label=NA, vertex.color=vertex.colors) 67 | dev.off() 68 | 69 | # One difficulty with plotting graphs is there being a mass of edges. One 70 | # approach would be to remove low-volume edges or diminish their width 71 | # (we'll do this in a bit); another is to change color and transparency to 72 | # draw attention to important edges. Here, we'll make edges red if at least 73 | # 50% of departures on this link are late and transparent light gray 74 | # otherwise. 75 | 76 | # A convenient way to specify colors is with hexidecimal (we just saw this 77 | # when outputting vertex.colors). A standard color would be something like 78 | # #00FF80, which means hexidecimal 00 (0) for red, FF (255) for green, and 79 | # 80 (128) for blue. Because transparency is not specified it is assumed to 80 | # be non-transparent. If we add a pair of hexidecimal digits at the end 81 | # they represent the transparency proportion. #00FF80FF is non-transparent, 82 | # #00FF8080 is partially transparent, and #00FF8000 is fully transparent 83 | # aka invisible. Our light gray color will be #EEEEEE22, which is mostly 84 | # transparent. 85 | 86 | # We now want different colors conditional on the value of E(g)$LateDep. 87 | # This is typically done with the ifelse() function. 88 | head(E(g)$LateDep) 89 | head(ifelse(E(g)$LateDep >= 0.5, "red", "#EEEEEE22")) 90 | edge.colors <- ifelse(E(g)$LateDep >= 0.5, "red", "#EEEEEE22") 91 | table(edge.colors) 92 | png("plot4.png") 93 | plot(g, layout=layout.lgl(g), vertex.size=3, edge.arrow.mode=0, vertex.label=NA, vertex.color=vertex.colors, edge.color=edge.colors) 94 | dev.off() 95 | -------------------------------------------------------------------------------- /4-graphs/code/section4.R: -------------------------------------------------------------------------------- 1 | ################################################################## 2 | # Section 4 -- Network Resilience 3 | ################################################################## 4 | 5 | # Any theories about the behavior of uniform random site percolation and 6 | # targeted site percolation in our network? 7 | 8 | # First we'll compute a random sample of a proportion phi of the nodes 9 | phi <- 0.8 10 | vcount(g) 11 | sample(vcount(g), phi*vcount(g)) 12 | 13 | # We can compute subgraphs of a network in which we only keep the 14 | # indicated nodes and edges connected to them with the induced.subgraph() 15 | # function. 16 | induced.subgraph(g, sample(vcount(g), phi*vcount(g))) 17 | 18 | # We want to compute the size of the biggest cluster, so let's first use 19 | # the clusters() function to get all the cluster memberships. 20 | clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g)))) 21 | 22 | # We can access the "csize" element of the list and compute its maximum 23 | max(clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize) 24 | 25 | # A problem with this is when we delete all the vertices. Then csize will 26 | # be blank, causing a warning with our code. 27 | phi <- 0 28 | max(clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize) 29 | 30 | # Let's fix it by adding 0 to csize. This will make max return 0 when 31 | # there are no vertices and return the maximum component size when there 32 | # are vertices. 33 | max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize)) 34 | phi <- 0.8 35 | max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize)) 36 | 37 | # Because this is random we want to replicate the computation and take 38 | # the average across the replications, which we can do with replicate() 39 | # and mean(). You'll see more sophisticated simulation in the simulation 40 | # module. 41 | reps <- 100 42 | replicate(reps, max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize))) 43 | mean(replicate(reps, max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize)))) 44 | 45 | # Let's normalize by the original size of the graph 46 | mean(replicate(reps, max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize)))) / vcount(g) 47 | 48 | # Finally, let's make a function with our code. 49 | random.site.percolation <- function(g, phi, reps) { 50 | mean(replicate(reps, max(c(0, clusters(induced.subgraph(g, sample(vcount(g), phi*vcount(g))))$csize)))) / vcount(g) 51 | } 52 | 53 | # Now we can build a data frame that contains the size of the giant 54 | # component after random site percolation with different phi values. 55 | # We'll sample a grid from 0 to 1 and use sapply to run for each. 56 | phis <- seq(0, 1, .01) 57 | rs.perc <- data.frame(phi=phis, perc=sapply(phis, random.site.percolation, g=g, reps=100)) 58 | head(rs.perc) 59 | 60 | # Now we can plot our results along with a line indicating the maximum 61 | # possible size of the giant component, which would be achieved if g were 62 | # a complete graph. 63 | plot(rs.perc) 64 | abline(0, 1) 65 | 66 | # Now we want to model an adversarial situation in which the nodes with 67 | # the highest degree are removed first. We'll do this by taking the degree 68 | # ordering of the nodes in the original graph g and using it throughout, 69 | # though another approach would be to recompute the degrees each time you 70 | # remove the highest-degree node. The first step is to sort the 71 | # nodes in the network by degree using the order() function. This returns 72 | # indices in the vertex list, sorted by degree. 73 | order(degree(g)) 74 | 75 | # This will have ordered in increasing order, so we can check that the 76 | # last few indices are airports we recognize: 77 | degree(g)[20] 78 | degree(g)[221] 79 | 80 | # We want to keep phi proportion of the airports, limiting to the ones 81 | # with smallest degree. We can get this by taking the first phi 82 | # proportion of the ordered vertices 83 | head(order(degree(g)), phi*vcount(g)) 84 | 85 | # As before we can compute the normalized size of the giant component. 86 | # There's no need for replication because we didn't use any random 87 | # selection in the procedure. 88 | max(c(0, clusters(induced.subgraph(g, head(order(degree(g)), phi*vcount(g))))$csize)) / vcount(g) 89 | 90 | # Finally we can create our function that does the targeted percolation. 91 | targeted.site.percolation <- function(g, phi) { 92 | max(c(0, clusters(induced.subgraph(g, head(order(degree(g)), phi*vcount(g))))$csize)) / vcount(g) 93 | } 94 | 95 | # As before we can compute the targeted rates and plot the survivability. 96 | ts.perc <- data.frame(phi=phis, perc=sapply(phis, targeted.site.percolation, g=g)) 97 | head(ts.perc) 98 | 99 | plot(ts.perc) 100 | abline(0, 1) 101 | -------------------------------------------------------------------------------- /6-nonlinear-opt/Nonlinear-JuMP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language": "Julia", 4 | "name": "", 5 | "signature": "sha256:eb67d64c267bde1c8acbbd3549e149ea7a8e8566f70baa255126dc8a172db3b0" 6 | }, 7 | "nbformat": 3, 8 | "nbformat_minor": 0, 9 | "worksheets": [ 10 | { 11 | "cells": [ 12 | { 13 | "cell_type": "heading", 14 | "level": 2, 15 | "metadata": {}, 16 | "source": [ 17 | "Nonlinear Optimization" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Consider the unconstrained minimization problem\n", 25 | "$$\n", 26 | "\\min_{x > 0} x^2 - \\log(x)\n", 27 | "$$\n", 28 | "The objective function is strictly convex (why?), so from high school calculus we find the minimizer when\n", 29 | "$$\n", 30 | "0 = \\frac{d}{dx} [x^2 - \\log(x)] = 2x - \\frac{1}{x}\n", 31 | "$$\n", 32 | "$$\n", 33 | "\\rightarrow x = \\frac{1}{\\sqrt{2}}\n", 34 | "$$" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | ">**\\[Exercise\\]**: Plot it\n", 42 | "\n", 43 | "> Plot the function $x^2-\\log(x)$, for $x$ between 0 and 3. You may use ``Gadfly`` or ``PyPlot``.\n", 44 | "\n", 45 | "> _Be careful not to take the log of zero_" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "#### Let's see how to formulate this problem in JuMP" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "collapsed": false, 58 | "input": [ 59 | "using JuMP\n", 60 | "using Ipopt" 61 | ], 62 | "language": "python", 63 | "metadata": {}, 64 | "outputs": [] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "collapsed": false, 69 | "input": [ 70 | "m = Model()\n", 71 | "@defVar(m, x >= 0, start = 1) # provide an initial starting point, we don't want to start at zero!\n", 72 | "@setNLObjective(m, Min, x^2 - log(x))\n", 73 | "status = solve(m)" 74 | ], 75 | "language": "python", 76 | "metadata": {}, 77 | "outputs": [] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "collapsed": false, 82 | "input": [ 83 | "getValue(x)" 84 | ], 85 | "language": "python", 86 | "metadata": {}, 87 | "outputs": [] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "collapsed": false, 92 | "input": [ 93 | "abs(getValue(x)-1/sqrt(2))" 94 | ], 95 | "language": "python", 96 | "metadata": {}, 97 | "outputs": [] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Pretty accurate!" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### Now for some constrained optimization\n", 111 | "\n", 112 | "We will add the constraint $x \\geq c$. When $c \\leq \\frac{1}{\\sqrt{2}}$, this constraint has no effect. Otherwise the optimal solution is $c$." 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "collapsed": false, 118 | "input": [ 119 | "using Interact" 120 | ], 121 | "language": "python", 122 | "metadata": {}, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "collapsed": false, 128 | "input": [ 129 | "@manipulate for c in 0.1:0.01:2.0\n", 130 | " m = Model(solver=IpoptSolver(print_level=0))\n", 131 | " @defVar(m, x >= 0, start = 1)\n", 132 | " @setNLObjective(m, Min, x^2 - log(x))\n", 133 | " @addConstraint(m, x >= c)\n", 134 | " status = solve(m)\n", 135 | " round(getValue(x),2)\n", 136 | "end" 137 | ], 138 | "language": "python", 139 | "metadata": {}, 140 | "outputs": [] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Differences with linear/quadratic JuMP:\n", 147 | "- Use ``@setNLObjective`` and ``@addNLConstraint`` instead of ``@setObjective`` and ``@addConstraint``.\n", 148 | "- Important to set a starting value for each variable.\n", 149 | "- Different [solvers](http://jump.readthedocs.org/en/release-0.7/installation.html#getting-solvers):\n", 150 | " - [Ipopt](https://github.com/JuliaOpt/Ipopt.jl) is open source, widely used\n", 151 | " - [KNITRO](https://github.com/JuliaOpt/KNITRO.jl) commercial, general nonlinear\n", 152 | " - [Mosek](https://github.com/JuliaOpt/Mosek.jl) commercial, convex problems only\n", 153 | "- Currently working on expanding support for mixed-integer nonlinear (MINLP) solvers" 154 | ] 155 | } 156 | ], 157 | "metadata": {} 158 | } 159 | ] 160 | } -------------------------------------------------------------------------------- /2-intermediate-R/SecondHalf.R: -------------------------------------------------------------------------------- 1 | #make sure your working directory contains flights_condensed.csv 2 | #first we'll read in the flight data 3 | flights = read.csv("flights_condensed.csv") 4 | 5 | #for our purposes, we want to limit ourselves to flights between the top 20 airports 6 | #this makes the data set smaller (examples run faster) 7 | top20 = c("ATL","LAX","ORD","DFW","DEN","JFK","SFO","CLT","LAS","PHX","MIA","IAH","EWR","MCO","SEA","MSP","DTW","BOS","PHL","LGA") 8 | flights = subset(flights, Origin %in% top20 & Dest %in% top20) #%in% is like is.element 9 | 10 | ### 11 | #joins 12 | ### 13 | 14 | #We're going to join some location data to the flights data so we can try to see the jet stream 15 | #to do this, we need to know the change in longitude of each flight 16 | #first we load up the airport location data 17 | latlong = read.csv("Airport_Codes_mapped_to_Latitude_Longitude_in_the_United_States.csv",header=TRUE) 18 | longitudes = latlong[,c(1,3)] #we only need longitudes 19 | 20 | #now we'll do the actual join 21 | #in base R, this is done using the merge() function 22 | flights = merge(flights,longitudes,by.x="Origin",by.y="locationID") 23 | #let's take a look at the data frame now 24 | #see that the column we've just merged in is called "Longitude" 25 | #but since we merged on origin, it's really the origin longitude. 26 | #So we rename it: 27 | names(flights)[match("Longitude",names(flights))]="Origin.Long" 28 | #same for destination longitude 29 | flights = merge(flights,longitudes,by.x="Dest",by.y="locationID") 30 | names(flights)[match("Longitude",names(flights))]="Dest.Long" 31 | 32 | #we'll now compute flight speeds and changes in longitude 33 | flights$Speed = flights$Distance / flights$AirTime 34 | summary(flights$Speed) #uhoh 35 | #some flights have no speed (perhaps they never made it off the ground) 36 | flights = subset(flights,AirTime>0) 37 | flights$DiffLong = flights$Dest.Long - flights$Origin.Long 38 | 39 | #can we see the jet stream in action? 40 | plot(flights$DiffLong, flights$Speed,pch=".") 41 | js.effect = cor(flights$DiffLong, flights$Speed) 42 | 43 | ### 44 | #Joins assignment 45 | ### 46 | 47 | # 1) Join airport latitudes to the flight data. What was the largest change in latitude for any flight? 48 | # 2) (optional) Find a flight (may not be unique) which experienced this largest change in latitude. 49 | # Hint: use the order() function to sort a data frame 50 | # 3) (optional) Re-do the jet stream example using latitudes instead of longitudes. 51 | # Is there a relationship between change in latitude and flight speed? 52 | 53 | ### 54 | #Joins with split-apply-combine 55 | ### 56 | 57 | #Here we do a more complicated joins example 58 | #the join is on multiple columns 59 | #and the analysis uses split-apply-combine 60 | #our goal is to find, for each airport, the average weather delay per .1 mm precipitation 61 | 62 | #read data to be joined 63 | weather = read.csv("prcp_pretty.csv") 64 | 65 | #merge in precipitation data 66 | #rows must match on day of month AND airport 67 | flights = merge(flights,weather,by.x=c("Origin","DayofMonth"),by.y=c("Airport","DayOfMonth")) 68 | 69 | #for this analysis, we only want entries with a number for weather delay (no NA) 70 | #we also limit to days with precipitation 71 | flights.rain = subset(flights, !is.na(WeatherDelay) & prcp>0) 72 | flights.rain$DelayRatio = flights.rain$WeatherDelay / flights.rain$prcp 73 | 74 | #split apply combine to find average weather delay per inch of precipitation 75 | #first we split 76 | #we must discard unused factors or we will get empty data frames for airports not in the top 20 77 | flights.rain$Origin = factor(flights.rain$Origin) 78 | flights.rain.split = split(flights.rain,flights.rain$Origin) 79 | 80 | #define a function 81 | process.airport = function(df){ 82 | airport.name = df$Origin[1] 83 | avg.ratio = mean(df$DelayRatio) 84 | return(data.frame(Airport=airport.name, Avg.delay.ratio=avg.ratio)) 85 | } 86 | 87 | flights.rain.split = lapply(flights.rain.split,process.airport) 88 | airport.info = do.call(rbind,flights.rain.split) 89 | 90 | #let's order the resulting data frame 91 | airport.info = airport.info[order(airport.info$Avg.delay.ratio),] 92 | 93 | 94 | ### 95 | #Second joins assignment 96 | ### 97 | #Is there a relationship between airport latitude and average delay ratio? 98 | 99 | 100 | 101 | 102 | ### 103 | #sqldf 104 | ### 105 | #side by side examples of: 106 | #subsetting 107 | flights.bos = subset(flights, Dest=="BOS") 108 | flights.bos = sqldf("select * from flights where Dest='BOS'") 109 | #subset and only keep only selected columns 110 | flights.fast = subset(flights, Speed>mean(flights$Speed))[,c("Origin","Dest")] 111 | flights.fast = sqldf("select Origin, Dest from flights where Speed>(select avg(Speed) from flights)") 112 | #inner join - note differences in columns returned 113 | A = airport.info[,1:2] #discard location data 114 | airport.info = sqldf("select * from A inner join latlong where A.Airport = latlong.locationID") 115 | airport.info = merge(A,latlong,by.x="Airport",by.y="locationID") 116 | -------------------------------------------------------------------------------- /1-intro-R/1-2.R: -------------------------------------------------------------------------------- 1 | # IAP 2015 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Script file 1-2.R 6 | # In this script file, we cover linear regression 7 | # and logistic regression. 8 | 9 | ####################### 10 | ## LINEAR REGRESSION ## 11 | ####################### 12 | 13 | # Load CEOcomp dataset if you haven't already 14 | CEOcomp = read.csv(file = "CEOcomp.csv", header = TRUE) 15 | 16 | # Use lm to create a linear regression model 17 | CEO.linReg <- lm(TotalCompensation ~ Years + ChangeStockPrice + ChangeCompanySales + MBA, data = CEOcomp) 18 | 19 | # First argument is the formula, second argument 20 | # is the data. Notice that you don't need $ here 21 | # since we are specifying the dataset in the function call 22 | 23 | # Use summary to take a look at the model 24 | summary(CEO.linReg) 25 | 26 | # Which variables are significant predictors of 27 | # TotalCompensation at the p = .05 level? 28 | 29 | # Check out some other useful outputs of a 30 | # linear regression 31 | CEO.linReg$coefficients 32 | CEO.linReg$residuals 33 | confint(CEO.linReg, level = 0.95) 34 | 35 | # We can also compute correlation between variables 36 | cor(CEOcomp$TotalCompensation, CEOcomp$Years) 37 | 38 | # Or create a correlation table (note: all columns 39 | # must be numeric to compute correlation of 40 | # the entire dataset) 41 | cor(CEOcomp) 42 | 43 | # We can also get more data on pairwise correlation: 44 | cor.test(CEOcomp$TotalCompensation, CEOcomp$Years) 45 | 46 | ################################################ 47 | ## SPLITTING DATA INTO TRAINING AND TEST SETS ## 48 | ################################################ 49 | 50 | # Load the dataset of interest 51 | TitanicPassengers = read.csv("TitanicPassengers.csv") 52 | str(TitanicPassengers) 53 | 54 | # We first need to install a package to help 55 | # us split the data. Note that this only 56 | # needs to be done once per machine! 57 | install.packages("caTools") 58 | 59 | # Now load the library. This needs to be done 60 | # every time you wish to use the library. 61 | library(caTools) 62 | 63 | 64 | # Now split the dataset into training and testing 65 | split <- sample.split(TitanicPassengers$Survived, SplitRatio = 0.6) 66 | TitanicTrain <- TitanicPassengers[split, ] 67 | TitanicTest <- TitanicPassengers[!split, ] 68 | 69 | ######################### 70 | ## LOGISTIC REGRESSION ## 71 | ######################### 72 | 73 | # Run a logistic regression using general linear model 74 | Titanic.logReg = glm(Survived ~ Class + Age + Sex, data = TitanicTrain, family = binomial) 75 | summary(Titanic.logReg) 76 | 77 | # Compute predicted probabilities on training data 78 | Titanic.logPred = predict(Titanic.logReg, type = "response") 79 | 80 | # Build a classification table to check accuracy on 81 | # training set. Note that due to randomness of split, 82 | # classification matrices may be slightly different 83 | table(TitanicTrain$Survived, round(Titanic.logPred)) 84 | 85 | # We now do the same for the test set 86 | Titanic.logPredTest = predict(Titanic.logReg, newdata = TitanicTest, type = "response") 87 | test.table <- table(TitanicTest$Survived, round(Titanic.logPredTest)) 88 | test.table 89 | 90 | # Compute percentage correct (overall accuracy) 91 | sum(diag(test.table))/nrow(TitanicTest) 92 | 93 | ################ 94 | ## ASSIGNMENT ## 95 | ################ 96 | 97 | # 1a) Load the dataset LettersBinary.csv and check its structure. 98 | 99 | 100 | 101 | # Doesn't make much sense, huh? Each observation 102 | # in this dataset is a capital letter H or R, in one 103 | # of a variety of fonts, and distorted in various 104 | # ways. The attributes x1 ... x16 are all properties 105 | # of the resultant transformation. In this 106 | # assignment, we wish to see if these attributes 107 | # can be useful predictors of what the original 108 | # letter was. 109 | 110 | # b) Split the dataset into training and test sets 111 | # such that the training set is comprised of 60% 112 | # of the original data. 113 | 114 | 115 | 116 | 117 | # c) Build a logistic regression model to predict 118 | # the letter based on the attributes. Then create a 119 | # classification matrix and determine the 120 | # accuracy of the model on the test set. 121 | 122 | letters.formula <- formula(Letter ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16) 123 | 124 | 125 | # You can use letters.formula in place of 126 | # typing the formula 127 | 128 | 129 | 130 | 131 | 132 | # EXTRA ASSIGNMENT: For linear regression, there are 133 | # several tests that should be done to make sure a model 134 | # is valid. We already did one of them (computed the 135 | # correlations). Here we will go through the others. 136 | 137 | # *1) Plot the residuals to see if they are normally 138 | # distributed (testing normality of the error 139 | # distribution): 140 | 141 | 142 | # *2) Plot the observed vs. predicted values to see if 143 | # they are symmetrically distributed around a diagonal 144 | # line (testing the linear relationship between the 145 | # dependent and independent variables) 146 | 147 | 148 | # *3) Plot the residuals as a function of the predicted 149 | # values (testing for heteroscedasticity) 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | -------------------------------------------------------------------------------- /5-simulation/distributed.jl: -------------------------------------------------------------------------------- 1 | # Start up multiple processors 2 | # addprocs(8) 3 | 4 | # can also start up via command line, ie julia -p 4 5 | 6 | # View the running worker processors 7 | workers() 8 | 9 | # Run a simple job on a worker 10 | ref = @spawn rand() 11 | 12 | # ref contains a reference to the data: 13 | # -- ref.where contains proc id of where the data is stored 14 | # -- ref.whence contains the master proc's id 15 | # -- ref.id is a unique ID 16 | 17 | # To see the result locally, run fetch: 18 | fetch(ref) 19 | 20 | # If we want to specify the proc the code runs on 21 | ref = @spawnat 3 rand() 22 | 23 | # Suppose we define a our own function 24 | function estimatePi(n) 25 | count = 0; 26 | for i in 1:n 27 | if rand()^2 + rand()^2 < 1 28 | count += 1 29 | end 30 | end 31 | return count 32 | end 33 | 34 | 35 | # Works fine locally 36 | n = 1000 37 | piEst = 4 * estimatePi(n)/1000 38 | println("Pi is approximately $piEst") 39 | 40 | # What happens here? 41 | # @spawnat 2 estimatePi(1000) 42 | 43 | # To run code on all workers, use @everywhere 44 | @everywhere function estimatePi(n) 45 | count = 0; 46 | for i in 1:n 47 | if rand()^2 + rand()^2 < 1 48 | count += 1 49 | end 50 | end 51 | return count 52 | end 53 | 54 | # Now it works 55 | n = 1000 56 | piEst = 4/n * remotecall_fetch(2,estimatePi,n) # spawn f on proc 2 and fetch results 57 | 58 | # Assignment: Write a function that runs the simulation in bank_11.jl and returns how long it 59 | # took to process all the customers. Run this function on a different core 60 | 61 | # Hint: After the simulation is run, sim.time contains the time of the last scheduled event 62 | 63 | 64 | 65 | 66 | 67 | # Using all cores 68 | 69 | # We want each processsor to run some simulations, and then return its results 70 | 71 | # Could do it manually: 72 | 73 | nCpus = length(workers()) 74 | totalSims = 8 * 10^7 75 | sims_per_cpu = div(totalSims,nCpus) # integer arithmatic 76 | 77 | results = cell(nCpus) 78 | for i in 1:nCpus 79 | results[i] = @spawnat i estimatePi(sims_per_cpu) 80 | end 81 | for i in 1:length(results) 82 | results[i] = fetch(results[i]); 83 | end 84 | total = 0 85 | for i in 1:nCpus 86 | total += results[i] 87 | end 88 | piEst = 4 * total / totalSims 89 | println("Pi is approximately $piEst") 90 | 91 | # Julia also has a built-in method to help us 92 | help("map") # like apply in R 93 | 94 | input = sims_per_cpu * ones(nCpus) 95 | results = map(estimatePi,input) 96 | 97 | help("pmap") 98 | results = pmap(estimatePi,input) 99 | 100 | total = 0 101 | for i in 1:nCpus 102 | total += results[i] 103 | end 104 | piEst = 4 * total / totalSims 105 | println("Pi is approximately $piEst") 106 | 107 | function benchmark(n) 108 | input = sims_per_cpu * ones(nCpus) 109 | @time map(estimatePi,input) 110 | @time pmap(estimatePi,input) 111 | return 112 | end 113 | 114 | benchmark(10) 115 | benchmark(10^2) 116 | benchmark(10^7) 117 | 118 | # Assignment 2: Use PMAP to run bank_11.jl in parallel to estimate the mean time to process all the customers 119 | # Hint: define a function that takes the random seed as the input and returns the time 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | # When doesn't pmap scale well? 129 | 130 | # pmap must send input to each proc, and get output <- lots of communication 131 | # Okay if few, big jobs, but not good for many small jobs 132 | 133 | # Solution: MapReduce 134 | # Send "batches" to each proc (Map) 135 | # Each proc runs batch and creates batch summary (reduce) 136 | # Each proc returns batch summary 137 | # Master proc compiles summary files from batch summaries (reduce) 138 | 139 | # Syntax: 140 | # @parallel [reducer] for ... 141 | # [code] 142 | # end 143 | 144 | # Adds a whole bunch of random numbers together 145 | @parallel (+) for i in 1:10^8 146 | rand() 147 | end 148 | 149 | # Easier than our pmap example above 150 | count = @parallel (+) for i in totalSims 151 | estimatePi(1) 152 | end 153 | 154 | estPi = 4 * count / totalSims 155 | println("Pi is approximately $piEst") 156 | 157 | ## How to write a custom reducer 158 | 159 | # Reducer takes in two arguments of the same type, and returns that same type 160 | # e.g. the "+" method takes in two real numbers and returns a real number 161 | 162 | # Suppose we want to numerically estimate the mean and standard error of our distribution 163 | # E[x] = 1/n sum x_i 164 | # E[x^2] = 1/n sum x_i^2 165 | # var[x] = E[x^2] - E^2[x] 166 | 167 | @everywhere type Results 168 | estimate 169 | estimateSq 170 | end 171 | @everywhere Results(x) = Results(x,x^2) 172 | 173 | # And then modify our simulate function to return Results... 174 | @everywhere function runSims(n) 175 | count = estimatePi(n) 176 | piEst = 4 * count / n 177 | return Results(piEst) 178 | end 179 | 180 | # And now we can write our reducer 181 | 182 | @everywhere function myReduce(a::Results,b::Results) 183 | return Results(a.estimate + b.estimate, 184 | a.estimateSq + b.estimateSq) 185 | end 186 | 187 | # Now we can do our MapReduce! 188 | n = 10^3 189 | results = @parallel myReduce for i in 1:n 190 | runSims(1000) 191 | end 192 | 193 | # And now to process our results 194 | function process(results::Results, n) 195 | mean = results.estimate / n 196 | stdev = sqrt(results.estimateSq / n - mean^2) 197 | 198 | println("Grand mean: $mean") 199 | println("Std Error: $stdev") 200 | end 201 | 202 | process(results,n) 203 | 204 | # Assignment: Write your own MapReduce implementation to calculate the mean and standard devation 205 | # of the time to process all the customers in bank_11.jl 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | -------------------------------------------------------------------------------- /8-project/Historical_Route.csv: -------------------------------------------------------------------------------- 1 | FLIGHT_ID,FLIGHT_NUMBER,ORIGIN,DESTINATION,TAIL_NUMBER,SCH_DEP,SCH_ARR 2 | 4,AS583,LAX,SFO,N703AS,19461060,19461143 3 | 49,AS391,SEA,GEG,N705AS,19461360,19461419 4 | 63,AS366,GEG,SEA,N705AS,19461460,19461521 5 | 123,AS141,SEA,FAI,N705AS,19461887,19461939 6 | 142,AS140,FAI,ANC,N705AS,19462000,19462053 7 | 18,AS423,SNA,SEA,N708AS,19461045,19461234 8 | 44,AS334,SEA,OAK,N708AS,19461274,19461400 9 | 66,AS357,OAK,SEA,N708AS,19461436,19461555 10 | 107,AS424,SEA,SNA,N708AS,19461666,19461828 11 | 138,AS497,SNA,SEA,N708AS,19461863,19462034 12 | 31,AS550,SEA,SAN,N713AS,19461180,19461343 13 | 67,AS575,SAN,SEA,N713AS,19461383,19461559 14 | 5,AS362,PDX,SJC,N754AS,19461040,19461152 15 | 25,AS313,SJC,PDX,N754AS,19461190,19461297 16 | 56,AS716,PDX,PHX,N754AS,19461337,19461484 17 | 91,AS707,PHX,PDX,N754AS,19461542,19461714 18 | 115,AS310,PDX,SFO,N754AS,19461757,19461867 19 | 139,AS325,SFO,SEA,N754AS,19461916,19462039 20 | 24,AS723,PHX,SEA,N755AS,19461083,19461271 21 | 53,AS630,SEA,LAS,N755AS,19461312,19461460 22 | 16,AS081,SEA,ANC,N760AS,19461000,19461226 23 | 30,AS081,ANC,FAI,N760AS,19461280,19461340 24 | 51,AS082,FAI,ANC,N760AS,19461374,19461440 25 | 89,AS082,ANC,SEA,N760AS,19461484,19461697 26 | 109,AS498,SEA,SFO,N760AS,19461721,19461851 27 | 128,AS498,SFO,PSP,N760AS,19461888,19461970 28 | 37,AS152,ANC,OME,N762AS,19461259,19461359 29 | 52,AS152,OME,OTZ,N762AS,19461399,19461441 30 | 70,AS152,OTZ,ANC,N762AS,19461481,19461573 31 | 85,AS032,ANC,ADQ,N762AS,19461618,19461678 32 | 101,AS033,ADQ,ANC,N762AS,19461718,19461772 33 | 119,AS045,ANC,BET,N762AS,19461820,19461896 34 | 133,AS046,BET,ANC,N762AS,19461936,19462003 35 | 14,AS143,ANC,FAI,N763AS,19461150,19461206 36 | 29,AS143,FAI,BRW,N763AS,19461246,19461330 37 | 50,AS143,BRW,SCC,N763AS,19461375,19461430 38 | 69,AS143,SCC,ANC,N763AS,19461470,19461572 39 | 93,AS146,ANC,SCC,N763AS,19461618,19461721 40 | 106,AS146,SCC,BRW,N763AS,19461762,19461814 41 | 125,AS146,BRW,FAI,N763AS,19461860,19461940 42 | 137,AS146,FAI,ANC,N763AS,19461980,19462033 43 | 1,AS073,SIT,JNU,N764AS,19461060,19461104 44 | 20,AS073,JNU,ANC,N764AS,19461144,19461251 45 | 88,AS384,ANC,SFO,N764AS,19461571,19461695 46 | 140,AS710,SFO,PHX,N764AS,19461895,19462042 47 | 10,AS236,PDX,SFO,N767AS,19461045,19461159 48 | 134,AS324,SFO,SJC,N767AS,19461883,19462009 49 | 3,AS041,ANC,BET,N768AS,19461065,19461143 50 | 21,AS042,BET,ANC,N768AS,19461184,19461257 51 | 9,AS531,SJC,SEA,N769AS,19461030,19461158 52 | 47,AS496,SEA,SNA,N769AS,19461248,19461408 53 | 74,AS421,SNA,PDX,N769AS,19461444,19461592 54 | 100,AS352,PDX,SNA,N769AS,19461630,19461768 55 | 117,AS321,SNA,OAK,N769AS,19461805,19461888 56 | 141,AS321,OAK,SEA,N769AS,19461925,19462043 57 | 33,AS812,SEA,DFW,N771AS,19461120,19461348 58 | 86,AS813,DFW,SEA,N771AS,19461403,19461678 59 | 112,AS470,SEA,OAK,N771AS,19461739,19461861 60 | 129,AS470,OAK,SNA,N771AS,19461896,19461980 61 | 81,AS363,SEA,GEG,N772AS,19461600,19461655 62 | 97,AS361,GEG,SEA,N772AS,19461695,19461761 63 | 127,AS682,SEA,LAS,N772AS,19461815,19461956 64 | 146,AS693,LAS,SEA,N772AS,19461998,19462152 65 | 6,AS399,PSP,SFO,N773AS,19461060,19461153 66 | 28,AS399,SFO,SEA,N773AS,19461188,19461319 67 | 8,AS642,SEA,LAS,N774AS,19461010,19461158 68 | 36,AS663,LAS,SEA,N774AS,19461195,19461356 69 | 64,AS428,SEA,SJC,N774AS,19461405,19461535 70 | 83,AS379,SJC,PDX,N774AS,19461570,19461673 71 | 105,AS368,PDX,SMF,N774AS,19461710,19461798 72 | 122,AS0389,SMF,PDX,N774AS,19461833,19461925 73 | 17,AS349,LGB,SEA,N775AS,19461045,19461230 74 | 121,AS069,SEA,KTN,N775AS,19461787,19461912 75 | 135,AS069,KTN,JNU,N775AS,19461952,19462012 76 | 22,AS065,SEA,KTN,N776AS,19461127,19461260 77 | 32,AS065,KTN,WRG,N776AS,19461300,19461347 78 | 48,AS065,WRG,PSG,N776AS,19461382,19461411 79 | 59,AS065,PSG,JNU,N776AS,19461452,19461495 80 | 80,AS065,JNU,ANC,N776AS,19461540,19461645 81 | 120,AS096,ANC,SEA,N776AS,19461695,19461911 82 | 145,AS376,SEA,GEG,N776AS,19462020,19462079 83 | 65,AS598,SEA,LAX,N778AS,19461390,19461550 84 | 95,AS485,LAX,SEA,N778AS,19461585,19461746 85 | 132,AS720,SEA,PHX,N778AS,19461832,19461994 86 | 79,AS494,SEA,OAK,N779AS,19461507,19461631 87 | 104,AS359,OAK,SEA,N779AS,19461671,19461791 88 | 11,AS529,LAX,SEA,N780AS,19461015,19461189 89 | 38,AS402,SEA,BOI,N780AS,19461290,19461369 90 | 58,AS381,BOI,SEA,N780AS,19461404,19461495 91 | 87,AS067,SEA,KTN,N780AS,19461563,19461687 92 | 102,AS067,KTN,SIT,N780AS,19461727,19461784 93 | 114,AS067,SIT,JNU,N780AS,19461824,19461866 94 | 136,AS067,JNU,ANC,N780AS,19461907,19462013 95 | 35,AS097,SEA,ANC,N783AS,19461120,19461353 96 | 61,AS064,ANC,JNU,N783AS,19461403,19461510 97 | 75,AS064,JNU,PSG,N783AS,19461550,19461596 98 | 82,AS064,PSG,WRG,N783AS,19461636,19461663 99 | 94,AS064,WRG,KTN,N783AS,19461700,19461738 100 | 118,AS064,KTN,SEA,N783AS,19461778,19461893 101 | 43,AS690,SFO,PSP,N785AS,19461300,19461383 102 | 62,AS543,PSP,SFO,N785AS,19461423,19461518 103 | 84,AS543,SFO,SEA,N785AS,19461553,19461676 104 | 116,AS374,SEA,ONT,N785AS,19461730,19461880 105 | 144,AS445,ONT,SEA,N785AS,19461922,19462079 106 | 23,AS526,SEA,PSP,N786AS,19461110,19461266 107 | 55,AS447,PSP,SEA,N786AS,19461307,19461473 108 | 15,AS719,PHX,PDX,N788AS,19461035,19461211 109 | 42,AS606,PDX,LAS,N788AS,19461251,19461382 110 | 68,AS697,LAS,PDX,N788AS,19461420,19461565 111 | 90,AS418,PDX,SJC,N788AS,19461607,19461714 112 | 111,AS301,SJC,PDX,N788AS,19461750,19461855 113 | 60,AS354,SEA,SFO,N791AS,19461379,19461507 114 | 78,AS354,SFO,PSP,N791AS,19461542,19461625 115 | 96,AS685,PSP,SFO,N791AS,19461665,19461760 116 | 39,AS411,LAX,PDX,N792AS,19461230,19461374 117 | 72,AS462,PDX,LAX,N792AS,19461444,19461581 118 | 98,AS409,LAX,PDX,N792AS,19461616,19461761 119 | 130,AS608,PDX,LAS,N792AS,19461858,19461985 120 | 147,AS695,LAS,PDX,N792AS,19462020,19462159 121 | 2,AS062,FAI,ANC,N793AS,19461060,19461130 122 | 26,AS062,ANC,JNU,N793AS,19461190,19461298 123 | 41,AS062,JNU,SIT,N793AS,19461338,19461382 124 | 54,AS062,SIT,KTN,N793AS,19461422,19461471 125 | 77,AS062,KTN,SEA,N793AS,19461511,19461621 126 | 113,AS468,SEA,LAX,N793AS,19461704,19461863 127 | 7,AS151,ANC,OTZ,N794AS,19461060,19461157 128 | 19,AS151,OTZ,OME,N794AS,19461197,19461242 129 | 40,AS151,OME,ANC,N794AS,19461282,19461375 130 | 76,AS043,ANC,BET,N794AS,19461522,19461601 131 | 92,AS044,BET,ANC,N794AS,19461642,19461715 132 | 110,AS153,ANC,OTZ,N794AS,19461760,19461854 133 | 124,AS153,OTZ,OME,N794AS,19461894,19461939 134 | 143,AS153,OME,ANC,N794AS,19461980,19462066 135 | 108,AS115,SEA,ANC,N795AS,19461616,19461844 136 | 131,AS070,ANC,JNU,N795AS,19461890,19461991 137 | 13,AS060,JNU,KTN,N796AS,19461132,19461195 138 | 34,AS060,KTN,SEA,N796AS,19461235,19461353 139 | 99,AS431,SEA,SNA,N796AS,19461600,19461767 140 | 27,AS061,SEA,JNU,N797AS,19461148,19461308 141 | 45,AS061,JNU,YAK,N797AS,19461348,19461407 142 | 57,AS061,YAK,CDV,N797AS,19461443,19461493 143 | 71,AS061,CDV,ANC,N797AS,19461528,19461576 144 | 126,AS128,ANC,FAI,N797AS,19461896,19461948 145 | 12,AS766,PDX,PHX,N799AS,19461030,19461190 146 | 46,AS751,PHX,PDX,N799AS,19461228,19461407 147 | 73,AS602,PDX,LAS,N799AS,19461457,19461584 148 | 103,AS0641,LAS,SEA,N799AS,19461621,19461785 149 | -------------------------------------------------------------------------------- /8-project/Flight_Alaska.csv: -------------------------------------------------------------------------------- 1 | FLIGHT_ID,FLIGHT_NUMBER,ORIGIN,DESTINATION,TAIL_NUMBER,SCH_DEP,SCH_ARR,ARRIVAL_DELAY 2 | 1,AS073,SIT,JNU,N764AS,19461060,19461104,0 3 | 2,AS062,FAI,ANC,N793AS,19461060,19461130,0 4 | 3,AS041,ANC,BET,N768AS,19461065,19461143,0 5 | 4,AS583,LAX,SFO,N703AS,19461060,19461143,16 6 | 5,AS362,PDX,SJC,N754AS,19461040,19461152,0 7 | 6,AS399,PSP,SFO,N773AS,19461060,19461153,143 8 | 7,AS151,ANC,OTZ,N794AS,19461060,19461157,0 9 | 8,AS642,SEA,LAS,N774AS,19461010,19461158,0 10 | 9,AS531,SJC,SEA,N769AS,19461030,19461158,0 11 | 10,AS236,PDX,SFO,N767AS,19461045,19461159,3 12 | 11,AS529,LAX,SEA,N780AS,19461015,19461189,4 13 | 12,AS766,PDX,PHX,N799AS,19461030,19461190,0 14 | 13,AS060,JNU,KTN,N796AS,19461132,19461195,0 15 | 14,AS143,ANC,FAI,N763AS,19461150,19461206,0 16 | 15,AS719,PHX,PDX,N788AS,19461035,19461211,0 17 | 16,AS081,SEA,ANC,N760AS,19461000,19461226,123 18 | 17,AS349,LGB,SEA,N775AS,19461045,19461230,0 19 | 18,AS423,SNA,SEA,N708AS,19461045,19461234,0 20 | 19,AS151,OTZ,OME,N794AS,19461197,19461242,0 21 | 20,AS073,JNU,ANC,N764AS,19461144,19461251,231 22 | 21,AS042,BET,ANC,N768AS,19461184,19461257,0 23 | 22,AS065,SEA,KTN,N776AS,19461127,19461260,0 24 | 23,AS526,SEA,PSP,N786AS,19461110,19461266,0 25 | 24,AS723,PHX,SEA,N755AS,19461083,19461271,6 26 | 25,AS313,SJC,PDX,N754AS,19461190,19461297,0 27 | 26,AS062,ANC,JNU,N793AS,19461190,19461298,0 28 | 27,AS061,SEA,JNU,N797AS,19461148,19461308,0 29 | 28,AS399,SFO,SEA,N773AS,19461188,19461319,0 30 | 29,AS143,FAI,BRW,N763AS,19461246,19461330,0 31 | 30,AS081,ANC,FAI,N760AS,19461280,19461340,0 32 | 31,AS550,SEA,SAN,N713AS,19461180,19461343,0 33 | 32,AS065,KTN,WRG,N776AS,19461300,19461347,54 34 | 33,AS812,SEA,DFW,N771AS,19461120,19461348,0 35 | 34,AS060,KTN,SEA,N796AS,19461235,19461353,0 36 | 35,AS097,SEA,ANC,N783AS,19461120,19461353,0 37 | 36,AS663,LAS,SEA,N774AS,19461195,19461356,33 38 | 37,AS152,ANC,OME,N762AS,19461259,19461359,0 39 | 38,AS402,SEA,BOI,N780AS,19461290,19461369,13 40 | 39,AS411,LAX,PDX,N792AS,19461230,19461374,25 41 | 40,AS151,OME,ANC,N794AS,19461282,19461375,0 42 | 41,AS062,JNU,SIT,N793AS,19461338,19461382,9 43 | 42,AS606,PDX,LAS,N788AS,19461251,19461382,3 44 | 43,AS690,SFO,PSP,N785AS,19461300,19461383,0 45 | 44,AS334,SEA,OAK,N708AS,19461274,19461400,0 46 | 45,AS061,JNU,YAK,N797AS,19461348,19461407,5 47 | 46,AS751,PHX,PDX,N799AS,19461228,19461407,0 48 | 47,AS496,SEA,SNA,N769AS,19461248,19461408,0 49 | 48,AS065,WRG,PSG,N776AS,19461382,19461411,32 50 | 49,AS391,SEA,GEG,N705AS,19461360,19461419,6 51 | 50,AS143,BRW,SCC,N763AS,19461375,19461430,0 52 | 51,AS082,FAI,ANC,N760AS,19461374,19461440,0 53 | 52,AS152,OME,OTZ,N762AS,19461399,19461441,0 54 | 53,AS630,SEA,LAS,N755AS,19461312,19461460,3 55 | 54,AS062,SIT,KTN,N793AS,19461422,19461471,0 56 | 55,AS447,PSP,SEA,N786AS,19461307,19461473,0 57 | 56,AS716,PDX,PHX,N754AS,19461337,19461484,0 58 | 57,AS061,YAK,CDV,N797AS,19461443,19461493,0 59 | 58,AS381,BOI,SEA,N780AS,19461404,19461495,23 60 | 59,AS065,PSG,JNU,N776AS,19461452,19461495,26 61 | 60,AS354,SEA,SFO,N791AS,19461379,19461507,0 62 | 61,AS064,ANC,JNU,N783AS,19461403,19461510,0 63 | 62,AS543,PSP,SFO,N785AS,19461423,19461518,0 64 | 63,AS366,GEG,SEA,N705AS,19461460,19461521,9 65 | 64,AS428,SEA,SJC,N774AS,19461405,19461535,27 66 | 65,AS598,SEA,LAX,N778AS,19461390,19461550,48 67 | 66,AS357,OAK,SEA,N708AS,19461436,19461555,0 68 | 67,AS575,SAN,SEA,N713AS,19461383,19461559,1 69 | 68,AS697,LAS,PDX,N788AS,19461420,19461565,10 70 | 69,AS143,SCC,ANC,N763AS,19461470,19461572,0 71 | 70,AS152,OTZ,ANC,N762AS,19461481,19461573,0 72 | 71,AS061,CDV,ANC,N797AS,19461528,19461576,0 73 | 72,AS462,PDX,LAX,N792AS,19461444,19461581,3 74 | 73,AS602,PDX,LAS,N799AS,19461457,19461584,0 75 | 74,AS421,SNA,PDX,N769AS,19461444,19461592,0 76 | 75,AS064,JNU,PSG,N783AS,19461550,19461596,16 77 | 76,AS043,ANC,BET,N794AS,19461522,19461601,0 78 | 77,AS062,KTN,SEA,N793AS,19461511,19461621,0 79 | 78,AS354,SFO,PSP,N791AS,19461542,19461625,0 80 | 79,AS494,SEA,OAK,N779AS,19461507,19461631,45 81 | 80,AS065,JNU,ANC,N776AS,19461540,19461645,20 82 | 81,AS363,SEA,GEG,N772AS,19461600,19461655,18 83 | 82,AS064,PSG,WRG,N783AS,19461636,19461663,0 84 | 83,AS379,SJC,PDX,N774AS,19461570,19461673,53 85 | 84,AS543,SFO,SEA,N785AS,19461553,19461676,0 86 | 85,AS032,ANC,ADQ,N762AS,19461618,19461678,17 87 | 86,AS813,DFW,SEA,N771AS,19461403,19461678,0 88 | 87,AS067,SEA,KTN,N780AS,19461563,19461687,23 89 | 88,AS384,ANC,SFO,N764AS,19461571,19461695,0 90 | 89,AS082,ANC,SEA,N760AS,19461484,19461697,6 91 | 90,AS418,PDX,SJC,N788AS,19461607,19461714,0 92 | 91,AS707,PHX,PDX,N754AS,19461542,19461714,1 93 | 92,AS044,BET,ANC,N794AS,19461642,19461715,0 94 | 93,AS146,ANC,SCC,N763AS,19461618,19461721,3 95 | 94,AS064,WRG,KTN,N783AS,19461700,19461738,0 96 | 95,AS485,LAX,SEA,N778AS,19461585,19461746,77 97 | 96,AS685,PSP,SFO,N791AS,19461665,19461760,0 98 | 97,AS361,GEG,SEA,N772AS,19461695,19461761,20 99 | 98,AS409,LAX,PDX,N792AS,19461616,19461761,21 100 | 99,AS431,SEA,SNA,N796AS,19461600,19461767,32 101 | 100,AS352,PDX,SNA,N769AS,19461630,19461768,0 102 | 101,AS033,ADQ,ANC,N762AS,19461718,19461772,9 103 | 102,AS067,KTN,SIT,N780AS,19461727,19461784,15 104 | 103,AS0641,LAS,SEA,N799AS,19461621,19461785,47 105 | 104,AS359,OAK,SEA,N779AS,19461671,19461791,55 106 | 105,AS368,PDX,SMF,N774AS,19461710,19461798,41 107 | 106,AS146,SCC,BRW,N763AS,19461762,19461814,0 108 | 107,AS424,SEA,SNA,N708AS,19461666,19461828,0 109 | 108,AS115,SEA,ANC,N795AS,19461616,19461844,0 110 | 109,AS498,SEA,SFO,N760AS,19461721,19461851,65 111 | 110,AS153,ANC,OTZ,N794AS,19461760,19461854,0 112 | 111,AS301,SJC,PDX,N788AS,19461750,19461855,0 113 | 112,AS470,SEA,OAK,N771AS,19461739,19461861,7 114 | 113,AS468,SEA,LAX,N793AS,19461704,19461863,13 115 | 114,AS067,SIT,JNU,N780AS,19461824,19461866,85 116 | 115,AS310,PDX,SFO,N754AS,19461757,19461867,0 117 | 116,AS374,SEA,ONT,N785AS,19461730,19461880,3 118 | 117,AS321,SNA,OAK,N769AS,19461805,19461888,29 119 | 118,AS064,KTN,SEA,N783AS,19461778,19461893,0 120 | 119,AS045,ANC,BET,N762AS,19461820,19461896,12 121 | 120,AS096,ANC,SEA,N776AS,19461695,19461911,22 122 | 121,AS069,SEA,KTN,N775AS,19461787,19461912,8 123 | 122,AS0389,SMF,PDX,N774AS,19461833,19461925,61 124 | 123,AS141,SEA,FAI,N705AS,19461887,19461939,0 125 | 124,AS153,OTZ,OME,N794AS,19461894,19461939,0 126 | 125,AS146,BRW,FAI,N763AS,19461860,19461940,0 127 | 126,AS128,ANC,FAI,N797AS,19461896,19461948,11 128 | 127,AS682,SEA,LAS,N772AS,19461815,19461956,0 129 | 128,AS498,SFO,PSP,N760AS,19461888,19461970,52 130 | 129,AS470,OAK,SNA,N771AS,19461896,19461980,0 131 | 130,AS608,PDX,LAS,N792AS,19461858,19461985,0 132 | 131,AS070,ANC,JNU,N795AS,19461890,19461991,0 133 | 132,AS720,SEA,PHX,N778AS,19461832,19461994,67 134 | 133,AS046,BET,ANC,N762AS,19461936,19462003,4 135 | 134,AS324,SFO,SJC,N767AS,19461883,19462009,34 136 | 135,AS069,KTN,JNU,N775AS,19461952,19462012,1 137 | 136,AS067,JNU,ANC,N780AS,19461907,19462013,91 138 | 137,AS146,FAI,ANC,N763AS,19461980,19462033,0 139 | 138,AS497,SNA,SEA,N708AS,19461863,19462034,15 140 | 139,AS325,SFO,SEA,N754AS,19461916,19462039,12 141 | 140,AS710,SFO,PHX,N764AS,19461895,19462042,0 142 | 141,AS321,OAK,SEA,N769AS,19461925,19462043,16 143 | 142,AS140,FAI,ANC,N705AS,19462000,19462053,0 144 | 143,AS153,OME,ANC,N794AS,19461980,19462066,0 145 | 144,AS445,ONT,SEA,N785AS,19461922,19462079,102 146 | 145,AS376,SEA,GEG,N776AS,19462020,19462079,3 147 | 146,AS693,LAS,SEA,N772AS,19461998,19462152,36 148 | 147,AS695,LAS,PDX,N792AS,19462020,19462159,43 149 | -------------------------------------------------------------------------------- /2-intermediate-R/FirstHalf.R: -------------------------------------------------------------------------------- 1 | Intermediate R: Data Wrangling 2 | 3 | ################################## 4 | # Section 1: Load data frame 5 | 6 | # First, load datasets. It's often more convenient to just keep strings as 7 | # strings, so we pass stringsAsFactors=FALSE. 8 | flights = read.csv("flights.csv", stringsAsFactors=FALSE) 9 | 10 | # Let's familiarize ourselves a bit with the data 11 | str(flights) 12 | 13 | ################################### 14 | # Section 2: tapply/table with built-in commands 15 | 16 | # We're going to be doing a lot of tapply, so let's make sure we remember how 17 | # to use it. 18 | # [[Pretty picture of how tapply() works, in slides]] 19 | 20 | # Let's look at the ArrDelayMinutes column 21 | summary(flights$ArrDelayMinutes) 22 | 23 | # Why the NAs? 24 | table(flights$Cancelled,is.na(flights$ArrDelayMinutes)) 25 | 26 | # To ask questions about delays, we need to exclude the NAs 27 | flightsFlown = subset(flights, !is.na(flights$ArrDelayMinutes)) 28 | 29 | # There are some huge outliers 30 | hist(flightsFlown$ArrDelayMinutes) 31 | flightsFlown = subset(flightsFlown, flights$ArrDelayMinutes < 1000) 32 | 33 | # What is the average arrival delay by day of month? 34 | tapply(flightsFlown$ArrDelayMinutes, flightsFlown$DayofMonth, mean) 35 | 36 | # What is the average arrival delay by airline? 37 | # What about standard deviation of arrival delay by airline? 38 | tapply(flightsFlown$ArrDelayMinutes, flightsFlown$Carrier, mean) 39 | tapply(flightsFlown$ArrDelayMinutes, flightsFlown$Carrier, sd) 40 | 41 | #################################### 42 | # Assignment 1 (Section 2): tapply/table with built-in commands 43 | 44 | # What is the average departure delay by weekday (not counting early 45 | # departures)? 46 | tapply(flightsFlown$DepDelayMinutes, flightsFlown$DayOfWeek, mean) 47 | 48 | # What is the maximum taxi-in time by airport (using 'Dest' column)? 49 | # Hint: R has a 'max' function. 50 | tapply(flightsFlown$TaxiIn, flightsFlown$Dest, max) 51 | 52 | # Extra question: What is the proportion of cancelled flights by airline? 53 | # Hint: The average of TRUE/FALSE values is the proportion that are TRUE. 54 | # Which airlines have the highest and lowest proportions of cancelled flights? 55 | sort(tapply(flights$Cancelled, flights$Carrier, mean)) 56 | 57 | ######################################### 58 | # Section 3: tapply with user-defined functions 59 | 60 | # Often we need to write our own functions to answer specific questions we have 61 | # about the data. We will write a function that finds the most common origin 62 | # airport over a data set of flights. 63 | 64 | # Let's look at how frequently each origin appears in the data set 65 | tab = sort(table(flights$Origin)) 66 | 67 | # Reminder: names function 68 | names(tab) 69 | 70 | # Writing a function that returns the most common origin given a data set 71 | most.common = function(x) { 72 | tab = sort(table(x), decreasing = TRUE) 73 | common.origin = names(tab)[1] 74 | return(common.origin) 75 | } 76 | 77 | # Apply most.common to each carrier using tapply 78 | tapply(flights$Origin,flights$Carrier,most.common) 79 | 80 | # ######################### 81 | # Assignment 2 (Section 3) 82 | 83 | # One simple way to measure the “skew level” of a distribution is 84 | # to subtract the median from the mean Write a function that calculates 85 | # this measure of skew for arrival delays (ArrDelayMinutes) and use 86 | # tapply to calculate it for each carrier. 87 | # Hint: use the 'median' function. 88 | 89 | shift = function(x) { 90 | mean(x) - median(x) 91 | } 92 | 93 | tapply(flightsFlown$ArrDelayMinutes, flightsFlown$Carrier, shift) 94 | 95 | # Extra: What is the most common Origin-Destination pair for 96 | # each carrier? (Hint: use the paste() function. What would you 97 | give as the first argument for tapply?) 98 | 99 | tapply(paste(flights$Origin,flights$Dest), flights$Carrier, most.common) 100 | 101 | ########################## 102 | # Section 4: Split-apply-combine 103 | 104 | # We want to create a new data frame with delay information about each origin 105 | # airport. Some of the data (about 150,000 entries) has information about 106 | # causes of the delays. We'll take one more subset of the data to exclude 107 | # all entries without delay type information. 108 | 109 | # Which entries to delete? 110 | summary(flights$WeatherDelay) 111 | summary(flights$CarrierDelay) 112 | flightsDelayInfo = subset(flights, !is.na(flights$WeatherDelay)) 113 | 114 | # Is the total of the delay columns equal to departure or arrival delay? 115 | summary(flightsDelayInfo$LateAircraftDelay + flightsDelayInfo$NASDelay + 116 | flightsDelayInfo$CarrierDelay + flightsDelayInfo$WeatherDelay + 117 | flightsDelayInfo$SecurityDelay == flightsDelayInfo$ArrDelayMinutes) 118 | 119 | # [[Picture of split-apply-combine; split breaks large df into smaller ones, 120 | # lapply converts small data frames into 1-row data frames; do.call(rbind) 121 | # combines them into a single data frame.]] 122 | 123 | # Let's first split by origin. 124 | spl = split(flightsDelayInfo, flightsDelayInfo$Origin) 125 | str(spl[[1]]) 126 | str(spl[[2]]) 127 | # spl is a list of data frames 128 | 129 | # Re-writing the delay proportion function and expanding to include more delay categories: 130 | delay.prop.df = function(x) { 131 | prop.carrier = sum(x$CarrierDelay)/sum(x$ArrDelayMinutes) 132 | prop.weather = sum(x$WeatherDelay)/sum(x$ArrDelayMinutes) 133 | prop.nas = sum(x$NASDelay)/sum(x$ArrDelayMinutes) 134 | prop.security = sum(x$SecurityDelay)/sum(x$ArrDelayMinutes) 135 | prop.late = sum(x$LateAircraftDelay)/sum(x$ArrDelayMinutes) 136 | return(data.frame(Origin = x$Origin[1], prop.carrier = prop.carrier, prop.weather = prop.weather, prop.nas = prop.nas, prop.security = prop.security, prop.late = prop.late)) 137 | } 138 | 139 | #Testing on a few split up data frames 140 | delay.prop.df(spl[[1]]) 141 | c(sum(spl[[1]]$CarrierDelay)/sum(spl[[1]]$ArrDelayMinutes),sum(spl[[1]]$WeatherDelay)/sum(spl[[1]]$ArrDelayMinutes),sum(spl[[1]]$NASDelay)/sum(spl[[1]]$ArrDelayMinutes),sum(spl[[1]]$SecurityDelay)/sum(spl[[1]]$ArrDelayMinutes),sum(spl[[1]]$LateAircraftDelay)/sum(spl[[1]]$ArrDelayMinutes)) 142 | 143 | # Use lapply (apply a function to a list) to convert elements of spl to 1-row summary 144 | # data frames. 145 | spl2 = lapply(spl, delay.prop.df) 146 | spl2[[1]] 147 | spl2[[2]] 148 | 149 | # Last step is to combine everything together. We could manually combine with 150 | # rbind: 151 | rbind(spl2[[1]], spl2[[2]], spl2[[3]]) 152 | 153 | # do.call is a nifty function that passes all of the elements of its second 154 | # argument to its first argument, which is a function 155 | flights.delay.info = do.call(rbind, spl2) 156 | head(flights.delay.info) 157 | 158 | # What are the airports with the highest proportion of weather delays? 159 | flights.delay.info[order(flights.delay.info$prop.weather),] 160 | 161 | # How about carrier delays? 162 | flights.delay.info[order(flights.delay.info$prop.carrier),] 163 | 164 | ########################## 165 | # Assignment 3 (Section 4): Split-apply-combine 166 | 167 | # From the flightsFlown data frame, create a data frame called carrier.info, where each row corresponds 168 | # to one carrier (airline). Include the following variables in your new data frame: 169 | # - carrier: The carrier code 170 | # - mean.arr.delay: Average arrival delay time (using ArrDelayMinutes) 171 | # - longest.delay: Longest flight delay for the month 172 | # - most.common.origin: most common origin for the carrier 173 | 174 | spl = split(flightsFlown, flightsFlown$Carrier) 175 | 176 | process.carrier = function(x) { 177 | carrier = x$Carrier[1] 178 | mean.arr.delay = mean(x$ArrDelayMinutes) 179 | longest.delay = max(x$ArrDelayMinutes) 180 | most.common.origin = most.common(x$Origin) 181 | return(data.frame(carrier,mean.arr.delay,longest.delay,most.common.origin)) 182 | } 183 | 184 | spl2 = lapply(spl, process.carrier) 185 | carrier.info = do.call(rbind, spl2) 186 | -------------------------------------------------------------------------------- /2-intermediate-R/prcp_pretty.csv: -------------------------------------------------------------------------------- 1 | "Airport","DayOfMonth","prcp" 2 | "ATL",1,0 3 | "ATL",2,43 4 | "ATL",3,135 5 | "ATL",4,3 6 | "ATL",5,10 7 | "ATL",6,46 8 | "ATL",7,51 9 | "ATL",8,33 10 | "ATL",9,254 11 | "ATL",10,46 12 | "ATL",11,0 13 | "ATL",12,0 14 | "ATL",13,0 15 | "ATL",14,137 16 | "ATL",15,3 17 | "ATL",16,0 18 | "ATL",17,0 19 | "ATL",18,0 20 | "ATL",19,0 21 | "ATL",20,0 22 | "ATL",21,0 23 | "ATL",22,792 24 | "ATL",23,20 25 | "ATL",24,0 26 | "ATL",25,0 27 | "ATL",26,0 28 | "ATL",27,0 29 | "ATL",28,358 30 | "ATL",29,51 31 | "ATL",30,0 32 | "ATL",31,0 33 | "BOS",1,69 34 | "BOS",2,0 35 | "BOS",3,0 36 | "BOS",4,0 37 | "BOS",5,3 38 | "BOS",6,76 39 | "BOS",7,23 40 | "BOS",8,0 41 | "BOS",9,109 42 | "BOS",10,18 43 | "BOS",11,0 44 | "BOS",12,0 45 | "BOS",13,0 46 | "BOS",14,56 47 | "BOS",15,163 48 | "BOS",16,0 49 | "BOS",17,135 50 | "BOS",18,0 51 | "BOS",19,0 52 | "BOS",20,0 53 | "BOS",21,0 54 | "BOS",22,3 55 | "BOS",23,173 56 | "BOS",24,0 57 | "BOS",25,0 58 | "BOS",26,13 59 | "BOS",27,0 60 | "BOS",28,0 61 | "BOS",29,335 62 | "BOS",30,0 63 | "BOS",31,0 64 | "CLT",1,0 65 | "CLT",2,0 66 | "CLT",3,3 67 | "CLT",4,5 68 | "CLT",5,94 69 | "CLT",6,0 70 | "CLT",7,10 71 | "CLT",8,10 72 | "CLT",9,79 73 | "CLT",10,36 74 | "CLT",11,0 75 | "CLT",12,0 76 | "CLT",13,0 77 | "CLT",14,251 78 | "CLT",15,3 79 | "CLT",16,0 80 | "CLT",17,0 81 | "CLT",18,0 82 | "CLT",19,0 83 | "CLT",20,0 84 | "CLT",21,0 85 | "CLT",22,366 86 | "CLT",23,490 87 | "CLT",24,0 88 | "CLT",25,0 89 | "CLT",26,0 90 | "CLT",27,0 91 | "CLT",28,3 92 | "CLT",29,462 93 | "CLT",30,0 94 | "CLT",31,0 95 | "DEN",1,0 96 | "DEN",2,0 97 | "DEN",3,3 98 | "DEN",4,28 99 | "DEN",5,0 100 | "DEN",6,0 101 | "DEN",7,0 102 | "DEN",8,5 103 | "DEN",9,0 104 | "DEN",10,0 105 | "DEN",11,0 106 | "DEN",12,0 107 | "DEN",13,0 108 | "DEN",14,0 109 | "DEN",15,0 110 | "DEN",16,0 111 | "DEN",17,0 112 | "DEN",18,0 113 | "DEN",19,0 114 | "DEN",20,0 115 | "DEN",21,8 116 | "DEN",22,5 117 | "DEN",23,5 118 | "DEN",24,0 119 | "DEN",25,0 120 | "DEN",26,0 121 | "DEN",27,0 122 | "DEN",28,10 123 | "DEN",29,0 124 | "DEN",30,0 125 | "DEN",31,0 126 | "DFW",1,0 127 | "DFW",2,0 128 | "DFW",3,0 129 | "DFW",4,0 130 | "DFW",5,102 131 | "DFW",6,216 132 | "DFW",7,0 133 | "DFW",8,0 134 | "DFW",9,0 135 | "DFW",10,0 136 | "DFW",11,0 137 | "DFW",12,0 138 | "DFW",13,3 139 | "DFW",14,0 140 | "DFW",15,0 141 | "DFW",16,0 142 | "DFW",17,0 143 | "DFW",18,0 144 | "DFW",19,0 145 | "DFW",20,5 146 | "DFW",21,376 147 | "DFW",22,0 148 | "DFW",23,0 149 | "DFW",24,0 150 | "DFW",25,0 151 | "DFW",26,0 152 | "DFW",27,0 153 | "DFW",28,0 154 | "DFW",29,0 155 | "DFW",30,0 156 | "DFW",31,0 157 | "DTW",1,0 158 | "DTW",2,0 159 | "DTW",3,23 160 | "DTW",4,0 161 | "DTW",5,0 162 | "DTW",6,0 163 | "DTW",7,0 164 | "DTW",8,8 165 | "DTW",9,10 166 | "DTW",10,0 167 | "DTW",11,3 168 | "DTW",12,0 169 | "DTW",13,0 170 | "DTW",14,122 171 | "DTW",15,8 172 | "DTW",16,20 173 | "DTW",17,8 174 | "DTW",18,0 175 | "DTW",19,0 176 | "DTW",20,86 177 | "DTW",21,224 178 | "DTW",22,64 179 | "DTW",23,5 180 | "DTW",24,0 181 | "DTW",25,3 182 | "DTW",26,13 183 | "DTW",27,0 184 | "DTW",28,0 185 | "DTW",29,0 186 | "DTW",30,0 187 | "DTW",31,20 188 | "EWR",1,0 189 | "EWR",2,0 190 | "EWR",3,0 191 | "EWR",4,0 192 | "EWR",5,3 193 | "EWR",6,203 194 | "EWR",7,38 195 | "EWR",8,25 196 | "EWR",9,86 197 | "EWR",10,69 198 | "EWR",11,0 199 | "EWR",12,0 200 | "EWR",13,0 201 | "EWR",14,122 202 | "EWR",15,104 203 | "EWR",16,0 204 | "EWR",17,46 205 | "EWR",18,0 206 | "EWR",19,0 207 | "EWR",20,0 208 | "EWR",21,0 209 | "EWR",22,8 210 | "EWR",23,135 211 | "EWR",24,0 212 | "EWR",25,0 213 | "EWR",26,0 214 | "EWR",27,0 215 | "EWR",28,0 216 | "EWR",29,335 217 | "EWR",30,0 218 | "EWR",31,0 219 | "IAH",1,0 220 | "IAH",2,0 221 | "IAH",3,0 222 | "IAH",4,0 223 | "IAH",5,0 224 | "IAH",6,3 225 | "IAH",7,0 226 | "IAH",8,0 227 | "IAH",9,18 228 | "IAH",10,0 229 | "IAH",11,0 230 | "IAH",12,0 231 | "IAH",13,15 232 | "IAH",14,0 233 | "IAH",15,0 234 | "IAH",16,0 235 | "IAH",17,0 236 | "IAH",18,0 237 | "IAH",19,8 238 | "IAH",20,0 239 | "IAH",21,376 240 | "IAH",22,3 241 | "IAH",23,0 242 | "IAH",24,0 243 | "IAH",25,0 244 | "IAH",26,0 245 | "IAH",27,0 246 | "IAH",28,0 247 | "IAH",29,0 248 | "IAH",30,0 249 | "IAH",31,0 250 | "JFK",1,0 251 | "JFK",2,0 252 | "JFK",3,0 253 | "JFK",4,0 254 | "JFK",5,3 255 | "JFK",6,145 256 | "JFK",7,43 257 | "JFK",8,8 258 | "JFK",9,64 259 | "JFK",10,71 260 | "JFK",11,0 261 | "JFK",12,0 262 | "JFK",13,0 263 | "JFK",14,145 264 | "JFK",15,208 265 | "JFK",16,0 266 | "JFK",17,33 267 | "JFK",18,0 268 | "JFK",19,0 269 | "JFK",20,0 270 | "JFK",21,0 271 | "JFK",22,0 272 | "JFK",23,124 273 | "JFK",24,0 274 | "JFK",25,0 275 | "JFK",26,0 276 | "JFK",27,0 277 | "JFK",28,0 278 | "JFK",29,300 279 | "JFK",30,0 280 | "JFK",31,0 281 | "LAS",1,0 282 | "LAS",2,0 283 | "LAS",3,8 284 | "LAS",4,5 285 | "LAS",5,0 286 | "LAS",6,0 287 | "LAS",7,0 288 | "LAS",8,0 289 | "LAS",9,0 290 | "LAS",10,0 291 | "LAS",11,0 292 | "LAS",12,0 293 | "LAS",13,0 294 | "LAS",14,0 295 | "LAS",15,0 296 | "LAS",16,0 297 | "LAS",17,0 298 | "LAS",18,0 299 | "LAS",19,0 300 | "LAS",20,0 301 | "LAS",21,0 302 | "LAS",22,0 303 | "LAS",23,0 304 | "LAS",24,0 305 | "LAS",25,0 306 | "LAS",26,0 307 | "LAS",27,0 308 | "LAS",28,0 309 | "LAS",29,0 310 | "LAS",30,0 311 | "LAS",31,0 312 | "LAX",1,0 313 | "LAX",2,0 314 | "LAX",3,0 315 | "LAX",4,0 316 | "LAX",5,0 317 | "LAX",6,0 318 | "LAX",7,66 319 | "LAX",8,0 320 | "LAX",9,0 321 | "LAX",10,0 322 | "LAX",11,0 323 | "LAX",12,0 324 | "LAX",13,0 325 | "LAX",14,0 326 | "LAX",15,0 327 | "LAX",16,0 328 | "LAX",17,0 329 | "LAX",18,0 330 | "LAX",19,10 331 | "LAX",20,0 332 | "LAX",21,0 333 | "LAX",22,0 334 | "LAX",23,0 335 | "LAX",24,0 336 | "LAX",25,0 337 | "LAX",26,0 338 | "LAX",27,0 339 | "LAX",28,0 340 | "LAX",29,0 341 | "LAX",30,0 342 | "LAX",31,0 343 | "LGA",1,0 344 | "LGA",2,0 345 | "LGA",3,0 346 | "LGA",4,0 347 | "LGA",5,3 348 | "LGA",6,185 349 | "LGA",7,33 350 | "LGA",8,13 351 | "LGA",9,53 352 | "LGA",10,56 353 | "LGA",11,0 354 | "LGA",12,0 355 | "LGA",13,0 356 | "LGA",14,112 357 | "LGA",15,196 358 | "LGA",16,0 359 | "LGA",17,41 360 | "LGA",18,0 361 | "LGA",19,0 362 | "LGA",20,0 363 | "LGA",21,0 364 | "LGA",22,3 365 | "LGA",23,135 366 | "LGA",24,0 367 | "LGA",25,0 368 | "LGA",26,0 369 | "LGA",27,0 370 | "LGA",28,0 371 | "LGA",29,305 372 | "LGA",30,0 373 | "LGA",31,0 374 | "MCO",1,0 375 | "MCO",2,0 376 | "MCO",3,0 377 | "MCO",4,0 378 | "MCO",5,0 379 | "MCO",6,0 380 | "MCO",7,0 381 | "MCO",8,0 382 | "MCO",9,0 383 | "MCO",10,0 384 | "MCO",11,0 385 | "MCO",12,0 386 | "MCO",13,0 387 | "MCO",14,3 388 | "MCO",15,46 389 | "MCO",16,0 390 | "MCO",17,0 391 | "MCO",18,0 392 | "MCO",19,0 393 | "MCO",20,0 394 | "MCO",21,0 395 | "MCO",22,0 396 | "MCO",23,0 397 | "MCO",24,0 398 | "MCO",25,0 399 | "MCO",26,0 400 | "MCO",27,3 401 | "MCO",28,3 402 | "MCO",29,15 403 | "MCO",30,0 404 | "MCO",31,0 405 | "MIA",1,53 406 | "MIA",2,0 407 | "MIA",3,0 408 | "MIA",4,0 409 | "MIA",5,0 410 | "MIA",6,0 411 | "MIA",7,33 412 | "MIA",8,15 413 | "MIA",9,0 414 | "MIA",10,0 415 | "MIA",11,10 416 | "MIA",12,0 417 | "MIA",13,3 418 | "MIA",14,3 419 | "MIA",15,0 420 | "MIA",16,0 421 | "MIA",17,0 422 | "MIA",18,0 423 | "MIA",19,0 424 | "MIA",20,0 425 | "MIA",21,0 426 | "MIA",22,0 427 | "MIA",23,0 428 | "MIA",24,0 429 | "MIA",25,23 430 | "MIA",26,952 431 | "MIA",27,64 432 | "MIA",28,30 433 | "MIA",29,0 434 | "MIA",30,0 435 | "MIA",31,0 436 | "MSP",1,0 437 | "MSP",2,30 438 | "MSP",3,20 439 | "MSP",4,132 440 | "MSP",5,0 441 | "MSP",6,0 442 | "MSP",7,0 443 | "MSP",8,18 444 | "MSP",9,0 445 | "MSP",10,15 446 | "MSP",11,0 447 | "MSP",12,0 448 | "MSP",13,5 449 | "MSP",14,15 450 | "MSP",15,0 451 | "MSP",16,25 452 | "MSP",17,0 453 | "MSP",18,0 454 | "MSP",19,8 455 | "MSP",20,5 456 | "MSP",21,0 457 | "MSP",22,8 458 | "MSP",23,0 459 | "MSP",24,58 460 | "MSP",25,5 461 | "MSP",26,3 462 | "MSP",27,0 463 | "MSP",28,0 464 | "MSP",29,0 465 | "MSP",30,23 466 | "MSP",31,0 467 | "ORD",1,0 468 | "ORD",2,18 469 | "ORD",3,5 470 | "ORD",4,5 471 | "ORD",5,0 472 | "ORD",6,0 473 | "ORD",7,0 474 | "ORD",8,61 475 | "ORD",9,0 476 | "ORD",10,0 477 | "ORD",11,13 478 | "ORD",12,0 479 | "ORD",13,3 480 | "ORD",14,58 481 | "ORD",15,0 482 | "ORD",16,13 483 | "ORD",17,3 484 | "ORD",18,0 485 | "ORD",19,15 486 | "ORD",20,79 487 | "ORD",21,61 488 | "ORD",22,51 489 | "ORD",23,0 490 | "ORD",24,3 491 | "ORD",25,18 492 | "ORD",26,0 493 | "ORD",27,0 494 | "ORD",28,0 495 | "ORD",29,3 496 | "ORD",30,10 497 | "ORD",31,76 498 | "PHL",1,0 499 | "PHL",2,0 500 | "PHL",3,0 501 | "PHL",4,0 502 | "PHL",5,0 503 | "PHL",6,196 504 | "PHL",7,18 505 | "PHL",8,145 506 | "PHL",9,249 507 | "PHL",10,69 508 | "PHL",11,0 509 | "PHL",12,0 510 | "PHL",13,0 511 | "PHL",14,178 512 | "PHL",15,18 513 | "PHL",16,0 514 | "PHL",17,13 515 | "PHL",18,0 516 | "PHL",19,0 517 | "PHL",20,0 518 | "PHL",21,0 519 | "PHL",22,10 520 | "PHL",23,124 521 | "PHL",24,0 522 | "PHL",25,0 523 | "PHL",26,0 524 | "PHL",27,0 525 | "PHL",28,0 526 | "PHL",29,302 527 | "PHL",30,0 528 | "PHL",31,0 529 | "PHX",1,0 530 | "PHX",2,0 531 | "PHX",3,0 532 | "PHX",4,0 533 | "PHX",5,0 534 | "PHX",6,0 535 | "PHX",7,0 536 | "PHX",8,0 537 | "PHX",9,0 538 | "PHX",10,0 539 | "PHX",11,0 540 | "PHX",12,0 541 | "PHX",13,0 542 | "PHX",14,0 543 | "PHX",15,0 544 | "PHX",16,0 545 | "PHX",17,0 546 | "PHX",18,0 547 | "PHX",19,41 548 | "PHX",20,58 549 | "PHX",21,0 550 | "PHX",22,0 551 | "PHX",23,0 552 | "PHX",24,0 553 | "PHX",25,0 554 | "PHX",26,0 555 | "PHX",27,0 556 | "PHX",28,0 557 | "PHX",29,0 558 | "PHX",30,0 559 | "PHX",31,0 560 | "SEA",1,30 561 | "SEA",2,46 562 | "SEA",3,0 563 | "SEA",4,0 564 | "SEA",5,0 565 | "SEA",6,0 566 | "SEA",7,0 567 | "SEA",8,0 568 | "SEA",9,0 569 | "SEA",10,0 570 | "SEA",11,0 571 | "SEA",12,69 572 | "SEA",13,5 573 | "SEA",14,0 574 | "SEA",15,13 575 | "SEA",16,3 576 | "SEA",17,0 577 | "SEA",18,13 578 | "SEA",19,0 579 | "SEA",20,56 580 | "SEA",21,56 581 | "SEA",22,107 582 | "SEA",23,15 583 | "SEA",24,0 584 | "SEA",25,0 585 | "SEA",26,0 586 | "SEA",27,3 587 | "SEA",28,0 588 | "SEA",29,0 589 | "SEA",30,3 590 | "SEA",31,5 591 | "SFO",1,0 592 | "SFO",2,0 593 | "SFO",3,0 594 | "SFO",4,0 595 | "SFO",5,0 596 | "SFO",6,74 597 | "SFO",7,15 598 | "SFO",8,0 599 | "SFO",9,0 600 | "SFO",10,0 601 | "SFO",11,0 602 | "SFO",12,0 603 | "SFO",13,0 604 | "SFO",14,0 605 | "SFO",15,0 606 | "SFO",16,0 607 | "SFO",17,0 608 | "SFO",18,0 609 | "SFO",19,0 610 | "SFO",20,0 611 | "SFO",21,0 612 | "SFO",22,0 613 | "SFO",23,0 614 | "SFO",24,0 615 | "SFO",25,0 616 | "SFO",26,0 617 | "SFO",27,0 618 | "SFO",28,0 619 | "SFO",29,0 620 | "SFO",30,0 621 | "SFO",31,0 622 | -------------------------------------------------------------------------------- /1-intro-R/.Rapp.history: -------------------------------------------------------------------------------- 1 | par(mfrow = c(2,2)) 2 | plot(anscombe$x1, anscombe$y1) 3 | abline(a1) 4 | plot(anscombe$x1, anscombe$y1)# 5 | abline(a1)# 6 | # 7 | plot(anscombe$x2, anscombe$y2)# 8 | abline(a2)# 9 | # 10 | plot(anscombe$x3, anscombe$y3)# 11 | abline(a3)# 12 | # 13 | plot(anscombe$x4, anscombe$y4)# 14 | abline(a4) 15 | par(mfrow = c(2,2))# 16 | plot(anscombe$x1, anscombe$y1)# 17 | abline(a1)# 18 | # 19 | plot(anscombe$x2, anscombe$y2)# 20 | abline(a2)# 21 | # 22 | plot(anscombe$x3, anscombe$y3)# 23 | abline(a3)# 24 | # 25 | plot(anscombe$x4, anscombe$y4)# 26 | abline(a4) 27 | ggplot(data = anscombe, aes(x = x1, y = y1)) 28 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point 29 | library(ggplot2)# 30 | library(maps)# 31 | library(ggmap)# 32 | data(anscombe)# 33 | str(anscombe) 34 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point 35 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point() 36 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_line() 37 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_line(color = "blue", size = 3, shape = 17) 38 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point(color = "blue", size = 3, shape = 17) 39 | ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point(color = "blue", size = 3, shape = 17) + ggtitle("Anscombe #1") 40 | pdf("MyPlot.pdf") 41 | ggsave() 42 | anscombe_plot = ggplot(data = anscombe, aes(x = x1, y = y1)) + geom_point(color = "blue", size = 3, shape = 17) + ggtitle("Anscombe #1") 43 | print(anscombe_plot) 44 | data(iris) 45 | str(iris) 46 | iris_plot = ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) + geom_point(color = "red", size = 3, shape = 16) + ggtitle("Sepal Length vs. Petal Length") 47 | print(iris_plot) 48 | iris_plot 49 | ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) + geom_point(color = "red", size = 3, shape = 16) + ggtitle("Sepal Length vs. Petal Length") 50 | library(stats) 51 | library(stats)# 52 | lm_test <- lm(mpg ~ hp + cyl + wt + gear, data = mtcars)# 53 | summary(lm_test) 54 | source("Assignment.R") 55 | getwd() 56 | setwd("~/Desktop/OR-software-tools-2015/1-intro-R/") 57 | source("Assignment.R") 58 | 3^(6-4) 59 | 22/7 60 | 16^(1/4) 61 | 6*9 == 62 | 42 63 | 6*9 == 64 | 54 65 | sqrt(2) 66 | abs(-2) 67 | sin(pi/2) 68 | cos(0) 69 | exp(-1) 70 | (1 - 1/100)^100 71 | log(exp(1)) 72 | help(log) 73 | ?log 74 | x <- 2^3 75 | y = 6 76 | x 77 | y 78 | print(x) 79 | print(y) 80 | ls() 81 | z <- seq(1:10) 82 | z <- 1:10 #this also works 83 | z 84 | z[5] 85 | sum(z) 86 | double_z <- z^2 87 | double_z 88 | airports = c("BOS", "JFK", "ORD", "SFO", "ATL") 89 | capacities = c(20, 45, 50, 35, 55) 90 | cbind(airports, capacities) 91 | df1 = data.frame(airports, capacities) 92 | df1 93 | class(airports) 94 | str(airports) 95 | class(capacities) 96 | str(capacities) 97 | class(df.runways) 98 | str(df.runways) 99 | df.runways = rbind(df1, df2) 100 | df1 = data.frame(airports, capacities) 101 | capacities = c(3, 2, 5, 1, 3) 102 | df2 = data.frame(airports, capacities) 103 | airports = c("BOS", "JFK", "ORD", "SFO", "ATL")# 104 | capacities = c(20, 45, 50, 35, 55)# 105 | # 106 | # Place vectors together as a matrix using bind# 107 | # 108 | # bind together as columns# 109 | cbind(airports, capacities)# 110 | # 111 | # bind together as rows# 112 | rbind(airports, capacities)# 113 | # 114 | # Create a data frame# 115 | df1 = data.frame(airports, capacities)# 116 | # 117 | # Add additional runways# 118 | capacities = c(3, 2, 5, 1, 3)# 119 | # 120 | # Create another data frame# 121 | df2 = data.frame(airports, capacities)# 122 | # 123 | # Append rows of the second data frame to those of the first# 124 | df.runways = rbind(df1, df2) 125 | class(df.runways) 126 | str(df.runways) 127 | df.runways 128 | df.runways$locations 129 | df.runways$airports 130 | summary(df.runways) 131 | summary(df.runways$airports) 132 | df.runways$airports 133 | summary(df.runways) 134 | summary(df.runways$airports) 135 | runwaysBOS = subset(df.runways, locations=="BOS") 136 | runwaysBOS = subset(df.runways, airports=="BOS") 137 | runwaysBOS 138 | runwaysBOS = df.runways[c(1,6), ] 139 | str(runwaysBOS) 140 | runwaysBOS$airports = factor(runwaysBOS$airports) 141 | str(runwaysBOS) 142 | sum(runwaysBOS$capacities) 143 | airports = c("BOS", "JFK", "ORD", "SFO", "ATL") 144 | capacities = c(20, 45, 50, 35, 55) 145 | cbind(airports, capacities) 146 | rbind(airports, capacities) 147 | df1 = data.frame(airports, capacities) 148 | capacities = c(3, 2, 5, 1, 3) 149 | df2 = data.frame(airports, capacities) 150 | df.runways = rbind(df1, df2) 151 | class(airports) 152 | str(airports) 153 | class(capacities) 154 | str(capacities) 155 | class(df.runways) 156 | str(df.runways) 157 | df.runways 158 | df.runways$airports 159 | summary(df.runways) 160 | summary(df.runways$airports) 161 | runwaysBOS = subset(df.runways, airports=="BOS") 162 | runwaysBOS 163 | runwaysBOS = df.runways[c(1,6), ] 164 | str(runwaysBOS) 165 | runwaysBOS$airports = factor(runwaysBOS$airports) 166 | str(runwaysBOS) 167 | sum(runwaysBOS$capacities) 168 | CEOcomp = read.csv(file = "CEOcomp.csv", header = TRUE) 169 | str(CEOcomp) 170 | names(CEOcomp) 171 | CEOcomp$Years 172 | CEOcomp$MBA 173 | attach(CEOcomp) 174 | Years 175 | MBA 176 | detach(CEOcomp) 177 | mean(CEOcomp$Years) 178 | sd(CEOcomp$Years) 179 | summary(CEOcomp$Years) 180 | plot(CEOcomp$Years, CEOcomp$TotalCompensation) 181 | plot(CEOcomp$Years, CEOcomp$TotalCompensation, main="Total Compensation by Year", xlab = "Years of Experience", ylab = "Total Compensation (thousand USD)") 182 | plot(CEOcomp$Years, CEOcomp$TotalCompensation) 183 | plot(CEOcomp$Years, CEOcomp$TotalCompensation, main="Total Compensation by Year", xlab = "Years of Experience", ylab = "Total Compensation (thousand USD)") 184 | tapply(CEOcomp$TotalCompensation, CEOcomp$MBA, mean) 185 | table(CEOcomp$Year, CEOcomp$MBA) 186 | CEOmissing = read.csv("CEOmissing.csv") 187 | summary(CEOmissing) 188 | str(CEOmissing) 189 | 5 == NA 190 | NA == NA 191 | is.na(5) 192 | is.na(NA) 193 | CEOnomissing = subset(CEOmissing, !is.na(TotalCompensation) & !is.na(Years) & !is.na(ChangeStockPrice) & !is.na(ChangeCompanySales) & !is.na(MBA)) 194 | summary(CEOnomissing) 195 | str(CEOnomissing) 196 | CEOomitmissing = na.omit(CEOmissing) 197 | summary(CEOomitmissing) 198 | str(CEOomitmissing) 199 | save.image("eg.RData") 200 | save(CEOcomp, file = "CEOcomp.RData") 201 | ?seq 202 | seq(from = 2, to = 20, by = 2) 203 | seq(2, 20, 2) 204 | 2*(1:10) 205 | hist(CEOcomp$Years) 206 | hist(CEOcomp$Years, main = "Years of Experience", xlab= "Years", ylab = "freq") 207 | otp = read.csv("~/Desktop/otp.csv") 208 | str(otp) 209 | summary(otp$Origin) 210 | summary(otp$Origin) 211 | summary(otp$Origin)[1:10] 212 | names(summary(otp$Origin))[1:10] 213 | names(summary(otp$Dest))[1:10] 214 | topten = names(summary(otp$Dest))[1:10] 215 | truncated = subset(otp, is.element(otp$Dest, topten) & is.element(otp$Origin, topten)) 216 | table(truncated$Origin, truncated$Dest) 217 | truncated$Origin = factor(truncated$Origin) 218 | truncated$Dest = factor(truncated$Dest) 219 | table(truncated$Origin, truncated$Dest) 220 | LB = read.csv("LettersBinary.csv") 221 | CEOcomp = read.csv(file = "CEOcomp.csv", header = TRUE) 222 | CEO.linReg <- lm(TotalCompensation ~ Years + ChangeStockPrice + ChangeCompanySales + MBA, data = CEOcomp) 223 | summary(CEO.linReg) 224 | CEO.linReg$coefficients 225 | CEO.linReg$residuals 226 | confint(CEO.linReg, level = 0.95) 227 | cor(CEOcomp$TotalCompensation, CEOcomp$Years) 228 | cor(CEOcomp) 229 | cor.test(CEOcomp$TotalCompensation, CEOcomp$Years) 230 | TitanicPassengers = read.csv("TitanicPassengers.csv") 231 | str(TitanicPassengers) 232 | library(caTools) 233 | split <- sample.split(TitanicPassengers$Survived, SplitRatio = 0.6) 234 | split 235 | TitanicTrain <- TitanicPassengers[split, ] 236 | TitanicTest <- TitanicPassengers[!split, ] 237 | Titanic.logReg = glm(Survived ~ Class + Age + Sex, data = TitanicTrain, family = binomial) 238 | summary(Titanic.logReg) 239 | Titanic.logPred = predict(Titanic.logReg, type = "response") 240 | split = sample.split(LB, SplitRatio = 0.6) 241 | LB.train = LB[split, ] 242 | LB.test = LB[!split, ] 243 | str(LB.train) 244 | str(LB.test) 245 | letters.formula <- formula(Letter ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16) 246 | log.LB = glm(letters.formula, data = LB.train, family = binomial) 247 | predict(log.LB, newdata = LB.test, type = "response") 248 | str(LB.test) 249 | table(LB.test$Letter, round(predict(log.LB, newdata = LB.test, type= "response"))) 250 | library(rpart) 251 | Titanic.CART = rpart(Survived ~ Class + Age + Sex, data = TitanicTrain, method = "class", control = rpart.control(minbucket = 10)) 252 | Titanic.CART = rpart(Survived ~ Class + Age + Sex, data = TitanicTrain, method = "class", control = rpart.control(minbucket = 10)) 253 | plot(Titanic.CART) 254 | text(Titanic.CART, pretty = 0) 255 | Titanic.CARTpredTest = predict(Titanic.CART, newdata = TitanicTest, type = "class") 256 | CARTpredTable <- table(TitanicTest$Survived, Titanic.CARTpredTest) 257 | CARTpredTable 258 | sum(diag(CARTpredTable))/nrow(TitanicTest) 259 | CEOcomp.CART = rpart(TotalCompensation ~ Years + ChangeStockPrice + ChangeCompanySales + MBA, data = CEOcomp, method = "anova", control = rpart.control(minsplit = 5)) 260 | predict(CEOcomp.CART) 261 | CEOcomp$TotalCompensation 262 | library(randomForest) 263 | install.packages("randomForest") 264 | library(randomForest) 265 | Titanic.forest = randomForest(Survived ~ Class + Age + Sex, data = TitanicTrain, nodesize = 10, ntree = 200) 266 | str(TitanicTrain$Survived) 267 | TitanicTrain$Survived <- factor(TitanicTrain$Survived) 268 | TitanicTest$Survived <- factor(TitanicTest$Survived) 269 | Titanic.forest = randomForest(Survived ~ Class + Age + Sex, data = TitanicTrain, nodesize = 10, ntree = 200) 270 | forest.table <- table(TitanicTest$Survived, Titanic.forestPred) 271 | Titanic.forest = randomForest(Survived ~ Class + Age + Sex, data = TitanicTrain, nodesize = 10, ntree = 200) 272 | Titanic.forestPred = predict(Titanic.forest, newdata = TitanicTest) 273 | forest.table <- table(TitanicTest$Survived, Titanic.forestPred) 274 | forest.table 275 | sum(diag(forest.table))/nrow(TitanicTest) 276 | ?randomForest 277 | data() 278 | data(iris) 279 | str(iris) 280 | IrisDist = dist(iris[1:4], method = "euclidean") 281 | IrisHC = hclust(IrisDist, method = "ward") 282 | IrisHC = hclust(IrisDist, method = "ward.D") 283 | plot(IrisHC) 284 | IrisDist 285 | rect.hclust(IrisHC, k = 3, border = "red") 286 | plot(IrisHC) 287 | rect.hclust(IrisHC, k = 3, border = "red") 288 | IrisHCGroups = cutree(IrisHC, k = 3) 289 | table(iris$Species, IrisHCGroups) 290 | tapply(iris$Petal.Length, IrisHCGroups, mean) 291 | IrisKMC = kmeans(iris[1:4], centers = 3, iter.max = 100) 292 | str(IrisKMC) 293 | IrisKMCGroups = IrisKMC$cluster 294 | table(iris$Species, IrisKMCGroups) 295 | IrisKMC = kmeans(iris[1:4], centers = 3, iter.max = 10000) 296 | IrisKMCGroups = IrisKMC$cluster 297 | table(iris$Species, IrisKMCGroups) 298 | IrisKMC$centers 299 | -------------------------------------------------------------------------------- /1-intro-R/1-1.R: -------------------------------------------------------------------------------- 1 | # IAP 2015 2 | # 15.S60 Software Tools for Operations Research 3 | # Lecture 1: Introduction to R 4 | 5 | # Script file 1-1.R 6 | # In this script file, we cover the basics of using R. 7 | 8 | ################################################### 9 | ## RUNNING R AT THE COMMAND LINE, SCRIPTING, AND ## 10 | ## SETTING THE WORKING DIRECTORY ## 11 | ################################################### 12 | 13 | # Using the R console (command line): 14 | # - You can type directly into the R console (at '>') and 15 | # execute by pressing Enter 16 | # - Previous lines can be accessed using the up and down arrows 17 | # - Tabs can be used for auto-completion 18 | # - Incomplete commands will be further prompted by '+' 19 | 20 | # Using R scripts in conjunction with the console: 21 | # - We are currently in a script ("1-1.R") 22 | # - Individual lines (or multiple) in this script can be executed 23 | # by placing the cursor on the line (or selecting) and typing 24 | # Ctrl + r on PC or Cmd + Enter on Mac 25 | # - An entire script file can also be run by Edit -> Run All on PC 26 | # or Edit -> Source on Mac or typing the following: 27 | source("Assignment.R") 28 | 29 | # Oops! We need to set our working directory. 30 | # Check the current directory (also in the upper part of console) 31 | getwd() 32 | 33 | # Set your directory path here! Where did you save the folder? 34 | setwd("~/Desktop/") 35 | 36 | # Alternatively, you can do File -> Change dir... on PC or 37 | # Misc -> Change Working Directory... on Mac 38 | 39 | ################################################ 40 | ## BASICS: CALCULATIONS, FUNCTIONS, VARIABLES ## 41 | ################################################ 42 | 43 | # You can use R as a calculator. E.g.: 44 | 3^(6-4) 45 | 22/7 46 | 16^(1/4) 47 | 48 | 6*9 == 49 | 50 | # What happened with that last one? Check the R console! 51 | # Let's see if it's equal to 42... 52 | 53 | # Use the arrow keys to recall the command and check to see 54 | # if 54 will give you the answer you expect. 55 | 56 | # Other useful functions: 57 | sqrt(2) 58 | abs(-2) 59 | 60 | sin(pi/2) 61 | cos(0) 62 | 63 | exp(-1) 64 | (1 - 1/100)^100 65 | 66 | log(exp(1)) 67 | 68 | # The help function can explain certain functions 69 | # What if we forgot if log was base 10 or natural log? 70 | help(log) 71 | ?log 72 | 73 | # You can save values, calculations, or function outputs to variables 74 | # with either <- or = 75 | x <- 2^3 76 | y = 6 77 | 78 | # Use just the variable name to display the output 79 | x 80 | y 81 | 82 | # Note! If you run a script using source(""), output will be 83 | # suppressed, unless you use the print function 84 | print(x) 85 | print(y) 86 | 87 | # Rules for variable names 88 | # - Can include letters, numbers 89 | # - Can have periods, underscores 90 | # - CANNOT begin with a number 91 | # - Case-sensitive 92 | # - CANNOT use spaces 93 | 94 | # Use the ls() function to see what variables are available 95 | ls() 96 | 97 | ######################################## 98 | ## VECTORS, MATRICES, AND DATA FRAMES ## 99 | ######################################## 100 | 101 | # Create a vector of numbers from 1 through 10, access an index, 102 | # and sum all of them 103 | z <- seq(1:10) 104 | z <- 1:10 #this also works 105 | z[5] 106 | sum(z) 107 | double_z <- z^2 108 | 109 | # Create vectors of airports and capacities 110 | airports = c("BOS", "JFK", "ORD", "SFO", "ATL") 111 | capacities = c(20, 45, 50, 35, 55) 112 | 113 | # Place vectors together as a matrix using bind 114 | 115 | # bind together as columns 116 | cbind(airports, capacities) 117 | 118 | # bind together as rows 119 | rbind(airports, capacities) 120 | 121 | # Create a data frame 122 | df1 = data.frame(airports, capacities) 123 | 124 | # Add additional runways 125 | capacities = c(3, 2, 5, 1, 3) 126 | 127 | # Create another data frame 128 | df2 = data.frame(airports, capacities) 129 | 130 | # Append rows of the second data frame to those of the first 131 | df.runways = rbind(df1, df2) 132 | 133 | # Check out the class and structure of various variables 134 | class(airports) 135 | str(airports) 136 | 137 | class(capacities) 138 | str(capacities) 139 | 140 | class(df.runways) 141 | str(df.runways) 142 | # Notice that there are 5 different values for airports. These 143 | # fall under different "categories" or "factors" 144 | 145 | df.runways 146 | 147 | # Use data.frame$col to extract the column col from a data frame 148 | df.runways$airports 149 | 150 | # The summary function can often give you useful information 151 | summary(df.runways) 152 | summary(df.runways$airports) 153 | 154 | # Use the subset function to extract rows of interest from 155 | # a data frame (first argument is the data frame, second 156 | # argument is the criterion on which to select) 157 | runwaysBOS = subset(df.runways, airports=="BOS") 158 | runwaysBOS 159 | 160 | # Alternatively, since we know that rows 1 and 6 correspond 161 | # to BOS, we can extract runwaysBOS from df.runways as follows: 162 | runwaysBOS = df.runways[c(1,6), ] 163 | 164 | str(runwaysBOS) 165 | # Notice that even though we used subset and runwaysBOS only 166 | # has one factor level for the airports column, the str function 167 | # still tells us that there are 5 levels. We can fix this using the 168 | # factor function. 169 | 170 | runwaysBOS$airports = factor(runwaysBOS$airports) 171 | str(runwaysBOS) 172 | 173 | # Find the total runway capacity in Boston 174 | sum(runwaysBOS$capacities) 175 | 176 | ############################ 177 | ## WORKING WITH CSV FILES ## 178 | ############################ 179 | 180 | # Load csv files using read.csv 181 | # header = TRUE is usually ASSUMED, so not strictly necessary 182 | CEOcomp = read.csv(file = "CEOcomp.csv", header = TRUE) 183 | 184 | # Use str to look at variable names 185 | str(CEOcomp) 186 | 187 | # Use names() to extract column names 188 | names(CEOcomp) 189 | 190 | # Use the $ command to look at specific variables 191 | CEOcomp$Years 192 | CEOcomp$MBA 193 | 194 | # If you only have one dataset, you can attach the name of the 195 | # data frame. This isn't generally recommended practice, though! 196 | attach(CEOcomp) 197 | Years 198 | MBA 199 | detach(CEOcomp) 200 | 201 | #################################################### 202 | ## BASIC STATISTICS, PLOTTING, AND SUMMARY TABLES ## 203 | #################################################### 204 | 205 | # Calculate the mean, standard deviation, and other statistics 206 | mean(CEOcomp$Years) 207 | sd(CEOcomp$Years) 208 | summary(CEOcomp$Years) 209 | 210 | # Plot compensation versus years of experience 211 | plot(CEOcomp$Years, CEOcomp$TotalCompensation) 212 | 213 | # Plot with a title, x- and y-axis labels 214 | plot(CEOcomp$Years, CEOcomp$TotalCompensation, main="Total Compensation by Year", xlab = "Years of Experience", ylab = "Total Compensation (thousand USD)") 215 | 216 | # For other plots and information about the graphics package 217 | library(help = "graphics") 218 | 219 | # Create a table to summarize the data 220 | # Here, we look at mean CEO compensation, based on whether or not 221 | # the CEO has an MBA 222 | tapply(CEOcomp$TotalCompensation, CEOcomp$MBA, mean) 223 | 224 | # We can also create a table to look at counts 225 | table(CEOcomp$Year, CEOcomp$MBA) 226 | 227 | # In our dataset, how many CEOs have 7 years of experience and 228 | # an MBA? 229 | 230 | ############################### 231 | ## DEALING WITH MISSING DATA ## 232 | ############################### 233 | 234 | # Often in real datasets we encounter missing data. For instance, 235 | # in a survey, not all respondents might answer all questions. Here, 236 | # we will just remove any rows with any missing data (e.g., removing 237 | # respondents who did not answer all questions). More sophisticated 238 | # methods for dealing with missing data exist, but we will not go 239 | # into detail here. 240 | 241 | # Load the CEOmissing dataset. This is just the previous dataset 242 | # with some entries missing. 243 | CEOmissing = read.csv("CEOmissing.csv") 244 | 245 | # Use the summary function to see how much missing data there is. 246 | summary(CEOmissing) 247 | str(CEOmissing) 248 | 249 | # Let's remove all of the rows where there is an entry missing. (The entry is NA) 250 | # First note that we cannot use '==' to check if an element is an NA 251 | 5 == NA 252 | NA == NA 253 | 254 | # Instead, we use the is.na() function. 255 | is.na(5) 256 | is.na(NA) 257 | 258 | # Now let's only select rows where all of the data is present 259 | CEOnomissing = subset(CEOmissing, !is.na(TotalCompensation) & !is.na(Years) & !is.na(ChangeStockPrice) & !is.na(ChangeCompanySales) & !is.na(MBA)) 260 | summary(CEOnomissing) 261 | str(CEOnomissing) 262 | 263 | # Alternatively, we could use the na.omit function 264 | CEOomitmissing = na.omit(CEOmissing) 265 | summary(CEOomitmissing) 266 | str(CEOomitmissing) 267 | 268 | ################################ 269 | ## UNDERSTANDING R WORKSPACES ## 270 | ################################ 271 | 272 | # You may save an entire workspace, including variables using the 273 | # following command (alternatively, you can use the Workspace tab 274 | # in the menu bar): 275 | save.image("eg.RData") 276 | 277 | # To load, you can run the following: 278 | load("eg.RData") 279 | 280 | # You should save the image if you are working on a large project 281 | # and are taking a pause from working on it. This way, when you 282 | # come back to R, you can just load the workspace and continue 283 | # as before 284 | 285 | # You can also save individual variables as follows: 286 | save(CEOcomp, file = "CEOcomp.RData") 287 | 288 | # This is useful when the variable is given the result of 289 | # a computation that takes a lot of time (e.g., loading 290 | # very large data sets, result of running multiple SVMs, etc.) 291 | 292 | ################# 293 | ## ASSIGNMENTS ## 294 | ################# 295 | 296 | # 1a) Use the help function on seq to assign the variable 'evens' 297 | # to be the even numbers from 2 through 20, inclusive. 298 | 299 | 300 | 301 | # b) Propose an alternative way to get 'evens' to be the even 302 | # numbers from 2 through 20, inclusive, with perhaps more 303 | # than one command. Write down the commands. 304 | 305 | 306 | 307 | 308 | 309 | ## 310 | # 2a) Try out a few other basic statistics and graphing functions 311 | 312 | min(CEOcomp$Years) 313 | median(CEOcomp$Years) 314 | max(CEOcomp$Years) 315 | 316 | sum(CEOcomp$MBA) 317 | 318 | hist(CEOcomp$Years) 319 | boxplot(CEOcomp$Years) 320 | 321 | # b) Edit the histogram plot above to ensure that it has a title 322 | # and that the x-axis is labeled properly 323 | 324 | ## 325 | # 3) Use the tapply() function on df.runways to obtain a table 326 | # detailing the total capacity at each airport (Hint: use the sum() function) 327 | 328 | 329 | ## 330 | # 4a) Load the on-time performance dataset "otp.csv" 331 | 332 | 333 | 334 | 335 | # b) Take a look at the structure of the on-time performance dataset. This 336 | # dataset gives the on-time performance of airplanes in September of 2014. 337 | 338 | 339 | 340 | 341 | # c) Find the airport with the most departing flights during this time period. 342 | # (Use the Origin column) 343 | 344 | 345 | 346 | # d**) Determine the ten airports that have the highest number of departing 347 | # and arriving flights. Use the "Origin" and "Dest" columns. Create a table 348 | # that contains the number of flights between these top ten airports. 349 | # (Hint: some of the following functions might be useful -- 350 | # summary, table, subset, factor, names, is.element, sort) 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | -------------------------------------------------------------------------------- /6-nonlinear-opt/IJulia intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language": "Julia", 4 | "name": "", 5 | "signature": "sha256:c37017d552407ab6a927b8378934d1402d8da140a5d659517c98c5f30f8dd1a5" 6 | }, 7 | "nbformat": 3, 8 | "nbformat_minor": 0, 9 | "worksheets": [ 10 | { 11 | "cells": [ 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Introduction to IJulia" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## Navigating IJulia notebooks\n", 24 | "_Click `Help -> User Interface Tour`_\n", 25 | "\n", 26 | "**Think of the notebook as a document that can interact with your computer.** The document relies only on a modern browser for rendering. When you connect the document to a Julia kernel and terminal instance on a computer, however, the document can send any command to the computer and show any output (text or graphics). \n", 27 | "\n", 28 | "* Each notebook is composed of cells\n", 29 | "* Two modes:\n", 30 | " * Command Mode for creating or deleting cells, saving or renaming the notebook, and other application-level functions\n", 31 | " * Edit Mode for manipulating text in individual cells\n", 32 | "* Create a cell by:\n", 33 | " * Clicking `Insert -> Insert Cell`\n", 34 | " * Pressing `a` or `b` in Command Mode\n", 35 | " * Pressing `Alt+Enter` in Edit Mode\n", 36 | "* Delete a cell by:\n", 37 | " * Clicking `Edit -> Delete Cell`\n", 38 | " * Pressing `dd`\n", 39 | "* Execute a cell by:\n", 40 | " * Clicking `Cell -> Run`\n", 41 | " * Pressing `Ctrl+Enter`\n", 42 | "\n", 43 | "Other functions:\n", 44 | "* Undo last text edit with `Ctrl+z` in Edit Mode\n", 45 | "* Undo last cell manipulation with `z` in Command Mode\n", 46 | "* Save notebook with `Ctrl+s` in Edit Mode\n", 47 | "* Save notebook with `s` in Command Mode\n", 48 | "\n", 49 | "Though notebooks rely on your browser to work, they do not require an internet connection. The only online tool that is consistently used is MathJax (for math rendering).\n", 50 | "\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "### Get comfortable with the notebook\n", 58 | "Notebooks are designed to not be fragile. If you try to close a notebook with unsaved changes, the browser will warn you.\n", 59 | "\n", 60 | "Try the following exercises:" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | ">**\\[Exercise\\]**: Close/open\n", 68 | "\n", 69 | ">1. Save the notebook\n", 70 | ">2. Copy the address\n", 71 | ">3. Close the tab\n", 72 | ">4. Paste the address into a new tab (or re-open the last closed tab with `Ctrl+Shift+T` on Chrome)\n", 73 | "\n", 74 | ">_The document is still there, and the Julia kernel is still alive! Nothing is lost._" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | ">**\\[Exercise\\]**: Zoom\n", 82 | "\n", 83 | ">Try changing the magnification of the web page (`Ctrl+, Ctrl-` on Chrome).\n", 84 | "\n", 85 | ">_Text and math scale well (so do graphics if you use an SVG or PDF backend)._" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | ">**\\[Exercise\\]**: MathJax\n", 93 | ">1. Create a new cell.\n", 94 | ">2. Type an opening \\$, your favorite mathematical expression, and a closing \\$.\n", 95 | ">3. Run the cell to render the $\\LaTeX$ expression.\n", 96 | ">4. Right-click the rendered expression." 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## Navigating Julia" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "Use the ``?name`` syntax to access the documentation for Julia functions" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "collapsed": false, 116 | "input": [ 117 | "?print" 118 | ], 119 | "language": "python", 120 | "metadata": {}, 121 | "outputs": [] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "collapsed": false, 126 | "input": [ 127 | "?sum" 128 | ], 129 | "language": "python", 130 | "metadata": {}, 131 | "outputs": [] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "The ``methods`` function lists all of the different implementations of a function depending on the input types.\n", 138 | "Click on the link to see the Julia source code." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "collapsed": false, 144 | "input": [ 145 | "methods(lufact)" 146 | ], 147 | "language": "python", 148 | "metadata": {}, 149 | "outputs": [] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "The ``methodswith`` function lists all of the different functions which may be applied to a given type." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "collapsed": false, 161 | "input": [ 162 | "methodswith(Complex)" 163 | ], 164 | "language": "python", 165 | "metadata": {}, 166 | "outputs": [] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "Use tab completion to search for function names.\n", 173 | "Try ``eig`` for eigenvalues, ``read`` for file input" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "collapsed": false, 179 | "input": [ 180 | "eig" 181 | ], 182 | "language": "python", 183 | "metadata": {}, 184 | "outputs": [] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "collapsed": false, 189 | "input": [ 190 | "read" 191 | ], 192 | "language": "python", 193 | "metadata": {}, 194 | "outputs": [] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "## Plotting\n", 201 | "There are several Julia plotting packages. \n", 202 | "\n", 203 | "* [PyPlot.jl][4] is a Julia interface to Matplotlib, and should feel familiar to both MATLAB and Python users.\n", 204 | "* [Winston][3] and [Gadfly][1] are written entirely in Julia. Winston is for general-purpose 2D plotting, and Gadfly (inspired by ggplot2) concentrates on statistical graphics.\n", 205 | "* [Plotly supports Julia][2].\n", 206 | "\n", 207 | "[1]: https://github.com/dcjones/Gadfly.jl\n", 208 | "[2]: https://plot.ly/julia/\n", 209 | "[3]: https://github.com/nolta/Winston.jl\n", 210 | "[4]: https://github.com/stevengj/PyPlot.jl" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "collapsed": false, 216 | "input": [ 217 | "using PyPlot\n", 218 | "\n", 219 | "# Example from PyPlot documentation:\n", 220 | "x = linspace(0,2*pi,1000)\n", 221 | "y = sin(3*x + 4*cos(2*x))\n", 222 | "plot(x, y, color=\"red\", \n", 223 | " linewidth=2.0, \n", 224 | " linestyle=\"--\")\n", 225 | "title(\"A sinusoidally modulated sinusoid\");" 226 | ], 227 | "language": "python", 228 | "metadata": {}, 229 | "outputs": [] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "## Interactivity\n", 236 | "\n", 237 | "The [Interact](https://github.com/JuliaLang/Interact.jl) package enables interactivity in IJulia through the ``@manipulate`` macro." 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "collapsed": false, 243 | "input": [ 244 | "using Interact\n", 245 | "@manipulate for x in 0:0.01:\u03c0\n", 246 | " sin(x)\n", 247 | "end" 248 | ], 249 | "language": "python", 250 | "metadata": {}, 251 | "outputs": [] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "You can have multiple manipulators with continuous or discrete choices:" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "collapsed": false, 263 | "input": [ 264 | "@manipulate for x in 0:0.01:\u03c0, f in [:sin, :cos]\n", 265 | " if f == :sin\n", 266 | " sin(x)\n", 267 | " else\n", 268 | " cos(x)\n", 269 | " end\n", 270 | "end" 271 | ], 272 | "language": "python", 273 | "metadata": {}, 274 | "outputs": [] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "**Note**: only the final value is updated" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "collapsed": false, 286 | "input": [ 287 | "@manipulate for x in 0:0.01:\u03c0\n", 288 | " println(\"My input was $x\")\n", 289 | " sin(x)\n", 290 | "end" 291 | ], 292 | "language": "python", 293 | "metadata": {}, 294 | "outputs": [] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "You can embed a plot inside ``@manipulate`` for interactive visualizations." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "collapsed": false, 306 | "input": [ 307 | "f = figure()\n", 308 | "@manipulate for z in 0:0.01:1; withfig(f) do\n", 309 | " x = linspace(0,2\u03c0,1000)\n", 310 | " y = z*sin(x)\n", 311 | " ylim(-1,1)\n", 312 | " xlim(0,2\u03c0)\n", 313 | " plot(x, y, color=\"blue\", \n", 314 | " linewidth=2.0, \n", 315 | " linestyle=\"-\")\n", 316 | " end\n", 317 | "end" 318 | ], 319 | "language": "python", 320 | "metadata": {}, 321 | "outputs": [] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "Here's the same using the ``Gadfly`` package instead of ``PyPlot``." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "collapsed": false, 333 | "input": [ 334 | "using Gadfly" 335 | ], 336 | "language": "python", 337 | "metadata": {}, 338 | "outputs": [] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "collapsed": false, 343 | "input": [ 344 | "@manipulate for z in 0:0.01:1\n", 345 | " x = linspace(0,2\u03c0,1000)\n", 346 | " y = z*sin(x)\n", 347 | " Gadfly.plot(x=x,y=y, Geom.line, Scale.y_continuous(minvalue=-1, maxvalue=1), Scale.x_continuous(minvalue=0, maxvalue=2\u03c0))\n", 348 | "end" 349 | ], 350 | "language": "python", 351 | "metadata": {}, 352 | "outputs": [] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | ">**\\[Exercise\\]**: Gaussian density\n", 359 | "\n", 360 | "> Plot the Gaussian density $\\frac{1}{\\sigma \\sqrt{2\\pi} } e^{ -\\frac{(x-\\mu)^2}{2\\sigma^2} }$ with manipulators for both the mean $\\mu$ and standard deviation $\\sigma$\n" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "## Sharing notebooks" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "Notebooks are self contained and standalone (unless they explicitly ``include()`` other Julia code). You can email them to friends, professors, and even post them online in viewable read-only form.\n", 375 | "\n", 376 | "_Click `File -> Download as -> IPython Notebook (.ipynb)` to save a copy of the notebook._" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | ">**\\[Exercise\\]**: Share your code\n", 384 | "\n", 385 | "> 1. Create a new notebook with some text, figures, and code.\n", 386 | "> 2. Save it to your desktop.\n", 387 | "> 3. Open the .ipynb file with a text editor, select all the text and copy it to the clipboard.\n", 388 | "> 4. Open [gist.github.com](https://gist.github.com/) and paste the text.\n", 389 | "> 5. Name the file foo.ipynb and click \"Create public Gist\".\n", 390 | "> 6. On the next page, click on \"Raw\".\n", 391 | "> 7. Copy the URL to the clipboard and open a new tab with [nbviewer](http://nbviewer.ipython.org/).\n", 392 | "> 8. Paste the URL to the raw ipynb into the text box and click \"Go!\"\n", 393 | "\n", 394 | "> _You now have an emailable link to share your notebook. An installation of IJulia is not required to view it!_\n", 395 | "\n", 396 | "The ``http://nbviewer.ipython.org/urls/..`` link is permanent so long as the original source (gist) exists." 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "-------\n", 404 | "\n", 405 | "Some content in this notebook was adapted from materials by [Jonas Kersulis](https://github.com/kersulis/IJulia-WPS)" 406 | ] 407 | } 408 | ], 409 | "metadata": {} 410 | } 411 | ] 412 | } -------------------------------------------------------------------------------- /6-nonlinear-opt/Nonlinear-DCP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language": "Julia", 4 | "name": "", 5 | "signature": "sha256:27ed86b6429b1263a82aa23b801efdfdbf55ce8097de482f33eda074de9519dc" 6 | }, 7 | "nbformat": 3, 8 | "nbformat_minor": 0, 9 | "worksheets": [ 10 | { 11 | "cells": [ 12 | { 13 | "cell_type": "heading", 14 | "level": 2, 15 | "metadata": {}, 16 | "source": [ 17 | "Convex optimization" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "So far we've been thinking about general nonlinear optimization problems of the form\n", 25 | "\n", 26 | "\\begin{align}\n", 27 | "\\min \\quad&f(x)\\\\\n", 28 | "\\text{s.t.} \\quad& g(x) = 0, \\\\\n", 29 | "& h(x) \\leq 0.\n", 30 | "\\end{align}\n", 31 | "\n", 32 | "and derivative-based methods to solve them.\n", 33 | "\n", 34 | "A special class of nonlinear optimization problems are *convex* optimization problems where $f$ and $h$ are convex and $g$ is affine. Under some additional regularity assumptions, much of the duality theory from linear programming can be extended to convex optimization, and there exist efficient (polynomial-time) algorithms to solve these problems. With few exceptions, if your problem is convex, you can expect to be able to solve it efficiently." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Detecting convexity\n", 42 | "\n", 43 | "A function $f: \\mathbb{R}^n \\to \\mathbb{R}$ is convex iff $f(\\theta x + (1-\\theta)y) \\leq \\theta f(x) + (1-\\theta)f(y), \\forall x,y \\in \\mathbb{R}^n \\text{ and } \\theta \\in [0,1]$.\n", 44 | "\n", 45 | "Given an arbitrary function $f$, detecting if $f$ is convex is [NP-Hard](http://web.mit.edu/~a_a_a/Public/Publications/convexity_nphard.pdf). So how do we know if a problem is convex?\n", 46 | "\n", 47 | "A reasonable approach is to make sure that a model is built-up in a manner that lets us prove convexity by using a calculus of convex analysis; this is **Disciplined Convex Programming** (DCP).\n", 48 | "\n", 49 | "We start with operations that are known to be convex:\n", 50 | "- Norms (why?)\n", 51 | "- $\\exp(\\cdot)$\n", 52 | "- $-\\log(\\cdot)$\n", 53 | "- $x^p$ for $p \\geq 1$ and $x \\geq 0$.\n", 54 | "- $1/x$ for $x > 0$\n", 55 | "- $x^2$\n", 56 | "- ...\n", 57 | "\n", 58 | "Then add composition rules, e.g., $f(g(\\cdot))$ is convex when $f$ is convex and\n", 59 | "- $g$ is linear or affine\n", 60 | "- $f$ is monotonic increasing and $g$ is convex\n", 61 | "\n", 62 | "Also, $f_1+f_2$ and $\\max\\{f_1,f_2\\}$ are convex when $f_1$ and $f_2$ are convex.\n", 63 | "\n", 64 | "So our previous example of $x^2 - \\log(x)$ is convex by these rules, because it is the sum of convex functions. So is $\\max\\{e^x,1/x\\}$ ([plot](http://www.wolframalpha.com/input/?i=max%28exp%28x%29%2C1%2Fx%29+for+x+%3E+0)).\n", 65 | "\n", 66 | "Note that these rules are *sufficient* but not *necessary* to prove convexity. \n", 67 | "\n", 68 | "There are a lot of existing materials on DCP which we won't try to reproduce here. Let's head over to http://dcp.stanford.edu/." 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | ">**\\[Exercise\\]**: DCP Quiz\n", 76 | "\n", 77 | "> Play the [DCP quiz](http://dcp.stanford.edu/quiz). Turn up the difficulty to hard for extra fun!" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Solving \"DCP-compliant\" problems\n", 85 | "\n", 86 | "DCP rules are useful not just for proving convexity, but also for *solving* the problems.\n", 87 | "\n", 88 | "For example, we (should) know that the following problem\n", 89 | "\\begin{align}\n", 90 | "\\min \\quad& {||}x||_1\\\\\n", 91 | "\\text{s.t.} \\quad& Ax = b, \\\\\n", 92 | "& x \\geq 0,\n", 93 | "\\end{align}\n", 94 | "\n", 95 | "where $||x||_1 = \\sum_i |x_i|$ can be solved by using linear programming.\n", 96 | "\n", 97 | "Just introduce auxiliary variables $z$ and solve\n", 98 | "\\begin{align}\n", 99 | "\\min \\quad& \\sum_i z_i\\\\\n", 100 | "\\text{s.t.} \\quad&z_i \\geq x_i, \\forall i\\\\\n", 101 | "& z_i \\geq -x_i, \\forall i\\\\\n", 102 | "& Ax = b, \\\\\n", 103 | "& x \\geq 0,\n", 104 | "\\end{align}\n", 105 | "\n", 106 | "Similarly\n", 107 | "\\begin{align}\n", 108 | "\\min \\quad& {||}x||_\\infty\\\\\n", 109 | "\\text{s.t.} \\quad& Ax = b, \\\\\n", 110 | "& x \\geq 0,\n", 111 | "\\end{align}\n", 112 | "\n", 113 | "where $||x||_\\infty = \\max\\{|x_1|,\\cdots,|x_n|\\}$ can be formulated as\n", 114 | "\n", 115 | "\\begin{align}\n", 116 | "\\min \\quad& z\\\\\n", 117 | "\\text{s.t.} \\quad&z \\geq x_i, \\forall i\\\\\n", 118 | "& z \\geq -x_i, \\forall i\\\\\n", 119 | "& Ax = b, \\\\\n", 120 | "& x \\geq 0,\n", 121 | "\\end{align}\n", 122 | "\n", 123 | "(What do we do when $||\\cdot||_1$ and $||\\cdot||_\\infty$ appear in convex constraints?)\n", 124 | "\n", 125 | "Given these results, we might say that $||\\cdot||_1$ and $||\\cdot||_\\infty$ are *LP-representable*, in a sense that can be made rigorous.\n", 126 | "\n", 127 | "What about $||\\cdot||_2$? It's SOCP (second-order conic programming) representable, since\n", 128 | "$$\n", 129 | "||x||_2 \\leq t\n", 130 | "$$\n", 131 | "is precisely a second-order conic constraint that's already supported by Gurobi, CPLEX, MOSEK, ECOS, SCS, ...\n", 132 | "\n", 133 | "What about $1/x$? It's also SOCP representable since\n", 134 | "$$\n", 135 | "1/x \\leq t\n", 136 | "$$\n", 137 | "iff\n", 138 | "$$\n", 139 | "||(2,x-t)||_2 \\leq x+t.\n", 140 | "$$\n", 141 | "\n", 142 | "It turns out that [A LOT](http://docs.mosek.com/generic/modeling-letter.pdf) of common convex functions are SOCP-representable.\n", 143 | "\n", 144 | "Once we know how to represent basic operations using LPs or SOCPs, we can easily compose them. For example, we would represent\n", 145 | "\n", 146 | "\\begin{align}\n", 147 | "\\min \\quad& \\max\\{||Cx-d||,1/x_1\\}\\\\\n", 148 | "\\text{s.t.} \\quad& Ax = b, \\\\\n", 149 | "& x \\geq 0,\n", 150 | "\\end{align}\n", 151 | "\n", 152 | "as\n", 153 | "\n", 154 | "\\begin{align}\n", 155 | "\\min \\quad& t\\\\\n", 156 | "\\text{s.t.} \\quad& t \\geq z_1 \\\\\n", 157 | "&t \\geq z_2\\\\\n", 158 | "&{||}Cx-d|| \\leq z_1\\\\\n", 159 | "&{||}(2,x_1-z_2)|| \\leq x_1+z_2\\\\\n", 160 | "& Ax = b, \\\\\n", 161 | "& x \\geq 0,\n", 162 | "\\end{align}\n", 163 | "\n", 164 | "and hand the problem off to Gurobi as an SOCP." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "### DCP in summary\n", 172 | "\n", 173 | "- Represent the model in a way that makes it easy to use DCP rules to prove convexity.\n", 174 | "- Break down the individual pieces into parts that are representable using LP, SOCP, semidefinite programming, (or exponential cones)\n", 175 | "- Use composition rules to *automatically* generate a complete formulation that can be given to existing solvers\n", 176 | "- Note that derivatives aren't used anywhere!\n", 177 | "\n", 178 | "The first implementation of DCP was [CVX](http://cvxr.com/cvx/) in MATLAB. More recently, it's been implemented in [cvxpy](https://github.com/cvxgrp/cvxpy) and [Convex.jl](https://github.com/JuliaOpt/Convex.jl)." 179 | ] 180 | }, 181 | { 182 | "cell_type": "heading", 183 | "level": 2, 184 | "metadata": {}, 185 | "source": [ 186 | "Support Vector Machines (SVM)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "[Support vector machines](http://en.wikipedia.org/wiki/Support_vector_machine) are a popular model in machine learning for classification. We'll use this example to illustrate the basic use of Convex.jl.\n", 194 | "\n", 195 | "The basic problem is that we are given a set of N points $x_1,x_2,\\ldots, x_N \\in \\mathbb{R}^n$ and labels $y_1, y_2, \\ldots y_n \\in \\{-1,+1\\}$. And we want to find a hyperplane of the form $w^Tx-b = 0$ that *separates* the two classes, i.e. $w^Tx_i - b \\geq 1$ when $y_i = +1$ and $w^Tx_i - b \\leq -1$ when $y_i = -1$. This condition can be written as $y_i(w^Tx_i - b) \\geq 1, \\forall\\, i$.\n", 196 | "\n", 197 | "Such a hyperplane will not exist in general if the data overlap, so instead we'll just try to minimize violations of the constraint $y_i(w^Tx_i - b) \\geq 1, \\forall\\, i$ by adding a penalty when it is violated. The optimization problem can be stated as\n", 198 | "$$\n", 199 | "\\min_{w,b} \\sum_{i=1}^N \\left[\\max\\{0, 1 - y_i(w^Tx_i - b)\\}\\right] + \\gamma ||w||_2^2\n", 200 | "$$\n", 201 | "Note that we penalize the norm of $w$ in order to guarantee a unique solution.\n", 202 | "\n", 203 | "Now let's write our own SVM solver!" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "collapsed": false, 209 | "input": [ 210 | "using Distributions\n", 211 | "using PyPlot\n", 212 | "using Convex\n", 213 | "using ECOS" 214 | ], 215 | "language": "python", 216 | "metadata": {}, 217 | "outputs": [] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "collapsed": false, 222 | "input": [ 223 | "# Function to generate some random test data\n", 224 | "function gen_data(N)\n", 225 | " # for +1 data, symmetric multivariate normal with center at (1,2)\n", 226 | " pos = rand(MvNormal([1.0,2.0],1.0),N)\n", 227 | " # for -1 data, symmetric multivariate normal with center at (-1,1)\n", 228 | " neg = rand(MvNormal([-1.0,1.0],1.0),N)\n", 229 | " x = [pos neg]\n", 230 | " y = [fill(+1,N),fill(-1,N)]\n", 231 | " return x,y\n", 232 | "end" 233 | ], 234 | "language": "python", 235 | "metadata": {}, 236 | "outputs": [] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "Let's see what the data look like." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "collapsed": false, 248 | "input": [ 249 | "x,y = gen_data(100)\n", 250 | "plot(x[1,1:100], x[2,1:100], \"ro\", x[1,101:200], x[2,101:200], \"bo\");" 251 | ], 252 | "language": "python", 253 | "metadata": {}, 254 | "outputs": [] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "Now we translate the optimization problem into Convex.jl form." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "collapsed": false, 266 | "input": [ 267 | "const \u03b3 = 0.005\n", 268 | "function svm_convex(x,y)\n", 269 | " n = size(x,1) # problem dimension\n", 270 | " N = size(x,2) # number of points\n", 271 | " w = Variable(n)\n", 272 | " b = Variable()\n", 273 | " \n", 274 | " problem = minimize( \u03b3*sum_squares(w) + sum(max(1-y.*(x'*w-b),0)))\n", 275 | " solve!(problem, ECOSSolver())\n", 276 | " return evaluate(w), evaluate(b)\n", 277 | "end" 278 | ], 279 | "language": "python", 280 | "metadata": {}, 281 | "outputs": [] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "And the solution?" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "collapsed": false, 293 | "input": [ 294 | "N = 1000\n", 295 | "x,y = gen_data(N)\n", 296 | "\n", 297 | "plot(x[1,1:N], x[2,1:N], \"ro\", x[1,(N+1):2N], x[2,(N+1):2N], \"bo\");\n", 298 | "w,b = svm_convex(x,y)\n", 299 | "\n", 300 | "@show w,b\n", 301 | "\n", 302 | "xmin, xmax = xlim()\n", 303 | "ymin, ymax = ylim()\n", 304 | "y1 = (1+b-w[1]*xmin)/w[2]\n", 305 | "y2 = (1+b-w[1]*xmax)/w[2]\n", 306 | "plot([xmin,xmax], [y1,y2], \"k-\");\n", 307 | "y1 = (-1+b-w[1]*xmin)/w[2]\n", 308 | "y2 = (-1+b-w[1]*xmax)/w[2]\n", 309 | "plot([xmin,xmax], [y1,y2], \"k-\");\n", 310 | "y1 = (b-w[1]*xmin)/w[2]\n", 311 | "y2 = (b-w[1]*xmax)/w[2]\n", 312 | "ylim(ymin,ymax)\n", 313 | "plot([xmin,xmax], [y1,y2], \"k-\");" 314 | ], 315 | "language": "python", 316 | "metadata": {}, 317 | "outputs": [] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | ">**\\[Exercise\\]**: Sensitivity\n", 324 | "\n", 325 | "> Increase the separation between the positive and negative data by modifying the means in ``gen_data``. How does the solution change?\n" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | ">**\\[Exercise\\]**: JuMP version\n", 333 | "\n", 334 | "> Translate the Convex.jl model into a JuMP model with linear constraints and a quadratic objective. For example, ``sum_squares(w)`` becomes ``sum{w[i]^2,i=1:n}``. Hint: the formulation is given on Wikpedia. (You may want to use ``IpoptSolver`` since ``ECOSSolver`` supports second-order conic constraints but won't directly accept quadratic objectives.)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### Discussion\n", 342 | "\n", 343 | "- Convex.jl vs. JuMP\n", 344 | "- Derivative-based nonlinear vs. automatic transformation to LP/SOCP/conic form" 345 | ] 346 | } 347 | ], 348 | "metadata": {} 349 | } 350 | ] 351 | } -------------------------------------------------------------------------------- /6-nonlinear-opt/Nonlinear-DualNumbers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language": "Julia", 4 | "name": "", 5 | "signature": "sha256:7da62c4657919dcad2e91d6a7b71dd3c79a52667ac61d8a9cf06a1c2ce7230c2" 6 | }, 7 | "nbformat": 3, 8 | "nbformat_minor": 0, 9 | "worksheets": [ 10 | { 11 | "cells": [ 12 | { 13 | "cell_type": "heading", 14 | "level": 1, 15 | "metadata": {}, 16 | "source": [ 17 | "Computing derivatives for nonlinear optimization: Forward mode automatic differentiation" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Consider a general constrained nonlinear optimization problem:\n", 25 | "$$\n", 26 | "\\begin{align}\n", 27 | "\\min \\quad&f(x)\\\\\n", 28 | "\\text{s.t.} \\quad& g(x) = 0, \\\\\n", 29 | "& h(x) \\leq 0.\n", 30 | "\\end{align}\n", 31 | "$$\n", 32 | "where $f : \\mathbb{R}^n \\to \\mathbb{R}, g : \\mathbb{R}^n \\to \\mathbb{R}^r$, and $h: \\mathbb{R}^n \\to \\mathbb{R}^s$.\n", 33 | "\n", 34 | "When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.\n", 35 | "\n", 36 | "What approaches can we use to solve this?\n", 37 | " - When $r=0$ and $s = 0$ (unconstrained), and $f$ differentiable, most classical approach is [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent), also fancier methods like [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method) and quasi-newton methods like [BFGS](http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm).\n", 38 | " - When $f$ differentiable and $g$ and $h$ linear, [gradient projection](http://neos-guide.org/content/gradient-projection-methods)\n", 39 | " - When $f$, $g$, and $h$ differentiable, [sequential quadratic programming](http://www.neos-guide.org/content/sequential-quadratic-programming)\n", 40 | " - When $f$, $g$, and $h$ twice differentiable, [interior-point methods](http://en.wikipedia.org/wiki/Interior_point_method)\n", 41 | " - When derivatives \"not available\", [derivative-free optimization](http://rd.springer.com/article/10.1007/s10898-012-9951-y)\n", 42 | " \n", 43 | "This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## How are derivatives computed?\n", 51 | "\n", 52 | "- Hand-written by applying chain rule\n", 53 | "- Finite difference approximation $\\frac{\\partial f}{\\partial x_i} = \\lim_{h\\to 0} \\frac{f(x+h e_i)-f(x)}{h}$\n", 54 | "- **Automatic differentiation**\n", 55 | " - Idea: Automatically transform code to compute a function into code to compute its derivatives" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Dual Numbers\n", 63 | "\n", 64 | "Consider numbers of the form $x + y\\epsilon$ with $x,y \\in \\mathbb{R}$. We *define* $\\epsilon^2 = 0$, so\n", 65 | "$$\n", 66 | "(x_1 + y_1\\epsilon)(x_2+y_2\\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\\epsilon.\n", 67 | "$$\n", 68 | "These are called the *dual numbers*. Think of $\\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).\n", 69 | "\n", 70 | "If we are given an infinitely differentiable function in Taylor expanded form\n", 71 | "$$\n", 72 | "f(x) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k\n", 73 | "$$\n", 74 | "it follows that \n", 75 | "$$\n", 76 | "f(x+y\\epsilon) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a+y\\epsilon)^k = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k + y\\epsilon\\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!}\\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\\epsilon\n", 77 | "$$\n", 78 | "\n", 79 | "Let's unpack what's going on here. We started with a function $f : \\mathbb{R} \\to \\mathbb{R}$. Dual numbers are *not* real numbers, so it doesn't even make sense to ask for the value $f(x+y\\epsilon)$ given $x+y\\epsilon \\in \\mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\\epsilon^2 = 0$ we found that $f(x+y\\epsilon)$ must be equal to $f(x) + yf'(x)\\epsilon$ if we use the Taylor expansion as the definition of $f : \\mathbb{D} \\to \\mathbb{R}$." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "Alternatively, for any once differentiable function $f : \\mathbb{R} \\to \\mathbb{R}$, we can *define* its extension to the dual numbers as\n", 87 | "$$\n", 88 | "f(x+y\\epsilon) = f(x) + yf'(x)\\epsilon.\n", 89 | "$$\n", 90 | "This is essentially equivalent to the previous definition.\n", 91 | "\n", 92 | "Let's verify a very basic property, the chain rule, using this definition.\n", 93 | "\n", 94 | "Suppose $h(x) = f(g(x))$. Then,\n", 95 | "$$\n", 96 | "h(x+y\\epsilon) = f(g(x+y\\epsilon)) = f(g(x) + yg'(x)\\epsilon) = f(g(x)) + yg'(x)f'(g(x))\\epsilon = h(x) + yh'(x)\\epsilon.\n", 97 | "$$\n", 98 | "\n", 99 | "Maybe that's not too surprising, but it's actually a quite useful observation." 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "### Implementation\n", 107 | "\n", 108 | "Dual numbers are implemented in the [DualNumbers](https://github.com/JuliaDiff/DualNumbers.jl) package in Julia." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "collapsed": false, 114 | "input": [ 115 | "using DualNumbers" 116 | ], 117 | "language": "python", 118 | "metadata": {}, 119 | "outputs": [] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "You construct $x + y\\epsilon$ with ``Dual(x,y)``. The real and epsilon components are accessed as ``real(d)`` and ``epsilon(d)``:" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "collapsed": false, 131 | "input": [ 132 | "d = Dual(2.0,1.0)" 133 | ], 134 | "language": "python", 135 | "metadata": {}, 136 | "outputs": [] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "collapsed": false, 141 | "input": [ 142 | "typeof(d)" 143 | ], 144 | "language": "python", 145 | "metadata": {}, 146 | "outputs": [] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "collapsed": false, 151 | "input": [ 152 | "real(d)" 153 | ], 154 | "language": "python", 155 | "metadata": {}, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "collapsed": false, 161 | "input": [ 162 | "epsilon(d)" 163 | ], 164 | "language": "python", 165 | "metadata": {}, 166 | "outputs": [] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "How is addition of dual numbers defined?" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "collapsed": false, 178 | "input": [ 179 | "@which d+Dual(3.0,4.0)" 180 | ], 181 | "language": "python", 182 | "metadata": {}, 183 | "outputs": [] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "Clicking on the link, we'll see:\n", 190 | "```julia\n", 191 | "+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))\n", 192 | "```" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "Multiplication?" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "collapsed": false, 205 | "input": [ 206 | "Dual(2.0,2.0)*Dual(3.0,4.0)" 207 | ], 208 | "language": "python", 209 | "metadata": {}, 210 | "outputs": [] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "collapsed": false, 215 | "input": [ 216 | "@which Dual(2.0,2.0)*Dual(3.0,4.0)" 217 | ], 218 | "language": "python", 219 | "metadata": {}, 220 | "outputs": [] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "The code is:\n", 227 | "```julia\n", 228 | "*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))\n", 229 | "```" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Basic univariate functions?" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "collapsed": false, 242 | "input": [ 243 | "log(Dual(2.0,1.0))" 244 | ], 245 | "language": "python", 246 | "metadata": {}, 247 | "outputs": [] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "collapsed": false, 252 | "input": [ 253 | "1/2.0" 254 | ], 255 | "language": "python", 256 | "metadata": {}, 257 | "outputs": [] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "How is this implemented?" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "collapsed": false, 269 | "input": [ 270 | "@code_lowered log(Dual(2.0,1.0))" 271 | ], 272 | "language": "python", 273 | "metadata": {}, 274 | "outputs": [] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "Trig functions?" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "collapsed": false, 286 | "input": [ 287 | "@code_lowered sin(Dual(2.0,1.0))" 288 | ], 289 | "language": "python", 290 | "metadata": {}, 291 | "outputs": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "## Computing derivatives of functions" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "We can define a function in Julia as:" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "collapsed": false, 310 | "input": [ 311 | "f(x) = x^2 - log(x)\n", 312 | "# Or equivalently\n", 313 | "function f(x)\n", 314 | " return x^2 - log(x)\n", 315 | "end" 316 | ], 317 | "language": "python", 318 | "metadata": {}, 319 | "outputs": [] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | ">**\\[Exercise\\]**: Differentiate it!\n", 326 | "\n", 327 | "> 1. Evaluate $f$ at $1 + \\epsilon$. What are $f(1)$ and $f'(1)$?\n", 328 | "> 2. Evaluate $f$ at $\\frac{1}{\\sqrt{2}} + \\epsilon$. What are $f(\\frac{1}{\\sqrt{2}})$ and $f'(\\frac{1}{\\sqrt{2}})$?\n", 329 | "> 3. Define a new function ``fprime`` which returns the derivative of ``f`` by using ``DualNumbers``.\n", 330 | "> 3. Use the finite difference formula $$\n", 331 | "f'(x) \\approx \\frac{f(x+h)-f(x)}{h}\n", 332 | "$$\n", 333 | "to evaluate $f'(\\frac{1}{\\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using ``@manipulate``, plots, or both!" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "### How general is it?\n", 341 | "\n", 342 | "Recall [Newton's iterative method](http://en.wikipedia.org/wiki/Newton%27s_method) for finding zeros:\n", 343 | "$$\n", 344 | "x \\leftarrow x - \\frac{f(x)}{f'(x)}\n", 345 | "$$\n", 346 | "until $f(x) \\approx 0$." 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "Let's use this method to compute $\\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$.\n", 354 | "So $f'(z) = 2z$, and we can implement the algorithm as:" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "collapsed": false, 360 | "input": [ 361 | "function squareroot(x)\n", 362 | " z = x # Initial starting point\n", 363 | " while abs(z*z - x) > 1e-13\n", 364 | " z = z - (z*z-x)/(2z)\n", 365 | " end\n", 366 | " return z\n", 367 | "end" 368 | ], 369 | "language": "python", 370 | "metadata": {}, 371 | "outputs": [] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "collapsed": false, 376 | "input": [ 377 | "squareroot(100)" 378 | ], 379 | "language": "python", 380 | "metadata": {}, 381 | "outputs": [] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "Can we differentiate this code? **Yes!**" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "collapsed": false, 393 | "input": [ 394 | "d = squareroot(Dual(100.0,1.0))" 395 | ], 396 | "language": "python", 397 | "metadata": {}, 398 | "outputs": [] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "collapsed": false, 403 | "input": [ 404 | "epsilon(d) # Computed derivative" 405 | ], 406 | "language": "python", 407 | "metadata": {}, 408 | "outputs": [] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "collapsed": false, 413 | "input": [ 414 | "1/(2*sqrt(100)) # The exact derivative" 415 | ], 416 | "language": "python", 417 | "metadata": {}, 418 | "outputs": [] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "collapsed": false, 423 | "input": [ 424 | "abs(epsilon(d)-1/(2*sqrt(100)))" 425 | ], 426 | "language": "python", 427 | "metadata": {}, 428 | "outputs": [] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "### Multivariate functions?\n", 435 | "\n", 436 | "Dual numbers can be used to compute the gradient of a function $f: \\mathbb{R}^n \\to \\mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is [implemented](https://github.com/JuliaOpt/Optim.jl/blob/583907676b5b99cdb2d4cba37f6026a3fe620a49/src/autodiff.jl) in [Optim](https://github.com/JuliaOpt/Optim.jl) with the ``autodiff=true`` keyword." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "collapsed": false, 442 | "input": [ 443 | "using Optim\n", 444 | "rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2\n", 445 | "optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)" 446 | ], 447 | "language": "python", 448 | "metadata": {}, 449 | "outputs": [] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "When $n$ is large, there's an alternative procedure called [reverse-mode automatic differentiation](http://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation) which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in [ReverseDiffSparse](https://github.com/mlubin/ReverseDiffSparse.jl))." 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "## Conclusions\n", 463 | "\n", 464 | "- We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.\n", 465 | "- In Julia it's very easy to use dual numbers for this!\n", 466 | "- Reconsider when derivatives are \"not available.\"\n", 467 | "\n", 468 | "This was just an introduction to one technique from the area of automatic differentiation. For more references, see [autodiff.org](http://www.autodiff.org/?module=Introduction&submenu=Selected%20Books)." 469 | ] 470 | } 471 | ], 472 | "metadata": {} 473 | } 474 | ] 475 | } -------------------------------------------------------------------------------- /7-adv-optimization/Callbacks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language": "Julia", 4 | "name": "", 5 | "signature": "sha256:65da469e9272a8bf11bfe753e257d8533eb871dfe5aa21aab4e553fa067b5292" 6 | }, 7 | "nbformat": 3, 8 | "nbformat_minor": 0, 9 | "worksheets": [ 10 | { 11 | "cells": [ 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Callbacks in Integer Programming\n", 17 | "\n", 18 | "As we discussed in the beginning, MIP solvers are complicated combinations of many techniques: cutting planes, heuristics, branching rules, etc.\n", 19 | "\n", 20 | "Some solvers allow you to customize aspects of the solve process in a deeper way than just setting options for these parameters. You can provide code to be run when certain events happen, and the solver **calls back** to these functions to ask what action(s) should be taken. Why might you want to do this?\n", 21 | "\n", 22 | "* The solver is struggling to find an integer solution. You know an efficient way to take a fractional solution and convert it to a good, if not optimal, integer solution. You can put this algorithm inside a **heuristic callback** that is called whenever a new fractional solution is found.\n", 23 | "* You have done an analysis of the structure of your MIP and have realized that you can find constraints that will cut off fractional solutions so that your LP relaxation is closer to integer points. You can write this as a **cut callback**.\n", 24 | "\n", 25 | "The particular example we will look at today is in some ways even more critical than these two types, because it enables whole types of problems to be solved that would be very difficult otherwise. In particular, consider a problem that has a very large number of constraints, **most of which will not be binding at the optimal solution**. This suggests that we probably don't need all those constraints to provided explicitly to the solver - instead, we can provide them implicitly with a **lazy constraint/cut callback**.\n", 26 | "\n", 27 | "*On board: flow chart*" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Application: Robust Portfolio Optimization\n", 35 | "\n", 36 | "Portfolio optmization is the problem of constructing a portfolio of assets to maximize returns, but usually with some consideration towards the risk of the portfolio. If we maximize for return, we will usually also have the highest chance of losing money. On the other hand, there is often a (very) low risk option that has minimal returns (e.g. US government bonds). We seek to construct optimization models that are let us explore this spectrum of options.\n", 37 | "\n", 38 | "The \"stochastic programming\" approach would estimate a probability distribution from data for each asset we are considering purchasing, and then we can do things like\n", 39 | "\n", 40 | "- minimize $StdDev[Profit]$, subject to $Exp[Profit] \\geq P_{min}$\n", 41 | "- maximize $E[Profit]$, subject to $StdDev[Profit] \\leq S_{max}$\n", 42 | "\n", 43 | "**Robust optimization** is an alternative method that, instead of saying that the uncertain return of the assets coming from probability distributions, says the returns are drawn from a bounded set of outcomes: an **uncertainty set**.\n", 44 | "\n", 45 | "### Setting up the Problem\n", 46 | "\n", 47 | "We will consider the following robust portfolio problem.\n", 48 | "\n", 49 | "- Let $0 \\leq x_i \\leq 1$ be the share of our money we put into asset $i$.\n", 50 | " - We need the additional constraint then that $\\mathbf{e}^\\prime \\mathbf{x} = 1$\n", 51 | " - We'll also impose a restriction that we can use no more than a quarter of the assets available.\n", 52 | " - Let $y_i \\in \\{0,1\\}$, $y_i = 1 \\iff x_i > 0$, and $\\mathbf{e}^\\prime \\mathbf{y} \\leq \\frac{N}{4}$\n", 53 | "\n", 54 | "- Let $p_i$ be the uncertain profit for asset $i$, with $\\mathbf{p}\\in U$, where...\n", 55 | "\n", 56 | "- $U$ is out uncertainty set. By varying its size and shape of the uncertainty set $U$ we can tradeoff between expected return and the worst-case return. We will assume we have (as data)\n", 57 | " - $\\bar{p}_i$, the expected return of each asset\n", 58 | " - $\\sigma_i$, the standard devition of return for each asset\n", 59 | "\n", 60 | "We will use the **ellipsoidal uncertainty set**\n", 61 | "\n", 62 | "$$ U^\\Gamma = \\left\\{ \\mathbf{p} \\mid p_i = \\bar{p}_i + \\sigma_i d_i, \\|\\mathbf{d}\\|\\leq \\Gamma \\right\\}$$\n", 63 | "\n", 64 | "*on board: diagram*\n", 65 | "\n", 66 | "So we can write out our problem now as\n", 67 | "\n", 68 | "$$\n", 69 | "\\max_{z, \\mathbf{x}\\geq \\mathbf{0}} z \\quad \\text{subject to}\\\\\n", 70 | "z \\leq \\mathbf{p}^\\prime \\mathbf{x} \\quad \\forall \\mathbf{p} \\in U \\\\\n", 71 | "\\mathbf{e}^\\prime \\mathbf{x} = 1 \\\\\n", 72 | "y_i \\geq x_i \\\\\n", 73 | "\\mathbf{e}^\\prime \\mathbf{y} \\leq \\frac{N}{4}\n", 74 | "$$\n", 75 | "\n", 76 | "The problem with the first constraint is that it is actually an **infinite** number of constraints - one for every possible value of $\\mathbf{p}$. We conjecture though that only a small number of them are needed to a get solution that \"mostly\" satisifes that constraint. We'll add them lazily using **lazy constraints** in JuMP with Gurobi. \n", 77 | "\n", 78 | "Whenever Gurobi finds a new integer-feasible solution $\\left( \\mathbf{x}^\\ast, \\mathbf{y}^\\ast, z^\\ast \\right)$, we will try to generate a new constraint. We do that by solving an **embedded** optimization problem:\n", 79 | "\n", 80 | "$$CUT(\\mathbf{x}^\\ast) = {\\arg \\min}_{\\mathbf{p}\\in U} \\mathbf{p}^\\prime \\mathbf{x}^\\ast$$\n", 81 | "\n", 82 | "*on board: diagram*\n", 83 | "\n", 84 | "We'll only add this new constraint if the it'd be violated by the current solution by more than a tolerance. Today we'll actually solve this embedded problem using Gurobi, but as an **exercise** you can solve it in closed-form - see how much of an improvement in solve times you get!" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "collapsed": false, 90 | "input": [ 91 | "using JuMP, Gurobi\n", 92 | "\n", 93 | "# Generate data\n", 94 | "n = 20\n", 95 | "p\u0304 = [1.15 + i*0.05/150 for i in 1:n]\n", 96 | "\u03c3 = [0.05/450*\u221a(2*i*n*(n+1)) for i in 1:n]\n", 97 | "\n", 98 | "function solve_portfolio()\n", 99 | " port = Model(solver=GurobiSolver())\n", 100 | " \n", 101 | " @defVar(port, z \u2264 maximum(p\u0304))\n", 102 | " @setObjective(port, Max, z)\n", 103 | " @defVar(port, 0 \u2264 x[1:n] \u2264 1)\n", 104 | " @addConstraint(port, sum(x) == 1)\n", 105 | " \n", 106 | " @defVar(port, y[1:n], Bin)\n", 107 | " for i in 1:n\n", 108 | " @addConstraint(port, y[i] \u2265 x[i])\n", 109 | " end\n", 110 | " @addConstraint(port, sum(y) \u2264 div(n,4))\n", 111 | " \n", 112 | " # Link z to x\n", 113 | " function portobj(cb)\n", 114 | " # Get values of z and x\n", 115 | " zval = getValue(z)\n", 116 | " xval = getValue(x)[:]\n", 117 | " \n", 118 | " # Find most pessimistic value of p'x\n", 119 | " # over all p in the uncertainty set\n", 120 | " rob = Model(solver=GurobiSolver(OutputFlag=0))\n", 121 | " @defVar(rob, p[i=1:n])\n", 122 | " @defVar(rob, d[i=1:n])\n", 123 | " @setObjective(rob, Min, dot(xval,p))\n", 124 | " \u0393 = sqrt(10)\n", 125 | " @addConstraint(rob, sum{d[i]^2,i=1:n} \u2264 \u0393^2)\n", 126 | " for i in 1:n\n", 127 | " @addConstraint(rob, p[i] == p\u0304[i] + \u03c3[i]*d[i])\n", 128 | " end\n", 129 | " solve(rob)\n", 130 | " worst_z = getObjectiveValue(rob)\n", 131 | " @show (zval, worst_z)\n", 132 | " worst_p = getValue(p)[:]\n", 133 | " \n", 134 | " # Is this worst_p going to change the objective\n", 135 | " # because worst_z is worse than the current z?\n", 136 | " if worst_z < zval - 1e-2\n", 137 | " # Yep, we've made things worse!\n", 138 | " # Gurobi should try to find a better portfolio now\n", 139 | " @addLazyConstraint(cb, z \u2264 dot(worst_p,x))\n", 140 | " end\n", 141 | " end\n", 142 | " setLazyCallback(port, portobj)\n", 143 | " \n", 144 | " solve(port)\n", 145 | " \n", 146 | " return getValue(x)[:]\n", 147 | "end\n", 148 | "\n", 149 | "solve_portfolio()" 150 | ], 151 | "language": "python", 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "output_type": "stream", 156 | "stream": "stdout", 157 | "text": [ 158 | "Optimize a model with 22 rows, 41 columns and 80 nonzeros\n", 159 | "(zval,worst_z) => (1.1566666666666665,1.1119444893019914)" 160 | ] 161 | }, 162 | { 163 | "output_type": "stream", 164 | "stream": "stdout", 165 | "text": [ 166 | "\n", 167 | "Presolve time: 0.00s\n", 168 | "Presolved: 22 rows, 41 columns, 80 nonzeros\n", 169 | "Variable types: 21 continuous, 20 integer (20 binary)\n", 170 | "(zval,worst_z) => (1.1566666666666665,1.1111246725600388)\n", 171 | "\n", 172 | "Root relaxation: objective 1.156333e+00, 6 iterations, 0.00 seconds\n", 173 | "(zval,worst_z) => (1.1563333333333332,1.1119444893019914)\n", 174 | "(zval,worst_z) => (1.1559999999999997,1.1127950730240155)\n", 175 | "(zval,worst_z) => (1.1556666666666666,1.1136790262540286)\n", 176 | "(zval,worst_z) => (1.1553333333333333,1.1145993405294434)\n", 177 | "(zval,worst_z) => (1.1549999999999998,1.1155594867652385)\n", 178 | "(zval,worst_z) => (1.1546666666666665,1.1165635147016044)\n", 179 | "(zval,worst_z) => (1.154333333333333,1.1176162164470733)\n", 180 | "(zval,worst_z) => (1.154,1.1187233395982223)\n", 181 | "(zval,worst_z) => (1.1536666666666666,1.1198918413411922)\n", 182 | "(zval,worst_z) => (1.153333333333333,1.1211303081530426)\n", 183 | "(zval,worst_z) => (1.1529999999999998,1.1224495373588008)\n", 184 | "(zval,worst_z) => (1.1526666666666663,1.1238634270548151)\n", 185 | "(zval,worst_z) => (1.1523333333333332,1.1253903875468847)\n", 186 | "(zval,worst_z) => (1.152,1.1270557049613767)\n", 187 | "\n", 188 | " Nodes | Current Node | Objective Bounds | Work\n", 189 | " Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time\n", 190 | "\n", 191 | " 0 0 1.15174 0 11 - 1.15174 - - 0s\n", 192 | "(zval,worst_z) => (1.1470227328301037,1.1396039397398352)\n", 193 | "H 0 0 1.1470227 1.15174 0.41% - 0s\n", 194 | " 0 0 1.15174 0 11 1.14702 1.15174 0.41% - 0s\n", 195 | " 0 0 1.15174 0 11 1.14702 1.15174 0.41% - 0s\n", 196 | "(zval,worst_z) => (1.1516666666666666,1.1288956656137035)\n", 197 | "(zval,worst_z) => (1.1513333333333569,1.1309663308247797)\n", 198 | "(zval,worst_z) => (1.151000000000038,1.133361658942341)\n", 199 | "(zval,worst_z) => (1.1470300811632392,1.1395542704160788)\n", 200 | "* 44 2 10 1.1470301 1.15168 0.41% 3.8 0s\n", 201 | "(zval,worst_z) => (1.147047478737521,1.139475366906823)\n", 202 | "* 68 5 10 1.1470475 1.15167 0.40% 3.6 0s\n", 203 | "(zval,worst_z) => (1.1470682530994947,1.1404573657558754)\n", 204 | "* 94 15 9 1.1470683 1.15167 0.40% 3.6 0s\n", 205 | "(zval,worst_z) => (1.1506666666666323,1.1362650225868365)\n", 206 | "(zval,worst_z) => (1.1471399470276338,1.1397880697939513)\n", 207 | "* 180 2 12 1.1471399 1.15166 0.39% 3.6 0s\n", 208 | "(zval,worst_z) => (1.1471438839894306,1.140552062533531)\n", 209 | "* 206 5 12 1.1471439 1.15166 0.39% 3.5 0s\n", 210 | "(zval,worst_z) => (1.1471459914930557,1.1407606097364476)\n", 211 | "* 323 47 11 1.1471460 1.15166 0.39% 3.5 0s\n", 212 | "(zval,worst_z) => (1.1471684136484515,1.141024846239318)\n", 213 | "* 340 58 11 1.1471684 1.15166 0.39% 3.5 0s\n", 214 | "(zval,worst_z) => " 215 | ] 216 | }, 217 | { 218 | "output_type": "stream", 219 | "stream": "stdout", 220 | "text": [ 221 | "(1.147173218445699,1.1404116811811653)\n", 222 | "H 394 68 1.1471732 1.15166 0.39% 3.5 0s\n", 223 | "(zval,worst_z) => (1.1471839603170648,1.141472852545052)\n", 224 | "* 527 118 11 1.1471840 1.15166 0.39% 3.5 0s\n", 225 | "(zval,worst_z) => (1.1503333333333334,1.140149865710194)\n", 226 | "(zval,worst_z) => (1.147273659552563,1.14110125789888)\n", 227 | "* 544 20 14 1.1472737 1.15161 0.38% 3.5 0s\n", 228 | "(zval,worst_z) => (1.147288763313607,1.1410423565132275)\n", 229 | "* 569 22 14 1.1472888 1.15161 0.38% 3.5 0s\n", 230 | "(zval,worst_z) => (1.147316766144706,1.141406256858547)\n", 231 | "* 594 22 13 1.1473168 1.15161 0.37% 3.5 0s\n", 232 | "(zval,worst_z) => (1.1473939997497258,1.1423996455925112)\n", 233 | "* 629 26 12 1.1473940 1.15161 0.37% 3.5 0s\n", 234 | "(zval,worst_z) => (1.1476045644702562,1.1416959456972795)\n", 235 | "* 1328 137 16 1.1476046 1.15161 0.35% 3.4 0s\n", 236 | "(zval,worst_z) => (1.1477106005691295,1.1428023401764515)\n", 237 | "* 1407 137 15 1.1477106 1.15152 0.33% 3.3 0s\n", 238 | "(zval,worst_z) => (1.1477847196402158,1.1433278504310909)\n", 239 | "* 3060 394 30 1.1477847 1.15141 0.32% 2.7 0s\n", 240 | "(zval,worst_z) => " 241 | ] 242 | }, 243 | { 244 | "output_type": "stream", 245 | "stream": "stdout", 246 | "text": [ 247 | "(1.147798868429544,1.1432980916291928)\n", 248 | "H 5386 979 1.1477989 1.15141 0.31% 2.3 0s\n", 249 | "(zval,worst_z) => (1.1478152209575423,1.143279716904396)\n", 250 | "* 7213 1220 30 1.1478152 1.15127 0.30% 2.1 0s\n", 251 | "(zval,worst_z) => (1.14785301792211,1.1426888719571178)\n", 252 | "*12139 1794 30 1.1478530 1.15070 0.25% 1.9 0s\n", 253 | "(zval,worst_z) => " 254 | ] 255 | }, 256 | { 257 | "output_type": "stream", 258 | "stream": "stdout", 259 | "text": [ 260 | "(1.1478601755755777,1.1432619688471866)\n", 261 | "*15951 1738 30 1.1478602 1.15043 0.22% 1.9 0s\n", 262 | "(zval,worst_z) => (1.1478937108549316,1.1426547380907053)\n", 263 | "H20488 1199 1.1478937 1.15017 0.20% 1.8 0s\n", 264 | "(zval,worst_z) => " 265 | ] 266 | }, 267 | { 268 | "output_type": "stream", 269 | "stream": "stdout", 270 | "text": [ 271 | "(1.1479043342581046,1.142650419192153)\n", 272 | "*28539 40 30 1.1479043 1.14956 0.14% 1.7 1s\n", 273 | "\n", 274 | "Explored 29166 nodes (49935 simplex iterations) in 1.00 seconds\n", 275 | "Thread count was 8 (of 8 available processors)\n", 276 | "\n", 277 | "Optimal solution found (tolerance 1.00e-04)\n", 278 | "Best objective 1.147904334258e+00, best bound 1.147904334258e+00, gap 0.0%\n" 279 | ] 280 | }, 281 | { 282 | "metadata": {}, 283 | "output_type": "pyout", 284 | "prompt_number": 1, 285 | "text": [ 286 | "20-element Array{Float64,1}:\n", 287 | " 0.417392 \n", 288 | " 0.29514 \n", 289 | " 0.0 \n", 290 | " 0.0 \n", 291 | " 0.0 \n", 292 | " -2.35922e-15\n", 293 | " 0.0 \n", 294 | " 0.0 \n", 295 | " 0.0 \n", 296 | " 0.0 \n", 297 | " 0.0 \n", 298 | " 0.0 \n", 299 | " 0.0 \n", 300 | " 0.0 \n", 301 | " 0.0 \n", 302 | " 0.0 \n", 303 | " 0.0 \n", 304 | " 0.09838 \n", 305 | " 0.0957561 \n", 306 | " 0.0933315 " 307 | ] 308 | } 309 | ], 310 | "prompt_number": 1 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "### Exercise: Replace inner model with closed form expression\n", 317 | "\n", 318 | "The cutting plane problem was:\n", 319 | "\n", 320 | "$${\\min}_{\\mathbf{p}\\in U} \\mathbf{p}^\\prime \\mathbf{x}^\\ast$$\n", 321 | "\n", 322 | "$$ U^\\Gamma = \\left\\{ \\mathbf{p} \\mid p_i = \\bar{p}_i + \\sigma_i d_i, \\|\\mathbf{d}\\|\\leq \\Gamma \\right\\}$$\n", 323 | "\n", 324 | "Lets do a little rearrangement, so instead it is\n", 325 | "\n", 326 | "$$ U^\\Gamma = \\left\\{ \\mathbf{p} \\mid \\sqrt{\\sum_{i=1}^n \\left( \\frac{p_i - \\bar{p}_i}{\\sigma_i} \\right)} \\leq \\Gamma \\right\\}$$\n", 327 | "\n", 328 | "So the problem is maximizing a linear function over an ellipse, which if you go through the KKT conditions you'll find has a nice closed form solution:\n", 329 | "\n", 330 | "$$ p^\\ast_i = \\bar{p}_i + \\frac{\\Gamma}{\\| diag(\\sigma) \\mathbf{x}^\\ast \\|} \\sigma^2_i x^\\ast_i$$" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "\n", 338 | "## Application: Travelling Salesman\n", 339 | "\n", 340 | "The most famous application of this is the **Travelling Salesman Problem**. TSP is the problem of finding a tour of shortest length that visits all the nodes in a graph. The decision variables in the MIP formulation correspond to whether we use an arc or not. If there $N$ nodes, we have $N^2$ variables. We will need $N$ constraints to make sure that each city is visited once. However, if you solve this you will find it is not sufficient:\n", 341 | "\n", 342 | "![subtours](http://i.imgur.com/rX9EYAr.png)\n", 343 | "\n", 344 | "To make sure these subtours don't occur, we need **subtour elimination constraints**. Unfortunately, there are $2^N$ possible subtour elimination constraints, which grows very very fast. The solution is to only add these constraints **lazily**: whenever the MIP solver finds an integer solution, we check for subtours. If we find them, we return a new constraint that will \"break\" the subtours. We then keep solving from there and repeat until no subtours our found. In practice we need far fewer than $2^N$ constraints, which is why it is possible to solve TSPs with 1000s of variables to optimality.\n", 345 | "\n" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "collapsed": false, 351 | "input": [], 352 | "language": "python", 353 | "metadata": {}, 354 | "outputs": [] 355 | } 356 | ], 357 | "metadata": {} 358 | } 359 | ] 360 | } --------------------------------------------------------------------------------