├── 0_Prerequisites.jmd ├── BasicDataAndPlots.jmd ├── COVID-monitoring.jmd ├── DiscussionBasicDataAndPlots.jmd ├── GMTMaps.jmd ├── Manifest.toml ├── Project.toml ├── README.md ├── RegressionDiscontinuityQuestion.ipynb ├── RegressionDiscontinuityQuestion.jmd ├── build.jl ├── data └── longevity.csv └── requirements.txt /0_Prerequisites.jmd: -------------------------------------------------------------------------------- 1 | # Prerequisites 2 | 3 | This document simply lets you add packages that are appropriate for 4 | data analysis and visualizations, and then precompile all of them so 5 | that you don't have long load times later when you want to use the 6 | packages. This script could take **quite a long time** and requires 7 | you to have an internet connection to download the packages. 8 | 9 | ```julia;exec=false 10 | 11 | using Pkg; 12 | Pkg.add("IJulia") 13 | Pkg.add("Queryverse") 14 | Pkg.add("DataFrames") 15 | Pkg.add("HTTP") 16 | Pkg.add("Plots") 17 | Pkg.add("Gadfly#master") 18 | Pkg.add("Distributions") 19 | Pkg.add("Random") 20 | Pkg.add("GLM") 21 | Pkg.add("Optim") 22 | Pkg.add("BlackBoxOptim") 23 | Pkg.add("Turing#master") # currently has bug 24 | Pkg.add("Stan") 25 | Pkg.add("Statistics") 26 | Pkg.add("SQLite") 27 | Pkg.add("Weave") 28 | Pkg.update() 29 | Pkg.precompile() 30 | 31 | ``` 32 | -------------------------------------------------------------------------------- /BasicDataAndPlots.jmd: -------------------------------------------------------------------------------- 1 | # Acquiring some data and plotting it 2 | 3 | ## A simple data analysis tutorial using Julia 4 | ### By: Daniel Lakeland 5 | ### Lakeland Applied Sciences LLC 6 | 7 | 8 | So you want to answer some questions you have about the 9 | world... Suppose you'd like to see how the population of several 10 | states has changed over time. We'll rely on CSV files published by the 11 | Census. The goal of this tutorial will be to simply walk you through 12 | acquiring the data, processing it into a form where it's easy to 13 | analyze, and plotting some aspects of the data to help you see what 14 | happened. 15 | 16 | In the companion Discussion we'll talk about why the code looks like 17 | it does and what other options there are. 18 | 19 | ## Getting the data 20 | 21 | We'll rely on the [Queryverse](https://www.queryverse.org/) 22 | meta-package which will provide us with ways to reads CSV files and to 23 | pipe datasets through filtering and mutation operations. To find out 24 | more you can 25 | [watch a great introduction](https://youtu.be/OFPNph-WxLM?t=73) by the 26 | author. And we'll use 27 | [DataFrames](https://juliadata.github.io/DataFrames.jl/stable/), which 28 | will let us create in-memory tabular data objects. We will also use 29 | [Gadfly](http://gadflyjl.org/stable/) which is a plotting package 30 | particularly oriented towards statistical data analysis plots. We 31 | won't cover the Queryverse built-in plotting system VegaLite in this 32 | tutorial, but might provide some separate tutorials. 33 | 34 | Let's get started by loading up the required packages, and grabbing 35 | some data: 36 | 37 | 38 | ```julia 39 | using Queryverse,DataFrames; 40 | 41 | cenfile="https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv" 42 | 43 | df = DataFrame(load(cenfile)); 44 | display(first(df,5)) 45 | 46 | ``` 47 | 48 | ## Reshaping the data to long form 49 | 50 | Of course, the shape of this data is all wrong. There are 164 columns 51 | each column has information on a different variable in a different 52 | year! This data format is convenient for people wishing to by-hand 53 | click some buttons and graph some data in Excel, because it places 54 | related data items together in the dataset and they can select those 55 | related items with a mouse. It is, however, horrible for someone 56 | wishing to **program** a computer to do analysis in bulk. 57 | 58 | This is a common data problem. To solve it we will want to reshape the 59 | data using some 60 | [DataFrame functions](https://juliadata.github.io/DataFrames.jl/stable/man/reshaping_and_pivoting/). In 61 | particular `stack` will take a dataset and "stack" it into a long 62 | form, with one row for each column mentioned. In this case we want all 63 | the columns except Not the ones that identify the row. 64 | 65 | ## Let's Stack It 66 | 67 | We'll stack the data frame, and then subset to the columns we might 68 | care about to avoid a lot of extra junk on our screen, and select the 69 | rows that represent statewide totals (where `STNAME == CTYNAME`), and 70 | we'll add a `year` column for later use. 71 | 72 | ```julia 73 | 74 | dflong = stack(df,Not([:SUMLEV,:REGION,:DIVISION,:STATE,:COUNTY,:STNAME,:CTYNAME ])) |> @filter(_.STNAME == _.CTYNAME) |> @mutate(year=-1) |> DataFrame; 75 | 76 | select!(dflong,[:STNAME,:year,:variable, :value,:DIVISION,:REGION]) 77 | 78 | 79 | display(first(dflong,5)) 80 | 81 | ``` 82 | 83 | Since the table is now "long form" we have one row per county per 84 | measurement. We can therefore select out just the ones that have 85 | population measurements. Let's take a look at which measurements those 86 | are. 87 | 88 | ```julia 89 | 90 | unique(dflong[:,:variable]) 91 | 92 | ``` 93 | 94 | ## Munging the data 95 | 96 | 97 | Clearly POPESTIMATE variables are the ones we want, but the year has 98 | been encoded into the symbol name because there was one column per 99 | year... What we want is to split this column into one containing 100 | POPESTIMATE and one containing the year. 101 | 102 | Since we're going to want to edit the variable names in the variable 103 | column, it's not convenient for them to be symbols. Let's convert them 104 | to String and in the process, also strip off the year and put it into 105 | a numerical `year` column. 106 | 107 | We notice that the census always put the year pasted on to the end of 108 | the symbol name (when appropriate). But it's not for every 109 | symbol. We'll strip off the last 4 characters, and try to convert it 110 | to an integer using `tryparse`. 111 | 112 | ```julia 113 | 114 | display(first(dflong,5)) 115 | dflong2 = DataFrame(dflong |> @mutate(year = something(tryparse(Int,string(_.variable)[end-3:end]),-1), variable=String(_.variable)) |> @mutate(variable = _.year > 0 ? _.variable[1:end-4] : _.variable) ) 116 | display(first(dflong2,5)) 117 | 118 | ``` 119 | 120 | There are lots of variables, so let's just select out the ones that 121 | look like POPESTIMATE. 122 | 123 | ```julia 124 | display(unique(dflong2.variable) ) 125 | 126 | pop = dflong2 |> @filter(_.variable == "POPESTIMATE") |>DataFrame 127 | display(first(pop,5)) 128 | 129 | ``` 130 | 131 | # Visualizing Aspects of Population Data 132 | 133 | 134 | Now we have a DataFrame called `pop` that we can examine. Let's find 135 | out what is going on in the population of these states over 136 | time. We'll need to make some plots using Gadfly. 137 | 138 | ```julia 139 | 140 | using Gadfly 141 | 142 | set_default_plot_size(20cm,10cm); 143 | 144 | display(plot(pop,x=:year,y=:value,Geom.line,color=:STNAME)) 145 | display(plot(pop,x=:value,Geom.density(bandwidth=1e6))) 146 | 147 | ``` 148 | 149 | Normally Jupyter only displays the last thing in a cell, so we 150 | explicitly display each graph. 151 | 152 | These plots get us some basic information, but they have all sorts of 153 | problems. For example the color key for the 50 states takes up more of 154 | the plot than the plot does. And there are too many lines to really 155 | tell what's going on. And the density plot for population is pretty 156 | interesting, but it has no labels and the units are not very 157 | convenient. Let's fix the density plot first. 158 | 159 | ```julia 160 | 161 | display(plot(DataFrame(pop |> @mutate(stpop=_.value/1e6)), 162 | x=:stpop,Geom.density(bandwidth=1), 163 | Guide.title("Distribution of State Populations"), 164 | Guide.xlabel("Population (Millions of People)"))) 165 | 166 | 167 | ``` 168 | 169 | Of course, this is the density for all observations across all the 170 | years. Let's look at how the distribution of populations changed 171 | between say 2015 and 2019: 172 | 173 | ```julia 174 | set_default_plot_size(20cm,10cm) 175 | display(hstack(plot(DataFrame(pop |> @mutate(stpop=_.value/1e6) |> @filter(_.year == 2015)), 176 | x=:stpop,Geom.density(bandwidth=1), 177 | Guide.title("Distribution of State Populations 2015"), 178 | Guide.xlabel("Population (Millions of People)")), 179 | plot(DataFrame(pop |> @mutate(stpop=_.value/1e6) |> @filter(_.year == 2019)), 180 | x=:stpop,Geom.density(bandwidth=1), 181 | Guide.title("Distribution of State Populations 2019"), 182 | Guide.xlabel("Population (Millions of People)")))) 183 | 184 | ``` 185 | 186 | We can see that the distribution is relatively stable as might be 187 | expected since tens of millions of people don't tend to all move 188 | between states every few years. Let's select all the states with more 189 | than 10 Million people, and look at how they trended in time, and 190 | compare to states with less than 2 Million people. 191 | 192 | ```julia 193 | bigstates = unique(pop[pop.value .> 10e6,:STNAME]) 194 | 195 | bigdata = DataFrame(pop |> @filter(_.STNAME in bigstates) |> @mutate(stpop=_.value/1e6)) 196 | bigplot = plot(bigdata, x=:year,y=:stpop, Geom.line,color=:STNAME,Geom.point,Guide.title("Population of Large States")) 197 | 198 | smallstates = unique(pop[pop.value .< 2e6,:STNAME]) 199 | 200 | smalldata = DataFrame(pop |> @filter(_.STNAME in smallstates) |> @mutate(stpop=_.value/1e6)) 201 | 202 | smallplot = plot(smalldata, x=:year,y=:stpop,Geom.line,color=:STNAME,Geom.point,Guide.title("Population of Small States")) 203 | 204 | set_default_plot_size(9inch,4inch) 205 | hstack(bigplot,smallplot) 206 | 207 | ``` 208 | 209 | So far so good. We notice that Idaho has been growing nonlinearly 210 | through time. Let's fit a quadratic to its growth curve and then we'll 211 | call it a day. 212 | 213 | ```julia 214 | using GLM 215 | iddata = smalldata |> @filter(_.STNAME == "Idaho") |> DataFrame 216 | idgrowth = lm(@formula(stpop ~ (year-2015)+(year-2015)^2),iddata) 217 | 218 | display(coef(idgrowth)) 219 | 220 | preds = DataFrame(predict(idgrowth,DataFrame(year=[2020,2021]),interval=:prediction)) 221 | preds.State = ["Idaho","Idaho"] 222 | preds.year = [2020,2021] 223 | preds 224 | 225 | 226 | ``` 227 | 228 | Who knows how many people will be in Idaho in 2021, perhaps COVID-19 229 | will cause an even bigger influx than the recent trend as people leave 230 | CA. But in any case, now we know what we'd expect if the trend 231 | continued as in the past. Extrapolating using a quadratic can be 232 | problematic, but this isn't too far outside the range of plausible. If 233 | we want to do more modeling, we should really learn some Bayesian 234 | methods 😉. 235 | 236 | 237 | That's it for now. If you have some interest in what we did, and why 238 | we did it, you can read the Discussion document. 239 | -------------------------------------------------------------------------------- /COVID-monitoring.jmd: -------------------------------------------------------------------------------- 1 | # Monitoring The Progress of the COVID-19 Pandemic 2 | 3 | ## A Julia Data Analysis Tutorial 4 | ## By Daniel Lakeland 5 | ## Lakeland Applied Sciences LLC 6 | 7 | 8 | As a mathematical modeler and data analyst many of my friends have 9 | asked me questions about what is going on with the COVID pandemic. On 10 | my blog I've posted some graphs as PDFs with updates every few weeks, 11 | but it's more convenient to give my friends and family an executable 12 | Jupyter notebook where they can update to the latest data any time 13 | they want. Let's get started by grabbing the daily data for all states 14 | from the [Covid Tracking Project](https://covidtracking.com/) 15 | 16 | We'll use the 17 | [Vegalite graphics library](https://vega.github.io/vega-lite/docs/) 18 | this time with the 19 | [@vlplot](https://www.queryverse.org/VegaLite.jl/stable/userguide/vlplotmacro/) 20 | macro. 21 | 22 | ```julia 23 | using Queryverse, CSVFiles 24 | ``` 25 | 26 | 27 | ```julia 28 | usdata = "https://covidtracking.com/api/v1/states/daily.csv" 29 | 30 | 31 | uspopurl = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv" 32 | 33 | statecodesurl ="https://www2.census.gov/geo/docs/reference/state.txt?#" 34 | 35 | uspop = load(uspopurl) |> @filter(_.COUNTY == 0) |> @select(:STATE,:STNAME,:POPESTIMATE2019) |> DataFrame 36 | 37 | 38 | statecodes = load(File(format"CSV",statecodesurl), delim='|') |> DataFrame 39 | 40 | 41 | 42 | dat = DataFrame(load(usdata)) 43 | 44 | 45 | dat = join(dat,statecodes, on = :state => :STUSAB,kind=:left) 46 | 47 | dat = join(dat,uspop, on = :STATE_NAME => :STNAME, kind=:left, makeunique=true) 48 | 49 | 50 | display(first(dat,5)) 51 | display(first(uspop)) 52 | 53 | 54 | display(first(dat,5)) 55 | display(first(uspop)) 56 | 57 | ``` 58 | 59 | # Understanding the Data 60 | 61 | The Covid Tracking Project aggregates data from all the states on 62 | various important measures. For the moment let's focus on daily 63 | positive tests, the ratio of total tests to positive tests, the number 64 | of hospitalized patients, and the number of deaths per day. 65 | 66 | Let's figure out which columns those correspond to: 67 | 68 | ```julia 69 | 70 | println(names(dat)) 71 | 72 | using Dates 73 | 74 | function convdate(d::Int) 75 | return(Date(div(d,10000),div(mod(d,10000),100),mod(d,100))); 76 | end 77 | 78 | dat2 = dat |> @mutate(thedate=convdate(_.date)) |> DataFrame 79 | 80 | allstates = unique(dat.state); 81 | 82 | ``` 83 | 84 | Let's create a function that plots percentage positive, positive per 85 | day, and deaths per day for a given state... 86 | 87 | ```julia 88 | 89 | function plotstate(df,state) 90 | dfstate = df |> @filter(_.state == state) |> 91 | @mutate(testpct=_.positiveIncrease/(_.totalTestResultsIncrease+.1)) |> 92 | @select(:thedate,:state,:testpct,:positiveIncrease,:deathIncrease)|>DataFrame 93 | 94 | testing = @vlplot(width=300,layer=[],title="Testing in $state") + 95 | @vlplot(:point,x=:thedate,y={:testpct,axis={title="Percentage Positive"}}) + 96 | @vlplot(transform=[{loess=:testpct,on=:thedate}], 97 | mark=:line,x=:thedate,y=:testpct) 98 | 99 | cases = @vlplot(width=300,layer=[],title="Cases in $state") + 100 | @vlplot(mark={:point,filled=true}, 101 | x=:thedate,y={:positiveIncrease,axis={title="Cases Per Day"}})+ 102 | @vlplot(transform=[{loess=:positiveIncrease,on=:thedate,bandwidth=.2}], 103 | mark=:line,x=:thedate,y=:positiveIncrease) 104 | 105 | deaths = @vlplot(width=300,layer=[],title="Deaths in $state") + 106 | @vlplot(:point,x=:thedate,y={:deathIncrease,axis={title="Deaths Per Day"}}) + 107 | @vlplot(transform=[{loess=:deathIncrease,on=:thedate,bandwidth=.2}], 108 | mark=:line,x=:thedate,y=:deathIncrease) 109 | 110 | return(dfstate |> hcat(testing,cases,deaths)) 111 | 112 | end 113 | 114 | #test output 115 | #plotstate(dat2,"CA") 116 | 117 | ``` 118 | 119 | # Plotting All The States: 120 | 121 | 122 | 123 | ```julia 124 | 125 | for i in allstates 126 | display(plotstate(dat2,i)) 127 | end 128 | 129 | 130 | ``` 131 | 132 | # Hospitalization: 133 | 134 | Hospitalization data is reported rather incompletely in the 135 | covidtracking data. We'll filter out the missing values, and then 136 | graph states for which it's available. Note that for Queryverse 137 | @filter we must filter on isna rather than ismissing. 138 | 139 | ```julia 140 | 141 | 142 | hdat = dat2 |> @filter(! isna(_.hospitalizedCurrently)) |> DataFrame 143 | display(first(hdat,10)) 144 | 145 | for i in allstates 146 | 147 | hdat |> @filter(_.state == i) |> @mutate(hosppc = _.hospitalizedCurrently / _.POPESTIMATE2019 * 1e3) |> @vlplot(layer=[]) + @vlplot(:point, x=:thedate, y=:hosppc,title="$i Hospitalization/1000 people") + @vlplot(transform=[{loess=:hosppc,on=:thedate,bandwidth=.2}],mark=:line,x=:thedate,y=:hosppc) |> display 148 | 149 | end 150 | 151 | ``` 152 | 153 | 154 | 155 | # Mortality Data: 156 | 157 | Although the CDC in general doesn't have up to date mortality data 158 | available, they have made an effort to create a variety of datasets 159 | for COVID. They're required not to release too-specific 160 | information. They can only do aggregated groups with more than 10 161 | people aggregated. Obviously you'd probably like every day every 162 | county, exactly how many people in each sex and age category 163 | died... But this kind of data is not allowed as it quickly becomes 164 | individually identifiable. 165 | 166 | The most useful datasets are: 167 | 168 | 1. COVID-19 [Case Surveillance Public Use Data](https://catalog.data.gov/dataset/covid-19-case-surveillance-public-use-data) 169 | 1. CSV at: https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD 170 | 2. [Provisional Death Counts by Sex,Age,State](https://catalog.data.gov/dataset/provisional-covid-19-death-counts-by-sex-age-and-state-fb69a) 171 | 1. CSV at: https://data.cdc.gov/api/views/9bhg-hcku/rows.csv?accessType=DOWNLOAD 172 | 3. [By Place of Death and State](https://catalog.data.gov/dataset/provisional-covid-19-death-counts-by-place-of-death-and-state-936c6) 173 | 4. [By Week and State](https://catalog.data.gov/dataset/provisional-covid-19-death-counts-by-week-ending-date-and-state) 174 | 1. CSV at: https://data.cdc.gov/api/views/r8kw-7aab/rows.csv?accessType=DOWNLOAD 175 | 5. [By Sex Age and Week](https://catalog.data.gov/dataset/provisional-covid-19-death-counts-by-sex-age-and-week) 176 | 1. CSV at: https://data.cdc.gov/api/views/vsak-wrfu/rows.csv?accessType=DOWNLOAD 177 | 6. [Death counts by county](https://catalog.data.gov/dataset/provisional-covid-19-death-counts-in-the-united-states-by-county) 178 | 1. CSV at: https://data.cdc.gov/api/views/kn79-hsxy/rows.csv?accessType=DOWNLOAD 179 | 7. [Weekly death counts by state and cause](https://catalog.data.gov/dataset/weekly-counts-of-deaths-by-state-and-select-causes-2019-2020) 180 | 1. CSV at: https://data.cdc.gov/api/views/muzy-jte6/rows.csv?accessType=DOWNLOAD 181 | 8. 182 | 183 | 184 | 185 | ```julia 186 | 187 | 188 | using HTTP 189 | function grabifsmallerolder(url,filename,size,time) 190 | s = stat(filename) 191 | if(s.size < size || s.mtime < time) 192 | chmod(filename,0o644) 193 | io = open(filename,"w") 194 | try r = HTTP.get(url,response_stream=io); catch err; 195 | throw(err) 196 | finally 197 | close(io) 198 | end 199 | end 200 | end 201 | 202 | covdeaths = "https://data.cdc.gov/api/views/9bhg-hcku/rows.csv?accessType=DOWNLOAD" 203 | grabifsmallerolder(covdeaths,"covSexAgeState.csv",1e6,time()-3600*24); 204 | 205 | 206 | 207 | ``` 208 | 209 | It's useful to ask which age groups are losing the most expected life 210 | years? To answer this we can use the life expectancy tables provided 211 | by the CDC. To make this analysis simple, we'll use tables for the 212 | whole population against the data for both sexes. 213 | 214 | 215 | 216 | ```julia; 217 | 218 | covdf = DataFrame(load("covSexAgeState.csv")) 219 | 220 | agedict = Dict("Under 1 year" => 0.5, "1-4 years" => 2.5, "5-14 years" => 10.0, "15-24 years" => 20.0, "0-17 years" => 17.0/2, "18-29 years" => (18+28)/2.0, 221 | "25-34 years" => 30.0, "30-49 years" => (30+49)/2.0, "35-44 years" => 40.0, 222 | "45-54 years" => 50.0, "50-64 years" => (50+64)/2.0, "55-64 years" => 60.0, "65-74 years" => 70.0 , 223 | "75-84 years" => 80.0, "85 years and over" => 90.0, "All ages"=>nothing, "All Ages" => nothing) 224 | 225 | rename!(covdf,[:date,:startwk,:endwk,:state,:sex,:agegrp,:coviddeaths,:totdeaths,:pneumdeaths,:pneumandcovdeaths,:infldeaths,:pnuminfcovdeaths,:footnote]) 226 | 227 | covdf.agenum = map(x -> agedict[x],covdf.agegrp) 228 | 229 | 230 | covallagedf = covdf[map(x -> x in ["All ages","All Ages"], covdf.agegrp ),:] 231 | 232 | filter!(x -> ! (x in ["All ages","All Ages"]),covdf) 233 | 234 | deathsplot = covdf |> @filter(_.state == "United States" && _.sex != "Unknown") |> 235 | ( @vlplot(layer=[]) + 236 | @vlplot(:point,x={"agenum:q",title="Age"},y= {"coviddeaths:q",scale={domain=[0,100000]}}, 237 | color="sex:n",width=600,title="Total Coronavirus Deaths By Age") + 238 | @vlplot(:line,x="agenum:q",y="coviddeaths:q",color="sex:n")) 239 | 240 | lifetable = "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/68_07/Table01.xlsx" 241 | 242 | grabifsmallerolder(lifetable,"alllife.xlsx",100,time()+3600*12); 243 | 244 | ltd = DataFrame(load("alllife.xlsx","Table 1!A4:G104")) 245 | rename!(ltd,[:agegrp,:qx,:nsurv,:ndie,:pylived,:pylivedabov,:expect]) 246 | 247 | ltd.age = 1:100 248 | 249 | lyldf = DataFrame(covdf |> @filter(_.sex == "All" && _.agenum != nothing) |> 250 | @mutate(lyl = _.coviddeaths * ltd[Int(round(_.agenum +.1)),:expect])) 251 | 252 | 253 | 254 | lylplot = lyldf |>@mutate(ltl=_.lyl/80)|> @vlplot(:bar,x={"agenum:q",title="Age"},y={"ltl:q",title="Nominal Lifetimes Lost"},width=600, 255 | title="Nominal Lifetimes Lost By Age (1 lifetime = 80 years)") 256 | display([deathsplot; lylplot]); 257 | 258 | 259 | ``` 260 | 261 | The CDC Through Data.gov has made available a de-identified 262 | individualized case database of COVID cases: 263 | 264 | ```julia 265 | 266 | covidcases="https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD" 267 | 268 | grabifsmallerolder(covidcases,"covidcasespub.csv",1000000,time()-3600*24); 269 | 270 | covcasedf = DataFrame(load("covidcasespub.csv")) 271 | 272 | ``` 273 | 274 | Manipulating data is sometimes easier if you can use a well developed 275 | language to manipulate it. Fortunately, the SQLite library is an 276 | excellent tool for manipulating large datasets in a self-contained 277 | database file, without all the complexity of a database management 278 | system like Mariadb/MySQL or PostgreSQL. We can access SQLite from 279 | Julia. It is somewhat slower than in R at the moment, but still fast 280 | enough for our purposes. Hopefully in the future it will be even 281 | better. 282 | 283 | ```julia 284 | 285 | using SQLite 286 | 287 | db = SQLite.DB("coviddb.db") 288 | 289 | SQLite.drop!(db,"CovidCases";ifexists=true); 290 | SQLite.load!(covcasedf,db,"CovidCases";ifnotexists=true); 291 | ``` 292 | 293 | Let's grab all deaths by day, aggregated by age group: 294 | 295 | ```julia 296 | byagedeaths = DBInterface.execute(db,"select cdc_report_dt,count(*) as N,age_group,death_yn from CovidCases group by cdc_report_dt,age_group,death_yn") |> DataFrame 297 | 298 | first(byagedeaths,10) 299 | 300 | ``` 301 | 302 | Now a graph that shows 40-50 year old death rates: 303 | 304 | 305 | ```julia 306 | 307 | function plotdeaths(df,agegroup) 308 | df |> @filter(_.age_group == agegroup && _.death_yn == "Yes") |> 309 | @vlplot(layer=[],title="US-wide Deaths per day (age $agegroup)",width=600) + 310 | @vlplot(:point,x=:cdc_report_dt,y={:N,scale={domain=[0,1000]}},) + 311 | @vlplot(:line,x=:cdc_report_dt,y=:N,transform=[{loess=:N, on=:cdc_report_dt,bandwidth=.2}]) 312 | end 313 | 314 | 315 | plot40 = byagedeaths |> x->plotdeaths(x,"40 - 49 Years") 316 | plot50 = byagedeaths |> x->plotdeaths(x,"50 - 59 Years") 317 | plot60 = byagedeaths |> x->plotdeaths(x,"60 - 69 Years") 318 | plot70 = byagedeaths |> x->plotdeaths(x,"70 - 79 Years") 319 | 320 | map(display,[plot40,plot50,plot60,plot70]); 321 | 322 | ``` 323 | -------------------------------------------------------------------------------- /DiscussionBasicDataAndPlots.jmd: -------------------------------------------------------------------------------- 1 | # Discussion of "Acquiring some data and plotting it" 2 | 3 | Although the Tutorials are designed to be very straightforward, so you 4 | can see what we're doing, and you learn by doing... the Discussion is 5 | all about **putting what we did in a broader perspective** and **showing you 6 | possible alternatives**, or where things could go wrong. Those are all 7 | just distractions during the initial learning phase, but are important 8 | perspectives to have once you're past the initial comfort zone, so you 9 | can make better decisions about how to do things. **In the tutorial we 10 | try to select one good way to accomplish the task and stick to it.** In 11 | the discussion we might show you 3 or 4 other ways. 12 | 13 | First off, we needed to get the CSV file from the Census. The first 14 | thing that will generally go wrong is that **you don't have the 15 | slightest clue where to look for the dang file**. When I searched for 16 | ["census population csv"](https://www.startpage.com/do/dsearch?query=census+population+csv&cat=web&pl=ext-ff&language=english&extVersion=1.3.0) 17 | on StartPage (an anonymous proxy for Google searches) it led to 18 | [State Population Totals](https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html) 19 | which at the bottom has a link to 20 | [Datasets](https://www2.census.gov/programs-surveys/popest/datasets/) 21 | which unhelpfully plops you into a directory tree by decade... so 22 | selecting 2010-2019 and then counties and then totals led to 23 | [our final csv file](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv). 24 | 25 | Since we want states, **it might have made more sense to go to the 26 | "states" directory**, but if you had done that, you'd have found one 27 | file with just the 2019 data, whereas **the county directory has the 28 | full time series, and it has summary data for each state**. Such is the 29 | life of a data analyst. 30 | 31 | Once we have a file we want to analyze, we need to get the data into 32 | Julia to analyze it. There are several issues: 33 | 34 | 1. Do you want to download the file via http/https and store it 35 | locally, or just grab the contents from the web and read it into memory? 36 | 2. Which package to do you use to read it into memory? 37 | 3. How do you handle the data in memory? 38 | 39 | 40 | ## Choice of Queryverse for tabular data 41 | 42 | 43 | I think the [Queryverse package set](https://www.queryverse.org/) is well thought out from a 44 | computational perspective and well maintained by 45 | [David Anthoff](https://www.david-anthoff.com/). He is committed to 46 | not making breaking changes to his code as much as possible, and has a 47 | great [video tutorial](https://www.youtube.com/watch?v=OFPNph-WxLM) on 48 | using Queryverse. But there are a few other useful packages to know 49 | about which we will discuss below. 50 | 51 | 52 | ## Downloading Files to Disk 53 | 54 | If you want to download files to disk, rather than just process them 55 | in memory, you should know about the 56 | [HTTP package](https://juliaweb.github.io/HTTP.jl/stable/). Here is an 57 | example of getting a file from a URL and putting it into a file on 58 | disk called "co-est2019-alldata.csv" which matches the original name. 59 | 60 | ```julia 61 | using HTTP 62 | 63 | outfile = open("co-est2019-alldata.csv","w") 64 | HTTP.get("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv",response_stream=outfile) 65 | close(outfile) 66 | 67 | ``` 68 | 69 | In this code, we first open our output file for writing, and then hand 70 | it to HTTP.get() as the `response_stream=` keyword argument. We have 71 | to be explicit about the HTTP.get because otherwise some other "get" 72 | function is called. The `HTTP.get` function will stream the response 73 | to the file for us... we then close the file to ensure it's written 74 | properly to disk. 75 | 76 | if you do this a bunch, you might write a utility function 77 | specifically to do it all for you. 78 | 79 | ```julia 80 | function http_get(url,outfile) 81 | out = open(outfile,"w"); 82 | try 83 | HTTP.get(url,response_stream=out) 84 | catch err 85 | println("there was an error $err while saving $url to file $outfile") 86 | end 87 | close(out) 88 | end 89 | ``` 90 | 91 | Here we 92 | [introduce a little error handling](https://docs.julialang.org/en/v1/manual/control-flow/#Exception-Handling-1) 93 | as well. The error handling ensures we are alerted to any problems 94 | with the network connection or disk io... and that we always close the 95 | file. 96 | 97 | 98 | ## Alternative CSV Handling 99 | 100 | If you're looking to read CSV type files, there are a number of 101 | alternative packages from the Queryverse provided CSVFiles 102 | one. Queryverse provides packages to handle CSV, 103 | [Apache Feather](https://arrow.apache.org/docs/python/feather.html), 104 | Excel, SPSS, Stata, SAS, and Parquet files. They are 105 | [reasonably fast](https://www.queryverse.org/benchmarks/), with 1M 106 | rows of 20 columns of mixed data taking 2 seconds or so, the 107 | interfaces are all uniform among the files, and the whole ecosystem 108 | works well within itself, and has excellent generic interfaces to 109 | other packages as well. This makes it hard to recommend alternatives, 110 | but it is worth mentioning the 111 | [CSV.jl](https://juliadata.github.io/CSV.jl/stable/) package, which 112 | can be used to easily read a CSV file into a DataFrame as follows: 113 | 114 | ```julia;exec=false 115 | 116 | using CSV; 117 | 118 | mydf = CSV.read("myfile.csv") 119 | 120 | ``` 121 | 122 | A convenience function `download` in Julia uses external tools like 123 | curl or wget as available to grab files from urls... but watch out if 124 | you're running on a machine where those external tools aren't 125 | available. 126 | 127 | ``` 128 | help?> download 129 | search: download 130 | 131 | download(url::AbstractString, [localfile::AbstractString]) 132 | 133 | Download a file from the given url, optionally renaming it to the given 134 | local file name. If no filename is given this will download into a 135 | randomly-named file in your temp directory. Note that this function 136 | relies on the availability of external tools such as curl, wget or fetch 137 | to download the file and is provided for convenience. For production use 138 | or situations in which more options are needed, please use a package 139 | that provides the desired functionality instead. 140 | 141 | Returns the filename of the downloaded file. 142 | ``` 143 | 144 | So we could do: 145 | 146 | ```julia 147 | 148 | using CSV 149 | 150 | mydf = CSV.read(download("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv")) 151 | 152 | 153 | ``` 154 | 155 | Note that CSVFiles/Queryverse doesn't seem to have these external 156 | requirements and will read from a URL with its built-in `load` 157 | function. 158 | 159 | ## Queryverse queries, syntactic sugar, and Data Frames 160 | 161 | The de-facto standard for in memory tabular data is the 162 | [DataFrames](https://juliadata.github.io/DataFrames.jl/stable/) 163 | package. We use it extensively here. When we pipe these through 164 | Queryverse queries such as `df |> @filter(...)` the Queryverse queries 165 | work with 166 | [iterators](https://docs.julialang.org/en/v1/manual/interfaces/) over 167 | [named tuples](https://docs.julialang.org/en/v1/manual/types/#Named-Tuple-Types-1). Basically 168 | a named tuple is the same as 169 | [E.F. Codd's](https://en.wikipedia.org/wiki/Edgar_F._Codd) notion of a 170 | [Relational Model](https://en.wikipedia.org/wiki/Relational_model) of 171 | "rows" or "tuples" of data grouped together into a "relation" or 172 | "table". This makes the interface extremely general purpose. However 173 | if you want to be able to grab particular values rather than interate 174 | over the entire table, you'll need to construct a DataFrame to hold 175 | the relation represented by the Query. At the end of the chain of 176 | operations if we want a DataFrame we need to pipe into the DataFrame 177 | constructor. 178 | 179 | ```julia;exec=false 180 | newdf = mydf |> @filter(...) |> @mutate(...) |> DataFrame 181 | 182 | ## or what's the same 183 | 184 | newdf = DataFrame(mydf |> @filter(...) |> @mutate(...) ) 185 | 186 | ``` 187 | 188 | The Pipe operator `|>` is a syntactical shorthand for prefix notation 189 | for calling a function with the first argument coming from the left 190 | hand side of the pipe. So `a |> foo(b,c)` is equivalent to 191 | `foo(a,b,c)`. 192 | 193 | Speaking of syntax, the notation `@foo` denotes a 194 | ["macro"](https://docs.julialang.org/en/v1/manual/metaprogramming/), 195 | that is a function which receives as its arguments the parsed 196 | representation of the language tree and which can then modify the 197 | language itself to generate / expand some syntactic elements into code 198 | to be run. Generating code is known as "metaprogramming" because 199 | you're writing programs that write programs. So presumably 200 | `@filter(_.a == 1)` **generates some code** that iterates through the 201 | named tuples coming in on the left hand side and throws away those for 202 | which the value of the field `a` is not equal to 1. Macros are mostly 203 | useful because they allow you to create specialized sub-languages 204 | within Julia, a feature not found in most languages other than LISP 205 | dialects (some people, including myself, consider Julia a LISP 206 | dialect). 207 | 208 | ## Splitting The What and the When 209 | 210 | In addition to the "wide form" we also need to undo the combination of 211 | the year with the variable name. 212 | 213 | When we split out the year we used the `tryparse` function. If it 214 | can't parse a number, it returns `nothing` which is a special value 215 | meaning nothing... We always want an Integer though, so we call 216 | `something(...)` which returns the first argument that isn't 217 | `nothing`, so that when we get nothing, something will give us -1 218 | instead which is unambiguously not a real year for this dataset. This 219 | keeps our column type as Int (and avoids problems later). When I tried 220 | this without using `something()` to ensure an integer was returned, 221 | then the ENTIRE table became a special "container" type 222 | `DataValue{Any}` type. This made it impossible to convert the Symbols 223 | to Strings. 224 | 225 | When we called the `stack` function, it creates a dataset in which 226 | each variable is stored as a 227 | ["symbol"](https://docs.julialang.org/en/v1/manual/metaprogramming/#Symbols-1). You 228 | can think of a "symbol" as a single structure in memory that 229 | represents a "name". In computer science this is called an 230 | ["interned string"](https://en.wikipedia.org/wiki/String_interning) a 231 | kind of Platonic ideal of the string... there can be only one, it's 232 | immutable, and it can be mapped one to one with a particular location 233 | in memory. This makes working with symbols very easy when it comes to 234 | things like checking to see if two symbol variables are the 235 | same... you can just compare their position in the global "symbol 236 | table" or something similar. whereas comparing two strings, we must 237 | compare each character in the string to see if they match. 238 | 239 | ```julia 240 | using BenchmarkTools 241 | 242 | a = "thequickbrownfoxjumpedoverthelazydog"; b="thequickbrownfoxjumpedoverthelazydog"; 243 | asym = :thequickbrownfoxjumpedoverthelazydog; bsym = :thequickbrownfoxjumpedoverthelazydog; 244 | 245 | @btime a === b; 246 | @btime asym === bsym; 247 | 248 | 249 | ``` 250 | 251 | The check `a===b` has to actually check all the characters in the 252 | string, so long strings will take longer. Whereas the symbol is just 253 | converted to a kind of marker in the symbol table... To check if two 254 | variables have the same symbol in them, we can immediately just check 255 | to see if they point to the same object regardless of how long the 256 | string associated to the symbol name is. 257 | 258 | But a symbol is just an atomic thing, it isn't a string you can get 259 | the third character of for example. So when we want to split out the 260 | last 4 characters, we want to work with a string. Hence, we converted 261 | it, using this bit of code: 262 | 263 | `dflong |> @mutate(year = something(tryparse(Int,string(_.variable)[end-3:end]),-1), variable=String(_.variable)) |> @mutate(variable = _.year > 0 ? _.variable[1:end-4] : _.variable)` 264 | 265 | 266 | Let's unparse that a little. We first mutated the table to include the 267 | year, and changed the "variable" by constructing a String from the 268 | symbol. Then for rows where year was a positive number, we took all 269 | but the last 4 characters, otherwise we took the whole string. 270 | 271 | Note the conditional 272 | [ternary operator](https://docs.julialang.org/en/v1/manual/control-flow/). It 273 | has the form `a ? b : c` and means the same as `if a b else c end`. As 274 | far as I know this ternary operator comes originally from C. 275 | 276 | 277 | 278 | # Visualizing Aspects of Population Data 279 | 280 | 281 | There are approximately 4 actively maintained Julia plotting 282 | libraries: 283 | 284 | 1. Plots 285 | 2. Gadfly 286 | 3. Vega/VegaLite 287 | 4. Makie 288 | 289 | There may be others as well, but these are the best known. 290 | 291 | We chose Gadfly in this Tutorial because it offers a "Grammar Of 292 | Graphics" style of plot specification. This is a good style for use in 293 | exploring data because it lets you compose different components 294 | together in a reasonable and understandable way. If you come from a 295 | place where you know Matlab or Python you may be more comfortable with 296 | Plots or Makie. 297 | 298 | The VegaLite library is part of Queryverse and is also a Grammar Of 299 | Graphics influenced library, however its graphical specification 300 | language is based on a specification written in 301 | [JSON](https://en.wikipedia.org/wiki/JSON) and it's a bit involved to 302 | learn that whole system. Possibly worth it, but maybe not for a first 303 | Tutorial. 304 | 305 | The Gadfly function `set_default_plot_size(w,h)` obviously sets the 306 | default sizes. It takes two arguments, a width and a height, which 307 | should be expressed as elements of the type `Measures.Length`. The 308 | variable `cm` is a constant global and the expression `20cm` is really 309 | the same as `20*cm` since Julia interprets adjacency as 310 | multiplication. There is also a constant global `inch`. 311 | 312 | 313 | ## Fitting basic models: 314 | 315 | We brought in for our first example of model fitting, the GLM 316 | library. This lets you fit relatively common simple models, and uses a 317 | formula syntax that is very similar to the one used by the lm or glm 318 | functions in R. 319 | 320 | 321 | ```julia;exec=false 322 | using GLM 323 | 324 | idgrowth = lm(@formula(stpop ~ (year-2015)+(year-2015)^2,smalldata)) 325 | 326 | display(coef(idgrowth)) 327 | 328 | predict(idgrowth,DataFrame(year=[2020,2021]),interval=:prediction) 329 | 330 | ``` 331 | 332 | The GLM package stands for "Generalized Linear Models" and fits models 333 | by essentially the maximum likelihood method. This means it gives 334 | point estimates, and uses the curvature of the likelihood function to 335 | estimate the standard errors of the parameters. 336 | 337 | Since these tutorials are written by an opinionated Bayesian, you can 338 | take the confidence intervals spit out by GLM as more or less 339 | approximately equal to a high probability density interval under a 340 | broad prior distribution. That's not always very reasonable thing to 341 | do. In our next tutorial perhaps we will build some explicit Bayesian 342 | models! 343 | 344 | -------------------------------------------------------------------------------- /GMTMaps.jmd: -------------------------------------------------------------------------------- 1 | # Visualizing Spatial Data with GMT 2 | 3 | GMT is the "Generic Mapping Tools" a powerful toolset for geospatial visualizations. With great power comes great learning curves. 4 | The goal here is to make it possible for someone who has a data analysis background but not much specific knowledge of geospatial software to 5 | make maps that show their data in 2D maps, and simultaneously to learn something about the framework and what it is capable of, and how to 6 | use it. 7 | 8 | # Problem 1: a Choropleth map of US Census Data 9 | 10 | One of the most common tasks in basic geospatial data analysis is to create a map in which regions of the world are colored based on 11 | the value of some statistic pertaining to the region. We will start with a very simple map of population for each county in the US. 12 | The data for this can be found from the Census bureau at https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv 13 | 14 | ```julia 15 | using CSV,DataFrames,GMT,Printf,DataFramesMeta,Downloads 16 | 17 | countypopurl = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv" 18 | 19 | Downloads.download(countypopurl,"data/countypops.csv") 20 | 21 | cpopall = CSV.read("data/countypops.csv",DataFrame) 22 | 23 | cpop = cpopall[:,[:STATE,:COUNTY,:STNAME,:CTYNAME,:POPESTIMATE2020]] 24 | 25 | ``` 26 | 27 | Now we have a large number of different estimates for each county in the US. For simplicity let's use the POPESTIMATE2020 variable. The COUNTY 28 | variable is the FIPS county code for the county. 29 | 30 | In order to build a Choropleth map, we will need to have a file that defines the spatial boundaries of each county! GMT is a generic mapping tool 31 | It will read various geospatial data formats which can be used to define the regions to be plotted. The Census bureau has a file which defines 32 | the boundaries of the counties at 3 different resolutions. These are in "shapefile" format. The files and other related shapefiles are available 33 | [at the Census website](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html). We will use the medium 34 | resolution version of the county shapefiles (5 million meter resolution). 35 | 36 | ```julia{eval=false} 37 | countyshapeurl = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip" 38 | 39 | download(countyshapeurl,"data/cb_2018_us_county_5m.zip") 40 | cd("data") 41 | run(`unzip cb_2018_us_county_5m.zip`) 42 | cd("..") 43 | 44 | ``` 45 | 46 | Inside the zip file are a number of files, the one ending in ".shp" is in the shape file format. We can read this using GMT's function 47 | gmtread 48 | 49 | ```julia 50 | counties = gmtread("data/cb_2018_us_county_5m.shp") 51 | ``` 52 | 53 | Now, what kind of thing is "counties?" 54 | 55 | ```julia 56 | typeof(counties) 57 | ``` 58 | 59 | It's a Vector{GMTdataset}. Basically GMT treats the shape file as a collection of shapes, each shape gets its own GMTdataset, which is a structure 60 | 61 | ```{eval=false} 62 | search: GMTdataset 63 | 64 | No documentation found. 65 | 66 | Summary 67 | ≡≡≡≡≡≡≡≡≡ 68 | 69 | mutable struct GMTdataset{T<:Real, N} 70 | 71 | Fields 72 | ≡≡≡≡≡≡≡≡ 73 | 74 | data :: Array{T<:Real, N} 75 | ds_bbox :: Vector{Float64} 76 | bbox :: Vector{Float64} 77 | attrib :: Dict{String, String} 78 | colnames :: Vector{String} 79 | text :: Vector{String}cpop.CFIPS = format("%03d",cpop.COUNTY) 80 | 81 | geom :: Int64 82 | 83 | Supertype Hierarchy 84 | ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡ 85 | 86 | GMTdataset{T<:Real, N} <: AbstractArray{T<:Real, N} <: Any 87 | ``` 88 | 89 | In addition to "data" which is lattitude and longitude values in this case, there are a variety of metadata fields. one of which is "header". 90 | 91 | Let's look at the header of the first county: 92 | ```julia 93 | counties[1].header 94 | counties[1].attrib 95 | ``` 96 | 97 | This shows that the header is a comma separated values string. We want to "join" our data to these counties such that the county FIPS identifier 98 | is the same. The FIPS identifier is a unique numerical code assigned to each county (the names are not necessarily unique). 99 | 100 | In this case, we have the value 39 being the FIPS code for the state of Ohio, and 071 being the county FIPS code for Highland County. 101 | 102 | So we'd like to join field 2 of the header to the COUNTY column of our population data. 103 | However our population data represents the COUNTY field as a number, not a 0 padded string. Let's fix that then do the join 104 | 105 | 106 | ```julia 107 | 108 | cpop.COUNTY = Printf.format.(Ref(Printf.Format("%03d")),cpop.COUNTY) 109 | cpop.STATE = Printf.format.(Ref(Printf.Format("%02d")),cpop.STATE) 110 | 111 | cpop = cpop[cpop.COUNTY .!= "000",:] ## eliminate county "000" which is "the whole state" for each state 112 | 113 | dfc = DataFrame(STATE = map(x->x.attrib["STATEFP"],counties),COUNTY=map(x -> x.attrib["COUNTYFP"],counties),ORDER=1:length(counties)) 114 | 115 | 116 | joineddata = @chain leftjoin(dfc,cpop,on= [:STATE,:COUNTY],makeunique=true) begin 117 | @orderby(:ORDER) 118 | end 119 | 120 | cptvallog = makecpt(range=(log(1000),log(11e6)),C=:plasma) 121 | 122 | #imshow(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),proj=:guess,colorbar=true) 123 | 124 | ``` 125 | 126 | There are missing values of POPESTIMATE2020 so let's replace them with NaN so we at least still get the polygon drawn. GMT will draw NaN 127 | as a special color. 128 | 129 | ```julia 130 | joineddata.POPESTIMATE2020 = replace(joineddata.POPESTIMATE2020,missing => NaN) 131 | 132 | 133 | 134 | GMT.plot(counties,level=log.(1.0 .+ joineddata.POPESTIMATE2020),cmap=cptvallog,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), 135 | proj=:guess,colorbar=true,figname="choroplethlog.png",title="Continental US County log(Population)") 136 | ``` 137 | ![Choropleth of log(population) at county level](choroplethlog.png) 138 | 139 | 140 | Let's see what it looks like if we don't take the logarithm... 141 | 142 | 143 | ```julia 144 | cptval = makecpt(range=(0,11_000_000),C=:plasma) 145 | 146 | GMT.plot(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), 147 | proj=:guess,colorbar=true,figname="choroplethnolog.png",title="Continental US County Population") 148 | ``` 149 | 150 | ![Continental Population](choroplethnolog.png) 151 | 152 | It may be interesting to work further with the polygons that represent the counties in some additional ways. For example we could calculate the population 153 | Density by dividing the population by the area of the polygon. We can calculate the area using the function gmtspatial. 154 | 155 | ```julia 156 | 157 | measures = gmtspatial(counties, area="k")[1] # to get area in km^2 158 | print(typeof(measures)) 159 | length(measures) 160 | print(measures) 161 | ``` 162 | 163 | But some counties will consiste of several polygons, like some along the coast may include some islands, etc, to get the total area, we 164 | need to group by county identifier, and sum all the areas. 165 | 166 | ```julia 167 | countyareas = @chain dfc begin 168 | @transform(:area = measures[:,3]) 169 | groupby([:STATE,:COUNTY]) 170 | @combine(:totarea = sum(:area)) 171 | @select(:STATE,:COUNTY,:totarea) 172 | end 173 | 174 | 175 | cpop = @chain leftjoin(cpop,countyareas,on = [:STATE, :COUNTY]) begin 176 | @transform(:density = :POPESTIMATE2020 ./ :totarea) 177 | @orderby(:STATE,:COUNTY) 178 | end 179 | 180 | print(cpop[1:10,:]) 181 | 182 | joineddata = @chain leftjoin(dfc,cpop,on = [:STATE,:COUNTY]) begin 183 | @orderby(:ORDER) 184 | end 185 | 186 | 187 | 188 | ``` 189 | 190 | ```julia 191 | denscpt = makecpt(range=(0,11e6/(100*100)),C=:plasma) 192 | 193 | 194 | GMT.plot(counties,level=Vector{Float64}(replace(joineddata.density,missing=>NaN)),cmap=denscpt,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), 195 | proj=:guess,colorbar=true,figname="choroplethdens.png",title="Continental US County Pop Density (1/km^2)") 196 | 197 | ``` 198 | 199 | ![Continental US Density](choroplethdens.png) 200 | 201 | 202 | ## Smallest Area Containing ~50% of the population 203 | 204 | Our approach to this question will be to sort the counties by density in decreasing order, cumsum the populations... and then create an indicator 205 | variable for whether the cumsum is less than 50% of the total, then plot this indicator variable. 206 | 207 | ```julia 208 | 209 | totpop = sum(cpop[cpop.COUNTY .!= "000",:POPESTIMATE2020]) ## ignore county 0, that's the "whole state" 210 | print(totpop) 211 | 212 | areads = @chain cpop begin 213 | @subset(:COUNTY .!= "000") 214 | @orderby(-:density) 215 | @transform( :csum = cumsum(:POPESTIMATE2020)) 216 | @transform( :inset = :csum .< totpop/2.0,:in80set = :csum .< 0.8 * totpop) 217 | end 218 | 219 | joineddata = @chain leftjoin(dfc,areads,on=[:STATE,:COUNTY]) begin 220 | @orderby(:ORDER) 221 | end 222 | 223 | insetcmap = makecpt(range(0,1),C=:plasma) 224 | 225 | GMT.plot(counties,level=Float64.(replace(joineddata.inset,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), 226 | proj=:guess,colorbar=true,figname="choroplethhpdset.png",title="Continental US smallest 50% set ") 227 | 228 | ``` 229 | 230 | ![high probability density set 50%](choroplethhpdset.png) 231 | 232 | 233 | ## 80%tile smallest area 234 | 235 | ```julia 236 | GMT.plot(counties,level=Float64.(replace(joineddata.in80set,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), 237 | proj=:guess,colorbar=true,figname="choroplethhpd80set.png",title="Continental US smallest 80% set ") 238 | 239 | ``` 240 | 241 | ![high probability density set 50%](choroplethhpd80set.png) 242 | 243 | 244 | 245 | -------------------------------------------------------------------------------- /Manifest.toml: -------------------------------------------------------------------------------- 1 | # This file is machine-generated - editing it directly is not advised 2 | 3 | julia_version = "1.7.1" 4 | manifest_format = "2.0" 5 | 6 | [[deps.ArgTools]] 7 | uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f" 8 | 9 | [[deps.Arrow]] 10 | deps = ["CategoricalArrays", "Dates"] 11 | git-tree-sha1 = "c86df6ed41b3bd192d663e5e0e7cac0d11fd4375" 12 | uuid = "69666777-d1a9-59fb-9406-91d4454c9d45" 13 | version = "0.2.4" 14 | 15 | [[deps.Artifacts]] 16 | uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33" 17 | 18 | [[deps.Base64]] 19 | uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" 20 | 21 | [[deps.BinaryProvider]] 22 | deps = ["Libdl", "Logging", "SHA"] 23 | git-tree-sha1 = "ecdec412a9abc8db54c0efc5548c64dfce072058" 24 | uuid = "b99e7846-7c00-51b0-8f62-c81ae34c0232" 25 | version = "0.5.10" 26 | 27 | [[deps.CEnum]] 28 | git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9" 29 | uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82" 30 | version = "0.4.1" 31 | 32 | [[deps.CSV]] 33 | deps = ["CodecZlib", "Dates", "FilePathsBase", "InlineStrings", "Mmap", "Parsers", "PooledArrays", "SentinelArrays", "Tables", "Unicode", "WeakRefStrings"] 34 | git-tree-sha1 = "9519274b50500b8029973d241d32cfbf0b127d97" 35 | uuid = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" 36 | version = "0.10.2" 37 | 38 | [[deps.CSVFiles]] 39 | deps = ["CodecZlib", "DataValues", "FileIO", "HTTP", "IterableTables", "IteratorInterfaceExtensions", "TableShowUtils", "TableTraits", "TableTraitsUtils", "TextParse"] 40 | git-tree-sha1 = "d4dd66b73d3c811daa67587980bf45a179d16983" 41 | uuid = "5d742f6a-9f54-50ce-8119-2520741973ca" 42 | version = "1.0.1" 43 | 44 | [[deps.CategoricalArrays]] 45 | deps = ["DataAPI", "Future", "JSON", "Missings", "Printf", "Statistics", "StructTypes", "Unicode"] 46 | git-tree-sha1 = "2ac27f59196a68070e132b25713f9a5bbc5fa0d2" 47 | uuid = "324d7699-5711-5eae-9e2f-1d82baa6b597" 48 | version = "0.8.3" 49 | 50 | [[deps.Chain]] 51 | git-tree-sha1 = "339237319ef4712e6e5df7758d0bccddf5c237d9" 52 | uuid = "8be319e6-bccf-4806-a6f7-6fae938471bc" 53 | version = "0.4.10" 54 | 55 | [[deps.ChainRulesCore]] 56 | deps = ["Compat", "LinearAlgebra", "SparseArrays"] 57 | git-tree-sha1 = "f9982ef575e19b0e5c7a98c6e75ee496c0f73a93" 58 | uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" 59 | version = "1.12.0" 60 | 61 | [[deps.ChangesOfVariables]] 62 | deps = ["ChainRulesCore", "LinearAlgebra", "Test"] 63 | git-tree-sha1 = "bf98fa45a0a4cee295de98d4c1462be26345b9a1" 64 | uuid = "9e997f8a-9a97-42d5-a9f1-ce6bfc15e2c0" 65 | version = "0.1.2" 66 | 67 | [[deps.CodecZlib]] 68 | deps = ["TranscodingStreams", "Zlib_jll"] 69 | git-tree-sha1 = "ded953804d019afa9a3f98981d99b33e3db7b6da" 70 | uuid = "944b1d66-785c-5afd-91f1-9de20f533193" 71 | version = "0.7.0" 72 | 73 | [[deps.CodecZstd]] 74 | deps = ["CEnum", "TranscodingStreams", "Zstd_jll"] 75 | git-tree-sha1 = "849470b337d0fa8449c21061de922386f32949d9" 76 | uuid = "6b39b394-51ab-5f42-8807-6242bab2b4c2" 77 | version = "0.7.2" 78 | 79 | [[deps.Compat]] 80 | deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"] 81 | git-tree-sha1 = "44c37b4636bc54afac5c574d2d02b625349d6582" 82 | uuid = "34da2185-b29b-5c13-b0c7-acf172513d20" 83 | version = "3.41.0" 84 | 85 | [[deps.CompilerSupportLibraries_jll]] 86 | deps = ["Artifacts", "Libdl"] 87 | uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae" 88 | 89 | [[deps.Conda]] 90 | deps = ["Downloads", "JSON", "VersionParsing"] 91 | git-tree-sha1 = "6cdc8832ba11c7695f494c9d9a1c31e90959ce0f" 92 | uuid = "8f4d0f93-b110-5947-807f-2305c1781a2d" 93 | version = "1.6.0" 94 | 95 | [[deps.ConstructionBase]] 96 | deps = ["LinearAlgebra"] 97 | git-tree-sha1 = "f74e9d5388b8620b4cee35d4c5a618dd4dc547f4" 98 | uuid = "187b0558-2788-49d3-abe0-74a17ed4e7c9" 99 | version = "1.3.0" 100 | 101 | [[deps.Crayons]] 102 | git-tree-sha1 = "249fe38abf76d48563e2f4556bebd215aa317e15" 103 | uuid = "a8cc5b0e-0ffa-5ad4-8c14-923d3ee1735f" 104 | version = "4.1.1" 105 | 106 | [[deps.DBInterface]] 107 | git-tree-sha1 = "9b0dc525a052b9269ccc5f7f04d5b3639c65bca5" 108 | uuid = "a10d1c49-ce27-4219-8d33-6db1a4562965" 109 | version = "2.5.0" 110 | 111 | [[deps.DataAPI]] 112 | git-tree-sha1 = "cc70b17275652eb47bc9e5f81635981f13cea5c8" 113 | uuid = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a" 114 | version = "1.9.0" 115 | 116 | [[deps.DataFrames]] 117 | deps = ["Compat", "DataAPI", "Future", "InvertedIndices", "IteratorInterfaceExtensions", "LinearAlgebra", "Markdown", "Missings", "PooledArrays", "PrettyTables", "Printf", "REPL", "Reexport", "SortingAlgorithms", "Statistics", "TableTraits", "Tables", "Unicode"] 118 | git-tree-sha1 = "ae02104e835f219b8930c7664b8012c93475c340" 119 | uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" 120 | version = "1.3.2" 121 | 122 | [[deps.DataFramesMeta]] 123 | deps = ["Chain", "DataFrames", "MacroTools", "OrderedCollections", "Reexport"] 124 | git-tree-sha1 = "ab4768d2cc6ab000cd0cec78e8e1ea6b03c7c3e2" 125 | uuid = "1313f7d8-7da2-5740-9ea0-a2ca25f37964" 126 | version = "0.10.0" 127 | 128 | [[deps.DataStructures]] 129 | deps = ["Compat", "InteractiveUtils", "OrderedCollections"] 130 | git-tree-sha1 = "3daef5523dd2e769dad2365274f760ff5f282c7d" 131 | uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" 132 | version = "0.18.11" 133 | 134 | [[deps.DataTables]] 135 | deps = ["DataValues", "ReadOnlyArrays", "TableShowUtils", "TableTraitsUtils"] 136 | git-tree-sha1 = "9b069372a767fc6142feecc8e6d737d1b1de4711" 137 | uuid = "743a1d0a-8ebc-4f23-814b-50d006366bc6" 138 | version = "0.1.0" 139 | 140 | [[deps.DataValueInterfaces]] 141 | git-tree-sha1 = "bfc1187b79289637fa0ef6d4436ebdfe6905cbd6" 142 | uuid = "e2d170a0-9d28-54be-80f0-106bbe20a464" 143 | version = "1.0.0" 144 | 145 | [[deps.DataValues]] 146 | deps = ["DataValueInterfaces", "Dates"] 147 | git-tree-sha1 = "d88a19299eba280a6d062e135a43f00323ae70bf" 148 | uuid = "e7dc6d0d-1eca-5fa6-8ad6-5aecde8b7ea5" 149 | version = "0.4.13" 150 | 151 | [[deps.DataVoyager]] 152 | deps = ["DataValues", "Electron", "FilePaths", "IterableTables", "IteratorInterfaceExtensions", "JSON", "TableTraits", "Test", "URIParser", "VegaLite"] 153 | git-tree-sha1 = "159f1d3f07225a59dd4edb8ad15e607fefac9543" 154 | uuid = "5721bf48-af8e-5845-8445-c9e18126e773" 155 | version = "1.0.2" 156 | 157 | [[deps.Dates]] 158 | deps = ["Printf"] 159 | uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" 160 | 161 | [[deps.DelimitedFiles]] 162 | deps = ["Mmap"] 163 | uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab" 164 | 165 | [[deps.DensityInterface]] 166 | deps = ["InverseFunctions", "Test"] 167 | git-tree-sha1 = "80c3e8639e3353e5d2912fb3a1916b8455e2494b" 168 | uuid = "b429d917-457f-4dbc-8f4c-0cc954292b1d" 169 | version = "0.4.0" 170 | 171 | [[deps.Distributed]] 172 | deps = ["Random", "Serialization", "Sockets"] 173 | uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b" 174 | 175 | [[deps.Distributions]] 176 | deps = ["ChainRulesCore", "DensityInterface", "FillArrays", "LinearAlgebra", "PDMats", "Printf", "QuadGK", "Random", "SparseArrays", "SpecialFunctions", "Statistics", "StatsBase", "StatsFuns", "Test"] 177 | git-tree-sha1 = "2e97190dfd4382499a4ac349e8d316491c9db341" 178 | uuid = "31c24e10-a181-5473-b8eb-7969acd0382f" 179 | version = "0.25.46" 180 | 181 | [[deps.DocStringExtensions]] 182 | deps = ["LibGit2"] 183 | git-tree-sha1 = "b19534d1895d702889b219c382a6e18010797f0b" 184 | uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae" 185 | version = "0.8.6" 186 | 187 | [[deps.DoubleFloats]] 188 | deps = ["GenericLinearAlgebra", "LinearAlgebra", "Polynomials", "Printf", "Quadmath", "Random", "Requires", "SpecialFunctions"] 189 | git-tree-sha1 = "70858638bb1b9acb83bc0a29fdb449891a71af84" 190 | uuid = "497a8b3b-efae-58df-a0af-a86822472b78" 191 | version = "1.1.26" 192 | 193 | [[deps.Downloads]] 194 | deps = ["ArgTools", "LibCURL", "NetworkOptions"] 195 | uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6" 196 | 197 | [[deps.Electron]] 198 | deps = ["Base64", "FilePaths", "JSON", "Pkg", "Sockets", "URIParser", "UUIDs"] 199 | git-tree-sha1 = "a53025d3eabe23659065b3c5bba7b4ffb1327aa0" 200 | uuid = "a1bb12fb-d4d1-54b4-b10a-ee7951ef7ad3" 201 | version = "3.1.2" 202 | 203 | [[deps.ExcelFiles]] 204 | deps = ["DataValues", "Dates", "ExcelReaders", "FileIO", "IterableTables", "IteratorInterfaceExtensions", "Printf", "PyCall", "TableShowUtils", "TableTraits", "TableTraitsUtils", "XLSX"] 205 | git-tree-sha1 = "f3e5f4279d77b74bf6aef2b53562f771cc5a0474" 206 | uuid = "89b67f3b-d1aa-5f6f-9ca4-282e8d98620d" 207 | version = "1.0.0" 208 | 209 | [[deps.ExcelReaders]] 210 | deps = ["DataValues", "Dates", "PyCall", "Test"] 211 | git-tree-sha1 = "6f9db420dd362bd5bcea3a0f6dabf8bda587fec3" 212 | uuid = "c04bee98-12a5-510c-87df-2a230cb6e075" 213 | version = "0.11.0" 214 | 215 | [[deps.ExprTools]] 216 | git-tree-sha1 = "56559bbef6ca5ea0c0818fa5c90320398a6fbf8d" 217 | uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04" 218 | version = "0.1.8" 219 | 220 | [[deps.EzXML]] 221 | deps = ["Printf", "XML2_jll"] 222 | git-tree-sha1 = "0fa3b52a04a4e210aeb1626def9c90df3ae65268" 223 | uuid = "8f5d6c58-4d21-5cfd-889c-e3ad7ee6a615" 224 | version = "1.1.0" 225 | 226 | [[deps.FeatherFiles]] 227 | deps = ["Arrow", "DataValues", "FeatherLib", "FileIO", "IterableTables", "IteratorInterfaceExtensions", "TableShowUtils", "TableTraits", "TableTraitsUtils", "Test"] 228 | git-tree-sha1 = "a2f2b57b23be259d7839bebae2b8f7bba4851a9b" 229 | uuid = "b675d258-116a-5741-b937-b79f054b0542" 230 | version = "0.8.1" 231 | 232 | [[deps.FeatherLib]] 233 | deps = ["Arrow", "CategoricalArrays", "Dates", "FlatBuffers", "Mmap", "Random"] 234 | git-tree-sha1 = "a3d0c5ca2f08bc8fae4394775f371f8e032149ab" 235 | uuid = "409f5150-fb84-534f-94db-80d1e10f57e1" 236 | version = "0.2.0" 237 | 238 | [[deps.FileIO]] 239 | deps = ["Pkg", "Requires", "UUIDs"] 240 | git-tree-sha1 = "67551df041955cc6ee2ed098718c8fcd7fc7aebe" 241 | uuid = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549" 242 | version = "1.12.0" 243 | 244 | [[deps.FilePaths]] 245 | deps = ["FilePathsBase", "MacroTools", "Reexport", "Requires"] 246 | git-tree-sha1 = "919d9412dbf53a2e6fe74af62a73ceed0bce0629" 247 | uuid = "8fc22ac5-c921-52a6-82fd-178b2807b824" 248 | version = "0.8.3" 249 | 250 | [[deps.FilePathsBase]] 251 | deps = ["Compat", "Dates", "Mmap", "Printf", "Test", "UUIDs"] 252 | git-tree-sha1 = "04d13bfa8ef11720c24e4d840c0033d145537df7" 253 | uuid = "48062228-2e41-5def-b9a4-89aafe57970f" 254 | version = "0.9.17" 255 | 256 | [[deps.FillArrays]] 257 | deps = ["LinearAlgebra", "Random", "SparseArrays", "Statistics"] 258 | git-tree-sha1 = "8756f9935b7ccc9064c6eef0bff0ad643df733a3" 259 | uuid = "1a297f60-69ca-5386-bcde-b61e274b549b" 260 | version = "0.12.7" 261 | 262 | [[deps.FlatBuffers]] 263 | deps = ["Parameters", "Test"] 264 | git-tree-sha1 = "8582924ac52011d08da9cf1e67f13a71dbbc2594" 265 | uuid = "53afe959-3a16-52fa-a8da-cf864710bae9" 266 | version = "0.5.4" 267 | 268 | [[deps.Formatting]] 269 | deps = ["Printf"] 270 | git-tree-sha1 = "8339d61043228fdd3eb658d86c926cb282ae72a8" 271 | uuid = "59287772-0a20-5a39-b81b-1366585eb4c0" 272 | version = "0.4.2" 273 | 274 | [[deps.Future]] 275 | deps = ["Random"] 276 | uuid = "9fa8497b-333b-5362-9e8d-4d0656e87820" 277 | 278 | [[deps.GLM]] 279 | deps = ["Distributions", "LinearAlgebra", "Printf", "Reexport", "SparseArrays", "SpecialFunctions", "Statistics", "StatsBase", "StatsFuns", "StatsModels"] 280 | git-tree-sha1 = "fb764dacfa30f948d52a6a4269ae293a479bbc62" 281 | uuid = "38e38edf-8417-5370-95a0-9cbb8c7f171a" 282 | version = "1.6.1" 283 | 284 | [[deps.GMT]] 285 | deps = ["Conda", "Dates", "Pkg", "Printf", "Statistics"] 286 | git-tree-sha1 = "d1068159f18828ec1831efd34f1880f84a792b98" 287 | uuid = "5752ebe1-31b9-557e-87aa-f909b540aa54" 288 | version = "0.40.1" 289 | 290 | [[deps.GenericLinearAlgebra]] 291 | deps = ["LinearAlgebra", "Printf", "Random"] 292 | git-tree-sha1 = "ac44f4f51ffee9ff1ea50bd3fbb5677ea568d33d" 293 | uuid = "14197337-ba66-59df-a3e3-ca00e7dcff7a" 294 | version = "0.2.7" 295 | 296 | [[deps.HTTP]] 297 | deps = ["Base64", "Dates", "IniFile", "Logging", "MbedTLS", "NetworkOptions", "Sockets", "URIs"] 298 | git-tree-sha1 = "0fa77022fe4b511826b39c894c90daf5fce3334a" 299 | uuid = "cd3eb016-35fb-5094-929b-558a96fad6f3" 300 | version = "0.9.17" 301 | 302 | [[deps.Highlights]] 303 | deps = ["DocStringExtensions", "InteractiveUtils", "REPL"] 304 | git-tree-sha1 = "f823a2d04fb233d52812c8024a6d46d9581904a4" 305 | uuid = "eafb193a-b7ab-5a9e-9068-77385905fa72" 306 | version = "0.4.5" 307 | 308 | [[deps.IniFile]] 309 | deps = ["Test"] 310 | git-tree-sha1 = "098e4d2c533924c921f9f9847274f2ad89e018b8" 311 | uuid = "83e8ac13-25f8-5344-8a64-a9f2b223428f" 312 | version = "0.5.0" 313 | 314 | [[deps.InlineStrings]] 315 | deps = ["Parsers"] 316 | git-tree-sha1 = "61feba885fac3a407465726d0c330b3055df897f" 317 | uuid = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48" 318 | version = "1.1.2" 319 | 320 | [[deps.InteractiveUtils]] 321 | deps = ["Markdown"] 322 | uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" 323 | 324 | [[deps.Intervals]] 325 | deps = ["Dates", "Printf", "RecipesBase", "Serialization", "TimeZones"] 326 | git-tree-sha1 = "323a38ed1952d30586d0fe03412cde9399d3618b" 327 | uuid = "d8418881-c3e1-53bb-8760-2df7ec849ed5" 328 | version = "1.5.0" 329 | 330 | [[deps.InverseFunctions]] 331 | deps = ["Test"] 332 | git-tree-sha1 = "a7254c0acd8e62f1ac75ad24d5db43f5f19f3c65" 333 | uuid = "3587e190-3f89-42d0-90ee-14403ec27112" 334 | version = "0.1.2" 335 | 336 | [[deps.InvertedIndices]] 337 | git-tree-sha1 = "bee5f1ef5bf65df56bdd2e40447590b272a5471f" 338 | uuid = "41ab1584-1d38-5bbf-9106-f11c6c58b48f" 339 | version = "1.1.0" 340 | 341 | [[deps.IrrationalConstants]] 342 | git-tree-sha1 = "7fd44fd4ff43fc60815f8e764c0f352b83c49151" 343 | uuid = "92d709cd-6900-40b7-9082-c6be49f344b6" 344 | version = "0.1.1" 345 | 346 | [[deps.IterableTables]] 347 | deps = ["DataValues", "IteratorInterfaceExtensions", "Requires", "TableTraits", "TableTraitsUtils"] 348 | git-tree-sha1 = "70300b876b2cebde43ebc0df42bc8c94a144e1b4" 349 | uuid = "1c8ee90f-4401-5389-894e-7a04a3dc0f4d" 350 | version = "1.0.0" 351 | 352 | [[deps.IteratorInterfaceExtensions]] 353 | git-tree-sha1 = "a3f24677c21f5bbe9d2a714f95dcd58337fb2856" 354 | uuid = "82899510-4779-5014-852e-03e436cf321d" 355 | version = "1.0.0" 356 | 357 | [[deps.JLLWrappers]] 358 | deps = ["Preferences"] 359 | git-tree-sha1 = "abc9885a7ca2052a736a600f7fa66209f96506e1" 360 | uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210" 361 | version = "1.4.1" 362 | 363 | [[deps.JSON]] 364 | deps = ["Dates", "Mmap", "Parsers", "Unicode"] 365 | git-tree-sha1 = "8076680b162ada2a031f707ac7b4953e30667a37" 366 | uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6" 367 | version = "0.21.2" 368 | 369 | [[deps.JSONSchema]] 370 | deps = ["HTTP", "JSON", "URIs"] 371 | git-tree-sha1 = "2f49f7f86762a0fbbeef84912265a1ae61c4ef80" 372 | uuid = "7d188eb4-7ad8-530c-ae41-71a32a6d4692" 373 | version = "0.3.4" 374 | 375 | [[deps.LazyArtifacts]] 376 | deps = ["Artifacts", "Pkg"] 377 | uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3" 378 | 379 | [[deps.LibCURL]] 380 | deps = ["LibCURL_jll", "MozillaCACerts_jll"] 381 | uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21" 382 | 383 | [[deps.LibCURL_jll]] 384 | deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"] 385 | uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0" 386 | 387 | [[deps.LibGit2]] 388 | deps = ["Base64", "NetworkOptions", "Printf", "SHA"] 389 | uuid = "76f85450-5226-5b5a-8eaa-529ad045b433" 390 | 391 | [[deps.LibSSH2_jll]] 392 | deps = ["Artifacts", "Libdl", "MbedTLS_jll"] 393 | uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8" 394 | 395 | [[deps.Libdl]] 396 | uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb" 397 | 398 | [[deps.Libiconv_jll]] 399 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] 400 | git-tree-sha1 = "42b62845d70a619f063a7da093d995ec8e15e778" 401 | uuid = "94ce4f54-9a6c-5748-9c1c-f9c7231a4531" 402 | version = "1.16.1+1" 403 | 404 | [[deps.LinearAlgebra]] 405 | deps = ["Libdl", "libblastrampoline_jll"] 406 | uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" 407 | 408 | [[deps.LogExpFunctions]] 409 | deps = ["ChainRulesCore", "ChangesOfVariables", "DocStringExtensions", "InverseFunctions", "IrrationalConstants", "LinearAlgebra"] 410 | git-tree-sha1 = "e5718a00af0ab9756305a0392832c8952c7426c1" 411 | uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688" 412 | version = "0.3.6" 413 | 414 | [[deps.Logging]] 415 | uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" 416 | 417 | [[deps.MacroTools]] 418 | deps = ["Markdown", "Random"] 419 | git-tree-sha1 = "3d3e902b31198a27340d0bf00d6ac452866021cf" 420 | uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09" 421 | version = "0.5.9" 422 | 423 | [[deps.Markdown]] 424 | deps = ["Base64"] 425 | uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" 426 | 427 | [[deps.MbedTLS]] 428 | deps = ["Dates", "MbedTLS_jll", "Random", "Sockets"] 429 | git-tree-sha1 = "1c38e51c3d08ef2278062ebceade0e46cefc96fe" 430 | uuid = "739be429-bea8-5141-9913-cc70e7f3736d" 431 | version = "1.0.3" 432 | 433 | [[deps.MbedTLS_jll]] 434 | deps = ["Artifacts", "Libdl"] 435 | uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1" 436 | 437 | [[deps.MemPool]] 438 | deps = ["DataStructures", "Distributed", "Mmap", "Random", "Serialization", "Sockets", "Test"] 439 | git-tree-sha1 = "d52799152697059353a8eac1000d32ba8d92aa25" 440 | uuid = "f9f48841-c794-520a-933b-121f7ba6ed94" 441 | version = "0.2.0" 442 | 443 | [[deps.Missings]] 444 | deps = ["DataAPI"] 445 | git-tree-sha1 = "f8c673ccc215eb50fcadb285f522420e29e69e1c" 446 | uuid = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28" 447 | version = "0.4.5" 448 | 449 | [[deps.Mmap]] 450 | uuid = "a63ad114-7e13-5084-954f-fe012c677804" 451 | 452 | [[deps.Mocking]] 453 | deps = ["Compat", "ExprTools"] 454 | git-tree-sha1 = "29714d0a7a8083bba8427a4fbfb00a540c681ce7" 455 | uuid = "78c3b35d-d492-501b-9361-3d52fe80e533" 456 | version = "0.7.3" 457 | 458 | [[deps.MozillaCACerts_jll]] 459 | uuid = "14a3606d-f60d-562e-9121-12d972cd8159" 460 | 461 | [[deps.Mustache]] 462 | deps = ["Printf", "Tables"] 463 | git-tree-sha1 = "21d7a05c3b94bcf45af67beccab4f2a1f4a3c30a" 464 | uuid = "ffc61752-8dc7-55ee-8c37-f3e9cdd09e70" 465 | version = "1.0.12" 466 | 467 | [[deps.MutableArithmetics]] 468 | deps = ["LinearAlgebra", "SparseArrays", "Test"] 469 | git-tree-sha1 = "73deac2cbae0820f43971fad6c08f6c4f2784ff2" 470 | uuid = "d8a4904e-b15c-11e9-3269-09a3773c0cb0" 471 | version = "0.3.2" 472 | 473 | [[deps.NetworkOptions]] 474 | uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908" 475 | 476 | [[deps.NodeJS]] 477 | deps = ["Pkg"] 478 | git-tree-sha1 = "905224bbdd4b555c69bb964514cfa387616f0d3a" 479 | uuid = "2bd173c7-0d6d-553b-b6af-13a54713934c" 480 | version = "1.3.0" 481 | 482 | [[deps.Nullables]] 483 | git-tree-sha1 = "8f87854cc8f3685a60689d8edecaa29d2251979b" 484 | uuid = "4d1e1d77-625e-5b40-9113-a560ec7a8ecd" 485 | version = "1.0.0" 486 | 487 | [[deps.OpenBLAS_jll]] 488 | deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"] 489 | uuid = "4536629a-c528-5b80-bd46-f80d51c5b363" 490 | 491 | [[deps.OpenLibm_jll]] 492 | deps = ["Artifacts", "Libdl"] 493 | uuid = "05823500-19ac-5b8b-9628-191a04bc5112" 494 | 495 | [[deps.OpenSpecFun_jll]] 496 | deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"] 497 | git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1" 498 | uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e" 499 | version = "0.5.5+0" 500 | 501 | [[deps.OrderedCollections]] 502 | git-tree-sha1 = "85f8e6578bf1f9ee0d11e7bb1b1456435479d47c" 503 | uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d" 504 | version = "1.4.1" 505 | 506 | [[deps.PDMats]] 507 | deps = ["LinearAlgebra", "SparseArrays", "SuiteSparse"] 508 | git-tree-sha1 = "ee26b350276c51697c9c2d88a072b339f9f03d73" 509 | uuid = "90014a1f-27ba-587c-ab20-58faa44d9150" 510 | version = "0.11.5" 511 | 512 | [[deps.Parameters]] 513 | deps = ["OrderedCollections", "UnPack"] 514 | git-tree-sha1 = "34c0e9ad262e5f7fc75b10a9952ca7692cfc5fbe" 515 | uuid = "d96e819e-fc66-5662-9728-84c9c7592b0a" 516 | version = "0.12.3" 517 | 518 | [[deps.Parquet]] 519 | deps = ["CodecZlib", "CodecZstd", "Dates", "MemPool", "ProtoBuf", "Snappy", "Thrift"] 520 | git-tree-sha1 = "3dc3ed38c932f5e00d75a5af354438c6b80d973d" 521 | uuid = "626c502c-15b0-58ad-a749-f091afb673ae" 522 | version = "0.4.0" 523 | 524 | [[deps.ParquetFiles]] 525 | deps = ["DataValues", "FileIO", "IterableTables", "IteratorInterfaceExtensions", "Parquet", "TableShowUtils", "TableTraits", "Test"] 526 | git-tree-sha1 = "7b4414214f41e2ae7844ea827bfd4ec7ae71e749" 527 | uuid = "46a55296-af5a-53b0-aaa0-97023b66127f" 528 | version = "0.2.0" 529 | 530 | [[deps.Parsers]] 531 | deps = ["Dates"] 532 | git-tree-sha1 = "0b5cfbb704034b5b4c1869e36634438a047df065" 533 | uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0" 534 | version = "2.2.1" 535 | 536 | [[deps.Pkg]] 537 | deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"] 538 | uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f" 539 | 540 | [[deps.Polynomials]] 541 | deps = ["Intervals", "LinearAlgebra", "MutableArithmetics", "RecipesBase"] 542 | git-tree-sha1 = "f184bc53e9add8c737e50fa82885bc3f7d70f628" 543 | uuid = "f27b6e38-b328-58d1-80ce-0feddd5e7a45" 544 | version = "2.0.24" 545 | 546 | [[deps.PooledArrays]] 547 | deps = ["DataAPI", "Future"] 548 | git-tree-sha1 = "db3a23166af8aebf4db5ef87ac5b00d36eb771e2" 549 | uuid = "2dfb63ee-cc39-5dd5-95bd-886bf059d720" 550 | version = "1.4.0" 551 | 552 | [[deps.Preferences]] 553 | deps = ["TOML"] 554 | git-tree-sha1 = "2cf929d64681236a2e074ffafb8d568733d2e6af" 555 | uuid = "21216c6a-2e73-6563-6e65-726566657250" 556 | version = "1.2.3" 557 | 558 | [[deps.PrettyTables]] 559 | deps = ["Crayons", "Formatting", "Markdown", "Reexport", "Tables"] 560 | git-tree-sha1 = "dfb54c4e414caa595a1f2ed759b160f5a3ddcba5" 561 | uuid = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d" 562 | version = "1.3.1" 563 | 564 | [[deps.Printf]] 565 | deps = ["Unicode"] 566 | uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" 567 | 568 | [[deps.ProtoBuf]] 569 | git-tree-sha1 = "51b74991da46594fb411a715e7e092bef50b99ff" 570 | uuid = "3349acd9-ac6a-5e09-bcdb-63829b23a429" 571 | version = "0.8.0" 572 | 573 | [[deps.PyCall]] 574 | deps = ["Conda", "Dates", "Libdl", "LinearAlgebra", "MacroTools", "Serialization", "VersionParsing"] 575 | git-tree-sha1 = "71fd4022ecd0c6d20180e23ff1b3e05a143959c2" 576 | uuid = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0" 577 | version = "1.93.0" 578 | 579 | [[deps.QuadGK]] 580 | deps = ["DataStructures", "LinearAlgebra"] 581 | git-tree-sha1 = "78aadffb3efd2155af139781b8a8df1ef279ea39" 582 | uuid = "1fd47b50-473d-5c70-9696-f719f8f3bcdc" 583 | version = "2.4.2" 584 | 585 | [[deps.Quadmath]] 586 | deps = ["Printf", "Random", "Requires"] 587 | git-tree-sha1 = "5a8f74af8eae654086a1d058b4ec94ff192e3de0" 588 | uuid = "be4d8f0f-7fa4-5f49-b795-2f01399ab2dd" 589 | version = "0.5.5" 590 | 591 | [[deps.Query]] 592 | deps = ["DataValues", "IterableTables", "MacroTools", "QueryOperators", "Statistics"] 593 | git-tree-sha1 = "a66aa7ca6f5c29f0e303ccef5c8bd55067df9bbe" 594 | uuid = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1" 595 | version = "1.0.0" 596 | 597 | [[deps.QueryOperators]] 598 | deps = ["DataStructures", "DataValues", "IteratorInterfaceExtensions", "TableShowUtils"] 599 | git-tree-sha1 = "911c64c204e7ecabfd1872eb93c49b4e7c701f02" 600 | uuid = "2aef5ad7-51ca-5a8f-8e88-e75cf067b44b" 601 | version = "0.9.3" 602 | 603 | [[deps.Queryverse]] 604 | deps = ["CSVFiles", "DataFrames", "DataTables", "DataValues", "DataVoyager", "ExcelFiles", "FeatherFiles", "FileIO", "IterableTables", "ParquetFiles", "Query", "Reexport", "StatFiles", "VegaLite"] 605 | git-tree-sha1 = "c9654374d9c5bd053c3f286b4c41a0f2b3fe161e" 606 | uuid = "612083be-0b0f-5412-89c1-4e7c75506a58" 607 | version = "0.7.0" 608 | 609 | [[deps.REPL]] 610 | deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"] 611 | uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb" 612 | 613 | [[deps.Random]] 614 | deps = ["SHA", "Serialization"] 615 | uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" 616 | 617 | [[deps.ReadOnlyArrays]] 618 | deps = ["SparseArrays", "Test"] 619 | git-tree-sha1 = "65f17072a35c2be7ac8941aeeae489013212e71f" 620 | uuid = "988b38a3-91fc-5605-94a2-ee2116b3bd83" 621 | version = "0.1.1" 622 | 623 | [[deps.ReadStat]] 624 | deps = ["DataValues", "Dates", "ReadStat_jll"] 625 | git-tree-sha1 = "f8652515b68572d3362ee38e32245249413fb2d7" 626 | uuid = "d71aba96-b539-5138-91ee-935c3ee1374c" 627 | version = "1.1.1" 628 | 629 | [[deps.ReadStat_jll]] 630 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Libiconv_jll", "Pkg", "Zlib_jll"] 631 | git-tree-sha1 = "afd287b1031406b3ec5d835a60b388ceb041bb63" 632 | uuid = "a4dc8951-f1cc-5499-9034-9ec1c3e64557" 633 | version = "1.1.5+0" 634 | 635 | [[deps.RecipesBase]] 636 | git-tree-sha1 = "6bf3f380ff52ce0832ddd3a2a7b9538ed1bcca7d" 637 | uuid = "3cdcf5f2-1ef4-517c-9805-6587b60abb01" 638 | version = "1.2.1" 639 | 640 | [[deps.Reexport]] 641 | git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b" 642 | uuid = "189a3867-3050-52da-a836-e630ba90ab69" 643 | version = "1.2.2" 644 | 645 | [[deps.RelocatableFolders]] 646 | deps = ["SHA", "Scratch"] 647 | git-tree-sha1 = "cdbd3b1338c72ce29d9584fdbe9e9b70eeb5adca" 648 | uuid = "05181044-ff0b-4ac5-8273-598c1e38db00" 649 | version = "0.1.3" 650 | 651 | [[deps.Requires]] 652 | deps = ["UUIDs"] 653 | git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7" 654 | uuid = "ae029012-a4dd-5104-9daa-d747884805df" 655 | version = "1.3.0" 656 | 657 | [[deps.Rmath]] 658 | deps = ["Random", "Rmath_jll"] 659 | git-tree-sha1 = "bf3188feca147ce108c76ad82c2792c57abe7b1f" 660 | uuid = "79098fc4-a85e-5d69-aa6a-4863f24498fa" 661 | version = "0.7.0" 662 | 663 | [[deps.Rmath_jll]] 664 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] 665 | git-tree-sha1 = "68db32dff12bb6127bac73c209881191bf0efbb7" 666 | uuid = "f50d1b31-88e8-58de-be2c-1cc44531875f" 667 | version = "0.3.0+0" 668 | 669 | [[deps.SHA]] 670 | uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" 671 | 672 | [[deps.SQLite]] 673 | deps = ["BinaryProvider", "DBInterface", "Dates", "Libdl", "Random", "SQLite_jll", "Serialization", "Tables", "Test", "WeakRefStrings"] 674 | git-tree-sha1 = "8e14d9b200b975e93a0ae0e5d17dea1c262690ee" 675 | uuid = "0aa819cd-b072-5ff4-a722-6bc24af294d9" 676 | version = "1.4.0" 677 | 678 | [[deps.SQLite_jll]] 679 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg", "Zlib_jll"] 680 | git-tree-sha1 = "cca82caa0b6bf7f0bc977e69063c0cf5d7da36e5" 681 | uuid = "76ed43ae-9a5d-5a62-8c75-30186b810ce8" 682 | version = "3.37.0+0" 683 | 684 | [[deps.Scratch]] 685 | deps = ["Dates"] 686 | git-tree-sha1 = "0b4b7f1393cff97c33891da2a0bf69c6ed241fda" 687 | uuid = "6c6a2e73-6563-6170-7368-637461726353" 688 | version = "1.1.0" 689 | 690 | [[deps.SentinelArrays]] 691 | deps = ["Dates", "Random"] 692 | git-tree-sha1 = "15dfe6b103c2a993be24404124b8791a09460983" 693 | uuid = "91c51154-3ec4-41a3-a24f-3f23e20d615c" 694 | version = "1.3.11" 695 | 696 | [[deps.Serialization]] 697 | uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" 698 | 699 | [[deps.Setfield]] 700 | deps = ["ConstructionBase", "Future", "MacroTools", "Requires"] 701 | git-tree-sha1 = "fca29e68c5062722b5b4435594c3d1ba557072a3" 702 | uuid = "efcf1570-3423-57d1-acb7-fd33fddbac46" 703 | version = "0.7.1" 704 | 705 | [[deps.SharedArrays]] 706 | deps = ["Distributed", "Mmap", "Random", "Serialization"] 707 | uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383" 708 | 709 | [[deps.ShiftedArrays]] 710 | git-tree-sha1 = "22395afdcf37d6709a5a0766cc4a5ca52cb85ea0" 711 | uuid = "1277b4bf-5013-50f5-be3d-901d8477a67a" 712 | version = "1.0.0" 713 | 714 | [[deps.Snappy]] 715 | deps = ["BinaryProvider", "Libdl", "Random", "Test"] 716 | git-tree-sha1 = "25620a91907972a05863941d6028791c2613888e" 717 | uuid = "59d4ed8c-697a-5b28-a4c7-fe95c22820f9" 718 | version = "0.3.0" 719 | 720 | [[deps.Sockets]] 721 | uuid = "6462fe0b-24de-5631-8697-dd941f90decc" 722 | 723 | [[deps.SortingAlgorithms]] 724 | deps = ["DataStructures"] 725 | git-tree-sha1 = "b3363d7460f7d098ca0912c69b082f75625d7508" 726 | uuid = "a2af1166-a08f-5f64-846c-94a0d3cef48c" 727 | version = "1.0.1" 728 | 729 | [[deps.SparseArrays]] 730 | deps = ["LinearAlgebra", "Random"] 731 | uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" 732 | 733 | [[deps.SpecialFunctions]] 734 | deps = ["ChainRulesCore", "IrrationalConstants", "LogExpFunctions", "OpenLibm_jll", "OpenSpecFun_jll"] 735 | git-tree-sha1 = "a4116accb1c84f0a8e1b9932d873654942b2364b" 736 | uuid = "276daf66-3868-5448-9aa4-cd146d93841b" 737 | version = "2.1.1" 738 | 739 | [[deps.StatFiles]] 740 | deps = ["DataValues", "FileIO", "IterableTables", "IteratorInterfaceExtensions", "ReadStat", "TableShowUtils", "TableTraits", "TableTraitsUtils", "Test"] 741 | git-tree-sha1 = "28466ea10caec61c476a262172319d2edf248187" 742 | uuid = "1463e38c-9381-5320-bcd4-4134955f093a" 743 | version = "0.8.0" 744 | 745 | [[deps.Statistics]] 746 | deps = ["LinearAlgebra", "SparseArrays"] 747 | uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" 748 | 749 | [[deps.StatsAPI]] 750 | git-tree-sha1 = "d88665adc9bcf45903013af0982e2fd05ae3d0a6" 751 | uuid = "82ae8749-77ed-4fe6-ae5f-f523153014b0" 752 | version = "1.2.0" 753 | 754 | [[deps.StatsBase]] 755 | deps = ["DataAPI", "DataStructures", "LinearAlgebra", "LogExpFunctions", "Missings", "Printf", "Random", "SortingAlgorithms", "SparseArrays", "Statistics", "StatsAPI"] 756 | git-tree-sha1 = "51383f2d367eb3b444c961d485c565e4c0cf4ba0" 757 | uuid = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" 758 | version = "0.33.14" 759 | 760 | [[deps.StatsFuns]] 761 | deps = ["ChainRulesCore", "InverseFunctions", "IrrationalConstants", "LogExpFunctions", "Reexport", "Rmath", "SpecialFunctions"] 762 | git-tree-sha1 = "f35e1879a71cca95f4826a14cdbf0b9e253ed918" 763 | uuid = "4c63d2b9-4356-54db-8cca-17b64c39e42c" 764 | version = "0.9.15" 765 | 766 | [[deps.StatsModels]] 767 | deps = ["DataAPI", "DataStructures", "LinearAlgebra", "Printf", "REPL", "ShiftedArrays", "SparseArrays", "StatsBase", "StatsFuns", "Tables"] 768 | git-tree-sha1 = "677488c295051568b0b79a77a8c44aa86e78b359" 769 | uuid = "3eaba693-59b7-5ba5-a881-562e759f1c8d" 770 | version = "0.6.28" 771 | 772 | [[deps.StringEncodings]] 773 | deps = ["Libiconv_jll"] 774 | git-tree-sha1 = "50ccd5ddb00d19392577902f0079267a72c5ab04" 775 | uuid = "69024149-9ee7-55f6-a4c4-859efe599b68" 776 | version = "0.3.5" 777 | 778 | [[deps.StructTypes]] 779 | deps = ["Dates", "UUIDs"] 780 | git-tree-sha1 = "d24a825a95a6d98c385001212dc9020d609f2d4f" 781 | uuid = "856f2bd8-1eba-4b0a-8007-ebc267875bd4" 782 | version = "1.8.1" 783 | 784 | [[deps.SuiteSparse]] 785 | deps = ["Libdl", "LinearAlgebra", "Serialization", "SparseArrays"] 786 | uuid = "4607b0f0-06f3-5cda-b6b1-a6196a1729e9" 787 | 788 | [[deps.TOML]] 789 | deps = ["Dates"] 790 | uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" 791 | 792 | [[deps.TableShowUtils]] 793 | deps = ["DataValues", "Dates", "JSON", "Markdown", "Test"] 794 | git-tree-sha1 = "14c54e1e96431fb87f0d2f5983f090f1b9d06457" 795 | uuid = "5e66a065-1f0a-5976-b372-e0b8c017ca10" 796 | version = "0.2.5" 797 | 798 | [[deps.TableTraits]] 799 | deps = ["IteratorInterfaceExtensions"] 800 | git-tree-sha1 = "c06b2f539df1c6efa794486abfb6ed2022561a39" 801 | uuid = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c" 802 | version = "1.0.1" 803 | 804 | [[deps.TableTraitsUtils]] 805 | deps = ["DataValues", "IteratorInterfaceExtensions", "Missings", "TableTraits"] 806 | git-tree-sha1 = "78fecfe140d7abb480b53a44f3f85b6aa373c293" 807 | uuid = "382cd787-c1b6-5bf2-a167-d5b971a19bda" 808 | version = "1.0.2" 809 | 810 | [[deps.Tables]] 811 | deps = ["DataAPI", "DataValueInterfaces", "IteratorInterfaceExtensions", "LinearAlgebra", "TableTraits", "Test"] 812 | git-tree-sha1 = "bb1064c9a84c52e277f1096cf41434b675cd368b" 813 | uuid = "bd369af6-aec1-5ad0-b16a-f7cc5008161c" 814 | version = "1.6.1" 815 | 816 | [[deps.Tar]] 817 | deps = ["ArgTools", "SHA"] 818 | uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e" 819 | 820 | [[deps.Test]] 821 | deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] 822 | uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" 823 | 824 | [[deps.TextParse]] 825 | deps = ["CodecZlib", "DataStructures", "Dates", "DoubleFloats", "Mmap", "Nullables", "WeakRefStrings"] 826 | git-tree-sha1 = "eb1f4fb185c8644faa2d18d14c72f2c24412415f" 827 | uuid = "e0df1984-e451-5cb5-8b61-797a481e67e3" 828 | version = "1.0.2" 829 | 830 | [[deps.Thrift]] 831 | deps = ["BinaryProvider", "Distributed", "Sockets"] 832 | git-tree-sha1 = "c3dd01c6067985a77fef761839203838ac12825b" 833 | uuid = "8d9c9c80-f77e-5080-9541-c6f69d204e22" 834 | version = "0.6.2" 835 | 836 | [[deps.TimeZones]] 837 | deps = ["Dates", "Downloads", "InlineStrings", "LazyArtifacts", "Mocking", "Printf", "RecipesBase", "Serialization", "Unicode"] 838 | git-tree-sha1 = "0f1017f68dc25f1a0cb99f4988f78fe4f2e7955f" 839 | uuid = "f269a46b-ccf7-5d73-abea-4c690281aa53" 840 | version = "1.7.1" 841 | 842 | [[deps.TranscodingStreams]] 843 | deps = ["Random", "Test"] 844 | git-tree-sha1 = "216b95ea110b5972db65aa90f88d8d89dcb8851c" 845 | uuid = "3bb67fe8-82b1-5028-8e26-92a6c54297fa" 846 | version = "0.9.6" 847 | 848 | [[deps.URIParser]] 849 | deps = ["Unicode"] 850 | git-tree-sha1 = "53a9f49546b8d2dd2e688d216421d050c9a31d0d" 851 | uuid = "30578b45-9adc-5946-b283-645ec420af67" 852 | version = "0.4.1" 853 | 854 | [[deps.URIs]] 855 | git-tree-sha1 = "97bbe755a53fe859669cd907f2d96aee8d2c1355" 856 | uuid = "5c2747f8-b7ea-4ff2-ba2e-563bfd36b1d4" 857 | version = "1.3.0" 858 | 859 | [[deps.UUIDs]] 860 | deps = ["Random", "SHA"] 861 | uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4" 862 | 863 | [[deps.UnPack]] 864 | git-tree-sha1 = "387c1f73762231e86e0c9c5443ce3b4a0a9a0c2b" 865 | uuid = "3a884ed6-31ef-47d7-9d2a-63182c4928ed" 866 | version = "1.0.2" 867 | 868 | [[deps.Unicode]] 869 | uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" 870 | 871 | [[deps.Vega]] 872 | deps = ["DataStructures", "DataValues", "Dates", "FileIO", "FilePaths", "IteratorInterfaceExtensions", "JSON", "JSONSchema", "MacroTools", "NodeJS", "Pkg", "REPL", "Random", "Setfield", "TableTraits", "TableTraitsUtils", "URIParser"] 873 | git-tree-sha1 = "43f83d3119a868874d18da6bca0f4b5b6aae53f7" 874 | uuid = "239c3e63-733f-47ad-beb7-a12fde22c578" 875 | version = "2.3.0" 876 | 877 | [[deps.VegaLite]] 878 | deps = ["Base64", "DataStructures", "DataValues", "Dates", "FileIO", "FilePaths", "IteratorInterfaceExtensions", "JSON", "MacroTools", "NodeJS", "Pkg", "REPL", "Random", "TableTraits", "TableTraitsUtils", "URIParser", "Vega"] 879 | git-tree-sha1 = "3e23f28af36da21bfb4acef08b144f92ad205660" 880 | uuid = "112f6efa-9a02-5b7d-90c0-432ed331239a" 881 | version = "2.6.0" 882 | 883 | [[deps.VersionParsing]] 884 | git-tree-sha1 = "58d6e80b4ee071f5efd07fda82cb9fbe17200868" 885 | uuid = "81def892-9a0e-5fdd-b105-ffc91e053289" 886 | version = "1.3.0" 887 | 888 | [[deps.WeakRefStrings]] 889 | deps = ["DataAPI", "InlineStrings", "Parsers"] 890 | git-tree-sha1 = "c69f9da3ff2f4f02e811c3323c22e5dfcb584cfa" 891 | uuid = "ea10d353-3f73-51f8-a26c-33c1cb351aa5" 892 | version = "1.4.1" 893 | 894 | [[deps.Weave]] 895 | deps = ["Base64", "Dates", "Highlights", "JSON", "Markdown", "Mustache", "Pkg", "Printf", "REPL", "RelocatableFolders", "Requires", "Serialization", "YAML"] 896 | git-tree-sha1 = "d62575dcea5aeb2bfdfe3b382d145b65975b5265" 897 | uuid = "44d3d7a6-8a23-5bf8-98c5-b353f8df5ec9" 898 | version = "0.10.10" 899 | 900 | [[deps.XLSX]] 901 | deps = ["Dates", "EzXML", "Printf", "Tables", "ZipFile"] 902 | git-tree-sha1 = "96d05d01d6657583a22410e3ba416c75c72d6e1d" 903 | uuid = "fdbf4ff8-1666-58a4-91e7-1b58723a45e0" 904 | version = "0.7.8" 905 | 906 | [[deps.XML2_jll]] 907 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Libiconv_jll", "Pkg", "Zlib_jll"] 908 | git-tree-sha1 = "1acf5bdf07aa0907e0a37d3718bb88d4b687b74a" 909 | uuid = "02c8fc9c-b97f-50b9-bbe4-9be30ff0a78a" 910 | version = "2.9.12+0" 911 | 912 | [[deps.YAML]] 913 | deps = ["Base64", "Dates", "Printf", "StringEncodings"] 914 | git-tree-sha1 = "3c6e8b9f5cdaaa21340f841653942e1a6b6561e5" 915 | uuid = "ddb6d928-2868-570f-bddf-ab3f9cf99eb6" 916 | version = "0.4.7" 917 | 918 | [[deps.ZipFile]] 919 | deps = ["Libdl", "Printf", "Zlib_jll"] 920 | git-tree-sha1 = "3593e69e469d2111389a9bd06bac1f3d730ac6de" 921 | uuid = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea" 922 | version = "0.9.4" 923 | 924 | [[deps.Zlib_jll]] 925 | deps = ["Libdl"] 926 | uuid = "83775a58-1f1d-513f-b197-d71354ab007a" 927 | 928 | [[deps.Zstd_jll]] 929 | deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] 930 | git-tree-sha1 = "cc4bf3fdde8b7e3e9fa0351bdeedba1cf3b7f6e6" 931 | uuid = "3161d3a3-bdf6-5164-811a-617609db77b4" 932 | version = "1.5.0+0" 933 | 934 | [[deps.libblastrampoline_jll]] 935 | deps = ["Artifacts", "Libdl", "OpenBLAS_jll"] 936 | uuid = "8e850b90-86db-534c-a0d3-1478176c7d93" 937 | 938 | [[deps.nghttp2_jll]] 939 | deps = ["Artifacts", "Libdl"] 940 | uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d" 941 | 942 | [[deps.p7zip_jll]] 943 | deps = ["Artifacts", "Libdl"] 944 | uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0" 945 | -------------------------------------------------------------------------------- /Project.toml: -------------------------------------------------------------------------------- 1 | name = "JuliaDataTutorials" 2 | uuid = "33cdf1e7-ff7b-4b8e-aca7-7f04e2106471" 3 | authors = ["Daniel Lakeland "] 4 | version = "0.1.0" 5 | 6 | [deps] 7 | CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" 8 | DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" 9 | DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964" 10 | Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" 11 | GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a" 12 | GMT = "5752ebe1-31b9-557e-87aa-f909b540aa54" 13 | HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3" 14 | Queryverse = "612083be-0b0f-5412-89c1-4e7c75506a58" 15 | SQLite = "0aa819cd-b072-5ff4-a722-6bc24af294d9" 16 | Weave = "44d3d7a6-8a23-5bf8-98c5-b353f8df5ec9" 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # JuliaDataTutorials 2 | Tutorials For Data Analysis in Julia 3 | 4 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dlakelan/JuliaDataTutorials/master) 5 | 6 | 7 | -------------------------------------------------------------------------------- /RegressionDiscontinuityQuestion.jmd: -------------------------------------------------------------------------------- 1 | # Another Regression Discontinuity Confusion 2 | 3 | Andrew Gelman gave an 4 | [example of a Regression Discontinuity Analysis](https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/) 5 | in which the original authors found evidence that their fit across the 6 | two regions x < 0 and x > 0 were significantly different, and 7 | concluded that losing a governors election cut 5-10 years off your 8 | life. 9 | 10 | The relevant data was 11 | [available here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IBKYRX) 12 | but I've downloaded it already and added it to the data directory as 13 | "longevity.csv" 14 | 15 | Here's the plan. We're going to discuss some mathematical concepts, 16 | what it means to have a discontinuity for example, and then we're 17 | going to try to identify discontinuities and fast-transitions in the 18 | given data, as well as in synthetic data where we know an appropriate 19 | signal is present. 20 | 21 | Let's start by loading the data and graphing the raw data: 22 | 23 | ```julia 24 | 25 | using Queryverse, Optim 26 | 27 | longdf = load("data/longevity.csv") |> DataFrame 28 | 29 | longdf |> @vlplot({:point,opacity=.1},x=:margin_pct_1,y=:living_day_post,width=700,height=700) 30 | 31 | ``` 32 | 33 | It would be hard to believe you could be elected *after* you died, or 34 | that you could live many more than 70 years after being elected. Also 35 | it seems some elections had 100% win margin... those are 36 | suspicious. So let's just filter that stuff out for ease of 37 | exposition. None of them seem to affect the question at hand. 38 | 39 | At the same time we'll convert the time-scale to years. 40 | 41 | 42 | ```julia 43 | 44 | daypyr = 365.2422 45 | 46 | cleandf = filter(x -> !ismissing(x.living_day_post) && !ismissing(x.margin_pct_1) && x.living_day_post > 0 && x.living_day_post < 70*daypyr && x.margin_pct_1 < 95,longdf) 47 | 48 | cleandf.yrpost = cleandf.living_day_post/daypyr 49 | 50 | 51 | cleandf |> @vlplot({:point,opacity=.1},x=:margin_pct_1,y=:yrpost,width=700,height=700) 52 | 53 | ``` 54 | 55 | ## The practical concept of discontinuity... 56 | 57 | All of the following functions are mathematically continuous: 58 | 59 | ```julia 60 | """ 61 | sigmoid(x,a,b) 62 | 63 | A rising sigmoid function whose transition occurs at a with scale b, 64 | evaluated in a numerically stable way so that we only ever 65 | exponentiate negative numbers. This is a generalized "inverse_logit" 66 | type function. 67 | 68 | """ 69 | function sigmoid(x,a,b) 70 | s = (x-a)/b; 71 | if(s > 0) 72 | return(1.0/(1.0+exp(-s))) 73 | else 74 | return(exp(s)/(1.0+exp(s))) 75 | end 76 | end 77 | 78 | 79 | 80 | 81 | contdf = DataFrame(x=collect(-1.0:.01:1.0)) 82 | contdf.y1 = map( x-> sigmoid(x,0.0,1.0),contdf.x) 83 | contdf.y2 = map( x-> sigmoid(x,0.0,.05),contdf.x) 84 | contdf.y3 = map( x-> sigmoid(x,0.0,.0001),contdf.x) 85 | 86 | contdf |> @vlplot(layers=[],width=700,title="Three continuous functions approaching a discontinuous one") + @vlplot({:line,color="blue"},x=:x,y=:y1) + 87 | @vlplot({:line,color="orange"},x=:x,y=:y2) + @vlplot({:line,color="red"},x=:x,y=:y3) 88 | 89 | ``` 90 | 91 | Practically speaking, however, on the scale where measurements in x 92 | resolve down to .01, the red function is in essence discontinuous 93 | because its value changes extremely rapidly across the least possible 94 | change in x. In other words, for a data analyst working with real 95 | world data where all measurements have some level of discrete aspect 96 | to them, a discontinuous function can be modeled very effectively by a 97 | rapidly changing continuous function. 98 | 99 | Once we have this insight, then as an analyst we can stop thinking 100 | about whether a function is discontinuous, and instead think about 101 | whether it changes rapidly in a local region or not. In fact, we can 102 | work with a flexible functional form capable of representing slowly 103 | changing or rapidly changing functions, and see if our data shows 104 | evidence of rapid change. 105 | 106 | There are many ways to represent nonlinear functions. One commonly 107 | available tool is the LOESS fit. LOESS in essence fits a line or low 108 | order polynomial (say quadratic or cubic) to a subset of the data 109 | centered on a point of interest. It does this for many points of 110 | interest, and forms a nonlinear function by connecting the value of 111 | the function at these different points of interest. 112 | 113 | How many points are involved in the fit is known as the 114 | "bandwidth". In the VegaLite specification this bandwidth is a number 115 | between 0 and 1 representing the fraction of the total data set that 116 | is in use at any given fit-point. 117 | 118 | Any flexible family of spline type functions that is capable of 119 | detecting a rapid change in living_day_post near margin_pct_1 ~ 0 will 120 | do for detecting whether there's an effect on longevity to winning an 121 | election. We'll fit LOESS curves with 3 different bandwidths: 0.2, 122 | 0.1, and 0.05 graphically, using the built in LOESS function in 123 | VegaLite plotting. As the bandwidth is decreased towards 0, the fit is 124 | dependent only on the closest few data point, and hence can change 125 | rapidly from one place to another when the data differs in one region 126 | or another. 127 | 128 | ```julia 129 | 130 | cleandf |> @vlplot(layer=[],width=700,height=400,title="LOESS of years lived post election against win margin in percentage points\nBandwidths = 0.2, 0.1, 0.05") + @vlplot({:point,opacity=.1},x=:margin_pct_1,y=:yrpost) + 131 | @vlplot({:line,color="blue",opacity=.5},transform=[{loess=:yrpost,on=:margin_pct_1,bandwidth=.2}],x=:margin_pct_1,y=:yrpost)+ 132 | @vlplot({:line,color="orange",opacity=.5},transform=[{loess=:yrpost,on=:margin_pct_1,bandwidth=.1}],x=:margin_pct_1,y=:yrpost)+ 133 | @vlplot({:line,color="red",opacity=.5},transform=[{loess=:yrpost,on=:margin_pct_1,bandwidth=.05}],x=:margin_pct_1,y=:yrpost) 134 | 135 | 136 | ``` 137 | 138 | What this shows is: nothing. As the LOESS fit uses less and less data 139 | due to the reducing bandwidth, we find not that there's a 140 | discontinuous jump in the function, rather there's simply small scale 141 | oscillations up and down which are consistent with more and more noisy 142 | fit. The blue curve which smoothes through the noise appears more 143 | reasonable as an estimate of what to expect than the red curve which 144 | rapidly changes up and down. Any causal model which somehow explains 145 | longevity of candidates by saying that winning by 0.1 percentage 146 | points does nothing to your longevity but by 0.2 percentage points 147 | increases your longevity vs 0.1 by 4 years, while 0.3 percentage 148 | points decreases your longevity vs 0.1 by 2 years is going to have a 149 | lot of explaining to do. On the other hand, the explanation "this 150 | estimate is noisy and there is no evidence of any meaningful effect" 151 | works just fine. 152 | 153 | Of course this is the raw data. If we believe the referenced analysis, 154 | it just so happens that the people who won the election must have been 155 | much sicker, so that by winning the election, their lifespan was 156 | extended 5 to 10 years thereby just exactly canceling out the effect 157 | of them being much sicker. 158 | 159 | 🤔 160 | 161 | What does it look like when there's a real signal? Let's put a few 162 | signals in place. First we'll just reuse the x values so that the 163 | distribution of x values is the same as our analysis. Then we'll 164 | create pure Normally distributed noise with an appropriate 165 | scale... and add in a signal, which will be plotted in blue. The red 166 | LOESS line will be our estimate. 167 | 168 | ```julia 169 | using Distributions, Random 170 | 171 | 172 | n = nrow(cleandf) 173 | Random.seed!(123) 174 | sdy = std(cleandf.living_day_post/daypyr) 175 | 176 | 177 | #xvals = rand(Normal(meanx,sdx),n) 178 | xvals = cleandf.margin_pct_1 179 | ynoise = rand(Normal(0,sdy),n) 180 | ysig = [if(xvals[i] > 0) 5.0 else 0.0 end for i in eachindex(ynoise)] 181 | 182 | 183 | function plotsig(x,y,n,lab) 184 | spec = DataFrame(x=x,y=y+n,s=y) |> @vlplot(layers=[],width=700,height=500,title=lab)+ 185 | @vlplot({:point,opacity=.1},x=:x,y=:y) + 186 | @vlplot({:line,opacity=1,color="blue"},x=:x,y=:s)+ 187 | @vlplot({:line,color="red"},transform=[{loess=:y,on=:x,bandwidth=.05}],x=:x,y=:y); 188 | return(spec); 189 | end 190 | 191 | 192 | p0 = plotsig(xvals,zeros(n),ynoise,"No signal, pure noise"); 193 | display(p0) 194 | 195 | 196 | p1 = plotsig(xvals,ysig,ynoise,"A reliably identified step-up signal"); 197 | display(p1); 198 | 199 | ysig2 = [3*xvals[i]*exp(-(xvals[i]/5)^2) for i in eachindex(ynoise)] 200 | 201 | p2=plotsig(xvals,ysig2,ynoise,"A reliably identified continuous wavelet"); 202 | display(p2) 203 | 204 | ysig3 = [ if(xvals[i] > 0) 3*xvals[i]*exp(-(xvals[i]/5)^2) else 0.0 end for i in eachindex(ynoise)] 205 | 206 | p3 = plotsig(xvals,ysig3,ynoise,"A reliably identified continuous wavelet x+ only"); 207 | display(p3) 208 | 209 | ``` 210 | 211 | The LOESS methodology using 5% of the data as bandwidth and a similar 212 | number of raw data points clearly identifies the signal in each 213 | case. 214 | 215 | We conclude that if there were a signal in the raw data from this 216 | study, it would be fairly strongly visible in the LOESS with 5% 217 | bandwidth rather than oscillating wildly around zero on short "time 218 | scales" (in this case vote percentage scale). The reason we can 219 | conclude this is in essence that a true signal couldn't oscillate 220 | rapidly, as once we transitioned across the x=0 boundary, there's 221 | nothing really causally different between people who won by 0.1 222 | percentage points, and people who won by say 0.15 or 0.2 percentage 223 | points. 224 | 225 | -------------------------------------------------------------------------------- /build.jl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/julia 2 | 3 | using Weave 4 | 5 | docs = [ #"0_Prerequisites.jmd", ## don't build this as it installs packages 6 | "BasicDataAndPlots.jmd", 7 | "DiscussionBasicDataAndPlots.jmd", 8 | "COVID-monitoring.jmd" 9 | ]; 10 | 11 | for i in docs 12 | notebook(i); 13 | end 14 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | xlrd==1.1.* 2 | --------------------------------------------------------------------------------