├── License.md
├── app
├── PaperToDigital.pdf
└── PaperToDigital.tex
├── ds1
├── .gitignore
├── ds1_present_web.pdf
└── ds1_web.tex
├── ds2
├── ds2_present_web.pdf
├── ds2_web.tex
└── scraping_assignment_web.txt
├── ds3
├── ds3_present_web.pdf
└── ds3_web.tex
├── ds4
├── ds4_present_web.pdf
└── ds4_web.tex
├── ds6
├── kmeans.pdf
└── kmeans.tex
├── graphs
└── ggplot2.md
└── readme.md
/License.md:
--------------------------------------------------------------------------------
1 | [](http://creativecommons.org/licenses/by-sa/4.0/)
2 | Data Science: Some Basics by [Gaurav Sood](https://github.com/soodoku/data-science) is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
3 | Based on a work at [https://github.com/soodoku/data-science](https://github.com/soodoku/data-science).
4 | Permissions beyond the scope of this license may be available at [http://gsood.com](http://gsood.com).
5 |
--------------------------------------------------------------------------------
/app/PaperToDigital.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/app/PaperToDigital.pdf
--------------------------------------------------------------------------------
/app/PaperToDigital.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress, black]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \usecolortheme[named=black]{structure}
5 | \usepackage{caption}
6 | \captionsetup{labelformat=empty}
7 | \setbeamertemplate{navigation symbols}{}
8 | %\usefonttheme{structurebold}
9 | \usepackage[scaled]{helvet}
10 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
11 | \usepackage[T1]{fontenc}
12 | \usepackage{setspace}
13 | %\usepackage{beamerthemesplit}
14 | \usepackage{graphics}
15 | \usepackage{hyperref}
16 | \usepackage{graphicx}
17 | \usepackage{verbatim}
18 | \usepackage{amssymb}
19 | \usepackage{wrapfig}
20 | \usefonttheme[onlymath]{serif}
21 | \usepackage{cmbright}
22 |
23 | \def\labelitemi{\textemdash}
24 | \setbeamertemplate{frametitle}{
25 | \begin{centering}
26 | \vskip15pt
27 | \insertframetitle
28 | \par
29 | \end{centering}
30 | }
31 | \title[DS]{From Paper to Digital}
32 | \author[Sood]{Gaurav~Sood}
33 | \large
34 | \date[2015]{Spring 2015}
35 | \subject{LearnDS}
36 | \begin{document}
37 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
38 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
39 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
40 |
41 | \newenvironment{large_enum}{
42 | \Large
43 | \begin{itemize}
44 | \setlength{\itemsep}{7pt}
45 | \setlength{\parskip}{0pt}
46 | \setlength{\parsep}{0pt}
47 | }{\end{itemize}}
48 |
49 | \begin{comment}
50 |
51 | setwd(paste0(basedir, "github/data-science/app/"))
52 | tools::texi2dvi("PaperToDigital.tex", pdf=TRUE,clean=TRUE)
53 | setwd(basedir)
54 |
55 | \end{comment}
56 |
57 | \frame
58 | {
59 | \titlepage
60 | }
61 |
62 | \begin{frame}
63 | \frametitle{}
64 | \only<1>{\Large When we think about paper \ldots}
65 | \only<2>{\Large We think about \alert{government offices}}
66 | \only<3>{\centering \scalebox{0.46}{\includegraphics{img/files1.jpg}}}
67 | \only<4>{\centering \scalebox{0.26}{\includegraphics{img/files2.jpg}}}
68 | \only<5>{\centering \scalebox{0.30}{\includegraphics{img/files3.jpg}}}
69 | \only<6>{\Large But paper based storage of information is common}
70 | \only<7>{\Large Libraries and Archives}
71 | \only<8>{\Large Health records}
72 | \only<9>{\Large Receipts \ldots}
73 | \only<10>{\Large Small Businesses}
74 | \only<11>{\Large And it isn't going away (quickly).}
75 | \end{frame}
76 |
77 | \begin{frame}
78 | \frametitle{The Dead Tree Format}
79 | \begin{large_enum}
80 | \item[--]<2-> Accessible only on location
81 | \item[--]<3-> Typically needs help of another human, who may in turn want \alert{money}
82 | \item[--]<4-> Hard to copy, distribute
83 | \item[--]<5-> Flammable
84 | \item[--]<6-> Time consuming to find stuff \\ \normalsize \pause \pause \pause \pause \pause \pause
85 | Google returns average search query in .2 seconds
86 | \item[--]<8-> Hard to analyze, summarize stored information
87 | \item[--]<9-> Hard to track performance, identify anomalous transactions, identify patterns ...
88 | \end{large_enum}
89 | \end{frame}
90 |
91 | \begin{frame}
92 | \frametitle{Solved Problem?}
93 | \begin{large_enum}
94 | \item[--]<2-> Lots of software:
95 | \begin{enumerate}
96 | \item[--]<3->Adobe Professional
97 | \item[--]<4->Abbyy FineReader
98 | \item[--]<5->Tesseract
99 | \end{enumerate}
100 | \item[--]<6->But ...
101 | \begin{enumerate}
102 | \item[--]<7->Still can't handle complex layout, languages other than english etc.\\
103 | \only<8>{\normalsize
104 | \begin{quote}``I found that even native OCR software such as \ldots the Abbyy Fine Reader \alert{proved utterly incapable of extracting words from scanned images of the texts}, even when those scanned images were of high quality.''\end{quote}}
105 | \item[--]<9->No information on how well you do (\alert{Quality Metrics}).
106 | \item[--]<10->Not scalable
107 | \end{enumerate}
108 | \end{large_enum}
109 | \end{frame}
110 |
111 | \begin{frame}
112 | \frametitle{How to Convert Squiggles to Bits?}
113 | \begin{large_enum}
114 | \item[--]<2-> Take images of paper
115 | \item[--]<3-> Within images, find where \alert{relevant} text is located
116 | \item[--]<4-> Find out how the text is laid out
117 | \item[--]<5-> Recognize the characters
118 | \end{large_enum}
119 | \end{frame}
120 |
121 | \begin{frame}
122 | \frametitle{Thus Performance Depends on...}
123 | \begin{large_enum}
124 | \item[--]<1->Quality of the scan: spine, contrast etc.
125 | \only<2>{\scalebox{0.6}{\includegraphics{ScannedBook3.png}}}
126 | \item[--]<3->Complexity of the layout
127 | \only<4>{\scalebox{0.35}{\includegraphics{ScannedBook2.png}}}
128 | \only<5>{\scalebox{0.6}{\includegraphics{ScannedBook4.png}}}
129 | \item[--]<6->Font
130 | \item[--]<7->Language
131 | \item[--]<8->Hardware and Software (duh!)
132 | \end{large_enum}
133 | \end{frame}
134 |
135 | \begin{frame}
136 | \frametitle{OCR}
137 | \begin{large_enum}
138 | \item[--]<1->Make images
139 | \item[--]<2->Detect Text \\ \pause
140 | \only<3->{\scalebox{1}{\includegraphics{TextArea.png}}}
141 | \item[--]<4->Segment ``Characters''\\ \pause \pause
142 | \only<5->{\scalebox{1}{\includegraphics{CharacterBoxes.png}}}
143 | \item[--]<6->Classify ``Characters''\\ \pause
144 | \only<7->{\scalebox{1}{\includegraphics{recognize2.png}}}
145 | \end{large_enum}
146 | \end{frame}
147 |
148 | \begin{frame}
149 | \frametitle{Mechanics}
150 | \begin{large_enum}
151 | \item[--]<1->Detect Text
152 | \begin{enumerate}
153 | \item[--]<2-> Supervised Learning
154 | \item[--]<3-> Blobs with text, Blobs without
155 | \item[--]<4-> But size of a blob is an issue
156 | \end{enumerate}
157 | \item[--]<5->Character Segmentation
158 | \begin{enumerate}
159 | \item[--]<6-> Supervised Learning
160 | \item[--]<7-> Letters (and \alert{Ligatures}) versus Splits
161 | \end{enumerate}
162 | \item[--]<8->Classify Characters (and Ligatures)
163 | \begin{enumerate}
164 | \item[--]<1-> Supervised Learning
165 | \item[--]<2-> A versus B versus C...
166 | \end{enumerate}
167 | \end{large_enum}
168 | \end{frame}
169 |
170 | \begin{frame}
171 | \frametitle{Supervised Learning}
172 | \begin{large_enum}
173 | \item[--]<1->Classified (training) data
174 | \item[--]<2->Estimate a model\\ \pause \normalsize
175 | \only<3>{$logit [p(spam)] = \alpha + f'\beta$ where $f$ is frequencies.\\}
176 | \only<4>{Predict class (e.g. Blobs with or without text) using features (pixel by pixel rgb)\\
177 | Use cross-validation to tune the parameters}
178 | \item[--]<5->Predict classes of unseen data (groups of pixels)
179 | \end{large_enum}
180 | \end{frame}
181 |
182 | \begin{frame}
183 | \frametitle{Paper to Digital Pipeline}
184 | \begin{large_enum}
185 | \item[--]<1-> Take images of paper
186 | \item[--]<1-> Within images, find where \alert{relevant} text is located
187 | \item[--]<1-> Find out how the text is laid out
188 | \item[--]<1-> Recognize the characters
189 | \item[--]<2-> \alert{Every step is error prone}
190 | \end{large_enum}
191 | \end{frame}
192 |
193 | \begin{frame}
194 | \only<1->{\Large Optimize all steps w.r.t final error rate.}
195 | \only<2->{\Large How to deal with errors that remain}
196 | \end{frame}
197 | \begin{frame}
198 | \frametitle{How to Fix Errors}
199 | \begin{large_enum}
200 | \item[--]<1->How confident are you that...
201 | \begin{enumerate}
202 | \item[--]<2-> An area has \alert{relevant} text
203 | \item[--]<3-> Split is correct
204 | \item[--]<4-> Right character (or ligature) is recognized
205 | \end{enumerate}
206 | \item[--]<5-> Flag low confidence areas, splits, characters...
207 | \item[--]<6-> Get humans to identify the correct classes
208 | \item[--]<7-> Use that knowledge to fix other errors
209 | \end{large_enum}
210 | \end{frame}
211 |
212 | \begin{frame}
213 | \frametitle{Fixing Character Recognition Errors}
214 | \begin{large_enum}
215 | \item[--]<1-> Search and Replace
216 | \item[--]<2-> OCR makes certain kinds of errors (| is mistaken for an I)
217 | \item[--]<3-> Compare against a corpora (dictionary) and replace
218 | \item[--]<4-> But replace with what?
219 | \item[--]<5-> standd -> strand, stand, stood, or sand?
220 | \end{large_enum}
221 | \end{frame}
222 |
223 | \begin{frame}
224 | \frametitle{Edit Distance}
225 | \begin{large_enum}
226 | \item[--]<1->How similar are two strings?
227 | \item[--]<2->Typically refers to minimum edit distance
228 | \item[--]<3->Minimum number of editing operations (Insertion, Deletion, Substitution) to convert one string to another.
229 | \item[--]<4->Levenshtein Distance, substitution cost = 2
230 | \item[--]<5->You can implement this at word level so Microsoft Corp. is 1 away from Microsoft.
231 | \end{large_enum}
232 | \end{frame}
233 |
234 | \begin{frame}
235 | \frametitle{Supervised Learning}
236 | \begin{large_enum}
237 | \item[--]<1->But edit distance isn't context aware. Use surrounding words.
238 | \item[--]<2->How likely is a certain word within a phrase?
239 | \item[--]<3->$\sim$ Contemporary spelling correction algorithms
240 | \item[--]<4->A bigram model of language: given previous word, probability of next word
241 | \item[--]<5->But good training data is paramount.
242 | \end{large_enum}
243 | \end{frame}
244 |
245 | \begin{frame}
246 | \frametitle{Supervised Learning}
247 | \begin{large_enum}
248 | \item[--]<1->Training data is `similar data' (topic model) and data from human computation
249 | \item[--]<2->Estimate a model based on similar data
250 | \item[--]<3->Use stochastic gradient descent to continue to tweak parameters based on human computation
251 | \item[--]<4->Human computation parallelized, data for costlier (most duplicated low confidence strings, errors in recognition correlated) errors prioritized
252 | \item[--]<5->Calculate error rate against trained random sample
253 | \end{large_enum}
254 | \end{frame}
255 |
256 | \end{document}
257 |
--------------------------------------------------------------------------------
/ds1/.gitignore:
--------------------------------------------------------------------------------
1 | /img/*
--------------------------------------------------------------------------------
/ds1/ds1_present_web.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/ds1/ds1_present_web.pdf
--------------------------------------------------------------------------------
/ds1/ds1_web.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress, black]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \usecolortheme[named=black]{structure}
5 | \usepackage{caption}
6 | \captionsetup{labelformat=empty}
7 | \setbeamertemplate{navigation symbols}{}
8 | %\usefonttheme{structurebold}
9 | \usepackage[scaled]{helvet}
10 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
11 | \usepackage[T1]{fontenc}
12 | \usepackage{setspace}
13 | %\usepackage{beamerthemesplit}
14 | \usepackage{graphics}
15 | \usepackage{hyperref}
16 | \usepackage{graphicx}
17 | \usepackage{verbatim}
18 | \usepackage{amssymb}
19 | \usepackage{wrapfig}
20 | \usefonttheme[onlymath]{serif}
21 | \usepackage{cmbright}
22 | \usepackage[normalem]{ulem}
23 | \def\labelitemi{\textemdash}
24 | \setbeamertemplate{frametitle}{
25 | \begin{centering}
26 | \vskip15pt
27 | \insertframetitle
28 | \par
29 | \end{centering}
30 | }
31 | \title[DS]{Data Science}
32 | \author[Sood]{Gaurav~Sood}
33 | \large
34 | \date[2015]{Spring 2015}
35 | \subject{LearnDS}
36 | \begin{document}
37 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
38 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
39 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
40 |
41 | \newenvironment{large_enum}{
42 | \Large
43 | \begin{itemize}
44 | \setlength{\itemsep}{7pt}
45 | \setlength{\parskip}{0pt}
46 | \setlength{\parsep}{0pt}
47 | }{\end{itemize}}
48 |
49 | \begin{comment}
50 |
51 | setwd(paste0(basedir, "github/data-science/ds1/"))
52 | tools::texi2dvi("ds1_web.tex", pdf=TRUE,clean=TRUE)
53 | setwd(basedir)
54 |
55 | \end{comment}
56 | \frame
57 | {
58 | \titlepage
59 | }
60 |
61 | \frame{
62 | \frametitle{Big Data}
63 | \Large
64 | \only<2>{\indent Lots of hype recently.\\\vspace{10mm}} \only<3>{\scalebox{0.7}{\includegraphics{img/bigdata.png}}} \only<4>{\indent But where's the cheese?\\\vspace{10mm}}
65 | }
66 | \frame{
67 | \center
68 | \Large Some examples
69 | }
70 | \frame{
71 | \frametitle{Fishing Out Fishy Figures}
72 | \only<1>{
73 | \begin{figure}[p]
74 | \centering \includegraphics[scale=0.5]{img/economist.png}
75 | \end{figure}
76 | \small ``The CPINu's August inflation figure of \alert{1.3\%} is less than half the \alert{2.65\%} of the CPI Congreso, a compilation of private estimates
77 | gathered by opposition members of Congress.'' (\href{http://www.economist.com/blogs/americasview/2014/09/statistics-argentina}{Economist})}
78 | \only<2>{
79 | \begin{figure}[p]
80 | \centering \includegraphics[scale=0.5]{img/ArgentinaPriceIndex.png}
81 | \end{figure}
82 | \small Source: \href{http://www.mit.edu/~afc/papers/Cavallo-Argentina-JME.pdf}{Online vs Official Price Indexes: Measuring Argentina's Inflation By Alberto Cavallo}}
83 | }
84 |
85 | \frame{
86 | \frametitle{Suicide Prevention in the Army}
87 | \Large
88 | \only<2>{``In 2012, more soldiers committed suicide than died while fighting in Afghanistan: 349 suicides compared to 295 combat deaths.''}
89 | \only<3>{\begin{figure}[p]
90 | \centering
91 | \includegraphics[scale=0.5]{img/ArmySuicides.png}\\
92 | \end{figure}
93 | }
94 | \only<4>{``Research has repeatedly shown that doctors are not accurate in predicting who is at risk of suicide.''}
95 | \only<5>{``The soldiers with the \alert{highest 5 percent of risk scores} committed over \alert{half of all suicides} in the period covered --- at an extraordinary rate of about 3,824 suicides per 100,000 person-years.''\\\vspace{5mm}
96 | \small \href{http://fivethirtyeight.com/features/the-army-is-building-an-algorithm-to-prevent-suicide/}{538 Article}\\
97 | \href{http://www.ncbi.nlm.nih.gov/pubmed/25390793}{STARRS paper}}
98 | }
99 | \frame{
100 | \frametitle{Reducing Crime}
101 | \Large
102 | \only<1>{Minority Report}
103 | \only<2>{Predictive, `CompStat', `HotSpot' Policing}
104 | \only<3>{\href{http://www.predpol.com/}{PredPol}: Predictive Policing\\\normalsize
105 | LAPD, Atlanta PD \\
106 | Based off earthquake prediction algorithm}
107 | \only<4>{``During a four-month trial in Kent, \alert{8.5\%} of all street crime occurred within PredPol's pink boxes, with plenty more next door to them; predictions from police analysts scored only \alert{5\%}. An earlier trial in Los Angeles saw the machine score \alert{6\%} compared with human analysts' \alert{3\%}.''\\\vspace{5mm}
108 | \small \href{http://www.economist.com/news/briefing/21582042-it-getting-easier-foresee-wrongdoing-and-spot-likely-wrongdoers-dont-even-think-about-it}{Economist}}
109 | \only<5>{\includegraphics[scale=0.45]{img/PredictivePolicing.jpg}}
110 | }
111 | \frame{
112 | \frametitle{Web Search}
113 | \only<2>{\begin{figure}[p]
114 | \centering\scalebox{0.4}{\includegraphics{img/yahoo.jpg}}
115 | \end{figure}
116 | }
117 | \only<3-7>{
118 | \begin{large_enum}
119 | \item<3->[--]Human Curation, Ad-hoc automation
120 | \item<4->[--]Google crawls over 20 billion URLs a day (Sullivan 2012).
121 | \item<5->[--]Google answers 100 billion search queries a month (Sullivan 2012).
122 | \item<6->[--] ``\ldots a typical search returns results in less than 0.2 seconds'' (Google)
123 | \item<7->[--]Page Rank
124 | \end{large_enum}
125 | }
126 | }
127 | \frame{
128 | \frametitle{Side-effects of Drugs}
129 | \Large
130 | \only<2>{``Adverse drug events cause substantial morbidity and mortality and are often discovered after a drug comes to market.''}
131 | \only<3>{FDA collects this information from ``physicians, pharmacists, patients, and drug companies'' but these reports ``are incomplete and biased''}
132 | \only<4>{``\alert{paroxetine} and \alert{pravastatin}, whose interaction was \alert{reported to cause hyperglycemia after the time period of the online logs} used in the analysis''}
133 | \only<5>{\begin{figure}[p]
134 | \centering
135 | \includegraphics[scale=0.60]{img/jama.png}\\
136 | \end{figure}
137 | \small \href{http://www.ncbi.nlm.nih.gov/pubmed/23467469}{Web-scale Pharmacovigilance: Listening to Signals from the Crowd.} By White et al.}
138 | }
139 | \frame{
140 | \frametitle{Flu Season}
141 | \Large
142 | \only<1>{How many got the sniffles?}
143 | \only<2>{How many got the sniffles in the past month?}
144 | \only<3>{
145 | \begin{figure}[p]
146 | \centering
147 | \includegraphics[scale=0.60]{img/GoogleFlu09Current.png}\\
148 | \end{figure}
149 | \small \href{http://www.google.org/flutrends/us/}{Google Flu Trends}
150 | }
151 | \only<4>{Google Flu is sick.}
152 | \only<5>{\begin{figure}[p]
153 | \centering
154 | \includegraphics[scale=0.45]{img/GoogleFlu.png}\\
155 | \end{figure}
156 | \small \href{http://bit.ly/1KMwZ0Y}{The Parable of Google Flu: Traps in Big Data Analysis.} By Lazer et al.
157 | }
158 | }
159 |
160 | \frame{
161 | \frametitle{Spam or Ham}
162 | \only<1>{
163 | \begin{figure}[p]
164 | \centering
165 | \includegraphics[scale=0.75]{img/spam.jpg}\\
166 | \end{figure}
167 | }
168 | \only<2-4>{
169 | \begin{large_enum}
170 | \item[-]<2->``According to Commtouch's Internet Threats Trend Report for the first quarter of 2013, an average of \alert{97.4 billion} spam e-mails and \alert{973 million} malware e-mails were sent worldwide each day in Q1 2013 (h/t Softpedia).''
171 | \item[-]<3->Spam Filter
172 | \item[-]<4->$logit [p(spam)] = \alpha + f'\beta$ where $f$ is frequencies.
173 | \end{large_enum}
174 | }
175 | }
176 | \frame{
177 | \frametitle{Vote}
178 | \only<1-2>{
179 | \begin{figure}[!htb]
180 | \centering
181 | \includegraphics[scale=0.5]{img/61million.png}
182 | \end{figure}
183 | }
184 | \Large
185 | \only<2>{\alert{.39\% direct effect}, and $.01$ to .1\% indirect effect.}
186 | }
187 | \frame{
188 | \frametitle{Vote for Obama}
189 | \begin{large_enum}
190 | \item[--]<1->{Obama 2012 Campaign}
191 | \item[--]<2->{Highly customized messaging: Soccer Moms, NASCAR dads \ldots}
192 | \item[--]<3->{\href{http://citoresearch.com/print/1813}{Used SQL database Vertica for `speed-of-thought' analyses.}}
193 | \end{large_enum}
194 | }
195 | \frame{
196 | \center
197 | \Large What do we mean by big data?
198 | }
199 | \frame{
200 | \frametitle{What do we mean by big data?}
201 | \begin{large_enum}
202 | \item[--]<1->Big in rows (size \textit{n})\\
203 | Big in columns (dimensions \textit{p})
204 | \item[--]<2->Hard to extract value from
205 | \item[--]<3-> `Big data' is high \textbf{volume}, high \textbf{velocity} and high \textbf{variety} information assets that demand cost-effective, innovative forms of information processing[.]\\\normalsize\indent
206 | Gartner, Inc.'s ``3Vs'' definition.
207 | \end{large_enum}
208 | }
209 | \frame{
210 | \frametitle{Sources of (Big) Data}
211 | \begin{large_enum}
212 | \item[--]<1->Data as the by-product of other activities\\
213 | \begin{enumerate}
214 | \item[--]<2-> Click trail, clicks before a purchase
215 | \item[--]<3-> Moving your body (\href{http://www.fitbit.com/}{Fitbit})
216 | \item[--]<4-> Moving yourself without moving your body (\href{http://www.progressive.com/auto/snapshot/?version=default}{Snapshot})
217 | \item[--]<5-> Data were always being generated. They just weren't being captured.
218 | \begin{enumerate}
219 | \item[--]<6->Cheaper, smaller sensors help.
220 | \item[--]<7->So does cheaper storage. (1950 $\sim$ \$10,000/MB. 2015 $\lll$ \$.0001/MB)
221 | \end{enumerate}
222 | \end{enumerate}
223 | \item[--]<8-> Data as the primary goal of activities\\\normalsize
224 | Telescopes, Genetic sequencers, 61 million person experiments \ldots
225 | \end{large_enum}
226 | }
227 | \frame{
228 | \frametitle{How big are the data?}
229 | \begin{large_enum}
230 | \item[--]<2->Web\\\normalsize
231 | 20 billion webpages, each 20 KB = 400 TB \\ \pause
232 | Say all stored on a single disk.\\ \pause
233 | Read speed $\sim$ 50MB/sec.\\ \pause
234 | 92 days to read from disk to memory.
235 | \item[--]<6->Astronomy \\\normalsize
236 | Apache Point Telescope $\sim$ 200 GB/night.\\
237 | Large Synoptic Survey Telescope: 3 billion pixel camera \\
238 | $\sim$ 30TB/night. In 10 years $\sim$ 60 PB
239 | \item[--]<7->Life Sciences\\\normalsize
240 | High Throughput sequencer $\sim$ 1 TB/day
241 | \item[--]<8->CIA \\\normalsize \pause \pause
242 | REDACTED
243 | \end{large_enum}
244 | }
245 |
246 | \frame{
247 | \center
248 | \Large Implications for Statistics, Computation.
249 | }
250 | \frame{
251 | \frametitle{Implications for Statistics}
252 | \begin{large_enum}
253 | \item[--]<2->Little data, Big data\\ \pause \pause \normalsize
254 | Sampling still matters
255 | \item[--]<4->Everything is significant (The Starry Night)\\\normalsize \pause \pause
256 | \begin{equation}\text{False discovery Proportion} = \frac{\text{\# of FP}}{\text{\# of Sig. Results}}\end{equation}\\
257 | Benjamini and Hochberg (1995) (FDR), cost of false discovery, Familywise error rate (Bonferroni)
258 | \item[--]<6->Inverting a matrix\\\normalsize \pause \pause
259 | (Stochastic) Gradient Descent (Ascent), BFGS, \ldots
260 | \item[--]<8->Causal inference\\\normalsize \pause \pause
261 | Large $p$ may help\\ \pause
262 | Passive observation as things change arbitrarily may help
263 | \end{large_enum}
264 | }
265 |
266 | \frame{
267 | \frametitle{Implications for Computation}
268 | \begin{large_enum}
269 | \item[--]<1->Conventional Understanding of what is computationally tractable: Polynomial time algorithm ($N^k$)
270 | \item[--]<2->Now it is $(N^k)/m$, where $m$ is the number of computers.
271 | \item[--]<3->For really big data: N*log(N)\\\normalsize
272 | Traversing a binary tree, sort and search N log(N)\\
273 | Streaming application
274 | \end{large_enum}
275 | }
276 | \frame{
277 | \center
278 | \Large MapReduce and PageRank
279 | }
280 | \frame{
281 | \Large
282 | 20 billion webpages, each 20 KB = 400 TB \\
283 | Say all data stored on a single disk. Read speed $\sim$ 50MB/sec.\\
284 | 92 days to read from disk to memory.
285 | }
286 | \frame{
287 | \frametitle{Solution, Problem}
288 | \begin{large_enum}
289 | \item[--]<1->Parallelize, if 1000 computers, then just $\sim$ 1 hour
290 | \item[--]<2->But \ldots
291 | \begin{enumerate}
292 | \item[--]<3->Nodes can fail.\\
293 | Say single node fails once every 1000 days.\\
294 | But 1000 failures per day if 1 million servers.
295 | \item[--]<4->Network bandwidth $\sim$ 1GBps.\\
296 | If you have 10 TB of day -- takes 1 day.
297 | \item[--]<5->Distributed programming can be very very hard.
298 | \end{enumerate}
299 | \end{large_enum}
300 | }
301 | \frame{
302 | \frametitle{Solution: MapReduce}
303 | \begin{large_enum}
304 | \item[--]<1->Store data redundantly
305 | \begin{enumerate}
306 | \item[--]<2->Distributed File Systems, e.g. GFS, HDFS
307 | \item[--]<3->Typical usage pattern: Data rarely updated. Read often. Updated through appends.
308 | \item[--]<4->Implementation:\\
309 | Data kept in chunks, machines called `chunk servers'\\
310 | Chunks replicated. Typically 3x. One in a completely separate rack.\\
311 | Master node (GFS)/Name Node (HDFS) tracks metadata
312 | \end{enumerate}
313 | \item[--]<5->Minimize data movement
314 | \item[--]<6->Simple programming model
315 | \end{large_enum}
316 | }
317 | \frame{
318 | \frametitle{More About MapReduce}
319 | \Large
320 | \begin{large_enum}
321 | \item[--]<1->Name comes from 2004 paper \href{http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf}{MapReduce: Simplified Data Processing on Large Clusters} by Dean and Ghemawat.
322 | \item[--]<2->Implementation - Hadoop (via Yahoo)/Apache
323 | \end{large_enum}
324 | }
325 | \frame{
326 | \frametitle{MapReduce Example}
327 | \begin{large_enum}
328 | \item[--]<1->Count each distinct word in a huge document, e.g. urls
329 | \begin{enumerate}
330 | \item[--]<2->Map function produces key, value pairs\\
331 | Word frequency of every word in each document
332 | \item[--]<3->Send different words to different computers
333 | \item[--]<4->Combine
334 | \end{enumerate}
335 | \end{large_enum}
336 | }
337 | \begin{frame}[fragile]
338 | \begin{verbatim}
339 | map(String input_key, String input_value):
340 | // input_key: document name
341 | // input_value: document contents
342 | for each word w in input_value:
343 | EmitIntermediate(w, ``1'');
344 |
345 |
346 | reduce(String output_key, Iterator intermediate_values):
347 | // output_key: a word
348 | // output_values: a list of counts
349 | int result = 0;
350 | for each v in intermediate_values:
351 | result += ParseInt(v);
352 | Emit(AsString(result));
353 | \end{verbatim}
354 | \end{frame}
355 | \frame{
356 | \frametitle{PageRank}
357 | \begin{large_enum}
358 | \item[--]<1->\href{http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf}{The PageRank Citation Ranking: Bringing Order to the Web} By Page et al.
359 | \item[--]<2->\href{http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf}{Among the top 10 data mining algorithms}
360 | \item[--]<3->Search
361 | \begin{enumerate}
362 | \item[-]<4->Searching for similarity
363 | \item[-]<5->Works well when you look for a document in your computer.\\ Any small trusted corpora.
364 | \item[-]<6->Web - lots of matches.
365 | \item[-]<7->Web - lots of false positives. Some of it malware peddling sites.
366 | \end{enumerate}
367 | \end{large_enum}
368 | }
369 | \frame{
370 | \frametitle{PageRank}
371 | \begin{large_enum}
372 | \item[--]<1->How to order conditional on similarity?
373 | \begin{enumerate}
374 | \item[-]<2->One can rank by popularity if you have complete web traffic information.
375 | \item[-]<3->But those data are hard to get.
376 | \item[-]<4->Or you can do ad hoc automation and human curation.
377 | \item[-]<5->But costly to implement. And don't scale.
378 | \end{enumerate}
379 | \item[--]<6->Innovation: see Internet as a graph\\ \normalsize
380 | Nodes = webpages, Edges = hyperlinks \\
381 | Ranks based on linking patterns alone.
382 | \item[--]<7->Currency is in-links.
383 | \end{large_enum}
384 | }
385 | \frame{
386 | \frametitle{PageRank}
387 | \Large
388 | \begin{large_enum}
389 | \item[--]<1->Each in-link is a vote.
390 | \item[--]<2->But each vote is not equal.
391 | \item[--]<3->In-links from more important pages count for more
392 | \item[--]<4->Value of each vote:
393 | \begin{enumerate}
394 | \item[-]<5->Proportional to importance of source page
395 | \item[-]<6->Inversely proportional to number of outgoing links on the source page
396 | \item[-]<7->Say page $i$ has importance $r_i$, and $n$ outgoing links, each vote = $r_i/n$
397 | \end{enumerate}
398 | \end{large_enum}
399 | }
400 |
401 | \frame{
402 | \frametitle{Page Rank Example}
403 | \begin{figure}[p]
404 | \centering
405 | \includegraphics[scale=0.10]{img/PageRankExample.png}
406 | \end{figure}
407 | Source: Wikipedia
408 | }
409 | \frame{
410 | \frametitle{PageRank}
411 | \Large
412 | \only<1->{Say page $i$ gets links from pages $j$ ($n=5$, $r_j$) and $k$ ($n=2$, $r_k$)}\\
413 | \only<2->{\begin{align*} r_{i} = \frac{r_{j}}{5} + \frac{r_{k}}{2} \end{align*}}
414 | \only<3->{Page $j$ will have its own outlinks, and each will have a value $r_j$}
415 | \only<4->{\begin{align*} r_i = \sum \frac{r_j}{d_j}\end{align*} over $j$ where $j$ tracks all the pages pointing to $i$.}
416 | }
417 | \frame{
418 | \center
419 | \Large Class
420 | }
421 | \frame{
422 | \frametitle{Data Science}
423 | \begin{figure}[p]
424 | \centering
425 | \includegraphics[scale=0.50]{img/DataScience.png}\\
426 | \end{figure}
427 | \small \href{http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram}{Data Science Venn Diagram.} By Drew Conoway
428 | }
429 | \frame{
430 | \frametitle{Course Outline}
431 | \begin{large_enum}
432 | \item[--]<1->Get your own (big) data\\\normalsize
433 | Scrape it, clean it\\
434 | Basics of webscraping\\
435 | Basics of regular expressions
436 | \item[--]<2->Manage your (big) data\\\normalsize
437 | Store it, organize it, use it\\
438 | Relational Databases, SQL (SQLite)\\
439 | Other databases
440 | \item[--]<3->Analyze your (big) data\\\normalsize
441 | Cross-validation\\
442 | (Not) Maximally Likely\\
443 | Numerical Optimization
444 | \end{large_enum}
445 | }
446 | \frame{
447 | \frametitle{Prerequisites}
448 | \begin{large_enum}
449 | \item[--]<1->Basic but important
450 | \item[--]<2->Statistics:
451 | \begin{enumerate}
452 | \item[--]<1->Probability theory, some combinatorics
453 | \item[--]<2->Linear Regression and other standard estimation techniques
454 | \end{enumerate}
455 | \item[--]<3->Computation:
456 | \begin{enumerate}
457 | \item[--]Have written a loop
458 | \item[--]Have written a function
459 | \end{enumerate}
460 | \end{large_enum}
461 | }
462 | \frame{
463 | \frametitle{Software and Programming}
464 | \begin{large_enum}
465 | \item[--]<1->Open Source\\\normalsize
466 | License fees add up if you are running software on 1000's of machines
467 | \item[--]<2->R
468 | \begin{enumerate}
469 | \item[--]<3->{\tt RStudio} IDE for R, Makes it easier to code.
470 | \item[--]<4->{\tt R Markdown} For Documenting.
471 | \end{enumerate}
472 | \item[--]<6->Python \\\normalsize
473 | \begin{enumerate}
474 | \item[--]<7->Academic Version {\tt Enthought Python Distribution}
475 | \item[--]<8->Eclipse, PyCharm, PyDev, Aptana Studio etc.
476 | \end{enumerate}
477 | \item[--]<9->SQLite
478 | \end{large_enum}
479 | }
480 |
481 |
482 | \end{document}
--------------------------------------------------------------------------------
/ds2/ds2_present_web.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/ds2/ds2_present_web.pdf
--------------------------------------------------------------------------------
/ds2/ds2_web.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress, black]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \usecolortheme[named=black]{structure}
5 | \usepackage{caption}
6 | \captionsetup{labelformat=empty}
7 | \setbeamertemplate{navigation symbols}{}
8 | %\usefonttheme{structurebold}
9 | \usepackage[scaled]{helvet}
10 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
11 | \usepackage[T1]{fontenc}
12 | \usepackage{setspace}
13 | %\usepackage{beamerthemesplit}
14 | \usepackage{graphics}
15 | \usepackage{hyperref}
16 | \usepackage{graphicx}
17 | \usepackage{verbatim}
18 | \usepackage{amssymb}
19 | \usepackage{wrapfig}
20 | \usefonttheme[onlymath]{serif}
21 | \usepackage{cmbright}
22 |
23 | \def\labelitemi{\textemdash}
24 | \setbeamertemplate{frametitle}{
25 | \begin{centering}
26 | \vskip15pt
27 | \insertframetitle
28 | \par
29 | \end{centering}
30 | }
31 | \title[DS]{Get, Clean Data}
32 | \author[Sood]{Gaurav~Sood}
33 | \large
34 | \date[2015]{Spring 2015}
35 | \subject{LearnDS}
36 | \begin{document}
37 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
38 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
39 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
40 |
41 | \newenvironment{large_enum}{
42 | \Large
43 | \begin{itemize}
44 | \setlength{\itemsep}{7pt}
45 | \setlength{\parskip}{0pt}
46 | \setlength{\parsep}{0pt}
47 | }{\end{itemize}}
48 |
49 | \begin{comment}
50 |
51 | setwd(paste0(basedir, "github/data-science/ds2/"))
52 | tools::texi2dvi("ds2_web.tex", pdf=TRUE,clean=TRUE)
53 | setwd(basedir)
54 |
55 | \end{comment}
56 |
57 | \frame
58 | {
59 | \titlepage
60 | }
61 |
62 | \frame{
63 | \frametitle{Data, Data, Everywhere}
64 | \begin{large_enum}
65 | \item[--]<2->The Famous Five:\\\normalsize
66 | Aural, Visual, Somatic, Gustatory, Olfactory
67 | \item[--]<3->The Social Famous Five:\\\normalsize
68 | What people (like to) hear, see, sense, smell, taste, \ldots
69 | \item[--]<4->Manifest Data:\\\normalsize
70 | Likes, Ratings, Reviews, Comments, Views, Searches \ldots
71 | \item[--]<5->Data about data:\\\normalsize
72 | Location of a tweet, photo, who called whom, \ldots
73 | \item[--]<6->Social data:\\\normalsize
74 | Friend graph, followers, who retweeted, liked,\ldots
75 | \item[--]<7->Data about structure:\\\normalsize\vspace{-.4\baselineskip} Layout of the site, In/out links, \ldots
76 | \end{large_enum}
77 | }
78 |
79 | \frame{
80 | \frametitle{Collecting Digital Data}
81 | \begin{large_enum}
82 | \item[--]<1->Proprietary Data collections\\\normalsize
83 | Lexis-Nexis, comScore \ldots
84 | \item[--]<2->APIs \\\normalsize
85 | Facebook, \href{http://developer.nytimes.com/docs}{NY Times}, Twitter, Google, FourSquare, \href{dfr.jstor.org}{Jstor}, Zillow \ldots
86 | \item[--]<3->Bulk Downloads \\\normalsize
87 | Wikipedia, data.gov, IMDB, Million Song Database, Google n-grams \ldots
88 | \item[--]<4->Scraping
89 | \item[--]<5->Custom Apps\\\normalsize Build custom apps to observe behavior, get (pay) people to download these apps
90 | \end{large_enum}
91 | }
92 |
93 | \frame{
94 | \frametitle{Scraping}
95 | \begin{large_enum}
96 | \item[--]<1->To analyze data, we typically need structure.\\\normalsize
97 | For instance, same number of rows for each column.
98 | \item[--]<2->But found data often with human readable structure.
99 | \item[--]<2->Copy and paste, type, to a dataset.
100 | \item[--]<4->But error prone, and not scalable.
101 | \item[--]<5->\alert{Idea:} Find the less accessible structure, automate based on it.
102 | \end{large_enum}
103 | }
104 |
105 | \frame{
106 | \frametitle{Collecting Found Digital Data}
107 | \begin{large_enum}
108 | \item[-]<1->Software
109 | \begin{enumerate}
110 | \item[-]<2->R - Not the best but will do.
111 | \item[-]<3->Python, Ruby, Perl, Java, \ldots
112 | \item[-]<4->30 Digits, 80 Legs, Grepsr \ldots
113 | \end{enumerate}
114 | \item[-]<5->Some things to keep in mind
115 | \begin{enumerate}
116 | \item[-]<6->Check if there is an API, or if data are available for download
117 | \item[-]<7->Play Nice: \\\pause \pause \pause \pause \pause \pause \pause
118 | - Scraper may be disallowed in `robots.txt' \\ \pause
119 | - Build lag between requests. \alert{Make lags random.}\\\pause
120 | - Scrape during off-peak hours
121 | \end{enumerate}
122 | \end{large_enum}
123 |
124 | }
125 |
126 | \begin{frame}
127 | \frametitle{Paper}
128 | \only<1>{\scalebox{0.35}{\includegraphics{ScannedBook.png}}}
129 | \only<2->{
130 | \begin{large_enum}
131 | \item[-]<2-> Create digital images of paper
132 | \item[-]<3-> Identify colored pixels as characters (OCR)
133 | \item[-]<4-> Software
134 | \begin{enumerate}
135 | \item[-]<5->Adobe Pro., etc.
136 | \item[-]<6->Best in class commercial: Abbyy FineReader \\
137 | Now has an API
138 | \item[-]<7->Best in class open-source: Tesseract
139 | \end{enumerate}
140 | \item[-]<8->Scrape off recognized characters: pyPdf etc.
141 | \item[-]<9->Post-processing
142 | \end{large_enum}
143 | }
144 | \end{frame}
145 |
146 | \begin{frame}
147 | \frametitle{Pictures, Audio, and Video}
148 | \begin{large_enum}
149 | \item[-]<1->Audio (or Video with audio) to text: Dragon Dictates, Google transcription
150 | \item[-]<2->Pictures: recognize color, faces
151 | \item[-]<3->Objects in images: \href{clarifai.com}{Clarifai}
152 | \item[-]<4->Scrape closed-captions
153 | \end{large_enum}
154 | \end{frame}
155 |
156 | \begin{frame}
157 | \frametitle{Get Others to Work}
158 | \begin{large_enum}
159 | \item[-]<1->Human Computing
160 | \item[-]<2->Amazon.com's Mechanical Turk
161 | \begin{enumerate}
162 | \item[-]<3-> Create Human Intensive Tasks (HITs)
163 | \item[-]<4-> \href{https://www.mturk.com/mturk/findhits?match=false}{Surveys, transcription, translation, \ldots}
164 | \item[-]<5-> You assess the work and pay out
165 | \end{enumerate}
166 | \item[-]<6->Odesk, elance, impact sourcing, run your own ads \ldots
167 | \item[-]<7->\href{http://www.google.com/insights/consumersurveys/home}{Google} -- surveys as payment for content
168 | \end{large_enum}
169 |
170 | \end{frame}
171 |
172 |
173 | \begin{frame}[fragile]
174 | \frametitle{Scraping one HTML page in Python}
175 |
176 | Shakespeare's Twelfth Night\\
177 | Using \href{http://www.crummy.com/software/BeautifulSoup/}{Beautiful Soup}
178 | \small
179 | \begin{enumerate}
180 | \item[]<2->\begin{verbatim} from BeautifulSoup import BeautifulSoup \end{verbatim}
181 | \item[]<3->\begin{verbatim} from urllib import urlopen \end{verbatim}
182 | \item[]<3->
183 | \item[]<4->\begin{verbatim} url = urlopen(`http://bit.ly/1D7wKcH').read()\end{verbatim}
184 | \item[]<5->\begin{verbatim} soup = BeautifulSoup(url)\end{verbatim}
185 | \item[]<6->\begin{verbatim} text = soup.p.contents\end{verbatim}
186 | \item[]<7->\begin{verbatim} print text\end{verbatim}
187 | \end{enumerate}
188 | \end{frame}
189 |
190 | \begin{frame}[fragile]
191 | \frametitle{Getting text from one pdf in Python}
192 |
193 | A Political Ad\\
194 | Using \href{http://pybrary.net/pyPdf/}{PyPdf}
195 | \small
196 | \begin{enumerate}
197 | \item[]<1->\begin{verbatim} import pyPdf \end{verbatim}
198 | \item[]<2->
199 | \item[]<2->\begin{verbatim} pdf = pyPdf.PdfFileReader(file('path to pdf', `rb'))\end{verbatim}
200 | \item[]<3->\begin{verbatim} content = pdf.getPage(0).extractText()\end{verbatim}
201 | \item[]<4->\begin{verbatim} print content\end{verbatim}
202 | \end{enumerate}
203 | \end{frame}
204 |
205 | \begin{frame}[fragile]
206 | \frametitle{Scraping many urls/files to structured data}
207 | \begin{large_enum}
208 | \item[-]<1->Loop, exploiting structure of the urls/file paths\\\normalsize \pause
209 | e.g. \href{http://search.espncricinfo.com/ci/content/match/search.html?search=odi;all=1;page=1}{ESPN URL}
210 | \item[-]<3->Handle errors, if files or urls don't open, what do you do?
211 | \item[-]<4->To harvest structured data, exploit structure within text
212 | \item[-]<5->Trigger words, html tags, \ldots
213 | \end{large_enum}
214 | \end{frame}
215 |
216 | \begin{frame}[fragile]
217 | \frametitle{Exception(al) Handling}
218 | \begin{enumerate}
219 | \item[]<1->\begin{verbatim}try: \end{verbatim}
220 | \item[]<1->\begin{verbatim} pdf = pyPdf.PdfFileReader(file(pdfFile, 'rb')) \end{verbatim}
221 | \item[]<2->\begin{verbatim}except Exception, e:\end{verbatim}
222 | \item[]<2->\begin{verbatim} return `Cannot Open: %s with error: %s' %
223 | (pdfFile, str(e))\end{verbatim}
224 | \end{enumerate}
225 | \end{frame}
226 |
227 | \begin{frame}[fragile]
228 | \frametitle{Inside the page}
229 | \begin{enumerate}
230 | \item[-]<1->Chrome Developer Tools
231 | \item[-]<2->Quick Tour of HTML
232 | \begin{enumerate}
233 | \item[-]<3->Tags begin with < and end with >
234 | \item[-]<4->Tags usually occur in pairs. Some don't (see img). And can be nested.
235 | \item[-]<5->\href{https://developer.mozilla.org/en-US/docs/Web/HTML/Element}{Mozilla HTML elements}
236 | \item[-]<6->
is for paragraph
237 | \item[-]<7-> is for a link
238 | \item[-]<8->, is for ordered, unordered list; - is a bullet
239 | \item[-]<9->tags can have attributes.
240 | \item[-]<10->DOM, hierarchical, parent, child:
241 | \begin{verbatim}
242 |
243 |
244 |
245 |
246 |
247 | \end{verbatim}
248 | \end{enumerate}
249 | \end{enumerate}
250 | \end{frame}
251 |
252 | \begin{frame}[fragile]
253 | \frametitle{Find Things}
254 | \begin{enumerate}
255 | \item[]<1->Navigate by HTML tags: \begin{verbatim} soup.title, soup.body, soup.body.contents \end{verbatim}
256 | \item[]<2->Search HTML tags: \begin{verbatim} soup.find_all('a'), soup.find(id="nav1") \end{verbatim}
257 | \item[]<3->
258 | \item[]<3->So to get all the urls in a page:
259 | \item[]<4->\begin{verbatim} for link in soup.find_all('a'): \end{verbatim}
260 | \item[]<4->\begin{verbatim} print(link.get('href')) \end{verbatim}
261 | \item[]<4->
262 | \item[]<5->\href{http://www.crummy.com/software/BeautifulSoup/bs4/doc/}{Beautiful Soup Documentation}
263 | \end{enumerate}
264 | \end{frame}
265 |
266 | \frame{
267 | \frametitle{Data Munging}
268 | \Large
269 | ``Data scientists, according to interviews and expert estimates, spend from \alert{50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data}, before it can be explored for useful information.''\\\vspace{5em}
270 |
271 | \small \href{http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html}{New York Times: For BigData Scientists, `Janitor Work' Is Key Hurdle to Insights}
272 | }
273 | \frame{
274 | \frametitle{Data Munging}
275 | \Large
276 | ``In our experience, the tasks of \alert{exploratory data mining and data cleaning constitute 80\% of the effort} that determines 80\% of the value of the ultimate data.''\\\vspace{5em}
277 |
278 | \small Dasu and Johnson, Exploratory Data Mining and Data Cleaning
279 | }
280 |
281 | \frame{
282 | \frametitle{Regular (or Rational) Expressions}
283 | \begin{large_enum}
284 | \item[-]<1-> Formal language for specifying text strings
285 | \item[-]<2-> Stephen Kleene, `inventor' of regular expressions.
286 | \item[-]<3-> Henry Spencer, behind the {\tt regex} library.
287 | \item[-]<4-> Descend from {\it finite automata} theory.
288 | \item[-]<5-> Matching
289 | \end{large_enum}
290 | }
291 |
292 | \begin{frame}
293 | \frametitle{The most basic regular expression}
294 | \begin{large_enum}
295 | \item[-]<1->String literal
296 | \item[-]<2->\href{http://regexpal.com}{RegexPal.com}
297 | \item[-]<2->Say you are searching for the word apple -- can be uppercase first character, plural, lowercase first character
298 | \end{large_enum}
299 | \end{frame}
300 |
301 | \begin{frame}[fragile]
302 | \frametitle{Disjunction}
303 | \begin{large_enum}
304 | \item[-]<1->Disjunction, Character classes
305 | \begin{enumerate}
306 | \item[-]<2-> \begin{verbatim} [] \end{verbatim}
307 | \item[-]<3-> \begin{verbatim} [aA]pple matches apple and Apple \end{verbatim}
308 | \item[-]<4-> \begin{verbatim} [0123456789] matches any digit \end{verbatim}
309 | \end{enumerate}
310 | \item[-]<5->Ranges
311 | \begin{enumerate}
312 | \item[-]<6-> \begin{verbatim} [0-9] matches any digit \end{verbatim}
313 | \item[-]<7-> \begin{verbatim} [a-z], [[:lower:]] matches any lowercase \end{verbatim}
314 | \item[-]<8-> \begin{verbatim} [a-zA-Z], [[:alpha:]] matches any uppercase \end{verbatim}
315 | \item[-]<9-> \begin{verbatim} [a-e1-9] matches any letter or digit \end{verbatim}
316 | \item[-]<10-> Hyphen only has a special meaning if used within range. \begin{verbatim} [-123] \end{verbatim}
317 | \end{enumerate}
318 | \end{large_enum}
319 | \end{frame}
320 |
321 | \begin{frame}[fragile]
322 | \frametitle{Disjunction Contd..}
323 | \begin{large_enum}
324 | \item[-]<1->Negation in Disjunction\\\normalsize
325 | \begin{enumerate}
326 | \item[-]<2-> \begin{verbatim} ^ right after the square bracket means a negation \end{verbatim}
327 | \item[-]<3-> \begin{verbatim} [^A-Z] \end{verbatim}
328 | \item[-]<4-> \begin{verbatim} [^Aa] means neither a capital A nor a lowercase a \end{verbatim}
329 | \item[-]<5-> \begin{verbatim} [^e^] means not an e, and not ^ \end{verbatim}
330 | \end{enumerate}
331 | \item[-]<6->Disjunction for longer strings\\\normalsize
332 | \begin{enumerate}
333 | \item[-]<7-> \begin{verbatim} pipe \end{verbatim}
334 | \item[-]<8-> \begin{verbatim} a|b|c = [abc] \end{verbatim}
335 | \item[-]<9-> \begin{verbatim} apple|pie \end{verbatim}
336 | \item[-]<10-> \begin{verbatim} [aA]pple|[aA]nd \end{verbatim}
337 | \end{enumerate}
338 | \end{large_enum}
339 | \end{frame}
340 | \begin{frame}[fragile]
341 | \frametitle{Special characters}
342 | \begin{large_enum}
343 | \item[-]<1->? - previous character is optional: colou?r - color, colour
344 | \item[-]<2-> . matches any character\\
345 | e.g. beg.n matches begun, begin, began
346 | \item[-]<3->Kleene Operators - named after Steven Kleene
347 | \begin{enumerate}
348 | \item[-]<4->* matches 0 or more of the previous characters\\
349 | e.g. oo*h will match ooh, oooh, etc.\\
350 | (abc)* will match abc, abcabc, etc.
351 | \item[-]<5->+ matches 1 or more of the previous characters\\
352 | e.g. o+h will match ooh, oooh, etc.
353 | \end{enumerate}
354 | \end{large_enum}
355 | \end{frame}
356 |
357 | \begin{frame}[fragile]
358 | \frametitle{Repetition Ranges}
359 | \begin{enumerate}
360 | \item[-]<1-> Specific ranges can also be specified
361 | \item[-]<2-> \small \begin{verbatim} { } to specify range for the immediately preceding regex \end{verbatim}
362 | \item[-]<3-> \begin{verbatim} {n} means exactly n occurrences \end{verbatim}
363 | \item[-]<4-> \begin{verbatim} {n,} means at least n occurrences \end{verbatim}
364 | \item[-]<5-> \begin{verbatim} {n,m} means at least n and no more than m occurrences \end{verbatim}
365 | \item[-]<6-> Example: \begin{verbatim} . {0, } = .* \end{verbatim}
366 | \end{enumerate}
367 | \end{frame}
368 |
369 | \begin{frame}[fragile]
370 | \begin{large_enum}
371 | \frametitle{More Regex}
372 | \item[-]<1->Anchors
373 | \begin{enumerate}
374 | \item[-]<2-> $^$ matches the beginning of the line\\
375 | e.g. \begin{verbatim} ^[A-Z] matches a captial letter at the start of a line.\end{verbatim}
376 | \item[-]<3-> \$ matches the end of the line.
377 | \end{enumerate}
378 | \item[-]<4-> \begin{verbatim} \. means a period \end{verbatim}
379 | \item[-]<5-> Example: look for the word `the'
380 | \begin{enumerate}
381 | \item[-]<6->missed capitalization: [tT]he
382 | \item[-]<7->make pattern more precise: \\
383 | \begin{verbatim} [tT]he[^A-Za-z], ^[tT]he[^A-Za-z] \end{verbatim}
384 | \end{enumerate}
385 | \end{large_enum}
386 | \end{frame}
387 |
388 | \frame{
389 | \frametitle{False Positive and Negatives}
390 | \begin{large_enum}
391 | \item[-]<1->False positives or Type 1 errors - matching things we shouldn't match
392 | \item[-]<2->False negatives or Type 2 errors - not matching things we should match
393 | \item[-]<3->Cost attached to false negative and positive
394 | \item[-]<4->Provide some metrics by comparing against good data for a small sample
395 | \end{large_enum}
396 | }
397 |
398 | \frame{
399 | \frametitle{Edit Distance}
400 | \begin{large_enum}
401 | \item[-]<1->pwned -> owned or pawned?
402 | \item[-]<2->standd -> strand, stand, stood, or sand?
403 | \item[-]<3->How similar are two strings?
404 | \item[-]<4->Applications
405 | \begin{enumerate}
406 | \item[-]<5->Spell Correction
407 | \item[-]<6->Also comes up in computational biology
408 | \item[-]<7->Machine translation
409 | \item[-]<8->Information extraction
410 | \item[-]<9->Speech recognition
411 | \end{enumerate}
412 | \end{large_enum}
413 | }
414 | \frame{
415 | \frametitle{Edit Distance}
416 | \begin{large_enum}
417 | \item[-]<1->Typically refers to minimum edit distance
418 | \item[-]<2->Minimum number of editing operations to convert one string to another
419 | \begin{enumerate}
420 | \item[-]<3->Insertion
421 | \item[-]<3->Deletion
422 | \item[-]<3->Substitution
423 | \end{enumerate}
424 | \item[-]<3->e.g. two strings: intention, execution
425 | \begin{enumerate}
426 | \item[-]<4->align it with second letter
427 | \item[-]<5->d (delete), s (substitute), s, i(nsert), s
428 | \item[-]<6->if each operation costs 1, edit distance = 5
429 | \item[-]<7->if substitition cost 2 (levenshtein distance), distance = 8
430 | \end{enumerate}
431 | \item[-]<8->You can implement this at word level so Microsoft Corp. is 1 away from Microsoft.
432 | \end{large_enum}
433 | }
434 |
435 | \frame{
436 |
437 | \center \Huge Text Processing
438 | }
439 |
440 | \frame{
441 | \frametitle{Text as Data}
442 | \begin{large_enum}
443 | \item[-]<1->Bag of words assumption\\\normalsize
444 | Lose word order
445 | \item[-]<2->Remove stop words:\\\normalsize
446 | If, and, but, who, what, the, they, their, a, or, \ldots\\
447 | \alert{Be careful: one person's stopword is another's key term.}
448 | \item[-]<3->(Same) Word: Stemming and Lemmatization\\\normalsize
449 | Taxing, taxes, taxation, taxable $\leadsto$ tax
450 | \item[-]<4->Remove rare words\\\normalsize
451 | $\sim$ .5\% to 15\%, depending on application\\
452 | \item[-]<5->Convert to lowercase, drop numbers, punctuation, etc.
453 | \end{large_enum}
454 | }
455 |
456 | \begin{frame}[fragile]
457 | \frametitle{How?}
458 | Using Natural Language Toolkit (\tt{nltk})
459 | \begin{itemize}
460 | \item[-]<2->\textbf{Lowercase}:
461 | \begin{verbatim}text = text.lower() \end{verbatim}
462 | \item[-]<3->\textbf{Remove stop words}:
463 | \item[]<4->\begin{verbatim}swords = stopwords.words('english') \end{verbatim}
464 | \item[]<5->\begin{verbatim}words = wordpunct_tokenize(text) \end{verbatim}
465 | \item[]<6->\begin{verbatim}words = [w for w in words if w not in swords] \end{verbatim}
466 | \item[]<7->\begin{verbatim}text = ' '.join(words) \end{verbatim}
467 | \item[-]<8->\textbf{Stemming}:
468 | \item[]<9->\begin{verbatim}st = EnglishStemmer() \end{verbatim}
469 | \item[]<10->\begin{verbatim}words = wordpunct_tokenize(text) \end{verbatim}
470 | \item[]<11->\begin{verbatim}words = [st.stem(w) for w in words] \end{verbatim}
471 | \item[]<12->\begin{verbatim}text = ' '.join(words) \end{verbatim}
472 | \end{itemize}
473 | \end{frame}
474 | \begin{frame}[fragile]
475 | \frametitle{To Matrices}
476 | \begin{itemize}
477 | \item[-]<2->n-grams
478 | \begin{verbatim}
479 | from nltk import bigrams, trigrams, ngrams
480 | text = word tokenize(text)
481 | text_bi = bigrams(text)
482 | \end{verbatim}
483 | \end{itemize}
484 | \end{frame}
485 |
486 | \end{document}
487 |
--------------------------------------------------------------------------------
/ds2/scraping_assignment_web.txt:
--------------------------------------------------------------------------------
1 | Do one of the following two projects:
2 |
3 | A. Build a database of NYC laws
4 | 1. Legistar database for NYC: http://legistar.council.nyc.gov/Legislation.aspx
5 | 2. Here's one way to do it: leave search box empty, use 'all years' and 'all types' from the dropdown menu. When you get the results, click on show and change it to 'show all.' You should get 31,850 records. Now iterate through the results, following each file # link. Each row will in the final dataset will represent a unique "file number." (first column of results)
6 |
7 | We want to build a flat file (e.g. csv) that includes all fields that you can get about the bill. So for instance, you would want to get the text of the legislation where available. And 'legislation details' where available.
8 |
9 | B. Coverage of budget deficits in the NY Times
10 | We would like to know how news coverage of budget deficits has changed over time.
11 | 1. Use the NY Times API (http://developer.nytimes.com/docs/read/article_search_api_v2) to create a monthly time series of coverage of budget deficits issues. To do so you must come up with a list of keywords. Choose keywords carefully.
12 | 2. Plot the timeseries, interpret the results.
13 |
14 | Languages:
15 | R or Python
16 |
17 | What to deliver:
18 | Link to github repo.
19 | a. A well documented script
20 | b. A readme file in markdown
21 | c. Data (in csv)
22 | d. Output (if asked for) (graph, tables)
--------------------------------------------------------------------------------
/ds3/ds3_present_web.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/ds3/ds3_present_web.pdf
--------------------------------------------------------------------------------
/ds3/ds3_web.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress, black]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \usecolortheme[named=black]{structure}
5 | \usepackage{caption}
6 | \captionsetup{labelformat=empty}
7 | \setbeamertemplate{navigation symbols}{}
8 | %\usefonttheme{structurebold}
9 | \usepackage[scaled]{helvet}
10 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
11 | \usepackage[T1]{fontenc}
12 | \usepackage{setspace}
13 | %\usepackage{beamerthemesplit}
14 | \usepackage{graphics}
15 | \usepackage{hyperref}
16 | \usepackage{graphicx}
17 | \usepackage{verbatim}
18 | \usepackage{amssymb}
19 | \usepackage{wrapfig}
20 | \usefonttheme[onlymath]{serif}
21 | \usepackage{cmbright}
22 |
23 | \def\labelitemi{\textemdash}
24 | \setbeamertemplate{frametitle}{
25 | \begin{centering}
26 | \vskip15pt
27 | \insertframetitle
28 | \par
29 | \end{centering}
30 | }
31 | \title[DS]{Databases}
32 | \author[Sood]{Gaurav~Sood}
33 | \large
34 | \date[2015]{Spring 2015}
35 | \subject{LearnDS}
36 | \begin{document}
37 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
38 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
39 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
40 |
41 | \newenvironment{large_enum}{
42 | \Large
43 | \begin{itemize}
44 | \setlength{\itemsep}{7pt}
45 | \setlength{\parskip}{0pt}
46 | \setlength{\parsep}{0pt}
47 | }{\end{itemize}}
48 |
49 | \begin{comment}
50 |
51 | setwd(paste0(basedir, "github/data-science/ds3/"))
52 | tools::texi2dvi("ds3_web.tex", pdf=TRUE,clean=TRUE)
53 | setwd(basedir)
54 |
55 | \end{comment}
56 | \frame
57 | {
58 | \titlepage
59 | }
60 |
61 | \frame{
62 | \frametitle{Data}
63 | \begin{large_enum}
64 | \item[--]<2->Bits on non-volatile storage e.g. magnetic media, SSD
65 | \item[--]<3->Not just bits on hard disk, but organized bits (a data model)
66 | \begin{enumerate}
67 | \item[-]<4-> Nested folders $\sim$ tree
68 | \item[-]<5-> Rows and columns $\sim$ table
69 | \end{enumerate}
70 | \end{large_enum}
71 | }
72 |
73 | \frame{
74 | \frametitle{Data Model}
75 | \begin{large_enum}
76 | \item[--]<1->Structure
77 | \begin{enumerate}
78 | \item[--]<2-> Rows and columns (Relational Database)
79 | \item[--]<3-> Key-value pairs (NoSQL systems)
80 | \item[--]<4-> Sequence of bytes (Files)
81 | \end{enumerate}
82 | \item[--]<5->Constraints
83 | \begin{enumerate}
84 | \item[--]<6-> All columns must be of same length
85 | \item[--]<7-> Each value in this column has to be X
86 | \item[--]<8-> Child cannot have two parents (tree) (before Gmail Labels)
87 | \end{enumerate}
88 | \item[--]<9->(Efficiently Supported) Operations
89 | \begin{enumerate}
90 | \item[--]<10-> Look up value of key x.
91 | \item[--]<11-> Find me rows where column `lastname' is Jordan.
92 | \item[--]<12-> Get next $n$ bytes.
93 | \end{enumerate}
94 | \end{large_enum}
95 | }
96 |
97 | \frame{
98 | \frametitle{Questions to consider}
99 | \begin{large_enum}
100 | \item[--]<1-> What are the features of the data? How do we plan to use the data?
101 | \item[--]<2-> How often are the data updated?
102 | \item[--]<3-> What queries are efficiently supported?
103 | \item[--]<4-> How are the data stored on disk -- the physical data model?
104 | \item[--]<5-> How are data organized internally -- the abstract data model?
105 | \end{large_enum}
106 | }
107 |
108 | \frame{
109 | \frametitle{What are databases?}
110 | \begin{large_enum}
111 | \item[--]<1-> Organized collection of data.
112 | \item[--]<2-> Organized to afford efficient retrieval (aim can vary).
113 | \item[--]<3-> Data + Metadata about structure and organization.
114 | \item[--]<4-> Data that are self-describing $\sim$ have a schema.
115 | \item[--]<5-> Generally accessed through a Database Management System.
116 | \end{large_enum}
117 | }
118 |
119 | \frame{
120 | \frametitle{What do Database Systems (DBMS) do?}
121 | \begin{large_enum}
122 | \item[--]<1-> Provide \alert{efficient}, \alert{reliable}, \alert{convenient} and \alert{safe}, \alert{multi-user} storage of and access to \alert{massive} amounts of data.
123 | \begin{enumerate}
124 | \item[--]<2-> \textbf{Massive:} Terabytes. Handle data that reside outside memory.
125 | \item[--]<3-> \textbf{Safe:} Robust to power outages, node failures etc.
126 | \item[--]<4-> \textbf{Multi-user:} Concurrency control. Not one user/ one user per data item. File system concurrency.
127 | \item[--]<5-> \textbf{Convenient:} High level query languages; declarative: what you want, not describe algorithm
128 | \item[--]<6-> \textbf{Efficient:} Performace, performance, performance.
129 | \item[--]<7-> \textbf{Reliable:} 99.99999\% up time.
130 | \end{enumerate}
131 | \end{large_enum}
132 | }
133 |
134 | \frame{
135 | \frametitle{Key concepts}
136 | \begin{large_enum}
137 | \item[--]<1->Data model\\\normalsize
138 | XML, relational model, etc.
139 | \item[--]<2->Schema versus data distinction\\\normalsize
140 | Types versus variables\\
141 | Schema (types, structure), actual data\\
142 | \item[--]<3->Data definition language (DDL)\\\normalsize
143 | For setting up schema
144 | \item[--]<4->Data manipulation or query language (DML)\\\normalsize
145 | For querying and modifying the database
146 | \end{large_enum}
147 | }
148 |
149 | \frame{
150 |
151 | \Large Some Data Models
152 | }
153 | \frame{
154 | \frametitle{XML}
155 | \begin{large_enum}
156 | \item[--]<1->Extensive Markup Language\\\normalsize
157 | Similar to HTML\\
158 | Tags describe data rather than how to format
159 | \item[--]<2->Standard for data exchange and representation
160 | \item[--]<3->XML
161 | \begin{enumerate}
162 | \item[--]<4-> Tagged elements
163 | \item[--]<5-> Nesting of elements
164 | \item[--]<6-> Attributes
165 | \item[--]<7-> Not all elements have attributes
166 | \end{enumerate}
167 | \end{large_enum}
168 | }
169 | \begin{frame}
170 | \frametitle<1>{Plain Text}
171 | \frametitle<2>{XML}
172 | \only<1>{
173 | {\tt
174 | June 5, 2006 \\
175 | Floor Statement of Senator Barack Obama Federal Marriage Amendment \\
176 | I agree with most Americans, with Democrats and Republicans, with Vice President Cheney, with over 2,000 religious leaders of all different beliefs, that decisions about marriage, as they always have, should be left to the states.
177 | }
178 | }
179 |
180 | \only<2>{
181 | {\tt
182 | $<$DOC$>$\\
183 | $<$DOCNO$>$obama-2006-06-05-01$<\//$DOCNO$>$\\
184 | $<$TEXT$>$\\
185 | I agree with most Americans, with Democrats and Republicans, with Vice President Cheney, with over 2,000 religious leaders of all different beliefs, that decisions about marriage, as they always have, should be left to the states.\\
186 | $<\//$TEXT$>$\\
187 | $<\//$DOC$>$
188 | }
189 | }
190 |
191 | \end{frame}
192 |
193 | \begin{frame}
194 | \frametitle{JSON}
195 |
196 | {\tt
197 | \{u'category': u'Government official', u'username': u'RepJohnLewis', u'about': u'Official Congressional page for Rep. John Lewis', ... u'name': u'John Lewis', u'hometown': u'Troy, AL', ... u'website': u'johnlewis.house.gov', u'phone': u'(404) 659 - 0116', u'birthday': u'02/21/1940', u'likes': 62974, ...\}
198 | }
199 |
200 | \end{frame}
201 |
202 | \frame{
203 | \frametitle{Relational Model}
204 | \begin{large_enum}
205 | \item[--]<1->RDBMS-- Codd 1970
206 | \item[--]<2->Database: set of named relations (or tables)
207 | \item[--]<3->Each relation has a set of named attributes (or columns)
208 | \item[--]<4->Each attribute (column) has a type (or domain)\\\normalsize
209 | There is also concept of an enumerated domain
210 | \item[--]<5->Each row (= tuple)
211 | \item[--]<6->Schema-- structure, name, type of attributes
212 | \item[--]<7->\alert{Everything is a table}
213 | \end{large_enum}
214 | }
215 |
216 | \frame{
217 | \frametitle{Relational Model}
218 | \begin{large_enum}
219 | \item[--]<1->Relationships are implicit; no pointers
220 | \begin{enumerate}
221 | \item[--]<2-> Physical Data independence
222 | \item[--]<3-> Compare to Network Data Model etc.
223 | \end{enumerate}
224 | \item[--]<4->Example:
225 | \begin{enumerate}
226 | \item[--]<5-> \lbrack course, Student id \rbrack, \lbrack student id, student name\rbrack
227 | \item[--]<6-> Just a shared id
228 | \item[--]<7-> Really bad for performance, all student names for a course, look up all the ids
229 | \item[--]<8-> Hierarchical can be quicker if all students in one course
230 | \item[--]<9-> But to look up all courses student X has taken, just as good (bad)
231 | \end{enumerate}
232 | \end{large_enum}
233 | }
234 |
235 | \frame{
236 | \frametitle{Relational Vs. XML}
237 | \begin{table}[h]
238 | \begin{tabular}{l|l|l}
239 | & Relational & XML \\
240 | Structrure & Tables & Hierarchical \\
241 | Schema & Fixed in advance & Flexible, self-describing\\
242 | Queries & Easy & Querying not as easy\\
243 | Ordering & Unordered & Implied order (document) \\
244 | \end{tabular}
245 | \end{table}
246 | }
247 |
248 | \frame{
249 | \frametitle{Algebra of Tables}
250 | \begin{large_enum}
251 | \item[--]<1->In algebra, concept of algebraic optimization -- Rules to simply calculation
252 | \item[--]<2->SQL uses relational algebra
253 | \item[--]<3->All RDBMS optimize relational algebra `calculations'
254 | \item[--]<4->Cost-based optimization
255 | \item[--]<5->\alert{We can leave this optimization to program}
256 | \end{large_enum}
257 | }
258 | \frame{
259 | \frametitle{Algebra of Tables}
260 | \begin{large_enum}
261 | \item[--]<1->Basic Relational Algebra
262 | \begin{enumerate}
263 | \item[--]<2->Union, Intersection, Difference\\
264 | \item[--]<3->Selection, $\sigma$\\
265 | \item[--]<4->Projection, $\Pi$\\
266 | \item[--]<5->Join, $\bowtie$
267 | \end{enumerate}
268 | \item[--]<6->Extended Relational Algebra:
269 | \begin{enumerate}
270 | \item[--]<7->Duplicate elimination, $d$\\
271 | \item[--]<8->Grouping and aggregation, $g$\\
272 | \item[--]<9->Sorting, $t$
273 | \end{enumerate}
274 | \item[--]<10->Sets Vs Bags
275 | \begin{enumerate}
276 | \item[--]<11->Sets: \{a,b,c\} (RA, Papers)
277 | \item[--]<12->Bags: \{a,a,b,c\} (SQL)
278 | \end{enumerate}
279 | \end{large_enum}
280 | }
281 |
282 | \frame{
283 | \frametitle{Select Rows}
284 | \begin{large_enum}
285 | \item[--]<1->Tuples (rows) that satisfy a particular criteria
286 | \item[--]<1->$\sigma_\text{condition} \text{Relation}$\\
287 | Pick Certain Rows
288 | \item[--]<2->$\sigma_{GPA > 3} \text{Student}$\\
289 | $\sigma_{GPA > 3 \wedge LastName=='Smith'} \text{Student}$\\
290 | \item[--]<3->In SQL, selection using `Where' (and project using `Select')
291 | \end{large_enum}
292 | }
293 |
294 | \frame{
295 | \frametitle{Project}
296 | \begin{large_enum}
297 | \item[--]<1->$\Pi_\text{A1, A2} \text{Relation}$\\
298 | Pick Certain Columns
299 | \item[--]<2->$\Pi_\text{SSN, GPA} \text{Student}$\\
300 | \item[--]<3->Both Select and Project\\
301 | $\Pi_\text{SSN, GPA} (\sigma_{GPA > 3} \text{Student})$\\
302 | $\Pi_\text{A1, A2} \text{Expression}$\\
303 | $\sigma_\text{condition} \text{Expression}$\\
304 | \end{large_enum}
305 | }
306 | \frame{
307 | \frametitle{SQL}
308 | Select Col1, Col2 (or *)\\
309 | FROM R1\\
310 | Where Condition
311 | }
312 | \frame{
313 | \frametitle{Cross-Product}
314 | \begin{large_enum}
315 | \item[--]<1->Cartesian Product
316 | \item[--]<2->R1 X R2; Every tuple in R1 with each tuple in R2
317 | \item[--]<3->Size of ouput: R1 * R2
318 | \item[--]<4->SELECT *\\
319 | FROM R1 CROSS JOIN R2;
320 | \item[--]<5->SELECT *\\
321 | FROM R1, R2;
322 | \end{large_enum}
323 | }
324 |
325 | \frame{
326 | \frametitle{Natural Join}
327 | \begin{large_enum}
328 | \item[--]<1->Means Equi-Join
329 | \item[--]<2->For every record in R1, find a tuple in R2 that satisfies a particular condition
330 | \item[--]<3->Select * \\
331 | From R1, R2\\
332 | Where R1.A = R2.A
333 | \item[--]<4->Or, do the cross-product and then filter
334 | \item[--]<5->Select *\\
335 | From R1 Join R2\\
336 | On R1.A = R2.B\\
337 | \end{large_enum}
338 | }
339 | \frame{
340 | \frametitle{Theta Join}
341 | \begin{large_enum}
342 | \item[--]<1->Join without the equality condition
343 | \item[--]<2->Equi-Join is a special case of theta join
344 | \item[--]<3->Find ALL weather stations within 5 miles of centroid of a zip code
345 | \item[--]<4->Select Distinct WStations\\
346 | From Wstations h, ZipCode s\\
347 | Where distance(h.location, s.location) < 5\\
348 | \item[--]<4->`distance' is a user-defined function
349 | \item[--]<5->Band Joins or Range Joins
350 | \end{large_enum}
351 | }
352 | \frame{
353 | \frametitle{Theta Join}
354 | \begin{large_enum}
355 | \item[--]<1->Outer Join
356 | \item[--]<2->All tuples in R1 and pad out other values with NULLs
357 | \item[--]<3->Left outer join\\
358 | Select *\\
359 | From R1 LEFT OUTER Join R2\\
360 | On R1.A = R2.B
361 | \item[--]<4->Right outer join
362 | \item[--]<5->Full outer join
363 | \end{large_enum}
364 | }
365 |
366 | \frame{
367 | \frametitle{Union}
368 | \begin{large_enum}
369 | \item[--]<1->R1 U R2\\
370 | Removes duplicates by default
371 | \item[--]<2->Select * FROM R1\\
372 | UNION\\
373 | Select * FROM R2
374 | \item[--]<3->If you wanted duplicates, you want to do `UNION ALL'
375 | \end{large_enum}
376 | }
377 | \frame{
378 | \frametitle{Difference}
379 | \begin{large_enum}
380 | \item[--]<1->R1 - R2
381 | \item[--]<2->Select * From R1\\
382 | Except\\
383 | Select * From R2
384 | \item[--]<3->Looks up all tuples in R1 and takes out tuples that it also sees in R2\\\normalsize
385 | R2 can have other tuples\\
386 | Everything in R1, remove things that also appear in R2
387 | \end{large_enum}
388 | }
389 | \frame{
390 | \frametitle{Intersection}
391 | \begin{large_enum}
392 | \item[--]<1->Not a fundamental operator
393 | \item[--]<2->R1 Intersect R2 = R1 - (R1 - R2)
394 | \item[--]<3->Can also express it as join
395 | \end{large_enum}
396 | }
397 |
398 | \frame{
399 | \frametitle{(Virtual) Views}
400 | \begin{large_enum}
401 | \item[--]<1->Defining and Using (Virtual) Views\\\normalsize
402 | Three-level vision of database\\
403 | Physical layer, Conceptual (abstraction of the data, relations), Logical layer (further abstraction, )\\
404 | Real applications use lots and lots of views
405 |
406 | \item[--]<2->Benefit of Views\\\normalsize
407 | Hide some data from some users (Authorization etc.)\\
408 | Make certain queries easier, more natural\\
409 | Modularity of database access
410 |
411 | \item[--]<3->Views\\\normalsize
412 | Query over relations\\
413 | View V = ViewQuery(R1, R2,... RN)\\
414 | Schema of V is schema of query result
415 | \end{large_enum}
416 | }
417 | \frame{
418 | \frametitle{Query over views}
419 | \begin{large_enum}
420 | \item[--]<1->V can be thought of as a table
421 | \item[--]<2->Q is actually rewritten in terms of R1, R2, ....RN
422 | \item[--]<3->R1 can also be view
423 | \item[--]<4->CREATE View VName (A1, ..., AN) As Query
424 | \item[--]<5->Just define the view and use them (System optimizes in terms of `base tables')
425 | \item[--]<6->View contents not stored
426 | \item[--]<7->A convenient view is a join
427 | \end{large_enum}
428 | }
429 |
430 | \frame{
431 | \frametitle{Modifying Views}
432 | \begin{large_enum}
433 | \item[--]<1->We cannot insert, update, delete as V isn't stored
434 | \item[--]<2->To alter V, we have to modify the base tables
435 | \end{large_enum}
436 | }
437 | \frame{
438 | \frametitle{Materialized Views}
439 | \begin{large_enum}
440 | \item[--]<1->Creates a physical table
441 | \item[--]<2->But \ldots
442 | \begin{enumerate}
443 | \item[--]<3->Tables can be very large
444 | \item[--]<4->What happens when modifications happen to the base tables
445 | \item[--]<5->Modification to base data can invalidate the view
446 | \end{enumerate}
447 | \end{large_enum}
448 | }
449 |
450 | \frame{
451 | \frametitle{SQLite: Data Types}
452 | \begin{large_enum}
453 | \item[--]<1->NULL -- The value is a NULL value
454 | \item[--]<2->INTEGER -- a signed integer
455 | \item[--]<3->REAL -- a floating point value
456 | \item[--]<4->TEXT -- a text string
457 | \item[--]<5->BLOB -- a blob of data
458 | \end{large_enum}
459 | }
460 |
461 | \end{document}
462 |
--------------------------------------------------------------------------------
/ds4/ds4_present_web.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/ds4/ds4_present_web.pdf
--------------------------------------------------------------------------------
/ds4/ds4_web.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress, black]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \usecolortheme[named=black]{structure}
5 | \usepackage{caption}
6 | \captionsetup{labelformat=empty}
7 | \setbeamertemplate{navigation symbols}{}
8 | %\usefonttheme{structurebold}
9 | \usepackage[scaled]{helvet}
10 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
11 | \usepackage[T1]{fontenc}
12 | \usepackage{setspace}
13 | %\usepackage{beamerthemesplit}
14 | \usepackage{graphics}
15 | \usepackage{hyperref}
16 | \usepackage{graphicx}
17 | \usepackage{verbatim}
18 | \usepackage{amssymb}
19 | \usepackage{wrapfig}
20 | \usefonttheme[onlymath]{serif}
21 | \usepackage{cmbright}
22 |
23 | \def\labelitemi{\textemdash}
24 | \setbeamertemplate{frametitle}{
25 | \begin{centering}
26 | \vskip15pt
27 | \insertframetitle
28 | \par
29 | \end{centering}
30 | }
31 | \title[DS]{Introduction to Statistical Learning}
32 | \author[Sood]{Gaurav~Sood}
33 | \large
34 | \date[2015]{Spring 2015}
35 | \subject{LearnDS}
36 | \begin{document}
37 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
38 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
39 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
40 |
41 | \newenvironment{large_enum}{
42 | \Large
43 | \begin{itemize}
44 | \setlength{\itemsep}{7pt}
45 | \setlength{\parskip}{0pt}
46 | \setlength{\parsep}{0pt}
47 | }{\end{itemize}}
48 |
49 | \begin{comment}
50 |
51 | setwd(paste0(basedir, "github/data-science/ds4/"))
52 | tools::texi2dvi("ds4_web.tex", pdf=TRUE,clean=TRUE)
53 | setwd(basedir)
54 |
55 | \end{comment}
56 | \frame
57 | {
58 | \titlepage
59 | }
60 |
61 | \frame{
62 | \frametitle{Two paradigms of learning}
63 | \begin{large_enum}
64 | \item[--]<2->Supervised Learning
65 | \begin{enumerate}
66 | \item[--]<3->When getting labels (predictions) is expensive
67 | \item[--]<4->Get labels for a small set of data
68 | \item[--]<5->Estimate relationship between $Y$ and $X$
69 | \item[--]<6->Predict labels of unseen data
70 | \item[--]<7->Labels and cost function \emph{supervise} \alert{dimension reduction}
71 | \end{enumerate}
72 | \item[--]<8->Unsupervised Learning
73 | \begin{enumerate}
74 | \item[--]<9->Find vectors similar to each other, maximize differences across
75 | \item[--]<10->Find rows similar to each other, maximize differences across
76 | \end{enumerate}
77 | \end{large_enum}
78 | }
79 |
80 | \frame{
81 | \frametitle{How to learn from data?}
82 | \begin{large_enum}
83 | \item[--]<1->$Y = f(X) + \epsilon$
84 | \item[--]<2->How do we estimate $f(X)$?
85 | \item[--]<3->If similar $x$, similar $y$
86 | \item[--]<4->Function: value of $y$ same as that of the nearest neighbor
87 | \item[--]<5->Question and a concern
88 | \begin{enumerate}
89 | \item[--]<6->What do we mean by nearest?
90 | \item[--]<7->Wouldn't it depend on what $x$ are observed?
91 | \end{enumerate}
92 | \end{large_enum}
93 | }
94 |
95 | \frame{
96 | \frametitle{What do we mean by nearest?}
97 | \begin{large_enum}
98 | \item[-]<2-> Euclidean distance: If $p$ and $q$ are two $n$ dimensional vectors \\\normalsize
99 | $d_e(p,q) = \sqrt{\sum\limits_{i=1}^n (q_i - p_i)^2}$
100 | \item[-]<3-> Some issues:
101 | \begin{itemize}
102 | \item[-]<4-> Not all features on the same scale
103 | \item[-]<5-> Features may be correlated with each other
104 | \end{itemize}
105 | \item[-]<6-> Mahalanobis distance between two vectors:\\\normalsize
106 | Say S is the covariance matrix\\
107 | $d_m(\vec{p},\vec{q}) = \sqrt{(\vec{p}- \vec{q})'S^{-1}(\vec{p} - \vec{q})}$
108 | \item[-]<7-> For Boolean, Jaccard distance:\\\normalsize
109 | $d_j(p,q) = \frac{\lvert p \cup q \rvert - \lvert p \cap q \rvert}{\lvert p \cup q \rvert} $ \\\pause
110 | Applications: Recommender systems, finding similar documents etc.
111 | \end{large_enum}
112 | }
113 |
114 | \frame{
115 | \frametitle{Learning from data}
116 | \begin{large_enum}
117 | \item[--]<1->$Y = f(X) + e$
118 | \item[--]<2->Problem formulated as a regression function \pause \pause
119 | \begin{enumerate}
120 | \item[--]<4->$E(Y|X)$\\
121 | \item[--]<5->$f(x) = E(Y|x=x)$
122 | \end{enumerate}
123 | \item[--]<6->Nearest neighbour averaging\\\normalsize
124 | $\hat{f}(x) = E[Y|X \in N(x)]$
125 | \item[--]<7->Great when small $p$, large $N$
126 | \end{large_enum}
127 | }
128 |
129 | \frame{
130 | \frametitle{Curse of Dimensionality}
131 | \begin{large_enum}
132 | \item[--]<1->$\hat{f}(x) = E[Y|X \in N(x)]$
133 | \item[--]<2->Define neighborhood too tightly, nothing in there.
134 | \item[--]<3->If we expand it, nearest neighbors can be far away
135 | \item[--]<4->How do we solve the problem?
136 | \item[--]<5->One way: parametric and structural models\\
137 | $f(x) = \beta_0 + \beta_1*X_1 + \beta_2*X_2 + .....\beta_p*X_p$
138 | \end{large_enum}
139 | }
140 | \frame{
141 | \frametitle{How to expand the linear model?}
142 | \begin{large_enum}
143 | \item[-]<1-> Simple polynomial transformations
144 | \item[-]<2-> Interactions
145 | \item[-]<3-> Step functions. E.g. transform education into $k$ dummy variables.
146 | \item[-]<4-> More complicated basis functions:
147 | \begin{itemize}
148 | \item[-]<5-> Piecewise polynomial: takes differential polynomial for different regions (split by knots)
149 | \item[-]<6-> Add constraint that there are no abrupt changes across regions.
150 | \item[-]<7-> Add constraint that first and second derivates are the same. (For cubic splines.)
151 | \item[-]<8-> Add boundary constraint: linear before 1st knot or after last knot. (Natural spline.)
152 | \end{itemize}
153 | \end{large_enum}
154 | }
155 |
156 | \frame{
157 | \frametitle{Error}
158 | \begin{large_enum}
159 | \item[--]<2->Reducible error:\\\normalsize
160 | $Var(\hat{f})$ + [\text{Bias}$(E[\hat{f}(x) - f(x)])]^2$
161 | \item[--]<3->Bias-Variance tradeoff
162 | \item[--]<4->As flexibility increases, $Var(\hat{f})$ increases\\\normalsize
163 | Model goes after each wrinkle in the training data
164 | \item[--]<5->Say we have estimated the ideal function $\hat{f}(X)$
165 | \item[--]<6->Ideal w.r.t a loss function, e.g., average squared error: $(Y - \hat{Y})^2$
166 | \item[--]<7->Ideal still leaves some error (irreducible error):
167 | \begin{enumerate}
168 | \item[--]<7->$\epsilon = Y - \hat{f}(x)$
169 | \item[--]<8->$E[(Y - \hat{f}(X)^2|X =x)] = (f(x) - \hat{f}(X))^2 + Var(\epsilon) $
170 | \end{enumerate}
171 | \end{large_enum}
172 | }
173 |
174 | \frame{
175 | \frametitle{Evaluating Models}
176 | \begin{large_enum}
177 | \item[--]<2->Deviance
178 | \begin{itemize}
179 | \item[--]<2-> Deviance $\propto -$Log-Likelihood
180 | \item[--]<3-> Likelihood: $p(y_1|x_1) x p(y_2|x_2) x ... x p(y_n|x_n)$
181 | \item[--]<4-> $\hat{\beta}$ maximize Likelihood (or minimize Deviance)
182 | \item[--]<5-> $R^2 = \frac{\text{Deviance of Fitted Model}}{\text{Deviance of Null Model}}$
183 | \end{itemize}
184 | \item[--]<6->AIC
185 | \begin{itemize}
186 | \item[--]<7-> $\text{Deviance} + 2*\text{df}$
187 | \item[--]<8-> In-sample - Out of sample Deviance $\sim$ $2*\text{df}$
188 | \item[--]<9-> AIC $\sim$ Out of sample Deviance
189 | \item[--]<10-> AIC overfits in high dimensions (df $\sim$ n).
190 | \item[--]<11-> AICc = $\text{Deviance} + 2*\text{df}*\frac{n}{n - df- 1}$
191 | \item[--]<12-> BIC = $\text{Deviance} + \text{df}*log(n)$
192 | \item[--]<12-> BIC underfits when large $n$.
193 | \end{itemize}
194 | \end{large_enum}
195 | }
196 |
197 | \frame{
198 | \frametitle{Evaluating Classification Models}
199 | \begin{table}[!htb]
200 | \small
201 | \centering
202 | \begin{tabular}{|cc|cc|}
203 | \hline
204 | & & \multicolumn{2}{c|}{\textbf{Observed}} \\
205 | & & true & false\\[0.5ex]
206 | \cline{2-4}
207 | \textbf{Predicted} & true & true positive & false positive\\[0.5ex]
208 | & false & false negative & true negative\\[0.5ex]
209 | \hline
210 | \end{tabular}
211 | \end{table}
212 |
213 | \begin{large_enum}
214 | \item[--]<2->Confusion matrix, $c^2 -c$ total possible errors
215 | \item[--]<3->Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
216 | \item[--]<4->Error Rate: $\frac{FP + FN}{TP + TN + FP + FN}$
217 | \item[--]<5->Sensitivity, TPR: $\frac{TP}{TP + FN}$
218 | \item[--]<6->Specificity, FPR: $\frac{TN}{FP + TN}$
219 | \item[--]<7->BER: $\frac{1}{2}(TPR + TNR)$
220 | \end{large_enum}
221 | }
222 |
223 | \frame{
224 | \frametitle{Evaluating Classification Models}
225 | \begin{large_enum}
226 | \item[--]<1->ROC: TPR Vs. FPR
227 | \item[--]<2->Precision: fraction of retrieved instances that are relevant\\\normalsize
228 | $\frac{TP}{TP + FP}$
229 | \item[--]<3->Recall: fraction of relevant instances that are retrieved\\\normalsize
230 | $\frac{TP}{TP + FN}$
231 | \item[--]<4->$F_1$: $2\frac{\text{precision*recall}}{\text{precision + recall}}$
232 | \item[--]<5->$F_\beta$: $(1 + \beta^2) \frac{\text{precision*recall}}{\beta^2\text{precision + recall}}$
233 | \end{large_enum}
234 | }
235 |
236 | \frame{
237 | \frametitle{Out of Sample Error}
238 | \begin{large_enum}
239 | \item[--]<1->Another way to assess model error
240 | \item[--]<2->$R^2$ always increases with more covariates.
241 | \item[--]<3->Or: As model complexity increases, training error goes down.
242 | \item[--]<4->But out of sample error goes down and then up.
243 | \item[--]<5->Out of sample $R^2$ can be worse than $\bar{y}$.
244 | \item[--]<6->Use of out of sample error to prevent overfitting
245 | \item[--]<7->Net prediction error on test set can vary a lot.
246 | \end{large_enum}
247 | }
248 | \frame{
249 |
250 | \center{\Large A Clarification}
251 | }
252 | \frame{
253 | \frametitle{Error in thinking about errors}
254 | \begin{large_enum}
255 | \item[--]<2->In sciences, `data mining' is a dirty `phrase'
256 | \item[--]<3->Jealousy?
257 | \item[--]<4->Evokes concerns about false positives \ldots
258 | \item[--]<5->But `mining' is by definition `the extraction of valuable [stuff]'
259 | \item[--]<6->So -- is more data worse?
260 | \item[--]<7->Not quite
261 | \item[--]<8->Larger $n$ allows for more precise estimation of relationship
262 | \end{large_enum}
263 | }
264 |
265 | \frame{
266 | \frametitle{False Positives}
267 | \begin{large_enum}
268 | \item[--]<1->Significance testing:
269 | \begin{enumerate}
270 | \item[--]<2->Say .05, 5\% false positive rate
271 | \item[--]<3->Assume independence, 1 of 20 false positive
272 | \item[--]<4->Say 100 vars, 5 true positives, all sig., 5\% of 95 $\sim$ 5. So 50\% \alert{false discovery rate}.
273 | \end{enumerate}
274 | \item[--]<5->Fixes:
275 | \begin{enumerate}
276 | \item[--]<6->Familywise error rate (Bonferroni)
277 | \item[--]<7->Optimization can be done w.r.t. to cost of false positive and negative \\
278 | e.g. Increase cut-off marks in exams, \href{http://www.cancer.gov/cancertopics/pdq/screening/breast/healthprofessional/page8}{Breast Cancer}
279 | \item[--]<8->False Discovery Rate
280 | \end{enumerate}
281 | \end{large_enum}
282 | }
283 |
284 | \frame{
285 | \frametitle{False Discovery Rate}
286 | \begin{large_enum}
287 | \item[]<1->\begin{equation*}\text{False discovery Proportion} = \frac{\text{\# of FP}}{\text{\# of Sig. Results}}\end{equation*}
288 | \item[--]<2->Can't be known but we can produce cutoffs so $E(FDP) < q$
289 | \item[--]<3->Benjamini and Hochberg (1995):
290 | \begin{enumerate}
291 | \item[--]<4->Rank the $n$ $p$-values, smallest to largest, $p_1$ \ldots $p_n$.
292 | \item[--]<5->$p$-value cut-off = max ($p_k: p_k \leq \frac{qk}{n}$)
293 | \item[--]<6->All $p$-values below that accepted
294 | \item[--]<7->Caveat: Assumes independence
295 | \end{enumerate}
296 | \end{large_enum}
297 | }
298 |
299 | \frame{
300 | \frametitle{Other Ways of Being Wrong}
301 | \begin{large_enum}
302 | \item[--]<1->Sampling
303 | \item[--]<2->Changing data generating process over time
304 | \item[--]<3->Confounding variables (`data leakage')
305 | \item[--]<4->Coding and computational errors
306 | \end{large_enum}
307 | }
308 | \end{document}
309 |
--------------------------------------------------------------------------------
/ds6/kmeans.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soodoku/data-science/0af39968f028b6c507945ebdb38bea8771f49303/ds6/kmeans.pdf
--------------------------------------------------------------------------------
/ds6/kmeans.tex:
--------------------------------------------------------------------------------
1 | \documentclass[compress]{beamer}
2 | \setbeamercolor{normal text}{fg=black}
3 | \beamertemplatesolidbackgroundcolor{white}
4 | \setbeamercovered{transparent, still covered={\opaqueness<1->{0}}, again covered={\opaqueness<1->{30}}}
5 | \usecolortheme[named=black]{structure}
6 | \definecolor{links}{HTML}{98AFC7}
7 | \hypersetup{colorlinks,linkcolor=,urlcolor=links}
8 | \usepackage{caption}
9 | \captionsetup{labelformat=empty}
10 | \setbeamertemplate{navigation symbols}{}
11 | %\usefonttheme{structurebold}
12 |
13 | \usepackage[scaled]{helvet}
14 | \renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif
15 | \usepackage[T1]{fontenc}
16 | \usepackage{setspace}
17 | %\usepackage{beamerthemesplit}
18 | \usepackage{graphics}
19 | \usepackage{Sweave}
20 |
21 |
22 | \usepackage{hyperref}
23 | \usepackage{graphicx}
24 | \usepackage{verbatim}
25 | \usepackage{amssymb}
26 | \usepackage{wrapfig}
27 | \def\labelitemi{\textemdash}
28 | \setbeamertemplate{frametitle}{
29 | \begin{centering}
30 | \vskip15pt
31 | \insertframetitle
32 | \par
33 | \end{centering}
34 | }
35 | \title[DS]{\scalebox{.20}{\includegraphics{specialk.png}}\\ $K$-means Clustering}
36 | \author[Sood]{gaurav~sood\\\href{http://www.gsood.com}{http://gsood.com}\\
37 | \href{https://twitter.com/soodoku}{twitter} \textbf{|} \href{http://github.com/soodoku}{github}}
38 | \large
39 | \date[2015]{\today}
40 | \subject{LearnDS}
41 | \begin{document}
42 | \newcommand{\multilineR}[1]{\begin{tabular}[b]{@{}r@{}}#1\end{tabular}}
43 | \newcommand{\multilineL}[1]{\begin{tabular}[b]{@{}l@{}}#1\end{tabular}}
44 | \newcommand{\multilineC}[1]{\begin{tabular}[b]{@{}c@{}}#1\end{tabular}}
45 |
46 | \newenvironment{large_enum}{
47 | \Large
48 | \begin{itemize}
49 | \setlength{\itemsep}{7pt}
50 | \setlength{\parskip}{0pt}
51 | \setlength{\parsep}{0pt}
52 | }{\end{itemize}}
53 |
54 | \begin{comment}
55 |
56 | setwd(paste0(githubdir, "data-science/ds6/"))
57 | tools::texi2dvi("kmeans.tex", pdf=TRUE,clean=TRUE)
58 | setwd(basedir)
59 |
60 | \end{comment}
61 | \frame
62 | {
63 | \titlepage
64 | }
65 |
66 | \frame{
67 | \frametitle{Unsupervised Learning}
68 | \begin{large_enum}
69 | \item[-]<2> Everything is dimension reduction
70 | \item[-]<3> In supervised learning, labels supervise dimension reduction\\
71 | \item[-]<4> For instance, regression is about finding a low dimensional representation of $Y$
72 | \item[-]<5> Supervised Learning $\sim$ Given Apples and Oranges, learn traits of Apples Vs. Oranges
73 | \item[-]<6> Given a bunch of spherical fruits, optimally describe types of fruits
74 | \end{large_enum}
75 | }
76 |
77 | \frame{
78 | \frametitle{Ways to Think About Unsupervised Learning}
79 | \begin{large_enum}
80 | \item[-]<2>Learning the probability model of the data $p(x_n|x_1,...,x_{n-1})$
81 | \item[-]<3>\textbf{Applications:} Outlier detection, Data compression
82 | \item[-]<4>Find rows similar to each other, groups of rows dissimilar to each other
83 | \item[-]<5>Find columns similar to each other, groups of columns dissimilar to each other
84 | \item[-]<6>\textbf{Applications:} Group movies by ratings, Segment shoppers
85 | \end{large_enum}
86 | }
87 |
88 | \frame{
89 | \frametitle{Solutions}
90 | \begin{large_enum}
91 | \item[-]<1-3>Two kinds of methods:
92 | \begin{enumerate}
93 | \item[-]<2->Principal components analysis
94 | \item[-]<3->Clustering
95 | \end{enumerate}
96 | \item[-]<4>Clustering looks to partition data into similar subgroups
97 | \item[-]<5-7>Two popular methods:
98 | \begin{enumerate}
99 | \item[-]<6-> Hierarchical clustering (computationally expensive)
100 | \item[-]<7> $k$-means clustering (pre-specify $k$)
101 | \end{enumerate}
102 | \end{large_enum}
103 | }
104 |
105 | \frame{
106 | \only<1>{\scriptsize{Source: \href{http://research.microsoft.com/en-us/um/people/cmbishop/prml/}{Pattern Recognition and Machine Learning}}}
107 |
108 | \only<1>{\center{\includegraphics{kmeans-ex.png}}}
109 | }
110 |
111 | \frame{
112 | \frametitle{$k$-Means Clustering}
113 | \begin{large_enum}
114 | \item[-]<1-3>$k$-means: Assume that we must split data into $k$ clusters
115 | \begin{enumerate}
116 | \item[-]<2-5>Each observation belongs to one cluster
117 | \item[-]<3-5>No observation belongs to more than one cluster
118 | \end{enumerate}
119 | \item[-]<4>Find partitioning that minimizes within cluster variation summed over all $k$
120 | clusters
121 | \item[-]<5>Euclidean distance between observations, sum it over all observations\\\normalsize
122 | \begin{equation}
123 | \text{min.}_{C_1,\ldots,C_K} \sum_{k=1}^{k} \frac{1}{|C_k|} \sum_{i, i^{'} \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i^{'}j})^2
124 | \end{equation}
125 | \end{large_enum}
126 | }
127 |
128 | \frame{
129 | \frametitle{$k$-Means (Lloyd's) Algorithm}
130 | \begin{large_enum}
131 | \item[-]<2>Randomly assign observations to 1 of $k$ clusters
132 | \item[-]<3-5>Iterate:
133 | \begin{enumerate}
134 | \item[-]<4-> For each of the $k$ clusters, compute the centroid
135 | \item[-]<5-> Assign each observation to cluster whose centroid is closest
136 | \end{enumerate}
137 | \end{large_enum}
138 | \vspace{5cm}
139 | }
140 | \frame{
141 | \only<1-6>{\scriptsize{Source: James et al. 2015}}
142 |
143 | \only<1>{\center{\includegraphics{kmeanspic1.png}}}\pause
144 | \only<2>{\center{\includegraphics{kmeanspic2.png}}}\pause
145 | \only<3>{\center{\includegraphics{kmeanspic3.png}}}\pause
146 | \only<4>{\center{\includegraphics{kmeanspic4.png}}}\pause
147 | \only<5>{\center{\includegraphics{kmeanspic5.png}}}\pause
148 | \only<6>{\center{\includegraphics{kmeanspic.png}}}
149 |
150 | }
151 |
152 | \frame{
153 | \frametitle{$k$-Means Algorithm}
154 | \begin{large_enum}
155 | \item[-]<0>Randomly assign observations to 1 of $k$ clusters
156 | \item[-]<0>Iterate:
157 | \begin{enumerate}
158 | \item[-]<1-> For each of the $k$ clusters, compute the centroid
159 | \item[-]<1-> Assign each observation to cluster whose centroid is closest
160 | \end{enumerate}
161 | \item[-]<2>Why does it work?
162 | \item[-]<3>It doesn't. Local minima possible.
163 | \item[-]<4->Initialization:
164 | \begin{enumerate}
165 | \item[-]<5->Forgy: Randomly choose $k$ observations and set them as centroids.
166 | \item[-]<6->Random Partition: Assign each observation randomly to one of the clusters.
167 | \item[-]<7->Run an alternate clustering algorithm on a small sample and use the clusters as initial centroids
168 | \item[-]<8> Pick dispersed points as centroids. For e.g. $k$-means++ and variations of it.
169 | \end{enumerate}
170 | \end{large_enum}
171 | }
172 |
173 | \frame{
174 | \frametitle{Distance between clusters}
175 | \begin{large_enum}
176 | \item[-]<1>Complete Linkage\\\normalsize
177 | Farthest distance between points in clusters
178 | \item[-]<2>Single\\\normalsize
179 | Closest pair
180 | \item[-]<3>Average\\\normalsize
181 | All pairs, and then take the average
182 | \item[-]<4>Centroid\\\normalsize
183 | Has problems called inversions\\
184 | Used in Genomics
185 | \item[-]<5>Complete and Average most commonly used
186 | \end{large_enum}
187 | }
188 |
189 | \frame{
190 | \frametitle{Practical Issues}
191 | \begin{large_enum}
192 | \item[-]<1-4>Choice of Similarity Measure
193 | \begin{enumerate}
194 | \item[-]<2->Scaling Matters
195 | \item[-]<3->Jaccard --- can be gotten quickly by minhashing via LSH Distance
196 | \item[-]<4->Correlation based measures (+/- may matter)
197 | \end{enumerate}
198 | \item[-]<5>High dimensional data. Solutions e.g. DANN
199 | \item[-]<6-8>Choosing $k$:
200 | \begin{enumerate}
201 | \item[-]<7->Calculate average distance to centroid for multiple $k$
202 | \item[-]<8>Plot them, look for the \emph{knee}\\
203 | \visible<8>{\scalebox{0.5}{\includegraphics{knee.jpg}}}
204 | \end{enumerate}
205 | \end{large_enum}
206 | }
207 | \frame{
208 | \frametitle{A(N)alyst choose $k$ contd.}
209 | \begin{large_enum}
210 | \item[-]<1-5>Calinski-Harabasz (CH) Index:
211 | \begin{enumerate}
212 | \item[-]<2->Between Cluster, $B = \sum_{1}^k n_k \lVert X_k - \bar{X} \rVert^2$
213 | \item[-]<3->Within Cluster, $W = \sum_{1}^k \lVert X_i - \bar{X_k} \lVert^2$
214 | \item[-]<4->Maximize Between Cluster Variation, Minimize Within Cluster Variation
215 | \item[-]<5->$\text{CH(K)} = \frac{B(K)}{(K-1)}\frac{n - K}{W(K)}$
216 | \end{enumerate}
217 |
218 | \item[-]<6-8>Gap Statistic (Tibshirani):
219 | \begin{enumerate}
220 | \item[-]<6->Compare observed $W(K)$ to $W_{\text{unif}}(K)$
221 | \item[-]<7->$\text{GAP}(K) = \text{log} W(K) - \text{log} W_{\text{unif}}(K)$
222 | \item[-]<8->Calculate $W_{\text{unif}}(K)$ by simulation.
223 | \end{enumerate}
224 | \end{large_enum}
225 | }
226 | \frame{
227 | \frametitle{Running Time}
228 | \begin{large_enum}
229 | \item[-]<1> $O(kn)$ for each iteration.
230 | \item[-]<2> But total iterations can be a lot, and not bounded.
231 | \item[-]<3> But in practice, polynomial running time.
232 | \item[-]<4-6> Big (Long) Data Solutions:
233 | \begin{enumerate}
234 | \item[-]<5->Bradley-Fayyad-Reina (BFR)
235 | \item[-]<6->CURE
236 | \end{enumerate}
237 | \item[-]<7-8>BFR
238 | \begin{enumerate}
239 | \item[-]<7->Assumes clusters are normally distributed around a centroid in Euclidean space.
240 | \item[-]<8->Exploit that to quantify likelihood point belongs to a cluster
241 | \end{enumerate}
242 | \end{large_enum}
243 | }
244 | \end{document}
245 |
--------------------------------------------------------------------------------
/graphs/ggplot2.md:
--------------------------------------------------------------------------------
1 | Efficient summary of the ggplot2 book
2 | ======================================
3 |
4 | TOC:
5 | --------
6 | 1. [Introduction](#chapter-1-introduction)
7 | 2. [qplot](#chapter-2-qplot)
8 | 3. [Mastering the Grammar](#chapter-3-mastering-the-grammar)
9 | 4. [Build a Plot Layer by Layer](#chapter-4-build-a-plot-layer-by-layer)
10 | 5. [Toolbox](#chapter-5-toolbox)
11 | 6. [Scales, Axes, and Legends](#chapter-6-scales-axes-and-legends)
12 | 7. [Positioning](#chapter-7-positioning)
13 | 8. [Polishing Your Plots for Publication](#chapter-8-polishing-your-plots-for-publication)
14 | 9. [Manipulating Data](#chapter-9-manipulating-data)
15 | 10. [Reducing Duplication](#chapter-10-reducing-duplication)
16 |
17 | Chapter 1: Introduction
18 | -------------------------
19 |
20 | **Note**: To learn more about grammar of graphics, see [Wilkinson 2005](http://www.amazon.com/The-Grammar-Graphics-Statistics-Computing/dp/0387245448), [Wickham 2009 (pdf)](http://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)
21 |
22 | **What is a statistical graphic?**
23 | - Mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars)
24 |
25 | **Basic components of such a mapping:**
26 | - data: stuff we want to visualize, mapping describes how variables are mapped to aesthetic attributes
27 | - geom: represents what you see on the plot: points, lines, polygons, etc.
28 | - scale: map values in data space to values in aesthetic space (color, size, shape), draws legends, axes
29 | - coord: data coordinates mapped to the plane of the graphic
30 | - facet: how to break data into subsets
31 |
32 | **How ggplot fits into other R graphics:**
33 | - **Base graphics**:
34 | - Pen and paper model: can only draw on top of the plot, cannot modify or delete existing content
35 | - No user accessible representation of graphics apart from appearance on screen
36 | - Includes tools for drawing primitives and entire plots
37 | - **Grid**:
38 | - Developed by Paul Murrell (1998 disseration)
39 | - Grid gobs (graphical objects) can be represented independently of the plot and modified later
40 | - System of viewports (each containing own coord. system) allows for complex graphics
41 | - Draws primitives but no way to produce statistical graphics
42 | - **Lattice**:
43 | - Sarkar 2008a
44 | - Implements trellis of Cleveland (1985, 1993)
45 | - Produces conditioned plots
46 | - Lacks formal model
47 |
48 | Chapter 2: qplot
49 | ------------------
50 |
51 | Basic plotting function.
52 |
53 | #### Scatterplot + Smoothers
54 | ```{r }
55 |
56 | # Assume x and y are quant. variables
57 | qplot(x,y, data=data)
58 |
59 | qplot(x, y, # x or y can be a complex variable (a*b*c)
60 | data=data,
61 | color=variable, # you can also do color=I("red") which isn't same as mapping
62 | shape=variable,
63 | alpha= I(1/10), # alpha for overplotting
64 | geom = c("point", "smooth")) # geom="path" for assessing relation w/ time etc., paths
65 |
66 | "
67 | Smoothers:
68 | - method="loess", span =.2
69 | - method = "gam", formula = y ~ s(x), formula = y ~ s(x, bs = "cs")
70 | - method = "lm"
71 | - method = "lm", formula = y ~ ns(x, 5)
72 | - method = "rlm" # robust linear model
73 | "
74 |
75 | qplot(x, y,
76 | data=data,
77 | geom=c("smooth"),
78 | method="loess",
79 | span=.2)
80 | ```
81 |
82 | #### Boxplot and jittered points
83 | ```{r }
84 |
85 | # Scatter plot for quant vars.
86 | # To prevent overplotting - jitter, and alpha
87 | qplot(categorical, quantitative, data=data, geom="jitter", alpha = I(1/10))
88 |
89 | # for boxplots, geom="boxplot"
90 | ```
91 |
92 | #### Histogram and Density
93 | ```{r }
94 | # 1d geoms: geom=histogram, freqpoly, density, bar
95 |
96 | qplot(x, data=data, geom="histogram", # or geom = "density"
97 | bindwith=.1,
98 | fill=color, # color is a variable to which you want to map color
99 | xlim=c(a,b))
100 |
101 | # weighted bar chart
102 | qplot(x, data=data, geom="bar", weight=variable)
103 |
104 | ```
105 |
106 | #### Timeseries, paths
107 | ```{r }
108 | qplot(date, y, data=data, geom="line")
109 | ```
110 |
111 | #### Faceting
112 | ```{r }
113 | qplot(x, data=data, facets = category ~ ., geom="histogram", binwidth=.1)
114 |
115 | ```
116 | #### Other options
117 | ```{r }
118 | qplot(x, y, data=data,
119 | xlab = "x axis label",
120 | ylab = "y axis label",
121 | main = "plot title",
122 | xlim = c(xmin, xmax),
123 | log = "xy") # logs x and y variables
124 |
125 | ```
126 |
127 | Chapter 3: Mastering the Grammar
128 | -----------------------------------
129 |
130 | #### Basic components of ggplot2:
131 | * Geoms: describe type of plot
132 | - scatterplot: geom = point
133 | - bubblechart: geom = point, size = mapped to a variable
134 | - barchart: geom = bar etc.
135 | * scaling: convert data units to pixels/spacing/colors
136 | * coord: use coordinate system for position (coord)
137 | * position adjustment
138 | * faceting
139 | * End product: new dataset that records this information (x, y, color, size, shape)
140 |
141 | #### Layer
142 | * layer = data (and mappings to it), stat, geom, position adjustment
143 | * Plot can have multiple layers, with diff. datasets
144 |
145 | #### Plot object:
146 | * a list with data, mapping, layers, scales, coordinates, and facet. and options (to store plot-specific theme options)
147 | * to render to screen: print()
148 | * to render to disk: ggsave()
149 | * to describe structure: summary()
150 | * to save cached copy to disk: save() (which can be brought back up using load())
151 |
152 | Chapter 4: Build a Plot Layer by Layer
153 | ---------------------------------------
154 |
155 | ggplot always takes data as a data.frame. Start with mapping basic aesthetics to data, and then add layers to it:
156 |
157 | ```{r }
158 | p <- ggplot(data, aes(x, y, color=z)) # nothing would be displayed as no layers yet
159 | p + layer(geom = "point") # use + to add layers, this uses the mapping given and default values for stat and position adj.
160 | ```
161 |
162 | #### Layer:
163 | ```{r }
164 | layer(geom, geom_params, stat, stat_params, data, mapping, position)
165 | layer(geom = "bar",
166 | geom_params=list(fill= "red"),
167 | stat="bin",
168 | stat_params = list(bindwidth=2))
169 | ```
170 |
171 | #### Shortcuts
172 | * Specifying layer is verbose so we use shortcuts exploiting the fact that
173 | - every geom is associated w/ default stat
174 | - every stat w/ default geom
175 | * All shortcut functions start either with geom_ or stat_
176 |
177 | ```{r }
178 | geom_histogram(binwidth=X, fill="red")
179 | geom_point()
180 | geom_smooth()
181 | ```
182 |
183 | #### Each basic component in detail
184 |
185 | **Data**
186 | - Must be a data.frame
187 | - If you want to produce the same plot for different data frames
188 | - to see same plot with new data:
189 | ```{r }
190 | p <- ggplot(data, aes(x,y)) + geom_point()
191 | p %+% new_data
192 | ```
193 |
194 | **Aesthetic Mappings**
195 | - aes function
196 | - the mappings can be modified later by doing + aes()
197 | - do summary(p) to see how the object changes
198 | - you can change the y-axis by doing `p + geom_pont(aes(y=new_y))`
199 | - **Setting Versus Mapping**
200 | - color = "red" versus color = var
201 |
202 | **Grouping**
203 | - break by a categorical variable for instance
204 |
205 | **Geoms**
206 | - each geom has a set of aesthetics it needs and a larger set it understands.
207 | - For instance, point needs x and y and understands color, size, shape
208 | - Bar geom needs height (ymax) and understands width, border color, fill color
209 | - every geom has a default stat
210 | - list of all geoms:
211 | - abline, area, bar,
212 | - blank (draws nothing),
213 | - boxplot
214 | - countour (3d contours in 2d), crossbar (hollow bar with middle indicated by a line)
215 | - density, density_2d (2d density)
216 | - errorbar
217 | - histogram, hline
218 | - interval, jitter, line (connect obs.)
219 | - linerange
220 | - path (connect obs. in original order)
221 | - point, pointrange (vertical line, with point in the middle)
222 | - polygon
223 | - ribbon, rug
224 | - segment, smooth, step (connect obs. in stairs)
225 | - text, tile, vline
226 |
227 | **Stat**
228 | - stat takes dataset as input and outputs a dataset
229 | - stat_bin produces: count, density, center of the bin (x)
230 | - list of all stats:
231 | - bin, boxplot, contour, density, density_2d, function, identity, qq,
232 | - quantile, smooth, spoke (converts angle and radius to xend and yend),
233 | - sum, summary, unique
234 |
235 | **Position Adjustments**
236 | - dodge, fill, identity, jitter, stack
237 |
238 | **Growth Trajectory tip**
239 | - Plots of longitudinal data are called spaghetti plots
240 | - Plot estimated trajectories + actual data and can't see the issues
241 | - Plot residuals
242 |
243 | Chapter 5: Toolbox
244 | -----------------------------------
245 |
246 | #### Three purposes of a layer:
247 | 1. Display the data
248 | 2. Display stat summaries
249 | 3. Display metadata
250 |
251 | #### Basic plot types (shortcuts) (all these understand size and color aesthetic):
252 | * geom_area(), geom_bar()
253 | * geom_line() (also understands linetype), geom_path() (same as geom_line but lines connected in order)
254 | * geom_point(), geom_polygon() (filled path, each vertex = separate row)
255 | * geom_text() (also has hjust, vjust)
256 | * geom_tile()
257 | * geom_histogram()
258 | * geom_freqpoly
259 | * geom_boxplot()
260 | * geom_density()
261 | * geom_jitter() (crude way of looking at categorical data, prevents overplotting)
262 |
263 | #### Dealing with Overplotting
264 | * Make points smaller (shape =".") or using hollow glyphs (shape =1)
265 | * alpha
266 | * jitter
267 | ```{r }
268 | p <- ggplot(data, aes(x, y))
269 | j <- position_jitter(width=.5)
270 | p + geom_jitter(position = j, alpha("color", 1/10))
271 | ```
272 |
273 | * binning
274 | - but breaking in squares can produce artifacts so use hexagon (geom_hexagon)
275 | - stat_bin2d
276 | - stat_binhex(bins = 10)
277 | - stat_binhex(bindiwth=c(.02, 200))
278 |
279 | * stat summaries
280 |
281 | #### Maps
282 | * add map borders using borders() function
283 | ```{r }
284 | library(maps)
285 | data(us.cities)
286 | ggplot(data, aes(long, lat)) + borders("state", size=.5)
287 |
288 | "
289 | choropleth maps:
290 | start by merging map data with whatever else
291 | reorder the merged
292 | geom = polygon
293 | "
294 | ```
295 | #### Revealing Uncertainty
296 |
297 | Geoms that can do intervals:
298 | * geom_ribbon -> geom_smooth
299 | * geom_errorbar -> geom_crossbar
300 | * geom_linerange -> geom_pointrange
301 |
302 | ```{r }
303 | library(effects)
304 | a <- as.data.frame(effect(...))
305 | ```
306 |
307 | #### Stat Summaries
308 | * stat_summary()
309 | * individual summary funcs like fun.y, fun.ymin, fun.ymax - take functions that take a vector and return single numeric
310 | ```{r }
311 | funs <- function(x) mean(x, trim=.5)
312 | stat_summary(fun.y = funs, geom="point")
313 | ```
314 | * more complex data func = fun.data (return named vector as output)
315 | ```{r }
316 | stat_summary(fun.data=exoticfunc, geom="ribbon")
317 | ```
318 |
319 | #### Annotating a Plot
320 | * Key thing: Annotations are extra data
321 | ```{r }
322 | geom_vline(), geom_hline # use it to highlight certain points on say a timeline
323 | geom_rect() to highlight interesting rect regions (takes xmin, xmax, ymin, ymax)
324 | geom_text(aes(x, y, label="caption")) # x, y are coords, label can be a single label, list of labels
325 | ```
326 | * Arrows
327 | - geom_line, geom_path, and geom_segment all have an arrow param
328 | - use arrow() function to tweak the arrow. arrow takes angle, length, ends, type
329 |
330 | Chapter 6: Scales, Axes, and Legends
331 | -----------------------------------
332 |
333 | **Scales control mapping from data to aesthetics. They also provide viewers a guide to map back from aesthetics to data.**
334 |
335 | #### How Scales work:
336 | * Input can be discrete or continuous.
337 | * Process of mapping domain to range:
338 | - transformation: only for continuous. log etc.
339 | - training: domain is learned. complicated when graph has multiple layers
340 | - mapping: function that translates domain to range (which we knew before)
341 |
342 | #### Usage
343 | ```{r }
344 | plot <- qplot(quant, quant, data)
345 | plot + aes(x=categorical) # doesn't work as scale of x is still cont.
346 | plot + aes(x=categorical) + scale_x_discrete()
347 | ```
348 |
349 | To modify default_scale:
350 | * Must construct a new scale and add with +
351 | * Scale constructors
352 | - start with scale_
353 | - followed by aesthetic (color_, shape_, x_)
354 | - followed by name of scale (gradient, hue, manual)
355 | - for e.g. scale_color_hue
356 | * Scale for each aesthetic
357 | - Colour and fill:
358 | - discrete: brewer, grey, hue, identity, manual (e.g. scale_color_brewer())
359 | - cont: gradient, gradient2, gradientn
360 | - position:
361 | - discrete
362 | - continuous, date
363 | - shape:
364 | - discrete only: shape, identity, manual
365 | - line type:
366 | - discrete only: linetype, identity, manual
367 | - size:
368 | - discrete: identity, manual
369 | - cont: size
370 | * Types of Scales
371 | - Position Scales: for axes
372 | * Color Scales: map to colors
373 | * Manual Scales: map manually to size, shape, color, line type etc.
374 | * Identity Scale
375 |
376 | #### Common Arguments
377 | * name
378 | - label for axis/legend
379 | - use \n for a new line
380 | - math expressions using plotmath
381 | - common shortcuts: xlab, ylab, labs()
382 |
383 | * limits
384 | - fixes limits
385 |
386 | * breaks, labels
387 | - breaks control which vals appear on axis/legend
388 | - labels - labels to appear on those breakpoints
389 |
390 | * formatter
391 | - if no labels specified, this is called.
392 | - useful formatters: comma, percent, dollar, scientific, and abbreviate (for categorical scales)
393 |
394 | #### Position Scales
395 |
396 | * Shorcuts
397 | - xlim(10, 20) # produces cont. scale
398 | - xlim("a", "b", "c")
399 | * If you don't want graph to extend beyond range of data: expand = c(0,0)
400 | * Continuous Scales:
401 | - scale_x_continuous
402 | - takes trans argument to transform: pow10, probit, recip (x^-1), reverse, sqrt, log, exp, identity, etc.
403 | - scale_x_log10() = scale_x_continuous(trans = "log10")
404 | * Date and Time
405 | - takes as.Date or as.POSIXct
406 | - takes three things: major (breaks), minor (breaks), format (formatting of labels)
407 | - e.g. major = "2 weeks", format = "%d/%m/%y" etc.
408 |
409 | #### Color
410 | * Read more about colour at http://www.handprint.com/HP/WCL/color7.html
411 | * hcl color space: hue (color), chroma (purity of color), luminance (lightness of color)
412 | * continuous:
413 | - scale_colour_gradient() # 2 color gradient
414 | - scale_color_gradient2() # 3 color gradient
415 | - scale_color_gradientn() # custom n gradient
416 | - scale_fill_gradient() etc. also...
417 | - RColorBrewer::display.brewer.all, can do fill = vore (where vore is from color brewer)
418 | * discrete:
419 | - scale_linetype
420 | - scale_shape
421 | - scale_size_discrete
422 | - scale_shape_manual
423 | ```{r }
424 | ggplot(data, aes(x)) +
425 | geom_line(aes(y = y + 5, color = "above")) +
426 | geom_line(aes(y = y -5, color = "below")) +
427 | scale_color_manual("Legend Label", c("below" = "blue", "above" ="red"))
428 | ```
429 | * scale_identity: data are colours
430 |
431 | #### Legends and Axes
432 | * Axis: axis label, tick mark, tick label
433 | * Legend: legend title, key, key label
434 | * Scale name controls axis label, legend title
435 | * breaks and labels (see above)
436 | * for visual appearance, see axis.*, legend.*
437 | * internal grid: panel.grid.major/minor
438 | * legend.position, legend.justification
439 |
440 | Chapter 7: Positioning
441 | -----------------------------------
442 | #### Faceting:
443 |
444 | **facet_grid: 2d grid of panels**
445 | * facet_grid(. ~ .) means neither rows nor columns are faceted
446 | * facet_grid(. ~ var) produces single row with multiple cols
447 |
448 | **facet_wrap: 1d ribbon of panels wrapped into 2d**
449 |
450 | ** Controlling Scales**
451 | * scales="fixed", x and y scales are fixed across all panels
452 | * scales = "free", x and y vary across panels
453 | * scales = "free_x" only x varies, y is fixed
454 |
455 | ** Dodging vs. Faceting**
456 | * Can be pretty similar except for the labeling.
457 | ```{r }
458 | qplot(xaxis, data=data, geom="bar", fill=diff_color_bars_same_x, position ="dodge")
459 | # similar to
460 | qplot(diff_color_bars_same_x, data=data, geom = "bar", fill=diff_color_bars_same_x) +
461 | facet_grid(. ~ xaxis)
462 | ```
463 | * To facet by continuous vars, categorize it: cut_interval(x, n=10) etc.
464 |
465 | ** Coordinate Systems**
466 | * Diff coordinate systems built in:
467 | * coord_flip: x is y now
468 | * coord_equal: equal scale coords
469 | * coord_trans: transform/log etc.
470 | * coord_map
471 | * coord_polar
472 | ```{r }
473 | plot + coord_flip()
474 | plot + coord_trans(x="pow10")
475 | plot + coord_cartesian(xlim=c(0,20))
476 | # coord_polar takes theta - position variable that is mapped to angle
477 | plot + coord_polar(theta="y") # default is x (creates pie chart from a stacked bar chart)
478 | ```
479 |
480 | Chapter 8: Polishing Your Plots for Publication
481 | ------------------------------------------------
482 | **Themes**
483 | * Themes control non data elements of plot
484 | * Themese control: font (title, axis, tick labels, strips, legend labels, legend key labels), color (ticks, grid lines, backgrounds (panel, plot, strip, legend))
485 | * Built in themes
486 | * theme_gray() (Based on advice by Tufte, Brewer, Carr etc.)
487 | * theme_bw()
488 | ```
489 | previous_theme <- theme_set(theme_bw())
490 | plot + previous_theme
491 | theme_set(previous_theme) # permanently restores previous theme
492 | ```
493 | * Theme elements
494 | * theme_text() for labels, headings. Control font family, color, face, size, hjust, vjust, angle, lineheight
495 | * theme_line(), theme_segment() take color, size, and linetype. e.g. opts(panel.grid.major = theme_line(color="red"))
496 | * theme_rect() for background etc. and takes color, size, and linetype (changes the border)
497 | * theme_blank() draws nothing. e.g. opts(panel.grid.major = theme_blank())
498 | * you can alter old theme with theme_update
499 | * Save
500 | * ggsave()
501 | * `pdf(); qplot(); dev.off()`
502 | * Subplots
503 | * `viewport(x, y, width, height)`
504 | * `pdf(); vp <- viewport(); small_graph; print(big_graph, vp=vp); dev.off()`
505 | * Grids
506 | * grid.layout()
507 | * grid.newpage()
508 | * pushViewport(viewport(layout = grid.layout(2,2)))
509 |
510 | Chapter 9: Manipulating Data
511 | -----------------------------------
512 | * ddply(.data, .variables, .fun, ...)
513 | * .data: dataset to break up
514 | * .variables: grouping var
515 | * .fun: summary function to apply
516 | * colwise() to apply a function to each column of data
517 | * colwise(function)(data)
518 |
519 | Chapter 10: Reducing Duplication
520 | -----------------------------------
521 | * last plot can be accessed via last_plot()
522 | * you can save plot templates
523 | ```{r }
524 | xquiet <- scale_x_continuous("", breaks=NA)
525 | yquiet <- scale_y_continuous("", breaks=NA)
526 | quiet <- c(xquiet, yquiet)
527 | plot + quiet
528 | ```
--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
1 | Data Science: Some Basics
2 | ==========================
3 |
4 | 1. Introduction to Data Science ([presentation](ds1/ds1_present_web.pdf), [tex](ds1/ds1_web.tex))
5 | * What can Big Data do for you?
6 | * What is Big Data?
7 | * Implications for Statistics and Computation
8 | * What is Data Science?
9 | * Prerequisites
10 |
11 | 2. Get your own (Big) Data ([presentation](ds2/ds2_present_web.pdf), [tex](ds2/ds2_web.tex))
12 | * Scrape web pages and pdfs. ([Scripts](https://github.com/soodoku/python-workshop))
13 | * Image to Text ([Python Script using Tesseract](https://github.com/soodoku/image-to-text))
14 | * Image to Text in R using the [Abbyy FineReader Cloud OCR](https://github.com/soodoku/abbyyR)
15 | * Image to Text in R using the [Captricity API](https://github.com/soodoku/captr)
16 | * Web Scraping/API Applications:
17 | - [Get Data on Journalists](https://github.com/soodoku/get-journalist-data)
18 | - [Get Weather Data](https://github.com/soodoku/get-weather-data)
19 | - [Get Cricket Data](https://github.com/soodoku/get-cricket-data)
20 | - [Get Congressional Speech Data](https://gist.github.com/soodoku/85d79275c5880f67b4cf)
21 | - [Track FB Likes, Twitter Followers, Youtube Views](https://github.com/soodoku/likes-followers-views)
22 | - [Track Civil Rights Coverage in NY Times using NYT API](https://github.com/soodoku/nyt-civil-rights)
23 | * [Get Social Networking Data](https://github.com/pablobarbera/social-media-workshop)
24 | * Regular Expressions
25 | * Pre-process text data
26 | * [Assignment](ds2/scraping_assignment_web.txt)
27 |
28 | 3. Databases and SQL ([presentation](ds3/ds3_present_web.pdf), [tex](ds3/ds3_web.tex))
29 | * What are databases?
30 | * Relational Model
31 | * Relational Algebra
32 | * Basic SQL
33 | * Views
34 |
35 | 4a. [Introduction to Introduction to Statistical Learning](https://github.com/soodoku/ds)
36 |
37 | 4b. Introduction to Statistical Learning ([presentation](ds4/ds4_present_web.pdf), [tex](ds4/ds4_web.tex))
38 | * How to learn from data?
39 | * Nearest Neighbors
40 | * When you don't have good neighbors
41 | * Assessing model fit
42 | * Clarification about Big Data
43 |
44 | 5. Supervised Methods
45 |
46 | 6. Unsupervised Methods
47 | * PCA, CA
48 | * k-means ([presentation](ds6/kmeans.pdf), [tex](ds6/kmeans.tex))
49 |
50 | 7. Presenting Analyses
51 | * [ggplot2 in brief](graphs/ggplot2.md)
52 | * Examples of ggplot in action:
53 | - NYT Civil Rights Coverage ([R code](https://github.com/soodoku/nyt-civil-rights/blob/master/plot.R), [Graph](https://github.com/soodoku/nyt-civil-rights/blob/master/nyt_aa.pdf))
54 | - Military Experience of UK Prime Ministers ([R code](https://github.com/soodoku/military-experience/blob/master/mil_plots.R), [Graph](https://github.com/soodoku/military-experience/blob/master/ukmil.pdf))
55 | - [Suggestions for writing](http://gbytes.gsood.com/on-writing/)
56 |
57 | 8. Some Applications
58 | * From paper to digital ([presentation](app/PaperToDigital.pdf), [tex](app/PaperToDigital.tex))
59 | * Text as Data
60 | - [Sentiment Analysis](https://gist.github.com/soodoku/22e4cff2eb6a05be3c0d)
61 | - [Model Relationship Between Words and Ideology](https://github.com/soodoku/speech-learn)
62 | - [Basic Text Classifier](https://gist.github.com/soodoku/e34dbe0219b0f00a74d5)
63 |
64 | Suggested Books
65 | --------------------
66 |
67 | [The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition](http://www.amazon.com/The-Elements-Statistical-Learning-Prediction/dp/0387848576)
68 | By Trevor Hastie, Robert Tibshirani, Jerome Friedman
69 | ISBN: 0387848576
70 |
71 | [Python Programming: An Introduction to Computer Science](http://www.amazon.com/Python-Programming-Introduction-Computer-Science/dp/1887902996)
72 | By John Zelle
73 | ISBN: 1590282418
74 |
75 | [ggplot2: Elegant Graphics for Data Analysis (Use R!)](http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403)
76 | By Hadley Wickham
77 | ISBN: 0387981403
78 |
79 | License
80 | --------------------
81 | Released under the [Creative Commons License](License.md).
82 |
--------------------------------------------------------------------------------