├── TopicNotes
    ├── ComputerVision
    │   ├── .#CreatingFilterCannyEdgeDetector.ipynb
    │   └── .ipynb_checkpoints
    │   │   ├── Untitled-checkpoint.ipynb
    │   │   ├── CornerDetector-checkpoint.ipynb
    │   │   ├── PizzaBlueScreen-checkpoint.ipynb
    │   │   ├── CannyEdgeDetector-checkpoint.ipynb
    │   │   └── HistogramOfOrientedGradients-checkpoint.ipynb
    ├── BayesNets
    │   ├── Bayes Nets Subtitles.zip
    │   └── subtitles
    │   │   ├── 9 - 34 - Two Test Cancer 2 V4 - lang_en_vs51.srt
    │   │   ├── 26 - General Bayes Net Solution - lang_en_vs51.srt
    │   │   ├── 28 - General Bayes Net 2 Solution - lang_en_vs51.srt
    │   │   ├── 30 - General Bayes Net 3 Solution - lang_en_vs51.srt
    │   │   ├── 31 - Value Of A Network - lang_en_vs51.srt
    │   │   ├── 27 - General Bayes Net 2 - lang_en_vs51.srt
    │   │   ├── 5 - Bayes Network Solution - lang_en_vs51.srt
    │   │   ├── 29 - General Bayes Net 3 - lang_en_vs51.srt
    │   │   ├── 32 - D Separation - lang_en_vs51.srt
    │   │   ├── 13 - Conditional Independence2 - lang_en_vs51.srt
    │   │   ├── 21 - 46 - Explaining Away 2 V3 - lang_en_vs1.srt
    │   │   ├── 15 - Absolute And Conditional - lang_en_vs51.srt
    │   │   ├── 2 - Challenge Question - lang_en_vs52.srt
    │   │   ├── 12 - Conditional Independence - lang_en_vs50.srt
    │   │   ├── 18 - Confounding Cause Solution - lang_en_vs51.srt
    │   │   ├── 34 - 60 - D Separation 2 - lang_en_vs1.srt
    │   │   ├── 33 - D Separation Solution - lang_en_vs51.srt
    │   │   ├── 16 - Absolute And Conditional Solution - lang_en_vs51.srt
    │   │   ├── 10 - 35 - Two Test Cancer 2 Solution V4 - lang_en_vs51.srt
    │   │   ├── 1 - Introduction - lang_en_vs51.srt
    │   │   ├── 7 - 32 - Two Test Cancer V6 - lang_en_vs51.srt
    │   │   ├── 20 - 45 - Explaining Away Solution V3 - lang_en_vs1.srt
    │   │   ├── 24 - 49 - Explaining Away 3 Solution V3 - lang_en_vs1.srt
    │   │   ├── 23 - 48 - Explaining Away 3 V4 - lang_en_vs1.srt
    │   │   ├── 25 - Conditional Dependence - lang_en_vs51.srt
    │   │   ├── 36 - 63 - D Separation 3 Solution - lang_en_vs1.srt
    │   │   ├── 35 - D Separation 2 Solution - lang_en_vs51.srt
    │   │   ├── 19 - 44 - Explaining Away V5 - lang_en_vs1.srt
    │   │   ├── 8 - 33 - Two Test Cancer Solution V4 - lang_en_vs51.srt
    │   │   ├── 4 - 29 - Bayes Network Merged FINAL - lang_en_vs51.srt
    │   │   ├── 17 - Confounding Cause - lang_en_vs51.srt
    │   │   ├── 22 - 47 - Explaining Away 2 Solution V3 - lang_en_vs1.srt
    │   │   ├── 14 - Conditional Independence2 Solution - lang_en_vs51.srt
    │   │   ├── 11 - Conditional Independence - lang_en_vs51.srt
    │   │   ├── 6 - 31 - Computing Bayes Rules Merged FINAL - lang_en_vs51.srt
    │   │   └── 3 - Challenge Question Solution - lang_en_vs52.srt
    ├── reinforcementLearning
    │   └── Notes.rst
    ├── KnowledgeBasedAI
    │   └── Notes.rst
    ├── MachineLearning
    │   ├── UnsupervisedLearning.rst
    │   └── SupervisedNotes.rst
    ├── LSTM
    │   └── CourseNotes.rst
    ├── GeneralAdverserialNetworks
    │   └── courseNotes.rst
    ├── Planning
    │   └── PlanningMDP.rst
    ├── Logic
    │   └── courseNotes.rst
    ├── CapsNets
    │   └── capsnets.rst
    ├── Hyperparameters
    │   └── CourseNotes.rst
    ├── Search
    │   └── CourseNotes.rst
    ├── InferenceBayesNets
    │   └── CourseNotes.rst
    ├── Introduction_to_AI
    │   └── CourseNotes.rst
    └── RecurrentNeuralNetworks
    │   └── courseNotes.rst
├── MathNotes
    ├── Probability.rst
    ├── trigonemetry.rst
    ├── Precalculus.rst
    ├── IntroProjectiveGeometry.rst
    ├── N-Dimensional-Geometry.md.html
    └── DifferentialIntegralCalculus.rst
├── LICENSE
└── BookNotes
    ├── ElementsOfProbabilityAndStatistics.rst
    └── ComputerGraphicsPrinciplesPracticeInC.rst


/TopicNotes/ComputerVision/.#CreatingFilterCannyEdgeDetector.ipynb:
--------------------------------------------------------------------------------
1 | kaan@mb-Precision-7510.5748:1533826696


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/Bayes Nets Subtitles.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/D-K-E/ai-notes/master/TopicNotes/BayesNets/Bayes Nets Subtitles.zip


--------------------------------------------------------------------------------
/TopicNotes/ComputerVision/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/TopicNotes/ComputerVision/.ipynb_checkpoints/CornerDetector-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/TopicNotes/ComputerVision/.ipynb_checkpoints/PizzaBlueScreen-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/TopicNotes/ComputerVision/.ipynb_checkpoints/CannyEdgeDetector-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/TopicNotes/ComputerVision/.ipynb_checkpoints/HistogramOfOrientedGradients-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/9 - 34 - Two Test Cancer 2 V4 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,380 --> 00:00:02,678
 3 | Calculate for
 4 | me the probability of cancer,
 5 | 
 6 | 2
 7 | 00:00:02,678 --> 00:00:08,590
 8 | given that have received one positive
 9 | and one negative test result.
10 | 
11 | 3
12 | 00:00:08,590 --> 00:00:10,450
13 | Please write your number into this box.
14 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/26 - General Bayes Net Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,540 --> 00:00:02,090
 3 | And the answer is 13.
 4 | 
 5 | 2
 6 | 00:00:02,090 --> 00:00:06,500
 7 | 1 over here, 2 over here,
 8 | and 4 over here.
 9 | 
10 | 3
11 | 00:00:06,500 --> 00:00:11,081
12 | Simplified this speaking,
13 | any variable that has K
14 | 
15 | 4
16 | 00:00:11,081 --> 00:00:15,245
17 | inputs requires 2 to
18 | the K search variables.
19 | 
20 | 5
21 | 00:00:15,245 --> 00:00:18,469
22 | So in total we have 1, 9, 13.
23 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/28 - General Bayes Net 2 Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,012 --> 00:00:06,640
 3 | And the answer is 19, so one here,
 4 | one here, one here, two here, two here.
 5 | 
 6 | 2
 7 | 00:00:06,640 --> 00:00:10,537
 8 | Two arrows pointing to G which makes for
 9 | four and
10 | 
11 | 3
12 | 00:00:10,537 --> 00:00:15,033
13 | three arrows pointing to D,
14 | two to the three is eight.
15 | 
16 | 4
17 | 00:00:15,033 --> 00:00:16,799
18 | So you get 1, 2, 3, 8, 2, 2,
19 | 4, if you add those up it's 19
20 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/30 - General Bayes Net 3 Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,225 --> 00:00:03,125
 3 | To answer this question,
 4 | let us add up these numbers.
 5 | 
 6 | 2
 7 | 00:00:03,125 --> 00:00:08,109
 8 | [INAUDIBLE] is 1, 1, 1,
 9 | this is one incoming arc, so
10 | 
11 | 3
12 | 00:00:08,109 --> 00:00:11,600
13 | it's 2, two incoming arc makes 4,
14 | 
15 | 4
16 | 00:00:11,600 --> 00:00:16,129
17 | one incoming arc is 2, 2 equals 4.
18 | 
19 | 5
20 | 00:00:16,129 --> 00:00:23,954
21 | Four incoming arcs make 16,
22 | we add all the rad numbers we get 47.
23 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/31 - Value Of A Network - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,510 --> 00:00:04,570
 3 | So it takes 47 numerical
 4 | probabilities to specify
 5 | 
 6 | 2
 7 | 00:00:05,570 --> 00:00:10,840
 8 | the joint compared to 65,000 if
 9 | we didn't have the structure.
10 | 
11 | 3
12 | 00:00:10,840 --> 00:00:15,440
13 | I think this example really illustrates
14 | the advantage of compact Bayes network
15 | 
16 | 4
17 | 00:00:15,440 --> 00:00:16,750
18 | representations.
19 | 
20 | 5
21 | 00:00:16,750 --> 00:00:18,720
22 | Over unstructured joint representations.
23 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/27 - General Bayes Net 2 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,430 --> 00:00:01,740
 3 | Here's another quiz.
 4 | 
 5 | 2
 6 | 00:00:01,740 --> 00:00:06,810
 7 | How many parameters do we need to
 8 | specify the joint distribution for
 9 | 
10 | 3
11 | 00:00:06,810 --> 00:00:11,510
12 | this phase map over here where A B and
13 | C point into D.
14 | 
15 | 4
16 | 00:00:11,510 --> 00:00:12,850
17 | D points into E.
18 | 
19 | 5
20 | 00:00:12,850 --> 00:00:15,320
21 | F and G and C also point into G?
22 | 
23 | 6
24 | 00:00:15,320 --> 00:00:17,290
25 | Please write your answer into this box.
26 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/5 - Bayes Network Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,090 --> 00:00:02,240
 3 | The answer is 3.
 4 | 
 5 | 2
 6 | 00:00:02,240 --> 00:00:08,666
 7 | It takes one parameter to specify
 8 | P(A) from which we derive P(-A).
 9 | 
10 | 3
11 | 00:00:08,666 --> 00:00:14,946
12 | It takes two parameters to
13 | specify P(B | A) and P(-A).
14 | 
15 | 4
16 | 00:00:14,946 --> 00:00:21,396
17 | From which we can derive P(-B | A) and
18 | P(-B | -A).
19 | 
20 | 5
21 | 00:00:21,396 --> 00:00:24,214
22 | So it's a total of 3 parameters for
23 | this Bayes' Network.
24 | 
25 | 6
26 | 00:00:24,214 --> 00:00:29,849
27 | [BLANK_AUDIO]
28 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/29 - General Bayes Net 3 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,610 --> 00:00:04,590
 3 | And here's our car network which we
 4 | discussed at the very beginning of this
 5 | 
 6 | 2
 7 | 00:00:04,590 --> 00:00:05,230
 8 | unit.
 9 | 
10 | 3
11 | 00:00:05,230 --> 00:00:09,776
12 | How many parameters do we
13 | need to specify this network?
14 | 
15 | 4
16 | 00:00:09,776 --> 00:00:14,508
17 | Remember, there are 16
18 | total variables and
19 | 
20 | 5
21 | 00:00:14,508 --> 00:00:22,891
22 | the naive join over those 16 will
23 | be 2 to the 16- 1, which is 65,535.
24 | 
25 | 6
26 | 00:00:22,891 --> 00:00:23,880
27 | Please write your answer
28 | into this box over here.
29 | 


--------------------------------------------------------------------------------
/MathNotes/Probability.rst:
--------------------------------------------------------------------------------
 1 | ###############
 2 | Probability
 3 | ###############
 4 | 
 5 | Bayes rule:
 6 | 
 7 | - P(A|B) = (P(B|A)·P(A)) / P(B)
 8 | 
 9 | - P(A|B): Posterior
10 | - P(B|A): Likelihood
11 | - P(A): Prior
12 | - P(B): Marginal Likelihood
13 | 
14 | - P(B): :math:`{\sum_a{P(B|A=a)·P(A=a)}}
15 | 
16 | 0.50 = P(g|o1)·P(o1) /0.9
17 | 0.45 = P(g|o1)·P(o1)
18 | 
19 | 0.05 = P(~g|o1)·P(o1) / 0.1
20 | 0.005 = P(~g|o1)·P(o1)
21 | 
22 | 0.455 = P(o1)(P(~g|o1) + P(g|o1))
23 | 0.455 = P(o1)
24 | 
25 | 0.15 = P(g|o2)·P(o2) /0.9
26 | 0.6 = P(g|o2)·P(o2)
27 | 
28 | 0.25 = (P(~g|o2)·P(o2)) / 0.1
29 | 0.025 = P(~g|o2)·P(o2)
30 | 
31 | 0.625 = p(o2)(P(~g|o2) + P(g|o2))
32 | 0.625 = p(o2)
33 | 
34 | 
35 | P(o2|o1) = (P(o1|o2)·P(o2)) / P(o1)
36 | 
37 | P(o1) = P(o1|o2)·P(o2)
38 | 
39 | 0.455 = P(o1|o2)·0.625
40 | 0.728 = P(o1|o2)
41 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/32 - D Separation - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,570 --> 00:00:04,360
 3 | The next concept I'd like to
 4 | teach you is called D-Separation.
 5 | 
 6 | 2
 7 | 00:00:04,360 --> 00:00:08,680
 8 | And let me start the discussion
 9 | of this concept by a quiz.
10 | 
11 | 3
12 | 00:00:08,680 --> 00:00:10,410
13 | We have here a base network and
14 | 
15 | 4
16 | 00:00:10,410 --> 00:00:14,160
17 | then when we ask to conditional
18 | independence question.
19 | 
20 | 5
21 | 00:00:14,160 --> 00:00:19,302
22 | Is C independent of A,
23 | please tell me yes or no.
24 | 
25 | 6
26 | 00:00:19,302 --> 00:00:24,478
27 | Is C independent of A given B,
28 | is C independent of D,
29 | 
30 | 7
31 | 00:00:24,478 --> 00:00:30,736
32 | is C independent of D given A, and
33 | is E independent of C given D.
34 | 
35 | 8
36 | 00:00:30,736 --> 00:00:32,589
37 | [BLANK_AUDIO]
38 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/13 - Conditional Independence2 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,810 --> 00:00:05,100
 3 | So let me draw the cancer
 4 | example again with two tests.
 5 | 
 6 | 2
 7 | 00:00:05,100 --> 00:00:07,797
 8 | Here's my cancer variable, and
 9 | 
10 | 3
11 | 00:00:07,797 --> 00:00:12,530
12 | there's two condition independent tests,
13 | T1 and T2.
14 | 
15 | 4
16 | 00:00:12,530 --> 00:00:17,020
17 | And as before let me assume that
18 | the probability of cancer is 0.01.
19 | 
20 | 5
21 | 00:00:17,020 --> 00:00:22,360
22 | What I want you to compute for
23 | me Is the probability
24 | 
25 | 6
26 | 00:00:22,360 --> 00:00:30,010
27 | of the second test to be positive if we
28 | know that the first test was positive.
29 | 
30 | 7
31 | 00:00:30,010 --> 00:00:33,321
32 | So write this into the following box.
33 | 
34 | 8
35 | 00:00:33,321 --> 00:00:36,019
36 | [BLANK_AUDIO]
37 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/21 - 46 - Explaining Away 2 V3 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,630 --> 00:00:05,419
 3 | Now, to understand the explain away
 4 | effect, you have to compare this to
 5 | 
 6 | 2
 7 | 00:00:05,419 --> 00:00:09,266
 8 | the probability of a raise
 9 | given that we're just happy and
10 | 
11 | 3
12 | 00:00:09,266 --> 00:00:12,188
13 | we don't know anything
14 | about the weather.
15 | 
16 | 4
17 | 00:00:12,188 --> 00:00:14,077
18 | So let's do that exercise next.
19 | 
20 | 5
21 | 00:00:14,077 --> 00:00:18,915
22 | So my next quiz is what's
23 | the probability of a raise given that
24 | 
25 | 6
26 | 00:00:18,915 --> 00:00:23,953
27 | all I know is that I'm happy and
28 | I don't know about the weather?
29 | 
30 | 7
31 | 00:00:23,953 --> 00:00:28,282
32 | Now this happens to be, once again,
33 | a pretty complicated question.
34 | 
35 | 8
36 | 00:00:28,282 --> 00:00:28,980
37 | So take your time.
38 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/15 - Absolute And Conditional - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,236 --> 00:00:04,557
 3 | So now we've learned about independence
 4 | and the corresponding Bayes network has
 5 | 
 6 | 2
 7 | 00:00:04,557 --> 00:00:07,130
 8 | two nodes,
 9 | they're just not connected at all.
10 | 
11 | 3
12 | 00:00:07,130 --> 00:00:09,379
13 | And we learned about
14 | conditional independence,
15 | 
16 | 4
17 | 00:00:09,379 --> 00:00:12,008
18 | in which case we have a Bayes
19 | network that looks like this.
20 | 
21 | 5
22 | 00:00:12,008 --> 00:00:16,896
23 | Now I would like to know whether
24 | absolute independence implies
25 | 
26 | 6
27 | 00:00:16,896 --> 00:00:20,410
28 | conditional independence, true or false?
29 | 
30 | 7
31 | 00:00:20,410 --> 00:00:24,012
32 | And I'd also like to know whether
33 | conditional independence implies
34 | 
35 | 8
36 | 00:00:24,012 --> 00:00:25,386
37 | absolute independence.
38 | 
39 | 9
40 | 00:00:25,386 --> 00:00:27,190
41 | Again, true or false?
42 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 DKE
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/2 - Challenge Question - lang_en_vs52.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,320 --> 00:00:03,020
 3 | Here is a challenge question to give
 4 | a taste of how how we are going to
 5 | 
 6 | 2
 7 | 00:00:03,020 --> 00:00:07,460
 8 | construct and use Bayesian networks to
 9 | make inferences about new situations.
10 | 
11 | 3
12 | 00:00:07,460 --> 00:00:10,720
13 | The small net given here models part
14 | of the situation where a student
15 | 
16 | 4
17 | 00:00:10,720 --> 00:00:14,220
18 | was about to graduate and
19 | applied for jobs at two companies.
20 | 
21 | 5
22 | 00:00:14,220 --> 00:00:16,690
23 | G represents the student,
24 | indeed, graduated.
25 | 
26 | 6
27 | 00:00:16,690 --> 00:00:20,550
28 | O1 and O2 represent that the student
29 | received a job offer from the company 1
30 | 
31 | 7
32 | 00:00:20,550 --> 00:00:22,430
33 | and the company 2, respectively.
34 | 
35 | 8
36 | 00:00:22,430 --> 00:00:25,385
37 | Calculate the probability that the
38 | student received an offer from company
39 | 
40 | 9
41 | 00:00:25,385 --> 00:00:28,505
42 | 2, given that we know that she
43 | received and offer from company 1.
44 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/12 - Conditional Independence - lang_en_vs50.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,000 --> 00:00:03,241
 3 | And the correct answer is no.
 4 | 
 5 | 2
 6 | 00:00:03,241 --> 00:00:07,899
 7 | Intuitively, getting a positive
 8 | test result of what cancer
 9 | 
10 | 3
11 | 00:00:07,899 --> 00:00:12,479
12 | gives us information about
13 | whether you have cancer or not.
14 | 
15 | 4
16 | 00:00:12,479 --> 00:00:14,780
17 | So if you get a positive test result,
18 | 
19 | 5
20 | 00:00:14,780 --> 00:00:18,277
21 | you're going to raise
22 | the probability of having cancer.
23 | 
24 | 6
25 | 00:00:18,277 --> 00:00:19,811
26 | So relative to the prior probability.
27 | 
28 | 7
29 | 00:00:19,811 --> 00:00:22,694
30 | With that increased probability,
31 | 
32 | 8
33 | 00:00:22,694 --> 00:00:28,170
34 | we will predict that another test will,
35 | with a higher likelihood,
36 | 
37 | 9
38 | 00:00:28,170 --> 00:00:33,951
39 | give us a positive response than if
40 | we hadn't taken the previous test.
41 | 
42 | 10
43 | 00:00:33,951 --> 00:00:35,401
44 | It's really important to understand.
45 | 
46 | 11
47 | 00:00:35,401 --> 00:00:41,410
48 | So that we understand it, let me make
49 | you calculate those probabilities.
50 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/18 - Confounding Cause Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,610 --> 00:00:02,885
 3 | The answer is surprisingly simple.
 4 | 
 5 | 2
 6 | 00:00:02,885 --> 00:00:08,240
 7 | It is 0.01, but
 8 | how do you know this so fast?
 9 | 
10 | 3
11 | 00:00:08,240 --> 00:00:12,359
12 | Well, if we look at this
13 | Bayes Network both the sadness and
14 | 
15 | 4
16 | 00:00:12,359 --> 00:00:17,080
17 | the question whether they got
18 | a raise impact my happiness.
19 | 
20 | 5
21 | 00:00:17,080 --> 00:00:22,240
22 | But since I don't know anything about
23 | the happiness there's no way that
24 | 
25 | 6
26 | 00:00:22,240 --> 00:00:28,260
27 | just the weather might impact
28 | whether I get a raise or not.
29 | 
30 | 7
31 | 00:00:28,260 --> 00:00:32,549
32 | In fact,
33 | it might be independently sunny and
34 | 
35 | 8
36 | 00:00:32,549 --> 00:00:35,230
37 | might independently get a raise at work.
38 | 
39 | 9
40 | 00:00:35,230 --> 00:00:41,350
41 | There's no mechanism of which
42 | these two things would co-occur.
43 | 
44 | 10
45 | 00:00:41,350 --> 00:00:45,240
46 | Therefore the probability of
47 | a raise given it is sunny,
48 | 
49 | 11
50 | 00:00:45,240 --> 00:00:50,499
51 | is just the same as the probability of
52 | a raise given any weather which is 0.01.
53 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/34 - 60 - D Separation 2 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:01,650 --> 00:00:04,205
 3 | In this specific example,
 4 | 
 5 | 2
 6 | 00:00:04,205 --> 00:00:06,900
 7 | the rule applies very, very simple.
 8 | 
 9 | 3
10 | 00:00:06,900 --> 00:00:08,980
11 | Any two variables are independent,
12 | 
13 | 4
14 | 00:00:08,980 --> 00:00:12,050
15 | if they're not linked by just unknown variables.
16 | 
17 | 5
18 | 00:00:12,050 --> 00:00:14,540
19 | So for example, ven OB,
20 | 
21 | 6
22 | 00:00:14,540 --> 00:00:16,820
23 | and everything downstream of B,
24 | 
25 | 7
26 | 00:00:16,820 --> 00:00:22,860
27 | becomes independent of anything upstream of B. E is not independent of C,
28 | 
29 | 8
30 | 00:00:22,860 --> 00:00:28,761
31 | conditional B, however, knowledge of B does not render A and E independent.
32 | 
33 | 9
34 | 00:00:28,761 --> 00:00:30,405
35 | In this graph over here,
36 | 
37 | 10
38 | 00:00:30,405 --> 00:00:32,330
39 | A and B connect to C,
40 | 
41 | 11
42 | 00:00:32,330 --> 00:00:37,100
43 | and C connects to D and to E. So let me ask you,
44 | 
45 | 12
46 | 00:00:37,100 --> 00:00:41,395
47 | is A independent of E?
48 | 
49 | 13
50 | 00:00:41,395 --> 00:00:43,435
51 | A independent of E given B? A independent of E given C?
52 | 
53 | 14
54 | 00:00:43,435 --> 00:00:44,810
55 | A independent of B?
56 | 
57 | 15
58 | 00:00:44,810 --> 00:00:48,000
59 | And A independent of B given C?
60 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/33 - D Separation Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,440 --> 00:00:03,719
 3 | So C is not independent of A.
 4 | 
 5 | 2
 6 | 00:00:03,719 --> 00:00:09,063
 7 | In fact, A influences C by virtue of B.
 8 | 
 9 | 3
10 | 00:00:09,063 --> 00:00:13,388
11 | But if you know B,
12 | then A becomes independent of C,
13 | 
14 | 4
15 | 00:00:13,388 --> 00:00:17,022
16 | which means the only
17 | determinant to C is B.
18 | 
19 | 5
20 | 00:00:17,022 --> 00:00:17,919
21 | If you know B for
22 | 
23 | 6
24 | 00:00:17,919 --> 00:00:21,857
25 | sure, then knowledge of A won't
26 | really tell you anything about C.
27 | 
28 | 7
29 | 00:00:21,857 --> 00:00:27,606
30 | C is also not independent of D, just
31 | the same way C is not independent of A.
32 | 
33 | 8
34 | 00:00:27,606 --> 00:00:31,410
35 | If I learn something about D,
36 | I can infer more about C.
37 | 
38 | 9
39 | 00:00:31,410 --> 00:00:36,980
40 | But if I do know A, then it's hard to
41 | imagine how knowledge of D would help
42 | 
43 | 10
44 | 00:00:36,980 --> 00:00:42,030
45 | me with C because I can't learn anything
46 | more about A than knowing A already.
47 | 
48 | 11
49 | 00:00:42,030 --> 00:00:44,950
50 | Therefore, given A,
51 | C and D are independent.
52 | 
53 | 12
54 | 00:00:44,950 --> 00:00:48,470
55 | The same is true for E and C.
56 | 
57 | 13
58 | 00:00:48,470 --> 00:00:51,670
59 | If you know D then E and
60 | C become independent.
61 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/16 - Absolute And Conditional Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,320 --> 00:00:02,887
 3 | And the answer is,
 4 | both of them are false.
 5 | 
 6 | 2
 7 | 00:00:02,887 --> 00:00:07,090
 8 | We already saw that conditional
 9 | independence, as shown over here,
10 | 
11 | 3
12 | 00:00:07,090 --> 00:00:09,703
13 | doesn't give us absolute independence.
14 | 
15 | 4
16 | 00:00:09,703 --> 00:00:12,749
17 | So, for example, this is test
18 | number one and test number two,
19 | 
20 | 5
21 | 00:00:12,749 --> 00:00:14,482
22 | you might or might not have cancer.
23 | 
24 | 6
25 | 00:00:14,482 --> 00:00:18,226
26 | Our first test gives us information
27 | about whether you have cancer or not.
28 | 
29 | 7
30 | 00:00:18,226 --> 00:00:21,809
31 | And as a result,
32 | you've changed our prior probability for
33 | 
34 | 8
35 | 00:00:21,809 --> 00:00:24,410
36 | the second test to come in positive.
37 | 
38 | 9
39 | 00:00:24,410 --> 00:00:30,240
40 | That means conditional independence
41 | does not imply absolute independence,
42 | 
43 | 10
44 | 00:00:30,240 --> 00:00:32,759
45 | which renders this assumption,
46 | here, false.
47 | 
48 | 11
49 | 00:00:32,759 --> 00:00:37,330
50 | And as it also turns out that if
51 | you have absolute independence,
52 | 
53 | 12
54 | 00:00:37,330 --> 00:00:40,040
55 | things might not be
56 | conditionally independent for
57 | 
58 | 13
59 | 00:00:40,040 --> 00:00:45,060
60 | reasons that can't quite explain so
61 | far but that we will learn about next.
62 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/10 - 35 - Two Test Cancer 2 Solution V4 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,550 --> 00:00:05,630
 3 | We have tried the same trick as before,
 4 | where we use exact same prior
 5 | 
 6 | 2
 7 | 00:00:05,630 --> 00:00:10,658
 8 | of first plus gives
 9 | the following factors, 0.9, 0.2.
10 | 
11 | 3
12 | 00:00:10,658 --> 00:00:14,558
13 | But our minus gives us
14 | the probability 0.1 for
15 | 
16 | 4
17 | 00:00:14,558 --> 00:00:18,179
18 | a negative test result
19 | treatment of cancer.
20 | 
21 | 5
22 | 00:00:18,179 --> 00:00:25,170
23 | And a 0.8 for the inverse of a negative
24 | result of not getting cancer.
25 | 
26 | 6
27 | 00:00:25,170 --> 00:00:28,760
28 | You multiply those together,
29 | you get a normalized probability.
30 | 
31 | 7
32 | 00:00:28,760 --> 00:00:33,573
33 | And if you now normalize by
34 | the sum of those two things to
35 | 
36 | 8
37 | 00:00:33,573 --> 00:00:36,920
38 | turn the statement to a probability,
39 | 
40 | 9
41 | 00:00:36,920 --> 00:00:42,578
42 | you get 0.0009 over the sum of
43 | those two things over here.
44 | 
45 | 10
46 | 00:00:42,578 --> 00:00:47,498
47 | And this is 0.0056 for the chance
48 | 
49 | 11
50 | 00:00:47,498 --> 00:00:52,568
51 | of having cancer, and 0.9943 for
52 | 
53 | 12
54 | 00:00:52,568 --> 00:00:57,650
55 | the chance of being cancer-free.
56 | 
57 | 13
58 | 00:00:57,650 --> 00:00:59,720
59 | And this adds up approximately to 1,
60 | 
61 | 14
62 | 00:00:59,720 --> 00:01:01,620
63 | and therefore it's
64 | a probability distribution.
65 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/1 - Introduction - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,270 --> 00:00:02,940
 3 | Sebastian, why are Bayes Nets important?
 4 | 
 5 | 2
 6 | 00:00:02,940 --> 00:00:05,781
 7 | &gt;&gt; They're one of the most amazing
 8 | accomplishments of recent artificial
 9 | 
10 | 3
11 | 00:00:05,781 --> 00:00:06,910
12 | intelligence.
13 | 
14 | 4
15 | 00:00:06,910 --> 00:00:10,510
16 | They take the idea of uncertainty and
17 | probability and
18 | 
19 | 5
20 | 00:00:10,510 --> 00:00:12,900
21 | marry it with efficient structures.
22 | 
23 | 6
24 | 00:00:12,900 --> 00:00:16,155
25 | So you don't have a big mumble jumble
26 | but you can really understand,
27 | 
28 | 7
29 | 00:00:16,155 --> 00:00:20,014
30 | like what uncertain variable
31 | influences other uncertain variable.
32 | 
33 | 8
34 | 00:00:20,014 --> 00:00:25,580
35 | A theory of the world using just
36 | Bayes Networks is really an amazing.
37 | 
38 | 9
39 | 00:00:25,580 --> 00:00:28,870
40 | &gt;&gt; Well, I've been teaching AI for
41 | many years, I've found that listening to
42 | 
43 | 10
44 | 00:00:28,870 --> 00:00:32,590
45 | Sebastian's lessons has really given
46 | me a better intuition for Bayes Nets.
47 | 
48 | 11
49 | 00:00:33,690 --> 00:00:37,130
50 | As you go through these lessons,
51 | make sure to understand how to construct
52 | 
53 | 12
54 | 00:00:37,130 --> 00:00:39,080
55 | a Bayes Nets and
56 | how to do inference with them.
57 | 
58 | 13
59 | 00:00:39,080 --> 00:00:43,810
60 | One of my favorite algorithms
61 | is Monte Carlo Markov Chain.
62 | 
63 | 14
64 | 00:00:43,810 --> 00:00:44,480
65 | Have fun with the lesson.
66 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/7 - 32 - Two Test Cancer V6 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,290 --> 00:00:03,460
 3 | The reason why I gave you all this
 4 | is because I wanted to apply it now
 5 | 
 6 | 2
 7 | 00:00:03,460 --> 00:00:08,530
 8 | to a slightly more complicated problem,
 9 | which is the two test cancer example.
10 | 
11 | 3
12 | 00:00:08,530 --> 00:00:14,310
13 | In this example, we, again,
14 | might have our unobservable cancer c,
15 | 
16 | 4
17 | 00:00:14,310 --> 00:00:18,540
18 | but now we're running two tests,
19 | test one and test two.
20 | 
21 | 5
22 | 00:00:18,540 --> 00:00:24,074
23 | As before,
24 | the prior probability of cancer is 0.01.
25 | 
26 | 6
27 | 00:00:24,074 --> 00:00:29,893
28 | The probability of receiving a positive
29 | test result for either test is 0.9.
30 | 
31 | 7
32 | 00:00:29,893 --> 00:00:36,388
33 | Probability of getting a negative
34 | result, keeping them cancer free is 0.8.
35 | 
36 | 8
37 | 00:00:36,388 --> 00:00:40,863
38 | And from those we were able to
39 | compute all the other properties.
40 | 
41 | 9
42 | 00:00:40,863 --> 00:00:42,788
43 | I'm just going to write
44 | them down over here.
45 | 
46 | 10
47 | 00:00:42,788 --> 00:00:45,425
48 | So take a moment to just write my notes.
49 | 
50 | 11
51 | 00:00:45,425 --> 00:00:49,407
52 | Now let's assume both of my
53 | tests come back positive.
54 | 
55 | 12
56 | 00:00:49,407 --> 00:00:55,263
57 | So T1 = +, and T2 = +.
58 | 
59 | 13
60 | 00:00:55,263 --> 00:01:02,630
61 | What's the probability of cancer now,
62 | written in short form P(C|++)?
63 | 
64 | 14
65 | 00:01:02,630 --> 00:01:06,080
66 | I want you to tell me what that is.
67 | 
68 | 15
69 | 00:01:06,080 --> 00:01:07,620
70 | And this is a non trivial question.
71 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/20 - 45 - Explaining Away Solution V3 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,730 --> 00:00:06,488
 3 | The answer is approximately 0.0142.
 4 | 
 5 | 2
 6 | 00:00:06,488 --> 00:00:10,266
 7 | And as an exercise in expanding
 8 | this term using Bayes' rule,
 9 | 
10 | 3
11 | 00:00:10,266 --> 00:00:14,120
12 | using total probability,
13 | which I just do for you.
14 | 
15 | 4
16 | 00:00:14,120 --> 00:00:20,945
17 | Using Bayes' rule,
18 | you can transform this into P of H,
19 | 
20 | 5
21 | 00:00:20,945 --> 00:00:26,764
22 | given R, S times P of R
23 | given S over P of H given S.
24 | 
25 | 6
26 | 00:00:26,764 --> 00:00:30,488
27 | You observe the position
28 | independence of R and S,
29 | 
30 | 7
31 | 00:00:30,488 --> 00:00:33,540
32 | to simplify this to just P of R.
33 | 
34 | 8
35 | 00:00:33,540 --> 00:00:38,867
36 | And the denominator is expanded
37 | by folding in R and not R.
38 | 
39 | 9
40 | 00:00:38,867 --> 00:00:44,822
41 | You have (H | R,S) times
42 | P(R) plus P(H|-RS)
43 | 
44 | 10
45 | 00:00:44,822 --> 00:00:49,870
46 | times P(-R) which is total probability.
47 | 
48 | 11
49 | 00:00:49,870 --> 00:00:54,757
50 | You can now read off the numbers
51 | from the tables over here,
52 | 
53 | 12
54 | 00:00:54,757 --> 00:00:59,946
55 | which gives us 1 times 0.01
56 | divided by this expression is
57 | 
58 | 13
59 | 00:00:59,946 --> 00:01:06,058
60 | the same as the expression over here
61 | 0.01 plus this thing over here.
62 | 
63 | 14
64 | 00:01:06,058 --> 00:01:10,522
65 | Which we can find over here
66 | to be 0.7 x this one over
67 | 
68 | 15
69 | 00:01:10,522 --> 00:01:15,097
70 | here which is one minus
71 | the value over here 0.99.
72 | 
73 | 16
74 | 00:01:15,097 --> 00:01:19,752
75 | Which gives us approximately 0.0142
76 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/24 - 49 - Explaining Away 3 Solution V3 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,610 --> 00:00:05,080
 3 | Now the answer follows the exact
 4 | same scheme as before,
 5 | 
 6 | 2
 7 | 00:00:05,080 --> 00:00:07,800
 8 | with S being replaced by not S.
 9 | 
10 | 3
11 | 00:00:07,800 --> 00:00:12,910
12 | So this should be an easier question for
13 | you to answer, P of R given H,
14 | 
15 | 4
16 | 00:00:12,910 --> 00:00:17,830
17 | and not S, can be inverted by
18 | base rule to be as follows.
19 | 
20 | 5
21 | 00:00:17,830 --> 00:00:20,290
22 | Once we apply base rule,
23 | as indicated over here,
24 | 
25 | 6
26 | 00:00:20,290 --> 00:00:23,700
27 | where we sort H to the left side and
28 | R to the right side,
29 | 
30 | 7
31 | 00:00:23,700 --> 00:00:29,670
32 | you can observe that this value over
33 | here can be readily found in the table.
34 | 
35 | 8
36 | 00:00:29,670 --> 00:00:32,320
37 | It's actually the 0.9 over there.
38 | 
39 | 9
40 | 00:00:32,320 --> 00:00:36,631
41 | This value over here,
42 | the Bayes is independent of the data
43 | 
44 | 10
45 | 00:00:37,825 --> 00:00:40,380
46 | Bayes network, so it's just 0.01.
47 | 
48 | 11
49 | 00:00:40,380 --> 00:00:46,870
50 | And as before, we apply total
51 | probability, the expression over here.
52 | 
53 | 12
54 | 00:00:46,870 --> 00:00:49,729
55 | And we obtain of this
56 | quotient over here,
57 | 
58 | 13
59 | 00:00:49,729 --> 00:00:52,598
60 | that these two expressions are the same.
61 | 
62 | 14
63 | 00:00:52,598 --> 00:00:54,680
64 | P of H given not S not R.
65 | 
66 | 15
67 | 00:00:54,680 --> 00:00:57,013
68 | R is the value over here.
69 | 
70 | 16
71 | 00:00:57,013 --> 00:01:00,485
72 | And the 0.99 is the compliment
73 | of probability of R,
74 | 
75 | 17
76 | 00:01:00,485 --> 00:01:06,143
77 | taken from over here and
78 | that ends up to be 0.0833.
79 | 
80 | 18
81 | 00:01:06,143 --> 00:01:08,785
82 | This will be the right answer.
83 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/23 - 48 - Explaining Away 3 V4 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,500 --> 00:00:02,630
 3 | And if you got this right,
 4 | 
 5 | 2
 6 | 00:00:02,630 --> 00:00:06,470
 7 | I will be deeply impressed of
 8 | the fact you got this right.
 9 | 
10 | 3
11 | 00:00:06,470 --> 00:00:10,810
12 | My happiness is well explained
13 | by the fact that sunny.
14 | 
15 | 4
16 | 00:00:10,810 --> 00:00:13,770
17 | So if someone observes you to
18 | be happy and asks the question,
19 | 
20 | 5
21 | 00:00:13,770 --> 00:00:16,740
22 | is this because Sebastian
23 | got a raise at work?
24 | 
25 | 6
26 | 00:00:16,740 --> 00:00:21,300
27 | Well If you know it's sunny and
28 | this is a fairly good explanation for
29 | 
30 | 7
31 | 00:00:21,300 --> 00:00:23,230
32 | me being happy.
33 | 
34 | 8
35 | 00:00:23,230 --> 00:00:25,830
36 | You don't have to assume I got a raise.
37 | 
38 | 9
39 | 00:00:25,830 --> 00:00:28,508
40 | If you don't know about the weather,
41 | 
42 | 10
43 | 00:00:28,508 --> 00:00:33,791
44 | then obviously the chances are higher
45 | that the raise caused my happiness.
46 | 
47 | 11
48 | 00:00:33,791 --> 00:00:39,274
49 | And therefore this number
50 | goes up from 0.014 to 0.018.
51 | 
52 | 12
53 | 00:00:39,274 --> 00:00:44,276
54 | Let me ask you one final
55 | question in this next quiz which
56 | 
57 | 13
58 | 00:00:44,276 --> 00:00:50,822
59 | is the probability of a race given
60 | that I look happy and it's not sunny.
61 | 
62 | 14
63 | 00:00:50,822 --> 00:00:55,398
64 | This is the most extreme case for
65 | making a raise likely because
66 | 
67 | 15
68 | 00:00:55,398 --> 00:01:00,640
69 | I am a happy guy and it's definitely
70 | not caused by the weather.
71 | 
72 | 16
73 | 00:01:00,640 --> 00:01:04,300
74 | So it could be just random or
75 | it could be caused by the raise.
76 | 
77 | 17
78 | 00:01:04,300 --> 00:01:08,260
79 | So please calculate this number for
80 | me and enter it into this box.
81 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/25 - Conditional Dependence - lang_en_vs51.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:00,370 --> 00:00:03,262
 3 | It's really interesting to compare
 4 | this to the situation over here.
 5 | 
 6 | 2
 7 | 00:00:03,262 --> 00:00:07,387
 8 | In both cases, I'm happy,
 9 | as shown over here.
10 | 
11 | 3
12 | 00:00:07,387 --> 00:00:12,052
13 | And I asked the same question,
14 | which is whether I got a raise at work,
15 | 
16 | 4
17 | 00:00:12,052 --> 00:00:13,270
18 | this R over here.
19 | 
20 | 5
21 | 00:00:13,270 --> 00:00:16,638
22 | But in one case,
23 | I observe that the weather is sunny and
24 | 
25 | 6
26 | 00:00:16,638 --> 00:00:18,230
27 | in the other one isn't.
28 | 
29 | 7
30 | 00:00:18,230 --> 00:00:21,557
31 | And look what it does to my probability
32 | of having received a raise.
33 | 
34 | 8
35 | 00:00:21,557 --> 00:00:26,055
36 | The sunniness perfectly
37 | well explains my happiness.
38 | 
39 | 9
40 | 00:00:26,055 --> 00:00:31,127
41 | And my probability of
42 | having received a raise
43 | 
44 | 10
45 | 00:00:31,127 --> 00:00:36,340
46 | ends up to be a mere 1.4%, or 0.014.
47 | 
48 | 11
49 | 00:00:36,340 --> 00:00:40,874
50 | However, if my wife observes it to
51 | be not sunny, then it is much more
52 | 
53 | 12
54 | 00:00:40,874 --> 00:00:45,344
55 | likely that the cause of my happiness
56 | is related to a raise at work.
57 | 
58 | 13
59 | 00:00:45,344 --> 00:00:47,988
60 | And now the probability is 8.3%,
61 | 
62 | 14
63 | 00:00:47,988 --> 00:00:52,043
64 | which is significantly
65 | higher than the 1.4% before.
66 | 
67 | 15
68 | 00:00:52,043 --> 00:00:56,268
69 | Now, this is a Bayes'
70 | network of which S and
71 | 
72 | 16
73 | 00:00:56,268 --> 00:01:01,993
74 | R are independent, but
75 | H adds a dependence between S and R.
76 | 
77 | 17
78 | 00:01:01,993 --> 00:01:06,520
79 | Let me talk about this in a little
80 | bit more detail on the next paper.
81 | 
82 | 18
83 | 00:01:06,520 --> 00:01:06,980
84 | So here's a
85 | 


--------------------------------------------------------------------------------
/MathNotes/trigonemetry.rst:
--------------------------------------------------------------------------------
 1 | ###############
 2 | Trigonometry
 3 | ###############
 4 | 
 5 | Notes from:
 6 | 
 7 | Loney, Sidney Luxton. Plane Trigonometry. Cambridge, 1893.
 8 | Klaf, Albert. Trigonometry Refresher. Reprint. New York, 2005.
 9 | 
10 | Chapter 1: Measurement of angles
11 | ==================================
12 | 
13 | Angles are measured in terms of right angle: 
14 | 
15 | - **Degree** is a unit of right angle which is divided into 90 equal parts.
16 | 
17 | - **Minute** is a unit of degree which is divided into 60 equal parts
18 | 
19 | - **Second** is a unit of minute which is divided into 60 equal parts
20 | 
21 | So :math:`46°23'39''` reads as 46 degrees 23 minutes 39 seconds.
22 | 
23 | This system is called **Sexagesimal** system of measurement. This system is
24 | widely used in practical applications of trigonometry.
25 | 
26 | There is also another **Centesimal** system which divides right angle into 100
27 | parts each called **grade** then each grade is divided into 100 parts called
28 | minutes and each minute is divided into 100 parts called seconds.
29 | 
30 | There is also a third system **Circular** measurement. This system is used in
31 | higher branches of mathematics.
32 | 
33 | It is preferred in higher branches of mathematics because there is no
34 | arbitrary number such as 90 or 360 in its definition.
35 | 
36 | Let's take any circle centered on point O, and let's pick two points on the
37 | perimeter of this circle A and P. The angle AOP is called a radian. How do we
38 | obtain and ensure the constancy of this angle ?
39 | 
40 | Let's suppose that there are two circles with a common center O.
41 | 
42 | Let's suppose n lines going from O to the outer circle. n lines intersect the
43 | outer circle at points *A, B, C, D, E, F, ...*. These lines intersect the inner
44 | circle at points *a, b, c, d, e, f, ...*.
45 | 
46 | Let's join point *a* to *b*, *b* to *c* etc, as well as *A* to *B*, etc.
47 | 
48 | Since the angle BOA is the same for sides *ab* and *AB* and :math:`aO = bO`
49 | and :math:`AO = BO`.
50 | 
51 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/36 - 63 - D Separation 3 Solution - lang_en_vs1.srt:
--------------------------------------------------------------------------------
 1 | 1
 2 | 00:00:01,740 --> 00:00:05,200
 3 | And the answer is yes.
 4 | 
 5 | 2
 6 | 00:00:05,200 --> 00:00:09,130
 7 | F is independent of A. Well defined by our rules of the separation is
 8 | 
 9 | 3
10 | 00:00:09,130 --> 00:00:13,405
11 | that F is dependent on D and A is dependent on D. But,
12 | 
13 | 4
14 | 00:00:13,405 --> 00:00:18,285
15 | if you don't know D, you can't govern any dependency between A and F at all.
16 | 
17 | 5
18 | 00:00:18,285 --> 00:00:20,485
19 | Now, if you do know D,
20 | 
21 | 6
22 | 00:00:20,485 --> 00:00:22,360
23 | then F and A become dependent.
24 | 
25 | 7
26 | 00:00:22,360 --> 00:00:27,580
27 | And the reason is B and E are dependent given D. And we can
28 | 
29 | 8
30 | 00:00:27,580 --> 00:00:29,690
31 | transform this back into the dependency of A
32 | 
33 | 9
34 | 00:00:29,690 --> 00:00:33,745
35 | and F because B and A are dependent, and E and F are dependent.
36 | 
37 | 10
38 | 00:00:33,745 --> 00:00:37,690
39 | There's an active path between A and F,
40 | 
41 | 11
42 | 00:00:37,690 --> 00:00:39,535
43 | which goes across here, and here,
44 | 
45 | 12
46 | 00:00:39,535 --> 00:00:41,640
47 | because D is known.
48 | 
49 | 13
50 | 00:00:41,640 --> 00:00:44,735
51 | Now if you know G, the same thing is true because G gives us knowledge about
52 | 
53 | 14
54 | 00:00:44,735 --> 00:00:48,010
55 | D and D can be applied back to this path over here.
56 | 
57 | 15
58 | 00:00:48,010 --> 00:00:49,090
59 | However, if you know H,
60 | 
61 | 16
62 | 00:00:49,090 --> 00:00:50,225
63 | that's not the case.
64 | 
65 | 17
66 | 00:00:50,225 --> 00:00:52,310
67 | So, H might tell us something about G,
68 | 
69 | 18
70 | 00:00:52,310 --> 00:00:55,585
71 | but it doesn't tell us anything about D, and therefore,
72 | 
73 | 19
74 | 00:00:55,585 --> 00:00:58,450
75 | we have no reason to close the path
76 | 
77 | 20
78 | 00:00:58,450 --> 00:01:02,643
79 | between A and F. The path between A and F is still passive,
80 | 
81 | 21
82 | 00:01:02,643 --> 00:01:05,070
83 | even though we have knowledge of H.
84 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/35 - D Separation 2 Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,410 --> 00:00:02,560
  3 | And the answer to this one
  4 | is really interesting.
  5 | 
  6 | 2
  7 | 00:00:02,560 --> 00:00:06,572
  8 | A is clearly not independent of
  9 | E because through C we can see
 10 | 
 11 | 3
 12 | 00:00:06,572 --> 00:00:08,105
 13 | an influence of A to E.
 14 | 
 15 | 4
 16 | 00:00:08,105 --> 00:00:10,690
 17 | Given B that doesn't change.
 18 | 
 19 | 5
 20 | 00:00:10,690 --> 00:00:14,710
 21 | A still influences C
 22 | despite the fact we know B.
 23 | 
 24 | 6
 25 | 00:00:14,710 --> 00:00:17,260
 26 | However if we know C,
 27 | the influence is cut off.
 28 | 
 29 | 7
 30 | 00:00:17,260 --> 00:00:21,500
 31 | There is no way A can
 32 | influence E if we know C.
 33 | 
 34 | 8
 35 | 00:00:21,500 --> 00:00:23,544
 36 | A is clearly independent of B.
 37 | 
 38 | 9
 39 | 00:00:23,544 --> 00:00:27,830
 40 | They're different entry variables,
 41 | they have no incoming Rs.
 42 | 
 43 | 10
 44 | 00:00:27,830 --> 00:00:32,750
 45 | &gt;&gt; But here is the caveat,
 46 | given C, A and B become dependent.
 47 | 
 48 | 11
 49 | 00:00:32,750 --> 00:00:36,031
 50 | &gt;&gt; So whereas initially A and
 51 | B were independent,
 52 | 
 53 | 12
 54 | 00:00:36,031 --> 00:00:38,579
 55 | if you give C they become dependent.
 56 | 
 57 | 13
 58 | 00:00:38,579 --> 00:00:43,040
 59 | And the reason why they become
 60 | dependent we've studied before is
 61 | 
 62 | 14
 63 | 00:00:43,040 --> 00:00:45,340
 64 | the Explain away effect.
 65 | 
 66 | 15
 67 | 00:00:45,340 --> 00:00:48,170
 68 | If we know for example C to be true,
 69 | 
 70 | 16
 71 | 00:00:48,170 --> 00:00:53,550
 72 | then knowledge of A will substantially
 73 | affect what we believe about B.
 74 | 
 75 | 17
 76 | 00:00:53,550 --> 00:00:56,870
 77 | If there is two joint causes for C and
 78 | 
 79 | 18
 80 | 00:00:56,870 --> 00:01:00,550
 81 | we happen to know A is true,
 82 | we will discredit cause B.
 83 | 
 84 | 19
 85 | 00:01:00,550 --> 00:01:05,324
 86 | If we happen to know A is false we'll
 87 | increase our belief for the cause B.
 88 | 
 89 | 20
 90 | 00:01:05,324 --> 00:01:09,366
 91 | There was an effect we studied
 92 | extensively in the happiness example I
 93 | 
 94 | 21
 95 | 00:01:09,366 --> 00:01:10,780
 96 | gave you before.
 97 | 
 98 | 22
 99 | 00:01:10,780 --> 00:01:15,710
100 | The interesting thing here is you're
101 | facing a situation where knowledge of
102 | 
103 | 23
104 | 00:01:15,710 --> 00:01:21,780
105 | a variable C renders previously
106 | independent variables dependent.
107 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/19 - 44 - Explaining Away V5 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,380 --> 00:00:03,762
  3 | And let me talk about
  4 | a really interesting and
  5 | 
  6 | 2
  7 | 00:00:03,762 --> 00:00:09,374
  8 | special instance of Bayes Net Reasoning
  9 | which is called explaining away.
 10 | 
 11 | 3
 12 | 00:00:09,374 --> 00:00:12,915
 13 | And I'll first give you
 14 | the intuitive answer, and
 15 | 
 16 | 4
 17 | 00:00:12,915 --> 00:00:15,962
 18 | then I wish you to
 19 | compute probabilities for
 20 | 
 21 | 5
 22 | 00:00:15,962 --> 00:00:20,590
 23 | me that manifest the explain away
 24 | effect in Bayes Net of this type.
 25 | 
 26 | 6
 27 | 00:00:20,590 --> 00:00:25,513
 28 | Explaining away means that,
 29 | if we know that we are happy,
 30 | 
 31 | 7
 32 | 00:00:25,513 --> 00:00:31,178
 33 | then sunny weather can explain
 34 | away the cause of happiness.
 35 | 
 36 | 8
 37 | 00:00:31,178 --> 00:00:33,100
 38 | If I then also know that it's sunny,
 39 | 
 40 | 9
 41 | 00:00:33,100 --> 00:00:37,460
 42 | it becomes less likely
 43 | that I received a raise.
 44 | 
 45 | 10
 46 | 00:00:37,460 --> 00:00:40,760
 47 | So let me put this differently, suppose
 48 | I'm a happy guy on a specific day, and
 49 | 
 50 | 11
 51 | 00:00:40,760 --> 00:00:45,020
 52 | my wife asks me, Sebastian,
 53 | why are you so happy?
 54 | 
 55 | 12
 56 | 00:00:45,020 --> 00:00:47,920
 57 | Is it sunny, or did you get a raise?
 58 | 
 59 | 13
 60 | 00:00:47,920 --> 00:00:50,504
 61 | If she then looks outside and
 62 | sees it's sunny,
 63 | 
 64 | 14
 65 | 00:00:50,504 --> 00:00:55,105
 66 | then she might explain to herself, well,
 67 | Sebastian is happy because it's sunny.
 68 | 
 69 | 15
 70 | 00:00:55,105 --> 00:00:59,038
 71 | It makes it effectively,
 72 | less likely that he got a raise,
 73 | 
 74 | 16
 75 | 00:00:59,038 --> 00:01:03,910
 76 | because I could already explain
 77 | this happiness by it being sunny.
 78 | 
 79 | 17
 80 | 00:01:03,910 --> 00:01:09,083
 81 | If she looks outside and it's rainy,
 82 | it makes it more likely I got a raise,
 83 | 
 84 | 18
 85 | 00:01:09,083 --> 00:01:13,117
 86 | because the weather can't
 87 | really explain my happiness.
 88 | 
 89 | 19
 90 | 00:01:13,117 --> 00:01:18,260
 91 | In other words, if you see a certain
 92 | effect that it could be caused
 93 | 
 94 | 20
 95 | 00:01:18,260 --> 00:01:22,935
 96 | by multiple causes,
 97 | seeing one of those causes can explain
 98 | 
 99 | 21
100 | 00:01:22,935 --> 00:01:27,439
101 | away any other potential cause
102 | of this effect over here.
103 | 
104 | 22
105 | 00:01:27,439 --> 00:01:32,254
106 | So let me put this in numbers, and
107 | ask you the challenging question of
108 | 
109 | 23
110 | 00:01:32,254 --> 00:01:37,410
111 | what's the probability of the raise
112 | given I'm happy, and it's sunny?
113 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/8 - 33 - Two Test Cancer Solution V4 - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,420 --> 00:00:06,093
  3 | So the correct answer is 0.1698
  4 | 
  5 | 2
  6 | 00:00:06,093 --> 00:00:10,526
  7 | approximately.
  8 | 
  9 | 3
 10 | 00:00:10,526 --> 00:00:15,370
 11 | And to compute this,
 12 | I used the trick I've shown you before.
 13 | 
 14 | 4
 15 | 00:00:15,370 --> 00:00:20,551
 16 | Let me write down the running count for
 17 | cancer, and
 18 | 
 19 | 5
 20 | 00:00:20,551 --> 00:00:28,033
 21 | for not cancer, as I integrate
 22 | the various multiplications in base row.
 23 | 
 24 | 6
 25 | 00:00:28,033 --> 00:00:34,937
 26 | My prior for cancer was 0.01,
 27 | and for non-cancer was 0.99.
 28 | 
 29 | 7
 30 | 00:00:34,937 --> 00:00:36,948
 31 | Then I get my first +, and
 32 | 
 33 | 8
 34 | 00:00:36,948 --> 00:00:41,710
 35 | the probability of a plus given
 36 | that we have cancer is 0.9.
 37 | 
 38 | 9
 39 | 00:00:41,710 --> 00:00:46,593
 40 | And the same for lung cancer is 0.2.
 41 | 
 42 | 10
 43 | 00:00:46,593 --> 00:00:49,987
 44 | So according to
 45 | the non-normalized base rule,
 46 | 
 47 | 11
 48 | 00:00:49,987 --> 00:00:54,623
 49 | I now multiply these two things
 50 | together to get my non-normalized
 51 | 
 52 | 12
 53 | 00:00:54,623 --> 00:00:58,680
 54 | probability of having
 55 | cancer given the plus.
 56 | 
 57 | 13
 58 | 00:00:58,680 --> 00:01:01,639
 59 | Since multiplication is commutative,
 60 | 
 61 | 14
 62 | 00:01:01,639 --> 00:01:07,126
 63 | I can do the same thing again with
 64 | my second test result, 0.9, 0.2.
 65 | 
 66 | 15
 67 | 00:01:07,126 --> 00:01:11,867
 68 | And I multiply all of these three things
 69 | together to get my non normalized
 70 | 
 71 | 16
 72 | 00:01:11,867 --> 00:01:14,911
 73 | probability, P prime,
 74 | to be the following.
 75 | 
 76 | 17
 77 | 00:01:14,911 --> 00:01:18,479
 78 | 0.0081 if you multiply
 79 | those things together.
 80 | 
 81 | 18
 82 | 00:01:19,490 --> 00:01:25,539
 83 | And 0.0396 if you multiply
 84 | these guys together.
 85 | 
 86 | 19
 87 | 00:01:25,539 --> 00:01:27,560
 88 | And these are not a probability.
 89 | 
 90 | 20
 91 | 00:01:27,560 --> 00:01:31,315
 92 | If we add those for
 93 | the two complementary,
 94 | 
 95 | 21
 96 | 00:01:31,315 --> 00:01:35,603
 97 | with cancer, non cancer, I get 0.0477.
 98 | 
 99 | 22
100 | 00:01:35,603 --> 00:01:37,974
101 | However, I now divide.
102 | 
103 | 23
104 | 00:01:37,974 --> 00:01:42,413
105 | That is, I normalize those non
106 | normalized probabilities over
107 | 
108 | 24
109 | 00:01:42,413 --> 00:01:44,920
110 | here by this factor over here.
111 | 
112 | 25
113 | 00:01:44,920 --> 00:01:47,410
114 | I actually get the correct
115 | posterior probability.
116 | 
117 | 26
118 | 00:01:47,410 --> 00:01:50,056
119 | P(C) given ++.
120 | 
121 | 27
122 | 00:01:50,056 --> 00:01:55,873
123 | And they look as follows,
124 | approximately 0.1698 and
125 | 
126 | 28
127 | 00:01:55,873 --> 00:01:59,021
128 | approximately 0.8301.
129 | 


--------------------------------------------------------------------------------
/TopicNotes/reinforcementLearning/Notes.rst:
--------------------------------------------------------------------------------
 1 | ##########################
 2 | Reinforcement Learning
 3 | ##########################
 4 | 
 5 | Forms of Learning:
 6 | 
 7 | - Supervised Learning: (x_1, y_1), (x_2, y_2): y = f(x)
 8 |   Ex: Speech recognition
 9 | - Unsupervised Learning: x_1, x_2, x_3, ... P(X=x_1), we search for a probability distribution, or a cluster
10 |   Ex: Clustering, light emmissions coming from the space.
11 | - Reinforcement Learning: state, action, state, action, ... . There are some rewards associated with some of these states. Rewards are just scalar numbers, positive, negative etc. We search for an optimal policy, that gives us what to do in any given state.
12 |   Ex: Elevator controller
13 | 
14 | MDP Review - Markov Decision Process
15 | =====================================
16 | 
17 | MDPs consist of a set of states, a set of actions available for a state, transition function which gives a result state, and a reward function.
18 | 
19 | Formally, :math:`s {\in} S`,
20 | :math:`a {\in} Actions(s)`,
21 | :math:`P(s'|s,a)`,
22 | also an initial state :math:`s_0`,
23 | sometimes reward function that is general over the triplet
24 | :math:`R(s,a,s')` or sometime we only talk about
25 | the result state :math:`R(s')`
26 | 
27 | To solve an mdp, we try to find an policy, that can maximise discounted total reward
28 | 
29 | Reinforcement Learning
30 | ========================
31 | 
32 | Reinforcement Learning comes into play when we don't know the reward function or even the transition modal of the world.
33 | When we don't know these we can't solve the MDP, however with reinforcement learning,
34 | we can find these, either by interacting with the world, or you can learn substitutes that would tell you as much as you
35 | know, so that you never have to actually compute the reward or the transition model.
36 | 
37 | Several choices exist for modaling:
38 | 
39 | +---------------------+--------+-----------+---------+
40 | | agent               | know   | learn     | Utility |
41 | +=====================+========+===========+=========+
42 | | utility based agent | P      | R ? -> U | U        |
43 | +---------------------+--------+-----------+---------+
44 | | Q Learning          | ?      | Q(s,a)    | Q       |
45 | +---------------------+--------+-----------+---------+
46 | | reflex agent        | ?      | Policy(s) | Policy  |
47 | +---------------------+--------+-----------+---------+
48 | 
49 | Q(s,a): is like a utility function evaluating state action pairs. We don't know transition modal and reward function.
50 | Reflex agent: is just about policy, it is called reflex agent because it is pure stimulus response.
51 | 
52 | Passive Reinforcement Learning:
53 |     Passive means agent has fixed policy and executes that. During that process it learns about R and/or P, transition modal.
54 | 
55 | Active Reinforcement Learning
56 |     Active implies that we change the policy as we learn about the world.
57 | 
58 | Passive Temporal Difference Learning
59 | -------------------------------------
60 | 
61 | This needs a table of utilities for each state, we also keep track of how many times we visited each state.
62 | Table of utilities should start out as blank, and the table of number of visits should start out with zeros
63 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/4 - 29 - Bayes Network Merged FINAL - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,500 --> 00:00:02,390
  3 | So we want to draw Bayes rule rapidly.
  4 | 
  5 | 2
  6 | 00:00:02,390 --> 00:00:06,830
  7 | We have a situation where we
  8 | have an internal variable A,
  9 | 
 10 | 3
 11 | 00:00:06,830 --> 00:00:10,670
 12 | like whether or not we have cancer.
 13 | 
 14 | 4
 15 | 00:00:10,670 --> 00:00:12,520
 16 | But we can't sense A,
 17 | 
 18 | 5
 19 | 00:00:12,520 --> 00:00:17,180
 20 | instead we have a second variable
 21 | called B, which is our test.
 22 | 
 23 | 6
 24 | 00:00:17,180 --> 00:00:19,330
 25 | And B is observable, but A isn't.
 26 | 
 27 | 7
 28 | 00:00:19,330 --> 00:00:24,090
 29 | This is a classical
 30 | example of a Base network.
 31 | 
 32 | 8
 33 | 00:00:24,090 --> 00:00:28,130
 34 | The base network is composed
 35 | of two variables A and B.
 36 | 
 37 | 9
 38 | 00:00:28,130 --> 00:00:32,210
 39 | We know the prior probability for A and
 40 | 
 41 | 10
 42 | 00:00:32,210 --> 00:00:36,070
 43 | we know the conditional
 44 | A causes B whether or
 45 | 
 46 | 11
 47 | 00:00:36,070 --> 00:00:39,730
 48 | not the cancer causes the test
 49 | result to be positive or not.
 50 | 
 51 | 12
 52 | 00:00:39,730 --> 00:00:41,590
 53 | Although with some randomness involved.
 54 | 
 55 | 13
 56 | 00:00:41,590 --> 00:00:47,750
 57 | So we know about the probability of
 58 | B given the different values for A.
 59 | 
 60 | 14
 61 | 00:00:47,750 --> 00:00:52,160
 62 | And what we care about in this specific
 63 | instance is called diagnostic reasoning.
 64 | 
 65 | 15
 66 | 00:00:52,160 --> 00:00:56,480
 67 | Which is inverse of
 68 | the causal reasoning.
 69 | 
 70 | 16
 71 | 00:00:56,480 --> 00:01:04,290
 72 | Probability of A given B, or similarly,
 73 | probability of A given off B.
 74 | 
 75 | 17
 76 | 00:01:04,290 --> 00:01:09,540
 77 | This is our very first Bayes Network and
 78 | the graphical representation
 79 | 
 80 | 18
 81 | 00:01:09,540 --> 00:01:13,700
 82 | of drawn two variables, A and
 83 | B, connected with an arrow.
 84 | 
 85 | 19
 86 | 00:01:13,700 --> 00:01:20,450
 87 | Because from A to B is a graphical
 88 | representation of distribution of two
 89 | 
 90 | 20
 91 | 00:01:20,450 --> 00:01:26,740
 92 | variables that specified the structure
 93 | of a and this is a probability.
 94 | 
 95 | 21
 96 | 00:01:26,740 --> 00:01:29,380
 97 | And has a conditional
 98 | probability which is normal.
 99 | 
100 | 22
101 | 00:01:29,380 --> 00:01:32,220
102 | Now I do have a quick quiz for you.
103 | 
104 | 23
105 | 00:01:32,220 --> 00:01:36,690
106 | How many parameters does it
107 | take to specify the entire
108 | 
109 | 24
110 | 00:01:36,690 --> 00:01:41,440
111 | joint probability between A and
112 | B or the entire Bayes Network?
113 | 
114 | 25
115 | 00:01:41,440 --> 00:01:43,440
116 | I'm not looking for
117 | a structural parameters.
118 | 
119 | 26
120 | 00:01:44,790 --> 00:01:47,380
121 | We look to the graph over here,
122 | just looking for
123 | 
124 | 27
125 | 00:01:47,380 --> 00:01:50,220
126 | the numerical parameters of
127 | the underlying probabilities.
128 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/17 - Confounding Cause - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,310 --> 00:00:04,680
  3 | For my next example, I will study
  4 | a different type of a Bayes network.
  5 | 
  6 | 2
  7 | 00:00:04,680 --> 00:00:07,838
  8 | Before we've seen networks
  9 | of the following type,
 10 | 
 11 | 3
 12 | 00:00:07,838 --> 00:00:12,370
 13 | where a single, hidden cause
 14 | caused two different measurements.
 15 | 
 16 | 4
 17 | 00:00:12,370 --> 00:00:15,600
 18 | I now want to study a network that
 19 | looks just like the opposite.
 20 | 
 21 | 5
 22 | 00:00:15,600 --> 00:00:19,110
 23 | We have two independent,
 24 | hidden causes, but
 25 | 
 26 | 6
 27 | 00:00:19,110 --> 00:00:24,060
 28 | they get confounded within a single,
 29 | observational variable.
 30 | 
 31 | 7
 32 | 00:00:24,060 --> 00:00:28,120
 33 | I would like to use
 34 | the example of happiness.
 35 | 
 36 | 8
 37 | 00:00:28,120 --> 00:00:32,080
 38 | Suppose I can be happy or unhappy.
 39 | 
 40 | 9
 41 | 00:00:32,080 --> 00:00:36,157
 42 | What makes me happy is
 43 | when the weather is sunny,
 44 | 
 45 | 10
 46 | 00:00:36,157 --> 00:00:41,040
 47 | or if I get a raise at my job,
 48 | which means I make more money.
 49 | 
 50 | 11
 51 | 00:00:41,040 --> 00:00:45,601
 52 | So let's call this sunny, let's call
 53 | this a raise and call this happiness.
 54 | 
 55 | 12
 56 | 00:00:45,601 --> 00:00:52,093
 57 | Perhaps, the probability
 58 | of it being sunny is 0.7.
 59 | 
 60 | 13
 61 | 00:00:52,093 --> 00:00:57,279
 62 | The probability of a raise is 0.01.
 63 | 
 64 | 14
 65 | 00:00:57,279 --> 00:01:03,290
 66 | And I will tell you if the probability
 67 | of being happy is governed as false.
 68 | 
 69 | 15
 70 | 00:01:03,290 --> 00:01:08,812
 71 | The probability of being happy given
 72 | that both of these things occur,
 73 | 
 74 | 16
 75 | 00:01:08,812 --> 00:01:11,722
 76 | I got a raise and it's sunny, is 1.
 77 | 
 78 | 17
 79 | 00:01:11,722 --> 00:01:16,688
 80 | The probability of being happy
 81 | given that it is not sunny and
 82 | 
 83 | 18
 84 | 00:01:16,688 --> 00:01:19,376
 85 | I still got a raise, is 0.9.
 86 | 
 87 | 19
 88 | 00:01:19,376 --> 00:01:23,592
 89 | The probability of being happy
 90 | given that it is sunny, but
 91 | 
 92 | 20
 93 | 00:01:23,592 --> 00:01:26,185
 94 | I didn't give a raise, is 0.7.
 95 | 
 96 | 21
 97 | 00:01:26,185 --> 00:01:28,855
 98 | And the probability of being happy,
 99 | 
100 | 22
101 | 00:01:28,855 --> 00:01:33,774
102 | given that it is neither sunny nor
103 | did I get their raise, is 0.1.
104 | 
105 | 23
106 | 00:01:33,774 --> 00:01:39,380
107 | This is a perfectly fine specification
108 | of a probability distribution
109 | 
110 | 24
111 | 00:01:39,380 --> 00:01:44,613
112 | where two causes affect the variable
113 | down here, the happiness.
114 | 
115 | 25
116 | 00:01:44,613 --> 00:01:48,223
117 | So I'd like you to calculate for
118 | me the following questions.
119 | 
120 | 26
121 | 00:01:48,223 --> 00:01:54,950
122 | The probability of a raise given that
123 | it is sunny, according to this model.
124 | 
125 | 27
126 | 00:01:54,950 --> 00:01:56,900
127 | Please enter your answer over here.
128 | 


--------------------------------------------------------------------------------
/MathNotes/Precalculus.rst:
--------------------------------------------------------------------------------
  1 | ##################
  2 | Precalculus notes
  3 | ##################
  4 | 
  5 | Arithmatic Operator
  6 | ###################
  7 | 
  8 | Addition
  9 | =========
 10 | 
 11 | Mostly used with infix operator :math:`+`. When two or more disjoint sets
 12 | combined the number of elements in the resulting set are represented with
 13 | addition.
 14 | 
 15 | Properties
 16 | ----------
 17 | 
 18 | Commutative: a + b = b + a, meaning order of elements does not change result
 19 | 
 20 | Associative: a + (b + c) = (a + b) + c meaning the precedence of operations
 21 | does not change the final result
 22 | 
 23 | 0 as identity element: a + 0 = a, meaning that 0 is a number which does not
 24 | change the identity of the element.
 25 | 
 26 | Units: in the operation a + b we assume that a and b have the same units, for
 27 | example 4m + 5m = 9m but 4m + 5m^2 is not possible.
 28 | 
 29 | Subtraction
 30 | ============
 31 | 
 32 | It is usually designated with the infix operator :math:`-`.
 33 | It represents going back some steps in the number line.
 34 | 
 35 | Properties
 36 | -----------
 37 | 
 38 | Anticommutative: a - b = - ( b - a) the result is the negative of the original
 39 | result
 40 | 
 41 | Non associative: a - (b - c) \not = (a - b) - c the precedence of operations
 42 | contributes to the result
 43 | 
 44 | 
 45 | Multiplication
 46 | ==============
 47 | 
 48 | Usually designated by the infix operator :math:`×` or
 49 | :math:`\cdot`. It can be thought of as a repeated addition.
 50 | 
 51 | Properties
 52 | ----------
 53 | 
 54 | Commutative: a × b = b × a
 55 | 
 56 | Associative: a × (b × c) = (a × b) × c
 57 | 
 58 | Distributive: a × (b + c) = a × b + a × c
 59 | 
 60 | Identity element: a × 1 = a
 61 | 
 62 | Special case of 0: a × 0 = 0
 63 | 
 64 | Negation: -1 × a = -a that is multiplication with -1 results in additive
 65 | inverse
 66 | 
 67 | Inverse element: every number except 0 has an multiplicative inverse so that 
 68 | a × 1/a = 1 
 69 | 
 70 | Order preservation: multiplication by positive number conserves order, by a
 71 | negative number reverses order:
 72 | a × b < a × c if b < c
 73 | -a × b > -a × c if b < c
 74 | 
 75 | Division
 76 | ========
 77 | 
 78 | Usually designated by the infix operator :math:`/`, and :math:`÷`. It can be
 79 | considered the process of calculating how many times a number is contained in
 80 | another one.
 81 | 
 82 | Properties
 83 | -----------
 84 | 
 85 | Distributive: (a+b) / c = a/c + b/c
 86 | 
 87 | 
 88 | Exponentiation
 89 | ===============
 90 | 
 91 | It is usually designated with a base and power part as in :math:`b^n` where b
 92 | is base and n is power. It represents the repeated multiplication of base like 
 93 | :math:`b^n = b × b × ... n times ... × b`
 94 | 
 95 | Properties
 96 | ----------
 97 | 
 98 | b^1 = b
 99 | b^{n+1} = b^n × b
100 | b^{n+m} = b^n × b^m
101 | b^0 = 1
102 | b^{-n} = 1 / b^n
103 | (b^m)^n = b^{m×n}
104 | (b×c)^n = b^n × c^n
105 | b^{u/v} = (b^u)^{1/v} = :math:`\sqrt[v]{b^u}` vth root of b^u
106 | 
107 | roots are thus treated as taking exponent with fraction
108 | 
109 | 
110 | Logarithms
111 | ===========
112 | 
113 | Logarithm is the inverse of exponential function. 
114 | For b^y = x, log_b(x) = y, basically I have the base and the resulting value I
115 | try to find the power.
116 | 
117 | Properties
118 | -----------
119 | 
120 | log_b(a × c) = log_b(a) + log_b(c)
121 | log_b(x / y ) = log_b(x) - log_b(y)
122 | log_b(x^p) = p log_b(x)
123 | log_b(\sqrt[p]{x}) = log_b(x) / p
124 | log_b(x) = log_k(x) / log_k(b)
125 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/22 - 47 - Explaining Away 2 Solution V3 - lang_en_vs1.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,600 --> 00:00:02,700
  3 | So this is a difficult question.
  4 | 
  5 | 2
  6 | 00:00:02,700 --> 00:00:05,700
  7 | Let me compute an auxiliary label,
  8 | 
  9 | 3
 10 | 00:00:05,700 --> 00:00:10,580
 11 | which is P of Happiness, that one is
 12 | 
 13 | 4
 14 | 00:00:10,580 --> 00:00:15,600
 15 | expanded by looking at the different
 16 | conditions that can make us happy.
 17 | 
 18 | 5
 19 | 00:00:15,600 --> 00:00:19,315
 20 | P of Happiness given S and
 21 | R, times P of S and
 22 | 
 23 | 6
 24 | 00:00:19,315 --> 00:00:23,443
 25 | R, which is of course
 26 | the product of those two,
 27 | 
 28 | 7
 29 | 00:00:23,443 --> 00:00:27,175
 30 | because independent plus P of Happiness.
 31 | 
 32 | 8
 33 | 00:00:27,175 --> 00:00:32,935
 34 | Given not S R, probability of
 35 | not S R + P of H given S and
 36 | 
 37 | 9
 38 | 00:00:32,935 --> 00:00:37,031
 39 | not R times the probability
 40 | of P of S and
 41 | 
 42 | 10
 43 | 00:00:37,031 --> 00:00:42,289
 44 | not R plus the last case P
 45 | of H given not S and not R.
 46 | 
 47 | 11
 48 | 00:00:42,289 --> 00:00:44,438
 49 | So this just looks at the happiness and
 50 | 
 51 | 12
 52 | 00:00:44,438 --> 00:00:48,630
 53 | all four combinations of the variables
 54 | that could lead to happiness.
 55 | 
 56 | 13
 57 | 00:00:48,630 --> 00:00:51,660
 58 | And you can plug those straight in,
 59 | this one over here is 1.
 60 | 
 61 | 14
 62 | 00:00:51,660 --> 00:00:56,607
 63 | And this one over here
 64 | is the product of S and
 65 | 
 66 | 15
 67 | 00:00:56,607 --> 00:01:00,599
 68 | R which is 0.7 times 0.01.
 69 | 
 70 | 16
 71 | 00:01:00,599 --> 00:01:06,078
 72 | And as you plug all of those in,
 73 | you get as a result,
 74 | 
 75 | 17
 76 | 00:01:06,078 --> 00:01:09,796
 77 | 0.5245, that's P(H).
 78 | 
 79 | 18
 80 | 00:01:09,796 --> 00:01:14,582
 81 | Just take some time and do the math by
 82 | going through these different cases
 83 | 
 84 | 19
 85 | 00:01:14,582 --> 00:01:17,495
 86 | using probability and
 87 | you get this result.
 88 | 
 89 | 20
 90 | 00:01:17,495 --> 00:01:24,636
 91 | Now armed with this number,
 92 | the rest now becomes easy which is,
 93 | 
 94 | 21
 95 | 00:01:24,636 --> 00:01:29,132
 96 | we can use base rule
 97 | to turn this around,
 98 | 
 99 | 22
100 | 00:01:29,132 --> 00:01:33,114
101 | P(H given R) P(R) over P(H).
102 | 
103 | 23
104 | 00:01:33,114 --> 00:01:37,719
105 | P(R) we know from over here,
106 | the probability of a race is 0.01.
107 | 
108 | 24
109 | 00:01:37,719 --> 00:01:41,303
110 | So the only thing we need to
111 | compute now is P(H given R).
112 | 
113 | 25
114 | 00:01:41,303 --> 00:01:45,548
115 | And again we applied sort of probability
116 | let me just do this over here.
117 | 
118 | 26
119 | 00:01:45,548 --> 00:01:49,912
120 | We can factor P(H given
121 | R) as P(H given R,S) for
122 | 
123 | 27
124 | 00:01:49,912 --> 00:01:54,684
125 | sunny times probability of
126 | sunny plus P(H given R) and
127 | 
128 | 28
129 | 00:01:54,684 --> 00:01:58,657
130 | not sunny times
131 | the probability of not sunny.
132 | 
133 | 29
134 | 00:01:58,657 --> 00:02:03,883
135 | And if you plug in the numbers for
136 | this you get 1
137 | 
138 | 30
139 | 00:02:03,883 --> 00:02:11,255
140 | times 0.7 + 0.9 times 0.3
141 | that happens to be 0.97.
142 | 
143 | 31
144 | 00:02:11,255 --> 00:02:18,701
145 | So if you now plug this all back
146 | into this equation over here,
147 | 
148 | 32
149 | 00:02:18,701 --> 00:02:24,833
150 | we get 0.97 times 0.01 / 0.5245.
151 | 
152 | 33
153 | 00:02:24,833 --> 00:02:31,994
154 | This gives us approximately,
155 | as a correct answer, 0.0185.
156 | 


--------------------------------------------------------------------------------
/TopicNotes/KnowledgeBasedAI/Notes.rst:
--------------------------------------------------------------------------------
  1 | ###########################
  2 | Knowledge Based AI Systems
  3 | ###########################
  4 | 
  5 | Five Fundamental Problems in AI
  6 | ================================
  7 | 
  8 | 1. Intelligent Agents have limited ressources:
  9 |    - But most ai problems are computationaly intractable.
 10 | 
 11 |    - How then we can make AI agents to give us real time performance on ai
 12 |      problems
 13 | 
 14 | 2. Computation is local, but problems have global constraints:
 15 | 
 16 |    - How then we can make AI agents to address global problems using local
 17 |      computations (object recognition with limited data for example)
 18 | 
 19 | 3. Computational logic is mainly deductive, but many AI problems are abductive
 20 |    or inductive in nature.
 21 | 
 22 |   - How can we get AI agents to address abductive or inductive problems?
 23 | 
 24 | 4. The world is dynamic, but knowledge is limited.
 25 | 
 26 |    - An AI agent has to begin always with what already knows, how then an AI
 27 |      agent can address a new problem ?
 28 | 
 29 | 5. Learning, problem solving, reasoning are complex, but explanation and
 30 |    justification add to the complexity:
 31 | 
 32 |   - How then can we get an ai agent to justify or explain its decisions?
 33 | 
 34 | Characteristics of AI Problems
 35 | =================================
 36 | 
 37 | 1. Knowledge often arrives incrementaly
 38 | 
 39 | 2. Problems exhibit recurring patterns
 40 | 
 41 | 3. Problems have multiple levels of granularity
 42 | 
 43 | 4. Many problems are computationaly intractable
 44 | 
 45 | 5. The world is dynamic, but the knowledge is static
 46 | 
 47 | 6. The world is open ended, but the knowledge is limited
 48 | 
 49 | Characteristics of AI Agents
 50 | =============================
 51 | 
 52 | 1. AI agents have limited computing powers
 53 | 
 54 | 2. AI agents have limited sensors
 55 | 
 56 | 3. AI agents have limited attention
 57 | 
 58 | 4. Computational logic is deductive, problems are inductive, abductive.
 59 | 
 60 | 5. AI agent' s knowledge is limited with respect to the world
 61 | 
 62 | AI system overview
 63 | -------------------
 64 | 
 65 | Input: Perception - data coming from sensors
 66 | Output: Action
 67 | 
 68 | Operations:
 69 | 
 70 | - Meta cognition
 71 | - Deliberation
 72 | - Reaction
 73 | 
 74 | Metacognition <--> Deliberation
 75 | 
 76 | Reaction <--> Deliberation
 77 | 
 78 | Deliberation has three components:
 79 | 
 80 | - Reasoning: 
 81 | 
 82 | - Learning: 
 83 | 
 84 |   - Gets the right answer and stores the right answer to somewhere
 85 | 
 86 |   - If it gets the wrong answer then if it gets the right one,
 87 |     stores the right answer to its place
 88 | 
 89 | - Memory: stores what is learned
 90 | 
 91 | All three components interact with each other.
 92 | 
 93 | 4 School of Thought in AI
 94 | --------------------------
 95 | 
 96 | Think of an xy axis. At the top of y axis there is thinking,
 97 | at the bottom of y axis there is acting.
 98 | At the right of x axis there is human like, at the left of x axis there is
 99 | optimal.
100 | 
101 | What distinguishes them ?
102 | 
103 | If we think about optimal ai agents? We are necessarily talking about an ai
104 | agent that is good in one thing, if we think about human like ai agent, we
105 | talk about agents that are above mediocre in most things,
106 | 
107 | We can then classify AI agents according to these axis.
108 | 
109 | AI agents that think optimally:
110 |     - Machine learning problems are mostly treated by these type of AIs
111 | 
112 | AI agents that think like humans:
113 |     - Semantic Web
114 | 
115 | AI agents that act optimally:
116 |     - Airplane autopilots
117 | 
118 | AI agents that act like humans:
119 |     - Improvisational robots: that can perhaps dance to the music you play
120 | 
121 | 
122 | 


--------------------------------------------------------------------------------
/TopicNotes/MachineLearning/UnsupervisedLearning.rst:
--------------------------------------------------------------------------------
  1 | ################################
  2 | Machine Learning - Unsupervised
  3 | ################################
  4 | 
  5 | We try to guess the structure from the data.
  6 | 
  7 | For example you have a straight line in a coordinate system.
  8 | You can say that the space's dimensionality is equal to 2,
  9 | where as the line, can be represented in 1 dimension.
 10 | 
 11 | One of the basic applications of the unsupervised learning technology
 12 | is to represent higher dimensionality structures, like images, in lower
 13 | dimensions like histograms.
 14 | 
 15 | We learn about clustering and dimension reduction.
 16 | 
 17 | Some of the terminology used in unsupervised learning.
 18 | We assume that the data is IID, that is identically distributed and
 19 | independently drawn from the same distribution.
 20 | 
 21 | The unsupervised learning seeks to recover the underlying density of the
 22 | probability distribution that generated data.
 23 | 
 24 | This is called *density estimation*, following two are versions of it:
 25 | 
 26 | - Clustering
 27 | - Dimensionality reduction
 28 | 
 29 | Clustering
 30 | -------------
 31 | 
 32 | Two algorithms are pretty common in clustering:
 33 | 
 34 | - k-means
 35 | - expectation maximisation
 36 | 
 37 | Problem with k-means:
 38 | 
 39 | - need to know k
 40 | - local minima
 41 | - high dimensionality
 42 | - lack of mathematical basis.
 43 | 
 44 | Expectation maximisation is a generalisation of k-means, it uses actual
 45 | probability distributions to describe what we are doing.
 46 | 
 47 | Gaussian Learning
 48 | -------------------
 49 | 
 50 | Fitting gaussians to data, or gaussian learning in which we shall be given some
 51 | data points and wonder what is the best gaussian fitting the data.
 52 | 
 53 | .. image:: Gaussians.png
 54 | 
 55 | The formula actually represents a probability distribution.
 56 | 
 57 | Let's explain the multivariate one, since it works for the single dimension case as well.
 58 | 
 59 | The formula is: :math:`{(2{\pi})^{-\frac{N}{2}}} {S^{\frac{-1}{2}}} {exp({ {\frac{-1}{2}} (x-m)^T {\frac{(x-m)}{S}}})}`
 60 | 
 61 | N, is the number of dimensions of the data.
 62 | 
 63 | m, is the mu, that is the average of the samples.
 64 | 
 65 | x, is the probing point, that is our data point.
 66 | 
 67 | S, is the covariance matrix, that is the matrix that shows
 68 | how far the gaussian points are away from the mean that cuts through
 69 | the peak of the gaussian.
 70 | 
 71 | This thing tries to normalise the error rate by the covariance matrix
 72 | 
 73 | How to choose K in Expectation Maximization and K-means:
 74 | 
 75 | - Guess initial K
 76 | - Run EM
 77 | - Remove unnecessary clusters
 78 | - Create new random clusters
 79 | - Repeat from the second step
 80 | 
 81 | Dimensionality Reduction
 82 | -------------------------
 83 | 
 84 | Linear Dimensionality reduction:
 85 | 
 86 | The idea is that we are given data points, we seek to find a linear subspace to remap the data.
 87 | 
 88 | 1. Fit a gaussian
 89 | 2. Calculate the eigenvalues and eigenvectors of the gaussian
 90 | 3. Pick eigenvectors with maximum eigen values
 91 | 4. Project the data onto the subspace of the eigenvectors you chose.
 92 | 
 93 | Specteral Clustering
 94 | ---------------------
 95 | 
 96 | Affinity to a group of data points define the cluster of a data point not its absolute position
 97 | 
 98 | Affinity matrix is essential to specteral clustering:
 99 | It is a matrix in which each data point is graphed relative to other data points
100 | Affinity means the quadratic distance between points. High affinity means small quadratic distance.
101 | This is a rank defficient matrix and it is easy to identify them with Principal Component Analysis.
102 | PCA analyses the vectors that are similar in an aproximate rank defficient matrix.
103 | 
104 | Dimensionality = Number of Large Eigen values
105 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/14 - Conditional Independence2 Solution - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,540 --> 00:00:04,210
  3 | So, for this one,
  4 | we are going to apply total probability.
  5 | 
  6 | 2
  7 | 00:00:04,210 --> 00:00:10,170
  8 | This thing over here is the same as
  9 | probability of test 2 to be positive,
 10 | 
 11 | 3
 12 | 00:00:10,170 --> 00:00:13,640
 13 | which I am going to abbreviate
 14 | with a + 2 over here.
 15 | 
 16 | 4
 17 | 00:00:13,640 --> 00:00:16,859
 18 | Condition on test 1 being positive and
 19 | 
 20 | 5
 21 | 00:00:16,859 --> 00:00:22,321
 22 | me having cancer has probability
 23 | of me having cancer given
 24 | 
 25 | 6
 26 | 00:00:22,321 --> 00:00:26,615
 27 | positive plus probability
 28 | of test 2 being positive
 29 | 
 30 | 7
 31 | 00:00:26,615 --> 00:00:31,296
 32 | condition on positive and
 33 | me not having cancer,
 34 | 
 35 | 8
 36 | 00:00:31,296 --> 00:00:38,090
 37 | times probability of me not having
 38 | cancer, given the test 1 is positive.
 39 | 
 40 | 9
 41 | 00:00:38,090 --> 00:00:41,420
 42 | That's the same as
 43 | the total probability.
 44 | 
 45 | 10
 46 | 00:00:41,420 --> 00:00:44,620
 47 | But now everything is
 48 | conditioned on + 1.
 49 | 
 50 | 11
 51 | 00:00:44,620 --> 00:00:46,530
 52 | Take a moment to verify this.
 53 | 
 54 | 12
 55 | 00:00:48,370 --> 00:00:50,200
 56 | Now here you can plug in the numbers.
 57 | 
 58 | 13
 59 | 00:00:50,200 --> 00:00:57,664
 60 | We already calculated this one before,
 61 | which is approximately 0.043.
 62 | 
 63 | 14
 64 | 00:00:57,664 --> 00:01:02,076
 65 | And this one over here is 1 minus this,
 66 | 
 67 | 15
 68 | 00:01:02,076 --> 00:01:06,365
 69 | which is 0.957 approximately.
 70 | 
 71 | 16
 72 | 00:01:06,365 --> 00:01:12,077
 73 | In this now exploits conditional
 74 | independence which is given that
 75 | 
 76 | 17
 77 | 00:01:12,077 --> 00:01:17,670
 78 | knowledge of a first test gives me no
 79 | more information about the second test.
 80 | 
 81 | 18
 82 | 00:01:17,670 --> 00:01:21,750
 83 | It only gives me information if C was
 84 | unknown as was the case over here.
 85 | 
 86 | 19
 87 | 00:01:21,750 --> 00:01:26,070
 88 | So we can rewrite this thing over
 89 | here as follows P of plus 2,
 90 | 
 91 | 20
 92 | 00:01:26,070 --> 00:01:31,820
 93 | given the cancer, I can drop the plus
 94 | 1 and the same is true over here.
 95 | 
 96 | 21
 97 | 00:01:31,820 --> 00:01:34,191
 98 | This is exploiting my
 99 | conditional independence.
100 | 
101 | 22
102 | 00:01:34,191 --> 00:01:37,702
103 | I knew that P of plus 1 or
104 | 
105 | 23
106 | 00:01:37,702 --> 00:01:45,020
107 | plus 2 condition on C is
108 | the same as P of plus 2.
109 | 
110 | 24
111 | 00:01:45,020 --> 00:01:47,400
112 | Condition of C and test 1.
113 | 
114 | 25
115 | 00:01:47,400 --> 00:01:51,765
116 | I can now read those off my table here,
117 | 
118 | 26
119 | 00:01:51,765 --> 00:01:57,361
120 | there's 0.9 times 0.043 plus 0.2,
121 | 
122 | 27
123 | 00:01:57,361 --> 00:02:03,775
124 | which is 1 minus 0.8 over here,
125 | times 0.957,
126 | 
127 | 28
128 | 00:02:03,775 --> 00:02:09,112
129 | which gives me approximately 0.2301.
130 | 
131 | 29
132 | 00:02:09,112 --> 00:02:14,610
133 | So that says if my first
134 | test comes in positive,
135 | 
136 | 30
137 | 00:02:14,610 --> 00:02:20,720
138 | I expect my second test to be
139 | positive with the probably 0.2301.
140 | 
141 | 31
142 | 00:02:20,720 --> 00:02:25,446
143 | That's an increase probability to the
144 | default probability which we calculated
145 | 
146 | 32
147 | 00:02:25,446 --> 00:02:28,960
148 | before, which is
149 | probability of any test.
150 | 
151 | 33
152 | 00:02:28,960 --> 00:02:32,970
153 | Test 2 coming as positive
154 | before was to normalize or
155 | 
156 | 34
157 | 00:02:32,970 --> 00:02:37,680
158 | Bayes rule which was 0.207.
159 | 
160 | 35
161 | 00:02:37,680 --> 00:02:42,482
162 | So my first test has a 20%
163 | chance of coming in positive,
164 | 
165 | 36
166 | 00:02:42,482 --> 00:02:47,676
167 | my second test after seeing
168 | a positive test has now an increased
169 | 
170 | 37
171 | 00:02:47,676 --> 00:02:51,993
172 | probability of about 23%
173 | of coming in positive.
174 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/11 - Conditional Independence - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,530 --> 00:00:03,327
  3 | I want to use a few
  4 | words of terminology.
  5 | 
  6 | 2
  7 | 00:00:03,327 --> 00:00:08,348
  8 | This again is a Bayes network
  9 | of which the hidden variable
 10 | 
 11 | 3
 12 | 00:00:08,348 --> 00:00:13,382
 13 | C causes the still stochastic
 14 | test outcomes T1 and T2.
 15 | 
 16 | 4
 17 | 00:00:13,382 --> 00:00:18,254
 18 | And what's really important is that
 19 | we assume not just that T1 and
 20 | 
 21 | 5
 22 | 00:00:18,254 --> 00:00:20,870
 23 | T2 are identically distributed.
 24 | 
 25 | 6
 26 | 00:00:20,870 --> 00:00:25,520
 27 | It is the same 0.9 for
 28 | test one that I used for test two.
 29 | 
 30 | 7
 31 | 00:00:25,520 --> 00:00:30,060
 32 | But we also assume that they
 33 | are conditionally independent.
 34 | 
 35 | 8
 36 | 00:00:30,060 --> 00:00:36,110
 37 | We assumed that if God told us
 38 | whether we actually had cancer or not.
 39 | 
 40 | 9
 41 | 00:00:36,110 --> 00:00:41,433
 42 | If you would ask with certainty
 43 | the breakup of variable C that knowing
 44 | 
 45 | 10
 46 | 00:00:41,433 --> 00:00:46,408
 47 | anything about T1 would not help
 48 | us make a statement about T2.
 49 | 
 50 | 11
 51 | 00:00:46,408 --> 00:00:52,135
 52 | But differently we assumed that
 53 | the probability of T2 given C and
 54 | 
 55 | 12
 56 | 00:00:52,135 --> 00:00:56,416
 57 | T1 is the same as
 58 | the probability of T2 given C.
 59 | 
 60 | 13
 61 | 00:00:56,416 --> 00:01:01,781
 62 | This is called conditional
 63 | independence which is given
 64 | 
 65 | 14
 66 | 00:01:01,781 --> 00:01:06,800
 67 | the value of the cancer variable C,
 68 | if you do this for
 69 | 
 70 | 15
 71 | 00:01:06,800 --> 00:01:11,760
 72 | fact than T2 would be independent of T1.
 73 | 
 74 | 16
 75 | 00:01:11,760 --> 00:01:15,810
 76 | It's conditionally dependant because
 77 | independence only holds true
 78 | 
 79 | 17
 80 | 00:01:15,810 --> 00:01:18,160
 81 | if we actually know C.
 82 | 
 83 | 18
 84 | 00:01:18,160 --> 00:01:20,160
 85 | And it comes out of
 86 | this diagram over here.
 87 | 
 88 | 19
 89 | 00:01:20,160 --> 00:01:26,160
 90 | If we look at this diagram,
 91 | if we knew the variable C over
 92 | 
 93 | 20
 94 | 00:01:26,160 --> 00:01:31,150
 95 | here, then C separately causes T1 and
 96 | T2.
 97 | 
 98 | 21
 99 | 00:01:31,150 --> 00:01:35,757
100 | So as a result,
101 | if we know C whatever over here,
102 | 
103 | 22
104 | 00:01:35,757 --> 00:01:41,356
105 | it's kind of cut off, categorically
106 | from what happens over here.
107 | 
108 | 23
109 | 00:01:41,356 --> 00:01:47,110
110 | That causes these two variables
111 | to be conditionally independent.
112 | 
113 | 24
114 | 00:01:47,110 --> 00:01:50,627
115 | So conditional independence is
116 | a really big thing for Bayes network.
117 | 
118 | 25
119 | 00:01:50,627 --> 00:01:54,838
120 | Here's a Bayes network
121 | where A causes B and C.
122 | 
123 | 26
124 | 00:01:54,838 --> 00:01:58,750
125 | And for Bayes network of this structure,
126 | 
127 | 27
128 | 00:01:58,750 --> 00:02:03,540
129 | we know that given A,
130 | B and C are independent.
131 | 
132 | 28
133 | 00:02:03,540 --> 00:02:08,389
134 | It's written as B condition
135 | independent of C given A.
136 | 
137 | 29
138 | 00:02:08,389 --> 00:02:12,811
139 | So here's a question, suppose we have
140 | conditional independence between B and
141 | 
142 | 30
143 | 00:02:12,811 --> 00:02:13,950
144 | C given A.
145 | 
146 | 31
147 | 00:02:13,950 --> 00:02:20,360
148 | Would then imply, and that's my
149 | question, that B and C are independent?
150 | 
151 | 32
152 | 00:02:20,360 --> 00:02:21,620
153 | So, suppose we don't know A.
154 | 
155 | 33
156 | 00:02:21,620 --> 00:02:25,072
157 | We don't know whether we have cancer,
158 | for example.
159 | 
160 | 34
161 | 00:02:25,072 --> 00:02:30,189
162 | Would that mean that the test results,
163 | individually, yet still independent
164 | 
165 | 35
166 | 00:02:30,189 --> 00:02:34,542
167 | of each other, even if we don't
168 | know about the cancer situation?
169 | 
170 | 36
171 | 00:02:34,542 --> 00:02:35,565
172 | Please answer yes or no.
173 | 
174 | 37
175 | 00:02:35,565 --> 00:02:45,009
176 | [BLANK_AUDIO]
177 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/6 - 31 - Computing Bayes Rules Merged FINAL - lang_en_vs51.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:01,480 --> 00:00:04,220
  3 | So we just encountered our
  4 | very first Bayes network and
  5 | 
  6 | 2
  7 | 00:00:04,220 --> 00:00:06,670
  8 | did a number of
  9 | interesting calculations.
 10 | 
 11 | 3
 12 | 00:00:06,670 --> 00:00:11,200
 13 | Let's now talk about Bayes Rule and
 14 | look into more complex base networks.
 15 | 
 16 | 4
 17 | 00:00:11,200 --> 00:00:13,140
 18 | I want to look at Bayes Rule again and
 19 | 
 20 | 5
 21 | 00:00:13,140 --> 00:00:15,820
 22 | make an observation that
 23 | is being non-trivial.
 24 | 
 25 | 6
 26 | 00:00:15,820 --> 00:00:21,120
 27 | Here is Bayes Rule, and
 28 | in practice what we find is
 29 | 
 30 | 7
 31 | 00:00:21,120 --> 00:00:25,620
 32 | this term here is relatively easy
 33 | to compute, it's just a product.
 34 | 
 35 | 8
 36 | 00:00:25,620 --> 00:00:28,530
 37 | Whereas this term is
 38 | really hard to compute.
 39 | 
 40 | 9
 41 | 00:00:28,530 --> 00:00:33,890
 42 | However, this term over here does not
 43 | depend on what we assume for variable A.
 44 | 
 45 | 10
 46 | 00:00:33,890 --> 00:00:35,500
 47 | It's just a function of B.
 48 | 
 49 | 11
 50 | 00:00:35,500 --> 00:00:36,360
 51 | So suppose for
 52 | 
 53 | 12
 54 | 00:00:36,360 --> 00:00:41,100
 55 | a moment we also care about the
 56 | complementary range of not A given B.
 57 | 
 58 | 13
 59 | 00:00:41,100 --> 00:00:43,800
 60 | For which Bayes will unfold as follows.
 61 | 
 62 | 14
 63 | 00:00:43,800 --> 00:00:47,640
 64 | Then we find that the normalizer
 65 | P of B is identical
 66 | 
 67 | 15
 68 | 00:00:47,640 --> 00:00:51,220
 69 | whether we assume A on the left side or
 70 | not A on the left side.
 71 | 
 72 | 16
 73 | 00:00:51,220 --> 00:00:59,290
 74 | We also know that P of A given B
 75 | plus P of not A given B must be one.
 76 | 
 77 | 17
 78 | 00:00:59,290 --> 00:01:01,580
 79 | Because these are two
 80 | complimentary events.
 81 | 
 82 | 18
 83 | 00:01:01,580 --> 00:01:05,720
 84 | It allows us to compute Bayes
 85 | rule very differently by
 86 | 
 87 | 19
 88 | 00:01:05,720 --> 00:01:08,010
 89 | basically ignoring the normalizer.
 90 | 
 91 | 20
 92 | 00:01:08,010 --> 00:01:09,890
 93 | So here's how it goes.
 94 | 
 95 | 21
 96 | 00:01:09,890 --> 00:01:11,928
 97 | You compute P of A given B and
 98 | 
 99 | 22
100 | 00:01:11,928 --> 00:01:17,261
101 | you're going to call this prime
102 | because it's not a real probability.
103 | 
104 | 23
105 | 00:01:17,261 --> 00:01:20,180
106 | It will be just P of B
107 | given A times P of A,
108 | 
109 | 24
110 | 00:01:20,180 --> 00:01:25,600
111 | which is the normalizer to the
112 | denominator of the expression over here.
113 | 
114 | 25
115 | 00:01:25,600 --> 00:01:29,770
116 | We do the same thing not A.
117 | 
118 | 26
119 | 00:01:29,770 --> 00:01:33,627
120 | So in both cases,
121 | we compute the posterior probability not
122 | 
123 | 27
124 | 00:01:33,627 --> 00:01:36,390
125 | normalized while omitting
126 | the normalizer B.
127 | 
128 | 28
129 | 00:01:36,390 --> 00:01:40,570
130 | And then we can recover
131 | the original probabilities
132 | 
133 | 29
134 | 00:01:40,570 --> 00:01:44,540
135 | by normalizing based on
136 | those values over here.
137 | 
138 | 30
139 | 00:01:44,540 --> 00:01:48,470
140 | So the probability of A given B,
141 | the actual probability,
142 | 
143 | 31
144 | 00:01:48,470 --> 00:01:53,520
145 | is a normalizer eta times this
146 | not normalized form over here.
147 | 
148 | 32
149 | 00:01:53,520 --> 00:01:57,034
150 | The same is true for
151 | the negation of A, over here.
152 | 
153 | 33
154 | 00:01:57,034 --> 00:02:02,783
155 | And eta is just the normalizer
156 | results by adding these two values
157 | 
158 | 34
159 | 00:02:02,783 --> 00:02:09,190
160 | over here together as shown over
161 | here and dividing them by one.
162 | 
163 | 35
164 | 00:02:09,190 --> 00:02:10,860
165 | So take a look at this for a moment.
166 | 
167 | 36
168 | 00:02:11,940 --> 00:02:16,710
169 | What we've done is we deferred the
170 | calculation of the normalizer over here.
171 | 
172 | 37
173 | 00:02:16,710 --> 00:02:19,920
174 | By computing pseudo probabilities
175 | that are non-nominalized.
176 | 
177 | 38
178 | 00:02:19,920 --> 00:02:21,550
179 | This made the calculation much easier.
180 | 
181 | 39
182 | 00:02:21,550 --> 00:02:26,353
183 | And returned everything we just
184 | folded back in the normalizer based
185 | 
186 | 40
187 | 00:02:26,353 --> 00:02:31,164
188 | on the resulting pseudo probabilities
189 | and get the correct answer.
190 | 
191 | 41
192 | 00:02:31,164 --> 00:02:40,189
193 | [BLANK_AUDIO]
194 | 


--------------------------------------------------------------------------------
/TopicNotes/LSTM/CourseNotes.rst:
--------------------------------------------------------------------------------
  1 | #####################
  2 | Introduction to LSTM
  3 | #####################
  4 | 
  5 | This stands for Long Short Term Memory Networks, and are quite useful when our
  6 | neural network needs to switch between remembering recent things, and things
  7 | from long time ago.
  8 | 
  9 | In small summary RNNs work as follows:
 10 | 
 11 | 1. Memory comes in 
 12 | 2. Merges with the current event
 13 | 3. Merged memory goes out.
 14 | 
 15 | In small summary LSTMs work as follows:
 16 | 
 17 | 1. Long term memory comes in
 18 | 2. Short term memory comes in
 19 | 3. Both merges with the current event.
 20 | 4. Merged long term memory goes out.
 21 | 5. Merged short term memory goes out.
 22 | 
 23 | 
 24 | Technically speaking LSTM architecture has several gates:
 25 | 
 26 | - Learn Gate
 27 | - Forget Gate
 28 | - Use Gate
 29 | - Remember Gate
 30 | 
 31 | Here is how they work:
 32 | 
 33 | - Long term memory passes through the forget gate, where it shrinks by removing
 34 |   unnecessary information
 35 | - Short term memory passes through the learn gate with the current event
 36 | - Both the information in the learn gate and the forget gate, passes through
 37 |   remember gate. This creates the new long term memory
 38 | - Both the information in the learn gate and the forget gate, passes through
 39 |   use gate. This creates the new short term memory.
 40 | 
 41 | Learn Gate
 42 | ----------
 43 | 
 44 | Learn gate combines the current event, the picture we are trying to identify, and
 45 | the short term memory, that is the memory of the pictures we have just seen.
 46 | It also forgets a little of it, keeping the important part.
 47 | 
 48 | Mathematically it works as the following:
 49 | 
 50 | Short Term Memory: STM
 51 | Event: E
 52 | Hyperbolic tangent function: :math:`tanh = {\frac{e^x - e^{-x}}{e^x + e^{-x}}}`
 53 | New Information: N
 54 | Ignore factor: i
 55 | [STM_{t-1}, E_t]: combining/concatenating two vectors.
 56 | sigmoid function: :math:`{\sigma}(x) = {\frac{e^x}{e^x+1}}`
 57 | 
 58 | Combination works as follows:
 59 | 
 60 | :math:`N_t = tanh(W_n[STM_{t-1},E_t] + b_n)`
 61 | 
 62 | Ignoring works as follows:
 63 | 
 64 | :math:`N_t {\times} i_t`
 65 | 
 66 | In order to calculate the ignore factor we create another small neural network
 67 | with sigmoid function:
 68 | 
 69 | :math:`i_t = {\sigma}(W_i[STM_{t-1},E_t] + b_i)`
 70 | 
 71 | 
 72 | Forget Gate
 73 | -----------
 74 | 
 75 | Forget gate simply forgets least relevant parts of the long term memory
 76 | 
 77 | The idea is the same as ignore factor of the learn gate.
 78 | This time we multiply the long term memory with the forget factor which
 79 | is calculated using the short term memory, and the event.
 80 | Mathematically it works as follows:
 81 | 
 82 | Long Term Memory: LTM
 83 | Short Term Memory: STM
 84 | Forget Factor: f
 85 | Event: E
 86 | sigmoid function: :math:`{\sigma}`
 87 | 
 88 | Forget part is as follows:
 89 | 
 90 | :math:`LTM_{t-1} {\times} f_t`
 91 | 
 92 | And the forget factor is calculated as follows:
 93 | 
 94 | :math:`f_t = {\sigma}(W_f[STM_{t-1}, E_t] + b_f)`
 95 | 
 96 | Remember Gate
 97 | -------------
 98 | 
 99 | Remember gate takes the output of the forget gate, and the learn gate and it
100 | combines them together, in order to create the new long term memory
101 | 
102 | The combination is a simple addition.
103 | It works as follows:
104 | 
105 | Long Term Memory: LTM
106 | Forget Factor: f
107 | New Information: N
108 | Ignore factor: i
109 | 
110 | :math:`LTM_t = (LTM_{t-1} {\times} f_t) + (N_t {\times} i_t)`
111 | 
112 | Use Gate
113 | --------
114 | 
115 | Use gate creates the short term memory by combining the output of the forget
116 | gate and the previous short term memory and the current event.
117 | 
118 | Mathematically it works as follows:
119 | 
120 | Long Term Memory: LTM
121 | Forget Factor: f
122 | Short Term Memory: STM
123 | Event: E
124 | Hyperbolic tangent function: :math:`tanh = {\frac{e^x - e^{-x}}{e^x + e^{-x}}}`
125 | New Information: N
126 | Ignore factor: i
127 | [STM_{t-1}, E_t]: combining/concatenating two vectors.
128 | sigmoid function: :math:`{\sigma}(x) = {\frac{e^x}{e^x+1}}`
129 | 
130 | It applies the following to the output of the forget gate:
131 | 
132 | Output of the forget gate: :math:`LTM_{t-1} {\times} f_t)`
133 | 
134 | :math:`U_t = tanh(W_u(LTM_{t-1} {\times} f_t) + b_u)`
135 | 
136 | It applies the following to the short term memory and the current event:
137 | 
138 | :math:`V_t = {\sigma}(W_v[STM_{t-1}, E_t] + b_v)`
139 | 
140 | It combines both :math:`U_t and V_t` in order to obtain the new short term
141 | memory as follows:
142 | 
143 | :math:`STM_t = U_t {\times} V_t`
144 | 
145 | Other Architectures
146 | -------------------
147 | 
148 | Other variants of LSTM architectures are Gated Recurrent Units, and
149 | Peephole Connections
150 | 


--------------------------------------------------------------------------------
/TopicNotes/GeneralAdverserialNetworks/courseNotes.rst:
--------------------------------------------------------------------------------
  1 | #############################################
  2 | Introduction to General Adverserial Networks
  3 | #############################################
  4 | 
  5 | What can you do with GANs ?
  6 | ===========================
  7 | 
  8 | GANs are used for generating realistic data. Most of the applications of gans so
  9 | far have been on images
 10 | 
 11 | StackGAN takes a description of a bird and generates an image of a bird
 12 | based on that description.
 13 | In that context GAN is drawing a sample from the probability distribution of all
 14 | the images that match the description
 15 | 
 16 | Pix2Pix
 17 | GAN is used by Adobe and Berkeley, as the user draws crude sketches GANs
 18 | transform that to more realistic images.
 19 | 
 20 | They can be used for image to image translation, for example blueprints for
 21 | building can be transformed into photos of finished buildings
 22 | 
 23 | CycleGAN
 24 | is used by Berkeley is especially good at transforming images through
 25 | unsupervised learning
 26 | 
 27 | GANs can be used to create realistic simulated training sets or environments
 28 | in which the other machine learning models train, like reinforcement learning
 29 | agents
 30 | 
 31 | How GANs work ?
 32 | ================
 33 | 
 34 | Generative Adverserial model is like other generative models.
 35 | 
 36 | Gans are a kind of generative model that let's us generate a whole image in
 37 | parallel. Along with other generative models gans use a differentiable function
 38 | represented by a neural network as a generator network.
 39 | 
 40 | The generator network takes random noise as the input, then runs that noise
 41 | through the differentiable function to transform and reshape the noise to
 42 | have recognizable structure.
 43 | The output of the generative network is a realistic image the choice of the
 44 | random noise determines what would come out of the network
 45 | 
 46 | Running the generator network with many different input noise values produces
 47 | many different realistic output images.
 48 | The goal for these images to be fair samples from the distribution over real
 49 | data
 50 | 
 51 | Of course the generator net does not start out producing real images it needs to
 52 | be trained. However training process is quite different from what we had seen so
 53 | far in supervised learning model
 54 | 
 55 | For most generative models we simply adjust the parameters to maximize the
 56 | probability that the model would generate the training set.
 57 | But it is very difficult to compute this probability, most generative models
 58 | get around that with some kind of approximation.
 59 | GANs use a second network called the discriminator in order to guide the
 60 | generator, the discriminator is a regular classifier.
 61 | During the training process the discriminator is shown real images half of the
 62 | time and fake images from the generator the other half.
 63 | 
 64 | Discriminator is trained to output the probability that the input is real,
 65 | so it tries to assign a probability close to 1 to real images and a probability
 66 | close to 0 to fake images.
 67 | Meanwhile the generator tries to produce images that would get a probability
 68 | close to 1 from the discriminator.
 69 | 
 70 | Over time the generator is forced to produce more realistic outputs in order to
 71 | fool the discriminator.
 72 | 
 73 | Games and Equilibra
 74 | --------------------
 75 | 
 76 | The inspiration for GANs come from Game theory.
 77 | Basically the generator and the discriminator are competing against each other.
 78 | Where each agent can choose a from some set of actions and the choice of actions
 79 | determines a well defined payoff for each player.
 80 | A state of equilibrium in game theory is where neither player can improve their
 81 | payoff by changing their strategy assuming the other players strategy stays the
 82 | same
 83 | 
 84 | In GANs, we have two agents, each with their own costs. They work with cost
 85 | functions, where you try to minimize the cost. In GANs you try to find a point
 86 | where it minimises the cost functions of both agents
 87 | 
 88 | For example for discriminator the local maxima occurs when the discriminator
 89 | accurately estimates the probability that the input is real rather than fake.
 90 | This probability is given by the ratio between the data density at the input and
 91 | the sum of both the data density and the model density induced by the generator
 92 | at the input. We can think of this ratio as measuring how much probability mass
 93 | in an area comes from the data rather than the generator
 94 | 
 95 | Deep Convolutional GANs
 96 | ------------------------
 97 | 
 98 | The main difference being the convolutional networks for generator, and
 99 | discriminator.
100 | 
101 | Transposed convolutional layers do the exact opposite of convolutional layers.
102 | They take something narrow, and make it larger and wider.
103 | 


--------------------------------------------------------------------------------
/MathNotes/IntroProjectiveGeometry.rst:
--------------------------------------------------------------------------------
 1 | #################################
 2 | Introduction Projective Geometry
 3 | #################################
 4 | 
 5 | Projective geometry models the imaging process of the camera, that is
 6 | transformation of that which is in 3d to 2d. This is an important capacity
 7 | because several set of properties which appear in Euclidean geometry, does not
 8 | apply to projective geometry. For example, in projective geometry, parallel
 9 | lines do coincide due to the perspective.
10 | Projective transformations conserve type (points stay points, lines stay lines,
11 | etc.), incidence (whether a point lies on a line or not), and a measure called
12 | cross ratio
13 | 
14 | Geometry Relations
15 | 
16 | +----------------------------+-----------+------------+--------+------------+
17 | |                            | Euclidean | similarity | affine | projective |
18 | +----------------------------+-----------+------------+--------+------------+
19 | | Transformations            |                                              |
20 | +----------------------------+-----------+------------+--------+------------+
21 | | rotation                   |    X      |   X        |   X    |     X      |
22 | +----------------------------+-----------+------------+--------+------------+
23 | | translation                |    X      |   X        |   X    |     X      |
24 | +----------------------------+-----------+------------+--------+------------+
25 | | uniform scaling            |           |   X        |   X    |     X      |
26 | +----------------------------+-----------+------------+--------+------------+
27 | | nonuniform scaling         |           |            |   X    |     X      |
28 | +----------------------------+-----------+------------+--------+------------+
29 | | shear                      |           |            |   X    |     X      |
30 | +----------------------------+-----------+------------+--------+------------+
31 | | perspective projection     |           |            |        |     X      |
32 | +----------------------------+-----------+------------+--------+------------+
33 | | composition of projections |           |            |        |     X      |
34 | +----------------------------+-----------+------------+--------+------------+
35 | | Invariants                 |                                              |
36 | +----------------------------+-----------+------------+--------+------------+
37 | | length                     |    X      |            |        |            |
38 | +----------------------------+-----------+------------+--------+------------+
39 | | angle                      |    X      |    X       |        |            |
40 | +----------------------------+-----------+------------+--------+------------+
41 | | ratio of lengths           |    X      |    X       |        |            |
42 | +----------------------------+-----------+------------+--------+------------+
43 | | parallelism                |    X      |    X       |   X    |            |
44 | +----------------------------+-----------+------------+--------+------------+
45 | | incidence                  |    X      |    X       |   X    |     X      |
46 | +----------------------------+-----------+------------+--------+------------+
47 | | cross ratio                |    X      |    X       |   X    |     X      |
48 | +----------------------------+-----------+------------+--------+------------+
49 | 
50 | 
51 | Concepts
52 | =========
53 | 
54 | 
55 | Homogenous coordinates
56 | ----------------------
57 | 
58 | We have a point (x,y) in euclidean plane. To represent this point in projective
59 | plane, we simply add a third coordinate of 1 at the end: (x,y,1).
60 | Scaling is unimportant, so the point (x, y, 1), also called the augmented vector.
61 | This means we can simply represent a point in projective plane as (ax, ay, a). 
62 | Since scaling is unimportant these points are called homogenous points.
63 | "a" is the scale factor that can not be 0. In order to find the eucledean points
64 | from projective points we simply divide the projective points to its last
65 | dimension:
66 | 
67 | - Euclidean : (x,y) -> Projective (Gx,Gy, G) -> Euclidean (Gx/G, Gy/G, G/G)
68 | 
69 | When the last element G is 0, the homogeneous point is called, a point at infinity,
70 | or ideal points. They do not have an equivalent representation in euclidean plane
71 | 
72 | 
73 | How do we represent a line in Projective space:
74 | 
75 | - We start with the simple formula ax + by + c = 0
76 |   - Essentially this is elementwise multiplication of 2 vectors 
77 |     :math:`v_1= [x, y, 1]` and :math:`v_2=[a, b, c]`
78 |   - Notice that the :math:`v_1` is the augmented vector and
79 |     :math:`v_2` is simply another homogeneous point
80 |   - So a line in 2d projective space is essentialy a multiplication of
81 |     an augmented vector with a homogeneous point
82 | 
83 | We can see if two lines intersect on 2d projective plane,
84 | by using cross product of two homogenous points
85 | 
86 | Ex.
87 | :math:`v_1 = [2,6,2]` and :math:`v_2 = [4,8,5]`
88 | they intersect at the point :math:`\tilde{x}`.
89 | The condition is that dot product of :math:`\overline{x}` with :math:`v_1`
90 | or with :math:`v_2` needs to be equal to 0.
91 | Notice the relation between the augmented vector and the homogenous point
92 | :math:`\overline{x} = {\frac{\tilde{x}}{w}}`
93 | 
94 | This is also same as :math:`v_1 {\times} v_2 = \tilde{x}`
95 | 


--------------------------------------------------------------------------------
/MathNotes/N-Dimensional-Geometry.md.html:
--------------------------------------------------------------------------------
  1 | # N-Dimensional Geometry
  2 | 
  3 | Notes from Murty, K. (2001) Computational and Algorithmic Linear Algebra and
  4 | n-Dimensional Geometry. Ann Arbor. Chapter 3
  5 | 
  6 | let $S \subset R^n$ and $x \in R^n$, the translation of S to x is the set of
  7 | points: $ \{ x + y | y \in S \} $
  8 | 
  9 | Parametric representation of a point in $S \subset R^n$ is a vector of
 10 | functions, such as:
 11 | $x = (x_1, x_2, \dots, x_n)^T 
 12 |    = (f_1(a_1, \dots, a_r), \dots, f_n(a_1, \dots, a_r) ) $
 13 | 
 14 | With no constraints the points in S can take any real number. We can also
 15 | introduce constraints on functions to determine the points in S, so something
 16 | like the following is possible:
 17 | 
 18 | $S = \{
 19 | (f_1(a_1, \dots, a_r), \dots, f_n(a_1, \dots, a_r) )_j | 
 20 |     j \in \{1,\dots, n \},
 21 |     a \in C(a) \}$
 22 | where $C(a)$ represents a subset of $R^n$ that satisfies a set of constraints.
 23 | 
 24 | A line in N dimensions is:
 25 | $ x = a + \alpha b$ where $a, b \in R^n$ and $b \not = 0$ meaning that at least
 26 | one component of the vector b must be a non zero value.
 27 | Notice that it has single parameter $\alpha$ which can take all real values
 28 | from $R$, that is, it is a scalar value that gets multiplied with each
 29 | component of the vector b.
 30 | 
 31 | In order to check whether a point is in a line or not
 32 | 
 33 | ```Python
 34 | 
 35 | def in_line(a: List[float], 
 36 |             b: List[float], 
 37 |             alpha: float,
 38 |             x: List[float]) -> bool:
 39 |     """
 40 |     Checks if point x is in line of the equation (a + b * alpha)
 41 |     """
 42 |     check = True
 43 |     for i in range(len(a)):
 44 |         if (a[i] + b[i] * alpha) != x[i]:
 45 |             check = False
 46 |     return check
 47 | ```
 48 | Check whether there is a line between two points:
 49 | 
 50 | ```Python
 51 | 
 52 | def have_a_line(a: List[float], c: List[float]) -> bool:
 53 |     """
 54 |     We check if a line might exist between these two points.
 55 |     and try to obtain the parametric representation (a + b * alpha = c)
 56 |     since b = c - a in alpha 1
 57 |     """
 58 |     if a == c:
 59 |         return False
 60 |     b = [c[i] - a[i] for i in range(len(a))]
 61 |     return any(b[i] > 0 for i in range(len(b)))
 62 | ```
 63 | 
 64 | Two parametric representations of lines $U$ and $V$ in $R^n$ correspond to the
 65 | same line if the following conditions hold true:
 66 | 
 67 | - $U = \{ x = a + \alpha * b | \alpha \in R, a, b \in R^n, b \not = 0 \}$
 68 | - $V = \{ x = c + \beta * d | \beta \in R, c, d \in R^n, d \not = 0 \}$
 69 | 
 70 | - $d$ is a scalar multible of $b$
 71 | - $a - c$ is a scalar multiple of $b$
 72 | 
 73 | ```Python
 74 | 
 75 | def check_parametric_same_line(a: List[float], alpha: float,
 76 |                                b: List[float],
 77 |                                c: List[float], beta: float, 
 78 |                                d: List[float]) -> bool:
 79 |     """
 80 |     Two parametric representations of lines $U$ and $V$ in $R^n$ correspond to
 81 |     the same line if the following conditions hold true:
 82 |     $d$ is a scalar multible of $b$
 83 |     $a - c$ is a scalar multiple of $b$
 84 |     """
 85 |     prev = None
 86 | 
 87 |     # check if 'd' is a scalar multiple of 'b'
 88 |     check = True
 89 |     for i in range(len(b)):
 90 |         if prev is None:
 91 |             prev = b[i] / d[i]
 92 |         else:
 93 |             check = b[i] / d[i] == prev
 94 |     if not check:
 95 |         return check
 96 | 
 97 |     # check if 'a - c' is a scalar multiple of 'b'
 98 |     for i in range(len(b)):
 99 |         if prev is None:
100 |             prev = b[i] / (a[i] - c[i])
101 |         else:
102 |             check = b[i] / (a[i] - c[i]) == prev
103 |     if not check:
104 |         return check
105 |     return check
106 | ```
107 | 
108 | Half lines and rays in N dimension:
109 | 
110 | The parameteric representation of a half line is 
111 | $\{ x = a + \alpha b | \alpha \ge k \}$ where 
112 | 
113 | - $b \not = 0$, 
114 | - $a,b \in R^n$
115 | - k is a constant value.
116 | 
117 | Notice that this includes a set of points that is a subset of line's
118 | representation which does not impose a condition on $\alpha$
119 | In this context:
120 | 
121 | - $a$ is the starting point of half-line
122 | - $b$ is the direction vector of the half-line
123 | 
124 | Using this representation a half line can also be characterized as a function,
125 | such as, 
126 | $ x = x(\alpha) 
127 |     = \{ x_1(\alpha), x_2(\alpha), \dots, x_n(\alpha) | \alpha \ge 0 \} $
128 | where $ x(\alpha) = a + \alpha b $
129 | when we say a half line begins at $a$, we assume that $\alpha = 0$ and that
130 | $b$ is the direction vector.
131 | 
132 | 
133 | A ray is a half line that starts at 0, meaning that
134 | The parametric representation of a ray is:
135 | 
136 | $\{ x = a + \alpha b | \alpha \ge k \}$ where 
137 | 
138 | - $b \not = 0$
139 | - $b \in R^n$
140 | - $a = \{ 0, 0, \dots, 0 \} \in R^n $
141 | - $k$ is a constant value
142 | 
143 | 
144 | For every point that is not equal to all zeros, that is 
145 | $p = \{p_0, p_1, \dots, p_i, \dots \} \in R^n$ where $p_i \not = 0$
146 | there is a unique ray which can be represented as:
147 | $\{x = \alpha p : \alpha \ge 0 \}$ $p$ here is called the direction of the
148 | ray. Notice that the same term corresponds to direction vector in the half
149 | line equation.
150 | 
151 | Let's us consider the following half line X:
152 | - $ \{ x = a + \alpha b | \alpha \ge k \} $
153 | 
154 | This half line begins at $a$ meaning that $\alpha = 0$
155 | 
156 | Let's us consider the following ray Y:
157 | - $\{ x = a + \beta b | \beta \ge k \}$, where $a = \{0, 0, \dots, 0\} \in R^n$
158 | 
159 | In this case X is parallel to Y. The point $x$ on half line that begins in $a$
160 | is translated by $\alpha / \beta$ amount in the direction of $b$.
161 | 
162 | p. 285 let $a, b \in R^n$
163 | 


--------------------------------------------------------------------------------
/BookNotes/ElementsOfProbabilityAndStatistics.rst:
--------------------------------------------------------------------------------
  1 | ##############################################################
  2 | Elements of Probability and Statistics by Biagini & Campanino
  3 | ##############################################################
  4 | 
  5 | Reference: 
  6 | Biagini, Francesca, and Massimo Campanino, Elements of Probability and Statistics: An Introduction to Probability with de Finetti’s Approach and to Bayesian Statistics (Switzerland, 2016)
  7 | 
  8 | Random Numbers
  9 | ===============
 10 | 
 11 | Introduction
 12 | --------------
 13 | 
 14 | Numbers can be used to represent values. However the catch is that these
 15 | values are not necessarily known. For example, you toss a coin, you can
 16 | represent the result as number but you don't know the result yet, so its
 17 | content is not known yet.
 18 | 
 19 | However even if you don't know the exact value yet, you know *possible values*
 20 | that can be the result, that is you know the *set of possible values* for the
 21 | given element.
 22 | 
 23 | These are called random numbers and they are most of the time represented with
 24 | a capital letter. For example let the sides of a dice is A. We would write
 25 | :math:`I(A) = {1,2,3,4,5,6}`:
 26 | 
 27 | - :math:`I()` denotes the set of possible values
 28 | 
 29 | - :math:`I(A)` is *upper bounded* if :math:`I(A) < + \infty`
 30 | - :math:`I(A)` is *lower bounded* if :math:`I(A) > - \infty`
 31 | - :math:`I(A)` is *bounded* if :math:`I(A) > - \infty, I(A) < + \infty`
 32 | 
 33 | Two random numbers A and B are logically independent if
 34 | 
 35 | - :math:`I(A,B) = I(A) \times I(B)`: the product being a cartesian product.
 36 | 
 37 | Invalid Example:
 38 | 
 39 | - I have a pack of balls, containing 4 balls, with 4 different colors. I 
 40 |   pick a ball and pick another ball without putting back the first ball. 
 41 | 
 42 | Valid Example:
 43 | 
 44 | - I have a pack of balls, containing 4 balls, with 4 different colors. If I
 45 |   pick a ball and I put the ball back and then I pick another ball.
 46 | 
 47 | Problem with Invalid Example:
 48 | 
 49 | By not putting the ball back I effect the color of the second ball, since we 
 50 | know that the second ball can not be of the color of the first one.
 51 | 
 52 | Formally:
 53 | 
 54 | .. math::
 55 | 
 56 |     I(E1, E2) = {(i, j) | 1 \le i \le 4, 1 \le j \le 4, i \ne j}
 57 |     (i, j) \not\in I(E1, E2)
 58 |     \therefore
 59 |     I(E1, E2) \ne I(E1) \times I(E2)
 60 | 
 61 | Following operations are possible for random numbers:
 62 | 
 63 | 1. :math:`A \lor B = max(A, B)`
 64 | 2. :math:`A \land B = min(A, B)`
 65 | 2. :math:`A \land B = AB`
 66 | 3. :math:`\~{B} = 1 - B`
 67 | 
 68 | First two operations are distributive, associative and commutative. That is
 69 | they are:
 70 | 
 71 | - Distributive:
 72 |   :math:`A \lor (B \land C) = (A \lor B) \land (A \lor C)`
 73 |   :math:`A \land (B \lor C) = (A \land B) \lor (A \land C)`
 74 | 
 75 | - Associative:
 76 |   :math:`A \lor (B \lor C) = (A \lor B) \lor C`
 77 |   :math:`A \land (B \land C) = (A \land B) \land C`
 78 | 
 79 | - Commutative:
 80 |   :math:`A \lor B = B \lor A`
 81 |   :math:`B \land A = A \land B`
 82 | 
 83 | Plus they have the following properties:
 84 | 
 85 | - :math:`\sim \sim B = B`
 86 | - :math:`\~{(A \land B)} = \~{A} \lor \~{B}`
 87 | - :math:`\~{(A \lor B)} = \~{A} \land \~{B}`
 88 | 
 89 | Events
 90 | -------
 91 | 
 92 | Events are particular random numbers. They can have two values, thus they are
 93 | boolean in nature. :math:`I(E) = {0, 1}`. They have either happened, or they
 94 | have not happened.
 95 | 
 96 | - operation :math:`\land` is called *logical product*.
 97 | - Operation :math:`\lor` is called *logical sum*.
 98 | 
 99 | In the case of events:
100 | 
101 | - :math:`E1 \lor E2 = E1 + E2 - (E1 \land E2)`
102 | - :math:`E1 \land E2 = E1 \times E2`
103 | 
104 | Notice that since events are random *numbers*, there is nothing weird about
105 | adding or subtracting them. The fact that we don't know their value, does not
106 | change the fact that they are addable or subtractable.
107 | 
108 | Complementary of an Event E is :math:`\~{E} = 1 - E`
109 | 
110 | Following are also true for events due to the fact that they are random
111 | numbers:
112 | 
113 | .. math::
114 |     
115 |     (E1 \lor E2)^{\sim} = \sim E1 \land \sim E2
116 |                         = (1 - E1) \times (1 - E2)
117 |                         = 1 - E1 - E2 + (E1 \times E2)
118 |                         = 1 - 1 ( E1 + E2 (E1 \times E2) )
119 |                         = 1 - ( E1 \lor E2 )
120 | 
121 | As you can see this is coherent with the complementary rule that we have
122 | provided above
123 | 
124 | Difference of events is the following:
125 | 
126 | - :math:`E1 \\ E2 = E1 - (E1 \times E2)`
127 | 
128 | Symmetric difference of events:
129 | 
130 | - :math:`E1 \triangle E2 = (E1 \\ E2) \lor (E2 \\ E1) = E1 + E2 \mod 2`
131 | 
132 | If the value of the random number, event, is equal to 1, we say that the event
133 | happens, when it is equal to 0, we say that it does not happen.
134 | 
135 | The logical sum of two events happen (is true) if at least one of the event
136 | happens.
137 | 
138 | The logical product of two events happen (is true) if both events happen
139 |  
140 | The complementary event of an event E happens, only if E does not happen.
141 | 
142 | The subset relation :math:`E1 \subset E2` implies that if E1 happens, so does
143 | E2.
144 | 
145 | The operator :math:`\vdash` marks the truth value of the proposition as true.
146 | It is thus a logical operator.
147 | 
148 | Following relations apply to events:
149 | 
150 | - *incompatiblity*: if :math:`E1 \subset E2 = 0`, meaning that if one event
151 |   happens the other can not happen. Thus :math:`\vdash E1 \land E2 = 0`
152 | 
153 | - *exhaustivity*: if :math:`E_1, + E_2, + ..., + E_n \ge 1`.
154 | 
155 | - *partition* :math:`E_1 +, E_2, +..., E_n = 1` are said to be a partition if
156 |   they are exhaustive and two by two incompatible).
157 | 
158 | page 6 given n events constituent
159 | 


--------------------------------------------------------------------------------
/TopicNotes/BayesNets/subtitles/3 - Challenge Question Solution - lang_en_vs52.srt:
--------------------------------------------------------------------------------
  1 | 1
  2 | 00:00:00,280 --> 00:00:04,282
  3 | The answer is A, 0.7445.
  4 | 
  5 | 2
  6 | 00:00:04,282 --> 00:00:08,106
  7 | Even though the base net is simple the
  8 | answer will require a bit more work than
  9 | 
 10 | 3
 11 | 00:00:08,106 --> 00:00:10,280
 12 | many of our challenge questions.
 13 | 
 14 | 4
 15 | 00:00:10,280 --> 00:00:13,660
 16 | To be clear we'll be using capital
 17 | letters to indicate our variables.
 18 | 
 19 | 5
 20 | 00:00:13,660 --> 00:00:17,010
 21 | We're using low case letters to
 22 | indicate when that variable is true,
 23 | 
 24 | 6
 25 | 00:00:17,010 --> 00:00:20,510
 26 | or a non in front of it it to
 27 | indicate when its not true.
 28 | 
 29 | 7
 30 | 00:00:20,510 --> 00:00:23,070
 31 | To solve this problem we have to
 32 | determine the ratio of times o2
 33 | 
 34 | 8
 35 | 00:00:23,070 --> 00:00:26,900
 36 | is true given o1 is true,
 37 | to all situations where just o1 is true.
 38 | 
 39 | 9
 40 | 00:00:27,920 --> 00:00:32,590
 41 | In other words, we sum over all possible
 42 | graduation situations, where o2 and
 43 | 
 44 | 10
 45 | 00:00:32,590 --> 00:00:34,440
 46 | o1 are true for the numerator.
 47 | 
 48 | 11
 49 | 00:00:34,440 --> 00:00:37,470
 50 | Which is simply,
 51 | the probability of o1 being true,
 52 | 
 53 | 12
 54 | 00:00:37,470 --> 00:00:40,500
 55 | o2 being true and graduation being true.
 56 | 
 57 | 13
 58 | 00:00:40,500 --> 00:00:45,620
 59 | Plus the probably of o1 being true, o2
 60 | being true and graduation being false.
 61 | 
 62 | 14
 63 | 00:00:45,620 --> 00:00:48,399
 64 | And we're going to
 65 | normalize by this number,
 66 | 
 67 | 15
 68 | 00:00:48,399 --> 00:00:50,972
 69 | plus the situation where o2 is negative.
 70 | 
 71 | 16
 72 | 00:00:50,972 --> 00:00:54,348
 73 | And that is the probably of o1, not o2,
 74 | 
 75 | 17
 76 | 00:00:54,348 --> 00:00:58,800
 77 | and g, plus the probably of o1,
 78 | not 02, and not g.
 79 | 
 80 | 18
 81 | 00:01:00,130 --> 00:01:04,354
 82 | Looking at the first part of the
 83 | numerator, probability of o1, o2, and
 84 | 
 85 | 19
 86 | 00:01:04,354 --> 00:01:07,170
 87 | g, that is simply
 88 | the probability of o1 given g,
 89 | 
 90 | 20
 91 | 00:01:07,170 --> 00:01:10,945
 92 | times the probability of o2 given g,
 93 | times the probability of g.
 94 | 
 95 | 21
 96 | 00:01:10,945 --> 00:01:13,790
 97 | And we can just read this off
 98 | the base network over here.
 99 | 
100 | 22
101 | 00:01:14,790 --> 00:01:17,739
102 | The first number,
103 | the probability of o1 given g is 0.05.
104 | 
105 | 23
106 | 00:01:18,890 --> 00:01:22,584
107 | Probability of o2 given g is just 0.75.
108 | 
109 | 24
110 | 00:01:22,584 --> 00:01:26,094
111 | And the probability of g is just 0.9.
112 | 
113 | 25
114 | 00:01:26,094 --> 00:01:29,868
115 | We multiply that out and it's 0.3375.
116 | 
117 | 26
118 | 00:01:29,868 --> 00:01:32,540
119 | Next we'll consider the second
120 | part of the numerator,
121 | 
122 | 27
123 | 00:01:32,540 --> 00:01:35,520
124 | the probability of o1, o2, and not g.
125 | 
126 | 28
127 | 00:01:35,520 --> 00:01:38,559
128 | Well that's simply the probability
129 | of 01 given not g,
130 | 
131 | 29
132 | 00:01:38,559 --> 00:01:42,119
133 | the probability of 02 given not g,
134 | and the probability of not g.
135 | 
136 | 30
137 | 00:01:42,119 --> 00:01:44,772
138 | And again we'll just read off our chart.
139 | 
140 | 31
141 | 00:01:44,772 --> 00:01:49,733
142 | It's 0.05 for
143 | a probability of o1 given not g,
144 | 
145 | 32
146 | 00:01:49,733 --> 00:01:54,941
147 | 0.25, which is a probability
148 | of 02 given not g.
149 | 
150 | 33
151 | 00:01:54,941 --> 00:01:59,239
152 | And then the probability of not g,
153 | which is just the compliment of this,
154 | 
155 | 34
156 | 00:01:59,239 --> 00:02:00,305
157 | which is 0.1.
158 | 
159 | 35
160 | 00:02:00,305 --> 00:02:03,412
161 | That comes out to be 0.00125.
162 | 
163 | 36
164 | 00:02:03,412 --> 00:02:06,550
165 | Now we'll consider the last
166 | two parts of the equation.
167 | 
168 | 37
169 | 00:02:06,550 --> 00:02:08,990
170 | We already have these
171 | two from our numerator.
172 | 
173 | 38
174 | 00:02:08,990 --> 00:02:12,469
175 | So now we need to look at
176 | probability of o1, not o2, and g,
177 | 
178 | 39
179 | 00:02:12,469 --> 00:02:17,181
180 | which is just equal to probability of o1
181 | given g, probability of not o2 given g,
182 | 
183 | 40
184 | 00:02:17,181 --> 00:02:18,510
185 | and probability of g.
186 | 
187 | 41
188 | 00:02:18,510 --> 00:02:21,550
189 | Again, we can just read these
190 | numbers off our base net.
191 | 
192 | 42
193 | 00:02:21,550 --> 00:02:24,010
194 | It comes up to be 0.1125.
195 | 
196 | 43
197 | 00:02:24,010 --> 00:02:28,150
198 | The final one probability,
199 | of o1, not o2 and not g.
200 | 
201 | 44
202 | 00:02:28,150 --> 00:02:33,706
203 | Again we just use the same equations and
204 | we get 0.00375.
205 | 
206 | 45
207 | 00:02:33,706 --> 00:02:36,632
208 | Now we can take all these
209 | numbers we have just calculated,
210 | 
211 | 46
212 | 00:02:36,632 --> 00:02:41,000
213 | put it into our equation, and
214 | do the substitution and get our answer.
215 | 
216 | 47
217 | 00:02:41,000 --> 00:02:44,064
218 | Doing the simple math
219 | we end up with 0.7445,
220 | 
221 | 48
222 | 00:02:44,064 --> 00:02:46,068
223 | which is what we were hoping for.
224 | 
225 | 49
226 | 00:02:46,068 --> 00:02:50,231
227 | We calculated this answer by summing up
228 | the results for all relevant situations.
229 | 
230 | 50
231 | 00:02:50,231 --> 00:02:52,247
232 | But we can also do
233 | inference by sampling,
234 | 
235 | 51
236 | 00:02:52,247 --> 00:02:54,720
237 | which can handle much bigger networks.
238 | 
239 | 52
240 | 00:02:54,720 --> 00:02:55,580
241 | Please be on the lookout for
242 | 
243 | 53
244 | 00:02:55,580 --> 00:02:58,120
245 | all the different ways we can
246 | do inference invasion networks.
247 | 


--------------------------------------------------------------------------------
/TopicNotes/Planning/PlanningMDP.rst:
--------------------------------------------------------------------------------
  1 | ##############################
  2 | Planning Under Uncertainity
  3 | ##############################
  4 | 
  5 | We have seen different types of environments:
  6 | 
  7 | +------------+---------------+------------+
  8 | |            | Deterministic | Stochastic |
  9 | +============+===============+============+
 10 | | Fully      | A* star, DFS, |MDP         |
 11 | | Observable | BFS,UniformCS |            |
 12 | +------------+---------------+------------+
 13 | | Partially  |               |   POMDP    |
 14 | | Observable |               |            |
 15 | +------------+---------------+------------+
 16 | 
 17 | Remember:
 18 | Stochastic: We don't know for sure the result of the actions
 19 | 
 20 | Deterministic: The outcome of the action is always the same and predictable
 21 | 
 22 | Fully Observable: Every state is visible from the current state, so you would
 23 | need only momentary sensory input for making a decision.
 24 | 
 25 | Partially Observable: You would need a memory for making a decision
 26 | 
 27 | MDP (Markov Decision Process)s
 28 | ================================
 29 | 
 30 | Markov Decision Process are composed of the following:
 31 | 
 32 | - States: {s_1, s_2, s_3, ..., s_n}
 33 | - Actions: {a_1, a_2, a_3, ..., a_n}
 34 | - State Transition Matrix: T(s,a,s') = Probability(s'|a,s):
 35 |   The matrix gives the transition probabilities of moving to another state
 36 |   given the action and current state.
 37 | - Reward function associated to states for designating a goal
 38 | - Policy: An action  that is associated to each state to reach to the goal.
 39 | - The planning problem is about finding the good and efficient policy
 40 | 
 41 | The cost function in MDPs can be defined as follows:
 42 | 
 43 | Reward(s) = {Goal Reward (a large positive value),
 44 |              Avoid rewards (large negative value),
 45 |              Step Cost (Small negative value)}
 46 | 
 47 | The objective of MDP can be defined as follows:
 48 | :math:`objective = max(Expactation[{\sum_{t=0}^{infinity}({\gamma}^t)R_t}])`
 49 | 
 50 | - t: time value
 51 | - gamma: discount factor, value between 0 and 1: 0 < gamma < 1
 52 | - R_t: cost function result at the time t.
 53 | 
 54 | The objective is to maximise the expectation from the sum of cost functions.
 55 | The discount factor decays the future reward relative to more immediate rewards
 56 | and it's kind of an alternative way to specify step cost, it is basically there
 57 | to give an incentive to reach the goal as fast as possible.
 58 | The good mathematical thing about the discount factor is that it keeps the
 59 | sigma expression bounded.
 60 | By specifying the discount factor it is easy to show the following bound:
 61 | 
 62 | :math:`{\sum_{t=0}^{infinity}({\gamma}^t)R_t}{\le}{\frac{1}{1-{\gamma}}}{\times}R_max`
 63 | 
 64 | The R_max is in most cases Goal Reward
 65 | 
 66 | Value Iteration
 67 | ----------------
 68 | 
 69 | Once we define a cost function, which in turn helps us to define an objective,
 70 | we are ready to designate a value for each state.
 71 | 
 72 | Value in this case would be:
 73 | 
 74 | :math:`V^{\pi}(s) = E_{\pi}[{\sum_{t=0}^{infinity}({\gamma}^t)R_t}|s_0=s]`
 75 | 
 76 | The expression looks complex but it is actually simple to decompose:
 77 | It says if we execute the policy pi, the expectation for the sum of the reward,
 78 | given that the initial state of the policy is the current state, is the value of
 79 | the state.
 80 | So basically: :math:`V^{\pi}`, says we are operating under the policy pi, same
 81 | thing with E.
 82 | 
 83 | Value function is a potential function that leads from the goal location all the
 84 | way into the state space, so that the hill climbing in this potential function
 85 | leads you on the shortest path to the goal.
 86 | 
 87 | Here is an example of its application:
 88 | 
 89 | +----+---+---+---+------+
 90 | |    | 1 | 2 | 3 | 4    |
 91 | +====+===+===+===+======+
 92 | | a  | 0 | 0 | 0 | +100 |
 93 | +----+---+---+---+------+
 94 | | b  | 0 | X | 0 | -100 |
 95 | +----+---+---+---+------+
 96 | | c  | 0 | 0 | 0 | 0    |
 97 | +----+---+---+---+------+
 98 | 
 99 | Now we ask the question:
100 | 
101 | Is 0 is a good value for field a3 ?
102 | 
103 | 
104 | The answer is no, beca can compute a better one and the reason is evident in the
105 | function.
106 | 
107 | For a state space in which:
108 | 
109 | Reward(s) = {Goal Reward: +100
110 |              Avoid rewards: -100
111 |              Step Cost: -3}
112 | 
113 | With the transition matrix of:
114 | 
115 | Going East 0.8 chance
116 | Going south 0.1 chance
117 | Staying in the same state 0.1 chance
118 | 
119 | 
120 | V(a3,E) = 0.8 * 100 -3 = 77
121 | 
122 | E: east, denoting going east.
123 | 
124 | We see thus that for the current state, 0 is not a good value, 77 is a better
125 | value. If we do this until the convergence of the value function algorithm, we
126 | would have the following matrix:
127 | 
128 | +----+----+----+----+------+
129 | |    | 1  | 2  | 3  | 4    |
130 | +====+====+====+====+======+
131 | | a  | 85 | 89 | 93 | +100 |
132 | +----+----+----+----+------+
133 | | b  | 81 | X  | 68 | -100 |
134 | +----+----+----+----+------+
135 | | c  | 77 | 73 | 70 | 47   |
136 | +----+----+----+----+------+
137 | 
138 | The algorithm is called the value iteration:
139 | From AIMA-python
140 | 
141 | .. code-block:: python
142 | 
143 |    def value_iteration_instru(mdp, iterations=20):
144 |     U_over_time = [] # List for updated state value dictionary
145 |     U1 = {s: 0 for s in mdp.states} # state value dictionary
146 |     R, T, gamma = mdp.R, mdp.T, mdp.gamma # R for reward, T, transition matrix
147 |     # in the form of P(not_current_state|current_state, current_action)
148 |     for _ in range(iterations):
149 |         U = U1.copy()
150 |         for s in mdp.states:
151 |             U1[s] = R(s) + gamma * max([sum([p * U[s1] for (p, s1) in T(s, a)])
152 |                                         for a in mdp.actions(s)]) # by maxing
153 |                                         # over every action of s we guarantee to
154 |                                         # choose the best policy
155 |         U_over_time.append(U)
156 |     return U_over_time
157 | 
158 | 
159 | POMDP (Partialy Observable Markov Decision Process)
160 | ===================================================
161 | 
162 | POMDPs address the problem of optimal exploration versus exploitation, there
163 | would be a information gathering actions, and goal driven actions.
164 | 
165 | 


--------------------------------------------------------------------------------
/TopicNotes/Logic/courseNotes.rst:
--------------------------------------------------------------------------------
  1 | #######################
  2 | `Logic`_
  3 | #######################
  4 | 
  5 | Expert systems:
  6 | 
  7 | - Supply
  8 | 
  9 | 
 10 | Future of planning algorithms ?
 11 | 
 12 | - Learning from examples
 13 | - Transfer learning
 14 |   - We learn in one area and apply it to another.
 15 | - Interactive planning:
 16 |   - Human - machine teams solves problems together:
 17 |     * What is the best interface ?
 18 | 
 19 | 
 20 | Understand the following algorithms:
 21 | 
 22 | Resolution algorithm:
 23 | 
 24 | It is an elegant way of infering new knowledge from a knowledge base
 25 | 
 26 | Graphplan
 27 | 
 28 | Value iteration for markov decision processing.
 29 | 
 30 | Propositional logic
 31 | ---------------------
 32 | 
 33 | Propositional symbols such as
 34 | B, E, A, M, J
 35 | 
 36 | corresponding to:
 37 | 
 38 | B
 39 |     Burglary Occuring
 40 | 
 41 | E
 42 |     Earth Quake Occuring
 43 | 
 44 | A
 45 |     Alarm Occuring
 46 | 
 47 | M
 48 |     Mary Calling
 49 | 
 50 | J
 51 |     John Calling
 52 | 
 53 | These can be either True or False, or unknown
 54 | 
 55 | We can make logical sentences by combining them with logical operators. Ex:
 56 | 
 57 | We can say that:
 58 | 
 59 | - Alarm is True when:
 60 |   * Burglary occuring true
 61 |   * Earthquake occuring true
 62 | 
 63 | Equivalent in propositional logic would be :math:`E {\lor} B {\Rightarrow} A`
 64 | 
 65 | We can say that when the alarm occurs, Jonh and Mary would call.
 66 | :math:`A {\Rightarrow} (J {\land} M)`
 67 | 
 68 | We can say John calls if and only if Mary calls.
 69 | :math:`J {\Leftrightarrow} M`
 70 | 
 71 | We can say that John is not Mary.
 72 | :math:`J {\sim} M`
 73 | 
 74 | A propositional logic is either true or false with respect to a model of the world.
 75 | 
 76 | A model is just a set of true or false values, for all the propositional symbols.
 77 | Ex.
 78 | model = { B:True, A:False, J:True, ... }
 79 | 
 80 | We can define the truth of a sentence in terms of the truth of the symbols with respect to the models using truth tables.
 81 | 
 82 | Truth tables
 83 | ----------------
 84 | 
 85 | Truth tables list all the possibilities for the propositional symbols:
 86 | 
 87 | | p     | q     | not p | p ^ q   | p v q  | p --> q | p <--> q |
 88 | 
 89 | |-------|-------|-------|---------|--------|---------|----------|
 90 | 
 91 | | false | false | true  | false   | false  | true    | true     |
 92 | 
 93 | | false | true  | true  | false   | true   | true    | false    |
 94 | 
 95 | | true  | false | false | false   | true   | false   | false    |
 96 | 
 97 | | true  | true  | false | true    | true   | true    | true     |
 98 | 
 99 | 
100 | **Valid sentence** is **true in every possible model** for every possible combination of values of the propositional symbols.
101 | **Satisfaiable sentence** is one that is true in some model.
102 | **Unsatisfaiable sentence** is false in every possible model.
103 | 
104 | Limitations of propositional logic
105 | -----------------------------------
106 | 
107 | - It can handle only truth or false values:
108 |   - No capability of handling uncertainty as in probability theory.
109 | - No possibility to model object with properties that are continous like, size, colour etc.
110 | - No shortcuts, that is to give a statement about a space you need to invoke every element of that space:
111 |   - To say that all the rooms in the house are clean, you would need to invoke all the rooms in the house and relate them to each other with respect to their current state.
112 | 
113 | 
114 | First Order Logic
115 | =====================
116 | 
117 | We shall compare three approaches that model our world, namely:
118 | 
119 | - Propositional Logic
120 | - First order Logic
121 | - Probability theory
122 | 
123 | We shall compare their ontological commitment with respect to our world. We would also compare the believes the agents might have within these approaches, that is their epistemological commitment.
124 | 
125 | Ontological commitment of the first order logic is that there are:
126 | 
127 | - Relations between things in the world
128 | - Objects
129 | - Functions on those objects
130 | 
131 | in the world.
132 | 
133 | Epistemological commitment of the first order logic is that agents can believe:
134 | 
135 | - True
136 | - False
137 | - Unknown
138 | 
139 | states of those that are in the world.
140 | 
141 | Ontological commitment of the propositional logic is that there are:
142 | 
143 | - Facts
144 | 
145 | in the world.
146 | 
147 | Epistemological commitment of the propositional logic is that agents can believe:
148 | 
149 | - True
150 | - False
151 | - Unknown
152 | 
153 |   states of those that are in the world.
154 | 
155 | Ontological commitment of the probability theory is that there are:
156 | 
157 | - Facts
158 | 
159 | in the world.
160 | 
161 | Epistemological commitment of the probability theory is that agents can believe:
162 | 
163 | - :math:`B = \{ 0,{\dots}, 1 \}`
164 | 
165 | states of those that are in the world.
166 | 
167 | Problem solving used **atomic representation** of the world, that is each state is indivisible, on which you can only check if it is the same or not with another state, or if it is the goal state or not.
168 | 
169 | In propositional logic, and probability theory, we use **factored representation** of the world that is, we divide the world into a set of facts that are true or false, this is called a factored representation.
170 | 
171 | The most complex type of representation is called **structured representation**, in which the individual state is not just a set of values for variables, but it can include relationships between objects, a branching structure, etc. What we see in traditional programming languages, in a database, this comes with first order logic.
172 | 
173 | 
174 | First Order Logic
175 | ====================
176 | 
177 | We have a set of:
178 | 
179 | - objects
180 | - constants that can refer to objects
181 | - function mapping from object to objects.
182 | - relations
183 | 
184 | Syntax in first order logic
185 | -----------------------------
186 | 
187 | We have sentences that describe facts that are true or false.
188 | 
189 | Atomic sentences are the predicates corresponding to relations:
190 | 
191 | - ABOVE(A,B)
192 | - Vowel(A,B)
193 | - A = B
194 | 
195 | We can also combine operators from propositional logic in the sentences of First order logic.
196 | 
197 | There are also quantifiers:
198 | 
199 | - :math:`{\forall}x` means for all x.
200 | - :math:`{\exists}y` there exists a y.
201 | 
202 | 
203 | 
204 | 


--------------------------------------------------------------------------------
/TopicNotes/CapsNets/capsnets.rst:
--------------------------------------------------------------------------------
  1 | #########
  2 | CapsNet
  3 | #########
  4 | 
  5 | Intuition
  6 | ===========
  7 | 
  8 | Drawback of CNNs is that they can not model spatial relationships between
  9 | features.
 10 | 
 11 | We can summarize this by saying, cnn's work fundamentally on 2d objects.
 12 | Thus they model 2d information. What's important is to model 3d relationships.
 13 | That is the position of features with respect to each other.
 14 | 
 15 | The thing is for an object, these relational positioning is an invariant.
 16 | For example your viewpoint can change but that does not change the body
 17 | proportions of the object you are looking at.
 18 | 
 19 | CNN's loose this information, due to usage of max pooling.
 20 | 
 21 | Let's recap how cnn's work.
 22 | 
 23 | CNN's have three layers:
 24 | 
 25 | - Convolution Layers
 26 | - Pooling Layers
 27 | - Fully connected (Dense) Layers
 28 | 
 29 | Convolution layers
 30 | 
 31 | Basically it is constructed by a convolutional window/frame which is
 32 | like a mask that passes through the image horizontally and vertically. Each
 33 | passage defines a node, and the resulting nodes are grouped in a layer called
 34 | convolutional layer.
 35 | 
 36 | The value for the convolutional layer node is created as in any linear
 37 | regression, that is we multiply the value of the pixel with the weights and sum
 38 | everything up including a bias term. Then we apply an activation function like
 39 | relu.
 40 | The weights are usually represented inside the convolution window
 41 | 
 42 | Pooling layers
 43 | 
 44 | They take convolutional layers as input and reduce its dimensionality. Either 
 45 | by finding the maximum value in the filters, thus keeping the number of
 46 | filters the same with reduced height and width, or by averaging all the
 47 | feature maps.
 48 | 
 49 | Fully Connected Layers
 50 | 
 51 | Takes all the input from previous layer makes a linear regression of the
 52 | input.
 53 | 
 54 | The problem for CNN is:
 55 | Internal data representation of a convolutional neural network does not take
 56 | into account important spatial hierarchies between simple and complex objects.
 57 | 
 58 | What does brain do:
 59 | From visual information received by eyes, they deconstruct a hierarchical
 60 | representation of the world around us and try to match it with already learned
 61 | patterns and relationships stored in the brain. This is how recognition
 62 | happens. 
 63 | 
 64 | What does capsules do:
 65 | It incorporates relative relationships between objects and it is represented
 66 | numerically as a 4D pose matrix.
 67 | 
 68 | Instead of aiming for viewpoint invariance in the activities of “neurons” that
 69 | use a single scalar output to summarize the activities of a local pool of
 70 | replicated feature detectors, artificial neural networks should use local
 71 | “capsules” that perform some quite complicated internal computations on their
 72 | inputs and then encapsulate the results of these computations into a small
 73 | vector of highly informative outputs. Each capsule learns to recognize an
 74 | implicitly defined visual entity over a limited domain of viewing conditions
 75 | and deformations and it outputs both the probability that the entity is
 76 | present within its limited domain and a set of “instantiation parameters” that
 77 | may include the precise pose, lighting and deformation of the visual entity
 78 | relative to an implicitly defined canonical version of that entity. When the
 79 | capsule is working properly, the probability of the visual entity being
 80 | present is locally invariant — it does not change as the entity moves over the
 81 | manifold of possible appearances within the limited domain covered by the
 82 | capsule. The instantiation parameters, however, are “equivariant” — as the
 83 | viewing conditions change and the entity moves over the appearance manifold,
 84 | the instantiation parameters change by a corresponding amount because they are
 85 | representing the intrinsic coordinates of the entity on the appearance
 86 | manifold.
 87 | 
 88 | Capsules encode probability of detection of a feature as the length of their
 89 | output vector. And the state of the detected feature is encoded as the
 90 | direction in which that vector points to (“instantiation parameters”). So when
 91 | detected feature moves around the image or its state somehow changes, the
 92 | probability still stays the same (length of vector does not change), but its
 93 | orientation changes.
 94 | 
 95 | 
 96 | Inner working of the Capsules
 97 | ------------------------------
 98 | 
 99 | It is easier to understand the difference of capsules with respect to neurons
100 | by remembering what neurons do.
101 | 
102 | Neurons:
103 | 
104 | - take a scalar input
105 | - Multiply it with a weight
106 | - Add a bias to it.
107 | 
108 | - Then we apply a function to map the resulting scalar value to a range of
109 |   probabilities
110 | 
111 | We can think the scalar input as scalars as well. They take scalars, multiply
112 | each with a weight, add each a bias, etc.
113 | 
114 | Capsules:
115 | 
116 | - Take vectors as input.
117 | - Multiply it with a weight matrix that encode a spatial relation between
118 |   lower level features (nose, mouth) and a higher level feature (face).
119 |   For example if first weight matrix encode the relations between a nose and a
120 |   face. The nose is ten times smaller than a face, and its direction is the
121 |   same with a face since they lie in the same plane.
122 |   Once we multiply it with a weight, the output would the position of the
123 |   face. So from 3 input vectors (eyes, nose, mouth), we obtain a prediction
124 |   about the position of a face.
125 |   Evidently if 3 predictions point to a same position and a state of a face,
126 |   then, the probability of having a face there is quite high.
127 | 
128 | - Scalar weighting of the input vector:
129 |   Let's say we have 2 capsules which correspond to a higher level feature. How
130 |   does a lower level capsule decide which of the two capsules should receive
131 |   its input. For example, I have a capsule for a nose, should it send its
132 |   output to human face or dog face. Assuming higher level capsules have
133 |   already input that are close to each other, the problem becomes that of
134 |   clustering. Can the output coming from the nose capsule form a cluster with
135 |   those of human face or dog face ? This is decided by the scalar weighting.
136 |   The main idea is since the scalar multiplication does not change the
137 |   direction of the vector but changes its length. This is the essence, it
138 |   shall be explained further later on.
139 | 
140 | - We sum the weighted output vectors. This is essentially same as a neuron
141 |   where we combine the outputs of weights.
142 | 
143 | - We squash the resulting vector into a unit vector. The squashing function
144 |   has two parts: Unit scaling and additional squashing. The formula is the
145 |   following: 
146 |   :math:`v_i = \frac{||s_i||^2}{1+ ||s_i||^2} \times \frac{s_i}{||s_i||}`
147 |   Basically each input vector becomes a scalar and makes up the resulting
148 |   vector.
149 | 
150 | 
151 | Dynamic Routing: Scalar weighting of Input vectors
152 | ---------------------------------------------------
153 | 
154 | The algorithm is the following:
155 | 
156 | .. code:: python
157 | 
158 |     def squashVector(u_ij: [float]):
159 |         """
160 |         Squash a vector's length to a range between 0 - 1
161 |         """
162 |         vecnorm = u_ij.length
163 |         unitvector = u_ij / vecnorm
164 |         firstTerm = vecnorm ** 2 / (1 + vecnorm ** 2)
165 |         return firstTerm * unitvector
166 | 
167 |     def dynamicRouting(r: int, u_ij_hat: [float], 
168 |                         currentLayer, nextLayer):
169 |         """
170 |         Implementation of dynamic routing algorithm from:
171 |         https://arxiv.org/pdf/1710.09829.pdf
172 |         procedureROUTING(u_ji_hat, r, l ):
173 |         for all capsule i in layer l and capsule j in layer(l+1):
174 |             b_ij←0
175 |             for r iterations do:
176 |                 for all capsule i in layer l:
177 |                     c_i←softmax(b_i)
178 |                 for all capsule j in layer(l+1):
179 |                     s_j←sum_i( c_ij * u_ji_hat)
180 |                 for all capsule j in layer(l+ 1):
181 |                     v_j←squash(sj)
182 |                 for all capsule i in layer l and capsule j in layer(l+1):
183 |                     b_ij←b_ij+ dotProduct(u_ji_hat, v_j)
184 |             return v_j
185 | 
186 |         Parameters
187 |         ------------
188 |         r: number of iterations
189 |         u_ij_hat: vector containing a list of float points, 
190 |             like every vector it has length and a direction.
191 |             length is the number of floating points, and direction
192 |             is the angle that it makes with respect to origin of the space.
193 | 
194 |         currentLayer: contains a list of capsules
195 |         nextLayer: contains a list of capsules as well.
196 |         """
197 |         # setting up priors
198 |         nb_next_capsules = getNumberOfCapsules(nextLayer)
199 |         c_ij = 1 / nb_next_capsules
200 |         b_ij = 0
201 |         for iteration in range(r):
202 | 
203 | 
204 | 


--------------------------------------------------------------------------------
/TopicNotes/Hyperparameters/CourseNotes.rst:
--------------------------------------------------------------------------------
  1 | #########################
  2 | Intro to Hyperparameters
  3 | #########################
  4 | 
  5 | The hyperparameter is the value we set before applying a learning algorithm into
  6 | a dataset.
  7 | 
  8 | The challenge is that there are no magic way to work with them. No trick or
  9 | principal that explains why a certain hyperparameter is better than the other
 10 | universally. Rather their capacity depends on the task at hand.
 11 | 
 12 | In general hyperparameters can be categorised into two categories:
 13 | 
 14 | - Optimizer Parameters
 15 | - Model Parameters
 16 | 
 17 | Optimizer Hyperparameters
 18 | ~~~~~~~~~~~~~~~~~~~~~
 19 | 
 20 | They relate more to the optimization of the training process.
 21 | These include:
 22 | 
 23 | - learning rate
 24 | - mini batch size
 25 | - number of training iterations or epochs
 26 | 
 27 | Model Hyperparameters
 28 | ~~~~~~~~~~~~~~~~~~~~~
 29 | 
 30 | These concern the structure of the model:
 31 | 
 32 | - The number of layers and hidden units
 33 | - And model specific stuff
 34 | 
 35 | 
 36 | Learning Rate
 37 | --------------
 38 | 
 39 | It is the single most important hyperparameter and one should always make sure
 40 | that it has been tuned.
 41 | 
 42 | If you have normalized the inputs to your model, then a good starting point is
 43 | 
 44 | 0.01
 45 | 
 46 | A good list of indicators for learning rates is the following:
 47 | 
 48 | - 0,1
 49 | - 0,01
 50 | - 0,001
 51 | - 0,0001
 52 | - 0,00001
 53 | - 0,000001
 54 | 
 55 | If you try one and the model does not train, you try the other ones, and see how
 56 | the model responds to it.
 57 | 
 58 | Learning rate determines the step size of the descent during the gradient
 59 | descent.
 60 | 
 61 | Here are the possible cases that are possible with learning rate:
 62 | 
 63 | - Validation error: Decreasing during training 
 64 |   due to reasonable learning rate [✓]
 65 | - Validation error: Decreasing during training 
 66 |   due to large learning rate [✓] 
 67 | - Validation error: Decreasing too slowly during training 
 68 |   due to too small learning rate [x] 
 69 | - Validation error: Increasing during training 
 70 |   due to too large learning rate [x] 
 71 | 
 72 | Here is a good technique to deal with these problems:
 73 | 
 74 | - Learning rate decay:
 75 | 
 76 |   - We decrease the learning rate by half in every 5 epochs etc to see how well
 77 |     it functions
 78 | 
 79 | Minibatch size
 80 | ---------------
 81 | 
 82 | It is the middle way between stochastic training and batch training.
 83 | Batch is we send everything at once, stochastic is we sent them one by one.
 84 | 
 85 | Increasing it would increase the computational requirement, decreasing it
 86 | would slower the training process, good numbers are:
 87 | 
 88 | 2,4,6,8,16,32,64,128,256
 89 | 
 90 | One should think of adjusting the learning rate when one changes the mini batch
 91 | size
 92 | 
 93 | Number of iterations
 94 | ---------------------
 95 | 
 96 | The metric we should focus in validation error. Idealy we start with a number
 97 | that is large to continue decreasing the validation error, and stop early
 98 | when the validation error stops decreasing.
 99 | 
100 | However one should be flexible in the definition of the stopping trigger,
101 | because the validation error would move back and forth, even if it is on
102 | the downward trend.
103 | In most cases, we stop the training if the validation error has not decreased
104 | in the last 10 - 20 steps
105 | 
106 | Number of Hidden Units
107 | ----------------------
108 | 
109 | The number and architecture of the hiddent units are the main measures for a
110 | model's learning capacity.
111 | 
112 | More capacity might lead to overfitting
113 | Less capacity might lead to underfitting
114 | 
115 | Dealing with this can be done by using regularization techniques like:
116 | - L2 regularization
117 | - Dropout
118 | 
119 | Tips:
120 | 
121 | - For the first hidden layer:
122 |   - It should be larger than the number of inputs
123 | - Number of layers:
124 |   - 3 layer would often perform better than 2, but more than 3 helps rarely
125 |   - This is not true for CNNs where depth of +10 layers gives better results
126 | 
127 | RNN Hyperparameters
128 | --------------------
129 | 
130 | There are two main parameters:
131 | 
132 | - Cell types: LSTM, Gated Recurrent Units Neural Network
133 | - Depth of the model: how many layers we need
134 | 
135 | One thing is for sure: LSTM and GRU perform better than normal RNN
136 | Discussion on LSTM and GRU is not resolved they depend on the task
137 | 
138 | Embedding size for word models should be around 50 - 200
139 | 
140 | LSTM Vs GRU
141 | 
142 | "These results clearly indicate the advantages of the gating units over the more
143 | traditional recurrent units. Convergence is often faster, and the final
144 | solutions tend to be better. However, our results are not conclusive in
145 | comparing the LSTM and the GRU, which suggests that the choice of the type of
146 | gated recurrent unit may depend heavily on the dataset and corresponding task."
147 | 
148 | Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio
149 | 
150 | "The GRU outperformed the LSTM on all tasks with the exception of language
151 | modelling"
152 | 
153 | An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz,
154 | Wojciech Zaremba, Ilya Sutskever
155 | 
156 | "Our consistent finding is that depth of at least two is beneficial. However,
157 | between two and three layers our results are mixed. Additionally, the results
158 | are mixed between the LSTM and the GRU, but both significantly outperform the
159 | RNN."
160 | 
161 | Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin
162 | Johnson, Li Fei-Fei
163 | 
164 | "Which of these variants is best? Do the differences matter? Greff, et al.
165 | (2015) do a nice comparison of popular variants, finding that they’re all about
166 | the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN
167 | architectures, finding some that worked better than LSTMs on certain tasks."
168 | 
169 | Understanding LSTM Networks by Chris Olah
170 | 
171 | "In our [Neural Machine Translation] experiments, LSTM cells consistently
172 | outperformed GRU cells. Since the computational bottleneck in our architecture
173 | is the softmax operation we did not observe large difference in training speed
174 | between LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla
175 | decoder is unable to learn nearly as well as the gated variant."
176 | 
177 | Massive Exploration of Neural Machine Translation Architectures by Denny Britz,
178 | Anna Goldie, Minh-Thang Luong, Quoc Le
179 | 
180 | 
181 | Example RNN Architectures
182 | ~~~~~~~~~~~~~~~~~~~~~~~~~~
183 | 
184 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
185 | | Application                           | Cell | Layers  | Size          | Vocabulary                | Embedding | Learning Rate |  |
186 | |                                       |      |         |               |                           |    Size   |               |  |
187 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
188 | | Speech Recognition (large vocabulary) | LSTM | 5, 7    | 600, 1000     | 82K, 500K                 |           |               |  |
189 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
190 | | Speech Recognition                    | LSTM | 1, 3, 5 | 250           |                           |           | 0.001         |  |
191 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
192 | | Machine Translation (seq2seq)         | LSTM | 4       | 1000          | Source: 160K, Target: 80K | 1,000     | --            |  |
193 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
194 | | Image Captioning                      | LSTM |         | 512           |                           | 512       | (fixed)       |  |
195 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
196 | | Image Generation                      | LSTM |         | 256, 400, 800 |                           |           |               |  |
197 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
198 | | Question Answering                    | LSTM | 2       | 500           |                           | 300       |               |  |
199 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
200 | | Text Summarization                    | GRU  |         | 200           | Source: 119K,             | 100       | 0,001         |  |
201 | |                                       |      |         |               | Target: 68K               |           |               |  |
202 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+
203 | 


--------------------------------------------------------------------------------
/TopicNotes/Search/CourseNotes.rst:
--------------------------------------------------------------------------------
  1 | ###################
  2 | Search
  3 | ###################
  4 | 
  5 | AI is about what to do when you don't know what to do.
  6 | Regular programming is when you know what to do and make the machine do it by writting instructions.
  7 | 
  8 | =========================
  9 | Definition of a Problem
 10 | =========================
 11 | 
 12 | What is a problem:
 13 | 
 14 | - Initial State: :math:`S_0`
 15 | - Actions(S): a function that takes the state as the input an returns a set of possible actions that the agent can execute.
 16 |   * returns: :math:`\{a_1, a_2, a_3, {\dots} \}`
 17 | - Result: another function, which takes an action and a state as input, and gives another state as the output.
 18 |   * So :math:`R(s,a_i) {\to} {\overline{s}}`, 
 19 | - GoalTest: a boolean function, which says if the we are at the goal state or not.
 20 | - PathCost: another function which takes a sequence of state as an argument to calculate a cost value for the given sequence.
 21 |   * :math:`PC(s_i^a_{j} + s_{i+1}^{a_{j+1}}, {\dots}) {\to} n, n {\in} {\mathbb{R}}, i {\in} \{0,1,{\dots}\}, j {\in} \{1,2,{\dots}\}`
 22 |   * Most of the time, cost of a path is the sum of the cost of the individual steps, which are calculated with StepCost function.
 23 | - StepCost: another function takes the state, action, and the resulting state of the action as an input and calculates the step cost of that action as the output.
 24 |   * :math:`SC(s,a_i,{\overline{s}}) {\to} n`
 25 | - These last two are evaluation functions, the StepCost is the quick one, and PathCost is the heavy one.
 26 | 
 27 | ---------------
 28 | States
 29 | ---------------
 30 | 
 31 | At every state, we would want to separate out to three parts
 32 | 
 33 | - Frontier
 34 | - Explored
 35 | - Unexplored
 36 | 
 37 | Two type of graphs intersect with each other:
 38 | 
 39 | - State space graph
 40 | - Search state graph
 41 | 
 42 | State space graph is usually a given, search state graph is calculated.
 43 | During the calculation process:
 44 | 
 45 | - the states that are already calculated are explored states,
 46 | - the states that are being calculated are frontiers,
 47 | - the states that are not yet calculated are unexplored states.
 48 | 
 49 | This results in the following general search tree function.
 50 | 
 51 | .. code-block:: python
 52 | 
 53 |                 def TreeSearch(problem):
 54 |                     frontier = {['initialState']}
 55 |                     # Frontier, is the path consisting only the initial state
 56 |                     while some condition:
 57 |                     if len(frontier) == 0:
 58 |                         # This means there can be no solution.
 59 |                         return False
 60 |                     path = choice(frontier)
 61 |                     # We make a choice among the paths that are calculated
 62 |                     # the chosen path is also removed from the frontier
 63 |                     # Choice function is a family of functions not a single
 64 |                     # algorithm
 65 |                     #
 66 |                     state = f(path)
 67 |                     # We find the state at the end of the path.
 68 |                     # f(): takes path as input returns its end.
 69 |                     if s == Goal:
 70 |                         return path
 71 |                         # Meaning if the state is the goal,
 72 |                         # then we found our path.
 73 |                     #
 74 |                     else:
 75 |                         # We are not at the goal state,
 76 |                         # Thus we should extend the path.
 77 |                         for a in actions(s):
 78 |                             frontier.add([path + a + Result(s,a)])
 79 |                             # We add the path, the action and the result of the action
 80 |                             # that is the new state resulting from the action.
 81 | 
 82 | Now let's see our options for the "choice" function:
 83 | 
 84 | - Breadth first search, or Shortest way Search
 85 |   * It chooses the shortest path that has not been considered.
 86 |     + The measure for the shortest path is the number of paths between the end state and the origin state.
 87 |     + For example, between s1 and s5 there are 2 paths, between s3 and s5 there are 102 paths, and between s2 and s5 there are 4 paths, breadth first search would evaluate, s1, then s2, then s3, before meeting a condition like, all the branches associated with the frontier state are in explored states list, etc.
 88 | 
 89 | 
 90 | .. code-block:: python
 91 | 
 92 |    def GraphSearch(problem):
 93 |        frontier = {['initial']} # Contains the Initial State with Its path.
 94 |        explored = {} # Contains the explored states with their path.
 95 |        if len(frontier) == 0:
 96 |           return False
 97 |        chosen_path = frontier.pop()# This should add this path to explored as well
 98 |        path = choice(chosen_path)
 99 |        state = end(path)
100 |        explored.add(state)
101 |        if state == goal:
102 |            return path
103 |        for action in ACTIONS(state):
104 |            if Results(a, state) not in frontier or Results(a, state) not in explored:
105 |                 frontier.add([path,a,Results(a,state)])
106 | 
107 | .. code-block:: python
108 | 
109 |    # Breadth first search
110 | 
111 | 
112 | 
113 | - Uniform Cost search, or Cheapest First Search
114 |   * Find the path with the **cheapest total cost**.
115 |   * How it works:
116 |     + We start in the start state
117 |     + We pop the empthy path from frontier to explored
118 |     + Add in the path out that state.
119 |   * It is not a directed search, it is just an expansion, we should expect to cover half of the search space on average.
120 | 
121 | --------------------------
122 | Use cases for Searches
123 | --------------------------
124 | 
125 | In the case of a large binary tree:
126 | 
127 | Breadth first search and cheap first search:
128 | 
129 | - Has 2n space saving necessity
130 | 
131 | Depth first search:
132 | 
133 | - Has n space saving necessity.
134 | 
135 | Breadth first and cheap first always find their goal, because they evaluate the states at a horizontal level, whereas
136 | Depth first search would go down the infinite path, because it evaluates the tree in vertical direction.
137 | 
138 | ------------------
139 | A* search
140 | ------------------
141 | 
142 | Better name would be Best Estimated Total Path Cost Search
143 | 
144 | To get faster results than the uniform cost search, we need to add more knowledge.
145 | A good knowledge would be the estimate of distance from the start state to the goal.
146 | 
147 | A* search works as follows:
148 | 
149 | - :math:`min(f = g + h)`:
150 |   * Minimum value for the function "f"
151 |   * Function "f" is defined as the sum of the results of the function "g" and "h"
152 |   * Function "g" is path cost, it takes the path as the argument.
153 |   * Function "h" is equal to final state of the path, which is equal to the estimated distance to the goal
154 |   * The value of the Function "h" should always be lower than the true cost of the state.
155 | 
156 | When A* ends, it returns a path, p with estimated cost, C.
157 | And it turns out that the C is also the actual cost, because at the goal state, the estimated distance to the goal would be 0, thus C, that was suppose to represent the estimated cost, would represent the actual cost, since C = h(s) + g(p) and h(s) = 0.
158 | All the other states in the frontier should have a bigger estimated cost than the C, because frontier is explored in cheapest first order.
159 | The path p, then must have cost that is less than the true cost of any other path in the frontier.
160 | 
161 | ======================================
162 | `How to derive heuristics functions`_
163 | ======================================
164 | 
165 | Heuristics can be derived from formal description of the problems.
166 | For example:
167 | 
168 | Description
169 |     Block can move from A to B if A is adjenct to B, and B is blank
170 | 
171 | Derived Heuristic 1
172 |     Block can move from A to B if A is adjenct to B. Heuristic: B - A == 1
173 | 
174 | Derived Heuristic 2
175 |     Block can move from A to B. Heuristic: Misplaced Blocks.
176 | 
177 | This is called generating a relaxed problem.
178 | 
179 | ============================================
180 | `When Searches Work`_
181 | ============================================
182 | 
183 | 1. Problem domain must be fully observable.
184 | 2. Problem domain must be Known: we have to know the set of available actions to us.
185 | 3. Problem domain must be discrete: we have to have a finite number of actions to choose from.
186 | 4. Problem domain must be deterministic: we have to know the result of taking an action.
187 | 5. Domain must be static, that is there must be nothing else that changes the domain besides our own actions.
188 | 
189 | ====================================
190 | `How to implement the algorithms`_
191 | ====================================
192 | 
193 | In the implementation we talk about nodes. A node has 4 fields:
194 | 
195 | * State at the End of the Path
196 | * Action it took to get to the State at the End of the Path
197 | * Total cost
198 | * Parent, is the pointer to another node
199 | 
200 | This linking of nodes one to another creates a sequence, a "path".
201 | The word path is used as an abstract idea, a node is the representation in the computer memory.
202 | Nodes are implemented as sequence structures, they can be thus list.
203 | Frontier, and explored list.
204 | 
205 | Frontier
206 | ----------
207 | 
208 | - Operations
209 |   * Remove best item, add a new one.
210 |   * Membership test
211 | - Implementation
212 |   * Priority queue
213 | - Representation
214 |   * Set
215 | - Build
216 |   * Hash Tables
217 |   * Tree
218 | 
219 | Explored
220 | ------------
221 | 
222 | - Operations
223 |   * Add new members
224 |   * Check for membership
225 | - Representation
226 |   * Single set
227 | - Build
228 |   * Hash Tables
229 |   * Tree
230 | 


--------------------------------------------------------------------------------
/BookNotes/ComputerGraphicsPrinciplesPracticeInC.rst:
--------------------------------------------------------------------------------
  1 | ############################################
  2 | Computer Graphics: Principles and Practice
  3 | ############################################
  4 | 
  5 | Notes from:
  6 | 
  7 | Foley, James, Andries van Dam, Steven Feiner, and John Hughes, Computer Graphics: Principles and Practice, 2nd edn (Paris, 1996)
  8 | 
  9 | and from:
 10 | 
 11 | Foley, James, Andries van Dam, Steven Feiner, and John Hughes, Computer Graphics: Principles and Practice, 2nd edn (Paris, 1996)
 12 | 
 13 | Chapter 7: Essential Mathematics
 14 | ================================
 15 | 
 16 | **v**: vector v
 17 | **M**: matrice M
 18 | except:
 19 | 
 20 | **R**: is the set of Real numbers
 21 | **C**: is the set of complex numbers
 22 | :math:`\mathbf{R}^{+}`: is the set of positive real numbers
 23 | :math:`\mathbf{R}^{+}_0`: is the set of non negative real numbers
 24 | *i*: is the indexing variable as in :math:`x_i`
 25 | 
 26 | Cartesian product of two sets: A and B:
 27 | 
 28 | :math:`A x B = {(a,b) : a \in A, b \in B}`
 29 | 
 30 | .. code:: python
 31 |     
 32 |     def cartProd(A, B):
 33 |         "Cartesian product of two lists"
 34 |         result = []
 35 |         for a in A:
 36 |             for b in B:
 37 |                 result.append((a,b))
 38 | 
 39 | :math:`[a, b] = {x : a ≤ x ≤ b}`
 40 | 
 41 | .. code:: python
 42 |     
 43 |     def checkClosedInt(x, lower_limit, upper_limit):
 44 |         "Check given number x is between lower and upper limit"
 45 |         return bool(x >= lower_limit and x <= upper_limit)
 46 | 
 47 | :math:`[a, b) = {x : a ≤ x < b}`
 48 | :math:`(a, b] = {x : a < x ≤ b}`
 49 | 
 50 | .. code:: python
 51 |     
 52 |     def checkHalfOpenIntUpper(x, lower_limit, upper_limit):
 53 |         "Check given number x is between lower and upper limit"
 54 |         return bool(x >= lower_limit and x < upper_limit)
 55 | 
 56 |     def checkHalfOpenIntLower(x, lower_limit, upper_limit):
 57 |         "Check given number x is between lower and upper limit"
 58 |         return bool(x > lower_limit and x <= upper_limit)
 59 | 
 60 | :math:`[a, ∞) = {x : a ≤ x}`
 61 | :math:`(−∞, b] = {x : x ≤ b}`
 62 | 
 63 | .. code:: python
 64 |     
 65 |     def checkLowerLimitWithInf(x, lower_limit):
 66 |         "Check given number x is between lower and upper limit"
 67 |         return bool(x >= lower_limit)
 68 | 
 69 |     def checkUpperLimitWithInf(x, upper_limit):
 70 |         "Check given number x is between lower and upper limit"
 71 |         return bool(x <= upper_limit)
 72 | 
 73 | Function:
 74 | 
 75 | :math:`f: \mathbf{R} \to \mathbf{R}: x \to x^2`. reads as:
 76 | 
 77 | .. code:: python
 78 |     
 79 |     def main(x: float) -> float:
 80 |         return x**2
 81 |         
 82 | 
 83 | Function f which maps a value x from **domain** R 
 84 | to a value :math:`x^2` from **codomain** R, 
 85 | 
 86 | Surjective function: a function that can map to an entire codomain.
 87 | Ex:
 88 | 
 89 | :math:`g: \mathbf{R} \to \mathbf{R}^{+}_0: x \to x^2`
 90 | 
 91 | The x is mapped to values that covers the entire codomain
 92 | 
 93 | g is surjective whereas f is not.
 94 | 
 95 | Injective function: a function which operates between a domain and a codomain
 96 | in which no two values of the domain can map to the same value of the
 97 | codomain, meaning that if h(a) = h(b) then a=b.
 98 | Ex:
 99 | 
100 | :math:`h: \mathbf{R}^{+}_0 \to \mathbf{R}^{+}_0: x \to x^2`
101 | 
102 | An function that is both injective and surjective like h can have an inverse 
103 | function h' which maps the codomain of h to domain of h. That is it simply 
104 | undoes what h does.
105 | 
106 | A function that is both injective and sujective is called bijective.
107 | 
108 | .. code:: python
109 |     
110 |     def checkFuncAndInverse(fn, fnInv, domain: set, codomain: set):
111 |         "Check if a given func and inverse is has injective properties"
112 |         result = []
113 |         for x in domain:
114 |             res = fnInv(fn(x)) == x
115 |             result.append(res)
116 | 
117 |         for y in codomain:
118 |             res = fn(fnInv(y)) == y
119 |             result.append(res)
120 | 
121 |         return all(result)
122 | 
123 | 
124 | Coordinates
125 | ------------
126 | Coordinates are real numbers that are associated to geometric entities, like
127 | points, lines, etc.
128 | 
129 | Geometric properties do not change, whereas numerical properties can change
130 | for example, a point lies on a line is a geometric property true irrespective
131 | of the coordinate system.
132 | 
133 | Chapter 16: Illumination Models from Ed. 2
134 | ==========================================
135 | 
136 | Ambient Light
137 | --------------
138 | 
139 | Simple illumination equation:
140 | :math:`I = k_i`
141 | where I is the resulting intensity, pixel value for example, and k is the 
142 | coefficient of the intrinsic intensity of the object. Notice no direction of 
143 | light source is included in this equation
144 | 
145 | Let's suppose another model, where there is an ambient light illuminating all
146 | the objects at a constant intensity. This ambient light is a result of diffuse
147 | where multiple surfaces reflect light, creating a non directional source of
148 | light. This would transform our equation into:
149 | :math:`I = I_a k_a`
150 | 
151 | where I is the resulting intensity, :math:`I_a` is the intensity of the
152 | ambient light, and :math:`k_a` is the coefficient of ambient reflection, a
153 | material property of the surface that reflects the ambient light
154 | 
155 | Diffuse reflection
156 | --------------------
157 | 
158 | Ambient light considers that the object is illuminated uniformly from all
159 | angles now let us consider illuminating objects from a point light source.
160 | In the case of the latter, the object would be bright in one side but not
161 | necessarily in another side. Its brightness would change according to the
162 | position of the point light source.
163 | 
164 | Lambertian reflection
165 | ++++++++++++++++++++++
166 | 
167 | Diffuse reflection is a property where an object, with dull, matte surface,
168 | like chalk is equally bright from all angles because they reflect the
169 | light equally in all directions
170 | 
171 | A surface's brightness depends on the angle between two variables:
172 | - L(ight source) direction. From where the light source is situated.
173 | - N(ormal of the surface), a vector that is perpendicular to the surface.
174 | 
175 | This works as the following:
176 | 
177 | Let us suppose that the incoming ray of light has a very very small area such
178 | as :math:`{\delta}A`.
179 | 
180 | If the incoming beam came from a direction that is perpendicular to the
181 | surface, the area of the illumination would have been equal to 
182 | :math:`{\delta}A`.
183 | 
184 | If we change the angle of approach of the incoming beam to an angle between 0
185 | and 90. Than :math:`{\delta}A` would be a side of a right angled triangle
186 | whose hypothenuse is the surface of the object. The angle :math:`{\theta}`
187 | would then be the angle between the hypothenuse and the :math:`{\delta}A`
188 | 
189 | The cosine of the angle theta is directly proportional to the amount of light
190 | reflected to the viewer, which is quite logical if you think. If you position
191 | yourself right behind the light source, the brightest spot of the object would
192 | right opposite to you.
193 | 
194 | The equation that models this relationship is:
195 | 
196 | :math:`I = I_p k_d \cos{\theta}`
197 | 
198 | where I_p is the light source intensity, k_d is the diffuse reflection
199 | coefficient of the material between 0-1. The angle theta must be between 0 -
200 | 90.
201 | 
202 | This also means that we are treating the surfaces as self-occulding, that is
203 | light cast from behind does not illuminate the front.
204 | When we need to light objects that are not self-occulding, we use 
205 | abs(cos theta) to invert their surface normals. Thus we treat them as if they
206 | are being illuminated from both sides
207 | 
208 | If both N and L is normalized, we can rewrite the :math:`\cos{\theta}` as a
209 | dot product of N and L.
210 | 
211 | The equation :math:`I = I_p {\times} k_d {\times} (N \cdot L)}`
212 | tend to make objects look harsh.
213 | So to obtain something more realistic, we add an ambient light:
214 | :math:`I = I_a {\times} k_a + I_p {\times} k_d {\times} (N \cdot L)`
215 | 
216 | Light-source attenuation
217 | -------------------------
218 | 
219 | The equation above won't distinguish where identical materials overlapping in
220 | an image. In order to do that we need to add another factor to the equation.
221 | The factor is called light source attenuation factor :math:`f_att`.
222 | The equation thus becomes:
223 | :math:`I = I_a {\times} k_a + f_{att} {\times} I_p {\times} k_d {\times} (N \cdot L)`
224 | 
225 | A useful example of :math:`f_att` is:
226 | :math:`f_att = min(\frac{1}{c_1 + c_2{\times}d_L + c_3{\times}d^2_{L}} , 1)`
227 | 
228 | Where c1, c2, c3 are constants that are provided by the user, and the
229 | :math:`d_L` is the distance between the light source and the illuminated
230 | surface.
231 | 
232 | Colored light and surfaces
233 | ---------------------------
234 | 
235 | The colored light is treated by channel, that is we write an equation for each
236 | component of the color model. We represent an object's *diffuse color* by
237 | :math:`O_d`, O for object, d for diffuse. In order to mark the component
238 | that is represented we add the its initial. Ex: :math:`O_{dR}` for red, G for
239 | green, B, for blue in RGB color space. The above equation for red would be:
240 | :math:`I_R = I_{aR}k_{a}O_{dR} + f_{att}I_{pR}k_{d}O_{dR} (N \cdot L)`
241 | 
242 | Rather than indicating the channel name in the equation for all the channels,
243 | we simply use the variable :math:`\lambda` in order to replace the channel
244 | initial, thus the equation becomes:
245 | :math:`I_{\lambda} = I_{a\lambda}k_{a}O_{d\lambda} +
246 | f_{att}I_{p\lambda}k_{d}O_{d\lambda} (N \cdot L)`
247 | 
248 | Specular Reflection
249 | --------------------
250 | 
251 | Specular reflection is easy to observe. Illuminate an apple, you would see
252 | that a particular part of the apple is highlighted with the color of the light
253 | source. This is specular reflection, you change your viewpoint you should
254 | notice that the highlighted area would also change.
255 | 
256 | Phong illumination model is explaining that:
257 | 
258 | .. math::
259 |     
260 |     I_{\lambda} = I_{a\lambda} k_a O_{d\lambda} 
261 |                   + f_{att} I_{p\lambda} [k_{d}O_{d\lambda} (N \cdot L)
262 |                                           + k_{s}O_{s\lambda} (R \cdot V)^n]
263 | 
264 | Where O_s is object's specular color, k_s and n are user provided.
265 | I_{p\lambda} is the primary component of the light source
266 | An alternative approach is to use the halfway vector H where:
267 | :math:`H = ( L + V ) / (|L + V|)`. Computationally this is cheaper.
268 | Because once we postulate that the viewer and the light source are at
269 | infinity.
270 | The H becomes a constant 1.
271 | But the specular component n, does not produce the same result as before due
272 | to the change in angles
273 | 


--------------------------------------------------------------------------------
/TopicNotes/MachineLearning/SupervisedNotes.rst:
--------------------------------------------------------------------------------
  1 | ################################
  2 | Machine Learning - Supervised
  3 | ################################
  4 | 
  5 | Bayes Networks are there for us to reason.
  6 | Machine Learning help us to find the bayes network model.
  7 | 
  8 | Machine Learning Taxanomy
  9 | ==============================
 10 | 
 11 | Distinguishing elements in machine learning:
 12 | 
 13 | What do we learn ?:
 14 | 
 15 | - Parameters
 16 | - Structure
 17 | - Hidden Concepts
 18 | 
 19 | From what do we learn ?:
 20 | 
 21 | - Supervised, with target labels
 22 | - Unsupervised, without target labels but use replacement principals
 23 | - Reinforcement learning, when an agent learns from feedback with physical environment
 24 | like "Bravo!", "Ouf, that hurts"
 25 | 
 26 | What for ?
 27 | 
 28 | - Prediction
 29 | - Diagnosis
 30 | - Summarisation
 31 | 
 32 | How to learn ?
 33 | 
 34 | - Passive, learning agent doesn't interfere with what is being learnt
 35 | - Active, learning agent has impact on the data being learnt
 36 | - Online, learning occurs while the data is being generated
 37 | - Offline, learning occurs after the data is generated
 38 | 
 39 | Outputs ?
 40 | 
 41 | - Classification: binary, or discreet number of class
 42 | - Regression: continous
 43 | 
 44 | Details ?
 45 | 
 46 | - Generative: seeks to model the data as generally as possible
 47 | - Discriminitive: distinguish data
 48 | 
 49 | Supervised Learning
 50 | =====================
 51 | 
 52 | Data structure is of the following:
 53 | 
 54 | {x_1, x_2, x_3, x_4, ..., x_n -> y_1}
 55 | {a_32, a_33, a_34, ..., a_n -> y_2}
 56 | 
 57 | our job is to find the f() in which f(x_m) = y_m
 58 | 
 59 | Occam's razor:
 60 | Everything else being equal, choose the less complex hypothesis.
 61 | 
 62 | Trade off within the machine learning is that:
 63 | 
 64 | good fit <--> low complexity
 65 | 
 66 | Meaning that good fits for the data are usually high in complexity.
 67 | 
 68 | Spam Detection
 69 | -----------------
 70 | 
 71 | In spam detection we got an email, we try to classify it, either as SPAM or HAM,
 72 | ham means it is worth passing to another person as a person mail.
 73 | 
 74 | 1 technique that is being used is bag of words in which we represent each word
 75 | as a frequency:
 76 | Ex:
 77 | 
 78 | hello my name is Soma, yes Soma is my name.
 79 | 
 80 | 
 81 | In a dictionary they are represented as the following:
 82 | 
 83 | hello my name is Soma yes
 84 | 1     2   2   2   2    1
 85 | 
 86 | Bayes Networks
 87 | ---------------
 88 | 
 89 | What we had been doing in spam detection is building a bayes network, where we
 90 | build up the parameters of the bayes network through the supervised learning by
 91 | maximum likelihood estimator based on training data.
 92 | 
 93 | Small refresher:
 94 | 
 95 | If we have structure like:
 96 | 
 97 |     Spam
 98 |   /      \
 99 | w1       w2
100 | 
101 | how many parameters we would need given that the dictionary has 12 words.
102 | The answer is 25
103 | Why ?
104 | For nodes that have incoming edges we would need two times the number of edges
105 | coming into the node, for nodes that are not recieving any edges, you would need
106 | only one parameter.
107 | 
108 | Message = Sport
109 | 
110 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = 0,1666...7
111 | P(Spam) = 3/8
112 | P(Message|Spam) = 1/9
113 | P(Message) = 1/4
114 | 
115 | Message = Today is Secret
116 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = ?
117 | 
118 | P(Message|Spam) = P("Today is Secret"|Spam) = P(Spam) * P("Secret" under mails labeled as spam) * P("Today" under mails labeled as spam) * P("is" under mails labeled as spam)
119 | 
120 | P(Message) = P(Message|Spam)P(Spam) + P(Message|~Spam)P(~Spam)
121 | 
122 | Since today is not among the mails labeled as spam, it is 0, thus the probability of our mail being spam is 0.
123 | 
124 | This is a good example of overfitting, because we can not have a word determining the entire mail's label.
125 | 
126 | Maximum likelihood Estimate is basically, P(x) = count(x) / N
127 | that is probability of x is equal to count of x in the class over all the data in the class.
128 | 
129 | Laplace Smoothing
130 | -------------------
131 | 
132 | Laplace smoothing is like maximum likelihood estimator, but adds a variable k for smoothing.
133 | 
134 | Basically L(k,x) = count(x) + k / N + k|x|
135 | 
136 | This means that Probability(x) in laplace smoothing calculation, is equal to the number of the occurences of x in the given class,
137 | N is total number of occurances in the given class, if no class is given it is equal to the total number of occurances in the data set, k is the smoother, |x| is the number of tokens in the data set
138 | 
139 | Ex:
140 | Total number of tokens: 12
141 | 
142 | total number of occurences: 24
143 | 
144 | Spam class have: 9 occurences
145 | 
146 | Ham class have: 15 occurences
147 | 
148 | if k is 1,
149 | 
150 | What is the probability of  token with 0 occurences in Spam Class given the spam class ?
151 | 
152 | P(token|spam) = 0 + 1 / 9 + 12 = 1 / 21
153 | 
154 | Message = Today is Secret
155 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = ?
156 | 
157 | P(Message|Spam) = P("Today is Secret"|Spam) = P(Spam) * P("Secret" under mails labeled as spam) * P("Today" under mails labeled as spam) * P("is" under mails labeled as spam)
158 | 
159 | P(Spam) = 2/5
160 | P(Secret|Spam) = 4/ 9+12
161 | P(Today|Spam) = 1/ 9+12
162 | P(is|Spam) = 2/ 9+12
163 | x
164 | \--------------------------
165 | 16 / 46305
166 | 0,000345
167 | 
168 | P(~Spam) = 3/5
169 | P(Secret|~Spam) = 2/ 15+12
170 | P(Today|~Spam) = 3/ 15+12
171 | P(is|~Spam) = 2/ 15+12
172 | 
173 | 36 / 98415
174 | 0,000365
175 | 
176 | P(Message) = P(Message|Spam)P(Spam) + P(Message|~Spam)P(~Spam) = 0,00071
177 | 
178 | This is basically naive bayes model which is a **generative** modal.
179 | 
180 | You can come up with other criteria for spam filtering, for example:
181 | 
182 | - Known spamming id?
183 | - Have you emailed to this person before
184 | - Have another 1000 people received the same message
185 | - Email header consistent
186 | - Written in all capitals
187 | - Do inline urls point to where they say
188 | - Are you adressed by your correct name.
189 | 
190 | Hand written digit recognition
191 | --------------------------------
192 | 
193 | Applying naive bayes for hand written digit recognition.
194 | 
195 | When you are applying naive bayes, the input can be:
196 | 
197 | - Pixel vectors
198 | 
199 |   * Let's say we have a input vector of 16 x 16 you have to convolve the input vector with a gaussian variable, that is you take each pixel with neighbouring pixels for smoothing. This method is called input smoothing.
200 | 
201 | - But naive bayes is not a good method for this.
202 | 
203 | Overfitting prevention
204 | ------------------------
205 | 
206 | Remember the Occam's razor:
207 | 
208 | Cross validation is the key for overcoming the overfitting problem.
209 | 
210 | A typical division for the training data would be:
211 | 
212 | +-------------------------------------------------------------+
213 | | Training data                                               |
214 | +=====+===========================+===========================+
215 | | %80 | Used for Training         | Figure out the parameters |
216 | +-----+---------------------------+---------------------------+
217 | | %10 | Used for Cross Validation | Find best smoothing param |
218 | +-----+---------------------------+---------------------------+
219 | | %10 | Used for Testing          | Verify the performence    |
220 | +-----+---------------------------+---------------------------+
221 | 
222 | Figuring out of parameters is very much like finding out the probabilities of the bayes network.
223 | 
224 | Finding the best smoothing parameter, K, happens as the following.
225 | 
226 | You train the model with different K values and measure their performance in Cross Validation data.
227 | Then you maximise over all K, to get the optimum k value.
228 | 
229 | Then you touch only **once** to test data, in order to verify your performance.
230 | 
231 | Regression Problems
232 | =====================
233 | 
234 | Data structure is of the following:
235 | 
236 | {x_1, x_2, x_3, x_4, ..., x_n -> y_1}
237 | {a_32, a_33, a_34, ..., a_n -> y_2}
238 | 
239 | our job is to find the f() in which f(x_m) = y_m
240 | 
241 | in linear regressions though the function f() has a particular form:
242 | 
243 | f() = w_1·x + w_0
244 | 
245 | Loss Function
246 | -----------------
247 | 
248 | Loss function can differ:
249 | 
250 | Here is how quadratic loss function is calculated:
251 | 
252 | :math:`w_0 = {\frac{1}{M}{\sum}y_i - \frac{w_1}{M}{\sum}x_i}`
253 | 
254 | :math:`w_1 = {\frac{M{\sum}x_iy_i - {\sum}x_i{\sum}y_i}{M{\sum}x^{2}_i - ({\sum}x_i)^2}}`
255 | 
256 | M, is the number of training examples.
257 | 
258 | 
259 | Problems with linear regression
260 | --------------------------------
261 | 
262 | Linear regressions are very susceptible to outliers.
263 | Outliers can change significantly the minimizing quadratic loss function.
264 | They don't generalise well to data that is not really linear, like curves etc.
265 | Some of the assumptions about the linear data might be wrong for such cases, for example
266 | as the x goes to infinity in a linear data, the graph would give the impression that the
267 | y also goes to infinity, whereas in reality that might not be the case.
268 | 
269 | Normalisation, or complexity control is also an issue to consider in linear regression.
270 | Most of the time people use either L1 or L2 regularisation.
271 | 
272 | Perceptron
273 | -----------
274 | 
275 | Perceptron algorithm works with a linear seperator.
276 | That is a linear equation which seperates positives from the negatives.
277 | 
278 | f(x) = {1 if w_1 x+w_0 >=0,
279 |         0 if w_1 x+ w_0 < 0}
280 | 
281 | Perceptron only convergence if the data is linearly seperable.
282 | 
283 | The iteration of the perceptron is of the following and it is very close to gradient descent.
284 | 
285 | Starts with a random guess for w_0, and w_1
286 | 
287 | w^m_i <-- w^m-1_i + a (y_j - f(x_j))
288 | 
289 | here m is the iteration index.
290 | y_i, is the target label,
291 | f(x_j) is the calculated target label.
292 | a is the learning rate.
293 | w^m-1_i is the weight calculated from the previous cycle
294 | y_i - f(x_i) is our error.
295 | 
296 | Linear Methods Summary
297 | ----------------------
298 | 
299 | We talked about:
300 | 
301 | - Regression vs Classification
302 | - Exact Solution vs Iterative Solutions
303 | - Smoothing
304 | - Non linear problems with linear methods, like SVMs, etc.
305 | 
306 | Parametric methods:
307 | 
308 | - Number of parameters are independent of the training size.
309 | 
310 | Non-parametric methods:
311 | 
312 | - Number of parameters are dependent of the training size, and can grow with the data size
313 | 
314 | K-Nearest Neighbor
315 | --------------------
316 | 
317 | Simple thing, already studied in the intro to machine learning so, naive baysen.tex
318 | 
319 | Problems of KNN:
320 | 
321 | - Very large data size:
322 |   + Solution: KDD, instead of representing the data as a list, we represent it as a tree
323 | 
324 | - Very large feature space:
325 |   + Solution: After 2-3 dimensions, don't use knn, do tensor calculations.
326 | 


--------------------------------------------------------------------------------
/TopicNotes/InferenceBayesNets/CourseNotes.rst:
--------------------------------------------------------------------------------
  1 | #########################
  2 | Inference in Bayes Nets
  3 | #########################
  4 | 
  5 | The point of this unit is to be able to answer probabilistic questions using
  6 | bayes nets.
  7 | 
  8 | Instead of input and output, in probabilistic inference we call the givens,
  9 | evidence, and the results, query.
 10 | The variables we know the values of are the evidence, and the variables we want
 11 | to find out are the query.
 12 | Anything that is neither evidence, nor query is a hidden variable.
 13 | 
 14 | The output is not a number, but rather a probabilistic distribution.
 15 | So the answer is going to be a complete joint probability distribution over
 16 | query variables.
 17 | It is called the posterior distribution given the evidence.
 18 | It is expressed as the following:
 19 | 
 20 | P(Q_1, Q_2, ... | E_1=e_1, E_2=e_2, ...), this is the computation we want to
 21 | come up with.
 22 | 
 23 | Another reasonable question would be to ask which combination of values would
 24 | yield highest probability, so something like this:
 25 | 
 26 | argmax_q P(Q_1=q_1, Q_2=q_2, ... | E_1=e_1, E_2=e_2, ... )
 27 | 
 28 | Which q values are maxiable given the evidence values is question we can ask.
 29 | 
 30 | The direction of computation is irrelevant, that is we can have e_1, e_2 and
 31 | compute q_1, etc. or have q_1 and compute all the rest, etc.
 32 | This is to notice that the direction of computation in ordinary programming
 33 | languages are in one way, from input to output, this is not the case for bayes
 34 | nets inference.
 35 | 
 36 | Exact Inference
 37 | =================
 38 | 
 39 | Enumeration
 40 | ----------------
 41 | 
 42 | Enumeration is a method to calculate the probabilities in a given bayes net, it
 43 | simply enumerates all the possibilities and adds them up and thus comes up with
 44 | an answer.
 45 | 
 46 | Here is how it works:
 47 | 
 48 | Example bayes net graph:
 49 | 
 50 |  B    E
 51 |  .\ ./
 52 |    A
 53 |  ./ .\
 54 |  J    M
 55 | 
 56 | The problem:
 57 | P(+b|+j,+m) = ?
 58 | It reads as if b is true, j is true, m is true, given that the events j, and m
 59 | are already happened, what is the probability of the event b would occur.
 60 | 
 61 | Enumeration works as follows:
 62 | 
 63 | First we express the conditional probabilities as unconditional probabilities.
 64 | So,
 65 | 
 66 | P(+b|+j,+m) = P(+b,+j,+m)/P(+j,+m)
 67 | 
 68 | Then we enumerate the atomic probabilities and calculate the sum of their
 69 | products.
 70 | P(+b|+j,+m) the probability of these terms can be determined by enumerating all
 71 | possible values of the hidden variables. In this case there are two E and A.
 72 | So we shall sum over those variables for all values of E and A.
 73 | 
 74 | Formally:
 75 | 
 76 | .. math::
 77 | 
 78 |    {\sum_{e}}{\sum_{a}} P(+b,+j,+m,e,a)
 79 | 
 80 | "a" and "e" represent here the possible value that E, and A can have in given
 81 | net.
 82 | 
 83 | We transform this equation to a product of 5 variables by expressing each of the
 84 | nodes in relation to their parent nodes in the net.
 85 | 
 86 | that is: :math:`{\sum_{e}}{\sum_{a}} P(+b) P(+j|a) P(+m|a) P(e) P(a|+b,e)`
 87 | 
 88 | The product section can be named as a function, f(e,a)
 89 | We look up the values from the table of the bayes net, which is created during
 90 | the creation of the net.
 91 | 
 92 | This enumeration technique works for small networks, bigger networks would
 93 | increase the number of rows to lookup.
 94 | 
 95 | So we need to find a way to speed up the enumeration.
 96 | 
 97 | Pull out terms
 98 | ----------------
 99 | 
100 | We can pull terms out from the enumeration, for example
101 | 
102 | :math:`{\sum_{e}}{\sum_{a}} P(+b) P(+j|a) P(+m|a) P(e) P(a|+b,e)`
103 | 
104 | In the above equation the P(+b) is the same throughout the summation
105 | thus it doesn't need to be added to the summation everytime, we can
106 | put it in front of the summation. P(e) is also independent of "a",
107 | so it can also come before the summation of the terms related to a.
108 | 
109 | But this won't decrease the amount that much.
110 | 
111 | Maximize independence
112 | -----------------------
113 | 
114 | This is a technique for efficient inference.
115 | The structure of the bayes net determines how efficient it is to do
116 | inference on it.
117 | 
118 | 
119 | Causal Direction
120 | ---------------------
121 | 
122 | The important thing is that Bayes nets tend to be the most compact,
123 | and thus the easier to do inference on when they are written
124 | in the causal direction, that is, when the networks flow from the causes to
125 | directions.
126 | 
127 | Variable Elimination
128 | -----------------------
129 | 
130 | This is a technique with several steps.
131 | We try to reduce the network to the smaller networks.
132 | 
133 | First step is joining factors.
134 | 
135 | Joining Factors:
136 | 
137 | - A factor is a multi dimensional matrix, it is a table of properties for a
138 |   given node.
139 | - We choose two or more of the factors
140 | - We will combine them, and it will represent the joint probability of all the
141 |   variables in that factor.
142 | 
143 | For example if we have the tables for P(r) and P(t|r), by multiplying the values
144 | in the tables, we can have
145 | P(r,t)
146 | 
147 | After this we do the second step, elimination or summing out, or marginalisation.
148 | 
149 | Elimination:
150 | 
151 | - We choose a variable that is **absent** in the next table.
152 | - We sum out the choosed variable based on the variable that is **present** in
153 |   the next table.
154 | 
155 | Example operation:
156 | 
157 | 
158 | +--------+-------------+
159 | | P(T,L) |             |
160 | +========+====+========+
161 | | T      | L  | Values |
162 | +--------+----+--------+
163 | | +t     | +l | 0.051  |
164 | +========+====+========+
165 | | +t     | -l | 0.119  |
166 | +--------+----+--------+
167 | | -t     | +l | 0.083  |
168 | +========+====+========+
169 | | -t     | -l | 0.747  |
170 | +--------+----+--------+
171 | 
172 | P(+l) = 0,051 + 0,083 = 0,134
173 | P(-l) = 0,119 + 0,747 = 0,866
174 | 
175 | Approximate Inference
176 | =======================
177 | 
178 | We use sampling.
179 | That is we stimulate the events in the nets.
180 | The more samples we have, the closer we approach to exact inference situation.
181 | 
182 | The sampling have the following advantages:
183 | 
184 | - We know a procedure for coming up with an at least approximate value for the
185 |   joint probability distribution
186 | - If we don't know what the conditional probability tables are, but we could
187 |   simulate the process, we could still proceed with the sampling
188 | We couldn't with exact inference.
189 | 
190 | Sampling method is *consistent*, that is if we have infinite number of samples
191 | we can calculate the true joint probability.
192 | 
193 | For calculating conditional probabilities, we look at the samples that interest
194 | us.
195 | For example,
196 | If we have sample like P(+C,-S,+r,-l), and if we are looking for P(-C|+r), since
197 | our sampling is based on P(+C), we reject at,
198 | and keep the ones with P(-C) in it. This proceedure is called
199 | *rejection sampling*.
200 | 
201 | The problem with rejection sampling is that if the evidence is unlikely, you end
202 | up rejecting a lot of samples.
203 | 
204 | To fix that we use a technique called, *likelihood weighting*
205 | 
206 | Likelihood Weighting
207 | ----------------------
208 | 
209 | This is a technique in which we fix the samples.
210 | For example, let's say when the burglary occurs, alarm goes of, so we have
211 | structure like B -> A.
212 | For a question like what is the probability of burglary occuring in the event
213 | that alarm goes off, P(B|+a) ?
214 | Following would happen.
215 | Since the burglaries are infrequent, during the sampling process, we would
216 | generate mostly P(-b, -a), and reject them
217 | because we are looking for P(B|+a)
218 | So we say, during the sampling process, that the value of "a" is positive from
219 | the start.
220 | This way we get to keep every sample, but this method is **inconsistent**, that
221 | is by having infinitely many samples obtained this way, we can not obtain the
222 | true joint probabilities.
223 | 
224 | However that can be fixed by adjusting the weighting of the samples on the
225 | outcome.
226 | 
227 | Example Calculation:
228 | 
229 | Let's say we are trying to calculate P(R|+s,+c)
230 | 
231 | for the following network
232 | 
233 |      Cloudy
234 |     /      \
235 | Sprinkler  Rain
236 |     \      /
237 |      Wet Grass
238 | 
239 | We have 4 tables:
240 | 
241 | +------+-----+------+
242 | |P(S|C)|            |
243 | +======+=====+======+
244 | | +c   | +s  | 0,1  |
245 | +------+-----+------+
246 | |      | -s  | 0,9  |
247 | +------+-----+------+
248 | | -c   | +s  | 0,5  |
249 | +------+-----+------+
250 | |      | -s  | 0,5  |
251 | +------+-----+------+
252 | 
253 | 
254 | +------+-----+------+
255 | |P(R|C)|            |
256 | +======+=====+======+
257 | | +c   | +r  | 0,8  |
258 | +------+-----+------+
259 | |      | -r  | 0,2  |
260 | +------+-----+------+
261 | | -c   | +r  | 0,2  |
262 | +------+-----+------+
263 | |      | -r  | 0,8  |
264 | +------+-----+------+
265 | 
266 | +------+-----+
267 | | P(C) |     |
268 | +======+=====+
269 | | +c   | 0,5 |
270 | +------+-----+
271 | | -c   | 0,5 |
272 | +------+-----+
273 | 
274 | 
275 | +------+-----+------+------+
276 | |P(W|R,S)                  |    
277 | +======+=====+======+======+
278 | | +s   | +r  | +w   | 0,99 |
279 | +------+-----+------+------+
280 | |      |     | -w   | 0,01 |
281 | +------+-----+------+------+
282 | |      | -r  | +w   | 0,90 |
283 | +------+-----+------+------+
284 | |      |     | -w   | 0,10 |
285 | +------+-----+------+------+
286 | | -s   | +r  | +w   | 0,90 |
287 | +------+-----+------+------+
288 | |      |     | -w   | 0,10 |
289 | +------+-----+------+------+
290 | |      | -r  | +w   | 0,01 |
291 | +------+-----+------+------+
292 | |      |     | -w   | 0,99 |
293 | +------+-----+------+------+
294 | 
295 | P(R|+s,+w)
296 | 
297 | Let's say we have the following sample:
298 | 
299 | P(+c, +s, +r, +w) what is its weight ?
300 | 
301 | Since the "s" and the "w" are constrained by the problem,
302 | we apply the product of their probabilities as weights,
303 | (P(+s|+r)=0,1 * P(+w|+r,+s)=0,99) = 0,099
304 | 
305 | 
306 | Gibbs Sampling
307 | ------------------
308 | This technique uses a method called Markov Chain Monte Carlo (MCMC):
309 |  we resample one variable at a time conditioned on all the other.
310 | The variables are dependent, and the technique is consistent.
311 | 
312 | 
313 | Monty hall problem
314 | ------------------
315 | 
316 | Three doors, one with goat, empty, car.
317 | These are boolean.
318 | Goat is given as +
319 | What is the
320 | There is one selected, and the goat is given thus, my selection is not goat.
321 | 
322 | P(+c)|P(-c), P(+e)|P(-e), P(+g)|P(-g)
323 | 1/3  | 2/3,  1/3  | 2/3   1/3  | 2/3
324 | 
325 | 
326 | P(C,E,G) = 1/27
327 | P(+c,+e,+g) = 1/27
328 | P(+c,-e,+g) = 2/27
329 | P(-c,+e,+g) = 2/27
330 | P(-c,-e,+g) = 4/27
331 | P(+c,-e,-g) = 4/27
332 | P(+c,+e,-g) = 2/27
333 | P(-c,+e,-g) = 4/27
334 | P(+c,-e,-g) = 4/27
335 | P(-c,-e,-g) = 8/27
336 | 
337 | P(+g), + P(C,E|-g), + P(C) = 1
338 | given, empty door, selection
339 |            2/3         1/3 
340 | 
341 | 


--------------------------------------------------------------------------------
/TopicNotes/Introduction_to_AI/CourseNotes.rst:
--------------------------------------------------------------------------------
  1 | Introduction to ai
  2 | #####################
  3 | 
  4 | What makes an ai is the capacity to react to changes.
  5 | 
  6 | Edges embody the rules of action, whereas the nodes embody the states for a game playing agent.
  7 | 
  8 | Intelligence should be defined within the context of a task.
  9 | 
 10 | An agent interracts with the environment by sensing its properties.
 11 | Actions change the state of environment.
 12 | 
 13 | 
 14 | Types of Ai problems:
 15 | *********************
 16 | 
 17 | AI problems are classified according to the properties of the environment states.
 18 | 
 19 | Environment States:
 20 | 
 21 | - It can be fully observable like a chess board.
 22 | - It can be partially observable like "yazlıkçı okeyi" where you can see the boards of the oponents.
 23 | - It can be deterministic
 24 |   - You know for sure the results of each action.
 25 | - It can be stochastic:
 26 |   - You don't know for sure the results of each action.
 27 | - It can be discrete:
 28 |   - There is a finite number of states the environment can be in.
 29 | - It can be continous:
 30 |   - The number of possible states of the environment is infinite.
 31 | - It can be benign:
 32 |   - The agent is the only one that is taking actions that intentionally effects its goal.
 33 | - It can be adverserial:
 34 |   - There can be other agents that can take actions to defeat its goal.
 35 | 
 36 | 
 37 | Intelligent agent:
 38 | 
 39 | It is one that takes actions in order to maximise expected utilty given a desired goal.
 40 | This is difficult to measure, so we use bounded optimality.
 41 | 
 42 | 
 43 | 
 44 | Alpha-beta pruning can provide performance improvements when applied to expected max game trees that have finite limits on the values the evaluation function returns.
 45 | 
 46 | Isolation game:
 47 | You move on the board, diagonaly or columnwise or row-wise, you can't move into occupied or previously occupied squares, the goal is to be the last player moving.
 48 | 
 49 | Mini-max Algorithm:
 50 | We assume that the opponent tries to minimze our scores, and we always try to maximise our scores.
 51 | Triangles pointing up, is the turns where we try to maximise our score.
 52 | Triangle pointing down, marks the turns where the oppononet is trying to minimse our scores.
 53 | 
 54 | The algorithm is as follows:
 55 | 
 56 | - For each max node, pick the maximum value among its child nodes, if there is at least one plus one child, the computer can always pick that to win. Otherwise it can never win. At each mid node we chose the minimum value to represent the opponent.
 57 | 
 58 | 
 59 | Supplemented Reading:
 60 | Norvig, Peter, and Stuart Russell, Artificial Intelligence: A Modern Approach, 3rd edn (New York, 2010), sections 5.1-2.
 61 | 
 62 | Summary:
 63 | 
 64 | Games are for the most part competitive environments, that is agent's interests are in conflict with each other.
 65 | This gives rise to adversarial search problems.
 66 | The difference between the mathematical game theory and its application in AI:
 67 | 
 68 | For mathematical game theory, every multi agent environment in which the agents act upon each other is *significant* is considered as a game environment, they are labeled as economies.
 69 | In AI, the most common games are zero-sum games of perfect information (chess, go, etc), that is: "this means deterministic, fully observable environments in which two agents act alternately and in which the utility values at the end of the game are always equal and opposite."
 70 | 
 71 | The formal definition of a game:
 72 | 
 73 | :math:`S_0`
 74 |     The initial state, which specifies how the game is set up at the start.
 75 | 
 76 | PLAYER (s)
 77 |     Defines which player has the move in a state.
 78 | 
 79 | ACTIONS(s)
 80 |     Returns the set of legal moves in a state.
 81 | 
 82 | RESULT(s, a)
 83 |     The transition model, which defines the result of a move.
 84 | 
 85 | TERMINAL - TEST (s)
 86 |     A terminal test, which is true when the game is over and false
 87 | otherwise. States where the game has ended are called terminal states.
 88 | 
 89 | UTILITY (s, p):
 90 |     A utility function (also called an objective function or payoff function), defines the final numeric value for a game that ends in terminal state s for a player p. In chess, the outcome is a win, loss, or draw, with values +1, 0, or 12 . Some games have a wider variety of possible outcomes; the payoffs in backgammon range from 0 to +192.
 91 | 
 92 | A **zero-sum** game is (confusingly) defined as one where the total payoff to all players is the same for every instance of the game.
 93 | Chess is zero-sum because every game has payoff of either 0 + 1, 1 + 0 or 12 + 12 .
 94 | *Constant-sum* would have been a better term, but zero-sum is traditional and makes sense if you imagine each player is charged an entry fee of :math:`{\frac{1}{2}}`
 95 | 
 96 | 
 97 | The relationship between the iterative deepening and the visited tree nodes is the following:
 98 | 
 99 | The number of nodes visited during the iterative deepening is the cumulative sum of all the visited nodes up to that level.
100 | 
101 | Visited tree node: 1
102 | Iterative tree node: 1
103 | Visited tree node: 4
104 | Iterative tree node: 5
105 | Visited tree node: 13
106 | Iterative tree node: 18
107 | Visited tree node: 40
108 | Iterative tree node: 58
109 | Visited tree node: 121
110 | Iterative tree node: 179
111 | 
112 | The general formula for the iterative deepening is:
113 | 
114 | .. math::
115 | 
116 |    b = K
117 | 
118 |    n = {\frac{K^{d+1} - 1}{K - 1}}
119 | 
120 | b
121 |     The branching factor. Gives how much the tree would branch out at each level.
122 | 
123 | n
124 |     The number of nodes in the tree.
125 | 
126 | d
127 |     The level indicator. Indicates the level of the tree.
128 | 
129 | 
130 | Evaluation function in the case of isolation game:
131 | 
132 | With an evaluation function like simply counting the number of moves our computer player
133 | has available at a given node, the player would select branches in the mini-max tree that
134 | lead to spaces where player has most options.
135 | 
136 | Good evaluation function for isolation is
137 | number_of_my_moves - number_of_opponent_moves
138 | 
139 | It promotes the boards in which i have more moves than my opponents, and penalises the boards in which the opponent has more moves.
140 | 
141 | .. code-block:: python
142 | 
143 |    # Here is a pseudo-code for
144 |    # Minimax algorithm with alphabeta pruning
145 |    minimax(root) = max(min(3, 12, 8), min(2, x, y), min(14, 5, 2))
146 |    minimax(root) = max(3, min(2,x,y), 2)
147 |    z = min(2,x,y)
148 |    # Then z <= 2
149 |    minimax(root) = max(3,z,2)
150 |    minimax(root) = 3
151 | 
152 | 
153 | Expectimax alpha-beta pruning
154 | *****************************
155 | 
156 | This expectimax algorithm works in the case where we don't really know what the outcome will be.
157 | That is we have several different outcomes with different probabilities.
158 | Expectimax uses the minimax tree which represents all the possible outcomes in the game.
159 | Each branch has a probability, and to propagate a value to a superior node, we take the values and multiply them with the probabilities
160 | assigned to their branches, then depending on the type of branch, that is min or max branch
161 | 
162 | An Example Tree:
163 | 
164 | Max:
165 | node 1: ; node 2: ; node 3:
166 | 
167 | Min:
168 | 
169 | Node1B1 Probability: 0.1
170 | Node1B2 Probability: 0.5
171 | Node1B3 Probability: 0.4
172 | 
173 | Node2B1 Probability: 0.5
174 | Node2B2 Probability: 0.5
175 | 
176 | Node3B1 Probability: 0.5
177 | Node3B2 Probability: 0.1
178 | Node3B3 Probability: 0.4
179 | 
180 | Node1B1Leaf1: -4
181 | Node1B1Leaf2: -5
182 | Node1B2Leaf1: 5
183 | Node1B2Leaf2: 6
184 | Node1B3Leaf1: 8
185 | Node1B3Leaf1: 10
186 | 
187 | Node2B1Leaf1: 0
188 | Node2B1Leaf2: 10
189 | Node2B2Leaf1: 10
190 | Node2B2Leaf2: 5
191 | 
192 | Node3B1Leaf1: -7
193 | Node3B1Leaf2: -3
194 | Node3B2Leaf1: 9
195 | Node3B2Leaf2: 10
196 | Node3B3Leaf1: 2
197 | Node3B3Leaf1: 5
198 | 
199 | An Example Calculation without pruning:
200 | 
201 | Min:
202 | 
203 | Node1: :math:`-5{\times}0.1 + 0.5{\times}5 + 8{\times}0.4 =5.2`
204 | 
205 | Node2: :math:`0{\times}0.5 + 5{\times}0.5 = 2.5`
206 | 
207 | Node3: :math:`-7{\times}0.5 + 9{\times}0.1 + 2{\times}0.4 = 1.8`
208 | 
209 | 
210 | Max:
211 | 
212 | Node1: 5.2
213 | 
214 | An Example Calculation with Pruning:
215 | 
216 | Min:
217 | 
218 | Node1:  :math:`-5{\times}0.1 + 0.5{\times}5 + 8{\times}0.4 =5.2`
219 | 
220 | Node2: if Node2B1Leaf1 is 0, then we can prune all the other branches in the middle branch.
221 | Why ?
222 | 
223 | Because we know that the values range from -10 to 10, and that the top branch is 5.2, meaning that we assume
224 | Node2 is either equal to or less than 5.2, and the probability distribution is 0.5. This means that even if all
225 | of the rest of the branches yield 10, the final outcome for the branch would be less than or equal to 5. Thus
226 | they would not change the value of the top branch.
227 | 
228 | Then
229 | Node2: <= 5
230 | 
231 | Node3: if Node3B1Leaf1 is -7 then we can safely ignore everything else. Because as in the above branch, even if
232 | every branch would yield 10, their sum would not be greater than 5.2.
233 | 
234 | Then
235 | Node3: <= 1.5
236 | 
237 | 1.5 comes from the fact that we evaluate as if the other branches yielded 10, and multiply it with the probabilities associated with the
238 | branches and then sum everything up.
239 | 
240 | 
241 | 
242 | 
243 | 
244 | 
245 | 
246 | 
247 | Terms:
248 | ******
249 | 
250 | Heuristic
251 |     Some additional piece of information - a rule, constraint, function - that informs an otherwise brute-force algorithm to act in a more optimal manner
252 | 
253 | Pruning
254 |     Pruning the search tree is to evaluate certain branches of the search tree by easy to check conditions. It allows us to ignore portions of the search tree that make no difference to the final choice
255 | 
256 | Mini-Max Algorithm
257 |     You are trying to maximize your chances of winning on your turn, and your opponent is trying to minimize your chances of winning on his turn. What we do is we map every possible move in the game, and map them to min or max states. Then depending on the type of the state we either take the maximum value indicated in the child nodes or the minimum value indicated in the child notes.
258 | 
259 | Depth-Limited Search
260 |     Going as far as we can to safely meet our deadline.
261 | 
262 | Horizon Effect
263 |     When it is obvious to human player that the game will be decided in the next move, but the depth of search doesn't let the computer player to see this happening. This means that some critical moves are not noticed by the machine. These critical moves occur because of the shape of the environment, that is because of a property of environment. Our agent nodes how to prioritise the nodes, but that doesn't mean that it is aware of how the nodes are situated in relation to each other.
264 | 
265 | Evaluation Function
266 |     It evaluates the possible moves so that the minimax driven agent can choose a move.
267 | 
268 | Alpha-Beta Prunning
269 |     Alpha-beta pruning works as follows. If a tree is on the max turn, upon viewing the first branch, we know that if a value is to be propagated to the superior branch other than the first one, it has to be bigger than the value of the first branch, thus we can ignore the branches that comes up with a lower or equal value. If a tree is on the min turn, upon viewing the first branch, we know that if a value is to be propagated to the superior branch other than the first branch value, then it has to be smaller than the value of the first branch, thus we can ignore the branches that comes up with a superior or equal value.
270 |     α
271 |         the value of the best (i.e., highest-value) choice we have found so far at any choice point along the path for MAX.
272 |     β
273 |         the value of the best (i.e., lowest-value) choice we have found so far at any choice point along the path for MIN.
274 | 
275 | Immediate Pruning
276 |     Immediate pruning happens when a branch fulfills the necessay conditions to be designated as the node's value immediately. For example, let's say the condition is that the sum of the numbers inside a node has to be equal to 10. If the first branch is 10 we don't need to look at the other values.
277 | 
278 | 


--------------------------------------------------------------------------------
/MathNotes/DifferentialIntegralCalculus.rst:
--------------------------------------------------------------------------------
  1 | #####################
  2 | Differential Calculus
  3 | #####################
  4 | 
  5 | Inifinite Numeric Sequence
  6 | ===========================
  7 | 
  8 | Tarasov, L. V. Calculus: Basic Concepts for High Schools. Moscow; Chicago: Mir Publishers ; [Distributed by] Imported Publications, 1982.
  9 | 
 10 | An infinite numeric sequence exists if every natural number (position number)
 11 | is unambiguously placed in correspondance with a definite number (an element
 12 | of a sequence) by a specific rule.
 13 | For example:
 14 | 
 15 | - 3 5 7 9 11
 16 | 
 17 | - 1 2 3 4 5 where the law is x + 2 so that
 18 | 
 19 | - x1 + 2 = 5 = x2, x2 + 2 = 7 = x3, the rule is than x_n = x_{n-1} + 2
 20 | 
 21 | Let's look at a different sequence:
 22 | 
 23 | 1, 1/2, 3, 1/4, 5, 1/6, ...
 24 | 
 25 | The rule is for y_n = { if n = 2k - 1, then n, else if n = 2k then 1/n }
 26 | 
 27 | We can also express this as y_n = a_n \times n + b_n \times 1/n, and say
 28 | a_n = 0 if n = 2k and 1 if n = 2k - 1 and b_n = 0 if n = 2k - 1 and 1 for 
 29 | n = 2k
 30 | 
 31 | A sequence is *nondecreasing* if 
 32 | y_1 <= y_2 <= y_3 <= ...
 33 | 
 34 | A sequence is *nonincreasing* if 
 35 | y_1 >= y_2 >= y_3 >= ...
 36 | 
 37 | A nonincreasing or nondecreasing sequences are *monotonic* sequences
 38 | 
 39 | A sequence is bounded if there are two numbers A and B which encloses
 40 | all the terms of a sequence so that
 41 | A <= y_n <= B
 42 | 
 43 | If one of the numbers is absent the sequence is unbounded.
 44 | 
 45 | There is a somewhat puzzling case for bounded infinite sequences that are
 46 | monotonic. For example, 0, 0.1, 0.11, 0.111, 0.111 and so on whose lower bound
 47 | and upper bound is 0 and 1. However the problem is that the condition that our
 48 | incrementation is decreasing does not impose the upper bound.
 49 | 
 50 | The calculus, in the sense we use in modern mathematics, starts with the
 51 | notion that there can be monotonic sequences that are bounded. This is ensured
 52 | with the concept of *limit*. If a sequence is monotonic and bounded it
 53 | necessarily has a limit.
 54 | 
 55 | 
 56 | Limits
 57 | =======
 58 | 
 59 | Here is the definition of a limit:
 60 | 
 61 | The number a is said to be the limit of sequence y_n if for any positive
 62 | number \lambda there is a real number N such that for all n > N the following
 63 | inequality holds:
 64 | 
 65 | |y_n - a| < \lambda
 66 | 
 67 | I have a sequence y with a as a limit. When I subtract the limit from any
 68 | value of the sequence, the absolute value of this operation should be smaller
 69 | than the number \lambda. When n is bigger than N, the difference between the
 70 | limit number and the sequence element is smaller than the lambda.
 71 | 
 72 | Let's examine this definition. First of all for any positive \lambda there is a
 73 | number N. There is also another number "n". When "n" is bigger than N there is
 74 | an inequality. Now \lambda is any positive number. There are then infinite values
 75 | that are possible for \lambda. Now the length of the sequence y_n is also infinite
 76 | thus we have an infinite number of possible "n" values as well.
 77 | 
 78 | 
 79 | What are limits ?
 80 | -------------------
 81 | 
 82 | 
 83 | Differentials
 84 | ==============
 85 | 
 86 | Slope of the line.
 87 | What is the difference between the current instant and the instant before
 88 | 
 89 | 
 90 | Formal definition:
 91 | 
 92 | - :math:`f^{(1)}(p) = \lim_{h \to 0} \frac{f(p+h) - f(p)}{h}`
 93 | 
 94 | Simple rules:
 95 | 
 96 | Power rule:
 97 | -----------
 98 | take the exponent, move it down,
 99 | multiply it with the coefficient—it’s the new coefficient.
100 | Now take one from the exponent.
101 | This is the new exponent.
102 | (This is for integer coefficient polynomials, as you can see, this is painfully limited)
103 |  2x^7, we have the derivative 14x^6.
104 | Simple enough? Another exercise: x^2.
105 | The derivative is 2x.
106 | 
107 | Addition and Subtraction
108 | -------------------------
109 | 
110 | The derivative of f(x)+g(x) is just f'(x)+g'(x). That is we sum the derivatives.
111 | Changing +g(x) to -g(x) gives us the subtraction rule.
112 | 
113 | 
114 | Product Rule
115 | -------------
116 | 
117 | The derivative of f(x)*G(x) is f'(x)*G(x)+f(x)*G'(x).
118 | That is derivative of f(x) times the G(x) plus
119 | f(x) times the derivative of G(x).
120 | 
121 | Here is its proof:
122 | 
123 | .. math::
124 | 
125 |     p(x) = f(x) \times g(x)
126 |     p'(x) = \lim_{h \to 0} \frac{p(x+h) - p(x)}{h}
127 |     v = f(x) \times g(x+h)
128 |     p'(x) = \lim_{h \to 0} \frac{p(x+h) - p(x) + v - v}{h}
129 |     p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - (f(x) \times g(x)) + v - v}{h}
130 |     p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - v + v - (f(x) \times g(x))}{h}
131 |     p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - (f(x) \times g(x+h))  + v - (f(x) \times g(x))}{h}
132 |     p'(x) = \lim_{h \to 0} \frac{ ((f(x+h) - f(x)) \times g(x+h)) + v - (f(x) \times g(x))}{h}
133 |     p'(x) = \lim_{h \to 0} \frac{ ((f(x+h) - f(x)) \times g(x+h)) + f(x) \times (g(x+h) - g(x))}{h}
134 |     p'(x) = f'(x) \times \lim_{h \to 0} g(x+h)) + f(x) \times g'(x)
135 |     p'(x) = f'(x) \times g(x) + f(x) \times g'(x)
136 | 
137 | We replace :math:`\lim_{h \to 0} g(x+h)` with :math:`g(x)` since *h* is very
138 | very close to 0, and we know that since *g(x)* is a differentiable function it
139 | must be continuous.
140 | 
141 | Division Rule
142 | --------------
143 | 
144 | The derivative of f(x)/g(x) is [f'(x)g(x)-f(x)g'(x)]/g(x)^2.
145 | That is the derivative of f(x) times the g(x) minus
146 | the f(x) times the derivative of g(x) divided by the g(x) squared
147 | 
148 | Chain Rule
149 | -----------
150 | 
151 | The derivative of f(g(x)) is
152 | f'(g(x))*g'(x).
153 | 
154 | Or, it is equal to the derivative of the outer function
155 | evaluated at the inner functions times the derivative of the inner function.
156 | 
157 | 
158 | Integrals
159 | ==========
160 | 
161 | The area under the curve.
162 | 
163 | Riemanns sum and Approximating a Definite Integral
164 | ---------------------------------------------------
165 | 
166 | The formula
167 | :math:`Sum={{\sum}^{n}_{i=1} f(x^{*}_i)(x_i - x_{i-1}}`
168 | 
169 | Essentially it means, I am trying to sum infinitely small bits
170 | 
171 | Here is the catch:
172 | I am trying to approximate the given area by using n number of rectangles.
173 | Let's try to approximate the area under the function f(t) by using 5 rectangles
174 | 
175 | :math:`{\int}f(t)dt {\approx} height_1 {\times}width +h_2w +h_3w+h_4w+h_5w`
176 | 
177 | Instead of using 5 rectangles we can use n rectangles to have a better/worse
178 | approximation
179 | 
180 | :math:`{\int}f(t)dt {\approx} height_1 {\times}width +h_2w +h_3w+h_4w+...+h_{n}w`
181 | 
182 | Let's factor out the width since its constant
183 | 
184 | $\int_i^a$
185 | 
186 | :math:`{\int}f(t)dt {\approx} w{\times}(h_1 +h_2 +h_3+h_4+...+h_{n})`
187 | 
188 | We can express the sum in the parantheses with the sum notation as well
189 | 
190 | :math:`{\int}f(t)dt {\approx} w{\times}{\sum^{n}_{i=1}}(h_i)`
191 | 
192 | The value of the width of the rectangle is simply the difference between range
193 | of the function distributed to the range of the approximation.
194 | So if the integral is taken over the interval [a,b] as in :math:`{\int}_{a}^{b}`
195 | Then the width for the n number of rectangles would be:
196 | :math:`width={\frac{b-a}{n}}`
197 | 
198 | $\sum_i$
199 | 
200 | The height is simply the according y value of the x with respect to f(x).
201 | That is the meaning behind the notation :math:`f(x_{i}^{*}`
202 | 
203 | So final form of the equation is the following
204 | 
205 | .. math::
206 | 
207 |    `{\int}_{a}^{b}f(t)dt{\approx}{\frac{b-a}{n}}{\times}{\sum^{n}_{i=1}}f(x_{i}^{*}`
208 | 
209 | Now the part f(t) of the integral side should be rather obvious,
210 | the height of the rectangle, and we have seen that the a and b are
211 | related to the range of the function, then
212 | 
213 | what's up with dx ?
214 | Well simply put dx is what happens when delta(x), that is x_i - x_{i-1}, approaches
215 | to the 0. So dx is the difference when the difference between any point in x axes,
216 | in the range of f(x) becomes very very very very very very very close to 0
217 | 
218 | Now, when you have a quantity whose value is virtually zero, there's not much
219 | you can do with it. 2+dx is pretty much, well, 2. Or to take another example,
220 | 2/dx blows up to infinity.
221 | But there are two circumstances under which terms involving dx can yield a
222 | finite number. One is when you divide two differentials; for instance, 2dx/dx=2,
223 | and dy/dx can be just about anything.
224 | 
225 | 
226 | Line Integrals
227 | ---------------
228 | 
229 | Sum of infinitely small areas under the curve within the range of f(x,y) 
230 | 
231 | This is multivariate calclulus and it is a slight generalization of what we had
232 | seen above in the definite integrals
233 | 
234 | Now a normal integral is:
235 | 
236 | - :math:`{\int}_{a}^{b}f(t)dt` where
237 | 
238 |   - :math:`\int` means sum
239 |   - a is the lower range
240 |   - b is the upper range
241 |   - dt is the difference between t_i and t_{i-} when it is infinitely small
242 | 
243 | A line integral is:
244 | 
245 | - :math:`{\int}_{a}^{b}f(x,y)ds`
246 | 
247 | Now let's see how we arrive to this:
248 | 
249 | We have a function k(x) which is defined on a coordinate plane xy.
250 | The function maps the value of x to a value of y in the coordinate plane
251 | 
252 | Now f(x,y) does the exact same thing in form. It takes the value of x and y
253 | and maps it to another value in third dimension let's say z for example.
254 | f(x,y) = z
255 | This means that we have now a third dimension z, to which our function f(x,y)
256 | maps to, so our plane now has three axis xyz
257 | 
258 | What about ds ? It is actually the same as saying dz, that is the difference
259 | between z_i and z_{i-1} as it approaches to zero
260 | 
261 | How does all this relate to our k(x) ?
262 | 
263 | This is the tricky part
264 | 
265 | Now let's say c(x) = y and g(y) = x
266 | then when x=t, y=c(t), and y=t, x=g(t)
267 | 
268 | So given that a <= t <= b
269 | f(x,y) can be written as f(g(t), c(t))
270 | 
271 | So we can rewrite our line integral as follows:
272 | 
273 | - :math:`{\int}_{a}^{b}f(g(t),c(t))ds`
274 | 
275 | Now ds can actually be expressed in forms of dy and dx.
276 | Because simply put infinitely small change in the curve k(x) is going to result from
277 | infinitely small change in x direction and infinitely small change in y direction.
278 | Notice that all three measures are distance measures.
279 | Let's break it down this way:
280 | dx = x_i - x_q
281 | dy = k(x_i) - k(x_q)
282 | ds = (x_i, k(x_i)) - (x_q, k(x_q))
283 | 
284 | Now the distance between two points are calculated with pythagoras theorem
285 | :math:`\sqrt{a^2 + b^2}`
286 | 
287 | We plug in our points to pythagoras theorem
288 | 
289 | :math:`\sqrt{(x_i - x_q)^2 + (k(x_i) - k(x_q))^2}`
290 | 
291 | Based on the above mentioned equivalency this simply transforms to
292 | 
293 | :math:`\sqrt{(dx)^2 + (dy)^2}`
294 |       
295 | Then we can rewrite our line integral as follows
296 | 
297 | - :math:`{\int}_{a}^{b}f(g(t),c(t)){\times}{\sqrt{(dx)^2 + (dy)^2}}`
298 | 
299 | Now the problem is our point functions are all defined in t but our ds is expressed
300 | in dx and dy, how do we transform it
301 | 
302 | Well let's suppose we multiplied the ds with dt/dt which 1, since we divide to equal
303 | quantities, so:
304 | 
305 | - :math:`ds={\sqrt{(dx)^2 + (dy)^2}}{\times}{\frac{dt}{dt}}`
306 | 
307 | If we reformulate the expression a bit
308 | 
309 | - :math:`ds={\sqrt{ {\frac{1}{(dt)^2}} {\times} ((dx)^2 + (dy)^2) } }{\times}{{dt}}`
310 | 
311 |   - We simply put the 1/dt of the dt/dt expression inside the square root
312 | 
313 | I continue this line of progress and distribute the 1/dt over the variables
314 | 
315 | - :math:`ds={\sqrt{ {\frac{1}{(dt)^2}}{\times}(dx)^2 + {\frac{1}{(dt)^2}}{\times}(dy)^2) } }{\times}{{dt}}`
316 | 
317 | The expression inside can be simplified as the following:
318 | 
319 | - :math:`ds={\sqrt{ {\frac{(dx)^2}{(dt)^2}} + {\frac{(dy)^2}{(dt)^2}} } }{\times}{{dt}}`
320 | 
321 | And by using simple properties of multiplication on fractions we can have the following:
322 | 
323 | - :math:`ds={\sqrt{ ({\frac{dx}{dt}})^2 + ({\frac{dy}{dt}})^2 } }{\times}{{dt}}`
324 | 
325 | Now notice that dy/dt and dx/dt are actually derivatives of c(t) and g(t) respectively
326 | that is they are c'(t)=dy/dt and g'(t)=dx/dt
327 | So the final form of our equation would be:
328 | 
329 | - :math:`ds={\sqrt{ (g'(t))^2 + (c'(t))^2 } }{\times}{{dt}}`
330 | 
331 | 


--------------------------------------------------------------------------------
/TopicNotes/RecurrentNeuralNetworks/courseNotes.rst:
--------------------------------------------------------------------------------
  1 | ##########################
  2 | Recurrent Neural Networks
  3 | ##########################
  4 | 
  5 | When to use RNNs ?
  6 | 
  7 | RNNs are good with time series.
  8 | Predicting stock market prices,
  9 | speech recognition, etc.
 10 | 
 11 | CNNs are good for images and videos, RNNs are good for sequential data analysis
 12 | like financial time series.
 13 | 
 14 | RNNs are used for speech recognition and machine translation.
 15 | 
 16 | RNNs deal with ordered sequences
 17 | 
 18 | It can be used to autocomplete stuff. Like you have a phrase:
 19 | 
 20 | "My cat is the" and then the trained algorithm can fill the phrase
 21 | as "My cat is the **best**"
 22 | 
 23 | In this case,
 24 | the input is the ordered sequence of words or characters
 25 | the output is ordered sequence of characters.
 26 | 
 27 | Machine translation:
 28 | the input is ordered sequence of words in language X
 29 | the output is ordered sequence of words in language Y
 30 | 
 31 | Vanilla supervised learners and structured input
 32 | -------------------------------------------------
 33 | 
 34 | Vanilla supervised learnes, like feedforward networks take generic input and
 35 | spit out a structured input.
 36 | They are basically function estimators.
 37 | Their key problem is that they do not consider the input structure at all.
 38 | 
 39 | Now CNNs were a change to that since they detected the patterns, that is they
 40 | take into consideration the spatial corelation between the pixels in an image.
 41 | 
 42 | RNNs are designed to exploit ordered sequential structures. Basically time
 43 | series are a good subject for these.
 44 | 
 45 | RNN Notations
 46 | --------------
 47 | 
 48 | What is ordered sequence ?
 49 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 50 | 
 51 | Ordered sequence is a list of values ordered by an index.
 52 | Index can be a list of numbers, or timestamps, or anything really.
 53 | 
 54 | How to model ordered sequences ?
 55 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 56 | 
 57 | Most ordered sequences are a product of some underlying process or processes,
 58 | but these underlying processes can be blackboxes, so we have no model who
 59 | explains the data.
 60 | 
 61 | In this case we can model the process recursively, that is we can use past
 62 | values to predict the future ones, meaning we can try to determine how
 63 | future values mathematically dependent on the preceeding ones.
 64 | 
 65 | What are seeds ?
 66 | ~~~~~~~~~~~~~~~~~
 67 | 
 68 | Let's say we are dealing with the set of odd numbers:
 69 | 
 70 | :math:`[1,3,5,7, ..., ]`
 71 | 
 72 | First value is :math:`S_1=1`.
 73 | That is it is a given.
 74 | However all the other values can be generated from the first one by adding 2.
 75 | This means that odd numbers set is a recursive sequence.
 76 | That is the rest of the sequence can be generated by a model from one or more
 77 | original values. The original values need to be given.
 78 | 
 79 | These initialized/given values in the case of a recursive sequence, are called
 80 | seeds
 81 | 
 82 | What is the order of a recursive sequence ?
 83 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 84 | 
 85 | The order of a recursive sequence is the number of prior elements it uses to
 86 | produce future ones
 87 | 
 88 | For example, the odd number set has the order of 1. Fibonacci sequence has the
 89 | order of 2.
 90 | 
 91 | Unfolded view
 92 | ~~~~~~~~~~~~~~
 93 | 
 94 | The odd numbers set can actually be described by using the following function.
 95 | 
 96 | :math:`f(s) = s + 2`
 97 | 
 98 | or as follows:
 99 | 
100 | s_1 --f--> s_2 --f--> s_3 --f--> s_4 --f--> ...
101 | 
102 | The name unfolded view applies both of these representations, where we show
103 | explicitely how each element of the sequence is generated.
104 | 
105 | Folded view
106 | ~~~~~~~~~~~~
107 | 
108 | The odd numbers set can actually be described by using the following notation.
109 | 
110 | :math:`s_t = f(s_{t-1})` 
111 | 
112 | Important note: A recursive sequence simply applies the same function to its
113 | output to generate the new input
114 | 
115 | How to drive recursive sequence
116 | ---------------------------------
117 | 
118 | Let's try to model how much money we have at the end of each month:
119 | 
120 | I need to know:
121 | 
122 | - The current value of my savings = h_1
123 | - Savings of each month would be = h_t for the month t
124 | 
125 | What drives my account ?
126 | 
127 | - Rent we pay, car lease, food shopping etc, drive it down.
128 | - Paycheck for the job, small investments etc, drive it up.
129 | 
130 | - s_t = is my loss or income during the month, so a sum of downers and uppers
131 | 
132 | Simple model for monthly savings would be:
133 | 
134 | - h_1 = 0
135 | - h_2 = h_1 + s_1
136 | 
137 | The folded view of the model would be:
138 | 
139 | - h_t = h_{t-1} + s_{t-1}
140 | - where t = 2,3, ...
141 | 
142 | Driver sequence is what moves everything, here it corresponds to monthly income
143 | or loss.
144 | The hidden sequence is that which is generated by the driver sequence, here
145 | corresponds to h_t
146 | 
147 | Injecting recursivity to supervised learners
148 | ----------------------------------------------
149 | 
150 | Suppose we have a set [1,3,5,7,9,11,13,15], we only have this set and nothing
151 | else. How do we arrive at the function f(s_t) = s_{t-1}+2
152 | We can't use sheer guessing. We need to find a function that generates the
153 | sequence.
154 | 
155 | We can use a parametrized function and learn weights by fitting:
156 | 
157 | s_1 = 1
158 | s_2 = g(s_1), where g(s_t) = w_0 + w_1{\times}s_{t-1} 
159 | s_3 = g(s_2)
160 | 
161 | Now we don't know the weights for w_0 and w_1. We simply implement least squares
162 | cost function to find that.
163 | 
164 | For example:
165 | 
166 | least square for s_2 is (s_2 - g(s_1))^2
167 | least square for s_3 is (s_3 - g(s_2))^2, etc.
168 | 
169 | We then try to minimize the sum of these least squares.
170 | 
171 | Formally :math:`min({\sum_{t}^{N}(s_t - (w_0 + w_1{\times}s_{t-1}))^2})`
172 | 
173 | We are simply performing regression here.
174 | 
175 | Windowing a sequence
176 | --------------------
177 | 
178 | During the above formula we have looked at pairs. For example
179 | (s_1,s_2), (s_2, s_3), etc. The reason we chose pairs was related to
180 | the formula we adopted, g(s_t) = w_0 + w_1{\times}s_{t-1}. These pairs
181 | are called windows, that is the parameters involved in a given iteration
182 | of the formula of the recursive sequence, establishes the window of that
183 | iteration.
184 | 
185 | Keras and fitting
186 | ------------------
187 | 
188 | The above given model can be defined as follows in Keras:
189 | 
190 | .. code-block:: python3
191 | 
192 |    model = Sequential()
193 |    layer = Dense(1, input_dim=1, activation="linear")
194 |    model.add(layer)
195 |    model.compile(loss="mean_squared_error", optimizer="adam")
196 |    model.fit(x,y, epocha=3000, batch_size=3,
197 |                 callbacks=callbacks_list, verbose=0)
198 | 
199 | The point here is the following:
200 | 
201 | Given an ordered sequence, we can make a recursive approximation to it by first
202 | making a random guess about the architecture of its recursive formula, then
203 | tuning the parameters of that architecture optimally using the sequence itself.
204 | 
205 | 
206 | We could have used the model in the traditional manner: Use a training set,
207 | validation set and test set, etc. So we can make predictions based on the
208 | training set.
209 | 
210 | Thus we can actually use these networks as generative models.
211 | 
212 | General Notes Recursivity in Supervised Learning
213 | ------------------------------------------------
214 | 
215 | There might be several architectures that model the recursive sequence which
216 | generated from a single recursive function.
217 | 
218 | The sequence itself might not be recursive but we might find a recursive
219 | function that fits the sequence at hand, that is we can find the best
220 | recursive approximation to the given sequence.
221 | 
222 | 
223 | Why RNN instead Feedforward neural networks
224 | --------------------------------------------
225 | 
226 | The cost function we are using in the feed forward neural networks is Least
227 | Squares and from the perspective of probablilities this means that, we assume
228 | that the probability distribution is identically distributed among the
229 | independent elements of the set.
230 | 
231 | However it is a little contradictory to the fact that we are assuming that these
232 | elements are part of an ordered sequence even a recursive one, and that they
233 | don't emit any structure in their distribution in the set.
234 | 
235 | Basically if we were dealing with odd numbers set, we are assuming that,
236 | 3 and 5 are conditionally independent to each other, which is misleading.
237 | Because if I change 3 the value of all the others in the sequence would also
238 | change.
239 | 
240 | Feed forward networks structure their data as data points with no edges between
241 | however, we *know* that there are, that's the whole point of referring to them
242 | as recursive sequences.
243 | 
244 | Simply put:
245 | 
246 | Feed forward neural networks use recursive approximation:
247 | - Try to model dependency -> recursiveness
248 | - But when try to tune parameters -> we make things independent since we assume
249 |   that they are actually independent.
250 | 
251 | Basic RNN derivation
252 | =====================
253 | 
254 | With Feed forward neural networks we modelled the recursivity but lost the
255 | dependence while tuning the parameters.
256 | 
257 | RNNs are there to enforce dependency between the levels. The way to do this
258 | is to ensure that each level ingest its predecessor functionaly.
259 | It does so by taking 2 arguments, both the sequence element and the hidden
260 | state.
261 | 
262 | Here is an unfolded view of a simple RNN architecture
263 | 
264 | .. math::
265 | 
266 |    s_1 {\approx} {\hat{s_1}} = {\alpha}
267 |    s_2 {\approx} {\hat{s_2}} = g({\hat{s_1}}, s_1)
268 |    s_3 {\approx} {\hat{s_3}} = g({\hat{s_2}}, s_2)
269 |    s_4 {\approx} {\hat{s_4}} = g({\hat{s_3}}, s_3)
270 |    {\vdots}
271 |    s_t {\approx} {\hat{s_t}} = g({\hat{s_{t-1}}}, s_{t-1})
272 | 
273 | Notation:
274 | 
275 | - :math:`s_t where t=1,2,3,4, {\dots}` are the elements of our recursive
276 |   sequence
277 | - :math:`{\hat{s_t}} where t=1,2,3,4, {\dots}` are the hidden states
278 |   that are driven by the sequence elements, basically they are the ones
279 |   that are calculated from the sequence elements.
280 | - :math:`{\alpha}` is the seed value
281 | 
282 | Dependency is much more explicit here, since each hidden state depends
283 | functionally on the preceeding state.
284 | 
285 | Modifying least square loss
286 | ----------------------------
287 | 
288 | The idea stays the same, we simply plug in our functionally dependent levels
289 | to least squares loss, like the following
290 | 
291 | 
292 | For example:
293 | 
294 | least square for s_2 is (s_2 - g({\hat{s_1}}, s_1))^2
295 | least square for s_3 is (s_3 - g({\hat{s_2}}, s_2))^2, etc.
296 | 
297 | In most cases we also use the linear combination of the hidden state with the
298 | sequence element as the second term in the subtraction. This is heavily used
299 | in the case of RNNs but there are no formal justifications for it.
300 | This modifies the loss function in the following way:
301 | 
302 | .. math::
303 | 
304 |    min(
305 |    {\sum_{t}^{N}(s_t - (w_0 + w_1{\times}g({\hat{s_{t-1}}}, s_{t-1}) ))^2}
306 |    )
307 | 
308 | Keras and RNN
309 | ---------------
310 | 
311 | The above given model can be defined as follows in Keras:
312 | 
313 | .. code-block:: python3
314 | 
315 |    model = Sequential()
316 |    layer = SimpleRNN(3, input_shape=(2,1), activation="relu")
317 |    model.add(layer)
318 |    model.add(Dense(1))
319 |    model.compile(loss="mean_squared_error", optimizer=opt)
320 | 
321 | RNN and memory
322 | ----------------
323 | 
324 | Every level of rnn contains the complete history of every sequence value
325 | that precedes it. This why RNNs have memory.
326 | 
327 | Technical notes
328 | ----------------
329 | 
330 | - RNNs need large  datasets to function well.
331 | - Vanishing gradient problem stands for RNNs as well.
332 |   - Different level architectures, for example Long Term Short Term memory
333 |   - Variations on stochastic gradient descent
334 |   - Longer sequence --> deeper network, and window as well
335 | 
336 | Implementing Character wise RNN
337 | -------------------------------
338 | 
339 | Character wise RNN means the network will learn one character at a time and
340 | generate new characters one character at a time.
341 | 
342 | Basically the network would take a character as an input and output a
343 | probability distribution for the next likely character.
344 | 
345 | Sequence Batching
346 | -----------------
347 | 
348 | We are working on sequences of data, by taking in a sequence an splitting it
349 | into a multiple shorter sequences, we can take advantage of matrix operations
350 | to make training more efficient.
351 | 
352 | batch size is the number of shorter sequences we are using
353 | batch length is the length of sequences we are feeding into the network
354 | 


--------------------------------------------------------------------------------