├── TopicNotes ├── ComputerVision │ ├── .#CreatingFilterCannyEdgeDetector.ipynb │ └── .ipynb_checkpoints │ │ ├── Untitled-checkpoint.ipynb │ │ ├── CornerDetector-checkpoint.ipynb │ │ ├── PizzaBlueScreen-checkpoint.ipynb │ │ ├── CannyEdgeDetector-checkpoint.ipynb │ │ └── HistogramOfOrientedGradients-checkpoint.ipynb ├── BayesNets │ ├── Bayes Nets Subtitles.zip │ └── subtitles │ │ ├── 9 - 34 - Two Test Cancer 2 V4 - lang_en_vs51.srt │ │ ├── 26 - General Bayes Net Solution - lang_en_vs51.srt │ │ ├── 28 - General Bayes Net 2 Solution - lang_en_vs51.srt │ │ ├── 30 - General Bayes Net 3 Solution - lang_en_vs51.srt │ │ ├── 31 - Value Of A Network - lang_en_vs51.srt │ │ ├── 27 - General Bayes Net 2 - lang_en_vs51.srt │ │ ├── 5 - Bayes Network Solution - lang_en_vs51.srt │ │ ├── 29 - General Bayes Net 3 - lang_en_vs51.srt │ │ ├── 32 - D Separation - lang_en_vs51.srt │ │ ├── 13 - Conditional Independence2 - lang_en_vs51.srt │ │ ├── 21 - 46 - Explaining Away 2 V3 - lang_en_vs1.srt │ │ ├── 15 - Absolute And Conditional - lang_en_vs51.srt │ │ ├── 2 - Challenge Question - lang_en_vs52.srt │ │ ├── 12 - Conditional Independence - lang_en_vs50.srt │ │ ├── 18 - Confounding Cause Solution - lang_en_vs51.srt │ │ ├── 34 - 60 - D Separation 2 - lang_en_vs1.srt │ │ ├── 33 - D Separation Solution - lang_en_vs51.srt │ │ ├── 16 - Absolute And Conditional Solution - lang_en_vs51.srt │ │ ├── 10 - 35 - Two Test Cancer 2 Solution V4 - lang_en_vs51.srt │ │ ├── 1 - Introduction - lang_en_vs51.srt │ │ ├── 7 - 32 - Two Test Cancer V6 - lang_en_vs51.srt │ │ ├── 20 - 45 - Explaining Away Solution V3 - lang_en_vs1.srt │ │ ├── 24 - 49 - Explaining Away 3 Solution V3 - lang_en_vs1.srt │ │ ├── 23 - 48 - Explaining Away 3 V4 - lang_en_vs1.srt │ │ ├── 25 - Conditional Dependence - lang_en_vs51.srt │ │ ├── 36 - 63 - D Separation 3 Solution - lang_en_vs1.srt │ │ ├── 35 - D Separation 2 Solution - lang_en_vs51.srt │ │ ├── 19 - 44 - Explaining Away V5 - lang_en_vs1.srt │ │ ├── 8 - 33 - Two Test Cancer Solution V4 - lang_en_vs51.srt │ │ ├── 4 - 29 - Bayes Network Merged FINAL - lang_en_vs51.srt │ │ ├── 17 - Confounding Cause - lang_en_vs51.srt │ │ ├── 22 - 47 - Explaining Away 2 Solution V3 - lang_en_vs1.srt │ │ ├── 14 - Conditional Independence2 Solution - lang_en_vs51.srt │ │ ├── 11 - Conditional Independence - lang_en_vs51.srt │ │ ├── 6 - 31 - Computing Bayes Rules Merged FINAL - lang_en_vs51.srt │ │ └── 3 - Challenge Question Solution - lang_en_vs52.srt ├── reinforcementLearning │ └── Notes.rst ├── KnowledgeBasedAI │ └── Notes.rst ├── MachineLearning │ ├── UnsupervisedLearning.rst │ └── SupervisedNotes.rst ├── LSTM │ └── CourseNotes.rst ├── GeneralAdverserialNetworks │ └── courseNotes.rst ├── Planning │ └── PlanningMDP.rst ├── Logic │ └── courseNotes.rst ├── CapsNets │ └── capsnets.rst ├── Hyperparameters │ └── CourseNotes.rst ├── Search │ └── CourseNotes.rst ├── InferenceBayesNets │ └── CourseNotes.rst ├── Introduction_to_AI │ └── CourseNotes.rst └── RecurrentNeuralNetworks │ └── courseNotes.rst ├── MathNotes ├── Probability.rst ├── trigonemetry.rst ├── Precalculus.rst ├── IntroProjectiveGeometry.rst ├── N-Dimensional-Geometry.md.html └── DifferentialIntegralCalculus.rst ├── LICENSE └── BookNotes ├── ElementsOfProbabilityAndStatistics.rst └── ComputerGraphicsPrinciplesPracticeInC.rst /TopicNotes/ComputerVision/.#CreatingFilterCannyEdgeDetector.ipynb: -------------------------------------------------------------------------------- 1 | kaan@mb-Precision-7510.5748:1533826696 -------------------------------------------------------------------------------- /TopicNotes/BayesNets/Bayes Nets Subtitles.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/D-K-E/ai-notes/master/TopicNotes/BayesNets/Bayes Nets Subtitles.zip -------------------------------------------------------------------------------- /TopicNotes/ComputerVision/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /TopicNotes/ComputerVision/.ipynb_checkpoints/CornerDetector-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /TopicNotes/ComputerVision/.ipynb_checkpoints/PizzaBlueScreen-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /TopicNotes/ComputerVision/.ipynb_checkpoints/CannyEdgeDetector-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /TopicNotes/ComputerVision/.ipynb_checkpoints/HistogramOfOrientedGradients-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/9 - 34 - Two Test Cancer 2 V4 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,380 --> 00:00:02,678 3 | Calculate for 4 | me the probability of cancer, 5 | 6 | 2 7 | 00:00:02,678 --> 00:00:08,590 8 | given that have received one positive 9 | and one negative test result. 10 | 11 | 3 12 | 00:00:08,590 --> 00:00:10,450 13 | Please write your number into this box. 14 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/26 - General Bayes Net Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,540 --> 00:00:02,090 3 | And the answer is 13. 4 | 5 | 2 6 | 00:00:02,090 --> 00:00:06,500 7 | 1 over here, 2 over here, 8 | and 4 over here. 9 | 10 | 3 11 | 00:00:06,500 --> 00:00:11,081 12 | Simplified this speaking, 13 | any variable that has K 14 | 15 | 4 16 | 00:00:11,081 --> 00:00:15,245 17 | inputs requires 2 to 18 | the K search variables. 19 | 20 | 5 21 | 00:00:15,245 --> 00:00:18,469 22 | So in total we have 1, 9, 13. 23 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/28 - General Bayes Net 2 Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,012 --> 00:00:06,640 3 | And the answer is 19, so one here, 4 | one here, one here, two here, two here. 5 | 6 | 2 7 | 00:00:06,640 --> 00:00:10,537 8 | Two arrows pointing to G which makes for 9 | four and 10 | 11 | 3 12 | 00:00:10,537 --> 00:00:15,033 13 | three arrows pointing to D, 14 | two to the three is eight. 15 | 16 | 4 17 | 00:00:15,033 --> 00:00:16,799 18 | So you get 1, 2, 3, 8, 2, 2, 19 | 4, if you add those up it's 19 20 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/30 - General Bayes Net 3 Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,225 --> 00:00:03,125 3 | To answer this question, 4 | let us add up these numbers. 5 | 6 | 2 7 | 00:00:03,125 --> 00:00:08,109 8 | [INAUDIBLE] is 1, 1, 1, 9 | this is one incoming arc, so 10 | 11 | 3 12 | 00:00:08,109 --> 00:00:11,600 13 | it's 2, two incoming arc makes 4, 14 | 15 | 4 16 | 00:00:11,600 --> 00:00:16,129 17 | one incoming arc is 2, 2 equals 4. 18 | 19 | 5 20 | 00:00:16,129 --> 00:00:23,954 21 | Four incoming arcs make 16, 22 | we add all the rad numbers we get 47. 23 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/31 - Value Of A Network - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,510 --> 00:00:04,570 3 | So it takes 47 numerical 4 | probabilities to specify 5 | 6 | 2 7 | 00:00:05,570 --> 00:00:10,840 8 | the joint compared to 65,000 if 9 | we didn't have the structure. 10 | 11 | 3 12 | 00:00:10,840 --> 00:00:15,440 13 | I think this example really illustrates 14 | the advantage of compact Bayes network 15 | 16 | 4 17 | 00:00:15,440 --> 00:00:16,750 18 | representations. 19 | 20 | 5 21 | 00:00:16,750 --> 00:00:18,720 22 | Over unstructured joint representations. 23 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/27 - General Bayes Net 2 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,430 --> 00:00:01,740 3 | Here's another quiz. 4 | 5 | 2 6 | 00:00:01,740 --> 00:00:06,810 7 | How many parameters do we need to 8 | specify the joint distribution for 9 | 10 | 3 11 | 00:00:06,810 --> 00:00:11,510 12 | this phase map over here where A B and 13 | C point into D. 14 | 15 | 4 16 | 00:00:11,510 --> 00:00:12,850 17 | D points into E. 18 | 19 | 5 20 | 00:00:12,850 --> 00:00:15,320 21 | F and G and C also point into G? 22 | 23 | 6 24 | 00:00:15,320 --> 00:00:17,290 25 | Please write your answer into this box. 26 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/5 - Bayes Network Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,090 --> 00:00:02,240 3 | The answer is 3. 4 | 5 | 2 6 | 00:00:02,240 --> 00:00:08,666 7 | It takes one parameter to specify 8 | P(A) from which we derive P(-A). 9 | 10 | 3 11 | 00:00:08,666 --> 00:00:14,946 12 | It takes two parameters to 13 | specify P(B | A) and P(-A). 14 | 15 | 4 16 | 00:00:14,946 --> 00:00:21,396 17 | From which we can derive P(-B | A) and 18 | P(-B | -A). 19 | 20 | 5 21 | 00:00:21,396 --> 00:00:24,214 22 | So it's a total of 3 parameters for 23 | this Bayes' Network. 24 | 25 | 6 26 | 00:00:24,214 --> 00:00:29,849 27 | [BLANK_AUDIO] 28 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/29 - General Bayes Net 3 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,610 --> 00:00:04,590 3 | And here's our car network which we 4 | discussed at the very beginning of this 5 | 6 | 2 7 | 00:00:04,590 --> 00:00:05,230 8 | unit. 9 | 10 | 3 11 | 00:00:05,230 --> 00:00:09,776 12 | How many parameters do we 13 | need to specify this network? 14 | 15 | 4 16 | 00:00:09,776 --> 00:00:14,508 17 | Remember, there are 16 18 | total variables and 19 | 20 | 5 21 | 00:00:14,508 --> 00:00:22,891 22 | the naive join over those 16 will 23 | be 2 to the 16- 1, which is 65,535. 24 | 25 | 6 26 | 00:00:22,891 --> 00:00:23,880 27 | Please write your answer 28 | into this box over here. 29 | -------------------------------------------------------------------------------- /MathNotes/Probability.rst: -------------------------------------------------------------------------------- 1 | ############### 2 | Probability 3 | ############### 4 | 5 | Bayes rule: 6 | 7 | - P(A|B) = (P(B|A)·P(A)) / P(B) 8 | 9 | - P(A|B): Posterior 10 | - P(B|A): Likelihood 11 | - P(A): Prior 12 | - P(B): Marginal Likelihood 13 | 14 | - P(B): :math:`{\sum_a{P(B|A=a)·P(A=a)}} 15 | 16 | 0.50 = P(g|o1)·P(o1) /0.9 17 | 0.45 = P(g|o1)·P(o1) 18 | 19 | 0.05 = P(~g|o1)·P(o1) / 0.1 20 | 0.005 = P(~g|o1)·P(o1) 21 | 22 | 0.455 = P(o1)(P(~g|o1) + P(g|o1)) 23 | 0.455 = P(o1) 24 | 25 | 0.15 = P(g|o2)·P(o2) /0.9 26 | 0.6 = P(g|o2)·P(o2) 27 | 28 | 0.25 = (P(~g|o2)·P(o2)) / 0.1 29 | 0.025 = P(~g|o2)·P(o2) 30 | 31 | 0.625 = p(o2)(P(~g|o2) + P(g|o2)) 32 | 0.625 = p(o2) 33 | 34 | 35 | P(o2|o1) = (P(o1|o2)·P(o2)) / P(o1) 36 | 37 | P(o1) = P(o1|o2)·P(o2) 38 | 39 | 0.455 = P(o1|o2)·0.625 40 | 0.728 = P(o1|o2) 41 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/32 - D Separation - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,570 --> 00:00:04,360 3 | The next concept I'd like to 4 | teach you is called D-Separation. 5 | 6 | 2 7 | 00:00:04,360 --> 00:00:08,680 8 | And let me start the discussion 9 | of this concept by a quiz. 10 | 11 | 3 12 | 00:00:08,680 --> 00:00:10,410 13 | We have here a base network and 14 | 15 | 4 16 | 00:00:10,410 --> 00:00:14,160 17 | then when we ask to conditional 18 | independence question. 19 | 20 | 5 21 | 00:00:14,160 --> 00:00:19,302 22 | Is C independent of A, 23 | please tell me yes or no. 24 | 25 | 6 26 | 00:00:19,302 --> 00:00:24,478 27 | Is C independent of A given B, 28 | is C independent of D, 29 | 30 | 7 31 | 00:00:24,478 --> 00:00:30,736 32 | is C independent of D given A, and 33 | is E independent of C given D. 34 | 35 | 8 36 | 00:00:30,736 --> 00:00:32,589 37 | [BLANK_AUDIO] 38 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/13 - Conditional Independence2 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,810 --> 00:00:05,100 3 | So let me draw the cancer 4 | example again with two tests. 5 | 6 | 2 7 | 00:00:05,100 --> 00:00:07,797 8 | Here's my cancer variable, and 9 | 10 | 3 11 | 00:00:07,797 --> 00:00:12,530 12 | there's two condition independent tests, 13 | T1 and T2. 14 | 15 | 4 16 | 00:00:12,530 --> 00:00:17,020 17 | And as before let me assume that 18 | the probability of cancer is 0.01. 19 | 20 | 5 21 | 00:00:17,020 --> 00:00:22,360 22 | What I want you to compute for 23 | me Is the probability 24 | 25 | 6 26 | 00:00:22,360 --> 00:00:30,010 27 | of the second test to be positive if we 28 | know that the first test was positive. 29 | 30 | 7 31 | 00:00:30,010 --> 00:00:33,321 32 | So write this into the following box. 33 | 34 | 8 35 | 00:00:33,321 --> 00:00:36,019 36 | [BLANK_AUDIO] 37 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/21 - 46 - Explaining Away 2 V3 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,630 --> 00:00:05,419 3 | Now, to understand the explain away 4 | effect, you have to compare this to 5 | 6 | 2 7 | 00:00:05,419 --> 00:00:09,266 8 | the probability of a raise 9 | given that we're just happy and 10 | 11 | 3 12 | 00:00:09,266 --> 00:00:12,188 13 | we don't know anything 14 | about the weather. 15 | 16 | 4 17 | 00:00:12,188 --> 00:00:14,077 18 | So let's do that exercise next. 19 | 20 | 5 21 | 00:00:14,077 --> 00:00:18,915 22 | So my next quiz is what's 23 | the probability of a raise given that 24 | 25 | 6 26 | 00:00:18,915 --> 00:00:23,953 27 | all I know is that I'm happy and 28 | I don't know about the weather? 29 | 30 | 7 31 | 00:00:23,953 --> 00:00:28,282 32 | Now this happens to be, once again, 33 | a pretty complicated question. 34 | 35 | 8 36 | 00:00:28,282 --> 00:00:28,980 37 | So take your time. 38 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/15 - Absolute And Conditional - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,236 --> 00:00:04,557 3 | So now we've learned about independence 4 | and the corresponding Bayes network has 5 | 6 | 2 7 | 00:00:04,557 --> 00:00:07,130 8 | two nodes, 9 | they're just not connected at all. 10 | 11 | 3 12 | 00:00:07,130 --> 00:00:09,379 13 | And we learned about 14 | conditional independence, 15 | 16 | 4 17 | 00:00:09,379 --> 00:00:12,008 18 | in which case we have a Bayes 19 | network that looks like this. 20 | 21 | 5 22 | 00:00:12,008 --> 00:00:16,896 23 | Now I would like to know whether 24 | absolute independence implies 25 | 26 | 6 27 | 00:00:16,896 --> 00:00:20,410 28 | conditional independence, true or false? 29 | 30 | 7 31 | 00:00:20,410 --> 00:00:24,012 32 | And I'd also like to know whether 33 | conditional independence implies 34 | 35 | 8 36 | 00:00:24,012 --> 00:00:25,386 37 | absolute independence. 38 | 39 | 9 40 | 00:00:25,386 --> 00:00:27,190 41 | Again, true or false? 42 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 DKE 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/2 - Challenge Question - lang_en_vs52.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,320 --> 00:00:03,020 3 | Here is a challenge question to give 4 | a taste of how how we are going to 5 | 6 | 2 7 | 00:00:03,020 --> 00:00:07,460 8 | construct and use Bayesian networks to 9 | make inferences about new situations. 10 | 11 | 3 12 | 00:00:07,460 --> 00:00:10,720 13 | The small net given here models part 14 | of the situation where a student 15 | 16 | 4 17 | 00:00:10,720 --> 00:00:14,220 18 | was about to graduate and 19 | applied for jobs at two companies. 20 | 21 | 5 22 | 00:00:14,220 --> 00:00:16,690 23 | G represents the student, 24 | indeed, graduated. 25 | 26 | 6 27 | 00:00:16,690 --> 00:00:20,550 28 | O1 and O2 represent that the student 29 | received a job offer from the company 1 30 | 31 | 7 32 | 00:00:20,550 --> 00:00:22,430 33 | and the company 2, respectively. 34 | 35 | 8 36 | 00:00:22,430 --> 00:00:25,385 37 | Calculate the probability that the 38 | student received an offer from company 39 | 40 | 9 41 | 00:00:25,385 --> 00:00:28,505 42 | 2, given that we know that she 43 | received and offer from company 1. 44 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/12 - Conditional Independence - lang_en_vs50.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,000 --> 00:00:03,241 3 | And the correct answer is no. 4 | 5 | 2 6 | 00:00:03,241 --> 00:00:07,899 7 | Intuitively, getting a positive 8 | test result of what cancer 9 | 10 | 3 11 | 00:00:07,899 --> 00:00:12,479 12 | gives us information about 13 | whether you have cancer or not. 14 | 15 | 4 16 | 00:00:12,479 --> 00:00:14,780 17 | So if you get a positive test result, 18 | 19 | 5 20 | 00:00:14,780 --> 00:00:18,277 21 | you're going to raise 22 | the probability of having cancer. 23 | 24 | 6 25 | 00:00:18,277 --> 00:00:19,811 26 | So relative to the prior probability. 27 | 28 | 7 29 | 00:00:19,811 --> 00:00:22,694 30 | With that increased probability, 31 | 32 | 8 33 | 00:00:22,694 --> 00:00:28,170 34 | we will predict that another test will, 35 | with a higher likelihood, 36 | 37 | 9 38 | 00:00:28,170 --> 00:00:33,951 39 | give us a positive response than if 40 | we hadn't taken the previous test. 41 | 42 | 10 43 | 00:00:33,951 --> 00:00:35,401 44 | It's really important to understand. 45 | 46 | 11 47 | 00:00:35,401 --> 00:00:41,410 48 | So that we understand it, let me make 49 | you calculate those probabilities. 50 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/18 - Confounding Cause Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,610 --> 00:00:02,885 3 | The answer is surprisingly simple. 4 | 5 | 2 6 | 00:00:02,885 --> 00:00:08,240 7 | It is 0.01, but 8 | how do you know this so fast? 9 | 10 | 3 11 | 00:00:08,240 --> 00:00:12,359 12 | Well, if we look at this 13 | Bayes Network both the sadness and 14 | 15 | 4 16 | 00:00:12,359 --> 00:00:17,080 17 | the question whether they got 18 | a raise impact my happiness. 19 | 20 | 5 21 | 00:00:17,080 --> 00:00:22,240 22 | But since I don't know anything about 23 | the happiness there's no way that 24 | 25 | 6 26 | 00:00:22,240 --> 00:00:28,260 27 | just the weather might impact 28 | whether I get a raise or not. 29 | 30 | 7 31 | 00:00:28,260 --> 00:00:32,549 32 | In fact, 33 | it might be independently sunny and 34 | 35 | 8 36 | 00:00:32,549 --> 00:00:35,230 37 | might independently get a raise at work. 38 | 39 | 9 40 | 00:00:35,230 --> 00:00:41,350 41 | There's no mechanism of which 42 | these two things would co-occur. 43 | 44 | 10 45 | 00:00:41,350 --> 00:00:45,240 46 | Therefore the probability of 47 | a raise given it is sunny, 48 | 49 | 11 50 | 00:00:45,240 --> 00:00:50,499 51 | is just the same as the probability of 52 | a raise given any weather which is 0.01. 53 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/34 - 60 - D Separation 2 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:01,650 --> 00:00:04,205 3 | In this specific example, 4 | 5 | 2 6 | 00:00:04,205 --> 00:00:06,900 7 | the rule applies very, very simple. 8 | 9 | 3 10 | 00:00:06,900 --> 00:00:08,980 11 | Any two variables are independent, 12 | 13 | 4 14 | 00:00:08,980 --> 00:00:12,050 15 | if they're not linked by just unknown variables. 16 | 17 | 5 18 | 00:00:12,050 --> 00:00:14,540 19 | So for example, ven OB, 20 | 21 | 6 22 | 00:00:14,540 --> 00:00:16,820 23 | and everything downstream of B, 24 | 25 | 7 26 | 00:00:16,820 --> 00:00:22,860 27 | becomes independent of anything upstream of B. E is not independent of C, 28 | 29 | 8 30 | 00:00:22,860 --> 00:00:28,761 31 | conditional B, however, knowledge of B does not render A and E independent. 32 | 33 | 9 34 | 00:00:28,761 --> 00:00:30,405 35 | In this graph over here, 36 | 37 | 10 38 | 00:00:30,405 --> 00:00:32,330 39 | A and B connect to C, 40 | 41 | 11 42 | 00:00:32,330 --> 00:00:37,100 43 | and C connects to D and to E. So let me ask you, 44 | 45 | 12 46 | 00:00:37,100 --> 00:00:41,395 47 | is A independent of E? 48 | 49 | 13 50 | 00:00:41,395 --> 00:00:43,435 51 | A independent of E given B? A independent of E given C? 52 | 53 | 14 54 | 00:00:43,435 --> 00:00:44,810 55 | A independent of B? 56 | 57 | 15 58 | 00:00:44,810 --> 00:00:48,000 59 | And A independent of B given C? 60 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/33 - D Separation Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,440 --> 00:00:03,719 3 | So C is not independent of A. 4 | 5 | 2 6 | 00:00:03,719 --> 00:00:09,063 7 | In fact, A influences C by virtue of B. 8 | 9 | 3 10 | 00:00:09,063 --> 00:00:13,388 11 | But if you know B, 12 | then A becomes independent of C, 13 | 14 | 4 15 | 00:00:13,388 --> 00:00:17,022 16 | which means the only 17 | determinant to C is B. 18 | 19 | 5 20 | 00:00:17,022 --> 00:00:17,919 21 | If you know B for 22 | 23 | 6 24 | 00:00:17,919 --> 00:00:21,857 25 | sure, then knowledge of A won't 26 | really tell you anything about C. 27 | 28 | 7 29 | 00:00:21,857 --> 00:00:27,606 30 | C is also not independent of D, just 31 | the same way C is not independent of A. 32 | 33 | 8 34 | 00:00:27,606 --> 00:00:31,410 35 | If I learn something about D, 36 | I can infer more about C. 37 | 38 | 9 39 | 00:00:31,410 --> 00:00:36,980 40 | But if I do know A, then it's hard to 41 | imagine how knowledge of D would help 42 | 43 | 10 44 | 00:00:36,980 --> 00:00:42,030 45 | me with C because I can't learn anything 46 | more about A than knowing A already. 47 | 48 | 11 49 | 00:00:42,030 --> 00:00:44,950 50 | Therefore, given A, 51 | C and D are independent. 52 | 53 | 12 54 | 00:00:44,950 --> 00:00:48,470 55 | The same is true for E and C. 56 | 57 | 13 58 | 00:00:48,470 --> 00:00:51,670 59 | If you know D then E and 60 | C become independent. 61 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/16 - Absolute And Conditional Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,320 --> 00:00:02,887 3 | And the answer is, 4 | both of them are false. 5 | 6 | 2 7 | 00:00:02,887 --> 00:00:07,090 8 | We already saw that conditional 9 | independence, as shown over here, 10 | 11 | 3 12 | 00:00:07,090 --> 00:00:09,703 13 | doesn't give us absolute independence. 14 | 15 | 4 16 | 00:00:09,703 --> 00:00:12,749 17 | So, for example, this is test 18 | number one and test number two, 19 | 20 | 5 21 | 00:00:12,749 --> 00:00:14,482 22 | you might or might not have cancer. 23 | 24 | 6 25 | 00:00:14,482 --> 00:00:18,226 26 | Our first test gives us information 27 | about whether you have cancer or not. 28 | 29 | 7 30 | 00:00:18,226 --> 00:00:21,809 31 | And as a result, 32 | you've changed our prior probability for 33 | 34 | 8 35 | 00:00:21,809 --> 00:00:24,410 36 | the second test to come in positive. 37 | 38 | 9 39 | 00:00:24,410 --> 00:00:30,240 40 | That means conditional independence 41 | does not imply absolute independence, 42 | 43 | 10 44 | 00:00:30,240 --> 00:00:32,759 45 | which renders this assumption, 46 | here, false. 47 | 48 | 11 49 | 00:00:32,759 --> 00:00:37,330 50 | And as it also turns out that if 51 | you have absolute independence, 52 | 53 | 12 54 | 00:00:37,330 --> 00:00:40,040 55 | things might not be 56 | conditionally independent for 57 | 58 | 13 59 | 00:00:40,040 --> 00:00:45,060 60 | reasons that can't quite explain so 61 | far but that we will learn about next. 62 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/10 - 35 - Two Test Cancer 2 Solution V4 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,550 --> 00:00:05,630 3 | We have tried the same trick as before, 4 | where we use exact same prior 5 | 6 | 2 7 | 00:00:05,630 --> 00:00:10,658 8 | of first plus gives 9 | the following factors, 0.9, 0.2. 10 | 11 | 3 12 | 00:00:10,658 --> 00:00:14,558 13 | But our minus gives us 14 | the probability 0.1 for 15 | 16 | 4 17 | 00:00:14,558 --> 00:00:18,179 18 | a negative test result 19 | treatment of cancer. 20 | 21 | 5 22 | 00:00:18,179 --> 00:00:25,170 23 | And a 0.8 for the inverse of a negative 24 | result of not getting cancer. 25 | 26 | 6 27 | 00:00:25,170 --> 00:00:28,760 28 | You multiply those together, 29 | you get a normalized probability. 30 | 31 | 7 32 | 00:00:28,760 --> 00:00:33,573 33 | And if you now normalize by 34 | the sum of those two things to 35 | 36 | 8 37 | 00:00:33,573 --> 00:00:36,920 38 | turn the statement to a probability, 39 | 40 | 9 41 | 00:00:36,920 --> 00:00:42,578 42 | you get 0.0009 over the sum of 43 | those two things over here. 44 | 45 | 10 46 | 00:00:42,578 --> 00:00:47,498 47 | And this is 0.0056 for the chance 48 | 49 | 11 50 | 00:00:47,498 --> 00:00:52,568 51 | of having cancer, and 0.9943 for 52 | 53 | 12 54 | 00:00:52,568 --> 00:00:57,650 55 | the chance of being cancer-free. 56 | 57 | 13 58 | 00:00:57,650 --> 00:00:59,720 59 | And this adds up approximately to 1, 60 | 61 | 14 62 | 00:00:59,720 --> 00:01:01,620 63 | and therefore it's 64 | a probability distribution. 65 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/1 - Introduction - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,270 --> 00:00:02,940 3 | Sebastian, why are Bayes Nets important? 4 | 5 | 2 6 | 00:00:02,940 --> 00:00:05,781 7 | >> They're one of the most amazing 8 | accomplishments of recent artificial 9 | 10 | 3 11 | 00:00:05,781 --> 00:00:06,910 12 | intelligence. 13 | 14 | 4 15 | 00:00:06,910 --> 00:00:10,510 16 | They take the idea of uncertainty and 17 | probability and 18 | 19 | 5 20 | 00:00:10,510 --> 00:00:12,900 21 | marry it with efficient structures. 22 | 23 | 6 24 | 00:00:12,900 --> 00:00:16,155 25 | So you don't have a big mumble jumble 26 | but you can really understand, 27 | 28 | 7 29 | 00:00:16,155 --> 00:00:20,014 30 | like what uncertain variable 31 | influences other uncertain variable. 32 | 33 | 8 34 | 00:00:20,014 --> 00:00:25,580 35 | A theory of the world using just 36 | Bayes Networks is really an amazing. 37 | 38 | 9 39 | 00:00:25,580 --> 00:00:28,870 40 | >> Well, I've been teaching AI for 41 | many years, I've found that listening to 42 | 43 | 10 44 | 00:00:28,870 --> 00:00:32,590 45 | Sebastian's lessons has really given 46 | me a better intuition for Bayes Nets. 47 | 48 | 11 49 | 00:00:33,690 --> 00:00:37,130 50 | As you go through these lessons, 51 | make sure to understand how to construct 52 | 53 | 12 54 | 00:00:37,130 --> 00:00:39,080 55 | a Bayes Nets and 56 | how to do inference with them. 57 | 58 | 13 59 | 00:00:39,080 --> 00:00:43,810 60 | One of my favorite algorithms 61 | is Monte Carlo Markov Chain. 62 | 63 | 14 64 | 00:00:43,810 --> 00:00:44,480 65 | Have fun with the lesson. 66 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/7 - 32 - Two Test Cancer V6 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,290 --> 00:00:03,460 3 | The reason why I gave you all this 4 | is because I wanted to apply it now 5 | 6 | 2 7 | 00:00:03,460 --> 00:00:08,530 8 | to a slightly more complicated problem, 9 | which is the two test cancer example. 10 | 11 | 3 12 | 00:00:08,530 --> 00:00:14,310 13 | In this example, we, again, 14 | might have our unobservable cancer c, 15 | 16 | 4 17 | 00:00:14,310 --> 00:00:18,540 18 | but now we're running two tests, 19 | test one and test two. 20 | 21 | 5 22 | 00:00:18,540 --> 00:00:24,074 23 | As before, 24 | the prior probability of cancer is 0.01. 25 | 26 | 6 27 | 00:00:24,074 --> 00:00:29,893 28 | The probability of receiving a positive 29 | test result for either test is 0.9. 30 | 31 | 7 32 | 00:00:29,893 --> 00:00:36,388 33 | Probability of getting a negative 34 | result, keeping them cancer free is 0.8. 35 | 36 | 8 37 | 00:00:36,388 --> 00:00:40,863 38 | And from those we were able to 39 | compute all the other properties. 40 | 41 | 9 42 | 00:00:40,863 --> 00:00:42,788 43 | I'm just going to write 44 | them down over here. 45 | 46 | 10 47 | 00:00:42,788 --> 00:00:45,425 48 | So take a moment to just write my notes. 49 | 50 | 11 51 | 00:00:45,425 --> 00:00:49,407 52 | Now let's assume both of my 53 | tests come back positive. 54 | 55 | 12 56 | 00:00:49,407 --> 00:00:55,263 57 | So T1 = +, and T2 = +. 58 | 59 | 13 60 | 00:00:55,263 --> 00:01:02,630 61 | What's the probability of cancer now, 62 | written in short form P(C|++)? 63 | 64 | 14 65 | 00:01:02,630 --> 00:01:06,080 66 | I want you to tell me what that is. 67 | 68 | 15 69 | 00:01:06,080 --> 00:01:07,620 70 | And this is a non trivial question. 71 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/20 - 45 - Explaining Away Solution V3 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,730 --> 00:00:06,488 3 | The answer is approximately 0.0142. 4 | 5 | 2 6 | 00:00:06,488 --> 00:00:10,266 7 | And as an exercise in expanding 8 | this term using Bayes' rule, 9 | 10 | 3 11 | 00:00:10,266 --> 00:00:14,120 12 | using total probability, 13 | which I just do for you. 14 | 15 | 4 16 | 00:00:14,120 --> 00:00:20,945 17 | Using Bayes' rule, 18 | you can transform this into P of H, 19 | 20 | 5 21 | 00:00:20,945 --> 00:00:26,764 22 | given R, S times P of R 23 | given S over P of H given S. 24 | 25 | 6 26 | 00:00:26,764 --> 00:00:30,488 27 | You observe the position 28 | independence of R and S, 29 | 30 | 7 31 | 00:00:30,488 --> 00:00:33,540 32 | to simplify this to just P of R. 33 | 34 | 8 35 | 00:00:33,540 --> 00:00:38,867 36 | And the denominator is expanded 37 | by folding in R and not R. 38 | 39 | 9 40 | 00:00:38,867 --> 00:00:44,822 41 | You have (H | R,S) times 42 | P(R) plus P(H|-RS) 43 | 44 | 10 45 | 00:00:44,822 --> 00:00:49,870 46 | times P(-R) which is total probability. 47 | 48 | 11 49 | 00:00:49,870 --> 00:00:54,757 50 | You can now read off the numbers 51 | from the tables over here, 52 | 53 | 12 54 | 00:00:54,757 --> 00:00:59,946 55 | which gives us 1 times 0.01 56 | divided by this expression is 57 | 58 | 13 59 | 00:00:59,946 --> 00:01:06,058 60 | the same as the expression over here 61 | 0.01 plus this thing over here. 62 | 63 | 14 64 | 00:01:06,058 --> 00:01:10,522 65 | Which we can find over here 66 | to be 0.7 x this one over 67 | 68 | 15 69 | 00:01:10,522 --> 00:01:15,097 70 | here which is one minus 71 | the value over here 0.99. 72 | 73 | 16 74 | 00:01:15,097 --> 00:01:19,752 75 | Which gives us approximately 0.0142 76 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/24 - 49 - Explaining Away 3 Solution V3 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,610 --> 00:00:05,080 3 | Now the answer follows the exact 4 | same scheme as before, 5 | 6 | 2 7 | 00:00:05,080 --> 00:00:07,800 8 | with S being replaced by not S. 9 | 10 | 3 11 | 00:00:07,800 --> 00:00:12,910 12 | So this should be an easier question for 13 | you to answer, P of R given H, 14 | 15 | 4 16 | 00:00:12,910 --> 00:00:17,830 17 | and not S, can be inverted by 18 | base rule to be as follows. 19 | 20 | 5 21 | 00:00:17,830 --> 00:00:20,290 22 | Once we apply base rule, 23 | as indicated over here, 24 | 25 | 6 26 | 00:00:20,290 --> 00:00:23,700 27 | where we sort H to the left side and 28 | R to the right side, 29 | 30 | 7 31 | 00:00:23,700 --> 00:00:29,670 32 | you can observe that this value over 33 | here can be readily found in the table. 34 | 35 | 8 36 | 00:00:29,670 --> 00:00:32,320 37 | It's actually the 0.9 over there. 38 | 39 | 9 40 | 00:00:32,320 --> 00:00:36,631 41 | This value over here, 42 | the Bayes is independent of the data 43 | 44 | 10 45 | 00:00:37,825 --> 00:00:40,380 46 | Bayes network, so it's just 0.01. 47 | 48 | 11 49 | 00:00:40,380 --> 00:00:46,870 50 | And as before, we apply total 51 | probability, the expression over here. 52 | 53 | 12 54 | 00:00:46,870 --> 00:00:49,729 55 | And we obtain of this 56 | quotient over here, 57 | 58 | 13 59 | 00:00:49,729 --> 00:00:52,598 60 | that these two expressions are the same. 61 | 62 | 14 63 | 00:00:52,598 --> 00:00:54,680 64 | P of H given not S not R. 65 | 66 | 15 67 | 00:00:54,680 --> 00:00:57,013 68 | R is the value over here. 69 | 70 | 16 71 | 00:00:57,013 --> 00:01:00,485 72 | And the 0.99 is the compliment 73 | of probability of R, 74 | 75 | 17 76 | 00:01:00,485 --> 00:01:06,143 77 | taken from over here and 78 | that ends up to be 0.0833. 79 | 80 | 18 81 | 00:01:06,143 --> 00:01:08,785 82 | This will be the right answer. 83 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/23 - 48 - Explaining Away 3 V4 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,500 --> 00:00:02,630 3 | And if you got this right, 4 | 5 | 2 6 | 00:00:02,630 --> 00:00:06,470 7 | I will be deeply impressed of 8 | the fact you got this right. 9 | 10 | 3 11 | 00:00:06,470 --> 00:00:10,810 12 | My happiness is well explained 13 | by the fact that sunny. 14 | 15 | 4 16 | 00:00:10,810 --> 00:00:13,770 17 | So if someone observes you to 18 | be happy and asks the question, 19 | 20 | 5 21 | 00:00:13,770 --> 00:00:16,740 22 | is this because Sebastian 23 | got a raise at work? 24 | 25 | 6 26 | 00:00:16,740 --> 00:00:21,300 27 | Well If you know it's sunny and 28 | this is a fairly good explanation for 29 | 30 | 7 31 | 00:00:21,300 --> 00:00:23,230 32 | me being happy. 33 | 34 | 8 35 | 00:00:23,230 --> 00:00:25,830 36 | You don't have to assume I got a raise. 37 | 38 | 9 39 | 00:00:25,830 --> 00:00:28,508 40 | If you don't know about the weather, 41 | 42 | 10 43 | 00:00:28,508 --> 00:00:33,791 44 | then obviously the chances are higher 45 | that the raise caused my happiness. 46 | 47 | 11 48 | 00:00:33,791 --> 00:00:39,274 49 | And therefore this number 50 | goes up from 0.014 to 0.018. 51 | 52 | 12 53 | 00:00:39,274 --> 00:00:44,276 54 | Let me ask you one final 55 | question in this next quiz which 56 | 57 | 13 58 | 00:00:44,276 --> 00:00:50,822 59 | is the probability of a race given 60 | that I look happy and it's not sunny. 61 | 62 | 14 63 | 00:00:50,822 --> 00:00:55,398 64 | This is the most extreme case for 65 | making a raise likely because 66 | 67 | 15 68 | 00:00:55,398 --> 00:01:00,640 69 | I am a happy guy and it's definitely 70 | not caused by the weather. 71 | 72 | 16 73 | 00:01:00,640 --> 00:01:04,300 74 | So it could be just random or 75 | it could be caused by the raise. 76 | 77 | 17 78 | 00:01:04,300 --> 00:01:08,260 79 | So please calculate this number for 80 | me and enter it into this box. 81 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/25 - Conditional Dependence - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,370 --> 00:00:03,262 3 | It's really interesting to compare 4 | this to the situation over here. 5 | 6 | 2 7 | 00:00:03,262 --> 00:00:07,387 8 | In both cases, I'm happy, 9 | as shown over here. 10 | 11 | 3 12 | 00:00:07,387 --> 00:00:12,052 13 | And I asked the same question, 14 | which is whether I got a raise at work, 15 | 16 | 4 17 | 00:00:12,052 --> 00:00:13,270 18 | this R over here. 19 | 20 | 5 21 | 00:00:13,270 --> 00:00:16,638 22 | But in one case, 23 | I observe that the weather is sunny and 24 | 25 | 6 26 | 00:00:16,638 --> 00:00:18,230 27 | in the other one isn't. 28 | 29 | 7 30 | 00:00:18,230 --> 00:00:21,557 31 | And look what it does to my probability 32 | of having received a raise. 33 | 34 | 8 35 | 00:00:21,557 --> 00:00:26,055 36 | The sunniness perfectly 37 | well explains my happiness. 38 | 39 | 9 40 | 00:00:26,055 --> 00:00:31,127 41 | And my probability of 42 | having received a raise 43 | 44 | 10 45 | 00:00:31,127 --> 00:00:36,340 46 | ends up to be a mere 1.4%, or 0.014. 47 | 48 | 11 49 | 00:00:36,340 --> 00:00:40,874 50 | However, if my wife observes it to 51 | be not sunny, then it is much more 52 | 53 | 12 54 | 00:00:40,874 --> 00:00:45,344 55 | likely that the cause of my happiness 56 | is related to a raise at work. 57 | 58 | 13 59 | 00:00:45,344 --> 00:00:47,988 60 | And now the probability is 8.3%, 61 | 62 | 14 63 | 00:00:47,988 --> 00:00:52,043 64 | which is significantly 65 | higher than the 1.4% before. 66 | 67 | 15 68 | 00:00:52,043 --> 00:00:56,268 69 | Now, this is a Bayes' 70 | network of which S and 71 | 72 | 16 73 | 00:00:56,268 --> 00:01:01,993 74 | R are independent, but 75 | H adds a dependence between S and R. 76 | 77 | 17 78 | 00:01:01,993 --> 00:01:06,520 79 | Let me talk about this in a little 80 | bit more detail on the next paper. 81 | 82 | 18 83 | 00:01:06,520 --> 00:01:06,980 84 | So here's a 85 | -------------------------------------------------------------------------------- /MathNotes/trigonemetry.rst: -------------------------------------------------------------------------------- 1 | ############### 2 | Trigonometry 3 | ############### 4 | 5 | Notes from: 6 | 7 | Loney, Sidney Luxton. Plane Trigonometry. Cambridge, 1893. 8 | Klaf, Albert. Trigonometry Refresher. Reprint. New York, 2005. 9 | 10 | Chapter 1: Measurement of angles 11 | ================================== 12 | 13 | Angles are measured in terms of right angle: 14 | 15 | - **Degree** is a unit of right angle which is divided into 90 equal parts. 16 | 17 | - **Minute** is a unit of degree which is divided into 60 equal parts 18 | 19 | - **Second** is a unit of minute which is divided into 60 equal parts 20 | 21 | So :math:`46°23'39''` reads as 46 degrees 23 minutes 39 seconds. 22 | 23 | This system is called **Sexagesimal** system of measurement. This system is 24 | widely used in practical applications of trigonometry. 25 | 26 | There is also another **Centesimal** system which divides right angle into 100 27 | parts each called **grade** then each grade is divided into 100 parts called 28 | minutes and each minute is divided into 100 parts called seconds. 29 | 30 | There is also a third system **Circular** measurement. This system is used in 31 | higher branches of mathematics. 32 | 33 | It is preferred in higher branches of mathematics because there is no 34 | arbitrary number such as 90 or 360 in its definition. 35 | 36 | Let's take any circle centered on point O, and let's pick two points on the 37 | perimeter of this circle A and P. The angle AOP is called a radian. How do we 38 | obtain and ensure the constancy of this angle ? 39 | 40 | Let's suppose that there are two circles with a common center O. 41 | 42 | Let's suppose n lines going from O to the outer circle. n lines intersect the 43 | outer circle at points *A, B, C, D, E, F, ...*. These lines intersect the inner 44 | circle at points *a, b, c, d, e, f, ...*. 45 | 46 | Let's join point *a* to *b*, *b* to *c* etc, as well as *A* to *B*, etc. 47 | 48 | Since the angle BOA is the same for sides *ab* and *AB* and :math:`aO = bO` 49 | and :math:`AO = BO`. 50 | 51 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/36 - 63 - D Separation 3 Solution - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:01,740 --> 00:00:05,200 3 | And the answer is yes. 4 | 5 | 2 6 | 00:00:05,200 --> 00:00:09,130 7 | F is independent of A. Well defined by our rules of the separation is 8 | 9 | 3 10 | 00:00:09,130 --> 00:00:13,405 11 | that F is dependent on D and A is dependent on D. But, 12 | 13 | 4 14 | 00:00:13,405 --> 00:00:18,285 15 | if you don't know D, you can't govern any dependency between A and F at all. 16 | 17 | 5 18 | 00:00:18,285 --> 00:00:20,485 19 | Now, if you do know D, 20 | 21 | 6 22 | 00:00:20,485 --> 00:00:22,360 23 | then F and A become dependent. 24 | 25 | 7 26 | 00:00:22,360 --> 00:00:27,580 27 | And the reason is B and E are dependent given D. And we can 28 | 29 | 8 30 | 00:00:27,580 --> 00:00:29,690 31 | transform this back into the dependency of A 32 | 33 | 9 34 | 00:00:29,690 --> 00:00:33,745 35 | and F because B and A are dependent, and E and F are dependent. 36 | 37 | 10 38 | 00:00:33,745 --> 00:00:37,690 39 | There's an active path between A and F, 40 | 41 | 11 42 | 00:00:37,690 --> 00:00:39,535 43 | which goes across here, and here, 44 | 45 | 12 46 | 00:00:39,535 --> 00:00:41,640 47 | because D is known. 48 | 49 | 13 50 | 00:00:41,640 --> 00:00:44,735 51 | Now if you know G, the same thing is true because G gives us knowledge about 52 | 53 | 14 54 | 00:00:44,735 --> 00:00:48,010 55 | D and D can be applied back to this path over here. 56 | 57 | 15 58 | 00:00:48,010 --> 00:00:49,090 59 | However, if you know H, 60 | 61 | 16 62 | 00:00:49,090 --> 00:00:50,225 63 | that's not the case. 64 | 65 | 17 66 | 00:00:50,225 --> 00:00:52,310 67 | So, H might tell us something about G, 68 | 69 | 18 70 | 00:00:52,310 --> 00:00:55,585 71 | but it doesn't tell us anything about D, and therefore, 72 | 73 | 19 74 | 00:00:55,585 --> 00:00:58,450 75 | we have no reason to close the path 76 | 77 | 20 78 | 00:00:58,450 --> 00:01:02,643 79 | between A and F. The path between A and F is still passive, 80 | 81 | 21 82 | 00:01:02,643 --> 00:01:05,070 83 | even though we have knowledge of H. 84 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/35 - D Separation 2 Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,410 --> 00:00:02,560 3 | And the answer to this one 4 | is really interesting. 5 | 6 | 2 7 | 00:00:02,560 --> 00:00:06,572 8 | A is clearly not independent of 9 | E because through C we can see 10 | 11 | 3 12 | 00:00:06,572 --> 00:00:08,105 13 | an influence of A to E. 14 | 15 | 4 16 | 00:00:08,105 --> 00:00:10,690 17 | Given B that doesn't change. 18 | 19 | 5 20 | 00:00:10,690 --> 00:00:14,710 21 | A still influences C 22 | despite the fact we know B. 23 | 24 | 6 25 | 00:00:14,710 --> 00:00:17,260 26 | However if we know C, 27 | the influence is cut off. 28 | 29 | 7 30 | 00:00:17,260 --> 00:00:21,500 31 | There is no way A can 32 | influence E if we know C. 33 | 34 | 8 35 | 00:00:21,500 --> 00:00:23,544 36 | A is clearly independent of B. 37 | 38 | 9 39 | 00:00:23,544 --> 00:00:27,830 40 | They're different entry variables, 41 | they have no incoming Rs. 42 | 43 | 10 44 | 00:00:27,830 --> 00:00:32,750 45 | >> But here is the caveat, 46 | given C, A and B become dependent. 47 | 48 | 11 49 | 00:00:32,750 --> 00:00:36,031 50 | >> So whereas initially A and 51 | B were independent, 52 | 53 | 12 54 | 00:00:36,031 --> 00:00:38,579 55 | if you give C they become dependent. 56 | 57 | 13 58 | 00:00:38,579 --> 00:00:43,040 59 | And the reason why they become 60 | dependent we've studied before is 61 | 62 | 14 63 | 00:00:43,040 --> 00:00:45,340 64 | the Explain away effect. 65 | 66 | 15 67 | 00:00:45,340 --> 00:00:48,170 68 | If we know for example C to be true, 69 | 70 | 16 71 | 00:00:48,170 --> 00:00:53,550 72 | then knowledge of A will substantially 73 | affect what we believe about B. 74 | 75 | 17 76 | 00:00:53,550 --> 00:00:56,870 77 | If there is two joint causes for C and 78 | 79 | 18 80 | 00:00:56,870 --> 00:01:00,550 81 | we happen to know A is true, 82 | we will discredit cause B. 83 | 84 | 19 85 | 00:01:00,550 --> 00:01:05,324 86 | If we happen to know A is false we'll 87 | increase our belief for the cause B. 88 | 89 | 20 90 | 00:01:05,324 --> 00:01:09,366 91 | There was an effect we studied 92 | extensively in the happiness example I 93 | 94 | 21 95 | 00:01:09,366 --> 00:01:10,780 96 | gave you before. 97 | 98 | 22 99 | 00:01:10,780 --> 00:01:15,710 100 | The interesting thing here is you're 101 | facing a situation where knowledge of 102 | 103 | 23 104 | 00:01:15,710 --> 00:01:21,780 105 | a variable C renders previously 106 | independent variables dependent. 107 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/19 - 44 - Explaining Away V5 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,380 --> 00:00:03,762 3 | And let me talk about 4 | a really interesting and 5 | 6 | 2 7 | 00:00:03,762 --> 00:00:09,374 8 | special instance of Bayes Net Reasoning 9 | which is called explaining away. 10 | 11 | 3 12 | 00:00:09,374 --> 00:00:12,915 13 | And I'll first give you 14 | the intuitive answer, and 15 | 16 | 4 17 | 00:00:12,915 --> 00:00:15,962 18 | then I wish you to 19 | compute probabilities for 20 | 21 | 5 22 | 00:00:15,962 --> 00:00:20,590 23 | me that manifest the explain away 24 | effect in Bayes Net of this type. 25 | 26 | 6 27 | 00:00:20,590 --> 00:00:25,513 28 | Explaining away means that, 29 | if we know that we are happy, 30 | 31 | 7 32 | 00:00:25,513 --> 00:00:31,178 33 | then sunny weather can explain 34 | away the cause of happiness. 35 | 36 | 8 37 | 00:00:31,178 --> 00:00:33,100 38 | If I then also know that it's sunny, 39 | 40 | 9 41 | 00:00:33,100 --> 00:00:37,460 42 | it becomes less likely 43 | that I received a raise. 44 | 45 | 10 46 | 00:00:37,460 --> 00:00:40,760 47 | So let me put this differently, suppose 48 | I'm a happy guy on a specific day, and 49 | 50 | 11 51 | 00:00:40,760 --> 00:00:45,020 52 | my wife asks me, Sebastian, 53 | why are you so happy? 54 | 55 | 12 56 | 00:00:45,020 --> 00:00:47,920 57 | Is it sunny, or did you get a raise? 58 | 59 | 13 60 | 00:00:47,920 --> 00:00:50,504 61 | If she then looks outside and 62 | sees it's sunny, 63 | 64 | 14 65 | 00:00:50,504 --> 00:00:55,105 66 | then she might explain to herself, well, 67 | Sebastian is happy because it's sunny. 68 | 69 | 15 70 | 00:00:55,105 --> 00:00:59,038 71 | It makes it effectively, 72 | less likely that he got a raise, 73 | 74 | 16 75 | 00:00:59,038 --> 00:01:03,910 76 | because I could already explain 77 | this happiness by it being sunny. 78 | 79 | 17 80 | 00:01:03,910 --> 00:01:09,083 81 | If she looks outside and it's rainy, 82 | it makes it more likely I got a raise, 83 | 84 | 18 85 | 00:01:09,083 --> 00:01:13,117 86 | because the weather can't 87 | really explain my happiness. 88 | 89 | 19 90 | 00:01:13,117 --> 00:01:18,260 91 | In other words, if you see a certain 92 | effect that it could be caused 93 | 94 | 20 95 | 00:01:18,260 --> 00:01:22,935 96 | by multiple causes, 97 | seeing one of those causes can explain 98 | 99 | 21 100 | 00:01:22,935 --> 00:01:27,439 101 | away any other potential cause 102 | of this effect over here. 103 | 104 | 22 105 | 00:01:27,439 --> 00:01:32,254 106 | So let me put this in numbers, and 107 | ask you the challenging question of 108 | 109 | 23 110 | 00:01:32,254 --> 00:01:37,410 111 | what's the probability of the raise 112 | given I'm happy, and it's sunny? 113 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/8 - 33 - Two Test Cancer Solution V4 - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,420 --> 00:00:06,093 3 | So the correct answer is 0.1698 4 | 5 | 2 6 | 00:00:06,093 --> 00:00:10,526 7 | approximately. 8 | 9 | 3 10 | 00:00:10,526 --> 00:00:15,370 11 | And to compute this, 12 | I used the trick I've shown you before. 13 | 14 | 4 15 | 00:00:15,370 --> 00:00:20,551 16 | Let me write down the running count for 17 | cancer, and 18 | 19 | 5 20 | 00:00:20,551 --> 00:00:28,033 21 | for not cancer, as I integrate 22 | the various multiplications in base row. 23 | 24 | 6 25 | 00:00:28,033 --> 00:00:34,937 26 | My prior for cancer was 0.01, 27 | and for non-cancer was 0.99. 28 | 29 | 7 30 | 00:00:34,937 --> 00:00:36,948 31 | Then I get my first +, and 32 | 33 | 8 34 | 00:00:36,948 --> 00:00:41,710 35 | the probability of a plus given 36 | that we have cancer is 0.9. 37 | 38 | 9 39 | 00:00:41,710 --> 00:00:46,593 40 | And the same for lung cancer is 0.2. 41 | 42 | 10 43 | 00:00:46,593 --> 00:00:49,987 44 | So according to 45 | the non-normalized base rule, 46 | 47 | 11 48 | 00:00:49,987 --> 00:00:54,623 49 | I now multiply these two things 50 | together to get my non-normalized 51 | 52 | 12 53 | 00:00:54,623 --> 00:00:58,680 54 | probability of having 55 | cancer given the plus. 56 | 57 | 13 58 | 00:00:58,680 --> 00:01:01,639 59 | Since multiplication is commutative, 60 | 61 | 14 62 | 00:01:01,639 --> 00:01:07,126 63 | I can do the same thing again with 64 | my second test result, 0.9, 0.2. 65 | 66 | 15 67 | 00:01:07,126 --> 00:01:11,867 68 | And I multiply all of these three things 69 | together to get my non normalized 70 | 71 | 16 72 | 00:01:11,867 --> 00:01:14,911 73 | probability, P prime, 74 | to be the following. 75 | 76 | 17 77 | 00:01:14,911 --> 00:01:18,479 78 | 0.0081 if you multiply 79 | those things together. 80 | 81 | 18 82 | 00:01:19,490 --> 00:01:25,539 83 | And 0.0396 if you multiply 84 | these guys together. 85 | 86 | 19 87 | 00:01:25,539 --> 00:01:27,560 88 | And these are not a probability. 89 | 90 | 20 91 | 00:01:27,560 --> 00:01:31,315 92 | If we add those for 93 | the two complementary, 94 | 95 | 21 96 | 00:01:31,315 --> 00:01:35,603 97 | with cancer, non cancer, I get 0.0477. 98 | 99 | 22 100 | 00:01:35,603 --> 00:01:37,974 101 | However, I now divide. 102 | 103 | 23 104 | 00:01:37,974 --> 00:01:42,413 105 | That is, I normalize those non 106 | normalized probabilities over 107 | 108 | 24 109 | 00:01:42,413 --> 00:01:44,920 110 | here by this factor over here. 111 | 112 | 25 113 | 00:01:44,920 --> 00:01:47,410 114 | I actually get the correct 115 | posterior probability. 116 | 117 | 26 118 | 00:01:47,410 --> 00:01:50,056 119 | P(C) given ++. 120 | 121 | 27 122 | 00:01:50,056 --> 00:01:55,873 123 | And they look as follows, 124 | approximately 0.1698 and 125 | 126 | 28 127 | 00:01:55,873 --> 00:01:59,021 128 | approximately 0.8301. 129 | -------------------------------------------------------------------------------- /TopicNotes/reinforcementLearning/Notes.rst: -------------------------------------------------------------------------------- 1 | ########################## 2 | Reinforcement Learning 3 | ########################## 4 | 5 | Forms of Learning: 6 | 7 | - Supervised Learning: (x_1, y_1), (x_2, y_2): y = f(x) 8 | Ex: Speech recognition 9 | - Unsupervised Learning: x_1, x_2, x_3, ... P(X=x_1), we search for a probability distribution, or a cluster 10 | Ex: Clustering, light emmissions coming from the space. 11 | - Reinforcement Learning: state, action, state, action, ... . There are some rewards associated with some of these states. Rewards are just scalar numbers, positive, negative etc. We search for an optimal policy, that gives us what to do in any given state. 12 | Ex: Elevator controller 13 | 14 | MDP Review - Markov Decision Process 15 | ===================================== 16 | 17 | MDPs consist of a set of states, a set of actions available for a state, transition function which gives a result state, and a reward function. 18 | 19 | Formally, :math:`s {\in} S`, 20 | :math:`a {\in} Actions(s)`, 21 | :math:`P(s'|s,a)`, 22 | also an initial state :math:`s_0`, 23 | sometimes reward function that is general over the triplet 24 | :math:`R(s,a,s')` or sometime we only talk about 25 | the result state :math:`R(s')` 26 | 27 | To solve an mdp, we try to find an policy, that can maximise discounted total reward 28 | 29 | Reinforcement Learning 30 | ======================== 31 | 32 | Reinforcement Learning comes into play when we don't know the reward function or even the transition modal of the world. 33 | When we don't know these we can't solve the MDP, however with reinforcement learning, 34 | we can find these, either by interacting with the world, or you can learn substitutes that would tell you as much as you 35 | know, so that you never have to actually compute the reward or the transition model. 36 | 37 | Several choices exist for modaling: 38 | 39 | +---------------------+--------+-----------+---------+ 40 | | agent | know | learn | Utility | 41 | +=====================+========+===========+=========+ 42 | | utility based agent | P | R ? -> U | U | 43 | +---------------------+--------+-----------+---------+ 44 | | Q Learning | ? | Q(s,a) | Q | 45 | +---------------------+--------+-----------+---------+ 46 | | reflex agent | ? | Policy(s) | Policy | 47 | +---------------------+--------+-----------+---------+ 48 | 49 | Q(s,a): is like a utility function evaluating state action pairs. We don't know transition modal and reward function. 50 | Reflex agent: is just about policy, it is called reflex agent because it is pure stimulus response. 51 | 52 | Passive Reinforcement Learning: 53 | Passive means agent has fixed policy and executes that. During that process it learns about R and/or P, transition modal. 54 | 55 | Active Reinforcement Learning 56 | Active implies that we change the policy as we learn about the world. 57 | 58 | Passive Temporal Difference Learning 59 | ------------------------------------- 60 | 61 | This needs a table of utilities for each state, we also keep track of how many times we visited each state. 62 | Table of utilities should start out as blank, and the table of number of visits should start out with zeros 63 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/4 - 29 - Bayes Network Merged FINAL - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,500 --> 00:00:02,390 3 | So we want to draw Bayes rule rapidly. 4 | 5 | 2 6 | 00:00:02,390 --> 00:00:06,830 7 | We have a situation where we 8 | have an internal variable A, 9 | 10 | 3 11 | 00:00:06,830 --> 00:00:10,670 12 | like whether or not we have cancer. 13 | 14 | 4 15 | 00:00:10,670 --> 00:00:12,520 16 | But we can't sense A, 17 | 18 | 5 19 | 00:00:12,520 --> 00:00:17,180 20 | instead we have a second variable 21 | called B, which is our test. 22 | 23 | 6 24 | 00:00:17,180 --> 00:00:19,330 25 | And B is observable, but A isn't. 26 | 27 | 7 28 | 00:00:19,330 --> 00:00:24,090 29 | This is a classical 30 | example of a Base network. 31 | 32 | 8 33 | 00:00:24,090 --> 00:00:28,130 34 | The base network is composed 35 | of two variables A and B. 36 | 37 | 9 38 | 00:00:28,130 --> 00:00:32,210 39 | We know the prior probability for A and 40 | 41 | 10 42 | 00:00:32,210 --> 00:00:36,070 43 | we know the conditional 44 | A causes B whether or 45 | 46 | 11 47 | 00:00:36,070 --> 00:00:39,730 48 | not the cancer causes the test 49 | result to be positive or not. 50 | 51 | 12 52 | 00:00:39,730 --> 00:00:41,590 53 | Although with some randomness involved. 54 | 55 | 13 56 | 00:00:41,590 --> 00:00:47,750 57 | So we know about the probability of 58 | B given the different values for A. 59 | 60 | 14 61 | 00:00:47,750 --> 00:00:52,160 62 | And what we care about in this specific 63 | instance is called diagnostic reasoning. 64 | 65 | 15 66 | 00:00:52,160 --> 00:00:56,480 67 | Which is inverse of 68 | the causal reasoning. 69 | 70 | 16 71 | 00:00:56,480 --> 00:01:04,290 72 | Probability of A given B, or similarly, 73 | probability of A given off B. 74 | 75 | 17 76 | 00:01:04,290 --> 00:01:09,540 77 | This is our very first Bayes Network and 78 | the graphical representation 79 | 80 | 18 81 | 00:01:09,540 --> 00:01:13,700 82 | of drawn two variables, A and 83 | B, connected with an arrow. 84 | 85 | 19 86 | 00:01:13,700 --> 00:01:20,450 87 | Because from A to B is a graphical 88 | representation of distribution of two 89 | 90 | 20 91 | 00:01:20,450 --> 00:01:26,740 92 | variables that specified the structure 93 | of a and this is a probability. 94 | 95 | 21 96 | 00:01:26,740 --> 00:01:29,380 97 | And has a conditional 98 | probability which is normal. 99 | 100 | 22 101 | 00:01:29,380 --> 00:01:32,220 102 | Now I do have a quick quiz for you. 103 | 104 | 23 105 | 00:01:32,220 --> 00:01:36,690 106 | How many parameters does it 107 | take to specify the entire 108 | 109 | 24 110 | 00:01:36,690 --> 00:01:41,440 111 | joint probability between A and 112 | B or the entire Bayes Network? 113 | 114 | 25 115 | 00:01:41,440 --> 00:01:43,440 116 | I'm not looking for 117 | a structural parameters. 118 | 119 | 26 120 | 00:01:44,790 --> 00:01:47,380 121 | We look to the graph over here, 122 | just looking for 123 | 124 | 27 125 | 00:01:47,380 --> 00:01:50,220 126 | the numerical parameters of 127 | the underlying probabilities. 128 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/17 - Confounding Cause - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,310 --> 00:00:04,680 3 | For my next example, I will study 4 | a different type of a Bayes network. 5 | 6 | 2 7 | 00:00:04,680 --> 00:00:07,838 8 | Before we've seen networks 9 | of the following type, 10 | 11 | 3 12 | 00:00:07,838 --> 00:00:12,370 13 | where a single, hidden cause 14 | caused two different measurements. 15 | 16 | 4 17 | 00:00:12,370 --> 00:00:15,600 18 | I now want to study a network that 19 | looks just like the opposite. 20 | 21 | 5 22 | 00:00:15,600 --> 00:00:19,110 23 | We have two independent, 24 | hidden causes, but 25 | 26 | 6 27 | 00:00:19,110 --> 00:00:24,060 28 | they get confounded within a single, 29 | observational variable. 30 | 31 | 7 32 | 00:00:24,060 --> 00:00:28,120 33 | I would like to use 34 | the example of happiness. 35 | 36 | 8 37 | 00:00:28,120 --> 00:00:32,080 38 | Suppose I can be happy or unhappy. 39 | 40 | 9 41 | 00:00:32,080 --> 00:00:36,157 42 | What makes me happy is 43 | when the weather is sunny, 44 | 45 | 10 46 | 00:00:36,157 --> 00:00:41,040 47 | or if I get a raise at my job, 48 | which means I make more money. 49 | 50 | 11 51 | 00:00:41,040 --> 00:00:45,601 52 | So let's call this sunny, let's call 53 | this a raise and call this happiness. 54 | 55 | 12 56 | 00:00:45,601 --> 00:00:52,093 57 | Perhaps, the probability 58 | of it being sunny is 0.7. 59 | 60 | 13 61 | 00:00:52,093 --> 00:00:57,279 62 | The probability of a raise is 0.01. 63 | 64 | 14 65 | 00:00:57,279 --> 00:01:03,290 66 | And I will tell you if the probability 67 | of being happy is governed as false. 68 | 69 | 15 70 | 00:01:03,290 --> 00:01:08,812 71 | The probability of being happy given 72 | that both of these things occur, 73 | 74 | 16 75 | 00:01:08,812 --> 00:01:11,722 76 | I got a raise and it's sunny, is 1. 77 | 78 | 17 79 | 00:01:11,722 --> 00:01:16,688 80 | The probability of being happy 81 | given that it is not sunny and 82 | 83 | 18 84 | 00:01:16,688 --> 00:01:19,376 85 | I still got a raise, is 0.9. 86 | 87 | 19 88 | 00:01:19,376 --> 00:01:23,592 89 | The probability of being happy 90 | given that it is sunny, but 91 | 92 | 20 93 | 00:01:23,592 --> 00:01:26,185 94 | I didn't give a raise, is 0.7. 95 | 96 | 21 97 | 00:01:26,185 --> 00:01:28,855 98 | And the probability of being happy, 99 | 100 | 22 101 | 00:01:28,855 --> 00:01:33,774 102 | given that it is neither sunny nor 103 | did I get their raise, is 0.1. 104 | 105 | 23 106 | 00:01:33,774 --> 00:01:39,380 107 | This is a perfectly fine specification 108 | of a probability distribution 109 | 110 | 24 111 | 00:01:39,380 --> 00:01:44,613 112 | where two causes affect the variable 113 | down here, the happiness. 114 | 115 | 25 116 | 00:01:44,613 --> 00:01:48,223 117 | So I'd like you to calculate for 118 | me the following questions. 119 | 120 | 26 121 | 00:01:48,223 --> 00:01:54,950 122 | The probability of a raise given that 123 | it is sunny, according to this model. 124 | 125 | 27 126 | 00:01:54,950 --> 00:01:56,900 127 | Please enter your answer over here. 128 | -------------------------------------------------------------------------------- /MathNotes/Precalculus.rst: -------------------------------------------------------------------------------- 1 | ################## 2 | Precalculus notes 3 | ################## 4 | 5 | Arithmatic Operator 6 | ################### 7 | 8 | Addition 9 | ========= 10 | 11 | Mostly used with infix operator :math:`+`. When two or more disjoint sets 12 | combined the number of elements in the resulting set are represented with 13 | addition. 14 | 15 | Properties 16 | ---------- 17 | 18 | Commutative: a + b = b + a, meaning order of elements does not change result 19 | 20 | Associative: a + (b + c) = (a + b) + c meaning the precedence of operations 21 | does not change the final result 22 | 23 | 0 as identity element: a + 0 = a, meaning that 0 is a number which does not 24 | change the identity of the element. 25 | 26 | Units: in the operation a + b we assume that a and b have the same units, for 27 | example 4m + 5m = 9m but 4m + 5m^2 is not possible. 28 | 29 | Subtraction 30 | ============ 31 | 32 | It is usually designated with the infix operator :math:`-`. 33 | It represents going back some steps in the number line. 34 | 35 | Properties 36 | ----------- 37 | 38 | Anticommutative: a - b = - ( b - a) the result is the negative of the original 39 | result 40 | 41 | Non associative: a - (b - c) \not = (a - b) - c the precedence of operations 42 | contributes to the result 43 | 44 | 45 | Multiplication 46 | ============== 47 | 48 | Usually designated by the infix operator :math:`×` or 49 | :math:`\cdot`. It can be thought of as a repeated addition. 50 | 51 | Properties 52 | ---------- 53 | 54 | Commutative: a × b = b × a 55 | 56 | Associative: a × (b × c) = (a × b) × c 57 | 58 | Distributive: a × (b + c) = a × b + a × c 59 | 60 | Identity element: a × 1 = a 61 | 62 | Special case of 0: a × 0 = 0 63 | 64 | Negation: -1 × a = -a that is multiplication with -1 results in additive 65 | inverse 66 | 67 | Inverse element: every number except 0 has an multiplicative inverse so that 68 | a × 1/a = 1 69 | 70 | Order preservation: multiplication by positive number conserves order, by a 71 | negative number reverses order: 72 | a × b < a × c if b < c 73 | -a × b > -a × c if b < c 74 | 75 | Division 76 | ======== 77 | 78 | Usually designated by the infix operator :math:`/`, and :math:`÷`. It can be 79 | considered the process of calculating how many times a number is contained in 80 | another one. 81 | 82 | Properties 83 | ----------- 84 | 85 | Distributive: (a+b) / c = a/c + b/c 86 | 87 | 88 | Exponentiation 89 | =============== 90 | 91 | It is usually designated with a base and power part as in :math:`b^n` where b 92 | is base and n is power. It represents the repeated multiplication of base like 93 | :math:`b^n = b × b × ... n times ... × b` 94 | 95 | Properties 96 | ---------- 97 | 98 | b^1 = b 99 | b^{n+1} = b^n × b 100 | b^{n+m} = b^n × b^m 101 | b^0 = 1 102 | b^{-n} = 1 / b^n 103 | (b^m)^n = b^{m×n} 104 | (b×c)^n = b^n × c^n 105 | b^{u/v} = (b^u)^{1/v} = :math:`\sqrt[v]{b^u}` vth root of b^u 106 | 107 | roots are thus treated as taking exponent with fraction 108 | 109 | 110 | Logarithms 111 | =========== 112 | 113 | Logarithm is the inverse of exponential function. 114 | For b^y = x, log_b(x) = y, basically I have the base and the resulting value I 115 | try to find the power. 116 | 117 | Properties 118 | ----------- 119 | 120 | log_b(a × c) = log_b(a) + log_b(c) 121 | log_b(x / y ) = log_b(x) - log_b(y) 122 | log_b(x^p) = p log_b(x) 123 | log_b(\sqrt[p]{x}) = log_b(x) / p 124 | log_b(x) = log_k(x) / log_k(b) 125 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/22 - 47 - Explaining Away 2 Solution V3 - lang_en_vs1.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,600 --> 00:00:02,700 3 | So this is a difficult question. 4 | 5 | 2 6 | 00:00:02,700 --> 00:00:05,700 7 | Let me compute an auxiliary label, 8 | 9 | 3 10 | 00:00:05,700 --> 00:00:10,580 11 | which is P of Happiness, that one is 12 | 13 | 4 14 | 00:00:10,580 --> 00:00:15,600 15 | expanded by looking at the different 16 | conditions that can make us happy. 17 | 18 | 5 19 | 00:00:15,600 --> 00:00:19,315 20 | P of Happiness given S and 21 | R, times P of S and 22 | 23 | 6 24 | 00:00:19,315 --> 00:00:23,443 25 | R, which is of course 26 | the product of those two, 27 | 28 | 7 29 | 00:00:23,443 --> 00:00:27,175 30 | because independent plus P of Happiness. 31 | 32 | 8 33 | 00:00:27,175 --> 00:00:32,935 34 | Given not S R, probability of 35 | not S R + P of H given S and 36 | 37 | 9 38 | 00:00:32,935 --> 00:00:37,031 39 | not R times the probability 40 | of P of S and 41 | 42 | 10 43 | 00:00:37,031 --> 00:00:42,289 44 | not R plus the last case P 45 | of H given not S and not R. 46 | 47 | 11 48 | 00:00:42,289 --> 00:00:44,438 49 | So this just looks at the happiness and 50 | 51 | 12 52 | 00:00:44,438 --> 00:00:48,630 53 | all four combinations of the variables 54 | that could lead to happiness. 55 | 56 | 13 57 | 00:00:48,630 --> 00:00:51,660 58 | And you can plug those straight in, 59 | this one over here is 1. 60 | 61 | 14 62 | 00:00:51,660 --> 00:00:56,607 63 | And this one over here 64 | is the product of S and 65 | 66 | 15 67 | 00:00:56,607 --> 00:01:00,599 68 | R which is 0.7 times 0.01. 69 | 70 | 16 71 | 00:01:00,599 --> 00:01:06,078 72 | And as you plug all of those in, 73 | you get as a result, 74 | 75 | 17 76 | 00:01:06,078 --> 00:01:09,796 77 | 0.5245, that's P(H). 78 | 79 | 18 80 | 00:01:09,796 --> 00:01:14,582 81 | Just take some time and do the math by 82 | going through these different cases 83 | 84 | 19 85 | 00:01:14,582 --> 00:01:17,495 86 | using probability and 87 | you get this result. 88 | 89 | 20 90 | 00:01:17,495 --> 00:01:24,636 91 | Now armed with this number, 92 | the rest now becomes easy which is, 93 | 94 | 21 95 | 00:01:24,636 --> 00:01:29,132 96 | we can use base rule 97 | to turn this around, 98 | 99 | 22 100 | 00:01:29,132 --> 00:01:33,114 101 | P(H given R) P(R) over P(H). 102 | 103 | 23 104 | 00:01:33,114 --> 00:01:37,719 105 | P(R) we know from over here, 106 | the probability of a race is 0.01. 107 | 108 | 24 109 | 00:01:37,719 --> 00:01:41,303 110 | So the only thing we need to 111 | compute now is P(H given R). 112 | 113 | 25 114 | 00:01:41,303 --> 00:01:45,548 115 | And again we applied sort of probability 116 | let me just do this over here. 117 | 118 | 26 119 | 00:01:45,548 --> 00:01:49,912 120 | We can factor P(H given 121 | R) as P(H given R,S) for 122 | 123 | 27 124 | 00:01:49,912 --> 00:01:54,684 125 | sunny times probability of 126 | sunny plus P(H given R) and 127 | 128 | 28 129 | 00:01:54,684 --> 00:01:58,657 130 | not sunny times 131 | the probability of not sunny. 132 | 133 | 29 134 | 00:01:58,657 --> 00:02:03,883 135 | And if you plug in the numbers for 136 | this you get 1 137 | 138 | 30 139 | 00:02:03,883 --> 00:02:11,255 140 | times 0.7 + 0.9 times 0.3 141 | that happens to be 0.97. 142 | 143 | 31 144 | 00:02:11,255 --> 00:02:18,701 145 | So if you now plug this all back 146 | into this equation over here, 147 | 148 | 32 149 | 00:02:18,701 --> 00:02:24,833 150 | we get 0.97 times 0.01 / 0.5245. 151 | 152 | 33 153 | 00:02:24,833 --> 00:02:31,994 154 | This gives us approximately, 155 | as a correct answer, 0.0185. 156 | -------------------------------------------------------------------------------- /TopicNotes/KnowledgeBasedAI/Notes.rst: -------------------------------------------------------------------------------- 1 | ########################### 2 | Knowledge Based AI Systems 3 | ########################### 4 | 5 | Five Fundamental Problems in AI 6 | ================================ 7 | 8 | 1. Intelligent Agents have limited ressources: 9 | - But most ai problems are computationaly intractable. 10 | 11 | - How then we can make AI agents to give us real time performance on ai 12 | problems 13 | 14 | 2. Computation is local, but problems have global constraints: 15 | 16 | - How then we can make AI agents to address global problems using local 17 | computations (object recognition with limited data for example) 18 | 19 | 3. Computational logic is mainly deductive, but many AI problems are abductive 20 | or inductive in nature. 21 | 22 | - How can we get AI agents to address abductive or inductive problems? 23 | 24 | 4. The world is dynamic, but knowledge is limited. 25 | 26 | - An AI agent has to begin always with what already knows, how then an AI 27 | agent can address a new problem ? 28 | 29 | 5. Learning, problem solving, reasoning are complex, but explanation and 30 | justification add to the complexity: 31 | 32 | - How then can we get an ai agent to justify or explain its decisions? 33 | 34 | Characteristics of AI Problems 35 | ================================= 36 | 37 | 1. Knowledge often arrives incrementaly 38 | 39 | 2. Problems exhibit recurring patterns 40 | 41 | 3. Problems have multiple levels of granularity 42 | 43 | 4. Many problems are computationaly intractable 44 | 45 | 5. The world is dynamic, but the knowledge is static 46 | 47 | 6. The world is open ended, but the knowledge is limited 48 | 49 | Characteristics of AI Agents 50 | ============================= 51 | 52 | 1. AI agents have limited computing powers 53 | 54 | 2. AI agents have limited sensors 55 | 56 | 3. AI agents have limited attention 57 | 58 | 4. Computational logic is deductive, problems are inductive, abductive. 59 | 60 | 5. AI agent' s knowledge is limited with respect to the world 61 | 62 | AI system overview 63 | ------------------- 64 | 65 | Input: Perception - data coming from sensors 66 | Output: Action 67 | 68 | Operations: 69 | 70 | - Meta cognition 71 | - Deliberation 72 | - Reaction 73 | 74 | Metacognition <--> Deliberation 75 | 76 | Reaction <--> Deliberation 77 | 78 | Deliberation has three components: 79 | 80 | - Reasoning: 81 | 82 | - Learning: 83 | 84 | - Gets the right answer and stores the right answer to somewhere 85 | 86 | - If it gets the wrong answer then if it gets the right one, 87 | stores the right answer to its place 88 | 89 | - Memory: stores what is learned 90 | 91 | All three components interact with each other. 92 | 93 | 4 School of Thought in AI 94 | -------------------------- 95 | 96 | Think of an xy axis. At the top of y axis there is thinking, 97 | at the bottom of y axis there is acting. 98 | At the right of x axis there is human like, at the left of x axis there is 99 | optimal. 100 | 101 | What distinguishes them ? 102 | 103 | If we think about optimal ai agents? We are necessarily talking about an ai 104 | agent that is good in one thing, if we think about human like ai agent, we 105 | talk about agents that are above mediocre in most things, 106 | 107 | We can then classify AI agents according to these axis. 108 | 109 | AI agents that think optimally: 110 | - Machine learning problems are mostly treated by these type of AIs 111 | 112 | AI agents that think like humans: 113 | - Semantic Web 114 | 115 | AI agents that act optimally: 116 | - Airplane autopilots 117 | 118 | AI agents that act like humans: 119 | - Improvisational robots: that can perhaps dance to the music you play 120 | 121 | 122 | -------------------------------------------------------------------------------- /TopicNotes/MachineLearning/UnsupervisedLearning.rst: -------------------------------------------------------------------------------- 1 | ################################ 2 | Machine Learning - Unsupervised 3 | ################################ 4 | 5 | We try to guess the structure from the data. 6 | 7 | For example you have a straight line in a coordinate system. 8 | You can say that the space's dimensionality is equal to 2, 9 | where as the line, can be represented in 1 dimension. 10 | 11 | One of the basic applications of the unsupervised learning technology 12 | is to represent higher dimensionality structures, like images, in lower 13 | dimensions like histograms. 14 | 15 | We learn about clustering and dimension reduction. 16 | 17 | Some of the terminology used in unsupervised learning. 18 | We assume that the data is IID, that is identically distributed and 19 | independently drawn from the same distribution. 20 | 21 | The unsupervised learning seeks to recover the underlying density of the 22 | probability distribution that generated data. 23 | 24 | This is called *density estimation*, following two are versions of it: 25 | 26 | - Clustering 27 | - Dimensionality reduction 28 | 29 | Clustering 30 | ------------- 31 | 32 | Two algorithms are pretty common in clustering: 33 | 34 | - k-means 35 | - expectation maximisation 36 | 37 | Problem with k-means: 38 | 39 | - need to know k 40 | - local minima 41 | - high dimensionality 42 | - lack of mathematical basis. 43 | 44 | Expectation maximisation is a generalisation of k-means, it uses actual 45 | probability distributions to describe what we are doing. 46 | 47 | Gaussian Learning 48 | ------------------- 49 | 50 | Fitting gaussians to data, or gaussian learning in which we shall be given some 51 | data points and wonder what is the best gaussian fitting the data. 52 | 53 | .. image:: Gaussians.png 54 | 55 | The formula actually represents a probability distribution. 56 | 57 | Let's explain the multivariate one, since it works for the single dimension case as well. 58 | 59 | The formula is: :math:`{(2{\pi})^{-\frac{N}{2}}} {S^{\frac{-1}{2}}} {exp({ {\frac{-1}{2}} (x-m)^T {\frac{(x-m)}{S}}})}` 60 | 61 | N, is the number of dimensions of the data. 62 | 63 | m, is the mu, that is the average of the samples. 64 | 65 | x, is the probing point, that is our data point. 66 | 67 | S, is the covariance matrix, that is the matrix that shows 68 | how far the gaussian points are away from the mean that cuts through 69 | the peak of the gaussian. 70 | 71 | This thing tries to normalise the error rate by the covariance matrix 72 | 73 | How to choose K in Expectation Maximization and K-means: 74 | 75 | - Guess initial K 76 | - Run EM 77 | - Remove unnecessary clusters 78 | - Create new random clusters 79 | - Repeat from the second step 80 | 81 | Dimensionality Reduction 82 | ------------------------- 83 | 84 | Linear Dimensionality reduction: 85 | 86 | The idea is that we are given data points, we seek to find a linear subspace to remap the data. 87 | 88 | 1. Fit a gaussian 89 | 2. Calculate the eigenvalues and eigenvectors of the gaussian 90 | 3. Pick eigenvectors with maximum eigen values 91 | 4. Project the data onto the subspace of the eigenvectors you chose. 92 | 93 | Specteral Clustering 94 | --------------------- 95 | 96 | Affinity to a group of data points define the cluster of a data point not its absolute position 97 | 98 | Affinity matrix is essential to specteral clustering: 99 | It is a matrix in which each data point is graphed relative to other data points 100 | Affinity means the quadratic distance between points. High affinity means small quadratic distance. 101 | This is a rank defficient matrix and it is easy to identify them with Principal Component Analysis. 102 | PCA analyses the vectors that are similar in an aproximate rank defficient matrix. 103 | 104 | Dimensionality = Number of Large Eigen values 105 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/14 - Conditional Independence2 Solution - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,540 --> 00:00:04,210 3 | So, for this one, 4 | we are going to apply total probability. 5 | 6 | 2 7 | 00:00:04,210 --> 00:00:10,170 8 | This thing over here is the same as 9 | probability of test 2 to be positive, 10 | 11 | 3 12 | 00:00:10,170 --> 00:00:13,640 13 | which I am going to abbreviate 14 | with a + 2 over here. 15 | 16 | 4 17 | 00:00:13,640 --> 00:00:16,859 18 | Condition on test 1 being positive and 19 | 20 | 5 21 | 00:00:16,859 --> 00:00:22,321 22 | me having cancer has probability 23 | of me having cancer given 24 | 25 | 6 26 | 00:00:22,321 --> 00:00:26,615 27 | positive plus probability 28 | of test 2 being positive 29 | 30 | 7 31 | 00:00:26,615 --> 00:00:31,296 32 | condition on positive and 33 | me not having cancer, 34 | 35 | 8 36 | 00:00:31,296 --> 00:00:38,090 37 | times probability of me not having 38 | cancer, given the test 1 is positive. 39 | 40 | 9 41 | 00:00:38,090 --> 00:00:41,420 42 | That's the same as 43 | the total probability. 44 | 45 | 10 46 | 00:00:41,420 --> 00:00:44,620 47 | But now everything is 48 | conditioned on + 1. 49 | 50 | 11 51 | 00:00:44,620 --> 00:00:46,530 52 | Take a moment to verify this. 53 | 54 | 12 55 | 00:00:48,370 --> 00:00:50,200 56 | Now here you can plug in the numbers. 57 | 58 | 13 59 | 00:00:50,200 --> 00:00:57,664 60 | We already calculated this one before, 61 | which is approximately 0.043. 62 | 63 | 14 64 | 00:00:57,664 --> 00:01:02,076 65 | And this one over here is 1 minus this, 66 | 67 | 15 68 | 00:01:02,076 --> 00:01:06,365 69 | which is 0.957 approximately. 70 | 71 | 16 72 | 00:01:06,365 --> 00:01:12,077 73 | In this now exploits conditional 74 | independence which is given that 75 | 76 | 17 77 | 00:01:12,077 --> 00:01:17,670 78 | knowledge of a first test gives me no 79 | more information about the second test. 80 | 81 | 18 82 | 00:01:17,670 --> 00:01:21,750 83 | It only gives me information if C was 84 | unknown as was the case over here. 85 | 86 | 19 87 | 00:01:21,750 --> 00:01:26,070 88 | So we can rewrite this thing over 89 | here as follows P of plus 2, 90 | 91 | 20 92 | 00:01:26,070 --> 00:01:31,820 93 | given the cancer, I can drop the plus 94 | 1 and the same is true over here. 95 | 96 | 21 97 | 00:01:31,820 --> 00:01:34,191 98 | This is exploiting my 99 | conditional independence. 100 | 101 | 22 102 | 00:01:34,191 --> 00:01:37,702 103 | I knew that P of plus 1 or 104 | 105 | 23 106 | 00:01:37,702 --> 00:01:45,020 107 | plus 2 condition on C is 108 | the same as P of plus 2. 109 | 110 | 24 111 | 00:01:45,020 --> 00:01:47,400 112 | Condition of C and test 1. 113 | 114 | 25 115 | 00:01:47,400 --> 00:01:51,765 116 | I can now read those off my table here, 117 | 118 | 26 119 | 00:01:51,765 --> 00:01:57,361 120 | there's 0.9 times 0.043 plus 0.2, 121 | 122 | 27 123 | 00:01:57,361 --> 00:02:03,775 124 | which is 1 minus 0.8 over here, 125 | times 0.957, 126 | 127 | 28 128 | 00:02:03,775 --> 00:02:09,112 129 | which gives me approximately 0.2301. 130 | 131 | 29 132 | 00:02:09,112 --> 00:02:14,610 133 | So that says if my first 134 | test comes in positive, 135 | 136 | 30 137 | 00:02:14,610 --> 00:02:20,720 138 | I expect my second test to be 139 | positive with the probably 0.2301. 140 | 141 | 31 142 | 00:02:20,720 --> 00:02:25,446 143 | That's an increase probability to the 144 | default probability which we calculated 145 | 146 | 32 147 | 00:02:25,446 --> 00:02:28,960 148 | before, which is 149 | probability of any test. 150 | 151 | 33 152 | 00:02:28,960 --> 00:02:32,970 153 | Test 2 coming as positive 154 | before was to normalize or 155 | 156 | 34 157 | 00:02:32,970 --> 00:02:37,680 158 | Bayes rule which was 0.207. 159 | 160 | 35 161 | 00:02:37,680 --> 00:02:42,482 162 | So my first test has a 20% 163 | chance of coming in positive, 164 | 165 | 36 166 | 00:02:42,482 --> 00:02:47,676 167 | my second test after seeing 168 | a positive test has now an increased 169 | 170 | 37 171 | 00:02:47,676 --> 00:02:51,993 172 | probability of about 23% 173 | of coming in positive. 174 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/11 - Conditional Independence - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,530 --> 00:00:03,327 3 | I want to use a few 4 | words of terminology. 5 | 6 | 2 7 | 00:00:03,327 --> 00:00:08,348 8 | This again is a Bayes network 9 | of which the hidden variable 10 | 11 | 3 12 | 00:00:08,348 --> 00:00:13,382 13 | C causes the still stochastic 14 | test outcomes T1 and T2. 15 | 16 | 4 17 | 00:00:13,382 --> 00:00:18,254 18 | And what's really important is that 19 | we assume not just that T1 and 20 | 21 | 5 22 | 00:00:18,254 --> 00:00:20,870 23 | T2 are identically distributed. 24 | 25 | 6 26 | 00:00:20,870 --> 00:00:25,520 27 | It is the same 0.9 for 28 | test one that I used for test two. 29 | 30 | 7 31 | 00:00:25,520 --> 00:00:30,060 32 | But we also assume that they 33 | are conditionally independent. 34 | 35 | 8 36 | 00:00:30,060 --> 00:00:36,110 37 | We assumed that if God told us 38 | whether we actually had cancer or not. 39 | 40 | 9 41 | 00:00:36,110 --> 00:00:41,433 42 | If you would ask with certainty 43 | the breakup of variable C that knowing 44 | 45 | 10 46 | 00:00:41,433 --> 00:00:46,408 47 | anything about T1 would not help 48 | us make a statement about T2. 49 | 50 | 11 51 | 00:00:46,408 --> 00:00:52,135 52 | But differently we assumed that 53 | the probability of T2 given C and 54 | 55 | 12 56 | 00:00:52,135 --> 00:00:56,416 57 | T1 is the same as 58 | the probability of T2 given C. 59 | 60 | 13 61 | 00:00:56,416 --> 00:01:01,781 62 | This is called conditional 63 | independence which is given 64 | 65 | 14 66 | 00:01:01,781 --> 00:01:06,800 67 | the value of the cancer variable C, 68 | if you do this for 69 | 70 | 15 71 | 00:01:06,800 --> 00:01:11,760 72 | fact than T2 would be independent of T1. 73 | 74 | 16 75 | 00:01:11,760 --> 00:01:15,810 76 | It's conditionally dependant because 77 | independence only holds true 78 | 79 | 17 80 | 00:01:15,810 --> 00:01:18,160 81 | if we actually know C. 82 | 83 | 18 84 | 00:01:18,160 --> 00:01:20,160 85 | And it comes out of 86 | this diagram over here. 87 | 88 | 19 89 | 00:01:20,160 --> 00:01:26,160 90 | If we look at this diagram, 91 | if we knew the variable C over 92 | 93 | 20 94 | 00:01:26,160 --> 00:01:31,150 95 | here, then C separately causes T1 and 96 | T2. 97 | 98 | 21 99 | 00:01:31,150 --> 00:01:35,757 100 | So as a result, 101 | if we know C whatever over here, 102 | 103 | 22 104 | 00:01:35,757 --> 00:01:41,356 105 | it's kind of cut off, categorically 106 | from what happens over here. 107 | 108 | 23 109 | 00:01:41,356 --> 00:01:47,110 110 | That causes these two variables 111 | to be conditionally independent. 112 | 113 | 24 114 | 00:01:47,110 --> 00:01:50,627 115 | So conditional independence is 116 | a really big thing for Bayes network. 117 | 118 | 25 119 | 00:01:50,627 --> 00:01:54,838 120 | Here's a Bayes network 121 | where A causes B and C. 122 | 123 | 26 124 | 00:01:54,838 --> 00:01:58,750 125 | And for Bayes network of this structure, 126 | 127 | 27 128 | 00:01:58,750 --> 00:02:03,540 129 | we know that given A, 130 | B and C are independent. 131 | 132 | 28 133 | 00:02:03,540 --> 00:02:08,389 134 | It's written as B condition 135 | independent of C given A. 136 | 137 | 29 138 | 00:02:08,389 --> 00:02:12,811 139 | So here's a question, suppose we have 140 | conditional independence between B and 141 | 142 | 30 143 | 00:02:12,811 --> 00:02:13,950 144 | C given A. 145 | 146 | 31 147 | 00:02:13,950 --> 00:02:20,360 148 | Would then imply, and that's my 149 | question, that B and C are independent? 150 | 151 | 32 152 | 00:02:20,360 --> 00:02:21,620 153 | So, suppose we don't know A. 154 | 155 | 33 156 | 00:02:21,620 --> 00:02:25,072 157 | We don't know whether we have cancer, 158 | for example. 159 | 160 | 34 161 | 00:02:25,072 --> 00:02:30,189 162 | Would that mean that the test results, 163 | individually, yet still independent 164 | 165 | 35 166 | 00:02:30,189 --> 00:02:34,542 167 | of each other, even if we don't 168 | know about the cancer situation? 169 | 170 | 36 171 | 00:02:34,542 --> 00:02:35,565 172 | Please answer yes or no. 173 | 174 | 37 175 | 00:02:35,565 --> 00:02:45,009 176 | [BLANK_AUDIO] 177 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/6 - 31 - Computing Bayes Rules Merged FINAL - lang_en_vs51.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:01,480 --> 00:00:04,220 3 | So we just encountered our 4 | very first Bayes network and 5 | 6 | 2 7 | 00:00:04,220 --> 00:00:06,670 8 | did a number of 9 | interesting calculations. 10 | 11 | 3 12 | 00:00:06,670 --> 00:00:11,200 13 | Let's now talk about Bayes Rule and 14 | look into more complex base networks. 15 | 16 | 4 17 | 00:00:11,200 --> 00:00:13,140 18 | I want to look at Bayes Rule again and 19 | 20 | 5 21 | 00:00:13,140 --> 00:00:15,820 22 | make an observation that 23 | is being non-trivial. 24 | 25 | 6 26 | 00:00:15,820 --> 00:00:21,120 27 | Here is Bayes Rule, and 28 | in practice what we find is 29 | 30 | 7 31 | 00:00:21,120 --> 00:00:25,620 32 | this term here is relatively easy 33 | to compute, it's just a product. 34 | 35 | 8 36 | 00:00:25,620 --> 00:00:28,530 37 | Whereas this term is 38 | really hard to compute. 39 | 40 | 9 41 | 00:00:28,530 --> 00:00:33,890 42 | However, this term over here does not 43 | depend on what we assume for variable A. 44 | 45 | 10 46 | 00:00:33,890 --> 00:00:35,500 47 | It's just a function of B. 48 | 49 | 11 50 | 00:00:35,500 --> 00:00:36,360 51 | So suppose for 52 | 53 | 12 54 | 00:00:36,360 --> 00:00:41,100 55 | a moment we also care about the 56 | complementary range of not A given B. 57 | 58 | 13 59 | 00:00:41,100 --> 00:00:43,800 60 | For which Bayes will unfold as follows. 61 | 62 | 14 63 | 00:00:43,800 --> 00:00:47,640 64 | Then we find that the normalizer 65 | P of B is identical 66 | 67 | 15 68 | 00:00:47,640 --> 00:00:51,220 69 | whether we assume A on the left side or 70 | not A on the left side. 71 | 72 | 16 73 | 00:00:51,220 --> 00:00:59,290 74 | We also know that P of A given B 75 | plus P of not A given B must be one. 76 | 77 | 17 78 | 00:00:59,290 --> 00:01:01,580 79 | Because these are two 80 | complimentary events. 81 | 82 | 18 83 | 00:01:01,580 --> 00:01:05,720 84 | It allows us to compute Bayes 85 | rule very differently by 86 | 87 | 19 88 | 00:01:05,720 --> 00:01:08,010 89 | basically ignoring the normalizer. 90 | 91 | 20 92 | 00:01:08,010 --> 00:01:09,890 93 | So here's how it goes. 94 | 95 | 21 96 | 00:01:09,890 --> 00:01:11,928 97 | You compute P of A given B and 98 | 99 | 22 100 | 00:01:11,928 --> 00:01:17,261 101 | you're going to call this prime 102 | because it's not a real probability. 103 | 104 | 23 105 | 00:01:17,261 --> 00:01:20,180 106 | It will be just P of B 107 | given A times P of A, 108 | 109 | 24 110 | 00:01:20,180 --> 00:01:25,600 111 | which is the normalizer to the 112 | denominator of the expression over here. 113 | 114 | 25 115 | 00:01:25,600 --> 00:01:29,770 116 | We do the same thing not A. 117 | 118 | 26 119 | 00:01:29,770 --> 00:01:33,627 120 | So in both cases, 121 | we compute the posterior probability not 122 | 123 | 27 124 | 00:01:33,627 --> 00:01:36,390 125 | normalized while omitting 126 | the normalizer B. 127 | 128 | 28 129 | 00:01:36,390 --> 00:01:40,570 130 | And then we can recover 131 | the original probabilities 132 | 133 | 29 134 | 00:01:40,570 --> 00:01:44,540 135 | by normalizing based on 136 | those values over here. 137 | 138 | 30 139 | 00:01:44,540 --> 00:01:48,470 140 | So the probability of A given B, 141 | the actual probability, 142 | 143 | 31 144 | 00:01:48,470 --> 00:01:53,520 145 | is a normalizer eta times this 146 | not normalized form over here. 147 | 148 | 32 149 | 00:01:53,520 --> 00:01:57,034 150 | The same is true for 151 | the negation of A, over here. 152 | 153 | 33 154 | 00:01:57,034 --> 00:02:02,783 155 | And eta is just the normalizer 156 | results by adding these two values 157 | 158 | 34 159 | 00:02:02,783 --> 00:02:09,190 160 | over here together as shown over 161 | here and dividing them by one. 162 | 163 | 35 164 | 00:02:09,190 --> 00:02:10,860 165 | So take a look at this for a moment. 166 | 167 | 36 168 | 00:02:11,940 --> 00:02:16,710 169 | What we've done is we deferred the 170 | calculation of the normalizer over here. 171 | 172 | 37 173 | 00:02:16,710 --> 00:02:19,920 174 | By computing pseudo probabilities 175 | that are non-nominalized. 176 | 177 | 38 178 | 00:02:19,920 --> 00:02:21,550 179 | This made the calculation much easier. 180 | 181 | 39 182 | 00:02:21,550 --> 00:02:26,353 183 | And returned everything we just 184 | folded back in the normalizer based 185 | 186 | 40 187 | 00:02:26,353 --> 00:02:31,164 188 | on the resulting pseudo probabilities 189 | and get the correct answer. 190 | 191 | 41 192 | 00:02:31,164 --> 00:02:40,189 193 | [BLANK_AUDIO] 194 | -------------------------------------------------------------------------------- /TopicNotes/LSTM/CourseNotes.rst: -------------------------------------------------------------------------------- 1 | ##################### 2 | Introduction to LSTM 3 | ##################### 4 | 5 | This stands for Long Short Term Memory Networks, and are quite useful when our 6 | neural network needs to switch between remembering recent things, and things 7 | from long time ago. 8 | 9 | In small summary RNNs work as follows: 10 | 11 | 1. Memory comes in 12 | 2. Merges with the current event 13 | 3. Merged memory goes out. 14 | 15 | In small summary LSTMs work as follows: 16 | 17 | 1. Long term memory comes in 18 | 2. Short term memory comes in 19 | 3. Both merges with the current event. 20 | 4. Merged long term memory goes out. 21 | 5. Merged short term memory goes out. 22 | 23 | 24 | Technically speaking LSTM architecture has several gates: 25 | 26 | - Learn Gate 27 | - Forget Gate 28 | - Use Gate 29 | - Remember Gate 30 | 31 | Here is how they work: 32 | 33 | - Long term memory passes through the forget gate, where it shrinks by removing 34 | unnecessary information 35 | - Short term memory passes through the learn gate with the current event 36 | - Both the information in the learn gate and the forget gate, passes through 37 | remember gate. This creates the new long term memory 38 | - Both the information in the learn gate and the forget gate, passes through 39 | use gate. This creates the new short term memory. 40 | 41 | Learn Gate 42 | ---------- 43 | 44 | Learn gate combines the current event, the picture we are trying to identify, and 45 | the short term memory, that is the memory of the pictures we have just seen. 46 | It also forgets a little of it, keeping the important part. 47 | 48 | Mathematically it works as the following: 49 | 50 | Short Term Memory: STM 51 | Event: E 52 | Hyperbolic tangent function: :math:`tanh = {\frac{e^x - e^{-x}}{e^x + e^{-x}}}` 53 | New Information: N 54 | Ignore factor: i 55 | [STM_{t-1}, E_t]: combining/concatenating two vectors. 56 | sigmoid function: :math:`{\sigma}(x) = {\frac{e^x}{e^x+1}}` 57 | 58 | Combination works as follows: 59 | 60 | :math:`N_t = tanh(W_n[STM_{t-1},E_t] + b_n)` 61 | 62 | Ignoring works as follows: 63 | 64 | :math:`N_t {\times} i_t` 65 | 66 | In order to calculate the ignore factor we create another small neural network 67 | with sigmoid function: 68 | 69 | :math:`i_t = {\sigma}(W_i[STM_{t-1},E_t] + b_i)` 70 | 71 | 72 | Forget Gate 73 | ----------- 74 | 75 | Forget gate simply forgets least relevant parts of the long term memory 76 | 77 | The idea is the same as ignore factor of the learn gate. 78 | This time we multiply the long term memory with the forget factor which 79 | is calculated using the short term memory, and the event. 80 | Mathematically it works as follows: 81 | 82 | Long Term Memory: LTM 83 | Short Term Memory: STM 84 | Forget Factor: f 85 | Event: E 86 | sigmoid function: :math:`{\sigma}` 87 | 88 | Forget part is as follows: 89 | 90 | :math:`LTM_{t-1} {\times} f_t` 91 | 92 | And the forget factor is calculated as follows: 93 | 94 | :math:`f_t = {\sigma}(W_f[STM_{t-1}, E_t] + b_f)` 95 | 96 | Remember Gate 97 | ------------- 98 | 99 | Remember gate takes the output of the forget gate, and the learn gate and it 100 | combines them together, in order to create the new long term memory 101 | 102 | The combination is a simple addition. 103 | It works as follows: 104 | 105 | Long Term Memory: LTM 106 | Forget Factor: f 107 | New Information: N 108 | Ignore factor: i 109 | 110 | :math:`LTM_t = (LTM_{t-1} {\times} f_t) + (N_t {\times} i_t)` 111 | 112 | Use Gate 113 | -------- 114 | 115 | Use gate creates the short term memory by combining the output of the forget 116 | gate and the previous short term memory and the current event. 117 | 118 | Mathematically it works as follows: 119 | 120 | Long Term Memory: LTM 121 | Forget Factor: f 122 | Short Term Memory: STM 123 | Event: E 124 | Hyperbolic tangent function: :math:`tanh = {\frac{e^x - e^{-x}}{e^x + e^{-x}}}` 125 | New Information: N 126 | Ignore factor: i 127 | [STM_{t-1}, E_t]: combining/concatenating two vectors. 128 | sigmoid function: :math:`{\sigma}(x) = {\frac{e^x}{e^x+1}}` 129 | 130 | It applies the following to the output of the forget gate: 131 | 132 | Output of the forget gate: :math:`LTM_{t-1} {\times} f_t)` 133 | 134 | :math:`U_t = tanh(W_u(LTM_{t-1} {\times} f_t) + b_u)` 135 | 136 | It applies the following to the short term memory and the current event: 137 | 138 | :math:`V_t = {\sigma}(W_v[STM_{t-1}, E_t] + b_v)` 139 | 140 | It combines both :math:`U_t and V_t` in order to obtain the new short term 141 | memory as follows: 142 | 143 | :math:`STM_t = U_t {\times} V_t` 144 | 145 | Other Architectures 146 | ------------------- 147 | 148 | Other variants of LSTM architectures are Gated Recurrent Units, and 149 | Peephole Connections 150 | -------------------------------------------------------------------------------- /TopicNotes/GeneralAdverserialNetworks/courseNotes.rst: -------------------------------------------------------------------------------- 1 | ############################################# 2 | Introduction to General Adverserial Networks 3 | ############################################# 4 | 5 | What can you do with GANs ? 6 | =========================== 7 | 8 | GANs are used for generating realistic data. Most of the applications of gans so 9 | far have been on images 10 | 11 | StackGAN takes a description of a bird and generates an image of a bird 12 | based on that description. 13 | In that context GAN is drawing a sample from the probability distribution of all 14 | the images that match the description 15 | 16 | Pix2Pix 17 | GAN is used by Adobe and Berkeley, as the user draws crude sketches GANs 18 | transform that to more realistic images. 19 | 20 | They can be used for image to image translation, for example blueprints for 21 | building can be transformed into photos of finished buildings 22 | 23 | CycleGAN 24 | is used by Berkeley is especially good at transforming images through 25 | unsupervised learning 26 | 27 | GANs can be used to create realistic simulated training sets or environments 28 | in which the other machine learning models train, like reinforcement learning 29 | agents 30 | 31 | How GANs work ? 32 | ================ 33 | 34 | Generative Adverserial model is like other generative models. 35 | 36 | Gans are a kind of generative model that let's us generate a whole image in 37 | parallel. Along with other generative models gans use a differentiable function 38 | represented by a neural network as a generator network. 39 | 40 | The generator network takes random noise as the input, then runs that noise 41 | through the differentiable function to transform and reshape the noise to 42 | have recognizable structure. 43 | The output of the generative network is a realistic image the choice of the 44 | random noise determines what would come out of the network 45 | 46 | Running the generator network with many different input noise values produces 47 | many different realistic output images. 48 | The goal for these images to be fair samples from the distribution over real 49 | data 50 | 51 | Of course the generator net does not start out producing real images it needs to 52 | be trained. However training process is quite different from what we had seen so 53 | far in supervised learning model 54 | 55 | For most generative models we simply adjust the parameters to maximize the 56 | probability that the model would generate the training set. 57 | But it is very difficult to compute this probability, most generative models 58 | get around that with some kind of approximation. 59 | GANs use a second network called the discriminator in order to guide the 60 | generator, the discriminator is a regular classifier. 61 | During the training process the discriminator is shown real images half of the 62 | time and fake images from the generator the other half. 63 | 64 | Discriminator is trained to output the probability that the input is real, 65 | so it tries to assign a probability close to 1 to real images and a probability 66 | close to 0 to fake images. 67 | Meanwhile the generator tries to produce images that would get a probability 68 | close to 1 from the discriminator. 69 | 70 | Over time the generator is forced to produce more realistic outputs in order to 71 | fool the discriminator. 72 | 73 | Games and Equilibra 74 | -------------------- 75 | 76 | The inspiration for GANs come from Game theory. 77 | Basically the generator and the discriminator are competing against each other. 78 | Where each agent can choose a from some set of actions and the choice of actions 79 | determines a well defined payoff for each player. 80 | A state of equilibrium in game theory is where neither player can improve their 81 | payoff by changing their strategy assuming the other players strategy stays the 82 | same 83 | 84 | In GANs, we have two agents, each with their own costs. They work with cost 85 | functions, where you try to minimize the cost. In GANs you try to find a point 86 | where it minimises the cost functions of both agents 87 | 88 | For example for discriminator the local maxima occurs when the discriminator 89 | accurately estimates the probability that the input is real rather than fake. 90 | This probability is given by the ratio between the data density at the input and 91 | the sum of both the data density and the model density induced by the generator 92 | at the input. We can think of this ratio as measuring how much probability mass 93 | in an area comes from the data rather than the generator 94 | 95 | Deep Convolutional GANs 96 | ------------------------ 97 | 98 | The main difference being the convolutional networks for generator, and 99 | discriminator. 100 | 101 | Transposed convolutional layers do the exact opposite of convolutional layers. 102 | They take something narrow, and make it larger and wider. 103 | -------------------------------------------------------------------------------- /MathNotes/IntroProjectiveGeometry.rst: -------------------------------------------------------------------------------- 1 | ################################# 2 | Introduction Projective Geometry 3 | ################################# 4 | 5 | Projective geometry models the imaging process of the camera, that is 6 | transformation of that which is in 3d to 2d. This is an important capacity 7 | because several set of properties which appear in Euclidean geometry, does not 8 | apply to projective geometry. For example, in projective geometry, parallel 9 | lines do coincide due to the perspective. 10 | Projective transformations conserve type (points stay points, lines stay lines, 11 | etc.), incidence (whether a point lies on a line or not), and a measure called 12 | cross ratio 13 | 14 | Geometry Relations 15 | 16 | +----------------------------+-----------+------------+--------+------------+ 17 | | | Euclidean | similarity | affine | projective | 18 | +----------------------------+-----------+------------+--------+------------+ 19 | | Transformations | | 20 | +----------------------------+-----------+------------+--------+------------+ 21 | | rotation | X | X | X | X | 22 | +----------------------------+-----------+------------+--------+------------+ 23 | | translation | X | X | X | X | 24 | +----------------------------+-----------+------------+--------+------------+ 25 | | uniform scaling | | X | X | X | 26 | +----------------------------+-----------+------------+--------+------------+ 27 | | nonuniform scaling | | | X | X | 28 | +----------------------------+-----------+------------+--------+------------+ 29 | | shear | | | X | X | 30 | +----------------------------+-----------+------------+--------+------------+ 31 | | perspective projection | | | | X | 32 | +----------------------------+-----------+------------+--------+------------+ 33 | | composition of projections | | | | X | 34 | +----------------------------+-----------+------------+--------+------------+ 35 | | Invariants | | 36 | +----------------------------+-----------+------------+--------+------------+ 37 | | length | X | | | | 38 | +----------------------------+-----------+------------+--------+------------+ 39 | | angle | X | X | | | 40 | +----------------------------+-----------+------------+--------+------------+ 41 | | ratio of lengths | X | X | | | 42 | +----------------------------+-----------+------------+--------+------------+ 43 | | parallelism | X | X | X | | 44 | +----------------------------+-----------+------------+--------+------------+ 45 | | incidence | X | X | X | X | 46 | +----------------------------+-----------+------------+--------+------------+ 47 | | cross ratio | X | X | X | X | 48 | +----------------------------+-----------+------------+--------+------------+ 49 | 50 | 51 | Concepts 52 | ========= 53 | 54 | 55 | Homogenous coordinates 56 | ---------------------- 57 | 58 | We have a point (x,y) in euclidean plane. To represent this point in projective 59 | plane, we simply add a third coordinate of 1 at the end: (x,y,1). 60 | Scaling is unimportant, so the point (x, y, 1), also called the augmented vector. 61 | This means we can simply represent a point in projective plane as (ax, ay, a). 62 | Since scaling is unimportant these points are called homogenous points. 63 | "a" is the scale factor that can not be 0. In order to find the eucledean points 64 | from projective points we simply divide the projective points to its last 65 | dimension: 66 | 67 | - Euclidean : (x,y) -> Projective (Gx,Gy, G) -> Euclidean (Gx/G, Gy/G, G/G) 68 | 69 | When the last element G is 0, the homogeneous point is called, a point at infinity, 70 | or ideal points. They do not have an equivalent representation in euclidean plane 71 | 72 | 73 | How do we represent a line in Projective space: 74 | 75 | - We start with the simple formula ax + by + c = 0 76 | - Essentially this is elementwise multiplication of 2 vectors 77 | :math:`v_1= [x, y, 1]` and :math:`v_2=[a, b, c]` 78 | - Notice that the :math:`v_1` is the augmented vector and 79 | :math:`v_2` is simply another homogeneous point 80 | - So a line in 2d projective space is essentialy a multiplication of 81 | an augmented vector with a homogeneous point 82 | 83 | We can see if two lines intersect on 2d projective plane, 84 | by using cross product of two homogenous points 85 | 86 | Ex. 87 | :math:`v_1 = [2,6,2]` and :math:`v_2 = [4,8,5]` 88 | they intersect at the point :math:`\tilde{x}`. 89 | The condition is that dot product of :math:`\overline{x}` with :math:`v_1` 90 | or with :math:`v_2` needs to be equal to 0. 91 | Notice the relation between the augmented vector and the homogenous point 92 | :math:`\overline{x} = {\frac{\tilde{x}}{w}}` 93 | 94 | This is also same as :math:`v_1 {\times} v_2 = \tilde{x}` 95 | -------------------------------------------------------------------------------- /MathNotes/N-Dimensional-Geometry.md.html: -------------------------------------------------------------------------------- 1 | # N-Dimensional Geometry 2 | 3 | Notes from Murty, K. (2001) Computational and Algorithmic Linear Algebra and 4 | n-Dimensional Geometry. Ann Arbor. Chapter 3 5 | 6 | let $S \subset R^n$ and $x \in R^n$, the translation of S to x is the set of 7 | points: $ \{ x + y | y \in S \} $ 8 | 9 | Parametric representation of a point in $S \subset R^n$ is a vector of 10 | functions, such as: 11 | $x = (x_1, x_2, \dots, x_n)^T 12 | = (f_1(a_1, \dots, a_r), \dots, f_n(a_1, \dots, a_r) ) $ 13 | 14 | With no constraints the points in S can take any real number. We can also 15 | introduce constraints on functions to determine the points in S, so something 16 | like the following is possible: 17 | 18 | $S = \{ 19 | (f_1(a_1, \dots, a_r), \dots, f_n(a_1, \dots, a_r) )_j | 20 | j \in \{1,\dots, n \}, 21 | a \in C(a) \}$ 22 | where $C(a)$ represents a subset of $R^n$ that satisfies a set of constraints. 23 | 24 | A line in N dimensions is: 25 | $ x = a + \alpha b$ where $a, b \in R^n$ and $b \not = 0$ meaning that at least 26 | one component of the vector b must be a non zero value. 27 | Notice that it has single parameter $\alpha$ which can take all real values 28 | from $R$, that is, it is a scalar value that gets multiplied with each 29 | component of the vector b. 30 | 31 | In order to check whether a point is in a line or not 32 | 33 | ```Python 34 | 35 | def in_line(a: List[float], 36 | b: List[float], 37 | alpha: float, 38 | x: List[float]) -> bool: 39 | """ 40 | Checks if point x is in line of the equation (a + b * alpha) 41 | """ 42 | check = True 43 | for i in range(len(a)): 44 | if (a[i] + b[i] * alpha) != x[i]: 45 | check = False 46 | return check 47 | ``` 48 | Check whether there is a line between two points: 49 | 50 | ```Python 51 | 52 | def have_a_line(a: List[float], c: List[float]) -> bool: 53 | """ 54 | We check if a line might exist between these two points. 55 | and try to obtain the parametric representation (a + b * alpha = c) 56 | since b = c - a in alpha 1 57 | """ 58 | if a == c: 59 | return False 60 | b = [c[i] - a[i] for i in range(len(a))] 61 | return any(b[i] > 0 for i in range(len(b))) 62 | ``` 63 | 64 | Two parametric representations of lines $U$ and $V$ in $R^n$ correspond to the 65 | same line if the following conditions hold true: 66 | 67 | - $U = \{ x = a + \alpha * b | \alpha \in R, a, b \in R^n, b \not = 0 \}$ 68 | - $V = \{ x = c + \beta * d | \beta \in R, c, d \in R^n, d \not = 0 \}$ 69 | 70 | - $d$ is a scalar multible of $b$ 71 | - $a - c$ is a scalar multiple of $b$ 72 | 73 | ```Python 74 | 75 | def check_parametric_same_line(a: List[float], alpha: float, 76 | b: List[float], 77 | c: List[float], beta: float, 78 | d: List[float]) -> bool: 79 | """ 80 | Two parametric representations of lines $U$ and $V$ in $R^n$ correspond to 81 | the same line if the following conditions hold true: 82 | $d$ is a scalar multible of $b$ 83 | $a - c$ is a scalar multiple of $b$ 84 | """ 85 | prev = None 86 | 87 | # check if 'd' is a scalar multiple of 'b' 88 | check = True 89 | for i in range(len(b)): 90 | if prev is None: 91 | prev = b[i] / d[i] 92 | else: 93 | check = b[i] / d[i] == prev 94 | if not check: 95 | return check 96 | 97 | # check if 'a - c' is a scalar multiple of 'b' 98 | for i in range(len(b)): 99 | if prev is None: 100 | prev = b[i] / (a[i] - c[i]) 101 | else: 102 | check = b[i] / (a[i] - c[i]) == prev 103 | if not check: 104 | return check 105 | return check 106 | ``` 107 | 108 | Half lines and rays in N dimension: 109 | 110 | The parameteric representation of a half line is 111 | $\{ x = a + \alpha b | \alpha \ge k \}$ where 112 | 113 | - $b \not = 0$, 114 | - $a,b \in R^n$ 115 | - k is a constant value. 116 | 117 | Notice that this includes a set of points that is a subset of line's 118 | representation which does not impose a condition on $\alpha$ 119 | In this context: 120 | 121 | - $a$ is the starting point of half-line 122 | - $b$ is the direction vector of the half-line 123 | 124 | Using this representation a half line can also be characterized as a function, 125 | such as, 126 | $ x = x(\alpha) 127 | = \{ x_1(\alpha), x_2(\alpha), \dots, x_n(\alpha) | \alpha \ge 0 \} $ 128 | where $ x(\alpha) = a + \alpha b $ 129 | when we say a half line begins at $a$, we assume that $\alpha = 0$ and that 130 | $b$ is the direction vector. 131 | 132 | 133 | A ray is a half line that starts at 0, meaning that 134 | The parametric representation of a ray is: 135 | 136 | $\{ x = a + \alpha b | \alpha \ge k \}$ where 137 | 138 | - $b \not = 0$ 139 | - $b \in R^n$ 140 | - $a = \{ 0, 0, \dots, 0 \} \in R^n $ 141 | - $k$ is a constant value 142 | 143 | 144 | For every point that is not equal to all zeros, that is 145 | $p = \{p_0, p_1, \dots, p_i, \dots \} \in R^n$ where $p_i \not = 0$ 146 | there is a unique ray which can be represented as: 147 | $\{x = \alpha p : \alpha \ge 0 \}$ $p$ here is called the direction of the 148 | ray. Notice that the same term corresponds to direction vector in the half 149 | line equation. 150 | 151 | Let's us consider the following half line X: 152 | - $ \{ x = a + \alpha b | \alpha \ge k \} $ 153 | 154 | This half line begins at $a$ meaning that $\alpha = 0$ 155 | 156 | Let's us consider the following ray Y: 157 | - $\{ x = a + \beta b | \beta \ge k \}$, where $a = \{0, 0, \dots, 0\} \in R^n$ 158 | 159 | In this case X is parallel to Y. The point $x$ on half line that begins in $a$ 160 | is translated by $\alpha / \beta$ amount in the direction of $b$. 161 | 162 | p. 285 let $a, b \in R^n$ 163 | -------------------------------------------------------------------------------- /BookNotes/ElementsOfProbabilityAndStatistics.rst: -------------------------------------------------------------------------------- 1 | ############################################################## 2 | Elements of Probability and Statistics by Biagini & Campanino 3 | ############################################################## 4 | 5 | Reference: 6 | Biagini, Francesca, and Massimo Campanino, Elements of Probability and Statistics: An Introduction to Probability with de Finetti’s Approach and to Bayesian Statistics (Switzerland, 2016) 7 | 8 | Random Numbers 9 | =============== 10 | 11 | Introduction 12 | -------------- 13 | 14 | Numbers can be used to represent values. However the catch is that these 15 | values are not necessarily known. For example, you toss a coin, you can 16 | represent the result as number but you don't know the result yet, so its 17 | content is not known yet. 18 | 19 | However even if you don't know the exact value yet, you know *possible values* 20 | that can be the result, that is you know the *set of possible values* for the 21 | given element. 22 | 23 | These are called random numbers and they are most of the time represented with 24 | a capital letter. For example let the sides of a dice is A. We would write 25 | :math:`I(A) = {1,2,3,4,5,6}`: 26 | 27 | - :math:`I()` denotes the set of possible values 28 | 29 | - :math:`I(A)` is *upper bounded* if :math:`I(A) < + \infty` 30 | - :math:`I(A)` is *lower bounded* if :math:`I(A) > - \infty` 31 | - :math:`I(A)` is *bounded* if :math:`I(A) > - \infty, I(A) < + \infty` 32 | 33 | Two random numbers A and B are logically independent if 34 | 35 | - :math:`I(A,B) = I(A) \times I(B)`: the product being a cartesian product. 36 | 37 | Invalid Example: 38 | 39 | - I have a pack of balls, containing 4 balls, with 4 different colors. I 40 | pick a ball and pick another ball without putting back the first ball. 41 | 42 | Valid Example: 43 | 44 | - I have a pack of balls, containing 4 balls, with 4 different colors. If I 45 | pick a ball and I put the ball back and then I pick another ball. 46 | 47 | Problem with Invalid Example: 48 | 49 | By not putting the ball back I effect the color of the second ball, since we 50 | know that the second ball can not be of the color of the first one. 51 | 52 | Formally: 53 | 54 | .. math:: 55 | 56 | I(E1, E2) = {(i, j) | 1 \le i \le 4, 1 \le j \le 4, i \ne j} 57 | (i, j) \not\in I(E1, E2) 58 | \therefore 59 | I(E1, E2) \ne I(E1) \times I(E2) 60 | 61 | Following operations are possible for random numbers: 62 | 63 | 1. :math:`A \lor B = max(A, B)` 64 | 2. :math:`A \land B = min(A, B)` 65 | 2. :math:`A \land B = AB` 66 | 3. :math:`\~{B} = 1 - B` 67 | 68 | First two operations are distributive, associative and commutative. That is 69 | they are: 70 | 71 | - Distributive: 72 | :math:`A \lor (B \land C) = (A \lor B) \land (A \lor C)` 73 | :math:`A \land (B \lor C) = (A \land B) \lor (A \land C)` 74 | 75 | - Associative: 76 | :math:`A \lor (B \lor C) = (A \lor B) \lor C` 77 | :math:`A \land (B \land C) = (A \land B) \land C` 78 | 79 | - Commutative: 80 | :math:`A \lor B = B \lor A` 81 | :math:`B \land A = A \land B` 82 | 83 | Plus they have the following properties: 84 | 85 | - :math:`\sim \sim B = B` 86 | - :math:`\~{(A \land B)} = \~{A} \lor \~{B}` 87 | - :math:`\~{(A \lor B)} = \~{A} \land \~{B}` 88 | 89 | Events 90 | ------- 91 | 92 | Events are particular random numbers. They can have two values, thus they are 93 | boolean in nature. :math:`I(E) = {0, 1}`. They have either happened, or they 94 | have not happened. 95 | 96 | - operation :math:`\land` is called *logical product*. 97 | - Operation :math:`\lor` is called *logical sum*. 98 | 99 | In the case of events: 100 | 101 | - :math:`E1 \lor E2 = E1 + E2 - (E1 \land E2)` 102 | - :math:`E1 \land E2 = E1 \times E2` 103 | 104 | Notice that since events are random *numbers*, there is nothing weird about 105 | adding or subtracting them. The fact that we don't know their value, does not 106 | change the fact that they are addable or subtractable. 107 | 108 | Complementary of an Event E is :math:`\~{E} = 1 - E` 109 | 110 | Following are also true for events due to the fact that they are random 111 | numbers: 112 | 113 | .. math:: 114 | 115 | (E1 \lor E2)^{\sim} = \sim E1 \land \sim E2 116 | = (1 - E1) \times (1 - E2) 117 | = 1 - E1 - E2 + (E1 \times E2) 118 | = 1 - 1 ( E1 + E2 (E1 \times E2) ) 119 | = 1 - ( E1 \lor E2 ) 120 | 121 | As you can see this is coherent with the complementary rule that we have 122 | provided above 123 | 124 | Difference of events is the following: 125 | 126 | - :math:`E1 \\ E2 = E1 - (E1 \times E2)` 127 | 128 | Symmetric difference of events: 129 | 130 | - :math:`E1 \triangle E2 = (E1 \\ E2) \lor (E2 \\ E1) = E1 + E2 \mod 2` 131 | 132 | If the value of the random number, event, is equal to 1, we say that the event 133 | happens, when it is equal to 0, we say that it does not happen. 134 | 135 | The logical sum of two events happen (is true) if at least one of the event 136 | happens. 137 | 138 | The logical product of two events happen (is true) if both events happen 139 | 140 | The complementary event of an event E happens, only if E does not happen. 141 | 142 | The subset relation :math:`E1 \subset E2` implies that if E1 happens, so does 143 | E2. 144 | 145 | The operator :math:`\vdash` marks the truth value of the proposition as true. 146 | It is thus a logical operator. 147 | 148 | Following relations apply to events: 149 | 150 | - *incompatiblity*: if :math:`E1 \subset E2 = 0`, meaning that if one event 151 | happens the other can not happen. Thus :math:`\vdash E1 \land E2 = 0` 152 | 153 | - *exhaustivity*: if :math:`E_1, + E_2, + ..., + E_n \ge 1`. 154 | 155 | - *partition* :math:`E_1 +, E_2, +..., E_n = 1` are said to be a partition if 156 | they are exhaustive and two by two incompatible). 157 | 158 | page 6 given n events constituent 159 | -------------------------------------------------------------------------------- /TopicNotes/BayesNets/subtitles/3 - Challenge Question Solution - lang_en_vs52.srt: -------------------------------------------------------------------------------- 1 | 1 2 | 00:00:00,280 --> 00:00:04,282 3 | The answer is A, 0.7445. 4 | 5 | 2 6 | 00:00:04,282 --> 00:00:08,106 7 | Even though the base net is simple the 8 | answer will require a bit more work than 9 | 10 | 3 11 | 00:00:08,106 --> 00:00:10,280 12 | many of our challenge questions. 13 | 14 | 4 15 | 00:00:10,280 --> 00:00:13,660 16 | To be clear we'll be using capital 17 | letters to indicate our variables. 18 | 19 | 5 20 | 00:00:13,660 --> 00:00:17,010 21 | We're using low case letters to 22 | indicate when that variable is true, 23 | 24 | 6 25 | 00:00:17,010 --> 00:00:20,510 26 | or a non in front of it it to 27 | indicate when its not true. 28 | 29 | 7 30 | 00:00:20,510 --> 00:00:23,070 31 | To solve this problem we have to 32 | determine the ratio of times o2 33 | 34 | 8 35 | 00:00:23,070 --> 00:00:26,900 36 | is true given o1 is true, 37 | to all situations where just o1 is true. 38 | 39 | 9 40 | 00:00:27,920 --> 00:00:32,590 41 | In other words, we sum over all possible 42 | graduation situations, where o2 and 43 | 44 | 10 45 | 00:00:32,590 --> 00:00:34,440 46 | o1 are true for the numerator. 47 | 48 | 11 49 | 00:00:34,440 --> 00:00:37,470 50 | Which is simply, 51 | the probability of o1 being true, 52 | 53 | 12 54 | 00:00:37,470 --> 00:00:40,500 55 | o2 being true and graduation being true. 56 | 57 | 13 58 | 00:00:40,500 --> 00:00:45,620 59 | Plus the probably of o1 being true, o2 60 | being true and graduation being false. 61 | 62 | 14 63 | 00:00:45,620 --> 00:00:48,399 64 | And we're going to 65 | normalize by this number, 66 | 67 | 15 68 | 00:00:48,399 --> 00:00:50,972 69 | plus the situation where o2 is negative. 70 | 71 | 16 72 | 00:00:50,972 --> 00:00:54,348 73 | And that is the probably of o1, not o2, 74 | 75 | 17 76 | 00:00:54,348 --> 00:00:58,800 77 | and g, plus the probably of o1, 78 | not 02, and not g. 79 | 80 | 18 81 | 00:01:00,130 --> 00:01:04,354 82 | Looking at the first part of the 83 | numerator, probability of o1, o2, and 84 | 85 | 19 86 | 00:01:04,354 --> 00:01:07,170 87 | g, that is simply 88 | the probability of o1 given g, 89 | 90 | 20 91 | 00:01:07,170 --> 00:01:10,945 92 | times the probability of o2 given g, 93 | times the probability of g. 94 | 95 | 21 96 | 00:01:10,945 --> 00:01:13,790 97 | And we can just read this off 98 | the base network over here. 99 | 100 | 22 101 | 00:01:14,790 --> 00:01:17,739 102 | The first number, 103 | the probability of o1 given g is 0.05. 104 | 105 | 23 106 | 00:01:18,890 --> 00:01:22,584 107 | Probability of o2 given g is just 0.75. 108 | 109 | 24 110 | 00:01:22,584 --> 00:01:26,094 111 | And the probability of g is just 0.9. 112 | 113 | 25 114 | 00:01:26,094 --> 00:01:29,868 115 | We multiply that out and it's 0.3375. 116 | 117 | 26 118 | 00:01:29,868 --> 00:01:32,540 119 | Next we'll consider the second 120 | part of the numerator, 121 | 122 | 27 123 | 00:01:32,540 --> 00:01:35,520 124 | the probability of o1, o2, and not g. 125 | 126 | 28 127 | 00:01:35,520 --> 00:01:38,559 128 | Well that's simply the probability 129 | of 01 given not g, 130 | 131 | 29 132 | 00:01:38,559 --> 00:01:42,119 133 | the probability of 02 given not g, 134 | and the probability of not g. 135 | 136 | 30 137 | 00:01:42,119 --> 00:01:44,772 138 | And again we'll just read off our chart. 139 | 140 | 31 141 | 00:01:44,772 --> 00:01:49,733 142 | It's 0.05 for 143 | a probability of o1 given not g, 144 | 145 | 32 146 | 00:01:49,733 --> 00:01:54,941 147 | 0.25, which is a probability 148 | of 02 given not g. 149 | 150 | 33 151 | 00:01:54,941 --> 00:01:59,239 152 | And then the probability of not g, 153 | which is just the compliment of this, 154 | 155 | 34 156 | 00:01:59,239 --> 00:02:00,305 157 | which is 0.1. 158 | 159 | 35 160 | 00:02:00,305 --> 00:02:03,412 161 | That comes out to be 0.00125. 162 | 163 | 36 164 | 00:02:03,412 --> 00:02:06,550 165 | Now we'll consider the last 166 | two parts of the equation. 167 | 168 | 37 169 | 00:02:06,550 --> 00:02:08,990 170 | We already have these 171 | two from our numerator. 172 | 173 | 38 174 | 00:02:08,990 --> 00:02:12,469 175 | So now we need to look at 176 | probability of o1, not o2, and g, 177 | 178 | 39 179 | 00:02:12,469 --> 00:02:17,181 180 | which is just equal to probability of o1 181 | given g, probability of not o2 given g, 182 | 183 | 40 184 | 00:02:17,181 --> 00:02:18,510 185 | and probability of g. 186 | 187 | 41 188 | 00:02:18,510 --> 00:02:21,550 189 | Again, we can just read these 190 | numbers off our base net. 191 | 192 | 42 193 | 00:02:21,550 --> 00:02:24,010 194 | It comes up to be 0.1125. 195 | 196 | 43 197 | 00:02:24,010 --> 00:02:28,150 198 | The final one probability, 199 | of o1, not o2 and not g. 200 | 201 | 44 202 | 00:02:28,150 --> 00:02:33,706 203 | Again we just use the same equations and 204 | we get 0.00375. 205 | 206 | 45 207 | 00:02:33,706 --> 00:02:36,632 208 | Now we can take all these 209 | numbers we have just calculated, 210 | 211 | 46 212 | 00:02:36,632 --> 00:02:41,000 213 | put it into our equation, and 214 | do the substitution and get our answer. 215 | 216 | 47 217 | 00:02:41,000 --> 00:02:44,064 218 | Doing the simple math 219 | we end up with 0.7445, 220 | 221 | 48 222 | 00:02:44,064 --> 00:02:46,068 223 | which is what we were hoping for. 224 | 225 | 49 226 | 00:02:46,068 --> 00:02:50,231 227 | We calculated this answer by summing up 228 | the results for all relevant situations. 229 | 230 | 50 231 | 00:02:50,231 --> 00:02:52,247 232 | But we can also do 233 | inference by sampling, 234 | 235 | 51 236 | 00:02:52,247 --> 00:02:54,720 237 | which can handle much bigger networks. 238 | 239 | 52 240 | 00:02:54,720 --> 00:02:55,580 241 | Please be on the lookout for 242 | 243 | 53 244 | 00:02:55,580 --> 00:02:58,120 245 | all the different ways we can 246 | do inference invasion networks. 247 | -------------------------------------------------------------------------------- /TopicNotes/Planning/PlanningMDP.rst: -------------------------------------------------------------------------------- 1 | ############################## 2 | Planning Under Uncertainity 3 | ############################## 4 | 5 | We have seen different types of environments: 6 | 7 | +------------+---------------+------------+ 8 | | | Deterministic | Stochastic | 9 | +============+===============+============+ 10 | | Fully | A* star, DFS, |MDP | 11 | | Observable | BFS,UniformCS | | 12 | +------------+---------------+------------+ 13 | | Partially | | POMDP | 14 | | Observable | | | 15 | +------------+---------------+------------+ 16 | 17 | Remember: 18 | Stochastic: We don't know for sure the result of the actions 19 | 20 | Deterministic: The outcome of the action is always the same and predictable 21 | 22 | Fully Observable: Every state is visible from the current state, so you would 23 | need only momentary sensory input for making a decision. 24 | 25 | Partially Observable: You would need a memory for making a decision 26 | 27 | MDP (Markov Decision Process)s 28 | ================================ 29 | 30 | Markov Decision Process are composed of the following: 31 | 32 | - States: {s_1, s_2, s_3, ..., s_n} 33 | - Actions: {a_1, a_2, a_3, ..., a_n} 34 | - State Transition Matrix: T(s,a,s') = Probability(s'|a,s): 35 | The matrix gives the transition probabilities of moving to another state 36 | given the action and current state. 37 | - Reward function associated to states for designating a goal 38 | - Policy: An action that is associated to each state to reach to the goal. 39 | - The planning problem is about finding the good and efficient policy 40 | 41 | The cost function in MDPs can be defined as follows: 42 | 43 | Reward(s) = {Goal Reward (a large positive value), 44 | Avoid rewards (large negative value), 45 | Step Cost (Small negative value)} 46 | 47 | The objective of MDP can be defined as follows: 48 | :math:`objective = max(Expactation[{\sum_{t=0}^{infinity}({\gamma}^t)R_t}])` 49 | 50 | - t: time value 51 | - gamma: discount factor, value between 0 and 1: 0 < gamma < 1 52 | - R_t: cost function result at the time t. 53 | 54 | The objective is to maximise the expectation from the sum of cost functions. 55 | The discount factor decays the future reward relative to more immediate rewards 56 | and it's kind of an alternative way to specify step cost, it is basically there 57 | to give an incentive to reach the goal as fast as possible. 58 | The good mathematical thing about the discount factor is that it keeps the 59 | sigma expression bounded. 60 | By specifying the discount factor it is easy to show the following bound: 61 | 62 | :math:`{\sum_{t=0}^{infinity}({\gamma}^t)R_t}{\le}{\frac{1}{1-{\gamma}}}{\times}R_max` 63 | 64 | The R_max is in most cases Goal Reward 65 | 66 | Value Iteration 67 | ---------------- 68 | 69 | Once we define a cost function, which in turn helps us to define an objective, 70 | we are ready to designate a value for each state. 71 | 72 | Value in this case would be: 73 | 74 | :math:`V^{\pi}(s) = E_{\pi}[{\sum_{t=0}^{infinity}({\gamma}^t)R_t}|s_0=s]` 75 | 76 | The expression looks complex but it is actually simple to decompose: 77 | It says if we execute the policy pi, the expectation for the sum of the reward, 78 | given that the initial state of the policy is the current state, is the value of 79 | the state. 80 | So basically: :math:`V^{\pi}`, says we are operating under the policy pi, same 81 | thing with E. 82 | 83 | Value function is a potential function that leads from the goal location all the 84 | way into the state space, so that the hill climbing in this potential function 85 | leads you on the shortest path to the goal. 86 | 87 | Here is an example of its application: 88 | 89 | +----+---+---+---+------+ 90 | | | 1 | 2 | 3 | 4 | 91 | +====+===+===+===+======+ 92 | | a | 0 | 0 | 0 | +100 | 93 | +----+---+---+---+------+ 94 | | b | 0 | X | 0 | -100 | 95 | +----+---+---+---+------+ 96 | | c | 0 | 0 | 0 | 0 | 97 | +----+---+---+---+------+ 98 | 99 | Now we ask the question: 100 | 101 | Is 0 is a good value for field a3 ? 102 | 103 | 104 | The answer is no, beca can compute a better one and the reason is evident in the 105 | function. 106 | 107 | For a state space in which: 108 | 109 | Reward(s) = {Goal Reward: +100 110 | Avoid rewards: -100 111 | Step Cost: -3} 112 | 113 | With the transition matrix of: 114 | 115 | Going East 0.8 chance 116 | Going south 0.1 chance 117 | Staying in the same state 0.1 chance 118 | 119 | 120 | V(a3,E) = 0.8 * 100 -3 = 77 121 | 122 | E: east, denoting going east. 123 | 124 | We see thus that for the current state, 0 is not a good value, 77 is a better 125 | value. If we do this until the convergence of the value function algorithm, we 126 | would have the following matrix: 127 | 128 | +----+----+----+----+------+ 129 | | | 1 | 2 | 3 | 4 | 130 | +====+====+====+====+======+ 131 | | a | 85 | 89 | 93 | +100 | 132 | +----+----+----+----+------+ 133 | | b | 81 | X | 68 | -100 | 134 | +----+----+----+----+------+ 135 | | c | 77 | 73 | 70 | 47 | 136 | +----+----+----+----+------+ 137 | 138 | The algorithm is called the value iteration: 139 | From AIMA-python 140 | 141 | .. code-block:: python 142 | 143 | def value_iteration_instru(mdp, iterations=20): 144 | U_over_time = [] # List for updated state value dictionary 145 | U1 = {s: 0 for s in mdp.states} # state value dictionary 146 | R, T, gamma = mdp.R, mdp.T, mdp.gamma # R for reward, T, transition matrix 147 | # in the form of P(not_current_state|current_state, current_action) 148 | for _ in range(iterations): 149 | U = U1.copy() 150 | for s in mdp.states: 151 | U1[s] = R(s) + gamma * max([sum([p * U[s1] for (p, s1) in T(s, a)]) 152 | for a in mdp.actions(s)]) # by maxing 153 | # over every action of s we guarantee to 154 | # choose the best policy 155 | U_over_time.append(U) 156 | return U_over_time 157 | 158 | 159 | POMDP (Partialy Observable Markov Decision Process) 160 | =================================================== 161 | 162 | POMDPs address the problem of optimal exploration versus exploitation, there 163 | would be a information gathering actions, and goal driven actions. 164 | 165 | -------------------------------------------------------------------------------- /TopicNotes/Logic/courseNotes.rst: -------------------------------------------------------------------------------- 1 | ####################### 2 | `Logic`_ 3 | ####################### 4 | 5 | Expert systems: 6 | 7 | - Supply 8 | 9 | 10 | Future of planning algorithms ? 11 | 12 | - Learning from examples 13 | - Transfer learning 14 | - We learn in one area and apply it to another. 15 | - Interactive planning: 16 | - Human - machine teams solves problems together: 17 | * What is the best interface ? 18 | 19 | 20 | Understand the following algorithms: 21 | 22 | Resolution algorithm: 23 | 24 | It is an elegant way of infering new knowledge from a knowledge base 25 | 26 | Graphplan 27 | 28 | Value iteration for markov decision processing. 29 | 30 | Propositional logic 31 | --------------------- 32 | 33 | Propositional symbols such as 34 | B, E, A, M, J 35 | 36 | corresponding to: 37 | 38 | B 39 | Burglary Occuring 40 | 41 | E 42 | Earth Quake Occuring 43 | 44 | A 45 | Alarm Occuring 46 | 47 | M 48 | Mary Calling 49 | 50 | J 51 | John Calling 52 | 53 | These can be either True or False, or unknown 54 | 55 | We can make logical sentences by combining them with logical operators. Ex: 56 | 57 | We can say that: 58 | 59 | - Alarm is True when: 60 | * Burglary occuring true 61 | * Earthquake occuring true 62 | 63 | Equivalent in propositional logic would be :math:`E {\lor} B {\Rightarrow} A` 64 | 65 | We can say that when the alarm occurs, Jonh and Mary would call. 66 | :math:`A {\Rightarrow} (J {\land} M)` 67 | 68 | We can say John calls if and only if Mary calls. 69 | :math:`J {\Leftrightarrow} M` 70 | 71 | We can say that John is not Mary. 72 | :math:`J {\sim} M` 73 | 74 | A propositional logic is either true or false with respect to a model of the world. 75 | 76 | A model is just a set of true or false values, for all the propositional symbols. 77 | Ex. 78 | model = { B:True, A:False, J:True, ... } 79 | 80 | We can define the truth of a sentence in terms of the truth of the symbols with respect to the models using truth tables. 81 | 82 | Truth tables 83 | ---------------- 84 | 85 | Truth tables list all the possibilities for the propositional symbols: 86 | 87 | | p | q | not p | p ^ q | p v q | p --> q | p <--> q | 88 | 89 | |-------|-------|-------|---------|--------|---------|----------| 90 | 91 | | false | false | true | false | false | true | true | 92 | 93 | | false | true | true | false | true | true | false | 94 | 95 | | true | false | false | false | true | false | false | 96 | 97 | | true | true | false | true | true | true | true | 98 | 99 | 100 | **Valid sentence** is **true in every possible model** for every possible combination of values of the propositional symbols. 101 | **Satisfaiable sentence** is one that is true in some model. 102 | **Unsatisfaiable sentence** is false in every possible model. 103 | 104 | Limitations of propositional logic 105 | ----------------------------------- 106 | 107 | - It can handle only truth or false values: 108 | - No capability of handling uncertainty as in probability theory. 109 | - No possibility to model object with properties that are continous like, size, colour etc. 110 | - No shortcuts, that is to give a statement about a space you need to invoke every element of that space: 111 | - To say that all the rooms in the house are clean, you would need to invoke all the rooms in the house and relate them to each other with respect to their current state. 112 | 113 | 114 | First Order Logic 115 | ===================== 116 | 117 | We shall compare three approaches that model our world, namely: 118 | 119 | - Propositional Logic 120 | - First order Logic 121 | - Probability theory 122 | 123 | We shall compare their ontological commitment with respect to our world. We would also compare the believes the agents might have within these approaches, that is their epistemological commitment. 124 | 125 | Ontological commitment of the first order logic is that there are: 126 | 127 | - Relations between things in the world 128 | - Objects 129 | - Functions on those objects 130 | 131 | in the world. 132 | 133 | Epistemological commitment of the first order logic is that agents can believe: 134 | 135 | - True 136 | - False 137 | - Unknown 138 | 139 | states of those that are in the world. 140 | 141 | Ontological commitment of the propositional logic is that there are: 142 | 143 | - Facts 144 | 145 | in the world. 146 | 147 | Epistemological commitment of the propositional logic is that agents can believe: 148 | 149 | - True 150 | - False 151 | - Unknown 152 | 153 | states of those that are in the world. 154 | 155 | Ontological commitment of the probability theory is that there are: 156 | 157 | - Facts 158 | 159 | in the world. 160 | 161 | Epistemological commitment of the probability theory is that agents can believe: 162 | 163 | - :math:`B = \{ 0,{\dots}, 1 \}` 164 | 165 | states of those that are in the world. 166 | 167 | Problem solving used **atomic representation** of the world, that is each state is indivisible, on which you can only check if it is the same or not with another state, or if it is the goal state or not. 168 | 169 | In propositional logic, and probability theory, we use **factored representation** of the world that is, we divide the world into a set of facts that are true or false, this is called a factored representation. 170 | 171 | The most complex type of representation is called **structured representation**, in which the individual state is not just a set of values for variables, but it can include relationships between objects, a branching structure, etc. What we see in traditional programming languages, in a database, this comes with first order logic. 172 | 173 | 174 | First Order Logic 175 | ==================== 176 | 177 | We have a set of: 178 | 179 | - objects 180 | - constants that can refer to objects 181 | - function mapping from object to objects. 182 | - relations 183 | 184 | Syntax in first order logic 185 | ----------------------------- 186 | 187 | We have sentences that describe facts that are true or false. 188 | 189 | Atomic sentences are the predicates corresponding to relations: 190 | 191 | - ABOVE(A,B) 192 | - Vowel(A,B) 193 | - A = B 194 | 195 | We can also combine operators from propositional logic in the sentences of First order logic. 196 | 197 | There are also quantifiers: 198 | 199 | - :math:`{\forall}x` means for all x. 200 | - :math:`{\exists}y` there exists a y. 201 | 202 | 203 | 204 | -------------------------------------------------------------------------------- /TopicNotes/CapsNets/capsnets.rst: -------------------------------------------------------------------------------- 1 | ######### 2 | CapsNet 3 | ######### 4 | 5 | Intuition 6 | =========== 7 | 8 | Drawback of CNNs is that they can not model spatial relationships between 9 | features. 10 | 11 | We can summarize this by saying, cnn's work fundamentally on 2d objects. 12 | Thus they model 2d information. What's important is to model 3d relationships. 13 | That is the position of features with respect to each other. 14 | 15 | The thing is for an object, these relational positioning is an invariant. 16 | For example your viewpoint can change but that does not change the body 17 | proportions of the object you are looking at. 18 | 19 | CNN's loose this information, due to usage of max pooling. 20 | 21 | Let's recap how cnn's work. 22 | 23 | CNN's have three layers: 24 | 25 | - Convolution Layers 26 | - Pooling Layers 27 | - Fully connected (Dense) Layers 28 | 29 | Convolution layers 30 | 31 | Basically it is constructed by a convolutional window/frame which is 32 | like a mask that passes through the image horizontally and vertically. Each 33 | passage defines a node, and the resulting nodes are grouped in a layer called 34 | convolutional layer. 35 | 36 | The value for the convolutional layer node is created as in any linear 37 | regression, that is we multiply the value of the pixel with the weights and sum 38 | everything up including a bias term. Then we apply an activation function like 39 | relu. 40 | The weights are usually represented inside the convolution window 41 | 42 | Pooling layers 43 | 44 | They take convolutional layers as input and reduce its dimensionality. Either 45 | by finding the maximum value in the filters, thus keeping the number of 46 | filters the same with reduced height and width, or by averaging all the 47 | feature maps. 48 | 49 | Fully Connected Layers 50 | 51 | Takes all the input from previous layer makes a linear regression of the 52 | input. 53 | 54 | The problem for CNN is: 55 | Internal data representation of a convolutional neural network does not take 56 | into account important spatial hierarchies between simple and complex objects. 57 | 58 | What does brain do: 59 | From visual information received by eyes, they deconstruct a hierarchical 60 | representation of the world around us and try to match it with already learned 61 | patterns and relationships stored in the brain. This is how recognition 62 | happens. 63 | 64 | What does capsules do: 65 | It incorporates relative relationships between objects and it is represented 66 | numerically as a 4D pose matrix. 67 | 68 | Instead of aiming for viewpoint invariance in the activities of “neurons” that 69 | use a single scalar output to summarize the activities of a local pool of 70 | replicated feature detectors, artificial neural networks should use local 71 | “capsules” that perform some quite complicated internal computations on their 72 | inputs and then encapsulate the results of these computations into a small 73 | vector of highly informative outputs. Each capsule learns to recognize an 74 | implicitly defined visual entity over a limited domain of viewing conditions 75 | and deformations and it outputs both the probability that the entity is 76 | present within its limited domain and a set of “instantiation parameters” that 77 | may include the precise pose, lighting and deformation of the visual entity 78 | relative to an implicitly defined canonical version of that entity. When the 79 | capsule is working properly, the probability of the visual entity being 80 | present is locally invariant — it does not change as the entity moves over the 81 | manifold of possible appearances within the limited domain covered by the 82 | capsule. The instantiation parameters, however, are “equivariant” — as the 83 | viewing conditions change and the entity moves over the appearance manifold, 84 | the instantiation parameters change by a corresponding amount because they are 85 | representing the intrinsic coordinates of the entity on the appearance 86 | manifold. 87 | 88 | Capsules encode probability of detection of a feature as the length of their 89 | output vector. And the state of the detected feature is encoded as the 90 | direction in which that vector points to (“instantiation parameters”). So when 91 | detected feature moves around the image or its state somehow changes, the 92 | probability still stays the same (length of vector does not change), but its 93 | orientation changes. 94 | 95 | 96 | Inner working of the Capsules 97 | ------------------------------ 98 | 99 | It is easier to understand the difference of capsules with respect to neurons 100 | by remembering what neurons do. 101 | 102 | Neurons: 103 | 104 | - take a scalar input 105 | - Multiply it with a weight 106 | - Add a bias to it. 107 | 108 | - Then we apply a function to map the resulting scalar value to a range of 109 | probabilities 110 | 111 | We can think the scalar input as scalars as well. They take scalars, multiply 112 | each with a weight, add each a bias, etc. 113 | 114 | Capsules: 115 | 116 | - Take vectors as input. 117 | - Multiply it with a weight matrix that encode a spatial relation between 118 | lower level features (nose, mouth) and a higher level feature (face). 119 | For example if first weight matrix encode the relations between a nose and a 120 | face. The nose is ten times smaller than a face, and its direction is the 121 | same with a face since they lie in the same plane. 122 | Once we multiply it with a weight, the output would the position of the 123 | face. So from 3 input vectors (eyes, nose, mouth), we obtain a prediction 124 | about the position of a face. 125 | Evidently if 3 predictions point to a same position and a state of a face, 126 | then, the probability of having a face there is quite high. 127 | 128 | - Scalar weighting of the input vector: 129 | Let's say we have 2 capsules which correspond to a higher level feature. How 130 | does a lower level capsule decide which of the two capsules should receive 131 | its input. For example, I have a capsule for a nose, should it send its 132 | output to human face or dog face. Assuming higher level capsules have 133 | already input that are close to each other, the problem becomes that of 134 | clustering. Can the output coming from the nose capsule form a cluster with 135 | those of human face or dog face ? This is decided by the scalar weighting. 136 | The main idea is since the scalar multiplication does not change the 137 | direction of the vector but changes its length. This is the essence, it 138 | shall be explained further later on. 139 | 140 | - We sum the weighted output vectors. This is essentially same as a neuron 141 | where we combine the outputs of weights. 142 | 143 | - We squash the resulting vector into a unit vector. The squashing function 144 | has two parts: Unit scaling and additional squashing. The formula is the 145 | following: 146 | :math:`v_i = \frac{||s_i||^2}{1+ ||s_i||^2} \times \frac{s_i}{||s_i||}` 147 | Basically each input vector becomes a scalar and makes up the resulting 148 | vector. 149 | 150 | 151 | Dynamic Routing: Scalar weighting of Input vectors 152 | --------------------------------------------------- 153 | 154 | The algorithm is the following: 155 | 156 | .. code:: python 157 | 158 | def squashVector(u_ij: [float]): 159 | """ 160 | Squash a vector's length to a range between 0 - 1 161 | """ 162 | vecnorm = u_ij.length 163 | unitvector = u_ij / vecnorm 164 | firstTerm = vecnorm ** 2 / (1 + vecnorm ** 2) 165 | return firstTerm * unitvector 166 | 167 | def dynamicRouting(r: int, u_ij_hat: [float], 168 | currentLayer, nextLayer): 169 | """ 170 | Implementation of dynamic routing algorithm from: 171 | https://arxiv.org/pdf/1710.09829.pdf 172 | procedureROUTING(u_ji_hat, r, l ): 173 | for all capsule i in layer l and capsule j in layer(l+1): 174 | b_ij←0 175 | for r iterations do: 176 | for all capsule i in layer l: 177 | c_i←softmax(b_i) 178 | for all capsule j in layer(l+1): 179 | s_j←sum_i( c_ij * u_ji_hat) 180 | for all capsule j in layer(l+ 1): 181 | v_j←squash(sj) 182 | for all capsule i in layer l and capsule j in layer(l+1): 183 | b_ij←b_ij+ dotProduct(u_ji_hat, v_j) 184 | return v_j 185 | 186 | Parameters 187 | ------------ 188 | r: number of iterations 189 | u_ij_hat: vector containing a list of float points, 190 | like every vector it has length and a direction. 191 | length is the number of floating points, and direction 192 | is the angle that it makes with respect to origin of the space. 193 | 194 | currentLayer: contains a list of capsules 195 | nextLayer: contains a list of capsules as well. 196 | """ 197 | # setting up priors 198 | nb_next_capsules = getNumberOfCapsules(nextLayer) 199 | c_ij = 1 / nb_next_capsules 200 | b_ij = 0 201 | for iteration in range(r): 202 | 203 | 204 | -------------------------------------------------------------------------------- /TopicNotes/Hyperparameters/CourseNotes.rst: -------------------------------------------------------------------------------- 1 | ######################### 2 | Intro to Hyperparameters 3 | ######################### 4 | 5 | The hyperparameter is the value we set before applying a learning algorithm into 6 | a dataset. 7 | 8 | The challenge is that there are no magic way to work with them. No trick or 9 | principal that explains why a certain hyperparameter is better than the other 10 | universally. Rather their capacity depends on the task at hand. 11 | 12 | In general hyperparameters can be categorised into two categories: 13 | 14 | - Optimizer Parameters 15 | - Model Parameters 16 | 17 | Optimizer Hyperparameters 18 | ~~~~~~~~~~~~~~~~~~~~~ 19 | 20 | They relate more to the optimization of the training process. 21 | These include: 22 | 23 | - learning rate 24 | - mini batch size 25 | - number of training iterations or epochs 26 | 27 | Model Hyperparameters 28 | ~~~~~~~~~~~~~~~~~~~~~ 29 | 30 | These concern the structure of the model: 31 | 32 | - The number of layers and hidden units 33 | - And model specific stuff 34 | 35 | 36 | Learning Rate 37 | -------------- 38 | 39 | It is the single most important hyperparameter and one should always make sure 40 | that it has been tuned. 41 | 42 | If you have normalized the inputs to your model, then a good starting point is 43 | 44 | 0.01 45 | 46 | A good list of indicators for learning rates is the following: 47 | 48 | - 0,1 49 | - 0,01 50 | - 0,001 51 | - 0,0001 52 | - 0,00001 53 | - 0,000001 54 | 55 | If you try one and the model does not train, you try the other ones, and see how 56 | the model responds to it. 57 | 58 | Learning rate determines the step size of the descent during the gradient 59 | descent. 60 | 61 | Here are the possible cases that are possible with learning rate: 62 | 63 | - Validation error: Decreasing during training 64 | due to reasonable learning rate [✓] 65 | - Validation error: Decreasing during training 66 | due to large learning rate [✓] 67 | - Validation error: Decreasing too slowly during training 68 | due to too small learning rate [x] 69 | - Validation error: Increasing during training 70 | due to too large learning rate [x] 71 | 72 | Here is a good technique to deal with these problems: 73 | 74 | - Learning rate decay: 75 | 76 | - We decrease the learning rate by half in every 5 epochs etc to see how well 77 | it functions 78 | 79 | Minibatch size 80 | --------------- 81 | 82 | It is the middle way between stochastic training and batch training. 83 | Batch is we send everything at once, stochastic is we sent them one by one. 84 | 85 | Increasing it would increase the computational requirement, decreasing it 86 | would slower the training process, good numbers are: 87 | 88 | 2,4,6,8,16,32,64,128,256 89 | 90 | One should think of adjusting the learning rate when one changes the mini batch 91 | size 92 | 93 | Number of iterations 94 | --------------------- 95 | 96 | The metric we should focus in validation error. Idealy we start with a number 97 | that is large to continue decreasing the validation error, and stop early 98 | when the validation error stops decreasing. 99 | 100 | However one should be flexible in the definition of the stopping trigger, 101 | because the validation error would move back and forth, even if it is on 102 | the downward trend. 103 | In most cases, we stop the training if the validation error has not decreased 104 | in the last 10 - 20 steps 105 | 106 | Number of Hidden Units 107 | ---------------------- 108 | 109 | The number and architecture of the hiddent units are the main measures for a 110 | model's learning capacity. 111 | 112 | More capacity might lead to overfitting 113 | Less capacity might lead to underfitting 114 | 115 | Dealing with this can be done by using regularization techniques like: 116 | - L2 regularization 117 | - Dropout 118 | 119 | Tips: 120 | 121 | - For the first hidden layer: 122 | - It should be larger than the number of inputs 123 | - Number of layers: 124 | - 3 layer would often perform better than 2, but more than 3 helps rarely 125 | - This is not true for CNNs where depth of +10 layers gives better results 126 | 127 | RNN Hyperparameters 128 | -------------------- 129 | 130 | There are two main parameters: 131 | 132 | - Cell types: LSTM, Gated Recurrent Units Neural Network 133 | - Depth of the model: how many layers we need 134 | 135 | One thing is for sure: LSTM and GRU perform better than normal RNN 136 | Discussion on LSTM and GRU is not resolved they depend on the task 137 | 138 | Embedding size for word models should be around 50 - 200 139 | 140 | LSTM Vs GRU 141 | 142 | "These results clearly indicate the advantages of the gating units over the more 143 | traditional recurrent units. Convergence is often faster, and the final 144 | solutions tend to be better. However, our results are not conclusive in 145 | comparing the LSTM and the GRU, which suggests that the choice of the type of 146 | gated recurrent unit may depend heavily on the dataset and corresponding task." 147 | 148 | Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio 149 | 150 | "The GRU outperformed the LSTM on all tasks with the exception of language 151 | modelling" 152 | 153 | An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz, 154 | Wojciech Zaremba, Ilya Sutskever 155 | 156 | "Our consistent finding is that depth of at least two is beneficial. However, 157 | between two and three layers our results are mixed. Additionally, the results 158 | are mixed between the LSTM and the GRU, but both significantly outperform the 159 | RNN." 160 | 161 | Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin 162 | Johnson, Li Fei-Fei 163 | 164 | "Which of these variants is best? Do the differences matter? Greff, et al. 165 | (2015) do a nice comparison of popular variants, finding that they’re all about 166 | the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN 167 | architectures, finding some that worked better than LSTMs on certain tasks." 168 | 169 | Understanding LSTM Networks by Chris Olah 170 | 171 | "In our [Neural Machine Translation] experiments, LSTM cells consistently 172 | outperformed GRU cells. Since the computational bottleneck in our architecture 173 | is the softmax operation we did not observe large difference in training speed 174 | between LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla 175 | decoder is unable to learn nearly as well as the gated variant." 176 | 177 | Massive Exploration of Neural Machine Translation Architectures by Denny Britz, 178 | Anna Goldie, Minh-Thang Luong, Quoc Le 179 | 180 | 181 | Example RNN Architectures 182 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ 183 | 184 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 185 | | Application | Cell | Layers | Size | Vocabulary | Embedding | Learning Rate | | 186 | | | | | | | Size | | | 187 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 188 | | Speech Recognition (large vocabulary) | LSTM | 5, 7 | 600, 1000 | 82K, 500K | | | | 189 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 190 | | Speech Recognition | LSTM | 1, 3, 5 | 250 | | | 0.001 | | 191 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 192 | | Machine Translation (seq2seq) | LSTM | 4 | 1000 | Source: 160K, Target: 80K | 1,000 | -- | | 193 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 194 | | Image Captioning | LSTM | | 512 | | 512 | (fixed) | | 195 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 196 | | Image Generation | LSTM | | 256, 400, 800 | | | | | 197 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 198 | | Question Answering | LSTM | 2 | 500 | | 300 | | | 199 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 200 | | Text Summarization | GRU | | 200 | Source: 119K, | 100 | 0,001 | | 201 | | | | | | Target: 68K | | | | 202 | +---------------------------------------+------+---------+---------------+---------------------------+-----------+---------------+--+ 203 | -------------------------------------------------------------------------------- /TopicNotes/Search/CourseNotes.rst: -------------------------------------------------------------------------------- 1 | ################### 2 | Search 3 | ################### 4 | 5 | AI is about what to do when you don't know what to do. 6 | Regular programming is when you know what to do and make the machine do it by writting instructions. 7 | 8 | ========================= 9 | Definition of a Problem 10 | ========================= 11 | 12 | What is a problem: 13 | 14 | - Initial State: :math:`S_0` 15 | - Actions(S): a function that takes the state as the input an returns a set of possible actions that the agent can execute. 16 | * returns: :math:`\{a_1, a_2, a_3, {\dots} \}` 17 | - Result: another function, which takes an action and a state as input, and gives another state as the output. 18 | * So :math:`R(s,a_i) {\to} {\overline{s}}`, 19 | - GoalTest: a boolean function, which says if the we are at the goal state or not. 20 | - PathCost: another function which takes a sequence of state as an argument to calculate a cost value for the given sequence. 21 | * :math:`PC(s_i^a_{j} + s_{i+1}^{a_{j+1}}, {\dots}) {\to} n, n {\in} {\mathbb{R}}, i {\in} \{0,1,{\dots}\}, j {\in} \{1,2,{\dots}\}` 22 | * Most of the time, cost of a path is the sum of the cost of the individual steps, which are calculated with StepCost function. 23 | - StepCost: another function takes the state, action, and the resulting state of the action as an input and calculates the step cost of that action as the output. 24 | * :math:`SC(s,a_i,{\overline{s}}) {\to} n` 25 | - These last two are evaluation functions, the StepCost is the quick one, and PathCost is the heavy one. 26 | 27 | --------------- 28 | States 29 | --------------- 30 | 31 | At every state, we would want to separate out to three parts 32 | 33 | - Frontier 34 | - Explored 35 | - Unexplored 36 | 37 | Two type of graphs intersect with each other: 38 | 39 | - State space graph 40 | - Search state graph 41 | 42 | State space graph is usually a given, search state graph is calculated. 43 | During the calculation process: 44 | 45 | - the states that are already calculated are explored states, 46 | - the states that are being calculated are frontiers, 47 | - the states that are not yet calculated are unexplored states. 48 | 49 | This results in the following general search tree function. 50 | 51 | .. code-block:: python 52 | 53 | def TreeSearch(problem): 54 | frontier = {['initialState']} 55 | # Frontier, is the path consisting only the initial state 56 | while some condition: 57 | if len(frontier) == 0: 58 | # This means there can be no solution. 59 | return False 60 | path = choice(frontier) 61 | # We make a choice among the paths that are calculated 62 | # the chosen path is also removed from the frontier 63 | # Choice function is a family of functions not a single 64 | # algorithm 65 | # 66 | state = f(path) 67 | # We find the state at the end of the path. 68 | # f(): takes path as input returns its end. 69 | if s == Goal: 70 | return path 71 | # Meaning if the state is the goal, 72 | # then we found our path. 73 | # 74 | else: 75 | # We are not at the goal state, 76 | # Thus we should extend the path. 77 | for a in actions(s): 78 | frontier.add([path + a + Result(s,a)]) 79 | # We add the path, the action and the result of the action 80 | # that is the new state resulting from the action. 81 | 82 | Now let's see our options for the "choice" function: 83 | 84 | - Breadth first search, or Shortest way Search 85 | * It chooses the shortest path that has not been considered. 86 | + The measure for the shortest path is the number of paths between the end state and the origin state. 87 | + For example, between s1 and s5 there are 2 paths, between s3 and s5 there are 102 paths, and between s2 and s5 there are 4 paths, breadth first search would evaluate, s1, then s2, then s3, before meeting a condition like, all the branches associated with the frontier state are in explored states list, etc. 88 | 89 | 90 | .. code-block:: python 91 | 92 | def GraphSearch(problem): 93 | frontier = {['initial']} # Contains the Initial State with Its path. 94 | explored = {} # Contains the explored states with their path. 95 | if len(frontier) == 0: 96 | return False 97 | chosen_path = frontier.pop()# This should add this path to explored as well 98 | path = choice(chosen_path) 99 | state = end(path) 100 | explored.add(state) 101 | if state == goal: 102 | return path 103 | for action in ACTIONS(state): 104 | if Results(a, state) not in frontier or Results(a, state) not in explored: 105 | frontier.add([path,a,Results(a,state)]) 106 | 107 | .. code-block:: python 108 | 109 | # Breadth first search 110 | 111 | 112 | 113 | - Uniform Cost search, or Cheapest First Search 114 | * Find the path with the **cheapest total cost**. 115 | * How it works: 116 | + We start in the start state 117 | + We pop the empthy path from frontier to explored 118 | + Add in the path out that state. 119 | * It is not a directed search, it is just an expansion, we should expect to cover half of the search space on average. 120 | 121 | -------------------------- 122 | Use cases for Searches 123 | -------------------------- 124 | 125 | In the case of a large binary tree: 126 | 127 | Breadth first search and cheap first search: 128 | 129 | - Has 2n space saving necessity 130 | 131 | Depth first search: 132 | 133 | - Has n space saving necessity. 134 | 135 | Breadth first and cheap first always find their goal, because they evaluate the states at a horizontal level, whereas 136 | Depth first search would go down the infinite path, because it evaluates the tree in vertical direction. 137 | 138 | ------------------ 139 | A* search 140 | ------------------ 141 | 142 | Better name would be Best Estimated Total Path Cost Search 143 | 144 | To get faster results than the uniform cost search, we need to add more knowledge. 145 | A good knowledge would be the estimate of distance from the start state to the goal. 146 | 147 | A* search works as follows: 148 | 149 | - :math:`min(f = g + h)`: 150 | * Minimum value for the function "f" 151 | * Function "f" is defined as the sum of the results of the function "g" and "h" 152 | * Function "g" is path cost, it takes the path as the argument. 153 | * Function "h" is equal to final state of the path, which is equal to the estimated distance to the goal 154 | * The value of the Function "h" should always be lower than the true cost of the state. 155 | 156 | When A* ends, it returns a path, p with estimated cost, C. 157 | And it turns out that the C is also the actual cost, because at the goal state, the estimated distance to the goal would be 0, thus C, that was suppose to represent the estimated cost, would represent the actual cost, since C = h(s) + g(p) and h(s) = 0. 158 | All the other states in the frontier should have a bigger estimated cost than the C, because frontier is explored in cheapest first order. 159 | The path p, then must have cost that is less than the true cost of any other path in the frontier. 160 | 161 | ====================================== 162 | `How to derive heuristics functions`_ 163 | ====================================== 164 | 165 | Heuristics can be derived from formal description of the problems. 166 | For example: 167 | 168 | Description 169 | Block can move from A to B if A is adjenct to B, and B is blank 170 | 171 | Derived Heuristic 1 172 | Block can move from A to B if A is adjenct to B. Heuristic: B - A == 1 173 | 174 | Derived Heuristic 2 175 | Block can move from A to B. Heuristic: Misplaced Blocks. 176 | 177 | This is called generating a relaxed problem. 178 | 179 | ============================================ 180 | `When Searches Work`_ 181 | ============================================ 182 | 183 | 1. Problem domain must be fully observable. 184 | 2. Problem domain must be Known: we have to know the set of available actions to us. 185 | 3. Problem domain must be discrete: we have to have a finite number of actions to choose from. 186 | 4. Problem domain must be deterministic: we have to know the result of taking an action. 187 | 5. Domain must be static, that is there must be nothing else that changes the domain besides our own actions. 188 | 189 | ==================================== 190 | `How to implement the algorithms`_ 191 | ==================================== 192 | 193 | In the implementation we talk about nodes. A node has 4 fields: 194 | 195 | * State at the End of the Path 196 | * Action it took to get to the State at the End of the Path 197 | * Total cost 198 | * Parent, is the pointer to another node 199 | 200 | This linking of nodes one to another creates a sequence, a "path". 201 | The word path is used as an abstract idea, a node is the representation in the computer memory. 202 | Nodes are implemented as sequence structures, they can be thus list. 203 | Frontier, and explored list. 204 | 205 | Frontier 206 | ---------- 207 | 208 | - Operations 209 | * Remove best item, add a new one. 210 | * Membership test 211 | - Implementation 212 | * Priority queue 213 | - Representation 214 | * Set 215 | - Build 216 | * Hash Tables 217 | * Tree 218 | 219 | Explored 220 | ------------ 221 | 222 | - Operations 223 | * Add new members 224 | * Check for membership 225 | - Representation 226 | * Single set 227 | - Build 228 | * Hash Tables 229 | * Tree 230 | -------------------------------------------------------------------------------- /BookNotes/ComputerGraphicsPrinciplesPracticeInC.rst: -------------------------------------------------------------------------------- 1 | ############################################ 2 | Computer Graphics: Principles and Practice 3 | ############################################ 4 | 5 | Notes from: 6 | 7 | Foley, James, Andries van Dam, Steven Feiner, and John Hughes, Computer Graphics: Principles and Practice, 2nd edn (Paris, 1996) 8 | 9 | and from: 10 | 11 | Foley, James, Andries van Dam, Steven Feiner, and John Hughes, Computer Graphics: Principles and Practice, 2nd edn (Paris, 1996) 12 | 13 | Chapter 7: Essential Mathematics 14 | ================================ 15 | 16 | **v**: vector v 17 | **M**: matrice M 18 | except: 19 | 20 | **R**: is the set of Real numbers 21 | **C**: is the set of complex numbers 22 | :math:`\mathbf{R}^{+}`: is the set of positive real numbers 23 | :math:`\mathbf{R}^{+}_0`: is the set of non negative real numbers 24 | *i*: is the indexing variable as in :math:`x_i` 25 | 26 | Cartesian product of two sets: A and B: 27 | 28 | :math:`A x B = {(a,b) : a \in A, b \in B}` 29 | 30 | .. code:: python 31 | 32 | def cartProd(A, B): 33 | "Cartesian product of two lists" 34 | result = [] 35 | for a in A: 36 | for b in B: 37 | result.append((a,b)) 38 | 39 | :math:`[a, b] = {x : a ≤ x ≤ b}` 40 | 41 | .. code:: python 42 | 43 | def checkClosedInt(x, lower_limit, upper_limit): 44 | "Check given number x is between lower and upper limit" 45 | return bool(x >= lower_limit and x <= upper_limit) 46 | 47 | :math:`[a, b) = {x : a ≤ x < b}` 48 | :math:`(a, b] = {x : a < x ≤ b}` 49 | 50 | .. code:: python 51 | 52 | def checkHalfOpenIntUpper(x, lower_limit, upper_limit): 53 | "Check given number x is between lower and upper limit" 54 | return bool(x >= lower_limit and x < upper_limit) 55 | 56 | def checkHalfOpenIntLower(x, lower_limit, upper_limit): 57 | "Check given number x is between lower and upper limit" 58 | return bool(x > lower_limit and x <= upper_limit) 59 | 60 | :math:`[a, ∞) = {x : a ≤ x}` 61 | :math:`(−∞, b] = {x : x ≤ b}` 62 | 63 | .. code:: python 64 | 65 | def checkLowerLimitWithInf(x, lower_limit): 66 | "Check given number x is between lower and upper limit" 67 | return bool(x >= lower_limit) 68 | 69 | def checkUpperLimitWithInf(x, upper_limit): 70 | "Check given number x is between lower and upper limit" 71 | return bool(x <= upper_limit) 72 | 73 | Function: 74 | 75 | :math:`f: \mathbf{R} \to \mathbf{R}: x \to x^2`. reads as: 76 | 77 | .. code:: python 78 | 79 | def main(x: float) -> float: 80 | return x**2 81 | 82 | 83 | Function f which maps a value x from **domain** R 84 | to a value :math:`x^2` from **codomain** R, 85 | 86 | Surjective function: a function that can map to an entire codomain. 87 | Ex: 88 | 89 | :math:`g: \mathbf{R} \to \mathbf{R}^{+}_0: x \to x^2` 90 | 91 | The x is mapped to values that covers the entire codomain 92 | 93 | g is surjective whereas f is not. 94 | 95 | Injective function: a function which operates between a domain and a codomain 96 | in which no two values of the domain can map to the same value of the 97 | codomain, meaning that if h(a) = h(b) then a=b. 98 | Ex: 99 | 100 | :math:`h: \mathbf{R}^{+}_0 \to \mathbf{R}^{+}_0: x \to x^2` 101 | 102 | An function that is both injective and surjective like h can have an inverse 103 | function h' which maps the codomain of h to domain of h. That is it simply 104 | undoes what h does. 105 | 106 | A function that is both injective and sujective is called bijective. 107 | 108 | .. code:: python 109 | 110 | def checkFuncAndInverse(fn, fnInv, domain: set, codomain: set): 111 | "Check if a given func and inverse is has injective properties" 112 | result = [] 113 | for x in domain: 114 | res = fnInv(fn(x)) == x 115 | result.append(res) 116 | 117 | for y in codomain: 118 | res = fn(fnInv(y)) == y 119 | result.append(res) 120 | 121 | return all(result) 122 | 123 | 124 | Coordinates 125 | ------------ 126 | Coordinates are real numbers that are associated to geometric entities, like 127 | points, lines, etc. 128 | 129 | Geometric properties do not change, whereas numerical properties can change 130 | for example, a point lies on a line is a geometric property true irrespective 131 | of the coordinate system. 132 | 133 | Chapter 16: Illumination Models from Ed. 2 134 | ========================================== 135 | 136 | Ambient Light 137 | -------------- 138 | 139 | Simple illumination equation: 140 | :math:`I = k_i` 141 | where I is the resulting intensity, pixel value for example, and k is the 142 | coefficient of the intrinsic intensity of the object. Notice no direction of 143 | light source is included in this equation 144 | 145 | Let's suppose another model, where there is an ambient light illuminating all 146 | the objects at a constant intensity. This ambient light is a result of diffuse 147 | where multiple surfaces reflect light, creating a non directional source of 148 | light. This would transform our equation into: 149 | :math:`I = I_a k_a` 150 | 151 | where I is the resulting intensity, :math:`I_a` is the intensity of the 152 | ambient light, and :math:`k_a` is the coefficient of ambient reflection, a 153 | material property of the surface that reflects the ambient light 154 | 155 | Diffuse reflection 156 | -------------------- 157 | 158 | Ambient light considers that the object is illuminated uniformly from all 159 | angles now let us consider illuminating objects from a point light source. 160 | In the case of the latter, the object would be bright in one side but not 161 | necessarily in another side. Its brightness would change according to the 162 | position of the point light source. 163 | 164 | Lambertian reflection 165 | ++++++++++++++++++++++ 166 | 167 | Diffuse reflection is a property where an object, with dull, matte surface, 168 | like chalk is equally bright from all angles because they reflect the 169 | light equally in all directions 170 | 171 | A surface's brightness depends on the angle between two variables: 172 | - L(ight source) direction. From where the light source is situated. 173 | - N(ormal of the surface), a vector that is perpendicular to the surface. 174 | 175 | This works as the following: 176 | 177 | Let us suppose that the incoming ray of light has a very very small area such 178 | as :math:`{\delta}A`. 179 | 180 | If the incoming beam came from a direction that is perpendicular to the 181 | surface, the area of the illumination would have been equal to 182 | :math:`{\delta}A`. 183 | 184 | If we change the angle of approach of the incoming beam to an angle between 0 185 | and 90. Than :math:`{\delta}A` would be a side of a right angled triangle 186 | whose hypothenuse is the surface of the object. The angle :math:`{\theta}` 187 | would then be the angle between the hypothenuse and the :math:`{\delta}A` 188 | 189 | The cosine of the angle theta is directly proportional to the amount of light 190 | reflected to the viewer, which is quite logical if you think. If you position 191 | yourself right behind the light source, the brightest spot of the object would 192 | right opposite to you. 193 | 194 | The equation that models this relationship is: 195 | 196 | :math:`I = I_p k_d \cos{\theta}` 197 | 198 | where I_p is the light source intensity, k_d is the diffuse reflection 199 | coefficient of the material between 0-1. The angle theta must be between 0 - 200 | 90. 201 | 202 | This also means that we are treating the surfaces as self-occulding, that is 203 | light cast from behind does not illuminate the front. 204 | When we need to light objects that are not self-occulding, we use 205 | abs(cos theta) to invert their surface normals. Thus we treat them as if they 206 | are being illuminated from both sides 207 | 208 | If both N and L is normalized, we can rewrite the :math:`\cos{\theta}` as a 209 | dot product of N and L. 210 | 211 | The equation :math:`I = I_p {\times} k_d {\times} (N \cdot L)}` 212 | tend to make objects look harsh. 213 | So to obtain something more realistic, we add an ambient light: 214 | :math:`I = I_a {\times} k_a + I_p {\times} k_d {\times} (N \cdot L)` 215 | 216 | Light-source attenuation 217 | ------------------------- 218 | 219 | The equation above won't distinguish where identical materials overlapping in 220 | an image. In order to do that we need to add another factor to the equation. 221 | The factor is called light source attenuation factor :math:`f_att`. 222 | The equation thus becomes: 223 | :math:`I = I_a {\times} k_a + f_{att} {\times} I_p {\times} k_d {\times} (N \cdot L)` 224 | 225 | A useful example of :math:`f_att` is: 226 | :math:`f_att = min(\frac{1}{c_1 + c_2{\times}d_L + c_3{\times}d^2_{L}} , 1)` 227 | 228 | Where c1, c2, c3 are constants that are provided by the user, and the 229 | :math:`d_L` is the distance between the light source and the illuminated 230 | surface. 231 | 232 | Colored light and surfaces 233 | --------------------------- 234 | 235 | The colored light is treated by channel, that is we write an equation for each 236 | component of the color model. We represent an object's *diffuse color* by 237 | :math:`O_d`, O for object, d for diffuse. In order to mark the component 238 | that is represented we add the its initial. Ex: :math:`O_{dR}` for red, G for 239 | green, B, for blue in RGB color space. The above equation for red would be: 240 | :math:`I_R = I_{aR}k_{a}O_{dR} + f_{att}I_{pR}k_{d}O_{dR} (N \cdot L)` 241 | 242 | Rather than indicating the channel name in the equation for all the channels, 243 | we simply use the variable :math:`\lambda` in order to replace the channel 244 | initial, thus the equation becomes: 245 | :math:`I_{\lambda} = I_{a\lambda}k_{a}O_{d\lambda} + 246 | f_{att}I_{p\lambda}k_{d}O_{d\lambda} (N \cdot L)` 247 | 248 | Specular Reflection 249 | -------------------- 250 | 251 | Specular reflection is easy to observe. Illuminate an apple, you would see 252 | that a particular part of the apple is highlighted with the color of the light 253 | source. This is specular reflection, you change your viewpoint you should 254 | notice that the highlighted area would also change. 255 | 256 | Phong illumination model is explaining that: 257 | 258 | .. math:: 259 | 260 | I_{\lambda} = I_{a\lambda} k_a O_{d\lambda} 261 | + f_{att} I_{p\lambda} [k_{d}O_{d\lambda} (N \cdot L) 262 | + k_{s}O_{s\lambda} (R \cdot V)^n] 263 | 264 | Where O_s is object's specular color, k_s and n are user provided. 265 | I_{p\lambda} is the primary component of the light source 266 | An alternative approach is to use the halfway vector H where: 267 | :math:`H = ( L + V ) / (|L + V|)`. Computationally this is cheaper. 268 | Because once we postulate that the viewer and the light source are at 269 | infinity. 270 | The H becomes a constant 1. 271 | But the specular component n, does not produce the same result as before due 272 | to the change in angles 273 | -------------------------------------------------------------------------------- /TopicNotes/MachineLearning/SupervisedNotes.rst: -------------------------------------------------------------------------------- 1 | ################################ 2 | Machine Learning - Supervised 3 | ################################ 4 | 5 | Bayes Networks are there for us to reason. 6 | Machine Learning help us to find the bayes network model. 7 | 8 | Machine Learning Taxanomy 9 | ============================== 10 | 11 | Distinguishing elements in machine learning: 12 | 13 | What do we learn ?: 14 | 15 | - Parameters 16 | - Structure 17 | - Hidden Concepts 18 | 19 | From what do we learn ?: 20 | 21 | - Supervised, with target labels 22 | - Unsupervised, without target labels but use replacement principals 23 | - Reinforcement learning, when an agent learns from feedback with physical environment 24 | like "Bravo!", "Ouf, that hurts" 25 | 26 | What for ? 27 | 28 | - Prediction 29 | - Diagnosis 30 | - Summarisation 31 | 32 | How to learn ? 33 | 34 | - Passive, learning agent doesn't interfere with what is being learnt 35 | - Active, learning agent has impact on the data being learnt 36 | - Online, learning occurs while the data is being generated 37 | - Offline, learning occurs after the data is generated 38 | 39 | Outputs ? 40 | 41 | - Classification: binary, or discreet number of class 42 | - Regression: continous 43 | 44 | Details ? 45 | 46 | - Generative: seeks to model the data as generally as possible 47 | - Discriminitive: distinguish data 48 | 49 | Supervised Learning 50 | ===================== 51 | 52 | Data structure is of the following: 53 | 54 | {x_1, x_2, x_3, x_4, ..., x_n -> y_1} 55 | {a_32, a_33, a_34, ..., a_n -> y_2} 56 | 57 | our job is to find the f() in which f(x_m) = y_m 58 | 59 | Occam's razor: 60 | Everything else being equal, choose the less complex hypothesis. 61 | 62 | Trade off within the machine learning is that: 63 | 64 | good fit <--> low complexity 65 | 66 | Meaning that good fits for the data are usually high in complexity. 67 | 68 | Spam Detection 69 | ----------------- 70 | 71 | In spam detection we got an email, we try to classify it, either as SPAM or HAM, 72 | ham means it is worth passing to another person as a person mail. 73 | 74 | 1 technique that is being used is bag of words in which we represent each word 75 | as a frequency: 76 | Ex: 77 | 78 | hello my name is Soma, yes Soma is my name. 79 | 80 | 81 | In a dictionary they are represented as the following: 82 | 83 | hello my name is Soma yes 84 | 1 2 2 2 2 1 85 | 86 | Bayes Networks 87 | --------------- 88 | 89 | What we had been doing in spam detection is building a bayes network, where we 90 | build up the parameters of the bayes network through the supervised learning by 91 | maximum likelihood estimator based on training data. 92 | 93 | Small refresher: 94 | 95 | If we have structure like: 96 | 97 | Spam 98 | / \ 99 | w1 w2 100 | 101 | how many parameters we would need given that the dictionary has 12 words. 102 | The answer is 25 103 | Why ? 104 | For nodes that have incoming edges we would need two times the number of edges 105 | coming into the node, for nodes that are not recieving any edges, you would need 106 | only one parameter. 107 | 108 | Message = Sport 109 | 110 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = 0,1666...7 111 | P(Spam) = 3/8 112 | P(Message|Spam) = 1/9 113 | P(Message) = 1/4 114 | 115 | Message = Today is Secret 116 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = ? 117 | 118 | P(Message|Spam) = P("Today is Secret"|Spam) = P(Spam) * P("Secret" under mails labeled as spam) * P("Today" under mails labeled as spam) * P("is" under mails labeled as spam) 119 | 120 | P(Message) = P(Message|Spam)P(Spam) + P(Message|~Spam)P(~Spam) 121 | 122 | Since today is not among the mails labeled as spam, it is 0, thus the probability of our mail being spam is 0. 123 | 124 | This is a good example of overfitting, because we can not have a word determining the entire mail's label. 125 | 126 | Maximum likelihood Estimate is basically, P(x) = count(x) / N 127 | that is probability of x is equal to count of x in the class over all the data in the class. 128 | 129 | Laplace Smoothing 130 | ------------------- 131 | 132 | Laplace smoothing is like maximum likelihood estimator, but adds a variable k for smoothing. 133 | 134 | Basically L(k,x) = count(x) + k / N + k|x| 135 | 136 | This means that Probability(x) in laplace smoothing calculation, is equal to the number of the occurences of x in the given class, 137 | N is total number of occurances in the given class, if no class is given it is equal to the total number of occurances in the data set, k is the smoother, |x| is the number of tokens in the data set 138 | 139 | Ex: 140 | Total number of tokens: 12 141 | 142 | total number of occurences: 24 143 | 144 | Spam class have: 9 occurences 145 | 146 | Ham class have: 15 occurences 147 | 148 | if k is 1, 149 | 150 | What is the probability of token with 0 occurences in Spam Class given the spam class ? 151 | 152 | P(token|spam) = 0 + 1 / 9 + 12 = 1 / 21 153 | 154 | Message = Today is Secret 155 | P(Spam | Message) = P(Message|Spam) P(Spam) / P(Message) = ? 156 | 157 | P(Message|Spam) = P("Today is Secret"|Spam) = P(Spam) * P("Secret" under mails labeled as spam) * P("Today" under mails labeled as spam) * P("is" under mails labeled as spam) 158 | 159 | P(Spam) = 2/5 160 | P(Secret|Spam) = 4/ 9+12 161 | P(Today|Spam) = 1/ 9+12 162 | P(is|Spam) = 2/ 9+12 163 | x 164 | \-------------------------- 165 | 16 / 46305 166 | 0,000345 167 | 168 | P(~Spam) = 3/5 169 | P(Secret|~Spam) = 2/ 15+12 170 | P(Today|~Spam) = 3/ 15+12 171 | P(is|~Spam) = 2/ 15+12 172 | 173 | 36 / 98415 174 | 0,000365 175 | 176 | P(Message) = P(Message|Spam)P(Spam) + P(Message|~Spam)P(~Spam) = 0,00071 177 | 178 | This is basically naive bayes model which is a **generative** modal. 179 | 180 | You can come up with other criteria for spam filtering, for example: 181 | 182 | - Known spamming id? 183 | - Have you emailed to this person before 184 | - Have another 1000 people received the same message 185 | - Email header consistent 186 | - Written in all capitals 187 | - Do inline urls point to where they say 188 | - Are you adressed by your correct name. 189 | 190 | Hand written digit recognition 191 | -------------------------------- 192 | 193 | Applying naive bayes for hand written digit recognition. 194 | 195 | When you are applying naive bayes, the input can be: 196 | 197 | - Pixel vectors 198 | 199 | * Let's say we have a input vector of 16 x 16 you have to convolve the input vector with a gaussian variable, that is you take each pixel with neighbouring pixels for smoothing. This method is called input smoothing. 200 | 201 | - But naive bayes is not a good method for this. 202 | 203 | Overfitting prevention 204 | ------------------------ 205 | 206 | Remember the Occam's razor: 207 | 208 | Cross validation is the key for overcoming the overfitting problem. 209 | 210 | A typical division for the training data would be: 211 | 212 | +-------------------------------------------------------------+ 213 | | Training data | 214 | +=====+===========================+===========================+ 215 | | %80 | Used for Training | Figure out the parameters | 216 | +-----+---------------------------+---------------------------+ 217 | | %10 | Used for Cross Validation | Find best smoothing param | 218 | +-----+---------------------------+---------------------------+ 219 | | %10 | Used for Testing | Verify the performence | 220 | +-----+---------------------------+---------------------------+ 221 | 222 | Figuring out of parameters is very much like finding out the probabilities of the bayes network. 223 | 224 | Finding the best smoothing parameter, K, happens as the following. 225 | 226 | You train the model with different K values and measure their performance in Cross Validation data. 227 | Then you maximise over all K, to get the optimum k value. 228 | 229 | Then you touch only **once** to test data, in order to verify your performance. 230 | 231 | Regression Problems 232 | ===================== 233 | 234 | Data structure is of the following: 235 | 236 | {x_1, x_2, x_3, x_4, ..., x_n -> y_1} 237 | {a_32, a_33, a_34, ..., a_n -> y_2} 238 | 239 | our job is to find the f() in which f(x_m) = y_m 240 | 241 | in linear regressions though the function f() has a particular form: 242 | 243 | f() = w_1·x + w_0 244 | 245 | Loss Function 246 | ----------------- 247 | 248 | Loss function can differ: 249 | 250 | Here is how quadratic loss function is calculated: 251 | 252 | :math:`w_0 = {\frac{1}{M}{\sum}y_i - \frac{w_1}{M}{\sum}x_i}` 253 | 254 | :math:`w_1 = {\frac{M{\sum}x_iy_i - {\sum}x_i{\sum}y_i}{M{\sum}x^{2}_i - ({\sum}x_i)^2}}` 255 | 256 | M, is the number of training examples. 257 | 258 | 259 | Problems with linear regression 260 | -------------------------------- 261 | 262 | Linear regressions are very susceptible to outliers. 263 | Outliers can change significantly the minimizing quadratic loss function. 264 | They don't generalise well to data that is not really linear, like curves etc. 265 | Some of the assumptions about the linear data might be wrong for such cases, for example 266 | as the x goes to infinity in a linear data, the graph would give the impression that the 267 | y also goes to infinity, whereas in reality that might not be the case. 268 | 269 | Normalisation, or complexity control is also an issue to consider in linear regression. 270 | Most of the time people use either L1 or L2 regularisation. 271 | 272 | Perceptron 273 | ----------- 274 | 275 | Perceptron algorithm works with a linear seperator. 276 | That is a linear equation which seperates positives from the negatives. 277 | 278 | f(x) = {1 if w_1 x+w_0 >=0, 279 | 0 if w_1 x+ w_0 < 0} 280 | 281 | Perceptron only convergence if the data is linearly seperable. 282 | 283 | The iteration of the perceptron is of the following and it is very close to gradient descent. 284 | 285 | Starts with a random guess for w_0, and w_1 286 | 287 | w^m_i <-- w^m-1_i + a (y_j - f(x_j)) 288 | 289 | here m is the iteration index. 290 | y_i, is the target label, 291 | f(x_j) is the calculated target label. 292 | a is the learning rate. 293 | w^m-1_i is the weight calculated from the previous cycle 294 | y_i - f(x_i) is our error. 295 | 296 | Linear Methods Summary 297 | ---------------------- 298 | 299 | We talked about: 300 | 301 | - Regression vs Classification 302 | - Exact Solution vs Iterative Solutions 303 | - Smoothing 304 | - Non linear problems with linear methods, like SVMs, etc. 305 | 306 | Parametric methods: 307 | 308 | - Number of parameters are independent of the training size. 309 | 310 | Non-parametric methods: 311 | 312 | - Number of parameters are dependent of the training size, and can grow with the data size 313 | 314 | K-Nearest Neighbor 315 | -------------------- 316 | 317 | Simple thing, already studied in the intro to machine learning so, naive baysen.tex 318 | 319 | Problems of KNN: 320 | 321 | - Very large data size: 322 | + Solution: KDD, instead of representing the data as a list, we represent it as a tree 323 | 324 | - Very large feature space: 325 | + Solution: After 2-3 dimensions, don't use knn, do tensor calculations. 326 | -------------------------------------------------------------------------------- /TopicNotes/InferenceBayesNets/CourseNotes.rst: -------------------------------------------------------------------------------- 1 | ######################### 2 | Inference in Bayes Nets 3 | ######################### 4 | 5 | The point of this unit is to be able to answer probabilistic questions using 6 | bayes nets. 7 | 8 | Instead of input and output, in probabilistic inference we call the givens, 9 | evidence, and the results, query. 10 | The variables we know the values of are the evidence, and the variables we want 11 | to find out are the query. 12 | Anything that is neither evidence, nor query is a hidden variable. 13 | 14 | The output is not a number, but rather a probabilistic distribution. 15 | So the answer is going to be a complete joint probability distribution over 16 | query variables. 17 | It is called the posterior distribution given the evidence. 18 | It is expressed as the following: 19 | 20 | P(Q_1, Q_2, ... | E_1=e_1, E_2=e_2, ...), this is the computation we want to 21 | come up with. 22 | 23 | Another reasonable question would be to ask which combination of values would 24 | yield highest probability, so something like this: 25 | 26 | argmax_q P(Q_1=q_1, Q_2=q_2, ... | E_1=e_1, E_2=e_2, ... ) 27 | 28 | Which q values are maxiable given the evidence values is question we can ask. 29 | 30 | The direction of computation is irrelevant, that is we can have e_1, e_2 and 31 | compute q_1, etc. or have q_1 and compute all the rest, etc. 32 | This is to notice that the direction of computation in ordinary programming 33 | languages are in one way, from input to output, this is not the case for bayes 34 | nets inference. 35 | 36 | Exact Inference 37 | ================= 38 | 39 | Enumeration 40 | ---------------- 41 | 42 | Enumeration is a method to calculate the probabilities in a given bayes net, it 43 | simply enumerates all the possibilities and adds them up and thus comes up with 44 | an answer. 45 | 46 | Here is how it works: 47 | 48 | Example bayes net graph: 49 | 50 | B E 51 | .\ ./ 52 | A 53 | ./ .\ 54 | J M 55 | 56 | The problem: 57 | P(+b|+j,+m) = ? 58 | It reads as if b is true, j is true, m is true, given that the events j, and m 59 | are already happened, what is the probability of the event b would occur. 60 | 61 | Enumeration works as follows: 62 | 63 | First we express the conditional probabilities as unconditional probabilities. 64 | So, 65 | 66 | P(+b|+j,+m) = P(+b,+j,+m)/P(+j,+m) 67 | 68 | Then we enumerate the atomic probabilities and calculate the sum of their 69 | products. 70 | P(+b|+j,+m) the probability of these terms can be determined by enumerating all 71 | possible values of the hidden variables. In this case there are two E and A. 72 | So we shall sum over those variables for all values of E and A. 73 | 74 | Formally: 75 | 76 | .. math:: 77 | 78 | {\sum_{e}}{\sum_{a}} P(+b,+j,+m,e,a) 79 | 80 | "a" and "e" represent here the possible value that E, and A can have in given 81 | net. 82 | 83 | We transform this equation to a product of 5 variables by expressing each of the 84 | nodes in relation to their parent nodes in the net. 85 | 86 | that is: :math:`{\sum_{e}}{\sum_{a}} P(+b) P(+j|a) P(+m|a) P(e) P(a|+b,e)` 87 | 88 | The product section can be named as a function, f(e,a) 89 | We look up the values from the table of the bayes net, which is created during 90 | the creation of the net. 91 | 92 | This enumeration technique works for small networks, bigger networks would 93 | increase the number of rows to lookup. 94 | 95 | So we need to find a way to speed up the enumeration. 96 | 97 | Pull out terms 98 | ---------------- 99 | 100 | We can pull terms out from the enumeration, for example 101 | 102 | :math:`{\sum_{e}}{\sum_{a}} P(+b) P(+j|a) P(+m|a) P(e) P(a|+b,e)` 103 | 104 | In the above equation the P(+b) is the same throughout the summation 105 | thus it doesn't need to be added to the summation everytime, we can 106 | put it in front of the summation. P(e) is also independent of "a", 107 | so it can also come before the summation of the terms related to a. 108 | 109 | But this won't decrease the amount that much. 110 | 111 | Maximize independence 112 | ----------------------- 113 | 114 | This is a technique for efficient inference. 115 | The structure of the bayes net determines how efficient it is to do 116 | inference on it. 117 | 118 | 119 | Causal Direction 120 | --------------------- 121 | 122 | The important thing is that Bayes nets tend to be the most compact, 123 | and thus the easier to do inference on when they are written 124 | in the causal direction, that is, when the networks flow from the causes to 125 | directions. 126 | 127 | Variable Elimination 128 | ----------------------- 129 | 130 | This is a technique with several steps. 131 | We try to reduce the network to the smaller networks. 132 | 133 | First step is joining factors. 134 | 135 | Joining Factors: 136 | 137 | - A factor is a multi dimensional matrix, it is a table of properties for a 138 | given node. 139 | - We choose two or more of the factors 140 | - We will combine them, and it will represent the joint probability of all the 141 | variables in that factor. 142 | 143 | For example if we have the tables for P(r) and P(t|r), by multiplying the values 144 | in the tables, we can have 145 | P(r,t) 146 | 147 | After this we do the second step, elimination or summing out, or marginalisation. 148 | 149 | Elimination: 150 | 151 | - We choose a variable that is **absent** in the next table. 152 | - We sum out the choosed variable based on the variable that is **present** in 153 | the next table. 154 | 155 | Example operation: 156 | 157 | 158 | +--------+-------------+ 159 | | P(T,L) | | 160 | +========+====+========+ 161 | | T | L | Values | 162 | +--------+----+--------+ 163 | | +t | +l | 0.051 | 164 | +========+====+========+ 165 | | +t | -l | 0.119 | 166 | +--------+----+--------+ 167 | | -t | +l | 0.083 | 168 | +========+====+========+ 169 | | -t | -l | 0.747 | 170 | +--------+----+--------+ 171 | 172 | P(+l) = 0,051 + 0,083 = 0,134 173 | P(-l) = 0,119 + 0,747 = 0,866 174 | 175 | Approximate Inference 176 | ======================= 177 | 178 | We use sampling. 179 | That is we stimulate the events in the nets. 180 | The more samples we have, the closer we approach to exact inference situation. 181 | 182 | The sampling have the following advantages: 183 | 184 | - We know a procedure for coming up with an at least approximate value for the 185 | joint probability distribution 186 | - If we don't know what the conditional probability tables are, but we could 187 | simulate the process, we could still proceed with the sampling 188 | We couldn't with exact inference. 189 | 190 | Sampling method is *consistent*, that is if we have infinite number of samples 191 | we can calculate the true joint probability. 192 | 193 | For calculating conditional probabilities, we look at the samples that interest 194 | us. 195 | For example, 196 | If we have sample like P(+C,-S,+r,-l), and if we are looking for P(-C|+r), since 197 | our sampling is based on P(+C), we reject at, 198 | and keep the ones with P(-C) in it. This proceedure is called 199 | *rejection sampling*. 200 | 201 | The problem with rejection sampling is that if the evidence is unlikely, you end 202 | up rejecting a lot of samples. 203 | 204 | To fix that we use a technique called, *likelihood weighting* 205 | 206 | Likelihood Weighting 207 | ---------------------- 208 | 209 | This is a technique in which we fix the samples. 210 | For example, let's say when the burglary occurs, alarm goes of, so we have 211 | structure like B -> A. 212 | For a question like what is the probability of burglary occuring in the event 213 | that alarm goes off, P(B|+a) ? 214 | Following would happen. 215 | Since the burglaries are infrequent, during the sampling process, we would 216 | generate mostly P(-b, -a), and reject them 217 | because we are looking for P(B|+a) 218 | So we say, during the sampling process, that the value of "a" is positive from 219 | the start. 220 | This way we get to keep every sample, but this method is **inconsistent**, that 221 | is by having infinitely many samples obtained this way, we can not obtain the 222 | true joint probabilities. 223 | 224 | However that can be fixed by adjusting the weighting of the samples on the 225 | outcome. 226 | 227 | Example Calculation: 228 | 229 | Let's say we are trying to calculate P(R|+s,+c) 230 | 231 | for the following network 232 | 233 | Cloudy 234 | / \ 235 | Sprinkler Rain 236 | \ / 237 | Wet Grass 238 | 239 | We have 4 tables: 240 | 241 | +------+-----+------+ 242 | |P(S|C)| | 243 | +======+=====+======+ 244 | | +c | +s | 0,1 | 245 | +------+-----+------+ 246 | | | -s | 0,9 | 247 | +------+-----+------+ 248 | | -c | +s | 0,5 | 249 | +------+-----+------+ 250 | | | -s | 0,5 | 251 | +------+-----+------+ 252 | 253 | 254 | +------+-----+------+ 255 | |P(R|C)| | 256 | +======+=====+======+ 257 | | +c | +r | 0,8 | 258 | +------+-----+------+ 259 | | | -r | 0,2 | 260 | +------+-----+------+ 261 | | -c | +r | 0,2 | 262 | +------+-----+------+ 263 | | | -r | 0,8 | 264 | +------+-----+------+ 265 | 266 | +------+-----+ 267 | | P(C) | | 268 | +======+=====+ 269 | | +c | 0,5 | 270 | +------+-----+ 271 | | -c | 0,5 | 272 | +------+-----+ 273 | 274 | 275 | +------+-----+------+------+ 276 | |P(W|R,S) | 277 | +======+=====+======+======+ 278 | | +s | +r | +w | 0,99 | 279 | +------+-----+------+------+ 280 | | | | -w | 0,01 | 281 | +------+-----+------+------+ 282 | | | -r | +w | 0,90 | 283 | +------+-----+------+------+ 284 | | | | -w | 0,10 | 285 | +------+-----+------+------+ 286 | | -s | +r | +w | 0,90 | 287 | +------+-----+------+------+ 288 | | | | -w | 0,10 | 289 | +------+-----+------+------+ 290 | | | -r | +w | 0,01 | 291 | +------+-----+------+------+ 292 | | | | -w | 0,99 | 293 | +------+-----+------+------+ 294 | 295 | P(R|+s,+w) 296 | 297 | Let's say we have the following sample: 298 | 299 | P(+c, +s, +r, +w) what is its weight ? 300 | 301 | Since the "s" and the "w" are constrained by the problem, 302 | we apply the product of their probabilities as weights, 303 | (P(+s|+r)=0,1 * P(+w|+r,+s)=0,99) = 0,099 304 | 305 | 306 | Gibbs Sampling 307 | ------------------ 308 | This technique uses a method called Markov Chain Monte Carlo (MCMC): 309 | we resample one variable at a time conditioned on all the other. 310 | The variables are dependent, and the technique is consistent. 311 | 312 | 313 | Monty hall problem 314 | ------------------ 315 | 316 | Three doors, one with goat, empty, car. 317 | These are boolean. 318 | Goat is given as + 319 | What is the 320 | There is one selected, and the goat is given thus, my selection is not goat. 321 | 322 | P(+c)|P(-c), P(+e)|P(-e), P(+g)|P(-g) 323 | 1/3 | 2/3, 1/3 | 2/3 1/3 | 2/3 324 | 325 | 326 | P(C,E,G) = 1/27 327 | P(+c,+e,+g) = 1/27 328 | P(+c,-e,+g) = 2/27 329 | P(-c,+e,+g) = 2/27 330 | P(-c,-e,+g) = 4/27 331 | P(+c,-e,-g) = 4/27 332 | P(+c,+e,-g) = 2/27 333 | P(-c,+e,-g) = 4/27 334 | P(+c,-e,-g) = 4/27 335 | P(-c,-e,-g) = 8/27 336 | 337 | P(+g), + P(C,E|-g), + P(C) = 1 338 | given, empty door, selection 339 | 2/3 1/3 340 | 341 | -------------------------------------------------------------------------------- /TopicNotes/Introduction_to_AI/CourseNotes.rst: -------------------------------------------------------------------------------- 1 | Introduction to ai 2 | ##################### 3 | 4 | What makes an ai is the capacity to react to changes. 5 | 6 | Edges embody the rules of action, whereas the nodes embody the states for a game playing agent. 7 | 8 | Intelligence should be defined within the context of a task. 9 | 10 | An agent interracts with the environment by sensing its properties. 11 | Actions change the state of environment. 12 | 13 | 14 | Types of Ai problems: 15 | ********************* 16 | 17 | AI problems are classified according to the properties of the environment states. 18 | 19 | Environment States: 20 | 21 | - It can be fully observable like a chess board. 22 | - It can be partially observable like "yazlıkçı okeyi" where you can see the boards of the oponents. 23 | - It can be deterministic 24 | - You know for sure the results of each action. 25 | - It can be stochastic: 26 | - You don't know for sure the results of each action. 27 | - It can be discrete: 28 | - There is a finite number of states the environment can be in. 29 | - It can be continous: 30 | - The number of possible states of the environment is infinite. 31 | - It can be benign: 32 | - The agent is the only one that is taking actions that intentionally effects its goal. 33 | - It can be adverserial: 34 | - There can be other agents that can take actions to defeat its goal. 35 | 36 | 37 | Intelligent agent: 38 | 39 | It is one that takes actions in order to maximise expected utilty given a desired goal. 40 | This is difficult to measure, so we use bounded optimality. 41 | 42 | 43 | 44 | Alpha-beta pruning can provide performance improvements when applied to expected max game trees that have finite limits on the values the evaluation function returns. 45 | 46 | Isolation game: 47 | You move on the board, diagonaly or columnwise or row-wise, you can't move into occupied or previously occupied squares, the goal is to be the last player moving. 48 | 49 | Mini-max Algorithm: 50 | We assume that the opponent tries to minimze our scores, and we always try to maximise our scores. 51 | Triangles pointing up, is the turns where we try to maximise our score. 52 | Triangle pointing down, marks the turns where the oppononet is trying to minimse our scores. 53 | 54 | The algorithm is as follows: 55 | 56 | - For each max node, pick the maximum value among its child nodes, if there is at least one plus one child, the computer can always pick that to win. Otherwise it can never win. At each mid node we chose the minimum value to represent the opponent. 57 | 58 | 59 | Supplemented Reading: 60 | Norvig, Peter, and Stuart Russell, Artificial Intelligence: A Modern Approach, 3rd edn (New York, 2010), sections 5.1-2. 61 | 62 | Summary: 63 | 64 | Games are for the most part competitive environments, that is agent's interests are in conflict with each other. 65 | This gives rise to adversarial search problems. 66 | The difference between the mathematical game theory and its application in AI: 67 | 68 | For mathematical game theory, every multi agent environment in which the agents act upon each other is *significant* is considered as a game environment, they are labeled as economies. 69 | In AI, the most common games are zero-sum games of perfect information (chess, go, etc), that is: "this means deterministic, fully observable environments in which two agents act alternately and in which the utility values at the end of the game are always equal and opposite." 70 | 71 | The formal definition of a game: 72 | 73 | :math:`S_0` 74 | The initial state, which specifies how the game is set up at the start. 75 | 76 | PLAYER (s) 77 | Defines which player has the move in a state. 78 | 79 | ACTIONS(s) 80 | Returns the set of legal moves in a state. 81 | 82 | RESULT(s, a) 83 | The transition model, which defines the result of a move. 84 | 85 | TERMINAL - TEST (s) 86 | A terminal test, which is true when the game is over and false 87 | otherwise. States where the game has ended are called terminal states. 88 | 89 | UTILITY (s, p): 90 | A utility function (also called an objective function or payoff function), defines the final numeric value for a game that ends in terminal state s for a player p. In chess, the outcome is a win, loss, or draw, with values +1, 0, or 12 . Some games have a wider variety of possible outcomes; the payoffs in backgammon range from 0 to +192. 91 | 92 | A **zero-sum** game is (confusingly) defined as one where the total payoff to all players is the same for every instance of the game. 93 | Chess is zero-sum because every game has payoff of either 0 + 1, 1 + 0 or 12 + 12 . 94 | *Constant-sum* would have been a better term, but zero-sum is traditional and makes sense if you imagine each player is charged an entry fee of :math:`{\frac{1}{2}}` 95 | 96 | 97 | The relationship between the iterative deepening and the visited tree nodes is the following: 98 | 99 | The number of nodes visited during the iterative deepening is the cumulative sum of all the visited nodes up to that level. 100 | 101 | Visited tree node: 1 102 | Iterative tree node: 1 103 | Visited tree node: 4 104 | Iterative tree node: 5 105 | Visited tree node: 13 106 | Iterative tree node: 18 107 | Visited tree node: 40 108 | Iterative tree node: 58 109 | Visited tree node: 121 110 | Iterative tree node: 179 111 | 112 | The general formula for the iterative deepening is: 113 | 114 | .. math:: 115 | 116 | b = K 117 | 118 | n = {\frac{K^{d+1} - 1}{K - 1}} 119 | 120 | b 121 | The branching factor. Gives how much the tree would branch out at each level. 122 | 123 | n 124 | The number of nodes in the tree. 125 | 126 | d 127 | The level indicator. Indicates the level of the tree. 128 | 129 | 130 | Evaluation function in the case of isolation game: 131 | 132 | With an evaluation function like simply counting the number of moves our computer player 133 | has available at a given node, the player would select branches in the mini-max tree that 134 | lead to spaces where player has most options. 135 | 136 | Good evaluation function for isolation is 137 | number_of_my_moves - number_of_opponent_moves 138 | 139 | It promotes the boards in which i have more moves than my opponents, and penalises the boards in which the opponent has more moves. 140 | 141 | .. code-block:: python 142 | 143 | # Here is a pseudo-code for 144 | # Minimax algorithm with alphabeta pruning 145 | minimax(root) = max(min(3, 12, 8), min(2, x, y), min(14, 5, 2)) 146 | minimax(root) = max(3, min(2,x,y), 2) 147 | z = min(2,x,y) 148 | # Then z <= 2 149 | minimax(root) = max(3,z,2) 150 | minimax(root) = 3 151 | 152 | 153 | Expectimax alpha-beta pruning 154 | ***************************** 155 | 156 | This expectimax algorithm works in the case where we don't really know what the outcome will be. 157 | That is we have several different outcomes with different probabilities. 158 | Expectimax uses the minimax tree which represents all the possible outcomes in the game. 159 | Each branch has a probability, and to propagate a value to a superior node, we take the values and multiply them with the probabilities 160 | assigned to their branches, then depending on the type of branch, that is min or max branch 161 | 162 | An Example Tree: 163 | 164 | Max: 165 | node 1: ; node 2: ; node 3: 166 | 167 | Min: 168 | 169 | Node1B1 Probability: 0.1 170 | Node1B2 Probability: 0.5 171 | Node1B3 Probability: 0.4 172 | 173 | Node2B1 Probability: 0.5 174 | Node2B2 Probability: 0.5 175 | 176 | Node3B1 Probability: 0.5 177 | Node3B2 Probability: 0.1 178 | Node3B3 Probability: 0.4 179 | 180 | Node1B1Leaf1: -4 181 | Node1B1Leaf2: -5 182 | Node1B2Leaf1: 5 183 | Node1B2Leaf2: 6 184 | Node1B3Leaf1: 8 185 | Node1B3Leaf1: 10 186 | 187 | Node2B1Leaf1: 0 188 | Node2B1Leaf2: 10 189 | Node2B2Leaf1: 10 190 | Node2B2Leaf2: 5 191 | 192 | Node3B1Leaf1: -7 193 | Node3B1Leaf2: -3 194 | Node3B2Leaf1: 9 195 | Node3B2Leaf2: 10 196 | Node3B3Leaf1: 2 197 | Node3B3Leaf1: 5 198 | 199 | An Example Calculation without pruning: 200 | 201 | Min: 202 | 203 | Node1: :math:`-5{\times}0.1 + 0.5{\times}5 + 8{\times}0.4 =5.2` 204 | 205 | Node2: :math:`0{\times}0.5 + 5{\times}0.5 = 2.5` 206 | 207 | Node3: :math:`-7{\times}0.5 + 9{\times}0.1 + 2{\times}0.4 = 1.8` 208 | 209 | 210 | Max: 211 | 212 | Node1: 5.2 213 | 214 | An Example Calculation with Pruning: 215 | 216 | Min: 217 | 218 | Node1: :math:`-5{\times}0.1 + 0.5{\times}5 + 8{\times}0.4 =5.2` 219 | 220 | Node2: if Node2B1Leaf1 is 0, then we can prune all the other branches in the middle branch. 221 | Why ? 222 | 223 | Because we know that the values range from -10 to 10, and that the top branch is 5.2, meaning that we assume 224 | Node2 is either equal to or less than 5.2, and the probability distribution is 0.5. This means that even if all 225 | of the rest of the branches yield 10, the final outcome for the branch would be less than or equal to 5. Thus 226 | they would not change the value of the top branch. 227 | 228 | Then 229 | Node2: <= 5 230 | 231 | Node3: if Node3B1Leaf1 is -7 then we can safely ignore everything else. Because as in the above branch, even if 232 | every branch would yield 10, their sum would not be greater than 5.2. 233 | 234 | Then 235 | Node3: <= 1.5 236 | 237 | 1.5 comes from the fact that we evaluate as if the other branches yielded 10, and multiply it with the probabilities associated with the 238 | branches and then sum everything up. 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | Terms: 248 | ****** 249 | 250 | Heuristic 251 | Some additional piece of information - a rule, constraint, function - that informs an otherwise brute-force algorithm to act in a more optimal manner 252 | 253 | Pruning 254 | Pruning the search tree is to evaluate certain branches of the search tree by easy to check conditions. It allows us to ignore portions of the search tree that make no difference to the final choice 255 | 256 | Mini-Max Algorithm 257 | You are trying to maximize your chances of winning on your turn, and your opponent is trying to minimize your chances of winning on his turn. What we do is we map every possible move in the game, and map them to min or max states. Then depending on the type of the state we either take the maximum value indicated in the child nodes or the minimum value indicated in the child notes. 258 | 259 | Depth-Limited Search 260 | Going as far as we can to safely meet our deadline. 261 | 262 | Horizon Effect 263 | When it is obvious to human player that the game will be decided in the next move, but the depth of search doesn't let the computer player to see this happening. This means that some critical moves are not noticed by the machine. These critical moves occur because of the shape of the environment, that is because of a property of environment. Our agent nodes how to prioritise the nodes, but that doesn't mean that it is aware of how the nodes are situated in relation to each other. 264 | 265 | Evaluation Function 266 | It evaluates the possible moves so that the minimax driven agent can choose a move. 267 | 268 | Alpha-Beta Prunning 269 | Alpha-beta pruning works as follows. If a tree is on the max turn, upon viewing the first branch, we know that if a value is to be propagated to the superior branch other than the first one, it has to be bigger than the value of the first branch, thus we can ignore the branches that comes up with a lower or equal value. If a tree is on the min turn, upon viewing the first branch, we know that if a value is to be propagated to the superior branch other than the first branch value, then it has to be smaller than the value of the first branch, thus we can ignore the branches that comes up with a superior or equal value. 270 | α 271 | the value of the best (i.e., highest-value) choice we have found so far at any choice point along the path for MAX. 272 | β 273 | the value of the best (i.e., lowest-value) choice we have found so far at any choice point along the path for MIN. 274 | 275 | Immediate Pruning 276 | Immediate pruning happens when a branch fulfills the necessay conditions to be designated as the node's value immediately. For example, let's say the condition is that the sum of the numbers inside a node has to be equal to 10. If the first branch is 10 we don't need to look at the other values. 277 | 278 | -------------------------------------------------------------------------------- /MathNotes/DifferentialIntegralCalculus.rst: -------------------------------------------------------------------------------- 1 | ##################### 2 | Differential Calculus 3 | ##################### 4 | 5 | Inifinite Numeric Sequence 6 | =========================== 7 | 8 | Tarasov, L. V. Calculus: Basic Concepts for High Schools. Moscow; Chicago: Mir Publishers ; [Distributed by] Imported Publications, 1982. 9 | 10 | An infinite numeric sequence exists if every natural number (position number) 11 | is unambiguously placed in correspondance with a definite number (an element 12 | of a sequence) by a specific rule. 13 | For example: 14 | 15 | - 3 5 7 9 11 16 | 17 | - 1 2 3 4 5 where the law is x + 2 so that 18 | 19 | - x1 + 2 = 5 = x2, x2 + 2 = 7 = x3, the rule is than x_n = x_{n-1} + 2 20 | 21 | Let's look at a different sequence: 22 | 23 | 1, 1/2, 3, 1/4, 5, 1/6, ... 24 | 25 | The rule is for y_n = { if n = 2k - 1, then n, else if n = 2k then 1/n } 26 | 27 | We can also express this as y_n = a_n \times n + b_n \times 1/n, and say 28 | a_n = 0 if n = 2k and 1 if n = 2k - 1 and b_n = 0 if n = 2k - 1 and 1 for 29 | n = 2k 30 | 31 | A sequence is *nondecreasing* if 32 | y_1 <= y_2 <= y_3 <= ... 33 | 34 | A sequence is *nonincreasing* if 35 | y_1 >= y_2 >= y_3 >= ... 36 | 37 | A nonincreasing or nondecreasing sequences are *monotonic* sequences 38 | 39 | A sequence is bounded if there are two numbers A and B which encloses 40 | all the terms of a sequence so that 41 | A <= y_n <= B 42 | 43 | If one of the numbers is absent the sequence is unbounded. 44 | 45 | There is a somewhat puzzling case for bounded infinite sequences that are 46 | monotonic. For example, 0, 0.1, 0.11, 0.111, 0.111 and so on whose lower bound 47 | and upper bound is 0 and 1. However the problem is that the condition that our 48 | incrementation is decreasing does not impose the upper bound. 49 | 50 | The calculus, in the sense we use in modern mathematics, starts with the 51 | notion that there can be monotonic sequences that are bounded. This is ensured 52 | with the concept of *limit*. If a sequence is monotonic and bounded it 53 | necessarily has a limit. 54 | 55 | 56 | Limits 57 | ======= 58 | 59 | Here is the definition of a limit: 60 | 61 | The number a is said to be the limit of sequence y_n if for any positive 62 | number \lambda there is a real number N such that for all n > N the following 63 | inequality holds: 64 | 65 | |y_n - a| < \lambda 66 | 67 | I have a sequence y with a as a limit. When I subtract the limit from any 68 | value of the sequence, the absolute value of this operation should be smaller 69 | than the number \lambda. When n is bigger than N, the difference between the 70 | limit number and the sequence element is smaller than the lambda. 71 | 72 | Let's examine this definition. First of all for any positive \lambda there is a 73 | number N. There is also another number "n". When "n" is bigger than N there is 74 | an inequality. Now \lambda is any positive number. There are then infinite values 75 | that are possible for \lambda. Now the length of the sequence y_n is also infinite 76 | thus we have an infinite number of possible "n" values as well. 77 | 78 | 79 | What are limits ? 80 | ------------------- 81 | 82 | 83 | Differentials 84 | ============== 85 | 86 | Slope of the line. 87 | What is the difference between the current instant and the instant before 88 | 89 | 90 | Formal definition: 91 | 92 | - :math:`f^{(1)}(p) = \lim_{h \to 0} \frac{f(p+h) - f(p)}{h}` 93 | 94 | Simple rules: 95 | 96 | Power rule: 97 | ----------- 98 | take the exponent, move it down, 99 | multiply it with the coefficient—it’s the new coefficient. 100 | Now take one from the exponent. 101 | This is the new exponent. 102 | (This is for integer coefficient polynomials, as you can see, this is painfully limited) 103 | 2x^7, we have the derivative 14x^6. 104 | Simple enough? Another exercise: x^2. 105 | The derivative is 2x. 106 | 107 | Addition and Subtraction 108 | ------------------------- 109 | 110 | The derivative of f(x)+g(x) is just f'(x)+g'(x). That is we sum the derivatives. 111 | Changing +g(x) to -g(x) gives us the subtraction rule. 112 | 113 | 114 | Product Rule 115 | ------------- 116 | 117 | The derivative of f(x)*G(x) is f'(x)*G(x)+f(x)*G'(x). 118 | That is derivative of f(x) times the G(x) plus 119 | f(x) times the derivative of G(x). 120 | 121 | Here is its proof: 122 | 123 | .. math:: 124 | 125 | p(x) = f(x) \times g(x) 126 | p'(x) = \lim_{h \to 0} \frac{p(x+h) - p(x)}{h} 127 | v = f(x) \times g(x+h) 128 | p'(x) = \lim_{h \to 0} \frac{p(x+h) - p(x) + v - v}{h} 129 | p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - (f(x) \times g(x)) + v - v}{h} 130 | p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - v + v - (f(x) \times g(x))}{h} 131 | p'(x) = \lim_{h \to 0} \frac{(f(x+h) \times g(x+h)) - (f(x) \times g(x+h)) + v - (f(x) \times g(x))}{h} 132 | p'(x) = \lim_{h \to 0} \frac{ ((f(x+h) - f(x)) \times g(x+h)) + v - (f(x) \times g(x))}{h} 133 | p'(x) = \lim_{h \to 0} \frac{ ((f(x+h) - f(x)) \times g(x+h)) + f(x) \times (g(x+h) - g(x))}{h} 134 | p'(x) = f'(x) \times \lim_{h \to 0} g(x+h)) + f(x) \times g'(x) 135 | p'(x) = f'(x) \times g(x) + f(x) \times g'(x) 136 | 137 | We replace :math:`\lim_{h \to 0} g(x+h)` with :math:`g(x)` since *h* is very 138 | very close to 0, and we know that since *g(x)* is a differentiable function it 139 | must be continuous. 140 | 141 | Division Rule 142 | -------------- 143 | 144 | The derivative of f(x)/g(x) is [f'(x)g(x)-f(x)g'(x)]/g(x)^2. 145 | That is the derivative of f(x) times the g(x) minus 146 | the f(x) times the derivative of g(x) divided by the g(x) squared 147 | 148 | Chain Rule 149 | ----------- 150 | 151 | The derivative of f(g(x)) is 152 | f'(g(x))*g'(x). 153 | 154 | Or, it is equal to the derivative of the outer function 155 | evaluated at the inner functions times the derivative of the inner function. 156 | 157 | 158 | Integrals 159 | ========== 160 | 161 | The area under the curve. 162 | 163 | Riemanns sum and Approximating a Definite Integral 164 | --------------------------------------------------- 165 | 166 | The formula 167 | :math:`Sum={{\sum}^{n}_{i=1} f(x^{*}_i)(x_i - x_{i-1}}` 168 | 169 | Essentially it means, I am trying to sum infinitely small bits 170 | 171 | Here is the catch: 172 | I am trying to approximate the given area by using n number of rectangles. 173 | Let's try to approximate the area under the function f(t) by using 5 rectangles 174 | 175 | :math:`{\int}f(t)dt {\approx} height_1 {\times}width +h_2w +h_3w+h_4w+h_5w` 176 | 177 | Instead of using 5 rectangles we can use n rectangles to have a better/worse 178 | approximation 179 | 180 | :math:`{\int}f(t)dt {\approx} height_1 {\times}width +h_2w +h_3w+h_4w+...+h_{n}w` 181 | 182 | Let's factor out the width since its constant 183 | 184 | $\int_i^a$ 185 | 186 | :math:`{\int}f(t)dt {\approx} w{\times}(h_1 +h_2 +h_3+h_4+...+h_{n})` 187 | 188 | We can express the sum in the parantheses with the sum notation as well 189 | 190 | :math:`{\int}f(t)dt {\approx} w{\times}{\sum^{n}_{i=1}}(h_i)` 191 | 192 | The value of the width of the rectangle is simply the difference between range 193 | of the function distributed to the range of the approximation. 194 | So if the integral is taken over the interval [a,b] as in :math:`{\int}_{a}^{b}` 195 | Then the width for the n number of rectangles would be: 196 | :math:`width={\frac{b-a}{n}}` 197 | 198 | $\sum_i$ 199 | 200 | The height is simply the according y value of the x with respect to f(x). 201 | That is the meaning behind the notation :math:`f(x_{i}^{*}` 202 | 203 | So final form of the equation is the following 204 | 205 | .. math:: 206 | 207 | `{\int}_{a}^{b}f(t)dt{\approx}{\frac{b-a}{n}}{\times}{\sum^{n}_{i=1}}f(x_{i}^{*}` 208 | 209 | Now the part f(t) of the integral side should be rather obvious, 210 | the height of the rectangle, and we have seen that the a and b are 211 | related to the range of the function, then 212 | 213 | what's up with dx ? 214 | Well simply put dx is what happens when delta(x), that is x_i - x_{i-1}, approaches 215 | to the 0. So dx is the difference when the difference between any point in x axes, 216 | in the range of f(x) becomes very very very very very very very close to 0 217 | 218 | Now, when you have a quantity whose value is virtually zero, there's not much 219 | you can do with it. 2+dx is pretty much, well, 2. Or to take another example, 220 | 2/dx blows up to infinity. 221 | But there are two circumstances under which terms involving dx can yield a 222 | finite number. One is when you divide two differentials; for instance, 2dx/dx=2, 223 | and dy/dx can be just about anything. 224 | 225 | 226 | Line Integrals 227 | --------------- 228 | 229 | Sum of infinitely small areas under the curve within the range of f(x,y) 230 | 231 | This is multivariate calclulus and it is a slight generalization of what we had 232 | seen above in the definite integrals 233 | 234 | Now a normal integral is: 235 | 236 | - :math:`{\int}_{a}^{b}f(t)dt` where 237 | 238 | - :math:`\int` means sum 239 | - a is the lower range 240 | - b is the upper range 241 | - dt is the difference between t_i and t_{i-} when it is infinitely small 242 | 243 | A line integral is: 244 | 245 | - :math:`{\int}_{a}^{b}f(x,y)ds` 246 | 247 | Now let's see how we arrive to this: 248 | 249 | We have a function k(x) which is defined on a coordinate plane xy. 250 | The function maps the value of x to a value of y in the coordinate plane 251 | 252 | Now f(x,y) does the exact same thing in form. It takes the value of x and y 253 | and maps it to another value in third dimension let's say z for example. 254 | f(x,y) = z 255 | This means that we have now a third dimension z, to which our function f(x,y) 256 | maps to, so our plane now has three axis xyz 257 | 258 | What about ds ? It is actually the same as saying dz, that is the difference 259 | between z_i and z_{i-1} as it approaches to zero 260 | 261 | How does all this relate to our k(x) ? 262 | 263 | This is the tricky part 264 | 265 | Now let's say c(x) = y and g(y) = x 266 | then when x=t, y=c(t), and y=t, x=g(t) 267 | 268 | So given that a <= t <= b 269 | f(x,y) can be written as f(g(t), c(t)) 270 | 271 | So we can rewrite our line integral as follows: 272 | 273 | - :math:`{\int}_{a}^{b}f(g(t),c(t))ds` 274 | 275 | Now ds can actually be expressed in forms of dy and dx. 276 | Because simply put infinitely small change in the curve k(x) is going to result from 277 | infinitely small change in x direction and infinitely small change in y direction. 278 | Notice that all three measures are distance measures. 279 | Let's break it down this way: 280 | dx = x_i - x_q 281 | dy = k(x_i) - k(x_q) 282 | ds = (x_i, k(x_i)) - (x_q, k(x_q)) 283 | 284 | Now the distance between two points are calculated with pythagoras theorem 285 | :math:`\sqrt{a^2 + b^2}` 286 | 287 | We plug in our points to pythagoras theorem 288 | 289 | :math:`\sqrt{(x_i - x_q)^2 + (k(x_i) - k(x_q))^2}` 290 | 291 | Based on the above mentioned equivalency this simply transforms to 292 | 293 | :math:`\sqrt{(dx)^2 + (dy)^2}` 294 | 295 | Then we can rewrite our line integral as follows 296 | 297 | - :math:`{\int}_{a}^{b}f(g(t),c(t)){\times}{\sqrt{(dx)^2 + (dy)^2}}` 298 | 299 | Now the problem is our point functions are all defined in t but our ds is expressed 300 | in dx and dy, how do we transform it 301 | 302 | Well let's suppose we multiplied the ds with dt/dt which 1, since we divide to equal 303 | quantities, so: 304 | 305 | - :math:`ds={\sqrt{(dx)^2 + (dy)^2}}{\times}{\frac{dt}{dt}}` 306 | 307 | If we reformulate the expression a bit 308 | 309 | - :math:`ds={\sqrt{ {\frac{1}{(dt)^2}} {\times} ((dx)^2 + (dy)^2) } }{\times}{{dt}}` 310 | 311 | - We simply put the 1/dt of the dt/dt expression inside the square root 312 | 313 | I continue this line of progress and distribute the 1/dt over the variables 314 | 315 | - :math:`ds={\sqrt{ {\frac{1}{(dt)^2}}{\times}(dx)^2 + {\frac{1}{(dt)^2}}{\times}(dy)^2) } }{\times}{{dt}}` 316 | 317 | The expression inside can be simplified as the following: 318 | 319 | - :math:`ds={\sqrt{ {\frac{(dx)^2}{(dt)^2}} + {\frac{(dy)^2}{(dt)^2}} } }{\times}{{dt}}` 320 | 321 | And by using simple properties of multiplication on fractions we can have the following: 322 | 323 | - :math:`ds={\sqrt{ ({\frac{dx}{dt}})^2 + ({\frac{dy}{dt}})^2 } }{\times}{{dt}}` 324 | 325 | Now notice that dy/dt and dx/dt are actually derivatives of c(t) and g(t) respectively 326 | that is they are c'(t)=dy/dt and g'(t)=dx/dt 327 | So the final form of our equation would be: 328 | 329 | - :math:`ds={\sqrt{ (g'(t))^2 + (c'(t))^2 } }{\times}{{dt}}` 330 | 331 | -------------------------------------------------------------------------------- /TopicNotes/RecurrentNeuralNetworks/courseNotes.rst: -------------------------------------------------------------------------------- 1 | ########################## 2 | Recurrent Neural Networks 3 | ########################## 4 | 5 | When to use RNNs ? 6 | 7 | RNNs are good with time series. 8 | Predicting stock market prices, 9 | speech recognition, etc. 10 | 11 | CNNs are good for images and videos, RNNs are good for sequential data analysis 12 | like financial time series. 13 | 14 | RNNs are used for speech recognition and machine translation. 15 | 16 | RNNs deal with ordered sequences 17 | 18 | It can be used to autocomplete stuff. Like you have a phrase: 19 | 20 | "My cat is the" and then the trained algorithm can fill the phrase 21 | as "My cat is the **best**" 22 | 23 | In this case, 24 | the input is the ordered sequence of words or characters 25 | the output is ordered sequence of characters. 26 | 27 | Machine translation: 28 | the input is ordered sequence of words in language X 29 | the output is ordered sequence of words in language Y 30 | 31 | Vanilla supervised learners and structured input 32 | ------------------------------------------------- 33 | 34 | Vanilla supervised learnes, like feedforward networks take generic input and 35 | spit out a structured input. 36 | They are basically function estimators. 37 | Their key problem is that they do not consider the input structure at all. 38 | 39 | Now CNNs were a change to that since they detected the patterns, that is they 40 | take into consideration the spatial corelation between the pixels in an image. 41 | 42 | RNNs are designed to exploit ordered sequential structures. Basically time 43 | series are a good subject for these. 44 | 45 | RNN Notations 46 | -------------- 47 | 48 | What is ordered sequence ? 49 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 50 | 51 | Ordered sequence is a list of values ordered by an index. 52 | Index can be a list of numbers, or timestamps, or anything really. 53 | 54 | How to model ordered sequences ? 55 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 56 | 57 | Most ordered sequences are a product of some underlying process or processes, 58 | but these underlying processes can be blackboxes, so we have no model who 59 | explains the data. 60 | 61 | In this case we can model the process recursively, that is we can use past 62 | values to predict the future ones, meaning we can try to determine how 63 | future values mathematically dependent on the preceeding ones. 64 | 65 | What are seeds ? 66 | ~~~~~~~~~~~~~~~~~ 67 | 68 | Let's say we are dealing with the set of odd numbers: 69 | 70 | :math:`[1,3,5,7, ..., ]` 71 | 72 | First value is :math:`S_1=1`. 73 | That is it is a given. 74 | However all the other values can be generated from the first one by adding 2. 75 | This means that odd numbers set is a recursive sequence. 76 | That is the rest of the sequence can be generated by a model from one or more 77 | original values. The original values need to be given. 78 | 79 | These initialized/given values in the case of a recursive sequence, are called 80 | seeds 81 | 82 | What is the order of a recursive sequence ? 83 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 84 | 85 | The order of a recursive sequence is the number of prior elements it uses to 86 | produce future ones 87 | 88 | For example, the odd number set has the order of 1. Fibonacci sequence has the 89 | order of 2. 90 | 91 | Unfolded view 92 | ~~~~~~~~~~~~~~ 93 | 94 | The odd numbers set can actually be described by using the following function. 95 | 96 | :math:`f(s) = s + 2` 97 | 98 | or as follows: 99 | 100 | s_1 --f--> s_2 --f--> s_3 --f--> s_4 --f--> ... 101 | 102 | The name unfolded view applies both of these representations, where we show 103 | explicitely how each element of the sequence is generated. 104 | 105 | Folded view 106 | ~~~~~~~~~~~~ 107 | 108 | The odd numbers set can actually be described by using the following notation. 109 | 110 | :math:`s_t = f(s_{t-1})` 111 | 112 | Important note: A recursive sequence simply applies the same function to its 113 | output to generate the new input 114 | 115 | How to drive recursive sequence 116 | --------------------------------- 117 | 118 | Let's try to model how much money we have at the end of each month: 119 | 120 | I need to know: 121 | 122 | - The current value of my savings = h_1 123 | - Savings of each month would be = h_t for the month t 124 | 125 | What drives my account ? 126 | 127 | - Rent we pay, car lease, food shopping etc, drive it down. 128 | - Paycheck for the job, small investments etc, drive it up. 129 | 130 | - s_t = is my loss or income during the month, so a sum of downers and uppers 131 | 132 | Simple model for monthly savings would be: 133 | 134 | - h_1 = 0 135 | - h_2 = h_1 + s_1 136 | 137 | The folded view of the model would be: 138 | 139 | - h_t = h_{t-1} + s_{t-1} 140 | - where t = 2,3, ... 141 | 142 | Driver sequence is what moves everything, here it corresponds to monthly income 143 | or loss. 144 | The hidden sequence is that which is generated by the driver sequence, here 145 | corresponds to h_t 146 | 147 | Injecting recursivity to supervised learners 148 | ---------------------------------------------- 149 | 150 | Suppose we have a set [1,3,5,7,9,11,13,15], we only have this set and nothing 151 | else. How do we arrive at the function f(s_t) = s_{t-1}+2 152 | We can't use sheer guessing. We need to find a function that generates the 153 | sequence. 154 | 155 | We can use a parametrized function and learn weights by fitting: 156 | 157 | s_1 = 1 158 | s_2 = g(s_1), where g(s_t) = w_0 + w_1{\times}s_{t-1} 159 | s_3 = g(s_2) 160 | 161 | Now we don't know the weights for w_0 and w_1. We simply implement least squares 162 | cost function to find that. 163 | 164 | For example: 165 | 166 | least square for s_2 is (s_2 - g(s_1))^2 167 | least square for s_3 is (s_3 - g(s_2))^2, etc. 168 | 169 | We then try to minimize the sum of these least squares. 170 | 171 | Formally :math:`min({\sum_{t}^{N}(s_t - (w_0 + w_1{\times}s_{t-1}))^2})` 172 | 173 | We are simply performing regression here. 174 | 175 | Windowing a sequence 176 | -------------------- 177 | 178 | During the above formula we have looked at pairs. For example 179 | (s_1,s_2), (s_2, s_3), etc. The reason we chose pairs was related to 180 | the formula we adopted, g(s_t) = w_0 + w_1{\times}s_{t-1}. These pairs 181 | are called windows, that is the parameters involved in a given iteration 182 | of the formula of the recursive sequence, establishes the window of that 183 | iteration. 184 | 185 | Keras and fitting 186 | ------------------ 187 | 188 | The above given model can be defined as follows in Keras: 189 | 190 | .. code-block:: python3 191 | 192 | model = Sequential() 193 | layer = Dense(1, input_dim=1, activation="linear") 194 | model.add(layer) 195 | model.compile(loss="mean_squared_error", optimizer="adam") 196 | model.fit(x,y, epocha=3000, batch_size=3, 197 | callbacks=callbacks_list, verbose=0) 198 | 199 | The point here is the following: 200 | 201 | Given an ordered sequence, we can make a recursive approximation to it by first 202 | making a random guess about the architecture of its recursive formula, then 203 | tuning the parameters of that architecture optimally using the sequence itself. 204 | 205 | 206 | We could have used the model in the traditional manner: Use a training set, 207 | validation set and test set, etc. So we can make predictions based on the 208 | training set. 209 | 210 | Thus we can actually use these networks as generative models. 211 | 212 | General Notes Recursivity in Supervised Learning 213 | ------------------------------------------------ 214 | 215 | There might be several architectures that model the recursive sequence which 216 | generated from a single recursive function. 217 | 218 | The sequence itself might not be recursive but we might find a recursive 219 | function that fits the sequence at hand, that is we can find the best 220 | recursive approximation to the given sequence. 221 | 222 | 223 | Why RNN instead Feedforward neural networks 224 | -------------------------------------------- 225 | 226 | The cost function we are using in the feed forward neural networks is Least 227 | Squares and from the perspective of probablilities this means that, we assume 228 | that the probability distribution is identically distributed among the 229 | independent elements of the set. 230 | 231 | However it is a little contradictory to the fact that we are assuming that these 232 | elements are part of an ordered sequence even a recursive one, and that they 233 | don't emit any structure in their distribution in the set. 234 | 235 | Basically if we were dealing with odd numbers set, we are assuming that, 236 | 3 and 5 are conditionally independent to each other, which is misleading. 237 | Because if I change 3 the value of all the others in the sequence would also 238 | change. 239 | 240 | Feed forward networks structure their data as data points with no edges between 241 | however, we *know* that there are, that's the whole point of referring to them 242 | as recursive sequences. 243 | 244 | Simply put: 245 | 246 | Feed forward neural networks use recursive approximation: 247 | - Try to model dependency -> recursiveness 248 | - But when try to tune parameters -> we make things independent since we assume 249 | that they are actually independent. 250 | 251 | Basic RNN derivation 252 | ===================== 253 | 254 | With Feed forward neural networks we modelled the recursivity but lost the 255 | dependence while tuning the parameters. 256 | 257 | RNNs are there to enforce dependency between the levels. The way to do this 258 | is to ensure that each level ingest its predecessor functionaly. 259 | It does so by taking 2 arguments, both the sequence element and the hidden 260 | state. 261 | 262 | Here is an unfolded view of a simple RNN architecture 263 | 264 | .. math:: 265 | 266 | s_1 {\approx} {\hat{s_1}} = {\alpha} 267 | s_2 {\approx} {\hat{s_2}} = g({\hat{s_1}}, s_1) 268 | s_3 {\approx} {\hat{s_3}} = g({\hat{s_2}}, s_2) 269 | s_4 {\approx} {\hat{s_4}} = g({\hat{s_3}}, s_3) 270 | {\vdots} 271 | s_t {\approx} {\hat{s_t}} = g({\hat{s_{t-1}}}, s_{t-1}) 272 | 273 | Notation: 274 | 275 | - :math:`s_t where t=1,2,3,4, {\dots}` are the elements of our recursive 276 | sequence 277 | - :math:`{\hat{s_t}} where t=1,2,3,4, {\dots}` are the hidden states 278 | that are driven by the sequence elements, basically they are the ones 279 | that are calculated from the sequence elements. 280 | - :math:`{\alpha}` is the seed value 281 | 282 | Dependency is much more explicit here, since each hidden state depends 283 | functionally on the preceeding state. 284 | 285 | Modifying least square loss 286 | ---------------------------- 287 | 288 | The idea stays the same, we simply plug in our functionally dependent levels 289 | to least squares loss, like the following 290 | 291 | 292 | For example: 293 | 294 | least square for s_2 is (s_2 - g({\hat{s_1}}, s_1))^2 295 | least square for s_3 is (s_3 - g({\hat{s_2}}, s_2))^2, etc. 296 | 297 | In most cases we also use the linear combination of the hidden state with the 298 | sequence element as the second term in the subtraction. This is heavily used 299 | in the case of RNNs but there are no formal justifications for it. 300 | This modifies the loss function in the following way: 301 | 302 | .. math:: 303 | 304 | min( 305 | {\sum_{t}^{N}(s_t - (w_0 + w_1{\times}g({\hat{s_{t-1}}}, s_{t-1}) ))^2} 306 | ) 307 | 308 | Keras and RNN 309 | --------------- 310 | 311 | The above given model can be defined as follows in Keras: 312 | 313 | .. code-block:: python3 314 | 315 | model = Sequential() 316 | layer = SimpleRNN(3, input_shape=(2,1), activation="relu") 317 | model.add(layer) 318 | model.add(Dense(1)) 319 | model.compile(loss="mean_squared_error", optimizer=opt) 320 | 321 | RNN and memory 322 | ---------------- 323 | 324 | Every level of rnn contains the complete history of every sequence value 325 | that precedes it. This why RNNs have memory. 326 | 327 | Technical notes 328 | ---------------- 329 | 330 | - RNNs need large datasets to function well. 331 | - Vanishing gradient problem stands for RNNs as well. 332 | - Different level architectures, for example Long Term Short Term memory 333 | - Variations on stochastic gradient descent 334 | - Longer sequence --> deeper network, and window as well 335 | 336 | Implementing Character wise RNN 337 | ------------------------------- 338 | 339 | Character wise RNN means the network will learn one character at a time and 340 | generate new characters one character at a time. 341 | 342 | Basically the network would take a character as an input and output a 343 | probability distribution for the next likely character. 344 | 345 | Sequence Batching 346 | ----------------- 347 | 348 | We are working on sequences of data, by taking in a sequence an splitting it 349 | into a multiple shorter sequences, we can take advantage of matrix operations 350 | to make training more efficient. 351 | 352 | batch size is the number of shorter sequences we are using 353 | batch length is the length of sequences we are feeding into the network 354 | --------------------------------------------------------------------------------