├── .gitignore ├── Differential_Equations └── README.md ├── Functional_Programming ├── Other_Notes │ └── sbt_and_eclipse.txt ├── README.md └── week1 │ └── week1_notes.txt ├── Math_104_Berkeley ├── README.md └── kenneth_ross_notes.txt ├── Deep_Learning ├── README.md ├── dlbook_chapter06notes.txt ├── dlbook_chapter02notes.txt ├── dlbook_chapter20notes.txt ├── dlbook_chapter17notes.txt ├── dlbook_chapter03notes.txt ├── dlbook_chapter09notes.txt ├── dlbook_chapter04notes.txt ├── dlbook_chapter08notes.txt ├── dlbook_chapter16notes.txt ├── dlbook_chapter14notes.txt ├── dlbook_chapter11notes.txt ├── dlbook_chapter07notes.txt ├── dlbook_chapter12notes.txt ├── dlbook_chapter05notes.txt └── dlbook_chapter10notes.txt ├── How_People_Learn ├── README.md ├── Part_04_Future_Directions.txt ├── Part_01_Intro.txt ├── Part_03_Teachers_and_Teaching.txt └── Part_02_Learners_and_Learning.txt ├── Random ├── Ray_Notes.txt └── AWS_Notes.txt ├── README.md ├── CS61C_Berkeley ├── README.md └── CS61C_Lectures.txt └── Robots_and_Robotic_Manip ├── dVRK.text ├── Modern_Robotics_Mech_Plan_Control.txt ├── Fetch.text ├── HSR.text ├── Mathematical_Introduction_Robotic_Manipulation.txt └── ROS.text /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.DS_Store 3 | -------------------------------------------------------------------------------- /Differential_Equations/README.md: -------------------------------------------------------------------------------- 1 | # Differential Equations 2 | 3 | ... 4 | -------------------------------------------------------------------------------- /Functional_Programming/Other_Notes/sbt_and_eclipse.txt: -------------------------------------------------------------------------------- 1 | Wow, learning how to use this stuff is really annoying. =( 2 | -------------------------------------------------------------------------------- /Math_104_Berkeley/README.md: -------------------------------------------------------------------------------- 1 | This is a real analysis review. 2 | 3 | Fortunately, the textbook is supposed to be easy to read. It is also freely 4 | available online. 5 | -------------------------------------------------------------------------------- /Deep_Learning/README.md: -------------------------------------------------------------------------------- 1 | I'm reading the Deep Learning book by Goodfellow et al. 2 | 3 | TODOs: 4 | 5 | - Chapter 13 6 | - Chapter 15 7 | - Chapter 18 8 | - Chapter 19 9 | - Chapter 20 (all of it!) 10 | 11 | -------------------------------------------------------------------------------- /How_People_Learn/README.md: -------------------------------------------------------------------------------- 1 | # How People Learn: Brain, Mind, Experience, and School: Expanded Edition 2 | 3 | From National Academies Press. Looks like it was published in 2000, so I wonder 4 | how much of it is up to date ... 5 | -------------------------------------------------------------------------------- /Random/Ray_Notes.txt: -------------------------------------------------------------------------------- 1 | I'm trying to learn how to use Ray. See: 2 | 3 | https://rise.cs.berkeley.edu/projects/ray/ 4 | 5 | for an overview of the project. (Unfortunately, it's hard to do a Google search 6 | on that, but I will manage.) 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Self_Study_Courses 2 | 3 | These will be public notes for courses that I'm self-studying. 4 | 5 | Current TODO list: 6 | 7 | - Finish Goodfellow et al 8 | - Finish CS 61C self-studying 9 | - Study robotic manipulation 10 | -------------------------------------------------------------------------------- /How_People_Learn/Part_04_Future_Directions.txt: -------------------------------------------------------------------------------- 1 | Chapter 10: Conclusions 2 | Chapter 11: Next Research Steps 3 | 4 | Mostly, these two chapters wrap up the book. I'm most interested in how 5 | humans/children learn, not so much about practical public policy or how to use 6 | technology. 7 | 8 | The first parts of Chapter 10 would be good to review periodically. 9 | -------------------------------------------------------------------------------- /CS61C_Berkeley/README.md: -------------------------------------------------------------------------------- 1 | Doing this because of (a) need to review computer architecture and (b) practice with C language. 2 | 3 | Relevant links: 4 | 5 | - https://github.com/61c-teach 6 | - https://cs61c.org/ 7 | - https://cs61c.org/resources/exams 8 | 9 | Looks like Berkeley changed to this format recently. Some of the courses have webcasts, though they might not all be public. 10 | -------------------------------------------------------------------------------- /Functional_Programming/README.md: -------------------------------------------------------------------------------- 1 | This is the Coursera course on Functional Programming, taught by the person who 2 | created the Scala Programming Language. =) 3 | 4 | Link to course: [click here][1] 5 | 6 | It says it's from January 30 to March 9; the year isn't stated but I assume it's 7 | 2017, which means this could be the first Coursera course that I actually follow 8 | from start to finish in time. I hope. 9 | 10 | [1]:https://www.coursera.org/learn/progfun1/ 11 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/dVRK.text: -------------------------------------------------------------------------------- 1 | How to use the dVRK in the context of ROS. Reading the ROS tutorials helped to 2 | clarify why ROS could auto-complete and refer to files somewhere else in the 3 | computer, because of our ROS path pointing to those directories. Also, the dVRK 4 | launch files involves similar `.xml` files as shown in the tutorials. Use 5 | `rosed` to edit without having to search for a path. 6 | 7 | Focus on the basic skeleton. How do we start? 8 | 9 | 10 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter06notes.txt: -------------------------------------------------------------------------------- 1 | ************************************************* 2 | * NOTES ON CHAPTER 6: Deep Feedforward Networks * 3 | ************************************************* 4 | 5 | This chapter *should* be review for me. Read though but don't get bogged into 6 | too much on backpropagation. By the way, these technically include convolutional 7 | nets, but we don't cover that in detail until Chapter 9. 8 | 9 | The first part (Section 6.1) starts off with the classic example of linear 10 | models failing to solve an XOR, but a simple ReLU two-layer network can do it. 11 | 12 | Most neural networks are trained with maximum likelihood so the cost function is 13 | the negative log likelihood, cost is 14 | 15 | J(\theta) = - E_{x,y} [log p_\theta(y|x)] 16 | 17 | This is **equivalently** described as the cross entropy between the model 18 | distribution and the data distribution. Interesting. 19 | 20 | There's some stuff about the cross entropy and viewing the neural network as a 21 | functional. I should review these later if I have time. BTW, they say that cross 22 | entropy is preferable to MAE or MSE, due to getting better gradient signals 23 | (Section 6.2.1). 24 | 25 | Section 6.3 is about the choice of hidden units. I'm skimming this. 26 | 27 | Section 6.5 is about backpropagation. I'm skimming this. It's looong. 28 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/Modern_Robotics_Mech_Plan_Control.txt: -------------------------------------------------------------------------------- 1 | Notes on the textbook: 2 | 3 | Modern Robotics: Mechanics, Planning, and Control, 2017 4 | Kevin M. Lynch and Frank C. Park 5 | 6 | Homepage: http://hades.mech.northwestern.edu/index.php/Modern_Robotics 7 | 8 | It's looks very similar to Murray, Li, and Sastry's book. 9 | 10 | ********************** 11 | * Chapter 1: Preview * 12 | ********************** 13 | 14 | One way of categorizing robots: 15 | 16 | - Open chain: all joints are "actuated," i.e., that we can move them. Example: 17 | most industrial robotic arm manipulators. 18 | - Closed chain: only some joints are "actuated." Example: Stewart-Gough 19 | Platform (!!) 20 | 21 | The following joints have one degree of freedom, for rotation and translation, 22 | respectively. 23 | 24 | - Revolute joints: these allow for rotation about the joint axis. 25 | - Prismatic joints: these allow for linear translation along the joint axis. 26 | 27 | Use "Degrees of Freedom" to specify the number of "actuated joints." However, a 28 | (potentially better) sense of DoF is the notion of **configuration spaces**: 29 | 30 | > A more abstract but equivalent definition of the degrees of freedom of a robot 31 | > begins with the notion of its configuration space: a robot's configuration is 32 | > a complete specification of the positions and orientations of each link of a 33 | > robot, and its configuration space is the set of all possible configurations 34 | > of the robot. 35 | 36 | 37 | ********************************** 38 | * Chapter 2: Configuration Space * 39 | ********************************** 40 | 41 | TODO 42 | 43 | 44 | ********************************* 45 | * Chapter 3: Rigid Body Motions * 46 | ********************************* 47 | 48 | TODO 49 | 50 | 51 | ********************************* 52 | * Chapter 4: Forward Kinematics * 53 | ********************************* 54 | 55 | Studies the problem of: given a set of input joint values, find the output 56 | position and orientation of the reference frame attached to the end-effector. 57 | This is easily done for an open-chain robot, and the default solution is the 58 | "Product of Exponentials" (PoE) formula. 59 | -------------------------------------------------------------------------------- /How_People_Learn/Part_01_Intro.txt: -------------------------------------------------------------------------------- 1 | Part 1: Introduction 2 | 3 | 4 | Chapter 1: Learning: From Speculation to Science 5 | 6 | 7 | Very important: 8 | 9 | - We need to stop teaching and testing based on factual knowledge, because the 10 | amount of facts to know is beyond what any one person can handle. The focus of 11 | teaching should be more on learning how to acquire and synthesize facts to 12 | "pick things up" quickly, so to speak. That's not to say facts are 13 | unimportant. It's just that the bigger priority should be understanding the 14 | connections among the facts so that it is easier to transfer and generalize to 15 | novel scenarios. Experts are very good at synthesizing, connecting, and 16 | efficiently organizing their reservoirs of knowledge. 17 | 18 | - Students start with lots of prior knowledge and are not simply "empty vessels" 19 | of which teachers fill with knowledge. It's necessary to check with them if 20 | their prior knowledge is inhibiting or misleading them when learning about 21 | various concepts. Classic scenario: fish is fish, where a fish asks an 22 | amphibian what land-based animals are like, but simply imagines them as fish 23 | with legs, fish with udders, etc. Another example: teaching students the Earth 24 | is round when they think it's flat. 25 | 26 | Also important: 27 | 28 | - There should be a focus on improving students' understanding of their own 29 | ability. They should be able to tell when they need help. The ability to 30 | predict one's performance on a task is called "metacognition" (see Chapters 2 31 | and 3). 32 | 33 | - Don't do shallow coverage of every possible topic within reach, instead reduce 34 | the number of topics but go through a few in depth to practice deeper 35 | understanding. 36 | 37 | - And a bunch of more mundane, practical stuff: need to change incentives of 38 | teaching and standardized tests so that it's not fact-based yet is still fair, 39 | need to do the same for adult teaching, etc. 40 | 41 | - Don't just focus on the best talent, need to work for lots of students. Well, 42 | it is important to develop top students more than we do in the US, but it's 43 | also clear that we need to broaden the population who have access to quality 44 | education. 45 | 46 | Stuff I forgot to record after a first pass: 47 | 48 | - Don't ask which teaching technique is best because that's like asking which 49 | tool is best: it depends on the task and materials at hand. 50 | 51 | - Don't forget all those hours students spend _outside_ of school. There are so 52 | many overlooked opportunities there. I should know, from personal experience. 53 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter02notes.txt: -------------------------------------------------------------------------------- 1 | ************************************** 2 | * NOTES ON CHAPTER 2: Linear Algebra * 3 | ************************************** 4 | 5 | This chapter was pure review for me, but some highlights and insights: 6 | 7 | - They talk about tensors but I'm kind of familiar with them already, mostly 8 | when I have to deal with numpy arrays that have at least three coordinate 9 | dimensions (or four, in some deep learning applications with images). 10 | 11 | - Columns of A can be thought of as different directions we're spanning out of 12 | the origin, and the components of x (as in the matrix-vector product Ax) 13 | indicate how far we move in those directions. 14 | 15 | - We say "orthogonal" matrices, but there's no terminology for matrices whose 16 | columns and/or rows are mutually orthogonal, but *not* orthonormal. 17 | 18 | - Don't forget **eigendecompositions**! They're very important. Interesting 19 | intuition: 20 | 21 | > [...] we can also decompose matrices in ways that show us information about 22 | > their functional properties that is not obvious from the representation of 23 | > the matrix as an array of elements. 24 | 25 | Eigendecomposition of matrix: A = V * diag(eig-vals) * V^{-1}, where V 26 | has columns which correspond to (right) eigenvectors of A. 27 | 28 | Not every matrix can be decomposed this way, but we're usually concerned with 29 | real symmetric A. In fact, in that case we can say even more: we can construct 30 | an *orthogonal* V so our V^{-1} turns into the easier-to-deal-with V^T matrix. 31 | 32 | - An alternative, and more generally applicable decomposition, is the SVD. (Why 33 | is it more general? Well, every real matrix has an SVD, including non-square 34 | ones, but non-square matrices have undefined eigendecompositions.) In their 35 | formulation, the inner matrix of singular values is rectangular in general 36 | (other books/references have *square* matrices, but the definitions are 37 | essentially equivalent). 38 | 39 | - Moore-Penrose pseudoinverse helps us (sometimes) solve linear equations for 40 | non-square matrices, in which case the "normal" inverse cannot be defined. Use 41 | the formula A^+ = V * D^+ * U^T for the pseudoinverse. When A is a fat matrix, 42 | the solution x = A^+ * y provides us with the minimum Euclidean norm solution 43 | (I must have forgotten this fact). 44 | 45 | - For the trace, don't forget about the **cyclic property**!!! 46 | 47 | - The chapter concludes with an example of **Principal Components Analysis**, 48 | i.e. how to apply lossy compression to a set of data points while losing as 49 | little information as possible. By "compression" we refer to shrinking points 50 | from R^m into R^n where n < m. This is necessarily lossy. To optimally encode 51 | a vector, use f(x) = D^Tx, which we determined from L2 norm minimization. The 52 | decoder is g(c) = Dc = DD^Tx which reconstructs an approximated version of the 53 | input from the compression. Then the next (and final) step is to find D. They 54 | do this by also using an L2 minimization. They provide some nice tips on how 55 | to write out optimization problems nicely and compactly. This is again review 56 | for me. 57 | 58 | Well, I'm pleased with this chapter. =) I should expand upon some of these 59 | concepts in personal blog posts, particularly that last part (the proof by 60 | induction). 61 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter20notes.txt: -------------------------------------------------------------------------------- 1 | *********************************************** 2 | * NOTES ON CHAPTER 20: Deep Generative Models * 3 | *********************************************** 4 | 5 | This is a **long** chapter, and likely contains most of the stuff at the 6 | research frontiers, at least those that interest the authors (Generative 7 | Adversarial Networks lol). 8 | 9 | 10 | Section 20.10: Directed Generative Nets 11 | 12 | Both VAEs and GANs are part of this section, which refers to using directed 13 | graphical models to "generate" something, or basically mirror a probability 14 | distribution. The first two sections, "Sigmoid Belief Nets" and "Differentiable 15 | Generator Nets" seem markedly less important, though the latter at least makes 16 | the point that a generator should be differentiable. It also makes the important 17 | distinction between a generator directly generating samples x, OR generating a 18 | DISTRIBUTION, which we then sample from for x. If we directly generate discrete 19 | values, the generator is not differentiable, FYI. 20 | 21 | 22 | Section 20.10.3: Variational Autoencoders 23 | 24 | - Trained purely with gradient methods. 25 | 26 | - To *generate* a sample, need to first sample a code z which has relevant 27 | latent factors, and then run through a generator ("decoder") network which 28 | will give us a mean vector (or maybe a second output with the covariance). We 29 | then sample from that Gaussian. Yes, this makes sense. Generating z may just 30 | be done with our prior. 31 | 32 | - Ah, but during training, we have to make use of our *encoder* network, since 33 | otherwise the generator/decoder wouldn't work well. The encoder network's job 34 | is to produce a useful z. 35 | 36 | - Training is done by maximizing that variational lower bound for each data x: 37 | 38 | L(q) <= log p_model(x) 39 | 40 | where q is the distribution of the encoder network. Essentially, the encoder 41 | network approximates an intractable integral! 42 | 43 | - Some downsides: VAEs output somewhat blurry images and do not fully utilize 44 | the latent code z. However, GANs seem to share that second problem. 45 | 46 | - VAEs have been extended in many ways, e.g. DRAW. I remember that paper when I 47 | read it half a semester ago, but that was before I had RNN intuition. 48 | 49 | - Advantage: the training process is basically training an autoencoder. Thus, it 50 | can learn a manifold structure since that's what autoencoders can do! 51 | 52 | 53 | Section 20.10.4: Generative Adversarial Networks 54 | 55 | Use this loss function formulation for the Generator: 56 | 57 | > In this best-performing formulation, the generator aims to increase the log 58 | > probability that the discriminator makes a mistake, rather than aiming to 59 | > decrease the log probability that the discriminator makes the correct 60 | > prediction. 61 | 62 | Yes, I tried this for my own work and have had better results with this 63 | technique. It seems to be more important to do this than to do one-sided label 64 | smoothing, batch normalization, etc., which makes sense as this was the rare 65 | "trick" that made it in the original 2014 NIPS paper. 66 | 67 | - Then Sections 20.10.5 through 20.10.10 go through more topics that I don't 68 | have time to learn. 69 | 70 | 71 | Section 20.14: Evaluating Generative Models 72 | 73 | Yeah, I had a feeling this would be here, because some of this is quite 74 | subjective, and it seems like we have to resort to hiring human workers in 75 | person or via Amazon Mechanical Turk. The authors make a good point that in 76 | object recognition (for instance) we can alter the input. Some networks 77 | downscale to 256x256, others to 227x227, etc., but with generative models, if 78 | you change the input, the task fundamentally changes, and thus we can't compare 79 | the two procedures. Oh, and they also point out differences in log p(x) if x is 80 | discrete r.v. or continuous, in which case the former maximizes at log 1 = 0 and 81 | the latter can be arbitrarily high since p(x) could theoretically approach 82 | infinity. 83 | -------------------------------------------------------------------------------- /How_People_Learn/Part_03_Teachers_and_Teaching.txt: -------------------------------------------------------------------------------- 1 | Part 3: Teachers and Teaching 2 | 3 | 4 | Chapter 6: Design of Learning Environments 5 | 6 | Very important: 7 | 8 | - Use learning-centered (actually, "learner centered") environments, a bit 9 | unclear to define but I think mostly about better understanding of students' 10 | prior knowledge. Again, see previous chapters about this. 11 | 12 | - Need some form of knowledge learning, so students need to learn something 13 | beyond just "learning how to learn". (Edit: not really the right way to define 14 | this but again not a clear definition, but mostly about how to make students 15 | knowledgeable, so that they can do effective transfer --- again, see previous 16 | chapters.) 17 | 18 | - Students need feedback (see "deliberate practice"), but not just the kind that 19 | come with grades and tests. Also, feedback is most effective when students can 20 | revise their thinking on the _current_ subject matter, not when they get a 21 | test but by the time they have it, they've moved on to newer concepts. 22 | 23 | - Must consider the community/culture aspect, which obviously affects learning. 24 | For instance, Anglo culture emphasizes talking and asking questions, but 25 | others might not (and this affects how teachers evaluate students). Also, 26 | seriously, when are we going to talk about multi-racials? Gaaaah, so 27 | disappointing. 28 | 29 | Also important: 30 | 31 | - A bunch of stuff on the merits of television (remember, this was 2000) but not 32 | really relevant for what I hope to get out of this book. Also a bunch of stuff 33 | on how to evaluate teachers for practical purposes. 34 | 35 | Stuff I didn't remember: 36 | 37 | - While some may say schools aren't working, the reality is that we're asking 38 | for way more out of students than in past eras. In the past, being literate 39 | could have simply meant being able to sign your name. Now we're getting to the 40 | point where we need students to interpret and compose potentially complicated 41 | written stuff. 42 | 43 | - Eh, a relevant quote: "Learning theory does not provide a simple recipe for 44 | designing effective learning environments; similarly, physics constrains but 45 | does not dictate how to build a bridge." 46 | 47 | 48 | Chapter 7: Effective Teaching Examples 49 | 50 | Very important: 51 | 52 | - History: focus not on facts but on analysis and understanding how to debate 53 | concepts. If you take students who know facts and historians who don't 54 | specialize in the same area, the students might actually do better on tests of 55 | factual knowledge, but won't be able to do any analysis. Effective teachers 56 | can promote debate, with careful monitoring of course. Interesting example: 57 | teacher asking students to put stuff in a time capsule, so they need to reason 58 | about important stuff. 59 | 60 | - Math: less focus on computation, more focus on problem solving skills. 61 | Analogies can help, e.g., modeling floors of a building to learn about 62 | negative numbers (negative floors = below ground level). Oh, also model-based 63 | stuff, where we apply math to building models of stuff (e.g., buildings). 64 | Could also clearly apply to physics. 65 | 66 | - Science: again, less on facts and more on analysis. Many students have 67 | intuition on stuff that's not correct in physics (e.g., forces and Newton's 68 | third law) so use live demos. Also recall earlier discussion about students 69 | not classifying problems correctly based on solution, but based on how they 70 | look (surface features). Students who are able to describe a problem 71 | "hierarchically" tend to do better --- though this is obviously vague. 72 | 73 | Also important: 74 | 75 | - Deliberate practice. Don't forget. 76 | 77 | - Effective teachers must know the subject matter AND be able to tell where 78 | students are likely to run into roadblocks. 79 | 80 | Stuff I didn't remember: 81 | 82 | - Practical stuff about instruction in large classes. 83 | 84 | 85 | Chapter 8: Teacher Learning 86 | (Not the most relevant chapter for me) 87 | 88 | There's a huge difference between education theory and practice, leads to 89 | teachers rejecting (or not really diving into) research, lots of turnover, 90 | susceptible to local politics, etc. It's best to have workshops and other 91 | meet-ups where teachers can practice and discuss teaching techniques, etc. 92 | 93 | 94 | Chapter 9: Technology to Support Learning 95 | (Not the most relevant chapter for me) 96 | 97 | Well this is kind of out of date, I suppose. Mostly, technology has tradeoffs 98 | but can be used to bring in new contexts/demos to the class, etc. Particularly 99 | useful if it can help provide repeated feedback (remember deliberate practice). 100 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter17notes.txt: -------------------------------------------------------------------------------- 1 | ******************************************** 2 | * NOTES ON CHAPTER 17: Monte Carlo Methods * 3 | ******************************************** 4 | 5 | I think this chapter will also be review, but I have forgotten a lot of this 6 | material. It might also help me for my other projects with BIDMach. 7 | 8 | Heh, Las Vegas algorithms ... we never talk about those in Deep Learning. I 9 | agree, we should stick with deterministic approximation algorithms or Monte 10 | Carlo methods. Right, the point here is we have something we want to know, such 11 | as the expected value of a function (which depends on the data). Use sampling to 12 | take the average of f(x_1), ..., f(x_n) to form our estimate of E_p[f(x)] for 13 | some base distribution p. We can compute our expected error via the Central 14 | Limit Theorem. (Which John Canny said is "the most abused theorem in all of 15 | statistics" but never mind ...) 16 | 17 | But what if we cannot even sample from our base distribution p in the first 18 | place. For the above, we needed to draw x_1, ..., x_n somehow! We now turn to 19 | our possible solutions: importance sampling and MCMC. (The latter includes Gibbs 20 | sampling, and maybe even contains some variants of importance sampling? Not 21 | totally sure.) 22 | 23 | Section 17.2, Importance Sampling. 24 | 25 | I see, we can turn Equation 17.9 into Equation 17.10 just by switching the 26 | distribution the x_i's are drawn from, and adding in the necessary functions. 27 | Yes, they have the same expected value ... and I can see why the variance would 28 | be different. They argue that the minimum variance is the q^* in Equation 17.13. 29 | Yeah ... that seems familiar. How do they derive that? If indeed f did not 30 | change signs, then p and f cancel and the variance turns into a constant. Yay! 31 | 32 | I'm not really getting much out of this section other than definitions. I'll 33 | mark a TODO for myself to look at the examples they give in other parts of the 34 | book; this chapter is not as self-contained as Chapter 16. 35 | 36 | Section 17.3, Markov Chain Monte Carlo (my favorite!). They refer the reader to 37 | Daphne's book for more details (which I've read before!). 38 | 39 | MCMC methods use *Markov chains* to approximate the desired sample distribution 40 | (call it p_model). These are most convenient for energy based models, p \propto 41 | exp(-E(x)), because they require non-zero probabilities everywhere. They also 42 | assume that the energy-based models are for _undirected_ graphical models, so 43 | that it's difficult to compute conditional probabilities. 44 | 45 | Procedure: start with random x, keep sampling, after a suitable burn-in period, 46 | the samples will start to come from p_model. Use a transition distribution 47 | T(x'|x), or a "kernel" in some of the literature. 48 | 49 | They show the usual matrix update in Equation 17.20, only for discrete random 50 | variables. Here, v should be in the probability simplex of dimension d where d 51 | is the amount of values that x can take on. Remember, we're in discrete land 52 | here. 53 | 54 | Something new to me: the matrix "A" here is a "stochastic matrix" and over time, 55 | its eigenvalues will converge to one as the exponent increases, or they'll decay 56 | to zero. Interesting ... the Perron-Frobenius Theorem they refer to is from a 57 | 1907 paper (!!!). 58 | 59 | They say "DL practitioners typically use 100 parallel Markov chains." Having 60 | independent chains gives us more independence. Why haven't I been doing this ... 61 | 62 | Section 17.4, Gibbs Sampling (yay ...). 63 | 64 | Not much in this section, they just say that for Deep Learning, it's common to 65 | use these for energy-based models, such as RBMs, though we better do block Gibbs 66 | sampling. 67 | 68 | Other stuff: 69 | 70 | They point out that the main problem with MCMC methods in high dimensions is 71 | that they mix poorly; the samples are too correlated. It might get trapped in a 72 | posterior mode, but I'm curious: how much of a problem is that? For deep neural 73 | networks, the biggest problem is with saddle points. They argue that the MCMC 74 | methods will not be able to "traverse" regions in manifold space with high 75 | energy. Those result in essentially zero p(x) due to e^{-H(x)}. 76 | 77 | Oh, I see, now they talk about temperature to aid exploration. Yeah, I know 78 | about that! =) Finally, I can see a reference about temperature. Think of 79 | temperature as: 80 | 81 | p(x) \propto exp(-H(x)/T) 82 | 83 | Thus, when temperature is high, the value in the exponent increases to zero, so 84 | the distribution becomes more uniform. 85 | 86 | You know, if there was more research done with MCMC methods and Deep Learning, 87 | wouldn't this chapter have discussed them? There isn't much here, to be honest, 88 | and lots of the references are pre-2012. And also, for tempering, why not cite 89 | some of the references they have in my own work? 90 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter03notes.txt: -------------------------------------------------------------------------------- 1 | ********************************************************** 2 | * NOTES ON CHAPTER 3: Probability and Information Theory * 3 | ********************************************************** 4 | 5 | This chapter was almost pure review for me, but some highlights and insights: 6 | 7 | - The chapter starts with some philosophy and some notation. Nothing new, though 8 | their notation is at least better than those from other textbooks I've read. 9 | Then they talk about definitions, marginals, conditionals, etc. It might be 10 | worth using their definition of covariance rather than the one I intuitively 11 | think of. High covariances (absolute values) mean values change a lot and are 12 | also far from their respective means often. Another concept to review: 13 | independence is a stronger requirement than zero covariance. Know the 14 | definition of a covariance matrix w.r.t. a random vector x. 15 | 16 | - Section 3.9: Common Probability Distributions, is pure review with the 17 | exception of the Dirac Distribution (to some extent), though they mention 18 | sometimes the need to use the inverse variance to increase efficiency, though 19 | I doubt this is used often. Do remember why we like Gaussians: (1) the CLT, 20 | and (2) out of all distributions with the same variance and which cover the 21 | real line, it has the highest entropy, which can be thought of as imposing the 22 | fewest prior assumptions possible. (If we didn't have these restrictions, we 23 | could pick the *uniform* distribution, so be careful about the assumptions.) 24 | Finally, for mixture distributions, don't forget that the canonical way is to 25 | first choose a distribution, and then generate a sample from that. It is NOT, 26 | first generate k samples from all k distributions in the mixture, and then 27 | take a linear combination of those proportional to the probability weight. I 28 | was confused by that a few years ago. The component identity of a mixture 29 | model is often viewed as a **latent variable**. 30 | 31 | - Know the **logistic** function (yes) and the **softplus** function (huh, a 32 | smoothed ReLU). 33 | 34 | - There is some brief **measure theory** here: 35 | 36 | > One of the key contributions of measure theory is to provide a 37 | > characterization of the set of sets that we can compute the probability of 38 | > without encountering paradoxes. In this book, we only integrate over sets 39 | > with relatively simple descriptions, so this aspect of measure theory never 40 | > becomes a relevant concern. For our purposes, measure theory is more useful 41 | > for describing theorems that apply to most points in R^n but do not apply to 42 | > some corner cases. 43 | 44 | - Oh, I like their example with deterministic functions of random variables. 45 | I've seen this a few time in statistics, and the key with variable 46 | transformations like those is that we have to take into account different 47 | scales of functions, which is where the derivative term and Jacobians appear. 48 | 49 | - Section 3.13: Information Theory. My favorite part is Figure 3.6. I should 50 | spend more time thinking about it. Also, good intuition: 51 | 52 | > A message saying "the sun rose this morning" is so uninformative as to be 53 | > unnecessary to send, but a message saying "there was a solar eclipse this 54 | > morning" is very informative. 55 | 56 | Information theory is about quantifying the "information" present in some 57 | signal. Use the **Shannon entropy** to quantify the uncertainty in a 58 | probability **distribution**: - E_x[log p(x)]. This is "differential entropy" 59 | if x is continuous. Low entropy means the random variable is closer to 60 | deterministic, high entropy means it's very random and uncertain. 61 | 62 | Note: in most information theory contexts, the log is base 2, so we refer to 63 | this as "bits." In machine learning, we use the natural logarithm, so we call 64 | them "nats." 65 | 66 | As usual, define the KL divergence. KL(P||Q) = E_P[log(P(x)/Q(x))]. For now, 67 | assume the first distribution, P, is what we're drawing expectations w.r.t. 68 | For discrete r.v.s: 69 | 70 | > [KL Divergence is] the extra amount of information [...] needed to send a 71 | > message containing symbols drawn from probability distribution P, when we 72 | > use a code that was designed to minimize the length of messages drawn from 73 | > probability distribution Q. 74 | 75 | - Note also the **cross entropy** quantity: - E_P[log Q(x)]. 76 | 77 | > Minimizing the cross-entropy with respect to Q is equivalent to minimizing 78 | > the KL divergence, because Q does not participate in the omitted term. 79 | 80 | This is why if Q is our model, we can minimize the cross entropy and make our 81 | Q close to P, which is the ground truth data distribution. 82 | 83 | - The chapter concludes some basic graphical models stuff. 84 | 85 | I like this chapter. 86 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter09notes.txt: -------------------------------------------------------------------------------- 1 | ********************************************** 2 | * NOTES ON CHAPTER 9: Convolutional Networks * 3 | ********************************************** 4 | 5 | This chapter should be review for me, but I do want to get clarification about 6 | (a) visualizing gradients/filters and (b) the "deconvolution" or "transpose 7 | convolution" operator. To a lesser extent, I'm interested in (c) how to 8 | implement efficient convolutions. 9 | 10 | - There is some stuff about whether we care about kernel flipping or not. 11 | However, this seems to be very specific about the convolution formula, and I 12 | doubt I'm going to go in detail on that since I'm not implementing them. 13 | 14 | - Understand why convolutions are so important: (1) **sparse interactions**, (2) 15 | **parameter sharing** and (3) **equivariant representations**. I know all of 16 | these, and to be clear on the last one, it's because we often want to 17 | represent the same shapes but in different locations in a grid. The book says 18 | "To say a function is equivariant means that if the input changes, the output 19 | changes in the same way" so maybe they're using a slightly different 20 | perspective. The first two together are mainly about the storage and 21 | efficiency improvements. The third doesn't apply to all transformations (for 22 | CNNs at least), but it definitely applies for translation. 23 | 24 | - In the pooling description (Section 9.3) the authors say non-linearities come 25 | **before** pooling and **after** convolutions. Indeed, this matches the 26 | ordering of the CNNs we wrote in CS 294-129. Intuitively, we already do a 27 | maximum operator in the standard 2x2 max pool, so why apply a ReLU **after** 28 | that? The major advantage of pooling is to make the network **invariant to 29 | slight transformations**. It also helps to reduce data dimensionality, 30 | particularly if we also padded the convolutions (and so the convolution layers 31 | do *not* reduce data dimensionality, but can leave that job for the pooling). 32 | 33 | - Interesting perspective: Section 9.4 explains why convolutions and pooling can 34 | be viewed as an infinitely strong prior. I can see why (beforehand) since 35 | these strongly assume the input is some grid-like thing, as an image. (A weak 36 | prior has high entropy, like a uniform distribution or a Gaussian) Be careful: 37 | 38 | > If a task relies on preserving precise spatial information, then using 39 | > pooling on all features can increase the training error. 40 | 41 | (This is an example of how architectures need to be tweaked for the task.) 42 | 43 | - Huh, I've never heard of **unshared convolution** nor **tiled convolution**. 44 | Eh, I can look them up later, they're alternatives to convolution but 45 | certainly less important to know. 46 | 47 | - Ah ... how to compute the **nightmarish** gradient of a convolution operator? 48 | The gradient is actually another convolution, but it's hard to derive 49 | algebraically. Convolutions are just (sparse) matrix multiplication assuming 50 | we've flattened the input tensor. We did that for CS 231n to flatten the input 51 | to shape (N, d1*d2*...*dn). Given that matrix, we take its transpose and that 52 | gives us the gradient for the backpropagation step, at least in theory. Wait, 53 | Goodfellow has a report from 2010 which explains how to compute these 54 | gradients. Interesting, how did I not know about this? 55 | 56 | - Something I didn't quite think of before, but it seems obvious: we can instead 57 | use **structured output** from a CNN that isn't a probability vector or 58 | distribution but some tensor that comes "earlier" in the net. This can give 59 | probabilities for each precise pixel in an image, for instance, if the tensor 60 | output is 3D and (i,j,k) means class i probability in coordinate (j,k). Yeah, 61 | overall there are quite a lot of options the user has in designing a CNN. This 62 | also enables the possibility of using recurrent CNNs, see Figure 9.17. 63 | 64 | - Section 9.8: **Efficient convolutions**. Unfortunately, there is only 65 | high-level discussion here, but I'm not sure I'd be able to understand the 66 | details anyway. They say: 67 | 68 | > Convolution is equivalent to converting both the input and the kernel to the 69 | > frequency domain using a Fourier transform, performing point-wise 70 | > multiplication of the two signals, and converting back to the time domain 71 | > using an inverse Fourier transform. For some problem sizes, this can be 72 | > faster than the naive implementation of discrete convolution. 73 | 74 | The last part of the chapter is about the neuro-scientific basis of CNNs. It's 75 | an easier read. 76 | 77 | Overall, I think this is a good chapter. Unfortunately, it didn't cover (a) or 78 | (b), the stuff I was wondering about earlier. =( OK, I think I understand how to 79 | visualize a weight filter, but maybe I should look back at that relevant CS 231n 80 | lecture. 81 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter04notes.txt: -------------------------------------------------------------------------------- 1 | ***************************************** 2 | * NOTES ON CHAPTER 4: Numerical Methods * 3 | ***************************************** 4 | 5 | This brief chapter will probably contain more new material for me compared to 6 | chapters 2 and 3, but still be mostly review. Here are the highlights: 7 | 8 | - We must delicately handle implementations of the **softmax function** to 9 | be robust to numerical underflow and overflow. The book amusingly just tells 10 | us to rely on Deep Learning libraries, which have presumably handled all these 11 | details for us. 12 | 13 | - Don't forget about a matrix's **condition number**, which when we're dealing 14 | with a function f(x) = A^{-1}x, roughly tells us how "quickly" it perturbs, 15 | i.e. its sensitivity. Later, they point out: 16 | 17 | > The condition number of the Hessian at this point measures how much the 18 | > second derivatives differ from each other. When the Hessian has a poor 19 | > condition number, gradient descent performs poorly. This is because in one 20 | > direction, the derivative increases rapidly, while in another direction, it 21 | > increases slowly. 22 | 23 | - Review: the **directional derivative** of function f in direction u is the 24 | derivative of the function f(x + alpha*u) evaluated at alpha=0, i.e. the slope 25 | of f in direction u. 26 | 27 | - Review of Hessians, Jacobians, gradient descent, etc. The Hessian can be 28 | thought of as the Jacobian of the gradient (of a function from R^n to R). 29 | Also, regarding rows/columns of the Jacobians, if the function f is from R^m 30 | to R^n, the Jacobian is n x m, so just remember the ordering (I doubt it is 31 | strict since this is just a representation that's convenient for us, and we 32 | could also take transposes if we wanted). In Deep Learning, the functions we 33 | encounter almost always have symmetric Hessians. I like Equation 4.9 as it 34 | emphasizes how gradient descent can sometimes overshoot the target and result 35 | in a *worse* value, if the second-order term dominates. 36 | 37 | - To generalize the second derivative test (tells us a maximum, minimum, or 38 | saddle point) in high dimensions, we need to analyze the eigenvalues of the 39 | Hessian, e.g.: 40 | 41 | > When the Hessian is positive definite (all its eigenvalues are positive), 42 | > the point is a local minimum. This can be seen by observing that the 43 | > directional second derivative in any direction must be positive, and making 44 | > reference to the univariate second derivative test. 45 | 46 | Likewise, the reverse is true when the Hessian is negative definite. Note that 47 | the Hessian is a function of x (vector in R^n), so different x will result in 48 | different Hessians. See Figure 4.5 for the quintessential example of a saddle 49 | point. 50 | 51 | BTW, why do the eigenvalues help us **at all**? How are they related to the 52 | second derivative test in one dimension? I think it's because the second-order 53 | Taylor series expansion involves a term d^THd, where d is some unit vector. 54 | This is the second term that's added into the Taylor series, so its values 55 | among different directions tells us the curvature. We also have an 56 | eigendecomposition of H, and that provides us the eigenvalues. 57 | 58 | - We have simple gradient descent, and then the second-order (i.e. expensive!) 59 | Newton's method. How do we **derive** the step size, e.g. if you're asked to 60 | do so in an interview? 61 | 62 | - Write out f(x) using a second-order Taylor series expansion at x(0). 63 | 64 | - Then look at the second-order Taylor series and take the gradient w.r.t x 65 | (not x(0)). 66 | 67 | - Solve for the best x, the critical point, and plug-n-chug. 68 | 69 | - At least, that seemed to work for me and I verified Newton's method. 70 | 71 | - In the context of Deep Learning, our functions are so complicated that we can 72 | rarely provide any theoretical guarantees. We can sometimes get headway by 73 | assuming Lipschitz functions, which tell us that small changes in the input 74 | have quantified small changes in the function output. 75 | 76 | - Convex optimization is a very successful research field, but we can only take 77 | lessons from it, we can't really use their algorithms and the importance of 78 | convexity is diminished in deep learning. Constrained optimization may be 79 | slightly more important. These involve the KKT conditions and Lagrange 80 | multipliers, which at a high level try to design an unconstrained problem so 81 | that the solution can be transformed into one for the **constrained** problem. 82 | Brief comments on those: 83 | 84 | - We rewrite the loss function by adding terms corresponding to constraints 85 | h(x) = 0 and/or g(x) <= 0. 86 | 87 | - We have min_{x in S} f(x) as our original **constrained** minimization 88 | problem. However ... 89 | 90 | - min_x max_{lambda} max_{alpha >= 0} L(x, lambda, alpha) has the same set of 91 | solutions and optimal points! 92 | 93 | - (Some caveats here, have to consider infinity cases, etc., but this is the 94 | general idea. Any time a constraint is violated, the minimum value of the 95 | Lagrangian w.r.t. x is ... infinity!) 96 | 97 | For some reason, I never feel comfortable with Lagrangians. It might be worth 98 | going back and reviewing Stephen Boyd's book, but I think the books' component 99 | was pretty clear. 100 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter08notes.txt: -------------------------------------------------------------------------------- 1 | **************************************************** 2 | * NOTES ON CHAPTER 8: Optimization for Deep Models * 3 | **************************************************** 4 | 5 | This chapter should be review for me. 6 | 7 | Section 8.1: Learning vs. Pure Optimization 8 | 9 | The authors make a good point in that we really care about minimizing the cost 10 | function w.r.t. the **data generating distribution**, NOT the actual training 11 | data (i.e. generalization). The difference with optimization is that we know the 12 | underlying data generating distribution, but in machine learning we only have 13 | the fixed training data, i.e. minimizing the **empirical risk**. However, this 14 | isn't used in its raw form: 15 | 16 | > These two problems mean that, in the context of deep learning, we rarely use 17 | > empirical risk minimization. Instead, we must use a slightly different 18 | > approach, in which the quantity that we actually optimize is even more 19 | > different from the quantity that we truly want to optimize. 20 | 21 | Also, as I know, ML algorithms typically stop not when they're at a true minimum 22 | but when we define them to stop, early stopping. =) 23 | 24 | Oh, note that second-order methods require larger batch sizes. In fact, Andrej 25 | Karpathy covered that briefly in Lecture 7 of Cs 231n. This is because 26 | matrix-vector multiplication and taking inverses amplify errors in the original 27 | Hessian/gradient. 28 | 29 | I do this: 30 | 31 | > Fortunately, in practice it is usually sufficient to shuffle the order of the 32 | > dataset once and then store it in shuffled fashion. This will impose a fixed 33 | > set of possible minibatches of consecutive examples that all models trained 34 | > thereafter will use, and each individual model will be forced to reuse this 35 | > ordering every time it passes through the training data. 36 | 37 | Section 8.2: Challenges in Neural Net Optimization 38 | 39 | > For many years, most practitioners believed that local minima were a common 40 | > problem plaguing neural network optimization. Today, that does not appear to 41 | > be the case. The problem remains an active area of research, but experts now 42 | > suspect that, for sufficiently large neural networks, most local minima have a 43 | > low cost function value, and that it is not important to find a true global 44 | > minimum rather than to find a point in parameter space that has low but not 45 | > minimal cost. 46 | 47 | To test whether we at a local minima, we can test the norm of the gradient. 48 | 49 | Section 8.3: Basic Algorithms 50 | 51 | These include SGD and its variants, the core of the chapter. I better know 52 | these. I know SGD and for momentum, they say: 53 | 54 | > Momentum aims primarily to solve two problems: poor conditioning of the 55 | > Hessian matrix and variance in the stochastic gradient. 56 | 57 | and 58 | 59 | > We can think of the particle as being like a hockey puck sliding down an icy 60 | > surface. Whenever it descends a steep part of the surface, it gathers speed 61 | > and continues sliding in that direction until it begins to go uphill again. 62 | 63 | There's some math there that I probably don't need to memorize, but I should 64 | blog about it soon. They write it as a first-order differential equation since 65 | we have a separate velocity term. If we didn't have that, we need a *second* 66 | order diff-eq. Also, I really have to review differential equations someday. 67 | 68 | Section 8.4: Parameter Initialization 69 | 70 | AKA break symmetry! 71 | 72 | Surprisingly, they don't see to mention Kaiming He's paper on weight 73 | initialization. I don't even see any discussion of fan-in and fan-out. 74 | 75 | Section 8.5: Algorithms with Adaptive Learning Rates 76 | 77 | Yes, the key is **adaptive** learning rates. AdaGrad, then RMSProp, then Adam: 78 | 79 | > The name "Adam" derives from the phrase "adaptive moments." In the context of 80 | > the earlier algorithms, it is perhaps best seen as a variant on the 81 | > combination of RMSProp and momentum with a few important distinctions. 82 | 83 | The distinctions have to do with estimates of moments and their biases. I'm 84 | quite confused on this, unfortunately. 85 | 86 | (Note: unlike what's suggested in CS 231n Lecture 7, in fact the textbook 87 | actually has RMSProp with Nesterov's in one of their algorithms.) 88 | 89 | Section 8.6: Approximate Second-Order Methods 90 | 91 | Newton's method is intractable, etc. etc. etc. Well, these can help: 92 | 93 | > Conjugate gradients is a method to efficiently avoid the calculation of the 94 | > inverse Hessian by iteratively descending conjugate directions. 95 | 96 | Also, know BFGS and L-BFGS. 97 | 98 | Section 8.7: Other Strategies 99 | 100 | Ah, **batch normalization**. 101 | 102 | > This means that the gradient will never propose an operation that acts simply 103 | > to increase the standard deviation or mean of $h_i$; the normalization 104 | > operations remove the effect of such an action and zero out its component in 105 | > the gradient. This was a major innovation of the batch normalization approach. 106 | 107 | and 108 | 109 | > Batch normalization reparametrizes the model to make some units always be 110 | > standardized by definition, deftly sidestepping both problems. 111 | 112 | Yeah, this idea of normalizing inputs is obvious, so we have to be clear on the 113 | actual contribution of batch normalization. 114 | 115 | There's some other stuff here about pre-training (yes that's important!) but 116 | also check Chapter 15. Oh, and don't forget, we normally don't want to design 117 | new optimization algorithms, but instead to make the networks **easier to 118 | optimize**. 119 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter16notes.txt: -------------------------------------------------------------------------------- 1 | ************************************************************************** 2 | * NOTES ON CHAPTER 16: Structured Probabilistic Models for Deep Learning * 3 | ************************************************************************** 4 | 5 | I expect to know the majority of this chapter, because it's probably going to be 6 | like Michael I. Jordan's notes. "Structured Probabilistic Models" are graphical 7 | models! But the key is that this should help me better understand the current 8 | research frontiers of Deep Learning, and it's self-contained. Let's see what it 9 | has to offer ... 10 | 11 | Their "Alice and Bob" (and "Carol" ...) example has to do with running a relay, 12 | which is better than Michael I. Jordan's example of being abducted by aliens. 13 | 14 | I remember Markov Random Fields, yes, we need to define a normalizing constant 15 | Z, but (a) if we define our clique potentials awfully, Z won't exist, and (b) in 16 | deep learning, Z is usually intractable. 17 | 18 | I agree with their quote: 19 | 20 | > One key difference between directed modeling and undirected modeling is that 21 | > directed models are defined directly in terms of probability distributions 22 | > from the start, while undirected models are defined more loosely by \phi 23 | > functions that are then converted into probability distributions. This changes 24 | > the intuitions one must develop in order to work with these models. 25 | 26 | When they go and talk about their example with x being binary and getting 27 | Pr(X_i = 1) being a sigmoid(b_i), you can get that by explicitly writing out the 28 | formula, then "rearranging" the sum so that terms independent of the current, 29 | rightmost sum get pushed left. Then you see that the numbers mean we get 30 | independence, and can split the fractions, etc. It brings back good memories of 31 | studying CS 188. 32 | 33 | Section 16.2.4 is on Energy-Based functions. John Canny would really like those! 34 | I think the easiest way for me to think of these is taking potentials of 35 | arbitrary functions and then using e^{-function}. AKA Boltzmann Machines. I like 36 | their discussion here; it is relatively elucidating. 37 | 38 | There is also review on what edges mean when describing graphical models. Again, 39 | this is all CS 188 stuff. For instance, remember that we can add more edges to a 40 | graphical model and still represent the same class of distributions (the edges 41 | can be unnecessary). 42 | 43 | One advantage for each type: 44 | 45 | - It is easier to sample from directed models (I agree). 46 | - It is easier to perform approximate inference on undirected models (I think I 47 | agree). 48 | 49 | Key fact: 50 | 51 | > Every probability distribution can be represented by either a directed model 52 | > or by an undirected model. 53 | 54 | Though there are some directed models for which no undirected model is 55 | equivalent to it. By "equivalent" here we mean in the precise set of 56 | independence assumptions it implies. 57 | 58 | And another key idea: 59 | 60 | > When we represent a probability distribution with a graph, we want to choose a 61 | > graph that implies as many independences as possible, without implying any 62 | > independences that do not actually exist. 63 | 64 | E.g. a loop of length 4 (with no chords inside) is an undirected graphical 65 | model, but we have to add an edge before adding orientations to the edges to 66 | "convert" it to as simple a directed graphical model as possible (that still 67 | implies as many (or as few?) assumptions). 68 | 69 | Section 16.3: sampling from graphical models. I agree, it's easy for directed 70 | models. They call it "ancestral sampling" whereas I've called it "forward 71 | sampling," I think from Daphne Koller. We have to modify it if we want to do 72 | more general sampling with conditioning, i.e. fixed variables. It's toughest if 73 | the variables are *descendants*. Ancestors are easier because we can fix them 74 | and just do P(x|parents(x)) as usual. For *undirected* models ... they mention 75 | Gibbs sampling. =) 76 | 77 | The next few sections are pretty short. They mention *structure learning*, i.e. 78 | learning the graphical model structure. That's a hard problem due to the 79 | super-exponential number of possibilities. However, it seems like structure 80 | learning --- as far as I can tell --- is no longer active? They also mention the 81 | importance of latent variables. Yes, that's a bit broad, but I agree. Just 82 | before the "real" Deep Learning part they talk about inference and approximate 83 | inference, which is something that I should know about well (but they just give 84 | a broad treatment, a bit unclear). 85 | 86 | Finally, the Deep Learning part that I wanted to read. 87 | 88 | After reading it, I just want to clarify: when people draw out a fully connected 89 | net, they usually write out nodes, edges, in layer format, etc. Is that 90 | correctly viewed as a *graphical model*? Or are those different design criteria? 91 | Also, I'm assuming that all the "latent variable" discussion is simply referring 92 | to the hidden layers (and their units)? I think that's the case after reading 93 | about why loopy belief propagation is "almost never" used in deep learning. (Oh, 94 | and by the way, I don't actually know loopy belief propagation ... and I just 95 | barely remember belief propagation.) I think it makes sense, in normal graphical 96 | models, we want the computational graph to be sparse to prevent high treewidth, 97 | but in deep learning, we do matrix multiplication which creates a lot of 98 | connectivity. So, matrix multiplication, not loopy belief propagation. 99 | 100 | They discuss *Restricted Boltzmann Machines* at the end. They say it is the 101 | "quintessential example" of using graphical models for deep learning. With only 102 | one hidden layer, it is not too deep (a.k.a. it looks like a normal graphical 103 | model) but it groups variables into layers, like deep learning. For now, let's 104 | only worry about the "canonical form" which is an energy-based model with a 105 | particular (negative) quadratic form plus linear terms. The inputs are (v,h). 106 | The names should be familiar: v=visible and h=hidden. Then it's like a complete 107 | bipartite graph with v on one side and h on the other. We can do Gibbs sampling 108 | on this (in fact, _block_ Gibbs sampling). 109 | 110 | Concluding point: 111 | 112 | > Overall, the RBM demonstrates the typical deep learning approach to graphical 113 | > models: representation learning accomplished via layers of latent variables, 114 | > combined with efficient interactions between layers parametrized by matrices. 115 | 116 | I've now read the chapter and feel pleased. Great job, authors! 117 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter14notes.txt: -------------------------------------------------------------------------------- 1 | ************************************* 2 | * NOTES ON CHAPTER 14: Autoencoders * 3 | ************************************* 4 | 5 | Let's review this and discuss with John Canny. 6 | 7 | The introduction is excellent, and matches with my intuition. I agree that an 8 | encoder is like doing dimension reduction, and it certainly seems like decoders 9 | (the reverse direction) can be used for generating things, hence they can be 10 | used within *generative* models. (A.K.A. VAEs!) 11 | 12 | They mention "recirculation" as a more biologically realistic (!!) alternative 13 | to backpropagation, but it is not used much. 14 | 15 | Think of AEs as optimizing this simple thing: 16 | 17 | min_{f,g} L(x, g(f(x)) 18 | 19 | where x is the whole dataset, and f and g are the encoder and decoder, 20 | respectively. 21 | 22 | We need to make sure the autoenconder is constrained somehow ("undercomplete") 23 | so that it isn't simply performing the identity function. Solutions: don't 24 | provide too much capacity to both (a) the hidden code and (b) either of the two 25 | networks, and *regularize* somehow. Also, don't just make things linear, because 26 | then it's doing nothing more than PCA. 27 | 28 | Confusing point: think of autoencoders as "approximating maximum likelihood 29 | training of a generative model that has latent variables." Why? 30 | 31 | - The prior is not over the "belief on our parameters before seeing data" but 32 | the hidden units (which are latent variables). Yes, this aspect make sense. 33 | - I don't know what they mean by "the autoencoder as approximating this sum with 34 | a point estimate for just one highly likely value for h" but let's not 35 | over-worry about this. 36 | 37 | (This was in the discussion about sparse autoencoders, and it makes a little 38 | more sense to me after reading about VAEs. The point is that `h` is a latent 39 | variable.) 40 | 41 | Denoising Autoencoders: clever! =) Rather than using g(f(x)) in the loss 42 | function, use g(f(\tilde{x})) where \tilde{x} is perturbed! This is a creative 43 | way to avoid the autoencoder simply learning the identity function. 44 | 45 | One can also regularize by limiting the derivatives, i.e. a "contractive 46 | autoencoder." 47 | 48 | I've wondered about the exact size of autoencoders in use nowadays, since I 49 | haven't seen a figure before. The encoder and decoder are themselves each feed 50 | forward neural networks, so in general, it seems like each can be implemented 51 | with many layers (or just one). 52 | 53 | Stochastic Encoders and Decoders: not sure I got much out of this. However, I 54 | did get this: the decoder can be seen as optimizing log p(x|h), since it is 55 | given h and has to produce x (and x is known!). But the analogue for the encoder 56 | is more confusing, because we have log p(h|x) but we don't know h. This must be 57 | similar to other latent variables in graphical models. 58 | 59 | **Update**: after reading this again with more knowledge of how these work, 60 | I think I didn't get the point of the last section. The log p(x|h) is indeed 61 | what the decoder optimizes, though (1) it really optimizes the encoder as 62 | well when this is trained end-to-end since the encoder produces h, and (2) 63 | we have to provide the loss function, and (3) we can **also** add a 64 | distribution to the encoder, but I don't think this is actually needed to 65 | train the encoder portion. In the case of continuous-valued pixels, we 66 | should probably consider a Gaussian distribution for the loss, which means 67 | the autoencoder should try and get the mean/variance. In VAEs, we can take 68 | advantage of the Gaussian assumption to *sample* elements. 69 | 70 | Denoising autoencoders: OK, their computational graph (Figure 14.3) makes sense. 71 | (It doesn't really help me get a deep understanding, though.) They introduce a 72 | corruption function C(\tilde{x} | x), whose function is obvious. I was confused 73 | for a bit as to why we're assuming we know the x (I mean, in real life, we might 74 | be given *only* noisy stuff) but if we don't have the real x, we can't evaluate 75 | the loss function! It's just part of our training data. 76 | 77 | Figure 14.4 makes sense intuitively. Corrupted stuff is off the manifold because 78 | if we take an average random sample, it'll be in some random space. But **real** 79 | samples are in a manifold. Unfortunately, some of the discussion here (e.g. 80 | connecting autoencoders with RBMs) just refers to reading papers. =( That's why 81 | I am reading this textbook, to *avoid* reading difficult-to-understand papers. 82 | There's also some discussion on estimating the score function, which I think I 83 | understand but haven't grokked it. 84 | 85 | OK, back to more obvious stuff: 86 | 87 | > Denoising autoencoders are, in some sense, just MLPs trained to denoise. 88 | > However, the name "denoising autoencoder" refers to a model that is intended 89 | > not merely to learn to denoise its input but to learn a good internal 90 | > representation as a side effect of learning to denoise. 91 | 92 | Manifolds! (Section 14.6) Key reason why we think about this (emphasis mine): 93 | 94 | > Like many other machine learning algorithms, autoencoders exploit the idea 95 | > that data concentrates around a low-dimensional manifold or a small set of 96 | > such manifolds, as described in section 5.11.3. [...] Autoencoders take this 97 | > idea further and aim to **learn the structure of the manifold**. 98 | 99 | Additional thoughts: 100 | 101 | - Understand **tangent planes**, these describe the direction of allowed 102 | variation for a point x while still remaining on the low-dim manifold. See 103 | Figure 14.6 for an intuitive example with MNIST, showing points on this 104 | manifold and also the allowable directions. 105 | 106 | - Intuitively, autoencoders need to learn how to represent this variation among 107 | the manifold. However, they don't need to do this for points off the 108 | manifold. See Figure 14.7. The reconstruction is flat near the manifold 109 | points, i.e. the only area that matters. True, it jumps up at several points, 110 | but those are well off the manifold. 111 | 112 | - There are other ways we can learn manifold structure, using non-Deep 113 | Learning techniques (see Figures 14.8 and 14.9), but I don't think these are 114 | as important to know now. 115 | 116 | Contractive Autoencoders (Section 14.7) introduce a regularizer to make the 117 | derivatives of f (as in, f(x) = h) small. 118 | 119 | What are applications of autoencoders? Definitely dimensionality reduction is 120 | one, and we can also think about information retrieval, the task of finding 121 | entries in a database that resemble a query entry. Why? Search is more efficient 122 | in lower-dimensional spaces. 123 | 124 | Overall, I actually think this chapter is among the weaker ones in the book. 125 | Looking through the CS 231n slides was a **lot** more helpful. Eh, not every 126 | chapter is perfect. 127 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/Fetch.text: -------------------------------------------------------------------------------- 1 | Notes on how to use the Fetch. 2 | 3 | ************ 4 | ** UPDATE ** 5 | ************ 6 | 7 | Here are some full steps: 8 | 9 | (0) Start the fetch, ensure that it can move with the joystick controls. 10 | 11 | (1) Switch to fetch mode by calling `fetch_mode` on the command line. This will 12 | ensure that the `ROS_MASTER_URI` is the Fetch robot. 13 | 14 | (2) Be on the correct WiFi network. Then the master node (Fetch) is accessible. 15 | 16 | - Verify that `rostopic list` returns topics related to the Fetch. 17 | - Also verify that the teleop via keyword script (via `rosrun ...`, see 18 | tutorials) is working, though sometimes even that doesn't work for me. 19 | 20 | (3) Then then do whatever I need to do... for instance, simply running Ron's 21 | camera script (a single python file) works to continually see the Fetch's 22 | cameras. Finally! 23 | 24 | - Some python scripts might require a launch file to be running, such as the 25 | built-in disco.py and wave.py code. For these use `roslaunch [...] [...]`. 26 | 27 | 28 | TODO: figure out robot state? For Fetch-specific messages. 29 | 30 | 31 | ****************** 32 | ** Older notes: ** 33 | ****************** 34 | 35 | Note that `PS1` is an environment variable that we can import, but the real key 36 | thing is to set ROS_MASTER_URI, that will let us connect to the Fetch. This does 37 | not happen by default, so must export it each new window (for now). 38 | 39 | Then I think we should do `rosrun [package] [script]` where I code stuff in 40 | [script] inside some package. But are Ron and Michael doing it in a similar way? 41 | 42 | Recommended order for development (NOT WORKING): 43 | 44 | - Code the script within some package 45 | - Compile the package with `catkin_make` 46 | - Another terminal, set `ROS_MASTER_URI` appropriately 47 | - In that same terminal, `source ./devel/setup.bash` 48 | - Finally, again in same terminal `rosrun ...` and enjoy 49 | 50 | I know when I set `ROS_MASTER_URI` and run `rostopic list` I get all the 51 | appropriate Fetch-related topics ... so why am I not able to access them in my 52 | code when calling `rosrun ...`? 53 | 54 | (If I don't set `ROS_MASTER_URI` and instead have it as the default, then I do 55 | not get any topics, of course. Note that according to documentation, roslaunch 56 | will START roscore if it detects that one doesn't exist!) 57 | 58 | Is there a launch file that I can use? I'm confused because `rostopic echo 59 | [...]` for the topics means I can see the output ... 60 | 61 | 62 | *************************** 63 | * Tutorial: Visualization * 64 | *************************** 65 | 66 | 67 | 68 | ******************************* 69 | * Tutorial: Gazebo Simulation * 70 | ******************************* 71 | 72 | At least this is clear: 73 | 74 | > Never run the simulator on the robot. Simulation requires that the ROS 75 | > parameter use_sim_time be set to true, which will cause the robot drivers to 76 | > stop working correctly. In addition, be sure to never start the simulator in a 77 | > terminal that has the ROS_MASTER_URI set to your robot for the same reasons. 78 | 79 | And it looks like I've installed the two packages necessary, 80 | `ros-indigo-fetch-gazebo` and `ros-indigo-fetch-gazebo-demo`. 81 | 82 | Run: `roslaunch fetch_gazebo simulation.launch` and the Gazebo simulator should 83 | show up! However, I've noticed if you exit, then try and run the simulator 84 | again, error messages may result? From looking up things online, it seems to be 85 | expected behavior. :-( Try CTRL+C in the same window to exit. I've been able to 86 | get `simulation.launch` to work fairly consistently, fortunately. 87 | 88 | For "Running the Mobile Manipulation Demo": 89 | 90 | The playground will get set up, just be patient. :-) It takes a few extra 91 | seconds due to a "namespace" error message, must be due to slow loading of 92 | data online. However, a playgroud _should_ eventually appear. 93 | 94 | Then the next part moves the Fetch throughout the Gazebo simulator. It's 95 | pretty cool. Doesn't work reliably, see GitHub issue I posted. 96 | 97 | I think this will be easier on a desktop since Gazebo also seems to be sensitive 98 | to the graphics card, though after this I fixed it so my laptop can access the 99 | separate GPU. 100 | 101 | How does the demo code work? Two commands: 102 | 103 | 1. roslaunch fetch_gazebo playground.launch 104 | 2. roslaunch fetch_gazebo_demo demo.launch 105 | 106 | Use `roscd [...]` to go to the package directory and look at `launch/` to find 107 | specific definitions. The first command runs the launch file with several 108 | readable arguments. The second one is more interesting, launch looks like: 109 | 110 | ``` 111 | 1 112 | 2 113 | 3 114 | 4 115 | 5 116 | 6 117 | 7 118 | 8 119 | 9 120 | 10 121 | 11 122 | 12 123 | 13 124 | 14 125 | 15 126 | 16 127 | 17 128 | 18 129 | 19 130 | ``` 131 | 132 | Four easy parts. What's odd, though, is that I can't find `demo.py` anywhere on 133 | my machine, but it's online at the repo: 134 | 135 | https://github.com/fetchrobotics/fetch_gazebo/blob/gazebo2/fetch_gazebo_demo/scripts/demo.py 136 | 137 | Might be another useful code reference as it's a clean stand-alone script, 138 | though with sone MoveIt, etc., obviously. 139 | 140 | 141 | 142 | ************************** 143 | * Tutorial: Robot Teleop * 144 | ************************** 145 | 146 | This is pretty easy. 147 | 148 | 149 | 150 | ************************ 151 | * Tutorial: Navigation * 152 | ************************ 153 | ************************** 154 | * Tutorial: Manipulation * 155 | ************************** 156 | 157 | I ran both of these manipulation tutorials (hand-wavy thing and disco) and it 158 | works. I wasn't able to try out extensions. 159 | 160 | 161 | 162 | ************************ 163 | * Tutorial: Perception * 164 | ************************ 165 | 166 | Fetch exposes several "ROS topics" that we can subscribe to in order to obtain 167 | camera information. Unfortunately, I have yet to get call-backs to work ... 168 | 169 | 170 | 171 | ************************** 172 | * Tutorial: Auto-Docking * 173 | ************************** 174 | ************************* 175 | * Tutorial: Calibration * 176 | ************************* 177 | ********************************** 178 | * Tutorial: Programming-By-Demos * 179 | ********************************** 180 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/HSR.text: -------------------------------------------------------------------------------- 1 | Notes on how to use the HSR. Use their Python interface (or we can do 2 | lower-level ROS stuff). Also, there's a built-in motion planner, so MoveIt! is 3 | not necessary. Ideally, we get a camera image, get the x and y values from the 4 | pixels, figure out z (the depth), and determine a rotation, and send it there. 5 | 6 | - Gazebo can be useful. 7 | - rviz is DEFINITELY helpful for debugging. Know it. 8 | - Calibration: ouch, unfortunately this will take a while and there are eight 9 | sensors to calibrate ... at minimum. The docs actually show a lot. I see a 10 | sensor (camera) on the hand as well. 11 | - Register positions, using the same image I see of black/white boxes, the 12 | "calibration marker jig". 13 | 14 | Monitor status: see 6.1 of the manuals. Setting up development PC/laptop, 15 | section 6.2. Not much else to write here. At least I can get rviz running with 16 | images. You need to hit the reset button and see the LEDs (not above the 17 | 'TOYOTA' text but everywhere else) turn yellow-ish. 18 | 19 | On my TODO list: 20 | 21 | - Figure out good test usage practices for rviz. 22 | - Get skeleton code set up for the HSR to: 23 | - process camera images 24 | - move based on those images (either base or gripper, or both) 25 | - Figure out a safe way to automatically move arms. 26 | 27 | 28 | 29 | ****************** 30 | * Moving the HSR * 31 | ****************** 32 | 33 | General idea with Python code, do something like: 34 | ``` 35 | self.robot = hsrb_interface.Robot() 36 | self.omni_base = self.robot.get('omni_base') 37 | self.whole_body = self.robot.get('whole_body') 38 | ``` 39 | where the `hsrb_interface` is code written by the Toyota HSR programmers, 40 | thankfully. That part is necessary for the robot to begin publishing stuff from 41 | its topics. 42 | 43 | Let's understand _base_ motion. 44 | 45 | 46 | Aerial view of the HSR. Assumes its head is facing north. 47 | 48 | ^ 49 | | 50 | <--[hsr]--> 51 | | 52 | v 53 | 54 | Axes are: 55 | 56 | pos(x) for north, neg(x) for south. 57 | Also, (oddly) pos(y) for LEFT, neg(y) for right. 58 | 59 | I thought the `y` stuff would be the other way around, but I guess not. The 60 | z stuff stays fixed (obviously). These are based on the (x,y,z) I get from 61 | `omni_base.get_pose()`. The rotations are in quaternions. 62 | 63 | FYI: When the robot starts up, it has some (x,y,z) position which should 64 | be set at (0,0,0) based on the starting position. 65 | 66 | Errors: unfortunately if you query the `omni_base.get_pose()` again and 67 | again, the values are still going to vary by something like 1-3mm, so 68 | there's always some error. Same with the dVRK. 69 | 70 | Rotations: clockwise from aerial view, the `z` decreases. Counterclockwise, 71 | it increases. The other three values in the quaternion don't seem to change, 72 | x==y==0 and w==1. We're only rotating about one plane for the base so this 73 | is expected. TODO: understand quaternions well. 74 | 75 | 76 | To clarify the above, understand `go_rel`: 77 | 78 | ``` 79 | In [30]: omni_base.go_rel? 80 | Type: instancemethod 81 | String Form:> 82 | File: /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/mobile_base.py 83 | Definition: omni_base.go_rel(self, x=0.0, y=0.0, yaw=0.0, timeout=0.0) 84 | Docstring: 85 | Move base from current position. 86 | 87 | Args: 88 | x (float): X-axis position on ``robot`` frame [m] 89 | y (float): Y-axis position on ``robot`` frame [m] 90 | yaw (float): Yaw position on ``robot`` frame [rad] 91 | timeout (float): Timeout until movement finish [sec]. 92 | Default is 0.0 and wait forever. 93 | ``` 94 | 95 | Seems like indeed we should only control x and y, obviously. The interesting 96 | part is that `yaw` must represent the `z` in the quaternion, so rotations of the 97 | base imply changes in yaw only. 98 | 99 | 100 | Next, `whole_body`, allows more control. This is for the _end_effector_: 101 | 102 | ``` 103 | In [38]: whole_body.get_end_effector_pose? 104 | Type: instancemethod 105 | String Form:> 106 | File: /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/joint_group.py 107 | Definition: whole_body.get_end_effector_pose(self, ref_frame_id=None) 108 | Docstring: 109 | Get a pose of end effector based on robot frame. 110 | 111 | Returns: 112 | Tuple[Vector3, Quaternion] 113 | 114 | In [39]: whole_body.get_end_effector_pose() 115 | Out[39]: Pose(pos=Vector3(x=0.2963931913608169, y=0.07800193518379123, z=0.6786170137933408), ori=Quaternion(x=0.7173120598879523, y=-7.000511757597367e-05, z=0.6967520358527196, w=-6.613377471335618e-05)) 116 | ``` 117 | 118 | This is relative to the base frame. So when we move the HSR, without moving 119 | the end-effector, the x,y,z stuff remains the same, as expected. BUT since 120 | the base frame has some fixed "reference rotation" then rotating base means 121 | the y and w quaternion components change; the x and z stay the same. 122 | 123 | We can also see joint names and their limits. Use `whole_body.joint_state` 124 | to get full details. There's lots of `whole_body.move_to[...]` methods that 125 | make it really convenient for research code. 126 | 127 | An alternative is to explicitly assign to these by publishing to the 128 | associated ROS topics, which might be more generally applicable to the 129 | Fetch and other robots (well, we change the topics ...). 130 | 131 | 132 | Finally, for the gripper itself, use `gripper`. We can grasp it, so it's similar 133 | to the dVRK, and use negative values for tight stuff. :-) 134 | 135 | 136 | Other notes on moving the HSR: 137 | 138 | - It's possible to move in straight lines, arcs, etc. 139 | - Understand `tf` for resolving coordinate frames. TODO: later ... actually, 140 | might as well do this all in simulation (rviz) first to double check 141 | movements. 142 | - Also use rviz for visualizing coordinates. RGB = xyz axes. 143 | - Common coordinates: `map` for the overall map, `base_footprint` for the 144 | base of the HSR, `hand_palm_link` for the robot's hand (end-effector I 145 | assume, or "tool frame"). 146 | - You can move both the base and arm together to get to a destination, can 147 | also weigh relative contribution. 148 | - Can move the hand based on force sensing, might be useful if we're running 149 | this automatically and need some environment feedback? 150 | - Avoid collisions by using the collision avoider they have. Looks really 151 | simple to use, they handle a lot for us. 152 | 153 | 154 | See Section 7.2.6 for more advanced coding, rather than using `ihsrb` which is 155 | like IPython. Oh, and later they actually have a YOLO tutorial. Nice! 156 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter11notes.txt: -------------------------------------------------------------------------------- 1 | ********************************************** 2 | * NOTES ON CHAPTER 11: Practical Methodology * 3 | ********************************************** 4 | 5 | This is sometimes neglected, but it shouldn't be! Their intro paragraph hits the 6 | core: 7 | 8 | > Successfully applying deep learning techniques requires more than just a good 9 | > knowledge of what algorithms exist and the principles that explain how they 10 | > work. A good machine learning practitioner also needs to know how to choose an 11 | > algorithm for a particular application and how to monitor and respond to 12 | > feedback obtained from experiments in order to improve a machine learning 13 | > system. 14 | 15 | Their running example is the Street View house number dataset and application, 16 | which is good for me since I only have minor knowledge of this material. The 17 | application is as follows: Cars photograph the buildings and address numbers, 18 | while a CNN recognizes the addresses based on photos. Then Google Maps can add 19 | the building to the correct location. 20 | 21 | Section 11.1: Performance Metrics 22 | 23 | Use precision and recall in the event that a binary classification shouldn't 24 | treat the two cases equally, e.g. with spam detection or diagnosing diseases. 25 | Precision is the fraction of relevant instances classified correctly, while 26 | recall is the number of true relevant instances detected. A disease detector 27 | saying that everyone has the disease has perfect recall, but very small 28 | precision, equal to the actual fraction who have diseases. We can draw a PR 29 | curve, or use a scalar metric such as **F-scores** or **AUC**. 30 | 31 | Section 11.2: Default Baseline Models 32 | 33 | This depends on the problem setting. Copy over previous work if possible. 34 | 35 | Start small-scale at first, with regularization and **early stopping**. (I 36 | forgot to do this for one project before adding it, and I'm glad I did.) 37 | 38 | Most of this should be obvious. 39 | 40 | Section 11.3: More Data? 41 | 42 | Regarding when to add more data, they suggest: 43 | 44 | > If the performance on the test set is also acceptable, then there is nothing 45 | > left to be done. If test set performance is much worse than training set 46 | > performance, then gathering more data is one of the most effective solutions. 47 | > [... after some regularization discussion ...] If you find that the gap 48 | > between train and test performance is still unacceptable even after tuning the 49 | > regularization hyperparameters, then gathering more data is advisable. 50 | 51 | Of course, in some domains such as medical applications, gathering data can be 52 | costly. Again, this is obvious. 53 | 54 | Section 11.4: Hyperparameters 55 | 56 | Do these manually or automatically. The manual version places special emphasis 57 | on finding a model with the right effective capacity for the problem at hand. 58 | 59 | As a function of a hyperparameter value, generalization curves often follow a 60 | U-shaped curve, with the optimal value somewhere in the middle. At the smaller 61 | end, we may have low capacity (and thus underfitting) and the other end may have 62 | high capacity (and thus overfitting). Though that depends on the low/high 63 | capacity assumption. Maybe this hyperparameter graph would be based on the 64 | hyperparameter of the total number of layers in a neural network. This is just 65 | an example, though. For applying weight decay, the curve might still be 66 | U-shaped, but the underfitting happens with high values, the overfitting happens 67 | with smaller values. 68 | 69 | Their main advice, and the one which agrees with my own experience, is that if 70 | there is ANY hyperparameter to tune, it should be the learning rate. Why? The 71 | effective capacity of the model is highest ... for a **correct** learning rate. 72 | Not when it's too large or too small. In general, the learning rate's **training 73 | error* curve decreases as it gets high enough ... then once it's barely too 74 | high, it SHOOTS UP, due to taking too large steps during gradient updates. 75 | 76 | What happens if your training error is worse than expected? Your best bet is to 77 | increase capacity. Especially with Deep Learning, we should be able to overfit 78 | to most training datasets, so try without regularization techniques. 79 | 80 | If the test error is worse than training, then the reason (at least with Deep 81 | Learning models with high capacity) is most likely due to generalization 82 | difference between test vs train error. Try regularization techniques. 83 | 84 | I **really like Table 11.1**, it outlines the effects of changing different 85 | hyperparameters. Study it well! Though I think I understood all of them; the one 86 | that might be newest to me is weight decay, but fortunately I somewhat 87 | understand it after reading through OpenAI's Evolution Strategies code. 88 | 89 | OK, next, **automatic hyperparameter search**. This includes **grid search**, 90 | best when we have three or fewer hyperparameters and we can test all points in 91 | the Cartesian product of the set of values. **Random search** can be better, as 92 | I know from CS 294-129. See Figure 11.2 for a comparison of grid search and 93 | random search. 94 | 95 | Typically, grid search values are chosen based on a logarithmic scale, or 96 | "trying every order of magnitude." If the best values are on a boundary point, 97 | shift the grid search. Sometimes we have to do coarse-to-fine, as Andrej 98 | Karpathy puts it. Random search can be cheaper and often more effective. Here, 99 | we have a marginal probability distribution for each hyperparameter, which we 100 | sample from to get hyperparameters. (Be careful about non-uniform distributions 101 | if we want to sample from a logarithmic scale, e.g. for learning rates that are 102 | 10^{-x}, we would do a uniform distribution sample on x.) Random search is more 103 | effective when there are hyperparameters which do not strongly affect the 104 | performance metric, which are considered wasteful for grid search. 105 | 106 | The section concludes on Bayesian hyperparameter optimization, but the authors 107 | conclude that this isn't relatively helpful for Deep Learning. 108 | 109 | Section 11.5: Debugging 110 | 111 | This is hard. :( 112 | 113 | Their example of an especially challenging bug is if the bias gradient update is 114 | slightly off. Then the other weights might actually be able to compensate for 115 | the error, to some extent. This is why you need a finite difference check, as we 116 | did for CS 231n, or use TensorFlow. 117 | 118 | Visualize the model in action, visualize the worst cases, **fit a tiny dataset** 119 | (which I do), etc. Also, monitor histograms of activations and gradients, which 120 | might help detect gradient saturation. 121 | 122 | Yeah, actually I *do* use a lot of these techniques, though maybe I should have 123 | those histograms somewhat? 124 | 125 | Oh, they say that the magnitude of parameter updates should be roughly 1% of the 126 | magnitude of the parameters themselves. In some recent work, I see 5% for this 127 | quantity. Maybe I should aim to get that reduced? 128 | 129 | Section 11.6: Example of Multi-Digit Recognition 130 | 131 | Looks interesting. Here, coverage was the metric to optimize while fixing 132 | accuracy to be 98%. (Thus, accuracy is more important.) They got a LOT of 133 | improvement simply by looking at the worst cases and seeing that there was 134 | unnecessary cropping. 135 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter07notes.txt: -------------------------------------------------------------------------------- 1 | ********************************************* 2 | * NOTES ON CHAPTER 7: Regularization for DL * 3 | ********************************************* 4 | 5 | Again, this will be mostly review. 6 | 7 | Section 7.1: Parameter Norm Penalties. 8 | 9 | One piece of intuition is that biases don't need to be regularized because they 10 | control one variable, whereas other weights control two (I guess the two nodes 11 | in their edges?). 12 | 13 | Good review for me, look at the math in Section 7.1.1 about L2 regularization. 14 | Assuming a quadratic cost function, we can show that weight decay rescales the 15 | optimal weight vector along the **axes** defined by the **eigenvectors** of H, 16 | the Hessian. This is good linear algebra review. Understand Figure 7.1 as well! 17 | 18 | TODO: review the L1 regularization section. I must have seen this before but I 19 | can't remember, and it'd be good to know. But the TL;DR is that L1 encourages 20 | more sparsity compared to L2, so certain features can be discarded. 21 | 22 | (Some of the next sections are quite short and I didn't take notes. One insight 23 | is that the definition of the Moore-Penrose pseudoinverse looks like a 24 | regularization formula, with weight decay!) 25 | 26 | Other regularization strategies: 27 | 28 | - Dataset Augmentation, useful for object recognition, but be careful not to, 29 | e.g. flip the images if we're doing optical character recognition, since the 30 | classes could be altered. Be careful to augment *after* the train/test split, 31 | and also that when comparing benchmarks, that algorithms use the same 32 | augmentation. 33 | 34 | - Add noise directly to weights, sometimes seen in RNNs, or the targets, as in 35 | **label smoothing**. 36 | 37 | - Semi-Supervised Learning. Use both p(x) and p(x,y) to determine p(y|x). 38 | Example: PCA for the "unsupervised" projection to an "easier" space, and then 39 | a classifier built on top of that, so PCA is a pre-processing step. Yeah, 40 | makes some sense. 41 | 42 | - Multi-Task Learning. Think of this as different tasks having the same input 43 | but different output, **AND** having a common "intermediate" step, or latent 44 | factor. We need that last condition because otherwise we're not sharing 45 | parameters across tasks (i.e. across different targets). I haven't really done 46 | much work with multi-task learning, but I bet I will in the future! 47 | 48 | - Early Stopping. Ah yes, this sounds dumb but it works. Often, training error 49 | will continue decreasing and asymptote somewhere, but our validation error can 50 | decrease initially, but then **increase**. We want to stop and return the 51 | weights we had at the time just before the validation error began to increase. 52 | Huh, the authors even say it's the most popular form of regularization, I 53 | guess because it comes naturally to beginners. There's some slight costs to 54 | (a) testing on the validation set, and (b) storing weights periodically, but 55 | from my experience those are minor. They continue to elaborate that if we want 56 | to use the validation set, we can do early stopping, *then* include all the 57 | data. (This seems overkill to me.) They conclude early stopping by showing 58 | mathematically how it acts as a regularizer. 59 | 60 | - Parameter Tying and Parameter Sharing. These try to make certain parameters 61 | close to each other, so the regularizer could be || w(a) - w(b) ||_2 where 62 | w(a) and w(b) are weights in two different layers. However, I think the more 63 | popular view is to have them be **equal**, and hence have parameter 64 | **sharing** instead of tying, which has the added advantage of memory savings. 65 | This is precisely what happens in CNNs (and RNNs!). 66 | 67 | - Sparse Representations. Here, for some reason, we're focused on 68 | **representational sparsity**. This means our DATA is considered to have a new 69 | representation which is sparse. This is *not* the same as **parameter 70 | sparsity**, which the L1 regularization on the parameters would have enforced. 71 | This arises out of putting penalties on the activations in the NN. However, 72 | I'm not really sure I follow this and it doesn't seem to be as important as 73 | other techniques. 74 | 75 | - Bagging and Ensembles. Train several different models (independently), then 76 | have them vote. It works well when the models do not make the same test 77 | errors. We can quantify this mathematically by computing the expected error 78 | and expected squared error. One way to do this is with bagging, which will 79 | sample k different **datasets**, formed by sampling with replacement the 80 | original data, so with high probability we'll get different datasets each time 81 | (with some data points repeated, of course, and others missing). 82 | 83 | - Dropout. This can be viewed as noise injection, FYI, **and** as a form of 84 | bagging and ensemble learning. Man, it's really clever. PS: remember how it 85 | works, we remove (non-output!) **units**, NOT the edges (though it could be 86 | done that way, I think). Edges are automatically removed when their units are 87 | removed. In code, of course, we just multiply by zero. Remember: 88 | 89 | > Each time we load an example into a minibatch, we randomly sample a 90 | > different binary mask to apply to all of the input and hidden units in the 91 | > network. The mask for each unit is sampled independently from all of the 92 | > others. The probability of sampling a mask value of one (causing a unit to 93 | > be included) is a hyperparameter fixed before training begins. It is not a 94 | > function of the current value of the model parameters or the input example. 95 | 96 | There is some discussion about how to predict or do inference with ensemble 97 | methods. The authors mention some obscure geometric mean trick, but 98 | fortunately, with dropout we can do one forward pass and scale by the dropout 99 | parameter. (Or we can avoid this, but divide by the dropout during training, 100 | as I know.) 101 | 102 | This is actually **not** exact even in expectation, due to the 103 | non-linearities, but it works well in practice. 104 | 105 | Dropout goes beyond regularization interpretations: 106 | 107 | > [...] there is another view of dropout that goes further than this. Dropout 108 | > trains not just a bagged ensemble of models, but an ensemble of models that 109 | > share hidden units. This means each hidden unit must be able to perform well 110 | > regardless of which other hidden units are in the model. 111 | 112 | It looks like we have redundancy, which is good. 113 | 114 | - Adversarial Training. You knew this was coming. :) We get those adversarial 115 | examples, and then use that to improve our classifier. See Goodfellow's papers 116 | for details. There are caveats, though, and I believe even with training on 117 | adversarial examples, such a model still has *new* adversarial examples. I 118 | might have to re-read those papers. Goodfellow showed that one cause for 119 | adversarial examples is excessive linearity. They can also be considered 120 | semi-supervised learning, which we talked about earlier in the chapter. 121 | 122 | - Tangent {Distance, Prop, Manifold Classifier}. These relate to our assumption 123 | that the essence of the data lie in lower-dimensional manifolds. The 124 | regularization here is that f(x) shouldn't change much as x moves along its 125 | manifold. I don't really think these are important for me to know right now, 126 | but I remember studying these a bit for the prelims. 127 | 128 | Whew, some of these were new actually, or at the very least I got a better 129 | understanding of them. Note that batch normalization (which might make dropout 130 | unnecessary) is discussed in the **next** chapter, not this one. 131 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter12notes.txt: -------------------------------------------------------------------------------- 1 | ************************************* 2 | * NOTES ON CHAPTER 12: Applications * 3 | ************************************* 4 | 5 | There's a LOT of them! Recall the 2016 publication date, so anything after that 6 | won't be here (e.g., the Transformer architecture, other DeepRL stuff?). 7 | 8 | 12.1: Large-Scale Deep Learning 9 | 10 | Nice discussion about how the video game community spurred the development of 11 | graphics cards, and how the characteristics of graphics card ended up being 12 | beneficial for the kind of computations used in deep learning. Actually, why? 13 | 14 | - We need to perform many operations in parallel (and these are often 15 | independent of each other, hence parallelization is easier). 16 | - Less 'branching' compared to the workload of a CPU. 17 | - GPUs have memory and data can be put on there, whereas the data is too large 18 | for most CPU caches. 19 | 20 | They got more popular after more general-purpose GPUs were available that 21 | could do stuff other than rendering, and NVIDIA's CUDA lets us implement those 22 | using a C-like language. But, it's very hard to write good CUDA code (not the 23 | same as writing good CPU code). Good news: once someone does it, we should 24 | refer to those libraries. 25 | 26 | - Data parallelism: easy for inference since we have models run on different 27 | machines. But for training, use Hogwild!. (We can alternatively increase the 28 | batch size for one machine, but we don't get the advantage of more frequent 29 | gradient updates versus HogWild!.) 30 | - Model parallelism: each machine runs a different part of the model. (Huh, I 31 | don't think I'll do this, we'd need a super large network?) 32 | - Model compression: mentions Hinton's knowledge distillation. :-) 33 | 34 | We can a lot with *dynamic structure*: this means we might use different 35 | components of the network for a given computation. For example, have a gated 36 | network which picks one of several expert networks to use for evaluation. 37 | (Results in soft or hard mixture of experts, depending on (as expected) whether 38 | the 'gater' outputs a soft weighting or a single hard weighting, like a one-hot 39 | vector of weights.) Even simpler: decision trees. 40 | 41 | Efficient hardware implementations: doesn't discuss Tensor Processing Units 42 | (TPUs) but those came out after this book, I think. 43 | 44 | 12.2: Computer Vision 45 | 46 | Pre-processing: make sure it's consistent, doesn't have to be fancy. Often 47 | scaling to [-1,1] or [0,1] suffices. Heck they say there are CNNs that can 48 | dynamically adjust to take images of different sizes, but I find it easiest to 49 | always keep a fixed scale. 50 | 51 | Examples: *contrast normalization*, and *whitening*. I think contrast 52 | normalization is like the (X - np.mean(X)) / (np.std(X) + eps) that we've often 53 | done in computer vision tasks. Whitening is another story about *rescaling 54 | principal components to have equal variance*. 55 | 56 | Actually this is a short section. I'm surprised there wasn't an overview on 57 | classification, detection, segmentation, and other computer vision problems. 58 | It's mostly about how data is processed. See CS 231n for details on the actual 59 | tasks. 60 | 61 | 12.3: Speech Recognition (ASR with 'Automatic' in it) 62 | 63 | (Not a subsection of NLP, despite ASR as part of my NLP class at Berkeley) 64 | 65 | Find the most probable linguistic sequence y given input acoustic sequence X. 66 | I.e.: argmax_y P(y|X). Before 2012, state of the art systems used Hidden Markov 67 | Models and Gaussian Mixture Models. 68 | 69 | Use "TIMIT" for benchmarking, the MNIST of ASR so to speak. 70 | 71 | Not much detail here, unfortunately, besides that Restricted Boltzmann Machines 72 | (RBMs) were among the ingredients for the resurgence of Deep Learning in ASR. 73 | But now they are not used. :) I wonder if Transformers are used in ASR now? I 74 | haven't been following the literature and the section is too short for a proper 75 | treatment. 76 | 77 | 12.4: Natural Language Processing 78 | 79 | Largely based on *language models* and treating *words* as the distinct unit, 80 | and then modeling language as probability of a next word given an existing 81 | sequence of words. Know *n-gram*, modeling conditional probability of a word 82 | based on the preceding n-1 words. Unigrams, digrams, and trigrams use 1, 2, and 83 | 3 as n. 84 | 85 | - But recall my NLP class: hard to use raw counts for computing conditional 86 | probabilities, because many counts are zero. 87 | - Thus use smoothing. 88 | - But still many 'curse of dimensionality' challenges with classical n-gram 89 | models. 90 | 91 | Neural language models: allow us to say that two words are similar, but they 92 | are distinct, and they show word embeddings. I think they are suggesting 93 | getting word embeddings by predicting the context given the center word, or 94 | predicting the center word given context (like we did in 182/282A). But 95 | regardless, it's good to have embeddings, since instead of representing words 96 | as one hot vectors, we use lower dimensional representations with Euclidean 97 | distance to get similarity. This is analogous to a CNN's hidden layer output 98 | giving us an image embedding. 99 | 100 | Issue with high-dimensional outputs: if our model needs to produce words (e.g., 101 | probability of next word given existing text) then naively a softmax over all V 102 | words in the vocabulary means we need a huge matrix to represent this 103 | operation and to train it (assuming naive cross-entropy loss). 104 | 105 | - Naive fix: use a 'short list' of most frequent words only. But that is 106 | counter to what we actually want! 107 | - Slightly better: *hierarchical softmax*. Now predict categories of words, and 108 | then predict more specific categories, etc. But performance of actual model 109 | often not that great, and hard to get the most likely word in a given 110 | context. 111 | - Importance sampling: the logic for this approach is that the gradient of the 112 | softmax can be broken up into the positive and negative phases (interesting 113 | intuition, I'd thought about it but was good to see them explicitly state 114 | it). The negative phase is an expectation, and we can use (biased) importance 115 | sampling. 116 | - Noise-contrastive estimation is another option, but see Chapter 18 for a 117 | fuller treatment. 118 | 119 | Interesting contrast with neural nets and n-grams: the latter are much faster 120 | for look-up operations with hash tables. 121 | 122 | Neural machine translation: recall the encoder-decoder architecture, where the 123 | encoder reads the sentence and produces a data structure called a "context" 124 | that contains "relevant information" somehow. Advantage of an RNN for 125 | encoders/decoders is that we can process variable-length sequences. 126 | 127 | They cite a paper by Jacob Devlin from 2014 who beat state of the art models by 128 | using a MLP. Heh, he would later be the first author on the 2018 BERT paper. 129 | 130 | They conclude with a brief discussion on some of the earlier attention models 131 | in Deep NLP. A lot more has happened since then! 132 | 133 | 12.5: Other Applications 134 | 135 | - Recommender systems and collaborative filtering. Actually this leads them to 136 | talk about contextual bandits, which as we know are an intermediate between 137 | the k-armed bandit case and the full RL problem. Why contextual bandits here? 138 | Because if recommender systems only give users the best item according to its 139 | model, there is no 'exploration' of other items that might be even better. 140 | 141 | Also, it's an intermediary because bandits = no state, basically. The normal 142 | RL problem means the action directly changes the next state. 143 | 144 | - Knowledge representation, reasoning, and question answering. Interesting 145 | topics, but for now not part of my direct research agenda. 146 | -------------------------------------------------------------------------------- /How_People_Learn/Part_02_Learners_and_Learning.txt: -------------------------------------------------------------------------------- 1 | Part 2: Learners and Learning 2 | 3 | 4 | Chapter 2: How Experts Differ from Novices 5 | 6 | Very important: 7 | 8 | - As implied in the previous chapter, what distinguishes experts from novices 9 | isn't necessarily factual knowledge (nor is it ability or intelligence), more 10 | as it is about better connections among concepts, and the ability to 11 | "conditionalize" knowledge. This means being able to know what areas/concepts 12 | are needed for a specific task, rather than trying our everything. 13 | 14 | - (Related) Experts have more fluent knowledge retrieval, so they better know 15 | what applies to specific tasks. This means their memory is not taxed trying to 16 | figure out what would apply. Organization is more efficient; novices may 17 | retrieve knowledge in a slow, sequential manner. 18 | 19 | - Experts recognize (and are more sensitive to) meaningful patterns across many 20 | fields. Example: with chess, if you randomize the pieces, the experts don't 21 | really remember those locations any better than novices, but if the pieces are 22 | arranged as they might be in a real game situation, the expert can pick up 23 | patterns and remember the location of pieces far better than novices can. 24 | 25 | - Different styles of experts: "artisans" vs "virtuosos". The former are experts 26 | in one field but the latter are also experts and, moreover, have the desirable 27 | property of "active learning" so they are experts at learning about new 28 | things. This requires metacognition, as discussed in the first chapter. 29 | Educational programs need to be designed to encourage the development of 30 | virtuosos. 31 | 32 | Also important: 33 | 34 | - Cool example with physics: experts organize problems in a way that reflects 35 | deeper, fundamental ideas, whereas novices will organize problems if they look 36 | similar (e.g., have the same drawings of triangles). 37 | 38 | - Being an expert at a subject is NOT the same as being an expert at teaching. 39 | An expert teacher will better understand when students might get stuck. Yeah, 40 | this is a widely agreed-upon fact. 41 | 42 | Stuff I didn't remember: 43 | 44 | :-) 45 | 46 | 47 | Chapter 3: Learning and Transfer 48 | 49 | Very important: 50 | 51 | - You could argue that the ultimate goal of teaching is better transfer 52 | learning, or how to efficiently use the knowledge from school and apply it to 53 | the real world. Also, the goal is not to immediately know how to do new tasks, 54 | but simply to increase the _speed_ at which these new tasks will be learned. 55 | The early performance attempts is less important since anyone is going to need 56 | some time to learn new stuff, so don't evaluate based on the first time, 57 | evaluate based on the length of the learning period. 58 | 59 | - All transfer learning (and learning itself, of course) starts from somewhere. 60 | Yeah, prior knowledge was emphasized in earlier chapters. Clearly, prior 61 | knowledge may help or hinder new learning. Examples: students incorrectly 62 | think that plants eat soil, that when they throw a ball in the air there is 63 | still "force from the hand pushing it" and so on. 64 | 65 | - For better transfer learning, we need to see the same concept in different 66 | contexts, so that we can understand the "abstract stuff" that is shared across 67 | tasks. That's better than remembering task-specific details (or "overly 68 | contextualized" knowledge in their jargon) that don't generalize. 69 | 70 | Also important: 71 | 72 | - Learning depends a lot on social background and culture, in addition to more 73 | factual, easy-to-define prior knowledge. Some cultures may discourage asking 74 | questions, for instance, which means if teachers expect to see questions, they 75 | might think a student is uninterested. There was also some differences noted 76 | among white versus black families (but no biracials, Asians, etc ... sigh). 77 | 78 | - Speed of learning depends on deliberate practice and feedback. :-) 79 | 80 | Stuff I didn't remember: 81 | 82 | - (A bit silly that I didn't record this, but oh well ...) All learning takes 83 | time. You simply can't be an expert without investing the time. And moving 84 | on to more advanced subjects without knowing the basics is not ideal. 85 | 86 | - Oh, another obvious thing I didn't quite record: don't forget about 87 | motivation. What factors (social, etc.) motivate students? That's very 88 | important for speed of learning. 89 | 90 | - Amount of transfer depends on overlap among concepts, well roughly speaking. 91 | Yeah, another generally obvious thing. 92 | 93 | 94 | Chapter 4: How Children Learn 95 | 96 | Very important: 97 | 98 | - Even the very young (as in, months-old infants) exhibit signs of learning and 99 | knowledge, which contrasts with very early research claims. We have better 100 | tools for experimentation and to measure infants, since (for obvious reasons) 101 | it's not that easy to test on them. TL;DR young children are active, 102 | competent agents. 103 | 104 | - Children also pick up language and can quickly tell if stuff seems natural or 105 | unnatural. On a related note, parents need to read to their children, though 106 | some of this can be "picture" books. 107 | 108 | - Zone of proximal development: the gap between current abilities, and the 109 | abilities one could have with extra teaching assistance. (Or more accurately, 110 | 'potential' ... see the text for details.) It's the job of parents, 111 | caregivers, teachers, etc., to continue improving the students' skills so that 112 | this zone proceeds to the next natural stages. 113 | 114 | Also important: 115 | 116 | - Some cool stuff that infants know: they like to be consistent with numbers, so 117 | they see groups of twos, relax, but if the next group has three things, then 118 | they'll be more alert and think something's different. Also, physics: infants 119 | somehow are able to tell that things will fall over without supports, and pay 120 | more attention on that (in rigorous experiments). 121 | 122 | - Children can naturally be interested in solving problems, it doesn't always 123 | have to be explicitly forced upon by a teacher. Also, lots of this depends on 124 | culture (again, this is obvious, but good to reiterate). 125 | 126 | Stuff I didn't remember: 127 | 128 | - "Privileged domains": physical and biological concepts, causality, number, and 129 | language. These are domains where infants show _positive_biases_ in learning, 130 | which makes sense from an evolutionary perspective. 131 | 132 | - Precise experimental techniques for detecting infant cues and preferences: 133 | non-nutritive sucking, habituation (i.e., infant "gets used to it" and stops 134 | responding to that cue), and visual expectation. 135 | 136 | - Infants can distinguish between animate and inanimate objects. Also, they're 137 | good at inferring from context. 138 | 139 | - There's a little bit about memory here, might be more in later chapters, but 140 | mostly about the strategy of clustering to improve memory performance. Also 141 | some discussion about how infants vs older children may have different memory 142 | strategies, and strategies get more effective with age (generally). 143 | 144 | 145 | Chapter 5: Mind and Brain 146 | 147 | Very important: 148 | 149 | - The mind is made up of neurons, with synapses and stuff (not going to get too 150 | technical here but you get the idea). These synaptic connections can be 151 | created and destroyed, and there's generally two ways things can happen: when 152 | they're created in huge swarms and then also removed in equal amounts, kind of 153 | like sculpting (youth) or continual creation through learning by experience 154 | (lifetime). 155 | 156 | - Don't fall for some of the hype you see in popular claims. :-) 157 | 158 | - Some discussion over difference between deaf and hearing ways of learning, the 159 | implication was that areas of the brain can be learned through experience. 160 | Also, learning organizes/restructures the brain. 161 | 162 | Also important: 163 | 164 | - Context matters. Different parts of the brain are ready to learn at different 165 | times. 166 | 167 | Stuff I didn't remember: 168 | 169 | - Eh, hopefully got the main points. 170 | -------------------------------------------------------------------------------- /Functional_Programming/week1/week1_notes.txt: -------------------------------------------------------------------------------- 1 | *************** 2 | * Lecture 1.1 * 3 | *************** 4 | 5 | Primary objective: functional programming from first principles, not necessarily 6 | Scala but will learn the language. This is like learning a different programming 7 | paradigm. 8 | 9 | Scala: migration from C/Java to functional programming. Look at programming 10 | with "fresh eyes". Can integrate it with classical programming to give both of 11 | best worlds. 12 | 13 | Three paradigms: 14 | 15 | - imperative (Java and C), understand via instructions for Von Neumann computers 16 | - functional (Scala, or maybe Haskell is a better example) 17 | - logic 18 | 19 | We want to **liberate** ourselves from John Von Neumann-style programming. John 20 | Backus argued for function programming. So we must avoid conceptualizing 21 | instruction by instruction (or word by word) and move at a higher-level 22 | abstraction (?). Martin uses polynomial and string examples. For a polynomial, 23 | you don't want to define a class and be able to suddenly change coefficients 24 | (stored in the polynomial class). That would be wrong for the theory of math 25 | which deals with things like (a+b)x = ax+bx, not just modifying a and b 26 | directly. 27 | 28 | This analogy has some flaws but I think things will be clearer for me later when 29 | I progress. 30 | 31 | Consequence of theory of functional programming: NO MUTATIONS. 32 | 33 | This seems restrictive (no mutual variables, assignments, loops, imperative 34 | control structures) but the focus is on functions, which is easier with 35 | functional programming. Functions here will be "first class citizens" as they 36 | can be defined anywhere, including INSIDE other functions. 37 | 38 | I might check out Martin's book but probably not, I have too much to do, I'll 39 | focus on the lectures. =) 40 | 41 | Martin says functional programming has grown in popularity due to exploiting 42 | parallelism for multi-core and cloud computing. Is that why John Canny chose to 43 | use Scala for BIDMach and BIDMat? And since this is getting so important, I 44 | really have to finish this Coursera course!!! 45 | 46 | *************** 47 | * Lecture 1.2 * 48 | *************** 49 | 50 | (Most of this stuff in the first half of this video is familiar to me.) 51 | 52 | Interactive shell = REPL, read eval print loop. Just do scala, as I know. But 53 | don't use that, just use `sbt console`. 54 | 55 | The "substitution model" is key: all it does is reduce expressions to values, 56 | and this can be applied to all expressions so long as they have no side effects. 57 | This is lambda calculus! Foundation for functional programming. In fact Alonzo 58 | Church showed that it can express all programs, i.e. Turing Complete. I remember 59 | this a little bit. 60 | 61 | Example: C++ has a side effect, and cannot be expressed by substitution model. 62 | That's why we don't have side effects in functional programming. 63 | 64 | To "do" the substitution model by hand, we have to explicitly substitute values 65 | and simplify, following specific rules. We can do this call by value or call by 66 | name. They have trade-offs: former only evaluates function arguments once, 67 | latter means function arguments are not evaluated if parameter is unused 68 | throughout the evaluation. 69 | 70 | *************** 71 | * Lecture 1.3 * 72 | *************** 73 | 74 | This provides more comparisons of CBN vs CBV, particularly as it regards to with 75 | vs without termination. 76 | 77 | Here's an important "theorem": if CBV terminates, then CBN also terminates, but 78 | *not* vice versa. 79 | 80 | Here's a simple example (pseudocode_: 81 | 82 | first(x,y)=x 83 | 84 | first(1, loop) 85 | 86 | Here, CBN terminates because it ignores the loop. However, CBV gets in an 87 | infinite loop. 88 | 89 | Despite this example, Scala uses CBV, but we can enforce CBN using `=>` as they 90 | do in the next example, showing how CBV can "get around that" problem by 91 | treating `y` as a special CBN parameter. 92 | 93 | *************** 94 | * Lecture 1.4 * 95 | *************** 96 | 97 | Conditionals and value definitions, two more "syntax constructs." 98 | 99 | Standard if-else, but used for **expressions** not statements. What does this 100 | mean? I think it means we don't have to write a return statement. Actually 101 | that's a general rule for Scala! Generally, legal Java expression => legal in 102 | Scala. 103 | 104 | Also have reduction rules, etc., such as && and ||. BTW those short-circuit 105 | evaluation, so they don't test the second argument if the first one determines 106 | the answer. 107 | 108 | There's a nice connection with CBV or CBN parameters: **definitions** can be CBV 109 | or CBN. The `def` is by name, the `val` is by value. So `def` must be evaluated 110 | upon each use, but `val` is evaluated at the point of its initialization. Oh, 111 | nice connection! =) Note that this is a loop but with effects dependent on how 112 | we use it: 113 | 114 | def loop: Boolean = loop 115 | 116 | For definitions, we're OK (it will not loop forever), but with vals, we're bad. 117 | 118 | Clever: 119 | 120 | def and(x:Boolean, y:Boolean) = if (x) y else false 121 | 122 | This is without using &&. 123 | 124 | *************** 125 | * Lecture 1.5 * 126 | *************** 127 | 128 | This is about defining square roots using Newton's method, so we have a 129 | non-trivial program. `def sqrt(x: Double): Double = { ... }`. He shows an 130 | example using Eclipse and its "session" functionality which is like a better 131 | version of the Scala command line (heh, like iPython is better than the Python 132 | interpreter). Use packages, even though it's not necessary here, because it 133 | keeps things ordered. 134 | 135 | Scala language note: explicit return types are not generally needed, but for 136 | *recursive* functions, we need them otherwise the compiler wouldn't be able to 137 | tell the return type. It's good practice to put the return type even if it's not 138 | needed. 139 | 140 | I see, I understand the code he wrote. Yes, it had problems with small/large 141 | numbers. I naively thought we should take logs and exponentials as needed, but 142 | in fact we only had to normalize our absolute difference so that the epsilon we 143 | chose, 0.001, is of the "appropriate value" rather than something too large or 144 | too small. 145 | 146 | *************** 147 | * Lecture 1.6 * 148 | *************** 149 | 150 | In the last lesson, we defined several methods separately, but we don't want the 151 | use to access any of them except for the `sqrt function. So we can nest all the 152 | other function definitions **inside** an overall `sqrt` call. He used a *block* 153 | by nesting with parentheses. 154 | 155 | Visibility is as what I would expect, i.e. stuff defined in blocks are not 156 | visible to other blocks, and expressions outside blocks are visible inside them 157 | *so long as* they are not overshadowed (or "over-written") by something inside 158 | with the same name. Yes, pretty obvious. OH, and it makes the square root 159 | function cleaner since we don't have to re-define `x` as a parameter. 160 | 161 | Don't use semicolons unless we want more than one statement, as in: 162 | 163 | val y = x+1; y*y 164 | 165 | To deal with two-line operations surround with parentheses or to write operator 166 | ion the end of the *first* line. But in BIDMach, we don't do that, we just write 167 | long expressions on one line. =) 168 | 169 | *************** 170 | * Lecture 1.7 * 171 | *************** 172 | 173 | Time to wrap up the first week by talking about *tail recursion*. 174 | 175 | But before that, some substitution formalism. (I'm not sure why this is 176 | important.) Then we did re-writing steps with Euclid's gcd function and the 177 | classical (recursive) factorial function. 178 | 179 | Rule: if a function calls itself as its last action, the function's stack frame 180 | can be reused. This is *tail recursion*, i.e. iteration, and it's good because 181 | we can run this in constant space. With classic factorial, we had our last 182 | argument as n*factorial(n-1), meaning that the last term was not our function, 183 | it was a more complicated expression with `n*` there. 184 | 185 | We can require that a function is tail-recursive by adding the `@tailrec` in the 186 | line above the method definition. Interesting! 187 | 188 | The last part of the lecture was about designing a tail-recursive version of 189 | factorial. Fortunately, I was able to figure this out. =) 190 | 191 | OK week 1 lectures done. Let's do the assignment. 192 | -------------------------------------------------------------------------------- /Math_104_Berkeley/kenneth_ross_notes.txt: -------------------------------------------------------------------------------- 1 | ******************************************************************************** 2 | * These are notes based on: 3 | * 4 | * Kenneth A. Ross 5 | * Elementary Analysis: The Theory of Calculus 6 | * Second Edition, 2013 7 | ******************************************************************************** 8 | 9 | 10 | ************* 11 | * CHAPTER 1 * 12 | ************* 13 | 14 | I skimmed this chapter and I should know just about everything from it. It 15 | includes: 16 | 17 | - Natural numbers 18 | 19 | - Simple induction 20 | 21 | - Rational numbers (also the definition of an "algebraic number") 22 | 23 | - The "Rational Zeros" theorem, which might be useful if I need to find 24 | candidates for solving certain polynomial equations. This can also be used to 25 | prove that sqrt(2) is not a rational number, and several other numbers, mostly 26 | by doing some brute-force cases for checking all possible solutions. It's a 27 | bit boring to do that! Note: this theorem only applies to finding *rational* 28 | zeros of polynomials with *integer* coefficients. For a more general rule, use 29 | "Newton's method" or the "secant method." 30 | 31 | - The set of real numbers. Now we're getting into real stuff here! We also have 32 | the triangle inequality, blah blah blah ... 33 | 34 | - The Completeness Axiom. This is the assertion that "\mathbb{R} has no gaps" 35 | and is the key factor which distinguishes \mathbb{R} from \mathbb{Q}. (It's 36 | discussed in Section 4.4.) Among other things, this section discusses: 37 | 38 | - The concepts of a minimum, maximum, and slightly more non-trivially, those 39 | of an _infinum_ (greatest lower bound) and _supremum_ (least upper bound). 40 | For the latter two, I know clearly that sup S and inf S do not have to 41 | belong to S! Classic example: (a,b). I remember doing examples like these 42 | from MATH 305 at Williams College: basically, finding the infimums and 43 | supremums of sets. It's nothing too fancy. Man, I must have been a bad 44 | student back then! 45 | 46 | - The concepts of upper bounds, lower bounds, etc. 47 | 48 | - The completeness axiom (as I mentioned). This does _not_ hold for the 49 | rationals! 50 | 51 | Yeah, nothing too advanced here. I'm happy that at least this material is easy 52 | for me to understand and review. 53 | 54 | - The symbols +infinity and -infinity, which are useful but must be handled with 55 | care. Do not treat them as real numbers that can be plugged into theorems! 56 | Note that it is also discussed that for nonempty, _bounded_ subsets A and B of 57 | \mathbb{R}, sup(A + B) = sup A + sup B and the same relation for infimums. 58 | This might be useful in some statistics proofs if we are dealing with multiple 59 | sets. 60 | 61 | - Useful to define sup S = +infinity if S is not bounded above, etc. 62 | 63 | - The last section is a "Development of \mathbb{R}" and it's probably not that 64 | useful for me. 65 | 66 | 67 | ************* 68 | * CHAPTER 2 * 69 | ************* 70 | 71 | This is about sequences and is hugely critical to understanding the rest of the 72 | book, and for real analysis in general. 73 | 74 | Section 2.7 75 | 76 | - Sequences are just a function from an index to some value. 77 | 78 | - We formally define _limits_, _convergence_, and _divergence_. See the 79 | textbook. I won't belabor the point here. Side note: limits are unique (prove 80 | this by assuming two limits, then showing that |s-t| is less than epsilon 81 | using the definitions and then the triangle inequality). Side note 2: 82 | oscillations (as in, (-1)^n) do not converge! 83 | 84 | Section 2.8 85 | 86 | - A discussion on proofs! When proving limits, we should invoke the formal 87 | definition and find n and epsilon s.t. the definition of a limit holds. 88 | 89 | - There are several interesting examples. I did a few of them quickly. I don't 90 | think I will ever have to invoke these directly any time soon (I'm mostly 91 | reading this section so that the more important parts later are clearer to 92 | me). 93 | 94 | - Exercise 8.5 is interesting, the "squeeze lemma" and I remember Professor 95 | Mihai Stoiciu talking about this during office hours (heh, we never had office 96 | hours _in_ his office since there were so many people!). 97 | 98 | Section 2.9 99 | 100 | - Limit theorems for sequences. I can invoke these pretty easily. I will again 101 | be skimming the proofs. 102 | 103 | - Oof, there's a lot of them. Mostly they involve similar techniques such as 104 | working backwards and solving for the tightest bounds, so we get the lowest 105 | value N such that the statement: "when n > N we get |s_n - s| < epsilon" is 106 | true. We have to sometimes develop upper bounds, and often have to use epsilon 107 | times some constant so that the later algebra gets it equal to epsilon. I've 108 | seen this stuff many times. 109 | 110 | Section 2.10 111 | 112 | Monotone Sequences and Cauchy Sequences. These help us conclude convergence of 113 | sequences _without_ knowing limits in advance. 114 | 115 | - Monotone sequences are those which are always increasing or always decreasing. 116 | They _can_ converge, if the rate of increase (respectively, decrease) slows to 117 | zero, think of 1/x for x>0 as x grows large. 118 | 119 | - Important Theorem I (10.2 in the book): All bounded monotone sequences 120 | converge. 121 | 122 | - Proof: let u be the supremum of the bounded sequence, so then we just show 123 | lim s_n = u. We start by fixing an epsilon (as usual) then we have to find 124 | some N such that for all n > N, we get |s_n - u| < epsilon. Well, (s_n) is 125 | increasing so we just need to first find an N so that u-epsilon < N < 126 | epsilon and then that automatically proves the statement. Yay! The proof is 127 | short and elegant. Again, it just relies on proving the limit statement!! 128 | 129 | - There's a related theorem which shows that if the sequences are unbounded, 130 | then, well they converge to infinity or minus infinity. (This is assuming 131 | monotone, because otherwise you can have oscillations to infinity, which 132 | would mean something different I guess.) Thus, limits of monotone sequences 133 | always have meaning. 134 | 135 | - Important Theorem II (10.11 in the book): a sequence is a convergent sequence 136 | IFF it is a Cauchy sequence. 137 | 138 | - Proof: well, they did one direction earlier and it makes sense. The other 139 | direction also makes sense. In both cases we simply start with the 140 | definition and try to prove the property. They can be tricky to come up. 141 | Mostly it's about making sense of sup-s and thinking of "stuff plus 142 | epsilon." 143 | 144 | - Uses Definition 10.8 which defines a _Cauchy_sequence_, a sequence has this 145 | property if for each epsilon > 0 there exists N such that (m,n) both greater 146 | than N implies |s_n - s_m| < epsilon. 147 | 148 | - Why is it useful? Because we can confirm that a sequence converges by 149 | verifying that it satisfies the Cauchy sequence property. We do not have to 150 | explicitly compute a limit in this case! 151 | 152 | - There's an interlude about discussions of decimals, but it's not likely to be 153 | much of concern to me. Don't forget about the geometric series convergence 154 | formula! It's 1/(1-r) for r>1. 155 | 156 | - There is also discussion on lim sup and lim inf. A sequence has a limit if and 157 | only if their `lim inf` and `lim sup` are equal. Also, lim sup is NOT 158 | generally sup{s_n for all n} because as N grows large, the set of elements we 159 | consider for lim inf gets smaller, hence the correct relationship is <=. Also, 160 | it's these lim inf and lim sup concepts which motivate the Cauchy sequence 161 | definition (see my notes above). 162 | 163 | Section 2.11 164 | 165 | Subsequences!! 166 | 167 | - I know the definition, obviously. You can also view it as defined by a 168 | "selection function." This point of view is probably useful if you are trying 169 | to _extract_ "interesting" indices within the overall sequence. 170 | 171 | - IMPORTANT: Theorem 11.2. This defines three facts about subsequences. 172 | 173 | (I don't quite follow?) 174 | 175 | Section 2.12 176 | 177 | TODO 178 | 179 | 180 | ************* 181 | * CHAPTER 3 * 182 | ************* 183 | 184 | TODO 185 | 186 | 187 | ************* 188 | * CHAPTER 4 * 189 | ************* 190 | 191 | TODO 192 | 193 | 194 | ************* 195 | * CHAPTER 5 * 196 | ************* 197 | 198 | TODO 199 | 200 | 201 | ************* 202 | * CHAPTER 6 * 203 | ************* 204 | 205 | TODO 206 | 207 | 208 | ************* 209 | * CHAPTER 7 * 210 | ************* 211 | 212 | TODO 213 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter05notes.txt: -------------------------------------------------------------------------------- 1 | *********************************************** 2 | * NOTES ON CHAPTER 5: Machine Learning Basics * 3 | *********************************************** 4 | 5 | Again, I expect that this will be almost entirely review. Here are some stuff 6 | which I didn't already have down cold: 7 | 8 | - The chapter starts off with Tom Mitchell's famous definition of machine 9 | learning, and then it goes through examples of tasks, experiences, and 10 | performance metrics. There isn't a whole lot new here. Maybe a good insight is 11 | to think of the tasks of (a) density estimation and (b) synthesis/sampling 12 | (e.g. with GANs) as the task of modeling densities implicitly (a) versus 13 | implicitly (b). Then for experiences, the key is to understand unsupervised 14 | vs. supervised learning, but the line between the categories is blurred, and I 15 | like their examples of how the problems can be converted to each other 16 | (Equations 5.1 and 5.2). Think of unsupervised as estimating p(x), supervised 17 | as estimating p(y|x), since we have our labels y in the latter case. They use 18 | linear regression as an example, and the "learning algorithm" consists of 19 | literally solving the normal equations. One step, no iterative updates! 20 | 21 | - We can use statistical learning theory to tell us how algorithms generalize. 22 | It's easiest if we assume IID, then the train/test errors are equal under 23 | expectation **if we chose a random model**, i.e random weights. In general, 24 | though, we optimize the training error, and **then** test, so the test error 25 | is at least as high as training error. The two central factors contributing to 26 | under/over-fitting are (1) training error, (2) gap between training and 27 | testing error. (This is covered again later in Chapter 11 on practical usage.) 28 | We can partially control under/over-fitting by controlling a model's 29 | **capacity**. E.g., for linear regression, add higher order terms, and 30 | capacity increases, but overfitting occurs with more parameters than examples. 31 | 32 | - Quantifying model capacity with classical measures, such as VC dimension, is 33 | rarely used in Deep Learning. 34 | 35 | - We can also think of **non-parametric** models as having arbitrarily high 36 | capacity. However, practical algorithms will rely on some form of constraints, 37 | e.g. nearest neighbors' complexity depends on the data. 38 | 39 | - **Expected** generalization can never increase when training data increases. 40 | 41 | - Use **weight decay** (i.e. L2 regularization) to prefer lower magnitude weight 42 | vectors as solutions. 43 | 44 | - With hyperparameters, don't tune then on the training data because that will 45 | cause preference towards overfitting. Tune on **validation sets**. If our data 46 | is too small, **use k-fold cross validation** to get better estimates of 47 | generalization error. 48 | 49 | - With bias/variance discussion, don't forget that the sample variance (for 50 | Gaussians) is actually **biased**, we need the n-1 correction for the 51 | **unbiased** version. 52 | 53 | - Don't forget the difference between **variance** and **standard error** w.r.t. 54 | **an estimator**. Here, the standard error is the square root of the variance, 55 | and both are computed based on empirical data (which is why I don't think we 56 | call it "standard deviation"). They say: 57 | 58 | > Unfortunately, neither the square root of the sample variance nor the square 59 | > root of the unbiased estimator of the variance provide an unbiased estimate 60 | > of the standard deviation. Both approaches tend to underestimate the true 61 | > standard deviation, but are still used in practice. The square root of the 62 | > unbiased estimator of the variance is less of an underestimate. For large m, 63 | > the approximation is quite reasonable. 64 | 65 | We use standard error often when writing out confidence intervals. 66 | 67 | They argue that increasing model capacity (at least under MSE for computing 68 | generalization error) generally increases **variance** but decreases **bias**. 69 | The reason is that variance here is based on samples where the "samples" are 70 | in fact training data sets. (The training set **is** the random variable, 71 | according to their Equation 5.47 definition.) Thus, with a new sample of the 72 | training data, we'll get different results since the model overfits. But under 73 | **expectation** over all draws of training datsets, the bias is low. 74 | 75 | - How did we **obtain** the estimators we just talked about? It's simple, MLE. 76 | And before reading Goodfellow's tutorial on GANs, I don't think I viewed MLE 77 | as minimizing a KL divergence. This is yet another reason why we like it. 78 | Another reason is, as I know from the AI prelims review, the MLE view of 79 | **conditional** log likelihood, where p(y|x) is modeled as a Gaussian, results 80 | in the same solution (obtained via maximizing likelihood) as the linear 81 | regression case with MSE loss. 82 | 83 | - Then the chapter talks about **Bayesian statistics**. To measure uncertainty 84 | of the estimator, the Frequentist approach uses the variance, but the Bayesian 85 | approach suggests to integrate instead. I also remember their example with 86 | Bayesian linear regression, we have to combine p(y|X,w)*p(w) but those are 87 | both exponentials and they multiply to result in another exponential which can 88 | be rearranged in the form of another Gaussian. If we want a single point 89 | estimate instead of a distribution, use **MAP estimates**. But why not just do 90 | the Frequentist MLE approach? Because MAP estimates retain *some* benefit of 91 | the Bayesian approach. That's the intuition, I guess. 92 | 93 | - Review: 94 | 95 | theta_MAP = argmax_theta p(theta|x) 96 | \propto argmax_theta p(theta)p(x|theta) 97 | = argmax_theta log p(theta) + log p(x|theta) 98 | 99 | (and for the MLE Gaussian, Frequentist case) 100 | 101 | theta_ML = argmax_\theta \prod_y p(y|x,theta) 102 | = argmax_\theta \sum_i \log p(y_i|x_i,\theta) // These are Gaussians 103 | 104 | - **Supervised Learning Algorithms**. The authors start by generalizing linear 105 | regression into logistic regression, as expected. Not much new here. With 106 | logistic regression, we no longer have a closed-form solution for the optimal 107 | weights, which is why gradient descent helps. 108 | 109 | - PS: Don't forget **SVMs**. I've forgotten some of it due to its lack of 110 | exposure in Deep Learning. The key innovation here is the kernel trick, of 111 | course (helps us model nonlinear x, and highly efficient). The SVM function 112 | is nonlinear w.r.t. the data, but it's **linear** w.r.t the coefficients 113 | \alpha. The \alpha here is mostly zeros, so as to reflect only points on the 114 | boundary close to the current sample of interest. 115 | 116 | - But note that SVMs and kernel machines in general struggle to generalize 117 | well, and Deep Learning is precisely designed to improve upon that. 118 | 119 | - Another common algorithm, **k-nearest neighbors**. In fact, there is not 120 | even a training or a learning stage for this (nonparametric) method. Yet 121 | another one, **decision trees**. 122 | 123 | - Note, p.144 missing a figure in my PDF version? TODO check. 124 | 125 | - **Unsupervised Learning Algorithms**. Examples: PCA and K-Means Clustering. 126 | PCA can be viewed as a data compression algorithm, or one which learns a 127 | "useful" representation of data (perhaps as "simple" as possible, to identify 128 | independent sources of variation which capture the essence of the data). This 129 | means using PCA to transform the data so that the covariance matrix of the 130 | transformed data is a diagonal matrix. PCA: 131 | 132 | > This ability of PCA to transform data into a representation where the 133 | > elements are mutually uncorrelated is a very important property of PCA. It 134 | > is a simple example of a representation that attempts to disentangle the 135 | > unknown factors of variation underlying the data. 136 | 137 | Then there's k-means, which learns a one-hot encoding for each sample. This is 138 | a bit extreme, though. The learning, of course, works like EM. 139 | 140 | - Stochastic Gradient Descent. The main workhorse of Deep Learning! It helps 141 | that our cost functions naturally decompose into a sum over training examples 142 | with per-sample loss (and taking the empirical mean of those, so it's an 143 | expectation!!!). Thus, take a minibatch sum of those terms. In fact, we can 144 | often converge to a good solution even without touching every element in the 145 | dataset (i.e. less than a single pass). 146 | 147 | - Section 5.11, which focuses specifically on Deep Learning challenges. DL helps 148 | to deal with the curse of dimensionality (PS: nice visuals in Figure 5.9!). 149 | They also help with local constancy and smoothness, meaning that we want f(x) 150 | to be approximately f(x+eps). Most classical algorithms try to follow this 151 | implicit prior, but the problem is that it doesn't scale to larger datasets 152 | because it requires enough examples to observe the data space. With DL, we try 153 | and introduce dependencies among different regions, using a "composition of 154 | factors". See Chapters 6 and 15 for this. Oh yeah, this is the idea of DL with 155 | hierarchies of features ... I can see where this is going. 156 | 157 | The last bit here is about manifold learning. We use it informally in machine 158 | learning to indicate a set of points that are well-connected or associated 159 | with each other in a lower-dimensional space. With high dimensions, it's 160 | essential to assume that most points in R^n are invalid. The authors argue 161 | that this is the case in terms of images, sounds, and text. For instance, 162 | uniformly sampling points in image results in static, and random words/letters 163 | mean gibberish instead of interesting sentences. It would be great if learning 164 | algorithms could *discover* these manifolds. In fact, GANs help us with that! 165 | 166 | (This is a bit hand-way, make sure to re-read this section if I want to 167 | refresh my memory.) 168 | -------------------------------------------------------------------------------- /Random/AWS_Notes.txt: -------------------------------------------------------------------------------- 1 | ----------------------- 2 | - AMAZON WEB SERVICES - 3 | ----------------------- 4 | 5 | **************** 6 | * May 11, 2017 * 7 | **************** 8 | 9 | I promise, I will learn how to use AWS so that I can finally run code in 10 | clusters instead of running pseudo-parallel code on my personal workstation. 11 | 12 | First, a few pointers, definitions, etc: 13 | 14 | - Be careful! Don't run code for no reasons. This uses up resources. It's not 15 | like my personal machine where I can pound it for no reason. Again, be 16 | careful. Also, be mindful of the location of the actual computing resources 17 | I'm using. 18 | 19 | - Amazon Web Services (AWS). It seems like I can use this just by using my 20 | normal Amazon account. It provides a number of services for cloud computing, 21 | which lets me use lots of computing power via the Internet, so long as we 22 | pay an amount commensurate with our usage level. See also: 23 | 24 | > Cloud computing provides a simple way to access servers, storage, databases 25 | > and a broad set of application services over the Internet. A Cloud services 26 | > platform such as Amazon Web Services owns and maintains the 27 | > network-connected hardware required for these application services, while 28 | > you provision and use what you need via a web application. 29 | 30 | (Cloud computing is really a marketing term ... don't put too much thought 31 | into it. Just think of it as a way for me to access lots of resources without 32 | having to buy them online, assemble my workstation, tell Berkeley to hook them 33 | up to the Internet, etc. I have one desktop that took me a while to set up; a 34 | server with many machines would take a lot longer to set up.) 35 | 36 | - Amazon Elastic Compute Cloud (EC2). These "EC2 Instances" are "virtual 37 | machines" that AWS provides, i.e. EC2 is a component of AWS. It seems to be 38 | an example of "Infrastructure as a Service" (IaaS). 39 | 40 | - Amazon Machine Instances (AMI). These are virtual machines. I can use these to 41 | launch stuff within the EC2. Don't forget to keep the key-pair! I think the 42 | point with cloud computing is that we can pick and choose which images match 43 | our desired specs and then "run them." To connect to these, use the good 44 | old-fashioned ssh. There are community-provided AMIs which I assume are from 45 | people/groups around the world who are letting us use their machines in 46 | exchange for payment. There are also marketplace AMIs, which are verified by 47 | AWS. 48 | 49 | - Google Cloud. I don't think I need to use this? It seems to be an alternative 50 | to Amazon Web Services. Once I have a Google Cloud account, I can create 51 | Google Compute Engines (GCEs) to run code, and even use Jupyter Notebooks for 52 | those which I can access in my local browser. For GPUs, I need to send in 53 | special requests. 54 | 55 | See the following for a comparison between these two: 56 | 57 | http://cloudacademy.com/blog/google-cloud-vs-aws-a-comparison/ 58 | 59 | The AWS website has lots of tutorials. I will check those tomorrow. 60 | 61 | Python libraries to know/learn: 62 | 63 | - boto (or boto3?) 64 | - redis 65 | - multiprocessing 66 | - click 67 | 68 | I've only "used" multiprocessing before ... and it didn't work for me. Also, 69 | click seems to be more for command line arguments instead of distributed 70 | systems. It seems to be an alternative to argparse ... yeah, I better check that 71 | out! It might take up the subject of my next blog post. 72 | 73 | 74 | **************** 75 | * May 12, 2017 * 76 | **************** 77 | 78 | I went through this 10-minute tutorial: "Launch a Linux Virtual Machine". 79 | Highlights: 80 | 81 | - After clicking "Launch Instance", I get to the familiar AMI page. Think of 82 | this as a place to choose my desired computer specs. (Note: to avoid 83 | confusion, this is what happens when we're at the AWS console; there is 84 | another "Launch Instance(s)" button that happens later, once I'm actually 85 | ready to do something.) 86 | 87 | - The tutorial uses a "General Purpose Instance" which should probably be my 88 | default choice for applications, unless I have a pressing reason to use 89 | something else. It also automatically clicks the "free tier eligible" image. 90 | 91 | - Wow, there is a LOT of stuff on the AWS Interface. Getting used to the GUI 92 | will take a while, but I at least know how to see my instances. 93 | 94 | - I can connect to my instance using: 95 | 96 | ssh -i ~/.ssh/MyKeyPair.pem ec2-user@{IP_Address} 97 | 98 | The IP address can be found on the AWS interface. This puts me in the 99 | `/home/ec2-user` folder on an instance, and it looks like I'm the only user. 100 | Huh, that's interesting, I thought this was going to be a shared machine with 101 | loads of users. Looks like `python` is installed, but not `ipython`. Argh. 102 | 103 | - I terminated the state, and I got this message: 104 | 105 | Broadcast message from root@ip-[IP CENSORED] 106 | (unknown) at 16:55 ... 107 | 108 | The system is going down for power off NOW! 109 | Connection to [IP CENSORED] closed by remote host. 110 | Connection to [IP CENSORED] closed. 111 | 112 | Interesting ... if we did *not* terminate the instance (but it was idle) then 113 | we still get charged. I didn't get charged (I hope not ...). 114 | 115 | 116 | Another potential resource: 117 | 118 | http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html 119 | 120 | "Setting Up": 121 | 122 | - I see, this is why I didn't need a password: 123 | 124 | > AWS uses public-key cryptography to secure the login information for your 125 | > instance. A Linux instance has no password; you use a key pair to log in to 126 | > your instance securely. You specify the name of the key pair when you launch 127 | > your instance, then provide the private key when you log in using SSH. 128 | 129 | - There's some stuff about "Virtual Private Clouds" and "Security Groups," but 130 | I'm not sure I understand or if it's that important right now. Think of those 131 | as firewalls, maybe? Yeah, the EC2 console says security groups control access 132 | to the instance. 133 | 134 | 135 | "Getting Started": 136 | 137 | - This is basically the same as the 10-minute tutorial. They also tell us how to 138 | connect with a browser. That might be inconvenient, but maybe not, if we're 139 | running on 1000 machines. But how do we run code using this? There must be 140 | some command line? 141 | 142 | - Oh, here's what they say about termination: 143 | 144 | > Terminating an instance effectively deletes it; you can't reconnect to an 145 | > instance after you've terminated it. 146 | 147 | I see. On the EC2 console, I can't seem to re-start that instance I created in 148 | that 10-minute tutorial. There is, however, a difference between STOPPING and 149 | instance versus TERMINATING an instance. The former lets me reuse the instance 150 | at some point later (and it doesn't charge me for the stopping period, though 151 | there IS a charge for storage ... look at their description about this). 152 | 153 | 154 | For billing, see: 155 | 156 | http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html 157 | 158 | A few pointers: 159 | 160 | - To see billing on the dashboard, click my name, and then the billing dashboard 161 | setting. It should be intuitive. 162 | 163 | - Try to use the free tier to test things: 164 | 165 | > You can test-drive some AWS services free of charge, within certain usage 166 | > limits. AWS calls this the AWS Free Tier. The free tier is designed to give 167 | > you hands-on experience with a range of AWS services at no charge. For 168 | > example, you can explore AWS as a platform for your business by setting up a 169 | > test website with a server, alarms, and database. You can also try out 170 | > services for developers, such as AWS CodePipeline, AWS Data Pipeline, and 171 | > AWS Device Farm. 172 | 173 | - Actually, looks like I'm not on the free tier since I had made the account in 174 | November 2015 despite NOT EVER USING IT ... 175 | 176 | 177 | For running on *clusters*, see: 178 | 179 | http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html 180 | 181 | <<<<<<< HEAD 182 | Another piece of software that might be useful is Packer. This helps me create 183 | identical machine images (i.e. AMIs) so that the nodes in a cluster are running 184 | and using the same stuff/settings. It's installed on my station. Use `.json` 185 | files for building images (be careful about expenses!). 186 | 187 | 188 | **************** 189 | * May 28, 2017 * 190 | **************** 191 | 192 | OK, I managed to finally make a new account, so I get the one-year free tier 193 | award. Let's see how that works out for me. Now let me try Jonathan Ho's 194 | Evolution Strategies code. How do we use Packer again? 195 | 196 | ======= 197 | Packer might be useful for running on clusters. This helps me create identical 198 | machine images (i.e. AMIs) so that the nodes in a cluster are running and using 199 | the same stuff/settings. It's installed on my station. Use `.json` files for 200 | building images (be careful about expenses!). These are configuration files to 201 | allow us to specify various settings about the image(s) we want to build. Run 202 | 203 | `packer build XXX.json` 204 | 205 | to build it. However, I think this requires two keys from AWS, which I can 206 | obtain online. I think I can just make them for me personally. They recommend 207 | creating keys separately for IAM users, but that seems to be more helpful for 208 | organizations with many users (kind of like computers with user accounts). 209 | 210 | NOTE: IAM = "Identity and Access Management." 211 | 212 | After running Packer's examples with my provided keys, I have a **snapshot**. It 213 | was a bit tricky to find. I had to search in the US-east region (N. Virginia), 214 | not the US-west region (N. California). Then click on "Snapshots" and I can see 215 | my AMI. This is **my** AMI, actually. So I'll get charged! 216 | 217 | In addition, assuming I'm in the right region, when I launch an instance, I can 218 | go to "My AMIs" and I will see that image right there. (It doesn't work if I'm 219 | using N. California, so the lesson is that one needs to be aware of what regions 220 | were used!) 221 | 222 | To be clear, what got created out of this configuration file was NOT an 223 | "Instance," but it seems to be either an "Image --> AMIs" or an "Elastic Block 224 | Store --> Snapshots." Strangely, I see something underneath both of those menu 225 | options ... I'm not sure what's the difference. They seem to be similar, except 226 | AMIs are, I assume, something that's representative of a full system, whereas 227 | the snapshots are backups of those ... yeah, it's not clear. Maybe check this: 228 | 229 | https://serverfault.com/questions/268719/amazon-ec2-terminology-ami-vs-ebs-vs-snapshot-vs-volume? 230 | 231 | Snapshots and Volumes should be subsets or types of EBSs, which themselves look 232 | like hard drives. Volumes are pieces and bits of EBSs, and Snapshots are 233 | captures (i.e. copies) of volumes at specific times. 234 | 235 | I *think* I have an idea of what an image means. I mean, with CS 231n, they 236 | provide an image with specialized GPU and Deep Learning stuff. That's with the 237 | "Community AMIs" of course. 238 | 239 | From Packer: 240 | 241 | > After running the above example, your AWS account now has an AMI associated 242 | > with it. AMIs are stored in S3 by Amazon, so unless you want to be charged 243 | > about $0.01 per month, you'll probably want to remove it. Remove the AMI by 244 | > first deregistering it on the AWS AMI management page. Next, delete the 245 | > associated snapshot on the AWS snapshot management page. 246 | 247 | I just did both of those. 248 | -------------------------------------------------------------------------------- /Deep_Learning/dlbook_chapter10notes.txt: -------------------------------------------------------------------------------- 1 | **************************************************************** 2 | * NOTES ON CHAPTER 10: Recurrent and Recursive Neural Networks * 3 | **************************************************************** 4 | 5 | I need to understand the parameter sharing and how RNNs (and their variants) can 6 | be "combined" into other areas. The parameter sharing is key, as it allows for 7 | *generalization*. CNNs share parameters with the weight filters across the 8 | grids; RNNs share parameters across timesteps. 9 | 10 | Quick note: I think they're using minibatch sizes of 1 to simplify all notation 11 | and exposition here. That's fine with me. Think of x as: 12 | 13 | [ x^1 x^2 ... x^T ] 14 | 15 | where superscripts indicate time. Note that each x^i itself could be a vector! 16 | 17 | Section 10.2, Recurrent Neural Networks 18 | 19 | It's important to understand the *computational graphs* involved with RNNs. I 20 | understand them as directed acyclic graphs, so how does this extend with 21 | recurrence? It's easier to think of them when we unroll (i.e. "unfold") the 22 | computational graphs. See Figure 10.2 as an example (I was able to get this 23 | without looking at the figure). They also use a more succinct "recurrent graph" 24 | representation. 25 | 26 | RNN Design Patterns, also kind of described in Andrej Karpathy's blog post: 27 | 28 | - Producing an output at each time step, and having recurrent connections 29 | between hidden layers. This is Figure 10.3, which I correctly predicted in 30 | advance minus the loss and y stuff. They have losses for *each* time step. 31 | Note the three matrix multiplies that are there, with the *same* respective 32 | matrices repeated across time. Also, we're using the softmax, so assume the 33 | output is discrete at each time step, e.g. o(t) could be the categorical 34 | distribution over the 26 letters in the alphabet. 35 | 36 | - Same as above, except recurrent connections are from outputs to hidden layers, 37 | so we still have three matrices but the "arrows" in the computational graph 38 | change. This is *less powerful**. Why?? Think: the former allows hidden to 39 | hidden, so the hidden stuff can be very rich. The output only lets hidden to 40 | output to hidden, so the output is the input and may be less rich. That seems 41 | intuitive. 42 | 43 | - Same as the first one (hidden to hidden connections) except we now have one 44 | output. That's useful to summarize, such as if we're doing sequence 45 | classification. 46 | 47 | Now develop the equations, e.g. f(b + Wh + Ux) where h is from the *previous* 48 | time step and x is the *current* time step, and f is the *activation* function. 49 | Yes, it's all familiar to me. They mention, though, that backpropagation is very 50 | expensive. They call the naive way (applying it on the unrolled computational 51 | graph) as "backpropagation through time." 52 | 53 | How to compute the gradient? They give us an example, thank goodness. Comments: 54 | 55 | - Note that L = L(1) + L(2) + ... + L(\tau) so yes, dL/dL(t) = 1 for all t. Each 56 | L(t) is a negative log probability for that output at that time. 57 | 58 | - The next equation (10.18) also makes sense, here i is the component in the 59 | vector, so we're in the univariate case. 60 | 61 | - Equation 10.19 is good, keep in mind that here we have to be careful with the 62 | timestep. For other h(t), we need to add two gradients due to two incoming 63 | terms (because of two *outgoing* terms in the *forward* pass). Thus, the 64 | matrices V and W will be present in some form. 65 | 66 | - The next part about using dummy variables for t is slightly confusing but it 67 | should just mean that the total contribution for these parameters are based on 68 | their sum across each time. Yeah, looking at the book again it's just a 69 | notation issue to help us out. For all those gradients, we have a final sum 70 | over t, where each term in the sum is a matrix/vector of the same size as the 71 | variable we're taking the gradient w.r.t. 72 | 73 | PS: when reading this, don't be confused by the notation. Look at the "notation" 74 | chapter online. 75 | 76 | RNNs as directed graphical models? This section is about expressing them as 77 | well-defined directed graphical models, and there are a few subtleties. This is 78 | WITHOUT any inputs, BTW ... probably just for intuition? 79 | 80 | They go through an example predicting a sequence of scalars. With the naive 81 | unrolled (directed) graphical model, we're applying the chain rule of 82 | probability and so it's very inefficient. RNNs provide better (in many metrics, 83 | but particularly efficiency) ways to express such distributions with directed 84 | graphical models by introducing deterministic connections (remember, the hidden 85 | states are deterministic). 86 | 87 | With RNNs, parameter sharing is a huge advantage, but the downside is that 88 | optimizing is hard because we make a potentially strong assumption that at each 89 | time step, the distribution embedded in the RNN remains stationary. 90 | 91 | The last bit here to get it into a well-defined graphical model is to figure out 92 | the length of the RNN. The paper presents three options, all of which seem 93 | obvious (though I'm ignoring lots of details, etc.). 94 | 95 | The next subsection (10.2.4) after this is about the more realistic setting of 96 | having x (input), so we're also modeling p(y|x). I think it's trying to stick 97 | with the graphical model setting. Also, note that the second option in the list 98 | of three things is what we did in CS 231n, Assignment 3, with the image 99 | captioning portion. Actually, the first option would seem better, which 100 | translates the input image to a vector as input to *all* hidden states, but 101 | that's harder to implement. 102 | 103 | I was quite confused about Figure 10.9, as to why are we considering the y(t)s 104 | as inputs?? However, it seems like it's because we want to model p(y|x) and, 105 | well, y is the ground truth. I'm just having trouble translating this to code, 106 | or maybe that's not what I should be doing, and instead just think of it as a 107 | graphical model? To think of it as code, I'd need the other case we had earlier 108 | where the *output* or *hidden state* was the input to the hidden state, not the 109 | actual target (which is to be compared with the output). 110 | 111 | Section 10.3: Bidirectional RNNs 112 | 113 | Bidirectional RNNs help us model the output y(t) when that output may also 114 | *depend on future times* t+1, t+2, etc., such as with speech recognition where 115 | we need to peek ahead a bit. Don't use a fixed window, though, they say: 116 | 117 | > This allows the output units o(t) to compute a representation that depends on 118 | > both the past and the future but is most sensitive to the input values around 119 | > time t, without having to specify a fixed-size window around t. 120 | 121 | Nice! 122 | 123 | Section 10.4: Encoder-Decoder Sequence-to-Sequence Architectures 124 | 125 | Use these to avoid the restriction of fixed sequence sizes for the inputs x (or 126 | x(t)). This is their main benefit/innovation, the lengths n_x and n_y (see 127 | Figure 10.12 if confused on this notation) **can vary**; if the training 128 | data consists of a bunch of sequences that are of similar or different lengths, 129 | the RNN will learn to mirror that training data. Side note: the first relevant 130 | paper on this (from 2014) called it "Encoder-Decoder" while the second one 131 | called it "Sequence-to-Sequence". I skimmed that second one, from Sutskever et 132 | al, NIPS 2014 last year, though maybe I should re-read it. Both papers are 133 | highly-cited. 134 | 135 | Connection with Section 10.2.4: we have a fixed-sized context vector C (well, 136 | usually) coming out of the encoder. Well, C is input to the decoder, and this is 137 | *precisely* the vector-to-sequence RNN architecture we talked about in that 138 | sub-section! 139 | 140 | How can the encoder deal with varying sizes n_x? If you think about it, it's 141 | just applying the RNN update over and over again to produce a fixed hidden state 142 | of the same size. At time t, we have processed x(1),...,x(t), and have hidden 143 | state h(t). (We're ignoring the earlier hidden states for simplicity.) Then the 144 | next time t+1, let's say that's our last one. Then we pass in h(t+1). So there's 145 | no issue with getting different sized inputs, because all that matters is (a) 146 | that we can repeatedly apply the RNN update, which is a for loop over the input 147 | sequence, and (b) that we take a fixed sized input to the decoder, which we can 148 | do with our final hidden state! 149 | 150 | Section 10.5: Deep Recurrent Neural Networks 151 | 152 | In all likelihood, I will not be dealing with these, but it might be worth 153 | knowing how deep we can go with RNNs, just like how I learned about the very 154 | deep GoogLeNet and the **ultra** deep ResNet. When we talk about depth, we mean 155 | adding more layers (w.r.t. the unrolled graph perspective) to the three 156 | components: input to hidden, hidden to hidden, and/or hidden to output. This 157 | might make learning hard, so one option is to introduce skip connections like 158 | in ResNets (man, I'm glad I reviewed ResNets). 159 | 160 | Section 10.6: Recursive Neural Networks 161 | 162 | Recursive Neural Networks, which we **do not** abbreviate as RNN, are a 163 | generalization of RNNs with a different computational graph "flavor" that looks 164 | like a tree rather than a chain. 165 | 166 | Section 10.7: Challenge of Long-Term Depenedencies 167 | 168 | Why is it hard? Here are some relevant quotes: 169 | 170 | > The basic problem is that gradients propagated over many stages tend to either 171 | > vanish (most of the time) or explode (rarely, but with much damage to the 172 | > optimization). [...] the difficulty with long-term dependencies arises from 173 | > the exponentially smaller weights given to long-term interactions (involving 174 | > the multiplication of many Jacobians) compared to short-term ones. [...] 175 | > Recurrent networks involve the composition of the same function multiple 176 | > times, once per time step. These compositions can result in extremely 177 | > nonlinear behavior, as illustrated in figure 10.15. 178 | 179 | This section describes the problem, and the subsequent sections (I assume 10.8 180 | through 10.12, judging from the LSTMs here) describe ways to solve it. 181 | 182 | They present a simplified analysis with matrix eigendecomposition, where we 183 | assume no activations. Then yes, gradients can explode if eigenvalues are 184 | greater than one or vanish if they are less than zero. Andrej Karpathy said 185 | something similar in his medium blog post (why does he bother with medium?). 186 | 187 | No free lunch: 188 | 189 | > One may hope that the problem can be avoided simply by staying in a region of 190 | > parameter space where the gradients do not vanish or explode. Unfortunately, 191 | > in order to store memories in a way that is robust to small perturbations, the 192 | > RNN must enter a region of parameter space where gradients vanish (Bengio et 193 | > al., 1993, 1994). 194 | 195 | It's a bit annoying that we are simplifying here by ignoring the activation 196 | functions, but I guess Bengio's old papers address activation functions? 197 | 198 | Section 10.8: Echo State Networks 199 | 200 | I skimmed this section. It's quite high-level and not that important to me. 201 | 202 | Section 10.9: Leaky Units, Multiple Time Scales 203 | 204 | I like this explanation: 205 | 206 | > One way to deal with long-term dependencies is to design a model that operates 207 | > at multiple time scales, so that some parts of the model operate at 208 | > fine-grained time scales and can handle small details, while other parts 209 | > operate at coarse time scales and transfer information from the distant past 210 | > to the present more efficiently. 211 | 212 | Oddly enough, they don't cite the ResNet paper?!? 213 | 214 | They can add skip connections (i.e. adding edges to the RNN). Or they can remove 215 | edges from the RNN, which might have similar positive effects as skip 216 | connections. 217 | 218 | Section 10.10: LSTMs (finally!), Gated Recurrent Unit RNNs 219 | 220 | As of this writing (2016), these two RNNs are the most effective RNNs we have 221 | for practical applications involving sequences. 222 | 223 | Gated Recurrent Unit (GRU): 224 | 225 | - Main idea: 226 | 227 | > [...] gated RNNs are based on the idea of creating paths through time that 228 | > have derivatives that neither vanish nor explode. 229 | 230 | - The RNN needs to *learn* when to forget and discard the past (it can't 231 | remember everything, after all!). 232 | 233 | - Another quote: 234 | 235 | > The main difference with the LSTM is that a single gating unit 236 | > simultaneously controls the forgetting factor and the decision to update the 237 | > state unit. 238 | 239 | Long Short-Term Memory (LSTM): 240 | 241 | - See Figure 10.16 for the block diagram. It's still very confusing despite how 242 | I implemented it in CS 231n. I'm amazed that these work at all. 243 | 244 | - Like GRUs, LSTMs need to *learn* when to forget. 245 | 246 | - It uses self-loops to enable paths to flow for long durations. By flow, I mean 247 | not only the forward pass, but the *backward* pass. 248 | 249 | The authors' conclusion is to simply stick with GRUs or LSTMs. 250 | 251 | Section 10.11: Optimization for Long-Term Dependencies 252 | 253 | They talk about how to improve optimization, such as with second-order methods 254 | and clipping gradients. (Be careful, taking the average of a bunch of clipped 255 | gradients means gradients that were larger have their contributions removed; see 256 | the discussion in the textbook.) 257 | 258 | I wouldn't put too much stock into this, though, because the authors say: 259 | 260 | > This is part of a continuing theme in machine learning that it is often much 261 | > easier to design a model that is easy to optimize than it is to design a more 262 | > powerful optimization algorithm. 263 | 264 | In fact it seems like it's easier to train LSTMs using simple SGD rather than 265 | use a more complicated optimization algorithm. PS: is ADAM used with RNNs? 266 | 267 | Section 10.12: Explicit Memory 268 | 269 | Philosophical quote: 270 | 271 | > Neural networks excel at storing implicit knowledge. However, they struggle to 272 | > memorize facts. 273 | 274 | This section introduces **Memory Networks** and **Neural Turing Machines**. 275 | 276 | For NTMs, note that: 277 | 278 | > It is difficult to optimize functions that produce exact, integer addresses. 279 | > To alleviate this problem, NTMs actually read to or write from many memory 280 | > cells simultaneously. To read, they take a weighted average of many cells. To 281 | > write, they modify multiple cells by different amounts 282 | 283 | Yeah, it's basically **soft attention**. 284 | 285 | Conclusion of the chapter: 286 | 287 | > Recurrent neural networks provide a way to extend deep learning to sequential 288 | > data. They are the last major tool in our deep learning toolbox. Our 289 | > discussion now moves to how to choose and use these tools and how to apply 290 | > them to real-world tasks. 291 | 292 | Whew! 293 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/Mathematical_Introduction_Robotic_Manipulation.txt: -------------------------------------------------------------------------------- 1 | Notes on the textbook: 2 | 3 | A Mathematical Introduction to Robotic Manipulation, 1994. 4 | Richard M. Murray and Zexiang Li and S. Shankar Sastry 5 | 6 | A bit old but still in use for Berkeley's courses. 7 | 8 | 9 | *************************** 10 | * Chapter 1: Introduction * 11 | *************************** 12 | 13 | Some history here ... not that relevant to me at this moment. I'd like to see a 14 | more modern take on this. 15 | 16 | But I do like this: 17 | 18 | > The vast majority of robots in operation today consist of six joints which are 19 | > either rotary (articulated) or sliding (prismatic), with a simple "end- 20 | > effector" for interacting with the workpieces. 21 | 22 | Yes, the dvrk has one "prismatic" joint out of seven (note, seven, not six...) 23 | and the others are rotary --- the dvrk guide actually says "revolute". And I 24 | obviously know the end-effectors by now. (Edit: "revolute" is clearly the better 25 | terminology... fortunately the book uses that later.) 26 | 27 | Then they talk about the book outline. Yeah, maybe I'll definitely take a look 28 | at Chapter 2 at a "leisurely pace" to better understand rigid body motion: 29 | 30 | > In this chapter, we present a geometric view to understanding translational 31 | > and rotational motion of a rigid body. While this is one of the most 32 | > ubiquitous topics encountered in textbooks on mechanics and robotics, it is 33 | > also perhaps one of the most frequently misunderstood. 34 | 35 | OK, fair enough. 36 | 37 | 38 | ******************************** 39 | * Chapter 2: Rigid Body Motion * 40 | ******************************** 41 | 42 | > In this chapter, we present a more modern treatment of the theory of screws 43 | > based on linear algebra and matrix groups. The fundamental tools are the use 44 | > of homogeneous coordinates to represent rigid motions and the matrix 45 | > exponential, which maps a twist into the corresponding screw motion. 46 | 47 | == Important facts == 48 | 49 | - Location (x, y, z). 50 | 51 | - Trajectory (x(t), y(t), z(t)) = p(t). 52 | 53 | - Rigid **body** satisfies || p(t) - q(t) || = || p(0) - q(0) || = constant. 54 | 55 | - Rigid body transformation: map from R^3 -> R^3 representing "rigid motion" 56 | (subtle point: cross product must be preserved). 57 | 58 | - Cartesian frame: specified with axes vectors x, y, z. These **must** be 59 | _orthogonal_ and with magnitude 1. I.e., _orthonormal_ vectors. Oh, and 60 | preserves z = x \times y to preserve the right-handedness of the system. 61 | 62 | - Know **rotation matrices**: orthogonal and has determinant 1 if right handed 63 | coordinate frame. 64 | 65 | - Figure 2.1 is helpful. **Every rotation** of that object corresponds to some 66 | rotation matrix (well, w.r.t. a fixed frame). And the rotation matrix even 67 | has a special form: we stack the coordinates of the principal axes (x,y,z) 68 | of the **body frame** of the object w.r.t. the "inertial frame." 69 | - Can also think of rotation matrices as transforming points from one frame to 70 | another. Draw a picture for their example; it's worth it. 71 | - Combine rotation matrices via matrix multiplication to form other rotations. 72 | 73 | - SO(n) = "Special Orthogonal" group of (n,n) matrices, typically n=3 but 74 | sometimes n=2. These are a linear algebra "group" under matrix multiplication; 75 | definition is the same as the abstract algebra concept. 76 | 77 | Related notation: so(n), with lowercase letters, is the space of n-by-n 78 | **skew symmetric** matrices, so A^T = -A. 79 | 80 | - SE(n) = "Special Exponential": R^n x SO(n). In the general case with n=3, we 81 | have six dimensions. This is the usual "position and rotation" that I'm 82 | familiar with; denote these as (p,R) where p is in R^3 and R is in SO(3). 83 | 84 | == Other Major Points == 85 | 86 | - How to prove that something (e.g., a rotation) is a rigid body transformation? 87 | It's simple: show that the transformation preserves distance and orientation. 88 | Look at Definition 2.1 and literally just prove the two properties! 89 | 90 | Don't forget to review the _cross_product_ between two vectors. 91 | 92 | a x b = (a)^b where (a)^ is the cross product matrix. We often use 93 | \hat{a}, which is what the book uses for exponential coordinates of 94 | rotation, with `e^{...}`. 95 | 96 | And be careful about the distinction: 97 | 98 | _points_ (typically written as p, q) 99 | _vectors_ (typically written as v, w) 100 | 101 | For two points p, q \in O, the vector v \in R^3 is the _directed_ line 102 | segment going from p to q. 103 | 104 | Conceptual difference: vectors have a _direction_ and a _magnitude_. 105 | 106 | - To track motion of a rigid body, we just need to watch one point plus the 107 | rotation w.r.t. that point. Hence, use a *configuration* which means we 108 | "attach" a coordinate frame to a point and track it w.r.t. a fixed frame. 109 | Don't forget what we mean by a configuration: something which can tell us 110 | "complete" (or "sufficient"?) information about something in some space. I 111 | remember that from CS 294-115. More precisely, that's SE(3). 112 | 113 | - "Exponential coordinates for rotation" derived from considering: given *axis* 114 | of rotation \omega, and the amount (i.e., angle through the axis) we rotate 115 | some arm (e.g., see Figure 2.2) can we derive the rotation matrix R? They were 116 | able to derive it by setting `R=e^{\hat{\omega} * \theta}` where 117 | `\hat{\omega}` is a matrix. That's where we get the exponential stuff. And for 118 | a closed-form implementation, look at **Rodrigues' formula**. I used it for CS 119 | 280. 120 | 121 | - This is known as "angular velocity" in physics. 122 | - We like this due to Euler's Theorem (2.6 in the book): _any_ orientation R 123 | in SO(3) is equivalent to a rotation about axis w in R^3 through an angle. 124 | 125 | - Theorem: **every rotation matrix** can be represented as the matrix 126 | exponential of some skew-symmetric matrix. 127 | 128 | BTW, in their notation, \hat{\omega} is a skew-symmetric 3x3 matrix. And 129 | they represent skew symmetric matrices as the product of a *unit* 130 | skew-symmetric matrix and a real number. 131 | 132 | - Another representation of rotations are the three **Euler Angles** which is 133 | what I'm most familiar with. AKA yaw, pitch, roll. The order of which axes we 134 | rotate about matters, since it can be represented as the product of three 135 | matrices. See Equation 2.20 for the formulas to derive yaw, pitch, and roll. 136 | Watch out for computing the correct quadrant for the arc-tan functions. 137 | 138 | - Downside: singularities. E.g., there are infinitely many representations of 139 | certain rotations, and it is a "fundamental topological fact" that 140 | singularities can't be eliminated in a 3-D representation of SO(3). I don't 141 | know why, but the authors argue that: 142 | 143 | > This situation is similar to that of attempting to find a global 144 | > coordinate chart on a sphere, which also fails. 145 | 146 | Hmm ... sounds intriguing. But I won't fret too much about this. 147 | 148 | == Rigid Motion in R^3 == 149 | 150 | (Now we're dealing with _translations_, in addition to rotations.) This is where 151 | the _SE(3)_ group appears. An element `(p,R) \in SE(3)` serves as: 152 | 153 | - A specification of the configuration of a rigid body. 154 | - A transformation taking the coordinates of a point from one frame to 155 | another. 156 | 157 | This is exactly analogous to the SO(3) case, where `R \in SO(3)` was either a 158 | rotation configuration or a rotation mapping. We can view it either way. :-) 159 | 160 | To make the linear algebra math easier to describe rigid transformations, use 161 | **homogeneous coordinates**. 162 | 163 | - Add 1 to the coordinates of a point, so now we're in R^4, and vectors are 164 | (well, effectively) in R^3 since their 4th component is always zero. 165 | - Now a RBT is one matmul on a vector, a linear ("affine") transformation. The 166 | last row is all zero except for a 1 at the lower right corner. 167 | - To compose several of these transformations, do more matmuls obviously. 168 | 169 | Must also know the exponential coordinates for rigid motion, so the SE analogue 170 | to the SO exponential of a skew symmetric matrix representing a rotation. 171 | 172 | - Once again, start from considering rotation about axis \omega 173 | - Then derive velocity of tip point via cross products 174 | - Then solve (integrate) differential equation to get exponential map 175 | - Main difference is the use of 4x4 matrices w/homogeneous-like 176 | representation. Also, we consider an extra ("offset"?) point q on \omega. 177 | 178 | Define se(3): 179 | se(3) := { (u,\hat{omega}) s.t. u in R^3, \hat{omega} in so(3) } 180 | Elements of se(3) are _twists_; Can also write them using 4x4 matrices using 181 | homogeneous coordinates, useful for the following proposition ... 182 | 183 | Proposition 2.8: given \hat{ξ} \in se(3) and \theta \in R, exponential of 184 | \hat{ξ}*theta is an element of SE(3), recall the special exponential ... think 185 | of it as the possible translations and rotations. 186 | 187 | Proof technique: 188 | - Start w/4x4 matrix \hat{ξ} in se(3). Want to show: exp(\hat{ξ}*theta) 189 | in SE(3). 190 | - Prove by construction and obtain a formula for that exponential. 191 | - Split into cases, \omega = 0 versus \omega =/= 0. 192 | - For second (harder) case, relate to \hat{ξ-prime} and use properties of 193 | exponentials and cross products. 194 | - Use the _homogeneous_ representation of elements in SE(3). Normally, I 195 | think of (p,R) \in SE(3), but use the 4x4 _matrix_ with R and p in it. 196 | 197 | Intuition: earlier we interpreted elements of SE(3) as transforming from one 198 | coordinate frame to another. Here, interpret it as mapping points from 199 | _initial_ coordinates to their coordinates _after_ the rigid motion is 200 | applied. Key difference from earlier is that the start and end are specified 201 | w.r.t. a _single_ coordinate frame. The book says: 202 | 203 | > Thus, the exponential map for a twist gives the relative motion of a rigid 204 | > body. This interpretation of the exponential of a twist as a mapping from 205 | > initial to final configurations will be especially important as we study the 206 | > kinematics of robot mechanisms in the next chapter. 207 | 208 | Important! _Every_ rigid transformation can be written as the exponential of 209 | some twist. BTW, I think the twist is only the \hat{ξ} part, and the `\theta 210 | \in R` part is multiplied later. Not a big deal, just think of twists as the 4x4 211 | "\hat{ξ}" matrices in se(3). 212 | 213 | _Screws_ are a "geometric description" of twists and give us more intuition on 214 | them. More precisely: 215 | 216 | > Consider a rigid body motion which consists of rotation about an axis in space 217 | > through an angle of `\theta` radians, followed by translation along the same 218 | > axis by an amount `d` as shown in Figure 2.7a. We call such a motion a screw 219 | > motion, since it is reminiscent of the motion of a screw, in so far as a screw 220 | > rotates and translates about the same axis. 221 | 222 | - Characterizing a screw: define _pitch_, _axis_, and _magnitude_. 223 | - To compute RBT, draw a figure, determine end-point, and derive the rotation 224 | plus vector offset to get the usual 4x4 homogeneous matrix representation. 225 | - The RBT of a screw has an equivalence with the exponential of a twist 226 | `exp(\hat{ξ}*\theta)`. 227 | - It is possible to define a screw for every twist! 228 | 229 | Important theorem: 230 | 231 | > Theorem 2.11 (Chasles). Every rigid body motion can be realized by a rotation 232 | > about an axis combined with a translation parallel to that axis. 233 | 234 | Be careful about _relative_ motion, which is w.r.t. a SINGLE reference frame. To 235 | "switch" between frames, you need to do an extra matrix multiply with g_{ab} to 236 | map from B's coordinates to A. 237 | 238 | == Velocity of a Rigid Body == 239 | 240 | (This is probably not that relevant for me.) 241 | 242 | == Wrenches and Reciprocal Screws == 243 | 244 | (This is probably not that relevant for me.) 245 | 246 | 247 | ************************************* 248 | * Chapter 3: Manipulator Kinematics * 249 | ************************************* 250 | 251 | == Section 2: Forward Kinematics == 252 | 253 | To determine the configuration of the end-effector given information about the 254 | robot joints, we typically assume that the robot is composed of a set of 255 | "lower-pair joints". 256 | 257 | - There are six common examples: prismatic, revolute, helical, cylindrical, 258 | planar, and spherical. The two most common are, of course, prismatic and 259 | revolute joints. (The 2017 book by Lynch & Park have figures of these, 260 | though they use "universal" instead of "planar".) 261 | - The reason why we like this assumption is that each of the joints 262 | **restricts the motion of adjacent links to a subgroup of SE(3)**, making it 263 | easier to analyze. 264 | 265 | Example, with Figure 3.1, there are four joints, three revolute and one 266 | prismatic. The revolute joints are specified with one \theta for each since it 267 | can be thought of as a single circle about some axis (specified with the right 268 | handed coordinate system). In fact, the same holds for the prismatic joint with 269 | \theta being the displacement along the axis, so specifying these four scalar 270 | values is enough for us to define the configuration of that particular robot. 271 | The **joint space** is the Cartesian product of these individual joint angles. 272 | Equivalently, we can form the configuration space of the robot. It has four 273 | degrees of freedom (3+1=4 obviously) but this of course doesn't hold as a 274 | general rule as robots may have constraints on joints that restrict some DoFs. 275 | 276 | Attach **two** coordinate frames: 277 | 278 | - Base frame: attached to a point on the manipulator which is stationary with 279 | respect to the first link (at index 0). 280 | - Tool frame: attached to the end-effector of the robot, so that the tool frame 281 | moves when the joints of the robot move (seems logical). 282 | So when I query the dVRK, the positions are clearly in the base frame, since 283 | if they were in the tool frame, the positions would always be (0,0,0). 284 | 285 | Forward kinematics: determine the function `g_st: Q -> SE(3)` that determines 286 | the configuration of the tool frame (w.r.t. the base frame). Q is the joint 287 | space of the manipulator, as I mention above. 288 | 289 | Generic solution: 290 | 291 | g_st(theta) = g_{s,l1}(theta_1) * ... * g_{l_{n-1},ln}(theta_n) * g_{ln,t} 292 | 293 | Concatenate the transformations among **adjacent** link frames. 294 | 295 | g_st, our final map, determines the _configuration_ of the _tool_ frame 296 | relative to _base_ frame. That's consistent with our subscript notation. 297 | Remember also that `g_{ij} \in SE(3)` can be thought as `(p_{ij},R_{ij})`. 298 | 299 | == Product of Exponentials == 300 | 301 | We can obtain a more "geometric description" using PoEs. (Not sure what 302 | precisely this means...) 303 | 304 | Example/Figure 3.2 for an overview of two choices: using g_st(\theta) as 305 | previously discussed, or using PoEs in which 306 | 307 | g_st(theta) = exp(hat{ξ}_1*theta_1) * exp(hat{ξ}_2*theta_2) * g_st(0) 308 | (g_st(0) = rigid body transformation from T to S) 309 | 310 | Derive by thinking: "fix theta_1 and consider motion wrt theta_2. Then do 311 | motion wrt theta_1 and combine result". This is generalized: 312 | 313 | > For each joint, construct a twist `ξ_i` which corresponds to the screw motion 314 | > for the i-th joint with all other joint angles held fixed at θ_j = 0`. 315 | 316 | Results in Equation 3.3 on pp.87, the PoEs, at last! (TODO: understand why the 317 | `ξ_i` have their particular form for revolute or prismatic cases.) 318 | 319 | If we assume that's true, then kinematics for Figure 3.3 are easily derived (and 320 | by this we can get every component in the matrices) by starting from PoEs and 321 | substituting into the formula for exp(hat{ξ}_i*theta_i) for 1<=i<=4 that we can 322 | find from Equation (2.36), pp.42. 323 | -------------------------------------------------------------------------------- /Robots_and_Robotic_Manip/ROS.text: -------------------------------------------------------------------------------- 1 | How to use ROS. I'm using ROS Indigo, on Ubuntu 14.04. Hopefully the Fetch will 2 | be updated for 16.04 soon. 3 | 4 | 5 | *************************************************************** 6 | * Tutorial 1: Installing and Configuring Your ROS Environment * 7 | *************************************************************** 8 | 9 | Note the environment variables after installation: 10 | 11 | ``` 12 | $ printenv | grep ROS 13 | ROS_ROOT=/opt/ros/indigo/share/ros 14 | ROS_PACKAGE_PATH=/opt/ros/indigo/share:/opt/ros/indigo/stacks 15 | ROS_MASTER_URI=http://localhost:11311 16 | ROSLISP_PACKAGE_DIRECTORIES= 17 | ROS_DISTRO=indigo 18 | ROS_ETC_DIR=/opt/ros/indigo/etc/ros 19 | ``` 20 | 21 | In my `.bashrc` I have: 22 | 23 | ``` 24 | source /opt/ros/indigo/setup.bash 25 | alias fetch_mode='export ROS_MASTER_URI=http://fetch59.local:11311 export PS1="\[\033[41;1;37m\]\[\033[0m\]\w$ "' 26 | ``` 27 | 28 | where `fetch_mode` came from the HSR tutorials. 29 | 30 | Another important note regarding rosbuild and catkin. 31 | 32 | > Note: Throughout the tutorials you will see references to rosbuild and catkin. 33 | > These are the two available methods for organizing and building your ROS code. 34 | > rosbuild is not recommended or maintained anymore but kept for legacy. catkin 35 | > is the recommended way to organise your code, it uses more standard CMake 36 | > conventions and provides more flexibility especially for people wanting to 37 | > integrate external code bases or who want to release their software. For a 38 | > full break down visit catkin or rosbuild. 39 | 40 | I followed their directions to make the appropriate directories for a catkin 41 | workspace. But sourcing the bash scripts didn't seem to have any noticeable 42 | effect. I thought it'd do a python virtualenv thing? 43 | 44 | Beyond the scope of this, but catkin stuff is here: 45 | 46 | http://wiki.ros.org/catkin/conceptual_overview 47 | 48 | - A build system specifically for ROS. Others are `GNU make` and `CMake`. 49 | - Source code is organized into "packages" which have targets to build. 50 | - For information on how to build, we need "configuration files." With catkin 51 | (extension of CMake) that's in `CMakeLists.txt`. 52 | - `catkin` is the newer tool we should use, not `rosbuild` (older). 53 | 54 | 55 | ********************************************* 56 | * Tutorial 2: Navigating the ROS Filesystem * 57 | ********************************************* 58 | 59 | Use `package.xml` to store information about a specific package, such as 60 | dependencies, maintainer, etc. Know `rospack`, `roscd`, etc. We can prepend 61 | `ros` to some common Unix commands, do tab completion, etc. 62 | 63 | ``` 64 | daniel@daniel-ubuntu-mac:~$ rospack find roscpp 65 | /opt/ros/indigo/share/roscpp 66 | daniel@daniel-ubuntu-mac:~$ roscd roscpp 67 | daniel@daniel-ubuntu-mac:/opt/ros/indigo/share/roscpp$ 68 | ``` 69 | 70 | 71 | ************************************** 72 | * Tutorial 3: Creating a ROS Package * 73 | ************************************** 74 | 75 | Packages need: a manifest (package.xml) file, a catkin configuration file, and 76 | its own directory (easy). Since we already created `catkin_ws/src` earlier, put 77 | each of our custom packages as its own directory within `catkin_ws/src`. 78 | 79 | After running the package script, I have this within `~/catkin_ws/src`: 80 | 81 | ``` 82 | CMakeLists.txt -> /opt/ros/indigo/share/catkin/cmake/toplevel.cmake 83 | 84 | beginner_tutorials/ 85 | CMakeLists.txt 86 | include/ 87 | beginner_tutorials/ 88 | (empty) 89 | package.xml 90 | src/ 91 | (empty) 92 | ``` 93 | 94 | - Since the tutorial runs the script with `rospy`, `roscpp`, and `std_msgs`, 95 | those are listed as the package dependencies in `package.xml`. 96 | 97 | - When we run `catkin_make` over the entire workspace, it will say "traversing 98 | into beginner_tutorials". 99 | 100 | - First-order dependencies: 101 | ``` 102 | ~/catkin_ws$ rospack depends1 beginner_tutorials 103 | roscpp 104 | rospy 105 | std_msgs 106 | ``` 107 | 108 | - We can also list all the *indirect* dependencies. 109 | 110 | - Dependencies are in the following groups: 111 | > inbuild_depend (don't see this, I have build_depend, build_export_depend) 112 | > buildtool_depend (I have this) 113 | > exec_depend (I have this) 114 | > test_depend (I don't see this) 115 | (Maybe they re-named `build_depend` and `build_export_depend`?) 116 | 117 | - `build_depend` for compilation, `exec_depend` for runtime 118 | 119 | - Make sure I customize `package.xml`!! It's mostly "meta-data" so should be 120 | easier than customizing `CMakeLists.txt`. See conventions online. 121 | 122 | 123 | 124 | ************************************** 125 | * Tutorial 4: Building a ROS Package * 126 | ************************************** 127 | 128 | This discusses `catkin_make` which we previously ran. Note that using 129 | `catkin_make` we can build *all* the packages in our workspace, at least in the 130 | `src/` directory (we can change the target directory). Here's what I have in 131 | `catkin_ws/`: 132 | 133 | ``` 134 | build/ 135 | beginner_tutorials/ 136 | catkin/ 137 | catkin_generated/ 138 | CATKIN_IGNORE 139 | catkin_make.cache 140 | CMakeCache.txt 141 | CMakeFiles/ 142 | cmake_install.cmake 143 | CTestTestfile.cmake 144 | gtest/ 145 | Makefile 146 | test_results/ 147 | devel/ 148 | env.sh 149 | lib/ 150 | setup.bash 151 | setup.sh 152 | _setup_util.py 153 | setup.zsh 154 | share/ 155 | src/ 156 | beginner_tutorials/ 157 | CMakeLists.txt 158 | ``` 159 | 160 | The `cmake` and `make` commands go to `build` when they need to build packages. 161 | The executables and libraries go in `devel` *before* installing packages. 162 | 163 | We'd also run `catkin_make install` but this seems to be optional. 164 | 165 | BTW, I now understand why there seem to be so many packages located in that 166 | directory on our dVRK machine. Unfortunately, we don't seem to be using it. I 167 | wonder if the HSR or YuMi computers have a similar file system. 168 | 169 | 170 | 171 | *************************************** 172 | * Tutorial 5: Understanding ROS Nodes * 173 | *************************************** 174 | 175 | - Nodes: A node is an executable that uses ROS to communicate with other nodes. 176 | - That's it. Use these to subscribe/publish to topics. 177 | - To communicate, use a "ROS client library" which is rospy or roscpp. 178 | 179 | - Messages: ROS data type used when subscribing or publishing to a topic. 180 | - E.g. "geometry_msgs/Twist". For publisher/subscriber nodes to communicate 181 | they need to send/accept the same message type. 182 | 183 | - Topics: Nodes can publish messages to a topic as well as subscribe to a topic 184 | to receive messages. 185 | - Communication depends on these _messages_. 186 | 187 | - Master: Name service for ROS (i.e. helps nodes find each other) 188 | 189 | - rosout: ROS equivalent of stdout/stderr 190 | - It runs by default from running `roscore` as it collects debug messages. 191 | 192 | - roscore: Master + rosout + parameter server (parameter server will be 193 | introduced later) 194 | - First thing we should run! Recall this is what we do for the dVRK. 195 | 196 | After `roscore`: 197 | 198 | ``` 199 | ~/catkin_ws$ roscore 200 | ... logging to 201 | /home/daniel/.ros/log/4a2cd14e-32cf-11e8-9512-7831c1b89008/roslaunch-daniel-ubuntu-mac-4867.log 202 | Checking log directory for disk usage. This may take awhile. 203 | Press Ctrl-C to interrupt 204 | Done checking log file disk usage. Usage is <1GB. 205 | 206 | started roslaunch server http://daniel-ubuntu-mac:33999/ 207 | ros_comm version 1.11.21 208 | 209 | SUMMARY 210 | ======== 211 | 212 | PARAMETERS 213 | * /rosdistro: indigo 214 | * /rosversion: 1.11.21 215 | 216 | NODES 217 | 218 | auto-starting new master 219 | process[master]: started with pid [4879] 220 | ROS_MASTER_URI=http://daniel-ubuntu-mac:11311/ 221 | 222 | setting /run_id to 4a2cd14e-32cf-11e8-9512-7831c1b89008 223 | process[rosout-1]: started with pid [4892] 224 | started core service [/rosout] 225 | ``` 226 | 227 | So `/rosout` will be listed when running `rosnode list` in a separate tab. Keep 228 | `roscore` running throughout the time we use ROS!! Use `rosnode info` to see (1) 229 | publishers, (2) subscribers, and (3) services. Also note `PARAMETERS` which must 230 | mean the parameter server. 231 | 232 | Use `rosrun` to run packages along with certain nodes within packages. I ran 233 | `turtlesim` and yes we get a new node and can re-name if needed. There appear to 234 | be two node options for this, one for the turtle and another for teleoperation. 235 | 236 | 237 | 238 | **************************************** 239 | * Tutorial 6: Understanding ROS Topics * 240 | **************************************** 241 | 242 | We run the turtlesim via teleoperation, and it works. 243 | 244 | - Nodes `turtlesim_node` and `turtle_teleop_key` within the `turtlesim` package 245 | communicate to each other via a ROS topic. 246 | - Communication within such topics depends on sending ROS _messages_. 247 | 248 | - The teleop node *publishes* key commands, while the sim node *subscribes*. 249 | 250 | - Use `rqt_graph` for visualizing node dependencies. This is very useful! 251 | 252 | - Use `rqt_plot` to plot certain node values that can be plotted (e.g., 253 | x-position of turtle) but I don't think I'll be using this, I like matpotlib. 254 | 255 | Use `rostopic` to examine nodes. For instance, if I run this and then move the 256 | turtle forward, I get: 257 | 258 | ``` 259 | ~/catkin_ws$ rostopic echo /turtle1/cmd_vel 260 | linear: 261 | x: 2.0 262 | y: 0.0 263 | z: 0.0 264 | angular: 265 | x: 0.0 266 | y: 0.0 267 | z: 0.0 268 | --- 269 | linear: 270 | x: 2.0 271 | y: 0.0 272 | z: 0.0 273 | angular: 274 | x: 0.0 275 | y: 0.0 276 | z: 0.0 277 | --- 278 | (and so on) 279 | ``` 280 | 281 | so the up key must mean increasing in the turtle's x direction. We can get a 282 | full picture of the publisher/subscriber situation: 283 | 284 | ``` 285 | ~/catkin_ws$ rostopic list -v 286 | 287 | Published topics: 288 | * /turtle1/color_sensor [turtlesim/Color] 1 publisher 289 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher 290 | * /rosout [rosgraph_msgs/Log] 4 publishers 291 | * /rosout_agg [rosgraph_msgs/Log] 1 publisher 292 | * /turtle1/pose [turtlesim/Pose] 1 publisher 293 | 294 | Subscribed topics: 295 | * /turtle1/cmd_vel [geometry_msgs/Twist] 2 subscribers 296 | * /rosout [rosgraph_msgs/Log] 1 subscriber 297 | * /statistics [rosgraph_msgs/TopicStatistics] 1 subscriber 298 | ``` 299 | 300 | The type of `/turtle1/cmd_vel` is `geometry_msgs/Twist`, as shown above. Looks 301 | like it lists topics followed by message (well, the _type_ of the message). 302 | 303 | Use `rostopic pub [...]` to publish something. In the turtle example, this might 304 | mean commanding the turtle's velocity. 305 | 306 | So, there's rostopic `pub`, `list`, `echo`, `type`, etc. Straightforward: 307 | 308 | rostopic bw display bandwidth used by topic 309 | rostopic echo print messages to screen 310 | rostopic hz display publishing rate of topic 311 | rostopic list print information about active topics 312 | rostopic pub publish data to topic 313 | rostopic type print topic type 314 | 315 | I don't really need `type` now as it's shown in `list` as seen above. The `hz` 316 | might be useful since (as I know with the dVRK) the camera images of the 317 | workspaces aren't updated instantaneously but with some delay, and that can 318 | affect policies which take the images as input. 319 | 320 | 321 | 322 | ********************************************************* 323 | * Tutorial 7: Understanding ROS Services and Parameters * 324 | ********************************************************* 325 | 326 | Recall we used to run `rosnode info /rosout` where we get information from 327 | node1, node2, etc in the argument. That provides us with three things. We sort 328 | of understand publications and subscripts, but now what about _services_? 329 | 330 | - Another way for nodes to communicate with each other. 331 | - Nodes send _requests_, receive _responses_. (Common sense, right?) 332 | 333 | Like `rostopic`, `rosservice` has lots of command options: 334 | 335 | rosservice list print information about active services 336 | rosservice call call the service with the provided args 337 | rosservice type print service type 338 | rosservice find find services by service type 339 | rosservice uri print service ROSRPC uri 340 | 341 | For example, I see this with `list`: 342 | 343 | ``` 344 | :~/catkin_ws$ rosservice list 345 | /clear 346 | /kill 347 | /reset 348 | /rosout/get_loggers 349 | /rosout/set_logger_level 350 | /rostopic_8997_1522274470739/get_loggers 351 | /rostopic_8997_1522274470739/set_logger_level 352 | /rqt_gui_py_node_9061/get_loggers 353 | /rqt_gui_py_node_9061/set_logger_level 354 | /spawn 355 | /teleop_turtle/get_loggers 356 | /teleop_turtle/set_logger_level 357 | /turtle1/set_pen 358 | /turtle1/teleport_absolute 359 | /turtle1/teleport_relative 360 | /turtlesim/get_loggers 361 | /turtlesim/set_logger_level 362 | ``` 363 | 364 | We can call the `rosservice call /clear` above, so this is calling a service in 365 | the list above (this one with no arguments). We choose `clear` so that the 366 | background is clear (we don't see the turtle's path). This is what I see from 367 | the window that originally started the `turtlesim` package. 368 | 369 | ``` 370 | :~/catkin_ws$ rosrun turtlesim turtlesim_node 371 | [ INFO] [1522273700.220832117]: Starting turtlesim with node name /turtlesim 372 | [ INFO] [1522273700.228355538]: Spawning turtle [turtle1] at x=[5.544445], y=[5.544445], theta=[0.000000] 373 | [ WARN] [1522273804.373982014]: Oh no! I hit the wall! (Clamping from [x=7.155886, y=-0.008128]) 374 | [ WARN] [1522273804.389975987]: Oh no! I hit the wall! (Clamping from [x=7.163082, y=-0.031181]) 375 | (omitted...) 376 | [ WARN] [1522276335.861971290]: Oh no! I hit the wall! (Clamping from [x=9.302450, y=11.089913]) 377 | [ WARN] [1522276335.877974885]: Oh no! I hit the wall! (Clamping from [x=9.334450, y=11.088992]) 378 | [ INFO] [1522280291.029979359]: Clearing turtlesim. 379 | ``` 380 | 381 | We can also use the `/spawn` service to, well, spawn another turtle. 382 | 383 | We also have `rosparam`, which is the parameter analogue to `rosservice` for 384 | service, `rostopic` for topics, etc. We can list the parameters and adjust them, 385 | for instance by changing the background color. (However, it doesn't seem to 386 | actually change my color, even though I am clearly setting all the background 387 | colors to be 0 ... hmmm.) 388 | 389 | You can save current parameters for easy loading later. 390 | 391 | 392 | 393 | *********************************************** 394 | * Tutorial 8: Using rqt_console and roslaunch * 395 | *********************************************** 396 | 397 | rqt_console (not sure how useful) 398 | 399 | - Along with rqt_logger_level, lets us see a lot of information in GUIs. 400 | - If we ram the turtle in the wall, we can see the warning message. 401 | - Assuming that WARN is within the current "verbosity" level... 402 | - Logging prioritized with: Fatal, Error, Warn, Info, Debug. 403 | 404 | roslaunch (looks _very_ useful, call this each time we start using robots) 405 | 406 | - Note that `roscore` started a "roslaunch server". 407 | - Use this with a _launch_file_ to start nodes in a more scalable way. 408 | - `roslaunch [package] [filename.launch]` 409 | - `roslaunch gscam endoscope.launch` 410 | - Good practice, put in the package: `~/catkin_ws/src/[...]/launch/[...]` 411 | where the second [...] is the `.launch` file with tags. 412 | 413 | ``` 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | ``` 431 | 432 | - Above example makes two groups (different names to avoid conflicts), each of 433 | which use a `turtlesim_node` node from the `turtlesim` package. 434 | 435 | - Also makes a new node with type "mimic". So the `` command must 436 | obviously let one make a new node, which can be assigned to a group if it's 437 | nested within one. Causes second turtle to mimic the first turtle! 438 | 439 | I see, when you run `roslaunch ...` we get this output: 440 | 441 | ``` 442 | daniel@daniel-ubuntu-mac:~/catkin_ws/src/beginner_tutorials/launch$ roslaunch beginner_tutorials turtlemimic.launch 443 | ... logging to /home/daniel/.ros/log/42096978-3383-11e8-9614-7831c1b89008/roslaunch-daniel-ubuntu-mac-4922.log 444 | Checking log directory for disk usage. This may take awhile. 445 | Press Ctrl-C to interrupt 446 | Done checking log file disk usage. Usage is <1GB. 447 | 448 | started roslaunch server http://daniel-ubuntu-mac:43721/ 449 | 450 | SUMMARY 451 | ======== 452 | 453 | PARAMETERS 454 | * /rosdistro: indigo 455 | * /rosversion: 1.11.21 456 | 457 | NODES 458 | / 459 | mimic (turtlesim/mimic) 460 | /turtlesim1/ 461 | sim (turtlesim/turtlesim_node) 462 | /turtlesim2/ 463 | sim (turtlesim/turtlesim_node) 464 | 465 | auto-starting new master 466 | process[master]: started with pid [4934] 467 | ROS_MASTER_URI=http://localhost:11311 468 | 469 | setting /run_id to 42096978-3383-11e8-9614-7831c1b89008 470 | process[rosout-1]: started with pid [4947] 471 | started core service [/rosout] 472 | process[turtlesim1/sim-2]: started with pid [4950] 473 | process[turtlesim2/sim-3]: started with pid [4959] 474 | process[mimic-4]: started with pid [4966] 475 | ``` 476 | 477 | so we get groups listed at the top level (turtlesim1, turtlesim2) along with the 478 | name of the node after it within the nested stuff. 479 | 480 | BTW: seems like roslaunch starts its own master server, so it is not necessary 481 | to have an existing "roscore" command in another tab. See "auto-starting new 482 | master" above and also: 483 | 484 | https://answers.ros.org/question/217107/does-a-roslaunch-start-roscore-when-needed/ 485 | 486 | We can still get lots of relevant information: 487 | 488 | ``` 489 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosnode list 490 | /mimic 491 | /rosout 492 | /turtlesim1/sim 493 | /turtlesim2/sim 494 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rostopic list 495 | /rosout 496 | /rosout_agg 497 | /turtlesim1/turtle1/cmd_vel 498 | /turtlesim1/turtle1/color_sensor 499 | /turtlesim1/turtle1/pose 500 | /turtlesim2/turtle1/cmd_vel 501 | /turtlesim2/turtle1/color_sensor 502 | /turtlesim2/turtle1/pose 503 | ``` 504 | 505 | Use `rqt_graph`, as discussed earlier, to understand the launch file. 506 | 507 | 508 | 509 | ************************************************ 510 | * Tutorial 9: Using rosed to edit files in ROS * 511 | ************************************************ 512 | 513 | A very short one, basically use `rosed [package_name] [filename]` to edit files 514 | without having to use command lines, would be useful for me since I got stuck on 515 | doing this in my early days of working with the dVRK. Fortunately this uses vim 516 | by default, so I should have no problem using it. 517 | 518 | 519 | 520 | ******************************************* 521 | * Tutorial 10: Creating a ROS msg and srv * 522 | ******************************************* 523 | 524 | - msg: are simple text files that describe the fields of a ROS message. They 525 | are used to generate source code for messages in different languages. 526 | - srv: describes a service, composed of two parts: a request and a response. 527 | 528 | These have their own syntax rules. See tutorial for details. We put them in 529 | `msg` and `srv` directories, and then we must ensure our `package.xml` file will 530 | know to compile and run custom messages, and also change `CMakeLists.txt`. 531 | There's a lot to do for the latter; see tutorial for lines to un-comment. 532 | 533 | The tutorials use a simple `AddTwoInts` service. Details with `rossrv`: 534 | 535 | ``` 536 | :~/catkin_ws/src/beginner_tutorials$ rossrv show AddTwoInts 537 | [beginner_tutorials/AddTwoInts]: 538 | int64 a 539 | int64 b 540 | --- 541 | int64 sum 542 | 543 | [rospy_tutorials/AddTwoInts]: 544 | int64 a 545 | int64 b 546 | --- 547 | int64 sum 548 | ``` 549 | 550 | - It's located in two places, since this was created with `roscp`. 551 | - The actual _implementation_ of the "add two ints" is located elsewhere. 552 | - Run `catkin_make install` and watch it build successfully. Whew. 553 | 554 | The installation makes C (header), lisp and python files. For example: 555 | 556 | /home/daniel/catkin_ws/install/lib/python2.7/dist-packages/beginner_tutorials/msg/_Num.py 557 | 558 | Again this is _not_ the code implementation (how could it read my mind?) but an 559 | automatically generated file with some known, common methods. Not yet sure what 560 | it's purpose is for ... 561 | 562 | 563 | 564 | **************************************************************** 565 | * Tutorial 11: Writing a Simple Publisher and Subscriber (C++) * 566 | **************************************************************** 567 | (Skipping) 568 | ******************************************************************* 569 | * Tutorial 12: Writing a Simple Publisher and Subscriber (Python) * 570 | ******************************************************************* 571 | 572 | After downloading their `talker.py` script, I have this in the package: 573 | 574 | ``` 575 | beginner_tutorials/ 576 | CMakeLists.txt 577 | package.xml 578 | include/ 579 | beginner_tutorials/ 580 | launch/ 581 | turtlemimic.launch 582 | msg/ 583 | Num.msg 584 | scripts/ 585 | talker.py 586 | src/ 587 | srv/ 588 | AddTwoInts.srv 589 | ``` 590 | 591 | For the most part just read the tutorial, it goes line-by-line. Above, there is 592 | no node that "receives" the messages sent by the talker, so we write that. It 593 | uses a very simple message type: 594 | 595 | ``` 596 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosmsg show String 597 | [std_msgs/String]: 598 | string data 599 | ``` 600 | 601 | with just a `data` argument to fill. 602 | 603 | For classes, look at: 604 | 605 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Publisher-class.html 606 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Subscriber-class.html 607 | 608 | They only have one method each, "publish" and "unregister", respectively. 609 | 610 | 611 | 612 | ************************************************************** 613 | * Tutorial 13: Examining the Simple Publisher and Subscriber * 614 | ************************************************************** 615 | 616 | This is really short. Just run the code and see what we get. Make sure `roscore` 617 | is running in a seprate tab, though. 618 | 619 | 620 | 621 | ********************************************************** 622 | * Tutorial 14: Writing a Simple Service and Client (C++) * 623 | ********************************************************** 624 | (Skipping) 625 | ************************************************************* 626 | * Tutorial 15: Writing a Simple Service and Client (Python) * 627 | ************************************************************* 628 | 629 | Makes the "service" that actually performs the addition. (It's not clear to me 630 | yet why we need this kind of structure.) And then the client. Again, straight 631 | from the tutorial. 632 | 633 | 634 | 635 | ******************************************************** 636 | * Tutorial 16: Examining the Simple Service and Client * 637 | ******************************************************** 638 | 639 | Yeah, I got it working. 640 | 641 | 642 | 643 | ************************************************ 644 | * Tutorial 17: Recording and playing back data * 645 | ************************************************ 646 | 647 | This is the rostopic status after starting this up: 648 | 649 | ``` 650 | daniel@daniel-ubuntu-mac:~/catkin_ws/devel$ rostopic list -v 651 | 652 | Published topics: 653 | * /turtle1/color_sensor [turtlesim/Color] 1 publisher 654 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher 655 | * /rosout [rosgraph_msgs/Log] 2 publishers 656 | * /rosout_agg [rosgraph_msgs/Log] 1 publisher 657 | * /turtle1/pose [turtlesim/Pose] 1 publisher 658 | 659 | Subscribed topics: 660 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 subscriber 661 | * /rosout [rosgraph_msgs/Log] 1 subscriber 662 | ``` 663 | 664 | I get the rosbag which records the keypresses: 665 | 666 | ``` 667 | daniel@daniel-ubuntu-mac:~/bagfiles$ ls -lh 668 | total 512K 669 | -rw-rw-r-- 1 daniel daniel 511K Mar 29 16:16 2018-03-29-16-15-19.bag 670 | daniel@daniel-ubuntu-mac:~/bagfiles$ vim 2018-03-29-16-15-19.bag 671 | daniel@daniel-ubuntu-mac:~/bagfiles$ rosbag info 2018-03-29-16-15-19.bag 672 | path: 2018-03-29-16-15-19.bag 673 | version: 2.0 674 | duration: 58.6s 675 | start: Mar 29 2018 16:15:19.26 (1522365319.26) 676 | end: Mar 29 2018 16:16:17.84 (1522365377.84) 677 | size: 510.9 KB 678 | messages: 7321 679 | compression: none [1/1 chunks] 680 | types: geometry_msgs/Twist [9f195f881246fdfa2798d1d3eebca84a] 681 | rosgraph_msgs/Log [acffd30cd6b6de30f120938c17c593fb] 682 | turtlesim/Color [353891e354491c51aabe32df673fb446] 683 | turtlesim/Pose [863b248d5016ca62ea2e895ae5265cf9] 684 | topics: /rosout 4 msgs : rosgraph_msgs/Log (2 connections) 685 | /turtle1/cmd_vel 21 msgs : geometry_msgs/Twist 686 | /turtle1/color_sensor 3648 msgs : turtlesim/Color 687 | /turtle1/pose 3648 msgs : turtlesim/Pose 688 | ``` 689 | 690 | And I can replay my commands. 691 | 692 | 693 | 694 | ******************************************** 695 | * Tutorial 18: Getting started with roswtf * 696 | ******************************************** 697 | 698 | Yeah this is just to check if the system is wrong, and looks like mine is OK. 699 | 700 | 701 | 702 | **************************************** 703 | * Tutorial 19: Navigating the ROS wiki * 704 | **************************************** 705 | 706 | Pretty simple, hopefully documentation won't be an issue. 707 | 708 | 709 | 710 | **************************** 711 | * Tutorial 20: Where Next? * 712 | **************************** 713 | 714 | Robotics work. :-) Look at our manuals, understand rviz, tf, and moveit. 715 | -------------------------------------------------------------------------------- /CS61C_Berkeley/CS61C_Lectures.txt: -------------------------------------------------------------------------------- 1 | CS 61C Lecture Review 2 | Fall 2017 Semester 3 | 4 | ********************************** 5 | * Lecture 1: Course Introduction * 6 | * Given: August 24, 2017 * 7 | ********************************** 8 | 9 | Lecture is about four things, well, three that matter to me: (1) machine 10 | structures, (2) great ideas (in architecture), and (3) how everything is just a 11 | number. 12 | 13 | 14 | Machine Structures 15 | 16 | C is the most popular programming language, followed by Python. Use C to 17 | write software for speed/performance, e.g. embedded systems. EDIT: nope! 18 | That was in F-2016. Now in F-2017, Python has taken over, probably due to 19 | Deep Learning. But C is still in second place. 20 | 21 | This class isn't about C programming, but C is a VERY important language to 22 | know in order to understand the important stuff: the **hardware-software 23 | interface**. It's closer to the hardware than Java or Python. 24 | 25 | Things we'll learn on the software side: 26 | Parallel requests 27 | Parallel threads 28 | Parallel instructions 29 | Parallel data 30 | Hardware descriptions 31 | 32 | and the hardware side: 33 | Logic gates 34 | Main memory 35 | Cores 36 | Caches 37 | Instruction Units 38 | 39 | Looks like the "new version/face" of CS 61C is parallelism, as I should know 40 | from CS 267. Along with computers being on **mobile devices** and in many 41 | other areas, such as cars! So many things have computers and sensors in them 42 | nowadays, that it's mind-blowing. 43 | 44 | 45 | Great Ideas in Architecture 46 | 47 | Abstraction (Phil Guo's one-word description of CS) 48 | 49 | Anything can be represented as a number. But does this mean we WANT 50 | them to be like that? No, we want to program in a "high-level" like C 51 | so that we don't have to trudge through assembly language code. 52 | 53 | We follow this hierarchy: 54 | ==> C 55 | ==> compiler 56 | ==> assembly language (then machine language??) 57 | ==> machine interpretation (note, in F-2017 they're doing RISC-V, 58 | not MIPS, which I think was in S-2017 ...) 59 | ==> architecture implementation (the logic circuit diagram?) 60 | (I don't fully understand assembly/architecture parts) 61 | 62 | Moore's Law (is it still applicable?!?) 63 | 64 | Basic idea: every 2 years (sometimes I've seen it 1.5 years ...) the 65 | number of transistors per chip will double. Transistors are the basic 66 | source of computation in computers, they're the bits of electricity that 67 | turn into 0s and 1s. From Wikipedia: 68 | "A transistor is a semiconductor device used to amplify or switch 69 | electronic signals and electrical power. It is composed of 70 | semiconductor material usually with at least three terminals for 71 | connection to an external circuit. A voltage or current applied to 72 | one pair of the transistor's terminals controls the current through 73 | another pair of terminals. Because the controlled (output) power can 74 | be higher than the controlling (input) power, a transistor can 75 | amplify a signal", 76 | and 77 | "The transistor is the fundamental building block of modern 78 | electronic devices, and is ubiquitous in modern electronic systems." 79 | 80 | However, as one would imagine, if you try to pack more and more 81 | transistors in a smaller area, it will be exponentially more costly, and 82 | there will be issues with heat, as well as limits faced with the laws of 83 | physics. 84 | 85 | Update: the F-2017 edition (after the class break) brought up a graph 86 | from David Patterson's textbook, showing that serial processor 87 | performance was exponential up to the last decade, to which it 88 | flat-lined. 89 | 90 | - Thus, in the "glory days" you could write a program and expect newer 91 | hardware to just be faster. But not anymore. If we tried to cram 92 | things even further, we'd run into programs like quantum computers, 93 | where we don't know if things are really a 0 or a 1 anymore. Uh oh. 94 | 95 | - Now companies (e.g. Apple, Tesla, Samsung, Google, Microsoft) are not 96 | just buying general-purpose Intel chips, but building their own chips. 97 | So it's an exciting time to be a computer architect. 98 | 99 | Principles of Locality (memory hierarchy and caches!!) 100 | 101 | Jim Gray's storage latency analogy. I've seen this one before. It's 102 | really nice. Everyone has a nice joke to play about caches. Main thing 103 | to know is what is actually in the hierarchy: 104 | - Registers 105 | - On-chip cache 106 | - On-board cache 107 | - Main memory (i.e. RAM) 108 | - Hard disk 109 | - Tape and optical robot (not sure what this means) 110 | Also see the pyramid in the notes. It makes sense: the stuff "closer" to 111 | us in the hierarchy just listed above has to be smaller since there's 112 | less room. Thus, registers are cramped in a small space and are limited, 113 | but there's much more room for memory on the hard disk. 114 | 115 | It seems like we have three main caches: L1, L2, and L3. Not sure on the 116 | difference between on-chip vs on-board cache, though. That might be 117 | on-chip (as in on the CPU?) vs on the MOTHERboard. As I (finally!!) now 118 | know from experience, the CPU chip goes in the motherboard in a very 119 | specific spot. 120 | 121 | Parallelism (CS 267!!) 122 | 123 | This is another thing we should do if possible. We can "fork" calls into 124 | several "workers" and then "join" them together later. Professor Katz 125 | mentions the laundry example. He can use the wash. Then the dryer. But 126 | if he's using the dryer, there's no reason why someone can't use the 127 | wash. So this is like stacking things together in a tree-fashion, might 128 | be related to "tricks with trees" from CS 267. 129 | 130 | Also: we'll learn how to do thread programming, using fork() to 131 | split up computation into worker threads, and join() calls to 132 | combine the result. 133 | 134 | Caveat: Amdahl's law. It tries to predict speed-ups from parallelism. 135 | The law states the obvious: if there are parts of an application which 136 | cannot be parallelized, then we can't get "perfect" speedup, which 137 | hypothetically would be a 2x speedup if we had 2x parallelism. 138 | 139 | Dependency via Reproducibility (should be obvious!) 140 | 141 | The larger our system, the more likely we have individual components 142 | that fail. But when we program, we desperately want to make sure we can 143 | focus on debugging what WE wrote, and NOT the underlying hardware (oh 144 | God). 145 | 146 | Easiest thing to do: take majority vote, this helps to protect against 147 | faulty machines. Prof Katz: this seems silly and expensive, but useful 148 | if we have to send code in space or some other area where it's too 149 | expensive to send repairmen. 150 | 151 | Redundant memory bits as well; these are Error Correcting Codes (ECCs). 152 | Can also do calculations involving the parity of a number (odd vs even) 153 | so we have a spare piece of memory which corrects the expected parity as 154 | needed. 155 | 156 | 157 | Then we switched speakers to Prof. Krste Asanović. 158 | 159 | Higher-level stuff: 160 | 161 | Moore's Law, etc., showed a new paradigm for computer architecture. See 162 | my earlier comments on Moore's Law. 163 | 164 | Then Deep Learning. Yes, I knew it! That's why Deep Learning needs 165 | computer architects, because it's now the hardware and not the algorithm 166 | (After all, we're still doing backpropagation). 167 | 168 | Google has developed a "Tensor Processing Unit" (TPU), a specialized 169 | engine for NN training. Interesting ... I saw Jeff Dean talking about 170 | this recently in his AMA. 171 | 172 | Microsoft has developed "Microsoft Brain Wave". Gah, so many new 173 | developments. 174 | 175 | RISC-V Instruction Set Architecture (ISA) 176 | 177 | In F-2017, they are switching to this from MIPS, which was used in 178 | previous iterations of the course. It was designed at Berkeley for 179 | research and education. 180 | 181 | ISA = the language of the processor, or how software is encoded to run 182 | on hardware. Example: think about how an "add" instruction would be 183 | written in bits. 184 | 185 | Why are we using it if it's open source? Because the cool people are 186 | adopting it. Starting now, NVIDIA is using RISC-V in their GPUs. And the 187 | previous popular set, MIPS, is not doing so well; the company that owns 188 | it is apparently up for sale? 189 | 190 | 191 | (Then we switched back to Prof. Katz, and had some stuff about class 192 | administration. Yeah, I won't post any homeworks publicly, they'll be private.) 193 | 194 | 195 | Everything is Just a Number 196 | 197 | Computers represent data as binary values. 198 | - The *bit* is the unit element, either 0 or 1. We're not doing quantum 199 | computing in this class, so we _know_ for certain if a bit is zero or 200 | one. 201 | - Then *bytes* are eight bits, can represent 2^8 = 256 different values. 202 | - A "word" is 4 bytes (i.e. 32 bits), has 2^32 different values, like Java 203 | integers. 204 | - Then there are 64-bit floating point numbers (and 32-bit as well), 205 | numpy can express both though the Theano library encourages 32-bit. 206 | - All of these are built up into longer and more complicated expressions! 207 | - In F-2017, we'll learn how RISC-V encodes computer programs into bits. 208 | 209 | Be sure to MEMORIZE how to convert: (binary <==> decimal). This is so 210 | important to have down cold. I'm definitely intuitively better at going in 211 | the ==> direction, just write the number then underneath, going in REVERSE 212 | direction, do 2^0, 2^1, etc., then multiply by 1s and 0s and add up. Other 213 | direction: keep successively dividing by two (rounding down) and keep track 214 | of parities. Collect (not sum!) the results together at the end. 215 | 216 | Unfortunately, there's also the hexadecimal notation. That's harder. Now 217 | there are 16 different units, not 2 or 10. It goes from 0 to 9 and then we 218 | note it as A=10, B=11, C=12, D=13, E=14, F=15. Obviously, I wrote the 219 | decimal numbers afterwards, could have easily done the binary version. 220 | - There are also octals, with 8 units of computation. 221 | - I'll avoid using these whenever possible. 222 | 223 | Make sure to be consistent with putting down "two", "ten", or "hex" as 224 | subscripts after the numbers. It will make it easier to track which is 225 | which. 226 | 227 | How to use these numbers in C? 228 | Use %d for decimal (I know this now!) 229 | Use %x for hexadecimal 230 | Use %o for octal 231 | Might also have to write numbers with 0x[...] and 0b[...] with 0x or 0b 232 | prefix to indicate which representation we're using. 233 | 234 | Beyond bytes, we have kilobytes, gigabytes, etc. Notice that marketing will 235 | assume we multiply by 1000, i.e. kilobytes are 1000 bytes. But in reality we 236 | "should" have 1024 bytes per kilobytes. Marketing can get away with not 237 | including that extra 24. Grrr. For the binary system, we use an extra "i", 238 | so it's KiByte, instead of KByte. And 1GB = 1000MB and 1GiB = 1024MiB. 239 | Watch out! 240 | 241 | 242 | ************************************** 243 | * Lecture 2: Numbers and C Language * 244 | * Given: August 29, 2017 * 245 | ************************************** 246 | 247 | Signed integer representation (Note: this material was originally in the first 248 | lecture in F-2016, but got bumped to the second lecture in F-2017 to make room 249 | for more discussion on why we need computer architects, and also Deep Learning.) 250 | 251 | We need to have negative numbers, so how to handle these? 252 | 253 | First attempt: first digit (well, leading digit, so leftmost) represents 254 | sign, remaining 7 (assuming 8 bits total) are for actual numerical 255 | content, "magnitude". But that's bad --- at least for integers --- since 256 | we have several special cases to consider, and our hardware performance 257 | will suffer. 258 | 259 | Better: two's complement. With 4 bits, have 16 total numbers: 260 | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 261 | -8 -7 -6 -5 -4 -3 -2 -1 262 | 263 | Thus, -3 in decimal maps to 13 in binary. This allows us to keep 264 | adding/subtraction rules for binary numbers consistent. Right, this is 265 | StackOverflow: "Two's complement is a clever way of storing integers so that 266 | common math problems are very simple to implement." In other words, the 267 | hardware doesn't have to make any special rules. 268 | 269 | But remember that these are just bits. Regardless of signed or unsigned, 270 | it's bits (four, in this case) that the hardware sees. 271 | 272 | A good analogy with alarm clocks in the lecture, particularly because my 273 | alarm clock requires me to keep incrementing the time before it "starts 274 | over" at the current value. Thus, 3+11=14 in unsigned, but this is 275 | 3-5=-2 in two's complement. Fortunately, the "adder" doesn't care, it 276 | just does the addition the same way, and we interpret it under the 277 | assumption that it's two's complement. 278 | 279 | It's not a "sign+magnitude" representation, because the second part 280 | isn't a "magnitude". 281 | 282 | How to do negation in two's complement: INVERT the bits, then add one. 283 | Don't forget to add one. 284 | 285 | The most significant bit (MSB) also indicates the sign, as in our first 286 | representation, but doesn't have the drawback of painful math or a +0 and -0 287 | annoyance as in the signed integer representation. 288 | 289 | With two's complement, **if signs are different**, no overflow detection 290 | needed. This makes sense, you can't add a positive and a negative number and 291 | get something exceeding your range, that's like a shrinkage factor. 292 | 293 | Adding numbers of different bit widths: 294 | - Unsigned: simply pad zeros at the most significant bits. 295 | - Signed: **sign extension**, pad either all 0s or all 1s, depending on 296 | the current sign of the number. 297 | 298 | 299 | Break / This is Not on the Exam 300 | 301 | Prof. Asanović talked about Google's TPU. :-) My God, it's so impressive. It 302 | has an **internal** matrix multiply unit. Ironically, it's useless for 303 | everything **except** for matrix multiplies. Then he talked about the IBM 304 | Mainframe. 305 | 306 | 307 | C Primer 308 | 309 | Remember, we're not giving a tutorial on C, the class is about the 310 | hardware/software interface. 311 | 312 | Bla bla bla hello world. Use printf("") for printing. Don't forget \n 313 | newlines!! Think of System.out.print("") in Java (not the println version). 314 | Also don't forget semicolons. And `#include `. They use `int 315 | main(void)` whereas I use `int main()` but there's no difference in C++ and 316 | in C the difference is "questionable". I think it doesn't matter for what I 317 | would use. But use int main(void) instead, to clearly specify that the 318 | method doesn't take in any arguments (according to StackOverflow). 319 | 320 | Then compiling using `gcc program.c ; ./a.out`. 321 | 322 | Progression: 323 | [...].c --(compiler)--> [...].o --(linker)--> [a.out] 324 | From source (i.e. text) file to "machine code object files" (whatever those 325 | are) to actual executable files, what gets run. The linker makes use of 326 | other library files, if we're using them, such as stdio.h I think. And the 327 | linker helps to link a bunch of [...].c files that we wrote, since we should 328 | split up our C code in several files to stay sane. 329 | 330 | There's *also* a "pre-processor" before the compiler executes, which (1) 331 | converts comments to a single space and (2) takes care of logic related 332 | to commands that start with #. These are "macros" and get expanded to 333 | replace their stuff inline, so for instance, if I look at the 334 | intermediate file output from Hello World, it could be very long. But 335 | that's OK, it's how C works. :-) 336 | 337 | Different from interpreted languages, such as Python, which are run 338 | "line-by-line". 339 | 340 | More similar to Java, but Java converts to "byte code" which is an EXAMPLE 341 | of an assembly language. 342 | 343 | Advantages: 344 | - Faster. This is why numpy uses a C/C++ "back end"; more on that later 345 | once I better understand it. 346 | - Note that computers can only "run" machine code, or the lowest-level 347 | instructions that it can run. Everything else is one layer of 348 | abstraction upon abstraction. Compilation can get our C code to 349 | machine code in "one shot". 350 | 351 | Disadvantages: 352 | - Long time to compile. 353 | - Need tools like "make" to avoid compiling unchanged code. OK maybe 354 | this isn't a real disadvantage, since we should be using make by 355 | default. 356 | - Architecture- and operating systems-specific. 357 | 358 | C Type Declarations 359 | 360 | Examples: 361 | int a; 362 | float b; 363 | char c; 364 | Like Java, have to declare beforehand, and the type can't change. 365 | (Usually, floats are 32 bits and doubles are 64 bits.) 366 | 367 | Can do: 368 | float pi = 3.14; /* ok this is mathematically awful but w/e */ 369 | But probably better to have it as a constant: 370 | const float pi = 3.14; 371 | 372 | For 'unsigned' stuff, just put that before the type, e.g. 'unsigned long'. 373 | 374 | Enumerations: 375 | typedef enum {red, green, blue} Color; 376 | We can then write and call `switch` on: 377 | Color pants = green; /* to use one example ... */ 378 | 379 | AH, now it's clear, in Java we KNOW ints are 32 bits, but in C it could be 380 | 16, 32, or 64 bits. Though on my system it's 32 and I think that makes the 381 | most sense. 382 | To check, use sizeof(int) and print it. I get '4' which must mean the 383 | BYTE count. 384 | 385 | No boolean data types! I learned this the hard way. (That's in C++). 386 | 0=False, anything else is true (but I guess use 1 for convention). 387 | 388 | Standard function definitions, like Java. But it looks like we don't need to 389 | use 'public...' or 'public static...'. 390 | 391 | Uninitialized variables: if you don't define them, they take on a random 392 | value in memory, i.e. garbage. Their for loop example prints different 393 | values of (uninitialized) x because they have another function which messes 394 | around with the memory on the stack. I think if that wasn't there, you would 395 | get the same "garbage" value for x. [Update: heh, a student asked the same 396 | question. But the Prof. said we should not rely on that. Which is fine, this 397 | was only a theoretical question.] 398 | 399 | structs: 400 | - Groups of variables 401 | - Like Java classes, but no methods 402 | - one-liner example syntax: 403 | typedef struct {int x, y;} Point; 404 | - then to create one: 405 | Point p = { 77, -8 }; 406 | 407 | Concluding Thoughts 408 | 409 | NO CLASSES in C! You need C++ for that, according to my own experience, and 410 | StackOverflow. For a while, C++ was known as "C with classes". But now it's 411 | just bloated. In C, simulate some class functionality by using structs. 412 | Thus, C shouldn't qualify as "object-oriented". 413 | 414 | Other main programmatic difference from Java (first one being no classes) is 415 | that in C we have explicit pointers. Let's discuss that in the next lecture. 416 | 417 | There are additional differences in the compilation, obviously. 418 | 419 | 420 | TODO BELOW ... (for F-2017) 421 | 422 | **************************** 423 | * Lecture 3: Pointers * 424 | * Given: September 1, 2016 * 425 | **************************** 426 | 427 | Pointers in C 428 | 429 | Processor vs Memory in computer, two different components. 430 | Former has registers, ALU, etc. 431 | Latter contains various bytes that form the programs, data, etc. 432 | 433 | Don't confuse memory address and a value. It's like humans are the 'values' 434 | living in their homes as 'memory addresses'. A POINTER is a MEMORY ADDRESS. 435 | When we say int a; then a = -85;, the memory address is some unknown 436 | integer and the value is -85. 437 | 438 | Know differences: 439 | int *x; // variable x is an address to an int 440 | int y = 9; // y is an int with value 9 441 | x = &y; // assigns *address of* (almost certainly not 9) y to x 442 | int z = *x; // assigns *value of* x (should be 9) to z 443 | *x = -7; // Assigns -7 to what x is pointing at 444 | 445 | Interesting, I get x=1505581164 y=-7 z=9 as the printf output, so when we 446 | set the memory address of y to x, and modify what x is pointing at, that 447 | will *also* modify what y points at. Interesting ... and a bit of a pain to 448 | track. 449 | 450 | Another thing, the type of x is 'int*', NOT 'int'. Watch out! It might be 451 | helpful to visualize this the way CS 61C does with its charts. Can write 452 | int* pi; or int *pi;, seems like the class does it the latter. It's 453 | unambiguous especially for char *a,*b; vs char* a,b;, in which case that 'b' 454 | is NOT a pointer to a char. 455 | 456 | Use generic pointers for applications such as allocating or freeing 457 | memory, where the code may need to point to arbitrary stuff. 458 | 459 | Have pointers to structs as well, which is where we get the arrow syntax 460 | "->" that I've seen before. 461 | 462 | Another trick: *(&a) = a, I believe. 463 | 464 | One thing, if we do '*pa = 5', this is NOT assigning to 'pa' but rather 465 | '*pa'. It doesn't really make sense to assign directly to 'pa' unless we 466 | know a memory address. Do we really want to gamble that '5' is indeed the 467 | correct _memory_address_ and not _value_? 468 | 469 | Functions 470 | These have pointers too. For arguments: 471 | void foo(int x, int *p) { ... } 472 | To call it, use: 473 | foo(a, &b); 474 | where a and b are both ints. The 'b' will get "passed by reference", 475 | since the pointer is passed by value. So it's like Java. There are a ton 476 | of blogs about this online. 477 | 478 | PS: I really like their four-column table approach, really helps 479 | 480 | Arrays in C (syntactic sugar for pointers, really) 481 | 482 | Several ways to declare basic arrays: 483 | int a[5]; // five integer array, obviously, but contents are garbage 484 | int b = {1,2,3}; // explicitly assign elements, not garbage =) 485 | 486 | In memory diagram: form contiguous block of memory, index 0 at bottom, then 487 | proceeding up we increment indices. 488 | 489 | #1 way we can shoot ourselves in the foot: no array bounds checking. 490 | So remember array sizes, e.g. by using: 491 | const int ARRAY_SIZE = 10; 492 | and then using that ARRAY_SIZE throughout the program. Don't repeat 493 | yourself! 494 | 495 | Helpful to also use sizeof() operator to get number of bytes. I use this 496 | frequently. But we can't assume anything about the hardware, other than 497 | sizeof(char) == 1. Don't assume: use sizeof(...) instead! 498 | 499 | Pointer Arithmetic 500 | 501 | PS: for computers, use byte addresses, so think of memory for an int as 502 | taking up four slots, because (at least in one example and on my machine) C 503 | ints are 4 bytes. 504 | 505 | I see, we can do stuff like: 506 | char c[] = {'a','b'}; 507 | char *pc = c; // from webcast, also same as &(c[0]) 508 | so pc is now a char* type, and *pc = 'a'. If we do *pc++; then pc = 'b'. The 509 | POINTER is incremented, not the value pointed by it. Yeah, it's confusing, 510 | this time we actually want to manipulate the address. 511 | 512 | The array name is a pointer to the 0th element of the "array". 513 | char *pstr; 514 | char astr[]; 515 | are identical except we can do pstr++ while we can't do astr++. 516 | ALSO: astr[2] = *(astr+2) 517 | 518 | OH I see, when we do pc++ the compiler actually adds sizeof(...) and takes 519 | care of that logic for us; it doesn't really "add one". Thanks! 520 | 521 | Bad style to interchange arrays and pointers. 522 | 523 | For methods, you can define them in the following ways: 524 | foo(int[] array, unsigned int size); 525 | foo(int *array, unsigned int size); 526 | 527 | Be careful when doing sizeof(a) with 'a' an array, because that might 528 | represent a pointer, which is usually 8 bytes on modern 64-bit machines. But 529 | if you start with int a[40] and do sizeof(a) you actually get 530 | 10*sizeof(int), that's weird. 531 | 532 | No sense to do so and is also illegal, don't do the following: 533 | - Add two pointers 534 | - Multiply two pointers 535 | - Subtracting a pointer from an integer 536 | We CAN, however, compare pointers to NULL, for instance (keyword might be 537 | 'null' in C). 538 | 539 | Pointers to pointers also exist. Oh no. 540 | 541 | Strings and Main 542 | 543 | C strings are "null terminated character arrays": 544 | char s[] = 'abc'; 545 | To find the length, iterate through the string and increment an index. 546 | Detect end of string with '0' or whatever special character we have. 547 | 548 | Don't forget the alternative way of writing main() methods with arguments: 549 | int main(int argc, char *argv[]) {...} 550 | argv is a POINTER ... (of type char*) ... TO AN ARRAY (that contains the 551 | string arguments from the command line). The argc is simply the number of 552 | arguments. 553 | 554 | When we run ./a.out, the './a.out' part is argv[0], other arguments 555 | after that go in later components, in order. It's similar to Python. 556 | 557 | Concluding Remarks 558 | 559 | Pointers are the same as (machine) memory addresses. 560 | Except for void*, pointers know the type and size of the objects they point 561 | to (is this why sizeof(a) for 'int a[10]' is known? Not sure). 562 | Pointers are powerful, but dangerous without careful planning. 563 | 564 | 565 | ******************************** 566 | * Lecture 4: Memory Management * 567 | * Given: September 6, 2016 * 568 | ******************************** 569 | 570 | TODO 571 | --------------------------------------------------------------------------------