├── .gitignore
├── Differential_Equations
└── README.md
├── Functional_Programming
├── Other_Notes
│ └── sbt_and_eclipse.txt
├── README.md
└── week1
│ └── week1_notes.txt
├── Math_104_Berkeley
├── README.md
└── kenneth_ross_notes.txt
├── Deep_Learning
├── README.md
├── dlbook_chapter06notes.txt
├── dlbook_chapter02notes.txt
├── dlbook_chapter20notes.txt
├── dlbook_chapter17notes.txt
├── dlbook_chapter03notes.txt
├── dlbook_chapter09notes.txt
├── dlbook_chapter04notes.txt
├── dlbook_chapter08notes.txt
├── dlbook_chapter16notes.txt
├── dlbook_chapter14notes.txt
├── dlbook_chapter11notes.txt
├── dlbook_chapter07notes.txt
├── dlbook_chapter12notes.txt
├── dlbook_chapter05notes.txt
└── dlbook_chapter10notes.txt
├── How_People_Learn
├── README.md
├── Part_04_Future_Directions.txt
├── Part_01_Intro.txt
├── Part_03_Teachers_and_Teaching.txt
└── Part_02_Learners_and_Learning.txt
├── Random
├── Ray_Notes.txt
└── AWS_Notes.txt
├── README.md
├── CS61C_Berkeley
├── README.md
└── CS61C_Lectures.txt
└── Robots_and_Robotic_Manip
├── dVRK.text
├── Modern_Robotics_Mech_Plan_Control.txt
├── Fetch.text
├── HSR.text
├── Mathematical_Introduction_Robotic_Manipulation.txt
└── ROS.text
/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.DS_Store
3 |
--------------------------------------------------------------------------------
/Differential_Equations/README.md:
--------------------------------------------------------------------------------
1 | # Differential Equations
2 |
3 | ...
4 |
--------------------------------------------------------------------------------
/Functional_Programming/Other_Notes/sbt_and_eclipse.txt:
--------------------------------------------------------------------------------
1 | Wow, learning how to use this stuff is really annoying. =(
2 |
--------------------------------------------------------------------------------
/Math_104_Berkeley/README.md:
--------------------------------------------------------------------------------
1 | This is a real analysis review.
2 |
3 | Fortunately, the textbook is supposed to be easy to read. It is also freely
4 | available online.
5 |
--------------------------------------------------------------------------------
/Deep_Learning/README.md:
--------------------------------------------------------------------------------
1 | I'm reading the Deep Learning book by Goodfellow et al.
2 |
3 | TODOs:
4 |
5 | - Chapter 13
6 | - Chapter 15
7 | - Chapter 18
8 | - Chapter 19
9 | - Chapter 20 (all of it!)
10 |
11 |
--------------------------------------------------------------------------------
/How_People_Learn/README.md:
--------------------------------------------------------------------------------
1 | # How People Learn: Brain, Mind, Experience, and School: Expanded Edition
2 |
3 | From National Academies Press. Looks like it was published in 2000, so I wonder
4 | how much of it is up to date ...
5 |
--------------------------------------------------------------------------------
/Random/Ray_Notes.txt:
--------------------------------------------------------------------------------
1 | I'm trying to learn how to use Ray. See:
2 |
3 | https://rise.cs.berkeley.edu/projects/ray/
4 |
5 | for an overview of the project. (Unfortunately, it's hard to do a Google search
6 | on that, but I will manage.)
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Self_Study_Courses
2 |
3 | These will be public notes for courses that I'm self-studying.
4 |
5 | Current TODO list:
6 |
7 | - Finish Goodfellow et al
8 | - Finish CS 61C self-studying
9 | - Study robotic manipulation
10 |
--------------------------------------------------------------------------------
/How_People_Learn/Part_04_Future_Directions.txt:
--------------------------------------------------------------------------------
1 | Chapter 10: Conclusions
2 | Chapter 11: Next Research Steps
3 |
4 | Mostly, these two chapters wrap up the book. I'm most interested in how
5 | humans/children learn, not so much about practical public policy or how to use
6 | technology.
7 |
8 | The first parts of Chapter 10 would be good to review periodically.
9 |
--------------------------------------------------------------------------------
/CS61C_Berkeley/README.md:
--------------------------------------------------------------------------------
1 | Doing this because of (a) need to review computer architecture and (b) practice with C language.
2 |
3 | Relevant links:
4 |
5 | - https://github.com/61c-teach
6 | - https://cs61c.org/
7 | - https://cs61c.org/resources/exams
8 |
9 | Looks like Berkeley changed to this format recently. Some of the courses have webcasts, though they might not all be public.
10 |
--------------------------------------------------------------------------------
/Functional_Programming/README.md:
--------------------------------------------------------------------------------
1 | This is the Coursera course on Functional Programming, taught by the person who
2 | created the Scala Programming Language. =)
3 |
4 | Link to course: [click here][1]
5 |
6 | It says it's from January 30 to March 9; the year isn't stated but I assume it's
7 | 2017, which means this could be the first Coursera course that I actually follow
8 | from start to finish in time. I hope.
9 |
10 | [1]:https://www.coursera.org/learn/progfun1/
11 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/dVRK.text:
--------------------------------------------------------------------------------
1 | How to use the dVRK in the context of ROS. Reading the ROS tutorials helped to
2 | clarify why ROS could auto-complete and refer to files somewhere else in the
3 | computer, because of our ROS path pointing to those directories. Also, the dVRK
4 | launch files involves similar `.xml` files as shown in the tutorials. Use
5 | `rosed` to edit without having to search for a path.
6 |
7 | Focus on the basic skeleton. How do we start?
8 |
9 |
10 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter06notes.txt:
--------------------------------------------------------------------------------
1 | *************************************************
2 | * NOTES ON CHAPTER 6: Deep Feedforward Networks *
3 | *************************************************
4 |
5 | This chapter *should* be review for me. Read though but don't get bogged into
6 | too much on backpropagation. By the way, these technically include convolutional
7 | nets, but we don't cover that in detail until Chapter 9.
8 |
9 | The first part (Section 6.1) starts off with the classic example of linear
10 | models failing to solve an XOR, but a simple ReLU two-layer network can do it.
11 |
12 | Most neural networks are trained with maximum likelihood so the cost function is
13 | the negative log likelihood, cost is
14 |
15 | J(\theta) = - E_{x,y} [log p_\theta(y|x)]
16 |
17 | This is **equivalently** described as the cross entropy between the model
18 | distribution and the data distribution. Interesting.
19 |
20 | There's some stuff about the cross entropy and viewing the neural network as a
21 | functional. I should review these later if I have time. BTW, they say that cross
22 | entropy is preferable to MAE or MSE, due to getting better gradient signals
23 | (Section 6.2.1).
24 |
25 | Section 6.3 is about the choice of hidden units. I'm skimming this.
26 |
27 | Section 6.5 is about backpropagation. I'm skimming this. It's looong.
28 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Modern_Robotics_Mech_Plan_Control.txt:
--------------------------------------------------------------------------------
1 | Notes on the textbook:
2 |
3 | Modern Robotics: Mechanics, Planning, and Control, 2017
4 | Kevin M. Lynch and Frank C. Park
5 |
6 | Homepage: http://hades.mech.northwestern.edu/index.php/Modern_Robotics
7 |
8 | It's looks very similar to Murray, Li, and Sastry's book.
9 |
10 | **********************
11 | * Chapter 1: Preview *
12 | **********************
13 |
14 | One way of categorizing robots:
15 |
16 | - Open chain: all joints are "actuated," i.e., that we can move them. Example:
17 | most industrial robotic arm manipulators.
18 | - Closed chain: only some joints are "actuated." Example: Stewart-Gough
19 | Platform (!!)
20 |
21 | The following joints have one degree of freedom, for rotation and translation,
22 | respectively.
23 |
24 | - Revolute joints: these allow for rotation about the joint axis.
25 | - Prismatic joints: these allow for linear translation along the joint axis.
26 |
27 | Use "Degrees of Freedom" to specify the number of "actuated joints." However, a
28 | (potentially better) sense of DoF is the notion of **configuration spaces**:
29 |
30 | > A more abstract but equivalent definition of the degrees of freedom of a robot
31 | > begins with the notion of its configuration space: a robot's configuration is
32 | > a complete specification of the positions and orientations of each link of a
33 | > robot, and its configuration space is the set of all possible configurations
34 | > of the robot.
35 |
36 |
37 | **********************************
38 | * Chapter 2: Configuration Space *
39 | **********************************
40 |
41 | TODO
42 |
43 |
44 | *********************************
45 | * Chapter 3: Rigid Body Motions *
46 | *********************************
47 |
48 | TODO
49 |
50 |
51 | *********************************
52 | * Chapter 4: Forward Kinematics *
53 | *********************************
54 |
55 | Studies the problem of: given a set of input joint values, find the output
56 | position and orientation of the reference frame attached to the end-effector.
57 | This is easily done for an open-chain robot, and the default solution is the
58 | "Product of Exponentials" (PoE) formula.
59 |
--------------------------------------------------------------------------------
/How_People_Learn/Part_01_Intro.txt:
--------------------------------------------------------------------------------
1 | Part 1: Introduction
2 |
3 |
4 | Chapter 1: Learning: From Speculation to Science
5 |
6 |
7 | Very important:
8 |
9 | - We need to stop teaching and testing based on factual knowledge, because the
10 | amount of facts to know is beyond what any one person can handle. The focus of
11 | teaching should be more on learning how to acquire and synthesize facts to
12 | "pick things up" quickly, so to speak. That's not to say facts are
13 | unimportant. It's just that the bigger priority should be understanding the
14 | connections among the facts so that it is easier to transfer and generalize to
15 | novel scenarios. Experts are very good at synthesizing, connecting, and
16 | efficiently organizing their reservoirs of knowledge.
17 |
18 | - Students start with lots of prior knowledge and are not simply "empty vessels"
19 | of which teachers fill with knowledge. It's necessary to check with them if
20 | their prior knowledge is inhibiting or misleading them when learning about
21 | various concepts. Classic scenario: fish is fish, where a fish asks an
22 | amphibian what land-based animals are like, but simply imagines them as fish
23 | with legs, fish with udders, etc. Another example: teaching students the Earth
24 | is round when they think it's flat.
25 |
26 | Also important:
27 |
28 | - There should be a focus on improving students' understanding of their own
29 | ability. They should be able to tell when they need help. The ability to
30 | predict one's performance on a task is called "metacognition" (see Chapters 2
31 | and 3).
32 |
33 | - Don't do shallow coverage of every possible topic within reach, instead reduce
34 | the number of topics but go through a few in depth to practice deeper
35 | understanding.
36 |
37 | - And a bunch of more mundane, practical stuff: need to change incentives of
38 | teaching and standardized tests so that it's not fact-based yet is still fair,
39 | need to do the same for adult teaching, etc.
40 |
41 | - Don't just focus on the best talent, need to work for lots of students. Well,
42 | it is important to develop top students more than we do in the US, but it's
43 | also clear that we need to broaden the population who have access to quality
44 | education.
45 |
46 | Stuff I forgot to record after a first pass:
47 |
48 | - Don't ask which teaching technique is best because that's like asking which
49 | tool is best: it depends on the task and materials at hand.
50 |
51 | - Don't forget all those hours students spend _outside_ of school. There are so
52 | many overlooked opportunities there. I should know, from personal experience.
53 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter02notes.txt:
--------------------------------------------------------------------------------
1 | **************************************
2 | * NOTES ON CHAPTER 2: Linear Algebra *
3 | **************************************
4 |
5 | This chapter was pure review for me, but some highlights and insights:
6 |
7 | - They talk about tensors but I'm kind of familiar with them already, mostly
8 | when I have to deal with numpy arrays that have at least three coordinate
9 | dimensions (or four, in some deep learning applications with images).
10 |
11 | - Columns of A can be thought of as different directions we're spanning out of
12 | the origin, and the components of x (as in the matrix-vector product Ax)
13 | indicate how far we move in those directions.
14 |
15 | - We say "orthogonal" matrices, but there's no terminology for matrices whose
16 | columns and/or rows are mutually orthogonal, but *not* orthonormal.
17 |
18 | - Don't forget **eigendecompositions**! They're very important. Interesting
19 | intuition:
20 |
21 | > [...] we can also decompose matrices in ways that show us information about
22 | > their functional properties that is not obvious from the representation of
23 | > the matrix as an array of elements.
24 |
25 | Eigendecomposition of matrix: A = V * diag(eig-vals) * V^{-1}, where V
26 | has columns which correspond to (right) eigenvectors of A.
27 |
28 | Not every matrix can be decomposed this way, but we're usually concerned with
29 | real symmetric A. In fact, in that case we can say even more: we can construct
30 | an *orthogonal* V so our V^{-1} turns into the easier-to-deal-with V^T matrix.
31 |
32 | - An alternative, and more generally applicable decomposition, is the SVD. (Why
33 | is it more general? Well, every real matrix has an SVD, including non-square
34 | ones, but non-square matrices have undefined eigendecompositions.) In their
35 | formulation, the inner matrix of singular values is rectangular in general
36 | (other books/references have *square* matrices, but the definitions are
37 | essentially equivalent).
38 |
39 | - Moore-Penrose pseudoinverse helps us (sometimes) solve linear equations for
40 | non-square matrices, in which case the "normal" inverse cannot be defined. Use
41 | the formula A^+ = V * D^+ * U^T for the pseudoinverse. When A is a fat matrix,
42 | the solution x = A^+ * y provides us with the minimum Euclidean norm solution
43 | (I must have forgotten this fact).
44 |
45 | - For the trace, don't forget about the **cyclic property**!!!
46 |
47 | - The chapter concludes with an example of **Principal Components Analysis**,
48 | i.e. how to apply lossy compression to a set of data points while losing as
49 | little information as possible. By "compression" we refer to shrinking points
50 | from R^m into R^n where n < m. This is necessarily lossy. To optimally encode
51 | a vector, use f(x) = D^Tx, which we determined from L2 norm minimization. The
52 | decoder is g(c) = Dc = DD^Tx which reconstructs an approximated version of the
53 | input from the compression. Then the next (and final) step is to find D. They
54 | do this by also using an L2 minimization. They provide some nice tips on how
55 | to write out optimization problems nicely and compactly. This is again review
56 | for me.
57 |
58 | Well, I'm pleased with this chapter. =) I should expand upon some of these
59 | concepts in personal blog posts, particularly that last part (the proof by
60 | induction).
61 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter20notes.txt:
--------------------------------------------------------------------------------
1 | ***********************************************
2 | * NOTES ON CHAPTER 20: Deep Generative Models *
3 | ***********************************************
4 |
5 | This is a **long** chapter, and likely contains most of the stuff at the
6 | research frontiers, at least those that interest the authors (Generative
7 | Adversarial Networks lol).
8 |
9 |
10 | Section 20.10: Directed Generative Nets
11 |
12 | Both VAEs and GANs are part of this section, which refers to using directed
13 | graphical models to "generate" something, or basically mirror a probability
14 | distribution. The first two sections, "Sigmoid Belief Nets" and "Differentiable
15 | Generator Nets" seem markedly less important, though the latter at least makes
16 | the point that a generator should be differentiable. It also makes the important
17 | distinction between a generator directly generating samples x, OR generating a
18 | DISTRIBUTION, which we then sample from for x. If we directly generate discrete
19 | values, the generator is not differentiable, FYI.
20 |
21 |
22 | Section 20.10.3: Variational Autoencoders
23 |
24 | - Trained purely with gradient methods.
25 |
26 | - To *generate* a sample, need to first sample a code z which has relevant
27 | latent factors, and then run through a generator ("decoder") network which
28 | will give us a mean vector (or maybe a second output with the covariance). We
29 | then sample from that Gaussian. Yes, this makes sense. Generating z may just
30 | be done with our prior.
31 |
32 | - Ah, but during training, we have to make use of our *encoder* network, since
33 | otherwise the generator/decoder wouldn't work well. The encoder network's job
34 | is to produce a useful z.
35 |
36 | - Training is done by maximizing that variational lower bound for each data x:
37 |
38 | L(q) <= log p_model(x)
39 |
40 | where q is the distribution of the encoder network. Essentially, the encoder
41 | network approximates an intractable integral!
42 |
43 | - Some downsides: VAEs output somewhat blurry images and do not fully utilize
44 | the latent code z. However, GANs seem to share that second problem.
45 |
46 | - VAEs have been extended in many ways, e.g. DRAW. I remember that paper when I
47 | read it half a semester ago, but that was before I had RNN intuition.
48 |
49 | - Advantage: the training process is basically training an autoencoder. Thus, it
50 | can learn a manifold structure since that's what autoencoders can do!
51 |
52 |
53 | Section 20.10.4: Generative Adversarial Networks
54 |
55 | Use this loss function formulation for the Generator:
56 |
57 | > In this best-performing formulation, the generator aims to increase the log
58 | > probability that the discriminator makes a mistake, rather than aiming to
59 | > decrease the log probability that the discriminator makes the correct
60 | > prediction.
61 |
62 | Yes, I tried this for my own work and have had better results with this
63 | technique. It seems to be more important to do this than to do one-sided label
64 | smoothing, batch normalization, etc., which makes sense as this was the rare
65 | "trick" that made it in the original 2014 NIPS paper.
66 |
67 | - Then Sections 20.10.5 through 20.10.10 go through more topics that I don't
68 | have time to learn.
69 |
70 |
71 | Section 20.14: Evaluating Generative Models
72 |
73 | Yeah, I had a feeling this would be here, because some of this is quite
74 | subjective, and it seems like we have to resort to hiring human workers in
75 | person or via Amazon Mechanical Turk. The authors make a good point that in
76 | object recognition (for instance) we can alter the input. Some networks
77 | downscale to 256x256, others to 227x227, etc., but with generative models, if
78 | you change the input, the task fundamentally changes, and thus we can't compare
79 | the two procedures. Oh, and they also point out differences in log p(x) if x is
80 | discrete r.v. or continuous, in which case the former maximizes at log 1 = 0 and
81 | the latter can be arbitrarily high since p(x) could theoretically approach
82 | infinity.
83 |
--------------------------------------------------------------------------------
/How_People_Learn/Part_03_Teachers_and_Teaching.txt:
--------------------------------------------------------------------------------
1 | Part 3: Teachers and Teaching
2 |
3 |
4 | Chapter 6: Design of Learning Environments
5 |
6 | Very important:
7 |
8 | - Use learning-centered (actually, "learner centered") environments, a bit
9 | unclear to define but I think mostly about better understanding of students'
10 | prior knowledge. Again, see previous chapters about this.
11 |
12 | - Need some form of knowledge learning, so students need to learn something
13 | beyond just "learning how to learn". (Edit: not really the right way to define
14 | this but again not a clear definition, but mostly about how to make students
15 | knowledgeable, so that they can do effective transfer --- again, see previous
16 | chapters.)
17 |
18 | - Students need feedback (see "deliberate practice"), but not just the kind that
19 | come with grades and tests. Also, feedback is most effective when students can
20 | revise their thinking on the _current_ subject matter, not when they get a
21 | test but by the time they have it, they've moved on to newer concepts.
22 |
23 | - Must consider the community/culture aspect, which obviously affects learning.
24 | For instance, Anglo culture emphasizes talking and asking questions, but
25 | others might not (and this affects how teachers evaluate students). Also,
26 | seriously, when are we going to talk about multi-racials? Gaaaah, so
27 | disappointing.
28 |
29 | Also important:
30 |
31 | - A bunch of stuff on the merits of television (remember, this was 2000) but not
32 | really relevant for what I hope to get out of this book. Also a bunch of stuff
33 | on how to evaluate teachers for practical purposes.
34 |
35 | Stuff I didn't remember:
36 |
37 | - While some may say schools aren't working, the reality is that we're asking
38 | for way more out of students than in past eras. In the past, being literate
39 | could have simply meant being able to sign your name. Now we're getting to the
40 | point where we need students to interpret and compose potentially complicated
41 | written stuff.
42 |
43 | - Eh, a relevant quote: "Learning theory does not provide a simple recipe for
44 | designing effective learning environments; similarly, physics constrains but
45 | does not dictate how to build a bridge."
46 |
47 |
48 | Chapter 7: Effective Teaching Examples
49 |
50 | Very important:
51 |
52 | - History: focus not on facts but on analysis and understanding how to debate
53 | concepts. If you take students who know facts and historians who don't
54 | specialize in the same area, the students might actually do better on tests of
55 | factual knowledge, but won't be able to do any analysis. Effective teachers
56 | can promote debate, with careful monitoring of course. Interesting example:
57 | teacher asking students to put stuff in a time capsule, so they need to reason
58 | about important stuff.
59 |
60 | - Math: less focus on computation, more focus on problem solving skills.
61 | Analogies can help, e.g., modeling floors of a building to learn about
62 | negative numbers (negative floors = below ground level). Oh, also model-based
63 | stuff, where we apply math to building models of stuff (e.g., buildings).
64 | Could also clearly apply to physics.
65 |
66 | - Science: again, less on facts and more on analysis. Many students have
67 | intuition on stuff that's not correct in physics (e.g., forces and Newton's
68 | third law) so use live demos. Also recall earlier discussion about students
69 | not classifying problems correctly based on solution, but based on how they
70 | look (surface features). Students who are able to describe a problem
71 | "hierarchically" tend to do better --- though this is obviously vague.
72 |
73 | Also important:
74 |
75 | - Deliberate practice. Don't forget.
76 |
77 | - Effective teachers must know the subject matter AND be able to tell where
78 | students are likely to run into roadblocks.
79 |
80 | Stuff I didn't remember:
81 |
82 | - Practical stuff about instruction in large classes.
83 |
84 |
85 | Chapter 8: Teacher Learning
86 | (Not the most relevant chapter for me)
87 |
88 | There's a huge difference between education theory and practice, leads to
89 | teachers rejecting (or not really diving into) research, lots of turnover,
90 | susceptible to local politics, etc. It's best to have workshops and other
91 | meet-ups where teachers can practice and discuss teaching techniques, etc.
92 |
93 |
94 | Chapter 9: Technology to Support Learning
95 | (Not the most relevant chapter for me)
96 |
97 | Well this is kind of out of date, I suppose. Mostly, technology has tradeoffs
98 | but can be used to bring in new contexts/demos to the class, etc. Particularly
99 | useful if it can help provide repeated feedback (remember deliberate practice).
100 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter17notes.txt:
--------------------------------------------------------------------------------
1 | ********************************************
2 | * NOTES ON CHAPTER 17: Monte Carlo Methods *
3 | ********************************************
4 |
5 | I think this chapter will also be review, but I have forgotten a lot of this
6 | material. It might also help me for my other projects with BIDMach.
7 |
8 | Heh, Las Vegas algorithms ... we never talk about those in Deep Learning. I
9 | agree, we should stick with deterministic approximation algorithms or Monte
10 | Carlo methods. Right, the point here is we have something we want to know, such
11 | as the expected value of a function (which depends on the data). Use sampling to
12 | take the average of f(x_1), ..., f(x_n) to form our estimate of E_p[f(x)] for
13 | some base distribution p. We can compute our expected error via the Central
14 | Limit Theorem. (Which John Canny said is "the most abused theorem in all of
15 | statistics" but never mind ...)
16 |
17 | But what if we cannot even sample from our base distribution p in the first
18 | place. For the above, we needed to draw x_1, ..., x_n somehow! We now turn to
19 | our possible solutions: importance sampling and MCMC. (The latter includes Gibbs
20 | sampling, and maybe even contains some variants of importance sampling? Not
21 | totally sure.)
22 |
23 | Section 17.2, Importance Sampling.
24 |
25 | I see, we can turn Equation 17.9 into Equation 17.10 just by switching the
26 | distribution the x_i's are drawn from, and adding in the necessary functions.
27 | Yes, they have the same expected value ... and I can see why the variance would
28 | be different. They argue that the minimum variance is the q^* in Equation 17.13.
29 | Yeah ... that seems familiar. How do they derive that? If indeed f did not
30 | change signs, then p and f cancel and the variance turns into a constant. Yay!
31 |
32 | I'm not really getting much out of this section other than definitions. I'll
33 | mark a TODO for myself to look at the examples they give in other parts of the
34 | book; this chapter is not as self-contained as Chapter 16.
35 |
36 | Section 17.3, Markov Chain Monte Carlo (my favorite!). They refer the reader to
37 | Daphne's book for more details (which I've read before!).
38 |
39 | MCMC methods use *Markov chains* to approximate the desired sample distribution
40 | (call it p_model). These are most convenient for energy based models, p \propto
41 | exp(-E(x)), because they require non-zero probabilities everywhere. They also
42 | assume that the energy-based models are for _undirected_ graphical models, so
43 | that it's difficult to compute conditional probabilities.
44 |
45 | Procedure: start with random x, keep sampling, after a suitable burn-in period,
46 | the samples will start to come from p_model. Use a transition distribution
47 | T(x'|x), or a "kernel" in some of the literature.
48 |
49 | They show the usual matrix update in Equation 17.20, only for discrete random
50 | variables. Here, v should be in the probability simplex of dimension d where d
51 | is the amount of values that x can take on. Remember, we're in discrete land
52 | here.
53 |
54 | Something new to me: the matrix "A" here is a "stochastic matrix" and over time,
55 | its eigenvalues will converge to one as the exponent increases, or they'll decay
56 | to zero. Interesting ... the Perron-Frobenius Theorem they refer to is from a
57 | 1907 paper (!!!).
58 |
59 | They say "DL practitioners typically use 100 parallel Markov chains." Having
60 | independent chains gives us more independence. Why haven't I been doing this ...
61 |
62 | Section 17.4, Gibbs Sampling (yay ...).
63 |
64 | Not much in this section, they just say that for Deep Learning, it's common to
65 | use these for energy-based models, such as RBMs, though we better do block Gibbs
66 | sampling.
67 |
68 | Other stuff:
69 |
70 | They point out that the main problem with MCMC methods in high dimensions is
71 | that they mix poorly; the samples are too correlated. It might get trapped in a
72 | posterior mode, but I'm curious: how much of a problem is that? For deep neural
73 | networks, the biggest problem is with saddle points. They argue that the MCMC
74 | methods will not be able to "traverse" regions in manifold space with high
75 | energy. Those result in essentially zero p(x) due to e^{-H(x)}.
76 |
77 | Oh, I see, now they talk about temperature to aid exploration. Yeah, I know
78 | about that! =) Finally, I can see a reference about temperature. Think of
79 | temperature as:
80 |
81 | p(x) \propto exp(-H(x)/T)
82 |
83 | Thus, when temperature is high, the value in the exponent increases to zero, so
84 | the distribution becomes more uniform.
85 |
86 | You know, if there was more research done with MCMC methods and Deep Learning,
87 | wouldn't this chapter have discussed them? There isn't much here, to be honest,
88 | and lots of the references are pre-2012. And also, for tempering, why not cite
89 | some of the references they have in my own work?
90 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter03notes.txt:
--------------------------------------------------------------------------------
1 | **********************************************************
2 | * NOTES ON CHAPTER 3: Probability and Information Theory *
3 | **********************************************************
4 |
5 | This chapter was almost pure review for me, but some highlights and insights:
6 |
7 | - The chapter starts with some philosophy and some notation. Nothing new, though
8 | their notation is at least better than those from other textbooks I've read.
9 | Then they talk about definitions, marginals, conditionals, etc. It might be
10 | worth using their definition of covariance rather than the one I intuitively
11 | think of. High covariances (absolute values) mean values change a lot and are
12 | also far from their respective means often. Another concept to review:
13 | independence is a stronger requirement than zero covariance. Know the
14 | definition of a covariance matrix w.r.t. a random vector x.
15 |
16 | - Section 3.9: Common Probability Distributions, is pure review with the
17 | exception of the Dirac Distribution (to some extent), though they mention
18 | sometimes the need to use the inverse variance to increase efficiency, though
19 | I doubt this is used often. Do remember why we like Gaussians: (1) the CLT,
20 | and (2) out of all distributions with the same variance and which cover the
21 | real line, it has the highest entropy, which can be thought of as imposing the
22 | fewest prior assumptions possible. (If we didn't have these restrictions, we
23 | could pick the *uniform* distribution, so be careful about the assumptions.)
24 | Finally, for mixture distributions, don't forget that the canonical way is to
25 | first choose a distribution, and then generate a sample from that. It is NOT,
26 | first generate k samples from all k distributions in the mixture, and then
27 | take a linear combination of those proportional to the probability weight. I
28 | was confused by that a few years ago. The component identity of a mixture
29 | model is often viewed as a **latent variable**.
30 |
31 | - Know the **logistic** function (yes) and the **softplus** function (huh, a
32 | smoothed ReLU).
33 |
34 | - There is some brief **measure theory** here:
35 |
36 | > One of the key contributions of measure theory is to provide a
37 | > characterization of the set of sets that we can compute the probability of
38 | > without encountering paradoxes. In this book, we only integrate over sets
39 | > with relatively simple descriptions, so this aspect of measure theory never
40 | > becomes a relevant concern. For our purposes, measure theory is more useful
41 | > for describing theorems that apply to most points in R^n but do not apply to
42 | > some corner cases.
43 |
44 | - Oh, I like their example with deterministic functions of random variables.
45 | I've seen this a few time in statistics, and the key with variable
46 | transformations like those is that we have to take into account different
47 | scales of functions, which is where the derivative term and Jacobians appear.
48 |
49 | - Section 3.13: Information Theory. My favorite part is Figure 3.6. I should
50 | spend more time thinking about it. Also, good intuition:
51 |
52 | > A message saying "the sun rose this morning" is so uninformative as to be
53 | > unnecessary to send, but a message saying "there was a solar eclipse this
54 | > morning" is very informative.
55 |
56 | Information theory is about quantifying the "information" present in some
57 | signal. Use the **Shannon entropy** to quantify the uncertainty in a
58 | probability **distribution**: - E_x[log p(x)]. This is "differential entropy"
59 | if x is continuous. Low entropy means the random variable is closer to
60 | deterministic, high entropy means it's very random and uncertain.
61 |
62 | Note: in most information theory contexts, the log is base 2, so we refer to
63 | this as "bits." In machine learning, we use the natural logarithm, so we call
64 | them "nats."
65 |
66 | As usual, define the KL divergence. KL(P||Q) = E_P[log(P(x)/Q(x))]. For now,
67 | assume the first distribution, P, is what we're drawing expectations w.r.t.
68 | For discrete r.v.s:
69 |
70 | > [KL Divergence is] the extra amount of information [...] needed to send a
71 | > message containing symbols drawn from probability distribution P, when we
72 | > use a code that was designed to minimize the length of messages drawn from
73 | > probability distribution Q.
74 |
75 | - Note also the **cross entropy** quantity: - E_P[log Q(x)].
76 |
77 | > Minimizing the cross-entropy with respect to Q is equivalent to minimizing
78 | > the KL divergence, because Q does not participate in the omitted term.
79 |
80 | This is why if Q is our model, we can minimize the cross entropy and make our
81 | Q close to P, which is the ground truth data distribution.
82 |
83 | - The chapter concludes some basic graphical models stuff.
84 |
85 | I like this chapter.
86 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter09notes.txt:
--------------------------------------------------------------------------------
1 | **********************************************
2 | * NOTES ON CHAPTER 9: Convolutional Networks *
3 | **********************************************
4 |
5 | This chapter should be review for me, but I do want to get clarification about
6 | (a) visualizing gradients/filters and (b) the "deconvolution" or "transpose
7 | convolution" operator. To a lesser extent, I'm interested in (c) how to
8 | implement efficient convolutions.
9 |
10 | - There is some stuff about whether we care about kernel flipping or not.
11 | However, this seems to be very specific about the convolution formula, and I
12 | doubt I'm going to go in detail on that since I'm not implementing them.
13 |
14 | - Understand why convolutions are so important: (1) **sparse interactions**, (2)
15 | **parameter sharing** and (3) **equivariant representations**. I know all of
16 | these, and to be clear on the last one, it's because we often want to
17 | represent the same shapes but in different locations in a grid. The book says
18 | "To say a function is equivariant means that if the input changes, the output
19 | changes in the same way" so maybe they're using a slightly different
20 | perspective. The first two together are mainly about the storage and
21 | efficiency improvements. The third doesn't apply to all transformations (for
22 | CNNs at least), but it definitely applies for translation.
23 |
24 | - In the pooling description (Section 9.3) the authors say non-linearities come
25 | **before** pooling and **after** convolutions. Indeed, this matches the
26 | ordering of the CNNs we wrote in CS 294-129. Intuitively, we already do a
27 | maximum operator in the standard 2x2 max pool, so why apply a ReLU **after**
28 | that? The major advantage of pooling is to make the network **invariant to
29 | slight transformations**. It also helps to reduce data dimensionality,
30 | particularly if we also padded the convolutions (and so the convolution layers
31 | do *not* reduce data dimensionality, but can leave that job for the pooling).
32 |
33 | - Interesting perspective: Section 9.4 explains why convolutions and pooling can
34 | be viewed as an infinitely strong prior. I can see why (beforehand) since
35 | these strongly assume the input is some grid-like thing, as an image. (A weak
36 | prior has high entropy, like a uniform distribution or a Gaussian) Be careful:
37 |
38 | > If a task relies on preserving precise spatial information, then using
39 | > pooling on all features can increase the training error.
40 |
41 | (This is an example of how architectures need to be tweaked for the task.)
42 |
43 | - Huh, I've never heard of **unshared convolution** nor **tiled convolution**.
44 | Eh, I can look them up later, they're alternatives to convolution but
45 | certainly less important to know.
46 |
47 | - Ah ... how to compute the **nightmarish** gradient of a convolution operator?
48 | The gradient is actually another convolution, but it's hard to derive
49 | algebraically. Convolutions are just (sparse) matrix multiplication assuming
50 | we've flattened the input tensor. We did that for CS 231n to flatten the input
51 | to shape (N, d1*d2*...*dn). Given that matrix, we take its transpose and that
52 | gives us the gradient for the backpropagation step, at least in theory. Wait,
53 | Goodfellow has a report from 2010 which explains how to compute these
54 | gradients. Interesting, how did I not know about this?
55 |
56 | - Something I didn't quite think of before, but it seems obvious: we can instead
57 | use **structured output** from a CNN that isn't a probability vector or
58 | distribution but some tensor that comes "earlier" in the net. This can give
59 | probabilities for each precise pixel in an image, for instance, if the tensor
60 | output is 3D and (i,j,k) means class i probability in coordinate (j,k). Yeah,
61 | overall there are quite a lot of options the user has in designing a CNN. This
62 | also enables the possibility of using recurrent CNNs, see Figure 9.17.
63 |
64 | - Section 9.8: **Efficient convolutions**. Unfortunately, there is only
65 | high-level discussion here, but I'm not sure I'd be able to understand the
66 | details anyway. They say:
67 |
68 | > Convolution is equivalent to converting both the input and the kernel to the
69 | > frequency domain using a Fourier transform, performing point-wise
70 | > multiplication of the two signals, and converting back to the time domain
71 | > using an inverse Fourier transform. For some problem sizes, this can be
72 | > faster than the naive implementation of discrete convolution.
73 |
74 | The last part of the chapter is about the neuro-scientific basis of CNNs. It's
75 | an easier read.
76 |
77 | Overall, I think this is a good chapter. Unfortunately, it didn't cover (a) or
78 | (b), the stuff I was wondering about earlier. =( OK, I think I understand how to
79 | visualize a weight filter, but maybe I should look back at that relevant CS 231n
80 | lecture.
81 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter04notes.txt:
--------------------------------------------------------------------------------
1 | *****************************************
2 | * NOTES ON CHAPTER 4: Numerical Methods *
3 | *****************************************
4 |
5 | This brief chapter will probably contain more new material for me compared to
6 | chapters 2 and 3, but still be mostly review. Here are the highlights:
7 |
8 | - We must delicately handle implementations of the **softmax function** to
9 | be robust to numerical underflow and overflow. The book amusingly just tells
10 | us to rely on Deep Learning libraries, which have presumably handled all these
11 | details for us.
12 |
13 | - Don't forget about a matrix's **condition number**, which when we're dealing
14 | with a function f(x) = A^{-1}x, roughly tells us how "quickly" it perturbs,
15 | i.e. its sensitivity. Later, they point out:
16 |
17 | > The condition number of the Hessian at this point measures how much the
18 | > second derivatives differ from each other. When the Hessian has a poor
19 | > condition number, gradient descent performs poorly. This is because in one
20 | > direction, the derivative increases rapidly, while in another direction, it
21 | > increases slowly.
22 |
23 | - Review: the **directional derivative** of function f in direction u is the
24 | derivative of the function f(x + alpha*u) evaluated at alpha=0, i.e. the slope
25 | of f in direction u.
26 |
27 | - Review of Hessians, Jacobians, gradient descent, etc. The Hessian can be
28 | thought of as the Jacobian of the gradient (of a function from R^n to R).
29 | Also, regarding rows/columns of the Jacobians, if the function f is from R^m
30 | to R^n, the Jacobian is n x m, so just remember the ordering (I doubt it is
31 | strict since this is just a representation that's convenient for us, and we
32 | could also take transposes if we wanted). In Deep Learning, the functions we
33 | encounter almost always have symmetric Hessians. I like Equation 4.9 as it
34 | emphasizes how gradient descent can sometimes overshoot the target and result
35 | in a *worse* value, if the second-order term dominates.
36 |
37 | - To generalize the second derivative test (tells us a maximum, minimum, or
38 | saddle point) in high dimensions, we need to analyze the eigenvalues of the
39 | Hessian, e.g.:
40 |
41 | > When the Hessian is positive definite (all its eigenvalues are positive),
42 | > the point is a local minimum. This can be seen by observing that the
43 | > directional second derivative in any direction must be positive, and making
44 | > reference to the univariate second derivative test.
45 |
46 | Likewise, the reverse is true when the Hessian is negative definite. Note that
47 | the Hessian is a function of x (vector in R^n), so different x will result in
48 | different Hessians. See Figure 4.5 for the quintessential example of a saddle
49 | point.
50 |
51 | BTW, why do the eigenvalues help us **at all**? How are they related to the
52 | second derivative test in one dimension? I think it's because the second-order
53 | Taylor series expansion involves a term d^THd, where d is some unit vector.
54 | This is the second term that's added into the Taylor series, so its values
55 | among different directions tells us the curvature. We also have an
56 | eigendecomposition of H, and that provides us the eigenvalues.
57 |
58 | - We have simple gradient descent, and then the second-order (i.e. expensive!)
59 | Newton's method. How do we **derive** the step size, e.g. if you're asked to
60 | do so in an interview?
61 |
62 | - Write out f(x) using a second-order Taylor series expansion at x(0).
63 |
64 | - Then look at the second-order Taylor series and take the gradient w.r.t x
65 | (not x(0)).
66 |
67 | - Solve for the best x, the critical point, and plug-n-chug.
68 |
69 | - At least, that seemed to work for me and I verified Newton's method.
70 |
71 | - In the context of Deep Learning, our functions are so complicated that we can
72 | rarely provide any theoretical guarantees. We can sometimes get headway by
73 | assuming Lipschitz functions, which tell us that small changes in the input
74 | have quantified small changes in the function output.
75 |
76 | - Convex optimization is a very successful research field, but we can only take
77 | lessons from it, we can't really use their algorithms and the importance of
78 | convexity is diminished in deep learning. Constrained optimization may be
79 | slightly more important. These involve the KKT conditions and Lagrange
80 | multipliers, which at a high level try to design an unconstrained problem so
81 | that the solution can be transformed into one for the **constrained** problem.
82 | Brief comments on those:
83 |
84 | - We rewrite the loss function by adding terms corresponding to constraints
85 | h(x) = 0 and/or g(x) <= 0.
86 |
87 | - We have min_{x in S} f(x) as our original **constrained** minimization
88 | problem. However ...
89 |
90 | - min_x max_{lambda} max_{alpha >= 0} L(x, lambda, alpha) has the same set of
91 | solutions and optimal points!
92 |
93 | - (Some caveats here, have to consider infinity cases, etc., but this is the
94 | general idea. Any time a constraint is violated, the minimum value of the
95 | Lagrangian w.r.t. x is ... infinity!)
96 |
97 | For some reason, I never feel comfortable with Lagrangians. It might be worth
98 | going back and reviewing Stephen Boyd's book, but I think the books' component
99 | was pretty clear.
100 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter08notes.txt:
--------------------------------------------------------------------------------
1 | ****************************************************
2 | * NOTES ON CHAPTER 8: Optimization for Deep Models *
3 | ****************************************************
4 |
5 | This chapter should be review for me.
6 |
7 | Section 8.1: Learning vs. Pure Optimization
8 |
9 | The authors make a good point in that we really care about minimizing the cost
10 | function w.r.t. the **data generating distribution**, NOT the actual training
11 | data (i.e. generalization). The difference with optimization is that we know the
12 | underlying data generating distribution, but in machine learning we only have
13 | the fixed training data, i.e. minimizing the **empirical risk**. However, this
14 | isn't used in its raw form:
15 |
16 | > These two problems mean that, in the context of deep learning, we rarely use
17 | > empirical risk minimization. Instead, we must use a slightly different
18 | > approach, in which the quantity that we actually optimize is even more
19 | > different from the quantity that we truly want to optimize.
20 |
21 | Also, as I know, ML algorithms typically stop not when they're at a true minimum
22 | but when we define them to stop, early stopping. =)
23 |
24 | Oh, note that second-order methods require larger batch sizes. In fact, Andrej
25 | Karpathy covered that briefly in Lecture 7 of Cs 231n. This is because
26 | matrix-vector multiplication and taking inverses amplify errors in the original
27 | Hessian/gradient.
28 |
29 | I do this:
30 |
31 | > Fortunately, in practice it is usually sufficient to shuffle the order of the
32 | > dataset once and then store it in shuffled fashion. This will impose a fixed
33 | > set of possible minibatches of consecutive examples that all models trained
34 | > thereafter will use, and each individual model will be forced to reuse this
35 | > ordering every time it passes through the training data.
36 |
37 | Section 8.2: Challenges in Neural Net Optimization
38 |
39 | > For many years, most practitioners believed that local minima were a common
40 | > problem plaguing neural network optimization. Today, that does not appear to
41 | > be the case. The problem remains an active area of research, but experts now
42 | > suspect that, for sufficiently large neural networks, most local minima have a
43 | > low cost function value, and that it is not important to find a true global
44 | > minimum rather than to find a point in parameter space that has low but not
45 | > minimal cost.
46 |
47 | To test whether we at a local minima, we can test the norm of the gradient.
48 |
49 | Section 8.3: Basic Algorithms
50 |
51 | These include SGD and its variants, the core of the chapter. I better know
52 | these. I know SGD and for momentum, they say:
53 |
54 | > Momentum aims primarily to solve two problems: poor conditioning of the
55 | > Hessian matrix and variance in the stochastic gradient.
56 |
57 | and
58 |
59 | > We can think of the particle as being like a hockey puck sliding down an icy
60 | > surface. Whenever it descends a steep part of the surface, it gathers speed
61 | > and continues sliding in that direction until it begins to go uphill again.
62 |
63 | There's some math there that I probably don't need to memorize, but I should
64 | blog about it soon. They write it as a first-order differential equation since
65 | we have a separate velocity term. If we didn't have that, we need a *second*
66 | order diff-eq. Also, I really have to review differential equations someday.
67 |
68 | Section 8.4: Parameter Initialization
69 |
70 | AKA break symmetry!
71 |
72 | Surprisingly, they don't see to mention Kaiming He's paper on weight
73 | initialization. I don't even see any discussion of fan-in and fan-out.
74 |
75 | Section 8.5: Algorithms with Adaptive Learning Rates
76 |
77 | Yes, the key is **adaptive** learning rates. AdaGrad, then RMSProp, then Adam:
78 |
79 | > The name "Adam" derives from the phrase "adaptive moments." In the context of
80 | > the earlier algorithms, it is perhaps best seen as a variant on the
81 | > combination of RMSProp and momentum with a few important distinctions.
82 |
83 | The distinctions have to do with estimates of moments and their biases. I'm
84 | quite confused on this, unfortunately.
85 |
86 | (Note: unlike what's suggested in CS 231n Lecture 7, in fact the textbook
87 | actually has RMSProp with Nesterov's in one of their algorithms.)
88 |
89 | Section 8.6: Approximate Second-Order Methods
90 |
91 | Newton's method is intractable, etc. etc. etc. Well, these can help:
92 |
93 | > Conjugate gradients is a method to efficiently avoid the calculation of the
94 | > inverse Hessian by iteratively descending conjugate directions.
95 |
96 | Also, know BFGS and L-BFGS.
97 |
98 | Section 8.7: Other Strategies
99 |
100 | Ah, **batch normalization**.
101 |
102 | > This means that the gradient will never propose an operation that acts simply
103 | > to increase the standard deviation or mean of $h_i$; the normalization
104 | > operations remove the effect of such an action and zero out its component in
105 | > the gradient. This was a major innovation of the batch normalization approach.
106 |
107 | and
108 |
109 | > Batch normalization reparametrizes the model to make some units always be
110 | > standardized by definition, deftly sidestepping both problems.
111 |
112 | Yeah, this idea of normalizing inputs is obvious, so we have to be clear on the
113 | actual contribution of batch normalization.
114 |
115 | There's some other stuff here about pre-training (yes that's important!) but
116 | also check Chapter 15. Oh, and don't forget, we normally don't want to design
117 | new optimization algorithms, but instead to make the networks **easier to
118 | optimize**.
119 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter16notes.txt:
--------------------------------------------------------------------------------
1 | **************************************************************************
2 | * NOTES ON CHAPTER 16: Structured Probabilistic Models for Deep Learning *
3 | **************************************************************************
4 |
5 | I expect to know the majority of this chapter, because it's probably going to be
6 | like Michael I. Jordan's notes. "Structured Probabilistic Models" are graphical
7 | models! But the key is that this should help me better understand the current
8 | research frontiers of Deep Learning, and it's self-contained. Let's see what it
9 | has to offer ...
10 |
11 | Their "Alice and Bob" (and "Carol" ...) example has to do with running a relay,
12 | which is better than Michael I. Jordan's example of being abducted by aliens.
13 |
14 | I remember Markov Random Fields, yes, we need to define a normalizing constant
15 | Z, but (a) if we define our clique potentials awfully, Z won't exist, and (b) in
16 | deep learning, Z is usually intractable.
17 |
18 | I agree with their quote:
19 |
20 | > One key difference between directed modeling and undirected modeling is that
21 | > directed models are defined directly in terms of probability distributions
22 | > from the start, while undirected models are defined more loosely by \phi
23 | > functions that are then converted into probability distributions. This changes
24 | > the intuitions one must develop in order to work with these models.
25 |
26 | When they go and talk about their example with x being binary and getting
27 | Pr(X_i = 1) being a sigmoid(b_i), you can get that by explicitly writing out the
28 | formula, then "rearranging" the sum so that terms independent of the current,
29 | rightmost sum get pushed left. Then you see that the numbers mean we get
30 | independence, and can split the fractions, etc. It brings back good memories of
31 | studying CS 188.
32 |
33 | Section 16.2.4 is on Energy-Based functions. John Canny would really like those!
34 | I think the easiest way for me to think of these is taking potentials of
35 | arbitrary functions and then using e^{-function}. AKA Boltzmann Machines. I like
36 | their discussion here; it is relatively elucidating.
37 |
38 | There is also review on what edges mean when describing graphical models. Again,
39 | this is all CS 188 stuff. For instance, remember that we can add more edges to a
40 | graphical model and still represent the same class of distributions (the edges
41 | can be unnecessary).
42 |
43 | One advantage for each type:
44 |
45 | - It is easier to sample from directed models (I agree).
46 | - It is easier to perform approximate inference on undirected models (I think I
47 | agree).
48 |
49 | Key fact:
50 |
51 | > Every probability distribution can be represented by either a directed model
52 | > or by an undirected model.
53 |
54 | Though there are some directed models for which no undirected model is
55 | equivalent to it. By "equivalent" here we mean in the precise set of
56 | independence assumptions it implies.
57 |
58 | And another key idea:
59 |
60 | > When we represent a probability distribution with a graph, we want to choose a
61 | > graph that implies as many independences as possible, without implying any
62 | > independences that do not actually exist.
63 |
64 | E.g. a loop of length 4 (with no chords inside) is an undirected graphical
65 | model, but we have to add an edge before adding orientations to the edges to
66 | "convert" it to as simple a directed graphical model as possible (that still
67 | implies as many (or as few?) assumptions).
68 |
69 | Section 16.3: sampling from graphical models. I agree, it's easy for directed
70 | models. They call it "ancestral sampling" whereas I've called it "forward
71 | sampling," I think from Daphne Koller. We have to modify it if we want to do
72 | more general sampling with conditioning, i.e. fixed variables. It's toughest if
73 | the variables are *descendants*. Ancestors are easier because we can fix them
74 | and just do P(x|parents(x)) as usual. For *undirected* models ... they mention
75 | Gibbs sampling. =)
76 |
77 | The next few sections are pretty short. They mention *structure learning*, i.e.
78 | learning the graphical model structure. That's a hard problem due to the
79 | super-exponential number of possibilities. However, it seems like structure
80 | learning --- as far as I can tell --- is no longer active? They also mention the
81 | importance of latent variables. Yes, that's a bit broad, but I agree. Just
82 | before the "real" Deep Learning part they talk about inference and approximate
83 | inference, which is something that I should know about well (but they just give
84 | a broad treatment, a bit unclear).
85 |
86 | Finally, the Deep Learning part that I wanted to read.
87 |
88 | After reading it, I just want to clarify: when people draw out a fully connected
89 | net, they usually write out nodes, edges, in layer format, etc. Is that
90 | correctly viewed as a *graphical model*? Or are those different design criteria?
91 | Also, I'm assuming that all the "latent variable" discussion is simply referring
92 | to the hidden layers (and their units)? I think that's the case after reading
93 | about why loopy belief propagation is "almost never" used in deep learning. (Oh,
94 | and by the way, I don't actually know loopy belief propagation ... and I just
95 | barely remember belief propagation.) I think it makes sense, in normal graphical
96 | models, we want the computational graph to be sparse to prevent high treewidth,
97 | but in deep learning, we do matrix multiplication which creates a lot of
98 | connectivity. So, matrix multiplication, not loopy belief propagation.
99 |
100 | They discuss *Restricted Boltzmann Machines* at the end. They say it is the
101 | "quintessential example" of using graphical models for deep learning. With only
102 | one hidden layer, it is not too deep (a.k.a. it looks like a normal graphical
103 | model) but it groups variables into layers, like deep learning. For now, let's
104 | only worry about the "canonical form" which is an energy-based model with a
105 | particular (negative) quadratic form plus linear terms. The inputs are (v,h).
106 | The names should be familiar: v=visible and h=hidden. Then it's like a complete
107 | bipartite graph with v on one side and h on the other. We can do Gibbs sampling
108 | on this (in fact, _block_ Gibbs sampling).
109 |
110 | Concluding point:
111 |
112 | > Overall, the RBM demonstrates the typical deep learning approach to graphical
113 | > models: representation learning accomplished via layers of latent variables,
114 | > combined with efficient interactions between layers parametrized by matrices.
115 |
116 | I've now read the chapter and feel pleased. Great job, authors!
117 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter14notes.txt:
--------------------------------------------------------------------------------
1 | *************************************
2 | * NOTES ON CHAPTER 14: Autoencoders *
3 | *************************************
4 |
5 | Let's review this and discuss with John Canny.
6 |
7 | The introduction is excellent, and matches with my intuition. I agree that an
8 | encoder is like doing dimension reduction, and it certainly seems like decoders
9 | (the reverse direction) can be used for generating things, hence they can be
10 | used within *generative* models. (A.K.A. VAEs!)
11 |
12 | They mention "recirculation" as a more biologically realistic (!!) alternative
13 | to backpropagation, but it is not used much.
14 |
15 | Think of AEs as optimizing this simple thing:
16 |
17 | min_{f,g} L(x, g(f(x))
18 |
19 | where x is the whole dataset, and f and g are the encoder and decoder,
20 | respectively.
21 |
22 | We need to make sure the autoenconder is constrained somehow ("undercomplete")
23 | so that it isn't simply performing the identity function. Solutions: don't
24 | provide too much capacity to both (a) the hidden code and (b) either of the two
25 | networks, and *regularize* somehow. Also, don't just make things linear, because
26 | then it's doing nothing more than PCA.
27 |
28 | Confusing point: think of autoencoders as "approximating maximum likelihood
29 | training of a generative model that has latent variables." Why?
30 |
31 | - The prior is not over the "belief on our parameters before seeing data" but
32 | the hidden units (which are latent variables). Yes, this aspect make sense.
33 | - I don't know what they mean by "the autoencoder as approximating this sum with
34 | a point estimate for just one highly likely value for h" but let's not
35 | over-worry about this.
36 |
37 | (This was in the discussion about sparse autoencoders, and it makes a little
38 | more sense to me after reading about VAEs. The point is that `h` is a latent
39 | variable.)
40 |
41 | Denoising Autoencoders: clever! =) Rather than using g(f(x)) in the loss
42 | function, use g(f(\tilde{x})) where \tilde{x} is perturbed! This is a creative
43 | way to avoid the autoencoder simply learning the identity function.
44 |
45 | One can also regularize by limiting the derivatives, i.e. a "contractive
46 | autoencoder."
47 |
48 | I've wondered about the exact size of autoencoders in use nowadays, since I
49 | haven't seen a figure before. The encoder and decoder are themselves each feed
50 | forward neural networks, so in general, it seems like each can be implemented
51 | with many layers (or just one).
52 |
53 | Stochastic Encoders and Decoders: not sure I got much out of this. However, I
54 | did get this: the decoder can be seen as optimizing log p(x|h), since it is
55 | given h and has to produce x (and x is known!). But the analogue for the encoder
56 | is more confusing, because we have log p(h|x) but we don't know h. This must be
57 | similar to other latent variables in graphical models.
58 |
59 | **Update**: after reading this again with more knowledge of how these work,
60 | I think I didn't get the point of the last section. The log p(x|h) is indeed
61 | what the decoder optimizes, though (1) it really optimizes the encoder as
62 | well when this is trained end-to-end since the encoder produces h, and (2)
63 | we have to provide the loss function, and (3) we can **also** add a
64 | distribution to the encoder, but I don't think this is actually needed to
65 | train the encoder portion. In the case of continuous-valued pixels, we
66 | should probably consider a Gaussian distribution for the loss, which means
67 | the autoencoder should try and get the mean/variance. In VAEs, we can take
68 | advantage of the Gaussian assumption to *sample* elements.
69 |
70 | Denoising autoencoders: OK, their computational graph (Figure 14.3) makes sense.
71 | (It doesn't really help me get a deep understanding, though.) They introduce a
72 | corruption function C(\tilde{x} | x), whose function is obvious. I was confused
73 | for a bit as to why we're assuming we know the x (I mean, in real life, we might
74 | be given *only* noisy stuff) but if we don't have the real x, we can't evaluate
75 | the loss function! It's just part of our training data.
76 |
77 | Figure 14.4 makes sense intuitively. Corrupted stuff is off the manifold because
78 | if we take an average random sample, it'll be in some random space. But **real**
79 | samples are in a manifold. Unfortunately, some of the discussion here (e.g.
80 | connecting autoencoders with RBMs) just refers to reading papers. =( That's why
81 | I am reading this textbook, to *avoid* reading difficult-to-understand papers.
82 | There's also some discussion on estimating the score function, which I think I
83 | understand but haven't grokked it.
84 |
85 | OK, back to more obvious stuff:
86 |
87 | > Denoising autoencoders are, in some sense, just MLPs trained to denoise.
88 | > However, the name "denoising autoencoder" refers to a model that is intended
89 | > not merely to learn to denoise its input but to learn a good internal
90 | > representation as a side effect of learning to denoise.
91 |
92 | Manifolds! (Section 14.6) Key reason why we think about this (emphasis mine):
93 |
94 | > Like many other machine learning algorithms, autoencoders exploit the idea
95 | > that data concentrates around a low-dimensional manifold or a small set of
96 | > such manifolds, as described in section 5.11.3. [...] Autoencoders take this
97 | > idea further and aim to **learn the structure of the manifold**.
98 |
99 | Additional thoughts:
100 |
101 | - Understand **tangent planes**, these describe the direction of allowed
102 | variation for a point x while still remaining on the low-dim manifold. See
103 | Figure 14.6 for an intuitive example with MNIST, showing points on this
104 | manifold and also the allowable directions.
105 |
106 | - Intuitively, autoencoders need to learn how to represent this variation among
107 | the manifold. However, they don't need to do this for points off the
108 | manifold. See Figure 14.7. The reconstruction is flat near the manifold
109 | points, i.e. the only area that matters. True, it jumps up at several points,
110 | but those are well off the manifold.
111 |
112 | - There are other ways we can learn manifold structure, using non-Deep
113 | Learning techniques (see Figures 14.8 and 14.9), but I don't think these are
114 | as important to know now.
115 |
116 | Contractive Autoencoders (Section 14.7) introduce a regularizer to make the
117 | derivatives of f (as in, f(x) = h) small.
118 |
119 | What are applications of autoencoders? Definitely dimensionality reduction is
120 | one, and we can also think about information retrieval, the task of finding
121 | entries in a database that resemble a query entry. Why? Search is more efficient
122 | in lower-dimensional spaces.
123 |
124 | Overall, I actually think this chapter is among the weaker ones in the book.
125 | Looking through the CS 231n slides was a **lot** more helpful. Eh, not every
126 | chapter is perfect.
127 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Fetch.text:
--------------------------------------------------------------------------------
1 | Notes on how to use the Fetch.
2 |
3 | ************
4 | ** UPDATE **
5 | ************
6 |
7 | Here are some full steps:
8 |
9 | (0) Start the fetch, ensure that it can move with the joystick controls.
10 |
11 | (1) Switch to fetch mode by calling `fetch_mode` on the command line. This will
12 | ensure that the `ROS_MASTER_URI` is the Fetch robot.
13 |
14 | (2) Be on the correct WiFi network. Then the master node (Fetch) is accessible.
15 |
16 | - Verify that `rostopic list` returns topics related to the Fetch.
17 | - Also verify that the teleop via keyword script (via `rosrun ...`, see
18 | tutorials) is working, though sometimes even that doesn't work for me.
19 |
20 | (3) Then then do whatever I need to do... for instance, simply running Ron's
21 | camera script (a single python file) works to continually see the Fetch's
22 | cameras. Finally!
23 |
24 | - Some python scripts might require a launch file to be running, such as the
25 | built-in disco.py and wave.py code. For these use `roslaunch [...] [...]`.
26 |
27 |
28 | TODO: figure out robot state? For Fetch-specific messages.
29 |
30 |
31 | ******************
32 | ** Older notes: **
33 | ******************
34 |
35 | Note that `PS1` is an environment variable that we can import, but the real key
36 | thing is to set ROS_MASTER_URI, that will let us connect to the Fetch. This does
37 | not happen by default, so must export it each new window (for now).
38 |
39 | Then I think we should do `rosrun [package] [script]` where I code stuff in
40 | [script] inside some package. But are Ron and Michael doing it in a similar way?
41 |
42 | Recommended order for development (NOT WORKING):
43 |
44 | - Code the script within some package
45 | - Compile the package with `catkin_make`
46 | - Another terminal, set `ROS_MASTER_URI` appropriately
47 | - In that same terminal, `source ./devel/setup.bash`
48 | - Finally, again in same terminal `rosrun ...` and enjoy
49 |
50 | I know when I set `ROS_MASTER_URI` and run `rostopic list` I get all the
51 | appropriate Fetch-related topics ... so why am I not able to access them in my
52 | code when calling `rosrun ...`?
53 |
54 | (If I don't set `ROS_MASTER_URI` and instead have it as the default, then I do
55 | not get any topics, of course. Note that according to documentation, roslaunch
56 | will START roscore if it detects that one doesn't exist!)
57 |
58 | Is there a launch file that I can use? I'm confused because `rostopic echo
59 | [...]` for the topics means I can see the output ...
60 |
61 |
62 | ***************************
63 | * Tutorial: Visualization *
64 | ***************************
65 |
66 |
67 |
68 | *******************************
69 | * Tutorial: Gazebo Simulation *
70 | *******************************
71 |
72 | At least this is clear:
73 |
74 | > Never run the simulator on the robot. Simulation requires that the ROS
75 | > parameter use_sim_time be set to true, which will cause the robot drivers to
76 | > stop working correctly. In addition, be sure to never start the simulator in a
77 | > terminal that has the ROS_MASTER_URI set to your robot for the same reasons.
78 |
79 | And it looks like I've installed the two packages necessary,
80 | `ros-indigo-fetch-gazebo` and `ros-indigo-fetch-gazebo-demo`.
81 |
82 | Run: `roslaunch fetch_gazebo simulation.launch` and the Gazebo simulator should
83 | show up! However, I've noticed if you exit, then try and run the simulator
84 | again, error messages may result? From looking up things online, it seems to be
85 | expected behavior. :-( Try CTRL+C in the same window to exit. I've been able to
86 | get `simulation.launch` to work fairly consistently, fortunately.
87 |
88 | For "Running the Mobile Manipulation Demo":
89 |
90 | The playground will get set up, just be patient. :-) It takes a few extra
91 | seconds due to a "namespace" error message, must be due to slow loading of
92 | data online. However, a playgroud _should_ eventually appear.
93 |
94 | Then the next part moves the Fetch throughout the Gazebo simulator. It's
95 | pretty cool. Doesn't work reliably, see GitHub issue I posted.
96 |
97 | I think this will be easier on a desktop since Gazebo also seems to be sensitive
98 | to the graphics card, though after this I fixed it so my laptop can access the
99 | separate GPU.
100 |
101 | How does the demo code work? Two commands:
102 |
103 | 1. roslaunch fetch_gazebo playground.launch
104 | 2. roslaunch fetch_gazebo_demo demo.launch
105 |
106 | Use `roscd [...]` to go to the package directory and look at `launch/` to find
107 | specific definitions. The first command runs the launch file with several
108 | readable arguments. The second one is more interesting, launch looks like:
109 |
110 | ```
111 | 1
112 | 2
113 | 3
114 | 4
115 | 5
116 | 6
117 | 7
118 | 8
119 | 9
120 | 10
121 | 11
122 | 12
123 | 13
124 | 14
125 | 15
126 | 16
127 | 17
128 | 18
129 | 19
130 | ```
131 |
132 | Four easy parts. What's odd, though, is that I can't find `demo.py` anywhere on
133 | my machine, but it's online at the repo:
134 |
135 | https://github.com/fetchrobotics/fetch_gazebo/blob/gazebo2/fetch_gazebo_demo/scripts/demo.py
136 |
137 | Might be another useful code reference as it's a clean stand-alone script,
138 | though with sone MoveIt, etc., obviously.
139 |
140 |
141 |
142 | **************************
143 | * Tutorial: Robot Teleop *
144 | **************************
145 |
146 | This is pretty easy.
147 |
148 |
149 |
150 | ************************
151 | * Tutorial: Navigation *
152 | ************************
153 | **************************
154 | * Tutorial: Manipulation *
155 | **************************
156 |
157 | I ran both of these manipulation tutorials (hand-wavy thing and disco) and it
158 | works. I wasn't able to try out extensions.
159 |
160 |
161 |
162 | ************************
163 | * Tutorial: Perception *
164 | ************************
165 |
166 | Fetch exposes several "ROS topics" that we can subscribe to in order to obtain
167 | camera information. Unfortunately, I have yet to get call-backs to work ...
168 |
169 |
170 |
171 | **************************
172 | * Tutorial: Auto-Docking *
173 | **************************
174 | *************************
175 | * Tutorial: Calibration *
176 | *************************
177 | **********************************
178 | * Tutorial: Programming-By-Demos *
179 | **********************************
180 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/HSR.text:
--------------------------------------------------------------------------------
1 | Notes on how to use the HSR. Use their Python interface (or we can do
2 | lower-level ROS stuff). Also, there's a built-in motion planner, so MoveIt! is
3 | not necessary. Ideally, we get a camera image, get the x and y values from the
4 | pixels, figure out z (the depth), and determine a rotation, and send it there.
5 |
6 | - Gazebo can be useful.
7 | - rviz is DEFINITELY helpful for debugging. Know it.
8 | - Calibration: ouch, unfortunately this will take a while and there are eight
9 | sensors to calibrate ... at minimum. The docs actually show a lot. I see a
10 | sensor (camera) on the hand as well.
11 | - Register positions, using the same image I see of black/white boxes, the
12 | "calibration marker jig".
13 |
14 | Monitor status: see 6.1 of the manuals. Setting up development PC/laptop,
15 | section 6.2. Not much else to write here. At least I can get rviz running with
16 | images. You need to hit the reset button and see the LEDs (not above the
17 | 'TOYOTA' text but everywhere else) turn yellow-ish.
18 |
19 | On my TODO list:
20 |
21 | - Figure out good test usage practices for rviz.
22 | - Get skeleton code set up for the HSR to:
23 | - process camera images
24 | - move based on those images (either base or gripper, or both)
25 | - Figure out a safe way to automatically move arms.
26 |
27 |
28 |
29 | ******************
30 | * Moving the HSR *
31 | ******************
32 |
33 | General idea with Python code, do something like:
34 | ```
35 | self.robot = hsrb_interface.Robot()
36 | self.omni_base = self.robot.get('omni_base')
37 | self.whole_body = self.robot.get('whole_body')
38 | ```
39 | where the `hsrb_interface` is code written by the Toyota HSR programmers,
40 | thankfully. That part is necessary for the robot to begin publishing stuff from
41 | its topics.
42 |
43 | Let's understand _base_ motion.
44 |
45 |
46 | Aerial view of the HSR. Assumes its head is facing north.
47 |
48 | ^
49 | |
50 | <--[hsr]-->
51 | |
52 | v
53 |
54 | Axes are:
55 |
56 | pos(x) for north, neg(x) for south.
57 | Also, (oddly) pos(y) for LEFT, neg(y) for right.
58 |
59 | I thought the `y` stuff would be the other way around, but I guess not. The
60 | z stuff stays fixed (obviously). These are based on the (x,y,z) I get from
61 | `omni_base.get_pose()`. The rotations are in quaternions.
62 |
63 | FYI: When the robot starts up, it has some (x,y,z) position which should
64 | be set at (0,0,0) based on the starting position.
65 |
66 | Errors: unfortunately if you query the `omni_base.get_pose()` again and
67 | again, the values are still going to vary by something like 1-3mm, so
68 | there's always some error. Same with the dVRK.
69 |
70 | Rotations: clockwise from aerial view, the `z` decreases. Counterclockwise,
71 | it increases. The other three values in the quaternion don't seem to change,
72 | x==y==0 and w==1. We're only rotating about one plane for the base so this
73 | is expected. TODO: understand quaternions well.
74 |
75 |
76 | To clarify the above, understand `go_rel`:
77 |
78 | ```
79 | In [30]: omni_base.go_rel?
80 | Type: instancemethod
81 | String Form:>
82 | File: /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/mobile_base.py
83 | Definition: omni_base.go_rel(self, x=0.0, y=0.0, yaw=0.0, timeout=0.0)
84 | Docstring:
85 | Move base from current position.
86 |
87 | Args:
88 | x (float): X-axis position on ``robot`` frame [m]
89 | y (float): Y-axis position on ``robot`` frame [m]
90 | yaw (float): Yaw position on ``robot`` frame [rad]
91 | timeout (float): Timeout until movement finish [sec].
92 | Default is 0.0 and wait forever.
93 | ```
94 |
95 | Seems like indeed we should only control x and y, obviously. The interesting
96 | part is that `yaw` must represent the `z` in the quaternion, so rotations of the
97 | base imply changes in yaw only.
98 |
99 |
100 | Next, `whole_body`, allows more control. This is for the _end_effector_:
101 |
102 | ```
103 | In [38]: whole_body.get_end_effector_pose?
104 | Type: instancemethod
105 | String Form:>
106 | File: /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/joint_group.py
107 | Definition: whole_body.get_end_effector_pose(self, ref_frame_id=None)
108 | Docstring:
109 | Get a pose of end effector based on robot frame.
110 |
111 | Returns:
112 | Tuple[Vector3, Quaternion]
113 |
114 | In [39]: whole_body.get_end_effector_pose()
115 | Out[39]: Pose(pos=Vector3(x=0.2963931913608169, y=0.07800193518379123, z=0.6786170137933408), ori=Quaternion(x=0.7173120598879523, y=-7.000511757597367e-05, z=0.6967520358527196, w=-6.613377471335618e-05))
116 | ```
117 |
118 | This is relative to the base frame. So when we move the HSR, without moving
119 | the end-effector, the x,y,z stuff remains the same, as expected. BUT since
120 | the base frame has some fixed "reference rotation" then rotating base means
121 | the y and w quaternion components change; the x and z stay the same.
122 |
123 | We can also see joint names and their limits. Use `whole_body.joint_state`
124 | to get full details. There's lots of `whole_body.move_to[...]` methods that
125 | make it really convenient for research code.
126 |
127 | An alternative is to explicitly assign to these by publishing to the
128 | associated ROS topics, which might be more generally applicable to the
129 | Fetch and other robots (well, we change the topics ...).
130 |
131 |
132 | Finally, for the gripper itself, use `gripper`. We can grasp it, so it's similar
133 | to the dVRK, and use negative values for tight stuff. :-)
134 |
135 |
136 | Other notes on moving the HSR:
137 |
138 | - It's possible to move in straight lines, arcs, etc.
139 | - Understand `tf` for resolving coordinate frames. TODO: later ... actually,
140 | might as well do this all in simulation (rviz) first to double check
141 | movements.
142 | - Also use rviz for visualizing coordinates. RGB = xyz axes.
143 | - Common coordinates: `map` for the overall map, `base_footprint` for the
144 | base of the HSR, `hand_palm_link` for the robot's hand (end-effector I
145 | assume, or "tool frame").
146 | - You can move both the base and arm together to get to a destination, can
147 | also weigh relative contribution.
148 | - Can move the hand based on force sensing, might be useful if we're running
149 | this automatically and need some environment feedback?
150 | - Avoid collisions by using the collision avoider they have. Looks really
151 | simple to use, they handle a lot for us.
152 |
153 |
154 | See Section 7.2.6 for more advanced coding, rather than using `ihsrb` which is
155 | like IPython. Oh, and later they actually have a YOLO tutorial. Nice!
156 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter11notes.txt:
--------------------------------------------------------------------------------
1 | **********************************************
2 | * NOTES ON CHAPTER 11: Practical Methodology *
3 | **********************************************
4 |
5 | This is sometimes neglected, but it shouldn't be! Their intro paragraph hits the
6 | core:
7 |
8 | > Successfully applying deep learning techniques requires more than just a good
9 | > knowledge of what algorithms exist and the principles that explain how they
10 | > work. A good machine learning practitioner also needs to know how to choose an
11 | > algorithm for a particular application and how to monitor and respond to
12 | > feedback obtained from experiments in order to improve a machine learning
13 | > system.
14 |
15 | Their running example is the Street View house number dataset and application,
16 | which is good for me since I only have minor knowledge of this material. The
17 | application is as follows: Cars photograph the buildings and address numbers,
18 | while a CNN recognizes the addresses based on photos. Then Google Maps can add
19 | the building to the correct location.
20 |
21 | Section 11.1: Performance Metrics
22 |
23 | Use precision and recall in the event that a binary classification shouldn't
24 | treat the two cases equally, e.g. with spam detection or diagnosing diseases.
25 | Precision is the fraction of relevant instances classified correctly, while
26 | recall is the number of true relevant instances detected. A disease detector
27 | saying that everyone has the disease has perfect recall, but very small
28 | precision, equal to the actual fraction who have diseases. We can draw a PR
29 | curve, or use a scalar metric such as **F-scores** or **AUC**.
30 |
31 | Section 11.2: Default Baseline Models
32 |
33 | This depends on the problem setting. Copy over previous work if possible.
34 |
35 | Start small-scale at first, with regularization and **early stopping**. (I
36 | forgot to do this for one project before adding it, and I'm glad I did.)
37 |
38 | Most of this should be obvious.
39 |
40 | Section 11.3: More Data?
41 |
42 | Regarding when to add more data, they suggest:
43 |
44 | > If the performance on the test set is also acceptable, then there is nothing
45 | > left to be done. If test set performance is much worse than training set
46 | > performance, then gathering more data is one of the most effective solutions.
47 | > [... after some regularization discussion ...] If you find that the gap
48 | > between train and test performance is still unacceptable even after tuning the
49 | > regularization hyperparameters, then gathering more data is advisable.
50 |
51 | Of course, in some domains such as medical applications, gathering data can be
52 | costly. Again, this is obvious.
53 |
54 | Section 11.4: Hyperparameters
55 |
56 | Do these manually or automatically. The manual version places special emphasis
57 | on finding a model with the right effective capacity for the problem at hand.
58 |
59 | As a function of a hyperparameter value, generalization curves often follow a
60 | U-shaped curve, with the optimal value somewhere in the middle. At the smaller
61 | end, we may have low capacity (and thus underfitting) and the other end may have
62 | high capacity (and thus overfitting). Though that depends on the low/high
63 | capacity assumption. Maybe this hyperparameter graph would be based on the
64 | hyperparameter of the total number of layers in a neural network. This is just
65 | an example, though. For applying weight decay, the curve might still be
66 | U-shaped, but the underfitting happens with high values, the overfitting happens
67 | with smaller values.
68 |
69 | Their main advice, and the one which agrees with my own experience, is that if
70 | there is ANY hyperparameter to tune, it should be the learning rate. Why? The
71 | effective capacity of the model is highest ... for a **correct** learning rate.
72 | Not when it's too large or too small. In general, the learning rate's **training
73 | error* curve decreases as it gets high enough ... then once it's barely too
74 | high, it SHOOTS UP, due to taking too large steps during gradient updates.
75 |
76 | What happens if your training error is worse than expected? Your best bet is to
77 | increase capacity. Especially with Deep Learning, we should be able to overfit
78 | to most training datasets, so try without regularization techniques.
79 |
80 | If the test error is worse than training, then the reason (at least with Deep
81 | Learning models with high capacity) is most likely due to generalization
82 | difference between test vs train error. Try regularization techniques.
83 |
84 | I **really like Table 11.1**, it outlines the effects of changing different
85 | hyperparameters. Study it well! Though I think I understood all of them; the one
86 | that might be newest to me is weight decay, but fortunately I somewhat
87 | understand it after reading through OpenAI's Evolution Strategies code.
88 |
89 | OK, next, **automatic hyperparameter search**. This includes **grid search**,
90 | best when we have three or fewer hyperparameters and we can test all points in
91 | the Cartesian product of the set of values. **Random search** can be better, as
92 | I know from CS 294-129. See Figure 11.2 for a comparison of grid search and
93 | random search.
94 |
95 | Typically, grid search values are chosen based on a logarithmic scale, or
96 | "trying every order of magnitude." If the best values are on a boundary point,
97 | shift the grid search. Sometimes we have to do coarse-to-fine, as Andrej
98 | Karpathy puts it. Random search can be cheaper and often more effective. Here,
99 | we have a marginal probability distribution for each hyperparameter, which we
100 | sample from to get hyperparameters. (Be careful about non-uniform distributions
101 | if we want to sample from a logarithmic scale, e.g. for learning rates that are
102 | 10^{-x}, we would do a uniform distribution sample on x.) Random search is more
103 | effective when there are hyperparameters which do not strongly affect the
104 | performance metric, which are considered wasteful for grid search.
105 |
106 | The section concludes on Bayesian hyperparameter optimization, but the authors
107 | conclude that this isn't relatively helpful for Deep Learning.
108 |
109 | Section 11.5: Debugging
110 |
111 | This is hard. :(
112 |
113 | Their example of an especially challenging bug is if the bias gradient update is
114 | slightly off. Then the other weights might actually be able to compensate for
115 | the error, to some extent. This is why you need a finite difference check, as we
116 | did for CS 231n, or use TensorFlow.
117 |
118 | Visualize the model in action, visualize the worst cases, **fit a tiny dataset**
119 | (which I do), etc. Also, monitor histograms of activations and gradients, which
120 | might help detect gradient saturation.
121 |
122 | Yeah, actually I *do* use a lot of these techniques, though maybe I should have
123 | those histograms somewhat?
124 |
125 | Oh, they say that the magnitude of parameter updates should be roughly 1% of the
126 | magnitude of the parameters themselves. In some recent work, I see 5% for this
127 | quantity. Maybe I should aim to get that reduced?
128 |
129 | Section 11.6: Example of Multi-Digit Recognition
130 |
131 | Looks interesting. Here, coverage was the metric to optimize while fixing
132 | accuracy to be 98%. (Thus, accuracy is more important.) They got a LOT of
133 | improvement simply by looking at the worst cases and seeing that there was
134 | unnecessary cropping.
135 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter07notes.txt:
--------------------------------------------------------------------------------
1 | *********************************************
2 | * NOTES ON CHAPTER 7: Regularization for DL *
3 | *********************************************
4 |
5 | Again, this will be mostly review.
6 |
7 | Section 7.1: Parameter Norm Penalties.
8 |
9 | One piece of intuition is that biases don't need to be regularized because they
10 | control one variable, whereas other weights control two (I guess the two nodes
11 | in their edges?).
12 |
13 | Good review for me, look at the math in Section 7.1.1 about L2 regularization.
14 | Assuming a quadratic cost function, we can show that weight decay rescales the
15 | optimal weight vector along the **axes** defined by the **eigenvectors** of H,
16 | the Hessian. This is good linear algebra review. Understand Figure 7.1 as well!
17 |
18 | TODO: review the L1 regularization section. I must have seen this before but I
19 | can't remember, and it'd be good to know. But the TL;DR is that L1 encourages
20 | more sparsity compared to L2, so certain features can be discarded.
21 |
22 | (Some of the next sections are quite short and I didn't take notes. One insight
23 | is that the definition of the Moore-Penrose pseudoinverse looks like a
24 | regularization formula, with weight decay!)
25 |
26 | Other regularization strategies:
27 |
28 | - Dataset Augmentation, useful for object recognition, but be careful not to,
29 | e.g. flip the images if we're doing optical character recognition, since the
30 | classes could be altered. Be careful to augment *after* the train/test split,
31 | and also that when comparing benchmarks, that algorithms use the same
32 | augmentation.
33 |
34 | - Add noise directly to weights, sometimes seen in RNNs, or the targets, as in
35 | **label smoothing**.
36 |
37 | - Semi-Supervised Learning. Use both p(x) and p(x,y) to determine p(y|x).
38 | Example: PCA for the "unsupervised" projection to an "easier" space, and then
39 | a classifier built on top of that, so PCA is a pre-processing step. Yeah,
40 | makes some sense.
41 |
42 | - Multi-Task Learning. Think of this as different tasks having the same input
43 | but different output, **AND** having a common "intermediate" step, or latent
44 | factor. We need that last condition because otherwise we're not sharing
45 | parameters across tasks (i.e. across different targets). I haven't really done
46 | much work with multi-task learning, but I bet I will in the future!
47 |
48 | - Early Stopping. Ah yes, this sounds dumb but it works. Often, training error
49 | will continue decreasing and asymptote somewhere, but our validation error can
50 | decrease initially, but then **increase**. We want to stop and return the
51 | weights we had at the time just before the validation error began to increase.
52 | Huh, the authors even say it's the most popular form of regularization, I
53 | guess because it comes naturally to beginners. There's some slight costs to
54 | (a) testing on the validation set, and (b) storing weights periodically, but
55 | from my experience those are minor. They continue to elaborate that if we want
56 | to use the validation set, we can do early stopping, *then* include all the
57 | data. (This seems overkill to me.) They conclude early stopping by showing
58 | mathematically how it acts as a regularizer.
59 |
60 | - Parameter Tying and Parameter Sharing. These try to make certain parameters
61 | close to each other, so the regularizer could be || w(a) - w(b) ||_2 where
62 | w(a) and w(b) are weights in two different layers. However, I think the more
63 | popular view is to have them be **equal**, and hence have parameter
64 | **sharing** instead of tying, which has the added advantage of memory savings.
65 | This is precisely what happens in CNNs (and RNNs!).
66 |
67 | - Sparse Representations. Here, for some reason, we're focused on
68 | **representational sparsity**. This means our DATA is considered to have a new
69 | representation which is sparse. This is *not* the same as **parameter
70 | sparsity**, which the L1 regularization on the parameters would have enforced.
71 | This arises out of putting penalties on the activations in the NN. However,
72 | I'm not really sure I follow this and it doesn't seem to be as important as
73 | other techniques.
74 |
75 | - Bagging and Ensembles. Train several different models (independently), then
76 | have them vote. It works well when the models do not make the same test
77 | errors. We can quantify this mathematically by computing the expected error
78 | and expected squared error. One way to do this is with bagging, which will
79 | sample k different **datasets**, formed by sampling with replacement the
80 | original data, so with high probability we'll get different datasets each time
81 | (with some data points repeated, of course, and others missing).
82 |
83 | - Dropout. This can be viewed as noise injection, FYI, **and** as a form of
84 | bagging and ensemble learning. Man, it's really clever. PS: remember how it
85 | works, we remove (non-output!) **units**, NOT the edges (though it could be
86 | done that way, I think). Edges are automatically removed when their units are
87 | removed. In code, of course, we just multiply by zero. Remember:
88 |
89 | > Each time we load an example into a minibatch, we randomly sample a
90 | > different binary mask to apply to all of the input and hidden units in the
91 | > network. The mask for each unit is sampled independently from all of the
92 | > others. The probability of sampling a mask value of one (causing a unit to
93 | > be included) is a hyperparameter fixed before training begins. It is not a
94 | > function of the current value of the model parameters or the input example.
95 |
96 | There is some discussion about how to predict or do inference with ensemble
97 | methods. The authors mention some obscure geometric mean trick, but
98 | fortunately, with dropout we can do one forward pass and scale by the dropout
99 | parameter. (Or we can avoid this, but divide by the dropout during training,
100 | as I know.)
101 |
102 | This is actually **not** exact even in expectation, due to the
103 | non-linearities, but it works well in practice.
104 |
105 | Dropout goes beyond regularization interpretations:
106 |
107 | > [...] there is another view of dropout that goes further than this. Dropout
108 | > trains not just a bagged ensemble of models, but an ensemble of models that
109 | > share hidden units. This means each hidden unit must be able to perform well
110 | > regardless of which other hidden units are in the model.
111 |
112 | It looks like we have redundancy, which is good.
113 |
114 | - Adversarial Training. You knew this was coming. :) We get those adversarial
115 | examples, and then use that to improve our classifier. See Goodfellow's papers
116 | for details. There are caveats, though, and I believe even with training on
117 | adversarial examples, such a model still has *new* adversarial examples. I
118 | might have to re-read those papers. Goodfellow showed that one cause for
119 | adversarial examples is excessive linearity. They can also be considered
120 | semi-supervised learning, which we talked about earlier in the chapter.
121 |
122 | - Tangent {Distance, Prop, Manifold Classifier}. These relate to our assumption
123 | that the essence of the data lie in lower-dimensional manifolds. The
124 | regularization here is that f(x) shouldn't change much as x moves along its
125 | manifold. I don't really think these are important for me to know right now,
126 | but I remember studying these a bit for the prelims.
127 |
128 | Whew, some of these were new actually, or at the very least I got a better
129 | understanding of them. Note that batch normalization (which might make dropout
130 | unnecessary) is discussed in the **next** chapter, not this one.
131 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter12notes.txt:
--------------------------------------------------------------------------------
1 | *************************************
2 | * NOTES ON CHAPTER 12: Applications *
3 | *************************************
4 |
5 | There's a LOT of them! Recall the 2016 publication date, so anything after that
6 | won't be here (e.g., the Transformer architecture, other DeepRL stuff?).
7 |
8 | 12.1: Large-Scale Deep Learning
9 |
10 | Nice discussion about how the video game community spurred the development of
11 | graphics cards, and how the characteristics of graphics card ended up being
12 | beneficial for the kind of computations used in deep learning. Actually, why?
13 |
14 | - We need to perform many operations in parallel (and these are often
15 | independent of each other, hence parallelization is easier).
16 | - Less 'branching' compared to the workload of a CPU.
17 | - GPUs have memory and data can be put on there, whereas the data is too large
18 | for most CPU caches.
19 |
20 | They got more popular after more general-purpose GPUs were available that
21 | could do stuff other than rendering, and NVIDIA's CUDA lets us implement those
22 | using a C-like language. But, it's very hard to write good CUDA code (not the
23 | same as writing good CPU code). Good news: once someone does it, we should
24 | refer to those libraries.
25 |
26 | - Data parallelism: easy for inference since we have models run on different
27 | machines. But for training, use Hogwild!. (We can alternatively increase the
28 | batch size for one machine, but we don't get the advantage of more frequent
29 | gradient updates versus HogWild!.)
30 | - Model parallelism: each machine runs a different part of the model. (Huh, I
31 | don't think I'll do this, we'd need a super large network?)
32 | - Model compression: mentions Hinton's knowledge distillation. :-)
33 |
34 | We can a lot with *dynamic structure*: this means we might use different
35 | components of the network for a given computation. For example, have a gated
36 | network which picks one of several expert networks to use for evaluation.
37 | (Results in soft or hard mixture of experts, depending on (as expected) whether
38 | the 'gater' outputs a soft weighting or a single hard weighting, like a one-hot
39 | vector of weights.) Even simpler: decision trees.
40 |
41 | Efficient hardware implementations: doesn't discuss Tensor Processing Units
42 | (TPUs) but those came out after this book, I think.
43 |
44 | 12.2: Computer Vision
45 |
46 | Pre-processing: make sure it's consistent, doesn't have to be fancy. Often
47 | scaling to [-1,1] or [0,1] suffices. Heck they say there are CNNs that can
48 | dynamically adjust to take images of different sizes, but I find it easiest to
49 | always keep a fixed scale.
50 |
51 | Examples: *contrast normalization*, and *whitening*. I think contrast
52 | normalization is like the (X - np.mean(X)) / (np.std(X) + eps) that we've often
53 | done in computer vision tasks. Whitening is another story about *rescaling
54 | principal components to have equal variance*.
55 |
56 | Actually this is a short section. I'm surprised there wasn't an overview on
57 | classification, detection, segmentation, and other computer vision problems.
58 | It's mostly about how data is processed. See CS 231n for details on the actual
59 | tasks.
60 |
61 | 12.3: Speech Recognition (ASR with 'Automatic' in it)
62 |
63 | (Not a subsection of NLP, despite ASR as part of my NLP class at Berkeley)
64 |
65 | Find the most probable linguistic sequence y given input acoustic sequence X.
66 | I.e.: argmax_y P(y|X). Before 2012, state of the art systems used Hidden Markov
67 | Models and Gaussian Mixture Models.
68 |
69 | Use "TIMIT" for benchmarking, the MNIST of ASR so to speak.
70 |
71 | Not much detail here, unfortunately, besides that Restricted Boltzmann Machines
72 | (RBMs) were among the ingredients for the resurgence of Deep Learning in ASR.
73 | But now they are not used. :) I wonder if Transformers are used in ASR now? I
74 | haven't been following the literature and the section is too short for a proper
75 | treatment.
76 |
77 | 12.4: Natural Language Processing
78 |
79 | Largely based on *language models* and treating *words* as the distinct unit,
80 | and then modeling language as probability of a next word given an existing
81 | sequence of words. Know *n-gram*, modeling conditional probability of a word
82 | based on the preceding n-1 words. Unigrams, digrams, and trigrams use 1, 2, and
83 | 3 as n.
84 |
85 | - But recall my NLP class: hard to use raw counts for computing conditional
86 | probabilities, because many counts are zero.
87 | - Thus use smoothing.
88 | - But still many 'curse of dimensionality' challenges with classical n-gram
89 | models.
90 |
91 | Neural language models: allow us to say that two words are similar, but they
92 | are distinct, and they show word embeddings. I think they are suggesting
93 | getting word embeddings by predicting the context given the center word, or
94 | predicting the center word given context (like we did in 182/282A). But
95 | regardless, it's good to have embeddings, since instead of representing words
96 | as one hot vectors, we use lower dimensional representations with Euclidean
97 | distance to get similarity. This is analogous to a CNN's hidden layer output
98 | giving us an image embedding.
99 |
100 | Issue with high-dimensional outputs: if our model needs to produce words (e.g.,
101 | probability of next word given existing text) then naively a softmax over all V
102 | words in the vocabulary means we need a huge matrix to represent this
103 | operation and to train it (assuming naive cross-entropy loss).
104 |
105 | - Naive fix: use a 'short list' of most frequent words only. But that is
106 | counter to what we actually want!
107 | - Slightly better: *hierarchical softmax*. Now predict categories of words, and
108 | then predict more specific categories, etc. But performance of actual model
109 | often not that great, and hard to get the most likely word in a given
110 | context.
111 | - Importance sampling: the logic for this approach is that the gradient of the
112 | softmax can be broken up into the positive and negative phases (interesting
113 | intuition, I'd thought about it but was good to see them explicitly state
114 | it). The negative phase is an expectation, and we can use (biased) importance
115 | sampling.
116 | - Noise-contrastive estimation is another option, but see Chapter 18 for a
117 | fuller treatment.
118 |
119 | Interesting contrast with neural nets and n-grams: the latter are much faster
120 | for look-up operations with hash tables.
121 |
122 | Neural machine translation: recall the encoder-decoder architecture, where the
123 | encoder reads the sentence and produces a data structure called a "context"
124 | that contains "relevant information" somehow. Advantage of an RNN for
125 | encoders/decoders is that we can process variable-length sequences.
126 |
127 | They cite a paper by Jacob Devlin from 2014 who beat state of the art models by
128 | using a MLP. Heh, he would later be the first author on the 2018 BERT paper.
129 |
130 | They conclude with a brief discussion on some of the earlier attention models
131 | in Deep NLP. A lot more has happened since then!
132 |
133 | 12.5: Other Applications
134 |
135 | - Recommender systems and collaborative filtering. Actually this leads them to
136 | talk about contextual bandits, which as we know are an intermediate between
137 | the k-armed bandit case and the full RL problem. Why contextual bandits here?
138 | Because if recommender systems only give users the best item according to its
139 | model, there is no 'exploration' of other items that might be even better.
140 |
141 | Also, it's an intermediary because bandits = no state, basically. The normal
142 | RL problem means the action directly changes the next state.
143 |
144 | - Knowledge representation, reasoning, and question answering. Interesting
145 | topics, but for now not part of my direct research agenda.
146 |
--------------------------------------------------------------------------------
/How_People_Learn/Part_02_Learners_and_Learning.txt:
--------------------------------------------------------------------------------
1 | Part 2: Learners and Learning
2 |
3 |
4 | Chapter 2: How Experts Differ from Novices
5 |
6 | Very important:
7 |
8 | - As implied in the previous chapter, what distinguishes experts from novices
9 | isn't necessarily factual knowledge (nor is it ability or intelligence), more
10 | as it is about better connections among concepts, and the ability to
11 | "conditionalize" knowledge. This means being able to know what areas/concepts
12 | are needed for a specific task, rather than trying our everything.
13 |
14 | - (Related) Experts have more fluent knowledge retrieval, so they better know
15 | what applies to specific tasks. This means their memory is not taxed trying to
16 | figure out what would apply. Organization is more efficient; novices may
17 | retrieve knowledge in a slow, sequential manner.
18 |
19 | - Experts recognize (and are more sensitive to) meaningful patterns across many
20 | fields. Example: with chess, if you randomize the pieces, the experts don't
21 | really remember those locations any better than novices, but if the pieces are
22 | arranged as they might be in a real game situation, the expert can pick up
23 | patterns and remember the location of pieces far better than novices can.
24 |
25 | - Different styles of experts: "artisans" vs "virtuosos". The former are experts
26 | in one field but the latter are also experts and, moreover, have the desirable
27 | property of "active learning" so they are experts at learning about new
28 | things. This requires metacognition, as discussed in the first chapter.
29 | Educational programs need to be designed to encourage the development of
30 | virtuosos.
31 |
32 | Also important:
33 |
34 | - Cool example with physics: experts organize problems in a way that reflects
35 | deeper, fundamental ideas, whereas novices will organize problems if they look
36 | similar (e.g., have the same drawings of triangles).
37 |
38 | - Being an expert at a subject is NOT the same as being an expert at teaching.
39 | An expert teacher will better understand when students might get stuck. Yeah,
40 | this is a widely agreed-upon fact.
41 |
42 | Stuff I didn't remember:
43 |
44 | :-)
45 |
46 |
47 | Chapter 3: Learning and Transfer
48 |
49 | Very important:
50 |
51 | - You could argue that the ultimate goal of teaching is better transfer
52 | learning, or how to efficiently use the knowledge from school and apply it to
53 | the real world. Also, the goal is not to immediately know how to do new tasks,
54 | but simply to increase the _speed_ at which these new tasks will be learned.
55 | The early performance attempts is less important since anyone is going to need
56 | some time to learn new stuff, so don't evaluate based on the first time,
57 | evaluate based on the length of the learning period.
58 |
59 | - All transfer learning (and learning itself, of course) starts from somewhere.
60 | Yeah, prior knowledge was emphasized in earlier chapters. Clearly, prior
61 | knowledge may help or hinder new learning. Examples: students incorrectly
62 | think that plants eat soil, that when they throw a ball in the air there is
63 | still "force from the hand pushing it" and so on.
64 |
65 | - For better transfer learning, we need to see the same concept in different
66 | contexts, so that we can understand the "abstract stuff" that is shared across
67 | tasks. That's better than remembering task-specific details (or "overly
68 | contextualized" knowledge in their jargon) that don't generalize.
69 |
70 | Also important:
71 |
72 | - Learning depends a lot on social background and culture, in addition to more
73 | factual, easy-to-define prior knowledge. Some cultures may discourage asking
74 | questions, for instance, which means if teachers expect to see questions, they
75 | might think a student is uninterested. There was also some differences noted
76 | among white versus black families (but no biracials, Asians, etc ... sigh).
77 |
78 | - Speed of learning depends on deliberate practice and feedback. :-)
79 |
80 | Stuff I didn't remember:
81 |
82 | - (A bit silly that I didn't record this, but oh well ...) All learning takes
83 | time. You simply can't be an expert without investing the time. And moving
84 | on to more advanced subjects without knowing the basics is not ideal.
85 |
86 | - Oh, another obvious thing I didn't quite record: don't forget about
87 | motivation. What factors (social, etc.) motivate students? That's very
88 | important for speed of learning.
89 |
90 | - Amount of transfer depends on overlap among concepts, well roughly speaking.
91 | Yeah, another generally obvious thing.
92 |
93 |
94 | Chapter 4: How Children Learn
95 |
96 | Very important:
97 |
98 | - Even the very young (as in, months-old infants) exhibit signs of learning and
99 | knowledge, which contrasts with very early research claims. We have better
100 | tools for experimentation and to measure infants, since (for obvious reasons)
101 | it's not that easy to test on them. TL;DR young children are active,
102 | competent agents.
103 |
104 | - Children also pick up language and can quickly tell if stuff seems natural or
105 | unnatural. On a related note, parents need to read to their children, though
106 | some of this can be "picture" books.
107 |
108 | - Zone of proximal development: the gap between current abilities, and the
109 | abilities one could have with extra teaching assistance. (Or more accurately,
110 | 'potential' ... see the text for details.) It's the job of parents,
111 | caregivers, teachers, etc., to continue improving the students' skills so that
112 | this zone proceeds to the next natural stages.
113 |
114 | Also important:
115 |
116 | - Some cool stuff that infants know: they like to be consistent with numbers, so
117 | they see groups of twos, relax, but if the next group has three things, then
118 | they'll be more alert and think something's different. Also, physics: infants
119 | somehow are able to tell that things will fall over without supports, and pay
120 | more attention on that (in rigorous experiments).
121 |
122 | - Children can naturally be interested in solving problems, it doesn't always
123 | have to be explicitly forced upon by a teacher. Also, lots of this depends on
124 | culture (again, this is obvious, but good to reiterate).
125 |
126 | Stuff I didn't remember:
127 |
128 | - "Privileged domains": physical and biological concepts, causality, number, and
129 | language. These are domains where infants show _positive_biases_ in learning,
130 | which makes sense from an evolutionary perspective.
131 |
132 | - Precise experimental techniques for detecting infant cues and preferences:
133 | non-nutritive sucking, habituation (i.e., infant "gets used to it" and stops
134 | responding to that cue), and visual expectation.
135 |
136 | - Infants can distinguish between animate and inanimate objects. Also, they're
137 | good at inferring from context.
138 |
139 | - There's a little bit about memory here, might be more in later chapters, but
140 | mostly about the strategy of clustering to improve memory performance. Also
141 | some discussion about how infants vs older children may have different memory
142 | strategies, and strategies get more effective with age (generally).
143 |
144 |
145 | Chapter 5: Mind and Brain
146 |
147 | Very important:
148 |
149 | - The mind is made up of neurons, with synapses and stuff (not going to get too
150 | technical here but you get the idea). These synaptic connections can be
151 | created and destroyed, and there's generally two ways things can happen: when
152 | they're created in huge swarms and then also removed in equal amounts, kind of
153 | like sculpting (youth) or continual creation through learning by experience
154 | (lifetime).
155 |
156 | - Don't fall for some of the hype you see in popular claims. :-)
157 |
158 | - Some discussion over difference between deaf and hearing ways of learning, the
159 | implication was that areas of the brain can be learned through experience.
160 | Also, learning organizes/restructures the brain.
161 |
162 | Also important:
163 |
164 | - Context matters. Different parts of the brain are ready to learn at different
165 | times.
166 |
167 | Stuff I didn't remember:
168 |
169 | - Eh, hopefully got the main points.
170 |
--------------------------------------------------------------------------------
/Functional_Programming/week1/week1_notes.txt:
--------------------------------------------------------------------------------
1 | ***************
2 | * Lecture 1.1 *
3 | ***************
4 |
5 | Primary objective: functional programming from first principles, not necessarily
6 | Scala but will learn the language. This is like learning a different programming
7 | paradigm.
8 |
9 | Scala: migration from C/Java to functional programming. Look at programming
10 | with "fresh eyes". Can integrate it with classical programming to give both of
11 | best worlds.
12 |
13 | Three paradigms:
14 |
15 | - imperative (Java and C), understand via instructions for Von Neumann computers
16 | - functional (Scala, or maybe Haskell is a better example)
17 | - logic
18 |
19 | We want to **liberate** ourselves from John Von Neumann-style programming. John
20 | Backus argued for function programming. So we must avoid conceptualizing
21 | instruction by instruction (or word by word) and move at a higher-level
22 | abstraction (?). Martin uses polynomial and string examples. For a polynomial,
23 | you don't want to define a class and be able to suddenly change coefficients
24 | (stored in the polynomial class). That would be wrong for the theory of math
25 | which deals with things like (a+b)x = ax+bx, not just modifying a and b
26 | directly.
27 |
28 | This analogy has some flaws but I think things will be clearer for me later when
29 | I progress.
30 |
31 | Consequence of theory of functional programming: NO MUTATIONS.
32 |
33 | This seems restrictive (no mutual variables, assignments, loops, imperative
34 | control structures) but the focus is on functions, which is easier with
35 | functional programming. Functions here will be "first class citizens" as they
36 | can be defined anywhere, including INSIDE other functions.
37 |
38 | I might check out Martin's book but probably not, I have too much to do, I'll
39 | focus on the lectures. =)
40 |
41 | Martin says functional programming has grown in popularity due to exploiting
42 | parallelism for multi-core and cloud computing. Is that why John Canny chose to
43 | use Scala for BIDMach and BIDMat? And since this is getting so important, I
44 | really have to finish this Coursera course!!!
45 |
46 | ***************
47 | * Lecture 1.2 *
48 | ***************
49 |
50 | (Most of this stuff in the first half of this video is familiar to me.)
51 |
52 | Interactive shell = REPL, read eval print loop. Just do scala, as I know. But
53 | don't use that, just use `sbt console`.
54 |
55 | The "substitution model" is key: all it does is reduce expressions to values,
56 | and this can be applied to all expressions so long as they have no side effects.
57 | This is lambda calculus! Foundation for functional programming. In fact Alonzo
58 | Church showed that it can express all programs, i.e. Turing Complete. I remember
59 | this a little bit.
60 |
61 | Example: C++ has a side effect, and cannot be expressed by substitution model.
62 | That's why we don't have side effects in functional programming.
63 |
64 | To "do" the substitution model by hand, we have to explicitly substitute values
65 | and simplify, following specific rules. We can do this call by value or call by
66 | name. They have trade-offs: former only evaluates function arguments once,
67 | latter means function arguments are not evaluated if parameter is unused
68 | throughout the evaluation.
69 |
70 | ***************
71 | * Lecture 1.3 *
72 | ***************
73 |
74 | This provides more comparisons of CBN vs CBV, particularly as it regards to with
75 | vs without termination.
76 |
77 | Here's an important "theorem": if CBV terminates, then CBN also terminates, but
78 | *not* vice versa.
79 |
80 | Here's a simple example (pseudocode_:
81 |
82 | first(x,y)=x
83 |
84 | first(1, loop)
85 |
86 | Here, CBN terminates because it ignores the loop. However, CBV gets in an
87 | infinite loop.
88 |
89 | Despite this example, Scala uses CBV, but we can enforce CBN using `=>` as they
90 | do in the next example, showing how CBV can "get around that" problem by
91 | treating `y` as a special CBN parameter.
92 |
93 | ***************
94 | * Lecture 1.4 *
95 | ***************
96 |
97 | Conditionals and value definitions, two more "syntax constructs."
98 |
99 | Standard if-else, but used for **expressions** not statements. What does this
100 | mean? I think it means we don't have to write a return statement. Actually
101 | that's a general rule for Scala! Generally, legal Java expression => legal in
102 | Scala.
103 |
104 | Also have reduction rules, etc., such as && and ||. BTW those short-circuit
105 | evaluation, so they don't test the second argument if the first one determines
106 | the answer.
107 |
108 | There's a nice connection with CBV or CBN parameters: **definitions** can be CBV
109 | or CBN. The `def` is by name, the `val` is by value. So `def` must be evaluated
110 | upon each use, but `val` is evaluated at the point of its initialization. Oh,
111 | nice connection! =) Note that this is a loop but with effects dependent on how
112 | we use it:
113 |
114 | def loop: Boolean = loop
115 |
116 | For definitions, we're OK (it will not loop forever), but with vals, we're bad.
117 |
118 | Clever:
119 |
120 | def and(x:Boolean, y:Boolean) = if (x) y else false
121 |
122 | This is without using &&.
123 |
124 | ***************
125 | * Lecture 1.5 *
126 | ***************
127 |
128 | This is about defining square roots using Newton's method, so we have a
129 | non-trivial program. `def sqrt(x: Double): Double = { ... }`. He shows an
130 | example using Eclipse and its "session" functionality which is like a better
131 | version of the Scala command line (heh, like iPython is better than the Python
132 | interpreter). Use packages, even though it's not necessary here, because it
133 | keeps things ordered.
134 |
135 | Scala language note: explicit return types are not generally needed, but for
136 | *recursive* functions, we need them otherwise the compiler wouldn't be able to
137 | tell the return type. It's good practice to put the return type even if it's not
138 | needed.
139 |
140 | I see, I understand the code he wrote. Yes, it had problems with small/large
141 | numbers. I naively thought we should take logs and exponentials as needed, but
142 | in fact we only had to normalize our absolute difference so that the epsilon we
143 | chose, 0.001, is of the "appropriate value" rather than something too large or
144 | too small.
145 |
146 | ***************
147 | * Lecture 1.6 *
148 | ***************
149 |
150 | In the last lesson, we defined several methods separately, but we don't want the
151 | use to access any of them except for the `sqrt function. So we can nest all the
152 | other function definitions **inside** an overall `sqrt` call. He used a *block*
153 | by nesting with parentheses.
154 |
155 | Visibility is as what I would expect, i.e. stuff defined in blocks are not
156 | visible to other blocks, and expressions outside blocks are visible inside them
157 | *so long as* they are not overshadowed (or "over-written") by something inside
158 | with the same name. Yes, pretty obvious. OH, and it makes the square root
159 | function cleaner since we don't have to re-define `x` as a parameter.
160 |
161 | Don't use semicolons unless we want more than one statement, as in:
162 |
163 | val y = x+1; y*y
164 |
165 | To deal with two-line operations surround with parentheses or to write operator
166 | ion the end of the *first* line. But in BIDMach, we don't do that, we just write
167 | long expressions on one line. =)
168 |
169 | ***************
170 | * Lecture 1.7 *
171 | ***************
172 |
173 | Time to wrap up the first week by talking about *tail recursion*.
174 |
175 | But before that, some substitution formalism. (I'm not sure why this is
176 | important.) Then we did re-writing steps with Euclid's gcd function and the
177 | classical (recursive) factorial function.
178 |
179 | Rule: if a function calls itself as its last action, the function's stack frame
180 | can be reused. This is *tail recursion*, i.e. iteration, and it's good because
181 | we can run this in constant space. With classic factorial, we had our last
182 | argument as n*factorial(n-1), meaning that the last term was not our function,
183 | it was a more complicated expression with `n*` there.
184 |
185 | We can require that a function is tail-recursive by adding the `@tailrec` in the
186 | line above the method definition. Interesting!
187 |
188 | The last part of the lecture was about designing a tail-recursive version of
189 | factorial. Fortunately, I was able to figure this out. =)
190 |
191 | OK week 1 lectures done. Let's do the assignment.
192 |
--------------------------------------------------------------------------------
/Math_104_Berkeley/kenneth_ross_notes.txt:
--------------------------------------------------------------------------------
1 | ********************************************************************************
2 | * These are notes based on:
3 | *
4 | * Kenneth A. Ross
5 | * Elementary Analysis: The Theory of Calculus
6 | * Second Edition, 2013
7 | ********************************************************************************
8 |
9 |
10 | *************
11 | * CHAPTER 1 *
12 | *************
13 |
14 | I skimmed this chapter and I should know just about everything from it. It
15 | includes:
16 |
17 | - Natural numbers
18 |
19 | - Simple induction
20 |
21 | - Rational numbers (also the definition of an "algebraic number")
22 |
23 | - The "Rational Zeros" theorem, which might be useful if I need to find
24 | candidates for solving certain polynomial equations. This can also be used to
25 | prove that sqrt(2) is not a rational number, and several other numbers, mostly
26 | by doing some brute-force cases for checking all possible solutions. It's a
27 | bit boring to do that! Note: this theorem only applies to finding *rational*
28 | zeros of polynomials with *integer* coefficients. For a more general rule, use
29 | "Newton's method" or the "secant method."
30 |
31 | - The set of real numbers. Now we're getting into real stuff here! We also have
32 | the triangle inequality, blah blah blah ...
33 |
34 | - The Completeness Axiom. This is the assertion that "\mathbb{R} has no gaps"
35 | and is the key factor which distinguishes \mathbb{R} from \mathbb{Q}. (It's
36 | discussed in Section 4.4.) Among other things, this section discusses:
37 |
38 | - The concepts of a minimum, maximum, and slightly more non-trivially, those
39 | of an _infinum_ (greatest lower bound) and _supremum_ (least upper bound).
40 | For the latter two, I know clearly that sup S and inf S do not have to
41 | belong to S! Classic example: (a,b). I remember doing examples like these
42 | from MATH 305 at Williams College: basically, finding the infimums and
43 | supremums of sets. It's nothing too fancy. Man, I must have been a bad
44 | student back then!
45 |
46 | - The concepts of upper bounds, lower bounds, etc.
47 |
48 | - The completeness axiom (as I mentioned). This does _not_ hold for the
49 | rationals!
50 |
51 | Yeah, nothing too advanced here. I'm happy that at least this material is easy
52 | for me to understand and review.
53 |
54 | - The symbols +infinity and -infinity, which are useful but must be handled with
55 | care. Do not treat them as real numbers that can be plugged into theorems!
56 | Note that it is also discussed that for nonempty, _bounded_ subsets A and B of
57 | \mathbb{R}, sup(A + B) = sup A + sup B and the same relation for infimums.
58 | This might be useful in some statistics proofs if we are dealing with multiple
59 | sets.
60 |
61 | - Useful to define sup S = +infinity if S is not bounded above, etc.
62 |
63 | - The last section is a "Development of \mathbb{R}" and it's probably not that
64 | useful for me.
65 |
66 |
67 | *************
68 | * CHAPTER 2 *
69 | *************
70 |
71 | This is about sequences and is hugely critical to understanding the rest of the
72 | book, and for real analysis in general.
73 |
74 | Section 2.7
75 |
76 | - Sequences are just a function from an index to some value.
77 |
78 | - We formally define _limits_, _convergence_, and _divergence_. See the
79 | textbook. I won't belabor the point here. Side note: limits are unique (prove
80 | this by assuming two limits, then showing that |s-t| is less than epsilon
81 | using the definitions and then the triangle inequality). Side note 2:
82 | oscillations (as in, (-1)^n) do not converge!
83 |
84 | Section 2.8
85 |
86 | - A discussion on proofs! When proving limits, we should invoke the formal
87 | definition and find n and epsilon s.t. the definition of a limit holds.
88 |
89 | - There are several interesting examples. I did a few of them quickly. I don't
90 | think I will ever have to invoke these directly any time soon (I'm mostly
91 | reading this section so that the more important parts later are clearer to
92 | me).
93 |
94 | - Exercise 8.5 is interesting, the "squeeze lemma" and I remember Professor
95 | Mihai Stoiciu talking about this during office hours (heh, we never had office
96 | hours _in_ his office since there were so many people!).
97 |
98 | Section 2.9
99 |
100 | - Limit theorems for sequences. I can invoke these pretty easily. I will again
101 | be skimming the proofs.
102 |
103 | - Oof, there's a lot of them. Mostly they involve similar techniques such as
104 | working backwards and solving for the tightest bounds, so we get the lowest
105 | value N such that the statement: "when n > N we get |s_n - s| < epsilon" is
106 | true. We have to sometimes develop upper bounds, and often have to use epsilon
107 | times some constant so that the later algebra gets it equal to epsilon. I've
108 | seen this stuff many times.
109 |
110 | Section 2.10
111 |
112 | Monotone Sequences and Cauchy Sequences. These help us conclude convergence of
113 | sequences _without_ knowing limits in advance.
114 |
115 | - Monotone sequences are those which are always increasing or always decreasing.
116 | They _can_ converge, if the rate of increase (respectively, decrease) slows to
117 | zero, think of 1/x for x>0 as x grows large.
118 |
119 | - Important Theorem I (10.2 in the book): All bounded monotone sequences
120 | converge.
121 |
122 | - Proof: let u be the supremum of the bounded sequence, so then we just show
123 | lim s_n = u. We start by fixing an epsilon (as usual) then we have to find
124 | some N such that for all n > N, we get |s_n - u| < epsilon. Well, (s_n) is
125 | increasing so we just need to first find an N so that u-epsilon < N <
126 | epsilon and then that automatically proves the statement. Yay! The proof is
127 | short and elegant. Again, it just relies on proving the limit statement!!
128 |
129 | - There's a related theorem which shows that if the sequences are unbounded,
130 | then, well they converge to infinity or minus infinity. (This is assuming
131 | monotone, because otherwise you can have oscillations to infinity, which
132 | would mean something different I guess.) Thus, limits of monotone sequences
133 | always have meaning.
134 |
135 | - Important Theorem II (10.11 in the book): a sequence is a convergent sequence
136 | IFF it is a Cauchy sequence.
137 |
138 | - Proof: well, they did one direction earlier and it makes sense. The other
139 | direction also makes sense. In both cases we simply start with the
140 | definition and try to prove the property. They can be tricky to come up.
141 | Mostly it's about making sense of sup-s and thinking of "stuff plus
142 | epsilon."
143 |
144 | - Uses Definition 10.8 which defines a _Cauchy_sequence_, a sequence has this
145 | property if for each epsilon > 0 there exists N such that (m,n) both greater
146 | than N implies |s_n - s_m| < epsilon.
147 |
148 | - Why is it useful? Because we can confirm that a sequence converges by
149 | verifying that it satisfies the Cauchy sequence property. We do not have to
150 | explicitly compute a limit in this case!
151 |
152 | - There's an interlude about discussions of decimals, but it's not likely to be
153 | much of concern to me. Don't forget about the geometric series convergence
154 | formula! It's 1/(1-r) for r>1.
155 |
156 | - There is also discussion on lim sup and lim inf. A sequence has a limit if and
157 | only if their `lim inf` and `lim sup` are equal. Also, lim sup is NOT
158 | generally sup{s_n for all n} because as N grows large, the set of elements we
159 | consider for lim inf gets smaller, hence the correct relationship is <=. Also,
160 | it's these lim inf and lim sup concepts which motivate the Cauchy sequence
161 | definition (see my notes above).
162 |
163 | Section 2.11
164 |
165 | Subsequences!!
166 |
167 | - I know the definition, obviously. You can also view it as defined by a
168 | "selection function." This point of view is probably useful if you are trying
169 | to _extract_ "interesting" indices within the overall sequence.
170 |
171 | - IMPORTANT: Theorem 11.2. This defines three facts about subsequences.
172 |
173 | (I don't quite follow?)
174 |
175 | Section 2.12
176 |
177 | TODO
178 |
179 |
180 | *************
181 | * CHAPTER 3 *
182 | *************
183 |
184 | TODO
185 |
186 |
187 | *************
188 | * CHAPTER 4 *
189 | *************
190 |
191 | TODO
192 |
193 |
194 | *************
195 | * CHAPTER 5 *
196 | *************
197 |
198 | TODO
199 |
200 |
201 | *************
202 | * CHAPTER 6 *
203 | *************
204 |
205 | TODO
206 |
207 |
208 | *************
209 | * CHAPTER 7 *
210 | *************
211 |
212 | TODO
213 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter05notes.txt:
--------------------------------------------------------------------------------
1 | ***********************************************
2 | * NOTES ON CHAPTER 5: Machine Learning Basics *
3 | ***********************************************
4 |
5 | Again, I expect that this will be almost entirely review. Here are some stuff
6 | which I didn't already have down cold:
7 |
8 | - The chapter starts off with Tom Mitchell's famous definition of machine
9 | learning, and then it goes through examples of tasks, experiences, and
10 | performance metrics. There isn't a whole lot new here. Maybe a good insight is
11 | to think of the tasks of (a) density estimation and (b) synthesis/sampling
12 | (e.g. with GANs) as the task of modeling densities implicitly (a) versus
13 | implicitly (b). Then for experiences, the key is to understand unsupervised
14 | vs. supervised learning, but the line between the categories is blurred, and I
15 | like their examples of how the problems can be converted to each other
16 | (Equations 5.1 and 5.2). Think of unsupervised as estimating p(x), supervised
17 | as estimating p(y|x), since we have our labels y in the latter case. They use
18 | linear regression as an example, and the "learning algorithm" consists of
19 | literally solving the normal equations. One step, no iterative updates!
20 |
21 | - We can use statistical learning theory to tell us how algorithms generalize.
22 | It's easiest if we assume IID, then the train/test errors are equal under
23 | expectation **if we chose a random model**, i.e random weights. In general,
24 | though, we optimize the training error, and **then** test, so the test error
25 | is at least as high as training error. The two central factors contributing to
26 | under/over-fitting are (1) training error, (2) gap between training and
27 | testing error. (This is covered again later in Chapter 11 on practical usage.)
28 | We can partially control under/over-fitting by controlling a model's
29 | **capacity**. E.g., for linear regression, add higher order terms, and
30 | capacity increases, but overfitting occurs with more parameters than examples.
31 |
32 | - Quantifying model capacity with classical measures, such as VC dimension, is
33 | rarely used in Deep Learning.
34 |
35 | - We can also think of **non-parametric** models as having arbitrarily high
36 | capacity. However, practical algorithms will rely on some form of constraints,
37 | e.g. nearest neighbors' complexity depends on the data.
38 |
39 | - **Expected** generalization can never increase when training data increases.
40 |
41 | - Use **weight decay** (i.e. L2 regularization) to prefer lower magnitude weight
42 | vectors as solutions.
43 |
44 | - With hyperparameters, don't tune then on the training data because that will
45 | cause preference towards overfitting. Tune on **validation sets**. If our data
46 | is too small, **use k-fold cross validation** to get better estimates of
47 | generalization error.
48 |
49 | - With bias/variance discussion, don't forget that the sample variance (for
50 | Gaussians) is actually **biased**, we need the n-1 correction for the
51 | **unbiased** version.
52 |
53 | - Don't forget the difference between **variance** and **standard error** w.r.t.
54 | **an estimator**. Here, the standard error is the square root of the variance,
55 | and both are computed based on empirical data (which is why I don't think we
56 | call it "standard deviation"). They say:
57 |
58 | > Unfortunately, neither the square root of the sample variance nor the square
59 | > root of the unbiased estimator of the variance provide an unbiased estimate
60 | > of the standard deviation. Both approaches tend to underestimate the true
61 | > standard deviation, but are still used in practice. The square root of the
62 | > unbiased estimator of the variance is less of an underestimate. For large m,
63 | > the approximation is quite reasonable.
64 |
65 | We use standard error often when writing out confidence intervals.
66 |
67 | They argue that increasing model capacity (at least under MSE for computing
68 | generalization error) generally increases **variance** but decreases **bias**.
69 | The reason is that variance here is based on samples where the "samples" are
70 | in fact training data sets. (The training set **is** the random variable,
71 | according to their Equation 5.47 definition.) Thus, with a new sample of the
72 | training data, we'll get different results since the model overfits. But under
73 | **expectation** over all draws of training datsets, the bias is low.
74 |
75 | - How did we **obtain** the estimators we just talked about? It's simple, MLE.
76 | And before reading Goodfellow's tutorial on GANs, I don't think I viewed MLE
77 | as minimizing a KL divergence. This is yet another reason why we like it.
78 | Another reason is, as I know from the AI prelims review, the MLE view of
79 | **conditional** log likelihood, where p(y|x) is modeled as a Gaussian, results
80 | in the same solution (obtained via maximizing likelihood) as the linear
81 | regression case with MSE loss.
82 |
83 | - Then the chapter talks about **Bayesian statistics**. To measure uncertainty
84 | of the estimator, the Frequentist approach uses the variance, but the Bayesian
85 | approach suggests to integrate instead. I also remember their example with
86 | Bayesian linear regression, we have to combine p(y|X,w)*p(w) but those are
87 | both exponentials and they multiply to result in another exponential which can
88 | be rearranged in the form of another Gaussian. If we want a single point
89 | estimate instead of a distribution, use **MAP estimates**. But why not just do
90 | the Frequentist MLE approach? Because MAP estimates retain *some* benefit of
91 | the Bayesian approach. That's the intuition, I guess.
92 |
93 | - Review:
94 |
95 | theta_MAP = argmax_theta p(theta|x)
96 | \propto argmax_theta p(theta)p(x|theta)
97 | = argmax_theta log p(theta) + log p(x|theta)
98 |
99 | (and for the MLE Gaussian, Frequentist case)
100 |
101 | theta_ML = argmax_\theta \prod_y p(y|x,theta)
102 | = argmax_\theta \sum_i \log p(y_i|x_i,\theta) // These are Gaussians
103 |
104 | - **Supervised Learning Algorithms**. The authors start by generalizing linear
105 | regression into logistic regression, as expected. Not much new here. With
106 | logistic regression, we no longer have a closed-form solution for the optimal
107 | weights, which is why gradient descent helps.
108 |
109 | - PS: Don't forget **SVMs**. I've forgotten some of it due to its lack of
110 | exposure in Deep Learning. The key innovation here is the kernel trick, of
111 | course (helps us model nonlinear x, and highly efficient). The SVM function
112 | is nonlinear w.r.t. the data, but it's **linear** w.r.t the coefficients
113 | \alpha. The \alpha here is mostly zeros, so as to reflect only points on the
114 | boundary close to the current sample of interest.
115 |
116 | - But note that SVMs and kernel machines in general struggle to generalize
117 | well, and Deep Learning is precisely designed to improve upon that.
118 |
119 | - Another common algorithm, **k-nearest neighbors**. In fact, there is not
120 | even a training or a learning stage for this (nonparametric) method. Yet
121 | another one, **decision trees**.
122 |
123 | - Note, p.144 missing a figure in my PDF version? TODO check.
124 |
125 | - **Unsupervised Learning Algorithms**. Examples: PCA and K-Means Clustering.
126 | PCA can be viewed as a data compression algorithm, or one which learns a
127 | "useful" representation of data (perhaps as "simple" as possible, to identify
128 | independent sources of variation which capture the essence of the data). This
129 | means using PCA to transform the data so that the covariance matrix of the
130 | transformed data is a diagonal matrix. PCA:
131 |
132 | > This ability of PCA to transform data into a representation where the
133 | > elements are mutually uncorrelated is a very important property of PCA. It
134 | > is a simple example of a representation that attempts to disentangle the
135 | > unknown factors of variation underlying the data.
136 |
137 | Then there's k-means, which learns a one-hot encoding for each sample. This is
138 | a bit extreme, though. The learning, of course, works like EM.
139 |
140 | - Stochastic Gradient Descent. The main workhorse of Deep Learning! It helps
141 | that our cost functions naturally decompose into a sum over training examples
142 | with per-sample loss (and taking the empirical mean of those, so it's an
143 | expectation!!!). Thus, take a minibatch sum of those terms. In fact, we can
144 | often converge to a good solution even without touching every element in the
145 | dataset (i.e. less than a single pass).
146 |
147 | - Section 5.11, which focuses specifically on Deep Learning challenges. DL helps
148 | to deal with the curse of dimensionality (PS: nice visuals in Figure 5.9!).
149 | They also help with local constancy and smoothness, meaning that we want f(x)
150 | to be approximately f(x+eps). Most classical algorithms try to follow this
151 | implicit prior, but the problem is that it doesn't scale to larger datasets
152 | because it requires enough examples to observe the data space. With DL, we try
153 | and introduce dependencies among different regions, using a "composition of
154 | factors". See Chapters 6 and 15 for this. Oh yeah, this is the idea of DL with
155 | hierarchies of features ... I can see where this is going.
156 |
157 | The last bit here is about manifold learning. We use it informally in machine
158 | learning to indicate a set of points that are well-connected or associated
159 | with each other in a lower-dimensional space. With high dimensions, it's
160 | essential to assume that most points in R^n are invalid. The authors argue
161 | that this is the case in terms of images, sounds, and text. For instance,
162 | uniformly sampling points in image results in static, and random words/letters
163 | mean gibberish instead of interesting sentences. It would be great if learning
164 | algorithms could *discover* these manifolds. In fact, GANs help us with that!
165 |
166 | (This is a bit hand-way, make sure to re-read this section if I want to
167 | refresh my memory.)
168 |
--------------------------------------------------------------------------------
/Random/AWS_Notes.txt:
--------------------------------------------------------------------------------
1 | -----------------------
2 | - AMAZON WEB SERVICES -
3 | -----------------------
4 |
5 | ****************
6 | * May 11, 2017 *
7 | ****************
8 |
9 | I promise, I will learn how to use AWS so that I can finally run code in
10 | clusters instead of running pseudo-parallel code on my personal workstation.
11 |
12 | First, a few pointers, definitions, etc:
13 |
14 | - Be careful! Don't run code for no reasons. This uses up resources. It's not
15 | like my personal machine where I can pound it for no reason. Again, be
16 | careful. Also, be mindful of the location of the actual computing resources
17 | I'm using.
18 |
19 | - Amazon Web Services (AWS). It seems like I can use this just by using my
20 | normal Amazon account. It provides a number of services for cloud computing,
21 | which lets me use lots of computing power via the Internet, so long as we
22 | pay an amount commensurate with our usage level. See also:
23 |
24 | > Cloud computing provides a simple way to access servers, storage, databases
25 | > and a broad set of application services over the Internet. A Cloud services
26 | > platform such as Amazon Web Services owns and maintains the
27 | > network-connected hardware required for these application services, while
28 | > you provision and use what you need via a web application.
29 |
30 | (Cloud computing is really a marketing term ... don't put too much thought
31 | into it. Just think of it as a way for me to access lots of resources without
32 | having to buy them online, assemble my workstation, tell Berkeley to hook them
33 | up to the Internet, etc. I have one desktop that took me a while to set up; a
34 | server with many machines would take a lot longer to set up.)
35 |
36 | - Amazon Elastic Compute Cloud (EC2). These "EC2 Instances" are "virtual
37 | machines" that AWS provides, i.e. EC2 is a component of AWS. It seems to be
38 | an example of "Infrastructure as a Service" (IaaS).
39 |
40 | - Amazon Machine Instances (AMI). These are virtual machines. I can use these to
41 | launch stuff within the EC2. Don't forget to keep the key-pair! I think the
42 | point with cloud computing is that we can pick and choose which images match
43 | our desired specs and then "run them." To connect to these, use the good
44 | old-fashioned ssh. There are community-provided AMIs which I assume are from
45 | people/groups around the world who are letting us use their machines in
46 | exchange for payment. There are also marketplace AMIs, which are verified by
47 | AWS.
48 |
49 | - Google Cloud. I don't think I need to use this? It seems to be an alternative
50 | to Amazon Web Services. Once I have a Google Cloud account, I can create
51 | Google Compute Engines (GCEs) to run code, and even use Jupyter Notebooks for
52 | those which I can access in my local browser. For GPUs, I need to send in
53 | special requests.
54 |
55 | See the following for a comparison between these two:
56 |
57 | http://cloudacademy.com/blog/google-cloud-vs-aws-a-comparison/
58 |
59 | The AWS website has lots of tutorials. I will check those tomorrow.
60 |
61 | Python libraries to know/learn:
62 |
63 | - boto (or boto3?)
64 | - redis
65 | - multiprocessing
66 | - click
67 |
68 | I've only "used" multiprocessing before ... and it didn't work for me. Also,
69 | click seems to be more for command line arguments instead of distributed
70 | systems. It seems to be an alternative to argparse ... yeah, I better check that
71 | out! It might take up the subject of my next blog post.
72 |
73 |
74 | ****************
75 | * May 12, 2017 *
76 | ****************
77 |
78 | I went through this 10-minute tutorial: "Launch a Linux Virtual Machine".
79 | Highlights:
80 |
81 | - After clicking "Launch Instance", I get to the familiar AMI page. Think of
82 | this as a place to choose my desired computer specs. (Note: to avoid
83 | confusion, this is what happens when we're at the AWS console; there is
84 | another "Launch Instance(s)" button that happens later, once I'm actually
85 | ready to do something.)
86 |
87 | - The tutorial uses a "General Purpose Instance" which should probably be my
88 | default choice for applications, unless I have a pressing reason to use
89 | something else. It also automatically clicks the "free tier eligible" image.
90 |
91 | - Wow, there is a LOT of stuff on the AWS Interface. Getting used to the GUI
92 | will take a while, but I at least know how to see my instances.
93 |
94 | - I can connect to my instance using:
95 |
96 | ssh -i ~/.ssh/MyKeyPair.pem ec2-user@{IP_Address}
97 |
98 | The IP address can be found on the AWS interface. This puts me in the
99 | `/home/ec2-user` folder on an instance, and it looks like I'm the only user.
100 | Huh, that's interesting, I thought this was going to be a shared machine with
101 | loads of users. Looks like `python` is installed, but not `ipython`. Argh.
102 |
103 | - I terminated the state, and I got this message:
104 |
105 | Broadcast message from root@ip-[IP CENSORED]
106 | (unknown) at 16:55 ...
107 |
108 | The system is going down for power off NOW!
109 | Connection to [IP CENSORED] closed by remote host.
110 | Connection to [IP CENSORED] closed.
111 |
112 | Interesting ... if we did *not* terminate the instance (but it was idle) then
113 | we still get charged. I didn't get charged (I hope not ...).
114 |
115 |
116 | Another potential resource:
117 |
118 | http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
119 |
120 | "Setting Up":
121 |
122 | - I see, this is why I didn't need a password:
123 |
124 | > AWS uses public-key cryptography to secure the login information for your
125 | > instance. A Linux instance has no password; you use a key pair to log in to
126 | > your instance securely. You specify the name of the key pair when you launch
127 | > your instance, then provide the private key when you log in using SSH.
128 |
129 | - There's some stuff about "Virtual Private Clouds" and "Security Groups," but
130 | I'm not sure I understand or if it's that important right now. Think of those
131 | as firewalls, maybe? Yeah, the EC2 console says security groups control access
132 | to the instance.
133 |
134 |
135 | "Getting Started":
136 |
137 | - This is basically the same as the 10-minute tutorial. They also tell us how to
138 | connect with a browser. That might be inconvenient, but maybe not, if we're
139 | running on 1000 machines. But how do we run code using this? There must be
140 | some command line?
141 |
142 | - Oh, here's what they say about termination:
143 |
144 | > Terminating an instance effectively deletes it; you can't reconnect to an
145 | > instance after you've terminated it.
146 |
147 | I see. On the EC2 console, I can't seem to re-start that instance I created in
148 | that 10-minute tutorial. There is, however, a difference between STOPPING and
149 | instance versus TERMINATING an instance. The former lets me reuse the instance
150 | at some point later (and it doesn't charge me for the stopping period, though
151 | there IS a charge for storage ... look at their description about this).
152 |
153 |
154 | For billing, see:
155 |
156 | http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html
157 |
158 | A few pointers:
159 |
160 | - To see billing on the dashboard, click my name, and then the billing dashboard
161 | setting. It should be intuitive.
162 |
163 | - Try to use the free tier to test things:
164 |
165 | > You can test-drive some AWS services free of charge, within certain usage
166 | > limits. AWS calls this the AWS Free Tier. The free tier is designed to give
167 | > you hands-on experience with a range of AWS services at no charge. For
168 | > example, you can explore AWS as a platform for your business by setting up a
169 | > test website with a server, alarms, and database. You can also try out
170 | > services for developers, such as AWS CodePipeline, AWS Data Pipeline, and
171 | > AWS Device Farm.
172 |
173 | - Actually, looks like I'm not on the free tier since I had made the account in
174 | November 2015 despite NOT EVER USING IT ...
175 |
176 |
177 | For running on *clusters*, see:
178 |
179 | http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html
180 |
181 | <<<<<<< HEAD
182 | Another piece of software that might be useful is Packer. This helps me create
183 | identical machine images (i.e. AMIs) so that the nodes in a cluster are running
184 | and using the same stuff/settings. It's installed on my station. Use `.json`
185 | files for building images (be careful about expenses!).
186 |
187 |
188 | ****************
189 | * May 28, 2017 *
190 | ****************
191 |
192 | OK, I managed to finally make a new account, so I get the one-year free tier
193 | award. Let's see how that works out for me. Now let me try Jonathan Ho's
194 | Evolution Strategies code. How do we use Packer again?
195 |
196 | =======
197 | Packer might be useful for running on clusters. This helps me create identical
198 | machine images (i.e. AMIs) so that the nodes in a cluster are running and using
199 | the same stuff/settings. It's installed on my station. Use `.json` files for
200 | building images (be careful about expenses!). These are configuration files to
201 | allow us to specify various settings about the image(s) we want to build. Run
202 |
203 | `packer build XXX.json`
204 |
205 | to build it. However, I think this requires two keys from AWS, which I can
206 | obtain online. I think I can just make them for me personally. They recommend
207 | creating keys separately for IAM users, but that seems to be more helpful for
208 | organizations with many users (kind of like computers with user accounts).
209 |
210 | NOTE: IAM = "Identity and Access Management."
211 |
212 | After running Packer's examples with my provided keys, I have a **snapshot**. It
213 | was a bit tricky to find. I had to search in the US-east region (N. Virginia),
214 | not the US-west region (N. California). Then click on "Snapshots" and I can see
215 | my AMI. This is **my** AMI, actually. So I'll get charged!
216 |
217 | In addition, assuming I'm in the right region, when I launch an instance, I can
218 | go to "My AMIs" and I will see that image right there. (It doesn't work if I'm
219 | using N. California, so the lesson is that one needs to be aware of what regions
220 | were used!)
221 |
222 | To be clear, what got created out of this configuration file was NOT an
223 | "Instance," but it seems to be either an "Image --> AMIs" or an "Elastic Block
224 | Store --> Snapshots." Strangely, I see something underneath both of those menu
225 | options ... I'm not sure what's the difference. They seem to be similar, except
226 | AMIs are, I assume, something that's representative of a full system, whereas
227 | the snapshots are backups of those ... yeah, it's not clear. Maybe check this:
228 |
229 | https://serverfault.com/questions/268719/amazon-ec2-terminology-ami-vs-ebs-vs-snapshot-vs-volume?
230 |
231 | Snapshots and Volumes should be subsets or types of EBSs, which themselves look
232 | like hard drives. Volumes are pieces and bits of EBSs, and Snapshots are
233 | captures (i.e. copies) of volumes at specific times.
234 |
235 | I *think* I have an idea of what an image means. I mean, with CS 231n, they
236 | provide an image with specialized GPU and Deep Learning stuff. That's with the
237 | "Community AMIs" of course.
238 |
239 | From Packer:
240 |
241 | > After running the above example, your AWS account now has an AMI associated
242 | > with it. AMIs are stored in S3 by Amazon, so unless you want to be charged
243 | > about $0.01 per month, you'll probably want to remove it. Remove the AMI by
244 | > first deregistering it on the AWS AMI management page. Next, delete the
245 | > associated snapshot on the AWS snapshot management page.
246 |
247 | I just did both of those.
248 |
--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter10notes.txt:
--------------------------------------------------------------------------------
1 | ****************************************************************
2 | * NOTES ON CHAPTER 10: Recurrent and Recursive Neural Networks *
3 | ****************************************************************
4 |
5 | I need to understand the parameter sharing and how RNNs (and their variants) can
6 | be "combined" into other areas. The parameter sharing is key, as it allows for
7 | *generalization*. CNNs share parameters with the weight filters across the
8 | grids; RNNs share parameters across timesteps.
9 |
10 | Quick note: I think they're using minibatch sizes of 1 to simplify all notation
11 | and exposition here. That's fine with me. Think of x as:
12 |
13 | [ x^1 x^2 ... x^T ]
14 |
15 | where superscripts indicate time. Note that each x^i itself could be a vector!
16 |
17 | Section 10.2, Recurrent Neural Networks
18 |
19 | It's important to understand the *computational graphs* involved with RNNs. I
20 | understand them as directed acyclic graphs, so how does this extend with
21 | recurrence? It's easier to think of them when we unroll (i.e. "unfold") the
22 | computational graphs. See Figure 10.2 as an example (I was able to get this
23 | without looking at the figure). They also use a more succinct "recurrent graph"
24 | representation.
25 |
26 | RNN Design Patterns, also kind of described in Andrej Karpathy's blog post:
27 |
28 | - Producing an output at each time step, and having recurrent connections
29 | between hidden layers. This is Figure 10.3, which I correctly predicted in
30 | advance minus the loss and y stuff. They have losses for *each* time step.
31 | Note the three matrix multiplies that are there, with the *same* respective
32 | matrices repeated across time. Also, we're using the softmax, so assume the
33 | output is discrete at each time step, e.g. o(t) could be the categorical
34 | distribution over the 26 letters in the alphabet.
35 |
36 | - Same as above, except recurrent connections are from outputs to hidden layers,
37 | so we still have three matrices but the "arrows" in the computational graph
38 | change. This is *less powerful**. Why?? Think: the former allows hidden to
39 | hidden, so the hidden stuff can be very rich. The output only lets hidden to
40 | output to hidden, so the output is the input and may be less rich. That seems
41 | intuitive.
42 |
43 | - Same as the first one (hidden to hidden connections) except we now have one
44 | output. That's useful to summarize, such as if we're doing sequence
45 | classification.
46 |
47 | Now develop the equations, e.g. f(b + Wh + Ux) where h is from the *previous*
48 | time step and x is the *current* time step, and f is the *activation* function.
49 | Yes, it's all familiar to me. They mention, though, that backpropagation is very
50 | expensive. They call the naive way (applying it on the unrolled computational
51 | graph) as "backpropagation through time."
52 |
53 | How to compute the gradient? They give us an example, thank goodness. Comments:
54 |
55 | - Note that L = L(1) + L(2) + ... + L(\tau) so yes, dL/dL(t) = 1 for all t. Each
56 | L(t) is a negative log probability for that output at that time.
57 |
58 | - The next equation (10.18) also makes sense, here i is the component in the
59 | vector, so we're in the univariate case.
60 |
61 | - Equation 10.19 is good, keep in mind that here we have to be careful with the
62 | timestep. For other h(t), we need to add two gradients due to two incoming
63 | terms (because of two *outgoing* terms in the *forward* pass). Thus, the
64 | matrices V and W will be present in some form.
65 |
66 | - The next part about using dummy variables for t is slightly confusing but it
67 | should just mean that the total contribution for these parameters are based on
68 | their sum across each time. Yeah, looking at the book again it's just a
69 | notation issue to help us out. For all those gradients, we have a final sum
70 | over t, where each term in the sum is a matrix/vector of the same size as the
71 | variable we're taking the gradient w.r.t.
72 |
73 | PS: when reading this, don't be confused by the notation. Look at the "notation"
74 | chapter online.
75 |
76 | RNNs as directed graphical models? This section is about expressing them as
77 | well-defined directed graphical models, and there are a few subtleties. This is
78 | WITHOUT any inputs, BTW ... probably just for intuition?
79 |
80 | They go through an example predicting a sequence of scalars. With the naive
81 | unrolled (directed) graphical model, we're applying the chain rule of
82 | probability and so it's very inefficient. RNNs provide better (in many metrics,
83 | but particularly efficiency) ways to express such distributions with directed
84 | graphical models by introducing deterministic connections (remember, the hidden
85 | states are deterministic).
86 |
87 | With RNNs, parameter sharing is a huge advantage, but the downside is that
88 | optimizing is hard because we make a potentially strong assumption that at each
89 | time step, the distribution embedded in the RNN remains stationary.
90 |
91 | The last bit here to get it into a well-defined graphical model is to figure out
92 | the length of the RNN. The paper presents three options, all of which seem
93 | obvious (though I'm ignoring lots of details, etc.).
94 |
95 | The next subsection (10.2.4) after this is about the more realistic setting of
96 | having x (input), so we're also modeling p(y|x). I think it's trying to stick
97 | with the graphical model setting. Also, note that the second option in the list
98 | of three things is what we did in CS 231n, Assignment 3, with the image
99 | captioning portion. Actually, the first option would seem better, which
100 | translates the input image to a vector as input to *all* hidden states, but
101 | that's harder to implement.
102 |
103 | I was quite confused about Figure 10.9, as to why are we considering the y(t)s
104 | as inputs?? However, it seems like it's because we want to model p(y|x) and,
105 | well, y is the ground truth. I'm just having trouble translating this to code,
106 | or maybe that's not what I should be doing, and instead just think of it as a
107 | graphical model? To think of it as code, I'd need the other case we had earlier
108 | where the *output* or *hidden state* was the input to the hidden state, not the
109 | actual target (which is to be compared with the output).
110 |
111 | Section 10.3: Bidirectional RNNs
112 |
113 | Bidirectional RNNs help us model the output y(t) when that output may also
114 | *depend on future times* t+1, t+2, etc., such as with speech recognition where
115 | we need to peek ahead a bit. Don't use a fixed window, though, they say:
116 |
117 | > This allows the output units o(t) to compute a representation that depends on
118 | > both the past and the future but is most sensitive to the input values around
119 | > time t, without having to specify a fixed-size window around t.
120 |
121 | Nice!
122 |
123 | Section 10.4: Encoder-Decoder Sequence-to-Sequence Architectures
124 |
125 | Use these to avoid the restriction of fixed sequence sizes for the inputs x (or
126 | x(t)). This is their main benefit/innovation, the lengths n_x and n_y (see
127 | Figure 10.12 if confused on this notation) **can vary**; if the training
128 | data consists of a bunch of sequences that are of similar or different lengths,
129 | the RNN will learn to mirror that training data. Side note: the first relevant
130 | paper on this (from 2014) called it "Encoder-Decoder" while the second one
131 | called it "Sequence-to-Sequence". I skimmed that second one, from Sutskever et
132 | al, NIPS 2014 last year, though maybe I should re-read it. Both papers are
133 | highly-cited.
134 |
135 | Connection with Section 10.2.4: we have a fixed-sized context vector C (well,
136 | usually) coming out of the encoder. Well, C is input to the decoder, and this is
137 | *precisely* the vector-to-sequence RNN architecture we talked about in that
138 | sub-section!
139 |
140 | How can the encoder deal with varying sizes n_x? If you think about it, it's
141 | just applying the RNN update over and over again to produce a fixed hidden state
142 | of the same size. At time t, we have processed x(1),...,x(t), and have hidden
143 | state h(t). (We're ignoring the earlier hidden states for simplicity.) Then the
144 | next time t+1, let's say that's our last one. Then we pass in h(t+1). So there's
145 | no issue with getting different sized inputs, because all that matters is (a)
146 | that we can repeatedly apply the RNN update, which is a for loop over the input
147 | sequence, and (b) that we take a fixed sized input to the decoder, which we can
148 | do with our final hidden state!
149 |
150 | Section 10.5: Deep Recurrent Neural Networks
151 |
152 | In all likelihood, I will not be dealing with these, but it might be worth
153 | knowing how deep we can go with RNNs, just like how I learned about the very
154 | deep GoogLeNet and the **ultra** deep ResNet. When we talk about depth, we mean
155 | adding more layers (w.r.t. the unrolled graph perspective) to the three
156 | components: input to hidden, hidden to hidden, and/or hidden to output. This
157 | might make learning hard, so one option is to introduce skip connections like
158 | in ResNets (man, I'm glad I reviewed ResNets).
159 |
160 | Section 10.6: Recursive Neural Networks
161 |
162 | Recursive Neural Networks, which we **do not** abbreviate as RNN, are a
163 | generalization of RNNs with a different computational graph "flavor" that looks
164 | like a tree rather than a chain.
165 |
166 | Section 10.7: Challenge of Long-Term Depenedencies
167 |
168 | Why is it hard? Here are some relevant quotes:
169 |
170 | > The basic problem is that gradients propagated over many stages tend to either
171 | > vanish (most of the time) or explode (rarely, but with much damage to the
172 | > optimization). [...] the difficulty with long-term dependencies arises from
173 | > the exponentially smaller weights given to long-term interactions (involving
174 | > the multiplication of many Jacobians) compared to short-term ones. [...]
175 | > Recurrent networks involve the composition of the same function multiple
176 | > times, once per time step. These compositions can result in extremely
177 | > nonlinear behavior, as illustrated in figure 10.15.
178 |
179 | This section describes the problem, and the subsequent sections (I assume 10.8
180 | through 10.12, judging from the LSTMs here) describe ways to solve it.
181 |
182 | They present a simplified analysis with matrix eigendecomposition, where we
183 | assume no activations. Then yes, gradients can explode if eigenvalues are
184 | greater than one or vanish if they are less than zero. Andrej Karpathy said
185 | something similar in his medium blog post (why does he bother with medium?).
186 |
187 | No free lunch:
188 |
189 | > One may hope that the problem can be avoided simply by staying in a region of
190 | > parameter space where the gradients do not vanish or explode. Unfortunately,
191 | > in order to store memories in a way that is robust to small perturbations, the
192 | > RNN must enter a region of parameter space where gradients vanish (Bengio et
193 | > al., 1993, 1994).
194 |
195 | It's a bit annoying that we are simplifying here by ignoring the activation
196 | functions, but I guess Bengio's old papers address activation functions?
197 |
198 | Section 10.8: Echo State Networks
199 |
200 | I skimmed this section. It's quite high-level and not that important to me.
201 |
202 | Section 10.9: Leaky Units, Multiple Time Scales
203 |
204 | I like this explanation:
205 |
206 | > One way to deal with long-term dependencies is to design a model that operates
207 | > at multiple time scales, so that some parts of the model operate at
208 | > fine-grained time scales and can handle small details, while other parts
209 | > operate at coarse time scales and transfer information from the distant past
210 | > to the present more efficiently.
211 |
212 | Oddly enough, they don't cite the ResNet paper?!?
213 |
214 | They can add skip connections (i.e. adding edges to the RNN). Or they can remove
215 | edges from the RNN, which might have similar positive effects as skip
216 | connections.
217 |
218 | Section 10.10: LSTMs (finally!), Gated Recurrent Unit RNNs
219 |
220 | As of this writing (2016), these two RNNs are the most effective RNNs we have
221 | for practical applications involving sequences.
222 |
223 | Gated Recurrent Unit (GRU):
224 |
225 | - Main idea:
226 |
227 | > [...] gated RNNs are based on the idea of creating paths through time that
228 | > have derivatives that neither vanish nor explode.
229 |
230 | - The RNN needs to *learn* when to forget and discard the past (it can't
231 | remember everything, after all!).
232 |
233 | - Another quote:
234 |
235 | > The main difference with the LSTM is that a single gating unit
236 | > simultaneously controls the forgetting factor and the decision to update the
237 | > state unit.
238 |
239 | Long Short-Term Memory (LSTM):
240 |
241 | - See Figure 10.16 for the block diagram. It's still very confusing despite how
242 | I implemented it in CS 231n. I'm amazed that these work at all.
243 |
244 | - Like GRUs, LSTMs need to *learn* when to forget.
245 |
246 | - It uses self-loops to enable paths to flow for long durations. By flow, I mean
247 | not only the forward pass, but the *backward* pass.
248 |
249 | The authors' conclusion is to simply stick with GRUs or LSTMs.
250 |
251 | Section 10.11: Optimization for Long-Term Dependencies
252 |
253 | They talk about how to improve optimization, such as with second-order methods
254 | and clipping gradients. (Be careful, taking the average of a bunch of clipped
255 | gradients means gradients that were larger have their contributions removed; see
256 | the discussion in the textbook.)
257 |
258 | I wouldn't put too much stock into this, though, because the authors say:
259 |
260 | > This is part of a continuing theme in machine learning that it is often much
261 | > easier to design a model that is easy to optimize than it is to design a more
262 | > powerful optimization algorithm.
263 |
264 | In fact it seems like it's easier to train LSTMs using simple SGD rather than
265 | use a more complicated optimization algorithm. PS: is ADAM used with RNNs?
266 |
267 | Section 10.12: Explicit Memory
268 |
269 | Philosophical quote:
270 |
271 | > Neural networks excel at storing implicit knowledge. However, they struggle to
272 | > memorize facts.
273 |
274 | This section introduces **Memory Networks** and **Neural Turing Machines**.
275 |
276 | For NTMs, note that:
277 |
278 | > It is difficult to optimize functions that produce exact, integer addresses.
279 | > To alleviate this problem, NTMs actually read to or write from many memory
280 | > cells simultaneously. To read, they take a weighted average of many cells. To
281 | > write, they modify multiple cells by different amounts
282 |
283 | Yeah, it's basically **soft attention**.
284 |
285 | Conclusion of the chapter:
286 |
287 | > Recurrent neural networks provide a way to extend deep learning to sequential
288 | > data. They are the last major tool in our deep learning toolbox. Our
289 | > discussion now moves to how to choose and use these tools and how to apply
290 | > them to real-world tasks.
291 |
292 | Whew!
293 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Mathematical_Introduction_Robotic_Manipulation.txt:
--------------------------------------------------------------------------------
1 | Notes on the textbook:
2 |
3 | A Mathematical Introduction to Robotic Manipulation, 1994.
4 | Richard M. Murray and Zexiang Li and S. Shankar Sastry
5 |
6 | A bit old but still in use for Berkeley's courses.
7 |
8 |
9 | ***************************
10 | * Chapter 1: Introduction *
11 | ***************************
12 |
13 | Some history here ... not that relevant to me at this moment. I'd like to see a
14 | more modern take on this.
15 |
16 | But I do like this:
17 |
18 | > The vast majority of robots in operation today consist of six joints which are
19 | > either rotary (articulated) or sliding (prismatic), with a simple "end-
20 | > effector" for interacting with the workpieces.
21 |
22 | Yes, the dvrk has one "prismatic" joint out of seven (note, seven, not six...)
23 | and the others are rotary --- the dvrk guide actually says "revolute". And I
24 | obviously know the end-effectors by now. (Edit: "revolute" is clearly the better
25 | terminology... fortunately the book uses that later.)
26 |
27 | Then they talk about the book outline. Yeah, maybe I'll definitely take a look
28 | at Chapter 2 at a "leisurely pace" to better understand rigid body motion:
29 |
30 | > In this chapter, we present a geometric view to understanding translational
31 | > and rotational motion of a rigid body. While this is one of the most
32 | > ubiquitous topics encountered in textbooks on mechanics and robotics, it is
33 | > also perhaps one of the most frequently misunderstood.
34 |
35 | OK, fair enough.
36 |
37 |
38 | ********************************
39 | * Chapter 2: Rigid Body Motion *
40 | ********************************
41 |
42 | > In this chapter, we present a more modern treatment of the theory of screws
43 | > based on linear algebra and matrix groups. The fundamental tools are the use
44 | > of homogeneous coordinates to represent rigid motions and the matrix
45 | > exponential, which maps a twist into the corresponding screw motion.
46 |
47 | == Important facts ==
48 |
49 | - Location (x, y, z).
50 |
51 | - Trajectory (x(t), y(t), z(t)) = p(t).
52 |
53 | - Rigid **body** satisfies || p(t) - q(t) || = || p(0) - q(0) || = constant.
54 |
55 | - Rigid body transformation: map from R^3 -> R^3 representing "rigid motion"
56 | (subtle point: cross product must be preserved).
57 |
58 | - Cartesian frame: specified with axes vectors x, y, z. These **must** be
59 | _orthogonal_ and with magnitude 1. I.e., _orthonormal_ vectors. Oh, and
60 | preserves z = x \times y to preserve the right-handedness of the system.
61 |
62 | - Know **rotation matrices**: orthogonal and has determinant 1 if right handed
63 | coordinate frame.
64 |
65 | - Figure 2.1 is helpful. **Every rotation** of that object corresponds to some
66 | rotation matrix (well, w.r.t. a fixed frame). And the rotation matrix even
67 | has a special form: we stack the coordinates of the principal axes (x,y,z)
68 | of the **body frame** of the object w.r.t. the "inertial frame."
69 | - Can also think of rotation matrices as transforming points from one frame to
70 | another. Draw a picture for their example; it's worth it.
71 | - Combine rotation matrices via matrix multiplication to form other rotations.
72 |
73 | - SO(n) = "Special Orthogonal" group of (n,n) matrices, typically n=3 but
74 | sometimes n=2. These are a linear algebra "group" under matrix multiplication;
75 | definition is the same as the abstract algebra concept.
76 |
77 | Related notation: so(n), with lowercase letters, is the space of n-by-n
78 | **skew symmetric** matrices, so A^T = -A.
79 |
80 | - SE(n) = "Special Exponential": R^n x SO(n). In the general case with n=3, we
81 | have six dimensions. This is the usual "position and rotation" that I'm
82 | familiar with; denote these as (p,R) where p is in R^3 and R is in SO(3).
83 |
84 | == Other Major Points ==
85 |
86 | - How to prove that something (e.g., a rotation) is a rigid body transformation?
87 | It's simple: show that the transformation preserves distance and orientation.
88 | Look at Definition 2.1 and literally just prove the two properties!
89 |
90 | Don't forget to review the _cross_product_ between two vectors.
91 |
92 | a x b = (a)^b where (a)^ is the cross product matrix. We often use
93 | \hat{a}, which is what the book uses for exponential coordinates of
94 | rotation, with `e^{...}`.
95 |
96 | And be careful about the distinction:
97 |
98 | _points_ (typically written as p, q)
99 | _vectors_ (typically written as v, w)
100 |
101 | For two points p, q \in O, the vector v \in R^3 is the _directed_ line
102 | segment going from p to q.
103 |
104 | Conceptual difference: vectors have a _direction_ and a _magnitude_.
105 |
106 | - To track motion of a rigid body, we just need to watch one point plus the
107 | rotation w.r.t. that point. Hence, use a *configuration* which means we
108 | "attach" a coordinate frame to a point and track it w.r.t. a fixed frame.
109 | Don't forget what we mean by a configuration: something which can tell us
110 | "complete" (or "sufficient"?) information about something in some space. I
111 | remember that from CS 294-115. More precisely, that's SE(3).
112 |
113 | - "Exponential coordinates for rotation" derived from considering: given *axis*
114 | of rotation \omega, and the amount (i.e., angle through the axis) we rotate
115 | some arm (e.g., see Figure 2.2) can we derive the rotation matrix R? They were
116 | able to derive it by setting `R=e^{\hat{\omega} * \theta}` where
117 | `\hat{\omega}` is a matrix. That's where we get the exponential stuff. And for
118 | a closed-form implementation, look at **Rodrigues' formula**. I used it for CS
119 | 280.
120 |
121 | - This is known as "angular velocity" in physics.
122 | - We like this due to Euler's Theorem (2.6 in the book): _any_ orientation R
123 | in SO(3) is equivalent to a rotation about axis w in R^3 through an angle.
124 |
125 | - Theorem: **every rotation matrix** can be represented as the matrix
126 | exponential of some skew-symmetric matrix.
127 |
128 | BTW, in their notation, \hat{\omega} is a skew-symmetric 3x3 matrix. And
129 | they represent skew symmetric matrices as the product of a *unit*
130 | skew-symmetric matrix and a real number.
131 |
132 | - Another representation of rotations are the three **Euler Angles** which is
133 | what I'm most familiar with. AKA yaw, pitch, roll. The order of which axes we
134 | rotate about matters, since it can be represented as the product of three
135 | matrices. See Equation 2.20 for the formulas to derive yaw, pitch, and roll.
136 | Watch out for computing the correct quadrant for the arc-tan functions.
137 |
138 | - Downside: singularities. E.g., there are infinitely many representations of
139 | certain rotations, and it is a "fundamental topological fact" that
140 | singularities can't be eliminated in a 3-D representation of SO(3). I don't
141 | know why, but the authors argue that:
142 |
143 | > This situation is similar to that of attempting to find a global
144 | > coordinate chart on a sphere, which also fails.
145 |
146 | Hmm ... sounds intriguing. But I won't fret too much about this.
147 |
148 | == Rigid Motion in R^3 ==
149 |
150 | (Now we're dealing with _translations_, in addition to rotations.) This is where
151 | the _SE(3)_ group appears. An element `(p,R) \in SE(3)` serves as:
152 |
153 | - A specification of the configuration of a rigid body.
154 | - A transformation taking the coordinates of a point from one frame to
155 | another.
156 |
157 | This is exactly analogous to the SO(3) case, where `R \in SO(3)` was either a
158 | rotation configuration or a rotation mapping. We can view it either way. :-)
159 |
160 | To make the linear algebra math easier to describe rigid transformations, use
161 | **homogeneous coordinates**.
162 |
163 | - Add 1 to the coordinates of a point, so now we're in R^4, and vectors are
164 | (well, effectively) in R^3 since their 4th component is always zero.
165 | - Now a RBT is one matmul on a vector, a linear ("affine") transformation. The
166 | last row is all zero except for a 1 at the lower right corner.
167 | - To compose several of these transformations, do more matmuls obviously.
168 |
169 | Must also know the exponential coordinates for rigid motion, so the SE analogue
170 | to the SO exponential of a skew symmetric matrix representing a rotation.
171 |
172 | - Once again, start from considering rotation about axis \omega
173 | - Then derive velocity of tip point via cross products
174 | - Then solve (integrate) differential equation to get exponential map
175 | - Main difference is the use of 4x4 matrices w/homogeneous-like
176 | representation. Also, we consider an extra ("offset"?) point q on \omega.
177 |
178 | Define se(3):
179 | se(3) := { (u,\hat{omega}) s.t. u in R^3, \hat{omega} in so(3) }
180 | Elements of se(3) are _twists_; Can also write them using 4x4 matrices using
181 | homogeneous coordinates, useful for the following proposition ...
182 |
183 | Proposition 2.8: given \hat{ξ} \in se(3) and \theta \in R, exponential of
184 | \hat{ξ}*theta is an element of SE(3), recall the special exponential ... think
185 | of it as the possible translations and rotations.
186 |
187 | Proof technique:
188 | - Start w/4x4 matrix \hat{ξ} in se(3). Want to show: exp(\hat{ξ}*theta)
189 | in SE(3).
190 | - Prove by construction and obtain a formula for that exponential.
191 | - Split into cases, \omega = 0 versus \omega =/= 0.
192 | - For second (harder) case, relate to \hat{ξ-prime} and use properties of
193 | exponentials and cross products.
194 | - Use the _homogeneous_ representation of elements in SE(3). Normally, I
195 | think of (p,R) \in SE(3), but use the 4x4 _matrix_ with R and p in it.
196 |
197 | Intuition: earlier we interpreted elements of SE(3) as transforming from one
198 | coordinate frame to another. Here, interpret it as mapping points from
199 | _initial_ coordinates to their coordinates _after_ the rigid motion is
200 | applied. Key difference from earlier is that the start and end are specified
201 | w.r.t. a _single_ coordinate frame. The book says:
202 |
203 | > Thus, the exponential map for a twist gives the relative motion of a rigid
204 | > body. This interpretation of the exponential of a twist as a mapping from
205 | > initial to final configurations will be especially important as we study the
206 | > kinematics of robot mechanisms in the next chapter.
207 |
208 | Important! _Every_ rigid transformation can be written as the exponential of
209 | some twist. BTW, I think the twist is only the \hat{ξ} part, and the `\theta
210 | \in R` part is multiplied later. Not a big deal, just think of twists as the 4x4
211 | "\hat{ξ}" matrices in se(3).
212 |
213 | _Screws_ are a "geometric description" of twists and give us more intuition on
214 | them. More precisely:
215 |
216 | > Consider a rigid body motion which consists of rotation about an axis in space
217 | > through an angle of `\theta` radians, followed by translation along the same
218 | > axis by an amount `d` as shown in Figure 2.7a. We call such a motion a screw
219 | > motion, since it is reminiscent of the motion of a screw, in so far as a screw
220 | > rotates and translates about the same axis.
221 |
222 | - Characterizing a screw: define _pitch_, _axis_, and _magnitude_.
223 | - To compute RBT, draw a figure, determine end-point, and derive the rotation
224 | plus vector offset to get the usual 4x4 homogeneous matrix representation.
225 | - The RBT of a screw has an equivalence with the exponential of a twist
226 | `exp(\hat{ξ}*\theta)`.
227 | - It is possible to define a screw for every twist!
228 |
229 | Important theorem:
230 |
231 | > Theorem 2.11 (Chasles). Every rigid body motion can be realized by a rotation
232 | > about an axis combined with a translation parallel to that axis.
233 |
234 | Be careful about _relative_ motion, which is w.r.t. a SINGLE reference frame. To
235 | "switch" between frames, you need to do an extra matrix multiply with g_{ab} to
236 | map from B's coordinates to A.
237 |
238 | == Velocity of a Rigid Body ==
239 |
240 | (This is probably not that relevant for me.)
241 |
242 | == Wrenches and Reciprocal Screws ==
243 |
244 | (This is probably not that relevant for me.)
245 |
246 |
247 | *************************************
248 | * Chapter 3: Manipulator Kinematics *
249 | *************************************
250 |
251 | == Section 2: Forward Kinematics ==
252 |
253 | To determine the configuration of the end-effector given information about the
254 | robot joints, we typically assume that the robot is composed of a set of
255 | "lower-pair joints".
256 |
257 | - There are six common examples: prismatic, revolute, helical, cylindrical,
258 | planar, and spherical. The two most common are, of course, prismatic and
259 | revolute joints. (The 2017 book by Lynch & Park have figures of these,
260 | though they use "universal" instead of "planar".)
261 | - The reason why we like this assumption is that each of the joints
262 | **restricts the motion of adjacent links to a subgroup of SE(3)**, making it
263 | easier to analyze.
264 |
265 | Example, with Figure 3.1, there are four joints, three revolute and one
266 | prismatic. The revolute joints are specified with one \theta for each since it
267 | can be thought of as a single circle about some axis (specified with the right
268 | handed coordinate system). In fact, the same holds for the prismatic joint with
269 | \theta being the displacement along the axis, so specifying these four scalar
270 | values is enough for us to define the configuration of that particular robot.
271 | The **joint space** is the Cartesian product of these individual joint angles.
272 | Equivalently, we can form the configuration space of the robot. It has four
273 | degrees of freedom (3+1=4 obviously) but this of course doesn't hold as a
274 | general rule as robots may have constraints on joints that restrict some DoFs.
275 |
276 | Attach **two** coordinate frames:
277 |
278 | - Base frame: attached to a point on the manipulator which is stationary with
279 | respect to the first link (at index 0).
280 | - Tool frame: attached to the end-effector of the robot, so that the tool frame
281 | moves when the joints of the robot move (seems logical).
282 | So when I query the dVRK, the positions are clearly in the base frame, since
283 | if they were in the tool frame, the positions would always be (0,0,0).
284 |
285 | Forward kinematics: determine the function `g_st: Q -> SE(3)` that determines
286 | the configuration of the tool frame (w.r.t. the base frame). Q is the joint
287 | space of the manipulator, as I mention above.
288 |
289 | Generic solution:
290 |
291 | g_st(theta) = g_{s,l1}(theta_1) * ... * g_{l_{n-1},ln}(theta_n) * g_{ln,t}
292 |
293 | Concatenate the transformations among **adjacent** link frames.
294 |
295 | g_st, our final map, determines the _configuration_ of the _tool_ frame
296 | relative to _base_ frame. That's consistent with our subscript notation.
297 | Remember also that `g_{ij} \in SE(3)` can be thought as `(p_{ij},R_{ij})`.
298 |
299 | == Product of Exponentials ==
300 |
301 | We can obtain a more "geometric description" using PoEs. (Not sure what
302 | precisely this means...)
303 |
304 | Example/Figure 3.2 for an overview of two choices: using g_st(\theta) as
305 | previously discussed, or using PoEs in which
306 |
307 | g_st(theta) = exp(hat{ξ}_1*theta_1) * exp(hat{ξ}_2*theta_2) * g_st(0)
308 | (g_st(0) = rigid body transformation from T to S)
309 |
310 | Derive by thinking: "fix theta_1 and consider motion wrt theta_2. Then do
311 | motion wrt theta_1 and combine result". This is generalized:
312 |
313 | > For each joint, construct a twist `ξ_i` which corresponds to the screw motion
314 | > for the i-th joint with all other joint angles held fixed at θ_j = 0`.
315 |
316 | Results in Equation 3.3 on pp.87, the PoEs, at last! (TODO: understand why the
317 | `ξ_i` have their particular form for revolute or prismatic cases.)
318 |
319 | If we assume that's true, then kinematics for Figure 3.3 are easily derived (and
320 | by this we can get every component in the matrices) by starting from PoEs and
321 | substituting into the formula for exp(hat{ξ}_i*theta_i) for 1<=i<=4 that we can
322 | find from Equation (2.36), pp.42.
323 |
--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/ROS.text:
--------------------------------------------------------------------------------
1 | How to use ROS. I'm using ROS Indigo, on Ubuntu 14.04. Hopefully the Fetch will
2 | be updated for 16.04 soon.
3 |
4 |
5 | ***************************************************************
6 | * Tutorial 1: Installing and Configuring Your ROS Environment *
7 | ***************************************************************
8 |
9 | Note the environment variables after installation:
10 |
11 | ```
12 | $ printenv | grep ROS
13 | ROS_ROOT=/opt/ros/indigo/share/ros
14 | ROS_PACKAGE_PATH=/opt/ros/indigo/share:/opt/ros/indigo/stacks
15 | ROS_MASTER_URI=http://localhost:11311
16 | ROSLISP_PACKAGE_DIRECTORIES=
17 | ROS_DISTRO=indigo
18 | ROS_ETC_DIR=/opt/ros/indigo/etc/ros
19 | ```
20 |
21 | In my `.bashrc` I have:
22 |
23 | ```
24 | source /opt/ros/indigo/setup.bash
25 | alias fetch_mode='export ROS_MASTER_URI=http://fetch59.local:11311 export PS1="\[\033[41;1;37m\]\[\033[0m\]\w$ "'
26 | ```
27 |
28 | where `fetch_mode` came from the HSR tutorials.
29 |
30 | Another important note regarding rosbuild and catkin.
31 |
32 | > Note: Throughout the tutorials you will see references to rosbuild and catkin.
33 | > These are the two available methods for organizing and building your ROS code.
34 | > rosbuild is not recommended or maintained anymore but kept for legacy. catkin
35 | > is the recommended way to organise your code, it uses more standard CMake
36 | > conventions and provides more flexibility especially for people wanting to
37 | > integrate external code bases or who want to release their software. For a
38 | > full break down visit catkin or rosbuild.
39 |
40 | I followed their directions to make the appropriate directories for a catkin
41 | workspace. But sourcing the bash scripts didn't seem to have any noticeable
42 | effect. I thought it'd do a python virtualenv thing?
43 |
44 | Beyond the scope of this, but catkin stuff is here:
45 |
46 | http://wiki.ros.org/catkin/conceptual_overview
47 |
48 | - A build system specifically for ROS. Others are `GNU make` and `CMake`.
49 | - Source code is organized into "packages" which have targets to build.
50 | - For information on how to build, we need "configuration files." With catkin
51 | (extension of CMake) that's in `CMakeLists.txt`.
52 | - `catkin` is the newer tool we should use, not `rosbuild` (older).
53 |
54 |
55 | *********************************************
56 | * Tutorial 2: Navigating the ROS Filesystem *
57 | *********************************************
58 |
59 | Use `package.xml` to store information about a specific package, such as
60 | dependencies, maintainer, etc. Know `rospack`, `roscd`, etc. We can prepend
61 | `ros` to some common Unix commands, do tab completion, etc.
62 |
63 | ```
64 | daniel@daniel-ubuntu-mac:~$ rospack find roscpp
65 | /opt/ros/indigo/share/roscpp
66 | daniel@daniel-ubuntu-mac:~$ roscd roscpp
67 | daniel@daniel-ubuntu-mac:/opt/ros/indigo/share/roscpp$
68 | ```
69 |
70 |
71 | **************************************
72 | * Tutorial 3: Creating a ROS Package *
73 | **************************************
74 |
75 | Packages need: a manifest (package.xml) file, a catkin configuration file, and
76 | its own directory (easy). Since we already created `catkin_ws/src` earlier, put
77 | each of our custom packages as its own directory within `catkin_ws/src`.
78 |
79 | After running the package script, I have this within `~/catkin_ws/src`:
80 |
81 | ```
82 | CMakeLists.txt -> /opt/ros/indigo/share/catkin/cmake/toplevel.cmake
83 |
84 | beginner_tutorials/
85 | CMakeLists.txt
86 | include/
87 | beginner_tutorials/
88 | (empty)
89 | package.xml
90 | src/
91 | (empty)
92 | ```
93 |
94 | - Since the tutorial runs the script with `rospy`, `roscpp`, and `std_msgs`,
95 | those are listed as the package dependencies in `package.xml`.
96 |
97 | - When we run `catkin_make` over the entire workspace, it will say "traversing
98 | into beginner_tutorials".
99 |
100 | - First-order dependencies:
101 | ```
102 | ~/catkin_ws$ rospack depends1 beginner_tutorials
103 | roscpp
104 | rospy
105 | std_msgs
106 | ```
107 |
108 | - We can also list all the *indirect* dependencies.
109 |
110 | - Dependencies are in the following groups:
111 | > inbuild_depend (don't see this, I have build_depend, build_export_depend)
112 | > buildtool_depend (I have this)
113 | > exec_depend (I have this)
114 | > test_depend (I don't see this)
115 | (Maybe they re-named `build_depend` and `build_export_depend`?)
116 |
117 | - `build_depend` for compilation, `exec_depend` for runtime
118 |
119 | - Make sure I customize `package.xml`!! It's mostly "meta-data" so should be
120 | easier than customizing `CMakeLists.txt`. See conventions online.
121 |
122 |
123 |
124 | **************************************
125 | * Tutorial 4: Building a ROS Package *
126 | **************************************
127 |
128 | This discusses `catkin_make` which we previously ran. Note that using
129 | `catkin_make` we can build *all* the packages in our workspace, at least in the
130 | `src/` directory (we can change the target directory). Here's what I have in
131 | `catkin_ws/`:
132 |
133 | ```
134 | build/
135 | beginner_tutorials/
136 | catkin/
137 | catkin_generated/
138 | CATKIN_IGNORE
139 | catkin_make.cache
140 | CMakeCache.txt
141 | CMakeFiles/
142 | cmake_install.cmake
143 | CTestTestfile.cmake
144 | gtest/
145 | Makefile
146 | test_results/
147 | devel/
148 | env.sh
149 | lib/
150 | setup.bash
151 | setup.sh
152 | _setup_util.py
153 | setup.zsh
154 | share/
155 | src/
156 | beginner_tutorials/
157 | CMakeLists.txt
158 | ```
159 |
160 | The `cmake` and `make` commands go to `build` when they need to build packages.
161 | The executables and libraries go in `devel` *before* installing packages.
162 |
163 | We'd also run `catkin_make install` but this seems to be optional.
164 |
165 | BTW, I now understand why there seem to be so many packages located in that
166 | directory on our dVRK machine. Unfortunately, we don't seem to be using it. I
167 | wonder if the HSR or YuMi computers have a similar file system.
168 |
169 |
170 |
171 | ***************************************
172 | * Tutorial 5: Understanding ROS Nodes *
173 | ***************************************
174 |
175 | - Nodes: A node is an executable that uses ROS to communicate with other nodes.
176 | - That's it. Use these to subscribe/publish to topics.
177 | - To communicate, use a "ROS client library" which is rospy or roscpp.
178 |
179 | - Messages: ROS data type used when subscribing or publishing to a topic.
180 | - E.g. "geometry_msgs/Twist". For publisher/subscriber nodes to communicate
181 | they need to send/accept the same message type.
182 |
183 | - Topics: Nodes can publish messages to a topic as well as subscribe to a topic
184 | to receive messages.
185 | - Communication depends on these _messages_.
186 |
187 | - Master: Name service for ROS (i.e. helps nodes find each other)
188 |
189 | - rosout: ROS equivalent of stdout/stderr
190 | - It runs by default from running `roscore` as it collects debug messages.
191 |
192 | - roscore: Master + rosout + parameter server (parameter server will be
193 | introduced later)
194 | - First thing we should run! Recall this is what we do for the dVRK.
195 |
196 | After `roscore`:
197 |
198 | ```
199 | ~/catkin_ws$ roscore
200 | ... logging to
201 | /home/daniel/.ros/log/4a2cd14e-32cf-11e8-9512-7831c1b89008/roslaunch-daniel-ubuntu-mac-4867.log
202 | Checking log directory for disk usage. This may take awhile.
203 | Press Ctrl-C to interrupt
204 | Done checking log file disk usage. Usage is <1GB.
205 |
206 | started roslaunch server http://daniel-ubuntu-mac:33999/
207 | ros_comm version 1.11.21
208 |
209 | SUMMARY
210 | ========
211 |
212 | PARAMETERS
213 | * /rosdistro: indigo
214 | * /rosversion: 1.11.21
215 |
216 | NODES
217 |
218 | auto-starting new master
219 | process[master]: started with pid [4879]
220 | ROS_MASTER_URI=http://daniel-ubuntu-mac:11311/
221 |
222 | setting /run_id to 4a2cd14e-32cf-11e8-9512-7831c1b89008
223 | process[rosout-1]: started with pid [4892]
224 | started core service [/rosout]
225 | ```
226 |
227 | So `/rosout` will be listed when running `rosnode list` in a separate tab. Keep
228 | `roscore` running throughout the time we use ROS!! Use `rosnode info` to see (1)
229 | publishers, (2) subscribers, and (3) services. Also note `PARAMETERS` which must
230 | mean the parameter server.
231 |
232 | Use `rosrun` to run packages along with certain nodes within packages. I ran
233 | `turtlesim` and yes we get a new node and can re-name if needed. There appear to
234 | be two node options for this, one for the turtle and another for teleoperation.
235 |
236 |
237 |
238 | ****************************************
239 | * Tutorial 6: Understanding ROS Topics *
240 | ****************************************
241 |
242 | We run the turtlesim via teleoperation, and it works.
243 |
244 | - Nodes `turtlesim_node` and `turtle_teleop_key` within the `turtlesim` package
245 | communicate to each other via a ROS topic.
246 | - Communication within such topics depends on sending ROS _messages_.
247 |
248 | - The teleop node *publishes* key commands, while the sim node *subscribes*.
249 |
250 | - Use `rqt_graph` for visualizing node dependencies. This is very useful!
251 |
252 | - Use `rqt_plot` to plot certain node values that can be plotted (e.g.,
253 | x-position of turtle) but I don't think I'll be using this, I like matpotlib.
254 |
255 | Use `rostopic` to examine nodes. For instance, if I run this and then move the
256 | turtle forward, I get:
257 |
258 | ```
259 | ~/catkin_ws$ rostopic echo /turtle1/cmd_vel
260 | linear:
261 | x: 2.0
262 | y: 0.0
263 | z: 0.0
264 | angular:
265 | x: 0.0
266 | y: 0.0
267 | z: 0.0
268 | ---
269 | linear:
270 | x: 2.0
271 | y: 0.0
272 | z: 0.0
273 | angular:
274 | x: 0.0
275 | y: 0.0
276 | z: 0.0
277 | ---
278 | (and so on)
279 | ```
280 |
281 | so the up key must mean increasing in the turtle's x direction. We can get a
282 | full picture of the publisher/subscriber situation:
283 |
284 | ```
285 | ~/catkin_ws$ rostopic list -v
286 |
287 | Published topics:
288 | * /turtle1/color_sensor [turtlesim/Color] 1 publisher
289 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher
290 | * /rosout [rosgraph_msgs/Log] 4 publishers
291 | * /rosout_agg [rosgraph_msgs/Log] 1 publisher
292 | * /turtle1/pose [turtlesim/Pose] 1 publisher
293 |
294 | Subscribed topics:
295 | * /turtle1/cmd_vel [geometry_msgs/Twist] 2 subscribers
296 | * /rosout [rosgraph_msgs/Log] 1 subscriber
297 | * /statistics [rosgraph_msgs/TopicStatistics] 1 subscriber
298 | ```
299 |
300 | The type of `/turtle1/cmd_vel` is `geometry_msgs/Twist`, as shown above. Looks
301 | like it lists topics followed by message (well, the _type_ of the message).
302 |
303 | Use `rostopic pub [...]` to publish something. In the turtle example, this might
304 | mean commanding the turtle's velocity.
305 |
306 | So, there's rostopic `pub`, `list`, `echo`, `type`, etc. Straightforward:
307 |
308 | rostopic bw display bandwidth used by topic
309 | rostopic echo print messages to screen
310 | rostopic hz display publishing rate of topic
311 | rostopic list print information about active topics
312 | rostopic pub publish data to topic
313 | rostopic type print topic type
314 |
315 | I don't really need `type` now as it's shown in `list` as seen above. The `hz`
316 | might be useful since (as I know with the dVRK) the camera images of the
317 | workspaces aren't updated instantaneously but with some delay, and that can
318 | affect policies which take the images as input.
319 |
320 |
321 |
322 | *********************************************************
323 | * Tutorial 7: Understanding ROS Services and Parameters *
324 | *********************************************************
325 |
326 | Recall we used to run `rosnode info /rosout` where we get information from
327 | node1, node2, etc in the argument. That provides us with three things. We sort
328 | of understand publications and subscripts, but now what about _services_?
329 |
330 | - Another way for nodes to communicate with each other.
331 | - Nodes send _requests_, receive _responses_. (Common sense, right?)
332 |
333 | Like `rostopic`, `rosservice` has lots of command options:
334 |
335 | rosservice list print information about active services
336 | rosservice call call the service with the provided args
337 | rosservice type print service type
338 | rosservice find find services by service type
339 | rosservice uri print service ROSRPC uri
340 |
341 | For example, I see this with `list`:
342 |
343 | ```
344 | :~/catkin_ws$ rosservice list
345 | /clear
346 | /kill
347 | /reset
348 | /rosout/get_loggers
349 | /rosout/set_logger_level
350 | /rostopic_8997_1522274470739/get_loggers
351 | /rostopic_8997_1522274470739/set_logger_level
352 | /rqt_gui_py_node_9061/get_loggers
353 | /rqt_gui_py_node_9061/set_logger_level
354 | /spawn
355 | /teleop_turtle/get_loggers
356 | /teleop_turtle/set_logger_level
357 | /turtle1/set_pen
358 | /turtle1/teleport_absolute
359 | /turtle1/teleport_relative
360 | /turtlesim/get_loggers
361 | /turtlesim/set_logger_level
362 | ```
363 |
364 | We can call the `rosservice call /clear` above, so this is calling a service in
365 | the list above (this one with no arguments). We choose `clear` so that the
366 | background is clear (we don't see the turtle's path). This is what I see from
367 | the window that originally started the `turtlesim` package.
368 |
369 | ```
370 | :~/catkin_ws$ rosrun turtlesim turtlesim_node
371 | [ INFO] [1522273700.220832117]: Starting turtlesim with node name /turtlesim
372 | [ INFO] [1522273700.228355538]: Spawning turtle [turtle1] at x=[5.544445], y=[5.544445], theta=[0.000000]
373 | [ WARN] [1522273804.373982014]: Oh no! I hit the wall! (Clamping from [x=7.155886, y=-0.008128])
374 | [ WARN] [1522273804.389975987]: Oh no! I hit the wall! (Clamping from [x=7.163082, y=-0.031181])
375 | (omitted...)
376 | [ WARN] [1522276335.861971290]: Oh no! I hit the wall! (Clamping from [x=9.302450, y=11.089913])
377 | [ WARN] [1522276335.877974885]: Oh no! I hit the wall! (Clamping from [x=9.334450, y=11.088992])
378 | [ INFO] [1522280291.029979359]: Clearing turtlesim.
379 | ```
380 |
381 | We can also use the `/spawn` service to, well, spawn another turtle.
382 |
383 | We also have `rosparam`, which is the parameter analogue to `rosservice` for
384 | service, `rostopic` for topics, etc. We can list the parameters and adjust them,
385 | for instance by changing the background color. (However, it doesn't seem to
386 | actually change my color, even though I am clearly setting all the background
387 | colors to be 0 ... hmmm.)
388 |
389 | You can save current parameters for easy loading later.
390 |
391 |
392 |
393 | ***********************************************
394 | * Tutorial 8: Using rqt_console and roslaunch *
395 | ***********************************************
396 |
397 | rqt_console (not sure how useful)
398 |
399 | - Along with rqt_logger_level, lets us see a lot of information in GUIs.
400 | - If we ram the turtle in the wall, we can see the warning message.
401 | - Assuming that WARN is within the current "verbosity" level...
402 | - Logging prioritized with: Fatal, Error, Warn, Info, Debug.
403 |
404 | roslaunch (looks _very_ useful, call this each time we start using robots)
405 |
406 | - Note that `roscore` started a "roslaunch server".
407 | - Use this with a _launch_file_ to start nodes in a more scalable way.
408 | - `roslaunch [package] [filename.launch]`
409 | - `roslaunch gscam endoscope.launch`
410 | - Good practice, put in the package: `~/catkin_ws/src/[...]/launch/[...]`
411 | where the second [...] is the `.launch` file with tags.
412 |
413 | ```
414 |
415 |
416 |
417 |
418 |
419 |
420 |
421 |
422 |
423 |
424 |
425 |
426 |
427 |
428 |
429 |
430 | ```
431 |
432 | - Above example makes two groups (different names to avoid conflicts), each of
433 | which use a `turtlesim_node` node from the `turtlesim` package.
434 |
435 | - Also makes a new node with type "mimic". So the `` command must
436 | obviously let one make a new node, which can be assigned to a group if it's
437 | nested within one. Causes second turtle to mimic the first turtle!
438 |
439 | I see, when you run `roslaunch ...` we get this output:
440 |
441 | ```
442 | daniel@daniel-ubuntu-mac:~/catkin_ws/src/beginner_tutorials/launch$ roslaunch beginner_tutorials turtlemimic.launch
443 | ... logging to /home/daniel/.ros/log/42096978-3383-11e8-9614-7831c1b89008/roslaunch-daniel-ubuntu-mac-4922.log
444 | Checking log directory for disk usage. This may take awhile.
445 | Press Ctrl-C to interrupt
446 | Done checking log file disk usage. Usage is <1GB.
447 |
448 | started roslaunch server http://daniel-ubuntu-mac:43721/
449 |
450 | SUMMARY
451 | ========
452 |
453 | PARAMETERS
454 | * /rosdistro: indigo
455 | * /rosversion: 1.11.21
456 |
457 | NODES
458 | /
459 | mimic (turtlesim/mimic)
460 | /turtlesim1/
461 | sim (turtlesim/turtlesim_node)
462 | /turtlesim2/
463 | sim (turtlesim/turtlesim_node)
464 |
465 | auto-starting new master
466 | process[master]: started with pid [4934]
467 | ROS_MASTER_URI=http://localhost:11311
468 |
469 | setting /run_id to 42096978-3383-11e8-9614-7831c1b89008
470 | process[rosout-1]: started with pid [4947]
471 | started core service [/rosout]
472 | process[turtlesim1/sim-2]: started with pid [4950]
473 | process[turtlesim2/sim-3]: started with pid [4959]
474 | process[mimic-4]: started with pid [4966]
475 | ```
476 |
477 | so we get groups listed at the top level (turtlesim1, turtlesim2) along with the
478 | name of the node after it within the nested stuff.
479 |
480 | BTW: seems like roslaunch starts its own master server, so it is not necessary
481 | to have an existing "roscore" command in another tab. See "auto-starting new
482 | master" above and also:
483 |
484 | https://answers.ros.org/question/217107/does-a-roslaunch-start-roscore-when-needed/
485 |
486 | We can still get lots of relevant information:
487 |
488 | ```
489 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosnode list
490 | /mimic
491 | /rosout
492 | /turtlesim1/sim
493 | /turtlesim2/sim
494 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rostopic list
495 | /rosout
496 | /rosout_agg
497 | /turtlesim1/turtle1/cmd_vel
498 | /turtlesim1/turtle1/color_sensor
499 | /turtlesim1/turtle1/pose
500 | /turtlesim2/turtle1/cmd_vel
501 | /turtlesim2/turtle1/color_sensor
502 | /turtlesim2/turtle1/pose
503 | ```
504 |
505 | Use `rqt_graph`, as discussed earlier, to understand the launch file.
506 |
507 |
508 |
509 | ************************************************
510 | * Tutorial 9: Using rosed to edit files in ROS *
511 | ************************************************
512 |
513 | A very short one, basically use `rosed [package_name] [filename]` to edit files
514 | without having to use command lines, would be useful for me since I got stuck on
515 | doing this in my early days of working with the dVRK. Fortunately this uses vim
516 | by default, so I should have no problem using it.
517 |
518 |
519 |
520 | *******************************************
521 | * Tutorial 10: Creating a ROS msg and srv *
522 | *******************************************
523 |
524 | - msg: are simple text files that describe the fields of a ROS message. They
525 | are used to generate source code for messages in different languages.
526 | - srv: describes a service, composed of two parts: a request and a response.
527 |
528 | These have their own syntax rules. See tutorial for details. We put them in
529 | `msg` and `srv` directories, and then we must ensure our `package.xml` file will
530 | know to compile and run custom messages, and also change `CMakeLists.txt`.
531 | There's a lot to do for the latter; see tutorial for lines to un-comment.
532 |
533 | The tutorials use a simple `AddTwoInts` service. Details with `rossrv`:
534 |
535 | ```
536 | :~/catkin_ws/src/beginner_tutorials$ rossrv show AddTwoInts
537 | [beginner_tutorials/AddTwoInts]:
538 | int64 a
539 | int64 b
540 | ---
541 | int64 sum
542 |
543 | [rospy_tutorials/AddTwoInts]:
544 | int64 a
545 | int64 b
546 | ---
547 | int64 sum
548 | ```
549 |
550 | - It's located in two places, since this was created with `roscp`.
551 | - The actual _implementation_ of the "add two ints" is located elsewhere.
552 | - Run `catkin_make install` and watch it build successfully. Whew.
553 |
554 | The installation makes C (header), lisp and python files. For example:
555 |
556 | /home/daniel/catkin_ws/install/lib/python2.7/dist-packages/beginner_tutorials/msg/_Num.py
557 |
558 | Again this is _not_ the code implementation (how could it read my mind?) but an
559 | automatically generated file with some known, common methods. Not yet sure what
560 | it's purpose is for ...
561 |
562 |
563 |
564 | ****************************************************************
565 | * Tutorial 11: Writing a Simple Publisher and Subscriber (C++) *
566 | ****************************************************************
567 | (Skipping)
568 | *******************************************************************
569 | * Tutorial 12: Writing a Simple Publisher and Subscriber (Python) *
570 | *******************************************************************
571 |
572 | After downloading their `talker.py` script, I have this in the package:
573 |
574 | ```
575 | beginner_tutorials/
576 | CMakeLists.txt
577 | package.xml
578 | include/
579 | beginner_tutorials/
580 | launch/
581 | turtlemimic.launch
582 | msg/
583 | Num.msg
584 | scripts/
585 | talker.py
586 | src/
587 | srv/
588 | AddTwoInts.srv
589 | ```
590 |
591 | For the most part just read the tutorial, it goes line-by-line. Above, there is
592 | no node that "receives" the messages sent by the talker, so we write that. It
593 | uses a very simple message type:
594 |
595 | ```
596 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosmsg show String
597 | [std_msgs/String]:
598 | string data
599 | ```
600 |
601 | with just a `data` argument to fill.
602 |
603 | For classes, look at:
604 |
605 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Publisher-class.html
606 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Subscriber-class.html
607 |
608 | They only have one method each, "publish" and "unregister", respectively.
609 |
610 |
611 |
612 | **************************************************************
613 | * Tutorial 13: Examining the Simple Publisher and Subscriber *
614 | **************************************************************
615 |
616 | This is really short. Just run the code and see what we get. Make sure `roscore`
617 | is running in a seprate tab, though.
618 |
619 |
620 |
621 | **********************************************************
622 | * Tutorial 14: Writing a Simple Service and Client (C++) *
623 | **********************************************************
624 | (Skipping)
625 | *************************************************************
626 | * Tutorial 15: Writing a Simple Service and Client (Python) *
627 | *************************************************************
628 |
629 | Makes the "service" that actually performs the addition. (It's not clear to me
630 | yet why we need this kind of structure.) And then the client. Again, straight
631 | from the tutorial.
632 |
633 |
634 |
635 | ********************************************************
636 | * Tutorial 16: Examining the Simple Service and Client *
637 | ********************************************************
638 |
639 | Yeah, I got it working.
640 |
641 |
642 |
643 | ************************************************
644 | * Tutorial 17: Recording and playing back data *
645 | ************************************************
646 |
647 | This is the rostopic status after starting this up:
648 |
649 | ```
650 | daniel@daniel-ubuntu-mac:~/catkin_ws/devel$ rostopic list -v
651 |
652 | Published topics:
653 | * /turtle1/color_sensor [turtlesim/Color] 1 publisher
654 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher
655 | * /rosout [rosgraph_msgs/Log] 2 publishers
656 | * /rosout_agg [rosgraph_msgs/Log] 1 publisher
657 | * /turtle1/pose [turtlesim/Pose] 1 publisher
658 |
659 | Subscribed topics:
660 | * /turtle1/cmd_vel [geometry_msgs/Twist] 1 subscriber
661 | * /rosout [rosgraph_msgs/Log] 1 subscriber
662 | ```
663 |
664 | I get the rosbag which records the keypresses:
665 |
666 | ```
667 | daniel@daniel-ubuntu-mac:~/bagfiles$ ls -lh
668 | total 512K
669 | -rw-rw-r-- 1 daniel daniel 511K Mar 29 16:16 2018-03-29-16-15-19.bag
670 | daniel@daniel-ubuntu-mac:~/bagfiles$ vim 2018-03-29-16-15-19.bag
671 | daniel@daniel-ubuntu-mac:~/bagfiles$ rosbag info 2018-03-29-16-15-19.bag
672 | path: 2018-03-29-16-15-19.bag
673 | version: 2.0
674 | duration: 58.6s
675 | start: Mar 29 2018 16:15:19.26 (1522365319.26)
676 | end: Mar 29 2018 16:16:17.84 (1522365377.84)
677 | size: 510.9 KB
678 | messages: 7321
679 | compression: none [1/1 chunks]
680 | types: geometry_msgs/Twist [9f195f881246fdfa2798d1d3eebca84a]
681 | rosgraph_msgs/Log [acffd30cd6b6de30f120938c17c593fb]
682 | turtlesim/Color [353891e354491c51aabe32df673fb446]
683 | turtlesim/Pose [863b248d5016ca62ea2e895ae5265cf9]
684 | topics: /rosout 4 msgs : rosgraph_msgs/Log (2 connections)
685 | /turtle1/cmd_vel 21 msgs : geometry_msgs/Twist
686 | /turtle1/color_sensor 3648 msgs : turtlesim/Color
687 | /turtle1/pose 3648 msgs : turtlesim/Pose
688 | ```
689 |
690 | And I can replay my commands.
691 |
692 |
693 |
694 | ********************************************
695 | * Tutorial 18: Getting started with roswtf *
696 | ********************************************
697 |
698 | Yeah this is just to check if the system is wrong, and looks like mine is OK.
699 |
700 |
701 |
702 | ****************************************
703 | * Tutorial 19: Navigating the ROS wiki *
704 | ****************************************
705 |
706 | Pretty simple, hopefully documentation won't be an issue.
707 |
708 |
709 |
710 | ****************************
711 | * Tutorial 20: Where Next? *
712 | ****************************
713 |
714 | Robotics work. :-) Look at our manuals, understand rviz, tf, and moveit.
715 |
--------------------------------------------------------------------------------
/CS61C_Berkeley/CS61C_Lectures.txt:
--------------------------------------------------------------------------------
1 | CS 61C Lecture Review
2 | Fall 2017 Semester
3 |
4 | **********************************
5 | * Lecture 1: Course Introduction *
6 | * Given: August 24, 2017 *
7 | **********************************
8 |
9 | Lecture is about four things, well, three that matter to me: (1) machine
10 | structures, (2) great ideas (in architecture), and (3) how everything is just a
11 | number.
12 |
13 |
14 | Machine Structures
15 |
16 | C is the most popular programming language, followed by Python. Use C to
17 | write software for speed/performance, e.g. embedded systems. EDIT: nope!
18 | That was in F-2016. Now in F-2017, Python has taken over, probably due to
19 | Deep Learning. But C is still in second place.
20 |
21 | This class isn't about C programming, but C is a VERY important language to
22 | know in order to understand the important stuff: the **hardware-software
23 | interface**. It's closer to the hardware than Java or Python.
24 |
25 | Things we'll learn on the software side:
26 | Parallel requests
27 | Parallel threads
28 | Parallel instructions
29 | Parallel data
30 | Hardware descriptions
31 |
32 | and the hardware side:
33 | Logic gates
34 | Main memory
35 | Cores
36 | Caches
37 | Instruction Units
38 |
39 | Looks like the "new version/face" of CS 61C is parallelism, as I should know
40 | from CS 267. Along with computers being on **mobile devices** and in many
41 | other areas, such as cars! So many things have computers and sensors in them
42 | nowadays, that it's mind-blowing.
43 |
44 |
45 | Great Ideas in Architecture
46 |
47 | Abstraction (Phil Guo's one-word description of CS)
48 |
49 | Anything can be represented as a number. But does this mean we WANT
50 | them to be like that? No, we want to program in a "high-level" like C
51 | so that we don't have to trudge through assembly language code.
52 |
53 | We follow this hierarchy:
54 | ==> C
55 | ==> compiler
56 | ==> assembly language (then machine language??)
57 | ==> machine interpretation (note, in F-2017 they're doing RISC-V,
58 | not MIPS, which I think was in S-2017 ...)
59 | ==> architecture implementation (the logic circuit diagram?)
60 | (I don't fully understand assembly/architecture parts)
61 |
62 | Moore's Law (is it still applicable?!?)
63 |
64 | Basic idea: every 2 years (sometimes I've seen it 1.5 years ...) the
65 | number of transistors per chip will double. Transistors are the basic
66 | source of computation in computers, they're the bits of electricity that
67 | turn into 0s and 1s. From Wikipedia:
68 | "A transistor is a semiconductor device used to amplify or switch
69 | electronic signals and electrical power. It is composed of
70 | semiconductor material usually with at least three terminals for
71 | connection to an external circuit. A voltage or current applied to
72 | one pair of the transistor's terminals controls the current through
73 | another pair of terminals. Because the controlled (output) power can
74 | be higher than the controlling (input) power, a transistor can
75 | amplify a signal",
76 | and
77 | "The transistor is the fundamental building block of modern
78 | electronic devices, and is ubiquitous in modern electronic systems."
79 |
80 | However, as one would imagine, if you try to pack more and more
81 | transistors in a smaller area, it will be exponentially more costly, and
82 | there will be issues with heat, as well as limits faced with the laws of
83 | physics.
84 |
85 | Update: the F-2017 edition (after the class break) brought up a graph
86 | from David Patterson's textbook, showing that serial processor
87 | performance was exponential up to the last decade, to which it
88 | flat-lined.
89 |
90 | - Thus, in the "glory days" you could write a program and expect newer
91 | hardware to just be faster. But not anymore. If we tried to cram
92 | things even further, we'd run into programs like quantum computers,
93 | where we don't know if things are really a 0 or a 1 anymore. Uh oh.
94 |
95 | - Now companies (e.g. Apple, Tesla, Samsung, Google, Microsoft) are not
96 | just buying general-purpose Intel chips, but building their own chips.
97 | So it's an exciting time to be a computer architect.
98 |
99 | Principles of Locality (memory hierarchy and caches!!)
100 |
101 | Jim Gray's storage latency analogy. I've seen this one before. It's
102 | really nice. Everyone has a nice joke to play about caches. Main thing
103 | to know is what is actually in the hierarchy:
104 | - Registers
105 | - On-chip cache
106 | - On-board cache
107 | - Main memory (i.e. RAM)
108 | - Hard disk
109 | - Tape and optical robot (not sure what this means)
110 | Also see the pyramid in the notes. It makes sense: the stuff "closer" to
111 | us in the hierarchy just listed above has to be smaller since there's
112 | less room. Thus, registers are cramped in a small space and are limited,
113 | but there's much more room for memory on the hard disk.
114 |
115 | It seems like we have three main caches: L1, L2, and L3. Not sure on the
116 | difference between on-chip vs on-board cache, though. That might be
117 | on-chip (as in on the CPU?) vs on the MOTHERboard. As I (finally!!) now
118 | know from experience, the CPU chip goes in the motherboard in a very
119 | specific spot.
120 |
121 | Parallelism (CS 267!!)
122 |
123 | This is another thing we should do if possible. We can "fork" calls into
124 | several "workers" and then "join" them together later. Professor Katz
125 | mentions the laundry example. He can use the wash. Then the dryer. But
126 | if he's using the dryer, there's no reason why someone can't use the
127 | wash. So this is like stacking things together in a tree-fashion, might
128 | be related to "tricks with trees" from CS 267.
129 |
130 | Also: we'll learn how to do thread programming, using fork() to
131 | split up computation into worker threads, and join() calls to
132 | combine the result.
133 |
134 | Caveat: Amdahl's law. It tries to predict speed-ups from parallelism.
135 | The law states the obvious: if there are parts of an application which
136 | cannot be parallelized, then we can't get "perfect" speedup, which
137 | hypothetically would be a 2x speedup if we had 2x parallelism.
138 |
139 | Dependency via Reproducibility (should be obvious!)
140 |
141 | The larger our system, the more likely we have individual components
142 | that fail. But when we program, we desperately want to make sure we can
143 | focus on debugging what WE wrote, and NOT the underlying hardware (oh
144 | God).
145 |
146 | Easiest thing to do: take majority vote, this helps to protect against
147 | faulty machines. Prof Katz: this seems silly and expensive, but useful
148 | if we have to send code in space or some other area where it's too
149 | expensive to send repairmen.
150 |
151 | Redundant memory bits as well; these are Error Correcting Codes (ECCs).
152 | Can also do calculations involving the parity of a number (odd vs even)
153 | so we have a spare piece of memory which corrects the expected parity as
154 | needed.
155 |
156 |
157 | Then we switched speakers to Prof. Krste Asanović.
158 |
159 | Higher-level stuff:
160 |
161 | Moore's Law, etc., showed a new paradigm for computer architecture. See
162 | my earlier comments on Moore's Law.
163 |
164 | Then Deep Learning. Yes, I knew it! That's why Deep Learning needs
165 | computer architects, because it's now the hardware and not the algorithm
166 | (After all, we're still doing backpropagation).
167 |
168 | Google has developed a "Tensor Processing Unit" (TPU), a specialized
169 | engine for NN training. Interesting ... I saw Jeff Dean talking about
170 | this recently in his AMA.
171 |
172 | Microsoft has developed "Microsoft Brain Wave". Gah, so many new
173 | developments.
174 |
175 | RISC-V Instruction Set Architecture (ISA)
176 |
177 | In F-2017, they are switching to this from MIPS, which was used in
178 | previous iterations of the course. It was designed at Berkeley for
179 | research and education.
180 |
181 | ISA = the language of the processor, or how software is encoded to run
182 | on hardware. Example: think about how an "add" instruction would be
183 | written in bits.
184 |
185 | Why are we using it if it's open source? Because the cool people are
186 | adopting it. Starting now, NVIDIA is using RISC-V in their GPUs. And the
187 | previous popular set, MIPS, is not doing so well; the company that owns
188 | it is apparently up for sale?
189 |
190 |
191 | (Then we switched back to Prof. Katz, and had some stuff about class
192 | administration. Yeah, I won't post any homeworks publicly, they'll be private.)
193 |
194 |
195 | Everything is Just a Number
196 |
197 | Computers represent data as binary values.
198 | - The *bit* is the unit element, either 0 or 1. We're not doing quantum
199 | computing in this class, so we _know_ for certain if a bit is zero or
200 | one.
201 | - Then *bytes* are eight bits, can represent 2^8 = 256 different values.
202 | - A "word" is 4 bytes (i.e. 32 bits), has 2^32 different values, like Java
203 | integers.
204 | - Then there are 64-bit floating point numbers (and 32-bit as well),
205 | numpy can express both though the Theano library encourages 32-bit.
206 | - All of these are built up into longer and more complicated expressions!
207 | - In F-2017, we'll learn how RISC-V encodes computer programs into bits.
208 |
209 | Be sure to MEMORIZE how to convert: (binary <==> decimal). This is so
210 | important to have down cold. I'm definitely intuitively better at going in
211 | the ==> direction, just write the number then underneath, going in REVERSE
212 | direction, do 2^0, 2^1, etc., then multiply by 1s and 0s and add up. Other
213 | direction: keep successively dividing by two (rounding down) and keep track
214 | of parities. Collect (not sum!) the results together at the end.
215 |
216 | Unfortunately, there's also the hexadecimal notation. That's harder. Now
217 | there are 16 different units, not 2 or 10. It goes from 0 to 9 and then we
218 | note it as A=10, B=11, C=12, D=13, E=14, F=15. Obviously, I wrote the
219 | decimal numbers afterwards, could have easily done the binary version.
220 | - There are also octals, with 8 units of computation.
221 | - I'll avoid using these whenever possible.
222 |
223 | Make sure to be consistent with putting down "two", "ten", or "hex" as
224 | subscripts after the numbers. It will make it easier to track which is
225 | which.
226 |
227 | How to use these numbers in C?
228 | Use %d for decimal (I know this now!)
229 | Use %x for hexadecimal
230 | Use %o for octal
231 | Might also have to write numbers with 0x[...] and 0b[...] with 0x or 0b
232 | prefix to indicate which representation we're using.
233 |
234 | Beyond bytes, we have kilobytes, gigabytes, etc. Notice that marketing will
235 | assume we multiply by 1000, i.e. kilobytes are 1000 bytes. But in reality we
236 | "should" have 1024 bytes per kilobytes. Marketing can get away with not
237 | including that extra 24. Grrr. For the binary system, we use an extra "i",
238 | so it's KiByte, instead of KByte. And 1GB = 1000MB and 1GiB = 1024MiB.
239 | Watch out!
240 |
241 |
242 | **************************************
243 | * Lecture 2: Numbers and C Language *
244 | * Given: August 29, 2017 *
245 | **************************************
246 |
247 | Signed integer representation (Note: this material was originally in the first
248 | lecture in F-2016, but got bumped to the second lecture in F-2017 to make room
249 | for more discussion on why we need computer architects, and also Deep Learning.)
250 |
251 | We need to have negative numbers, so how to handle these?
252 |
253 | First attempt: first digit (well, leading digit, so leftmost) represents
254 | sign, remaining 7 (assuming 8 bits total) are for actual numerical
255 | content, "magnitude". But that's bad --- at least for integers --- since
256 | we have several special cases to consider, and our hardware performance
257 | will suffer.
258 |
259 | Better: two's complement. With 4 bits, have 16 total numbers:
260 | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
261 | -8 -7 -6 -5 -4 -3 -2 -1
262 |
263 | Thus, -3 in decimal maps to 13 in binary. This allows us to keep
264 | adding/subtraction rules for binary numbers consistent. Right, this is
265 | StackOverflow: "Two's complement is a clever way of storing integers so that
266 | common math problems are very simple to implement." In other words, the
267 | hardware doesn't have to make any special rules.
268 |
269 | But remember that these are just bits. Regardless of signed or unsigned,
270 | it's bits (four, in this case) that the hardware sees.
271 |
272 | A good analogy with alarm clocks in the lecture, particularly because my
273 | alarm clock requires me to keep incrementing the time before it "starts
274 | over" at the current value. Thus, 3+11=14 in unsigned, but this is
275 | 3-5=-2 in two's complement. Fortunately, the "adder" doesn't care, it
276 | just does the addition the same way, and we interpret it under the
277 | assumption that it's two's complement.
278 |
279 | It's not a "sign+magnitude" representation, because the second part
280 | isn't a "magnitude".
281 |
282 | How to do negation in two's complement: INVERT the bits, then add one.
283 | Don't forget to add one.
284 |
285 | The most significant bit (MSB) also indicates the sign, as in our first
286 | representation, but doesn't have the drawback of painful math or a +0 and -0
287 | annoyance as in the signed integer representation.
288 |
289 | With two's complement, **if signs are different**, no overflow detection
290 | needed. This makes sense, you can't add a positive and a negative number and
291 | get something exceeding your range, that's like a shrinkage factor.
292 |
293 | Adding numbers of different bit widths:
294 | - Unsigned: simply pad zeros at the most significant bits.
295 | - Signed: **sign extension**, pad either all 0s or all 1s, depending on
296 | the current sign of the number.
297 |
298 |
299 | Break / This is Not on the Exam
300 |
301 | Prof. Asanović talked about Google's TPU. :-) My God, it's so impressive. It
302 | has an **internal** matrix multiply unit. Ironically, it's useless for
303 | everything **except** for matrix multiplies. Then he talked about the IBM
304 | Mainframe.
305 |
306 |
307 | C Primer
308 |
309 | Remember, we're not giving a tutorial on C, the class is about the
310 | hardware/software interface.
311 |
312 | Bla bla bla hello world. Use printf("") for printing. Don't forget \n
313 | newlines!! Think of System.out.print("") in Java (not the println version).
314 | Also don't forget semicolons. And `#include `. They use `int
315 | main(void)` whereas I use `int main()` but there's no difference in C++ and
316 | in C the difference is "questionable". I think it doesn't matter for what I
317 | would use. But use int main(void) instead, to clearly specify that the
318 | method doesn't take in any arguments (according to StackOverflow).
319 |
320 | Then compiling using `gcc program.c ; ./a.out`.
321 |
322 | Progression:
323 | [...].c --(compiler)--> [...].o --(linker)--> [a.out]
324 | From source (i.e. text) file to "machine code object files" (whatever those
325 | are) to actual executable files, what gets run. The linker makes use of
326 | other library files, if we're using them, such as stdio.h I think. And the
327 | linker helps to link a bunch of [...].c files that we wrote, since we should
328 | split up our C code in several files to stay sane.
329 |
330 | There's *also* a "pre-processor" before the compiler executes, which (1)
331 | converts comments to a single space and (2) takes care of logic related
332 | to commands that start with #. These are "macros" and get expanded to
333 | replace their stuff inline, so for instance, if I look at the
334 | intermediate file output from Hello World, it could be very long. But
335 | that's OK, it's how C works. :-)
336 |
337 | Different from interpreted languages, such as Python, which are run
338 | "line-by-line".
339 |
340 | More similar to Java, but Java converts to "byte code" which is an EXAMPLE
341 | of an assembly language.
342 |
343 | Advantages:
344 | - Faster. This is why numpy uses a C/C++ "back end"; more on that later
345 | once I better understand it.
346 | - Note that computers can only "run" machine code, or the lowest-level
347 | instructions that it can run. Everything else is one layer of
348 | abstraction upon abstraction. Compilation can get our C code to
349 | machine code in "one shot".
350 |
351 | Disadvantages:
352 | - Long time to compile.
353 | - Need tools like "make" to avoid compiling unchanged code. OK maybe
354 | this isn't a real disadvantage, since we should be using make by
355 | default.
356 | - Architecture- and operating systems-specific.
357 |
358 | C Type Declarations
359 |
360 | Examples:
361 | int a;
362 | float b;
363 | char c;
364 | Like Java, have to declare beforehand, and the type can't change.
365 | (Usually, floats are 32 bits and doubles are 64 bits.)
366 |
367 | Can do:
368 | float pi = 3.14; /* ok this is mathematically awful but w/e */
369 | But probably better to have it as a constant:
370 | const float pi = 3.14;
371 |
372 | For 'unsigned' stuff, just put that before the type, e.g. 'unsigned long'.
373 |
374 | Enumerations:
375 | typedef enum {red, green, blue} Color;
376 | We can then write and call `switch` on:
377 | Color pants = green; /* to use one example ... */
378 |
379 | AH, now it's clear, in Java we KNOW ints are 32 bits, but in C it could be
380 | 16, 32, or 64 bits. Though on my system it's 32 and I think that makes the
381 | most sense.
382 | To check, use sizeof(int) and print it. I get '4' which must mean the
383 | BYTE count.
384 |
385 | No boolean data types! I learned this the hard way. (That's in C++).
386 | 0=False, anything else is true (but I guess use 1 for convention).
387 |
388 | Standard function definitions, like Java. But it looks like we don't need to
389 | use 'public...' or 'public static...'.
390 |
391 | Uninitialized variables: if you don't define them, they take on a random
392 | value in memory, i.e. garbage. Their for loop example prints different
393 | values of (uninitialized) x because they have another function which messes
394 | around with the memory on the stack. I think if that wasn't there, you would
395 | get the same "garbage" value for x. [Update: heh, a student asked the same
396 | question. But the Prof. said we should not rely on that. Which is fine, this
397 | was only a theoretical question.]
398 |
399 | structs:
400 | - Groups of variables
401 | - Like Java classes, but no methods
402 | - one-liner example syntax:
403 | typedef struct {int x, y;} Point;
404 | - then to create one:
405 | Point p = { 77, -8 };
406 |
407 | Concluding Thoughts
408 |
409 | NO CLASSES in C! You need C++ for that, according to my own experience, and
410 | StackOverflow. For a while, C++ was known as "C with classes". But now it's
411 | just bloated. In C, simulate some class functionality by using structs.
412 | Thus, C shouldn't qualify as "object-oriented".
413 |
414 | Other main programmatic difference from Java (first one being no classes) is
415 | that in C we have explicit pointers. Let's discuss that in the next lecture.
416 |
417 | There are additional differences in the compilation, obviously.
418 |
419 |
420 | TODO BELOW ... (for F-2017)
421 |
422 | ****************************
423 | * Lecture 3: Pointers *
424 | * Given: September 1, 2016 *
425 | ****************************
426 |
427 | Pointers in C
428 |
429 | Processor vs Memory in computer, two different components.
430 | Former has registers, ALU, etc.
431 | Latter contains various bytes that form the programs, data, etc.
432 |
433 | Don't confuse memory address and a value. It's like humans are the 'values'
434 | living in their homes as 'memory addresses'. A POINTER is a MEMORY ADDRESS.
435 | When we say int a; then a = -85;, the memory address is some unknown
436 | integer and the value is -85.
437 |
438 | Know differences:
439 | int *x; // variable x is an address to an int
440 | int y = 9; // y is an int with value 9
441 | x = &y; // assigns *address of* (almost certainly not 9) y to x
442 | int z = *x; // assigns *value of* x (should be 9) to z
443 | *x = -7; // Assigns -7 to what x is pointing at
444 |
445 | Interesting, I get x=1505581164 y=-7 z=9 as the printf output, so when we
446 | set the memory address of y to x, and modify what x is pointing at, that
447 | will *also* modify what y points at. Interesting ... and a bit of a pain to
448 | track.
449 |
450 | Another thing, the type of x is 'int*', NOT 'int'. Watch out! It might be
451 | helpful to visualize this the way CS 61C does with its charts. Can write
452 | int* pi; or int *pi;, seems like the class does it the latter. It's
453 | unambiguous especially for char *a,*b; vs char* a,b;, in which case that 'b'
454 | is NOT a pointer to a char.
455 |
456 | Use generic pointers for applications such as allocating or freeing
457 | memory, where the code may need to point to arbitrary stuff.
458 |
459 | Have pointers to structs as well, which is where we get the arrow syntax
460 | "->" that I've seen before.
461 |
462 | Another trick: *(&a) = a, I believe.
463 |
464 | One thing, if we do '*pa = 5', this is NOT assigning to 'pa' but rather
465 | '*pa'. It doesn't really make sense to assign directly to 'pa' unless we
466 | know a memory address. Do we really want to gamble that '5' is indeed the
467 | correct _memory_address_ and not _value_?
468 |
469 | Functions
470 | These have pointers too. For arguments:
471 | void foo(int x, int *p) { ... }
472 | To call it, use:
473 | foo(a, &b);
474 | where a and b are both ints. The 'b' will get "passed by reference",
475 | since the pointer is passed by value. So it's like Java. There are a ton
476 | of blogs about this online.
477 |
478 | PS: I really like their four-column table approach, really helps
479 |
480 | Arrays in C (syntactic sugar for pointers, really)
481 |
482 | Several ways to declare basic arrays:
483 | int a[5]; // five integer array, obviously, but contents are garbage
484 | int b = {1,2,3}; // explicitly assign elements, not garbage =)
485 |
486 | In memory diagram: form contiguous block of memory, index 0 at bottom, then
487 | proceeding up we increment indices.
488 |
489 | #1 way we can shoot ourselves in the foot: no array bounds checking.
490 | So remember array sizes, e.g. by using:
491 | const int ARRAY_SIZE = 10;
492 | and then using that ARRAY_SIZE throughout the program. Don't repeat
493 | yourself!
494 |
495 | Helpful to also use sizeof() operator to get number of bytes. I use this
496 | frequently. But we can't assume anything about the hardware, other than
497 | sizeof(char) == 1. Don't assume: use sizeof(...) instead!
498 |
499 | Pointer Arithmetic
500 |
501 | PS: for computers, use byte addresses, so think of memory for an int as
502 | taking up four slots, because (at least in one example and on my machine) C
503 | ints are 4 bytes.
504 |
505 | I see, we can do stuff like:
506 | char c[] = {'a','b'};
507 | char *pc = c; // from webcast, also same as &(c[0])
508 | so pc is now a char* type, and *pc = 'a'. If we do *pc++; then pc = 'b'. The
509 | POINTER is incremented, not the value pointed by it. Yeah, it's confusing,
510 | this time we actually want to manipulate the address.
511 |
512 | The array name is a pointer to the 0th element of the "array".
513 | char *pstr;
514 | char astr[];
515 | are identical except we can do pstr++ while we can't do astr++.
516 | ALSO: astr[2] = *(astr+2)
517 |
518 | OH I see, when we do pc++ the compiler actually adds sizeof(...) and takes
519 | care of that logic for us; it doesn't really "add one". Thanks!
520 |
521 | Bad style to interchange arrays and pointers.
522 |
523 | For methods, you can define them in the following ways:
524 | foo(int[] array, unsigned int size);
525 | foo(int *array, unsigned int size);
526 |
527 | Be careful when doing sizeof(a) with 'a' an array, because that might
528 | represent a pointer, which is usually 8 bytes on modern 64-bit machines. But
529 | if you start with int a[40] and do sizeof(a) you actually get
530 | 10*sizeof(int), that's weird.
531 |
532 | No sense to do so and is also illegal, don't do the following:
533 | - Add two pointers
534 | - Multiply two pointers
535 | - Subtracting a pointer from an integer
536 | We CAN, however, compare pointers to NULL, for instance (keyword might be
537 | 'null' in C).
538 |
539 | Pointers to pointers also exist. Oh no.
540 |
541 | Strings and Main
542 |
543 | C strings are "null terminated character arrays":
544 | char s[] = 'abc';
545 | To find the length, iterate through the string and increment an index.
546 | Detect end of string with '0' or whatever special character we have.
547 |
548 | Don't forget the alternative way of writing main() methods with arguments:
549 | int main(int argc, char *argv[]) {...}
550 | argv is a POINTER ... (of type char*) ... TO AN ARRAY (that contains the
551 | string arguments from the command line). The argc is simply the number of
552 | arguments.
553 |
554 | When we run ./a.out, the './a.out' part is argv[0], other arguments
555 | after that go in later components, in order. It's similar to Python.
556 |
557 | Concluding Remarks
558 |
559 | Pointers are the same as (machine) memory addresses.
560 | Except for void*, pointers know the type and size of the objects they point
561 | to (is this why sizeof(a) for 'int a[10]' is known? Not sure).
562 | Pointers are powerful, but dangerous without careful planning.
563 |
564 |
565 | ********************************
566 | * Lecture 4: Memory Management *
567 | * Given: September 6, 2016 *
568 | ********************************
569 |
570 | TODO
571 |
--------------------------------------------------------------------------------