├── .gitignore
├── Differential_Equations
    └── README.md
├── Functional_Programming
    ├── Other_Notes
    │   └── sbt_and_eclipse.txt
    ├── README.md
    └── week1
    │   └── week1_notes.txt
├── Math_104_Berkeley
    ├── README.md
    └── kenneth_ross_notes.txt
├── Deep_Learning
    ├── README.md
    ├── dlbook_chapter06notes.txt
    ├── dlbook_chapter02notes.txt
    ├── dlbook_chapter20notes.txt
    ├── dlbook_chapter17notes.txt
    ├── dlbook_chapter03notes.txt
    ├── dlbook_chapter09notes.txt
    ├── dlbook_chapter04notes.txt
    ├── dlbook_chapter08notes.txt
    ├── dlbook_chapter16notes.txt
    ├── dlbook_chapter14notes.txt
    ├── dlbook_chapter11notes.txt
    ├── dlbook_chapter07notes.txt
    ├── dlbook_chapter12notes.txt
    ├── dlbook_chapter05notes.txt
    └── dlbook_chapter10notes.txt
├── How_People_Learn
    ├── README.md
    ├── Part_04_Future_Directions.txt
    ├── Part_01_Intro.txt
    ├── Part_03_Teachers_and_Teaching.txt
    └── Part_02_Learners_and_Learning.txt
├── Random
    ├── Ray_Notes.txt
    └── AWS_Notes.txt
├── README.md
├── CS61C_Berkeley
    ├── README.md
    └── CS61C_Lectures.txt
└── Robots_and_Robotic_Manip
    ├── dVRK.text
    ├── Modern_Robotics_Mech_Plan_Control.txt
    ├── Fetch.text
    ├── HSR.text
    ├── Mathematical_Introduction_Robotic_Manipulation.txt
    └── ROS.text


/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.DS_Store
3 | 


--------------------------------------------------------------------------------
/Differential_Equations/README.md:
--------------------------------------------------------------------------------
1 | # Differential Equations
2 | 
3 | ...
4 | 


--------------------------------------------------------------------------------
/Functional_Programming/Other_Notes/sbt_and_eclipse.txt:
--------------------------------------------------------------------------------
1 | Wow, learning how to use this stuff is really annoying. =(
2 | 


--------------------------------------------------------------------------------
/Math_104_Berkeley/README.md:
--------------------------------------------------------------------------------
1 | This is a real analysis review.
2 | 
3 | Fortunately, the textbook is supposed to be easy to read. It is also freely
4 | available online.
5 | 


--------------------------------------------------------------------------------
/Deep_Learning/README.md:
--------------------------------------------------------------------------------
 1 | I'm reading the Deep Learning book by Goodfellow et al.
 2 | 
 3 | TODOs:
 4 | 
 5 | - Chapter 13
 6 | - Chapter 15
 7 | - Chapter 18
 8 | - Chapter 19
 9 | - Chapter 20 (all of it!)
10 | 
11 | 


--------------------------------------------------------------------------------
/How_People_Learn/README.md:
--------------------------------------------------------------------------------
1 | # How People Learn: Brain, Mind, Experience, and School: Expanded Edition
2 | 
3 | From National Academies Press. Looks like it was published in 2000, so I wonder
4 | how much of it is up to date ...
5 | 


--------------------------------------------------------------------------------
/Random/Ray_Notes.txt:
--------------------------------------------------------------------------------
1 | I'm trying to learn how to use Ray. See:
2 | 
3 | https://rise.cs.berkeley.edu/projects/ray/
4 | 
5 | for an overview of the project. (Unfortunately, it's hard to do a Google search
6 | on that, but I will manage.)
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Self_Study_Courses
 2 | 
 3 | These will be public notes for courses that I'm self-studying.
 4 | 
 5 | Current TODO list:
 6 | 
 7 | - Finish Goodfellow et al
 8 | - Finish CS 61C self-studying
 9 | - Study robotic manipulation
10 | 


--------------------------------------------------------------------------------
/How_People_Learn/Part_04_Future_Directions.txt:
--------------------------------------------------------------------------------
1 | Chapter 10: Conclusions
2 | Chapter 11: Next Research Steps
3 | 
4 | Mostly, these two chapters wrap up the book. I'm most interested in how
5 | humans/children learn, not so much about practical public policy or how to use
6 | technology.
7 | 
8 | The first parts of Chapter 10 would be good to review periodically.
9 | 


--------------------------------------------------------------------------------
/CS61C_Berkeley/README.md:
--------------------------------------------------------------------------------
 1 | Doing this because of (a) need to review computer architecture and (b) practice with C language.
 2 | 
 3 | Relevant links:
 4 | 
 5 | - https://github.com/61c-teach
 6 | - https://cs61c.org/
 7 | - https://cs61c.org/resources/exams
 8 | 
 9 | Looks like Berkeley changed to this format recently. Some of the courses have webcasts, though they might not all be public.
10 | 


--------------------------------------------------------------------------------
/Functional_Programming/README.md:
--------------------------------------------------------------------------------
 1 | This is the Coursera course on Functional Programming, taught by the person who
 2 | created the Scala Programming Language. =)
 3 | 
 4 | Link to course: [click here][1]
 5 | 
 6 | It says it's from January 30 to March 9; the year isn't stated but I assume it's
 7 | 2017, which means this could be the first Coursera course that I actually follow
 8 | from start to finish in time. I hope.
 9 | 
10 | [1]:https://www.coursera.org/learn/progfun1/
11 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/dVRK.text:
--------------------------------------------------------------------------------
 1 | How to use the dVRK in the context of ROS. Reading the ROS tutorials helped to
 2 | clarify why ROS could auto-complete and refer to files somewhere else in the
 3 | computer, because of our ROS path pointing to those directories. Also, the dVRK
 4 | launch files involves similar `.xml` files as shown in the tutorials. Use
 5 | `rosed` to edit without having to search for a path.
 6 | 
 7 | Focus on the basic skeleton. How do we start?
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter06notes.txt:
--------------------------------------------------------------------------------
 1 | *************************************************
 2 | * NOTES ON CHAPTER 6: Deep Feedforward Networks *
 3 | *************************************************
 4 | 
 5 | This chapter *should* be review for me. Read though but don't get bogged into
 6 | too much on backpropagation. By the way, these technically include convolutional
 7 | nets, but we don't cover that in detail until Chapter 9.
 8 | 
 9 | The first part (Section 6.1) starts off with the classic example of linear
10 | models failing to solve an XOR, but a simple ReLU two-layer network can do it.
11 | 
12 | Most neural networks are trained with maximum likelihood so the cost function is
13 | the negative log likelihood, cost is
14 | 
15 |     J(\theta) = - E_{x,y} [log p_\theta(y|x)]
16 | 
17 | This is **equivalently** described as the cross entropy between the model
18 | distribution and the data distribution. Interesting.
19 | 
20 | There's some stuff about the cross entropy and viewing the neural network as a
21 | functional. I should review these later if I have time. BTW, they say that cross
22 | entropy is preferable to MAE or MSE, due to getting better gradient signals
23 | (Section 6.2.1).
24 | 
25 | Section 6.3 is about the choice of hidden units. I'm skimming this.
26 | 
27 | Section 6.5 is about backpropagation. I'm skimming this. It's looong.
28 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Modern_Robotics_Mech_Plan_Control.txt:
--------------------------------------------------------------------------------
 1 | Notes on the textbook:
 2 | 
 3 |     Modern Robotics: Mechanics, Planning, and Control, 2017
 4 |     Kevin M. Lynch and Frank C. Park
 5 | 
 6 | Homepage: http://hades.mech.northwestern.edu/index.php/Modern_Robotics
 7 | 
 8 | It's looks very similar to Murray, Li, and Sastry's book.
 9 | 
10 | **********************
11 | * Chapter 1: Preview *
12 | **********************
13 | 
14 | One way of categorizing robots:
15 | 
16 | - Open chain: all joints are "actuated," i.e., that we can move them. Example:
17 |   most industrial robotic arm manipulators.
18 | - Closed chain: only some joints are "actuated." Example: Stewart-Gough
19 |   Platform (!!)
20 | 
21 | The following joints have one degree of freedom, for rotation and translation,
22 | respectively.
23 | 
24 | - Revolute joints: these allow for rotation about the joint axis.
25 | - Prismatic joints: these allow for linear translation along the joint axis.
26 | 
27 | Use "Degrees of Freedom" to specify the number of "actuated joints." However, a
28 | (potentially better) sense of DoF is the notion of **configuration spaces**:
29 | 
30 | > A more abstract but equivalent definition of the degrees of freedom of a robot
31 | > begins with the notion of its configuration space: a robot's configuration is
32 | > a complete specification of the positions and orientations of each link of a
33 | > robot, and its configuration space is the set of all possible configurations
34 | > of the robot.
35 | 
36 | 
37 | **********************************
38 | * Chapter 2: Configuration Space *
39 | **********************************
40 | 
41 | TODO
42 | 
43 | 
44 | *********************************
45 | * Chapter 3: Rigid Body Motions *
46 | *********************************
47 | 
48 | TODO
49 | 
50 | 
51 | *********************************
52 | * Chapter 4: Forward Kinematics *
53 | *********************************
54 | 
55 | Studies the problem of: given a set of input joint values, find the output
56 | position and orientation of the reference frame attached to the end-effector.
57 | This is easily done for an open-chain robot, and the default solution is the
58 | "Product of Exponentials" (PoE) formula.
59 | 


--------------------------------------------------------------------------------
/How_People_Learn/Part_01_Intro.txt:
--------------------------------------------------------------------------------
 1 | Part 1: Introduction
 2 | 
 3 | 
 4 | Chapter 1: Learning: From Speculation to Science
 5 | 
 6 | 
 7 | Very important:
 8 | 
 9 | - We need to stop teaching and testing based on factual knowledge, because the
10 |   amount of facts to know is beyond what any one person can handle. The focus of
11 |   teaching should be more on learning how to acquire and synthesize facts to
12 |   "pick things up" quickly, so to speak.  That's not to say facts are
13 |   unimportant. It's just that the bigger priority should be understanding the
14 |   connections among the facts so that it is easier to transfer and generalize to
15 |   novel scenarios.  Experts are very good at synthesizing, connecting, and
16 |   efficiently organizing their reservoirs of knowledge.
17 | 
18 | - Students start with lots of prior knowledge and are not simply "empty vessels"
19 |   of which teachers fill with knowledge. It's necessary to check with them if
20 |   their prior knowledge is inhibiting or misleading them when learning about
21 |   various concepts.  Classic scenario: fish is fish, where a fish asks an
22 |   amphibian what land-based animals are like, but simply imagines them as fish
23 |   with legs, fish with udders, etc. Another example: teaching students the Earth
24 |   is round when they think it's flat.
25 | 
26 | Also important:
27 | 
28 | - There should be a focus on improving students' understanding of their own
29 |   ability. They should be able to tell when they need help. The ability to
30 |   predict one's performance on a task is called "metacognition" (see Chapters 2
31 |   and 3).
32 | 
33 | - Don't do shallow coverage of every possible topic within reach, instead reduce
34 |   the number of topics but go through a few in depth to practice deeper
35 |   understanding.
36 | 
37 | - And a bunch of more mundane, practical stuff: need to change incentives of
38 |   teaching and standardized tests so that it's not fact-based yet is still fair,
39 |   need to do the same for adult teaching, etc.
40 | 
41 | - Don't just focus on the best talent, need to work for lots of students. Well,
42 |   it is important to develop top students more than we do in the US, but it's
43 |   also clear that we need to broaden the population who have access to quality
44 |   education.
45 | 
46 | Stuff I forgot to record after a first pass:
47 | 
48 | - Don't ask which teaching technique is best because that's like asking which
49 |   tool is best: it depends on the task and materials at hand.
50 | 
51 | - Don't forget all those hours students spend _outside_ of school. There are so
52 |   many overlooked opportunities there. I should know, from personal experience.
53 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter02notes.txt:
--------------------------------------------------------------------------------
 1 | **************************************
 2 | * NOTES ON CHAPTER 2: Linear Algebra *
 3 | **************************************
 4 | 
 5 | This chapter was pure review for me, but some highlights and insights:
 6 | 
 7 | - They talk about tensors but I'm kind of familiar with them already, mostly
 8 |   when I have to deal with numpy arrays that have at least three coordinate
 9 |   dimensions (or four, in some deep learning applications with images).
10 | 
11 | - Columns of A can be thought of as different directions we're spanning out of
12 |   the origin, and the components of x (as in the matrix-vector product Ax)
13 |   indicate how far we move in those directions.
14 | 
15 | - We say "orthogonal" matrices, but there's no terminology for matrices whose
16 |   columns and/or rows are mutually orthogonal, but *not* orthonormal.
17 | 
18 | - Don't forget **eigendecompositions**! They're very important. Interesting
19 |   intuition:
20 | 
21 |   > [...] we can also decompose matrices in ways that show us information about
22 |   > their functional properties that is not obvious from the representation of
23 |   > the matrix as an array of elements.
24 | 
25 |   Eigendecomposition of matrix: A = V * diag(eig-vals) * V^{-1}, where V
26 |   has columns which correspond to (right) eigenvectors of A.
27 | 
28 |   Not every matrix can be decomposed this way, but we're usually concerned with
29 |   real symmetric A. In fact, in that case we can say even more: we can construct
30 |   an *orthogonal* V so our V^{-1} turns into the easier-to-deal-with V^T matrix.
31 | 
32 | - An alternative, and more generally applicable decomposition, is the SVD. (Why
33 |   is it more general? Well, every real matrix has an SVD, including non-square
34 |   ones, but non-square matrices have undefined eigendecompositions.) In their
35 |   formulation, the inner matrix of singular values is rectangular in general
36 |   (other books/references have *square* matrices, but the definitions are
37 |   essentially equivalent).
38 | 
39 | - Moore-Penrose pseudoinverse helps us (sometimes) solve linear equations for
40 |   non-square matrices, in which case the "normal" inverse cannot be defined. Use
41 |   the formula A^+ = V * D^+ * U^T for the pseudoinverse. When A is a fat matrix,
42 |   the solution x = A^+ * y provides us with the minimum Euclidean norm solution
43 |   (I must have forgotten this fact).
44 | 
45 | - For the trace, don't forget about the **cyclic property**!!!
46 | 
47 | - The chapter concludes with an example of **Principal Components Analysis**,
48 |   i.e. how to apply lossy compression to a set of data points while losing as
49 |   little information as possible. By "compression" we refer to shrinking points
50 |   from R^m into R^n where n < m. This is necessarily lossy. To optimally encode
51 |   a vector, use f(x) = D^Tx, which we determined from L2 norm minimization. The
52 |   decoder is g(c) = Dc = DD^Tx which reconstructs an approximated version of the
53 |   input from the compression.  Then the next (and final) step is to find D. They
54 |   do this by also using an L2 minimization. They provide some nice tips on how
55 |   to write out optimization problems nicely and compactly. This is again review
56 |   for me.
57 | 
58 | Well, I'm pleased with this chapter. =) I should expand upon some of these
59 | concepts in personal blog posts, particularly that last part (the proof by
60 | induction).
61 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter20notes.txt:
--------------------------------------------------------------------------------
 1 | ***********************************************
 2 | * NOTES ON CHAPTER 20: Deep Generative Models *
 3 | ***********************************************
 4 | 
 5 | This is a **long** chapter, and likely contains most of the stuff at the
 6 | research frontiers, at least those that interest the authors (Generative
 7 | Adversarial Networks lol).
 8 | 
 9 | 
10 |     Section 20.10: Directed Generative Nets
11 | 
12 | Both VAEs and GANs are part of this section, which refers to using directed
13 | graphical models to "generate" something, or basically mirror a probability
14 | distribution. The first two sections, "Sigmoid Belief Nets" and "Differentiable
15 | Generator Nets" seem markedly less important, though the latter at least makes
16 | the point that a generator should be differentiable. It also makes the important
17 | distinction between a generator directly generating samples x, OR generating a
18 | DISTRIBUTION, which we then sample from for x. If we directly generate discrete
19 | values, the generator is not differentiable, FYI.
20 | 
21 | 
22 |     Section 20.10.3: Variational Autoencoders
23 | 
24 | - Trained purely with gradient methods.
25 | 
26 | - To *generate* a sample, need to first sample a code z which has relevant
27 |   latent factors, and then run through a generator ("decoder") network which
28 |   will give us a mean vector (or maybe a second output with the covariance). We
29 |   then sample from that Gaussian.  Yes, this makes sense.  Generating z may just
30 |   be done with our prior.
31 | 
32 | - Ah, but during training, we have to make use of our *encoder* network, since
33 |   otherwise the generator/decoder wouldn't work well.  The encoder network's job
34 |   is to produce a useful z.
35 |   
36 | - Training is done by maximizing that variational lower bound for each data x:
37 | 
38 |     L(q) <= log p_model(x)
39 | 
40 |   where q is the distribution of the encoder network. Essentially, the encoder
41 |   network approximates an intractable integral!
42 | 
43 | - Some downsides: VAEs output somewhat blurry images and do not fully utilize
44 |   the latent code z. However, GANs seem to share that second problem.
45 | 
46 | - VAEs have been extended in many ways, e.g. DRAW. I remember that paper when I
47 |   read it half a semester ago, but that was before I had RNN intuition.
48 | 
49 | - Advantage: the training process is basically training an autoencoder. Thus, it
50 |   can learn a manifold structure since that's what autoencoders can do!
51 | 
52 | 
53 |     Section 20.10.4: Generative Adversarial Networks
54 | 
55 | Use this loss function formulation for the Generator:
56 | 
57 | > In this best-performing formulation, the generator aims to increase the log
58 | > probability that the discriminator makes a mistake, rather than aiming to
59 | > decrease the log probability that the discriminator makes the correct
60 | > prediction.
61 | 
62 | Yes, I tried this for my own work and have had better results with this
63 | technique. It seems to be more important to do this than to do one-sided label
64 | smoothing, batch normalization, etc., which makes sense as this was the rare
65 | "trick" that made it in the original 2014 NIPS paper.
66 | 
67 | - Then Sections 20.10.5 through 20.10.10 go through more topics that I don't
68 |   have time to learn.
69 | 
70 | 
71 |     Section 20.14: Evaluating Generative Models
72 | 
73 | Yeah, I had a feeling this would be here, because some of this is quite
74 | subjective, and it seems like we have to resort to hiring human workers in
75 | person or via Amazon Mechanical Turk. The authors make a good point that in
76 | object recognition (for instance) we can alter the input. Some networks
77 | downscale to 256x256, others to 227x227, etc., but with generative models, if
78 | you change the input, the task fundamentally changes, and thus we can't compare
79 | the two procedures. Oh, and they also point out differences in log p(x) if x is
80 | discrete r.v. or continuous, in which case the former maximizes at log 1 = 0 and
81 | the latter can be arbitrarily high since p(x) could theoretically approach
82 | infinity.
83 | 


--------------------------------------------------------------------------------
/How_People_Learn/Part_03_Teachers_and_Teaching.txt:
--------------------------------------------------------------------------------
  1 | Part 3: Teachers and Teaching
  2 | 
  3 | 
  4 | Chapter 6: Design of Learning Environments
  5 | 
  6 | Very important:
  7 | 
  8 | - Use learning-centered (actually, "learner centered") environments, a bit
  9 |   unclear to define but I think mostly about better understanding of students'
 10 |   prior knowledge. Again, see previous chapters about this.
 11 | 
 12 | - Need some form of knowledge learning, so students need to learn something
 13 |   beyond just "learning how to learn". (Edit: not really the right way to define
 14 |   this but again not a clear definition, but mostly about how to make students
 15 |   knowledgeable, so that they can do effective transfer --- again, see previous
 16 |   chapters.)
 17 | 
 18 | - Students need feedback (see "deliberate practice"), but not just the kind that
 19 |   come with grades and tests. Also, feedback is most effective when students can
 20 |   revise their thinking on the _current_ subject matter, not when they get a
 21 |   test but by the time they have it, they've moved on to newer concepts.
 22 | 
 23 | - Must consider the community/culture aspect, which obviously affects learning.
 24 |   For instance, Anglo culture emphasizes talking and asking questions, but
 25 |   others might not (and this affects how teachers evaluate students). Also,
 26 |   seriously, when are we going to talk about multi-racials? Gaaaah, so
 27 |   disappointing.
 28 | 
 29 | Also important:
 30 | 
 31 | - A bunch of stuff on the merits of television (remember, this was 2000) but not
 32 |   really relevant for what I hope to get out of this book. Also a bunch of stuff
 33 |   on how to evaluate teachers for practical purposes.
 34 | 
 35 | Stuff I didn't remember:
 36 | 
 37 | - While some may say schools aren't working, the reality is that we're asking
 38 |   for way more out of students than in past eras. In the past, being literate
 39 |   could have simply meant being able to sign your name. Now we're getting to the
 40 |   point where we need students to interpret and compose potentially complicated
 41 |   written stuff.
 42 | 
 43 | - Eh, a relevant quote: "Learning theory does not provide a simple recipe for
 44 |   designing effective learning environments; similarly, physics constrains but
 45 |   does not dictate how to build a bridge."
 46 | 
 47 | 
 48 | Chapter 7: Effective Teaching Examples
 49 | 
 50 | Very important:
 51 | 
 52 | - History: focus not on facts but on analysis and understanding how to debate
 53 |   concepts. If you take students who know facts and historians who don't
 54 |   specialize in the same area, the students might actually do better on tests of
 55 |   factual knowledge, but won't be able to do any analysis. Effective teachers
 56 |   can promote debate, with careful monitoring of course. Interesting example:
 57 |   teacher asking students to put stuff in a time capsule, so they need to reason
 58 |   about important stuff.
 59 | 
 60 | - Math: less focus on computation, more focus on problem solving skills.
 61 |   Analogies can help, e.g., modeling floors of a building to learn about
 62 |   negative numbers (negative floors = below ground level). Oh, also model-based
 63 |   stuff, where we apply math to building models of stuff (e.g., buildings).
 64 |   Could also clearly apply to physics.
 65 | 
 66 | - Science: again, less on facts and more on analysis. Many students have
 67 |   intuition on stuff that's not correct in physics (e.g., forces and Newton's
 68 |   third law) so use live demos. Also recall earlier discussion about students
 69 |   not classifying problems correctly based on solution, but based on how they
 70 |   look (surface features). Students who are able to describe a problem
 71 |   "hierarchically" tend to do better --- though this is obviously vague.
 72 | 
 73 | Also important:
 74 | 
 75 | - Deliberate practice. Don't forget.
 76 | 
 77 | - Effective teachers must know the subject matter AND be able to tell where
 78 |   students are likely to run into roadblocks.
 79 | 
 80 | Stuff I didn't remember:
 81 | 
 82 | - Practical stuff about instruction in large classes.
 83 | 
 84 | 
 85 | Chapter 8: Teacher Learning
 86 | (Not the most relevant chapter for me)
 87 | 
 88 | There's a huge difference between education theory and practice, leads to
 89 | teachers rejecting (or not really diving into) research, lots of turnover,
 90 | susceptible to local politics, etc. It's best to have workshops and other
 91 | meet-ups where teachers can practice and discuss teaching techniques, etc.
 92 | 
 93 | 
 94 | Chapter 9: Technology to Support Learning
 95 | (Not the most relevant chapter for me)
 96 | 
 97 | Well this is kind of out of date, I suppose. Mostly, technology has tradeoffs
 98 | but can be used to bring in new contexts/demos to the class, etc. Particularly
 99 | useful if it can help provide repeated feedback (remember deliberate practice).
100 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter17notes.txt:
--------------------------------------------------------------------------------
 1 | ********************************************
 2 | * NOTES ON CHAPTER 17: Monte Carlo Methods *
 3 | ********************************************
 4 | 
 5 | I think this chapter will also be review, but I have forgotten a lot of this
 6 | material. It might also help me for my other projects with BIDMach.
 7 | 
 8 | Heh, Las Vegas algorithms ... we never talk about those in Deep Learning. I
 9 | agree, we should stick with deterministic approximation algorithms or Monte
10 | Carlo methods. Right, the point here is we have something we want to know, such
11 | as the expected value of a function (which depends on the data). Use sampling to
12 | take the average of f(x_1), ..., f(x_n) to form our estimate of E_p[f(x)] for
13 | some base distribution p. We can compute our expected error via the Central
14 | Limit Theorem. (Which John Canny said is "the most abused theorem in all of
15 | statistics" but never mind ...)
16 | 
17 | But what if we cannot even sample from our base distribution p in the first
18 | place. For the above, we needed to draw x_1, ..., x_n somehow! We now turn to
19 | our possible solutions: importance sampling and MCMC. (The latter includes Gibbs
20 | sampling, and maybe even contains some variants of importance sampling? Not
21 | totally sure.)
22 | 
23 | Section 17.2, Importance Sampling.
24 | 
25 | I see, we can turn Equation 17.9 into Equation 17.10 just by switching the
26 | distribution the x_i's are drawn from, and adding in the necessary functions.
27 | Yes, they have the same expected value ... and I can see why the variance would
28 | be different. They argue that the minimum variance is the q^* in Equation 17.13.
29 | Yeah ... that seems familiar. How do they derive that? If indeed f did not
30 | change signs, then p and f cancel and the variance turns into a constant. Yay!
31 | 
32 | I'm not really getting much out of this section other than definitions.  I'll
33 | mark a TODO for myself to look at the examples they give in other parts of the
34 | book; this chapter is not as self-contained as Chapter 16.
35 | 
36 | Section 17.3, Markov Chain Monte Carlo (my favorite!). They refer the reader to
37 | Daphne's book for more details (which I've read before!).
38 | 
39 | MCMC methods use *Markov chains* to approximate the desired sample distribution
40 | (call it p_model). These are most convenient for energy based models, p \propto
41 | exp(-E(x)), because they require non-zero probabilities everywhere. They also
42 | assume that the energy-based models are for _undirected_ graphical models, so
43 | that it's difficult to compute conditional probabilities.
44 | 
45 | Procedure: start with random x, keep sampling, after a suitable burn-in period,
46 | the samples will start to come from p_model. Use a transition distribution
47 | T(x'|x), or a "kernel" in some of the literature.
48 | 
49 | They show the usual matrix update in Equation 17.20, only for discrete random
50 | variables. Here, v should be in the probability simplex of dimension d where d
51 | is the amount of values that x can take on. Remember, we're in discrete land
52 | here.
53 | 
54 | Something new to me: the matrix "A" here is a "stochastic matrix" and over time,
55 | its eigenvalues will converge to one as the exponent increases, or they'll decay
56 | to zero. Interesting ...  the Perron-Frobenius Theorem they refer to is from a
57 | 1907 paper (!!!).
58 | 
59 | They say "DL practitioners typically use 100 parallel Markov chains." Having
60 | independent chains gives us more independence. Why haven't I been doing this ...
61 | 
62 | Section 17.4, Gibbs Sampling (yay ...).
63 | 
64 | Not much in this section, they just say that for Deep Learning, it's common to
65 | use these for energy-based models, such as RBMs, though we better do block Gibbs
66 | sampling.
67 | 
68 | Other stuff: 
69 | 
70 | They point out that the main problem with MCMC methods in high dimensions is
71 | that they mix poorly; the samples are too correlated. It might get trapped in a
72 | posterior mode, but I'm curious: how much of a problem is that? For deep neural
73 | networks, the biggest problem is with saddle points. They argue that the MCMC
74 | methods will not be able to "traverse" regions in manifold space with high
75 | energy. Those result in essentially zero p(x) due to e^{-H(x)}.
76 | 
77 | Oh, I see, now they talk about temperature to aid exploration. Yeah, I know
78 | about that! =) Finally, I can see a reference about temperature. Think of
79 | temperature as:
80 | 
81 | p(x) \propto exp(-H(x)/T)
82 | 
83 | Thus, when temperature is high, the value in the exponent increases to zero, so
84 | the distribution becomes more uniform.
85 | 
86 | You know, if there was more research done with MCMC methods and Deep Learning,
87 | wouldn't this chapter have discussed them? There isn't much here, to be honest,
88 | and lots of the references are pre-2012. And also, for tempering, why not cite
89 | some of the references they have in my own work?
90 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter03notes.txt:
--------------------------------------------------------------------------------
 1 | **********************************************************
 2 | * NOTES ON CHAPTER 3: Probability and Information Theory *
 3 | **********************************************************
 4 | 
 5 | This chapter was almost pure review for me, but some highlights and insights:
 6 | 
 7 | - The chapter starts with some philosophy and some notation. Nothing new, though
 8 |   their notation is at least better than those from other textbooks I've read.
 9 |   Then they talk about definitions, marginals, conditionals, etc. It might be
10 |   worth using their definition of covariance rather than the one I intuitively
11 |   think of. High covariances (absolute values) mean values change a lot and are
12 |   also far from their respective means often. Another concept to review:
13 |   independence is a stronger requirement than zero covariance. Know the
14 |   definition of a covariance matrix w.r.t. a random vector x.
15 | 
16 | - Section 3.9: Common Probability Distributions, is pure review with the
17 |   exception of the Dirac Distribution (to some extent), though they mention
18 |   sometimes the need to use the inverse variance to increase efficiency, though
19 |   I doubt this is used often. Do remember why we like Gaussians: (1) the CLT,
20 |   and (2) out of all distributions with the same variance and which cover the
21 |   real line, it has the highest entropy, which can be thought of as imposing the
22 |   fewest prior assumptions possible. (If we didn't have these restrictions, we
23 |   could pick the *uniform* distribution, so be careful about the assumptions.)
24 |   Finally, for mixture distributions, don't forget that the canonical way is to
25 |   first choose a distribution, and then generate a sample from that. It is NOT,
26 |   first generate k samples from all k distributions in the mixture, and then
27 |   take a linear combination of those proportional to the probability weight. I
28 |   was confused by that a few years ago. The component identity of a mixture
29 |   model is often viewed as a **latent variable**.
30 | 
31 | - Know the **logistic** function (yes) and the **softplus** function (huh, a
32 |   smoothed ReLU).
33 | 
34 | - There is some brief **measure theory** here:
35 | 
36 |   > One of the key contributions of measure theory is to provide a
37 |   > characterization of the set of sets that we can compute the probability of
38 |   > without encountering paradoxes. In this book, we only integrate over sets
39 |   > with relatively simple descriptions, so this aspect of measure theory never
40 |   > becomes a relevant concern. For our purposes, measure theory is more useful
41 |   > for describing theorems that apply to most points in R^n but do not apply to
42 |   > some corner cases.
43 | 
44 | - Oh, I like their example with deterministic functions of random variables.
45 |   I've seen this a few time in statistics, and the key with variable
46 |   transformations like those is that we have to take into account different
47 |   scales of functions, which is where the derivative term and Jacobians appear.
48 | 
49 | - Section 3.13: Information Theory. My favorite part is Figure 3.6. I should
50 |   spend more time thinking about it. Also, good intuition: 
51 | 
52 |   > A message saying "the sun rose this morning" is so uninformative as to be
53 |   > unnecessary to send, but a message saying "there was a solar eclipse this
54 |   > morning" is very informative.
55 | 
56 |   Information theory is about quantifying the "information" present in some
57 |   signal. Use the **Shannon entropy** to quantify the uncertainty in a
58 |   probability **distribution**: - E_x[log p(x)]. This is "differential entropy"
59 |   if x is continuous. Low entropy means the random variable is closer to
60 |   deterministic, high entropy means it's very random and uncertain.
61 | 
62 |   Note: in most information theory contexts, the log is base 2, so we refer to
63 |   this as "bits." In machine learning, we use the natural logarithm, so we call
64 |   them "nats."
65 | 
66 |   As usual, define the KL divergence. KL(P||Q) = E_P[log(P(x)/Q(x))]. For now,
67 |   assume the first distribution, P, is what we're drawing expectations w.r.t.
68 |   For discrete r.v.s:
69 | 
70 |   > [KL Divergence is] the extra amount of information [...]  needed to send a
71 |   > message containing symbols drawn from probability distribution P, when we
72 |   > use a code that was designed to minimize the length of messages drawn from
73 |   > probability distribution Q.
74 |   
75 |   - Note also the **cross entropy** quantity: - E_P[log Q(x)].
76 | 
77 |   > Minimizing the cross-entropy with respect to Q is equivalent to minimizing
78 |   > the KL divergence, because Q does not participate in the omitted term.
79 | 
80 |   This is why if Q is our model, we can minimize the cross entropy and make our
81 |   Q close to P, which is the ground truth data distribution.
82 | 
83 | - The chapter concludes some basic graphical models stuff.
84 | 
85 | I like this chapter.
86 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter09notes.txt:
--------------------------------------------------------------------------------
 1 | **********************************************
 2 | * NOTES ON CHAPTER 9: Convolutional Networks *
 3 | **********************************************
 4 | 
 5 | This chapter should be review for me, but I do want to get clarification about
 6 | (a) visualizing gradients/filters and (b) the "deconvolution" or "transpose
 7 | convolution" operator. To a lesser extent, I'm interested in (c) how to
 8 | implement efficient convolutions.
 9 | 
10 | - There is some stuff about whether we care about kernel flipping or not.
11 |   However, this seems to be very specific about the convolution formula, and I
12 |   doubt I'm going to go in detail on that since I'm not implementing them.
13 | 
14 | - Understand why convolutions are so important: (1) **sparse interactions**, (2)
15 |   **parameter sharing** and (3) **equivariant representations**. I know all of
16 |   these, and to be clear on the last one, it's because we often want to
17 |   represent the same shapes but in different locations in a grid. The book says
18 |   "To say a function is equivariant means that if the input changes, the output
19 |   changes in the same way" so maybe they're using a slightly different
20 |   perspective. The first two together are mainly about the storage and
21 |   efficiency improvements.  The third doesn't apply to all transformations (for
22 |   CNNs at least), but it definitely applies for translation.
23 | 
24 | - In the pooling description (Section 9.3) the authors say non-linearities come
25 |   **before** pooling and **after** convolutions. Indeed, this matches the
26 |   ordering of the CNNs we wrote in CS 294-129. Intuitively, we already do a
27 |   maximum operator in the standard 2x2 max pool, so why apply a ReLU **after**
28 |   that? The major advantage of pooling is to make the network **invariant to
29 |   slight transformations**. It also helps to reduce data dimensionality,
30 |   particularly if we also padded the convolutions (and so the convolution layers
31 |   do *not* reduce data dimensionality, but can leave that job for the pooling).
32 | 
33 | - Interesting perspective: Section 9.4 explains why convolutions and pooling can
34 |   be viewed as an infinitely strong prior. I can see why (beforehand) since
35 |   these strongly assume the input is some grid-like thing, as an image. (A weak
36 |   prior has high entropy, like a uniform distribution or a Gaussian) Be careful:
37 | 
38 |   > If a task relies on preserving precise spatial information, then using
39 |   > pooling on all features can increase the training error.
40 | 
41 |   (This is an example of how architectures need to be tweaked for the task.)
42 | 
43 | - Huh, I've never heard of **unshared convolution** nor **tiled convolution**.
44 |   Eh, I can look them up later, they're alternatives to convolution but
45 |   certainly less important to know.
46 | 
47 | - Ah ... how to compute the **nightmarish** gradient of a convolution operator?
48 |   The gradient is actually another convolution, but it's hard to derive
49 |   algebraically. Convolutions are just (sparse) matrix multiplication assuming
50 |   we've flattened the input tensor. We did that for CS 231n to flatten the input
51 |   to shape (N, d1*d2*...*dn). Given that matrix, we take its transpose and that
52 |   gives us the gradient for the backpropagation step, at least in theory. Wait,
53 |   Goodfellow has a report from 2010 which explains how to compute these
54 |   gradients. Interesting, how did I not know about this?
55 | 
56 | - Something I didn't quite think of before, but it seems obvious: we can instead
57 |   use **structured output** from a CNN that isn't a probability vector or
58 |   distribution but some tensor that comes "earlier" in the net. This can give
59 |   probabilities for each precise pixel in an image, for instance, if the tensor
60 |   output is 3D and (i,j,k) means class i probability in coordinate (j,k). Yeah,
61 |   overall there are quite a lot of options the user has in designing a CNN. This
62 |   also enables the possibility of using recurrent CNNs, see Figure 9.17.
63 | 
64 | - Section 9.8: **Efficient convolutions**. Unfortunately, there is only
65 |   high-level discussion here, but I'm not sure I'd be able to understand the
66 |   details anyway. They say:
67 | 
68 |   > Convolution is equivalent to converting both the input and the kernel to the
69 |   > frequency domain using a Fourier transform, performing point-wise
70 |   > multiplication of the two signals, and converting back to the time domain
71 |   > using an inverse Fourier transform. For some problem sizes, this can be
72 |   > faster than the naive implementation of discrete convolution.
73 | 
74 | The last part of the chapter is about the neuro-scientific basis of CNNs. It's
75 | an easier read.
76 | 
77 | Overall, I think this is a good chapter. Unfortunately, it didn't cover (a) or
78 | (b), the stuff I was wondering about earlier. =( OK, I think I understand how to
79 | visualize a weight filter, but maybe I should look back at that relevant CS 231n
80 | lecture.
81 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter04notes.txt:
--------------------------------------------------------------------------------
  1 | *****************************************
  2 | * NOTES ON CHAPTER 4: Numerical Methods *
  3 | *****************************************
  4 | 
  5 | This brief chapter will probably contain more new material for me compared to
  6 | chapters 2 and 3, but still be mostly review. Here are the highlights:
  7 | 
  8 | - We must delicately handle implementations of the **softmax function** to
  9 |   be robust to numerical underflow and overflow. The book amusingly just tells
 10 |   us to rely on Deep Learning libraries, which have presumably handled all these
 11 |   details for us.
 12 | 
 13 | - Don't forget about a matrix's **condition number**, which when we're dealing
 14 |   with a function f(x) = A^{-1}x, roughly tells us how "quickly" it perturbs,
 15 |   i.e. its sensitivity. Later, they point out:
 16 | 
 17 |   > The condition number of the Hessian at this point measures how much the
 18 |   > second derivatives differ from each other. When the Hessian has a poor
 19 |   > condition number, gradient descent performs poorly. This is because in one
 20 |   > direction, the derivative increases rapidly, while in another direction, it
 21 |   > increases slowly.
 22 | 
 23 | - Review: the **directional derivative** of function f in direction u is the
 24 |   derivative of the function f(x + alpha*u) evaluated at alpha=0, i.e. the slope
 25 |   of f in direction u.
 26 | 
 27 | - Review of Hessians, Jacobians, gradient descent, etc. The Hessian can be
 28 |   thought of as the Jacobian of the gradient (of a function from R^n to R).
 29 |   Also, regarding rows/columns of the Jacobians, if the function f is from R^m
 30 |   to R^n, the Jacobian is n x m, so just remember the ordering (I doubt it is
 31 |   strict since this is just a representation that's convenient for us, and we
 32 |   could also take transposes if we wanted). In Deep Learning, the functions we
 33 |   encounter almost always have symmetric Hessians. I like Equation 4.9 as it
 34 |   emphasizes how gradient descent can sometimes overshoot the target and result
 35 |   in a *worse* value, if the second-order term dominates.
 36 | 
 37 | - To generalize the second derivative test (tells us a maximum, minimum, or
 38 |   saddle point) in high dimensions, we need to analyze the eigenvalues of the
 39 |   Hessian, e.g.:
 40 | 
 41 |   > When the Hessian is positive definite (all its eigenvalues are positive),
 42 |   > the point is a local minimum. This can be seen by observing that the
 43 |   > directional second derivative in any direction must be positive, and making
 44 |   > reference to the univariate second derivative test.
 45 | 
 46 |   Likewise, the reverse is true when the Hessian is negative definite. Note that
 47 |   the Hessian is a function of x (vector in R^n), so different x will result in
 48 |   different Hessians. See Figure 4.5 for the quintessential example of a saddle
 49 |   point.
 50 | 
 51 |   BTW, why do the eigenvalues help us **at all**? How are they related to the
 52 |   second derivative test in one dimension? I think it's because the second-order
 53 |   Taylor series expansion involves a term d^THd, where d is some unit vector.
 54 |   This is the second term that's added into the Taylor series, so its values
 55 |   among different directions tells us the curvature. We also have an
 56 |   eigendecomposition of H, and that provides us the eigenvalues.
 57 | 
 58 | - We have simple gradient descent, and then the second-order (i.e. expensive!)
 59 |   Newton's method. How do we **derive** the step size, e.g. if you're asked to
 60 |   do so in an interview?
 61 |   
 62 |   - Write out f(x) using a second-order Taylor series expansion at x(0).
 63 |   
 64 |   - Then look at the second-order Taylor series and take the gradient w.r.t x
 65 |     (not x(0)).
 66 | 
 67 |   - Solve for the best x, the critical point, and plug-n-chug.
 68 | 
 69 |   - At least, that seemed to work for me and I verified Newton's method.
 70 | 
 71 | - In the context of Deep Learning, our functions are so complicated that we can
 72 |   rarely provide any theoretical guarantees. We can sometimes get headway by
 73 |   assuming Lipschitz functions, which tell us that small changes in the input
 74 |   have quantified small changes in the function output.
 75 | 
 76 | - Convex optimization is a very successful research field, but we can only take
 77 |   lessons from it, we can't really use their algorithms and the importance of
 78 |   convexity is diminished in deep learning. Constrained optimization may be
 79 |   slightly more important. These involve the KKT conditions and Lagrange
 80 |   multipliers, which at a high level try to design an unconstrained problem so
 81 |   that the solution can be transformed into one for the **constrained** problem.
 82 |   Brief comments on those:
 83 | 
 84 |   - We rewrite the loss function by adding terms corresponding to constraints
 85 |     h(x) = 0 and/or g(x) <= 0.
 86 | 
 87 |   - We have min_{x in S} f(x) as our original **constrained** minimization
 88 |     problem. However ...
 89 |  
 90 |   - min_x max_{lambda} max_{alpha >= 0} L(x, lambda, alpha) has the same set of
 91 |     solutions and optimal points!
 92 | 
 93 |   - (Some caveats here, have to consider infinity cases, etc., but this is the
 94 |     general idea. Any time a constraint is violated, the minimum value of the
 95 |     Lagrangian w.r.t. x is ... infinity!)
 96 | 
 97 | For some reason, I never feel comfortable with Lagrangians. It might be worth
 98 | going back and reviewing Stephen Boyd's book, but I think the books' component
 99 | was pretty clear.
100 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter08notes.txt:
--------------------------------------------------------------------------------
  1 | ****************************************************
  2 | * NOTES ON CHAPTER 8: Optimization for Deep Models *
  3 | ****************************************************
  4 | 
  5 | This chapter should be review for me.
  6 | 
  7 |     Section 8.1: Learning vs. Pure Optimization
  8 | 
  9 | The authors make a good point in that we really care about minimizing the cost
 10 | function w.r.t. the **data generating distribution**, NOT the actual training
 11 | data (i.e. generalization). The difference with optimization is that we know the
 12 | underlying data generating distribution, but in machine learning we only have
 13 | the fixed training data, i.e. minimizing the **empirical risk**. However, this
 14 | isn't used in its raw form:
 15 | 
 16 | > These two problems mean that, in the context of deep learning, we rarely use
 17 | > empirical risk minimization. Instead, we must use a slightly different
 18 | > approach, in which the quantity that we actually optimize is even more
 19 | > different from the quantity that we truly want to optimize.
 20 | 
 21 | Also, as I know, ML algorithms typically stop not when they're at a true minimum
 22 | but when we define them to stop, early stopping. =)
 23 | 
 24 | Oh, note that second-order methods require larger batch sizes. In fact, Andrej
 25 | Karpathy covered that briefly in Lecture 7 of Cs 231n. This is because
 26 | matrix-vector multiplication and taking inverses amplify errors in the original
 27 | Hessian/gradient.
 28 | 
 29 | I do this:
 30 | 
 31 | > Fortunately, in practice it is usually sufficient to shuffle the order of the
 32 | > dataset once and then store it in shuffled fashion. This will impose a fixed
 33 | > set of possible minibatches of consecutive examples that all models trained
 34 | > thereafter will use, and each individual model will be forced to reuse this
 35 | > ordering every time it passes through the training data.
 36 | 
 37 |     Section 8.2: Challenges in Neural Net Optimization
 38 | 
 39 | > For many years, most practitioners believed that local minima were a common
 40 | > problem plaguing neural network optimization. Today, that does not appear to
 41 | > be the case. The problem remains an active area of research, but experts now
 42 | > suspect that, for sufficiently large neural networks, most local minima have a
 43 | > low cost function value, and that it is not important to find a true global
 44 | > minimum rather than to find a point in parameter space that has low but not
 45 | > minimal cost.
 46 | 
 47 | To test whether we at a local minima, we can test the norm of the gradient.
 48 | 
 49 |     Section 8.3: Basic Algorithms
 50 | 
 51 | These include SGD and its variants, the core of the chapter. I better know
 52 | these. I know SGD and for momentum, they say:
 53 | 
 54 | > Momentum aims primarily to solve two problems: poor conditioning of the
 55 | > Hessian matrix and variance in the stochastic gradient.
 56 | 
 57 | and 
 58 | 
 59 | > We can think of the particle as being like a hockey puck sliding down an icy
 60 | > surface. Whenever it descends a steep part of the surface, it gathers speed
 61 | > and continues sliding in that direction until it begins to go uphill again.
 62 | 
 63 | There's some math there that I probably don't need to memorize, but I should
 64 | blog about it soon. They write it as a first-order differential equation since
 65 | we have a separate velocity term. If we didn't have that, we need a *second*
 66 | order diff-eq. Also, I really have to review differential equations someday.
 67 | 
 68 |     Section 8.4: Parameter Initialization
 69 | 
 70 | AKA break symmetry!
 71 | 
 72 | Surprisingly, they don't see to mention Kaiming He's paper on weight
 73 | initialization. I don't even see any discussion of fan-in and fan-out.
 74 | 
 75 |     Section 8.5: Algorithms with Adaptive Learning Rates
 76 | 
 77 | Yes, the key is **adaptive** learning rates. AdaGrad, then RMSProp, then Adam:
 78 | 
 79 | > The name "Adam" derives from the phrase "adaptive moments." In the context of
 80 | > the earlier algorithms, it is perhaps best seen as a variant on the
 81 | > combination of RMSProp and momentum with a few important distinctions.
 82 | 
 83 | The distinctions have to do with estimates of moments and their biases. I'm
 84 | quite confused on this, unfortunately.
 85 | 
 86 | (Note: unlike what's suggested in CS 231n Lecture 7, in fact the textbook
 87 | actually has RMSProp with Nesterov's in one of their algorithms.)
 88 | 
 89 |     Section 8.6: Approximate Second-Order Methods
 90 | 
 91 | Newton's method is intractable, etc. etc. etc. Well, these can help:
 92 | 
 93 | > Conjugate gradients is a method to efficiently avoid the calculation of the
 94 | > inverse Hessian by iteratively descending conjugate directions.
 95 | 
 96 | Also, know BFGS and L-BFGS.
 97 | 
 98 |     Section 8.7: Other Strategies
 99 | 
100 | Ah, **batch normalization**.
101 | 
102 | > This means that the gradient will never propose an operation that acts simply
103 | > to increase the standard deviation or mean of $h_i$; the normalization
104 | > operations remove the effect of such an action and zero out its component in
105 | > the gradient. This was a major innovation of the batch normalization approach.
106 | 
107 | and
108 | 
109 | > Batch normalization reparametrizes the model to make some units always be
110 | > standardized by definition, deftly sidestepping both problems.
111 | 
112 | Yeah, this idea of normalizing inputs is obvious, so we have to be clear on the
113 | actual contribution of batch normalization.
114 | 
115 | There's some other stuff here about pre-training (yes that's important!) but
116 | also check Chapter 15. Oh, and don't forget, we normally don't want to design
117 | new optimization algorithms, but instead to make the networks **easier to
118 | optimize**.
119 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter16notes.txt:
--------------------------------------------------------------------------------
  1 | **************************************************************************
  2 | * NOTES ON CHAPTER 16: Structured Probabilistic Models for Deep Learning *
  3 | **************************************************************************
  4 | 
  5 | I expect to know the majority of this chapter, because it's probably going to be
  6 | like Michael I. Jordan's notes. "Structured Probabilistic Models" are graphical
  7 | models! But the key is that this should help me better understand the current
  8 | research frontiers of Deep Learning, and it's self-contained. Let's see what it
  9 | has to offer ...
 10 | 
 11 | Their "Alice and Bob" (and "Carol" ...) example has to do with running a relay,
 12 | which is better than Michael I. Jordan's example of being abducted by aliens.
 13 | 
 14 | I remember Markov Random Fields, yes, we need to define a normalizing constant
 15 | Z, but (a) if we define our clique potentials awfully, Z won't exist, and (b) in
 16 | deep learning, Z is usually intractable.
 17 | 
 18 | I agree with their quote:
 19 | 
 20 | > One key difference between directed modeling and undirected modeling is that
 21 | > directed models are defined directly in terms of probability distributions
 22 | > from the start, while undirected models are defined more loosely by \phi
 23 | > functions that are then converted into probability distributions. This changes
 24 | > the intuitions one must develop in order to work with these models.
 25 | 
 26 | When they go and talk about their example with x being binary and getting
 27 | Pr(X_i = 1) being a sigmoid(b_i), you can get that by explicitly writing out the
 28 | formula, then "rearranging" the sum so that terms independent of the current,
 29 | rightmost sum get pushed left. Then you see that the numbers mean we get
 30 | independence, and can split the fractions, etc. It brings back good memories of
 31 | studying CS 188.
 32 | 
 33 | Section 16.2.4 is on Energy-Based functions. John Canny would really like those!
 34 | I think the easiest way for me to think of these is taking potentials of
 35 | arbitrary functions and then using e^{-function}. AKA Boltzmann Machines. I like
 36 | their discussion here; it is relatively elucidating.
 37 | 
 38 | There is also review on what edges mean when describing graphical models. Again,
 39 | this is all CS 188 stuff. For instance, remember that we can add more edges to a
 40 | graphical model and still represent the same class of distributions (the edges
 41 | can be unnecessary).
 42 | 
 43 | One advantage for each type:
 44 | 
 45 | - It is easier to sample from directed models (I agree).
 46 | - It is easier to perform approximate inference on undirected models (I think I
 47 |   agree).
 48 | 
 49 | Key fact: 
 50 | 
 51 | > Every probability distribution can be represented by either a directed model
 52 | > or by an undirected model.
 53 | 
 54 | Though there are some directed models for which no undirected model is
 55 | equivalent to it. By "equivalent" here we mean in the precise set of
 56 | independence assumptions it implies.
 57 | 
 58 | And another key idea:
 59 | 
 60 | > When we represent a probability distribution with a graph, we want to choose a
 61 | > graph that implies as many independences as possible, without implying any
 62 | > independences that do not actually exist.
 63 | 
 64 | E.g. a loop of length 4 (with no chords inside) is an undirected graphical
 65 | model, but we have to add an edge before adding orientations to the edges to
 66 | "convert" it to as simple a directed graphical model as possible (that still
 67 | implies as many (or as few?) assumptions).
 68 | 
 69 | Section 16.3: sampling from graphical models. I agree, it's easy for directed
 70 | models. They call it "ancestral sampling" whereas I've called it "forward
 71 | sampling," I think from Daphne Koller. We have to modify it if we want to do
 72 | more general sampling with conditioning, i.e. fixed variables. It's toughest if
 73 | the variables are *descendants*. Ancestors are easier because we can fix them
 74 | and just do P(x|parents(x)) as usual. For *undirected* models ... they mention
 75 | Gibbs sampling. =)
 76 | 
 77 | The next few sections are pretty short. They mention *structure learning*, i.e.
 78 | learning the graphical model structure. That's a hard problem due to the
 79 | super-exponential number of possibilities. However, it seems like structure
 80 | learning --- as far as I can tell --- is no longer active? They also mention the
 81 | importance of latent variables. Yes, that's a bit broad, but I agree. Just
 82 | before the "real" Deep Learning part they talk about inference and approximate
 83 | inference, which is something that I should know about well (but they just give
 84 | a broad treatment, a bit unclear).
 85 | 
 86 | Finally, the Deep Learning part that I wanted to read.
 87 | 
 88 | After reading it, I just want to clarify: when people draw out a fully connected
 89 | net, they usually write out nodes, edges, in layer format, etc. Is that
 90 | correctly viewed as a *graphical model*? Or are those different design criteria?
 91 | Also, I'm assuming that all the "latent variable" discussion is simply referring
 92 | to the hidden layers (and their units)? I think that's the case after reading
 93 | about why loopy belief propagation is "almost never" used in deep learning. (Oh,
 94 | and by the way, I don't actually know loopy belief propagation ... and I just
 95 | barely remember belief propagation.) I think it makes sense, in normal graphical
 96 | models, we want the computational graph to be sparse to prevent high treewidth,
 97 | but in deep learning, we do matrix multiplication which creates a lot of
 98 | connectivity. So, matrix multiplication, not loopy belief propagation.
 99 | 
100 | They discuss *Restricted Boltzmann Machines* at the end. They say it is the
101 | "quintessential example" of using graphical models for deep learning. With only
102 | one hidden layer, it is not too deep (a.k.a. it looks like a normal graphical
103 | model) but it groups variables into layers, like deep learning. For now, let's
104 | only worry about the "canonical form" which is an energy-based model with a
105 | particular (negative) quadratic form plus linear terms. The inputs are (v,h).
106 | The names should be familiar: v=visible and h=hidden. Then it's like a complete
107 | bipartite graph with v on one side and h on the other. We can do Gibbs sampling
108 | on this (in fact, _block_ Gibbs sampling).
109 | 
110 | Concluding point:
111 | 
112 | > Overall, the RBM demonstrates the typical deep learning approach to graphical
113 | > models: representation learning accomplished via layers of latent variables,
114 | > combined with efficient interactions between layers parametrized by matrices.
115 | 
116 | I've now read the chapter and feel pleased. Great job, authors!
117 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter14notes.txt:
--------------------------------------------------------------------------------
  1 | *************************************
  2 | * NOTES ON CHAPTER 14: Autoencoders *
  3 | *************************************
  4 | 
  5 | Let's review this and discuss with John Canny.
  6 | 
  7 | The introduction is excellent, and matches with my intuition. I agree that an
  8 | encoder is like doing dimension reduction, and it certainly seems like decoders
  9 | (the reverse direction) can be used for generating things, hence they can be
 10 | used within *generative* models. (A.K.A. VAEs!) 
 11 | 
 12 | They mention "recirculation" as a more biologically realistic (!!) alternative
 13 | to backpropagation, but it is not used much.
 14 | 
 15 | Think of AEs as optimizing this simple thing: 
 16 | 
 17 |     min_{f,g} L(x, g(f(x))
 18 | 
 19 | where x is the whole dataset, and f and g are the encoder and decoder,
 20 | respectively.
 21 | 
 22 | We need to make sure the autoenconder is constrained somehow ("undercomplete")
 23 | so that it isn't simply performing the identity function. Solutions: don't
 24 | provide too much capacity to both (a) the hidden code and (b) either of the two
 25 | networks, and *regularize* somehow. Also, don't just make things linear, because
 26 | then it's doing nothing more than PCA.
 27 | 
 28 | Confusing point: think of autoencoders as "approximating maximum likelihood
 29 | training of a generative model that has latent variables." Why?
 30 | 
 31 | - The prior is not over the "belief on our parameters before seeing data" but
 32 |   the hidden units (which are latent variables). Yes, this aspect make sense.
 33 | - I don't know what they mean by "the autoencoder as approximating this sum with
 34 |   a point estimate for just one highly likely value for h" but let's not
 35 |   over-worry about this.
 36 | 
 37 | (This was in the discussion about sparse autoencoders, and it makes a little
 38 | more sense to me after reading about VAEs. The point is that `h` is a latent
 39 | variable.)
 40 | 
 41 | Denoising Autoencoders: clever! =) Rather than using g(f(x)) in the loss
 42 | function, use g(f(\tilde{x})) where \tilde{x} is perturbed! This is a creative
 43 | way to avoid the autoencoder simply learning the identity function.
 44 | 
 45 | One can also regularize by limiting the derivatives, i.e. a "contractive
 46 | autoencoder."
 47 | 
 48 | I've wondered about the exact size of autoencoders in use nowadays, since I
 49 | haven't seen a figure before. The encoder and decoder are themselves each feed
 50 | forward neural networks, so in general, it seems like each can be implemented
 51 | with many layers (or just one).
 52 | 
 53 | Stochastic Encoders and Decoders: not sure I got much out of this. However, I
 54 | did get this: the decoder can be seen as optimizing log p(x|h), since it is
 55 | given h and has to produce x (and x is known!). But the analogue for the encoder
 56 | is more confusing, because we have log p(h|x) but we don't know h. This must be
 57 | similar to other latent variables in graphical models. 
 58 | 
 59 |     **Update**: after reading this again with more knowledge of how these work,
 60 |     I think I didn't get the point of the last section. The log p(x|h) is indeed
 61 |     what the decoder optimizes, though (1) it really optimizes the encoder as
 62 |     well when this is trained end-to-end since the encoder produces h, and (2)
 63 |     we have to provide the loss function, and (3) we can **also** add a
 64 |     distribution to the encoder, but I don't think this is actually needed to
 65 |     train the encoder portion. In the case of continuous-valued pixels, we
 66 |     should probably consider a Gaussian distribution for the loss, which means
 67 |     the autoencoder should try and get the mean/variance. In VAEs, we can take
 68 |     advantage of the Gaussian assumption to *sample* elements.
 69 | 
 70 | Denoising autoencoders: OK, their computational graph (Figure 14.3) makes sense.
 71 | (It doesn't really help me get a deep understanding, though.) They introduce a
 72 | corruption function C(\tilde{x} | x), whose function is obvious. I was confused
 73 | for a bit as to why we're assuming we know the x (I mean, in real life, we might
 74 | be given *only* noisy stuff) but if we don't have the real x, we can't evaluate
 75 | the loss function!  It's just part of our training data.
 76 | 
 77 | Figure 14.4 makes sense intuitively. Corrupted stuff is off the manifold because
 78 | if we take an average random sample, it'll be in some random space. But **real**
 79 | samples are in a manifold. Unfortunately, some of the discussion here (e.g.
 80 | connecting autoencoders with RBMs) just refers to reading papers. =( That's why
 81 | I am reading this textbook, to *avoid* reading difficult-to-understand papers.
 82 | There's also some discussion on estimating the score function, which I think I
 83 | understand but haven't grokked it.
 84 | 
 85 | OK, back to more obvious stuff:
 86 | 
 87 | > Denoising autoencoders are, in some sense, just MLPs trained to denoise.
 88 | > However, the name "denoising autoencoder" refers to a model that is intended
 89 | > not merely to learn to denoise its input but to learn a good internal
 90 | > representation as a side effect of learning to denoise.
 91 | 
 92 | Manifolds! (Section 14.6) Key reason why we think about this (emphasis mine):
 93 | 
 94 | > Like many other machine learning algorithms, autoencoders exploit the idea
 95 | > that data concentrates around a low-dimensional manifold or a small set of
 96 | > such manifolds, as described in section 5.11.3. [...] Autoencoders take this
 97 | > idea further and aim to **learn the structure of the manifold**.
 98 | 
 99 | Additional thoughts:
100 | 
101 | - Understand **tangent planes**, these describe the direction of allowed
102 |   variation for a point x while still remaining on the low-dim manifold. See
103 |   Figure 14.6 for an intuitive example with MNIST, showing points on this
104 |   manifold and also the allowable directions.
105 | 
106 | - Intuitively, autoencoders need to learn how to represent this variation among
107 |   the manifold.  However, they don't need to do this for points off the
108 |   manifold. See Figure 14.7. The reconstruction is flat near the manifold
109 |   points, i.e. the only area that matters. True, it jumps up at several points,
110 |   but those are well off the manifold.
111 | 
112 | - There are other ways we can learn manifold structure, using non-Deep
113 |   Learning techniques (see Figures 14.8 and 14.9), but I don't think these are
114 |   as important to know now.
115 | 
116 | Contractive Autoencoders (Section 14.7) introduce a regularizer to make the
117 | derivatives of f (as in, f(x) = h) small.
118 | 
119 | What are applications of autoencoders? Definitely dimensionality reduction is
120 | one, and we can also think about information retrieval, the task of finding
121 | entries in a database that resemble a query entry. Why? Search is more efficient
122 | in lower-dimensional spaces.
123 | 
124 | Overall, I actually think this chapter is among the weaker ones in the book.
125 | Looking through the CS 231n slides was a **lot** more helpful. Eh, not every
126 | chapter is perfect.
127 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Fetch.text:
--------------------------------------------------------------------------------
  1 | Notes on how to use the Fetch.
  2 | 
  3 | ************
  4 | ** UPDATE **
  5 | ************
  6 | 
  7 | Here are some full steps:
  8 | 
  9 | (0) Start the fetch, ensure that it can move with the joystick controls.
 10 | 
 11 | (1) Switch to fetch mode by calling `fetch_mode` on the command line. This will
 12 | ensure that the `ROS_MASTER_URI` is the Fetch robot.
 13 | 
 14 | (2) Be on the correct WiFi network. Then the master node (Fetch) is accessible.
 15 | 
 16 |     - Verify that `rostopic list` returns topics related to the Fetch.
 17 |     - Also verify that the teleop via keyword script (via `rosrun ...`, see
 18 |       tutorials) is working, though sometimes even that doesn't work for me.
 19 | 
 20 | (3) Then then do whatever I need to do... for instance, simply running Ron's
 21 | camera script (a single python file) works to continually see the Fetch's
 22 | cameras. Finally!
 23 | 
 24 |     - Some python scripts might require a launch file to be running, such as the
 25 |       built-in disco.py and wave.py code. For these use `roslaunch [...] [...]`.
 26 | 
 27 | 
 28 | TODO: figure out robot state? For Fetch-specific messages.
 29 | 
 30 | 
 31 | ******************
 32 | ** Older notes: **
 33 | ******************
 34 | 
 35 | Note that `PS1` is an environment variable that we can import, but the real key
 36 | thing is to set ROS_MASTER_URI, that will let us connect to the Fetch. This does
 37 | not happen by default, so must export it each new window (for now).
 38 | 
 39 | Then I think we should do `rosrun [package] [script]` where I code stuff in
 40 | [script] inside some package. But are Ron and Michael doing it in a similar way?
 41 | 
 42 | Recommended order for development (NOT WORKING):
 43 | 
 44 | - Code the script within some package
 45 | - Compile the package with `catkin_make`
 46 | - Another terminal, set `ROS_MASTER_URI` appropriately
 47 | - In that same terminal, `source ./devel/setup.bash`
 48 | - Finally, again in same terminal `rosrun ...` and enjoy
 49 | 
 50 | I know when I set `ROS_MASTER_URI` and run `rostopic list` I get all the
 51 | appropriate Fetch-related topics ... so why am I not able to access them in my
 52 | code when calling `rosrun ...`? 
 53 | 
 54 | (If I don't set `ROS_MASTER_URI` and instead have it as the default, then I do
 55 | not get any topics, of course. Note that according to documentation, roslaunch
 56 | will START roscore if it detects that one doesn't exist!)
 57 | 
 58 | Is there a launch file that I can use? I'm confused because `rostopic echo
 59 | [...]` for the topics means I can see the output ...
 60 | 
 61 | 
 62 | ***************************
 63 | * Tutorial: Visualization *
 64 | ***************************
 65 | 
 66 | 
 67 | 
 68 | *******************************
 69 | * Tutorial: Gazebo Simulation *
 70 | *******************************
 71 | 
 72 | At least this is clear:
 73 | 
 74 | > Never run the simulator on the robot. Simulation requires that the ROS
 75 | > parameter use_sim_time be set to true, which will cause the robot drivers to
 76 | > stop working correctly. In addition, be sure to never start the simulator in a
 77 | > terminal that has the ROS_MASTER_URI set to your robot for the same reasons.
 78 | 
 79 | And it looks like I've installed the two packages necessary,
 80 | `ros-indigo-fetch-gazebo` and `ros-indigo-fetch-gazebo-demo`.
 81 | 
 82 | Run: `roslaunch fetch_gazebo simulation.launch` and the Gazebo simulator should
 83 | show up! However, I've noticed if you exit, then try and run the simulator
 84 | again, error messages may result? From looking up things online, it seems to be
 85 | expected behavior. :-( Try CTRL+C in the same window to exit. I've been able to
 86 | get `simulation.launch` to work fairly consistently, fortunately.
 87 | 
 88 | For "Running the Mobile Manipulation Demo":
 89 | 
 90 |     The playground will get set up, just be patient. :-) It takes a few extra
 91 |     seconds due to a "namespace" error message, must be due to slow loading of
 92 |     data online. However, a playgroud _should_ eventually appear.
 93 | 
 94 |     Then the next part moves the Fetch throughout the Gazebo simulator. It's
 95 |     pretty cool. Doesn't work reliably, see GitHub issue I posted.
 96 | 
 97 | I think this will be easier on a desktop since Gazebo also seems to be sensitive
 98 | to the graphics card, though after this I fixed it so my laptop can access the
 99 | separate GPU.
100 | 
101 | How does the demo code work? Two commands:
102 | 
103 | 1. roslaunch fetch_gazebo playground.launch
104 | 2. roslaunch fetch_gazebo_demo demo.launch
105 | 
106 | Use `roscd [...]` to go to the package directory and look at `launch/` to find
107 | specific definitions. The first command runs the launch file with several
108 | readable arguments. The second one is more interesting, launch looks like:
109 | 
110 | ```
111 |   1 <launch>
112 |   2 
113 |   3   <!-- Start navigation -->
114 |   4   <include file="$(find fetch_gazebo_demo)/launch/fetch_nav.launch" />
115 |   5 
116 |   6   <!-- Start MoveIt -->
117 |   7   <include file="$(find fetch_moveit_config)/launch/move_group.launch" >
118 |   8     <arg name="info" value="true"/><!-- publish grasp markers -->
119 |   9   </include>
120 |  10 
121 |  11   <!-- Start Perception -->
122 |  12   <node name="basic_grasping_perception" pkg="simple_grasping" type="basic_grasping_perception" >
123 |  13     <rosparam command="load" file="$(find fetch_gazebo_demo)/config/simple_grasping.yaml" />
124 |  14   </node>
125 |  15 
126 |  16   <!-- Drive to the table, pick stuff up -->
127 |  17   <node name="demo" pkg="fetch_gazebo_demo" type="demo.py" output="screen" />
128 |  18 
129 |  19 </launch>
130 | ```
131 | 
132 | Four easy parts. What's odd, though, is that I can't find `demo.py` anywhere on
133 | my machine, but it's online at the repo:
134 | 
135 | https://github.com/fetchrobotics/fetch_gazebo/blob/gazebo2/fetch_gazebo_demo/scripts/demo.py
136 | 
137 | Might be another useful code reference as it's a clean stand-alone script,
138 | though with sone MoveIt, etc., obviously.
139 | 
140 | 
141 | 
142 | **************************
143 | * Tutorial: Robot Teleop *
144 | **************************
145 | 
146 | This is pretty easy.
147 | 
148 | 
149 | 
150 | ************************
151 | * Tutorial: Navigation *
152 | ************************
153 | **************************
154 | * Tutorial: Manipulation *
155 | **************************
156 | 
157 | I ran both of these manipulation tutorials (hand-wavy thing and disco) and it
158 | works. I wasn't able to try out extensions.
159 | 
160 | 
161 | 
162 | ************************
163 | * Tutorial: Perception *
164 | ************************
165 | 
166 | Fetch exposes several "ROS topics" that we can subscribe to in order to obtain
167 | camera information. Unfortunately, I have yet to get call-backs to work ...
168 | 
169 | 
170 | 
171 | **************************
172 | * Tutorial: Auto-Docking *
173 | **************************
174 | *************************
175 | * Tutorial: Calibration *
176 | *************************
177 | **********************************
178 | * Tutorial: Programming-By-Demos *
179 | **********************************
180 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/HSR.text:
--------------------------------------------------------------------------------
  1 | Notes on how to use the HSR. Use their Python interface (or we can do
  2 | lower-level ROS stuff). Also, there's a built-in motion planner, so MoveIt! is
  3 | not necessary. Ideally, we get a camera image, get the x and y values from the
  4 | pixels, figure out z (the depth), and determine a rotation, and send it there.
  5 | 
  6 | - Gazebo can be useful.
  7 | - rviz is DEFINITELY helpful for debugging. Know it.
  8 | - Calibration: ouch, unfortunately this will take a while and there are eight
  9 |   sensors to calibrate ... at minimum. The docs actually show a lot. I see a
 10 |   sensor (camera) on the hand as well.
 11 | - Register positions, using the same image I see of black/white boxes, the
 12 |   "calibration marker jig".
 13 | 
 14 | Monitor status: see 6.1 of the manuals. Setting up development PC/laptop,
 15 | section 6.2. Not much else to write here. At least I can get rviz running with
 16 | images. You need to hit the reset button and see the LEDs (not above the
 17 | 'TOYOTA' text but everywhere else) turn yellow-ish.
 18 | 
 19 | On my TODO list:
 20 | 
 21 | - Figure out good test usage practices for rviz.
 22 | - Get skeleton code set up for the HSR to:
 23 |     - process camera images
 24 |     - move based on those images (either base or gripper, or both)
 25 | - Figure out a safe way to automatically move arms.
 26 | 
 27 | 
 28 | 
 29 | ******************
 30 | * Moving the HSR *
 31 | ******************
 32 | 
 33 | General idea with Python code, do something like:
 34 |     ```
 35 |     self.robot = hsrb_interface.Robot()
 36 |     self.omni_base = self.robot.get('omni_base')
 37 |     self.whole_body = self.robot.get('whole_body')
 38 |     ```
 39 | where the `hsrb_interface` is code written by the Toyota HSR programmers,
 40 | thankfully. That part is necessary for the robot to begin publishing stuff from
 41 | its topics. 
 42 | 
 43 | Let's understand _base_ motion.
 44 | 
 45 | 
 46 |     Aerial view of the HSR. Assumes its head is facing north.
 47 |     
 48 |          ^
 49 |          |
 50 |     <--[hsr]--> 
 51 |          |
 52 |          v
 53 |     
 54 |     Axes are: 
 55 | 
 56 |         pos(x) for north, neg(x) for south. 
 57 |         Also, (oddly) pos(y) for LEFT, neg(y) for right. 
 58 | 
 59 |     I thought the `y` stuff would be the other way around, but I guess not. The
 60 |     z stuff stays fixed (obviously).  These are based on the (x,y,z) I get from
 61 |     `omni_base.get_pose()`. The rotations are in quaternions.
 62 | 
 63 |         FYI: When the robot starts up, it has some (x,y,z) position which should
 64 |         be set at (0,0,0) based on the starting position. 
 65 |    
 66 |     Errors: unfortunately if you query the `omni_base.get_pose()` again and
 67 |     again, the values are still going to vary by something like 1-3mm, so
 68 |     there's always some error. Same with the dVRK.
 69 |     
 70 |     Rotations: clockwise from aerial view, the `z` decreases. Counterclockwise,
 71 |     it increases. The other three values in the quaternion don't seem to change,
 72 |     x==y==0 and w==1.  We're only rotating about one plane for the base so this
 73 |     is expected. TODO: understand quaternions well.
 74 |  
 75 | 
 76 | To clarify the above, understand `go_rel`:
 77 | 
 78 |     ```
 79 |     In [30]: omni_base.go_rel?
 80 |     Type:       instancemethod
 81 |     String Form:<bound method MobileBase.go_rel of <hsrb_interface.mobile_base.MobileBase object at 0x7feb5d259d50>>
 82 |     File:       /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/mobile_base.py
 83 |     Definition: omni_base.go_rel(self, x=0.0, y=0.0, yaw=0.0, timeout=0.0)
 84 |     Docstring:
 85 |     Move base from current position.
 86 |     
 87 |     Args:
 88 |         x   (float): X-axis position on ``robot`` frame [m]
 89 |         y   (float): Y-axis position on ``robot`` frame [m]
 90 |         yaw (float): Yaw position on ``robot`` frame [rad]
 91 |         timeout (float): Timeout until movement finish [sec].
 92 |             Default is 0.0 and wait forever.
 93 |     ```
 94 |     
 95 |     Seems like indeed we should only control x and y, obviously. The interesting
 96 |     part is that `yaw` must represent the `z` in the quaternion, so rotations of the
 97 |     base imply changes in yaw only.
 98 | 
 99 | 
100 | Next, `whole_body`, allows more control. This is for the _end_effector_:
101 | 
102 |     ```
103 |     In [38]: whole_body.get_end_effector_pose?
104 |     Type:       instancemethod
105 |     String Form:<bound method JointGroup.get_end_effector_pose of <hsrb_interface.joint_group.JointGroup object at 0x7feb5f2d5ad0>>
106 |     File:       /opt/tmc/ros/indigo/lib/python2.7/dist-packages/hsrb_interface/joint_group.py
107 |     Definition: whole_body.get_end_effector_pose(self, ref_frame_id=None)
108 |     Docstring:
109 |     Get a pose of end effector based on robot frame.
110 |     
111 |     Returns:
112 |         Tuple[Vector3, Quaternion]
113 |     
114 |     In [39]: whole_body.get_end_effector_pose()
115 |     Out[39]: Pose(pos=Vector3(x=0.2963931913608169, y=0.07800193518379123, z=0.6786170137933408), ori=Quaternion(x=0.7173120598879523, y=-7.000511757597367e-05, z=0.6967520358527196, w=-6.613377471335618e-05))
116 |     ```
117 | 
118 |     This is relative to the base frame. So when we move the HSR, without moving
119 |     the end-effector, the x,y,z stuff remains the same, as expected. BUT since
120 |     the base frame has some fixed "reference rotation" then rotating base means
121 |     the y and w quaternion components change; the x and z stay the same.
122 | 
123 |     We can also see joint names and their limits. Use `whole_body.joint_state`
124 |     to get full details. There's lots of `whole_body.move_to[...]` methods that
125 |     make it really convenient for research code.
126 | 
127 |     An alternative is to explicitly assign to these by publishing to the
128 |     associated ROS topics, which might be more generally applicable to the
129 |     Fetch and other robots (well, we change the topics ...).
130 | 
131 | 
132 | Finally, for the gripper itself, use `gripper`. We can grasp it, so it's similar
133 | to the dVRK, and use negative values for tight stuff. :-)
134 | 
135 | 
136 | Other notes on moving the HSR:
137 | 
138 |     - It's possible to move in straight lines, arcs, etc.
139 |     - Understand `tf` for resolving coordinate frames. TODO: later ... actually,
140 |       might as well do this all in simulation (rviz) first to double check
141 |       movements.
142 |     - Also use rviz for visualizing coordinates. RGB = xyz axes.
143 |     - Common coordinates: `map` for the overall map, `base_footprint` for the
144 |       base of the HSR, `hand_palm_link` for the robot's hand (end-effector I
145 |       assume, or "tool frame").
146 |     - You can move both the base and arm together to get to a destination, can
147 |       also weigh relative contribution.
148 |     - Can move the hand based on force sensing, might be useful if we're running
149 |       this automatically and need some environment feedback?
150 |     - Avoid collisions by using the collision avoider they have. Looks really
151 |       simple to use, they handle a lot for us.
152 | 
153 | 
154 | See Section 7.2.6 for more advanced coding, rather than using `ihsrb` which is
155 | like IPython. Oh, and later they actually have a YOLO tutorial. Nice!
156 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter11notes.txt:
--------------------------------------------------------------------------------
  1 | **********************************************
  2 | * NOTES ON CHAPTER 11: Practical Methodology *
  3 | **********************************************
  4 | 
  5 | This is sometimes neglected, but it shouldn't be! Their intro paragraph hits the
  6 | core:
  7 | 
  8 | > Successfully applying deep learning techniques requires more than just a good
  9 | > knowledge of what algorithms exist and the principles that explain how they
 10 | > work. A good machine learning practitioner also needs to know how to choose an
 11 | > algorithm for a particular application and how to monitor and respond to
 12 | > feedback obtained from experiments in order to improve a machine learning
 13 | > system.
 14 | 
 15 | Their running example is the Street View house number dataset and application,
 16 | which is good for me since I only have minor knowledge of this material. The
 17 | application is as follows: Cars photograph the buildings and address numbers,
 18 | while a CNN recognizes the addresses based on photos. Then Google Maps can add
 19 | the building to the correct location.
 20 | 
 21 |     Section 11.1: Performance Metrics
 22 | 
 23 | Use precision and recall in the event that a binary classification shouldn't
 24 | treat the two cases equally, e.g. with spam detection or diagnosing diseases.
 25 | Precision is the fraction of relevant instances classified correctly, while
 26 | recall is the number of true relevant instances detected. A disease detector
 27 | saying that everyone has the disease has perfect recall, but very small
 28 | precision, equal to the actual fraction who have diseases. We can draw a PR
 29 | curve, or use a scalar metric such as **F-scores** or **AUC**.
 30 | 
 31 |     Section 11.2: Default Baseline Models
 32 | 
 33 | This depends on the problem setting. Copy over previous work if possible.
 34 | 
 35 | Start small-scale at first, with regularization and **early stopping**. (I
 36 | forgot to do this for one project before adding it, and I'm glad I did.)
 37 | 
 38 | Most of this should be obvious.
 39 | 
 40 |     Section 11.3: More Data?
 41 | 
 42 | Regarding when to add more data, they suggest:
 43 | 
 44 | > If the performance on the test set is also acceptable, then there is nothing
 45 | > left to be done. If test set performance is much worse than training set
 46 | > performance, then gathering more data is one of the most effective solutions.
 47 | > [... after some regularization discussion ...] If you find that the gap
 48 | > between train and test performance is still unacceptable even after tuning the
 49 | > regularization hyperparameters, then gathering more data is advisable.
 50 | 
 51 | Of course, in some domains such as medical applications, gathering data can be
 52 | costly. Again, this is obvious.
 53 | 
 54 |     Section 11.4: Hyperparameters
 55 | 
 56 | Do these manually or automatically. The manual version places special emphasis
 57 | on finding a model with the right effective capacity for the problem at hand.
 58 | 
 59 | As a function of a hyperparameter value, generalization curves often follow a
 60 | U-shaped curve, with the optimal value somewhere in the middle. At the smaller
 61 | end, we may have low capacity (and thus underfitting) and the other end may have
 62 | high capacity (and thus overfitting). Though that depends on the low/high
 63 | capacity assumption. Maybe this hyperparameter graph would be based on the
 64 | hyperparameter of the total number of layers in a neural network. This is just
 65 | an example, though. For applying weight decay, the curve might still be
 66 | U-shaped, but the underfitting happens with high values, the overfitting happens
 67 | with smaller values.
 68 | 
 69 | Their main advice, and the one which agrees with my own experience, is that if
 70 | there is ANY hyperparameter to tune, it should be the learning rate. Why? The
 71 | effective capacity of the model is highest ... for a **correct** learning rate.
 72 | Not when it's too large or too small. In general, the learning rate's **training
 73 | error* curve decreases as it gets high enough ... then once it's barely too
 74 | high, it SHOOTS UP, due to taking too large steps during gradient updates.
 75 | 
 76 | What happens if your training error is worse than expected? Your best bet is to
 77 | increase capacity. Especially with Deep Learning, we should be able to overfit
 78 | to most training datasets, so try without regularization techniques.
 79 | 
 80 | If the test error is worse than training, then the reason (at least with Deep
 81 | Learning models with high capacity) is most likely due to generalization
 82 | difference between test vs train error. Try regularization techniques.
 83 | 
 84 | I **really like Table 11.1**, it outlines the effects of changing different
 85 | hyperparameters. Study it well! Though I think I understood all of them; the one
 86 | that might be newest to me is weight decay, but fortunately I somewhat
 87 | understand it after reading through OpenAI's Evolution Strategies code.
 88 | 
 89 | OK, next, **automatic hyperparameter search**. This includes **grid search**,
 90 | best when we have three or fewer hyperparameters and we can test all points in
 91 | the Cartesian product of the set of values. **Random search** can be better, as
 92 | I know from CS 294-129. See Figure 11.2 for a comparison of grid search and
 93 | random search. 
 94 | 
 95 | Typically, grid search values are chosen based on a logarithmic scale, or
 96 | "trying every order of magnitude." If the best values are on a boundary point,
 97 | shift the grid search. Sometimes we have to do coarse-to-fine, as Andrej
 98 | Karpathy puts it. Random search can be cheaper and often more effective. Here,
 99 | we have a marginal probability distribution for each hyperparameter, which we
100 | sample from to get hyperparameters. (Be careful about non-uniform distributions
101 | if we want to sample from a logarithmic scale, e.g.  for learning rates that are
102 | 10^{-x}, we would do a uniform distribution sample on x.) Random search is more
103 | effective when there are hyperparameters which do not strongly affect the
104 | performance metric, which are considered wasteful for grid search.
105 | 
106 | The section concludes on Bayesian hyperparameter optimization, but the authors
107 | conclude that this isn't relatively helpful for Deep Learning.
108 | 
109 |     Section 11.5: Debugging
110 | 
111 | This is hard. :(
112 | 
113 | Their example of an especially challenging bug is if the bias gradient update is
114 | slightly off. Then the other weights might actually be able to compensate for
115 | the error, to some extent. This is why you need a finite difference check, as we
116 | did for CS 231n, or use TensorFlow.
117 | 
118 | Visualize the model in action, visualize the worst cases, **fit a tiny dataset**
119 | (which I do), etc. Also, monitor histograms of activations and gradients, which
120 | might help detect gradient saturation.
121 | 
122 | Yeah, actually I *do* use a lot of these techniques, though maybe I should have
123 | those histograms somewhat?
124 | 
125 | Oh, they say that the magnitude of parameter updates should be roughly 1% of the
126 | magnitude of the parameters themselves. In some recent work, I see 5% for this
127 | quantity. Maybe I should aim to get that reduced?
128 | 
129 |     Section 11.6: Example of Multi-Digit Recognition
130 | 
131 | Looks interesting. Here, coverage was the metric to optimize while fixing
132 | accuracy to be 98%. (Thus, accuracy is more important.) They got a LOT of
133 | improvement simply by looking at the worst cases and seeing that there was
134 | unnecessary cropping.
135 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter07notes.txt:
--------------------------------------------------------------------------------
  1 | *********************************************
  2 | * NOTES ON CHAPTER 7: Regularization for DL *
  3 | *********************************************
  4 | 
  5 | Again, this will be mostly review.
  6 | 
  7 | Section 7.1: Parameter Norm Penalties. 
  8 | 
  9 | One piece of intuition is that biases don't need to be regularized because they
 10 | control one variable, whereas other weights control two (I guess the two nodes
 11 | in their edges?).
 12 | 
 13 | Good review for me, look at the math in Section 7.1.1 about L2 regularization.
 14 | Assuming a quadratic cost function, we can show that weight decay rescales the
 15 | optimal weight vector along the **axes** defined by the **eigenvectors** of H,
 16 | the Hessian. This is good linear algebra review. Understand Figure 7.1 as well!
 17 | 
 18 | TODO: review the L1 regularization section. I must have seen this before but I
 19 | can't remember, and it'd be good to know. But the TL;DR is that L1 encourages
 20 | more sparsity compared to L2, so certain features can be discarded.
 21 | 
 22 | (Some of the next sections are quite short and I didn't take notes. One insight
 23 | is that the definition of the Moore-Penrose pseudoinverse looks like a
 24 | regularization formula, with weight decay!)
 25 | 
 26 | Other regularization strategies:
 27 | 
 28 | - Dataset Augmentation, useful for object recognition, but be careful not to,
 29 |   e.g. flip the images if we're doing optical character recognition, since the
 30 |   classes could be altered. Be careful to augment *after* the train/test split,
 31 |   and also that when comparing benchmarks, that algorithms use the same
 32 |   augmentation.
 33 | 
 34 | - Add noise directly to weights, sometimes seen in RNNs, or the targets, as in
 35 |   **label smoothing**.
 36 | 
 37 | - Semi-Supervised Learning. Use both p(x) and p(x,y) to determine p(y|x).
 38 |   Example: PCA for the "unsupervised" projection to an "easier" space, and then
 39 |   a classifier built on top of that, so PCA is a pre-processing step. Yeah,
 40 |   makes some sense.
 41 | 
 42 | - Multi-Task Learning. Think of this as different tasks having the same input
 43 |   but different output, **AND** having a common "intermediate" step, or latent
 44 |   factor. We need that last condition because otherwise we're not sharing
 45 |   parameters across tasks (i.e. across different targets). I haven't really done
 46 |   much work with multi-task learning, but I bet I will in the future!
 47 | 
 48 | - Early Stopping. Ah yes, this sounds dumb but it works. Often, training error
 49 |   will continue decreasing and asymptote somewhere, but our validation error can
 50 |   decrease initially, but then **increase**. We want to stop and return the
 51 |   weights we had at the time just before the validation error began to increase.
 52 |   Huh, the authors even say it's the most popular form of regularization, I
 53 |   guess because it comes naturally to beginners. There's some slight costs to
 54 |   (a) testing on the validation set, and (b) storing weights periodically, but
 55 |   from my experience those are minor. They continue to elaborate that if we want
 56 |   to use the validation set, we can do early stopping, *then* include all the
 57 |   data. (This seems overkill to me.) They conclude early stopping by showing
 58 |   mathematically how it acts as a regularizer.
 59 | 
 60 | - Parameter Tying and Parameter Sharing. These try to make certain parameters
 61 |   close to each other, so the regularizer could be || w(a) - w(b) ||_2 where
 62 |   w(a) and w(b) are weights in two different layers. However, I think the more
 63 |   popular view is to have them be **equal**, and hence have parameter
 64 |   **sharing** instead of tying, which has the added advantage of memory savings.
 65 |   This is precisely what happens in CNNs (and RNNs!).
 66 | 
 67 | - Sparse Representations. Here, for some reason, we're focused on
 68 |   **representational sparsity**. This means our DATA is considered to have a new
 69 |   representation which is sparse. This is *not* the same as **parameter
 70 |   sparsity**, which the L1 regularization on the parameters would have enforced.
 71 |   This arises out of putting penalties on the activations in the NN. However,
 72 |   I'm not really sure I follow this and it doesn't seem to be as important as
 73 |   other techniques.
 74 | 
 75 | - Bagging and Ensembles. Train several different models (independently), then
 76 |   have them vote. It works well when the models do not make the same test
 77 |   errors. We can quantify this mathematically by computing the expected error
 78 |   and expected squared error. One way to do this is with bagging, which will
 79 |   sample k different **datasets**, formed by sampling with replacement the
 80 |   original data, so with high probability we'll get different datasets each time
 81 |   (with some data points repeated, of course, and others missing).
 82 | 
 83 | - Dropout. This can be viewed as noise injection, FYI, **and** as a form of
 84 |   bagging and ensemble learning. Man, it's really clever. PS: remember how it
 85 |   works, we remove (non-output!) **units**, NOT the edges (though it could be
 86 |   done that way, I think). Edges are automatically removed when their units are
 87 |   removed. In code, of course, we just multiply by zero. Remember:
 88 | 
 89 |   > Each time we load an example into a minibatch, we randomly sample a
 90 |   > different binary mask to apply to all of the input and hidden units in the
 91 |   > network. The mask for each unit is sampled independently from all of the
 92 |   > others. The probability of sampling a mask value of one (causing a unit to
 93 |   > be included) is a hyperparameter fixed before training begins. It is not a
 94 |   > function of the current value of the model parameters or the input example.
 95 | 
 96 |   There is some discussion about how to predict or do inference with ensemble
 97 |   methods. The authors mention some obscure geometric mean trick, but
 98 |   fortunately, with dropout we can do one forward pass and scale by the dropout
 99 |   parameter. (Or we can avoid this, but divide by the dropout during training,
100 |   as I know.)
101 | 
102 |   This is actually **not** exact even in expectation, due to the
103 |   non-linearities, but it works well in practice.
104 | 
105 |   Dropout goes beyond regularization interpretations:
106 | 
107 |   > [...] there is another view of dropout that goes further than this. Dropout
108 |   > trains not just a bagged ensemble of models, but an ensemble of models that
109 |   > share hidden units. This means each hidden unit must be able to perform well
110 |   > regardless of which other hidden units are in the model.
111 | 
112 |   It looks like we have redundancy, which is good.
113 | 
114 | - Adversarial Training. You knew this was coming. :) We get those adversarial
115 |   examples, and then use that to improve our classifier. See Goodfellow's papers
116 |   for details. There are caveats, though, and I believe even with training on
117 |   adversarial examples, such a model still has *new* adversarial examples. I
118 |   might have to re-read those papers.  Goodfellow showed that one cause for
119 |   adversarial examples is excessive linearity. They can also be considered
120 |   semi-supervised learning, which we talked about earlier in the chapter.
121 | 
122 | - Tangent {Distance, Prop, Manifold Classifier}. These relate to our assumption
123 |   that the essence of the data lie in lower-dimensional manifolds. The
124 |   regularization here is that f(x) shouldn't change much as x moves along its
125 |   manifold. I don't really think these are important for me to know right now,
126 |   but I remember studying these a bit for the prelims.
127 | 
128 | Whew, some of these were new actually, or at the very least I got a better
129 | understanding of them. Note that batch normalization (which might make dropout
130 | unnecessary) is discussed in the **next** chapter, not this one.
131 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter12notes.txt:
--------------------------------------------------------------------------------
  1 | *************************************
  2 | * NOTES ON CHAPTER 12: Applications *
  3 | *************************************
  4 | 
  5 | There's a LOT of them! Recall the 2016 publication date, so anything after that
  6 | won't be here (e.g., the Transformer architecture, other DeepRL stuff?).
  7 | 
  8 |     12.1: Large-Scale Deep Learning
  9 | 
 10 | Nice discussion about how the video game community spurred the development of
 11 | graphics cards, and how the characteristics of graphics card ended up being
 12 | beneficial for the kind of computations used in deep learning. Actually, why?
 13 | 
 14 | - We need to perform many operations in parallel (and these are often
 15 |   independent of each other, hence parallelization is easier).
 16 | - Less 'branching' compared to the workload of a CPU.
 17 | - GPUs have memory and data can be put on there, whereas the data is too large
 18 |   for most CPU caches.
 19 | 
 20 | They got more popular after more general-purpose GPUs were available that
 21 | could do stuff other than rendering, and NVIDIA's CUDA lets us implement those
 22 | using a C-like language. But, it's very hard to write good CUDA code (not the
 23 | same as writing good CPU code). Good news: once someone does it, we should
 24 | refer to those libraries.
 25 | 
 26 | - Data parallelism: easy for inference since we have models run on different
 27 |   machines. But for training, use Hogwild!. (We can alternatively increase the
 28 |   batch size for one machine, but we don't get the advantage of more frequent
 29 |   gradient updates versus HogWild!.)
 30 | - Model parallelism: each machine runs a different part of the model. (Huh, I
 31 |   don't think I'll do this, we'd need a super large network?)
 32 | - Model compression: mentions Hinton's knowledge distillation. :-)
 33 | 
 34 | We can a lot with *dynamic structure*: this means we might use different
 35 | components of the network for a given computation. For example, have a gated
 36 | network which picks one of several expert networks to use for evaluation.
 37 | (Results in soft or hard mixture of experts, depending on (as expected) whether
 38 | the 'gater' outputs a soft weighting or a single hard weighting, like a one-hot
 39 | vector of weights.) Even simpler: decision trees.
 40 | 
 41 | Efficient hardware implementations: doesn't discuss Tensor Processing Units
 42 | (TPUs) but those came out after this book, I think.
 43 | 
 44 |     12.2: Computer Vision
 45 | 
 46 | Pre-processing: make sure it's consistent, doesn't have to be fancy. Often
 47 | scaling to [-1,1] or [0,1] suffices. Heck they say there are CNNs that can
 48 | dynamically adjust to take images of different sizes, but I find it easiest to
 49 | always keep a fixed scale.
 50 | 
 51 | Examples: *contrast normalization*, and *whitening*. I think contrast
 52 | normalization is like the (X - np.mean(X)) / (np.std(X) + eps) that we've often
 53 | done in computer vision tasks. Whitening is another story about *rescaling
 54 | principal components to have equal variance*.
 55 | 
 56 | Actually this is a short section. I'm surprised there wasn't an overview on
 57 | classification, detection, segmentation, and other computer vision problems.
 58 | It's mostly about how data is processed. See CS 231n for details on the actual
 59 | tasks.
 60 | 
 61 |     12.3: Speech Recognition (ASR with 'Automatic' in it)
 62 | 
 63 | (Not a subsection of NLP, despite ASR as part of my NLP class at Berkeley)
 64 | 
 65 | Find the most probable linguistic sequence y given input acoustic sequence X.
 66 | I.e.: argmax_y P(y|X). Before 2012, state of the art systems used Hidden Markov
 67 | Models and Gaussian Mixture Models.
 68 | 
 69 | Use "TIMIT" for benchmarking, the MNIST of ASR so to speak.
 70 | 
 71 | Not much detail here, unfortunately, besides that Restricted Boltzmann Machines
 72 | (RBMs) were among the ingredients for the resurgence of Deep Learning in ASR.
 73 | But now they are not used. :) I wonder if Transformers are used in ASR now? I
 74 | haven't been following the literature and the section is too short for a proper
 75 | treatment.
 76 | 
 77 |     12.4: Natural Language Processing
 78 | 
 79 | Largely based on *language models* and treating *words* as the distinct unit,
 80 | and then modeling language as probability of a next word given an existing
 81 | sequence of words. Know *n-gram*, modeling conditional probability of a word
 82 | based on the preceding n-1 words. Unigrams, digrams, and trigrams use 1, 2, and
 83 | 3 as n.
 84 | 
 85 |     - But recall my NLP class: hard to use raw counts for computing conditional
 86 |       probabilities, because many counts are zero.
 87 |     - Thus use smoothing.
 88 |     - But still many 'curse of dimensionality' challenges with classical n-gram
 89 |       models.
 90 | 
 91 | Neural language models: allow us to say that two words are similar, but they
 92 | are distinct, and they show word embeddings. I think they are suggesting
 93 | getting word embeddings by predicting the context given the center word, or
 94 | predicting the center word given context (like we did in 182/282A). But
 95 | regardless, it's good to have embeddings, since instead of representing words
 96 | as one hot vectors, we use lower dimensional representations with Euclidean
 97 | distance to get similarity. This is analogous to a CNN's hidden layer output
 98 | giving us an image embedding.
 99 | 
100 | Issue with high-dimensional outputs: if our model needs to produce words (e.g.,
101 | probability of next word given existing text) then naively a softmax over all V
102 | words in the vocabulary means we need a huge matrix to represent this
103 | operation and to train it (assuming naive cross-entropy loss).
104 | 
105 | - Naive fix: use a 'short list' of most frequent words only. But that is
106 |   counter to what we actually want!
107 | - Slightly better: *hierarchical softmax*. Now predict categories of words, and
108 |   then predict more specific categories, etc. But performance of actual model
109 |   often not that great, and hard to get the most likely word in a given
110 |   context.
111 | - Importance sampling: the logic for this approach is that the gradient of the
112 |   softmax can be broken up into the positive and negative phases (interesting
113 |   intuition, I'd thought about it but was good to see them explicitly state
114 |   it). The negative phase is an expectation, and we can use (biased) importance
115 |   sampling.
116 | - Noise-contrastive estimation is another option, but see Chapter 18 for a
117 |   fuller treatment.
118 | 
119 | Interesting contrast with neural nets and n-grams: the latter are much faster
120 | for look-up operations with hash tables.
121 | 
122 | Neural machine translation: recall the encoder-decoder architecture, where the
123 | encoder reads the sentence and produces a data structure called a "context"
124 | that contains "relevant information" somehow. Advantage of an RNN for
125 | encoders/decoders is that we can process variable-length sequences.
126 | 
127 | They cite a paper by Jacob Devlin from 2014 who beat state of the art models by
128 | using a MLP. Heh, he would later be the first author on the 2018 BERT paper.
129 | 
130 | They conclude with a brief discussion on some of the earlier attention models
131 | in Deep NLP. A lot more has happened since then!
132 | 
133 |     12.5: Other Applications
134 | 
135 | - Recommender systems and collaborative filtering. Actually this leads them to
136 |   talk about contextual bandits, which as we know are an intermediate between
137 |   the k-armed bandit case and the full RL problem. Why contextual bandits here?
138 |   Because if recommender systems only give users the best item according to its
139 |   model, there is no 'exploration' of other items that might be even better.
140 | 
141 |   Also, it's an intermediary because bandits = no state, basically. The normal
142 |   RL problem means the action directly changes the next state.
143 | 
144 | - Knowledge representation, reasoning, and question answering. Interesting
145 |   topics, but for now not part of my direct research agenda.
146 | 


--------------------------------------------------------------------------------
/How_People_Learn/Part_02_Learners_and_Learning.txt:
--------------------------------------------------------------------------------
  1 | Part 2: Learners and Learning
  2 | 
  3 | 
  4 | Chapter 2: How Experts Differ from Novices
  5 | 
  6 | Very important:
  7 | 
  8 | - As implied in the previous chapter, what distinguishes experts from novices
  9 |   isn't necessarily factual knowledge (nor is it ability or intelligence), more
 10 |   as it is about better connections among concepts, and the ability to
 11 |   "conditionalize" knowledge.  This means being able to know what areas/concepts
 12 |   are needed for a specific task, rather than trying our everything.
 13 |   
 14 | - (Related) Experts have more fluent knowledge retrieval, so they better know
 15 |   what applies to specific tasks. This means their memory is not taxed trying to
 16 |   figure out what would apply. Organization is more efficient; novices may
 17 |   retrieve knowledge in a slow, sequential manner.
 18 | 
 19 | - Experts recognize (and are more sensitive to) meaningful patterns across many
 20 |   fields. Example: with chess, if you randomize the pieces, the experts don't
 21 |   really remember those locations any better than novices, but if the pieces are
 22 |   arranged as they might be in a real game situation, the expert can pick up
 23 |   patterns and remember the location of pieces far better than novices can. 
 24 |   
 25 | - Different styles of experts: "artisans" vs "virtuosos". The former are experts
 26 |   in one field but the latter are also experts and, moreover, have the desirable
 27 |   property of "active learning" so they are experts at learning about new
 28 |   things. This requires metacognition, as discussed in the first chapter.
 29 |   Educational programs need to be designed to encourage the development of
 30 |   virtuosos.
 31 | 
 32 | Also important:
 33 | 
 34 | - Cool example with physics: experts organize problems in a way that reflects
 35 |   deeper, fundamental ideas, whereas novices will organize problems if they look
 36 |   similar (e.g., have the same drawings of triangles).
 37 | 
 38 | - Being an expert at a subject is NOT the same as being an expert at teaching.
 39 |   An expert teacher will better understand when students might get stuck. Yeah,
 40 |   this is a widely agreed-upon fact.
 41 | 
 42 | Stuff I didn't remember:
 43 | 
 44 | :-)
 45 | 
 46 | 
 47 | Chapter 3: Learning and Transfer
 48 | 
 49 | Very important:
 50 | 
 51 | - You could argue that the ultimate goal of teaching is better transfer
 52 |   learning, or how to efficiently use the knowledge from school and apply it to
 53 |   the real world. Also, the goal is not to immediately know how to do new tasks,
 54 |   but simply to increase the _speed_ at which these new tasks will be learned.
 55 |   The early performance attempts is less important since anyone is going to need
 56 |   some time to learn new stuff, so don't evaluate based on the first time,
 57 |   evaluate based on the length of the learning period.
 58 | 
 59 | - All transfer learning (and learning itself, of course) starts from somewhere.
 60 |   Yeah, prior knowledge was emphasized in earlier chapters. Clearly, prior
 61 |   knowledge may help or hinder new learning. Examples: students incorrectly
 62 |   think that plants eat soil, that when they throw a ball in the air there is
 63 |   still "force from the hand pushing it" and so on.
 64 | 
 65 | - For better transfer learning, we need to see the same concept in different
 66 |   contexts, so that we can understand the "abstract stuff" that is shared across
 67 |   tasks. That's better than remembering task-specific details (or "overly
 68 |   contextualized" knowledge in their jargon) that don't generalize.
 69 | 
 70 | Also important:
 71 | 
 72 | - Learning depends a lot on social background and culture, in addition to more
 73 |   factual, easy-to-define prior knowledge. Some cultures may discourage asking
 74 |   questions, for instance, which means if teachers expect to see questions, they
 75 |   might think a student is uninterested. There was also some differences noted
 76 |   among white versus black families (but no biracials, Asians, etc ... sigh).
 77 | 
 78 | - Speed of learning depends on deliberate practice and feedback. :-)
 79 | 
 80 | Stuff I didn't remember:
 81 | 
 82 | - (A bit silly that I didn't record this, but oh well ...) All learning takes
 83 |   time. You simply can't be an expert without investing the time. And moving
 84 |   on to more advanced subjects without knowing the basics is not ideal.
 85 | 
 86 | - Oh, another obvious thing I didn't quite record: don't forget about
 87 |   motivation. What factors (social, etc.) motivate students? That's very
 88 |   important for speed of learning.
 89 | 
 90 | - Amount of transfer depends on overlap among concepts, well roughly speaking.
 91 |   Yeah, another generally obvious thing.
 92 | 
 93 | 
 94 | Chapter 4: How Children Learn
 95 | 
 96 | Very important:
 97 | 
 98 | - Even the very young (as in, months-old infants) exhibit signs of learning and
 99 |   knowledge, which contrasts with very early research claims. We have better
100 |   tools for experimentation and to measure infants, since (for obvious reasons)
101 |   it's not that easy to test on them.  TL;DR young children are active,
102 |   competent agents.
103 | 
104 | - Children also pick up language and can quickly tell if stuff seems natural or
105 |   unnatural. On a related note, parents need to read to their children, though
106 |   some of this can be "picture" books.
107 | 
108 | - Zone of proximal development: the gap between current abilities, and the
109 |   abilities one could have with extra teaching assistance. (Or more accurately,
110 |   'potential' ... see the text for details.) It's the job of parents,
111 |   caregivers, teachers, etc., to continue improving the students' skills so that
112 |   this zone proceeds to the next natural stages.
113 | 
114 | Also important:
115 | 
116 | - Some cool stuff that infants know: they like to be consistent with numbers, so
117 |   they see groups of twos, relax, but if the next group has three things, then
118 |   they'll be more alert and think something's different. Also, physics: infants
119 |   somehow are able to tell that things will fall over without supports, and pay
120 |   more attention on that (in rigorous experiments).
121 | 
122 | - Children can naturally be interested in solving problems, it doesn't always
123 |   have to be explicitly forced upon by a teacher. Also, lots of this depends on
124 |   culture (again, this is obvious, but good to reiterate).
125 | 
126 | Stuff I didn't remember:
127 | 
128 | - "Privileged domains": physical and biological concepts, causality, number, and
129 |   language. These are domains where infants show _positive_biases_ in learning,
130 |   which makes sense from an evolutionary perspective.
131 | 
132 | - Precise experimental techniques for detecting infant cues and preferences:
133 |   non-nutritive sucking, habituation (i.e., infant "gets used to it" and stops
134 |   responding to that cue), and visual expectation.
135 | 
136 | - Infants can distinguish between animate and inanimate objects. Also, they're
137 |   good at inferring from context.
138 | 
139 | - There's a little bit about memory here, might be more in later chapters, but
140 |   mostly about the strategy of clustering to improve memory performance.  Also
141 |   some discussion about how infants vs older children may have different memory
142 |   strategies, and strategies get more effective with age (generally).
143 | 
144 | 
145 | Chapter 5: Mind and Brain
146 | 
147 | Very important:
148 | 
149 | - The mind is made up of neurons, with synapses and stuff (not going to get too
150 |   technical here but you get the idea).  These synaptic connections can be
151 |   created and destroyed, and there's generally two ways things can happen: when
152 |   they're created in huge swarms and then also removed in equal amounts, kind of
153 |   like sculpting (youth) or continual creation through learning by experience
154 |   (lifetime).
155 | 
156 | - Don't fall for some of the hype you see in popular claims. :-)
157 | 
158 | - Some discussion over difference between deaf and hearing ways of learning, the
159 |   implication was that areas of the brain can be learned through experience.
160 |   Also, learning organizes/restructures the brain.
161 | 
162 | Also important:
163 | 
164 | - Context matters. Different parts of the brain are ready to learn at different
165 |   times. 
166 | 
167 | Stuff I didn't remember:
168 | 
169 | - Eh, hopefully got the main points.
170 | 


--------------------------------------------------------------------------------
/Functional_Programming/week1/week1_notes.txt:
--------------------------------------------------------------------------------
  1 | ***************
  2 | * Lecture 1.1 *
  3 | ***************
  4 | 
  5 | Primary objective: functional programming from first principles, not necessarily
  6 | Scala but will learn the language. This is like learning a different programming
  7 | paradigm.
  8 | 
  9 | Scala: migration from  C/Java to functional programming. Look at programming
 10 | with "fresh eyes". Can integrate it with classical programming to give both of
 11 | best worlds.
 12 | 
 13 | Three paradigms:
 14 | 
 15 | - imperative (Java and C), understand via instructions for Von Neumann computers
 16 | - functional (Scala, or maybe Haskell is a better example)
 17 | - logic
 18 | 
 19 | We want to **liberate** ourselves from John Von Neumann-style programming. John
 20 | Backus argued for function programming. So we must avoid conceptualizing
 21 | instruction by instruction (or word by word) and move at a higher-level
 22 | abstraction (?). Martin uses polynomial and string examples. For a polynomial,
 23 | you don't want to define a class and be able to suddenly change coefficients
 24 | (stored in the polynomial class). That would be wrong for the theory of math
 25 | which deals with things like (a+b)x = ax+bx, not just modifying a and b
 26 | directly.
 27 | 
 28 | This analogy has some flaws but I think things will be clearer for me later when
 29 | I progress.
 30 | 
 31 | Consequence of theory of functional programming: NO MUTATIONS.
 32 | 
 33 | This seems restrictive (no mutual variables, assignments, loops, imperative
 34 | control structures) but the focus is on functions, which is easier with
 35 | functional programming.  Functions here will be "first class citizens" as they
 36 | can be defined anywhere, including INSIDE other functions. 
 37 | 
 38 | I might check out Martin's book but probably not, I have too much to do, I'll
 39 | focus on the lectures. =)
 40 | 
 41 | Martin says functional programming has grown in popularity due to exploiting
 42 | parallelism for multi-core and cloud computing. Is that why John Canny chose to
 43 | use Scala for BIDMach and BIDMat? And since this is getting so important, I
 44 | really have to finish this Coursera course!!!
 45 | 
 46 | ***************
 47 | * Lecture 1.2 *
 48 | ***************
 49 | 
 50 | (Most of this stuff in the first half of this video is familiar to me.)
 51 | 
 52 | Interactive shell = REPL, read eval print loop. Just do scala, as I know. But
 53 | don't use that, just use `sbt console`.
 54 | 
 55 | The "substitution model" is key: all it does is reduce expressions to values,
 56 | and this can be applied to all expressions so long as they have no side effects.
 57 | This is lambda calculus! Foundation for functional programming. In fact Alonzo
 58 | Church showed that it can express all programs, i.e. Turing Complete. I remember
 59 | this a little bit.
 60 | 
 61 | Example: C++ has a side effect, and cannot be expressed by substitution model.
 62 | That's why we don't have side effects in functional programming.
 63 | 
 64 | To "do" the substitution model by hand, we have to explicitly substitute values
 65 | and simplify, following specific rules. We can do this call by value or call by
 66 | name. They have trade-offs: former only evaluates function arguments once,
 67 | latter means function arguments are not evaluated if parameter is unused
 68 | throughout the evaluation.
 69 | 
 70 | ***************
 71 | * Lecture 1.3 *
 72 | ***************
 73 | 
 74 | This provides more comparisons of CBN vs CBV, particularly as it regards to with
 75 | vs without termination.
 76 | 
 77 | Here's an important "theorem": if CBV terminates, then CBN also terminates, but
 78 | *not* vice versa.
 79 | 
 80 | Here's a simple example (pseudocode_:
 81 | 
 82 |     first(x,y)=x
 83 | 
 84 |     first(1, loop)
 85 | 
 86 | Here, CBN terminates because it ignores the loop. However, CBV gets in an
 87 | infinite loop.
 88 | 
 89 | Despite this example, Scala uses CBV, but we can enforce CBN using `=>` as they
 90 | do in the next example, showing how CBV can "get around that" problem by
 91 | treating `y` as a special CBN parameter.
 92 | 
 93 | ***************
 94 | * Lecture 1.4 *
 95 | ***************
 96 | 
 97 | Conditionals and value definitions, two more "syntax constructs."
 98 | 
 99 | Standard if-else, but used for **expressions** not statements. What does this
100 | mean? I think it means we don't have to write a return statement. Actually
101 | that's a general rule for Scala! Generally, legal Java expression => legal in
102 | Scala.
103 | 
104 | Also have reduction rules, etc., such as && and ||. BTW those short-circuit
105 | evaluation, so they don't test the second argument if the first one determines
106 | the answer.
107 | 
108 | There's a nice connection with CBV or CBN parameters: **definitions** can be CBV
109 | or CBN. The `def` is by name, the `val` is by value. So `def` must be evaluated
110 | upon each use, but `val` is evaluated at the point of its initialization. Oh,
111 | nice connection! =) Note that this is a loop but with effects dependent on how
112 | we use it:
113 | 
114 |     def loop: Boolean = loop
115 | 
116 | For definitions, we're OK (it will not loop forever),  but with vals, we're bad.
117 | 
118 | Clever:
119 | 
120 |     def and(x:Boolean, y:Boolean) = if (x) y else false
121 | 
122 | This is without using &&.
123 | 
124 | ***************
125 | * Lecture 1.5 *
126 | ***************
127 | 
128 | This is about defining square roots using Newton's method, so we have a
129 | non-trivial program. `def sqrt(x: Double): Double = { ... }`. He shows an
130 | example using Eclipse and its "session" functionality which is like a better
131 | version of the Scala command line (heh, like iPython is better than the Python
132 | interpreter). Use packages, even though it's not necessary here, because it
133 | keeps things ordered.
134 | 
135 | Scala language note: explicit return types are not generally needed, but for
136 | *recursive* functions, we need them otherwise the compiler wouldn't be able to
137 | tell the return type. It's good practice to put the return type even if it's not
138 | needed.
139 | 
140 | I see, I understand the code he wrote. Yes, it had problems with small/large
141 | numbers. I naively thought we should take logs and exponentials as needed, but
142 | in fact we only had to normalize our absolute difference so that the epsilon we
143 | chose, 0.001, is of the "appropriate value" rather than something too large or
144 | too small.
145 | 
146 | ***************
147 | * Lecture 1.6 *
148 | ***************
149 | 
150 | In the last lesson, we defined several methods separately, but we don't want the
151 | use to access any of them except for the `sqrt function. So we can nest all the
152 | other function definitions **inside** an overall `sqrt` call. He used a *block*
153 | by nesting with parentheses.
154 | 
155 | Visibility is as what I would expect, i.e. stuff defined in blocks are not
156 | visible to other blocks, and expressions outside blocks are visible inside them
157 | *so long as* they are not overshadowed (or "over-written") by something inside
158 | with the same name. Yes, pretty obvious. OH, and it makes the square root
159 | function cleaner since we don't have to re-define `x` as a parameter.
160 | 
161 | Don't use semicolons unless we want more than one statement, as in:
162 | 
163 |     val y = x+1; y*y
164 | 
165 | To deal with two-line operations surround with parentheses or to write operator
166 | ion the end of the *first* line. But in BIDMach, we don't do that, we just write
167 | long expressions on one line. =)
168 | 
169 | ***************
170 | * Lecture 1.7 *
171 | ***************
172 | 
173 | Time to wrap up the first week by talking about *tail recursion*. 
174 | 
175 | But before that, some substitution formalism. (I'm not sure why this is
176 | important.) Then we did re-writing steps with Euclid's gcd function and the
177 | classical (recursive) factorial function.
178 | 
179 | Rule: if a function calls itself as its last action, the function's stack frame
180 | can be reused. This is *tail recursion*, i.e. iteration, and it's good because
181 | we can run this in constant space. With classic factorial, we had our last
182 | argument as n*factorial(n-1), meaning that the last term was not our function,
183 | it was a more complicated expression with `n*` there.
184 | 
185 | We can require that a function is tail-recursive by adding the `@tailrec` in the
186 | line above the method definition. Interesting!
187 | 
188 | The last part of the lecture was about designing a tail-recursive version of
189 | factorial. Fortunately, I was able to figure this out. =)
190 | 
191 | OK week 1 lectures done. Let's do the assignment.
192 | 


--------------------------------------------------------------------------------
/Math_104_Berkeley/kenneth_ross_notes.txt:
--------------------------------------------------------------------------------
  1 | ********************************************************************************
  2 | * These are notes based on:
  3 | * 
  4 | * Kenneth A. Ross
  5 | * Elementary Analysis: The Theory of Calculus
  6 | * Second Edition, 2013
  7 | ********************************************************************************
  8 | 
  9 | 
 10 | *************
 11 | * CHAPTER 1 *
 12 | *************
 13 | 
 14 | I skimmed this chapter and I should know just about everything from it. It
 15 | includes:
 16 | 
 17 | - Natural numbers
 18 | 
 19 | - Simple induction
 20 | 
 21 | - Rational numbers (also the definition of an "algebraic number")
 22 | 
 23 | - The "Rational Zeros" theorem, which might be useful if I need to find
 24 |   candidates for solving certain polynomial equations. This can also be used to
 25 |   prove that sqrt(2) is not a rational number, and several other numbers, mostly
 26 |   by doing some brute-force cases for checking all possible solutions. It's a
 27 |   bit boring to do that! Note: this theorem only applies to finding *rational*
 28 |   zeros of polynomials with *integer* coefficients. For a more general rule, use
 29 |   "Newton's method" or the "secant method."
 30 | 
 31 | - The set of real numbers. Now we're getting into real stuff here! We also have
 32 |   the triangle inequality, blah blah blah ...
 33 | 
 34 | - The Completeness Axiom. This is the assertion that "\mathbb{R} has no gaps"
 35 |   and is the key factor which distinguishes \mathbb{R} from \mathbb{Q}. (It's
 36 |   discussed in Section 4.4.) Among other things, this section discusses:
 37 | 
 38 |     - The concepts of a minimum, maximum, and slightly more non-trivially, those
 39 |       of an _infinum_ (greatest lower bound) and _supremum_ (least upper bound).
 40 |       For the latter two, I know clearly that sup S and inf S do not have to
 41 |       belong to S! Classic example: (a,b). I remember doing examples like these
 42 |       from MATH 305 at Williams College: basically, finding the infimums and
 43 |       supremums of sets. It's nothing too fancy. Man, I must have been a bad
 44 |       student back then!
 45 | 
 46 |     - The concepts of upper bounds, lower bounds, etc.
 47 | 
 48 |     - The completeness axiom (as I mentioned). This does _not_ hold for the
 49 |       rationals!
 50 | 
 51 |   Yeah, nothing too advanced here. I'm happy that at least this material is easy
 52 |   for me to understand and review.
 53 | 
 54 | - The symbols +infinity and -infinity, which are useful but must be handled with
 55 |   care. Do not treat them as real numbers that can be plugged into theorems!
 56 |   Note that it is also discussed that for nonempty, _bounded_ subsets A and B of
 57 |   \mathbb{R}, sup(A + B) = sup A + sup B and the same relation for infimums.
 58 |   This might be useful in some statistics proofs if we are dealing with multiple
 59 |   sets.
 60 |    
 61 |     - Useful to define sup S = +infinity if S is not bounded above, etc.
 62 | 
 63 | - The last section is a "Development of \mathbb{R}" and it's probably not that
 64 |   useful for me.
 65 | 
 66 | 
 67 | *************
 68 | * CHAPTER 2 *
 69 | *************
 70 | 
 71 | This is about sequences and is hugely critical to understanding the rest of the
 72 | book, and for real analysis in general.
 73 | 
 74 | Section 2.7
 75 | 
 76 | - Sequences are just a function from an index to some value.
 77 | 
 78 | - We formally define _limits_, _convergence_, and _divergence_. See the
 79 |   textbook. I won't belabor the point here. Side note: limits are unique (prove
 80 |   this by assuming two limits, then showing that |s-t| is less than epsilon
 81 |   using the definitions and then the triangle inequality).  Side note 2:
 82 |   oscillations (as in, (-1)^n) do not converge!
 83 | 
 84 | Section 2.8
 85 | 
 86 | - A discussion on proofs! When proving limits, we should invoke the formal
 87 |   definition and find n and epsilon s.t. the definition of a limit holds.
 88 | 
 89 | - There are several interesting examples. I did a few of them quickly. I don't
 90 |   think I will ever have to invoke these directly any time soon (I'm mostly
 91 |   reading this section so that the more important parts later are clearer to
 92 |   me).
 93 | 
 94 | - Exercise 8.5 is interesting, the "squeeze lemma" and I remember Professor
 95 |   Mihai Stoiciu talking about this during office hours (heh, we never had office
 96 |   hours _in_ his office since there were so many people!).
 97 | 
 98 | Section 2.9
 99 | 
100 | - Limit theorems for sequences. I can invoke these pretty easily. I will again
101 |   be skimming the proofs.
102 | 
103 | - Oof, there's a lot of them. Mostly they involve similar techniques such as
104 |   working backwards and solving for the tightest bounds, so we get the lowest
105 |   value N such that the statement: "when n > N we get |s_n - s| < epsilon" is
106 |   true. We have to sometimes develop upper bounds, and often have to use epsilon
107 |   times some constant so that the later algebra gets it equal to epsilon. I've
108 |   seen this stuff many times.
109 | 
110 | Section 2.10
111 | 
112 | Monotone Sequences and Cauchy Sequences. These help us conclude convergence of
113 | sequences _without_ knowing limits in advance.
114 | 
115 | - Monotone sequences are those which are always increasing or always decreasing.
116 |   They _can_ converge, if the rate of increase (respectively, decrease) slows to
117 |   zero, think of 1/x for x>0 as x grows large.
118 | 
119 | - Important Theorem I (10.2 in the book): All bounded monotone sequences
120 |   converge.
121 | 
122 |   - Proof: let u be the supremum of the bounded sequence, so then we just show
123 |     lim s_n = u. We start by fixing an epsilon (as usual) then we have to find
124 |     some N such that for all n > N, we get |s_n - u| < epsilon. Well, (s_n) is
125 |     increasing so we just need to first find an N so that u-epsilon < N <
126 |     epsilon and then that automatically proves the statement. Yay! The proof is
127 |     short and elegant. Again, it just relies on proving the limit statement!!
128 | 
129 |   - There's a related theorem which shows that if the sequences are unbounded,
130 |     then, well they converge to infinity or minus infinity. (This is assuming
131 |     monotone, because otherwise you can have oscillations to infinity, which
132 |     would mean something different I guess.) Thus, limits of monotone sequences
133 |     always have meaning.
134 | 
135 | - Important Theorem II (10.11 in the book): a sequence is a convergent sequence
136 |   IFF it is a Cauchy sequence.
137 | 
138 |   - Proof: well, they did one direction earlier and it makes sense. The other
139 |     direction also makes sense. In both cases we simply start with the
140 |     definition and try to prove the property. They can be tricky to come up.
141 |     Mostly it's about making sense of sup-s and thinking of "stuff plus
142 |     epsilon."
143 | 
144 |   - Uses Definition 10.8 which defines a _Cauchy_sequence_, a sequence has this
145 |     property if for each epsilon > 0 there exists N such that (m,n) both greater
146 |     than N implies |s_n - s_m| < epsilon.
147 | 
148 |   - Why is it useful? Because we can confirm that a sequence converges by
149 |     verifying that it satisfies the Cauchy sequence property. We do not have to
150 |     explicitly compute a limit in this case!
151 | 
152 | - There's an interlude about discussions of decimals, but it's not likely to be
153 |   much of concern to me. Don't forget about the geometric series convergence
154 |   formula! It's 1/(1-r) for r>1. 
155 | 
156 | - There is also discussion on lim sup and lim inf. A sequence has a limit if and
157 |   only if their `lim inf` and `lim sup` are equal. Also, lim sup is NOT
158 |   generally sup{s_n for all n} because as N grows large, the set of elements we
159 |   consider for lim inf gets smaller, hence the correct relationship is <=. Also,
160 |   it's these lim inf and lim sup concepts which motivate the Cauchy sequence
161 |   definition (see my notes above).
162 | 
163 | Section 2.11
164 | 
165 | Subsequences!!
166 | 
167 | - I know the definition, obviously. You can also view it as defined by a
168 |   "selection function." This point of view is probably useful if you are trying
169 |   to _extract_ "interesting" indices within the overall sequence.
170 | 
171 | - IMPORTANT: Theorem 11.2. This defines three facts about subsequences.
172 | 
173 | (I don't quite follow?)
174 | 
175 | Section 2.12
176 | 
177 | TODO
178 | 
179 | 
180 | *************
181 | * CHAPTER 3 *
182 | *************
183 | 
184 | TODO
185 | 
186 | 
187 | *************
188 | * CHAPTER 4 *
189 | *************
190 | 
191 | TODO
192 | 
193 | 
194 | *************
195 | * CHAPTER 5 *
196 | *************
197 | 
198 | TODO
199 | 
200 | 
201 | *************
202 | * CHAPTER 6 *
203 | *************
204 | 
205 | TODO
206 | 
207 | 
208 | *************
209 | * CHAPTER 7 *
210 | *************
211 | 
212 | TODO
213 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter05notes.txt:
--------------------------------------------------------------------------------
  1 | ***********************************************
  2 | * NOTES ON CHAPTER 5: Machine Learning Basics *
  3 | ***********************************************
  4 | 
  5 | Again, I expect that this will be almost entirely review. Here are some stuff
  6 | which I didn't already have down cold:
  7 | 
  8 | - The chapter starts off with Tom Mitchell's famous definition of machine
  9 |   learning, and then it goes through examples of tasks, experiences, and
 10 |   performance metrics. There isn't a whole lot new here. Maybe a good insight is
 11 |   to think of the tasks of (a) density estimation and (b) synthesis/sampling
 12 |   (e.g. with GANs) as the task of modeling densities implicitly (a) versus
 13 |   implicitly (b). Then for experiences, the key is to understand unsupervised
 14 |   vs. supervised learning, but the line between the categories is blurred, and I
 15 |   like their examples of how the problems can be converted to each other
 16 |   (Equations 5.1 and 5.2). Think of unsupervised as estimating p(x), supervised
 17 |   as estimating p(y|x), since we have our labels y in the latter case. They use
 18 |   linear regression as an example, and the "learning algorithm" consists of
 19 |   literally solving the normal equations. One step, no iterative updates!
 20 | 
 21 | - We can use statistical learning theory to tell us how algorithms generalize.
 22 |   It's easiest if we assume IID, then the train/test errors are equal under
 23 |   expectation **if we chose a random model**, i.e random weights. In general,
 24 |   though, we optimize the training error, and **then** test, so the test error
 25 |   is at least as high as training error. The two central factors contributing to
 26 |   under/over-fitting are (1) training error, (2) gap between training and
 27 |   testing error. (This is covered again later in Chapter 11 on practical usage.)
 28 |   We can partially control under/over-fitting by controlling a model's
 29 |   **capacity**. E.g., for linear regression, add higher order terms, and
 30 |   capacity increases, but overfitting occurs with more parameters than examples.
 31 | 
 32 | - Quantifying model capacity with classical measures, such as VC dimension, is
 33 |   rarely used in Deep Learning.
 34 | 
 35 | - We can also think of **non-parametric** models as having arbitrarily high
 36 |   capacity. However, practical algorithms will rely on some form of constraints,
 37 |   e.g. nearest neighbors' complexity depends on the data.
 38 | 
 39 | - **Expected** generalization can never increase when training data increases.
 40 | 
 41 | - Use **weight decay** (i.e. L2 regularization) to prefer lower magnitude weight
 42 |   vectors as solutions.
 43 | 
 44 | - With hyperparameters, don't tune then on the training data because that will
 45 |   cause preference towards overfitting. Tune on **validation sets**. If our data
 46 |   is too small, **use k-fold cross validation** to get better estimates of
 47 |   generalization error.
 48 | 
 49 | - With bias/variance discussion, don't forget that the sample variance (for
 50 |   Gaussians) is actually **biased**, we need the n-1 correction for the
 51 |   **unbiased** version.
 52 | 
 53 | - Don't forget the difference between **variance** and **standard error** w.r.t.
 54 |   **an estimator**. Here, the standard error is the square root of the variance,
 55 |   and both are computed based on empirical data (which is why I don't think we
 56 |   call it "standard deviation"). They say:
 57 | 
 58 |   > Unfortunately, neither the square root of the sample variance nor the square
 59 |   > root of the unbiased estimator of the variance provide an unbiased estimate
 60 |   > of the standard deviation. Both approaches tend to underestimate the true
 61 |   > standard deviation, but are still used in practice. The square root of the
 62 |   > unbiased estimator of the variance is less of an underestimate. For large m,
 63 |   > the approximation is quite reasonable.
 64 | 
 65 |   We use standard error often when writing out confidence intervals.
 66 | 
 67 |   They argue that increasing model capacity (at least under MSE for computing
 68 |   generalization error) generally increases **variance** but decreases **bias**.
 69 |   The reason is that variance here is based on samples where the "samples" are
 70 |   in fact training data sets. (The training set **is** the random variable,
 71 |   according to their Equation 5.47 definition.) Thus, with a new sample of the
 72 |   training data, we'll get different results since the model overfits. But under
 73 |   **expectation** over all draws of training datsets, the bias is low.
 74 | 
 75 | - How did we **obtain** the estimators we just talked about? It's simple, MLE.
 76 |   And before reading Goodfellow's tutorial on GANs, I don't think I viewed MLE
 77 |   as minimizing a KL divergence. This is yet another reason why we like it.
 78 |   Another reason is, as I know from the AI prelims review, the MLE view of
 79 |   **conditional** log likelihood, where p(y|x) is modeled as a Gaussian, results
 80 |   in the same solution (obtained via maximizing likelihood) as the linear
 81 |   regression case with MSE loss.
 82 | 
 83 | - Then the chapter talks about **Bayesian statistics**. To measure uncertainty
 84 |   of the estimator, the Frequentist approach uses the variance, but the Bayesian
 85 |   approach suggests to integrate instead. I also remember their example with
 86 |   Bayesian linear regression, we have to combine p(y|X,w)*p(w) but those are
 87 |   both exponentials and they multiply to result in another exponential which can
 88 |   be rearranged in the form of another Gaussian. If we want a single point
 89 |   estimate instead of a distribution, use **MAP estimates**. But why not just do
 90 |   the Frequentist MLE approach? Because MAP estimates retain *some* benefit of
 91 |   the Bayesian approach. That's the intuition, I guess.
 92 | 
 93 | - Review: 
 94 | 
 95 |   theta_MAP =       argmax_theta p(theta|x) 
 96 |             \propto argmax_theta p(theta)p(x|theta)
 97 |             =       argmax_theta log p(theta) + log p(x|theta)
 98 |   
 99 |   (and for the MLE Gaussian, Frequentist case)
100 | 
101 |   theta_ML = argmax_\theta \prod_y p(y|x,theta)
102 |            = argmax_\theta \sum_i \log p(y_i|x_i,\theta) // These are Gaussians
103 | 
104 | - **Supervised Learning Algorithms**. The authors start by generalizing linear
105 |   regression into logistic regression, as expected. Not much new here. With
106 |   logistic regression, we no longer have a closed-form solution for the optimal
107 |   weights, which is why gradient descent helps.
108 |   
109 |   - PS: Don't forget **SVMs**. I've forgotten some of it due to its lack of
110 |     exposure in Deep Learning. The key innovation here is the kernel trick, of
111 |     course (helps us model nonlinear x, and highly efficient). The SVM function
112 |     is nonlinear w.r.t. the data, but it's **linear** w.r.t the coefficients
113 |     \alpha. The \alpha here is mostly zeros, so as to reflect only points on the
114 |     boundary close to the current sample of interest.
115 | 
116 |   - But note that SVMs and kernel machines in general struggle to generalize
117 |     well, and Deep Learning is precisely designed to improve upon that.
118 | 
119 |   - Another common algorithm, **k-nearest neighbors**. In fact, there is not
120 |     even a training or a learning stage for this (nonparametric) method. Yet
121 |     another one, **decision trees**.
122 | 
123 |   - Note, p.144 missing a figure in my PDF version? TODO check.
124 |  
125 | - **Unsupervised Learning Algorithms**. Examples: PCA and K-Means Clustering.
126 |   PCA can be viewed as a data compression algorithm, or one which learns a
127 |   "useful" representation of data (perhaps as "simple" as possible, to identify
128 |   independent sources of variation which capture the essence of the data). This
129 |   means using PCA to transform the data so that the covariance matrix of the
130 |   transformed data is a diagonal matrix. PCA:
131 | 
132 |   > This ability of PCA to transform data into a representation where the
133 |   > elements are mutually uncorrelated is a very important property of PCA. It
134 |   > is a simple example of a representation that attempts to disentangle the
135 |   > unknown factors of variation underlying the data.
136 | 
137 |   Then there's k-means, which learns a one-hot encoding for each sample. This is
138 |   a bit extreme, though. The learning, of course, works like EM.
139 | 
140 | - Stochastic Gradient Descent. The main workhorse of Deep Learning! It helps
141 |   that our cost functions naturally decompose into a sum over training examples
142 |   with per-sample loss (and taking the empirical mean of those, so it's an
143 |   expectation!!!). Thus, take a minibatch sum of those terms. In fact, we can
144 |   often converge to a good solution even without touching every element in the
145 |   dataset (i.e. less than a single pass).
146 | 
147 | - Section 5.11, which focuses specifically on Deep Learning challenges. DL helps
148 |   to deal with the curse of dimensionality (PS: nice visuals in Figure 5.9!).
149 |   They also help with local constancy and smoothness, meaning that we want f(x)
150 |   to be approximately f(x+eps). Most classical algorithms try to follow this
151 |   implicit prior, but the problem is that it doesn't scale to larger datasets
152 |   because it requires enough examples to observe the data space. With DL, we try
153 |   and introduce dependencies among different regions, using a "composition of
154 |   factors". See Chapters 6 and 15 for this. Oh yeah, this is the idea of DL with
155 |   hierarchies of features ... I can see where this is going.
156 | 
157 |   The last bit here is about manifold learning. We use it informally in machine
158 |   learning to indicate a set of points that are well-connected or associated
159 |   with each other in a lower-dimensional space. With high dimensions, it's
160 |   essential to assume that most points in R^n are invalid. The authors argue
161 |   that this is the case in terms of images, sounds, and text. For instance,
162 |   uniformly sampling points in image results in static, and random words/letters
163 |   mean gibberish instead of interesting sentences. It would be great if learning
164 |   algorithms could *discover* these manifolds. In fact, GANs help us with that!
165 | 
166 |   (This is a bit hand-way, make sure to re-read this section if I want to
167 |   refresh my memory.)
168 | 


--------------------------------------------------------------------------------
/Random/AWS_Notes.txt:
--------------------------------------------------------------------------------
  1 | -----------------------
  2 | - AMAZON WEB SERVICES -
  3 | -----------------------
  4 | 
  5 | ****************
  6 | * May 11, 2017 *
  7 | ****************
  8 | 
  9 | I promise, I will learn how to use AWS so that I can finally run code in
 10 | clusters instead of running pseudo-parallel code on my personal workstation.
 11 | 
 12 | First, a few pointers, definitions, etc:
 13 | 
 14 | - Be careful! Don't run code for no reasons. This uses up resources. It's not
 15 |   like my personal machine where I can pound it for no reason. Again, be
 16 |   careful. Also, be mindful of the location of the actual computing resources
 17 |   I'm using.
 18 | 
 19 | - Amazon Web Services (AWS). It seems like I can use this just by using my
 20 |   normal Amazon account. It provides a number of services for cloud computing,
 21 |   which lets me use lots of computing power via the Internet, so long as we
 22 |   pay an amount commensurate with our usage level. See also:
 23 | 
 24 |   > Cloud computing provides a simple way to access servers, storage, databases
 25 |   > and a broad set of application services over the Internet. A Cloud services
 26 |   > platform such as Amazon Web Services owns and maintains the
 27 |   > network-connected hardware required for these application services, while
 28 |   > you provision and use what you need via a web application.
 29 | 
 30 |   (Cloud computing is really a marketing term ... don't put too much thought
 31 |   into it. Just think of it as a way for me to access lots of resources without
 32 |   having to buy them online, assemble my workstation, tell Berkeley to hook them
 33 |   up to the Internet, etc. I have one desktop that took me a while to set up; a
 34 |   server with many machines would take a lot longer to set up.)
 35 | 
 36 | - Amazon Elastic Compute Cloud (EC2). These "EC2 Instances" are "virtual
 37 |   machines" that AWS provides, i.e. EC2 is a component of AWS.  It seems to be
 38 |   an example of "Infrastructure as a Service" (IaaS).
 39 |   
 40 | - Amazon Machine Instances (AMI). These are virtual machines. I can use these to
 41 |   launch stuff within the EC2. Don't forget to keep the key-pair! I think the
 42 |   point with cloud computing is that we can pick and choose which images match
 43 |   our desired specs and then "run them." To connect to these, use the good
 44 |   old-fashioned ssh. There are community-provided AMIs which I assume are from
 45 |   people/groups around the world who are letting us use their machines in
 46 |   exchange for payment. There are also marketplace AMIs, which are verified by
 47 |   AWS.
 48 | 
 49 | - Google Cloud. I don't think I need to use this? It seems to be an alternative
 50 |   to Amazon Web Services. Once I have a Google Cloud account, I can create
 51 |   Google Compute Engines (GCEs) to run code, and even use Jupyter Notebooks for
 52 |   those which I can access in my local browser. For GPUs, I need to send in
 53 |   special requests.
 54 | 
 55 | See the following for a comparison between these two:
 56 | 
 57 | http://cloudacademy.com/blog/google-cloud-vs-aws-a-comparison/
 58 | 
 59 | The AWS website has lots of tutorials. I will check those tomorrow.
 60 | 
 61 | Python libraries to know/learn:
 62 | 
 63 | - boto (or boto3?)
 64 | - redis
 65 | - multiprocessing
 66 | - click
 67 | 
 68 | I've only "used" multiprocessing before ... and it didn't work for me. Also,
 69 | click seems to be more for command line arguments instead of distributed
 70 | systems. It seems to be an alternative to argparse ... yeah, I better check that
 71 | out! It might take up the subject of my next blog post.
 72 | 
 73 | 
 74 | ****************
 75 | * May 12, 2017 *
 76 | ****************
 77 | 
 78 | I went through this 10-minute tutorial: "Launch a Linux Virtual Machine".
 79 | Highlights:
 80 | 
 81 | - After clicking "Launch Instance", I get to the familiar AMI page. Think of
 82 |   this as a place to choose my desired computer specs. (Note: to avoid
 83 |   confusion, this is what happens when we're at the AWS console; there is
 84 |   another "Launch Instance(s)" button that happens later, once I'm actually
 85 |   ready to do something.)
 86 | 
 87 | - The tutorial uses a "General Purpose Instance" which should probably be my
 88 |   default choice for applications, unless I have a pressing reason to use
 89 |   something else. It also automatically clicks the "free tier eligible" image.
 90 | 
 91 | - Wow, there is a LOT of stuff on the AWS Interface. Getting used to the GUI
 92 |   will take a while, but I at least know how to see my instances.
 93 | 
 94 | - I can connect to my instance using:
 95 | 
 96 |     ssh -i ~/.ssh/MyKeyPair.pem ec2-user@{IP_Address}
 97 | 
 98 |   The IP address can be found on the AWS interface. This puts me in the
 99 |   `/home/ec2-user` folder on an instance, and it looks like I'm the only user.
100 |   Huh, that's interesting, I thought this was going to be a shared machine with
101 |   loads of users. Looks like `python` is installed, but not `ipython`. Argh.
102 | 
103 | - I terminated the state, and I got this message:
104 | 
105 |     Broadcast message from root@ip-[IP CENSORED]
106 |         (unknown) at 16:55 ...
107 |     
108 |     The system is going down for power off NOW!
109 |     Connection to [IP CENSORED] closed by remote host.
110 |     Connection to [IP CENSORED] closed.
111 | 
112 |   Interesting ... if we did *not* terminate the instance (but it was idle) then
113 |   we still get charged. I didn't get charged (I hope not ...).
114 | 
115 | 
116 | Another potential resource:
117 | 
118 | http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
119 | 
120 | "Setting Up":
121 | 
122 | - I see, this is why I didn't need a password:
123 | 
124 |   > AWS uses public-key cryptography to secure the login information for your
125 |   > instance. A Linux instance has no password; you use a key pair to log in to
126 |   > your instance securely. You specify the name of the key pair when you launch
127 |   > your instance, then provide the private key when you log in using SSH.
128 | 
129 | - There's some stuff about "Virtual Private Clouds" and "Security Groups," but
130 |   I'm not sure I understand or if it's that important right now. Think of those
131 |   as firewalls, maybe? Yeah, the EC2 console says security groups control access
132 |   to the instance.
133 | 
134 | 
135 | "Getting Started":
136 | 
137 | - This is basically the same as the 10-minute tutorial. They also tell us how to
138 |   connect with a browser. That might be inconvenient, but maybe not, if we're
139 |   running on 1000 machines. But how do we run code using this? There must be
140 |   some command line?
141 | 
142 | - Oh, here's what they say about termination:
143 | 
144 |   > Terminating an instance effectively deletes it; you can't reconnect to an
145 |   > instance after you've terminated it.
146 | 
147 |   I see. On the EC2 console, I can't seem to re-start that instance I created in
148 |   that 10-minute tutorial. There is, however, a difference between STOPPING and
149 |   instance versus TERMINATING an instance. The former lets me reuse the instance
150 |   at some point later (and it doesn't charge me for the stopping period, though
151 |   there IS a charge for storage ... look at their description about this).
152 | 
153 | 
154 | For billing, see:
155 | 
156 | http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html
157 | 
158 | A few pointers:
159 | 
160 | - To see billing on the dashboard, click my name, and then the billing dashboard
161 |   setting. It should be intuitive.
162 | 
163 | - Try to use the free tier to test things:
164 | 
165 |   > You can test-drive some AWS services free of charge, within certain usage
166 |   > limits. AWS calls this the AWS Free Tier. The free tier is designed to give
167 |   > you hands-on experience with a range of AWS services at no charge. For
168 |   > example, you can explore AWS as a platform for your business by setting up a
169 |   > test website with a server, alarms, and database. You can also try out
170 |   > services for developers, such as AWS CodePipeline, AWS Data Pipeline, and
171 |   > AWS Device Farm.
172 | 
173 | - Actually, looks like I'm not on the free tier since I had made the account in
174 |   November 2015 despite NOT EVER USING IT ... 
175 | 
176 | 
177 | For running on *clusters*, see:
178 | 
179 | http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html
180 | 
181 | <<<<<<< HEAD
182 | Another piece of software that might be useful is Packer. This helps me create
183 | identical machine images (i.e. AMIs) so that the nodes in a cluster are running
184 | and using the same stuff/settings. It's installed on my station.  Use `.json`
185 | files for building images (be careful about expenses!).
186 | 
187 | 
188 | ****************
189 | * May 28, 2017 *
190 | ****************
191 | 
192 | OK, I managed to finally make a new account, so I get the one-year free tier
193 | award. Let's see how that works out for me. Now let me try Jonathan Ho's
194 | Evolution Strategies code. How do we use Packer again?
195 | 
196 | =======
197 | Packer might be useful for running on clusters. This helps me create identical
198 | machine images (i.e. AMIs) so that the nodes in a cluster are running and using
199 | the same stuff/settings. It's installed on my station.  Use `.json` files for
200 | building images (be careful about expenses!). These are configuration files to
201 | allow us to specify various settings about the image(s) we want to build. Run 
202 | 
203 |     `packer build XXX.json` 
204 |     
205 | to build it. However, I think this requires two keys from AWS, which I can
206 | obtain online. I think I can just make them for me personally. They recommend
207 | creating keys separately for IAM users, but that seems to be more helpful for
208 | organizations with many users (kind of like computers with user accounts).
209 | 
210 | NOTE: IAM = "Identity and Access Management."
211 | 
212 | After running Packer's examples with my provided keys, I have a **snapshot**. It
213 | was a bit tricky to find. I had to search in the US-east region (N. Virginia),
214 | not the US-west region (N. California). Then click on "Snapshots" and I can see
215 | my AMI. This is **my** AMI, actually. So I'll get charged! 
216 | 
217 | In addition, assuming I'm in the right region, when I launch an instance, I can
218 | go to "My AMIs" and I will see that image right there. (It doesn't work if I'm
219 | using N. California, so the lesson is that one needs to be aware of what regions
220 | were used!)
221 | 
222 | To be clear, what got created out of this configuration file was NOT an
223 | "Instance," but it seems to be either an "Image --> AMIs" or an "Elastic Block
224 | Store --> Snapshots." Strangely, I see something underneath both of those menu
225 | options ... I'm not sure what's the difference. They seem to be similar, except
226 | AMIs are, I assume, something that's representative of a full system, whereas
227 | the snapshots are backups of those ... yeah, it's not clear. Maybe check this:
228 | 
229 | https://serverfault.com/questions/268719/amazon-ec2-terminology-ami-vs-ebs-vs-snapshot-vs-volume?
230 | 
231 | Snapshots and Volumes should be subsets or types of EBSs, which themselves look
232 | like hard drives. Volumes are pieces and bits of EBSs, and Snapshots are
233 | captures (i.e. copies) of volumes at specific times.
234 | 
235 | I *think* I have an idea of what an image means. I mean, with CS 231n, they
236 | provide an image with specialized GPU and Deep Learning stuff. That's with the
237 | "Community AMIs" of course.
238 | 
239 | From Packer:
240 | 
241 |   > After running the above example, your AWS account now has an AMI associated
242 |   > with it. AMIs are stored in S3 by Amazon, so unless you want to be charged
243 |   > about $0.01 per month, you'll probably want to remove it. Remove the AMI by
244 |   > first deregistering it on the AWS AMI management page. Next, delete the
245 |   > associated snapshot on the AWS snapshot management page.
246 | 
247 | I just did both of those.
248 | 


--------------------------------------------------------------------------------
/Deep_Learning/dlbook_chapter10notes.txt:
--------------------------------------------------------------------------------
  1 | ****************************************************************
  2 | * NOTES ON CHAPTER 10: Recurrent and Recursive Neural Networks *
  3 | ****************************************************************
  4 | 
  5 | I need to understand the parameter sharing and how RNNs (and their variants) can
  6 | be "combined" into other areas. The parameter sharing is key, as it allows for
  7 | *generalization*. CNNs share parameters with the weight filters across the
  8 | grids; RNNs share parameters across timesteps.
  9 | 
 10 | Quick note: I think they're using minibatch sizes of 1 to simplify all notation
 11 | and exposition here. That's fine with me. Think of x as:
 12 | 
 13 | [ x^1 x^2 ... x^T ]
 14 | 
 15 | where superscripts indicate time. Note that each x^i itself could be a vector!
 16 | 
 17 | Section 10.2, Recurrent Neural Networks
 18 | 
 19 | It's important to understand the *computational graphs* involved with RNNs. I
 20 | understand them as directed acyclic graphs, so how does this extend with
 21 | recurrence? It's easier to think of them when we unroll (i.e. "unfold") the
 22 | computational graphs. See Figure 10.2 as an example (I was able to get this
 23 | without looking at the figure). They also use a more succinct "recurrent graph"
 24 | representation.
 25 | 
 26 | RNN Design Patterns, also kind of described in Andrej Karpathy's blog post:
 27 | 
 28 | - Producing an output at each time step, and having recurrent connections
 29 |   between hidden layers. This is Figure 10.3, which I correctly predicted in
 30 |   advance minus the loss and y stuff. They have losses for *each* time step.
 31 |   Note the three matrix multiplies that are there, with the *same* respective
 32 |   matrices repeated across time. Also, we're using the softmax, so assume the
 33 |   output is discrete at each time step, e.g. o(t) could be the categorical
 34 |   distribution over the 26 letters in the alphabet.
 35 | 
 36 | - Same as above, except recurrent connections are from outputs to hidden layers,
 37 |   so we still have three matrices but the "arrows" in the computational graph
 38 |   change. This is *less powerful**. Why?? Think: the former allows hidden to
 39 |   hidden, so the hidden stuff can be very rich. The output only lets hidden to
 40 |   output to hidden, so the output is the input and may be less rich. That seems
 41 |   intuitive.
 42 | 
 43 | - Same as the first one (hidden to hidden connections) except we now have one
 44 |   output. That's useful to summarize, such as if we're doing sequence
 45 |   classification.
 46 | 
 47 | Now develop the equations, e.g. f(b + Wh + Ux) where h is from the *previous*
 48 | time step and x is the *current* time step, and f is the *activation* function.
 49 | Yes, it's all familiar to me. They mention, though, that backpropagation is very
 50 | expensive. They call the naive way (applying it on the unrolled computational
 51 | graph) as "backpropagation through time."
 52 | 
 53 | How to compute the gradient? They give us an example, thank goodness. Comments:
 54 | 
 55 | - Note that L = L(1) + L(2) + ... + L(\tau) so yes, dL/dL(t) = 1 for all t. Each
 56 |   L(t) is a negative log probability for that output at that time.
 57 | 
 58 | - The next equation (10.18) also makes sense, here i is the component in the
 59 |   vector, so we're in the univariate case.
 60 | 
 61 | - Equation 10.19 is good, keep in mind that here we have to be careful with the
 62 |   timestep. For other h(t), we need to add two gradients due to two incoming
 63 |   terms (because of two *outgoing* terms in the *forward* pass). Thus, the
 64 |   matrices V and W will be present in some form.
 65 | 
 66 | - The next part about using dummy variables for t is slightly confusing but it
 67 |   should just mean that the total contribution for these parameters are based on
 68 |   their sum across each time. Yeah, looking at the book again it's just a
 69 |   notation issue to help us out. For all those gradients, we have a final sum
 70 |   over t, where each term in the sum is a matrix/vector of the same size as the
 71 |   variable we're taking the gradient w.r.t.
 72 | 
 73 | PS: when reading this, don't be confused by the notation. Look at the "notation"
 74 | chapter online.
 75 | 
 76 | RNNs as directed graphical models? This section is about expressing them as
 77 | well-defined directed graphical models, and there are a few subtleties. This is
 78 | WITHOUT any inputs, BTW ... probably just for intuition?
 79 | 
 80 | They go through an example predicting a sequence of scalars. With the naive
 81 | unrolled (directed) graphical model, we're applying the chain rule of
 82 | probability and so it's very inefficient. RNNs provide better (in many metrics,
 83 | but particularly efficiency) ways to express such distributions with directed
 84 | graphical models by introducing deterministic connections (remember, the hidden
 85 | states are deterministic).
 86 | 
 87 | With RNNs, parameter sharing is a huge advantage, but the downside is that
 88 | optimizing is hard because we make a potentially strong assumption that at each
 89 | time step, the distribution embedded in the RNN remains stationary.
 90 | 
 91 | The last bit here to get it into a well-defined graphical model is to figure out
 92 | the length of the RNN. The paper presents three options, all of which seem
 93 | obvious (though I'm ignoring lots of details, etc.).
 94 | 
 95 | The next subsection (10.2.4) after this is about the more realistic setting of
 96 | having x (input), so we're also modeling p(y|x). I think it's trying to stick
 97 | with the graphical model setting. Also, note that the second option in the list
 98 | of three things is what we did in CS 231n, Assignment 3, with the image
 99 | captioning portion. Actually, the first option would seem better, which
100 | translates the input image to a vector as input to *all* hidden states, but
101 | that's harder to implement.
102 | 
103 | I was quite confused about Figure 10.9, as to why are we considering the y(t)s
104 | as inputs?? However, it seems like it's because we want to model p(y|x) and,
105 | well, y is the ground truth. I'm just having trouble translating this to code,
106 | or maybe that's not what I should be doing, and instead just think of it as a
107 | graphical model? To think of it as code, I'd need the other case we had earlier
108 | where the *output* or *hidden state* was the input to the hidden state, not the
109 | actual target (which is to be compared with the output).
110 | 
111 | Section 10.3: Bidirectional RNNs
112 | 
113 | Bidirectional RNNs help us model the output y(t) when that output may also
114 | *depend on future times* t+1, t+2, etc., such as with speech recognition where
115 | we need to peek ahead a bit. Don't use a fixed window, though, they say:
116 | 
117 | > This allows the output units o(t) to compute a representation that depends on
118 | > both the past and the future but is most sensitive to the input values around
119 | > time t, without having to specify a fixed-size window around t.
120 | 
121 | Nice!
122 | 
123 | Section 10.4: Encoder-Decoder Sequence-to-Sequence Architectures
124 | 
125 | Use these to avoid the restriction of fixed sequence sizes for the inputs x (or
126 | x(t)). This is their main benefit/innovation, the lengths n_x and n_y (see
127 | Figure 10.12 if confused on this notation) **can vary**; if the training
128 | data consists of a bunch of sequences that are of similar or different lengths,
129 | the RNN will learn to mirror that training data.  Side note: the first relevant
130 | paper on this (from 2014) called it "Encoder-Decoder" while the second one
131 | called it "Sequence-to-Sequence". I skimmed that second one, from Sutskever et
132 | al, NIPS 2014 last year, though maybe I should re-read it. Both papers are
133 | highly-cited.
134 | 
135 | Connection with Section 10.2.4: we have a fixed-sized context vector C (well,
136 | usually) coming out of the encoder. Well, C is input to the decoder, and this is
137 | *precisely* the vector-to-sequence RNN architecture we talked about in that
138 | sub-section!
139 | 
140 | How can the encoder deal with varying sizes n_x? If you think about it, it's
141 | just applying the RNN update over and over again to produce a fixed hidden state
142 | of the same size. At time t, we have processed x(1),...,x(t), and have hidden
143 | state h(t). (We're ignoring the earlier hidden states for simplicity.) Then the
144 | next time t+1, let's say that's our last one. Then we pass in h(t+1). So there's
145 | no issue with getting different sized inputs, because all that matters is (a)
146 | that we can repeatedly apply the RNN update, which is a for loop over the input
147 | sequence, and (b) that we take a fixed sized input to the decoder, which we can
148 | do with our final hidden state!
149 | 
150 | Section 10.5: Deep Recurrent Neural Networks
151 | 
152 | In all likelihood, I will not be dealing with these, but it might be worth
153 | knowing how deep we can go with RNNs, just like how I learned about the very
154 | deep GoogLeNet and the **ultra** deep ResNet. When we talk about depth, we mean
155 | adding more layers (w.r.t. the unrolled graph perspective) to the three
156 | components: input to hidden, hidden to hidden, and/or hidden to output. This
157 | might make learning hard, so one option is to introduce skip connections like
158 | in ResNets (man, I'm glad I reviewed ResNets).
159 | 
160 | Section 10.6: Recursive Neural Networks
161 | 
162 | Recursive Neural Networks, which we **do not** abbreviate as RNN, are a
163 | generalization of RNNs with a different computational graph "flavor" that looks
164 | like a tree rather than a chain.
165 | 
166 | Section 10.7: Challenge of Long-Term Depenedencies
167 | 
168 | Why is it hard? Here are some relevant quotes:
169 | 
170 | > The basic problem is that gradients propagated over many stages tend to either
171 | > vanish (most of the time) or explode (rarely, but with much damage to the
172 | > optimization). [...] the difficulty with long-term dependencies arises from
173 | > the exponentially smaller weights given to long-term interactions (involving
174 | > the multiplication of many Jacobians) compared to short-term ones. [...]
175 | > Recurrent networks involve the composition of the same function multiple
176 | > times, once per time step. These compositions can result in extremely
177 | > nonlinear behavior, as illustrated in figure 10.15.
178 | 
179 | This section describes the problem, and the subsequent sections (I assume 10.8
180 | through 10.12, judging from the LSTMs here) describe ways to solve it.
181 | 
182 | They present a simplified analysis with matrix eigendecomposition, where we
183 | assume no activations. Then yes, gradients can explode if eigenvalues are
184 | greater than one or vanish if they are less than zero. Andrej Karpathy said
185 | something similar in his medium blog post (why does he bother with medium?).
186 | 
187 | No free lunch:
188 | 
189 | > One may hope that the problem can be avoided simply by staying in a region of
190 | > parameter space where the gradients do not vanish or explode. Unfortunately,
191 | > in order to store memories in a way that is robust to small perturbations, the
192 | > RNN must enter a region of parameter space where gradients vanish (Bengio et
193 | > al., 1993, 1994).
194 | 
195 | It's a bit annoying that we are simplifying here by ignoring the activation
196 | functions, but I guess Bengio's old papers address activation functions?
197 | 
198 | Section 10.8: Echo State Networks
199 | 
200 | I skimmed this section. It's quite high-level and not that important to me.
201 | 
202 | Section 10.9: Leaky Units, Multiple Time Scales
203 | 
204 | I like this explanation:
205 | 
206 | > One way to deal with long-term dependencies is to design a model that operates
207 | > at multiple time scales, so that some parts of the model operate at
208 | > fine-grained time scales and can handle small details, while other parts
209 | > operate at coarse time scales and transfer information from the distant past
210 | > to the present more efficiently.
211 | 
212 | Oddly enough, they don't cite the ResNet paper?!?
213 | 
214 | They can add skip connections (i.e. adding edges to the RNN). Or they can remove
215 | edges from the RNN, which might have similar positive effects as skip
216 | connections.
217 | 
218 | Section 10.10: LSTMs (finally!), Gated Recurrent Unit RNNs
219 | 
220 | As of this writing (2016), these two RNNs are the most effective RNNs we have
221 | for practical applications involving sequences.
222 | 
223 | Gated Recurrent Unit (GRU):
224 | 
225 | - Main idea:
226 | 
227 |   > [...] gated RNNs are based on the idea of creating paths through time that
228 |   > have derivatives that neither vanish nor explode.
229 | 
230 | - The RNN needs to *learn* when to forget and discard the past (it can't
231 |   remember everything, after all!).
232 | 
233 | - Another quote:
234 | 
235 |   > The main difference with the LSTM is that a single gating unit
236 |   > simultaneously controls the forgetting factor and the decision to update the
237 |   > state unit.
238 | 
239 | Long Short-Term Memory (LSTM):
240 | 
241 | - See Figure 10.16 for the block diagram. It's still very confusing despite how
242 |   I implemented it in CS 231n. I'm amazed that these work at all.
243 | 
244 | - Like GRUs, LSTMs need to *learn* when to forget.
245 | 
246 | - It uses self-loops to enable paths to flow for long durations. By flow, I mean
247 |   not only the forward pass, but the *backward* pass.
248 | 
249 | The authors' conclusion is to simply stick with GRUs or LSTMs.
250 | 
251 | Section 10.11: Optimization for Long-Term Dependencies
252 | 
253 | They talk about how to improve optimization, such as with second-order methods
254 | and clipping gradients. (Be careful, taking the average of a bunch of clipped
255 | gradients means gradients that were larger have their contributions removed; see
256 | the discussion in the textbook.)
257 | 
258 | I wouldn't put too much stock into this, though, because the authors say:
259 | 
260 | > This is part of a continuing theme in machine learning that it is often much
261 | > easier to design a model that is easy to optimize than it is to design a more
262 | > powerful optimization algorithm.
263 | 
264 | In fact it seems like it's easier to train LSTMs using simple SGD rather than
265 | use a more complicated optimization algorithm. PS: is ADAM used with RNNs?
266 | 
267 | Section 10.12: Explicit Memory
268 | 
269 | Philosophical quote:
270 | 
271 | > Neural networks excel at storing implicit knowledge. However, they struggle to
272 | > memorize facts.
273 | 
274 | This section introduces **Memory Networks** and **Neural Turing Machines**.
275 | 
276 | For NTMs, note that:
277 | 
278 | > It is difficult to optimize functions that produce exact, integer addresses.
279 | > To alleviate this problem, NTMs actually read to or write from many memory
280 | > cells simultaneously. To read, they take a weighted average of many cells. To
281 | > write, they modify multiple cells by different amounts
282 | 
283 | Yeah, it's basically **soft attention**.
284 | 
285 | Conclusion of the chapter:
286 | 
287 | > Recurrent neural networks provide a way to extend deep learning to sequential
288 | > data. They are the last major tool in our deep learning toolbox. Our
289 | > discussion now moves to how to choose and use these tools and how to apply
290 | > them to real-world tasks.
291 | 
292 | Whew!
293 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/Mathematical_Introduction_Robotic_Manipulation.txt:
--------------------------------------------------------------------------------
  1 | Notes on the textbook:
  2 | 
  3 |     A Mathematical Introduction to Robotic Manipulation, 1994.
  4 |     Richard M. Murray and Zexiang Li and S. Shankar Sastry
  5 | 
  6 | A bit old but still in use for Berkeley's courses.
  7 | 
  8 | 
  9 | ***************************
 10 | * Chapter 1: Introduction *
 11 | ***************************
 12 | 
 13 | Some history here ... not that relevant to me at this moment. I'd like to see a
 14 | more modern take on this.
 15 | 
 16 | But I do like this:
 17 | 
 18 | > The vast majority of robots in operation today consist of six joints which are
 19 | > either rotary (articulated) or sliding (prismatic), with a simple "end-
 20 | > effector" for interacting with the workpieces.
 21 | 
 22 | Yes, the dvrk has one "prismatic" joint out of seven (note, seven, not six...)
 23 | and the others are rotary --- the dvrk guide actually says "revolute". And I
 24 | obviously know the end-effectors by now. (Edit: "revolute" is clearly the better
 25 | terminology... fortunately the book uses that later.)
 26 | 
 27 | Then they talk about the book outline. Yeah, maybe I'll definitely take a look
 28 | at Chapter 2 at a "leisurely pace" to better understand rigid body motion:
 29 | 
 30 | > In this chapter, we present a geometric view to understanding translational
 31 | > and rotational motion of a rigid body. While this is one of the most
 32 | > ubiquitous topics encountered in textbooks on mechanics and robotics, it is
 33 | > also perhaps one of the most frequently misunderstood.
 34 | 
 35 | OK, fair enough.
 36 | 
 37 | 
 38 | ********************************
 39 | * Chapter 2: Rigid Body Motion *
 40 | ********************************
 41 | 
 42 | > In this chapter, we present a more modern treatment of the theory of screws
 43 | > based on linear algebra and matrix groups. The fundamental tools are the use
 44 | > of homogeneous coordinates to represent rigid motions and the matrix
 45 | > exponential, which maps a twist into the corresponding screw motion.
 46 | 
 47 | == Important facts ==
 48 | 
 49 | - Location (x, y, z).
 50 | 
 51 | - Trajectory (x(t), y(t), z(t)) = p(t).
 52 | 
 53 | - Rigid **body** satisfies || p(t) - q(t) || = || p(0) - q(0) || = constant.
 54 | 
 55 | - Rigid body transformation: map from R^3 -> R^3 representing "rigid motion"
 56 |   (subtle point: cross product must be preserved).
 57 | 
 58 | - Cartesian frame: specified with axes vectors x, y, z. These **must** be
 59 |   _orthogonal_ and with magnitude 1. I.e., _orthonormal_ vectors. Oh, and
 60 |   preserves z = x \times y to preserve the right-handedness of the system.
 61 | 
 62 | - Know **rotation matrices**: orthogonal and has determinant 1 if right handed
 63 |   coordinate frame.
 64 | 
 65 |   - Figure 2.1 is helpful. **Every rotation** of that object corresponds to some
 66 |     rotation matrix (well, w.r.t. a fixed frame). And the rotation matrix even
 67 |     has a special form: we stack the coordinates of the principal axes (x,y,z)
 68 |     of the **body frame** of the object w.r.t. the "inertial frame."
 69 |   - Can also think of rotation matrices as transforming points from one frame to
 70 |     another. Draw a picture for their example; it's worth it.
 71 |   - Combine rotation matrices via matrix multiplication to form other rotations.
 72 | 
 73 | - SO(n) = "Special Orthogonal" group of (n,n) matrices, typically n=3 but
 74 |   sometimes n=2. These are a linear algebra "group" under matrix multiplication;
 75 |   definition is the same as the abstract algebra concept.
 76 | 
 77 |     Related notation: so(n), with lowercase letters, is the space of n-by-n
 78 |     **skew symmetric** matrices, so A^T = -A.
 79 | 
 80 | - SE(n) = "Special Exponential": R^n x SO(n). In the general case with n=3, we
 81 |   have six dimensions. This is the usual "position and rotation" that I'm
 82 |   familiar with; denote these as (p,R) where p is in R^3 and R is in SO(3).
 83 | 
 84 | == Other Major Points ==
 85 | 
 86 | - How to prove that something (e.g., a rotation) is a rigid body transformation?
 87 |   It's simple: show that the transformation preserves distance and orientation.
 88 |   Look at Definition 2.1 and literally just prove the two properties!
 89 | 
 90 |     Don't forget to review the _cross_product_ between two vectors.
 91 |         
 92 |       a x b = (a)^b where (a)^ is the cross product matrix.  We often use
 93 |       \hat{a}, which is what the book uses for exponential coordinates of
 94 |       rotation, with `e^{...}`.
 95 | 
 96 |     And be careful about the distinction:
 97 |     
 98 |       _points_  (typically written as p, q)
 99 |       _vectors_ (typically written as v, w) 
100 | 
101 |     For two points p, q \in O, the vector v \in R^3 is the _directed_ line
102 |     segment going from p to q.
103 | 
104 |     Conceptual difference: vectors have a _direction_ and a _magnitude_.
105 | 
106 | - To track motion of a rigid body, we just need to watch one point plus the
107 |   rotation w.r.t. that point. Hence, use a *configuration* which means we
108 |   "attach" a coordinate frame to a point and track it w.r.t. a fixed frame.
109 |   Don't forget what we mean by a configuration: something which can tell us
110 |   "complete" (or "sufficient"?) information about something in some space. I
111 |   remember that from CS 294-115. More precisely, that's SE(3).
112 | 
113 | - "Exponential coordinates for rotation" derived from considering: given *axis*
114 |   of rotation \omega, and the amount (i.e., angle through the axis) we rotate
115 |   some arm (e.g., see Figure 2.2) can we derive the rotation matrix R? They were
116 |   able to derive it by setting `R=e^{\hat{\omega} * \theta}` where
117 |   `\hat{\omega}` is a matrix. That's where we get the exponential stuff. And for
118 |   a closed-form implementation, look at **Rodrigues' formula**. I used it for CS
119 |   280.
120 |     
121 |     - This is known as "angular velocity" in physics.
122 |     - We like this due to Euler's Theorem (2.6 in the book): _any_ orientation R
123 |       in SO(3) is equivalent to a rotation about axis w in R^3 through an angle.
124 | 
125 | - Theorem: **every rotation matrix** can be represented as the matrix
126 |   exponential of some skew-symmetric matrix. 
127 |   
128 |     BTW, in their notation, \hat{\omega} is a skew-symmetric 3x3 matrix. And
129 |     they represent skew symmetric matrices as the product of a *unit*
130 |     skew-symmetric matrix and a real number.
131 | 
132 | - Another representation of rotations are the three **Euler Angles** which is
133 |   what I'm most familiar with. AKA yaw, pitch, roll. The order of which axes we
134 |   rotate about matters, since it can be represented as the product of three
135 |   matrices.  See Equation 2.20 for the formulas to derive yaw, pitch, and roll.
136 |   Watch out for computing the correct quadrant for the arc-tan functions.
137 | 
138 |   - Downside: singularities. E.g., there are infinitely many representations of
139 |     certain rotations, and it is a "fundamental topological fact" that
140 |     singularities can't be eliminated in a 3-D representation of SO(3). I don't
141 |     know why, but the authors argue that:
142 | 
143 |     > This situation is similar to that of attempting to find a global
144 |     > coordinate chart on a sphere, which also fails.
145 | 
146 |     Hmm ... sounds intriguing. But I won't fret too much about this.
147 | 
148 | == Rigid Motion in R^3 ==
149 | 
150 | (Now we're dealing with _translations_, in addition to rotations.) This is where
151 | the _SE(3)_ group appears. An element `(p,R) \in SE(3)` serves as:
152 | 
153 |   - A specification of the configuration of a rigid body.
154 |   - A transformation taking the coordinates of a point from one frame to
155 |     another.
156 | 
157 | This is exactly analogous to the SO(3) case, where `R \in SO(3)` was either a
158 | rotation configuration or a rotation mapping. We can view it either way. :-)
159 | 
160 | To make the linear algebra math easier to describe rigid transformations, use
161 | **homogeneous coordinates**. 
162 |   
163 |   - Add 1 to the coordinates of a point, so now we're in R^4, and vectors are
164 |     (well, effectively) in R^3 since their 4th component is always zero.
165 |   - Now a RBT is one matmul on a vector, a linear ("affine") transformation. The
166 |     last row is all zero except for a 1 at the lower right corner.
167 |   - To compose several of these transformations, do more matmuls obviously.
168 | 
169 | Must also know the exponential coordinates for rigid motion, so the SE analogue
170 | to the SO exponential of a skew symmetric matrix representing a rotation.
171 | 
172 |   - Once again, start from considering rotation about axis \omega
173 |   - Then derive velocity of tip point via cross products
174 |   - Then solve (integrate) differential equation to get exponential map
175 |   - Main difference is the use of 4x4 matrices w/homogeneous-like
176 |     representation. Also, we consider an extra ("offset"?) point q on \omega.
177 | 
178 | Define se(3): 
179 |   se(3) := { (u,\hat{omega}) s.t. u in R^3, \hat{omega} in so(3) }
180 | Elements of se(3) are _twists_; Can also write them using 4x4 matrices using
181 | homogeneous coordinates, useful for the following proposition ...
182 | 
183 | Proposition 2.8: given \hat{ξ} \in se(3) and \theta \in R, exponential of
184 | \hat{ξ}*theta is an element of SE(3), recall the special exponential ... think
185 | of it as the possible translations and rotations. 
186 |    
187 |   Proof technique: 
188 |     - Start w/4x4 matrix \hat{ξ} in se(3). Want to show: exp(\hat{ξ}*theta)
189 |       in SE(3).
190 |     - Prove by construction and obtain a formula for that exponential.
191 |     - Split into cases, \omega = 0 versus \omega =/= 0.
192 |     - For second (harder) case, relate to \hat{ξ-prime} and use properties of
193 |       exponentials and cross products.
194 |     - Use the _homogeneous_ representation of elements in SE(3). Normally, I
195 |       think of (p,R) \in SE(3), but use the 4x4 _matrix_ with R and p in it.
196 | 
197 |   Intuition: earlier we interpreted elements of SE(3) as transforming from one
198 |   coordinate frame to another. Here, interpret it as mapping points from
199 |   _initial_ coordinates to their coordinates _after_ the rigid motion is
200 |   applied. Key difference from earlier is that the start and end are specified
201 |   w.r.t. a _single_ coordinate frame. The book says:
202 |  
203 |   > Thus, the exponential map for a twist gives the relative motion of a rigid
204 |   > body. This interpretation of the exponential of a twist as a mapping from
205 |   > initial to final configurations will be especially important as we study the
206 |   > kinematics of robot mechanisms in the next chapter.
207 | 
208 | Important! _Every_ rigid transformation can be written as the exponential of
209 | some twist. BTW, I think the twist is only the \hat{ξ} part, and the `\theta
210 | \in R` part is multiplied later. Not a big deal, just think of twists as the 4x4
211 | "\hat{ξ}" matrices in se(3).
212 | 
213 | _Screws_ are a "geometric description" of twists and give us more intuition on
214 | them. More precisely:
215 | 
216 | > Consider a rigid body motion which consists of rotation about an axis in space
217 | > through an angle of `\theta` radians, followed by translation along the same
218 | > axis by an amount `d` as shown in Figure 2.7a. We call such a motion a screw
219 | > motion, since it is reminiscent of the motion of a screw, in so far as a screw
220 | > rotates and translates about the same axis.
221 | 
222 | - Characterizing a screw: define _pitch_, _axis_, and _magnitude_.
223 | - To compute RBT, draw a figure, determine end-point, and derive the rotation
224 |   plus vector offset to get the usual 4x4 homogeneous matrix representation.
225 | - The RBT of a screw has an equivalence with the exponential of a twist
226 |   `exp(\hat{ξ}*\theta)`.
227 | - It is possible to define a screw for every twist!
228 | 
229 | Important theorem:
230 | 
231 | > Theorem 2.11 (Chasles). Every rigid body motion can be realized by a rotation
232 | > about an axis combined with a translation parallel to that axis.
233 | 
234 | Be careful about _relative_ motion, which is w.r.t. a SINGLE reference frame. To
235 | "switch" between frames, you need to do an extra matrix multiply with g_{ab} to
236 | map from B's coordinates to A.
237 | 
238 | == Velocity of a Rigid Body ==
239 | 
240 | (This is probably not that relevant for me.)
241 | 
242 | == Wrenches and Reciprocal Screws ==
243 | 
244 | (This is probably not that relevant for me.)
245 | 
246 | 
247 | *************************************
248 | * Chapter 3: Manipulator Kinematics *
249 | *************************************
250 | 
251 | == Section 2: Forward Kinematics ==
252 | 
253 | To determine the configuration of the end-effector given information about the
254 | robot joints, we typically assume that the robot is composed of a set of
255 | "lower-pair joints". 
256 | 
257 | - There are six common examples: prismatic, revolute, helical, cylindrical,
258 |   planar, and spherical. The two most common are, of course, prismatic and
259 |   revolute joints. (The 2017 book by Lynch & Park have figures of these,
260 |   though they use "universal" instead of "planar".)
261 | - The reason why we like this assumption is that each of the joints
262 |   **restricts the motion of adjacent links to a subgroup of SE(3)**, making it
263 |   easier to analyze.
264 | 
265 | Example, with Figure 3.1, there are four joints, three revolute and one
266 | prismatic. The revolute joints are specified with one \theta for each since it
267 | can be thought of as a single circle about some axis (specified with the right
268 | handed coordinate system). In fact, the same holds for the prismatic joint with
269 | \theta being the displacement along the axis, so specifying these four scalar
270 | values is enough for us to define the configuration of that particular robot.
271 | The **joint space** is the Cartesian product of these individual joint angles.
272 | Equivalently, we can form the configuration space of the robot. It has four
273 | degrees of freedom (3+1=4 obviously) but this of course doesn't hold as a
274 | general rule as robots may have constraints on joints that restrict some DoFs.
275 | 
276 | Attach **two** coordinate frames:
277 | 
278 | - Base frame: attached to a point on the manipulator which is stationary with
279 |   respect to the first link (at index 0).
280 | - Tool frame: attached to the end-effector of the robot, so that the tool frame
281 |   moves when the joints of the robot move (seems logical).
282 |     So when I query the dVRK, the positions are clearly in the base frame, since
283 |     if they were in the tool frame, the positions would always be (0,0,0).
284 | 
285 | Forward kinematics: determine the function `g_st: Q -> SE(3)` that determines
286 | the configuration of the tool frame (w.r.t. the base frame). Q is the joint
287 | space of the manipulator, as I mention above.
288 | 
289 | Generic solution:
290 | 
291 |   g_st(theta) = g_{s,l1}(theta_1) * ... * g_{l_{n-1},ln}(theta_n) * g_{ln,t}
292 | 
293 | Concatenate the transformations among **adjacent** link frames.
294 | 
295 |   g_st, our final map, determines the _configuration_ of the _tool_ frame
296 |   relative to _base_ frame. That's consistent with our subscript notation.
297 |   Remember also that `g_{ij} \in SE(3)` can be thought as `(p_{ij},R_{ij})`.
298 | 
299 | == Product of Exponentials ==
300 | 
301 | We can obtain a more "geometric description" using PoEs. (Not sure what
302 | precisely this means...)
303 | 
304 | Example/Figure 3.2 for an overview of two choices: using g_st(\theta) as
305 | previously discussed, or using PoEs in which
306 | 
307 |     g_st(theta) = exp(hat{ξ}_1*theta_1) * exp(hat{ξ}_2*theta_2) * g_st(0)
308 |     (g_st(0) = rigid body transformation from T to S)
309 | 
310 | Derive by thinking: "fix theta_1 and consider motion wrt theta_2.  Then do
311 | motion wrt theta_1 and combine result". This is generalized:
312 | 
313 | > For each joint, construct a twist `ξ_i` which corresponds to the screw motion
314 | > for the i-th joint with all other joint angles held fixed at θ_j = 0`.
315 | 
316 | Results in Equation 3.3 on pp.87, the PoEs, at last! (TODO: understand why the
317 | `ξ_i` have their particular form for revolute or prismatic cases.)
318 | 
319 | If we assume that's true, then kinematics for Figure 3.3 are easily derived (and
320 | by this we can get every component in the matrices) by starting from PoEs and
321 | substituting into the formula for exp(hat{ξ}_i*theta_i) for 1<=i<=4 that we can
322 | find from Equation (2.36), pp.42.
323 | 


--------------------------------------------------------------------------------
/Robots_and_Robotic_Manip/ROS.text:
--------------------------------------------------------------------------------
  1 | How to use ROS. I'm using ROS Indigo, on Ubuntu 14.04. Hopefully the Fetch will
  2 | be updated for 16.04 soon. 
  3 | 
  4 | 
  5 | ***************************************************************
  6 | * Tutorial 1: Installing and Configuring Your ROS Environment *
  7 | ***************************************************************
  8 | 
  9 | Note the environment variables after installation:
 10 | 
 11 | ```
 12 | $ printenv | grep ROS
 13 | ROS_ROOT=/opt/ros/indigo/share/ros
 14 | ROS_PACKAGE_PATH=/opt/ros/indigo/share:/opt/ros/indigo/stacks
 15 | ROS_MASTER_URI=http://localhost:11311
 16 | ROSLISP_PACKAGE_DIRECTORIES=
 17 | ROS_DISTRO=indigo
 18 | ROS_ETC_DIR=/opt/ros/indigo/etc/ros
 19 | ```
 20 | 
 21 | In my `.bashrc` I have:
 22 | 
 23 | ```
 24 | source /opt/ros/indigo/setup.bash
 25 | alias fetch_mode='export ROS_MASTER_URI=http://fetch59.local:11311 export PS1="\[\033[41;1;37m\]<fetch>\[\033[0m\]\w$ "'
 26 | ```
 27 | 
 28 | where `fetch_mode` came from the HSR tutorials.
 29 | 
 30 | Another important note regarding rosbuild and catkin.
 31 | 
 32 | > Note: Throughout the tutorials you will see references to rosbuild and catkin.
 33 | > These are the two available methods for organizing and building your ROS code.
 34 | > rosbuild is not recommended or maintained anymore but kept for legacy. catkin
 35 | > is the recommended way to organise your code, it uses more standard CMake
 36 | > conventions and provides more flexibility especially for people wanting to
 37 | > integrate external code bases or who want to release their software. For a
 38 | > full break down visit catkin or rosbuild. 
 39 | 
 40 | I followed their directions to make the appropriate directories for a catkin
 41 | workspace. But sourcing the bash scripts didn't seem to have any noticeable
 42 | effect. I thought it'd do a python virtualenv thing?
 43 | 
 44 | Beyond the scope of this, but catkin stuff is here:
 45 | 
 46 | http://wiki.ros.org/catkin/conceptual_overview
 47 | 
 48 | - A build system specifically for ROS. Others are `GNU make` and `CMake`.
 49 | - Source code is organized into "packages" which have targets to build.
 50 | - For information on how to build, we need "configuration files." With catkin
 51 |   (extension of CMake) that's in `CMakeLists.txt`.
 52 | - `catkin` is the newer tool we should use, not `rosbuild` (older).
 53 | 
 54 | 
 55 | *********************************************
 56 | * Tutorial 2: Navigating the ROS Filesystem *
 57 | *********************************************
 58 | 
 59 | Use `package.xml` to store information about a specific package, such as
 60 | dependencies, maintainer, etc.  Know `rospack`, `roscd`, etc. We can prepend
 61 | `ros` to some common Unix commands, do tab completion, etc.
 62 | 
 63 | ```
 64 | daniel@daniel-ubuntu-mac:~$ rospack find roscpp
 65 | /opt/ros/indigo/share/roscpp
 66 | daniel@daniel-ubuntu-mac:~$ roscd roscpp
 67 | daniel@daniel-ubuntu-mac:/opt/ros/indigo/share/roscpp$ 
 68 | ```
 69 | 
 70 | 
 71 | **************************************
 72 | * Tutorial 3: Creating a ROS Package *
 73 | **************************************
 74 | 
 75 | Packages need: a manifest (package.xml) file, a catkin configuration file, and
 76 | its own directory (easy). Since we already created `catkin_ws/src` earlier, put
 77 | each of our custom packages as its own directory within `catkin_ws/src`.
 78 | 
 79 | After running the package script, I have this within `~/catkin_ws/src`:
 80 | 
 81 | ```
 82 | CMakeLists.txt -> /opt/ros/indigo/share/catkin/cmake/toplevel.cmake
 83 | 
 84 | beginner_tutorials/
 85 |     CMakeLists.txt
 86 |     include/
 87 |         beginner_tutorials/
 88 |             (empty)
 89 |     package.xml
 90 |     src/
 91 |         (empty)
 92 | ```
 93 | 
 94 | - Since the tutorial runs the script with `rospy`, `roscpp`, and `std_msgs`,
 95 |   those are listed as the package dependencies in `package.xml`.
 96 | 
 97 | - When we run `catkin_make` over the entire workspace, it will say "traversing
 98 |   into beginner_tutorials".
 99 | 
100 | - First-order dependencies:
101 |   ```
102 |   ~/catkin_ws$ rospack depends1 beginner_tutorials 
103 |   roscpp
104 |   rospy
105 |   std_msgs
106 |   ```
107 | 
108 | - We can also list all the *indirect* dependencies.
109 | 
110 | - Dependencies are in the following groups:
111 |     > inbuild_depend (don't see this, I have build_depend, build_export_depend)
112 |     > buildtool_depend (I have this)
113 |     > exec_depend (I have this)
114 |     > test_depend (I don't see this)
115 |   (Maybe they re-named `build_depend` and `build_export_depend`?)
116 | 
117 | - `build_depend` for compilation, `exec_depend` for runtime
118 | 
119 | - Make sure I customize `package.xml`!! It's mostly "meta-data" so should be
120 |   easier than customizing `CMakeLists.txt`. See conventions online.
121 | 
122 | 
123 | 
124 | **************************************
125 | * Tutorial 4: Building a ROS Package *
126 | **************************************
127 | 
128 | This discusses `catkin_make` which we previously ran. Note that using
129 | `catkin_make` we can build *all* the packages in our workspace, at least in the
130 | `src/` directory (we can change the target directory). Here's what I have in
131 | `catkin_ws/`:
132 | 
133 |     ```
134 |     build/
135 |         beginner_tutorials/
136 |         catkin/
137 |         catkin_generated/
138 |         CATKIN_IGNORE
139 |         catkin_make.cache
140 |         CMakeCache.txt
141 |         CMakeFiles/
142 |         cmake_install.cmake
143 |         CTestTestfile.cmake
144 |         gtest/
145 |         Makefile
146 |         test_results/
147 |     devel/
148 |         env.sh
149 |         lib/
150 |         setup.bash
151 |         setup.sh
152 |         _setup_util.py
153 |         setup.zsh
154 |         share/
155 |     src/
156 |         beginner_tutorials/
157 |         CMakeLists.txt
158 |     ```
159 | 
160 | The `cmake` and `make` commands go to `build` when they need to build packages.
161 | The executables and libraries go in `devel` *before* installing packages.
162 | 
163 | We'd also run `catkin_make install` but this seems to be optional. 
164 | 
165 | BTW, I now understand why there seem to be so many packages located in that
166 | directory on our dVRK machine. Unfortunately, we don't seem to be using it.  I
167 | wonder if the HSR or YuMi computers have a similar file system.
168 | 
169 | 
170 | 
171 | ***************************************
172 | * Tutorial 5: Understanding ROS Nodes *
173 | ***************************************
174 | 
175 | - Nodes: A node is an executable that uses ROS to communicate with other nodes.
176 |     - That's it. Use these to subscribe/publish to topics.
177 |     - To communicate, use a "ROS client library" which is rospy or roscpp.
178 | 
179 | - Messages: ROS data type used when subscribing or publishing to a topic.
180 |     - E.g. "geometry_msgs/Twist". For publisher/subscriber nodes to communicate
181 |       they need to send/accept the same message type.
182 | 
183 | - Topics: Nodes can publish messages to a topic as well as subscribe to a topic
184 |   to receive messages.
185 |     - Communication depends on these _messages_.
186 | 
187 | - Master: Name service for ROS (i.e. helps nodes find each other)
188 | 
189 | - rosout: ROS equivalent of stdout/stderr
190 |     - It runs by default from running `roscore` as it collects debug messages.
191 | 
192 | - roscore: Master + rosout + parameter server (parameter server will be
193 |   introduced later) 
194 |     - First thing we should run! Recall this is what we do for the dVRK.
195 | 
196 | After `roscore`:
197 | 
198 |     ```
199 |     ~/catkin_ws$ roscore
200 |     ... logging to
201 |     /home/daniel/.ros/log/4a2cd14e-32cf-11e8-9512-7831c1b89008/roslaunch-daniel-ubuntu-mac-4867.log
202 |     Checking log directory for disk usage. This may take awhile.
203 |     Press Ctrl-C to interrupt
204 |     Done checking log file disk usage. Usage is <1GB.
205 |     
206 |     started roslaunch server http://daniel-ubuntu-mac:33999/
207 |     ros_comm version 1.11.21
208 |     
209 |     SUMMARY
210 |     ========
211 |     
212 |     PARAMETERS
213 |      * /rosdistro: indigo
214 |      * /rosversion: 1.11.21
215 |     
216 |     NODES
217 |     
218 |     auto-starting new master
219 |     process[master]: started with pid [4879]
220 |     ROS_MASTER_URI=http://daniel-ubuntu-mac:11311/
221 |     
222 |     setting /run_id to 4a2cd14e-32cf-11e8-9512-7831c1b89008
223 |     process[rosout-1]: started with pid [4892]
224 |     started core service [/rosout]
225 |     ```
226 | 
227 | So `/rosout` will be listed when running `rosnode list` in a separate tab. Keep
228 | `roscore` running throughout the time we use ROS!! Use `rosnode info` to see (1)
229 | publishers, (2) subscribers, and (3) services. Also note `PARAMETERS` which must
230 | mean the parameter server.
231 | 
232 | Use `rosrun` to run packages along with certain nodes within packages. I ran
233 | `turtlesim` and yes we get a new node and can re-name if needed. There appear to
234 | be two node options for this, one for the turtle and another for teleoperation.
235 | 
236 | 
237 | 
238 | ****************************************
239 | * Tutorial 6: Understanding ROS Topics *
240 | ****************************************
241 | 
242 | We run the turtlesim via teleoperation, and it works.
243 | 
244 | - Nodes `turtlesim_node` and `turtle_teleop_key` within the `turtlesim` package
245 |   communicate to each other via a ROS topic. 
246 |     - Communication within such topics depends on sending ROS _messages_.
247 | 
248 | - The teleop node *publishes* key commands, while the sim node *subscribes*.
249 | 
250 | - Use `rqt_graph` for visualizing node dependencies. This is very useful!
251 | 
252 | - Use `rqt_plot` to plot certain node values that can be plotted (e.g.,
253 |   x-position of turtle) but I don't think I'll be using this, I like matpotlib.
254 | 
255 | Use `rostopic` to examine nodes. For instance, if I run this and then move the
256 | turtle forward, I get:
257 | 
258 | ```
259 | ~/catkin_ws$ rostopic echo /turtle1/cmd_vel 
260 | linear: 
261 | x: 2.0
262 | y: 0.0
263 | z: 0.0
264 | angular: 
265 | x: 0.0
266 | y: 0.0
267 | z: 0.0
268 | ---
269 | linear: 
270 | x: 2.0
271 | y: 0.0
272 | z: 0.0
273 | angular: 
274 | x: 0.0
275 | y: 0.0
276 | z: 0.0
277 | ---
278 | (and so on)
279 | ```
280 | 
281 | so the up key must mean increasing in the turtle's x direction. We can get a
282 | full picture of the publisher/subscriber situation:
283 | 
284 | ```
285 | ~/catkin_ws$ rostopic list -v
286 | 
287 | Published topics:
288 |  * /turtle1/color_sensor [turtlesim/Color] 1 publisher
289 |  * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher
290 |  * /rosout [rosgraph_msgs/Log] 4 publishers
291 |  * /rosout_agg [rosgraph_msgs/Log] 1 publisher
292 |  * /turtle1/pose [turtlesim/Pose] 1 publisher
293 | 
294 | Subscribed topics:
295 |  * /turtle1/cmd_vel [geometry_msgs/Twist] 2 subscribers
296 |  * /rosout [rosgraph_msgs/Log] 1 subscriber
297 |  * /statistics [rosgraph_msgs/TopicStatistics] 1 subscriber
298 | ```
299 | 
300 | The type of `/turtle1/cmd_vel` is `geometry_msgs/Twist`, as shown above. Looks
301 | like it lists topics followed by message (well, the _type_ of the message).
302 | 
303 | Use `rostopic pub [...]` to publish something. In the turtle example, this might
304 | mean commanding the turtle's velocity.
305 | 
306 | So, there's rostopic `pub`, `list`, `echo`, `type`, etc. Straightforward:
307 | 
308 |     rostopic bw     display bandwidth used by topic
309 |     rostopic echo   print messages to screen
310 |     rostopic hz     display publishing rate of topic    
311 |     rostopic list   print information about active topics
312 |     rostopic pub    publish data to topic
313 |     rostopic type   print topic type
314 | 
315 | I don't really need `type` now as it's shown in `list` as seen above. The `hz`
316 | might be useful since (as I know with the dVRK) the camera images of the
317 | workspaces aren't updated instantaneously but with some delay, and that can
318 | affect policies which take the images as input.
319 | 
320 | 
321 | 
322 | *********************************************************
323 | * Tutorial 7: Understanding ROS Services and Parameters *
324 | *********************************************************
325 | 
326 | Recall we used to run `rosnode info /rosout`  where we get information from
327 | node1, node2, etc in the argument. That provides us with three things. We sort
328 | of understand publications and subscripts, but now what about _services_?
329 | 
330 | - Another way for nodes to communicate with each other.
331 | - Nodes send _requests_, receive _responses_. (Common sense, right?)
332 | 
333 | Like `rostopic`, `rosservice` has lots of command options:
334 | 
335 |     rosservice list         print information about active services
336 |     rosservice call         call the service with the provided args
337 |     rosservice type         print service type
338 |     rosservice find         find services by service type
339 |     rosservice uri          print service ROSRPC uri
340 | 
341 | For example, I see this with `list`:
342 | 
343 | ```
344 | :~/catkin_ws$ rosservice list
345 | /clear
346 | /kill
347 | /reset
348 | /rosout/get_loggers
349 | /rosout/set_logger_level
350 | /rostopic_8997_1522274470739/get_loggers
351 | /rostopic_8997_1522274470739/set_logger_level
352 | /rqt_gui_py_node_9061/get_loggers
353 | /rqt_gui_py_node_9061/set_logger_level
354 | /spawn
355 | /teleop_turtle/get_loggers
356 | /teleop_turtle/set_logger_level
357 | /turtle1/set_pen
358 | /turtle1/teleport_absolute
359 | /turtle1/teleport_relative
360 | /turtlesim/get_loggers
361 | /turtlesim/set_logger_level
362 | ```
363 | 
364 | We can call the `rosservice call /clear` above, so this is calling a service in
365 | the list above (this one with no arguments). We choose `clear` so that the
366 | background is clear (we don't see the turtle's path). This is what I see from
367 | the window that originally started the `turtlesim` package.
368 | 
369 | ```
370 | :~/catkin_ws$ rosrun turtlesim turtlesim_node
371 | [ INFO] [1522273700.220832117]: Starting turtlesim with node name /turtlesim
372 | [ INFO] [1522273700.228355538]: Spawning turtle [turtle1] at x=[5.544445], y=[5.544445], theta=[0.000000]
373 | [ WARN] [1522273804.373982014]: Oh no! I hit the wall! (Clamping from [x=7.155886, y=-0.008128])
374 | [ WARN] [1522273804.389975987]: Oh no! I hit the wall! (Clamping from [x=7.163082, y=-0.031181])
375 | (omitted...)
376 | [ WARN] [1522276335.861971290]: Oh no! I hit the wall! (Clamping from [x=9.302450, y=11.089913])
377 | [ WARN] [1522276335.877974885]: Oh no! I hit the wall! (Clamping from [x=9.334450, y=11.088992])
378 | [ INFO] [1522280291.029979359]: Clearing turtlesim.
379 | ```
380 | 
381 | We can also use the `/spawn` service to, well, spawn another turtle.
382 | 
383 | We also have `rosparam`, which is the parameter analogue to `rosservice` for
384 | service, `rostopic` for topics, etc. We can list the parameters and adjust them,
385 | for instance by changing the background color. (However, it doesn't seem to
386 | actually change my color, even though I am clearly setting all the background
387 | colors to be 0 ... hmmm.)
388 | 
389 | You can save current parameters for easy loading later.
390 | 
391 | 
392 | 
393 | ***********************************************
394 | * Tutorial 8: Using rqt_console and roslaunch *
395 | ***********************************************
396 | 
397 | rqt_console (not sure how useful)
398 | 
399 |     - Along with rqt_logger_level, lets us see a lot of information in GUIs.
400 |     - If we ram the turtle in the wall, we can see the warning message.
401 |     - Assuming that WARN is within the current "verbosity" level...
402 |     - Logging prioritized with: Fatal, Error, Warn, Info, Debug.
403 | 
404 | roslaunch (looks _very_ useful, call this each time we start using robots)
405 | 
406 |     - Note that `roscore` started a "roslaunch server".
407 |     - Use this with a _launch_file_ to start nodes in a more scalable way.
408 |         - `roslaunch [package] [filename.launch]`
409 |         - `roslaunch  gscam     endoscope.launch`
410 |     - Good practice, put in the package: `~/catkin_ws/src/[...]/launch/[...]`
411 |       where the second [...] is the `.launch` file with <launch> tags.
412 | 
413 | ```
414 | <launch>
415 | 
416 |   <group ns="turtlesim1">
417 |   <node pkg="turtlesim" name="sim" type="turtlesim_node"/>
418 |   </group>
419 | 
420 |   <group ns="turtlesim2">
421 |   <node pkg="turtlesim" name="sim" type="turtlesim_node"/>
422 |   </group>
423 | 
424 |   <node pkg="turtlesim" name="mimic" type="mimic">
425 |   <remap from="input" to="turtlesim1/turtle1"/>
426 |   <remap from="output" to="turtlesim2/turtle1"/>
427 |   </node>
428 | 
429 | </launch>
430 | ```
431 | 
432 | - Above example makes two groups (different names to avoid conflicts), each of
433 |   which use a `turtlesim_node` node from the `turtlesim` package.
434 | 
435 | - Also makes a new node with type "mimic". So the `<node [...]>` command must
436 |   obviously let one make a new node, which can be assigned to a group if it's
437 |   nested within one. Causes second turtle to mimic the first turtle!
438 | 
439 | I see, when you run `roslaunch ...` we get this output:
440 | 
441 | ```
442 | daniel@daniel-ubuntu-mac:~/catkin_ws/src/beginner_tutorials/launch$ roslaunch beginner_tutorials turtlemimic.launch 
443 | ... logging to /home/daniel/.ros/log/42096978-3383-11e8-9614-7831c1b89008/roslaunch-daniel-ubuntu-mac-4922.log
444 | Checking log directory for disk usage. This may take awhile.
445 | Press Ctrl-C to interrupt
446 | Done checking log file disk usage. Usage is <1GB.
447 | 
448 | started roslaunch server http://daniel-ubuntu-mac:43721/
449 | 
450 | SUMMARY
451 | ========
452 | 
453 | PARAMETERS
454 |  * /rosdistro: indigo
455 |  * /rosversion: 1.11.21
456 | 
457 | NODES
458 |   /
459 |     mimic (turtlesim/mimic)
460 |   /turtlesim1/
461 |     sim (turtlesim/turtlesim_node)
462 |   /turtlesim2/
463 |     sim (turtlesim/turtlesim_node)
464 | 
465 | auto-starting new master
466 | process[master]: started with pid [4934]
467 | ROS_MASTER_URI=http://localhost:11311
468 | 
469 | setting /run_id to 42096978-3383-11e8-9614-7831c1b89008
470 | process[rosout-1]: started with pid [4947]
471 | started core service [/rosout]
472 | process[turtlesim1/sim-2]: started with pid [4950]
473 | process[turtlesim2/sim-3]: started with pid [4959]
474 | process[mimic-4]: started with pid [4966]
475 | ```
476 | 
477 | so we get groups listed at the top level (turtlesim1, turtlesim2) along with the
478 | name of the node after it within the nested stuff. 
479 | 
480 | BTW: seems like roslaunch starts its own master server, so it is not necessary
481 | to have an existing "roscore" command in another tab. See "auto-starting new
482 | master" above and also:
483 | 
484 |     https://answers.ros.org/question/217107/does-a-roslaunch-start-roscore-when-needed/
485 | 
486 | We can still get lots of relevant information:
487 | 
488 | ```
489 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosnode list
490 | /mimic
491 | /rosout
492 | /turtlesim1/sim
493 | /turtlesim2/sim
494 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rostopic list
495 | /rosout
496 | /rosout_agg
497 | /turtlesim1/turtle1/cmd_vel
498 | /turtlesim1/turtle1/color_sensor
499 | /turtlesim1/turtle1/pose
500 | /turtlesim2/turtle1/cmd_vel
501 | /turtlesim2/turtle1/color_sensor
502 | /turtlesim2/turtle1/pose
503 | ```
504 | 
505 | Use `rqt_graph`, as discussed earlier, to understand the launch file.
506 | 
507 | 
508 | 
509 | ************************************************
510 | * Tutorial 9: Using rosed to edit files in ROS *
511 | ************************************************
512 | 
513 | A very short one, basically use `rosed [package_name] [filename]` to edit files
514 | without having to use command lines, would be useful for me since I got stuck on
515 | doing this in my early days of working with the dVRK. Fortunately this uses vim
516 | by default, so I should have no problem using it.
517 | 
518 | 
519 | 
520 | *******************************************
521 | * Tutorial 10: Creating a ROS msg and srv *
522 | *******************************************
523 | 
524 | - msg: are simple text files that describe the fields of a ROS message.  They
525 |   are used to generate source code for messages in different languages.
526 | - srv: describes a service, composed of two parts: a request and a response.
527 | 
528 | These have their own syntax rules. See tutorial for details. We put them in
529 | `msg` and `srv` directories, and then we must ensure our `package.xml` file will
530 | know to compile and run custom messages, and also change `CMakeLists.txt`.
531 | There's a lot to do for the latter; see tutorial for lines to un-comment.
532 | 
533 | The tutorials use a simple `AddTwoInts` service. Details with `rossrv`:
534 | 
535 | ```
536 | :~/catkin_ws/src/beginner_tutorials$ rossrv show AddTwoInts
537 | [beginner_tutorials/AddTwoInts]:
538 | int64 a
539 | int64 b
540 | ---
541 | int64 sum
542 | 
543 | [rospy_tutorials/AddTwoInts]:
544 | int64 a
545 | int64 b
546 | ---
547 | int64 sum
548 | ```
549 | 
550 | - It's located in two places, since this was created with `roscp`.
551 | - The actual _implementation_ of the "add two ints" is located elsewhere.
552 | - Run `catkin_make install` and watch it build successfully. Whew.
553 | 
554 | The installation makes C (header), lisp and python files. For example:
555 | 
556 | /home/daniel/catkin_ws/install/lib/python2.7/dist-packages/beginner_tutorials/msg/_Num.py
557 | 
558 | Again this is _not_ the code implementation (how could it read my mind?) but an
559 | automatically generated file with some known, common methods. Not yet sure what
560 | it's purpose is for ...
561 | 
562 | 
563 | 
564 | ****************************************************************
565 | * Tutorial 11: Writing a Simple Publisher and Subscriber (C++) *
566 | ****************************************************************
567 | (Skipping)
568 | *******************************************************************
569 | * Tutorial 12: Writing a Simple Publisher and Subscriber (Python) *
570 | *******************************************************************
571 | 
572 | After downloading their `talker.py` script, I have this in the package:
573 | 
574 | ```
575 | beginner_tutorials/
576 |     CMakeLists.txt
577 |     package.xml
578 |     include/
579 |         beginner_tutorials/
580 |     launch/
581 |         turtlemimic.launch
582 |     msg/
583 |         Num.msg
584 |     scripts/
585 |         talker.py
586 |     src/
587 |     srv/
588 |         AddTwoInts.srv
589 | ```
590 | 
591 | For the most part just read the tutorial, it goes line-by-line. Above, there is
592 | no node that "receives" the messages sent by the talker, so we write that. It
593 | uses a very simple message type:
594 | 
595 | ```
596 | daniel@daniel-ubuntu-mac:~/catkin_ws$ rosmsg show String
597 | [std_msgs/String]:
598 | string data
599 | ```
600 | 
601 | with just a `data` argument to fill.
602 | 
603 | For classes, look at:
604 | 
605 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Publisher-class.html
606 | http://docs.ros.org/indigo/api/rospy/html/rospy.topics.Subscriber-class.html
607 | 
608 | They only have one method each, "publish" and "unregister", respectively.
609 | 
610 | 
611 | 
612 | **************************************************************
613 | * Tutorial 13: Examining the Simple Publisher and Subscriber *
614 | **************************************************************
615 | 
616 | This is really short. Just run the code and see what we get. Make sure `roscore`
617 | is running in a seprate tab, though.
618 | 
619 | 
620 | 
621 | **********************************************************
622 | * Tutorial 14: Writing a Simple Service and Client (C++) *
623 | **********************************************************
624 | (Skipping)
625 | *************************************************************
626 | * Tutorial 15: Writing a Simple Service and Client (Python) *
627 | *************************************************************
628 | 
629 | Makes the "service" that actually performs the addition. (It's not clear to me
630 | yet why we need this kind of structure.) And then the client. Again, straight
631 | from the tutorial.
632 | 
633 | 
634 | 
635 | ********************************************************
636 | * Tutorial 16: Examining the Simple Service and Client *
637 | ********************************************************
638 | 
639 | Yeah, I got it working.
640 | 
641 | 
642 | 
643 | ************************************************
644 | * Tutorial 17: Recording and playing back data *
645 | ************************************************
646 | 
647 | This is the rostopic status after starting this up:
648 | 
649 | ```
650 | daniel@daniel-ubuntu-mac:~/catkin_ws/devel$ rostopic list -v
651 | 
652 | Published topics:
653 |  * /turtle1/color_sensor [turtlesim/Color] 1 publisher
654 |  * /turtle1/cmd_vel [geometry_msgs/Twist] 1 publisher
655 |  * /rosout [rosgraph_msgs/Log] 2 publishers
656 |  * /rosout_agg [rosgraph_msgs/Log] 1 publisher
657 |  * /turtle1/pose [turtlesim/Pose] 1 publisher
658 | 
659 | Subscribed topics:
660 |  * /turtle1/cmd_vel [geometry_msgs/Twist] 1 subscriber
661 |  * /rosout [rosgraph_msgs/Log] 1 subscriber
662 | ```
663 | 
664 | I get the rosbag which records the keypresses:
665 | 
666 | ```
667 | daniel@daniel-ubuntu-mac:~/bagfiles$ ls -lh
668 | total 512K
669 | -rw-rw-r-- 1 daniel daniel 511K Mar 29 16:16 2018-03-29-16-15-19.bag
670 | daniel@daniel-ubuntu-mac:~/bagfiles$ vim 2018-03-29-16-15-19.bag 
671 | daniel@daniel-ubuntu-mac:~/bagfiles$ rosbag info 2018-03-29-16-15-19.bag 
672 | path:        2018-03-29-16-15-19.bag
673 | version:     2.0
674 | duration:    58.6s
675 | start:       Mar 29 2018 16:15:19.26 (1522365319.26)
676 | end:         Mar 29 2018 16:16:17.84 (1522365377.84)
677 | size:        510.9 KB
678 | messages:    7321
679 | compression: none [1/1 chunks]
680 | types:       geometry_msgs/Twist [9f195f881246fdfa2798d1d3eebca84a]
681 |              rosgraph_msgs/Log   [acffd30cd6b6de30f120938c17c593fb]
682 |              turtlesim/Color     [353891e354491c51aabe32df673fb446]
683 |              turtlesim/Pose      [863b248d5016ca62ea2e895ae5265cf9]
684 | topics:      /rosout                    4 msgs    : rosgraph_msgs/Log   (2 connections)
685 |              /turtle1/cmd_vel          21 msgs    : geometry_msgs/Twist
686 |              /turtle1/color_sensor   3648 msgs    : turtlesim/Color    
687 |              /turtle1/pose           3648 msgs    : turtlesim/Pose
688 | ```
689 | 
690 | And I can replay my commands.
691 | 
692 | 
693 | 
694 | ********************************************
695 | * Tutorial 18: Getting started with roswtf *
696 | ********************************************
697 | 
698 | Yeah this is just to check if the system is wrong, and looks like mine is OK.
699 | 
700 | 
701 | 
702 | ****************************************
703 | * Tutorial 19: Navigating the ROS wiki *
704 | ****************************************
705 | 
706 | Pretty simple, hopefully documentation won't be an issue.
707 | 
708 | 
709 | 
710 | ****************************
711 | * Tutorial 20: Where Next? *
712 | ****************************
713 | 
714 | Robotics work. :-) Look at our manuals, understand rviz, tf, and moveit.
715 | 


--------------------------------------------------------------------------------
/CS61C_Berkeley/CS61C_Lectures.txt:
--------------------------------------------------------------------------------
  1 | CS 61C Lecture Review
  2 | Fall 2017 Semester
  3 | 
  4 | **********************************
  5 | * Lecture 1: Course Introduction *
  6 | * Given: August 24, 2017         *
  7 | **********************************
  8 | 
  9 | Lecture is about four things, well, three that matter to me: (1) machine
 10 | structures, (2) great ideas (in architecture), and (3) how everything is just a
 11 | number.
 12 | 
 13 | 
 14 | Machine Structures
 15 | 
 16 |     C is the most popular programming language, followed by Python. Use C to
 17 |     write software for speed/performance, e.g. embedded systems. EDIT: nope!
 18 |     That was in F-2016. Now in F-2017, Python has taken over, probably due to
 19 |     Deep Learning. But C is still in second place.
 20 | 
 21 |     This class isn't about C programming, but C is a VERY important language to
 22 |     know in order to understand the important stuff: the **hardware-software
 23 |     interface**. It's closer to the hardware than Java or Python.
 24 | 
 25 |     Things we'll learn on the software side:
 26 |         Parallel requests 
 27 |         Parallel threads
 28 |         Parallel instructions
 29 |         Parallel data
 30 |         Hardware descriptions
 31 | 
 32 |     and the hardware side:
 33 |         Logic gates
 34 |         Main memory
 35 |         Cores
 36 |         Caches
 37 |         Instruction Units
 38 | 
 39 |     Looks like the "new version/face" of CS 61C is parallelism, as I should know
 40 |     from CS 267. Along with computers being on **mobile devices** and in many
 41 |     other areas, such as cars! So many things have computers and sensors in them
 42 |     nowadays, that it's mind-blowing.
 43 | 
 44 | 
 45 | Great Ideas in Architecture
 46 | 
 47 |     Abstraction (Phil Guo's one-word description of CS)
 48 | 
 49 |         Anything can be represented as a number.  But does this mean we WANT
 50 |         them to be like that?  No, we want to program in a "high-level" like C
 51 |         so that we don't have to trudge through assembly language code.
 52 | 
 53 |         We follow this hierarchy: 
 54 |             ==> C
 55 |             ==> compiler
 56 |             ==> assembly language (then machine language??)
 57 |             ==> machine interpretation (note, in F-2017 they're doing RISC-V,
 58 |                 not MIPS, which I think was in S-2017 ...)
 59 |             ==> architecture implementation (the logic circuit diagram?)
 60 |             (I don't fully understand assembly/architecture parts)
 61 | 
 62 |     Moore's Law (is it still applicable?!?)
 63 | 
 64 |         Basic idea: every 2 years (sometimes I've seen it 1.5 years ...) the
 65 |         number of transistors per chip will double. Transistors are the basic
 66 |         source of computation in computers, they're the bits of electricity that
 67 |         turn into 0s and 1s. From Wikipedia: 
 68 |             "A transistor is a semiconductor device used to amplify or switch
 69 |             electronic signals and electrical power. It is composed of
 70 |             semiconductor material usually with at least three terminals for
 71 |             connection to an external circuit. A voltage or current applied to
 72 |             one pair of the transistor's terminals controls the current through
 73 |             another pair of terminals. Because the controlled (output) power can
 74 |             be higher than the controlling (input) power, a transistor can
 75 |             amplify a signal", 
 76 |         and 
 77 |             "The transistor is the fundamental building block of modern
 78 |             electronic devices, and is ubiquitous in modern electronic systems."
 79 | 
 80 |         However, as one would imagine, if you try to pack more and more
 81 |         transistors in a smaller area, it will be exponentially more costly, and
 82 |         there will be issues with heat, as well as limits faced with the laws of
 83 |         physics.
 84 | 
 85 |         Update: the F-2017 edition (after the class break) brought up a graph
 86 |         from David Patterson's textbook, showing that serial processor
 87 |         performance was exponential up to the last decade, to which it
 88 |         flat-lined. 
 89 |         
 90 |         - Thus, in the "glory days" you could write a program and expect newer
 91 |           hardware to just be faster. But not anymore. If we tried to cram
 92 |           things even further, we'd run into programs like quantum computers,
 93 |           where we don't know if things are really a 0 or a 1 anymore. Uh oh.
 94 | 
 95 |         - Now companies (e.g. Apple, Tesla, Samsung, Google, Microsoft) are not
 96 |           just buying general-purpose Intel chips, but building their own chips.
 97 |           So it's an exciting time to be a computer architect.
 98 | 
 99 |     Principles of Locality (memory hierarchy and caches!!)
100 | 
101 |         Jim Gray's storage latency analogy. I've seen this one before. It's
102 |         really nice. Everyone has a nice joke to play about caches. Main thing
103 |         to know is what is actually in the hierarchy:
104 |             - Registers
105 |             - On-chip cache
106 |             - On-board cache
107 |             - Main memory (i.e. RAM)
108 |             - Hard disk
109 |             - Tape and optical robot (not sure what this means)
110 |         Also see the pyramid in the notes. It makes sense: the stuff "closer" to
111 |         us in the hierarchy just listed above has to be smaller since there's
112 |         less room. Thus, registers are cramped in a small space and are limited,
113 |         but there's much more room for memory on the hard disk. 
114 |         
115 |         It seems like we have three main caches: L1, L2, and L3. Not sure on the
116 |         difference between on-chip vs on-board cache, though. That might be
117 |         on-chip (as in on the CPU?) vs on the MOTHERboard. As I (finally!!) now
118 |         know from experience, the CPU chip goes in the motherboard in a very
119 |         specific spot.
120 | 
121 |     Parallelism (CS 267!!)
122 |         
123 |         This is another thing we should do if possible. We can "fork" calls into
124 |         several "workers" and then "join" them together later. Professor Katz
125 |         mentions the laundry example. He can use the wash. Then the dryer. But
126 |         if he's using the dryer, there's no reason why someone can't use the
127 |         wash. So this is like stacking things together in a tree-fashion, might
128 |         be related to "tricks with trees" from CS 267.
129 | 
130 |             Also: we'll learn how to do thread programming, using fork() to
131 |             split up computation into worker threads, and join() calls to
132 |             combine the result.
133 | 
134 |         Caveat: Amdahl's law. It tries to predict speed-ups from parallelism.
135 |         The law states the obvious: if there are parts of an application which
136 |         cannot be parallelized, then we can't get "perfect" speedup, which
137 |         hypothetically would be a 2x speedup if we had 2x parallelism.
138 | 
139 |     Dependency via Reproducibility (should be obvious!)
140 | 
141 |         The larger our system, the more likely we have individual components 
142 |         that fail. But when we program, we desperately want to make sure we can
143 |         focus on debugging what WE wrote, and NOT the underlying hardware (oh
144 |         God).
145 | 
146 |         Easiest thing to do: take majority vote, this helps to protect against
147 |         faulty machines. Prof Katz: this seems silly and expensive, but useful
148 |         if we have to send code in space or some other area where it's too
149 |         expensive to send repairmen.
150 |             
151 |         Redundant memory bits as well; these are Error Correcting Codes (ECCs).
152 |         Can also do calculations involving the parity of a number (odd vs even)
153 |         so we have a spare piece of memory which corrects the expected parity as
154 |         needed. 
155 | 
156 | 
157 | Then we switched speakers to Prof. Krste Asanović.
158 | 
159 |     Higher-level stuff:
160 | 
161 |         Moore's Law, etc., showed a new paradigm for computer architecture.  See
162 |         my earlier comments on Moore's Law.
163 |         
164 |         Then Deep Learning. Yes, I knew it! That's why Deep Learning needs
165 |         computer architects, because it's now the hardware and not the algorithm
166 |         (After all, we're still doing backpropagation). 
167 |         
168 |         Google has developed a "Tensor Processing Unit" (TPU), a specialized
169 |         engine for NN training. Interesting ... I saw Jeff Dean talking about
170 |         this recently in his AMA.
171 | 
172 |         Microsoft has developed "Microsoft Brain Wave". Gah, so many new
173 |         developments.
174 | 
175 |     RISC-V Instruction Set Architecture (ISA)
176 | 
177 |         In F-2017, they are switching to this from MIPS, which was used in
178 |         previous iterations of the course. It was designed at Berkeley for
179 |         research and education.
180 | 
181 |         ISA = the language of the processor, or how software is encoded to run
182 |         on hardware. Example: think about how an "add" instruction would be
183 |         written in bits.
184 | 
185 |         Why are we using it if it's open source? Because the cool people are
186 |         adopting it. Starting now, NVIDIA is using RISC-V in their GPUs. And the
187 |         previous popular set, MIPS, is not doing so well; the company that owns
188 |         it is apparently up for sale?
189 | 
190 | 
191 | (Then we switched back to Prof. Katz, and had some stuff about class
192 | administration.  Yeah, I won't post any homeworks publicly, they'll be private.)
193 | 
194 | 
195 | Everything is Just a Number
196 | 
197 |     Computers represent data as binary values.
198 |         - The *bit* is the unit element, either 0 or 1. We're not doing quantum
199 |           computing in this class, so we _know_ for certain if a bit is zero or
200 |           one.
201 |         - Then *bytes* are eight bits, can represent 2^8 = 256 different values.
202 |         - A "word" is 4 bytes (i.e. 32 bits), has 2^32 different values, like Java
203 |             integers.
204 |         - Then there are 64-bit floating point numbers (and 32-bit as well),
205 |           numpy can express both though the Theano library encourages 32-bit.
206 |         - All of these are built up into longer and more complicated expressions!
207 |         - In F-2017, we'll learn how RISC-V encodes computer programs into bits.
208 | 
209 |     Be sure to MEMORIZE how to convert: (binary <==> decimal). This is so
210 |     important to have down cold. I'm definitely intuitively better at going in
211 |     the ==> direction, just write the number then underneath, going in REVERSE
212 |     direction, do 2^0, 2^1, etc., then multiply by 1s and 0s and add up. Other
213 |     direction: keep successively dividing by two (rounding down) and keep track
214 |     of parities. Collect (not sum!) the results together at the end.
215 | 
216 |     Unfortunately, there's also the hexadecimal notation. That's harder. Now
217 |     there are 16 different units, not 2 or 10. It goes from 0 to 9 and then we
218 |     note it as A=10, B=11, C=12, D=13, E=14, F=15. Obviously, I wrote the
219 |     decimal numbers afterwards, could have easily done the binary version.
220 |         - There are also octals, with 8 units of computation. 
221 |         - I'll avoid using these whenever possible.
222 | 
223 |     Make sure to be consistent with putting down "two", "ten", or "hex" as
224 |     subscripts after the numbers. It will make it easier to track which is
225 |     which.
226 | 
227 |     How to use these numbers in C?
228 |         Use %d for decimal (I know this now!)
229 |         Use %x for hexadecimal
230 |         Use %o for octal
231 |     Might also have to write numbers with 0x[...] and 0b[...] with 0x or 0b
232 |     prefix to indicate which representation we're using.
233 | 
234 |     Beyond bytes, we have kilobytes, gigabytes, etc. Notice that marketing will
235 |     assume we multiply by 1000, i.e. kilobytes are 1000 bytes. But in reality we
236 |     "should" have 1024 bytes per kilobytes. Marketing can get away with not
237 |     including that extra 24. Grrr. For the binary system, we use an extra "i",
238 |     so it's KiByte, instead of KByte. And 1GB = 1000MB and 1GiB = 1024MiB.
239 |     Watch out!
240 | 
241 | 
242 | **************************************
243 | * Lecture 2: Numbers and C Language  *
244 | * Given: August 29, 2017             *
245 | **************************************
246 | 
247 | Signed integer representation (Note: this material was originally in the first
248 | lecture in F-2016, but got bumped to the second lecture in F-2017 to make room
249 | for more discussion on why we need computer architects, and also Deep Learning.)
250 | 
251 |     We need to have negative numbers, so how to handle these?
252 | 
253 |     First attempt: first digit (well, leading digit, so leftmost) represents
254 |         sign, remaining 7 (assuming 8 bits total) are for actual numerical
255 |         content, "magnitude". But that's bad --- at least for integers --- since
256 |         we have several special cases to consider, and our hardware performance
257 |         will suffer.
258 | 
259 |     Better: two's complement. With 4 bits, have 16 total numbers:
260 |         0 1 2 3 4 5 6 7  8  9 10 11 12 13 14 15
261 |                         -8 -7 -6 -5 -4 -3 -2 -1 
262 | 
263 |     Thus, -3 in decimal maps to 13 in binary. This allows us to keep
264 |     adding/subtraction rules for binary numbers consistent. Right, this is
265 |     StackOverflow: "Two's complement is a clever way of storing integers so that
266 |     common math problems are very simple to implement." In other words, the
267 |     hardware doesn't have to make any special rules.
268 | 
269 |         But remember that these are just bits. Regardless of signed or unsigned,
270 |         it's bits (four, in this case) that the hardware sees.
271 | 
272 |         A good analogy with alarm clocks in the lecture, particularly because my
273 |         alarm clock requires me to keep incrementing the time before it "starts
274 |         over" at the current value. Thus, 3+11=14 in unsigned, but this is
275 |         3-5=-2 in two's complement. Fortunately, the "adder" doesn't care, it
276 |         just does the addition the same way, and we interpret it under the
277 |         assumption that it's two's complement.
278 | 
279 |         It's not a "sign+magnitude" representation, because the second part
280 |         isn't a "magnitude".
281 | 
282 |     How to do negation in two's complement: INVERT the bits, then add one.
283 |     Don't forget to add one.
284 |     
285 |     The most significant bit (MSB) also indicates the sign, as in our first
286 |     representation, but doesn't have the drawback of painful math or a +0 and -0
287 |     annoyance as in the signed integer representation.
288 | 
289 |     With two's complement, **if signs are different**, no overflow detection
290 |     needed. This makes sense, you can't add a positive and a negative number and
291 |     get something exceeding your range, that's like a shrinkage factor.
292 | 
293 |     Adding numbers of different bit widths:
294 |         - Unsigned: simply pad zeros at the most significant bits.
295 |         - Signed: **sign extension**, pad either all 0s or all 1s, depending on
296 |           the current sign of the number.
297 | 
298 | 
299 | Break / This is Not on the Exam
300 | 
301 |     Prof. Asanović talked about Google's TPU. :-) My God, it's so impressive. It
302 |     has an **internal** matrix multiply unit. Ironically, it's useless for
303 |     everything **except** for matrix multiplies. Then he talked about the IBM
304 |     Mainframe.
305 | 
306 | 
307 | C Primer
308 | 
309 |     Remember, we're not giving a tutorial on C, the class is about the
310 |     hardware/software interface.
311 | 
312 |     Bla bla bla hello world. Use printf("") for printing. Don't forget \n
313 |     newlines!! Think of System.out.print("") in Java (not the println version).
314 |     Also don't forget semicolons. And `#include <stdio.h>`. They use `int
315 |     main(void)` whereas I use `int main()` but there's no difference in C++ and
316 |     in C the difference is "questionable". I think it doesn't matter for what I
317 |     would use. But use int main(void) instead, to clearly specify that the
318 |     method doesn't take in any arguments (according to StackOverflow).
319 | 
320 |     Then compiling using `gcc program.c ; ./a.out`.
321 | 
322 |     Progression:
323 |         [...].c --(compiler)--> [...].o --(linker)--> [a.out]
324 |     From source (i.e. text) file to "machine code object files" (whatever those
325 |     are) to actual executable files, what gets run. The linker makes use of
326 |     other library files, if we're using them, such as stdio.h I think. And the
327 |     linker helps to link a bunch of [...].c files that we wrote, since we should
328 |     split up our C code in several files to stay sane.
329 | 
330 |         There's *also* a "pre-processor" before the compiler executes, which (1)
331 |         converts comments to a single space and (2) takes care of logic related
332 |         to commands that start with #. These are "macros" and get expanded to
333 |         replace their stuff inline, so for instance, if I look at the
334 |         intermediate file output from Hello World, it could be very long. But
335 |         that's OK, it's how C works. :-)
336 | 
337 |     Different from interpreted languages, such as Python, which are run
338 |     "line-by-line". 
339 | 
340 |     More similar to Java, but Java converts to "byte code" which is an EXAMPLE
341 |     of an assembly language.
342 | 
343 |     Advantages:
344 |         - Faster. This is why numpy uses a C/C++ "back end"; more on that later
345 |           once I better understand it.
346 |         - Note that computers can only "run" machine code, or the lowest-level
347 |           instructions that it can run. Everything else is one layer of
348 |           abstraction upon abstraction. Compilation can get our C code to
349 |           machine code in "one shot".
350 | 
351 |     Disadvantages:
352 |         - Long time to compile.
353 |         - Need tools like "make" to avoid compiling unchanged code. OK maybe
354 |           this isn't a real disadvantage, since we should be using make by
355 |           default.
356 |         - Architecture- and operating systems-specific.
357 | 
358 | C Type Declarations
359 | 
360 |     Examples:
361 |         int a;
362 |         float b;
363 |         char c;
364 |     Like Java, have to declare beforehand, and the type can't change.
365 |     (Usually, floats are 32 bits and doubles are 64 bits.)
366 | 
367 |     Can do:
368 |         float pi = 3.14; /* ok this is mathematically awful but w/e */
369 |     But probably better to have it as a constant:
370 |         const float pi = 3.14;
371 | 
372 |     For 'unsigned' stuff, just put that before the type, e.g. 'unsigned long'.
373 | 
374 |     Enumerations:
375 |         typedef enum {red, green, blue} Color;
376 |     We can then write and call `switch` on:
377 |         Color pants = green; /* to use one example ... */
378 | 
379 |     AH, now it's clear, in Java we KNOW ints are 32 bits, but in C it could be
380 |     16, 32, or 64 bits. Though on my system it's 32 and I think that makes the
381 |     most sense.
382 |         To check, use sizeof(int) and print it. I get '4' which must mean the
383 |         BYTE count.
384 | 
385 |     No boolean data types! I learned this the hard way. (That's in C++).
386 |     0=False, anything else is true (but I guess use 1 for convention).
387 | 
388 |     Standard function definitions, like Java. But it looks like we don't need to
389 |     use 'public...' or 'public static...'. 
390 | 
391 |     Uninitialized variables: if you don't define them, they take on a random
392 |     value in memory, i.e. garbage. Their for loop example prints different
393 |     values of (uninitialized) x because they have another function which messes
394 |     around with the memory on the stack. I think if that wasn't there, you would
395 |     get the same "garbage" value for x. [Update: heh, a student asked the same
396 |     question. But the Prof. said we should not rely on that. Which is fine, this
397 |     was only a theoretical question.]
398 | 
399 |     structs:
400 |         - Groups of variables
401 |         - Like Java classes, but no methods
402 |         - one-liner example syntax:
403 |             typedef struct {int x, y;} Point;
404 |         - then to create one:
405 |             Point p = { 77, -8 };
406 | 
407 | Concluding Thoughts
408 | 
409 |     NO CLASSES in C! You need C++ for that, according to my own experience, and
410 |     StackOverflow. For a while, C++ was known as "C with classes". But now it's
411 |     just bloated. In C, simulate some class functionality by using structs.
412 |     Thus, C shouldn't qualify as "object-oriented".
413 | 
414 |     Other main programmatic difference from Java (first one being no classes) is
415 |     that in C we have explicit pointers. Let's discuss that in the next lecture.
416 | 
417 |     There are additional differences in the compilation, obviously.
418 | 
419 | 
420 | TODO BELOW ... (for F-2017)
421 | 
422 | ****************************
423 | * Lecture 3: Pointers      *
424 | * Given: September 1, 2016 *
425 | ****************************
426 | 
427 | Pointers in C
428 | 
429 |     Processor vs Memory in computer, two different components.
430 |         Former has registers, ALU, etc.
431 |         Latter contains various bytes that form the programs, data, etc.
432 | 
433 |     Don't confuse memory address and a value. It's like humans are the 'values'
434 |     living in their homes as 'memory addresses'. A POINTER is a MEMORY ADDRESS.
435 |         When we say int a; then a = -85;, the memory address is some unknown
436 |         integer and the value is -85.
437 | 
438 |     Know differences:
439 |         int *x;         // variable x is an address to an int
440 |         int y = 9;      // y is an int with value 9
441 |         x = &y;         // assigns *address of* (almost certainly not 9) y to x
442 |         int z = *x;     // assigns *value of* x (should be 9) to z
443 |         *x = -7;        // Assigns -7 to what x is pointing at
444 | 
445 |     Interesting, I get x=1505581164 y=-7 z=9 as the printf output, so when we
446 |     set the memory address of y to x, and modify what x is pointing at, that
447 |     will *also* modify what y points at. Interesting ... and a bit of a pain to
448 |     track.
449 | 
450 |     Another thing, the type of x is 'int*', NOT 'int'. Watch out! It might be
451 |     helpful to visualize this the way CS 61C does with its charts. Can write
452 |     int* pi; or int *pi;, seems like the class does it the latter. It's
453 |     unambiguous especially for char *a,*b; vs char* a,b;, in which case that 'b'
454 |     is NOT a pointer to a char.
455 | 
456 |         Use generic pointers for applications such as allocating or freeing
457 |         memory, where the code may need to point to arbitrary stuff.
458 | 
459 |         Have pointers to structs as well, which is where we get the arrow syntax
460 |         "->" that I've seen before.
461 | 
462 |         Another trick: *(&a) = a, I believe.
463 | 
464 |     One thing, if we do '*pa = 5', this is NOT assigning to 'pa' but rather
465 |     '*pa'. It doesn't really make sense to assign directly to 'pa' unless we
466 |     know a memory address. Do we really want to gamble that '5' is indeed the
467 |     correct _memory_address_ and not _value_?
468 | 
469 |     Functions
470 |         These have pointers too. For arguments:
471 |             void foo(int x, int *p) { ... }
472 |         To call it, use:
473 |             foo(a, &b);
474 |         where a and b are both ints. The 'b' will get "passed by reference",
475 |         since the pointer is passed by value. So it's like Java. There are a ton
476 |         of blogs about this online.
477 | 
478 |         PS: I really like their four-column table approach, really helps
479 | 
480 | Arrays in C (syntactic sugar for pointers, really)
481 | 
482 |     Several ways to declare basic arrays:
483 |         int a[5]; // five integer array, obviously, but contents are garbage
484 |         int b = {1,2,3}; // explicitly assign elements, not garbage =)
485 | 
486 |     In memory diagram: form contiguous block of memory, index 0 at bottom, then
487 |     proceeding up we increment indices. 
488 | 
489 |     #1 way we can shoot ourselves in the foot: no array bounds checking.
490 |         So remember array sizes, e.g. by using:
491 |             const int ARRAY_SIZE = 10;
492 |         and then using that ARRAY_SIZE throughout the program. Don't repeat
493 |         yourself!
494 | 
495 |         Helpful to also use sizeof() operator to get number of bytes. I use this
496 |         frequently. But we can't assume anything about the hardware, other than
497 |         sizeof(char) == 1. Don't assume: use sizeof(...) instead!
498 | 
499 | Pointer Arithmetic
500 | 
501 |     PS: for computers, use byte addresses, so think of memory for an int as
502 |     taking up four slots, because (at least in one example and on my machine) C
503 |     ints are 4 bytes.
504 | 
505 |     I see, we can do stuff like:
506 |         char c[] = {'a','b'};
507 |         char *pc = c; // from webcast, also same as &(c[0])
508 |     so pc is now a char* type, and *pc = 'a'. If we do *pc++; then pc = 'b'. The
509 |     POINTER is incremented, not the value pointed by it. Yeah, it's confusing,
510 |     this time we actually want to manipulate the address.
511 | 
512 |     The array name is a pointer to the 0th element of the "array".
513 |         char *pstr;
514 |         char astr[];
515 |     are identical except we can do pstr++ while we can't do astr++.
516 |         ALSO: astr[2] = *(astr+2)
517 | 
518 |     OH I see, when we do pc++ the compiler actually adds sizeof(...) and takes
519 |     care of that logic for us; it doesn't really "add one". Thanks!
520 | 
521 |     Bad style to interchange arrays and pointers.
522 | 
523 |     For methods, you can define them in the following ways:
524 |         foo(int[] array, unsigned int size);
525 |         foo(int *array, unsigned int size);
526 | 
527 |     Be careful when doing sizeof(a) with 'a' an array, because that might
528 |     represent a pointer, which is usually 8 bytes on modern 64-bit machines. But
529 |     if you start with int a[40] and do sizeof(a) you actually get
530 |     10*sizeof(int), that's weird.
531 | 
532 |     No sense to do so and is also illegal, don't do the following:
533 |         - Add two pointers
534 |         - Multiply two pointers
535 |         - Subtracting a pointer from an integer
536 |     We CAN, however, compare pointers to NULL, for instance (keyword might be
537 |     'null' in C).
538 | 
539 |     Pointers to pointers also exist. Oh no.
540 | 
541 | Strings and Main
542 | 
543 |     C strings are "null terminated character arrays":
544 |         char s[] = 'abc';
545 |     To find the length, iterate through the string and increment an index.
546 |     Detect end of string with '0' or whatever special character we have.
547 | 
548 |     Don't forget the alternative way of writing main() methods with arguments:
549 |         int main(int argc, char *argv[]) {...}
550 |     argv is a POINTER ... (of type char*) ... TO AN ARRAY (that contains the
551 |     string arguments from the command line). The argc is simply the number of
552 |     arguments.
553 | 
554 |         When we run ./a.out, the './a.out' part is argv[0], other arguments
555 |         after that go in later components, in order. It's similar to Python.
556 | 
557 | Concluding Remarks
558 | 
559 |     Pointers are the same as (machine) memory addresses.
560 |     Except for void*, pointers know the type and size of the objects they point
561 |         to (is this why sizeof(a) for 'int a[10]' is known? Not sure).
562 |     Pointers are powerful, but dangerous without careful planning.
563 | 
564 | 
565 | ********************************
566 | * Lecture 4: Memory Management *
567 | * Given: September 6, 2016     *
568 | ********************************
569 | 
570 | TODO
571 | 


--------------------------------------------------------------------------------