├── lab3-hist.png
├── controlflow.png
├── README.md
├── lab4.md
├── lab2.md
├── lab3.md
└── lab1.md


/lab3-hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MIT-DB-Class/course-info-2018/HEAD/lab3-hist.png


--------------------------------------------------------------------------------
/controlflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MIT-DB-Class/course-info-2018/HEAD/controlflow.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | course-info
  2 | ===========
  3 | 
  4 | GitHub Repo for http://db.csail.mit.edu/6.830/
  5 | 
  6 | We will be using git, a source code control tool, for labs in 6.830. This will
  7 | allow you to download the code for the labs, and also submit the labs in a
  8 | standardized format that will streamline grading.
  9 | 
 10 | You will also be able to use git to commit your progress on the labs as you go.
 11 | 
 12 | Course git repositories will be hosted as a repository in GitHub. GitHub is a
 13 | website that hosts runs git servers for thousands of open source projects. In
 14 | our case, your code will be in a private repository that is visible only to you
 15 | and course staff.
 16 | 
 17 | This document describes what you need to do to get started with git, and also
 18 | download and upload 6.830/6.814 labs via GitHub.
 19 | 
 20 | ## Contents
 21 | 
 22 | - [Learning Git](#learning-git)
 23 | - [Setting up GitHub](#setting-up-github)
 24 | - [Installing Git](#installing-git)
 25 | - [Setting up Git](#setting-up-git)
 26 | - [Getting Newly Released Labs](#getting-newly-released-lab)
 27 | - [Submitting Your Labs](#submitting-your-lab)
 28 | - [Word of Caution](#word-of-caution)
 29 | - [Help!](#help)
 30 | 
 31 | 
 32 | ## Learning Git
 33 | 
 34 | There are numerous guides on using Git that are available. They range from being
 35 | interactive to just text-based. Find one that works and experiment; making
 36 | mistakes and fixing them is a great way to learn. Here is a link to resources
 37 | that GitHub suggests:
 38 | [https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github][resources].
 39 | 
 40 | If you have no experience with git, you may find the following web-based
 41 | tutorial helpful: [Try Git](https://try.github.io/levels/1/challenges/1).
 42 | 
 43 | ## <a name="setting-up-github"></a> Setting Up GitHub
 44 | 
 45 | Now that you have a basic understanding of Git, it's time to get started with GitHub.
 46 | 
 47 | 0. Install git. (See below for suggestions).
 48 | 
 49 | 1. If you don't already have an account, sign up for one here: [https://github.com/join][join].
 50 | 
 51 |    If you filled the form ([this form](https://goo.gl/forms/FZPsfP5DQTTzffdC3))
 52 |    then you should now have a repository set up just for your lab solutions.  This
 53 |    should be called `homework-solns-2018-<athena username>` and located in the
 54 |    MIT-DB-Class organization. 
 55 | 
 56 |    This is what you'll set up in the next section to allow you to write your
 57 |    lab answers and submit them.
 58 | 
 59 | ### Installing git <a name="installing-git"></a> 
 60 | 
 61 | The instructions are tested on bash/linux environments. Installing git should be
 62 | a simple `apt-get / yum / etc install`.  
 63 | 
 64 | Instructions for installing git on Linux, OSX, or Windows can be found at
 65 | [GitBook:
 66 | Installing](http://git-scm.com/book/en/Getting-Started-Installing-Git).
 67 | 
 68 | If you are using Eclipse, many versions come with git configured. The
 69 | instructions will be slightly different than the command line instructions
 70 | listed but will work for any OS.  Detailed instructions can be found at [EGit
 71 | User Guide](http://wiki.eclipse.org/EGit/User_Guide) or [EGit
 72 | Tutorial](http://eclipsesource.com/blogs/tutorials/egit-tutorial).
 73 | 
 74 | 
 75 | ## Setting Up Git <a name="setting-up-git"></a>
 76 | 
 77 | You should have Git installed from the previous section.
 78 | 
 79 | 1. The first thing we have to do is to clone the current lab repository by issuing the following commands on the command line:
 80 | 
 81 |    ```bash
 82 |     $ git clone git@github.com:MIT-DB-Class/simple-db-hw.git
 83 |    ```
 84 | 
 85 |    If you get an error doing clone, most likely the cause is that you just
 86 |    haven't finished setting up your GitHub account. You just need to [setup an SSH
 87 |    key][ssh-key] to allow pushing and pulling over SSH.
 88 | 
 89 |    This will make a complete replica of the lab repository locally. Now we are
 90 |    going to change it to point to your personal repository that was created for you
 91 |    in the previous section.
 92 | 
 93 |    Change your working path to your newly cloned repository:
 94 | 
 95 |    ```bash
 96 |     $ cd simple-db-hw/
 97 |    ```
 98 | 
 99 | 2. By default the remote called `origin` is set to the location that you cloned the repository from. You should see the following:
100 | 
101 |    ```bash
102 |     $ git remote -v
103 |         origin git@github.com:MIT-DB-Class/simple-db-hw.git (fetch)
104 |         origin git@github.com:MIT-DB-Class/simple-db-hw.git (push)
105 |    ```
106 | 
107 |    We don't want that remote to be the origin. Instead, we want to change it to point to your repository. To do that, issue the following command:
108 | 
109 |    ```bash
110 |     $ git remote rename origin upstream
111 |    ```
112 | 
113 |    And now you should see the following:
114 | 
115 |    ```bash
116 |     $ git remote -v
117 |         upstream git@github.com:MIT-DB-Class/simple-db-hw.git (fetch)
118 |         upstream git@github.com:MIT-DB-Class/simple-db-hw.git (push)
119 |    ```
120 | 
121 | 3. Lastly we need to give your repository a new `origin` since it is lacking one. Issue the following command, substituting your athena username:
122 | 
123 |    ```bash
124 |     $ git remote add origin git@github.com:MIT-DB-Class/homework-solns-2018-<athena-username>.git
125 |    ```
126 | 
127 |    If you have an error that looks like the following:
128 | 
129 |    ```
130 |    Could not rename config section 'remote.[old name]' to 'remote.[new name]'
131 |    ```
132 | 
133 |    Or this error:
134 |    
135 |    ```
136 |    fatal: remote origin already exists.
137 |    ```
138 |    
139 |    This appears to happen to some depending on the version of Git they are using. To fix it, just issue the following command:
140 |    
141 |    ```bash
142 |    $ git remote set-url origin git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
143 |    ```
144 | 
145 |    This solution was found from [StackOverflow](http://stackoverflow.com/a/2432799) thanks to [Cassidy Williams](https://github.com/cassidoo).
146 | 
147 |    For reference, your final `git remote -v` should look like following when it's setup correctly:
148 | 
149 | 
150 |    ```bash
151 |     $ git remote -v
152 |         upstream git@github.com:MIT-DB-Class/simple-db-hw.git (fetch)
153 |         upstream git@github.com:MIT-DB-Class/simple-db-hw.git (push)
154 |         origin git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git (fetch)
155 |         origin git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git (push)
156 |    ```
157 | 
158 | 4. Let's test it out by doing a push of your master branch to GitHub by issuing the following:
159 | 
160 |    ```bash
161 |     $ git push -u origin master
162 |    ```
163 | 
164 |    You should see something like the following:
165 | 
166 |    ```
167 | 	Counting objects: 59, done.
168 | 	Delta compression using up to 4 threads.
169 | 	Compressing objects: 100% (53/53), done.
170 | 	Writing objects: 100% (59/59), 420.46 KiB | 0 bytes/s, done.
171 | 	Total 59 (delta 2), reused 59 (delta 2)
172 | 	remote: Resolving deltas: 100% (2/2), done.
173 | 	To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
174 | 	 * [new branch]      master -> master
175 | 	Branch master set up to track remote branch master from origin.
176 |    ```
177 | 
178 |    If you get an error doing push, most likely the cause is that you just haven't finished setting up your GitHub account. You just need to [setup an SSH key][ssh-key] to allow pushing and pulling over SSH.
179 | 
180 | 5. That last command was a bit special and only needs to be run the first time to setup the remote tracking branches. Now we should be able to just run `git push` without the arguments. Try it and you should get the following:
181 | 
182 |    ```bash
183 |     $ git push
184 |       Everything up-to-date
185 |    ```
186 | 
187 | If you don't know Git that well, this probably seemed very arcane. Just keep
188 | using Git and you'll understand more and more.   You aren't required to use
189 | commands like commit and push as you develop your labs, but will find them
190 | useful for debugging.  We'll provide explicit instructions on how to use these
191 | commands to actually upload your final lab solution.
192 | 
193 | ## Getting Newly Released Labs <a name="getting-newly-released-lab"></a>
194 | 
195 | (You don't need to follow these instructions until Lab 1.)
196 | 
197 | Pulling in labs that are released or previous lab solutions should be easy as long as you set up your repository based on the instructions in the last section.
198 | 
199 | 1. All new lab and previous lab solutions will be posted to the [labs](https://github.com/MIT-DB-Class/simple-db-hw) repository in the class organization.
200 | 
201 | 	Check it periodically as well as Piazza's announcements for updates on when the new labs are released.
202 | 
203 | 2. Once a lab is released, pull in the changes from your simpledb directory:
204 | 
205 |    ```bash
206 |     $ git pull upstream master
207 |    ```
208 | 
209 |    **OR** if you wish to be more explicit, you can `fetch` first and then `merge`:
210 | 
211 |    ```bash
212 |     $ git fetch upstream
213 |     $ git merge upstream/master
214 |    ```
215 |    Now commit to your master branch:
216 |    ```bash
217 | 	$ git push origin master
218 |    ```
219 | 
220 | 3. If you've followed the instructions in each lab, you should have no merge conflicts and everything should be peachy.
221 | 
222 | ## <a name="submitting-your-lab"></a> Submitting Your Labs
223 | 
224 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called <tt>lab#-writeup.txt</tt>, which has been created for you in the top level of your <tt>simple-db-hw</tt> directory.
225 | 
226 | You need to explicitly add any other files you create, such as new *.java files.
227 | 
228 | The criteria for your lab being submitted on time is that your code must be
229 | **tagged** and  **pushed** by the date and time. This means that if one of the
230 | TAs or the instructor were to open up GitHub, they would be able to see your
231 | solutions on the GitHub web page.
232 | 
233 | Just because your code has been committed on your local machine does not mean
234 | that it has been **submitted**; it needs to be on GitHub.
235 | 
236 | There is a bash script `turnInLab1.sh` in the root level directory of
237 | simple-db-hw that commits  your changes, deletes any prior tag for the current
238 | lab, tags the current commit, and pushes the tag to github.  If you are using
239 | Linux or Mac OSX, you should be able to run the following:
240 | 
241 | ```bash
242 | $ ./turnInLab1.sh
243 | ```
244 | 
245 | You should see something like the following output:
246 | 
247 |  ```bash
248 |  $ ./turnInLab1.sh 
249 | error: tag 'lab1submit' not found.
250 | remote: warning: Deleting a non-existent ref.
251 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
252 |  - [deleted]         lab1submit
253 | [master 7a26701] Lab 1
254 |  1 file changed, 0 insertions(+), 0 deletions(-)
255 |  create mode 100644 aaa
256 | Counting objects: 3, done.
257 | Delta compression using up to 4 threads.
258 | Compressing objects: 100% (3/3), done.
259 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
260 | Total 3 (delta 1), reused 0 (delta 0)
261 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects.
262 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
263 |    069856c..7a26701  master -> master
264 |  * [new tag]         lab1submit -> lab1submit
265 |  ```
266 | 
267 | 
268 | If the above command worked for you, you can skip to item 6 below.  If not, submit your solutions for lab 1 as follows (*replace lab 1 with the correct lab ID for later labs*):
269 | 
270 | 1. Look at your current repository status.
271 | 
272 |    ```bash
273 |    $ git status
274 |    ```
275 | 
276 | 2. Add and commit your code changes (if they aren't already added and commited).
277 | 
278 |    ```bash
279 |     $ git commit -a -m 'Lab 1'
280 |    ```
281 | 
282 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*)
283 | 
284 |    ```bash
285 |    $ git tag -d lab1submit
286 |    $ git push origin :refs/tags/lab1submit
287 |    ```
288 | 
289 | 4. Tag your last commit as the lab to be graded (*again, update the lab ID for later labs*)
290 |    ```bash
291 |    $ git tag -a lab1submit -m 'submit lab 1'
292 |    ```
293 | 
294 | 5. This is the most important part: **push** your solutions to GitHub.
295 | 
296 |    ```bash
297 |    $ git push origin master --tags
298 |    ```
299 | 
300 | 6. The last thing that we strongly recommend you do is to go to the [MIT-DB-Class] organization page on GitHub to make sure that we can see your solutions.
301 | 
302 | Just navigate to your repository and check that your latest commits are on GitHub. You should also be able to check 
303 |    `https://github.com/MIT-DB-Class/homework-solns-2018-<athena username>/tree/lab1`
304 | 
305 | 
306 | ## <a name="word-of-caution"></a> Word of Caution
307 | 
308 | Git is a distributed version control system. This means everything operates offline until you run `git pull` or `git push`. This is a great feature.
309 | 
310 | The bad thing is that you may forget to `git push` your changes. This is why we **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect.
311 | 
312 | ## <a name="help"></a> Help!
313 | 
314 | If at any point you need help with setting all this up, feel free to reach out to one of the TAs or the instructor. Their contact information can be found on the [course homepage](http://db.csail.mit.edu/6.830/).
315 | 
316 | [join]: https://github.com/join
317 | [resources]: https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github
318 | [ssh-key]: https://help.github.com/articles/generating-ssh-keys
319 | 
320 | 


--------------------------------------------------------------------------------
/lab4.md:
--------------------------------------------------------------------------------
  1 | # 6.830 Lab 4: SimpleDB Transactions
  2 | 
  3 | **Assigned: Friday, October 26, 2018**<br>
  4 | **Due: Friday, November 9, 2018 11:59 PM EDT**
  5 | 
  6 | 
  7 | In this lab, you will implement a simple locking-based
  8 | transaction system in SimpleDB.  You will need to add lock and
  9 | unlock calls at the appropriate places in your code, as well as
 10 | code to track the locks held by each transaction and grant
 11 | locks to transactions as they are needed.
 12 | 
 13 | 
 14 | The remainder of this document describes what is involved in
 15 | adding transaction support and provides a basic outline of how
 16 | you might add this support to your database.
 17 | 
 18 | 
 19 | 
 20 | As with the previous lab, we recommend that you start as early as possible.
 21 | Locking and transactions can be quite tricky to debug!
 22 | 
 23 | ##  1. Getting started 
 24 | 
 25 | You should begin with the code you submitted for Lab 3 (if you did not
 26 | submit code for Lab 3, or your solution didn't work properly, contact us to
 27 | discuss options).  Additionally, we are providing extra test cases 
 28 | for this lab that are not in the original code distribution you received. We reiterate
 29 | that the unit tests we provide are to help guide your implementation along,
 30 | but they are not intended to be comprehensive or to establish correctness.
 31 | 
 32 | 
 33 | You will need to add these new files to your release. The easiest way
 34 | to do this is to change to your project directory (probably called simple-db-hw) 
 35 | and pull from the master GitHub repository:
 36 | 
 37 | ```
 38 | $ cd simple-db-hw
 39 | $ git pull upstream master
 40 | ```
 41 | 
 42 | 
 43 | ##  2. Transactions, Locking, and Concurrency Control 
 44 | 
 45 | Before starting,
 46 | you should make sure you understand what a transaction is and how
 47 | rigorous two-phase locking (which you will use to ensure isolation and
 48 | atomicity of your transactions) works.
 49 | 
 50 | In the remainder of this section, we briefly overview these concepts 
 51 | and discuss how they relate to SimpleDB.
 52 | 
 53 | ###  2.1. Transactions 
 54 | 
 55 | A transaction is a group of database actions (e.g., inserts, deletes,
 56 | and reads) that are executed *atomically*; that is, either all of
 57 | the actions complete or none of them do, and it is not apparent to an
 58 | outside observer of the database that these actions were not completed
 59 | as a part of a single, indivisible action.
 60 | 
 61 | ###  2.2. The ACID Properties 
 62 | 
 63 | To help you understand
 64 |  how transaction management works in SimpleDB, we briefly review how
 65 | it  ensures that the ACID properties are satisfied:
 66 | 
 67 | * **Atomicity**:  Rigorous two-phase locking and careful buffer management
 68 |   ensure atomicity.</li>
 69 | * **Consistency**:  The database is transaction consistent by virtue of
 70 |   atomicity.  Other consistency issues (e.g., key constraints) are
 71 |   not addressed in SimpleDB.</li>
 72 | * **Isolation**: Rigorous two-phase locking provides isolation.</li>
 73 | * **Durability**: A FORCE buffer management policy ensures
 74 |   durability (see Section 2.3 below).</li>
 75 | 
 76 | 
 77 | ###  2.3. Recovery and Buffer Management 
 78 | 
 79 | To simplify your job, we recommend that you implement a NO STEAL/FORCE
 80 | buffer management policy.  
 81 | 
 82 | As we discussed in class, this means that:
 83 | 
 84 | *  You shouldn't evict dirty (updated) pages from the buffer pool if they
 85 |   are locked by an uncommitted transaction (this is NO STEAL).
 86 | *  On transaction commit, you should force dirty pages to disk (e.g.,
 87 |   write the pages out) (this is FORCE).
 88 | 
 89 | 
 90 | To further simplify your life, you may assume that SimpleDB will not crash
 91 | while processing a `transactionComplete` command.  Note that
 92 | these three points mean that you do not need to implement log-based
 93 | recovery in this lab, since you will never need to undo any work (you never evict
 94 | dirty pages) and you will never need to redo any work (you force
 95 | updates on commit and will not crash during commit processing). 
 96 | 
 97 | ###  2.4. Granting Locks 
 98 | 
 99 | You will need to add calls to SimpleDB (in `BufferPool`,
100 | for example), that allow a caller to request or release a (shared or
101 | exclusive) lock on a specific object on behalf of a specific
102 | transaction.
103 | 
104 | 
105 | 
106 | We recommend locking at *page* granularity, though you should be able
107 | to implement locking at *tuple* granularity if you wish (please do not
108 | implement table-level locking). The rest of this document and our unit
109 | tests assume page-level locking.
110 | 
111 | 
112 | You will need to create data structures that keep track of which locks
113 | each transaction holds and that check to see if a lock should be granted
114 | to a transaction when it is requested.
115 | 
116 | You will need to implement shared and exclusive locks; recall that these
117 | work as follows:
118 | 
119 | *  Before a transaction can read an object, it must have a shared lock on it.
120 | *  Before a transaction can write an object, it must have an exclusive lock on it.
121 | *  Multiple transactions can have a shared lock on an object.
122 | *  Only one transaction may have an exclusive lock on an object.
123 | *  If transaction *t* is the only transaction holding a shared lock on
124 | 			      an object *o*, *t* may *upgrade*
125 | 			      its lock on *o* to an exclusive lock.
126 | 
127 | 
128 | 
129 | 
130 | If a transaction requests a lock that it should not be granted, your code
131 | should *block*, waiting for that lock to become available (i.e., be
132 | released by another transaction running in a different thread).
133 | 
134 | 
135 | 
136 | You need to be especially careful to avoid race conditions when
137 | writing the code that acquires locks -- think about how you will
138 | ensure that correct behavior results if two threads request the same
139 | lock at the same time (you way wish to read about <a
140 | href="http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html">
141 | Synchronization</a> in Java).
142 | 
143 | ***
144 | 
145 | **Exercise 1.**
146 | 
147 |   Write the methods that acquire and release locks in BufferPool. Assuming
148 |   you are using page-level locking, you will need to complete the following:
149 |   
150 |   *  Modify <tt>getPage()</tt> to block and acquire the desired lock
151 |        before returning a page.
152 |   *  Implement <tt>releasePage()</tt>.  This method is primarily used
153 |     for testing, and at the end of transactions.
154 |   *  Implement <tt>holdsLock()</tt> so that logic in Exercise 2 can 
155 |        determine whether a page is already locked by a transaction.
156 |   
157 | 
158 |   
159 | You may find it helpful to define a class that is responsible for
160 | maintaining state about transactions and locks, but the design decision  is up to
161 | you.
162 | 
163 |   
164 | You may need to implement the next exercise before your code passes 
165 | the unit tests in LockingTest.
166 | 
167 | ***
168 | 
169 | 
170 | ###  2.5. Lock Lifetime 
171 | 
172 | You will need to implement rigorous two-phase locking.  This means that
173 | transactions should acquire the appropriate type of lock on any object
174 | before accessing that object and shouldn't release any locks until after
175 | the transaction commits.  
176 | 
177 | 
178 | 
179 | Fortunately, the SimpleDB design is such that it is possible obtain locks on
180 | pages in `BufferPool.getPage()` before you read or modify them.
181 | So, rather than adding calls to locking routines in each of your operators,
182 | we recommend acquiring locks in `getPage()`. Depending on your
183 | implementation, it is possible that you may not have to acquire a lock
184 | anywhere else. It is up to you to verify this!
185 | 
186 | 
187 | 
188 | You will need to acquire a *shared* lock on any page (or tuple)
189 | before you read it, and you will need to acquire an *exclusive*
190 | lock on any page (or tuple) before you write it. You will notice that
191 | we are already passing around `Permissions` objects in the
192 | BufferPool; these objects indicate the type of lock that the caller
193 | would like to have on the object being accessed (we have given you the
194 | code for the `Permissions` class.)
195 | 
196 |  Note that your implementation of `HeapFile.insertTuple()`
197 | and `HeapFile.deleteTuple()`, as well as the implementation
198 | of the iterator returned by `HeapFile.iterator()` should
199 | access pages using `BufferPool.getPage()`. Double check
200 | that that these different uses of `getPage()` pass the
201 | correct permissions object (e.g., `Permissions.READ_WRITE`
202 | or `Permissions.READ_ONLY`). You may also wish to double
203 | check that your implementation of
204 | `BufferPool.insertTuple()` and
205 | `BufferPool.deleteTupe()` call `markDirty()` on
206 | any of the pages they access (you should have done this when you
207 | implemented this code in lab 2, but we did not test for this case.)
208 | 
209 | 
210 | 
211 | After you have acquired locks, you will need to think about when to
212 | release them as well. It is clear that you should release all locks
213 | associated with a transaction after it has committed or aborted to ensure rigorous 2PL.
214 | However, it is
215 | possible for there to be other scenarios in which releasing a lock before
216 | a transaction ends might be useful. For instance, you may release a shared lock
217 | on a page after scanning it to find empty slots (as described below).
218 | 
219 | 
220 | 
221 | ***
222 | 
223 | **Exercise 2.**
224 | 
225 |   Ensure that you acquire and release locks throughout SimpleDB. Some (but
226 |   not necessarily all) actions that you should verify work properly:
227 |   
228 |   *  Reading tuples off of pages during a SeqScan (if you
229 |     implemented locking  in `BufferPool.getPage()`, this should work
230 |     correctly as long as your `HeapFile.iterator()` uses
231 |      `BufferPool.getPage()`.)
232 |   *  Inserting and deleting tuples through BufferPool and HeapFile
233 |        methods (if you
234 |     implemented locking in `BufferPool.getPage()`, this should work
235 |     correctly as long as `HeapFile.insertTuple()` and
236 | `HeapFile.deleteTuple()` use
237 |      `BufferPool.getPage()`.)
238 |   
239 | 
240 |   
241 | You will also want to think especially hard about acquiring and releasing
242 | locks in the following situations:
243 | 
244 |   
245 |   *  Adding a new page to a `HeapFile`.  When do you physically 
246 |   write the page to disk?  Are there race conditions with other transactions 
247 |   (on other threads) that might need special attention at the HeapFile level,
248 |   regardless of page-level locking?
249 |   *  Looking for an empty slot into which you can insert tuples.
250 |   Most implementations scan pages looking for an empty
251 |   slot, and will need a READ_ONLY lock to do this.  Surprisingly, however,
252 |     if a transaction *t* finds no free slot on a page *p*, *t* may immediately release the lock on *p*.
253 |     Although this apparently contradicts the rules of two-phase locking, it is ok because
254 |     *t* did not use any data from the page, such that a concurrent transaction *t'* which updated
255 |     *p* cannot possibly effect the answer or outcome of *t*. 
256 |  
257 | 
258 | At this point, your code should pass the unit tests in
259 | LockingTest.
260 | 
261 | ***
262 | 
263 | ###  2.6. Implementing NO STEAL 
264 | 
265 |  Modifications from a transaction are written to disk only after it
266 | commits. This means we can abort a transaction by discarding the dirty
267 | pages and rereading them from disk. Thus, we must not evict dirty
268 | pages. This policy is called NO STEAL.
269 | 
270 |  You will need to modify the <tt>evictPage</tt> method in <tt>BufferPool</tt>.
271 | In particular, it must never evict a dirty page. If your eviction policy prefers a dirty page
272 | for eviction, you will have to find a way to evict an alternative
273 | page. In the case where all pages in the buffer pool are dirty, you
274 | should throw a <tt>DbException</tt>.
275 | 
276 |  Note that, in general, evicting a clean page that is locked by a
277 | running transaction is OK when using NO STEAL, as long as your lock
278 | manager keeps information about evicted pages around, and as long as
279 | none of your operator implementations keep references to Page objects
280 | which have been evicted.
281 | 
282 | 
283 | ***
284 | 
285 | **Exercise 3.**
286 | 
287 | Implement the necessary logic for page eviction without evicting dirty pages
288 | in the <tt>evictPage</tt> method in <tt>BufferPool</tt>.
289 | 
290 | ***
291 | 
292 | 
293 | ###  2.7. Transactions 
294 | 
295 | In SimpleDB, a `TransactionId` object is created at the
296 | beginning of each query.  This object is passed to each of the operators
297 | involved in the query.  When the query is complete, the
298 | `BufferPool` method `transactionComplete` is called.
299 | 
300 | 
301 | 
302 | Calling this method either *commits* or *aborts*  the
303 | transaction, specified by the parameter flag `commit`. At any point
304 | during its execution, an operator may throw a
305 | `TransactionAbortedException` exception, which indicates an
306 | internal error or deadlock has occurred.  The test cases we have provided
307 | you with create the appropriate `TransactionId` objects, pass
308 | them to your operators in the appropriate way, and invoke
309 | `transactionComplete` when a query is finished.  We have also
310 | implemented `TransactionId`.
311 | 
312 | 
313 | 
314 | ***
315 | 
316 | **Exercise 4.**
317 | 
318 |   Implement the `transactionComplete()` method in
319 |   `BufferPool`. Note that there are two versions of 
320 |   transactionComplete, one which accepts an additional boolean **commit** argument,
321 |   and one which does not.  The version without the additional argument should
322 |   always commit and so can simply be implemented by calling  `transactionComplete(tid, true)`. 
323 |   
324 |   
325 |   
326 |   When you commit, you should flush dirty pages
327 |   associated to the transaction to disk. When you abort, you should revert
328 |   any changes made by the transaction by restoring the page to its on-disk
329 |   state.
330 | 
331 | 
332 |   
333 | 
334 |   Whether the transaction commits or aborts, you should also release any state the
335 |   `BufferPool` keeps regarding
336 |   the transaction, including releasing any locks that the transaction held.
337 | 
338 |   
339 |   At this point, your code should pass the `TransactionTest` unit test and the
340 |   `AbortEvictionTest` system test.  You may find the `TransactionTest` system test
341 |   illustrative, but it will likely fail until you complete the next exercise.
342 | 
343 | 
344 | ***
345 | 
346 | 
347 | 
348 | ###  2.8. Deadlocks and Aborts
349 | 
350 | It is possible for transactions in SimpleDB to deadlock (if you do not
351 | understand why, we recommend reading about deadlocks in Ramakrishnan & Gehrke).
352 | You will need to detect this situation and throw a
353 | `TransactionAbortedException`. 
354 | 
355 | 
356 | 
357 | There are many possible ways to detect deadlock. For example, you may
358 | implement a simple timeout policy that aborts a transaction if it has not
359 | completed after a given period of time. Alternately, you may implement
360 | cycle-detection in a dependency graph data structure. In this scheme, you
361 | would check for cycles in a dependency graph whenever you attempt to grant
362 | a new lock, and abort something if a cycle exists.
363 | 
364 | 
365 | 
366 | After you have detected that a deadlock exists, you must decide how to
367 | improve the situation. Assume you have detected a deadlock while
368 | transaction *t* is waiting for a lock.  If you're feeling
369 | homicidal, you might abort **all** transactions that *t* is
370 | waiting for; this may result in a large amount of work being undone, but
371 | you can guarantee that *t* will make progress.
372 | Alternately, you may decide to abort *t* to give other
373 | transactions a chance to make progress. This means that the end-user will have
374 | to retry transaction *t*.
375 | 
376 | 
377 | 
378 | ***
379 | 
380 | **Exercise 5.**
381 | 
382 |   Implement deadlock detection and resolution in
383 |   `src/simpledb/BufferPool.java`. Most likely, you will want to check
384 |   for a deadlock whenever a transaction attempts to acquire a lock and finds another
385 |   transaction is holding the lock (note that this by itself is not a deadlock, but may
386 |   be symptomatic of one.)  You have many design
387 |   decisions for your deadlock resolution system, but it is not necessary to
388 |   do something complicated. Please describe your choices in the lab writeup.
389 | 
390 |   
391 | 
392 |   You should ensure that your code aborts transactions properly when a
393 |   deadlock occurs, by throwing a
394 |   `TransactionAbortedException` exception.
395 |   This exception will be caught by the code executing the transaction
396 | (e.g., `TransactionTest.java`), which should call
397 | `transactionComplete()` to cleanup after the transaction.
398 |   You are not expected to automatically restart a transaction which
399 |   fails due to a deadlock -- you can assume that higher level code
400 |   will take care of this.
401 | 
402 |   
403 | 
404 |   We have provided some (not-so-unit) tests in
405 |   `test/simpledb/DeadlockTest.java`. They are actually a
406 |   bit involved, so they may take more than a few seconds to run (depending
407 |   on your policy). If they seem to hang indefinitely, then you probably
408 |   have an unresolved deadlock. These tests construct simple deadlock
409 |   situations that your code should be able to escape.
410 |   
411 | 
412 |   
413 | 
414 |   Note that there are two timing parameters near the top of
415 |   `DeadLockTest.java`; these determine the frequency at which
416 |   the test checks if locks have been acquired and the waiting time before
417 |   an aborted transaction is restarted. You may observe different
418 |   performance characteristics by tweaking these parameters if you use a
419 |   timeout-based detection method. The tests will output
420 |   `TransactionAbortedExceptions` corresponding to resolved
421 |   deadlocks to the console.
422 | 
423 |   
424 | 
425 |   Your code should now should pass the `TransactionTest` system test (which may also run for quite a long time).  
426 | 
427 | 
428 |    At this point, you should have a recoverable database, in the
429 | sense that if the database system crashes (at a point other than
430 | `transactionComplete()`) or if the user explicitly aborts a
431 | transaction, the effects of any running transaction will not be visible
432 | after the system restarts (or the transaction aborts.) You may wish to
433 | verify this by running some transactions and explicitly killing the
434 | database server.
435 | 
436 | 
437 | ***
438 | 
439 | ###  2.9. Design alternatives 
440 | 
441 | During the course of this lab, we have identified three substantial design
442 | choices that you have to make:
443 | 
444 | 
445 | *  Locking granularity: page-level versus tuple-level
446 | *  Deadlock detection: timeouts versus dependency graphs
447 | *  Deadlock resolution: aborting yourself versus aborting others
448 | 
449 | 
450 | ***
451 | 
452 | **Bonus Exercise 6. (10% extra credit)**
453 | 
454 |   For one or more of these choices, implement both alternatives and
455 |   briefly compare their performance characteristics in your writeup.
456 | 
457 |   
458 | 
459 |   You have now completed this lab. 
460 |   Good work!
461 | 
462 | ***
463 | 
464 | ##  3. Logistics 
465 | 
466 | You must submit your code (see below) as well as a short (2 pages, maximum)
467 | writeup describing your approach.  This writeup should:
468 | 
469 | 
470 | 
471 | *  Describe any design decisions you made, including your deadlock detection
472 | policy, locking granularity, etc.
473 | 
474 | *  Discuss and justify any changes you made to the API.
475 | 
476 | 
477 | 
478 | ###  3.1. Collaboration 
479 | 
480 | This lab should be manageable for a single person, but if you prefer
481 | to work with a partner, this is also OK.  Larger groups are not allowed.
482 | Please indicate clearly who you worked with, if anyone, on your writeup.  
483 | 
484 | ###  3.2. Submitting your assignment 
485 | 
486 | 
487 | <!--To submit your code, please create a <tt>6.830-lab3.tar.gz</tt> tarball (such
488 | that, untarred, it creates a <tt>6.830-lab3/src/simpledb</tt> directory with
489 | your code) and submit it for the Lab 2 assigment on the  <a
490 | href="https://stellar.mit.edu/S/course/6/sp13/6.830/homework/">Stellar Site Homework Section</a>.
491 | You may submit your code multiple times; we will use the latest version you
492 | submit that arrives before the deadline (before 11:59pm on the due date).  If
493 | applicable, please indicate your partner in your writeup.  Please also submit
494 | your individual writeup as a PDF or plain text file (.txt).  Please do not submit a .doc or .docx.-->
495 | 
496 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab4-writeup.txt, which has been created for you in the top level of your simple-db-hw directory.
497 | 
498 | 
499 | You also need to explicitly add any other files you create, such as new *.java 
500 | files.
501 | 
502 | The criteria for your lab being submitted on time is that your code must be
503 | **tagged** and 
504 | **pushed** by the date and time. This means that if one of the TAs or the
505 | instructor were to open up GitHub, they would be able to see your solutions on
506 | the GitHub web page.
507 | 
508 | Just because your code has been commited on your local machine does not
509 | mean that it has been **submitted**; it needs to be on GitHub.
510 | 
511 | There is a bash script `turnInLab4.sh` in the root level directory of simple-db-hw that commits 
512 | your changes, deletes any prior tag
513 | for the current lab, tags the current commit, and pushes the tag 
514 | to GitHub.  If you are using Linux or Mac OSX, you should be able to run the following:
515 | 
516 |    ```bash
517 |    $ ./turnInLab4.sh
518 |    ```
519 | You should see something like the following output:
520 | 
521 |  ```bash
522 |  $ ./turnInLab4.sh 
523 | error: tag 'lab4submit' not found.
524 | remote: warning: Deleting a non-existent ref.
525 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
526 |  - [deleted]         lab1submit
527 | [master 7a26701] Lab 4
528 |  1 file changed, 0 insertions(+), 0 deletions(-)
529 |  create mode 100644 aaa
530 | Counting objects: 3, done.
531 | Delta compression using up to 4 threads.
532 | Compressing objects: 100% (3/3), done.
533 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
534 | Total 3 (delta 1), reused 0 (delta 0)
535 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects.
536 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
537 |    069856c..7a26701  master -> master
538 |  * [new tag]         lab4submit -> lab4submit
539 | ```
540 | 
541 | 
542 | If the above command worked for you, you can skip to item 6 below.  If not, submit your solutions for lab 4 as follows:
543 | 
544 | 1. Look at your current repository status.
545 | 
546 |    ```bash
547 |    $ git status
548 |    ```
549 | 
550 | 2. Add and commit your code changes (if they aren't already added and commited).
551 | 
552 |    ```bash
553 |     $ git commit -a -m 'Lab 4'
554 |    ```
555 | 
556 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*)
557 | 
558 |    ```bash
559 |    $ git tag -d lab4submit
560 |    $ git push origin :refs/tags/lab4submit
561 |    ```
562 | 
563 | 4. Tag your last commit as the lab to be graded
564 |    ```bash
565 |    $ git tag -a lab4submit -m 'submit lab 4'
566 |    ```
567 | 
568 | 5. This is the most important part: **push** your solutions to GitHub.
569 | 
570 |    ```bash
571 |    $ git push origin master --tags
572 |    ```
573 | 
574 | 6. The last thing that we strongly recommend you do is to go to the
575 |    [MIT-DB-Class] organization page on GitHub to
576 |    make sure that we can see your solutions.
577 | 
578 |    Just navigate to your repository and check that your latest commits are on
579 |    GitHub. You should also be able to check 
580 |    `https://github.com/MIT-DB-Class/homework-solns-2018-<mid id>/tree/lab4submit`
581 | 
582 | 
583 | #### <a name="word-of-caution"></a> Word of Caution
584 | 
585 | Git is a distributed version control system. This means everything operates
586 | offline until you run `git pull` or `git push`. This is a great feature.
587 | 
588 | The bad thing is that you may forget to `git push` your changes. This is why we strongly, **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect.
589 | 
590 | Just because your code has been commited on your local machine does not
591 | mean that it has been **submitted**; it needs to be on GitHub.
592 | 
593 | <a name="bugs"></a>
594 | 
595 | ###  3.3. Submitting a bug 
596 | Despite its friendly sounding name, SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc.
597 | 
598 | We ask you, therefore, to do this lab with an adventurous mindset.  Don't get mad if something is not clear, or even wrong; rather, try to figure it out
599 | yourself or send us a friendly email.  
600 | 
601 | Please submit (friendly!) bug reports to <a
602 | href="mailto:6.830-staff@mit.edu">6.830-staff@mit.edu</a>.
603 | When you do, please try to include:
604 | 
605 | 
606 | * A description of the bug.
607 | 
608 | * A <tt>.java</tt> file we can drop in the
609 | `test/simpledb` directory, compile, and run.
610 | 
611 | * A <tt>.txt</tt> file with the data that reproduces the bug.  We should be
612 | able to convert it to a <tt>.dat</tt> file using `HeapFileEncoder`.
613 | 
614 | 
615 | 
616 | You can also post on the class page on Piazza if you feel you have run into a bug.
617 | 
618 | <a name="grading"></a>
619 | ###  3.4 Grading 
620 | 
621 | 50% of your grade will be based on whether or not your code passes the
622 | system test suite we will run over it. These tests will be a superset
623 | of the tests we have provided. Before handing in your code, you should
624 | make sure it produces no errors (passes all of the tests) from both
625 | <tt>ant test</tt> and <tt>ant systemtest</tt>.
626 | 
627 | 
628 | 
629 | **Important:** Before testing, we will replace your <tt>build.xml</tt>,
630 | <tt>HeapFileEncoder.java</tt>, and the entire contents of the
631 | <tt>test/</tt> directory with our version of these files!  This
632 | means you cannot change the format of <tt>.dat</tt> files!  You should
633 | therefore be careful changing our APIs. This also means you need to test
634 | whether your code compiles with our test programs. In other words, we will
635 | pull your repo, replace the files mentioned above, compile it, and then
636 | grade it. It will look roughly like this:
637 | 
638 | 
639 | ```
640 | $ git pull
641 | [replace build.xml, HeapFileEncoder.java and test]
642 | $ ant test
643 | $ ant systemtest
644 | [additional tests]
645 | ```
646 | 
647 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.
648 | 
649 | 
650 | 
651 | An additional 50% of your grade will be based on the quality of your
652 | writeup and our subjective evaluation of your code.
653 | 
654 | 
655 | 
656 | We've had a lot of fun designing this assignment, and we hope you enjoy
657 | hacking on it!
658 | 


--------------------------------------------------------------------------------
/lab2.md:
--------------------------------------------------------------------------------
  1 | # 6.830 Lab 2: SimpleDB Operators
  2 | 
  3 | **Assigned: Friday, September 28, 2018**<br>
  4 | **Due: Wednesday, October 10, 2018 11:59 PM EDT**
  5 | 
  6 | 
  7 | <!--
  8 | Version History:
  9 | 
 10 | 
 11 | 3/1/12 : Initial version
 12 | -->
 13 | 
 14 | 
 15 | 
 16 | In this lab assignment, you will write a set of operators for SimpleDB
 17 | to implement table modifications (e.g., insert and delete records),
 18 | selections, joins, and aggregates. These will build on top of the
 19 | foundation that you wrote in Lab 1 to provide you with a database
 20 | system that can perform simple queries over multiple tables.
 21 | 
 22 | 
 23 | 
 24 | Additionally, we ignored the issue of buffer pool management in Lab 1: we
 25 | have not dealt with the problem that arises when we reference more pages 
 26 | than we can fit in memory over the lifetime of the database. 
 27 | In Lab 2, you will design an eviction policy to
 28 | flush stale pages from the buffer pool.
 29 | 
 30 | 
 31 | 
 32 | You do not need to implement transactions or locking in this lab.
 33 | 
 34 | 
 35 | 
 36 | The remainder of this document gives some suggestions about how to start
 37 | coding, describes a set of exercises to help you work through the lab, 
 38 | and discusses how to hand in your code. This lab requires you to
 39 | write a fair amount of code, so we encourage you to **start early**!
 40 | 
 41 | <a name="starting"></a>
 42 | 
 43 | ## 1. Getting started 
 44 | 
 45 | You should begin with the code you submitted for Lab 1 (if you did not
 46 | submit code for Lab 1, or your solution didn't work properly, contact us to
 47 | discuss options). 
 48 | 
 49 | ### 1.3. Implementation hints 
 50 | 
 51 | 
 52 | As before, we **strongly encourage** you to read through this entire
 53 | document to get a feel for the high-level design of SimpleDB before you
 54 | write code. 
 55 | 
 56 | We suggest exercises along this document to guide your implementation, but
 57 | you may find that a different order makes more sense for you.  As before,
 58 | we will grade your assignment by looking at your code and verifying that
 59 | you have passed the test for the ant targets `test` and
 60 | `systemtest`. Note the code only needs to pass the tests we indicate in this
 61 | lab, not all of unit and system tests. See Section 3.4 for a complete discussion of
 62 | grading and list of the tests you will need to pass.
 63 | 
 64 | Here's a rough outline of one way you might proceed with your SimpleDB
 65 | implementation;  more details on the steps in this outline, including
 66 | exercises, are given in Section 2 below.
 67 | 
 68 | 
 69 | 
 70 | *  Implement the operators `Filter` and `Join` and
 71 | verify that their corresponding tests work. The Javadoc comments for
 72 | these operators contain details about how they should work.  We have given you implementations of
 73 |   `Project` and `OrderBy` which may help you
 74 |   understand how other operators work.
 75 | 
 76 | *  Implement `IntegerAggregator` and `StringAggregator`. Here, you will write the
 77 | logic that actually computes an aggregate over a particular field across
 78 | multiple groups in a sequence of input tuples. Use integer division for
 79 | computing the average, since SimpleDB only supports integers. StringAggegator
 80 | only needs to support the COUNT aggregate, since the other operations do not
 81 | make sense for strings.  
 82 | 
 83 | *  Implement the `Aggregate` operator.  As with other
 84 | operators, aggregates implement the `OpIterator` interface
 85 | so that they can be placed in SimpleDB query plans.  Note that the
 86 | output of an `Aggregate` operator is an aggregate value of an
 87 | entire group for each call to `next()`, and that the
 88 | aggregate constructor takes the aggregation and grouping fields.
 89 | 
 90 | *  Implement the methods related to tuple insertion, deletion, and page
 91 | eviction in `BufferPool`. You do not need to worry about
 92 | transactions at this point.
 93 | 
 94 | *  Implement the `Insert` and `Delete` operators.
 95 | Like all operators,  `Insert` and `Delete` implement
 96 | `OpIterator`, accepting a stream of tuples to insert or delete
 97 | and outputting a single tuple with an integer field that indicates the
 98 | number of tuples inserted or deleted.  These operators will need to call
 99 | the appropriate methods in `BufferPool` that actually modify the
100 |  pages on disk.  Check that the tests for inserting and
101 | deleting tuples work properly.  
102 | 
103 |   Note that SimpleDB does not implement any kind of consistency or integrity
104 | checking, so it is possible to insert duplicate records into a file and
105 | there is no way to enforce primary or foreign key constraints.
106 | 
107 | 
108 | 
109 | At this point you should be able to pass the tests in the ant
110 |   `systemtest` target, which is the goal of this lab.
111 | 
112 |   
113 | 
114 |   You'll also be able to use the provided SQL parser to run SQL
115 |   queries against your database!  See [Section 2.7](#parser) for a
116 |   brief tutorial.
117 | 
118 | 
119 | 
120 | 
121 | Finally, you might notice that the iterators in this lab extend the
122 | `Operator` class instead of implementing the OpIterator
123 | interface.  Because the implementation of <tt>next</tt>/<tt>hasNext</tt>
124 | is often repetitive, annoying, and error-prone, `Operator`
125 | implements this logic generically, and only requires that you implement
126 | a simpler <tt>readNext</tt>.  Feel free to use this style of
127 | implementation, or just implement the `OpIterator` interface if you prefer.
128 | To implement the OpIterator interface, remove `extends Operator`
129 | from iterator classes, and in its place put `implements OpIterator`.
130 | 
131 | 
132 | 
133 | ## 2. SimpleDB Architecture and Implementation Guide 
134 | 
135 | 
136 | ### 2.1. Filter and Join 
137 | 
138 | Recall that SimpleDB OpIterator classes implement the operations of the
139 | relational algebra. You will now implement two operators that will enable
140 | you to perform queries that are slightly more interesting than a table
141 | scan.
142 | 
143 | 
144 | 
145 | * *Filter*: This operator only returns tuples that satisfy
146 | a `Predicate` that is specified as part of its constructor. Hence,
147 | it filters out any tuples that do not match the predicate.
148 | 
149 | * *Join*: This operator joins tuples from its two children according to
150 | a `JoinPredicate` that is passed in as part of its constructor.
151 | We only require a simple nested loops join, but you may explore more
152 | interesting join implementations. Describe your implementation in your lab
153 | writeup.
154 | 
155 | 
156 | 
157 | **Exercise 1.**
158 | 
159 |   Implement the skeleton methods in:
160 | 
161 | ***  
162 |   *  src/simpledb/Predicate.java
163 |   *  src/simpledb/JoinPredicate.java
164 |   *  src/simpledb/Filter.java
165 |   *  src/simpledb/Join.java
166 |   
167 | ***  
168 |   
169 |   At this point, your code should pass the unit tests in
170 |   PredicateTest, JoinPredicateTest, FilterTest, and JoinTest. Furthermore,
171 |   you should be able to pass the system tests FilterTest and JoinTest.
172 | 
173 | 
174 | 
175 | ### 2.2. Aggregates 
176 | 
177 | An additional SimpleDB operator implements basic SQL aggregates with a
178 | `GROUP BY` clause. You should implement the five SQL aggregates
179 | (`COUNT`, `SUM`, `AVG`, `MIN`,
180 | `MAX`) and support grouping.  You only need to support aggregates
181 | over a single field, and grouping by a single field. 
182 | 
183 | 
184 | 
185 | In order to calculate aggregates, we use an `Aggregator`
186 | interface which merges a new tuple into the existing calculation of an
187 | aggregate. The `Aggregator` is told during construction what
188 | operation it should use for aggregation.  Subsequently, the client code
189 | should call `Aggregator.mergeTupleIntoGroup()` for every tuple in the child
190 | iterator. After all tuples have been merged, the client can retrieve a
191 | OpIterator of aggregation results. Each tuple in the result is a pair of
192 | the form `(groupValue, aggregateValue)`, unless the value
193 | of the group by field was `Aggregator.NO_GROUPING`, in which
194 | case the result is a single tuple of the form `(aggregateValue)`.
195 | 
196 | 
197 | 
198 | Note that this implementation requires space linear in the number of
199 | distinct groups.  For the purposes of this lab, you do not need to worry
200 | about the situation where the number of groups exceeds available memory.
201 | 
202 |  
203 | 
204 | **Exercise 2.**
205 | 
206 |   Implement the skeleton methods in:
207 |   
208 | ***  
209 |   *  src/simpledb/IntegerAggregator.java
210 |   *  src/simpledb/StringAggregator.java
211 |   *  src/simpledb/Aggregate.java
212 |   
213 | ***  
214 | 
215 |   At this point, your code should pass the unit tests
216 |   IntegerAggregatorTest, StringAggregatorTest, and
217 |   AggregateTest. Furthermore, you should be able to pass the AggregateTest system test.
218 | 
219 | 
220 | ### 2.3. HeapFile Mutability 
221 | 
222 | Now, we will begin to implement methods to support modifying tables. We
223 | begin at the level of individual pages and files.  There are two main sets
224 | of operations:  adding tuples and removing tuples.
225 | 
226 | **Removing tuples:** To remove a tuple, you will need to implement
227 | `deleteTuple`.
228 | Tuples contain `RecordIDs` which allow you to find
229 | the page they reside on, so this should be as simple as locating the page 
230 | a tuple belongs to and modifying the headers of the page appropriately.
231 | 
232 | **Adding tuples:** The `insertTuple` method in
233 | `HeapFile.java` is responsible for adding a tuple to a heap
234 | file.  To add a new tuple to a HeapFile, you will have to find a page with
235 | an empty slot. If no such pages exist in the HeapFile, you
236 | need to create a new page and append it to the physical file on disk.  You will
237 | need to ensure that the RecordID in the tuple is updated correctly.
238 | 
239 | **Exercise 3.**
240 | 
241 |   Implement the remaining skeleton methods in:
242 | 
243 | ***  
244 |   *  src/simpledb/HeapPage.java
245 |   *  src/simpledb/HeapFile.java<br>
246 |      (Note that you do not necessarily need to implement writePage at this point).
247 | 
248 | ***
249 |   
250 |   
251 | 
252 |   To implement HeapPage, you will need to modify the header bitmap for
253 |   methods such as <tt>insertTuple()</tt> and <tt>deleteTuple()</tt>. You may
254 |   find that the <tt>getNumEmptySlots()</tt> and <tt>isSlotUsed()</tt> methods we asked you to
255 |   implement in Lab 1 serve as useful abstractions.  Note that there is a
256 |   <tt>markSlotUsed</tt> method provided as an abstraction to modify the filled
257 |   or cleared status of a tuple in the page header.
258 | 
259 |   
260 |    Note that it is important that the <tt>HeapFile.insertTuple()</tt>
261 |    and <tt>HeapFile.deleteTuple()</tt> methods access pages using
262 |    the <tt>BufferPool.getPage()</tt> method; otherwise, your
263 |    implementation of transactions in the next lab will not work
264 |    properly.
265 |  
266 | 
267 |   Implement the following skeleton methods in <tt>src/simpledb/BufferPool.java</tt>:  
268 |   
269 | ***  
270 |   *  insertTuple()
271 |   *  deleteTuple()
272 |   
273 | ***  
274 |     
275 |   
276 |   These methods should call the appropriate methods in the HeapFile that
277 |   belong to the table being modified (this extra level of indirection is
278 |   needed to support other types of files &mdash; like indices &mdash; in the
279 |   future).
280 | 
281 |   
282 | 
283 |   At this point, your code should pass the unit tests in HeapPageWriteTest and
284 |   HeapFileWriteTest, as well as BufferPoolWriteTest.
285 | 
286 |  
287 | 
288 | 
289 | ### 2.4. Insertion and deletion 
290 | 
291 | Now that you have written all of the HeapFile machinery to add and remove
292 | tuples, you will implement the `Insert` and `Delete`
293 | operators. 
294 | 
295 | 
296 | 
297 | For plans that implement `insert` and `delete` queries,
298 | the top-most operator is a special `Insert` or `Delete`
299 | operator that modifies the pages on disk.  These operators return the number
300 | of affected tuples. This is implemented by returning a single tuple with one
301 | integer field, containing the count.
302 | 
303 | 
304 | 
305 | * *Insert*: This operator adds the tuples it reads from its child
306 | operator to the `tableid` specified in its constructor.  It should
307 | use the `BufferPool.insertTuple()` method to do this.
308 | 
309 | * *Delete*: This operator deletes the tuples it reads from its child
310 | operator from the `tableid` specified in its constructor.  It
311 | should use the `BufferPool.deleteTuple()` method to do this.
312 | 
313 | 
314 | 
315 |  
316 | 
317 | **Exercise 4.**
318 | 
319 |   Implement the skeleton methods in:
320 |   
321 | ***  
322 |   *  src/simpledb/Insert.java
323 |   *  src/simpledb/Delete.java
324 |   
325 | ***  
326 | 
327 |   At this point, your code should pass the unit tests in InsertTest. We
328 |   have not provided unit tests for `Delete`. Furthermore, you
329 |   should be able to pass  the InsertTest and DeleteTest system tests.
330 | 
331 | 
332 | ### 2.5. Page eviction 
333 | 
334 | In Lab 1, we did not correctly observe the limit on the maximum number of pages
335 | in the buffer pool defined by the
336 | constructor argument `numPages`. Now, you will choose a page eviction
337 | policy and instrument any previous code that reads or creates pages to
338 | implement your policy.
339 | 
340 | 
341 | 
342 | When more than <tt>numPages</tt> pages are in the buffer pool, one page should be
343 | evicted from the pool before the next is loaded.  The choice of eviction
344 | policy is up to you; it is not necessary to do something sophisticated.
345 | Describe your policy in the lab writeup.
346 | 
347 | 
348 | 
349 | Notice that `BufferPool` asks you to implement
350 | a `flushAllPages()` method.  This is not something you would ever
351 | need in a real implementation of a buffer pool.  However, we need this method
352 | for testing purposes. You should never call this method from any real code.
353 | 
354 | Because of the way we have implemented ScanTest.cacheTest, you will
355 | need to ensure that your flushPage and flushAllPages methods
356 | do no evict pages from the buffer pool to properly pass
357 | this test.
358 | 
359 | flushAllPages should call flushPage on all pages in the BufferPool,
360 | and flushPage should write any dirty page to disk and mark it as not
361 | dirty, while leaving it in the BufferPool.
362 | 
363 | The only method which should remove page from the buffer pool is
364 | evictPage, which should call flushPage on any dirty page it evicts.
365 | 
366 | **Exercise 5.**
367 | 
368 |   Fill in the `flushPage()` method and additional helper
369 |   methods to implement page eviction in:
370 | 
371 | ***  
372 |   *  src/simpledb/BufferPool.java
373 |   
374 | ***
375 | 
376 |   
377 | 
378 |   If you did not implement `writePage()` in
379 |   <tt>HeapFile.java</tt> above, you will also need to do that here. Finally,
380 |   you should also implement `discardPage()` to remove a page from the
381 |   buffer pool *without* flushing it to disk.  We will not test `discardPage()`
382 |   in this lab, but it will be necessary for future labs.
383 |   
384 | 
385 |   At this point, your code should pass the EvictionTest system test.
386 |   
387 |   Since we will not
388 |   be checking for any particular eviction policy, this test works by creating a
389 |   BufferPool with 16 pages (NOTE: while DEFAULT_PAGES is 50, we are initializing the
390 |   BufferPool with less!), scanning a file with many more than 16 pages, and seeing
391 |   if the memory usage of the JVM increases by more than 5 MB.  If you do not
392 |   implement an eviction policy correctly, you will not evict enough pages, and will
393 |   go over the size limitation, thus failing the test.
394 |   
395 | 
396 | You have now completed this lab. Good work!
397 | 
398 | <a name="query_walkthrough"></a>
399 | ### 2.6. Query walkthrough
400 | 
401 | 
402 | 
403 | The following code implements a simple join query between two tables, each
404 | consisting of three columns of integers.  (The file
405 | `some_data_file1.dat` and `some_data_file2.dat` are
406 | binary representation of the pages from this file). This code is equivalent
407 | to the SQL statement:
408 | 
409 | ```sql
410 | SELECT * 
411 |   FROM some_data_file1, some_data_file2 
412 |   WHERE some_data_file1.field1 = some_data_file2.field1
413 |   AND some_data_file1.id > 1
414 | ```
415 | 
416 | For more extensive examples of query operations, you may find it helpful to
417 | browse the unit tests for joins, filters, and aggregates.
418 |  
419 | ```java
420 | package simpledb;
421 | import java.io.*;
422 | 
423 | public class jointest {
424 | 
425 |     public static void main(String[] argv) {
426 |         // construct a 3-column table schema
427 |         Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE };
428 |         String names[] = new String[]{ "field0", "field1", "field2" };
429 | 
430 |         TupleDesc td = new TupleDesc(types, names);
431 | 
432 |         // create the tables, associate them with the data files
433 |         // and tell the catalog about the schema  the tables.
434 |         HeapFile table1 = new HeapFile(new File("some_data_file1.dat"), td);
435 |         Database.getCatalog().addTable(table1, "t1");
436 | 
437 |         HeapFile table2 = new HeapFile(new File("some_data_file2.dat"), td);
438 |         Database.getCatalog().addTable(table2, "t2");
439 | 
440 |         // construct the query: we use two SeqScans, which spoonfeed
441 |         // tuples via iterators into join
442 |         TransactionId tid = new TransactionId();
443 | 
444 |         SeqScan ss1 = new SeqScan(tid, table1.getId(), "t1");
445 |         SeqScan ss2 = new SeqScan(tid, table2.getId(), "t2");
446 | 
447 |         // create a filter for the where condition
448 |         Filter sf1 = new Filter(
449 |                                 new Predicate(0,
450 |                                 Predicate.Op.GREATER_THAN, new IntField(1)),  ss1);
451 | 
452 |         JoinPredicate p = new JoinPredicate(1, Predicate.Op.EQUALS, 1);
453 |         Join j = new Join(p, sf1, ss2);
454 | 
455 |         // and run it
456 |         try {
457 |             j.open();
458 |             while (j.hasNext()) {
459 |                 Tuple tup = j.next();
460 |                 System.out.println(tup);
461 |             }
462 |             j.close();
463 |             Database.getBufferPool().transactionComplete(tid);
464 | 
465 |         } catch (Exception e) {
466 |             e.printStackTrace();
467 |         }
468 | 
469 |     }
470 | 
471 | }
472 | ```
473 | 
474 | 
475 | 
476 | Both tables have three integer fields. To express this, we create
477 | a `TupleDesc` object and pass it an array of `Type`
478 | objects indicating field types and `String` objects
479 | indicating field names. Once we have created this `TupleDesc`, we initialize
480 | two `HeapFile` objects representing the tables.  Once we have
481 | created the tables, we add them to the Catalog. (If this were a database
482 | server that was already running, we would have this catalog information
483 | loaded; we need to load this only for the purposes of this test).
484 | 
485 | 
486 | 
487 | Once we have finished initializing the database system, we create a query
488 | plan.  Our plan consists of two `SeqScan` operators that scan
489 | the tuples from each file on disk, connected to a `Filter`
490 | operator on the first HeapFile, connected to a `Join` operator
491 | that joins the tuples in the tables according to the
492 | `JoinPredicate`.  In general, these operators are instantiated
493 | with references to the appropriate table (in the case of SeqScan) or child
494 | operator (in the case of e.g., Join). The test program then repeatedly
495 | calls `next` on the `Join` operator, which in turn
496 | pulls tuples from its children. As tuples are output from the
497 | `Join`, they are printed out on the command line.
498 | 
499 | <a name="parser"></a>
500 | ### 2.7. Query Parser
501 | 
502 | We've provided you with a query parser for SimpleDB that you can use
503 | to write and run SQL queries against your database once you
504 | have completed the exercises in this lab.
505 | 
506 | The first step is to create some data tables and a catalog.  Suppose
507 | you have a file `data.txt` with the following contents:
508 | 
509 | ```
510 | 1,10
511 | 2,20
512 | 3,30
513 | 4,40
514 | 5,50
515 | 5,50
516 | ```
517 | 
518 | You can convert this into a SimpleDB table using the
519 | `convert` command (make sure to type <tt>ant</tt> first!):
520 | 
521 | ```
522 | java -jar dist/simpledb.jar convert data.txt 2 "int,int"
523 | ```
524 | 
525 | This creates a file `data.dat`. In addition to the table's
526 | raw data, the two additional parameters specify that each record has
527 | two fields and that their types are `int` and
528 | `int`.
529 | 
530 | 
531 | 
532 | Next, create a catalog file, `catalog.txt`,
533 | with the following contents:
534 | 
535 | ```
536 | data (f1 int, f2 int)
537 | ```
538 | 
539 | This tells SimpleDB that there is one table, `data` (stored in
540 | `data.dat`) with two integer fields named `f1`
541 | and `f2`.
542 | 
543 | Finally, invoke the parser.
544 | You must run java from the
545 | command line (ant doesn't work properly with interactive targets.)
546 |   From the `simpledb/` directory, type:
547 | 
548 | ```
549 | java -jar dist/simpledb.jar parser catalog.txt
550 | ```
551 | 
552 | You should see output like:
553 | 
554 | ```
555 | Added table : data with schema INT(f1), INT(f2), 
556 | SimpleDB> 
557 | ```
558 | 
559 | Finally, you can run a query:
560 | 
561 | ```
562 | SimpleDB> select d.f1, d.f2 from data d;
563 | Started a new transaction tid = 1221852405823
564 |  ADDING TABLE d(data) TO tableMap
565 |      TABLE HAS  tupleDesc INT(d.f1), INT(d.f2), 
566 | 1       10
567 | 2       20
568 | 3       30
569 | 4       40
570 | 5       50
571 | 5       50
572 | 
573 |  6 rows.
574 | ----------------
575 | 0.16 seconds
576 | 
577 | SimpleDB> 
578 | ```
579 | 
580 | The parser is relatively full featured (including support for SELECTs,
581 | INSERTs, DELETEs, and transactions), but does have some problems
582 | and does not necessarily report completely informative error
583 | messages.  Here are some limitations to bear in mind:
584 | 
585 | 
586 | *  You must preface every field name with its table name, even if
587 |   the field name is unique (you can use table name aliases, as in the
588 |   example above, but you cannot use the AS keyword.)
589 | 
590 | *  Nested queries are supported in the WHERE clause, but not the
591 |  FROM clause.
592 |   
593 | *  No arithmetic expressions are supported (for example, you can't
594 |  take the sum of two fields.)
595 |   
596 | *  At most one GROUP BY and one aggregate column are allowed.
597 |   
598 | *  Set-oriented operators like IN, UNION, and EXCEPT are not
599 |  allowed.
600 | 
601 | *  Only AND expressions in the WHERE clause are allowed.
602 | 
603 | *  UPDATE expressions are not supported.
604 |   
605 | *  The string operator LIKE is allowed, but must be written out
606 |  fully (that is, the Postgres tilde [~] shorthand is not allowed.) 
607 | 
608 | 
609 | ## 3. Logistics 
610 | 
611 | You must submit your code (see below) as well as a short (2 pages, maximum)
612 | writeup describing your approach.  This writeup should:
613 | 
614 | 
615 | 
616 | *  Describe any design decisions you made, including your choice of page
617 |   eviction policy. If you used something other than a nested-loops join,
618 |   describe the tradeoffs of the algorithm you chose.
619 | 
620 | *  Discuss and justify any changes you made to the API.
621 | 
622 | *  Describe any missing or incomplete elements of your code.
623 | 
624 | *  Describe how long you spent on the lab, and whether there was anything
625 |   you found particularly difficult or confusing.
626 | 
627 | 
628 | 
629 | ### 3.1. Collaboration 
630 | 
631 | This lab should be manageable for a single person, but if you prefer
632 | to work with a partner, this is also OK.  Larger groups are not allowed.
633 | Please indicate clearly who you worked with, if anyone, on your individual
634 | writeup.  
635 | 
636 | ### 3.2. Submitting your assignment 
637 | 
638 | 
639 | 
640 | <!--To submit your code, please create a <tt>6.830-lab2.tar.gz</tt> tarball (such
641 | that, untarred, it creates a <tt>6.830-lab2/src/simpledb</tt> directory with
642 | your code) and submit it for the Lab 2 assigment on the [Stellar Site Homework Section](https://stellar.mit.edu/S/course/6/fa14/6.830/homework/).
643 | You may submit your code multiple times; we will use the latest version you
644 | submit that arrives before the deadline (before 11:59pm on the due date).  If
645 | applicable, please indicate your partner in your writeup.  Please also submit
646 | your individual writeup as a PDF or plain text file (.txt).  Please do not submit a .doc or .docx.
647 | 
648 | Make sure your code is packaged so the instructions outlined in section 3.4 work.-->
649 | 
650 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab2-writeup.txt, which has been created for you in the top level of your simple-db-hw directory.
651 | 
652 | 
653 | You also need to explicitly add any other files you create, such as new *.java 
654 | files.
655 | 
656 | The criteria for your lab being submitted on time is that your code must be
657 | **tagged** and 
658 | **pushed** by the date and time. This means that if one of the TAs or the
659 | instructor were to open up GitHub, they would be able to see your solutions on
660 | the GitHub web page.
661 | 
662 | Just because your code has been commited on your local machine does not
663 | mean that it has been **submitted**; it needs to be on GitHub.
664 | 
665 | There is a bash script `turnInLab2.sh` in the root level directory of simple-db-hw that commits 
666 | your changes, deletes any prior tag
667 | for the current lab, tags the current commit, and pushes the tag 
668 | to GitHub.  If you are using Linux or Mac OSX, you should be able to run the following:
669 | 
670 |    ```bash
671 |    $ ./turnInLab2.sh
672 |    ```
673 | You should see something like the following output:
674 | 
675 |  ```bash
676 |  $ ./turnInLab2.sh 
677 | error: tag 'lab2submit' not found.
678 | remote: warning: Deleting a non-existent ref.
679 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
680 |  - [deleted]         lab1submit
681 | [master 7a26701] Lab 2
682 |  1 file changed, 0 insertions(+), 0 deletions(-)
683 |  create mode 100644 aaa
684 | Counting objects: 3, done.
685 | Delta compression using up to 4 threads.
686 | Compressing objects: 100% (3/3), done.
687 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
688 | Total 3 (delta 1), reused 0 (delta 0)
689 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects.
690 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
691 |    069856c..7a26701  master -> master
692 |  * [new tag]         lab2submit -> lab2submit
693 | ```
694 | 
695 | 
696 | If the above command worked for you, you can skip to item 6 below.  If not, submit your solutions for lab 2 as follows:
697 | 
698 | 1. Look at your current repository status.
699 | 
700 |    ```bash
701 |    $ git status
702 |    ```
703 | 
704 | 2. Add and commit your code changes (if they aren't already added and commited).
705 | 
706 |    ```bash
707 |     $ git commit -a -m 'Lab 2'
708 |    ```
709 | 
710 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*)
711 | 
712 |    ```bash
713 |    $ git tag -d lab2submit
714 |    $ git push origin :refs/tags/lab2submit
715 |    ```
716 | 
717 | 4. Tag your last commit as the lab to be graded
718 |    ```bash
719 |    $ git tag -a lab2submit -m 'submit lab 2'
720 |    ```
721 | 
722 | 5. This is the most important part: **push** your solutions to GitHub.
723 | 
724 |    ```bash
725 |    $ git push origin master --tags
726 |    ```
727 | 
728 | 6. The last thing that we strongly recommend you do is to go to the
729 |    [MIT-DB-Class] organization page on GitHub to
730 |    make sure that we can see your solutions.
731 | 
732 |    Just navigate to your repository and check that your latest commits are on
733 |    GitHub. You should also be able to check 
734 |    `https://github.com/MIT-DB-Class/homework-solns-2018-<athena username>/tree/lab2submit`
735 | 
736 | 
737 | #### <a name="word-of-caution"></a> Word of Caution
738 | 
739 | Git is a distributed version control system. This means everything operates
740 | offline until you run `git pull` or `git push`. This is a great feature.
741 | 
742 | The bad thing is that you may forget to `git push` your changes. This is why we
743 | strongly, **strongly** suggest that you check GitHub to be sure that what you
744 | want us to see matches up with what you expect.
745 | 
746 | 
747 | <a name="bugs"></a>
748 | ### 3.3. Submitting a bug 
749 | 
750 | SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc.
751 | 
752 | We ask you, therefore, to do this lab with an adventurous mindset.  Don't get mad if something is not clear, or even wrong; rather, try to figure it out
753 | yourself or send us a friendly email.  
754 | 
755 | Please submit (friendly!) bug reports to [6.830-staff@mit.edu](mailto:6.830-staff@mit.edu).
756 | When you do, please try to include:
757 | 
758 | 
759 | 
760 | * A description of the bug.
761 | 
762 | * A <tt>.java</tt> file we can drop in the
763 | `test/simpledb` directory, compile, and run.
764 | 
765 | * A <tt>.txt</tt> file with the data that reproduces the bug.  We should be
766 | able to convert it to a <tt>.dat</tt> file using `HeapFileEncoder`.
767 | 
768 | 
769 | 
770 | You can also post on the class page on Piazza if you feel you have run into a bug.
771 | 
772 | <a name="grading"></a>
773 | ### 3.4 Grading 
774 | 
775 | 50% of your grade will be based on whether or not your code passes the
776 | system test suite we will run over it. These tests will be a superset
777 | of the tests we have provided. Before handing in your code, you should
778 | make sure it produces no errors (passes all of the tests) from both
779 | <tt>ant test</tt> and <tt>ant systemtest</tt>.
780 | 
781 | 
782 | 
783 | **Important:** before testing, we will replace your <tt>build.xml</tt>,
784 | <tt>HeapFileEncoder.java</tt>, and the entire contents of the
785 | <tt>test/</tt> directory with our version of these files!  This
786 | means you cannot change the format of <tt>.dat</tt> files!  You should
787 | therefore be careful changing our APIs. This also means you need to test
788 | whether your code compiles with our test programs. 
789 | 
790 | In other words, we will
791 | pull your repo, replace the files mentioned above, compile it, and then
792 | grade it.  It will look roughly like this:
793 | 
794 | ```
795 | [replace build.xml, HeapFileEncoder.java, and test]
796 | $ git checkout -- build.xml src/java/simpledb/HeapFileEncoder.java test/
797 | $ ant test
798 | $ ant systemtest
799 | [additional tests]
800 | ```
801 | 
802 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.
803 | 
804 | 
805 | 
806 | An additional 50% of your grade will be based on the quality of your
807 | writeup and our subjective evaluation of your code.
808 | 
809 | 
810 | 
811 | We've had a lot of fun designing this assignment, and we hope you enjoy
812 | hacking on it!
813 | 
814 | 
815 | <!--  LocalWords:  SimpleDB HeapFile dat TupleDesc SeqScan getNext ss tid int
816 |  -->
817 | <!--  LocalWords:  NoSuchElementException printStackTrace TransactionId typeAr
818 |  -->
819 | <!--  LocalWords:  runTest FilterTest simpledb src java txt JavaDoc's HeapPage
820 |  -->
821 | <!--  LocalWords:  BufferPool endian OpIterator JoinPredicate getPage tableid
822 |  -->
823 | <!--  LocalWords:  MIN HashMap insertTuple deleteTuple walkthrough
824 |  -->
825 | 


--------------------------------------------------------------------------------
/lab3.md:
--------------------------------------------------------------------------------
  1 | # 6.830 Lab 3: Query Optimization
  2 | 
  3 | **Assigned: Friday, October 12, 2018**<br>
  4 | **Due: Friday, October 26, 2018**
  5 | 
  6 | 
  7 | In this lab, you will implement a query optimizer on top of SimpleDB.
  8 | The main tasks include implementing a selectivity estimation framework
  9 | and a cost-based optimizer. You have freedom as to exactly what you
 10 | implement, but we recommend using something similar to the Selinger
 11 | cost-based optimizer.
 12 | 
 13 | 
 14 | The remainder of this document describes what is involved in
 15 | adding optimizer support and provides a basic outline of how
 16 | you might add this support to your database.
 17 | 
 18 | 
 19 | 
 20 | As with the previous lab, we recommend that you start as early as possible.
 21 | 
 22 | 
 23 | ##  1. Getting started 
 24 | 
 25 | You should begin with the code you submitted for Lab 2. (If you did not
 26 | submit code for Lab 2, or your solution didn't work properly, contact us to
 27 | discuss options.)  
 28 | 
 29 | 
 30 | We have provided you with extra test cases as well
 31 | as source code files for this lab
 32 | that are not in the original code distribution you received. We reiterate
 33 | that the unit tests we provide are to help guide your implementation along,
 34 | but they are not intended to be comprehensive or to establish correctness.
 35 | 
 36 | You will need to add these new files to your release. The easiest way
 37 | to do this is to change to your project directory (probably called simple-db-hw) 
 38 | and pull from the master GitHub repository:
 39 | 
 40 | ```
 41 | $ cd simple-db-hw
 42 | $ git pull upstream master
 43 | ```
 44 | 
 45 | ### 1.1. Implementation hints 
 46 | We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you. As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets <tt>test</tt> and <tt>systemtest</tt>. See Section 3.4 for a complete discussion of grading and the tests you will need to pass. 
 47 | 
 48 | 
 49 | 
 50 | Here's a rough outline of one way you might proceed with this lab. More details on these steps are given in Section 2 below.
 51 | 
 52 | *  Implement the methods in the <tt>TableStats</tt> class that allow
 53 | it to estimate selectivities of filters and cost of
 54 | scans, using histograms (skeleton provided for the <tt>IntHistogram</tt> class) or some
 55 | other form of statistics of your devising.  
 56 | *  Implement the methods in the <tt>JoinOptimizer</tt> class that
 57 | allow it to estimate the cost and selectivities of joins.
 58 | *  Write the <tt>orderJoins</tt> method in <tt>JoinOptimizer</tt>. This method must produce
 59 | an optimal ordering for a series of joins (likely using the
 60 | Selinger algorithm), given statistics computed in the previous two steps.
 61 | 
 62 | 
 63 | ##  2. Optimizer outline 
 64 | 
 65 | Recall that the main idea of a cost-based optimizer is to:
 66 | 
 67 | *  Use statistics about tables to estimate "costs" of different
 68 | query plans.  Typically, the cost of a plan is related to the  cardinalities of
 69 | (number of tuples produced by) intermediate joins and selections, as well as the
 70 | selectivity of filter and join predicates.
 71 | *  Use these statistics to order joins and selections in an
 72 | optimal way, and to select the best implementation for join
 73 | algorithms from amongst several alternatives.
 74 | 
 75 | In this lab, you will implement code to perform both of these
 76 | functions.
 77 | 
 78 | The optimizer will be invoked from <tt>simpledb/Parser.java</tt>.  You may wish
 79 | to review the <a href="https://github.com/MIT-DB-Class/course-info-2018/blob/master/lab2.md#parser">lab 2 parser exercise</a>
 80 | before starting this lab.  Briefly, if you have a catalog file
 81 | <tt>catalog.txt</tt> describing your tables, you can run the parser by
 82 | typing:
 83 | ```
 84 | java -jar dist/simpledb.jar parser catalog.txt
 85 | ```
 86 | 
 87 | 
 88 | When the Parser is invoked, it will compute statistics over all of the
 89 | tables (using statistics code you provide). When a query is issued,
 90 | the parser
 91 | will convert the query into a logical plan representation and then call
 92 | your query optimizer to generate an optimal plan.
 93 | 
 94 | ### 2.1 Overall Optimizer Structure  
 95 | Before getting started with the implementation, you need to understand the overall structure of the SimpleDB optimizer. 
 96 | The overall control flow of the SimpleDB modules of the parser and optimizer is
 97 |   shown in Figure 1.  
 98 | 
 99 | <p align="center">
100 | <img width=400 src="controlflow.png"><br>
101 | <i>Figure 1: Diagram illustrating classes, methods, and objects used in the parser</i>
102 | </p>
103 | 
104 | 
105 | The key at the bottom explains the symbols; you
106 | will implement the components with double-borders.  The classes and
107 | methods will be explained in more detail in the text that follows (you may wish to refer back
108 | to this diagram), but
109 | the basic operation is as follows:
110 | 
111 | 
112 | 1. <tt>Parser.java</tt> constructs a set of table statistics (stored in the
113 | <tt>statsMap</tt> container) when it is initialized.  It then waits for a
114 | query to be input, and calls the method <tt>parseQuery</tt> on that query. 
115 | 2.  <tt>parseQuery</tt> first constructs  a <tt>LogicalPlan</tt> that
116 | represents the parsed query. <tt>parseQuery</tt> then calls the method <tt>physicalPlan</tt> on the
117 | <tt>LogicalPlan</tt> instance it has constructed.  The <tt>physicalPlan</tt> method  returns a <tt>DBIterator</tt> object that can be used to actually run
118 | the query.  
119 | 
120 | 
121 |  
122 | In the exercises to come, you will implement the methods that help
123 | <tt>physicalPlan</tt> devise an optimal plan. 
124 | 
125 | 
126 | ### 2.2. Statistics Estimation 
127 | Accurately estimating plan cost is quite tricky.  In this lab, we will
128 | focus only on the cost of sequences of joins and base table accesses.  We
129 | won't worry about access method selection (since we only have one
130 | access method, table scans) or the costs of additional operators (like
131 | aggregates).
132 | 
133 | You are only required to consider left-deep plans for this lab. See
134 | Section 2.3 for a description of additional "bonus" optimizer features
135 | you might implement, including an approach for handling bushy plans.
136 | 
137 | ####  2.2.1 Overall Plan Cost 
138 | 
139 | We will write join plans of the form `p=t1 join t2 join ... tn`, 
140 | which signifies a left deep join where t1 is the left-most
141 | join (deepest in the tree).
142 | Given a plan like `p`, its cost
143 | can be expressed as:
144 | 
145 | ```
146 | scancost(t1) + scancost(t2) + joincost(t1 join t2) +
147 | scancost(t3) + joincost((t1 join t2) join t3) +
148 | ... 
149 | ```
150 | 
151 | Here, `scancost(t1)` is the I/O cost of scanning table t1,
152 | `joincost(t1,t2)` is the CPU cost to join t1 to t2.  To
153 | make I/O and CPU cost comparable, typically a constant scaling factor
154 | is used, e.g.:
155 | 
156 | ```
157 | cost(predicate application) = 1
158 | cost(pageScan) = SCALING_FACTOR x cost(predicate application)
159 | ```
160 | 
161 | For this lab, you can ignore the effects of caching (e.g., assume that
162 | every access to a table incurs the full cost of a scan) -- again, this
163 | is something you may add as an optional bonus extension to your lab
164 | in Section 2.3.  Therefore, `scancost(t1)` is simply the
165 | number of pages in `t1 x SCALING_FACTOR`.
166 | 
167 | ####  2.2.2 Join Cost 
168 | 
169 | When using nested loops joins, recall that the cost of a join between
170 | two tables t1 and t2 (where t1 is the outer) is
171 | simply:
172 | 
173 | ```
174 | joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost
175 |                        + ntups(t1) x ntups(t2)  //CPU cost
176 | ```
177 | 
178 | Here, `ntups(t1)` is the number of tuples in table t1.
179 | 
180 | ####  2.2.3 Filter Selectivity 
181 | 
182 | `ntups` can be directly computed for a base table by
183 | scanning that table.  Estimating `ntups` for a table with
184 | one or more selection predicates over it can be trickier -- 
185 | this is the *filter selectivity estimation* problem.  Here's one
186 | approach that you might use, based on computing a histogram over the
187 | values in the table:
188 | 
189 | *  Compute the minimum and maximum values for every attribute in the table (by scanning
190 | it once).
191 | *  Construct a histogram for every attribute in the table. A simple
192 | approach is to use a fixed number of buckets *NumB*,
193 | with
194 | each bucket representing the number of records in a fixed range of the
195 | domain of the attribute of the histogram.  For example, if a field
196 | *f* ranges from 1 to 100, and there are 10 buckets, then bucket 1 might
197 | contain the count of the number of records between 1 and 10, bucket
198 | 2 a count of the number of records between 11 and 20, and so on.
199 | *  Scan the table again, selecting out all of fields of all of the
200 | tuples and using them to populate the counts of the buckets
201 | in each histogram.
202 | *  To estimate the selectivity of an equality expression,
203 | *f=const*, compute the bucket that contains value *const*.
204 | Suppose the width (range of values) of the bucket is *w*, the height (number of
205 | tuples) is *h*,
206 | and the number of tuples in the table is *ntups*.  Then, assuming
207 | values are uniformly distributed throughout the bucket, the selectivity of
208 | the
209 | expression is roughly *(h / w) / ntups*, since *(h/w)*
210 | represents the expected number of tuples in the bin with value
211 | *const*.
212 | *  To estimate the selectivity of a range expression *f>const*,
213 | compute the
214 | bucket *b* that *const* is in, with width *w_b* and height
215 | *h_b*.  Then, *b* contains a fraction <nobr>*b_f = h_b / ntups* </nobr>of the
216 | total tuples.  Assuming tuples are uniformly distributed throughout *b*,
217 | the fraction *b_part* of *b* that is *> const* is
218 | <nobr>*(b_right - const) / w_b*</nobr>, where *b_right* is the right endpoint of
219 | *b*'s bucket.  Thus, bucket *b* contributes *(b_f  x
220 |  b_part)* selectivity to the predicate.  In addition, buckets
221 | *b+1...NumB-1* contribute all of their
222 | selectivity (which can be computed using a formula similar to
223 | *b_f* above).  Summing the selectivity contributions of all the
224 | buckets will yield the overall selectivity of the expression. 
225 | Figure 2 illustrates this process.
226 | *  Selectivity of expressions involving *less than* can be performed
227 | similar to the greater than case, looking at buckets down to 0.
228 | 
229 | <p align="center">
230 | <img width=400 src="lab3-hist.png"><br>
231 | <i>Figure 2: Diagram illustrating the histograms you will implement in Lab 5</i>
232 | </p>
233 | 
234 | 
235 | In the next two exercises, you will code to perform selectivity estimation of
236 | joins and filters.
237 | 
238 | ***
239 | **Exercise 1:  IntHistogram.java**
240 | 
241 | You will need to implement
242 | some way to record table statistics for selectivity estimation.  We have
243 | provided a skeleton class, <tt>IntHistogram</tt> that will do this.  Our 
244 | intent is that you calculate histograms using the bucket-based method described 
245 | above, but you are free to use some other method so long as it provides
246 | reasonable selectivity estimates.
247 | 
248 |   
249 | We have provided a class <tt>StringHistogram</tt> that uses
250 | <tt>IntHistogram</tt> to compute selecitivites for String
251 | predicates.  You may modify <tt>StringHistogram</tt> if you want to 
252 | implement a better estimator, though you should not need to in order to 
253 | complete this lab.
254 |   
255 | After completing this exercise, you should be able to pass the 
256 | <tt>IntHistogramTest</tt> unit test (you are not required to pass this test if you
257 | choose not to implement histogram-based selectivity estimation).
258 | 
259 | ***
260 | **Exercise 2:  TableStats.java**
261 |   
262 | The class <tt>TableStats</tt> contains methods that compute
263 | the number of tuples and pages in a table and that estimate the
264 | selectivity of predicates over the fields of that table.  The
265 | query parser we have created creates one instance of <tt>TableStats</tt> per
266 | table,  and passes these structures into your query optimizer (which
267 | you will need in later exercises).
268 |   
269 | You should fill in the following methods and classes in <tt>TableStats</tt>:
270 |   
271 | *  Implement the <tt>TableStats</tt>  constructor:
272 | Once you have
273 | implemented a method for tracking statistics such as histograms, you
274 | should implement the <tt>TableStats</tt> constructor, adding code
275 | to scan the table (possibly multiple times) to build the statistics
276 | you need.
277 | *   Implement <tt>estimateSelectivity(int field, Predicate.Op op,
278 | Field constant)</tt>: Using your statistics (e.g., an <tt>IntHistogram</tt>
279 | or <tt>StringHistogram</tt> depending on the type of the field), estimate
280 | the selectivity of predicate <tt>field op constant</tt> on the table.
281 | *   Implement <tt>estimateScanCost()</tt>: This method estimates the
282 | cost of sequentially scanning the file, given that the cost to read
283 | a page is <tt>costPerPageIO</tt>.  You can assume that there are no
284 | seeks and that no pages are in the buffer pool.  This method may
285 | use costs or sizes you computed in the constructor.
286 | *   Implement <tt>estimateTableCardinality(double
287 | selectivityFactor)</tt>: This method returns the number of tuples
288 | in the relation, given that a predicate with selectivity
289 | selectivityFactor is applied. This method may
290 | use costs or sizes you computed in the constructor.
291 | 
292 | You may wish to modify the constructor of <tt>TableStats.java</tt> to, for
293 | example, compute histograms over the fields as described above for
294 | purposes of selectivity estimation.
295 | 
296 | After completing these tasks you should be able to pass the unit tests
297 | in <tt>TableStatsTest</tt>.
298 | ***
299 | 
300 | #### 2.2.4 Join Cardinality 
301 | 
302 | Finally, observe that the cost for the join plan <tt>p</tt> above
303 | includes expressions of the form <tt>joincost((t1 join t2) join
304 | t3)</tt>.  To evaluate this expression, you need some way to estimate
305 | the size (<tt>ntups</tt>) of <tt>t1 join t2</tt>.  This *join
306 | cardinality estimation* problem is harder than the filter selectivity
307 | estimation problem.  In this lab, you aren't required to do anything
308 | fancy for this, though one of the optional excercises in Section 2.4
309 | includes a histogram-based method for join selectivity estimation.
310 | 
311 | 
312 | While implementing your simple solution, you  should keep in mind the following:
313 | 
314 | *  For equality joins, when one of the attributes is a primary key, the number of tuples produced by the join cannot
315 | be larger than the cardinality of the non-primary key attribute. 
316 | * For equality joins when there is no primary key, it's hard to say much about what the size of the output
317 | is -- it could be the size of the product of the cardinalities of the tables (if both tables have the
318 | same value for all tuples) -- or it could be 0.  It's fine to make up a simple heuristic (say,
319 | the size of the larger of the two tables).
320 | *  For range scans, it is similarly hard to say anything accurate about sizes.
321 | The size of the output should be proportional to
322 | the sizes of the inputs.  It is fine to assume that a fixed fraction
323 | of the cross-product is emitted by range scans (say, 30%).  In general, the cost of a range
324 | join should be larger than the cost of a non-primary key equality join of two tables
325 | of the same size.
326 | 
327 | 
328 | 
329 | 
330 | ***
331 | **Exercise 3:  Join Cost Estimation**
332 |   
333 | 
334 | The class <tt>JoinOptimizer.java</tt> includes all of the methods
335 | for ordering and computing costs of joins.  In this exercise, you
336 | will write the methods for estimating the selectivity and cost of
337 | a join, specifically:
338 |   
339 | *  Implement <tt>
340 | estimateJoinCost(LogicalJoinNode j, int card1, int card2, double
341 | cost1, double cost2)</tt>:  This method estimates the cost of
342 | join j, given that the left input is of cardinality card1, the
343 | right input of cardinality card2, that the cost to scan the left
344 | input is cost1, and that the cost to access the right input is
345 | card2.  You can assume the join is an NL join, and apply
346 | the formula mentioned earlier.
347 | *  Implement <tt>estimateJoinCardinality(LogicalJoinNode j, int
348 | card1, int card2, boolean t1pkey, boolean t2pkey)</tt>: This
349 | method estimates the number of tuples output by join j, given that
350 | the left input is size card1, the right input is size card2, and
351 | the flags t1pkey and t2pkey that indicate whether the left and
352 | right (respectively) field is unique (a primary key).
353 | 
354 | After implementing these methods, you should be able to pass the unit
355 | tests <tt>estimateJoinCostTest</tt> and <tt>estimateJoinCardinality</tt> in <tt>JoinOptimizerTest.java</tt>.
356 | ***
357 | 
358 | 
359 | ###  2.3 Join Ordering 
360 | 
361 | Now that you have implemented methods for estimating costs, you will
362 | implement the Selinger optimizer.  For these methods, joins are
363 | expressed as a list of join nodes (e.g., predicates over two tables)
364 | as opposed to a list of relations to join as described in class.
365 | 
366 | An outline in pseudocode would be:
367 | 
368 | ```
369 | 1. j = set of join nodes
370 | 2. for (i in 1...|j|):
371 | 3.     for s in {all length i subsets of j}
372 | 4.       bestPlan = {}
373 | 5.       for s' in {all length d-1 subsets of s}
374 | 6.            subplan = optjoin(s')
375 | 7.            plan = best way to join (s-s') to subplan
376 | 8.            if (cost(plan) < cost(bestPlan))
377 | 9.               bestPlan = plan
378 | 10.      optjoin(s) = bestPlan
379 | 11. return optjoin(j)
380 | ```
381 | 
382 | To help you implement this algorithm, we have provided several classes and methods to assist you.  First,
383 | the method <tt>enumerateSubsets(Vector v, int size)</tt> in <tt>JoinOptimizer.java</tt> will return
384 | a set of all of the subsets of <tt>v</tt> of size <tt>size</tt>. This method is not particularly efficient;  you can earn
385 | extra credit by implementing a more efficient enumerator.
386 | 
387 | Second, we have provided the method:
388 | ```java
389 |     private CostCard computeCostAndCardOfSubplan(HashMap<String, TableStats> stats, 
390 |                                                 HashMap<String, Double> filterSelectivities, 
391 |                                                 LogicalJoinNode joinToRemove,  
392 |                                                 Set<LogicalJoinNode> joinSet,
393 |                                                 double bestCostSoFar,
394 |                                                 PlanCache pc) 
395 | ```
396 | 
397 | Given a subset of joins (<tt>joinSet</tt>), and a join to remove from
398 | this set (<tt>joinToRemove</tt>), this method computes the best way to
399 | join <tt>joinToRemove</tt> to <tt>joinSet - {joinToRemove}</tt>. It
400 | returns this best method in a <tt>CostCard</tt> object, which includes
401 | the cost, cardinality, and best join ordering (as a vector).
402 | <tt>computeCostAndCardOfSubplan</tt> may return null, if no plan can
403 | be found (because, for example, there is no left-deep join that is
404 | possible), or if the cost of all plans is greater than the
405 | <tt>bestCostSoFar</tt> argument.  The method uses a cache of previous
406 | joins called <tt>pc</tt> (<tt>optjoin</tt> in the pseudocode above) to
407 | quickly lookup the fastest way to join <tt>joinSet -
408 | {joinToRemove}</tt>.    The other arguments (<tt>stats</tt> and
409 | <tt>filterSelectivities</tt>) are passed into the <tt>orderJoins</tt>
410 | method that you must implement as a part of Exercise 4, and are
411 | explained below.  This method essentially performs lines 6--8 of the
412 | pseudocode described earlier.
413 | 
414 | Third, we have provided the method:
415 | ```java
416 |     private void printJoins(Vector<LogicalJoinNode> js, 
417 |                            PlanCache pc,
418 |                            HashMap<String, TableStats> stats,
419 |                            HashMap<String, Double> selectivities)
420 | ```
421 | 
422 | This method can be used to display a graphical representation of a join plan (when the "explain" flag is set via
423 | the "-explain" option to the optimizer, for example).
424 | 
425 | Fourth, we have provided a class <tt>PlanCache</tt> that can be used
426 | to cache the best way to join a subset of the joins considered so far
427 | in your implementation of Selinger (an instance of this class is
428 | needed to use <tt>computeCostAndCardOfSubplan</tt>).
429 | 
430 | ***
431 | **Exercise 4:  Join Ordering**
432 |   
433 | 
434 | In <tt>JoinOptimizer.java</tt>, implement the method:
435 | ```java
436 |   Vector<LogicalJoinNode> orderJoins(HashMap<String, TableStats> stats, 
437 |                    HashMap<String, Double> filterSelectivities,  
438 |                    boolean explain)
439 | ```
440 | 
441 | This method should operate on the <tt>joins</tt> class member,
442 | returning a new Vector that specifies the order in which joins
443 | should be done.  Item 0 of this vector indicates the left-most,
444 | bottom-most join in a left-deep plan.  Adjacent joins in the
445 | returned vector should share at least one field to ensure the plan
446 | is left-deep.  Here <tt>stats</tt> is an object that lets you find
447 | the <tt>TableStats</tt> for a given table name that appears in the
448 | <tt>FROM</tt> list of the query.  <tt>filterSelectivities</tt>
449 | allows you to find the selectivity of any predicates over a table;
450 | it is guaranteed to have one entry per table name in the
451 | <tt>FROM</tt> list.  Finally, <tt>explain</tt> specifies that you
452 | should output a representation of the join order for informational purposes. 
453 | 
454 | 
455 | You may wish to use the helper methods and classes described above to assist
456 | in your implementation. Roughly, your implementation should follow
457 | the pseudocode above, looping through subset sizes, subsets, and
458 | sub-plans of subsets, calling <tt>computeCostAndCardOfSubplan</tt> and
459 | building a <tt>PlanCache</tt> object that stores the minimal-cost
460 | way to perform each subset join.
461 |   
462 | After implementing this method, you should be able to pass all the unit tests in
463 | <tt>JoinOptimizerTest</tt>.  You should also pass the system test
464 | <tt>QueryTest</tt>.
465 | ***
466 | 
467 | 
468 | ###  2.4 Extra Credit
469 |    
470 | In this section, we describe several optional excercises that you may
471 | implement for extra credit.  These are less well defined than the
472 | previous exercises but give you a chance to show off your mastery of
473 | query optimization!
474 | 
475 | ***
476 | **Bonus Exercises.** Each of these bonuses is worth up to 5% extra credit:
477 | 
478 | *  *Add code to perform more advanced join cardinality estimation*.
479 | Rather than using simple heuristics to estimate join cardinality,
480 | devise a more sophisticated algorithm.      
481 | * One option is to use joint histograms between
482 | every pair of attributes *a* and *b* in every pair of tables *t1* and *t2*.
483 | The idea is to create buckets of *a*, and for each bucket *A* of *a*, create a
484 | histogram of *b* values that co-occur with *a* values in *A*.
485 | *  Another  way to estimate the cardinality of a join is to assume that each value in the smaller table has a matching value in the larger table. Then the formula for the join selectivity would be: 1/(*Max*(*num-distinct*(t1, column1), *num-distinct*(t2, column2))). Here, column1 and column2 are the join attributes.  The cardinality of the join is then the product of the cardinalities of *t1* and *t2* times the selectivity. <br>
486 | *  *Improved subset iterator*.  Our implementation of
487 | <tt>enumerateSubsets</tt> is quite inefficient, because it creates
488 | a large number of Java objects on each invocation.  A better
489 | approach would be to implement an iterator that, for example,
490 | returns a <tt>BitSet</tt> that specifies the elements in the
491 | <tt>joins</tt> vector that should be accessed on each iteration.
492 | In this bonus exercise, you would improve the performance of
493 | <tt>enumerateSubsets</tt> so that your system could perform query
494 | optimization on plans with 20 or more joins (currently such plans
495 | takes minutes or hours to compute).
496 | *  *A cost model that accounts for caching*.  The methods to
497 | estimate scan and join cost do not account for caching in the
498 | buffer pool.  You should extend the cost model to account for
499 | caching effects.  This is tricky because multiple joins are
500 | running simultaneously due to the iterator model, and so it may be
501 | hard to predict how much memory each will have access to using the
502 | simple buffer pool we have implemented in previous labs.
503 | *  *Improved join algorithms and algorithm selection*.  Our
504 | current cost estimation and join operator selection algorithms
505 | (see <tt>instantiateJoin()</tt> in <tt>JoinOptimizer.java</tt>)
506 | only consider nested loops joins.  Extend these methods to use one
507 | or more additional join algorithms (for example, some form of in
508 | memory hashing using a <tt>HashMap</tt>).
509 | *  *Bushy plans*.  Improve the provided <tt>orderJoins()</tt> and other helper
510 | methods to generate bushy joins.  Our query plan
511 | generation and visualization algorithms are perfectly capable of
512 | handling bushy plans;  for example, if <tt>orderJoins()</tt>
513 | returns the vector (t1 join t2 ; t3 join t4 ; t2 join t3), this
514 | will correspond to a bushy plan with the (t2 join t3) node at the top.
515 | 
516 | ***
517 | 
518 | 
519 | You have now completed this lab. 
520 | Good work!
521 | 
522 | ##  3. Logistics 
523 | You must submit your code (see below) as well as a short (2 pages, maximum)
524 | writeup describing your approach.  This writeup should:
525 | 
526 | *  Describe any design decisions you made, including methods for selectivity estimation,
527 | join ordering, as well as any of the bonus exercises you chose to implement and how
528 | you implemented them (for each bonus exercise you may submit up to 1 additional page).
529 | *  Discuss and justify any changes you made to the API.
530 | *  Describe any missing or incomplete elements of your code.
531 | *  Describe how long you spent on the lab, and whether there was anything
532 | you found particularly difficult or confusing.
533 | 
534 | ###  3.1. Collaboration 
535 | This lab should be manageable for a single person, but if you prefer
536 | to work with a partner, this is also OK.  Larger groups are not allowed.
537 | Please indicate clearly who you worked with, if anyone, on your writeup.
538 | 
539 | ###  3.2. Submitting your assignment 
540 | 
541 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab3-writeup.txt, which has been created for you in the top level of your simple-db-hw directory.
542 | 
543 | You also need to explicitly add any other files you create, such as new *.java 
544 | files.
545 | 
546 | The criteria for your lab being submitted on time is that your code must be
547 | **tagged** and 
548 | **pushed** by the date and time. This means that if one of the TAs or the
549 | instructor were to open up GitHub, they would be able to see your solutions on
550 | the GitHub web page.
551 | 
552 | Just because your code has been commited on your local machine does not
553 | mean that it has been **submitted**; it needs to be on GitHub.
554 | 
555 | There is a bash script `turnInLab3.sh` in the root level directory of simple-db-hw that commits 
556 | your changes, deletes any prior tag
557 | for the current lab, tags the current commit, and pushes the tag 
558 | to GitHub.  If you are using Linux or Mac OSX, you should be able to run the following:
559 | 
560 |    ```bash
561 |    $ ./turnInLab3.sh
562 |    ```
563 | You should see something like the following output:
564 | 
565 |  ```bash
566 |  $ ./turnInLab3.sh 
567 | error: tag 'lab3submit' not found.
568 | remote: warning: Deleting a non-existent ref.
569 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
570 |  - [deleted]         lab1submit
571 | [master 7a26701] Lab 3
572 |  1 file changed, 0 insertions(+), 0 deletions(-)
573 |  create mode 100644 aaa
574 | Counting objects: 3, done.
575 | Delta compression using up to 4 threads.
576 | Compressing objects: 100% (3/3), done.
577 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
578 | Total 3 (delta 1), reused 0 (delta 0)
579 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects.
580 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
581 |    069856c..7a26701  master -> master
582 |  * [new tag]         lab3submit -> lab3submit
583 | ```
584 | 
585 | 
586 | If the above command worked for you, you can skip to item 6 below.  If not, submit your solutions for Lab 3 as follows:
587 | 
588 | 1. Look at your current repository status.
589 | 
590 |    ```bash
591 |    $ git status
592 |    ```
593 | 
594 | 2. Add and commit your code changes (if they aren't already added and commited).
595 | 
596 |    ```bash
597 |     $ git commit -a -m 'Lab 3'
598 |    ```
599 | 
600 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*)
601 | 
602 |    ```bash
603 |    $ git tag -d lab3submit
604 |    $ git push origin :refs/tags/lab3submit
605 |    ```
606 | 
607 | 4. Tag your last commit as the lab to be graded
608 |    ```bash
609 |    $ git tag -a lab3submit -m 'submit lab 3'
610 |    ```
611 | 
612 | 5. This is the most important part: **push** your solutions to GitHub.
613 | 
614 |    ```bash
615 |    $ git push origin master --tags
616 |    ```
617 | 
618 | 6. The last thing that we strongly recommend you do is to go to the
619 |    [MIT-DB-Class] organization page on GitHub to
620 |    make sure that we can see your solutions.
621 | 
622 |    Just navigate to your repository and check that your latest commits are on
623 |    GitHub. You should also be able to check 
624 |    `https://github.com/MIT-DB-Class/homework-solns-2018-(mit id)/tree/lab3submit`
625 | 
626 | 
627 | #### <a name="word-of-caution"></a> Word of Caution
628 | 
629 | Git is a distributed version control system. This means everything operates
630 | offline until you run `git pull` or `git push`. This is a great feature.
631 | 
632 | The bad thing is that you may forget to `git push` your changes. This is why we
633 | strongly, **strongly** suggest that you check GitHub to be sure that what you
634 | want us to see matches up with what you expect.
635 | 
636 | 
637 | 
638 | <a name="bugs"></a>
639 | ###  3.3. Submitting a bug 
640 | 
641 | SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc.
642 | 
643 | We ask you, therefore, to do this lab with an adventurous mindset.  Don't get mad if something is not clear, or even wrong; rather, try to figure it out
644 | yourself or send us a friendly email.  
645 | 
646 | Please submit (friendly!) bug reports to <a href="mailto:6.830-staff@mit.edu">6.830-staff@mit.edu</a>.
647 | When you do, please try to include:
648 | 
649 | * A description of the bug.
650 | * A <tt>.java</tt> file we can drop in the
651 | `test/simpledb` directory, compile, and run.
652 | * A <tt>.txt</tt> file with the data that reproduces the bug.  We should be
653 | able to convert it to a <tt>.dat</tt> file using `HeapFileEncoder`.
654 | 
655 | You can also post on the class page on Piazza if you feel you have run into a bug.
656 | 
657 | 
658 | ###  3.4 Grading 
659 | 50% of your grade will be based on whether or not your code passes the
660 | test suite we will run over it. These tests will be a superset
661 | of the tests we have provided. Before handing in your code, you should
662 | make sure it produces no errors (passes all of the tests) from both
663 | <tt>ant test</tt> and <tt>ant systemtest</tt>.
664 | 
665 | **Important:** before testing, we will replace your <tt>build.xml</tt>,
666 | <tt>HeapFileEncoder.java</tt>, <tt>BPlusTreeFileEncoder.java</tt>, and the entire contents of the
667 | <tt>test/</tt> directory with our version of these files!  This
668 | means you cannot change the format of <tt>.dat</tt> files!  You should
669 | therefore be careful changing our APIs. This also means you need to test
670 | whether your code compiles with our test programs. In other words, we will
671 | pull your repo, replace the files mentioned above, compile it, and then
672 | grade it. It will look roughly like this:
673 | 
674 | ```
675 | $ git pull
676 | [replace build.xml, HeapFileEncoder.java, BPlusTreeFileEncoder.java and test]
677 | $ ant test
678 | $ ant systemtest
679 | [additional tests]
680 | ```
681 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.
682 |     
683 | 
684 | An additional 50% of your grade will be based on the quality of your
685 | writeup and our subjective evaluation of your code.
686 |     
687 | 
688 | We've had a lot of fun designing this assignment, and we hope you enjoy
689 | hacking on it!
690 | 


--------------------------------------------------------------------------------
/lab1.md:
--------------------------------------------------------------------------------
  1 | # 6.830 Lab 1: SimpleDB
  2 | 
  3 | **Assigned: Mon, Sept 17**
  4 | 
  5 | **Due: Wed, Sept 26 11:59 PM EDT**
  6 | 
  7 | 
  8 | <!--
  9 | **Bug Update:** We have a [page](bugs.html) to keep track
 10 | of SimpleDB bugs that you or we find. Fixes for bugs/annoyances will also be
 11 | posted there. Some bugs may have already been found, so do take a look at the page
 12 | to get the latest version/ patches for the lab code.
 13 | -->
 14 | 
 15 | In the lab assignments in 6.830 you will write a basic database management system called SimpleDB. For this lab, you will focus on implementing the core modules required to access stored data on disk; in future labs, you will add support for various query processing operators, as well as transactions, locking, and concurrent queries.
 16 | 
 17 | SimpleDB is written in Java.  We have provided you with a set of mostly unimplemented classes and interfaces.  You will need to write the code for these classes.  We will grade your code by running a set of system tests written using [JUnit](http://junit.sourceforge.net/). We have also provided a number of unit tests, which we will not use for grading but that you may find useful in verifying that your code works.
 18 | 
 19 | The remainder of this document describes the basic architecture of SimpleDB, gives some suggestions about how to start coding, and discusses how to hand in your lab.
 20 | 
 21 | We **strongly recommend** that you start as early as possible on this lab. It requires you to write a fair amount of code!
 22 | 
 23 | <!--
 24 | 
 25 | ##  0.  Find bugs, be patient, earn candy bars 
 26 | 
 27 | SimpleDB is a relatively complex piece of code.
 28 | It is very possible you are going to find bugs, inconsistencies, and bad,
 29 | outdated, or incorrect documentation, etc.
 30 | 
 31 | We ask you, therefore, to do this lab with an adventurous mindset.  Don't get
 32 | mad if something is not clear, or even wrong; rather, try to figure it out
 33 | yourself or send us a friendly email.  We promise to help out by posting
 34 | bug fixes, new commits to the HW repo, etc., as bugs and issues are reported.
 35 | 
 36 | <p>...and if you find a bug in our code, we`ll give you a candy bar (see [Section 3.3](#bugs))!
 37 | 
 38 | -->
 39 | <!--which you can find [here](bugs.html).</p>-->
 40 | 
 41 | ## 0. Environment Setup
 42 | 
 43 | **Start by downloading the code for lab 1 from the course GitHub repository by following the instructions [here](https://github.com/MIT-DB-Class/course-info-2018).**
 44 | 
 45 | These instructions are written for Athena or any other Unix-based platform (e.g., Linux, MacOS, etc.)  Because the code is written in Java, it should work under Windows as well, although the directions in this document may not apply.
 46 | 
 47 | We have included [Section 1.2](#eclipse) on using the project with Eclipse.
 48 | 
 49 | 
 50 | ##  1. Getting started 
 51 | 
 52 | 
 53 | SimpleDB uses the [Ant build tool](http://ant.apache.org/) to compile the code and run tests. Ant is similar to [make](http://www.gnu.org/software/make/manual/), but the build file is written in XML and is somewhat better suited to Java code. Most modern Linux distributions include Ant. Under Athena, it is included in the `sipb` locker, which you can get to by typing `add sipb` at the Athena prompt.  Note that on some versions of Athena you must also run `add -f java` to set the environment correctly for Java programs. See the [Athena documentation on using Java](http://web.mit.edu/acs/www/languages.html#Java) for more details.
 54 | 
 55 | To help you during development, we have provided a set of unit tests in addition to the end-to-end tests that we use for grading. These are by no means comprehensive, and you should not rely on them exclusively to verify the correctness of your project (put those 6.170 skills to use!).
 56 | 
 57 | To run the unit tests use the `test` build target:
 58 | 
 59 | ```
 60 | $ cd [project-directory]
 61 | $ # run all unit tests
 62 | $ ant test
 63 | $ # run a specific unit test
 64 | $ ant runtest -Dtest=TupleTest
 65 | ```
 66 | 
 67 | You should see output similar to:
 68 | 
 69 | ```
 70 |  build output...
 71 | 
 72 | test:
 73 |     [junit] Running simpledb.CatalogTest
 74 |     [junit] Testsuite: simpledb.CatalogTest
 75 |     [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.037 sec
 76 |     [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.037 sec
 77 | 
 78 |  ... stack traces and error reports ...
 79 | ```
 80 | 
 81 | The output above indicates that two errors occurred during compilation; this is because the code we have given you doesn't yet work.  As you complete parts of the lab, you will work towards passing additional unit tests.
 82 | 
 83 | If you wish to write new unit tests as you code, they should be added to the <tt>test/simpledb</tt> directory.
 84 | 
 85 | <p>For more details about how to use Ant, see the [manual](http://ant.apache.org/manual/). The [Running Ant](http://ant.apache.org/manual/running.html) section provides details about using the `ant` command. However, the quick reference table below should be sufficient for working on the labs.
 86 | 
 87 | Command | Description
 88 | --- | ---
 89 | ant|Build the default target (for simpledb, this is dist).
 90 | ant -projecthelp|List all the targets in `build.xml` with descriptions.
 91 | ant dist|Compile the code in src and package it in `dist/simpledb.jar`.
 92 | ant test|Compile and run all the unit tests.
 93 | ant runtest -Dtest=testname|Run the unit test named `testname`.
 94 | ant systemtest|Compile and run all the system tests.
 95 | ant runsystest -Dtest=testname|Compile and run the system test named `testname`.
 96 | 
 97 | 
 98 | If you are under windows system and don't want to run ant tests from command line, you can also run them from eclipse. Right click build.xml,  in the targets tab, you can see "runtest" "runsystest" etc. For example, select runtest would be equivalent to "ant runtest" from command line.  Arguments such as "-Dtest=testname" can be specified in the "Main" Tab, "Arguments" textbox. Note that you can also create a shortcut to runtest by copying from build.xml, modifying targets and arguments and renaming it to, say, runtest_build.xml.
 99 | 
100 | ### 1.1.  Running end-to-end tests 
101 | 
102 | We have also provided a set of end-to-end tests that will eventually be used for grading. These tests are structured as JUnit tests that live in the <tt>test/simpledb/systemtest</tt> directory. To run all the system tests, use the `systemtest` build target:
103 | 
104 | ```
105 | $ ant systemtest
106 | 
107 |  ... build output ...
108 | 
109 |     [junit] Testcase: testSmall took 0.017 sec
110 |     [junit]     Caused an ERROR
111 |     [junit] expected to find the following tuples:
112 |     [junit]     19128
113 |     [junit] 
114 |     [junit] java.lang.AssertionError: expected to find the following tuples:
115 |     [junit]     19128
116 |     [junit] 
117 |     [junit]     at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:122)
118 |     [junit]     at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:83)
119 |     [junit]     at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:75)
120 |     [junit]     at simpledb.systemtest.ScanTest.validateScan(ScanTest.java:30)
121 |     [junit]     at simpledb.systemtest.ScanTest.testSmall(ScanTest.java:40)
122 | 
123 |  ... more error messages ...
124 | ```
125 | 
126 | <p>This indicates that this test failed, showing the stack trace where the error was detected. To debug, start by reading the source code where the error occurred. When the tests pass, you will see something like the following:
127 | 
128 | ```
129 | $ ant systemtest
130 | 
131 |  ... build output ...
132 | 
133 |     [junit] Testsuite: simpledb.systemtest.ScanTest
134 |     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec
135 |     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec
136 |     [junit] 
137 |     [junit] Testcase: testSmall took 0.937 sec
138 |     [junit] Testcase: testLarge took 5.276 sec
139 |     [junit] Testcase: testRandom took 1.049 sec
140 | 
141 | BUILD SUCCESSFUL
142 | Total time: 52 seconds
143 | ```
144 | 
145 | ####  1.1.1 Creating dummy tables 
146 | 
147 | It is likely you'll want to create your own tests and your own data tables to test your own implementation of SimpleDB.  You can create any <tt>.txt</tt> file and convert it to a <tt>.dat</tt> file in SimpleDB's `HeapFile` format using the command:
148 | 
149 | ```
150 | $ java -jar dist/simpledb.jar convert file.txt N
151 | ```
152 | 
153 | where <tt>file.txt</tt> is the name of the file and <tt>N</tt> is the number of columns in the file.  Notice that <tt>file.txt</tt> has to be in the following format:
154 | 
155 | ```
156 | int1,int2,...,intN
157 | int1,int2,...,intN
158 | int1,int2,...,intN
159 | int1,int2,...,intN
160 | ```
161 | 
162 | ...where each intN is a non-negative integer.
163 | 
164 | To view the contents of a table, use the `print` command:
165 | 
166 | ```
167 | $ java -jar dist/simpledb.jar print file.dat N
168 | ```
169 | 
170 | where <tt>file.dat</tt> is the name of a table created with the <tt>convert</tt> command, and <tt>N</tt> is the number of columns in the file.
171 | 
172 | <a name="eclipse"></a>
173 | 
174 | ### 1.2.  Working in Eclipse 
175 | 
176 | [Eclipse](http://www.eclipse.org) is a graphical software development environment that you might be more comfortable with working in. The instructions we provide were generated by using Eclipse for Java Developers (not the enterprise edition) with Java 1.7.  
177 | 
178 | **Setting the Lab Up in Eclipse**
179 | 
180 | *    Once Eclipse is installed, start it, and note that the first screen asks you to select a location for your workspace (we will refer to this directory as $W). Select the directory containing your simple-db-hw repository.
181 | *    In Eclipse, select File->New->Project->Java->Java Project, and push Next.
182 | *    Enter "simple-db-hw" as the project name.
183 | *    On the same screen that you entered the project name, select "Create project from existing source," and browse to $W/simple-db-hw.
184 | *    Click finish, and you should be able to see "simple-db-hw" as a new project in the Project Explorer tab on the left-hand side of your screen. Opening this project reveals the directory structure discussed above - implementation code can be found in "src," and unit tests and system tests found in "test." 
185 | 
186 | **Note:** that this class assumes that you are using the official Oracle release of Java.  This is the default on MacOS X, and for most Windows Eclipse installs; but many Linux distributions default to alternate Java runtimes (like OpenJDK). Please download the latest Java8 updates from [Oracle Website](http://www.oracle.com/technetwork/java/javase/downloads/index.html), and use that Java version. If you don't switch, you may see spurious test failures in some of the performance tests in later labs.
187 | 
188 | **Running Individual Unit and System Tests**
189 | 
190 | To run a unit test or system test (both are JUnit tests, and can be initialized the same way), go to the Package Explorer tab on the left side of your screen.  Under the "simple-db-hw" project, open the "test" directory.  Unit tests are found in the "simpledb" package, and system tests are found in the "simpledb.systemtests" package. To run one of these tests, select the test (they are all called *Test.java - don't select TestUtil.java or SystemTestUtil.java), right click on it, select "Run As," and select "JUnit Test."  This will bring up a JUnit tab, which will tell you the status of the individual tests within the JUnit test suite, and will show you exceptions and other errors that will help you debug problems.
191 | 
192 | **Running Ant Build Targets**
193 | 
194 | If you want to run commands such as "ant test" or "ant systemtest," right click on build.xml in the Package Explorer.  Select "Run As," and then "Ant Build..." (note: select the option with the ellipsis (...), otherwise you won't be presented with a set of build targets to run).  Then, in the "Targets" tab of the next screen, check off the targets you want to run (probably "dist" and one of "test" or "systemtest").  This should run the build targets and show you the results in Eclipse's console window.
195 | 
196 | ###  1.3. Implementation hints 
197 | 
198 | Before beginning to write code, we **strongly encourage** you to read through this entire document to get a feel for the high-level design of SimpleDB. 
199 | 
200 | <p>
201 | 
202 | You will need to fill in any piece of code that is not implemented.  It will be obvious where we think you should write code. You may need to add private methods and/or helper classes.  You may change APIs, but make sure our [grading](#grading) tests still run and make sure to mention, explain, and defend your decisions in your writeup.
203 | 
204 | <p>
205 | 
206 | In addition to the methods that you need to fill out for this lab,  the class interfaces contain numerous methods that you need not implement until subsequent labs. These will either be indicated per class:
207 | 
208 | ```java
209 | // Not necessary for lab1.
210 | public class Insert implements DbIterator {
211 | ```
212 | 
213 | or per method:
214 | 
215 | ```Java
216 | public boolean deleteTuple(Tuple t) throws DbException {
217 |     // some code goes here
218 |     // not necessary for lab1
219 |     return false;
220 | }
221 | ```
222 | 
223 | 
224 | The code that you submit should compile without having to modify these methods.
225 | 
226 | <p>
227 | 
228 | We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you.
229 | 
230 | **Here's a rough outline of one way you might proceed with your SimpleDB implementation:**
231 | 
232 | ****
233 | *  Implement the classes to manage tuples, namely Tuple, TupleDesc. We have already implemented Field, IntField, StringField, and Type for you. Since you only need to support integer and (fixed length) string fields and fixed length tuples, these are straightforward.
234 | * Implement the Catalog (this should be very simple).
235 | * Implement the BufferPool constructor and the getPage() method.
236 | * Implement the access methods, HeapPage and HeapFile and associated ID classes. A good portion of these files has already been written for you.
237 | * Implement the operator SeqScan.
238 | * At this point, you should be able to pass the ScanTest system test, which is the goal for this lab. 
239 | 
240 | ***
241 | 
242 |  Section 2 below walks you through these implementation steps and the unit tests corresponding to each one in more detail. 
243 | 
244 | ###  1.4. Transactions, locking, and recovery 
245 | 
246 | As you look through the interfaces we have provided you, you will see a number of references to locking, transactions, and recovery.  You do not need to support these features in this lab, but you should keep these parameters in the interfaces of your code because you will be implementing transactions and locking in a future lab. The test code we have provided you with generates a fake transaction ID that is passed into the operators of the query it runs; you should pass this transaction ID into other operators and the buffer pool.
247 | 
248 | ##  2. SimpleDB Architecture and Implementation Guide 
249 | 
250 | SimpleDB consists of:
251 | 
252 | 
253 |  *   Classes that represent fields, tuples, and tuple schemas;
254 |  *   Classes that apply predicates and conditions to tuples;
255 |  *   One or more access methods (e.g., heap files) that store relations on disk and provide a way to iterate through tuples of those relations;
256 |  *   A collection of operator classes (e.g., select, join, insert, delete, etc.) that process tuples;
257 |  *   A buffer pool that caches active tuples and pages in memory and handles concurrency control and transactions (neither of which you need to worry about for this lab); and,
258 |  *   A catalog that stores information about available tables and their schemas. 
259 | 
260 | 
261 | SimpleDB does not include many things that you may think of as being a part of a "database."  In particular, SimpleDB does not have:
262 | 
263 | *   (In this lab), a SQL front end or parser that allows you to type queries directly into SimpleDB.  Instead, queries are built up by chaining a set of operators together into a hand-built query plan (see [Section 2.7](#query_walkthrough)).  We will provide a simple parser for use in later labs.
264 | *   Views.
265 | *  Data types except integers and fixed length strings.
266 | * (In this lab) Query optimizer.
267 | * (In this lab) Indices.
268 | 
269 | <p>
270 | 
271 | In the rest of this Section, we describe each of the main components of SimpleDB that you will need to implement in this lab.  You should use the exercises in this discussion to guide your implementation. This document is by no means a complete specification for SimpleDB; you will need to make decisions about how to design and implement various parts of the system.  Note that for Lab 1 you do not need to implement any operators (e.g., select, join, project) except sequential scan. You will add support for additional operators in future labs.
272 | 
273 | <p>
274 | 
275 | ###  2.1. The Database Class 
276 | 
277 | The Database class provides access to a collection of static objects that are the global state of the database.  In particular, this includes methods to access the catalog (the list of all the tables in the database), the buffer pool (the collection of database file pages that are currently resident in memory), and the log file. You will not need to worry about the log file in this lab. We have implemented the Database class for you.  You should take a look at this file as you will need to access these objects.
278 | 
279 | ###  2.2. Fields and Tuples 
280 | 
281 | <p>Tuples in SimpleDB are quite basic.  They consist of a collection of `Field` objects, one per field in the `Tuple`. `Field` is an interface that different data types (e.g., integer, string) implement.  `Tuple` objects are created by the underlying access methods (e.g., heap files, or B-trees), as described in the next section.  Tuples also have a type (or schema), called a _tuple descriptor_, represented by a `TupleDesc` object.  This object consists of a collection of `Type` objects, one per field in the tuple, each of which describes the type of the corresponding field.
282 | 
283 | ### Exercise 1
284 | 
285 | **Implement the skeleton methods in:**
286 | ***
287 |  * src/simpledb/TupleDesc.java
288 |  * src/simpledb/Tuple.java 
289 | 
290 | ***
291 | 
292 | 
293 | At this point, your code should pass the unit tests TupleTest and TupleDescTest.  At this point, modifyRecordId() should fail because you havn't implemented it yet.
294 | 
295 | ###  2.3. Catalog 
296 | 
297 | The catalog (class `Catalog` in SimpleDB) consists of a list of the tables and schemas of the tables that are currently in the database.  You will need to support the ability to add a new table, as well as getting information about a particular table.  Associated with each table is a `TupleDesc` object that allows operators to determine the types and number of fields in a table.
298 | 
299 | The global catalog is a single instance of `Catalog` that is allocated for the entire SimpleDB process. The global catalog can be retrieved via the method `Database.getCatalog()`, and the same goes for the global buffer pool (using `Database.getBufferPool()`).
300 | 
301 | ### Exercise 2
302 | 
303 | **Implement the skeleton methods in:**
304 | ***
305 |  * src/simpledb/Catalog.java
306 | 
307 | *** 
308 | 
309 | At this point, your code should pass the unit tests in CatalogTest.
310 | 
311 | 
312 | ###  2.4. BufferPool 
313 | 
314 | <p>The buffer pool (class `BufferPool` in SimpleDB) is responsible for caching pages in memory that have been recently read from disk. All operators read and write pages from various files on disk through the buffer pool. It consists of a fixed number of pages, defined by the `numPages` parameter to the `BufferPool` constructor. In later labs, you will implement an eviction policy. For this lab, you only need to implement the constructor and the `BufferPool.getPage()` method used by the SeqScan operator. The BufferPool should store up to `numPages` pages. For this lab, if more than `numPages` requests are made for different pages, then instead of implementing an eviction policy, you may throw a DbException. In future labs you will be required to implement an eviction policy.
315 | 
316 | The `Database` class provides a static method, `Database.getBufferPool()`, that returns a reference to the single BufferPool instance for the entire SimpleDB process.
317 | 
318 | ### Exercise 3
319 | 
320 | **Implement the `getPage()` method in:**
321 | 
322 | ***
323 | * src/simpledb/BufferPool.java
324 | 
325 | ***
326 | 
327 | We have not provided unit tests for BufferPool.  The functionality you implemented will be tested in the implementation of HeapFile below. You should use the `DbFile.readPage` method to access pages of a DbFile.
328 | 
329 | 
330 | <!--
331 | When more than this many pages are in the buffer pool, one page should be
332 | evicted from the pool before the next is loaded.  The choice of eviction
333 | policy is up to you; it is not necessary to do something sophisticated.
334 | -->
335 | 
336 | <!--
337 | <p>
338 | 
339 | Notice that `BufferPool` asks you to implement
340 | a `flush_all_pages()` method.  This is not something you would ever
341 | need in a real implementation of a buffer pool.  However, we need this method
342 | for testing purposes.  You really should never call this method from anywhere
343 | in your code.
344 | -->
345 | 
346 | ###  2.5. HeapFile access method 
347 | 
348 | Access methods provide a way to read or write data from disk that is arranged in a specific way.  Common access methods include heap files (unsorted files of tuples) and B-trees; for this assignment, you will only implement a heap file access method, and we have written some of the code for you.
349 | 
350 | <p>
351 | 
352 | A `HeapFile` object is arranged into a set of pages, each of which consists of a fixed number of bytes for storing tuples, (defined by the constant `BufferPool.DEFAULT_PAGE_SIZE`), including a header.  In SimpleDB, there is one `HeapFile` object for each table in the database.  Each page in a `HeapFile` is arranged as a set of slots, each of which can hold one tuple (tuples for a given table in SimpleDB are all of the same size).  In addition to these slots, each page has a header that consists of a bitmap with one bit per tuple slot.  If the bit corresponding to a particular tuple is 1, it indicates that the tuple is valid; if it is 0, the tuple is invalid (e.g., has been deleted or was never initialized.)  Pages of `HeapFile` objects are of type `HeapPage` which implements the `Page` interface.  Pages are stored in the buffer pool but are read and written by the `HeapFile` class.
353 | 
354 | <p>
355 | 
356 | SimpleDB stores heap files on disk in more or less the same format they are stored in memory.  Each file consists of page data arranged consecutively on disk.  Each page consists of one or more bytes representing the header, followed by the _page size_ bytes of actual page content. Each tuple requires _tuple size_ * 8 bits  for its content and 1 bit for the header. Thus, the number of tuples that can  fit in a single page is:
357 | 
358 | <p>
359 | 
360 | `
361 |     _tuples per page_ = floor((_page size_ * 8) / (_tuple size_ * 8 + 1))
362 | `
363 | 
364 | <p>
365 | 
366 | Where _tuple size_ is the size of a tuple in the page in bytes.  The idea here is that each tuple requires one additional bit of storage in the header. We compute the number of bits in a page (by mulitplying  page size by 8), and divide this quantity by the number of bits in a tuple (including this extra header bit) to get the number of tuples per page. The floor operation rounds down to the nearest integer number of tuples (we don't want to store partial tuples on a page!)
367 | 
368 | <p>
369 | 
370 | Once we know the number of tuples per page, the number of bytes required to store the header is simply:
371 | <p>
372 | 
373 | `
374 |      headerBytes = ceiling(tupsPerPage/8)
375 | `
376 | 
377 | <p>
378 | 
379 | The ceiling operation rounds up to the nearest integer number of bytes (we never store less than a full byte of header information.)
380 | 
381 | <p>
382 | 
383 | The low (least significant) bits of each byte represents the status of the slots that are earlier in the file.  Hence, the lowest bit of the first byte represents whether or not the first slot in the page is in use. The second  lowest bit of the first byte represents whether or not the second slot in the  page is in use, and so on. Also, note that the high-order bits of the last byte may not correspond to a slot that is actually in the file, since the number of slots may not be a multiple of 8. Also note that all Java virtual machines are [big-endian](http://en.wikipedia.org/wiki/Endianness).
384 | 
385 | <p>
386 | 
387 | ### Exercise 4
388 | 
389 | 
390 | 
391 | **Implement the skeleton methods in:**
392 | ***
393 | *    src/simpledb/HeapPageId.java
394 | *    src/simpledb/RecordID.java
395 | *    src/simpledb/HeapPage.java 
396 | 
397 | ***
398 | 
399 | 
400 | Although you will not use them directly in Lab 1, we ask you to implement <tt>getNumEmptySlots()</tt> and <tt>isSlotUsed()</tt> in HeapPage. These require pushing around bits in the page header. You may find it helpful to look at the other methods that have been provided in HeapPage or in <tt>src/simpledb/HeapFileEncoder.java</tt> to understand the layout of pages. 
401 | 
402 | 
403 | You will also need to implement an Iterator over the tuples in the page, which may involve an auxiliary class or data structure.  
404 | 
405 | At this point, your code should pass the unit tests in HeapPageIdTest, RecordIDTest, and HeapPageReadTest.
406 | 
407 | 
408 | <p> 
409 | 
410 | After you have implemented <tt>HeapPage</tt>, you will write methods for <tt>HeapFile</tt> in this lab to calculate the number of pages in a file and to read a page from the file. You will then be able to fetch tuples from a file stored on disk.
411 | 
412 | ### Exercise 5
413 | 
414 | **Implement the skeleton methods in:**
415 | 
416 | ***
417 | * src/simpledb/HeapFile.java
418 | 
419 | *** 
420 | 
421 | To read a page from disk, you will first need to calculate the correct offset in the file.  Hint: you will need random access to the file in order to read and write pages at arbitrary offsets.  You should not call BufferPool methods when reading a page from disk.
422 | 
423 | <p> 
424 | You will also need to implement the `HeapFile.iterator()` method, which should iterate through through the tuples of each page in the HeapFile. The iterator must use the `BufferPool.getPage()` method to access pages in the `HeapFile`. This method loads the page into the buffer pool and will eventually be used (in a later lab) to implement locking-based concurrency control and recovery.  Do not load the entire table into memory on the open() call -- this will cause an out of memory error for very large tables.
425 | 
426 | <p>
427 | 
428 | At this point, your code should pass the unit tests in HeapFileReadTest.
429 | 
430 | 
431 | ###  2.6. Operators 
432 | 
433 | Operators are responsible for the actual execution of the query plan. They implement the operations of the relational algebra.  In SimpleDB, operators are iterator based; each operator implements the `DbIterator` interface.
434 | 
435 | <p>
436 | 
437 | Operators are connected together into a plan by passing lower-level operators into the constructors of higher-level operators, i.e., by 'chaining them together.'  Special access method operators at the leaves of the plan are responsible for reading data from the disk (and hence do not have any operators below them).
438 | 
439 | <p>
440 | 
441 | At the top of the plan, the program interacting with SimpleDB simply calls `getNext` on the root operator; this operator then calls `getNext` on its children, and so on, until these leaf operators are called.  They fetch tuples from disk and pass them up the tree (as return arguments to `getNext`); tuples propagate up the plan in this way until they are output at the root or combined or rejected by another operator in the plan.
442 | 
443 | <p>
444 | 
445 | <!--
446 | For plans that implement `INSERT` and `DELETE` queries,
447 | the top-most operator is a special `Insert` or `Delete`
448 | operator that modifies the pages on disk.  These operators return a tuple
449 | containing the count of the number of affected tuples to the user-level
450 | program.
451 | 
452 | <p>
453 | -->
454 | 
455 | For this lab, you will only need to implement one SimpleDB operator.
456 | 
457 | ### Exercise 6.
458 | 
459 | **Implement the skeleton methods in:**
460 | 
461 | ***
462 | * src/simpledb/SeqScan.java
463 | 
464 | ***
465 | This operator sequentially scans all of the tuples from the pages of the table specified by the `tableid` in the constructor.  This operator should access tuples through the `DbFile.iterator()` method.
466 | 
467 | <p>At this point, you should be able to complete the ScanTest system test. Good work!
468 | 
469 | You will fill in other operators in subsequent labs.
470 | 
471 | <a name="query_walkthrough"></a>
472 | 
473 | ### 2.7. A simple query
474 | 
475 | The purpose of this section is to illustrate how these various components are connected together to process a simple query.
476 | 
477 | Suppose you have a data file, "some_data_file.txt", with the following contents:
478 | ```
479 | 1,1,1
480 | 2,2,2 
481 | 3,4,4
482 | ```
483 | <p>
484 | You can convert this into a binary file that SimpleDB can query as follows:
485 | <p>
486 | ```java -jar dist/simpledb.jar convert some_data_file.txt 3```
487 | <p>
488 | Here, the argument "3" tells conver that the input has 3 columns.
489 | <p>
490 | The following code implements a simple selection query over this file. This code is equivalent to the SQL statement `SELECT * FROM some_data_file`.
491 | 
492 | ```
493 | package simpledb;
494 | import java.io.*;
495 | 
496 | public class test {
497 | 
498 |     public static void main(String[] argv) {
499 | 
500 |         // construct a 3-column table schema
501 |         Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE };
502 |         String names[] = new String[]{ "field0", "field1", "field2" };
503 |         TupleDesc descriptor = new TupleDesc(types, names);
504 | 
505 |         // create the table, associate it with some_data_file.dat
506 |         // and tell the catalog about the schema of this table.
507 |         HeapFile table1 = new HeapFile(new File("some_data_file.dat"), descriptor);
508 |         Database.getCatalog().addTable(table1, "test");
509 | 
510 |         // construct the query: we use a simple SeqScan, which spoonfeeds
511 |         // tuples via its iterator.
512 |         TransactionId tid = new TransactionId();
513 |         SeqScan f = new SeqScan(tid, table1.getId());
514 | 
515 |         try {
516 |             // and run it
517 |             f.open();
518 |             while (f.hasNext()) {
519 |                 Tuple tup = f.next();
520 |                 System.out.println(tup);
521 |             }
522 |             f.close();
523 |             Database.getBufferPool().transactionComplete(tid);
524 |         } catch (Exception e) {
525 |             System.out.println ("Exception : " + e);
526 |         }
527 |     }
528 | 
529 | }
530 | ```
531 | 
532 | The table we create has three integer fields.  To express this, we create a `TupleDesc` object and pass it an array of `Type` objects, and optionally an array of `String` field names. Once we  have created this `TupleDesc`, we initialize a `HeapFile` object representing the table stored in `some_data_file.dat`. Once we have created the table, we add it to the catalog. If this were a database server that was already running, we would have this catalog information loaded. We need to load it explicitly to make this code self-contained.
533 | 
534 | Once we have finished initializing the database system, we create a query plan.  Our plan consists only of the `SeqScan` operator that scans the tuples from disk. In general, these operators are instantiated with references to the appropriate table (in the case of `SeqScan`) or child operator (in the case of e.g. Filter). The test program then repeatedly calls `hasNext` and `next` on the `SeqScan` operator. As tuples are output from the `SeqScan`, they are printed out on the command line.
535 | 
536 | We **strongly recommend** you try this out as a fun end-to-end test that will help you get experience writing your own test programs for simpledb.  You should create the file "test.java" in the src/simpledb directory with the code above, and place the `some_data_file.dat` file in the top level directory.  Then run:
537 | 
538 | ```
539 | ant
540 | java -classpath dist/simpledb.jar simpledb.test
541 | ```
542 | 
543 | Note that `ant` compiles `test.java` and generates a new jarfile that contains it.
544 | 
545 | ##  3. Logistics 
546 | 
547 | You must submit your code (see below) as well as a short (2 pages, maximum) writeup describing your approach.  This writeup should:
548 | 
549 | *    Describe any design decisions you made. These may be minimal for Lab 1.
550 | *    Discuss and justify any changes you made to the API.
551 | *    Describe any missing or incomplete elements of your code.
552 | *    Describe how long you spent on the lab, and whether there was anything you found particularly difficult or confusing. 
553 | 
554 | ###  3.1. Collaboration 
555 | 
556 | This lab should be manageable for a single person, but if you prefer to work with a partner, this is also OK.  Larger groups are not allowed. Please indicate clearly who you worked with, if anyone, on your individual writeup.  
557 | 
558 | ###  3.2. Submitting your assignment 
559 | <!--
560 | To submit your code, please create a <tt>6.830-lab1.tar.gz</tt> tarball (such
561 | that, untarred, it creates a <tt>6.830-lab1/src/simpledb</tt> directory with
562 | your code) and submit it on the [6.830 Stellar Site](https://stellar.mit.edu/S/course/6/sp13/6.830/index.html). You can use the `ant handin` target to generate the tarball.
563 | -->
564 | 
565 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab1-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 
566 | 
567 | You also need to explicitly add any other files you create, such as new *.java  files.
568 | 
569 | The criteria for your lab being submitted on time is that your code must be **tagged** and  **pushed** by the date and time. This means that if one of the TAs or the instructor were to open up GitHub, they would be able to see your solutions on the GitHub web page.
570 | 
571 | Just because your code has been commited on your local machine does not mean that it has been **submitted**; it needs to be on GitHub.
572 | 
573 | There is a bash script `turnInLab1.sh` in the root level directory of simple-db-hw that commits  your changes, deletes any prior tag for the current lab, tags the current commit, and pushes the branch and tag  to github.  If you are using Linux or Mac OSX, you should be able to run the following:
574 | 
575 | ```bash
576 | $ ./turnInLab1.sh
577 | ```
578 | 
579 | You should see something like the following output:
580 | 
581 |  ```bash
582 |  $ ./turnInLab1.sh 
583 | error: tag 'lab1submit' not found.
584 | remote: warning: Deleting a non-existent ref.
585 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
586 |  - [deleted]         lab1submit
587 | [master 7a26701] Lab 1
588 |  1 file changed, 0 insertions(+), 0 deletions(-)
589 |  create mode 100644 aaa
590 | Counting objects: 3, done.
591 | Delta compression using up to 4 threads.
592 | Compressing objects: 100% (3/3), done.
593 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
594 | Total 3 (delta 1), reused 0 (delta 0)
595 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects.
596 | To git@github.com:MIT-DB-Class/homework-solns-2018-<athena username>.git
597 |    069856c..7a26701  master -> master
598 |  * [new tag]         lab1submit -> lab1submit
599 | ```
600 | 
601 | 
602 | If the above command worked for you, you can skip to item 6 below.  If not, submit your solutions for lab 1 as follows:
603 | 
604 | 1. Look at your current repository status.
605 | 
606 |    ```bash
607 |    $ git status
608 |    ```
609 | 
610 | 2. Add and commit your code changes (if they aren't already added and commited).
611 | 
612 |    ```bash
613 |     $ git commit -a -m 'Lab 1'
614 |    ```
615 | 
616 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*)
617 | 
618 |    ```bash
619 |    $ git tag -d lab1submit
620 |    $ git push origin :refs/tags/lab1submit
621 |    ```
622 | 
623 | 4. Tag your last commit as the lab to be graded
624 |    ```bash
625 |    $ git tag -a lab1submit -m 'submit lab 1'
626 |    ```
627 | 
628 | 5. This is the most important part: **push** your solutions to GitHub.
629 | 
630 |    ```bash
631 |    $ git push origin master --tags
632 |    ```
633 | 
634 | 6. The last thing that we strongly recommend you do is to go to the
635 |    [MIT-DB-Class] organization page on GitHub to
636 |    make sure that we can see your solutions.
637 | 
638 |    Just navigate to your repository and check that your latest commits are on
639 |    GitHub. You should also be able to check 
640 |    `https://github.com/MIT-DB-Class/homework-solns-2018-<athena username>/tree/lab1submit`
641 | 
642 | 
643 | #### <a name="word-of-caution"></a> Word of Caution
644 | 
645 | Git is a distributed version control system. This means everything operates offline until you run `git pull` or `git push`. This is a great feature.
646 | 
647 | The bad thing is that you may forget to `git push` your changes. This is why we strongly, **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect.
648 | 
649 | 
650 | <a name="bugs"></a>
651 | 
652 | ###  3.3. Submitting a bug 
653 | 
654 | Please submit (friendly!) bug reports to [6.830-staff@mit.edu](mailto:6.830-staff@mit.edu). When you do, please try to include:
655 | 
656 | 
657 | *    A description of the bug.
658 | *    A .java file we can drop in the test/simpledb directory, compile, and run.
659 | *    A .txt file with the data that reproduces the bug. We should be able to convert it to a .dat file using HeapFileEncoder. 
660 | 
661 | If you are the first person to report a particular bug in the code, we will give you a candy bar!
662 | 
663 | <!--The latest bug reports/fixes can be found [here](bugs.html).-->
664 | 
665 | <a name="grading"></a>
666 | 
667 | ###  3.4 Grading 
668 | 
669 | <p>75% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both  <tt>ant test</tt> and <tt>ant systemtest</tt>.
670 | 
671 | **Important:** before testing, we will replace your <tt>build.xml</tt> and the entire contents of the <tt>test</tt> directory with our version of these files.  This means you cannot change the format of <tt>.dat</tt> files!  You should also be careful changing our APIs. You should test that your code compiles the unmodified tests. 
672 | 
673 | In other words, we will pull your repo, replace the files mentioned above, compile it, and then grade it.  It will look roughly like this:
674 | 
675 | ```
676 | [replace build.xml and test]
677 | $ git checkout -- build.xml test\
678 | $ ant test
679 | $ ant systemtest
680 | [additional tests]
681 | ```
682 | 
683 | <p>If any of these commands fail, we`ll be unhappy, and, therefore, so will your grade.
684 | 
685 | An additional 25% of your grade will be based on the quality of your writeup and our subjective evaluation of your code.
686 | 
687 | We`ve had a lot of fun designing this assignment, and we hope you enjoy hacking on it!
688 | 


--------------------------------------------------------------------------------