├── lab3-hist.png ├── controlflow.png ├── README.md ├── lab4.md ├── lab2.md ├── lab3.md └── lab1.md /lab3-hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MIT-DB-Class/course-info-2018/HEAD/lab3-hist.png -------------------------------------------------------------------------------- /controlflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MIT-DB-Class/course-info-2018/HEAD/controlflow.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | course-info 2 | =========== 3 | 4 | GitHub Repo for http://db.csail.mit.edu/6.830/ 5 | 6 | We will be using git, a source code control tool, for labs in 6.830. This will 7 | allow you to download the code for the labs, and also submit the labs in a 8 | standardized format that will streamline grading. 9 | 10 | You will also be able to use git to commit your progress on the labs as you go. 11 | 12 | Course git repositories will be hosted as a repository in GitHub. GitHub is a 13 | website that hosts runs git servers for thousands of open source projects. In 14 | our case, your code will be in a private repository that is visible only to you 15 | and course staff. 16 | 17 | This document describes what you need to do to get started with git, and also 18 | download and upload 6.830/6.814 labs via GitHub. 19 | 20 | ## Contents 21 | 22 | - [Learning Git](#learning-git) 23 | - [Setting up GitHub](#setting-up-github) 24 | - [Installing Git](#installing-git) 25 | - [Setting up Git](#setting-up-git) 26 | - [Getting Newly Released Labs](#getting-newly-released-lab) 27 | - [Submitting Your Labs](#submitting-your-lab) 28 | - [Word of Caution](#word-of-caution) 29 | - [Help!](#help) 30 | 31 | 32 | ## Learning Git 33 | 34 | There are numerous guides on using Git that are available. They range from being 35 | interactive to just text-based. Find one that works and experiment; making 36 | mistakes and fixing them is a great way to learn. Here is a link to resources 37 | that GitHub suggests: 38 | [https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github][resources]. 39 | 40 | If you have no experience with git, you may find the following web-based 41 | tutorial helpful: [Try Git](https://try.github.io/levels/1/challenges/1). 42 | 43 | ## Setting Up GitHub 44 | 45 | Now that you have a basic understanding of Git, it's time to get started with GitHub. 46 | 47 | 0. Install git. (See below for suggestions). 48 | 49 | 1. If you don't already have an account, sign up for one here: [https://github.com/join][join]. 50 | 51 | If you filled the form ([this form](https://goo.gl/forms/FZPsfP5DQTTzffdC3)) 52 | then you should now have a repository set up just for your lab solutions. This 53 | should be called `homework-solns-2018-` and located in the 54 | MIT-DB-Class organization. 55 | 56 | This is what you'll set up in the next section to allow you to write your 57 | lab answers and submit them. 58 | 59 | ### Installing git 60 | 61 | The instructions are tested on bash/linux environments. Installing git should be 62 | a simple `apt-get / yum / etc install`. 63 | 64 | Instructions for installing git on Linux, OSX, or Windows can be found at 65 | [GitBook: 66 | Installing](http://git-scm.com/book/en/Getting-Started-Installing-Git). 67 | 68 | If you are using Eclipse, many versions come with git configured. The 69 | instructions will be slightly different than the command line instructions 70 | listed but will work for any OS. Detailed instructions can be found at [EGit 71 | User Guide](http://wiki.eclipse.org/EGit/User_Guide) or [EGit 72 | Tutorial](http://eclipsesource.com/blogs/tutorials/egit-tutorial). 73 | 74 | 75 | ## Setting Up Git 76 | 77 | You should have Git installed from the previous section. 78 | 79 | 1. The first thing we have to do is to clone the current lab repository by issuing the following commands on the command line: 80 | 81 | ```bash 82 | $ git clone git@github.com:MIT-DB-Class/simple-db-hw.git 83 | ``` 84 | 85 | If you get an error doing clone, most likely the cause is that you just 86 | haven't finished setting up your GitHub account. You just need to [setup an SSH 87 | key][ssh-key] to allow pushing and pulling over SSH. 88 | 89 | This will make a complete replica of the lab repository locally. Now we are 90 | going to change it to point to your personal repository that was created for you 91 | in the previous section. 92 | 93 | Change your working path to your newly cloned repository: 94 | 95 | ```bash 96 | $ cd simple-db-hw/ 97 | ``` 98 | 99 | 2. By default the remote called `origin` is set to the location that you cloned the repository from. You should see the following: 100 | 101 | ```bash 102 | $ git remote -v 103 | origin git@github.com:MIT-DB-Class/simple-db-hw.git (fetch) 104 | origin git@github.com:MIT-DB-Class/simple-db-hw.git (push) 105 | ``` 106 | 107 | We don't want that remote to be the origin. Instead, we want to change it to point to your repository. To do that, issue the following command: 108 | 109 | ```bash 110 | $ git remote rename origin upstream 111 | ``` 112 | 113 | And now you should see the following: 114 | 115 | ```bash 116 | $ git remote -v 117 | upstream git@github.com:MIT-DB-Class/simple-db-hw.git (fetch) 118 | upstream git@github.com:MIT-DB-Class/simple-db-hw.git (push) 119 | ``` 120 | 121 | 3. Lastly we need to give your repository a new `origin` since it is lacking one. Issue the following command, substituting your athena username: 122 | 123 | ```bash 124 | $ git remote add origin git@github.com:MIT-DB-Class/homework-solns-2018-.git 125 | ``` 126 | 127 | If you have an error that looks like the following: 128 | 129 | ``` 130 | Could not rename config section 'remote.[old name]' to 'remote.[new name]' 131 | ``` 132 | 133 | Or this error: 134 | 135 | ``` 136 | fatal: remote origin already exists. 137 | ``` 138 | 139 | This appears to happen to some depending on the version of Git they are using. To fix it, just issue the following command: 140 | 141 | ```bash 142 | $ git remote set-url origin git@github.com:MIT-DB-Class/homework-solns-2018-.git 143 | ``` 144 | 145 | This solution was found from [StackOverflow](http://stackoverflow.com/a/2432799) thanks to [Cassidy Williams](https://github.com/cassidoo). 146 | 147 | For reference, your final `git remote -v` should look like following when it's setup correctly: 148 | 149 | 150 | ```bash 151 | $ git remote -v 152 | upstream git@github.com:MIT-DB-Class/simple-db-hw.git (fetch) 153 | upstream git@github.com:MIT-DB-Class/simple-db-hw.git (push) 154 | origin git@github.com:MIT-DB-Class/homework-solns-2018-.git (fetch) 155 | origin git@github.com:MIT-DB-Class/homework-solns-2018-.git (push) 156 | ``` 157 | 158 | 4. Let's test it out by doing a push of your master branch to GitHub by issuing the following: 159 | 160 | ```bash 161 | $ git push -u origin master 162 | ``` 163 | 164 | You should see something like the following: 165 | 166 | ``` 167 | Counting objects: 59, done. 168 | Delta compression using up to 4 threads. 169 | Compressing objects: 100% (53/53), done. 170 | Writing objects: 100% (59/59), 420.46 KiB | 0 bytes/s, done. 171 | Total 59 (delta 2), reused 59 (delta 2) 172 | remote: Resolving deltas: 100% (2/2), done. 173 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 174 | * [new branch] master -> master 175 | Branch master set up to track remote branch master from origin. 176 | ``` 177 | 178 | If you get an error doing push, most likely the cause is that you just haven't finished setting up your GitHub account. You just need to [setup an SSH key][ssh-key] to allow pushing and pulling over SSH. 179 | 180 | 5. That last command was a bit special and only needs to be run the first time to setup the remote tracking branches. Now we should be able to just run `git push` without the arguments. Try it and you should get the following: 181 | 182 | ```bash 183 | $ git push 184 | Everything up-to-date 185 | ``` 186 | 187 | If you don't know Git that well, this probably seemed very arcane. Just keep 188 | using Git and you'll understand more and more. You aren't required to use 189 | commands like commit and push as you develop your labs, but will find them 190 | useful for debugging. We'll provide explicit instructions on how to use these 191 | commands to actually upload your final lab solution. 192 | 193 | ## Getting Newly Released Labs 194 | 195 | (You don't need to follow these instructions until Lab 1.) 196 | 197 | Pulling in labs that are released or previous lab solutions should be easy as long as you set up your repository based on the instructions in the last section. 198 | 199 | 1. All new lab and previous lab solutions will be posted to the [labs](https://github.com/MIT-DB-Class/simple-db-hw) repository in the class organization. 200 | 201 | Check it periodically as well as Piazza's announcements for updates on when the new labs are released. 202 | 203 | 2. Once a lab is released, pull in the changes from your simpledb directory: 204 | 205 | ```bash 206 | $ git pull upstream master 207 | ``` 208 | 209 | **OR** if you wish to be more explicit, you can `fetch` first and then `merge`: 210 | 211 | ```bash 212 | $ git fetch upstream 213 | $ git merge upstream/master 214 | ``` 215 | Now commit to your master branch: 216 | ```bash 217 | $ git push origin master 218 | ``` 219 | 220 | 3. If you've followed the instructions in each lab, you should have no merge conflicts and everything should be peachy. 221 | 222 | ## Submitting Your Labs 223 | 224 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab#-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 225 | 226 | You need to explicitly add any other files you create, such as new *.java files. 227 | 228 | The criteria for your lab being submitted on time is that your code must be 229 | **tagged** and **pushed** by the date and time. This means that if one of the 230 | TAs or the instructor were to open up GitHub, they would be able to see your 231 | solutions on the GitHub web page. 232 | 233 | Just because your code has been committed on your local machine does not mean 234 | that it has been **submitted**; it needs to be on GitHub. 235 | 236 | There is a bash script `turnInLab1.sh` in the root level directory of 237 | simple-db-hw that commits your changes, deletes any prior tag for the current 238 | lab, tags the current commit, and pushes the tag to github. If you are using 239 | Linux or Mac OSX, you should be able to run the following: 240 | 241 | ```bash 242 | $ ./turnInLab1.sh 243 | ``` 244 | 245 | You should see something like the following output: 246 | 247 | ```bash 248 | $ ./turnInLab1.sh 249 | error: tag 'lab1submit' not found. 250 | remote: warning: Deleting a non-existent ref. 251 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 252 | - [deleted] lab1submit 253 | [master 7a26701] Lab 1 254 | 1 file changed, 0 insertions(+), 0 deletions(-) 255 | create mode 100644 aaa 256 | Counting objects: 3, done. 257 | Delta compression using up to 4 threads. 258 | Compressing objects: 100% (3/3), done. 259 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done. 260 | Total 3 (delta 1), reused 0 (delta 0) 261 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects. 262 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 263 | 069856c..7a26701 master -> master 264 | * [new tag] lab1submit -> lab1submit 265 | ``` 266 | 267 | 268 | If the above command worked for you, you can skip to item 6 below. If not, submit your solutions for lab 1 as follows (*replace lab 1 with the correct lab ID for later labs*): 269 | 270 | 1. Look at your current repository status. 271 | 272 | ```bash 273 | $ git status 274 | ``` 275 | 276 | 2. Add and commit your code changes (if they aren't already added and commited). 277 | 278 | ```bash 279 | $ git commit -a -m 'Lab 1' 280 | ``` 281 | 282 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*) 283 | 284 | ```bash 285 | $ git tag -d lab1submit 286 | $ git push origin :refs/tags/lab1submit 287 | ``` 288 | 289 | 4. Tag your last commit as the lab to be graded (*again, update the lab ID for later labs*) 290 | ```bash 291 | $ git tag -a lab1submit -m 'submit lab 1' 292 | ``` 293 | 294 | 5. This is the most important part: **push** your solutions to GitHub. 295 | 296 | ```bash 297 | $ git push origin master --tags 298 | ``` 299 | 300 | 6. The last thing that we strongly recommend you do is to go to the [MIT-DB-Class] organization page on GitHub to make sure that we can see your solutions. 301 | 302 | Just navigate to your repository and check that your latest commits are on GitHub. You should also be able to check 303 | `https://github.com/MIT-DB-Class/homework-solns-2018-/tree/lab1` 304 | 305 | 306 | ## Word of Caution 307 | 308 | Git is a distributed version control system. This means everything operates offline until you run `git pull` or `git push`. This is a great feature. 309 | 310 | The bad thing is that you may forget to `git push` your changes. This is why we **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect. 311 | 312 | ## Help! 313 | 314 | If at any point you need help with setting all this up, feel free to reach out to one of the TAs or the instructor. Their contact information can be found on the [course homepage](http://db.csail.mit.edu/6.830/). 315 | 316 | [join]: https://github.com/join 317 | [resources]: https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github 318 | [ssh-key]: https://help.github.com/articles/generating-ssh-keys 319 | 320 | -------------------------------------------------------------------------------- /lab4.md: -------------------------------------------------------------------------------- 1 | # 6.830 Lab 4: SimpleDB Transactions 2 | 3 | **Assigned: Friday, October 26, 2018**
4 | **Due: Friday, November 9, 2018 11:59 PM EDT** 5 | 6 | 7 | In this lab, you will implement a simple locking-based 8 | transaction system in SimpleDB. You will need to add lock and 9 | unlock calls at the appropriate places in your code, as well as 10 | code to track the locks held by each transaction and grant 11 | locks to transactions as they are needed. 12 | 13 | 14 | The remainder of this document describes what is involved in 15 | adding transaction support and provides a basic outline of how 16 | you might add this support to your database. 17 | 18 | 19 | 20 | As with the previous lab, we recommend that you start as early as possible. 21 | Locking and transactions can be quite tricky to debug! 22 | 23 | ## 1. Getting started 24 | 25 | You should begin with the code you submitted for Lab 3 (if you did not 26 | submit code for Lab 3, or your solution didn't work properly, contact us to 27 | discuss options). Additionally, we are providing extra test cases 28 | for this lab that are not in the original code distribution you received. We reiterate 29 | that the unit tests we provide are to help guide your implementation along, 30 | but they are not intended to be comprehensive or to establish correctness. 31 | 32 | 33 | You will need to add these new files to your release. The easiest way 34 | to do this is to change to your project directory (probably called simple-db-hw) 35 | and pull from the master GitHub repository: 36 | 37 | ``` 38 | $ cd simple-db-hw 39 | $ git pull upstream master 40 | ``` 41 | 42 | 43 | ## 2. Transactions, Locking, and Concurrency Control 44 | 45 | Before starting, 46 | you should make sure you understand what a transaction is and how 47 | rigorous two-phase locking (which you will use to ensure isolation and 48 | atomicity of your transactions) works. 49 | 50 | In the remainder of this section, we briefly overview these concepts 51 | and discuss how they relate to SimpleDB. 52 | 53 | ### 2.1. Transactions 54 | 55 | A transaction is a group of database actions (e.g., inserts, deletes, 56 | and reads) that are executed *atomically*; that is, either all of 57 | the actions complete or none of them do, and it is not apparent to an 58 | outside observer of the database that these actions were not completed 59 | as a part of a single, indivisible action. 60 | 61 | ### 2.2. The ACID Properties 62 | 63 | To help you understand 64 | how transaction management works in SimpleDB, we briefly review how 65 | it ensures that the ACID properties are satisfied: 66 | 67 | * **Atomicity**: Rigorous two-phase locking and careful buffer management 68 | ensure atomicity. 69 | * **Consistency**: The database is transaction consistent by virtue of 70 | atomicity. Other consistency issues (e.g., key constraints) are 71 | not addressed in SimpleDB. 72 | * **Isolation**: Rigorous two-phase locking provides isolation. 73 | * **Durability**: A FORCE buffer management policy ensures 74 | durability (see Section 2.3 below). 75 | 76 | 77 | ### 2.3. Recovery and Buffer Management 78 | 79 | To simplify your job, we recommend that you implement a NO STEAL/FORCE 80 | buffer management policy. 81 | 82 | As we discussed in class, this means that: 83 | 84 | * You shouldn't evict dirty (updated) pages from the buffer pool if they 85 | are locked by an uncommitted transaction (this is NO STEAL). 86 | * On transaction commit, you should force dirty pages to disk (e.g., 87 | write the pages out) (this is FORCE). 88 | 89 | 90 | To further simplify your life, you may assume that SimpleDB will not crash 91 | while processing a `transactionComplete` command. Note that 92 | these three points mean that you do not need to implement log-based 93 | recovery in this lab, since you will never need to undo any work (you never evict 94 | dirty pages) and you will never need to redo any work (you force 95 | updates on commit and will not crash during commit processing). 96 | 97 | ### 2.4. Granting Locks 98 | 99 | You will need to add calls to SimpleDB (in `BufferPool`, 100 | for example), that allow a caller to request or release a (shared or 101 | exclusive) lock on a specific object on behalf of a specific 102 | transaction. 103 | 104 | 105 | 106 | We recommend locking at *page* granularity, though you should be able 107 | to implement locking at *tuple* granularity if you wish (please do not 108 | implement table-level locking). The rest of this document and our unit 109 | tests assume page-level locking. 110 | 111 | 112 | You will need to create data structures that keep track of which locks 113 | each transaction holds and that check to see if a lock should be granted 114 | to a transaction when it is requested. 115 | 116 | You will need to implement shared and exclusive locks; recall that these 117 | work as follows: 118 | 119 | * Before a transaction can read an object, it must have a shared lock on it. 120 | * Before a transaction can write an object, it must have an exclusive lock on it. 121 | * Multiple transactions can have a shared lock on an object. 122 | * Only one transaction may have an exclusive lock on an object. 123 | * If transaction *t* is the only transaction holding a shared lock on 124 | an object *o*, *t* may *upgrade* 125 | its lock on *o* to an exclusive lock. 126 | 127 | 128 | 129 | 130 | If a transaction requests a lock that it should not be granted, your code 131 | should *block*, waiting for that lock to become available (i.e., be 132 | released by another transaction running in a different thread). 133 | 134 | 135 | 136 | You need to be especially careful to avoid race conditions when 137 | writing the code that acquires locks -- think about how you will 138 | ensure that correct behavior results if two threads request the same 139 | lock at the same time (you way wish to read about 141 | Synchronization in Java). 142 | 143 | *** 144 | 145 | **Exercise 1.** 146 | 147 | Write the methods that acquire and release locks in BufferPool. Assuming 148 | you are using page-level locking, you will need to complete the following: 149 | 150 | * Modify getPage() to block and acquire the desired lock 151 | before returning a page. 152 | * Implement releasePage(). This method is primarily used 153 | for testing, and at the end of transactions. 154 | * Implement holdsLock() so that logic in Exercise 2 can 155 | determine whether a page is already locked by a transaction. 156 | 157 | 158 | 159 | You may find it helpful to define a class that is responsible for 160 | maintaining state about transactions and locks, but the design decision is up to 161 | you. 162 | 163 | 164 | You may need to implement the next exercise before your code passes 165 | the unit tests in LockingTest. 166 | 167 | *** 168 | 169 | 170 | ### 2.5. Lock Lifetime 171 | 172 | You will need to implement rigorous two-phase locking. This means that 173 | transactions should acquire the appropriate type of lock on any object 174 | before accessing that object and shouldn't release any locks until after 175 | the transaction commits. 176 | 177 | 178 | 179 | Fortunately, the SimpleDB design is such that it is possible obtain locks on 180 | pages in `BufferPool.getPage()` before you read or modify them. 181 | So, rather than adding calls to locking routines in each of your operators, 182 | we recommend acquiring locks in `getPage()`. Depending on your 183 | implementation, it is possible that you may not have to acquire a lock 184 | anywhere else. It is up to you to verify this! 185 | 186 | 187 | 188 | You will need to acquire a *shared* lock on any page (or tuple) 189 | before you read it, and you will need to acquire an *exclusive* 190 | lock on any page (or tuple) before you write it. You will notice that 191 | we are already passing around `Permissions` objects in the 192 | BufferPool; these objects indicate the type of lock that the caller 193 | would like to have on the object being accessed (we have given you the 194 | code for the `Permissions` class.) 195 | 196 | Note that your implementation of `HeapFile.insertTuple()` 197 | and `HeapFile.deleteTuple()`, as well as the implementation 198 | of the iterator returned by `HeapFile.iterator()` should 199 | access pages using `BufferPool.getPage()`. Double check 200 | that that these different uses of `getPage()` pass the 201 | correct permissions object (e.g., `Permissions.READ_WRITE` 202 | or `Permissions.READ_ONLY`). You may also wish to double 203 | check that your implementation of 204 | `BufferPool.insertTuple()` and 205 | `BufferPool.deleteTupe()` call `markDirty()` on 206 | any of the pages they access (you should have done this when you 207 | implemented this code in lab 2, but we did not test for this case.) 208 | 209 | 210 | 211 | After you have acquired locks, you will need to think about when to 212 | release them as well. It is clear that you should release all locks 213 | associated with a transaction after it has committed or aborted to ensure rigorous 2PL. 214 | However, it is 215 | possible for there to be other scenarios in which releasing a lock before 216 | a transaction ends might be useful. For instance, you may release a shared lock 217 | on a page after scanning it to find empty slots (as described below). 218 | 219 | 220 | 221 | *** 222 | 223 | **Exercise 2.** 224 | 225 | Ensure that you acquire and release locks throughout SimpleDB. Some (but 226 | not necessarily all) actions that you should verify work properly: 227 | 228 | * Reading tuples off of pages during a SeqScan (if you 229 | implemented locking in `BufferPool.getPage()`, this should work 230 | correctly as long as your `HeapFile.iterator()` uses 231 | `BufferPool.getPage()`.) 232 | * Inserting and deleting tuples through BufferPool and HeapFile 233 | methods (if you 234 | implemented locking in `BufferPool.getPage()`, this should work 235 | correctly as long as `HeapFile.insertTuple()` and 236 | `HeapFile.deleteTuple()` use 237 | `BufferPool.getPage()`.) 238 | 239 | 240 | 241 | You will also want to think especially hard about acquiring and releasing 242 | locks in the following situations: 243 | 244 | 245 | * Adding a new page to a `HeapFile`. When do you physically 246 | write the page to disk? Are there race conditions with other transactions 247 | (on other threads) that might need special attention at the HeapFile level, 248 | regardless of page-level locking? 249 | * Looking for an empty slot into which you can insert tuples. 250 | Most implementations scan pages looking for an empty 251 | slot, and will need a READ_ONLY lock to do this. Surprisingly, however, 252 | if a transaction *t* finds no free slot on a page *p*, *t* may immediately release the lock on *p*. 253 | Although this apparently contradicts the rules of two-phase locking, it is ok because 254 | *t* did not use any data from the page, such that a concurrent transaction *t'* which updated 255 | *p* cannot possibly effect the answer or outcome of *t*. 256 | 257 | 258 | At this point, your code should pass the unit tests in 259 | LockingTest. 260 | 261 | *** 262 | 263 | ### 2.6. Implementing NO STEAL 264 | 265 | Modifications from a transaction are written to disk only after it 266 | commits. This means we can abort a transaction by discarding the dirty 267 | pages and rereading them from disk. Thus, we must not evict dirty 268 | pages. This policy is called NO STEAL. 269 | 270 | You will need to modify the evictPage method in BufferPool. 271 | In particular, it must never evict a dirty page. If your eviction policy prefers a dirty page 272 | for eviction, you will have to find a way to evict an alternative 273 | page. In the case where all pages in the buffer pool are dirty, you 274 | should throw a DbException. 275 | 276 | Note that, in general, evicting a clean page that is locked by a 277 | running transaction is OK when using NO STEAL, as long as your lock 278 | manager keeps information about evicted pages around, and as long as 279 | none of your operator implementations keep references to Page objects 280 | which have been evicted. 281 | 282 | 283 | *** 284 | 285 | **Exercise 3.** 286 | 287 | Implement the necessary logic for page eviction without evicting dirty pages 288 | in the evictPage method in BufferPool. 289 | 290 | *** 291 | 292 | 293 | ### 2.7. Transactions 294 | 295 | In SimpleDB, a `TransactionId` object is created at the 296 | beginning of each query. This object is passed to each of the operators 297 | involved in the query. When the query is complete, the 298 | `BufferPool` method `transactionComplete` is called. 299 | 300 | 301 | 302 | Calling this method either *commits* or *aborts* the 303 | transaction, specified by the parameter flag `commit`. At any point 304 | during its execution, an operator may throw a 305 | `TransactionAbortedException` exception, which indicates an 306 | internal error or deadlock has occurred. The test cases we have provided 307 | you with create the appropriate `TransactionId` objects, pass 308 | them to your operators in the appropriate way, and invoke 309 | `transactionComplete` when a query is finished. We have also 310 | implemented `TransactionId`. 311 | 312 | 313 | 314 | *** 315 | 316 | **Exercise 4.** 317 | 318 | Implement the `transactionComplete()` method in 319 | `BufferPool`. Note that there are two versions of 320 | transactionComplete, one which accepts an additional boolean **commit** argument, 321 | and one which does not. The version without the additional argument should 322 | always commit and so can simply be implemented by calling `transactionComplete(tid, true)`. 323 | 324 | 325 | 326 | When you commit, you should flush dirty pages 327 | associated to the transaction to disk. When you abort, you should revert 328 | any changes made by the transaction by restoring the page to its on-disk 329 | state. 330 | 331 | 332 | 333 | 334 | Whether the transaction commits or aborts, you should also release any state the 335 | `BufferPool` keeps regarding 336 | the transaction, including releasing any locks that the transaction held. 337 | 338 | 339 | At this point, your code should pass the `TransactionTest` unit test and the 340 | `AbortEvictionTest` system test. You may find the `TransactionTest` system test 341 | illustrative, but it will likely fail until you complete the next exercise. 342 | 343 | 344 | *** 345 | 346 | 347 | 348 | ### 2.8. Deadlocks and Aborts 349 | 350 | It is possible for transactions in SimpleDB to deadlock (if you do not 351 | understand why, we recommend reading about deadlocks in Ramakrishnan & Gehrke). 352 | You will need to detect this situation and throw a 353 | `TransactionAbortedException`. 354 | 355 | 356 | 357 | There are many possible ways to detect deadlock. For example, you may 358 | implement a simple timeout policy that aborts a transaction if it has not 359 | completed after a given period of time. Alternately, you may implement 360 | cycle-detection in a dependency graph data structure. In this scheme, you 361 | would check for cycles in a dependency graph whenever you attempt to grant 362 | a new lock, and abort something if a cycle exists. 363 | 364 | 365 | 366 | After you have detected that a deadlock exists, you must decide how to 367 | improve the situation. Assume you have detected a deadlock while 368 | transaction *t* is waiting for a lock. If you're feeling 369 | homicidal, you might abort **all** transactions that *t* is 370 | waiting for; this may result in a large amount of work being undone, but 371 | you can guarantee that *t* will make progress. 372 | Alternately, you may decide to abort *t* to give other 373 | transactions a chance to make progress. This means that the end-user will have 374 | to retry transaction *t*. 375 | 376 | 377 | 378 | *** 379 | 380 | **Exercise 5.** 381 | 382 | Implement deadlock detection and resolution in 383 | `src/simpledb/BufferPool.java`. Most likely, you will want to check 384 | for a deadlock whenever a transaction attempts to acquire a lock and finds another 385 | transaction is holding the lock (note that this by itself is not a deadlock, but may 386 | be symptomatic of one.) You have many design 387 | decisions for your deadlock resolution system, but it is not necessary to 388 | do something complicated. Please describe your choices in the lab writeup. 389 | 390 | 391 | 392 | You should ensure that your code aborts transactions properly when a 393 | deadlock occurs, by throwing a 394 | `TransactionAbortedException` exception. 395 | This exception will be caught by the code executing the transaction 396 | (e.g., `TransactionTest.java`), which should call 397 | `transactionComplete()` to cleanup after the transaction. 398 | You are not expected to automatically restart a transaction which 399 | fails due to a deadlock -- you can assume that higher level code 400 | will take care of this. 401 | 402 | 403 | 404 | We have provided some (not-so-unit) tests in 405 | `test/simpledb/DeadlockTest.java`. They are actually a 406 | bit involved, so they may take more than a few seconds to run (depending 407 | on your policy). If they seem to hang indefinitely, then you probably 408 | have an unresolved deadlock. These tests construct simple deadlock 409 | situations that your code should be able to escape. 410 | 411 | 412 | 413 | 414 | Note that there are two timing parameters near the top of 415 | `DeadLockTest.java`; these determine the frequency at which 416 | the test checks if locks have been acquired and the waiting time before 417 | an aborted transaction is restarted. You may observe different 418 | performance characteristics by tweaking these parameters if you use a 419 | timeout-based detection method. The tests will output 420 | `TransactionAbortedExceptions` corresponding to resolved 421 | deadlocks to the console. 422 | 423 | 424 | 425 | Your code should now should pass the `TransactionTest` system test (which may also run for quite a long time). 426 | 427 | 428 | At this point, you should have a recoverable database, in the 429 | sense that if the database system crashes (at a point other than 430 | `transactionComplete()`) or if the user explicitly aborts a 431 | transaction, the effects of any running transaction will not be visible 432 | after the system restarts (or the transaction aborts.) You may wish to 433 | verify this by running some transactions and explicitly killing the 434 | database server. 435 | 436 | 437 | *** 438 | 439 | ### 2.9. Design alternatives 440 | 441 | During the course of this lab, we have identified three substantial design 442 | choices that you have to make: 443 | 444 | 445 | * Locking granularity: page-level versus tuple-level 446 | * Deadlock detection: timeouts versus dependency graphs 447 | * Deadlock resolution: aborting yourself versus aborting others 448 | 449 | 450 | *** 451 | 452 | **Bonus Exercise 6. (10% extra credit)** 453 | 454 | For one or more of these choices, implement both alternatives and 455 | briefly compare their performance characteristics in your writeup. 456 | 457 | 458 | 459 | You have now completed this lab. 460 | Good work! 461 | 462 | *** 463 | 464 | ## 3. Logistics 465 | 466 | You must submit your code (see below) as well as a short (2 pages, maximum) 467 | writeup describing your approach. This writeup should: 468 | 469 | 470 | 471 | * Describe any design decisions you made, including your deadlock detection 472 | policy, locking granularity, etc. 473 | 474 | * Discuss and justify any changes you made to the API. 475 | 476 | 477 | 478 | ### 3.1. Collaboration 479 | 480 | This lab should be manageable for a single person, but if you prefer 481 | to work with a partner, this is also OK. Larger groups are not allowed. 482 | Please indicate clearly who you worked with, if anyone, on your writeup. 483 | 484 | ### 3.2. Submitting your assignment 485 | 486 | 487 | 495 | 496 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab4-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 497 | 498 | 499 | You also need to explicitly add any other files you create, such as new *.java 500 | files. 501 | 502 | The criteria for your lab being submitted on time is that your code must be 503 | **tagged** and 504 | **pushed** by the date and time. This means that if one of the TAs or the 505 | instructor were to open up GitHub, they would be able to see your solutions on 506 | the GitHub web page. 507 | 508 | Just because your code has been commited on your local machine does not 509 | mean that it has been **submitted**; it needs to be on GitHub. 510 | 511 | There is a bash script `turnInLab4.sh` in the root level directory of simple-db-hw that commits 512 | your changes, deletes any prior tag 513 | for the current lab, tags the current commit, and pushes the tag 514 | to GitHub. If you are using Linux or Mac OSX, you should be able to run the following: 515 | 516 | ```bash 517 | $ ./turnInLab4.sh 518 | ``` 519 | You should see something like the following output: 520 | 521 | ```bash 522 | $ ./turnInLab4.sh 523 | error: tag 'lab4submit' not found. 524 | remote: warning: Deleting a non-existent ref. 525 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 526 | - [deleted] lab1submit 527 | [master 7a26701] Lab 4 528 | 1 file changed, 0 insertions(+), 0 deletions(-) 529 | create mode 100644 aaa 530 | Counting objects: 3, done. 531 | Delta compression using up to 4 threads. 532 | Compressing objects: 100% (3/3), done. 533 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done. 534 | Total 3 (delta 1), reused 0 (delta 0) 535 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects. 536 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 537 | 069856c..7a26701 master -> master 538 | * [new tag] lab4submit -> lab4submit 539 | ``` 540 | 541 | 542 | If the above command worked for you, you can skip to item 6 below. If not, submit your solutions for lab 4 as follows: 543 | 544 | 1. Look at your current repository status. 545 | 546 | ```bash 547 | $ git status 548 | ``` 549 | 550 | 2. Add and commit your code changes (if they aren't already added and commited). 551 | 552 | ```bash 553 | $ git commit -a -m 'Lab 4' 554 | ``` 555 | 556 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*) 557 | 558 | ```bash 559 | $ git tag -d lab4submit 560 | $ git push origin :refs/tags/lab4submit 561 | ``` 562 | 563 | 4. Tag your last commit as the lab to be graded 564 | ```bash 565 | $ git tag -a lab4submit -m 'submit lab 4' 566 | ``` 567 | 568 | 5. This is the most important part: **push** your solutions to GitHub. 569 | 570 | ```bash 571 | $ git push origin master --tags 572 | ``` 573 | 574 | 6. The last thing that we strongly recommend you do is to go to the 575 | [MIT-DB-Class] organization page on GitHub to 576 | make sure that we can see your solutions. 577 | 578 | Just navigate to your repository and check that your latest commits are on 579 | GitHub. You should also be able to check 580 | `https://github.com/MIT-DB-Class/homework-solns-2018-/tree/lab4submit` 581 | 582 | 583 | #### Word of Caution 584 | 585 | Git is a distributed version control system. This means everything operates 586 | offline until you run `git pull` or `git push`. This is a great feature. 587 | 588 | The bad thing is that you may forget to `git push` your changes. This is why we strongly, **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect. 589 | 590 | Just because your code has been commited on your local machine does not 591 | mean that it has been **submitted**; it needs to be on GitHub. 592 | 593 | 594 | 595 | ### 3.3. Submitting a bug 596 | Despite its friendly sounding name, SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc. 597 | 598 | We ask you, therefore, to do this lab with an adventurous mindset. Don't get mad if something is not clear, or even wrong; rather, try to figure it out 599 | yourself or send us a friendly email. 600 | 601 | Please submit (friendly!) bug reports to 6.830-staff@mit.edu. 603 | When you do, please try to include: 604 | 605 | 606 | * A description of the bug. 607 | 608 | * A .java file we can drop in the 609 | `test/simpledb` directory, compile, and run. 610 | 611 | * A .txt file with the data that reproduces the bug. We should be 612 | able to convert it to a .dat file using `HeapFileEncoder`. 613 | 614 | 615 | 616 | You can also post on the class page on Piazza if you feel you have run into a bug. 617 | 618 | 619 | ### 3.4 Grading 620 | 621 | 50% of your grade will be based on whether or not your code passes the 622 | system test suite we will run over it. These tests will be a superset 623 | of the tests we have provided. Before handing in your code, you should 624 | make sure it produces no errors (passes all of the tests) from both 625 | ant test and ant systemtest. 626 | 627 | 628 | 629 | **Important:** Before testing, we will replace your build.xml, 630 | HeapFileEncoder.java, and the entire contents of the 631 | test/ directory with our version of these files! This 632 | means you cannot change the format of .dat files! You should 633 | therefore be careful changing our APIs. This also means you need to test 634 | whether your code compiles with our test programs. In other words, we will 635 | pull your repo, replace the files mentioned above, compile it, and then 636 | grade it. It will look roughly like this: 637 | 638 | 639 | ``` 640 | $ git pull 641 | [replace build.xml, HeapFileEncoder.java and test] 642 | $ ant test 643 | $ ant systemtest 644 | [additional tests] 645 | ``` 646 | 647 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade. 648 | 649 | 650 | 651 | An additional 50% of your grade will be based on the quality of your 652 | writeup and our subjective evaluation of your code. 653 | 654 | 655 | 656 | We've had a lot of fun designing this assignment, and we hope you enjoy 657 | hacking on it! 658 | -------------------------------------------------------------------------------- /lab2.md: -------------------------------------------------------------------------------- 1 | # 6.830 Lab 2: SimpleDB Operators 2 | 3 | **Assigned: Friday, September 28, 2018**
4 | **Due: Wednesday, October 10, 2018 11:59 PM EDT** 5 | 6 | 7 | 13 | 14 | 15 | 16 | In this lab assignment, you will write a set of operators for SimpleDB 17 | to implement table modifications (e.g., insert and delete records), 18 | selections, joins, and aggregates. These will build on top of the 19 | foundation that you wrote in Lab 1 to provide you with a database 20 | system that can perform simple queries over multiple tables. 21 | 22 | 23 | 24 | Additionally, we ignored the issue of buffer pool management in Lab 1: we 25 | have not dealt with the problem that arises when we reference more pages 26 | than we can fit in memory over the lifetime of the database. 27 | In Lab 2, you will design an eviction policy to 28 | flush stale pages from the buffer pool. 29 | 30 | 31 | 32 | You do not need to implement transactions or locking in this lab. 33 | 34 | 35 | 36 | The remainder of this document gives some suggestions about how to start 37 | coding, describes a set of exercises to help you work through the lab, 38 | and discusses how to hand in your code. This lab requires you to 39 | write a fair amount of code, so we encourage you to **start early**! 40 | 41 | 42 | 43 | ## 1. Getting started 44 | 45 | You should begin with the code you submitted for Lab 1 (if you did not 46 | submit code for Lab 1, or your solution didn't work properly, contact us to 47 | discuss options). 48 | 49 | ### 1.3. Implementation hints 50 | 51 | 52 | As before, we **strongly encourage** you to read through this entire 53 | document to get a feel for the high-level design of SimpleDB before you 54 | write code. 55 | 56 | We suggest exercises along this document to guide your implementation, but 57 | you may find that a different order makes more sense for you. As before, 58 | we will grade your assignment by looking at your code and verifying that 59 | you have passed the test for the ant targets `test` and 60 | `systemtest`. Note the code only needs to pass the tests we indicate in this 61 | lab, not all of unit and system tests. See Section 3.4 for a complete discussion of 62 | grading and list of the tests you will need to pass. 63 | 64 | Here's a rough outline of one way you might proceed with your SimpleDB 65 | implementation; more details on the steps in this outline, including 66 | exercises, are given in Section 2 below. 67 | 68 | 69 | 70 | * Implement the operators `Filter` and `Join` and 71 | verify that their corresponding tests work. The Javadoc comments for 72 | these operators contain details about how they should work. We have given you implementations of 73 | `Project` and `OrderBy` which may help you 74 | understand how other operators work. 75 | 76 | * Implement `IntegerAggregator` and `StringAggregator`. Here, you will write the 77 | logic that actually computes an aggregate over a particular field across 78 | multiple groups in a sequence of input tuples. Use integer division for 79 | computing the average, since SimpleDB only supports integers. StringAggegator 80 | only needs to support the COUNT aggregate, since the other operations do not 81 | make sense for strings. 82 | 83 | * Implement the `Aggregate` operator. As with other 84 | operators, aggregates implement the `OpIterator` interface 85 | so that they can be placed in SimpleDB query plans. Note that the 86 | output of an `Aggregate` operator is an aggregate value of an 87 | entire group for each call to `next()`, and that the 88 | aggregate constructor takes the aggregation and grouping fields. 89 | 90 | * Implement the methods related to tuple insertion, deletion, and page 91 | eviction in `BufferPool`. You do not need to worry about 92 | transactions at this point. 93 | 94 | * Implement the `Insert` and `Delete` operators. 95 | Like all operators, `Insert` and `Delete` implement 96 | `OpIterator`, accepting a stream of tuples to insert or delete 97 | and outputting a single tuple with an integer field that indicates the 98 | number of tuples inserted or deleted. These operators will need to call 99 | the appropriate methods in `BufferPool` that actually modify the 100 | pages on disk. Check that the tests for inserting and 101 | deleting tuples work properly. 102 | 103 | Note that SimpleDB does not implement any kind of consistency or integrity 104 | checking, so it is possible to insert duplicate records into a file and 105 | there is no way to enforce primary or foreign key constraints. 106 | 107 | 108 | 109 | At this point you should be able to pass the tests in the ant 110 | `systemtest` target, which is the goal of this lab. 111 | 112 | 113 | 114 | You'll also be able to use the provided SQL parser to run SQL 115 | queries against your database! See [Section 2.7](#parser) for a 116 | brief tutorial. 117 | 118 | 119 | 120 | 121 | Finally, you might notice that the iterators in this lab extend the 122 | `Operator` class instead of implementing the OpIterator 123 | interface. Because the implementation of next/hasNext 124 | is often repetitive, annoying, and error-prone, `Operator` 125 | implements this logic generically, and only requires that you implement 126 | a simpler readNext. Feel free to use this style of 127 | implementation, or just implement the `OpIterator` interface if you prefer. 128 | To implement the OpIterator interface, remove `extends Operator` 129 | from iterator classes, and in its place put `implements OpIterator`. 130 | 131 | 132 | 133 | ## 2. SimpleDB Architecture and Implementation Guide 134 | 135 | 136 | ### 2.1. Filter and Join 137 | 138 | Recall that SimpleDB OpIterator classes implement the operations of the 139 | relational algebra. You will now implement two operators that will enable 140 | you to perform queries that are slightly more interesting than a table 141 | scan. 142 | 143 | 144 | 145 | * *Filter*: This operator only returns tuples that satisfy 146 | a `Predicate` that is specified as part of its constructor. Hence, 147 | it filters out any tuples that do not match the predicate. 148 | 149 | * *Join*: This operator joins tuples from its two children according to 150 | a `JoinPredicate` that is passed in as part of its constructor. 151 | We only require a simple nested loops join, but you may explore more 152 | interesting join implementations. Describe your implementation in your lab 153 | writeup. 154 | 155 | 156 | 157 | **Exercise 1.** 158 | 159 | Implement the skeleton methods in: 160 | 161 | *** 162 | * src/simpledb/Predicate.java 163 | * src/simpledb/JoinPredicate.java 164 | * src/simpledb/Filter.java 165 | * src/simpledb/Join.java 166 | 167 | *** 168 | 169 | At this point, your code should pass the unit tests in 170 | PredicateTest, JoinPredicateTest, FilterTest, and JoinTest. Furthermore, 171 | you should be able to pass the system tests FilterTest and JoinTest. 172 | 173 | 174 | 175 | ### 2.2. Aggregates 176 | 177 | An additional SimpleDB operator implements basic SQL aggregates with a 178 | `GROUP BY` clause. You should implement the five SQL aggregates 179 | (`COUNT`, `SUM`, `AVG`, `MIN`, 180 | `MAX`) and support grouping. You only need to support aggregates 181 | over a single field, and grouping by a single field. 182 | 183 | 184 | 185 | In order to calculate aggregates, we use an `Aggregator` 186 | interface which merges a new tuple into the existing calculation of an 187 | aggregate. The `Aggregator` is told during construction what 188 | operation it should use for aggregation. Subsequently, the client code 189 | should call `Aggregator.mergeTupleIntoGroup()` for every tuple in the child 190 | iterator. After all tuples have been merged, the client can retrieve a 191 | OpIterator of aggregation results. Each tuple in the result is a pair of 192 | the form `(groupValue, aggregateValue)`, unless the value 193 | of the group by field was `Aggregator.NO_GROUPING`, in which 194 | case the result is a single tuple of the form `(aggregateValue)`. 195 | 196 | 197 | 198 | Note that this implementation requires space linear in the number of 199 | distinct groups. For the purposes of this lab, you do not need to worry 200 | about the situation where the number of groups exceeds available memory. 201 | 202 | 203 | 204 | **Exercise 2.** 205 | 206 | Implement the skeleton methods in: 207 | 208 | *** 209 | * src/simpledb/IntegerAggregator.java 210 | * src/simpledb/StringAggregator.java 211 | * src/simpledb/Aggregate.java 212 | 213 | *** 214 | 215 | At this point, your code should pass the unit tests 216 | IntegerAggregatorTest, StringAggregatorTest, and 217 | AggregateTest. Furthermore, you should be able to pass the AggregateTest system test. 218 | 219 | 220 | ### 2.3. HeapFile Mutability 221 | 222 | Now, we will begin to implement methods to support modifying tables. We 223 | begin at the level of individual pages and files. There are two main sets 224 | of operations: adding tuples and removing tuples. 225 | 226 | **Removing tuples:** To remove a tuple, you will need to implement 227 | `deleteTuple`. 228 | Tuples contain `RecordIDs` which allow you to find 229 | the page they reside on, so this should be as simple as locating the page 230 | a tuple belongs to and modifying the headers of the page appropriately. 231 | 232 | **Adding tuples:** The `insertTuple` method in 233 | `HeapFile.java` is responsible for adding a tuple to a heap 234 | file. To add a new tuple to a HeapFile, you will have to find a page with 235 | an empty slot. If no such pages exist in the HeapFile, you 236 | need to create a new page and append it to the physical file on disk. You will 237 | need to ensure that the RecordID in the tuple is updated correctly. 238 | 239 | **Exercise 3.** 240 | 241 | Implement the remaining skeleton methods in: 242 | 243 | *** 244 | * src/simpledb/HeapPage.java 245 | * src/simpledb/HeapFile.java
246 | (Note that you do not necessarily need to implement writePage at this point). 247 | 248 | *** 249 | 250 | 251 | 252 | To implement HeapPage, you will need to modify the header bitmap for 253 | methods such as insertTuple() and deleteTuple(). You may 254 | find that the getNumEmptySlots() and isSlotUsed() methods we asked you to 255 | implement in Lab 1 serve as useful abstractions. Note that there is a 256 | markSlotUsed method provided as an abstraction to modify the filled 257 | or cleared status of a tuple in the page header. 258 | 259 | 260 | Note that it is important that the HeapFile.insertTuple() 261 | and HeapFile.deleteTuple() methods access pages using 262 | the BufferPool.getPage() method; otherwise, your 263 | implementation of transactions in the next lab will not work 264 | properly. 265 | 266 | 267 | Implement the following skeleton methods in src/simpledb/BufferPool.java: 268 | 269 | *** 270 | * insertTuple() 271 | * deleteTuple() 272 | 273 | *** 274 | 275 | 276 | These methods should call the appropriate methods in the HeapFile that 277 | belong to the table being modified (this extra level of indirection is 278 | needed to support other types of files — like indices — in the 279 | future). 280 | 281 | 282 | 283 | At this point, your code should pass the unit tests in HeapPageWriteTest and 284 | HeapFileWriteTest, as well as BufferPoolWriteTest. 285 | 286 | 287 | 288 | 289 | ### 2.4. Insertion and deletion 290 | 291 | Now that you have written all of the HeapFile machinery to add and remove 292 | tuples, you will implement the `Insert` and `Delete` 293 | operators. 294 | 295 | 296 | 297 | For plans that implement `insert` and `delete` queries, 298 | the top-most operator is a special `Insert` or `Delete` 299 | operator that modifies the pages on disk. These operators return the number 300 | of affected tuples. This is implemented by returning a single tuple with one 301 | integer field, containing the count. 302 | 303 | 304 | 305 | * *Insert*: This operator adds the tuples it reads from its child 306 | operator to the `tableid` specified in its constructor. It should 307 | use the `BufferPool.insertTuple()` method to do this. 308 | 309 | * *Delete*: This operator deletes the tuples it reads from its child 310 | operator from the `tableid` specified in its constructor. It 311 | should use the `BufferPool.deleteTuple()` method to do this. 312 | 313 | 314 | 315 | 316 | 317 | **Exercise 4.** 318 | 319 | Implement the skeleton methods in: 320 | 321 | *** 322 | * src/simpledb/Insert.java 323 | * src/simpledb/Delete.java 324 | 325 | *** 326 | 327 | At this point, your code should pass the unit tests in InsertTest. We 328 | have not provided unit tests for `Delete`. Furthermore, you 329 | should be able to pass the InsertTest and DeleteTest system tests. 330 | 331 | 332 | ### 2.5. Page eviction 333 | 334 | In Lab 1, we did not correctly observe the limit on the maximum number of pages 335 | in the buffer pool defined by the 336 | constructor argument `numPages`. Now, you will choose a page eviction 337 | policy and instrument any previous code that reads or creates pages to 338 | implement your policy. 339 | 340 | 341 | 342 | When more than numPages pages are in the buffer pool, one page should be 343 | evicted from the pool before the next is loaded. The choice of eviction 344 | policy is up to you; it is not necessary to do something sophisticated. 345 | Describe your policy in the lab writeup. 346 | 347 | 348 | 349 | Notice that `BufferPool` asks you to implement 350 | a `flushAllPages()` method. This is not something you would ever 351 | need in a real implementation of a buffer pool. However, we need this method 352 | for testing purposes. You should never call this method from any real code. 353 | 354 | Because of the way we have implemented ScanTest.cacheTest, you will 355 | need to ensure that your flushPage and flushAllPages methods 356 | do no evict pages from the buffer pool to properly pass 357 | this test. 358 | 359 | flushAllPages should call flushPage on all pages in the BufferPool, 360 | and flushPage should write any dirty page to disk and mark it as not 361 | dirty, while leaving it in the BufferPool. 362 | 363 | The only method which should remove page from the buffer pool is 364 | evictPage, which should call flushPage on any dirty page it evicts. 365 | 366 | **Exercise 5.** 367 | 368 | Fill in the `flushPage()` method and additional helper 369 | methods to implement page eviction in: 370 | 371 | *** 372 | * src/simpledb/BufferPool.java 373 | 374 | *** 375 | 376 | 377 | 378 | If you did not implement `writePage()` in 379 | HeapFile.java above, you will also need to do that here. Finally, 380 | you should also implement `discardPage()` to remove a page from the 381 | buffer pool *without* flushing it to disk. We will not test `discardPage()` 382 | in this lab, but it will be necessary for future labs. 383 | 384 | 385 | At this point, your code should pass the EvictionTest system test. 386 | 387 | Since we will not 388 | be checking for any particular eviction policy, this test works by creating a 389 | BufferPool with 16 pages (NOTE: while DEFAULT_PAGES is 50, we are initializing the 390 | BufferPool with less!), scanning a file with many more than 16 pages, and seeing 391 | if the memory usage of the JVM increases by more than 5 MB. If you do not 392 | implement an eviction policy correctly, you will not evict enough pages, and will 393 | go over the size limitation, thus failing the test. 394 | 395 | 396 | You have now completed this lab. Good work! 397 | 398 | 399 | ### 2.6. Query walkthrough 400 | 401 | 402 | 403 | The following code implements a simple join query between two tables, each 404 | consisting of three columns of integers. (The file 405 | `some_data_file1.dat` and `some_data_file2.dat` are 406 | binary representation of the pages from this file). This code is equivalent 407 | to the SQL statement: 408 | 409 | ```sql 410 | SELECT * 411 | FROM some_data_file1, some_data_file2 412 | WHERE some_data_file1.field1 = some_data_file2.field1 413 | AND some_data_file1.id > 1 414 | ``` 415 | 416 | For more extensive examples of query operations, you may find it helpful to 417 | browse the unit tests for joins, filters, and aggregates. 418 | 419 | ```java 420 | package simpledb; 421 | import java.io.*; 422 | 423 | public class jointest { 424 | 425 | public static void main(String[] argv) { 426 | // construct a 3-column table schema 427 | Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE }; 428 | String names[] = new String[]{ "field0", "field1", "field2" }; 429 | 430 | TupleDesc td = new TupleDesc(types, names); 431 | 432 | // create the tables, associate them with the data files 433 | // and tell the catalog about the schema the tables. 434 | HeapFile table1 = new HeapFile(new File("some_data_file1.dat"), td); 435 | Database.getCatalog().addTable(table1, "t1"); 436 | 437 | HeapFile table2 = new HeapFile(new File("some_data_file2.dat"), td); 438 | Database.getCatalog().addTable(table2, "t2"); 439 | 440 | // construct the query: we use two SeqScans, which spoonfeed 441 | // tuples via iterators into join 442 | TransactionId tid = new TransactionId(); 443 | 444 | SeqScan ss1 = new SeqScan(tid, table1.getId(), "t1"); 445 | SeqScan ss2 = new SeqScan(tid, table2.getId(), "t2"); 446 | 447 | // create a filter for the where condition 448 | Filter sf1 = new Filter( 449 | new Predicate(0, 450 | Predicate.Op.GREATER_THAN, new IntField(1)), ss1); 451 | 452 | JoinPredicate p = new JoinPredicate(1, Predicate.Op.EQUALS, 1); 453 | Join j = new Join(p, sf1, ss2); 454 | 455 | // and run it 456 | try { 457 | j.open(); 458 | while (j.hasNext()) { 459 | Tuple tup = j.next(); 460 | System.out.println(tup); 461 | } 462 | j.close(); 463 | Database.getBufferPool().transactionComplete(tid); 464 | 465 | } catch (Exception e) { 466 | e.printStackTrace(); 467 | } 468 | 469 | } 470 | 471 | } 472 | ``` 473 | 474 | 475 | 476 | Both tables have three integer fields. To express this, we create 477 | a `TupleDesc` object and pass it an array of `Type` 478 | objects indicating field types and `String` objects 479 | indicating field names. Once we have created this `TupleDesc`, we initialize 480 | two `HeapFile` objects representing the tables. Once we have 481 | created the tables, we add them to the Catalog. (If this were a database 482 | server that was already running, we would have this catalog information 483 | loaded; we need to load this only for the purposes of this test). 484 | 485 | 486 | 487 | Once we have finished initializing the database system, we create a query 488 | plan. Our plan consists of two `SeqScan` operators that scan 489 | the tuples from each file on disk, connected to a `Filter` 490 | operator on the first HeapFile, connected to a `Join` operator 491 | that joins the tuples in the tables according to the 492 | `JoinPredicate`. In general, these operators are instantiated 493 | with references to the appropriate table (in the case of SeqScan) or child 494 | operator (in the case of e.g., Join). The test program then repeatedly 495 | calls `next` on the `Join` operator, which in turn 496 | pulls tuples from its children. As tuples are output from the 497 | `Join`, they are printed out on the command line. 498 | 499 | 500 | ### 2.7. Query Parser 501 | 502 | We've provided you with a query parser for SimpleDB that you can use 503 | to write and run SQL queries against your database once you 504 | have completed the exercises in this lab. 505 | 506 | The first step is to create some data tables and a catalog. Suppose 507 | you have a file `data.txt` with the following contents: 508 | 509 | ``` 510 | 1,10 511 | 2,20 512 | 3,30 513 | 4,40 514 | 5,50 515 | 5,50 516 | ``` 517 | 518 | You can convert this into a SimpleDB table using the 519 | `convert` command (make sure to type ant first!): 520 | 521 | ``` 522 | java -jar dist/simpledb.jar convert data.txt 2 "int,int" 523 | ``` 524 | 525 | This creates a file `data.dat`. In addition to the table's 526 | raw data, the two additional parameters specify that each record has 527 | two fields and that their types are `int` and 528 | `int`. 529 | 530 | 531 | 532 | Next, create a catalog file, `catalog.txt`, 533 | with the following contents: 534 | 535 | ``` 536 | data (f1 int, f2 int) 537 | ``` 538 | 539 | This tells SimpleDB that there is one table, `data` (stored in 540 | `data.dat`) with two integer fields named `f1` 541 | and `f2`. 542 | 543 | Finally, invoke the parser. 544 | You must run java from the 545 | command line (ant doesn't work properly with interactive targets.) 546 | From the `simpledb/` directory, type: 547 | 548 | ``` 549 | java -jar dist/simpledb.jar parser catalog.txt 550 | ``` 551 | 552 | You should see output like: 553 | 554 | ``` 555 | Added table : data with schema INT(f1), INT(f2), 556 | SimpleDB> 557 | ``` 558 | 559 | Finally, you can run a query: 560 | 561 | ``` 562 | SimpleDB> select d.f1, d.f2 from data d; 563 | Started a new transaction tid = 1221852405823 564 | ADDING TABLE d(data) TO tableMap 565 | TABLE HAS tupleDesc INT(d.f1), INT(d.f2), 566 | 1 10 567 | 2 20 568 | 3 30 569 | 4 40 570 | 5 50 571 | 5 50 572 | 573 | 6 rows. 574 | ---------------- 575 | 0.16 seconds 576 | 577 | SimpleDB> 578 | ``` 579 | 580 | The parser is relatively full featured (including support for SELECTs, 581 | INSERTs, DELETEs, and transactions), but does have some problems 582 | and does not necessarily report completely informative error 583 | messages. Here are some limitations to bear in mind: 584 | 585 | 586 | * You must preface every field name with its table name, even if 587 | the field name is unique (you can use table name aliases, as in the 588 | example above, but you cannot use the AS keyword.) 589 | 590 | * Nested queries are supported in the WHERE clause, but not the 591 | FROM clause. 592 | 593 | * No arithmetic expressions are supported (for example, you can't 594 | take the sum of two fields.) 595 | 596 | * At most one GROUP BY and one aggregate column are allowed. 597 | 598 | * Set-oriented operators like IN, UNION, and EXCEPT are not 599 | allowed. 600 | 601 | * Only AND expressions in the WHERE clause are allowed. 602 | 603 | * UPDATE expressions are not supported. 604 | 605 | * The string operator LIKE is allowed, but must be written out 606 | fully (that is, the Postgres tilde [~] shorthand is not allowed.) 607 | 608 | 609 | ## 3. Logistics 610 | 611 | You must submit your code (see below) as well as a short (2 pages, maximum) 612 | writeup describing your approach. This writeup should: 613 | 614 | 615 | 616 | * Describe any design decisions you made, including your choice of page 617 | eviction policy. If you used something other than a nested-loops join, 618 | describe the tradeoffs of the algorithm you chose. 619 | 620 | * Discuss and justify any changes you made to the API. 621 | 622 | * Describe any missing or incomplete elements of your code. 623 | 624 | * Describe how long you spent on the lab, and whether there was anything 625 | you found particularly difficult or confusing. 626 | 627 | 628 | 629 | ### 3.1. Collaboration 630 | 631 | This lab should be manageable for a single person, but if you prefer 632 | to work with a partner, this is also OK. Larger groups are not allowed. 633 | Please indicate clearly who you worked with, if anyone, on your individual 634 | writeup. 635 | 636 | ### 3.2. Submitting your assignment 637 | 638 | 639 | 640 | 649 | 650 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab2-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 651 | 652 | 653 | You also need to explicitly add any other files you create, such as new *.java 654 | files. 655 | 656 | The criteria for your lab being submitted on time is that your code must be 657 | **tagged** and 658 | **pushed** by the date and time. This means that if one of the TAs or the 659 | instructor were to open up GitHub, they would be able to see your solutions on 660 | the GitHub web page. 661 | 662 | Just because your code has been commited on your local machine does not 663 | mean that it has been **submitted**; it needs to be on GitHub. 664 | 665 | There is a bash script `turnInLab2.sh` in the root level directory of simple-db-hw that commits 666 | your changes, deletes any prior tag 667 | for the current lab, tags the current commit, and pushes the tag 668 | to GitHub. If you are using Linux or Mac OSX, you should be able to run the following: 669 | 670 | ```bash 671 | $ ./turnInLab2.sh 672 | ``` 673 | You should see something like the following output: 674 | 675 | ```bash 676 | $ ./turnInLab2.sh 677 | error: tag 'lab2submit' not found. 678 | remote: warning: Deleting a non-existent ref. 679 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 680 | - [deleted] lab1submit 681 | [master 7a26701] Lab 2 682 | 1 file changed, 0 insertions(+), 0 deletions(-) 683 | create mode 100644 aaa 684 | Counting objects: 3, done. 685 | Delta compression using up to 4 threads. 686 | Compressing objects: 100% (3/3), done. 687 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done. 688 | Total 3 (delta 1), reused 0 (delta 0) 689 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects. 690 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 691 | 069856c..7a26701 master -> master 692 | * [new tag] lab2submit -> lab2submit 693 | ``` 694 | 695 | 696 | If the above command worked for you, you can skip to item 6 below. If not, submit your solutions for lab 2 as follows: 697 | 698 | 1. Look at your current repository status. 699 | 700 | ```bash 701 | $ git status 702 | ``` 703 | 704 | 2. Add and commit your code changes (if they aren't already added and commited). 705 | 706 | ```bash 707 | $ git commit -a -m 'Lab 2' 708 | ``` 709 | 710 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*) 711 | 712 | ```bash 713 | $ git tag -d lab2submit 714 | $ git push origin :refs/tags/lab2submit 715 | ``` 716 | 717 | 4. Tag your last commit as the lab to be graded 718 | ```bash 719 | $ git tag -a lab2submit -m 'submit lab 2' 720 | ``` 721 | 722 | 5. This is the most important part: **push** your solutions to GitHub. 723 | 724 | ```bash 725 | $ git push origin master --tags 726 | ``` 727 | 728 | 6. The last thing that we strongly recommend you do is to go to the 729 | [MIT-DB-Class] organization page on GitHub to 730 | make sure that we can see your solutions. 731 | 732 | Just navigate to your repository and check that your latest commits are on 733 | GitHub. You should also be able to check 734 | `https://github.com/MIT-DB-Class/homework-solns-2018-/tree/lab2submit` 735 | 736 | 737 | #### Word of Caution 738 | 739 | Git is a distributed version control system. This means everything operates 740 | offline until you run `git pull` or `git push`. This is a great feature. 741 | 742 | The bad thing is that you may forget to `git push` your changes. This is why we 743 | strongly, **strongly** suggest that you check GitHub to be sure that what you 744 | want us to see matches up with what you expect. 745 | 746 | 747 | 748 | ### 3.3. Submitting a bug 749 | 750 | SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc. 751 | 752 | We ask you, therefore, to do this lab with an adventurous mindset. Don't get mad if something is not clear, or even wrong; rather, try to figure it out 753 | yourself or send us a friendly email. 754 | 755 | Please submit (friendly!) bug reports to [6.830-staff@mit.edu](mailto:6.830-staff@mit.edu). 756 | When you do, please try to include: 757 | 758 | 759 | 760 | * A description of the bug. 761 | 762 | * A .java file we can drop in the 763 | `test/simpledb` directory, compile, and run. 764 | 765 | * A .txt file with the data that reproduces the bug. We should be 766 | able to convert it to a .dat file using `HeapFileEncoder`. 767 | 768 | 769 | 770 | You can also post on the class page on Piazza if you feel you have run into a bug. 771 | 772 | 773 | ### 3.4 Grading 774 | 775 | 50% of your grade will be based on whether or not your code passes the 776 | system test suite we will run over it. These tests will be a superset 777 | of the tests we have provided. Before handing in your code, you should 778 | make sure it produces no errors (passes all of the tests) from both 779 | ant test and ant systemtest. 780 | 781 | 782 | 783 | **Important:** before testing, we will replace your build.xml, 784 | HeapFileEncoder.java, and the entire contents of the 785 | test/ directory with our version of these files! This 786 | means you cannot change the format of .dat files! You should 787 | therefore be careful changing our APIs. This also means you need to test 788 | whether your code compiles with our test programs. 789 | 790 | In other words, we will 791 | pull your repo, replace the files mentioned above, compile it, and then 792 | grade it. It will look roughly like this: 793 | 794 | ``` 795 | [replace build.xml, HeapFileEncoder.java, and test] 796 | $ git checkout -- build.xml src/java/simpledb/HeapFileEncoder.java test/ 797 | $ ant test 798 | $ ant systemtest 799 | [additional tests] 800 | ``` 801 | 802 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade. 803 | 804 | 805 | 806 | An additional 50% of your grade will be based on the quality of your 807 | writeup and our subjective evaluation of your code. 808 | 809 | 810 | 811 | We've had a lot of fun designing this assignment, and we hope you enjoy 812 | hacking on it! 813 | 814 | 815 | 817 | 819 | 821 | 823 | 825 | -------------------------------------------------------------------------------- /lab3.md: -------------------------------------------------------------------------------- 1 | # 6.830 Lab 3: Query Optimization 2 | 3 | **Assigned: Friday, October 12, 2018**
4 | **Due: Friday, October 26, 2018** 5 | 6 | 7 | In this lab, you will implement a query optimizer on top of SimpleDB. 8 | The main tasks include implementing a selectivity estimation framework 9 | and a cost-based optimizer. You have freedom as to exactly what you 10 | implement, but we recommend using something similar to the Selinger 11 | cost-based optimizer. 12 | 13 | 14 | The remainder of this document describes what is involved in 15 | adding optimizer support and provides a basic outline of how 16 | you might add this support to your database. 17 | 18 | 19 | 20 | As with the previous lab, we recommend that you start as early as possible. 21 | 22 | 23 | ## 1. Getting started 24 | 25 | You should begin with the code you submitted for Lab 2. (If you did not 26 | submit code for Lab 2, or your solution didn't work properly, contact us to 27 | discuss options.) 28 | 29 | 30 | We have provided you with extra test cases as well 31 | as source code files for this lab 32 | that are not in the original code distribution you received. We reiterate 33 | that the unit tests we provide are to help guide your implementation along, 34 | but they are not intended to be comprehensive or to establish correctness. 35 | 36 | You will need to add these new files to your release. The easiest way 37 | to do this is to change to your project directory (probably called simple-db-hw) 38 | and pull from the master GitHub repository: 39 | 40 | ``` 41 | $ cd simple-db-hw 42 | $ git pull upstream master 43 | ``` 44 | 45 | ### 1.1. Implementation hints 46 | We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you. As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See Section 3.4 for a complete discussion of grading and the tests you will need to pass. 47 | 48 | 49 | 50 | Here's a rough outline of one way you might proceed with this lab. More details on these steps are given in Section 2 below. 51 | 52 | * Implement the methods in the TableStats class that allow 53 | it to estimate selectivities of filters and cost of 54 | scans, using histograms (skeleton provided for the IntHistogram class) or some 55 | other form of statistics of your devising. 56 | * Implement the methods in the JoinOptimizer class that 57 | allow it to estimate the cost and selectivities of joins. 58 | * Write the orderJoins method in JoinOptimizer. This method must produce 59 | an optimal ordering for a series of joins (likely using the 60 | Selinger algorithm), given statistics computed in the previous two steps. 61 | 62 | 63 | ## 2. Optimizer outline 64 | 65 | Recall that the main idea of a cost-based optimizer is to: 66 | 67 | * Use statistics about tables to estimate "costs" of different 68 | query plans. Typically, the cost of a plan is related to the cardinalities of 69 | (number of tuples produced by) intermediate joins and selections, as well as the 70 | selectivity of filter and join predicates. 71 | * Use these statistics to order joins and selections in an 72 | optimal way, and to select the best implementation for join 73 | algorithms from amongst several alternatives. 74 | 75 | In this lab, you will implement code to perform both of these 76 | functions. 77 | 78 | The optimizer will be invoked from simpledb/Parser.java. You may wish 79 | to review the lab 2 parser exercise 80 | before starting this lab. Briefly, if you have a catalog file 81 | catalog.txt describing your tables, you can run the parser by 82 | typing: 83 | ``` 84 | java -jar dist/simpledb.jar parser catalog.txt 85 | ``` 86 | 87 | 88 | When the Parser is invoked, it will compute statistics over all of the 89 | tables (using statistics code you provide). When a query is issued, 90 | the parser 91 | will convert the query into a logical plan representation and then call 92 | your query optimizer to generate an optimal plan. 93 | 94 | ### 2.1 Overall Optimizer Structure 95 | Before getting started with the implementation, you need to understand the overall structure of the SimpleDB optimizer. 96 | The overall control flow of the SimpleDB modules of the parser and optimizer is 97 | shown in Figure 1. 98 | 99 |

100 |
101 | Figure 1: Diagram illustrating classes, methods, and objects used in the parser 102 |

103 | 104 | 105 | The key at the bottom explains the symbols; you 106 | will implement the components with double-borders. The classes and 107 | methods will be explained in more detail in the text that follows (you may wish to refer back 108 | to this diagram), but 109 | the basic operation is as follows: 110 | 111 | 112 | 1. Parser.java constructs a set of table statistics (stored in the 113 | statsMap container) when it is initialized. It then waits for a 114 | query to be input, and calls the method parseQuery on that query. 115 | 2. parseQuery first constructs a LogicalPlan that 116 | represents the parsed query. parseQuery then calls the method physicalPlan on the 117 | LogicalPlan instance it has constructed. The physicalPlan method returns a DBIterator object that can be used to actually run 118 | the query. 119 | 120 | 121 | 122 | In the exercises to come, you will implement the methods that help 123 | physicalPlan devise an optimal plan. 124 | 125 | 126 | ### 2.2. Statistics Estimation 127 | Accurately estimating plan cost is quite tricky. In this lab, we will 128 | focus only on the cost of sequences of joins and base table accesses. We 129 | won't worry about access method selection (since we only have one 130 | access method, table scans) or the costs of additional operators (like 131 | aggregates). 132 | 133 | You are only required to consider left-deep plans for this lab. See 134 | Section 2.3 for a description of additional "bonus" optimizer features 135 | you might implement, including an approach for handling bushy plans. 136 | 137 | #### 2.2.1 Overall Plan Cost 138 | 139 | We will write join plans of the form `p=t1 join t2 join ... tn`, 140 | which signifies a left deep join where t1 is the left-most 141 | join (deepest in the tree). 142 | Given a plan like `p`, its cost 143 | can be expressed as: 144 | 145 | ``` 146 | scancost(t1) + scancost(t2) + joincost(t1 join t2) + 147 | scancost(t3) + joincost((t1 join t2) join t3) + 148 | ... 149 | ``` 150 | 151 | Here, `scancost(t1)` is the I/O cost of scanning table t1, 152 | `joincost(t1,t2)` is the CPU cost to join t1 to t2. To 153 | make I/O and CPU cost comparable, typically a constant scaling factor 154 | is used, e.g.: 155 | 156 | ``` 157 | cost(predicate application) = 1 158 | cost(pageScan) = SCALING_FACTOR x cost(predicate application) 159 | ``` 160 | 161 | For this lab, you can ignore the effects of caching (e.g., assume that 162 | every access to a table incurs the full cost of a scan) -- again, this 163 | is something you may add as an optional bonus extension to your lab 164 | in Section 2.3. Therefore, `scancost(t1)` is simply the 165 | number of pages in `t1 x SCALING_FACTOR`. 166 | 167 | #### 2.2.2 Join Cost 168 | 169 | When using nested loops joins, recall that the cost of a join between 170 | two tables t1 and t2 (where t1 is the outer) is 171 | simply: 172 | 173 | ``` 174 | joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost 175 | + ntups(t1) x ntups(t2) //CPU cost 176 | ``` 177 | 178 | Here, `ntups(t1)` is the number of tuples in table t1. 179 | 180 | #### 2.2.3 Filter Selectivity 181 | 182 | `ntups` can be directly computed for a base table by 183 | scanning that table. Estimating `ntups` for a table with 184 | one or more selection predicates over it can be trickier -- 185 | this is the *filter selectivity estimation* problem. Here's one 186 | approach that you might use, based on computing a histogram over the 187 | values in the table: 188 | 189 | * Compute the minimum and maximum values for every attribute in the table (by scanning 190 | it once). 191 | * Construct a histogram for every attribute in the table. A simple 192 | approach is to use a fixed number of buckets *NumB*, 193 | with 194 | each bucket representing the number of records in a fixed range of the 195 | domain of the attribute of the histogram. For example, if a field 196 | *f* ranges from 1 to 100, and there are 10 buckets, then bucket 1 might 197 | contain the count of the number of records between 1 and 10, bucket 198 | 2 a count of the number of records between 11 and 20, and so on. 199 | * Scan the table again, selecting out all of fields of all of the 200 | tuples and using them to populate the counts of the buckets 201 | in each histogram. 202 | * To estimate the selectivity of an equality expression, 203 | *f=const*, compute the bucket that contains value *const*. 204 | Suppose the width (range of values) of the bucket is *w*, the height (number of 205 | tuples) is *h*, 206 | and the number of tuples in the table is *ntups*. Then, assuming 207 | values are uniformly distributed throughout the bucket, the selectivity of 208 | the 209 | expression is roughly *(h / w) / ntups*, since *(h/w)* 210 | represents the expected number of tuples in the bin with value 211 | *const*. 212 | * To estimate the selectivity of a range expression *f>const*, 213 | compute the 214 | bucket *b* that *const* is in, with width *w_b* and height 215 | *h_b*. Then, *b* contains a fraction *b_f = h_b / ntups* of the 216 | total tuples. Assuming tuples are uniformly distributed throughout *b*, 217 | the fraction *b_part* of *b* that is *> const* is 218 | *(b_right - const) / w_b*, where *b_right* is the right endpoint of 219 | *b*'s bucket. Thus, bucket *b* contributes *(b_f x 220 | b_part)* selectivity to the predicate. In addition, buckets 221 | *b+1...NumB-1* contribute all of their 222 | selectivity (which can be computed using a formula similar to 223 | *b_f* above). Summing the selectivity contributions of all the 224 | buckets will yield the overall selectivity of the expression. 225 | Figure 2 illustrates this process. 226 | * Selectivity of expressions involving *less than* can be performed 227 | similar to the greater than case, looking at buckets down to 0. 228 | 229 |

230 |
231 | Figure 2: Diagram illustrating the histograms you will implement in Lab 5 232 |

233 | 234 | 235 | In the next two exercises, you will code to perform selectivity estimation of 236 | joins and filters. 237 | 238 | *** 239 | **Exercise 1: IntHistogram.java** 240 | 241 | You will need to implement 242 | some way to record table statistics for selectivity estimation. We have 243 | provided a skeleton class, IntHistogram that will do this. Our 244 | intent is that you calculate histograms using the bucket-based method described 245 | above, but you are free to use some other method so long as it provides 246 | reasonable selectivity estimates. 247 | 248 | 249 | We have provided a class StringHistogram that uses 250 | IntHistogram to compute selecitivites for String 251 | predicates. You may modify StringHistogram if you want to 252 | implement a better estimator, though you should not need to in order to 253 | complete this lab. 254 | 255 | After completing this exercise, you should be able to pass the 256 | IntHistogramTest unit test (you are not required to pass this test if you 257 | choose not to implement histogram-based selectivity estimation). 258 | 259 | *** 260 | **Exercise 2: TableStats.java** 261 | 262 | The class TableStats contains methods that compute 263 | the number of tuples and pages in a table and that estimate the 264 | selectivity of predicates over the fields of that table. The 265 | query parser we have created creates one instance of TableStats per 266 | table, and passes these structures into your query optimizer (which 267 | you will need in later exercises). 268 | 269 | You should fill in the following methods and classes in TableStats: 270 | 271 | * Implement the TableStats constructor: 272 | Once you have 273 | implemented a method for tracking statistics such as histograms, you 274 | should implement the TableStats constructor, adding code 275 | to scan the table (possibly multiple times) to build the statistics 276 | you need. 277 | * Implement estimateSelectivity(int field, Predicate.Op op, 278 | Field constant): Using your statistics (e.g., an IntHistogram 279 | or StringHistogram depending on the type of the field), estimate 280 | the selectivity of predicate field op constant on the table. 281 | * Implement estimateScanCost(): This method estimates the 282 | cost of sequentially scanning the file, given that the cost to read 283 | a page is costPerPageIO. You can assume that there are no 284 | seeks and that no pages are in the buffer pool. This method may 285 | use costs or sizes you computed in the constructor. 286 | * Implement estimateTableCardinality(double 287 | selectivityFactor): This method returns the number of tuples 288 | in the relation, given that a predicate with selectivity 289 | selectivityFactor is applied. This method may 290 | use costs or sizes you computed in the constructor. 291 | 292 | You may wish to modify the constructor of TableStats.java to, for 293 | example, compute histograms over the fields as described above for 294 | purposes of selectivity estimation. 295 | 296 | After completing these tasks you should be able to pass the unit tests 297 | in TableStatsTest. 298 | *** 299 | 300 | #### 2.2.4 Join Cardinality 301 | 302 | Finally, observe that the cost for the join plan p above 303 | includes expressions of the form joincost((t1 join t2) join 304 | t3). To evaluate this expression, you need some way to estimate 305 | the size (ntups) of t1 join t2. This *join 306 | cardinality estimation* problem is harder than the filter selectivity 307 | estimation problem. In this lab, you aren't required to do anything 308 | fancy for this, though one of the optional excercises in Section 2.4 309 | includes a histogram-based method for join selectivity estimation. 310 | 311 | 312 | While implementing your simple solution, you should keep in mind the following: 313 | 314 | * For equality joins, when one of the attributes is a primary key, the number of tuples produced by the join cannot 315 | be larger than the cardinality of the non-primary key attribute. 316 | * For equality joins when there is no primary key, it's hard to say much about what the size of the output 317 | is -- it could be the size of the product of the cardinalities of the tables (if both tables have the 318 | same value for all tuples) -- or it could be 0. It's fine to make up a simple heuristic (say, 319 | the size of the larger of the two tables). 320 | * For range scans, it is similarly hard to say anything accurate about sizes. 321 | The size of the output should be proportional to 322 | the sizes of the inputs. It is fine to assume that a fixed fraction 323 | of the cross-product is emitted by range scans (say, 30%). In general, the cost of a range 324 | join should be larger than the cost of a non-primary key equality join of two tables 325 | of the same size. 326 | 327 | 328 | 329 | 330 | *** 331 | **Exercise 3: Join Cost Estimation** 332 | 333 | 334 | The class JoinOptimizer.java includes all of the methods 335 | for ordering and computing costs of joins. In this exercise, you 336 | will write the methods for estimating the selectivity and cost of 337 | a join, specifically: 338 | 339 | * Implement 340 | estimateJoinCost(LogicalJoinNode j, int card1, int card2, double 341 | cost1, double cost2): This method estimates the cost of 342 | join j, given that the left input is of cardinality card1, the 343 | right input of cardinality card2, that the cost to scan the left 344 | input is cost1, and that the cost to access the right input is 345 | card2. You can assume the join is an NL join, and apply 346 | the formula mentioned earlier. 347 | * Implement estimateJoinCardinality(LogicalJoinNode j, int 348 | card1, int card2, boolean t1pkey, boolean t2pkey): This 349 | method estimates the number of tuples output by join j, given that 350 | the left input is size card1, the right input is size card2, and 351 | the flags t1pkey and t2pkey that indicate whether the left and 352 | right (respectively) field is unique (a primary key). 353 | 354 | After implementing these methods, you should be able to pass the unit 355 | tests estimateJoinCostTest and estimateJoinCardinality in JoinOptimizerTest.java. 356 | *** 357 | 358 | 359 | ### 2.3 Join Ordering 360 | 361 | Now that you have implemented methods for estimating costs, you will 362 | implement the Selinger optimizer. For these methods, joins are 363 | expressed as a list of join nodes (e.g., predicates over two tables) 364 | as opposed to a list of relations to join as described in class. 365 | 366 | An outline in pseudocode would be: 367 | 368 | ``` 369 | 1. j = set of join nodes 370 | 2. for (i in 1...|j|): 371 | 3. for s in {all length i subsets of j} 372 | 4. bestPlan = {} 373 | 5. for s' in {all length d-1 subsets of s} 374 | 6. subplan = optjoin(s') 375 | 7. plan = best way to join (s-s') to subplan 376 | 8. if (cost(plan) < cost(bestPlan)) 377 | 9. bestPlan = plan 378 | 10. optjoin(s) = bestPlan 379 | 11. return optjoin(j) 380 | ``` 381 | 382 | To help you implement this algorithm, we have provided several classes and methods to assist you. First, 383 | the method enumerateSubsets(Vector v, int size) in JoinOptimizer.java will return 384 | a set of all of the subsets of v of size size. This method is not particularly efficient; you can earn 385 | extra credit by implementing a more efficient enumerator. 386 | 387 | Second, we have provided the method: 388 | ```java 389 | private CostCard computeCostAndCardOfSubplan(HashMap stats, 390 | HashMap filterSelectivities, 391 | LogicalJoinNode joinToRemove, 392 | Set joinSet, 393 | double bestCostSoFar, 394 | PlanCache pc) 395 | ``` 396 | 397 | Given a subset of joins (joinSet), and a join to remove from 398 | this set (joinToRemove), this method computes the best way to 399 | join joinToRemove to joinSet - {joinToRemove}. It 400 | returns this best method in a CostCard object, which includes 401 | the cost, cardinality, and best join ordering (as a vector). 402 | computeCostAndCardOfSubplan may return null, if no plan can 403 | be found (because, for example, there is no left-deep join that is 404 | possible), or if the cost of all plans is greater than the 405 | bestCostSoFar argument. The method uses a cache of previous 406 | joins called pc (optjoin in the pseudocode above) to 407 | quickly lookup the fastest way to join joinSet - 408 | {joinToRemove}. The other arguments (stats and 409 | filterSelectivities) are passed into the orderJoins 410 | method that you must implement as a part of Exercise 4, and are 411 | explained below. This method essentially performs lines 6--8 of the 412 | pseudocode described earlier. 413 | 414 | Third, we have provided the method: 415 | ```java 416 | private void printJoins(Vector js, 417 | PlanCache pc, 418 | HashMap stats, 419 | HashMap selectivities) 420 | ``` 421 | 422 | This method can be used to display a graphical representation of a join plan (when the "explain" flag is set via 423 | the "-explain" option to the optimizer, for example). 424 | 425 | Fourth, we have provided a class PlanCache that can be used 426 | to cache the best way to join a subset of the joins considered so far 427 | in your implementation of Selinger (an instance of this class is 428 | needed to use computeCostAndCardOfSubplan). 429 | 430 | *** 431 | **Exercise 4: Join Ordering** 432 | 433 | 434 | In JoinOptimizer.java, implement the method: 435 | ```java 436 | Vector orderJoins(HashMap stats, 437 | HashMap filterSelectivities, 438 | boolean explain) 439 | ``` 440 | 441 | This method should operate on the joins class member, 442 | returning a new Vector that specifies the order in which joins 443 | should be done. Item 0 of this vector indicates the left-most, 444 | bottom-most join in a left-deep plan. Adjacent joins in the 445 | returned vector should share at least one field to ensure the plan 446 | is left-deep. Here stats is an object that lets you find 447 | the TableStats for a given table name that appears in the 448 | FROM list of the query. filterSelectivities 449 | allows you to find the selectivity of any predicates over a table; 450 | it is guaranteed to have one entry per table name in the 451 | FROM list. Finally, explain specifies that you 452 | should output a representation of the join order for informational purposes. 453 | 454 | 455 | You may wish to use the helper methods and classes described above to assist 456 | in your implementation. Roughly, your implementation should follow 457 | the pseudocode above, looping through subset sizes, subsets, and 458 | sub-plans of subsets, calling computeCostAndCardOfSubplan and 459 | building a PlanCache object that stores the minimal-cost 460 | way to perform each subset join. 461 | 462 | After implementing this method, you should be able to pass all the unit tests in 463 | JoinOptimizerTest. You should also pass the system test 464 | QueryTest. 465 | *** 466 | 467 | 468 | ### 2.4 Extra Credit 469 | 470 | In this section, we describe several optional excercises that you may 471 | implement for extra credit. These are less well defined than the 472 | previous exercises but give you a chance to show off your mastery of 473 | query optimization! 474 | 475 | *** 476 | **Bonus Exercises.** Each of these bonuses is worth up to 5% extra credit: 477 | 478 | * *Add code to perform more advanced join cardinality estimation*. 479 | Rather than using simple heuristics to estimate join cardinality, 480 | devise a more sophisticated algorithm. 481 | * One option is to use joint histograms between 482 | every pair of attributes *a* and *b* in every pair of tables *t1* and *t2*. 483 | The idea is to create buckets of *a*, and for each bucket *A* of *a*, create a 484 | histogram of *b* values that co-occur with *a* values in *A*. 485 | * Another way to estimate the cardinality of a join is to assume that each value in the smaller table has a matching value in the larger table. Then the formula for the join selectivity would be: 1/(*Max*(*num-distinct*(t1, column1), *num-distinct*(t2, column2))). Here, column1 and column2 are the join attributes. The cardinality of the join is then the product of the cardinalities of *t1* and *t2* times the selectivity.
486 | * *Improved subset iterator*. Our implementation of 487 | enumerateSubsets is quite inefficient, because it creates 488 | a large number of Java objects on each invocation. A better 489 | approach would be to implement an iterator that, for example, 490 | returns a BitSet that specifies the elements in the 491 | joins vector that should be accessed on each iteration. 492 | In this bonus exercise, you would improve the performance of 493 | enumerateSubsets so that your system could perform query 494 | optimization on plans with 20 or more joins (currently such plans 495 | takes minutes or hours to compute). 496 | * *A cost model that accounts for caching*. The methods to 497 | estimate scan and join cost do not account for caching in the 498 | buffer pool. You should extend the cost model to account for 499 | caching effects. This is tricky because multiple joins are 500 | running simultaneously due to the iterator model, and so it may be 501 | hard to predict how much memory each will have access to using the 502 | simple buffer pool we have implemented in previous labs. 503 | * *Improved join algorithms and algorithm selection*. Our 504 | current cost estimation and join operator selection algorithms 505 | (see instantiateJoin() in JoinOptimizer.java) 506 | only consider nested loops joins. Extend these methods to use one 507 | or more additional join algorithms (for example, some form of in 508 | memory hashing using a HashMap). 509 | * *Bushy plans*. Improve the provided orderJoins() and other helper 510 | methods to generate bushy joins. Our query plan 511 | generation and visualization algorithms are perfectly capable of 512 | handling bushy plans; for example, if orderJoins() 513 | returns the vector (t1 join t2 ; t3 join t4 ; t2 join t3), this 514 | will correspond to a bushy plan with the (t2 join t3) node at the top. 515 | 516 | *** 517 | 518 | 519 | You have now completed this lab. 520 | Good work! 521 | 522 | ## 3. Logistics 523 | You must submit your code (see below) as well as a short (2 pages, maximum) 524 | writeup describing your approach. This writeup should: 525 | 526 | * Describe any design decisions you made, including methods for selectivity estimation, 527 | join ordering, as well as any of the bonus exercises you chose to implement and how 528 | you implemented them (for each bonus exercise you may submit up to 1 additional page). 529 | * Discuss and justify any changes you made to the API. 530 | * Describe any missing or incomplete elements of your code. 531 | * Describe how long you spent on the lab, and whether there was anything 532 | you found particularly difficult or confusing. 533 | 534 | ### 3.1. Collaboration 535 | This lab should be manageable for a single person, but if you prefer 536 | to work with a partner, this is also OK. Larger groups are not allowed. 537 | Please indicate clearly who you worked with, if anyone, on your writeup. 538 | 539 | ### 3.2. Submitting your assignment 540 | 541 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab3-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 542 | 543 | You also need to explicitly add any other files you create, such as new *.java 544 | files. 545 | 546 | The criteria for your lab being submitted on time is that your code must be 547 | **tagged** and 548 | **pushed** by the date and time. This means that if one of the TAs or the 549 | instructor were to open up GitHub, they would be able to see your solutions on 550 | the GitHub web page. 551 | 552 | Just because your code has been commited on your local machine does not 553 | mean that it has been **submitted**; it needs to be on GitHub. 554 | 555 | There is a bash script `turnInLab3.sh` in the root level directory of simple-db-hw that commits 556 | your changes, deletes any prior tag 557 | for the current lab, tags the current commit, and pushes the tag 558 | to GitHub. If you are using Linux or Mac OSX, you should be able to run the following: 559 | 560 | ```bash 561 | $ ./turnInLab3.sh 562 | ``` 563 | You should see something like the following output: 564 | 565 | ```bash 566 | $ ./turnInLab3.sh 567 | error: tag 'lab3submit' not found. 568 | remote: warning: Deleting a non-existent ref. 569 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 570 | - [deleted] lab1submit 571 | [master 7a26701] Lab 3 572 | 1 file changed, 0 insertions(+), 0 deletions(-) 573 | create mode 100644 aaa 574 | Counting objects: 3, done. 575 | Delta compression using up to 4 threads. 576 | Compressing objects: 100% (3/3), done. 577 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done. 578 | Total 3 (delta 1), reused 0 (delta 0) 579 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects. 580 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 581 | 069856c..7a26701 master -> master 582 | * [new tag] lab3submit -> lab3submit 583 | ``` 584 | 585 | 586 | If the above command worked for you, you can skip to item 6 below. If not, submit your solutions for Lab 3 as follows: 587 | 588 | 1. Look at your current repository status. 589 | 590 | ```bash 591 | $ git status 592 | ``` 593 | 594 | 2. Add and commit your code changes (if they aren't already added and commited). 595 | 596 | ```bash 597 | $ git commit -a -m 'Lab 3' 598 | ``` 599 | 600 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*) 601 | 602 | ```bash 603 | $ git tag -d lab3submit 604 | $ git push origin :refs/tags/lab3submit 605 | ``` 606 | 607 | 4. Tag your last commit as the lab to be graded 608 | ```bash 609 | $ git tag -a lab3submit -m 'submit lab 3' 610 | ``` 611 | 612 | 5. This is the most important part: **push** your solutions to GitHub. 613 | 614 | ```bash 615 | $ git push origin master --tags 616 | ``` 617 | 618 | 6. The last thing that we strongly recommend you do is to go to the 619 | [MIT-DB-Class] organization page on GitHub to 620 | make sure that we can see your solutions. 621 | 622 | Just navigate to your repository and check that your latest commits are on 623 | GitHub. You should also be able to check 624 | `https://github.com/MIT-DB-Class/homework-solns-2018-(mit id)/tree/lab3submit` 625 | 626 | 627 | #### Word of Caution 628 | 629 | Git is a distributed version control system. This means everything operates 630 | offline until you run `git pull` or `git push`. This is a great feature. 631 | 632 | The bad thing is that you may forget to `git push` your changes. This is why we 633 | strongly, **strongly** suggest that you check GitHub to be sure that what you 634 | want us to see matches up with what you expect. 635 | 636 | 637 | 638 | 639 | ### 3.3. Submitting a bug 640 | 641 | SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc. 642 | 643 | We ask you, therefore, to do this lab with an adventurous mindset. Don't get mad if something is not clear, or even wrong; rather, try to figure it out 644 | yourself or send us a friendly email. 645 | 646 | Please submit (friendly!) bug reports to 6.830-staff@mit.edu. 647 | When you do, please try to include: 648 | 649 | * A description of the bug. 650 | * A .java file we can drop in the 651 | `test/simpledb` directory, compile, and run. 652 | * A .txt file with the data that reproduces the bug. We should be 653 | able to convert it to a .dat file using `HeapFileEncoder`. 654 | 655 | You can also post on the class page on Piazza if you feel you have run into a bug. 656 | 657 | 658 | ### 3.4 Grading 659 | 50% of your grade will be based on whether or not your code passes the 660 | test suite we will run over it. These tests will be a superset 661 | of the tests we have provided. Before handing in your code, you should 662 | make sure it produces no errors (passes all of the tests) from both 663 | ant test and ant systemtest. 664 | 665 | **Important:** before testing, we will replace your build.xml, 666 | HeapFileEncoder.java, BPlusTreeFileEncoder.java, and the entire contents of the 667 | test/ directory with our version of these files! This 668 | means you cannot change the format of .dat files! You should 669 | therefore be careful changing our APIs. This also means you need to test 670 | whether your code compiles with our test programs. In other words, we will 671 | pull your repo, replace the files mentioned above, compile it, and then 672 | grade it. It will look roughly like this: 673 | 674 | ``` 675 | $ git pull 676 | [replace build.xml, HeapFileEncoder.java, BPlusTreeFileEncoder.java and test] 677 | $ ant test 678 | $ ant systemtest 679 | [additional tests] 680 | ``` 681 | If any of these commands fail, we'll be unhappy, and, therefore, so will your grade. 682 | 683 | 684 | An additional 50% of your grade will be based on the quality of your 685 | writeup and our subjective evaluation of your code. 686 | 687 | 688 | We've had a lot of fun designing this assignment, and we hope you enjoy 689 | hacking on it! 690 | -------------------------------------------------------------------------------- /lab1.md: -------------------------------------------------------------------------------- 1 | # 6.830 Lab 1: SimpleDB 2 | 3 | **Assigned: Mon, Sept 17** 4 | 5 | **Due: Wed, Sept 26 11:59 PM EDT** 6 | 7 | 8 | 14 | 15 | In the lab assignments in 6.830 you will write a basic database management system called SimpleDB. For this lab, you will focus on implementing the core modules required to access stored data on disk; in future labs, you will add support for various query processing operators, as well as transactions, locking, and concurrent queries. 16 | 17 | SimpleDB is written in Java. We have provided you with a set of mostly unimplemented classes and interfaces. You will need to write the code for these classes. We will grade your code by running a set of system tests written using [JUnit](http://junit.sourceforge.net/). We have also provided a number of unit tests, which we will not use for grading but that you may find useful in verifying that your code works. 18 | 19 | The remainder of this document describes the basic architecture of SimpleDB, gives some suggestions about how to start coding, and discusses how to hand in your lab. 20 | 21 | We **strongly recommend** that you start as early as possible on this lab. It requires you to write a fair amount of code! 22 | 23 | 39 | 40 | 41 | ## 0. Environment Setup 42 | 43 | **Start by downloading the code for lab 1 from the course GitHub repository by following the instructions [here](https://github.com/MIT-DB-Class/course-info-2018).** 44 | 45 | These instructions are written for Athena or any other Unix-based platform (e.g., Linux, MacOS, etc.) Because the code is written in Java, it should work under Windows as well, although the directions in this document may not apply. 46 | 47 | We have included [Section 1.2](#eclipse) on using the project with Eclipse. 48 | 49 | 50 | ## 1. Getting started 51 | 52 | 53 | SimpleDB uses the [Ant build tool](http://ant.apache.org/) to compile the code and run tests. Ant is similar to [make](http://www.gnu.org/software/make/manual/), but the build file is written in XML and is somewhat better suited to Java code. Most modern Linux distributions include Ant. Under Athena, it is included in the `sipb` locker, which you can get to by typing `add sipb` at the Athena prompt. Note that on some versions of Athena you must also run `add -f java` to set the environment correctly for Java programs. See the [Athena documentation on using Java](http://web.mit.edu/acs/www/languages.html#Java) for more details. 54 | 55 | To help you during development, we have provided a set of unit tests in addition to the end-to-end tests that we use for grading. These are by no means comprehensive, and you should not rely on them exclusively to verify the correctness of your project (put those 6.170 skills to use!). 56 | 57 | To run the unit tests use the `test` build target: 58 | 59 | ``` 60 | $ cd [project-directory] 61 | $ # run all unit tests 62 | $ ant test 63 | $ # run a specific unit test 64 | $ ant runtest -Dtest=TupleTest 65 | ``` 66 | 67 | You should see output similar to: 68 | 69 | ``` 70 | build output... 71 | 72 | test: 73 | [junit] Running simpledb.CatalogTest 74 | [junit] Testsuite: simpledb.CatalogTest 75 | [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.037 sec 76 | [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.037 sec 77 | 78 | ... stack traces and error reports ... 79 | ``` 80 | 81 | The output above indicates that two errors occurred during compilation; this is because the code we have given you doesn't yet work. As you complete parts of the lab, you will work towards passing additional unit tests. 82 | 83 | If you wish to write new unit tests as you code, they should be added to the test/simpledb directory. 84 | 85 |

For more details about how to use Ant, see the [manual](http://ant.apache.org/manual/). The [Running Ant](http://ant.apache.org/manual/running.html) section provides details about using the `ant` command. However, the quick reference table below should be sufficient for working on the labs. 86 | 87 | Command | Description 88 | --- | --- 89 | ant|Build the default target (for simpledb, this is dist). 90 | ant -projecthelp|List all the targets in `build.xml` with descriptions. 91 | ant dist|Compile the code in src and package it in `dist/simpledb.jar`. 92 | ant test|Compile and run all the unit tests. 93 | ant runtest -Dtest=testname|Run the unit test named `testname`. 94 | ant systemtest|Compile and run all the system tests. 95 | ant runsystest -Dtest=testname|Compile and run the system test named `testname`. 96 | 97 | 98 | If you are under windows system and don't want to run ant tests from command line, you can also run them from eclipse. Right click build.xml, in the targets tab, you can see "runtest" "runsystest" etc. For example, select runtest would be equivalent to "ant runtest" from command line. Arguments such as "-Dtest=testname" can be specified in the "Main" Tab, "Arguments" textbox. Note that you can also create a shortcut to runtest by copying from build.xml, modifying targets and arguments and renaming it to, say, runtest_build.xml. 99 | 100 | ### 1.1. Running end-to-end tests 101 | 102 | We have also provided a set of end-to-end tests that will eventually be used for grading. These tests are structured as JUnit tests that live in the test/simpledb/systemtest directory. To run all the system tests, use the `systemtest` build target: 103 | 104 | ``` 105 | $ ant systemtest 106 | 107 | ... build output ... 108 | 109 | [junit] Testcase: testSmall took 0.017 sec 110 | [junit] Caused an ERROR 111 | [junit] expected to find the following tuples: 112 | [junit] 19128 113 | [junit] 114 | [junit] java.lang.AssertionError: expected to find the following tuples: 115 | [junit] 19128 116 | [junit] 117 | [junit] at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:122) 118 | [junit] at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:83) 119 | [junit] at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:75) 120 | [junit] at simpledb.systemtest.ScanTest.validateScan(ScanTest.java:30) 121 | [junit] at simpledb.systemtest.ScanTest.testSmall(ScanTest.java:40) 122 | 123 | ... more error messages ... 124 | ``` 125 | 126 |

This indicates that this test failed, showing the stack trace where the error was detected. To debug, start by reading the source code where the error occurred. When the tests pass, you will see something like the following: 127 | 128 | ``` 129 | $ ant systemtest 130 | 131 | ... build output ... 132 | 133 | [junit] Testsuite: simpledb.systemtest.ScanTest 134 | [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec 135 | [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec 136 | [junit] 137 | [junit] Testcase: testSmall took 0.937 sec 138 | [junit] Testcase: testLarge took 5.276 sec 139 | [junit] Testcase: testRandom took 1.049 sec 140 | 141 | BUILD SUCCESSFUL 142 | Total time: 52 seconds 143 | ``` 144 | 145 | #### 1.1.1 Creating dummy tables 146 | 147 | It is likely you'll want to create your own tests and your own data tables to test your own implementation of SimpleDB. You can create any .txt file and convert it to a .dat file in SimpleDB's `HeapFile` format using the command: 148 | 149 | ``` 150 | $ java -jar dist/simpledb.jar convert file.txt N 151 | ``` 152 | 153 | where file.txt is the name of the file and N is the number of columns in the file. Notice that file.txt has to be in the following format: 154 | 155 | ``` 156 | int1,int2,...,intN 157 | int1,int2,...,intN 158 | int1,int2,...,intN 159 | int1,int2,...,intN 160 | ``` 161 | 162 | ...where each intN is a non-negative integer. 163 | 164 | To view the contents of a table, use the `print` command: 165 | 166 | ``` 167 | $ java -jar dist/simpledb.jar print file.dat N 168 | ``` 169 | 170 | where file.dat is the name of a table created with the convert command, and N is the number of columns in the file. 171 | 172 | 173 | 174 | ### 1.2. Working in Eclipse 175 | 176 | [Eclipse](http://www.eclipse.org) is a graphical software development environment that you might be more comfortable with working in. The instructions we provide were generated by using Eclipse for Java Developers (not the enterprise edition) with Java 1.7. 177 | 178 | **Setting the Lab Up in Eclipse** 179 | 180 | * Once Eclipse is installed, start it, and note that the first screen asks you to select a location for your workspace (we will refer to this directory as $W). Select the directory containing your simple-db-hw repository. 181 | * In Eclipse, select File->New->Project->Java->Java Project, and push Next. 182 | * Enter "simple-db-hw" as the project name. 183 | * On the same screen that you entered the project name, select "Create project from existing source," and browse to $W/simple-db-hw. 184 | * Click finish, and you should be able to see "simple-db-hw" as a new project in the Project Explorer tab on the left-hand side of your screen. Opening this project reveals the directory structure discussed above - implementation code can be found in "src," and unit tests and system tests found in "test." 185 | 186 | **Note:** that this class assumes that you are using the official Oracle release of Java. This is the default on MacOS X, and for most Windows Eclipse installs; but many Linux distributions default to alternate Java runtimes (like OpenJDK). Please download the latest Java8 updates from [Oracle Website](http://www.oracle.com/technetwork/java/javase/downloads/index.html), and use that Java version. If you don't switch, you may see spurious test failures in some of the performance tests in later labs. 187 | 188 | **Running Individual Unit and System Tests** 189 | 190 | To run a unit test or system test (both are JUnit tests, and can be initialized the same way), go to the Package Explorer tab on the left side of your screen. Under the "simple-db-hw" project, open the "test" directory. Unit tests are found in the "simpledb" package, and system tests are found in the "simpledb.systemtests" package. To run one of these tests, select the test (they are all called *Test.java - don't select TestUtil.java or SystemTestUtil.java), right click on it, select "Run As," and select "JUnit Test." This will bring up a JUnit tab, which will tell you the status of the individual tests within the JUnit test suite, and will show you exceptions and other errors that will help you debug problems. 191 | 192 | **Running Ant Build Targets** 193 | 194 | If you want to run commands such as "ant test" or "ant systemtest," right click on build.xml in the Package Explorer. Select "Run As," and then "Ant Build..." (note: select the option with the ellipsis (...), otherwise you won't be presented with a set of build targets to run). Then, in the "Targets" tab of the next screen, check off the targets you want to run (probably "dist" and one of "test" or "systemtest"). This should run the build targets and show you the results in Eclipse's console window. 195 | 196 | ### 1.3. Implementation hints 197 | 198 | Before beginning to write code, we **strongly encourage** you to read through this entire document to get a feel for the high-level design of SimpleDB. 199 | 200 |

201 | 202 | You will need to fill in any piece of code that is not implemented. It will be obvious where we think you should write code. You may need to add private methods and/or helper classes. You may change APIs, but make sure our [grading](#grading) tests still run and make sure to mention, explain, and defend your decisions in your writeup. 203 | 204 |

205 | 206 | In addition to the methods that you need to fill out for this lab, the class interfaces contain numerous methods that you need not implement until subsequent labs. These will either be indicated per class: 207 | 208 | ```java 209 | // Not necessary for lab1. 210 | public class Insert implements DbIterator { 211 | ``` 212 | 213 | or per method: 214 | 215 | ```Java 216 | public boolean deleteTuple(Tuple t) throws DbException { 217 | // some code goes here 218 | // not necessary for lab1 219 | return false; 220 | } 221 | ``` 222 | 223 | 224 | The code that you submit should compile without having to modify these methods. 225 | 226 |

227 | 228 | We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you. 229 | 230 | **Here's a rough outline of one way you might proceed with your SimpleDB implementation:** 231 | 232 | **** 233 | * Implement the classes to manage tuples, namely Tuple, TupleDesc. We have already implemented Field, IntField, StringField, and Type for you. Since you only need to support integer and (fixed length) string fields and fixed length tuples, these are straightforward. 234 | * Implement the Catalog (this should be very simple). 235 | * Implement the BufferPool constructor and the getPage() method. 236 | * Implement the access methods, HeapPage and HeapFile and associated ID classes. A good portion of these files has already been written for you. 237 | * Implement the operator SeqScan. 238 | * At this point, you should be able to pass the ScanTest system test, which is the goal for this lab. 239 | 240 | *** 241 | 242 | Section 2 below walks you through these implementation steps and the unit tests corresponding to each one in more detail. 243 | 244 | ### 1.4. Transactions, locking, and recovery 245 | 246 | As you look through the interfaces we have provided you, you will see a number of references to locking, transactions, and recovery. You do not need to support these features in this lab, but you should keep these parameters in the interfaces of your code because you will be implementing transactions and locking in a future lab. The test code we have provided you with generates a fake transaction ID that is passed into the operators of the query it runs; you should pass this transaction ID into other operators and the buffer pool. 247 | 248 | ## 2. SimpleDB Architecture and Implementation Guide 249 | 250 | SimpleDB consists of: 251 | 252 | 253 | * Classes that represent fields, tuples, and tuple schemas; 254 | * Classes that apply predicates and conditions to tuples; 255 | * One or more access methods (e.g., heap files) that store relations on disk and provide a way to iterate through tuples of those relations; 256 | * A collection of operator classes (e.g., select, join, insert, delete, etc.) that process tuples; 257 | * A buffer pool that caches active tuples and pages in memory and handles concurrency control and transactions (neither of which you need to worry about for this lab); and, 258 | * A catalog that stores information about available tables and their schemas. 259 | 260 | 261 | SimpleDB does not include many things that you may think of as being a part of a "database." In particular, SimpleDB does not have: 262 | 263 | * (In this lab), a SQL front end or parser that allows you to type queries directly into SimpleDB. Instead, queries are built up by chaining a set of operators together into a hand-built query plan (see [Section 2.7](#query_walkthrough)). We will provide a simple parser for use in later labs. 264 | * Views. 265 | * Data types except integers and fixed length strings. 266 | * (In this lab) Query optimizer. 267 | * (In this lab) Indices. 268 | 269 |

270 | 271 | In the rest of this Section, we describe each of the main components of SimpleDB that you will need to implement in this lab. You should use the exercises in this discussion to guide your implementation. This document is by no means a complete specification for SimpleDB; you will need to make decisions about how to design and implement various parts of the system. Note that for Lab 1 you do not need to implement any operators (e.g., select, join, project) except sequential scan. You will add support for additional operators in future labs. 272 | 273 |

274 | 275 | ### 2.1. The Database Class 276 | 277 | The Database class provides access to a collection of static objects that are the global state of the database. In particular, this includes methods to access the catalog (the list of all the tables in the database), the buffer pool (the collection of database file pages that are currently resident in memory), and the log file. You will not need to worry about the log file in this lab. We have implemented the Database class for you. You should take a look at this file as you will need to access these objects. 278 | 279 | ### 2.2. Fields and Tuples 280 | 281 |

Tuples in SimpleDB are quite basic. They consist of a collection of `Field` objects, one per field in the `Tuple`. `Field` is an interface that different data types (e.g., integer, string) implement. `Tuple` objects are created by the underlying access methods (e.g., heap files, or B-trees), as described in the next section. Tuples also have a type (or schema), called a _tuple descriptor_, represented by a `TupleDesc` object. This object consists of a collection of `Type` objects, one per field in the tuple, each of which describes the type of the corresponding field. 282 | 283 | ### Exercise 1 284 | 285 | **Implement the skeleton methods in:** 286 | *** 287 | * src/simpledb/TupleDesc.java 288 | * src/simpledb/Tuple.java 289 | 290 | *** 291 | 292 | 293 | At this point, your code should pass the unit tests TupleTest and TupleDescTest. At this point, modifyRecordId() should fail because you havn't implemented it yet. 294 | 295 | ### 2.3. Catalog 296 | 297 | The catalog (class `Catalog` in SimpleDB) consists of a list of the tables and schemas of the tables that are currently in the database. You will need to support the ability to add a new table, as well as getting information about a particular table. Associated with each table is a `TupleDesc` object that allows operators to determine the types and number of fields in a table. 298 | 299 | The global catalog is a single instance of `Catalog` that is allocated for the entire SimpleDB process. The global catalog can be retrieved via the method `Database.getCatalog()`, and the same goes for the global buffer pool (using `Database.getBufferPool()`). 300 | 301 | ### Exercise 2 302 | 303 | **Implement the skeleton methods in:** 304 | *** 305 | * src/simpledb/Catalog.java 306 | 307 | *** 308 | 309 | At this point, your code should pass the unit tests in CatalogTest. 310 | 311 | 312 | ### 2.4. BufferPool 313 | 314 |

The buffer pool (class `BufferPool` in SimpleDB) is responsible for caching pages in memory that have been recently read from disk. All operators read and write pages from various files on disk through the buffer pool. It consists of a fixed number of pages, defined by the `numPages` parameter to the `BufferPool` constructor. In later labs, you will implement an eviction policy. For this lab, you only need to implement the constructor and the `BufferPool.getPage()` method used by the SeqScan operator. The BufferPool should store up to `numPages` pages. For this lab, if more than `numPages` requests are made for different pages, then instead of implementing an eviction policy, you may throw a DbException. In future labs you will be required to implement an eviction policy. 315 | 316 | The `Database` class provides a static method, `Database.getBufferPool()`, that returns a reference to the single BufferPool instance for the entire SimpleDB process. 317 | 318 | ### Exercise 3 319 | 320 | **Implement the `getPage()` method in:** 321 | 322 | *** 323 | * src/simpledb/BufferPool.java 324 | 325 | *** 326 | 327 | We have not provided unit tests for BufferPool. The functionality you implemented will be tested in the implementation of HeapFile below. You should use the `DbFile.readPage` method to access pages of a DbFile. 328 | 329 | 330 | 335 | 336 | 345 | 346 | ### 2.5. HeapFile access method 347 | 348 | Access methods provide a way to read or write data from disk that is arranged in a specific way. Common access methods include heap files (unsorted files of tuples) and B-trees; for this assignment, you will only implement a heap file access method, and we have written some of the code for you. 349 | 350 |

351 | 352 | A `HeapFile` object is arranged into a set of pages, each of which consists of a fixed number of bytes for storing tuples, (defined by the constant `BufferPool.DEFAULT_PAGE_SIZE`), including a header. In SimpleDB, there is one `HeapFile` object for each table in the database. Each page in a `HeapFile` is arranged as a set of slots, each of which can hold one tuple (tuples for a given table in SimpleDB are all of the same size). In addition to these slots, each page has a header that consists of a bitmap with one bit per tuple slot. If the bit corresponding to a particular tuple is 1, it indicates that the tuple is valid; if it is 0, the tuple is invalid (e.g., has been deleted or was never initialized.) Pages of `HeapFile` objects are of type `HeapPage` which implements the `Page` interface. Pages are stored in the buffer pool but are read and written by the `HeapFile` class. 353 | 354 |

355 | 356 | SimpleDB stores heap files on disk in more or less the same format they are stored in memory. Each file consists of page data arranged consecutively on disk. Each page consists of one or more bytes representing the header, followed by the _page size_ bytes of actual page content. Each tuple requires _tuple size_ * 8 bits for its content and 1 bit for the header. Thus, the number of tuples that can fit in a single page is: 357 | 358 |

359 | 360 | ` 361 | _tuples per page_ = floor((_page size_ * 8) / (_tuple size_ * 8 + 1)) 362 | ` 363 | 364 |

365 | 366 | Where _tuple size_ is the size of a tuple in the page in bytes. The idea here is that each tuple requires one additional bit of storage in the header. We compute the number of bits in a page (by mulitplying page size by 8), and divide this quantity by the number of bits in a tuple (including this extra header bit) to get the number of tuples per page. The floor operation rounds down to the nearest integer number of tuples (we don't want to store partial tuples on a page!) 367 | 368 |

369 | 370 | Once we know the number of tuples per page, the number of bytes required to store the header is simply: 371 |

372 | 373 | ` 374 | headerBytes = ceiling(tupsPerPage/8) 375 | ` 376 | 377 |

378 | 379 | The ceiling operation rounds up to the nearest integer number of bytes (we never store less than a full byte of header information.) 380 | 381 |

382 | 383 | The low (least significant) bits of each byte represents the status of the slots that are earlier in the file. Hence, the lowest bit of the first byte represents whether or not the first slot in the page is in use. The second lowest bit of the first byte represents whether or not the second slot in the page is in use, and so on. Also, note that the high-order bits of the last byte may not correspond to a slot that is actually in the file, since the number of slots may not be a multiple of 8. Also note that all Java virtual machines are [big-endian](http://en.wikipedia.org/wiki/Endianness). 384 | 385 |

386 | 387 | ### Exercise 4 388 | 389 | 390 | 391 | **Implement the skeleton methods in:** 392 | *** 393 | * src/simpledb/HeapPageId.java 394 | * src/simpledb/RecordID.java 395 | * src/simpledb/HeapPage.java 396 | 397 | *** 398 | 399 | 400 | Although you will not use them directly in Lab 1, we ask you to implement getNumEmptySlots() and isSlotUsed() in HeapPage. These require pushing around bits in the page header. You may find it helpful to look at the other methods that have been provided in HeapPage or in src/simpledb/HeapFileEncoder.java to understand the layout of pages. 401 | 402 | 403 | You will also need to implement an Iterator over the tuples in the page, which may involve an auxiliary class or data structure. 404 | 405 | At this point, your code should pass the unit tests in HeapPageIdTest, RecordIDTest, and HeapPageReadTest. 406 | 407 | 408 |

409 | 410 | After you have implemented HeapPage, you will write methods for HeapFile in this lab to calculate the number of pages in a file and to read a page from the file. You will then be able to fetch tuples from a file stored on disk. 411 | 412 | ### Exercise 5 413 | 414 | **Implement the skeleton methods in:** 415 | 416 | *** 417 | * src/simpledb/HeapFile.java 418 | 419 | *** 420 | 421 | To read a page from disk, you will first need to calculate the correct offset in the file. Hint: you will need random access to the file in order to read and write pages at arbitrary offsets. You should not call BufferPool methods when reading a page from disk. 422 | 423 |

424 | You will also need to implement the `HeapFile.iterator()` method, which should iterate through through the tuples of each page in the HeapFile. The iterator must use the `BufferPool.getPage()` method to access pages in the `HeapFile`. This method loads the page into the buffer pool and will eventually be used (in a later lab) to implement locking-based concurrency control and recovery. Do not load the entire table into memory on the open() call -- this will cause an out of memory error for very large tables. 425 | 426 |

427 | 428 | At this point, your code should pass the unit tests in HeapFileReadTest. 429 | 430 | 431 | ### 2.6. Operators 432 | 433 | Operators are responsible for the actual execution of the query plan. They implement the operations of the relational algebra. In SimpleDB, operators are iterator based; each operator implements the `DbIterator` interface. 434 | 435 |

436 | 437 | Operators are connected together into a plan by passing lower-level operators into the constructors of higher-level operators, i.e., by 'chaining them together.' Special access method operators at the leaves of the plan are responsible for reading data from the disk (and hence do not have any operators below them). 438 | 439 |

440 | 441 | At the top of the plan, the program interacting with SimpleDB simply calls `getNext` on the root operator; this operator then calls `getNext` on its children, and so on, until these leaf operators are called. They fetch tuples from disk and pass them up the tree (as return arguments to `getNext`); tuples propagate up the plan in this way until they are output at the root or combined or rejected by another operator in the plan. 442 | 443 |

444 | 445 | 454 | 455 | For this lab, you will only need to implement one SimpleDB operator. 456 | 457 | ### Exercise 6. 458 | 459 | **Implement the skeleton methods in:** 460 | 461 | *** 462 | * src/simpledb/SeqScan.java 463 | 464 | *** 465 | This operator sequentially scans all of the tuples from the pages of the table specified by the `tableid` in the constructor. This operator should access tuples through the `DbFile.iterator()` method. 466 | 467 |

At this point, you should be able to complete the ScanTest system test. Good work! 468 | 469 | You will fill in other operators in subsequent labs. 470 | 471 | 472 | 473 | ### 2.7. A simple query 474 | 475 | The purpose of this section is to illustrate how these various components are connected together to process a simple query. 476 | 477 | Suppose you have a data file, "some_data_file.txt", with the following contents: 478 | ``` 479 | 1,1,1 480 | 2,2,2 481 | 3,4,4 482 | ``` 483 |

484 | You can convert this into a binary file that SimpleDB can query as follows: 485 |

486 | ```java -jar dist/simpledb.jar convert some_data_file.txt 3``` 487 |

488 | Here, the argument "3" tells conver that the input has 3 columns. 489 |

490 | The following code implements a simple selection query over this file. This code is equivalent to the SQL statement `SELECT * FROM some_data_file`. 491 | 492 | ``` 493 | package simpledb; 494 | import java.io.*; 495 | 496 | public class test { 497 | 498 | public static void main(String[] argv) { 499 | 500 | // construct a 3-column table schema 501 | Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE }; 502 | String names[] = new String[]{ "field0", "field1", "field2" }; 503 | TupleDesc descriptor = new TupleDesc(types, names); 504 | 505 | // create the table, associate it with some_data_file.dat 506 | // and tell the catalog about the schema of this table. 507 | HeapFile table1 = new HeapFile(new File("some_data_file.dat"), descriptor); 508 | Database.getCatalog().addTable(table1, "test"); 509 | 510 | // construct the query: we use a simple SeqScan, which spoonfeeds 511 | // tuples via its iterator. 512 | TransactionId tid = new TransactionId(); 513 | SeqScan f = new SeqScan(tid, table1.getId()); 514 | 515 | try { 516 | // and run it 517 | f.open(); 518 | while (f.hasNext()) { 519 | Tuple tup = f.next(); 520 | System.out.println(tup); 521 | } 522 | f.close(); 523 | Database.getBufferPool().transactionComplete(tid); 524 | } catch (Exception e) { 525 | System.out.println ("Exception : " + e); 526 | } 527 | } 528 | 529 | } 530 | ``` 531 | 532 | The table we create has three integer fields. To express this, we create a `TupleDesc` object and pass it an array of `Type` objects, and optionally an array of `String` field names. Once we have created this `TupleDesc`, we initialize a `HeapFile` object representing the table stored in `some_data_file.dat`. Once we have created the table, we add it to the catalog. If this were a database server that was already running, we would have this catalog information loaded. We need to load it explicitly to make this code self-contained. 533 | 534 | Once we have finished initializing the database system, we create a query plan. Our plan consists only of the `SeqScan` operator that scans the tuples from disk. In general, these operators are instantiated with references to the appropriate table (in the case of `SeqScan`) or child operator (in the case of e.g. Filter). The test program then repeatedly calls `hasNext` and `next` on the `SeqScan` operator. As tuples are output from the `SeqScan`, they are printed out on the command line. 535 | 536 | We **strongly recommend** you try this out as a fun end-to-end test that will help you get experience writing your own test programs for simpledb. You should create the file "test.java" in the src/simpledb directory with the code above, and place the `some_data_file.dat` file in the top level directory. Then run: 537 | 538 | ``` 539 | ant 540 | java -classpath dist/simpledb.jar simpledb.test 541 | ``` 542 | 543 | Note that `ant` compiles `test.java` and generates a new jarfile that contains it. 544 | 545 | ## 3. Logistics 546 | 547 | You must submit your code (see below) as well as a short (2 pages, maximum) writeup describing your approach. This writeup should: 548 | 549 | * Describe any design decisions you made. These may be minimal for Lab 1. 550 | * Discuss and justify any changes you made to the API. 551 | * Describe any missing or incomplete elements of your code. 552 | * Describe how long you spent on the lab, and whether there was anything you found particularly difficult or confusing. 553 | 554 | ### 3.1. Collaboration 555 | 556 | This lab should be manageable for a single person, but if you prefer to work with a partner, this is also OK. Larger groups are not allowed. Please indicate clearly who you worked with, if anyone, on your individual writeup. 557 | 558 | ### 3.2. Submitting your assignment 559 | 564 | 565 | You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11:59 PM on the due date). Place the write-up in a file called lab1-writeup.txt, which has been created for you in the top level of your simple-db-hw directory. 566 | 567 | You also need to explicitly add any other files you create, such as new *.java files. 568 | 569 | The criteria for your lab being submitted on time is that your code must be **tagged** and **pushed** by the date and time. This means that if one of the TAs or the instructor were to open up GitHub, they would be able to see your solutions on the GitHub web page. 570 | 571 | Just because your code has been commited on your local machine does not mean that it has been **submitted**; it needs to be on GitHub. 572 | 573 | There is a bash script `turnInLab1.sh` in the root level directory of simple-db-hw that commits your changes, deletes any prior tag for the current lab, tags the current commit, and pushes the branch and tag to github. If you are using Linux or Mac OSX, you should be able to run the following: 574 | 575 | ```bash 576 | $ ./turnInLab1.sh 577 | ``` 578 | 579 | You should see something like the following output: 580 | 581 | ```bash 582 | $ ./turnInLab1.sh 583 | error: tag 'lab1submit' not found. 584 | remote: warning: Deleting a non-existent ref. 585 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 586 | - [deleted] lab1submit 587 | [master 7a26701] Lab 1 588 | 1 file changed, 0 insertions(+), 0 deletions(-) 589 | create mode 100644 aaa 590 | Counting objects: 3, done. 591 | Delta compression using up to 4 threads. 592 | Compressing objects: 100% (3/3), done. 593 | Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done. 594 | Total 3 (delta 1), reused 0 (delta 0) 595 | remote: Resolving deltas: 100% (1/1), completed with 1 local objects. 596 | To git@github.com:MIT-DB-Class/homework-solns-2018-.git 597 | 069856c..7a26701 master -> master 598 | * [new tag] lab1submit -> lab1submit 599 | ``` 600 | 601 | 602 | If the above command worked for you, you can skip to item 6 below. If not, submit your solutions for lab 1 as follows: 603 | 604 | 1. Look at your current repository status. 605 | 606 | ```bash 607 | $ git status 608 | ``` 609 | 610 | 2. Add and commit your code changes (if they aren't already added and commited). 611 | 612 | ```bash 613 | $ git commit -a -m 'Lab 1' 614 | ``` 615 | 616 | 3. Delete any prior local and remote tag (*this will return an error if you have not tagged previously; this allows you to submit multiple times*) 617 | 618 | ```bash 619 | $ git tag -d lab1submit 620 | $ git push origin :refs/tags/lab1submit 621 | ``` 622 | 623 | 4. Tag your last commit as the lab to be graded 624 | ```bash 625 | $ git tag -a lab1submit -m 'submit lab 1' 626 | ``` 627 | 628 | 5. This is the most important part: **push** your solutions to GitHub. 629 | 630 | ```bash 631 | $ git push origin master --tags 632 | ``` 633 | 634 | 6. The last thing that we strongly recommend you do is to go to the 635 | [MIT-DB-Class] organization page on GitHub to 636 | make sure that we can see your solutions. 637 | 638 | Just navigate to your repository and check that your latest commits are on 639 | GitHub. You should also be able to check 640 | `https://github.com/MIT-DB-Class/homework-solns-2018-/tree/lab1submit` 641 | 642 | 643 | #### Word of Caution 644 | 645 | Git is a distributed version control system. This means everything operates offline until you run `git pull` or `git push`. This is a great feature. 646 | 647 | The bad thing is that you may forget to `git push` your changes. This is why we strongly, **strongly** suggest that you check GitHub to be sure that what you want us to see matches up with what you expect. 648 | 649 | 650 | 651 | 652 | ### 3.3. Submitting a bug 653 | 654 | Please submit (friendly!) bug reports to [6.830-staff@mit.edu](mailto:6.830-staff@mit.edu). When you do, please try to include: 655 | 656 | 657 | * A description of the bug. 658 | * A .java file we can drop in the test/simpledb directory, compile, and run. 659 | * A .txt file with the data that reproduces the bug. We should be able to convert it to a .dat file using HeapFileEncoder. 660 | 661 | If you are the first person to report a particular bug in the code, we will give you a candy bar! 662 | 663 | 664 | 665 | 666 | 667 | ### 3.4 Grading 668 | 669 |

75% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest. 670 | 671 | **Important:** before testing, we will replace your build.xml and the entire contents of the test directory with our version of these files. This means you cannot change the format of .dat files! You should also be careful changing our APIs. You should test that your code compiles the unmodified tests. 672 | 673 | In other words, we will pull your repo, replace the files mentioned above, compile it, and then grade it. It will look roughly like this: 674 | 675 | ``` 676 | [replace build.xml and test] 677 | $ git checkout -- build.xml test\ 678 | $ ant test 679 | $ ant systemtest 680 | [additional tests] 681 | ``` 682 | 683 |

If any of these commands fail, we`ll be unhappy, and, therefore, so will your grade. 684 | 685 | An additional 25% of your grade will be based on the quality of your writeup and our subjective evaluation of your code. 686 | 687 | We`ve had a lot of fun designing this assignment, and we hope you enjoy hacking on it! 688 | --------------------------------------------------------------------------------