├── README.md ├── lab0 ├── .gitignore ├── Makefile ├── README-zh.md ├── README.md ├── casegen.go ├── go.mod ├── go.sum ├── mapreduce.go ├── report.md ├── urltop10.go ├── urltop10_example.go ├── urltop10_test.go └── utils.go └── stmtflow ├── README.md ├── test_autocommit_read.t.sql ├── test_commit_nothing.t.sql ├── test_conflict.t.sql ├── test_in_txn_read.t.sql ├── test_snap_read.t.sql └── test_unique_index.t.sql /README.md: -------------------------------------------------------------------------------- 1 | # VLDB Summer School 2021 Course 2 | -------------------------------------------------------------------------------- /lab0/.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | .vscode 3 | -------------------------------------------------------------------------------- /lab0/Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: all 2 | 3 | all: test_example test_homework cleanup gendata 4 | 5 | test_example: 6 | go test -v -run=TestExampleURLTop 7 | 8 | test_homework: 9 | go test -v -run=TestURLTop 10 | 11 | cleanup: 12 | go test -v -run=TestCleanData 13 | 14 | gendata: 15 | go test -v -run=TestGenData 16 | -------------------------------------------------------------------------------- /lab0/README-zh.md: -------------------------------------------------------------------------------- 1 | # Map-Reduce Task 2 | 3 | [English Ver](./README.md) 4 | 5 | ## 介绍 6 | 7 | 我们为 VLDB Summer School 2021 设置了一个 Map-Reduce 的小作业。通过完成本课程,你可以学习到一些 Golang 的基础知识并对分布式系统有一个基础的了解。本课程与 MIT 6.824 的 [lab1](https://pdos.csail.mit.edu/6.824/labs/lab-mr.html) 相似,适合作为学习分布式系统的第一步。 8 | 9 | 本课程提供了一个未完成的 Map-Reduce 框架,你需要完成这个框架,并且使用它从文件中找到出现频率最高的 10 个 URL。 10 | 11 | ## 课程 12 | 13 | 我们推荐 MIT 6.824 的第一课(介绍章节)来学习课程。MIT 6.824 对分布式系统进行了整体介绍并且详细的讲解了 Map-Reduce 框架。 14 | 15 | - [Official] https://www.youtube.com/watch?v=cQP8WApzIQQ 16 | - [Mirror] https://www.bilibili.com/video/BV1R7411t71W?p=1 17 | 18 | Map-Reduce 是一个著名的分布式计算框架,除了 MIT 6.824,可以参考网上找到的其他资料来帮祝你学习, 19 | 20 | 对于 Golang 的初学者,我们推荐先通过 [Online Go tutorial](https://tour.golang.org/) 学习语言。本实验中需要几项重要的技能: 21 | 22 | - 使用 go routine,channel 和 WaitGroup 23 | - Interface 24 | - 读写文件 25 | 26 | 在 MIT 6.824 的课程指南中,还有很多学习建议,例如如何对复杂系统进行 Debug,推荐感兴趣的同学阅读。 27 | 28 | - https://pdos.csail.mit.edu/6.824/labs/guidance.html 29 | 30 | 如果有任何问题,可以在 AskTUG 论坛上进行讨论。 31 | 32 | - https://asktug.com/tags/vldbss 33 | 34 | ## 实验介绍 35 | 36 | ### 1. 完成 Map-Reduce 框架 37 | 38 | #### 要求 39 | 40 | Map-Reduce 的框架代码在 `mapreduce.go` 中,但是这个框架尚未完成。 41 | 42 | map 和 reduce 的函数定义与 MIT 6.824 相同。 43 | 44 | ```go 45 | type ReduceF func(key string, values []string) string 46 | type MapF func(filename string, contents string) []KeyValue 47 | ``` 48 | 49 | 在 `urltop10_example.go` 有一个基于 Map-Reduce 框架实现的例子,能够寻找到 10 个出现频率最高的 URL。但是因为框架尚未完成,这个例子还没法运行,你需要完成框架,让 `urltop10_example.go` 中的函数正确运行。 50 | 51 | #### TODO 52 | 53 | - 在 `mapreduce.go` 的 `YOUR CODE HERE` 标记处补充代码,完成框架。 54 | 55 | - 运行 `make test_example` 来测试给出的例子,使其通过测试。 56 | 57 | ### 2. 基于 Map-Reduce 框架编写 Map-Reduce 函数 58 | 59 | #### 要求 60 | 61 | 在完成 Map-Reduce 的框架并通过测试之后,你需要使用所实现的框架,在 `urltop10.go` 中实现自己的 `MapF` 和 `ReduceF` 来完成本项课程。 62 | 63 | #### TODO 64 | 65 | - 完成 `urltop10.go` 中的 `URLTop10` 函数,函数逻辑可以参考 `urltop10_example.go` 中的 `ExampleURLTop10`。 66 | 67 | - 运行 `make test_homework` 使自己编写的 Map-Reduce 函数通过测试。 68 | 69 | ### 帮助信息 70 | 71 | 可以使用 `make gendata` 生成测试数据。 72 | 73 | 所有的数据文件会在运行过程中被生成,可以使用 `make cleanup` 进行清理。 74 | 75 | 请以字典序输出结果,并且确保格式与生成的测试文件中的 `result` 文件一致,这样才能够通过测试。 76 | 77 | 不同的测试有**不同的数据分布**,这点在实现过程中需要被考虑到。 78 | 79 | ### 提交作业 80 | 81 | 在完成所有作业后,你需要通过邮件的方式提交作业。 82 | 83 | - 使用 [report.md](./report.md) 模板撰写报告,请勿修改模板或增加额外内容,尽量简洁精炼。 84 | - 将代码打包,和 PDF 报告一起发送到 vldbss.2021@pingcap.com。 85 | - 邮件标题格式为 “学校-姓名 lab0 作业”。 86 | -------------------------------------------------------------------------------- /lab0/README.md: -------------------------------------------------------------------------------- 1 | # Map-Reduce Task 2 | 3 | [中文版](./README-zh.md) 4 | 5 | ## Introduction 6 | 7 | This is the Map-Reduce homework before VLDB Summer School 2021. By completing this task, you will learn some basic skill on Golang language and have some knowledge about distributed systems. This task is similar to the [lab1](https://pdos.csail.mit.edu/6.824/labs/lab-mr.html) of MIT 6.824, a small startup for your study on distributed systems. 8 | 9 | There is an uncompleted Map-Reduce framework, you should complete it and use it to extract the 10 most frequent URLs from data files. 10 | 11 | ## Course 12 | 13 | We recommend the introduction chapter of MIT 6.824's course for this lab. This video provides an overview of distributed systems and dive into the Map-Reduce framework. 14 | 15 | - [Official] https://www.youtube.com/watch?v=cQP8WApzIQQ 16 | - [Mirror] https://www.bilibili.com/video/BV1R7411t71W?p=1 17 | 18 | Map-Reduce is a well-known distributed computation framework, you may find other videos and blogs online. 19 | 20 | For Golang learning, we recommend [Online Go tutorial](https://tour.golang.org/). If you are new to this language, it would be a good idea to finish the tutorial course before starting this lab. In this lab, there are some important skills: 21 | 22 | - The usage of go routine, channel and WaitGroup 23 | - Interface 24 | - Read/write file 25 | 26 | In MIT 6.824's guidance, there are some further advices like debugging in such complex systems. Check it out as your interest. 27 | 28 | - https://pdos.csail.mit.edu/6.824/labs/guidance.html 29 | 30 | Ask any questions in AskTUG forum. 31 | 32 | - https://asktug.com/tags/vldbss 33 | 34 | ## Lab 35 | 36 | ### 1. Complete the Map-Reduce framework 37 | 38 | #### Requirements 39 | 40 | The simple Map-Reduce framework is defined in `mapreduce.go`. However, it's not completed yet. 41 | 42 | The map and reduce function are defined as same as MIT 6.824 lab 1. 43 | 44 | ```go 45 | type ReduceF func(key string, values []string) string 46 | type MapF func(filename string, contents string) []KeyValue 47 | ``` 48 | 49 | There is an example in `urltop10_example.go` which is used to extract the 10 most frequent URLs. Since the framework is uncompleted, the example is not able to run. You need to complete the framework and let the functions in `urltop10_example.go` run correctly. 50 | 51 | #### TODO 52 | 53 | - Fill your code in `mapreduce.go` below comments `YOUR CODE HERE`. 54 | 55 | - Run `make test_example` to test the given example, let it pass the test. 56 | 57 | ### 2. Write your own Map-Reduce functions based on the framework 58 | 59 | #### Requirements 60 | 61 | After you have passed tests with your Map-Reduce framework and the example, you need to implement your own `MapF` and `ReduceF` in `urltop10.go` to accomplish this task. 62 | 63 | #### TODO 64 | 65 | - Complete `URLTop10` function in `urltop10.go`. You can refer to `ExampleURLTop10` in `urltop10_example.go`. 66 | 67 | - Run `make test_homework` to let your own Map-Reduce functions pass the tests. 68 | 69 | ### Helps 70 | 71 | All data files will be generated at runtime, and you can use `make cleanup` to clean all test data. 72 | 73 | Please output URLs by lexicographical order and ensure that your result has the same format as test data(there is a `result` file) so that you can pass all tests. 74 | 75 | Each test cases has **different data distribution** and you should take it into account. 76 | 77 | ### Submit Your Work 78 | 79 | After you've done with all the tasks above, you need to submit your homework by email. 80 | 81 | - Write your report follow [report.md](./report.md). Note you may not edit this template or add further contents. Your report should be concise and clear. 82 | - Package your code and send it with PDF report to vldbss.2021@pingcap.com. 83 | - Mail title should be "College-Name Lab0 Homework". 84 | -------------------------------------------------------------------------------- /lab0/casegen.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "math/rand" 6 | "path" 7 | "sort" 8 | ) 9 | 10 | type DataSize int 11 | 12 | const ( 13 | KB = 1 << 10 14 | MB = 1 << 20 15 | GB = 1 << 30 16 | ) 17 | 18 | func (d DataSize) String() string { 19 | if d < KB { 20 | return fmt.Sprintf("%dbyte", d) 21 | } else if d < MB { 22 | return fmt.Sprintf("%dKB", d/KB) 23 | } else if d < GB { 24 | return fmt.Sprintf("%dMB", d/MB) 25 | } 26 | return fmt.Sprintf("%dGB", d/GB) 27 | } 28 | 29 | // Case represents a test case. 30 | type Case struct { 31 | MapFiles []string // input files for map function 32 | ResultFile string // expected result 33 | } 34 | 35 | // CaseGenF represents test case generate function 36 | type CaseGenF func(dataFileDir string, totalDataSize, nMapFiles int) Case 37 | 38 | // AllCaseGenFs returns all CaseGenFs used to test. 39 | func AllCaseGenFs() []CaseGenF { 40 | var gs []CaseGenF 41 | gs = append(gs, genUniformCases()...) 42 | gs = append(gs, genPercentCases()...) 43 | gs = append(gs, CaseSingleURLPerFile) 44 | return gs 45 | } 46 | 47 | func genUniformCases() []CaseGenF { 48 | cardinalities := []int{1, 7, 200, 10000, 1000000} 49 | gs := make([]CaseGenF, 0, len(cardinalities)) 50 | for i := range cardinalities { 51 | card := cardinalities[i] 52 | gs = append(gs, func(dataFileDir string, totalDataSize, nMapFiles int) Case { 53 | if FileOrDirExist(dataFileDir) { 54 | files := make([]string, 0, nMapFiles) 55 | for i := 0; i < nMapFiles; i++ { 56 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 57 | files = append(files, fpath) 58 | } 59 | rpath := path.Join(dataFileDir, "result") 60 | return Case{ 61 | MapFiles: files, 62 | ResultFile: rpath, 63 | } 64 | } 65 | urls, avgLen := randomNURL(card) 66 | eachRecords := (totalDataSize / nMapFiles) / avgLen 67 | files := make([]string, 0, nMapFiles) 68 | urlCount := make(map[string]int, len(urls)) 69 | for i := 0; i < nMapFiles; i++ { 70 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 71 | files = append(files, fpath) 72 | f, buf := CreateFileAndBuf(fpath) 73 | for i := 0; i < eachRecords; i++ { 74 | str := urls[rand.Int()%len(urls)] 75 | urlCount[str]++ 76 | WriteToBuf(buf, str, "\n") 77 | } 78 | SafeClose(f, buf) 79 | } 80 | 81 | rpath := path.Join(dataFileDir, "result") 82 | genResult(rpath, urlCount) 83 | return Case{ 84 | MapFiles: files, 85 | ResultFile: rpath, 86 | } 87 | }) 88 | } 89 | return gs 90 | } 91 | 92 | func genPercentCases() []CaseGenF { 93 | ps := []struct { 94 | l int 95 | p []float64 96 | }{ 97 | {11, []float64{0.9, 0.09, 0.009, 0.0009, 0.00009, 0.000009}}, 98 | {10000, []float64{0.9, 0.09, 0.009, 0.0009, 0.00009, 0.000009}}, 99 | {100000, []float64{0.9, 0.09, 0.009, 0.0009, 0.00009, 0.000009}}, 100 | {10000, []float64{0.5, 0.4}}, 101 | {10000, []float64{0.3, 0.3, 0.3}}, 102 | } 103 | gs := make([]CaseGenF, 0, len(ps)) 104 | for i := range ps { 105 | p := ps[i] 106 | gs = append(gs, func(dataFileDir string, totalDataSize, nMapFiles int) Case { 107 | if FileOrDirExist(dataFileDir) { 108 | files := make([]string, 0, nMapFiles) 109 | for i := 0; i < nMapFiles; i++ { 110 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 111 | files = append(files, fpath) 112 | } 113 | rpath := path.Join(dataFileDir, "result") 114 | return Case{ 115 | MapFiles: files, 116 | ResultFile: rpath, 117 | } 118 | } 119 | 120 | // make up percents list 121 | percents := make([]float64, 0, p.l) 122 | percents = append(percents, p.p...) 123 | var sum float64 124 | for _, p := range p.p { 125 | sum += p 126 | } 127 | if sum > 1 || len(p.p) > p.l { 128 | panic("invalid prefix") 129 | } 130 | x := (1 - sum) / float64(p.l-len(p.p)) 131 | for i := 0; i < p.l-len(p.p); i++ { 132 | percents = append(percents, x) 133 | } 134 | 135 | // generate data 136 | urls, avgLen := randomNURL(len(percents)) 137 | eachRecords := (totalDataSize / nMapFiles) / avgLen 138 | files := make([]string, 0, nMapFiles) 139 | urlCount := make(map[string]int, len(urls)) 140 | 141 | accumulate := make([]float64, len(percents)+1) 142 | accumulate[0] = 0 143 | for i := range percents { 144 | accumulate[i+1] = accumulate[i] + percents[i] 145 | } 146 | 147 | for i := 0; i < nMapFiles; i++ { 148 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 149 | files = append(files, fpath) 150 | f, buf := CreateFileAndBuf(fpath) 151 | for i := 0; i < eachRecords; i++ { 152 | x := rand.Float64() 153 | idx := sort.SearchFloat64s(accumulate, x) 154 | if idx != 0 { 155 | idx-- 156 | } 157 | str := urls[idx] 158 | urlCount[str]++ 159 | WriteToBuf(buf, str, "\n") 160 | } 161 | SafeClose(f, buf) 162 | } 163 | 164 | rpath := path.Join(dataFileDir, "result") 165 | genResult(rpath, urlCount) 166 | return Case{ 167 | MapFiles: files, 168 | ResultFile: rpath, 169 | } 170 | }) 171 | } 172 | return gs 173 | } 174 | 175 | // CaseSingleURLPerFile . 176 | func CaseSingleURLPerFile(dataFileDir string, totalDataSize, nMapFiles int) Case { 177 | if FileOrDirExist(dataFileDir) { 178 | files := make([]string, 0, nMapFiles) 179 | for i := 0; i < nMapFiles; i++ { 180 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 181 | files = append(files, fpath) 182 | } 183 | rpath := path.Join(dataFileDir, "result") 184 | return Case{ 185 | MapFiles: files, 186 | ResultFile: rpath, 187 | } 188 | } 189 | urls, avgLen := randomNURL(nMapFiles) 190 | eachRecords := (totalDataSize / nMapFiles) / avgLen 191 | files := make([]string, 0, nMapFiles) 192 | urlCount := make(map[string]int, len(urls)) 193 | for i := 0; i < nMapFiles; i++ { 194 | fpath := path.Join(dataFileDir, fmt.Sprintf("inputMapFile%d", i)) 195 | files = append(files, fpath) 196 | f, buf := CreateFileAndBuf(fpath) 197 | for j := 0; j < eachRecords; j++ { 198 | str := urls[i] 199 | urlCount[str]++ 200 | WriteToBuf(buf, str, "\n") 201 | } 202 | SafeClose(f, buf) 203 | } 204 | 205 | rpath := path.Join(dataFileDir, "result") 206 | genResult(rpath, urlCount) 207 | return Case{ 208 | MapFiles: files, 209 | ResultFile: rpath, 210 | } 211 | } 212 | 213 | func genResult(rpath string, urlCount map[string]int) { 214 | us, cs := TopN(urlCount, 10) 215 | f, buf := CreateFileAndBuf(rpath) 216 | for i := range us { 217 | fmt.Fprintf(buf, "%s: %d\n", us[i], cs[i]) 218 | } 219 | SafeClose(f, buf) 220 | } 221 | 222 | func randomNURL(n int) ([]string, int) { 223 | length := 0 224 | urls := make([]string, 0, n) 225 | for i := 0; i < n; i++ { 226 | url := wrapLikeURL(fmt.Sprintf("%d", i)) 227 | length += len(url) 228 | urls = append(urls, url) 229 | } 230 | return urls, length / len(urls) 231 | } 232 | 233 | var urlPrefixes = []string{ 234 | "github.com/pingcap/tidb/issues", 235 | "github.com/pingcap/tidb/pull", 236 | "github.com/pingcap/tidb", 237 | } 238 | 239 | func wrapLikeURL(suffix string) string { 240 | return path.Join(urlPrefixes[rand.Intn(len(urlPrefixes))], suffix) 241 | } 242 | -------------------------------------------------------------------------------- /lab0/go.mod: -------------------------------------------------------------------------------- 1 | module talent 2 | 3 | go 1.12 4 | -------------------------------------------------------------------------------- /lab0/go.sum: -------------------------------------------------------------------------------- 1 | github.com/pingcap/talent-plan v0.0.0-20190408125936-2f97dda786d6 h1:Kr1alXUfrJVBcLQb9tbrZGpInKkBhGLZsuMKNfesH1I= 2 | -------------------------------------------------------------------------------- /lab0/mapreduce.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "bufio" 5 | "encoding/json" 6 | "hash/fnv" 7 | "io/ioutil" 8 | "log" 9 | "os" 10 | "path" 11 | "runtime" 12 | "strconv" 13 | "sync" 14 | ) 15 | 16 | // KeyValue is a type used to hold the key/value pairs passed to the map and reduce functions. 17 | type KeyValue struct { 18 | Key string 19 | Value string 20 | } 21 | 22 | // ReduceF function from MIT 6.824 LAB1 23 | type ReduceF func(key string, values []string) string 24 | 25 | // MapF function from MIT 6.824 LAB1 26 | type MapF func(filename string, contents string) []KeyValue 27 | 28 | // jobPhase indicates whether a task is scheduled as a map or reduce task. 29 | type jobPhase string 30 | 31 | const ( 32 | mapPhase jobPhase = "mapPhase" 33 | reducePhase = "reducePhase" 34 | ) 35 | 36 | type task struct { 37 | dataDir string 38 | jobName string 39 | mapFile string // only for map, the input file 40 | phase jobPhase // are we in mapPhase or reducePhase? 41 | taskNumber int // this task's index in the current phase 42 | nMap int // number of map tasks 43 | nReduce int // number of reduce tasks 44 | mapF MapF // map function used in this job 45 | reduceF ReduceF // reduce function used in this job 46 | wg sync.WaitGroup 47 | } 48 | 49 | // MRCluster represents a map-reduce cluster. 50 | type MRCluster struct { 51 | nWorkers int 52 | wg sync.WaitGroup 53 | taskCh chan *task 54 | exit chan struct{} 55 | } 56 | 57 | var singleton = &MRCluster{ 58 | nWorkers: runtime.NumCPU(), 59 | taskCh: make(chan *task), 60 | exit: make(chan struct{}), 61 | } 62 | 63 | func init() { 64 | singleton.Start() 65 | } 66 | 67 | // GetMRCluster returns a reference to a MRCluster. 68 | func GetMRCluster() *MRCluster { 69 | return singleton 70 | } 71 | 72 | // NWorkers returns how many workers there are in this cluster. 73 | func (c *MRCluster) NWorkers() int { return c.nWorkers } 74 | 75 | // Start starts this cluster. 76 | func (c *MRCluster) Start() { 77 | for i := 0; i < c.nWorkers; i++ { 78 | c.wg.Add(1) 79 | go c.worker() 80 | } 81 | } 82 | 83 | func (c *MRCluster) worker() { 84 | defer c.wg.Done() 85 | for { 86 | select { 87 | case t := <-c.taskCh: 88 | if t.phase == mapPhase { 89 | content, err := ioutil.ReadFile(t.mapFile) 90 | if err != nil { 91 | panic(err) 92 | } 93 | 94 | fs := make([]*os.File, t.nReduce) 95 | bs := make([]*bufio.Writer, t.nReduce) 96 | for i := range fs { 97 | rpath := reduceName(t.dataDir, t.jobName, t.taskNumber, i) 98 | fs[i], bs[i] = CreateFileAndBuf(rpath) 99 | } 100 | results := t.mapF(t.mapFile, string(content)) 101 | for _, kv := range results { 102 | enc := json.NewEncoder(bs[ihash(kv.Key)%t.nReduce]) 103 | if err := enc.Encode(&kv); err != nil { 104 | log.Fatalln(err) 105 | } 106 | } 107 | for i := range fs { 108 | SafeClose(fs[i], bs[i]) 109 | } 110 | } else { 111 | // YOUR CODE HERE :) 112 | // hint: don't encode results returned by ReduceF, and just output 113 | // them into the destination file directly so that users can get 114 | // results formatted as what they want. 115 | panic("YOUR CODE HERE") 116 | } 117 | t.wg.Done() 118 | case <-c.exit: 119 | return 120 | } 121 | } 122 | } 123 | 124 | // Shutdown shutdowns this cluster. 125 | func (c *MRCluster) Shutdown() { 126 | close(c.exit) 127 | c.wg.Wait() 128 | } 129 | 130 | // Submit submits a job to this cluster. 131 | func (c *MRCluster) Submit(jobName, dataDir string, mapF MapF, reduceF ReduceF, mapFiles []string, nReduce int) <-chan []string { 132 | notify := make(chan []string) 133 | go c.run(jobName, dataDir, mapF, reduceF, mapFiles, nReduce, notify) 134 | return notify 135 | } 136 | 137 | func (c *MRCluster) run(jobName, dataDir string, mapF MapF, reduceF ReduceF, mapFiles []string, nReduce int, notify chan<- []string) { 138 | // map phase 139 | nMap := len(mapFiles) 140 | tasks := make([]*task, 0, nMap) 141 | for i := 0; i < nMap; i++ { 142 | t := &task{ 143 | dataDir: dataDir, 144 | jobName: jobName, 145 | mapFile: mapFiles[i], 146 | phase: mapPhase, 147 | taskNumber: i, 148 | nReduce: nReduce, 149 | nMap: nMap, 150 | mapF: mapF, 151 | } 152 | t.wg.Add(1) 153 | tasks = append(tasks, t) 154 | go func() { c.taskCh <- t }() 155 | } 156 | for _, t := range tasks { 157 | t.wg.Wait() 158 | } 159 | 160 | // reduce phase 161 | // YOUR CODE HERE :D 162 | panic("YOUR CODE HERE") 163 | } 164 | 165 | func ihash(s string) int { 166 | h := fnv.New32a() 167 | h.Write([]byte(s)) 168 | return int(h.Sum32() & 0x7fffffff) 169 | } 170 | 171 | func reduceName(dataDir, jobName string, mapTask int, reduceTask int) string { 172 | return path.Join(dataDir, "mrtmp."+jobName+"-"+strconv.Itoa(mapTask)+"-"+strconv.Itoa(reduceTask)) 173 | } 174 | 175 | func mergeName(dataDir, jobName string, reduceTask int) string { 176 | return path.Join(dataDir, "mrtmp."+jobName+"-res-"+strconv.Itoa(reduceTask)) 177 | } 178 | -------------------------------------------------------------------------------- /lab0/report.md: -------------------------------------------------------------------------------- 1 | # Lab0 实验报告 2 | 3 | ## 实验结果 4 | 5 | ### 1. 完成 Map-Reduce 框架 6 | 7 | `make test_example` 的实验截图 8 | 9 | ### 2. 基于 Map-Reduce 框架编写 Map-Reduce 函数 10 | 11 | `make test_homework` 的实验截图 12 | 13 | ## 实验总结 14 | 15 | 在这部分可以简单谈论自己在实验过程中遇到的困难、对 Map-Reduce 计算框架的理解,字数在 1000 字以内。 16 | -------------------------------------------------------------------------------- /lab0/urltop10.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | // URLTop10 . 4 | func URLTop10(nWorkers int) RoundsArgs { 5 | // YOUR CODE HERE :) 6 | // And don't forget to document your idea. 7 | panic("YOUR CODE HERE") 8 | return nil 9 | } 10 | -------------------------------------------------------------------------------- /lab0/urltop10_example.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "bytes" 5 | "fmt" 6 | "strconv" 7 | "strings" 8 | ) 9 | 10 | // ExampleURLTop10 generates RoundsArgs for getting the 10 most frequent URLs. 11 | // There are two rounds in this approach. 12 | // The first round will do url count. 13 | // The second will sort results generated in the first round and 14 | // get the 10 most frequent URLs. 15 | func ExampleURLTop10(nWorkers int) RoundsArgs { 16 | var args RoundsArgs 17 | // round 1: do url count 18 | args = append(args, RoundArgs{ 19 | MapFunc: ExampleURLCountMap, 20 | ReduceFunc: ExampleURLCountReduce, 21 | NReduce: nWorkers, 22 | }) 23 | // round 2: sort and get the 10 most frequent URLs 24 | args = append(args, RoundArgs{ 25 | MapFunc: ExampleURLTop10Map, 26 | ReduceFunc: ExampleURLTop10Reduce, 27 | NReduce: 1, 28 | }) 29 | return args 30 | } 31 | 32 | // ExampleURLCountMap is the map function in the first round 33 | func ExampleURLCountMap(filename string, contents string) []KeyValue { 34 | lines := strings.Split(contents, "\n") 35 | kvs := make([]KeyValue, 0, len(lines)) 36 | for _, l := range lines { 37 | l = strings.TrimSpace(l) 38 | if len(l) == 0 { 39 | continue 40 | } 41 | kvs = append(kvs, KeyValue{Key: l}) 42 | } 43 | return kvs 44 | } 45 | 46 | // ExampleURLCountReduce is the reduce function in the first round 47 | func ExampleURLCountReduce(key string, values []string) string { 48 | return fmt.Sprintf("%s %s\n", key, strconv.Itoa(len(values))) 49 | } 50 | 51 | // ExampleURLTop10Map is the map function in the second round 52 | func ExampleURLTop10Map(filename string, contents string) []KeyValue { 53 | lines := strings.Split(contents, "\n") 54 | kvs := make([]KeyValue, 0, len(lines)) 55 | for _, l := range lines { 56 | kvs = append(kvs, KeyValue{"", l}) 57 | } 58 | return kvs 59 | } 60 | 61 | // ExampleURLTop10Reduce is the reduce function in the second round 62 | func ExampleURLTop10Reduce(key string, values []string) string { 63 | cnts := make(map[string]int, len(values)) 64 | for _, v := range values { 65 | v := strings.TrimSpace(v) 66 | if len(v) == 0 { 67 | continue 68 | } 69 | tmp := strings.Split(v, " ") 70 | n, err := strconv.Atoi(tmp[1]) 71 | if err != nil { 72 | panic(err) 73 | } 74 | cnts[tmp[0]] = n 75 | } 76 | 77 | us, cs := TopN(cnts, 10) 78 | buf := new(bytes.Buffer) 79 | for i := range us { 80 | fmt.Fprintf(buf, "%s: %d\n", us[i], cs[i]) 81 | } 82 | return buf.String() 83 | } 84 | -------------------------------------------------------------------------------- /lab0/urltop10_test.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "log" 6 | "os" 7 | "path" 8 | "runtime" 9 | "testing" 10 | "time" 11 | ) 12 | 13 | func testDataScale() ([]DataSize, []int) { 14 | dataSize := []DataSize{1 * MB, 10 * MB, 100 * MB, 500 * MB, 1 * GB} 15 | nMapFiles := []int{5, 10, 20, 40, 60} 16 | return dataSize, nMapFiles 17 | } 18 | 19 | const ( 20 | dataDir = "/tmp/mr_homework" 21 | ) 22 | 23 | func dataPrefix(i int, ds DataSize, nMap int) string { 24 | return path.Join(dataDir, fmt.Sprintf("case%d-%s-%d", i, ds, nMap)) 25 | } 26 | 27 | func TestGenData(t *testing.T) { 28 | gens := AllCaseGenFs() 29 | dataSize, nMapFiles := testDataScale() 30 | for k := range dataSize { 31 | for i, gen := range gens { 32 | fmt.Printf("generate data file for cast%d, dataSize=%v, nMap=%v\n", i, dataSize[k], nMapFiles[k]) 33 | prefix := dataPrefix(i, dataSize[k], nMapFiles[k]) 34 | gen(prefix, int(dataSize[k]), nMapFiles[k]) 35 | } 36 | } 37 | } 38 | 39 | func TestCleanData(t *testing.T) { 40 | if err := os.RemoveAll(dataDir); err != nil { 41 | log.Fatal(err) 42 | } 43 | } 44 | 45 | func TestExampleURLTop(t *testing.T) { 46 | rounds := ExampleURLTop10(GetMRCluster().NWorkers()) 47 | testURLTop(t, rounds) 48 | } 49 | 50 | func TestURLTop(t *testing.T) { 51 | rounds := URLTop10(GetMRCluster().NWorkers()) 52 | testURLTop(t, rounds) 53 | } 54 | 55 | func testURLTop(t *testing.T, rounds RoundsArgs) { 56 | if len(rounds) == 0 { 57 | t.Fatalf("no rounds arguments, please finish your code") 58 | } 59 | mr := GetMRCluster() 60 | 61 | // run all cases 62 | gens := AllCaseGenFs() 63 | dataSize, nMapFiles := testDataScale() 64 | for k := range dataSize { 65 | for i, gen := range gens { 66 | // generate data 67 | prefix := dataPrefix(i, dataSize[k], nMapFiles[k]) 68 | c := gen(prefix, int(dataSize[k]), nMapFiles[k]) 69 | 70 | runtime.GC() 71 | 72 | // run map-reduce rounds 73 | begin := time.Now() 74 | inputFiles := c.MapFiles 75 | for idx, r := range rounds { 76 | jobName := fmt.Sprintf("Case%d-Round%d", i, idx) 77 | ch := mr.Submit(jobName, prefix, r.MapFunc, r.ReduceFunc, inputFiles, r.NReduce) 78 | inputFiles = <-ch 79 | } 80 | cost := time.Since(begin) 81 | 82 | // check result 83 | if len(inputFiles) != 1 { 84 | panic("the length of result file list should be 1") 85 | } 86 | result := inputFiles[0] 87 | 88 | if errMsg, ok := CheckFile(c.ResultFile, result); !ok { 89 | t.Fatalf("Case%d FAIL, dataSize=%v, nMapFiles=%v, cost=%v\n%v\n", i, dataSize[k], nMapFiles[k], cost, errMsg) 90 | } else { 91 | fmt.Printf("Case%d PASS, dataSize=%v, nMapFiles=%v, cost=%v\n", i, dataSize[k], nMapFiles[k], cost) 92 | } 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /lab0/utils.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "bufio" 5 | "fmt" 6 | "io/ioutil" 7 | "os" 8 | "path" 9 | "sort" 10 | "strings" 11 | ) 12 | 13 | // RoundArgs contains arguments used in a map-reduce round. 14 | type RoundArgs struct { 15 | MapFunc MapF 16 | ReduceFunc ReduceF 17 | NReduce int 18 | } 19 | 20 | // RoundsArgs represents arguments used in multiple map-reduce rounds. 21 | type RoundsArgs []RoundArgs 22 | 23 | type urlCount struct { 24 | url string 25 | cnt int 26 | } 27 | 28 | // TopN returns topN urls in the urlCntMap. 29 | func TopN(urlCntMap map[string]int, n int) ([]string, []int) { 30 | ucs := make([]*urlCount, 0, len(urlCntMap)) 31 | for k, v := range urlCntMap { 32 | ucs = append(ucs, &urlCount{k, v}) 33 | } 34 | sort.Slice(ucs, func(i, j int) bool { 35 | if ucs[i].cnt == ucs[j].cnt { 36 | return ucs[i].url < ucs[j].url 37 | } 38 | return ucs[i].cnt > ucs[j].cnt 39 | }) 40 | urls := make([]string, 0, n) 41 | cnts := make([]int, 0, n) 42 | for i, u := range ucs { 43 | if i == n { 44 | break 45 | } 46 | urls = append(urls, u.url) 47 | cnts = append(cnts, u.cnt) 48 | } 49 | return urls, cnts 50 | } 51 | 52 | // CheckFile checks if these two files are same. 53 | func CheckFile(expected, got string) (string, bool) { 54 | c1, err := ioutil.ReadFile(expected) 55 | if err != nil { 56 | panic(err) 57 | } 58 | c2, err := ioutil.ReadFile(got) 59 | if err != nil { 60 | panic(err) 61 | } 62 | s1 := strings.TrimSpace(string(c1)) 63 | s2 := strings.TrimSpace(string(c2)) 64 | if s1 == s2 { 65 | return "", true 66 | } 67 | 68 | errMsg := fmt.Sprintf("expected:\n%s\n, but got:\n%s\n", c1, c2) 69 | return errMsg, false 70 | } 71 | 72 | // CreateFileAndBuf opens or creates a specific file for writing. 73 | func CreateFileAndBuf(fpath string) (*os.File, *bufio.Writer) { 74 | dir := path.Dir(fpath) 75 | os.MkdirAll(dir, 0777) 76 | f, err := os.OpenFile(fpath, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666) 77 | if err != nil { 78 | panic(err) 79 | } 80 | return f, bufio.NewWriterSize(f, 1<<20) 81 | } 82 | 83 | // OpenFileAndBuf opens a specific file for reading. 84 | func OpenFileAndBuf(fpath string) (*os.File, *bufio.Reader) { 85 | f, err := os.OpenFile(fpath, os.O_RDONLY, 0666) 86 | if err != nil { 87 | panic(err) 88 | } 89 | return f, bufio.NewReader(f) 90 | } 91 | 92 | // WriteToBuf write strs to this buffer. 93 | func WriteToBuf(buf *bufio.Writer, strs ...string) { 94 | for _, str := range strs { 95 | if _, err := buf.WriteString(str); err != nil { 96 | panic(err) 97 | } 98 | } 99 | } 100 | 101 | // SafeClose flushes this buffer and closes this file. 102 | func SafeClose(f *os.File, buf *bufio.Writer) { 103 | if buf != nil { 104 | if err := buf.Flush(); err != nil { 105 | panic(err) 106 | } 107 | } 108 | if err := f.Close(); err != nil { 109 | panic(err) 110 | } 111 | } 112 | 113 | // FileOrDirExist tests if this file or dir exist in a simple way. 114 | func FileOrDirExist(p string) bool { 115 | _, err := os.Stat(p) 116 | return err == nil 117 | } 118 | -------------------------------------------------------------------------------- /stmtflow/README.md: -------------------------------------------------------------------------------- 1 | # The stmtflow test tool 2 | 3 | 4 | ## Introduction 5 | 6 | The `stmtflow` test [tool](https://github.com/zyguan/tidb-test-util) is used to run a suite of **sql tests** on the tinysql/tinykv cluster **concurrently**. 7 | In the summer school project, we could use this tool to write specific test cases with multiple connections or sessions to test the concurrency results of our transaction engine. 8 | 9 | 10 | ## How to use it 11 | 12 | ``` 13 | git clone https://github.com/zyguan/tidb-test-util.git 14 | cd tidb-test-util 15 | make build 16 | ``` 17 | 18 | After the build process, there will be a `stmtflow` binary in the `bin` directory. Then we could write some test cases, for example the test logic like. 19 | ``` 20 | /* init */ drop table if exists test; 21 | /* init */ create table test (id int primary key, value int); 22 | /* init */ insert into test (id, value) values (1, 10), (2, 20); 23 | 24 | /* t1 */ begin; 25 | /* t2 */ begin; 26 | /* t1 */ delete from test where id = 1; 27 | /* t2 */ select * from test where id = 1; 28 | /* t1 */ commit; 29 | /* t2 */ select * from test where id = 1; -- 1 -> 10 30 | /* t2 */ commit; 31 | ``` 32 | Save this test sqls into a test file, namely `hello.t.sql`. To let the `stmtflow` tool get aware of the running cluster, try export it the environment variable. 33 | ``` 34 | Do export STMTFLOW_DSN='root:@tcp(127.0.0.1:4000)/test' 35 | ``` 36 | After that, the tests could run using the following command. 37 | ``` 38 | ./stmtflow play hello.t.sql 39 | ``` 40 | The expected ouput of this case is 41 | ``` 42 | # hello.t.sql 43 | /* init */ drop table if exists test; 44 | -- init >> 0 rows affected 45 | /* init */ create table test (id int primary key, value int); 46 | -- init >> 0 rows affected 47 | /* init */ insert into test (id, value) values (1, 10), (2, 20); 48 | -- init >> 2 rows affected 49 | /* t1 */ begin; 50 | -- t1 >> 0 rows affected 51 | /* t2 */ begin; 52 | -- t2 >> 0 rows affected 53 | /* t1 */ delete from test where id = 1; 54 | -- t1 >> 1 rows affected 55 | /* t2 */ select * from test where id = 1; 56 | -- t2 >> +----+-------+ 57 | -- t2 | ID | VALUE | 58 | -- t2 +----+-------+ 59 | -- t2 | 1 | 10 | 60 | -- t2 +----+-------+ 61 | /* t1 */ commit; 62 | -- t1 >> 0 rows affected 63 | /* t2 */ select * from test where id = 1; -- 1 -> 10 64 | -- t2 >> +----+-------+ 65 | -- t2 | ID | VALUE | 66 | -- t2 +----+-------+ 67 | -- t2 | 1 | 10 | 68 | -- t2 +----+-------+ 69 | /* t2 */ commit; 70 | -- t2 >> 0 rows affected 71 | ``` 72 | Also, if the results need to be recorded we could add `-w` param for the `play` command to record the test results. 73 | ``` 74 | ./stmtflow play hello.t.sql -w 75 | ``` 76 | 77 | ## Tests 78 | 79 | There are some basic test cases in the `stmtflow directory` which could be used to verify the read/write results of our transaction engine which end with `t.sql`. Try use the 80 | `stmflow` test tool to verify if the output is as expected. 81 | -------------------------------------------------------------------------------- /stmtflow/test_autocommit_read.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | /* t1 */ select * from t1; -- the original snap value is returnd. 6 | /* t2 */ insert into t1 values(2, 2, 2); 7 | /* t1 */ select * from t1; -- t1 should see the newly inserted value. 8 | /* t2 */ begin; 9 | /* t2 */ delete from t1; 10 | /* t1 */ select * from t1; -- t1 should not see the delete. 11 | /* t2 */ commit; 12 | /* t1 */ select * from t1; -- t1 should see the delete. 13 | -------------------------------------------------------------------------------- /stmtflow/test_commit_nothing.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | /* t1 */ begin; 6 | /* t2 */ begin; 7 | /* t1 */ insert into t1 values(2, 2, 2); 8 | /* t1 */ delete from t1 where c3 = 2; -- delete own write. 9 | /* t1 */ commit 10 | /* t2 */ insert into t1 values(2, 3, 3); 11 | /* t2 */ commit; 12 | /* t2 */ select * from t1; 13 | -------------------------------------------------------------------------------- /stmtflow/test_conflict.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | # use t2 to test the duplicate entry check, use t3 to test the write conflict error. 6 | /* t1 */ begin; 7 | /* t2 */ begin; 8 | /* t3 */ begin; 9 | /* t1 */ delete from t1 where c3 = 1; 10 | /* t1 */ commit; 11 | /* t2 */ insert into t1 values (1, 1, 2); -- It's expected to report duplicate entry error. 12 | /* t3 */ delete from t1 where c2 > 0; 13 | /* t2 */ commit; -- The t2 commit will not write anything. 14 | /* t3 */ commit; -- It's expected to report write conflict error. 15 | -------------------------------------------------------------------------------- /stmtflow/test_in_txn_read.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | /* t1 */ begin; 6 | /* t1 */ insert into t1 values(2, 2, 2); 7 | /* t1 */ select * from t1; -- should return the merged result in memory and from tinykv store. 8 | /* t1 */ delete from t1 where c3 = 2; -- delete own write. 9 | /* t1 */ select * from t1; -- should return the tinykv store results only. 10 | /* t1 */ delete from t1; 11 | /* t1 */ select * from t1; -- should return nothing. 12 | /* t1 */ commit; 13 | -------------------------------------------------------------------------------- /stmtflow/test_snap_read.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | /* t1 */ begin; 6 | /* t1 */ select * from t1; -- the original snap value is returnd. 7 | /* t2 */ begin; 8 | /* t2 */ insert into t1 values(2, 2, 2); 9 | /* t2 */ commit; 10 | /* t1 */ select * from t1; -- t1 will not see the new value from t2. 11 | /* t2 */ delete from t1; 12 | /* t1 */ select * from t1; -- t1 will not see the delete operation from t2. 13 | /* t1 */ insert into t1 values(3, 3, 3); 14 | /* t1 */ select * from t1; -- t1 will see the snapshot value and its own write value. 15 | /* t1 */ commit; 16 | /* t1 */ select * from t1; -- t1 will see the left value only. 17 | -------------------------------------------------------------------------------- /stmtflow/test_unique_index.t.sql: -------------------------------------------------------------------------------- 1 | /* init */ drop table if exists t1; 2 | /* init */ create table t1 (c1 int key, c2 int, c3 int, unique key uk1(c2), key k1(c3)); 3 | /* init */ insert into t1 values (1, 1, 1); 4 | 5 | # use t2 to test the unique key check, use t3 to to test the unique key write conflict. 6 | /* t1 */ begin; 7 | /* t2 */ begin; 8 | /* t3 */ begin; 9 | /* t1 */ delete from t1 where c3 = 1; 10 | /* t1 */ commit; 11 | /* t2 */ insert into t1 values (2, 1, 2); -- It's expected to report duplicate entry error. 12 | /* t3 */ delete from t1 where c2 = 1; 13 | /* t2 */ commit; 14 | /* t3 */ commit; 15 | --------------------------------------------------------------------------------