├── .gitignore ├── LICENSE ├── README.md ├── answer-keys-dplyr.R ├── answer-keys-pydt.ipynb ├── answer-keys-pypd.ipynb ├── answer-keys-rdt.ipynb ├── data ├── stock-market-data.csv └── stock-market-data.rds ├── dataset-and-questions.md └── img └── answer-keys.png /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | .Rapp.history 4 | 5 | # Session Data files 6 | .RData 7 | 8 | # Example code in package build process 9 | *-Ex.R 10 | 11 | # Output files from R CMD build 12 | /*.tar.gz 13 | 14 | # Output files from R CMD check 15 | /*.Rcheck/ 16 | 17 | # RStudio files 18 | .Rproj.user/ 19 | 20 | # produced vignettes 21 | vignettes/*.html 22 | vignettes/*.pdf 23 | 24 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 25 | .httr-oauth 26 | 27 | # knitr and R markdown default cache directories 28 | /*_cache/ 29 | /cache/ 30 | 31 | # Temporary files created by R markdown 32 | *.utf8.md 33 | *.knit.md 34 | 35 | # Shiny token, see https://shiny.rstudio.com/articles/shinyapps.html 36 | rsconnect/ 37 | 38 | # ignore Jupyter checkpoint 39 | *.ipynb_checkpoints 40 | answer-keys-py.ipynb -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Kun Ren 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Answer Keys to [Renkun](https://github.com/renkun-ken)'s R Data Practise 2 | ## Overview 3 | This is the keys to [Renkun](https://github.com/renkun-ken)'s [50 R data exercises](https://github.com/renkun-ken/r-data-practice). The original 50 exercises are desiged to help users build a solid skill set for data cleaning/manipulation. [Renkun]((https://github.com/renkun-ken)) didn't provide keys to these exercises and here we present ours using the `data.table` package. We believe `data.table` is the BEST R tool for data manipulation. For more information about how amazing `data.table` is, please refer to its [Github page](https://github.com/Rdatatable/data.table). 4 | 5 | The 50 exercises are built on a stock price dataset including variables like `symbol`, `date` and `price`. These exercises include common data manipulation practise like *compute the average for the largest N observations within each group* or *find out the top N stocks with the largest price jump during the past M days*. 6 | 7 | ## Project Structure 8 | - `dataset-and-questions.md`: Introduction to the dataset and question list (no answer keys included). 9 | 10 | - `answer-keys.ipynb`: The questions and answers. 11 | 12 | 13 | 14 | - `data`: the data folder 15 | - `stock-market-data.rds`: the dataset. Please use `readRDS` to import the data. 16 | 17 | 18 | 19 | # R语言数据操作练习 20 | > Update (Nov 8, 2023) 21 | > 22 | > **加入python的pandas解题版本** 23 | > 24 | > **另有一部分python的datatable解题过程,但由于pydt烂尾,没有继续更新** 25 | 26 | > Update (Aug 15, 2020) 27 | > 28 | > **对习题2-4进行修改,感谢[yu-soong](https://github.com/yu-soong)** 29 | 30 | > Update (July 12, 2020) 31 | > 32 | > **对习题11-15的amount_change计算公式进行了简化,增加习题10村长的答案解析,感谢[@zhjx19](https://github.com/zhjx19)** 33 | 34 | > Update (Apr 30, 2020) 35 | > 36 | > **对习题8、10、30-34、36、42、59、62、63进行了修改** 37 | 38 | > Update (Apr 26, 2020) 39 | > 40 | > **习题51-63完成校对**。对Ex-31及Ex-33进行了修改。 41 | 42 | > Update (Apr 22, 2020) 43 | > 44 | > **习题41-50完成校对**。对Ex-28进行了修改,感谢[StephenLee1029](https://github.com/StephenLee1029)的提问。 45 | 46 | > Update (Feb 28, 2020) 47 | > 48 | > **习题36-40完成校对**。 49 | 50 | > Update (Feb 14, 2020) 51 | > 52 | > **习题30-35完成校对**。 53 | 54 | > Update (Feb 13, 2020) 55 | > 56 | > **习题22-29完成校对**。对Ex-16进行了修改。 57 | 58 | > Update (Feb 11, 2020) 59 | > 60 | > 增加了使用`dpylr`解题的版本(`answer-keys-dplyr.R`),由 @香酥奶油饼 提供,感谢! 61 | 62 | > Update (Jan 7, 2020) 63 | > 64 | >**习题1-21完成校对**。Ex-6可能仍存在争议。 65 | 66 | > Update (Dec 27, 2019, First Release) 67 | > 68 | > 目前有些题目的答案还不是很完善,我们正在努力校对。**经过校对并且提供答案解析的为1-35题**。 69 | 70 | ## 概览 71 | 本Repo是[Renkun](https://github.com/renkun-ken)50道R数据操作练习题的答案。这些题目旨在帮助用户掌握常见的数据操作,例如*找出每组中最大的N个观测*。这些练习依赖于一个股票价格数据集(包含在本项目中),包含日期、股票代码、价格等变量。 72 | 73 | [Renkun](https://github.com/renkun-ken)本人没有练习的答案,我们在这里提供了使用`data.table`的实现。我们认为`data.table`是最好的R数据处理工具包,关于更多`data.table`的神奇之处,请参考它的[Github 主页](https://github.com/Rdatatable/data.table) 74 | 75 | **所有练习及答案解析都放在[answer-keys](answer-keys.ipynb)这个Jupyter Notebook中。** 76 | 77 | > 如果希望了解更多R相关的技巧,欢迎订阅我们的公众号:`大喵与村长的R进制`。 78 | 79 | ## 项目结构 80 | - `dataset-and-questions.md`: 关于数据集的介绍,同时给出所有50道练习题(不含答案)。 81 | 82 | - `answer-keys.ipynb`: 练习题答案。每一道练习题都包括题目、答案代码以及答案预览 83 | 84 | 85 | 86 | 87 | 88 | ## 习题预览 89 | 90 | 1. 哪些股票的代码中包含 8 这个数字? 91 | 2. 每天上涨和下跌的股票各有多少? 92 | 3. 每天每个交易所上涨、下跌的股票各有多少? 93 | 4. 沪深300成分股中,每天上涨、下跌的股票各有多少? 94 | 5. 每天每个行业各有多少只股票? 95 | 6. 股票数最大的行业和总成交额最大的行业是否总是同一个行业? 96 | 7. 每天涨幅超过5%、跌幅超过5%的股票各有多少? 97 | 8. 每天涨幅前10的股票的总成交额和跌幅前10的股票的总成交额比例是多少? 98 | 9. 每天开盘涨停的股票与收盘涨停的股票各有多少?(涨停按照收益率超过1.5%的标准计算) 99 | 10. 每天统计最近3天出现过开盘涨停、跌停的股票各有多少只? 100 | 11. 股票每天的成交额变化率和收益率的相关性如何? 101 | 12. 每天每个行业的总成交额变化率和行业收益率的相关性如何? 102 | 13. 每天市场的总成交额变化率和市场收益率相关性如何? 103 | 14. 每天市场的总成交额的变化率和所有股票收益率的标准差相关性如何? 104 | 15. 每天每个行业的总成交额变化率和行业内股票收益率的标准差相关性如何? 105 | 16. 上证50、沪深300、中证500指数成分股中,沪股和深股各有多少? 106 | 17. 上证50、沪深300、中证500指数成分股中,行业分布如何? 107 | 18. 每天上证50、沪深300、中证500指数成分股的总成交额各是多少? 108 | 19. 上证50、沪深300、中证500指数日收益率的历史波动率是多少? 109 | 20. 上证50、沪深300、中证500指数日收益率的相关系数矩阵? 110 | 21. 上证50、沪深300、去除上证50的沪深300指数日收益率的相关系数矩阵? 111 | 22. 每天沪深300指数成分占比最大的10只股票是哪些? 112 | 23. 各个行业的平均每日股票数量从大到小排序是什么? 113 | 24. 每个行业每天成交额最大的一只股票代码是什么? 114 | 25. 每个行业每天最大成交额是最小成交额的几倍? 115 | 26. 每个行业每天成交额最大的5只股票和成交额总和是多少? 116 | 27. 每个行业每天成交额超过该行业中股票成交额80%分位数的股票的平均收益率是多少? 117 | 28. 每天成交额最大的10%的股票的平均收益率和成交额最小的10%的股票的平均收益率的相关系数是多少? 118 | 29. 每天哪些行业的平均成交额高于全市场平均成交额? 119 | 30. 每天每个股票对市场的超额收益率是多少? 120 | 31. 每天每个股票对市场去除自身的超额收益率是多少? 121 | 32. 每天每个股票对行业的超额收益率是多少? 122 | 33. 每天每个股票对行业去除自身的超额收益率是多少? 123 | 34. 每个股票每天对市场的超额收益率与对行业的超额收益率的相关系数如何? 124 | 35. 每天有哪些行业的平均收益率超过市场平均收益率? 125 | 36. 每天每个行业对市场的超额收益率是多少? 126 | 37. 每天每个行业对去除本行业后的市场超额收益是多少? 127 | 38. 每天分别有多少股票是最近连续3个交易日上涨、下跌的? 128 | 39. 每天分别有多少股票是最近连续3个交易日收益率超过当天市场平均收益率? 129 | 40. 每天分别有多少股票是最新5个交易日中至少有4个交易日的收益率超过当天市场平均收益率? 130 | 41. 每个月中,个股月收益超过市场月收益1倍以上的股票有哪些? 131 | 42. 每个月中,个股月收益超过行业月收益1倍以上的股票有哪些? 132 | 43. 每个股票的收益率对市场收益率的相关系数最高的10个股票是哪些? 133 | 44. 每个行业日收益率的历史波动率是多少?(用日收益率计算标准差) 134 | 45. 各个行业的日收益率的相关系数矩阵如何?哪两个行业相关性最高、最低? 135 | 46. 各个行业的收益率对市场收益率的相关系数由高到低排列如何? 136 | 47. 每个月总成交额比上个月下降幅度最大的行业是哪个? 137 | 48. 数据当中各个股票的最大回撤幅度是多少?(最大回撤是从一个高点到低点的降幅的最大值) 138 | 49. 每只股票的胜率是多少?(胜率是每天收益率为正数的概率) 139 | 50. 每只股票的盈亏比是多少?(盈亏比是正收益之和与负收益之和的比值的绝对值) 140 | 51. 市场的胜率是多少?(市场收益率为正的概率) 141 | 52. 市场的盈亏比是多少?(市场中每个股票的市值加权正收益和市值加权负收益之比) 142 | 53. 每个行业的胜率是多少? 143 | 54. 每个行业的盈亏比是多少?(行业盈亏比是行业内每个股票的市值加权的正收益率和市值加权的负收益率之比) 144 | 55. 是否存在股票的月成交额超过所在行业当月中某天一天总成交额的情况? 145 | 56. 每天每个行业编入、编出的股票各有多少? 146 | 57. 每天每个行业内股票收益率的标准差是多少? 147 | 58. 每天每个行业内股票收益率的标准差的相关性如何? 148 | 59. 每天计算出成交额的 z-score (减去均值除以标准差), 该指标能解释下一天个股超额收益率的多少比例? 149 | 60. 每个股票的收益率和300、500指数收益率可以回归出一个截距项和2个beta,这两个beta的分布如何? 150 | 61. 每天开盘后到最高价涨幅最大的100只股票同样也是全天(昨收到今收)涨幅最大的100只股票的比例是多少? 151 | 62. 每天计算最近三天每天对市场的超额收益率都排进当天前100的股票有哪些? 152 | 63. 每天计算最近三天每天对行业的超额收益率都排进当天行业前30%的股票有哪些? 153 | 154 | ## 学习资源推荐 155 | 156 | * [Base R cheatsheet](http://github.com/rstudio/cheatsheets/raw/master/base-r.pdf) 157 | * [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf) 158 | * [Regular Expressions](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) 159 | * [Work with Strings cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) 160 | * [data.table cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf) 161 | 162 | ## 致谢 163 | [@frankzhangsyd](https://github.com/frankzhangsyd) 164 | 165 | @香酥奶油饼(微信) 166 | 167 | [@zhjx19](https://github.com/zhjx19) 168 | 169 | [yu-soong](https://github.com/yu-soong) 170 | -------------------------------------------------------------------------------- /answer-keys-dplyr.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(tidyverse) 3 | library(microbenchmark) 4 | stock_data <- readRDS("r-data-practice-master/data/stock-market-data.rds") 5 | head(stock_data) 6 | 7 | 8 | # 1. 哪些股票的代码中包含"8"这个数字? --------------------------------------------------- 9 | 10 | library(stringr) 11 | microbenchmark( 12 | stock_data$symbol[str_detect(stock_data$symbol, "8")] %>% unique(), 13 | stock_data[str_detect(symbol, "8"), unique(symbol)] 14 | ) 15 | 16 | # 2. 每天上涨和下跌的股票各有多少? --------------------------------------------------------- 17 | 18 | microbenchmark( 19 | stock_data[, 20 | .(num = uniqueN(symbol)), 21 | keyby = .(date, UpDown = ifelse(pre_close > close, "Down", "Up"))], 22 | times = 5) 23 | microbenchmark( 24 | stock_data %>% 25 | mutate(UpDown = ifelse(pre_close > close, "Down", "Up")) %>% 26 | group_by(date, UpDown) %>% 27 | summarise(num = n()), 28 | times = 5) 29 | 30 | # 3. 每天每个交易所上涨、下跌的股票各有多少? ---------------------------------------------------- 31 | 32 | microbenchmark( 33 | stock_data[, 34 | .(num = uniqueN(symbol)), 35 | keyby = .(date, 36 | exchange = str_sub(symbol, start = -2, end = -1), 37 | UpDown = ifelse(pre_close > close, "Down", "Up"))], 38 | times = 5) 39 | microbenchmark( 40 | stock_data %>% 41 | mutate(exchange = str_sub(symbol, start = -2, end = -1), UpDown = ifelse(pre_close > close, "Down", "Up")) %>% 42 | group_by(date, exchange) %>% 43 | summarise(num = n()), 44 | times = 5) 45 | 46 | # 4. 沪深300成分股中,每天上涨、下跌的股票各有多少? -------------------------------------------- 47 | 48 | microbenchmark( 49 | filter(stock_data, index_w300 > 0) %>% 50 | mutate(UpDown = ifelse(pre_close > close, "Down", "Up")) %>% 51 | group_by(date, UpDown) %>% 52 | summarise(num = n()), 53 | times = 5) 54 | microbenchmark( 55 | stock_data[index_w300 > 0, 56 | .(num = uniqueN(symbol)), 57 | keyby = .(date, updown = ifelse(close - pre_close > 0, "UP", "DOWN")) 58 | ][1:5], 59 | times = 5) 60 | 61 | # 5. 每天每个行业各有多少只股票? ------------------------------------------------------- 62 | 63 | stock_data %>% 64 | group_by(date, industry) %>% 65 | summarise(num = n()) 66 | data[, .(stk_num = uniqueN(symbol)), 67 | keyby = .(date, industry) 68 | ][1:5] 69 | 70 | # 6. 股票数最大的行业和总成交额最大的行业是否总是同一个行业? ----------------------------------------- 71 | 72 | microbenchmark( 73 | stock_data %>% 74 | group_by(date, industry) %>% 75 | summarise(trd_amount = sum(amount), stk_num = uniqueN(symbol)) %>% 76 | group_by(date) %>% 77 | filter(rank(desc(trd_amount)) == 1 & rank(desc(stk_num)) == 1), 78 | times = 5) 79 | microbenchmark( 80 | stock_data[, .(trd_amount = sum(amount), stk_num = uniqueN(symbol)), keyby = .(date, industry) 81 | ][, .SD[trd_amount == max(trd_amount) & stk_num == max(stk_num), .(industry)], 82 | keyby = .(date) 83 | ][1:5], 84 | times = 5) 85 | 86 | # 7. 每天涨幅超过5%、跌幅超过5%的股票各有多少? ---------------------------------------------- 87 | 88 | microbenchmark( 89 | stock_data %>% 90 | mutate(ret = (close - pre_close)/pre_close) %>% 91 | group_by(date) %>% 92 | summarise(high_up = sum(ret > 0.05), high_down = sum(ret < -0.05)), 93 | times = 5) 94 | microbenchmark( 95 | stock_data[, ':='(ret = (close - pre_close)/pre_close), 96 | ][ret > 0.05 | ret < -0.05, 97 | .(symbol_amount = uniqueN(symbol)), 98 | keyby = .(date, updown = ifelse(ret > 0.05, "up5%+", "down5%+")) 99 | ][1:5], 100 | times = 5) 101 | 102 | # 8. 每天涨幅前10的股票的总成交额和跌幅前10的股票的总成交额比例是多少? ---------------------------------- 103 | 104 | microbenchmark( 105 | stock_data %>% 106 | mutate(ret = (close - pre_close)/pre_close) %>% 107 | group_by(date) %>% 108 | summarise(top10 = sum(amount[rank(desc(ret)) <= 10]), 109 | bottom10 = sum(amount[rank(ret) <= 10]), 110 | ratio = bottom10/top10), 111 | times = 5) 112 | microbenchmark( 113 | stock_data[, ':='(ret = (close - pre_close)/pre_close) 114 | ][order(date, ret) 115 | ][, .(top10 = sum(amount[1:10]), bottom10 = sum(amount[(.N - 10):.N])), 116 | keyby = date 117 | ][, ':='(ratio = top10/bottom10) 118 | ][1:15], 119 | times = 5) 120 | 121 | # 9. 每天开盘涨停的股票与收盘涨停的股票各有多少? ----------------------------------------------- 122 | 123 | microbenchmark( 124 | stock_data %>% 125 | mutate(ret_open = open/pre_close - 1, ret_close = close/pre_close - 1) %>% 126 | group_by(date) %>% 127 | summarise(n_openlimit = sum(ret_open > 0.015), 128 | n_closelimit = sum(ret_close > 0.015)), 129 | times = 5) 130 | microbenchmark( 131 | stock_data[, ':='(ret_open = open/pre_close - 1, 132 | ret_close = close/pre_close - 1) 133 | ][, .(n_openlimit = sum(ret_open > 0.015), 134 | n_closelimit = sum(ret_close > 0.015)), 135 | keyby = date 136 | ][1:5], 137 | times = 5) 138 | 139 | # 10. 每天统计最近3天出现过开盘涨停的股票各有多少只? -------------------------------------------- 140 | 141 | microbenchmark( 142 | {df10 <- stock_data %>% 143 | mutate(n_openlimit = ifelse(open/pre_close > 1.015, 1, 0)) %>% 144 | group_by(date) %>% 145 | summarise(num = sum(n_openlimit), 146 | n_openlimit_3d = NA) 147 | for (i in 4:length(unique(stock_data$date))) { 148 | df10$n_openlimit_3d[[i]] <- sum(df10$num[(i - 3):(i - 1)]) 149 | }}, times = 5) 150 | 151 | microbenchmark( 152 | {data_ex9 <- stock_data[, ':='(ret_open = open/pre_close - 1, 153 | ret_close = close/pre_close - 1) 154 | ][, .(n_openlimit = sum(ret_open > 0.015), 155 | n_closelimit = sum(ret_close > 0.015)), 156 | keyby = date] 157 | data_ex9[, 158 | .(date, 159 | n_openlimit_3d = { 160 | l = vector() 161 | for (t in 4:.N) { 162 | l[t] = sum(n_openlimit[(t - 3):(t - 1)]) 163 | } 164 | l 165 | }) 166 | ][1:5] 167 | }, times = 5) 168 | 169 | # 11. 股票每天的成交额变化率和收益率的相关性如何? ---------------------------------------------- 170 | 171 | microbenchmark( 172 | stock_data[, 173 | .(amount_change = { 174 | a <- vector() 175 | for (i in 2:.N) {a[i] <- amount[i]/amount[i - 1] - 1} 176 | a}, 177 | ret = close/pre_close - 1, symbol = symbol) 178 | ][is.finite(amount_change), na.omit(.SD) 179 | ][, cor(amount_change, ret)], 180 | times = 10) 181 | 182 | microbenchmark( 183 | stock_data %>% 184 | mutate(amount_change = { 185 | a <- vector() 186 | for (i in 2:dim(stock_data)[1]) {a[i] <- amount[i]/amount[i - 1] - 1} 187 | a}, ret = close/pre_close - 1) %>% 188 | select(amount_change, ret) %>% 189 | filter(is.finite(amount_change) == T, is.na(ret) == F, is.na(amount_change) == F) %>% 190 | cor(), 191 | times = 10) 192 | 193 | # 12. 每天每个行业的总成交额变化率和行业收益率的相关性如何? ----------------------------------------- 194 | 195 | df12 <- stock_data %>% 196 | mutate(ret = close/pre_close - 1) %>% 197 | group_by(industry, date) %>% 198 | summarise(ind_amount = sum(amount), 199 | ind_ret = weighted.mean(ret, w = capt)) %>% 200 | mutate(ind_amount_change = { 201 | a <- vector() 202 | for (i in 2:length(ind_ret)) {a[i] <- ind_amount[i]/ind_amount[i - 1] - 1} 203 | a}) 204 | cor(df12$ind_ret, df12$ind_amount_change, use = "complete.obs") 205 | 206 | stock_data[, .(ind_amount = sum(amount), weight = capt/sum(capt), ret = close/pre_close - 1), keyby = .(date, industry) 207 | ][, .(ind_ret = sum(weight * ret), ind_amount = ind_amount), keyby = .(industry, date) 208 | ][, unique(.SD) 209 | ][, .(ind_amount_change = { 210 | a <- vector() 211 | for (i in 2:.N) { 212 | a[i] <- ind_amount[i]/ind_amount[i - 1] - 1 213 | } 214 | a 215 | }, ind_ret = ind_ret), keyby = .(industry) 216 | ][!is.na(ind_amount_change), cor(ind_amount_change, ind_ret)] 217 | 218 | # 13. 每天市场的总成交额变化率和市场收益率相关性如何? -------------------------------------------- 219 | 220 | df13 <- stock_data %>% 221 | mutate(ret = close/pre_close - 1) %>% 222 | group_by(date) %>% 223 | summarise(mkt_amount = sum(amount), 224 | mkt_ret = weighted.mean(ret, w = capt)) %>% 225 | mutate(mkt_amount_change = { 226 | a <- vector() 227 | for (i in 2:length(date)) {a[i] <- mkt_amount[i]/mkt_amount[i - 1] - 1} 228 | a}) 229 | cor(df13$mkt_ret, df13$mkt_amount_change, use = "complete.obs") 230 | 231 | data[, .(mkt_amount = sum(amount), weight = capt/sum(capt), ret = close/pre_close - 1), keyby = date 232 | ][, .(mkt_ret = sum(weight * ret), mkt_amount = mkt_amount), keyby = date 233 | ][, unique(.SD) 234 | ][, .(mkt_amount_change = { 235 | a <- vector() 236 | for (i in 2:.N) {a[i] <- mkt_amount[i]/mkt_amount[i - 1] - 1 237 | } 238 | a 239 | }, mkt_ret = mkt_ret) 240 | ][!is.na(mkt_amount_change), cor(mkt_amount_change, mkt_ret)] 241 | 242 | # 14. 每天市场的总成交额的变化率和所有股票收益率的标准差相关性如何? ------------------------------------- 243 | 244 | df14 <- stock_data %>% 245 | mutate(ret = close/pre_close - 1) %>% 246 | group_by(date) %>% 247 | summarise(mkt_amount = sum(amount), 248 | ret_sd = sd(ret)) %>% 249 | mutate(mkt_amount_change = { 250 | a <- vector() 251 | for (i in 2:length(date)) {a[i] <- mkt_amount[i]/mkt_amount[i - 1] - 1} 252 | a}) 253 | cor(df14$ret_sd, df14$mkt_amount_change, use = "complete.obs") 254 | 255 | data[, .(mkt_amount = sum(amount), ret = close/pre_close - 1, symbol = symbol), keyby = date 256 | ][, .(ret_sd = unique(sd(ret)), mkt_amount = unique(mkt_amount)), keyby = date 257 | ][, .(mkt_amount_change = { 258 | a <- vector() 259 | for (i in 2:.N) {a[i] <- mkt_amount[i]/mkt_amount[i - 1] - 1 260 | } 261 | a 262 | }, ret_sd = ret_sd) 263 | ][!is.na(mkt_amount_change), cor(mkt_amount_change, ret_sd)] 264 | 265 | # 15. 每天每个行业的总成交额变化率和行业内股票收益率的标准差相关性如何? ----------------------------------- 266 | 267 | df15 <- stock_data %>% 268 | mutate(ret = close/pre_close - 1) %>% 269 | group_by(industry, date) %>% 270 | summarise(ind_amount = sum(amount), 271 | ind_ret_sd = sd(ret)) %>% 272 | mutate(ind_amount_change = { 273 | a <- vector() 274 | for (i in 2:length(ind_ret_sd)) {a[i] <- ind_amount[i]/ind_amount[i - 1] - 1} 275 | a}) 276 | cor(df15$ind_ret_sd, df15$ind_amount_change, use = "complete.obs") 277 | 278 | data[, .(ind_amount = sum(amount), ret = close/pre_close - 1, symbol = symbol), keyby = .(industry, date) 279 | ][, .(ind_ret_sd = unique(sd(ret)), ind_amount = unique(ind_amount)), keyby = .(industry, date) 280 | ][, .(ind_amount_change = { 281 | a <- vector() 282 | for (i in 2:.N) {a[i] <- ind_amount[i]/ind_amount[i - 1] - 1 283 | } 284 | a 285 | }, ind_ret_sd = ind_ret_sd), keyby = industry 286 | ][!is.na(ind_amount_change), cor(ind_ret_sd, ind_amount_change)] 287 | 288 | # 16. 上证50、沪深300、中证500指数成分股中,沪股和深股各有多少? ----------------------------------- 289 | 290 | stock_data %>% 291 | mutate(exchange = str_sub(symbol, start = -2, end = -1), 292 | index_w50 = ifelse(index_w50 > 0, 1, 0), 293 | index_w300 = ifelse(index_w300 > 0, 1, 0), 294 | index_w500 = ifelse(index_w500 > 0, 1, 0)) %>% 295 | group_by(date, exchange) %>% 296 | summarise(index50 = sum(index_w50), 297 | index300 = sum(index_w300), 298 | index500 = sum(index_w500)) 299 | 300 | stock_data[, melt(.SD, measure.vars = patterns("index"), variable.name = "index_name") 301 | ][value > 0, .(stkcd_amount = uniqueN(symbol)), 302 | by = .(index_name, type = ifelse(str_detect(symbol, "SH"), "SH", "SZ"))] 303 | 304 | # 17. 上证50、沪深300、中证500指数成分股中,行业分布如何? -------------------------------------- 305 | 306 | microbenchmark( 307 | stock_data %>% 308 | mutate(index_w50 = ifelse(index_w50 > 0, 1, 0), 309 | index_w300 = ifelse(index_w300 > 0, 1, 0), 310 | index_w500 = ifelse(index_w500 > 0, 1, 0)) %>% 311 | group_by(date, industry) %>% 312 | summarise(index50 = sum(index_w50), 313 | index300 = sum(index_w300), 314 | index500 = sum(index_w500)), 315 | times = 10) 316 | 317 | microbenchmark( 318 | stock_data[, melt(.SD, measure.vars = patterns("index"), variable.name = "index_name") 319 | ][value > 0, .(stkcd_amount = uniqueN(symbol)), by = .(index_name, industry) 320 | ][1:5], 321 | times = 10) 322 | 323 | # 18. 每天上证50、沪深300、中证500指数成分股的总成交额各是多少? ----------------------------------- 324 | 325 | stock_data %>% 326 | mutate(index_w50 = ifelse(index_w50 > 0, 1, 0), 327 | index_w300 = ifelse(index_w300 > 0, 1, 0), 328 | index_w500 = ifelse(index_w500 > 0, 1, 0)) %>% 329 | group_by(date) %>% 330 | summarise(index50 = sum(index_w50 * amount), 331 | index300 = sum(index_w300 * amount), 332 | index500 = sum(index_w500 * amount)) 333 | 334 | data[, melt(.SD, measure.vars = patterns("index"), variable.name = "index_name") 335 | ][value > 0, .(amount = sum(amount)), by = .(index_name, date) 336 | ][1:5] 337 | 338 | # 19. 上证50、沪深300、中证500指数日收益率的历史波动率是多少? ------------------------------------ 339 | 340 | df19 <- stock_data %>% 341 | mutate(ret = close/pre_close - 1) %>% 342 | group_by(date) %>% 343 | summarise(index50 = sum(index_w50 * ret), 344 | index300 = sum(index_w300 * ret), 345 | index500 = sum(index_w500 * ret)) 346 | map_dbl(df19[,2:4], sd) 347 | 348 | data[, melt(.SD, measure.vars = patterns("index"), variable.name = "index_name") 349 | ][, .(index_ret = sum(value * (close/pre_close - 1))), by = .(date, index_name) 350 | ][, .(vol = sd(index_ret)), by = .(index_name)] 351 | 352 | # 20. 上证50、沪深300、中证500指数日收益率的相关系数矩阵?¶ ------------------------------------- 353 | 354 | cor(df19[2:4]) 355 | 356 | data[, .(index_w50_ret = sum(index_w50 * (close/pre_close - 1)), 357 | index_w300_ret = sum(index_w300 * (close/pre_close - 1)), 358 | index_w500_ret = sum(index_w500 * (close/pre_close - 1))), keyby = date 359 | ][, cor(.SD[, -1])] 360 | 361 | 362 | # 21. 上证50、沪深300、去除上证50的沪深300指数日收益率的相关系数矩阵? ------------------------------- 363 | 364 | df21 <- stock_data %>% 365 | mutate(ret = close/pre_close - 1, 366 | index300_50_ret = ifelse(index_w50 > 0, 0, ret * index_w300)) %>% 367 | group_by(date) %>% 368 | summarise(index50 = sum(index_w50 * ret), 369 | index300 = sum(index_w300 * ret), 370 | index300_50 = sum(index300_50_ret)) 371 | cor(df21[2:4]) 372 | 373 | # 22. 每天沪深300指数成分占比最大的10只股票是哪些? ------------------------------------------- 374 | 375 | stock_data %>% 376 | arrange(date, index_w300) %>% 377 | group_by(date) %>% 378 | summarise(Big10 = paste(symbol[1:10], collapse = " ")) 379 | 380 | stock_data[order(date, index_w300), .(symbol = symbol[1:10]), by = .(date) 381 | ][1:5] 382 | 383 | # 23. 各个行业的平均每日股票数量从大到小排序是什么? --------------------------------------------- 384 | 385 | microbenchmark( 386 | stock_data %>% 387 | group_by(date) %>% 388 | count(industry) %>% 389 | group_by(industry) %>% 390 | summarise(n = mean(n)) %>% 391 | arrange(desc(n)), 392 | times = 10) 393 | 394 | microbenchmark( 395 | stock_data[, .(stkcd_amount = uniqueN(symbol)), keyby = .(date, industry) 396 | ][order(date, -stkcd_amount) 397 | ], 398 | times = 10) 399 | 400 | # 24. 每个行业每天成交额最大的一只股票代码是什么? ---------------------------------------------- 401 | 402 | stock_data %>% 403 | arrange(desc(amount)) %>% 404 | group_by(date, industry) %>% 405 | summarise(No_1 = symbol[1]) 406 | 407 | stock_data[order(-amount), .(symbol = symbol[1]), keyby = .(date, industry) 408 | ][1:5] 409 | 410 | # 25. 每个行业每天最大成交额是最小成交额的几倍? ----------------------------------------------- 411 | 412 | stock_data %>% 413 | filter(amount > 0) %>% 414 | group_by(date, industry) %>% 415 | summarise(times = max(amount)/min(amount)) 416 | 417 | stock_data[order(-amount) & amount > 0, .(times = amount[1]/amount[.N]), 418 | keyby = .(date, industry) 419 | ] -------------------------------------------------------------------------------- /answer-keys-pydt.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Readme" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- 针对python版本的`datatable`,我们也希望给出相应的解决方案,并尽量与R版本进行对比。\n", 15 | "- 与R版本的类似,我们只会输出结果数据集的**前5行**。\n", 16 | "- 其余注意事项可查看*anwer-keys-rdt.ipynb*中的readme部分。\n", 17 | "- 欢迎大家关注我们的微信公众号**大喵与村长的R进制**。\n", 18 | "- 如有任何问题请在我们的Github主页开issue。最后感谢renkun!" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "# Import data" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 5, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/html": [ 36 | "
\n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | "
symboldatepre_closeopenhighlowclosevolumeamountadj_factorcaptindex_w50index_w300index_w500industry
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0600000.SH201201048.498.548.568.398.41342013792.9023e+086.655271.25501e+110.04640930.02125940BANKS
1600000.SH201201058.418.478.828.478.651321162031.14475e+096.655271.29082e+110.04640930.02125940BANKS
2600000.SH201201068.658.638.788.628.71617786875.37044e+086.655271.29977e+110.04640930.02125940BANKS
3600000.SH201201098.718.728.998.688.95801362497.1143e+086.655271.33559e+110.04640930.02125940BANKS
4600000.SH201201108.958.959.18.889.07720046326.47207e+086.655271.3535e+110.04640930.02125940BANKS
\n", 50 | " \n", 53 | "
\n" 54 | ], 55 | "text/plain": [ 56 | "" 57 | ] 58 | }, 59 | "execution_count": 5, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "import datatable as dt\n", 66 | "data = dt.fread(\"data/stock-market-data.csv\")\n", 67 | "data[:5,:]" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "- 数据集为“面板数据”:包含多个股票(横截面),而每个股票则有多个按照日期排序的变量(时间序列)\n", 75 | "\n", 76 | "- 股票代码`symbol` 和日期`date`共同组成了数据集的key,也即每个唯一的`symbol` 和`date`组合决定了一个唯一的观测。\n", 77 | "\n", 78 | "- 整个数据集首先按照代码`symbol`排列,其次按照日期`date`排列。\n", 79 | "\n", 80 | "- 若干主要变量说明:\n", 81 | " - `symbol`:股票代码。.SH 结尾的是沪股,.SZ 结尾的是深股\n", 82 | " - `date`:日期\n", 83 | " - `pre_close`: 昨收盘\n", 84 | " - `open`:开盘价\n", 85 | " - `high`:最高价(日内)\n", 86 | " - `low`:最低价(日内)\n", 87 | " - `close`:收盘价\n", 88 | " - `volume`:成交量\n", 89 | " - `amount`:成交金额\n", 90 | " - `industry`:行业\n", 91 | " - `index_w50`:该股票在上证50指数的成分比例\n", 92 | " - `index_w300`:该股票在上证300指数的成分比例\n", 93 | " - `index_w500`:该股票在中证500指数的成分比例" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "# Answer Keys" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "## 1. 哪些股票的代码中包含\"8\"这个数字?" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 39, 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/html": [ 118 | "
\n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | "
symboldatepre_closeopenhighlowclosevolumeamountadj_factorcaptindex_w50index_w300index_w500industry
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0600008.SH201201045.25.295.294.924.9666856383.40151e+073.292231.0912e+1000.001229960UTILITIE
1600008.SH201201054.964.974.984.784.8556006972.73034e+073.292231.067e+1000.001229960UTILITIE
2600008.SH201201064.854.8654.784.9837734531.8459e+073.292231.0956e+1000.001229960UTILITIE
3600008.SH201201094.984.985.24.915.1757493792.9119e+073.292231.1374e+1000.001229960UTILITIE
4600008.SH201201105.175.175.325.135.2882768084.33631e+073.292231.1616e+1000.001229960UTILITIE
\n", 132 | " \n", 135 | "
\n" 136 | ], 137 | "text/plain": [ 138 | "" 139 | ] 140 | }, 141 | "execution_count": 39, 142 | "metadata": {}, 143 | "output_type": "execute_result" 144 | } 145 | ], 146 | "source": [ 147 | "data[dt.re.match(dt.f.symbol, \".*8.*\"),:][:5, :]" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## 2. 每天上涨和下跌的股票有多少" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 58, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "text/html": [ 165 | "
\n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | "
dateupdown_tagsymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104down2007
120120104steady122
220120104up191
320120105down2071
420120105steady117
\n", 179 | " \n", 182 | "
\n" 183 | ], 184 | "text/plain": [ 185 | "" 186 | ] 187 | }, 188 | "execution_count": 58, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "data[:, dt.update(updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, \"up\", dt.f.close < dt.f.pre_close, \"down\", \"steady\"))]\n", 195 | "data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag)][:5, :]" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "## 3. 每天每个交易所上涨、下跌的股票各有多少?" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 66, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/html": [ 213 | "
\n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | "
dateupdown_tagexch_tagsymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104downSH794
120120104downSZ1213
220120104steadySH42
320120104steadySZ80
420120104upSH85
\n", 227 | " \n", 230 | "
\n" 231 | ], 232 | "text/plain": [ 233 | "" 234 | ] 235 | }, 236 | "execution_count": 66, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "data[:, dt.update(\n", 243 | " updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, \"up\", dt.f.close < dt.f.pre_close, \"down\", \"steady\"),\n", 244 | " exch_tag = dt.str.slice(dt.f.symbol, -2, None) \n", 245 | " )]\n", 246 | "data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag, dt.f.exch_tag)][:5, :]\n" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "## 4. 沪深300成分股中,每天上涨、下跌的股票各有多少?" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 70, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "text/html": [ 264 | "
\n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | "
dateupdown_tagsymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104down275
120120104steady5
220120104up20
320120105down242
420120105steady8
\n", 278 | " \n", 281 | "
\n" 282 | ], 283 | "text/plain": [ 284 | "" 285 | ] 286 | }, 287 | "execution_count": 70, 288 | "metadata": {}, 289 | "output_type": "execute_result" 290 | } 291 | ], 292 | "source": [ 293 | "data[:, dt.update(updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, \"up\", dt.f.close < dt.f.pre_close, \"down\", \"steady\"))]\n", 294 | "data[dt.f.index_w300 > 0,:\n", 295 | " ][:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag)][:5, :]" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "## 5. 每天每个行业各有多少只股票?" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 74, 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "data": { 312 | "text/html": [ 313 | "
\n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | "
dateindustrysymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104AERODEF10
120120104AIRLINE12
220120104AUTO85
320120104BANKS16
420120104BEV30
\n", 327 | " \n", 330 | "
\n" 331 | ], 332 | "text/plain": [ 333 | "" 334 | ] 335 | }, 336 | "execution_count": 74, 337 | "metadata": {}, 338 | "output_type": "execute_result" 339 | } 340 | ], 341 | "source": [ 342 | "data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.industry)][:5, :]" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "## 6. 股票数最大的行业和总成交额最大的行业是否总是同一个行业?" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 105, 355 | "metadata": {}, 356 | "outputs": [ 357 | { 358 | "data": { 359 | "text/html": [ 360 | "
\n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | "
datecount
▪▪▪▪▪▪▪▪▪▪▪▪
0201201041
1201201052
2201201061
3201201092
4201201102
\n", 374 | " \n", 377 | "
\n" 378 | ], 379 | "text/plain": [ 380 | "" 381 | ] 382 | }, 383 | "execution_count": 105, 384 | "metadata": {}, 385 | "output_type": "execute_result" 386 | } 387 | ], 388 | "source": [ 389 | "data = data[:, {\n", 390 | " 'symbol_num' : dt.count(dt.f.symbol),\n", 391 | " 'amount_num' : dt.sum(dt.f.amount)\n", 392 | " }, dt.by(dt.f.date, dt.f.industry)\n", 393 | " ]\n", 394 | "data[:, dt.update(\n", 395 | " symbol_max = dt.max(dt.f.symbol_num),\n", 396 | " amount_max = dt.max(dt.f.amount_num)\n", 397 | " ), dt.by(dt.f.date)\n", 398 | " ]\n", 399 | "data[(dt.f.symbol_num == dt.f.symbol_max) | (dt.f.amount_num == dt.f.amount_max), :\n", 400 | " ][:, dt.count(), dt.by(dt.f.date)\n", 401 | " ][:5, :]" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "## 7. 每天涨幅超过5%、跌幅超过5%的股票各有多少?" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 111, 414 | "metadata": {}, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "text/html": [ 419 | "
\n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | "
dateupdown_percentsymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104down5p277
120120104up5p17
220120105down5p886
320120105up5p10
420120106down5p66
\n", 433 | " \n", 436 | "
\n" 437 | ], 438 | "text/plain": [ 439 | "" 440 | ] 441 | }, 442 | "execution_count": 111, 443 | "metadata": {}, 444 | "output_type": "execute_result" 445 | } 446 | ], 447 | "source": [ 448 | "data[:, dt.update(\n", 449 | " updown_percent = dt.ifelse(dt.f.close/dt.f.pre_close-1 > 0.05, \"up5p\", dt.f.close/dt.f.pre_close-1 < -0.05, \"down5p\", \"others\")\n", 450 | " )]\n", 451 | "data[dt.f.updown_percent != \"others\", :\n", 452 | " ][:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_percent)\n", 453 | " ][:5, :]" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "## 8. 每天涨幅前10的股票的总成交额和跌幅前10的股票的总成交额比例是多少?" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 181, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "data": { 470 | "text/html": [ 471 | "
\n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | "
datetagratio
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104bottom0.0057457
120120104top0.00830585
220120105bottom0.00435634
320120105top0.00750696
420120106bottom0.00932084
\n", 485 | " \n", 488 | "
\n" 489 | ], 490 | "text/plain": [ 491 | "" 492 | ] 493 | }, 494 | "execution_count": 181, 495 | "metadata": {}, 496 | "output_type": "execute_result" 497 | } 498 | ], 499 | "source": [ 500 | "# 计算每只股票的涨幅并按照涨幅和日期排序\n", 501 | "data[:, dt.update(\n", 502 | " ret = dt.f.close/dt.f.pre_close - 1)]\n", 503 | "data = data[:, :, dt.sort(dt.f.date, -dt.f.ret)]\n", 504 | "\n", 505 | "# 计算每日的成交量\n", 506 | "data[:, dt.update(\n", 507 | " daily_amount = dt.sum(dt.f.amount)\n", 508 | " ), dt.by(dt.f.date)]\n", 509 | "\n", 510 | "# 将涨幅前十和后十的股票的成交量进行计算\n", 511 | "data = dt.rbind(data[:10, {\"amount\": dt.sum(dt.f.amount),\n", 512 | " \"daily_amount\": dt.f.daily_amount,\n", 513 | " \"tag\": \"top\"\n", 514 | " }, dt.by(dt.f.date)], \n", 515 | " data[-10:,{\"amount\": dt.sum(dt.f.amount),\n", 516 | " \"daily_amount\": dt.f.daily_amount,\n", 517 | " \"tag\": \"bottom\"\n", 518 | " }, dt.by(dt.f.date)])\n", 519 | "\n", 520 | "# 去重后分别计算成交比例\n", 521 | "data = data[:, :, dt.sort(dt.f.date, dt.f.tag)]\n", 522 | "data = data[0, :, dt.by(dt.f.date, dt.f.tag)]\n", 523 | "data[:, {\"ratio\": dt.f.amount/dt.f.daily_amount}, dt.by(dt.f.date, dt.f.tag)][:5, :]" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "## 9. 每天开盘涨停的股票与收盘涨停的股票各有多少?(涨停按照收益率超过1.5%的标准计算)" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": 6, 536 | "metadata": {}, 537 | "outputs": [ 538 | { 539 | "data": { 540 | "text/html": [ 541 | "
\n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | "
dateup_tagsymbol
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
020120104close_stop70
120120104open_stop325
220120105close_stop60
320120105open_stop27
420120106close_stop743
\n", 555 | " \n", 558 | "
\n" 559 | ], 560 | "text/plain": [ 561 | "" 562 | ] 563 | }, 564 | "execution_count": 6, 565 | "metadata": {}, 566 | "output_type": "execute_result" 567 | } 568 | ], 569 | "source": [ 570 | "# 计算开盘与收盘收益率\n", 571 | "data[:, dt.update(\n", 572 | " close_stop = dt.ifelse(dt.f.close/dt.f.pre_close - 1 > 0.015, \"close_stop\", \"others\"),\n", 573 | " open_stop = dt.ifelse(dt.f.open/dt.f.pre_close - 1 > 0.015, \"open_stop\", \"others\")\n", 574 | " )]\n", 575 | "\n", 576 | "data1 = data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.close_stop)\n", 577 | " ][dt.f.close_stop != \"others\", :]\n", 578 | "data1.names = {\"close_stop\": \"up_tag\"}\n", 579 | "\n", 580 | "data2 = data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.open_stop)\n", 581 | " ][dt.f.open_stop != \"others\", :]\n", 582 | "data2.names = {\"open_stop\": \"up_tag\"}\n", 583 | "\n", 584 | "data = dt.rbind(data1, data2)\n", 585 | "data = data[:, :, dt.sort(dt.f.date)]\n", 586 | "data[:5, :]" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [] 595 | } 596 | ], 597 | "metadata": { 598 | "interpreter": { 599 | "hash": "be7c5096a73329a2b1a0393d63cb94dd70da5883b6836fb3dfbdedd6f96e7abc" 600 | }, 601 | "kernelspec": { 602 | "display_name": "Python 3.8.6 64-bit", 603 | "language": "python", 604 | "name": "python3" 605 | }, 606 | "language_info": { 607 | "codemirror_mode": { 608 | "name": "ipython", 609 | "version": 3 610 | }, 611 | "file_extension": ".py", 612 | "mimetype": "text/x-python", 613 | "name": "python", 614 | "nbconvert_exporter": "python", 615 | "pygments_lexer": "ipython3", 616 | "version": "3.7.3" 617 | }, 618 | "orig_nbformat": 4 619 | }, 620 | "nbformat": 4, 621 | "nbformat_minor": 2 622 | } 623 | -------------------------------------------------------------------------------- /data/stock-market-data.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ravin515/r-data-practice/b46ed24aaa01c84db727f2df200ee27dace38f9e/data/stock-market-data.rds -------------------------------------------------------------------------------- /dataset-and-questions.md: -------------------------------------------------------------------------------- 1 | # 股票市场数据 2 | 3 | ## 数据简介 4 | 5 | | 列名 | 内容 | 6 | | ------------ | ------------------- | 7 | | `symbol` | 股票代码 | 8 | | `date` | 交易日 | 9 | | `pre_close` | 昨收价 | 10 | | `open` | 开盘价 | 11 | | `high` | 最高价 | 12 | | `low` | 最低价 | 13 | | `close` | 收盘价 | 14 | | `volume` | 成交量 | 15 | | `amount` | 成交额 | 16 | | `adj_factor` | 复权因子 | 17 | | `capt` | 流通市值 | 18 | | `index_w50` | 上证50指数成分比例 | 19 | | `index_w300` | 沪深300指数成分比例 | 20 | | `index_w500` | 中证500指数成分比例 | 21 | | `industry` | 行业分类 | 22 | 23 | * 股票代码(`symbol`)中 `.SH` 结尾的是沪股,`.SZ` 结尾的是深股。 24 | * 当日股票收益率可以用 `close / pre_close - 1` 计算得到,其中 `pre_close` 是当天开盘前拿到的除权除息后的昨收价。 25 | * 如果股票不在相应的指数成分股中,则指数成分比例数值为0。 26 | * 计算市场或者行业指数或收益率时用流通市值(`capt`)作为权重,市场收益率即为所有股票的市值加权收益率,行业收益率即为各个行业内所有股票的市值加权收益率。 27 | * 日量价数据无法精确判断涨跌停状态,用涨跌幅 9.9% 来近似 28 | 29 | ## 数据导入 30 | 31 | ```r 32 | library(data.table) 33 | data <- readRDS("data/stock-market-data.rds") 34 | ``` 35 | 36 | ## 练习题 37 | 38 | 1. 哪些股票的代码中包含 8 这个数字? 39 | 2. 每天上涨和下跌的股票各有多少? 40 | 3. 每天每个交易所上涨、下跌的股票各有多少? 41 | 4. 沪深300成分股中,每天上涨、下跌的股票各有多少? 42 | 5. 每天每个行业各有多少只股票? 43 | 6. 股票数最大的行业和总成交额最大的行业是否总是同一个行业? 44 | 7. 每天涨幅超过5%、跌幅超过5%的股票各有多少? 45 | 8. 每天涨幅前10的股票的总成交额和跌幅前10的股票的总成交额比例是多少? 46 | 9. 每天开盘涨停的股票与收盘涨停的股票各有多少?(涨停按照收益率超过1.5%的标准计算) 47 | 10. 每天统计最近3天出现过开盘涨停、跌停的股票各有多少只? 48 | 11. 股票每天的成交额变化率和收益率的相关性如何? 49 | 12. 每天每个行业的总成交额变化率和行业收益率的相关性如何? 50 | 13. 每天市场的总成交额变化率和市场收益率相关性如何? 51 | 14. 每天市场的总成交额的变化率和所有股票收益率的标准差相关性如何? 52 | 15. 每天每个行业的总成交额变化率和行业内股票收益率的标准差相关性如何? 53 | 16. 上证50、沪深300、中证500指数成分股中,沪股和深股各有多少? 54 | 17. 上证50、沪深300、中证500指数成分股中,行业分布如何? 55 | 18. 每天上证50、沪深300、中证500指数成分股的总成交额各是多少? 56 | 19. 上证50、沪深300、中证500指数日收益率的历史波动率是多少? 57 | 20. 上证50、沪深300、中证500指数日收益率的相关系数矩阵? 58 | 21. 上证50、沪深300、去除上证50的沪深300指数日收益率的相关系数矩阵? 59 | 22. 每天沪深300指数成分占比最大的10只股票是哪些? 60 | 23. 各个行业的平均每日股票数量从大到小排序是什么? 61 | 24. 每个行业每天成交额最大的一只股票代码是什么? 62 | 25. 每个行业每天最大成交额是最小成交额的几倍? 63 | 26. 每个行业每天成交额最大的5只股票和成交额总和是多少? 64 | 27. 每个行业每天成交额超过该行业中股票成交额80%分位数的股票的平均收益率是多少? 65 | 28. 每天成交额最大的10%的股票的平均收益率和成交额最小的10%的股票的平均收益率的相关系数是多少? 66 | 29. 每天哪些行业的平均成交额高于全市场平均成交额? 67 | 30. 每天每个股票对市场的超额收益率是多少? 68 | 31. 每天每个股票对市场去除自身的超额收益率是多少? 69 | 32. 每天每个股票对行业的超额收益率是多少? 70 | 33. 每天每个股票对行业去除自身的超额收益率是多少? 71 | 34. 每个股票每天对市场的超额收益率与对行业的超额收益率的相关系数如何? 72 | 35. 每天有哪些行业的平均收益率超过市场平均收益率? 73 | 36. 每天每个行业对市场的超额收益率是多少? 74 | 37. 每天每个行业对去除本行业后的市场超额收益是多少? 75 | 38. 每天分别有多少股票是最近连续3个交易日上涨、下跌的? 76 | 39. 每天分别有多少股票是最近连续3个交易日收益率超过当天市场平均收益率? 77 | 40. 每天分别有多少股票是最新5个交易日中至少有4个交易日的收益率超过当天市场平均收益率? 78 | 41. 每个月中,个股月收益超过市场月收益1倍以上的股票有哪些? 79 | 42. 每个月中,个股月收益超过行业月收益1倍以上的股票有哪些? 80 | 43. 每个股票的收益率对市场收益率的相关系数最高的10个股票是哪些? 81 | 44. 每个行业日收益率的历史波动率是多少?(用日收益率计算标准差) 82 | 45. 各个行业的日收益率的相关系数矩阵如何?哪两个行业相关性最高、最低? 83 | 46. 各个行业的收益率对市场收益率的相关系数由高到低排列如何? 84 | 47. 每个月总成交额比上个月下降幅度最大的行业是哪个? 85 | 48. 数据当中各个股票的最大回撤幅度是多少?(最大回撤是从一个高点到低点的降幅的最大值) 86 | 49. 每只股票的胜率是多少?(胜率是每天收益率为正数的概率) 87 | 50. 每只股票的盈亏比是多少?(盈亏比是正收益之和与负收益之和的比值的绝对值) 88 | 51. 市场的胜率是多少?(市场收益率为正的概率) 89 | 52. 市场的盈亏比是多少?(市场中每个股票的市值加权正收益和市值加权负收益之比) 90 | 53. 每个行业的胜率是多少? 91 | 54. 每个行业的盈亏比是多少?(行业盈亏比是行业内每个股票的市值加权的正收益率和市值加权的负收益率之比) 92 | 55. 是否存在股票的月成交额超过所在行业当月中某天一天总成交额的情况? 93 | 56. 每天每个行业编入、编出的股票各有多少? 94 | 57. 每天每个行业内股票收益率的标准差是多少? 95 | 58. 每天每个行业内股票收益率的标准差的相关性如何? 96 | 59. 每天计算出成交额的 z-score (减去均值除以标准差), 该指标能解释下一天个股超额收益率的多少比例? 97 | 60. 每个股票的收益率和300、500指数收益率可以回归出一个截距项和2个beta,这两个beta的分布如何? 98 | 61. 每天开盘后到最高价涨幅最大的100只股票同样也是全天(昨收到今收)涨幅最大的100只股票的比例是多少? 99 | 62. 每天计算最近三天每天对市场的超额收益率都排进当天前100的股票有哪些? 100 | 63. 每天计算最近三天每天对行业的超额收益率都排进当天行业前30%的股票有哪些? 101 | -------------------------------------------------------------------------------- /img/answer-keys.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ravin515/r-data-practice/b46ed24aaa01c84db727f2df200ee27dace38f9e/img/answer-keys.png --------------------------------------------------------------------------------