├── .gitignore ├── README.md ├── backup ├── ff3_residual_std.bak └── fill_mixin_old.py ├── bin ├── backtest.sh ├── evaluate.sh ├── install.sh ├── pipeline.sh ├── prepare_factor.sh ├── train.sh └── tushare.sh ├── conf └── config.sample.yml ├── jupyter ├── A股数据研究.ipynb ├── 完整流程.ipynb ├── 数据分析.ipynb ├── 数据研究.ipynb └── 未命名.ipynb ├── mlstock ├── __init__.py ├── const.py ├── data │ ├── __init__.py │ ├── data_filter.py │ ├── data_loader.py │ ├── datasource.py │ ├── stock_data.py │ └── stock_info.py ├── factors │ ├── README.md │ ├── __init__.py │ ├── alpha_beta.py │ ├── balance_sheet.py │ ├── cashflow.py │ ├── daily_indicator.py │ ├── factor.py │ ├── fama │ │ ├── README.md │ │ ├── analysis.py │ │ └── fama_model.py │ ├── ff3_residual_std.py │ ├── finance_indicator.py │ ├── income.py │ ├── kdj.py │ ├── macd.py │ ├── mixin │ │ ├── __init__.py │ │ ├── fill_mixin.py │ │ └── ttm_mixin.py │ ├── old │ │ ├── BM.py │ │ ├── README.md │ │ ├── ROA.py │ │ ├── ROE.py │ │ ├── __init__.py │ │ ├── assets_debt_ratio.py │ │ ├── clv.py │ │ ├── dividend_rate.py │ │ ├── ebitda.py │ │ ├── ep.py │ │ ├── market_value.py │ │ ├── momentum.py │ │ └── peg.py │ ├── psy.py │ ├── returns.py │ ├── rsi.py │ ├── stake_holder.py │ ├── std.py │ ├── turnover.py │ └── turnover_return.py ├── ml │ ├── README.md │ ├── __init__.py │ ├── backtest.py │ ├── backtests │ │ ├── __init__.py │ │ ├── backtest_backtrader.py │ │ ├── backtest_deliberate.py │ │ ├── backtest_simple.py │ │ ├── broker.py │ │ ├── metrics.py │ │ ├── ml_strategy.py │ │ └── timing.py │ ├── data │ │ ├── __init__.py │ │ ├── factor_conf.py │ │ └── factor_service.py │ ├── evaluate.py │ ├── prepare_factor.py │ ├── train.py │ └── trains │ │ ├── __init__.py │ │ ├── train_action.py │ │ ├── train_pct.py │ │ └── train_winloss.py ├── research │ ├── __init__.py │ ├── backtest_select_top_n.py │ ├── prepare_train_backtest_for_one_factor.py │ ├── test_adf_kpss.py │ └── train_backtest_for_each_factor.py └── utils │ ├── __init__.py │ ├── data_utils.py │ ├── db_utils.py │ ├── df_utils.py │ ├── dynamic_loader.py │ ├── industry_neutral.py │ ├── multi_processor.py │ └── utils.py ├── nohup.out ├── requirement.txt ├── setup.py └── test ├── __init__.py ├── toy ├── __init__.py └── test_pandas.py └── tushare_debug.py /.gitignore: -------------------------------------------------------------------------------- 1 | debug/ 2 | conf/config.yml 3 | model/ 4 | .venv/ 5 | /data/ 6 | .idea/ 7 | __pycache__/ 8 | *.pyc 9 | .ipynb_checkpoints/ 10 | .DS_Store 11 | logs 12 | ._.DS_Store 13 | .DS_Store 14 | nohup.out 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 前言 2 | 3 | 感谢B站UP主[致敬大神](https://www.bilibili.com/video/BV1564y1b7PR),这个项目是站在她的华泰金工的研报复现的基础上的一个项目。 4 | 5 | 她虽然已经给出了完整的代码,但是我更想在她的工作基础上,做成一个可以一键run的机器学习项目,方便更好的高效运行。 6 | 7 | 也通过对她的代码的重新梳理、理解和调试,更好的理解和掌握机器学习在量化投资中的应用。 8 | 9 | 这个项目计划会持续2~3个月,可以持续关注。 10 | 11 | 12 | 13 | # 运行这个程序 14 | 15 | ## 1、安装各种包 16 | 17 | ``` 18 | brew install ta-lib 19 | pip install -r requirement.txt 20 | ``` 21 | 如遇中断,需要手工逐个安装。 22 | 23 | ## 2.准备数据 24 | 25 | 这个需要大量的数据,可以参考之前 [tushare下载程序](https://github.com/piginzoo/mfm_learner/tree/main/mfm_learner/utils/tushare_download), 26 | 从tushare下载所有的数据到本地数据库, 27 | 28 | ``` 29 | git clone https://github.com/piginzoo/mfm_learner.git 30 | cd mfm_learner 31 | python -m utils.tushare_download.updator 32 | ``` 33 | 整个过程大约会持续3-4个小时,会拉去从2008.1.1-至今的数据,2008年之前的数据由于股改问题,参考价值不大,故未使用。 34 | 35 | ## 3.一键式完成整个过程 36 | 37 | 运行pipelne.sh,完成因子清洗、训练、指标、回测: 38 | 39 | ```shell 40 | bin/pipline.sh 41 | ``` 42 | 43 | 整个过程会持续40分钟~1小时。最终会生成 data/plot.jpg,来显示回测结果。 44 | 45 | ## 4. 单独运行每个环节 46 | 47 | ### 4.1 因子清洗 48 | 49 | 参数细节参考:[prepare_factor.py](mlstock/ml/prepare_factor.py) 50 | 51 | ```python 52 | python -m mlstock.ml.prepare_factor -in -s 20080101 -e 20220901 53 | ``` 54 | ### 4.2 训练 55 | 56 | 参数细节参考:[train.py](mlstock/ml/train.py) 57 | 58 | ```python 59 | python -m mlstock.ml.train --train all --data data/processed_industry_neutral_20080101_20220901_20220828152110.csv 60 | ``` 61 | 62 | ### 4.3 指标评价 63 | 64 | 参数细节参考:[evaluate.py](mlstock/ml/evaluate.py) 65 | 66 | ```python 67 | python -m mlstock.ml.backtest \ 68 | -s 20190101 -e 20220901 \ 69 | -mp pct_ridge_20220828224004.model \ 70 | -mw winloss_xgboost_20220828224019.model \ 71 | -d processed_industry_neutral_20080101_20220901_20220828152110.csv 72 | ``` 73 | 74 | ### 4.4 回测 75 | 76 | 参数细节参考:[backtest.py](mlstock/ml/backtest.py) 77 | 78 | ```python 79 | python -m mlstock.ml.backtest \ 80 | -s 20190101 -e 20220901 \ 81 | -mp pct_ridge_20220828224004.model \ 82 | -mw winloss_xgboost_20220828224019.model \ 83 | -d processed_industry_neutral_20080101_20220901_20220828152110.csv 84 | ``` 85 | 86 | # 开发计划 87 | 88 | - [X] 实现一个简单的闭环,几个特征,线性回归 89 | - [X] 完善大部分的特征,做特征的数据分布分析,补充异常值,清晰特征 90 | - [X] 尝试多种模型,训练调优,对比效果 91 | - [X] 进一步做分层回测,看模型的实际投资效果 92 | - [ ] 做模型可解释性的实践 93 | - [ ] 尝试深度模型、AlphaNet 94 | - [ ] 确定最优模型,并上实盘 95 | 96 | # 开发日志 97 | 98 | 7.28 99 | - 为了防止macd之类出现nan,预加载了一些日期的数据,目前是观察后,设置为35,加载完再用start_date过滤掉之前的 100 | - 过滤了那些全是nan的财务指标的股票,比如cash_in_subtotal_finance、cash_in_subtotal_invest,4077=>3263,比例20% 101 | 102 | 8.6 103 | - 实现了ttm、fill处理 104 | - 完善了多种财务指标表的处理:cashflow,balance_sheet,income,fina_indicator,都是对应到tushare的各张财务表 105 | - 实现了特异性波动、beta、alpha等指标 106 | 107 | 8.13 108 | - 实现了基于fama-french的特异性波动std 109 | - 实现了各种异常值处理,填充 110 | 111 | 8.28 112 | - 实现了基于"市值+行业"的因子中性化处理 113 | - 实现了基于岭回归的收益率训练代码 114 | - 实现了基于xgboost的涨跌训练代码 115 | - 实现了岭回归回归和xgboost分类的评价指标 116 | - 实现了回测的收益、累计收益和基准的分析 117 | 118 | 9.2 119 | - 通过逐个因子训练观察,完成了因子的未来函数排查 120 | - 重写了fama-french的特异性波动std,之前的有未来函数问题 121 | - 完善了因子、训练、指标、回测的一键训练,也支持分开运行,更灵活 122 | - 实现了多个回测业务指标 123 | 124 | 9.7 125 | - 实现了一个新的保守回测,用第二天的价格买入,考虑涨跌停 126 | - 实现了一个简单的根据大盘进行择时,发现效果更差 127 | 128 | # 数据处理的trick 129 | 130 | 对于数据,有必要记录一下一些特殊的处理细节 131 | - 先按照daily_basic中的'total_mv','pe_ttm', 'ps_ttm', 'pb'缺失值超过80%,去筛掉了一些股票 132 | - 然后对剩余的股票的daily_basic的上述字段,按照日期向后进行na的填充 133 | - 对财务相关(income,balance_sheet,cashflow,finance_indicator)的数据,都统一除以了实质total_mv,归一化他们 134 | - 和神仔比,我没做行业中性化、PCA,难道是不难,就是不想做了,一个是还要花时间,另外,记得在某集看到,神仔好像后来嫌慢也没做,等有时间我再考虑这事吧 135 | -------------------------------------------------------------------------------- /backup/fill_mixin_old.py: -------------------------------------------------------------------------------- 1 | from mlstock.utils import utils 2 | 3 | import logging 4 | 5 | from mlstock.utils.utils import logging_time 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | 10 | class FillMixin: 11 | 12 | @logging_time("间隔填充") 13 | def fill(self, df_stocks, df_finance, finance_column_names): 14 | """ 15 | 将离散的单个的财务TTM信息,反向填充到周频的交易数据里去, 16 | 比如周频,日期为7.22(周五),那对应的财务数据是6.30日的, 17 | 所以要用6.30的TTM数据去填充到这个7.22日上去。 18 | 19 | 这个是配合TTM使用,TTM对公告日期进行了TTM,但只有公告日当天的, 20 | 但是,我们需要的是周频的那天的TTM,考虑到财报的滞后性, 21 | 某天的TTM值,'合理'的是他之前的最后一个公告日对应的TTM。 22 | 所以,才需要这个fill函数,来填充每一个周五的TTM。 23 | 24 | :param df_stocks: df是原始的周频数据,以周五的日期为准 25 | :param df_finance: 财务数据,只包含财务公告日期,要求之前已经做了ttm数据 26 | """ 27 | # 按照公布日期倒序排(日期从新到旧),<=== '倒'序很重要,这样才好找到对应日的财务数据 28 | df_finance = df_finance.sort_values('ann_date', ascending=False) 29 | df = df_stocks.groupby(by=['ts_code', 'trade_date']).apply(self.handle_one_stock, 30 | df_finance, 31 | finance_column_names) 32 | return df 33 | 34 | # 这个ffill方法很酷,但是这里不适用,主要是因为我们基于df_stock的日期来作为基准,但是这个df_stock这里是df_weekly 35 | # 只有周五的日期,所以,导致,无法和df_finance做join(merge), 36 | # 之前别的地方可以这样,是因为,那个用的是市场交易日,每天都有,不会落下,这样df_finance的日子总是可以join上的 37 | # def fill_one_stock(self, df_one_stock, finance_column_names): 38 | # if type(finance_column_names) != list: finance_column_names = [finance_column_names] 39 | # df_one_stock = df_one_stock[['ts_code', 'trade_date'] + finance_column_names] 40 | # df_one_stock = df_one_stock.fillna(method='ffill') 41 | # return df_one_stock 42 | 43 | # @logging_time 44 | def handle_one_stock(self, df_stock, df_finance, finance_column_names): 45 | """ 46 | 处理一只股票得财务数据填充 47 | :param df: 一只股票的数据的周频的一期数据 48 | """ 49 | # 只是用股票代码和日期信息 50 | df_stock = df_stock[['ts_code', 'trade_date']] 51 | df_stock[finance_column_names] = self.__find_nearest_values(df_stock, df_finance, finance_column_names) 52 | return df_stock 53 | 54 | def __find_nearest_values(self, df_stock, df_finance, finance_column_names): 55 | """ 56 | 找到离我最旧最近的一天的财务数据作为我当天的财务数据, 57 | 举例:当前是8.15,找到离我最近的是7.31号发布的6.30号的半年数据,那么就用这个数据,作为我(8.15)的财务数据 58 | :param df_stock: 59 | :param df_finance: 60 | :param finance_column_names: 61 | :return: 62 | """ 63 | # 这只股票的代码 64 | ts_code = df_stock.iloc[0]['ts_code'] 65 | # 这只股票当前的日期 66 | trade_date = df_stock.iloc[0]['trade_date'] 67 | # 过滤出这只股票的财务数据 68 | df_finance = df_finance[df_finance['ts_code'] == ts_code] 69 | 70 | # 按照日子挨个查找,找到离我最旧最近的公布日,返回其公布日对应的财务数据 71 | """ 72 | trade_date ann_date 73 | 2020.7.08 2020.7.25 74 | --->2020.7.15 2020.7.10 75 | 2020.7.23 76 | 2020.8.1 77 | ... 78 | 如同这个例子,要用2020.7.25的值去条虫2020.8.1;用2020.7.10的值,去填充7.15和7.23的 79 | 算法实例:当前日是2020.7.15,我去遍历ann_date(降序),发现trade_date比ann_date大,就停下来,用这天的财务数据 80 | 所以,要求trade_date和ann_date都要升序 81 | """ 82 | 83 | for _, row in df_finance.iterrows(): 84 | if trade_date >= row['ann_date']: 85 | # import pdb;pdb.set_trace() 86 | # logger.debug("找到股票[%s]的周频日期[%s]的财务日为[%s]的填充数据", ts_code, trade_date, row['ann_date']) 87 | return row[finance_column_names] 88 | return None 89 | 90 | 91 | # python -m mlstock.factors.mixin.fill_mixin 92 | if __name__ == '__main__': 93 | utils.init_logger(file=False) 94 | 95 | start_date = '20150703' 96 | end_date = '20190826' 97 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 98 | finance_column_names = ['basic_eps', 'diluted_eps'] 99 | 100 | import tushare as ts 101 | import pandas as pd 102 | 103 | pro = ts.pro_api() 104 | 105 | df_finance_list = [] 106 | df_stock_list = [] 107 | for ts_code in stocks: 108 | df_finance = pro.income(ts_code=ts_code, start_date=start_date, end_date=end_date, 109 | fields='ts_code,ann_date,f_ann_date,end_date,report_type,comp_type,basic_eps,diluted_eps') 110 | df_finance_list.append(df_finance) 111 | df_stock = pro.weekly(ts_code=ts_code, start_date=start_date, end_date=end_date) 112 | df_stock_list.append(df_stock) 113 | df_finance = pd.concat(df_finance_list) 114 | df_stock = pd.concat(df_stock_list) 115 | logger.info("原始数据:\n%r\n%r", df_stock, df_finance) 116 | df = FillMixin().fill(df_stock, df_finance, finance_column_names) 117 | logger.info("填充完数据:\n%r", df) 118 | -------------------------------------------------------------------------------- /bin/backtest.sh: -------------------------------------------------------------------------------- 1 | function elapse(){ 2 | duration=$SECONDS 3 | echo "耗时 $(($duration / 60)) 分 $(($duration % 60)) 秒." 4 | } 5 | 6 | echo "准备开始回测..." 7 | 8 | DATA_FILE=data/`ls -1rt data/|grep factor|tail -n 1` 9 | PCT_MODEL_FILE=model/`ls -1rt model/|grep pct_ridge|tail -n 1` 10 | WINLOSS_MODEL_FILE=model/`ls -1rt model/|grep winloss|tail -n 1` 11 | 12 | echo " 使用最新的数据文件:$DATA_FILE" 13 | echo " 使用最新的收益模型:$PCT_MODEL_FILE" 14 | echo " 使用最新的涨跌模型:$WINLOSS_MODEL_FILE" 15 | 16 | SECONDS=0 17 | python -m mlstock.ml.backtest \ 18 | -s 20090101 -e 20190101 \ 19 | -mp $PCT_MODEL_FILE \ 20 | -mw $WINLOSS_MODEL_FILE \ 21 | -d $DATA_FILE #>./logs/console.backtest.log 2>&1 22 | echo "回测20090101~20190101结束" 23 | elapse 24 | 25 | SECONDS=0 26 | python -m mlstock.ml.backtest \ 27 | -s 20190101 -e 20220901 \ 28 | -mp $PCT_MODEL_FILE \ 29 | -mw $WINLOSS_MODEL_FILE \ 30 | -d $DATA_FILE #>./logs/console.backtest.log 2>&1 31 | echo "回测20190101~20220901结束" 32 | elapse -------------------------------------------------------------------------------- /bin/evaluate.sh: -------------------------------------------------------------------------------- 1 | function elapse(){ 2 | duration=$SECONDS 3 | echo "耗时 $(($duration / 60)) 分 $(($duration % 60)) 秒." 4 | } 5 | 6 | echo "准备开始指标评测..." 7 | 8 | DATA_FILE=data/`ls -1rt data/|grep factor|tail -n 1` 9 | PCT_MODEL_FILE=model/`ls -1rt model/|grep pct_ridge|tail -n 1` 10 | WINLOSS_MODEL_FILE=model/`ls -1rt model/|grep winloss|tail -n 1` 11 | 12 | echo " 使用最新的数据文件:$DATA_FILE" 13 | echo " 使用最新的收益模型:$PCT_MODEL_FILE" 14 | echo " 使用最新的涨跌模型:$WINLOSS_MODEL_FILE" 15 | 16 | SECONDS=0 17 | python -m mlstock.ml.evaluate \ 18 | -s 20090101 -e 20190101 \ 19 | -mp $PCT_MODEL_FILE \ 20 | -mw $WINLOSS_MODEL_FILE \ 21 | -d $DATA_FILE #>./logs/console.evaluate.log 2>&1 22 | echo "20090101~20190101,指标评测结束" 23 | elapse 24 | 25 | SECONDS=0 26 | python -m mlstock.ml.evaluate \ 27 | -s 20190101 -e 20220901 \ 28 | -mp $PCT_MODEL_FILE \ 29 | -mw $WINLOSS_MODEL_FILE \ 30 | -d $DATA_FILE #>./logs/console.evaluate.log 2>&1 31 | echo "20190101~20220901,指标评测结束" 32 | elapse -------------------------------------------------------------------------------- /bin/install.sh: -------------------------------------------------------------------------------- 1 | python setup.py install && python setup.py clean -------------------------------------------------------------------------------- /bin/pipeline.sh: -------------------------------------------------------------------------------- 1 | # ----------------------------------------------------------------------------------------------- 2 | # 这是一个一站式的训练,我最开始是想做成一体式的,或者叫一键式的,即从因子数据生成、清晰、训练、评测、回测一口气都完成, 3 | # 后来发现,这样会很不灵活,比如我想调试、或者重新生成模型,都变得困难,而且,当跑全数据的时候,整个过程极其漫长, 4 | # 所以,就拆成了目前的几段分别做的:prepare_data.py, train.py, evaluate.py, backtest.py, 5 | # 然后用这个批处理,来串起来,实现我之前想的一键式训练至回测过程。 6 | # 运行: 7 | # bin/pipeline.sh 8 | # 调试模式:bin/pipeline.sh 50 # 仅仅使用50只股票的数据 9 | # 10 | # ----------------------------------------------------------------------------------------------- 11 | 12 | . bin/prepare_factor.sh $1 # $1 是训练数据的行数,调试时候使用,正式运行不用 13 | . bin/train.sh all 14 | . bin/evaluate.sh 15 | . bin/backtest.sh -------------------------------------------------------------------------------- /bin/prepare_factor.sh: -------------------------------------------------------------------------------- 1 | # prepare_factor.sh 100 (调试模式,只加载100只股票) 2 | function elapse(){ 3 | duration=$SECONDS 4 | echo "耗时 $(($duration / 60)) 分 $(($duration % 60)) 秒." 5 | } 6 | 7 | echo "准备开始因子生成..." 8 | SECONDS=0 9 | if [ "$1" != "" ] 10 | then 11 | echo "[ 调试模式 ]" 12 | python -m mlstock.ml.prepare_factor -in -s 20080101 -e 20220901 -n $1 # >./logs/console.data.log 2>&1 13 | else 14 | python -m mlstock.ml.prepare_factor -in -s 20080101 -e 20220901 # >./logs/console.data.log 2>&1 15 | fi 16 | DATA_FILE=data/`ls -1rt data/|grep .csv|tail -n 1` 17 | echo "因子生成结束,生成的因子文件为:$DATA_FILE" 18 | elapse -------------------------------------------------------------------------------- /bin/train.sh: -------------------------------------------------------------------------------- 1 | function elapse(){ 2 | duration=$SECONDS 3 | echo "耗时 $(($duration / 60)) 分 $(($duration % 60)) 秒." 4 | } 5 | 6 | if [ "$1" == "" ] 7 | then 8 | MODEL=all 9 | else 10 | MODEL=$1 11 | fi 12 | 13 | echo "准备开始训练..." 14 | 15 | DATA_FILE=data/`ls -1rt data/|grep factor|tail -n 1` 16 | 17 | echo " 使用最新的数据文件:$DATA_FILE" 18 | 19 | SECONDS=0 20 | python -m mlstock.ml.train \ 21 | -s 20090101 -e 20190101 \ 22 | -t $MODEL \ 23 | -d $DATA_FILE # >./logs/console.train.log 2>&1 24 | echo "训练结束" 25 | elapse -------------------------------------------------------------------------------- /bin/tushare.sh: -------------------------------------------------------------------------------- 1 | echo "快捷调试tushare的函数..." 2 | python -m test.tushare_debug -------------------------------------------------------------------------------- /conf/config.sample.yml: -------------------------------------------------------------------------------- 1 | dateformat: '%Y%m%d' 2 | database: 3 | uid: 'root' 4 | pwd: '123456' 5 | db: 'tushare' 6 | host: '127.0.0.1' 7 | port: 3306 8 | 9 | -------------------------------------------------------------------------------- /jupyter/未命名.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "95657070", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [] 10 | } 11 | ], 12 | "metadata": { 13 | "kernelspec": { 14 | "display_name": "Python 3 (ipykernel)", 15 | "language": "python", 16 | "name": "python3" 17 | }, 18 | "language_info": { 19 | "codemirror_mode": { 20 | "name": "ipython", 21 | "version": 3 22 | }, 23 | "file_extension": ".py", 24 | "mimetype": "text/x-python", 25 | "name": "python", 26 | "nbconvert_exporter": "python", 27 | "pygments_lexer": "ipython3", 28 | "version": "3.8.10" 29 | }, 30 | "toc": { 31 | "base_numbering": 1, 32 | "nav_menu": {}, 33 | "number_sections": true, 34 | "sideBar": true, 35 | "skip_h1_title": false, 36 | "title_cell": "Table of Contents", 37 | "title_sidebar": "Contents", 38 | "toc_cell": false, 39 | "toc_position": {}, 40 | "toc_section_display": true, 41 | "toc_window_display": false 42 | } 43 | }, 44 | "nbformat": 4, 45 | "nbformat_minor": 5 46 | } 47 | -------------------------------------------------------------------------------- /mlstock/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/__init__.py -------------------------------------------------------------------------------- /mlstock/const.py: -------------------------------------------------------------------------------- 1 | EALIEST_DATE='20080101' 2 | STOCK_IPO_YEARS = 1 # 至少上市1年的股票 3 | CONF_PATH = "conf/config.yml" 4 | RESERVED_PERIODS = 50 # 预留50周的数据,目前看到的需要最长预留的是MACD:35,但是中间有各种假期、节日啥的,所以,预留40不够,改到50了 5 | CODE_DATE = ['ts_code','trade_date'] # 定义一个最常用的取得数据集的 ts_code和 trade_date 的列名 6 | TARGET = ['target'] 7 | TRAIN_TEST_SPLIT_DATE = '20190101' # 用来分割Train和Test的日期 8 | BASELINE_INDEX_CODE = "000300.SH" # 用于计算对比用的基准指数代码,目前是沪深300 9 | TOP_30 = 30 10 | RISK_FREE_ANNUALLY_RETRUN = 0.03 # 在我国无风险收益率一般取值十年期国债收益,我查了一下有波动,取个大致的均值3% -------------------------------------------------------------------------------- /mlstock/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/data/__init__.py -------------------------------------------------------------------------------- /mlstock/data/data_filter.py: -------------------------------------------------------------------------------- 1 | """ 2 | 只保留非ST,主板,中小板,剔除(ST、创业板、科创板、北交所、B股) 3 | ``` 4 | - 证券交易所的地点:沪市是上海;深市是深圳;北交所是北京; 5 | - 板块不同:沪市只有主板与B股;深市有主板、中小板、创业板和B股。 6 | - 股票代码不同:沪市主板是60开头,B股是900开头;深市主板是000开头,中小板是002开头、创业板是300开头、B股是200开头 7 | - 新股上市首日不设涨跌幅限制(沪深两市股票上市首日不设涨跌幅,创业板、科创板新股上市前五个交易日不设涨跌幅)。 8 | - 科创板股票竞价交易实行价格涨跌幅限制,涨跌幅比例为20%。 首次公开发行上市的科创板股票,上市后的前5个交易日不设价格涨跌幅限制 9 | ``` 10 | """ 11 | from mlstock import const 12 | from mlstock.const import STOCK_IPO_YEARS 13 | from mlstock.data.datasource import DataSource 14 | from mlstock.utils import utils 15 | import logging 16 | import pandas as pd 17 | 18 | logger = logging.getLogger(__name__) 19 | 20 | 21 | def filter_BJ_Startup_B(df_stocks): 22 | df = df_stocks[(df_stocks.market == "主板") | (df_stocks.market == "中小板")] 23 | logger.info("从[%d]只过滤掉非主板、非中小板股票后,剩余[%d]只股票", len(df_stocks), len(df)) 24 | return df 25 | 26 | 27 | def filter_stocks(least_year=STOCK_IPO_YEARS): 28 | """ 29 | 用于全市场过滤股票,默认的标准是:至少上市1年的 30 | 从stock_basic + daily_basic,合并,原因是daily_basic有市值信息 31 | max_circ_mv: 最大流动市值,单位是亿 32 | """ 33 | datasource = DataSource() 34 | df_stock_basic = datasource.stock_basic() 35 | 36 | total_amount = len(df_stock_basic) 37 | logger.info("加载[%d]只股票的基础信息(basic)", total_amount) 38 | 39 | df_stock_basic = filter_unlist(df_stock_basic, total_amount) 40 | 41 | df_stock_basic = filter_by_years(df_stock_basic, end_date=utils.today(), least_year=least_year) 42 | 43 | df_stock_basic = filter_ST(df_stock_basic) 44 | 45 | df_stock_basic = filter_BJ_Startup_B(df_stock_basic) 46 | 47 | return df_stock_basic 48 | 49 | 50 | def filter_ST(df_stocks): 51 | # 剔除ST股票 52 | total_amount = len(df_stocks) 53 | df_stocks = df_stocks[~df_stocks['name'].str.contains("ST")] 54 | logger.info("过滤掉ST的[%d]只股票后,剩余[%d]只股票", total_amount - len(df_stocks), len(df_stocks)) 55 | return df_stocks 56 | 57 | 58 | def filter_by_years(df_stocks, end_date, least_year): 59 | df_stocks['list_date'] = pd.to_datetime(df_stocks['list_date'], format='%Y%m%d') 60 | df_stocks['period'] = utils.str2date(end_date) - df_stocks['list_date'] 61 | df_stocks_more_years = df_stocks[df_stocks['period'] > pd.Timedelta(days=365 * least_year)] 62 | logger.info("过滤掉上市不到[%d]年[%d]只的股票,剩余[%d]只股票", least_year, len(df_stocks) - len(df_stocks_more_years), 63 | len(df_stocks_more_years)) 64 | return df_stocks_more_years 65 | 66 | 67 | def filter_unlist(df_stocks, total_amount): 68 | # print(df_stocks) 69 | df_stocks = df_stocks[df_stocks['list_status'] == 'L'] 70 | logger.info("过滤掉[%d]只不在市股票,剩余[%d]只股票", total_amount - len(df_stocks), len(df_stocks)) 71 | return df_stocks 72 | 73 | 74 | # python -m mlstock.data.data_filter 75 | if __name__ == '__main__': 76 | utils.init_logger(simple=True) 77 | print(filter_stocks()) 78 | -------------------------------------------------------------------------------- /mlstock/data/data_loader.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | 4 | import pandas as pd 5 | import numpy as np 6 | from mlstock import const 7 | from mlstock.data.stock_data import StockData 8 | from mlstock.utils import utils 9 | from mlstock.utils.utils import logging_time 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | def calculate_columns_missed_by_stock(df, columns): 15 | """ 16 | 按照股票代码,来算他的所有的特征中,NA的占比 17 | 返回的是一个包含了index为ts_code的NA占比值: 18 | ts_code 19 | 000505.SZ 0.081690 20 | 000506.SZ 0.532946 21 | 000507.SZ 0.000000 22 | 000509.SZ 0.534615 23 | 000510.SZ 0.000000 24 | """ 25 | assert 'ts_code' in df.columns 26 | 27 | # 如果有na,count会不计数,利用这个特性来计算缺失比例 28 | return (1 - df[columns].groupby('ts_code').apply(lambda d: d.count() / d.shape[0])).max(axis=1) 29 | 30 | 31 | @logging_time('加载日频、周频、基础数据') 32 | def load(datasource, stock_codes, start_date, end_date): 33 | """从数据库加载数据,并做一些必要填充""" 34 | 35 | # 多加载之前的数据,这样做是为了尽量不让技术指标,如MACD之类的出现NAN 36 | original_start_date = start_date 37 | start_date = utils.last_week(start_date, const.RESERVED_PERIODS) 38 | logger.debug("开始加载 %s ~ %s 的股票数据(从真正开始日期%s预加载%d周)", 39 | start_date, end_date, original_start_date, const.RESERVED_PERIODS) 40 | 41 | # 加日周频数据,虽然我们算的周频,但是有些地方需要日频数据 42 | start_time = time.time() 43 | df_daily_basic = __load(stock_codes, start_date, end_date, func=datasource.daily_basic) 44 | # 把daily_basic中关键字段缺少比较多(>80%)的股票剔除掉 45 | df_stock_nan_stat = calculate_columns_missed_by_stock(df_daily_basic, 46 | ['ts_code', 'trade_date', 'total_mv', 'pe_ttm', 'ps_ttm', 47 | 'pb']) 48 | nan_too_many_stocks = df_stock_nan_stat[df_stock_nan_stat > 0.8].index 49 | if len(nan_too_many_stocks) > 0: 50 | stock_codes = stock_codes[~stock_codes.isin(nan_too_many_stocks.tolist())] 51 | df_daily_basic = df_daily_basic[~df_daily_basic.isin(nan_too_many_stocks.tolist())] 52 | logger.warning("由于daily_basic中的'total_mv','pe_ttm', 'ps_ttm', 'pb'缺失值超过80%%,导致%d只股票被剔除:%r", 53 | len(nan_too_many_stocks.tolist()), 54 | nan_too_many_stocks.tolist()) 55 | # 把daily_basic的nan信息都fill上 56 | df_daily_basic = df_daily_basic.sort_values(['ts_code', 'trade_date']) 57 | df_daily_basic[['total_mv', 'pe_ttm', 'ps_ttm', 'pb']] = \ 58 | df_daily_basic.groupby('ts_code').ffill().bfill()[['total_mv', 'pe_ttm', 'ps_ttm', 'pb']] 59 | 60 | logger.info("加载[%d]只股票 %s~%s 的日频基础(basic)数据 %d 行,耗时%.0f秒", 61 | len(stock_codes), 62 | start_date, 63 | end_date, 64 | len(df_daily_basic), 65 | time.time() - start_time) 66 | 67 | # 加载周频数据 68 | df_weekly = __load(stock_codes, start_date, end_date, func=datasource.weekly) 69 | logger.info("加载[%d]只股票 %s~%s 的周频数据 %d 行,耗时%.0f秒", 70 | len(stock_codes), 71 | start_date, 72 | end_date, 73 | len(df_weekly), 74 | time.time() - start_time) 75 | 76 | # 加日频数据,虽然我们算的周频,但是有些地方需要日频数据 77 | start_time = time.time() 78 | df_daily = __load(stock_codes, start_date, end_date, func=datasource.daily) 79 | logger.info("加载[%d]只股票 %s~%s 的日频数据 %d 行,耗时%.0f秒", 80 | len(stock_codes), 81 | start_date, 82 | end_date, 83 | len(df_daily), 84 | time.time() - start_time) 85 | 86 | # 加上证指数的日频数据 87 | start_time = time.time() 88 | df_index_daily = __load(['000001.SH'], start_date, end_date, func=datasource.index_daily) 89 | logger.info("加载上证指数 %s~%s 的日频数据 %d 行,耗时%.0f秒", 90 | start_date, 91 | end_date, 92 | len(df_index_daily), 93 | time.time() - start_time) 94 | 95 | # 加上证指数的周频数据 96 | start_time = time.time() 97 | df_index_weekly = __load(['000001.SH'], start_date, end_date, func=datasource.index_weekly) 98 | logger.info("加载上证指数 %s~%s 的周频数据 %d 行,耗时%.0f秒", 99 | start_date, 100 | end_date, 101 | len(df_index_weekly), 102 | time.time() - start_time) 103 | 104 | # 加上交易日历数据 105 | df_calendar = datasource.trade_cal(start_date, end_date) 106 | 107 | stock_data = StockData() 108 | # 按照ts_code + trade_date,排序 109 | # 排序默认是ascending=True, 升序,从旧到新,比如日期是2008->2022, 110 | # 然后赋值到stock_data 111 | stock_data.df_daily = df_daily.sort_values(['ts_code', 'trade_date']) 112 | stock_data.df_weekly = df_weekly.sort_values(['ts_code', 'trade_date']) 113 | stock_data.df_daily_basic = df_daily_basic # 之前sort过了 114 | stock_data.df_index_weekly = df_index_weekly.sort_values(['ts_code', 'trade_date']) 115 | stock_data.df_index_daily = df_index_daily.sort_values(['ts_code', 'trade_date']) 116 | stock_data.df_calendar = df_calendar 117 | 118 | return stock_data 119 | 120 | 121 | def __load(stocks, start_date, end_date, func): 122 | data_list = [] 123 | for code in stocks: 124 | df = func(code, start_date, end_date) 125 | data_list.append(df) 126 | return pd.concat(data_list) 127 | -------------------------------------------------------------------------------- /mlstock/data/stock_data.py: -------------------------------------------------------------------------------- 1 | class StockData: 2 | df_daily = None 3 | df_weekly = None 4 | df_daily_basic = None 5 | df_limit = None -------------------------------------------------------------------------------- /mlstock/data/stock_info.py: -------------------------------------------------------------------------------- 1 | class StocksInfo: 2 | def __init__(self, stocks, start_date, end_date): 3 | self.stocks = stocks 4 | self.start_date = start_date 5 | self.end_date = end_date 6 | -------------------------------------------------------------------------------- /mlstock/factors/README.md: -------------------------------------------------------------------------------- 1 | # 因子说明 2 | 3 | 这里列出所有, 《华泰人工智能系列》使用的因子,神仔也是参考了这个list,不过她的有所增加,参考《Jupiter 0102_数据整理》。 4 | 5 | # 华泰人工智能系列使用的因子(48个特征) 6 | 7 | | 实现 | 大类因子 | 因子描述 | 因子方向 | 8 | |-------------------|------------|----------------------------|-------| 9 | | 估值 | EP | 净利润(TTM)/总市值 | 1 | 10 | | 估值 | EPcut | 扣除非经常性损益后净利润(TTM)/总市值 | 1 | 11 | | 估值 | BP | 净资产/总市值 | 1 | 12 | | 估值 | SP | 营业收入(TTM)/总市值 | 1 | 13 | | 估值 | NCFP | 净现金流(TTM)/总市值 | 1 | 14 | | 估值 | OCFP | 经营性现金流(TTM)/总市值 | 1 | 15 | | 估值 | DP | 近12 个月现金红利(按除息日计)/总市值 | 1 | 16 | | 估值 | G/PE | 净利润(TTM)同比增长率/PE_TTM | 1 | 17 | | 成长 | Sales_G_q | 营业收入(最新财报,YTD)同比增长率 | 1 | 18 | | 成长 | Prof it_G_q | 净利润(最新财报,YTD)同比增长率 | 1 | 19 | | 成长 | OCF_G_q | 经营性现金流(最新财报,YTD)同比增长率 | 1 | 20 | | 成长 | ROE_G_q ROE | (最新财报,YTD)同比增长率 | 1 | 21 | | 财务质量 | ROE_q ROE | (最新财报,YTD) | 1 | 22 | | 财务质量 | ROE_ttm ROE | (最新财报,TTM) | 1 | 23 | | 财务质量 | ROA_q ROA | (最新财报,YTD) | 1 | 24 | | 财务质量 | ROA_ttm ROA | (最新财报,TTM) | 1 | 25 | | 财务质量 | grossprofitmargin_q | 毛利率(最新财报,YTD) | 1 | 26 | | 财务质量 | grossprofitmargin_ttm | 毛利率(最新财报,TTM) | 1 | 27 | | 财务质量 | prof itmargin_q | 扣除非经常性损益后净利润率(最新财报,YTD) | 1 | 28 | | 财务质量 | prof itmargin_ttm | 扣除非经常性损益后净利润率(最新财报,TTM) | 1 | 29 | | 财务质量 | assetturnover_q | 资产周转率(最新财报,YTD) | 1 | 30 | | 财务质量 | assetturnover_ttm | 资产周转率(最新财报,TTM) | 1 | 31 | | 财务质量 | operationcashf lowratio_q | 经营性现金流/净利润(最新财报,YTD) | 1 | 32 | | 财务质量 | operationcashf lowratio_ttm | 经营性现金流/净利润(最新财报,TTM) | 1 | 33 | | 杠杆 | financial_leverage | 总资产/净资产 | -1 | 34 | | 杠杆 | debtequityratio | 非流动负债/净资产 | -1 | 35 | | 杠杆 | cashratio | 现金比率 | 1 | 36 | | 杠杆 | currentratio | 流动比率 | 1 | 37 | | 市值 | ln_capital | 总市值取对数 | -1 | 38 | | 动量反转 | HAlpha | 个股 60 个月收益与上证综指回归的截距项 | -1 | 39 | | 动量反转 | return_Nm | 个股最近N 个月收益率,N=1,3,6,12 | -1 | 40 | | 动量反转 | wgt_return_Nm | 个股最近N 个月内用每日换手率乘以每日收益率求算术平均值,N=1,3,6,12 | -1 | 41 | | 动量反转 | exp_w gt_return_Nm | 个股最近N 个月内用每日换手率乘以函数exp(-x_i/N/4)再乘以每日收益率求算术平均值,x_i 为该日距离截面日的交易日的个数,N=1,3,6,12 | -1 | 42 | | 波动率 | std_FF3factor_Nm | 特质波动率——个股最近N个月内用日频收益率对Fama French 三因子回归的残差的标准差,N=1,3,6,12 | -1 | 43 | | 波动率 | std_Nm | 个股最近N个月的日收益率序列标准差,N=1,3,6, 12 | -1 | 44 | | 股价 | ln_price | 股价取对数 | -1 | 45 | | beta | beta | 个股 60 个月收益与上证综指回归的beta | -1 | 46 | | 换手率 | turn_Nm | 个股最近N个月内日均换手率(剔除停牌、涨跌停的交易日),N=1,3,6,12 | -1 | 47 | | 换手率 | bias_turn_Nm | 个股最近N 个月内日均换手率除以最近2 年内日均换手率(剔除停牌、涨跌停的交易日)再减去1,N=1,3,6,12 | -1 | 48 | | 情绪 | rating_average | wind 评级的平均值 | 1 | 49 | | 情绪 | rating_change | wind 评级(上调家数-下调家数)/总数 | 1 | 50 | | 情绪 | rating_targetprice | wind 一致目标价/现价-1 | 1 | 51 | | 股东 | holder_avgpctchange | 户均持股比例的同比增长率 | 1 | 52 | | 技术 | MACD | 经典技术指标(释义可参考百度百科),长周期取30日,短周期取10 日,计算DEA 均线的周期(中周期)取15 日 | -1 | 53 | | 技术 | DEA | | -1 | 54 | | 技术 | DIF | | -1 | 55 | | 技术 | RSI | 经典技术指标,周期取20 日 | -1 | 56 | | 技术 | PSY | 经典技术指标,周期取20 日 | -1 | 57 | | 技术 | BIAS | 经典技术指标,周期取20 日 | -1 | 58 | 59 | # 神仔整理的因子(96个特征) 60 | ``` 61 | ['ts_code', 'year_month', 'macd', 'dea', 'dif', 'rsi', 'psy', 'bias', 62 | 'close_hfq', 'close_hfq_log', 'pct_chg', 'industry', 'list_date', 63 | 'pct_chg_hs300', 'bias_turn_1m', 'bias_turn_3m', 'bias_turn_6m', 64 | 'bias_turn_12m', 'total_cur_assets', 'total_nca', 'total_assets', 65 | 'total_cur_liab', 'total_ncl', 'total_liab', 'c_fr_sale_sg', 66 | 'c_inf_fr_operate_a', 'c_paid_goods_s', 'c_paid_to_for_empl', 67 | 'st_cash_out_act', 'stot_inflows_inv_act', 'stot_out_inv_act', 68 | 'n_cashflow_inv_act', 'stot_cash_in_fnc_act', 'stot_cashout_fnc_act', 69 | 'n_cash_flows_fnc_act', 'trade_date', 'close_wfq', 'turnover_rate', 70 | 'turnover_rate_f', 'volume_ratio', 'pe', 'pe_ttm', 'pb', 'ps', 'ps_ttm', 71 | 'total_share', 'float_share', 'free_share', 'total_mv', 'circ_mv', 72 | 'total_mv_log', 'esg', 'exp_wgt_return_20d', 'exp_wgt_return_60d', 73 | 'exp_wgt_return_120d', 'exp_wgt_return_240d', 'ar_turn', 'ca_turn', 74 | 'fa_turn', 'assets_turn', 'current_ratio', 'quick_ratio', 75 | 'ocf_to_shortdebt', 'debt_to_eqt', 'tangibleasset_to_debt', 76 | 'profit_to_op', 'roa_yearly', 'tr_yoy', 'or_yoy', 'ebt_yoy', 'op_yoy', 77 | 'HAlpha', 'Beta', 'holder_chg', 'basic_eps', 'diluted_eps', 78 | 'total_revenue', 'total_cogs', 'operate_profit', 'non_oper_income', 79 | 'non_oper_exp', 'total_profit', 'n_income', 'return_3m', 'return_6m', 80 | 'return_12m', 'std_20d', 'std_60d', 'std_120d', 'std_240d', 81 | 'turnover_1m', 'turnover_3m', 'turnover_6m', 'turnover_12m', 82 | 'wgt_return_1m', 'wgt_return_3m', 'wgt_return_6m', 'wgt_return_12m'] 83 | ``` 84 | # 我们目前实现的特征 85 | 86 | ['return_1w', 'return_3w', 'return_6w', 'return_12w', 'turnover_return_1w', 'turnover_return_3w', 'turnover_return_6w', 'turnover_return_12w', 87 | 'std_1w', 'std_3w', 'std_6w', 'std_12w', 'MACD', 'KDJ', 'PSY', 'RSI', 'total_current_assets', 'total_none_current_assets', 'total_assets', 'total_current_liabilities', 'total_none_current_liabilities', 'total_liabilities', 'basic_eps', 'diluted_eps', 'total_revenue', 'total_cogs', 'operate_profit', 'none_operate_income', 'none_operate_exp', 'total_profit', 'net_income', 'cash_sale_goods_service', 'cash_in_subtotal_operate', 'cash_paid_goods_service', 'cash_paid_employees', 'cash_out_subtotal_operate', 'cash_in_subtotal_invest', 'cash_out_subtotal_invest', 'cash_net_invest', 'cash_in_subtotal_finance', 'cash_out_subtotal_finance', 'cash_net_finance', 'account_receivable_turnover', 'current_assets_turnover', 'fixed_assets_turnover', 'total_assets_turnover', 'current_ratio', 'quick_ratio', 'operation_cashflow_to_current_liabilities', 'equity_ratio', 'tangible_assets_to_total_liability', 'total_profit_to_operate', 'annualized_net_return_of_total_assets', 'total_operate_revenue_yoy', 'operate_revenue_yoy', 'total_profit_yoy', 'operate_profit_yoy', 'operate_cashflow_yoy', 'net_profit_deduct_non_recurring_profit_loss', 'operate_cashflow_per_share', 'ROE_TTM', 'ROA_TTM', 'ROE_YOY', 'EP', 'SP', 'BP', 'std_ff3factor_1w', 'std_ff3factor_3w', 'std_ff3factor_6w', 'std_ff3factor_12w', 'alpha', 'beta'] -------------------------------------------------------------------------------- /mlstock/factors/__init__.py: -------------------------------------------------------------------------------- 1 | class I: 2 | def __init__(self, name, tushare_name, cname, category, ttm=None): 3 | self.name = name 4 | self.tushare_name = tushare_name 5 | self.cname = cname 6 | self.category = category 7 | self.ttm = ttm 8 | -------------------------------------------------------------------------------- /mlstock/factors/alpha_beta.py: -------------------------------------------------------------------------------- 1 | """ 2 | HAlpha 个股 60 个月收益与上证综指回归的截距项, 3 | 对我周频来说,就用60个周的上证综指回归的截距项, 4 | 5 | beta : 个股 60 个月收益与上证综指回归的beta 6 | 7 | 回归使用的是CAPM公式: 8 | R_it - r_f = alpha_it + beta_it * (R_mt - r_f) + e_it 9 | 个股收益 - 无风险收益 = 个股alpha + 个股beta * (市场/上证综指收益 - 无风险收益) + 个股扰动项 10 | - i:是股票序号; 11 | - t:是一个周期,我们这里是周; 12 | - e:epsilon,扰动项 13 | 14 | 一般情况,是用N个周期的市场(上证综指)收益,和N个个股收益,一共是N对数据(R_it,R_mt),来回归alpha_it和beta_it的, 15 | 现在呢,华泰金工要求我们用60周的,实际上你可以想象是一个滑动时间窗口,60周,就是N=60, 16 | 用这60个周的这只股票的收益率,和60个周的市场(上证综指)收益率,回归出2个数:alpha和beta这两个数。 17 | 也就是你站在当期,向前倒腾60周,回归出来的。 18 | 19 | 然后,然后,你往下周移动一下,相当于时间窗口移动了一下,那么就又有新的60对数据(R_it,R_mt)出来了, 20 | 当然,其中59个和上一样的,但是,这个时候,你再回归,就又会得到一个alpha和beta, 21 | 22 | 这样,每一期,都要做这么一个回归,都得到一个beta,一个aphla,这个就是当期的华泰说的HAlpha和beta。 23 | 24 | 实现的时候,用了2个apply,每周五,都向前回溯60周,然后用这60周的数据回归alpha和beta 25 | 26 | """ 27 | import time 28 | 29 | from mlstock.factors.factor import SimpleFactor, ComplexMergeFactor 30 | from mlstock.utils import utils 31 | import numpy as np 32 | import logging 33 | 34 | logger = logging.getLogger(__name__) 35 | 36 | class AlphaBeta(ComplexMergeFactor): 37 | # 英文名 38 | @property 39 | def name(self): 40 | return ["alpha", "beta"] 41 | 42 | # 中文名 43 | @property 44 | def cname(self): 45 | return ["alpha", "beta"] 46 | 47 | def _handle_one_stock(self, df_stock_weekly): 48 | """ 49 | 处理"一只"股票, 50 | 用60周的滑动窗口的概念,来不断地算每天,向前60天的数据回归出来的α和β(CAPM的概念) 51 | 参考: https://www.jianshu.com/p/1eaf89990ce7 52 | apply返回多列的时候,只能通过axis=1 + result_type="expand",来处理, 53 | 且,必须是整行apply,不能单列apply(df_stock_weekly.trade_date.apply(...),这种axis=1就报错!) 54 | 这块搞了半天,靠! 55 | """ 56 | return df_stock_weekly.apply(self._handle_60_weeks_OLS, 57 | df_stock_weekly=df_stock_weekly, 58 | axis=1, 59 | result_type="expand") 60 | 61 | def _handle_60_weeks_OLS(self, date, df_stock_weekly): 62 | """ 63 | 用一只股票的前360周收益,和,上证的前60周收益,做回归,得到alpha、beta 64 | :param date: 65 | :param df_stock_weekly: 66 | :return: 67 | """ 68 | 69 | # 取得当周的日期(周最后一天) 70 | date = date['trade_date'] 71 | # 从当前周向前回溯60周, 72 | df_recent_60 = df_stock_weekly[df_stock_weekly['trade_date'] <= date][-60:] 73 | # 太少的回归不出来 74 | if len(df_recent_60) < 2: return np.nan, np.nan 75 | 76 | X = df_recent_60['pct_chg'].values 77 | y = df_recent_60['pct_chg_index'].values 78 | 79 | # 用这60周的60个数据,做线性回归,X是个股的收益,y是指数收益,求出截距和系数,即alpha和beta 80 | assert len(X)>0, f"回归alpha/beta的数据太少,应该60个,现在{len(X)}" 81 | params, _ = utils.OLS(X, y) 82 | 83 | alpha, beta = params[0], params[1] 84 | 85 | # if np.isnan(alpha): 86 | # import pdb;pdb.set_trace() 87 | # print(date,df_stock_weekly[df_stock_weekly['trade_date'] == date],y) 88 | # print("-"*40) 89 | return alpha, beta 90 | 91 | def calculate(self, stock_data): 92 | df_weekly = stock_data.df_weekly 93 | df_index_weekly = stock_data.df_index_weekly 94 | df_index_weekly = df_index_weekly.rename(columns={'pct_chg': 'pct_chg_index'}) 95 | df_index_weekly = df_index_weekly[['trade_date', 'pct_chg_index']] 96 | 97 | df_weekly = df_weekly.merge(df_index_weekly, on=['trade_date'], how='left') 98 | # 2022.8.10,bugfix,股票非周五导致weekly指数收益为NAN,导致其移动平均为NAN,导致大量数据缺失,因此需要drop掉这些异常数据 99 | original_length = len(df_weekly) 100 | df_weekly.dropna(subset=['pct_chg','pct_chg_index'], inplace=True) 101 | logger.debug("计算alpha/beta时剔除'pct_chg','pct_chg_index'中nan后,数据 %d=>%d 行", 102 | original_length,len(df_weekly)) 103 | 104 | # 先统一排一下序 105 | df_weekly = df_weekly.sort_values(['ts_code', 'trade_date']) 106 | 107 | df_weekly[['alpha', 'beta']] = df_weekly.groupby(['ts_code']).apply(self._handle_one_stock) 108 | return df_weekly[['ts_code', 'trade_date', 'alpha', 'beta']] 109 | 110 | 111 | # python -m mlstock.factors.alpha_beta 112 | if __name__ == '__main__': 113 | utils.init_logger(file=False) 114 | from mlstock.data import data_loader, data_filter 115 | from mlstock.data.datasource import DataSource 116 | from mlstock.data.stock_info import StocksInfo 117 | 118 | start_date = "20080101" 119 | end_date = "20220101" 120 | stocks = ['000017.SZ'] # '600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 121 | 122 | start_time = time.time() 123 | start_date = "20080101" 124 | end_date = "20220801" 125 | # stocks = ['000007.SZ','000010.SZ'] 126 | df_stock_basic = data_filter.filter_stocks() 127 | df_stock_basic = df_stock_basic.iloc[:50] 128 | stocks = df_stock_basic.ts_code 129 | 130 | datasource = DataSource() 131 | stocks_info = StocksInfo(stocks, start_date, end_date) 132 | df_stocks = data_loader.load(datasource, stocks, start_date, end_date) 133 | 134 | print(df_stocks.df_weekly.count()) 135 | factor_alpha_beta = AlphaBeta(datasource, stocks_info) 136 | df = factor_alpha_beta.calculate(df_stocks) 137 | df_na = df[df['beta'].isna()] 138 | print("Alpha Beta为空的行,%d 行" % len(df_na)) 139 | print(df_na) 140 | 141 | print("结果统计\n",df.count()) 142 | utils.time_elapse(start_time,"全部处理") 143 | -------------------------------------------------------------------------------- /mlstock/factors/balance_sheet.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock.data.stock_info import StocksInfo 4 | from mlstock.factors import I 5 | from mlstock.factors.factor import FinanceFactor 6 | from mlstock.factors.mixin.fill_mixin import FillMixin 7 | from mlstock.factors.mixin.ttm_mixin import TTMMixin 8 | from mlstock.utils import utils 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class BalanceSheet(FinanceFactor, FillMixin, TTMMixin): 14 | """ 15 | 资产负债表 16 | """ 17 | 18 | FIELDS_DEF = [ 19 | I('total_current_assets', 'total_cur_assets', '流动资产合计', '资产负债'), 20 | I('total_none_current_assets', 'total_nca', '非流动资产合计', '资产负债'), 21 | I('total_assets', 'total_assets', '资产总计', '资产负债'), 22 | I('total_current_liabilities', 'total_cur_liab', '流动负债合计', '资产负债'), 23 | I('total_none_current_liabilities', 'total_ncl', '非流动负债合计', '资产负债'), 24 | I('total_liabilities', 'total_liab', '流动负债合计', '资产负债')] 25 | 26 | @property 27 | def data_loader_func(self): 28 | return self.datasource.balance_sheet 29 | 30 | # python -m mlstock.factors.balance_sheet 31 | if __name__ == '__main__': 32 | utils.init_logger(file=False) 33 | 34 | start_date = '20090101' 35 | end_date = '20220801' 36 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 37 | 38 | BalanceSheet.test(stocks, start_date, end_date) 39 | -------------------------------------------------------------------------------- /mlstock/factors/cashflow.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock.data.stock_info import StocksInfo 4 | from mlstock.factors import I 5 | from mlstock.factors.factor import FinanceFactor 6 | from mlstock.factors.mixin.fill_mixin import FillMixin 7 | from mlstock.factors.mixin.ttm_mixin import TTMMixin 8 | from mlstock.utils import utils 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class CashFlow(FinanceFactor, FillMixin, TTMMixin): 14 | """ 15 | 现金流量表 16 | """ 17 | 18 | FIELDS_DEF = [ 19 | I('cash_sale_goods_service', 'c_fr_sale_sg', '销售商品、提供劳务收到的现金', '现金流量', ttm=True), 20 | I('cash_in_subtotal_operate', 'c_inf_fr_operate_a', '经营活动现金流入小计', '现金流量', ttm=True), 21 | I('cash_paid_goods_service', 'c_paid_goods_s', '购买商品、接受劳务支付的现金', '现金流量', ttm=True), 22 | I('cash_paid_employees', 'c_paid_to_for_empl', '支付给职工以及为职工支付的现金', '现金流量', ttm=True), 23 | I('cash_out_subtotal_operate', 'st_cash_out_act', '经营活动现金流出小计', '现金流量', ttm=True), 24 | I('cash_in_subtotal_invest', 'stot_inflows_inv_act', '投资活动现金流入小计', '现金流量', ttm=True), 25 | I('cash_out_subtotal_invest', 'stot_out_inv_act', '投资活动现金流出小计', '现金流量', ttm=True), 26 | I('cash_net_invest', 'n_cashflow_inv_act', '投资活动产生的现金流量净额', '现金流量', ttm=True), 27 | I('cash_in_subtotal_finance', 'stot_cash_in_fnc_act', '筹资活动现金流入小计', '现金流量', ttm=True), 28 | I('cash_out_subtotal_finance', 'stot_cashout_fnc_act', '筹资活动现金流出小计', '现金流量', ttm=True), 29 | I('cash_net_finance', 'n_cash_flows_fnc_act', '筹资活动产生的现金流量净额', '现金流量''', ttm=True)] 30 | 31 | @property 32 | def data_loader_func(self): 33 | return self.datasource.cashflow 34 | 35 | 36 | # python -m mlstock.factors.cashflow 37 | if __name__ == '__main__': 38 | utils.init_logger(file=False) 39 | 40 | start_date = '20150703' 41 | end_date = '20190826' 42 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 43 | 44 | CashFlow.test(stocks, start_date, end_date) 45 | -------------------------------------------------------------------------------- /mlstock/factors/daily_indicator.py: -------------------------------------------------------------------------------- 1 | """ 2 | 盈利收益率 3 | """ 4 | import logging 5 | from mlstock.factors.factor import ComplexMergeFactor 6 | from mlstock.utils import utils 7 | import numpy as np 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | """ 12 | EP = 净利润(TTM)/总市值 13 | 盈利收益率 EP(Earn/Price) = 盈利/价格 14 | 其实,就是1/PE(市盈率), 15 | 这里,就说说PE,因为EP就是他的倒数: 16 | PE = PRICE / EARNING PER SHARE,指股票的本益比,也称为“利润收益率”。 17 | 本益比是某种股票普通股每股市价与每股盈利的比率,所以它也称为股价收益比率或市价盈利比率。 18 | - [基本知识解读 -- PE, PB, ROE,盈利收益率](https://xueqiu.com/4522742712/61623733) 19 | 20 | """ 21 | 22 | 23 | class DailyIndicator(ComplexMergeFactor): 24 | """ 25 | daily_basic 中提供了3个指标 26 | """ 27 | 28 | # 英文名 29 | @property 30 | def name(self): 31 | return ["total_market_value_log","EP", "SP", "BP"] 32 | 33 | # 中文名 34 | @property 35 | def cname(self): 36 | return ["总市值对数值","净收益(TTM)/总市值", "营业收入(TTM)/总市值", "净资产/总市值"] 37 | 38 | def calculate(self, stock_data): 39 | df_daily_basic = stock_data.df_daily_basic 40 | df_daily_basic = df_daily_basic.sort_values(['ts_code', 'trade_date']) 41 | 42 | # 2022.8.16,这事直接搬到最开始的数据load里面去做了,很多地方都需要这个basic数据 43 | # 如果缺失,用这天之前的数据来填充(ffill) 44 | # 这样做不行,ts_code和trade_date两列丢了,pandas1.3.4也不行,只好逐个fill了 45 | # df_daily_basic = df_daily_basic.groupby(by=['ts_code']).fillna(method='ffill').reset_index() 46 | # df_daily_basic[['total_mv','pe_ttm', 'ps_ttm', 'pb']] = df_daily_basic.groupby('ts_code').ffill().bfill()[['total_mv','pe_ttm', 'ps_ttm', 'pb']] 47 | 48 | df_daily_basic[self.name[0]] = df_daily_basic.total_mv.apply(np.log) 49 | df_daily_basic[self.name[1]] = 1 / df_daily_basic['pe_ttm'] # EP = 1/PE(市盈率) 50 | df_daily_basic[self.name[2]] = 1 / df_daily_basic['ps_ttm'] # SP = 1/PS(市销率) 51 | df_daily_basic[self.name[3]] = 1 / df_daily_basic['pb'] # SP = 1/PS(市销率) 52 | return df_daily_basic[['ts_code', 'trade_date'] + self.name] 53 | 54 | 55 | # python -m mlstock.factors.daily_indicator 56 | if __name__ == '__main__': 57 | from mlstock.data import data_loader 58 | from mlstock.data.datasource import DataSource 59 | from mlstock.data.stock_info import StocksInfo 60 | 61 | utils.init_logger(file=False) 62 | 63 | start_date = "20180101" 64 | end_date = "20200101" 65 | stocks = ['000019.SZ', '000063.SZ', '000068.SZ', '000422.SZ'] 66 | datasource = DataSource() 67 | stocks_info = StocksInfo(stocks, start_date, end_date) 68 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 69 | df = DailyIndicator(datasource, stocks_info).calculate(stock_data) 70 | print(df) 71 | print("NA缺失比例", df.isna().sum() / len(df)) 72 | -------------------------------------------------------------------------------- /mlstock/factors/factor.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from abc import ABC, abstractmethod 3 | 4 | import pandas as pd 5 | from pandas import DataFrame 6 | 7 | from mlstock.data.stock_info import StocksInfo 8 | from mlstock.utils import utils 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class Factor(ABC): 14 | """ 15 | 因子,也就是指标 16 | """ 17 | 18 | def __init__(self, datasource, stocks_info: StocksInfo): 19 | self.datasource = datasource 20 | self.stocks_info = stocks_info 21 | 22 | # 英文名 23 | @property 24 | def name(self): 25 | return "Unknown" 26 | 27 | # 中文名 28 | @property 29 | def cname(self): 30 | return "未定义" 31 | 32 | @abstractmethod 33 | def calculate(self, stock_data): 34 | raise ImportError() 35 | 36 | @abstractmethod 37 | def merge(self, df_stocks, df_factor): 38 | raise ImportError() 39 | 40 | def _rename_finance_column_names(self, df: DataFrame): 41 | """ 42 | - 把其他的需要的字段的名字改成一个full_name,缩写看不懂 43 | """ 44 | # 把tushare中的业务字段名改成full_name,便于理解 45 | df = self._rename(df, self.tushare_name, self.name) 46 | return df 47 | 48 | def _extract_fields(self, df: DataFrame): 49 | """ 50 | - 把ts_code,ann_date,和其他需要的字段,剥离出来 51 | """ 52 | # 注意,这里是trade_code,不是ann_date 53 | return df[['ts_code', 'trade_date'] + self.name] 54 | 55 | def _rename(self, df: DataFrame, _from: list, _to: list): 56 | name_pair = dict(zip(_from, _to)) 57 | return df.rename(columns=name_pair) 58 | 59 | 60 | class SimpleFactor(Factor): 61 | 62 | def merge(self, df_stocks, df_factor): 63 | """ 64 | 这个merge只是简单的把列合到一起,这里假设,df_factor包含了同df_stocks同样行数的数据, 65 | 且,包含的列都是纯粹的特征列(不包含ts_code,trade_date等附加列) 66 | :param df_stocks: 67 | :param df_factor: 68 | :return: 69 | """ 70 | assert len(df_factor) == len(df_factor) 71 | # index不一样没法对着行做列赋值,尽管行数一样,所以先都重置int的索引 72 | df_stocks = df_stocks.reset_index(drop=True) 73 | df_factor = df_factor.reset_index(drop=True) 74 | 75 | if type(df_factor) == pd.Series: 76 | df_factor.name = self.name 77 | if type(df_factor) == pd.DataFrame: 78 | df_factor.columns = self.name 79 | 80 | return pd.concat([df_stocks, df_factor], axis=1) 81 | 82 | 83 | class ComplexMergeFactor(Factor): 84 | """合并数据和因子靠的是ts_code和trade_date做left join""" 85 | 86 | def merge(self, df_stocks, df_factor): 87 | # rename财务数据的公布日期ann_date=>trade_date 88 | df_factor.rename({'ann_date': 'trade_date'}) 89 | # 做数据合并 90 | df_stocks = df_stocks.merge(df_factor, on=['ts_code', 'trade_date'], how='left') 91 | return df_stocks 92 | 93 | 94 | # 95 | class FinanceFactor(ComplexMergeFactor): 96 | # 字段配置,是一个I(name, tushare_name, cname, category, ttm=None, normalize=False)的数组, 97 | # 子类需要重新定义自己的字段配置 98 | FIELDS_DEF = [] 99 | 100 | # 英文名 101 | @property 102 | def name(self): 103 | return [i.name for i in self.FIELDS_DEF] 104 | 105 | # 中文名 106 | @property 107 | def cname(self): 108 | return [i.cname for i in self.FIELDS_DEF] 109 | 110 | def _rename_finance_column_names(self, df: DataFrame): 111 | """ 112 | 把其他的需要的字段的名字改成一个full_name,缩写看不懂 113 | """ 114 | return self._rename(df, self.get_tushare_names(), self.get_names()) 115 | 116 | def _rename_to_cnames(self, df: DataFrame): 117 | """ 118 | 为了调试用,把e文=>中文列名 119 | """ 120 | return self._rename(df, self.get_names(), self.get_cnames()) 121 | 122 | def _numberic(self, df_finance): 123 | """ 124 | 由于tushare下载完后很多是nan列,或者,别错误定义成了text类型,这里要强制转成float 125 | self.name就是对应的财务指标列,是rename之后的 126 | """ 127 | df_finance[self.get_tushare_names()] = df_finance[self.get_tushare_names()].astype('float') 128 | return df_finance 129 | 130 | def get_tushare_names(self): 131 | return [_def.tushare_name for _def in self.FIELDS_DEF] 132 | 133 | def get_names(self): 134 | return [_def.name for _def in self.FIELDS_DEF] 135 | 136 | def get_cnames(self): 137 | return [_def.cname for _def in self.FIELDS_DEF] 138 | 139 | def get_name_pair(self): 140 | return [[_def.cname,_def.name] for _def in self.FIELDS_DEF] 141 | 142 | 143 | def get_ttm_fields(self): 144 | return [_def.name for _def in self.FIELDS_DEF if _def.ttm] 145 | 146 | @property 147 | def data_loader_func(self): 148 | raise NotImplementedError() 149 | 150 | def calculate(self, stock_data): 151 | df_weekly = stock_data.df_weekly 152 | """ 153 | 之所有要传入df_stocks,是因为要用他的日期,对每个日期进行TTM填充 154 | :param df_weekly: 股票周频数据 155 | :return: 156 | """ 157 | 158 | # 由于财务数据,需要TTM,所以要溯源到1年前,所以要多加载前一年的数据 159 | start_date_last_year = utils.last_year(self.stocks_info.start_date) 160 | 161 | # 加载财务数据(通过self.data_loader_func) 162 | df_finance = self.data_loader_func(self.stocks_info.stocks, start_date_last_year, self.stocks_info.end_date) 163 | 164 | assert len(df_finance) > 0, f"因子数据{self}行数为0" 165 | 166 | # 把财务字段类型改成float,之前种种原因导致tushare下载下来的数据的列是text类型的,这纯粹是个patch 167 | df_finance = self._numberic(df_finance) 168 | 169 | # 把财务字段改成全名(tushare中的缩写很讨厌) 170 | df_finance = self._rename_finance_column_names(df_finance) 171 | 172 | # 做财务数据的TTM处理 173 | df_finance = self.ttm(df_finance, self.get_ttm_fields()) 174 | 175 | # 按照股票的周频日期,来生成对应的指标(填充周频对应的财务指标) 176 | df_finance = self.fill(df_weekly, df_finance, self.name) 177 | 178 | # 财务数据都除以总市值,进行归一化 179 | df_finance = self.normalize_by_market_value(df_finance, stock_data.df_daily_basic) 180 | 181 | # 只保留股票、日期和需要的特征列 182 | df_finance = self._extract_fields(df_finance) 183 | 184 | return df_finance 185 | 186 | def normalize_by_market_value(self, df_finance, df_daily_basic): 187 | """ 188 | 直白点,就是都除以市值,让大家标准都统一化 189 | :return: 190 | """ 191 | # df_daily_basic都fill提前过了,所以不用担心有na值 192 | df_finance = df_finance.merge(df_daily_basic[['ts_code', 'trade_date', 'total_mv']], 193 | on=['ts_code', 'trade_date'], how='left') 194 | df_finance[self.name] = df_finance[self.name].apply(lambda x: x / df_finance['total_mv']) 195 | return df_finance 196 | 197 | @classmethod 198 | def test(cls, stocks, start_date, end_date): 199 | """这个方法纯粹是为了测试""" 200 | from mlstock.data import data_loader 201 | from mlstock.data.datasource import DataSource 202 | 203 | datasource = DataSource() 204 | stocks_info = StocksInfo(stocks, start_date, end_date) 205 | df_stocks = data_loader.load(datasource, stocks, start_date, end_date) 206 | indicator_cls = cls(datasource, stocks_info) 207 | df_indicator = indicator_cls.calculate(df_stocks) 208 | logger.debug("财务指标因子:\n%r", df_indicator) 209 | logger.debug("-" * 80) 210 | # 把名字改成中文 211 | # df_indicator = indicator_cls._rename_to_cnames(df_indicator) 212 | logger.debug("数据列统计:\n%r", df_indicator.groupby('ts_code').count()) 213 | logger.debug("-" * 80) 214 | logger.debug("NA列统计:\n%r", df_indicator.groupby('ts_code').apply(lambda df: df.isna().sum())) 215 | return df_indicator 216 | -------------------------------------------------------------------------------- /mlstock/factors/fama/README.md: -------------------------------------------------------------------------------- 1 | 尝试复现A股的fama-french的3因子模型实证 2 | 3 | 参考: 4 | - https://www.cxyzjd.com/article/nv_144/108891000 5 | - https://zhuanlan.zhihu.com/p/374959531 6 | - https://zhuanlan.zhihu.com/p/21449852 7 | - https://zhuanlan.zhihu.com/p/341902943 8 | 9 | -------------------------------------------------------------------------------- /mlstock/factors/fama/analysis.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Aug 6 08:16:05 2020 4 | 5 | @author: 12767 6 | """ 7 | 8 | import pandas as pd 9 | import tushare as ts 10 | 11 | pro = ts.pro_api('') 12 | import statsmodels.api as sm 13 | 14 | 15 | # 回归函数,返回α 16 | def cal_aaa(df_buff): 17 | df_buff['index-rf'] = df_buff['index'] - df_buff['rf'] 18 | df_buff['stock-rf'] = df_buff['pct_chg'] - df_buff['rf'] 19 | model = sm.OLS(df_buff['stock-rf'], sm.add_constant(df_buff[['index-rf', 'SMB', 'HML']].values)) 20 | result = model.fit() 21 | # print(result.params) 22 | print(result.summary()) 23 | return result.params[0] 24 | 25 | 26 | def analyze(): 27 | # %%获取一段时间内的历史交易日 28 | df_cal = pro.trade_cal(start_date='20190101', end_date='20200801') 29 | df_cal = df_cal.query('(exchange=="SSE") & (is_open==1)') # 筛选,清除非交易日 30 | Date = df_cal.cal_date.tolist() 31 | 32 | # 挑选出所需跨度的交易日 33 | month_trade_days = [] 34 | i0 = 0 35 | while i0 < len(Date) - 1: 36 | # if Date[i0][5]!=Date[i0+1][5]: 37 | month_trade_days.append(Date[i0]) 38 | i0 += 1 39 | month_trade_days.append(Date[-1]) 40 | 41 | # %%提取出无风险利率 42 | rf = pd.read_excel('RF.xlsx') 43 | month_rf = [] 44 | i0 = 0 45 | while (i0 < len(rf['Clsdt'])): 46 | if rf['Clsdt'][i0].replace('-', '') in month_trade_days: 47 | month_rf.append(rf['Nrrdaydt'][i0] / 100) 48 | i0 += 1 49 | # %% 50 | data_buff = pd.DataFrame() 51 | data_buff['trade_date'] = month_trade_days 52 | data_buff['rf'] = month_rf 53 | 54 | # 获取指数收益率信息 55 | index = pro.index_daily(ts_code='000002.SH', start_date='20190101', end_date='20200731') 56 | index = index.drop(['close', 'open', 'high', 'ts_code', 'low', 'change', 'pre_close', 'vol', 'amount'], axis=1) 57 | index = index.rename(columns={ 58 | 'pct_chg': 'index'}) 59 | index['index'] = index['index'] / 100 60 | data_buff = pd.merge(data_buff, index, on='trade_date', how='inner') 61 | 62 | # 提取另外两个因子序列 63 | two_factor = pd.read_excel('three_factor_model.xlsx') 64 | data_buff['SMB'] = two_factor['SMB'] 65 | data_buff['HML'] = two_factor['HML'] 66 | 67 | # %%遍历所有股票,计算每只股票的α 68 | # 获取所有股票的信息 69 | stock_information = pro.stock_basic(exchange='', list_status='L', 70 | fields='ts_code,symbol,name,area,industry,list_date') 71 | aerfa_list = [] 72 | # %%输出挑选出来的股票的详细回归信息 73 | aerfa_list = [] 74 | stock_list = ['300156.SZ', 75 | '300090.SZ', 76 | '600175.SH', 77 | '002220.SZ', 78 | '002370.SZ', 79 | '300677.SZ', 80 | '600685.SH', 81 | '600095.SH', 82 | '603069.SH', 83 | '601066.SH'] 84 | i0 = 0 85 | while (i0 < len(stock_list)): 86 | stock = stock_list[i0] 87 | df = pro.daily(ts_code=stock, start_date='20190101', end_date='20200731') 88 | df_buff = pd.merge(data_buff, df, on='trade_date', how='inner') 89 | df_buff = df_buff.drop(['close', 'open', 'high', 'low', 'pre_close', 'change', 'vol', 'amount'], axis=1) 90 | if len(df_buff['rf']) == 0: 91 | aerfa_list.append(99) 92 | else: 93 | aerfa_list.append(cal_aaa(df_buff)) 94 | print(stock) 95 | i0 += 1 96 | # %%********此处循环3000只股票 97 | i0 = 0 98 | while (i0 < len(stock_information['ts_code'])): 99 | stock = stock_information['ts_code'][i0] 100 | df = pro.daily(ts_code=stock, start_date='20190101', end_date='20200731') 101 | df_buff = pd.merge(data_buff, df, on='trade_date', how='inner') 102 | df_buff = df_buff.drop(['close', 'open', 'high', 'low', 'pre_close', 'change', 'vol', 'amount'], axis=1) 103 | if len(df_buff['rf']) == 0: 104 | aerfa_list.append(99) 105 | else: 106 | aerfa_list.append(cal_aaa(df_buff)) 107 | print(stock) 108 | i0 += 1 109 | 110 | # %%保存数据 111 | stock_information['aerfa'] = aerfa_list 112 | stock_information.to_excel('stock_information.xlsx') 113 | 114 | 115 | if __name__ == '__main__': 116 | analyze() 117 | -------------------------------------------------------------------------------- /mlstock/factors/fama/fama_model.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import pandas as pd 4 | 5 | from mlstock.utils.utils import logging_time 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | """ 10 | Fama-French 3 因子,是用3个因子(市值、SMB、HML)来解释收益率, 11 | 因此,这个处理类的: 12 | 输入是:2000只股票的各自收益率,市场收益率(用上证指数),还有市值和账面市值比的信息 13 | 输出是:3个因子(市值、SMB、HML),以及3因子拟合股票收益率后的残差 14 | 再大致说一下FF3的思路: 15 | 1、对于每一期,2000只股票(假设的),用他们构建SMB和HML,就是2数,市值因子就用上证收益率 16 | 2、然后有N期,就有(N,3)个数,这就是X,而y是每只股票的N期收益率(N,1) 17 | 3、对每只股票,做N期收益率 和 FF3(N,3)的回归 18 | 4、对每只股票,回归出截距项、3个因子的系数,这是4个系数,是常数, 19 | 还有一个长度为N的时间序列,即预测值(FF3拟合出来的)和真实值(股票N期每期的真实收益)的N期残差(N,1) 20 | """ 21 | 22 | 23 | # TODO: 我没有减去无风险收益,要么去读一下国债1年期收益周化,要么给一个固定的周化收益率,都可以,回头在改进 24 | 25 | 26 | # %%定义计算函数 27 | # @logging_time('fama-frech的因子计算(所有股票)') 28 | def calculate_smb_hml(df): 29 | """" 30 | 参考: 31 | - https://zhuanlan.zhihu.com/p/55071842 32 | - https://zhuanlan.zhihu.com/p/341902943 33 | - https://zhuanlan.zhihu.com/p/21449852 34 | R_i = a_i + b_i * R_M + s_i * E(SMB) + h_i E(HML) + e_i 35 | - R_i:是股票收益率 36 | - SMB:市值因子,用的就是市值信息, circ_mv 37 | SMB = (SL+SM+SH)/3 - (BL+BM+BH)/3 38 | - HML:账面市值比,B/M,1/pb (PB是市净率=总市值/净资产) 39 | HML = (BH+SH)/2 - (BL+SL)/2 40 | """ 41 | 42 | # 划分大小市值公司 43 | median = df['circ_mv'].median() 44 | df['SB'] = df['circ_mv'].map(lambda x: 'B' if x >= median else 'S') 45 | 46 | # 求账面市值比:PB的倒数 47 | df['BM'] = 1 / df['pb'] 48 | # 划分高、中、低账面市值比公司 49 | border_down, border_up = df['BM'].quantile([0.3, 0.7]) 50 | df['HML'] = df['BM'].map(lambda x: 'H' if x >= border_up else 'M') 51 | df['HML'] = df.apply(lambda row: 'L' if row['BM'] <= border_down else row['HML'], axis=1) 52 | 53 | # 组合划分为6组 54 | df_SL = df.query('(SB=="S") & (HML=="L")') 55 | df_SM = df.query('(SB=="S") & (HML=="M")') 56 | df_SH = df.query('(SB=="S") & (HML=="H")') 57 | df_BL = df.query('(SB=="B") & (HML=="L")') 58 | df_BM = df.query('(SB=="B") & (HML=="M")') 59 | df_BH = df.query('(SB=="B") & (HML=="H")') 60 | 61 | """ 62 | # 计算各组收益率, pct_chg:涨跌幅 , circ_mv:流通市值(万元) 63 | # 以SL为例子:Small+Low 64 | # 小市值+低账面市值比,的一组,比如100只股票,把他们的当期收益"**按照市值加权**"后,汇总到一起 65 | # 每期,得到的SL是一个数, 66 | # 组内按市值赋权平均收益率 = sum(个股收益率 * 个股市值/组内总市值) 67 | """ 68 | R_SL = ((df_SL['pct_chg']) * (df_SL['circ_mv'] / df_SL['circ_mv'].sum())).sum() # 这种写法和下面的5种结果一样 69 | R_SM = (df_SM['pct_chg'] * df_SM['circ_mv']).sum() / df_SM['circ_mv'].sum() # 我只是测试一下是否一致, 70 | R_SH = (df_SH['pct_chg'] * df_SH['circ_mv']).sum() / df_SH['circ_mv'].sum() # 大约在千分之几,也对,我做的是每日的收益率 71 | R_BL = (df_BL['pct_chg'] * df_BL['circ_mv']).sum() / df_BL['circ_mv'].sum() 72 | R_BM = (df_BM['pct_chg'] * df_BM['circ_mv']).sum() / df_BM['circ_mv'].sum() 73 | R_BH = (df_BH['pct_chg'] * df_BH['circ_mv']).sum() / df_BH['circ_mv'].sum() 74 | 75 | # 计算SMB, HML并返回 76 | # 这个没啥好说的,即使按照Fama造的公式,得到了smb,smb是啥?是当期的一个数 77 | smb = (R_SL + R_SM + R_SH - R_BL - R_BM - R_BH) / 3 78 | hml = (R_SH + R_BH - R_SL - R_BL) / 2 79 | 80 | # 调试用 81 | # R_SL , R_SM , R_SH , R_BL , R_BM , R_BH 82 | # import numpy as np 83 | # if np.isnan(smb): 84 | # import pdb;pdb.set_trace() 85 | 86 | # 返回Series,是为了groupby.apply()可以返回多列,用数组不可以(如下) 87 | return pd.Series({'SMB': smb, 'HML': hml}) 88 | # return smb, hml # R_SL, R_SM, R_SH, R_BL, R_BM, R_BH 89 | 90 | 91 | def calculate_factors(df_stocks, df_market, df_basic): 92 | """ 93 | 计算因子系数(不是因子),和,残差 94 | :param df_stocks: 股票的每期收益率 95 | :param df_market: 市场(上证指数)的每期收益率 96 | :param df_daily_basic: 每只股票的每期的基本信息 97 | :return: 每只股票的4个系数(N,4),和拟合残差(N,M) N为股票数,M为周期数 98 | """ 99 | 100 | # 获得股票池 101 | logger.debug("FF3计算:%d行 股票数据", len(df_stocks)) 102 | 103 | # 获取该日期所有股票的基本面指标,里面有市值信息 104 | df_stocks = df_stocks.merge(df_basic[['ts_code', 'trade_date', 'circ_mv', 'pb']], # circ_mv:市值,pb:账面市值比 105 | on=['ts_code', 'trade_date'], how='left') 106 | 107 | """ 108 | 发现一个问题,就是有的股票周数据确实,比如周一有,周二停牌了,一直到下周、下下周, 109 | 那么这个股票的在当期的数据就会缺失(目前我们用周五的数据作为本周的周频日),所以上述的股票就确实周频数据了, 110 | """ 111 | df_smb_hml = df_stocks.groupby('trade_date').apply(calculate_smb_hml) 112 | df_smb_hml = df_smb_hml.reset_index() 113 | 114 | # 把市场因子(上证指数)加入进去 115 | df_market = df_market[['trade_date', 'pct_chg']] 116 | df_market = df_market.rename(columns={'pct_chg': 'R_M'}) 117 | df_market_smb_hml = df_smb_hml.merge(df_market, on=['trade_date'], how='left') 118 | 119 | # 发现一个问题,有的股票某周只有非周五的某天有数据(停牌啥的了),导致那周的数据,无法和别人对齐, 120 | # 就会出现很多单独的某天的截面数据,比如某个日子是周一,就只会它这一只,人家都是周五嘛,周频数据, 121 | # 这种情况下,就会导致计算出来的smb、hml出现nan,所以,需要把他们drop掉 122 | # drop掉之后,恰好是和每个周频的日子的数量对等的 123 | df_market_smb_hml.dropna(inplace=True) 124 | 125 | return df_market_smb_hml 126 | 127 | 128 | # python -m fama.factor 129 | if __name__ == '__main__': 130 | # index_code="000905.SH",使用中证500股票池 131 | calculate_factors(index_code="000905.SH", 132 | stock_num=10, 133 | start_date='20200101', 134 | end_date='20200201') 135 | -------------------------------------------------------------------------------- /mlstock/factors/ff3_residual_std.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | import logging 4 | import os.path 5 | import time 6 | 7 | import pandas as pd 8 | import statsmodels.formula.api as sm 9 | 10 | from mlstock.factors.factor import ComplexMergeFactor 11 | from mlstock.factors.fama import fama_model 12 | from mlstock.utils import utils 13 | 14 | logger = logging.getLogger(__name__) 15 | 16 | """ 17 | 18 | 华泰金工:特质波动率——个股最近N个月内用日频收益率对 Fama-French 三因子回归的残差的标准差,N=1,3,6,12 19 | 20 | 这行注释删除:~~我没有用"日频",我直接用的周频,日频的计算量太大了,周频也可以吧,反正就是一种特征工程。~~ 21 | 22 | 我本来想用周频,后来发现算的是std标准差,用1个、3个、6个这样的数据数量根本不够算std啊, 23 | 所以,还是得切回到日频,当然,这样的话,计算量会大很多。 24 | 25 | 所谓"特质波动率": 就是源于一个现象"低特质波动的股票,未来预期收益更高"。 26 | 27 | 参考: 28 | - https://www.joinquant.com/view/community/detail/b27081ecc7bccfc7acc484f8a63e2459 29 | - https://www.joinquant.com/view/community/detail/1813dae5165ee3c5c81e2408d7fe576f 30 | - https://zhuanlan.zhihu.com/p/30158144 31 | - https://zhuanlan.zhihu.com/p/379585598 32 | - https://mp.weixin.qq.com/s/k_2ltrIQ7jkgAKhDc7Vo2A 33 | - https://blog.csdn.net/FightingBob/article/details/106791144 34 | - https://uqer.datayes.com/v3/community/share/58db552a6d08bb0051c52451 35 | 36 | 特质波动率(Idiosyncratic Volatility, IV)与预期收益率的负向关系既不符合经典资产定价理论, 37 | 也不符合基于不完全信息的定价理论,因此学术界称之为“特质波动率之谜”。 38 | 39 | 该因子虽在多头部分表现略逊于流通市值因子,但在多空方面表现明显强于流通市值因子, 40 | 说明特质波动率因子具有很好的选股区分能力,并且在空头部分有良好的风险警示作用。 41 | 42 | 基于CAPM的特质波动率 IVCAPM: 就是基于CAMP的残差的年化标准差来衡量。 43 | 基于Fama-French三因子模型的特质波动率 IVFF3: 就是在IVCAMP的基础上,再剔除市值因子和估值因子后的残差的年化标准差来衡量。 44 | 45 | ---- 46 | 关于实现: 47 | 48 | 特质波动率,可以有多种实现方法,可以是CAMP市场的残差,也可以是Fama-Frech的残差,这里,我用的是FF3的残差, 49 | 啥叫FF3的残差,就是,用Fama定义的模型,先去算因子收益,[参考](../../fama/factor.py)中, 50 | 使用股票池(比如中证500),去算出来的全市场的SMB,HML因子, 51 | 然后,就可以对某一直股票,比如"招商银行",对他进行回归:r_i = α_i + b1 * r_m_i + b2 * smb_i + b3 * hml_i + e_i 52 | 我们要的就是那个e_i,也就是这期里,无法被模型解释的'**特质**'。上式计算,用的是每天的数据,为何强调这点呢,是为了说明e_i的i,指的是每天。 53 | 那波动率呢? 54 | 就是计算你回测周期的内的标准差 * sqrt(T),比如你回测周期是20天,那就是把招商银行这20天的特异残差求一个标准差,然后再乘以根号下20。 55 | 这个值,是这20天共同拥有的一个"特异波动率",对,这20天的因子暴露值,都一样,都是这个数! 56 | 我是这么理解的,也不知道对不对,这些文章云山雾罩地不说人话都。 57 | 58 | ---- 59 | 2022.8.31 60 | 发现使用了未来函数,重新修改此代码,但是修改后,速度非常慢: 61 | 62 | 计算完时间窗口0周的所有股票Fama-French回归残差的标准差:0行耗时: 0天0小时2分2秒354毫秒 63 | 50只,2017.1.1~2020.8.1,2分钟 64 | 2000只,2008.1.1~2020.8.1, 40x1.8x2 = 144分钟 = 2.5小时, 65 | 如果是1,3,6,12,需要2.5x4 = 10小时, 66 | 如果用多核,18核跑,也需要32分钟。 67 | 但是,没办法,之前的版本虽然快,但是是用来所有的时序数据做了ff3的回归,是存在外来函数的额,不能那样做 68 | 好吧,再说吧,先把代码跑通 69 | 70 | """ 71 | 72 | N = [1] # , 3, 6, 12] 73 | WEEK_TRADE_DAYS = 5 74 | 75 | 76 | class FF3ResidualStd(ComplexMergeFactor): 77 | 78 | @property 79 | def name(self): 80 | return [f"std_ff3factor_{i}w" for i in N] 81 | 82 | @property 83 | def cname(self): 84 | return [f'{i}周特异波动率' for i in N] 85 | 86 | def _calculate_one_stock_ff3_residual_std(self, df_one_stock_daily, df_fama, period): 87 | """ 88 | 计算特异性:讲人话,就是用fama计算后它(这只股票)的理论价格,和它的实际价格的偏差, 89 | 周频的每周,都可以依据当周的df_fama因子(市场、smb、hml)算出每一个股票的理论价格,这3个因子是所有的股票共享的哈,牢记, 90 | 所以,对一只股票而言,每周的df_fama因子(市场、smb、hml)不同,就导致用它算出的这只股票的特异性数值, 91 | 这样下来,这只股票的每周的特异性数值(用fama因子计算出来的残差),就是一个时间序列。 92 | 这个时间序列,就是我们所要的。 93 | 94 | "用横截面(每期)上的所有股票算出的smb/hml/rm,用其当做当期因子,然后对单只股票所有期,进行时间序列回归,可得系数权重标量,和每期残差" 95 | 96 | 参考:https://zhuanlan.zhihu.com/p/131533515 97 | :param df_daily: 按照时间序列,所有股票共享的fama-french的三个因子,每周3个值,N期 98 | :param df_fama: 按照时间序列,所有股票共享的fama-french的三个因子,每周3个值,N期 99 | :return: 返回的一个每只股票的特异性收益率的时间序列 100 | """ 101 | start_time = time.time() 102 | stock_code = df_one_stock_daily.name # 保留一下股票名称,因为下面的merge后,就会消失 103 | # 细节:1个时间截面上,的所有股票,共享fama的3因子数值 104 | df_one_stock_daily = df_one_stock_daily.merge(df_fama, on=['trade_date'], how='left') 105 | 106 | def _calculate_residual_std(s): # s是series,带着index,用index反向查找rolling记录 107 | """ 108 | :param df: 一只股票的当日之前的N天的数据 109 | :return: 110 | """ 111 | df = df_one_stock_daily.loc[s.index] 112 | 113 | # 细节:这个是这只股票的所有期(我们是N周)N*5天的数据进行回归 114 | ols_result = sm.ols(formula='pct_chg ~ R_M + SMB + HML', data=df).fit() 115 | 116 | # 获得残差,注意,这个是每个股票都每周,都计算出来一个残差来 117 | std = ols_result.resid.std() 118 | return std 119 | 120 | # rolling不支持多列,raw=False是为了返回seies,以便获得index 121 | df = df_one_stock_daily.pct_chg.rolling(window=period).apply(_calculate_residual_std, raw=False) 122 | utils.time_elapse(start_time, f"计算完股票[{stock_code}]的fama-french前 {period} 天的残差标准差") 123 | return df 124 | 125 | def calculate(self, stock_data): 126 | """ 127 | 计算是以天为最小单位,1、3、6、12周的窗口长度的收益率波动标准差 128 | :param stock_data: 129 | :return: 130 | """ 131 | 132 | df_daily = stock_data.df_daily 133 | df_index_daily = stock_data.df_index_daily 134 | df_daily_basic = stock_data.df_daily_basic 135 | start_time = time.time() 136 | df_fama = fama_model.calculate_factors(df_stocks=df_daily, df_market=df_index_daily, df_basic=df_daily_basic) 137 | utils.time_elapse(start_time, "计算完市场的Fama-Frech三因子数据") 138 | 139 | # 2.按照要求计算以1、3、6、12周的滑动窗口,计算出每期的的特异性波动的方差 140 | start_time = time.time() 141 | 142 | for i, n in enumerate(N): 143 | # 变成周频 144 | time_window = n * WEEK_TRADE_DAYS 145 | # 每只股票,针对"每周"数据,都要逐个计算其残差std 146 | df = df_daily[['ts_code', 'trade_date', 'pct_chg']].groupby('ts_code') \ 147 | .apply(self._calculate_one_stock_ff3_residual_std, 148 | df_fama=df_fama, 149 | period=time_window) 150 | df_daily[self.name[i]] = df.reset_index(drop=True) 151 | start_time = utils.time_elapse(start_time, 152 | f"计算完时间窗口{i}周的所有股票Fama-French回归残差的标准差:{len(df)}行", 153 | "debug") 154 | 155 | return df_daily[['ts_code', 'trade_date'] + self.name] 156 | 157 | 158 | # python -m mlstock.factors.ff3_residual_std 159 | if __name__ == '__main__': 160 | from mlstock.data import data_filter 161 | from mlstock.data import data_loader, data_filter 162 | from mlstock.data.datasource import DataSource 163 | from mlstock.data.stock_info import StocksInfo 164 | 165 | utils.init_logger(file=False) 166 | 167 | start_time = time.time() 168 | 169 | start_date = "20200101" 170 | end_date = "20220801" 171 | df_stock_basic = data_filter.filter_stocks() 172 | df_stock_basic = df_stock_basic.iloc[:50] 173 | datasource = DataSource() 174 | stocks_info = StocksInfo(df_stock_basic.ts_code, start_date, end_date) 175 | df_stocks = data_loader.load(datasource, df_stock_basic.ts_code, start_date, end_date) 176 | 177 | factor_alpha_beta = FF3ResidualStd(datasource, stocks_info) 178 | df = factor_alpha_beta.calculate(df_stocks) 179 | 180 | print("因子结果\n", df) 181 | utils.time_elapse(start_time, "全部处理") 182 | 183 | """ 184 | 计算每只股票,和FF3的每天的残差 185 | 186 | 先做Fmam-French的三因子回归: 187 | r_i = α_i + b1 * r_m_i + b2 * smb_i + b3 * hml_i + e_i 188 | 对每一只股票做回归,r_i,r_m_i,smb_i,hml_i 已知,这里的i表示的就是股票的序号,不是时间的序号哈, 189 | 这里的r_i可不是一天,是所有人的日期,比如 回归的数据,是,招商银行,从2008年到2021年 190 | 回归后,可以得到α_i、b1、b2、b3、e_i,我们这里只需要残差e_i,这里的残差也是多天的残差,这只股票的多天的残差。 191 | 参考: 192 | - https://blog.csdn.net/CoderPai/article/details/82982146 193 | - https://zhuanlan.zhihu.com/p/261031713 194 | 使用statsmodels.formula中的ols,可以写一个表达式,来指定Y和X_i,即dataframe中的列名,很酷,喜欢 195 | 数据: 196 | 合并后数据如下: 197 | trade_date R_M SMB HML ts_code pct_chg 198 | 2016-06-24 0.12321 0.165260 0.002198 0.085632 0.052 199 | 2016-06-27 0.2331 0.165537 0.003583 0.063299 0.01 200 | 2016-06-28 0.1234 0.135215 0.010403 0.059038 0.035 201 | ... 202 | 做回归: 203 | r_i = α_i + b1 * r_m_i + b2 * smb_i + b3 * hml_i + e_i 204 | 某一只股票的所有的日子的数据做回归,r_i,r_m_i,smb_i,hml_i 已知,回归后,得到e_i(残差) 205 | """ 206 | -------------------------------------------------------------------------------- /mlstock/factors/finance_indicator.py: -------------------------------------------------------------------------------- 1 | from mlstock.data.stock_info import StocksInfo 2 | from mlstock.factors import I 3 | from mlstock.factors.factor import FinanceFactor 4 | 5 | from mlstock.factors.mixin.fill_mixin import FillMixin 6 | from mlstock.factors.mixin.ttm_mixin import TTMMixin 7 | from mlstock.utils import utils 8 | import logging 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class FinanceIndicator(FinanceFactor, FillMixin, TTMMixin): 14 | """ 15 | 从fina_indicator表中衍生出来指标 16 | """ 17 | 18 | FIELDS_DEF = [ 19 | # 神仔注释掉了,不知为何?我暂且保留 20 | # ('invturn_days', '存货周转天数') 21 | # ('arturn_days', '应收账款周转天数') 22 | # ('inv_turn', '存货周转率') 23 | I('account_receivable_turnover', 'ar_turn', '应收账款周转率', '营运能力', ttm=True), 24 | I('current_assets_turnover', 'ca_turn', '流动资产周转率', '营运能力'), 25 | I('fixed_assets_turnover', 'fa_turn', '固定资产周转率', '营运能力'), 26 | I('total_assets_turnover', 'assets_turn', '总资产周转率', '营运能力'), 27 | # 短期偿债能力 28 | I('current_ratio', 'current_ratio', '流动比率', '短期偿债能力'), 29 | I('quick_ratio', 'quick_ratio', '速动比率', '短期偿债能力'), 30 | I('operation_cashflow_to_current_liabilities', 'ocf_to_shortdebt', '经营活动产生的现金流量净额比流动负债', '短期偿债能力'), 31 | # 长期偿债能力 32 | I('equity_ratio', 'debt_to_eqt', '产权比率', '长期偿债能力'), 33 | I('tangible_assets_to_total_liability', 'tangibleasset_to_debt', '有形资产比负债合计', '长期偿债能力'), 34 | # ('ebit_to_interest','已获利息倍数(EBIT比利息费用)'), 35 | # 盈利能力 36 | I('total_profit_to_operate', 'profit_to_op', '利润总额比营业收入', ' 盈利能力'), 37 | # tushare的中文名为准,tushare的这个英文名有歧义,感觉是年化资产回报率 38 | # 总资产报酬率是企业一定时期内获得的报酬总额与平均资产总额的比率。 总资产净利率是指公司净利润与平均资产总额的百分比。 39 | # 而总资产报酬率中的“报酬”可能是任何形式的“利润”,它的范围比较大,有可能是净利润,有可能是息税前利润,也可能是利润总额,是不一定的 40 | I('annualized_net_return_of_total_assets', 'roa_yearly', '年化总资产净利率', '盈利能力'), 41 | # 发展能力 42 | I('total_operate_revenue_yoy', 'tr_yoy', '营业总收入同比增长率(%)', '发展能力'), 43 | I('operate_revenue_yoy', 'or_yoy', '营业收入同比增长率(%)', '发展能力'), # 华泰金工:营业收入(最新财报,YTD)同比增长率 44 | I('total_profit_yoy', 'ebt_yoy', '利润总额同比增长率(%)', '发展能力'), # tushare叫ebt,我猜是'息税前利润(ebit)的缩写,但是,我还是以中文名为准了', 45 | I('operate_profit_yoy', 'op_yoy', '营业利润同比增长率(%)', '发展能力'), 46 | # 这个数据不在默认下载的列中,暂时忽略 47 | # I('net_profit_yoy', 'q_profit_yoy', '净利润同比增长率', '发展能力'), # 华泰金工:净利润(最新财报,YTD)同比增长率 48 | I('operate_cashflow_yoy', 'ocf_yoy', '经营活动产生的现金流量净额同比增长率', '发展能力'), # 华泰金工:经营性现金流(最新财报,YTD)同比增长率 49 | # 财务质量 50 | I('net_profit_deduct_non_recurring_profit_loss', 'profit_dedt', '扣除非经常性损益后的净利润(扣非净利润)', '财务质量', ttm=True), 51 | # 华泰金工要求的是"扣除非经常性损益后净利润(TTM)/总市值" 52 | I('operate_cashflow_per_share', 'ocfps', '每股经营活动产生的现金流量净额', '财务质量', ttm=True), 53 | # 华泰金工要求的是泰金工上的是"经营性现金流(TTM)/总市值",用这个代替 54 | I('ROE_TTM', 'roe', '净资产收益率', '财务质量', ttm=True), # ROE(最新财报,TTM) 55 | # I('','ROA_yoy', 这个tushare上没有 56 | I('ROA_TTM', 'roa', '总资产报酬率', '财务质量', ttm=True), # ROA(最新财报,TTM) 57 | I('ROE_YOY', 'roe_yoy', '净资产收益率(摊薄)同比增长率', '财务质量'), # 华泰金工:ROE(最新财报,YTD)同比增长率 58 | ] 59 | 60 | # 营运能力 61 | @property 62 | def data_loader_func(self): 63 | return self.datasource.fina_indicator 64 | 65 | 66 | # python -m mlstock.factors.finance_indicator 67 | if __name__ == '__main__': 68 | utils.init_logger(file=False) 69 | 70 | start_date = '20150703' 71 | end_date = '20190826' 72 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 73 | 74 | FinanceIndicator.test(stocks, start_date, end_date) 75 | -------------------------------------------------------------------------------- /mlstock/factors/income.py: -------------------------------------------------------------------------------- 1 | from mlstock.data.stock_info import StocksInfo 2 | from mlstock.factors import I 3 | from mlstock.factors.factor import Factor, FinanceFactor 4 | 5 | from mlstock.factors.mixin.fill_mixin import FillMixin 6 | from mlstock.factors.mixin.ttm_mixin import TTMMixin 7 | from mlstock.utils import utils 8 | import logging 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class Income(FinanceFactor, FillMixin, TTMMixin): 14 | """ 15 | 利润表 16 | """ 17 | FIELDS_DEF = [ 18 | I('basic_eps', 'basic_eps', '基本每股收益', '利润表', ttm=True), 19 | I('diluted_eps', 'diluted_eps', '稀释每股收益', '利润表', ttm=True), 20 | I('total_revenue', 'total_revenue', '营业总收入', '利润表', ttm=True), 21 | I('total_cogs', 'total_cogs', '营业总成本', '利润表', ttm=True), 22 | I('operate_profit', 'operate_profit', '营业利润', '利润表', ttm=True), 23 | I('none_operate_income', 'non_oper_income', '营业外收入', '利润表', ttm=True), 24 | I('none_operate_exp', 'non_oper_exp', '营业外支出', '利润表', ttm=True), 25 | I('total_profit', 'total_profit', '利润总额', '利润表', ttm=True), 26 | I('net_income', 'n_income', '净利润(含少数股东损益)', '利润表', ttm=True)] 27 | 28 | """ 29 | tushare , is ttm, tushare title, huatai 30 | net_after_nr_lp_correct ttm 扣除非经常性损益后的净利润(更正前) | 估值/EPcut/扣除非经常性损益后净利润(TTM)/总市值 31 | """ 32 | 33 | @property 34 | def data_loader_func(self): 35 | return self.datasource.income 36 | 37 | 38 | # python -m mlstock.factors.income 39 | if __name__ == '__main__': 40 | utils.init_logger(file=False) 41 | 42 | start_date = '20080101' 43 | end_date = '20220826' 44 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 45 | stocks = ['000996.SZ', '002008.SZ', '002014.SZ', '002026.SZ', '002033.SZ', 46 | '002038.SZ', '002057.SZ', '600728.SH', '000682.SZ', '000697.SZ', 47 | '000707.SZ', '000718.SZ', '000722.SZ', '000731.SZ', '000737.SZ', 48 | '000766.SZ', '000778.SZ', '000792.SZ', '000798.SZ', '000812.SZ'] 49 | Income.test(stocks, start_date, end_date) 50 | -------------------------------------------------------------------------------- /mlstock/factors/kdj.py: -------------------------------------------------------------------------------- 1 | import talib 2 | import pandas as pd 3 | from mlstock.factors.factor import SimpleFactor 4 | 5 | fastk_period = 9 6 | slowk_period = 3 7 | slowk_matype = 0 8 | slowd_period = 3 9 | slowd_matype = 0 10 | 11 | 12 | class KDJ(SimpleFactor): 13 | # 英文名 14 | @property 15 | def name(self): 16 | return "KDJ" 17 | 18 | # 中文名 19 | @property 20 | def cname(self): 21 | return "KDJ" 22 | 23 | def calculate(self, stock_data): 24 | df_weekly = stock_data.df_weekly 25 | K, D = talib.STOCH( 26 | df_weekly.high, 27 | df_weekly.low, 28 | df_weekly.close, 29 | fastk_period=fastk_period, 30 | slowk_period=slowk_period, 31 | slowk_matype=slowk_matype, 32 | slowd_period=slowd_period, 33 | slowd_matype=slowd_matype) 34 | 35 | # 求出J值,J = (3*K)-(2*D) 36 | J = pd.Series(list(map(lambda x, y: 3 * x - 2 * y, K, D))) 37 | return J 38 | -------------------------------------------------------------------------------- /mlstock/factors/macd.py: -------------------------------------------------------------------------------- 1 | import talib as ta 2 | from .factor import SimpleFactor 3 | 4 | fastperiod = 12 5 | slowperiod = 26 6 | signalperiod = 9 7 | 8 | class MACD(SimpleFactor): 9 | """ 10 | 1、线1:先得到一个DIF : EMA12 - EMA26 11 | 2、线2:在得到一个DEA:DIF的9日加权移动平均 12 | 所以,26周 + 9周 = 35周,才可能得到一个有效的dea值,所以要预加载35周,大约9个月的数据 13 | """ 14 | 15 | # 英文名 16 | @property 17 | def name(self): 18 | return "MACD" 19 | 20 | # 中文名 21 | @property 22 | def cname(self): 23 | return "MACD" 24 | 25 | def calculate(self, stock_data): 26 | df_weekly = stock_data.df_weekly 27 | return df_weekly.groupby('ts_code').close.apply(self.__macd) 28 | 29 | def __macd(self, x): 30 | 31 | macd, dea, dif = ta.MACD(x, 32 | fastperiod=fastperiod, 33 | slowperiod=slowperiod, 34 | signalperiod=signalperiod) 35 | return macd 36 | 37 | 38 | # python -m mlstock.factors.macd 39 | if __name__ == '__main__': 40 | from mlstock.data import data_loader 41 | from mlstock.data.datasource import DataSource 42 | from mlstock.data.stock_info import StocksInfo 43 | from mlstock.utils import utils 44 | import pandas;pandas.set_option('display.max_rows', 1000000) 45 | utils.init_logger(file=False) 46 | 47 | start_date = "20180101" 48 | end_date = "20200101" 49 | stocks = ['000007.SZ '] 50 | datasource = DataSource() 51 | stocks_info = StocksInfo(stocks, start_date, end_date) 52 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 53 | df = MACD(datasource, stocks_info).calculate(stock_data) 54 | df_weekly = stock_data.df_weekly 55 | valid_len = len(df_weekly[df_weekly['trade_date']>start_date]) 56 | df = df[-valid_len:] 57 | print(df) 58 | print("NA占比: %.2f%%" % (df.isna().sum()*100/len(df))) 59 | import pdb;pdb.set_trace() 60 | -------------------------------------------------------------------------------- /mlstock/factors/mixin/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/factors/mixin/__init__.py -------------------------------------------------------------------------------- /mlstock/factors/mixin/fill_mixin.py: -------------------------------------------------------------------------------- 1 | from mlstock.utils import utils 2 | 3 | import logging 4 | 5 | from mlstock.utils.utils import logging_time 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | 10 | class FillMixin: 11 | 12 | @logging_time("间隔填充") 13 | def fill(self, df_stocks, df_finance, finance_column_names): 14 | """ 15 | 将离散的单个的财务TTM信息,反向填充到周频的交易数据里去, 16 | 比如周频,日期为7.22(周五),那对应的财务数据是6.30日的, 17 | 所以要用6.30的TTM数据去填充到这个7.22日上去。 18 | 19 | 这个是配合TTM使用,TTM对公告日期进行了TTM,但只有公告日当天的, 20 | 但是,我们需要的是周频的那天的TTM,考虑到财报的滞后性, 21 | 某天的TTM值,'合理'的是他之前的最后一个公告日对应的TTM。 22 | 所以,才需要这个fill函数,来填充每一个周五的TTM。 23 | 24 | :param df_stocks: df是原始的周频数据,以周五的日期为准 25 | :param df_finance: 财务数据,只包含财务公告日期,要求之前已经做了ttm数据 26 | """ 27 | if type(finance_column_names)!=list: 28 | finance_column_names = [finance_column_names] 29 | 30 | # 开始做join合并,注意注意,用outer,外连接,这样就不会落任何两边的日期(财务的,和,股票交易数据的) 31 | df_merge = df_stocks.merge(df_finance, 32 | how="outer", 33 | left_on=['ts_code','trade_date'], 34 | right_on=['ts_code','ann_date']) 35 | # import pdb;pdb.set_trace() 36 | # 因为做了join合并,所以必须要做一个日期排序,因为日期是乱的(默认是升序,从旧到新) 37 | df_merge = df_merge.sort_values(['ts_code','trade_date']) 38 | 39 | # 然后,就可以对指定的财务字段,进行填充了,先用ffill,后用bfill 40 | # ffill从上向下填充,即用旧日期的去填充新日期的 41 | df_merge[finance_column_names] = df_merge.groupby('ts_code').ffill().bfill()[finance_column_names] 42 | # 取close这一列(它在股票数据里肯定不是空,但被outer-join后,财务日期对应的close肯定是空,用它来过滤),来过滤掉无效数据 43 | df_merge = df_merge[~df_merge['close'].isna()] 44 | # 只取需要的列 45 | df = df_merge[['ts_code','trade_date']+finance_column_names] 46 | return df 47 | 48 | 49 | # python -m mlstock.factors.mixin.fill_mixin 50 | if __name__ == '__main__': 51 | utils.init_logger(file=False) 52 | 53 | start_date = '20150703' 54 | end_date = '20190826' 55 | stocks = ['600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 56 | finance_column_names = ['basic_eps', 'diluted_eps'] 57 | 58 | import tushare as ts 59 | import pandas as pd 60 | 61 | pro = ts.pro_api() 62 | 63 | df_finance_list = [] 64 | df_stock_list = [] 65 | for ts_code in stocks: 66 | df_finance = pro.income(ts_code=ts_code, start_date=start_date, end_date=end_date, 67 | fields='ts_code,ann_date,f_ann_date,end_date,report_type,comp_type,basic_eps,diluted_eps') 68 | df_finance_list.append(df_finance) 69 | df_stock = pro.weekly(ts_code=ts_code, start_date=start_date, end_date=end_date) 70 | df_stock_list.append(df_stock) 71 | df_finance = pd.concat(df_finance_list) 72 | df_stock = pd.concat(df_stock_list) 73 | logger.info("原始数据:\n股票数据\n%r\n财务数据:%r", df_stock, df_finance) 74 | df = FillMixin().fill(df_stock, df_finance, finance_column_names) 75 | logger.info("填充完数据:\n%r", df) 76 | -------------------------------------------------------------------------------- /mlstock/factors/old/BM.py: -------------------------------------------------------------------------------- 1 | """ 2 | 账面市值比(book-to-market):净资产/市值 3 | 4 | [知乎参考](https://www.zhihu.com/question/23906290/answer/123700275),计算方法: 5 | - 1.账面市值比(BM)=股东权益/公司市值. 6 | - 2.股东权益(净资产)=资产总额-负债总额 (每股净资产x流通股数) 7 | - 3.公司市值=流通股数x每股股价 8 | - 4.账面市值比(BM)=股东权益/公司市值=(每股净资产x流通股数)/(流通股数x每股股价)=每股净资产/每股股价=B/P=市净率的倒数 9 | """ 10 | from mfm_learner.datasource import datasource_utils 11 | from mfm_learner.example.factors.factor import Factor 12 | 13 | 14 | class BMFactor(Factor): 15 | """ 16 | 账面市值比(book-to-market):净资产/市值 , 17 | 市净率(pb - price/book value ratio)的倒数:市值/净资产 18 | 所以,我用市净率取一个倒数即可 19 | """ 20 | 21 | def __init__(self): 22 | super().__init__() 23 | 24 | def name(self): 25 | return "bm" 26 | 27 | def cname(self): 28 | return "账面市值比" 29 | 30 | def calculate(self, stock_codes, start_date, end_date): 31 | df_basic = self.datasource.daily_basic(stock_codes, start_date, end_date) 32 | df_basic = datasource_utils.reset_index(df_basic) 33 | return 1 / df_basic['ps'] # ps是市净率,账面市值比是1/ps 34 | -------------------------------------------------------------------------------- /mlstock/factors/old/ROA.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/factors/old/ROA.py -------------------------------------------------------------------------------- /mlstock/factors/old/ROE.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mfm_learner.datasource import datasource_utils 4 | from mfm_learner.example import factor_utils 5 | from mfm_learner.example.factors.factor import Factor 6 | from mfm_learner.utils import utils 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | 11 | class ROETTMFactor(Factor): 12 | """ 13 | 14 | ROE计算:https://baike.baidu.com/item/%E5%87%80%E8%B5%84%E4%BA%A7%E6%94%B6%E7%9B%8A%E7%8E%87 15 | 定义公式: 16 | 1、净资产收益率=净利润*2/(本年期初净资产+本年期末净资产) 17 | 2、杜邦公式(常用) 18 | - 净资产收益率=销售净利率*资产周转率*杠杆比率 19 | - 净资产收益率 =净利润 /净资产 20 | -净资产收益率= (净利润 / 销售收入)*(销售收入/ 总资产)*(总资产/净资产) 21 | 22 | ROE是和分期相关的,所以年报中的ROE的分期是不同的,有按年、季度、半年等分的,所以无法统一比较, 23 | 解决办法,是都画为TTM,即从当日开始,向前回溯一整年,统一成期间为1整年的ROE_TTM(Trailing Twelve Month) 24 | 但是,由于财报的发布滞后,所以计算TTM的时候,需要考虑这种滞后性,后面会详细讨论: 25 | ------------------------------------------------------------------------ 26 | ts_code ann_date end_date roe 27 | 0 600000.SH 20211030 20210930 6.3866 28 | 1 600000.SH 20211030 20210930 6.3866 29 | 2 600000.SH 20210828 20210630 4.6233 30 | 3 600000.SH 20210430 20210331 2.8901 31 | 4 600000.SH 20210430 20210331 2.8901 32 | 5 600000.SH 20210327 20201231 9.7856 33 | 6 600000.SH 20201031 20200930 7.9413 34 | 35 | ROE是净资产收益率,ann_date是真实发布日,end_date是对应财报上的统计截止日, 36 | 所以,指标有滞后性,所以,以ann_date真实发布日为最合适, 37 | 为了可比性,要做ttm才可以,即向前滚动12个月, 38 | 但是,即使TTM了,还要考虑不同股票的对齐时间, 39 | 比如截面时间是4.15日: 40 | A股票:3.31号发布1季报,他的TTM,就是 roe_1季报 + roe_去年年报 - roe_去年1季报 41 | B股票:4.10号发布1季报,他的TTM,就是 roe_1季报 + roe_去年年报 - roe_去年1季报 42 | C股票:还没有发布1季报,但是他在3.31号发布了去年年报,所以,他只能用去年的年报数据了 43 | D股票:还没有发布1季报,也没发布去年年报,所以,他只能用去年的3季报(去年10.31日最晚发布) 44 | ------------------------------------------------------------------------ 45 | 因子的值,需要每天都要计算出来一个值,即每天都要有一个ROE_TTM, 46 | 所以,每天的ROE_TTM,一定是回溯到第一个可以找到的财报发表日,然后用那个发布日子,计算那之前的TTM, 47 | 举个例子, 48 | 我计算C股票的4.15日的ROE_TTM,就回溯到他是3.31号发布的财报,里面披露的是去年的年报的ROE 49 | 我计算B股票的4.15日的ROE_TTM,就回溯到他是4.10号发布的财报,里面披露的是今年1季报ROE+去年的年报ROE-去年1季报的ROE 50 | 这里有个小的问题,或者说关键点: 51 | 就是A股票和B股票,实际上比的不是同一个时间的东东, 52 | A股票是去年的12个月TTM的ROE, 53 | B股票则是去年1季度~今年1季度12个月的TTM, 54 | 他俩没有完全对齐,差了1个季度, 55 | 我自己觉得,可能这是"最优"的比较了把,毕竟,我是不能用"未来"数据的, 56 | 我站在4.15日,能看到的A股票的信息,就是去年他的ROE_TTM,虽然滞后了1个季度,但是总比没有强。 57 | 1个季度的之后,忍了。 58 | ------------------------------------------------------------------------ 59 | 这个确实是问题,我梳理一下问题,虽然解决不了,但是至少要自己门清: 60 | - 多只股票的ROE_TTM都会滞后,最多可能会之后3-4个月(比如4.30号才得到去年的年报) 61 | - 多只股票可能都无法对齐ROE_TTM,比如上例中A用的是当期的,而极端的D,用的居然是截止去年10.30号发布的9.30号的3季报的数据了 62 | """ 63 | 64 | def __init__(self): 65 | super().__init__() 66 | 67 | def name(self): 68 | return "roe_ttm" 69 | 70 | def calculate(self, stock_codes, start_date, end_date): 71 | 72 | start_date_2years_ago = utils.last_year(start_date, num=2) 73 | trade_dates = self.datasource.trade_cal(start_date, end_date) 74 | df_finance = self.datasource.fina_indicator(stock_codes, start_date_2years_ago, end_date) 75 | 76 | # TODO 懒得重新下载fina_indicator,临时trick一下 77 | df_finance['end_date'] = df_finance['end_date'].apply(str) 78 | 79 | assert len(df_finance) > 0 80 | df = factor_utils.handle_finance_ttm(stock_codes, 81 | df_finance, 82 | trade_dates, 83 | col_name_value='roe', 84 | col_name_finance_date='end_date') 85 | 86 | df = datasource_utils.reset_index(df) 87 | return df['roe_ttm'] 88 | 89 | 90 | class ROEYOYFactor(Factor): 91 | """ 92 | ROEYoY:ROE Year over Year,同比增长率 93 | 关于ROE计算:https://baike.baidu.com/item/%E5%87%80%E8%B5%84%E4%BA%A7%E6%94%B6%E7%9B%8A%E7%8E%87 94 | 95 | 遇到一个问题: 96 | SELECT ann_date,end_date,roe_yoy 97 | FROM tushare.fina_indicator 98 | where ts_code='600000.SH' and ann_date='20180428' 99 | ------------------------------------------------------ 100 | '20180428','20180331','-12.2098' 101 | '20180428','20171231','-11.6185' 102 | ------------------------------------------------------ 103 | 可以看出来,用code+datetime作为索引,可以得到2个roe_yoy,这是因为,一个是年报和去年同比,一个是季报和去年同比, 104 | 而且,都是今天发布的,所以,这里的我们要理解,yoy,同比,比的可能是同季度的,也可能是半年的,一年的, 105 | 如果严格的话,应该使用roe_ttm,然后去和去年的roe_ttm比较,这样最准,但是,这样处理就太复杂了, 106 | 所以,折中的还是用甲股票和乙股票,他们自己的当日年报中提供的yoy同比对比,比较吧。 107 | 这种折中,可能潜在一个问题,就是我可能用我的"季报同比",对比了,你的"年报同比",原因是我们同一日、相近日发布的不同scope的财报, 108 | 这是一个问题,最好的解决办法,还是我用我的ROE_TTM和我去年今日的ROE_TTM,做同比;然后,再和你的结果比,就是上面说的方法。 109 | 算了,还是用这个这种方法吧, 110 | 所以,上述的问题,是我自己(甲股票)同一天发了2份财报(年报和季报),这个时候我取哪个yoy同比结果呢, 111 | 我的解决之道是,随便,哪个都行 112 | 113 | """ 114 | 115 | def __init__(self): 116 | super().__init__() 117 | 118 | def name(self): 119 | return "roe_yoy" 120 | 121 | def cname(self): 122 | return "净资产收益率同比增长率(ROE变动)" 123 | 124 | def calculate(self, stock_codes, start_date, end_date): 125 | 126 | df = factor_utils.handle_finance_fill(self.datasource,stock_codes,start_date,end_date,'roe_yoy') 127 | 128 | assert len(df) > 0 129 | df = datasource_utils.reset_index(df) 130 | 131 | if not df.index.is_unique: 132 | old_len = len(df) 133 | df = df[~df.index.duplicated()] 134 | logger.warning("因子主键[日期+股票]重复,合计抛弃%d条", old_len - len(df)) 135 | 136 | assert df.columns == ['roe_yoy'] 137 | 138 | return df 139 | -------------------------------------------------------------------------------- /mlstock/factors/old/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/factors/old/__init__.py -------------------------------------------------------------------------------- /mlstock/factors/old/assets_debt_ratio.py: -------------------------------------------------------------------------------- 1 | from mfm_learner.datasource import datasource_utils 2 | from mfm_learner.example.factor_utils import handle_finance_fill 3 | from mfm_learner.example.factors.factor import Factor 4 | 5 | 6 | class AssetsDebtRateFactor(Factor): 7 | 8 | def __init__(self): 9 | super().__init__() 10 | 11 | def name(self): 12 | return "assets_debt_rate" 13 | 14 | def cname(self): 15 | return "资产负债率" 16 | 17 | def calculate(self, stock_codes, start_date, end_date): 18 | df = handle_finance_fill(self.datasource, 19 | stock_codes, 20 | start_date, 21 | end_date, 22 | finance_index_col_name_value='debt_to_assets') 23 | 24 | df = datasource_utils.reset_index(df) 25 | return df['debt_to_assets'] / 100 # 百分比是0~100,要变成小数 26 | -------------------------------------------------------------------------------- /mlstock/factors/old/clv.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | # 第一个因子: 4 | # clv: close location value, 5 | # ( (close-day_low) - (day_high - close) ) / (day_high - day_low) 6 | # 这玩意,是一个股票的,每天都有,是一个数, 7 | # 我们要从一堆的股票N只中,得到N个这个值,可以形成一个截面, 8 | # 用这个截面,我们可以拟合出β和α, 9 | # 然后经过T个周期(交易日),就可以有个T个β和α, 10 | # 因子长这个样子: 11 | # trade_date 12 | # 000001.XSHE 000002.XSHE 13 | # 2019-01-02 -0.768924 0.094851 。。。。。 14 | # 。。。。。。 15 | # 类比gpd、股指(市场收益率),这个是因子不是一个啊?而是多个股票N个啊?咋办? 16 | # 17 | # 参考:https://www.bilibili.com/read/cv13893224?spm_id_from=333.999.0.0 18 | import logging 19 | 20 | from mfm_learner.datasource import datasource_utils 21 | from mfm_learner.example.factors.factor import Factor 22 | 23 | logger = logging.getLogger(__name__) 24 | 25 | 26 | class CLVFactor(Factor): 27 | 28 | def __init__(self): 29 | super().__init__() 30 | 31 | def name(self): 32 | return "clv" 33 | 34 | def calculate(self, stock_codes, start_date, end_date, df_daily=None): 35 | if df_daily is None: 36 | df_daily = datasource_utils.load_daily_data(self.datasource, stock_codes, start_date, end_date) 37 | 38 | # 计算CLV因子 39 | df_daily['CLV'] = ((df_daily['close'] - df_daily['low']) - (df_daily['high'] - df_daily['close'])) / ( 40 | df_daily['high'] - df_daily['low']) 41 | # 处理出现一字涨跌停 42 | df_daily.loc[(df_daily['high'] == df_daily['low']) & (df_daily['open'] > df_daily['pre_close']), 'CLV'] = 1 43 | df_daily.loc[(df_daily['high'] == df_daily['low']) & (df_daily['open'] < df_daily['pre_close']), 'CLV'] = -1 44 | 45 | df_daily = datasource_utils.reset_index(df_daily) 46 | factors = df_daily['CLV'] 47 | logger.debug("一共加载%s~%s %d条 CLV 数据", start_date, end_date, len(factors)) 48 | 49 | return factors 50 | -------------------------------------------------------------------------------- /mlstock/factors/old/dividend_rate.py: -------------------------------------------------------------------------------- 1 | from mfm_learner.datasource import datasource_utils 2 | from mfm_learner.example.factor_utils import handle_finance_fill 3 | from mfm_learner.example.factors.factor import Factor 4 | 5 | 6 | class DividendRateFactor(Factor): 7 | """ 8 | DividendRat 9 | 股息率TTM=近12个月股息总额/当日总市值 10 | 11 | tushare样例数据: 12 | https://tushare.pro/document/2?doc_id=32 13 | - dv_ratio float 股息率 (%) 14 | - dv_ttm float 股息率(TTM)(%) 15 | -------------------------------------------- 16 | ts_code trade_date dv_ratio dv_ttm 17 | 0 600230.SH 20191231 3.4573 1.9361 18 | 1 600230.SH 20191230 3.4573 1.9361 19 | 2 600230.SH 20191227 3.4308 1.9212 20 | 3 600230.SH 20191226 3.3629 1.8832 21 | 4 600230.SH 20191225 3.5537 1.9900 22 | .. ... ... ... ... 23 | 482 600230.SH 20180108 0.2692 0.2692 24 | 483 600230.SH 20180105 0.2856 0.2856 25 | 484 600230.SH 20180104 0.2805 0.2805 26 | 485 600230.SH 20180103 0.2897 0.2897 27 | 486 600230.SH 20180102 0.3021 0.3021 28 | -------------------------------------------- 29 | 诡异之处,dv_ratio是股息率,dv_ttm是股息率TTM, 30 | TTM应该比直接的股息率要高,对吧? 31 | 我理解普通股息率应该是从年初到现在的分红/市值, 32 | 而TTM还包含了去年的分红呢,理应比普通的股息率要高, 33 | 可是,看tushare数据,恰恰是反的,困惑ing... 34 | 35 | TODO:目前,考虑还是直接用TTM数据了 36 | """ 37 | 38 | def __init__(self): 39 | super().__init__() 40 | 41 | def name(self): 42 | return "dividend_rate_ttm嗯,嗯。尴尬现" 43 | 44 | def cname(self): 45 | return "股息率" 46 | 47 | def calculate(self, stock_codes, start_date, end_date): 48 | df_basic = self.datasource.daily_basic(stock_codes, start_date, end_date) 49 | df_basic = datasource_utils.reset_index(df_basic) 50 | return df_basic['dv_ttm']/100 51 | 52 | -------------------------------------------------------------------------------- /mlstock/factors/old/ebitda.py: -------------------------------------------------------------------------------- 1 | """ 2 | https://baike.baidu.com/item/EBITDA/7810909 3 | 4 | 税息折旧及摊销前利润,简称EBITDA,是Earnings Before Interest, Taxes, Depreciation and Amortization的缩写, 5 | 即未计利息、税项、折旧及摊销前的利润。 6 | 7 | EBITDA受欢迎的最大原因之一是,EBITDA比营业利润显示更多的利润,公司可以通过吹捧EBITDA数据,把投资者在高额债务和巨大费用上面的注意力引开。 8 | 9 | EBITDA = EBIT【息税前利润】 - Taxation【税款】+ Depreciation & Amortization【折旧和摊销】 10 | - EBIT = 净利润+利息+所得税-公允价值变动收益-投资收益 11 | - 折旧和摊销 = 资产减值损失 12 | 13 | EBITDA常被拿来和现金流比较,因为它和净收入之间的差距就是两项对现金流没有影响的开支项目, 即折旧和摊销。 14 | > 意思是说,这玩意,只考虑跟当期钱相关的,包含了利息,而且,刨除了那些假装要扣的钱(摊销和折旧) 15 | 16 | 我和MJ讨论后,我的理解是: 17 | 咱不把借债的因素考虑进来,也不考虑缴税呢,也不让折旧摊销来捣乱,我就看我的赚钱能力(你丫甭管我借钱多少,咱就看,干出来的粗利润) 18 | 当然是刨除了生产、销售成本之后的,不考虑息税和折旧。 19 | ------------------------------------------------ 20 | 21 | 我查了tushare,有两个地方可以提供ebitda,一个是 22 | pro.fina_indicator(ts_code='600000.SH',fields='ts_code,ann_date,end_date,ebit,ebitda') 23 | 另外一个是, 24 | pro.income(ts_code='600000.SH',fields='ts_code,ann_date,end_date,ebit,ebitda') 25 | income接口,是提供有效数据,fina_indicator提供的都是None,靠,tushare没有认真去清洗啊。 26 | ------------------------------ 27 | 不过,我又去聚宽的:https://www.joinquant.com/help/api/help#factor_values:%E5%9F%BA%E7%A1%80%E5%9B%A0%E5%AD%90 28 | 他的《基础因子》api中,有ebotda,而且很全,是按照每日给出的,很棒。 29 | 唉,可惜他太贵了,他的API使用license是6000+/年(tushare 200/年) 30 | 更可怕的是,tushare的income的结果,和聚宽的基础因子接口,出来的结果是!!!不一样!!!的,不信可以自己去试一试(聚宽只能在他的在线实验室测试) 31 | 我信谁? 32 | 除非我去查年报,自己按照上面的"息税前利润(EBIT)+折旧费用+摊销费用 =EBITDA" 公式自己算一遍,好吧,逼急了,我就这么干。 33 | ------------------------------ 34 | 目前,基于我只能用tushare的数据,我选择用income接口,然后按照交易日做ffill,我留个 TODO,将来这个指标要做进一步的优化! 35 | """ 36 | from mfm_learner.datasource import datasource_utils 37 | from mfm_learner.example import factor_utils 38 | from mfm_learner.example.factors.factor import Factor 39 | from mfm_learner.utils import utils 40 | 41 | 42 | class EBITDAFactor(Factor): 43 | """ 44 | """ 45 | 46 | def __init__(self): 47 | super().__init__() 48 | 49 | def name(self): 50 | return "ebitda_ttm" 51 | 52 | def calculate(self, stock_codes, start_date, end_date): 53 | start_date_1years_ago = utils.last_year(start_date, num=1) 54 | trade_dates = self.datasource.trade_cal(start_date, end_date) 55 | df_finance = self.datasource.income(stock_codes, start_date_1years_ago, end_date) 56 | 57 | # TODO 懒得重新下载fina_indicator,临时trick一下 58 | df_finance['end_date'] = df_finance['end_date'].apply(str) 59 | 60 | assert len(df_finance) > 0 61 | df = factor_utils.handle_finance_ttm(stock_codes, 62 | df_finance, 63 | trade_dates, 64 | col_name_value='ebitda', 65 | col_name_finance_date='end_date') 66 | 67 | df = datasource_utils.reset_index(df) 68 | return df['ebita_ttm'] 69 | -------------------------------------------------------------------------------- /mlstock/factors/old/ep.py: -------------------------------------------------------------------------------- 1 | """ 2 | 盈利收益率 3 | """ 4 | import logging 5 | import math 6 | 7 | import numpy as np 8 | 9 | from mfm_learner.datasource import datasource_utils 10 | from mfm_learner.example.factors.factor import Factor 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | """ 15 | 盈利收益率 EP(Earn/Price) = 盈利/价格 16 | 17 | 其实,就是1/PE(市盈率), 18 | 19 | 这里,就说说PE,因为EP就是他的倒数: 20 | 21 | PE = PRICE / EARNING PER SHARE,指股票的本益比,也称为“利润收益率”。 22 | 23 | 本益比是某种股票普通股每股市价与每股盈利的比率,所以它也称为股价收益比率或市价盈利比率。 24 | 25 | - [基本知识解读 -- PE, PB, ROE,盈利收益率](https://xueqiu.com/4522742712/61623733) 26 | 27 | """ 28 | 29 | class EPFactor(Factor): 30 | 31 | def __init__(self): 32 | super().__init__() 33 | 34 | def name(self): 35 | return "ep" 36 | 37 | def calculate(self, stock_codes, start_date, end_date, df_daily=None): 38 | df_basic = self.datasource.daily_basic(stock_codes, start_date, end_date) 39 | df_basic = datasource_utils.reset_index(df_basic) 40 | return 1 / df_basic['pe'] # pe是市赢率,盈利收益率 EP比是1/pe -------------------------------------------------------------------------------- /mlstock/factors/old/market_value.py: -------------------------------------------------------------------------------- 1 | """ 2 | 参考: 3 | https://zhuanlan.zhihu.com/p/161706770 4 | """ 5 | import logging 6 | 7 | import numpy as np 8 | 9 | from mfm_learner.datasource import datasource_utils 10 | from mfm_learner.example.factors.factor import Factor 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | """ 15 | # 规模因子 - 市值因子Market Value 16 | """ 17 | 18 | 19 | class MarketValueFactor(Factor): 20 | """ 21 | 市值因子LNAP,是公司股票市值的自然对数, 22 | """ 23 | 24 | def __init__(self): 25 | super().__init__() 26 | 27 | def name(self): 28 | return "mv" 29 | 30 | def cname(self): 31 | return "市值" 32 | 33 | def calculate(self, stock_codes, start_date, end_date): 34 | df_basic = self.datasource.daily_basic(stock_codes, start_date, end_date) 35 | df_basic = datasource_utils.reset_index(df_basic) 36 | df_basic['LNAP'] = np.log(df_basic['total_mv']) 37 | logger.debug("计算完%s~%s市值因子(LNAP),股票[%r], %d 条因子值", start_date, end_date, stock_codes, len(df_basic)) 38 | assert len(df_basic) > 0, df_basic 39 | return df_basic['LNAP'] 40 | 41 | 42 | class CirculationMarketValueFactor(Factor): 43 | """ 44 | 流动市值因子LNCAP,是公司股票流通市值的自然对数 45 | """ 46 | 47 | def __init__(self): 48 | super().__init__() 49 | 50 | def name(self): 51 | return "cmv" 52 | 53 | def cname(self): 54 | return "流通市值" 55 | 56 | def calculate(self, stock_codes, start_date, end_date): 57 | df_basic = self.datasource.daily_basic(stock_codes, start_date, end_date) 58 | df_basic = datasource_utils.reset_index(df_basic) 59 | df_basic['circ_mv'] = np.log(df_basic['circ_mv']) # 做个log处理,否则相差太大 60 | logger.debug("计算完%s~%s流通市值因子,股票[%r], %d 条因子值", start_date, end_date, stock_codes, len(df_basic)) 61 | assert len(df_basic) > 0, df_basic 62 | return df_basic['circ_mv'] 63 | -------------------------------------------------------------------------------- /mlstock/factors/old/momentum.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import numpy as np 4 | 5 | from mfm_learner.datasource import datasource_utils 6 | from mfm_learner.example.factors.factor import Factor 7 | from mfm_learner.utils import utils 8 | 9 | logger = logging.getLogger(__name__) 10 | period_window = 10 11 | 12 | """ 13 | 动量因子: 14 | 动量因子是指与股票的价格和交易量变化相关的因子,常见的动量因子:一个月动量、3个月动量等。 15 | 计算个股(或其他投资标的)过去N个时间窗口的收益回报: 16 | adj_close = (high + low + close)/3 17 | adj_return = adj_close_t - adj_close_{t-n} 18 | 来计算受盘中最高价和最低价的调整的调整收盘价动量,逻辑是,在日线的层面上收盘价表示市场主力资本对标的物的价值判断, 19 | 而最高价和最低价往往反应了市场投机者的情绪,同时合理考虑这样的多方情绪可以更好的衡量市场的动量变化。 20 | 21 | 参考: 22 | https://zhuanlan.zhihu.com/p/96888358 23 | https://zhuanlan.zhihu.com/p/379269953 24 | 25 | 说白了: 26 | 就是算当前,和,N天前的,收盘价close的差,也就是收益(**间隔Nt天的收益**) 27 | 但是,收盘价close,再优化一下,用(最高+最低+收盘)/3, 28 | 但是,波动值还是嫌太大,所以要取一下log。 29 | """ 30 | 31 | mapping = [ 32 | {'name': 'momentum_10d', 'cname':'10日动量', 'days': 10}, 33 | {'name': 'momentum_1m', 'cname':'1月动量','days': 20}, 34 | {'name': 'momentum_3m', 'cname':'3月动量','days': 60}, 35 | {'name': 'momentum_6m', 'cname':'6月动量','days': 120}, 36 | {'name': 'momentum_12m', 'cname':'12月动量','days': 252}, 37 | {'name': 'momentum_24m', 'cname':'24日动量','days': 480} 38 | ] 39 | 40 | 41 | class MomentumFactor(Factor): 42 | """ 43 | data['turnover_1m'] = data['turnover_rate'].rolling(window=20, min_periods=1).apply(func=np.nanmean) 44 | data['turnover_3m'] = data['turnover_rate'].rolling(window=60, min_periods=1).apply(func=np.nanmean) 45 | data['turnover_6m'] = data['turnover_rate'].rolling(window=120, min_periods=1).apply(func=np.nanmean) 46 | data['turnover_2y'] = data['turnover_rate'].rolling(window=480, min_periods=1).apply(func=np.nanmean) 47 | 48 | """ 49 | 50 | def __init__(self): 51 | super().__init__() 52 | 53 | def name(self): 54 | return [m['name'] for m in mapping] 55 | 56 | def cname(self): 57 | return [m['cname'] for m in mapping] 58 | 59 | def calculate(self, stock_codes, start_date, end_date, df_daily=None): 60 | """ 61 | 计算动量,动量,就是往前回溯period个周期,然后算收益, 62 | 由于股票的价格是一个随经济或标的本身经营情况有变化的变量。那么如果变量有指数增长趋势(exponential growth), 63 | 比如 GDP,股票价格,期货价格,则一般取对数,使得 lnGDP 变为线性增长趋势(linear growth), 64 | 为了防止有的价格高低,所以用log方法,更接近,参考:https://zhuanlan.zhihu.com/p/96888358 65 | 66 | 本来的动量用的是减法,这里,换成了除法,就是用来解决文中提到的规模不同的问题 67 | :param period_window: 68 | :param df: 69 | :return: 70 | """ 71 | results = [] 72 | for m in mapping: 73 | start_date_2years_ago = utils.last_year(start_date, num=2) 74 | df_daily = datasource_utils.load_daily_data(self.datasource, stock_codes, start_date_2years_ago, end_date) 75 | adj_close = (df_daily['close'] + df_daily['high'] + df_daily['low']) / 3 76 | df_daily[m['name']] = np.log(adj_close / adj_close.shift(m['days'])) # shift(1) 往后移,就变成上个月的了 77 | df_daily = datasource_utils.reset_index(df_daily) # 设置日期+code为索引 78 | df_factor = df_daily[m['name']] 79 | df_factor.dropna(inplace=True) 80 | results.append(df_factor) 81 | return results 82 | -------------------------------------------------------------------------------- /mlstock/factors/old/peg.py: -------------------------------------------------------------------------------- 1 | """ 2 | # 估值因子 - PEG 3 | 4 | 参考: 5 | - https://zhuanlan.zhihu.com/p/29144485 6 | - https://www.joinquant.com/help/api/help#factor_values:%E6%88%90%E9%95%BF%E5%9B%A0%E5%AD%90 7 | - https://www.joinquant.com/view/community/detail/087af3a4e27c600ed855cb0c1d0fdfed 8 | 在时间序列上,PEG因子的暴露度相对其他因子较为稳定,在近一年表现出较强的趋势性 9 | 市盈率相对盈利增长比率PEG = PE / (归母公司净利润(TTM)增长率 * 100) # 如果 PE 或 增长率为负,则为 nan 10 | 对应tushare:netprofit_yoy float Y 归属母公司股东的净利润同比增长率(%) 11 | 12 | 财务数据"归属母公司股东的净利润同比增长率(%)"的取得:https://tushare.pro/document/2?doc_id=79 13 | 输入参数 14 | ts_code str Y TS股票代码,e.g. 600001.SH/000001.SZ 15 | ann_date str N 公告日期 16 | start_date str N 报告期开始日期 17 | end_date str N 报告期结束日期 18 | period str N 报告期(每个季度最后一天的日期,比如20171231表示年报) 19 | 输出参数 20 | ts_code str Y TS代码 21 | ann_date str Y 公告日期 22 | end_date str Y 报告期 23 | 这里的逻辑是,某一天,只能按照公告日期来当做"真正"的日期,毕竟,比如8.30号公告6.30号之前的信息, 24 | 加入今天是7.1日,所以只能用最近的一次,也就是最后一次8.30号的, 25 | 再往前倒,比如6.15号还发布过一次,那6.15号之后到6.30号,之间的,就用6.15好的数据, 26 | 所以,报告期end_date其实没啥用,因为他是滞后的,外界是无法提前知道的。 27 | 这样处理也简单粗暴,不知道业界是怎么处理的?我感觉应该很普遍的一个问题。 28 | 29 | shift(-1) 30 | current next 31 | 2021.1.1 32 | --------------------- 33 | 2021.1.1 2021.3.1 34 | 2021.3.1 2021.6.30 35 | 2021.6.30 2021.9.30 36 | 2021.9.30 37 | --------------------- 38 | 2021.1.1之前的,不应该用2021.1.1去填充,但是,没办法,无法获得再之前的数据,只好用他了 39 | 2021.9.30之后的,都用2021.9.30来填充 40 | 41 | 季报应该是累计数,为了可比性,所以应该做一个处理 42 | 研报上的指标用的都是TTM 43 | PE-TTM 也称之为 滚动市盈率 44 | TTM英文本意是Trailing Twelve Months,也就是过去12个月,非常好理解 45 | 比如当前2017年半年报刚发布完,那么过去12个月的净利润就是: 46 | 2017Q2 (2017年2季报累计值) + 2016Q4 (2016年4季度累计值) - 2016Q2 (2016年2季度累计值) 47 | """ 48 | import logging 49 | import math 50 | 51 | import numpy as np 52 | 53 | from mfm_learner.datasource import datasource_utils 54 | from mfm_learner.example.factors.factor import Factor 55 | 56 | logger = logging.getLogger(__name__) 57 | 58 | 59 | class PEGFactor(Factor): 60 | 61 | def __init__(self): 62 | super().__init__() 63 | 64 | def name(self): 65 | return "peg" 66 | 67 | def calculate(self, stock_codes, start_date, end_date, df_daily=None): 68 | """ 69 | # 计算股票的PEG值 70 | # 输入:context(见API);stock_list为list类型,表示股票池 71 | # 输出:df_PEG为dataframe: index为股票代码,data为相应的PEG值 72 | """ 73 | df = self.load_stock_data(stock_codes, start_date, end_date) 74 | df['PEG'] = df['pe'] / df['netprofit_yoy'] 75 | df = datasource_utils.reset_index(df) 76 | return df['PEG'] 77 | 78 | def load_stock_data(self, stock_codes, start_date, end_date): 79 | df_merge = None 80 | for stock_code in stock_codes: 81 | # 基本数据,包含:PE 82 | df_basic = self.datasource.daily_basic(stock_code=stock_code, start_date=start_date, end_date=end_date) 83 | 84 | # 财务数据,包含:归母公司净利润(TTM)增长率 85 | df_finance = self.datasource.fina_indicator(stock_code=stock_code, start_date=start_date, end_date=end_date) 86 | 87 | df_finance = df_finance.sort_index(level='datetime', ascending=True) # 从早到晚排序 88 | 89 | df_finance['datetime_next'] = df_finance['datetime'].shift(-1) 90 | df_basic['netprofit_yoy'] = np.NaN 91 | logger.debug("股票[%s] %s~%s 有%d条财务数据,但有%d条基础数据", 92 | stock_code, start_date, end_date, len(df_finance), len(df_basic)) 93 | 94 | for index, finance in df_finance.iterrows(): 95 | 96 | next_date = finance['datetime_next'] 97 | current_date = finance['datetime'] 98 | netprofit_yoy = finance['netprofit_yoy'] 99 | 100 | # 第一个区间,只能"2021.1.1之前的,不应该用2021.1.1去填充,但是,没办法,无法获得再之前的数据,只好用他了" 101 | if index == 0: 102 | # logger.debug("开始 -> %s , 过滤条数 %d", current_date, 103 | # len(df_basic.loc[(df_basic.datetime <= current_date)])) 104 | df_basic.loc[df_basic.datetime <= current_date, 'netprofit_yoy'] = netprofit_yoy 105 | 106 | # bugfix,太诡异了,如果是nan,其实nan是一个float类型的,type(nan)== 107 | if next_date is None or (type(next_date) == float and math.isnan(next_date)): 108 | df_basic.loc[df_basic.datetime > current_date, 'netprofit_yoy'] = netprofit_yoy 109 | # logger.debug("%s -> 结束 , 过滤条数 %d", current_date, 110 | # len(df_basic.loc[(df_basic.datetime > current_date)])) 111 | else: 112 | df_basic.loc[(df_basic.datetime > current_date) & 113 | (df_basic.datetime <= next_date), 'netprofit_yoy'] = netprofit_yoy 114 | # logger.debug("%s -> %s , 过滤条数 %d", current_date, next_date, len( 115 | # df_basic.loc[(df_basic.datetime > current_date) & (df_basic.datetime <= next_date)])) 116 | 117 | if df_merge is None: 118 | df_merge = df_basic 119 | else: 120 | df_merge = df_merge.append(df_basic) 121 | # logger.debug("加载%s~%s的股票[%s]的%d条PE和归母公司净利润(TTM)增长率的合并数据", start_date, end_date, stock_code, len(df_merge)) 122 | logger.debug("一共加载%s~%s %d条 PEG 数据", start_date, end_date, len(df_merge)) 123 | 124 | return df_merge 125 | -------------------------------------------------------------------------------- /mlstock/factors/psy.py: -------------------------------------------------------------------------------- 1 | from mlstock.factors.factor import SimpleFactor 2 | 3 | N = [1, 3, 6, 12] 4 | 5 | 6 | class PSY(SimpleFactor): 7 | """ 8 | PSY: 心理线指标、大众指标,研究投资者心理波动的情绪指标。 9 | PSY = N天内上涨天数 / N * 100,N一般取12,最大不超高24,周线最长不超过26 10 | PSY大小反映市场是倾向于买方、还是卖方。 11 | """ 12 | 13 | # 英文名 14 | @property 15 | def name(self): 16 | return ['PSY_{}w'.format(i) for i in N] 17 | 18 | # 中文名 19 | @property 20 | def cname(self): 21 | return ['PSY_{}w'.format(i) for i in N] 22 | 23 | def calculate(self, stock_data): 24 | df_daily = stock_data.df_daily 25 | 26 | df_daily = df_daily.sort_values(['ts_code', 'trade_date']) # 默认是升序排列 27 | 28 | for i, n in enumerate(N): 29 | # 先根据收益率,转变成涨1,跌0 30 | df_daily['winloss'] = df_daily.pct_chg.apply(lambda x: x > 0) 31 | 32 | # 按股票来分组,然后每只股票做按照N周滚动计算PSY 33 | # 完事后要做reset_index(level=0,drop=True),原因是groupby+rolling后得到的df的index是(ts_code,原id) 34 | # 所以要drop掉ts_code,即level=0,这样才可以直接赋值回去(按照id直接对回原dataframe) 35 | # 全TMD是细节啊,我真的是服了pandas库了 36 | df = df_daily.groupby('ts_code')['winloss'].rolling(window=5 * n). \ 37 | apply(lambda x: x.sum() / (5 * n)).reset_index(level=0, drop=True) 38 | df_daily[self.name[i]] = df 39 | return df_daily[self.name] 40 | 41 | 42 | # python -m mlstock.factors.psy 43 | if __name__ == '__main__': 44 | from mlstock.data import data_loader 45 | from mlstock.data.datasource import DataSource 46 | from mlstock.data.stock_info import StocksInfo 47 | from mlstock.utils import utils 48 | import pandas 49 | 50 | pandas.set_option('display.max_rows', 1000000) 51 | utils.init_logger(file=False) 52 | 53 | start_date = "20180101" 54 | end_date = "20200101" 55 | stocks = ['000401.SZ', '600000.SH', '002357.SZ'] 56 | datasource = DataSource() 57 | stocks_info = StocksInfo(stocks, start_date, end_date) 58 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 59 | 60 | df = PSY(datasource, stocks_info).calculate(stock_data) 61 | print(df) 62 | print("NA缺失比例", df.isna().sum() / len(df)) 63 | -------------------------------------------------------------------------------- /mlstock/factors/returns.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import numpy as np 4 | import pandas as pd 5 | 6 | from mlstock.factors.factor import SimpleFactor 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | """ 11 | 个股最近N个周收益率 N=1, 3, 6, 12, 12 | 注意,是累计收益 13 | """ 14 | 15 | N = [1, 3, 6, 12] 16 | 17 | 18 | class Return(SimpleFactor): 19 | 20 | @property 21 | def name(self): 22 | return ['return_{}w'.format(i) for i in N] 23 | 24 | @property 25 | def cname(self): 26 | return ['{}周累计收益'.format(i) for i in N] 27 | 28 | def _calculte_return_N(self, x, period): 29 | """计算累计收益,所以用 **prod** 乘法""" 30 | return (1 + x).rolling(window=period).apply(np.prod, raw=True) - 1 31 | 32 | def calculate(self, stock_data): 33 | df_weekly = stock_data.df_weekly 34 | """ 35 | 计算N周累计收益,就是往前回溯period个周期 36 | """ 37 | results = [] 38 | for period in N: 39 | df_return = df_weekly.groupby('ts_code').pct_chg.apply(lambda x: self._calculte_return_N(x, period)) 40 | results.append(df_return) 41 | df = pd.concat(results, axis=1) # 按照列拼接(axis=1) 42 | return df 43 | 44 | # python -m mlstock.factors.returns 45 | if __name__ == '__main__': 46 | from mlstock.data import data_loader 47 | from mlstock.data.datasource import DataSource 48 | from mlstock.data.stock_info import StocksInfo 49 | from mlstock.utils import utils 50 | import pandas 51 | pandas.set_option('display.max_rows', 1000000) 52 | utils.init_logger(file=False) 53 | 54 | start_date = "20180101" 55 | end_date = "20200101" 56 | stocks = ['000401.SZ'] 57 | datasource = DataSource() 58 | stocks_info = StocksInfo(stocks, start_date, end_date) 59 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 60 | df = Return(datasource, stocks_info).calculate(stock_data) 61 | # df = df[df.trade_date>start_date] 62 | print(df) 63 | print("NA缺失比例", df.isna().sum() / len(df)) 64 | -------------------------------------------------------------------------------- /mlstock/factors/rsi.py: -------------------------------------------------------------------------------- 1 | import talib 2 | 3 | from mlstock.factors.factor import SimpleFactor 4 | 5 | PERIOD = 20 6 | 7 | 8 | class RSI(SimpleFactor): 9 | """ 10 | 相对强弱指标RSI是用以计测市场供需关系和买卖力道的方法及指标。 11 | 计算公式: 12 | N日RSI =A/(A+B)×100 13 | A=N日内收盘涨幅之和 14 | B=N日内收盘跌幅之和(取正值) 15 | 由上面算式可知RSI指标的技术含义,即以向上的力量与向下的力量进行比较, 16 | 若向上的力量较大,则计算出来的指标上升;若向下的力量较大,则指标下降,由此测算出市场走势的强弱。 17 | """ 18 | 19 | # 英文名 20 | @property 21 | def name(self): 22 | return "RSI" 23 | 24 | # 中文名 25 | @property 26 | def cname(self): 27 | return "RSI" 28 | 29 | def calculate(self, stock_data): 30 | df_weekly = stock_data.df_weekly 31 | return self.rsi(df_weekly.close, period=PERIOD) 32 | 33 | # psy 20日 34 | def rsi(self, x, period=PERIOD): 35 | return talib.RSI(x, timeperiod=period) 36 | -------------------------------------------------------------------------------- /mlstock/factors/stake_holder.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock import const 4 | from mlstock.factors.factor import ComplexMergeFactor 5 | from mlstock.factors.mixin.fill_mixin import FillMixin 6 | from mlstock.utils import utils 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | 11 | class StakeHolder(ComplexMergeFactor, FillMixin): 12 | """ 13 | 股东变化率 14 | """ 15 | 16 | @property 17 | def name(self): 18 | return 'stake_holder_chg' 19 | 20 | @property 21 | def cname(self): 22 | return '股东人数' 23 | 24 | def calculate(self, stock_data): 25 | df_weekly = stock_data.df_weekly 26 | 27 | # 因为需要回溯,所以用了同一回溯标准(类MACD指标) 28 | start_date = utils.last_week(self.stocks_info.start_date, const.RESERVED_PERIODS) 29 | 30 | df_stake_holder = self.datasource.stock_holder_number(self.stocks_info.stocks, 31 | start_date, 32 | self.stocks_info.end_date) 33 | df_stake_holder = df_stake_holder.sort_values(by='ann_date') 34 | df = self.fill(df_weekly, df_stake_holder, 'holder_num') 35 | df[self.name] = (df.holder_num - df.holder_num.shift(1)) / df.holder_num.shift(1) 36 | return df[['ts_code', 'trade_date', self.name]] 37 | 38 | 39 | # python -m mlstock.factors.stake_holder 40 | if __name__ == '__main__': 41 | from mlstock.data import data_loader 42 | from mlstock.data.datasource import DataSource 43 | from mlstock.data.stock_info import StocksInfo 44 | import pandas 45 | 46 | pandas.set_option('display.max_rows', 1000000) 47 | 48 | utils.init_logger(file=False) 49 | 50 | datasource = DataSource() 51 | start_date = "20180101" 52 | end_date = "20200101" 53 | stocks = ['000010.SZ', '000014.SZ'] 54 | stocks_info = StocksInfo(stocks, start_date, end_date) 55 | 56 | df_stocks = data_loader.load(datasource, stocks, start_date, end_date) 57 | 58 | df = StakeHolder(datasource, stocks_info).calculate(df_stocks) 59 | print(df) 60 | -------------------------------------------------------------------------------- /mlstock/factors/std.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock.factors.factor import SimpleFactor 4 | import pandas as pd 5 | 6 | logger = logging.getLogger(__name__) 7 | 8 | """ 9 | 波动率因子: 10 | https://zhuanlan.zhihu.com/p/30158144 11 | 波动率因子有很多,我这里的是std,标准差, 12 | 而算标准差,又要设置时间窗口 13 | """ 14 | 15 | mapping = [ 16 | {'name': 'std_1w', 'cname': '1周波动率', 'period': 1}, 17 | {'name': 'std_3w', 'cname': '3周波动率', 'period': 3}, 18 | {'name': 'std_6w', 'cname': '6周波动率', 'period': 6}, 19 | {'name': 'std_12w', 'cname': '12周波动率', 'period': 12} 20 | ] 21 | 22 | 23 | class Std(SimpleFactor): 24 | 25 | @property 26 | def name(self): 27 | return [m['name'] for m in mapping] 28 | 29 | @property 30 | def cname(self): 31 | return [m['cname'] for m in mapping] 32 | 33 | def calculate(self, stock_data): 34 | df_weekly = stock_data.df_weekly 35 | """ 36 | 计算波动率,波动率,就是往前回溯period个周期 37 | """ 38 | results = [] 39 | for m in mapping: 40 | # x 5,是近似于周的频度(5个交易日是一个周), 41 | # rolling是截止到今天为止之前的做rolling,不会有未来函数的问题 42 | df_std = df_weekly.pct_chg.rolling(window=m['period']*5).std() 43 | results.append(df_std) 44 | df = pd.concat(results, axis=1) # 按照列拼接(axis=1) 45 | return df 46 | -------------------------------------------------------------------------------- /mlstock/factors/turnover.py: -------------------------------------------------------------------------------- 1 | """ 2 | 换手率因子: 3 | 4 | - https://zhuanlan.zhihu.com/p/37232850 5 | - https://crm.htsc.com.cn/doc/2017/10750101/6678c51c-a298-41ba-beb9-610ab793cf05.pdf 华泰~换手率类因子 6 | - https://uqer.datayes.com/v3/community/share/5afd527db3a1a1012acad84c 7 | 8 | 换手率因子是一类很重要的情绪类因子,反映了一支股票在一段时间内的流动性强弱,和持有者平均持有时间的长短。 9 | 一般来说换手率因子的大小和股票的收益为负向关系,即换手率越高的股票预期收益越低,换手率越低的股票预期收益越高。 10 | 11 | 四个构造出来的换手率类的因子(都是与股票的日均换手率相关): 12 | - turnover_Nm:个股最近N周的日均换手率,表现了个股N周内的流动性水平。N=1,3,6 13 | - turnover_bias_Nm:个股最近N周的日均换手率除以个股两年内日均换手率再减去1,代表了个股N周内流动性的乖离率。N=1,3,6 14 | - turnover_std_Nm:个股最近N周的日换手率的标准差,表现了个股N周内流动性水平的波动幅度。N=1,3,6 15 | - turnover_bias_std_Nm:个股最近N周的日换手率的标准差除以个股两年内日换手率的标准差再减去1,代表了个股N周内流动性的波动幅度的乖离率。N=1,3,6 16 | 17 | 这是4个因子哈,都是跟换手率相关的,他们之间具备共线性,是相关的,要用的时候,挑一个好的,或者,做因子正交化后再用。 18 | 19 | 市值中性化:换手率类因子与市值类因子存在一定程度的负相关性,我们对换手率因子首先进行市值中性化处理,从而消除了大市值对于换手率因子表现的影响。 20 | 21 | 知乎文章的结论:进行市值中性化处理之后,因子表现有明显提高。在本文的回测方法下,turnover_1m和turnover_std_1m因子表现较好。 22 | """ 23 | 24 | import logging 25 | 26 | from mlstock.factors.factor import SimpleFactor 27 | 28 | logger = logging.getLogger(__name__) 29 | 30 | N = [1, 3, 6, 12] 31 | 32 | 33 | class Turnover(SimpleFactor): 34 | """ 35 | 个股最近N个月内 日均换手率 剔除停牌 涨跌停的交易 36 | """ 37 | 38 | @property 39 | def name(self): 40 | return [f'turnover_{i}w' for i in N] + \ 41 | [f'turnover_bias_{i}w' for i in N] + \ 42 | [f'turnover_std_{i}w' for i in N] + \ 43 | [f'turnover_bias_std_{i}w' for i in N] 44 | 45 | @property 46 | def cname(self): 47 | return [f'{i}周换手率' for i in N] + \ 48 | [f'{i}周换手率偏差' for i in N] + \ 49 | [f'{i}周换手率标准差' for i in N] + \ 50 | [f'{i}周换手率标准差偏差' for i in N] 51 | 52 | def calculate(self, stock_data): 53 | df_daily_basic = stock_data.df_daily_basic 54 | 55 | """ 56 | # https://tushare.pro/document/2?doc_id=32 57 | code datetime turnover_rate_f circ_mv 58 | 0 600230.SH 20180726 4.5734 1.115326e+06 59 | 1 600237.SH 20180726 1.7703 2.336490e+05 60 | """ 61 | df_daily_basic = df_daily_basic[['ts_code', 'trade_date', 'turnover_rate_f', 'circ_mv']] 62 | df_daily_basic.columns = ['ts_code', 'trade_date', 'turnover_rate', 'circ_mv'] 63 | df_daily_basic = df_daily_basic.sort_values(['ts_code', 'trade_date']) 64 | 65 | datas = self.calculate_turnover_rate(df_daily_basic) 66 | 67 | return datas 68 | 69 | """ 70 | nanmean:忽略nan,不参与mean,例: 71 | >>> a = np.array([[1, np.nan], [3, 4]]) 72 | >>> np.nanmean(a) 73 | 2.6666666666666665 74 | >>> np.nanmean(a, axis=0) 75 | array([2., 4.]) 76 | >>> np.nanmean(a, axis=1) 77 | array([1., 3.5]) # may vary 78 | rolling: 79 | https://blog.csdn.net/maymay_/article/details/80241627 80 | # 定义因子计算逻辑 81 | - turnover_Nm:个股最近N周的日均换手率,表现了个股N周内的流动性水平。N=1,3,6 82 | - turnover_bias_Nm:个股最近N周的日均换手率除以个股两年内日均换手率再减去1,代表了个股N周内流动性的乖离率。N=1,3,6 83 | - turnover_std_Nm:个股最近N周的日换手率的标准差,表现了个股N周内流动性水平的波动幅度。N=1,3,6 84 | - turnover_bias_std_Nm:个股最近N周的日换手率的标准差除以个股两年内日换手率的标准差再减去1,代表了个股N周内流动性的波动幅度的乖离率。N=1,3,6 85 | """ 86 | 87 | def calculate_turnover_rate(self, df): 88 | """ 89 | 注意,df是日频数据,不是周频的,特别指明一下,所以要windows=5*i(N周) 90 | 英文命名上有歧义,还是按照中文解释为主 91 | :param df: 92 | :return: 93 | """ 94 | 95 | # 24周内的均值和标准差值,这个不是指标,是用于计算指标用的中间值 96 | df[f'turnover_24w'] = df.groupby('ts_code')['turnover_rate'].rolling(window=5 * 24, 97 | min_periods=1).mean().values 98 | df[f'turnover_std_24w'] = df.groupby('ts_code')['turnover_rate'].rolling(window=5 * 24, 99 | min_periods=1).std().values 100 | 101 | # 1.N周的日换手率均值 102 | for i in N: 103 | df[f'turnover_{i}w'] = df.groupby('ts_code')['turnover_rate'].rolling(window=5 * i, 104 | min_periods=1).mean().values 105 | # 2.N周的日换手率 / 两年内日换手率 - 1,表示N周流动性的乖离率 106 | # import pdb; pdb.set_trace() 107 | for i in N: 108 | df[f'turnover_bias_{i}w'] = df.groupby('ts_code').apply( 109 | lambda df_stock: df_stock[f'turnover_{i}w'] / df_stock.turnover_24w - 1).values 110 | 111 | # 3.N周的日均换手率的标准差 112 | for i in N: 113 | df[f'turnover_std_{i}w'] = df.groupby('ts_code')['turnover_rate'].rolling(window=5, 114 | min_periods=2).std().values 115 | 116 | # N周的日换手率的标准差 / 两年内日换手率的标准差 - 1,表示N周波动幅度的乖离率 117 | for i in N: 118 | df[f'turnover_bias_std_{i}w'] = df.groupby('ts_code').apply( 119 | lambda df_stock: df_stock[f'turnover_std_{i}w'] / df_stock.turnover_std_24w - 1).values 120 | 121 | return df[self.name] 122 | 123 | 124 | # python -m mlstock.factors.turnover 125 | if __name__ == '__main__': 126 | from mlstock.data import data_loader 127 | from mlstock.data.datasource import DataSource 128 | from mlstock.data.stock_info import StocksInfo 129 | from mlstock.utils import utils 130 | import pandas 131 | 132 | pandas.set_option('display.max_rows', 1000000) 133 | utils.init_logger(file=False) 134 | 135 | start_date = "20180101" 136 | end_date = "20200101" 137 | stocks = ['000401.SZ','600000.SH', '002357.SZ', '000404.SZ', '600230.SH'] 138 | datasource = DataSource() 139 | stocks_info = StocksInfo(stocks, start_date, end_date) 140 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 141 | # 把基础信息merge到周频数据中 142 | df_stock_basic = datasource.stock_basic(stocks) 143 | df_weekly = stock_data.df_weekly.merge(df_stock_basic, on='ts_code', how='left') 144 | 145 | tr = Turnover(datasource, stocks_info) 146 | df_factors = tr.calculate(stock_data) 147 | df = tr.merge(df_weekly, df_factors) 148 | df = df[df.trade_date > start_date] 149 | print(df) 150 | print("NA缺失比例", df.isna().sum() / len(df)) -------------------------------------------------------------------------------- /mlstock/factors/turnover_return.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock.factors.factor import ComplexMergeFactor 4 | 5 | 6 | logger = logging.getLogger(__name__) 7 | 8 | N = [1, 3, 6, 12] 9 | 10 | 11 | class TurnoverReturn(ComplexMergeFactor): 12 | """ 13 | 14 | 华泰金工原始:`动量反转 wgt_return_Nm:个股最近N个月内用每日换手率乘以每日收益率求算术平均值,N=1,3,6,12` 15 | 我们这里就是N周, 16 | 注意,这里是"每日",所以我们需要加载每日,<--------- **每日** 17 | 18 | daily_basic是每日的数据,我们目前是每周的数据, 19 | 神仔的做法是: 20 | df['wgt_return_1m'] = df['close'] * df['pct_chg'] 21 | df = df.groupby(idx_f).wgt_return_1m.agg('mean') 22 | 23 | 我觉得没必要乘以close收盘价啊。 24 | """ 25 | 26 | @property 27 | def name(self): 28 | return [f'turnover_return_{i}w' for i in N] 29 | 30 | @property 31 | def cname(self): 32 | return [f'{i}周内每日换手率乘以每日收益率的算术平均值' for i in N] 33 | 34 | def calculate(self, stock_data): 35 | df_daily = stock_data.df_daily 36 | df_daily_basic = stock_data.df_daily_basic 37 | 38 | """ https://tushare.pro/document/2?doc_id=32 39 | code datetime turnover_rate_f 40 | 600230.SH 20180726 4.5734 41 | """ 42 | df_daily_basic = df_daily_basic[['trade_date', 'ts_code', 'turnover_rate_f']] 43 | df_daily = df_daily[['trade_date', 'ts_code', 'pct_chg']] 44 | df = df_daily.merge(df_daily_basic, on=['ts_code', 'trade_date']) 45 | 46 | # `动量反转 wgt_return_Nm:个股最近N个月内用每日换手率乘以每日收益率求算术平均值,N=1,3,6,12` 47 | df['turnover_return'] = df['turnover_rate_f'] * df['pct_chg'] 48 | 49 | """ 50 | 2022.8.16 一个低级但是查了半天才找到的bug, 51 | 数据不做下面的排序,默认是按照trade_date+ts_code的顺序排序的, 52 | 会导致下面的赋值出现问题: 53 | df[f'turnover_return_{i}w'] = df.groupby('ts_code').turnover_return.rolling(i * 5).mean().values 54 | 改为 55 | df[f'turnover_return_{i}w'] = df.groupby('ts_code').turnover_return.rolling(i * 5).mean().reset_index(level=0, drop=True) 56 | 就没有问题,原因是后者是一个Series,有索引,可以和原有的序列对上, 57 | 而前者只是一个numpy array,与之前的 trade_date+ts_code 的顺序对齐的话,当然乱掉了 58 | 所以,只要这里做了排序,用numpy array还是Series,都无所谓嘞 59 | 这个bug查了我1天,唉,低级错误害死人 60 | """ 61 | df = df.sort_values(['ts_code', 'trade_date']) 62 | 63 | 64 | for i in N: 65 | # x5,按照每周交易日5天计算的 66 | # 我靠!隐藏的很深的一个bug,找各种写法会导致中间莫名其妙的出现nan,而且计算的也不对,改为后者就ok了 67 | # df[f'turnover_return_{i}w'] = df.groupby('ts_code').turnover_return.rolling(i * 5).mean().values 68 | df[f'turnover_return_{i}w'] = df.groupby('ts_code').turnover_return.rolling(i * 5).mean().reset_index(level=0, drop=True) 69 | 70 | # 返回ts_code和trade_date是为了和周频数据做join 71 | return df[['trade_date', 'ts_code'] + self.name] 72 | 73 | """ 74 | nanmean:忽略nan,不参与mean,例: 75 | >>> a = np.array([[1, np.nan], [3, 4]]) 76 | >>> np.nanmean(a) 77 | 2.6666666666666665 78 | >>> np.nanmean(a, axis=0) 79 | array([2., 4.]) 80 | >>> np.nanmean(a, axis=1) 81 | array([1., 3.5]) # may vary 82 | rolling: 83 | https://blog.csdn.net/maymay_/article/details/80241627 84 | # 定义因子计算逻辑 85 | - turnover_Nm:个股最近N个月的日均换手率,表现了个股N个月内的流动性水平。N=1,3,6 86 | - turnover_bias_Nm:个股最近N个月的日均换手率除以个股两年内日均换手率再减去1,代表了个股N个月内流动性的乖离率。N=1,3,6 87 | - turnover_std_Nm:个股最近N个月的日换手率的标准差,表现了个股N个月内流动性水平的波动幅度。N=1,3,6 88 | - turnover_bias_std_Nm:个股最近N个月的日换手率的标准差除以个股两年内日换手率的标准差再减去1,代表了个股N个月内流动性的波动幅度的乖离率。N=1,3,6 89 | """ 90 | # python -m mlstock.factors.turnover_return 91 | if __name__ == '__main__': 92 | from mlstock.data import data_loader 93 | from mlstock.data.datasource import DataSource 94 | from mlstock.data.stock_info import StocksInfo 95 | from mlstock.utils import utils 96 | import pandas 97 | pandas.set_option('display.max_rows', 1000000) 98 | utils.init_logger(file=False) 99 | 100 | start_date = "20080101" 101 | end_date = "20200101" 102 | stocks = ['000001.SZ'] 103 | datasource = DataSource() 104 | stocks_info = StocksInfo(stocks, start_date, end_date) 105 | stock_data = data_loader.load(datasource, stocks, start_date, end_date) 106 | # 把基础信息merge到周频数据中 107 | df_stock_basic = datasource.stock_basic(stocks) 108 | df_weekly = stock_data.df_weekly.merge(df_stock_basic, on='ts_code', how='left') 109 | 110 | tr = TurnoverReturn(datasource, stocks_info) 111 | df_factors = tr.calculate(stock_data) 112 | df = tr.merge(df_weekly,df_factors) 113 | df = df[df.trade_date>start_date] 114 | print(df) 115 | print("NA缺失比例", df.isna().sum() / len(df)) 116 | # print(df[(df.ts_code=='000401.SZ')&(df.trade_date=='20180629')]) 117 | -------------------------------------------------------------------------------- /mlstock/ml/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/ml/README.md -------------------------------------------------------------------------------- /mlstock/ml/__init__.py: -------------------------------------------------------------------------------- 1 | from mlstock.ml.data import factor_service 2 | 3 | import logging 4 | 5 | from mlstock.utils import utils 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | 10 | def load_and_filter_data(data_path, start_date, end_date): 11 | # 加载数据 12 | utils.check_file_path(data_path) 13 | 14 | df_data = factor_service.load_from_file(data_path) 15 | original_size = len(df_data) 16 | original_start_date = df_data.trade_date.min() 17 | original_end_date = df_data.trade_date.max() 18 | df_data = df_data[df_data.trade_date >= start_date] 19 | df_data = df_data[df_data.trade_date <= end_date] 20 | logger.debug("数据%s~%s %d行,过滤后=> %s~%s %d行", 21 | original_start_date, original_end_date, original_size, 22 | start_date, end_date, len(df_data)) 23 | return df_data 24 | -------------------------------------------------------------------------------- /mlstock/ml/backtest.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import time 4 | 5 | import joblib 6 | import matplotlib.pyplot as plt 7 | import matplotlib.ticker as ticker 8 | import numpy as np 9 | from pandas import DataFrame 10 | 11 | from mlstock.const import TOP_30 12 | from mlstock.data.datasource import DataSource 13 | from mlstock.ml import load_and_filter_data 14 | from mlstock.ml.backtests import backtest_simple, backtest_backtrader, backtest_deliberate 15 | from mlstock.ml.data import factor_conf 16 | from mlstock.ml.backtests.metrics import metrics 17 | from mlstock.utils import utils 18 | 19 | logger = logging.getLogger(__name__) 20 | 21 | 22 | def main(type, data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 23 | """ 24 | 回测 25 | :param data_path: 因子数据文件的路径 26 | :param start_date: 回测的开始日期 27 | :param end_date: 回测结束日期 28 | :param model_pct_path: 回测用的预测收益率的模型路径 29 | :param model_winloss_path: 回测用的,预测收益率的模型路径 30 | :param factor_names: 因子们的名称,用于过滤预测的X 31 | :return: 32 | """ 33 | if type == 'simple': 34 | return backtest_simple.main(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 35 | 36 | if type == 'deliberate': 37 | return backtest_deliberate.main(data_path, start_date, end_date, model_pct_path, model_winloss_path, 38 | factor_names) 39 | 40 | if type == 'deliberate': 41 | return backtest_backtrader.main(data_path, start_date, end_date, model_pct_path, model_winloss_path, 42 | factor_names) 43 | 44 | raise ValueError(f"无效的backtest类型:{type}") 45 | 46 | 47 | """ 48 | python -m mlstock.ml.backtest \ 49 | -t deliberate \ 50 | -s 20190101 -e 20220901 \ 51 | -mp model/pct_ridge_20220902112320.model \ 52 | -mw model/winloss_xgboost_20220902112813.model \ 53 | -d data/factor_20080101_20220901_2954_1299032__industry_neutral_20220902112049.csv 54 | 55 | python -m mlstock.ml.backtest \ 56 | -t simple \ 57 | -s 20080101 -e 20190101 \ 58 | -mp model/pct_ridge_20220902112320.model \ 59 | -mw model/winloss_xgboost_20220902112813.model \ 60 | -d data/factor_20080101_20220901_2954_1299032__industry_neutral_20220902112049.csv 61 | 62 | """ 63 | if __name__ == '__main__': 64 | utils.init_logger(file=True) 65 | parser = argparse.ArgumentParser() 66 | 67 | # 数据相关的 68 | parser.add_argument('-t', '--type', type=str, default="simple|backtrader|deliberate") 69 | parser.add_argument('-s', '--start_date', type=str, default="20190101", help="开始日期") 70 | parser.add_argument('-e', '--end_date', type=str, default="20220901", help="结束日期") 71 | parser.add_argument('-d', '--data', type=str, default=None, help="数据文件") 72 | parser.add_argument('-mp', '--model_pct', type=str, default=None, help="收益率模型") 73 | parser.add_argument('-mw', '--model_winloss', type=str, default=None, help="涨跌模型") 74 | 75 | args = parser.parse_args() 76 | 77 | factor_names = factor_conf.get_factor_names() 78 | 79 | start_time = time.time() 80 | main( 81 | args.type, 82 | args.data, 83 | args.start_date, 84 | args.end_date, 85 | args.model_pct, 86 | args.model_winloss, 87 | factor_names) 88 | utils.time_elapse(start_time,"整个回测过程") 89 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/__init__.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | 4 | import joblib 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | from matplotlib import ticker 8 | from pandas import DataFrame 9 | 10 | from mlstock.const import TOP_30 11 | from mlstock.ml import load_and_filter_data 12 | from mlstock.utils import utils 13 | 14 | logger = logging.getLogger(__name__) 15 | 16 | 17 | def select_top_n(df, df_limit, top_n): 18 | """ 19 | :param df: 20 | :param df_limit: 21 | :return: 22 | """ 23 | 24 | # 先把所有预测为跌的全部过滤掉 25 | original_size = len(df) 26 | df = df[df.winloss_pred == 1] 27 | logger.debug("根据涨跌模型结果,过滤数据 %d=>%d", original_size, len(df)) 28 | 29 | # 剔除那些涨停的股票,参考:https://stackoverflow.com/questions/32652718/pandas-find-rows-which-dont-exist-in-another-dataframe-by-multiple-columns 30 | df_limit = df_limit[['trade_date', 'ts_code', 'limit']] 31 | original_size = len(df) 32 | df = df.merge(df_limit, on=['ts_code', 'trade_date'], how='left', indicator=True) 33 | df = df[df._merge == 'left_only'] 34 | logger.debug("根据涨跌停信息,过滤数据 %d=>%d", original_size, len(df)) 35 | 36 | # 先按照日期 + 下周预测收益,按照降序排 37 | df = df.sort_values(['trade_date', 'pct_pred'], ascending=False) 38 | 39 | # 按照日期分组,每组里面取前30,然后算收益率,作为组合资产的收益率 40 | # 注意!这里是下期收益"next_pct_chg"的均值,实际上是提前了一期(这个细节可以留意一下) 41 | # .reset_index(level=0, drop=True) 42 | df_selected_stocks = df.groupby('trade_date').apply(lambda grp: grp.nlargest(top_n, 'pct_pred')) 43 | logger.debug("按照预测收益率挑选出%d条股票信息", len(df_selected_stocks)) 44 | 45 | return df_selected_stocks.reset_index(drop=True) 46 | 47 | 48 | def predict(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 49 | """ 50 | 回测 51 | :param data_path: 因子数据文件的路径 52 | :param start_date: 回测的开始日期 53 | :param end_date: 回测结束日期 54 | :param model_pct_path: 回测用的预测收益率的模型路径 55 | :param model_winloss_path: 回测用的,预测收益率的模型路径 56 | :param factor_names: 因子们的名称,用于过滤预测的X 57 | :return: 58 | """ 59 | # 从csv因子数据文件中加载数据 60 | df_data = load_and_filter_data(data_path, start_date, end_date) 61 | 62 | # 加载模型;如果参数未提供,为None 63 | # 查看数据文件和模型文件路径是否正确 64 | if model_pct_path: utils.check_file_path(model_pct_path) 65 | if model_winloss_path: utils.check_file_path(model_winloss_path) 66 | model_pct = joblib.load(model_pct_path) if model_pct_path else None 67 | model_winloss = joblib.load(model_winloss_path) if model_winloss_path else None 68 | 69 | if model_pct: 70 | start_time = time.time() 71 | X = df_data[factor_names] 72 | df_data['pct_pred'] = model_pct.predict(X) 73 | utils.time_elapse(start_time, f"预测下期收益: {len(df_data)}行 ") 74 | 75 | if model_winloss: 76 | start_time = time.time() 77 | X = df_data[factor_names] 78 | df_data['winloss_pred'] = model_winloss.predict(X) 79 | utils.time_elapse(start_time, f"预测下期涨跌: {len(df_data)}行 ") 80 | return df_data 81 | 82 | 83 | # 画A股,参考:https://deepinout.com/matplotlib/matplotlib-axis/matplotlib-the-hidden-scale-mode-shows-the-three-y-axes.html 84 | # 创建第三条Y轴,把第三条Y轴的其它三边隐藏起来,只留下右边显示 85 | def make_patch_spines_invisible(ax): 86 | ax.set_frame_on(True) 87 | ax.patch.set_visible(False) 88 | for sp in ax.spines.values(): 89 | sp.set_visible(False) 90 | 91 | 92 | def plot(df,save_path): 93 | """ 94 | 1. 每期实际收益 95 | 2. 每期实际累计收益 96 | 3. 基准累计收益率 97 | 4. 上证指数 98 | :param df: 99 | :return: 100 | """ 101 | plt.rcParams['font.sans-serif'] = ['SimHei'] 102 | plt.rcParams['axes.unicode_minus'] = False 103 | 104 | x = df.trade_date.values 105 | y1 = df.next_pct_chg.values 106 | y2 = df.cumulative_pct_chg.values 107 | y3 = df.cumulative_pct_chg_baseline.values 108 | y4 = df.index_close.values 109 | 110 | color_y1 = '#2A9CAD' 111 | color_y2 = "#FAB03D" 112 | color_y3 = "#D3D3D3" 113 | color_y4 = "#008000" 114 | 115 | title = '资产组合收益率及累积收益率' 116 | 117 | label_x = '周' 118 | label_y1 = '资产组合周收益率' 119 | label_y2 = '资产组合累积收益率' 120 | label_y3 = '基准累积收益率' 121 | label_y4 = '上证指数' 122 | 123 | fig, ax1 = plt.subplots(figsize=(10, 6), dpi=300) 124 | plt.xticks(rotation=60) 125 | ax2 = ax1.twinx() # 做镜像处理 126 | 127 | ax1.bar(x=x, height=y1, label=label_y1, color=color_y1, alpha=0.7) 128 | ax1.set_xlabel(label_x) # 设置x轴标题 129 | ax1.set_ylabel(label_y1) # 设置Y1轴标题 130 | ax1.grid(False) 131 | # 12周间隔,3个月相当于,为了让X轴稀疏一些,太密了,如果不做的话 132 | ax1.xaxis.set_major_locator(ticker.MultipleLocator(12)) 133 | 134 | ax2.plot(x, y2, color=color_y2, ms=10, label=label_y2) 135 | ax2.plot(x, y3, color=color_y3, ms=10, label=label_y3) 136 | ax2.set_ylabel(label_y2 + "/" + label_y3) # 设置Y2轴标题 137 | ax2.grid(False) 138 | ax2.xaxis.set_major_locator(ticker.MultipleLocator(12)) 139 | 140 | ax3 = ax1.twinx() 141 | ax3.spines["right"].set_position(("axes", 1.2)) 142 | make_patch_spines_invisible(ax3) 143 | ax3.spines["right"].set_visible(True) 144 | p3, = ax3.plot(x, y4, color=color_y4, ms=10, label=label_y4) 145 | ax3.set_ylabel(label_y4) 146 | ax3.grid(False) 147 | ax3.xaxis.set_major_locator(ticker.MultipleLocator(12)) 148 | ax3.yaxis.label.set_color(p3.get_color()) 149 | 150 | # 添加标签 151 | ax1.legend(loc='upper left') 152 | ax2.legend(loc='upper right') 153 | ax3.legend(loc='lower right') 154 | 155 | plt.title(title) # 添加标题 156 | plt.grid(axis="y") # 背景网格 157 | 158 | plt.savefig(save_path) 159 | plt.show() 160 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/backtest_backtrader.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import os 4 | import time 5 | 6 | import backtrader as bt # 引入backtrader框架 7 | import backtrader.analyzers as bta # 添加分析函数 8 | import joblib 9 | from backtrader.feeds import PandasData 10 | from backtrader_plotting import Bokeh 11 | from pandas import DataFrame 12 | import numpy as np 13 | from mlstock.const import TOP_30 14 | from mlstock.data.datasource import DataSource 15 | from mlstock.ml import load_and_filter_data 16 | from mlstock.ml.backtests import select_top_n, predict 17 | from mlstock.ml.backtests.ml_strategy import MachineLearningStrategy 18 | from mlstock.ml.data import factor_conf 19 | from mlstock.utils import utils, df_utils 20 | from mlstock.utils.utils import AStockPlotScheme 21 | 22 | logger = logging.getLogger(__name__) 23 | 24 | # TODO 2022.9.3 suspend backtrader backtest version 25 | """ 26 | backtrader版本的回测没有完成,也不打算继续下去了,代码仅作保留。 27 | 原因是, 28 | 1、数据初始化得一个股票一个股票的加载(for: cerebro.adddata(dataname=df_stock,name=ts_code) ), 29 | 我已经有整个的df_daily了,还要挨个循环,真烦。 30 | 2、各项指标计算,我只能受限于backtrader analyzer,虽说也不错,但是总是不如自己写的顺手 31 | 3、整个过程其实也不是特别复杂,自己实现一遍,以后也方便调整 32 | 33 | 以后闲的没事的时候,可以把这个backtrader版本完成。 34 | """ 35 | 36 | 37 | def load_data_to_cerebro(cerebro, start_date, end_date, df): 38 | df = df.rename(columns={'vol': 'volume', 'ts_code': 'name', 'trade_date': 'datetime'}) # 列名准从backtrader的命名规范 39 | df['openinterest'] = 0 # backtrader需要这列,所以给他补上 40 | df['datetime'] = df_utils.to_datetime(df['datetime'], date_format="%Y%m%d") 41 | df = df.set_index('datetime') 42 | df = df.sort_index() 43 | d_start_date = utils.str2date(start_date) # 开始日期 44 | d_end_date = utils.str2date(end_date) # 结束日期 45 | data = PandasData(dataname=df, fromdate=d_start_date, todate=d_end_date, plot=True) 46 | cerebro.adddata(data) 47 | logger.debug("初始化股票%s~%s数据到脑波cerebro:%d 条", start_date, end_date, len(df)) 48 | 49 | 50 | def main(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 51 | """ 52 | datetime open high low close volume openi.. 53 | 2016-06-24 0.16 0.002 0.085 0.078 0.173 0.214 54 | 2016-06-30 0.03 0.012 0.037 0.027 0.010 0.077 55 | """ 56 | cerebro = bt.Cerebro() # 初始化cerebro 57 | datasource = DataSource() 58 | 59 | df_data = predict(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 60 | df_daily = datasource.daily(df_data.ts_codes,start_date, end_date) 61 | df_limit = datasource.limit_list() 62 | df_selected_stocks = select_top_n(df_data, df_limit) 63 | 64 | # 加载股票数据到脑波 65 | load_data_to_cerebro(cerebro, start_date, end_date, df_daily) 66 | 67 | start_cash = 500000.0 68 | cerebro.broker.setcash(start_cash) 69 | cerebro.broker.setcommission(commission=0.002) 70 | cerebro.addstrategy(MachineLearningStrategy, df_data=df_selected_stocks) 71 | 72 | # 添加分析对象 73 | cerebro.addanalyzer(bta.SharpeRatio, _name="sharpe", timeframe=bt.TimeFrame.Days) # 夏普指数 74 | cerebro.addanalyzer(bt.analyzers.DrawDown, _name='DW') # 回撤分析 75 | cerebro.addanalyzer(bt.analyzers.Returns, _name='returns') 76 | cerebro.addanalyzer(bt.analyzers.Calmar, _name='calmar') # 卡玛比率 - Calmar:超额收益➗最大回撤 77 | cerebro.addanalyzer(bt.analyzers.PeriodStats, _name='period_stats') 78 | cerebro.addanalyzer(bt.analyzers.AnnualReturn, _name='annual') 79 | cerebro.addanalyzer(bt.analyzers.PyFolio, _name='PyFolio') # 加入PyFolio分析者,这个是为了做quantstats分析用 80 | cerebro.addanalyzer(bt.analyzers.TradeAnalyzer, _name='trade') 81 | 82 | # 打印 83 | logger.debug('回测期间:%r ~ %r , 初始资金: %r', start_date, end_date, start_cash) 84 | # 运行回测 85 | results = cerebro.run(optreturn=True) 86 | 87 | if not os.path.exists(f"debug/"): os.makedirs(f"debug/") 88 | file_name = f"debug/{utils.nowtime()}_{start_date}_{end_date}.html" 89 | 90 | b = Bokeh(filename=file_name, style='bar', plot_mode='single', output_mode='save', scheme=AStockPlotScheme()) 91 | cerebro.plot(b, style='candlestick', iplot=False) 92 | 93 | 94 | """ 95 | # 测试用 96 | python -m mfm_learner.example.factor_backtester \ 97 | --factor momentum_3m \ 98 | --start 20180101 \ 99 | --end 20191230 \ 100 | --num 50 \ 101 | --period 20 \ 102 | --index 000905.SH \ 103 | --risk 104 | 105 | python -m mfm_learner.example.factor_backtester \ 106 | --factor clv \ 107 | --start 20180101 \ 108 | --end 20190101 \ 109 | --index 000905.SH \ 110 | --num 20 \ 111 | --period 20 \ 112 | --risk \ 113 | --percent 10% 114 | 115 | python -m mfm_learner.example.factor_backtester \ 116 | --factor synthesis:clv_peg_mv \ 117 | --start 20180101 \ 118 | --end 20190101 \ 119 | --num 20 \ 120 | --period 20 \ 121 | --index 000905.SH 122 | 123 | python -m mfm_learner.example.factor_backtester \ 124 | --factor clv,peg,mv,roe_ttm,roe_ \ 125 | --start 20180101 \ 126 | --end 20190101 \ 127 | --num 20 \ 128 | --period 20 \ 129 | --index 000905.SH 130 | """ 131 | if __name__ == '__main__': 132 | utils.init_logger() 133 | 134 | start_time = time.time() 135 | parser = argparse.ArgumentParser() 136 | parser.add_argument('-f', '--factor', type=str, help="单个因子名、多个(逗号分割)、所有(all)") 137 | parser.add_argument('-s', '--start', type=str, help="开始日期") 138 | parser.add_argument('-e', '--end', type=str, help="结束日期") 139 | parser.add_argument('-i', '--index', type=str, help="股票池code") 140 | parser.add_argument('-p', '--period', type=int, help="调仓周期,多个的话,用逗号分隔") 141 | parser.add_argument('-an', '--atr_n', type=int, default=3, help="ATR风控倍数") 142 | parser.add_argument('-ap', '--atr_p', type=int, default=15, help="ATR周期") 143 | parser.add_argument('-n', '--num', type=int, help="股票数量") 144 | parser.add_argument('-r', '--risk', action='store_true', help="是否风控") 145 | args = parser.parse_args() 146 | 147 | main(args.start, 148 | args.end, 149 | args.index, 150 | args.period, 151 | args.num, 152 | args.factor, 153 | args.risk, 154 | args.atr_p, 155 | args.atr_n) 156 | logger.debug("共耗时: %.0f 秒", time.time() - start_time) 157 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/backtest_deliberate.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from mlstock.const import TOP_30 4 | from mlstock.data.datasource import DataSource 5 | from mlstock.ml.backtests import predict, select_top_n, plot, timing 6 | from mlstock.ml.backtests.broker import Broker 7 | from mlstock.ml.backtests.metrics import metrics 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | 12 | def load_datas(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 13 | datasource = DataSource() 14 | 15 | df_data = predict(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 16 | df_limit = datasource.limit_list() 17 | df_index = datasource.index_weekly('000001.SH', start_date, end_date) 18 | ts_codes = df_data.ts_code.unique().tolist() 19 | df_daily = datasource.daily(ts_codes, start_date, end_date, adjust='') 20 | df_daily = df_daily.sort_values(['trade_date', 'ts_code']) 21 | df_calendar = datasource.trade_cal(start_date, end_date) 22 | df_baseline = df_data[['trade_date', 'next_pct_chg_baseline']].drop_duplicates() 23 | return df_data, df_daily, df_index, df_baseline, df_limit, df_calendar 24 | 25 | 26 | def run_broker(df_data, df_daily, df_index, df_baseline, df_limit, df_calendar, 27 | start_date, end_date, factor_names, 28 | top_n, df_timing): 29 | df_selected_stocks = select_top_n(df_data, df_limit, top_n) 30 | 31 | broker = Broker(df_selected_stocks, df_daily, df_calendar, conservative=False, df_timing=df_timing) 32 | broker.execute() 33 | df_portfolio = broker.df_values 34 | df_portfolio.sort_values('trade_date') 35 | 36 | # 只筛出来周频的市值来 37 | df_portfolio = df_baseline.merge(df_portfolio, how='left', on='trade_date') 38 | 39 | # 拼接上指数 40 | df_index = df_index[['trade_date', 'close']] 41 | df_index = df_index.rename(columns={'close': 'index_close'}) 42 | df_portfolio = df_portfolio.merge(df_index, how='left', on='trade_date') 43 | 44 | # 准备pct、next_pct_chg、和cumulative_xxxx 45 | df_portfolio = df_portfolio.sort_values('trade_date') 46 | df_portfolio['pct_chg'] = df_portfolio.total_value.pct_change() 47 | df_portfolio['next_pct_chg'] = df_portfolio.pct_chg.shift(-1) 48 | df_portfolio[['cumulative_pct_chg', 'cumulative_pct_chg_baseline']] = \ 49 | df_portfolio[['next_pct_chg', 'next_pct_chg_baseline']].apply(lambda x: (x + 1).cumprod() - 1) 50 | 51 | df_portfolio = df_portfolio[~df_portfolio.cumulative_pct_chg.isna()] 52 | 53 | save_path = 'data/plot_{}_{}_top{}.jpg'.format(start_date, end_date, top_n) 54 | plot(df_portfolio, save_path) 55 | 56 | # 计算各项指标 57 | logger.info("佣金总额:%.2f", broker.total_commission) 58 | metrics(df_portfolio) 59 | 60 | 61 | def main(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 62 | """ 63 | 先预测出所有的下周收益率、下周涨跌 => df_data, 64 | 然后选出每周的top30 => df_selected_stocks, 65 | 然后使用Broker,来遍历每天的交易,每周进行调仓,并,记录下每周的股票+现价合计价值 => df_portfolio 66 | 最后计算出next_pct_chg、cumulative_pct_chg,并画出plot,计算metrics 67 | """ 68 | df_data, df_daily, df_index, df_baseline, df_limit, df_calendar = \ 69 | load_datas(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 70 | 71 | """ 72 | # df_timing = timing.ma(df_index) 73 | # 加入择时模块,效果非常不好:,之前总收益还30%呢 74 | 投资时长 : 3年42月182周 75 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 累计收益 : -11.9% 76 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 累计基准收益 : 239.3% 77 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 年化收益率 : -1.3% 78 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 年化超额收益 : nan% 79 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 周波动率 : 1.3% 80 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 夏普比率 : 452.9% 81 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 最大回撤 : -24.0% 82 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: 信息比率 : -10.6% 83 | 2022-09-14 15:09:08,113 - DEBUG - metrics.py:131 P11460: PK基准胜率 : 42.1% 84 | """ 85 | # 去掉择时了 86 | df_timing = None 87 | 88 | run_broker(df_data, df_daily, df_index, df_baseline, df_limit, df_calendar, start_date, end_date, factor_names, 89 | TOP_30, df_timing) 90 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/backtest_simple.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | 4 | from mlstock.const import TOP_30 5 | from mlstock.data.datasource import DataSource 6 | from mlstock.ml.backtests import plot, predict, select_top_n 7 | from mlstock.ml.backtests.metrics import metrics 8 | from mlstock.ml.data import factor_conf 9 | from mlstock.utils import utils 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | def main(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 15 | """ 16 | 回测 17 | :param data_path: 因子数据文件的路径 18 | :param start_date: 回测的开始日期 19 | :param end_date: 回测结束日期 20 | :param model_pct_path: 回测用的预测收益率的模型路径 21 | :param model_winloss_path: 回测用的,预测收益率的模型路径 22 | :param factor_names: 因子们的名称,用于过滤预测的X 23 | :return: 24 | """ 25 | 26 | datasource = DataSource() 27 | 28 | df_data = predict(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 29 | df_limit = datasource.limit_list() 30 | 31 | df_selected_stocks = select_top_n(df_data, df_limit,TOP_30) 32 | df_selected_stocks = df_selected_stocks.reset_index(drop=True) 33 | 34 | 35 | # 组合的收益率情况 36 | df_portfolio = df_selected_stocks.groupby('trade_date')[['next_pct_chg', 'next_pct_chg_baseline']].mean().reset_index() 37 | 38 | # 拼接上指数 39 | df_index = datasource.index_weekly('000001.SH', start_date, end_date) 40 | df_index = df_index[['trade_date', 'close']] 41 | df_index = df_index.rename(columns={'close': 'index_close'}) 42 | df_portfolio = df_portfolio.merge(df_index, how='left', on='trade_date') 43 | df_portfolio.columns = ['trade_date', 'next_pct_chg', 'next_pct_chg_baseline','index_close'] 44 | df_portfolio[['cumulative_pct_chg', 'cumulative_pct_chg_baseline']] = \ 45 | df_portfolio[['next_pct_chg', 'next_pct_chg_baseline']].apply(lambda x: (x + 1).cumprod() - 1) 46 | 47 | df_portfolio.sort_values('trade_date') 48 | 49 | # 画出回测图 50 | plot(df_portfolio, f"data/plot_simple_{start_date}_{end_date}") 51 | 52 | # 计算各项指标 53 | metrics(df_portfolio) 54 | 55 | 56 | """ 57 | python -m mlstock.ml.backtest \ 58 | -s 20190101 -e 20220901 \ 59 | -mp model/pct_ridge_20220828190251.model \ 60 | -mw model/winloss_xgboost_20220828190259.model \ 61 | -d data/processed_industry_neutral_20080101_20220901_20220828152110.csv 62 | 63 | python -m mlstock.ml.backtest \ 64 | -s 20190101 -e 20220901 \ 65 | -mp model/pct_ridge_20220828190251.model \ 66 | -mw model/winloss_xgboost_20220828190259.model \ 67 | -d data/processed_industry_neutral_20080101_20220901_20220828152110.csv 68 | """ 69 | if __name__ == '__main__': 70 | utils.init_logger(file=True) 71 | parser = argparse.ArgumentParser() 72 | 73 | # 数据相关的 74 | parser.add_argument('-s', '--start_date', type=str, default="20190101", help="开始日期") 75 | parser.add_argument('-e', '--end_date', type=str, default="20220901", help="结束日期") 76 | parser.add_argument('-d', '--data', type=str, default=None, help="数据文件") 77 | parser.add_argument('-mp', '--model_pct', type=str, default=None, help="收益率模型") 78 | parser.add_argument('-mw', '--model_winloss', type=str, default=None, help="涨跌模型") 79 | 80 | args = parser.parse_args() 81 | 82 | factor_names = factor_conf.get_factor_names() 83 | 84 | main(args.data, 85 | args.start_date, 86 | args.end_date, 87 | args.model_pct, 88 | args.model_winloss, 89 | factor_names) 90 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/metrics.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | import logging 3 | import numpy as np 4 | from empyrical import max_drawdown 5 | 6 | from mlstock.const import RISK_FREE_ANNUALLY_RETRUN 7 | from mlstock.utils import utils 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | 12 | def scope(df): 13 | start_date = datetime.strptime(df.trade_date.min(), "%Y%m%d") 14 | end_date = datetime.strptime(df.trade_date.max(), "%Y%m%d") 15 | years = end_date.year - start_date.year 16 | months = years * 12 + end_date.month - start_date.month 17 | weeks, _ = divmod((end_date - start_date).days, 7) 18 | return f"{years}年{months}月{weeks}周" 19 | 20 | 21 | def annually_profit(df): 22 | """ 23 | 年化收益率 24 | A股每年250个交易日,50个交易周, 25 | 年化收益率 = 26 | """ 27 | # 累计收益 28 | cumulative_return = df['cumulative_pct_chg'].iloc[-1] + 1 29 | total_weeks = len(df) 30 | return np.power(cumulative_return, 50 / total_weeks) - 1 31 | 32 | 33 | def volatility(df): 34 | """波动率""" 35 | return df['next_pct_chg'].std() 36 | 37 | 38 | def sharp_ratio(df): 39 | """ 40 | 夏普比率 = 收益均值-无风险收益率 / 收益方差 41 | 无风险收益率,在我国无风险收益率一般取值十年期国债收益 42 | """ 43 | return (df['next_pct_chg'].mean() - RISK_FREE_ANNUALLY_RETRUN / 50) / df['next_pct_chg'].mean() 44 | 45 | 46 | def max_drawback(df): 47 | """最大回撤""" 48 | return max_drawdown(df.next_pct_chg) 49 | 50 | 51 | def annually_active_return(df): 52 | """年化主动收益率""" 53 | cumulative_active_return = df['cumulative_active_pct_chg'].iloc[-1] + 1 54 | total_weeks = len(df) 55 | return np.power(cumulative_active_return, 50 / total_weeks) - 1 56 | 57 | 58 | def active_return_max_drawback(df): 59 | """年化主动收最大回撤""" 60 | pass 61 | 62 | 63 | def annually_track_error(df): 64 | """年化跟踪误差""" 65 | 66 | 67 | def information_ratio(df): 68 | """ 69 | 信息比率IR 70 | IR= IC的多周期均值/IC的标准方差,代表因子获取稳定Alpha的能力。当IR大于0.5时因子稳定获取超额收益能力较强。 71 | - https://www.zhihu.com/question/342944058 72 | - https://zhuanlan.zhihu.com/p/351462926 73 | 讲人话: 74 | 就是主动收益的均值/主动收益的方差 75 | """ 76 | return df.active_pct_chg.mean() / df.active_pct_chg.std() 77 | 78 | 79 | def win_rate(df): 80 | """胜率""" 81 | return (df['active_pct_chg'] > 0).sum() / len(df) 82 | 83 | 84 | def metrics(df): 85 | """ 86 | :param df: 87 | df[ 88 | 'trade_date', 89 | 'next_pct_chg', # 当前收益率 90 | 'next_pct_chg_baseline', # 当期基准收益率 91 | 'cumulative_pct_chg', # 当期累计收益率 92 | 'cumulative_pct_chg_baseline' # 当期基准收益率 93 | ] 94 | :param df_base: 95 | :return: 96 | """ 97 | if df is not None: 98 | df.to_csv("data/df_portfolio.csv") 99 | else: 100 | import pandas as pd 101 | df = pd.read_csv("data/df_portfolio.csv",header=0) 102 | df['trade_date'] = df['trade_date'].astype(str) 103 | 104 | assert 'next_pct_chg' in df.columns, "缺少列:next_pct_chg" 105 | assert 'next_pct_chg_baseline' in df.columns, '缺少列:next_pct_chg_baseline' 106 | assert 'cumulative_pct_chg' in df.columns, '缺少列:cumulative_pct_chg' 107 | assert 'cumulative_pct_chg_baseline' in df.columns, '缺少列:cumulative_pct_chg_baseline' 108 | 109 | # 每期超额收益率 110 | df['active_pct_chg'] = df['next_pct_chg'] - df['next_pct_chg_baseline'] 111 | # 每期累计超额收益率 112 | df['cumulative_active_pct_chg'] = df['cumulative_pct_chg'] - df['cumulative_pct_chg_baseline'] 113 | 114 | result = {} 115 | 116 | result['投资时长'] = scope(df) 117 | result['累计收益'] = df['cumulative_pct_chg'].iloc[-1] 118 | result['累计基准收益'] = df['cumulative_pct_chg_baseline'].iloc[-1] 119 | result['年化收益率'] = annually_profit(df) 120 | result['年化超额收益'] = annually_active_return(df) 121 | result['周波动率'] = volatility(df) 122 | result['夏普比率'] = sharp_ratio(df) 123 | result['最大回撤'] = max_drawback(df) 124 | result['信息比率'] = information_ratio(df) 125 | result['PK基准胜率'] = win_rate(df) 126 | 127 | logger.info("投资详细指标:") 128 | for k, v in result.items(): 129 | if isinstance(v, np.floating) or isinstance(v, float): 130 | v = "{:.1f}%".format(v * 100) 131 | logger.debug("\t%s\t : %s", k, v) 132 | return result 133 | 134 | 135 | # python -m mlstock.ml.evaluate.metrics 136 | if __name__ == '__main__': 137 | utils.init_logger(file=False) 138 | metrics(None) 139 | -------------------------------------------------------------------------------- /mlstock/ml/backtests/ml_strategy.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import math 3 | from abc import abstractmethod 4 | import backtrader as bt # 引入backtrader框架 5 | import numpy as np 6 | from pandas import DataFrame 7 | 8 | from mlstock.const import TOP_30 9 | from mlstock.utils import utils 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | class MachineLearningStrategy(bt.Strategy): 15 | 16 | def __init__(self, df_selected_stocks): 17 | self.df_selected_stocks = df_selected_stocks 18 | self.weekly_dates = df_selected_stocks.trade_date 19 | 20 | # 记录交易执行情况(可省略,默认不输出结果) 21 | def notify_order(self, order): 22 | # logger.debug('订单状态:%r', order.Status[order.status]) 23 | # print(order) 24 | 25 | # 如果order为submitted/accepted,返回空 26 | if order.status in [order.Submitted, order.Accepted]: 27 | # logger.debug('订单状态:%r', order.Status[order.status]) 28 | return 29 | 30 | # 如果order为buy/sell executed,报告价格结果 31 | if order.status in [order.Completed]: 32 | if order.isbuy(): 33 | logger.debug('成功买入: 股票[%s],价格[%.2f],成本[%.2f],手续费[%.2f]', 34 | order.data._name, 35 | order.executed.price, 36 | order.executed.value, 37 | order.executed.comm) 38 | 39 | self.buyprice = order.executed.price 40 | self.buycomm = order.executed.comm 41 | 42 | else: 43 | bt.OrderData 44 | logger.debug('成功卖出: 股票[%s],价格[%.2f],成本[%.2f],手续费[%.2f]', 45 | order.data._name, 46 | order.executed.price, 47 | order.executed.value, 48 | order.executed.comm) 49 | 50 | self.bar_executed = len(self) 51 | 52 | # 如果指令取消/交易失败, 报告结果 53 | elif order.status in [order.Canceled, order.Margin, order.Rejected]: 54 | """ 55 | Order.Created:订单已被创建; 56 | Order.Submitted:订单已被传递给经纪商 Broker; 57 | Order.Accepted:订单已被经纪商接收; 58 | Order.Partial:订单已被部分成交; 59 | Order.Complete:订单已成交; 60 | Order.Rejected:订单已被经纪商拒绝; 61 | Order.Margin:执行该订单需要追加保证金,并且先前接受的订单已从系统中删除; 62 | Order.Cancelled (or Order.Canceled):确认订单已经被撤销; 63 | Order.Expired:订单已到期,其已经从系统中删除 。 64 | """ 65 | logger.debug('交易失败,股票[%s]订单状态:%r', order.data._name, order.Status[order.status]) 66 | 67 | self.order = None 68 | 69 | # 这个是一只股票的一个完整交易的生命周期:开仓,持有,卖出 70 | def notify_trade(self, trade): 71 | if trade.isclosed: 72 | open_date = utils.date2str(bt.num2date(trade.dtopen)) 73 | close_date = utils.date2str(bt.num2date(trade.dtopen)) 74 | logger.debug('策略收益:股票[%s], 毛收益 [%.2f], 净收益 [%.2f],交易开始日期[%s]~[%s]', 75 | trade.data._name, trade.pnl, trade.pnlcomm, 76 | open_date, close_date) 77 | self.__post_trade(trade) 78 | 79 | def sell_out(self, stock_code): 80 | # 根据名字获得对应那只股票的数据 81 | stock_data = self.getdatabyname(stock_code) 82 | 83 | # size = self.getsizing(stock_data,isbuy=False) 84 | # self.sell(data=stock_data,exectype=bt.Order.Limit,size=size) 85 | size = self.getposition(stock_data, self.broker).size 86 | self.close(data=stock_data, exectype=bt.Order.Limit) 87 | 88 | logger.debug('[%s] 平仓股票 %s : 卖出%r股', self.current_date, stock_data._name, size) 89 | 90 | def next(self): 91 | 92 | # 如果不是周调仓日(周五),就忽略 93 | current_date = self.datas[0].datetime.datetime(0) 94 | if current_date not in self.weekly_dates: return 95 | self.current_date = utils.date2str(current_date) 96 | 97 | logger.debug("第%d个交易日:%r ", len(self.data), utils.date2str(current_date)) 98 | 99 | # 选择本周的topN的股票 100 | selected_stocks = self.df_selected_stocks[self.df_selected_stocks.trade_date == current_date] 101 | 102 | logger.debug("此次选中的股票为:%r", ",".join(selected_stocks.tolist())) 103 | 104 | # 以往买入的标的,本次不在标的中,则先平仓 105 | # "常规下单函数主要有 3 个:买入 buy() 、卖出 sell()、平仓 close() " 106 | to_sell_stocks = set(self.trade_recorder.get_stocks()) - set(selected_stocks) 107 | 108 | # 计算调仓率 109 | if len(self.trade_recorder.get_stocks()) > 0: 110 | rebalance_rate = len(to_sell_stocks) / len(self.trade_recorder.get_stocks()) 111 | self.rebalance_rates.append(rebalance_rate) 112 | 113 | # 1. 清仓未在选择列表的股票 114 | logger.debug("[%s] 卖出股票:%r", self.current_date, to_sell_stocks) 115 | for sell_stock in to_sell_stocks: 116 | self.sell_out(sell_stock) 117 | 118 | logger.debug("[%s] 卖出%d只股票,剩余%d只持仓", 119 | self.current_date, len(to_sell_stocks), 120 | len(self.trade_recorder.get_stocks())) 121 | 122 | # 每只股票买入资金百分比,预留2%的资金以应付佣金和计算误差 123 | buy_percentage = (1 - 0.02) / len(selected_stocks) 124 | 125 | # 得到可以用来购买的金额 126 | buy_amount_per_stock = buy_percentage * self.broker.getcash() 127 | 128 | # 2. 买入选择的股票 129 | for buy_stock in selected_stocks: 130 | self._buy_in(buy_stock, buy_amount_per_stock) 131 | 132 | def _buy_in(self, stock_code, buy_amount): 133 | # 防止股票不在数据集中 134 | if stock_code not in self.getdatanames(): 135 | logger.warning("[%s] 股票[%s]不在数据集中", self.current_date, stock_code) 136 | return 137 | 138 | # 如果选中的股票在当前的持仓中,就忽略 139 | if stock_code in self.trade_recorder.get_stocks(): 140 | logger.debug("[%s] %s 在持仓中,不动", self.current_date, stock_code) 141 | return 142 | 143 | # 根据名字获得对应那只股票的数据 144 | stock_data = self.getdatabyname(stock_code) 145 | 146 | # 买,用当天的开盘价,假设的场景是,你决定一早就买了,就按当天的开盘价买 147 | open_price = stock_data.open[0] 148 | 149 | # TODO:按次日开盘价计算下单量,下单量是100(手)的整数倍 ???次日价格,还是,本日价格? 150 | size = math.ceil(buy_amount / open_price) 151 | logger.debug("[%s] 购入股票[%s 股价%.2f] %d股,金额:%.2f", self.current_date, stock_code, open_price, size, 152 | buy_amount) 153 | self.buy(data=stock_data, size=size, price=open_price, exectype=bt.Order.Limit) -------------------------------------------------------------------------------- /mlstock/ml/backtests/timing.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from pandas import DataFrame 3 | import talib 4 | df = df_weekly_index = DataFrame() 5 | 6 | 7 | def macd(): 8 | macd = talib.MACD(df_weekly_index.close) 9 | 10 | def ma(df): 11 | df['SMA5'] = talib.SMA(np.array(df.close), timeperiod=5) 12 | df['SMA10'] = talib.SMA(np.array(df.close), timeperiod=10) 13 | df['transaction'] = df.SMA5>=df.SMA10 14 | return df[['trade_date','transaction']] 15 | 16 | def kdj(df): 17 | df['K'], df['D'] = talib.STOCH(df['high'].values, 18 | df['low'].values, 19 | df['close'].values, 20 | fastk_period=9, 21 | slowk_period=3, 22 | slowk_matype=0, 23 | slowd_period=3, 24 | slowd_matype=0) 25 | ####处理数据,计算J值 26 | df['K'].fillna(0,inplace=True) 27 | df['D'].fillna(0,inplace=True) 28 | df['J']=3*df['K']-2*df['D'] 29 | 30 | ####计算金叉和死叉 31 | df['KDJ_金叉死叉'] = '' 32 | ####令K>D 为真值 33 | kdj_position=df['K']>df['D'] 34 | 35 | ####当Ki>Di 为真,Ki-1 Di-1 为假 为死叉 39 | df.loc[kdj_position[(kdj_position == False) & (kdj_position.shift() == True)].index, 'KDJ_金叉死叉'] = '死叉' -------------------------------------------------------------------------------- /mlstock/ml/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/ml/data/__init__.py -------------------------------------------------------------------------------- /mlstock/ml/data/factor_conf.py: -------------------------------------------------------------------------------- 1 | from mlstock.factors.alpha_beta import AlphaBeta 2 | from mlstock.factors.daily_indicator import DailyIndicator 3 | from mlstock.factors.ff3_residual_std import FF3ResidualStd 4 | from mlstock.factors.finance_indicator import FinanceIndicator 5 | from mlstock.factors.kdj import KDJ 6 | from mlstock.factors.macd import MACD 7 | from mlstock.factors.psy import PSY 8 | from mlstock.factors.rsi import RSI 9 | from mlstock.factors.balance_sheet import BalanceSheet 10 | from mlstock.factors.cashflow import CashFlow 11 | from mlstock.factors.income import Income 12 | from mlstock.factors.stake_holder import StakeHolder 13 | from mlstock.factors.std import Std 14 | from mlstock.factors.returns import Return 15 | from mlstock.factors.turnover_return import TurnoverReturn 16 | 17 | """ 18 | 所有的因子配置,有的因子类只包含一个feature,有的因子类可能包含多个features。 19 | """ 20 | 21 | # 测试用 22 | # FACTORS = [TurnoverReturn, 23 | # Return, 24 | # Std, 25 | # MACD, 26 | # KDJ, 27 | # PSY, 28 | # RSI] 29 | 30 | # 正式 31 | FACTORS = [TurnoverReturn, 32 | Return, 33 | Std, 34 | MACD, 35 | KDJ, 36 | PSY, 37 | RSI, 38 | BalanceSheet, 39 | Income, 40 | CashFlow, 41 | FinanceIndicator, 42 | DailyIndicator, 43 | # FF3ResidualStd, # 暂时不用,太慢了,单独跑效果也一般 44 | AlphaBeta, 45 | StakeHolder] 46 | 47 | 48 | def get_factor_names(): 49 | """ 50 | 获得所有的因子名 51 | :return: 52 | """ 53 | 54 | names = [] 55 | for f in FACTORS: 56 | _names = f(None, None).name 57 | if type(_names) == list: 58 | names += _names 59 | else: 60 | names += [_names] 61 | return names 62 | 63 | 64 | def get_factor_class_by_name(name): 65 | """ 66 | 根据名字获得Factor Class 67 | :return: 68 | """ 69 | for f in FACTORS: 70 | _names = f(None, None).name 71 | if name in _names: return f 72 | raise ValueError(f"通过因子名称[{name}]无法找到因子类Class") 73 | -------------------------------------------------------------------------------- /mlstock/ml/evaluate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | 4 | import joblib 5 | import numpy as np 6 | import pandas as pd 7 | from pandas import DataFrame 8 | from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, precision_score, \ 9 | recall_score, f1_score 10 | 11 | from mlstock.ml import load_and_filter_data 12 | from mlstock.ml.data import factor_service, factor_conf 13 | from mlstock.ml.data.factor_service import extract_features 14 | from mlstock.utils import utils 15 | 16 | logger = logging.getLogger(__name__) 17 | 18 | 19 | def classification_metrics(df, model): 20 | """ 21 | https://ningshixian.github.io/2020/08/24/sklearn%E8%AF%84%E4%BC%B0%E6%8C%87%E6%A0%87/ 22 | """ 23 | 24 | X = extract_features(df) 25 | y = df.target.apply(lambda x: 1 if x > 0 else 0) 26 | 27 | y_pred = model.predict(X) 28 | metrics = {} 29 | metrics['accuracy'] = accuracy_score(y, y_pred) 30 | metrics['precision'] = precision_score(y, y_pred) 31 | metrics['recall'] = recall_score(y, y_pred) 32 | metrics['f1'] = f1_score(y, y_pred) 33 | 34 | logger.debug("计算出回归指标:%r", metrics) 35 | return metrics 36 | 37 | 38 | def regression_metrics(df, model): 39 | """ 40 | https://blog.csdn.net/u012735708/article/details/84337262 41 | """ 42 | metrics = {} 43 | 44 | df['y'] = df['target'] 45 | X = extract_features(df) 46 | 47 | df['y_pred'] = model.predict(X) 48 | 49 | metrics['corr'] = df[['y', 'y_pred']].corr().iloc[0, 1] # 测试标签y和预测y_pred相关性,到底准不准啊 50 | 51 | # 看一下rank的IC,不看值相关性,而是看排名的相关性 52 | df['y_rank'] = df.y.rank(ascending=False) # 并列的默认使用排名均值 53 | df['y_pred_rank'] = df.y_pred.rank(ascending=False) 54 | metrics['rank_corr'] = df[['y_rank', 'y_pred_rank']].corr().iloc[0, 1] 55 | 56 | metrics['RMSE'] = np.sqrt(mean_squared_error(df.y, df.y_pred)) 57 | 58 | metrics['MA'] = mean_absolute_error(df.y, df.y_pred) 59 | 60 | metrics['R2'] = r2_score(df.y, df.y_pred) 61 | 62 | logger.debug("计算出分类指标:%r", metrics) 63 | return metrics 64 | 65 | 66 | def factor_weights(model): 67 | """ 68 | 显示权重的影响 69 | :param model: 70 | :return: 71 | """ 72 | 73 | param_weights = model.coef_ 74 | param_names = factor_conf.get_factor_names() 75 | df = pd.DataFrame({'names': param_names, 'weights': param_weights}) 76 | df = df.reindex(df.weights.abs().sort_values().index) 77 | logger.info("参数和权重排序:") 78 | logger.info(df.to_string().replace('\n', '\n\t')) 79 | 80 | 81 | def main(args): 82 | # 查看数据文件和模型文件路径是否正确 83 | if args.model_pct: utils.check_file_path(args.model_pct) 84 | if args.model_winloss: utils.check_file_path(args.model_winloss) 85 | 86 | df_data = load_and_filter_data(args.data, args.start_date, args.end_date) 87 | 88 | 89 | # 加载模型;如果参数未提供,为None 90 | model_pct = joblib.load(args.model_pct) if args.model_pct else None 91 | model_winloss = joblib.load(args.model_winloss) if args.model_winloss else None 92 | 93 | 94 | if model_pct: 95 | factor_weights(model_pct) 96 | regression_metrics(df_data, model_pct) 97 | 98 | if model_winloss: 99 | classification_metrics(df_data, model_winloss) 100 | 101 | logger.info("原始数据统计:") 102 | logger.info("周收益平均值:%.2f%%", df_data.target.mean() * 100) 103 | logger.info("周收益标准差:%.2f%%", df_data.target.std() * 100) 104 | logger.info("周收益中位数:%.2f%%", df_data.target.median() * 100) 105 | logger.info("绝对值平均值:%.2f%%", df_data.target.abs().mean() * 100) 106 | logger.info("绝对值标准差:%.2f%%", df_data.target.abs().std() * 100) 107 | logger.info("绝对值中位数:%.2f%%", df_data.target.abs().median() * 100) 108 | 109 | 110 | """ 111 | python -m mlstock.ml.evaluate \ 112 | -s 20190101 -e 20220901 \ 113 | -mp model/pct_ridge_20220902112320.model \ 114 | -mw model/winloss_xgboost_20220902112813.model \ 115 | -d data/factor_20080101_20220901_2954_1299032__industry_neutral_20220902112049.csv 116 | """ 117 | if __name__ == '__main__': 118 | utils.init_logger(file=True) 119 | parser = argparse.ArgumentParser() 120 | 121 | # 数据相关的 122 | parser.add_argument('-s', '--start_date', type=str, default="20190101", help="开始日期") 123 | parser.add_argument('-e', '--end_date', type=str, default="20220901", help="结束日期") 124 | parser.add_argument('-d', '--data', type=str, default=None, help="数据文件") 125 | parser.add_argument('-mp', '--model_pct', type=str, default=None, help="收益率模型") 126 | parser.add_argument('-mw', '--model_winloss', type=str, default=None, help="涨跌模型") 127 | 128 | args = parser.parse_args() 129 | 130 | main(args) 131 | -------------------------------------------------------------------------------- /mlstock/ml/prepare_factor.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from mlstock.ml.data import factor_service 4 | from mlstock.ml.data.factor_conf import FACTORS 5 | from mlstock.utils import utils 6 | 7 | 8 | def main(args): 9 | start_date = args.start_date 10 | end_date = args.end_date 11 | num = args.num 12 | is_industry_neutral = args.industry_neutral 13 | 14 | # 那么就需要从新计算了 15 | df_weekly, factor_names, csv_path = factor_service.calculate(FACTORS, start_date, end_date, num, is_industry_neutral) 16 | return df_weekly, factor_names 17 | 18 | """ 19 | python -m mlstock.ml.prepare_factor -n 50 -in -s 20080101 -e 20220901 20 | """ 21 | if __name__ == '__main__': 22 | utils.init_logger(file=True) 23 | 24 | parser = argparse.ArgumentParser() 25 | 26 | # 数据相关的 27 | parser.add_argument('-s', '--start_date', type=str, default="20090101", help="开始日期") 28 | parser.add_argument('-e', '--end_date', type=str, default="20220901", help="结束日期") 29 | parser.add_argument('-n', '--num', type=int, default=100000, help="股票数量,调试用") 30 | 31 | parser.add_argument('-in', '--industry_neutral', action='store_true', default=False, help="是否做行业中性处理") 32 | 33 | args = parser.parse_args() 34 | 35 | df_data, factor_names = main(args) 36 | -------------------------------------------------------------------------------- /mlstock/ml/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | 4 | from mlstock.ml import load_and_filter_data 5 | from mlstock.ml.data import factor_conf 6 | from mlstock.ml.trains.train_pct import TrainPct 7 | from mlstock.ml.trains.train_winloss import TrainWinLoss 8 | from mlstock.utils import utils 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | def main(data_path, start_date, end_date, train_type, factor_names): 14 | """ 15 | 训练 16 | :param data_path: 数据(因子)csv文件的路径 17 | :param start_date: 训练数据的开始日期 18 | :param end_date: 训练数据的结束日子 19 | :param train_type: all|pct|winloss,方便单独训练 20 | :param factor_names: 因子的名称,用于过滤出X 21 | :return: 22 | """ 23 | # 从csv文件中加载数据,现在统一成从文件加载了,之前还是先清洗,用得到的dataframe,但过程很慢, 24 | # 改成先存成文件,再从文件中加载,把过程分解了,方便做pipeline 25 | df_data = load_and_filter_data(data_path, start_date, end_date) 26 | 27 | # 收益率回归模型 28 | train_pct = TrainPct(factor_names) 29 | 30 | # 涨跌分类模型 31 | train_winloss = TrainWinLoss(factor_names) 32 | 33 | # 回归+分类 34 | if train_type == 'all': 35 | return [train_pct.train(df_data), train_winloss.train(df_data)] 36 | 37 | # 仅回归 38 | if train_type == 'pct': 39 | return train_pct.train(df_data) 40 | 41 | # 仅分类 42 | if train_type == 'winloss': 43 | return train_winloss.train(df_data) 44 | 45 | raise ValueError(f"无法识别训练类型:{train_type}") 46 | 47 | 48 | """ 49 | python -m mlstock.ml.train --train all --data data/ 50 | """ 51 | if __name__ == '__main__': 52 | utils.init_logger(file=True, log_level=logging.DEBUG) 53 | 54 | parser = argparse.ArgumentParser() 55 | 56 | # 数据相关的 57 | parser.add_argument('-s', '--start_date', type=str, default="20090101", help="开始日期") 58 | parser.add_argument('-e', '--end_date', type=str, default="20190101", help="结束日期") 59 | parser.add_argument('-n', '--num', type=int, default=100000, help="股票数量,调试用") 60 | parser.add_argument('-d', '--data', type=str, help="预先加载的因子数据文件的路径,不再从头计算因子") 61 | parser.add_argument('-in', '--industry_neutral', action='store_true', default=False, help="是否做行业中性处理") 62 | 63 | # 训练相关的 64 | parser.add_argument('-t', '--train', type=str, default="all", help="all|pct|winloss : 训练所有|仅训练收益|仅训练涨跌") 65 | args = parser.parse_args() 66 | factor_names = factor_conf.get_factor_names() 67 | logger.info("训练使用的特征 %d 个:%r", len(factor_names), factor_names) 68 | 69 | main(args.data, args.start_date, args.end_date, args.train, factor_names) 70 | -------------------------------------------------------------------------------- /mlstock/ml/trains/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/ml/trains/__init__.py -------------------------------------------------------------------------------- /mlstock/ml/trains/train_action.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os.path 3 | import time 4 | 5 | import joblib 6 | 7 | from mlstock.const import TRAIN_TEST_SPLIT_DATE 8 | from mlstock.utils.utils import time_elapse 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class TrainAction: 14 | 15 | def __init__(self, factor_names): 16 | self.factor_names = factor_names 17 | 18 | def set_target(self): 19 | raise NotImplemented() 20 | 21 | def _train(self, X_train, y_train): 22 | raise NotImplemented() 23 | 24 | def train(self, df_weekly): 25 | # 根据子类,来调整target(分类要变成0:1) 26 | df_weekly = self.set_target(df_weekly) 27 | 28 | # 划分训练集和测试集,测试集占总数据的15%,随机种子为10(如果不定义,会每次都不一样) 29 | # 2009.1~2022.8,165个月,Test比例0.3,大约是2019.1~2022.8,正好合适 30 | df_train = df_weekly[df_weekly.trade_date < TRAIN_TEST_SPLIT_DATE] 31 | df_test = df_weekly[df_weekly.trade_date >= TRAIN_TEST_SPLIT_DATE] 32 | X_train = df_train[self.factor_names].values 33 | y_train = df_train.target 34 | 35 | # 训练 36 | start_time = time.time() 37 | logger.debug("开始训练:数据行数和特征数:%r",X_train.shape) 38 | model = self._train(X_train, y_train) 39 | model_path = self.save_model(model) 40 | time_elapse(start_time, "⭐️ 训练完成") 41 | 42 | return model_path 43 | 44 | def get_model_name(self): 45 | raise NotImplemented() 46 | 47 | def save_model(self, model): 48 | if not os.path.exists("./model"): os.mkdir("./model") 49 | model_file_path = f"./model/{self.get_model_name()}" 50 | joblib.dump(model, model_file_path) 51 | logger.info("训练结果保存到:%s", model_file_path) 52 | return model_file_path 53 | -------------------------------------------------------------------------------- /mlstock/ml/trains/train_pct.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | 4 | import matplotlib.pyplot as plt 5 | import numpy as np 6 | from sklearn.linear_model import Ridge 7 | from sklearn.model_selection import cross_val_score 8 | 9 | from mlstock.ml.trains.train_action import TrainAction 10 | from mlstock.utils import utils 11 | from mlstock.utils.utils import time_elapse 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | class TrainPct(TrainAction): 17 | 18 | def get_model_name(self): 19 | return f"pct_ridge_{utils.now()}.model" 20 | 21 | def set_target(self, df_data): 22 | return df_data # 啥也不做 23 | 24 | def _train(self, X_train, y_train): 25 | """用岭回归预测下周收益""" 26 | best_hyperparam = self.search_best_hyperparams(X_train, y_train) 27 | ridge = Ridge(alpha=best_hyperparam) 28 | ridge.fit(X_train, y_train) 29 | return ridge 30 | 31 | def search_best_hyperparams(self, X_train, y_train): 32 | """ 33 | 超找最好的超参 34 | :param X_train: 35 | :param y_train: 36 | :return: 37 | """ 38 | # 做这个是为了人肉看一下最好的岭回归的超参alpha的最优值是啥 39 | # 是没必要的,因为后面还会用 gridsearch自动跑一下,做这个就是想直观的感受一下 40 | results = {} 41 | alpha_scope = np.arange(start=1, stop=200, step=10) 42 | for i in alpha_scope: 43 | # Ridge和Lasso回归分别代表L1和L2的正则化,L1会把系数压缩到0,而L2则不会,同时L1还有挑选特征的作用 44 | ridge = Ridge(alpha=i) 45 | results[i] = cross_val_score(ridge, X_train, y_train, cv=10, scoring='neg_mean_squared_error').mean() 46 | 47 | # 按照value排序:{1: 1, 2: 2, 3: 3} =>[(3, 3), (2, 2), (1, 1)] 48 | sorted_results = sorted(results.items(), key=lambda x: (x[1], x[0]), reverse=True) 49 | 50 | logger.info("超参数/样本和预测的均方误差:%r", results) 51 | logger.info("Best超参数为:%.0f, Best均方误差的均值:%.2f", sorted_results[0][0], sorted_results[0][1]) 52 | 53 | # 保存超参的图像 54 | fig = plt.figure(figsize=(20, 5)) 55 | plt.title('Best Alpha') 56 | plt.plot(alpha_scope, [results[i] for i in alpha_scope], c="red", label="alpha") 57 | plt.legend() 58 | fig.savefig("data/best_alpha.jpg") 59 | 60 | # 自动搜索,暂时保留,上面的代码中,我手工搜索了,还画了图 61 | # # 用grid search找最好的alpha:[200,205,...,500] 62 | # # grid search的参数是alpha,岭回归就这样一个参数,用于约束参数的平方和 63 | # # grid search的入参包括alpha的范围,K-Fold的折数(cv),还有岭回归评价的函数(负均方误差) 64 | # grid_search = GridSearchCV(Ridge(), 65 | # {'alpha': alpha_scope}, 66 | # cv=5, # 5折(KFold值) 67 | # scoring='neg_mean_squared_error') 68 | # grid_search.fit(X_train, y_train) 69 | # # model = grid_search.estimator.fit(X_train, y_train) 70 | # logger.info("GridSarch最好的成绩:%.5f", grid_search.best_score_) 71 | # # 得到的结果是495,确实和上面人肉跑是一样的结果 72 | # logger.info("GridSarch最好的参数:%.5f", grid_search.best_estimator_.alpha) 73 | 74 | best_hyperparam = sorted_results[0][0] 75 | return best_hyperparam 76 | -------------------------------------------------------------------------------- /mlstock/ml/trains/train_winloss.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from sklearn.model_selection import GridSearchCV 4 | from sklearn.preprocessing import LabelEncoder 5 | from xgboost import XGBClassifier 6 | 7 | from mlstock.ml.trains.train_action import TrainAction 8 | from mlstock.utils import utils 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | PARAMS_MODE = 'fix' # fix | optimize 13 | 14 | 15 | class TrainWinLoss(TrainAction): 16 | 17 | def get_model_name(self): 18 | return f"winloss_xgboost_{utils.now()}.model" 19 | 20 | def set_target(self, df_data): 21 | df_data['target'] = df_data.target.apply(lambda x: 1 if x > 0 else 0) 22 | logger.info("设置target为分类:0跌,1涨") 23 | return df_data 24 | 25 | def _train(self, X_train, y_train): 26 | """ 27 | Xgboost来做输赢判断,参考:https://cloud.tencent.com/developer/article/1656126 28 | :return: 29 | """ 30 | # https://so.muouseo.com/qa/em6w1x8w20k8.html 31 | le = LabelEncoder() 32 | y_train = le.fit_transform(y_train) 33 | 34 | if PARAMS_MODE == 'optimize': 35 | # 创建xgb分类模型实例 36 | model = XGBClassifier(nthread=1) 37 | 38 | # 待搜索的参数列表空间 39 | param_list = {"max_depth": [3, 5, 7, 9], 40 | "n_estimators": [10, 30, 50, 70, 90]} 41 | 42 | # 创建网格搜索 43 | grid_search = GridSearchCV(model, 44 | param_grid=param_list, 45 | cv=5, 46 | verbose=10, 47 | scoring='f1_weighted', # TODO:f1??? 48 | n_jobs=15) # 最多15个进程同时跑: 1个进程2G内存,15x2=30G内存使用,不能再多了 49 | # 基于flights数据集执行搜索 50 | grid_search.fit(X_train, y_train) 51 | 52 | # 输出搜索结果 53 | logger.debug("GridSearch出最优参数:%r", grid_search.best_estimator_) 54 | 55 | return grid_search.best_estimator_ 56 | else: 57 | # 创建xgb分类模型实例 58 | # 这个参数是由上面的优化结果得出的,上面的时不时跑一次,然后把最优结果抄到这里 59 | model = XGBClassifier(nthread=1, max_depth=7, n_estimators=50) 60 | model.fit(X_train, y_train) 61 | return model 62 | -------------------------------------------------------------------------------- /mlstock/research/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/research/__init__.py -------------------------------------------------------------------------------- /mlstock/research/backtest_select_top_n.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | python -m mlstock.research.backtest_select_top_n \ 4 | -s 20190101 -e 20220901 \ 5 | -mp model/pct_ridge_20220902112320.model \ 6 | -mw model/winloss_xgboost_20220902112813.model \ 7 | -d data/factor_20080101_20220901_2954_1299032__industry_neutral_20220902112049.csv 8 | """ 9 | import argparse 10 | import time 11 | 12 | from mlstock.ml.backtests.backtest_deliberate import load_datas, run_broker 13 | from mlstock.ml.data import factor_conf 14 | from mlstock.utils import utils 15 | 16 | 17 | def main(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names): 18 | """ 19 | 先预测出所有的下周收益率、下周涨跌 => df_data, 20 | 然后选出每周的top30 => df_selected_stocks, 21 | 然后使用Broker,来遍历每天的交易,每周进行调仓,并,记录下每周的股票+现价合计价值 => df_portfolio 22 | 最后计算出next_pct_chg、cumulative_pct_chg,并画出plot,计算metrics 23 | """ 24 | df_data, df_daily, df_index, df_baseline, df_limit, df_calendar = \ 25 | load_datas(data_path, start_date, end_date, model_pct_path, model_winloss_path, factor_names) 26 | 27 | for top_n in [5, 10, 15, 20, 25, 30, 35, 40]: 28 | run_broker(df_data, df_daily, df_index, df_baseline, df_limit, df_calendar, start_date, end_date, factor_names, 29 | top_n) 30 | 31 | 32 | if __name__ == '__main__': 33 | utils.init_logger(file=True) 34 | parser = argparse.ArgumentParser() 35 | 36 | # 数据相关的 37 | # parser.add_argument('-t', '--type', type=str, default="simple|backtrader|deliberate") 38 | parser.add_argument('-s', '--start_date', type=str, default="20190101", help="开始日期") 39 | parser.add_argument('-e', '--end_date', type=str, default="20220901", help="结束日期") 40 | parser.add_argument('-d', '--data', type=str, default=None, help="数据文件") 41 | parser.add_argument('-mp', '--model_pct', type=str, default=None, help="收益率模型") 42 | parser.add_argument('-mw', '--model_winloss', type=str, default=None, help="涨跌模型") 43 | 44 | args = parser.parse_args() 45 | 46 | factor_names = factor_conf.get_factor_names() 47 | 48 | start_time = time.time() 49 | main( 50 | # args.type, 51 | args.data, 52 | args.start_date, 53 | args.end_date, 54 | args.model_pct, 55 | args.model_winloss, 56 | factor_names) 57 | utils.time_elapse(start_time, "整个回测过程") 58 | -------------------------------------------------------------------------------- /mlstock/research/prepare_train_backtest_for_one_factor.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from mlstock.factors.daily_indicator import DailyIndicator 4 | from mlstock.ml.data import factor_conf, factor_service 5 | from mlstock.research import train_backtest_for_each_factor 6 | from mlstock.utils import utils 7 | 8 | 9 | def main(factor_name): 10 | start_date = '20090101' 11 | end_date = '20220901' 12 | 13 | factor_class = factor_conf.get_factor_class_by_name(factor_name) 14 | # DailyIndicator 始终需要这个因子,因为里面有市值log信息,用于市值中性化 15 | _, _, csv_path = factor_service.calculate([factor_class,DailyIndicator], 16 | start_date, 17 | end_date, 18 | num=10000000, 19 | is_industry_neutral=True) 20 | train_backtest_for_each_factor.main(csv_path, factor_name) 21 | 22 | 23 | """ 24 | python -m mlstock.research.prepare_train_backtest_for_one_factor -f MACD 25 | """ 26 | if __name__ == '__main__': 27 | utils.init_logger(file=False) 28 | parser = argparse.ArgumentParser() 29 | 30 | # 数据相关的 31 | # 数据相关的 32 | parser.add_argument('-f', '--factor', type=str, default=None, help="因子名字") 33 | args = parser.parse_args() 34 | 35 | main(args.factor) 36 | -------------------------------------------------------------------------------- /mlstock/research/test_adf_kpss.py: -------------------------------------------------------------------------------- 1 | """ 2 | 他们说RSI,是个平稳序列,是么?我自己试试。 3 | 结论: 4 | 我测了10只左右,大约都在lags=20,也就是延迟在20步的时候, 5 | 开始进入1个方差的范围,也就是趋于平稳了。 6 | 另外,对随机性做了Q检验,发现P值是始终是0的,说明,lag=40以内是自相关性=0的概率是0, 7 | H0假设:这个是个随机序列,这个事件应该大概率发生,如果小概率发生了,就会被推翻。 8 | 我看到的是,我自己的这个序列机会p-value总是0,说明是小概率,说明原假设被推翻,所以不是一个随机序列。 9 | 引自我的假设检验笔记:https://book.piginzoo.com/quantitative/statistics/test.html 10 | "你假设了一个参数,然后,你用这个参数去算某一次事件的概率,如果这个概率小于0.05,那说明你的假设不靠谱啊,你的假设下,应该大概率发生才对;现在小概率发生了,说明你的假设不对啊。" 11 | 后来又测试了macd的hist,果然不是平稳的 12 | """ 13 | import random 14 | import matplotlib 15 | import matplotlib.pyplot as plt 16 | import talib as ta 17 | from matplotlib.font_manager import FontProperties 18 | from statsmodels.graphics.tsaplots import plot_acf 19 | import sys 20 | import numpy as np 21 | import tushare as ts 22 | 23 | matplotlib.rcParams['font.sans-serif'] = ['SimHei'] # 设置中文字体 24 | matplotlib.rcParams['axes.unicode_minus'] = False # 正常显示负号,解决负号'-'显示为方块的问题 25 | 26 | token = sys.argv[1] 27 | pro = ts.pro_api(token) 28 | 29 | 30 | # 接下来,还要看看是不是随机序列:https://mlln.cn/2017/10/26/python%E6%97%B6%E9%97%B4%E5%BA%8F%E5%88%97%E5%88%86%E6%9E%90-%E7%BA%AF%E9%9A%8F%E6%9C%BA%E6%80%A7%E6%A3%80%E9%AA%8C/ 31 | def boxpierce_test(x, plt): 32 | from statsmodels.sandbox.stats.diagnostic import acorr_ljungbox 33 | result = acorr_ljungbox(x, boxpierce=True) 34 | print("box pierce 和 box ljung统计量:") 35 | print(result) 36 | 37 | ax = plt.add_subplot(425) 38 | ax.plot(result.lb_stat, label='lb_stat'); 39 | ax.set_ylabel('True-Q') 40 | ax.plot(result.bp_stat, label='bp_stat') 41 | ax.legend() 42 | 43 | ax = plt.add_subplot(426) 44 | ax.plot(result.lb_pvalue, label='lb_pvalue'); 45 | ax.set_ylabel('P') 46 | ax.plot(result.bp_pvalue, label='bp_pvalue') 47 | ax.legend() 48 | 49 | ax = plt.add_subplot(427) 50 | x = [random.randint(1, 200) for i in range(len(x))] 51 | result = acorr_ljungbox(x, boxpierce=True) 52 | 53 | ax.plot(result.lb_stat, label='lb_stat'); 54 | ax.set_ylabel('random-Q') 55 | ax.plot(result.bp_stat, label='bp_stat') 56 | ax.legend() 57 | 58 | ax = plt.add_subplot(428) 59 | ax.plot(result.lb_pvalue, label='lb_pvalue'); 60 | ax.set_ylabel('P') 61 | ax.plot(result.bp_pvalue, label='bp_pvalue') 62 | ax.legend() 63 | 64 | 65 | def test_kpss(df): 66 | """ 67 | https://juejin.cn/post/7121729587079282696 68 | 69 | 例子: 70 | Result of KPSS Test: 71 | (0.8942571772584017, 0.01, 21, {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739}) 72 | Test Statistic 0.894257 73 | p-value 0.010000 74 | Lag Used 21.000000 75 | Critical Valuse (10%) 0.347000 <-- 临界值 76 | Critical Valuse (5%) 0.463000 <-- 临界值 77 | Critical Valuse (2.5%) 0.574000 <-- 临界值 78 | Critical Valuse (1%) 0.739000 <-- 临界值 79 | -------------------------------- 80 | T值是0.89, 81 | p-value是1%,小概率(<0.05),小概率发生了,那说明你H0假设不对啊,推翻原假设,选择备择假设。 82 | http://book.piginzoo.com/quantitative/statistics/test.html 83 | "你假设了一个参数,然后,你用这个参数去算某一次事件的概率, 84 | 如果这个概率小于0.05,那说明你的假设不靠谱啊,你的假设下,应该大概率发生才对; 85 | 现在小概率发生了,说明你的假设不对啊。" 86 | 而KPSS的H0假设是平稳的,备择假设是不平稳,那么这个结果就是不平稳的。(KPSS检验的原假设和备择假设与ADF检验相反) 87 | KPSS是右侧单边检验,故T值(0.894) > 临界值(0.347|0.463|0.574|0.739),拒绝H0原假设(平稳的),即,序列不平稳。 88 | 89 | :param df: 90 | :return: 91 | """ 92 | import statsmodels.api as sm 93 | test = sm.tsa.stattools.kpss(df, regression='ct') 94 | print("KPSS平稳性检验结果:", test) 95 | print("说明:") 96 | print("\t原假设H0:时间序列是平稳的,备择假设H1:时间序列不是趋势静止的") 97 | print("\t如果p值小于显著性水平α=0.05,就拒绝无效假设,不是趋势静止的。") 98 | print("-" * 40) 99 | print("T统计量:", test[0]) 100 | print("p-value:", test[1]) 101 | print("lags延迟期数:", test[2]) 102 | print("置信区间下的临界T统计量:") 103 | print("\t 1% :", test[3]['1%']) 104 | print("\t 5% :", test[3]['5%']) 105 | print("\t 10% :", test[3]['10%']) 106 | print("检验结果(是否平稳):", 107 | test[1] > 0.05 and test[0] < test[3]['1%'] and test[0] < test[3]['5%'] and test[0] < test[3]['10%'] 108 | , "<====================") 109 | 110 | 111 | def test_adf(df): 112 | """ 113 | https://cloud.tencent.com/developer/article/1737142 114 | 115 | ADF的原假设H0:序列是不平稳的 116 | 如果检验得到的p-value概率值,很小(<0.05),说明很小的概率竟然发生了,说明原假设H0不对,那么备择假设H1:序列是平稳的就成立了。 117 | 118 | 例子: 119 | ADF平稳性检验结果: (-69.09149159218173, 0.0, 0, 4985, {'1%': -3.4316624715142177, '5%': -2.862119970102166, '10%': -2.5670787188546584}, 28666.784252148856) 120 | T统计量: -69.09149159218173 121 | p-value: 0.0 122 | lags延迟期数: 0 123 | 测试的次数: 4985 124 | 置信区间下的临界T统计量: 125 | 1% : -3.4316624715142177 126 | 5% : -2.862119970102166 127 | 10% : -2.5670787188546584 128 | ---------------------------------------- 129 | 130 | 131 | :param df: 132 | :return: 133 | """ 134 | from statsmodels.tsa.stattools import adfuller 135 | adftest = adfuller(df, autolag='AIC') # ADF检验 136 | print("ADF平稳性检验结果:", adftest) 137 | print("说明:") 138 | print("\tADF检验的原假设是存在单位根,统计值是小于1%水平下的数字就可以极显著的拒绝原假设,认为数据平稳") 139 | print("\tADF结果T统计量同时小于1%、5%、10%三个level的统计量,说明平稳") 140 | print("-" * 40) 141 | print("T统计量:", adftest[0]) 142 | print("p-value:", adftest[1]) 143 | print("lags延迟期数:", adftest[2]) 144 | print("测试的次数:", adftest[3]) 145 | print("置信区间下的临界T统计量:") 146 | print("\t 1% :", adftest[4]['1%']) 147 | print("\t 5% :", adftest[4]['5%']) 148 | print("\t 10% :", adftest[4]['10%']) 149 | print("检验结果(是否平稳):", 150 | adftest[0] < adftest[4]['1%'] and adftest[0] < adftest[4]['5%'] and adftest[0] < adftest[4]['10%'], 151 | "<====================") 152 | def test(code): 153 | df = pro.daily(stock_code=code) 154 | 155 | fig = plt.figure(figsize=(16, 6)) 156 | 157 | # 1. test rsi 158 | rsi = ta.RSI(df['close']) 159 | rsi.dropna(inplace=True) 160 | test_stock(rsi,'rsi',fig) 161 | 162 | # 2. test return 163 | df_pct = df['pct_chg'].dropna() 164 | test_stock(df_pct,'pct',fig) 165 | 166 | # 3. test return log 167 | df_log = df_pct.apply(np.log) 168 | df_log.replace([np.inf, -np.inf], np.nan, inplace=True) 169 | df_log.dropna(inplace=True) 170 | test_stock(df_log,'pct_log',fig) 171 | 172 | def test_stock(data,name,fig): 173 | 174 | print("\n\n") 175 | print("="*80) 176 | print(f" 检验 {name} !!!!") 177 | print("=" * 80) 178 | print("\n\n") 179 | 180 | print("\n\n================") 181 | print("** ADF平稳性检验 **") 182 | print("================") 183 | test_adf(data) 184 | 185 | print("\n\n================") 186 | print("** KPSS平稳性检验 **") 187 | print("================") 188 | test_kpss(data) 189 | 190 | print("\n\n================") 191 | print("随机性检验") 192 | print(f"** 画自相关图 data/平稳性随机性_{name}_{code}.jpg **") 193 | print("================") 194 | plt.clf() 195 | test_stationarity(data, fig,name) 196 | boxpierce_test(data, fig) 197 | plt.savefig(f"data/平稳性随机性_{name}_{code}.jpg") 198 | 199 | def test_stationarity(x, fig, name): 200 | """自相关图ACF 201 | https://blog.csdn.net/mfsdmlove/article/details/124769371 202 | """ 203 | 204 | font = FontProperties() 205 | 206 | ax = fig.add_subplot(421) 207 | ax.set_ylabel("return", fontproperties=font, fontsize=16) 208 | ax.set_yticklabels([str(x * 100) + "0%" for x in ax.get_yticks()], fontproperties=font, fontsize=14) 209 | ax.set_title(name, fontproperties=font, fontsize=16) 210 | ax.grid() 211 | plt.plot(x) 212 | 213 | ax = fig.add_subplot(422) 214 | plot_acf(x, ax=ax, lags=20) 215 | 216 | from pandas.plotting import autocorrelation_plot 217 | ax = fig.add_subplot(423) 218 | autocorrelation_plot(x, ax=ax) 219 | 220 | """ 221 | 参考: 222 | - https://blog.51cto.com/u_15671528/5524434 223 | """ 224 | 225 | # python -m mlstock.research.test_adf_kpss xxxxxxxxxxxxxxxxxxxxxxx 226 | if __name__ == '__main__': 227 | # if len(sys.argv) < 2: 228 | # print("格式:python -m mlstock.research.test_adf_kpss.py xxxxxxxxxxxxxxxxxxxxx(tushare的token)") 229 | for code in ["600495.SH"]: # , "600540.SH", "600819.SH", "600138.SH", "002357.SZ", "002119.SZ"]: 230 | test(code) 231 | -------------------------------------------------------------------------------- /mlstock/research/train_backtest_for_each_factor.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from mlstock.ml import train, backtest 4 | from mlstock.ml.data import factor_conf 5 | from mlstock.utils import utils 6 | 7 | """ 8 | 把因子逐个跑一遍,用其单独训练,单独跑回测, 9 | 这样来对比不同因子的效果。 10 | """ 11 | def main(data_path, factor_name): 12 | start_date = '20090101' 13 | split_date = '20190101' 14 | end_date = '20220901' 15 | 16 | factor_names = factor_conf.get_factor_names() 17 | 18 | if factor_name is not None: 19 | if not factor_name in factor_names: 20 | raise ValueError(f"因子名[{factor_name}不正确,不在因子列表中]") 21 | pct_model_path = train.main(data_path, start_date, split_date, 'pct', [factor_name]) 22 | backtest.main(data_path, start_date, split_date, pct_model_path, None, [factor_name]) 23 | backtest.main(data_path, split_date, end_date, pct_model_path, None, [factor_name]) 24 | return 25 | 26 | for factor_name in factor_names: 27 | pct_model_path = train.main(data_path, start_date, split_date, 'pct', [factor_name]) 28 | backtest.main(data_path, start_date, split_date, pct_model_path, None, [factor_name]) 29 | backtest.main(data_path, split_date, end_date, pct_model_path, None, [factor_name]) 30 | 31 | 32 | # python -m mlstock.research.train_backtest_by_factor -d data/factor_20080101_20220901_2954_1322891_20220829120341.csv 33 | if __name__ == '__main__': 34 | utils.init_logger(file=False) 35 | parser = argparse.ArgumentParser() 36 | 37 | # 数据相关的 38 | parser.add_argument('-d', '--data', type=str, default=None, help="数据文件") 39 | parser.add_argument('-f', '--factor', type=str, default=None, help="因子名字") 40 | args = parser.parse_args() 41 | 42 | main(args.data, args.factor) 43 | -------------------------------------------------------------------------------- /mlstock/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/mlstock/utils/__init__.py -------------------------------------------------------------------------------- /mlstock/utils/data_utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import math 3 | from datetime import datetime 4 | 5 | import backtrader 6 | import pandas as pd 7 | from backtrader.feeds import PandasData 8 | 9 | from mlstock.data.datasource import DataSource 10 | from mlstock.utils import utils 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | PNL_LIMIT = 0.098 15 | OPEN_GAP = 0.001 16 | 17 | 18 | class MyPandasData(PandasData): 19 | def __init__(self, df): 20 | super().__init__() 21 | self.df = df 22 | 23 | 24 | def is_limit_up(data): 25 | """ 26 | 判断是否涨停: 27 | 1、比昨天的价格超过9.8%,不能为10%,太严格 28 | 2、open、high、low和close的差距不超过价格的0.1% <-- No!No!No!不能用这个,这个就是未来函数了,想想你不可能知道第二天的close的 29 | 对于买入的逻辑而言,你的做法应该是, 30 | 如果你发现一直是高开,且最低价格一直不下来,直到快收盘,你就放弃购买了。 31 | 那么,在回测的时候,复现这个逻辑,你的做法应该是, 32 | - 你如果发现当天的开盘价已经是昨日的9.8%了(不用管最高价了) 33 | - 最低价一直没和开盘价差距在0.1%(一直没拉开) 34 | 这样,你就应该放弃买入了,视为这次机会放弃了。 35 | """ 36 | pnl = (data.open[0] - data.close[-1]) / data.close[-1] 37 | gap = (data.open[0] - data.low[0]) / data.open[0] 38 | if pnl >= PNL_LIMIT and gap <= OPEN_GAP: 39 | logger.warning("今天是涨停日, 开盘涨幅[%.2f%%], 开盘和最低价相差比[%.2f%%]", pnl * 100, gap * 100) 40 | return True 41 | return False 42 | 43 | 44 | def is_limit_low(data): 45 | """ 46 | 判断是否跌: 47 | 1、比昨天的价格超过-9.8%,不能为10%,太严格 48 | 对于卖出的逻辑而言,你的做法应该是: 49 | 如果你发现一直是低开,最高价格一直上不来,直到快收盘,你就卖不出去了 50 | 那么,在回测的时候,复现这个逻辑,你的做法应该是: 51 | - 你如果发现当天的开盘价已经超过昨日的-9.8%了(不用管最低了) 52 | - 最高价一直没和开盘价差距在0.1%(一直没拉开) 53 | 这样,你就应该放弃买入了,视为这次机会放弃了。 54 | """ 55 | pnl = (data.open[0] - data.close[-1]) / data.close[-1] 56 | gap = (data.high[0] - data.open[0]) / data.open[0] 57 | if pnl <= - PNL_LIMIT and gap <= OPEN_GAP: 58 | logger.warning("今天是跌停日, 开盘跌幅[%.2f%%], 最高价和开盘相差比[%.2f%%]", pnl * 100, gap * 100) 59 | return True 60 | return False 61 | 62 | 63 | def get_trade_period(the_date, period, datasource): 64 | """ 65 | 返回某一天所在的周、月的交易日历中的开始和结束日期 66 | 比如,我传入是 2022.2.15, 返回是的2022.2.2/2022.2.27(这2日是2月的开始和结束交易日) 67 | datasource是传入的 68 | the_date:格式是YYYYMMDD 69 | period:W 或 M 70 | """ 71 | 72 | the_date = utils.str2date(the_date) 73 | 74 | # 读取交易日期 75 | df = datasource.trade_cal(exchange='SSE', start_date=today, end_date='20990101') 76 | # 只保存日期列 77 | df = pd.DataFrame(df, columns=['cal_date']) 78 | # 转成日期型 79 | df['cal_date'] = pd.to_datetime(df['cal_date'], format="%Y%m%d") 80 | # 如果今天不是交易日,就不需要生成 81 | if pd.Timestamp(the_date) not in df['cal_date'].unique(): return False 82 | 83 | # 把日期列设成index(因为index才可以用to_period函数) 84 | df = df[['cal_date']].set_index('cal_date') 85 | # 按照周、月的分组,对index进行分组 86 | df_group = df.groupby(df.index.to_period(period)) 87 | # 看传入日期,是否是所在组的,最后一天,即,周最后一天,或者,月最后一天 88 | target_period = None 89 | for period, dates in df_group: 90 | if period.start_time < pd.Timestamp(the_date) < period.end_time: 91 | target_period = period 92 | if target_period is None: 93 | logger.warning("无法找到上个[%s]的开始、结束日期", period) 94 | return None, None 95 | return period[0], period[-1] 96 | 97 | 98 | def is_trade_day(): 99 | """ 100 | 判断是不是交易时间:9:30~11:30 101 | :return: 102 | """ 103 | datasource = DataSource() 104 | trade_dates = list(datasource.trade_cal(start_date=utils.last_week(utils.today()), end_date=utils.today())) 105 | if utils.today() in trade_dates: 106 | return True 107 | return False 108 | 109 | 110 | def next_trade_day(trade_date, df_calendar): 111 | """ 112 | 下一个交易日 113 | :return: 114 | """ 115 | index = df_calendar[df_calendar == trade_date].index[0] + 1 116 | if index > len(df_calendar): return None 117 | return df_calendar[index] 118 | 119 | 120 | def is_trade_time(): 121 | FMT = '%H:%M:%S' 122 | now = datetime.strftime(datetime.now(), FMT) 123 | time_0930 = "09:30:00" 124 | time_1130 = "11:30:00" 125 | time_1300 = "13:00:00" 126 | time_1500 = "15:00:00" 127 | is_morning = time_0930 <= now <= time_1130 128 | is_afternoon = time_1300 <= now <= time_1500 129 | return is_morning or is_afternoon 130 | 131 | 132 | def calc_size(broker, cash, data, price): 133 | """ 134 | 用来计算可以购买的股数: 135 | 1、刨除手续费 136 | 2、要是100的整数倍 137 | 为了保守起见,用涨停价格来买,这样可能会少买一些。 138 | 之前我用当天的close价格来算size,如果不打富余,第二天价格上涨一些,都会导致购买失败。 139 | """ 140 | commission = broker.getcommissioninfo(data) 141 | commission = commission.p.commission 142 | # 注释掉了,头寸交给外面的头寸分配器(CashDistribute)来做了 143 | # cash = broker.get_cash() 144 | 145 | # 按照一个保守价格来买入 146 | size = math.ceil(cash * (1 - commission) / price) 147 | 148 | # 要是100的整数倍 149 | size = (size // 100) * 100 150 | return size 151 | 152 | 153 | class LongOnlySizer(backtrader.Sizer): 154 | """ 155 | 自定义Sizer,经实践,不靠谱, 156 | 他用的是下单当天的价格,即当前日期,不是下一个交易日,而是当前日,这个肯定不行了就。 157 | -------------------------------------------- 158 | >[20170331] 信号出现:SMA5/10翻红,5/10多头排列,周MACD金叉,月KDJ金叉,买入 159 | >[20170331] 尝试买入:挂限价单[46.23]单,当前现金[126961.77],买入股份[2700.00]份 160 | >[20170331] 计算买入股数: 价格[42.49],现金[126961.77],股数[2900.00] 161 | \__这个日期是错的,应该用20170405的价格 162 | >['20170405'] 交易失败,股票[000020.SZ]:'Margin',现金[126961.77] 163 | """ 164 | 165 | params = (('min_stake_unit', 100),) 166 | 167 | def _getsizing(self, comminfo, cash, data, isbuy): 168 | if isbuy: 169 | commission = comminfo.p.commission 170 | price = data.open[0] 171 | size = math.ceil(cash * (1 - commission) / price) 172 | size = (size // 100) * 100 173 | logger.debug("[%s] 计算买入股数: 价格[%.2f],现金[%.2f],股数[%.2f]", self.strategy._date(), price, cash, size) 174 | return size 175 | 176 | 177 | # python -m backtest.data_utils 178 | if __name__ == '__main__': 179 | print("今天是交易日么?", is_trade_day()) 180 | -------------------------------------------------------------------------------- /mlstock/utils/db_utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | 4 | import sqlalchemy 5 | from pandas import Series 6 | from sqlalchemy import create_engine 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | EALIEST_DATE = '20080101' # 最早的数据日期 11 | 12 | 13 | def connect_db(conf): 14 | """ 15 | # https://stackoverflow.com/questions/8645250/how-to-close-sqlalchemy-connection-in-mysql: 16 | Engine is a factory for connections as well as a ** pool ** of connections, not the connection itself. 17 | When you say conn.close(), the connection is returned to the connection pool within the Engine, 18 | not actually closed. 19 | """ 20 | 21 | uid = conf['database']['uid'] 22 | pwd = conf['database']['pwd'] 23 | db = conf['database']['db'] 24 | host = conf['database']['host'] 25 | port = conf['database']['port'] 26 | engine = create_engine("mysql+pymysql://{}:{}@{}:{}/{}?charset={}".format(uid, pwd, host, port, db, 'utf8')) 27 | # engine = create_engine('sqlite:///' + DB_FILE + '?check_same_thread=False', echo=echo) # 是否显示SQL:, echo=True) 28 | return engine 29 | 30 | 31 | def is_table_exist(engine, name): 32 | return sqlalchemy.inspect(engine).has_table(name) 33 | 34 | 35 | def is_table_index_exist(engine, name): 36 | if not is_table_exist(engine, name): 37 | return False 38 | 39 | indices = sqlalchemy.inspect(engine).get_indexes(name) 40 | return indices and len(indices) > 0 41 | 42 | 43 | def run_sql(engine, sql): 44 | c = engine.connect() 45 | sql = (sql) 46 | result = c.execute(sql) 47 | return result 48 | 49 | 50 | def list_to_sql_format(_list): 51 | """ 52 | 把list转成sql中in要求的格式 53 | ['a','b','c'] => " 'a','b','c' " 54 | """ 55 | if type(_list) == Series: 56 | _list = list(_list) 57 | elif type(_list) != list: 58 | _list = [_list] 59 | 60 | data = ["\'" + one + "\'" for one in _list] 61 | return ','.join(data) 62 | 63 | 64 | def create_db_index(engine, table_name, df): 65 | if is_table_index_exist(engine, table_name): return 66 | 67 | # 创建索引,需要单的sql处理 68 | index_sql = None 69 | if "ts_code" in df.columns and "trade_date" in df.columns: 70 | index_sql = "create index {}_code_date on {} (ts_code,trade_date);".format(table_name, table_name) 71 | if "ts_code" in df.columns and "ann_date" in df.columns: 72 | index_sql = "create index {}_code_date on {} (ts_code,ann_date);".format(table_name, table_name) 73 | 74 | if not index_sql: return 75 | 76 | start_time = time.time() 77 | engine.execute(index_sql) 78 | logger.debug("在表[%s]上创建索引,耗时: %.2f %s", table_name, time.time() - start_time, index_sql) 79 | -------------------------------------------------------------------------------- /mlstock/utils/df_utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | 4 | import pandas as pd 5 | from pandas.api.types import is_datetime64_any_dtype as is_datetime 6 | from tqdm import tqdm 7 | 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | 12 | def reset_index(df, date_only=False, date_format=None): 13 | """ 14 | 把索引设置成[日期+股票代码]的复合索引 15 | """ 16 | 17 | if date_format is None: date_format = CONF['dateformat'] 18 | 19 | assert 'datetime' in df.columns, df.columns 20 | if date_only: 21 | # 如果是日期类型了,无需再转了 22 | if not is_datetime(df['datetime']): 23 | df['datetime'] = to_datetime(df['datetime'], date_format) 24 | df = df.set_index('datetime') 25 | else: 26 | assert 'code' in df.columns, df.columns 27 | df['datetime'] = to_datetime(df['datetime'], date_format) 28 | df = df.set_index(['datetime', 'code']) 29 | return df 30 | 31 | 32 | def to_datetime(series, date_format=None): 33 | if date_format is None: date_format = CONF['dateformat'] 34 | return pd.to_datetime(series, format=date_format) # 时间为日期格式,tushare是str 35 | 36 | 37 | def date2str(df, date_column): 38 | df[date_column] = df[date_column].dt.strftime(CONF['dateformat']) 39 | return df 40 | 41 | 42 | def load_daily_data(datasource, stock_codes, start_date, end_date): 43 | df_merge = pd.DataFrame() 44 | # 每支股票 45 | start_time = time.time() 46 | pbar = tqdm(total=len(stock_codes)) 47 | for i, stock_code in enumerate(stock_codes): 48 | # 得到日交易数据 49 | data = datasource.daily(stock_code=stock_code, start_date=start_date, end_date=end_date) 50 | if df_merge is None: 51 | df_merge = data 52 | else: 53 | df_merge = df_merge.append(data) 54 | pbar.update(i) 55 | pbar.close() 56 | 57 | logger.debug("一共加载 %s~%s %d 只股票,共计 %d 条日交易数据,耗时 %.2f 秒", 58 | start_date, 59 | end_date, 60 | len(stock_codes), 61 | len(df_merge), 62 | time.time() - start_time) 63 | return df_merge 64 | 65 | 66 | def compile_industry(series_industry): 67 | """ 68 | 把行业列(文字)转换成统一的行业码, 69 | 使用的是申万的2014版(一级28个/二级104个),没用2021版本:https://tushare.pro/document/2?doc_id=181) 70 | 如 "家用电器" => '330000' 71 | ---------- 72 | index_classify结果: 73 | index_code industry_name level industry_code is_pub parent_code 74 | 801024.SI 采掘服务 L2 210400 None 210000 75 | 801035.SI 石油化工 L2 220100 None 220000 76 | 801033.SI 化学原料 L2 220200 None 220000 77 | 靠,tushare返回的是中文字符串,不是,申万的行业代码,所以,我得自己处理了。 78 | ---------- 79 | df_industry: 股票对应的行业中文名,3列[code, date, industry] 80 | """ 81 | df_industries = ds_factory.get().index_classify() 82 | 83 | def find_industry_code(chinese_name): 84 | 85 | # import pdb;pdb.set_trace() 86 | # 找列[industry_name]中的值和中文行业名相等的行 87 | found_rows = df_industries.loc[df_industries['industry_name'] == chinese_name] 88 | 89 | for _, row in found_rows.iterrows(): 90 | r = __extract_industry_code(row) 91 | if r: return r 92 | 93 | r = __get_possible(df_industries, chinese_name) 94 | if r is None: 95 | ValueError('无法找到 [' + chinese_name + "] 对应的行业代码") 96 | return r 97 | 98 | def __extract_industry_code(row): 99 | if row.level == 'L1': return row.industry_code 100 | if row.level == 'L2': return row.parent_code 101 | if row.level == 'L3': # 假设一定能找到 102 | assert len(df_industries.loc[df_industries['industry_code'] == row.parent_code]) > 0 103 | return df_industries.loc[df_industries['industry_code'] == row.parent_code].iloc[0].parent_code 104 | raise None 105 | 106 | def __get_possible(df_industries, chinese_name): 107 | from Levenshtein import distance 108 | 109 | df_industries['distance'] = df_industries['industry_name'].apply( 110 | lambda x: distance(x, chinese_name) 111 | ) 112 | first_row = df_industries.sort_values('distance').iloc[0] 113 | code = __extract_industry_code(first_row) 114 | 115 | # logger.debug("行业纠错:%s=>%s:%s",chinese_name,first_row['industry_name'],code) 116 | 117 | return code 118 | 119 | # 用中文名列,生成,申万的行业代码列, df_industry['industry']是中文名,转成申万的代码:industry_code 120 | return series_industry.apply(find_industry_code) 121 | 122 | 123 | def validate_trade_date(df, date_column=None, start_date=None, end_date=None): 124 | """ 125 | 用于检查DataFrame中,有哪些交易日期是缺失的 126 | :param start_date: 127 | :param end_date: 128 | :param df: 被检查的dataframe 129 | :param date_columns: dataframe中的日期列 130 | :return: 131 | """ 132 | 133 | assert len(df) > 0, len(df) 134 | 135 | def __find_date_column(columns): 136 | for col in columns: 137 | if col in DATE_COLUMNS: 138 | return col 139 | return None 140 | 141 | if not date_column: 142 | date_column = __find_date_column(df.columns) 143 | if not date_column: 144 | raise ValueError("Dataframe中不包含日期字段:" + str(df.columns)) 145 | 146 | if not start_date: 147 | series_sort = df[date_column].sort_values() 148 | start_date = series_sort.iloc[0] 149 | end_date = series_sort.iloc[-1] 150 | 151 | df_all_trade_dates = ds_factory.get().trade_cal(start_date, end_date) 152 | df_miss_trade_dates = df_all_trade_dates[~df_all_trade_dates.isin(df[date_column])] 153 | logger.debug("%s~%s 数据%d行,应为%d个交易日,缺失%d个交易日,缺失率%.1f%%", 154 | start_date, 155 | end_date, 156 | len(df), 157 | len(df_all_trade_dates), 158 | len(df_miss_trade_dates), 159 | len(df_miss_trade_dates) * 100 / len(df_all_trade_dates)) 160 | return df_miss_trade_dates, len(df_miss_trade_dates) / len(df_all_trade_dates) 161 | 162 | 163 | # python -m datasource.datasource_utils 164 | if __name__ == '__main__': 165 | utils.init_logger() 166 | df = ds_factory.get().daily('300152.SZ', '20180101', '20190101') 167 | validate_trade_date(df) 168 | -------------------------------------------------------------------------------- /mlstock/utils/dynamic_loader.py: -------------------------------------------------------------------------------- 1 | import importlib 2 | import inspect 3 | import logging 4 | from pkgutil import walk_packages 5 | 6 | from utils import utils 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | 11 | def create_factor_by_name(name, factor_dict): 12 | for _, clazz in factor_dict.items(): 13 | factor = clazz() 14 | factor_name = factor.name() 15 | if type(factor_name) == list and name in factor_name: return factor 16 | if factor.name() == name: return factor 17 | logger.warning("无法根据名称[%s]创建因子实例",name) 18 | raise ValueError(f"无法根据名称{name}创建因子实例") 19 | 20 | 21 | # 对构造函数的参数做类型转换,目前仅支持int,未来可以自由扩充 22 | # 注意!构造函数的参数一定要标明类型,如 name:int 23 | def convert_params(clazz, param_values): 24 | # logger.debug("准备转化%r的参数:%r",clazz,param_values) 25 | full_args = inspect.getfullargspec(clazz.__init__) 26 | annotations = full_args.annotations 27 | arg_names = full_args.args[1:] # 第一个是self,忽略 28 | new_params = [] 29 | for i, value in enumerate(param_values): 30 | 31 | arg_name = arg_names[i] 32 | arg_type = annotations.get(arg_name, None) 33 | if arg_type and value and arg_type == int: 34 | logger.debug("参数%s的值%s转化成int", arg_name, value) 35 | value = int(value) 36 | new_params.append(value) 37 | 38 | return new_params 39 | 40 | 41 | def dynamic_load_classes(module_name, parent_class): 42 | classes = [] 43 | base_module = importlib.import_module(module_name) 44 | 45 | for _, name, is_pkg in walk_packages(base_module.__path__, prefix="{}.".format(module_name)): 46 | if is_pkg: continue 47 | 48 | module = importlib.import_module(name) 49 | 50 | for name, obj in inspect.getmembers(module): 51 | 52 | if not inspect.isclass(obj): continue 53 | if not issubclass(obj, parent_class): continue 54 | if obj == parent_class: continue 55 | classes.append(obj) 56 | 57 | # print(name, ",", obj) 58 | 59 | return classes 60 | 61 | 62 | def dynamic_instantiation(package_name, parent_class): 63 | objs = {} 64 | classes = dynamic_load_classes(package_name, parent_class) 65 | for clazz in classes: 66 | # obj = clazz() 67 | # print(clazz.__name__) 68 | objs[clazz.__name__] = clazz 69 | return objs 70 | 71 | 72 | # python -m utils.dynamic_loader 73 | if __name__ == "__main__": 74 | utils.init_logger() 75 | from example.factors.factor import Factor 76 | 77 | class_dict = dynamic_instantiation("example.factors", Factor) 78 | logger.debug("所有的加载类:%r", class_dict) 79 | 80 | # obj = get_validator("KeyValueValidator(签发机关)", class_dict) 81 | # assert obj.process("机关哈哈哈哈哈哈") is not None 82 | # 83 | # # 测试类名拼错了 84 | # obj = get_validator("AchiveNoValidator(档案编号)", class_dict) 85 | # assert obj is None 86 | # 87 | # obj = get_validator("AchieveNoValidator(档案编号)", class_dict) 88 | # assert obj is not None 89 | # 90 | # obj = get_validator("VehicleUseCharacterValidator()", class_dict) 91 | # assert obj is not None 92 | -------------------------------------------------------------------------------- /mlstock/utils/industry_neutral.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn import linear_model 3 | from sklearn.base import BaseEstimator, TransformerMixin 4 | 5 | from mlstock.utils import utils 6 | 7 | """ 8 | 行业中性处理 9 | 行业市值中性化: 10 | 将填充缺失值后的因子暴露度对行业哑变量和取对数后的市值做线性回归,取残差作为新的因子暴露度。 11 | 数据预处理(下)之中性化:https://www.pianshen.com/article/9590863701/ 12 | 其中,Factor_i为股票i的alpha因子 13 | MktVal_i为股票i的总市值, 14 | Industry_j,i为行业虚拟变量, 即如果股票i属于行业j则暴露度为1,否则为0,而且每个股票i仅属于一个行业 15 | 16 | 关于Estimator: 17 | # https://juejin.cn/post/6844903478788096007 18 | 19 | 估计器,很多时候可以直接理解成分类器,主要包含两个函数: 20 | fit():训练算法,设置内部参数。接收训练集和类别两个参数。 21 | predict():预测测试集类别,参数为测试集。大多数scikit-learn估计器接收和输出的数据格式均为numpy数组或类似格式。 22 | 23 | 转换器用于数据预处理和数据转换,主要是三个方法: 24 | fit():训练算法,设置内部参数。 25 | transform():数据转换。 26 | fit_transform():合并fit和transform两个方法 27 | 28 | 使用: 29 | idst_neutral= IndustryNeutral() 30 | idst_neutral = idst_neutral.fit(df_train, f_x, f_idst) 31 | df_train = idst_neutral.transform(df_train, f_x, f_idst) 32 | df_test = idst_neutral.transform(df_test, f_x, f_idst) 33 | """ 34 | 35 | 36 | class IndustryMarketNeutral(BaseEstimator, TransformerMixin): 37 | """ 38 | 行业中性化没啥神秘的!他是对一个因子值,比如'资产收益率',对整个值进行回归, 39 | 它作为Y,X是市值和行业one-hot,回归出来和原值的残差,就是中性化后的因子值。 40 | 41 | 用特征数据中的每一列,作为Y,X是行业one-hot和行业市值对数化(我理解是为了小一些),来做一个线性回归模型。 42 | 所以,每一列都会有一个模型, 43 | 然后用每一个模型去预测每个因子的预测值,y_pred, 44 | 然后 y - y_pred 的残差,就是行业+市值中性化后的 45 | """ 46 | 47 | def __init__(self, factor_names, industry_name, market_value_name): 48 | self.models = {} 49 | self.factor_names = factor_names 50 | self.industry_name = industry_name 51 | self.market_value_name = market_value_name 52 | 53 | def _to_one_hot(self, series): 54 | pd.get_dummies(series, prefix='industry') 55 | 56 | def _regression_fit(self, X, y): 57 | # 把异常数据去掉 58 | X = X[~y.isna()] 59 | y = y[~y.isna()] 60 | self.models[y.name] = linear_model.LinearRegression().fit(X, y) 61 | return 62 | 63 | def _regression_pred(self, X, y): 64 | pred = self.models[y.name].predict(X) 65 | return y - pred 66 | 67 | def fit(self, df): 68 | # 把行业从int变成one-hot,然后再合并上市值信息 69 | assert self.market_value_name in df.columns,f"市值对数列[{self.market_value_name}]不在dataframe中" 70 | X = pd.concat( 71 | [df[self.market_value_name], 72 | self._to_one_hot(df[self.industry_name])], 73 | axis=1) 74 | # 每一个特征/因子,即每一列,逐列进行训练,每列训练出一个模型来,保存到字典中 75 | df[self.factor_names].apply(lambda y: self._regression_fit(X, y)) 76 | return self 77 | 78 | def transform(self, df): 79 | # 把行业从int变成one-hot,然后再合并上市值信息 80 | X = pd.concat( 81 | [df[self.market_value_name], 82 | self._to_one_hot(df[self.industry_name])], 83 | axis=1) 84 | df[self.factor_names] = df[self.factor_names].apply(lambda y: self._regression_pred(X, y)) 85 | return df 86 | 87 | # python -m mlstock.utils.industry_neutral 88 | # if __name__ == '__main__': 89 | # from mlstock.data.datasource import DataSource 90 | # from mlstock.data.stock_info import StocksInfo 91 | # 92 | # utils.init_logger(file=False) 93 | # 94 | # start_date = "20080101" 95 | # end_date = "20220801" 96 | # datasource = DataSource() 97 | # stock_data, ts_codes = load_stock_data(datasource, start_date, end_date, 20) 98 | # 99 | # from mlstock.factors.macd import MACD 100 | # factor = MACD(datasource, StocksInfo(ts_codes, start_date, end_date)) 101 | # df_factor = factor.calculate(stock_data) 102 | # df_weekly = factor.merge(stock_data.df_weekly, df_factor) 103 | # df_weekly = pd.merge(df_weekly, stock_data.df_daily_basic, how='left', on=['ts_code','trade_date']) 104 | # 105 | # industry_market_neutral = IndustryMarketNeutral([factor.name], 106 | # market_value_name='total_mv_log', 107 | # industry_name='industry') 108 | # industry_market_neutral.fit(df_weekly) 109 | # df_weekly = industry_market_neutral.transform(df_weekly) 110 | # print(df_weekly) -------------------------------------------------------------------------------- /mlstock/utils/multi_processor.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import time 3 | from multiprocessing import Process 4 | 5 | logger = logging.getLogger(__name__) 6 | 7 | 8 | # 把任务平分到每个work进程中,多余的放到最后一个进程,10=>[3,3,4], 11=>[4,4,3],原则是尽量均衡 9 | def split(_list, n): 10 | k, m = divmod(len(_list), n) 11 | return (_list[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)) 12 | 13 | 14 | def execute(data, worker_num, function, **params): 15 | """ 16 | 多进程处理器 17 | :param data_list: 数据列表,可以是list,也可以是dict(dict会被转成list[0,1]) 18 | :param worker_num: 启动多少个进程处理 19 | :param function: 执行函数,这个函数的参数有要求,一定是 (数据:函数可以自己来约定, id, queue) 20 | :return: 21 | """ 22 | start = time.time() 23 | producers = [] 24 | 25 | # convert to list 26 | if type(data) == dict: 27 | data = [[k, v] for k, v in data.items()] 28 | 29 | assert type(data) == list, "invalid data type,must be list, you are:" + str(type(data)) 30 | 31 | split_data_list = split(data, worker_num) 32 | 33 | for id, worker_data_list in enumerate(split_data_list): 34 | params['stocks'] = worker_data_list 35 | logger.debug("进程[#%d]参数:%r",id, params) 36 | p = Process(target=function,kwargs=params) 37 | producers.append(p) 38 | p.start() 39 | for p in producers: 40 | p.join() # 要等着这些子进程结束后,主进程在往下 41 | 42 | all_seconds = time.time() - start 43 | minutes = all_seconds // 60 44 | seconds = all_seconds % 60 45 | 46 | logger.info("%d个工作进程运行完毕,处理[%d]条数据,耗时: %d 分 %d 秒", 47 | id+1, 48 | len(data), 49 | minutes, 50 | seconds) 51 | -------------------------------------------------------------------------------- /nohup.out: -------------------------------------------------------------------------------- 1 | 2022-08-26 20:24:26,894 - INFO - train_winloss.py:20 P5252: 设置target为分类:0跌,1涨 2 | 【调试模式】 3 | 开始初始化日志:file=True, simple=False 4 | 日志:创建控制台处理器 5 | 日志:创建文件处理器 ./logs/202208262024.log 6 | Fitting 5 folds for each of 20 candidates, totalling 100 fits 7 | -------------------------------------------------------------------------------- /requirement.txt: -------------------------------------------------------------------------------- 1 | tqdm 2 | tushare 3 | async_generator 4 | jqdatasdk 5 | statsmodels 6 | pandas==1.3.4 7 | pyyaml 8 | scikit-learn 9 | backtrader==1.9.76.123 10 | backtrader_plotting 11 | pymysql 12 | sqlalchemy==1.4.0 13 | empyrical 14 | chinese_calendar==1.7.2 15 | apscheduler==3.5.0 16 | matplotlib 17 | # matplotlib==3.2.2 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from setuptools import setup, find_packages, Command 4 | 5 | 6 | class CleanCommand(Command): 7 | """Custom clean command to tidy up the project root.""" 8 | user_options = [] 9 | 10 | def initialize_options(self): 11 | pass 12 | 13 | def finalize_options(self): 14 | pass 15 | 16 | def run(self): 17 | os.system('rm -vrf ./build ./dist ./*.pyc ./*.tgz ./*.egg-info') 18 | 19 | 20 | # Further down when you call setup() 21 | setup( 22 | cmdclass={ 23 | 'clean': CleanCommand, 24 | }, 25 | name="mlstock", 26 | version="1.0", 27 | description="mlstock", 28 | author="piginzoo", 29 | author_email="piginzoo@qq.com", 30 | url="https://github.com/piginzoo/mlstock.git", 31 | license="LGPL", 32 | packages=find_packages(where=".", exclude=('test', 'test.*', 'conf'), include=('*',)) 33 | ) 34 | -------------------------------------------------------------------------------- /test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/test/__init__.py -------------------------------------------------------------------------------- /test/toy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piginzoo/mlstock/65ed1aa78448a5a77a17cea8fda295c7c7894087/test/toy/__init__.py -------------------------------------------------------------------------------- /test/toy/test_pandas.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | def make_dummy_data(): 5 | data1 = [ 6 | ['000001.SZ', '2016-06-21'] + np.random.rand(3).tolist(), 7 | ['000001.SZ', '2016-06-22'] + np.random.rand(3).tolist(), 8 | ['000001.SZ', '2016-06-23'] + np.random.rand(3).tolist(), 9 | ['000001.SZ', '2016-06-24'] + np.random.rand(3).tolist(), 10 | ['000001.SZ', '2016-06-27'] + np.random.rand(3).tolist(), 11 | ['000001.SZ', '2016-06-28'] + np.random.rand(3).tolist(), 12 | ['000002.SH', '2016-06-21'] + np.random.rand(3).tolist(), 13 | ['000002.SH', '2016-06-22'] + np.random.rand(3).tolist(), 14 | ['000002.SH', '2016-06-23'] + np.random.rand(3).tolist(), 15 | ['000002.SH', '2016-06-24'] + np.random.rand(3).tolist(), 16 | ['000002.SH', '2016-06-27'] + np.random.rand(3).tolist(), 17 | ['000002.SH', '2016-06-28'] + np.random.rand(3).tolist(), 18 | ['000002.SH', '2016-06-29'] + np.random.rand(3).tolist(), 19 | ['000003.SH', '2016-06-18'] + np.random.rand(3).tolist(), 20 | ['000003.SH', '2016-06-19'] + np.random.rand(3).tolist(), 21 | ['000003.SH', '2016-06-20'] + np.random.rand(3).tolist(), 22 | ['000003.SH', '2016-06-21'] + np.random.rand(3).tolist(), 23 | ['000003.SH', '2016-06-22'] + np.random.rand(3).tolist(), 24 | ['000003.SH', '2016-06-23'] + np.random.rand(3).tolist(), 25 | ['000003.SH', '2016-06-24'] + np.random.rand(3).tolist(), 26 | ['000003.SH', '2016-06-27'] + np.random.rand(3).tolist(), 27 | ['000003.SH', '2016-06-28'] + np.random.rand(3).tolist() 28 | ] 29 | data1 = pd.DataFrame(data1, 30 | columns=["ts_code", "datetime", "a1", "a2", "a3"]) 31 | data1['datetime'] = pd.to_datetime(data1['datetime'], format='%Y-%m-%d') # 时间为日期格式,tushare是str 32 | return data1 33 | 34 | 35 | df = make_dummy_data() 36 | 37 | 38 | def roll_func(s): 39 | _df = df.loc[s.index][["a1","a2","a3"]] 40 | # import pdb;pdb.set_trace() 41 | # return _df.sum(axis=0).sum()/(_df.shape[0]*_df.shape[1]) 42 | return _df['a3'].iloc[0] 43 | 44 | 45 | df1 = df.groupby('ts_code').rolling(window=3).mean() 46 | print(df1) 47 | print("="*80) 48 | 49 | df2 = df.groupby('ts_code').a1.rolling(window=3).apply(roll_func, raw=False) 50 | print(df) 51 | print("-"*40) 52 | print(df2) 53 | 54 | # python -m test.toy.test_pandas 55 | -------------------------------------------------------------------------------- /test/tushare_debug.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import tushare as ts 4 | 5 | pro = ts.pro_api() 6 | df = pro.query('daily', ts_code='000001.SZ,000002.SZ,000003.SZ', start_date='20180101', end_date='20201231') 7 | 8 | import pdb;pdb.set_trace() --------------------------------------------------------------------------------