├── README.md ├── data ├── preprocessed │ ├── SortCols.csv │ ├── finance.csv │ └── stockReturns.csv └── raw │ ├── Balance sheet │ ├── FS_Combas1.xlsx │ ├── FS_Combas2.xlsx │ ├── FS_Combas3.xlsx │ └── balance sheet.csv │ ├── Income statement │ ├── FS_Comins1.xlsx │ ├── FS_Comins2.xlsx │ └── FS_Comins3.xlsx │ ├── mktmnth │ ├── TRD_Cnmont.xlsx │ └── TRD_Cnmont1.xlsx │ ├── rf │ └── TRD_Nrrate.xlsx │ ├── stockmnth │ ├── TRD_Mnth0.xlsx │ ├── TRD_Mnth1.xlsx │ └── TRD_Mnth2.xlsx │ └── 字段说明.txt ├── factors.py ├── preprocess.py ├── regression.py └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | ## 因子处理 2 | 我们使用pandas + numpy进行数据预处理,回归分析使用statsmodel的api。 3 | 4 | 主要分为三个文件,每个文件为一个模块,具体解释如下: 5 | 6 | - preprocess.py 负责数据预处理 7 | 8 | 1. 首先将原始数据文件夹下的若干文件按文件夹合并为csv 9 | 2. 其次将资产负债表和现金流量表合成finance.csv输出 10 | 3. 其次将个股return、无风险收益率、市场收益率合并为stockReturns.csv输出 11 | 4. 根据finance.csv和stockReturns.csv分别计算出BM SIZE INV OP 指标表,其中需要跨表操作时才将两表合并,否则依然分开使用两表. 12 | 13 | - factors.py 负责计算Fama-French五因子 14 | 15 | 我们选定的是论文中的2*3模型 16 | 17 | 该文件下存在一个Grouping类,其主要功能是将股票根据BM SIZE INV OP大小进行分组。并通过getVMReturn函数返回对应组合的收益率,具体使用方法见类的说明 18 | 19 | 后缀为_MethodOne的函数均为2*3法计算单个因子的函数 20 | 21 | FF5函数合并单个因子并输出2016FF5.csv和2017FF5.csv两个csv 22 | 23 | - PortfolioExcessReturn是根据特定指标分出组合,并输出组合市值加权回报率的函数(即我们选定的应变量). 24 | 25 | 我们按INV-SIZE,OP-SIZE,BM-SIZE大小排序分别分成5*5的组合,最终输出得到group_result文件夹下的6个csv 26 | ## 程序说明 27 | 要运行程序首先安装依赖 28 | ```commandline 29 | pip install -r requirements.txt 30 | ``` 31 | 然后按以下顺序执行命令 32 | ```commandline 33 | python preprocess.py 34 | python factors.py 35 | python regeression.py 36 | ``` 37 | 38 | ## 程序结构 39 | ```java 40 | │ factors.py # 计算指标的py文件 41 | │ preprocess.py # 预处理的py文件 42 | │ regression.py # 回归处理py文件 43 | │ requirements.txt # 依赖文件 44 | │ 45 | ├─data 46 | │ ├─preprocessed # 初步预处理后的数据 47 | │ │ finance.csv # 合并后的资产负债表 48 | │ │ SortCols.csv # BM SIZE INV OP 指标表 49 | │ │ stockReturns.csv # 合并后的回报表 50 | │ │ 51 | │ └─raw # 原始数据 52 | │ │ 字段说明.txt 53 | │ │ 54 | │ ├─Balance sheet 55 | │ │ balance sheet.csv 56 | │ │ FS_Combas1.xlsx 57 | │ │ FS_Combas2.xlsx 58 | │ │ FS_Combas3.xlsx 59 | │ │ 60 | │ ├─Income statement 61 | │ │ FS_Comins1.xlsx 62 | │ │ FS_Comins2.xlsx 63 | │ │ FS_Comins3.xlsx 64 | │ │ 65 | │ ├─mktmnth 66 | │ │ TRD_Cnmont.xlsx 67 | │ │ TRD_Cnmont1.xlsx 68 | │ │ 69 | │ ├─rf 70 | │ │ TRD_Nrrate.xlsx 71 | │ │ 72 | │ └─stockmnth 73 | │ TRD_Mnth0.xlsx 74 | │ TRD_Mnth1.xlsx 75 | │ TRD_Mnth2.xlsx 76 | │ 77 | ├─factor_result # 2*3 FF五因子结果(2015、2016期) 78 | │ 2016FF5.csv 79 | │ 2017FF5.csv 80 | │ 81 | ├─group_result # 5*5分组 组合超额回报率 82 | │ 2016_BM_Size.csv 83 | │ 2016_Inv_Size.csv 84 | │ 2016_OP_Size.csv 85 | │ 2017_BM_Size.csv 86 | │ 2017_Inv_Size.csv 87 | │ 2017_OP_Size.csv 88 | │ 89 | └─regression_result # 回归结果 90 | regression_BM.csv 91 | regression_BM_pvalue.csv 92 | regression_Inv.csv 93 | regression_Inv_pvalue.csv 94 | regression_OP.csv 95 | regression_OP_pvalue.csv 96 | 97 | ``` 98 | 99 | -------------------------------------------------------------------------------- /data/raw/Balance sheet/FS_Combas1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Balance sheet/FS_Combas1.xlsx -------------------------------------------------------------------------------- /data/raw/Balance sheet/FS_Combas2.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Balance sheet/FS_Combas2.xlsx -------------------------------------------------------------------------------- /data/raw/Balance sheet/FS_Combas3.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Balance sheet/FS_Combas3.xlsx -------------------------------------------------------------------------------- /data/raw/Income statement/FS_Comins1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Income statement/FS_Comins1.xlsx -------------------------------------------------------------------------------- /data/raw/Income statement/FS_Comins2.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Income statement/FS_Comins2.xlsx -------------------------------------------------------------------------------- /data/raw/Income statement/FS_Comins3.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/Income statement/FS_Comins3.xlsx -------------------------------------------------------------------------------- /data/raw/mktmnth/TRD_Cnmont.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/mktmnth/TRD_Cnmont.xlsx -------------------------------------------------------------------------------- /data/raw/mktmnth/TRD_Cnmont1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/mktmnth/TRD_Cnmont1.xlsx -------------------------------------------------------------------------------- /data/raw/rf/TRD_Nrrate.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/rf/TRD_Nrrate.xlsx -------------------------------------------------------------------------------- /data/raw/stockmnth/TRD_Mnth0.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/stockmnth/TRD_Mnth0.xlsx -------------------------------------------------------------------------------- /data/raw/stockmnth/TRD_Mnth1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/stockmnth/TRD_Mnth1.xlsx -------------------------------------------------------------------------------- /data/raw/stockmnth/TRD_Mnth2.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/stockmnth/TRD_Mnth2.xlsx -------------------------------------------------------------------------------- /data/raw/字段说明.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teal0range/FF5/75854dd622df74152d6e322ef2762569becdb6c9/data/raw/字段说明.txt -------------------------------------------------------------------------------- /factors.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | from datetime import datetime 5 | from numpy import vectorize 6 | 7 | from preprocess import generateDate 8 | 9 | data_path = 'data/preprocessed' 10 | phase = { 11 | "2016": generateDate("2016-07-31", "2017-06-30"), 12 | "2017": generateDate("2017-07-31", "2018-01-31") 13 | } 14 | 15 | currentPhase = phase["2016"] 16 | 17 | 18 | class Grouping: 19 | # 分组类 20 | def __init__(self, t): 21 | """形如[[('h',df0),('l',df1)],[('b',df0),('s',df1)]] 22 | 用于存储交集之前的数据 23 | """ 24 | self.groups = [] 25 | 26 | """pd.DataFrame用于存储方便查询的格式""" 27 | self.frame = pd.DataFrame() 28 | self.view = pd.DataFrame() 29 | 30 | """标记类有没有准备好被查询""" 31 | self.isPrepared = False 32 | 33 | """区间""" 34 | self.t = t 35 | 36 | """BM Inv OP Size""" 37 | self.factorList = pd.read_csv(os.path.join(data_path, 'SortCols.csv')) 38 | self.factorList = self.factorList[self.factorList['phase'] == int(self.t)] 39 | self.factorList['MktSize'] = self.factorList['Size'] 40 | 41 | """回报率文件""" 42 | self.stockReturn = self.date2Phase( 43 | pd.read_csv(os.path.join(data_path, 'stockReturns.csv'))[['Stkcd', 'date', 'Mretwd', 'Msmvosd', 'Nrrmtdt']] 44 | ) 45 | 46 | """向量化getVMReturn函数""" 47 | self.getVMReturn = np.vectorize(self.getVMReturn, excluded=['section']) 48 | self.getVMExcessReturn = np.vectorize(self.getVMExcessReturn, excluded=['section']) 49 | 50 | def append(self, breakpoints, col_name): 51 | """ 52 | 用于向Group中添加维度 53 | :param df: 添加维度的数据表 54 | :param breakpoints: 断点,形如[('H',0),('L',0.5)],意味[0,0.5]分位点为H,[0.5,1]为L 55 | :param col_name: 数据列名 56 | :return: None 57 | """ 58 | self.groups.append(self.split(breakpoints, col_name)) 59 | self.isPrepared = False 60 | 61 | def split(self, breakpoints, col_name): 62 | """ 63 | 根据数据列分割factorList 64 | :param breakpoints: 断点,形如[('H',0),('L',0.5)],意味[0,0.5]分位点为H,[0.5,1]为L 65 | :param col_name: 数据列名 66 | :return: 形如[('h',df0),('l',df1)] 67 | """ 68 | df = self.factorList 69 | df = df[['Stkcd', 'phase', 'MktSize', col_name]].sort_values(by=[col_name]) 70 | res = [] 71 | for idx, _breakpoint in enumerate(breakpoints): 72 | upper = 1 if idx == len(breakpoints) - 1 else breakpoints[idx + 1][1] 73 | tag, ratio = _breakpoint 74 | res.append((tag, df.iloc[int(ratio * len(df)):int(upper * len(df))].drop([col_name], axis=1))) 75 | return res 76 | 77 | def prepare(self): 78 | """ 79 | 准备类以备查询 80 | :return:None 81 | """ 82 | self.isPrepared = True 83 | self.intersections() 84 | for i in range(len(self.frame)): 85 | data = self.frame.iat[i, self.getAxis()] 86 | self.frame.iat[i, self.getAxis()] = pd.merge(data, self.stockReturn, on=['Stkcd', 'phase']) 87 | 88 | def intersections(self, axis=0, cols=None, df=None): 89 | """ 90 | 根据数值做出排序交集 91 | :param axis: 92 | :param cols: 93 | :param df: 94 | :return: 95 | """ 96 | if cols is None: 97 | cols = {} 98 | if axis == len(self.groups): 99 | cols['data'] = [df] 100 | self.frame = self.frame.append(pd.DataFrame(cols, index=[len(self.frame)]), sort=True) 101 | return 102 | 103 | for group in self.groups[axis]: 104 | if axis == 0: 105 | df = group[1] 106 | tmp = df 107 | if axis != 0: 108 | df = pd.merge(df, group[1], on=['Stkcd', 'phase', 'MktSize']) 109 | cols[axis] = [group[0]] 110 | self.intersections(axis + 1, cols, df) 111 | cols.pop(axis) 112 | df = tmp 113 | 114 | def getAxis(self): 115 | return len(self.groups) 116 | 117 | def __getitem__(self, item): 118 | if not self.isPrepared: 119 | self.prepare() 120 | res = self.frame 121 | for axis in range(self.getAxis()): 122 | res = res[res[axis] == item[axis]] 123 | return res.iat[0, self.getAxis()].reset_index().drop(['index'], axis=1) 124 | 125 | def getVMReturn(self, date, section): 126 | """ 127 | 获取某分组的VM加权回报,此函数在类初始化时会被向量化 128 | :param date: 时间 YYYY-MM-DD 129 | :param section: 类别 130 | :return: 131 | """ 132 | df = self.__getitem__(section) 133 | df = df[df['date'] == date] 134 | return np.sum(df['Mretwd'] * df['MktSize']) / np.sum(df['MktSize']) 135 | 136 | def getVMExcessReturn(self, date, section): 137 | """ 138 | 获取某分组的VM加权回报,此函数在类初始化时会被向量化 139 | :param date: 时间 YYYY-MM-DD 140 | :param section: 类别 141 | :return: 142 | """ 143 | df = self.__getitem__(section) 144 | df = df[df['date'] == date] 145 | return np.sum((df['Mretwd'] - df['Nrrmtdt']) * df['MktSize']) / np.sum(df['MktSize']) 146 | 147 | @staticmethod 148 | def date2Phase(df): 149 | """ 150 | 将时间转化为t, t年7月至t+1年6月 151 | :param df: 数据表 152 | :return: 153 | """ 154 | df['phase'] = df['date'].apply(lambda x: datetime.strptime(x, "%Y-%m").year 155 | if datetime.strptime(x, "%Y-%m").month >= 7 else datetime.strptime(x, "%Y-%m").year - 1) 156 | return df 157 | 158 | 159 | def Mktrf_MethodOne(): 160 | df = pd.read_csv(os.path.join(data_path, 'stockReturns.csv')) 161 | df = df[['date', 'Cmretwdos', 'Nrrmtdt']].drop_duplicates().sort_values(by='date') 162 | df['Mktrf'] = df['Cmretwdos'] - df['Nrrmtdt'] 163 | return df[['date', 'Mktrf']] 164 | 165 | 166 | def SMB_MethodOne(t): 167 | # SMB BM 168 | g = Grouping(t) 169 | g.append([('S', 0), ('B', 0.5)], 'Size') 170 | g.append([('L', 0), ('N', 0.3), ('H', 0.7)], 'BM') 171 | SMB_BM = (g.getVMReturn(currentPhase, section=['S', 'L']) + g.getVMReturn(currentPhase, section=['S', 'N']) 172 | + g.getVMReturn(currentPhase, section=['S', 'H'])) / 3 - \ 173 | (g.getVMReturn(currentPhase, section=['B', 'L']) + g.getVMReturn(currentPhase, section=['B', 'N']) 174 | + g.getVMReturn(currentPhase, section=['B', 'H'])) / 3 175 | 176 | g = Grouping(t) 177 | g.append([('S', 0), ('B', 0.5)], 'Size') 178 | g.append([('W', 0), ('N', 0.3), ('R', 0.7)], 'OP') 179 | SMB_OP = (g.getVMReturn(currentPhase, section=['S', 'W']) + g.getVMReturn(currentPhase, section=['S', 'N']) 180 | + g.getVMReturn(currentPhase, section=['S', 'R'])) / 3 - \ 181 | (g.getVMReturn(currentPhase, section=['B', 'W']) + g.getVMReturn(currentPhase, section=['B', 'N']) 182 | + g.getVMReturn(currentPhase, section=['B', 'R'])) / 3 183 | 184 | g = Grouping(t) 185 | g.append([('S', 0), ('B', 0.5)], 'Size') 186 | g.append([('C', 0), ('N', 0.3), ('A', 0.7)], 'Inv') 187 | SMB_Inv = (g.getVMReturn(currentPhase, section=['S', 'C']) + g.getVMReturn(currentPhase, section=['S', 'N']) 188 | + g.getVMReturn(currentPhase, section=['S', 'A'])) / 3 - \ 189 | (g.getVMReturn(currentPhase, section=['B', 'C']) + g.getVMReturn(currentPhase, section=['B', 'N']) 190 | + g.getVMReturn(currentPhase, section=['B', 'A'])) / 3 191 | 192 | return pd.DataFrame({'date': currentPhase, 'SMB': (SMB_BM + SMB_OP + SMB_Inv) / 3}) 193 | 194 | 195 | def HML_MethodOne(t): 196 | g = Grouping(t) 197 | g.append([('S', 0), ('B', 0.5)], 'Size') 198 | g.append([('L', 0), ('N', 0.3), ('H', 0.7)], 'BM') 199 | HML = (g.getVMReturn(currentPhase, section=['S', 'H']) + g.getVMReturn(currentPhase, section=['B', 'H'])) / 2 - \ 200 | (g.getVMReturn(currentPhase, section=['S', 'L']) + g.getVMReturn(currentPhase, section=['B', 'L'])) / 2 201 | return pd.DataFrame({'date': currentPhase, 'HML': HML}) 202 | 203 | 204 | def RMW_MethodOne(t): 205 | g = Grouping(t) 206 | g.append([('S', 0), ('B', 0.5)], 'Size') 207 | g.append([('W', 0), ('N', 0.3), ('R', 0.7)], 'OP') 208 | RMW = (g.getVMReturn(currentPhase, section=['S', 'R']) + g.getVMReturn(currentPhase, section=['B', 'R'])) / 2 - \ 209 | (g.getVMReturn(currentPhase, section=['S', 'W']) + g.getVMReturn(currentPhase, section=['B', 'W'])) / 2 210 | 211 | return pd.DataFrame({'date': currentPhase, 'RMW': RMW}) 212 | 213 | 214 | def CMA_MethodOne(t): 215 | g = Grouping(t) 216 | g.append([('S', 0), ('B', 0.5)], 'Size') 217 | g.append([('C', 0), ('N', 0.3), ('A', 0.7)], 'Inv') 218 | CMA = (g.getVMReturn(currentPhase, section=['S', 'C']) + g.getVMReturn(currentPhase, section=['B', 'C'])) / 2 - \ 219 | (g.getVMReturn(currentPhase, section=['S', 'A']) + g.getVMReturn(currentPhase, section=['B', 'A'])) / 2 220 | 221 | return pd.DataFrame({'date': currentPhase, 'CMA': CMA}) 222 | 223 | 224 | @vectorize 225 | def FF5(t): 226 | # 输出t期FF5因子的列表 227 | global currentPhase 228 | currentPhase = phase[t] 229 | res = pd.merge(Mktrf_MethodOne(), 230 | pd.merge(SMB_MethodOne(t), 231 | pd.merge(HML_MethodOne(t), 232 | pd.merge(RMW_MethodOne(t), CMA_MethodOne(t), on=['date']), on=['date']), 233 | on=['date']), on=['date']) 234 | res.to_csv(os.path.join("factor_result", t + "FF5.csv"), index=False) 235 | 236 | 237 | def PortfolioExcessReturn(t, row_type, row_num, col_type='Size', col_num=5, csv=True): 238 | """ 239 | 指标分组计算组合 240 | :param t: 期号 241 | :param row_type: 行名 242 | :param row_num: 行分裂数 243 | :param col_type: 列名(默认Size) 244 | :param col_num: 列分裂数(默认5) 245 | :param csv: 是否输出csv 246 | :return: 247 | """ 248 | global currentPhase 249 | currentPhase = phase[t] 250 | g = Grouping(t) 251 | g.append([(i, i * 1 / row_num) for i in range(row_num)], row_type) 252 | g.append([(i, i * 1 / col_num) for i in range(col_num)], col_type) 253 | 254 | @vectorize 255 | def formatDataframe(row_count, col_count): 256 | return pd.DataFrame({"date": currentPhase, 257 | "row_type": [row_type] * len(currentPhase), 258 | "row_count": [row_count + 1] * len(currentPhase), 259 | "col_type": [col_type] * len(currentPhase), 260 | "col_count": [col_count + 1] * len(currentPhase), 261 | "ExcessReturn": g.getVMExcessReturn(currentPhase, section=[row_count, col_count])} 262 | , index=[i for i in range(len(currentPhase))]) 263 | 264 | df = pd.DataFrame().append( 265 | list(formatDataframe(np.arange(0, row_num * col_num, 1) % row_num, 266 | np.arange(0, row_num * col_num, 1) // row_num)) 267 | ) 268 | if csv: 269 | df.to_csv(os.path.join("group_result", t + "_" + row_type + "_" + col_type + ".csv"), index=False) 270 | return df 271 | 272 | 273 | if __name__ == '__main__': 274 | FF5(['2016', '2017']) 275 | PortfolioExcessReturn('2016', 'OP', 5) 276 | PortfolioExcessReturn('2017', 'OP', 5) 277 | PortfolioExcessReturn('2016', 'Inv', 5) 278 | PortfolioExcessReturn('2017', 'Inv', 5) 279 | PortfolioExcessReturn('2016', 'BM', 5) 280 | PortfolioExcessReturn('2017', 'BM', 5) -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | from datetime import datetime, timedelta 4 | import numpy as np 5 | 6 | data_path = './data/preprocessed' 7 | 8 | 9 | # Utils工具类 10 | def convertCode(code) -> str: 11 | """ 12 | 将股票代码转换为标准6位 13 | :param code: 股票代码 14 | :return: 6位标准代码 15 | """ 16 | return "{:06d}".format(int(code)) 17 | 18 | 19 | def generateDate(start, end) -> list: 20 | start = datetime.strptime(start, "%Y-%m-%d") 21 | end = datetime.strptime(end, "%Y-%m-%d") 22 | current = start 23 | res = [] 24 | while current < end: 25 | res.append(current.strftime("%Y-%m")) 26 | current = current.replace(day=28) + timedelta(days=4) 27 | return res 28 | 29 | 30 | # IO 输入输出 31 | def excel2Df(directory: str, **kwargs) -> pd.DataFrame: 32 | """ 33 | 合并excel表并且输出dataframe 34 | :rtype: pd.DataFrame 35 | :param directory: xlsx path 36 | :return: None 37 | """ 38 | # 储存dataframe的临时变量 39 | df_ls = [] 40 | # 遍历文件夹下的所有excel 41 | for root, dirs, files in os.walk(directory): 42 | for file in files: 43 | if file.endswith(".xlsx"): 44 | df_ls.append(pd.read_excel(os.path.join(root, file), **kwargs)) 45 | # 合并输出 46 | return pd.DataFrame().append(df_ls, sort=False) 47 | 48 | 49 | # Filters 过滤器 50 | def dateFilter(df: pd.DataFrame, date_col: str, date_format: str = None) -> pd.DataFrame: 51 | """ 52 | 过滤非6,12月的数据 53 | :param df: 数据表 54 | :param date_col: 表示时间的列名 55 | :param date_format: 日期格式 56 | :return: 过滤后的dataframe 57 | """ 58 | if date_format is None: 59 | # 默认日期格式 60 | date_format = "%Y-%m-%d" 61 | return df[df[date_col].apply(lambda x: datetime.strptime(x, date_format).month in [6, 12])] 62 | 63 | 64 | def MainBoardFilter(df: pd.DataFrame, code_col: str) -> pd.DataFrame: 65 | """ 66 | 过滤出A股数据 67 | :param df:数据表 68 | :param code_col:股票代码列名 69 | :return: 过滤后的数据 70 | """ 71 | return df[np.isin(df[code_col].apply(convertCode).str[:3], ['000', '600', '601', '603', '605'])] 72 | 73 | 74 | def FinanceFrames(): 75 | # 合并&预处理财务报表数据 76 | balanceSheet = excel2Df("./data/raw/Balance sheet", index_col=0) 77 | IncomeSheet = excel2Df("./data/raw/Income statement") 78 | balanceSheet = balanceSheet[balanceSheet['Typrep'] == 'A'].sort_values(by=['Stkcd', 'Accper']) 79 | IncomeSheet = IncomeSheet[IncomeSheet['Typrep'] == 'A'].sort_values(by=['Stkcd', 'Accper']) 80 | # 日期过滤&板块过滤 81 | balanceSheet = MainBoardFilter(dateFilter(balanceSheet, 'Accper'), 'Stkcd').drop(['Typrep'], axis=1) 82 | IncomeSheet = MainBoardFilter(dateFilter(IncomeSheet, 'Accper'), 'Stkcd').drop(['Typrep'], axis=1) 83 | pd.merge(balanceSheet, IncomeSheet, on=['Stkcd', 'Accper'], how='outer'). \ 84 | to_csv("./data/preprocessed/finance.csv", index=False) 85 | 86 | 87 | def StockReturnFrames(): 88 | # 合并&预处理 89 | # 市场收益率&无风险收益率 90 | mktmnth = excel2Df("./data/raw/mktmnth") 91 | mktmnth = mktmnth[mktmnth['Markettype'] == 5][['Trdmnt', 'Cmretwdos']] 92 | rf = excel2Df("./data/raw/rf")[['Clsdt', 'Nrrmtdt']] 93 | rf['month'] = rf['Clsdt'].str[:7] 94 | rf = rf.groupby('month').apply(lambda x: x.iloc[np.argmax(x['Clsdt'].values)]) 95 | rf['Nrrmtdt'] = rf['Nrrmtdt'] / 100 96 | marketRf = pd.merge(mktmnth.rename(columns={"Trdmnt": "date"}), 97 | rf[['month', 'Nrrmtdt']].rename(columns={"month": "date"}), 98 | on='date', how='outer').sort_values(by=['date']) 99 | # 个股收益率 100 | stockmnth = excel2Df("./data/raw/stockmnth")[['Stkcd', 'Trdmnt', 'Msmvosd', 'Mretwd', 'Markettype']] 101 | stockmnth = stockmnth[np.isin(stockmnth['Markettype'], [1, 4])]. \ 102 | rename(columns={'Trdmnt': 'date'}).drop('Markettype', axis=1) 103 | pd.merge(stockmnth, marketRf, on='date').sort_values(by=['Stkcd', 'date']).dropna(). \ 104 | to_csv("data/preprocessed/stockReturns.csv", index=None) 105 | 106 | 107 | def extraFactors(): 108 | finance = pd.read_csv(os.path.join(data_path, 'finance.csv')).rename(columns={'Accper': "date"}) 109 | finance['date'] = finance['date'].str[:7] 110 | 111 | stockReturns = pd.read_csv(os.path.join(data_path, 'stockReturns.csv')) 112 | 113 | df = pd.merge(stockReturns, finance, on=['Stkcd', 'date']) 114 | # Size 115 | Size = stockReturns.groupby(['Stkcd']).apply(lambda x: pd.DataFrame( 116 | { 117 | 'phase': [2016, 2017], 118 | 'Size': [ 119 | x[x['date'] == '2016-06']['Msmvosd'].iat[0] if len(x[x['date'] == '2016-06']) > 0 else np.NAN, 120 | x[x['date'] == '2017-06']['Msmvosd'].iat[0] if len(x[x['date'] == '2017-06']) > 0 else np.NAN 121 | ] 122 | }).dropna()).reset_index().drop(['level_1'], axis=1) 123 | # B/M ratio 124 | BM = df.groupby(['Stkcd']).apply(lambda x: pd.DataFrame({ 125 | 'phase': [2016, 2017], 126 | 'BM': [ 127 | x[x['date'] == '2015-12']['total_equity'].iat[0] / x[x['date'] == '2015-12']['Msmvosd'].iat[0] 128 | if len(x[x['date'] == '2015-12']) > 0 else np.NAN, 129 | x[x['date'] == '2016-12']['total_equity'].iat[0] / x[x['date'] == '2016-12']['Msmvosd'].iat[0] 130 | if len(x[x['date'] == '2016-12']) > 0 else np.NAN 131 | ] 132 | }).dropna()).reset_index().drop(['level_1'], axis=1) 133 | 134 | # Inv 135 | Inv = finance.groupby(['Stkcd']).apply(lambda x: pd.DataFrame({ 136 | 'phase': [2016, 2017], 137 | 'Inv': [ 138 | (x[x['date'] == '2015-12']['total_assets'].iat[0] - x[x['date'] == '2014-12']['total_assets'].iat[0]) 139 | / x[x['date'] == '2014-12']['total_assets'].iat[0] 140 | if len(x[x['date'] == '2015-12']) > 0 and len(x[x['date'] == '2014-12']) > 0 else np.NAN, 141 | (x[x['date'] == '2016-12']['total_assets'].iat[0] - x[x['date'] == '2015-12']['total_assets'].iat[0]) 142 | / x[x['date'] == '2015-12']['total_assets'].iat[0] 143 | if len(x[x['date'] == '2016-12']) > 0 and len(x[x['date'] == '2015-12']) > 0 else np.NAN 144 | ] 145 | }).dropna()).reset_index().drop(['level_1'], axis=1) 146 | 147 | # OP 148 | 149 | OP = df.groupby(['Stkcd']).apply(lambda x: pd.DataFrame({ 150 | 'phase': [2016, 2017], 151 | 'OP': [ 152 | x[x['date'] == '2015-12']['operating profit'].iat[0] / x[x['date'] == '2015-12']['total_equity'].iat[0] 153 | if len(x[x['date'] == '2015-12']) > 0 else np.NAN, 154 | x[x['date'] == '2016-12']['operating profit'].iat[0] / x[x['date'] == '2016-12']['total_equity'].iat[0] 155 | if len(x[x['date'] == '2016-12']) > 0 and len(x[x['date'] == '2016-12']) > 0 else np.NAN 156 | ] 157 | }).dropna()).reset_index().drop(['level_1'], axis=1) 158 | 159 | pd.merge(pd.merge(BM, Inv), pd.merge(OP, Size), on=['Stkcd', 'phase']). \ 160 | to_csv(os.path.join(data_path, "SortCols.csv"), index=False) 161 | 162 | 163 | if __name__ == '__main__': 164 | FinanceFrames() 165 | StockReturnFrames() 166 | extraFactors() 167 | -------------------------------------------------------------------------------- /regression.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | import statsmodels.api as sm 5 | 6 | # 读取并合并ff5数据 7 | data1 = pd.read_csv("factor_result/2016FF5.csv") 8 | data2 = pd.read_csv("factor_result/2017FF5.csv") 9 | data_ff5 = data1.append(data2) 10 | data_ff5 = data_ff5.reset_index(drop=True) 11 | # 读取分组超额收益率数据 12 | for fac in ['BM', 'Inv', 'OP']: 13 | data1 = pd.read_csv("group_result/2016_" + fac + "_Size.csv") 14 | data2 = pd.read_csv("group_result/2017_" + fac + "_Size.csv") 15 | if fac == 'BM': 16 | Size_BM = data1.append(data2) 17 | elif fac == 'Inv': 18 | Size_Inv = data1.append(data2) 19 | elif fac == 'OP': 20 | Size_OP = data1.append(data2) 21 | # 对BM_Size分组进行五因子回归 22 | regression_BM = pd.DataFrame(columns=['Size', 'BM', 'Const', 'Mktrf', 'SMB', 'HML', 'RMW', 'CMA', 'R-squared']) 23 | regression_BM_pvalue = pd.DataFrame(columns=['Size', 'BM', 'Const_p', 'Mktrf_p', 'SMB_p', 'HML_p', 'RMW_p', 'CMA_p']) 24 | regression_BM_tvalue = pd.DataFrame(columns=['Size', 'OP', 'Const_t', 'Mktrf_t', 'SMB_t', 'HML_t', 'RMW_t', 'CMA_t']) 25 | x_ = Size_BM['row_count'].value_counts().shape[0] 26 | y_ = Size_BM['col_count'].value_counts().shape[0] 27 | for i in range(1, x_ + 1): 28 | for j in range(1, y_ + 1): 29 | X = data_ff5.iloc[0:19, 1:6] 30 | X = sm.add_constant(X) 31 | y = Size_BM[(Size_BM['row_count'] == i) & (Size_BM['col_count'] == j)].iloc[:, 5] 32 | y = y.reset_index(drop=True) 33 | model = sm.OLS(y, X) 34 | results = model.fit() 35 | regression_BM = regression_BM.append([{'Size': i, 'BM': j, 'Const': results.params[0], 36 | 'Mktrf': results.params[1], 'SMB': results.params[2], 37 | 'HML': results.params[3], 'RMW': results.params[4], 38 | 'CMA': results.params[5], 'R-squared': results.rsquared}]) 39 | regression_BM_pvalue = regression_BM_pvalue.append([{'Size': i, 'BM': j, 'Const_p': results.pvalues[0], 40 | 'Mktrf_p': results.pvalues[1], 'SMB_p': results.pvalues[2], 41 | 'HML_p': results.pvalues[3], 'RMW_p': results.pvalues[4], 42 | 'CMA_p': results.pvalues[5]}]) 43 | regression_BM_tvalue = regression_BM_tvalue.append([{'Size': i, 'OP': j, 'Const_t': results.tvalues[0], 44 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 45 | 'HML_t': results.tvalues[3], 'RMW_t': results.tvalues[4], 46 | 'CMA_t': results.tvalues[5]}]) 47 | # 对Inv_Size分组进行五因子回归 48 | regression_Inv = pd.DataFrame(columns=['Size', 'Inv', 'Const', 'Mktrf', 'SMB', 'HML', 'RMW', 'CMA', 'R-squared']) 49 | regression_Inv_pvalue = pd.DataFrame(columns=['Size', 'Inv', 'Const_p', 'Mktrf_p', 'SMB_p', 'HML_p', 'RMW_p', 'CMA_p']) 50 | regression_Inv_tvalue = pd.DataFrame(columns=['Size', 'OP', 'Const_t', 'Mktrf_t', 'SMB_t', 'HML_t', 'RMW_t', 'CMA_t']) 51 | x_ = Size_Inv['row_count'].value_counts().shape[0] 52 | y_ = Size_Inv['col_count'].value_counts().shape[0] 53 | for i in range(1, x_ + 1): 54 | for j in range(1, y_ + 1): 55 | X = data_ff5.iloc[0:19, 1:6] 56 | X = sm.add_constant(X) 57 | y = Size_Inv[(Size_Inv['row_count'] == i) & (Size_Inv['col_count'] == j)].iloc[:, 5] 58 | y = y.reset_index(drop=True) 59 | model = sm.OLS(y, X) 60 | results = model.fit() 61 | regression_Inv = regression_Inv.append([{'Size': i, 'Inv': j, 'Const': results.params[0], 62 | 'Mktrf': results.params[1], 'SMB': results.params[2], 63 | 'HML': results.params[3], 'RMW': results.params[4], 64 | 'CMA': results.params[5], 'R-squared': results.rsquared}]) 65 | regression_Inv_pvalue = regression_Inv_pvalue.append([{'Size': i, 'Inv': j, 'Const_p': results.pvalues[0], 66 | 'Mktrf_p': results.pvalues[1], 67 | 'SMB_p': results.pvalues[2], 'HML_p': results.pvalues[3], 68 | 'RMW_p': results.pvalues[4], 69 | 'CMA_p': results.pvalues[5]}]) 70 | regression_Inv_tvalue = regression_Inv_tvalue.append([{'Size': i, 'OP': j, 'Const_t': results.tvalues[0], 71 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 72 | 'HML_t': results.tvalues[3], 'RMW_t': results.tvalues[4], 73 | 'CMA_t': results.tvalues[5]}]) 74 | # 对OP_Size分组进行五因子回归 75 | regression_OP = pd.DataFrame(columns=['Size', 'OP', 'Const', 'Mktrf', 'SMB', 'HML', 'RMW', 'CMA', 'R-squared']) 76 | regression_OP_pvalue = pd.DataFrame(columns=['Size', 'OP', 'Const_p', 'Mktrf_p', 'SMB_p', 'HML_p', 'RMW_p', 'CMA_p']) 77 | regression_OP_tvalue = pd.DataFrame(columns=['Size', 'OP', 'Const_t', 'Mktrf_t', 'SMB_t', 'HML_t', 'RMW_t', 'CMA_t']) 78 | x_ = Size_OP['row_count'].value_counts().shape[0] 79 | y_ = Size_OP['col_count'].value_counts().shape[0] 80 | for i in range(1, x_ + 1): 81 | for j in range(1, y_ + 1): 82 | X = data_ff5.iloc[0:19, 1:6] 83 | X = sm.add_constant(X) 84 | y = Size_OP[(Size_OP['row_count'] == i) & (Size_OP['col_count'] == j)].iloc[:, 5] 85 | y = y.reset_index(drop=True) 86 | model = sm.OLS(y, X) 87 | results = model.fit() 88 | regression_OP = regression_OP.append([{'Size': i, 'OP': j, 'Const': results.params[0], 89 | 'Mktrf': results.params[1], 'SMB': results.params[2], 90 | 'HML': results.params[3], 'RMW': results.params[4], 91 | 'CMA': results.params[5], 'R-squared': results.rsquared}]) 92 | regression_OP_pvalue = regression_OP_pvalue.append([{'Size': i, 'OP': j, 'Const_p': results.pvalues[0], 93 | 'Mktrf_p': results.pvalues[1], 'SMB_p': results.pvalues[2], 94 | 'HML_p': results.pvalues[3], 'RMW_p': results.pvalues[4], 95 | 'CMA_p': results.pvalues[5]}]) 96 | regression_OP_tvalue = regression_OP_tvalue.append([{'Size': i, 'OP': j, 'Const_t': results.tvalues[0], 97 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 98 | 'HML_t': results.tvalues[3], 'RMW_t': results.tvalues[4], 99 | 'CMA_t': results.tvalues[5]}]) 100 | 101 | if not os.path.exists("regression_result/"): 102 | os.mkdir("regression_result/") 103 | # 保存表 104 | regression_BM.to_csv('regression_result/regression_BM.csv', index=False) 105 | regression_BM_pvalue.to_csv('regression_result/regression_BM_pvalue.csv', index=False) 106 | regression_BM_tvalue.to_csv('regression_result/regression_BM_tvalue.csv', index=False) 107 | regression_Inv.to_csv('regression_result/regression_Inv.csv', index=False) 108 | regression_Inv_pvalue.to_csv('regression_result/regression_Inv_pvalue.csv', index=False) 109 | regression_Inv_tvalue.to_csv('regression_result/regression_Inv_tvalue.csv', index=False) 110 | regression_OP.to_csv('regression_result/regression_OP.csv', index=False) 111 | regression_OP_pvalue.to_csv('regression_result/regression_OP_pvalue.csv', index=False) 112 | regression_OP_tvalue.to_csv('regression_result/regression_OP_tvalue.csv', index=False) 113 | 114 | 115 | # 变量间相互回归 116 | regression_factors = pd.DataFrame(columns=['DepVar','Const','Mktrf','SMB','HML','RMW','CMA','R-squared']) 117 | regression_factors_tvalue = pd.DataFrame(columns=['DepVar','Const_t','Mktrf_t','SMB_t','HML_t','RMW_t','CMA_t']) 118 | regression_factors_pvalue = pd.DataFrame(columns=['DepVar','Const_p','Mktrf_p','SMB_p','HML_p','RMW_p','CMA_p']) 119 | # 被解释变量——Mktrf 120 | y = data_ff5.loc[:,['Mktrf']] 121 | X = data_ff5.loc[:,['SMB','HML','RMW','CMA']] 122 | X = sm.add_constant(X) 123 | model = sm.OLS(y,X) 124 | results = model.fit() 125 | regression_factors = regression_factors.append([{'DepVar': 'Mktrf','Const': results.params[0], 126 | 'Mktrf': 0, 'SMB': results.params[1], 127 | 'HML': results.params[2], 'RMW': results.params[3], 128 | 'CMA': results.params[4], 'R-squared': results.rsquared}]) 129 | regression_factors_tvalue = regression_factors_tvalue.append([{'DepVar': 'Mktrf','Const_t': results.tvalues[0], 130 | 'Mktrf_t': 0, 'SMB_t': results.tvalues[1], 131 | 'HML_t': results.tvalues[2], 'RMW_t': results.tvalues[3], 132 | 'CMA_t': results.tvalues[4]}]) 133 | regression_factors_pvalue = regression_factors_pvalue.append([{'DepVar': 'Mktrf','Const_p': results.pvalues[0], 134 | 'Mktrf_p': 0, 'SMB_p': results.pvalues[1], 135 | 'HML_p': results.pvalues[2], 'RMW_p': results.pvalues[3], 136 | 'CMA_p': results.pvalues[4]}]) 137 | # 被解释变量——SMB 138 | y = data_ff5.loc[:,['SMB']] 139 | X = data_ff5.loc[:,['Mktrf','HML','RMW','CMA']] 140 | X = sm.add_constant(X) 141 | model = sm.OLS(y,X) 142 | results = model.fit() 143 | regression_factors = regression_factors.append([{'DepVar': 'SMB','Const': results.params[0], 144 | 'Mktrf': results.params[1], 'SMB': 0, 145 | 'HML': results.params[2], 'RMW': results.params[3], 146 | 'CMA': results.params[4], 'R-squared': results.rsquared}]) 147 | regression_factors_tvalue = regression_factors_tvalue.append([{'DepVar': 'SMB','Const_t': results.tvalues[0], 148 | 'Mktrf_t': results.tvalues[1], 'SMB_t': 0, 149 | 'HML_t': results.tvalues[2], 'RMW_t': results.tvalues[3], 150 | 'CMA_t': results.tvalues[4]}]) 151 | regression_factors_pvalue = regression_factors_pvalue.append([{'DepVar': 'SMB','Const_p': results.pvalues[0], 152 | 'Mktrf_p': results.pvalues[1], 'SMB_p': 0, 153 | 'HML_p': results.pvalues[2], 'RMW_p': results.pvalues[3], 154 | 'CMA_p': results.pvalues[4]}]) 155 | # 被解释变量——HML 156 | y = data_ff5.loc[:,['HML']] 157 | X = data_ff5.loc[:,['Mktrf','SMB','RMW','CMA']] 158 | X = sm.add_constant(X) 159 | model = sm.OLS(y,X) 160 | results = model.fit() 161 | regression_factors = regression_factors.append([{'DepVar': 'HML','Const': results.params[0], 162 | 'Mktrf': results.params[1], 'SMB': results.params[2], 163 | 'HML': 0, 'RMW': results.params[3], 164 | 'CMA': results.params[4], 'R-squared': results.rsquared}]) 165 | regression_factors_tvalue = regression_factors_tvalue.append([{'DepVar': 'HML','Const_t': results.tvalues[0], 166 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 167 | 'HML_t': 0, 'RMW_t': results.tvalues[3], 168 | 'CMA_t': results.tvalues[4]}]) 169 | regression_factors_pvalue = regression_factors_pvalue.append([{'DepVar': 'HML','Const_p': results.pvalues[0], 170 | 'Mktrf_p': results.pvalues[1], 'SMB_p': results.pvalues[2], 171 | 'HML_p': 0, 'RMW_p': results.pvalues[3], 172 | 'CMA_p': results.pvalues[4]}]) 173 | # 被解释变量——RMW 174 | y = data_ff5.loc[:,['RMW']] 175 | X = data_ff5.loc[:,['Mktrf','SMB','HML','CMA']] 176 | X = sm.add_constant(X) 177 | model = sm.OLS(y,X) 178 | results = model.fit() 179 | regression_factors = regression_factors.append([{'DepVar': 'RMW','Const': results.params[0], 180 | 'Mktrf': results.params[1], 'SMB': results.params[2], 181 | 'HML': results.params[3], 'RMW': 0, 182 | 'CMA': results.params[4], 'R-squared': results.rsquared}]) 183 | regression_factors_tvalue = regression_factors_tvalue.append([{'DepVar': 'RMW','Const_t': results.tvalues[0], 184 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 185 | 'HML_t': results.tvalues[3], 'RMW_t': 0, 186 | 'CMA_t': results.tvalues[4]}]) 187 | regression_factors_pvalue = regression_factors_pvalue.append([{'DepVar': 'RMW','Const_p': results.pvalues[0], 188 | 'Mktrf_p': results.pvalues[1], 'SMB_p': results.pvalues[2], 189 | 'HML_p': results.pvalues[3], 'RMW_p': 0, 190 | 'CMA_p': results.pvalues[4]}]) 191 | # 被解释变量——CMA 192 | y = data_ff5.loc[:,['CMA']] 193 | X = data_ff5.loc[:,['Mktrf','SMB','HML','RMW']] 194 | X = sm.add_constant(X) 195 | model = sm.OLS(y,X) 196 | results = model.fit() 197 | regression_factors = regression_factors.append([{'DepVar': 'CMA','Const': results.params[0], 198 | 'Mktrf': results.params[0], 'SMB': results.params[1], 199 | 'HML': results.params[2], 'RMW': results.params[3], 200 | 'CMA': 0, 'R-squared': results.rsquared}]) 201 | regression_factors_tvalue = regression_factors_tvalue.append([{'DepVar': 'SMB','Const_t': results.tvalues[0], 202 | 'Mktrf_t': results.tvalues[1], 'SMB_t': results.tvalues[2], 203 | 'HML_t': results.tvalues[3], 'RMW_t': results.tvalues[4], 204 | 'CMA_t': 0}]) 205 | regression_factors_pvalue = regression_factors_pvalue.append([{'DepVar': 'SMB','Const_p': results.pvalues[0], 206 | 'Mktrf_p': results.pvalues[1], 'SMB_p': results.pvalues[2], 207 | 'HML_p': results.pvalues[3], 'RMW_p': results.pvalues[4], 208 | 'CMA_p': 0}]) 209 | 210 | regression_factors.to_csv('regression_result/regression_factors.csv', index=False) 211 | regression_factors_tvalue.to_csv('regression_result/regression_factors_tvalue.csv', index=False) 212 | regression_factors_pvalue.to_csv('regression_result/regression_factors_pvalue.csv', index=False) -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==1.1.4 2 | statsmodels==0.12.1 3 | numpy==1.22.0 4 | --------------------------------------------------------------------------------