├── code ├── transformer_to_stock ├── stock_main.py ├── config.py ├── data_utils.py ├── lstm_model.py └── dataprocess.py ├── data └── readme └── README.md /code/transformer_to_stock: -------------------------------------------------------------------------------- 1 | #waiting for updating 2 | -------------------------------------------------------------------------------- /data/readme: -------------------------------------------------------------------------------- 1 | 2017-2018年沪深300 270只股票的合成文件train和2019年1-3月的test 2 | -------------------------------------------------------------------------------- /code/stock_main.py: -------------------------------------------------------------------------------- 1 | from lstm_model import * 2 | 3 | 4 | train_lstm() # 训练模型,参数在arg里调 5 | # fining_tune_train() #微调 6 | # test() #测试集测试 7 | -------------------------------------------------------------------------------- /code/config.py: -------------------------------------------------------------------------------- 1 | from data_utils import * 2 | 3 | 4 | # -------------------参数配置----------------- # 5 | class Arg: 6 | def __init__(self): 7 | # 训练集数据存放路径 8 | self.train_dir = 'd:\data\\train_mix-17-18.csv' 9 | # 测试集数据存放路径 10 | self.test_dir = 'd:\data\\train_mix-1904.csv' 11 | # 更新数据存放路径 12 | self.new_dir = 'd:\data\\train_mix-19.csv' 13 | # 要预测的数据存放路径 14 | self.predict_dir = 'D:\data\\000001.csv' 15 | # 模型存放路径 16 | self.train_model_dir = 'D:\logfile\\new_logfile\\' 17 | # fining-turn模型存放路径 18 | self.fining_turn_model_dir = 'D:\logfile\\new_logfile\\finet\\' 19 | # 训练图存放路径 20 | self.train_graph_dir = 'D:\logfile\\new_logfile\graph\\train_270\\' 21 | # 验证loss存放路径 22 | self.val_graph_dir = 'D:\logfile\\new_logfile\graph\\val_270\\' 23 | # 模型名称 24 | self.model_name = 'model-270-17-19' 25 | self.model_name_ft = 'model-ft-01-03' 26 | self.rnn_unit = 128 # 隐层节点数 27 | self.input_size = 6 # 输入维度(既用几个特征) 28 | self.output_size = 6 # 输出维度(既使用分类类数预测) 29 | self.layer_num = 3 # 隐藏层层数 30 | self.lr = 0.0006 # 学习率 31 | self.time_step = 20 # 时间步长 32 | self.epoch = 50 # 训练次数 33 | self.epoch_fining = 30 # 微调的迭代次数 34 | # 单只股票的长度(同一数据集股票长度应处理等长) 35 | self.stock_len = get_data_len('D:\data\\399300_1904.csv') 36 | # 更新后单只股票的长度(同一数据集股票长度应处理等长) 37 | self.stock_len_new = get_data_len('D:\data\\399300_190103.csv') 38 | self.batch_size = 1024 # batch_size 39 | self.ratio = 0.8 # 训练集验证集比例 40 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # hs300_stock_predict 2 | 该项目用于对hs300股票的预测,包括股票下载,数据清洗,LSTM 模型的训练,测试,以及实时预测。
3 | 4 | ## 文件构成 5 | 1、data_utils.py 用于股票数据下载,清洗,合并等。该文件共有9个函数。 6 | get_stock_data(code, date1, date2, filename, length=-1)
7 | 该函数用于下载股票数据,保存开、高、收、低、量、涨跌幅等6维数据。
8 | 由于用的tushare接口,因此只能下载最近两年的数据。(从新浪网易财经的数据爬虫接口后续开放)
9 | 共有`5个`参数
10 | `code`是需要下载的股票代码,如000001是平安银行的股票代码,输入'000001'既下载平安银行的股票数据。
11 | `date1`是开始日期,格式如"2019-01-03",`date2`是结束日期,格式同上。
12 | `filename`是存放数据的目录,如"D:\data\"
13 | `length`是筛选股票长度,默认为-1,既对下载的股票数据长度上不做筛选,如果人为指定长度如200,既会将时间长度200以下的数据剔除不予以保存。

14 | get_hs300_data(date1, date2, filename)
15 | 该函数用于下载沪深300指数数据,参数格式同get_stock_data

16 | update_stock_data(filename)
17 | 该函数将股票数据从本地文件的最后日期更新至当日,`filename`是指定的单只股票路径名称,如"d:\data\000001.csv"

18 | get_data_len(file_path)
19 | 该函数过去单只股票的时间长度,`file_path`是单只股票路径名称,如"d:\data\000001.csv"

20 | select_stock_data(file1, file2, date1, date2)
21 | 该函数对已经再本地的文件按照日期筛选,`date1`是开始数据,`date2`是结束数据,`file1`是源文件夹,`file2`是筛选日期后文件存放的文件夹

22 | crop_stock(df, date)
23 | 该函数暂时不使用

24 | fill_stock_data(target, sfile, tfile)
25 | 该函数按照沪深300指数的时间长度来对个股停盘数据进行填充,填充为该股上一交易日的数据。该函数是对所选文件夹下所有文件进行处理。
26 | 注意,如果开始日期是属于停牌的,那么该段停牌将不会被填充,后续会有更新。
27 | `target`为参照股票,一般选择同时间段的沪深300指数文件,`sfile`为原文件夹,`tfile`为填充完要存放文件夹。

28 | merge_stock_data(filename, tfile)
29 | 该函数将多个文件按序合并为一个文件,如讲沪深300只个股文件合并为一个总文件,方便后续模型输入。
30 | `filename`是需要合并的文件夹路径,`tfile`是存放合并后文件的文件夹路径。

31 | quchong(file)
32 | 该函数暂不使用。

33 | 34 | 2、dataprocess.py 用于训练数据的处理,归一化等,模型的输入都由该文件的接口输出提供。 35 | get_train_data(batch_size=args.batch_size, time_step=args.time_step)
36 | 该函数用于处理训练数据,参数默认,有配置文件给定。该函数返回五个变量:`batch_index`训练集分批处理的批次, `val_index`验证集批次, `train_x_1`, 训练集输入,`train_y_1`, 训练集标签,`val_x`, y验证集输入,`val_y`验证集标签
37 | 备注:由于整个数据处理是对多只股票合成的总文件处理,所以在时间步长迭代添加时须在各自股票时间长度内完成,因此,需要在配置文件中指定股票长度。

38 | get_test_data(time_step=args.time_step)
39 | 该函数用于处理测试集数据,返回两个变量:`test_x`测试集输入, `test_y`测试集标签。

40 | get_update_data(time_step=args.time_step)
41 | 该函数将更新数据加历史数据的前time_step-1拼接,用于整批处理,如2019 1-3月数据和2018.12的数据拼接,返回拼接后的`train_x`, `train_y`
42 | get_predict_data(file_name)
43 | 该函数完成下载实时股票数据,与之前的数据拼接后返回输入x。有一个参数需要填充,`file_name`既要预测的单只股票文件。 44 | 45 | 3、config.py 配置文件,所有接口内超参数,路径等,在该文件修改,即可在全局生效。 46 | 47 | 4、lstm_model 模型,包括训练,微调,测试,及预测。 48 | train_lstm(time_step=args.time_step, val=True)
49 | 用于训练的函数,val既是否验证,默认开启。其数据来自`get_train_data()`

50 | fining_tune_train(time_step=args.time_step)
51 | 用于微调模型,如新增数据在原模型继续训练,或者迁移学习等。其数据可以来自`get_update_data()`

52 | test(time_step=args.time_step)
53 | 用于测量测试集的准确率和F1,数据来自`get_test_data()`

54 | predict(time_step=args.time_step)
55 | 用于预测第二天的收盘价,数据来自`get_predict_data(args.predict_dir)`

56 | 57 | 5、stock_main.py 主函数 58 | 可以调用上述所有函数接口,实现相关功能。

59 | ## 相关论文 60 | 《基于LSTM的股票价格的多分类预测》

61 | 论文地址:https://www.hanspub.org/journal/PaperInformation.aspx?paperID=32542

62 | -------------------------------------------------------------------------------- /code/data_utils.py: -------------------------------------------------------------------------------- 1 | import tushare as ts 2 | import pandas as pd 3 | import os 4 | import time 5 | import glob 6 | 7 | 8 | # ----------------------下载某只股票数据------------------- # 9 | # code:股票编码 日期格式:2019-05-21 filename:写到要存放数据的根目录即可如D:\data\ 10 | # length是筛选股票长度,默认值为False,既不做筛选,可人为指定长度,如200,既少于200天的股票不保存 11 | def get_stock_data(code, date1, date2, filename, length=-1): 12 | df = ts.get_hist_data(code, start=date1, end=date2) 13 | df1 = pd.DataFrame(df) 14 | df1 = df1[['open', 'high', 'close', 'low', 'volume', 'p_change']] 15 | df1 = df1.sort_values(by='date') 16 | print('共有%s天数据' % len(df1)) 17 | if length == -1: 18 | path = code + '.csv' 19 | df1.to_csv(os.path.join(filename, path)) 20 | else: 21 | if len(df1) >= length: 22 | path = code + '.csv' 23 | df1.to_csv(os.path.join(filename, path)) 24 | 25 | 26 | # ----------------------下载沪深300指数数据------------------- # 27 | # date1是开始日期,date2是截止日期,filename是文件存放目录 28 | def get_hs300_data(date1, date2, filename): 29 | df = ts.get_hist_data('399300', start=date1, end=date2) 30 | df1 = pd.DataFrame(df) 31 | df1 = df1[['open', 'high', 'close', 'low', 'volume', 'p_change']] 32 | df1 = df1.sort_values(by='date') 33 | print('共有%s天数据' % len(df1)) 34 | df1.to_csv(os.path.join(filename, '399300.csv')) 35 | 36 | 37 | # ------------------------更新股票数据------------------------ # 38 | # 将股票数据从本地文件的最后日期更新至当日 39 | # filename:具体到文件名如d:\data\000001.csv 40 | def update_stock_data(filename): 41 | (filepath, tempfilename) = os.path.split(filename) 42 | (stock_code, extension) = os.path.splitext(tempfilename) 43 | f = open(filename, 'r') 44 | df = pd.read_csv(f) 45 | print('股票{}文件中的最新日期为:{}'.format(stock_code, df.iloc[-1, 0])) 46 | data_now = time.strftime('%Y-%m-%d', time.localtime(time.time())) 47 | print('更新日期至:%s' % data_now) 48 | nf = ts.get_hist_data(stock_code, str(df.iloc[-1, 0]), data_now) 49 | nf = nf.sort_values(by='date') 50 | nf = nf.iloc[1:] 51 | print('共有%s天数据' % len(nf)) 52 | nf = pd.DataFrame(nf) 53 | nf = nf[['open', 'high', 'close', 'low', 'volume', 'p_change']] 54 | nf.to_csv(filename, mode='a', header=False) 55 | f.close() 56 | 57 | 58 | # ------------------------获取股票长度----------------------- # 59 | # 辅助函数 60 | def get_data_len(file_path): 61 | with open(file_path) as f: 62 | df = pd.read_csv(f) 63 | return len(df) 64 | 65 | 66 | # --------------------------日期筛选------------------------- # 67 | # 对已经再本地的文件按照日期筛选,date1是开始数据,date2是结束数据 68 | # file1是源文件夹,file2是筛选日期后文件存放的文件夹 69 | def select_stock_data(file1, file2, date1, date2): 70 | csv_list = glob.glob(file1 + '*.csv') 71 | print(u'共发现%s个CSV文件' % len(csv_list)) 72 | file_list = [] 73 | for i in csv_list: 74 | (filepath, filename) = os.path.split(i) 75 | file_list.append(filename) 76 | for i in file_list: 77 | f = open(os.path.join(file1, i), 'r') 78 | df1 = pd.read_csv(f, header=0) 79 | df1['date'] = pd.to_datetime(df1['date']) 80 | df1 = df1.set_index('date') 81 | df2 = df1[date1:date2] 82 | df2.to_csv(os.path.join(file2, i)) 83 | 84 | 85 | def crop_stock(df, date): 86 | start = df.loc[df['日期'] == date].index[0] 87 | return df[start:] 88 | 89 | 90 | # --------------------------停盘填充------------------------- # 91 | # 按照沪深300指数来对个股停盘数据进行填充,填充为该股上一交易日的数据 92 | # target为参照股票,sfile为原文件夹,tfile为填充完要存放文件夹 93 | def fill_stock_data(target, sfile, tfile): 94 | tf = open(target) 95 | tf = pd.read_csv(tf) 96 | csv_list = glob.glob(sfile + '*.csv') 97 | print(u'共发现%s个CSV文件' % len(csv_list)) 98 | i = 1 99 | for item in csv_list: 100 | f1 = open(item) 101 | print('正在处理第%s个文件' % i) 102 | df2 = pd.read_csv(f1) 103 | mix_data = pd.merge(tf, df2, how='outer', on="date") 104 | mix_data = mix_data.fillna(method='pad') 105 | d1 = mix_data[['date', 'open_y', 'high_y', 'close_y', 'low_y', 'volume_y', 'p_change_y']] 106 | d1.rename(columns={'open_y': 'open', 'high_y': 'high', 'close_y': 'close', 'low_y': 'low', 'volume_y': 'volume', 'p_change_y': 'p_change'}, inplace=True) 107 | (filepath, filename) = os.path.split(item) 108 | d1.to_csv(os.path.join(tfile, filename), index=False) 109 | i += 1 110 | 111 | 112 | # --------------------------文件合并------------------------- # 113 | # 将多个文件合并为一个文件,在文件末尾添加 114 | # filename是需要合并的文件夹,tfile是存放合并后文件的文件夹 115 | def merge_stock_data(filename, tfile): 116 | csv_list = glob.glob(filename + '*.csv') 117 | print(u'共发现%s个CSV文件' % len(csv_list)) 118 | f = open(csv_list[0]) 119 | df = pd.read_csv(f) 120 | for i in range(1, len(csv_list)): 121 | f1 = open(csv_list[i], 'rb') 122 | df1 = pd.read_csv(f1) 123 | df = pd.concat([df, df1]) 124 | df.to_csv(tfile+'train_mix.csv', index=None) 125 | 126 | 127 | def quchong(file): 128 | f = open(file) 129 | df = pd.read_csv(f, header=0) 130 | datalist = df.drop_duplicates() 131 | datalist.to_csv(file) 132 | 133 | 134 | if __name__ == '__main__': 135 | print(1) 136 | # fill_stock_data('d:\data\\399300.csv', 'd:\data\\201904\\', 'd:\data\\201904-fill\\') 137 | # merge_stock_data('d:\data\\201904-fill\\', 'd:\data\\') 138 | get_stock_data('000001', '2019-05-01', '2019-05-31', 'd:\data\\') 139 | # print(get_data_len('d:\data\\000001.csv')) 140 | -------------------------------------------------------------------------------- /code/lstm_model.py: -------------------------------------------------------------------------------- 1 | from dataprocess import * 2 | import tensorflow as tf 3 | from sklearn.metrics import classification_report 4 | 5 | 6 | # ——————————————————定义神经网络变量—————————————————— 7 | # 输入层、输出层权重、偏置 8 | 9 | 10 | weights = { 11 | 'in': tf.Variable(tf.random_normal([args.input_size, args.rnn_unit])), 12 | 'out': tf.Variable(tf.random_normal([args.rnn_unit, args.output_size])) 13 | } 14 | biases = { 15 | 'in': tf.Variable(tf.constant(0.1, shape=[args.rnn_unit, ])), 16 | 'out': tf.Variable(tf.constant(0.1, shape=[1, ])) 17 | } 18 | 19 | 20 | # ——————————————————定义神经网络变量—————————————————— 21 | def lstm(X): 22 | batch_size = tf.shape(X)[0] 23 | time_step = tf.shape(X)[1] 24 | w_in = weights['in'] 25 | b_in = biases['in'] 26 | input = tf.reshape(X, [-1, args.input_size]) # 需要将tensor转成2维进行计算,计算后的结果作为隐藏层的输入 27 | input_rnn = tf.matmul(input, w_in)+b_in 28 | input_rnn = tf.reshape(input_rnn, [-1, time_step, args.rnn_unit]) # 将tensor转成3维,作为lstm cell的输入 29 | cell = tf.nn.rnn_cell.BasicLSTMCell(args.rnn_unit) 30 | lstm_cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell) # 设置dropout 31 | mlstm_cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * args.layer_num, state_is_tuple=True) 32 | init_state = mlstm_cell.zero_state(batch_size, dtype=tf.float32) 33 | # output_rnn是记录lstm每个输出节点的结果,final_states是最后一个cell的结果 34 | output_rnn, final_states = tf.nn.dynamic_rnn(mlstm_cell, input_rnn, initial_state=init_state, dtype=tf.float32) 35 | output = output_rnn[:, -1, :] 36 | output = tf.reshape(output, [-1, args.rnn_unit]) # 作为输出层的输入 37 | w_out = weights['out'] 38 | b_out = biases['out'] 39 | pred = tf.matmul(output, w_out)+b_out 40 | return pred, final_states 41 | 42 | 43 | # -----------------------训练模型------------------------------------ # 44 | # 用于训练模型,val(验证集)默认为开启 45 | def train_lstm(time_step=args.time_step, val=True): 46 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size]) 47 | Y = tf.placeholder(tf.float32, shape=[None, 1, args.output_size]) 48 | batch_index, val_index, train_x, train_y, val_x, val_y = get_train_data() 49 | print('trian_y:{}, val_y:{}'.format(np.shape(train_y), np.shape(val_y))) 50 | pred, _ = lstm(X) 51 | # 损失函数 52 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=Y)) 53 | train_op = tf.train.AdamOptimizer(args.lr).minimize(loss) 54 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=15) 55 | with tf.Session() as sess: 56 | sess.run(tf.global_variables_initializer()) 57 | tf.summary.scalar('loss', loss) 58 | merged_summary_op = tf.summary.merge_all() 59 | train_writer = tf.summary.FileWriter(args.train_graph_dir, sess.graph) 60 | valid_writer = tf.summary.FileWriter(args.val_graph_dir) 61 | min_loss = 50 62 | # 训练.次 63 | for i in range(args.epoch): 64 | for j in range(len(batch_index)-1): 65 | summary_str, _, loss_ = sess.run([merged_summary_op, train_op, loss], 66 | feed_dict={X: train_x[batch_index[j]:batch_index[j+1]], 67 | Y: train_y[batch_index[j]:batch_index[j+1]]}) 68 | if val: 69 | for j in range(len(val_index) - 1): 70 | valid_str, loss_val = sess.run([merged_summary_op, loss], 71 | feed_dict={X: val_x[val_index[j]:val_index[j+1]], 72 | Y: val_y[val_index[j]:val_index[j+1]]}) 73 | if i % 10 == 0: 74 | print("------------------------------------------------------") 75 | print('epoch: {}, train_loss: {:.4f}, Val_Loss: {:.4f}'.format(i+1, loss_, loss_val)) 76 | train_writer.add_summary(summary_str, i) 77 | valid_writer.add_summary(valid_str, i) 78 | if loss_val < min_loss: 79 | min_loss = loss_val 80 | print("保存模型:", saver.save(sess, args.train_model_dir+args.model_name)) 81 | 82 | 83 | # -----------------------fine-tuning训练模型------------------------- # 84 | # 用于微调模型,如新增数据在原模型继续训练,迁移学习等 85 | # 其迭代次数不应过大,避免过拟合 86 | def fining_tune_train(time_step=args.time_step): 87 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size]) 88 | Y = tf.placeholder(tf.float32, shape=[None, 1, args.output_size]) 89 | train_x, train_y = get_update_data() 90 | pred, _ = lstm(X) 91 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=Y)) 92 | train_op = tf.train.AdamOptimizer(args.lr).minimize(loss) 93 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=15) 94 | min_loss = 1000 95 | with tf.Session() as sess: 96 | # 参数恢复 97 | sess.run(tf.global_variables_initializer()) 98 | saver.restore(sess, args.train_model_dir+args.model_name) 99 | print("模型加载完毕") 100 | for i in range(args.epoch_fining): 101 | if len(train_x) < 10000: 102 | _, loss_ = sess.run([train_op, loss], feed_dict={X: train_x, Y: train_y}) 103 | if i % 10 == 0: 104 | print("------------------------------------------------------") 105 | print('epoch: {}, train_loss: {:.4f}'.format(i + 1, loss_)) 106 | if loss_ < min_loss: 107 | min_loss = loss_ 108 | print("保存模型:", saver.save(sess, args.fining_turn_model_dir + args.model_name_ft)) 109 | else: 110 | b_z = args.batch_size 111 | for j in range(len(train_x)//b_z+1): 112 | _, loss_ = sess.run([train_op, loss], feed_dict={X: train_x[b_z*j:b_z*(j+1)], 113 | Y: train_y[b_z*j:b_z*(j+1)]}) 114 | if i % 10 == 0: 115 | print("------------------------------------------------------") 116 | print('epoch: {}, train_loss: {:.4f}'.format(i + 1, loss_)) 117 | if loss_ < min_loss: 118 | min_loss = loss_ 119 | print("保存模型:", saver.save(sess, args.fining_turn_model_dir + args.model_name_ft)) 120 | 121 | 122 | # -----------------------------测试模型------------------------------ # 123 | # 用于测量测试集的准确率和F1 124 | def test(time_step=args.time_step): 125 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size]) 126 | test_x, test_y = get_test_data() 127 | print("---------数据加载完毕--------") 128 | pred, _ = lstm(X) 129 | pre_dict = [] 130 | saver = tf.train.Saver(tf.global_variables()) 131 | with tf.Session() as sess: 132 | # 参数恢复 133 | saver.restore(sess, args.train_model_dir+args.model_name) 134 | print("----------模型加载完毕----------") 135 | if len(test_x) < 15000: 136 | prob = sess.run(pred, feed_dict={X: test_x}) 137 | pre_dict.extend(prob) 138 | else: 139 | for i in range(len(test_x)//args.batch_size+1): 140 | prob = sess.run(pred, feed_dict={X: test_x[args.batch_size*i:args.batch_size*(i+1)]}) 141 | pre_dict.extend(prob) 142 | pre_dict = np.array(pre_dict) 143 | test_label = np.array(test_y) 144 | a1 = list(np.argmax(pre_dict, 1)) 145 | print(a1) 146 | a2 = list(np.argmax(test_label, 1)) 147 | a1_1 = [] 148 | a2_1 = [] 149 | # 3分类编码 150 | for i in a1: 151 | if i >= 4: 152 | a1_1.append(2) 153 | elif i <= 1: 154 | a1_1.append(1) 155 | else: 156 | a1_1.append(0) 157 | for i in a2: 158 | if i >= 4: 159 | a2_1.append(2) 160 | elif i <= 1: 161 | a2_1.append(1) 162 | else: 163 | a2_1.append(0) 164 | correct_prediction = tf.equal(a1, a2) 165 | acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)).eval() 166 | print(classification_report(a1, a2)) 167 | print(acc) 168 | 169 | 170 | # -----------------------------预测模型------------------------------ # 171 | # 用于预测第二天的收盘价 172 | def predict(time_step=args.time_step): 173 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size]) 174 | pre_x, code = get_predict_data(args.predict_dir) 175 | print("---------数据加载完毕--------") 176 | pred, _ = lstm(X) 177 | pre_y = [] 178 | saver = tf.train.Saver(tf.global_variables()) 179 | with tf.Session() as sess: 180 | # 参数恢复 181 | saver.restore(sess, args.fining_turn_model_dir + args.model_name) 182 | print("----------模型加载完毕----------") 183 | if len(pre_x) < 15000: 184 | prob = sess.run(pred, feed_dict={X: pre_x}) 185 | pre_y.extend(prob) 186 | else: 187 | for i in range(len(pre_x) // args.batch_size + 1): 188 | prob = sess.run(pred, feed_dict={X: pre_x[args.batch_size * i:args.batch_size * (i + 1)]}) 189 | pre_y.extend(prob) 190 | pre_y = np.array(pre_y) 191 | a = list(np.argmax(pre_y, 1)) 192 | if a[0] == 0: 193 | print("模型{}预测股票{}明天的收盘价跌2%及以上".format(args.model_name, code)) 194 | elif a[0] == 1: 195 | print("模型{}预测股票{}明天的收盘价跌1%-2%".format(args.model_name, code)) 196 | elif a[0] == 2: 197 | print("模型{}预测股票{}明天的收盘价跌1%以内".format(args.model_name, code)) 198 | elif a[0] == 3: 199 | print("模型{}预测股票{}明天的收盘价涨1%以内".format(args.model_name, code)) 200 | elif a[0] == 4: 201 | print("模型{}预测股票{}明天的收盘价涨1%-2%".format(args.model_name, code)) 202 | else: 203 | print("模型{}预测股票{}明天的收盘价涨2%以上".format(args.model_name, code)) 204 | -------------------------------------------------------------------------------- /code/dataprocess.py: -------------------------------------------------------------------------------- 1 | from data_utils import * 2 | import numpy as np 3 | from config import Arg 4 | 5 | 6 | args = Arg() 7 | 8 | 9 | # 该系列代码所要求的股票文件名称必须是股票代码+csv的格式,如000001.csv 10 | # --------------------------训练集数据的处理--------------------- # 11 | def get_train_data(batch_size=args.batch_size, time_step=args.time_step): 12 | ratio = args.ratio 13 | stock_len = args.stock_len 14 | len_index = [] 15 | batch_index = [] 16 | val_index = [] 17 | train_dir = args.train_dir 18 | df = open(train_dir) 19 | data_otrain = pd.read_csv(df) 20 | data_train = data_otrain.iloc[:, 1:].values 21 | print(len(data_train)) 22 | label_train = data_otrain.iloc[:, -1].values 23 | normalized_train_data = (data_train-np.mean(data_train, axis=0))/np.std(data_train, axis=0) # 标准化 24 | train_x, train_y = [], [] # 训练集x和y定义 25 | for i in range(len(normalized_train_data) + 1): 26 | if i % stock_len == 0: 27 | len_index.append(i) 28 | for i in range(len(len_index) - 1): 29 | for k in range(len_index[i], len_index[i + 1] - time_step - 1): 30 | x = normalized_train_data[k:k + time_step, :6] 31 | y = label_train[k + time_step, np.newaxis] 32 | temp_data = [] 33 | # onehot编码 34 | for j in y: 35 | if j > 2: 36 | temp_data.append([0, 0, 0, 0, 0, 1]) 37 | elif 1 < j <= 2: 38 | temp_data.append([0, 0, 0, 0, 1, 0]) 39 | elif 0 < j <= 1: 40 | temp_data.append([0, 0, 0, 1, 0, 0]) 41 | elif -1 < j <= 0: 42 | temp_data.append([0, 0, 1, 0, 0, 0]) 43 | elif -2 < j <= -1: 44 | temp_data.append([0, 1, 0, 0, 0, 0]) 45 | else: 46 | temp_data.append([1, 0, 0, 0, 0, 0]) 47 | train_x.append(x.tolist()) 48 | train_y.append(temp_data) 49 | train_len = int(len(train_x) * ratio) # 按照8:2划分训练集和验证集 50 | train_x_1, train_y_1 = train_x[:train_len], train_y[:train_len] # 训练集的x和标签 51 | val_x, val_y = train_x[train_len:], train_y[train_len:] # 验证集的x和标签 52 | # 添加标签 53 | for i in range(len(train_x_1)): 54 | if i % batch_size == 0: 55 | batch_index.append(i) 56 | for i in range(len(val_x)): 57 | if i % batch_size == 0: 58 | val_index.append(i) 59 | batch_index.append(len(train_x_1)) 60 | val_index.append(len(val_x)) 61 | print(batch_index) 62 | print(val_index) 63 | print(np.shape(train_x)) 64 | return batch_index, val_index, train_x_1, train_y_1, val_x, val_y 65 | 66 | 67 | # --------------------------测试集数据的处理--------------------- # 68 | # 测试集数据长度不能小于time_step 69 | def get_test_data(time_step=args.time_step): 70 | stock_len = args.stock_len 71 | test_dir = args.test_dir 72 | f = open(test_dir) 73 | df = pd.read_csv(f) 74 | data_test = df.iloc[:, 1:].values 75 | label_test = df.iloc[:, -1].values 76 | batch_index = [] 77 | normalized_test_data = (data_test-np.mean(data_test, axis=0))/np.std(data_test, axis=0) # 标准化 78 | test_x, test_y = [], [] 79 | for i in range(len(normalized_test_data) + 1): 80 | if i % stock_len == 0: 81 | batch_index.append(i) 82 | for i in range(len(batch_index)-1): 83 | if stock_len > time_step+1: 84 | for j in range(batch_index[i], batch_index[i + 1] - time_step - 1): 85 | x = normalized_test_data[j:j + time_step, :] 86 | y = label_test[j + time_step, np.newaxis] 87 | temp_data = [] 88 | # 标签编码 89 | for k in y: 90 | if k > 2: 91 | temp_data.append([0, 0, 0, 0, 0, 1]) 92 | elif 1 < k <= 2: 93 | temp_data.append([0, 0, 0, 0, 1, 0]) 94 | elif 0 < k <= 1: 95 | temp_data.append([0, 0, 0, 1, 0, 0]) 96 | elif -1 < k <= 0: 97 | temp_data.append([0, 0, 1, 0, 0, 0]) 98 | elif -2 < k <= -1: 99 | temp_data.append([0, 1, 0, 0, 0, 0]) 100 | else: 101 | temp_data.append([1, 0, 0, 0, 0, 0]) 102 | test_x.append(x.tolist()) 103 | test_y.extend(temp_data) 104 | else: 105 | for j in range(batch_index[i], batch_index[i]+1): 106 | x = normalized_test_data[j:j + time_step, :] 107 | y = label_test[j + time_step, np.newaxis] 108 | temp_data = [] 109 | # 标签编码 110 | for k in y: 111 | if k > 2: 112 | temp_data.append([0, 0, 0, 0, 0, 1]) 113 | elif 1 < k <= 2: 114 | temp_data.append([0, 0, 0, 0, 1, 0]) 115 | elif 0 < k <= 1: 116 | temp_data.append([0, 0, 0, 1, 0, 0]) 117 | elif -1 < k <= 0: 118 | temp_data.append([0, 0, 1, 0, 0, 0]) 119 | elif -2 < k <= -1: 120 | temp_data.append([0, 1, 0, 0, 0, 0]) 121 | else: 122 | temp_data.append([1, 0, 0, 0, 0, 0]) 123 | test_x.append(x.tolist()) 124 | test_y.extend(temp_data) 125 | 126 | print(batch_index) 127 | print(np.shape(test_x)) 128 | return test_x, test_y 129 | 130 | 131 | # -------------------------新股票数据批量处理-------------------- # 132 | # 该函数将更新数据加历史数据的前time_step-1拼接,用于整批处理 133 | # 如2019 1-3月数据和2018.12的数据拼接 134 | def get_update_data(time_step=args.time_step): 135 | train_data_dir = args.train_dir 136 | new_data_dir = args.new_dir 137 | stock_len = args.stock_len 138 | new_len = args.stock_len_new 139 | f = open(train_data_dir) 140 | nf = open(new_data_dir) 141 | df = pd.read_csv(f) 142 | ndf = pd.read_csv(nf) 143 | data_train = df.iloc[:, 1:].values 144 | data_new = ndf.iloc[:, 1:].values 145 | # 标准化 146 | mean, std = np.mean(data_train, axis=0), np.std(data_train, axis=0) 147 | new_mean, new_std = np.mean(data_new, axis=0), np.std(data_new, axis=0) 148 | normalized_data_train = (data_train - mean) / std 149 | normalized_data_new = (data_new - new_mean) / new_std 150 | label_new = ndf.iloc[:, -1].values 151 | train_x, train_y = [], [] 152 | batch_index = [] 153 | new_index = [] 154 | for i in range(len(data_train)+1): 155 | if i % stock_len == 0: 156 | batch_index.append(i) 157 | for i in range(len(data_new)+1): 158 | if i % new_len == 0: 159 | new_index.append(i) 160 | # 该部分添加上一时间戳和更新时间戳的time_step股票数据 161 | print(batch_index) 162 | print(new_index) 163 | for i in range(1, len(batch_index)): 164 | count = time_step 165 | while count > 1: 166 | last_data = [] 167 | last_data.extend(normalized_data_train[batch_index[i] - count + 1:batch_index[i]]) 168 | last_data.extend(normalized_data_new[0:time_step-count+1]) 169 | y = label_new[time_step-count+1, np.newaxis] 170 | temp_data = [] 171 | for k in y: 172 | if k > 2: 173 | temp_data.append([0, 0, 0, 0, 0, 1]) 174 | elif 1 < k <= 2: 175 | temp_data.append([0, 0, 0, 0, 1, 0]) 176 | elif 0 < k <= 1: 177 | temp_data.append([0, 0, 0, 1, 0, 0]) 178 | elif -1 < k <= 0: 179 | temp_data.append([0, 0, 1, 0, 0, 0]) 180 | elif -2 < k <= -1: 181 | temp_data.append([0, 1, 0, 0, 0, 0]) 182 | else: 183 | temp_data.append([1, 0, 0, 0, 0, 0]) 184 | train_x.append(last_data) 185 | train_y.append(temp_data) 186 | count -= 1 187 | # 该部分正常添加new_data中的股票数据 188 | for i in range(len(new_index)-1): 189 | for j in range(new_index[i], new_index[i + 1] - time_step - 1): 190 | x = normalized_data_new[j:j+time_step, :] 191 | y = label_new[j+time_step, np.newaxis] 192 | temp_data = [] 193 | # 标签编码 194 | for k in y: 195 | if k > 2: 196 | temp_data.append([0, 0, 0, 0, 0, 1]) 197 | elif 1 < k <= 2: 198 | temp_data.append([0, 0, 0, 0, 1, 0]) 199 | elif 0 < k <= 1: 200 | temp_data.append([0, 0, 0, 1, 0, 0]) 201 | elif -1 < k <= 0: 202 | temp_data.append([0, 0, 1, 0, 0, 0]) 203 | elif -2 < k <= -1: 204 | temp_data.append([0, 1, 0, 0, 0, 0]) 205 | else: 206 | temp_data.append([1, 0, 0, 0, 0, 0]) 207 | train_x.append(x.tolist()) 208 | train_y.append(temp_data) 209 | print(np.shape(train_x)) 210 | return train_x, train_y 211 | 212 | 213 | # --------------------------当天股票数据更新---------------------- # 214 | # 该函数完成下载实时股票数据,与之前的数据拼接后拼接的x 215 | # 只能用于获取一天的更新数据,不会对源文件进行更新,如果有断层(不只一天),请先下载整批数据,然后使用get_update_data来更新数据 216 | # file_name是要用于预测的股票地址如'D:\data\\201904\\000001.csv' 217 | def get_predict_data(file_name): 218 | f = open(file_name) 219 | (filepath, temp_file_name) = os.path.split(file_name) 220 | (stock_code, extension) = os.path.splitext(temp_file_name) 221 | f = pd.read_csv(f) 222 | data_now = time.strftime('%Y-%m-%d', time.localtime(time.time())) 223 | hist_data = f[-args.time_step+1:] 224 | real_data = ts.get_realtime_quotes(stock_code) 225 | real_data = real_data[['open', 'high', 'price', 'low', 'volume']] 226 | real_data.insert(0, 'date', data_now) 227 | p_change = (float(real_data.iloc[-1, 3]) - float(hist_data.iloc[-1, 3]))/float(hist_data.iloc[-1, 3]) 228 | real_data['p_change'] = p_change 229 | real_data.rename(index=str, columns={"price": "close"}, inplace=True) 230 | real_data[['open', 'high', 'close', 'low', 'volume', 'p_change']] = \ 231 | real_data[['open', 'high', 'close', 'low', 'volume', 'p_change']].astype('float') 232 | hist_data = hist_data.append(real_data) 233 | print("---------数据更新完成-----------") 234 | pre_data = hist_data.iloc[:, 1:].values 235 | x = (pre_data - np.mean(pre_data, axis=0)) / np.std(pre_data, axis=0) # 标准化 236 | x = [x.tolist()] 237 | print(np.shape(x)) 238 | return x, stock_code 239 | 240 | 241 | if __name__ == '__main__': 242 | print(args.stock_len) 243 | get_train_data() 244 | # get_stock_data('399300', '2019-04-01', '2019-04-30', 'd:\data\\') 245 | # get_update_data() 246 | # get_stock_data()使用样例,用于下载codelist中的股票,codelist可以自己指定 247 | """ 248 | codelist = [] 249 | with open('d:\data\codelist.csv') as f: 250 | df = pd.read_csv(f, converters={'code': str}) 251 | codelist.extend(df['code']) 252 | i = 1 253 | for code in codelist: 254 | print('正在处理第%s个股票' % i) 255 | i += 1 256 | get_stock_data(code, '2019-04-01', '2019-04-30', 'd:\data\\201904\\', 20) 257 | """ 258 | --------------------------------------------------------------------------------