├── code
├── transformer_to_stock
├── stock_main.py
├── config.py
├── data_utils.py
├── lstm_model.py
└── dataprocess.py
├── data
└── readme
└── README.md
/code/transformer_to_stock:
--------------------------------------------------------------------------------
1 | #waiting for updating
2 |
--------------------------------------------------------------------------------
/data/readme:
--------------------------------------------------------------------------------
1 | 2017-2018年沪深300 270只股票的合成文件train和2019年1-3月的test
2 |
--------------------------------------------------------------------------------
/code/stock_main.py:
--------------------------------------------------------------------------------
1 | from lstm_model import *
2 |
3 |
4 | train_lstm() # 训练模型,参数在arg里调
5 | # fining_tune_train() #微调
6 | # test() #测试集测试
7 |
--------------------------------------------------------------------------------
/code/config.py:
--------------------------------------------------------------------------------
1 | from data_utils import *
2 |
3 |
4 | # -------------------参数配置----------------- #
5 | class Arg:
6 | def __init__(self):
7 | # 训练集数据存放路径
8 | self.train_dir = 'd:\data\\train_mix-17-18.csv'
9 | # 测试集数据存放路径
10 | self.test_dir = 'd:\data\\train_mix-1904.csv'
11 | # 更新数据存放路径
12 | self.new_dir = 'd:\data\\train_mix-19.csv'
13 | # 要预测的数据存放路径
14 | self.predict_dir = 'D:\data\\000001.csv'
15 | # 模型存放路径
16 | self.train_model_dir = 'D:\logfile\\new_logfile\\'
17 | # fining-turn模型存放路径
18 | self.fining_turn_model_dir = 'D:\logfile\\new_logfile\\finet\\'
19 | # 训练图存放路径
20 | self.train_graph_dir = 'D:\logfile\\new_logfile\graph\\train_270\\'
21 | # 验证loss存放路径
22 | self.val_graph_dir = 'D:\logfile\\new_logfile\graph\\val_270\\'
23 | # 模型名称
24 | self.model_name = 'model-270-17-19'
25 | self.model_name_ft = 'model-ft-01-03'
26 | self.rnn_unit = 128 # 隐层节点数
27 | self.input_size = 6 # 输入维度(既用几个特征)
28 | self.output_size = 6 # 输出维度(既使用分类类数预测)
29 | self.layer_num = 3 # 隐藏层层数
30 | self.lr = 0.0006 # 学习率
31 | self.time_step = 20 # 时间步长
32 | self.epoch = 50 # 训练次数
33 | self.epoch_fining = 30 # 微调的迭代次数
34 | # 单只股票的长度(同一数据集股票长度应处理等长)
35 | self.stock_len = get_data_len('D:\data\\399300_1904.csv')
36 | # 更新后单只股票的长度(同一数据集股票长度应处理等长)
37 | self.stock_len_new = get_data_len('D:\data\\399300_190103.csv')
38 | self.batch_size = 1024 # batch_size
39 | self.ratio = 0.8 # 训练集验证集比例
40 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # hs300_stock_predict
2 | 该项目用于对hs300股票的预测,包括股票下载,数据清洗,LSTM 模型的训练,测试,以及实时预测。
3 |
4 | ## 文件构成
5 | 1、data_utils.py 用于股票数据下载,清洗,合并等。该文件共有9个函数。
6 | get_stock_data(code, date1, date2, filename, length=-1)
7 | 该函数用于下载股票数据,保存开、高、收、低、量、涨跌幅等6维数据。
8 | 由于用的tushare接口,因此只能下载最近两年的数据。(从新浪网易财经的数据爬虫接口后续开放)
9 | 共有`5个`参数
10 | `code`是需要下载的股票代码,如000001是平安银行的股票代码,输入'000001'既下载平安银行的股票数据。
11 | `date1`是开始日期,格式如"2019-01-03",`date2`是结束日期,格式同上。
12 | `filename`是存放数据的目录,如"D:\data\"
13 | `length`是筛选股票长度,默认为-1,既对下载的股票数据长度上不做筛选,如果人为指定长度如200,既会将时间长度200以下的数据剔除不予以保存。
14 | get_hs300_data(date1, date2, filename)
15 | 该函数用于下载沪深300指数数据,参数格式同get_stock_data
16 | update_stock_data(filename)
17 | 该函数将股票数据从本地文件的最后日期更新至当日,`filename`是指定的单只股票路径名称,如"d:\data\000001.csv"
18 | get_data_len(file_path)
19 | 该函数过去单只股票的时间长度,`file_path`是单只股票路径名称,如"d:\data\000001.csv"
20 | select_stock_data(file1, file2, date1, date2)
21 | 该函数对已经再本地的文件按照日期筛选,`date1`是开始数据,`date2`是结束数据,`file1`是源文件夹,`file2`是筛选日期后文件存放的文件夹
22 | crop_stock(df, date)
23 | 该函数暂时不使用
24 | fill_stock_data(target, sfile, tfile)
25 | 该函数按照沪深300指数的时间长度来对个股停盘数据进行填充,填充为该股上一交易日的数据。该函数是对所选文件夹下所有文件进行处理。
26 | 注意,如果开始日期是属于停牌的,那么该段停牌将不会被填充,后续会有更新。
27 | `target`为参照股票,一般选择同时间段的沪深300指数文件,`sfile`为原文件夹,`tfile`为填充完要存放文件夹。
28 | merge_stock_data(filename, tfile)
29 | 该函数将多个文件按序合并为一个文件,如讲沪深300只个股文件合并为一个总文件,方便后续模型输入。
30 | `filename`是需要合并的文件夹路径,`tfile`是存放合并后文件的文件夹路径。
31 | quchong(file)
32 | 该函数暂不使用。
33 |
34 | 2、dataprocess.py 用于训练数据的处理,归一化等,模型的输入都由该文件的接口输出提供。
35 | get_train_data(batch_size=args.batch_size, time_step=args.time_step)
36 | 该函数用于处理训练数据,参数默认,有配置文件给定。该函数返回五个变量:`batch_index`训练集分批处理的批次, `val_index`验证集批次, `train_x_1`, 训练集输入,`train_y_1`, 训练集标签,`val_x`, y验证集输入,`val_y`验证集标签
37 | 备注:由于整个数据处理是对多只股票合成的总文件处理,所以在时间步长迭代添加时须在各自股票时间长度内完成,因此,需要在配置文件中指定股票长度。
38 | get_test_data(time_step=args.time_step)
39 | 该函数用于处理测试集数据,返回两个变量:`test_x`测试集输入, `test_y`测试集标签。
40 | get_update_data(time_step=args.time_step)
41 | 该函数将更新数据加历史数据的前time_step-1拼接,用于整批处理,如2019 1-3月数据和2018.12的数据拼接,返回拼接后的`train_x`, `train_y`
42 | get_predict_data(file_name)
43 | 该函数完成下载实时股票数据,与之前的数据拼接后返回输入x。有一个参数需要填充,`file_name`既要预测的单只股票文件。
44 |
45 | 3、config.py 配置文件,所有接口内超参数,路径等,在该文件修改,即可在全局生效。
46 |
47 | 4、lstm_model 模型,包括训练,微调,测试,及预测。
48 | train_lstm(time_step=args.time_step, val=True)
49 | 用于训练的函数,val既是否验证,默认开启。其数据来自`get_train_data()`
50 | fining_tune_train(time_step=args.time_step)
51 | 用于微调模型,如新增数据在原模型继续训练,或者迁移学习等。其数据可以来自`get_update_data()`
52 | test(time_step=args.time_step)
53 | 用于测量测试集的准确率和F1,数据来自`get_test_data()`
54 | predict(time_step=args.time_step)
55 | 用于预测第二天的收盘价,数据来自`get_predict_data(args.predict_dir)`
56 |
57 | 5、stock_main.py 主函数
58 | 可以调用上述所有函数接口,实现相关功能。
59 | ## 相关论文
60 | 《基于LSTM的股票价格的多分类预测》
61 | 论文地址:https://www.hanspub.org/journal/PaperInformation.aspx?paperID=32542
62 |
--------------------------------------------------------------------------------
/code/data_utils.py:
--------------------------------------------------------------------------------
1 | import tushare as ts
2 | import pandas as pd
3 | import os
4 | import time
5 | import glob
6 |
7 |
8 | # ----------------------下载某只股票数据------------------- #
9 | # code:股票编码 日期格式:2019-05-21 filename:写到要存放数据的根目录即可如D:\data\
10 | # length是筛选股票长度,默认值为False,既不做筛选,可人为指定长度,如200,既少于200天的股票不保存
11 | def get_stock_data(code, date1, date2, filename, length=-1):
12 | df = ts.get_hist_data(code, start=date1, end=date2)
13 | df1 = pd.DataFrame(df)
14 | df1 = df1[['open', 'high', 'close', 'low', 'volume', 'p_change']]
15 | df1 = df1.sort_values(by='date')
16 | print('共有%s天数据' % len(df1))
17 | if length == -1:
18 | path = code + '.csv'
19 | df1.to_csv(os.path.join(filename, path))
20 | else:
21 | if len(df1) >= length:
22 | path = code + '.csv'
23 | df1.to_csv(os.path.join(filename, path))
24 |
25 |
26 | # ----------------------下载沪深300指数数据------------------- #
27 | # date1是开始日期,date2是截止日期,filename是文件存放目录
28 | def get_hs300_data(date1, date2, filename):
29 | df = ts.get_hist_data('399300', start=date1, end=date2)
30 | df1 = pd.DataFrame(df)
31 | df1 = df1[['open', 'high', 'close', 'low', 'volume', 'p_change']]
32 | df1 = df1.sort_values(by='date')
33 | print('共有%s天数据' % len(df1))
34 | df1.to_csv(os.path.join(filename, '399300.csv'))
35 |
36 |
37 | # ------------------------更新股票数据------------------------ #
38 | # 将股票数据从本地文件的最后日期更新至当日
39 | # filename:具体到文件名如d:\data\000001.csv
40 | def update_stock_data(filename):
41 | (filepath, tempfilename) = os.path.split(filename)
42 | (stock_code, extension) = os.path.splitext(tempfilename)
43 | f = open(filename, 'r')
44 | df = pd.read_csv(f)
45 | print('股票{}文件中的最新日期为:{}'.format(stock_code, df.iloc[-1, 0]))
46 | data_now = time.strftime('%Y-%m-%d', time.localtime(time.time()))
47 | print('更新日期至:%s' % data_now)
48 | nf = ts.get_hist_data(stock_code, str(df.iloc[-1, 0]), data_now)
49 | nf = nf.sort_values(by='date')
50 | nf = nf.iloc[1:]
51 | print('共有%s天数据' % len(nf))
52 | nf = pd.DataFrame(nf)
53 | nf = nf[['open', 'high', 'close', 'low', 'volume', 'p_change']]
54 | nf.to_csv(filename, mode='a', header=False)
55 | f.close()
56 |
57 |
58 | # ------------------------获取股票长度----------------------- #
59 | # 辅助函数
60 | def get_data_len(file_path):
61 | with open(file_path) as f:
62 | df = pd.read_csv(f)
63 | return len(df)
64 |
65 |
66 | # --------------------------日期筛选------------------------- #
67 | # 对已经再本地的文件按照日期筛选,date1是开始数据,date2是结束数据
68 | # file1是源文件夹,file2是筛选日期后文件存放的文件夹
69 | def select_stock_data(file1, file2, date1, date2):
70 | csv_list = glob.glob(file1 + '*.csv')
71 | print(u'共发现%s个CSV文件' % len(csv_list))
72 | file_list = []
73 | for i in csv_list:
74 | (filepath, filename) = os.path.split(i)
75 | file_list.append(filename)
76 | for i in file_list:
77 | f = open(os.path.join(file1, i), 'r')
78 | df1 = pd.read_csv(f, header=0)
79 | df1['date'] = pd.to_datetime(df1['date'])
80 | df1 = df1.set_index('date')
81 | df2 = df1[date1:date2]
82 | df2.to_csv(os.path.join(file2, i))
83 |
84 |
85 | def crop_stock(df, date):
86 | start = df.loc[df['日期'] == date].index[0]
87 | return df[start:]
88 |
89 |
90 | # --------------------------停盘填充------------------------- #
91 | # 按照沪深300指数来对个股停盘数据进行填充,填充为该股上一交易日的数据
92 | # target为参照股票,sfile为原文件夹,tfile为填充完要存放文件夹
93 | def fill_stock_data(target, sfile, tfile):
94 | tf = open(target)
95 | tf = pd.read_csv(tf)
96 | csv_list = glob.glob(sfile + '*.csv')
97 | print(u'共发现%s个CSV文件' % len(csv_list))
98 | i = 1
99 | for item in csv_list:
100 | f1 = open(item)
101 | print('正在处理第%s个文件' % i)
102 | df2 = pd.read_csv(f1)
103 | mix_data = pd.merge(tf, df2, how='outer', on="date")
104 | mix_data = mix_data.fillna(method='pad')
105 | d1 = mix_data[['date', 'open_y', 'high_y', 'close_y', 'low_y', 'volume_y', 'p_change_y']]
106 | d1.rename(columns={'open_y': 'open', 'high_y': 'high', 'close_y': 'close', 'low_y': 'low', 'volume_y': 'volume', 'p_change_y': 'p_change'}, inplace=True)
107 | (filepath, filename) = os.path.split(item)
108 | d1.to_csv(os.path.join(tfile, filename), index=False)
109 | i += 1
110 |
111 |
112 | # --------------------------文件合并------------------------- #
113 | # 将多个文件合并为一个文件,在文件末尾添加
114 | # filename是需要合并的文件夹,tfile是存放合并后文件的文件夹
115 | def merge_stock_data(filename, tfile):
116 | csv_list = glob.glob(filename + '*.csv')
117 | print(u'共发现%s个CSV文件' % len(csv_list))
118 | f = open(csv_list[0])
119 | df = pd.read_csv(f)
120 | for i in range(1, len(csv_list)):
121 | f1 = open(csv_list[i], 'rb')
122 | df1 = pd.read_csv(f1)
123 | df = pd.concat([df, df1])
124 | df.to_csv(tfile+'train_mix.csv', index=None)
125 |
126 |
127 | def quchong(file):
128 | f = open(file)
129 | df = pd.read_csv(f, header=0)
130 | datalist = df.drop_duplicates()
131 | datalist.to_csv(file)
132 |
133 |
134 | if __name__ == '__main__':
135 | print(1)
136 | # fill_stock_data('d:\data\\399300.csv', 'd:\data\\201904\\', 'd:\data\\201904-fill\\')
137 | # merge_stock_data('d:\data\\201904-fill\\', 'd:\data\\')
138 | get_stock_data('000001', '2019-05-01', '2019-05-31', 'd:\data\\')
139 | # print(get_data_len('d:\data\\000001.csv'))
140 |
--------------------------------------------------------------------------------
/code/lstm_model.py:
--------------------------------------------------------------------------------
1 | from dataprocess import *
2 | import tensorflow as tf
3 | from sklearn.metrics import classification_report
4 |
5 |
6 | # ——————————————————定义神经网络变量——————————————————
7 | # 输入层、输出层权重、偏置
8 |
9 |
10 | weights = {
11 | 'in': tf.Variable(tf.random_normal([args.input_size, args.rnn_unit])),
12 | 'out': tf.Variable(tf.random_normal([args.rnn_unit, args.output_size]))
13 | }
14 | biases = {
15 | 'in': tf.Variable(tf.constant(0.1, shape=[args.rnn_unit, ])),
16 | 'out': tf.Variable(tf.constant(0.1, shape=[1, ]))
17 | }
18 |
19 |
20 | # ——————————————————定义神经网络变量——————————————————
21 | def lstm(X):
22 | batch_size = tf.shape(X)[0]
23 | time_step = tf.shape(X)[1]
24 | w_in = weights['in']
25 | b_in = biases['in']
26 | input = tf.reshape(X, [-1, args.input_size]) # 需要将tensor转成2维进行计算,计算后的结果作为隐藏层的输入
27 | input_rnn = tf.matmul(input, w_in)+b_in
28 | input_rnn = tf.reshape(input_rnn, [-1, time_step, args.rnn_unit]) # 将tensor转成3维,作为lstm cell的输入
29 | cell = tf.nn.rnn_cell.BasicLSTMCell(args.rnn_unit)
30 | lstm_cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell) # 设置dropout
31 | mlstm_cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * args.layer_num, state_is_tuple=True)
32 | init_state = mlstm_cell.zero_state(batch_size, dtype=tf.float32)
33 | # output_rnn是记录lstm每个输出节点的结果,final_states是最后一个cell的结果
34 | output_rnn, final_states = tf.nn.dynamic_rnn(mlstm_cell, input_rnn, initial_state=init_state, dtype=tf.float32)
35 | output = output_rnn[:, -1, :]
36 | output = tf.reshape(output, [-1, args.rnn_unit]) # 作为输出层的输入
37 | w_out = weights['out']
38 | b_out = biases['out']
39 | pred = tf.matmul(output, w_out)+b_out
40 | return pred, final_states
41 |
42 |
43 | # -----------------------训练模型------------------------------------ #
44 | # 用于训练模型,val(验证集)默认为开启
45 | def train_lstm(time_step=args.time_step, val=True):
46 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size])
47 | Y = tf.placeholder(tf.float32, shape=[None, 1, args.output_size])
48 | batch_index, val_index, train_x, train_y, val_x, val_y = get_train_data()
49 | print('trian_y:{}, val_y:{}'.format(np.shape(train_y), np.shape(val_y)))
50 | pred, _ = lstm(X)
51 | # 损失函数
52 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=Y))
53 | train_op = tf.train.AdamOptimizer(args.lr).minimize(loss)
54 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=15)
55 | with tf.Session() as sess:
56 | sess.run(tf.global_variables_initializer())
57 | tf.summary.scalar('loss', loss)
58 | merged_summary_op = tf.summary.merge_all()
59 | train_writer = tf.summary.FileWriter(args.train_graph_dir, sess.graph)
60 | valid_writer = tf.summary.FileWriter(args.val_graph_dir)
61 | min_loss = 50
62 | # 训练.次
63 | for i in range(args.epoch):
64 | for j in range(len(batch_index)-1):
65 | summary_str, _, loss_ = sess.run([merged_summary_op, train_op, loss],
66 | feed_dict={X: train_x[batch_index[j]:batch_index[j+1]],
67 | Y: train_y[batch_index[j]:batch_index[j+1]]})
68 | if val:
69 | for j in range(len(val_index) - 1):
70 | valid_str, loss_val = sess.run([merged_summary_op, loss],
71 | feed_dict={X: val_x[val_index[j]:val_index[j+1]],
72 | Y: val_y[val_index[j]:val_index[j+1]]})
73 | if i % 10 == 0:
74 | print("------------------------------------------------------")
75 | print('epoch: {}, train_loss: {:.4f}, Val_Loss: {:.4f}'.format(i+1, loss_, loss_val))
76 | train_writer.add_summary(summary_str, i)
77 | valid_writer.add_summary(valid_str, i)
78 | if loss_val < min_loss:
79 | min_loss = loss_val
80 | print("保存模型:", saver.save(sess, args.train_model_dir+args.model_name))
81 |
82 |
83 | # -----------------------fine-tuning训练模型------------------------- #
84 | # 用于微调模型,如新增数据在原模型继续训练,迁移学习等
85 | # 其迭代次数不应过大,避免过拟合
86 | def fining_tune_train(time_step=args.time_step):
87 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size])
88 | Y = tf.placeholder(tf.float32, shape=[None, 1, args.output_size])
89 | train_x, train_y = get_update_data()
90 | pred, _ = lstm(X)
91 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=Y))
92 | train_op = tf.train.AdamOptimizer(args.lr).minimize(loss)
93 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=15)
94 | min_loss = 1000
95 | with tf.Session() as sess:
96 | # 参数恢复
97 | sess.run(tf.global_variables_initializer())
98 | saver.restore(sess, args.train_model_dir+args.model_name)
99 | print("模型加载完毕")
100 | for i in range(args.epoch_fining):
101 | if len(train_x) < 10000:
102 | _, loss_ = sess.run([train_op, loss], feed_dict={X: train_x, Y: train_y})
103 | if i % 10 == 0:
104 | print("------------------------------------------------------")
105 | print('epoch: {}, train_loss: {:.4f}'.format(i + 1, loss_))
106 | if loss_ < min_loss:
107 | min_loss = loss_
108 | print("保存模型:", saver.save(sess, args.fining_turn_model_dir + args.model_name_ft))
109 | else:
110 | b_z = args.batch_size
111 | for j in range(len(train_x)//b_z+1):
112 | _, loss_ = sess.run([train_op, loss], feed_dict={X: train_x[b_z*j:b_z*(j+1)],
113 | Y: train_y[b_z*j:b_z*(j+1)]})
114 | if i % 10 == 0:
115 | print("------------------------------------------------------")
116 | print('epoch: {}, train_loss: {:.4f}'.format(i + 1, loss_))
117 | if loss_ < min_loss:
118 | min_loss = loss_
119 | print("保存模型:", saver.save(sess, args.fining_turn_model_dir + args.model_name_ft))
120 |
121 |
122 | # -----------------------------测试模型------------------------------ #
123 | # 用于测量测试集的准确率和F1
124 | def test(time_step=args.time_step):
125 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size])
126 | test_x, test_y = get_test_data()
127 | print("---------数据加载完毕--------")
128 | pred, _ = lstm(X)
129 | pre_dict = []
130 | saver = tf.train.Saver(tf.global_variables())
131 | with tf.Session() as sess:
132 | # 参数恢复
133 | saver.restore(sess, args.train_model_dir+args.model_name)
134 | print("----------模型加载完毕----------")
135 | if len(test_x) < 15000:
136 | prob = sess.run(pred, feed_dict={X: test_x})
137 | pre_dict.extend(prob)
138 | else:
139 | for i in range(len(test_x)//args.batch_size+1):
140 | prob = sess.run(pred, feed_dict={X: test_x[args.batch_size*i:args.batch_size*(i+1)]})
141 | pre_dict.extend(prob)
142 | pre_dict = np.array(pre_dict)
143 | test_label = np.array(test_y)
144 | a1 = list(np.argmax(pre_dict, 1))
145 | print(a1)
146 | a2 = list(np.argmax(test_label, 1))
147 | a1_1 = []
148 | a2_1 = []
149 | # 3分类编码
150 | for i in a1:
151 | if i >= 4:
152 | a1_1.append(2)
153 | elif i <= 1:
154 | a1_1.append(1)
155 | else:
156 | a1_1.append(0)
157 | for i in a2:
158 | if i >= 4:
159 | a2_1.append(2)
160 | elif i <= 1:
161 | a2_1.append(1)
162 | else:
163 | a2_1.append(0)
164 | correct_prediction = tf.equal(a1, a2)
165 | acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)).eval()
166 | print(classification_report(a1, a2))
167 | print(acc)
168 |
169 |
170 | # -----------------------------预测模型------------------------------ #
171 | # 用于预测第二天的收盘价
172 | def predict(time_step=args.time_step):
173 | X = tf.placeholder(tf.float32, shape=[None, time_step, args.input_size])
174 | pre_x, code = get_predict_data(args.predict_dir)
175 | print("---------数据加载完毕--------")
176 | pred, _ = lstm(X)
177 | pre_y = []
178 | saver = tf.train.Saver(tf.global_variables())
179 | with tf.Session() as sess:
180 | # 参数恢复
181 | saver.restore(sess, args.fining_turn_model_dir + args.model_name)
182 | print("----------模型加载完毕----------")
183 | if len(pre_x) < 15000:
184 | prob = sess.run(pred, feed_dict={X: pre_x})
185 | pre_y.extend(prob)
186 | else:
187 | for i in range(len(pre_x) // args.batch_size + 1):
188 | prob = sess.run(pred, feed_dict={X: pre_x[args.batch_size * i:args.batch_size * (i + 1)]})
189 | pre_y.extend(prob)
190 | pre_y = np.array(pre_y)
191 | a = list(np.argmax(pre_y, 1))
192 | if a[0] == 0:
193 | print("模型{}预测股票{}明天的收盘价跌2%及以上".format(args.model_name, code))
194 | elif a[0] == 1:
195 | print("模型{}预测股票{}明天的收盘价跌1%-2%".format(args.model_name, code))
196 | elif a[0] == 2:
197 | print("模型{}预测股票{}明天的收盘价跌1%以内".format(args.model_name, code))
198 | elif a[0] == 3:
199 | print("模型{}预测股票{}明天的收盘价涨1%以内".format(args.model_name, code))
200 | elif a[0] == 4:
201 | print("模型{}预测股票{}明天的收盘价涨1%-2%".format(args.model_name, code))
202 | else:
203 | print("模型{}预测股票{}明天的收盘价涨2%以上".format(args.model_name, code))
204 |
--------------------------------------------------------------------------------
/code/dataprocess.py:
--------------------------------------------------------------------------------
1 | from data_utils import *
2 | import numpy as np
3 | from config import Arg
4 |
5 |
6 | args = Arg()
7 |
8 |
9 | # 该系列代码所要求的股票文件名称必须是股票代码+csv的格式,如000001.csv
10 | # --------------------------训练集数据的处理--------------------- #
11 | def get_train_data(batch_size=args.batch_size, time_step=args.time_step):
12 | ratio = args.ratio
13 | stock_len = args.stock_len
14 | len_index = []
15 | batch_index = []
16 | val_index = []
17 | train_dir = args.train_dir
18 | df = open(train_dir)
19 | data_otrain = pd.read_csv(df)
20 | data_train = data_otrain.iloc[:, 1:].values
21 | print(len(data_train))
22 | label_train = data_otrain.iloc[:, -1].values
23 | normalized_train_data = (data_train-np.mean(data_train, axis=0))/np.std(data_train, axis=0) # 标准化
24 | train_x, train_y = [], [] # 训练集x和y定义
25 | for i in range(len(normalized_train_data) + 1):
26 | if i % stock_len == 0:
27 | len_index.append(i)
28 | for i in range(len(len_index) - 1):
29 | for k in range(len_index[i], len_index[i + 1] - time_step - 1):
30 | x = normalized_train_data[k:k + time_step, :6]
31 | y = label_train[k + time_step, np.newaxis]
32 | temp_data = []
33 | # onehot编码
34 | for j in y:
35 | if j > 2:
36 | temp_data.append([0, 0, 0, 0, 0, 1])
37 | elif 1 < j <= 2:
38 | temp_data.append([0, 0, 0, 0, 1, 0])
39 | elif 0 < j <= 1:
40 | temp_data.append([0, 0, 0, 1, 0, 0])
41 | elif -1 < j <= 0:
42 | temp_data.append([0, 0, 1, 0, 0, 0])
43 | elif -2 < j <= -1:
44 | temp_data.append([0, 1, 0, 0, 0, 0])
45 | else:
46 | temp_data.append([1, 0, 0, 0, 0, 0])
47 | train_x.append(x.tolist())
48 | train_y.append(temp_data)
49 | train_len = int(len(train_x) * ratio) # 按照8:2划分训练集和验证集
50 | train_x_1, train_y_1 = train_x[:train_len], train_y[:train_len] # 训练集的x和标签
51 | val_x, val_y = train_x[train_len:], train_y[train_len:] # 验证集的x和标签
52 | # 添加标签
53 | for i in range(len(train_x_1)):
54 | if i % batch_size == 0:
55 | batch_index.append(i)
56 | for i in range(len(val_x)):
57 | if i % batch_size == 0:
58 | val_index.append(i)
59 | batch_index.append(len(train_x_1))
60 | val_index.append(len(val_x))
61 | print(batch_index)
62 | print(val_index)
63 | print(np.shape(train_x))
64 | return batch_index, val_index, train_x_1, train_y_1, val_x, val_y
65 |
66 |
67 | # --------------------------测试集数据的处理--------------------- #
68 | # 测试集数据长度不能小于time_step
69 | def get_test_data(time_step=args.time_step):
70 | stock_len = args.stock_len
71 | test_dir = args.test_dir
72 | f = open(test_dir)
73 | df = pd.read_csv(f)
74 | data_test = df.iloc[:, 1:].values
75 | label_test = df.iloc[:, -1].values
76 | batch_index = []
77 | normalized_test_data = (data_test-np.mean(data_test, axis=0))/np.std(data_test, axis=0) # 标准化
78 | test_x, test_y = [], []
79 | for i in range(len(normalized_test_data) + 1):
80 | if i % stock_len == 0:
81 | batch_index.append(i)
82 | for i in range(len(batch_index)-1):
83 | if stock_len > time_step+1:
84 | for j in range(batch_index[i], batch_index[i + 1] - time_step - 1):
85 | x = normalized_test_data[j:j + time_step, :]
86 | y = label_test[j + time_step, np.newaxis]
87 | temp_data = []
88 | # 标签编码
89 | for k in y:
90 | if k > 2:
91 | temp_data.append([0, 0, 0, 0, 0, 1])
92 | elif 1 < k <= 2:
93 | temp_data.append([0, 0, 0, 0, 1, 0])
94 | elif 0 < k <= 1:
95 | temp_data.append([0, 0, 0, 1, 0, 0])
96 | elif -1 < k <= 0:
97 | temp_data.append([0, 0, 1, 0, 0, 0])
98 | elif -2 < k <= -1:
99 | temp_data.append([0, 1, 0, 0, 0, 0])
100 | else:
101 | temp_data.append([1, 0, 0, 0, 0, 0])
102 | test_x.append(x.tolist())
103 | test_y.extend(temp_data)
104 | else:
105 | for j in range(batch_index[i], batch_index[i]+1):
106 | x = normalized_test_data[j:j + time_step, :]
107 | y = label_test[j + time_step, np.newaxis]
108 | temp_data = []
109 | # 标签编码
110 | for k in y:
111 | if k > 2:
112 | temp_data.append([0, 0, 0, 0, 0, 1])
113 | elif 1 < k <= 2:
114 | temp_data.append([0, 0, 0, 0, 1, 0])
115 | elif 0 < k <= 1:
116 | temp_data.append([0, 0, 0, 1, 0, 0])
117 | elif -1 < k <= 0:
118 | temp_data.append([0, 0, 1, 0, 0, 0])
119 | elif -2 < k <= -1:
120 | temp_data.append([0, 1, 0, 0, 0, 0])
121 | else:
122 | temp_data.append([1, 0, 0, 0, 0, 0])
123 | test_x.append(x.tolist())
124 | test_y.extend(temp_data)
125 |
126 | print(batch_index)
127 | print(np.shape(test_x))
128 | return test_x, test_y
129 |
130 |
131 | # -------------------------新股票数据批量处理-------------------- #
132 | # 该函数将更新数据加历史数据的前time_step-1拼接,用于整批处理
133 | # 如2019 1-3月数据和2018.12的数据拼接
134 | def get_update_data(time_step=args.time_step):
135 | train_data_dir = args.train_dir
136 | new_data_dir = args.new_dir
137 | stock_len = args.stock_len
138 | new_len = args.stock_len_new
139 | f = open(train_data_dir)
140 | nf = open(new_data_dir)
141 | df = pd.read_csv(f)
142 | ndf = pd.read_csv(nf)
143 | data_train = df.iloc[:, 1:].values
144 | data_new = ndf.iloc[:, 1:].values
145 | # 标准化
146 | mean, std = np.mean(data_train, axis=0), np.std(data_train, axis=0)
147 | new_mean, new_std = np.mean(data_new, axis=0), np.std(data_new, axis=0)
148 | normalized_data_train = (data_train - mean) / std
149 | normalized_data_new = (data_new - new_mean) / new_std
150 | label_new = ndf.iloc[:, -1].values
151 | train_x, train_y = [], []
152 | batch_index = []
153 | new_index = []
154 | for i in range(len(data_train)+1):
155 | if i % stock_len == 0:
156 | batch_index.append(i)
157 | for i in range(len(data_new)+1):
158 | if i % new_len == 0:
159 | new_index.append(i)
160 | # 该部分添加上一时间戳和更新时间戳的time_step股票数据
161 | print(batch_index)
162 | print(new_index)
163 | for i in range(1, len(batch_index)):
164 | count = time_step
165 | while count > 1:
166 | last_data = []
167 | last_data.extend(normalized_data_train[batch_index[i] - count + 1:batch_index[i]])
168 | last_data.extend(normalized_data_new[0:time_step-count+1])
169 | y = label_new[time_step-count+1, np.newaxis]
170 | temp_data = []
171 | for k in y:
172 | if k > 2:
173 | temp_data.append([0, 0, 0, 0, 0, 1])
174 | elif 1 < k <= 2:
175 | temp_data.append([0, 0, 0, 0, 1, 0])
176 | elif 0 < k <= 1:
177 | temp_data.append([0, 0, 0, 1, 0, 0])
178 | elif -1 < k <= 0:
179 | temp_data.append([0, 0, 1, 0, 0, 0])
180 | elif -2 < k <= -1:
181 | temp_data.append([0, 1, 0, 0, 0, 0])
182 | else:
183 | temp_data.append([1, 0, 0, 0, 0, 0])
184 | train_x.append(last_data)
185 | train_y.append(temp_data)
186 | count -= 1
187 | # 该部分正常添加new_data中的股票数据
188 | for i in range(len(new_index)-1):
189 | for j in range(new_index[i], new_index[i + 1] - time_step - 1):
190 | x = normalized_data_new[j:j+time_step, :]
191 | y = label_new[j+time_step, np.newaxis]
192 | temp_data = []
193 | # 标签编码
194 | for k in y:
195 | if k > 2:
196 | temp_data.append([0, 0, 0, 0, 0, 1])
197 | elif 1 < k <= 2:
198 | temp_data.append([0, 0, 0, 0, 1, 0])
199 | elif 0 < k <= 1:
200 | temp_data.append([0, 0, 0, 1, 0, 0])
201 | elif -1 < k <= 0:
202 | temp_data.append([0, 0, 1, 0, 0, 0])
203 | elif -2 < k <= -1:
204 | temp_data.append([0, 1, 0, 0, 0, 0])
205 | else:
206 | temp_data.append([1, 0, 0, 0, 0, 0])
207 | train_x.append(x.tolist())
208 | train_y.append(temp_data)
209 | print(np.shape(train_x))
210 | return train_x, train_y
211 |
212 |
213 | # --------------------------当天股票数据更新---------------------- #
214 | # 该函数完成下载实时股票数据,与之前的数据拼接后拼接的x
215 | # 只能用于获取一天的更新数据,不会对源文件进行更新,如果有断层(不只一天),请先下载整批数据,然后使用get_update_data来更新数据
216 | # file_name是要用于预测的股票地址如'D:\data\\201904\\000001.csv'
217 | def get_predict_data(file_name):
218 | f = open(file_name)
219 | (filepath, temp_file_name) = os.path.split(file_name)
220 | (stock_code, extension) = os.path.splitext(temp_file_name)
221 | f = pd.read_csv(f)
222 | data_now = time.strftime('%Y-%m-%d', time.localtime(time.time()))
223 | hist_data = f[-args.time_step+1:]
224 | real_data = ts.get_realtime_quotes(stock_code)
225 | real_data = real_data[['open', 'high', 'price', 'low', 'volume']]
226 | real_data.insert(0, 'date', data_now)
227 | p_change = (float(real_data.iloc[-1, 3]) - float(hist_data.iloc[-1, 3]))/float(hist_data.iloc[-1, 3])
228 | real_data['p_change'] = p_change
229 | real_data.rename(index=str, columns={"price": "close"}, inplace=True)
230 | real_data[['open', 'high', 'close', 'low', 'volume', 'p_change']] = \
231 | real_data[['open', 'high', 'close', 'low', 'volume', 'p_change']].astype('float')
232 | hist_data = hist_data.append(real_data)
233 | print("---------数据更新完成-----------")
234 | pre_data = hist_data.iloc[:, 1:].values
235 | x = (pre_data - np.mean(pre_data, axis=0)) / np.std(pre_data, axis=0) # 标准化
236 | x = [x.tolist()]
237 | print(np.shape(x))
238 | return x, stock_code
239 |
240 |
241 | if __name__ == '__main__':
242 | print(args.stock_len)
243 | get_train_data()
244 | # get_stock_data('399300', '2019-04-01', '2019-04-30', 'd:\data\\')
245 | # get_update_data()
246 | # get_stock_data()使用样例,用于下载codelist中的股票,codelist可以自己指定
247 | """
248 | codelist = []
249 | with open('d:\data\codelist.csv') as f:
250 | df = pd.read_csv(f, converters={'code': str})
251 | codelist.extend(df['code'])
252 | i = 1
253 | for code in codelist:
254 | print('正在处理第%s个股票' % i)
255 | i += 1
256 | get_stock_data(code, '2019-04-01', '2019-04-30', 'd:\data\\201904\\', 20)
257 | """
258 |
--------------------------------------------------------------------------------