├── README.md
├── Use_Tsfresh_build_lasso_model.py
├── adfuller_test.py
├── comb_time_series.py
├── demand_forecasting.jpg
├── multi_prophet_model.py
├── pyspark_sale_forecast
    └── script
    │   ├── README
    │   ├── pyspark_ml_predict.py
    │   └── pyspark_ml_tune.py
├── sale_df.csv
├── sale_df.xlsx
├── sarimax_model.py
├── spark_lightgbm_jar
    └── spark_lgb_jar.rar
├── spark_prophet_demo.py
├── ts_sale.csv
├── ts_sale_example.py
├── ts_shape.py
├── 利用SARIMAX 进行销量预测.md
├── 数据之外的代码.md
├── 时间序列机器学习模型特征总结.md
├── 机器学习模型可解释性.md
├── 特征选择.md
└── 销量预测spark实战.md


/README.md:
--------------------------------------------------------------------------------
 1 | 本代码仓主要为个人所写时间序列数据挖掘方面的文章和适合公布的代码
 2 | 
 3 | 
 4 | 
 5 | ![](demand_forecasting.jpg)
 6 | 
 7 | 同时个人csdn也会同步：
 8 | https://blog.csdn.net/fitzgerald0
 9 | 
10 | 最近更新时间为：2021-05-09
11 | 
12 | 
13 | 【1】TensorFlow Probability概率编程-时序模型
14 | 
15 | [博客](https://blog.csdn.net/fitzgerald0/article/details/90274550)
16 | [程序](https://github.com/fitzgerald0/time_series_data_mining/blob/master/ts_sale_example.py)
17 | [数据](https://github.com/fitzgerald0/time_series_data_mining/blob/master/ts_sale.csv)
18 | 
19 | 
20 | 
21 | 【2】本文介绍使用tsfresh库进行时序 freature extract,结合lasso进行建模；
22 | 
23 | [博客](https://blog.csdn.net/fitzgerald0/article/details/90612781)
24 | 
25 | 
26 | 【3】利用SARIMAX进行销量预测
27 | 
28 | [博客](https://blog.csdn.net/fitzgerald0/article/details/100823231)
29 | 
30 | [程序](https://github.com/fitzgerald0/time_series_data_mining/blob/master/sarimax_model.py)
31 | 
32 | [数据](https://github.com/fitzgerald0/time_series_data_mining/blob/master/sale_df.xlsx)
33 | 
34 | 【4】时间序列机器学习树模型特征汇总
35 | 
36 | [博客](https://blog.csdn.net/fitzgerald0/article/details/104029842)
37 | 
38 | 【5】多序列prophet模型预测
39 | 
40 | [多进程prophet程序](https://github.com/fitzgerald0/time_series_data_mining/blob/master/multi_prophet_model.py)
41 | 
42 | 
43 | 
44 | 
45 | 【6】基于PySpark的销量预测
46 | 
47 | [博客](https://blog.csdn.net/fitzgerald0/article/details/106885568)
48 | 
49 | [代码](https://github.com/fitzgerald0/time_series_data_mining/tree/master/pyspark_sale_forecast/script)
50 | 
51 | 
52 | 
53 | 
54 | 【7】PySpark-prophet预测
55 | 
56 | [博文](https://blog.csdn.net/fitzgerald0/article/details/106157008)
57 | 
58 | [代码](https://github.com/fitzgerald0/time_series_data_mining/blob/master/spark_prophet_demo.py)
59 | 
60 | 【8】k-shape时间序列聚类(tslearn)
61 | 
62 | [文章](https://blog.csdn.net/fitzgerald0/article/details/108188588)
63 | 
64 | [代码](https://github.com/fitzgerald0/time_series_data_mining/blob/master/ts_shape.py)
65 | 
66 | 
67 | 
68 | 【9】时间序列可预测性度量
69 | 
70 | [博文](https://blog.csdn.net/fitzgerald0/article/details/108995724)
71 | 
72 | [代码](https://github.com/fitzgerald0/time_series_data_mining/blob/master/adfuller_test.py)
73 | 
74 | 
75 | 
76 | **以下为PySpark销量预测实战系列九篇文章**
77 | 
78 | [1.PySpark与DataFrame简介](https://blog.csdn.net/fitzgerald0/article/details/116455182)
79 | 
80 |  [2.PySpark时间序列数据统计描述，分布特性与内部特性](https://blog.csdn.net/fitzgerald0/article/details/116090223)
81 | 
82 |  [3.缺失值填充与异常值处理](https://blog.csdn.net/fitzgerald0/article/details/116306783)
83 | 
84 | [4.时间序列特征工程](https://blog.csdn.net/fitzgerald0/article/details/116453472)
85 | 
86 | [5.特征选择](https://blog.csdn.net/fitzgerald0/article/details/115876867)
87 | 
88 | [6.简单预测模型](https://blog.csdn.net/fitzgerald0/article/details/115918790)
89 | 
90 | [7.线性回归与广义线性模型](https://blog.csdn.net/fitzgerald0/article/details/116451185)
91 | 
92 | [8.机器学习调参方法](https://blog.csdn.net/fitzgerald0/article/details/116452338)
93 | 
94 | [9.销量预测建模中常用的损失函数与模型评估指标](https://blog.csdn.net/fitzgerald0/article/details/115471489)
95 | 
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/Use_Tsfresh_build_lasso_model.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import numpy as np
  3 | import pandas as pd
  4 | 
  5 | import gc
  6 | import time
  7 | import calendar 
  8 | from datetime import date
  9 | import warnings
 10 | warnings.filterwarnings('ignore')
 11 | import statsmodels.api as sm
 12 | import matplotlib.pyplot as plt
 13 | from sklearn.linear_model import Ridge,Lasso
 14 | from sklearn.preprocessing import StandardScaler
 15 | from sklearn.model_selection import GridSearchCV
 16 | 
 17 | import tsfresh
 18 | from tsfresh import extract_features, extract_relevant_features, select_features
 19 | 
 20 | %matplotlib inline
 21 | 
 22 | 
 23 | #data pre_process
 24 | data= pd.read_excel("TOP1000_code_04.xlsx",encoding='gbk',errors='ignore',dtype=str)
 25 | data=data[data['goods_code']!='0200011831']
 26 | data_ts=data[['goods_code','ym','sale']]
 27 | data_ts=data_ts[data_ts['ym']!='2016_12']
 28 | data_ts=data_ts[data_ts['ym']!='2019_03']
 29 | 
 30 | 
 31 | def split_ts(data):
 32 |     data['ym']=data['ym'].apply(lambda x: datetime.datetime.strptime(x,"%Y_%m"))
 33 |     data['sale']=data['sale'].astype(float)
 34 |     #划分数据集
 35 |     ts_test=data[(data['ym']<='2019-01-01')&(data['ym']>='2018-12-01')]
 36 |     ts_train=data[(data['ym']<'2018-12-01')]
 37 |     ts_test=ts_test.sort_values(by=['goods_code','ym'])
 38 |     ts_train=ts_train.sort_values(by=['goods_code','ym'])
 39 |     ts_test=ts_test.reset_index()
 40 |     ts_train=ts_train.reset_index()
 41 |     ts_test_1= pd.DataFrame(ts_test,columns=["goods_code", "sale","ym"])
 42 |     ts_test_1['sale']=ts_test_1['sale'].astype(float)
 43 |     ts_test_pro= tsfresh.extract_features(ts_test_1, column_id="goods_code",column_sort='ym')
 44 |     ts_train_1= pd.DataFrame(ts_train,columns=["goods_code", "sale","ym"])
 45 |     ts_train_1['sale']=ts_train_1['sale'].astype(float)
 46 |     ts_train_pro= tsfresh.extract_features(ts_train_1, column_id="goods_code",column_sort='ym')
 47 |     #释放临时变量
 48 |     del ts_test,ts_train,ts_test_1,ts_train_1
 49 |     gc.collect()
 50 |     return ts_test_pro,ts_train_pro
 51 | 
 52 | 
 53 | def data_clean(data):
 54 |     #固定进入模型的字段
 55 |     data=data[['goods_code','ym','sale']]
 56 |     data.astype(str)
 57 |     data['sale']=data['sale'].astype(float)
 58 |     data.sort_values(by=['goods_code','ym'],inplace=True)
 59 |     data['goods_code']=data['goods_code'].astype(str)
 60 |     data_m1=pd.DataFrame(data['ym'].groupby(data['goods_code']).count())
 61 |     data_m1=data_m1.rename(index=str,columns={'ym':'long'})
 62 |     data_m1=data_m1.reset_index()
 63 |     #选择大于15个月的值进行预测
 64 |     data_m2=data_m1.merge(data,on='goods_code',how='right')
 65 |     data_m2=data_m2[data_m2['long']>=15]
 66 |     data_m2['ym']=data_m2['ym'].apply(lambda x: datetime.datetime.strptime(x,"%Y_%m"))
 67 |     data_m2['month'] = data_m2['ym'].dt.month
 68 |     data_m2['month_day']=data_m2['month'].map(lambda x : calendar.monthrange(2019,x)[1])
 69 |     data_m2['dis_month'] = data_m2['ym'].map(lambda x: (datetime.datetime(2019, 3, 8)-x).days//28)
 70 |     data_m2.sort_values(['goods_code','ym'],inplace=True)
 71 |     return data_m2
 72 | 
 73 | def train_model():
 74 |     start_time=time.time()
 75 |     data_inp=data_clean(df)
 76 |     pivot = data_inp.pivot(index='goods_code', columns='dis_month', values='sale')
 77 |     #对变量重新命名
 78 |     col_name=[]
 79 |     for i in range(len(pivot.columns)):
 80 |         col_name.append('sales_'+str(i))
 81 |     pivot.columns=col_name
 82 |     pivot.fillna(0, inplace=True)
 83 |     sub=pivot.reset_index()
 84 |     test_features=['goods_code']
 85 |     trian_features = ['goods_code']
 86 |     for i in range(1,3):
 87 |         test_features.append('sales_' + str(i))
 88 |     #前面21个月作为训练集
 89 |     for i in range(3,23):
 90 |         trian_features.append('sales_' + str(i))
 91 | 
 92 |     sub.fillna(0, inplace=True)
 93 |     sub.drop_duplicates(subset=['goods_code'],keep='first',inplace=True)
 94 |     #最近的两个月作为测试集
 95 |     for i in range(1,3):
 96 |         test_features.append('sales_' + str(i))
 97 |    
 98 |     for i in range(3,23):
 99 |         trian_features.append('sales_' + str(i))
100 |     X_train = sub[trian_features]
101 |     y_train = sub[['sales_0', 'goods_code']]
102 |     X_test = sub[test_features]    
103 |     sales_type = 'sales_'
104 |     
105 |     #平均数特征
106 |     X_train['mean_sale'] = X_train.apply(
107 |         lambda x: np.mean([x[sales_type+'3'], x[sales_type+'4'],x[sales_type+'5'], 
108 |                               x[sales_type+'6'], x[sales_type+'7'],x[sales_type+'8'], x[sales_type+'9'], 
109 |                            x[sales_type+'10'], x[sales_type+'11'],x[sales_type+'12'],x[sales_type+'13'], 
110 |                               x[sales_type+'14'],
111 |                            x[sales_type+'15'], x[sales_type+'16'], x[sales_type+'17'],x[sales_type+'18'],
112 |                            x[sales_type+'19'], x[sales_type+'20'], x[sales_type+'21'], x[sales_type+'22']]), axis=1)
113 |     
114 |     X_test['mean_sale'] = X_test.apply(
115 |         lambda x: np.mean([x[sales_type+'1'], x[sales_type+'2']]), axis=1)
116 |     train_mean=X_train['mean_sale']
117 |     test_mean=X_test['mean_sale']
118 |     train_mean=pd.Series(train_mean)
119 |     test_mean=pd.Series(test_mean)
120 |     
121 |      #众数特征
122 |     X_train['median_sale'] = X_train.apply(
123 |         lambda x: np.median([ x[sales_type+'3'], x[sales_type+'4'],
124 |                       x[sales_type+'5'], x[sales_type+'6'], x[sales_type+'7'],x[sales_type+'8'], 
125 |                              x[sales_type+'9'], x[sales_type+'10'], x[sales_type+'11'],x[sales_type+'12'],
126 |                              x[sales_type+'13'], x[sales_type+'14'],x[sales_type+'15'], x[sales_type+'16'], 
127 |                              x[sales_type+'17'],x[sales_type+'18'], x[sales_type+'19'], x[sales_type+'20'],
128 |                              x[sales_type+'21'], x[sales_type+'22']]), axis=1)
129 |     X_test['median_sale'] = X_test.apply(
130 |         lambda x: np.median([x[sales_type+'1'], x[sales_type+'2']]), axis=1)
131 |     
132 |     #标准差特征
133 |     X_train['std_sale'] = X_train.apply(
134 |         lambda x: np.std([ x[sales_type+'3'], x[sales_type+'4'],x[sales_type+'5'], x[sales_type+'6'], 
135 |                           x[sales_type+'7'],x[sales_type+'8'], x[sales_type+'9'], x[sales_type+'10'], 
136 |                           x[sales_type+'11'],x[sales_type+'12'],x[sales_type+'13'], x[sales_type+'14'],
137 |                         x[sales_type+'15'], x[sales_type+'16'], x[sales_type+'17'],x[sales_type+'18'], 
138 |                         x[sales_type+'19'], x[sales_type+'20'], x[sales_type+'21'], x[sales_type+'22']]), axis=1)
139 |     X_test['std_sale'] = X_test.apply(
140 |         lambda x: np.std([x[sales_type+'1'], x[sales_type+'2']]), axis=1)
141 |     
142 |     train_median=X_train['median_sale']
143 |     test_median=X_test['median_sale']
144 | 
145 |     train_std=X_train['std_sale']
146 |     test_std=X_test['std_sale']
147 | 
148 |     X_train = sub[trian_features]
149 |     X_test = sub[test_features]
150 |     
151 |     formas_train=[train_mean,train_median,train_std]
152 |     formas_test=[test_mean,test_median,test_std]
153 |     train_inp=pd.concat(formas_train,axis=1)
154 |     test_inp=pd.concat(formas_test,axis=1)
155 |     
156 |     #残差特征
157 |     lr_Y=y_train['sales_0']
158 |     lr_train_x=train_inp
159 |     re_train= sm.OLS(lr_Y,lr_train_x).fit()
160 |     train_inp['resid']=re_train.resid
161 |     
162 |     lr_Y=y_train['sales_0']
163 |     lr_test_x=test_inp
164 |     re_test= sm.OLS(lr_Y,lr_test_x).fit()
165 |     test_inp['resid']=re_test.resid
166 |     
167 |     train_inp=pd.concat([y_train,train_inp],axis=1)
168 |     
169 |     ts_test_pro,ts_train_pro=split_ts(df)
170 |     
171 |     ts_train_=ts_train_pro.reset_index()
172 |     train_inp=pd.merge(train_inp,ts_train_,left_on='goods_code',right_on='id',how='left')
173 |     test_inp=pd.concat([y_train,test_inp],axis=1)
174 |     
175 |     ts_test_=ts_test_pro.reset_index()
176 |     test_inp=pd.merge(test_inp,ts_test_,left_on='goods_code',right_on='id',how='left')
177 |     train_inp.drop(['sales_0','goods_code'],axis=1,inplace=True)
178 |     test_inp.drop(['sales_0','goods_code'],axis=1,inplace=True)
179 |     
180 |     train_inp.fillna(0,inplace=True)
181 |     train_inp.replace(np.inf,0,inplace=True)
182 |     test_inp.replace(np.inf,0,inplace=True)
183 |     test_inp.fillna(0,inplace=True)
184 | 
185 |     #lasso
186 |     ss = StandardScaler()
187 |     train_inp_s= ss.fit_transform(train_inp) 
188 |     test_inp_s= ss.transform(test_inp)
189 |     alpha_ridge = [1e-4,1e-3,1e-2,0.1,1]
190 | 
191 |     coeffs = {}
192 |     for alpha in alpha_ridge:
193 |         r = Lasso(alpha=alpha, normalize=True, max_iter=1000000)
194 |         r = r.fit(train_inp_s, y_train['sales_0'])
195 | 
196 |     grid_search = GridSearchCV(Lasso(alpha=alpha, normalize=True), scoring='neg_mean_squared_error',
197 |                            param_grid={'alpha': alpha_ridge}, cv=5, n_jobs=-1)
198 |     grid_search.fit(train_inp_s, y_train['sales_0'])
199 |     
200 |     alpha = alpha_ridge
201 |     rmse = list(np.sqrt(-grid_search.cv_results_['mean_test_score']))
202 |     plt.figure(figsize=(6,5))
203 |     
204 |     lasso_cv = pd.Series(rmse, index = alpha)
205 |     lasso_cv.plot(title = "Validation - LASSO", logx=True)
206 |     plt.xlabel("alpha")
207 |     plt.ylabel("rmse")
208 |     plt.show()
209 |     
210 |     least_lasso=min(alpha)
211 |     lasso = Lasso(alpha=least_lasso,normalize=True)
212 |     model_lasso=lasso.fit(train_inp_s,y_train['sales_0'])
213 |     
214 |     print("lasso feature.......................")
215 |     lasso_coef = pd.Series(model_lasso.coef_,index = train_inp.columns)
216 |     lasso_coef=lasso_coef[lasso_coef!=0.0000]
217 |     lasso_coef=lasso_coef.astype(float)
218 |     print(".....lasso_coef..............")
219 | 
220 |     print(lasso_coef.sort_values(ascending=False).head(10))
221 |     print(" R^2，拟合优度")
222 |     
223 |     matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
224 |     imp_coef = pd.concat([lasso_coef.sort_values().head(5), 
225 |                      lasso_coef.sort_values().tail(5)])#选头尾各10条
226 | 
227 |     imp_coef.plot(kind = "barh")
228 |     plt.title("Coefficients in the Lasso Model")
229 |     
230 |     print(lasso.score(train_inp_s,y_train['sales_0']))
231 |     
232 |     print(lasso.get_params())  
233 |     print('参数信息')
234 |     print(lasso.set_params(fit_intercept=False)) 
235 |     lasso_preds =model_lasso.predict(test_inp_s)
236 |     #绘制预测结果和真实值散点图
237 |     fig, ax = plt.subplots()
238 |     ax.scatter(y_train['sales_0'],lasso_preds)
239 |     ax.plot([y_train['sales_0'].min(), y_train['sales_0'].max()], [y_train['sales_0'].min(), y_train['sales_0'].max()], 'k--', lw=4)
240 |     ax.set_xlabel('y_true')
241 |     ax.set_ylabel('Pred')
242 |     plt.show()
243 |     y_pred=pd.DataFrame(lasso_preds,columns=['y_pred'])
244 |     
245 |     matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)
246 |     preds = pd.DataFrame({"preds":y_pred['y_pred'], "true":y_train['sales_0']}) 
247 |     preds["residuals"] = preds["true"] - preds["preds"]
248 |     
249 |     print("打印预测值描述.....................")
250 |     preds=preds.astype(float)
251 |     print(preds.head())
252 |     print(preds.describe())
253 |     print(preds.shape)
254 |     preds.plot(x = "preds", y = "residuals",kind = "scatter")
255 |     plt.title("True and residuals")
256 |     plt.show()
257 |     
258 |     data_out=[y_train['goods_code'],y_train['sales_0'],y_pred]
259 |     result=pd.concat(data_out,axis=1)
260 |     #计算mape
261 |     result['mape']=abs((result['sales_0']-result['y_pred'])/result['sales_0']*100)    
262 |     return result,lasso_coef
263 | 
264 | 
265 | if __name__ == '__main__':
266 |     df=data_ts
267 |     result_f,lasso_coef_f=train_model()
268 |     
269 |     del df
270 |     gc.collect()    


--------------------------------------------------------------------------------
/adfuller_test.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020/10/3 14:58
 3 | # @Author  : hjs
 4 | # @File    : adfuller_test.py
 5 | 
 6 | #this is example for spark udf run adfuller test
 7 | 
 8 | import pandas as pd
 9 | from statsmodels.tsa.stattools import adfuller
10 | from pyspark.sql import SparkSession
11 | from pyspark.sql.functions import pandas_udf, PandasUDFType
12 | from pyspark.sql.types import *
13 | 
14 | spark = SparkSession. \
15 |     Builder(). \
16 |     config("spark.sql.crossJoin.enabled", "true"). \
17 |     enableHiveSupport(). \
18 |     getOrCreate()
19 | 
20 | 
21 | df=spark.sql("select * from test.app_forecast_input_fix")
22 | 
23 | 
24 | schema = StructType([
25 |     StructField("store_id", StringType()),
26 |     StructField("is_adfuller", DoubleType())
27 | ])
28 | 
29 | @pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
30 | def adfuller_func(df):
31 |     df.sort_values(by=['date'],ascending=[True],inplace=True)
32 |     adfuller_result=adfuller(df['qty'],autolag='AIC')
33 |     is_adfuller=None
34 |     if adfuller_result[1]<0.05:
35 |         is_adfuller=1
36 |     else:
37 |         is_adfuller=0
38 |     result=pd.DataFrame({'store_id':df['store_id'].iloc[0],'is_adfuller':[is_adfuller]})
39 |     return result
40 | 
41 | adfuller_result = df.groupby(['store_id']).apply(adfuller_func)
42 | adfuller_result.printSchema()
43 | adfuller_result.createOrReplaceTempView('adfuller_result')
44 | spark.sql("""drop table if exists test.adfuller_result""")
45 | spark.sql("""create table test.adfuller_result as select * from adfuller_result""")
46 | 
47 | 
48 | 
49 | 


--------------------------------------------------------------------------------
/comb_time_series.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # @Time    : 2022/2/21 21:18
  3 | # @Author  : huangjisheng
  4 | # @File    : comb_time_series.py
  5 | # @Software: PyCharm
  6 | # !/usr/bin/env python
  7 | # coding: utf-8
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from sklearn.linear_model import Lasso
 12 | from sklearn.linear_model import LassoCV
 13 | class Loss_func(object):
 14 |     def __init__(self):
 15 |         self.y_true = y_true
 16 |         self.y_pred = y_pred
 17 | 
 18 |     def mae(y_true, y_pred):
 19 |         return np.mean(np.abs(y_true - y_pred))
 20 | 
 21 |     def mape(y_true, y_pred):
 22 |         return np.mean(np.abs((y_pred - y_true) / y_true)) * 100
 23 | 
 24 |     def smape(y_true, y_pred):
 25 |         return 2.0 * np.mean(np.abs(y_pred - y_true) / (np.abs(y_pred) + np.abs(y_true))) * 100
 26 | 
 27 | use_loss = Loss_func.mape
 28 | loss_name = getattr(use_loss, '__name__')
 29 | 
 30 | forecast_df = pd.read_excel('comb_forecast_result.xlsx')
 31 | 
 32 | """
 33 | features: 为不同模型预测结果
 34 | 默认真实值为y
 35 | """
 36 | 
 37 | features = ['auto_arima','holt_winters','sarimax','xgboost','prophet','ma']
 38 | 
 39 | def simple_avg_forecast(forecast_df):
 40 |     forecast_df['naive_comb_forecast'] = 0
 41 |     for fea in features:
 42 |         forecast_df['naive_comb_forecast'] += forecast_df[fea]
 43 |     forecast_df['naive_comb_forecast'] = forecast_df['naive_comb_forecast'] / len(features)
 44 |     return forecast_df
 45 | 
 46 | 
 47 | def weight_avg_forecast(forecast_df):
 48 |     forecast_df['{}_sum'.format(loss_name)] = 0
 49 |     forecast_df['{}_max'.format(loss_name)] = 0
 50 |     for fea in features:
 51 |         forecast_df["{}_{}".format(fea, loss_name)] = forecast_df.apply(lambda x: use_loss(x['y'], x[fea]), axis=1)
 52 |         forecast_df["{}_{}".format(fea, loss_name)] = forecast_df["{}_{}".format(fea, loss_name)].apply(
 53 |             lambda x: 0 if x <= 0 else x)
 54 | 
 55 |     for fea in features:
 56 |         forecast_df['{}_max'.format(loss_name)] = forecast_df.apply(
 57 |             lambda x: max(x['{}_max'.format(loss_name)], x["{}_{}".format(fea, loss_name)]), axis=1)
 58 | 
 59 |     for fea in features:
 60 |         forecast_df['{}_sum'.format(loss_name)] += forecast_df['{}_max'.format(loss_name)] - forecast_df[
 61 |             "{}_{}".format(fea, loss_name)]
 62 | 
 63 |     for fea in features:
 64 |         forecast_df["{}_weight_{}".format(fea, loss_name)] = (forecast_df['{}_max'.format(loss_name)] - forecast_df[
 65 |             "{}_{}".format(fea, loss_name)]) / forecast_df['{}_sum'.format(loss_name)]
 66 |     forecast_df['weight_avg_forecast'] = 0
 67 |     for fea in features:
 68 |         forecast_df['weight_avg_forecast'] += forecast_df["{}_weight_{}".format(fea, loss_name)] * forecast_df[fea]
 69 |     return forecast_df
 70 | 
 71 | 
 72 | def lasso_comb_forecast(forecast_df, target_col='y'):
 73 |     reg_data = forecast_df[features]
 74 |     target = [target_col]
 75 |     reg_target = forecast_df[target]
 76 |     lassocv = LassoCV()
 77 |     lassocv.fit(reg_data, reg_target)
 78 |     alpha = lassocv.alpha_
 79 |     print('best alpha is : {}'.format(alpha))
 80 |     lasso = Lasso(alpha=alpha)
 81 |     lasso.fit(reg_data, reg_target)
 82 |     num_effect_coef = np.sum(lasso.coef_ != 0)
 83 |     print('all coef num : {}. not equal coef num : {}'.format(len(lasso.coef_), num_effect_coef))
 84 |     lasso_coefs = lasso.coef_
 85 |     lst = zip(lasso_coefs, features)
 86 |     loss_coef_df = pd.DataFrame.from_dict(lst)
 87 |     loss_coef_df.columns = ['coef', 'feature']
 88 |     t = 'lasso_comb_forecast='
 89 |     for i in loss_coef_df['feature'].unique():
 90 |         coef = loss_coef_df[loss_coef_df['feature'] == i]['coef'].values[0]
 91 |         temp = str(i) + '*' + str(coef) + '+'
 92 |         t += temp
 93 |     forecast_df.eval(t[:-1], inplace=True)
 94 |     for fea in features:
 95 |         forecast_df['lasso_coef_{}'.format(fea)] = loss_coef_df[loss_coef_df['feature'] == i]['coef'].values[0]
 96 |     return forecast_df
 97 | 
 98 | 
 99 | def corr_comb_forecast(forecast_df):
100 |     forecast_df_corr = forecast_df.corr()
101 |     df_corr = pd.DataFrame(forecast_df_corr['y'].sort_values(ascending=False)[1:])
102 | 
103 |     print(df_corr)
104 |     forecast_df_corr_re = forecast_df_corr.reset_index()
105 |     corr_select_fea = forecast_df_corr_re[forecast_df_corr_re['index'] == 'y']
106 |     corr_select_fea = corr_select_fea[features]
107 |     corr_select_fea[features] = abs(corr_select_fea[features])
108 |     corr_select_fea_min = min([corr_select_fea[fea].values[0]] for fea in corr_select_fea[features])[0]
109 | 
110 |     t_sum = 0
111 |     for fea in features:
112 |         corr_select_fea['corr_norm_{}'.format(fea)] = corr_select_fea[fea] - corr_select_fea_min
113 |         print(corr_select_fea['corr_norm_{}'.format(fea)].values[0])
114 |         t_sum += corr_select_fea['corr_norm_{}'.format(fea)].values[0]
115 | 
116 |     for fea in features:
117 |         corr_select_fea['corr_norm_{}'.format(fea)] = corr_select_fea['corr_norm_{}'.format(fea)] / t_sum
118 |     forecast_df['corr_forecast'] = 0
119 |     for fea in features:
120 |         forecast_df['corr_forecast'] += corr_select_fea['corr_norm_{}'.format(fea)].values[0] * forecast_df[fea]
121 |         forecast_df['corr_norm_{}'.format(fea)] = corr_select_fea['corr_norm_{}'.format(fea)].values[0]
122 |     return forecast_df
123 | 
124 | 
125 | forecast_df = simple_avg_forecast(forecast_df)
126 | forecast_df = weight_avg_forecast(forecast_df)
127 | forecast_df = lasso_comb_forecast(forecast_df)
128 | forecast_df = corr_comb_forecast(forecast_df)


--------------------------------------------------------------------------------
/demand_forecasting.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fitzgerald0/time_series_data_mining/22c55480988d893c7f545d1698b6797beac82e9e/demand_forecasting.jpg


--------------------------------------------------------------------------------
/multi_prophet_model.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # -*- coding:utf-8 -*-
  3 | """
  4 | Name  : multi_prophet_model.py
  5 | Time  : 2020/4/28 14:11
  6 | Author : hjs
  7 | """
  8 | 
  9 | 
 10 | """
 11 | 针对多个序列数据的prophet预测，比如，10万个sku序列
 12 | 从数据库读取到数据预处理和回测，预测的个人算法框架
 13 | 回测最近7天，预测未来28天
 14 | 
 15 | """
 16 | import gc
 17 | from dateutil.relativedelta import relativedelta
 18 | from fbprophet import Prophet
 19 | import pandas as pd
 20 | import numpy as np
 21 | import time
 22 | import datetime
 23 | import os
 24 | from holiday_data import holiday_df#自定义假期数据
 25 | from joblib import Parallel, delayed
 26 | from multiprocessing import cpu_count
 27 | import warnings
 28 | warnings.simplefilter(action='ignore', category=FutureWarning)
 29 | warnings.filterwarnings('ignore')
 30 | #全局超参数，使用计算核数
 31 | use_cpu = cpu_count() // 4
 32 | 
 33 | def sale_ds():
 34 |     
 35 |     sql = """select * from scmtemp.csh_scode1_29_dateset"""
 36 |     hive_conn = conn_hive()#该处为定义的读取hive，因隐私问题，不放出
 37 |     df = pd.read_sql(sql, hive_conn)
 38 |     df.columns = [col.lower().split('.')[-1] for col in df.columns]
 39 |     df.drop_duplicates(subset=['store_code', 'goods_code', 'ds'], inplace=True)
 40 |     df['store_code'] = df['store_code'].astype(str)
 41 |     df = df[df['store_code'] != 'None']
 42 |     df['store_sku'] = df['store_code'].astype(
 43 |         str) + '-' + df['goods_code'].astype(str)
 44 |     df.drop(columns=['store_code', 'goods_code'], inplace=True)
 45 |     df.rename(columns={'qty_fix': 'y'}, inplace=True)
 46 |     print('finish the data reading....')
 47 |     return df
 48 | 
 49 | def replace_fill(data, name):
 50 | 
 51 |     """
 52 |     先尝试使用上周的数据填补，再针对极端的数据进行cap，保障序列的完整和平滑性
 53 |     :param data:单个序列
 54 |     :param name: 序列名称，store_sku
 55 |     :return: 修复后的一条序列
 56 |     """
 57 |     data['ds'] = pd.to_datetime(data['ds'], format='%Y-%m-%d')
 58 |     data['y'] = data['y'].astype(float)
 59 |     data.loc[data['y'] <= 0, 'y'] = np.NaN
 60 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(7).values[0]
 61 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(-7).values[0]
 62 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(-14).values[0]
 63 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(14).values[0]
 64 |     data['y'] = data['y'].interpolate(methon='nearest', order=3)
 65 |     low = data[data['y'] > 0]['y'].quantile(0.10)
 66 |     high = data[data['y'] > 0]['y'].quantile(0.90)
 67 |     data.loc[data['y'] < low, 'y'] = np.NaN
 68 |     data.loc[data['y'] > high, 'y'] = np.NaN
 69 |     data['y'] = data['y'].fillna(data['y'].mean())
 70 |     data['store_sku'] = name
 71 |     return data
 72 | 
 73 | 
 74 | def multi_fill(data):
 75 |     start_time = time.time()
 76 |     data['store_sku'] = data['store_sku'].astype(str)
 77 |     data_grouped = data.groupby(data.store_sku)
 78 |     results = Parallel(
 79 |         n_jobs=use_cpu)(
 80 |         delayed(replace_fill)(
 81 |             group,
 82 |             name) for name,
 83 |         group in data_grouped)
 84 |     p_predict = pd.concat(results)
 85 |     end_time = time.time()
 86 |     del data
 87 |     gc.collect()
 88 |     print('read data end etl have use {} minutes'.format(
 89 |         round((end_time - start_time) / 60, 2)))
 90 |     return p_predict
 91 | 
 92 | 
 93 | def predict_cap(data, result, columns):
 94 |     """
 95 |     :param data:修正后的输入数据
 96 |     :param result: 预测值
 97 |     :param columns: 预测值columns
 98 |     :return:每个序列上下限使用原始输入数据进行修正的结果
 99 |     """
100 |     data_list = set(result['store_sku'].unique())
101 |     data_df = data[data['store_sku'].isin(data_list)][['store_sku', 'y']]
102 |     for i in data_df['store_sku'].unique():
103 |         low = (1 + 0.1) * data_df[data_df['store_sku'] == i]['y'].min()
104 |         hight = (1 + 0.05) * data_df[data_df['store_sku'] == i]['y'].max()
105 |         result.loc[(result['store_sku'] == i) & (
106 |             result[columns] < low), columns] = low
107 |         result.loc[(result['store_sku'] == i) & (
108 |             result[columns] > hight), columns] = hight
109 |     return result
110 | 
111 | 
112 | def data_tranform(data):
113 |     """
114 |     :param data:全部序列的数据
115 |     :return: 针对所以数据处理后的结果，如，针对某一天赋值为0，做对数处理
116 |     """
117 |     data = data[['store_sku', 'ds', 'y']]
118 |     data.drop_duplicates(subset=['store_sku', 'ds'], inplace=True)
119 |     data.sort_values(['store_sku', 'ds'], ascending=[
120 |                      True, True], inplace=True)
121 |     data['ds'] = data['ds'].astype(str)
122 |     data['ds'] = data['ds'].apply(
123 |         lambda x: datetime.datetime.strptime(
124 |             x, "%Y-%m-%d"))
125 |     data.loc[data['y'] == np.nan, 'y'] = data.shift(7).iloc[-1:, :]
126 |     data.loc[data['y'] == np.nan, 'y'] = data.shift(-7).iloc[-1:, :]
127 |     data.loc[data['y'] == np.nan, 'y'] = data.shift(-14).iloc[-1:, :]
128 |     data['y'] = np.log1p(data['y'])
129 |     data = data.dropna(axis=0)
130 |     return data
131 | 
132 | 
133 | def prophet_train(data, name, holiday_df, model_type='test'):
134 |     # 选择model_type:test表示回测，否则预测未来时间点
135 |     model = Prophet(
136 |         daily_seasonality=False,
137 |         yearly_seasonality=True,
138 |         holidays=holiday_df,
139 |         holidays_prior_scale=10)
140 |     model.add_seasonality(
141 |         name='weekly',
142 |         period=7,
143 |         fourier_order=3,
144 |         prior_scale=0.10
145 |         # ,mode='additive'
146 |     )
147 |     if model_type == 'test':
148 |         data_train = data.iloc[:-6]
149 |         model.fit(data_train)
150 |         future = model.make_future_dataframe(periods=7, freq='d')
151 |     else:
152 |         model.fit(data)
153 |         future = model.make_future_dataframe(periods=7 * 4, freq='d')
154 |     forecast = model.predict(future)
155 |     forecast['store_sku'] = name
156 |     print('---this runing id is :{0} ---'.format(name))
157 |     return forecast
158 | 
159 | 
160 | def multi_process(data, holiday_df, model_type):
161 |     data['store_sku'] = data['store_sku'].astype(str)
162 |     data_grouped = data.groupby(data.store_sku)
163 |     results = Parallel(
164 |         n_jobs=use_cpu)(
165 |         delayed(prophet_train)(
166 |             group,
167 |             name,
168 |             holiday_df,
169 |             model_type) for name,
170 |         group in data_grouped)
171 |     p_predict = pd.concat(results)
172 |     return p_predict
173 | 
174 | 
175 | def prophet_main(data, holiday_df_, model_type, true_time=False):
176 |     start = time.time()
177 |     if true_time is False:
178 |         true_time = pd.datetime.now().strftime('%Y-%m-%d')
179 |     else:
180 |         true_time = datetime.datetime.strptime(true_time, "%Y-%m-%d")
181 |         true_time = str(
182 |             (true_time +
183 |              datetime.timedelta(
184 |                  days=7)).strftime('%Y-%m-%d'))
185 | 
186 |     df = data_tranform(data)
187 |     df['ds'] = pd.to_datetime(df['ds'], format('Y%-%m-%d'))
188 |     df = df[df['ds'] < true_time]
189 |     df['ds'] = df['ds'].astype(str)
190 |     df['ds'] = pd.to_datetime(df['ds'])
191 |     holiday_df_['ds'] = pd.to_datetime(holiday_df_['ds'])
192 |     holiday_df_['ds'] = holiday_df_['ds'].astype(str)
193 |     holiday_df_['ds'] = holiday_df_['ds'].apply(
194 |         lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
195 |     # parallel
196 |     pro_back = multi_process(df, holiday_df_, model_type)
197 |     if model_type == 'test':
198 |         pro_back = pd.merge(
199 |             df, pro_back, on=[
200 |                 'store_sku', 'ds'], how='inner')
201 |     else:
202 |         print('this is forecast model！')
203 |     pro_back.rename(columns={'yhat': 'pro_pred'}, inplace=True)
204 |     pro_back['pro_pred'] = np.expm1(pro_back['pro_pred'])
205 |     # 盖帽异常值
206 |     pro_back.loc[pro_back['pro_pred'] < 0, 'pro_pred'] = 0
207 |     pro_back_adj = predict_cap(df, pro_back, 'pro_pred')
208 |     low = pro_back_adj['pro_pred'].quantile(0.05)
209 |     hight = pro_back_adj['pro_pred'].quantile(0.95)
210 |     pro_back_adj.loc[pro_back_adj['pro_pred'] < low, 'pro_pred'] = low
211 |     pro_back_adj.loc[pro_back_adj['pro_pred'] > hight, 'pro_pred'] = hight
212 |     pro_back_adj['pro_pred'] = pro_back_adj['pro_pred'].round(2)
213 |     if model_type == 'test':
214 |         today_date = str(
215 |             (datetime.datetime.strptime(
216 |                 true_time,
217 |                 "%Y-%m-%d") -
218 |                 datetime.timedelta(
219 |                 days=7)).strftime('%Y-%m-%d'))
220 |         back_result = pro_back_adj[pro_back_adj['ds'] >= today_date][[
221 |             'store_sku', 'ds', 'pro_pred']].drop_duplicates(subset=['store_sku', 'ds'])
222 |     else:
223 |         result = pro_back_adj[pro_back_adj['ds'] >= true_time]
224 |         back_result = result[['store_sku', 'ds', 'pro_pred']
225 |                              ].drop_duplicates(subset=['store_sku', 'ds'])
226 |         print(
227 |             'prophet model use : {} minutes'.format(
228 |                 (time.time() - start) // 60))
229 |     return back_result
230 | 
231 | 
232 | if __name__ == '__main__':
233 |     sale_df = sale_ds()
234 |     sale_df['ds'] = pd.to_datetime(sale_df['ds'])
235 |     sale_df = sale_df[['store_sku', 'ds', 'y']]
236 |     # 控制长度,不使用疫情时期的数据，且周期不用太长，关注最近的几个完整周期即可
237 |     start_day = (
238 |         sale_df['ds'].max() -
239 |         relativedelta(
240 |             days=63)).strftime('%Y-%m-%d')
241 |     sale_df = sale_df[sale_df['ds'] >= start_day][['store_sku', 'ds', 'y']]
242 |     # 筛选条件：1 序列长度大于等于14，且过去最少有七天的销售记录；
243 |     # 条件1，保障模型有两个完整的周期数据；
244 |     # 条件2，避免出现0，0，0，0，0，0，1，0，1这样非常稀疏的数据出现
245 |     sale_set = sale_df.groupby(
246 |         ['store_sku']).filter(
247 |         lambda x: len(x) >= 14 and np.sum(
248 |             x['y']) > 7)
249 |     print('min date is {},max date is {}'.format(
250 |         sale_set['ds'].min(), sale_set['ds'].max()))
251 |     sale_data = multi_fill(sale_set)
252 |     holiday_df_ = holiday_df()
253 |     # 回测最近7天
254 |     model_type = 'test'
255 |     # 回测开始时间，如果为false，则回测从过去的第七天开始
256 |     true_time = False
257 |     pro_mape = prophet_main(sale_data, holiday_df_, model_type, true_time)
258 |     pro_mape = pd.merge(
259 |         pro_mape, sale_set, on=[
260 |             'store_sku', 'ds'], how='inner')
261 |     pro_mape['mape'] = np.abs(
262 |         pro_mape['y'] - pro_mape['pro_pred']) / pro_mape['y'] * 100
263 |     print('mape------', pro_mape['mape'].mean())
264 |     pro_mape.to_excel('pro_mape_428.xlsx', index=False)
265 | 
266 |     # 以下为预测未来28天
267 |     model_type = 'train'
268 |     prophet_forecast = prophet_main(
269 |          sale_data, holiday_df_, model_type, true_time)
270 |     prophet_forecast.to_excel('prophet_forecast_428.xlsx', index=False)
271 | 


--------------------------------------------------------------------------------
/pyspark_sale_forecast/script/README:
--------------------------------------------------------------------------------
1 | 
2 | tips:
3 | 
4 | database is fake.
5 | Welcome to leave your valuable opinions and suggestions.
6 | 
7 | 


--------------------------------------------------------------------------------
/pyspark_sale_forecast/script/pyspark_ml_predict.py:
--------------------------------------------------------------------------------
  1 | #-*- coding:utf-8 -*-
  2 | """
  3 | Name  : pyspark_ml_predict.py
  4 | Time  : 2020/6/20 17:36
  5 | Author : hjs
  6 | """
  7 | 
  8 | from pyspark.ml.feature import OneHotEncoder
  9 | from pyspark.ml.feature import StringIndexer
 10 | from pyspark.ml.regression import LinearRegression
 11 | from pyspark.ml.feature import VectorAssembler
 12 | from pyspark.sql.types import *
 13 | from pyspark.sql.functions import *
 14 | from pyspark.sql import SparkSession
 15 | import datetime
 16 | import warnings
 17 | warnings.simplefilter(action='ignore', category=FutureWarning)
 18 | warnings.filterwarnings('ignore')
 19 | spark = SparkSession. \
 20 |     Builder(). \
 21 |     config("spark.sql.crossJoin.enabled", "true"). \
 22 |     config("spark.sql.execution.arrow.enabled", "false"). \
 23 |     enableHiveSupport(). \
 24 |     getOrCreate()
 25 | 
 26 | today = (datetime.datetime.today()).strftime('%Y-%m-%d')
 27 | 
 28 | df = spark.sql(f"""
 29 |     select store_code,goods_code,ds,qty
 30 | 	from xxx.temp_store_sale
 31 | 	where ds>='{prev28_day}' and ds<'{today}'
 32 |     union
 33 |     select s.store_code,s.goods_code,d.ds,0 as qty
 34 |     from
 35 |     (select stat_date as ds from xxx.dim_date where stat_date<'{after7_day}' and stat_date>='{today}') d
 36 |     join
 37 |     (select
 38 |     distinct
 39 |     store_code,goods_code
 40 |     from sxxx.temp_store_sale
 41 |     ) s on 1=1""")
 42 | 
 43 | #读取最佳参数
 44 | best_param_set=spark.sql(f"select regparam,fitIntercept, elasticNetParam from xxx.regression_model_best_param order by update_date desc,update_time desc limit 1 ").collect()
 45 | 
 46 | reg_vec=best_param_set.select('regparam')
 47 | reg_b= [row.regparam for row in reg_vec][0]
 48 | reg_b=float(reg_b)
 49 | 
 50 | inter_vec =best_param_set.select('fitIntercept')
 51 | inter_b = [row.fitIntercept for row in inter_vec][0]
 52 | #str --> boole
 53 | if inter_b=='false':
 54 |     inter_b=False
 55 | else:
 56 |     inter_b=True
 57 | 
 58 | elastic_vec =best_param_set.select('elasticNetParam')
 59 | elastic_b = [row.elasticNetParam for row in elastic_vec][0]
 60 | elastic_b=float(elastic_b)
 61 | 
 62 | 
 63 | #特征处理
 64 | 
 65 | df=df.withColumn('dayofweek',dayofweek('ds'))
 66 | df = df.withColumn("dayofweek", df["dayofweek"].cast(StringType()))
 67 | 
 68 | #是否月末编码
 69 | df=df.withColumn('day',dayofmonth('ds'))
 70 | df = df.withColumn('day', df["day"].cast(StringType()))
 71 | df = df.withColumn('month_end',when(df['day'] <=25,0).otherwise(1))
 72 | 
 73 | #星期编码--将星期转化为了0-1变量
 74 | dayofweek_ind = StringIndexer(inputCol='dayofweek', outputCol='dayofweek_index')
 75 | dayofweek_ind_model = dayofweek_ind.fit(df)
 76 | dayofweek_ind_ = dayofweek_ind_model.transform(df)
 77 | onehotencoder = OneHotEncoder(inputCol='dayofweek_index', outputCol='dayofweek_Vec')
 78 | df = onehotencoder.transform(dayofweek_ind_)
 79 | 
 80 | 
 81 | inputCols=[
 82 | "dayofweek_Vec",
 83 | "month_end"]
 84 | 
 85 | assembler = VectorAssembler(inputCols=inputCols, outputCol="features")
 86 | 
 87 | 
 88 | #使用where自定义切分数据集
 89 | train_data=df.where(df['ds'] <today)
 90 | test_data=df.where(df['ds'] >=today)
 91 | 
 92 | train_mod01 = assembler.transform(train_data)
 93 | train_mod02 = train_mod01.selectExpr("features","qty as label")
 94 | 
 95 | test_mod01 = assembler.transform(test_data)
 96 | test_mod02 = test_mod01.select("store_code","goods_code","ds","features")
 97 | 
 98 | # build train the model
 99 | lr = LinearRegression(maxIter=100,regParam=reg_b, fitIntercept=inter_b,elasticNetParam=elastic_b, solver="normal")
100 | model = lr.fit(train_mod02)
101 | 
102 | 
103 | # predict
104 | predictions = model.transform(test_mod02)
105 | print('print the schema')
106 | predictions.printSchema()
107 | predictions.select("store_code","goods_code","ds","prediction").show(5)
108 | log.info('predictions shape'+str(predictions.count()))
109 | 
110 | test_store_predict=predictions.select("store_code","goods_code","ds","prediction").createOrReplaceTempView('test_store_predict')
111 | spark.sql(f"""create table xxx.regression_test_store_predict as select * from test_store_predict""")
112 | 


--------------------------------------------------------------------------------
/pyspark_sale_forecast/script/pyspark_ml_tune.py:
--------------------------------------------------------------------------------
  1 | #-*- coding:utf-8 -*-
  2 | """
  3 | Name  : pyspark_ml_tune.py
  4 | Time  : 2020/6/20 17:26
  5 | Author : hjs
  6 | """
  7 | 
  8 | """
  9 | 
 10 | tune the  best param for linear model
 11 | 
 12 | """
 13 | 
 14 | import pandas as pd
 15 | import datetime
 16 | from pyspark.sql import SparkSession
 17 | from pyspark.ml.feature import OneHotEncoder
 18 | from pyspark.ml.feature import StringIndexer
 19 | from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
 20 | from pyspark.ml.evaluation import RegressionEvaluator
 21 | from pyspark.ml.regression import LinearRegression
 22 | from pyspark.ml.feature import VectorAssembler
 23 | from pyspark.sql.functions import *
 24 | from pyspark.sql.types import *
 25 | 
 26 | import warnings
 27 | warnings.simplefilter(action='ignore', category=FutureWarning)
 28 | 
 29 | #spark初始化
 30 | spark = SparkSession. \
 31 |     Builder(). \
 32 |     config("spark.sql.crossJoin.enabled", "true"). \
 33 |     config("spark.sql.execution.arrow.enabled", "false"). \
 34 |     enableHiveSupport(). \
 35 |     getOrCreate()
 36 | 
 37 | today = (datetime.datetime.today()).strftime('%Y-%m-%d')
 38 | update_time = str(datetime.datetime.now().strftime('%b-%m-%y %H:%M:%S')).split(' ')[-1]
 39 | 
 40 | #spark.sql数据读取
 41 | df = spark.sql("""
 42 | select store_code,goods_code,ds,qty as label
 43 | from xxx.temp_store_sale
 44 | where ds>='2020-05-22'
 45 | """)
 46 | 
 47 | #数据此时是spark.dataframe格式，用类sql的形式进行操作
 48 | df = df.withColumn('dayofweek', dayofweek('ds'))
 49 | df = df.withColumn("dayofweek", df["dayofweek"].cast(StringType()))
 50 | # 是否月末编码
 51 | df = df.withColumn('day', dayofmonth('ds'))
 52 | df = df.withColumn('day', df["day"].cast(StringType()))
 53 | df = df.withColumn('month_end', when(df['day'] <= 25, 0).otherwise(1))
 54 | 
 55 | # 星期编码--将星期转化为了0-1变量，从周一至周天
 56 | dayofweek_ind = StringIndexer(inputCol='dayofweek', outputCol='dayofweek_index')
 57 | dayofweek_ind_model = dayofweek_ind.fit(df)
 58 | dayofweek_ind_ = dayofweek_ind_model.transform(df)
 59 | onehotencoder = OneHotEncoder(inputCol='dayofweek_index', outputCol='dayofweek_Vec')
 60 | df = onehotencoder.transform(dayofweek_ind_)
 61 | 
 62 | 
 63 | #此时产生的dayofweek_Vec是一个向量
 64 | inputCols = [
 65 |     "dayofweek_Vec",
 66 |     "month_end"]
 67 | 
 68 | 
 69 | assembler = VectorAssembler(inputCols=inputCols, outputCol="features")
 70 | #数据集划分,此时是随机切分，不考虑时间的顺序
 71 | train_data_1, test_data_1 = df.randomSplit([0.7, 0.3])
 72 | train_data=assembler.transform(train_data_1)
 73 | test_data = assembler.transform(test_data_1)
 74 | 
 75 | 
 76 | lr_params = ({'regParam': 0.00}, {'fitIntercept': True}, {'elasticNetParam': 0.5})
 77 | 
 78 | lr = LinearRegression(maxIter=100, regParam=lr_params[0]['regParam'], \
 79 |                       fitIntercept=lr_params[1]['fitIntercept'], \
 80 |                       elasticNetParam=lr_params[2]['elasticNetParam'])
 81 | 
 82 | lrParamGrid = ParamGridBuilder() \
 83 |     .addGrid(lr.regParam, [0.005, 0.01, 0.1, 0.5]) \
 84 |     .addGrid(lr.fitIntercept, [False, True]) \
 85 |     .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.5, 1.0]) \
 86 |     .build()
 87 | 
 88 | model = lr.fit(train_data)
 89 | pred = model.evaluate(test_data)
 90 | 
 91 | #调参前的模型评估
 92 | eval = RegressionEvaluator(
 93 |     labelCol="label",
 94 |     predictionCol="prediction",
 95 |     metricName="mae")
 96 | mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"})
 97 | r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
 98 | 
 99 | # 本次不需要预测，故注释掉
100 | #predictions = model.transform(test_data)
101 | #predictions.printSchema()
102 | 
103 | cross_valid = CrossValidator(estimator=lr, estimatorParamMaps=lrParamGrid, evaluator=RegressionEvaluator(),
104 |                           numFolds=5)
105 | 
106 | cvModel = cross_valid.fit(train_data)
107 | 
108 | best_parameters = [(
109 |     [{key.name: paramValue} for key, paramValue in zip(params.keys(), params.values())], metric) \
110 |     for params, metric in zip(
111 |         cvModel.getEstimatorParamMaps(),
112 |         cvModel.avgMetrics)]
113 | 
114 | lr_best_params = sorted(best_parameters, key=lambda el: el[1], reverse=True)[0]
115 | 
116 | 
117 | #借用pd.DataFrame把以上关键参数转换为结构化数据
118 | pd_best_params = pd.DataFrame({
119 |     'regParam':[lr_best_params[0][0]['regParam']],
120 |     'fitIntercept':[lr_best_params[0][1]['fitIntercept']],
121 |     'elasticNetParam':[lr_best_params[0][2]['elasticNetParam']]
122 | }
123 | )
124 | 
125 | pd_best_params['update_date'] = today
126 | pd_best_params['update_time'] = update_time
127 | pd_best_params['model_type'] = 'linear'
128 | 
129 | # 最优参数进行再次模型训练
130 | lr = LinearRegression(maxIter=100, regParam=lr_best_params[0][0]['regParam'], \
131 |                       fitIntercept=lr_best_params[0][1]['fitIntercept'], \
132 |                       elasticNetParam=lr_best_params[0][2]['elasticNetParam'])
133 | 
134 | model = lr.fit(train_data)
135 | 
136 | pred = model.evaluate(test_data)
137 | 
138 | eval = RegressionEvaluator(
139 |     labelCol="label",
140 |     predictionCol="prediction",
141 |     metricName="mae")
142 | 
143 | #回归模型评估参数有很多种，基于模型建模效果得检验是R方,本文作为一个时序预测，更关注得是mae，所以我们输出两个评估结果
144 | r2_tune = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
145 | mae_tune = eval.evaluate(pred.predictions, {eval.metricName: "mae"})
146 | 
147 | pd_best_params['mae_value'] = str(mae_tune)
148 | #虽然这里pd_best_params中mae_value的数据类型为str，但是在写入过程中，会有一个类型推断，
149 | # 所以最后在hive中查看会知道这是一个float类型
150 | #pd.DataFrame-->spark.dataframe 然后写入表中，以追加得形式写入hive，得到的最优参数供模型预测使用
151 | spark.createDataFrame(pd_best_params).write.mode("append").format('hive').saveAsTable(
152 |     'xxx.regression_model_best_param')
153 | 


--------------------------------------------------------------------------------
/sale_df.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fitzgerald0/time_series_data_mining/22c55480988d893c7f545d1698b6797beac82e9e/sale_df.xlsx


--------------------------------------------------------------------------------
/sarimax_model.py:
--------------------------------------------------------------------------------
  1 | #-*- coding:utf-8 -*-
  2 | """
  3 | Name  : sarimax_model.py
  4 | Time  : 2019/9/8 18:17
  5 | Author : hjs
  6 | """
  7 | 
  8 | import time
  9 | from itertools import product
 10 | import numpy as np
 11 | import pandas as pd
 12 | from joblib import Parallel, delayed
 13 | import warnings
 14 | 
 15 | warnings.filterwarnings('ignore')
 16 | from warnings import catch_warnings, filterwarnings
 17 | from statsmodels.tsa.statespace.sarimax import SARIMAX
 18 | 
 19 | 
 20 | # 传入数据和参数，输出模型预测
 21 | def model_forecast(history, config):
 22 |     order, sorder, trend = config
 23 |     model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False,
 24 |                     enforce_invertibility=False)
 25 |     model_fit = model.fit(disp=False)
 26 |     yhat = model_fit.predict(len(history), len(history))
 27 |     return yhat[0]
 28 | 
 29 | 
 30 | # 模型评估指标,mape
 31 | def mape(y_true, y_pred):
 32 |     y_true, y_pred = np.array(y_true), np.array(y_pred)
 33 |     return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
 34 | 
 35 | 
 36 | # 划分训练集和测试集
 37 | def train_test_split(data, n_test):
 38 |     return data[:-n_test], data[-n_test:]
 39 | 
 40 | 
 41 | # one-step滚动向前预测
 42 | def forward_valid(data, n_test, cfg):
 43 |     predictions = list()
 44 |     train, test = train_test_split(data, n_test)
 45 |     history = [x for x in train]
 46 |     for i in range(len(test)):
 47 |         yhat = model_forecast(history, cfg)
 48 |         predictions.append(yhat)
 49 |         history.append(test[i])
 50 |     error = mape(test, predictions)
 51 |     return error
 52 | 
 53 | 
 54 | # 模型评估
 55 | def score_model(data, n_test, cfg, debug=False):
 56 |     result = None
 57 |     key = str(cfg)
 58 |     if debug:
 59 |         result = forward_valid(data, n_test, cfg)
 60 |     else:
 61 |         try:
 62 |             with catch_warnings():
 63 |                 filterwarnings("ignore")
 64 |                 result = forward_valid(data, n_test, cfg)
 65 |         except:
 66 |             error = None
 67 | 
 68 |     return (key, result)
 69 | 
 70 | 
 71 | # 网格搜索
 72 | def grid_search(data, cfg_list, n_test, parallel=True):
 73 |     scores = None
 74 |     if parallel:
 75 |         # 使用计算机全部的cpu核数多进程并行
 76 |         executor = Parallel(n_jobs=-1, backend='multiprocessing')
 77 |         tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)
 78 |         scores = executor(tasks)
 79 | 
 80 |     else:
 81 |         scores = [score_model(data, n_test, cfg) for cfg in cfg_list]
 82 |     scores = [r for r in scores if r[1] != None]
 83 |     scores.sort(key=lambda x: x[1])
 84 |     return scores
 85 | 
 86 | 
 87 | # 生成参数列表
 88 | def sarima_configs(seasonal=[0]):
 89 |     p = d = q = [0, 1, 2]
 90 |     pdq = list(product(p, d, q))
 91 |     s = 0
 92 |     seasonal_pdq = [(x[0], x[1], x[2], s) for x in list(product(p, d, q))]
 93 |     t = ['n', 'c', 't', 'ct']
 94 |     return list(product(pdq, seasonal_pdq, t))
 95 | 
 96 | 
 97 | # 模型训练
 98 | def train_model(sale_df):
 99 |     n_test = 3
100 |     p_b, d_b, q_b = [], [], []
101 |     P_b, D_b, Q_b = [], [], []
102 |     m_b, t_b = [], []
103 |     model_id, error = [], []
104 |     for i in sale_df['store_code'].unique():
105 |         data = sale_df[sale_df['store_code'] == i]['y']
106 |         data = [i for i in data]
107 |         cfg_list = sarima_configs()
108 |         scores = grid_search(data, cfg_list, n_test, parallel=True)
109 |         p_b.append(int(scores[0][0][2]))
110 |         d_b.append(int(scores[0][0][5]))
111 |         q_b.append(int(scores[0][0][8]))
112 |         P_b.append(int(scores[0][0][13]))
113 |         D_b.append(int(scores[0][0][16]))
114 |         Q_b.append(int(scores[0][0][19]))
115 |         m_b.append(int(scores[0][0][22]))
116 |         t_b.append(str(scores[0][0][27]))
117 |         model_id.append(i)
118 |         error.append(scores[1][-1])
119 |         params_df = pd.DataFrame(
120 |             {'store_code': model_id, 'map': error, 'p': p_b, 'd': d_b, 'q': q_b, 'P': P_b, 'D': D_b, 'Q': Q_b, 'm': m_b,
121 |              't': t_b})
122 |     return params_df
123 | 
124 | 
125 | # 模型预测
126 | def one_step_forecast(data, order, seasonal_order, t, h_fore):
127 |     predictions = list()
128 |     data = [i for i in data]
129 |     for i in range(h_fore):
130 |         model = SARIMAX(data, order=order, seasonal_order=seasonal_order, trend=t, enforce_stationarity=False,
131 |                         enforce_invertibility=False)
132 |         model_fit = model.fit(disp=False)
133 |         yhat = model_fit.predict(len(data), len(data))
134 |         data.append(yhat[0])
135 |         predictions.append(yhat[0])
136 |     return predictions
137 | 
138 | 
139 | def forecast_model(sale_df, params_df):
140 |     h_fore = 4
141 |     fore_list = []
142 |     model_id = []
143 |     for i in sale_df['store_code'].unique():
144 |         data = sale_df[sale_df['store_code'] == i]['y']
145 |         p = params_df[params_df['store_code'] == i].iloc[:, 2].values[0]
146 |         d = params_df[params_df['store_code'] == i].iloc[:, 3].values[0]
147 |         q = params_df[params_df['store_code'] == i].iloc[:, 4].values[0]
148 |         P = params_df[params_df['store_code'] == i].iloc[:, 5].values[0]
149 |         D = params_df[params_df['store_code'] == i].iloc[:, 6].values[0]
150 |         Q = params_df[params_df['store_code'] == i].iloc[:, 7].values[0]
151 |         m = params_df[params_df['store_code'] == i].iloc[:, 8].values[0]
152 |         t = params_df[params_df['store_code'] == i].iloc[:, 9].values[0]
153 |         order = (p, d, q)
154 |         seasonal_order = (P, D, Q, m)
155 |         all_fore = one_step_forecast(data, order, seasonal_order, t, h_fore)
156 |         fore_list.append(all_fore)
157 | 
158 |         # 以下为，多步预测，如果不使用滚动预测，则不调one_step_forecast函数
159 |         # model=SARIMAX(data, order=order,seasonal_order=seasonal_order,trend=t,enforce_stationarity=False,
160 |         #                                                enforce_invertibility=False)
161 |         # forecast_=model.fit(disp=-1).forecast(steps=h_fore)
162 |         # fore_list_flatten = [x for x in forecast_]
163 |         # fore_list.append(fore_list_flatten)
164 |         model_id.append(i)
165 |     df_forecast = pd.DataFrame({'store_code': model_id, 'fore': fore_list})
166 |     return df_forecast
167 | 
168 | if __name__ == '__main__':
169 |     start_time = time.time()
170 |     sale_df = pd.read_excel('/home/test01/store_forecast/sale_df.xlsx')
171 |     params_df = train_model(sale_df)
172 |     forecast_out = forecast_model(sale_df, params_df)
173 |     end_time = time.time()
174 |     use_time = (end_time - start_time) // 60
175 |     print('finish the process use', use_time, 'mins')
176 | 


--------------------------------------------------------------------------------
/spark_lightgbm_jar/spark_lgb_jar.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fitzgerald0/time_series_data_mining/22c55480988d893c7f545d1698b6797beac82e9e/spark_lightgbm_jar/spark_lgb_jar.rar


--------------------------------------------------------------------------------
/spark_prophet_demo.py:
--------------------------------------------------------------------------------
  1 | #-*- coding:utf-8 -*-
  2 | """
  3 | Name  : prophet_spark_demo.py
  4 | Time  : 2020/5/16 10:25
  5 | Author : hjs
  6 | """
  7 | 
  8 | 
  9 | import datetime
 10 | from dateutil.relativedelta import relativedelta
 11 | from fbprophet import Prophet
 12 | import pandas as pd
 13 | import numpy as np
 14 | import warnings
 15 | warnings.filterwarnings('ignore')
 16 | from pyspark.sql import SparkSession
 17 | from pyspark.sql.functions import pandas_udf, PandasUDFType
 18 | from pyspark.sql.types import *
 19 | 
 20 | spark = SparkSession. \
 21 |     Builder(). \
 22 |     config("spark.sql.crossJoin.enabled", "true"). \
 23 |     config("spark.sql.execution.arrow.enabled", "true"). \
 24 |     enableHiveSupport(). \
 25 |     getOrCreate()
 26 | 
 27 | 
 28 | def holiday_set():
 29 |     """"
 30 |     更新周期:年
 31 |     更新方式:手动
 32 |     优化点:
 33 |     1.观察该节假日是否存在假日效应，如果存在则设置，促销也可以作为假日因素考虑.
 34 |     2.如果日期周期短，可以把不同的假日合并
 35 |     """
 36 |     new_year_day = pd.DataFrame({
 37 |         'holiday': 'new_year_day',
 38 |         'ds': pd.to_datetime(['2020-01-01', '2021-01-01']),
 39 |         'lower_window': -2,
 40 |         'upper_window': 1,
 41 |     })
 42 |     # 受疫情的影响,2020年春节时间长
 43 |     spring_festival = pd.DataFrame({
 44 |         'holiday': 'spring_festival',
 45 |         'ds': pd.to_datetime(['2020-01-24']),
 46 |         'lower_window': -4,
 47 |         'upper_window': 14,
 48 |     })
 49 |     valentine_day = pd.DataFrame({
 50 |         'holiday': 'valentine_day',
 51 |         'ds': pd.to_datetime(['2019-02-14', '2020-02-14']),
 52 |         'lower_window': -1,
 53 |         'upper_window': 0,
 54 |     })
 55 |     tomb_sweeping = pd.DataFrame({
 56 |         'holiday': 'tomb_sweeping',
 57 |         'ds': pd.to_datetime(['2019-04-05', '2020-04-05']),
 58 |         'lower_window': 0,
 59 |         'upper_window': 1,
 60 |     })
 61 |     labour_day = pd.DataFrame({
 62 |         'holiday': 'labour_day',
 63 |         'ds': pd.to_datetime(['2019-05-01', '2020-05-01']),
 64 |         'lower_window': -1,
 65 |         'upper_window': 2,
 66 |     })
 67 |     children_day = pd.DataFrame({
 68 |         'holiday': 'children_day',
 69 |         'ds': pd.to_datetime(['2019-06-01', '2020-06-01']),
 70 |         'lower_window': -1,
 71 |         'upper_window': 0,
 72 |     })
 73 | 
 74 |     shopping_618 = pd.DataFrame({
 75 |         'holiday': 'shopping_618',
 76 |         'ds': pd.to_datetime(['2019-06-18', '2020-06-18']),
 77 |         'lower_window': 0,
 78 |         'upper_window': 1,
 79 |     })
 80 |     mid_autumn = pd.DataFrame({
 81 |         'holiday': 'mid_autumn',
 82 |         'ds': pd.to_datetime(['2019-09-13']),
 83 |         'lower_window': 0,
 84 |         'upper_window': 0,
 85 |     })
 86 |     national_day = pd.DataFrame({
 87 |         'holiday': 'national_day',
 88 |         'ds': pd.to_datetime(['2019-10-01', '2020-10-01']),
 89 |         'lower_window': -1,
 90 |         'upper_window': 6,
 91 |     })
 92 | 
 93 |     double_eleven = pd.DataFrame({
 94 |         'holiday': 'double_eleven',
 95 |         'ds': pd.to_datetime(['2019-11-11', '2020-11-11']),
 96 |         'lower_window': -1,
 97 |         'upper_window': 0,
 98 |     })
 99 |     year_sale = pd.DataFrame({
100 |         'holiday': 'year_sale',
101 |         'ds': pd.to_datetime(['2019-12-05', '2019-12-31']),
102 |         'lower_window': 0,
103 |         'upper_window': 1,
104 |     })
105 |     double_twelve = pd.DataFrame({
106 |         'holiday': 'double_twelve',
107 |         'ds': pd.to_datetime(['2019-12-12', '2020-12-12']),
108 |         'lower_window': 0,
109 |         'upper_window': 0,
110 |     })
111 | 
112 |     christmas_day = pd.DataFrame({
113 |         'holiday': 'christmas_day',
114 |         'ds': pd.to_datetime(['2019-12-25', '2020-12-25']),
115 |         'lower_window': -1,
116 |         'upper_window': 0,
117 |     })
118 | 
119 |     holidays_df = pd.concat(
120 |         (new_year_day,
121 |          spring_festival,
122 |          valentine_day,
123 |          tomb_sweeping,
124 |          labour_day,
125 |          children_day,
126 |          shopping_618,
127 |          mid_autumn,
128 |          national_day,
129 |          double_eleven,
130 |          year_sale,
131 |          double_twelve,
132 |          christmas_day))
133 | 
134 |     holidays_set = holidays_df[['ds', 'holiday',
135 |                                 'lower_window', 'upper_window']].reset_index()
136 |     return holidays_set
137 | 
138 | 
139 | holiday_df = holiday_set()
140 | 
141 | 
142 | 
143 | def sale_ds(df):
144 |     df['ds'] = pd.to_datetime(df['ds'])
145 |     df = df[['store_sku', 'ds', 'y']]
146 |     # 控制长度,周期不用太长，关注最近的几个完整周期即可
147 |     start_day = (
148 |             df['ds'].max() -
149 |             relativedelta(
150 |                 days=63)).strftime('%Y-%m-%d')
151 |     df = df[df['ds'] >= start_day][['store_sku', 'ds', 'y']]
152 |     # 筛选条件：1 序列长度大于等于14，且过去最少有七天的销售记录；
153 |     # 条件1，保障模型有两个完整的周期数据；
154 |     # 条件2，避免出现0，0，0，0，0，0，1，0，1这样数据稀疏的数据出现
155 |     sale_set = df.groupby(
156 |         ['store_sku']).filter(
157 |         lambda x: len(x) >= 14 and np.sum(
158 |             x['y']) > 7)
159 |     return sale_set
160 | 
161 | 
162 | def replace_fill(data):
163 |     """
164 |     先尝试使用上周的数据填补，再针对极端的数据进行cap，保障序列的完整和平滑性
165 |     :param data:单个序列
166 |     :param name: 序列名称，store_sku
167 |     :return: 修复后的一条序列
168 |     """
169 |     data['ds'] = pd.to_datetime(data['ds'], format='%Y-%m-%d')
170 |     data['y'] = data['y'].astype(float)
171 |     data.loc[data['y'] <= 0, 'y'] = np.NaN
172 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(7).values[0]
173 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(-7).values[0]
174 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(-14).values[0]
175 |     data.loc[data['y'].isnull(), 'y'] = data['y'].shift(14).values[0]
176 |     data.loc[data['y'].isnull(), 'y'] = data['y'].interpolate(methon='nearest', order=3)
177 |     low = data[data['y'] > 0]['y'].quantile(0.10)
178 |     high = data[data['y'] > 0]['y'].quantile(0.90)
179 |     data.loc[data['y'] < low, 'y'] = np.NaN
180 |     data.loc[data['y'] > high, 'y'] = np.NaN
181 |     data['y'] = data['y'].fillna(data['y'].mean())
182 |     data['y'] = np.log1p(data['y'])
183 |     return data
184 | 
185 | 
186 | def prophet_train(data):
187 |     model = Prophet(
188 |         daily_seasonality=False,
189 |         yearly_seasonality=False,
190 |         holidays=holiday_df,
191 |         holidays_prior_scale=10)
192 |     model.add_seasonality(
193 |         name='weekly',
194 |         period=7,
195 |         fourier_order=3,
196 |         prior_scale=0.10)
197 |     model.fit(data)
198 |     future = model.make_future_dataframe(periods=7, freq='d')
199 |     forecast = model.predict(future)
200 |     forecast['pro_pred'] = np.expm1(forecast['yhat'])
201 |     forecast_df=forecast[['store_sku','ds','pro_pred']]
202 |     # 对预测值修正
203 |     forecast_df.loc[forecast_df['pro_pred'] < 0, 'pro_pred'] = 0
204 |     low = (1 + 0.1) * data['y'].min()
205 |     hight = min((1 + 0.05) * data['y'].max(), 10000)
206 |     forecast_df.loc[forecast_df['pro_pred'] < low, 'pro_pred'] = low
207 |     forecast_df.loc[forecast_df['pro_pred'] > hight, 'pro_pred'] = hight
208 |     return forecast
209 | 
210 | def prophet_main(data):
211 |     true_time = pd.datetime.now().strftime('%Y-%m-%d')
212 |     data.dropna(inplace=True)
213 |     data['ds'] = pd.to_datetime(data['ds'])
214 |     data = data[data['ds'] < true_time]
215 |     data['ds'] = data['ds'].astype(str)
216 |     data['ds'] = pd.to_datetime(data['ds'])
217 |     # 异常值替换
218 |     data = replace_fill(data)
219 |     pro_back = prophet_train(data)
220 |     return pro_back
221 | 
222 | schema = StructType([
223 |     StructField("store_sku", StringType()),
224 |     StructField("ds", StringType()),
225 |     StructField("pro_pred", DoubleType())
226 | ])
227 | 
228 | @pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
229 | def run_model(data):
230 |     data['store_sku']=data['store_sku'].astype(str)
231 |     df = prophet_main(data)
232 |     uuid = data['store_sku'].iloc[0]
233 |     df['store_sku']=unid
234 |     df['ds']=df['ds'].astype(str)
235 |     df['pro_pred']=df['pro_pred'].astype(float)
236 |     cols=['store_sku','ds','pro_pred']
237 |     return df[cols]
238 | 
239 | data = spark.sql(
240 |     """
241 |     select concat(store_code,'_',goods_code) as store_sku,qty_fix as y,ds
242 |     from scmtemp.csh_etl_predsku_store_sku_sale_fix_d""")
243 | data.createOrReplaceTempView('data')
244 | sale_predict = data.groupby(['store_sku']).apply(run_model)
245 | sale_predict.createOrReplaceTempView('test_read_data')
246 | # 保存到数据库
247 | spark.sql(f"drop table if exists scmtemp.tmp_hjs_store_sku_sale_prophet")
248 | spark.sql(f"create table scmtemp.tmp_hjs_store_sku_sale_prophet as select * from store_sku_predict_29 ")
249 | print('完成预测')
250 | 


--------------------------------------------------------------------------------
/ts_sale.csv:
--------------------------------------------------------------------------------
 1 | sale,ym
 2 | 45.94,2017-01-01
 3 | 44.93,2017-02-01
 4 | 42.66,2017-03-01
 5 | 38.36,2017-04-01
 6 | 38.18,2017-05-01
 7 | 36.32,2017-06-01
 8 | 39.18,2017-07-01
 9 | 40.06,2017-08-01
10 | 38.16,2017-09-01
11 | 41.25,2017-10-01
12 | 39.49,2017-11-01
13 | 43.11,2017-12-01
14 | 45.0,2018-01-01
15 | 46.26,2018-02-01
16 | 48.4,2018-03-01
17 | 42.07,2018-04-01
18 | 42.93,2018-05-01
19 | 41.26,2018-06-01
20 | 42.43,2018-07-01
21 | 40.79,2018-08-01
22 | 39.34,2018-09-01
23 | 40.34,2018-10-01
24 | 38.54,2018-11-01
25 | 39.5,2018-12-01
26 | 39.19,2019-01-01
27 | 38.14,2019-02-01
28 | 


--------------------------------------------------------------------------------
/ts_sale_example.py:
--------------------------------------------------------------------------------
  1 | #本文为使用tensorflow概率建模
  2 | 
  3 | %matplotlib inline
  4 | import pandas as pd
  5 | import numpy as np
  6 | from matplotlib import pylab as plt
  7 | import seaborn as sns
  8 | import numpy as np
  9 | import tensorflow as tf
 10 | import tensorflow_probability as tfp
 11 | from tensorflow_probability import distributions as tfd
 12 | from tensorflow_probability import sts
 13 | 
 14 | 
 15 | #绘制预测与真实值时间序列图
 16 | 
 17 | def plot_forecast(x, y,
 18 |                   forecast_mean, forecast_scale, forecast_samples,
 19 |                   title, x_locator=None, x_formatter=None):
 20 |     """Plot a forecast distribution against the 'true' time series."""
 21 |     colors = sns.color_palette()
 22 |     c1, c2 = colors[0], colors[1]
 23 |     fig = plt.figure(figsize=(12, 6))
 24 |     ax = fig.add_subplot(1, 1, 1)
 25 | 
 26 |     num_steps = len(y)
 27 |     num_steps_forecast = forecast_mean.shape[-1]
 28 |     num_steps_train = num_steps - num_steps_forecast
 29 |     
 30 | 
 31 |     ax.plot(x, y, lw=2, color=c1, label='ground truth')
 32 |     forecast_steps = np.arange(
 33 |       x[num_steps_train],
 34 |       x[num_steps_train]+num_steps_forecast,
 35 |       dtype=x.dtype)
 36 |     #
 37 |     ax.plot(forecast_steps, forecast_samples.T, lw=1, color=c2, alpha=0.1)
 38 |     #绘制预测期望值以及针对最后三个月的100个采样结果
 39 |     ax.plot(forecast_steps, forecast_mean, lw=2, ls='--', color='r',
 40 |            label='forecast')
 41 |     ax.fill_between(forecast_steps,
 42 |                    forecast_mean-2*forecast_scale,
 43 |                    forecast_mean+2*forecast_scale, color=c2, alpha=0.2)
 44 | 
 45 |     ymin, ymax = min(np.min(forecast_samples), np.min(y)), max(np.max(forecast_samples), np.max(y))
 46 |     yrange = ymax-ymin
 47 |     ax.set_ylim([ymin - yrange*0.1, ymax + yrange*0.1])
 48 |     ax.set_title("{}".format(title))
 49 |     ax.legend()
 50 |  
 51 |     if x_locator is not None:
 52 |         ax.xaxis.set_major_locator(x_locator)
 53 |         ax.xaxis.set_major_formatter(x_formatter)
 54 |         fig.autofmt_xdate()
 55 |         
 56 |     return fig, ax
 57 | 
 58 | #建构模型，并计算计算损失函数
 59 | def cal_loss(training_data):
 60 |     
 61 |     #设置全局默认图形
 62 |     tf.reset_default_graph()
 63 |     #遵循加法模型，设置趋势
 64 |     trend = sts.LocalLinearTrend(observed_time_series=observed_time_series)
 65 |     
 66 |     #设置季节性
 67 |     seasonal = tfp.sts.Seasonal(
 68 |           num_seasons=12, observed_time_series=observed_time_series)
 69 |     #模型拟合,之所以用sum，而不是我们在建模中常见的fit定义，是因为，
 70 |     #模型时间序列为加法模型，有如上文提到的趋势，季节性，周期性等成分相加
 71 |     #默认的先验分布为正态（normal）
 72 |     ts_model = sts.Sum([trend, seasonal], observed_time_series=observed_time_series)
 73 | 
 74 |     #构建变分损失函数和后验
 75 |     with tf.variable_scope('sts_elbo', reuse=tf.AUTO_REUSE):
 76 |         elbo_loss, variational_posteriors = tfp.sts.build_factored_variational_loss(
 77 |           ts_model,observed_time_series=training_data)
 78 |     
 79 |     return ts_model,elbo_loss,variational_posteriors
 80 | 
 81 | 
 82 | #模型训练，输出后验分布
 83 | def run(training_data):
 84 |     
 85 |     ts_model,elbo_loss,variational_posteriors=cal_loss(training_data)
 86 |     num_variational_steps = 401 
 87 |     num_variational_steps = int(num_variational_steps)
 88 | 
 89 |     #训练模型，ELBO作为在变分推断的损失函数
 90 |     train_vi = tf.train.AdamOptimizer(0.1).minimize(elbo_loss)
 91 |     
 92 |     #创建会话,并通过上下文管理器方式对张量Tensor对象进行计算
 93 |     with tf.Session() as sess:
 94 |         sess.run(tf.global_variables_initializer())
 95 |         for i in range(num_variational_steps):
 96 |             _, elbo_ = sess.run((train_vi, elbo_loss))
 97 |             
 98 |         if i % 20 == 0:
 99 |             print("step {} -ELBO {}".format(i, elbo_))
100 |         #求解后验参数
101 |         q_samples_ = sess.run({k: q.sample(3)
102 |                              for k, q in variational_posteriors.items()})
103 |         
104 |         
105 |         print("打印变分推断参数信息:")
106 |         for param in ts_model.parameters:
107 |             print("{}: {} +- {}".format(param.name,
108 |                       np.mean(q_samples_[param.name], axis=0),
109 |                       np.std(q_samples_[param.name], axis=0)))
110 | 
111 |     data_t_dist = tfp.sts.forecast(ts_model,observed_time_series=training_data,\
112 |                                    parameter_samples=q_samples_,num_steps_forecast=num_forecast_steps)
113 |     return  data_t_dist
114 | 
115 | 
116 | #模型预测
117 | def forecast(training_data):
118 |     data_t_dist=run(training_data)
119 |     with tf.Session() as sess:
120 |         data_t_mean, data_t_scale, data_t_samples = sess.run(
121 |           (data_t_dist.mean()[..., 0],
122 |            data_t_dist.stddev()[..., 0],
123 |            data_t_dist.sample(num_samples)[..., 0]))
124 |         
125 |     return data_t_mean,data_t_scale, data_t_samples
126 | 
127 | 
128 | #计算回测
129 | def get_mape(data_t,forecsat):
130 |     true_=data_t[-num_forecast_steps:]
131 |     true_=true_.iloc[:,-1]
132 |     true_=true_.reset_index()
133 |     forecsat=pd.DataFrame(forecsat,columns=['focecast'])
134 |     mape_=pd.concat([pd.DataFrame(true_),forecsat],axis=1)
135 |     mape_['mape']=abs(mape_.iloc[:,-2]-mape_.iloc[:,-1])/mape_.iloc[:,-2]*100
136 |     return mape_
137 | 
138 | 
139 | 
140 | if __name__ == '__main__':
141 |     
142 |     #读取数据集
143 | 
144 |     data_t=pd.read_csv("../input/ts_sale.csv")
145 |     data_t=data_t[['sale','ym']]
146 |     data_t=data_t.set_index('ym')
147 |     #data_t.to_csv('/input/ts_sale')
148 |     print('序列长度',len(data_t))
149 |     
150 |     #设置超参数
151 |     num_forecast_steps =3 # 最后三个月作为预测值，以便于计算回测mape
152 |     num_samples=100    #设定采样次数
153 |     
154 |     training_data = data_t[:-num_forecast_steps]
155 |     data_dates=np.array(data_t.index,dtype='datetime64[M]')
156 | 
157 |     observed_time_series=training_data
158 |     data_t_mean,data_t_scale, data_t_samples=forecast(training_data)
159 |     
160 |     data_y=pd.Series(data_t['sale'])
161 |     fig, ax = plot_forecast(
162 |     data_dates, data_y,
163 |     data_t_mean,data_t_scale, data_t_samples,title="forecast")
164 |     ax.axvline(data_dates[-num_forecast_steps], linestyle="--")
165 |     ax.legend(loc="upper left")
166 |     ax.set_ylabel("sale")
167 |     ax.set_xlabel("year_month")
168 |     fig.autofmt_xdate()
169 |     
170 |     mape=get_mape(data_t,data_t_mean)
171 |     print(mape)
172 |     print('mape:',mape['mape'].mean())
173 | 


--------------------------------------------------------------------------------
/ts_shape.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020/8/19 23:23
  3 | # @Author  : hjs
  4 | # @File    : ts_shape.py
  5 | # @Software : PyCharm
  6 | 
  7 | """
  8 | 写在最前面，有小伙伴反馈说给的数据量较大，自己运行起来较大
  9 | 所以在后面有加备注，只取部分数据计算就好。
 10 | """
 11 | 
 12 | import numpy as np
 13 | import pandas as pd
 14 | from tslearn.clustering import KShape
 15 | from tslearn.preprocessing import TimeSeriesScalerMeanVariance
 16 | import matplotlib.pyplot as plt
 17 | plt.figure(figsize=(8, 4))
 18 | from tslearn.clustering import silhouette_score
 19 | 
 20 | 
 21 | """
 22 | 1.数据读取与预处理（序列填充，使每条序列等长）
 23 | 2.计算轮廓系数，求出轮廓系数最大时的聚类个数k
 24 | 3.使用最佳聚类个数，得到序列聚类标签
 25 | 4.可视化，绘制elbow线图辅助检验聚类个数是否合理，同时绘制不同序列的聚类效果图。
 26 | """
 27 | 
 28 | class Plot_Cluster_Time_Series(object):
 29 |     def __init__(self,data,seed):
 30 |         self.data=data
 31 |         self.seed=seed
 32 | 
 33 |     def fill_na_ts(self):
 34 |         data=self.data
 35 |         df_store = data[['item_id']].drop_duplicates()
 36 |         max_ds = str(data['date'].max())[:10].replace('-', '')
 37 |         min_ds = str(data['date'].min())[:10].replace('-', '')
 38 |         print('min time is : {},max time is : {}'.format(min_ds, max_ds))
 39 |         time_index = pd.date_range(min_ds, max_ds, freq='D')
 40 |         time_index = pd.DataFrame(time_index)
 41 |         time_index.columns = ['ts_index']
 42 |         time_index['value'] = 1
 43 |         df_store['value'] = 1
 44 |         store_time_index = pd.merge(time_index, df_store, how='left', on='value')
 45 |         store_time_index.drop(columns='value', inplace=True)
 46 |         data['date'] = pd.to_datetime(data['date'])
 47 |         store_time_index['ts_index'] = pd.to_datetime(store_time_index['ts_index'])
 48 |         store_time_index.rename(columns={'ts_index': 'date'}, inplace=True)
 49 |         data_full = pd.merge(store_time_index, data, how='left', on=['date', 'item_id'])
 50 |         data_full['qty'] = data_full['qty'].fillna(0)
 51 |         data_full.fillna(0, inplace=True)
 52 |         return data_full
 53 | 
 54 |     def read_data(self):
 55 |         """
 56 |         :return: norm dataset and time series id
 57 |         """
 58 |         data = self.fill_na_ts()
 59 |         multi_ts = data.sort_values(by=['item_id', 'date'], ascending=[1, 1])[['item_id', 'qty']]
 60 |         int_numer=multi_ts.shape[0] // multi_ts['item_id'].nunique()
 61 |         multi_ts=multi_ts.groupby('item_id').filter(lambda x: x['item_id'].count() ==int_numer)
 62 |         data_array = np.array(multi_ts[['qty']]).reshape(multi_ts['item_id'].nunique(),multi_ts.shape[0] // multi_ts['item_id'].nunique())
 63 |         ts_norm = TimeSeriesScalerMeanVariance(mu=0.0, std=1.0).fit_transform(data_array)
 64 |         return ts_norm, multi_ts['item_id'].unique()
 65 | 
 66 |     def plot_elbow(self,data):
 67 |         """
 68 | 
 69 |         :param df:multi time series  type is np.array
 70 |         :return: elbow plot
 71 |         """
 72 |         distortions = []
 73 |         for i in range(2, 7):
 74 |             ks = KShape(n_clusters=i, n_init=5, verbose=True, random_state=self.seed)
 75 |             ks.fit(data)
 76 |             distortions.append(ks.inertia_)
 77 |         plt.plot(range(2, 7), distortions, marker='o')
 78 |         plt.xlabel('Number of clusters')
 79 |         plt.ylabel('Distortion Line')
 80 |         plt.show()
 81 | 
 82 | 
 83 |     def shape_score(self,data,labels,metric='dtw'):
 84 |         """
 85 | 
 86 |         :param df:
 87 |         :param labels:
 88 |         :param metric:
 89 |         :return:
 90 |         """
 91 |         score=silhouette_score(data,labels,metric)
 92 |         return score
 93 | 
 94 |     def cal_k_shape(self,data,num_cluster):
 95 |         """
 96 |         use best of cluster
 97 |         :param df: time series dataset
 98 |         :param num_cluster:
 99 |         :return:cluster label
100 |         """
101 |         ks = KShape(n_clusters=num_cluster, n_init=5, verbose=True, random_state=self.seed)
102 |         y_pred = ks.fit_predict(data)
103 |         return y_pred
104 | 
105 |     def plot_best_shape(self,data,num_cluster):
106 |         """
107 |         time series cluster plot
108 |         :param df:
109 |         :param num_cluster:
110 |         :return:
111 |         """
112 |         ks = KShape(n_clusters=num_cluster, n_init=5, verbose=True, random_state=self.seed)
113 |         y_pred = ks.fit_predict(data)
114 |         for yi in range(num_cluster):
115 |             for xx in data[y_pred == yi]:
116 |                 plt.plot(xx.ravel(), "k-", alpha=.3)
117 |             plt.plot(ks.cluster_centers_[yi].ravel(), "r-")
118 |             plt.text(0.55, 0.85, 'Cluster %d' % (yi + 1),
119 |                      transform=plt.gca().transAxes)
120 |             plt.tight_layout()
121 |             plt.show()
122 | 
123 | 
124 | def main():
125 |     seed = 666
126 |     data = pd.read_csv('./sale_df.csv',parse_dates=['date'])
127 |     data = data[(data['date']>='2015-01-01')&(data['date']<'2015-02-01')]
128 |     
129 |     #有小伙伴反馈说给的数据量较大，自己运行起来较大
130 |     data=data[data['item_id'].isin(data['item_id'].unique()[:100])] #特意加一句，只取部分数据
131 |     data = data[['item_id', 'qty', 'date']]
132 |     print(data.head())
133 |     pcts=Plot_Cluster_Time_Series(data,seed)
134 | 
135 |     input_df, multi_id = pcts.read_data()
136 |     k_shape, k_score = [], []
137 |     for i in range(2, 7):
138 |         shape_pred = pcts.cal_k_shape(input_df,i)
139 |         score = pcts.shape_score(input_df,shape_pred)
140 |         k_score.append(score)
141 |         k_shape.append(i)
142 | 
143 |     dict_shape = dict(zip(k_shape, k_score))
144 |     best_shape = sorted(dict_shape.items(), key=lambda x: x[1], reverse=True)[0][0]
145 |     print('best_shape :',best_shape)
146 |     fin_label = pcts.cal_k_shape(input_df,best_shape)
147 | 
148 |     fin_cluster = pd.DataFrame({"id": multi_id, "cluster_label": fin_label})
149 |     pcts.plot_best_shape(input_df,best_shape)
150 |     pcts.plot_elbow(input_df)
151 |     return fin_cluster
152 | 
153 | if __name__ == '__main__':
154 |     fin_cluster = main()
155 |     #聚类结果输出到本地
156 |     fin_cluster.to_excel('k_shape_result.xlsx',index=False)
157 | 
158 |    
159 | 
160 | 


--------------------------------------------------------------------------------
/利用SARIMAX 进行销量预测.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 利用SARIMAX 进行销量预测
  4 | 
  5 | 本文延续一个月前推送的销量预测模型系列，从传统的时间序列SARIMAX 算法讲解销量预测模型。
  6 | 
  7 | 主要涉及到python的pandas，statsmodels,joblib等模块，通过对多个模型进行并行网格搜索寻找评价指标map最小模型的参数，虽然可以使用的模型非常多，从传统时间序列，到机器学习，深度学习算法。但是作为计量经济学主要内容之一，时间序列因为其强大成熟完备的理论基础，应作为我们处理带有时序效应的数据时的首先需要尝试的模型类型，且往往效果不错。本文只是实用代码的角度讲解其中statsmodels中SARIMAX 的用法，不涉及公式。
  8 | 
  9 | 下面从代码开始剖析，以便于照例上手实际操作。
 10 | 
 11 | SARIMAX 是在差分移动自回归模型（ARIMA）的基础上加上季节性（S,Seasonal）和外部变量(X,eXogenous)
 12 | 
 13 | 也就是说以ARIMA基础加上周期性和季节性，适用于时间序列中带有明显周期性和季节性特征的数据。
 14 | 
 15 | 
 16 | 
 17 | 由于参数众多，所以下面简单介绍其含义，免得勿用或者遗漏。
 18 | 
 19 | | 参数                        | 含义                                             | 是否必须 |
 20 | | :-------------------------- | ------------------------------------------------ | :------- |
 21 | | **endog**                   | 观察（自）变量 y                                 | 是       |
 22 | | **exog**                    | 外部变量                                         | 否       |
 23 | | **order**                   | 自回归，差分，滑动平均项 (p,d,q)                 | 否       |
 24 | | **seasonal_order**          | 季节因素的自回归，差分，移动平均，周期 (P,D,Q,s) | 否       |
 25 | | **trend**                   | 趋势，c表示常数，t:线性，ct:常数+线性            | 否       |
 26 | | **measurement_error**       | 自变量的测量误差                                 | 否       |
 27 | | **time_varying_regression** | 外部变量是否存在不同的系数                       | 否       |
 28 | | **mle_regression**          | 是否选择最大似然极大参数估计方法                 | 否       |
 29 | | **simple_differencing**     | 简单差分，是否使用部分条件极大似然               | 否       |
 30 | | **enforce_stationarity**    | 是否在模型种使用强制平稳                         | 否       |
 31 | | **enforce_invertibility**   | 是否使用移动平均转换                             | 否       |
 32 | | **hamilton_representation** | 是否使用汉密尔顿表示                             | 否       |
 33 | | **concentrate_scale**       | 是否允许标准误偏大                               | 否       |
 34 | | **trend_offset**            | 是否存在趋势                                     | 否       |
 35 | | **use_exact_diffuse**       | 是否使用非平稳的初始化                           | 否       |
 36 | | **kwargs                    | 接受不定数量的参数，如空间状态矩阵和卡尔曼滤波   | 否       |
 37 | 
 38 | 
 39 | 
 40 | 参数说明：
 41 | 
 42 | 以上我们列出了改函数中所有可能有到的参数，可以看到很多参数不是必须指定，比如，甚至只是需要给定，endog(观察变量)，算法就可以运行起来，也正是这样，SARIMAX，具有极大的灵活性，主要体现在：
 43 | 
 44 | 1 如果不指定seasonal_order，或者季节性参数都为0，那么就是普通的ARIMA模型；
 45 | 
 46 | 2 exog，外部因子没有也可以不用指定，所以目前用python的statsmodels进行时间序列分析时，用SARIMAX就好了；
 47 | 
 48 | 3，其他的参数如无必要，则不需要修改，因为函数默认的参数在大多数时候是最优的；
 49 | 
 50 | 4上表多次提到，我们也多次见到，关于初始化，我们知道模型拟合所用迭代算法，是需要提供一个初始值的，在初始值的基础上不断迭代，一般情况下是随机的指定，在大多数梯度下降算法陷入局部最优的时候，可以尝试更改初始值，和上条一样，如无不要勿动；
 51 | 
 52 | 5，关于拟合算法，我们一般都是假定给定的数据满足正态分布，所有使用极大似然算法求解最优参数；
 53 | 
 54 | 6，关于是否强制平稳和移动平均转换，一般设置为False，保持灵活性。
 55 | 
 56 | 总的来说，SARIMA 模型通过(p,d,q) (P,D,Q)m 不同的组合，囊括了ARIMA, ARMA, AR, MA模型，通过指定的模型评估准则，选择最优模型。
 57 | 
 58 | 
 59 | 
 60 | 关于模型选择标准，提供以下三种思路：
 61 | 
 62 | 1，使用AIC信息准则
 63 | 
 64 | 通过Akaike information criteria (AIC)进行模型选择，用最大似然函数拟合模型，虽然我们的目标是似然函数最大化，但并非越大越好，我们同时需要考虑模型复杂度，所以常常使用AIC和BIC作为模型优劣的衡量标准，我们说的优劣是不同的模型进行比较的时候，只能说在这么多备选模型中，最小AIC的模型刻画的真实数据表达的信息损失最小，是一个相对指标。
 65 | 
 66 | AIC=-2 ln(*L*) + 2 *k* 
 67 | 
 68 | BIC=-2 ln(*L*) + ln(n)*k 
 69 | 
 70 | AIC在样本容量很大时，拟合所得数值会因为样本容量而放大。(通常超过1000的样本称之为大样本容量)
 71 | 
 72 | AIC准则和BIC准则作为模型选择的准则，可以有效弥补根据自相关图和偏自相关图定阶的主观性。
 73 | 
 74 | AIC是statsmodels模块中SARIMAX 函数默认模型评估准则。
 75 | 
 76 | 
 77 | 
 78 | 2使用 Box-Jenkins建模分析过程
 79 | 
 80 | 当然也可以使用Box-Jenkins 建模流程，如下：建立在反复验证是否满足假设前提，并对参数调整。
 81 | 
 82 | ![img](https://img-blog.csdnimg.cn/20190220160453715.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE2MDUwNTYx,size_16,color_FFFFFF,t_70)
 83 | 
 84 | 
 85 | 
 86 | 
 87 | 
 88 | 3，使用自定义模型评估准则，比如，MAPE（平均绝对误差百分比），从总体上评估模型预测准确率:
 89 | $$
 90 | M A P E=\frac{\sum_{i=1}^{n}\left(\frac{\left|y_{i}-e_{i}\right|}{y_{i}} \times 100\right)}{n}
 91 | $$
 92 | 
 93 | 
 94 | 同时MAPE是反映误差大小的相对值，不同模型对同一数据的模型进行评估才有比较的意义。
 95 | 
 96 | 需要说明的是，建模评估标准以上就列举了三种，最好综合考虑，同时可能不同的业务有不同的述求，AIC是信息论的角度度量信息损失大小， Box-Jenkins是传统的层层假设之下的时间序列统计建模准则，往往在应对单一序列模型，通过眼观图形和检验参数的显著性来判定，效果佳。而如果我们关注的是模型准确率，那么最好的当然是定义mape函数，不能无脑的用函数原始的参数，这点尤其值得关注，包括XGBoost模型中，目标损失函数不能直接使用RMSE。
 97 | 
 98 | 导入模块
 99 | 
100 | ```python
101 | import time
102 | from itertools  import product
103 | import numpy as np
104 | import pandas as pd
105 | from math import sqrt
106 | from joblib import Parallel,delayed
107 | import warnings
108 | warnings.filterwarnings('ignore')
109 | from warnings import catch_warnings,filterwarnings
110 | from statsmodels.tsa.statespace.sarimax import SARIMAX
111 | ```
112 | 
113 | 
114 | 
115 | 定义模型,该部分的关键参数已经在上文全部罗列，因为重要所以从原文档中全部总结翻译过来。
116 | 
117 | ```python
118 | #传入数据和参数，输出模型预测
119 | def model_forecast(history,config):
120 |     order, sorder, trend = config
121 |     model = SARIMAX(history, order=order, 		 seasonal_order=sorder,trend=trend,enforce_stationarity=False, enforce_invertibility=False)
122 |     model_fit = model.fit(disp=False)
123 |     yhat = model_fit.predict(len(history), len(history))
124 |     return yhat[0]
125 | ```
126 | 
127 | 
128 | 
129 | ```python
130 | #模型评估指标,mape
131 | def mape(y_true, y_pred):
132 |     y_true, y_pred = np.array(y_true), np.array(y_pred)
133 |     return np.mean(np.abs((y_true - y_pred) / y_true)) * 100    
134 | 
135 | #划分训练集和测试集
136 | def train_test_split(data, n_test):
137 |     return data[:-n_test], data[-n_test:]
138 | ```
139 | 
140 | 使用one-step滚动向前预测法，每次预测值再加入数据中，接着预测下一个值，而不是一次预测多个值。依据经验，我们可以知道多数情况下，其实滚动逐步预测比多步预测效果更佳，所以应该尝试滚动预测。
141 | 
142 | ```python
143 | #one-step滚动向前预测
144 | def forward_valid(data, n_test, cfg):
145 |     predictions = list()
146 |     train, test = train_test_split(data, n_test)
147 |     history = [x for x in train]
148 |     for i in range(len(test)):
149 |         yhat = model_forecast(history, cfg)
150 |         predictions.append(yhat)
151 |         history.append(test[i])
152 |     error = mape(test, predictions)
153 |     return error
154 | ```
155 | 
156 | 
157 | 
158 | 当模型的移动平均或者自回归阶数较高时，模型计算极大似然的时候可能会抛出很多警告，比如常见的自由度为0，模型非正定，所以这里需要设置忽视警告,还如AIC有时会得到NaN值，这样的数据精度问题，在高阶模型求解中会出现，所以我们需要用到python中的try-except异常处理控制流，把可能报错的语法块放置在try中，以免程序中断。如果需要查看警告或者调试，则debug这里可以设置为True,但是大概率程序会报错退出，而你，一脸朦胧。对数据进行标准化等处理，其实在一定程度上是可以避免一些计算方面的问题，同时也会提高计算求解效率。
159 | 
160 | ```python
161 | #模型评估
162 | def score_model(data,n_test,cfg,debug=False):
163 |     result = None
164 |     key = str(cfg)
165 |     if debug:
166 |         result = forward_valid(data, n_test, cfg)
167 |     else:
168 |         try:
169 |             with catch_warnings():
170 |                 filterwarnings("ignore")
171 |                 result = forward_valid(data, n_test, cfg)
172 |         except:
173 |             error = None
174 |             
175 |     return (key, result)
176 | ```
177 | 
178 | 
179 | 
180 | 网格搜索
181 | 
182 | 网格搜索非常耗时，时间复杂度非常高，为指数型。在条件允许的情况下，尤其是PC和服务器计算力极大提高的当下，做法通常都是用空间换时间，压榨计算资源，使用多线程并行，以便可以在短时间内得到结果。
183 | 
184 | 所以，我们使用Joblib模块中的`Parallel`和`delayed`函数并行求解多个模型，Joblib模块也是常常在机器学习任务**grid search**和**Cross validation**中为了提高计算速度需要必备的。
185 | 
186 | ```python
187 | #网格搜索
188 | def grid_search(data, cfg_list, n_test, parallel=True):
189 |     scores = None
190 |     if parallel:
191 |         #使用计算机全部的cpu核数多进程并行
192 |         executor = Parallel(n_jobs=-1, backend='multiprocessing')
193 |         tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)
194 |         scores = executor(tasks)
195 |         
196 |     else:
197 |         scores = [score_model(data, n_test, cfg) for cfg in cfg_list]
198 |     scores = [r for r in scores if r[1] != None]
199 |     scores.sort(key=lambda x: x[1])
200 |     return scores
201 | 
202 | #生成参数列表
203 | def sarima_configs(seasonal=[0]):   
204 |     p = d = q = [0,1,2]
205 |     pdq = list(product(p, d, q))
206 |     s = 0
207 |     seasonal_pdq = [(x[0], x[1], x[2], s) for x in list(product(p, d, q))]
208 |     t=['n','c','t','ct']
209 |     return list(product(pdq,seasonal_pdq,t))
210 | ```
211 | 
212 | 
213 | 
214 | 还需要唠叨的是，本文引进的itertools模块中的product，是对迭代对象创建笛卡尔积的函数，穷尽迭代对象的所有组合，返回迭代对象组合的元组。解释一下 ，上面sarima_configs函数中的，p、 d、 q其实都是0、1、2，因为更高阶的并不多见，且高阶会导致模型非常复杂，往往0，1，2也就够了，季节性这里，设置了一个默认的0，是因为本文使用的是周这样的汇总时间点，通过前期数据探索作图看出设置为4，或者12都没有意义，所以为了节省计算资源，指定为0，不让程序计算选择该参数。以上函数，我们自己可以写嵌套循环，但是python内置的模块和成熟的模块在计算性能和规范上会比自己手写的优很多，所以这也是不要重复造轮子的理念，除非自己造的轮子更好，能解决一个新需求。既然讲到计算性能，因为本文涉及到了很多循环迭代，那么如果可以的话，建议使用Profile 这个内置的模块分析每个函数花费的时间。
215 | 
216 | 以下为模型训练函数，n_test表示预测三个值，因为我个人使用的场景比较固定，所以就直接写在函数内部作为局部变量了，为了保持函数的灵活性，作为全局参数或者函数的形参当然是更好，另，下面这种列表元素append追加似乎不太优雅。（过早优化是万恶之源，emmm,就这样子，逃）
217 | 
218 | ```python
219 | #模型训练
220 | def train_model(sale_df):
221 |     sum_=0
222 |     n_test = 3
223 |     p_b,d_b,q_b=[],[],[]
224 |     P_b,D_b,Q_b=[],[],[]
225 |     m_b,t_b=[],[]
226 |     model_id,error=[],[]
227 |     for i in sale_df['store_code'].unique():
228 |         data=sale_df[sale_df['store_code']==i]['y']
229 |         data=[i for i in data]
230 |         cfg_list = sarima_configs()
231 |         scores = grid_search(data,cfg_list,n_test,parallel=True)
232 |         p_b.append(int(scores[0][0][2]))
233 |         d_b.append(int(scores[0][0][5]))
234 |         q_b.append(int(scores[0][0][8]))
235 |         P_b.append(int(scores[0][0][13]))
236 |         D_b.append(int(scores[0][0][16]))
237 |         Q_b.append(int(scores[0][0][19]))
238 |         m_b.append(int(scores[0][0][22]))
239 |         t_b.append(str(scores[0][0][27]))
240 |         model_id.append(i)
241 |         error.append(scores[1][-1])
242 |         params_df=pd.DataFrame({'store_code': model_id, 'map': error,'p':p_b,'d':d_b,'q':q_b,'P':P_b,'D':D_b,'Q':Q_b,'m':m_b,'t':t_b})
243 |     return params_df
244 | ```
245 | 
246 | 
247 | 
248 | 通过模型训练得到的最优参数，传递，滚动预测四个时间点。
249 | 
250 | ```python
251 | #定义预测函数，传入数据和参数，返回预测值
252 | def one_step_forecast(data,order,seasonal_order,t,h_fore):
253 |     predictions=list()
254 |     data=[i for i in data]
255 |     for i in range(h_fore):
256 |         model = SARIMAX(data, order=order, seasonal_order=seasonal_order,trend=t,enforce_stationarity=False, enforce_invertibility=False)
257 |         model_fit = model.fit(disp=False)
258 |         yhat = model_fit.predict(len(data), len(data))
259 |         data.append(yhat[0])
260 |         predictions.append(yhat[0])
261 |     return predictions
262 | 
263 | 
264 | #用for循环，多个序列预测
265 | def forecast_model(sale_df,params_df):
266 |     h_fore=4
267 |     fore_list=[]
268 |     model_id=[]
269 |     for i in sale_df['store_code'].unique():
270 |         params_list=params_df[params_df['store_code']==i]
271 |         data=sale_df[sale_df['store_code']==i]['y']
272 |         p=params_df[params_df['store_code']==i].iloc[:,2].values[0]
273 |         d=params_df[params_df['store_code']==i].iloc[:,3].values[0]
274 |         q=params_df[params_df['store_code']==i].iloc[:,4].values[0]
275 |         P=params_df[params_df['store_code']==i].iloc[:,5].values[0]
276 |         D=params_df[params_df['store_code']==i].iloc[:,6].values[0]
277 |         Q=params_df[params_df['store_code']==i].iloc[:,7].values[0]
278 |         m=params_df[params_df['store_code']==i].iloc[:,8].values[0]
279 |         t=params_df[params_df['store_code']==i].iloc[:,9].values[0]
280 |         order=(p, d, q)
281 |         seasonal_order=(P,D,Q,m)
282 |         all_fore=one_step_forecast(data,order,seasonal_order,t,h_fore)
283 |         fore_list.append(all_fore)
284 |         
285 |         #以下为，多步预测，如果不使用滚动预测，则不调one_step_forecast函数
286 |         #model=SARIMAX(data, order=order,seasonal_order=seasonal_order,trend=t,enforce_stationarity=False,
287 |         #                                                enforce_invertibility=False)
288 |         #forecast_=model.fit(disp=-1).forecast(steps=h_fore)
289 |         #fore_list_flatten = [x for x in forecast_]
290 |         #fore_list.append(fore_list_flatten)
291 |         model_id.append(i)
292 |     df_forecast = pd.DataFrame({'store_code': model_id, 'fore': fore_list})
293 |     return df_forecast
294 | 
295 | ```
296 | 
297 | 以下就是主函数了
298 | 
299 | ```python
300 | if __name__ == '__main__':
301 |     start_time=time.time()
302 |     sale_df=pd.read_excel('/home/test01/store_forecast/sale_df.xlsx')
303 |     params_df=train_model(sale_df)
304 |     forecast_out=forecast_model(sale_df,params_df)
305 |     end_time=time.time()
306 |     use_time=(end_time-start_time)//60
307 |     print('finish the process use',use_time,'mins')
308 | ```
309 | 
310 | 
311 | 
312 | 以下展示为本次模型所得结果，每个门店一个序列模型
313 | 
314 | | store_code | mape     | p    | d    | q    | P    | D    | Q    | m    | t    |
315 | | ---------- | -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
316 | | 1F         | 6.305378 | 1    | 0    | 2    | 2    | 0    | 1    | 0    | c    |
317 | | 62         | 1.889192 | 0    | 2    | 2    | 0    | 0    | 2    | 0    | t    |
318 | | CS         | 1.425515 | 1    | 2    | 2    | 0    | 0    | 0    | 0    | c    |
319 | | 2H         | 2.144674 | 0    | 1    | 2    | 1    | 0    | 2    | 0    | c    |
320 | | 32         | 5.289745 | 0    | 2    | 2    | 0    | 0    | 0    | 0    | t    |
321 | 
322 | 
323 | 
324 | 五个模型总体MAPE为3.4%,效果还不错，本文没有还没有用到SARIMAX 中的X,也就是eXogenous外部因素，一来，看到模型得到的MAPE已经达到3.4%,准确率到了96.6%，但是限于时间的关系，就没有加入如天气客流等外部因素。
325 | 
326 | 本文完全站在工程实现的角度，考虑多种参数组合模型，通过并行网格搜索，回测得到我们定义的准则MAPE最小的模型参数，并把最优参数作为模型预测未来值的参数，滚动预测未来4个时间点。以上就是结合自己经验和体会以及查阅的资料针对几个关键点进行阐述而成。如有误，欢迎指正留言，完整的程序和数据会放在github上，因为是企业数据，不便于公布，所以在UCI上找了一个类似的数据集，如有需要请点击【阅读原文】
327 | 
328 | 


--------------------------------------------------------------------------------
/数据之外的代码.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | # 数据之外的代码
 4 | 
 5 | ## 背景与简介
 6 | 
 7 | 周五同事做了敏捷开发的Knowledge Sharing，最近也刚好在看敏捷宣言发起人之一Robert C.Martin的《代码整洁之道》 ，结合之前看了一半的《软技能：代码之外的生存指南》，站在鲍勃大神和前辈的肩膀上，分享一点两本书中的观点以及个人浅显的理解和絮叨，作为一种“软”分享，我个人理解的算法工程师，首先是作为工程师而存在，在信息科技时代，以提升码力解决问题为第一位。
 8 | 
 9 | 
10 | 
11 | ## 1、是否还需要代码
12 | 
13 | 前段时间有新闻说会写代码的人工智能来了[1]，又掀起一阵新的焦虑。目前开源架构集成越来越高，之前需要编写大量代码实现某个功能，现在只需要调用现有的接口，几行代码搞定；很多算法也已封装好，算法工程师似乎变得门槛越来越低，加之BI工具的成熟，业务人员甚至使用拖拉拽就可以实现复杂的算法，而这正是当前与过去专业工程师的岗位职责，未来会不会一步步的解构专业人员存在的部分价值，变得不用写代码？![微信图片_写代码人工智能](E:\2019工作\微信公众号\自投稿件\微信图片_写代码人工智能.png)
14 | 
15 | 《《代码整洁之道》的作者似乎有预见性的首先解答了这个疑惑，并在书中表示，代码自动产出的事永远不会发生，即便是人类倾其全部的直觉和创造力，造不出完全满足客户模糊感觉的成功系统，也不能把含糊不清的需求翻译成可完美执行的程序，结合作者作为敏捷开发宣言提出者之一的身份，“个体和互动高于流程和工具”，“传递信息效果最好效率也最高的方式是面对面的交谈”等理念，说明了只有业务人员和开发人员充分的沟通及时响应变化，才能开发出满足实际需求解决当前问题的软件，从哲学上解答了存在问题，所以代码还是要写[2]。
16 | 
17 | 
18 | 
19 | ## 2、什么是整洁代码
20 | 
21 | 我们直接从书中摘取知名从业者对整洁代码的定义如下：（个人在编辑过程中做了少许语句的调整）
22 | 
23 | #### **1 优雅高效，简介规范**
24 | 
25 | *我喜欢优雅和高效的代码。代码逻辑应当直截了当，让缺陷难以隐藏；尽量便于维护，减少依赖关系；依据某种分层战略完善错误处理代码；性能调至最优，省得引诱别人做不规范的优化，制造一堆混乱。整洁的代码只做好一件事。*-----Bjarne Stroustrup，C++语言发明者
26 | 
27 | #### **2 简单直接，干净利落**
28 | 
29 | *整洁的代码简单直接；整洁的代码如同优美的散文；整洁的代码从不隐藏设计者的意图，充满了干净利落的抽象和直截了当的控制语句。-----Grady Booch《面向对象分析与设计》作者
30 | 
31 | #### **3 通过测试，表达清晰**
32 | 
33 | *整洁的代码应可由作者之外的开发者阅读和增补。它应当有单元测试和验收测试。它使用有意义的命名。它只提供一种而非多种做一件事的途径。它只有尽量少的依赖关系，而且要明确地定义和提供清晰、尽量少的API。代码应通过其字面表达含义，因为存在不同的语言，导致并非所有必需信息均可通过代码自身清晰表达。----*Dave Thomas，OTI公司创始人
34 | 
35 | **4 完美无瑕，无需更改**
36 | 
37 | *我可以列出我留意到的整洁代码的所有特点，但其中有一条是根本性的。整洁的代码总是看起来像是某位特别在意它的人写的。几乎没有改进的余地。代码作者什么都想到了，如果你企图改进它，总会回到原点，赞叹某人留给你的代码——全心投入的某人留下的代码。----Michael Feathers 《修改代码的艺术》一书作者
38 | 
39 | **5 通过测试，体现理念**
40 | 
41 | 简单代码，依其重要顺序：
42 | 
43 | - 能通过所有测试；
44 | - 没有重复代码；
45 | - 体现系统中的全部设计理念；
46 | - 包括尽量少的实体，比如类、方法、函数等。----Ron Jeffries《极限编程实施》一书作者
47 | 
48 | **6 深合己意，针对问题**
49 | 
50 | 如果每个例程都让你感到深合己意，那就是整洁代码。如果代码让编程语言看起来像是专为解决那个问题而存在，就可以称之为漂亮的代码。----Ward Cunningham，Wiki发明者
51 | 
52 | 以上6位经验丰富的专业人士，给出了他们的定义，类似于python之禅（**The Zen of Python**），值得我们不时的拿出来审视自己代码，提高专业素养。
53 | 
54 | 以上6位经验丰富的专业人士，给出了他们的定义，类似于python之禅（**The Zen of Python**），值得我们不时的拿出来审视自己代码，提高专业素养。注：为更加通顺对翻译后的书籍原文在语言上做了些许改动。
55 | 
56 | 
57 | 
58 | ## 3、如何专注写代码
59 | 
60 | 以上我们回顾了整洁代码的定义，基于《软技能：代码之外的生存指南》下面我们谈谈如何专注于写代码。
61 | 
62 | 20世纪著名经济学家约翰·梅纳德·凯恩斯在1930年做过一个有名的预测，他认为到2030年，人类的劳动时间每周15小时就足够了，也就是工作日每天3小时。结合前段时间在互联网上引起巨大热潮，甚至举国讨论的"996.icu"话题，同时参照我们的邻国日本的当下就知道，实际情况完全与伟大如凯恩斯做出的预测大相径庭。生产效率的提高，人类的工作时间应该减少才对，为什么很多公司开启996的工作模式，甚至一周7天24小时随时待命？
63 | 
64 | ![微信图片_凯恩斯](E:\2019工作\微信公众号\自投稿件\微信图片_凯恩斯.png)
65 | 
66 | 解释是，工具便利性的提高，为我们创造了更大的空间，我们倾向于把这个可能的空间填满用尽[3]，正如微信的便利，随时拿起手机进行交流，仿佛对方在2小时之内没有回复你，是一种不礼貌行为，越来越多的软件瓜分了我们的注意力，各种夸张的标题和图片，想尽办法要吸引我们点进去，诱惑的增多以及沟通的便利性带来注意力的分散与效率的降低，使得我们在先进生产工具面前变得没有那么专注高效。
67 | 
68 | ![](E:\2019工作\微信公众号\自投稿件\微信图片_996.png) 
69 | 
70 | 在《软技能:代码之外的生存指南》一书中，作者的思维模式是这样的，先花一些时间先把所有的事务在脑子里过一遍，然后才能达到思维高峰以完成任务[4]，结合我们身边的代码高手，写之前先充分思考，然后指尖飞快的在键盘上敲击，一次通过。作者在书中提到，在一项任务之前，确保你已经做好一切可以让可以免受干扰的措施，比如将手机调成静音，关闭分散注意力的浏览器窗口。如果你也用jupyter，其实有一个很好的快捷键----F11 一键进入全屏，可以屏蔽掉很多打开的应用，甚至在桌子上挂上“正在忙，请勿打扰”，可能开始同事会有点不喜欢你的这种作法。和996类似的程序员热门话题是中年职业危机，我们不可能永远在一线敲代码，所以下面我们谈谈代码之外的事情。
71 | 
72 | ## 4、代码之外是什么
73 | 
74 | 作为《软技能》这本书的副标题----“代码之外的生存指南”，本书提到了很多代码之外的技能，个人认为非常实用且具有可操作性，所写是具体的领域和可复制的操作指南，也因此，与鸡汤文和食之无味的理论有显著的区别，对其中的一些观点也非常认同，比如，自我营销与打造引人注目的品牌，我个人之前有写过一段时间的博客，记录当下所学和总结一些技能感悟，后来各种原因中断，等过了半年以后再看，发现有几篇的浏览量已过万，也有同行点赞和表示帮助自己解决了问题，那一刻非常的欣喜，同时对于互联网的传播和扩散力吃了一惊。虽然初衷并非是推销自己，这些积累会变成将来很具有说服力的背书，当然在此过程中也要严格要求自己，所写内容是自己求证无误的。如部分求职者会把自己的GitHub网址放上去，这就是一种简单的自我营销方式。
75 | 
76 | ![微信图片_20190526214506](E:\2019工作\微信公众号\自投稿件\微信图片_20190526214506.png)
77 | 
78 | 书中作者提出我们先确定在哪一方面确立自己的个人品牌，毕竟技术领域非常广博，包含了众多的领域，同时解答了作为一个全栈开发者还是作为专业领域开发者哪个更有价值，即，选定一个领域深耕，成为专业人士。书中列出了一下自我营销的方式，可供大家参考：
79 | 
80 | - 写博客，建立自己的博客，并发布文章。
81 | - 分享视频，在视频网站上上传自己分享的内容和课程。
82 | - 投稿，给相关领域的杂志投稿。
83 | - 著书，出版书籍。
84 | - 参加技术会议，参加技术研讨会并发言。
85 | - 开源代码，在开源社区贡献自己的代码。
86 | 
87 | 
88 | 
89 | 【1】https://openreview.net/pdf?id=ByldLrqlx
90 | 
91 | 【2】《代码整洁之道》
92 | 
93 | 【3】https://github.com/ruanyf/weekly/blob/master/docs/issue-54.md
94 | 
95 | 【4】软技能:代码之外的生存指南


--------------------------------------------------------------------------------
/时间序列机器学习模型特征总结.md:
--------------------------------------------------------------------------------
  1 | 时间序列特征汇总
  2 | 
  3 | 稍微总结一下，时间序列中的特征，主要针对机器学习树模型，因为是时序数据，所以和寻常的机器学习特征略有不同，比如关注时间特征，滞后特征，滑窗特征等。
  4 | 
  5 | ### 特征一、时间特征
  6 | 
  7 | import datetime
  8 | 
  9 | import pandas as pd
 10 | 
 11 | df['ds']=df['ds'].astype(str)
 12 | df['ds'] = df['ds'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
 13 | df['year']=df['ds'].dt.year
 14 | df['quarter']=df['ds'].dt.quarter
 15 | df['month']=df['ds'].dt.month
 16 | df['dayofweek']=df['ds'].dt.dayofweek
 17 | df['week']=df['ds'].dt.week
 18 | 
 19 | 诸如此类，更多相信内容请参照一下文档https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
 20 | 还有格外需要手工生成的时间段切分：比如讲一天分为上午，下午，晚上这样的特征，可以使用apply，配合lambda匿名函数
 21 | 
 22 | df['half']=df['hour'].apply(lambda x: 1 if x<'12' else 0)
 23 | 
 24 | #对于出现的异常值，年底出现促销，导致销售波动，对该天的数据做异常标记
 25 | df['outlier']=df['ds'].apply(lambda x: 1 if x=='2018-12-31' else 0)
 26 | 
 27 | 还如，定义是否为春节月，是否为月末等，对这类具有明显销售波动的带有时间点打标签。这部分特征是十分基础且重要。
 28 | 
 29 | ### 特征二、类别mean encoding特征
 30 | 
 31 | #针对类别变量可以有如下方法
 32 | 
 33 | #Mean Encoding  
 34 | #比如门店预测
 35 | #带有城市标签，如果用OneHotEncoder转换为0-1特征就太多了
 36 | #我们可以使用mean encoding方式，构建均值特征
 37 | city_mean=df.groupby(['city']).agg({'y': ['mean']})
 38 | city_mean.reset_index(inplace=True)
 39 | city_mean.columns = ['city','city_mean']
 40 | #其他的信息，比如省份等亦是如此
 41 | df = pd.merge(df,city_mean,on='city',how='inner')
 42 | #还有一种类似的count encoding，
 43 | #其实应该归于统计特征
 44 | #至于实现方式那就是
 45 | city_count=df.groupby(['city']).agg({'y': ['count']})
 46 | 
 47 | 
 48 | 
 49 | ### 特征三、统计特征
 50 | 
 51 | 如mean, median, max, min, std
 52 | 
 53 | 需要注意的是windows的选择
 54 | 
 55 | 如果是月的话，建议选择三 如果是日，且有周的规律性，则应该选择7,也可以同时存在多个颗粒度的滑窗特征。比如，选择了滑窗4，同时使用滑窗12，那就是季度。当然你也可是多尝试，所以调参侠的乐趣/苦逼，也就在这里了。滑窗也是一种对数据的平滑。一定不要忘记，把数据的顺序换过来
 56 | 
 57 | df.sort_values(['store_code','ds'],ascending=[True,True],inplace=True)
 58 | 
 59 | f_min = lambda x: x.rolling(window=4, min_periods=1).min()
 60 | f_max = lambda x: x.rolling(window=4, min_periods=1).max()
 61 | f_mean = lambda x: x.rolling(window=4, min_periods=1).mean()
 62 | f_std = lambda x: x.rolling(window=4, min_periods=1).std()
 63 | f_median=lambda x: x.rolling(window=4, min_periods=1).median()
 64 | function_list = [f_min, f_max, f_mean, f_std,f_median]
 65 | function_name = ['min', 'max', 'mean', 'std','median']
 66 | for i in range(len(function_list)):
 67 |     df[('stat_%s' % function_name[i])] = df.sort_values('ds').groupby(['store_code'])['y'].apply(function_list[i])
 68 |     
 69 | 
 70 | 说明一点，这里为何没有使用sum聚合函数，因为有了mean,所以就没有必要使用sum了，因为sum是可以依据mean求得(可以依据一个特征直接得到的另一个变量的就属于冗余特征)
 71 | 还有峰度，偏度等
 72 | 有一个在时间序列方面非常出名的特征库：tsfresh
 73 | https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html
 74 | 提供了非常多的特征扩展，我在之前的一篇文章中有提到，感兴趣的读者，可以前往一看。
 75 | tsfresh已经可以生成这么多的特征了，为何我们不直接调包，还要花这么大的力气来手工一个一个实现呢？
 76 | 针对这两点疑问，有如下解释，虽然目前计算资源较为充裕，但我们也不希望在模型准确率接近的情况下，一个自动生成了300维度的特征花费两个小时，一个自己扩展了30维度的特征花费3分钟，不断迭代中的等待时间是非常昂贵的，所以我个人倾向于手工实现，做到心中有数,否则和黑匣子有什么区别。
 77 | 其二，我们在很多情况下还是要追求模型的可解释性，少而精的特征，对于指导运营人员，具有非常大的价值，需要明白知道特征的含义。
 78 | 
 79 | 
 80 | 
 81 | ### 特征四：滞后历史特征
 82 | 
 83 | 
 84 | 
 85 | for i in [1,2,4,8,12,24,52]:
 86 |     df["lag_{}".format(i)] = df.groupby('store_code')['y'].shift(i)
 87 | #这样的形式也是可以的，因为我这里是周的销售预测，所以我比较关注，上一周的，上两周的，四周以前的，也就是一个月前，以及8周，就是两月前。
 88 | 
 89 | 这里是周销售预测，所以更关注上一周/上两周/四周前/8周的历史数据。
 90 | 
 91 | 以上生成滞后特征，于是就可以方便的计算同/环比。
 92 | 
 93 | 需要注意的是，只能是用滞后的数据，不能基于y值生成特征，否则就是数据穿越了
 94 | df['huanbi_1_2'] = df['last_1_sale'] / df['last_2_sale']
 95 | 
 96 | df['last_14_2_sale'] = (df['last_14_sale'] - df['last_2_sale']) / df['last_14_sale']
 97 | 
 98 | df['sale_uplift'] = (data['last_1_sale'] - data['last_2_sale']) / data['last_2_sale']
 99 | 
100 | 
101 | 
102 | ### 特征五：高阶特征
103 | 
104 | 若有更多的信息，我们是可以利用多个特征进行融合，比如，我有门店开业时长，平均营业额，门店销售方差等等，可以利用这些信息聚类。
105 | 
106 | 理由是：把类别标签，作为一个特征，相同的类别，理应具有相似的曲线，具有相似特性的数据，生成相同的数据特征。
107 | 
108 | 当然有读者肯定有疑问了，聚类是无监督学习，事先无法知道聚类的个数，这里建议使用一点经验值，或者使用聚类的评估指标，如果轮廓系数，得到一个较为可靠的类别数。def store_cluster(data):
109 |     data_.drop_duplicates(subset=['store_code'],inplace=True)
110 |     data_un=data_[['store_code','shop_mean','open_length','std_7']].set_index(['store_code'])
111 |     data_un.fillna(0, inplace=True)
112 |     columns_to_normalize= ['shop_mean','open_length','std_7']
113 |     data_un[columns_to_normalize] = data_un[columns_to_normalize].apply(lambda x: (x - x.mean()) / np.std(x))
114 |     kmeans= KMeans(n_clusters=4).fit(data_un)
115 |     data_un['cluster_id']=kmeans.predict(data_un)
116 |     data_un=data_un.reset_index()
117 |     return data_un[['store_code','cluster_id']]
118 | 
119 | 刚提到了曲线和趋势，那么，我们也是可以依据多条序列的波动，依据波动，找到相似的波动曲线，作为同一类标签，这就是基于时间序列的一个聚类方式。
120 | 比如 DTW方法(离散小波变换)
121 | 
122 | 如果说KMeans聚类是一种静态的使用欧式距离计算的聚类方法，DTW方法就是一种对多个序列具有延展或者压缩特性的聚类方式，考虑序列形状的相似。该方法有一个python库，可以使用pip直接安装，地址见https://github.com/pierre-rouanet/dtw
123 | 
124 | 除此之外还有傅里叶变换模块和函数如下：
125 | 
126 | from pykalman import KalmanFilter
127 | 
128 | 当然对于多个类别数据，也可以使用embedding的方式
129 | 
130 | <https://www.tensorflow.org/tutorials/text/word_embeddings>
131 | 
132 | 或者*TF-IDF,总的来说我这里列举的高阶特征，是利用多个特征进行融合，或者使用多条数据，求得相似性，减少信息冗余。当然也限于篇幅的原因，这部分非常重要的内容没有展开来讲，不过可以依据以上提到的关键词和资料很容易找到相关材料。
133 | 
134 | ### 特征六：外部特征
135 | 
136 | 很多数据科学竞赛本身不提供外部数据，但鼓励选手通过技术手段获取并利用外部数据，比如天气，比如节假日，对气温这样的特征，做分桶处理见pd.cut函数。
137 | 
138 | 以上就是个人日常使用较多的特征，只是针对树模型特征生成过程进行简单的总结阐述，具体在建模过程中还要依据数据本身灵活多变，还有诸如，prophet模型的特征，数据预处理等等内容再次不做介绍，如有机会单独开篇。
139 | 
140 | 


--------------------------------------------------------------------------------
/机器学习模型可解释性.md:
--------------------------------------------------------------------------------
 1 | 机器学习模型可解释性
 2 | 
 3 | https://www.zhihu.com/question/48224234
 4 | 
 5 | 
 6 | 
 7 | 
 8 | 
 9 | 
10 | 
11 | shap
12 | 
13 | http://sofasofa.io/tutorials/shap_xgboost/
14 | 
15 | SHAP value 的解释：SHAP值通过与某一特征取基线值时的预测做对比，来解释该特征取某一特定值的影响。
16 | 
17 | 
18 | 
19 | 
20 | 
21 | 作用：
22 | 
23 | 特征重要性，
24 | 
25 | 单一样本各个特征的重要性
26 | 
27 | 特征交互
28 | 
29 | 
30 | 
31 | 
32 | 
33 | **partial dependence**
34 | 
35 | 部分依赖图（PDP或PD图）显示特征对机器学习模型的预测结果的边际效应，可以展示一个特征是如何影响预测的。部分依赖图可以显示目标与特征之间的关系是线性的，单调的还是更复杂的。例如，当应用于线性回归模型时，部分依赖图总是显示线性关系。
36 | 
37 | 
38 | 
39 | **PDP分析步骤如下：**
40 | 
41 | 1. 训练一个Xgboost模型（假设F1 … F4是我们的特征，Y是目标变量，假设F1是最重要的特征）。
42 | 2. 我们有兴趣探索Y和F1的直接关系。
43 | 3. 用F1（A）代替F1列，并为所有的观察找到新的预测值。采取预测的平均值。（称之为基准值）
44 | 4. 对F1（B）… F1（E）重复步骤3，即针对特征F1的所有不同值。
45 | 5. PDP的X轴具有不同的F1值，而Y轴是虽该基准值F1值的平均预测而变化。
46 | 
47 | 
48 | 
49 | 
50 | 
51 | 
52 | 
53 | eli5
54 | 
55 | https://www.sohu.com/a/324334904_114877
56 | 
57 | 
58 | 
59 | # 排列重要性
60 | 
61 | **Permutation Importance**
62 | 
63 | 
64 | 
65 | #### PI思想
66 | 
67 | - 用上全部特征，训练一个模型。
68 | - 验证集预测得到得分。
69 | - 验证集的一个特征列的值进行随机打乱，预测得到得分。
70 | - 将上述得分做差即可得到特征x1对预测的影响。
71 | - 依次将每一列特征按上述方法做，得到每二个特征对预测的影响。
72 | 
73 | 
74 | 
75 | 


--------------------------------------------------------------------------------
/特征选择.md:
--------------------------------------------------------------------------------
  1 | 特征选择
  2 | 
  3 | 互信息
  4 | 
  5 | ```python
  6 | 
  7 | from sklearn.feature_selection import SelectKBest, f_regression,mutual_info_regression
  8 | from sklearn.datasets import load_boston
  9 | 
 10 | boston = load_boston()
 11 | print('Boston data shape: ', boston.data.shape)
 12 | 
 13 | selector = SelectKBest(mutual_info_regression)
 14 | #X_new = selector.fit_transform(boston.data, boston.target)
 15 | X_new = selector.fit_transform(boston_x, boston_y)
 16 | print('Filtered Boston data shape:', X_new.shape)
 17 | print('F-Scores:', selector.scores_)
 18 | selector.get_support()
 19 | 
 20 | ```
 21 | 
 22 | 
 23 | 
 24 | 使用shap计算特征贡献率
 25 | 
 26 | https://blog.csdn.net/jin_tmac/article/details/106099218
 27 | 
 28 | https://blog.csdn.net/demm868/article/details/109523717
 29 | 
 30 | https://zhuanlan.zhihu.com/p/101352812?utm_source=qq
 31 | 
 32 | 
 33 | 
 34 | 架构好文【马东】
 35 | 
 36 | https://zhuanlan.zhihu.com/p/96420594
 37 | 
 38 | 
 39 | 
 40 | 时间序列预测中的数据顺序和训练集数据完整性
 41 | 
 42 | 
 43 | 
 44 | 对抗检验
 45 | 
 46 | https://blog.csdn.net/caoyuan666/article/details/106223344/
 47 | 
 48 | 
 49 | 
 50 | 多种训练集和验证集划分方法
 51 | 
 52 | https://blog.csdn.net/whybehere/article/details/108192957
 53 | 
 54 | 
 55 | 
 56 | shap
 57 | 
 58 | 
 59 | 
 60 | http://sofasofa.io/tutorials/shap_xgboost/
 61 | 
 62 | 
 63 | 
 64 | 1.单个样本中每个特征的贡献，shap values，是正向还是负向
 65 | 
 66 | 2.在特征总体的分析 
 67 | 
 68 | 3.多个变量的交互作用
 69 | 
 70 | 4.部分依赖图（Partial Dependence Plot、pdpbox）
 71 | 
 72 | 
 73 | 
 74 | 
 75 | 
 76 | PI  排序重要性
 77 | 
 78 | Permutation Importance
 79 | 
 80 | 
 81 | 
 82 | 
 83 | 
 84 | 
 85 | 
 86 |  SHAP 值、
 87 | 
 88 | 置换重要性（permutaion importance）、
 89 | 
 90 | 删除和重新学习方法（drop-and-relearn approach）
 91 | 
 92 | 
 93 | 
 94 | ELI5库可以进行Permutation Importance的计算
 95 | 
 96 | https://wanpingdou.blog.csdn.net/article/details/106813825
 97 | 
 98 | 
 99 | 
100 | ```python
101 | import eli5
102 | from eli5.sklearn import PermutationImportance
103 | 
104 | perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
105 | eli5.show_weights(perm, feature_names = val_X.columns.tolist())
106 | ```
107 | 
108 | 
109 | 
110 | kaggle 特征重要性 featexp
111 | 
112 | https://blog.csdn.net/weixin_41814051/article/details/104300961
113 | 
114 | 
115 | 
116 | https://blog.csdn.net/Datawhale/article/details/103169719
117 | 
118 | 
119 | 
120 | 特征泄露与数据泄露
121 | 
122 | 数据穿越
123 | 
124 | 
125 | 
126 | kaggle  shake up 分析
127 | 
128 | https://zhuanlan.zhihu.com/p/68381175
129 | 
130 | 
131 | 
132 | https://zhuanlan.zhihu.com/p/64473570
133 | 
134 | 
135 | 
136 | 
137 | 
138 | 分类任务中的类别特征
139 | 
140 | 
141 | 
142 | 高维稀疏
143 | 
144 | 
145 | 
146 | 


--------------------------------------------------------------------------------
/销量预测spark实战.md:
--------------------------------------------------------------------------------
  1 | # 第x章:销量预测spark实战
  2 | 
  3 | **本章目录**
  4 | 
  5 | 1.Spark.ML与DataFrame简介
  6 | 2.销量预测特征工程
  7 | 2.销量预测特征选择和超参数调优
  8 | 4.销量预测Spark算法模型实战
  9 | 
 10 | 
 11 | 
 12 | *本文是PySpark销量预测系列第一篇，后面会陆续通过实战案例详细介绍PySpark销量预测流程，包含特征工程、特征筛**选、超参搜索、预测算法。*
 13 | 
 14 | 
 15 | 
 16 | 在零售销量预测领域，销售小票数据动辄上千万条，这个量级在单机版上进行数据分析/挖掘是非常困难的，所以我们需要借助大数据利器--Spark来完成。
 17 | 
 18 | Spark作为一个快速通用的分布式计算平台，可以高效的使用内存，向用户呈现高级API，这些API将转换为复杂的并行程序，用户无需深入底层。
 19 | 
 20 | 由于数据挖掘或分析人员，大多数熟悉的编程语言是Python，所以本章我们介绍Spark的Python版--PySpark。本节先介绍必要的基础知识比如DataFrame和ML库，在后续章节中会给出基于Spark机器学习的特征生成/特征选择/超参数调优以及机器学习销量预测算法。
 21 | 
 22 | 
 23 | 
 24 | ## **1.Spark.DataFrame与Spark.ML简介**
 25 | 
 26 | 
 27 | 
 28 | 从Spark 2.0开始，Spark机器学习API是基于DataFrame的Spark.ML ,而之前基于RDD的Spark.MLlib已进入维护模式，不再更新加入新特性。基于DataFrame的Spark.ML是在RDD的基础上进一步的封装，也是更加强大方便的机器学习API,同时如果已经习惯了Python机器学习库如sklearn等，那么你会发现ML用起来很亲切。
 29 | 
 30 | 下面我们就开始介绍DataFrame和ML
 31 | 
 32 | DataFrame 从属于 Spark SQL 模块，适用于结构化/数据库表以及字典结构的数据，执行数据读取操作返回的数据格式就是DataFrame，同时熟悉Python的pandas库或者R语言的同学来说，更是觉得亲切，Spark.DataFrame正是借鉴了二者。DataFrame的主要优点是Spark引擎在一开始就为其提供了性能优化，与Java或者Scala相比，Python中的RDD非常慢。每当使用RDD执行PySpark程序时，在PySpark驱动器中，启动Py4j使用JavaSparkContext的JVM，PySpark将数据分发到多个节点的Python子进程中，此时Python和JVM之间是有很多上下文切换和通信开销，而DataFrame存在的意义就是优化PySpark的查询性能。
 33 | 
 34 | 以上我们交代了Spark.DataFrame的由来，下面介绍其常见操作。
 35 | 
 36 | ![img](https://mmbiz.qpic.cn/mmbiz_png/wufCEEo7jqpvGdhYMibALv6dicjuBqZAfam5icHErcAox8RE2ZhtpJFibS7TKZnyODqCTMQyY2BzeLSxicmoaSN0Y7w/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)
 37 | 
 38 | 
 39 | 
 40 | ### 1.1 Spark.DataFrame生成
 41 | 
 42 | (1)使用toDF(基于RDD)
 43 | 
 44 | ```python
 45 | from pyspark import SparkConf,SparkContext
 46 | from pyspark.sql import Row
 47 | conf = SparkConf().setMaster("local").setAppName("My App")
 48 | sc = SparkContext(conf = conf)
 49 | df = sc.parallelize([ \
 50 |     Row(name='Alice', age=5, height=80), \
 51 |     Row(name='Alice', age=5, height=80), \
 52 |     Row(name='Alice', age=10, height=80)]).toDF()
 53 | 
 54 | #查看数据类型
 55 | df.dtypes
 56 | #[('age', 'bigint'), ('height', 'bigint'), ('name', 'string')]
 57 | 查看df类型
 58 | type(df)
 59 | #class 'pyspark.sql.dataframe.DataFrame'>
 60 | ```
 61 | 
 62 | 可以将DataFrame视为关系数据表，在其上进行类似于SQL的操作，同时与平时建SQL表需要指定数据类型不同的是，此时数据列的类型是自动推断，这也是其强大之处。
 63 | 
 64 | (2)读取本地文件
 65 | 
 66 | ```python 
 67 | from pyspark.sql import SparkSession
 68 | 
 69 | spark = SparkSession.builder \
 70 | 		.master("local") \
 71 | 	  .appName("Test Create DataFrame") \
 72 | 	  .config("spark.some.config.option", "some-value") \
 73 |     .getOrCreate()
 74 | df = spark.read.csv('python/test_spark/ts_dataset.csv')
 75 | 
 76 | ```
 77 | 
 78 | 同理还可以读取parquet/json文件
 79 | 
 80 | ```python
 81 | df_parquet=spark.read.parquet('....')
 82 | 
 83 | df_json = spark.read.format('json').load('python/test_spark/ts_dataset.json')
 84 | 
 85 | ```
 86 | 
 87 | 以上两种方式中，第一种是Spark1.x版本中以RDD作为主要API的方式，第二种的SparkSession是随着spark2.x引入，封装了SparkContext、SparkConf、sqlContext等，为用户提供统一的接口来使用Spark各项功能的更高级抽象的启动方式。
 88 | 
 89 | 强调一点是，我们通过会话SparkSession读取出来的数据类型就是DataFrame，而第一种还需要在RDD的基础上使用toDF进行转换。如果当前读者使用的spark版本是2，那么，推荐使用第二种方式。
 90 | 
 91 | 
 92 | 
 93 | (3)读取HIVE表
 94 | 
 95 | ```python
 96 | from pyspark.sql import SparkSession
 97 | spark = SparkSession. \
 98 |     Builder(). \
 99 |     config("spark.sql.crossJoin.enabled", "true"). \
100 |     config("spark.sql.execution.arrow.enabled", "true"). \
101 |     enableHiveSupport(). \
102 |     getOrCreate()
103 | df=spark.sql("""select regparam,fitIntercept, elasticNetParam from temp.model_best_param""")
104 | 
105 | ```
106 | 
107 | 这种类型和上文直接读取本地文件类似，Spark任务在创建时，是默认支持Hive，可以直接访问现有的 Hive支持的存储格式。解释一下，Apache Hive是Hadoop上一种常见的结构化数据源，支持包含HDFS在内的多种存储系统上的表，由于实际工作中我们使用spark.sql读取数据操作的机会更多，也是spark最核心组件之一，所以这里重点讲解一些Spark.SQL。与Spark其他的组件一样，在使用的时候是需要提前引入Spark.SQL，但也无需依赖大量的包，如果需要把Spark.SQL连接到一个部署好的Hive上，则需要把hive-site.xml复制到spark的配置文件目录中，该部分内容参考网络上其他的教程。以上代码中enableHiveSupport的调用使得SparkSession支持Hive。如果是Spark 1.x版本，则使用以下方式引用。
108 | 
109 | ```python
110 | from pyspark.sql import HiveContext
111 | hiveCtx=HiveContext(sc)
112 | data=hiveCtx.sql("select regparam,fitIntercept, elasticNetParam from temp.model_best_para ")
113 | ```
114 | 
115 | 
116 | 
117 | 
118 | 
119 | (4)pandas.DataFrame转换而来
120 | 
121 | 既然使用python进行数据处理，尤其是结构化数据，那么pandas一定绕不开，所以我们经常会有把做过一些处理的pandas.DataFrame数据转换为Spark.DataFrame的诉求，好在Spark.DataFrame在设计之初就参考并考虑到了这个问题，所以实现方式也相当简单。
122 | 
123 | ```python
124 | import pandas as pd
125 | df = pd.read_csv('python/test_spark/ts_dataset.csv')
126 | #将pandas.Dataframe 转换成-->spark.dataFrame 
127 | spark_df=spark.createDataFrame(df)
128 | #将spark.dataFrame 转换成--> pandas.Dataframe
129 | pd_df = spark_df.toPandas()
130 | ```
131 | 
132 | 以上将Spark.DataFrame 转换成--> pandas.Dataframe的过程，不建议对超过10G的数据执行该操作。
133 | 
134 | 本节开头我们也说了Spark.DataFrame是从属于Spark.sql的，Spark.sql作为Spark最重要的组件，是可以从各种结构化数据格式和数据源读取和写入的，所以上面我们也展示了读取json/csv等本地以及数据库中的数据。同时spark还允许用户通过thrift的jdbc远程访问数据库。总的来说 Spark 隐藏了分布式计算的复杂性， Spark SQL 、DataFrame更近一步用统一而简洁的API接口隐藏了数据分析的复杂性。从开发速度和性能上来说，DataFrame + SQL 无疑是大数据分析的最好选择。
135 | 
136 | ### 1.2 Spark.DataFrame操作
137 | 
138 | 以上我们强调了Spark.DataFrame可以灵活的读取各种数据源，数据读取加载后就是对其进行处理了，下面介绍读取DataFrame格式的数据以后执行的一些简单的操作。
139 | 
140 | (1)展示DataFrame
141 | 
142 | ```python
143 | spark_df.show()
144 | ```
145 | 
146 | - 打印DataFrame的Schema信息
147 | 
148 | ```
149 | spark_df.printSchema()
150 | ```
151 | 
152 | - 显示前n行
153 | 
154 | ```
155 | spark_df.head(5)
156 | ```
157 | 
158 | - 显示数据长度与列名
159 | 
160 | ```python
161 | df.count()
162 | df.columns
163 | ```
164 | 
165 | 
166 | 
167 | (2)操作DataFrame列
168 | 
169 | - 选择列
170 | 
171 | ```
172 | ml_dataset=spark_df.select("features", "label")
173 | ```
174 | 
175 | - 增加列
176 | 
177 | ```python
178 | from pyspark.sql.functions import *
179 | #注意这个*号，这里是导入了sql.functions中所有的函数，所以下文的abs就是由此而来
180 | df2 = spark_df.withColumn("abs_age", abs(df2.age))
181 | ```
182 | 
183 | - 删除列
184 | 
185 | ```python
186 | df3= spark_df.drop("age")
187 | ```
188 | 
189 | - 筛选
190 | 
191 | ```python
192 | df4= spark_df.where(spark_df["age"]>20)
193 | ```
194 | 
195 | 
196 | 
197 | 以上只是简单的展示了一小部分最为常见的DataFrame操作，更详尽的内容请查阅官方文档或者其他参考资料。
198 | 
199 | 
200 | 
201 | ### 1.3 Spark.ML简介
202 | 
203 | 以上我们介绍了与Spark.ML机器学习密切相关的数据类型和基本操作--Spark.DataFrame
204 | 
205 | 犹如我们通过pandas.DataFrame对数据做加工，下面我们看看用这些清洗过后的制作佳肴的过程--机器学习建模。
206 | 
207 | ML包括三个主要的抽象类：转换器（Transformer）、评估器（Estimator）和管道（Pipeline）。
208 | 
209 | 转换器，顾名思义就是在原对象的基础上对DataFrame进行转换操作，常见的有spark.ml.feature中的对特征做归一化，分箱，降度，OneHot等数据处理，通过`transform()`方法将一个DataFrame转换成另一个DataFrame。
210 | 
211 | 评估器，评估器是用于机器学习诸如预测或分类等算法，训练一个DataFrame并生成一个模型。用实现fit()方法来拟合模型。
212 | 
213 | ```python
214 | from pyspark.ml.feature import MinMaxScaler
215 | #定义/引入转换类
216 | max_min_scaler = MinMaxScaler(inputCol="age", outputCol="age_scaler")
217 | #fit数据
218 | max_min_age = max_min_scaler.fit(df)
219 | #执行转换
220 | max_min_age_=max_min_age.transform(spark_df)
221 | ```
222 | 
223 | 
224 | 
225 | 管道
226 | 管道这一概念同样受Python的Scikit-Learn库的影响，PySpark ML中的管道指从转换到评估的端到端的过程，为简化机器学习过程并使其具备可扩展性，采用了一系列 API 定义并标准化机器学习工作流，包含数据读取、预处理、特征加工、特征选择、模型拟合、模型验证、模型评估等一系列工作，对DataFrame数据执行计算操作。Spark机器学习部分其他的如特征生成，模型训练，模型保存，数据集划分/超参数调优，后面我们会有实际案例进行详细阐述。另外，随着Spark.3.0的发布，最近的ML简介可以通过此链接了解。
227 | 
228 | http://spark.apache.org/docs/latest/ml-guide.html
229 | 
230 | 
231 | 
232 | 顺便介绍几本手头上的相关书籍
233 | 
234 | 1.Spark快速大数据分析，本书有些旧，主要是spark.1.x为主，少量的spark.2.X介绍，如果想要了解或者不得不使用rdd based  APIs进行数据分析或者想深入spark更底层学习一点scala等函数式编程入门的还是不错的选择，比较全面通俗。豆瓣评分7.9
235 | 
236 | 2.PySpark实战指南，用python进行spark数据分析那就不得不提到这本书，倒不见得有多好，只是目前市面上也没有更好的专门使用python介绍spark的了，本书从rdd到mllib的介绍以及ml包的介绍，可以通过书中提供的api介绍了解使用python进行spark机器学习的过程，当然机器学习的一些细节是没有涉及到的，总的来说更多的是展示流程和api的使用。
237 | 
238 | 至于spark乃至于hadoop的书市面上可就非常多了，个人也不是专长做这一块的，所以也就不好品论。
239 | 
240 | 


--------------------------------------------------------------------------------