├── .idea
├── .gitignore
└── encodings.xml
├── README.md
├── __pycache__
├── feature_en.cpython-37.pyc
└── pre_process.cpython-37.pyc
├── data
└── readme.txt
├── docs
├── final term
│ ├── feat_en.md
│ ├── modeling.md
│ ├── preprocess.md
│ └── summary_gyg.md
├── fist term
│ ├── 1. 需求分析.md
│ ├── 2.整体思路.md
│ ├── 3.分工合作.md
│ ├── 4.讨论记录.md
│ ├── assert
│ │ ├── image-20191203133449096.png
│ │ ├── image-20191203133502011.png
│ │ ├── image-20191203133525187.png
│ │ ├── 车流量建模-1575354584839.png
│ │ ├── 车流量建模-1575354602642.png
│ │ └── 车流量建模.png
│ ├── img
│ │ ├── Q1_数据集_特征工程.png
│ │ └── 车流量建模.png
│ ├── itf_PreProcessor.md
│ └── 建模思路_意见文档.md
└── summary
│ └── vvlj.md
├── eda.ipynb
├── evaluator.py
├── feature_en
├── __init__.py
└── feature_en.py
├── model
├── AP.py
├── ARMA.py
├── __pycache__
│ ├── AP.cpython-37.pyc
│ └── ARMA.cpython-37.pyc
└── traffic_flow.ipynb
├── pre_process
├── __init__.py
└── pre_process.py
└── runex.py
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | /workspace.xml
--------------------------------------------------------------------------------
/.idea/encodings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # utfp
2 | 城市交通流量时空预测---山东省数据应用(青岛)创新创业大赛。http://sdac.qingdao.gov.cn/common/cmptIndex.html
3 |
4 |
5 |
6 | #### 预处理说明
7 |
8 | 原始数据没有流量的信息,通过预处理函数来统计卡口每5min内的车流量。
9 |
10 | 为了能够顺利调用预处理函数,需要统一data的目录结构(需**手动创建**)。项目的data目录未上传,结构如下;其中first是初赛数据、final是复赛数据
11 |
12 | + data
13 |
14 | + first
15 |
16 | + testCrossroadFlow
17 | + testTaxiGPS
18 |
19 | + trainCrossroadFlow
20 | + trainTaxiGPS
21 |
22 | + final
23 |
24 | + test_user
25 | + train
26 |
27 | + submit
28 |
29 | + 0_submit.csv # 初赛提交实例
30 | + 1_submit.csv # 复赛提交实例
31 |
32 |
33 |
34 | **调用示例**
35 |
36 | ```python
37 | # 由于相对路径问题,需要在项目根目录调用,如runex.py
38 | from pre_process.pre_process import PreProcessor
39 | term = 'final' # 初赛:first;复赛:final
40 | process_num = 2 # 进程数量
41 | PreProcessor(term).dump_buffer(process_num)
42 |
43 | # 得到的流量文件如下
44 | # 初赛数据:./data/0_flow_data.csv ['crossroadID', 'timestamp', 'flow']
45 | # 复赛数据: ./data/1_flow_data.csv ['crossroadID', 'direction', 'timestamp', 'flow']
46 | ```
47 |
48 |
49 |
50 | #### 文档说明
51 |
52 | 比赛时的讨论记录、问题、总结都放在docs文件夹
53 |
54 | + docs
55 | + final term:初赛记录
56 | + first term: 复赛记录
57 | + summary: 总结
--------------------------------------------------------------------------------
/__pycache__/feature_en.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/__pycache__/feature_en.cpython-37.pyc
--------------------------------------------------------------------------------
/__pycache__/pre_process.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/__pycache__/pre_process.cpython-37.pyc
--------------------------------------------------------------------------------
/data/readme.txt:
--------------------------------------------------------------------------------
1 | 此文件仅对此比赛的数据的构成做出说明。
2 |
3 | 一、数据内容包含两类:
4 | 1. 青岛交通卡口的过车数据(主数据)
5 | 2. 青岛出租车的GPS记录(辅助数据)
6 |
7 | 二、数据时间:
8 | 2019年8月1日~2019年8月23日
9 |
10 | 三、训练集和测试集划分:
11 | 按照日期划分。
12 | 训练集:8月1日~8月19日
13 | 测试集:8月20日~8月23日
14 |
15 | 四、文件夹名字说明:
16 | 1. trainTaxiGPS:训练集对应日期的出租车GPS记录
17 | 2. testTaxiGPS:测试集对应日期的出租车GPS的测试数据
18 | 3. trainCrossroadFlow:训练集对应日期的交通卡口过车数据
19 | 此文件夹下,额外包含
20 | 路网信息 roadnet.csv
21 | 交通卡口名字信息 crossroadName.csv
22 | 4. testCrossroadFlow:测试集对应日期、特定时段的交通卡口过车数据
23 | 此文件夹下,额外包含
24 | 提交结果的样例文件 submit_example.csv
25 | 5. 字段对应关系说明.xlsx:为原始的交通道路描述信息,供参考。基于此文件生成了roadnet.csv和crossroadName.csv
26 |
--------------------------------------------------------------------------------
/docs/final term/feat_en.md:
--------------------------------------------------------------------------------
1 | #### 邻接数据集
2 |
3 | 提取卡口某一时刻的流量(flow) 和其邻接卡口流量的均值(mean_flow)
4 |
5 | 作为目标值
6 |
7 | ```python
8 | crossroadID, timestamp, flow -> mean_flow, flow
9 | ```
10 |
11 | 缺失值:drop;
12 |
13 | **思路**:利用多级索引 + 表格转置,得到列名为卡口的表格,方便索引
14 |
15 | ```python
16 | crossroadID,direction , timestamp, flow ->
17 | crossroadID, timestamp, flow ->
18 | crossroadID1, crossroadID2, crossroadID3
19 | timestamp direction ->
20 | ```
21 |
22 | 方式一:按行遍历,依次用路口**某时刻**的流量构建新数据
23 |
24 | 方式二:按列遍历,借用矩阵运算依次用路口**所有**流量构建新数据
25 |
26 |
27 |
28 | 问题:
29 |
30 | 1. 流量数据的一些卡口,没有出现在路网中
31 |
32 | 2. ```python
33 | pd.DataFrame([[1,pd.np.nan,1]]).mean(axis=1)
34 | # None是空,不算它
35 | ```
36 |
37 | #### 邻接数据集
38 | 提取卡口某一时刻各个方向的流量(flow) 和其邻接卡口流量的各个方向的流量(X)
39 |
40 | ```python
41 | crossroadID, timestamp, flow -> (road1, direciont1), (road2, directino2), direction, flow
42 | ```
43 |
44 |
45 | 选择:
46 |
47 | 1. pd.DataFrame.append
48 |
49 | 2. pd.concate
50 |
--------------------------------------------------------------------------------
/docs/final term/modeling.md:
--------------------------------------------------------------------------------
1 | [@__@]:pass代表自行填充
2 |
3 | ### 一、目标
4 |
5 | >预测特定卡口、特定方向、特定时间段的山东交通流量数据
6 |
7 |
8 |
9 |
10 |
11 | ### 二、预测方法
12 |
13 | #### 2.1 回归
14 |
15 | 利用卡口之间的邻接关系构建数据集,有两种方式。方式一是不考虑时间维度,将卡口某一时刻的流量作为目标值$y_i$,其邻接卡口流量的**均值**为$x_i$;另外则是将相邻时刻的邻接卡口均值作为新的特征,如当前时刻为$x_{2i}$,前一时刻是$x_{1i}$,后一时刻则为$x_{3i}$。
16 |
17 | - X:邻接卡口流量的**均值**
18 | - y:卡口某一时刻的流量
19 |
20 | **更新版本**
21 |
22 | 下列的y值都要推后30分钟!
23 | 1. 只对某个路口进行预测,以邻接卡口各个方向的流量和该卡口的方向作为X,该卡口不同方向的流量作为y进行岭回归。(若考虑不同时段的影响,将聚类的类别作作为新的特征)
24 | 2. 只对某个路口进行预测,以邻接卡口各个方向的流量作为X,该卡口的总流量作为y进行岭回归;再将总流量作为和该卡口的方向作为X,卡口不同方向的流量作为y进行岭回归。
25 |
26 | **问题**
27 |
28 | > 1. 由于某些卡口的邻接卡口都没有在训练集出现过,而在测试集出现,它们需要在其他测试卡口预测完之后,进行二次预测。
29 | > 2. sumbit数据问题:
30 | > 1. 训练集(1-20号)拥有整天的数据,训练集(20多号)仅仅有一天中前半个钟的数据
31 | > 2. sumbit卡口只需要预测一天中后半个钟
32 | > 3. 卡口的方向仅有{1,3,5,7},但测试卡口100300的方向是{1, 2, 5, 7}
33 | > 4. 7:30与07:30对应不上 -> '7:30'.rjust(5, '0')
34 | > 3. 如何调整模型? ---> 根据提交的结果 ----> 暂时处理关键数据进行预测即可
35 | > 8. 仅有一般的训练集卡口的邻接节点在训练集里面,最终效果很差
36 |
37 | **参考材料**
38 |
39 | >无
40 |
41 |
42 |
43 |
44 |
45 | #### 2.2 算法2
46 |
47 | pass
48 |
49 | + X
50 | + y
51 |
52 | > pass
53 |
54 | **参考材料**
55 |
56 | > pass
57 |
58 |
59 |
--------------------------------------------------------------------------------
/docs/final term/preprocess.md:
--------------------------------------------------------------------------------
1 | ### 原始数据
2 |
3 | + 某**时刻**通过某**卡口**的**车辆**
4 |
5 | ```python
6 | direction, laneID, timestamp, crossroadID, vehicleID
7 | ```
8 | + 训练集、测试集与submit文件
9 | + 训练集: 1-19号数据,包含整天的数据 -- 07:00, 07:05, ..., 18:55
10 | + 测试集: 22-25号,包含一天中前半个钟的数据 -- 07:00, 07:05, ..., 07.25, 08:00, ..., 18:25
11 | + submit文件: 22-25号,只需要预测前后个钟数据 -- 07:30, 07:55, ..., 18:55
12 |
13 |
14 | ### 需求
15 |
16 | 统计卡口每5min内的车流量。
17 |
18 | 1. 单个卡口的流量时序数据
19 | 2. 单个时段所有卡口的流量数据
20 |
21 |
22 |
23 | ### 处理方式
24 |
25 | 1. 以字典的形式pickle进文件缓存
26 |
27 | ```python
28 | {'road': pd.Series(flow_lst, index=timestamp)} # 初赛
29 | {'road': 'direction': pd.Series(flow_lst, index=timestamp)} # 复赛
30 | ```
31 |
32 | 2. 以表格的形式存入文件缓存
33 | 1. 读取原始文件
34 | 2. 统计卡口每5min内的车流量数据
35 | 3. 按批写入文件
36 |
37 | ```python
38 | timestamp, crossroadID, flow # 初赛
39 | direction, timestamp, crossroadID, flow # 复赛
40 | ```
41 |
42 | ### 接口
43 | ```python
44 | # 若报错则先确认data目录结构
45 | # 再调用PreProcessor('final').dump_buffer()缓存数据
46 | flow_df = PreProcessor('final').get_timeflow()
47 | ```
--------------------------------------------------------------------------------
/docs/final term/summary_gyg.md:
--------------------------------------------------------------------------------
1 | #### 总结:
2 |
3 | ---
4 |
5 | 本次比赛分为初赛和复赛
6 |
7 | - 初赛
8 |
9 | 惭愧的是,初始时主要是城和乐在完成,具体思路是时序+相邻聚合,最后成功晋级复赛。
10 |
11 | - 复赛
12 |
13 | 关于复赛,我借助了城的数据统计,得到卡口数据。然后尝试通过lstm去预测缺失值,但是效果并不理想,推测是数据成片缺失导致模型不稳定,后来爬取了卡口的gps数据,用地理坐标位置加权去预测不存在历史记录的卡口流量,最后的成绩是66左右,最好成绩是前10,奈何前3才能进决赛,所以我们也止步于此。
14 |
15 | ----
16 |
17 | 不足:
18 |
19 | 1. 面对这种参考内容较少的预测,能力不足。
20 | 2. 对于数据科学类竞赛能力、经验不足,需要多多联系。
--------------------------------------------------------------------------------
/docs/fist term/1. 需求分析.md:
--------------------------------------------------------------------------------
1 | ### 预测目标
2 |
3 | 多个交通口多个时段的流量。因此有两种预测方式,它们是问题的两个切入点:
4 |
5 | 1. 固定时段,预测所有的交通口流量
6 | 2. 固定交通口,预测所有时段的流量
7 |
8 |
9 |
10 | ### 数据
11 |
12 | 1. 流量(目标): 单位时间内通过道路交叉口的车辆数量,根据提交示例(submit_example)来看,单位应该是**每五分钟的车辆数**,定义为count/unitTime
13 | 2. 交通口拓扑数据(headNode-tailNode)
14 | 3. 卡口流:时间、交口Id、车道Id、车辆Id、方向
15 | 4. 车辆轨迹:时间、车辆Id、经纬度、速度、方向
16 |
17 | 
18 |
19 |
20 |
21 | ### 模型
22 |
23 | 首先,卡口流与车辆轨迹是两类数据,应该可归类为异构数据。然后模型可想到的就只有时序模型……思路不够明确。以上是我了解到的所有东西。
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/docs/fist term/2.整体思路.md:
--------------------------------------------------------------------------------
1 | #### 问题:短时交通流量预测
2 |
3 | 有两个关键词:时间序列、空间拓扑
4 |
5 |
6 |
7 | #### 思路
8 |
9 | 先从时间序列开始,用机器学习的方法做预测。再考虑空间拓扑结构,结合具体行业知识(交通方面的),做数据挖掘。
10 |
11 | 对于机器学习方法,可参考刘的综述论文,介绍了:多元线性回归模型、历史趋势模型、神经网络模型、时间序列模型、Kalman 滤波模 型及多模型融合预测方法。此外,也可先参考《Python数据挖掘与分析》里面的时序模型。
--------------------------------------------------------------------------------
/docs/fist term/3.分工合作.md:
--------------------------------------------------------------------------------
1 | ### 分工合作
2 |
3 | 能力相同,任务根据大火的时间情况进行分配。时间少可优先负责后期,时间多可负责前期。
4 |
5 |
6 |
7 | #### 如何分工
8 |
9 | 按建模角度,任务可分为用时序模型预测同一交通口**多个时段**的流量、用别的模型预测同一时段**多个交通口**的流量。(后期再考虑将两者结合。)
10 |
11 | 按流程角度,任务可分为预处理、特征工程、建模、模型检验与可视化。
12 |
13 | 建模分工待大家弄懂如何建模之后再确定。**流程分工**现在可按大家的时间确定。
14 |
15 |
16 |
17 | #### 想法
18 |
19 | 按流程分工,刚好一个人负责每个模块,即预处理、特征工程、建模、模型检验与可视化。最终各个模块整合成一个完整的项目。为了协同,需要确定好初始的**项目框架**及其**接口**。
20 |
21 | 我想出的项目框架结构如下:
22 |
23 | + pre_process.py:预处理
24 | + class PreProcessor
25 | + feature_en.py:特征工程
26 | + class FeatureEn
27 | + model:模型
28 | + model1.py
29 | + class Model1
30 | + model2.py
31 | + class Model2
32 | + evaluator.py:检验及可视化
33 | + class Evaluator
34 | + class Draw
35 | + EX.py:用于实验
36 |
37 |
38 |
39 |
40 |
41 |
--------------------------------------------------------------------------------
/docs/fist term/4.讨论记录.md:
--------------------------------------------------------------------------------
1 | #### 讨论记录
2 |
3 | #### 1. 补充
4 |
5 | 首先,补充一下开会忽略的东西,即当前的进度,已完成的有
6 |
7 | + 能够获取车流量时序数据
8 | + 单个路口多个时段
9 | + 单个时段多个路口
10 | + ARMA模型
11 |
12 | 其次,补充下各成员的项目思路
13 |
14 | + 特征工程:均认为需要寻找新的方法
15 | + 模型
16 | + 炜乐/炯城:ARMA时序建模优化空间不大,寻找新模型可**并行**
17 | + 仰淦/正婷:新模型是否寻找需要基于ARMA时序的建模结果
18 |
19 |
20 |
21 | #### 2. 内容
22 |
23 | 项目思路在之前已明确,故此次讨论的内容仅有一个新的特征工程方法。
24 |
25 |
26 |
27 | #### 3. 下一步
28 |
29 | 求同存异
30 |
31 | 1. 求同:继续大家都认同的思路,需要寻找**特征工程**的方法。
32 | 2. 存异:是否寻早新的建模方法因人而异。
--------------------------------------------------------------------------------
/docs/fist term/assert/image-20191203133449096.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133449096.png
--------------------------------------------------------------------------------
/docs/fist term/assert/image-20191203133502011.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133502011.png
--------------------------------------------------------------------------------
/docs/fist term/assert/image-20191203133525187.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133525187.png
--------------------------------------------------------------------------------
/docs/fist term/assert/车流量建模-1575354584839.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模-1575354584839.png
--------------------------------------------------------------------------------
/docs/fist term/assert/车流量建模-1575354602642.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模-1575354602642.png
--------------------------------------------------------------------------------
/docs/fist term/assert/车流量建模.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模.png
--------------------------------------------------------------------------------
/docs/fist term/img/Q1_数据集_特征工程.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/img/Q1_数据集_特征工程.png
--------------------------------------------------------------------------------
/docs/fist term/img/车流量建模.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/img/车流量建模.png
--------------------------------------------------------------------------------
/docs/fist term/itf_PreProcessor.md:
--------------------------------------------------------------------------------
1 | 获取单个路口各个时段车流量数据
2 |
3 | ```python
4 | def get_roadFlow(self,i) -> object:
5 | '''获取单个路口各个时段车流量数据
6 | :param i:天数
7 | :return:
8 | dfFlow:pd.DataFrame 原始车流数据表
9 | dFlow: {crossroadID:pd.Series} 车流量时序数据
10 | '''
11 |
12 | # 调用示例
13 | from pre_process import PreProcessor
14 | prp = PreProcessor() # 数据管理器
15 | dfFlow,dFlow =prp.get_roadFlow(1) # 原始车流数据表,车流量时序数据
16 | ```
17 |
18 | 缓存文件说明
19 | + flow_i :第i天各个路口车流量
20 | + roadFlowTotal_x :路口x所有时段流量数据
--------------------------------------------------------------------------------
/docs/fist term/建模思路_意见文档.md:
--------------------------------------------------------------------------------
1 | ## 特征工程
2 |
3 | ### 基于时空语义的短时交通流建模
4 |
5 | 文章在统计学**相关系数**的基础上,对其进行时空语义的扩展,引 入了**空间权重矩阵**与**时间延迟**以表达各流量序 列间的时空关联性. 并以时空相关系数为判断 依据,快速选取与预测点相关的预测因子,最后 以支持**向量机**为预测工具,提出了一种基于时空分析的短时交通流量预测方法.
6 |
7 | #### 1. 理论概念
8 |
9 | 符号说明
10 |
11 | | 变量名 | 解释说明 |
12 | | ------ | --------------------------- |
13 | | S | 空间单元(路口/路段),N个 |
14 | | X | 时空属性序列(车流量),n个 |
15 |
16 |
17 |
18 | **时间延迟$d$**
19 |
20 | 如下图,$X_i = 表示空间单元$i$的车流量序列,共n个时段。**超参**时间延迟$d$指时段间隔。
21 |
22 | 
23 |
24 | **时空权重矩阵P**
25 |
26 | P表示空间单位个数$N=n$、超参时间延迟$d=D$,路段$i$的空间权值矩阵。横向为空间维度、纵向是时间维度。
27 | $$
28 | P =\left|
29 | \begin{array}{lcr}
30 | \rho_{i1}(0) & \rho_{i2}(0) & \ldots & \rho_{in}(0) \\
31 | \rho_{i1}(1) & \rho_{i2}(1) & \ldots & \rho_{in}(1) \\
32 | \ldots & \ldots & \ldots & \ldots \\
33 | \rho_{i1}(D) & \rho_{i2}(D) & \ldots & \rho_{in}(D) \\
34 | \end{array}
35 | \right|
36 | $$
37 | **时间权重系数$\rho_{ij}$**计算公式
38 | $$
39 | \rho_{ij}(d) = w_{ij}\frac{\sum_{t=1}^{n-d}[x_i(t+d)-\bar x_i(d-n)][x_j(t)-\bar x_j(n-d) ]}{\sqrt{\sum_{t=1}^{n-d}[x_i(t+d)-\bar x_i(d-n)]^2\sum_{t=1}^{n-d}[x_j(t)-\bar x_j(n-d) ]^2}}
40 | $$
41 |
42 |
43 | #### 2. 流程
44 |
45 | 总体流程图如下所示,主要有两步,一是用时空分析算法进行特征工程,二是用最大最小归一化方法对目标工程。
46 |
47 |
--------------------------------------------------------------------------------
/docs/summary/vvlj.md:
--------------------------------------------------------------------------------
1 | [TOC]
2 |
3 |
4 |
5 | ### 关于协作
6 |
7 | 由于git能够追溯文件中某个位置的修改,多个人维护同一个文件容易产生冲突。
8 |
9 | + 因此当多个人负责同一个功能的不同部分时,如预处理或者特征工程,尽量分文件进行协作,或者讨论一致后再push上去,避免冲突。
10 | + 利用高级指令
11 |
12 |
13 |
14 | ### 知识点积累
15 |
16 | #### 时间(datetime)
17 |
18 | + 时间戳加减
19 |
20 | ```python
21 | # timedelta类型可用于时间加减运算
22 | from datetime import timedelta
23 | timedelta(days=1, minutes=1, seconds=1)
24 | ```
25 |
26 | + 字符串转时间戳
27 |
28 | ```python
29 | # str -> pandas._libs.tslibs.timestamps.Timestamp
30 | import pandas as pd
31 | pd.to_datetime('2019-09-25 18:00:00')
32 | ```
33 |
34 |
35 |
36 |
37 |
38 | #### DataFrame
39 |
40 | + 分组
41 |
42 | ```python
43 | # name, group_df(含分组的字段)
44 | for name, group_df in df.groupby(['column', ]):
45 | pass
46 |
47 | # name, group_index(索引)
48 | for name, group_index in df.groupby(['column', ]).groups.items(): # .groups是dict
49 | pass
50 |
51 | # group_df(含分组的字段)
52 | group_df = df.groupby(['column', ]).get_group(name)
53 | ```
54 |
55 | + apply,按列迭代
56 |
57 | ```python
58 | # Series,第一个参数x便是series的元素;可以传入额外参数如y
59 | series.apply(lambda x, y: x + y, y = 1)
60 |
61 | # DataFrame, column1的元素是x,column2的元素是y,不可以传入额外参数
62 | df['column1', 'column2'].apply(lambda x, y: x + y)
63 | ```
64 |
65 | + 去重
66 |
67 | ```python
68 | # subset: 去重的列; keep:last,保留最后一个出现,默认第一个出现
69 | df.drop_duplicates(subset=['column', ], keep='last')
70 | ```
71 |
72 | + 索引
73 |
74 | ```python
75 | # 设置某列为列索引
76 | df.set_index(['column', ])
77 | # 将列索引还原成column
78 | df.reset_index()
79 | # 将列索引设置成行索引
80 | df.unstack() # 顺序默认从最内部的索引开始设置
81 |
82 | # 行多级索引
83 | df['column1']['column2'] # 从上到下依次索引
84 | df['column1', 'column2'] # 多级索引
85 | # 列多级索引
86 | df.loc['index1'].loc['index2'] # 从左到右依次索引
87 | df.loc[('index1', 'index2')] # 多级索引
88 | ```
89 |
90 | + 连接表格
91 |
92 | ```python
93 | # 向DataFrame添加多行,利用纵向表格连接比单行添加快
94 | # axis=0,纵向连接;ignore_index,重新编号索引
95 | pd.concat((df1, df2), axis=0, ignore_index=True)
96 | # axis=1,横向连接
97 | pd.concat((df1, df2), axis=0, ignore_index=True)
98 | ```
99 |
100 |
101 |
102 | #### csv文件写入
103 |
104 | ```python
105 | with open(path, 'w', newline='') as f:
106 | handler = csv.writer(f)
107 | handler.writerow([[]])
108 | ```
109 |
110 |
111 |
112 | #### 字符串控制
113 |
114 | + 右靠齐补零
115 |
116 | ```python
117 | '7:30'.rjust(5, "0")
118 | ```
119 |
120 |
121 |
122 | ### 总结
123 |
124 | 这次项目断断续续做了很久,从19年11月到20年4月,整整有5个月的时间,不过真正花在上面的时间并没有那么多。初赛到复赛之间有一段数据筹备的时间。这是第一次正式打比赛,虽然名次不是很高,但是还是收获到很多东西的,包括团队间的合作、开始一个数据挖掘项目的注意事项。
125 |
126 | 首先是团队合作,一个团队一起从事一个项目时,要注意每个人的分工以及进度。分工是按照数据挖掘的流程进行分工,如预处理、特征工程和建模。每个人不一定要了解其他人的进度,但一定要了解项目整体走到哪一步了,比如大家是否清楚题目的要求是什么,是否对数据有一个基本的认识,是否了解一些基础函数的作用。人通常是有惰性的,为了提高协作效率,就要多多主动去了解其他人的进度。另外,文档、代码的规范化也没有做好,这就给了沟通和记录加大了难度。
127 |
128 | 其次是对这次数据挖掘的体会。弄清楚数据的概况很重要,例如缺失值、含义等等。由于前期没有了解好数据,后期再对模型进行修修补补就很麻烦。本次交通流数据有很多缺失的情况,例子流量的缺失、邻接卡口的缺失、甚至测试集整个数据集的缺失。这些缺失导致了很多设想好的模型不可用,例如回归模型不可用,即利用岭回归来确定某个卡口的其临界卡口不同车流量的关系,权值越高意味着更加可能两特定方向的路口相连,最后利用邻接关系和这些权值求出测试卡口的流量。下次进行比赛必须花更多的时间阅读数据的描述,以及比赛的要求。
129 |
130 | 最后,每次项目后,需要多思考从项目中学到了什么,有什么不足,或多或少都写一点。
131 |
132 |
--------------------------------------------------------------------------------
/eda.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 4,
6 | "metadata": {
7 | "collapsed": true,
8 | "pycharm": {
9 | "name": "#%%\n",
10 | "is_executing": false
11 | }
12 | },
13 | "outputs": [
14 | {
15 | "name": "stdout",
16 | "text": [
17 | "count 608.000000\nmean 2.809211\nstd 1.059878\nmin 1.000000\n25% 2.000000\n50% 3.000000\n75% 4.000000\nmax 10.000000\ndtype: float64\n"
18 | ],
19 | "output_type": "stream"
20 | },
21 | {
22 | "data": {
23 | "text/plain": "",
24 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXMAAAD2CAYAAAAksGdNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAM9UlEQVR4nO3db2hd93nA8e8zuwFhNZmLg4bDqDGYwZjqNhFdDB5claQ0C2NdGDRgOrwWvK1he+M3LvUYhDFCWMJYIaHaXC/t1nruoFk2uyHd4DIPXKjNaF32h+2F0mGSjM6ZjUwouDx7oRNL15Et6erce6TH3w8Yn3uOdM5PP8tfH59zr25kJpKkre2nuh6AJGnjjLkkFWDMJakAYy5JBRhzSSpgexcH3bVrV+7Zs6eLQ7fm+vXr7Nixo+thbBrOxyDnY4lzMWgj83Hx4sUfZeb9K23rJOZ79uzhwoULXRy6Nf1+n16v1/UwNg3nY5DzscS5GLSR+YiI12+3zcssklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVEAnrwDdqvYcO3Nz+ej0DQ4vezxK8888PpbjSNq6PDOXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAJWjXlE3BcR34qI1yLimxFxT0SciIjzEXF82ce9Z50kaTzWcmZ+CHg+Mz8OvAk8CWzLzAPA3ojYFxFP3LpudEOWJN0qMnPtHxzxN8C9wJ9k5tmIeBKYAD4CvLp8XWaevOVzjwBHAKamph46depUW1/D2Fy6fPXm8tQEvPXOeI47/cB94znQBiwsLDA5Odn1MDYN52OJczFoI/MxOzt7MTNnVtq2fa07iYgDwE5gHrjcrL4CPAjsWGHdgMycA+YAZmZmstfrrfXQm8bhY2duLh+dvsFzl9Y8fRsyf6g3luNsRL/fZyv+mY6K87HEuRg0qvlY0w3QiPgA8EXgM8ACi2fjAJPNPlZaJ0kak7XcAL0H+Abw+cx8HbgIHGw272fxTH2ldZKkMVnLdYLPsnjZ5AsR8QXgJPDpiNgNPAY8DCRw7pZ1kqQxWTXmmfki8OLydRHxCvAo8GxmXm3W9W5dJ0kaj6Hu4GXm28Dp1dZJksbDG5WSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpALG8wO5tSF7lv0c9XGbf+bxzo4tae08M5ekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAtYU84iYiohzzfL2iPhhRPSbX9PN+hMRcT4ijo9ywJKk91o15hGxE3gJ2NGs+hDw9czsNb8uRcQTwLbMPADsjYh9oxuyJOlWkZl3/oCIe4EA/jYzexHxOeAp4DpwCfgt4Hng1cw8GxFPAhOZefKW/RwBjgBMTU09dOrUqda/mFG7dPnqzeWpCXjrnQ4HMybTD9y3po9bWFhgcnJyxKPZOpyPJc7FoI3Mx+zs7MXMnFlp2/bVPjkzrwFExLurvgs8kplvRMRXgF9m8az9crP9CvDgCvuZA+YAZmZmstfrre+r2AQOHztzc/no9A2eu7Tq9G1584d6a/q4fr/PVvwzHRXnY4lzMWhU8zFMjb6fmT9uli8A+4AFYKJZN4k3ViVprIaJ7lcjYn9EbAM+CXwPuAgcbLbvB+bbGZ4kaS2GOTN/Gvgai9fRX8nMf2iuq5+LiN3AY8DDLY5RkrSKNcc8M3vN7z9g8Rkty7ddi4ge8CjwbGZefc8OJEkj09odvMx8Gzjd1v4kSWvnjUpJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVED9t8rRhuxZ9u5Kd3J0+sbAOzFt1Pwzj7e2L+lu4Jm5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySClhTzCNiKiLOLXt8IiLOR8TxO62TJI3HqjGPiJ3AS8CO5vETwLbMPADsjYh9K60b5aAlSYPWcmb+E+BTwLXmcQ843Sy/Bhy8zTpJ0phsX+0DMvMaQES8u2oHcLlZvgI8eJt1AyLiCHAEYGpqin6/v4Fhd+Po9I2by1MTg4/vdm3Px1b8/lhuYWFhy38NbXEuBo1qPlaN+QoWgIlmeZLFs/uV1g3IzDlgDmBmZiZ7vd4Qh+7W4WNnbi4fnb7Bc5eGmb6a2p6P+UO91vbVhX6/z1b8Hh8F52LQqOZjmGezXGTpMsp+YP426yRJYzLMqdTLwLmI2A08BjwM5ArrJEljsuYz88zsNb9fY/GG53eA2cy8utK61kcqSbqtoS5yZubbLD175bbrJEnj4StAJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IB6455RGyPiB9GRL/5NR0RJyLifEQcH8UgJUl3NsyZ+YeAr2dmLzN7wD5gW2YeAPZGxL42ByhJWl1k5vo+IeJzwFPAdeAS8GPg7zPzbEQ8CUxk5skVPu8IcARgamrqoVOnTm107GN36fLVm8tTE/DWOx0OZpNpez6mH7ivvZ11YGFhgcnJya6HsSk4F4M2Mh+zs7MXM3NmpW3bh9jfd4FHMvONiPgK8DHgS822K8CDK31SZs4BcwAzMzPZ6/WGOHS3Dh87c3P56PQNnrs0zPTV1PZ8zB/qtbavLvT7fbbi9/goOBeDRjUfw1xm+X5mvtEsXwB2ARPN48kh9ylJ2oBhwvvViNgfEduAT7J4yeVgs20/MN/S2CRJazTM/4ufBr4GBPAK8DJwLiJ2A48BD7c3PEnSWqw75pn5Axaf0XJTRPSAR4FnM/PqSp8nSRqdVu5YZebbwOk29iVJWj9vVkpSAcZckgow5pJUwJZ71cueZS/ckSQt8sxckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQVsuR+Bq7tDlz/qeP6Zxzs7tjQsz8wlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJamAVt9pKCJOAD8PnMnMP2xz39K4tPEuR0enb3C4w3dLWg/fWamG1mIeEU8A2zLzQER8OSL2ZeZ/trV/SWpLl29L+Bef2DGS/UZmtrOjiD8FXs3MsxHxJDCRmSeXbT8CHGke/hzwH60cuDu7gB91PYhNxPkY5HwscS4GbWQ+PpiZ96+0oc3LLDuAy83yFeDB5Rszcw6Ya/F4nYqIC5k50/U4NgvnY5DzscS5GDSq+WjzBugCMNEsT7a8b0nSHbQZ3IvAwWZ5PzDf4r4lSXfQ5mWWl4FzEbEbeAx4uMV9b0ZlLhm1xPkY5HwscS4GjWQ+WrsBChARO4FHgX/KzDdb27Ek6Y5ajbkkqRvepJSkAoz5OkXEfRHxrYh4LSK+GRH3dD2mrkXEVET8S9fj2Cwi4oWI+JWux9G1iNgZEWcj4kJEfKnr8XSp+TtybtnjExFxPiKOt3UMY75+h4DnM/PjwJvAJzoez2bwxyw9LfWuFhG/BPxMZv5d12PZBD4N/FXznOr3R8Rd+Vzz5l7iSyy+Fmfg1fLA3ojY18ZxjPk6ZeYLmfnt5uH9wP90OZ6uRcTHgOss/sN2V4uI9wF/BsxHxK92PZ5N4H+BX4iInwZ+FvjvjsfTlZ8AnwKuNY97wOlm+TWWntK9IcZ8SBFxANiZmd/peixdaS4x/T5wrOuxbBK/Afwr8Czw0Yj43Y7H07V/Bj4I/B7wbyy+Mvyuk5nXMvPqslW3vlp+qo3jGPMhRMQHgC8Cn+l6LB07BryQmf/X9UA2iY8Ac83Tcv8SmO14PF37A+C3M/Np4N+B3+x4PJvFSF4tb8zXqTkb/Qbw+cx8vevxdOwR4KmI6AMfjog/73g8XfsvYG+zPAPc7d8fO4HpiNgG/CLg86AXjeTV8j7PfJ0i4neAPwK+16x6MTP/usMhbQoR0c/MXtfj6FJEvB/4Mov/bX4f8OuZefnOn1VXRHwUOMnipZbzwK9l5kK3o+rOu39HIuJe4BzwjzSvlr/lMsxw+zfmkjReo3i1vDGXpAK8Zi5JBRhzSSrAmEtSAcZckgow5pJUwP8DmOcjTHmrYkoAAAAASUVORK5CYII=\n"
25 | },
26 | "metadata": {
27 | "needs_background": "light"
28 | },
29 | "output_type": "display_data"
30 | }
31 | ],
32 | "source": [
33 | "from feature_en.feature_en import FeatureEn\n",
34 | "from myfunc.matplot import *\n",
35 | "import pandas as pd\n",
36 | "\n",
37 | "fe = FeatureEn()\n",
38 | "\n",
39 | "# 查看卡口的度的分布特征\n",
40 | "d = pd.Series(list(len(v) for v in fe.adj_map.values()), index=fe.adj_map.keys())\n",
41 | "print(d.describe())\n",
42 | "d.hist()\n",
43 | "plt.show()"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "outputs": [],
50 | "source": [],
51 | "metadata": {
52 | "collapsed": false,
53 | "pycharm": {
54 | "name": "#%%\n"
55 | }
56 | }
57 | }
58 | ],
59 | "metadata": {
60 | "language_info": {
61 | "codemirror_mode": {
62 | "name": "ipython",
63 | "version": 2
64 | },
65 | "file_extension": ".py",
66 | "mimetype": "text/x-python",
67 | "name": "python",
68 | "nbconvert_exporter": "python",
69 | "pygments_lexer": "ipython2",
70 | "version": "2.7.6"
71 | },
72 | "kernelspec": {
73 | "name": "python3",
74 | "language": "python",
75 | "display_name": "Python 3"
76 | },
77 | "pycharm": {
78 | "stem_cell": {
79 | "cell_type": "raw",
80 | "source": [],
81 | "metadata": {
82 | "collapsed": false
83 | }
84 | }
85 | }
86 | },
87 | "nbformat": 4,
88 | "nbformat_minor": 0
89 | }
--------------------------------------------------------------------------------
/evaluator.py:
--------------------------------------------------------------------------------
1 | class Evaluator:
2 | pass
3 |
4 |
5 | class Drawer:
6 | pass
7 |
8 |
9 | def plot_roadflow():
10 | # ******载入数据******
11 | day = 3
12 | prp = PreProcessor() # 数据管理器
13 | dfFlow, dFlow = prp.get_roadflow_by_day(day) # 原始车流数据表,车流量时序数据
14 | # *****绘图示例******
15 | key = list(dFlow.keys())[0]
16 | seFolw = dFlow[key]
17 | seFolw.plot()
18 | plt.title(f'{day}号交通口{key}车流量时序图')
19 | plt.ylabel('车流量/5min')
20 | plt.xlabel('时间/t')
21 | plt.show()
--------------------------------------------------------------------------------
/feature_en/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/feature_en/__init__.py
--------------------------------------------------------------------------------
/feature_en/feature_en.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from pre_process.pre_process import get_adj_map, PreProcessor
4 | import datetime
5 | from sklearn.metrics.pairwise import cosine_similarity
6 | from pre_process.pre_process import get_trainroad_adjoin, get_testroad_adjoin
7 |
8 |
9 | class FeatureEn:
10 | def __init__(self, term='first'):
11 | self.adj_map = get_adj_map() # 对象与路网绑定
12 | self.prp = PreProcessor(term) # 数据管理器
13 |
14 | def extract_relevancy(self, roadId, d, dFlow):
15 | '''抽取车流量表相关度,作为训练集
16 | :param roadId: 路口
17 | :param d: 时间延迟
18 | :param dFlow: 车流量表,{roadId:pd.Series}
19 | :param adjMap:
20 | :return: X,相关度矩阵,每一列代表一个样本,即横向空间维度,纵向时间维度
21 | '''
22 |
23 | # 时空相关系数计算
24 | lAdjNode = self.adj_map[roadId]
25 | X = np.zeros((d, len(lAdjNode))) # 相关系数矩阵,问题**每个路口训练一个模型
26 | return X
27 |
28 | def extract_adjoin_by_col(self):
29 | '''某时段内路口流量为特征,其邻接路口为目标值,构建训练集和测试集(按列遍历,很快)'''
30 | if self.prp.term: # 复赛情况
31 | # 加载数据集
32 | flow_data = self.prp.load_buffer() # 训练集
33 | road_set = set(flow_data['crossroadID'])
34 | # 构建邻接表
35 | road_direction_dct = {} # 用于构建邻接表
36 | for road in road_set:
37 | road_direction_dct[road] = flow_data['direction'][flow_data['crossroadID'] == road].unique()
38 | adj_map = {} # {road: {'adjoin': [('flow', 'roadID', 'direction')], 'self':[]}}
39 | for road in road_set:
40 | adjoin_set = set(self.adj_map[road]) & road_set
41 | adj_map[road] = set()
42 | for adjoin in adjoin_set:
43 | adj_map[road].update(('flow', dire, road) for dire in road_direction_dct[adjoin])
44 | flow_data.set_index(['timestamp', 'crossroadID', 'direction'], inplace=True)
45 | flow_data = flow_data.unstack().unstack() # 重建列的索引
46 | # flow_data.drop(columns=flow_data.columns ^ (flow_data.columns & adj_map.keys()), inplace=True)
47 | # 获取训练集
48 | train_index = flow_data.index < '2019-09-22 07:00:00'
49 | train_flow = flow_data[train_index] # 划分数据集, 训练集
50 | train_x_index = train_flow.index[train_flow.index < '2019-09-21 18:30:00'] # 训练集特征索引
51 | train_y_index = (pd.to_datetime(train_x_index) + datetime.timedelta(minutes=30)
52 | ).map(lambda x: str(x)) & flow_data.index
53 | train_x_index &= (pd.to_datetime(train_y_index) - datetime.timedelta(minutes=30)
54 | ).map(lambda x: str(x)) # 训练集目标值索引
55 | train_flow_x = train_flow.loc[train_x_index]
56 | train_flow_y = train_flow.loc[train_y_index]
57 | # 获取测试集索引
58 | test_flow = flow_data[~train_index]
59 | submit_data = self.prp.get_submit()
60 | test_index_y = submit_data['timestamp'].unique()
61 | test_index_x = (pd.to_datetime(test_index_y) - datetime.timedelta(minutes=30)
62 | ).map(lambda x: str(x)) & flow_data.index # 测试索引
63 | test_flow = test_flow.loc[test_index_x] # 测试集
64 | for road in road_set:
65 | adjoin_cols = adj_map[road]
66 | if len(adjoin_cols):
67 | # 根据邻接表提取 训练集(X, direction, flow)
68 | train_df = pd.DataFrame()
69 | x_cloumns = list(i[1:] for i in adjoin_cols) # 新的单索引列名,防止报错
70 | for dire in road_direction_dct[road]: # 先纵向扩充df
71 | train_df_next = train_flow_x[adjoin_cols]
72 | train_df_next.columns = x_cloumns
73 | train_df_next['direction'] = [dire] * len(train_df_next)
74 | train_df_next['y'] = train_flow_y[('flow', dire, road)].values
75 | train_df = pd.concat((train_df, train_df_next[train_df_next['y'].notna()]), axis=0)
76 | train_df = pd.concat((train_df, pd.get_dummies(train_df['direction'])), axis=1) # 再将每个方向作为一列(哑编码)
77 | # 根据邻接表提取 (X, direction, flow)
78 | test_df = pd.DataFrame()
79 | for dire in road_direction_dct[road]: # 先纵向扩充df
80 | test_df_next = test_flow[adjoin_cols]
81 | test_df_next.columns = x_cloumns
82 | test_df_next.index = test_index_y
83 | test_df_next['direction'] = [dire] * len(test_df_next)
84 | test_df = pd.concat((test_df, test_df_next), axis=0)
85 | test_df = pd.concat((test_df, pd.get_dummies(test_df['direction'])), axis=1) # 再将每个方向作为一列(哑编码)
86 | # 去除空的列
87 | for df in (train_df, test_df):
88 | na_index = df.isna().sum(axis=0)
89 | for col in na_index[na_index == len(df)].index: #
90 | train_df.drop(columns=col, inplace=True)
91 | test_df.drop(columns=col, inplace=True)
92 | yield road, train_df, test_df
93 |
94 | def similarity_matrix(self):
95 | '''
96 | 计算卡口之间的相似度矩阵
97 | :return:相似度矩阵
98 | '''
99 | matrix, index = self.prp.get_roadflow_alltheday()
100 | cos = cosine_similarity(pd.DataFrame(np.array(matrix), index=index, columns=["1", "2", "3", "4", "5", "6", "7", "8"]))
101 | return cos, index
102 |
103 | def get_train_data(self):
104 | '''
105 | 得到训练集和测试集
106 | :return:训练集和测试集
107 | '''
108 | global timelist
109 | train = self.prp.load_train()
110 | predMapping, mapping = get_testroad_adjoin(self.prp)
111 | train_mapping = get_trainroad_adjoin(predMapping, mapping)
112 | # [[邻居1[n天各个方向车流1],邻居2,邻居3,……],[]]
113 | timelist = []
114 | for i in range(1, 22): # 完整的时间列表
115 | timelist.extend(pd.date_range(f'2019/09/{i} 07:00', f'2019/09/{i} 18:55', freq='5min').tolist())
116 | # 修改train中的时间
117 | train["timestamp"] = [pd.to_datetime(i, errors='coerce') for i in train["timestamp"].tolist()]
118 | train["direction"] = [eval(i) for i in train["direction"].tolist()]
119 | # 整理数据
120 | train_x = []
121 | train_y = []
122 | for key in train_mapping.keys():
123 | a = []
124 | tdf = pd.DataFrame(timelist, columns=["timestamp"]) # 生成时间戳df
125 | tdf.to_csv("./data/tdf.csv")
126 | for i in train_mapping[key][:]: # 相邻卡口
127 | result_ = get_something(i, train, tdf)
128 | if result_:
129 | a.append(result_)
130 | if a: # 判断a中是否有内容
131 | train_x.append(a) # 存入训练集[[[时间1],[时间2]],[],[]]
132 | train_y.append(get_something(key, train, tdf)) # 把key也加进来
133 | text_save("x", train_x)
134 |
135 | def get_text_data(self):
136 | train = self.prp.load_train()
137 | predMapping, mapping = get_testroad_adjoin(self.prp)
138 | test_x = []
139 | # [[邻居1[n天各个方向车流1],邻居2,邻居3,……],[]]
140 | timelist = []
141 | keylst = []
142 | for i in range(1, 22): # 完整的时间列表
143 | timelist.extend(pd.date_range(f'2019/09/{i} 07:00', f'2019/09/{i} 18:55', freq='5min').tolist())
144 | # 修改train中的时间
145 | train["timestamp"] = [pd.to_datetime(i, errors='coerce') for i in train["timestamp"].tolist()]
146 | train["direction"] = [eval(i) for i in train["direction"].tolist()]
147 | for key in predMapping.keys():
148 | keylst.append(key)
149 | a = []
150 | tdf = pd.DataFrame(timelist, columns=["timestamp"]) # 生成时间戳df
151 | for i in list(predMapping[key])[:]: # 相邻卡口
152 | result_ = get_something(i, train, tdf)
153 | if result_:
154 | a.append(result_)
155 | print(a)
156 | if a: # 判断a中是否有内容
157 | test_x.append(a) # 存入训练集[[[时间1],[时间2]],[],[]]
158 | text_save("test", test_x)
159 | return keylst
160 |
161 |
162 | def text_save(flag, data): # filename为写入CSV文件的路径,data为要写入数据列表.
163 | if flag == "x":
164 | filename = "./data/train_x.csv"
165 | s = []
166 | for i in range(len(data)): # n个卡口
167 | for j in range(len(data[i][0])): # 时间
168 | print(len(data[i][0]))
169 | a = []
170 | for k in range(len(data[i])): # n个邻居
171 | a.append(data[i][k][j])
172 | # a = str(a).replace("'", '').replace(',', '') # 去除单引号,逗号,每行末尾追加换行符
173 | print(a)
174 | s.append(a)
175 | pd.DataFrame(s).to_csv(filename)
176 | print("train_x保存文件成功")
177 | elif flag == "y":
178 | filename = "./data/train_y.txt"
179 | f = open(filename, "a")
180 | s = []
181 | for i in range(len(data)): # n个卡口
182 | for j in range(len(data[i])): # 所有时段
183 | s.append(data[i][j])
184 | f.write(str(s))
185 | f.close()
186 | print("train_y保存文件成功")
187 | else:
188 | filename = "./data/test_x.csv"
189 | s = []
190 | for i in range(len(data)): # n个卡口
191 | for j in range(len(data[i][0])): # 时间
192 | print(len(data[i][0]))
193 | a = []
194 | for k in range(len(data[i])): # n个邻居
195 | a.append(data[i][k][j])
196 | # a = str(a).replace("'", '').replace(',', '') # 去除单引号,逗号,每行末尾追加换行符
197 | print(a)
198 | s.append(a)
199 | pd.DataFrame(s).to_csv(filename)
200 | print("test保存文件成功")
201 |
202 |
203 | def get_something(i, train, tdf):
204 | """ 用均值填补缺失值,返回所有时间段的数据
205 | :param i: 某卡口
206 | :param train:训练集
207 | :param tdf:时间表
208 | :return: 各个时间段流量,result
209 | """
210 | b = []
211 | if train[train["crossroadID"] == i]["direction"].tolist(): # 若非空,存在a里面
212 | mean = np.array(train[train["crossroadID"] == i]["direction"].tolist()).mean(axis=0)
213 | mean = [int(round(x)) for x in mean[:]]
214 | result = pd.merge(tdf, train[train["crossroadID"] == i], on='timestamp', how="left").drop("crossroadID", axis=1)
215 | for y in result.fillna(str(mean))["direction"].tolist():
216 | if type(y) is str:
217 | y = eval(y)
218 | b.append(y)
219 | return b
220 |
--------------------------------------------------------------------------------
/model/AP.py:
--------------------------------------------------------------------------------
1 | from sklearn.cluster import AffinityPropagation
2 |
3 |
4 | class AP:
5 | def __init__(self, x):
6 | self.ap = AffinityPropagation()
7 | self.cluster_centers_indices = None
8 | self.labels = None
9 | self.x = x
10 |
11 | def fit(self):
12 | return self.ap.fit(self.x)
13 |
14 | def predict(self):
15 | self.cluster_centers_indices = self.fit().cluster_centers_indices_
16 | self.labels = self.ap.labels_
17 | return self.cluster_centers_indices, self.labels
18 |
19 |
20 | def ap_predict(x):
21 | ap = AP(x)
22 | cluster_centers_indices, labels = ap.predict()
23 | return cluster_centers_indices, labels
24 |
25 |
--------------------------------------------------------------------------------
/model/ARMA.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | plt.rcParams['font.sans-serif'] = ['Simhei']
3 | plt.rcParams['axes.unicode_minus'] = False
4 | import warnings
5 | warnings.filterwarnings('ignore')
6 | from datetime import timedelta
7 | from statsmodels.tsa.arima_model import ARIMA
8 | import numpy as np
9 | import warnings
10 | import pandas as pd
11 | from statsmodels.tsa.seasonal import seasonal_decompose
12 | warnings.filterwarnings('ignore')
13 |
14 |
15 | class ModeDecomp(object):
16 | def __init__(self, dataSet, test_data, test_size=24):
17 | data = dataSet.set_index('timestamp')
18 | data.index = pd.to_datetime(data.index)
19 | self.dataSet = data
20 | self.test_size = test_size
21 | self.train_size = len(self.dataSet)
22 | self.train = self.dataSet['flow']
23 | self.train = self._diff_smooth(self.train)
24 | self.test = test_data['flow']
25 |
26 | # 对数据进行平滑处理
27 | def _diff_smooth(self, dataSet):
28 | dif = dataSet.diff() # 差分序列
29 | td = dif.describe()
30 |
31 | high = td['75%'] + 1.5 * (td['75%'] - td['25%']) # 定义高点阈值,1.5倍四分位距之外
32 | low = td['25%'] - 1.5 * (td['75%'] - td['25%']) # 定义低点阈值,同上
33 |
34 | # 变化幅度超过阈值的点的索引
35 | forbid_index = dif[(dif > high) | (dif < low)].index
36 | i = 0
37 | while i < len(forbid_index) - 1:
38 | n = 1 # 发现连续多少个点变化幅度过大,大部分只有单个点
39 | start = forbid_index[i] # 异常点的起始索引
40 | while forbid_index[i + n] == start + timedelta(minutes=60*n):
41 | n += 1
42 | if (i + n) > len(forbid_index) - 1:
43 | break
44 | i += n - 1
45 | end = forbid_index[i] # 异常点的结束索引
46 | # 用前后值的中间值均匀填充
47 | try:
48 | value = np.linspace(dataSet[start - timedelta(minutes=60)], dataSet[end + timedelta(minutes=60)], n)
49 | dataSet[start: end] = value
50 | except:
51 | pass
52 | i += 1
53 | return dataSet
54 |
55 | def decomp(self, freq):
56 | decomposition = seasonal_decompose(self.train, freq=freq, two_sided=False)
57 | self.trend = decomposition.trend
58 | self.seasonal = decomposition.seasonal
59 | self.residual = decomposition.resid
60 | # decomposition.plot()
61 | # plt.show()
62 | d = self.residual.describe()
63 | delta = d['75%'] - d['25%']
64 | self.low_error, self.high_error = (d['25%'] - 1*delta, d['75%'] + 1*delta)
65 |
66 | def trend_model(self, order):
67 | self.trend.dropna(inplace=True)
68 | self.trend_model_ = ARIMA(self.trend, order).fit(disp=-1, method='css')
69 | # return self.trend_model_
70 |
71 | def predict_new(self):
72 | """
73 | 预测新数据
74 | :return:
75 | """
76 | n = self.test_size
77 | self.pred_time_index = pd.date_range(start=self.train.index[-1], periods=n+1, freq='5min')[1:]
78 | self.trend_pred = self.trend_model_.forecast(n)[0]
79 | pred_time_index = self.add_season()
80 | return pred_time_index
81 |
82 | def add_season(self):
83 | '''
84 | 为预测出的趋势数据添加周期数据和残差数据
85 | '''
86 | self.train_season = self.seasonal[:self.train_size]
87 | values = []
88 | low_conf_values = []
89 | high_conf_values = []
90 |
91 | for i, t in enumerate(self.pred_time_index):
92 | trend_part = self.trend_pred[i]
93 | # 相同时间的数据均值
94 | season_part = self.train_season[
95 | self.train_season.index.time == t.time()
96 | ].mean()
97 | # 趋势+周期+误差界限
98 | predict = trend_part + season_part
99 | low_bound = trend_part + season_part + self.low_error
100 | high_bound = trend_part + season_part + self.high_error
101 |
102 | values.append(predict)
103 | low_conf_values.append(low_bound)
104 | high_conf_values.append(high_bound)
105 | self.final_pred = pd.Series(values, index=self.pred_time_index, name='predict')
106 | self.low_conf = pd.Series(low_conf_values, index=self.pred_time_index, name='low_conf')
107 | self.high_conf = pd.Series(high_conf_values, index=self.pred_time_index, name='high_conf')
108 | return self.pred_time_index
109 |
110 |
111 | def predict(X):
112 | dataSet = X[:-144]
113 | # input(len(dataSet))
114 | a = 144 * 4
115 | test_data = np.zeros(a)
116 | test_data = pd.DataFrame(test_data, columns=['flow'])
117 | data = pd.DataFrame(dataSet.values, columns=['flow'])
118 | data['timestamp'] = dataSet.index
119 | size = 144 * 4
120 | mode = ModeDecomp(data, test_data, test_size=size)
121 | mode.decomp(size)
122 | for lis in [[3, 1, 3], [1, 2, 3], [5, 2, 3], [1, 1, 2], [3, 1, 4], [0, 0, 1]]:
123 | try:
124 | mode.trend_model(order=(lis[0], lis[1], lis[2]))
125 | break
126 | except:
127 | continue
128 | # mode.trend_model(order=(0, 0, 1))
129 | pred_time_index = mode.predict_new()
130 | pred = mode.final_pred
131 | test = mode.test
132 | # insert_Operateefficient_predict(str(area), str(Date), str(paramster[0]), str(paramster[1]), str(paramster[2]))
133 | # plt.subplot(211)
134 | # plt.plot(mode.train)
135 | # plt.subplot(212)
136 | # test1 = np.array(test).tolist()
137 | # test = pd.Series(test1, index=pred_time_index, name='test')
138 | # pred.plot(color='salmon', label='Predict')
139 | # test.plot(color='steelblue', label='Original')
140 | # mode.low_conf.plot(color='grey', label='low')
141 | # mode.high_conf.plot(color='grey', label='high')
142 | # plt.legend(loc='right')
143 | # plt.tight_layout()
144 | # plt.show()
145 | # accessMode(test, pred)
146 |
147 | return pred
148 |
149 |
150 | def create_test_data():
151 | test_data = pd.read_csv('data/testCrossroadFlow/submit_example.csv')
152 | for i in range(len(test_data)):
153 | retail_data = test_data.iloc[[i]]
154 | date = retail_data['date'][i]
155 | crossroadID = retail_data['crossroadID'][i]
156 | timeBegin = retail_data['timeBegin'][i]
157 |
158 | date = 20
159 | crossroadID = 100001
160 | open_file = 'data/tmp/pred_{}_{}.csv'.format(date, crossroadID)
161 | pred_data = pd.read_csv(open_file,header=0,index_col=0)
162 | if len(timeBegin) == 4:
163 | search_time = '2019-08-'+str(date)+" 0"+timeBegin+":00"
164 | elif len(timeBegin)==5:
165 | search_time = '2019-08-' + str(date) + " " + timeBegin + ":00"
166 | pred_flow = pred_data.loc[pred_data['timestamp'] == search_time]['flow'].values[0]
167 | test_data.iloc[[i]]['value'] = pred_flow
168 | pred_flow = pred_data[search_time]['flow']
169 |
--------------------------------------------------------------------------------
/model/__pycache__/AP.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/model/__pycache__/AP.cpython-37.pyc
--------------------------------------------------------------------------------
/model/__pycache__/ARMA.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/model/__pycache__/ARMA.cpython-37.pyc
--------------------------------------------------------------------------------
/pre_process/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/pre_process/__init__.py
--------------------------------------------------------------------------------
/pre_process/pre_process.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import csv
3 | import numpy as np
4 | import datetime
5 | from datetime import timedelta
6 | from multiprocessing.pool import Pool
7 | from multiprocessing import freeze_support
8 | import tqdm
9 |
10 | "20、21、23天,卡口100306缺失"
11 | # 目录:训练集、测试集、缓存数据;如何设置根目录?
12 |
13 | path_dct = {'trainflow': ['./data/first/trainCrossroadFlow/train_trafficFlow_%d',
14 | './data/final/train/train_trafficFlow_09-%02d'],
15 | 'testflow': ['./data/first/testCrossroadFlow/test_trafficFlow_%d',
16 | './data/final/test_user/test_trafficFlow_09-%02d'],
17 | 'columns': [['crossroadID', 'timestamp', 'flow'], ['crossroadID', 'direction', 'timestamp', 'flow']],
18 | 'day_list': [list(range(3, 24)) + [1], range(1, 26)]}
19 |
20 |
21 | def round_minutes(t, n):
22 | '''时间区间,向下取整
23 | :param t: 时间戳
24 | :param n: 取余的数
25 | :return:
26 | '''
27 | # 对分钟取整
28 | return t - timedelta(minutes=t.minute % n, seconds=t.second)
29 |
30 |
31 | class PreProcessor:
32 | '''预处理:从原数据中统计每五分钟,各个时段、各个路口的流量
33 | dump_buffer :统计卡口流量
34 | '''
35 |
36 | def __init__(self, term='first'):
37 | self.term = 0 if term == 'first' else 1
38 | self.flow_path_lst = []
39 | self.flow_data, self.time_flow = None, None # 流量数据
40 | self.train = None # 训练数据
41 | self.train_x = None
42 | self.train_y = None
43 | self.test_x = None
44 |
45 | def load_data(self, i):
46 | '''从文件夹中载入csv表格'''
47 | encoding = 'utf-8'
48 | border = 22 if self.term else 20
49 | if i < border:
50 | dirpath = path_dct['trainflow'][self.term]
51 | else:
52 | dirpath = path_dct['testflow'][self.term]
53 | columns = ['timestamp', 'crossroadID', 'vehicleID']
54 | if self.term:
55 | columns.append('direction')
56 | return pd.read_csv(dirpath % i + '.csv', encoding=encoding)
57 |
58 | def load_buffer(self):
59 | if self.flow_data is None:
60 | self.flow_data = pd.read_csv(f'./data/{self.term}_flow_data.csv') # 若报错则先确认data目录结构,再调用dump_buffer
61 | return self.flow_data
62 |
63 | def load_train(self):
64 | if self.train is None:
65 | self.train = pd.read_csv(f'./data/train.csv', names=['crossroadID', 'timestamp', 'direction'])
66 | return self.train
67 |
68 | def cal_flow(self, i):
69 | '''计算流量'''
70 | flow_dfx = self.load_data(i)
71 | flow_dfx['timestamp'] = pd.to_datetime(flow_dfx['timestamp']) # str To TimeStamp
72 | flow_dfx['timestamp'] = flow_dfx['timestamp'].apply(round_minutes, n=15) # 时间离散化,每五分钟
73 | flow_lst = [] # [[road, timestamp, flow]] [[road, timestamp, flow]]
74 | if not self.term:
75 | for road, df in flow_dfx.groupby('crossroadID'): # 按路口分组
76 | flow_lst.extend([road, g[0], len(g[1])] for g in df.groupby('timestamp'))
77 | return flow_lst
78 | else:
79 | for keys, df in flow_dfx.groupby(['crossroadID', 'direction']): # 按路口分组
80 | flow_lst.extend([*keys, g[0], len(g[1])] for g in df.groupby('timestamp'))
81 | return flow_lst
82 |
83 | def dump_buffer(self, num=4):
84 | '''统计卡口流量, 存入csv, columns = ['crossroadID', 'timestamp', 'flow']
85 | :param num: 进程的数量
86 | :return:
87 | '''
88 | freeze_support()
89 | pool = Pool(num)
90 | with open(f'./data/{self.term}_flow_data.csv', 'w', newline='') as f:
91 | handler = csv.writer(f)
92 | handler.writerow(path_dct['columns'][self.term])
93 | for flow_lst in pool.map(self.cal_flow, path_dct['day_list'][self.term]):
94 | handler.writerows(flow_lst)
95 | # for i in path_dct['day_list'][self.term]:
96 | # handler.writerows(self.cal_flow(i))
97 |
98 | def get_roadflow_alltheday(self):
99 | '''获取训练集中各个路口个个方向的车流量,用于预测相似度
100 | :return:matrix:卡口各个方向车流量
101 | '''
102 | flow_data = self.load_buffer()
103 | matrix, index = [], []
104 | a = [0, 0, 0, 0, 0, 0, 0, 0]
105 | for keys, df in flow_data.groupby(['crossroadID', 'direction']):
106 | if keys[0] in index:
107 | pass
108 | else: # 出现新的卡口
109 | a = [0, 0, 0, 0, 0, 0, 0, 0]
110 | index.append(keys[0])
111 | a[int(keys[1]) - 1] = np.sum(df["flow"])
112 | matrix.append(a)
113 | df = pd.DataFrame({"index": index, "matrix": matrix}).drop_duplicates(subset=['index'], keep='last', inplace=False)
114 | return df['matrix'].tolist(), df['index'].tolist()
115 |
116 | def get_roadflow_by_road(self, roadid):
117 | '''获取单个路口所有车流量数据
118 | :param roadId: 道路Id
119 | :return:
120 | '''
121 | flow_data = self.load_buffer()
122 | flow_data = flow_data.set_index('timestamp')
123 | roadflow_df = flow_data[flow_data['crossroadID'] == roadid]
124 | if self.term:
125 | for dire, df in roadflow_df.groupby('direction'):
126 | yield dire, pd.Series(df['flow'])
127 | else:
128 | yield None, pd.Series(roadflow_df['flow'])
129 |
130 | def get_submit(self):
131 | submit = pd.read_csv(f'./data/submit/{self.term}_submit.csv')
132 | submit['timestamp'] = submit[['timeBegin', 'date']].apply(
133 | lambda x: f'2019-{x["date"]} {x["timeBegin"].rjust(5, "0")}:00', axis=1)
134 | return submit
135 |
136 | def roadid_nums(self):
137 | '''查看各天的记录roadid数量'''
138 | if self.term:
139 | day_list = list(range(1, 26))
140 | else:
141 | day_list = list(range(3, 24))
142 | data = []
143 | for d in day_list:
144 | ids = set(self.load_data(d)['crossroadID'])
145 | data.append((len(ids), ids))
146 | return data
147 |
148 | def fill_na(self):
149 | if self.term:
150 | flow_data = self.load_buffer()
151 | cur_day = datetime.datetime(2019, 9, 1, 7)
152 | unit_day = datetime.timedelta(days=1)
153 | five_minutes = datetime.timedelta(minutes=5)
154 | thirty_minutes = datetime.timedelta(minutes=30)
155 | train_ts = []
156 | for _ in range(21):
157 | cur_time = cur_day
158 | for _ in range(144):
159 | train_ts.append(str(cur_time))
160 | cur_time += five_minutes
161 | cur_day += unit_day
162 | test_ts = []
163 | for _ in range(4):
164 | cur_time = cur_day
165 | for i in range(1, 73):
166 | test_ts.append(str(cur_time))
167 | if i % 6 == 0:
168 | cur_time += thirty_minutes
169 | cur_time += five_minutes
170 | cur_day += unit_day
171 | ts_set_list = set(train_ts), set(test_ts)
172 | # 32
173 | flow_data_with_na = pd.DataFrame()
174 | for road, road_df in tqdm.tqdm(flow_data.groupby('crossroadID')):
175 | dire_lst = road_df['direction'].unique()
176 | data_list = []
177 | for ts_set in ts_set_list:
178 | for ts in ts_set ^ (ts_set & set(road_df['timestamp'])):
179 | # print(len(ts_set), len(set(road_df['timestamp'])), len(ts_set ^ (ts_set & set(road_df['timestamp']))))
180 | # return ts_set, set(road_df['timestamp'])
181 | for dire in dire_lst:
182 | data_list.append([road, dire, ts, 0])
183 | # flow_data.loc[cur_index] = [road, dire, ts, 0]
184 | flow_data_with_na = pd.concat(
185 | (flow_data_with_na, pd.DataFrame(data_list, columns=['crossroadID', 'direction', 'timestamp', 'flow']))
186 | , axis=0, ignore_index=True)
187 | flow_data_with_na = pd.concat((flow_data_with_na, flow_data), axis=0, ignore_index=True)
188 | flow_data_with_na.to_csv("./data/flow_data_with_na.csv", index=False) # 3783710; 4814894
189 | return flow_data_with_na
190 | # b = a[a.crossroadID == 100002]
191 | # b = b[b.direction == 1].timestamp.values
192 | # b.sort()
193 | # print(b[-200:])
194 |
195 | # 获取训练集的样子
196 | def get_train_data(self):
197 | flow_data = self.load_buffer()
198 | # ['crossroadID', 'timestamp',[八个方向]]
199 | flow_list = []
200 | for keys, df in flow_data.groupby(['crossroadID', 'timestamp']): # 按卡口、时间分组
201 | a = [0, 0, 0, 0, 0, 0, 0, 0]
202 | for index, row in df.iterrows():
203 | a[row[1] - 1] = row[-1]
204 | flow_list.append([*keys, a])
205 | pd.DataFrame(flow_list).to_csv("data/train.csv", encoding="utf-8")
206 |
207 | def load_traindata(self):
208 | self.train_x = pd.read_csv("./data/train_x.csv")
209 | self.train_y = open("./data/train_y.txt").read()
210 | self.test_x = pd.read_csv("./data/test_x.csv")
211 | self.train_y = eval(self.train_y) # list类型
212 | self.train_x = self.changetype(self.train_x)
213 | self.test_x = self.changetype(self.test_x)
214 | return self.train_x, self.train_y, self.test_x
215 |
216 | def changetype(self, data):
217 | """
218 | 将dataframe中的类型改变str->list,并不上缺失值
219 | :param data:
220 | :return:
221 | """
222 | total = []
223 | for row in data.fillna(str([0, 0, 0, 0, 0, 0, 0, 0])).iterrows():
224 | a = []
225 | for i in range(1, len(row[1])):
226 | a.append(eval(row[1][i]))
227 | total.append(a)
228 | return pd.DataFrame(total)
229 |
230 |
231 | # 获取测试卡口的相邻卡口
232 | def get_testroad_adjoin(prp):
233 | # 邻接表,无去重
234 | mapping = get_adj_map()
235 | # 获取第一个邻接节点
236 | sPredRoad = set(prp.get_submit()['crossroadID']) # 要预测的路口
237 | predMapping = {}
238 | # available 只在训练集出现的邻接卡口
239 | available = set(prp.load_buffer()['crossroadID']) ^ (sPredRoad & set(prp.load_buffer()['crossroadID']))
240 | for r in sPredRoad: # 要预测的每一个卡口
241 | vs = mapping.get(r) # 邻接表中每一个键(卡口),返回键的值-->即邻接卡口
242 | if vs is not None:
243 | adj_set = set(vs)
244 | bind = adj_set & available
245 | if bind:
246 | predMapping[r] = bind.pop() # 保留在训练集中出现过的卡口
247 | rest = sPredRoad ^ predMapping.keys() # 不出现在训练集中的卡口,随机找相邻
248 | for r in rest: # 空值处理
249 | predMapping[r] = None
250 | return predMapping
251 | # return rest, predMapping
252 | # 训练集的数据
253 | length = len(rest)
254 | while True:
255 | for r in rest:
256 | adi = set(mapping.get(r, [])) & predMapping.keys()
257 | if adi:
258 | predMapping[r] = predMapping[adi.pop()] # 暂时是用待预测路口的数据
259 | rest = sPredRoad ^ predMapping.keys()
260 | if length == len(rest):
261 | break # 如果没有变化则跳出
262 | length = len(rest)
263 | candi = list(predMapping.values())[0]
264 | for roadid in rest:
265 | predMapping[roadid] = candi
266 |
267 | return predMapping
268 |
269 |
270 | def get_trainroad_adjoin(premap, map):
271 | train_id = pd.read_csv(f'./data/train.csv', names=['crossroadID', 'timestamp', 'direction'])["crossroadID"].tolist()
272 | train_map = {}
273 | for key in map.keys():
274 | if key in train_id:
275 | train_map[key] = list(set(map[key]))
276 | train_mapping = train_map.copy()
277 | for key in train_map.keys():
278 | if [x for x in train_map[key] if x in list(premap.keys())]: # 邻接值需要被预测
279 | try:
280 | train_mapping.pop(key)
281 | except IndexError as e:
282 | continue
283 | return train_mapping # 得到训练集的卡口
284 |
285 |
286 | def get_adj_map():
287 | adj_map = {}
288 | net_df = pd.read_csv('data/first/trainCrossroadFlow/roadnet.csv')
289 | for h, t in net_df.values:
290 | if h in adj_map:
291 | adj_map[h].add(t)
292 | else:
293 | adj_map[h] = {t}
294 | if t in adj_map:
295 | adj_map[t].add(h)
296 | else:
297 | adj_map[t] = {h}
298 | return adj_map
299 |
300 |
301 | def get_testroad_adjoin_lr(prp):
302 | # 邻接表,无去重
303 | mapping = get_adj_map()
304 | # 获取第一个邻接节点
305 | sPredRoad = set(prp.get_submit()['crossroadID']) # 要预测的路口
306 | predMapping = {}
307 | # available 只在训练集出现的邻接卡口
308 | available = {100097, 100354, 100355, 100227, 100359, 100360, 100105, 100237, 100117, 100118, 100375, 100377, 100378, 100252,
309 | 100381, 100382, 100388, 100389, 100134, 100007, 100264, 100137, 100145, 100222, 100152, 100153, 100283, 100284,
310 | 100157, 100030, 100031, 100158, 100160, 100161, 100291, 100036, 100295, 100045, 100303, 100176, 100306, 100051,
311 | 100052, 100181, 100056, 100057, 100058, 100319, 100578, 100452, 100453, 100326, 100327, 100331, 100332, 100077,
312 | 100205, 100208, 100209, 100211, 100213, 100472, 100094}
313 | for r in sPredRoad: # 要预测的每一个卡口
314 | queue = list(mapping.get(r, [])).copy()
315 | while queue:
316 | cur = queue.pop(0)
317 | if cur in available:
318 | predMapping[r] = cur # 保留在训练集中出现过的卡口
319 | break
320 | else:
321 | queue.extend(mapping.get(cur, []))
322 | rest = sPredRoad ^ predMapping.keys() # 不出现在训练集中的卡口,随机找相邻
323 | for r in rest: # 空值处理
324 | predMapping[r] = None
325 | return predMapping
326 |
--------------------------------------------------------------------------------
/runex.py:
--------------------------------------------------------------------------------
1 | from pre_process.pre_process import PreProcessor, get_testroad_adjoin, pd, get_testroad_adjoin_lr
2 | import matplotlib.pyplot as plt
3 | from model.ARMA import predict
4 | from feature_en.feature_en import FeatureEn
5 | import tqdm
6 | from model.AP import ap_predict
7 |
8 | plt.rcParams['font.sans-serif'] = ['Simhei']
9 | plt.rcParams['axes.unicode_minus'] = False
10 |
11 |
12 | def arma_ex(term='first'):
13 | prp = PreProcessor(term) # 数据管理器
14 | preMapping = get_testroad_adjoin(prp)
15 | submit_df = prp.get_submit()
16 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID'))
17 | submit_index, day_list = [], range(22, 26) # 索引
18 | for day in day_list:
19 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min'))
20 | predict_df = pd.DataFrame()
21 | for pre_id in tqdm.tqdm(preMapping.keys()):
22 | instand_id = preMapping[pre_id]
23 | dire_list = dire_dct[pre_id].copy()
24 | for dire, roadflow in prp.get_roadflow_by_road(instand_id):
25 | if not dire_list:
26 | break # 如空了则不要了,
27 | try:
28 | pred_pre = predict(roadflow)
29 | except Exception as e:
30 | print(instand_id, '\t', e)
31 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index)
32 | # pred_pre = pred_pre.dropna(axis=0, how="any")
33 | pred_pre.fillna(pred_pre.mean(), inplace=True)
34 | for i in range(len(pred_pre)):
35 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]])
36 | pred = pd.DataFrame(pred_pre.values, columns=['value'])
37 | pred['timestamp'] = submit_index
38 | pred['date'] = pred['timestamp'].apply(lambda x: x.strftime('%d'))
39 | pred['timeBegin'] = pred['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}')
40 | pred['crossroadID'] = pre_id
41 | pred['min_time'] = pred['timestamp'].apply(lambda x: int(x.strftime('%M')))
42 | pred = pred[pred['min_time'] >= 30]
43 | pred.drop(['timestamp'], axis=1, inplace=True)
44 | order = ['date', 'crossroadID', 'timeBegin', 'value']
45 | pred = pred[order]
46 | if prp.term:
47 | pred['direction'] = dire_list.pop()
48 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
49 | while dire_list: # 方向不够用的情况
50 | pred['direction'] = dire_list.pop()
51 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
52 | submit_time_set = set(submit_df['timeBegin'])
53 | predict_df.set_index('timeBegin', inplace=True)
54 | return predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[
55 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']]
56 |
57 |
58 | def ap(f):
59 | """
60 | 聚类算法
61 | :param f:FeatureEn
62 | :return:训练集中各个卡口的类别(字典类型),
63 | :center_id:中心点信息
64 | """
65 | cos, index = f.similarity_matrix()
66 | cluster_centers_indices, labels = ap_predict(cos)
67 | id_type = dict(zip(index, labels))
68 | center_id = dict(zip([i+1 for i in range(len(cluster_centers_indices))], cluster_centers_indices))
69 | print(center_id, id_type)
70 | return id_type, center_id
71 |
72 |
73 | def regression_ex(term='final'):
74 | keylst = [
75 | 100115, 100245, 100246, 100374, 100003, 100004, 100020, 100285, 100159, 100287, 100288, 100164, 100300, 100179,
76 | 100053, 100183, 100315, 100061, 100193, 100066, 100457, 100343, 100217, 100434, 100249, 100316, 100329, 100019,
77 | 100340, 100041, 100069
78 | ]
79 | keylst = [val for val in keylst for i in range(3024)]
80 | from sklearn.model_selection import train_test_split
81 | from sklearn.linear_model import LinearRegression
82 | from sklearn.metrics import mean_squared_error, r2_score
83 | prp = PreProcessor(term) # 数据管理器
84 | train_x, train_y, test_x = prp.load_traindata()
85 | # 训练模型
86 | lr = LinearRegression()
87 | # print(train_x.iloc[:, 0:1])
88 | lr.fit(train_x.iloc[:, 0:1].values, train_y)
89 | test_y = lr.predict(test_x.values)
90 | # print(test_y)
91 | return test_y
92 |
93 |
94 | def regression_ex_vvlj(term='first'):
95 | prp = PreProcessor(term) # 数据管理器
96 | preMapping = get_testroad_adjoin(prp)
97 | submit_df = prp.get_submit()
98 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID'))
99 | submit_index, day_list = [], range(22, 26) # 索引
100 | for day in day_list:
101 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min'))
102 | predict_df = pd.DataFrame()
103 | for pre_id in tqdm.tqdm(preMapping.keys()):
104 | instand_id = preMapping[pre_id]
105 | dire_list = dire_dct[pre_id].copy()
106 | for dire, roadflow in prp.get_roadflow_by_road(instand_id):
107 | if not dire_list:
108 | break # 如空了则不要了,
109 | try:
110 | pred_pre = predict(roadflow)
111 | except Exception as e:
112 | print(instand_id, '\t', e)
113 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index)
114 | # pred_pre = pred_pre.dropna(axis=0, how="any")
115 | pred_pre.fillna(pred_pre.mean(), inplace=True)
116 | for i in range(len(pred_pre)):
117 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]])
118 | pred = pd.DataFrame(pred_pre.values, columns=['value'])
119 | pred['timestamp'] = submit_index
120 | pred['date'] = pred['timestamp'].apply(lambda x: x.strftime('%d'))
121 | pred['timeBegin'] = pred['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}')
122 | pred['crossroadID'] = pre_id
123 | pred['min_time'] = pred['timestamp'].apply(lambda x: int(x.strftime('%M')))
124 | pred = pred[pred['min_time'] >= 30]
125 | pred.drop(['timestamp'], axis=1, inplace=True)
126 | order = ['date', 'crossroadID', 'timeBegin', 'value']
127 | pred = pred[order]
128 | if prp.term:
129 | pred['direction'] = dire_list.pop()
130 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
131 | while dire_list: # 方向不够用的情况
132 | pred['direction'] = dire_list.pop()
133 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
134 | submit_time_set = set(submit_df['timeBegin'])
135 | predict_df.set_index('timeBegin', inplace=True)
136 | return predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[
137 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']]
138 |
139 |
140 | def regression_many_x(term='final'):
141 | from sklearn.model_selection import train_test_split
142 | from sklearn.linear_model import LinearRegression
143 | from sklearn.metrics import mean_squared_error, r2_score
144 | prp = PreProcessor(term)
145 | pred_map = get_testroad_adjoin_lr(prp)
146 | submit_df = prp.get_submit()
147 | submit_df.set_index(['timestamp', 'direction', 'crossroadID'], inplace=True)
148 | fe = FeatureEn(term)
149 | r2_rst = {}
150 | predict_dct = {}
151 | for road, train_data, test_data in fe.extract_adjoin_by_col():
152 | train_data = train_data.dropna(axis=0)
153 | # print(test_data.isna().sum(0))
154 | test_data = test_data.dropna(axis=0)
155 | X, y = train_data.drop(columns=['y', 'direction']), train_data['y']
156 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
157 | lr = LinearRegression()
158 | lr.fit(X_train, y_train)
159 | rst = r2_score(y_test, lr.predict(X_test))
160 | r2_rst[road] = rst
161 | test_data['flow'] = lr.predict(test_data.drop(columns='direction'))
162 | # 格式化
163 | test_data = test_data.reset_index()
164 | test_data['crossroadID'] = road
165 | predict_dct[road] = test_data
166 | # predict_df = pd.concat((predict_df, test_data[['crossroadID', 'index', 'flow']]), ignore_index=True)
167 | # test_data.set_index(['index', 'direction'], inplace=True)
168 | # return road, submit_df, test_data
169 | # submit_df[submit_df['crossroadID'] == road].loc[test_data.index, 'value'] = test_data['value']
170 | # submit_df.reset_index()[['date', 'crossroadID', 'direction', 'timeBegin', 'value']].\
171 | # to_csv('./data/lr.csv', index=False)
172 | # predict_df.columns = ['crossroadID', 'timestamp', 'flow']
173 | for submit_road, train_road in pred_map.items():
174 | if train_road is not None:
175 | test_data = predict_dct[train_road].set_index(['index', 'direction'])
176 | s_index = submit_df.index & set(i + (submit_road, ) for i in test_data.index)
177 | test_index = list(i[:2] for i in s_index)
178 | submit_df.loc[s_index, 'value'] = test_data.loc[test_index, 'flow']
179 | for index in submit_df[submit_df['value'] == 0.1].index:
180 | submit_df.loc[index, 'value'] = submit_df.loc[index[0], 'value'].mean()
181 | submit_df = submit_df.reset_index()[['date', 'crossroadID', 'direction', 'timeBegin', 'value']]
182 | submit_df['value'] = submit_df['value'].apply(lambda x: int(x))
183 | submit_df.to_csv('./data/lr_bfs.csv', index=False)
184 | return submit_df
185 |
186 |
187 | def timestamp_fmt(roadflow):
188 | roadflow['date'] = roadflow['timestamp'].apply(lambda x: '09-' + x.strftime('%d'))
189 | roadflow['timeBegin'] = roadflow['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}')
190 | roadflow['min_time'] = roadflow['timestamp'].apply(lambda x: int(x.strftime('%M')))
191 | pred = roadflow[roadflow['min_time'] >= 30]
192 | pred.drop(['timestamp'], axis=1, inplace=True)
193 | return pred
194 |
195 |
196 | def result_fmt(term='first'):
197 | # 预测卡口与训练卡口邻接关系
198 | prp = PreProcessor(term) # 数据管理器
199 | preMapping = get_testroad_adjoin(prp)
200 | submit_df = prp.get_submit()
201 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID'))
202 | submit_index, day_list = [], range(22, 26) # 索引
203 | for day in day_list:
204 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min'))
205 | train_index, day_list = [], range(15, 19) # 索引
206 | for day in day_list:
207 | train_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min'))
208 | train_index = list(str(i) for i in train_index)
209 | train_index_set = set(train_index)
210 | predict_df = pd.DataFrame(columns=['date', 'crossroadID', 'direction', 'timeBegin', 'value'])
211 | for pre_id in tqdm.tqdm(preMapping.keys()):
212 | instand_id = preMapping[pre_id]
213 | dire_list = dire_dct[pre_id].copy()
214 | if instand_id is None:
215 | roadflow = pd.DataFrame([predict_df['value'].mean()] * (144 * 4), columns=['value'])
216 | roadflow['timestamp'] = submit_index
217 | roadflow['crossroadID'] = pre_id
218 | pred = timestamp_fmt(roadflow)
219 | else:
220 | for dire, roadflow in prp.get_roadflow_by_road(instand_id):
221 | if not dire_list:
222 | break # 如空了则不要了,
223 | roadflow = roadflow.loc[list(set(roadflow.index) & train_index_set)]
224 | for ts in set(roadflow.index) ^ train_index_set:
225 | roadflow[ts] = roadflow.mean()
226 | roadflow = pd.DataFrame(roadflow.values, columns=['value'])
227 | roadflow['timestamp'] = submit_index
228 | roadflow['crossroadID'] = pre_id
229 | pred = timestamp_fmt(roadflow)
230 | if prp.term:
231 | pred['direction'] = dire_list.pop()
232 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
233 | while dire_list: # 方向不够用的情况
234 | pred['direction'] = dire_list.pop()
235 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
236 | submit_time_set = set(submit_df['timeBegin'])
237 | predict_df.set_index('timeBegin', inplace=True)
238 | predict_df['value'].fillna(predict_df['value'].mean(), inplace=True)
239 | df = predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[
240 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']]
241 | df.to_csv('./data/random.csv', index=False)
242 | return df
243 |
244 |
245 | def arma_base(term='first'):
246 | submit_df = prp.get_submit()
247 | submit_index, day_list = [], range(22, 26) # 索引
248 | for day in day_list:
249 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min'))
250 | predict_df = pd.DataFrame()
251 | for instand_id in prp.load_buffer()['crossroadID'].unique():
252 | for dire, roadflow in prp.get_roadflow_by_road(instand_id):
253 | try:
254 | pred_pre = predict(roadflow)
255 | except Exception as e:
256 | print(instand_id, '\t', e)
257 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index)
258 | # pred_pre = pred_pre.dropna(axis=0, how="any")
259 | pred_pre.fillna(pred_pre.mean(), inplace=True)
260 | for i in range(len(pred_pre)):
261 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]])
262 | pred = pd.DataFrame(pred_pre.values, columns=['value'])
263 | pred['timestamp'] = submit_index
264 | pred['crossroadID'] = instand_id
265 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True)
266 | predict_df.to_csv('./data/aram_base.csv',index=False)
267 | return predict_df
268 |
269 |
270 | if __name__ == '__main__':
271 | term = 'final' # 初赛:first;复赛:final
272 | prp = PreProcessor(term)
273 | prp.dump_buffer(2) # 载入数据
274 | prp.fill_na() # 填入缺失值
275 | # arma_base(term) # 时序模型
276 | # result_fmt(term) # 随机模型
277 | # regression_many_x() # 回归模型
278 | # regression_ex(term) # 回归哦行
279 |
--------------------------------------------------------------------------------