├── .idea ├── .gitignore └── encodings.xml ├── README.md ├── __pycache__ ├── feature_en.cpython-37.pyc └── pre_process.cpython-37.pyc ├── data └── readme.txt ├── docs ├── final term │ ├── feat_en.md │ ├── modeling.md │ ├── preprocess.md │ └── summary_gyg.md ├── fist term │ ├── 1. 需求分析.md │ ├── 2.整体思路.md │ ├── 3.分工合作.md │ ├── 4.讨论记录.md │ ├── assert │ │ ├── image-20191203133449096.png │ │ ├── image-20191203133502011.png │ │ ├── image-20191203133525187.png │ │ ├── 车流量建模-1575354584839.png │ │ ├── 车流量建模-1575354602642.png │ │ └── 车流量建模.png │ ├── img │ │ ├── Q1_数据集_特征工程.png │ │ └── 车流量建模.png │ ├── itf_PreProcessor.md │ └── 建模思路_意见文档.md └── summary │ └── vvlj.md ├── eda.ipynb ├── evaluator.py ├── feature_en ├── __init__.py └── feature_en.py ├── model ├── AP.py ├── ARMA.py ├── __pycache__ │ ├── AP.cpython-37.pyc │ └── ARMA.cpython-37.pyc └── traffic_flow.ipynb ├── pre_process ├── __init__.py └── pre_process.py └── runex.py /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /workspace.xml -------------------------------------------------------------------------------- /.idea/encodings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # utfp 2 | 城市交通流量时空预测---山东省数据应用(青岛)创新创业大赛。http://sdac.qingdao.gov.cn/common/cmptIndex.html 3 | 4 | 5 | 6 | #### 预处理说明 7 | 8 | ​ 原始数据没有流量的信息,通过预处理函数来统计卡口每5min内的车流量。 9 | 10 | 为了能够顺利调用预处理函数,需要统一data的目录结构(需**手动创建**)。项目的data目录未上传,结构如下;其中first是初赛数据、final是复赛数据 11 | 12 | + data 13 | 14 | + first 15 | 16 | + testCrossroadFlow 17 | + testTaxiGPS 18 | 19 | + trainCrossroadFlow 20 | + trainTaxiGPS 21 | 22 | + final 23 | 24 | + test_user 25 | + train 26 | 27 | + submit 28 | 29 | + 0_submit.csv # 初赛提交实例 30 | + 1_submit.csv # 复赛提交实例 31 | 32 | 33 | 34 | **调用示例** 35 | 36 | ```python 37 | # 由于相对路径问题,需要在项目根目录调用,如runex.py 38 | from pre_process.pre_process import PreProcessor 39 | term = 'final' # 初赛:first;复赛:final 40 | process_num = 2 # 进程数量 41 | PreProcessor(term).dump_buffer(process_num) 42 | 43 | # 得到的流量文件如下 44 | # 初赛数据:./data/0_flow_data.csv ['crossroadID', 'timestamp', 'flow'] 45 | # 复赛数据: ./data/1_flow_data.csv ['crossroadID', 'direction', 'timestamp', 'flow'] 46 | ``` 47 | 48 | 49 | 50 | #### 文档说明 51 | 52 | ​ 比赛时的讨论记录、问题、总结都放在docs文件夹 53 | 54 | + docs 55 | + final term:初赛记录 56 | + first term: 复赛记录 57 | + summary: 总结 -------------------------------------------------------------------------------- /__pycache__/feature_en.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/__pycache__/feature_en.cpython-37.pyc -------------------------------------------------------------------------------- /__pycache__/pre_process.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/__pycache__/pre_process.cpython-37.pyc -------------------------------------------------------------------------------- /data/readme.txt: -------------------------------------------------------------------------------- 1 | 此文件仅对此比赛的数据的构成做出说明。 2 | 3 | 一、数据内容包含两类: 4 | 1. 青岛交通卡口的过车数据(主数据) 5 | 2. 青岛出租车的GPS记录(辅助数据) 6 | 7 | 二、数据时间: 8 | 2019年8月1日~2019年8月23日 9 | 10 | 三、训练集和测试集划分: 11 | 按照日期划分。 12 | 训练集:8月1日~8月19日 13 | 测试集:8月20日~8月23日 14 | 15 | 四、文件夹名字说明: 16 | 1. trainTaxiGPS:训练集对应日期的出租车GPS记录 17 | 2. testTaxiGPS:测试集对应日期的出租车GPS的测试数据 18 | 3. trainCrossroadFlow:训练集对应日期的交通卡口过车数据 19 | 此文件夹下,额外包含 20 | 路网信息 roadnet.csv 21 | 交通卡口名字信息 crossroadName.csv 22 | 4. testCrossroadFlow:测试集对应日期、特定时段的交通卡口过车数据 23 | 此文件夹下,额外包含 24 | 提交结果的样例文件 submit_example.csv 25 | 5. 字段对应关系说明.xlsx:为原始的交通道路描述信息,供参考。基于此文件生成了roadnet.csv和crossroadName.csv 26 | -------------------------------------------------------------------------------- /docs/final term/feat_en.md: -------------------------------------------------------------------------------- 1 | #### 邻接数据集 2 | 3 | ​ 提取卡口某一时刻的流量(flow) 和其邻接卡口流量的均值(mean_flow) 4 | 5 | 作为目标值 6 | 7 | ```python 8 | crossroadID, timestamp, flow -> mean_flow, flow 9 | ``` 10 | 11 | ​ 缺失值:drop; 12 | 13 | **思路**:利用多级索引 + 表格转置,得到列名为卡口的表格,方便索引 14 | 15 | ```python 16 | crossroadID,direction , timestamp, flow -> 17 | crossroadID, timestamp, flow -> 18 | crossroadID1, crossroadID2, crossroadID3 19 | timestamp direction -> 20 | ``` 21 | 22 | 方式一:按行遍历,依次用路口**某时刻**的流量构建新数据 23 | 24 | 方式二:按列遍历,借用矩阵运算依次用路口**所有**流量构建新数据 25 | 26 | 27 | 28 | 问题: 29 | 30 | 1. 流量数据的一些卡口,没有出现在路网中 31 | 32 | 2. ```python 33 | pd.DataFrame([[1,pd.np.nan,1]]).mean(axis=1) 34 | # None是空,不算它 35 | ``` 36 | 37 | #### 邻接数据集 38 | 提取卡口某一时刻各个方向的流量(flow) 和其邻接卡口流量的各个方向的流量(X) 39 | 40 | ```python 41 | crossroadID, timestamp, flow -> (road1, direciont1), (road2, directino2), direction, flow 42 | ``` 43 | 44 | 45 | 选择: 46 | 47 | 1. pd.DataFrame.append 48 | 49 | 2. pd.concate 50 | -------------------------------------------------------------------------------- /docs/final term/modeling.md: -------------------------------------------------------------------------------- 1 | [@__@]:pass代表自行填充 2 | 3 | ### 一、目标 4 | 5 | >预测特定卡口、特定方向、特定时间段的山东交通流量数据 6 | 7 | 8 | 9 | 10 | 11 | ### 二、预测方法 12 | 13 | #### 2.1 回归 14 | 15 | ​ 利用卡口之间的邻接关系构建数据集,有两种方式。方式一是不考虑时间维度,将卡口某一时刻的流量作为目标值$y_i$,其邻接卡口流量的**均值**为$x_i$;另外则是将相邻时刻的邻接卡口均值作为新的特征,如当前时刻为$x_{2i}$,前一时刻是$x_{1i}$,后一时刻则为$x_{3i}$。 16 | 17 | - X:邻接卡口流量的**均值** 18 | - y:卡口某一时刻的流量 19 | 20 | **更新版本** 21 | 22 | 下列的y值都要推后30分钟! 23 | 1. 只对某个路口进行预测,以邻接卡口各个方向的流量和该卡口的方向作为X,该卡口不同方向的流量作为y进行岭回归。(若考虑不同时段的影响,将聚类的类别作作为新的特征) 24 | 2. 只对某个路口进行预测,以邻接卡口各个方向的流量作为X,该卡口的总流量作为y进行岭回归;再将总流量作为和该卡口的方向作为X,卡口不同方向的流量作为y进行岭回归。 25 | 26 | **问题** 27 | 28 | > 1. 由于某些卡口的邻接卡口都没有在训练集出现过,而在测试集出现,它们需要在其他测试卡口预测完之后,进行二次预测。 29 | > 2. sumbit数据问题: 30 | > 1. 训练集(1-20号)拥有整天的数据,训练集(20多号)仅仅有一天中前半个钟的数据 31 | > 2. sumbit卡口只需要预测一天中后半个钟 32 | > 3. 卡口的方向仅有{1,3,5,7},但测试卡口100300的方向是{1, 2, 5, 7} 33 | > 4. 7:30与07:30对应不上 -> '7:30'.rjust(5, '0') 34 | > 3. 如何调整模型? ---> 根据提交的结果 ----> 暂时处理关键数据进行预测即可 35 | > 8. 仅有一般的训练集卡口的邻接节点在训练集里面,最终效果很差 36 | 37 | **参考材料** 38 | 39 | >无 40 | 41 | 42 | 43 | 44 | 45 | #### 2.2 算法2 46 | 47 | pass 48 | 49 | + X 50 | + y 51 | 52 | > pass 53 | 54 | **参考材料** 55 | 56 | > pass 57 | 58 | 59 | -------------------------------------------------------------------------------- /docs/final term/preprocess.md: -------------------------------------------------------------------------------- 1 | ### 原始数据 2 | 3 | + 某**时刻**通过某**卡口**的**车辆** 4 | 5 | ```python 6 | direction, laneID, timestamp, crossroadID, vehicleID 7 | ``` 8 | + 训练集、测试集与submit文件 9 | + 训练集: 1-19号数据,包含整天的数据 -- 07:00, 07:05, ..., 18:55 10 | + 测试集: 22-25号,包含一天中前半个钟的数据 -- 07:00, 07:05, ..., 07.25, 08:00, ..., 18:25 11 | + submit文件: 22-25号,只需要预测前后个钟数据 -- 07:30, 07:55, ..., 18:55 12 | 13 | 14 | ### 需求 15 | 16 | ​ 统计卡口每5min内的车流量。 17 | 18 | 1. 单个卡口的流量时序数据 19 | 2. 单个时段所有卡口的流量数据 20 | 21 | 22 | 23 | ### 处理方式 24 | 25 | 1. 以字典的形式pickle进文件缓存 26 | 27 | ```python 28 | {'road': pd.Series(flow_lst, index=timestamp)} # 初赛 29 | {'road': 'direction': pd.Series(flow_lst, index=timestamp)} # 复赛 30 | ``` 31 | 32 | 2. 以表格的形式存入文件缓存 33 | 1. 读取原始文件 34 | 2. 统计卡口每5min内的车流量数据 35 | 3. 按批写入文件 36 | 37 | ```python 38 | timestamp, crossroadID, flow # 初赛 39 | direction, timestamp, crossroadID, flow # 复赛 40 | ``` 41 | 42 | ### 接口 43 | ```python 44 | # 若报错则先确认data目录结构 45 | # 再调用PreProcessor('final').dump_buffer()缓存数据 46 | flow_df = PreProcessor('final').get_timeflow() 47 | ``` -------------------------------------------------------------------------------- /docs/final term/summary_gyg.md: -------------------------------------------------------------------------------- 1 | #### 总结: 2 | 3 | --- 4 | 5 | 本次比赛分为初赛和复赛 6 | 7 | - 初赛 8 | 9 | 惭愧的是,初始时主要是城和乐在完成,具体思路是时序+相邻聚合,最后成功晋级复赛。 10 | 11 | - 复赛 12 | 13 | 关于复赛,我借助了城的数据统计,得到卡口数据。然后尝试通过lstm去预测缺失值,但是效果并不理想,推测是数据成片缺失导致模型不稳定,后来爬取了卡口的gps数据,用地理坐标位置加权去预测不存在历史记录的卡口流量,最后的成绩是66左右,最好成绩是前10,奈何前3才能进决赛,所以我们也止步于此。 14 | 15 | ---- 16 | 17 | 不足: 18 | 19 | 1. 面对这种参考内容较少的预测,能力不足。 20 | 2. 对于数据科学类竞赛能力、经验不足,需要多多联系。 -------------------------------------------------------------------------------- /docs/fist term/1. 需求分析.md: -------------------------------------------------------------------------------- 1 | ### 预测目标 2 | 3 | ​ 多个交通口多个时段的流量。因此有两种预测方式,它们是问题的两个切入点: 4 | 5 | 1. 固定时段,预测所有的交通口流量 6 | 2. 固定交通口,预测所有时段的流量 7 | 8 | 9 | 10 | ### 数据 11 | 12 | 1. 流量(目标): 单位时间内通过道路交叉口的车辆数量,根据提交示例(submit_example)来看,单位应该是**每五分钟的车辆数**,定义为count/unitTime 13 | 2. 交通口拓扑数据(headNode-tailNode) 14 | 3. 卡口流:时间、交口Id、车道Id、车辆Id、方向 15 | 4. 车辆轨迹:时间、车辆Id、经纬度、速度、方向 16 | 17 | ![img](http://dc-anhui.obs.cn-east-2.myhwclouds.com/pkbigdata/master.other.img/f4b7ec31-8ddc-4ba0-b2ab-57e8e45c55d9.png) 18 | 19 | 20 | 21 | ### 模型 22 | 23 | ​ 首先,卡口流与车辆轨迹是两类数据,应该可归类为异构数据。然后模型可想到的就只有时序模型……思路不够明确。以上是我了解到的所有东西。 24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /docs/fist term/2.整体思路.md: -------------------------------------------------------------------------------- 1 | #### 问题:短时交通流量预测 2 | 3 | ​ 有两个关键词:时间序列、空间拓扑 4 | 5 | 6 | 7 | #### 思路 8 | 9 | ​ 先从时间序列开始,用机器学习的方法做预测。再考虑空间拓扑结构,结合具体行业知识(交通方面的),做数据挖掘。 10 | 11 | 对于机器学习方法,可参考刘的综述论文,介绍了:多元线性回归模型、历史趋势模型、神经网络模型、时间序列模型、Kalman 滤波模 型及多模型融合预测方法。此外,也可先参考《Python数据挖掘与分析》里面的时序模型。 -------------------------------------------------------------------------------- /docs/fist term/3.分工合作.md: -------------------------------------------------------------------------------- 1 | ### 分工合作 2 | 3 | ​ 能力相同,任务根据大火的时间情况进行分配。时间少可优先负责后期,时间多可负责前期。 4 | 5 | 6 | 7 | #### 如何分工 8 | 9 | ​ 按建模角度,任务可分为用时序模型预测同一交通口**多个时段**的流量、用别的模型预测同一时段**多个交通口**的流量。(后期再考虑将两者结合。) 10 | 11 | ​ 按流程角度,任务可分为预处理、特征工程、建模、模型检验与可视化。 12 | 13 | ​ 建模分工待大家弄懂如何建模之后再确定。**流程分工**现在可按大家的时间确定。 14 | 15 | 16 | 17 | #### 想法 18 | 19 | ​ 按流程分工,刚好一个人负责每个模块,即预处理、特征工程、建模、模型检验与可视化。最终各个模块整合成一个完整的项目。为了协同,需要确定好初始的**项目框架**及其**接口**。 20 | 21 | ​ 我想出的项目框架结构如下: 22 | 23 | + pre_process.py:预处理 24 | + class PreProcessor 25 | + feature_en.py:特征工程 26 | + class FeatureEn 27 | + model:模型 28 | + model1.py 29 | + class Model1 30 | + model2.py 31 | + class Model2 32 | + evaluator.py:检验及可视化 33 | + class Evaluator 34 | + class Draw 35 | + EX.py:用于实验 36 | 37 | 38 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /docs/fist term/4.讨论记录.md: -------------------------------------------------------------------------------- 1 | #### 讨论记录 2 | 3 | #### 1. 补充 4 | 5 | ​ 首先,补充一下开会忽略的东西,即当前的进度,已完成的有 6 | 7 | + 能够获取车流量时序数据 8 | + 单个路口多个时段 9 | + 单个时段多个路口 10 | + ARMA模型 11 | 12 | ​ 其次,补充下各成员的项目思路 13 | 14 | + 特征工程:均认为需要寻找新的方法 15 | + 模型 16 | + 炜乐/炯城:ARMA时序建模优化空间不大,寻找新模型可**并行** 17 | + 仰淦/正婷:新模型是否寻找需要基于ARMA时序的建模结果 18 | 19 | 20 | 21 | #### 2. 内容 22 | 23 | ​ 项目思路在之前已明确,故此次讨论的内容仅有一个新的特征工程方法。 24 | 25 | 26 | 27 | #### 3. 下一步 28 | 29 | ​ 求同存异 30 | 31 | 1. 求同:继续大家都认同的思路,需要寻找**特征工程**的方法。 32 | 2. 存异:是否寻早新的建模方法因人而异。 -------------------------------------------------------------------------------- /docs/fist term/assert/image-20191203133449096.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133449096.png -------------------------------------------------------------------------------- /docs/fist term/assert/image-20191203133502011.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133502011.png -------------------------------------------------------------------------------- /docs/fist term/assert/image-20191203133525187.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/image-20191203133525187.png -------------------------------------------------------------------------------- /docs/fist term/assert/车流量建模-1575354584839.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模-1575354584839.png -------------------------------------------------------------------------------- /docs/fist term/assert/车流量建模-1575354602642.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模-1575354602642.png -------------------------------------------------------------------------------- /docs/fist term/assert/车流量建模.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/assert/车流量建模.png -------------------------------------------------------------------------------- /docs/fist term/img/Q1_数据集_特征工程.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/img/Q1_数据集_特征工程.png -------------------------------------------------------------------------------- /docs/fist term/img/车流量建模.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/docs/fist term/img/车流量建模.png -------------------------------------------------------------------------------- /docs/fist term/itf_PreProcessor.md: -------------------------------------------------------------------------------- 1 | 获取单个路口各个时段车流量数据 2 | 3 | ```python 4 | def get_roadFlow(self,i) -> object: 5 | '''获取单个路口各个时段车流量数据 6 | :param i:天数 7 | :return: 8 | dfFlow:pd.DataFrame 原始车流数据表 9 | dFlow: {crossroadID:pd.Series} 车流量时序数据 10 | ''' 11 | 12 | # 调用示例 13 | from pre_process import PreProcessor 14 | prp = PreProcessor() # 数据管理器 15 | dfFlow,dFlow =prp.get_roadFlow(1) # 原始车流数据表,车流量时序数据 16 | ``` 17 | 18 | 缓存文件说明 19 | + flow_i :第i天各个路口车流量 20 | + roadFlowTotal_x :路口x所有时段流量数据 -------------------------------------------------------------------------------- /docs/fist term/建模思路_意见文档.md: -------------------------------------------------------------------------------- 1 | ## 特征工程 2 | 3 | ### 基于时空语义的短时交通流建模 4 | 5 | ​ 文章在统计学**相关系数**的基础上,对其进行时空语义的扩展,引 入了**空间权重矩阵**与**时间延迟**以表达各流量序 列间的时空关联性. 并以时空相关系数为判断 依据,快速选取与预测点相关的预测因子,最后 以支持**向量机**为预测工具,提出了一种基于时空分析的短时交通流量预测方法. 6 | 7 | #### 1. 理论概念 8 | 9 | 符号说明 10 | 11 | | 变量名 | 解释说明 | 12 | | ------ | --------------------------- | 13 | | S | 空间单元(路口/路段),N个 | 14 | | X | 时空属性序列(车流量),n个 | 15 | 16 | 17 | 18 | **时间延迟$d$** 19 | 20 | ​ 如下图,$X_i = 表示空间单元$i$的车流量序列,共n个时段。**超参**时间延迟$d$指时段间隔。 21 | 22 | ![image-20191203133525187](assert/image-20191203133525187.png) 23 | 24 | **时空权重矩阵P** 25 | 26 | ​ P表示空间单位个数$N=n$、超参时间延迟$d=D$,路段$i$的空间权值矩阵。横向为空间维度、纵向是时间维度。 27 | $$ 28 | P =\left| 29 | \begin{array}{lcr} 30 | \rho_{i1}(0) & \rho_{i2}(0) & \ldots & \rho_{in}(0) \\ 31 | \rho_{i1}(1) & \rho_{i2}(1) & \ldots & \rho_{in}(1) \\ 32 | \ldots & \ldots & \ldots & \ldots \\ 33 | \rho_{i1}(D) & \rho_{i2}(D) & \ldots & \rho_{in}(D) \\ 34 | \end{array} 35 | \right| 36 | $$ 37 | **时间权重系数$\rho_{ij}$**计算公式 38 | $$ 39 | \rho_{ij}(d) = w_{ij}\frac{\sum_{t=1}^{n-d}[x_i(t+d)-\bar x_i(d-n)][x_j(t)-\bar x_j(n-d) ]}{\sqrt{\sum_{t=1}^{n-d}[x_i(t+d)-\bar x_i(d-n)]^2\sum_{t=1}^{n-d}[x_j(t)-\bar x_j(n-d) ]^2}} 40 | $$ 41 | 42 | 43 | #### 2. 流程 44 | 45 | ​ 总体流程图如下所示,主要有两步,一是用时空分析算法进行特征工程,二是用最大最小归一化方法对目标工程。 46 | 47 | 车流量建模 -------------------------------------------------------------------------------- /docs/summary/vvlj.md: -------------------------------------------------------------------------------- 1 | [TOC] 2 | 3 | 4 | 5 | ### 关于协作 6 | 7 | 由于git能够追溯文件中某个位置的修改,多个人维护同一个文件容易产生冲突。 8 | 9 | + 因此当多个人负责同一个功能的不同部分时,如预处理或者特征工程,尽量分文件进行协作,或者讨论一致后再push上去,避免冲突。 10 | + 利用高级指令 11 | 12 | 13 | 14 | ### 知识点积累 15 | 16 | #### 时间(datetime) 17 | 18 | + 时间戳加减 19 | 20 | ```python 21 | # timedelta类型可用于时间加减运算 22 | from datetime import timedelta 23 | timedelta(days=1, minutes=1, seconds=1) 24 | ``` 25 | 26 | + 字符串转时间戳 27 | 28 | ```python 29 | # str -> pandas._libs.tslibs.timestamps.Timestamp 30 | import pandas as pd 31 | pd.to_datetime('2019-09-25 18:00:00') 32 | ``` 33 | 34 | 35 | 36 | 37 | 38 | #### DataFrame 39 | 40 | + 分组 41 | 42 | ```python 43 | # name, group_df(含分组的字段) 44 | for name, group_df in df.groupby(['column', ]): 45 | pass 46 | 47 | # name, group_index(索引) 48 | for name, group_index in df.groupby(['column', ]).groups.items(): # .groups是dict 49 | pass 50 | 51 | # group_df(含分组的字段) 52 | group_df = df.groupby(['column', ]).get_group(name) 53 | ``` 54 | 55 | + apply,按列迭代 56 | 57 | ```python 58 | # Series,第一个参数x便是series的元素;可以传入额外参数如y 59 | series.apply(lambda x, y: x + y, y = 1) 60 | 61 | # DataFrame, column1的元素是x,column2的元素是y,不可以传入额外参数 62 | df['column1', 'column2'].apply(lambda x, y: x + y) 63 | ``` 64 | 65 | + 去重 66 | 67 | ```python 68 | # subset: 去重的列; keep:last,保留最后一个出现,默认第一个出现 69 | df.drop_duplicates(subset=['column', ], keep='last') 70 | ``` 71 | 72 | + 索引 73 | 74 | ```python 75 | # 设置某列为列索引 76 | df.set_index(['column', ]) 77 | # 将列索引还原成column 78 | df.reset_index() 79 | # 将列索引设置成行索引 80 | df.unstack() # 顺序默认从最内部的索引开始设置 81 | 82 | # 行多级索引 83 | df['column1']['column2'] # 从上到下依次索引 84 | df['column1', 'column2'] # 多级索引 85 | # 列多级索引 86 | df.loc['index1'].loc['index2'] # 从左到右依次索引 87 | df.loc[('index1', 'index2')] # 多级索引 88 | ``` 89 | 90 | + 连接表格 91 | 92 | ```python 93 | # 向DataFrame添加多行,利用纵向表格连接比单行添加快 94 | # axis=0,纵向连接;ignore_index,重新编号索引 95 | pd.concat((df1, df2), axis=0, ignore_index=True) 96 | # axis=1,横向连接 97 | pd.concat((df1, df2), axis=0, ignore_index=True) 98 | ``` 99 | 100 | 101 | 102 | #### csv文件写入 103 | 104 | ```python 105 | with open(path, 'w', newline='') as f: 106 | handler = csv.writer(f) 107 | handler.writerow([[]]) 108 | ``` 109 | 110 | 111 | 112 | #### 字符串控制 113 | 114 | + 右靠齐补零 115 | 116 | ```python 117 | '7:30'.rjust(5, "0") 118 | ``` 119 | 120 | 121 | 122 | ### 总结 123 | 124 | ​ 这次项目断断续续做了很久,从19年11月到20年4月,整整有5个月的时间,不过真正花在上面的时间并没有那么多。初赛到复赛之间有一段数据筹备的时间。这是第一次正式打比赛,虽然名次不是很高,但是还是收获到很多东西的,包括团队间的合作、开始一个数据挖掘项目的注意事项。 125 | 126 | ​ 首先是团队合作,一个团队一起从事一个项目时,要注意每个人的分工以及进度。分工是按照数据挖掘的流程进行分工,如预处理、特征工程和建模。每个人不一定要了解其他人的进度,但一定要了解项目整体走到哪一步了,比如大家是否清楚题目的要求是什么,是否对数据有一个基本的认识,是否了解一些基础函数的作用。人通常是有惰性的,为了提高协作效率,就要多多主动去了解其他人的进度。另外,文档、代码的规范化也没有做好,这就给了沟通和记录加大了难度。 127 | 128 | ​ 其次是对这次数据挖掘的体会。弄清楚数据的概况很重要,例如缺失值、含义等等。由于前期没有了解好数据,后期再对模型进行修修补补就很麻烦。本次交通流数据有很多缺失的情况,例子流量的缺失、邻接卡口的缺失、甚至测试集整个数据集的缺失。这些缺失导致了很多设想好的模型不可用,例如回归模型不可用,即利用岭回归来确定某个卡口的其临界卡口不同车流量的关系,权值越高意味着更加可能两特定方向的路口相连,最后利用邻接关系和这些权值求出测试卡口的流量。下次进行比赛必须花更多的时间阅读数据的描述,以及比赛的要求。 129 | 130 | ​ 最后,每次项目后,需要多思考从项目中学到了什么,有什么不足,或多或少都写一点。 131 | 132 | -------------------------------------------------------------------------------- /eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 4, 6 | "metadata": { 7 | "collapsed": true, 8 | "pycharm": { 9 | "name": "#%%\n", 10 | "is_executing": false 11 | } 12 | }, 13 | "outputs": [ 14 | { 15 | "name": "stdout", 16 | "text": [ 17 | "count 608.000000\nmean 2.809211\nstd 1.059878\nmin 1.000000\n25% 2.000000\n50% 3.000000\n75% 4.000000\nmax 10.000000\ndtype: float64\n" 18 | ], 19 | "output_type": "stream" 20 | }, 21 | { 22 | "data": { 23 | "text/plain": "
", 24 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXMAAAD2CAYAAAAksGdNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAM9UlEQVR4nO3db2hd93nA8e8zuwFhNZmLg4bDqDGYwZjqNhFdDB5claQ0C2NdGDRgOrwWvK1he+M3LvUYhDFCWMJYIaHaXC/t1nruoFk2uyHd4DIPXKjNaF32h+2F0mGSjM6ZjUwouDx7oRNL15Et6erce6TH3w8Yn3uOdM5PP8tfH59zr25kJpKkre2nuh6AJGnjjLkkFWDMJakAYy5JBRhzSSpgexcH3bVrV+7Zs6eLQ7fm+vXr7Nixo+thbBrOxyDnY4lzMWgj83Hx4sUfZeb9K23rJOZ79uzhwoULXRy6Nf1+n16v1/UwNg3nY5DzscS5GLSR+YiI12+3zcssklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVEAnrwDdqvYcO3Nz+ej0DQ4vezxK8888PpbjSNq6PDOXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAJWjXlE3BcR34qI1yLimxFxT0SciIjzEXF82ce9Z50kaTzWcmZ+CHg+Mz8OvAk8CWzLzAPA3ojYFxFP3LpudEOWJN0qMnPtHxzxN8C9wJ9k5tmIeBKYAD4CvLp8XWaevOVzjwBHAKamph46depUW1/D2Fy6fPXm8tQEvPXOeI47/cB94znQBiwsLDA5Odn1MDYN52OJczFoI/MxOzt7MTNnVtq2fa07iYgDwE5gHrjcrL4CPAjsWGHdgMycA+YAZmZmstfrrfXQm8bhY2duLh+dvsFzl9Y8fRsyf6g3luNsRL/fZyv+mY6K87HEuRg0qvlY0w3QiPgA8EXgM8ACi2fjAJPNPlZaJ0kak7XcAL0H+Abw+cx8HbgIHGw272fxTH2ldZKkMVnLdYLPsnjZ5AsR8QXgJPDpiNgNPAY8DCRw7pZ1kqQxWTXmmfki8OLydRHxCvAo8GxmXm3W9W5dJ0kaj6Hu4GXm28Dp1dZJksbDG5WSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpALG8wO5tSF7lv0c9XGbf+bxzo4tae08M5ekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAtYU84iYiohzzfL2iPhhRPSbX9PN+hMRcT4ijo9ywJKk91o15hGxE3gJ2NGs+hDw9czsNb8uRcQTwLbMPADsjYh9oxuyJOlWkZl3/oCIe4EA/jYzexHxOeAp4DpwCfgt4Hng1cw8GxFPAhOZefKW/RwBjgBMTU09dOrUqda/mFG7dPnqzeWpCXjrnQ4HMybTD9y3po9bWFhgcnJyxKPZOpyPJc7FoI3Mx+zs7MXMnFlp2/bVPjkzrwFExLurvgs8kplvRMRXgF9m8az9crP9CvDgCvuZA+YAZmZmstfrre+r2AQOHztzc/no9A2eu7Tq9G1584d6a/q4fr/PVvwzHRXnY4lzMWhU8zFMjb6fmT9uli8A+4AFYKJZN4k3ViVprIaJ7lcjYn9EbAM+CXwPuAgcbLbvB+bbGZ4kaS2GOTN/Gvgai9fRX8nMf2iuq5+LiN3AY8DDLY5RkrSKNcc8M3vN7z9g8Rkty7ddi4ge8CjwbGZefc8OJEkj09odvMx8Gzjd1v4kSWvnjUpJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVED9t8rRhuxZ9u5Kd3J0+sbAOzFt1Pwzj7e2L+lu4Jm5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySClhTzCNiKiLOLXt8IiLOR8TxO62TJI3HqjGPiJ3AS8CO5vETwLbMPADsjYh9K60b5aAlSYPWcmb+E+BTwLXmcQ843Sy/Bhy8zTpJ0phsX+0DMvMaQES8u2oHcLlZvgI8eJt1AyLiCHAEYGpqin6/v4Fhd+Po9I2by1MTg4/vdm3Px1b8/lhuYWFhy38NbXEuBo1qPlaN+QoWgIlmeZLFs/uV1g3IzDlgDmBmZiZ7vd4Qh+7W4WNnbi4fnb7Bc5eGmb6a2p6P+UO91vbVhX6/z1b8Hh8F52LQqOZjmGezXGTpMsp+YP426yRJYzLMqdTLwLmI2A08BjwM5ArrJEljsuYz88zsNb9fY/GG53eA2cy8utK61kcqSbqtoS5yZubbLD175bbrJEnj4StAJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJakAYy5JBRhzSSrAmEtSAcZckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IB6455RGyPiB9GRL/5NR0RJyLifEQcH8UgJUl3NsyZ+YeAr2dmLzN7wD5gW2YeAPZGxL42ByhJWl1k5vo+IeJzwFPAdeAS8GPg7zPzbEQ8CUxk5skVPu8IcARgamrqoVOnTm107GN36fLVm8tTE/DWOx0OZpNpez6mH7ivvZ11YGFhgcnJya6HsSk4F4M2Mh+zs7MXM3NmpW3bh9jfd4FHMvONiPgK8DHgS822K8CDK31SZs4BcwAzMzPZ6/WGOHS3Dh87c3P56PQNnrs0zPTV1PZ8zB/qtbavLvT7fbbi9/goOBeDRjUfw1xm+X5mvtEsXwB2ARPN48kh9ylJ2oBhwvvViNgfEduAT7J4yeVgs20/MN/S2CRJazTM/4ufBr4GBPAK8DJwLiJ2A48BD7c3PEnSWqw75pn5Axaf0XJTRPSAR4FnM/PqSp8nSRqdVu5YZebbwOk29iVJWj9vVkpSAcZckgow5pJUwJZ71cueZS/ckSQt8sxckgow5pJUgDGXpAKMuSQVYMwlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQVsuR+Bq7tDlz/qeP6Zxzs7tjQsz8wlqQBjLkkFGHNJKsCYS1IBxlySCjDmklSAMZekAoy5JBVgzCWpAGMuSQUYc0kqwJhLUgHGXJIKMOaSVIAxl6QCjLkkFWDMJamAVt9pKCJOAD8PnMnMP2xz39K4tPEuR0enb3C4w3dLWg/fWamG1mIeEU8A2zLzQER8OSL2ZeZ/trV/SWpLl29L+Bef2DGS/UZmtrOjiD8FXs3MsxHxJDCRmSeXbT8CHGke/hzwH60cuDu7gB91PYhNxPkY5HwscS4GbWQ+PpiZ96+0oc3LLDuAy83yFeDB5Rszcw6Ya/F4nYqIC5k50/U4NgvnY5DzscS5GDSq+WjzBugCMNEsT7a8b0nSHbQZ3IvAwWZ5PzDf4r4lSXfQ5mWWl4FzEbEbeAx4uMV9b0ZlLhm1xPkY5HwscS4GjWQ+WrsBChARO4FHgX/KzDdb27Ek6Y5ajbkkqRvepJSkAoz5OkXEfRHxrYh4LSK+GRH3dD2mrkXEVET8S9fj2Cwi4oWI+JWux9G1iNgZEWcj4kJEfKnr8XSp+TtybtnjExFxPiKOt3UMY75+h4DnM/PjwJvAJzoez2bwxyw9LfWuFhG/BPxMZv5d12PZBD4N/FXznOr3R8Rd+Vzz5l7iSyy+Fmfg1fLA3ojY18ZxjPk6ZeYLmfnt5uH9wP90OZ6uRcTHgOss/sN2V4uI9wF/BsxHxK92PZ5N4H+BX4iInwZ+FvjvjsfTlZ8AnwKuNY97wOlm+TWWntK9IcZ8SBFxANiZmd/peixdaS4x/T5wrOuxbBK/Afwr8Czw0Yj43Y7H07V/Bj4I/B7wbyy+Mvyuk5nXMvPqslW3vlp+qo3jGPMhRMQHgC8Cn+l6LB07BryQmf/X9UA2iY8Ac83Tcv8SmO14PF37A+C3M/Np4N+B3+x4PJvFSF4tb8zXqTkb/Qbw+cx8vevxdOwR4KmI6AMfjog/73g8XfsvYG+zPAPc7d8fO4HpiNgG/CLg86AXjeTV8j7PfJ0i4neAPwK+16x6MTP/usMhbQoR0c/MXtfj6FJEvB/4Mov/bX4f8OuZefnOn1VXRHwUOMnipZbzwK9l5kK3o+rOu39HIuJe4BzwjzSvlr/lMsxw+zfmkjReo3i1vDGXpAK8Zi5JBRhzSSrAmEtSAcZckgow5pJUwP8DmOcjTHmrYkoAAAAASUVORK5CYII=\n" 25 | }, 26 | "metadata": { 27 | "needs_background": "light" 28 | }, 29 | "output_type": "display_data" 30 | } 31 | ], 32 | "source": [ 33 | "from feature_en.feature_en import FeatureEn\n", 34 | "from myfunc.matplot import *\n", 35 | "import pandas as pd\n", 36 | "\n", 37 | "fe = FeatureEn()\n", 38 | "\n", 39 | "# 查看卡口的度的分布特征\n", 40 | "d = pd.Series(list(len(v) for v in fe.adj_map.values()), index=fe.adj_map.keys())\n", 41 | "print(d.describe())\n", 42 | "d.hist()\n", 43 | "plt.show()" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "outputs": [], 50 | "source": [], 51 | "metadata": { 52 | "collapsed": false, 53 | "pycharm": { 54 | "name": "#%%\n" 55 | } 56 | } 57 | } 58 | ], 59 | "metadata": { 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 2 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython2", 70 | "version": "2.7.6" 71 | }, 72 | "kernelspec": { 73 | "name": "python3", 74 | "language": "python", 75 | "display_name": "Python 3" 76 | }, 77 | "pycharm": { 78 | "stem_cell": { 79 | "cell_type": "raw", 80 | "source": [], 81 | "metadata": { 82 | "collapsed": false 83 | } 84 | } 85 | } 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 0 89 | } -------------------------------------------------------------------------------- /evaluator.py: -------------------------------------------------------------------------------- 1 | class Evaluator: 2 | pass 3 | 4 | 5 | class Drawer: 6 | pass 7 | 8 | 9 | def plot_roadflow(): 10 | # ******载入数据****** 11 | day = 3 12 | prp = PreProcessor() # 数据管理器 13 | dfFlow, dFlow = prp.get_roadflow_by_day(day) # 原始车流数据表,车流量时序数据 14 | # *****绘图示例****** 15 | key = list(dFlow.keys())[0] 16 | seFolw = dFlow[key] 17 | seFolw.plot() 18 | plt.title(f'{day}号交通口{key}车流量时序图') 19 | plt.ylabel('车流量/5min') 20 | plt.xlabel('时间/t') 21 | plt.show() -------------------------------------------------------------------------------- /feature_en/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/feature_en/__init__.py -------------------------------------------------------------------------------- /feature_en/feature_en.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from pre_process.pre_process import get_adj_map, PreProcessor 4 | import datetime 5 | from sklearn.metrics.pairwise import cosine_similarity 6 | from pre_process.pre_process import get_trainroad_adjoin, get_testroad_adjoin 7 | 8 | 9 | class FeatureEn: 10 | def __init__(self, term='first'): 11 | self.adj_map = get_adj_map() # 对象与路网绑定 12 | self.prp = PreProcessor(term) # 数据管理器 13 | 14 | def extract_relevancy(self, roadId, d, dFlow): 15 | '''抽取车流量表相关度,作为训练集 16 | :param roadId: 路口 17 | :param d: 时间延迟 18 | :param dFlow: 车流量表,{roadId:pd.Series} 19 | :param adjMap: 20 | :return: X,相关度矩阵,每一列代表一个样本,即横向空间维度,纵向时间维度 21 | ''' 22 | 23 | # 时空相关系数计算 24 | lAdjNode = self.adj_map[roadId] 25 | X = np.zeros((d, len(lAdjNode))) # 相关系数矩阵,问题**每个路口训练一个模型 26 | return X 27 | 28 | def extract_adjoin_by_col(self): 29 | '''某时段内路口流量为特征,其邻接路口为目标值,构建训练集和测试集(按列遍历,很快)''' 30 | if self.prp.term: # 复赛情况 31 | # 加载数据集 32 | flow_data = self.prp.load_buffer() # 训练集 33 | road_set = set(flow_data['crossroadID']) 34 | # 构建邻接表 35 | road_direction_dct = {} # 用于构建邻接表 36 | for road in road_set: 37 | road_direction_dct[road] = flow_data['direction'][flow_data['crossroadID'] == road].unique() 38 | adj_map = {} # {road: {'adjoin': [('flow', 'roadID', 'direction')], 'self':[]}} 39 | for road in road_set: 40 | adjoin_set = set(self.adj_map[road]) & road_set 41 | adj_map[road] = set() 42 | for adjoin in adjoin_set: 43 | adj_map[road].update(('flow', dire, road) for dire in road_direction_dct[adjoin]) 44 | flow_data.set_index(['timestamp', 'crossroadID', 'direction'], inplace=True) 45 | flow_data = flow_data.unstack().unstack() # 重建列的索引 46 | # flow_data.drop(columns=flow_data.columns ^ (flow_data.columns & adj_map.keys()), inplace=True) 47 | # 获取训练集 48 | train_index = flow_data.index < '2019-09-22 07:00:00' 49 | train_flow = flow_data[train_index] # 划分数据集, 训练集 50 | train_x_index = train_flow.index[train_flow.index < '2019-09-21 18:30:00'] # 训练集特征索引 51 | train_y_index = (pd.to_datetime(train_x_index) + datetime.timedelta(minutes=30) 52 | ).map(lambda x: str(x)) & flow_data.index 53 | train_x_index &= (pd.to_datetime(train_y_index) - datetime.timedelta(minutes=30) 54 | ).map(lambda x: str(x)) # 训练集目标值索引 55 | train_flow_x = train_flow.loc[train_x_index] 56 | train_flow_y = train_flow.loc[train_y_index] 57 | # 获取测试集索引 58 | test_flow = flow_data[~train_index] 59 | submit_data = self.prp.get_submit() 60 | test_index_y = submit_data['timestamp'].unique() 61 | test_index_x = (pd.to_datetime(test_index_y) - datetime.timedelta(minutes=30) 62 | ).map(lambda x: str(x)) & flow_data.index # 测试索引 63 | test_flow = test_flow.loc[test_index_x] # 测试集 64 | for road in road_set: 65 | adjoin_cols = adj_map[road] 66 | if len(adjoin_cols): 67 | # 根据邻接表提取 训练集(X, direction, flow) 68 | train_df = pd.DataFrame() 69 | x_cloumns = list(i[1:] for i in adjoin_cols) # 新的单索引列名,防止报错 70 | for dire in road_direction_dct[road]: # 先纵向扩充df 71 | train_df_next = train_flow_x[adjoin_cols] 72 | train_df_next.columns = x_cloumns 73 | train_df_next['direction'] = [dire] * len(train_df_next) 74 | train_df_next['y'] = train_flow_y[('flow', dire, road)].values 75 | train_df = pd.concat((train_df, train_df_next[train_df_next['y'].notna()]), axis=0) 76 | train_df = pd.concat((train_df, pd.get_dummies(train_df['direction'])), axis=1) # 再将每个方向作为一列(哑编码) 77 | # 根据邻接表提取 (X, direction, flow) 78 | test_df = pd.DataFrame() 79 | for dire in road_direction_dct[road]: # 先纵向扩充df 80 | test_df_next = test_flow[adjoin_cols] 81 | test_df_next.columns = x_cloumns 82 | test_df_next.index = test_index_y 83 | test_df_next['direction'] = [dire] * len(test_df_next) 84 | test_df = pd.concat((test_df, test_df_next), axis=0) 85 | test_df = pd.concat((test_df, pd.get_dummies(test_df['direction'])), axis=1) # 再将每个方向作为一列(哑编码) 86 | # 去除空的列 87 | for df in (train_df, test_df): 88 | na_index = df.isna().sum(axis=0) 89 | for col in na_index[na_index == len(df)].index: # 90 | train_df.drop(columns=col, inplace=True) 91 | test_df.drop(columns=col, inplace=True) 92 | yield road, train_df, test_df 93 | 94 | def similarity_matrix(self): 95 | ''' 96 | 计算卡口之间的相似度矩阵 97 | :return:相似度矩阵 98 | ''' 99 | matrix, index = self.prp.get_roadflow_alltheday() 100 | cos = cosine_similarity(pd.DataFrame(np.array(matrix), index=index, columns=["1", "2", "3", "4", "5", "6", "7", "8"])) 101 | return cos, index 102 | 103 | def get_train_data(self): 104 | ''' 105 | 得到训练集和测试集 106 | :return:训练集和测试集 107 | ''' 108 | global timelist 109 | train = self.prp.load_train() 110 | predMapping, mapping = get_testroad_adjoin(self.prp) 111 | train_mapping = get_trainroad_adjoin(predMapping, mapping) 112 | # [[邻居1[n天各个方向车流1],邻居2,邻居3,……],[]] 113 | timelist = [] 114 | for i in range(1, 22): # 完整的时间列表 115 | timelist.extend(pd.date_range(f'2019/09/{i} 07:00', f'2019/09/{i} 18:55', freq='5min').tolist()) 116 | # 修改train中的时间 117 | train["timestamp"] = [pd.to_datetime(i, errors='coerce') for i in train["timestamp"].tolist()] 118 | train["direction"] = [eval(i) for i in train["direction"].tolist()] 119 | # 整理数据 120 | train_x = [] 121 | train_y = [] 122 | for key in train_mapping.keys(): 123 | a = [] 124 | tdf = pd.DataFrame(timelist, columns=["timestamp"]) # 生成时间戳df 125 | tdf.to_csv("./data/tdf.csv") 126 | for i in train_mapping[key][:]: # 相邻卡口 127 | result_ = get_something(i, train, tdf) 128 | if result_: 129 | a.append(result_) 130 | if a: # 判断a中是否有内容 131 | train_x.append(a) # 存入训练集[[[时间1],[时间2]],[],[]] 132 | train_y.append(get_something(key, train, tdf)) # 把key也加进来 133 | text_save("x", train_x) 134 | 135 | def get_text_data(self): 136 | train = self.prp.load_train() 137 | predMapping, mapping = get_testroad_adjoin(self.prp) 138 | test_x = [] 139 | # [[邻居1[n天各个方向车流1],邻居2,邻居3,……],[]] 140 | timelist = [] 141 | keylst = [] 142 | for i in range(1, 22): # 完整的时间列表 143 | timelist.extend(pd.date_range(f'2019/09/{i} 07:00', f'2019/09/{i} 18:55', freq='5min').tolist()) 144 | # 修改train中的时间 145 | train["timestamp"] = [pd.to_datetime(i, errors='coerce') for i in train["timestamp"].tolist()] 146 | train["direction"] = [eval(i) for i in train["direction"].tolist()] 147 | for key in predMapping.keys(): 148 | keylst.append(key) 149 | a = [] 150 | tdf = pd.DataFrame(timelist, columns=["timestamp"]) # 生成时间戳df 151 | for i in list(predMapping[key])[:]: # 相邻卡口 152 | result_ = get_something(i, train, tdf) 153 | if result_: 154 | a.append(result_) 155 | print(a) 156 | if a: # 判断a中是否有内容 157 | test_x.append(a) # 存入训练集[[[时间1],[时间2]],[],[]] 158 | text_save("test", test_x) 159 | return keylst 160 | 161 | 162 | def text_save(flag, data): # filename为写入CSV文件的路径,data为要写入数据列表. 163 | if flag == "x": 164 | filename = "./data/train_x.csv" 165 | s = [] 166 | for i in range(len(data)): # n个卡口 167 | for j in range(len(data[i][0])): # 时间 168 | print(len(data[i][0])) 169 | a = [] 170 | for k in range(len(data[i])): # n个邻居 171 | a.append(data[i][k][j]) 172 | # a = str(a).replace("'", '').replace(',', '') # 去除单引号,逗号,每行末尾追加换行符 173 | print(a) 174 | s.append(a) 175 | pd.DataFrame(s).to_csv(filename) 176 | print("train_x保存文件成功") 177 | elif flag == "y": 178 | filename = "./data/train_y.txt" 179 | f = open(filename, "a") 180 | s = [] 181 | for i in range(len(data)): # n个卡口 182 | for j in range(len(data[i])): # 所有时段 183 | s.append(data[i][j]) 184 | f.write(str(s)) 185 | f.close() 186 | print("train_y保存文件成功") 187 | else: 188 | filename = "./data/test_x.csv" 189 | s = [] 190 | for i in range(len(data)): # n个卡口 191 | for j in range(len(data[i][0])): # 时间 192 | print(len(data[i][0])) 193 | a = [] 194 | for k in range(len(data[i])): # n个邻居 195 | a.append(data[i][k][j]) 196 | # a = str(a).replace("'", '').replace(',', '') # 去除单引号,逗号,每行末尾追加换行符 197 | print(a) 198 | s.append(a) 199 | pd.DataFrame(s).to_csv(filename) 200 | print("test保存文件成功") 201 | 202 | 203 | def get_something(i, train, tdf): 204 | """ 用均值填补缺失值,返回所有时间段的数据 205 | :param i: 某卡口 206 | :param train:训练集 207 | :param tdf:时间表 208 | :return: 各个时间段流量,result 209 | """ 210 | b = [] 211 | if train[train["crossroadID"] == i]["direction"].tolist(): # 若非空,存在a里面 212 | mean = np.array(train[train["crossroadID"] == i]["direction"].tolist()).mean(axis=0) 213 | mean = [int(round(x)) for x in mean[:]] 214 | result = pd.merge(tdf, train[train["crossroadID"] == i], on='timestamp', how="left").drop("crossroadID", axis=1) 215 | for y in result.fillna(str(mean))["direction"].tolist(): 216 | if type(y) is str: 217 | y = eval(y) 218 | b.append(y) 219 | return b 220 | -------------------------------------------------------------------------------- /model/AP.py: -------------------------------------------------------------------------------- 1 | from sklearn.cluster import AffinityPropagation 2 | 3 | 4 | class AP: 5 | def __init__(self, x): 6 | self.ap = AffinityPropagation() 7 | self.cluster_centers_indices = None 8 | self.labels = None 9 | self.x = x 10 | 11 | def fit(self): 12 | return self.ap.fit(self.x) 13 | 14 | def predict(self): 15 | self.cluster_centers_indices = self.fit().cluster_centers_indices_ 16 | self.labels = self.ap.labels_ 17 | return self.cluster_centers_indices, self.labels 18 | 19 | 20 | def ap_predict(x): 21 | ap = AP(x) 22 | cluster_centers_indices, labels = ap.predict() 23 | return cluster_centers_indices, labels 24 | 25 | -------------------------------------------------------------------------------- /model/ARMA.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | plt.rcParams['font.sans-serif'] = ['Simhei'] 3 | plt.rcParams['axes.unicode_minus'] = False 4 | import warnings 5 | warnings.filterwarnings('ignore') 6 | from datetime import timedelta 7 | from statsmodels.tsa.arima_model import ARIMA 8 | import numpy as np 9 | import warnings 10 | import pandas as pd 11 | from statsmodels.tsa.seasonal import seasonal_decompose 12 | warnings.filterwarnings('ignore') 13 | 14 | 15 | class ModeDecomp(object): 16 | def __init__(self, dataSet, test_data, test_size=24): 17 | data = dataSet.set_index('timestamp') 18 | data.index = pd.to_datetime(data.index) 19 | self.dataSet = data 20 | self.test_size = test_size 21 | self.train_size = len(self.dataSet) 22 | self.train = self.dataSet['flow'] 23 | self.train = self._diff_smooth(self.train) 24 | self.test = test_data['flow'] 25 | 26 | # 对数据进行平滑处理 27 | def _diff_smooth(self, dataSet): 28 | dif = dataSet.diff() # 差分序列 29 | td = dif.describe() 30 | 31 | high = td['75%'] + 1.5 * (td['75%'] - td['25%']) # 定义高点阈值,1.5倍四分位距之外 32 | low = td['25%'] - 1.5 * (td['75%'] - td['25%']) # 定义低点阈值,同上 33 | 34 | # 变化幅度超过阈值的点的索引 35 | forbid_index = dif[(dif > high) | (dif < low)].index 36 | i = 0 37 | while i < len(forbid_index) - 1: 38 | n = 1 # 发现连续多少个点变化幅度过大,大部分只有单个点 39 | start = forbid_index[i] # 异常点的起始索引 40 | while forbid_index[i + n] == start + timedelta(minutes=60*n): 41 | n += 1 42 | if (i + n) > len(forbid_index) - 1: 43 | break 44 | i += n - 1 45 | end = forbid_index[i] # 异常点的结束索引 46 | # 用前后值的中间值均匀填充 47 | try: 48 | value = np.linspace(dataSet[start - timedelta(minutes=60)], dataSet[end + timedelta(minutes=60)], n) 49 | dataSet[start: end] = value 50 | except: 51 | pass 52 | i += 1 53 | return dataSet 54 | 55 | def decomp(self, freq): 56 | decomposition = seasonal_decompose(self.train, freq=freq, two_sided=False) 57 | self.trend = decomposition.trend 58 | self.seasonal = decomposition.seasonal 59 | self.residual = decomposition.resid 60 | # decomposition.plot() 61 | # plt.show() 62 | d = self.residual.describe() 63 | delta = d['75%'] - d['25%'] 64 | self.low_error, self.high_error = (d['25%'] - 1*delta, d['75%'] + 1*delta) 65 | 66 | def trend_model(self, order): 67 | self.trend.dropna(inplace=True) 68 | self.trend_model_ = ARIMA(self.trend, order).fit(disp=-1, method='css') 69 | # return self.trend_model_ 70 | 71 | def predict_new(self): 72 | """ 73 | 预测新数据 74 | :return: 75 | """ 76 | n = self.test_size 77 | self.pred_time_index = pd.date_range(start=self.train.index[-1], periods=n+1, freq='5min')[1:] 78 | self.trend_pred = self.trend_model_.forecast(n)[0] 79 | pred_time_index = self.add_season() 80 | return pred_time_index 81 | 82 | def add_season(self): 83 | ''' 84 | 为预测出的趋势数据添加周期数据和残差数据 85 | ''' 86 | self.train_season = self.seasonal[:self.train_size] 87 | values = [] 88 | low_conf_values = [] 89 | high_conf_values = [] 90 | 91 | for i, t in enumerate(self.pred_time_index): 92 | trend_part = self.trend_pred[i] 93 | # 相同时间的数据均值 94 | season_part = self.train_season[ 95 | self.train_season.index.time == t.time() 96 | ].mean() 97 | # 趋势+周期+误差界限 98 | predict = trend_part + season_part 99 | low_bound = trend_part + season_part + self.low_error 100 | high_bound = trend_part + season_part + self.high_error 101 | 102 | values.append(predict) 103 | low_conf_values.append(low_bound) 104 | high_conf_values.append(high_bound) 105 | self.final_pred = pd.Series(values, index=self.pred_time_index, name='predict') 106 | self.low_conf = pd.Series(low_conf_values, index=self.pred_time_index, name='low_conf') 107 | self.high_conf = pd.Series(high_conf_values, index=self.pred_time_index, name='high_conf') 108 | return self.pred_time_index 109 | 110 | 111 | def predict(X): 112 | dataSet = X[:-144] 113 | # input(len(dataSet)) 114 | a = 144 * 4 115 | test_data = np.zeros(a) 116 | test_data = pd.DataFrame(test_data, columns=['flow']) 117 | data = pd.DataFrame(dataSet.values, columns=['flow']) 118 | data['timestamp'] = dataSet.index 119 | size = 144 * 4 120 | mode = ModeDecomp(data, test_data, test_size=size) 121 | mode.decomp(size) 122 | for lis in [[3, 1, 3], [1, 2, 3], [5, 2, 3], [1, 1, 2], [3, 1, 4], [0, 0, 1]]: 123 | try: 124 | mode.trend_model(order=(lis[0], lis[1], lis[2])) 125 | break 126 | except: 127 | continue 128 | # mode.trend_model(order=(0, 0, 1)) 129 | pred_time_index = mode.predict_new() 130 | pred = mode.final_pred 131 | test = mode.test 132 | # insert_Operateefficient_predict(str(area), str(Date), str(paramster[0]), str(paramster[1]), str(paramster[2])) 133 | # plt.subplot(211) 134 | # plt.plot(mode.train) 135 | # plt.subplot(212) 136 | # test1 = np.array(test).tolist() 137 | # test = pd.Series(test1, index=pred_time_index, name='test') 138 | # pred.plot(color='salmon', label='Predict') 139 | # test.plot(color='steelblue', label='Original') 140 | # mode.low_conf.plot(color='grey', label='low') 141 | # mode.high_conf.plot(color='grey', label='high') 142 | # plt.legend(loc='right') 143 | # plt.tight_layout() 144 | # plt.show() 145 | # accessMode(test, pred) 146 | 147 | return pred 148 | 149 | 150 | def create_test_data(): 151 | test_data = pd.read_csv('data/testCrossroadFlow/submit_example.csv') 152 | for i in range(len(test_data)): 153 | retail_data = test_data.iloc[[i]] 154 | date = retail_data['date'][i] 155 | crossroadID = retail_data['crossroadID'][i] 156 | timeBegin = retail_data['timeBegin'][i] 157 | 158 | date = 20 159 | crossroadID = 100001 160 | open_file = 'data/tmp/pred_{}_{}.csv'.format(date, crossroadID) 161 | pred_data = pd.read_csv(open_file,header=0,index_col=0) 162 | if len(timeBegin) == 4: 163 | search_time = '2019-08-'+str(date)+" 0"+timeBegin+":00" 164 | elif len(timeBegin)==5: 165 | search_time = '2019-08-' + str(date) + " " + timeBegin + ":00" 166 | pred_flow = pred_data.loc[pred_data['timestamp'] == search_time]['flow'].values[0] 167 | test_data.iloc[[i]]['value'] = pred_flow 168 | pred_flow = pred_data[search_time]['flow'] 169 | -------------------------------------------------------------------------------- /model/__pycache__/AP.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/model/__pycache__/AP.cpython-37.pyc -------------------------------------------------------------------------------- /model/__pycache__/ARMA.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/model/__pycache__/ARMA.cpython-37.pyc -------------------------------------------------------------------------------- /pre_process/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QGDM2018/utfp/3883eee1e8e09e5b20b085236042108131695dbf/pre_process/__init__.py -------------------------------------------------------------------------------- /pre_process/pre_process.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import csv 3 | import numpy as np 4 | import datetime 5 | from datetime import timedelta 6 | from multiprocessing.pool import Pool 7 | from multiprocessing import freeze_support 8 | import tqdm 9 | 10 | "20、21、23天,卡口100306缺失" 11 | # 目录:训练集、测试集、缓存数据;如何设置根目录? 12 | 13 | path_dct = {'trainflow': ['./data/first/trainCrossroadFlow/train_trafficFlow_%d', 14 | './data/final/train/train_trafficFlow_09-%02d'], 15 | 'testflow': ['./data/first/testCrossroadFlow/test_trafficFlow_%d', 16 | './data/final/test_user/test_trafficFlow_09-%02d'], 17 | 'columns': [['crossroadID', 'timestamp', 'flow'], ['crossroadID', 'direction', 'timestamp', 'flow']], 18 | 'day_list': [list(range(3, 24)) + [1], range(1, 26)]} 19 | 20 | 21 | def round_minutes(t, n): 22 | '''时间区间,向下取整 23 | :param t: 时间戳 24 | :param n: 取余的数 25 | :return: 26 | ''' 27 | # 对分钟取整 28 | return t - timedelta(minutes=t.minute % n, seconds=t.second) 29 | 30 | 31 | class PreProcessor: 32 | '''预处理:从原数据中统计每五分钟,各个时段、各个路口的流量 33 | dump_buffer :统计卡口流量 34 | ''' 35 | 36 | def __init__(self, term='first'): 37 | self.term = 0 if term == 'first' else 1 38 | self.flow_path_lst = [] 39 | self.flow_data, self.time_flow = None, None # 流量数据 40 | self.train = None # 训练数据 41 | self.train_x = None 42 | self.train_y = None 43 | self.test_x = None 44 | 45 | def load_data(self, i): 46 | '''从文件夹中载入csv表格''' 47 | encoding = 'utf-8' 48 | border = 22 if self.term else 20 49 | if i < border: 50 | dirpath = path_dct['trainflow'][self.term] 51 | else: 52 | dirpath = path_dct['testflow'][self.term] 53 | columns = ['timestamp', 'crossroadID', 'vehicleID'] 54 | if self.term: 55 | columns.append('direction') 56 | return pd.read_csv(dirpath % i + '.csv', encoding=encoding) 57 | 58 | def load_buffer(self): 59 | if self.flow_data is None: 60 | self.flow_data = pd.read_csv(f'./data/{self.term}_flow_data.csv') # 若报错则先确认data目录结构,再调用dump_buffer 61 | return self.flow_data 62 | 63 | def load_train(self): 64 | if self.train is None: 65 | self.train = pd.read_csv(f'./data/train.csv', names=['crossroadID', 'timestamp', 'direction']) 66 | return self.train 67 | 68 | def cal_flow(self, i): 69 | '''计算流量''' 70 | flow_dfx = self.load_data(i) 71 | flow_dfx['timestamp'] = pd.to_datetime(flow_dfx['timestamp']) # str To TimeStamp 72 | flow_dfx['timestamp'] = flow_dfx['timestamp'].apply(round_minutes, n=15) # 时间离散化,每五分钟 73 | flow_lst = [] # [[road, timestamp, flow]] [[road, timestamp, flow]] 74 | if not self.term: 75 | for road, df in flow_dfx.groupby('crossroadID'): # 按路口分组 76 | flow_lst.extend([road, g[0], len(g[1])] for g in df.groupby('timestamp')) 77 | return flow_lst 78 | else: 79 | for keys, df in flow_dfx.groupby(['crossroadID', 'direction']): # 按路口分组 80 | flow_lst.extend([*keys, g[0], len(g[1])] for g in df.groupby('timestamp')) 81 | return flow_lst 82 | 83 | def dump_buffer(self, num=4): 84 | '''统计卡口流量, 存入csv, columns = ['crossroadID', 'timestamp', 'flow'] 85 | :param num: 进程的数量 86 | :return: 87 | ''' 88 | freeze_support() 89 | pool = Pool(num) 90 | with open(f'./data/{self.term}_flow_data.csv', 'w', newline='') as f: 91 | handler = csv.writer(f) 92 | handler.writerow(path_dct['columns'][self.term]) 93 | for flow_lst in pool.map(self.cal_flow, path_dct['day_list'][self.term]): 94 | handler.writerows(flow_lst) 95 | # for i in path_dct['day_list'][self.term]: 96 | # handler.writerows(self.cal_flow(i)) 97 | 98 | def get_roadflow_alltheday(self): 99 | '''获取训练集中各个路口个个方向的车流量,用于预测相似度 100 | :return:matrix:卡口各个方向车流量 101 | ''' 102 | flow_data = self.load_buffer() 103 | matrix, index = [], [] 104 | a = [0, 0, 0, 0, 0, 0, 0, 0] 105 | for keys, df in flow_data.groupby(['crossroadID', 'direction']): 106 | if keys[0] in index: 107 | pass 108 | else: # 出现新的卡口 109 | a = [0, 0, 0, 0, 0, 0, 0, 0] 110 | index.append(keys[0]) 111 | a[int(keys[1]) - 1] = np.sum(df["flow"]) 112 | matrix.append(a) 113 | df = pd.DataFrame({"index": index, "matrix": matrix}).drop_duplicates(subset=['index'], keep='last', inplace=False) 114 | return df['matrix'].tolist(), df['index'].tolist() 115 | 116 | def get_roadflow_by_road(self, roadid): 117 | '''获取单个路口所有车流量数据 118 | :param roadId: 道路Id 119 | :return: 120 | ''' 121 | flow_data = self.load_buffer() 122 | flow_data = flow_data.set_index('timestamp') 123 | roadflow_df = flow_data[flow_data['crossroadID'] == roadid] 124 | if self.term: 125 | for dire, df in roadflow_df.groupby('direction'): 126 | yield dire, pd.Series(df['flow']) 127 | else: 128 | yield None, pd.Series(roadflow_df['flow']) 129 | 130 | def get_submit(self): 131 | submit = pd.read_csv(f'./data/submit/{self.term}_submit.csv') 132 | submit['timestamp'] = submit[['timeBegin', 'date']].apply( 133 | lambda x: f'2019-{x["date"]} {x["timeBegin"].rjust(5, "0")}:00', axis=1) 134 | return submit 135 | 136 | def roadid_nums(self): 137 | '''查看各天的记录roadid数量''' 138 | if self.term: 139 | day_list = list(range(1, 26)) 140 | else: 141 | day_list = list(range(3, 24)) 142 | data = [] 143 | for d in day_list: 144 | ids = set(self.load_data(d)['crossroadID']) 145 | data.append((len(ids), ids)) 146 | return data 147 | 148 | def fill_na(self): 149 | if self.term: 150 | flow_data = self.load_buffer() 151 | cur_day = datetime.datetime(2019, 9, 1, 7) 152 | unit_day = datetime.timedelta(days=1) 153 | five_minutes = datetime.timedelta(minutes=5) 154 | thirty_minutes = datetime.timedelta(minutes=30) 155 | train_ts = [] 156 | for _ in range(21): 157 | cur_time = cur_day 158 | for _ in range(144): 159 | train_ts.append(str(cur_time)) 160 | cur_time += five_minutes 161 | cur_day += unit_day 162 | test_ts = [] 163 | for _ in range(4): 164 | cur_time = cur_day 165 | for i in range(1, 73): 166 | test_ts.append(str(cur_time)) 167 | if i % 6 == 0: 168 | cur_time += thirty_minutes 169 | cur_time += five_minutes 170 | cur_day += unit_day 171 | ts_set_list = set(train_ts), set(test_ts) 172 | # 32 173 | flow_data_with_na = pd.DataFrame() 174 | for road, road_df in tqdm.tqdm(flow_data.groupby('crossroadID')): 175 | dire_lst = road_df['direction'].unique() 176 | data_list = [] 177 | for ts_set in ts_set_list: 178 | for ts in ts_set ^ (ts_set & set(road_df['timestamp'])): 179 | # print(len(ts_set), len(set(road_df['timestamp'])), len(ts_set ^ (ts_set & set(road_df['timestamp'])))) 180 | # return ts_set, set(road_df['timestamp']) 181 | for dire in dire_lst: 182 | data_list.append([road, dire, ts, 0]) 183 | # flow_data.loc[cur_index] = [road, dire, ts, 0] 184 | flow_data_with_na = pd.concat( 185 | (flow_data_with_na, pd.DataFrame(data_list, columns=['crossroadID', 'direction', 'timestamp', 'flow'])) 186 | , axis=0, ignore_index=True) 187 | flow_data_with_na = pd.concat((flow_data_with_na, flow_data), axis=0, ignore_index=True) 188 | flow_data_with_na.to_csv("./data/flow_data_with_na.csv", index=False) # 3783710; 4814894 189 | return flow_data_with_na 190 | # b = a[a.crossroadID == 100002] 191 | # b = b[b.direction == 1].timestamp.values 192 | # b.sort() 193 | # print(b[-200:]) 194 | 195 | # 获取训练集的样子 196 | def get_train_data(self): 197 | flow_data = self.load_buffer() 198 | # ['crossroadID', 'timestamp',[八个方向]] 199 | flow_list = [] 200 | for keys, df in flow_data.groupby(['crossroadID', 'timestamp']): # 按卡口、时间分组 201 | a = [0, 0, 0, 0, 0, 0, 0, 0] 202 | for index, row in df.iterrows(): 203 | a[row[1] - 1] = row[-1] 204 | flow_list.append([*keys, a]) 205 | pd.DataFrame(flow_list).to_csv("data/train.csv", encoding="utf-8") 206 | 207 | def load_traindata(self): 208 | self.train_x = pd.read_csv("./data/train_x.csv") 209 | self.train_y = open("./data/train_y.txt").read() 210 | self.test_x = pd.read_csv("./data/test_x.csv") 211 | self.train_y = eval(self.train_y) # list类型 212 | self.train_x = self.changetype(self.train_x) 213 | self.test_x = self.changetype(self.test_x) 214 | return self.train_x, self.train_y, self.test_x 215 | 216 | def changetype(self, data): 217 | """ 218 | 将dataframe中的类型改变str->list,并不上缺失值 219 | :param data: 220 | :return: 221 | """ 222 | total = [] 223 | for row in data.fillna(str([0, 0, 0, 0, 0, 0, 0, 0])).iterrows(): 224 | a = [] 225 | for i in range(1, len(row[1])): 226 | a.append(eval(row[1][i])) 227 | total.append(a) 228 | return pd.DataFrame(total) 229 | 230 | 231 | # 获取测试卡口的相邻卡口 232 | def get_testroad_adjoin(prp): 233 | # 邻接表,无去重 234 | mapping = get_adj_map() 235 | # 获取第一个邻接节点 236 | sPredRoad = set(prp.get_submit()['crossroadID']) # 要预测的路口 237 | predMapping = {} 238 | # available 只在训练集出现的邻接卡口 239 | available = set(prp.load_buffer()['crossroadID']) ^ (sPredRoad & set(prp.load_buffer()['crossroadID'])) 240 | for r in sPredRoad: # 要预测的每一个卡口 241 | vs = mapping.get(r) # 邻接表中每一个键(卡口),返回键的值-->即邻接卡口 242 | if vs is not None: 243 | adj_set = set(vs) 244 | bind = adj_set & available 245 | if bind: 246 | predMapping[r] = bind.pop() # 保留在训练集中出现过的卡口 247 | rest = sPredRoad ^ predMapping.keys() # 不出现在训练集中的卡口,随机找相邻 248 | for r in rest: # 空值处理 249 | predMapping[r] = None 250 | return predMapping 251 | # return rest, predMapping 252 | # 训练集的数据 253 | length = len(rest) 254 | while True: 255 | for r in rest: 256 | adi = set(mapping.get(r, [])) & predMapping.keys() 257 | if adi: 258 | predMapping[r] = predMapping[adi.pop()] # 暂时是用待预测路口的数据 259 | rest = sPredRoad ^ predMapping.keys() 260 | if length == len(rest): 261 | break # 如果没有变化则跳出 262 | length = len(rest) 263 | candi = list(predMapping.values())[0] 264 | for roadid in rest: 265 | predMapping[roadid] = candi 266 | 267 | return predMapping 268 | 269 | 270 | def get_trainroad_adjoin(premap, map): 271 | train_id = pd.read_csv(f'./data/train.csv', names=['crossroadID', 'timestamp', 'direction'])["crossroadID"].tolist() 272 | train_map = {} 273 | for key in map.keys(): 274 | if key in train_id: 275 | train_map[key] = list(set(map[key])) 276 | train_mapping = train_map.copy() 277 | for key in train_map.keys(): 278 | if [x for x in train_map[key] if x in list(premap.keys())]: # 邻接值需要被预测 279 | try: 280 | train_mapping.pop(key) 281 | except IndexError as e: 282 | continue 283 | return train_mapping # 得到训练集的卡口 284 | 285 | 286 | def get_adj_map(): 287 | adj_map = {} 288 | net_df = pd.read_csv('data/first/trainCrossroadFlow/roadnet.csv') 289 | for h, t in net_df.values: 290 | if h in adj_map: 291 | adj_map[h].add(t) 292 | else: 293 | adj_map[h] = {t} 294 | if t in adj_map: 295 | adj_map[t].add(h) 296 | else: 297 | adj_map[t] = {h} 298 | return adj_map 299 | 300 | 301 | def get_testroad_adjoin_lr(prp): 302 | # 邻接表,无去重 303 | mapping = get_adj_map() 304 | # 获取第一个邻接节点 305 | sPredRoad = set(prp.get_submit()['crossroadID']) # 要预测的路口 306 | predMapping = {} 307 | # available 只在训练集出现的邻接卡口 308 | available = {100097, 100354, 100355, 100227, 100359, 100360, 100105, 100237, 100117, 100118, 100375, 100377, 100378, 100252, 309 | 100381, 100382, 100388, 100389, 100134, 100007, 100264, 100137, 100145, 100222, 100152, 100153, 100283, 100284, 310 | 100157, 100030, 100031, 100158, 100160, 100161, 100291, 100036, 100295, 100045, 100303, 100176, 100306, 100051, 311 | 100052, 100181, 100056, 100057, 100058, 100319, 100578, 100452, 100453, 100326, 100327, 100331, 100332, 100077, 312 | 100205, 100208, 100209, 100211, 100213, 100472, 100094} 313 | for r in sPredRoad: # 要预测的每一个卡口 314 | queue = list(mapping.get(r, [])).copy() 315 | while queue: 316 | cur = queue.pop(0) 317 | if cur in available: 318 | predMapping[r] = cur # 保留在训练集中出现过的卡口 319 | break 320 | else: 321 | queue.extend(mapping.get(cur, [])) 322 | rest = sPredRoad ^ predMapping.keys() # 不出现在训练集中的卡口,随机找相邻 323 | for r in rest: # 空值处理 324 | predMapping[r] = None 325 | return predMapping 326 | -------------------------------------------------------------------------------- /runex.py: -------------------------------------------------------------------------------- 1 | from pre_process.pre_process import PreProcessor, get_testroad_adjoin, pd, get_testroad_adjoin_lr 2 | import matplotlib.pyplot as plt 3 | from model.ARMA import predict 4 | from feature_en.feature_en import FeatureEn 5 | import tqdm 6 | from model.AP import ap_predict 7 | 8 | plt.rcParams['font.sans-serif'] = ['Simhei'] 9 | plt.rcParams['axes.unicode_minus'] = False 10 | 11 | 12 | def arma_ex(term='first'): 13 | prp = PreProcessor(term) # 数据管理器 14 | preMapping = get_testroad_adjoin(prp) 15 | submit_df = prp.get_submit() 16 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID')) 17 | submit_index, day_list = [], range(22, 26) # 索引 18 | for day in day_list: 19 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min')) 20 | predict_df = pd.DataFrame() 21 | for pre_id in tqdm.tqdm(preMapping.keys()): 22 | instand_id = preMapping[pre_id] 23 | dire_list = dire_dct[pre_id].copy() 24 | for dire, roadflow in prp.get_roadflow_by_road(instand_id): 25 | if not dire_list: 26 | break # 如空了则不要了, 27 | try: 28 | pred_pre = predict(roadflow) 29 | except Exception as e: 30 | print(instand_id, '\t', e) 31 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index) 32 | # pred_pre = pred_pre.dropna(axis=0, how="any") 33 | pred_pre.fillna(pred_pre.mean(), inplace=True) 34 | for i in range(len(pred_pre)): 35 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]]) 36 | pred = pd.DataFrame(pred_pre.values, columns=['value']) 37 | pred['timestamp'] = submit_index 38 | pred['date'] = pred['timestamp'].apply(lambda x: x.strftime('%d')) 39 | pred['timeBegin'] = pred['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}') 40 | pred['crossroadID'] = pre_id 41 | pred['min_time'] = pred['timestamp'].apply(lambda x: int(x.strftime('%M'))) 42 | pred = pred[pred['min_time'] >= 30] 43 | pred.drop(['timestamp'], axis=1, inplace=True) 44 | order = ['date', 'crossroadID', 'timeBegin', 'value'] 45 | pred = pred[order] 46 | if prp.term: 47 | pred['direction'] = dire_list.pop() 48 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 49 | while dire_list: # 方向不够用的情况 50 | pred['direction'] = dire_list.pop() 51 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 52 | submit_time_set = set(submit_df['timeBegin']) 53 | predict_df.set_index('timeBegin', inplace=True) 54 | return predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[ 55 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']] 56 | 57 | 58 | def ap(f): 59 | """ 60 | 聚类算法 61 | :param f:FeatureEn 62 | :return:训练集中各个卡口的类别(字典类型), 63 | :center_id:中心点信息 64 | """ 65 | cos, index = f.similarity_matrix() 66 | cluster_centers_indices, labels = ap_predict(cos) 67 | id_type = dict(zip(index, labels)) 68 | center_id = dict(zip([i+1 for i in range(len(cluster_centers_indices))], cluster_centers_indices)) 69 | print(center_id, id_type) 70 | return id_type, center_id 71 | 72 | 73 | def regression_ex(term='final'): 74 | keylst = [ 75 | 100115, 100245, 100246, 100374, 100003, 100004, 100020, 100285, 100159, 100287, 100288, 100164, 100300, 100179, 76 | 100053, 100183, 100315, 100061, 100193, 100066, 100457, 100343, 100217, 100434, 100249, 100316, 100329, 100019, 77 | 100340, 100041, 100069 78 | ] 79 | keylst = [val for val in keylst for i in range(3024)] 80 | from sklearn.model_selection import train_test_split 81 | from sklearn.linear_model import LinearRegression 82 | from sklearn.metrics import mean_squared_error, r2_score 83 | prp = PreProcessor(term) # 数据管理器 84 | train_x, train_y, test_x = prp.load_traindata() 85 | # 训练模型 86 | lr = LinearRegression() 87 | # print(train_x.iloc[:, 0:1]) 88 | lr.fit(train_x.iloc[:, 0:1].values, train_y) 89 | test_y = lr.predict(test_x.values) 90 | # print(test_y) 91 | return test_y 92 | 93 | 94 | def regression_ex_vvlj(term='first'): 95 | prp = PreProcessor(term) # 数据管理器 96 | preMapping = get_testroad_adjoin(prp) 97 | submit_df = prp.get_submit() 98 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID')) 99 | submit_index, day_list = [], range(22, 26) # 索引 100 | for day in day_list: 101 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min')) 102 | predict_df = pd.DataFrame() 103 | for pre_id in tqdm.tqdm(preMapping.keys()): 104 | instand_id = preMapping[pre_id] 105 | dire_list = dire_dct[pre_id].copy() 106 | for dire, roadflow in prp.get_roadflow_by_road(instand_id): 107 | if not dire_list: 108 | break # 如空了则不要了, 109 | try: 110 | pred_pre = predict(roadflow) 111 | except Exception as e: 112 | print(instand_id, '\t', e) 113 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index) 114 | # pred_pre = pred_pre.dropna(axis=0, how="any") 115 | pred_pre.fillna(pred_pre.mean(), inplace=True) 116 | for i in range(len(pred_pre)): 117 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]]) 118 | pred = pd.DataFrame(pred_pre.values, columns=['value']) 119 | pred['timestamp'] = submit_index 120 | pred['date'] = pred['timestamp'].apply(lambda x: x.strftime('%d')) 121 | pred['timeBegin'] = pred['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}') 122 | pred['crossroadID'] = pre_id 123 | pred['min_time'] = pred['timestamp'].apply(lambda x: int(x.strftime('%M'))) 124 | pred = pred[pred['min_time'] >= 30] 125 | pred.drop(['timestamp'], axis=1, inplace=True) 126 | order = ['date', 'crossroadID', 'timeBegin', 'value'] 127 | pred = pred[order] 128 | if prp.term: 129 | pred['direction'] = dire_list.pop() 130 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 131 | while dire_list: # 方向不够用的情况 132 | pred['direction'] = dire_list.pop() 133 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 134 | submit_time_set = set(submit_df['timeBegin']) 135 | predict_df.set_index('timeBegin', inplace=True) 136 | return predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[ 137 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']] 138 | 139 | 140 | def regression_many_x(term='final'): 141 | from sklearn.model_selection import train_test_split 142 | from sklearn.linear_model import LinearRegression 143 | from sklearn.metrics import mean_squared_error, r2_score 144 | prp = PreProcessor(term) 145 | pred_map = get_testroad_adjoin_lr(prp) 146 | submit_df = prp.get_submit() 147 | submit_df.set_index(['timestamp', 'direction', 'crossroadID'], inplace=True) 148 | fe = FeatureEn(term) 149 | r2_rst = {} 150 | predict_dct = {} 151 | for road, train_data, test_data in fe.extract_adjoin_by_col(): 152 | train_data = train_data.dropna(axis=0) 153 | # print(test_data.isna().sum(0)) 154 | test_data = test_data.dropna(axis=0) 155 | X, y = train_data.drop(columns=['y', 'direction']), train_data['y'] 156 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33) 157 | lr = LinearRegression() 158 | lr.fit(X_train, y_train) 159 | rst = r2_score(y_test, lr.predict(X_test)) 160 | r2_rst[road] = rst 161 | test_data['flow'] = lr.predict(test_data.drop(columns='direction')) 162 | # 格式化 163 | test_data = test_data.reset_index() 164 | test_data['crossroadID'] = road 165 | predict_dct[road] = test_data 166 | # predict_df = pd.concat((predict_df, test_data[['crossroadID', 'index', 'flow']]), ignore_index=True) 167 | # test_data.set_index(['index', 'direction'], inplace=True) 168 | # return road, submit_df, test_data 169 | # submit_df[submit_df['crossroadID'] == road].loc[test_data.index, 'value'] = test_data['value'] 170 | # submit_df.reset_index()[['date', 'crossroadID', 'direction', 'timeBegin', 'value']].\ 171 | # to_csv('./data/lr.csv', index=False) 172 | # predict_df.columns = ['crossroadID', 'timestamp', 'flow'] 173 | for submit_road, train_road in pred_map.items(): 174 | if train_road is not None: 175 | test_data = predict_dct[train_road].set_index(['index', 'direction']) 176 | s_index = submit_df.index & set(i + (submit_road, ) for i in test_data.index) 177 | test_index = list(i[:2] for i in s_index) 178 | submit_df.loc[s_index, 'value'] = test_data.loc[test_index, 'flow'] 179 | for index in submit_df[submit_df['value'] == 0.1].index: 180 | submit_df.loc[index, 'value'] = submit_df.loc[index[0], 'value'].mean() 181 | submit_df = submit_df.reset_index()[['date', 'crossroadID', 'direction', 'timeBegin', 'value']] 182 | submit_df['value'] = submit_df['value'].apply(lambda x: int(x)) 183 | submit_df.to_csv('./data/lr_bfs.csv', index=False) 184 | return submit_df 185 | 186 | 187 | def timestamp_fmt(roadflow): 188 | roadflow['date'] = roadflow['timestamp'].apply(lambda x: '09-' + x.strftime('%d')) 189 | roadflow['timeBegin'] = roadflow['timestamp'].apply(lambda x: f'{x.hour}:{x.strftime("%M")}') 190 | roadflow['min_time'] = roadflow['timestamp'].apply(lambda x: int(x.strftime('%M'))) 191 | pred = roadflow[roadflow['min_time'] >= 30] 192 | pred.drop(['timestamp'], axis=1, inplace=True) 193 | return pred 194 | 195 | 196 | def result_fmt(term='first'): 197 | # 预测卡口与训练卡口邻接关系 198 | prp = PreProcessor(term) # 数据管理器 199 | preMapping = get_testroad_adjoin(prp) 200 | submit_df = prp.get_submit() 201 | dire_dct = dict([road, list(set(df['direction']))] for (road, df) in submit_df.groupby('crossroadID')) 202 | submit_index, day_list = [], range(22, 26) # 索引 203 | for day in day_list: 204 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min')) 205 | train_index, day_list = [], range(15, 19) # 索引 206 | for day in day_list: 207 | train_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min')) 208 | train_index = list(str(i) for i in train_index) 209 | train_index_set = set(train_index) 210 | predict_df = pd.DataFrame(columns=['date', 'crossroadID', 'direction', 'timeBegin', 'value']) 211 | for pre_id in tqdm.tqdm(preMapping.keys()): 212 | instand_id = preMapping[pre_id] 213 | dire_list = dire_dct[pre_id].copy() 214 | if instand_id is None: 215 | roadflow = pd.DataFrame([predict_df['value'].mean()] * (144 * 4), columns=['value']) 216 | roadflow['timestamp'] = submit_index 217 | roadflow['crossroadID'] = pre_id 218 | pred = timestamp_fmt(roadflow) 219 | else: 220 | for dire, roadflow in prp.get_roadflow_by_road(instand_id): 221 | if not dire_list: 222 | break # 如空了则不要了, 223 | roadflow = roadflow.loc[list(set(roadflow.index) & train_index_set)] 224 | for ts in set(roadflow.index) ^ train_index_set: 225 | roadflow[ts] = roadflow.mean() 226 | roadflow = pd.DataFrame(roadflow.values, columns=['value']) 227 | roadflow['timestamp'] = submit_index 228 | roadflow['crossroadID'] = pre_id 229 | pred = timestamp_fmt(roadflow) 230 | if prp.term: 231 | pred['direction'] = dire_list.pop() 232 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 233 | while dire_list: # 方向不够用的情况 234 | pred['direction'] = dire_list.pop() 235 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 236 | submit_time_set = set(submit_df['timeBegin']) 237 | predict_df.set_index('timeBegin', inplace=True) 238 | predict_df['value'].fillna(predict_df['value'].mean(), inplace=True) 239 | df = predict_df.loc[list(set(predict_df.index) & submit_time_set)].reset_index()[ 240 | ['date', 'crossroadID', 'direction', 'timeBegin', 'value']] 241 | df.to_csv('./data/random.csv', index=False) 242 | return df 243 | 244 | 245 | def arma_base(term='first'): 246 | submit_df = prp.get_submit() 247 | submit_index, day_list = [], range(22, 26) # 索引 248 | for day in day_list: 249 | submit_index.extend(pd.date_range(start=f'2019-09-{day} 07:00:00', periods=144, freq='5min')) 250 | predict_df = pd.DataFrame() 251 | for instand_id in prp.load_buffer()['crossroadID'].unique(): 252 | for dire, roadflow in prp.get_roadflow_by_road(instand_id): 253 | try: 254 | pred_pre = predict(roadflow) 255 | except Exception as e: 256 | print(instand_id, '\t', e) 257 | pred_pre = pd.Series([0.1] * (144 * 4), index=submit_index) 258 | # pred_pre = pred_pre.dropna(axis=0, how="any") 259 | pred_pre.fillna(pred_pre.mean(), inplace=True) 260 | for i in range(len(pred_pre)): 261 | pred_pre.iloc[[i]] = int(pred_pre.iloc[[i]]) 262 | pred = pd.DataFrame(pred_pre.values, columns=['value']) 263 | pred['timestamp'] = submit_index 264 | pred['crossroadID'] = instand_id 265 | predict_df = pd.concat((predict_df, pred), axis=0, ignore_index=True) 266 | predict_df.to_csv('./data/aram_base.csv',index=False) 267 | return predict_df 268 | 269 | 270 | if __name__ == '__main__': 271 | term = 'final' # 初赛:first;复赛:final 272 | prp = PreProcessor(term) 273 | prp.dump_buffer(2) # 载入数据 274 | prp.fill_na() # 填入缺失值 275 | # arma_base(term) # 时序模型 276 | # result_fmt(term) # 随机模型 277 | # regression_many_x() # 回归模型 278 | # regression_ex(term) # 回归哦行 279 | --------------------------------------------------------------------------------