├── 電量迴歸預測與分群解析.pdf ├── README.md ├── 2_preprocess_to_ratio.ipynb └── 3_feature-matrix-produce.ipynb /電量迴歸預測與分群解析.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wantinghuang/SKlearn-kmeans-DTW-clustering-user-grouping/HEAD/電量迴歸預測與分群解析.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 2017 Smart Grid Big Data Analytics Competition 2 | 3 | A project to analyze power usage data offered by Taiwan Power Cooperation in 2017 Smart Grid Big Data Analytics Competition 4 | 5 | The goal of this project is to find some useful information through out the history data of power usage and some open data (e.g. weather data). 6 | We use clustering method to form groups and later we will build deep learning (SVR, RNN, LSTM) model to predict power usage. 7 | This repository focus on how to do clustering. 8 | 9 | 低壓智慧電表大數據分析與設計競賽。針對歷史用電資料，利用數值分析等方法，提出具學術或應用價值的研究 10 | 11 | * 本篇著重於如何做分群，不包含預測用電量的檔案。 12 | 13 | ## 競賽內容 14 | 15 | 利用本次競賽所提供的一年期間，每15分鐘低壓智慧電表資料，自行搭配氣象資料等，進而利用數值分析或演算分析個體或全體住戶的用電行為，例如用電量預測、用電特徵、不同區域比較或節電建議…等 (不限於此)。 16 | 17 | ### 智慧電表資料說明 18 | 19 | 開放資料內容 20 | 1.用戶類型：低壓住宅1,000戶以上 21 | 環境資訊：該資料所屬地區(縣市)、用電類型(表燈…等)。 22 | 即時用電資料：當下前15分鐘的用電能量（kWh） 23 | 歷史用電資料：去年度所有15分鐘用電能量（kWh） 24 | 電價資訊：台電現行季節電價、住商簡易時間電價方案 25 | 2.介面與格式 26 | 研析組：提供用戶一整年度歷史用電資料，格式擬為CSV。 27 | 28 | 29 | ## Authors 30 | 31 | 國立交通大學林于翔、許睿之、楊凱期、黃揚、彭笙榕、蘇恆毅、黃婉婷、陳泓勳 32 | 33 | ## License 34 | 35 | This project is licensed under the MIT License 36 | 37 | Copyright (C) <2017> 38 | 39 | 40 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 41 | 42 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 43 | 44 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 45 | -------------------------------------------------------------------------------- /2_preprocess_to_ratio.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "import numpy as np\n", 14 | "import os\n", 15 | "import seaborn as sns\n", 16 | "import pandas as pd\n", 17 | "import os\n", 18 | "dir_path = os.getcwd()\n", 19 | "filename_M= os.listdir('C:/Users/lab343/Desktop/LAB/WORK/智慧電表/User_Value/M')\n", 20 | "filename_T= os.listdir('C:/Users/lab343/Desktop/LAB/WORK/智慧電表/User_Value/T')\n", 21 | "%matplotlib inline\n", 22 | "\n", 23 | "per_date = pd.date_range('2016-01-01 00:15:00','2017-01-01 00:00:00',freq='15Min')\n", 24 | "per_month = ['2016-01','2016-02','2016-03','2016-04','2016-05','2016-06','2016-07','2016-08','2016-09','2016-10','2016-11',\n", 25 | " '2016-12']\n", 26 | "per_hour = pd.date_range('2016-01-01 01:00:00','2017-01-01 00:00:00',freq='1H')\n", 27 | "per_week = pd.date_range('2016-01-01 00:15:00','2017-01-01 00:15:00',freq='7D')\n", 28 | "per_day = pd.date_range('2016-01-01','2016-12-31',freq='1D')\n", 29 | "\n", 30 | "week = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']\n", 31 | "\n", 32 | "\n", 33 | "contract = {'1':'低壓表燈',\n", 34 | " '4':'低壓綜合用電',\n", 35 | " '7':'普通低壓電力',\n", 36 | " 'C':'低壓需量綜合',\n", 37 | " 'D':'低壓需量電力',\n", 38 | " 'F':'低壓表燈時間電價'}\n", 39 | "user = {'0':'公用路燈或自來水',\n", 40 | " '1':'軍眷',\n", 41 | " '5':'非營業用',\n", 42 | " '6':'營業用'}" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "full_file = pd.read_csv('full_user.csv')\n", 54 | "full_file = full_file.filename.values.tolist()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "def pre_ratio(userfile):\n", 66 | " try:\n", 67 | " test = pd.read_csv('C:/Users/lab343/Desktop/LAB/WORK/智慧電表/hour/M/'+userfile,index_col=0)\n", 68 | " except:\n", 69 | " test = pd.read_csv('C:/Users/lab343/Desktop/LAB/WORK/智慧電表/hour/T/'+userfile,index_col=0)\n", 70 | " test.index = pd.to_datetime(test.index) \n", 71 | " print(userfile,'是否有遺失值:',test.isnull().values.any())\n", 72 | " day_sum = []\n", 73 | " j = 0\n", 74 | " for i in range(366):\n", 75 | " day_sum.append(test.iloc[j:j+24,1].sum())\n", 76 | " j +=24\n", 77 | " for i in per_day.astype(str):\n", 78 | " test.loc[i,'day_sum'] = day_sum[per_day.astype(str).tolist().index(i)]\n", 79 | " \n", 80 | " test.day_sum = test.day_sum.shift(1)\n", 81 | " test.iloc[0,4] = test.iloc[1,4]\n", 82 | " test['ratio_value'] = test.Value/test.day_sum\n", 83 | " test.drop(['CustomerID','Value','day_sum','Week','holiday'],1,inplace=True)\n", 84 | " return test" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "for i in full_file:\n", 96 | " data = pre_ratio(i)\n", 97 | " data.to_csv('C:/Users/lab343/Desktop/LAB/WORK/智慧電表/ratiodata/'+i)" 98 | ] 99 | } 100 | ], 101 | "metadata": { 102 | "kernelspec": { 103 | "display_name": "Python 3", 104 | "language": "python", 105 | "name": "python3" 106 | }, 107 | "language_info": { 108 | "codemirror_mode": { 109 | "name": "ipython", 110 | "version": 3 111 | }, 112 | "file_extension": ".py", 113 | "mimetype": "text/x-python", 114 | "name": "python", 115 | "nbconvert_exporter": "python", 116 | "pygments_lexer": "ipython3", 117 | "version": "3.4.1" 118 | } 119 | }, 120 | "nbformat": 4, 121 | "nbformat_minor": 2 122 | } 123 | -------------------------------------------------------------------------------- /3_feature-matrix-produce.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 將資料處理成要分群的feature" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "#### 存成pkl檔，只要執行一次即可" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pickle\n", 26 | "import pandas as pd\n", 27 | "import numpy as np\n", 28 | "import os\n", 29 | "from datetime import datetime,timedelta,date,time\n", 30 | "import glob\n", 31 | "%matplotlib inline" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": { 38 | "collapsed": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "column_names = ['DateTime', 'ratio_value']\n", 43 | "area = 'data/ratiodata/'\n", 44 | "dir_path = os.getcwd()" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "region = pd.read_excel(('region_full.xlsx'), sheetname=None, header=0, skiprows=None)\n", 56 | "region_list = ['Yonghe','Xinyi', 'Penghu', 'Dali', 'Banciao','Quchi']" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 5, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "## people with zero value whole year (which caused the problem of \"Mean of empty slice\")\n", 68 | "# 永和\n", 69 | "r1 = ['T1411','T0457','T0031','T2834','T0639','T0473','T2915','T1298','T0441','T0603','T0474','T7018',\n", 70 | " 'T1267','T1264','T3265','T2879','T0980','T2914','T3072','T3522']\n", 71 | "# 信義\n", 72 | "r2 = ['M00686']\n", 73 | "# 板橋\n", 74 | "r3 = ['T4166','T6540','T4648','T4542','T6873','T5003','T4167','T5851']\n", 75 | "remove_list = [r1,r2,None,None,r3,None]" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": true 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "dt = {}\n", 87 | "for dpath in glob.glob(os.path.join(dir_path,area)+'*.csv'):\n", 88 | " d_name = dpath[61:][:-4]\n", 89 | " #print(d_name)\n", 90 | " dt[d_name] = pd.read_csv(dpath)\n", 91 | " dt[d_name].fillna(0,inplace = True)\n", 92 | " dt[d_name].columns = column_names\n", 93 | " dt[d_name]['DateTime'] = pd.to_datetime(dt[d_name]['DateTime']) " 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 62, 99 | "metadata": { 100 | "collapsed": true 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "# 輸入一個 DataFrame，回傳非假日的所有紀錄\n", 105 | "def get_weekday_DT(dataT):\n", 106 | " return dataT[[0<=dtime.weekday()<=4 for dtime in dataT['DateTime']]]\n", 107 | "#get_weekday_DT(dt['M00001.csv'])" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 63, 113 | "metadata": { 114 | "collapsed": true 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "# 將所有的非假日資料過濾出來\n", 119 | "summer_dt = {}\n", 120 | "for k in dt.keys():\n", 121 | " summer_dt[k] = get_weekday_DT(dt[k])" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 64, 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "# all_times 為所有時間點的 datetime.time，[00:00, 01:0, 02:00, ... , 23:00,00:00]\n", 133 | "all_times = [time(hour=1, minute=0, second=0)]\n", 134 | "while True:\n", 135 | " next_time = datetime.combine(date.today(), all_times[-1]) + timedelta(hours=1)\n", 136 | " #next_time = all_times[-1] + time.timedelta(minutes=15)\n", 137 | " if next_time.time() == all_times[0]:\n", 138 | " break\n", 139 | " all_times.append(next_time.time())" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 65, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "# 將每個 user 的所有時間點用電情形做整理\n", 151 | "# summer_tdt 為 dict, key: 用戶名稱\n", 152 | "# , value: 該用戶的每個時間點用電情形, 為 dict\n", 153 | "# 用戶用電情形, key: 時間點, ex: 01:00\n", 154 | "# , value: list, 所有該用戶在該時間點用電的紀錄, ex: [ 0.2, 0.3, 0.4]\n", 155 | "# diff 為 dict, key: 用戶名稱\n", 156 | "# , value: 該用戶的每個時間點用電斜率, 為 dict\n", 157 | "# 用戶用電斜率, key: 時間點, ex: 01:00\n", 158 | "# , value: list, 所有該用戶在該時段與下一個時段的斜率\n", 159 | "\n", 160 | "\n", 161 | "summer_tdt = {}\n", 162 | "diff = {}\n", 163 | "for k in summer_dt.keys():\n", 164 | " #print(k)\n", 165 | " summer_tdt[k] = {}\n", 166 | " diff[k] = {}\n", 167 | " #將每用戶每個時間點歸零\n", 168 | " for t in all_times:\n", 169 | " summer_tdt[k][t] = []\n", 170 | " diff[k][t]=[]\n", 171 | " for i in range(len(summer_dt[k])):\n", 172 | " if np.isnan(summer_dt[k]['ratio_value'].iloc[i]):\n", 173 | " continue\n", 174 | " summer_tdt[k][summer_dt[k]['DateTime'].iloc[i].time()].append(summer_dt[k]['ratio_value'].iloc[i])\n", 175 | " if i < len(summer_dt[k])-1:\n", 176 | " diff[k][summer_dt[k]['DateTime'].iloc[i].time()].append(summer_dt[k]['ratio_value'].iloc[i+1]-summer_dt[k]['ratio_value'].iloc[i])" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": true 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "A = summer_tdt\n", 188 | "summer_tdt = open('summer_tdt.pkl','wb')\n", 189 | "pickle.dump(A,summer_tdt)\n", 190 | "print(summer_tdt)\n", 191 | "summer_tdt.close()\n", 192 | "print('done! finsh storing feature matirx48')" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "# 將所有 user 的所有時間點用電取 mean，並儲存在 summer_avgdt\n", 204 | "# summer_avgdt 為 dict, key: 每個時間點\n", 205 | "# , value: list, 紀錄每個 user 該時段平均用電\n", 206 | "# diff_avgdt 為 dict, key: 每個時間點\n", 207 | "# , value: list, 紀錄每個 user 該時段與下一個時段的平均斜率\n", 208 | "\n", 209 | "summer_avgdt = {}\n", 210 | "diff_avgdt = {}\n", 211 | "for t in all_times:\n", 212 | " summer_avgdt[t] = []\n", 213 | " diff_avgdt[t] = []\n", 214 | " for k in summer_dt.keys():\n", 215 | " summer_avgdt[t].append(np.nanmean(summer_tdt[k][t]))\n", 216 | " diff_avgdt[t].append(np.nanmean(diff[k][t]))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "def union2(dict1, dict2):\n", 228 | " return dict(dict1.items() + dict2.items())\n", 229 | "dicts=[summer_avgdt,diff_avgdt]\n", 230 | "dict(i for dct in dicts for i in dct.items())" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "# 存成pkl檔" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "### feature_matrix48.pkl 為用電平均與斜率，共48個feature的檔案" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 41, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "<_io.BufferedWriter name='matrix_data_Dali.pkl'>\n", 257 | "done! finsh storing feature matirx\n" 258 | ] 259 | } 260 | ], 261 | "source": [ 262 | "#x1 = [summer_avgdt[k] for k in summer_avgdt.keys()]\n", 263 | "x1 = summer_avgd\n", 264 | "#x2 = [diff_avgdt[k] for k in diff_avgdt.keys()]\n", 265 | "x2 = diff_avgdt\n", 266 | "x = x1 + x2\n", 267 | "#A = [list(i) for i in zip(*x)]\n", 268 | "A = x\n", 269 | "feature_matrix48 = open('matrix_data48_'+heka+'.pkl','wb')\n", 270 | "pickle.dump(A,feature_matrix48)\n", 271 | "print(feature_matrix48)\n", 272 | "feature_matrix48.close()\n", 273 | "print('done! finsh storing feature matirx48')" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "### feature_matrix24.pkl 為只有用電平均，共24個feature的檔案" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "collapsed": true 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "#B = [list(i) for i in zip(*x1)]\n", 292 | "B = x1\n", 293 | "feature_matrix24 = open('matrix_data24_'+heka+'.pkl','wb')\n", 294 | "pickle.dump(B,feature_matrix24)\n", 295 | "print(feature_matrix24)\n", 296 | "feature_matrix24.close()\n", 297 | "print('done! finsh storing feature matirx24')\n", 298 | "#B = pickle.load(open('matrix_data_'+heka+'.pkl','rb'))" 299 | ] 300 | } 301 | ], 302 | "metadata": { 303 | "kernelspec": { 304 | "display_name": "Python 3", 305 | "language": "python", 306 | "name": "python3" 307 | }, 308 | "language_info": { 309 | "codemirror_mode": { 310 | "name": "ipython", 311 | "version": 3 312 | }, 313 | "file_extension": ".py", 314 | "mimetype": "text/x-python", 315 | "name": "python", 316 | "nbconvert_exporter": "python", 317 | "pygments_lexer": "ipython3", 318 | "version": "3.4.1" 319 | } 320 | }, 321 | "nbformat": 4, 322 | "nbformat_minor": 2 323 | } 324 | --------------------------------------------------------------------------------