├── README.md
└── Reduce_fastload.py


/README.md:
--------------------------------------------------------------------------------
 1 | # optimization-of-pandas-for-large-CSV
 2 | >Pandas 是常用的 Python 软件库，可用于数据操作和分析。在进行数据分析时，导入数据（例如pd.read_csv)几乎是必需的，但对于大的CSV，可能会需要占用大量的内存和读取时间，这对于数据分析时如果需要Reloading原始数据的话会非常低效。
 3 | >Dataquest.io 发布了一篇关于如何优化 pandas 内存占用的教程，仅需进行简单的数据类型转换，就能够将一个棒球比赛数据集的内存占用减少了近 90%，而pandas本身集成上的一些压缩数据类型可以帮助我们快速读取数据。
 4 | 
 5 | 
 6 | ## 1. 方法介绍
 7 | [博客](https://blog.csdn.net/wlx19970505/article/details/102920112)
 8 | 
 9 | ## 2.类的使用
10 | 
11 | **step1:导入**：from Reduce_fastload import reduce_fastload  
12 | **step2:实例化**：process=reduce_fastload('your path',use_HDF5=True/False,use_feather=True/False)  
13 | **step3:对原始数据作内存优化**：process.reduce_data()  
14 | **step4:加载优化数据**：process_data=process.reload_data()  
15 | 
16 | ## example
17 | 
18 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20191105195459124.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dseDE5OTcwNTA1,size_16,color_FFFFFF,t_70#pic_center)  
19 | 可以看出，原CSV文件占用内存为616.95MB，优化内存后的占用仅为173.9MB，且相对于原来pd.read_csv的7.7s的loading time,读入优化后的预处理数据文件能很大程度上的加速了读取。
20 | 
21 | 
22 | ### Reference   
23 | [1].https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65  
24 | [2].https://zhuanlan.zhihu.com/p/56541628  
25 | [3].https://blog.csdn.net/weiyongle1996/article/details/78498603  
26 | 


--------------------------------------------------------------------------------
/Reduce_fastload.py:
--------------------------------------------------------------------------------
 1 | # encoding: utf-8
 2 | """
 3 | @author:  wanglixiang
 4 | @contact: lixiangwang9705@gmail.com
 5 | """
 6 | 
 7 | 
 8 | import numpy as np
 9 | import pandas as pd
10 | import time
11 | 
12 | 
13 | class reduce_fastload:
14 |     def __init__(self, data_dir, use_HDF5=False, use_feather=False):
15 |         """
16 |         :use HDF5 to store: use_HDF5=True
17 |         :use feather to store: use_feather=True
18 |         
19 |         """
20 |         self.data_dir = data_dir
21 |         self.use_HDF5 = use_HDF5
22 |         self.use_feather = use_feather
23 |         self.is_reduce = 0
24 | 
25 |     def reduce_data(self):
26 |         df = pd.read_csv(self.data_dir, parse_dates=True, keep_date_col=True)
27 | 
28 |         start_mem = df.memory_usage().sum() / 1024 ** 2
29 |         print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
30 | 
31 |         ## Reference from: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65
32 |         for col in df.columns:
33 |             col_type = df[col].dtype
34 | 
35 |             if col_type != object:
36 |                 c_min = df[col].min()
37 |                 c_max = df[col].max()
38 |                 if str(col_type)[:3] == 'int':
39 |                     if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
40 |                         df[col] = df[col].astype(np.int8)
41 |                     elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
42 |                         df[col] = df[col].astype(np.int16)
43 |                     elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
44 |                         df[col] = df[col].astype(np.int32)
45 |                     elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
46 |                         df[col] = df[col].astype(np.int64)
47 |                 else:
48 |                     if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
49 |                         df[col] = df[col].astype(np.float16)
50 |                     elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
51 |                         df[col] = df[col].astype(np.float32)
52 |                     else:
53 |                         df[col] = df[col].astype(np.float64)
54 |             else:
55 | 
56 |                 df[col] = df[col].astype('category')
57 | 
58 |         end_mem = df.memory_usage().sum() / 1024 ** 2
59 |         print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
60 |         print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
61 | 
62 |         if self.use_HDF5 == True and self.use_feather == False:
63 |             data_store = pd.HDFStore('processed_data.h5')
64 |             # Store object in HDFStore
65 |             data_store.put('preprocessed_df', df, format='table')
66 | 
67 |             data_store.close()
68 |             self.is_reduce = 1
69 |         elif self.use_HDF5 == False and self.use_feather == True:
70 |             df.to_feather('processed_data.feather')
71 |             self.is_reduce = 1
72 |         else:
73 |             print('Please choose the only way to compress：True or False')
74 | 
75 |     def reload_data(self):
76 |         if self.is_reduce == 0:
77 |             print('You have not compressed the data yet')
78 | 
79 |         else:
80 |             if self.use_HDF5 == True and self.use_feather == False:
81 | 
82 |                 time_start = time.time()
83 |                 store_data = pd.HDFStore('processed_data.h5')
84 |                 # 通过key获取数据
85 |                 preprocessed_df = store_data['preprocessed_df']
86 |                 print('load time:',time.time() - time_start)
87 |                 store_data.close()
88 | 
89 |             elif self.use_HDF5 == False and self.use_feather == True:
90 |                 time_start = time.time()
91 |                 preprocessed_df = pd.read_feather('processed_data.feather')
92 |                 print('load time:',time.time() - time_start)
93 | 
94 |             else:
95 |                 print('Please choose the only way to compress：True or False')
96 | 
97 |         return preprocessed_df


--------------------------------------------------------------------------------