├── README.md
├── license
└── quantclean.py
/README.md:
--------------------------------------------------------------------------------
1 | # Quantclean 🧹
2 |
3 | "Make it cleaner, make it leaner"
4 |
5 | Already used by **several people working in the quant and finance industries**, Quantclean is the all-in-one tool that will help you to **reformat your dataset and clean it**.
6 |
7 | Quantclean is a program that **reformats** every financial dataset to **US Equity TradeBar** (Quantconnect format)
8 |
9 | We all faced the problem of reformating or data to a standard. Manual data cleaning is clearly boring and takes time. Quantclean is here to help you and to make you life easier.
10 |
11 | Works great with datas from Quandl, Algoseek, Alpha Vantage, yfinance, and many other more...
12 |
13 | ## Installation
14 |
15 | ```
16 | pip install quantclean
17 | ```
18 |
19 | ## Few things you may want to know before getting started 🍉
20 |
21 | 1) Even if you don't have an open, close, volume, high, low, date column, quantclean will create a blank column for it. No problem!
22 |
23 | 2) The dataframe generated will look like this if you have a date and time column (or if both are on the same column):
24 |
25 | | Date| Open | High | Low | Close | Volume
26 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
27 | | 20131001 09:00 | 6448000 | 6448000 | 6448000 | 6448000 | 90
28 |
29 | - Date - String date "YYYYMMDD HH:MM" in the timezone of the data format.
30 | - Open - Deci-cents Open Price for TradeBar.
31 | - High - Deci-cents High Price for TradeBar.
32 | - Low - Deci-cents Low Price for TradeBar.
33 | - Close - Deci-cents Close Price for TradeBar.
34 | - Volume - Number of shares traded in this TradeBar.
35 |
36 |
37 | 3) You can also get something like that if use the ```sweeper_dash``` function instead of ```sweeper```
38 |
39 |
40 | | Date| Open | High | Low | Close | Volume
41 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
42 | | **2013-10-01 09:00:00** | 6448000 | 6448000 | 6448000 | 6448000 | 90
43 |
44 |
45 | As you can see, the date format is YYYY-MM-DD and no more YYYYMMDD.
46 |
47 |
48 | 4) If you just have a date column (e.g : something like YYYY-MM-DD), it will look like this:
49 |
50 | | Date| Open | High | Low | Close | Volume
51 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
52 | | 20131001 | 6448000 | 6448000 | 6448000 | 6448000 | 90
53 |
54 |
55 | You can also use the ```sweeper_dash``` function here.
56 |
57 | ## How to use it? 🚀
58 |
59 | First, [here](https://colab.research.google.com/drive/1L6wRRl1l2UnPY50F3qp2cxTcIqC4dtgK?usp=sharing) is a notebook that give you an example of how to use quantclean.
60 |
61 | Note : I took this data from Quandl, your dataset doesn't have to look like this one necessarily, quantclean adapts to your dataset as well as possible
62 |
63 | ```
64 | from quantclean import sweeper
65 |
66 | df = pd.read_csv('AS-N100.csv')
67 | df
68 | ```
69 |
70 |
71 | ```
72 | _df = sweeper(df)
73 | _df
74 | ```
75 | Output:
76 |
77 |
78 |
79 | Now, you may not be happy of this date colum which is presented in the YYYYMMDD format and maybe be prefer YYYY-MM-DD.
80 |
81 | In that case do :
82 |
83 | ```
84 | df_dash = sweeper_dash(df)
85 | df_dash
86 | ```
87 |
88 | Output:
89 |
90 |
91 |
92 | ## Contribution
93 |
94 | If you have some suggestions or improvements don't hesitate to create an issue or make a pull request. Any help is welcome!
95 |
--------------------------------------------------------------------------------
/license:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Santosh Passoubady
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/quantclean.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from pandas_datareader import data as web
3 | import logging
4 |
5 | def sweeper(data):
6 | for name in logging.Logger.manager.loggerDict.keys():
7 | logging.getLogger(name).setLevel(logging.CRITICAL)
8 |
9 | #non efficient, right?
10 | data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns]
11 |
12 | data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns]
13 |
14 | data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns]
15 |
16 | data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns]
17 |
18 | data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns]
19 |
20 | data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns]
21 |
22 | data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns]
23 |
24 | if 'Date' in data.columns and 'Time' in data.columns:
25 | data['Date'] = data['Date']+" "+data['Time']
26 |
27 | elif 'Time' in data.columns and not 'Date' in data.columns:
28 | data['Date'] = data['Time']
29 |
30 | elif 'Date' in data.columns and not 'Time' in data.columns:
31 | pass
32 |
33 |
34 | try:
35 | df = data[['Date','Open','High','Low', 'Close', 'Volume']]
36 | df.reset_index(drop=True, inplace=True)
37 | df['Date'] = df['Date'].astype(str)
38 | df['Date'] = df['Date'].str.replace(r'-|/', '')
39 |
40 | missing = data.isnull().sum().sum()
41 | if missing >=1:
42 | print("The sweeper detected missing values")
43 | print("1: No change")
44 | print("2: Delete them")
45 | print("2: Delete the row containing missing data(s)")
46 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
47 | if answer ==1:
48 | pass
49 | if answer ==2:
50 | df = df.dropna()
51 | if answer==3:
52 | df = df.dropna( how='all',
53 | subset=['Date','Open','High','Low', 'Close', 'Volume'])
54 | else:
55 | print("not valid answer")
56 |
57 | return df
58 |
59 | except KeyError:
60 |
61 | #non sense messages
62 | print('Oupsi...seems like someone has cast a spell on your dataset')
63 |
64 | print('Checking which column has been bewitched...')
65 |
66 | cols = ['Date','Open','High','Low', 'Close', 'Volume']
67 |
68 | for col in cols:
69 | if col not in data.columns:
70 | print(col + " is invisible")
71 | data[col] = ""
72 |
73 | print("The spell has been successfuly broken!")
74 |
75 | df = data[['Date','Open','High','Low', 'Close', 'Volume']]
76 | df.reset_index(drop=True, inplace=True)
77 | df['Date'] = df['Date'].astype(str)
78 | df['Date'] = df['Date'].str.replace(r'-|/', '')
79 |
80 | missing = data.isnull().sum().sum()
81 | if missing >=1:
82 | print("The sweeper detected missing values")
83 | print("1: No change")
84 | print("2: Delete them")
85 | print("2: Delete the row containing missing data(s)")
86 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
87 | if answer ==1:
88 | pass
89 | if answer ==2:
90 | df = df.dropna()
91 | if answer==3:
92 | df = df.dropna( how='all',
93 | subset=['Date','Open','High','Low', 'Close', 'Volume'])
94 | else:
95 | print("not valid answer")
96 |
97 | return df
98 |
99 | ---------------------------------------------------------------------------------------------------------------
100 | def sweeper_dash(data):
101 | for name in logging.Logger.manager.loggerDict.keys():
102 | logging.getLogger(name).setLevel(logging.CRITICAL)
103 |
104 | #non efficient, right?
105 | data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns]
106 |
107 | data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns]
108 |
109 | data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns]
110 |
111 | data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns]
112 |
113 | data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns]
114 |
115 | data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns]
116 |
117 | data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns]
118 |
119 | if 'Date' in data.columns and 'Time' in data.columns:
120 | data['Date'] = data['Date']+" "+data['Time']
121 |
122 | elif 'Time' in data.columns and not 'Date' in data.columns:
123 | data['Date'] = data['Time']
124 |
125 | elif 'Date' in data.columns and not 'Time' in data.columns:
126 | pass
127 |
128 |
129 | try:
130 | df = data[['Date','Open','High','Low', 'Close', 'Volume']]
131 | df.reset_index(drop=True, inplace=True)
132 |
133 | df['Date'] = df['Date'].astype(str)
134 |
135 | missing = data.isnull().sum().sum()
136 | if missing >=1:
137 | print("The sweeper detected missing values")
138 | print("1: No change")
139 | print("2: Delete them")
140 | print("2: Delete the row containing missing data(s)")
141 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
142 | if answer ==1:
143 | pass
144 | if answer ==2:
145 | df = df.dropna()
146 | if answer==3:
147 | df = df.dropna( how='all',
148 | subset=['Date','Open','High','Low', 'Close', 'Volume'])
149 | else:
150 | print("not valid answer")
151 |
152 | return df
153 |
154 | except KeyError:
155 |
156 | #non sense messages
157 | print('Oupsi...seems like someone has cast a spell on your dataset')
158 |
159 | print('Checking which column has been bewitched...')
160 |
161 | cols = ['Date','Open','High','Low', 'Close', 'Volume']
162 |
163 | for col in cols:
164 | if col not in data.columns:
165 | print(col + " is invisible")
166 | data[col] = ""
167 |
168 | print("The spell has been successfuly broken!")
169 |
170 | df = data[['Date','Open','High','Low', 'Close', 'Volume']]
171 | df.reset_index(drop=True, inplace=True)
172 | df['Date'] = df['Date'].astype(str)
173 |
174 | missing = data.isnull().sum().sum()
175 | if missing >=1:
176 | print("The sweeper detected missing values")
177 | print("1: No change")
178 | print("2: Delete them")
179 | print("2: Delete the row containing missing data(s)")
180 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
181 | if answer ==1:
182 | pass
183 | if answer ==2:
184 | df = df.dropna()
185 | if answer==3:
186 | df = df.dropna( how='all',
187 | subset=['Date','Open','High','Low', 'Close', 'Volume'])
188 | else:
189 | print("not valid answer")
190 |
191 | return df
192 |
--------------------------------------------------------------------------------