├── README.md
├── license
└── quantclean.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Quantclean 🧹
 2 | 
 3 | <strong><em>"Make it cleaner, make it leaner"</em></strong>
 4 | 
 5 | Already used by **several people working in the quant and finance industries**, Quantclean is the all-in-one tool that will help you to **reformat your dataset and clean it**.
 6 | 
 7 | Quantclean is a program that **reformats** every financial dataset to **US Equity TradeBar** (Quantconnect format)
 8 | 
 9 | We all faced the problem of reformating or data to a standard. Manual data cleaning is clearly boring and takes time. Quantclean is here to help you and to make you life easier.
10 | 
11 | Works great with datas from Quandl, Algoseek, Alpha Vantage, yfinance, and many other more...
12 | 
13 | ## Installation 
14 | 
15 | ```
16 | pip install quantclean
17 | ```
18 | 
19 | ## Few things you may want to know before getting started 🍉
20 | 
21 | 1) Even if you don't have an open, close, volume, high, low, date column, quantclean will create a blank column for it. No problem!
22 | 
23 | 2) The dataframe generated will look like this if you have a date and time column (or if both are on the same column):
24 | 
25 | | Date| Open | High | Low | Close | Volume
26 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
27 | | 20131001 09:00 | 6448000  | 6448000 | 6448000 | 6448000 | 90
28 | 
29 |  - Date - String date "YYYYMMDD HH:MM" in the timezone of the data format.
30 |  - Open - Deci-cents Open Price for TradeBar.
31 |  - High - Deci-cents High Price for TradeBar.
32 |  - Low - Deci-cents Low Price for TradeBar.
33 |  - Close - Deci-cents Close Price for TradeBar.
34 |  - Volume - Number of shares traded in this TradeBar.
35 |  
36 | 
37 | 3) You can also get something like that if use the ```sweeper_dash``` function instead of ```sweeper```
38 | 
39 | 
40 | | Date| Open | High | Low | Close | Volume
41 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
42 | | **2013-10-01 09:00:00** | 6448000  | 6448000 | 6448000 | 6448000 | 90
43 | 
44 | 
45 | As you can see, the date format is YYYY-MM-DD and no more YYYYMMDD.
46 | 
47 | 
48 | 4) If you just have a date column (e.g : something like YYYY-MM-DD), it will look like this:
49 | 
50 | | Date| Open | High | Low | Close | Volume
51 | | ----------- | ---------- | --------- | ---------- | --------- | ---------
52 | | 20131001 | 6448000  | 6448000 | 6448000 | 6448000 | 90
53 | 
54 | 
55 | You can also use the ```sweeper_dash``` function here.
56 | 
57 | ## How to use it? 🚀
58 | 
59 | First, [here](https://colab.research.google.com/drive/1L6wRRl1l2UnPY50F3qp2cxTcIqC4dtgK?usp=sharing) is a notebook that give you an example of how to use quantclean.
60 | 
61 | <u>Note :</u> I took this data from Quandl, your dataset doesn't have to look like this one necessarily, quantclean adapts to your dataset as well as possible
62 | 
63 | ```
64 | from quantclean import sweeper
65 | 
66 | df = pd.read_csv('AS-N100.csv')
67 | df
68 | ```
69 | <img src="https://i.ibb.co/zVfYx5J/Capture.jpg"/>
70 | 
71 | ```
72 | _df = sweeper(df)
73 | _df
74 | ```
75 | Output: 
76 | 
77 | <img src="https://i.ibb.co/YdncjPz/Capture.jpg"/>
78 | 
79 | Now, you may not be happy of this date colum which is presented in the YYYYMMDD format and maybe be prefer YYYY-MM-DD.
80 | 
81 | In that case do :
82 | 
83 | ```
84 | df_dash = sweeper_dash(df)
85 | df_dash
86 | ```
87 | 
88 | Output: 
89 | 
90 | <img src = "https://i.ibb.co/LNd5Kb9/Capture.jpg"/>
91 | 
92 | ## Contribution
93 | 
94 | If you have some suggestions or improvements don't hesitate to create an issue or make a pull request. Any help is welcome!
95 | 


--------------------------------------------------------------------------------
/license:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Santosh Passoubady
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/quantclean.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | from pandas_datareader import data as web
  3 | import logging
  4 | 
  5 | def sweeper(data):
  6 |   for name in logging.Logger.manager.loggerDict.keys():
  7 |     logging.getLogger(name).setLevel(logging.CRITICAL)
  8 | 
  9 |   #non efficient, right?
 10 |   data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns]
 11 | 
 12 |   data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns]
 13 | 
 14 |   data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns]
 15 | 
 16 |   data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns]
 17 | 
 18 |   data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns]
 19 | 
 20 |   data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns]
 21 | 
 22 |   data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns]
 23 | 
 24 |   if 'Date' in data.columns and 'Time' in data.columns:
 25 |     data['Date'] = data['Date']+" "+data['Time']
 26 | 
 27 |   elif 'Time' in data.columns and not 'Date' in data.columns:
 28 |     data['Date'] = data['Time']
 29 | 
 30 |   elif 'Date' in data.columns and not 'Time' in data.columns:
 31 |     pass
 32 | 
 33 | 
 34 |   try:
 35 |     df = data[['Date','Open','High','Low', 'Close', 'Volume']]
 36 |     df.reset_index(drop=True, inplace=True)
 37 |     df['Date'] = df['Date'].astype(str)
 38 |     df['Date'] = df['Date'].str.replace(r'-|/', '')
 39 | 
 40 |     missing = data.isnull().sum().sum()
 41 |     if missing >=1:
 42 |       print("The sweeper detected missing values")
 43 |       print("1: No change")
 44 |       print("2: Delete them")
 45 |       print("2: Delete the row containing missing data(s)")
 46 |       answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
 47 |       if answer ==1:
 48 |         pass
 49 |       if answer ==2:
 50 |         df = df.dropna()
 51 |       if answer==3:
 52 |         df = df.dropna( how='all',
 53 |                     subset=['Date','Open','High','Low', 'Close', 'Volume'])
 54 |       else:
 55 |         print("not valid answer")
 56 | 
 57 |     return df
 58 | 
 59 |   except KeyError:
 60 | 
 61 |     #non sense messages
 62 |     print('Oupsi...seems like someone has cast a spell on your dataset')
 63 | 
 64 |     print('Checking which column has been bewitched...')
 65 | 
 66 |     cols = ['Date','Open','High','Low', 'Close', 'Volume']
 67 | 
 68 |     for col in cols:
 69 |       if col not in data.columns:
 70 |         print(col + " is invisible")
 71 |         data[col] = ""
 72 | 
 73 |     print("The spell has been successfuly broken!")
 74 | 
 75 |     df = data[['Date','Open','High','Low', 'Close', 'Volume']]
 76 |     df.reset_index(drop=True, inplace=True)
 77 |     df['Date'] = df['Date'].astype(str)
 78 |     df['Date'] = df['Date'].str.replace(r'-|/', '')
 79 | 
 80 |     missing = data.isnull().sum().sum()
 81 |     if missing >=1:
 82 |       print("The sweeper detected missing values")
 83 |       print("1: No change")
 84 |       print("2: Delete them")
 85 |       print("2: Delete the row containing missing data(s)")
 86 |       answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
 87 |       if answer ==1:
 88 |         pass
 89 |       if answer ==2:
 90 |         df = df.dropna()
 91 |       if answer==3:
 92 |         df = df.dropna( how='all',
 93 |                     subset=['Date','Open','High','Low', 'Close', 'Volume'])
 94 |       else:
 95 |         print("not valid answer")
 96 | 
 97 |     return df
 98 |   
 99 | ---------------------------------------------------------------------------------------------------------------
100 | def sweeper_dash(data):
101 |   for name in logging.Logger.manager.loggerDict.keys():
102 |     logging.getLogger(name).setLevel(logging.CRITICAL)
103 | 
104 |   #non efficient, right?
105 |   data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns]
106 | 
107 |   data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns]
108 | 
109 |   data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns]
110 | 
111 |   data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns]
112 | 
113 |   data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns]
114 | 
115 |   data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns]
116 | 
117 |   data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns]
118 | 
119 |   if 'Date' in data.columns and 'Time' in data.columns:
120 |     data['Date'] = data['Date']+" "+data['Time']
121 | 
122 |   elif 'Time' in data.columns and not 'Date' in data.columns:
123 |     data['Date'] = data['Time']
124 | 
125 |   elif 'Date' in data.columns and not 'Time' in data.columns:
126 |     pass
127 | 
128 | 
129 |   try:
130 |     df = data[['Date','Open','High','Low', 'Close', 'Volume']]
131 |     df.reset_index(drop=True, inplace=True)
132 | 
133 |     df['Date'] = df['Date'].astype(str)
134 | 
135 |     missing = data.isnull().sum().sum()
136 |     if missing >=1:
137 |       print("The sweeper detected missing values")
138 |       print("1: No change")
139 |       print("2: Delete them")
140 |       print("2: Delete the row containing missing data(s)")
141 |       answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
142 |       if answer ==1:
143 |         pass
144 |       if answer ==2:
145 |         df = df.dropna()
146 |       if answer==3:
147 |         df = df.dropna( how='all',
148 |                     subset=['Date','Open','High','Low', 'Close', 'Volume'])
149 |       else:
150 |         print("not valid answer")
151 | 
152 |     return df
153 | 
154 |   except KeyError:
155 | 
156 |     #non sense messages
157 |     print('Oupsi...seems like someone has cast a spell on your dataset')
158 | 
159 |     print('Checking which column has been bewitched...')
160 | 
161 |     cols = ['Date','Open','High','Low', 'Close', 'Volume']
162 | 
163 |     for col in cols:
164 |       if col not in data.columns:
165 |         print(col + " is invisible")
166 |         data[col] = ""
167 | 
168 |     print("The spell has been successfuly broken!")
169 | 
170 |     df = data[['Date','Open','High','Low', 'Close', 'Volume']]
171 |     df.reset_index(drop=True, inplace=True)
172 |     df['Date'] = df['Date'].astype(str)
173 | 
174 |     missing = data.isnull().sum().sum()
175 |     if missing >=1:
176 |       print("The sweeper detected missing values")
177 |       print("1: No change")
178 |       print("2: Delete them")
179 |       print("2: Delete the row containing missing data(s)")
180 |       answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)")
181 |       if answer ==1:
182 |         pass
183 |       if answer ==2:
184 |         df = df.dropna()
185 |       if answer==3:
186 |         df = df.dropna( how='all',
187 |                     subset=['Date','Open','High','Low', 'Close', 'Volume'])
188 |       else:
189 |         print("not valid answer")
190 | 
191 |     return df
192 | 


--------------------------------------------------------------------------------