├── README.md ├── license └── quantclean.py /README.md: -------------------------------------------------------------------------------- 1 | # Quantclean 🧹 2 | 3 | "Make it cleaner, make it leaner" 4 | 5 | Already used by **several people working in the quant and finance industries**, Quantclean is the all-in-one tool that will help you to **reformat your dataset and clean it**. 6 | 7 | Quantclean is a program that **reformats** every financial dataset to **US Equity TradeBar** (Quantconnect format) 8 | 9 | We all faced the problem of reformating or data to a standard. Manual data cleaning is clearly boring and takes time. Quantclean is here to help you and to make you life easier. 10 | 11 | Works great with datas from Quandl, Algoseek, Alpha Vantage, yfinance, and many other more... 12 | 13 | ## Installation 14 | 15 | ``` 16 | pip install quantclean 17 | ``` 18 | 19 | ## Few things you may want to know before getting started 🍉 20 | 21 | 1) Even if you don't have an open, close, volume, high, low, date column, quantclean will create a blank column for it. No problem! 22 | 23 | 2) The dataframe generated will look like this if you have a date and time column (or if both are on the same column): 24 | 25 | | Date| Open | High | Low | Close | Volume 26 | | ----------- | ---------- | --------- | ---------- | --------- | --------- 27 | | 20131001 09:00 | 6448000 | 6448000 | 6448000 | 6448000 | 90 28 | 29 | - Date - String date "YYYYMMDD HH:MM" in the timezone of the data format. 30 | - Open - Deci-cents Open Price for TradeBar. 31 | - High - Deci-cents High Price for TradeBar. 32 | - Low - Deci-cents Low Price for TradeBar. 33 | - Close - Deci-cents Close Price for TradeBar. 34 | - Volume - Number of shares traded in this TradeBar. 35 | 36 | 37 | 3) You can also get something like that if use the ```sweeper_dash``` function instead of ```sweeper``` 38 | 39 | 40 | | Date| Open | High | Low | Close | Volume 41 | | ----------- | ---------- | --------- | ---------- | --------- | --------- 42 | | **2013-10-01 09:00:00** | 6448000 | 6448000 | 6448000 | 6448000 | 90 43 | 44 | 45 | As you can see, the date format is YYYY-MM-DD and no more YYYYMMDD. 46 | 47 | 48 | 4) If you just have a date column (e.g : something like YYYY-MM-DD), it will look like this: 49 | 50 | | Date| Open | High | Low | Close | Volume 51 | | ----------- | ---------- | --------- | ---------- | --------- | --------- 52 | | 20131001 | 6448000 | 6448000 | 6448000 | 6448000 | 90 53 | 54 | 55 | You can also use the ```sweeper_dash``` function here. 56 | 57 | ## How to use it? 🚀 58 | 59 | First, [here](https://colab.research.google.com/drive/1L6wRRl1l2UnPY50F3qp2cxTcIqC4dtgK?usp=sharing) is a notebook that give you an example of how to use quantclean. 60 | 61 | Note : I took this data from Quandl, your dataset doesn't have to look like this one necessarily, quantclean adapts to your dataset as well as possible 62 | 63 | ``` 64 | from quantclean import sweeper 65 | 66 | df = pd.read_csv('AS-N100.csv') 67 | df 68 | ``` 69 | 70 | 71 | ``` 72 | _df = sweeper(df) 73 | _df 74 | ``` 75 | Output: 76 | 77 | 78 | 79 | Now, you may not be happy of this date colum which is presented in the YYYYMMDD format and maybe be prefer YYYY-MM-DD. 80 | 81 | In that case do : 82 | 83 | ``` 84 | df_dash = sweeper_dash(df) 85 | df_dash 86 | ``` 87 | 88 | Output: 89 | 90 | 91 | 92 | ## Contribution 93 | 94 | If you have some suggestions or improvements don't hesitate to create an issue or make a pull request. Any help is welcome! 95 | -------------------------------------------------------------------------------- /license: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Santosh Passoubady 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /quantclean.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pandas_datareader import data as web 3 | import logging 4 | 5 | def sweeper(data): 6 | for name in logging.Logger.manager.loggerDict.keys(): 7 | logging.getLogger(name).setLevel(logging.CRITICAL) 8 | 9 | #non efficient, right? 10 | data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns] 11 | 12 | data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns] 13 | 14 | data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns] 15 | 16 | data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns] 17 | 18 | data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns] 19 | 20 | data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns] 21 | 22 | data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns] 23 | 24 | if 'Date' in data.columns and 'Time' in data.columns: 25 | data['Date'] = data['Date']+" "+data['Time'] 26 | 27 | elif 'Time' in data.columns and not 'Date' in data.columns: 28 | data['Date'] = data['Time'] 29 | 30 | elif 'Date' in data.columns and not 'Time' in data.columns: 31 | pass 32 | 33 | 34 | try: 35 | df = data[['Date','Open','High','Low', 'Close', 'Volume']] 36 | df.reset_index(drop=True, inplace=True) 37 | df['Date'] = df['Date'].astype(str) 38 | df['Date'] = df['Date'].str.replace(r'-|/', '') 39 | 40 | missing = data.isnull().sum().sum() 41 | if missing >=1: 42 | print("The sweeper detected missing values") 43 | print("1: No change") 44 | print("2: Delete them") 45 | print("2: Delete the row containing missing data(s)") 46 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)") 47 | if answer ==1: 48 | pass 49 | if answer ==2: 50 | df = df.dropna() 51 | if answer==3: 52 | df = df.dropna( how='all', 53 | subset=['Date','Open','High','Low', 'Close', 'Volume']) 54 | else: 55 | print("not valid answer") 56 | 57 | return df 58 | 59 | except KeyError: 60 | 61 | #non sense messages 62 | print('Oupsi...seems like someone has cast a spell on your dataset') 63 | 64 | print('Checking which column has been bewitched...') 65 | 66 | cols = ['Date','Open','High','Low', 'Close', 'Volume'] 67 | 68 | for col in cols: 69 | if col not in data.columns: 70 | print(col + " is invisible") 71 | data[col] = "" 72 | 73 | print("The spell has been successfuly broken!") 74 | 75 | df = data[['Date','Open','High','Low', 'Close', 'Volume']] 76 | df.reset_index(drop=True, inplace=True) 77 | df['Date'] = df['Date'].astype(str) 78 | df['Date'] = df['Date'].str.replace(r'-|/', '') 79 | 80 | missing = data.isnull().sum().sum() 81 | if missing >=1: 82 | print("The sweeper detected missing values") 83 | print("1: No change") 84 | print("2: Delete them") 85 | print("2: Delete the row containing missing data(s)") 86 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)") 87 | if answer ==1: 88 | pass 89 | if answer ==2: 90 | df = df.dropna() 91 | if answer==3: 92 | df = df.dropna( how='all', 93 | subset=['Date','Open','High','Low', 'Close', 'Volume']) 94 | else: 95 | print("not valid answer") 96 | 97 | return df 98 | 99 | --------------------------------------------------------------------------------------------------------------- 100 | def sweeper_dash(data): 101 | for name in logging.Logger.manager.loggerDict.keys(): 102 | logging.getLogger(name).setLevel(logging.CRITICAL) 103 | 104 | #non efficient, right? 105 | data.columns = ['Open' if 'open' in x else 'Open' if 'OPEN' in x else x for x in data.columns] 106 | 107 | data.columns = ['High' if 'high' in x else 'High' if 'HIGH' in x else x for x in data.columns] 108 | 109 | data.columns = ['Low' if 'low' in x else 'Low' if 'LOW' in x else x for x in data.columns] 110 | 111 | data.columns = ['Close' if 'close' in x else 'Close' if 'CLOSE' in x else x for x in data.columns] 112 | 113 | data.columns = ['Volume' if 'volume' in x else 'Volume' if 'VOLUME' in x else x for x in data.columns] 114 | 115 | data.columns = ['Date' if 'date' in x else 'Date' if 'DATE' in x else x for x in data.columns] 116 | 117 | data.columns = ['Time' if 'time' in x else 'Time' if 'TIME' in x else x for x in data.columns] 118 | 119 | if 'Date' in data.columns and 'Time' in data.columns: 120 | data['Date'] = data['Date']+" "+data['Time'] 121 | 122 | elif 'Time' in data.columns and not 'Date' in data.columns: 123 | data['Date'] = data['Time'] 124 | 125 | elif 'Date' in data.columns and not 'Time' in data.columns: 126 | pass 127 | 128 | 129 | try: 130 | df = data[['Date','Open','High','Low', 'Close', 'Volume']] 131 | df.reset_index(drop=True, inplace=True) 132 | 133 | df['Date'] = df['Date'].astype(str) 134 | 135 | missing = data.isnull().sum().sum() 136 | if missing >=1: 137 | print("The sweeper detected missing values") 138 | print("1: No change") 139 | print("2: Delete them") 140 | print("2: Delete the row containing missing data(s)") 141 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)") 142 | if answer ==1: 143 | pass 144 | if answer ==2: 145 | df = df.dropna() 146 | if answer==3: 147 | df = df.dropna( how='all', 148 | subset=['Date','Open','High','Low', 'Close', 'Volume']) 149 | else: 150 | print("not valid answer") 151 | 152 | return df 153 | 154 | except KeyError: 155 | 156 | #non sense messages 157 | print('Oupsi...seems like someone has cast a spell on your dataset') 158 | 159 | print('Checking which column has been bewitched...') 160 | 161 | cols = ['Date','Open','High','Low', 'Close', 'Volume'] 162 | 163 | for col in cols: 164 | if col not in data.columns: 165 | print(col + " is invisible") 166 | data[col] = "" 167 | 168 | print("The spell has been successfuly broken!") 169 | 170 | df = data[['Date','Open','High','Low', 'Close', 'Volume']] 171 | df.reset_index(drop=True, inplace=True) 172 | df['Date'] = df['Date'].astype(str) 173 | 174 | missing = data.isnull().sum().sum() 175 | if missing >=1: 176 | print("The sweeper detected missing values") 177 | print("1: No change") 178 | print("2: Delete them") 179 | print("2: Delete the row containing missing data(s)") 180 | answer = input("How do you want to deal with these missing values? (answer 1, 2 or 3)") 181 | if answer ==1: 182 | pass 183 | if answer ==2: 184 | df = df.dropna() 185 | if answer==3: 186 | df = df.dropna( how='all', 187 | subset=['Date','Open','High','Low', 'Close', 'Volume']) 188 | else: 189 | print("not valid answer") 190 | 191 | return df 192 | --------------------------------------------------------------------------------