├── .github
└── workflows
│ └── ci.yaml
├── .gitignore
├── Dockerfile.txt
├── LICENSE
├── MANIFEST.in
├── README.md
├── cover.jpg
├── egeaML
├── __init__.py
├── constants.py
├── datareader.py
├── egeaML.py
└── preprocessing.py
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests
├── __init__.py
├── data_ingestion
├── __init__.py
├── fixture.py
├── test_base.py
└── test_financial_datareader.py
└── preprocessing
├── __init__.py
└── test_imputation.py
/.github/workflows/ci.yaml:
--------------------------------------------------------------------------------
1 | name: CI
2 |
3 | on: [push]
4 |
5 | jobs:
6 | build:
7 | runs-on: ubuntu-latest
8 | strategy:
9 | matrix:
10 | python-version: ["3.10"]
11 |
12 | steps:
13 | - uses: actions/checkout@v3
14 | - name: Set up Python ${{ matrix.python-version }}
15 | uses: actions/setup-python@v4
16 | with:
17 | python-version: ${{ matrix.python-version }}
18 | - name: Install dependencies
19 | run: |
20 | python -m pip install --upgrade pip
21 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
22 | - name: Lint with Ruff
23 | run: |
24 | pip install ruff
25 | ruff --format=github --target-version=py310 .
26 | continue-on-error: true
27 | - name: Test with pytest
28 | run: |
29 | coverage run -m pytest -v -s
30 | - name: Generate Coverage Report
31 | run: |
32 | coverage report -m
33 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Compiled source #
2 | ###################
3 | *.com
4 | *.class
5 | *.dll
6 | *.exe
7 | *.o
8 | *.so
9 |
10 | # Packages #
11 | ############
12 | # it's better to unpack these files and commit the raw source
13 | # git has its own built in compression methods
14 | *.7z
15 | *.dmg
16 | *.gz
17 | *.iso
18 | *.jar
19 | *.rar
20 | *.tar
21 | *.zip
22 |
23 | # Logs and databases #
24 | ######################
25 | *.log
26 | *.sql
27 | *.sqlite
28 |
29 | # OS generated files #
30 | ######################
31 | .DS_Store
32 | .DS_Store?
33 | ._*
34 | .Spotlight-V100
35 | .Trashes
36 | ehthumbs.db
37 | Thumbs.db
--------------------------------------------------------------------------------
/Dockerfile.txt:
--------------------------------------------------------------------------------
1 | FROM python:3.7
2 |
3 | WORKDIR /Applied_Machine_Learning_with_Python
4 | COPY requirements.txt /Applied_Machine_Learning_with_Python/
5 | RUN pip install -r requirements.txt
6 |
7 | CMD ["/bin/bash"]
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 andreagiussani
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include Readme.md
2 | recursive-include egeaML/data *
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Helper functions for the book "Applied Machine Learning with Python"
2 |
3 | [](https://pypi.python.org/pypi/egeaML)
4 | [](https://pypi.python.org/pypi/egeaML)
5 | [](https://github.com/andreagiussani/Applied_Machine_Learning_with_Python/actions?queryworkflow%3ACI+event%3Apush+branch%3Amain)
6 |
7 |
8 |
9 |
10 |
11 |
12 | This repository contains the Supplementary Material for the book "Applied Machine Learning with Python", written by Andrea Giussani.
13 | You can find details about the book on the [BUP](https://bup.egeaonline.it) website.
14 | The books was written with the following specific versions of some popular libraries:
15 | - [scikit-learn](https://scikit-learn.org/stable/) version 1.2.2
16 | - [pandas](https://pandas.pydata.org) version 1.5.3
17 | - [xgboost](https://xgboost.readthedocs.io/en/latest/#) version 1.7.4
18 | - [gensim](https://radimrehurek.com/gensim/) version 3.8.1
19 | - [matplotlib](https://matplotlib.org) version 3.7.1
20 | - [seaborn](https://seaborn.pydata.org) version 0.9.0
21 |
22 | ## How to use the EgeaML Library
23 | The book provides a book-specific module, called **egeaML**.
24 | Be sure you have created a virtualenv. Then run
25 | ```bash
26 | pip install egeaML
27 | ```
28 | Once installed you can load a structured label dataset - such as the well-known Boston dataset -
29 | as a `pandas.DataFrame`, as follows:
30 |
31 | ```python
32 | from egeaML.datareader import DataReader
33 |
34 | raw_data = DataReader(
35 | filename='https://raw.githubusercontent.com/andreagiussani/datasets/master/egeaML/boston.csv',
36 | col_target='MEDV'
37 | )
38 | ```
39 |
40 | Please noe that the base code is evolving over time; in case you want to stick to the print version of the book,
41 | be sure you install the `egeaML==0.2.3` version.
42 |
43 | ## How to develop on the EgeaML
44 | Please, clone on your local machine this repo, as follows:
45 | ```bash
46 | git clone https://github.com/andreagiussani/Applied_Machine_Learning_with_Python.git
47 | ```
48 | To install it into your local env, I recommend to create a virtualenv where you add the necessary requirements, running this command from your favourite terminal emulator:
49 | ```bash
50 | pip install -r requirements.txt
51 | pip install git+https://github.com/andreagiussani/Applied_Machine_Learning_with_Python.git
52 | ```
53 |
54 | If, instead, you use the Anaconda system:
55 | ```bash
56 | conda install --file requirements.txt
57 | conda install git+https://github.com/andreagiussani/Applied_Machine_Learning_with_Python.git
58 | ```
59 | If you have Python3 already installed in your local environment, you can run:
60 | ```bash
61 | python3 -m pip install --upgrade pip
62 | python3 -m pip install git+https://github.com/andreagiussani/Applied_Machine_Learning_with_Python.git
63 | ```
64 |
65 | ### Unittest each method
66 | As a developer, you should unittest your contribution.
67 | To do so, you simply need to create a dedicated folder inside the `tests` subfolder (or possibly extend an existing one),
68 | and test that your method exactly does what you expect. Please look at the following example to tke inspiration:
69 |
70 | ```python
71 | import unittest
72 | import os
73 | import pandas as pd
74 |
75 | from egeaML.datareader import DataReader
76 |
77 |
78 | class DataIngestionTestCase(unittest.TestCase):
79 | URL_STRING_NAME = 'https://raw.githubusercontent.com/andreagiussani/datasets/master/egeaML'
80 | FILENAME_STRING_NAME = 'boston.csv'
81 |
82 | def setUp(self):
83 | self.col_target = 'MEDV'
84 | self.filename = os.path.join(self.URL_STRING_NAME, self.FILENAME_STRING_NAME)
85 | self.columns = [
86 | 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE',
87 | 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'
88 | ]
89 | self.raw_data = DataReader(filename=self.filename, col_target=self.col_target)
90 |
91 | def test__load_dataframe(self):
92 | df = self.raw_data()
93 | self.assertIsInstance(df, pd.DataFrame)
94 | self.assertEqual(df.shape[0], 506)
95 | self.assertEqual(df.shape[1], 14)
96 | ```
97 | The above unittest checks that the output is of type `pandas.DataFrame` and
98 | verify the expected output satisfies some characteristics.
99 |
100 | ## Extra Stuff
101 | If you wish to use the `egeaML` library on a Jupyter notebook, you firstly need to install the jupyter library,
102 | and then running the following command
103 | ```bash
104 | pip install jupyter
105 | python3 -m ipykernel install --user --name=
106 | ```
107 | where the name is the name you have assigned to your local environment.
108 | You are now ready to use all the feature of this helper!
109 |
110 | ## Submitting Errata
111 | If you have errata for the book, please submit them via the [BUP](https://bup.egeaonline.it) website. In case of possible mistakes within the book-specific module, you can submit a fixed-version as a pull-request in this repository.
112 |
113 | ## How to Cite this Book
114 |
115 | ```tex
116 | @book{giussani2020,
117 | TITLE="Applied Machine Learning with Python",
118 | AUTHOR="Andrea Giussani",
119 | YEAR="2020",
120 | PUBLISHER="Bocconi University Press"
121 | }
122 | ```
123 |
--------------------------------------------------------------------------------
/cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andreagiussani/Applied_Machine_Learning_with_Python/89966d54faf344cf90df55532bb1541f64461686/cover.jpg
--------------------------------------------------------------------------------
/egeaML/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/egeaML/constants.py:
--------------------------------------------------------------------------------
1 | UNNAMED_COLNAME = 'Unnamed'
2 | FILENAME_CONSTANT = 'filename'
3 | COL_TO_DROP_CONSTANT = 'col_to_drop'
4 | COL_TARGET_COLNAME = 'col_target'
5 |
--------------------------------------------------------------------------------
/egeaML/datareader.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | from calendar import monthrange
3 | from typing import Union
4 |
5 | import yfinance as yf
6 | import matplotlib.pyplot as plt
7 | import pandas as pd
8 | import seaborn as sns
9 |
10 | import requests
11 | from io import BytesIO
12 | from zipfile import ZipFile
13 |
14 | from concurrent.futures import ThreadPoolExecutor, as_completed
15 |
16 | from sklearn.model_selection import train_test_split
17 |
18 | from egeaML.constants import (
19 | UNNAMED_COLNAME,
20 | FILENAME_CONSTANT,
21 | COL_TARGET_COLNAME,
22 | COL_TO_DROP_CONSTANT,
23 | )
24 |
25 | import logging
26 | logging.basicConfig(level=logging.INFO)
27 |
28 |
29 | class DataReader:
30 | """
31 | This class is used to ingest data into the system before preprocessing.
32 | """
33 |
34 | def __init__(self, **args):
35 | """
36 | This module is used to ingest data into the system before preprocessing.
37 | """
38 | self.filename = args.get(FILENAME_CONSTANT)
39 | self.col_to_drop = args.get(COL_TO_DROP_CONSTANT)
40 | self.col_target = args.get(COL_TARGET_COLNAME)
41 | self.X = None
42 | self.y = None
43 |
44 | def __call__(self, split_features_target: bool = False):
45 | """
46 | This function takes the .csv file, and clean from unwanted columns.
47 | If split_features_target is set to True the function returns the set of features (explanatory variables)
48 | and the target variable
49 | Parameters
50 | ----------
51 | split_features_target: bool
52 | Default value is False, if True return set of features and target variable.
53 | """
54 | df = pd.read_csv(self.filename, index_col=False)
55 | df = df.loc[:, ~df.columns.str.match(UNNAMED_COLNAME)]
56 | if split_features_target:
57 | self.y = df[self.col_target] # This returns a vector containing the target variable
58 | self.X = df.drop(self.col_target, axis=1) if self.col_to_drop is None else \
59 | df.drop([self.col_to_drop, self.col_target], axis=1)
60 | return self.X, self.y
61 | return df
62 |
63 | def split_train_test(self, test_size=0.3, random_seed=42):
64 | """
65 | This function splits the data into train and test set
66 | """
67 | X_train, X_test, y_train, y_test = train_test_split(
68 | self.X, self.y, test_size=test_size, random_state=random_seed
69 | )
70 | return X_train, X_test, y_train, y_test
71 |
72 | def plot_column_distribution(self, variable_name, title_plot, yticklabels):
73 | """
74 | This is a graphical utility, since it returns the distribution of a variable
75 | """
76 | plt.figure(figsize=(8, 5))
77 | sns.set(font_scale=1.4)
78 | sns.heatmap(
79 | pd.DataFrame(self.df[variable_name].value_counts()),
80 | annot=True,
81 | fmt='g', cbar=False, cmap='Blues',
82 | annot_kws={"size": 20},
83 | yticklabels=yticklabels
84 | )
85 | plt.title(title_plot)
86 |
87 |
88 | class FinancialDataReader:
89 | # TODO: to be improved
90 |
91 | def __init__(self, stock_name, start_date, end_date):
92 | self.stock_name = stock_name
93 | self.start_date = start_date
94 | self.end_date = end_date
95 | self._validation_input()
96 |
97 | def _validation_input(self):
98 | if type(self.stock_name) is not str:
99 | raise ValueError('The stock name must be a string')
100 | if self.start_date > self.end_date:
101 | raise ValueError('The end date must be greater than the start date.')
102 |
103 | def __call__(self):
104 | df = yf.download(self.stock_name, start=self.start_date, end=self.end_date)
105 | return df
106 |
107 |
108 | class CryptoDataReader:
109 | """
110 | Parameters
111 | ----------
112 | crypto_name : string
113 | Cryptocurrency to download
114 | start_date: datetime, str
115 | Download start date string (YYYY-MM-DD) or _datetime.
116 | end_date: datetime, str
117 | Download end date string (YYYY-MM-DD) or _datetime.
118 | timeframe : str
119 | Valid timeframes: 1s,1m,3m,5m,15m,30m,1h,2h,4h,6h,8h,12h,1d
120 |
121 | Examples
122 | --------
123 | Using datetime objects:
124 |
125 | start_date = datetime.date(2022, 1, 1)
126 | end_date = datetime.date(2022, 12, 31)
127 |
128 | crypto = CryptoDataReader('BTCUSDT', start_date, end_date, '1d')
129 | data = crypto.get_data()
130 |
131 | Using date as string:
132 |
133 | start_date = '2022-06-30'
134 | end_date = '2023-03-31'
135 |
136 | crypto = CryptoDataReader('ADAUSDT', start_date, end_date, '1h')
137 | data = crypto.get_data()
138 | """
139 |
140 | def __init__(self, crypto_name, start_date, end_date, timeframe):
141 | self.crypto_name = crypto_name.upper()
142 | self.start_date = start_date
143 | self.end_date = end_date
144 | self.timeframe = timeframe
145 | self._validation_input()
146 |
147 | def _validation_input(self):
148 |
149 | valid_timeframe = ['1s', '1m', '3m', '5m', '15m', '30m', '1h', '2h', '4h', '6h', '8h', '12h', '1d']
150 | if self.timeframe not in valid_timeframe:
151 | raise ValueError(f'Timeframe := {self.timeframe} must be in ({", ".join(valid_timeframe)})')
152 |
153 | if isinstance(self.start_date, str) and isinstance(self.end_date, str):
154 | self.start_date = datetime.datetime.strptime(self.start_date, '%Y-%m-%d').date()
155 | self.end_date = datetime.datetime.strptime(self.end_date, '%Y-%m-%d').date()
156 |
157 | if self.start_date > self.end_date:
158 | raise ValueError(f'The end date must be greater than the start date.')
159 |
160 | filepath = 'https://raw.githubusercontent.com/binance/binance-public-data/master/data/symbols.txt'
161 | tickers = pd.read_csv(filepath, header=None)
162 | if self.crypto_name not in tickers.values:
163 | raise ValueError(f'{self.crypto_name} is not a valid ticker. Check the available tickers at {filepath}.')
164 |
165 | @staticmethod
166 | def _check_connection() -> bool:
167 | with requests.head('http://www.google.com') as response:
168 | if response.status_code == 404:
169 | return False
170 | else:
171 | return True
172 |
173 | def _get_url(self, date: datetime, type: str) -> str:
174 | """ Create the url from where download data """
175 | year, month, day = date.year, date.strftime('%m'), date.strftime('%d')
176 |
177 | URL = f"https://data.binance.vision/data/spot/" \
178 | f"{type}/klines/{self.crypto_name}/{self.timeframe}/{self.crypto_name}-{self.timeframe}-{year}-{month}"
179 | return URL + ".zip" if type == 'monthly' else URL + f"-{day}.zip"
180 |
181 | def _download_data(self, date: datetime, type: str) -> Union[pd.DataFrame, bool]:
182 |
183 | url = self._get_url(date, type=type)
184 | with requests.get(url) as response:
185 |
186 | if response.status_code == 404:
187 | return False
188 |
189 | else:
190 | zipfile = ZipFile(BytesIO(response.content))
191 | with zipfile.open(zipfile.namelist()[0]) as file_in:
192 | download = pd.read_csv(file_in,
193 | usecols=[0, 1, 2, 3, 4, 5, 8, 9],
194 | header=None,
195 | names=['Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Trades', 'Buy_volume'])
196 | return download
197 |
198 | @staticmethod
199 | def last_day_of_month(date: datetime) -> datetime:
200 | return date.replace(day=monthrange(date.year, date.month)[1])
201 |
202 | def _get_dates_to_download(self) -> list:
203 | adjusted_end_date = self.last_day_of_month(self.end_date)
204 | dates_monthly = [(date, 'monthly') for date in pd.date_range(self.start_date, adjusted_end_date, freq='M')]
205 | dates_daily = [(date, 'daily') for date in pd.date_range(datetime.date.today().replace(day=1), datetime.date.today(), freq='D')
206 | if adjusted_end_date >= datetime.date.today()]
207 |
208 | return dates_monthly + dates_daily
209 |
210 | def get_data(self) -> pd.DataFrame:
211 |
212 | if not self._check_connection():
213 | raise OSError('No connection available')
214 |
215 | data = pd.DataFrame(columns=['Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Trades', 'Buy_volume'])
216 |
217 | with ThreadPoolExecutor(max_workers=10) as exe:
218 |
219 | dates_to_download = self._get_dates_to_download()
220 | futures = [exe.submit(self._download_data, date, type) for date, type in dates_to_download]
221 |
222 | for future in as_completed(futures):
223 | output = future.result()
224 | if isinstance(output, pd.DataFrame):
225 | data = pd.concat([data, output], axis=0, join='inner')
226 |
227 | data.drop_duplicates(subset=['Time'], inplace=True)
228 | data.Time = pd.to_datetime(data.Time, unit='ms')
229 | data.set_index(keys='Time', inplace=True)
230 | data.sort_index(inplace=True)
231 | data = data.loc[self.start_date:datetime.datetime.combine(self.end_date, datetime.time(23, 59, 59))]
232 |
233 | if data.index.min().date() != self.start_date or data.index.max().date() != self.end_date:
234 | logging.info(f'Data for {self.crypto_name} is only available '
235 | f'from {data.index.min().date()} to {data.index.max().date()}')
236 |
237 | return data
238 |
--------------------------------------------------------------------------------
/egeaML/egeaML.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | from __future__ import print_function
3 | import re
4 | import os
5 | import wget
6 | import pandas as pd
7 | import numpy as np
8 | import xgboost as xgb
9 | import gensim
10 |
11 | import matplotlib.pyplot as plt
12 | import seaborn as sns
13 | from matplotlib.colors import ListedColormap
14 | plt.style.use('ggplot')
15 |
16 |
17 | from sklearn.linear_model import LogisticRegression
18 | from sklearn.svm import SVC
19 | from sklearn.decomposition import PCA
20 | from sklearn.tree import DecisionTreeClassifier
21 | from sklearn.model_selection import GridSearchCV, train_test_split
22 | from sklearn.neighbors import KNeighborsClassifier
23 | from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_curve
24 |
25 |
26 | from gensim.parsing.preprocessing import strip_punctuation
27 | from gensim.parsing.preprocessing import remove_stopwords
28 | from gensim.parsing.preprocessing import strip_multiple_whitespaces
29 | from gensim import models
30 | from gensim.models.doc2vec import TaggedDocument
31 |
32 |
33 | from nltk.tokenize import word_tokenize
34 | from nltk.stem.wordnet import WordNetLemmatizer
35 | from nltk.stem.snowball import SnowballStemmer
36 |
37 |
38 | class ModelFitting:
39 |
40 | def __init__(self, n):
41 | self.n = n
42 |
43 | self.lr = LogisticRegression()
44 | self.dt = DecisionTreeClassifier()
45 | self.svc = SVC()
46 |
47 | self.my_dict = dict()
48 | self.abb_list = list()
49 | self.names_list = list()
50 | self.raw_scikit_models = list()
51 | self.clfs = None
52 |
53 | def models_def(self, **kwargs):
54 | self.my_dict[kwargs['model_one']] = kwargs['abb1']
55 | self.my_dict[kwargs['model_two']] = kwargs['abb2']
56 | self.my_dict[kwargs['model_three']] = kwargs['abb3']
57 | return self.my_dict
58 |
59 | def get_models(self,**kwargs):
60 | my_dict = kwargs['models_dict']
61 | my_list = [kwargs['model_one'], kwargs['model_two'],
62 | kwargs['model_three']]
63 | for name,abb in my_dict.items():
64 | self.abb_list.append(abb)
65 | self.names_list.append(name)
66 | for i in my_list:
67 | self.raw_scikit_models.append(i+'()')
68 | scikit_models_list = [self.lr,self.dt,self.svc]
69 | self.clfs = list(zip(self.names_list,scikit_models_list))
70 | return self.clfs
71 |
72 | def fitting_models(self,models, X_train,y_train,X_test,y_test):
73 | for name,clf in models:
74 | clf_ = clf
75 | clf_.fit(X_train,y_train)
76 | y_pred = clf_.predict(X_test)
77 | score = format(accuracy_score(y_test,y_pred), '.4f')
78 | print("{} : {}".format(name,score))
79 |
80 |
81 | class DataUtils:
82 |
83 | def __init__(self):
84 | pass
85 |
86 | def download_data(self, foldername, urls, directory='./data' ):
87 | """
88 | This function download an online available dataset.
89 | The dataset is going to be saved on a folder in the current working directory
90 | Parameters
91 | ----------
92 | foldername: str
93 | Folder name where the dowloaded data is going to be saved.
94 | Example: 'IMDb'
95 |
96 | squadurls: list
97 | List of urls
98 |
99 | directory: str
100 | Directory where the dataset will be stored
101 |
102 | Examples
103 | -------
104 | > foldername = IMDb
105 | > download_data(foldername=foldername, urls=[url_imdb]s, directory='./data_folder')
106 |
107 | """
108 | print("\nDownloading data...")
109 | directory = os.path.expanduser(directory)
110 | try:
111 | if not os.path.exists(directory):
112 | os.makedirs(directory)
113 | except:
114 | print("The " + directory + " folder already exists!")
115 |
116 | dir_data = os.path.join(directory,foldername)
117 | try:
118 | if not os.path.exists(dir_data):
119 | os.makedirs(dir_data)
120 | except:
121 | print("The " + foldername + " folder already exists!")
122 |
123 | for dataurl in urls:
124 | file = dataurl.split("/")[-1]
125 | if os.path.exists(os.path.join(dir_data, file)):
126 | print(file, "already downloaded")
127 | else:
128 | wget.download(url=dataurl, out=dir_data)
129 | print("\nDownload Finished")
130 |
131 |
132 | class LossUtils:
133 |
134 | def __init__(self, data):
135 | self.data = data
136 |
137 | def huber_loss(self,c=3):
138 | return ((abs(self.data) < c) * 0.5 * self.data ** 2 + (abs(self.data) >= c) * -c * (0.5 * c - abs(self.data)))
139 |
140 | def logistic_loss(self):
141 | return np.log(1+np.exp(-self.data))
142 |
143 | def hinge_loss(self):
144 | return np.maximum(1 - self.data, 0)
145 |
146 |
147 | class PlottingUtils:
148 |
149 | def plot_pca(X):
150 |
151 | def draw_vector(v0, v1, ax=None):
152 | ax = ax or plt.gca()
153 | arrowprops=dict(facecolor='black',
154 | arrowstyle='->',
155 | linewidth=2,
156 | shrinkA=0, shrinkB=0)
157 | ax.annotate('', v1, v0, arrowprops=arrowprops)
158 |
159 | pca = PCA(n_components=2, whiten=True)
160 | pca.fit(X)
161 | fig, ax = plt.subplots(1, 2, figsize=(16, 6))
162 | fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
163 | # plot data
164 | ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)
165 | for length, vector in zip(pca.explained_variance_,
166 | pca.components_):
167 | v = vector * 3 * np.sqrt(length)
168 | draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])
169 | ax[0].axis('equal');
170 | ax[0].set(xlabel='x', ylabel='y', title='input')
171 |
172 | # plot principal components
173 | X_pca = pca.transform(X)
174 | ax[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)
175 | draw_vector([0, 0], [0, 3], ax=ax[1])
176 | draw_vector([0, 0], [3, 0], ax=ax[1])
177 | ax[1].axis('equal')
178 | ax[1].set(xlabel='component 1', ylabel='component 2',
179 | title='principal components',
180 | xlim=(-5, 5), ylim=(-3, 3.1))
181 |
182 |
183 | def plot_loss(data, model1, model2, model1_name, model2_name, model1_abb, model2_abb, xlim=None, ylim=None):
184 | plt.plot(data, model1, label=model1_name, linestyle='-')
185 | plt.plot(data, model2, label=model2_name, linestyle=':')
186 | plt.ylabel("Loss")
187 | plt.xlabel("Raw Model Output: $y - f(x)$")
188 | plt.legend([model1_abb, model2_abb])
189 | plt.xlim(xlim)
190 | plt.ylim(ylim)
191 | plt.show()
192 |
193 |
194 | class classification_plots:
195 |
196 | @staticmethod
197 | def training_class(X,y,test_size=0.3):
198 | """
199 | This Function plots a a 2-dim training set,
200 | and each point is labelled by the class it belongs to.
201 | The arguments are as follows:
202 | - X: 2-dim set of features;
203 | - y: 1-dim target label;
204 | - test_size: equal to 0.3 by default.
205 | It can take any numer between 0 and 1
206 | """
207 | X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=test_size,\
208 | random_state=42)
209 | df = pd.DataFrame(dict(height=X_train.iloc[:,1], weight=X_train.iloc[:,0],\
210 | label=y_train))
211 | colors = {0:'red', 1:'blue'}
212 | fig, ax = plt.subplots()
213 | grouped = df.groupby('label')
214 | for key, group in grouped:
215 | group.plot(ax=ax, kind='scatter', x='height', y='weight', label=key, \
216 | color=colors[key],figsize=(8, 5))
217 | plt.legend(["Training Class Female", "Training Class Man"],fontsize=10)
218 | plt.show()
219 |
220 | @staticmethod
221 | def knn_class(X,y,test_size=0.3):
222 | """
223 | This Function fits and a k-Neigh classifier and provides the
224 | visualization of the prediction results wrt the target.
225 | """
226 | X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=test_size,random_state=42)
227 | df = pd.DataFrame(dict(height=X_train.iloc[:,1], weight=X_train.iloc[:,0],
228 | label=y_train))
229 | colors = {0:'red', 1:'blue'}
230 | fig, ax = plt.subplots()
231 | grouped = df.groupby('label')
232 | for key, group in grouped:
233 | group.plot(ax=ax, kind='scatter', x='weight', y='height', label=key,
234 | color=colors[key],figsize=(5, 5))
235 | clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
236 | y_pred = clf.predict(X_test)
237 | df_ = pd.DataFrame(dict(height=X_test.iloc[:,1], weight=X_test.iloc[:,0],
238 | label=y_pred))
239 | colors = {0:'orange', 1:'green'}
240 | grouped_ = df_.groupby('label')
241 | for key, group in grouped_:
242 | group.plot(ax=ax, kind='scatter', x='weight', y='height', label=key,
243 | color=colors[key],figsize=(5, 5))
244 | plt.xlabel('Height', fontsize=14)
245 | plt.ylabel('Weight', fontsize=14)
246 | plt.legend(["Training Female", "Training Man", "Test Pred Female", "Test Pred Man"]
247 | ,fontsize=10)
248 | plt.show()
249 |
250 | @staticmethod
251 | def plotting_prediction(X_train,X_test,y_train,y_test,nn):
252 | """
253 | This function plots the test set points labelled with the predicted value.
254 | The parameter nn stands for the Number of Neighbors
255 | """
256 | plt.style.use('ggplot')
257 | plt.figure(figsize=(5,5))
258 |
259 | clf = KNeighborsClassifier(n_neighbors=nn).fit(X_train, y_train)
260 | y_pred = clf.predict(X_test)
261 |
262 | colors = ['lime' if i else 'yellow' for i in y_test]
263 | ps = clf.predict_proba(X_test)[:,1]
264 | errs = ((ps < 0.5) & y_test) |((ps >= 0.5) & (1-y_test))
265 | plt.scatter(X_test.weight[errs], X_test.height[errs], facecolors='red', s=150)
266 | plt.scatter(X_test.weight, X_test.height,
267 | facecolors=colors, edgecolors='k', s=50, alpha=1)
268 | plt.xlabel('Height', fontsize=14)
269 | plt.ylabel('Weight', fontsize=14)
270 | plt.tight_layout()
271 |
272 |
273 | @staticmethod
274 | def confusion_matrix(y_test, y_pred, cmap=None, xticklabels=None, yticklabels=None):
275 | """
276 | This function generates a confusion matrix, which is used as a
277 | summary to evaluate a Classification predictor.
278 | The arguments are:
279 | - y_test: the true labels;
280 | - y_pred: the predicted labels;
281 | - cmap: it is the palette used to color the confusion matrix.
282 | The available options are:
283 | - cmap="YlGnBu"
284 | - cmap="Blues"
285 | - cmap="BuPu"
286 | - cmap="Greens"
287 | Please refer to the notebook available on the book repo
288 | Miscellaneous/setting_CMAP_argument_matplotlib.ipynb
289 | for further details.
290 | - xticklabels: list
291 | description of x-axis label;
292 | - yticklabels: list
293 | description of y-axis label
294 | """
295 | mat = confusion_matrix(y_test, y_pred)
296 | if not xticklabels:
297 | sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap=cmap, annot_kws={"size": 12})
298 | plt.xlabel('True label')
299 | plt.ylabel('Predicted label')
300 | plt.show()
301 | else:
302 | sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap=cmap, annot_kws={"size": 12},
303 | xticklabels=xticklabels, yticklabels=yticklabels)
304 | plt.xlabel('True label')
305 | plt.ylabel('Predicted label')
306 | plt.show()
307 |
308 | @staticmethod
309 | def plot_precision_recall(y_test, y_pred):
310 | """
311 | Precision/Recall Curve
312 | Parameters:
313 | - y_test: the true test labels
314 | - y_pred: the predicted labels
315 | """
316 | precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
317 | plt.figure(figsize=(8,6))
318 | plt.plot(recall, precision, lw=2, color='navy')
319 | plt.title('Precision-Recall curve on the Diabetes dataset')
320 | plt.ylabel('Precision')
321 | plt.xlabel('Recall')
322 | plt.xlim=([0.0, 1.05])
323 | plt.ylim=([0.0, 1.05])
324 | plt.show()
325 |
326 | @staticmethod
327 | def knn_boundaries(X_train,X_test,y_train,y_test,n_neighbors):
328 | """
329 | This Function provides the boundaries for a k-Neigh classifier
330 | """
331 | # Create color maps
332 | cmap_bold = ListedColormap(['#FF3333', '#3333FF'])
333 | cmap_light = ListedColormap(['#e6eff0', '#096b76'])# (['#FF9999', '#9999FF'])
334 | clf = KNeighborsClassifier(n_neighbors=n_neighbors)
335 | clf.fit(X_train, y_train)
336 |
337 | # Plot the decision boundary. For that, we will assign a color to each
338 | # point in the mesh [x_min, x_max]x[y_min, y_max].
339 | x_min, x_max = X_test.iloc[:, 0].min() - 1, X_test.iloc[:, 0].max() + 1
340 | y_min, y_max = X_test.iloc[:, 1].min() - 1, X_test.iloc[:, 1].max() + 1
341 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
342 | np.arange(y_min, y_max, 0.02))
343 | Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
344 |
345 | # Put the result into a color plot
346 | Z = Z.reshape(xx.shape)
347 | plt.figure(figsize=(8, 6))
348 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light, linewidths=40)
349 |
350 |
351 | # Plot also the training points
352 | plt.scatter(X_test.iloc[:, 0], X_test.iloc[:, 1], c=y_test,
353 | cmap=cmap_bold, linewidths=3)
354 | plt.xlim(xx.min(), xx.max())
355 | plt.ylim(yy.min(), yy.max())
356 | plt.title("Binary classification (k = %i)"
357 | % (n_neighbors))
358 | plt.show()
359 |
360 | # def scaling_plot():
361 | # import mglearn
362 | # mglearn.plots.plot_scaling()
363 |
364 | @staticmethod
365 | def plot_hist(data,features_name,target_name):
366 | data = pd.DataFrame(data, columns=features_name)
367 | plt.figure(figsize=(20, 16))
368 | features = list(data)
369 | for i, col in enumerate(features):
370 | plt.subplot(3, len(features)/2 , i+1)
371 | x = data[col]
372 | plt.hist(x, 50, density=True, facecolor='g', alpha=0.75)
373 | plt.title(col)
374 | plt.xlabel(col)
375 | plt.ylabel(target_name)
376 |
377 | @staticmethod
378 | def plot_svc_decision_function(model, ax=None, plot_support=True):
379 | """Plot the decision function for a 2D SVC"""
380 | if ax is None:
381 | ax = plt.gca()
382 | xlim = ax.get_xlim()
383 | ylim = ax.get_ylim()
384 |
385 | # create grid to evaluate model
386 | x = np.linspace(xlim[0], xlim[1], 30)
387 | y = np.linspace(ylim[0], ylim[1], 30)
388 | Y, X = np.meshgrid(y, x)
389 | xy = np.vstack([X.ravel(), Y.ravel()]).T
390 | P = model.decision_function(xy).reshape(X.shape)
391 |
392 | # plot decision boundary and margins
393 | ax.contour(X, Y, P, colors='k',
394 | levels=[-1, 0, 1], alpha=0.5,
395 | linestyles=['--', '-', '--'])
396 |
397 | # plot support vectors
398 | if plot_support:
399 | ax.scatter(model.support_vectors_[:, 0],
400 | model.support_vectors_[:, 1],
401 | s=300, linewidth=1, facecolors='none')
402 | ax.set_xlim(xlim)
403 | ax.set_ylim(ylim)
404 |
405 | @staticmethod
406 | def plot_svc_regularization_effect(X,y,kernel,cmap):
407 | fig, ax = plt.subplots(1, 2, figsize=(16, 6))
408 | fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
409 | for ax, C in zip(ax, [100.0, 0.1]):
410 | model = SVC(kernel=kernel, C=C).fit(X, y)
411 | ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=cmap)
412 | classification_plots.plot_svc_decision_function(model, ax)
413 | ax.scatter(model.support_vectors_[:, 0],model.support_vectors_[:, 1],
414 | s=300, lw=1, facecolors='none')
415 | ax.set_title('C = {0:.1f}'.format(C), size=14)
416 | plt.show()
417 |
418 |
419 | class EgeaMLXGBoost:
420 | def fitting(X,y,param_grid,n_jobs,cv):
421 | clf_xgb = xgb.XGBClassifier(n_jobs=n_jobs, objective="binary:logistic")
422 | clf = GridSearchCV(clf_xgb, param_grid=param_grid, verbose=1, cv=cv)
423 | model = clf.fit(X, y)
424 | return model
425 |
426 | def checking_overfitting(X_train,y_train,learning_rate, n_estimators):
427 | model__ = xgb.XGBClassifier()
428 | param_grid_ = dict(learning_rate=learning_rate,
429 | n_estimators=n_estimators)
430 | grid_search = GridSearchCV(model__, param_grid_,
431 | scoring="neg_log_loss",
432 | n_jobs=-1,
433 | cv=10)
434 | grid_result = grid_search.fit(X_train, y_train)
435 | print("Best Log Score: %f using %s" % (grid_result.best_score_,
436 | grid_result.best_params_))
437 | means = grid_result.cv_results_['mean_test_score']
438 | stds = grid_result.cv_results_['std_test_score']
439 | params = grid_result.cv_results_['params']
440 | scores = np.array(means).reshape(len(learning_rate),
441 | len(n_estimators))
442 | for i, value in enumerate(learning_rate):
443 | plt.plot(n_estimators, scores[i],label='learning_rate: ' + str(value))
444 | plt.legend()
445 | plt.xlabel('n_estimators')
446 | plt.ylabel('Log Loss')
447 | plt.savefig('n_estimators_vs_learning_rate.png')
448 |
449 |
450 | class EgeaNLP:
451 |
452 | @staticmethod
453 | def clean_text(text):
454 | new_string = []
455 | for word in gensim.utils.simple_preprocess(text):
456 | if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 2:
457 | stem_ = SnowballStemmer('english')
458 | lemma = WordNetLemmatizer()
459 | new = stem_.stem(lemma.lemmatize(word, pos='v'))
460 | new_string.append(new)
461 | return new_string
462 |
463 | @staticmethod
464 | def simple_tokenization(doc):
465 | """This function performs simple tokenization"""
466 | tok = re.findall('(\\w+)', doc.lower())
467 | docs = ' '.join(tok)
468 | return word_tokenize(docs)
469 |
470 | @staticmethod
471 | def parsing_text(doc):
472 | """This function removes stopwords and puctuaction"""
473 | return strip_multiple_whitespaces(strip_punctuation(remove_stopwords(doc.lower())))
474 |
475 |
476 | @staticmethod
477 | def top_words(corpus, dictionary, doc, n_words=5):
478 | """
479 | This function returns the most n important words in a particular document
480 | ----------
481 | Params:
482 | tfidf: obj
483 | tfidf matrix
484 | dictionary: obj
485 | This is the gensim dictionary
486 | doc: obj
487 | The specific corpus we wish to extract the top words
488 | n_words: int
489 | Number of top words to be shown
490 | """
491 | tfidf = models.TfidfModel(corpus)
492 | tf_obj = tfidf[doc]
493 | soterd_obj = sorted(tf_obj, key=lambda x: x[1], reverse=True)
494 | top_words = list()
495 | for obj in soterd_obj[:n_words]:
496 | top_words.append("{0:s} ({1:01.03f})".format(dictionary[obj[0]], obj[1]))
497 | return top_words
498 |
499 | @staticmethod
500 | def analogy(model, x1, x2, y1):
501 | result = model.most_similar(positive=[y1, x2], negative=[x1])
502 | return result[0][0]
503 |
504 | @staticmethod
505 | def display_similarity(model, words=None, sample=0):
506 | """
507 | This methods plot the desired list of words in a 2-dim cartesian plane
508 | based on their similarity.
509 |
510 | """
511 | if words == None:
512 | if sample > 0:
513 | words = np.random.choice(list(model.vocab.keys()), sample)
514 | else:
515 | words = [ word for word in model.vocab ]
516 |
517 | word_vectors = np.array([model[w] for w in words])
518 | pca = PCA().fit_transform(word_vectors)[:,:2]
519 | shift = 0.1
520 | plt.figure(figsize=(8,6))
521 | plt.scatter(pca[:,0], pca[:,1], edgecolors='k', c='b')
522 | for w, (x1,x2) in zip(words, pca):
523 | plt.text(x1+shift, x2+shift, w)
524 | plt.xlabel('Component 1')
525 | plt.ylabel('Component 2')
526 |
527 | @staticmethod
528 | def tagging_doc2vec(docs):
529 | """
530 | This function prepares tagged documents for the Doc2vec model
531 | """
532 | mylist = list()
533 | for i,s in enumerate(docs):
534 | mylist.append(TaggedDocument(s, [i]))
535 | return mylist
536 |
537 |
538 | class EgeaNN:
539 |
540 | def __init__(self):
541 | pass
542 |
543 | def plot_data(self, X, y):
544 | """
545 | This function plots the raw data
546 | """
547 | plt.figure(figsize = (8, 6))
548 | plt.plot(X[y==0, 0], X[y==0, 1], 'or', alpha=0.5, label=0)
549 | plt.plot(X[y==1, 0], X[y==1, 1], 'ob', alpha=0.5, label=1, marker="^")
550 | plt.xlim((min(X[:, 0])-0.1, max(X[:, 0])+0.1))
551 | plt.ylim((min(X[:, 1])-0.1, max(X[:, 1])+0.1))
552 | plt.legend()
553 |
554 | def make_multiclass(self, n=500, d=2, k=3):
555 | """
556 | parameters:
557 | n: # points per class
558 | d: #dimensionality
559 | k: # of classes
560 | """
561 |
562 | np.random.seed(0)
563 | X = np.zeros((n*k, d))
564 | y = np.zeros(n*k)
565 | for j in range(k):
566 | ix = range(n*j, n*(j+1))
567 | r = np.linspace(0.0,1,n)
568 | t = np.linspace(j*4,(j+1)*4,n) + np.random.randn(n)*0.2
569 | X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
570 | y[ix] = j
571 | fig = plt.figure(figsize=(6, 6))
572 | plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap='RdYlBu', alpha=0.8)
573 | plt.xlim([-1,1])
574 | plt.ylim([-1,1])
575 | return X, y
576 |
577 | def plot_decision_boundary(self, func, X, y):
578 | figsize=(6, 6)
579 | amin, bmin = X.min(axis=0) - 0.1
580 | amax, bmax = X.max(axis=0) + 0.1
581 | hticks = np.linspace(amin, amax, 101)
582 | vticks = np.linspace(bmin, bmax, 101)
583 |
584 | aa, bb = np.meshgrid(hticks, vticks)
585 | ab = np.c_[aa.ravel(), bb.ravel()]
586 | c = func(ab)
587 | cc = c.reshape(aa.shape)
588 |
589 | cm = 'RdBu'
590 | cm_bright = ListedColormap(['#FF0000', '#0000FF'])
591 |
592 | fig, ax = plt.subplots(figsize=figsize)
593 | contour = plt.contourf(aa, bb, cc, cmap=cm, alpha=0.8)
594 |
595 | ax_c = fig.colorbar(contour)
596 | ax_c.set_label("$P(y = 1)$")
597 | ax_c.set_ticks([0, 0.25, 0.5, 0.75, 1])
598 |
599 | plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)
600 | plt.xlim(amin, amax)
601 | plt.ylim(bmin, bmax)
602 |
603 | def plot_loss_accuracy(self, history):
604 | historydf = pd.DataFrame(history.history, index=history.epoch)
605 | plt.figure(figsize=(10, 6))
606 | historydf.plot(ylim=(0, max(1, historydf.values.max())),
607 | style=['+-','.-'] )
608 | loss = history.history['loss'][-1]
609 | acc = history.history['acc'][-1]
610 | plt.title('Loss: %.3f, Accuracy: %.3f' % (loss, acc))
611 |
612 | def plot_multiclass_decision_boundary(self, model, X, y):
613 | x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
614 | y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
615 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101),
616 | np.linspace(y_min, y_max, 101))
617 | cmap = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
618 |
619 | Z = model.predict_classes(np.c_[xx.ravel(), yy.ravel()],
620 | verbose=0)
621 | Z = Z.reshape(xx.shape)
622 | fig = plt.figure(figsize=(8, 8))
623 | plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)
624 | plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap='RdYlBu')
625 | plt.xlim(xx.min(), xx.max())
626 | plt.ylim(yy.min(), yy.max())
627 |
--------------------------------------------------------------------------------
/egeaML/preprocessing.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 |
3 | from typing import List
4 |
5 |
6 | class Preprocessing:
7 | """
8 | This class is aimed at facilitating preprocessing.
9 | It is made of two main objects: detecting nulls,
10 | and dealing with categorical_cols.
11 | """
12 | def __init__(self, columns: List, X: pd.DataFrame):
13 | self.categorical_cols = columns
14 | self.X = X
15 | self.df = None
16 |
17 | def simple_imputer(self) -> pd.DataFrame:
18 | """
19 | This function replaces null values in the input DataFrame with mode for object columns
20 | and median for numeric columns.
21 | """
22 | summary_stats = self.X.select_dtypes(include=['object']).mode().to_dict(orient='records')[0]
23 | summary_stats.update(self.X.select_dtypes(exclude=['object']).median().to_dict())
24 | self.X.fillna(value=summary_stats, inplace=True)
25 | return self.X
26 |
27 | def dummization(self):
28 | """
29 | This function performs dummization for categorical_cols
30 | """
31 | #TODO: use sklearn ColumnTransformer instead
32 |
33 | return pd.get_dummies(
34 | self.simple_imputer(),
35 | prefix_sep='_',
36 | prefix=self.categorical_cols,
37 | columns=self.categorical_cols,
38 | drop_first=False
39 | )
40 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | catboost==1.1.1
2 | coverage==4.5.4
3 | gensim==4.3.1
4 | imbalanced-learn==0.5.0
5 | yfinance==0.2.9
6 | matplotlib==3.7.1
7 | nltk==3.8.1
8 | pandas==1.5.3
9 | pytest==7.2.0
10 | requests==2.28.0
11 | seaborn==0.9.0
12 | shap==0.41.0
13 | scikit-learn==1.2.2
14 | tensorflow==2.11.0
15 | xgboost==1.7.4
16 | wget==3.2
17 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [bumpversion]
2 | current_version = 1.0.2
3 | commit = True
4 | tag = True
5 |
6 | [bumpversion:file:setup.py]
7 | search = version='{current_version}'
8 | replace = version='{new_version}'
9 |
10 | [bdist_wheel]
11 | universal = 1
12 |
13 | [metadata]
14 | description_file=README.md
15 | license_files=LICENSE
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(
4 | name="egeaML",
5 | version="1.0.2",
6 | author="Andrea Giussani",
7 | author_email="andrea.giussani@unibocconi.it",
8 | description=("A python library used in support of the Book"
9 | "'Applied Machine Learning with Python'"),
10 | url="https://github.com/andreagiussani/Applied_Machine_Learning_with_Python",
11 | license="BSD",
12 | packages=find_packages(),
13 | install_requires=[
14 | 'pandas==1.5.3', 'scikit-learn==1.2.2',
15 | 'shap==0.41.0', 'catboost==1.1.1',
16 | 'gensim==4.3.1', 'nltk==3.8.1',
17 | 'matplotlib==3.7.1', 'seaborn==0.9.0', 'wget==3.2',
18 | 'imbalanced-learn==0.5.0', 'tensorflow==2.11.0',
19 | 'xgboost==1.7.4', 'yfinance==0.2.9'
20 | ],
21 | include_package_data=True,
22 | )
23 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andreagiussani/Applied_Machine_Learning_with_Python/89966d54faf344cf90df55532bb1541f64461686/tests/__init__.py
--------------------------------------------------------------------------------
/tests/data_ingestion/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andreagiussani/Applied_Machine_Learning_with_Python/89966d54faf344cf90df55532bb1541f64461686/tests/data_ingestion/__init__.py
--------------------------------------------------------------------------------
/tests/data_ingestion/fixture.py:
--------------------------------------------------------------------------------
1 | from io import StringIO
2 |
3 |
4 | def get_mocked_string_csv():
5 | return StringIO(
6 | """col1,col2,col3,y
7 | 1,4.4,99,1
8 | 2,4.5,200,1
9 | 3,4.7,65,0
10 | 4,1.5,140,0"""
11 | )
12 |
--------------------------------------------------------------------------------
/tests/data_ingestion/test_base.py:
--------------------------------------------------------------------------------
1 | import unittest
2 | import pandas as pd
3 |
4 | from egeaML.datareader import DataReader
5 | from tests.data_ingestion.fixture import get_mocked_string_csv
6 |
7 |
8 | class DataIngestionTestCase(unittest.TestCase):
9 |
10 | def setUp(self):
11 | self.col_target = 'y'
12 | self.filename = get_mocked_string_csv()
13 | self.columns = ['col1', 'col2', 'col3', 'y']
14 | self.raw_data = DataReader(filename=self.filename, col_target=self.col_target)
15 |
16 | def test__load_dataframe(self):
17 | df = self.raw_data()
18 | self.assertIsInstance(df, pd.DataFrame)
19 | self.assertEqual(df.shape[0], 4)
20 | self.assertEqual(df.shape[1], 4)
21 | self.assertListEqual(list(df), self.columns)
22 |
--------------------------------------------------------------------------------
/tests/data_ingestion/test_financial_datareader.py:
--------------------------------------------------------------------------------
1 | import unittest
2 | import pandas as pd
3 |
4 | from unittest.mock import MagicMock
5 | from egeaML.datareader import FinancialDataReader
6 |
7 |
8 | class FinancialDataReaderTestCase(unittest.TestCase):
9 |
10 | def setUp(self):
11 | self.columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
12 | self.raw_data = FinancialDataReader("AAPL", start_date="2023-01-02", end_date="2023-01-04")
13 | self.raw_data = MagicMock(return_value=pd.DataFrame(
14 | [
15 | [130.279999, 130.899994, 124.169998, 125.070000, 124.879326, 112117500],
16 | [127.279999, 128, 122.879326, 124.970000, 124.5, 102115400],
17 | ],
18 | columns=self.columns
19 | ))
20 |
21 | def test__load_dataframe(self):
22 | df = self.raw_data()
23 | self.assertIsInstance(df, pd.DataFrame)
24 | self.assertEqual(df.shape[0], 2)
25 | self.assertEqual(df.shape[1], 6)
26 | self.assertListEqual(list(df), self.columns)
27 |
--------------------------------------------------------------------------------
/tests/preprocessing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andreagiussani/Applied_Machine_Learning_with_Python/89966d54faf344cf90df55532bb1541f64461686/tests/preprocessing/__init__.py
--------------------------------------------------------------------------------
/tests/preprocessing/test_imputation.py:
--------------------------------------------------------------------------------
1 | import unittest
2 | import pandas as pd
3 | import numpy as np
4 |
5 | from egeaML.preprocessing import Preprocessing
6 |
7 |
8 | class DataIngestionTestCase(unittest.TestCase):
9 |
10 | def setUp(self):
11 | self.X = pd.DataFrame(
12 | {
13 | 'col1': [1, 6, np.nan, 5],
14 | 'col2': [100, np.nan, np.nan, 30],
15 | 'col3': ['iphone', 'iphone', np.nan, 'pixel']
16 | }
17 | )
18 | self.transformer = Preprocessing(columns=None, X=self.X)
19 |
20 | def test__impute_null_values(self):
21 | df = self.transformer.simple_imputer()
22 | self.assertIsInstance(df, pd.DataFrame)
23 | self.assertEqual(df[~df['col2'].isna()].shape[0], 4)
24 | self.assertEqual(df.loc[2, 'col3'], 'iphone')
25 | self.assertEqual(df.loc[2, 'col1'], 5)
26 |
--------------------------------------------------------------------------------