├── .github └── workflows │ └── ci.yml ├── LICENSE ├── README.md ├── dev-requirements.txt ├── pyproject.toml ├── requirements.txt ├── setup.py └── toraniko ├── __init__.py ├── math.py ├── model.py ├── styles.py ├── tests ├── __init__.py ├── test_math.py ├── test_model.py └── test_utils.py └── utils.py /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: 4 | pull_request: 5 | branches: 6 | - main 7 | 8 | defaults: 9 | run: 10 | working-directory: toraniko 11 | shell: bash 12 | 13 | jobs: 14 | test: 15 | runs-on: ubuntu-latest 16 | 17 | steps: 18 | - name: Checkout 19 | uses: actions/checkout@v3 20 | 21 | - name: Set Python 22 | uses: actions/setup-python@v4 23 | with: 24 | python-version: '3.10' 25 | 26 | - name: Install deps 27 | run: | 28 | python -m pip install --upgrade pip 29 | pip install -r ../dev-requirements.txt 30 | 31 | - name: Run tests 32 | run: | 33 | pytest tests/ 34 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 0xfdf@implies.com 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # toraniko 2 | 3 | Toraniko is a complete implementation of a risk model suitable for quantitative and systematic trading at institutional scale. In particular, it is a characteristic factor model in the same vein as Barra and Axioma (in fact, given the same datasets, it approximately reproduces Barra's estimated factor returns). 4 | 5 | ![mom_factor](https://github.com/user-attachments/assets/f9d2927c-e899-4fd6-944c-8f9a104b410f) 6 | 7 | Using this library, you can create new custom factors and estimate their returns. Then you can estimate a factor covariance matrix suitable for portfolio optimization with factor exposure constraints (e.g. to main a market neutral portfolio). 8 | 9 | The only dependencies are numpy and polars. It supports market, sector and style factors; three styles are included: value, size and momentum. The library also comes with generalizable math and data cleaning utility functions you'd want to have for constructing more style factors (or custom fundamental factors of any kind). 10 | 11 | ## Installation 12 | 13 | Using pip: 14 | 15 | `pip install toraniko` 16 | 17 | ## User Manual 18 | 19 | #### Data 20 | 21 | You'll need the following data to run a complete model estimation: 22 | 23 | 1. Sector scores, used for estimating the market and sector factor returns. GICS level 1 is suitable for this. The sector scores consist of one row per asset, with 0s in each column except for the sector in which the asset is a member, which is filled with 1. 24 | 25 | ``` 26 | symbol Basic Materials Communication Services Consumer Cyclical Consumer Defensive Energy Financial Services Healthcare Industrials Real Estate Technology Utilities 27 | str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 28 | "A" 0 0 0 0 0 0 1 0 0 0 0 29 | "AA" 1 0 0 0 0 0 0 0 0 0 0 30 | "AACI" 0 0 0 0 0 1 0 0 0 0 0 31 | "AACT" 0 0 0 0 0 1 0 0 0 0 0 32 | "AADI" 0 0 0 0 0 0 1 0 0 0 0 33 | … … … … … … … … … … … … 34 | "ZVRA" 0 0 0 0 0 0 1 0 0 0 0 35 | "ZVSA" 0 0 0 0 0 0 1 0 0 0 0 36 | "ZWS" 0 0 0 0 0 0 0 1 0 0 0 37 | "ZYME" 0 0 0 0 0 0 1 0 0 0 0 38 | "ZYXI" 0 0 0 0 0 0 1 0 0 0 0 39 | ``` 40 | 41 | 2. Symbol-by-symbol daily asset returns for a large universe of equities: 42 | 43 | ``` 44 | date symbol asset_returns 45 | date str f64 46 | 2013-01-02 "A" 0.022962 47 | 2013-01-02 "AAMC" -0.073171 48 | 2013-01-02 "AAME" 0.035566 49 | 2013-01-02 "AAON" 0.019163 50 | 2013-01-02 "AAP" 0.001935 51 | … … … 52 | 2024-02-23 "ZVRA" -0.025 53 | 2024-02-23 "ZVSA" 0.291311 54 | 2024-02-23 "ZWS" 0.006378 55 | 2024-02-23 "ZYME" 0.000838 56 | 2024-02-23 "ZYXI" 0.001552 57 | ``` 58 | 59 | 3. For the value factor: symbol-by-symbol daily market cap, cash flow, share count, revenue and book value estimates, so you can calculate book-price, sales-price and cash flow-price metrics: 60 | 61 | ``` 62 | date symbol book_price sales_price cf_price market_cap 63 | date str f64 f64 f64 f64 64 | 2013-10-30 "AAPL" 0.343017 0.081994 0.007687 4.5701e11 65 | 2013-10-31 "AAPL" 0.342231 0.081763 0.007665 4.5830e11 66 | 2013-11-01 "AAPL" 0.341398 0.081521 0.007643 4.5966e11 67 | 2013-11-04 "AAPL" 0.340947 0.08137 0.007628 4.6051e11 68 | 2013-11-05 "AAPL" 0.340491 0.081219 0.007614 4.6137e11 69 | … … … … … … 70 | 2024-02-16 "AAPL" 0.072243 0.040937 0.001972 2.9209e12 71 | 2024-02-20 "AAPL" 0.072405 0.040942 0.001972 2.9206e12 72 | 2024-02-21 "AAPL" 0.072614 0.040973 0.001974 2.9184e12 73 | 2024-02-22 "AAPL" 0.072792 0.040987 0.001974 2.9174e12 74 | 2024-02-23 "AAPL" 0.073007 0.041021 0.001976 2.9150e12 75 | ``` 76 | 77 | #### Style factor score calculation 78 | 79 | Taking the foregoing data together you'll have: 80 | 81 | ``` 82 | date symbol book_price sales_price cf_price market_cap asset_returns 83 | date str f64 f64 f64 f64 f64 84 | 2013-02-12 "A" null null null 1.1080e10 0.00045 85 | 2013-02-12 "AAON" null null null 5.5410e8 0.006741 86 | 2013-02-12 "AAP" null null null 5.4212e9 0.002679 87 | 2013-02-12 "AAPL" 0.302543 0.1189 null 4.5847e11 -0.025069 88 | 2013-02-12 "AAT" null null null 1.1135e9 0.006764 89 | … … … … … … … 90 | 2024-02-23 "ZS" 0.105832 0.016538 0.007052 3.5311e10 0.04015 91 | 2024-02-23 "ZTS" 0.125734 0.025145 -0.017418 8.8010e10 0.002797 92 | 2024-02-23 "ZUO" 0.424385 0.089243 0.073274 1.2309e9 0.002448 93 | 2024-02-23 "ZWS" 0.296327 0.067669 0.001916 5.2727e9 0.006378 94 | 2024-02-23 "ZYME" 0.363093 0.016538 -0.054523 7.3077e8 0.000838 95 | ``` 96 | 97 | Then to estimate the momentum factor, you can run 98 | 99 | ``` 100 | from toraniko.styles import factor_mom 101 | 102 | mom_df = factor_mom(df.select("symbol", "date", "asset_returns"), trailing_days=252, winsor_factor=0.01).collect() 103 | ``` 104 | 105 | and you'll obtain scores roughly resembling this histogram: 106 | 107 | Screenshot 2024-08-05 at 12 28 39 AM 108 | 109 | Likewise for value, you can run: 110 | 111 | ``` 112 | from toraniko.styles import factor_val 113 | 114 | value_df = factor_val(df.select("date", "symbol", "book_price", "sales_price", "cf_price")).collect() 115 | ``` 116 | 117 | Similarly: 118 | 119 | Screenshot 2024-08-05 at 12 30 08 AM 120 | 121 | #### Factor return estimation 122 | 123 | Let's say you've estimated three style factors: value, momentum, size. 124 | 125 | ``` 126 | date symbol mom_score sze_score val_score 127 | date str f64 f64 f64 128 | 2014-03-07 "A" 0.793009 -0.872847 0.265801 129 | 2014-03-07 "AAON" 1.128932 0.939116 -1.050307 130 | 2014-03-07 "AAP" 2.190209 0.0 0.351273 131 | 2014-03-07 "AAPL" -0.202091 -3.373659 -0.289713 132 | 2014-03-07 "AAT" -0.413211 0.805413 0.052482 133 | … … … … … 134 | 2024-02-23 "ZS" 2.783905 -1.303952 -1.778753 135 | 2024-02-23 "ZTS" 0.030749 -1.905171 -1.619675 136 | 2024-02-23 "ZUO" 0.235533 0.905671 0.253427 137 | 2024-02-23 "ZWS" 0.594781 0.0 -0.520947 138 | 2024-02-23 "ZYME" -1.631452 1.248919 -1.451382 139 | ``` 140 | 141 | Merge these with the aforementioned GICS sector scores, and take the top N by market cap each day on a suitable universe. Here we'll do the Russell 3000: 142 | 143 | ``` 144 | from toraniko.utils import top_n_by_group 145 | 146 | ddf = ( 147 | ret_df.join(cap_df.drop("book_value"), on=["date", "symbol"]) 148 | .join(sector_scores, on="symbol") 149 | .join(style_scores, on=["date", "symbol"]) 150 | .drop_nulls() 151 | ) 152 | ddf = ( 153 | top_n_by_group( 154 | ddf.lazy(), 155 | 3000, 156 | "market_cap", 157 | ("date",), 158 | True 159 | ) 160 | .collect() 161 | .sort("date", "symbol") 162 | ) 163 | 164 | returns_df = ddf.select("date", "symbol", "asset_returns") 165 | mkt_cap_df = ddf.select("date", "symbol", "market_cap") 166 | sector_df = ddf.select(["date"] + list(sector_scores.columns)) 167 | style_df = ddf.select(style_scores.columns) 168 | ``` 169 | 170 | Then simply: 171 | 172 | ``` 173 | from toraniko.model import estimate_factor_returns 174 | 175 | fac_df, eps_df = estimate_factor_returns(returns_df, mkt_cap_df, sector_df, style_df, winsor_factor=0.1, residualize_styles=False) 176 | ``` 177 | 178 | On an M1 MacBook, this estimates 10+ years of daily market, sector and style factor returns in under a minute. 179 | 180 | Here is a comparison of the model value factor out versus Barra's. Even on a relatively low quality data source (Yahoo Finance) and without significant effort in cleaning corporate actions, the results are comparable over a 10 year period: 181 | 182 | ![val_factor](https://github.com/user-attachments/assets/28f41989-f802-4c2f-beed-1d2bda24a96d) 183 | 184 | ![valu](https://github.com/user-attachments/assets/366f49a8-d7e7-46de-bb61-6f656393275a) 185 | -------------------------------------------------------------------------------- /dev-requirements.txt: -------------------------------------------------------------------------------- 1 | pytest~=7.4.4 2 | numpy~=1.26.2 3 | polars~=1.0 -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["poetry-core"] 3 | build-backend = "poetry.core.masonry.api" 4 | 5 | [tool.poetry] 6 | name = "toraniko" 7 | version = "1.1.0" 8 | description = "A multi-factor equity risk model for quantitative trading." 9 | authors = ["0xfdf <0xfdf@implies.com>"] 10 | maintainers = ["0xfdf <0xfdf@implies.com>"] 11 | license = "MIT" 12 | readme = "README.md" 13 | homepage = "https://github.com/0xfdf/toraniko" 14 | repository = "https://github.com/0xfdf/toraniko" 15 | keywords = ["risk", "model", "portfolio", "optimization", "factor", "quant", "quantitative", "finance", "trading"] 16 | classifiers = [ 17 | "Development Status :: 5 - Production/Stable", 18 | "Intended Audience :: Science/Research", 19 | "Programming Language :: Python :: 3", 20 | "Programming Language :: Python :: 3.10", 21 | "Programming Language :: Python :: 3.11", 22 | "Programming Language :: Python :: 3.12", 23 | "License :: OSI Approved :: MIT License", 24 | "Topic :: Scientific/Engineering" 25 | ] 26 | 27 | [tool.poetry.dependencies] 28 | python = ">=3.10,<4.0" 29 | numpy = "~1.26.2" 30 | polars = "~1.0.0" 31 | 32 | [tool.poetry.dev-dependencies] 33 | pytest = "~7.4.4" 34 | 35 | [tool.poetry.urls] 36 | homepage = "https://github.com/0xfdf/toraniko" 37 | repository = "https://github.com/0xfdf/toraniko" 38 | issues = "https://github.com/0xfdf/toraniko/issues" 39 | changelog = "https://github.com/0xfdf/toraniko/releases" 40 | 41 | [tool.ruff] 42 | line-length = 120 43 | fix = false 44 | select = ["E", "F", "I", "N", "Q", "R", "S", "T", "U", "W", "Y"] 45 | 46 | [tool.black] 47 | line-length = 120 48 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy~=1.26.2 2 | polars~=1.0 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | requirements = [] 4 | with open('requirements.txt', 'r') as fh: 5 | for line in fh: 6 | requirements.append(line.strip()) 7 | 8 | setup( 9 | name="toraniko", 10 | version="1.1.0", 11 | packages=find_packages(), 12 | install_requires = requirements 13 | ) 14 | -------------------------------------------------------------------------------- /toraniko/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xfdf/toraniko/253ce7044ef33eb718e671e9f918a88ba5fa3350/toraniko/__init__.py -------------------------------------------------------------------------------- /toraniko/math.py: -------------------------------------------------------------------------------- 1 | """Basic mathematical and statistical operations used in the model.""" 2 | 3 | import numpy as np 4 | import polars as pl 5 | 6 | 7 | def center_xsection(target_col: str, over_col: str, standardize: bool = False) -> pl.Expr: 8 | """Cross-sectionally center (and optionally standardize) a Polars DataFrame `target_col` partitioned by `over_col`. 9 | 10 | This returns a Polars expression, so it be chained in a `select` or `with_columns` invocation 11 | without needing to set a new intermediate DataFrame or materialize lazy evaluation. 12 | 13 | Parameters 14 | ---------- 15 | target_col: the column to be standardized 16 | over_col: the column over which standardization should be applied, cross-sectionally 17 | standardize: boolean indicating if we should also standardize the target column 18 | 19 | Returns 20 | ------- 21 | Polars Expr 22 | """ 23 | expr = pl.col(target_col) - pl.col(target_col).drop_nulls().drop_nans().mean().over(over_col) 24 | if standardize: 25 | return expr / pl.col(target_col).drop_nulls().drop_nans().std().over(over_col) 26 | return expr 27 | 28 | 29 | def norm_xsection( 30 | target_col: str, 31 | over_col: str, 32 | lower: int | float = 0, 33 | upper: int | float = 1, 34 | ) -> pl.Expr: 35 | """Cross-sectionally normalize a Polars DataFrame `target_col` partitioned by `over_col`, with rescaling 36 | to the interval [`lower`, `upper`]. 37 | 38 | This returns a Polars expression, so it can be chained in a `select` or `with_columns` invocation 39 | without needing to set a new intermediate DataFrame or materialize lazy evaluation. 40 | 41 | NaN values are not propagated in the max and min calculation, but NaN values are preserved for normalization. 42 | 43 | Parameters 44 | ---------- 45 | target_col: str name of the column to normalize 46 | over_col: str name of the column to partition the normalization by 47 | lower: lower bound of the rescaling interval, defaults to 0 to construct a percent 48 | upper: upper bound of the rescaling interval, defaults to 1 to construct a percent 49 | 50 | Returns 51 | ------- 52 | Polars Expr 53 | """ 54 | min_col = pl.col(target_col).drop_nans().min().over(over_col) 55 | max_col = pl.col(target_col).drop_nans().max().over(over_col) 56 | 57 | norm_col = ( 58 | pl.when(pl.col(target_col).is_nan()) 59 | .then(pl.col(target_col)) # Preserve NaN values 60 | .when(max_col != min_col) # Avoid division by zero by making sure min != max 61 | .then((pl.col(target_col) - min_col) / (max_col - min_col) * (upper - lower) + lower) 62 | .otherwise(lower) 63 | ) 64 | 65 | return norm_col 66 | 67 | 68 | def winsorize(data: np.ndarray, percentile: float = 0.05, axis: int = 0) -> np.ndarray: 69 | """Windorize each vector of a 2D numpy array to symmetric percentiles given by `percentile`. 70 | 71 | This returns a Polars expression, not a DataFrame, so it be chained (including lazily) in 72 | a `select` or `with_columns` invocation without needing to set a new intermediate DataFrame variable. 73 | 74 | Parameters 75 | ---------- 76 | data: numpy array containing original data to be winsorized 77 | percentile: float indicating the percentiles to apply winsorization at 78 | axis: int indicating which axis to apply winsorization over (i.e. orientation if `dara` is 2D) 79 | 80 | Returns 81 | ------- 82 | numpy array 83 | """ 84 | try: 85 | if not 0 <= percentile <= 1: 86 | raise ValueError("`percentile` must be between 0 and 1") 87 | except AttributeError as e: 88 | raise TypeError("`percentile` must be a numeric type, such as an int or float") from e 89 | 90 | fin_data = np.where(np.isfinite(data), data, np.nan) 91 | 92 | # compute lower and upper percentiles for each column 93 | lower_bounds = np.nanpercentile(fin_data, percentile * 100, axis=axis, keepdims=True) 94 | upper_bounds = np.nanpercentile(fin_data, (1 - percentile) * 100, axis=axis, keepdims=True) 95 | 96 | # clip data to within the bounds 97 | return np.clip(data, lower_bounds, upper_bounds) 98 | 99 | 100 | def winsorize_xsection( 101 | df: pl.DataFrame | pl.LazyFrame, 102 | data_cols: tuple[str, ...], 103 | group_col: str, 104 | percentile: float = 0.05, 105 | ) -> pl.DataFrame | pl.LazyFrame: 106 | """Cross-sectionally winsorize the `data_cols` of `df`, grouped on `group_col`, to the symmetric percentile 107 | given by `percentile`. 108 | 109 | Parameters 110 | ---------- 111 | df: Polars DataFrame or LazyFrame containing feature data to winsorize 112 | data_cols: collection of strings indicating the columns of `df` to be winsorized 113 | group_col: str column of `df` to use as the cross-sectional group 114 | percentile: float value indicating the symmetric winsorization threshold 115 | 116 | Returns 117 | ------- 118 | Polars DataFrame or LazyFrame 119 | """ 120 | 121 | def winsorize_group(group: pl.DataFrame) -> pl.DataFrame: 122 | for col in data_cols: 123 | winsorized_data = winsorize(group[col].to_numpy(), percentile=percentile) 124 | group = group.with_columns(pl.Series(col, winsorized_data).alias(col)) 125 | return group 126 | 127 | match df: 128 | case pl.DataFrame(): 129 | grouped = df.group_by(group_col).map_groups(winsorize_group) 130 | case pl.LazyFrame(): 131 | grouped = df.group_by(group_col).map_groups(winsorize_group, schema=df.collect_schema()) 132 | case _: 133 | raise TypeError("`df` must be a Polars DataFrame or LazyFrame") 134 | return grouped 135 | 136 | 137 | def percentiles_xsection( 138 | target_col: str, 139 | over_col: str, 140 | lower_pct: float, 141 | upper_pct: float, 142 | fill_val: float | int = 0.0, 143 | ) -> pl.Expr: 144 | """Cross-sectionally mark all values of `target_col` that fall outside the `lower_pct` percentile or 145 | `upper_pct` percentile, within each `over_col` group. This is essentially an anti-winsorization, suitable for 146 | building high - low portfolios. The `fill_val` is inserted to each value between the percentile cutoffs. 147 | 148 | This returns a Polars expression, so it be chained in a `select` or `with_columns` invocation 149 | without needing to set a new intermediate DataFrame or materialize lazy evaluation. 150 | 151 | Parameters 152 | ---------- 153 | target_col: str column name to have non-percentile thresholded values masked 154 | over_col: str column name to apply masking over, cross-sectionally 155 | lower_pct: float lower percentile under which to keep values 156 | upper_pct: float upper percentile over which to keep values 157 | fill_val: numeric value for masking 158 | 159 | Returns 160 | ------- 161 | Polars Expr 162 | """ 163 | return ( 164 | pl.when( 165 | (pl.col(target_col) <= pl.col(target_col).drop_nans().quantile(lower_pct).over(over_col)) 166 | | (pl.col(target_col) >= pl.col(target_col).drop_nans().quantile(upper_pct).over(over_col)) 167 | ) 168 | .then(pl.col(target_col)) 169 | .otherwise(fill_val) 170 | ) 171 | 172 | 173 | def exp_weights(window: int, half_life: int) -> np.ndarray: 174 | """Generate exponentially decaying weights over `window` trailing values, decaying by half each `half_life` index. 175 | 176 | Parameters 177 | ---------- 178 | window: integer number of points in the trailing lookback period 179 | half_life: integer decay rate 180 | 181 | Returns 182 | ------- 183 | numpy array 184 | """ 185 | try: 186 | assert isinstance(window, int) 187 | if not window > 0: 188 | raise ValueError("`window` must be a strictly positive integer") 189 | except (AttributeError, AssertionError) as e: 190 | raise TypeError("`window` must be an integer type") from e 191 | try: 192 | assert isinstance(half_life, int) 193 | if not half_life > 0: 194 | raise ValueError("`half_life` must be a strictly positive integer") 195 | except (AttributeError, AssertionError) as e: 196 | raise TypeError("`half_life` must be an integer type") from e 197 | decay = np.log(2) / half_life 198 | return np.exp(-decay * np.arange(window))[::-1] 199 | -------------------------------------------------------------------------------- /toraniko/model.py: -------------------------------------------------------------------------------- 1 | """Complete implementation of the factor model.""" 2 | 3 | import numpy as np 4 | import polars as pl 5 | import polars.exceptions as pl_exc 6 | 7 | from toraniko.math import winsorize 8 | 9 | 10 | def _factor_returns( 11 | returns: np.ndarray, 12 | mkt_caps: np.ndarray, 13 | sector_scores: np.ndarray, 14 | style_scores: np.ndarray, 15 | residualize_styles: bool, 16 | ) -> tuple[np.ndarray, np.ndarray]: 17 | """Estimate market, sector, style and residual asset returns for one time period, robust to rank deficiency. 18 | 19 | Parameters 20 | ---------- 21 | returns: np.ndarray returns of the assets (shape n_assets x 1) 22 | mkt_caps: np.ndarray of asset market capitalizations (shape n_assets x 1) 23 | sector_scores: np.ndarray of asset scores used to estimate the sector return (shape n_assets x m_sectors) 24 | style_scores: np.ndarray of asset scores used to estimate style factor returns (shape n_assets x m_styles) 25 | residualize_styles: bool indicating if styles should be orthogonalized to market + sector 26 | 27 | Returns 28 | ------- 29 | tuple of arrays: (market/sector/style factor returns, residual returns) 30 | """ 31 | n_assets = returns.shape[0] 32 | m_sectors, m_styles = sector_scores.shape[1], style_scores.shape[1] 33 | 34 | # Proxy for the inverse of asset idiosyncratic variances 35 | W = np.diag(np.sqrt(mkt_caps.ravel())) 36 | 37 | # Estimate sector factor returns with a constraint that the sector factors sum to 0 38 | # Economically, we assert that the market return is completely spanned by the sector returns 39 | beta_sector = np.hstack([np.ones(n_assets).reshape(-1, 1), sector_scores]) 40 | a = np.concatenate([np.array([0]), (-1 * np.ones(m_sectors - 1))]) 41 | Imat = np.identity(m_sectors) 42 | R_sector = np.vstack([Imat, a]) 43 | # Change of variables to add the constraint 44 | B_sector = beta_sector @ R_sector 45 | 46 | V_sector, _, _, _ = np.linalg.lstsq(B_sector.T @ W @ B_sector, B_sector.T @ W, rcond=None) 47 | # Change of variables to recover all sectors 48 | g = V_sector @ returns 49 | fac_ret_sector = R_sector @ g 50 | 51 | sector_resid_returns = returns - (B_sector @ g) 52 | 53 | # Estimate style factor returns without constraints 54 | V_style, _, _, _ = np.linalg.lstsq(style_scores.T @ W @ style_scores, style_scores.T @ W, rcond=None) 55 | if residualize_styles: 56 | fac_ret_style = V_style @ sector_resid_returns 57 | else: 58 | fac_ret_style = V_style @ returns 59 | 60 | # Combine factor returns 61 | fac_ret = np.concatenate([fac_ret_sector, fac_ret_style]) 62 | 63 | # Calculate final residuals 64 | epsilon = sector_resid_returns - (style_scores @ fac_ret_style) 65 | 66 | return fac_ret, epsilon 67 | 68 | 69 | def estimate_factor_returns( 70 | returns_df: pl.DataFrame, 71 | mkt_cap_df: pl.DataFrame, 72 | sector_df: pl.DataFrame, 73 | style_df: pl.DataFrame, 74 | winsor_factor: float | None = 0.05, 75 | residualize_styles: bool = True, 76 | ) -> tuple[pl.DataFrame, pl.DataFrame] | pl.DataFrame: 77 | """Estimate factor and residual returns across all time periods using input asset factor scores. 78 | 79 | Parameters 80 | ---------- 81 | returns_df: Polars DataFrame containing | date | symbol | asset_returns | 82 | mkt_cap_df: Polars DataFrame containing | date | symbol | market_cap | 83 | sector_df: Polars DataFrame containing | date | symbol | followed by one column for each sector 84 | style_df: Polars DataFrame containing | date | symbol | followed by one column for each style 85 | winsor_factor: winsorization proportion 86 | residualize_styles: bool indicating if style returns should be orthogonalized to market + sector returns 87 | 88 | Returns 89 | ------- 90 | tuple of Polars DataFrames melted by date: (factor returns, residual returns) 91 | """ 92 | returns, residuals = [], [] 93 | try: 94 | sectors = sorted(sector_df.select(pl.exclude("date", "symbol")).columns) 95 | except AttributeError as e: 96 | raise TypeError("`sector_df` must be a Polars DataFrame, but it's missing required attributes") from e 97 | except pl_exc.ColumnNotFoundError as e: 98 | raise ValueError("`sector_df` must have columns for 'date' and 'symbol' in addition to each sector") from e 99 | try: 100 | styles = sorted(style_df.select(pl.exclude("date", "symbol")).columns) 101 | except AttributeError as e: 102 | raise TypeError("`style_df` must be a Polars DataFrame, but it's missing required attributes") from e 103 | except pl_exc.ColumnNotFoundError as e: 104 | raise ValueError("`style_df` must have columns for 'date' and 'symbol' in addition to each style") from e 105 | try: 106 | returns_df = ( 107 | returns_df.join(mkt_cap_df, on=["date", "symbol"]) 108 | .join(sector_df, on=["date", "symbol"]) 109 | .join(style_df, on=["date", "symbol"]) 110 | ) 111 | dates = returns_df["date"].unique().to_list() 112 | # iterate through, one day at a time 113 | # this could probably be made more efficient with Polars' `.map_groups` method 114 | for dt in dates: 115 | ddf = returns_df.filter(pl.col("date") == dt).sort("symbol") 116 | r = ddf["asset_returns"].to_numpy() 117 | if winsor_factor is not None: 118 | r = winsorize(r, winsor_factor) 119 | f, e = _factor_returns( 120 | r, 121 | ddf["market_cap"].to_numpy(), 122 | ddf.select(sectors).to_numpy(), 123 | ddf.select(styles).to_numpy(), 124 | residualize_styles, 125 | ) 126 | returns.append(f) 127 | residuals.append(dict(zip(ddf["symbol"].to_list(), e))) 128 | except AttributeError as e: 129 | raise TypeError( 130 | "`returns_df` and `mkt_cap_df` must be Polars DataFrames, but there are missing attributes" 131 | ) from e 132 | except pl_exc.ColumnNotFoundError as e: 133 | raise ValueError( 134 | "`returns_df` must have columns 'date', 'symbol' and 'asset_returns'; " 135 | "`mkt_cap_df` must have 'date', 'symbol' and 'market_cap' columns" 136 | ) from e 137 | ret_df = pl.DataFrame(np.array(returns)) 138 | ret_df.columns = ["market"] + sectors + styles 139 | ret_df = ret_df.with_columns(pl.Series(dates).alias("date")) 140 | eps_df = pl.DataFrame(residuals).with_columns(pl.Series(dates).alias("date")) 141 | return ret_df, eps_df 142 | -------------------------------------------------------------------------------- /toraniko/styles.py: -------------------------------------------------------------------------------- 1 | """Style factor implementations.""" 2 | 3 | import numpy as np 4 | import polars as pl 5 | import polars.exceptions as pl_exc 6 | 7 | from toraniko.math import ( 8 | exp_weights, 9 | center_xsection, 10 | percentiles_xsection, 11 | winsorize_xsection, 12 | ) 13 | 14 | ### 15 | # NB: These functions do not try to handle NaN or null resilience for you, nor do they make allowances 16 | # for data having pathological distributions. Garbage in, garbage out. You need to inspect your data 17 | # and use the functions in the math and utils modules to ensure your features are sane and 18 | # well-behaved before you try to construct factors from them! 19 | ### 20 | 21 | 22 | def factor_mom( 23 | returns_df: pl.DataFrame | pl.LazyFrame, 24 | trailing_days: int = 504, 25 | half_life: int = 126, 26 | lag: int = 20, 27 | winsor_factor: float = 0.01, 28 | ) -> pl.LazyFrame: 29 | """Estimate rolling symbol by symbol momentum factor scores using asset returns. 30 | 31 | Parameters 32 | ---------- 33 | returns_df: Polars DataFrame containing columns: | date | symbol | asset_returns | 34 | trailing_days: int look back period over which to measure momentum 35 | half_life: int decay rate for exponential weighting, in days 36 | lag: int number of days to lag the current day's return observation (20 trading days is one month) 37 | 38 | Returns 39 | ------- 40 | Polars DataFrame containing columns: | date | symbol | mom_score | 41 | """ 42 | weights = exp_weights(trailing_days, half_life) 43 | 44 | def weighted_cumprod(values: np.ndarray) -> float: 45 | return (np.cumprod(1 + (values * weights[-len(values) :])) - 1)[-1] # type: ignore 46 | 47 | try: 48 | df = ( 49 | returns_df.lazy() 50 | .sort(by="date") 51 | .with_columns(pl.col("asset_returns").shift(lag).over("symbol").alias("asset_returns")) 52 | .with_columns( 53 | pl.col("asset_returns") 54 | .rolling_map(weighted_cumprod, window_size=trailing_days) 55 | .over(pl.col("symbol")) 56 | .alias("mom_score") 57 | ) 58 | ).collect() 59 | df = winsorize_xsection(df, ("mom_score",), "date", percentile=winsor_factor) 60 | return df.lazy().select( 61 | pl.col("date"), 62 | pl.col("symbol"), 63 | center_xsection("mom_score", "date", True).alias("mom_score"), 64 | ) 65 | except AttributeError as e: 66 | raise TypeError("`returns_df` must be a Polars DataFrame | LazyFrame, but it's missing attributes") from e 67 | except pl_exc.ColumnNotFoundError as e: 68 | raise ValueError("`returns_df` must have 'date', 'symbol' and 'asset_returns' columns") from e 69 | 70 | 71 | def factor_sze( 72 | mkt_cap_df: pl.DataFrame | pl.LazyFrame, 73 | lower_decile: float = 0.2, 74 | upper_decile: float = 0.8, 75 | ) -> pl.LazyFrame: 76 | """Estimate rolling symbol by symbol size factor scores using asset market caps. 77 | 78 | Parameters 79 | ---------- 80 | mkt_cap_df: Polars DataFrame containing columns: | date | symbol | market_cap | 81 | 82 | Returns 83 | ------- 84 | Polars DataFrame containing columns: | date | symbol | sze_score | 85 | """ 86 | try: 87 | return ( 88 | mkt_cap_df.lazy() 89 | # our factor is the Fama-French SMB, i.e. small-minus-big, because the size risk premium 90 | # is on the smaller firms rather than the larger ones. consequently we multiply by -1 91 | .with_columns(pl.col("market_cap").log().alias("sze_score") * -1) 92 | .with_columns( 93 | "date", 94 | "symbol", 95 | (center_xsection("sze_score", "date", True)).alias("sze_score"), 96 | ) 97 | .with_columns(percentiles_xsection("sze_score", "date", lower_decile, upper_decile, 0.0).alias("sze_score")) 98 | .select("date", "symbol", "sze_score") 99 | ) 100 | except AttributeError as e: 101 | raise TypeError("`mkt_cap_df` must be a Polars DataFrame or LazyFrame, but it's missing attributes") from e 102 | except pl_exc.ColumnNotFoundError as e: 103 | raise ValueError("`mkt_cap_df` must have 'date', 'symbol' and 'market_cap' columns") from e 104 | 105 | 106 | def factor_val(value_df: pl.DataFrame | pl.LazyFrame, winsorize_features: float | None = None) -> pl.LazyFrame: 107 | """Estimate rolling symbol by symbol value factor scores using price ratios. 108 | 109 | Parameters 110 | ---------- 111 | value_df: Polars DataFrame containing columns: | date | symbol | book_price | sales_price | cf_price 112 | winsorize_features: optional float indicating if the features should be winsorized. none applied if None 113 | 114 | Returns 115 | ------- 116 | Polars DataFrame containing: | date | symbol | val_score | 117 | """ 118 | try: 119 | if winsorize_features is not None: 120 | value_df = winsorize_xsection(value_df, ("book_price", "sales_price", "cf_price"), "date") 121 | return ( 122 | value_df.lazy() 123 | .with_columns( 124 | pl.col("book_price").log().alias("book_price"), 125 | pl.col("sales_price").log().alias("sales_price"), 126 | ) 127 | .with_columns( 128 | center_xsection("book_price", "date", True).alias("book_price"), 129 | center_xsection("sales_price", "date", True).alias("sales_price"), 130 | center_xsection("cf_price", "date", True).alias("cf_price"), 131 | ) 132 | .with_columns( 133 | # NB: it's imperative you've properly handled NaNs prior to this point 134 | pl.mean_horizontal( 135 | pl.col("book_price"), 136 | pl.col("sales_price"), 137 | pl.col("cf_price"), 138 | ).alias("val_score") 139 | ) 140 | .select( 141 | pl.col("date"), 142 | pl.col("symbol"), 143 | center_xsection("val_score", "date", True).alias("val_score"), 144 | ) 145 | ) 146 | except AttributeError as e: 147 | raise TypeError("`value_df` must be a Polars DataFrame or LazyFrame, but it's missing attributes") from e 148 | except pl_exc.ColumnNotFoundError as e: 149 | raise ValueError( 150 | "`value_df` must have 'date', 'symbol', 'book_price', 'sales_price' and 'fcf_price' columns" 151 | ) from e 152 | -------------------------------------------------------------------------------- /toraniko/tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xfdf/toraniko/253ce7044ef33eb718e671e9f918a88ba5fa3350/toraniko/tests/__init__.py -------------------------------------------------------------------------------- /toraniko/tests/test_math.py: -------------------------------------------------------------------------------- 1 | """Test functions in the math module.""" 2 | 3 | import pytest 4 | import polars as pl 5 | import numpy as np 6 | from polars.testing import assert_frame_equal 7 | 8 | from toraniko.math import ( 9 | center_xsection, 10 | exp_weights, 11 | norm_xsection, 12 | percentiles_xsection, 13 | winsorize, 14 | winsorize_xsection, 15 | ) 16 | 17 | 18 | @pytest.fixture 19 | def sample_data(): 20 | """ 21 | Fixture to provide sample data for testing. 22 | 23 | Returns 24 | ------- 25 | pl.DataFrame 26 | A DataFrame with a 'group' column and a 'value' column. 27 | """ 28 | return pl.DataFrame({"group": ["A", "A", "A", "B", "B", "B"], "value": [1, 2, 3, 4, 5, 6]}) 29 | 30 | 31 | ### 32 | # `center_xsection` 33 | ### 34 | 35 | 36 | def test_center_xsection_centering(sample_data): 37 | """ 38 | Test centering without standardization. 39 | 40 | Parameters 41 | ---------- 42 | sample_data : pl.DataFrame 43 | The sample data to test. 44 | 45 | Asserts 46 | ------- 47 | The centered values match the expected centered values. 48 | """ 49 | centered_df = sample_data.with_columns(center_xsection("value", "group")) 50 | expected_centered_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0] 51 | assert np.allclose(centered_df["value"].to_numpy(), expected_centered_values) 52 | 53 | 54 | def test_center_xsection_standardizing(sample_data): 55 | """ 56 | Test centering and standardizing. 57 | 58 | Parameters 59 | ---------- 60 | sample_data : pl.DataFrame 61 | The sample data to test. 62 | 63 | Asserts 64 | ------- 65 | The standardized values match the expected standardized values. 66 | """ 67 | standardized_df = sample_data.with_columns(center_xsection("value", "group", standardize=True)) 68 | expected_standardized_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0] 69 | assert np.allclose(standardized_df["value"].to_numpy(), expected_standardized_values) 70 | 71 | 72 | def test_center_xsection_handle_nan(sample_data): 73 | """ 74 | Test handling of NaN values. 75 | 76 | Parameters 77 | ---------- 78 | sample_data : pl.DataFrame 79 | The sample data to test, with an additional NaN column. 80 | 81 | Asserts 82 | ------- 83 | The values in the 'nan_col' column are all NaN. 84 | """ 85 | sample_data_with_nan = sample_data.with_columns(pl.lit(np.nan).alias("nan_col")) 86 | centered_df = sample_data_with_nan.with_columns(center_xsection("nan_col", "group")) 87 | expected_values = [np.nan] * len(centered_df) 88 | assert np.allclose(centered_df["nan_col"].to_numpy(), expected_values, equal_nan=True) 89 | 90 | 91 | def test_center_xsection_different_groups(): 92 | """ 93 | Test with a more complex group and value structure. 94 | 95 | Asserts 96 | ------- 97 | The centered values match the expected values. 98 | """ 99 | data = pl.DataFrame({"group": ["A", "A", "B", "B", "C", "C"], "value": [1.5, 2.5, 5.5, 6.5, 9.0, 9.0]}) 100 | centered_df = data.with_columns(center_xsection("value", "group")) 101 | expected_values = [-0.5, 0.5, -0.5, 0.5, 0.0, 0.0] 102 | assert np.allclose(centered_df["value"].to_numpy(), expected_values) 103 | 104 | 105 | def test_center_xsection_empty_group(): 106 | """ 107 | Test with an empty group. 108 | 109 | Asserts 110 | ------- 111 | The group 'C' is empty in the filtered DataFrame. 112 | """ 113 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0]}) 114 | empty_group = data.filter(pl.col("group") == "C") 115 | assert empty_group.is_empty() 116 | 117 | 118 | def test_center_xsection_all_nan(): 119 | """ 120 | Test when the entire column consists of NaN values. 121 | 122 | Asserts 123 | ------- 124 | The result column should contain only NaN values. 125 | """ 126 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [np.nan, np.nan, np.nan, np.nan]}) 127 | centered_df = data.with_columns(center_xsection("value", "group")) 128 | expected_values = [np.nan, np.nan, np.nan, np.nan] 129 | assert np.allclose(centered_df["value"].to_numpy(), expected_values, equal_nan=True) 130 | 131 | 132 | def test_center_xsection_single_row_group(): 133 | """ 134 | Test centering when a group has only one row. 135 | 136 | Asserts 137 | ------- 138 | The result should be 0.0 for the single row group. 139 | """ 140 | data = pl.DataFrame({"group": ["A", "A", "B"], "value": [1.0, 2.0, 3.0]}) 141 | centered_df = data.with_columns(center_xsection("value", "group")) 142 | expected_values = [-0.5, 0.5, 0.0] # Centering for group B with one value should be 0.0 143 | assert np.allclose(centered_df["value"].to_numpy(), expected_values) 144 | 145 | 146 | def test_center_xsection_mixed_data_types(): 147 | """ 148 | Test handling of mixed data types, where non-numeric columns are ignored. 149 | 150 | Asserts 151 | ------- 152 | The centering process should only affect numeric columns. 153 | """ 154 | data = pl.DataFrame( 155 | {"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0], "category": ["x", "y", "z", "w"]} 156 | ) 157 | centered_df = data.with_columns(center_xsection("value", "group")) 158 | expected_values = [-0.5, 0.5, -0.5, 0.5] 159 | assert np.allclose(centered_df["value"].to_numpy(), expected_values) 160 | assert all(centered_df["category"] == data["category"]) 161 | 162 | 163 | ### 164 | # `norm_xsection` 165 | ### 166 | 167 | 168 | def test_norm_xsection_default_range(sample_data): 169 | """ 170 | Test normalization with default range [0, 1]. 171 | 172 | Parameters 173 | ---------- 174 | sample_data : pl.DataFrame 175 | The sample data to test. 176 | 177 | Asserts 178 | ------- 179 | The normalized values match the expected values in the range [0, 1]. 180 | """ 181 | normalized_df = sample_data.with_columns(norm_xsection("value", "group")) 182 | expected_normalized_values = [0.0, 0.5, 1.0, 0.0, 0.5, 1.0] 183 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values) 184 | 185 | 186 | def test_norm_xsection_custom_range(sample_data): 187 | """ 188 | Test normalization with a custom range [-1, 1]. 189 | 190 | Parameters 191 | ---------- 192 | sample_data : pl.DataFrame 193 | The sample data to test. 194 | 195 | Asserts 196 | ------- 197 | The normalized values match the expected values in the range [-1, 1]. 198 | """ 199 | normalized_df = sample_data.with_columns(norm_xsection("value", "group", lower=-1, upper=1)) 200 | expected_normalized_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0] 201 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values) 202 | 203 | 204 | def test_norm_xsection_nan_values(): 205 | """ 206 | Test normalization when the target column contains NaN values. 207 | 208 | Asserts 209 | ------- 210 | The NaN values are preserved in the output. 211 | """ 212 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1, np.nan, 4, np.nan]}, strict=False) 213 | normalized_df = data.with_columns(norm_xsection("value", "group")) 214 | expected_normalized_values = [0.0, np.nan, 0.0, np.nan] 215 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values, equal_nan=True) 216 | 217 | 218 | def test_norm_xsection_single_value_group(): 219 | """ 220 | Test normalization when a group has a single value. 221 | 222 | Asserts 223 | ------- 224 | The result should be the lower bound of the range, as there's no range to normalize. 225 | """ 226 | data = pl.DataFrame({"group": ["A", "A", "B"], "value": [1, 2, 3]}) 227 | normalized_df = data.with_columns(norm_xsection("value", "group")) 228 | expected_normalized_values = [0.0, 1.0, 0.0] # Single value group should map to the lower bound 229 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values) 230 | 231 | 232 | def test_norm_xsection_identical_values(): 233 | """ 234 | Test normalization when all values in a group are identical. 235 | 236 | Asserts 237 | ------- 238 | The normalized values should all be the lower bound, as there's no range to normalize. 239 | """ 240 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [5, 5, 5, 5]}) 241 | normalized_df = data.with_columns(norm_xsection("value", "group")) 242 | expected_normalized_values = [0.0, 0.0, 0.0, 0.0] # Identical values map to the lower bound 243 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values) 244 | 245 | 246 | def test_norm_xsection_mixed_data_types(): 247 | """ 248 | Test handling of mixed data types, where non-numeric columns are ignored. 249 | 250 | Asserts 251 | ------- 252 | The normalization process should only affect numeric columns. 253 | """ 254 | data = pl.DataFrame( 255 | {"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0], "category": ["x", "y", "z", "w"]} 256 | ) 257 | normalized_df = data.with_columns(norm_xsection("value", "group")) 258 | expected_normalized_values = [0.0, 1.0, 0.0, 1.0] 259 | actual_values = normalized_df["value"].to_numpy() 260 | print(actual_values) 261 | assert np.allclose(actual_values, expected_normalized_values) 262 | assert all(normalized_df["category"] == data["category"]) 263 | 264 | 265 | def test_norm_xsection_custom_range_large_values(): 266 | """ 267 | Test normalization with a custom range [100, 200] and large values. 268 | 269 | Asserts 270 | ------- 271 | The normalized values match the expected values in the range [100, 200]. 272 | """ 273 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1000, 2000, 3000, 4000]}) 274 | normalized_df = data.with_columns(norm_xsection("value", "group", lower=100, upper=200)) 275 | expected_normalized_values = [100.0, 200.0, 100.0, 200.0] 276 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values) 277 | 278 | 279 | ### 280 | # `winsorize` 281 | ### 282 | 283 | 284 | @pytest.fixture 285 | def sample_columns(): 286 | """ 287 | Fixture to provide sample numpy array columns for testing. 288 | 289 | Returns 290 | ------- 291 | np.ndarray 292 | """ 293 | return np.array( 294 | [ 295 | [1, 100, 1000], 296 | [2, 200, 2000], 297 | [3, 300, 3000], 298 | [4, 400, 4000], 299 | [5, 500, 5000], 300 | [6, 600, 6000], 301 | [7, 700, 7000], 302 | [8, 800, 8000], 303 | [9, 900, 9000], 304 | [10, 1000, 10000], 305 | ] 306 | ) 307 | 308 | 309 | @pytest.fixture 310 | def sample_columns_with_nans(): 311 | """ 312 | Fixture to provide sample numpy array columns for testing. 313 | 314 | Returns 315 | ------- 316 | np.ndarray 317 | """ 318 | return np.array( 319 | [ 320 | [1, 100, np.nan], 321 | [2, 200, 2000], 322 | [3, 300, np.nan], 323 | [4, 400, np.nan], 324 | [5, 500, 5000], 325 | [6, 600, 6000], 326 | [7, 700, 7000], 327 | [8, 800, 8000], 328 | [9, 900, np.nan], 329 | [10, 1000, 10000], 330 | ] 331 | ) 332 | 333 | 334 | @pytest.fixture 335 | def sample_rows(): 336 | """ 337 | Fixture to provide sample numpy rows for testing. 338 | 339 | Returns 340 | ------- 341 | np.ndarray 342 | """ 343 | return np.array( 344 | [ 345 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 346 | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], 347 | [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000], 348 | ] 349 | ) 350 | 351 | 352 | @pytest.fixture 353 | def sample_rows_with_nans(): 354 | """ 355 | Fixture to provide sample numpy rows for testing. 356 | 357 | Returns 358 | ------- 359 | np.ndarray 360 | """ 361 | return np.array( 362 | [ 363 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 364 | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], 365 | [np.nan, 2000, np.nan, np.nan, 5000, 6000, 7000, 8000, np.nan, 10000], 366 | ] 367 | ) 368 | 369 | 370 | def test_winsorize_axis_0(sample_columns): 371 | """ 372 | Test column-wise winsorization (axis=0). 373 | 374 | This test checks if the function correctly winsorizes a 2D array column-wise. 375 | 376 | Parameters 377 | ---------- 378 | sample_columns : np.ndarray 379 | Sample data fixture for column-wise testing. 380 | 381 | Raises 382 | ------ 383 | AssertionError 384 | If the winsorized output doesn't match the expected result. 385 | """ 386 | result = winsorize(sample_columns, percentile=0.2, axis=0) 387 | 388 | # Expected results (calculated manually for 20th and 80th percentiles) 389 | expected = np.array( 390 | [ 391 | [2.8, 280, 2800], 392 | [2.8, 280, 2800], 393 | [3.0, 300, 3000], 394 | [4.0, 400, 4000], 395 | [5.0, 500, 5000], 396 | [6.0, 600, 6000], 397 | [7.0, 700, 7000], 398 | [8.0, 800, 8000], 399 | [8.2, 820, 8200], 400 | [8.2, 820, 8200], 401 | ] 402 | ) 403 | 404 | np.testing.assert_array_almost_equal(result, expected) 405 | 406 | 407 | def test_winsorize_axis_1(sample_rows): 408 | """ 409 | Test row-wise winsorization (axis=1). 410 | 411 | This test checks if the function correctly winsorizes a 2D array row-wise. 412 | 413 | Parameters 414 | ---------- 415 | sample_rows : np.ndarray 416 | Sample data fixture for row-wise testing. 417 | 418 | Raises 419 | ------ 420 | AssertionError 421 | If the winsorized output doesn't match the expected result. 422 | """ 423 | result = winsorize(sample_rows, percentile=0.2, axis=1) 424 | 425 | # Expected results (calculated manually for 20th and 80th percentiles) 426 | expected = np.array( 427 | [ 428 | [2.8, 2.8, 3, 4, 5, 6, 7, 8, 8.2, 8.2], 429 | [280, 280, 300, 400, 500, 600, 700, 800, 820, 820], 430 | [2800, 2800, 3000, 4000, 5000, 6000, 7000, 8000, 8200, 8200], 431 | ] 432 | ) 433 | 434 | np.testing.assert_array_almost_equal(result, expected) 435 | 436 | 437 | def test_winsorize_axis_0_with_nans(sample_columns_with_nans): 438 | """ 439 | Test winsorize function with NaN values. 440 | 441 | This test verifies that the function handles NaN values correctly. 442 | 443 | Raises 444 | ------ 445 | AssertionError 446 | If the winsorized output doesn't handle NaN values as expected. 447 | """ 448 | result = winsorize(sample_columns_with_nans, percentile=0.2, axis=0) 449 | expected = np.array( 450 | [ 451 | [2.8e00, 2.8e02, np.nan], 452 | [2.8e00, 2.8e02, 5.0e03], 453 | [3.0e00, 3.0e02, np.nan], 454 | [4.0e00, 4.0e02, np.nan], 455 | [5.0e00, 5.0e02, 5.0e03], 456 | [6.0e00, 6.0e02, 6.0e03], 457 | [7.0e00, 7.0e02, 7.0e03], 458 | [8.0e00, 8.0e02, 8.0e03], 459 | [8.2e00, 8.2e02, np.nan], 460 | [8.2e00, 8.2e02, 8.0e03], 461 | ] 462 | ) 463 | np.testing.assert_allclose(result, expected, equal_nan=True) 464 | 465 | 466 | def test_winsorize_axis_1_with_nans(sample_rows_with_nans): 467 | """ 468 | Test winsorize function with NaN values. 469 | 470 | This test verifies that the function handles NaN values correctly. 471 | 472 | Raises 473 | ------ 474 | AssertionError 475 | If the winsorized output doesn't handle NaN values as expected. 476 | """ 477 | result = winsorize(sample_rows_with_nans, percentile=0.2, axis=1) 478 | # take transpose of axis=0 test to test axis=1 479 | expected = np.array( 480 | [ 481 | [2.8e00, 2.8e02, np.nan], 482 | [2.8e00, 2.8e02, 5.0e03], 483 | [3.0e00, 3.0e02, np.nan], 484 | [4.0e00, 4.0e02, np.nan], 485 | [5.0e00, 5.0e02, 5.0e03], 486 | [6.0e00, 6.0e02, 6.0e03], 487 | [7.0e00, 7.0e02, 7.0e03], 488 | [8.0e00, 8.0e02, 8.0e03], 489 | [8.2e00, 8.2e02, np.nan], 490 | [8.2e00, 8.2e02, 8.0e03], 491 | ] 492 | ).T 493 | np.testing.assert_allclose(result, expected, equal_nan=True) 494 | 495 | 496 | def test_winsorize_invalid_percentile(sample_rows): 497 | """ 498 | Test winsorize function with invalid percentile values. 499 | 500 | This test checks if the function raises a ValueError for percentile values outside [0, 1]. 501 | 502 | Raises 503 | ------ 504 | AssertionError 505 | If the function doesn't raise a ValueError for invalid percentile values. 506 | """ 507 | with pytest.raises(TypeError): 508 | winsorize(sample_rows, percentile="string_percentile") 509 | with pytest.raises(ValueError): 510 | winsorize(sample_rows, percentile=-0.1) 511 | with pytest.raises(ValueError): 512 | winsorize(sample_rows, percentile=1.1) 513 | 514 | 515 | def test_winsorize_1d_array(): 516 | """ 517 | Test winsorize function with a 1D array. 518 | 519 | This test verifies that the function works correctly with 1D input. 520 | 521 | Raises 522 | ------ 523 | AssertionError 524 | If the winsorized output doesn't match the expected result for a 1D array. 525 | """ 526 | data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) 527 | result = winsorize(data, percentile=0.2) 528 | expected = np.array([2.8, 2.8, 3, 4, 5, 6, 7, 8, 8.2, 8.2]) 529 | np.testing.assert_array_equal(result, expected) 530 | 531 | 532 | def test_winsorize_empty_array(): 533 | """ 534 | Test winsorize function with an empty array. 535 | 536 | This test checks if the function handles empty arrays correctly. 537 | 538 | Raises 539 | ------ 540 | AssertionError 541 | If the function doesn't return an empty array for empty input. 542 | """ 543 | data = np.array([]) 544 | result = winsorize(data) 545 | np.testing.assert_array_equal(result, data) 546 | 547 | 548 | def test_winsorize_all_nan(): 549 | """ 550 | Test winsorize function with an array containing only NaN values. 551 | 552 | This test verifies that the function handles arrays with only NaN values correctly. 553 | 554 | Raises 555 | ------ 556 | AssertionError 557 | If the function doesn't return an array of NaN values for all-NaN input. 558 | """ 559 | data = np.array([[np.nan, np.nan], [np.nan, np.nan]]) 560 | result = winsorize(data) 561 | np.testing.assert_array_equal(result, data) 562 | 563 | 564 | ### 565 | # winsorize_xsection 566 | ### 567 | 568 | 569 | @pytest.fixture 570 | def sample_df(): 571 | """ 572 | Fixture to provide a sample DataFrame for testing. 573 | """ 574 | return pl.DataFrame( 575 | { 576 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 577 | "value1": [1, 2, 10, 4, 5, 20, 7, 8, 30], 578 | "value2": [100, 200, 1000, 400, 500, 2000, 700, 800, 3000], 579 | } 580 | ) 581 | 582 | 583 | @pytest.mark.parametrize("lazy", [True, False]) 584 | def test_winsorize_xsection(sample_df, lazy): 585 | """ 586 | Test basic functionality of winsorize_xsection. 587 | """ 588 | if lazy: 589 | result = winsorize_xsection( 590 | sample_df.lazy(), data_cols=("value1", "value2"), group_col="group", percentile=0.1 591 | ).sort("group") 592 | assert isinstance(result, pl.LazyFrame) 593 | result = result.collect() 594 | else: 595 | result = winsorize_xsection(sample_df, data_cols=("value1", "value2"), group_col="group", percentile=0.1).sort( 596 | "group" 597 | ) 598 | 599 | expected = pl.DataFrame( 600 | { 601 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 602 | "value1": [1.2, 2.0, 8.4, 4.2, 5.0, 17.0, 7.2, 8.0, 25.6], 603 | "value2": [120.0, 200.0, 840.0, 420.0, 500.0, 1700.0, 720.0, 800.0, 2560.0], 604 | } 605 | ) 606 | 607 | assert_frame_equal(result, expected, check_exact=False) 608 | 609 | 610 | @pytest.fixture 611 | def sample_df_with_nans(): 612 | """ 613 | Fixture to provide a sample DataFrame with NaN values for testing. 614 | """ 615 | return pl.DataFrame( 616 | { 617 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 618 | "value1": [1, np.nan, 10, 4, 5, np.nan, 7, 8, 30], 619 | "value2": [100, 200, np.nan, np.nan, 500, 2000, 700, np.nan, 3000], 620 | }, 621 | strict=False, 622 | ) 623 | 624 | 625 | def test_winsorize_xsection_with_nans(sample_df_with_nans): 626 | """ 627 | Test winsorize_xsection with NaN values. 628 | """ 629 | result = winsorize_xsection( 630 | sample_df_with_nans, data_cols=("value1", "value2"), group_col="group", percentile=0.1 631 | ).sort("group") 632 | 633 | expected = pl.DataFrame( 634 | { 635 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 636 | "value1": [1.9, np.nan, 9.1, 4.1, 4.9, np.nan, 7.2, 8.0, 25.6], 637 | "value2": [110.0, 190.0, np.nan, np.nan, 650.0, 1850.0, 930.0, np.nan, 2770.0], 638 | }, 639 | strict=False, 640 | ) 641 | 642 | assert_frame_equal(result, expected, check_exact=False) 643 | 644 | 645 | ### 646 | # `xsection_percentiles` 647 | ### 648 | 649 | 650 | def test_xsection_percentiles(sample_df): 651 | """ 652 | Test basic functionality of xsection_percentiles. 653 | """ 654 | result = sample_df.with_columns(percentiles_xsection("value1", "group", 0.25, 0.75).alias("result")).sort("group") 655 | 656 | expected = pl.DataFrame( 657 | { 658 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 659 | "value1": [1, 2, 10, 4, 5, 20, 7, 8, 30], 660 | "result": [1.0, 2.0, 10.0, 4.0, 5.0, 20.0, 7.0, 8.0, 30.0], 661 | } 662 | ) 663 | 664 | pl.testing.assert_frame_equal(result.select("group", "value1", "result"), expected) 665 | 666 | 667 | def test_xsection_percentiles_with_nans(sample_df_with_nans): 668 | """ 669 | Test xsection_percentiles with NaN values. 670 | """ 671 | result = sample_df_with_nans.with_columns(percentiles_xsection("value1", "group", 0.25, 0.75).alias("result")) 672 | 673 | expected = pl.DataFrame( 674 | { 675 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 676 | "value1": [1.0, np.nan, 10.0, 4.0, 5.0, np.nan, 7.0, 8.0, 30.0], 677 | "result": [1.0, np.nan, 10.0, 4.0, 5.0, np.nan, 7.0, 8.0, 30.0], 678 | } 679 | ) 680 | 681 | pl.testing.assert_frame_equal(result.select("group", "value1", "result"), expected) 682 | 683 | 684 | ### 685 | # `exp_weights` 686 | ### 687 | 688 | 689 | def test_exp_weights_basic(): 690 | """ 691 | Test basic functionality of exp_weights. 692 | """ 693 | result = exp_weights(window=5, half_life=2) 694 | expected = np.array([0.25, 0.35355339, 0.5, 0.70710678, 1.0]) 695 | np.testing.assert_array_almost_equal(result, expected, decimal=6) 696 | 697 | 698 | def test_exp_weights_window_1(): 699 | """ 700 | Test exp_weights with window of 1. 701 | """ 702 | result = exp_weights(window=1, half_life=2) 703 | expected = np.array([1.0]) 704 | np.testing.assert_array_almost_equal(result, expected) 705 | 706 | 707 | def test_exp_weights_half_life_1(): 708 | """ 709 | Test exp_weights with half_life of 1. 710 | """ 711 | result = exp_weights(window=5, half_life=1) 712 | expected = np.array([0.0625, 0.125, 0.25, 0.5, 1.0]) 713 | np.testing.assert_array_almost_equal(result, expected) 714 | 715 | 716 | def test_exp_weights_large_window(): 717 | """ 718 | Test exp_weights with a large window. 719 | """ 720 | result = exp_weights(window=100, half_life=10) 721 | assert len(result) == 100 722 | assert result[-1] == 1.0 723 | assert result[0] < result[-1] 724 | 725 | 726 | def test_exp_weights_decreasing(): 727 | """ 728 | Test that weights are decreasing from end to start. 729 | """ 730 | result = exp_weights(window=10, half_life=3) 731 | assert np.all(np.diff(result) > 0) 732 | 733 | 734 | def test_exp_weights_half_life(): 735 | """ 736 | Test that weights actually decay by half each half_life. 737 | """ 738 | half_life = 5 739 | window = 20 740 | weights = exp_weights(window, half_life) 741 | for i in range(0, window - half_life, half_life): 742 | assert np.isclose(weights[i], 0.5 * weights[i + half_life], rtol=1e-5) 743 | 744 | 745 | def test_exp_weights_invalid_window(): 746 | """ 747 | Test exp_weights with invalid window value. 748 | """ 749 | with pytest.raises(ValueError): 750 | exp_weights(window=0, half_life=2) 751 | 752 | with pytest.raises(ValueError): 753 | exp_weights(window=-1, half_life=2) 754 | 755 | with pytest.raises(TypeError): 756 | exp_weights(window="window", half_life=2) 757 | 758 | with pytest.raises(TypeError): 759 | exp_weights(window=5.1, half_life=3) 760 | 761 | 762 | def test_exp_weights_invalid_half_life(): 763 | """ 764 | Test exp_weights with invalid half_life value. 765 | """ 766 | with pytest.raises(ValueError): 767 | exp_weights(window=5, half_life=0) 768 | 769 | with pytest.raises(ValueError): 770 | exp_weights(window=5, half_life=-1) 771 | 772 | with pytest.raises(TypeError): 773 | exp_weights(window=5, half_life="half_life") 774 | 775 | with pytest.raises(TypeError): 776 | exp_weights(window=5, half_life=3.2) 777 | 778 | 779 | def test_output(): 780 | """ 781 | Test with a specific input and output. 782 | """ 783 | result = exp_weights(10, 10) 784 | expected = np.array( 785 | [0.53588673, 0.57434918, 0.61557221, 0.65975396, 0.70710678, 0.75785828, 0.8122524, 0.87055056, 0.93303299, 1.0] 786 | ) 787 | np.testing.assert_array_almost_equal(result, expected) 788 | -------------------------------------------------------------------------------- /toraniko/tests/test_model.py: -------------------------------------------------------------------------------- 1 | """Tests functions in the model module.""" 2 | 3 | import polars as pl 4 | import pytest 5 | import numpy as np 6 | from toraniko.model import _factor_returns, estimate_factor_returns 7 | 8 | ### 9 | # `_factor_returns` 10 | ### 11 | 12 | 13 | @pytest.fixture 14 | def sample_data(): 15 | n_assets = 100 16 | n_sectors = 10 17 | n_styles = 5 18 | 19 | np.random.seed(42) 20 | returns = np.random.randn(n_assets, 1) 21 | mkt_caps = np.abs(np.random.randn(n_assets, 1)) 22 | sector_scores = np.random.randint(0, 2, size=(n_assets, n_sectors)) 23 | style_scores = np.random.randn(n_assets, n_styles) 24 | 25 | return returns, mkt_caps, sector_scores, style_scores 26 | 27 | 28 | def test_output_shape_and_values(sample_data): 29 | returns, mkt_caps, sector_scores, style_scores = sample_data 30 | fac_ret, epsilon = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 31 | 32 | assert fac_ret.shape == (1 + sector_scores.shape[1] + style_scores.shape[1], 1) 33 | assert epsilon.shape == returns.shape 34 | # evaluate expected vs actual fac_ret and epsilon as test vectors 35 | expected_fac_ret = np.array( 36 | [ 37 | [-0.1619187], 38 | [0.0743045], 39 | [-0.21585373], 40 | [-0.27895516], 41 | [0.21496233], 42 | [-0.09829407], 43 | [-0.36415363], 44 | [0.16375184], 45 | [0.12933617], 46 | [0.35729484], 47 | [0.0176069], 48 | [0.13510884], 49 | [0.10831872], 50 | [-0.05781987], 51 | [0.11867375], 52 | [0.07687654], 53 | ] 54 | ) 55 | np.testing.assert_array_almost_equal(fac_ret, expected_fac_ret) 56 | expected_epsilon_first_10 = np.array( 57 | [ 58 | [0.04899388], 59 | [0.30702498], 60 | [0.23548647], 61 | [1.29181826], 62 | [0.30327413], 63 | [0.22804785], 64 | [1.40533523], 65 | [1.37903988], 66 | [0.08243883], 67 | [1.37698801], 68 | ] 69 | ) 70 | np.testing.assert_array_almost_equal(epsilon[:10], expected_epsilon_first_10) 71 | 72 | 73 | def test_residualize_styles(sample_data): 74 | returns, mkt_caps, sector_scores, style_scores = sample_data 75 | 76 | # if we residualize the styles we should obtain different returns out of the function 77 | 78 | fac_ret_res, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 79 | fac_ret_non_res, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, False) 80 | 81 | assert not np.allclose(fac_ret_res, fac_ret_non_res) 82 | 83 | 84 | def test_sector_constraint(sample_data): 85 | returns, mkt_caps, sector_scores, style_scores = sample_data 86 | fac_ret, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 87 | 88 | sector_returns = fac_ret[1 : sector_scores.shape[1] + 1] 89 | assert np.isclose(np.sum(sector_returns), 0, atol=1e-10) 90 | 91 | 92 | def test_zero_returns(): 93 | n_assets = 50 94 | n_sectors = 5 95 | n_styles = 3 96 | 97 | returns = np.zeros((n_assets, 1)) 98 | mkt_caps = np.ones((n_assets, 1)) 99 | sector_scores = np.random.randint(0, 2, size=(n_assets, n_sectors)) 100 | style_scores = np.random.randn(n_assets, n_styles) 101 | 102 | fac_ret, epsilon = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 103 | 104 | assert np.allclose(fac_ret, 0) 105 | assert np.allclose(epsilon, 0) 106 | 107 | 108 | def test_market_cap_weighting(sample_data): 109 | returns, mkt_caps, sector_scores, style_scores = sample_data 110 | 111 | fac_ret1, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 112 | fac_ret2, _ = _factor_returns(returns, np.ones_like(mkt_caps), sector_scores, style_scores, True) 113 | 114 | assert not np.allclose(fac_ret1, fac_ret2) 115 | 116 | 117 | def test_reproducibility(sample_data): 118 | returns, mkt_caps, sector_scores, style_scores = sample_data 119 | 120 | fac_ret1, epsilon1 = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 121 | fac_ret2, epsilon2 = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True) 122 | 123 | assert np.allclose(fac_ret1, fac_ret2) 124 | assert np.allclose(epsilon1, epsilon2) 125 | 126 | 127 | ### 128 | # `estimate_factor_returns` 129 | ### 130 | 131 | 132 | def model_sample_data(): 133 | dates = ["2021-01-01", "2021-01-02", "2021-01-03"] 134 | symbols = ["AAPL", "MSFT", "GOOGL"] 135 | returns_data = {"date": dates * 3, "symbol": symbols * 3, "asset_returns": np.random.randn(9)} 136 | mkt_cap_data = {"date": dates * 3, "symbol": symbols * 3, "market_cap": np.random.rand(9) * 1000} 137 | sector_data = {"date": dates * 3, "symbol": symbols * 3, "Tech": [1, 0, 0] * 3, "Finance": [0, 1, 0] * 3} 138 | style_data = {"date": dates * 3, "symbol": symbols * 3, "Value": [1, 0, 0] * 3, "Growth": [0, 1, 0] * 3} 139 | return (pl.DataFrame(returns_data), pl.DataFrame(mkt_cap_data), pl.DataFrame(sector_data), pl.DataFrame(style_data)) 140 | 141 | 142 | def test_estimate_factor_returns_normal(): 143 | returns_df, mkt_cap_df, sector_df, style_df = model_sample_data() 144 | factor_returns, residual_returns = estimate_factor_returns(returns_df, mkt_cap_df, sector_df, style_df) 145 | assert isinstance(factor_returns, pl.DataFrame) 146 | assert isinstance(residual_returns, pl.DataFrame) 147 | assert factor_returns.shape[0] == 3 # Number of dates 148 | assert residual_returns.shape[0] == 3 # Number of dates 149 | 150 | 151 | def test_estimate_factor_returns_empty(): 152 | empty_df = pl.DataFrame() 153 | with pytest.raises(ValueError): 154 | estimate_factor_returns(empty_df, empty_df, empty_df, empty_df) 155 | 156 | 157 | def test_estimate_factor_returns_incorrect_types(): 158 | with pytest.raises(TypeError): 159 | estimate_factor_returns("not_a_dataframe", "not_a_dataframe", "not_a_dataframe", "not_a_dataframe") 160 | 161 | 162 | def test_estimate_factor_returns_missing_columns(): 163 | returns_df, mkt_cap_df, sector_df, style_df = model_sample_data() 164 | with pytest.raises(ValueError): 165 | estimate_factor_returns(returns_df.drop("asset_returns"), mkt_cap_df, sector_df, style_df) 166 | -------------------------------------------------------------------------------- /toraniko/tests/test_utils.py: -------------------------------------------------------------------------------- 1 | """Test functions in the utils module.""" 2 | 3 | import pytest 4 | import polars as pl 5 | import numpy as np 6 | from polars.testing import assert_frame_equal 7 | 8 | from toraniko.utils import fill_features, smooth_features, top_n_by_group 9 | 10 | ### 11 | # `fill_features` 12 | ### 13 | 14 | 15 | @pytest.fixture 16 | def sample_df(): 17 | return pl.DataFrame( 18 | { 19 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"], 20 | "group": ["A", "A", "A", "B", "B"], 21 | "feature1": [1.0, np.nan, 3.0, np.inf, 5.0], 22 | "feature2": [np.nan, 2.0, np.nan, 4.0, np.nan], 23 | } 24 | ) 25 | 26 | 27 | @pytest.mark.parametrize("lazy", [True, False]) 28 | def test_fill_features(sample_df, lazy): 29 | if lazy: 30 | inp = sample_df.lazy() 31 | else: 32 | inp = sample_df 33 | result = ( 34 | fill_features(inp, features=("feature1", "feature2"), sort_col="date", over_col="group").sort("group").collect() 35 | ) 36 | 37 | expected = pl.DataFrame( 38 | { 39 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"], 40 | "group": ["A", "A", "A", "B", "B"], 41 | "feature1": [1.0, 1.0, 3.0, None, 5.0], 42 | "feature2": [None, 2.0, 2.0, 4.0, 4.0], 43 | } 44 | ) 45 | 46 | pl.testing.assert_frame_equal(result, expected) 47 | 48 | 49 | def test_fill_features_invalid_input(sample_df): 50 | with pytest.raises(ValueError): 51 | fill_features(sample_df, features=("non_existent_feature",), sort_col="date", over_col="group") 52 | with pytest.raises(TypeError): 53 | fill_features("not_a_dataframe", features=("feature1",), sort_col="date", over_col="group") 54 | 55 | 56 | def test_fill_features_all_null_column(sample_df): 57 | df_with_null_column = sample_df.with_columns(pl.lit(None).alias("null_feature")) 58 | result = fill_features(df_with_null_column, features=("null_feature",), sort_col="date", over_col="group") 59 | result = result.collect() 60 | 61 | assert all(result["null_feature"].is_null()) 62 | 63 | 64 | def test_fill_features_multiple_groups(sample_df): 65 | multi_group_df = pl.DataFrame( 66 | { 67 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 68 | "group": ["A", "A", "A", "B", "B", "C"], 69 | "feature1": [1.0, np.nan, 3.0, np.inf, 5.0, np.nan], 70 | "feature2": [np.nan, 2.0, np.nan, 4.0, np.nan, 6.0], 71 | } 72 | ) 73 | 74 | result = fill_features(multi_group_df, features=("feature1", "feature2"), sort_col="date", over_col="group") 75 | result = result.collect() 76 | 77 | expected = pl.DataFrame( 78 | { 79 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 80 | "group": ["A", "A", "A", "B", "B", "C"], 81 | "feature1": [1.0, 1.0, 3.0, None, 5.0, None], 82 | "feature2": [None, 2.0, 2.0, 4.0, 4.0, 6.0], 83 | } 84 | ) 85 | 86 | pl.testing.assert_frame_equal(result, expected) 87 | 88 | 89 | def test_fill_features_different_sort_order(sample_df): 90 | result = fill_features(sample_df, features=("feature1", "feature2"), sort_col="date", over_col="group") 91 | result = result.sort("date", descending=True).collect() 92 | 93 | expected = pl.DataFrame( 94 | { 95 | "date": ["2023-01-05", "2023-01-04", "2023-01-03", "2023-01-02", "2023-01-01"], 96 | "group": ["B", "B", "A", "A", "A"], 97 | "feature1": [5.0, None, 3.0, 1.0, 1.0], 98 | "feature2": [4.0, 4.0, 2.0, 2.0, None], 99 | } 100 | ) 101 | 102 | pl.testing.assert_frame_equal(result, expected) 103 | 104 | 105 | ### 106 | # `smooth_features` 107 | ### 108 | 109 | 110 | @pytest.fixture 111 | def sample_smooth_df(): 112 | return pl.DataFrame( 113 | { 114 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 115 | "group": ["A", "A", "A", "B", "B", "B"], 116 | "feature1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 117 | "feature2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0], 118 | } 119 | ) 120 | 121 | 122 | @pytest.mark.parametrize("lazy", [True, False]) 123 | def test_smooth_features(sample_smooth_df, lazy): 124 | if lazy: 125 | inp = sample_smooth_df.lazy() 126 | else: 127 | inp = sample_smooth_df 128 | result = ( 129 | smooth_features(inp, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=2) 130 | .sort("group") 131 | .collect() 132 | ) 133 | 134 | expected = pl.DataFrame( 135 | { 136 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 137 | "group": ["A", "A", "A", "B", "B", "B"], 138 | "feature1": [None, 1.5, 2.5, None, 4.5, 5.5], 139 | "feature2": [None, 15.0, 25.0, None, 45.0, 55.0], 140 | } 141 | ) 142 | 143 | pl.testing.assert_frame_equal(result, expected) 144 | 145 | 146 | def test_smooth_features_larger_window(sample_smooth_df): 147 | result = smooth_features( 148 | sample_smooth_df, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=10 149 | ) 150 | result = result.collect() 151 | 152 | expected = pl.DataFrame( 153 | { 154 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 155 | "group": ["A", "A", "A", "B", "B", "B"], 156 | "feature1": [None, None, None, None, None, None], 157 | "feature2": [None, None, None, None, None, None], 158 | } 159 | ).with_columns(pl.col("feature1").cast(float).alias("feature1"), pl.col("feature2").cast(float).alias("feature2")) 160 | 161 | pl.testing.assert_frame_equal(result, expected) 162 | 163 | 164 | def test_smooth_features_invalid_input(sample_smooth_df): 165 | with pytest.raises(TypeError): 166 | smooth_features("not_a_dataframe", features=("feature1",), sort_col="date", over_col="group", window_size=2) 167 | with pytest.raises(ValueError): 168 | smooth_features( 169 | sample_smooth_df, features=("non_existent_feature",), sort_col="date", over_col="group", window_size=2 170 | ) 171 | 172 | 173 | def test_smooth_features_with_nulls(sample_smooth_df): 174 | df_with_nulls = sample_smooth_df.with_columns( 175 | pl.when(pl.col("feature1") == 3.0).then(None).otherwise(pl.col("feature1")).alias("feature1") 176 | ) 177 | result = smooth_features( 178 | df_with_nulls, features=("feature1",), sort_col="date", over_col="group", window_size=2 179 | ).sort("date", "group") 180 | result = result.collect() 181 | 182 | expected = pl.DataFrame( 183 | { 184 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"], 185 | "group": ["A", "A", "A", "B", "B", "B"], 186 | "feature1": [None, 1.5, None, None, 4.5, 5.5], 187 | "feature2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0], 188 | } 189 | ) 190 | 191 | pl.testing.assert_frame_equal(result, expected) 192 | 193 | 194 | def test_smooth_features_window_size_one(sample_smooth_df): 195 | result = smooth_features( 196 | sample_smooth_df, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=1 197 | ) 198 | result = result.collect() 199 | 200 | # With window_size=1, the result should be the same as the input 201 | pl.testing.assert_frame_equal(result, sample_smooth_df) 202 | 203 | 204 | ### 205 | # `top_n_by_group` 206 | ### 207 | 208 | 209 | @pytest.fixture 210 | def top_n_df(): 211 | return pl.DataFrame( 212 | { 213 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"], 214 | "subgroup": [1, 1, 2, 1, 2, 2, 1, 2, 3], 215 | "value": [10, 20, 30, 15, 25, 35, 5, 15, 25], 216 | } 217 | ) 218 | 219 | 220 | @pytest.mark.parametrize("filter", [True, False]) 221 | def test_top_n_by_group(top_n_df, filter): 222 | result = ( 223 | top_n_by_group(top_n_df, n=2, rank_var="value", group_var=("group",), filter=filter) 224 | .collect() 225 | .sort("group", "subgroup") 226 | ) 227 | if filter: 228 | assert result.shape == (6, 3) 229 | assert result["group"].to_list() == ["A", "A", "B", "B", "C", "C"] 230 | assert result["value"].to_list() == [20, 30, 25, 35, 15, 25] 231 | else: 232 | assert result.shape == (9, 4) 233 | assert "rank_mask" in result.columns 234 | assert result["rank_mask"].to_list() == [0, 1, 1, 0, 1, 1, 0, 1, 1] 235 | 236 | 237 | def test_top_n_by_multiple_groups(top_n_df): 238 | result = ( 239 | top_n_by_group(top_n_df, n=1, rank_var="value", group_var=("group", "subgroup"), filter=True) 240 | .sort("group", "subgroup") 241 | .collect() 242 | ) 243 | assert result.shape == (7, 3) 244 | assert result["group"].to_list() == ["A", "A", "B", "B", "C", "C", "C"] 245 | assert result["subgroup"].to_list() == [1, 2, 1, 2, 1, 2, 3] 246 | assert result["value"].to_list() == [20, 30, 15, 35, 5, 15, 25] 247 | 248 | 249 | @pytest.mark.parametrize("filter", [True, False]) 250 | def test_lazyframe_input(filter): 251 | lazy_df = pl.LazyFrame({"group": ["A", "B"], "value": [1, 2]}) 252 | result = top_n_by_group(lazy_df, n=1, rank_var="value", group_var=("group",), filter=filter) 253 | assert isinstance(result, pl.LazyFrame) 254 | 255 | 256 | def test_invalid_input(): 257 | df = pl.DataFrame({"group": ["A", "B"], "wrong_column": [1, 2]}) 258 | with pytest.raises(ValueError, match="missing one or more required columns"): 259 | top_n_by_group(df, n=1, rank_var="value", group_var=("group",), filter=True) 260 | with pytest.raises(TypeError, match="must be a Polars DataFrame or LazyFrame"): 261 | top_n_by_group("not_a_dataframe", n=1, rank_var="value", group_var=("group",), filter=True) 262 | 263 | 264 | def test_empty_dataframe(): 265 | empty_df = pl.DataFrame({"group": [], "value": []}) 266 | result = top_n_by_group(empty_df, n=1, rank_var="value", group_var=("group",), filter=True).collect() 267 | assert result.shape == (0, 2) 268 | 269 | 270 | def test_n_greater_than_group_size(top_n_df): 271 | result = top_n_by_group(top_n_df, n=5, rank_var="value", group_var=("group",), filter=True).collect() 272 | assert result.shape == (9, 3) 273 | 274 | 275 | def test_tie_handling(top_n_df): 276 | df_with_ties = top_n_df.with_columns( 277 | pl.when(pl.col("value") == 25).then(26).otherwise(pl.col("value")).alias("value") 278 | ) 279 | result = ( 280 | top_n_by_group(df_with_ties, n=1, rank_var="value", group_var=("group",), filter=True) 281 | .sort("group", "subgroup") 282 | .collect() 283 | ) 284 | assert result.shape == (3, 3) 285 | assert result["value"].to_list() == [30, 35, 26] 286 | -------------------------------------------------------------------------------- /toraniko/utils.py: -------------------------------------------------------------------------------- 1 | """Utility functions, primarily for data cleaning.""" 2 | 3 | import numpy as np 4 | import polars as pl 5 | 6 | 7 | def fill_features( 8 | df: pl.DataFrame | pl.LazyFrame, features: tuple[str, ...], sort_col: str, over_col: str 9 | ) -> pl.LazyFrame: 10 | """Cast feature columns to numeric (float), convert NaN and inf values to null, then forward fill nulls 11 | for each column of `features`, sorted on `sort_col` and partitioned by `over_col`. 12 | 13 | Parameters 14 | ---------- 15 | df: Polars DataFrame or LazyFrame containing columns `sort_col`, `over_col` and each of `features` 16 | features: collection of strings indicating which columns of `df` are the feature values 17 | sort_col: str column of `df` indicating how to sort 18 | over_col: str column of `df` indicating how to partition 19 | 20 | Returns 21 | ------- 22 | Polars LazyFrame containing the original columns with cleaned feature data 23 | """ 24 | try: 25 | # eagerly check all `features`, `sort_col`, `over_col` present: can't catch ColumNotFoundError in lazy context 26 | assert all(c in df.columns for c in features + (sort_col, over_col)) 27 | return ( 28 | df.lazy() 29 | .with_columns([pl.col(f).cast(float).alias(f) for f in features]) 30 | .with_columns( 31 | [ 32 | pl.when( 33 | (pl.col(f).abs() == np.inf) 34 | | (pl.col(f) == np.nan) 35 | | (pl.col(f).is_null()) 36 | | (pl.col(f).cast(str) == "NaN") 37 | ) 38 | .then(None) 39 | .otherwise(pl.col(f)) 40 | .alias(f) 41 | for f in features 42 | ] 43 | ) 44 | .sort(by=sort_col) 45 | .with_columns([pl.col(f).forward_fill().over(over_col).alias(f) for f in features]) 46 | ) 47 | except AttributeError as e: 48 | raise TypeError("`df` must be a Polars DataFrame | LazyFrame, but it's missing required attributes") from e 49 | except AssertionError as e: 50 | raise ValueError(f"`df` must have all of {[over_col, sort_col] + list(features)} as columns") from e 51 | 52 | 53 | def smooth_features( 54 | df: pl.DataFrame | pl.LazyFrame, 55 | features: tuple[str, ...], 56 | sort_col: str, 57 | over_col: str, 58 | window_size: int, 59 | ) -> pl.LazyFrame: 60 | """Smooth the `features` columns of `df` by taking the rolling mean of each, sorted over `sort_col` and 61 | partitioned by `over_col`, using `window_size` trailing periods for the moving average window. 62 | 63 | Parameters 64 | ---------- 65 | df: Polars DataFrame | LazyFrame containing columns `sort_col`, `over_col` and each of `features` 66 | features: collection of strings indicating which columns of `df` are the feature values 67 | sort_col: str column of `df` indicating how to sort 68 | over_col: str column of `df` indicating how to partition 69 | window_size: int number of time periods for the moving average 70 | 71 | Returns 72 | ------- 73 | Polars LazyFrame containing the original columns, with each of `features` replaced with moving average 74 | """ 75 | try: 76 | # eagerly check `over_col`, `sort_col`, `features` present: can't catch pl.ColumnNotFoundError in lazy context 77 | assert all(c in df.columns for c in features + (over_col, sort_col)) 78 | return ( 79 | df.lazy() 80 | .sort(by=sort_col) 81 | .with_columns([pl.col(f).rolling_mean(window_size=window_size).over(over_col).alias(f) for f in features]) 82 | ) 83 | except AttributeError as e: 84 | raise TypeError("`df` must be a Polars DataFrame | LazyFrame, but it's missing required attributes") from e 85 | except AssertionError as e: 86 | raise ValueError(f"`df` must have all of {[over_col, sort_col] + list(features)} as columns") from e 87 | 88 | 89 | def top_n_by_group( 90 | df: pl.DataFrame | pl.LazyFrame, 91 | n: int, 92 | rank_var: str, 93 | group_var: tuple[str, ...], 94 | filter: bool = True, 95 | ) -> pl.LazyFrame: 96 | """Mark the top `n` rows in each of `group_var` according to `rank_var` descending. 97 | 98 | If `filter` is True, the returned DataFrame contains only the filtered data. If `filter` is False, 99 | the returned DataFrame has all data, with an additional 'rank_mask' column indicating if that row 100 | is in the filter. 101 | 102 | Parameters 103 | ---------- 104 | df: Polars DataFrame | LazyFrame 105 | n: integer indicating the top rows to take in each group 106 | rank_var: str column name to rank on 107 | group_var: tuple of str column names to group and sort on 108 | filter: boolean indicating how much data to return 109 | 110 | Returns 111 | ------- 112 | Polars LazyFrame containing original columns and optional filter column 113 | """ 114 | try: 115 | # eagerly check `rank_var`, `group_var` are present: we can't catch a ColumnNotFoundError in a lazy context 116 | assert all(c in df.columns for c in (rank_var,) + group_var) 117 | rdf = ( 118 | df.lazy() 119 | .sort(by=list(group_var) + [rank_var]) 120 | .with_columns(pl.col(rank_var).rank(descending=True).over(group_var).cast(int).alias("rank")) 121 | ) 122 | match filter: 123 | case True: 124 | return rdf.filter(pl.col("rank") <= n).drop("rank").sort(by=list(group_var) + [rank_var]) 125 | case False: 126 | return ( 127 | rdf.with_columns( 128 | pl.when(pl.col("rank") <= n).then(pl.lit(1)).otherwise(pl.lit(0)).alias("rank_mask") 129 | ) 130 | .drop("rank") 131 | .sort(by=list(group_var) + [rank_var]) 132 | ) 133 | except AssertionError as e: 134 | raise ValueError(f"`df` is missing one or more required columns: '{rank_var}' and '{group_var}'") from e 135 | except AttributeError as e: 136 | raise TypeError("`df` must be a Polars DataFrame or LazyFrame but is missing a required attribute") from e 137 | --------------------------------------------------------------------------------