├── .github
└── workflows
│ └── ci.yml
├── LICENSE
├── README.md
├── dev-requirements.txt
├── pyproject.toml
├── requirements.txt
├── setup.py
└── toraniko
├── __init__.py
├── math.py
├── model.py
├── styles.py
├── tests
├── __init__.py
├── test_math.py
├── test_model.py
└── test_utils.py
└── utils.py
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
1 | name: CI
2 |
3 | on:
4 | pull_request:
5 | branches:
6 | - main
7 |
8 | defaults:
9 | run:
10 | working-directory: toraniko
11 | shell: bash
12 |
13 | jobs:
14 | test:
15 | runs-on: ubuntu-latest
16 |
17 | steps:
18 | - name: Checkout
19 | uses: actions/checkout@v3
20 |
21 | - name: Set Python
22 | uses: actions/setup-python@v4
23 | with:
24 | python-version: '3.10'
25 |
26 | - name: Install deps
27 | run: |
28 | python -m pip install --upgrade pip
29 | pip install -r ../dev-requirements.txt
30 |
31 | - name: Run tests
32 | run: |
33 | pytest tests/
34 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 0xfdf@implies.com
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # toraniko
2 |
3 | Toraniko is a complete implementation of a risk model suitable for quantitative and systematic trading at institutional scale. In particular, it is a characteristic factor model in the same vein as Barra and Axioma (in fact, given the same datasets, it approximately reproduces Barra's estimated factor returns).
4 |
5 | 
6 |
7 | Using this library, you can create new custom factors and estimate their returns. Then you can estimate a factor covariance matrix suitable for portfolio optimization with factor exposure constraints (e.g. to main a market neutral portfolio).
8 |
9 | The only dependencies are numpy and polars. It supports market, sector and style factors; three styles are included: value, size and momentum. The library also comes with generalizable math and data cleaning utility functions you'd want to have for constructing more style factors (or custom fundamental factors of any kind).
10 |
11 | ## Installation
12 |
13 | Using pip:
14 |
15 | `pip install toraniko`
16 |
17 | ## User Manual
18 |
19 | #### Data
20 |
21 | You'll need the following data to run a complete model estimation:
22 |
23 | 1. Sector scores, used for estimating the market and sector factor returns. GICS level 1 is suitable for this. The sector scores consist of one row per asset, with 0s in each column except for the sector in which the asset is a member, which is filled with 1.
24 |
25 | ```
26 | symbol Basic Materials Communication Services Consumer Cyclical Consumer Defensive Energy Financial Services Healthcare Industrials Real Estate Technology Utilities
27 | str i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
28 | "A" 0 0 0 0 0 0 1 0 0 0 0
29 | "AA" 1 0 0 0 0 0 0 0 0 0 0
30 | "AACI" 0 0 0 0 0 1 0 0 0 0 0
31 | "AACT" 0 0 0 0 0 1 0 0 0 0 0
32 | "AADI" 0 0 0 0 0 0 1 0 0 0 0
33 | … … … … … … … … … … … …
34 | "ZVRA" 0 0 0 0 0 0 1 0 0 0 0
35 | "ZVSA" 0 0 0 0 0 0 1 0 0 0 0
36 | "ZWS" 0 0 0 0 0 0 0 1 0 0 0
37 | "ZYME" 0 0 0 0 0 0 1 0 0 0 0
38 | "ZYXI" 0 0 0 0 0 0 1 0 0 0 0
39 | ```
40 |
41 | 2. Symbol-by-symbol daily asset returns for a large universe of equities:
42 |
43 | ```
44 | date symbol asset_returns
45 | date str f64
46 | 2013-01-02 "A" 0.022962
47 | 2013-01-02 "AAMC" -0.073171
48 | 2013-01-02 "AAME" 0.035566
49 | 2013-01-02 "AAON" 0.019163
50 | 2013-01-02 "AAP" 0.001935
51 | … … …
52 | 2024-02-23 "ZVRA" -0.025
53 | 2024-02-23 "ZVSA" 0.291311
54 | 2024-02-23 "ZWS" 0.006378
55 | 2024-02-23 "ZYME" 0.000838
56 | 2024-02-23 "ZYXI" 0.001552
57 | ```
58 |
59 | 3. For the value factor: symbol-by-symbol daily market cap, cash flow, share count, revenue and book value estimates, so you can calculate book-price, sales-price and cash flow-price metrics:
60 |
61 | ```
62 | date symbol book_price sales_price cf_price market_cap
63 | date str f64 f64 f64 f64
64 | 2013-10-30 "AAPL" 0.343017 0.081994 0.007687 4.5701e11
65 | 2013-10-31 "AAPL" 0.342231 0.081763 0.007665 4.5830e11
66 | 2013-11-01 "AAPL" 0.341398 0.081521 0.007643 4.5966e11
67 | 2013-11-04 "AAPL" 0.340947 0.08137 0.007628 4.6051e11
68 | 2013-11-05 "AAPL" 0.340491 0.081219 0.007614 4.6137e11
69 | … … … … … …
70 | 2024-02-16 "AAPL" 0.072243 0.040937 0.001972 2.9209e12
71 | 2024-02-20 "AAPL" 0.072405 0.040942 0.001972 2.9206e12
72 | 2024-02-21 "AAPL" 0.072614 0.040973 0.001974 2.9184e12
73 | 2024-02-22 "AAPL" 0.072792 0.040987 0.001974 2.9174e12
74 | 2024-02-23 "AAPL" 0.073007 0.041021 0.001976 2.9150e12
75 | ```
76 |
77 | #### Style factor score calculation
78 |
79 | Taking the foregoing data together you'll have:
80 |
81 | ```
82 | date symbol book_price sales_price cf_price market_cap asset_returns
83 | date str f64 f64 f64 f64 f64
84 | 2013-02-12 "A" null null null 1.1080e10 0.00045
85 | 2013-02-12 "AAON" null null null 5.5410e8 0.006741
86 | 2013-02-12 "AAP" null null null 5.4212e9 0.002679
87 | 2013-02-12 "AAPL" 0.302543 0.1189 null 4.5847e11 -0.025069
88 | 2013-02-12 "AAT" null null null 1.1135e9 0.006764
89 | … … … … … … …
90 | 2024-02-23 "ZS" 0.105832 0.016538 0.007052 3.5311e10 0.04015
91 | 2024-02-23 "ZTS" 0.125734 0.025145 -0.017418 8.8010e10 0.002797
92 | 2024-02-23 "ZUO" 0.424385 0.089243 0.073274 1.2309e9 0.002448
93 | 2024-02-23 "ZWS" 0.296327 0.067669 0.001916 5.2727e9 0.006378
94 | 2024-02-23 "ZYME" 0.363093 0.016538 -0.054523 7.3077e8 0.000838
95 | ```
96 |
97 | Then to estimate the momentum factor, you can run
98 |
99 | ```
100 | from toraniko.styles import factor_mom
101 |
102 | mom_df = factor_mom(df.select("symbol", "date", "asset_returns"), trailing_days=252, winsor_factor=0.01).collect()
103 | ```
104 |
105 | and you'll obtain scores roughly resembling this histogram:
106 |
107 |
108 |
109 | Likewise for value, you can run:
110 |
111 | ```
112 | from toraniko.styles import factor_val
113 |
114 | value_df = factor_val(df.select("date", "symbol", "book_price", "sales_price", "cf_price")).collect()
115 | ```
116 |
117 | Similarly:
118 |
119 |
120 |
121 | #### Factor return estimation
122 |
123 | Let's say you've estimated three style factors: value, momentum, size.
124 |
125 | ```
126 | date symbol mom_score sze_score val_score
127 | date str f64 f64 f64
128 | 2014-03-07 "A" 0.793009 -0.872847 0.265801
129 | 2014-03-07 "AAON" 1.128932 0.939116 -1.050307
130 | 2014-03-07 "AAP" 2.190209 0.0 0.351273
131 | 2014-03-07 "AAPL" -0.202091 -3.373659 -0.289713
132 | 2014-03-07 "AAT" -0.413211 0.805413 0.052482
133 | … … … … …
134 | 2024-02-23 "ZS" 2.783905 -1.303952 -1.778753
135 | 2024-02-23 "ZTS" 0.030749 -1.905171 -1.619675
136 | 2024-02-23 "ZUO" 0.235533 0.905671 0.253427
137 | 2024-02-23 "ZWS" 0.594781 0.0 -0.520947
138 | 2024-02-23 "ZYME" -1.631452 1.248919 -1.451382
139 | ```
140 |
141 | Merge these with the aforementioned GICS sector scores, and take the top N by market cap each day on a suitable universe. Here we'll do the Russell 3000:
142 |
143 | ```
144 | from toraniko.utils import top_n_by_group
145 |
146 | ddf = (
147 | ret_df.join(cap_df.drop("book_value"), on=["date", "symbol"])
148 | .join(sector_scores, on="symbol")
149 | .join(style_scores, on=["date", "symbol"])
150 | .drop_nulls()
151 | )
152 | ddf = (
153 | top_n_by_group(
154 | ddf.lazy(),
155 | 3000,
156 | "market_cap",
157 | ("date",),
158 | True
159 | )
160 | .collect()
161 | .sort("date", "symbol")
162 | )
163 |
164 | returns_df = ddf.select("date", "symbol", "asset_returns")
165 | mkt_cap_df = ddf.select("date", "symbol", "market_cap")
166 | sector_df = ddf.select(["date"] + list(sector_scores.columns))
167 | style_df = ddf.select(style_scores.columns)
168 | ```
169 |
170 | Then simply:
171 |
172 | ```
173 | from toraniko.model import estimate_factor_returns
174 |
175 | fac_df, eps_df = estimate_factor_returns(returns_df, mkt_cap_df, sector_df, style_df, winsor_factor=0.1, residualize_styles=False)
176 | ```
177 |
178 | On an M1 MacBook, this estimates 10+ years of daily market, sector and style factor returns in under a minute.
179 |
180 | Here is a comparison of the model value factor out versus Barra's. Even on a relatively low quality data source (Yahoo Finance) and without significant effort in cleaning corporate actions, the results are comparable over a 10 year period:
181 |
182 | 
183 |
184 | 
185 |
--------------------------------------------------------------------------------
/dev-requirements.txt:
--------------------------------------------------------------------------------
1 | pytest~=7.4.4
2 | numpy~=1.26.2
3 | polars~=1.0
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["poetry-core"]
3 | build-backend = "poetry.core.masonry.api"
4 |
5 | [tool.poetry]
6 | name = "toraniko"
7 | version = "1.1.0"
8 | description = "A multi-factor equity risk model for quantitative trading."
9 | authors = ["0xfdf <0xfdf@implies.com>"]
10 | maintainers = ["0xfdf <0xfdf@implies.com>"]
11 | license = "MIT"
12 | readme = "README.md"
13 | homepage = "https://github.com/0xfdf/toraniko"
14 | repository = "https://github.com/0xfdf/toraniko"
15 | keywords = ["risk", "model", "portfolio", "optimization", "factor", "quant", "quantitative", "finance", "trading"]
16 | classifiers = [
17 | "Development Status :: 5 - Production/Stable",
18 | "Intended Audience :: Science/Research",
19 | "Programming Language :: Python :: 3",
20 | "Programming Language :: Python :: 3.10",
21 | "Programming Language :: Python :: 3.11",
22 | "Programming Language :: Python :: 3.12",
23 | "License :: OSI Approved :: MIT License",
24 | "Topic :: Scientific/Engineering"
25 | ]
26 |
27 | [tool.poetry.dependencies]
28 | python = ">=3.10,<4.0"
29 | numpy = "~1.26.2"
30 | polars = "~1.0.0"
31 |
32 | [tool.poetry.dev-dependencies]
33 | pytest = "~7.4.4"
34 |
35 | [tool.poetry.urls]
36 | homepage = "https://github.com/0xfdf/toraniko"
37 | repository = "https://github.com/0xfdf/toraniko"
38 | issues = "https://github.com/0xfdf/toraniko/issues"
39 | changelog = "https://github.com/0xfdf/toraniko/releases"
40 |
41 | [tool.ruff]
42 | line-length = 120
43 | fix = false
44 | select = ["E", "F", "I", "N", "Q", "R", "S", "T", "U", "W", "Y"]
45 |
46 | [tool.black]
47 | line-length = 120
48 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy~=1.26.2
2 | polars~=1.0
3 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | requirements = []
4 | with open('requirements.txt', 'r') as fh:
5 | for line in fh:
6 | requirements.append(line.strip())
7 |
8 | setup(
9 | name="toraniko",
10 | version="1.1.0",
11 | packages=find_packages(),
12 | install_requires = requirements
13 | )
14 |
--------------------------------------------------------------------------------
/toraniko/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/0xfdf/toraniko/253ce7044ef33eb718e671e9f918a88ba5fa3350/toraniko/__init__.py
--------------------------------------------------------------------------------
/toraniko/math.py:
--------------------------------------------------------------------------------
1 | """Basic mathematical and statistical operations used in the model."""
2 |
3 | import numpy as np
4 | import polars as pl
5 |
6 |
7 | def center_xsection(target_col: str, over_col: str, standardize: bool = False) -> pl.Expr:
8 | """Cross-sectionally center (and optionally standardize) a Polars DataFrame `target_col` partitioned by `over_col`.
9 |
10 | This returns a Polars expression, so it be chained in a `select` or `with_columns` invocation
11 | without needing to set a new intermediate DataFrame or materialize lazy evaluation.
12 |
13 | Parameters
14 | ----------
15 | target_col: the column to be standardized
16 | over_col: the column over which standardization should be applied, cross-sectionally
17 | standardize: boolean indicating if we should also standardize the target column
18 |
19 | Returns
20 | -------
21 | Polars Expr
22 | """
23 | expr = pl.col(target_col) - pl.col(target_col).drop_nulls().drop_nans().mean().over(over_col)
24 | if standardize:
25 | return expr / pl.col(target_col).drop_nulls().drop_nans().std().over(over_col)
26 | return expr
27 |
28 |
29 | def norm_xsection(
30 | target_col: str,
31 | over_col: str,
32 | lower: int | float = 0,
33 | upper: int | float = 1,
34 | ) -> pl.Expr:
35 | """Cross-sectionally normalize a Polars DataFrame `target_col` partitioned by `over_col`, with rescaling
36 | to the interval [`lower`, `upper`].
37 |
38 | This returns a Polars expression, so it can be chained in a `select` or `with_columns` invocation
39 | without needing to set a new intermediate DataFrame or materialize lazy evaluation.
40 |
41 | NaN values are not propagated in the max and min calculation, but NaN values are preserved for normalization.
42 |
43 | Parameters
44 | ----------
45 | target_col: str name of the column to normalize
46 | over_col: str name of the column to partition the normalization by
47 | lower: lower bound of the rescaling interval, defaults to 0 to construct a percent
48 | upper: upper bound of the rescaling interval, defaults to 1 to construct a percent
49 |
50 | Returns
51 | -------
52 | Polars Expr
53 | """
54 | min_col = pl.col(target_col).drop_nans().min().over(over_col)
55 | max_col = pl.col(target_col).drop_nans().max().over(over_col)
56 |
57 | norm_col = (
58 | pl.when(pl.col(target_col).is_nan())
59 | .then(pl.col(target_col)) # Preserve NaN values
60 | .when(max_col != min_col) # Avoid division by zero by making sure min != max
61 | .then((pl.col(target_col) - min_col) / (max_col - min_col) * (upper - lower) + lower)
62 | .otherwise(lower)
63 | )
64 |
65 | return norm_col
66 |
67 |
68 | def winsorize(data: np.ndarray, percentile: float = 0.05, axis: int = 0) -> np.ndarray:
69 | """Windorize each vector of a 2D numpy array to symmetric percentiles given by `percentile`.
70 |
71 | This returns a Polars expression, not a DataFrame, so it be chained (including lazily) in
72 | a `select` or `with_columns` invocation without needing to set a new intermediate DataFrame variable.
73 |
74 | Parameters
75 | ----------
76 | data: numpy array containing original data to be winsorized
77 | percentile: float indicating the percentiles to apply winsorization at
78 | axis: int indicating which axis to apply winsorization over (i.e. orientation if `dara` is 2D)
79 |
80 | Returns
81 | -------
82 | numpy array
83 | """
84 | try:
85 | if not 0 <= percentile <= 1:
86 | raise ValueError("`percentile` must be between 0 and 1")
87 | except AttributeError as e:
88 | raise TypeError("`percentile` must be a numeric type, such as an int or float") from e
89 |
90 | fin_data = np.where(np.isfinite(data), data, np.nan)
91 |
92 | # compute lower and upper percentiles for each column
93 | lower_bounds = np.nanpercentile(fin_data, percentile * 100, axis=axis, keepdims=True)
94 | upper_bounds = np.nanpercentile(fin_data, (1 - percentile) * 100, axis=axis, keepdims=True)
95 |
96 | # clip data to within the bounds
97 | return np.clip(data, lower_bounds, upper_bounds)
98 |
99 |
100 | def winsorize_xsection(
101 | df: pl.DataFrame | pl.LazyFrame,
102 | data_cols: tuple[str, ...],
103 | group_col: str,
104 | percentile: float = 0.05,
105 | ) -> pl.DataFrame | pl.LazyFrame:
106 | """Cross-sectionally winsorize the `data_cols` of `df`, grouped on `group_col`, to the symmetric percentile
107 | given by `percentile`.
108 |
109 | Parameters
110 | ----------
111 | df: Polars DataFrame or LazyFrame containing feature data to winsorize
112 | data_cols: collection of strings indicating the columns of `df` to be winsorized
113 | group_col: str column of `df` to use as the cross-sectional group
114 | percentile: float value indicating the symmetric winsorization threshold
115 |
116 | Returns
117 | -------
118 | Polars DataFrame or LazyFrame
119 | """
120 |
121 | def winsorize_group(group: pl.DataFrame) -> pl.DataFrame:
122 | for col in data_cols:
123 | winsorized_data = winsorize(group[col].to_numpy(), percentile=percentile)
124 | group = group.with_columns(pl.Series(col, winsorized_data).alias(col))
125 | return group
126 |
127 | match df:
128 | case pl.DataFrame():
129 | grouped = df.group_by(group_col).map_groups(winsorize_group)
130 | case pl.LazyFrame():
131 | grouped = df.group_by(group_col).map_groups(winsorize_group, schema=df.collect_schema())
132 | case _:
133 | raise TypeError("`df` must be a Polars DataFrame or LazyFrame")
134 | return grouped
135 |
136 |
137 | def percentiles_xsection(
138 | target_col: str,
139 | over_col: str,
140 | lower_pct: float,
141 | upper_pct: float,
142 | fill_val: float | int = 0.0,
143 | ) -> pl.Expr:
144 | """Cross-sectionally mark all values of `target_col` that fall outside the `lower_pct` percentile or
145 | `upper_pct` percentile, within each `over_col` group. This is essentially an anti-winsorization, suitable for
146 | building high - low portfolios. The `fill_val` is inserted to each value between the percentile cutoffs.
147 |
148 | This returns a Polars expression, so it be chained in a `select` or `with_columns` invocation
149 | without needing to set a new intermediate DataFrame or materialize lazy evaluation.
150 |
151 | Parameters
152 | ----------
153 | target_col: str column name to have non-percentile thresholded values masked
154 | over_col: str column name to apply masking over, cross-sectionally
155 | lower_pct: float lower percentile under which to keep values
156 | upper_pct: float upper percentile over which to keep values
157 | fill_val: numeric value for masking
158 |
159 | Returns
160 | -------
161 | Polars Expr
162 | """
163 | return (
164 | pl.when(
165 | (pl.col(target_col) <= pl.col(target_col).drop_nans().quantile(lower_pct).over(over_col))
166 | | (pl.col(target_col) >= pl.col(target_col).drop_nans().quantile(upper_pct).over(over_col))
167 | )
168 | .then(pl.col(target_col))
169 | .otherwise(fill_val)
170 | )
171 |
172 |
173 | def exp_weights(window: int, half_life: int) -> np.ndarray:
174 | """Generate exponentially decaying weights over `window` trailing values, decaying by half each `half_life` index.
175 |
176 | Parameters
177 | ----------
178 | window: integer number of points in the trailing lookback period
179 | half_life: integer decay rate
180 |
181 | Returns
182 | -------
183 | numpy array
184 | """
185 | try:
186 | assert isinstance(window, int)
187 | if not window > 0:
188 | raise ValueError("`window` must be a strictly positive integer")
189 | except (AttributeError, AssertionError) as e:
190 | raise TypeError("`window` must be an integer type") from e
191 | try:
192 | assert isinstance(half_life, int)
193 | if not half_life > 0:
194 | raise ValueError("`half_life` must be a strictly positive integer")
195 | except (AttributeError, AssertionError) as e:
196 | raise TypeError("`half_life` must be an integer type") from e
197 | decay = np.log(2) / half_life
198 | return np.exp(-decay * np.arange(window))[::-1]
199 |
--------------------------------------------------------------------------------
/toraniko/model.py:
--------------------------------------------------------------------------------
1 | """Complete implementation of the factor model."""
2 |
3 | import numpy as np
4 | import polars as pl
5 | import polars.exceptions as pl_exc
6 |
7 | from toraniko.math import winsorize
8 |
9 |
10 | def _factor_returns(
11 | returns: np.ndarray,
12 | mkt_caps: np.ndarray,
13 | sector_scores: np.ndarray,
14 | style_scores: np.ndarray,
15 | residualize_styles: bool,
16 | ) -> tuple[np.ndarray, np.ndarray]:
17 | """Estimate market, sector, style and residual asset returns for one time period, robust to rank deficiency.
18 |
19 | Parameters
20 | ----------
21 | returns: np.ndarray returns of the assets (shape n_assets x 1)
22 | mkt_caps: np.ndarray of asset market capitalizations (shape n_assets x 1)
23 | sector_scores: np.ndarray of asset scores used to estimate the sector return (shape n_assets x m_sectors)
24 | style_scores: np.ndarray of asset scores used to estimate style factor returns (shape n_assets x m_styles)
25 | residualize_styles: bool indicating if styles should be orthogonalized to market + sector
26 |
27 | Returns
28 | -------
29 | tuple of arrays: (market/sector/style factor returns, residual returns)
30 | """
31 | n_assets = returns.shape[0]
32 | m_sectors, m_styles = sector_scores.shape[1], style_scores.shape[1]
33 |
34 | # Proxy for the inverse of asset idiosyncratic variances
35 | W = np.diag(np.sqrt(mkt_caps.ravel()))
36 |
37 | # Estimate sector factor returns with a constraint that the sector factors sum to 0
38 | # Economically, we assert that the market return is completely spanned by the sector returns
39 | beta_sector = np.hstack([np.ones(n_assets).reshape(-1, 1), sector_scores])
40 | a = np.concatenate([np.array([0]), (-1 * np.ones(m_sectors - 1))])
41 | Imat = np.identity(m_sectors)
42 | R_sector = np.vstack([Imat, a])
43 | # Change of variables to add the constraint
44 | B_sector = beta_sector @ R_sector
45 |
46 | V_sector, _, _, _ = np.linalg.lstsq(B_sector.T @ W @ B_sector, B_sector.T @ W, rcond=None)
47 | # Change of variables to recover all sectors
48 | g = V_sector @ returns
49 | fac_ret_sector = R_sector @ g
50 |
51 | sector_resid_returns = returns - (B_sector @ g)
52 |
53 | # Estimate style factor returns without constraints
54 | V_style, _, _, _ = np.linalg.lstsq(style_scores.T @ W @ style_scores, style_scores.T @ W, rcond=None)
55 | if residualize_styles:
56 | fac_ret_style = V_style @ sector_resid_returns
57 | else:
58 | fac_ret_style = V_style @ returns
59 |
60 | # Combine factor returns
61 | fac_ret = np.concatenate([fac_ret_sector, fac_ret_style])
62 |
63 | # Calculate final residuals
64 | epsilon = sector_resid_returns - (style_scores @ fac_ret_style)
65 |
66 | return fac_ret, epsilon
67 |
68 |
69 | def estimate_factor_returns(
70 | returns_df: pl.DataFrame,
71 | mkt_cap_df: pl.DataFrame,
72 | sector_df: pl.DataFrame,
73 | style_df: pl.DataFrame,
74 | winsor_factor: float | None = 0.05,
75 | residualize_styles: bool = True,
76 | ) -> tuple[pl.DataFrame, pl.DataFrame] | pl.DataFrame:
77 | """Estimate factor and residual returns across all time periods using input asset factor scores.
78 |
79 | Parameters
80 | ----------
81 | returns_df: Polars DataFrame containing | date | symbol | asset_returns |
82 | mkt_cap_df: Polars DataFrame containing | date | symbol | market_cap |
83 | sector_df: Polars DataFrame containing | date | symbol | followed by one column for each sector
84 | style_df: Polars DataFrame containing | date | symbol | followed by one column for each style
85 | winsor_factor: winsorization proportion
86 | residualize_styles: bool indicating if style returns should be orthogonalized to market + sector returns
87 |
88 | Returns
89 | -------
90 | tuple of Polars DataFrames melted by date: (factor returns, residual returns)
91 | """
92 | returns, residuals = [], []
93 | try:
94 | sectors = sorted(sector_df.select(pl.exclude("date", "symbol")).columns)
95 | except AttributeError as e:
96 | raise TypeError("`sector_df` must be a Polars DataFrame, but it's missing required attributes") from e
97 | except pl_exc.ColumnNotFoundError as e:
98 | raise ValueError("`sector_df` must have columns for 'date' and 'symbol' in addition to each sector") from e
99 | try:
100 | styles = sorted(style_df.select(pl.exclude("date", "symbol")).columns)
101 | except AttributeError as e:
102 | raise TypeError("`style_df` must be a Polars DataFrame, but it's missing required attributes") from e
103 | except pl_exc.ColumnNotFoundError as e:
104 | raise ValueError("`style_df` must have columns for 'date' and 'symbol' in addition to each style") from e
105 | try:
106 | returns_df = (
107 | returns_df.join(mkt_cap_df, on=["date", "symbol"])
108 | .join(sector_df, on=["date", "symbol"])
109 | .join(style_df, on=["date", "symbol"])
110 | )
111 | dates = returns_df["date"].unique().to_list()
112 | # iterate through, one day at a time
113 | # this could probably be made more efficient with Polars' `.map_groups` method
114 | for dt in dates:
115 | ddf = returns_df.filter(pl.col("date") == dt).sort("symbol")
116 | r = ddf["asset_returns"].to_numpy()
117 | if winsor_factor is not None:
118 | r = winsorize(r, winsor_factor)
119 | f, e = _factor_returns(
120 | r,
121 | ddf["market_cap"].to_numpy(),
122 | ddf.select(sectors).to_numpy(),
123 | ddf.select(styles).to_numpy(),
124 | residualize_styles,
125 | )
126 | returns.append(f)
127 | residuals.append(dict(zip(ddf["symbol"].to_list(), e)))
128 | except AttributeError as e:
129 | raise TypeError(
130 | "`returns_df` and `mkt_cap_df` must be Polars DataFrames, but there are missing attributes"
131 | ) from e
132 | except pl_exc.ColumnNotFoundError as e:
133 | raise ValueError(
134 | "`returns_df` must have columns 'date', 'symbol' and 'asset_returns'; "
135 | "`mkt_cap_df` must have 'date', 'symbol' and 'market_cap' columns"
136 | ) from e
137 | ret_df = pl.DataFrame(np.array(returns))
138 | ret_df.columns = ["market"] + sectors + styles
139 | ret_df = ret_df.with_columns(pl.Series(dates).alias("date"))
140 | eps_df = pl.DataFrame(residuals).with_columns(pl.Series(dates).alias("date"))
141 | return ret_df, eps_df
142 |
--------------------------------------------------------------------------------
/toraniko/styles.py:
--------------------------------------------------------------------------------
1 | """Style factor implementations."""
2 |
3 | import numpy as np
4 | import polars as pl
5 | import polars.exceptions as pl_exc
6 |
7 | from toraniko.math import (
8 | exp_weights,
9 | center_xsection,
10 | percentiles_xsection,
11 | winsorize_xsection,
12 | )
13 |
14 | ###
15 | # NB: These functions do not try to handle NaN or null resilience for you, nor do they make allowances
16 | # for data having pathological distributions. Garbage in, garbage out. You need to inspect your data
17 | # and use the functions in the math and utils modules to ensure your features are sane and
18 | # well-behaved before you try to construct factors from them!
19 | ###
20 |
21 |
22 | def factor_mom(
23 | returns_df: pl.DataFrame | pl.LazyFrame,
24 | trailing_days: int = 504,
25 | half_life: int = 126,
26 | lag: int = 20,
27 | winsor_factor: float = 0.01,
28 | ) -> pl.LazyFrame:
29 | """Estimate rolling symbol by symbol momentum factor scores using asset returns.
30 |
31 | Parameters
32 | ----------
33 | returns_df: Polars DataFrame containing columns: | date | symbol | asset_returns |
34 | trailing_days: int look back period over which to measure momentum
35 | half_life: int decay rate for exponential weighting, in days
36 | lag: int number of days to lag the current day's return observation (20 trading days is one month)
37 |
38 | Returns
39 | -------
40 | Polars DataFrame containing columns: | date | symbol | mom_score |
41 | """
42 | weights = exp_weights(trailing_days, half_life)
43 |
44 | def weighted_cumprod(values: np.ndarray) -> float:
45 | return (np.cumprod(1 + (values * weights[-len(values) :])) - 1)[-1] # type: ignore
46 |
47 | try:
48 | df = (
49 | returns_df.lazy()
50 | .sort(by="date")
51 | .with_columns(pl.col("asset_returns").shift(lag).over("symbol").alias("asset_returns"))
52 | .with_columns(
53 | pl.col("asset_returns")
54 | .rolling_map(weighted_cumprod, window_size=trailing_days)
55 | .over(pl.col("symbol"))
56 | .alias("mom_score")
57 | )
58 | ).collect()
59 | df = winsorize_xsection(df, ("mom_score",), "date", percentile=winsor_factor)
60 | return df.lazy().select(
61 | pl.col("date"),
62 | pl.col("symbol"),
63 | center_xsection("mom_score", "date", True).alias("mom_score"),
64 | )
65 | except AttributeError as e:
66 | raise TypeError("`returns_df` must be a Polars DataFrame | LazyFrame, but it's missing attributes") from e
67 | except pl_exc.ColumnNotFoundError as e:
68 | raise ValueError("`returns_df` must have 'date', 'symbol' and 'asset_returns' columns") from e
69 |
70 |
71 | def factor_sze(
72 | mkt_cap_df: pl.DataFrame | pl.LazyFrame,
73 | lower_decile: float = 0.2,
74 | upper_decile: float = 0.8,
75 | ) -> pl.LazyFrame:
76 | """Estimate rolling symbol by symbol size factor scores using asset market caps.
77 |
78 | Parameters
79 | ----------
80 | mkt_cap_df: Polars DataFrame containing columns: | date | symbol | market_cap |
81 |
82 | Returns
83 | -------
84 | Polars DataFrame containing columns: | date | symbol | sze_score |
85 | """
86 | try:
87 | return (
88 | mkt_cap_df.lazy()
89 | # our factor is the Fama-French SMB, i.e. small-minus-big, because the size risk premium
90 | # is on the smaller firms rather than the larger ones. consequently we multiply by -1
91 | .with_columns(pl.col("market_cap").log().alias("sze_score") * -1)
92 | .with_columns(
93 | "date",
94 | "symbol",
95 | (center_xsection("sze_score", "date", True)).alias("sze_score"),
96 | )
97 | .with_columns(percentiles_xsection("sze_score", "date", lower_decile, upper_decile, 0.0).alias("sze_score"))
98 | .select("date", "symbol", "sze_score")
99 | )
100 | except AttributeError as e:
101 | raise TypeError("`mkt_cap_df` must be a Polars DataFrame or LazyFrame, but it's missing attributes") from e
102 | except pl_exc.ColumnNotFoundError as e:
103 | raise ValueError("`mkt_cap_df` must have 'date', 'symbol' and 'market_cap' columns") from e
104 |
105 |
106 | def factor_val(value_df: pl.DataFrame | pl.LazyFrame, winsorize_features: float | None = None) -> pl.LazyFrame:
107 | """Estimate rolling symbol by symbol value factor scores using price ratios.
108 |
109 | Parameters
110 | ----------
111 | value_df: Polars DataFrame containing columns: | date | symbol | book_price | sales_price | cf_price
112 | winsorize_features: optional float indicating if the features should be winsorized. none applied if None
113 |
114 | Returns
115 | -------
116 | Polars DataFrame containing: | date | symbol | val_score |
117 | """
118 | try:
119 | if winsorize_features is not None:
120 | value_df = winsorize_xsection(value_df, ("book_price", "sales_price", "cf_price"), "date")
121 | return (
122 | value_df.lazy()
123 | .with_columns(
124 | pl.col("book_price").log().alias("book_price"),
125 | pl.col("sales_price").log().alias("sales_price"),
126 | )
127 | .with_columns(
128 | center_xsection("book_price", "date", True).alias("book_price"),
129 | center_xsection("sales_price", "date", True).alias("sales_price"),
130 | center_xsection("cf_price", "date", True).alias("cf_price"),
131 | )
132 | .with_columns(
133 | # NB: it's imperative you've properly handled NaNs prior to this point
134 | pl.mean_horizontal(
135 | pl.col("book_price"),
136 | pl.col("sales_price"),
137 | pl.col("cf_price"),
138 | ).alias("val_score")
139 | )
140 | .select(
141 | pl.col("date"),
142 | pl.col("symbol"),
143 | center_xsection("val_score", "date", True).alias("val_score"),
144 | )
145 | )
146 | except AttributeError as e:
147 | raise TypeError("`value_df` must be a Polars DataFrame or LazyFrame, but it's missing attributes") from e
148 | except pl_exc.ColumnNotFoundError as e:
149 | raise ValueError(
150 | "`value_df` must have 'date', 'symbol', 'book_price', 'sales_price' and 'fcf_price' columns"
151 | ) from e
152 |
--------------------------------------------------------------------------------
/toraniko/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/0xfdf/toraniko/253ce7044ef33eb718e671e9f918a88ba5fa3350/toraniko/tests/__init__.py
--------------------------------------------------------------------------------
/toraniko/tests/test_math.py:
--------------------------------------------------------------------------------
1 | """Test functions in the math module."""
2 |
3 | import pytest
4 | import polars as pl
5 | import numpy as np
6 | from polars.testing import assert_frame_equal
7 |
8 | from toraniko.math import (
9 | center_xsection,
10 | exp_weights,
11 | norm_xsection,
12 | percentiles_xsection,
13 | winsorize,
14 | winsorize_xsection,
15 | )
16 |
17 |
18 | @pytest.fixture
19 | def sample_data():
20 | """
21 | Fixture to provide sample data for testing.
22 |
23 | Returns
24 | -------
25 | pl.DataFrame
26 | A DataFrame with a 'group' column and a 'value' column.
27 | """
28 | return pl.DataFrame({"group": ["A", "A", "A", "B", "B", "B"], "value": [1, 2, 3, 4, 5, 6]})
29 |
30 |
31 | ###
32 | # `center_xsection`
33 | ###
34 |
35 |
36 | def test_center_xsection_centering(sample_data):
37 | """
38 | Test centering without standardization.
39 |
40 | Parameters
41 | ----------
42 | sample_data : pl.DataFrame
43 | The sample data to test.
44 |
45 | Asserts
46 | -------
47 | The centered values match the expected centered values.
48 | """
49 | centered_df = sample_data.with_columns(center_xsection("value", "group"))
50 | expected_centered_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0]
51 | assert np.allclose(centered_df["value"].to_numpy(), expected_centered_values)
52 |
53 |
54 | def test_center_xsection_standardizing(sample_data):
55 | """
56 | Test centering and standardizing.
57 |
58 | Parameters
59 | ----------
60 | sample_data : pl.DataFrame
61 | The sample data to test.
62 |
63 | Asserts
64 | -------
65 | The standardized values match the expected standardized values.
66 | """
67 | standardized_df = sample_data.with_columns(center_xsection("value", "group", standardize=True))
68 | expected_standardized_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0]
69 | assert np.allclose(standardized_df["value"].to_numpy(), expected_standardized_values)
70 |
71 |
72 | def test_center_xsection_handle_nan(sample_data):
73 | """
74 | Test handling of NaN values.
75 |
76 | Parameters
77 | ----------
78 | sample_data : pl.DataFrame
79 | The sample data to test, with an additional NaN column.
80 |
81 | Asserts
82 | -------
83 | The values in the 'nan_col' column are all NaN.
84 | """
85 | sample_data_with_nan = sample_data.with_columns(pl.lit(np.nan).alias("nan_col"))
86 | centered_df = sample_data_with_nan.with_columns(center_xsection("nan_col", "group"))
87 | expected_values = [np.nan] * len(centered_df)
88 | assert np.allclose(centered_df["nan_col"].to_numpy(), expected_values, equal_nan=True)
89 |
90 |
91 | def test_center_xsection_different_groups():
92 | """
93 | Test with a more complex group and value structure.
94 |
95 | Asserts
96 | -------
97 | The centered values match the expected values.
98 | """
99 | data = pl.DataFrame({"group": ["A", "A", "B", "B", "C", "C"], "value": [1.5, 2.5, 5.5, 6.5, 9.0, 9.0]})
100 | centered_df = data.with_columns(center_xsection("value", "group"))
101 | expected_values = [-0.5, 0.5, -0.5, 0.5, 0.0, 0.0]
102 | assert np.allclose(centered_df["value"].to_numpy(), expected_values)
103 |
104 |
105 | def test_center_xsection_empty_group():
106 | """
107 | Test with an empty group.
108 |
109 | Asserts
110 | -------
111 | The group 'C' is empty in the filtered DataFrame.
112 | """
113 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0]})
114 | empty_group = data.filter(pl.col("group") == "C")
115 | assert empty_group.is_empty()
116 |
117 |
118 | def test_center_xsection_all_nan():
119 | """
120 | Test when the entire column consists of NaN values.
121 |
122 | Asserts
123 | -------
124 | The result column should contain only NaN values.
125 | """
126 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [np.nan, np.nan, np.nan, np.nan]})
127 | centered_df = data.with_columns(center_xsection("value", "group"))
128 | expected_values = [np.nan, np.nan, np.nan, np.nan]
129 | assert np.allclose(centered_df["value"].to_numpy(), expected_values, equal_nan=True)
130 |
131 |
132 | def test_center_xsection_single_row_group():
133 | """
134 | Test centering when a group has only one row.
135 |
136 | Asserts
137 | -------
138 | The result should be 0.0 for the single row group.
139 | """
140 | data = pl.DataFrame({"group": ["A", "A", "B"], "value": [1.0, 2.0, 3.0]})
141 | centered_df = data.with_columns(center_xsection("value", "group"))
142 | expected_values = [-0.5, 0.5, 0.0] # Centering for group B with one value should be 0.0
143 | assert np.allclose(centered_df["value"].to_numpy(), expected_values)
144 |
145 |
146 | def test_center_xsection_mixed_data_types():
147 | """
148 | Test handling of mixed data types, where non-numeric columns are ignored.
149 |
150 | Asserts
151 | -------
152 | The centering process should only affect numeric columns.
153 | """
154 | data = pl.DataFrame(
155 | {"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0], "category": ["x", "y", "z", "w"]}
156 | )
157 | centered_df = data.with_columns(center_xsection("value", "group"))
158 | expected_values = [-0.5, 0.5, -0.5, 0.5]
159 | assert np.allclose(centered_df["value"].to_numpy(), expected_values)
160 | assert all(centered_df["category"] == data["category"])
161 |
162 |
163 | ###
164 | # `norm_xsection`
165 | ###
166 |
167 |
168 | def test_norm_xsection_default_range(sample_data):
169 | """
170 | Test normalization with default range [0, 1].
171 |
172 | Parameters
173 | ----------
174 | sample_data : pl.DataFrame
175 | The sample data to test.
176 |
177 | Asserts
178 | -------
179 | The normalized values match the expected values in the range [0, 1].
180 | """
181 | normalized_df = sample_data.with_columns(norm_xsection("value", "group"))
182 | expected_normalized_values = [0.0, 0.5, 1.0, 0.0, 0.5, 1.0]
183 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values)
184 |
185 |
186 | def test_norm_xsection_custom_range(sample_data):
187 | """
188 | Test normalization with a custom range [-1, 1].
189 |
190 | Parameters
191 | ----------
192 | sample_data : pl.DataFrame
193 | The sample data to test.
194 |
195 | Asserts
196 | -------
197 | The normalized values match the expected values in the range [-1, 1].
198 | """
199 | normalized_df = sample_data.with_columns(norm_xsection("value", "group", lower=-1, upper=1))
200 | expected_normalized_values = [-1.0, 0.0, 1.0, -1.0, 0.0, 1.0]
201 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values)
202 |
203 |
204 | def test_norm_xsection_nan_values():
205 | """
206 | Test normalization when the target column contains NaN values.
207 |
208 | Asserts
209 | -------
210 | The NaN values are preserved in the output.
211 | """
212 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1, np.nan, 4, np.nan]}, strict=False)
213 | normalized_df = data.with_columns(norm_xsection("value", "group"))
214 | expected_normalized_values = [0.0, np.nan, 0.0, np.nan]
215 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values, equal_nan=True)
216 |
217 |
218 | def test_norm_xsection_single_value_group():
219 | """
220 | Test normalization when a group has a single value.
221 |
222 | Asserts
223 | -------
224 | The result should be the lower bound of the range, as there's no range to normalize.
225 | """
226 | data = pl.DataFrame({"group": ["A", "A", "B"], "value": [1, 2, 3]})
227 | normalized_df = data.with_columns(norm_xsection("value", "group"))
228 | expected_normalized_values = [0.0, 1.0, 0.0] # Single value group should map to the lower bound
229 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values)
230 |
231 |
232 | def test_norm_xsection_identical_values():
233 | """
234 | Test normalization when all values in a group are identical.
235 |
236 | Asserts
237 | -------
238 | The normalized values should all be the lower bound, as there's no range to normalize.
239 | """
240 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [5, 5, 5, 5]})
241 | normalized_df = data.with_columns(norm_xsection("value", "group"))
242 | expected_normalized_values = [0.0, 0.0, 0.0, 0.0] # Identical values map to the lower bound
243 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values)
244 |
245 |
246 | def test_norm_xsection_mixed_data_types():
247 | """
248 | Test handling of mixed data types, where non-numeric columns are ignored.
249 |
250 | Asserts
251 | -------
252 | The normalization process should only affect numeric columns.
253 | """
254 | data = pl.DataFrame(
255 | {"group": ["A", "A", "B", "B"], "value": [1.0, 2.0, 3.0, 4.0], "category": ["x", "y", "z", "w"]}
256 | )
257 | normalized_df = data.with_columns(norm_xsection("value", "group"))
258 | expected_normalized_values = [0.0, 1.0, 0.0, 1.0]
259 | actual_values = normalized_df["value"].to_numpy()
260 | print(actual_values)
261 | assert np.allclose(actual_values, expected_normalized_values)
262 | assert all(normalized_df["category"] == data["category"])
263 |
264 |
265 | def test_norm_xsection_custom_range_large_values():
266 | """
267 | Test normalization with a custom range [100, 200] and large values.
268 |
269 | Asserts
270 | -------
271 | The normalized values match the expected values in the range [100, 200].
272 | """
273 | data = pl.DataFrame({"group": ["A", "A", "B", "B"], "value": [1000, 2000, 3000, 4000]})
274 | normalized_df = data.with_columns(norm_xsection("value", "group", lower=100, upper=200))
275 | expected_normalized_values = [100.0, 200.0, 100.0, 200.0]
276 | assert np.allclose(normalized_df["value"].to_numpy(), expected_normalized_values)
277 |
278 |
279 | ###
280 | # `winsorize`
281 | ###
282 |
283 |
284 | @pytest.fixture
285 | def sample_columns():
286 | """
287 | Fixture to provide sample numpy array columns for testing.
288 |
289 | Returns
290 | -------
291 | np.ndarray
292 | """
293 | return np.array(
294 | [
295 | [1, 100, 1000],
296 | [2, 200, 2000],
297 | [3, 300, 3000],
298 | [4, 400, 4000],
299 | [5, 500, 5000],
300 | [6, 600, 6000],
301 | [7, 700, 7000],
302 | [8, 800, 8000],
303 | [9, 900, 9000],
304 | [10, 1000, 10000],
305 | ]
306 | )
307 |
308 |
309 | @pytest.fixture
310 | def sample_columns_with_nans():
311 | """
312 | Fixture to provide sample numpy array columns for testing.
313 |
314 | Returns
315 | -------
316 | np.ndarray
317 | """
318 | return np.array(
319 | [
320 | [1, 100, np.nan],
321 | [2, 200, 2000],
322 | [3, 300, np.nan],
323 | [4, 400, np.nan],
324 | [5, 500, 5000],
325 | [6, 600, 6000],
326 | [7, 700, 7000],
327 | [8, 800, 8000],
328 | [9, 900, np.nan],
329 | [10, 1000, 10000],
330 | ]
331 | )
332 |
333 |
334 | @pytest.fixture
335 | def sample_rows():
336 | """
337 | Fixture to provide sample numpy rows for testing.
338 |
339 | Returns
340 | -------
341 | np.ndarray
342 | """
343 | return np.array(
344 | [
345 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
346 | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
347 | [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000],
348 | ]
349 | )
350 |
351 |
352 | @pytest.fixture
353 | def sample_rows_with_nans():
354 | """
355 | Fixture to provide sample numpy rows for testing.
356 |
357 | Returns
358 | -------
359 | np.ndarray
360 | """
361 | return np.array(
362 | [
363 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
364 | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
365 | [np.nan, 2000, np.nan, np.nan, 5000, 6000, 7000, 8000, np.nan, 10000],
366 | ]
367 | )
368 |
369 |
370 | def test_winsorize_axis_0(sample_columns):
371 | """
372 | Test column-wise winsorization (axis=0).
373 |
374 | This test checks if the function correctly winsorizes a 2D array column-wise.
375 |
376 | Parameters
377 | ----------
378 | sample_columns : np.ndarray
379 | Sample data fixture for column-wise testing.
380 |
381 | Raises
382 | ------
383 | AssertionError
384 | If the winsorized output doesn't match the expected result.
385 | """
386 | result = winsorize(sample_columns, percentile=0.2, axis=0)
387 |
388 | # Expected results (calculated manually for 20th and 80th percentiles)
389 | expected = np.array(
390 | [
391 | [2.8, 280, 2800],
392 | [2.8, 280, 2800],
393 | [3.0, 300, 3000],
394 | [4.0, 400, 4000],
395 | [5.0, 500, 5000],
396 | [6.0, 600, 6000],
397 | [7.0, 700, 7000],
398 | [8.0, 800, 8000],
399 | [8.2, 820, 8200],
400 | [8.2, 820, 8200],
401 | ]
402 | )
403 |
404 | np.testing.assert_array_almost_equal(result, expected)
405 |
406 |
407 | def test_winsorize_axis_1(sample_rows):
408 | """
409 | Test row-wise winsorization (axis=1).
410 |
411 | This test checks if the function correctly winsorizes a 2D array row-wise.
412 |
413 | Parameters
414 | ----------
415 | sample_rows : np.ndarray
416 | Sample data fixture for row-wise testing.
417 |
418 | Raises
419 | ------
420 | AssertionError
421 | If the winsorized output doesn't match the expected result.
422 | """
423 | result = winsorize(sample_rows, percentile=0.2, axis=1)
424 |
425 | # Expected results (calculated manually for 20th and 80th percentiles)
426 | expected = np.array(
427 | [
428 | [2.8, 2.8, 3, 4, 5, 6, 7, 8, 8.2, 8.2],
429 | [280, 280, 300, 400, 500, 600, 700, 800, 820, 820],
430 | [2800, 2800, 3000, 4000, 5000, 6000, 7000, 8000, 8200, 8200],
431 | ]
432 | )
433 |
434 | np.testing.assert_array_almost_equal(result, expected)
435 |
436 |
437 | def test_winsorize_axis_0_with_nans(sample_columns_with_nans):
438 | """
439 | Test winsorize function with NaN values.
440 |
441 | This test verifies that the function handles NaN values correctly.
442 |
443 | Raises
444 | ------
445 | AssertionError
446 | If the winsorized output doesn't handle NaN values as expected.
447 | """
448 | result = winsorize(sample_columns_with_nans, percentile=0.2, axis=0)
449 | expected = np.array(
450 | [
451 | [2.8e00, 2.8e02, np.nan],
452 | [2.8e00, 2.8e02, 5.0e03],
453 | [3.0e00, 3.0e02, np.nan],
454 | [4.0e00, 4.0e02, np.nan],
455 | [5.0e00, 5.0e02, 5.0e03],
456 | [6.0e00, 6.0e02, 6.0e03],
457 | [7.0e00, 7.0e02, 7.0e03],
458 | [8.0e00, 8.0e02, 8.0e03],
459 | [8.2e00, 8.2e02, np.nan],
460 | [8.2e00, 8.2e02, 8.0e03],
461 | ]
462 | )
463 | np.testing.assert_allclose(result, expected, equal_nan=True)
464 |
465 |
466 | def test_winsorize_axis_1_with_nans(sample_rows_with_nans):
467 | """
468 | Test winsorize function with NaN values.
469 |
470 | This test verifies that the function handles NaN values correctly.
471 |
472 | Raises
473 | ------
474 | AssertionError
475 | If the winsorized output doesn't handle NaN values as expected.
476 | """
477 | result = winsorize(sample_rows_with_nans, percentile=0.2, axis=1)
478 | # take transpose of axis=0 test to test axis=1
479 | expected = np.array(
480 | [
481 | [2.8e00, 2.8e02, np.nan],
482 | [2.8e00, 2.8e02, 5.0e03],
483 | [3.0e00, 3.0e02, np.nan],
484 | [4.0e00, 4.0e02, np.nan],
485 | [5.0e00, 5.0e02, 5.0e03],
486 | [6.0e00, 6.0e02, 6.0e03],
487 | [7.0e00, 7.0e02, 7.0e03],
488 | [8.0e00, 8.0e02, 8.0e03],
489 | [8.2e00, 8.2e02, np.nan],
490 | [8.2e00, 8.2e02, 8.0e03],
491 | ]
492 | ).T
493 | np.testing.assert_allclose(result, expected, equal_nan=True)
494 |
495 |
496 | def test_winsorize_invalid_percentile(sample_rows):
497 | """
498 | Test winsorize function with invalid percentile values.
499 |
500 | This test checks if the function raises a ValueError for percentile values outside [0, 1].
501 |
502 | Raises
503 | ------
504 | AssertionError
505 | If the function doesn't raise a ValueError for invalid percentile values.
506 | """
507 | with pytest.raises(TypeError):
508 | winsorize(sample_rows, percentile="string_percentile")
509 | with pytest.raises(ValueError):
510 | winsorize(sample_rows, percentile=-0.1)
511 | with pytest.raises(ValueError):
512 | winsorize(sample_rows, percentile=1.1)
513 |
514 |
515 | def test_winsorize_1d_array():
516 | """
517 | Test winsorize function with a 1D array.
518 |
519 | This test verifies that the function works correctly with 1D input.
520 |
521 | Raises
522 | ------
523 | AssertionError
524 | If the winsorized output doesn't match the expected result for a 1D array.
525 | """
526 | data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
527 | result = winsorize(data, percentile=0.2)
528 | expected = np.array([2.8, 2.8, 3, 4, 5, 6, 7, 8, 8.2, 8.2])
529 | np.testing.assert_array_equal(result, expected)
530 |
531 |
532 | def test_winsorize_empty_array():
533 | """
534 | Test winsorize function with an empty array.
535 |
536 | This test checks if the function handles empty arrays correctly.
537 |
538 | Raises
539 | ------
540 | AssertionError
541 | If the function doesn't return an empty array for empty input.
542 | """
543 | data = np.array([])
544 | result = winsorize(data)
545 | np.testing.assert_array_equal(result, data)
546 |
547 |
548 | def test_winsorize_all_nan():
549 | """
550 | Test winsorize function with an array containing only NaN values.
551 |
552 | This test verifies that the function handles arrays with only NaN values correctly.
553 |
554 | Raises
555 | ------
556 | AssertionError
557 | If the function doesn't return an array of NaN values for all-NaN input.
558 | """
559 | data = np.array([[np.nan, np.nan], [np.nan, np.nan]])
560 | result = winsorize(data)
561 | np.testing.assert_array_equal(result, data)
562 |
563 |
564 | ###
565 | # winsorize_xsection
566 | ###
567 |
568 |
569 | @pytest.fixture
570 | def sample_df():
571 | """
572 | Fixture to provide a sample DataFrame for testing.
573 | """
574 | return pl.DataFrame(
575 | {
576 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
577 | "value1": [1, 2, 10, 4, 5, 20, 7, 8, 30],
578 | "value2": [100, 200, 1000, 400, 500, 2000, 700, 800, 3000],
579 | }
580 | )
581 |
582 |
583 | @pytest.mark.parametrize("lazy", [True, False])
584 | def test_winsorize_xsection(sample_df, lazy):
585 | """
586 | Test basic functionality of winsorize_xsection.
587 | """
588 | if lazy:
589 | result = winsorize_xsection(
590 | sample_df.lazy(), data_cols=("value1", "value2"), group_col="group", percentile=0.1
591 | ).sort("group")
592 | assert isinstance(result, pl.LazyFrame)
593 | result = result.collect()
594 | else:
595 | result = winsorize_xsection(sample_df, data_cols=("value1", "value2"), group_col="group", percentile=0.1).sort(
596 | "group"
597 | )
598 |
599 | expected = pl.DataFrame(
600 | {
601 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
602 | "value1": [1.2, 2.0, 8.4, 4.2, 5.0, 17.0, 7.2, 8.0, 25.6],
603 | "value2": [120.0, 200.0, 840.0, 420.0, 500.0, 1700.0, 720.0, 800.0, 2560.0],
604 | }
605 | )
606 |
607 | assert_frame_equal(result, expected, check_exact=False)
608 |
609 |
610 | @pytest.fixture
611 | def sample_df_with_nans():
612 | """
613 | Fixture to provide a sample DataFrame with NaN values for testing.
614 | """
615 | return pl.DataFrame(
616 | {
617 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
618 | "value1": [1, np.nan, 10, 4, 5, np.nan, 7, 8, 30],
619 | "value2": [100, 200, np.nan, np.nan, 500, 2000, 700, np.nan, 3000],
620 | },
621 | strict=False,
622 | )
623 |
624 |
625 | def test_winsorize_xsection_with_nans(sample_df_with_nans):
626 | """
627 | Test winsorize_xsection with NaN values.
628 | """
629 | result = winsorize_xsection(
630 | sample_df_with_nans, data_cols=("value1", "value2"), group_col="group", percentile=0.1
631 | ).sort("group")
632 |
633 | expected = pl.DataFrame(
634 | {
635 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
636 | "value1": [1.9, np.nan, 9.1, 4.1, 4.9, np.nan, 7.2, 8.0, 25.6],
637 | "value2": [110.0, 190.0, np.nan, np.nan, 650.0, 1850.0, 930.0, np.nan, 2770.0],
638 | },
639 | strict=False,
640 | )
641 |
642 | assert_frame_equal(result, expected, check_exact=False)
643 |
644 |
645 | ###
646 | # `xsection_percentiles`
647 | ###
648 |
649 |
650 | def test_xsection_percentiles(sample_df):
651 | """
652 | Test basic functionality of xsection_percentiles.
653 | """
654 | result = sample_df.with_columns(percentiles_xsection("value1", "group", 0.25, 0.75).alias("result")).sort("group")
655 |
656 | expected = pl.DataFrame(
657 | {
658 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
659 | "value1": [1, 2, 10, 4, 5, 20, 7, 8, 30],
660 | "result": [1.0, 2.0, 10.0, 4.0, 5.0, 20.0, 7.0, 8.0, 30.0],
661 | }
662 | )
663 |
664 | pl.testing.assert_frame_equal(result.select("group", "value1", "result"), expected)
665 |
666 |
667 | def test_xsection_percentiles_with_nans(sample_df_with_nans):
668 | """
669 | Test xsection_percentiles with NaN values.
670 | """
671 | result = sample_df_with_nans.with_columns(percentiles_xsection("value1", "group", 0.25, 0.75).alias("result"))
672 |
673 | expected = pl.DataFrame(
674 | {
675 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
676 | "value1": [1.0, np.nan, 10.0, 4.0, 5.0, np.nan, 7.0, 8.0, 30.0],
677 | "result": [1.0, np.nan, 10.0, 4.0, 5.0, np.nan, 7.0, 8.0, 30.0],
678 | }
679 | )
680 |
681 | pl.testing.assert_frame_equal(result.select("group", "value1", "result"), expected)
682 |
683 |
684 | ###
685 | # `exp_weights`
686 | ###
687 |
688 |
689 | def test_exp_weights_basic():
690 | """
691 | Test basic functionality of exp_weights.
692 | """
693 | result = exp_weights(window=5, half_life=2)
694 | expected = np.array([0.25, 0.35355339, 0.5, 0.70710678, 1.0])
695 | np.testing.assert_array_almost_equal(result, expected, decimal=6)
696 |
697 |
698 | def test_exp_weights_window_1():
699 | """
700 | Test exp_weights with window of 1.
701 | """
702 | result = exp_weights(window=1, half_life=2)
703 | expected = np.array([1.0])
704 | np.testing.assert_array_almost_equal(result, expected)
705 |
706 |
707 | def test_exp_weights_half_life_1():
708 | """
709 | Test exp_weights with half_life of 1.
710 | """
711 | result = exp_weights(window=5, half_life=1)
712 | expected = np.array([0.0625, 0.125, 0.25, 0.5, 1.0])
713 | np.testing.assert_array_almost_equal(result, expected)
714 |
715 |
716 | def test_exp_weights_large_window():
717 | """
718 | Test exp_weights with a large window.
719 | """
720 | result = exp_weights(window=100, half_life=10)
721 | assert len(result) == 100
722 | assert result[-1] == 1.0
723 | assert result[0] < result[-1]
724 |
725 |
726 | def test_exp_weights_decreasing():
727 | """
728 | Test that weights are decreasing from end to start.
729 | """
730 | result = exp_weights(window=10, half_life=3)
731 | assert np.all(np.diff(result) > 0)
732 |
733 |
734 | def test_exp_weights_half_life():
735 | """
736 | Test that weights actually decay by half each half_life.
737 | """
738 | half_life = 5
739 | window = 20
740 | weights = exp_weights(window, half_life)
741 | for i in range(0, window - half_life, half_life):
742 | assert np.isclose(weights[i], 0.5 * weights[i + half_life], rtol=1e-5)
743 |
744 |
745 | def test_exp_weights_invalid_window():
746 | """
747 | Test exp_weights with invalid window value.
748 | """
749 | with pytest.raises(ValueError):
750 | exp_weights(window=0, half_life=2)
751 |
752 | with pytest.raises(ValueError):
753 | exp_weights(window=-1, half_life=2)
754 |
755 | with pytest.raises(TypeError):
756 | exp_weights(window="window", half_life=2)
757 |
758 | with pytest.raises(TypeError):
759 | exp_weights(window=5.1, half_life=3)
760 |
761 |
762 | def test_exp_weights_invalid_half_life():
763 | """
764 | Test exp_weights with invalid half_life value.
765 | """
766 | with pytest.raises(ValueError):
767 | exp_weights(window=5, half_life=0)
768 |
769 | with pytest.raises(ValueError):
770 | exp_weights(window=5, half_life=-1)
771 |
772 | with pytest.raises(TypeError):
773 | exp_weights(window=5, half_life="half_life")
774 |
775 | with pytest.raises(TypeError):
776 | exp_weights(window=5, half_life=3.2)
777 |
778 |
779 | def test_output():
780 | """
781 | Test with a specific input and output.
782 | """
783 | result = exp_weights(10, 10)
784 | expected = np.array(
785 | [0.53588673, 0.57434918, 0.61557221, 0.65975396, 0.70710678, 0.75785828, 0.8122524, 0.87055056, 0.93303299, 1.0]
786 | )
787 | np.testing.assert_array_almost_equal(result, expected)
788 |
--------------------------------------------------------------------------------
/toraniko/tests/test_model.py:
--------------------------------------------------------------------------------
1 | """Tests functions in the model module."""
2 |
3 | import polars as pl
4 | import pytest
5 | import numpy as np
6 | from toraniko.model import _factor_returns, estimate_factor_returns
7 |
8 | ###
9 | # `_factor_returns`
10 | ###
11 |
12 |
13 | @pytest.fixture
14 | def sample_data():
15 | n_assets = 100
16 | n_sectors = 10
17 | n_styles = 5
18 |
19 | np.random.seed(42)
20 | returns = np.random.randn(n_assets, 1)
21 | mkt_caps = np.abs(np.random.randn(n_assets, 1))
22 | sector_scores = np.random.randint(0, 2, size=(n_assets, n_sectors))
23 | style_scores = np.random.randn(n_assets, n_styles)
24 |
25 | return returns, mkt_caps, sector_scores, style_scores
26 |
27 |
28 | def test_output_shape_and_values(sample_data):
29 | returns, mkt_caps, sector_scores, style_scores = sample_data
30 | fac_ret, epsilon = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
31 |
32 | assert fac_ret.shape == (1 + sector_scores.shape[1] + style_scores.shape[1], 1)
33 | assert epsilon.shape == returns.shape
34 | # evaluate expected vs actual fac_ret and epsilon as test vectors
35 | expected_fac_ret = np.array(
36 | [
37 | [-0.1619187],
38 | [0.0743045],
39 | [-0.21585373],
40 | [-0.27895516],
41 | [0.21496233],
42 | [-0.09829407],
43 | [-0.36415363],
44 | [0.16375184],
45 | [0.12933617],
46 | [0.35729484],
47 | [0.0176069],
48 | [0.13510884],
49 | [0.10831872],
50 | [-0.05781987],
51 | [0.11867375],
52 | [0.07687654],
53 | ]
54 | )
55 | np.testing.assert_array_almost_equal(fac_ret, expected_fac_ret)
56 | expected_epsilon_first_10 = np.array(
57 | [
58 | [0.04899388],
59 | [0.30702498],
60 | [0.23548647],
61 | [1.29181826],
62 | [0.30327413],
63 | [0.22804785],
64 | [1.40533523],
65 | [1.37903988],
66 | [0.08243883],
67 | [1.37698801],
68 | ]
69 | )
70 | np.testing.assert_array_almost_equal(epsilon[:10], expected_epsilon_first_10)
71 |
72 |
73 | def test_residualize_styles(sample_data):
74 | returns, mkt_caps, sector_scores, style_scores = sample_data
75 |
76 | # if we residualize the styles we should obtain different returns out of the function
77 |
78 | fac_ret_res, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
79 | fac_ret_non_res, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, False)
80 |
81 | assert not np.allclose(fac_ret_res, fac_ret_non_res)
82 |
83 |
84 | def test_sector_constraint(sample_data):
85 | returns, mkt_caps, sector_scores, style_scores = sample_data
86 | fac_ret, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
87 |
88 | sector_returns = fac_ret[1 : sector_scores.shape[1] + 1]
89 | assert np.isclose(np.sum(sector_returns), 0, atol=1e-10)
90 |
91 |
92 | def test_zero_returns():
93 | n_assets = 50
94 | n_sectors = 5
95 | n_styles = 3
96 |
97 | returns = np.zeros((n_assets, 1))
98 | mkt_caps = np.ones((n_assets, 1))
99 | sector_scores = np.random.randint(0, 2, size=(n_assets, n_sectors))
100 | style_scores = np.random.randn(n_assets, n_styles)
101 |
102 | fac_ret, epsilon = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
103 |
104 | assert np.allclose(fac_ret, 0)
105 | assert np.allclose(epsilon, 0)
106 |
107 |
108 | def test_market_cap_weighting(sample_data):
109 | returns, mkt_caps, sector_scores, style_scores = sample_data
110 |
111 | fac_ret1, _ = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
112 | fac_ret2, _ = _factor_returns(returns, np.ones_like(mkt_caps), sector_scores, style_scores, True)
113 |
114 | assert not np.allclose(fac_ret1, fac_ret2)
115 |
116 |
117 | def test_reproducibility(sample_data):
118 | returns, mkt_caps, sector_scores, style_scores = sample_data
119 |
120 | fac_ret1, epsilon1 = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
121 | fac_ret2, epsilon2 = _factor_returns(returns, mkt_caps, sector_scores, style_scores, True)
122 |
123 | assert np.allclose(fac_ret1, fac_ret2)
124 | assert np.allclose(epsilon1, epsilon2)
125 |
126 |
127 | ###
128 | # `estimate_factor_returns`
129 | ###
130 |
131 |
132 | def model_sample_data():
133 | dates = ["2021-01-01", "2021-01-02", "2021-01-03"]
134 | symbols = ["AAPL", "MSFT", "GOOGL"]
135 | returns_data = {"date": dates * 3, "symbol": symbols * 3, "asset_returns": np.random.randn(9)}
136 | mkt_cap_data = {"date": dates * 3, "symbol": symbols * 3, "market_cap": np.random.rand(9) * 1000}
137 | sector_data = {"date": dates * 3, "symbol": symbols * 3, "Tech": [1, 0, 0] * 3, "Finance": [0, 1, 0] * 3}
138 | style_data = {"date": dates * 3, "symbol": symbols * 3, "Value": [1, 0, 0] * 3, "Growth": [0, 1, 0] * 3}
139 | return (pl.DataFrame(returns_data), pl.DataFrame(mkt_cap_data), pl.DataFrame(sector_data), pl.DataFrame(style_data))
140 |
141 |
142 | def test_estimate_factor_returns_normal():
143 | returns_df, mkt_cap_df, sector_df, style_df = model_sample_data()
144 | factor_returns, residual_returns = estimate_factor_returns(returns_df, mkt_cap_df, sector_df, style_df)
145 | assert isinstance(factor_returns, pl.DataFrame)
146 | assert isinstance(residual_returns, pl.DataFrame)
147 | assert factor_returns.shape[0] == 3 # Number of dates
148 | assert residual_returns.shape[0] == 3 # Number of dates
149 |
150 |
151 | def test_estimate_factor_returns_empty():
152 | empty_df = pl.DataFrame()
153 | with pytest.raises(ValueError):
154 | estimate_factor_returns(empty_df, empty_df, empty_df, empty_df)
155 |
156 |
157 | def test_estimate_factor_returns_incorrect_types():
158 | with pytest.raises(TypeError):
159 | estimate_factor_returns("not_a_dataframe", "not_a_dataframe", "not_a_dataframe", "not_a_dataframe")
160 |
161 |
162 | def test_estimate_factor_returns_missing_columns():
163 | returns_df, mkt_cap_df, sector_df, style_df = model_sample_data()
164 | with pytest.raises(ValueError):
165 | estimate_factor_returns(returns_df.drop("asset_returns"), mkt_cap_df, sector_df, style_df)
166 |
--------------------------------------------------------------------------------
/toraniko/tests/test_utils.py:
--------------------------------------------------------------------------------
1 | """Test functions in the utils module."""
2 |
3 | import pytest
4 | import polars as pl
5 | import numpy as np
6 | from polars.testing import assert_frame_equal
7 |
8 | from toraniko.utils import fill_features, smooth_features, top_n_by_group
9 |
10 | ###
11 | # `fill_features`
12 | ###
13 |
14 |
15 | @pytest.fixture
16 | def sample_df():
17 | return pl.DataFrame(
18 | {
19 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"],
20 | "group": ["A", "A", "A", "B", "B"],
21 | "feature1": [1.0, np.nan, 3.0, np.inf, 5.0],
22 | "feature2": [np.nan, 2.0, np.nan, 4.0, np.nan],
23 | }
24 | )
25 |
26 |
27 | @pytest.mark.parametrize("lazy", [True, False])
28 | def test_fill_features(sample_df, lazy):
29 | if lazy:
30 | inp = sample_df.lazy()
31 | else:
32 | inp = sample_df
33 | result = (
34 | fill_features(inp, features=("feature1", "feature2"), sort_col="date", over_col="group").sort("group").collect()
35 | )
36 |
37 | expected = pl.DataFrame(
38 | {
39 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"],
40 | "group": ["A", "A", "A", "B", "B"],
41 | "feature1": [1.0, 1.0, 3.0, None, 5.0],
42 | "feature2": [None, 2.0, 2.0, 4.0, 4.0],
43 | }
44 | )
45 |
46 | pl.testing.assert_frame_equal(result, expected)
47 |
48 |
49 | def test_fill_features_invalid_input(sample_df):
50 | with pytest.raises(ValueError):
51 | fill_features(sample_df, features=("non_existent_feature",), sort_col="date", over_col="group")
52 | with pytest.raises(TypeError):
53 | fill_features("not_a_dataframe", features=("feature1",), sort_col="date", over_col="group")
54 |
55 |
56 | def test_fill_features_all_null_column(sample_df):
57 | df_with_null_column = sample_df.with_columns(pl.lit(None).alias("null_feature"))
58 | result = fill_features(df_with_null_column, features=("null_feature",), sort_col="date", over_col="group")
59 | result = result.collect()
60 |
61 | assert all(result["null_feature"].is_null())
62 |
63 |
64 | def test_fill_features_multiple_groups(sample_df):
65 | multi_group_df = pl.DataFrame(
66 | {
67 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
68 | "group": ["A", "A", "A", "B", "B", "C"],
69 | "feature1": [1.0, np.nan, 3.0, np.inf, 5.0, np.nan],
70 | "feature2": [np.nan, 2.0, np.nan, 4.0, np.nan, 6.0],
71 | }
72 | )
73 |
74 | result = fill_features(multi_group_df, features=("feature1", "feature2"), sort_col="date", over_col="group")
75 | result = result.collect()
76 |
77 | expected = pl.DataFrame(
78 | {
79 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
80 | "group": ["A", "A", "A", "B", "B", "C"],
81 | "feature1": [1.0, 1.0, 3.0, None, 5.0, None],
82 | "feature2": [None, 2.0, 2.0, 4.0, 4.0, 6.0],
83 | }
84 | )
85 |
86 | pl.testing.assert_frame_equal(result, expected)
87 |
88 |
89 | def test_fill_features_different_sort_order(sample_df):
90 | result = fill_features(sample_df, features=("feature1", "feature2"), sort_col="date", over_col="group")
91 | result = result.sort("date", descending=True).collect()
92 |
93 | expected = pl.DataFrame(
94 | {
95 | "date": ["2023-01-05", "2023-01-04", "2023-01-03", "2023-01-02", "2023-01-01"],
96 | "group": ["B", "B", "A", "A", "A"],
97 | "feature1": [5.0, None, 3.0, 1.0, 1.0],
98 | "feature2": [4.0, 4.0, 2.0, 2.0, None],
99 | }
100 | )
101 |
102 | pl.testing.assert_frame_equal(result, expected)
103 |
104 |
105 | ###
106 | # `smooth_features`
107 | ###
108 |
109 |
110 | @pytest.fixture
111 | def sample_smooth_df():
112 | return pl.DataFrame(
113 | {
114 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
115 | "group": ["A", "A", "A", "B", "B", "B"],
116 | "feature1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
117 | "feature2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
118 | }
119 | )
120 |
121 |
122 | @pytest.mark.parametrize("lazy", [True, False])
123 | def test_smooth_features(sample_smooth_df, lazy):
124 | if lazy:
125 | inp = sample_smooth_df.lazy()
126 | else:
127 | inp = sample_smooth_df
128 | result = (
129 | smooth_features(inp, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=2)
130 | .sort("group")
131 | .collect()
132 | )
133 |
134 | expected = pl.DataFrame(
135 | {
136 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
137 | "group": ["A", "A", "A", "B", "B", "B"],
138 | "feature1": [None, 1.5, 2.5, None, 4.5, 5.5],
139 | "feature2": [None, 15.0, 25.0, None, 45.0, 55.0],
140 | }
141 | )
142 |
143 | pl.testing.assert_frame_equal(result, expected)
144 |
145 |
146 | def test_smooth_features_larger_window(sample_smooth_df):
147 | result = smooth_features(
148 | sample_smooth_df, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=10
149 | )
150 | result = result.collect()
151 |
152 | expected = pl.DataFrame(
153 | {
154 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
155 | "group": ["A", "A", "A", "B", "B", "B"],
156 | "feature1": [None, None, None, None, None, None],
157 | "feature2": [None, None, None, None, None, None],
158 | }
159 | ).with_columns(pl.col("feature1").cast(float).alias("feature1"), pl.col("feature2").cast(float).alias("feature2"))
160 |
161 | pl.testing.assert_frame_equal(result, expected)
162 |
163 |
164 | def test_smooth_features_invalid_input(sample_smooth_df):
165 | with pytest.raises(TypeError):
166 | smooth_features("not_a_dataframe", features=("feature1",), sort_col="date", over_col="group", window_size=2)
167 | with pytest.raises(ValueError):
168 | smooth_features(
169 | sample_smooth_df, features=("non_existent_feature",), sort_col="date", over_col="group", window_size=2
170 | )
171 |
172 |
173 | def test_smooth_features_with_nulls(sample_smooth_df):
174 | df_with_nulls = sample_smooth_df.with_columns(
175 | pl.when(pl.col("feature1") == 3.0).then(None).otherwise(pl.col("feature1")).alias("feature1")
176 | )
177 | result = smooth_features(
178 | df_with_nulls, features=("feature1",), sort_col="date", over_col="group", window_size=2
179 | ).sort("date", "group")
180 | result = result.collect()
181 |
182 | expected = pl.DataFrame(
183 | {
184 | "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06"],
185 | "group": ["A", "A", "A", "B", "B", "B"],
186 | "feature1": [None, 1.5, None, None, 4.5, 5.5],
187 | "feature2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
188 | }
189 | )
190 |
191 | pl.testing.assert_frame_equal(result, expected)
192 |
193 |
194 | def test_smooth_features_window_size_one(sample_smooth_df):
195 | result = smooth_features(
196 | sample_smooth_df, features=("feature1", "feature2"), sort_col="date", over_col="group", window_size=1
197 | )
198 | result = result.collect()
199 |
200 | # With window_size=1, the result should be the same as the input
201 | pl.testing.assert_frame_equal(result, sample_smooth_df)
202 |
203 |
204 | ###
205 | # `top_n_by_group`
206 | ###
207 |
208 |
209 | @pytest.fixture
210 | def top_n_df():
211 | return pl.DataFrame(
212 | {
213 | "group": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
214 | "subgroup": [1, 1, 2, 1, 2, 2, 1, 2, 3],
215 | "value": [10, 20, 30, 15, 25, 35, 5, 15, 25],
216 | }
217 | )
218 |
219 |
220 | @pytest.mark.parametrize("filter", [True, False])
221 | def test_top_n_by_group(top_n_df, filter):
222 | result = (
223 | top_n_by_group(top_n_df, n=2, rank_var="value", group_var=("group",), filter=filter)
224 | .collect()
225 | .sort("group", "subgroup")
226 | )
227 | if filter:
228 | assert result.shape == (6, 3)
229 | assert result["group"].to_list() == ["A", "A", "B", "B", "C", "C"]
230 | assert result["value"].to_list() == [20, 30, 25, 35, 15, 25]
231 | else:
232 | assert result.shape == (9, 4)
233 | assert "rank_mask" in result.columns
234 | assert result["rank_mask"].to_list() == [0, 1, 1, 0, 1, 1, 0, 1, 1]
235 |
236 |
237 | def test_top_n_by_multiple_groups(top_n_df):
238 | result = (
239 | top_n_by_group(top_n_df, n=1, rank_var="value", group_var=("group", "subgroup"), filter=True)
240 | .sort("group", "subgroup")
241 | .collect()
242 | )
243 | assert result.shape == (7, 3)
244 | assert result["group"].to_list() == ["A", "A", "B", "B", "C", "C", "C"]
245 | assert result["subgroup"].to_list() == [1, 2, 1, 2, 1, 2, 3]
246 | assert result["value"].to_list() == [20, 30, 15, 35, 5, 15, 25]
247 |
248 |
249 | @pytest.mark.parametrize("filter", [True, False])
250 | def test_lazyframe_input(filter):
251 | lazy_df = pl.LazyFrame({"group": ["A", "B"], "value": [1, 2]})
252 | result = top_n_by_group(lazy_df, n=1, rank_var="value", group_var=("group",), filter=filter)
253 | assert isinstance(result, pl.LazyFrame)
254 |
255 |
256 | def test_invalid_input():
257 | df = pl.DataFrame({"group": ["A", "B"], "wrong_column": [1, 2]})
258 | with pytest.raises(ValueError, match="missing one or more required columns"):
259 | top_n_by_group(df, n=1, rank_var="value", group_var=("group",), filter=True)
260 | with pytest.raises(TypeError, match="must be a Polars DataFrame or LazyFrame"):
261 | top_n_by_group("not_a_dataframe", n=1, rank_var="value", group_var=("group",), filter=True)
262 |
263 |
264 | def test_empty_dataframe():
265 | empty_df = pl.DataFrame({"group": [], "value": []})
266 | result = top_n_by_group(empty_df, n=1, rank_var="value", group_var=("group",), filter=True).collect()
267 | assert result.shape == (0, 2)
268 |
269 |
270 | def test_n_greater_than_group_size(top_n_df):
271 | result = top_n_by_group(top_n_df, n=5, rank_var="value", group_var=("group",), filter=True).collect()
272 | assert result.shape == (9, 3)
273 |
274 |
275 | def test_tie_handling(top_n_df):
276 | df_with_ties = top_n_df.with_columns(
277 | pl.when(pl.col("value") == 25).then(26).otherwise(pl.col("value")).alias("value")
278 | )
279 | result = (
280 | top_n_by_group(df_with_ties, n=1, rank_var="value", group_var=("group",), filter=True)
281 | .sort("group", "subgroup")
282 | .collect()
283 | )
284 | assert result.shape == (3, 3)
285 | assert result["value"].to_list() == [30, 35, 26]
286 |
--------------------------------------------------------------------------------
/toraniko/utils.py:
--------------------------------------------------------------------------------
1 | """Utility functions, primarily for data cleaning."""
2 |
3 | import numpy as np
4 | import polars as pl
5 |
6 |
7 | def fill_features(
8 | df: pl.DataFrame | pl.LazyFrame, features: tuple[str, ...], sort_col: str, over_col: str
9 | ) -> pl.LazyFrame:
10 | """Cast feature columns to numeric (float), convert NaN and inf values to null, then forward fill nulls
11 | for each column of `features`, sorted on `sort_col` and partitioned by `over_col`.
12 |
13 | Parameters
14 | ----------
15 | df: Polars DataFrame or LazyFrame containing columns `sort_col`, `over_col` and each of `features`
16 | features: collection of strings indicating which columns of `df` are the feature values
17 | sort_col: str column of `df` indicating how to sort
18 | over_col: str column of `df` indicating how to partition
19 |
20 | Returns
21 | -------
22 | Polars LazyFrame containing the original columns with cleaned feature data
23 | """
24 | try:
25 | # eagerly check all `features`, `sort_col`, `over_col` present: can't catch ColumNotFoundError in lazy context
26 | assert all(c in df.columns for c in features + (sort_col, over_col))
27 | return (
28 | df.lazy()
29 | .with_columns([pl.col(f).cast(float).alias(f) for f in features])
30 | .with_columns(
31 | [
32 | pl.when(
33 | (pl.col(f).abs() == np.inf)
34 | | (pl.col(f) == np.nan)
35 | | (pl.col(f).is_null())
36 | | (pl.col(f).cast(str) == "NaN")
37 | )
38 | .then(None)
39 | .otherwise(pl.col(f))
40 | .alias(f)
41 | for f in features
42 | ]
43 | )
44 | .sort(by=sort_col)
45 | .with_columns([pl.col(f).forward_fill().over(over_col).alias(f) for f in features])
46 | )
47 | except AttributeError as e:
48 | raise TypeError("`df` must be a Polars DataFrame | LazyFrame, but it's missing required attributes") from e
49 | except AssertionError as e:
50 | raise ValueError(f"`df` must have all of {[over_col, sort_col] + list(features)} as columns") from e
51 |
52 |
53 | def smooth_features(
54 | df: pl.DataFrame | pl.LazyFrame,
55 | features: tuple[str, ...],
56 | sort_col: str,
57 | over_col: str,
58 | window_size: int,
59 | ) -> pl.LazyFrame:
60 | """Smooth the `features` columns of `df` by taking the rolling mean of each, sorted over `sort_col` and
61 | partitioned by `over_col`, using `window_size` trailing periods for the moving average window.
62 |
63 | Parameters
64 | ----------
65 | df: Polars DataFrame | LazyFrame containing columns `sort_col`, `over_col` and each of `features`
66 | features: collection of strings indicating which columns of `df` are the feature values
67 | sort_col: str column of `df` indicating how to sort
68 | over_col: str column of `df` indicating how to partition
69 | window_size: int number of time periods for the moving average
70 |
71 | Returns
72 | -------
73 | Polars LazyFrame containing the original columns, with each of `features` replaced with moving average
74 | """
75 | try:
76 | # eagerly check `over_col`, `sort_col`, `features` present: can't catch pl.ColumnNotFoundError in lazy context
77 | assert all(c in df.columns for c in features + (over_col, sort_col))
78 | return (
79 | df.lazy()
80 | .sort(by=sort_col)
81 | .with_columns([pl.col(f).rolling_mean(window_size=window_size).over(over_col).alias(f) for f in features])
82 | )
83 | except AttributeError as e:
84 | raise TypeError("`df` must be a Polars DataFrame | LazyFrame, but it's missing required attributes") from e
85 | except AssertionError as e:
86 | raise ValueError(f"`df` must have all of {[over_col, sort_col] + list(features)} as columns") from e
87 |
88 |
89 | def top_n_by_group(
90 | df: pl.DataFrame | pl.LazyFrame,
91 | n: int,
92 | rank_var: str,
93 | group_var: tuple[str, ...],
94 | filter: bool = True,
95 | ) -> pl.LazyFrame:
96 | """Mark the top `n` rows in each of `group_var` according to `rank_var` descending.
97 |
98 | If `filter` is True, the returned DataFrame contains only the filtered data. If `filter` is False,
99 | the returned DataFrame has all data, with an additional 'rank_mask' column indicating if that row
100 | is in the filter.
101 |
102 | Parameters
103 | ----------
104 | df: Polars DataFrame | LazyFrame
105 | n: integer indicating the top rows to take in each group
106 | rank_var: str column name to rank on
107 | group_var: tuple of str column names to group and sort on
108 | filter: boolean indicating how much data to return
109 |
110 | Returns
111 | -------
112 | Polars LazyFrame containing original columns and optional filter column
113 | """
114 | try:
115 | # eagerly check `rank_var`, `group_var` are present: we can't catch a ColumnNotFoundError in a lazy context
116 | assert all(c in df.columns for c in (rank_var,) + group_var)
117 | rdf = (
118 | df.lazy()
119 | .sort(by=list(group_var) + [rank_var])
120 | .with_columns(pl.col(rank_var).rank(descending=True).over(group_var).cast(int).alias("rank"))
121 | )
122 | match filter:
123 | case True:
124 | return rdf.filter(pl.col("rank") <= n).drop("rank").sort(by=list(group_var) + [rank_var])
125 | case False:
126 | return (
127 | rdf.with_columns(
128 | pl.when(pl.col("rank") <= n).then(pl.lit(1)).otherwise(pl.lit(0)).alias("rank_mask")
129 | )
130 | .drop("rank")
131 | .sort(by=list(group_var) + [rank_var])
132 | )
133 | except AssertionError as e:
134 | raise ValueError(f"`df` is missing one or more required columns: '{rank_var}' and '{group_var}'") from e
135 | except AttributeError as e:
136 | raise TypeError("`df` must be a Polars DataFrame or LazyFrame but is missing a required attribute") from e
137 |
--------------------------------------------------------------------------------