├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.md ├── causalitydata ├── __init__.py ├── analysis │ ├── __init__.py │ ├── backtest.py │ ├── compounding.py │ └── evaluate_pnl.py ├── data │ ├── __init__.py │ └── dataloader.py └── notebook │ └── 01-Backtesting-Signals.ipynb ├── images └── logo.png ├── poetry.lock ├── pyproject.toml ├── scripts ├── install_ipykernel.py └── uninstall_ipykernel.py └── tests └── __init__.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/__pycache__ 2 | .vscode 3 | .idea 4 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: https://github.com/pre-commit/pre-commit-hooks 3 | rev: v3.2.0 4 | hooks: 5 | - id: check-toml 6 | - id: check-yaml 7 | - id: end-of-file-fixer 8 | - id: trailing-whitespace 9 | - id: check-added-large-files 10 | args: ['--maxkb=1000'] 11 | - repo: https://github.com/python-poetry/poetry 12 | rev: '1.7.1' 13 | hooks: 14 | - id: poetry-check 15 | - repo: https://github.com/psf/black 16 | rev: '24.2.0' 17 | hooks: 18 | - id: black 19 | - id: black-jupyter 20 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2024, Causality SL 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | 3. Neither the name of the copyright holder nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 7 | 8 | 9 | 10 | 17 | [![Contributors][contributors-shield]][contributors-url] 18 | [![Forks][forks-shield]][forks-url] 19 | [![Stargazers][stars-shield]][stars-url] 20 | [![Issues][issues-shield]][issues-url] 21 | [![BSD3 License][license-shield]][license-url] 22 | [![LinkedIn][linkedin-shield]][linkedin-url] 23 | 24 | 25 | 26 | 27 |
28 |
29 | 30 | Logo 31 | 32 | 33 |

Causality Benchmark Data

34 | 35 |

36 | Showcasing Causality Group's benchmark data through a data loading library and a signal backtesting example. 37 | 39 |
40 |
41 | 43 | Report Bug 44 | · 45 | Request Feature 46 |

47 |
48 | 49 | 50 | 51 | 52 |
53 | Table of Contents 54 |
    55 |
  1. 56 | About the Project 57 | 60 |
  2. 61 |
  3. 62 | Getting Started 63 | 67 |
  4. 68 |
  5. 69 | Backtesting and Data Layout 70 | 75 |
  6. 76 | 77 | 78 |
  7. License
  8. 79 |
  9. Contact
  10. 80 | 81 |
82 |
83 | 84 | 85 | 86 | 87 | ## About the Project 88 | 89 | Have you ever found yourself struggling to prepare clean financial data for analysis or to align data from various sources? 90 | 91 | With this repository, you can explore Causality Group's curated historical dataset for academic and non-commercial use, covering the 1500 most liquid stocks in the US equities markets. 92 | 93 | Features include: 94 | * Liquid universe of 1500 stocks, updated monthly 95 | * Free from survivorship bias 96 | * Daily Open, High, Low, Close, VWAP, and Volume 97 | * Overnight returns adjusted for splits, dividends, mergers, and acquisitions 98 | * Intraday 5-minute VWAP, spread, and volume snapshots 99 | * SPY ETF data for hedging 100 | * CAPM betas and residuals for market-neutral analysis 101 | 102 | Please contact us on [LinkedIn](https://www.linkedin.com/in/markhorvath-ai) for access to the dataset! 103 | 104 | More details [here](#backtesting-and-data-layout) 105 | 106 |

(back to top)

107 | 108 | 109 | 110 | ### Built With 111 | 112 | * [![Python][Python.org]][Python-url] 113 | * [![Poetry][Poetry.org]][Poetry-url] 114 | * [![Jupyter][Jupyter.org]][Jupyter-url] 115 | * [![Matplotlib][Matplotlib.org]][Matplotlib-url] 116 | * [![Numpy][Numpy.org]][Numpy-url] 117 | * [![Pandas][Pandas.org]][Pandas-url] 118 | * [![Sklearn][Sklearn.org]][Sklearn-url] 119 | * [![Scipy][Scipy.org]][Scipy-url] 120 | 121 |

(back to top)

122 | 123 | 124 | 125 | 126 | ## Getting Started 127 | 128 | Follow these steps to set up the project on your local machine for development and testing purposes. 129 | 130 | ### Prerequisites 131 | 132 | Ensure you have the following installed on your local setup: 133 | - Python 3.9.5 134 | - Poetry (see [installation instructions](https://python-poetry.org/docs/#installation)) 135 | 136 | ### Installation 137 | 138 | 1. Clone the repository. 139 | 2. Install the dependencies: 140 | ```bash 141 | poetry install 142 | ``` 143 | > **Optional:** If you want to use the Jupyter kernel, install the optional `jupyter` group of dependencies with `poetry install --with jupyter`. 144 | 145 | 3. Install the pre-commit hooks: 146 | ```bash 147 | poetry run pre-commit install 148 | ``` 149 | 150 | You're all set! Pre-commit hooks will run on git commit (more information in [pre-commit docs](https://pre-commit.com/index.html)). Ensure your changes pass all checks before pushing. 151 | 152 | ### Available Scripts 153 | - `poetry run black ./causalitydata`: Runs the code formatter. 154 | - `poetry run pylint ./causalitydata`: Runs the linter. 155 | - `poetry run install-ipykernel`: Installs the causality kernel for Jupyter. 156 | - `poetry run uninstall-ipykernel`: Uninstalls the causality kernel for Jupyter. 157 | 158 | > **Note:** To run the ipykernel scripts you need to install the optional `jupyter` group of dependencies. Use `poetry install --with jupyter`. 159 | 160 |

(back to top)

161 | 162 | 163 | 164 | 165 | ## Backtesting and Data Layout 166 | 167 | ### Backtesting 168 | 169 | [01-Backtesting-Signals.ipynb](https://github.com/causality-group/causality-benchmark-data/blob/main/causalitydata/notebook/01-Backtesting-Signals.ipynb) serves as a minimal example of utilizing the dataset and library for quantitative analysis, alpha signal research, and backtesting. 170 | 171 | The example showcases a daily backtest, relying on close-to-close adjusted returns of the 1500 most liquid companies in the US since 2007. Since the most liquid companies change constantly, we update our liquid universe at the start of each month. This dynamic universe is already pre-calculated in the `universe.csv` data file. 172 | 173 | Assuming trading at the 16:00 close auction in the US, our example only uses features for alpha creation that are observable by 15:45. We plot the performance of some well-known alpha factors and invite you to experiment with building your quantitative investment model from there! 174 | 175 | ### Data Layout 176 | 177 | All data files in the benchmark dataset have the same structure: 178 | 179 | * Data files are in `.csv` format. 180 | * The first row contains the header. 181 | * Rows represent different dates in increasing order. There is only one row per date, i.e., there is no intraday granularity within the files. 182 | * The first column corresponds to the index and contains the date information, at which the given value is observable: 183 | * Date format: `YYYY-MM-DD`. 184 | * Every other column represents an individual asset in the universe: 185 | * Asset identifier format: `__`. E.g., `AAPL_XNAS_ESXXXX`. 186 | * All files have the same number of rows and columns. 187 | 188 | There are two types of files in the dataset, *daily* and *intraday*. *Daily* files contain data whose characteristic is that there can only be one datapoint per day, e.g., open auction price, daily volume, GICS sector information, etc. *Intraday* files contain information about the market movements during the US trading session, e.g., intraday prices and volumes. We accumulate this data in 5-minute bars. The name of *intraday* files starts with an integer identifying the bar time. 189 | 190 | #### File Description 191 | 192 | Here we detail the data contained in some files that might not be trivial by their name. 193 | 194 | * **Daily** 195 | * `universe.csv`: Mask of the tradable universe at each date. The universe is rebalanced at the beginning of each month. 196 | * `ret_.csv`: Adjusted asset returns calculated on different time periods: 197 | * `cc`: Close-to-Close, the position is entered at the close auction and exited at the following day's close auction. 198 | * `co`: Close-to-Open, the position is entered at the close auction and exited at the following day's open auction. 199 | * `oc`: Open-to-Close, the position is entered at the open auction and exited at the same day's close auction. 200 | * `oo`: Open-to-Open, the position is entered at the open auction and exited at the following day's open auction. 201 | * `SPY_ret_.csv`: SPY ETF return. The SPY time series is placed in all asset columns for convenience. 202 | * `beta_.csv`: CAPM betas between assets and the SPY ETF for different time periods. 203 | * `resid_.csv`: CAPM residual returns on different time periods. `resid = ret - beta * SPY_ret` 204 | * **Intraday** 205 | * `__5m.csv`: Intraday market data snapshots at hhmmss bar. These backward-looking bars are calculated on the time range `[t-5min, t)`. 206 | 207 |

(back to top)

208 | 209 | 210 | 211 | 212 | 224 | 225 | 226 | 227 | 243 | 244 | 245 | 246 | ## License 247 | 248 | Distributed under the BSD-3 License. See `LICENSE` for more information. 249 | 250 |

(back to top)

251 | 252 | 253 | 254 | 255 | ## Contact 256 | 257 | Please reach us at [LinkedIn](https://www.linkedin.com/in/markhorvath-ai) or visit our [website](https://www.causalitygroup.com)! 258 | 259 |

(back to top)

260 | 261 | 262 | 263 | 264 | 273 | 274 | 275 | 276 | 277 | [contributors-shield]: https://img.shields.io/github/contributors/causality-group/causality-benchmark-data?style=for-the-badge 278 | [contributors-url]: https://github.com/causality-group/causality-benchmark-data/graphs/contributors 279 | [forks-shield]: https://img.shields.io/github/forks/causality-group/causality-benchmark-data.svg?style=for-the-badge 280 | [forks-url]: https://github.com/causality-group/causality-benchmark-data/network/members 281 | [stars-shield]: https://img.shields.io/github/stars/causality-group/causality-benchmark-data?style=for-the-badge 282 | [stars-url]: https://github.com/causality-group/causality-benchmark-data/stargazers 283 | [issues-shield]: https://img.shields.io/github/issues/causality-group/causality-benchmark-data.svg?style=for-the-badge 284 | [issues-url]: https://github.com/causality-group/causality-benchmark-data/issues 285 | [license-shield]: https://img.shields.io/github/license/causality-group/causality-benchmark-data.svg?style=for-the-badge 286 | [license-url]: https://github.com/causality-group/causality-benchmark-data/blob/main/LICENSE 287 | [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555 288 | [linkedin-url]: https://linkedin.com/company/causality-group 289 | 290 | [Python.org]: https://img.shields.io/badge/Python-3.9.5-blue?style=for-the-badge&logo=python&logoColor=ffdd54&labelColor=3776ab&color=3776ab 291 | [Python-url]: https://python.org/ 292 | [Poetry.org]: https://img.shields.io/badge/Poetry-1.7.1-%233B82F6?style=for-the-badge&logo=poetry&logoColor=0B3D8D&labelColor=%233B82F6 293 | [Poetry-url]: https://python-poetry.org/ 294 | [Jupyter.org]: https://img.shields.io/badge/jupyter-8.6.0-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white&labelColor=%23FA0F00 295 | [Jupyter-url]: https://jupyter.org/ 296 | [Pandas.org]: https://img.shields.io/badge/pandas-2.2.0-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white&labelColor=%23150458&color=%23150458 297 | [Pandas-url]: https://pandas.pydata.org/ 298 | [Matplotlib.org]: https://img.shields.io/badge/Matplotlib-3.8.3-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black&labelColor=%23ffffff 299 | [Matplotlib-url]: https://matplotlib.org 300 | [Numpy.org]: https://img.shields.io/badge/numpy-1.26.4-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white&labelColor=%23013243 301 | [Numpy-url]: https://numpy.org 302 | [Sklearn.org]: https://img.shields.io/badge/scikit--learn-1.0.1-%23F7931E.svg?style=for-the-badge&logo=scikit-learn&logoColor=white&labelColor=%23F7931E 303 | [Sklearn-url]: http://scikit-learn.org 304 | [SciPy.org]: https://img.shields.io/badge/SciPy-1.12.0-%230C55A5.svg?style=for-the-badge&logo=scipy&logoColor=%white&labelColor=%230C55A5 305 | [Scipy-url]: https://scipy.org 306 | -------------------------------------------------------------------------------- /causalitydata/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/causality-group/causality-benchmark-data/16c83f4fa5bbeab0583858e702973aba6cceaf28/causalitydata/__init__.py -------------------------------------------------------------------------------- /causalitydata/analysis/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/causality-group/causality-benchmark-data/16c83f4fa5bbeab0583858e702973aba6cceaf28/causalitydata/analysis/__init__.py -------------------------------------------------------------------------------- /causalitydata/analysis/backtest.py: -------------------------------------------------------------------------------- 1 | def signalbacktest_df( 2 | signal_df, upcoming_ret_df, enter_at_stdev=2.0, do_return_signal=False 3 | ): 4 | """A simple backtest, trading cross-sectional signals at a given standard deviation. 5 | 6 | Invests $0.5 both long and short, hence results can be interpreted as strategy returns. 7 | 8 | Args: 9 | signal_df: DataFrame with signals observable at the Timestamps in in its index. 10 | upcoming_ret_df: DataFrame with forward looking returns tradeable at the Timestamps in in its index. 11 | enter_at_stdev: Number of standard deviations to enter the trade both on long and short side. 12 | do_return_signal: If True, returns the signal DataFrame as well, otherwise only the strategy returns. 13 | 14 | Returns: 15 | DataFrame with strategy returns. 16 | """ 17 | is_long_df = ( 18 | signal_df.add( 19 | -(signal_df.mean(axis=1) + enter_at_stdev * signal_df.std(axis=1)), axis=0 20 | ) 21 | > 0.0 22 | ) 23 | is_short_df = ( 24 | signal_df.add( 25 | -(signal_df.mean(axis=1) - enter_at_stdev * signal_df.std(axis=1)), axis=0 26 | ) 27 | < 0.0 28 | ) 29 | 30 | signal_df = is_long_df.astype(int) - is_short_df.astype(int) 31 | # Demean cross-sectionally: 32 | signal_df = signal_df.add(-signal_df.mean(axis=1), axis=0) 33 | # Normalize cross-sectionally: 34 | signal_df = signal_df.div(signal_df.abs().sum(axis=1), axis=0) 35 | 36 | if do_return_signal: 37 | return signal_df * upcoming_ret_df, signal_df 38 | else: 39 | return signal_df * upcoming_ret_df 40 | -------------------------------------------------------------------------------- /causalitydata/analysis/compounding.py: -------------------------------------------------------------------------------- 1 | """Functionality to compound returns.""" 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from typing import List 6 | 7 | EPSILON_TIME = pd.offsets.Micro() 8 | 9 | 10 | def compound_ret_df(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame: 11 | """Compounds arithmetic returns. 12 | 13 | Args: 14 | df1: DataFrame with arithmetic returns 15 | df2: DataFrame with arithmetic returns 16 | 17 | Returns: 18 | DataFrame with compounded returns 19 | """ 20 | return (1.0 + df1) * (1.0 + df2) - 1.0 21 | 22 | 23 | def _sum_nan_if_all_nan_df(df: pd.DataFrame) -> pd.Series: 24 | """ 25 | If all values in a column are NaN, result will be NaN. 26 | Otherwise, the sum of the values. 27 | """ 28 | sum_ser = df.sum(axis=0) 29 | all_nan_ser = df.count(axis=0) == 0 30 | sum_ser[all_nan_ser] = np.nan 31 | return sum_ser 32 | 33 | 34 | def compound_upcoming_bar_return_cc_df( 35 | ret_cc_df: pd.DataFrame, bar_tss: List[pd.Timestamp] 36 | ) -> pd.DataFrame: 37 | """ 38 | Compounds daily adjusted close to close returns for each Timestamp in bar_tss, 39 | such that we can trade those at the 16:00 auction. 40 | 41 | Returns bars do not overlap, each covers a time period from 16:00 between bar_tss[i] and bar_tss[i+1]. 42 | """ 43 | # Returns end on the index date, need to align to the start date of the return 44 | ret_cc_df = ret_cc_df.shift(-1) 45 | ret_cc_df = np.log(ret_cc_df + 1.0) # convert to log returns 46 | bar_tss = bar_tss + [pd.Timestamp.max] 47 | log_bar_ret_df = pd.DataFrame( 48 | { 49 | # Indexing my Timestamp is inclusive, hence: -EPSILON_TIME 50 | ts: _sum_nan_if_all_nan_df( 51 | ret_cc_df.loc[ts : bar_tss[i + 1] - EPSILON_TIME, :] 52 | ) 53 | for i, ts in enumerate(bar_tss[:-1]) 54 | } 55 | ).T 56 | return np.exp(log_bar_ret_df) - 1.0 57 | 58 | 59 | def compound_observable_bar_return_cc_df( 60 | ret_cc_df: pd.DataFrame, 61 | bar_tss: List[pd.Timestamp], 62 | ) -> pd.DataFrame: 63 | """ 64 | Compounds daily adjusted close to close returns for each Timestamp in bar_tss, 65 | such that we can observe those at 15:45 to trade the 16:00 auction. 66 | 67 | This means in practice using close prices only until previous close. 68 | 69 | Returns bars do not overlap, each covers a time period from 16:00 70 | between bar_tss[i] and bar_tss[i+1]. 71 | """ 72 | ret_cc_df = np.log(ret_cc_df + 1.0) # convert to log returns 73 | bar_tss = [pd.Timestamp.min] + bar_tss 74 | log_bar_ret_df = pd.DataFrame( 75 | { 76 | # Indexing my Timestamp is inclusive, hence: -EPSILON_TIME 77 | ts: _sum_nan_if_all_nan_df( 78 | ret_cc_df.loc[bar_tss[i - 1] : ts - EPSILON_TIME, :] 79 | ) 80 | for i, ts in enumerate(bar_tss) 81 | if i > 0 82 | } 83 | ).T 84 | return np.exp(log_bar_ret_df) - 1.0 85 | -------------------------------------------------------------------------------- /causalitydata/analysis/evaluate_pnl.py: -------------------------------------------------------------------------------- 1 | """Functionality to evaluate performance.""" 2 | 3 | from typing import List, Optional, Tuple 4 | 5 | import numpy as np 6 | import matplotlib.pyplot as plt 7 | import pandas as pd 8 | 9 | 10 | def plot_pnl( 11 | pnl_df: pd.DataFrame, 12 | xlabel: Optional[str] = "Time", 13 | ylabel: Optional[str] = "Cumulative PnL", 14 | title: Optional[str] = "Profit and Loss Curve", 15 | append_legends: Optional[List[str]] = None, 16 | figure_size: Tuple[int, int] = (10, 6), 17 | ): 18 | """ 19 | Plots a PnL DataFrame. 20 | 21 | Args: 22 | pnl_df: DataFrame with PnL values for each trading period 23 | xlabel: Label for the x-axis 24 | ylabel: Label for the y-axis 25 | title: Title of the plot 26 | append_legends: List of strings to append to the legend of each line 27 | figure_size: Size of the figure (width, height) 28 | """ 29 | plt.figure(figsize=figure_size) 30 | plt.plot(pnl_df.index, pnl_df.cumsum().values) 31 | if xlabel is not None: 32 | plt.xlabel(xlabel) 33 | if ylabel is not None: 34 | plt.ylabel(ylabel) 35 | if title is not None: 36 | plt.title(title) 37 | legends = pnl_df.columns.tolist() 38 | if append_legends is not None: 39 | new_legend = [] 40 | for i, item in enumerate(legends): 41 | new_legend += [legends[i] + item] 42 | legends = new_legend 43 | plt.legend(legends, bbox_to_anchor=(1.05, 1), loc="upper left") 44 | plt.show() 45 | 46 | 47 | def calculate_performance_df( 48 | pnl_df: pd.DataFrame, 49 | ) -> pd.DataFrame: 50 | """ 51 | Calculates performance metrics for a PnL DataFrame: 52 | - Cumulative PnL 53 | - Daily PnL 54 | - Average PnL 55 | - Maximum PnL 56 | - Minimum PnL 57 | - PnL Standard Deviation 58 | - Sharpe Ratio 59 | - Sortino Ratio 60 | - Maximum Drawdown 61 | - Calmar Ratio 62 | - Matrin Ratio 63 | 64 | Args: 65 | pnl_df: DataFrame with PnL values for each trading period 66 | signal_df: DataFrame with trading signals for each trading period 67 | 68 | Returns: 69 | pd.DataFrame with performance metrics. 70 | Performance metrics are rows of the DataFrame, 71 | and column names match column names in the pnl_df argument. 72 | """ 73 | 74 | num_periods_per_year = len(pnl_df.index) / ( 75 | (pnl_df.index[-1].year + (pnl_df.index[-1].month - 1) / 12) 76 | - (pnl_df.index[0].year + (pnl_df.index[0].month - 1) / 12) 77 | ) 78 | 79 | performance_df = pd.DataFrame() 80 | 81 | # Cumulative PnL 82 | performance_df["Cumulative PnL"] = pnl_df.sum() 83 | 84 | # Daily PnL 85 | performance_df["Annual PnL"] = pnl_df.mean() * num_periods_per_year 86 | 87 | # Average PnL 88 | performance_df["Average PnL"] = pnl_df.mean() 89 | 90 | # Maximum PnL 91 | performance_df["Maximum PnL"] = pnl_df.max() 92 | 93 | # Minimum PnL 94 | performance_df["Minimum PnL"] = pnl_df.min() 95 | 96 | # PnL Standard Deviation 97 | performance_df["Annual Standard Deviation"] = pnl_df.std() * np.sqrt( 98 | num_periods_per_year 99 | ) 100 | 101 | # Sharpe Ratio 102 | performance_df["Annual Sharpe Ratio"] = ( 103 | pnl_df.mean() / pnl_df.std() * np.sqrt(num_periods_per_year) 104 | ) 105 | 106 | # Sortino Ratio 107 | downside_returns = pnl_df[pnl_df < 0] 108 | downside_std = downside_returns.std() 109 | performance_df["Sortino Ratio"] = ( 110 | pnl_df.mean() / downside_std * np.sqrt(num_periods_per_year) 111 | ) 112 | 113 | # Maximum Drawdown 114 | cumulative_returns = pnl_df.cumsum() 115 | rolling_max = cumulative_returns.cummax() 116 | drawdown = cumulative_returns - rolling_max 117 | performance_df["Maximum Drawdown"] = drawdown.min() 118 | 119 | # Calmar Ratio 120 | performance_df["Calmar Ratio"] = pnl_df.mean() / abs(drawdown.min()) 121 | 122 | # Martin Ratio 123 | performance_df["Martin Ratio"] = pnl_df.mean() / pnl_df.std() * np.sqrt(len(pnl_df)) 124 | 125 | # # Statistical significance of the null-hypothesis that 126 | # the mean PnL is zero or negative, using one sided t-test 127 | # t_stat, p_value = stats.ttest_1samp(pnl_df, 0, alternative='less') 128 | # performance_df['T-Test p-value'] = p_value 129 | 130 | return performance_df.T 131 | -------------------------------------------------------------------------------- /causalitydata/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/causality-group/causality-benchmark-data/16c83f4fa5bbeab0583858e702973aba6cceaf28/causalitydata/data/__init__.py -------------------------------------------------------------------------------- /causalitydata/data/dataloader.py: -------------------------------------------------------------------------------- 1 | """Functionality to load data.""" 2 | 3 | import os 4 | from typing import Tuple, Optional 5 | 6 | import numpy as np 7 | import pandas as pd 8 | 9 | 10 | def load_field_df( 11 | field_name: str, 12 | data_root_path: str, 13 | shift: int = 0, 14 | end_ts: Optional[pd.Timestamp] = None, 15 | dtype: type = np.float64, 16 | ) -> pd.DataFrame: 17 | """ 18 | Loads data fields into DataFrame with DateTimeIndex. 19 | 20 | Args: 21 | field_name: name of the field to load 22 | data_root_path: path to the data root directory 23 | shift: number of periods to shift the data by. 24 | Positive shift lags, negative shift peeks into the future. 25 | end_ts: end date of the DataFrame. 26 | If None, the whole DataFrame is returned. 27 | dtype: data type to read in 28 | 29 | Returns: 30 | DataFrame with DateTimeIndex. 31 | Columns are strings including ticker symbol, 32 | exchange and CFI category, while rows are dates of the observations. 33 | """ 34 | df = load_path_df(os.path.join(data_root_path, field_name + ".csv"), dtype=dtype) 35 | df.index = pd.DatetimeIndex(pd.to_datetime(df.index)) 36 | df = df.shift(shift) 37 | if end_ts is not None: 38 | df = df.loc[:end_ts, :] 39 | return df 40 | 41 | 42 | def load_path_df( 43 | csvfile_path: str, 44 | exclude_tickers: Tuple = (), 45 | dtype: type = np.float64, 46 | ) -> pd.DataFrame: 47 | """ 48 | Reads data into pandas from csv files. 49 | 50 | Args: 51 | csvfile_path: path to csv file 52 | exclude_tickers: list of tickers to exclude 53 | dtype: data type to read in 54 | 55 | Returns: 56 | pd.DataFrame with time index and columns for each ticker 57 | """ 58 | if "str" in str(dtype): 59 | df = pd.read_csv( 60 | csvfile_path, 61 | # avoid DtypeWarning: Columns (1..1330) have mixed types. 62 | # Specify dtype option on import or set low_memory=False 63 | low_memory=False, 64 | index_col=0, 65 | header=0, 66 | parse_dates=False, 67 | dtype=dtype, 68 | ) 69 | else: # numeric type 70 | df = pd.read_csv( 71 | csvfile_path, 72 | # avoid DtypeWarning: Columns (1..1330) have mixed types. 73 | # Specify dtype option on import or set low_memory=False 74 | low_memory=False, 75 | index_col=0, 76 | header=0, 77 | parse_dates=False, 78 | ) 79 | 80 | if exclude_tickers: 81 | for ex in exclude_tickers: 82 | df = df[[s for s in df.columns if ex not in s]] 83 | 84 | if "str" in str(dtype): 85 | return df 86 | else: # numeric type 87 | return df.astype(dtype=dtype, copy=False) 88 | -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/causality-group/causality-benchmark-data/16c83f4fa5bbeab0583858e702973aba6cceaf28/images/logo.png -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "causalitydata" 3 | version = "0.1.0" 4 | description = "Showcasing time series forecasting and machine learning optimization with Darts." 5 | authors = ["Causality Group"] 6 | readme = "README.md" 7 | 8 | [tool.poetry.dependencies] 9 | python = "~3.9" 10 | scikit-learn = "1.0.1" 11 | darts = "0.27.1" 12 | stats = "0.1.2a" 13 | 14 | 15 | [tool.poetry.group.dev.dependencies] 16 | pytest = "^8.0.1" 17 | black = {extras = ["jupyter"], version = "^24.2.0"} 18 | pylint = "^3.0.3" 19 | pre-commit = "^3.6.2" 20 | 21 | 22 | [tool.poetry.group.jupyter] 23 | optional = true 24 | 25 | [tool.poetry.group.jupyter.dependencies] 26 | ipykernel = "6.29.0" 27 | 28 | [build-system] 29 | requires = ["poetry-core"] 30 | build-backend = "poetry.core.masonry.api" 31 | 32 | [tool.poetry.scripts] 33 | install-ipykernel = "scripts.install_ipykernel:install_ipykernel" 34 | uninstall-ipykernel = "scripts.uninstall_ipykernel:uninstall_ipykernel" 35 | -------------------------------------------------------------------------------- /scripts/install_ipykernel.py: -------------------------------------------------------------------------------- 1 | """Script to install causality ipykernel.""" 2 | 3 | import logging 4 | from ipykernel.kernelspec import install 5 | 6 | # Set up logging 7 | logging.basicConfig(level=logging.INFO) 8 | 9 | 10 | def install_ipykernel(): 11 | """Installs causality ipykernel.""" 12 | try: 13 | install(user=True, kernel_name="causalitydata-p3.9") 14 | logging.info("Successfully installed causality ipykernel.") 15 | except Exception as e: 16 | logging.error("Failed to install causality ipykernel.") 17 | logging.error(repr(e)) 18 | 19 | 20 | if __name__ == "__main__": 21 | install_ipykernel() 22 | -------------------------------------------------------------------------------- /scripts/uninstall_ipykernel.py: -------------------------------------------------------------------------------- 1 | """Script to uninstall causality ipykernel.""" 2 | 3 | import logging 4 | from jupyter_client.kernelspec import KernelSpecManager 5 | 6 | # Set up logging 7 | logging.basicConfig(level=logging.INFO) 8 | 9 | 10 | def uninstall_ipykernel(): 11 | """Uninstalls causality ipykernel.""" 12 | try: 13 | KernelSpecManager().remove_kernel_spec("causalitydata-p3.9") 14 | logging.info("Successfully uninstalled causality ipykernel.") 15 | except Exception as e: 16 | logging.error("Failed to uninstall causality ipykernel.") 17 | logging.error(repr(e)) 18 | 19 | 20 | if __name__ == "__main__": 21 | uninstall_ipykernel() 22 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/causality-group/causality-benchmark-data/16c83f4fa5bbeab0583858e702973aba6cceaf28/tests/__init__.py --------------------------------------------------------------------------------