├── .gitattributes ├── .gitignore ├── README.md ├── assets ├── Notebook.ipynb └── Tabularz.png ├── deltapy ├── __init__.py ├── extract.py ├── interact.py ├── mapper.py └── transform.py └── setup.py /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # DeltaPy⁠⁠ — Tabular Data Augmentation & Feature Engineering 3 | 4 | [![Downloads](https://pepy.tech/badge/deltapy)](https://pepy.tech/project/deltapy) 5 | 6 | [![DOI](https://zenodo.org/badge/253993655.svg)](https://zenodo.org/badge/latestdoi/253993655) 7 | 8 | --------- 9 | 10 | Finance Quant Machine Learning 11 | ------------------ 12 | - [ML-Quant.com](https://www.ml-quant.com/) - Automated Research Repository 13 | 14 | ### Introduction 15 | 16 | Tabular augmentation is a new experimental space that makes use of novel and traditional data generation and synthesisation techniques to improve model prediction success. It is in essence a process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set. DeltaPy was created with finance applications in mind, but it can be broadly applied to any data-rich environment. 17 | 18 | To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: **(1) transforming**, **(2) interacting**, **(3) mapping**, **(4) extracting**, and **(5) synthesising**. What follows is a practical example of how the above methodology can be used. The purpose here is to establish a framework for table augmentation and to point and guide the user to existing packages. 19 | 20 | For most the [Colab Notebook](https://colab.research.google.com/drive/1-uJqGeKZfJegX0TmovhsO90iasyxZYiT) format might be preferred. I have enabled comments if you want to ask question or address any issues you uncover. For anything pressing use the issues tab. Also have a look at the [SSRN report](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3582219) for a more succinct insights. 21 | 22 | ![](assets/Tabularz.png) 23 | 24 | Data augmentation can be defined as any method that could increase the size or improve the quality of a dataset by generating new features or instances without the collection of additional data-points. Data augmentation is of particular importance in image classification tasks where additional data can be created by cropping, padding, or flipping existing images. 25 | 26 | Tabular cross-sectional and time-series prediction tasks can also benefit from augmentation. Here we divide tabular augmentation into columnular and row-wise methods. Row-wise methods are further divided into extraction and data synthesisation techniques, whereas columnular methods are divided into transformation, interaction, and mapping methods. 27 | 28 | See the [Skeleton Example](#example), for a combination of multiple methods that lead to a halfing of the mean squared error. 29 | 30 | 31 | #### Installation & Citation 32 | ---------- 33 | ``` 34 | pip install deltapy 35 | ``` 36 | 37 | ``` 38 | @software{deltapy, 39 | title = {{DeltaPy}: Tabular Data Augmentation}, 40 | author = {Snow, Derek}, 41 | url = {https://github.com/firmai/deltapy/}, 42 | version = {0.1.0}, 43 | date = {2020-04-11}, 44 | } 45 | ``` 46 | 47 | ``` 48 | Snow, Derek, DeltaPy: A Framework for Tabular Data Augmentation in Python (April 22, 2020). Available at SSRN: https://ssrn.com/abstract=3582219 49 | ``` 50 | 51 | 52 | ### Function Glossary 53 | --------------- 54 | 55 | **Transformation** 56 | ```python 57 | df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head() 58 | df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head() 59 | df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head() 60 | df_out = transform.windsorization(df.copy(),"Close",para,strategy='both'); df_out.head() 61 | df_out = transform.operations(df.copy(),["Close"]); df_out.head() 62 | df_out = transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0); 63 | df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head() 64 | df_out = transform.bkb(df.copy(), ["Close"]); df_out.head() 65 | df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head() 66 | df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head() 67 | df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head() 68 | df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head() 69 | df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head() 70 | df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head() 71 | df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head() 72 | df_out = transform.modify(df.copy(),["Close"]); df_out.head() 73 | df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head() 74 | ``` 75 | **Interaction** 76 | ```python 77 | df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head() 78 | df_out = interact.autoregression(df.copy()); df_out.head() 79 | df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head() 80 | df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head() 81 | df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head() 82 | df_out = interact.tech(df.copy()); df_out.head() 83 | df_out = interact.genetic_feat(df.copy()); df_out.head() 84 | ``` 85 | **Mapping** 86 | ```python 87 | df_out = mapper.pca_feature(df.copy(),variance_or_components=0.80,drop_cols=["Close_1"]); df_out.head() 88 | df_out = mapper.cross_lag(df.copy()); df_out.head() 89 | df_out = mapper.a_chi(df.copy()); df_out.head() 90 | df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head() 91 | df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head() 92 | df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head() 93 | df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head() 94 | ``` 95 | 96 | **Extraction** 97 | ```python 98 | extract.abs_energy(df["Close"]) 99 | extract.cid_ce(df["Close"], True) 100 | extract.mean_abs_change(df["Close"]) 101 | extract.mean_second_derivative_central(df["Close"]) 102 | extract.variance_larger_than_standard_deviation(df["Close"]) 103 | extract.var_index(df["Close"].values,var_index_param) 104 | extract.symmetry_looking(df["Close"]) 105 | extract.has_duplicate_max(df["Close"]) 106 | extract.partial_autocorrelation(df["Close"]) 107 | extract.augmented_dickey_fuller(df["Close"]) 108 | extract.gskew(df["Close"]) 109 | extract.stetson_mean(df["Close"]) 110 | extract.length(df["Close"]) 111 | extract.count_above_mean(df["Close"]) 112 | extract.longest_strike_below_mean(df["Close"]) 113 | extract.wozniak(df["Close"]) 114 | extract.last_location_of_maximum(df["Close"]) 115 | extract.fft_coefficient(df["Close"]) 116 | extract.ar_coefficient(df["Close"]) 117 | extract.index_mass_quantile(df["Close"]) 118 | extract.number_cwt_peaks(df["Close"]) 119 | extract.spkt_welch_density(df["Close"]) 120 | extract.linear_trend_timewise(df["Close"]) 121 | extract.c3(df["Close"]) 122 | extract.binned_entropy(df["Close"]) 123 | extract.svd_entropy(df["Close"].values) 124 | extract.hjorth_complexity(df["Close"]) 125 | extract.max_langevin_fixed_point(df["Close"]) 126 | extract.percent_amplitude(df["Close"]) 127 | extract.cad_prob(df["Close"]) 128 | extract.zero_crossing_derivative(df["Close"]) 129 | extract.detrended_fluctuation_analysis(df["Close"]) 130 | extract.fisher_information(df["Close"]) 131 | extract.higuchi_fractal_dimension(df["Close"]) 132 | extract.petrosian_fractal_dimension(df["Close"]) 133 | extract.hurst_exponent(df["Close"]) 134 | extract.largest_lyauponov_exponent(df["Close"]) 135 | extract.whelch_method(df["Close"]) 136 | extract.find_freq(df["Close"]) 137 | extract.flux_perc(df["Close"]) 138 | extract.range_cum_s(df["Close"]) 139 | extract.structure_func(df["Close"]) 140 | extract.kurtosis(df["Close"]) 141 | extract.stetson_k(df["Close"]) 142 | ``` 143 | 144 | Test sets should ideally not be preprocessed with the training data, as in such a way one could be peaking ahead in the training data. The preprocessing parameters should be identified on the test set and then applied on the test set, i.e., the test set should not have an impact on the transformation applied. As an example, you would learn the parameters of PCA decomposition on the training set and then apply the parameters to both the train and the test set. 145 | 146 | The benefit of pipelines become clear when one wants to apply multiple augmentation methods. It makes it easy to learn the parameters and then apply them widely. For the most part, this notebook does not concern itself with 'peaking ahead' or pipelines, for some functions, one might have to restructure to code and make use of open source packages to create your preferred solution. 147 | 148 | 149 | Documentation by Example 150 | ----------------- 151 | 152 | **Notebook Dependencies** 153 | 154 | 155 | ```python 156 | pip install deltapy 157 | ``` 158 | 159 | 160 | ```python 161 | pip install pykalman 162 | pip install tsaug 163 | pip install ta 164 | pip install tsaug 165 | pip install pandasvault 166 | pip install gplearn 167 | pip install ta 168 | pip install seasonal 169 | pip install pandasvault 170 | ``` 171 | 172 | ### Data and Package Load 173 | 174 | 175 | ```python 176 | import pandas as pd 177 | import numpy as np 178 | from deltapy import transform, interact, mapper, extract 179 | import warnings 180 | warnings.filterwarnings('ignore') 181 | 182 | def data_copy(): 183 | df = pd.read_csv("https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv") 184 | df["Close_1"] = df["Close"].shift(-1) 185 | df = df.dropna() 186 | df["Date"] = pd.to_datetime(df["Date"]) 187 | df = df.set_index("Date") 188 | return df 189 | df = data_copy(); df.head() 190 | ``` 191 | 192 | Some of these categories are fluid and some techniques could fit into multiple buckets. This is an attempt to find an exhaustive number of techniques, but not an exhaustive list of implementations of the techniques. For example, there are thousands of ways to smooth a time-series, but we have only includes 1-2 techniques of interest under each category. 193 | 194 | ### **(1) [Transformation:](#transformation)** 195 | ----------------- 196 | 1. Scaling/Normalisation 197 | 2. Standardisation 198 | 10. Differencing 199 | 3. Capping 200 | 13. Operations 201 | 4. Smoothing 202 | 5. Decomposing 203 | 6. Filtering 204 | 7. Spectral Analysis 205 | 8. Waveforms 206 | 9. Modifications 207 | 11. Rolling 208 | 12. Lagging 209 | 14. Forecast Model 210 | 211 | ### **(2) [Interaction:](#interaction)** 212 | ----------------- 213 | 1. Regressions 214 | 2. Operators 215 | 3. Discretising 216 | 4. Normalising 217 | 5. Distance 218 | 6. Speciality 219 | 7. Genetic 220 | 221 | ### **(3) [Mapping:](#mapping)** 222 | ----------------- 223 | 1. Eigen Decomposition 224 | 2. Cross Decomposition 225 | 3. Kernel Approximation 226 | 4. Autoencoder 227 | 5. Manifold Learning 228 | 6. Clustering 229 | 7. Neighbouring 230 | 231 | ### **(4) [Extraction:](#extraction)** 232 | ----------------- 233 | 1. Energy 234 | 2. Distance 235 | 3. Differencing 236 | 4. Derivative 237 | 5. Volatility 238 | 6. Shape 239 | 7. Occurrence 240 | 8. Autocorrelation 241 | 9. Stochasticity 242 | 10. Averages 243 | 11. Size 244 | 13. Count 245 | 14. Streaks 246 | 14. Location 247 | 15. Model Coefficients 248 | 16. Quantile 249 | 17. Peaks 250 | 18. Density 251 | 20. Linearity 252 | 20. Non-linearity 253 | 21. Entropy 254 | 22. Fixed Points 255 | 23. Amplitude 256 | 23. Probability 257 | 24. Crossings 258 | 25. Fluctuation 259 | 26. Information 260 | 27. Fractals 261 | 29. Exponent 262 | 30. Spectral Analysis 263 | 31. Percentile 264 | 32. Range 265 | 33. Structural 266 | 12. Distribution 267 | 268 | 269 | 270 | 271 | ## **(1) Transformation** 272 | 273 | Here transformation is any method that includes only one feature as an input to produce a new feature/s. Transformations can be applied to cross-section and time-series data. Some transformations are exclusive to time-series data (smoothing, filtering), but a handful of functions apply to both. 274 | 275 | Where the time series methods has a centred mean, or are forward-looking, there is a need to recalculate the outputed time series on a running basis to ensure that information of the future does not leak into the model. The last value of this recalculated series or an extracted feature from this series can then be used as a running value that is only backward looking, satisfying the no 'peaking' ahead rule. 276 | 277 | There are some packaged in Python that dynamically create time series and extracts their features, but none that incoropates the dynamic creation of a time series in combination with a wide application of prespecified list of extractions. Because this technique is expensive, we have a preference for models that only take historical data into account. 278 | 279 | In this section we will include a list of all types of transformations, those that only use present information (operations), those that incorporate all values (interpolation methods), those that only include past values (smoothing functions), and those that incorporate a subset window of lagging and leading values (select filters). Only those that use historical values or are turned into prediction methods can be used out of the box. The entire time series can be used in the model development process for historical value methods, and only the forecasted values can be used for prediction models. 280 | 281 | Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. When using an interpolation method, you are taking future information into account e.g, cubic spline. You can use interpolation methods to forecast into the future (extrapolation), and then use those forecasts in a training set. Or you could recalculate the interpolation for each time step and then extract features out of that series (extraction method). Interpolation and other forward-looking methods can be used if they are turned into prediction problems, then the forecasted values can be trained and tested on, and the fitted data can be diregarded. In the list presented below the first five methods can be used for cross-section and time series data, after that the time-series only methods follow. 282 | 283 | #### **(1) Scaling/Normalisation** 284 | 285 | There are a multitude of scaling methods available. Scaling generally gets applied to the entire dataset and is especially necessary for certain algorithms. K-means make use of euclidean distance hence the need for scaling. For PCA because we are trying to identify the feature with maximus variance we also need scaling. Similarly, we need scaled features for gradient descent. Any algorithm that is not based on a distance measure is not affected by feature scaling. Some of the methods include range scalers like minimum-maximum scaler, maximum absolute scaler or even standardisation methods like the standard scaler can be used for scaling. The example used here is robust scaler. Normalisation is a good technique when you don't know the distribution of the data. Scaling looks into the future, so parameters have to be training on a training set and applied to a test set. 286 | 287 | (i) Robust Scaler 288 | 289 | Scaling according to the interquartile range, making it robust to outliers. 290 | 291 | 292 | ```python 293 | def robust_scaler(df, drop=None,quantile_range=(25, 75) ): 294 | if drop: 295 | keep = df[drop] 296 | df = df.drop(drop, axis=1) 297 | center = np.median(df, axis=0) 298 | quantiles = np.percentile(df, quantile_range, axis=0) 299 | scale = quantiles[1] - quantiles[0] 300 | df = (df - center) / scale 301 | if drop: 302 | df = pd.concat((keep,df),axis=1) 303 | return df 304 | 305 | df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head() 306 | ``` 307 | 308 | #### **(2) Standardisation** 309 | 310 | When using a standardisation method, it is often more effective when the attribute itself if Gaussian. It is also useful to apply the technique when the model you want to use makes assumptions of Gaussian distributions like linear regression, logistic regression, and linear discriminant analysis. For most applications, standardisation is recommended. 311 | 312 | (i) Standard Scaler 313 | 314 | Standardize features by removing the mean and scaling to unit variance 315 | 316 | 317 | ```python 318 | def standard_scaler(df,drop ): 319 | if drop: 320 | keep = df[drop] 321 | df = df.drop(drop, axis=1) 322 | mean = np.mean(df, axis=0) 323 | scale = np.std(df, axis=0) 324 | df = (df - mean) / scale 325 | if drop: 326 | df = pd.concat((keep,df),axis=1) 327 | return df 328 | 329 | 330 | df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head() 331 | ``` 332 | 333 | #### **(3) Differencing** 334 | 335 | Computing the differences between consecutive observation, normally used to obtain a stationary time series. 336 | 337 | (i) Fractional Differencing 338 | 339 | Fractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing. 340 | 341 | 342 | ```python 343 | import pylab as pl 344 | 345 | def fast_fracdiff(x, cols, d): 346 | for col in cols: 347 | T = len(x[col]) 348 | np2 = int(2 ** np.ceil(np.log2(2 * T - 1))) 349 | k = np.arange(1, T) 350 | b = (1,) + tuple(np.cumprod((k - d - 1) / k)) 351 | z = (0,) * (np2 - T) 352 | z1 = b + z 353 | z2 = tuple(x[col]) + z 354 | dx = pl.ifft(pl.fft(z1) * pl.fft(z2)) 355 | x[col+"_frac"] = np.real(dx[0:T]) 356 | return x 357 | 358 | df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head() 359 | ``` 360 | 361 | #### **(4) Capping** 362 | 363 | Any method that provides sets a floor and a cap to a feature's value. Capping can affect the distribution of data, so it should not be exagerated. One can cap values by using the average, by using the max and min values, or by an arbitrary extreme value. 364 | 365 | 366 | 367 | 368 | (i) Winzorisation 369 | 370 | The transformation of features by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing it with a certain percentile value. 371 | 372 | 373 | ```python 374 | def outlier_detect(data,col,threshold=1,method="IQR"): 375 | 376 | if method == "IQR": 377 | IQR = data[col].quantile(0.75) - data[col].quantile(0.25) 378 | Lower_fence = data[col].quantile(0.25) - (IQR * threshold) 379 | Upper_fence = data[col].quantile(0.75) + (IQR * threshold) 380 | if method == "STD": 381 | Upper_fence = data[col].mean() + threshold * data[col].std() 382 | Lower_fence = data[col].mean() - threshold * data[col].std() 383 | if method == "OWN": 384 | Upper_fence = data[col].mean() + threshold * data[col].std() 385 | Lower_fence = data[col].mean() - threshold * data[col].std() 386 | if method =="MAD": 387 | median = data[col].median() 388 | median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]]) 389 | modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]]) 390 | outlier_index = np.abs(modified_z_scores) > threshold 391 | print('Num of outlier detected:',outlier_index.value_counts()[1]) 392 | print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index)) 393 | return outlier_index, (median_absolute_deviation, median_absolute_deviation) 394 | 395 | para = (Upper_fence, Lower_fence) 396 | tmp = pd.concat([data[col]>Upper_fence,data[col]para[0],col] = para[0] 411 | data_copy.loc[data_copy[col]para[0],col] = para[0] 414 | elif strategy == 'bottom': 415 | data_copy.loc[data_copy[col]= len(df[col]): # we are forecasting 497 | m = i - len(df[col]) + 1 498 | result.append((smooth + m*trend) + seasonals[i%slen]) 499 | else: 500 | val = df[col][i] 501 | last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend) 502 | trend = beta * (smooth-last_smooth) + (1-beta)*trend 503 | seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen] 504 | result.append(smooth+trend+seasonals[i%slen]) 505 | df[col+"_TES"] = result 506 | #print(seasonals) 507 | return df 508 | 509 | df_out= transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0); df_out.head() 510 | ``` 511 | 512 | #### **(7) Decomposing** 513 | 514 | Decomposition procedures are used in time series to describe the trend and seasonal factors in a time series. More extensive decompositions might also include long-run cycles, holiday effects, day of week effects and so on. Here, we’ll only consider trend and seasonal decompositions. A naive decomposition makes use of moving averages, other decomposition methods are available that make use of LOESS. 515 | 516 | (i) Naive Decomposition 517 | 518 | The base trend takes historical information into account and established moving averages; it does not have to be linear. To estimate the seasonal component for each season, simply average the detrended values for that season. If the seasonal variation looks constant, we should use the additive model. If the magnitude is increasing as a function of time, we will use multiplicative. Here because it is predictive in nature we are using a one sided moving average, as opposed to a two-sided centred average. 519 | 520 | 521 | ```python 522 | import statsmodels.api as sm 523 | 524 | def naive_dec(df, columns, freq=2): 525 | for col in columns: 526 | decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False) 527 | df[col+"_NDDT" ] = decomposition.trend 528 | df[col+"_NDDT"] = decomposition.seasonal 529 | df[col+"_NDDT"] = decomposition.resid 530 | return df 531 | 532 | df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head() 533 | ``` 534 | 535 | #### **(8) Filtering** 536 | 537 | It is often useful to either low-pass filter (smooth) time series in order to reveal low-frequency features and trends, or to high-pass filter (detrend) time series in order to isolate high frequency transients (e.g. storms). Low pass filters use historical values, high-pass filters detrends with low-pass filters, so also indirectly uses historical values. 538 | 539 | There are a few filters available, closely associated with decompositions and smoothing functions. The Hodrick-Prescott filter separates a time-series $yt$ into a trend $τt$ and a cyclical component $ζt$. The Christiano-Fitzgerald filter is a generalization of Baxter-King filter and can be seen as weighted moving average. 540 | 541 | (i) Baxter-King Bandpass 542 | 543 | The Baxter-King filter is intended to explicitly deal with the periodicity of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. The parameters are arbitrarily chosen. This method uses a centred moving average that has to be changed to a lagged moving average before it can be used as an input feature. The maximum period of oscillation should be used as the point to truncate the dataset, as that part of the time series does not incorporate all the required datapoints. 544 | 545 | 546 | ```python 547 | import statsmodels.api as sm 548 | 549 | def bkb(df, cols): 550 | for col in cols: 551 | df[col+"_BPF"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1) 552 | return df 553 | 554 | df_out = transform.bkb(df.copy(), ["Close"]); df_out.head() 555 | ``` 556 | 557 | (ii) Butter Lowpass (IIR Filter Design) 558 | 559 | The Butterworth filter is a type of signal processing filter designed to have a frequency response as flat as possible in the passban. Like other filtersm the first few values have to be disregarded for accurate downstream prediction. Instead of disregarding these values on a per case basis, they can be diregarded in one chunk once the database of transformed features have been developed. 560 | 561 | 562 | ```python 563 | from scipy import signal, integrate 564 | def butter_lowpass(cutoff, fs=20, order=5): 565 | nyq = 0.5 * fs 566 | normal_cutoff = cutoff / nyq 567 | b, a = signal.butter(order, normal_cutoff, btype='low', analog=False) 568 | return b, a 569 | 570 | def butter_lowpass_filter(df,cols, cutoff, fs=20, order=5): 571 | b, a = butter_lowpass(cutoff, fs, order=order) 572 | for col in cols: 573 | df[col+"_BUTTER"] = signal.lfilter(b, a, df[col]) 574 | return df 575 | 576 | df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head() 577 | ``` 578 | 579 | (iii) Hilbert Transform Angle 580 | 581 | The Hilbert transform is a time-domain to time-domain transformation which shifts the phase of a signal by 90 degrees. It is also a centred measure and would be difficult to use in a time series prediction setting, unless it is recalculated on a per step basis or transformed to be based on historical values only. 582 | 583 | 584 | ```python 585 | from scipy import signal 586 | import numpy as np 587 | 588 | def instantaneous_phases(df,cols): 589 | for col in cols: 590 | df[col+"_HILLB"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0) 591 | return df 592 | 593 | df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head() 594 | ``` 595 | 596 | (iiiv) Unscented Kalman Filter 597 | 598 | 599 | The Kalman filter is better suited for estimating things that change over time. The most tangible example is tracking moving objects. A Kalman filter will be very close to the actual trajectory because it says the most recent measurement is more important than the older ones. The Unscented Kalman Filter (UKF) is a model based-techniques that recursively estimates the states (and with some modifications also parameters) of a nonlinear, dynamic, discrete-time system. The UKF is based on the typical prediction-correction style methods. The Kalman Smoother incorporates future values, the Filter doesn't and can be used for online prediction. The normal Kalman filter is a forward filter in the sense that it makes forecast of the current state using only current and past observations, whereas the smoother is based on computing a suitable linear combination of two filters, which are ran in forward and backward directions. 600 | 601 | 602 | 603 | 604 | ```python 605 | from pykalman import UnscentedKalmanFilter 606 | 607 | def kalman_feat(df, cols): 608 | for col in cols: 609 | ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1) 610 | (filtered_state_means, filtered_state_covariances) = ukf.filter(df[col]) 611 | (smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col]) 612 | df[col+"_UKFSMOOTH"] = smoothed_state_means.flatten() 613 | df[col+"_UKFFILTER"] = filtered_state_means.flatten() 614 | return df 615 | 616 | df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head() 617 | ``` 618 | 619 | #### **(9) Spectral Analysis** 620 | 621 | There are a range of functions for spectral analysis. You can use periodograms and the welch method to estimate the power spectral density. You can also use the welch method to estimate the cross power spectral density. Other techniques include spectograms, Lomb-Scargle periodograms and, short time fourier transform. 622 | 623 | (i) Periodogram 624 | 625 | This returns an array of sample frequencies and the power spectrum of x, or the power spectral density of x. 626 | 627 | 628 | ```python 629 | from scipy import signal 630 | def perd_feat(df, cols): 631 | for col in cols: 632 | sig = signal.periodogram(df[col],fs=1, return_onesided=False) 633 | df[col+"_FREQ"] = sig[0] 634 | df[col+"_POWER"] = sig[1] 635 | return df 636 | 637 | df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head() 638 | ``` 639 | 640 | (ii) Fast Fourier Transform 641 | 642 | The FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal. This takes the entire signal into account, and as a result has to be recalculated on a running basis to avoid peaking into the future. 643 | 644 | 645 | 646 | ```python 647 | def fft_feat(df, cols): 648 | for col in cols: 649 | fft_df = np.fft.fft(np.asarray(df[col].tolist())) 650 | fft_df = pd.DataFrame({'fft':fft_df}) 651 | df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values 652 | df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values 653 | return df 654 | 655 | df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head() 656 | ``` 657 | 658 | #### **(10) Waveforms** 659 | 660 | The waveform of a signal is the shape of its graph as a function of time. 661 | 662 | (i) Continuous Wave Radar 663 | 664 | 665 | ```python 666 | from scipy import signal 667 | def harmonicradar_cw(df, cols, fs,fc): 668 | for col in cols: 669 | ttxt = f'CW: {fc} Hz' 670 | #%% input 671 | t = df[col] 672 | tx = np.sin(2*np.pi*fc*t) 673 | _,Pxx = signal.welch(tx,fs) 674 | #%% diode 675 | d = (signal.square(2*np.pi*fc*t)) 676 | d[d<0] = 0. 677 | #%% output of diode 678 | rx = tx * d 679 | df[col+"_HARRAD"] = rx.values 680 | return df 681 | 682 | df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head() 683 | ``` 684 | 685 | (ii) Saw Tooth 686 | 687 | Return a periodic sawtooth or triangle waveform. 688 | 689 | 690 | ```python 691 | def saw(df, cols): 692 | for col in cols: 693 | df[col+" SAW"] = signal.sawtooth(df[col]) 694 | return df 695 | 696 | df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head() 697 | ``` 698 | 699 | ##### **(9) Modifications** 700 | 701 | A range of modification usually applied ot images, these values would have to be recalculate for each time-series. 702 | 703 | (i) Various Techniques 704 | 705 | 706 | ```python 707 | from tsaug import * 708 | def modify(df, cols): 709 | for col in cols: 710 | series = df[col].values 711 | df[col+"_magnify"], _ = magnify(series, series) 712 | df[col+"_affine"], _ = affine(series, series) 713 | df[col+"_crop"], _ = crop(series, series) 714 | df[col+"_cross_sum"], _ = cross_sum(series, series) 715 | df[col+"_resample"], _ = resample(series, series) 716 | df[col+"_trend"], _ = trend(series, series) 717 | 718 | df[col+"_random_affine"], _ = random_time_warp(series, series) 719 | df[col+"_random_crop"], _ = random_crop(series, series) 720 | df[col+"_random_cross_sum"], _ = random_cross_sum(series, series) 721 | df[col+"_random_sidetrack"], _ = random_sidetrack(series, series) 722 | df[col+"_random_time_warp"], _ = random_time_warp(series, series) 723 | df[col+"_random_magnify"], _ = random_magnify(series, series) 724 | df[col+"_random_jitter"], _ = random_jitter(series, series) 725 | df[col+"_random_trend"], _ = random_trend(series, series) 726 | return df 727 | 728 | df_out = transform.modify(df.copy(),["Close"]); df_out.head() 729 | ``` 730 | 731 | #### **(11) Rolling** 732 | 733 | Features that are calculated on a rolling basis over fixed window size. 734 | 735 | (i) Mean, Standard Deviation 736 | 737 | 738 | ```python 739 | def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None): 740 | windows = [1+a for a in windows] 741 | if not columns: 742 | columns = df.columns.to_list() 743 | rolling_dfs = (df[columns].rolling(i) # 1. Create window 744 | .agg(functions) # 1. Aggregate 745 | .rename({col: '{0}_{1:d}'.format(col, i) 746 | for col in columns}, axis=1) # 2. Rename columns 747 | for i in windows) # For each window 748 | df_out = pd.concat((df, *rolling_dfs), axis=1) 749 | da = df_out.iloc[:,len(df.columns):] 750 | da = [col[0] + "_" + col[1] for col in da.columns.to_list()] 751 | df_out.columns = df.columns.to_list() + da 752 | 753 | return df_out # 3. Concatenate dataframes 754 | 755 | df_out = transform.multiple_rolling(df, columns=["Close"]); df_out.head() 756 | ``` 757 | 758 | #### **(12) Lagging** 759 | 760 | Lagged values from existing features. 761 | 762 | (i) Single Steps 763 | 764 | 765 | ```python 766 | def multiple_lags(df, start=1, end=3,columns=None): 767 | if not columns: 768 | columns = df.columns.to_list() 769 | lags = range(start, end+1) # Just two lags for demonstration. 770 | 771 | df = df.assign(**{ 772 | '{}_t_{}'.format(col, t): df[col].shift(t) 773 | for t in lags 774 | for col in columns 775 | }) 776 | return df 777 | 778 | df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head() 779 | ``` 780 | 781 | #### **(13) Forecast Model** 782 | 783 | There are a range of time series model that can be implemented like AR, MA, ARMA, ARIMA, SARIMA, SARIMAX, VAR, VARMA, VARMAX, SES, and HWES. The models can be divided into autoregressive models and smoothing models. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. Each method might requre specific tuning and parameters to suit your prediction task. You need to drop a certain amount of historical data that you use during the fitting stage. Models that take seasonality into account need more training data. 784 | 785 | 786 | (i) Prophet 787 | 788 | Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality. You can apply additive models to your training data but also interactive models like deep learning models. The problem is that because these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. In this example, I train on 150 data points to illustrate how the remaining or so 100 datapoints can be used in a new prediction problem. You can plot with ```df["PROPHET"].plot()``` to see the effect. 789 | 790 | You can apply additive models to your training data but also interactive models like deep learning models. The problem is that these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. 791 | 792 | 793 | ```python 794 | from fbprophet import Prophet 795 | 796 | def prophet_feat(df, cols,date, freq,train_size=150): 797 | def prophet_dataframe(df): 798 | df.columns = ['ds','y'] 799 | return df 800 | 801 | def original_dataframe(df, freq, name): 802 | prophet_pred = pd.DataFrame({"Date" : df['ds'], name : df["yhat"]}) 803 | prophet_pred = prophet_pred.set_index("Date") 804 | #prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq) 805 | return prophet_pred[name].values 806 | 807 | for col in cols: 808 | model = Prophet(daily_seasonality=True) 809 | fb = model.fit(prophet_dataframe(df[[date, col]].head(train_size))) 810 | forecast_len = len(df) - train_size 811 | future = model.make_future_dataframe(periods=forecast_len,freq=freq) 812 | future_pred = model.predict(future) 813 | df[col+"_PROPHET"] = list(original_dataframe(future_pred,freq,col)) 814 | return df 815 | 816 | df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head() 817 | ``` 818 | 819 | 820 | 821 | ## **(2) Interaction** 822 | 823 | Interactions are defined as methods that require more than one feature to create an additional feature. Here we include normalising and discretising techniques that are non-feature specific. Almost all of these method can be applied to cross-section method. The only methods that are time specific is the technical features in the speciality section and the autoregression model. 824 | 825 | #### **(1) Regression** 826 | 827 | Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables. 828 | 829 | (i) Lowess Smoother 830 | 831 | The lowess smoother is a robust locally weighted regression. The function fits a nonparametric regression curve to a scatterplot. 832 | 833 | 834 | ```python 835 | from math import ceil 836 | import numpy as np 837 | from scipy import linalg 838 | import math 839 | 840 | def lowess(df, cols, y, f=2. / 3., iter=3): 841 | for col in cols: 842 | n = len(df[col]) 843 | r = int(ceil(f * n)) 844 | h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)] 845 | w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0) 846 | w = (1 - w ** 3) ** 3 847 | yest = np.zeros(n) 848 | delta = np.ones(n) 849 | for iteration in range(iter): 850 | for i in range(n): 851 | weights = delta * w[:, i] 852 | b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])]) 853 | A = np.array([[np.sum(weights), np.sum(weights * df[col])], 854 | [np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]]) 855 | beta = linalg.solve(A, b) 856 | yest[i] = beta[0] + beta[1] * df[col][i] 857 | 858 | residuals = y - yest 859 | s = np.median(np.abs(residuals)) 860 | delta = np.clip(residuals / (6.0 * s), -1, 1) 861 | delta = (1 - delta ** 2) ** 2 862 | df[col+"_LOWESS"] = yest 863 | 864 | return df 865 | 866 | df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head() 867 | ``` 868 | 869 | Autoregression 870 | 871 | Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step 872 | 873 | 874 | ```python 875 | from statsmodels.tsa.ar_model import AR 876 | from timeit import default_timer as timer 877 | def autoregression(df, drop=None, settings={"autoreg_lag":4}): 878 | 879 | autoreg_lag = settings["autoreg_lag"] 880 | if drop: 881 | keep = df[drop] 882 | df = df.drop([drop],axis=1).values 883 | 884 | n_channels = df.shape[0] 885 | t = timer() 886 | channels_regg = np.zeros((n_channels, autoreg_lag + 1)) 887 | for i in range(0, n_channels): 888 | fitted_model = AR(df.values[i, :]).fit(autoreg_lag) 889 | # TODO: This is not the same as Matlab's for some reasons! 890 | # kk = ARMAResults(fitted_model) 891 | # autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow 892 | channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params) 893 | 894 | for i in range(channels_regg.shape[1]): 895 | df["LAG_"+str(i+1)] = channels_regg[:,i] 896 | 897 | if drop: 898 | df = pd.concat((keep,df),axis=1) 899 | 900 | t = timer() - t 901 | return df 902 | 903 | df_out = interact.autoregression(df.copy()); df_out.head() 904 | ``` 905 | 906 | #### **(2) Operator** 907 | 908 | Looking at interaction between different features. Here the methods employed are multiplication and division. 909 | 910 | (i) Multiplication and Division 911 | 912 | 913 | ```python 914 | def muldiv(df, feature_list): 915 | for feat in feature_list: 916 | for feat_two in feature_list: 917 | if feat==feat_two: 918 | continue 919 | else: 920 | df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard 921 | df[feat+"_X_"+feat_two] = df[feat]*(df[feat_two]) 922 | 923 | return df 924 | 925 | df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head() 926 | ``` 927 | 928 | #### **(3) Discretising** 929 | 930 | In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes 931 | 932 | (i) Decision Tree Discretiser 933 | 934 | The first method that will be applies here is a supersived discretiser. Discretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals. 935 | 936 | 937 | ```python 938 | from sklearn.tree import DecisionTreeRegressor 939 | 940 | def decision_tree_disc(df, cols, depth=4 ): 941 | for col in cols: 942 | df[col +"_m1"] = df[col].shift(1) 943 | df = df.iloc[1:,:] 944 | tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0) 945 | tree_model.fit(df[col +"_m1"].to_frame(), df[col]) 946 | df[col+"_Disc"] = tree_model.predict(df[col +"_m1"].to_frame()) 947 | return df 948 | 949 | df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head() 950 | ``` 951 | 952 | #### **(4) Normalising** 953 | 954 | Normalising normally pertains to the scaling of data. There are many method available, interacting normalising methods makes use of all the feature's attributes to do the scaling. 955 | 956 | (i) Quantile Normalisation 957 | 958 | In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. 959 | 960 | 961 | ```python 962 | import numpy as np 963 | import pandas as pd 964 | 965 | def quantile_normalize(df, drop): 966 | 967 | if drop: 968 | keep = df[drop] 969 | df = df.drop(drop,axis=1) 970 | 971 | #compute rank 972 | dic = {} 973 | for col in df: 974 | dic.update({col : sorted(df[col])}) 975 | sorted_df = pd.DataFrame(dic) 976 | rank = sorted_df.mean(axis = 1).tolist() 977 | #sort 978 | for col in df: 979 | t = np.searchsorted(np.sort(df[col]), df[col]) 980 | df[col] = [rank[i] for i in t] 981 | 982 | if drop: 983 | df = pd.concat((keep,df),axis=1) 984 | return df 985 | 986 | df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head() 987 | ``` 988 | 989 | #### **(5) Distance** 990 | 991 | There are multiple types of distance functions like Euclidean, Mahalanobis, and Minkowski distance. Here we are using a contrived example in a location based haversine distance. 992 | 993 | (i) Haversine Distance 994 | 995 | The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere. 996 | 997 | 998 | ```python 999 | from math import sin, cos, sqrt, atan2, radians 1000 | def haversine_distance(row, lon="Open", lat="Close"): 1001 | c_lat,c_long = radians(52.5200), radians(13.4050) 1002 | R = 6373.0 1003 | long = radians(row['Open']) 1004 | lat = radians(row['Close']) 1005 | 1006 | dlon = long - c_long 1007 | dlat = lat - c_lat 1008 | a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2 1009 | c = 2 * atan2(sqrt(a), sqrt(1 - a)) 1010 | 1011 | return R * c 1012 | 1013 | df_out['distance_central'] = df.apply(interact.haversine_distance,axis=1); df_out.head() 1014 | ``` 1015 | 1016 | #### **(6) Speciality** 1017 | 1018 | (i) Technical Features 1019 | 1020 | Technical indicators are heuristic or mathematical calculations based on the price, volume, or open interest of a security or contract used by traders who follow technical analysis. By analyzing historical data, technical analysts use indicators to predict future price movements. 1021 | 1022 | 1023 | ```python 1024 | import ta 1025 | 1026 | def tech(df): 1027 | return ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume") 1028 | 1029 | df_out = interact.tech(df.copy()); df_out.head() 1030 | ``` 1031 | 1032 | #### **(7) Genetic** 1033 | 1034 | Genetic programming has shown promise in constructing feature by osing original features to form high-level ones that can help algorithms achieve better performance. 1035 | 1036 | (i) Symbolic Transformer 1037 | 1038 | 1039 | A symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship. 1040 | 1041 | 1042 | ```python 1043 | df.head() 1044 | ``` 1045 | 1046 | 1047 | ```python 1048 | from gplearn.genetic import SymbolicTransformer 1049 | 1050 | def genetic_feat(df, num_gen=20, num_comp=10): 1051 | function_set = ['add', 'sub', 'mul', 'div', 1052 | 'sqrt', 'log', 'abs', 'neg', 'inv','tan'] 1053 | 1054 | gp = SymbolicTransformer(generations=num_gen, population_size=200, 1055 | hall_of_fame=100, n_components=num_comp, 1056 | function_set=function_set, 1057 | parsimony_coefficient=0.0005, 1058 | max_samples=0.9, verbose=1, 1059 | random_state=0, n_jobs=6) 1060 | 1061 | gen_feats = gp.fit_transform(df.drop("Close_1", axis=1), df["Close_1"]); df.iloc[:,:8] 1062 | gen_feats = pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])]) 1063 | gen_feats.index = df.index 1064 | return pd.concat((df,gen_feats),axis=1) 1065 | 1066 | df_out = interact.genetic_feat(df.copy()); df_out.head() 1067 | ``` 1068 | 1069 | 1070 | 1071 | ## **(3) Mapping** 1072 | 1073 | Methods that help with the summarisation of features by remapping them to achieve some aim like the maximisation of variability or class separability. These methods tend to be unsupervised, but can also take an supervised form. 1074 | 1075 | #### **(1) Eigen Decomposition** 1076 | 1077 | Eigendecomposition or sometimes spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Some examples are LDA and PCA. 1078 | 1079 | (i) Principal Component Analysis 1080 | 1081 | Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. 1082 | 1083 | 1084 | ```python 1085 | def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True): 1086 | 1087 | if non_linear: 1088 | pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True) 1089 | else: 1090 | if memory_issues: 1091 | if not mem_iss_component: 1092 | raise ValueError("If you have memory issues, you have to preselect mem_iss_component") 1093 | pca = IncrementalPCA(mem_iss_component) 1094 | else: 1095 | if variance_or_components>1: 1096 | pca = PCA(n_components=variance_or_components) 1097 | else: # automated selection based on variance 1098 | pca = PCA(n_components=variance_or_components,svd_solver="full") 1099 | if drop_cols: 1100 | X_pca = pca.fit_transform(df.drop(drop_cols,axis=1)) 1101 | return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1) 1102 | 1103 | else: 1104 | X_pca = pca.fit_transform(df) 1105 | return pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index) 1106 | 1107 | 1108 | return df 1109 | 1110 | df_out = mapper.pca_feature(df.copy(), variance_or_components=0.9, n_components=8,non_linear=False) 1111 | ``` 1112 | 1113 | #### **(2) Cross Decomposition** 1114 | 1115 | These families of algorithms are useful to find linear relations between two multivariate datasets. 1116 | 1117 | (1) Canonical Correlation Analysis 1118 | 1119 | Canonical-correlation analysis (CCA) is a way of inferring information from cross-covariance matrices. 1120 | 1121 | 1122 | ```python 1123 | from sklearn.cross_decomposition import CCA 1124 | 1125 | def cross_lag(df, drop=None, lags=1, components=4 ): 1126 | 1127 | if drop: 1128 | keep = df[drop] 1129 | df = df.drop([drop],axis=1) 1130 | 1131 | df_2 = df.shift(lags) 1132 | df = df.iloc[lags:,:] 1133 | df_2 = df_2.dropna().reset_index(drop=True) 1134 | 1135 | cca = CCA(n_components=components) 1136 | cca.fit(df_2, df) 1137 | 1138 | X_c, df_2 = cca.transform(df_2, df) 1139 | df_2 = pd.DataFrame(df_2, index=df.index) 1140 | df_2 = df.add_prefix('crd_') 1141 | 1142 | if drop: 1143 | df = pd.concat([keep,df,df_2],axis=1) 1144 | else: 1145 | df = pd.concat([df,df_2],axis=1) 1146 | return df 1147 | 1148 | df_out = mapper.cross_lag(df.copy()); df_out.head() 1149 | ``` 1150 | 1151 | #### **(3) Kernel Approximation** 1152 | 1153 | Functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines. 1154 | 1155 | (i) Additive Chi2 Kernel 1156 | 1157 | Computes the additive chi-squared kernel between observations in X and Y The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative. 1158 | 1159 | 1160 | ```python 1161 | from sklearn.kernel_approximation import AdditiveChi2Sampler 1162 | 1163 | def a_chi(df, drop=None, lags=1, sample_steps=2 ): 1164 | 1165 | if drop: 1166 | keep = df[drop] 1167 | df = df.drop([drop],axis=1) 1168 | 1169 | df_2 = df.shift(lags) 1170 | df = df.iloc[lags:,:] 1171 | df_2 = df_2.dropna().reset_index(drop=True) 1172 | 1173 | chi2sampler = AdditiveChi2Sampler(sample_steps=sample_steps) 1174 | 1175 | df_2 = chi2sampler.fit_transform(df_2, df["Close"]) 1176 | 1177 | df_2 = pd.DataFrame(df_2, index=df.index) 1178 | df_2 = df.add_prefix('achi_') 1179 | 1180 | if drop: 1181 | df = pd.concat([keep,df,df_2],axis=1) 1182 | else: 1183 | df = pd.concat([df,df_2],axis=1) 1184 | return df 1185 | 1186 | df_out = mapper.a_chi(df.copy()); df_out.head() 1187 | ``` 1188 | 1189 | #### **(4) Autoencoder** 1190 | 1191 | An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore noise. 1192 | 1193 | (i) Feed Forward 1194 | 1195 | The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons 1196 | 1197 | 1198 | ```python 1199 | from sklearn.preprocessing import minmax_scale 1200 | import tensorflow as tf 1201 | import numpy as np 1202 | 1203 | def encoder_dataset(df, drop=None, dimesions=20): 1204 | 1205 | if drop: 1206 | train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0) 1207 | else: 1208 | train_scaled = minmax_scale(df.values, axis = 0) 1209 | 1210 | # define the number of encoding dimensions 1211 | encoding_dim = dimesions 1212 | # define the number of features 1213 | ncol = train_scaled.shape[1] 1214 | input_dim = tf.keras.Input(shape = (ncol, )) 1215 | 1216 | # Encoder Layers 1217 | encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim) 1218 | encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1) 1219 | encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2) 1220 | encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3) 1221 | encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4) 1222 | encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5) 1223 | encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6) 1224 | 1225 | encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7) 1226 | encoded_input = tf.keras.Input(shape = (encoding_dim, )) 1227 | 1228 | encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index) 1229 | encoded_train = encoded_train.add_prefix('encoded_') 1230 | if drop: 1231 | encoded_train = pd.concat((df[drop],encoded_train),axis=1) 1232 | 1233 | return encoded_train 1234 | 1235 | df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head() 1236 | ``` 1237 | 1238 | 1239 | ```python 1240 | df_out.head() 1241 | ``` 1242 | 1243 | #### **(5) Manifold Learning** 1244 | 1245 | Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. 1246 | 1247 | (i) Local Linear Embedding 1248 | 1249 | Locally Linear Embedding is a method of non-linear dimensionality reduction. It tries to reduce these n-Dimensions while trying to preserve the geometric features of the original non-linear feature structure. 1250 | 1251 | 1252 | ```python 1253 | from sklearn.manifold import LocallyLinearEmbedding 1254 | 1255 | def lle_feat(df, drop=None, components=4): 1256 | 1257 | if drop: 1258 | keep = df[drop] 1259 | df = df.drop(drop, axis=1) 1260 | 1261 | embedding = LocallyLinearEmbedding(n_components=components) 1262 | em = embedding.fit_transform(df) 1263 | df = pd.DataFrame(em,index=df.index) 1264 | df = df.add_prefix('lle_') 1265 | if drop: 1266 | df = pd.concat((keep,df),axis=1) 1267 | return df 1268 | 1269 | df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head() 1270 | 1271 | ``` 1272 | 1273 | #### **(6) Clustering** 1274 | 1275 | Most clustering techniques start with a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together with some measure. Although these clustering techniques are typically used for observations, it can also be used for feature dimensionality reduction; especially hierarchical clustering techniques. 1276 | 1277 | (i) Feature Agglomeration 1278 | 1279 | Feature agglomerative uses clustering to group together features that look very similar, thus decreasing the number of features. 1280 | 1281 | 1282 | ```python 1283 | import numpy as np 1284 | from sklearn import datasets, cluster 1285 | 1286 | def feature_agg(df, drop=None, components=4): 1287 | 1288 | if drop: 1289 | keep = df[drop] 1290 | df = df.drop(drop, axis=1) 1291 | 1292 | components = min(df.shape[1]-1,components) 1293 | agglo = cluster.FeatureAgglomeration(n_clusters=components) 1294 | agglo.fit(df) 1295 | df = pd.DataFrame(agglo.transform(df),index=df.index) 1296 | df = df.add_prefix('feagg_') 1297 | 1298 | if drop: 1299 | return pd.concat((keep,df),axis=1) 1300 | else: 1301 | return df 1302 | 1303 | 1304 | df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head() 1305 | ``` 1306 | 1307 | #### **(7) Neigbouring** 1308 | 1309 | Neighbouring points can be calculated using distance metrics like Hamming, Manhattan, Minkowski distance. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. 1310 | 1311 | (i) Nearest Neighbours 1312 | 1313 | Unsupervised learner for implementing neighbor searches. 1314 | 1315 | 1316 | ```python 1317 | from sklearn.neighbors import NearestNeighbors 1318 | 1319 | def neigh_feat(df, drop, neighbors=6): 1320 | 1321 | if drop: 1322 | keep = df[drop] 1323 | df = df.drop(drop, axis=1) 1324 | 1325 | components = min(df.shape[0]-1,neighbors) 1326 | neigh = NearestNeighbors(n_neighbors=neighbors) 1327 | neigh.fit(df) 1328 | neigh = neigh.kneighbors()[0] 1329 | df = pd.DataFrame(neigh, index=df.index) 1330 | df = df.add_prefix('neigh_') 1331 | 1332 | if drop: 1333 | return pd.concat((keep,df),axis=1) 1334 | else: 1335 | return df 1336 | 1337 | return df 1338 | 1339 | df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head() 1340 | 1341 | ``` 1342 | 1343 | 1344 | 1345 | ## **(4) Extraction** 1346 | 1347 | When working with extraction, you have decide the size of the time series history to take into account when calculating a collection of walk-forward feature values. To facilitate our extraction, we use an excellent package called TSfresh, and also some of their default features. For completeness, we also include 12 or so custom features to be added to the extraction pipeline. 1348 | 1349 | The *time series* methods in the transformation section and the interaction section are similar to the methods we will uncover in the extraction section, however, for transformation and interaction methods the output is an entire new time series, whereas extraction methods takes as input multiple constructed time series and extracts a singular value from each time series to reconstruct an entirely new time series. 1350 | 1351 | Some methods naturally fit better in one format over another, e.g., lags are too expensive for extraction; time series decomposition only has to be performed once, because it has a low level of 'leakage' so is better suited to transformation; and forecast methods attempt to predict multiple future training samples, so won't work with extraction that only delivers one value per time series. Furthermore all non time-series (cross-sectional) transformation and extraction techniques can not make use of extraction as it is solely a time-series method. 1352 | 1353 | Lastly, when we want to double apply specific functions we can apply it as a transformation/interaction then all the extraction methods can be applied to this feature as well. For example, if we calculate a smoothing function (transformation) then all other extraction functions (median, entropy, linearity etc.) can now be applied to that smoothing function, including the application of the smoothing function itself, e.g., a double smooth, double lag, double filter etc. So separating these methods out give us great flexibility. 1354 | 1355 | Decorator 1356 | 1357 | 1358 | ```python 1359 | def set_property(key, value): 1360 | """ 1361 | This method returns a decorator that sets the property key of the function to value 1362 | """ 1363 | def decorate_func(func): 1364 | setattr(func, key, value) 1365 | if func.__doc__ and key == "fctype": 1366 | func.__doc__ = func.__doc__ + "\n\n *This function is of type: " + value + "*\n" 1367 | return func 1368 | return decorate_func 1369 | ``` 1370 | 1371 | #### **(1) Energy** 1372 | 1373 | You can calculate the linear, non-linear and absolute energy of a time series. In signal processing, the energy $E_S$ of a continuous-time signal $x(t)$ is defined as the area under the squared magnitude of the considered signal. Mathematically, $E_{s}=\langle x(t), x(t)\rangle=\int_{-\infty}^{\infty}|x(t)|^{2} d t$ 1374 | 1375 | (i) Absolute Energy 1376 | 1377 | Returns the absolute energy of the time series which is the sum over the squared values 1378 | 1379 | 1380 | ```python 1381 | #-> In Package 1382 | def abs_energy(x): 1383 | 1384 | if not isinstance(x, (np.ndarray, pd.Series)): 1385 | x = np.asarray(x) 1386 | return np.dot(x, x) 1387 | 1388 | extract.abs_energy(df["Close"]) 1389 | ``` 1390 | 1391 | #### **(2) Distance** 1392 | 1393 | Here we widely define distance measures as those that take a difference between attributes or series of datapoints. 1394 | 1395 | (i) Complexity-Invariant Distance 1396 | 1397 | This function calculator is an estimate for a time series complexity. 1398 | 1399 | 1400 | ```python 1401 | #-> In Package 1402 | def cid_ce(x, normalize): 1403 | 1404 | if not isinstance(x, (np.ndarray, pd.Series)): 1405 | x = np.asarray(x) 1406 | if normalize: 1407 | s = np.std(x) 1408 | if s!=0: 1409 | x = (x - np.mean(x))/s 1410 | else: 1411 | return 0.0 1412 | 1413 | x = np.diff(x) 1414 | return np.sqrt(np.dot(x, x)) 1415 | 1416 | extract.cid_ce(df["Close"], True) 1417 | ``` 1418 | 1419 | #### **(3) Differencing** 1420 | 1421 | Many alternatives to differencing exists, one can for example take the difference of every other value, take the squared difference, take the fractional difference, or like our example, take the mean absolute difference. 1422 | 1423 | (i) Mean Absolute Change 1424 | 1425 | Returns the mean over the absolute differences between subsequent time series values. 1426 | 1427 | 1428 | ```python 1429 | #-> In Package 1430 | def mean_abs_change(x): 1431 | return np.mean(np.abs(np.diff(x))) 1432 | 1433 | extract.mean_abs_change(df["Close"]) 1434 | ``` 1435 | 1436 | #### **(4) Derivative** 1437 | 1438 | Features where the emphasis is on the rate of change. 1439 | 1440 | (i) Mean Central Second Derivative 1441 | 1442 | Returns the mean value of a central approximation of the second derivative 1443 | 1444 | 1445 | ```python 1446 | #-> In Package 1447 | def _roll(a, shift): 1448 | if not isinstance(a, np.ndarray): 1449 | a = np.asarray(a) 1450 | idx = shift % len(a) 1451 | return np.concatenate([a[-idx:], a[:-idx]]) 1452 | 1453 | def mean_second_derivative_central(x): 1454 | 1455 | diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0 1456 | return np.mean(diff[1:-1]) 1457 | 1458 | extract.mean_second_derivative_central(df["Close"]) 1459 | ``` 1460 | 1461 | #### **(5) Volatility** 1462 | 1463 | Volatility is a statistical measure of the dispersion of a time-series. 1464 | 1465 | (i) Variance Larger than Standard Deviation 1466 | 1467 | 1468 | ```python 1469 | #-> In Package 1470 | def variance_larger_than_standard_deviation(x): 1471 | 1472 | y = np.var(x) 1473 | return y > np.sqrt(y) 1474 | 1475 | extract.variance_larger_than_standard_deviation(df["Close"]) 1476 | ``` 1477 | 1478 | (ii) Variability Index 1479 | 1480 | Variability Index is a way to measure how smooth or 'variable' a time series is. 1481 | 1482 | 1483 | ```python 1484 | var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values} 1485 | 1486 | @set_property("fctype", "combiner") 1487 | @set_property("custom", True) 1488 | def var_index(time,param=var_index_param): 1489 | final = [] 1490 | keys = [] 1491 | for key, magnitude in param.items(): 1492 | w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2) 1493 | w_mean = np.mean(w) 1494 | 1495 | N = len(time) 1496 | sigma2 = np.var(magnitude) 1497 | 1498 | S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2) 1499 | S2 = sum(w) 1500 | 1501 | eta_e = (w_mean * np.power(time[N - 1] - 1502 | time[0], 2) * S1 / (sigma2 * S2 * N ** 2)) 1503 | final.append(eta_e) 1504 | keys.append(key) 1505 | return {"Interact__{}".format(k): eta_e for eta_e, k in zip(final,keys) } 1506 | 1507 | extract.var_index(df["Close"].values,var_index_param) 1508 | ``` 1509 | 1510 | #### **(6) Shape** 1511 | 1512 | Features that emphasises a particular shape not ordinarily considered as a distribution statistic. Extends to derivations of the original time series too For example a feature looking at the sinusoidal shape of an autocorrelation plot. 1513 | 1514 | (i) Symmetrical 1515 | 1516 | Boolean variable denoting if the distribution of x looks symmetric. 1517 | 1518 | 1519 | ```python 1520 | #-> In Package 1521 | def symmetry_looking(x, param=[{"r": 0.2}]): 1522 | 1523 | if not isinstance(x, (np.ndarray, pd.Series)): 1524 | x = np.asarray(x) 1525 | mean_median_difference = np.abs(np.mean(x) - np.median(x)) 1526 | max_min_difference = np.max(x) - np.min(x) 1527 | return [("r_{}".format(r["r"]), mean_median_difference < (r["r"] * max_min_difference)) 1528 | for r in param] 1529 | 1530 | extract.symmetry_looking(df["Close"]) 1531 | ``` 1532 | 1533 | #### **(7) Occurrence** 1534 | 1535 | Looking at the occurrence, and reoccurence of defined values. 1536 | 1537 | (i) Has Duplicate Max 1538 | 1539 | 1540 | ```python 1541 | #-> In Package 1542 | def has_duplicate_max(x): 1543 | """ 1544 | Checks if the maximum value of x is observed more than once 1545 | 1546 | :param x: the time series to calculate the feature of 1547 | :type x: numpy.ndarray 1548 | :return: the value of this feature 1549 | :return type: bool 1550 | """ 1551 | if not isinstance(x, (np.ndarray, pd.Series)): 1552 | x = np.asarray(x) 1553 | return np.sum(x == np.max(x)) >= 2 1554 | 1555 | extract.has_duplicate_max(df["Close"]) 1556 | ``` 1557 | 1558 | #### **(8) Autocorrelation** 1559 | 1560 | Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. 1561 | 1562 | (i) Partial Autocorrelation 1563 | 1564 | Partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. 1565 | 1566 | 1567 | ```python 1568 | #-> In Package 1569 | from statsmodels.tsa.stattools import acf, adfuller, pacf 1570 | 1571 | def partial_autocorrelation(x, param=[{"lag": 1}]): 1572 | 1573 | # Check the difference between demanded lags by param and possible lags to calculate (depends on len(x)) 1574 | max_demanded_lag = max([lag["lag"] for lag in param]) 1575 | n = len(x) 1576 | 1577 | # Check if list is too short to make calculations 1578 | if n <= 1: 1579 | pacf_coeffs = [np.nan] * (max_demanded_lag + 1) 1580 | else: 1581 | if (n <= max_demanded_lag): 1582 | max_lag = n - 1 1583 | else: 1584 | max_lag = max_demanded_lag 1585 | pacf_coeffs = list(pacf(x, method="ld", nlags=max_lag)) 1586 | pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag)) 1587 | 1588 | return [("lag_{}".format(lag["lag"]), pacf_coeffs[lag["lag"]]) for lag in param] 1589 | 1590 | extract.partial_autocorrelation(df["Close"]) 1591 | ``` 1592 | 1593 | #### **(9) Stochasticity** 1594 | 1595 | Stochastic refers to a randomly determined process. Any features trying to capture stochasticity by degree or type are included under this branch. 1596 | 1597 | (i) Augmented Dickey Fuller 1598 | 1599 | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample. 1600 | 1601 | 1602 | ```python 1603 | #-> In Package 1604 | def augmented_dickey_fuller(x, param=[{"attr": "teststat"}]): 1605 | 1606 | res = None 1607 | try: 1608 | res = adfuller(x) 1609 | except LinAlgError: 1610 | res = np.NaN, np.NaN, np.NaN 1611 | except ValueError: # occurs if sample size is too small 1612 | res = np.NaN, np.NaN, np.NaN 1613 | except MissingDataError: # is thrown for e.g. inf or nan in the data 1614 | res = np.NaN, np.NaN, np.NaN 1615 | 1616 | return [('attr_"{}"'.format(config["attr"]), 1617 | res[0] if config["attr"] == "teststat" 1618 | else res[1] if config["attr"] == "pvalue" 1619 | else res[2] if config["attr"] == "usedlag" else np.NaN) 1620 | for config in param] 1621 | 1622 | extract.augmented_dickey_fuller(df["Close"]) 1623 | ``` 1624 | 1625 | #### **(10) Averages** 1626 | 1627 | (i) Median of Magnitudes Skew 1628 | 1629 | 1630 | ```python 1631 | @set_property("fctype", "simple") 1632 | @set_property("custom", True) 1633 | def gskew(x): 1634 | interpolation="nearest" 1635 | median_mag = np.median(x) 1636 | F_3_value = np.percentile(x, 3, interpolation=interpolation) 1637 | F_97_value = np.percentile(x, 97, interpolation=interpolation) 1638 | 1639 | skew = (np.median(x[x <= F_3_value]) + 1640 | np.median(x[x >= F_97_value]) - 2 * median_mag) 1641 | 1642 | return skew 1643 | 1644 | extract.gskew(df["Close"]) 1645 | ``` 1646 | 1647 | (ii) Stetson Mean 1648 | 1649 | An iteratively weighted mean used in the Stetson variability index 1650 | 1651 | 1652 | ```python 1653 | stestson_param = {"weight":100., "alpha":2., "beta":2., "tol":1.e-6, "nmax":20} 1654 | 1655 | @set_property("fctype", "combiner") 1656 | @set_property("custom", True) 1657 | def stetson_mean(x, param=stestson_param): 1658 | 1659 | weight= stestson_param["weight"] 1660 | alpha= stestson_param["alpha"] 1661 | beta = stestson_param["beta"] 1662 | tol= stestson_param["tol"] 1663 | nmax= stestson_param["nmax"] 1664 | 1665 | 1666 | mu = np.median(x) 1667 | for i in range(nmax): 1668 | resid = x - mu 1669 | resid_err = np.abs(resid) * np.sqrt(weight) 1670 | weight1 = weight / (1. + (resid_err / alpha)**beta) 1671 | weight1 /= weight1.mean() 1672 | diff = np.mean(x * weight1) - mu 1673 | mu += diff 1674 | if (np.abs(diff) < tol*np.abs(mu) or np.abs(diff) < tol): 1675 | break 1676 | 1677 | return mu 1678 | 1679 | extract.stetson_mean(df["Close"]) 1680 | ``` 1681 | 1682 | #### **(11) Size** 1683 | 1684 | (i) Lenght 1685 | 1686 | 1687 | ```python 1688 | #-> In Package 1689 | def length(x): 1690 | return len(x) 1691 | 1692 | extract.length(df["Close"]) 1693 | ``` 1694 | 1695 | #### **(12) Count** 1696 | 1697 | (i) Count Above Mean 1698 | 1699 | Returns the number of values in x that are higher than the mean of x 1700 | 1701 | 1702 | ```python 1703 | #-> In Package 1704 | def count_above_mean(x): 1705 | m = np.mean(x) 1706 | return np.where(x > m)[0].size 1707 | 1708 | extract.count_above_mean(df["Close"]) 1709 | ``` 1710 | 1711 | #### **(13) Streaks** 1712 | 1713 | (i) Longest Strike Below Mean 1714 | 1715 | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x 1716 | 1717 | 1718 | ```python 1719 | #-> In Package 1720 | import itertools 1721 | def get_length_sequences_where(x): 1722 | 1723 | if len(x) == 0: 1724 | return [0] 1725 | else: 1726 | res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1] 1727 | return res if len(res) > 0 else [0] 1728 | 1729 | def longest_strike_below_mean(x): 1730 | 1731 | if not isinstance(x, (np.ndarray, pd.Series)): 1732 | x = np.asarray(x) 1733 | return np.max(get_length_sequences_where(x <= np.mean(x))) if x.size > 0 else 0 1734 | 1735 | extract.longest_strike_below_mean(df["Close"]) 1736 | ``` 1737 | 1738 | (ii) Wozniak 1739 | 1740 | This is an astronomical feature, we count the number of three consecutive data points that are brighter or fainter than $2σ$ and normalize the number by $N−2$ 1741 | 1742 | 1743 | ```python 1744 | woz_param = [{"consecutiveStar": n} for n in [2, 4]] 1745 | 1746 | @set_property("fctype", "combiner") 1747 | @set_property("custom", True) 1748 | def wozniak(magnitude, param=woz_param): 1749 | 1750 | iters = [] 1751 | for consecutiveStar in [stars["consecutiveStar"] for stars in param]: 1752 | N = len(magnitude) 1753 | if N < consecutiveStar: 1754 | return 0 1755 | sigma = np.std(magnitude) 1756 | m = np.mean(magnitude) 1757 | count = 0 1758 | 1759 | for i in range(N - consecutiveStar + 1): 1760 | flag = 0 1761 | for j in range(consecutiveStar): 1762 | if(magnitude[i + j] > m + 2 * sigma or 1763 | magnitude[i + j] < m - 2 * sigma): 1764 | flag = 1 1765 | else: 1766 | flag = 0 1767 | break 1768 | if flag: 1769 | count = count + 1 1770 | iters.append(count * 1.0 / (N - consecutiveStar + 1)) 1771 | 1772 | return [("consecutiveStar_{}".format(config["consecutiveStar"]), iters[en] ) for en, config in enumerate(param)] 1773 | 1774 | extract.wozniak(df["Close"]) 1775 | ``` 1776 | 1777 | #### **(14) Location** 1778 | 1779 | (i) Last location of Maximum 1780 | 1781 | Returns the relative last location of the maximum value of x. 1782 | last_location_of_minimum(x), 1783 | 1784 | 1785 | ```python 1786 | #-> In Package 1787 | def last_location_of_maximum(x): 1788 | 1789 | x = np.asarray(x) 1790 | return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN 1791 | 1792 | extract.last_location_of_maximum(df["Close"]) 1793 | ``` 1794 | 1795 | #### **(15) Model Coefficients** 1796 | 1797 | Any coefficient that are obtained from a model that might help in the prediction problem. For example here we might include coefficients of polynomial $h(x)$, which has been fitted to the deterministic dynamics of Langevin model. 1798 | 1799 | (i) FFT Coefficient 1800 | 1801 | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input. 1802 | 1803 | 1804 | ```python 1805 | #-> In Package 1806 | def fft_coefficient(x, param = [{"coeff": 10, "attr": "real"}]): 1807 | 1808 | assert min([config["coeff"] for config in param]) >= 0, "Coefficients must be positive or zero." 1809 | assert set([config["attr"] for config in param]) <= set(["imag", "real", "abs", "angle"]), \ 1810 | 'Attribute must be "real", "imag", "angle" or "abs"' 1811 | 1812 | fft = np.fft.rfft(x) 1813 | 1814 | def complex_agg(x, agg): 1815 | if agg == "real": 1816 | return x.real 1817 | elif agg == "imag": 1818 | return x.imag 1819 | elif agg == "abs": 1820 | return np.abs(x) 1821 | elif agg == "angle": 1822 | return np.angle(x, deg=True) 1823 | 1824 | res = [complex_agg(fft[config["coeff"]], config["attr"]) if config["coeff"] < len(fft) 1825 | else np.NaN for config in param] 1826 | index = [('coeff_{}__attr_"{}"'.format(config["coeff"], config["attr"]),res[0]) for config in param] 1827 | return index 1828 | 1829 | extract.fft_coefficient(df["Close"]) 1830 | ``` 1831 | 1832 | (ii) AR Coefficient 1833 | 1834 | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process. 1835 | 1836 | 1837 | ```python 1838 | #-> In Package 1839 | from statsmodels.tsa.ar_model import AR 1840 | 1841 | def ar_coefficient(x, param=[{"coeff": 5, "k": 5}]): 1842 | 1843 | calculated_ar_params = {} 1844 | 1845 | x_as_list = list(x) 1846 | calculated_AR = AR(x_as_list) 1847 | 1848 | res = {} 1849 | 1850 | for parameter_combination in param: 1851 | k = parameter_combination["k"] 1852 | p = parameter_combination["coeff"] 1853 | 1854 | column_name = "k_{}__coeff_{}".format(k, p) 1855 | 1856 | if k not in calculated_ar_params: 1857 | try: 1858 | calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params 1859 | except (LinAlgError, ValueError): 1860 | calculated_ar_params[k] = [np.NaN]*k 1861 | 1862 | mod = calculated_ar_params[k] 1863 | 1864 | if p <= k: 1865 | try: 1866 | res[column_name] = mod[p] 1867 | except IndexError: 1868 | res[column_name] = 0 1869 | else: 1870 | res[column_name] = np.NaN 1871 | 1872 | return [(key, value) for key, value in res.items()] 1873 | 1874 | extract.ar_coefficient(df["Close"]) 1875 | ``` 1876 | 1877 | #### **(16) Quantiles** 1878 | 1879 | This includes finding normal quantile values in the series, but also quantile derived measures like change quantiles and index max quantiles. 1880 | 1881 | (i) Index Mass Quantile 1882 | 1883 | The relative index $i$ where $q\%$ of the mass of the time series $x$ lie left of $i$ 1884 | . 1885 | 1886 | 1887 | ```python 1888 | #-> In Package 1889 | def index_mass_quantile(x, param=[{"q": 0.3}]): 1890 | 1891 | x = np.asarray(x) 1892 | abs_x = np.abs(x) 1893 | s = sum(abs_x) 1894 | 1895 | if s == 0: 1896 | # all values in x are zero or it has length 0 1897 | return [("q_{}".format(config["q"]), np.NaN) for config in param] 1898 | else: 1899 | # at least one value is not zero 1900 | mass_centralized = np.cumsum(abs_x) / s 1901 | return [("q_{}".format(config["q"]), (np.argmax(mass_centralized >= config["q"])+1)/len(x)) for config in param] 1902 | 1903 | extract.index_mass_quantile(df["Close"]) 1904 | ``` 1905 | 1906 | #### **(17) Peaks** 1907 | 1908 | (i) Number of CWT Peaks 1909 | 1910 | This feature calculator searches for different peaks in x. 1911 | 1912 | 1913 | ```python 1914 | from scipy.signal import cwt, find_peaks_cwt, ricker, welch 1915 | 1916 | cwt_param = [ka for ka in [2,6,9]] 1917 | 1918 | @set_property("fctype", "combiner") 1919 | @set_property("custom", True) 1920 | def number_cwt_peaks(x, param=cwt_param): 1921 | 1922 | return [("CWTPeak_{}".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param] 1923 | 1924 | extract.number_cwt_peaks(df["Close"]) 1925 | ``` 1926 | 1927 | #### **(18) Density** 1928 | 1929 | The density, and more specifically the power spectral density of the signal describes the power present in the signal as a function of frequency, per unit frequency. 1930 | 1931 | (i) Cross Power Spectral Density 1932 | 1933 | This feature calculator estimates the cross power spectral density of the time series $x$ at different frequencies. 1934 | 1935 | 1936 | ```python 1937 | #-> In Package 1938 | def spkt_welch_density(x, param=[{"coeff": 5}]): 1939 | freq, pxx = welch(x, nperseg=min(len(x), 256)) 1940 | coeff = [config["coeff"] for config in param] 1941 | indices = ["coeff_{}".format(i) for i in coeff] 1942 | 1943 | if len(pxx) <= np.max(coeff): # There are fewer data points in the time series than requested coefficients 1944 | 1945 | # filter coefficients that are not contained in pxx 1946 | reduced_coeff = [coefficient for coefficient in coeff if len(pxx) > coefficient] 1947 | not_calculated_coefficients = [coefficient for coefficient in coeff 1948 | if coefficient not in reduced_coeff] 1949 | 1950 | # Fill up the rest of the requested coefficients with np.NaNs 1951 | return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients)) 1952 | else: 1953 | return pxx[coeff].ravel()[0] 1954 | 1955 | extract.spkt_welch_density(df["Close"]) 1956 | ``` 1957 | 1958 | #### **(19) Linearity** 1959 | 1960 | Any measure of linearity that might make use of something like the linear least-squares regression for the values of the time series. This can be against the time series minus one and many other alternatives. 1961 | 1962 | (i) Linear Trend Time Wise 1963 | 1964 | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one. 1965 | 1966 | 1967 | ```python 1968 | from scipy.stats import linregress 1969 | 1970 | #-> In Package 1971 | def linear_trend_timewise(x, param= [{"attr": "pvalue"}]): 1972 | 1973 | ix = x.index 1974 | 1975 | # Get differences between each timestamp and the first timestamp in seconds. 1976 | # Then convert to hours and reshape for linear regression 1977 | times_seconds = (ix - ix[0]).total_seconds() 1978 | times_hours = np.asarray(times_seconds / float(3600)) 1979 | 1980 | linReg = linregress(times_hours, x.values) 1981 | 1982 | return [("attr_\"{}\"".format(config["attr"]), getattr(linReg, config["attr"])) 1983 | for config in param] 1984 | 1985 | extract.linear_trend_timewise(df["Close"]) 1986 | ``` 1987 | 1988 | #### **(20) Non-Linearity** 1989 | 1990 | (i) Schreiber Non-Linearity 1991 | 1992 | 1993 | ```python 1994 | #-> In Package 1995 | def c3(x, lag=3): 1996 | if not isinstance(x, (np.ndarray, pd.Series)): 1997 | x = np.asarray(x) 1998 | n = x.size 1999 | if 2 * lag >= n: 2000 | return 0 2001 | else: 2002 | return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)]) 2003 | 2004 | extract.c3(df["Close"]) 2005 | ``` 2006 | 2007 | #### **(21) Entropy** 2008 | 2009 | Any feature looking at the complexity of a time series. This is typically used in medical signal disciplines (EEG, EMG). There are multiple types of measures like spectral entropy, permutation entropy, sample entropy, approximate entropy, Lempel-Ziv complexity and other. This includes entropy measures and there derivations. 2010 | 2011 | (i) Binned Entropy 2012 | 2013 | Bins the values of x into max_bins equidistant bins. 2014 | 2015 | 2016 | ```python 2017 | #-> In Package 2018 | def binned_entropy(x, max_bins=10): 2019 | if not isinstance(x, (np.ndarray, pd.Series)): 2020 | x = np.asarray(x) 2021 | hist, bin_edges = np.histogram(x, bins=max_bins) 2022 | probs = hist / x.size 2023 | return - np.sum(p * np.math.log(p) for p in probs if p != 0) 2024 | 2025 | extract.binned_entropy(df["Close"]) 2026 | ``` 2027 | 2028 | (ii) SVD Entropy 2029 | 2030 | SVD entropy is an indicator of the number of eigenvectors that are needed for an adequate explanation of the data set. 2031 | 2032 | 2033 | ```python 2034 | svd_param = [{"Tau": ta, "DE": de} 2035 | for ta in [4] 2036 | for de in [3,6]] 2037 | 2038 | def _embed_seq(X,Tau,D): 2039 | N =len(X) 2040 | if D * Tau > N: 2041 | print("Cannot build such a matrix, because D * Tau > N") 2042 | exit() 2043 | if Tau<1: 2044 | print("Tau has to be at least 1") 2045 | exit() 2046 | Y= np.zeros((N - (D - 1) * Tau, D)) 2047 | 2048 | for i in range(0, N - (D - 1) * Tau): 2049 | for j in range(0, D): 2050 | Y[i][j] = X[i + j * Tau] 2051 | return Y 2052 | 2053 | @set_property("fctype", "combiner") 2054 | @set_property("custom", True) 2055 | def svd_entropy(epochs, param=svd_param): 2056 | axis=0 2057 | 2058 | final = [] 2059 | for par in param: 2060 | 2061 | def svd_entropy_1d(X, Tau, DE): 2062 | Y = _embed_seq(X, Tau, DE) 2063 | W = np.linalg.svd(Y, compute_uv=0) 2064 | W /= sum(W) # normalize singular values 2065 | return -1 * np.sum(W * np.log(W)) 2066 | 2067 | Tau = par["Tau"] 2068 | DE = par["DE"] 2069 | 2070 | final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0]) 2071 | 2072 | 2073 | return [("Tau_\"{}\"__De_{}\"".format(par["Tau"], par["DE"]), final[en]) for en, par in enumerate(param)] 2074 | 2075 | extract.svd_entropy(df["Close"].values) 2076 | ``` 2077 | 2078 | (iii) Hjort 2079 | 2080 | The Complexity parameter represents the change in frequency. The parameter compares the signal's similarity to a pure sine wave, where the value converges to 1 if the signal is more similar. 2081 | 2082 | 2083 | ```python 2084 | def _hjorth_mobility(epochs): 2085 | diff = np.diff(epochs, axis=0) 2086 | sigma0 = np.std(epochs, axis=0) 2087 | sigma1 = np.std(diff, axis=0) 2088 | return np.divide(sigma1, sigma0) 2089 | 2090 | @set_property("fctype", "simple") 2091 | @set_property("custom", True) 2092 | def hjorth_complexity(epochs): 2093 | diff1 = np.diff(epochs, axis=0) 2094 | diff2 = np.diff(diff1, axis=0) 2095 | sigma1 = np.std(diff1, axis=0) 2096 | sigma2 = np.std(diff2, axis=0) 2097 | return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs)) 2098 | 2099 | extract.hjorth_complexity(df["Close"]) 2100 | ``` 2101 | 2102 | #### **(22) Fixed Points** 2103 | 2104 | Fixed points and equilibria as identified from fitted models. 2105 | 2106 | (i) Langevin Fixed Points 2107 | 2108 | Largest fixed point of dynamics $max\ {h(x)=0}$ estimated from polynomial $h(x)$ which has been fitted to the deterministic dynamics of Langevin model 2109 | 2110 | 2111 | ```python 2112 | #-> In Package 2113 | def _estimate_friedrich_coefficients(x, m, r): 2114 | assert m > 0, "Order of polynomial need to be positive integer, found {}".format(m) 2115 | df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)}) 2116 | try: 2117 | df['quantiles'] = pd.qcut(df.signal, r) 2118 | except ValueError: 2119 | return [np.NaN] * (m + 1) 2120 | 2121 | quantiles = df.groupby('quantiles') 2122 | 2123 | result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()}) 2124 | result.dropna(inplace=True) 2125 | 2126 | try: 2127 | return np.polyfit(result.x_mean, result.y_mean, deg=m) 2128 | except (np.linalg.LinAlgError, ValueError): 2129 | return [np.NaN] * (m + 1) 2130 | 2131 | 2132 | def max_langevin_fixed_point(x, r=3, m=30): 2133 | coeff = _estimate_friedrich_coefficients(x, m, r) 2134 | 2135 | try: 2136 | max_fixed_point = np.max(np.real(np.roots(coeff))) 2137 | except (np.linalg.LinAlgError, ValueError): 2138 | return np.nan 2139 | 2140 | return max_fixed_point 2141 | 2142 | extract.max_langevin_fixed_point(df["Close"]) 2143 | ``` 2144 | 2145 | #### **(23) Amplitude** 2146 | 2147 | Features derived from peaked values in either the positive or negative direction. 2148 | 2149 | (i) Willison Amplitude 2150 | 2151 | This feature is defined as the amount of times that the change in the signal amplitude exceeds a threshold. 2152 | 2153 | 2154 | ```python 2155 | will_param = [ka for ka in [0.2,3]] 2156 | 2157 | @set_property("fctype", "combiner") 2158 | @set_property("custom", True) 2159 | def willison_amplitude(X, param=will_param): 2160 | return [("Thresh_{}".format(n),np.sum(np.abs(np.diff(X)) >= n)) for n in param] 2161 | 2162 | extract.willison_amplitude(df["Close"]) 2163 | ``` 2164 | 2165 | (ii) Percent Amplitude 2166 | 2167 | Returns the largest distance from the median value, measured 2168 | as a percentage of the median 2169 | 2170 | 2171 | ```python 2172 | perc_param = [{"base":ba, "exponent":exp} for ba in [3,5] for exp in [-0.1,-0.2]] 2173 | 2174 | @set_property("fctype", "combiner") 2175 | @set_property("custom", True) 2176 | def percent_amplitude(x, param =perc_param): 2177 | final = [] 2178 | for par in param: 2179 | linear_scale_data = par["base"] ** (par["exponent"] * x) 2180 | y_max = np.max(linear_scale_data) 2181 | y_min = np.min(linear_scale_data) 2182 | y_med = np.median(linear_scale_data) 2183 | final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))) 2184 | 2185 | return [("Base_{}__Exp{}".format(pa["base"],pa["exponent"]),fin) for fin, pa in zip(final,param)] 2186 | 2187 | extract.percent_amplitude(df["Close"]) 2188 | ``` 2189 | 2190 | #### **(24) Probability** 2191 | 2192 | (i) Cadence Probability 2193 | 2194 | Given the observed distribution of time lags cads, compute the probability that the next observation occurs within time minutes of an arbitrary epoch. 2195 | 2196 | 2197 | ```python 2198 | #-> fixes required 2199 | import scipy.stats as stats 2200 | 2201 | cad_param = [0.1,1000, -234] 2202 | 2203 | @set_property("fctype", "combiner") 2204 | @set_property("custom", True) 2205 | def cad_prob(cads, param=cad_param): 2206 | return [("time_{}".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param] 2207 | 2208 | extract.cad_prob(df["Close"]) 2209 | ``` 2210 | 2211 | #### **(25) Crossings** 2212 | 2213 | Calculates the crossing of the series with other defined values or series. 2214 | 2215 | (i) Zero Crossing Derivative 2216 | 2217 | The positioning of the edge point is located at the zero crossing of the first derivative of the filter. 2218 | 2219 | 2220 | ```python 2221 | zero_param = [0.01, 8] 2222 | 2223 | @set_property("fctype", "combiner") 2224 | @set_property("custom", True) 2225 | def zero_crossing_derivative(epochs, param=zero_param): 2226 | diff = np.diff(epochs) 2227 | norm = diff-diff.mean() 2228 | return [("e_{}".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] <= e) & (epoch[5:] > e))), 0, norm).ravel()[0]) for e in param] 2229 | 2230 | extract.zero_crossing_derivative(df["Close"]) 2231 | ``` 2232 | 2233 | #### **(26) Fluctuations** 2234 | 2235 | These features are again from medical signal sciences, but under this category we would include values such as fluctuation based entropy measures, fluctuation of correlation dynamics, and co-fluctuations. 2236 | 2237 | (i) Detrended Fluctuation Analysis (DFA) 2238 | 2239 | DFA Calculate the Hurst exponent using DFA analysis. 2240 | 2241 | 2242 | ```python 2243 | from scipy.stats import kurtosis as _kurt 2244 | from scipy.stats import skew as _skew 2245 | import numpy as np 2246 | 2247 | @set_property("fctype", "simple") 2248 | @set_property("custom", True) 2249 | def detrended_fluctuation_analysis(epochs): 2250 | def dfa_1d(X, Ave=None, L=None): 2251 | X = np.array(X) 2252 | 2253 | if Ave is None: 2254 | Ave = np.mean(X) 2255 | 2256 | Y = np.cumsum(X) 2257 | Y -= Ave 2258 | 2259 | if L is None: 2260 | L = np.floor(len(X) * 1 / ( 2261 | 2 ** np.array(list(range(1, int(np.log2(len(X))) - 4)))) 2262 | ) 2263 | 2264 | F = np.zeros(len(L)) # F(n) of different given box length n 2265 | 2266 | for i in range(0, len(L)): 2267 | n = int(L[i]) # for each box length L[i] 2268 | if n == 0: 2269 | print("time series is too short while the box length is too big") 2270 | print("abort") 2271 | exit() 2272 | for j in range(0, len(X), n): # for each box 2273 | if j + n < len(X): 2274 | c = list(range(j, j + n)) 2275 | # coordinates of time in the box 2276 | c = np.vstack([c, np.ones(n)]).T 2277 | # the value of data in the box 2278 | y = Y[j:j + n] 2279 | # add residue in this box 2280 | F[i] += np.linalg.lstsq(c, y, rcond=None)[1] 2281 | F[i] /= ((len(X) / n) * n) 2282 | F = np.sqrt(F) 2283 | 2284 | stacked = np.vstack([np.log(L), np.ones(len(L))]) 2285 | stacked_t = stacked.T 2286 | Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None) 2287 | 2288 | return Alpha[0][0] 2289 | 2290 | return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0] 2291 | 2292 | extract.detrended_fluctuation_analysis(df["Close"]) 2293 | ``` 2294 | 2295 | #### **(27) Information** 2296 | 2297 | Closely related to entropy and complexity measures. Any measure that attempts to measure the amount of information from an observable variable is included here. 2298 | 2299 | (i) Fisher Information 2300 | 2301 | Fisher information is a statistical information concept distinct from, and earlier than, Shannon information in communication theory. 2302 | 2303 | 2304 | ```python 2305 | def _embed_seq(X, Tau, D): 2306 | 2307 | shape = (X.size - Tau * (D - 1), D) 2308 | strides = (X.itemsize, Tau * X.itemsize) 2309 | return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides) 2310 | 2311 | fisher_param = [{"Tau":ta, "DE":de} for ta in [3,15] for de in [10,5]] 2312 | 2313 | @set_property("fctype", "combiner") 2314 | @set_property("custom", True) 2315 | def fisher_information(epochs, param=fisher_param): 2316 | def fisher_info_1d(a, tau, de): 2317 | # taken from pyeeg improvements 2318 | 2319 | mat = _embed_seq(a, tau, de) 2320 | W = np.linalg.svd(mat, compute_uv=False) 2321 | W /= sum(W) # normalize singular values 2322 | FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1] 2323 | return np.sum(FI_v) 2324 | 2325 | return [("Tau_{}__DE_{}".format(par["Tau"], par["DE"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par["Tau"], par["DE"]).ravel()[0]) for par in param] 2326 | 2327 | extract.fisher_information(df["Close"]) 2328 | ``` 2329 | 2330 | #### **(28) Fractals** 2331 | 2332 | In mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern (strictly speaking, a fractal pattern) changes with the scale at which it is measured. 2333 | 2334 | (i) Highuchi Fractal 2335 | 2336 | Compute a Higuchi Fractal Dimension of a time series 2337 | 2338 | 2339 | ```python 2340 | hig_para = [{"Kmax": 3},{"Kmax": 5}] 2341 | 2342 | @set_property("fctype", "combiner") 2343 | @set_property("custom", True) 2344 | def higuchi_fractal_dimension(epochs, param=hig_para): 2345 | def hfd_1d(X, Kmax): 2346 | 2347 | L = [] 2348 | x = [] 2349 | N = len(X) 2350 | for k in range(1, Kmax): 2351 | Lk = [] 2352 | for m in range(0, k): 2353 | Lmk = 0 2354 | for i in range(1, int(np.floor((N - m) / k))): 2355 | Lmk += abs(X[m + i * k] - X[m + i * k - k]) 2356 | Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k 2357 | Lk.append(Lmk) 2358 | L.append(np.log(np.mean(Lk))) 2359 | x.append([np.log(float(1) / k), 1]) 2360 | 2361 | (p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None) 2362 | return p[0] 2363 | 2364 | return [("Kmax_{}".format(config["Kmax"]), np.apply_along_axis(hfd_1d, 0, epochs, config["Kmax"]).ravel()[0] ) for config in param] 2365 | 2366 | extract.higuchi_fractal_dimension(df["Close"]) 2367 | ``` 2368 | 2369 | (ii) Petrosian Fractal 2370 | 2371 | Compute a Petrosian Fractal Dimension of a time series. 2372 | 2373 | 2374 | ```python 2375 | @set_property("fctype", "simple") 2376 | @set_property("custom", True) 2377 | def petrosian_fractal_dimension(epochs): 2378 | def pfd_1d(X, D=None): 2379 | # taken from pyeeg 2380 | """Compute Petrosian Fractal Dimension of a time series from either two 2381 | cases below: 2382 | 1. X, the time series of type list (default) 2383 | 2. D, the first order differential sequence of X (if D is provided, 2384 | recommended to speed up) 2385 | In case 1, D is computed using Numpy's difference function. 2386 | To speed up, it is recommended to compute D before calling this function 2387 | because D may also be used by other functions whereas computing it here 2388 | again will slow down. 2389 | """ 2390 | if D is None: 2391 | D = np.diff(X) 2392 | D = D.tolist() 2393 | N_delta = 0 # number of sign changes in derivative of the signal 2394 | for i in range(1, len(D)): 2395 | if D[i] * D[i - 1] < 0: 2396 | N_delta += 1 2397 | n = len(X) 2398 | return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta)) 2399 | return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0] 2400 | 2401 | extract.petrosian_fractal_dimension(df["Close"]) 2402 | ``` 2403 | 2404 | #### **(29) Exponent** 2405 | 2406 | (i) Hurst Exponent 2407 | 2408 | The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases. 2409 | 2410 | 2411 | ```python 2412 | @set_property("fctype", "simple") 2413 | @set_property("custom", True) 2414 | def hurst_exponent(epochs): 2415 | def hurst_1d(X): 2416 | 2417 | X = np.array(X) 2418 | N = X.size 2419 | T = np.arange(1, N + 1) 2420 | Y = np.cumsum(X) 2421 | Ave_T = Y / T 2422 | 2423 | S_T = np.zeros(N) 2424 | R_T = np.zeros(N) 2425 | for i in range(N): 2426 | S_T[i] = np.std(X[:i + 1]) 2427 | X_T = Y - T * Ave_T[i] 2428 | R_T[i] = np.ptp(X_T[:i + 1]) 2429 | 2430 | for i in range(1, len(S_T)): 2431 | if np.diff(S_T)[i - 1] != 0: 2432 | break 2433 | for j in range(1, len(R_T)): 2434 | if np.diff(R_T)[j - 1] != 0: 2435 | break 2436 | k = max(i, j) 2437 | assert k < 10, "rethink it!" 2438 | 2439 | R_S = R_T[k:] / S_T[k:] 2440 | R_S = np.log(R_S) 2441 | 2442 | n = np.log(T)[k:] 2443 | A = np.column_stack((n, np.ones(n.size))) 2444 | [m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0] 2445 | H = m 2446 | return H 2447 | return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0] 2448 | 2449 | extract.hurst_exponent(df["Close"]) 2450 | ``` 2451 | 2452 | (ii) Largest Lyauponov Exponent 2453 | 2454 | In mathematics the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories. 2455 | 2456 | 2457 | ```python 2458 | def _embed_seq(X, Tau, D): 2459 | shape = (X.size - Tau * (D - 1), D) 2460 | strides = (X.itemsize, Tau * X.itemsize) 2461 | return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides) 2462 | 2463 | lyaup_param = [{"Tau":4, "n":3, "T":10, "fs":9},{"Tau":8, "n":7, "T":15, "fs":6}] 2464 | 2465 | @set_property("fctype", "combiner") 2466 | @set_property("custom", True) 2467 | def largest_lyauponov_exponent(epochs, param=lyaup_param): 2468 | def LLE_1d(x, tau, n, T, fs): 2469 | 2470 | Em = _embed_seq(x, tau, n) 2471 | M = len(Em) 2472 | A = np.tile(Em, (len(Em), 1, 1)) 2473 | B = np.transpose(A, [1, 0, 2]) 2474 | square_dists = (A - B) ** 2 # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2 2475 | D = np.sqrt(square_dists[:, :, :].sum(axis=2)) # D[i,j] = ||Em[i]-Em[j]||_2 2476 | 2477 | # Exclude elements within T of the diagonal 2478 | band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1) 2479 | band[band == 1] = np.inf 2480 | neighbors = (D + band).argmin(axis=0) # nearest neighbors more than T steps away 2481 | 2482 | # in_bounds[i,j] = (i+j <= M-1 and i+neighbors[j] <= M-1) 2483 | inc = np.tile(np.arange(M), (M, 1)) 2484 | row_inds = (np.tile(np.arange(M), (M, 1)).T + inc) 2485 | col_inds = (np.tile(neighbors, (M, 1)) + inc.T) 2486 | in_bounds = np.logical_and(row_inds <= M - 1, col_inds <= M - 1) 2487 | # Uncomment for old (miscounted) version 2488 | # in_bounds = numpy.logical_and(row_inds < M - 1, col_inds < M - 1) 2489 | row_inds[~in_bounds] = 0 2490 | col_inds[~in_bounds] = 0 2491 | 2492 | # neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2 2493 | neighbor_dists = np.ma.MaskedArray(D[row_inds, col_inds], ~in_bounds) 2494 | J = (~neighbor_dists.mask).sum(axis=1) # number of in-bounds indices by row 2495 | # Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged 2496 | 2497 | neighbor_dists[neighbor_dists == 0] = 1 2498 | 2499 | # !!! this fixes the divide by zero in log error !!! 2500 | neighbor_dists.data[neighbor_dists.data == 0] = 1 2501 | 2502 | d_ij = np.sum(np.log(neighbor_dists.data), axis=1) 2503 | mean_d = d_ij[J > 0] / J[J > 0] 2504 | 2505 | x = np.arange(len(mean_d)) 2506 | X = np.vstack((x, np.ones(len(mean_d)))).T 2507 | [m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0] 2508 | Lexp = fs * m 2509 | return Lexp 2510 | 2511 | return [("Tau_{}__n_{}__T_{}__fs_{}".format(par["Tau"], par["n"], par["T"], par["fs"]), np.apply_along_axis(LLE_1d, 0, epochs, par["Tau"], par["n"], par["T"], par["fs"]).ravel()[0]) for par in param] 2512 | 2513 | extract.largest_lyauponov_exponent(df["Close"]) 2514 | ``` 2515 | 2516 | #### **(30) Spectral Analysis** 2517 | 2518 | Spectral analysis is analysis in terms of a spectrum of frequencies or related quantities such as energies, eigenvalues, etc. 2519 | 2520 | (i) Whelch Method 2521 | 2522 | The Whelch Method is an approach for spectral density estimation. It is used in physics, engineering, and applied mathematics for estimating the power of a signal at different frequencies. 2523 | 2524 | 2525 | ```python 2526 | from scipy import signal, integrate 2527 | 2528 | whelch_param = [100,200] 2529 | 2530 | @set_property("fctype", "combiner") 2531 | @set_property("custom", True) 2532 | def whelch_method(data, param=whelch_param): 2533 | 2534 | final = [] 2535 | for Fs in param: 2536 | f, pxx = signal.welch(data, fs=Fs, nperseg=1024) 2537 | d = {'psd': pxx, 'freqs': f} 2538 | df = pd.DataFrame(data=d) 2539 | dfs = df.sort_values(['psd'], ascending=False) 2540 | rows = dfs.iloc[:10] 2541 | final.append(rows['freqs'].mean()) 2542 | 2543 | return [("Fs_{}".format(pa),fin) for pa, fin in zip(param,final)] 2544 | 2545 | extract.whelch_method(df["Close"]) 2546 | ``` 2547 | 2548 | 2549 | ```python 2550 | #-> Basically same as above 2551 | freq_param = [{"fs":50, "sel":15},{"fs":200, "sel":20}] 2552 | 2553 | @set_property("fctype", "combiner") 2554 | @set_property("custom", True) 2555 | def find_freq(serie, param=freq_param): 2556 | 2557 | final = [] 2558 | for par in param: 2559 | fft0 = np.fft.rfft(serie*np.hanning(len(serie))) 2560 | freqs = np.fft.rfftfreq(len(serie), d=1.0/par["fs"]) 2561 | fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))]) 2562 | d = {'fft': fftmod, 'freq': freqs} 2563 | df = pd.DataFrame(d) 2564 | hop = df.sort_values(['fft'], ascending=False) 2565 | rows = hop.iloc[:par["sel"]] 2566 | final.append(rows['freq'].mean()) 2567 | 2568 | return [("Fs_{}__sel{}".format(pa["fs"],pa["sel"]),fin) for pa, fin in zip(param,final)] 2569 | 2570 | extract.find_freq(df["Close"]) 2571 | ``` 2572 | 2573 | #### **(31) Percentile** 2574 | 2575 | (i) Flux Percentile 2576 | 2577 | Flux (or radiant flux) is the total amount of energy that crosses a unit area per unit time. Flux is an astronomical value, measured in joules per square metre per second (joules/m2/s), or watts per square metre. Here we provide the ratio of flux percentiles. 2578 | 2579 | 2580 | ```python 2581 | #-> In Package 2582 | 2583 | import math 2584 | def flux_perc(magnitude): 2585 | sorted_data = np.sort(magnitude) 2586 | lc_length = len(sorted_data) 2587 | 2588 | F_60_index = int(math.ceil(0.60 * lc_length)) 2589 | F_40_index = int(math.ceil(0.40 * lc_length)) 2590 | F_5_index = int(math.ceil(0.05 * lc_length)) 2591 | F_95_index = int(math.ceil(0.95 * lc_length)) 2592 | 2593 | F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index] 2594 | F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index] 2595 | F_mid20 = F_40_60 / F_5_95 2596 | 2597 | return {"FluxPercentileRatioMid20": F_mid20} 2598 | 2599 | extract.flux_perc(df["Close"]) 2600 | ``` 2601 | 2602 | #### **(32) Range** 2603 | 2604 | (i) Range of Cummulative Sum 2605 | 2606 | 2607 | ```python 2608 | @set_property("fctype", "simple") 2609 | @set_property("custom", True) 2610 | def range_cum_s(magnitude): 2611 | sigma = np.std(magnitude) 2612 | N = len(magnitude) 2613 | m = np.mean(magnitude) 2614 | s = np.cumsum(magnitude - m) * 1.0 / (N * sigma) 2615 | R = np.max(s) - np.min(s) 2616 | return {"Rcs": R} 2617 | 2618 | extract.range_cum_s(df["Close"]) 2619 | ``` 2620 | 2621 | #### **(33) Structural** 2622 | 2623 | Structural features, potential placeholders for future research. 2624 | 2625 | (i) Structure Function 2626 | 2627 | The structure function of rotation measures (RMs) contains information on electron density and magnetic field fluctuations when used i astronomy. It becomes a custom feature when used with your own unique time series data. 2628 | 2629 | 2630 | ```python 2631 | from scipy.interpolate import interp1d 2632 | 2633 | struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values} 2634 | 2635 | @set_property("fctype", "combiner") 2636 | @set_property("custom", True) 2637 | def structure_func(time, param=struct_param): 2638 | 2639 | dict_final = {} 2640 | for key, magnitude in param.items(): 2641 | dict_final[key] = [] 2642 | Nsf, Np = 100, 100 2643 | sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf) 2644 | f = interp1d(time, magnitude) 2645 | 2646 | time_int = np.linspace(np.min(time), np.max(time), Np) 2647 | mag_int = f(time_int) 2648 | 2649 | for tau in np.arange(1, Nsf): 2650 | sf1[tau - 1] = np.mean( 2651 | np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0)) 2652 | sf2[tau - 1] = np.mean( 2653 | np.abs(np.power( 2654 | np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0))) 2655 | sf3[tau - 1] = np.mean( 2656 | np.abs(np.power( 2657 | np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0))) 2658 | sf1_log = np.log10(np.trim_zeros(sf1)) 2659 | sf2_log = np.log10(np.trim_zeros(sf2)) 2660 | sf3_log = np.log10(np.trim_zeros(sf3)) 2661 | 2662 | if len(sf1_log) and len(sf2_log): 2663 | m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1) 2664 | else: 2665 | 2666 | m_21 = np.nan 2667 | 2668 | if len(sf1_log) and len(sf3_log): 2669 | m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1) 2670 | else: 2671 | 2672 | m_31 = np.nan 2673 | 2674 | if len(sf2_log) and len(sf3_log): 2675 | m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1) 2676 | else: 2677 | 2678 | m_32 = np.nan 2679 | dict_final[key].append(m_21) 2680 | dict_final[key].append(m_31) 2681 | dict_final[key].append(m_32) 2682 | 2683 | return [("StructureFunction_{}__m_{}".format(key, name), li) for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)] 2684 | 2685 | struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values} 2686 | 2687 | extract.structure_func(df["Close"],struct_param) 2688 | ``` 2689 | 2690 | #### **(34) Distribution** 2691 | 2692 | (i) Kurtosis 2693 | 2694 | 2695 | ```python 2696 | #-> In Package 2697 | def kurtosis(x): 2698 | 2699 | if not isinstance(x, pd.Series): 2700 | x = pd.Series(x) 2701 | return pd.Series.kurtosis(x) 2702 | 2703 | extract.kurtosis(df["Close"]) 2704 | ``` 2705 | 2706 | (ii) Stetson Kurtosis 2707 | 2708 | 2709 | ```python 2710 | @set_property("fctype", "simple") 2711 | @set_property("custom", True) 2712 | def stetson_k(x): 2713 | """A robust kurtosis statistic.""" 2714 | n = len(x) 2715 | x0 = stetson_mean(x, 1./20**2) 2716 | delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20 2717 | ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2)) 2718 | return ta 2719 | 2720 | extract.stetson_k(df["Close"]) 2721 | ``` 2722 | 2723 | ## **(5) Synthesise** 2724 | 2725 | Time-Series synthesisation (TSS) happens before the feature extraction step and Cross Sectional Synthesisation (CSS) happens after the feature extraction step. Currently I will only include a CSS package, in the future, I would further work on developing out this section. This area still has a lot of performance and stability issues. In the future it might be a more viable candidate to improve prediction. 2726 | 2727 | 2728 | ```python 2729 | from lightgbm import LGBMRegressor 2730 | from sklearn.metrics import mean_squared_error 2731 | 2732 | def model(df_final): 2733 | model = LGBMRegressor() 2734 | test = df_final.head(int(len(df_final)*0.4)) 2735 | train = df_final[~df_final.isin(test)].dropna() 2736 | model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"]) 2737 | preds = model.predict(test.drop(["Close_1"],axis=1)) 2738 | test = df_final.head(int(len(df_final)*0.4)) 2739 | train = df_final[~df_final.isin(test)].dropna() 2740 | model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"]) 2741 | val = mean_squared_error(test["Close_1"],preds); 2742 | return val 2743 | ``` 2744 | 2745 | 2746 | ```python 2747 | pip install ctgan 2748 | ``` 2749 | 2750 | 2751 | ```python 2752 | from ctgan import CTGANSynthesizer 2753 | 2754 | #discrete_columns = [""] 2755 | ctgan = CTGANSynthesizer() 2756 | ctgan.fit(df,epochs=10) #15 2757 | ``` 2758 | 2759 | Random Benchmark 2760 | 2761 | 2762 | ```python 2763 | np.random.seed(1) 2764 | df_in = df.copy() 2765 | df_in["Close_1"] = np.random.permutation(df_in["Close_1"].values) 2766 | model(df_in) 2767 | ``` 2768 | 2769 | Generated Performance 2770 | 2771 | 2772 | ```python 2773 | df_gen = ctgan.sample(len(df_in)*100) 2774 | model(df_gen) 2775 | ``` 2776 | 2777 | As expected a cross-sectional technique, does not work well on time-series data, in the future, other methods will be investigated. 2778 | 2779 | 2780 | 2781 | ## **(6) Skeleton Example** 2782 | 2783 | Here I will perform tabular agumenting methods on a small dataset single digit features and around 250 instances. This is not necessarily the best sized dataset to highlight the performance of tabular augmentation as some method like extraction would be overkill as it would lead to dimensionality problems. It is also good to know that there are close to infinite number of ways to perform these augmentation methods. In the future, automated augmentation methods can guide the experiment process. 2784 | 2785 | The approach taken in this skeleton is to develop running models that are tested after each augmentation to highlight what methods might work well on this particular dataset. The metric we will use is mean squared error. In this implementation we do not have special hold-out sets. 2786 | 2787 | The above framework of implementation will be consulted, but one still have to be strategic as to when you apply what function, and you have to make sure that you are processing your data with appropriate techniques (drop null values, fill null values) at the appropriate time. 2788 | 2789 | #### **Validation** 2790 | 2791 | Develop Model and Define Metric 2792 | 2793 | 2794 | ```python 2795 | from lightgbm import LGBMRegressor 2796 | from sklearn.metrics import mean_squared_error 2797 | 2798 | def model(df_final): 2799 | model = LGBMRegressor() 2800 | test = df_final.head(int(len(df_final)*0.4)) 2801 | train = df_final[~df_final.isin(test)].dropna() 2802 | model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"]) 2803 | preds = model.predict(test.drop(["Close_1"],axis=1)) 2804 | test = df_final.head(int(len(df_final)*0.4)) 2805 | train = df_final[~df_final.isin(test)].dropna() 2806 | model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"]) 2807 | val = mean_squared_error(test["Close_1"],preds); 2808 | return val 2809 | ``` 2810 | 2811 | Reload Data 2812 | 2813 | 2814 | ```python 2815 | df = data_copy() 2816 | ``` 2817 | 2818 | 2819 | ```python 2820 | model(df) 2821 | ``` 2822 | 2823 | 302.61676570345287 2824 | 2825 | 2826 | 2827 | **(1) (7) (i) Transformation - Decomposition - Naive** 2828 | 2829 | 2830 | ```python 2831 | ## If Inferred Seasonality is Too Large Default to Five 2832 | seasons = transform.infer_seasonality(df["Close"],index=0) 2833 | df_out = transform.naive_dec(df.copy(), ["Close","Open"], freq=5) 2834 | model(df_out) #improvement 2835 | ``` 2836 | 2837 | 2838 | 2839 | 2840 | 274.34477082783525 2841 | 2842 | 2843 | 2844 | **(1) (8) (i) Transformation - Filter - Baxter-King-Bandpass** 2845 | 2846 | 2847 | ```python 2848 | df_out = transform.bkb(df_out, ["Close","Low"]) 2849 | df_best = df_out.copy() 2850 | model(df_out) #improvement 2851 | ``` 2852 | 2853 | 2854 | 2855 | 2856 | 267.1826850968307 2857 | 2858 | 2859 | 2860 | **(1) (3) (i) Transformation - Differentiation - Fractional** 2861 | 2862 | 2863 | ```python 2864 | df_out = transform.fast_fracdiff(df_out, ["Close_BPF"],0.5) 2865 | model(df_out) #null 2866 | ``` 2867 | 2868 | 2869 | 2870 | 2871 | 267.7083192402742 2872 | 2873 | 2874 | 2875 | **(1) (1) (i) Transformation - Scaling - Robust Scaler** 2876 | 2877 | 2878 | ```python 2879 | df_out = df_out.dropna() 2880 | df_out = transform.robust_scaler(df_out, drop=["Close_1"]) 2881 | model(df_out) #noisy 2882 | ``` 2883 | 2884 | 2885 | 2886 | 2887 | 270.96980399571214 2888 | 2889 | 2890 | 2891 | **(2) (2) (i) Interactions - Operator - Multiplication/Division** 2892 | 2893 | 2894 | ```python 2895 | df_out.head() 2896 | ``` 2897 | 2898 | 2899 | 2900 | 2901 | 2902 | 2903 | 2904 | 2905 | 2906 | 2907 | 2908 | 2909 | 2910 | 2911 | 2912 | 2913 | 2914 | 2915 | 2916 | 2917 | 2918 | 2919 | 2920 | 2921 | 2922 | 2923 | 2924 | 2925 | 2926 | 2927 | 2928 | 2929 | 2930 | 2931 | 2932 | 2933 | 2934 | 2935 | 2936 | 2937 | 2938 | 2939 | 2940 | 2941 | 2942 | 2943 | 2944 | 2945 | 2946 | 2947 | 2948 | 2949 | 2950 | 2951 | 2952 | 2953 | 2954 | 2955 | 2956 | 2957 | 2958 | 2959 | 2960 | 2961 | 2962 | 2963 | 2964 | 2965 | 2966 | 2967 | 2968 | 2969 | 2970 | 2971 | 2972 | 2973 | 2974 | 2975 | 2976 | 2977 | 2978 | 2979 | 2980 | 2981 | 2982 | 2983 | 2984 | 2985 | 2986 | 2987 | 2988 | 2989 | 2990 | 2991 | 2992 | 2993 | 2994 | 2995 | 2996 | 2997 | 2998 | 2999 | 3000 | 3001 | 3002 | 3003 | 3004 | 3005 | 3006 | 3007 | 3008 | 3009 | 3010 | 3011 | 3012 | 3013 | 3014 | 3015 | 3016 | 3017 | 3018 | 3019 | 3020 | 3021 | 3022 | 3023 | 3024 | 3025 | 3026 | 3027 | 3028 | 3029 | 3030 | 3031 | 3032 | 3033 | 3034 | 3035 | 3036 | 3037 | 3038 | 3039 | 3040 |
Close_1HighLowOpenCloseVolumeAdj CloseClose_NDDTClose_NDDSClose_NDDROpen_NDDTOpen_NDDSOpen_NDDRClose_BPFLow_BPFClose_BPF_frac
Date
2019-01-08338.5299991.0184130.9640481.0966001.001175-0.1626161.0011750.8322970.8349641.3354330.7587430.6915962.259884-2.534142-2.249135-3.593612
2019-01-09344.9700011.0120681.0233021.0114661.042689-0.5017981.0426890.908963-0.1650361.1113460.8357860.3333611.129783-3.081959-2.776302-2.523465
2019-01-10347.2600101.0355811.0275630.9969691.126762-0.3675761.1267621.0293472.1200260.8536970.9075880.0000000.533777-2.052768-2.543449-0.747382
2019-01-11334.3999941.0731531.1205061.0983131.156658-0.5865711.1566581.109144-5.1560510.5919901.002162-0.6666390.608516-0.694642-0.8316700.414063
2019-01-14344.4299930.9996271.0569911.1021350.988773-0.5417520.9887731.1076330.000000-0.6603501.056302-0.9154910.263025-0.645590-0.116166-0.118012
3041 | 3042 | 3043 | 3044 | ```python 3045 | df_out = interact.muldiv(df_out, ["Close","Open_NDDS","Low_BPF"]) 3046 | model(df_out) #noisy 3047 | ``` 3048 | 3049 | 3050 | 3051 | 3052 | 285.6420643864313 3053 | 3054 | 3055 | 3056 | 3057 | ```python 3058 | df_r = df_out.copy() 3059 | ``` 3060 | 3061 | **(2) (6) (i) Interactions - Speciality - Technical** 3062 | 3063 | 3064 | ```python 3065 | import ta 3066 | df = interact.tech(df) 3067 | df_out = pd.merge(df_out, df.iloc[:,7:], left_index=True, right_index=True, how="left") 3068 | ``` 3069 | 3070 | **Clean Dataframe and Metric** 3071 | 3072 | 3073 | ```python 3074 | """Droping column where missing values are above a threshold""" 3075 | df_out = df_out.dropna(thresh = len(df_out)*0.95, axis = "columns") 3076 | df_out = df_out.dropna() 3077 | df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0) 3078 | close = df_out["Close"].copy() 3079 | df_d = df_out.copy() 3080 | model(df_out) #improve 3081 | ``` 3082 | 3083 | 3084 | 3085 | 3086 | 592.52971755184 3087 | 3088 | 3089 | 3090 | **(3) (1) (i) Mapping - Eigen Decomposition - PCA** 3091 | 3092 | 3093 | 3094 | ```python 3095 | from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA 3096 | 3097 | df_out = transform.robust_scaler(df_out, drop=["Close_1"]) 3098 | ``` 3099 | 3100 | 3101 | ```python 3102 | df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0) 3103 | df_out = mapper.pca_feature(df_out, drop_cols=["Close_1"], variance_or_components=0.9, n_components=8,non_linear=False) 3104 | ``` 3105 | 3106 | 3107 | ```python 3108 | model(df_out) #noisy but not too bad given the 10 fold dimensionality reduction 3109 | ``` 3110 | 3111 | 3112 | 3113 | 3114 | 687.158330455884 3115 | 3116 | 3117 | 3118 | **(4) Extracting** 3119 | 3120 | Here at first, I show the functions that have been added to the DeltaPy fork of tsfresh. You have to add your own personal adjustments based on the features you would like to construct. I am using self-developed features, but you can also use TSFresh's community functions. 3121 | 3122 | *The following files have been appropriately ammended (Get in contact for advice)* 3123 | 1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/settings.py 3124 | 1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/feature_calculators.py 3125 | 1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/extraction.py 3126 | 3127 | **(4) (10) (i) Extracting - Averages - GSkew** 3128 | 3129 | 3130 | ```python 3131 | extract.gskew(df_out["PCA_1"]) 3132 | ``` 3133 | 3134 | 3135 | 3136 | 3137 | -0.7903067336449059 3138 | 3139 | 3140 | 3141 | **(4) (21) (ii) Extracting - Entropy - SVD Entropy** 3142 | 3143 | 3144 | ```python 3145 | svd_param = [{"Tau": ta, "DE": de} 3146 | for ta in [4] 3147 | for de in [3,6]] 3148 | 3149 | extract.svd_entropy(df_out["PCA_1"],svd_param) 3150 | ``` 3151 | 3152 | 3153 | 3154 | 3155 | [('Tau_"4"__De_3"', 0.7234823323374294), 3156 | ('Tau_"4"__De_6"', 1.3014347840145244)] 3157 | 3158 | 3159 | 3160 | **(4) (13) (ii) Extracting - Streaks - Wozniak** 3161 | 3162 | 3163 | ```python 3164 | woz_param = [{"consecutiveStar": n} for n in [2, 4]] 3165 | 3166 | extract.wozniak(df_out["PCA_1"],woz_param) 3167 | ``` 3168 | 3169 | 3170 | 3171 | 3172 | [('consecutiveStar_2', 0.012658227848101266), ('consecutiveStar_4', 0.0)] 3173 | 3174 | 3175 | 3176 | **(4) (28) (i) Extracting - Fractal - Higuchi** 3177 | 3178 | 3179 | ```python 3180 | hig_param = [{"Kmax": 3},{"Kmax": 5}] 3181 | 3182 | extract.higuchi_fractal_dimension(df_out["PCA_1"],hig_param) 3183 | ``` 3184 | 3185 | 3186 | 3187 | 3188 | [('Kmax_3', 0.577913816027104), ('Kmax_5', 0.8176960510304725)] 3189 | 3190 | 3191 | 3192 | **(4) (5) (ii) Extracting - Volatility - Variability Index** 3193 | 3194 | 3195 | ```python 3196 | var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values} 3197 | 3198 | extract.var_index(df["Close"].values,var_index_param) 3199 | ``` 3200 | 3201 | 3202 | 3203 | 3204 | {'Interact__Open': 0.00396022538846289, 3205 | 'Interact__Volume': 0.20550155114176533} 3206 | 3207 | 3208 | 3209 | **Time Series Extraction** 3210 | 3211 | 3212 | ```python 3213 | pip install git+git://github.com/firmai/tsfresh.git 3214 | ``` 3215 | 3216 | ```python 3217 | #Construct the preferred input dataframe. 3218 | from tsfresh.utilities.dataframe_functions import roll_time_series 3219 | df_out["ID"] = 0 3220 | periods = 30 3221 | df_out = df_out.reset_index() 3222 | df_ts = roll_time_series(df_out,"ID","Date",None,1,periods) 3223 | counts = df_ts['ID'].value_counts() 3224 | df_ts = df_ts[df_ts['ID'].isin(counts[counts > periods].index)] 3225 | ``` 3226 | 3227 | 3228 | ```python 3229 | #Perform extraction 3230 | from tsfresh.feature_extraction import extract_features, CustomFCParameters 3231 | settings_dict = CustomFCParameters() 3232 | settings_dict["var_index"] = {"PCA_1":None, "PCA_2": None} 3233 | df_feat = extract_features(df_ts.drop(["Close_1"],axis=1),default_fc_parameters=settings_dict,column_id="ID",column_sort="Date") 3234 | ``` 3235 | 3236 | Feature Extraction: 100%|██████████| 5/5 [00:10<00:00, 2.14s/it] 3237 | 3238 | 3239 | 3240 | ```python 3241 | # Cleaning operations 3242 | import pandasvault as pv 3243 | df_feat2 = df_feat.copy() 3244 | df_feat = df_feat.dropna(thresh = len(df_feat)*0.50, axis = "columns") 3245 | df_feat_cons = pv.constant_feature_detect(data=df_feat,threshold=0.9) 3246 | df_feat = df_feat.drop(df_feat_cons, axis=1) 3247 | df_feat = df_feat.ffill() 3248 | df_feat = pd.merge(df_feat,df[["Close_1"]],left_index=True,right_index=True,how="left") 3249 | print(df_feat.shape) 3250 | model(df_feat) #noisy 3251 | ``` 3252 | 3253 | 7 variables are found to be almost constant 3254 | (208, 48) 3255 | 2064.7813982935995 3256 | 3257 | 3258 | 3259 | 3260 | ```python 3261 | from tsfresh import select_features 3262 | from tsfresh.utilities.dataframe_functions import impute 3263 | 3264 | impute(df_feat) 3265 | df_feat_2 = select_features(df_feat.drop(["Close_1"],axis=1),df_feat["Close_1"],fdr_level=0.05) 3266 | df_feat_2["Close_1"] = df_feat["Close_1"] 3267 | model(df_feat_2) #improvement (b/ not an augmentation method) 3268 | ``` 3269 | 3270 | 3271 | 3272 | 3273 | 1577.5273071299482 3274 | 3275 | 3276 | 3277 | **(3) (6) (i) Feature Agglomoration;   (1)(2)(i) Standard Scaler.** 3278 | 3279 | Like in this step, after (1), (2), (3), (4) and (5), you can often circle back to the initial steps to normalise the data and dimensionally reduce the data for the final model. 3280 | 3281 | 3282 | ```python 3283 | import numpy as np 3284 | from sklearn import datasets, cluster 3285 | 3286 | def feature_agg(df, drop, components): 3287 | components = min(df.shape[1]-1,components) 3288 | agglo = cluster.FeatureAgglomeration(n_clusters=components,) 3289 | df = df.drop(drop,axis=1) 3290 | agglo.fit(df) 3291 | df = pd.DataFrame(agglo.transform(df)) 3292 | df = df.add_prefix('fe_agg_') 3293 | 3294 | return df 3295 | 3296 | df_final = transform.standard_scaler(df_feat_2, drop=["Close_1"]) 3297 | df_final = mapper.feature_agg(df_final,["Close_1"],4) 3298 | df_final.index = df_feat.index 3299 | df_final["Close_1"] = df_feat["Close_1"] 3300 | model(df_final) #noisy 3301 | ``` 3302 | 3303 | 3304 | 3305 | 3306 | 1949.89085894338 3307 | 3308 | 3309 | 3310 | **Final Model** After Applying 13 Arbitrary Augmentation Techniques 3311 | 3312 | 3313 | ```python 3314 | model(df_final) #improvement 3315 | ``` 3316 | 3317 | 3318 | 3319 | 3320 | 1949.89085894338 3321 | 3322 | 3323 | 3324 | **Original Model** Before Augmentation 3325 | 3326 | 3327 | ```python 3328 | df_org = df.iloc[:,:7][df.index.isin(df_final.index)] 3329 | model(df_org) 3330 | ``` 3331 | 3332 | 3333 | 3334 | 3335 | 389.783990984133 3336 | 3337 | 3338 | 3339 | **Best Model** After Developing 8 Augmenting Features 3340 | 3341 | 3342 | ```python 3343 | df_best = df_best.replace([np.inf, -np.inf], np.nan).ffill().fillna(0) 3344 | model(df_best) 3345 | ``` 3346 | 3347 | 3348 | 3349 | 3350 | 267.1826850968307 3351 | 3352 | 3353 | 3354 | **Commentary** 3355 | 3356 | There are countless ways in which the current model can be improved, this can take on an automated process where all techniques are tested against a hold out set, for example, we can perform the operation below, and even though it improves the score here, there is a need for more robust tests. The skeleton example above is not meant to highlight the performance of the package. It simply serves as an example of how one can go about applying augmentation methods. 3357 | 3358 | Quite naturally this example suffers from dimensionality issues with array shapes reaching ```(208, 48)```, furthermore you would need a sample that is at least 50-100 times larger before machine learning methods start to make sense. 3359 | 3360 | Nonetheless, in this example, *Transformation, Interactions* and *Mappings* (applied to extraction output) performed fairly well. *Extraction* augmentation was overkill, but created a reasonable model when dimensionally reduced. A better selection of one of the 50+ augmentation methods and the order of augmentation could further help improve the outcome if robustly tested against development sets. 3361 | 3362 | 3363 | [[1]](https://colab.research.google.com/drive/1tstO4fja9wRWjkPgjxRYr9MEvYXTx7fA) DeltaPy Development 3364 | -------------------------------------------------------------------------------- /assets/Tabularz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/firmai/deltapy/05c8a6440440bcc5ee1d051d7dfeee70807329de/assets/Tabularz.png -------------------------------------------------------------------------------- /deltapy/__init__.py: -------------------------------------------------------------------------------- 1 | from deltapy import transform 2 | from deltapy import interact 3 | from deltapy import mapper 4 | from deltapy import extract -------------------------------------------------------------------------------- /deltapy/extract.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from statsmodels.tsa.stattools import acf, adfuller, pacf 4 | import itertools 5 | from statsmodels.tsa.ar_model import AR 6 | from scipy.signal import cwt, find_peaks_cwt, ricker, welch 7 | from scipy.stats import linregress 8 | import scipy.stats as stats 9 | from scipy.stats import kurtosis as _kurt 10 | from scipy.stats import skew as _skew 11 | import numpy as np 12 | from scipy import signal, integrate 13 | from scipy.interpolate import interp1d 14 | import math 15 | from statsmodels.tools.sm_exceptions import MissingDataError 16 | from numpy.linalg import LinAlgError 17 | 18 | 19 | def set_property(key, value): 20 | """ 21 | This method returns a decorator that sets the property key of the function to value 22 | """ 23 | def decorate_func(func): 24 | setattr(func, key, value) 25 | if func.__doc__ and key == "fctype": 26 | func.__doc__ = func.__doc__ + "\n\n *This function is of type: " + value + "*\n" 27 | return func 28 | return decorate_func 29 | 30 | 31 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 32 | 33 | 34 | #-> In Package 35 | def abs_energy(x): 36 | 37 | if not isinstance(x, (np.ndarray, pd.Series)): 38 | x = np.asarray(x) 39 | return np.dot(x, x) 40 | 41 | # abs_energy(df["Close"]) 42 | 43 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 44 | 45 | 46 | #-> In Package 47 | def cid_ce(x, normalize): 48 | 49 | if not isinstance(x, (np.ndarray, pd.Series)): 50 | x = np.asarray(x) 51 | if normalize: 52 | s = np.std(x) 53 | if s!=0: 54 | x = (x - np.mean(x))/s 55 | else: 56 | return 0.0 57 | 58 | x = np.diff(x) 59 | return np.sqrt(np.dot(x, x)) 60 | 61 | # cid_ce(df["Close"], True) 62 | 63 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 64 | 65 | 66 | #-> In Package 67 | def mean_abs_change(x): 68 | return np.mean(np.abs(np.diff(x))) 69 | 70 | # mean_abs_change(df["Close"]) 71 | 72 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 73 | 74 | 75 | #-> In Package 76 | def _roll(a, shift): 77 | if not isinstance(a, np.ndarray): 78 | a = np.asarray(a) 79 | idx = shift % len(a) 80 | return np.concatenate([a[-idx:], a[:-idx]]) 81 | 82 | def mean_second_derivative_central(x): 83 | 84 | diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0 85 | return np.mean(diff[1:-1]) 86 | 87 | # mean_second_derivative_central(df["Close"]) 88 | 89 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 90 | 91 | 92 | #-> In Package 93 | def variance_larger_than_standard_deviation(x): 94 | 95 | y = np.var(x) 96 | return y > np.sqrt(y) 97 | 98 | # variance_larger_than_standard_deviation(df["Close"]) 99 | 100 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 101 | 102 | var_index_param = {"Volume":None, "Open": None} 103 | 104 | @set_property("fctype", "combiner") 105 | @set_property("custom", True) 106 | def var_index(time,param=var_index_param): 107 | final = [] 108 | keys = [] 109 | for key, magnitude in param.items(): 110 | w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2) 111 | w_mean = np.mean(w) 112 | 113 | N = len(time) 114 | sigma2 = np.var(magnitude) 115 | 116 | S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2) 117 | S2 = sum(w) 118 | 119 | eta_e = (w_mean * np.power(time[N - 1] - 120 | time[0], 2) * S1 / (sigma2 * S2 * N ** 2)) 121 | final.append(eta_e) 122 | keys.append(key) 123 | return {"Interact__{}".format(k): eta_e for eta_e, k in zip(final,keys) } 124 | 125 | # var_index(df["Close"].values) 126 | 127 | 128 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 129 | 130 | 131 | #-> In Package 132 | def symmetry_looking(x, param=[{"r": 0.2}]): 133 | 134 | if not isinstance(x, (np.ndarray, pd.Series)): 135 | x = np.asarray(x) 136 | mean_median_difference = np.abs(np.mean(x) - np.median(x)) 137 | max_min_difference = np.max(x) - np.min(x) 138 | return [("r_{}".format(r["r"]), mean_median_difference < (r["r"] * max_min_difference)) 139 | for r in param] 140 | 141 | # symmetry_looking(df["Close"]) 142 | 143 | 144 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 145 | 146 | #-> In Package 147 | def has_duplicate_max(x): 148 | """ 149 | Checks if the maximum value of x is observed more than once 150 | 151 | :param x: the time series to calculate the feature of 152 | :type x: numpy.ndarray 153 | :return: the value of this feature 154 | :return type: bool 155 | """ 156 | if not isinstance(x, (np.ndarray, pd.Series)): 157 | x = np.asarray(x) 158 | return np.sum(x == np.max(x)) >= 2 159 | 160 | # has_duplicate_max(df["Close"]) 161 | 162 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 163 | 164 | #-> In Package 165 | 166 | def partial_autocorrelation(x, param=[{"lag": 1}]): 167 | 168 | # Check the difference between demanded lags by param and possible lags to calculate (depends on len(x)) 169 | max_demanded_lag = max([lag["lag"] for lag in param]) 170 | n = len(x) 171 | 172 | # Check if list is too short to make calculations 173 | if n <= 1: 174 | pacf_coeffs = [np.nan] * (max_demanded_lag + 1) 175 | else: 176 | if (n <= max_demanded_lag): 177 | max_lag = n - 1 178 | else: 179 | max_lag = max_demanded_lag 180 | pacf_coeffs = list(pacf(x, method="ld", nlags=max_lag)) 181 | pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag)) 182 | 183 | return [("lag_{}".format(lag["lag"]), pacf_coeffs[lag["lag"]]) for lag in param] 184 | 185 | # partial_autocorrelation(df["Close"]) 186 | 187 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 188 | 189 | #-> In Package 190 | def augmented_dickey_fuller(x, param=[{"attr": "teststat"}]): 191 | 192 | res = None 193 | try: 194 | res = adfuller(x) 195 | except LinAlgError: 196 | res = np.NaN, np.NaN, np.NaN 197 | except ValueError: # occurs if sample size is too small 198 | res = np.NaN, np.NaN, np.NaN 199 | except MissingDataError: # is thrown for e.g. inf or nan in the data 200 | res = np.NaN, np.NaN, np.NaN 201 | 202 | return [('attr_"{}"'.format(config["attr"]), 203 | res[0] if config["attr"] == "teststat" 204 | else res[1] if config["attr"] == "pvalue" 205 | else res[2] if config["attr"] == "usedlag" else np.NaN) 206 | for config in param] 207 | 208 | # augmented_dickey_fuller(df["Close"]) 209 | 210 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 211 | 212 | @set_property("fctype", "simple") 213 | @set_property("custom", True) 214 | def gskew(x): 215 | interpolation="nearest" 216 | median_mag = np.median(x) 217 | F_3_value = np.percentile(x, 3, interpolation=interpolation) 218 | F_97_value = np.percentile(x, 97, interpolation=interpolation) 219 | 220 | skew = (np.median(x[x <= F_3_value]) + 221 | np.median(x[x >= F_97_value]) - 2 * median_mag) 222 | 223 | return skew 224 | 225 | # gskew(df["Close"]) 226 | 227 | 228 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 229 | 230 | stestson_param = {"weight":100., "alpha":2., "beta":2., "tol":1.e-6, "nmax":20} 231 | 232 | @set_property("fctype", "combiner") 233 | @set_property("custom", True) 234 | def stetson_mean(x, param=stestson_param): 235 | 236 | weight= stestson_param["weight"] 237 | alpha= stestson_param["alpha"] 238 | beta = stestson_param["beta"] 239 | tol= stestson_param["tol"] 240 | nmax= stestson_param["nmax"] 241 | 242 | 243 | mu = np.median(x) 244 | for i in range(nmax): 245 | resid = x - mu 246 | resid_err = np.abs(resid) * np.sqrt(weight) 247 | weight1 = weight / (1. + (resid_err / alpha)**beta) 248 | weight1 /= weight1.mean() 249 | diff = np.mean(x * weight1) - mu 250 | mu += diff 251 | if (np.abs(diff) < tol*np.abs(mu) or np.abs(diff) < tol): 252 | break 253 | 254 | return mu 255 | 256 | # stetson_mean(df["Close"]) 257 | 258 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 259 | 260 | #-> In Package 261 | def length(x): 262 | return len(x) 263 | 264 | # length(df["Close"]) 265 | 266 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 267 | 268 | 269 | #-> In Package 270 | def count_above_mean(x): 271 | m = np.mean(x) 272 | return np.where(x > m)[0].size 273 | 274 | # count_above_mean(df["Close"]) 275 | 276 | 277 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 278 | 279 | #-> In Package 280 | 281 | 282 | def get_length_sequences_where(x): 283 | 284 | if len(x) == 0: 285 | return [0] 286 | else: 287 | res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1] 288 | return res if len(res) > 0 else [0] 289 | 290 | def longest_strike_below_mean(x): 291 | 292 | if not isinstance(x, (np.ndarray, pd.Series)): 293 | x = np.asarray(x) 294 | return np.max(get_length_sequences_where(x <= np.mean(x))) if x.size > 0 else 0 295 | 296 | # longest_strike_below_mean(df["Close"]) 297 | 298 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 299 | 300 | woz_param = [{"consecutiveStar": n} for n in [2, 4]] 301 | 302 | @set_property("fctype", "combiner") 303 | @set_property("custom", True) 304 | def wozniak(magnitude, param=woz_param): 305 | 306 | iters = [] 307 | for consecutiveStar in [stars["consecutiveStar"] for stars in param]: 308 | N = len(magnitude) 309 | if N < consecutiveStar: 310 | return 0 311 | sigma = np.std(magnitude) 312 | m = np.mean(magnitude) 313 | count = 0 314 | 315 | for i in range(N - consecutiveStar + 1): 316 | flag = 0 317 | for j in range(consecutiveStar): 318 | if(magnitude[i + j] > m + 2 * sigma or 319 | magnitude[i + j] < m - 2 * sigma): 320 | flag = 1 321 | else: 322 | flag = 0 323 | break 324 | if flag: 325 | count = count + 1 326 | iters.append(count * 1.0 / (N - consecutiveStar + 1)) 327 | 328 | return [("consecutiveStar_{}".format(config["consecutiveStar"]), iters[en] ) for en, config in enumerate(param)] 329 | 330 | # wozniak(df["Close"]) 331 | 332 | 333 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 334 | 335 | #-> In Package 336 | def last_location_of_maximum(x): 337 | 338 | x = np.asarray(x) 339 | return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN 340 | 341 | # last_location_of_maximum(df["Close"]) 342 | 343 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 344 | 345 | #-> In Package 346 | def fft_coefficient(x, param = [{"coeff": 10, "attr": "real"}]): 347 | 348 | assert min([config["coeff"] for config in param]) >= 0, "Coefficients must be positive or zero." 349 | assert set([config["attr"] for config in param]) <= set(["imag", "real", "abs", "angle"]), \ 350 | 'Attribute must be "real", "imag", "angle" or "abs"' 351 | 352 | fft = np.fft.rfft(x) 353 | 354 | def complex_agg(x, agg): 355 | if agg == "real": 356 | return x.real 357 | elif agg == "imag": 358 | return x.imag 359 | elif agg == "abs": 360 | return np.abs(x) 361 | elif agg == "angle": 362 | return np.angle(x, deg=True) 363 | 364 | res = [complex_agg(fft[config["coeff"]], config["attr"]) if config["coeff"] < len(fft) 365 | else np.NaN for config in param] 366 | index = [('coeff_{}__attr_"{}"'.format(config["coeff"], config["attr"]),res[0]) for config in param] 367 | return index 368 | 369 | # fft_coefficient(df["Close"]) 370 | 371 | 372 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 373 | 374 | #-> In Package 375 | 376 | 377 | def ar_coefficient(x, param=[{"coeff": 5, "k": 5}]): 378 | 379 | calculated_ar_params = {} 380 | 381 | x_as_list = list(x) 382 | calculated_AR = AR(x_as_list) 383 | 384 | res = {} 385 | 386 | for parameter_combination in param: 387 | k = parameter_combination["k"] 388 | p = parameter_combination["coeff"] 389 | 390 | column_name = "k_{}__coeff_{}".format(k, p) 391 | 392 | if k not in calculated_ar_params: 393 | try: 394 | calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params 395 | except (LinAlgError, ValueError): 396 | calculated_ar_params[k] = [np.NaN]*k 397 | 398 | mod = calculated_ar_params[k] 399 | 400 | if p <= k: 401 | try: 402 | res[column_name] = mod[p] 403 | except IndexError: 404 | res[column_name] = 0 405 | else: 406 | res[column_name] = np.NaN 407 | 408 | return [(key, value) for key, value in res.items()] 409 | 410 | # ar_coefficient(df["Close"]) 411 | 412 | 413 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 414 | 415 | #-> In Package 416 | def index_mass_quantile(x, param=[{"q": 0.3}]): 417 | 418 | x = np.asarray(x) 419 | abs_x = np.abs(x) 420 | s = sum(abs_x) 421 | 422 | if s == 0: 423 | # all values in x are zero or it has length 0 424 | return [("q_{}".format(config["q"]), np.NaN) for config in param] 425 | else: 426 | # at least one value is not zero 427 | mass_centralized = np.cumsum(abs_x) / s 428 | return [("q_{}".format(config["q"]), (np.argmax(mass_centralized >= config["q"])+1)/len(x)) for config in param] 429 | 430 | # index_mass_quantile(df["Close"]) 431 | 432 | 433 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 434 | 435 | 436 | 437 | cwt_param = [ka for ka in [2,6,9]] 438 | 439 | @set_property("fctype", "combiner") 440 | @set_property("custom", True) 441 | def number_cwt_peaks(x, param=cwt_param): 442 | 443 | return [("CWTPeak_{}".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param] 444 | 445 | # number_cwt_peaks(df["Close"]) 446 | 447 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 448 | 449 | 450 | #-> In Package 451 | def spkt_welch_density(x, param=[{"coeff": 5}]): 452 | freq, pxx = welch(x, nperseg=min(len(x), 256)) 453 | coeff = [config["coeff"] for config in param] 454 | indices = ["coeff_{}".format(i) for i in coeff] 455 | 456 | if len(pxx) <= np.max(coeff): # There are fewer data points in the time series than requested coefficients 457 | 458 | # filter coefficients that are not contained in pxx 459 | reduced_coeff = [coefficient for coefficient in coeff if len(pxx) > coefficient] 460 | not_calculated_coefficients = [coefficient for coefficient in coeff 461 | if coefficient not in reduced_coeff] 462 | 463 | # Fill up the rest of the requested coefficients with np.NaNs 464 | return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients)) 465 | else: 466 | return pxx[coeff].ravel()[0] 467 | 468 | # spkt_welch_density(df["Close"]) 469 | 470 | 471 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 472 | 473 | 474 | #-> In Package 475 | def linear_trend_timewise(x, param= [{"attr": "pvalue"}]): 476 | 477 | ix = x.index 478 | 479 | # Get differences between each timestamp and the first timestamp in seconds. 480 | # Then convert to hours and reshape for linear regression 481 | times_seconds = (ix - ix[0]).total_seconds() 482 | times_hours = np.asarray(times_seconds / float(3600)) 483 | 484 | linReg = linregress(times_hours, x.values) 485 | 486 | return [("attr_\"{}\"".format(config["attr"]), getattr(linReg, config["attr"])) 487 | for config in param] 488 | 489 | # linear_trend_timewise(df["Close"]) 490 | 491 | 492 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 493 | 494 | #-> In Package 495 | def c3(x, lag=3): 496 | if not isinstance(x, (np.ndarray, pd.Series)): 497 | x = np.asarray(x) 498 | n = x.size 499 | if 2 * lag >= n: 500 | return 0 501 | else: 502 | return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)]) 503 | 504 | # c3(df["Close"]) 505 | 506 | 507 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 508 | 509 | #-> In Package 510 | def binned_entropy(x, max_bins=10): 511 | if not isinstance(x, (np.ndarray, pd.Series)): 512 | x = np.asarray(x) 513 | hist, bin_edges = np.histogram(x, bins=max_bins) 514 | probs = hist / x.size 515 | return - np.sum(p * np.math.log(p) for p in probs if p != 0) 516 | 517 | # binned_entropy(df["Close"]) 518 | 519 | 520 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 521 | 522 | 523 | svd_param = [{"Tau": ta, "DE": de} 524 | for ta in [4] 525 | for de in [3,6]] 526 | 527 | def _embed_seq(X,Tau,D): 528 | N =len(X) 529 | if D * Tau > N: 530 | print("Cannot build such a matrix, because D * Tau > N") 531 | exit() 532 | if Tau<1: 533 | print("Tau has to be at least 1") 534 | exit() 535 | Y= np.zeros((N - (D - 1) * Tau, D)) 536 | 537 | for i in range(0, N - (D - 1) * Tau): 538 | for j in range(0, D): 539 | Y[i][j] = X[i + j * Tau] 540 | return Y 541 | 542 | @set_property("fctype", "combiner") 543 | @set_property("custom", True) 544 | def svd_entropy(epochs, param=svd_param): 545 | axis=0 546 | 547 | final = [] 548 | for par in param: 549 | 550 | def svd_entropy_1d(X, Tau, DE): 551 | Y = _embed_seq(X, Tau, DE) 552 | W = np.linalg.svd(Y, compute_uv=0) 553 | W /= sum(W) # normalize singular values 554 | return -1 * np.sum(W * np.log(W)) 555 | 556 | Tau = par["Tau"] 557 | DE = par["DE"] 558 | 559 | final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0]) 560 | 561 | 562 | return [("Tau_\"{}\"__De_{}\"".format(par["Tau"], par["DE"]), final[en]) for en, par in enumerate(param)] 563 | 564 | # svd_entropy(df["Close"].values) 565 | 566 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 567 | 568 | def _hjorth_mobility(epochs): 569 | diff = np.diff(epochs, axis=0) 570 | sigma0 = np.std(epochs, axis=0) 571 | sigma1 = np.std(diff, axis=0) 572 | return np.divide(sigma1, sigma0) 573 | 574 | @set_property("fctype", "simple") 575 | @set_property("custom", True) 576 | def hjorth_complexity(epochs): 577 | diff1 = np.diff(epochs, axis=0) 578 | diff2 = np.diff(diff1, axis=0) 579 | sigma1 = np.std(diff1, axis=0) 580 | sigma2 = np.std(diff2, axis=0) 581 | return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs)) 582 | 583 | # hjorth_complexity(df["Close"]) 584 | 585 | 586 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 587 | 588 | #-> In Package 589 | def _estimate_friedrich_coefficients(x, m, r): 590 | assert m > 0, "Order of polynomial need to be positive integer, found {}".format(m) 591 | df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)}) 592 | try: 593 | df['quantiles'] = pd.qcut(df.signal, r) 594 | except ValueError: 595 | return [np.NaN] * (m + 1) 596 | 597 | quantiles = df.groupby('quantiles') 598 | 599 | result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()}) 600 | result.dropna(inplace=True) 601 | 602 | try: 603 | return np.polyfit(result.x_mean, result.y_mean, deg=m) 604 | except (np.linalg.LinAlgError, ValueError): 605 | return [np.NaN] * (m + 1) 606 | 607 | 608 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 609 | 610 | 611 | def max_langevin_fixed_point(x, r=3, m=30): 612 | coeff = _estimate_friedrich_coefficients(x, m, r) 613 | 614 | try: 615 | max_fixed_point = np.max(np.real(np.roots(coeff))) 616 | except (np.linalg.LinAlgError, ValueError): 617 | return np.nan 618 | 619 | return max_fixed_point 620 | 621 | # max_langevin_fixed_point(df["Close"]) 622 | 623 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 624 | 625 | will_param = [ka for ka in [0.2,3]] 626 | 627 | @set_property("fctype", "combiner") 628 | @set_property("custom", True) 629 | def willison_amplitude(X, param=will_param): 630 | return [("Thresh_{}".format(n),np.sum(np.abs(np.diff(X)) >= n)) for n in param] 631 | 632 | # willison_amplitude(df["Close"]) 633 | 634 | 635 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 636 | 637 | perc_param = [{"base":ba, "exponent":exp} for ba in [3,5] for exp in [-0.1,-0.2]] 638 | 639 | @set_property("fctype", "combiner") 640 | @set_property("custom", True) 641 | def percent_amplitude(x, param =perc_param): 642 | final = [] 643 | for par in param: 644 | linear_scale_data = par["base"] ** (par["exponent"] * x) 645 | y_max = np.max(linear_scale_data) 646 | y_min = np.min(linear_scale_data) 647 | y_med = np.median(linear_scale_data) 648 | final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))) 649 | 650 | return [("Base_{}__Exp{}".format(pa["base"],pa["exponent"]),fin) for fin, pa in zip(final,param)] 651 | 652 | # percent_amplitude(df["Close"]) 653 | 654 | 655 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 656 | 657 | #-> fixes required 658 | 659 | 660 | 661 | cad_param = [0.1,1000, -234] 662 | 663 | @set_property("fctype", "combiner") 664 | @set_property("custom", True) 665 | def cad_prob(cads, param=cad_param): 666 | return [("time_{}".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param] 667 | 668 | # cad_prob(df["Close"]) 669 | 670 | 671 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 672 | 673 | zero_param = [0.01, 8] 674 | 675 | @set_property("fctype", "combiner") 676 | @set_property("custom", True) 677 | def zero_crossing_derivative(epochs, param=zero_param): 678 | diff = np.diff(epochs) 679 | norm = diff-diff.mean() 680 | return [("e_{}".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] <= e) & (epoch[5:] > e))), 0, norm).ravel()[0]) for e in param] 681 | 682 | # zero_crossing_derivative(df["Close"]) 683 | 684 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 685 | 686 | 687 | @set_property("fctype", "simple") 688 | @set_property("custom", True) 689 | def detrended_fluctuation_analysis(epochs): 690 | def dfa_1d(X, Ave=None, L=None): 691 | X = np.array(X) 692 | 693 | if Ave is None: 694 | Ave = np.mean(X) 695 | 696 | Y = np.cumsum(X) 697 | Y -= Ave 698 | 699 | if L is None: 700 | L = np.floor(len(X) * 1 / ( 701 | 2 ** np.array(list(range(1, int(np.log2(len(X))) - 4)))) 702 | ) 703 | 704 | F = np.zeros(len(L)) # F(n) of different given box length n 705 | 706 | for i in range(0, len(L)): 707 | n = int(L[i]) # for each box length L[i] 708 | if n == 0: 709 | print("time series is too short while the box length is too big") 710 | print("abort") 711 | exit() 712 | for j in range(0, len(X), n): # for each box 713 | if j + n < len(X): 714 | c = list(range(j, j + n)) 715 | # coordinates of time in the box 716 | c = np.vstack([c, np.ones(n)]).T 717 | # the value of data in the box 718 | y = Y[j:j + n] 719 | # add residue in this box 720 | F[i] += np.linalg.lstsq(c, y, rcond=None)[1] 721 | F[i] /= ((len(X) / n) * n) 722 | F = np.sqrt(F) 723 | 724 | stacked = np.vstack([np.log(L), np.ones(len(L))]) 725 | stacked_t = stacked.T 726 | Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None) 727 | 728 | return Alpha[0][0] 729 | 730 | return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0] 731 | 732 | # detrended_fluctuation_analysis(df["Close"]) 733 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 734 | 735 | 736 | def _embed_seq(X, Tau, D): 737 | 738 | shape = (X.size - Tau * (D - 1), D) 739 | strides = (X.itemsize, Tau * X.itemsize) 740 | return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides) 741 | 742 | fisher_param = [{"Tau":ta, "DE":de} for ta in [3,15] for de in [10,5]] 743 | 744 | @set_property("fctype", "combiner") 745 | @set_property("custom", True) 746 | def fisher_information(epochs, param=fisher_param): 747 | def fisher_info_1d(a, tau, de): 748 | # taken from pyeeg improvements 749 | 750 | mat = _embed_seq(a, tau, de) 751 | W = np.linalg.svd(mat, compute_uv=False) 752 | W /= sum(W) # normalize singular values 753 | FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1] 754 | return np.sum(FI_v) 755 | 756 | return [("Tau_{}__DE_{}".format(par["Tau"], par["DE"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par["Tau"], par["DE"]).ravel()[0]) for par in param] 757 | 758 | # fisher_information(df["Close"]) 759 | 760 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 761 | 762 | hig_param = [{"Kmax": 3},{"Kmax": 5}] 763 | 764 | @set_property("fctype", "combiner") 765 | @set_property("custom", True) 766 | def higuchi_fractal_dimension(epochs, param=hig_param): 767 | def hfd_1d(X, Kmax): 768 | 769 | L = [] 770 | x = [] 771 | N = len(X) 772 | for k in range(1, Kmax): 773 | Lk = [] 774 | for m in range(0, k): 775 | Lmk = 0 776 | for i in range(1, int(np.floor((N - m) / k))): 777 | Lmk += abs(X[m + i * k] - X[m + i * k - k]) 778 | Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k 779 | Lk.append(Lmk) 780 | L.append(np.log(np.mean(Lk))) 781 | x.append([np.log(float(1) / k), 1]) 782 | 783 | (p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None) 784 | return p[0] 785 | 786 | return [("Kmax_{}".format(config["Kmax"]), np.apply_along_axis(hfd_1d, 0, epochs, config["Kmax"]).ravel()[0] ) for config in param] 787 | 788 | # higuchi_fractal_dimension(df["Close"]) 789 | 790 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 791 | 792 | @set_property("fctype", "simple") 793 | @set_property("custom", True) 794 | def petrosian_fractal_dimension(epochs): 795 | def pfd_1d(X, D=None): 796 | # taken from pyeeg 797 | """Compute Petrosian Fractal Dimension of a time series from either two 798 | cases below: 799 | 1. X, the time series of type list (default) 800 | 2. D, the first order differential sequence of X (if D is provided, 801 | recommended to speed up) 802 | In case 1, D is computed using Numpy's difference function. 803 | To speed up, it is recommended to compute D before calling this function 804 | because D may also be used by other functions whereas computing it here 805 | again will slow down. 806 | """ 807 | if D is None: 808 | D = np.diff(X) 809 | D = D.tolist() 810 | N_delta = 0 # number of sign changes in derivative of the signal 811 | for i in range(1, len(D)): 812 | if D[i] * D[i - 1] < 0: 813 | N_delta += 1 814 | n = len(X) 815 | return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta)) 816 | return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0] 817 | 818 | # petrosian_fractal_dimension(df["Close"]) 819 | 820 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 821 | 822 | @set_property("fctype", "simple") 823 | @set_property("custom", True) 824 | def hurst_exponent(epochs): 825 | def hurst_1d(X): 826 | 827 | X = np.array(X) 828 | N = X.size 829 | T = np.arange(1, N + 1) 830 | Y = np.cumsum(X) 831 | Ave_T = Y / T 832 | 833 | S_T = np.zeros(N) 834 | R_T = np.zeros(N) 835 | for i in range(N): 836 | S_T[i] = np.std(X[:i + 1]) 837 | X_T = Y - T * Ave_T[i] 838 | R_T[i] = np.ptp(X_T[:i + 1]) 839 | 840 | for i in range(1, len(S_T)): 841 | if np.diff(S_T)[i - 1] != 0: 842 | break 843 | for j in range(1, len(R_T)): 844 | if np.diff(R_T)[j - 1] != 0: 845 | break 846 | k = max(i, j) 847 | assert k < 10, "rethink it!" 848 | 849 | R_S = R_T[k:] / S_T[k:] 850 | R_S = np.log(R_S) 851 | 852 | n = np.log(T)[k:] 853 | A = np.column_stack((n, np.ones(n.size))) 854 | [m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0] 855 | H = m 856 | return H 857 | return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0] 858 | 859 | # hurst_exponent(df["Close"]) 860 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 861 | 862 | def _embed_seq(X, Tau, D): 863 | shape = (X.size - Tau * (D - 1), D) 864 | strides = (X.itemsize, Tau * X.itemsize) 865 | return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides) 866 | 867 | lyaup_param = [{"Tau":4, "n":3, "T":10, "fs":9},{"Tau":8, "n":7, "T":15, "fs":6}] 868 | 869 | @set_property("fctype", "combiner") 870 | @set_property("custom", True) 871 | def largest_lyauponov_exponent(epochs, param=lyaup_param): 872 | def LLE_1d(x, tau, n, T, fs): 873 | 874 | Em = _embed_seq(x, tau, n) 875 | M = len(Em) 876 | A = np.tile(Em, (len(Em), 1, 1)) 877 | B = np.transpose(A, [1, 0, 2]) 878 | square_dists = (A - B) ** 2 # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2 879 | D = np.sqrt(square_dists[:, :, :].sum(axis=2)) # D[i,j] = ||Em[i]-Em[j]||_2 880 | 881 | # Exclude elements within T of the diagonal 882 | band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1) 883 | band[band == 1] = np.inf 884 | neighbors = (D + band).argmin(axis=0) # nearest neighbors more than T steps away 885 | 886 | # in_bounds[i,j] = (i+j <= M-1 and i+neighbors[j] <= M-1) 887 | inc = np.tile(np.arange(M), (M, 1)) 888 | row_inds = (np.tile(np.arange(M), (M, 1)).T + inc) 889 | col_inds = (np.tile(neighbors, (M, 1)) + inc.T) 890 | in_bounds = np.logical_and(row_inds <= M - 1, col_inds <= M - 1) 891 | # Uncomment for old (miscounted) version 892 | # in_bounds = numpy.logical_and(row_inds < M - 1, col_inds < M - 1) 893 | row_inds[~in_bounds] = 0 894 | col_inds[~in_bounds] = 0 895 | 896 | # neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2 897 | neighbor_dists = np.ma.MaskedArray(D[row_inds, col_inds], ~in_bounds) 898 | J = (~neighbor_dists.mask).sum(axis=1) # number of in-bounds indices by row 899 | # Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged 900 | 901 | neighbor_dists[neighbor_dists == 0] = 1 902 | 903 | # !!! this fixes the divide by zero in log error !!! 904 | neighbor_dists.data[neighbor_dists.data == 0] = 1 905 | 906 | d_ij = np.sum(np.log(neighbor_dists.data), axis=1) 907 | mean_d = d_ij[J > 0] / J[J > 0] 908 | 909 | x = np.arange(len(mean_d)) 910 | X = np.vstack((x, np.ones(len(mean_d)))).T 911 | [m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0] 912 | Lexp = fs * m 913 | return Lexp 914 | 915 | return [("Tau_{}__n_{}__T_{}__fs_{}".format(par["Tau"], par["n"], par["T"], par["fs"]), np.apply_along_axis(LLE_1d, 0, epochs, par["Tau"], par["n"], par["T"], par["fs"]).ravel()[0]) for par in param] 916 | 917 | # largest_lyauponov_exponent(df["Close"]) 918 | 919 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 920 | 921 | 922 | whelch_param = [100,200] 923 | 924 | @set_property("fctype", "combiner") 925 | @set_property("custom", True) 926 | def whelch_method(data, param=whelch_param): 927 | 928 | final = [] 929 | for Fs in param: 930 | f, pxx = signal.welch(data, fs=Fs, nperseg=1024) 931 | d = {'psd': pxx, 'freqs': f} 932 | df = pd.DataFrame(data=d) 933 | dfs = df.sort_values(['psd'], ascending=False) 934 | rows = dfs.iloc[:10] 935 | final.append(rows['freqs'].mean()) 936 | 937 | return [("Fs_{}".format(pa),fin) for pa, fin in zip(param,final)] 938 | 939 | # whelch_method(df["Close"]) 940 | 941 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 942 | 943 | #-> Basically same as above 944 | freq_param = [{"fs":50, "sel":15},{"fs":200, "sel":20}] 945 | 946 | @set_property("fctype", "combiner") 947 | @set_property("custom", True) 948 | def find_freq(serie, param=freq_param): 949 | 950 | final = [] 951 | for par in param: 952 | fft0 = np.fft.rfft(serie*np.hanning(len(serie))) 953 | freqs = np.fft.rfftfreq(len(serie), d=1.0/par["fs"]) 954 | fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))]) 955 | d = {'fft': fftmod, 'freq': freqs} 956 | df = pd.DataFrame(d) 957 | hop = df.sort_values(['fft'], ascending=False) 958 | rows = hop.iloc[:par["sel"]] 959 | final.append(rows['freq'].mean()) 960 | 961 | return [("Fs_{}__sel{}".format(pa["fs"],pa["sel"]),fin) for pa, fin in zip(param,final)] 962 | 963 | # find_freq(df["Close"]) 964 | 965 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 966 | 967 | #-> In Package 968 | 969 | def flux_perc(magnitude): 970 | sorted_data = np.sort(magnitude) 971 | lc_length = len(sorted_data) 972 | 973 | F_60_index = int(math.ceil(0.60 * lc_length)) 974 | F_40_index = int(math.ceil(0.40 * lc_length)) 975 | F_5_index = int(math.ceil(0.05 * lc_length)) 976 | F_95_index = int(math.ceil(0.95 * lc_length)) 977 | 978 | F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index] 979 | F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index] 980 | F_mid20 = F_40_60 / F_5_95 981 | 982 | return {"FluxPercentileRatioMid20": F_mid20} 983 | 984 | # flux_perc(df["Close"]) 985 | 986 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 987 | 988 | @set_property("fctype", "simple") 989 | @set_property("custom", True) 990 | def range_cum_s(magnitude): 991 | sigma = np.std(magnitude) 992 | N = len(magnitude) 993 | m = np.mean(magnitude) 994 | s = np.cumsum(magnitude - m) * 1.0 / (N * sigma) 995 | R = np.max(s) - np.min(s) 996 | return {"Rcs": R} 997 | 998 | # range_cum_s(df["Close"]) 999 | 1000 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1001 | 1002 | 1003 | struct_param = {"Volume":None, "Open": None} 1004 | 1005 | @set_property("fctype", "combiner") 1006 | @set_property("custom", True) 1007 | def structure_func(time, param=struct_param): 1008 | 1009 | dict_final = {} 1010 | for key, magnitude in param.items(): 1011 | dict_final[key] = [] 1012 | Nsf, Np = 100, 100 1013 | sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf) 1014 | f = interp1d(time, magnitude) 1015 | 1016 | time_int = np.linspace(np.min(time), np.max(time), Np) 1017 | mag_int = f(time_int) 1018 | 1019 | for tau in np.arange(1, Nsf): 1020 | sf1[tau - 1] = np.mean( 1021 | np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0)) 1022 | sf2[tau - 1] = np.mean( 1023 | np.abs(np.power( 1024 | np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0))) 1025 | sf3[tau - 1] = np.mean( 1026 | np.abs(np.power( 1027 | np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0))) 1028 | sf1_log = np.log10(np.trim_zeros(sf1)) 1029 | sf2_log = np.log10(np.trim_zeros(sf2)) 1030 | sf3_log = np.log10(np.trim_zeros(sf3)) 1031 | 1032 | if len(sf1_log) and len(sf2_log): 1033 | m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1) 1034 | else: 1035 | 1036 | m_21 = np.nan 1037 | 1038 | if len(sf1_log) and len(sf3_log): 1039 | m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1) 1040 | else: 1041 | 1042 | m_31 = np.nan 1043 | 1044 | if len(sf2_log) and len(sf3_log): 1045 | m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1) 1046 | else: 1047 | 1048 | m_32 = np.nan 1049 | dict_final[key].append(m_21) 1050 | dict_final[key].append(m_31) 1051 | dict_final[key].append(m_32) 1052 | 1053 | return [("StructureFunction_{}__m_{}".format(key, name), li) for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)] 1054 | 1055 | # structure_func(df["Close"], struct_param) 1056 | 1057 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1058 | 1059 | 1060 | #-> In Package 1061 | def kurtosis(x): 1062 | 1063 | if not isinstance(x, pd.Series): 1064 | x = pd.Series(x) 1065 | return pd.Series.kurtosis(x) 1066 | 1067 | # kurtosis(df["Close"]) 1068 | 1069 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1070 | 1071 | 1072 | @set_property("fctype", "simple") 1073 | @set_property("custom", True) 1074 | def stetson_k(x): 1075 | """A robust kurtosis statistic.""" 1076 | n = len(x) 1077 | x0 = stetson_mean(x, 1./20**2) 1078 | delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20 1079 | ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2)) 1080 | return ta 1081 | 1082 | # stetson_k(df["Close"]) -------------------------------------------------------------------------------- /deltapy/interact.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from statsmodels.tsa.ar_model import AR 4 | from timeit import default_timer as timer 5 | from sklearn.tree import DecisionTreeRegressor 6 | from math import sin, cos, sqrt, atan2, radians,ceil 7 | import ta 8 | from gplearn.genetic import SymbolicTransformer 9 | from scipy import linalg 10 | import math 11 | 12 | def lowess(df, cols, y, f=2. / 3., iter=3): 13 | for col in cols: 14 | n = len(df[col]) 15 | r = int(ceil(f * n)) 16 | h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)] 17 | w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0) 18 | w = (1 - w ** 3) ** 3 19 | yest = np.zeros(n) 20 | delta = np.ones(n) 21 | for iteration in range(iter): 22 | for i in range(n): 23 | weights = delta * w[:, i] 24 | b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])]) 25 | A = np.array([[np.sum(weights), np.sum(weights * df[col])], 26 | [np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]]) 27 | beta = linalg.solve(A, b) 28 | yest[i] = beta[0] + beta[1] * df[col][i] 29 | 30 | residuals = y - yest 31 | s = np.median(np.abs(residuals)) 32 | delta = np.clip(residuals / (6.0 * s), -1, 1) 33 | delta = (1 - delta ** 2) ** 2 34 | df[col+"_LOWESS"] = yest 35 | 36 | return df 37 | 38 | # df_out = lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head() 39 | 40 | 41 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 42 | 43 | 44 | def autoregression(df, drop=None, settings={"autoreg_lag":4}): 45 | """ 46 | This function calculates the autoregression for each channel. 47 | :param x: the input signal. Its size is (number of channels, samples). 48 | :param settings: a dictionary with one attribute, "autoreg_lag", that is the max lag for autoregression. 49 | :return: the "final_value" is a matrix (number of channels, autoreg_lag) indicating the parameters of 50 | autoregression for each channel. 51 | """ 52 | autoreg_lag = settings["autoreg_lag"] 53 | if drop: 54 | keep = df[drop] 55 | df = df.drop([drop],axis=1).values 56 | 57 | n_channels = df.shape[0] 58 | t = timer() 59 | channels_regg = np.zeros((n_channels, autoreg_lag + 1)) 60 | for i in range(0, n_channels): 61 | fitted_model = AR(df.values[i, :]).fit(autoreg_lag) 62 | # TODO: This is not the same as Matlab's for some reasons! 63 | # kk = ARMAResults(fitted_model) 64 | # autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow 65 | channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params) 66 | 67 | for i in range(channels_regg.shape[1]): 68 | df["LAG_"+str(i+1)] = channels_regg[:,i] 69 | 70 | if drop: 71 | df = pd.concat((keep,df),axis=1) 72 | 73 | t = timer() - t 74 | return df 75 | 76 | # df = autoregression(df.fillna(0)) 77 | 78 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 79 | 80 | def muldiv(df, feature_list): 81 | for feat in feature_list: 82 | for feat_two in feature_list: 83 | if feat==feat_two: 84 | continue 85 | else: 86 | df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard 87 | df[feat+"_X_"+feat_two] = df[feat]*(df[feat_two]) 88 | 89 | return df 90 | 91 | # df = muldiv(df, ["Close","Open"]) 92 | 93 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 94 | 95 | def decision_tree_disc(df, cols, depth=4 ): 96 | for col in cols: 97 | df[col +"_m1"] = df[col].shift(1) 98 | df = df.iloc[1:,:] 99 | tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0) 100 | tree_model.fit(df[col +"_m1"].to_frame(), df[col]) 101 | df[col+"_Disc"] = tree_model.predict(df[col +"_m1"].to_frame()) 102 | return df 103 | 104 | # df_out = decision_tree_disc(df, ["Close"]); df_out.head() 105 | 106 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 107 | 108 | def quantile_normalize(df, drop): 109 | 110 | if drop: 111 | keep = df[drop] 112 | df = df.drop(drop,axis=1) 113 | 114 | #compute rank 115 | dic = {} 116 | for col in df: 117 | dic.update({col : sorted(df[col])}) 118 | sorted_df = pd.DataFrame(dic) 119 | rank = sorted_df.mean(axis = 1).tolist() 120 | #sort 121 | for col in df: 122 | t = np.searchsorted(np.sort(df[col]), df[col]) 123 | df[col] = [rank[i] for i in t] 124 | 125 | if drop: 126 | df = pd.concat((keep,df),axis=1) 127 | return df 128 | 129 | # df_out = quantile_normalize(df.fillna(0), drop=["Close"]) 130 | 131 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 132 | 133 | 134 | def haversine_distance(row, lon="Open", lat="Close"): 135 | c_lat,c_long = radians(52.5200), radians(13.4050) 136 | R = 6373.0 137 | long = radians(row['Open']) 138 | lat = radians(row['Close']) 139 | 140 | dlon = long - c_long 141 | dlat = lat - c_lat 142 | a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2 143 | c = 2 * atan2(sqrt(a), sqrt(1 - a)) 144 | 145 | return R * c 146 | 147 | # df['distance_central'] = df.apply(haversine_distance,axis=1); df.iloc[:,4:] 148 | 149 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 150 | 151 | 152 | def tech(df): 153 | return ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume") 154 | 155 | # df = tech(df) 156 | 157 | 158 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 159 | 160 | 161 | def genetic_feat(df, num_gen=20, num_comp=10): 162 | function_set = ['add', 'sub', 'mul', 'div', 163 | 'sqrt', 'log', 'abs', 'neg', 'inv','tan'] 164 | 165 | gp = SymbolicTransformer(generations=num_gen, population_size=200, 166 | hall_of_fame=100, n_components=num_comp, 167 | function_set=function_set, 168 | parsimony_coefficient=0.0005, 169 | max_samples=0.9, verbose=1, 170 | random_state=0, n_jobs=6) 171 | 172 | gen_feats = gp.fit_transform(df.drop("Close_1", axis=1), df["Close_1"]); df.iloc[:,:8] 173 | gen_feats = pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])]) 174 | gen_feats.index = df.index 175 | return pd.concat((df,gen_feats),axis=1) 176 | 177 | # df = genetic_feat(df) -------------------------------------------------------------------------------- /deltapy/mapper.py: -------------------------------------------------------------------------------- 1 | from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA 2 | from sklearn.kernel_approximation import AdditiveChi2Sampler 3 | from sklearn.cross_decomposition import CCA 4 | from sklearn.manifold import LocallyLinearEmbedding 5 | from sklearn.preprocessing import minmax_scale 6 | from sklearn import datasets, cluster 7 | from sklearn.neighbors import NearestNeighbors 8 | import numpy as np 9 | import pandas as pd 10 | import tensorflow as tf 11 | 12 | 13 | def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True): 14 | 15 | if non_linear: 16 | pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True) 17 | else: 18 | if memory_issues: 19 | if not mem_iss_component: 20 | raise ValueError("If you have memory issues, you have to preselect mem_iss_component") 21 | pca = IncrementalPCA(mem_iss_component) 22 | else: 23 | if variance_or_components>1: 24 | pca = PCA(n_components=variance_or_components) 25 | else: # automted selection based on variance 26 | pca = PCA(n_components=variance_or_components,svd_solver="full") 27 | if drop_cols: 28 | X_pca = pca.fit_transform(df.drop(drop_cols,axis=1)) 29 | return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1) 30 | 31 | else: 32 | X_pca = pca.fit_transform(df) 33 | return pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index) 34 | 35 | 36 | return df 37 | 38 | # df_out = pca_feature(df_out, variance_or_components=0.9, n_components=8,non_linear=False) 39 | 40 | 41 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 42 | 43 | def cross_lag(df, drop=None, lags=1, components=4 ): 44 | 45 | if drop: 46 | keep = df[drop] 47 | df = df.drop([drop],axis=1) 48 | 49 | df_2 = df.shift(lags) 50 | df = df.iloc[lags:,:] 51 | df_2 = df_2.dropna().reset_index(drop=True) 52 | 53 | cca = CCA(n_components=components) 54 | cca.fit(df_2, df) 55 | 56 | X_c, df_2 = cca.transform(df_2, df) 57 | df_2 = pd.DataFrame(df_2, index=df.index) 58 | df_2 = df.add_prefix('crd_') 59 | 60 | if drop: 61 | df = pd.concat([keep,df,df_2],axis=1) 62 | else: 63 | df = pd.concat([df,df_2],axis=1) 64 | return df 65 | 66 | # df_out = cross_lag(df) 67 | 68 | 69 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 70 | 71 | def a_chi(df, drop=None, lags=1, sample_steps=2 ): 72 | 73 | if drop: 74 | keep = df[drop] 75 | df = df.drop([drop],axis=1) 76 | 77 | df_2 = df.shift(lags) 78 | df = df.iloc[lags:,:] 79 | df_2 = df_2.dropna().reset_index(drop=True) 80 | 81 | chi2sampler = AdditiveChi2Sampler(sample_steps=sample_steps) 82 | 83 | df_2 = chi2sampler.fit_transform(df_2, df["Close"]) 84 | 85 | df_2 = pd.DataFrame(df_2, index=df.index) 86 | df_2 = df.add_prefix('achi_') 87 | 88 | if drop: 89 | df = pd.concat([keep,df,df_2],axis=1) 90 | else: 91 | df = pd.concat([df,df_2],axis=1) 92 | return df 93 | 94 | # df_out = a_chi(df) 95 | 96 | 97 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 98 | 99 | def encoder_dataset(df, drop=None, dimesions=20): 100 | 101 | if drop: 102 | train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0) 103 | else: 104 | train_scaled = minmax_scale(df.values, axis = 0) 105 | 106 | # define the number of encoding dimensions 107 | encoding_dim = dimesions 108 | # define the number of features 109 | ncol = train_scaled.shape[1] 110 | input_dim = tf.keras.Input(shape = (ncol, )) 111 | 112 | # Encoder Layers 113 | encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim) 114 | encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1) 115 | encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2) 116 | encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3) 117 | encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4) 118 | encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5) 119 | encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6) 120 | 121 | encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7) 122 | encoded_input = tf.keras.Input(shape = (encoding_dim, )) 123 | 124 | encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index) 125 | encoded_train = encoded_train.add_prefix('encoded_') 126 | if drop: 127 | encoded_train = pd.concat((df[drop],encoded_train),axis=1) 128 | 129 | return encoded_train 130 | 131 | # df_out = encoder_dataset(df, ["Close_1"], 15) 132 | 133 | 134 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 135 | 136 | def lle_feat(df, drop=None, components=4): 137 | 138 | if drop: 139 | keep = df[drop] 140 | df = df.drop(drop, axis=1) 141 | 142 | embedding = LocallyLinearEmbedding(n_components=components) 143 | em = embedding.fit_transform(df) 144 | df = pd.DataFrame(em,index=df.index) 145 | df = df.add_prefix('lle_') 146 | if drop: 147 | df = pd.concat((keep,df),axis=1) 148 | return df 149 | 150 | # df_out = lle_feat(df,["Close_1"],4) 151 | 152 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 153 | 154 | def feature_agg(df, drop=None, components=4): 155 | 156 | if drop: 157 | keep = df[drop] 158 | df = df.drop(drop, axis=1) 159 | 160 | components = min(df.shape[1]-1,components) 161 | agglo = cluster.FeatureAgglomeration(n_clusters=components) 162 | agglo.fit(df) 163 | df = pd.DataFrame(agglo.transform(df),index=df.index) 164 | df = df.add_prefix('feagg_') 165 | 166 | if drop: 167 | return pd.concat((keep,df),axis=1) 168 | else: 169 | return df 170 | 171 | 172 | # df_out = feature_agg(df.fillna(0),["Close_1"],4 ) 173 | 174 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 175 | 176 | 177 | def neigh_feat(df, drop, neighbors=6): 178 | 179 | if drop: 180 | keep = df[drop] 181 | df = df.drop(drop, axis=1) 182 | 183 | components = min(df.shape[0]-1,neighbors) 184 | neigh = NearestNeighbors(n_neighbors=neighbors) 185 | neigh.fit(df) 186 | neigh = neigh.kneighbors()[0] 187 | df = pd.DataFrame(neigh, index=df.index) 188 | df = df.add_prefix('neigh_') 189 | 190 | if drop: 191 | return pd.concat((keep,df),axis=1) 192 | else: 193 | return df 194 | 195 | return df 196 | 197 | # df_out = neigh_feat(df,["Close_1"],4 ) 198 | -------------------------------------------------------------------------------- /deltapy/transform.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import statsmodels.api as sm 4 | from scipy import signal, integrate 5 | from pykalman import UnscentedKalmanFilter 6 | from tsaug import * 7 | from fbprophet import Prophet 8 | import pylab as pl 9 | from seasonal.periodogram import periodogram 10 | 11 | def infer_seasonality(train,index=0): ##skip the first one, normally 12 | interval, power = periodogram(train, min_period=4, max_period=None) 13 | try: 14 | season = int(pd.DataFrame([interval, power]).T.sort_values(1,ascending=False).iloc[0,index]) 15 | except: 16 | print("Failed") 17 | return min(season,5) 18 | 19 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 20 | 21 | 22 | def robust_scaler(df, drop=None,quantile_range=(25, 75) ): 23 | if drop: 24 | keep = df[drop] 25 | df = df.drop(drop, axis=1) 26 | center = np.median(df, axis=0) 27 | quantiles = np.percentile(df, quantile_range, axis=0) 28 | scale = quantiles[1] - quantiles[0] 29 | df = (df - center) / scale 30 | if drop: 31 | df = pd.concat((keep,df),axis=1) 32 | return df 33 | 34 | # df = robust_scaler(df, drop=["Close_1"]) 35 | 36 | 37 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 38 | 39 | def standard_scaler(df,drop ): 40 | if drop: 41 | keep = df[drop] 42 | df = df.drop(drop, axis=1) 43 | mean = np.mean(df, axis=0) 44 | scale = np.std(df, axis=0) 45 | df = (df - mean) / scale 46 | if drop: 47 | df = pd.concat((keep,df),axis=1) 48 | return df 49 | 50 | 51 | # df = standard_scaler(df, drop=["Close"]) 52 | 53 | 54 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 55 | 56 | 57 | def fast_fracdiff(x, cols, d): 58 | for col in cols: 59 | T = len(x[col]) 60 | np2 = int(2 ** np.ceil(np.log2(2 * T - 1))) 61 | k = np.arange(1, T) 62 | b = (1,) + tuple(np.cumprod((k - d - 1) / k)) 63 | z = (0,) * (np2 - T) 64 | z1 = b + z 65 | z2 = tuple(x[col]) + z 66 | dx = pl.ifft(pl.fft(z1) * pl.fft(z2)) 67 | x[col+"_frac"] = np.real(dx[0:T]) 68 | return x 69 | 70 | # df_out = fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head() 71 | 72 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 73 | 74 | 75 | def outlier_detect(data,col,threshold=1,method="IQR"): 76 | 77 | if method == "IQR": 78 | IQR = data[col].quantile(0.75) - data[col].quantile(0.25) 79 | Lower_fence = data[col].quantile(0.25) - (IQR * threshold) 80 | Upper_fence = data[col].quantile(0.75) + (IQR * threshold) 81 | if method == "STD": 82 | Upper_fence = data[col].mean() + threshold * data[col].std() 83 | Lower_fence = data[col].mean() - threshold * data[col].std() 84 | if method == "OWN": 85 | Upper_fence = data[col].mean() + threshold * data[col].std() 86 | Lower_fence = data[col].mean() - threshold * data[col].std() 87 | if method =="MAD": 88 | median = data[col].median() 89 | median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]]) 90 | modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]]) 91 | outlier_index = np.abs(modified_z_scores) > threshold 92 | print('Num of outlier detected:',outlier_index.value_counts()[1]) 93 | print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index)) 94 | return outlier_index, (median_absolute_deviation, median_absolute_deviation) 95 | 96 | para = (Upper_fence, Lower_fence) 97 | tmp = pd.concat([data[col]>Upper_fence,data[col]para[0],col] = para[0] 112 | data_copy.loc[data_copy[col]para[0],col] = para[0] 115 | elif strategy == 'bottom': 116 | data_copy.loc[data_copy[col]= len(df[col]): # we are forecasting 182 | m = i - len(df[col]) + 1 183 | result.append((smooth + m*trend) + seasonals[i%slen]) 184 | else: 185 | val = df[col][i] 186 | last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend) 187 | trend = beta * (smooth-last_smooth) + (1-beta)*trend 188 | seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen] 189 | result.append(smooth+trend+seasonals[i%slen]) 190 | df[col+"_TES"] = result 191 | #print(seasonals) 192 | return df 193 | 194 | # df_out= triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0); df_out.head() 195 | 196 | 197 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 198 | 199 | 200 | def naive_dec(df, columns, freq=2): 201 | for col in columns: 202 | decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False) 203 | df[col+"_NDDT" ] = decomposition.trend 204 | df[col+"_NDDS"] = decomposition.seasonal 205 | df[col+"_NDDR"] = decomposition.resid 206 | return df 207 | 208 | # df_out = naive_dec(df.copy(), ["Close","Open"]); df_out.head() 209 | 210 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 211 | 212 | 213 | def bkb(df, cols): 214 | for col in cols: 215 | df[col+"_BPF"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1) 216 | return df 217 | 218 | # df_out = bkb(df.copy(), ["Close"]); df_out.head() 219 | 220 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 221 | 222 | 223 | def butter_lowpass(cutoff, fs=20, order=5): 224 | nyq = 0.5 * fs 225 | normal_cutoff = cutoff / nyq 226 | b, a = signal.butter(order, normal_cutoff, btype='low', analog=False) 227 | return b, a 228 | 229 | def butter_lowpass_filter(df,cols, cutoff, fs=20, order=5): 230 | b, a = butter_lowpass(cutoff, fs, order=order) 231 | for col in cols: 232 | df[col+"_BUTTER"] = signal.lfilter(b, a, df[col]) 233 | return df 234 | 235 | # df_out = butter_lowpass_filter(df.copy(),["Close"],4); df_out.head() 236 | 237 | 238 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 239 | 240 | 241 | def instantaneous_phases(df,cols): 242 | for col in cols: 243 | df[col+"_HILLB"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0) 244 | return df 245 | 246 | # df_out = instantaneous_phases(df.copy(), ["Close"]); df_out.head() 247 | 248 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 249 | 250 | 251 | def kalman_feat(df, cols): 252 | for col in cols: 253 | ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1) 254 | (filtered_state_means, filtered_state_covariances) = ukf.filter(df[col]) 255 | (smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col]) 256 | df[col+"_UKFSMOOTH"] = smoothed_state_means.flatten() 257 | df[col+"_UKFFILTER"] = filtered_state_means.flatten() 258 | return df 259 | 260 | # df_out = kalman_feat(df.copy(), ["Close"]); df_out.head() 261 | 262 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 263 | 264 | 265 | def perd_feat(df, cols): 266 | for col in cols: 267 | sig = signal.periodogram(df[col],fs=1, return_onesided=False) 268 | df[col+"_FREQ"] = sig[0] 269 | df[col+"_POWER"] = sig[1] 270 | return df 271 | 272 | # df_out = perd_feat(df.copy(),["Close"]); df_out.head() 273 | 274 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 275 | 276 | 277 | def fft_feat(df, cols): 278 | for col in cols: 279 | fft_df = np.fft.fft(np.asarray(df[col].tolist())) 280 | fft_df = pd.DataFrame({'fft':fft_df}) 281 | df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values 282 | df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values 283 | return df 284 | 285 | # df_out = fft_feat(df.copy(), ["Close"]); df_out.head() 286 | 287 | 288 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 289 | 290 | 291 | def harmonicradar_cw(df, cols, fs,fc): 292 | for col in cols: 293 | ttxt = f'CW: {fc} Hz' 294 | #%% input 295 | t = df[col] 296 | tx = np.sin(2*np.pi*fc*t) 297 | _,Pxx = signal.welch(tx,fs) 298 | #%% diode 299 | d = (signal.square(2*np.pi*fc*t)) 300 | d[d<0] = 0. 301 | #%% output of diode 302 | rx = tx * d 303 | df[col+"_HARRAD"] = rx.values 304 | return df 305 | 306 | # df_out = harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head() 307 | 308 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 309 | 310 | 311 | def saw(df, cols): 312 | for col in cols: 313 | df[col+" SAW"] = signal.sawtooth(df[col]) 314 | return df 315 | 316 | # df_out = saw(df.copy(),["Close","Open"]); df_out.head() 317 | 318 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 319 | 320 | 321 | def modify(df, cols): 322 | for col in cols: 323 | series = df[col].values 324 | df[col+"_magnify"], _ = magnify(series, series) 325 | df[col+"_affine"], _ = affine(series, series) 326 | df[col+"_crop"], _ = crop(series, series) 327 | df[col+"_cross_sum"], _ = cross_sum(series, series) 328 | df[col+"_resample"], _ = resample(series, series) 329 | df[col+"_trend"], _ = trend(series, series) 330 | 331 | df[col+"_random_affine"], _ = random_time_warp(series, series) 332 | df[col+"_random_crop"], _ = random_crop(series, series) 333 | df[col+"_random_cross_sum"], _ = random_cross_sum(series, series) 334 | df[col+"_random_sidetrack"], _ = random_sidetrack(series, series) 335 | df[col+"_random_time_warp"], _ = random_time_warp(series, series) 336 | df[col+"_random_magnify"], _ = random_magnify(series, series) 337 | df[col+"_random_jitter"], _ = random_jitter(series, series) 338 | df[col+"_random_trend"], _ = random_trend(series, series) 339 | return df 340 | 341 | # df_out = modify(df.copy(),["Close"]); df_out.head() 342 | 343 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 344 | 345 | def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None): 346 | windows = [1+a for a in windows] 347 | if not columns: 348 | columns = df.columns.to_list() 349 | rolling_dfs = (df[columns].rolling(i) # 1. Create window 350 | .agg(functions) # 1. Aggregate 351 | .rename({col: '{0}_{1:d}'.format(col, i) 352 | for col in columns}, axis=1) # 2. Rename columns 353 | for i in windows) # For each window 354 | df_out = pd.concat((df, *rolling_dfs), axis=1) 355 | da = df_out.iloc[:,len(df.columns):] 356 | da = [col[0] + "_" + col[1] for col in da.columns.to_list()] 357 | df_out.columns = df.columns.to_list() + da 358 | 359 | return df_out # 3. Concatenate dataframes 360 | 361 | # df = multiple_rolling(df, columns=["Close"]); 362 | 363 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 364 | 365 | 366 | def multiple_lags(df, start=1, end=3,columns=None): 367 | if not columns: 368 | columns = df.columns.to_list() 369 | lags = range(start, end+1) # Just two lags for demonstration. 370 | 371 | df = df.assign(**{ 372 | '{}_t_{}'.format(col, t): df[col].shift(t) 373 | for t in lags 374 | for col in columns 375 | }) 376 | return df 377 | 378 | # df = multiple_lags(df, start=1, end=3, columns=["Close"]) 379 | 380 | 381 | #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 382 | 383 | 384 | def prophet_feat(df, cols,date, freq,train_size=150): 385 | def prophet_dataframe(df): 386 | df.columns = ['ds','y'] 387 | return df 388 | 389 | def original_dataframe(df, freq, name): 390 | prophet_pred = pd.DataFrame({"Date" : df['ds'], name : df["yhat"]}) 391 | prophet_pred = prophet_pred.set_index("Date") 392 | #prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq) 393 | return prophet_pred[name].values 394 | 395 | for col in cols: 396 | model = Prophet(daily_seasonality=True) 397 | fb = model.fit(prophet_dataframe(df[[date, col]].head(train_size))) 398 | forecast_len = len(df) - train_size 399 | future = model.make_future_dataframe(periods=forecast_len,freq=freq) 400 | future_pred = model.predict(future) 401 | df[col+"_PROPHET"] = list(original_dataframe(future_pred,freq,col)) 402 | return df 403 | 404 | # df_out = prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head() 405 | 406 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, Command 2 | import os 3 | import sys 4 | 5 | 6 | setup(name='deltapy', 7 | version='0.1.1', 8 | description='Data Transformation and Augmentation', 9 | url='https://github.com/firmai/deltapy', 10 | author='snowde', 11 | author_email='d.snow@firmai.org', 12 | license='MIT', 13 | packages=['deltapy'], 14 | install_requires=[ 15 | "fbprophet", 16 | "pandas", 17 | "pykalman", 18 | "tsaug", 19 | "gplearn", 20 | "ta", 21 | "tensorflow", 22 | "scikit-learn", 23 | "scipy", 24 | "sklearn", 25 | "statsmodels", 26 | "numpy", 27 | "seasonal"], 28 | zip_safe=False) 29 | --------------------------------------------------------------------------------