├── README.md └── standardize_time_series_formats.md /README.md: -------------------------------------------------------------------------------- 1 | # Using python to work with time series data 2 | 3 | The python ecosystem contains different packages that can be used to process time series. 4 | 5 | The following list is by no means exhaustive, feel free to edit the list (will propose a file change via PR) if you miss anything. 6 | 7 | # Machine learning, statistics, analytics 8 | 9 | ## Libraries 10 | 11 | | Project Name | Description | 12 | | ------- | ------ | 13 | | [Arrow](https://github.com/crsmithdev/arrow) | A sensible, human-friendly approach to creating, manipulating, formatting and converting dates, times, and timestamps | 14 | | [bta-lib](https://github.com/mementum/bta-lib) | Technical Analysis library in pandas for backtesting algotrading and quantitative analysis | 15 | | [cesium](https://github.com/cesium-ml/cesium) | Time series platform with feature extraction aiming for non uniformly sampled signals | 16 | | [Darts](https://github.com/unit8co/darts) | A library making it very easy to produce forecasts using a wide range of models, from ARIMA to deep learning. Also does ensembling, model selection and more. | 17 | | [ETNA](https://github.com/tinkoff-ai/etna-ts) | A python library for time series forecasting and analysis with temporal data structure always in mind. Includes a variety of predictive models with unified interface along with EDA and validation methods| 18 | | [GENDIS](https://github.com/IBCNServices/GENDIS) | Shapelet discovery by genetic algorithms | 19 | | [glm-sklearn](https://github.com/jcrudy/glm-sklearn) | scikit-learn compatible wrapper around the GLM module in [statsmodels](https://github.com/statsmodels/statsmodels) | 20 | | [Featuretools](https://github.com/Featuretools/featuretools) | Time series feature extraction, with possible conditionality on other variables with a pandas compatible relational-database-like data container | 21 | | [fecon235](https://github.com/rsvp/fecon235) | Computational tools for financial economics | 22 | | [ffn](https://github.com/pmorissette/ffn) | financial function library | 23 | | [flint](https://github.com/twosigma/flint) | A Time Series Library for Apache Spark | 24 | | [Flow Forecast](https://github.com/AIStream-Peelout/flow-forecast) | Flow Forecast is a deep learning for time series forecasting, classification, and anomaly detection framework built in PyTorch | 25 | | [hctsa](https://github.com/benfulcher/hctsa) | Matlab based feature extraction which can be controlled from python | 26 | | [HMMLearn](https://github.com/hmmlearn/hmmlearn) | Hidden Markov Models with scikit-learn compatible API | 27 | | [khiva-python](https://github.com/shapelets/khiva-python) | A Time Series library with accelerated analytics on GPUS, it provides feature extraction and motif discovery among other functionalities.| 28 | | [matrixprofile-ts](https://github.com/target/matrixprofile-ts) | Python implementation of the Matrix Profile algorithm which offers anomaly detection and pattern (or “motif”) discovery at the same time. | 29 | | [Nitime](https://github.com/nipy/nitime) | Timeseries analysis for neuroscience data | 30 | | [Orbit](https://github.com/uber/orbit) | Orbit is a Python package for Bayesian time series forecasting and inference | 31 | | [Pandas TA](https://github.com/twopirllc/pandas-ta) | An easy to use Python 3 Pandas Extension with 130+ Technical Analysis Indicators | 32 | | [Pastas](https://github.com/pastas/pastas) | Timeseries analysis for hydrological data | 33 | | [prophet](https://github.com/facebook/prophet) | Time series forecasting for time series data that has multiple seasonality with linear or non-linear growth | 34 | | [pyDSE](https://github.com/blue-yonder/pydse) | ARMA models for Dynamic System Estimation | 35 | | [pyFTS](https://pyfts.github.io/pyFTS) | Fuzzy set rule-based models for time series forecasting, including multi-step, point, interval and probabilistic forecasting | 36 | | [PyFlux](https://github.com/RJT1990/pyflux) | Classical time series forecasting models | 37 | | [pysf](https://github.com/alan-turing-institute/pysf) | A scikit-learn compatible machine learning library for supervised/panel forecasting | 38 | | [pyramid](https://github.com/tgsmith61591/pyramid) | port of R's auto.arima method to Python | 39 | | [pytorch-forecasting](https://github.com/jdb78/pytorch-forecasting) | A time series forecasting library using PyTorch with various state-of-the-art network architectures. | 40 | | [pyts](https://github.com/johannfaouzi/pyts) | Contains time series preprocessing, transformation as well as classification techniques | 41 | | [ruptures](https://github.com/deepcharles/ruptures) | Provides methods to find change points in time series such as shifts in the mean or scale of the signal as well as more complex changes in the probability distribution or frequency. | 42 | | [seglearn](https://github.com/dmbee/seglearn) | Extends the scikit-learn pipeline concept to sequence data | 43 | | [sktime](https://github.com/alan-turing-institute/sktime) | A scikit-learn compatible library for learning with time series/panel data including time series classification/regression and (supervised/panel) forecasting | 44 | | [statsmodels](https://github.com/statsmodels/statsmodels) | Contains a submodule for classical time series models and hypothesis tests | 45 | | [stumpy](https://github.com/TDAmeritrade/stumpy) | Calculates matrix profile for time series subsequence all-pairs-similarity-search | 46 | | [TensorFlow-Time-Series-Examples](https://github.com/hzy46/TensorFlow-Time-Series-Examples) | Time Series Prediction with tf.contrib.timeseries | 47 | | [tensorflow_probability.sts](https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/sts) | Bayesian Structural Time Series model in Tensorflow Probability | 48 | | [timemachines](https://github.com/microprediction/timemachines) | Functional interface to prophet and other packages, with Elo ratings | 49 | | [Traces](https://github.com/datascopeanalytics/traces) | A library for unevenly-spaced time series analysis | 50 | | [ta-lib](https://github.com/mrjbq7/ta-lib) | Calculate technical indicators for financial time series (python wrapper around TA-Lib) | 51 | | [tsai](https://github.com/timeseriesAI/tsai) | State-of-the-art Deep Learning with Time Series and Sequences in Pytorch / fastai | 52 | | [ta](https://github.com/bukosabino/ta) | Calculate technical indicators for financial time series | 53 | | [TIMEX](https://github.com/AlexMV12/TIMEX) | Library for creating time-series-forecasting-as-a-service platforms/websites, with a fully automated data ingestion, pre-processing, prediction and results visualization pipeline. 54 | | [tsflex](https://github.com/predict-idlab/tsflex) | A toolkit for flexible time series processing and feature extraction. | 55 | | [tsfresh](https://github.com/blue-yonder/tsfresh) | Extracts and filters features from time series, allowing supervised classificators and regressor to be applied to time series data | 56 | | [tslearn](https://github.com/rtavenar/tslearn) | Direct time series classifiers and regressors | 57 | | [tspreprocess](https://github.com/MaxBenChrist/tspreprocess) | Preprocess time series (resampling, denoising etc.), still WIP | 58 | | [tsmoothie](https://github.com/cerlymarco/tsmoothie) | A python library for time-series smoothing and outlier detection in a vectorized way| 59 | 60 | 61 | ## Examples or singular models 62 | 63 | | Project Name | Description | 64 | | ------- | ------ | 65 | | [ES-RNN forecasting algorithm](https://github.com/damitkwr/ESRNN-GPU) | Python implementation of the winning forecasting method of the M4 competition combining exponential smoothing with a recurrent neural network using PyTorch | 66 | | [Deep learning methods for time series classification](https://github.com/hfawaz/dl-4-tsc) | A collection of common deep learning architectures for time series classification | 67 | | [LSTM-Neural-Network-for-Time-Series-Prediction](https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction) | LSTM based forecasting model | 68 | | [LSTM_tsc](https://github.com/RobRomijnders/LSTM_tsc) | An LSTM based time-series classification neural network| 69 | | [shapelets-python](https://github.com/mohaseeb/shaplets-python) | Shapelet Classifier based on a multi layer neural network | 70 | | [M4 competition](https://github.com/M4Competition) | Collection of statistical and machine learning forecasting methods | 71 | | [UCR_Time_Series_Classification_Deep_Learning_Baseline](https://github.com/cauchyturing/UCR_Time_Series_Classification_Deep_Learning_Baseline) | Fully Convolutional Neural Networks for state-of-the-art time series classification | 72 | | [WTTE-RNN](https://github.com/ragulpr/wtte-rnn/) | Time to Event forecast by RNN based Weibull density estimation | 73 | 74 | 75 | # Time series data container 76 | 77 | | Project name | Description | 78 | | ------- | ------ | 79 | | [Featuretools](https://github.com/Featuretools/featuretools) | Time series feature extraction, with possible conditionality on other variables with a pandas compatible relational-database-like data container | 80 | | [pysf](https://github.com/alan-turing-institute/pysf) | A scikit-learn compatible library for supervised forecasting | 81 | | [xarray](https://github.com/pydata/xarray) | Labelled, multi-dimensional data structures as long as they have a shared time index | 82 | | [xpandas](https://github.com/alan-turing-institute/xpandas) | Labelled 1D and 2D data container for storing type-heterogeneous tabular data of any type, including time series, and encapsulates feature extraction and transformation modelling in an sklearn-compatible transformer interface, work in progress. | 83 | 84 | 85 | # Data sets 86 | | Project Name | Description | 87 | | ------- | ------ | 88 | | [awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets#time-series) | This huge list of public datasets also has a section on time series datasets| 89 | | [ecmwf_models](https://github.com/TUW-GEO/ecmwf_models) | Readers and converters for climate reanalysis data | 90 | | [M4 competition](https://github.com/M4Competition) | Forecasting competition on 100,000 time series | 91 | | [pandas-datareader](https://github.com/pydata/pandas-datareader) | Pulls financial data from different sources (e.g. yahoo, google, Quandl) | 92 | | [Timeseriesclassification.com](https://timeseriesclassification.com) | An extensive repository for time series classification datasets | 93 | 94 | 95 | # Databases, frameworks 96 | | Project Name | Description | 97 | | ------- | ------ | 98 | | [artic](https://github.com/manahl/arctic) | High performance datastore for time series and tick data | 99 | | [automl_service](https://github.com/crawles/automl_service) | Fully automated time series classification pipeline, deployed as a web service | 100 | | [cesium](https://github.com/cesium-ml/cesium) | Time series platform with feature extraction aming for non uniformly sampled signals | 101 | | [thunder](https://github.com/thunder-project/thunder) | scalable analysis of image and time series data in python based on spark | 102 | | [whisper](https://github.com/graphite-project/whisper) | File-based time-series database format | 103 | 104 | 105 | # Free courses 106 | | Project Name | Description | 107 | | ------- | ------ | 108 | | [Time Series Forecasting](https://www.udacity.com/course/time-series-forecasting--ud980) | Udacity free course to learn about how to build and apply time series analysis/forecasting in business contexts | 109 | 110 | ---- 111 | 112 | # Discussion 113 | 114 | We would like to trigger a homogenization of the formats which are used in the python time series community, please see the [concept page](https://github.com/MaxBenChrist/awesome_time_series_in_python/blob/master/standardize_time_series_formats.md) 115 | -------------------------------------------------------------------------------- /standardize_time_series_formats.md: -------------------------------------------------------------------------------- 1 | This page is not yet finished. 2 | 3 | TODO: remove question about multivariate time series 4 | 5 | # Motivation: There are too many time series formats 6 | 7 | Image the following situation: 8 | You inspect a delivery of new time series data and want to develop a classification algorithm for it. 9 | Because it is a new dataset for you, you are not sure if you should use a shape based approach or maybe a feature based one. 10 | In any case, you want to apply different packages on that data and compare the results. 11 | 12 | Now, there is no widely agreed standard for time series data. 13 | For most of the tools, 14 | 15 | * you will have to read the instructions 16 | * understand the format of the respective package, 17 | * and finally you will have write a script to convert your data. 18 | 19 | This is annoying and slows you down. 20 | 21 | For the construction of supervised machine learning models, using different packages is way more convenient. 22 | Almost all packages expect a feature matrix as input. 23 | In a feature matrix, a column denotes a feature, a row is a sample. 24 | Object wise, either `numpy.ndarrays` or their extensions `pandas.DataFrame` are used. 25 | 26 | You can use your feature matrix and first apply models from sklearn on it. 27 | Then you can take the same object and try lightgbm or xgboost models on it: 28 | 29 | ``` Python 30 | 31 | X = [[0, 0, 1, 1], 32 | [0, 1, 0, 0], 33 | [1, 0, 0, 1]] 34 | 35 | y = [1, 36 | 1, 37 | 1] 38 | 39 | # first train a model from sklearn 40 | from sklearn.ensemble import RandomForestClassifer() 41 | clf1 = RandomForestClassifer() 42 | clf1.fit(X, y) 43 | 44 | # now train a model from another package on the data, there is no transformation necessary 45 | from lightgbm import LGBMClassifier 46 | clf2 = LGBMClassifier() 47 | clf2.fit(X, y) 48 | 49 | ``` 50 | 51 | All without every having the need to convert your data, everything works out of the box. 52 | 53 | We want the same for time series data. 54 | The purpose of this document is to find a common standard. 55 | The analysis of time series data and the interplay between packages should become more user friendly. 56 | 57 | # Classification of different time series formats 58 | 59 | A time series consists of timely annotated data, a recording is based on two characteristics, the `time` and `value` dimensions. 60 | Therefore, a singular recording is a two dimensional vector 61 | ``` 62 | (time, value) 63 | ``` 64 | An example would be 65 | ``` 66 | (2009-06-15T13:45:30, 83°C) 67 | ``` 68 | which denotes a temperature of `83°C` measured at time `2009-06-15T13:45:30`. 69 | 70 | A whole time series, which is a collection of such two dimensional recordings can have meta information, characteristics that will not change over time. 71 | The most important meta information is the identifier of the respective entity and in case of multivariate scenarios the type of time series. 72 | Multivariate means that a singular entity has multiple assigned time series. 73 | 74 | In that case, a recording is a 4 dimensional vector 75 | ``` 76 | (id, time, value, kind) 77 | ``` 78 | where `value` is the value of the time series of type `kind` recorded at time `time` for the entity `id`. 79 | 80 | For example 81 | ``` 82 | (VW Beetle - SN: 7 4545 4543, 2009-06-15T13:45:30, 83°C, Engine Temperature G1) 83 | ``` 84 | denotes a temperature of `83°C` measured at sensor `Engine Temperature G1` for the VW Beetle with serial number `7 4545 4543` at time `2009-06-15T13:45:30`. 85 | 86 | There is a myriad of different formats which could be used to save such information. 87 | We will discuss the following formats. 88 | 89 | 1. Relational 90 | 1. Stacked matrix 91 | 2. Flat matrix 92 | 3. 3-dimensional matrix 93 | 2. Nested 94 | 1. Dictionary based 95 | 3. Binary 96 | 1. ? 97 | 98 | (If you have some more ideas, please feel free to submit a pr). 99 | Later we will analyze the saving capabilities of the different formats. 100 | 101 |     102 | ### 1.i Stacked Matrix 103 | 104 | This is the most flexible format. It supports non uniformly sampled time series of different lengths. In this format, each row will contain the four dimensional tuple. 105 | 106 | Example: The two time series 107 | ``` 108 | values [11, 2] for times [0, 1] of kind a for id 1 109 | values [13, 4] for times [0, 3] of kind b for id 1 110 | ``` 111 | will be saved as 112 | ``` 113 | time id value kind 114 | 0 1 11 a 115 | 1 1 2 a 116 | 0 1 13 b 117 | 3 1 4 b 118 | ``` 119 | 120 | ### 1.ii Flat Matrix 121 | 122 | Is suitable for the multivariate, uniformly sampled case when we want to save different kinds of time series that all need to have the same length and need to be recorded at the same times. 123 | 124 | In this format, we will dedicate a full columns for each type of time series. 125 | 126 | Example: The two time series 127 | ``` 128 | values [11, 2] for times [0, 1] of kind a for id 1 129 | values [13, 4] for times [0, 1] of kind b for id 1 130 | ``` 131 | will be saved as 132 | ``` 133 | time id a b 134 | 0 1 11 13 135 | 1 1 2 4 136 | ``` 137 | 138 | ### 1.iii 3-dimensional Matrix 139 | 140 | For this format, the time series need to be uniformly sampled and of same length. 141 | Then we use the first two dimensions of the matrix to denote kind and id and the third one for the time scale. 142 | 143 | Example: The two time series 144 | ``` 145 | values [11, 2] for times [0, 1] of kind a for id 1 146 | values [13, 4] for times [0, 1] of kind b for id 1 147 | ``` 148 | will be recorded as 149 | ``` 150 | time a b 151 | 1 [0, 1] [11, 2] [13, 4] 152 | ``` 153 | 154 | 155 | ### 2.i Dictionary based 156 | 157 | We can have dictionary mapping from the id to another dictionary that maps kind to the time series. 158 | Essentially you are using a singular array for each time series. 159 | 160 | Example: The two time series 161 | ``` 162 | values [11, 2] for times [0, 1] of kind a for id 1 163 | values [13, 4] for times [0, 3] of kind b for id 1 164 | ``` 165 | will be recorded as 166 | ``` 167 | { 1: 168 | { a: [time: [0, 1], value:[11, 2]], 169 | b: [time: [0, 3], value:[13, 4]] 170 | } 171 | } 172 | ``` 173 | 174 | ## How to pick the right format 175 | 176 | Before one can pick the right format, one needs to check a few points 177 | 178 | 1. Do the time series can have different lengths? 179 | 2. Are the time series non uniformly sampled, are the time series allowed to have missing values? 180 | 3. Do we inspect multivariate time series? 181 | 182 | Depending of the answers to this questions, different formats are suitable. 183 | The following table lists the characteristics of the different formats 184 | 185 | | Format | 1. Different length | 2. Non uniformly sampled | 3. Multivariate time series | Does not contain redundant information | Tabular format | 186 | | -------| :---: | :---: | :---: | :---: | :---: | 187 | | 1.i Stacked Matrix | _X_ | _X_ | _X_ | | _X_| 188 | | 1.ii Flat Matrix | | | _X_ | _X_ | _X_ | 189 | | 1.iii 3-dimensional Matrix | | | _X_ | _X_ | | 190 | | 2.ii Dictionary based | _X_ | _X_ | _X_ | _X_ | | | 191 | --------------------------------------------------------------------------------