├── .gitignore ├── README.md ├── datasets └── README.md ├── imgs ├── ltsm_model.png ├── prompt_csv_tsne.png └── stat_prompt.png ├── ltsm ├── data_pipeline │ └── reader │ │ ├── __init__.py │ │ ├── base_reader.py │ │ └── monash_reader.py ├── data_provider │ ├── data_factory.py │ ├── data_loader.py │ ├── data_processing │ │ ├── __init__.py │ │ ├── base_processor.py │ │ ├── standard_scaler.py │ │ └── tokenizer_processor.py │ ├── dataset.py │ ├── hf_data_loader.py │ └── splitter.py ├── models │ ├── __init__.py │ ├── config.py │ ├── embed.py │ ├── ltsm_model.py │ ├── ltsm_tokenizer.py │ ├── ltsm_wordprompt.py │ └── utils.py └── utils │ ├── .DS_Store │ ├── __init__.py │ ├── dist.py │ ├── metrics.py │ ├── timefeatures.py │ └── tools.py ├── main_ltsm.py ├── main_tokenizer.py ├── prompt_bank ├── prompt_data_normalize_split │ └── README.md ├── stat-prompt │ ├── README.md │ ├── prompt_generate_split.py │ ├── prompt_normalization_split.py │ ├── prompt_tsne.py │ └── tsfel │ │ ├── __init__.py │ │ ├── feature_extraction │ │ ├── __init__.py │ │ ├── calc_features.py │ │ ├── features.json │ │ ├── features.py │ │ ├── features_settings.py │ │ └── features_utils.py │ │ └── utils │ │ ├── __init__.py │ │ ├── add_personal_features.py │ │ ├── calculate_complexity.py │ │ ├── progress_bar.py │ │ └── signal_processing.py └── text_prompt_data_csv │ └── csv_prompt.json ├── requirements.txt ├── scripts ├── test_csv_lora.sh ├── test_ltsm.sh ├── train_ltsm_csv.sh ├── train_ltsm_textprompt_csv.sh └── train_ltsm_tokenizer_csv.sh ├── setup.py └── tutorial └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | checkpoints/ 2 | dataset/ 3 | TSForecasting/ 4 | *.swp 5 | output/ 6 | .idea/ 7 | 8 | *.pub 9 | 10 | # Byte-compiled / optimized / DLL files 11 | __pycache__/ 12 | *.py[cod] 13 | *$py.class 14 | 15 | # C extensions 16 | *.so 17 | 18 | # Distribution / packaging 19 | .Python 20 | build/ 21 | develop-eggs/ 22 | dist/ 23 | downloads/ 24 | eggs/ 25 | .eggs/ 26 | lib/ 27 | lib64/ 28 | parts/ 29 | sdist/ 30 | var/ 31 | wheels/ 32 | pip-wheel-metadata/ 33 | share/python-wheels/ 34 | *.egg-info/ 35 | .installed.cfg 36 | *.egg 37 | MANIFEST 38 | 39 | # PyInstaller 40 | # Usually these files are written by a python script from a template 41 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 42 | *.manifest 43 | *.spec 44 | 45 | # Installer logs 46 | pip-log.txt 47 | pip-delete-this-directory.txt 48 | 49 | # Unit test / coverage reports 50 | htmlcov/ 51 | .tox/ 52 | .nox/ 53 | .coverage 54 | .coverage.* 55 | .cache 56 | nosetests.xml 57 | coverage.xml 58 | *.cover 59 | *.py,cover 60 | .hypothesis/ 61 | .pytest_cache/ 62 | 63 | # Translations 64 | *.mo 65 | *.pot 66 | 67 | # Django stuff: 68 | *.log 69 | local_settings.py 70 | db.sqlite3 71 | db.sqlite3-journal 72 | 73 | # Flask stuff: 74 | instance/ 75 | .webassets-cache 76 | 77 | # Scrapy stuff: 78 | .scrapy 79 | 80 | # Sphinx documentation 81 | docs/_build/ 82 | 83 | # PyBuilder 84 | target/ 85 | 86 | # Jupyter Notebook 87 | .ipynb_checkpoints 88 | 89 | # IPython 90 | profile_default/ 91 | ipython_config.py 92 | 93 | # pyenv 94 | .python-version 95 | 96 | # pipenv 97 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 98 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 99 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 100 | # install all needed dependencies. 101 | #Pipfile.lock 102 | 103 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 104 | __pypackages__/ 105 | 106 | # Celery stuff 107 | celerybeat-schedule 108 | celerybeat.pid 109 | 110 | # SageMath parsed files 111 | *.sage.py 112 | 113 | # Environments 114 | .env 115 | .venv 116 | env/ 117 | venv/ 118 | ENV/ 119 | env.bak/ 120 | venv.bak/ 121 | 122 | # Spyder project settings 123 | .spyderproject 124 | .spyproject 125 | 126 | # Rope project settings 127 | .ropeproject 128 | 129 | # mkdocs documentation 130 | /site 131 | 132 | # mypy 133 | .mypy_cache/ 134 | .dmypy.json 135 | dmypy.json 136 | 137 | # Pyre type checker 138 | .pyre/ 139 | *.csv 140 | scratch/ 141 | .DS_Store 142 | .idea/ 143 | 144 | /datasets 145 | /prompt_bank -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Understanding Different Design Choices in Training Large Time Series Models 2 | 3 | 4 | This work investigates the transition from traditional Time Series Forecasting (TSF) to Large Time Series Models (LTSMs), leveraging universal transformer-based models. Training LTSMs on diverse time series data introduces challenges due to varying frequencies, dimensions, and patterns. We explore various design choices for LTSMs, including pre-processing, model configurations, and dataset setups. We introduce **Time Series Prompt**, a statistical prompting strategy, and $\texttt{LTSM-bundle}$, which encapsulates the most effective design practices identified. $\texttt{LTSM-bundle}$ is developed by [Data Lab at Rice University](https://cs.rice.edu/~xh37/). 5 | 6 | ## Resources 7 | :star2: Please star our repo to follow the latest updates on LTSM-bundle! 8 | 9 | :mega: We have released our [paper](https://arxiv.org/abs/2406.14045) and source code of LTSM-bundle-v1.0! 10 | 11 | :books: Follow our latest [English Tutorial](https://github.com/daochenzha/ltsm/tree/main/tutorial) or [中文教程](https://zhuanlan.zhihu.com/p/708804309) to costomize your LTSM! 12 | 13 | :earth_americas: For more information, please visit: 14 | * Paper: [https://arxiv.org/abs/2406.14045](https://arxiv.org/abs/2406.14045) 15 | * Blog: [Time Series Are Not That Different for LLMs](https://towardsdatascience.com/time-series-are-not-that-different-for-llms-56435dc7d2b1) 16 | * Tutorial: [Build your own LTSM-bundle](https://github.com/daochenzha/ltsm/tree/main/tutorial) 17 | * Chinese Tutorial: [https://zhuanlan.zhihu.com/p/708804309](https://zhuanlan.zhihu.com/p/708804309) 18 | * Do you want to learn more about data pipeline search? Please check out our [data-centric AI survey](https://arxiv.org/abs/2303.10158) and [data-centric AI resources](https://github.com/daochenzha/data-centric-AI) ! 19 | 20 | ## Why LTSM-bundle? 21 | The LTSM-bundle package leverages the HuggingFace transformers toolkit, offering flexibility to switch between different advanced language models as the backbone. It is easy to tailor the general LTSMs to their specific time series forecasting needs by selecting the most suitable language model from a wide array of options. The flexibility enhances the adaptability of the package across different industries and data types, ensuring optimal performance in diverse scenarios. 22 | 23 | ## Installation 24 | ``` 25 | conda create -n ltsm python=3.8.0 26 | conda activate ltsm 27 | git clone git@github.com:daochenzha/ltsm.git 28 | cd ltsm 29 | pip3 install -e . 30 | pip3 install -r requirements.txt 31 | ``` 32 | 33 | ## Quick Exploration on LTSM-bundle 34 | 35 | Training on **[Time Series Prompt]** and **[Linear Tokenization]** 36 | ```bash 37 | bash scripts/train_ltsm_csv.sh 38 | ``` 39 | 40 | Training on **[Text Prompt]** and **[Linear Tokenization]** 41 | ```bash 42 | bash scripts/train_ltsm_textprompt_csv.sh 43 | ``` 44 | 45 | Training on **[Time Series Prompt]** and **[Time Series Tokenization]** 46 | ```bash 47 | bash scripts/train_ltsm_tokenizer_csv.sh 48 | ``` 49 | 50 | ## Datasets and Time Series Prompts 51 | Download the datasets 52 | ```bash 53 | cd datasets 54 | download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P 55 | ``` 56 | 57 | Download the time series prompts 58 | ```bash 59 | cd prompt_bank/propmt_data_csv 60 | download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P 61 | ``` 62 | 63 | ## Cite This Work 64 | If you find this work useful, you may cite this work: 65 | ``` 66 | @article{ltsm-bundle, 67 | title={Understanding Different Design Choices in Training Large Time Series Models}, 68 | author={Chuang*, Yu-Neng and Li*, Songchen and Yuan*, Jiayi and Wang*, Guanchu and Lai*, Kwei-Herng and Yu, Leisheng and Ding, Sirui and Chang, Chia-Yuan and Tan, Qiaoyu and Zha, Daochen and Hu, Xia}, 69 | journal={arXiv preprint arXiv:2406.14045}, 70 | year={2024} 71 | } 72 | ``` 73 | -------------------------------------------------------------------------------- /datasets/README.md: -------------------------------------------------------------------------------- 1 | # Training Dataset -------------------------------------------------------------------------------- /imgs/ltsm_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datamllab/ltsm/91ee7775ee5dabfd4baa4bf8713ecd111560655d/imgs/ltsm_model.png -------------------------------------------------------------------------------- /imgs/prompt_csv_tsne.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datamllab/ltsm/91ee7775ee5dabfd4baa4bf8713ecd111560655d/imgs/prompt_csv_tsne.png -------------------------------------------------------------------------------- /imgs/stat_prompt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datamllab/ltsm/91ee7775ee5dabfd4baa4bf8713ecd111560655d/imgs/stat_prompt.png -------------------------------------------------------------------------------- /ltsm/data_pipeline/reader/__init__.py: -------------------------------------------------------------------------------- 1 | from ltsm.data_pipeline.reader.monash_reader import MonashReader 2 | 3 | reader_dict = {} 4 | 5 | def register_reader(module): 6 | assert module.module_id not in reader_dict, f"Reader {module.module_id} alreader registered" 7 | reader_dict[module.module_id] = module 8 | 9 | register_reader(MonashReader) -------------------------------------------------------------------------------- /ltsm/data_pipeline/reader/base_reader.py: -------------------------------------------------------------------------------- 1 | class BaseReader: 2 | def __init__(self): 3 | pass 4 | 5 | def fetch(self): 6 | # input: path 7 | # output: DataFrame 8 | pass -------------------------------------------------------------------------------- /ltsm/data_pipeline/reader/monash_reader.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from distutils.util import strtobool 4 | from datetime import datetime 5 | 6 | from ltsm.data_pipeline.reader.base_reader import BaseReader 7 | 8 | 9 | class MonashReader(BaseReader): 10 | module_id = "monash" 11 | def __init__(self, data_path): 12 | super().__init__() 13 | self.data_path = data_path 14 | 15 | def fetch(self): 16 | # input: path 17 | # output: DataFrame 18 | df, frequency, forecast_horizon, contain_missing_values, contain_equal_length = self._tsf_to_dataframe(self.data_path) 19 | 20 | def dropna(x): 21 | return x[~np.isnan(x)] 22 | timeseries = [dropna(ts).astype(np.float32) for ts in df.series_value] 23 | return timeseries 24 | 25 | def _tsf_to_dataframe(self, data_path: str, 26 | replace_missing_vals_with="NaN", 27 | value_column_name="series_value"): 28 | col_names = [] 29 | col_types = [] 30 | all_data = {} 31 | line_count = 0 32 | frequency = None 33 | forecast_horizon = None 34 | contain_missing_values = None 35 | contain_equal_length = None 36 | found_data_tag = False 37 | found_data_section = False 38 | started_reading_data_section = False 39 | with open(data_path, "r", encoding="cp1252") as file: 40 | for line in file: 41 | # Strip white space from start/end of line 42 | line = line.strip() 43 | if line: 44 | if line.startswith("@"): # Read meta-data 45 | if not line.startswith("@data"): 46 | line_content = line.split(" ") 47 | if line.startswith("@attribute"): 48 | if ( 49 | len(line_content) != 3 50 | ): # Attributes have both name and type 51 | raise Exception("Invalid meta-data specification.") 52 | 53 | col_names.append(line_content[1]) 54 | col_types.append(line_content[2]) 55 | else: 56 | if ( 57 | len(line_content) != 2 58 | ): # Other meta-data have only values 59 | raise Exception("Invalid meta-data specification.") 60 | 61 | if line.startswith("@frequency"): 62 | frequency = line_content[1] 63 | elif line.startswith("@horizon"): 64 | forecast_horizon = int(line_content[1]) 65 | elif line.startswith("@missing"): 66 | contain_missing_values = bool( 67 | strtobool(line_content[1]) 68 | ) 69 | elif line.startswith("@equallength"): 70 | contain_equal_length = bool(strtobool(line_content[1])) 71 | 72 | else: 73 | if len(col_names) == 0: 74 | raise Exception( 75 | "Missing attribute section. Attribute section must come before data." 76 | ) 77 | 78 | found_data_tag = True 79 | elif not line.startswith("#"): 80 | if len(col_names) == 0: 81 | raise Exception( 82 | "Missing attribute section. Attribute section must come before data." 83 | ) 84 | elif not found_data_tag: 85 | raise Exception("Missing @data tag.") 86 | else: 87 | if not started_reading_data_section: 88 | started_reading_data_section = True 89 | found_data_section = True 90 | all_series = [] 91 | 92 | for col in col_names: 93 | all_data[col] = [] 94 | 95 | full_info = line.split(":") 96 | 97 | if len(full_info) != (len(col_names) + 1): 98 | raise Exception("Missing attributes/values in series.") 99 | 100 | series = full_info[len(full_info) - 1] 101 | series = series.split(",") 102 | 103 | if len(series) == 0: 104 | raise Exception( 105 | "A given series should contains a set of comma separated numeric values. At least one numeric value should be there in a series. Missing values should be indicated with ? symbol" 106 | ) 107 | 108 | numeric_series = [] 109 | 110 | for val in series: 111 | if val == "?": 112 | numeric_series.append(replace_missing_vals_with) 113 | else: 114 | numeric_series.append(float(val)) 115 | 116 | if numeric_series.count(replace_missing_vals_with) == len( 117 | numeric_series 118 | ): 119 | raise Exception( 120 | "All series values are missing. A given series should contains a set of comma separated numeric values. At least one numeric value should be there in a series." 121 | ) 122 | 123 | all_series.append(pd.Series(numeric_series).array) 124 | 125 | for i in range(len(col_names)): 126 | att_val = None 127 | if col_types[i] == "numeric": 128 | att_val = int(full_info[i]) 129 | elif col_types[i] == "string": 130 | att_val = str(full_info[i]) 131 | elif col_types[i] == "date": 132 | att_val = datetime.strptime( 133 | full_info[i], "%Y-%m-%d %H-%M-%S" 134 | ) 135 | else: 136 | raise Exception( 137 | "Invalid attribute type." 138 | ) # Currently, the code supports only numeric, string and date types. Extend this as required. 139 | 140 | if att_val is None: 141 | raise Exception("Invalid attribute value.") 142 | else: 143 | all_data[col_names[i]].append(att_val) 144 | 145 | line_count = line_count + 1 146 | 147 | if line_count == 0: 148 | raise Exception("Empty file.") 149 | if len(col_names) == 0: 150 | raise Exception("Missing attribute section.") 151 | if not found_data_section: 152 | raise Exception("Missing series information under data section.") 153 | 154 | all_data[value_column_name] = all_series 155 | loaded_data = pd.DataFrame(all_data) 156 | 157 | return ( 158 | loaded_data, 159 | frequency, 160 | forecast_horizon, 161 | contain_missing_values, 162 | contain_equal_length, 163 | ) 164 | -------------------------------------------------------------------------------- /ltsm/data_provider/data_processing/__init__.py: -------------------------------------------------------------------------------- 1 | from ltsm.data_provider.data_processing.standard_scaler import StandardScaler 2 | 3 | processor_dict = {} 4 | 5 | def register_processor(module): 6 | assert module.module_id not in processor_dict, f"Processor {module.module_id} alreader registered" 7 | processor_dict[module.module_id] = module 8 | 9 | register_processor(StandardScaler) 10 | -------------------------------------------------------------------------------- /ltsm/data_provider/data_processing/base_processor.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from typing import Any, Dict, List, Literal, Optional, Tuple, Union 4 | from dataclasses import dataclass 5 | class BaseProcessor: 6 | def __init__(self): 7 | pass 8 | 9 | def process(self, raw_data, train_data, val_data, test_data, fit_train_only=False): 10 | pass 11 | 12 | def inverse_process(self, data): 13 | pass 14 | 15 | def save(self, save_dir): 16 | pass 17 | 18 | def load(self, save_dir): 19 | pass 20 | -------------------------------------------------------------------------------- /ltsm/data_provider/data_processing/standard_scaler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import numpy as np 4 | from sklearn.preprocessing import StandardScaler as SKStandardScaler 5 | 6 | from ltsm.data_provider.data_processing.base_processor import BaseProcessor 7 | 8 | 9 | class StandardScaler(BaseProcessor): 10 | module_id = "standard_scaler" 11 | 12 | def __init__(self): 13 | self._scaler = None 14 | 15 | def process(self, raw_data, train_data, val_data, test_data, fit_train_only=False): 16 | scaled_train_data, scaled_val_data, scaled_test_data = [], [], [] 17 | 18 | for raw_sequence, train_sequence, val_sequence, test_sequence in zip( 19 | raw_data, 20 | train_data, 21 | val_data, 22 | test_data, 23 | ): 24 | train_sequence = train_sequence.reshape(-1, 1) 25 | val_sequence = val_sequence.reshape(-1, 1) 26 | test_sequence = test_sequence.reshape(-1, 1) 27 | 28 | self._scaler = SKStandardScaler() 29 | 30 | if fit_train_only: 31 | self._scaler.fit(train_sequence) 32 | else: 33 | self._scaler.fit(raw_sequence.reshape(-1, 1)) 34 | 35 | scaled_train_data.append(self._scaler.transform(train_sequence).flatten()) 36 | scaled_val_data.append(self._scaler.transform(val_sequence).flatten()) 37 | scaled_test_data.append(self._scaler.transform(test_sequence).flatten()) 38 | 39 | return scaled_train_data, scaled_val_data, scaled_test_data 40 | 41 | def inverse_process(self, data): 42 | assert self._scaler is not None, "StandardScaler has not been fitted" 43 | raw_shape = data.shape 44 | data = self._scaler.inverse_transform(data.reshape(-1, 1)) 45 | 46 | return data.reshape(raw_shape) 47 | 48 | def save(self, save_dir): 49 | save_path = os.path.join(save_dir, "processor.pkl") 50 | with open(save_path, 'wb') as f: 51 | pickle.dump(self._scaler, f) 52 | 53 | def load(self, save_dir): 54 | save_path = os.path.join(save_dir, "processor.pkl") 55 | with open(save_path, 'rb') as f: 56 | self._scaler = pickle.load(f) 57 | 58 | -------------------------------------------------------------------------------- /ltsm/data_provider/data_processing/tokenizer_processor.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from dataclasses import dataclass 3 | from typing import Any, Dict, List, Literal, Optional, Tuple, Union 4 | 5 | @dataclass 6 | class TokenizerConfig: 7 | """ 8 | This class holds all the configuration parameters to be used 9 | by ``ChronosTokenizer`` and ``ChronosModel``. 10 | """ 11 | 12 | tokenizer_class: str 13 | tokenizer_kwargs: Dict[str, Any] 14 | n_tokens: int 15 | n_special_tokens: int 16 | pad_token_id: int 17 | eos_token_id: int 18 | use_eos_token: bool 19 | model_type: Literal["causal", "seq2seq"] 20 | context_length: int 21 | prediction_length: int 22 | num_samples: int 23 | temperature: float 24 | top_k: int 25 | top_p: float 26 | 27 | def __post_init__(self): 28 | assert ( 29 | self.pad_token_id < self.n_special_tokens 30 | and self.eos_token_id < self.n_special_tokens 31 | ), f"Special token id's must be smaller than {self.n_special_tokens=}" 32 | 33 | def create_tokenizer(self) -> "ChronosTokenizer": 34 | if self.tokenizer_class == "MeanScaleUniformBins": 35 | return MeanScaleUniformBins(**self.tokenizer_kwargs, config=self) 36 | raise ValueError 37 | 38 | 39 | class ChronosTokenizer: 40 | """ 41 | A ``ChronosTokenizer`` definines how time series are mapped into token IDs 42 | and back. 43 | 44 | For details, see the ``input_transform`` and ``output_transform`` methods, 45 | which concrete classes must implement. 46 | """ 47 | 48 | def input_transform( 49 | self, context: torch.Tensor 50 | ) -> Tuple[torch.Tensor, torch.Tensor, Any]: 51 | """ 52 | Turn a batch of time series into token IDs, attention map, and scale. 53 | 54 | Parameters 55 | ---------- 56 | context 57 | A tensor shaped (batch_size, time_length), containing the 58 | timeseries to forecast. Use left-padding with ``torch.nan`` 59 | to align time series of different lengths. 60 | 61 | Returns 62 | ------- 63 | token_ids 64 | A tensor of integers, shaped (batch_size, time_length + 1) 65 | if ``config.use_eos_token`` and (batch_size, time_length) 66 | otherwise, containing token IDs for the input series. 67 | attention_mask 68 | A boolean tensor, same shape as ``token_ids``, indicating 69 | which input observations are not ``torch.nan`` (i.e. not 70 | missing nor padding). 71 | tokenizer_state 72 | An object that will be passed to ``output_transform``. 73 | Contains the relevant context to decode output samples into 74 | real values, such as location and scale parameters. 75 | """ 76 | raise NotImplementedError() 77 | 78 | def output_transform( 79 | self, samples: torch.Tensor, tokenizer_state: Any 80 | ) -> torch.Tensor: 81 | """ 82 | Turn a batch of sample token IDs into real values. 83 | 84 | Parameters 85 | ---------- 86 | samples 87 | A tensor of integers, shaped (batch_size, num_samples, time_length), 88 | containing token IDs of sample trajectories. 89 | tokenizer_state 90 | An object returned by ``input_transform`` containing 91 | relevant context to decode samples, such as location and scale. 92 | The nature of this depends on the specific tokenizer. 93 | 94 | Returns 95 | ------- 96 | forecasts 97 | A real tensor, shaped (batch_size, num_samples, time_length), 98 | containing forecasted sample paths. 99 | """ 100 | raise NotImplementedError() 101 | 102 | 103 | class MeanScaleUniformBins(ChronosTokenizer): 104 | def __init__( 105 | self, low_limit: float, high_limit: float, config: TokenizerConfig 106 | ) -> None: 107 | self.config = config 108 | self.centers = torch.linspace( 109 | low_limit, 110 | high_limit, 111 | config.n_tokens - config.n_special_tokens - 1, 112 | ) 113 | self.boundaries = torch.concat( 114 | ( 115 | torch.tensor([-1e20], device=self.centers.device), 116 | (self.centers[1:] + self.centers[:-1]) / 2, 117 | torch.tensor([1e20], device=self.centers.device), 118 | ) 119 | ) 120 | 121 | def input_transform( 122 | self, context: torch.Tensor 123 | ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 124 | batch_size, length = context.shape 125 | 126 | if length > self.config.context_length: 127 | context = context[..., -self.config.context_length :] 128 | 129 | attention_mask = ~torch.isnan(context) 130 | scale = torch.nansum( 131 | torch.abs(context) * attention_mask, dim=-1 132 | ) / torch.nansum(attention_mask, dim=-1) 133 | scale[~(scale > 0)] = 1.0 134 | scaled_context = context / scale.unsqueeze(dim=-1) 135 | token_ids = ( 136 | torch.bucketize( 137 | input=scaled_context, 138 | boundaries=self.boundaries, 139 | # buckets are open to the right, see: 140 | # https://pytorch.org/docs/2.1/generated/torch.bucketize.html#torch-bucketize 141 | right=True, 142 | ) 143 | + self.config.n_special_tokens 144 | ) 145 | token_ids[~attention_mask] = self.config.pad_token_id 146 | 147 | if self.config.use_eos_token: 148 | eos_tokens = torch.full( 149 | (batch_size, 1), fill_value=self.config.eos_token_id 150 | ) 151 | token_ids = torch.concat((token_ids, eos_tokens), dim=1) 152 | eos_mask = torch.full((batch_size, 1), fill_value=True) 153 | attention_mask = torch.concat((attention_mask, eos_mask), dim=1) 154 | 155 | return token_ids, attention_mask, scale 156 | 157 | def output_transform( 158 | self, samples: torch.Tensor, scale: torch.Tensor 159 | ) -> torch.Tensor: 160 | 161 | scale_unsqueezed = scale.unsqueeze(-1).unsqueeze(-1) 162 | indices = torch.clamp( 163 | samples - self.config.n_special_tokens, 164 | min=0, 165 | max=len(self.centers) - 1, 166 | ) 167 | self.centers = self.centers.to(samples.device) 168 | return self.centers[indices] * scale_unsqueezed 169 | 170 | -------------------------------------------------------------------------------- /ltsm/data_provider/dataset.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import torch 4 | from torch.utils.data.dataset import Dataset 5 | 6 | from ltsm.data_provider.data_processing.tokenizer_processor import TokenizerConfig 7 | 8 | 9 | class TSDataset(Dataset): 10 | def __init__( 11 | self, 12 | data, 13 | seq_len, 14 | pred_len, 15 | ): 16 | self.data = data 17 | self.seq_len = seq_len 18 | self.pred_len = pred_len 19 | 20 | # Create a map from item index to sequence index and offset 21 | self.num_items = 0 22 | self.item2sequence, self.item2offset = [], [] 23 | 24 | for sequence_index, sequence in enumerate(self.data): 25 | assert len(sequence) >= self.seq_len + self.pred_len, f"Sequence must have a lenth with at least seq_len + pred_len, the current length is {len(sequence)}" 26 | cur_offset = 0 27 | for _ in range(len(sequence) - self.seq_len - self.pred_len + 1): 28 | self.item2sequence.append(sequence_index) 29 | self.item2offset.append(cur_offset) 30 | cur_offset += 1 31 | self.num_items += 1 32 | 33 | def __getitem__(self, index): 34 | sequence_index = self.item2sequence[index] 35 | x_begin = self.item2offset[index] 36 | x_end = x_begin + self.seq_len 37 | y_begin = x_end 38 | y_end = y_begin + self.pred_len 39 | 40 | seq_x = torch.from_numpy(np.expand_dims(self.data[sequence_index][x_begin:x_end], -1)) 41 | seq_y = torch.from_numpy(np.expand_dims(self.data[sequence_index][y_begin:y_end], -1)) 42 | 43 | return seq_x, seq_y 44 | 45 | def __len__(self): 46 | return self.num_items 47 | 48 | 49 | class TSPromptDataset(Dataset): 50 | def __init__( 51 | self, 52 | data, 53 | prompt, 54 | seq_len, 55 | pred_len, 56 | downsample_rate=10, 57 | ): 58 | self.prompt = prompt 59 | self.seq_len = seq_len 60 | self.pred_len = pred_len 61 | self.num_items = 0 62 | self.item2sequence, self.item2offset = [], [] 63 | self.data = data 64 | 65 | for sequence_index, sequence in enumerate(self.data): 66 | assert len(sequence) >= self.seq_len + self.pred_len, f"Sequence must have a length with at least seq_len + pred_len, the current length is {len(sequence)}" 67 | cur_offset = 0 68 | for cur_offset in range(0, len(sequence) - self.seq_len - self.pred_len + 1, downsample_rate): 69 | self.item2sequence.append(sequence_index) 70 | self.item2offset.append(cur_offset) 71 | self.num_items += 1 72 | 73 | 74 | 75 | def __getitem__(self, index): 76 | sequence_index = self.item2sequence[index] 77 | x_begin = self.item2offset[index] 78 | x_end = x_begin + self.seq_len 79 | y_begin = x_end 80 | y_end = y_begin + self.pred_len 81 | prompt= self.prompt[sequence_index] 82 | 83 | seq_x = np.concatenate((prompt, self.data[sequence_index][x_begin:x_end])) 84 | seq_x = torch.from_numpy(np.expand_dims(seq_x, -1)) 85 | seq_y = torch.from_numpy(np.expand_dims(self.data[sequence_index][y_begin:y_end], -1)) 86 | return seq_x, seq_y 87 | 88 | def __len__(self): 89 | return self.num_items 90 | 91 | 92 | class TSTokenDataset(Dataset): 93 | def __init__( 94 | self, 95 | data, 96 | prompt, 97 | seq_len, 98 | pred_len, 99 | downsample_rate=10, 100 | ): 101 | self.seq_len = seq_len 102 | self.pred_len = pred_len 103 | self.num_items = 0 104 | self.item2sequence, self.item2offset = [], [] 105 | self.data = data 106 | self.prompt = prompt 107 | 108 | for sequence_index, sequence in enumerate(self.data): 109 | assert len(sequence) >= self.seq_len + self.pred_len, f"Sequence must have a length with at least seq_len + pred_len, the current length is {len(sequence)}" 110 | cur_offset = 0 111 | for cur_offset in range(0, len(sequence) - self.seq_len - self.pred_len + 1, downsample_rate): 112 | self.item2sequence.append(sequence_index) 113 | self.item2offset.append(cur_offset) 114 | self.num_items += 1 115 | 116 | context_length = seq_len+pred_len 117 | prediction_length = pred_len 118 | n_tokens = 1024 119 | n_special_tokens = 2 120 | config = TokenizerConfig( 121 | tokenizer_class="MeanScaleUniformBins", 122 | tokenizer_kwargs=dict(low_limit=-3.0, high_limit=3.0), 123 | n_tokens=n_tokens, 124 | n_special_tokens=n_special_tokens, 125 | pad_token_id=0, 126 | eos_token_id=1, 127 | use_eos_token=0, 128 | model_type="causal", 129 | context_length=context_length, 130 | prediction_length=prediction_length, 131 | num_samples=20, 132 | temperature=1.0, 133 | top_k=50, 134 | top_p=1.0, 135 | ) 136 | 137 | self.tokenizer = config.create_tokenizer() 138 | 139 | for sequence_index, sequence in enumerate(self.data): 140 | assert len(sequence) >= self.seq_len + self.pred_len, f"Sequence must have a length with at least seq_len + pred_len, the current length is {len(sequence)}" 141 | cur_offset = 0 142 | for cur_offset in range(0, len(sequence) - self.seq_len - self.pred_len + 1, downsample_rate): 143 | self.item2sequence.append(sequence_index) 144 | self.item2offset.append(cur_offset) 145 | # cur_offset += 1 146 | self.num_items += 1 147 | 148 | 149 | def __getitem__(self, index): 150 | sequence_index = self.item2sequence[index] 151 | x_begin = self.item2offset[index] 152 | x_end = x_begin + self.seq_len 153 | y_begin = x_end 154 | y_end = y_begin + self.pred_len 155 | prompt= self.prompt[sequence_index] 156 | 157 | seq = self.data[sequence_index][x_begin:y_end] 158 | # seq = np.concatenate((prompt, self.data[sequence_index][x_begin:y_end])) 159 | seq = torch.from_numpy(np.expand_dims(seq,0)) 160 | seq_token, _, seq_scale = self.tokenizer.input_transform(seq) 161 | 162 | propmt_seq = torch.from_numpy(np.expand_dims(prompt,0)) 163 | propmt_token, _, _ = self.tokenizer.input_transform(propmt_seq) 164 | 165 | seq_x = seq_token[0,:self.seq_len] 166 | seq_x = np.concatenate((propmt_token.squeeze(), seq_x), axis=0) 167 | data_y = np.concatenate((seq_scale, seq_token[0, self.seq_len:self.seq_len+self.pred_len]), axis=0) 168 | 169 | return seq_x, data_y 170 | 171 | def __len__(self): 172 | return self.num_items -------------------------------------------------------------------------------- /ltsm/data_provider/hf_data_loader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import pandas as pd 4 | import torch 5 | from torch.utils.data import Dataset, DataLoader 6 | from sklearn.preprocessing import StandardScaler 7 | import warnings 8 | from pathlib import Path 9 | 10 | from torch.utils.data.dataset import ConcatDataset, Dataset 11 | 12 | from ltsm.utils.timefeatures import time_features 13 | from ltsm.utils.tools import convert_tsf_to_dataframe 14 | 15 | warnings.filterwarnings('ignore') 16 | 17 | -------------------------------------------------------------------------------- /ltsm/data_provider/splitter.py: -------------------------------------------------------------------------------- 1 | import os 2 | class DataSplitter: 3 | def __init__(self): 4 | pass 5 | 6 | def get_splits(self): 7 | pass 8 | 9 | 10 | class SplitterByTimestamp(DataSplitter): 11 | def __init__(self, seq_len, pred_len, train_ratio, val_ratio,prompt_folder_path, data_name): 12 | super().__init__() 13 | self.seq_len = seq_len 14 | self.pred_len = pred_len 15 | self.train_ratio = train_ratio 16 | self.val_ratio = val_ratio 17 | self.prompt_folder_path = prompt_folder_path 18 | self.data_name = data_name 19 | 20 | def get_splits(self, raw_data): 21 | train_split, val_split, test_split, buff = [], [], [], [] 22 | for index, sequence in enumerate(raw_data): 23 | 24 | assert sequence.ndim == 1, "Time-series should be 1D." 25 | 26 | num_train = int(len(sequence) * self.train_ratio) 27 | num_val = int(len(sequence) * self.val_ratio) 28 | 29 | if num_train < self.seq_len + self.pred_len: 30 | continue 31 | 32 | # We also add the previous seq_len points to the val and test sets 33 | train_split.append(sequence[:num_train]) 34 | val_split.append(sequence[num_train-self.seq_len:num_train+num_val]) 35 | test_split.append(sequence[num_train+num_val-self.seq_len:]) 36 | buff.append(index) 37 | 38 | return train_split, val_split, test_split, buff 39 | 40 | def get_csv_splits(self, df_data): 41 | train_split, val_split, test_split, buff = [], [], [], [] 42 | cols = df_data.columns[1:] 43 | raw_data = df_data[cols].T.values 44 | if 'ETTh1' in self.data_name or 'ETTh2' in self.data_name: 45 | raw_data = df_data[cols][:14400].T.values 46 | 47 | if 'ETTm1' in self.data_name or 'ETTm2' in self.data_name: 48 | raw_data = df_data[cols][:57600].T.values 49 | 50 | for col, sequence in zip(cols, raw_data): 51 | 52 | assert sequence.ndim == 1, "Time-series should be 1D." 53 | 54 | num_train = int(len(sequence) * self.train_ratio) 55 | num_val = int(len(sequence) * self.val_ratio) 56 | 57 | if num_train < self.seq_len + self.pred_len: 58 | continue 59 | 60 | 61 | # We also add the previous seq_len points to the val and test sets 62 | train_split.append(sequence[:num_train]) 63 | val_split.append(sequence[num_train-self.seq_len:num_train+num_val]) 64 | test_split.append(sequence[num_train+num_val-self.seq_len:]) 65 | buff.append(col) 66 | 67 | print(f"Data{self.data_name} has been split into train, val, test sets with the following shapes: {train_split[0].shape}, {val_split[0].shape}, {test_split[0].shape}") 68 | 69 | return train_split, val_split, test_split, buff 70 | -------------------------------------------------------------------------------- /ltsm/models/__init__.py: -------------------------------------------------------------------------------- 1 | from .utils import get_model 2 | from .config import LTSMConfig -------------------------------------------------------------------------------- /ltsm/models/config.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from transformers import PretrainedConfig 3 | import json 4 | 5 | @dataclass 6 | class LTSMConfig(PretrainedConfig): 7 | 8 | def __init__(self, **kwargs): 9 | super().__init__(**kwargs) 10 | 11 | for key, value in kwargs.items(): 12 | setattr(self, key, value) 13 | 14 | def update(self, **kwargs): 15 | for key, value in kwargs.items(): 16 | setattr(self, key, value) 17 | 18 | def load(self, json_file): 19 | 20 | with open(json_file) as f: 21 | config = json.load(f) 22 | 23 | for key, value in config.items(): 24 | setattr(self, key, value) 25 | 26 | return self -------------------------------------------------------------------------------- /ltsm/models/embed.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.nn.utils import weight_norm 5 | import math 6 | 7 | 8 | class PatchEmbedding(nn.Module): 9 | def __init__(self, d_model, patch_len, stride, dropout): 10 | super(PatchEmbedding, self).__init__() 11 | # Patching 12 | self.patch_len = patch_len 13 | self.stride = stride 14 | self.padding_patch_layer = nn.ReplicationPad1d((0, stride)) 15 | 16 | # Backbone, Input encoding: projection of feature vectors onto a d-dim vector space 17 | self.value_embedding = TokenEmbedding(patch_len, d_model) 18 | 19 | # Positional embedding 20 | # self.position_embedding = PositionalEmbedding(d_model) 21 | 22 | # Residual dropout 23 | self.dropout = nn.Dropout(dropout) 24 | 25 | def forward(self, x): 26 | # do patching 27 | n_vars = x.shape[1] 28 | x = self.padding_patch_layer(x) 29 | x = x.unfold(dimension=-1, size=self.patch_len, step=self.stride) 30 | x = torch.reshape(x, (x.shape[0] * x.shape[1], x.shape[2], x.shape[3])) 31 | # Input encoding 32 | x = self.value_embedding(x) 33 | return self.dropout(x), n_vars 34 | 35 | 36 | class PositionalEmbedding(nn.Module): 37 | def __init__(self, d_model, max_len=5000): 38 | super(PositionalEmbedding, self).__init__() 39 | # Compute the positional encodings once in log space. 40 | pe = torch.zeros(max_len, d_model).float() 41 | pe.require_grad = False 42 | 43 | position = torch.arange(0, max_len).float().unsqueeze(1) 44 | div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() 45 | 46 | pe[:, 0::2] = torch.sin(position * div_term) 47 | pe[:, 1::2] = torch.cos(position * div_term) 48 | 49 | pe = pe.unsqueeze(0) 50 | self.register_buffer('pe', pe) 51 | 52 | def forward(self, x): 53 | return self.pe[:, :x.size(1)] 54 | 55 | 56 | class TokenEmbedding(nn.Module): 57 | def __init__(self, c_in, d_model): 58 | super(TokenEmbedding, self).__init__() 59 | padding = 1 if torch.__version__ >= '1.5.0' else 2 60 | self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model, 61 | kernel_size=3, padding=padding, padding_mode='circular', bias=False) 62 | for m in self.modules(): 63 | if isinstance(m, nn.Conv1d): 64 | nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu') 65 | 66 | def forward(self, x): 67 | # print("x.shape = {}".format(x.shape)) 68 | x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2) 69 | return x 70 | 71 | 72 | class FixedEmbedding(nn.Module): 73 | def __init__(self, c_in, d_model): 74 | super(FixedEmbedding, self).__init__() 75 | 76 | w = torch.zeros(c_in, d_model).float() 77 | w.require_grad = False 78 | 79 | position = torch.arange(0, c_in).float().unsqueeze(1) 80 | div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() 81 | 82 | w[:, 0::2] = torch.sin(position * div_term) 83 | w[:, 1::2] = torch.cos(position * div_term) 84 | 85 | self.emb = nn.Embedding(c_in, d_model) 86 | self.emb.weight = nn.Parameter(w, requires_grad=False) 87 | 88 | def forward(self, x): 89 | return self.emb(x).detach() 90 | 91 | 92 | class TemporalEmbedding(nn.Module): 93 | def __init__(self, d_model, embed_type='fixed', freq='h'): 94 | super(TemporalEmbedding, self).__init__() 95 | 96 | minute_size = 4 97 | hour_size = 24 98 | weekday_size = 7 99 | day_size = 32 100 | month_size = 13 101 | 102 | Embed = FixedEmbedding if embed_type == 'fixed' else nn.Embedding 103 | if freq == 't': 104 | self.minute_embed = Embed(minute_size, d_model) 105 | self.hour_embed = Embed(hour_size, d_model) 106 | self.weekday_embed = Embed(weekday_size, d_model) 107 | self.day_embed = Embed(day_size, d_model) 108 | self.month_embed = Embed(month_size, d_model) 109 | 110 | def forward(self, x): 111 | x = x.long() 112 | 113 | minute_x = self.minute_embed(x[:, :, 4]) if hasattr(self, 'minute_embed') else 0. 114 | hour_x = self.hour_embed(x[:, :, 3]) 115 | weekday_x = self.weekday_embed(x[:, :, 2]) 116 | day_x = self.day_embed(x[:, :, 1]) 117 | month_x = self.month_embed(x[:, :, 0]) 118 | 119 | return hour_x + weekday_x + day_x + month_x + minute_x 120 | 121 | 122 | class TimeFeatureEmbedding(nn.Module): 123 | def __init__(self, d_model, embed_type='timeF', freq='h'): 124 | super(TimeFeatureEmbedding, self).__init__() 125 | 126 | freq_map = {'h': 4, 't': 5, 's': 6, 'm': 1, 'a': 1, 'w': 2, 'd': 3, 'b': 3} 127 | d_inp = freq_map[freq] 128 | self.embed = nn.Linear(d_inp, d_model, bias=False) 129 | 130 | def forward(self, x): 131 | return self.embed(x) 132 | 133 | 134 | class DataEmbedding(nn.Module): 135 | def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1): 136 | super(DataEmbedding, self).__init__() 137 | 138 | self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model) 139 | self.position_embedding = PositionalEmbedding(d_model=d_model) 140 | self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type, 141 | freq=freq) if embed_type != 'timeF' else TimeFeatureEmbedding( 142 | d_model=d_model, embed_type=embed_type, freq=freq) 143 | self.dropout = nn.Dropout(p=dropout) 144 | 145 | def forward(self, x, x_mark): 146 | x = self.value_embedding(x) + self.temporal_embedding(x_mark) + self.position_embedding(x) 147 | return self.dropout(x) 148 | 149 | 150 | class DataEmbedding_wo_pos(nn.Module): 151 | def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1): 152 | super(DataEmbedding_wo_pos, self).__init__() 153 | 154 | self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model) 155 | self.position_embedding = PositionalEmbedding(d_model=d_model) 156 | self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type, 157 | freq=freq) if embed_type != 'timeF' else TimeFeatureEmbedding( 158 | d_model=d_model, embed_type=embed_type, freq=freq) 159 | self.dropout = nn.Dropout(p=dropout) 160 | 161 | def forward(self, x, x_mark): 162 | x = self.value_embedding(x) + self.temporal_embedding(x_mark) 163 | return self.dropout(x) 164 | 165 | 166 | class DataEmbedding_wo_time(nn.Module): 167 | def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1): 168 | super(DataEmbedding_wo_time, self).__init__() 169 | 170 | self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model) 171 | self.position_embedding = PositionalEmbedding(d_model=d_model) 172 | self.dropout = nn.Dropout(p=dropout) 173 | 174 | def forward(self, x): 175 | x = self.value_embedding(x) + self.position_embedding(x) 176 | return self.dropout(x) -------------------------------------------------------------------------------- /ltsm/models/ltsm_model.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | from einops import rearrange 5 | from .config import LTSMConfig 6 | from transformers.modeling_utils import PreTrainedModel, PretrainedConfig 7 | from transformers import AutoModel, AutoConfig, AutoTokenizer 8 | 9 | class LTSM(PreTrainedModel): 10 | config_class = LTSMConfig 11 | def __init__(self, configs, *model_args, **model_kwargs): 12 | super().__init__(configs) 13 | self.patch_size = configs.patch_size 14 | self.pretrain = configs.pretrain 15 | self.stride = configs.stride 16 | self.patch_num = (configs.seq_len + configs.prompt_len - self.patch_size) // self.stride + 1 17 | self.d_type = torch.bfloat16 18 | self.padding_patch_layer = nn.ReplicationPad1d((0, self.stride)) 19 | self.patch_num += 1 20 | self.configs = configs 21 | 22 | if configs.pretrain: 23 | print("Loading the pretraining weight.") 24 | self.llm_config = AutoConfig.from_pretrained(configs.model_name_or_path) 25 | self.llm = AutoModel.from_pretrained(configs.model_name_or_path) # loads a pretrained GPT-2 base model 26 | else: 27 | raise NotImplementedError("You must load the pretraining weight.") 28 | 29 | self.model_prune(configs) 30 | print("model = {}".format(self.llm)) 31 | 32 | self.in_layer = nn.Linear(configs.patch_size, self.llm_config.hidden_size) 33 | self.out_layer = nn.Linear(self.llm_config.hidden_size * self.patch_num, configs.pred_len) 34 | 35 | self.cnt = 0 36 | 37 | def model_prune(self, configs): 38 | if "gpt2" in configs.model_name_or_path: 39 | self.llm.h = self.llm.h[:configs.gpt_layers] 40 | elif "phi" in configs.model_name_or_path or "llama" in configs.model_name_or_path or "gemma" in configs.model_name_or_path: 41 | self.llm.layers = self.llm.layers[:configs.gpt_layers] 42 | else: 43 | raise NotImplementedError(f"No implementation in model prune for {self.llm}.") 44 | 45 | def forward(self, x): 46 | B, L, M = x.shape 47 | 48 | means = x.mean(1, keepdim=True).detach() 49 | 50 | x = x - means 51 | stdev = torch.sqrt(torch.var(x, dim=1, keepdim=True, unbiased=False)+ 1e-5).detach() 52 | x /= stdev 53 | x = rearrange(x, 'b l m -> b m l') 54 | 55 | x = self.padding_patch_layer(x) 56 | x = x.unfold(dimension=-1, size=self.patch_size, step=self.stride) 57 | x = rearrange(x, 'b m n p -> (b m) n p') 58 | outputs = self.in_layer(x).to(dtype=torch.bfloat16) 59 | 60 | outputs = self.llm(inputs_embeds=outputs).last_hidden_state 61 | outputs = outputs.to(dtype=x.dtype) 62 | 63 | outputs = self.out_layer(outputs.reshape(B*M, -1)) 64 | outputs = rearrange(outputs, '(b m) l -> b l m', b=B) 65 | 66 | outputs = outputs * stdev 67 | outputs = outputs + means 68 | 69 | return outputs 70 | -------------------------------------------------------------------------------- /ltsm/models/ltsm_tokenizer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from transformers.modeling_utils import PreTrainedModel 4 | from transformers import AutoModel, AutoConfig 5 | 6 | 7 | class LTSM_Tokenizer(PreTrainedModel): 8 | def __init__(self, configs): 9 | super().__init__(configs) 10 | self.patch_size = configs.patch_size 11 | self.pretrain = configs.pretrain 12 | 13 | self.d_type = torch.bfloat16 14 | self.pred_len = configs.pred_len 15 | 16 | if configs.pretrain: 17 | print("Loading the pretraining weight.") 18 | self.llm_config = AutoConfig.from_pretrained(configs.model_name_or_path) 19 | self.llm = AutoModel.from_pretrained(configs.model_name_or_path) # loads a pretrained GPT-2 base model 20 | else: 21 | raise NotImplementedError("You must load the pretraining weight.") 22 | 23 | self.model_prune(configs) 24 | print("gpt2 = {}".format(self.llm)) 25 | 26 | def model_prune(self, configs): 27 | if "gpt2" in configs.model_name_or_path: 28 | self.llm.h = self.llm.h[:configs.gpt_layers] 29 | elif "phi" in configs.model_name_or_path or "llama" in configs.model_name_or_path or "gemma" in configs.model_name_or_path: 30 | self.llm.layers = self.llm.layers[:configs.gpt_layers] 31 | else: 32 | raise NotImplementedError(f"No implementation in model prune for {self.llm}.") 33 | 34 | def forward(self, x): 35 | x = x.int().unsqueeze(-1) 36 | # x = x.int().to(self.llm.device) 37 | # import ipdb; ipdb.set_trace() 38 | outputs = self.llm(input_ids = x).last_hidden_state 39 | outputs = outputs[:, -self.pred_len:, :] 40 | 41 | return outputs -------------------------------------------------------------------------------- /ltsm/models/ltsm_wordprompt.py: -------------------------------------------------------------------------------- 1 | import json 2 | import torch 3 | import torch.nn as nn 4 | 5 | from transformers.modeling_utils import PreTrainedModel 6 | from transformers import AutoModel, AutoConfig, AutoTokenizer 7 | 8 | from .utils import Normalize, FlattenHead, ReprogrammingLayer 9 | from .embed import PatchEmbedding 10 | 11 | 12 | class LTSM_WordPrompt(PreTrainedModel): 13 | def __init__(self, configs): 14 | super().__init__(configs) 15 | self.pred_len = configs.pred_len 16 | self.seq_len = configs.seq_len 17 | self.d_ff = configs.d_ff 18 | self.top_k = 5 19 | self.d_llm = configs.d_model 20 | self.patch_len = configs.patch_size 21 | self.stride = configs.stride 22 | self.pretrain = configs.pretrain 23 | 24 | with open(configs.prompt_data_path, 'r') as f: 25 | self.index2prompt = json.load(f) 26 | if configs.pretrain: 27 | print("Loading the pretraining weight.") 28 | self.llm_config = AutoConfig.from_pretrained(configs.model_name_or_path) 29 | self.llm_model = AutoModel.from_pretrained(configs.model_name_or_path) # loads a pretrained GPT-2 base model 30 | self.tokenizer = AutoTokenizer.from_pretrained(configs.model_name_or_path) 31 | else: 32 | raise NotImplementedError("You must load the pretraining weight.") 33 | 34 | self.model_prune(configs) 35 | print("model = {}".format(self.llm_model)) 36 | 37 | if self.tokenizer.eos_token: 38 | self.tokenizer.pad_token = self.tokenizer.eos_token 39 | else: 40 | pad_token = '[PAD]' 41 | self.tokenizer.add_special_tokens({'pad_token': pad_token}) 42 | self.tokenizer.pad_token = pad_token 43 | 44 | for param in self.llm_model.parameters(): 45 | param.requires_grad = False 46 | 47 | self.dropout = nn.Dropout(configs.dropout) 48 | 49 | self.patch_embedding = PatchEmbedding( 50 | configs.d_model, self.patch_len, self.stride, configs.dropout) 51 | 52 | self.word_embeddings = self.llm_model.get_input_embeddings().weight 53 | self.vocab_size = self.word_embeddings.shape[0] 54 | self.num_tokens = 1000 55 | self.mapping_layer = nn.Linear(self.vocab_size, self.num_tokens) 56 | 57 | self.reprogramming_layer = ReprogrammingLayer(configs.d_model, configs.n_heads, self.d_ff, self.d_llm) 58 | 59 | self.patch_nums = int((configs.seq_len - self.patch_len) / self.stride + 2) 60 | self.head_nf = self.d_ff * self.patch_nums 61 | 62 | self.output_projection = FlattenHead(configs.enc_in, self.head_nf, self.pred_len, 63 | head_dropout=configs.dropout) 64 | self.normalize_layers = Normalize(configs.enc_in, affine=False) 65 | 66 | 67 | def model_prune(self, configs): 68 | if "gpt2" in configs.model_name_or_path: 69 | self.llm_model.h = self.llm_model.h[:configs.gpt_layers] 70 | elif "phi" in configs.model_name_or_path or "llama" in configs.model_name_or_path or "gemma" in configs.model_name_or_path: 71 | self.llm_model.layers = self.llm_model.layers[:configs.gpt_layers] 72 | else: 73 | raise NotImplementedError(f"No implementation in model prune for {self.llm_model}.") 74 | 75 | 76 | def calcute_lags(self, x_enc): 77 | q_fft = torch.fft.rfft(x_enc.permute(0, 2, 1).contiguous(), dim=-1) 78 | k_fft = torch.fft.rfft(x_enc.permute(0, 2, 1).contiguous(), dim=-1) 79 | res = q_fft * torch.conj(k_fft) 80 | corr = torch.fft.irfft(res, dim=-1) 81 | mean_value = torch.mean(corr, dim=1) 82 | _, lags = torch.topk(mean_value, self.top_k, dim=-1) 83 | return lags 84 | 85 | def forward(self, x_enc): 86 | index = x_enc[:, 0, 0] 87 | index = index.tolist() 88 | x_enc = x_enc[:,1:,:] 89 | x_enc = self.normalize_layers(x_enc, 'norm') 90 | 91 | B, T, N = x_enc.size() 92 | x_enc = x_enc.permute(0, 2, 1).contiguous().reshape(B * N, T, 1) 93 | 94 | min_values = torch.min(x_enc, dim=1)[0] 95 | max_values = torch.max(x_enc, dim=1)[0] 96 | medians = torch.median(x_enc, dim=1).values 97 | lags = self.calcute_lags(x_enc) 98 | trends = x_enc.diff(dim=1).sum(dim=1) 99 | # ipdb.set_trace() 100 | prompt = [] 101 | for b in range(x_enc.shape[0]): 102 | min_values_str = str(min_values[b].tolist()[0]) 103 | max_values_str = str(max_values[b].tolist()[0]) 104 | median_values_str = str(medians[b].tolist()[0]) 105 | lags_values_str = str(lags[b].tolist()) 106 | prompt_ = ( 107 | f"<|start_prompt|>Dataset description: {self.index2prompt[str(int(index[b]))]}<|end_prompt|>" 108 | f"Task description: forecast the next {str(self.pred_len)} steps given the previous {str(self.seq_len)} steps information; " 109 | "Input statistics: " 110 | f"min value {min_values_str}, " 111 | f"max value {max_values_str}, " 112 | f"median value {median_values_str}, " 113 | f"the trend of input is {'upward' if trends[b] > 0 else 'downward'}, " 114 | f"top 5 lags are : {lags_values_str}<||>" 115 | ) 116 | 117 | prompt.append(prompt_) 118 | 119 | x_enc = x_enc.reshape(B, N, T).permute(0, 2, 1).contiguous() 120 | 121 | prompt = self.tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048).input_ids 122 | prompt_embeddings = self.llm_model.get_input_embeddings()(prompt.to(x_enc.device)) # (batch, prompt_token, dim) 123 | 124 | source_embeddings = self.mapping_layer(self.word_embeddings.permute(1, 0)).permute(1, 0) 125 | 126 | x_enc = x_enc.permute(0, 2, 1).contiguous() 127 | enc_out, n_vars = self.patch_embedding(x_enc.to(torch.float32)) 128 | enc_out = self.reprogramming_layer(enc_out, source_embeddings, source_embeddings) 129 | llama_enc_out = torch.cat([prompt_embeddings, enc_out], dim=1) 130 | dec_out = self.llm_model(inputs_embeds=llama_enc_out).last_hidden_state 131 | dec_out = dec_out[:, :, :self.d_ff] # (batch, patch_num, d_ff) 132 | 133 | dec_out = torch.reshape( 134 | dec_out, (-1, n_vars, dec_out.shape[-2], dec_out.shape[-1])) 135 | dec_out = dec_out.permute(0, 1, 3, 2).contiguous() 136 | 137 | dec_out = self.output_projection(dec_out[:, :, :, -self.patch_nums:]) 138 | dec_out = dec_out.permute(0, 2, 1).contiguous() 139 | 140 | dec_out = self.normalize_layers(dec_out, 'denorm') 141 | 142 | return dec_out[:, -self.pred_len:, :] 143 | -------------------------------------------------------------------------------- /ltsm/models/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | from math import sqrt 5 | from transformers.modeling_utils import PreTrainedModel, PretrainedConfig 6 | 7 | class Normalize(nn.Module): 8 | def __init__(self, num_features: int, eps=1e-5, affine=False, subtract_last=False, non_norm=False): 9 | """ 10 | :param num_features: the number of features or channels 11 | :param eps: a value added for numerical stability 12 | :param affine: if True, RevIN has learnable affine parameters 13 | """ 14 | super(Normalize, self).__init__() 15 | self.num_features = num_features 16 | self.eps = eps 17 | self.affine = affine 18 | self.subtract_last = subtract_last 19 | self.non_norm = non_norm 20 | if self.affine: 21 | self._init_params() 22 | 23 | def forward(self, x, mode: str): 24 | if mode == 'norm': 25 | self._get_statistics(x) 26 | x = self._normalize(x) 27 | elif mode == 'denorm': 28 | x = self._denormalize(x) 29 | else: 30 | raise NotImplementedError 31 | return x 32 | 33 | def _init_params(self): 34 | # initialize RevIN params: (C,) 35 | self.affine_weight = nn.Parameter(torch.ones(self.num_features)) 36 | self.affine_bias = nn.Parameter(torch.zeros(self.num_features)) 37 | 38 | def _get_statistics(self, x): 39 | dim2reduce = tuple(range(1, x.ndim - 1)) 40 | if self.subtract_last: 41 | self.last = x[:, -1, :].unsqueeze(1) 42 | else: 43 | self.mean = torch.mean(x, dim=dim2reduce, keepdim=True).detach() 44 | self.stdev = torch.sqrt(torch.var(x, dim=dim2reduce, keepdim=True, unbiased=False) + self.eps).detach() 45 | 46 | def _normalize(self, x): 47 | if self.non_norm: 48 | return x 49 | if self.subtract_last: 50 | x = x - self.last 51 | else: 52 | x = x - self.mean 53 | x = x / self.stdev 54 | if self.affine: 55 | x = x * self.affine_weight 56 | x = x + self.affine_bias 57 | return x 58 | 59 | def _denormalize(self, x): 60 | if self.non_norm: 61 | return x 62 | if self.affine: 63 | x = x - self.affine_bias 64 | x = x / (self.affine_weight + self.eps * self.eps) 65 | x = x * self.stdev 66 | if self.subtract_last: 67 | x = x + self.last 68 | else: 69 | x = x + self.mean 70 | return x 71 | 72 | 73 | class FlattenHead(nn.Module): 74 | def __init__(self, n_vars, nf, target_window, head_dropout=0): 75 | super().__init__() 76 | self.n_vars = n_vars 77 | self.flatten = nn.Flatten(start_dim=-2) 78 | self.linear = nn.Linear(nf, target_window) 79 | self.dropout = nn.Dropout(head_dropout) 80 | 81 | def forward(self, x): 82 | x = self.flatten(x) 83 | x = self.linear(x) 84 | x = self.dropout(x) 85 | return x 86 | 87 | class ReprogrammingLayer(nn.Module): 88 | def __init__(self, d_model, n_heads, d_keys=None, d_llm=None, attention_dropout=0.1): 89 | super(ReprogrammingLayer, self).__init__() 90 | 91 | d_keys = d_keys or (d_model // n_heads) 92 | 93 | self.query_projection = nn.Linear(d_model, d_keys * n_heads) 94 | self.key_projection = nn.Linear(d_llm, d_keys * n_heads) 95 | self.value_projection = nn.Linear(d_llm, d_keys * n_heads) 96 | self.out_projection = nn.Linear(d_keys * n_heads, d_llm) 97 | self.n_heads = n_heads 98 | self.dropout = nn.Dropout(attention_dropout) 99 | 100 | def forward(self, target_embedding, source_embedding, value_embedding): 101 | B, L, _ = target_embedding.shape 102 | S, _ = source_embedding.shape 103 | H = self.n_heads 104 | 105 | target_embedding = self.query_projection(target_embedding).view(B, L, H, -1) 106 | source_embedding = self.key_projection(source_embedding).view(S, H, -1) 107 | value_embedding = self.value_projection(value_embedding).view(S, H, -1) 108 | 109 | out = self.reprogramming(target_embedding, source_embedding, value_embedding) 110 | 111 | out = out.reshape(B, L, -1) 112 | 113 | return self.out_projection(out) 114 | 115 | def reprogramming(self, target_embedding, source_embedding, value_embedding): 116 | B, L, H, E = target_embedding.shape 117 | 118 | scale = 1. / sqrt(E) 119 | 120 | scores = torch.einsum("blhe,she->bhls", target_embedding, source_embedding) 121 | 122 | A = self.dropout(torch.softmax(scale * scores, dim=-1)) 123 | reprogramming_embedding = torch.einsum("bhls,she->blhe", A, value_embedding) 124 | 125 | return reprogramming_embedding 126 | 127 | 128 | 129 | def get_model(config): 130 | if config.model == 'LTSM_WordPrompt': 131 | from .ltsm_wordprompt import LTSM_WordPrompt 132 | model = LTSM_WordPrompt(config) 133 | elif config.model == 'LTSM_Tokenizer': 134 | from .ltsm_tokenizer import LTSM_Tokenizer 135 | model = LTSM_Tokenizer(config) 136 | else: 137 | from .ltsm_model import LTSM 138 | if config.local_pretrain == "None": 139 | model = LTSM(config) 140 | else: 141 | model_config = PretrainedConfig.from_pretrained(config.local_pretrain) 142 | model = LTSM.from_pretrained(config.local_pretrain, model_config) 143 | 144 | 145 | return model 146 | 147 | -------------------------------------------------------------------------------- /ltsm/utils/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datamllab/ltsm/91ee7775ee5dabfd4baa4bf8713ecd111560655d/ltsm/utils/.DS_Store -------------------------------------------------------------------------------- /ltsm/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datamllab/ltsm/91ee7775ee5dabfd4baa4bf8713ecd111560655d/ltsm/utils/__init__.py -------------------------------------------------------------------------------- /ltsm/utils/dist.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.spatial.distance import euclidean 3 | from fastdtw import fastdtw 4 | import torch 5 | 6 | def pairwise_dtw(x_batch, y_batch): 7 | """ 8 | 9 | Args: 10 | :param x_batch: Tensor, [ Batchsize, Time, Dimension_x ] 11 | :param y_batch: Tensor, [ Batchsize, Time, Dimension_y ] 12 | 13 | The input tensor should have Dimension_x == Dimension_y 14 | 15 | :return: Pair-wise Distance, Tensor, [ Batchsize, Batchsize ] 16 | """ 17 | 18 | batchsize_x = x_batch.shape[0] 19 | batchsize_y = y_batch.shape[0] 20 | dist_matrix = torch.zeros((batchsize_x, batchsize_y), device=torch.device("cpu")) 21 | for idx1, x in enumerate(x_batch): 22 | for idx2, y in enumerate(y_batch): 23 | if x_batch is y_batch and dist_matrix[idx2, idx1] > 0: 24 | dist_matrix[idx1, idx2] = dist_matrix[idx2, idx1] 25 | 26 | else: 27 | distance_xy, _ = fastdtw(x, y, dist=euclidean) 28 | dist_matrix[idx1, idx2] = distance_xy 29 | 30 | 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /ltsm/utils/metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | def RSE(pred, true): 5 | return np.sqrt(np.sum((true - pred) ** 2)) / np.sqrt(np.sum((true - true.mean()) ** 2)) 6 | 7 | 8 | def CORR(pred, true): 9 | u = ((true - true.mean(0)) * (pred - pred.mean(0))).sum(0) 10 | d = np.sqrt(((true - true.mean(0)) ** 2 * (pred - pred.mean(0)) ** 2).sum(0)) 11 | return (u / d).mean(-1) 12 | 13 | 14 | def MAE(pred, true): 15 | return np.mean(np.abs(pred - true)) 16 | 17 | 18 | def MSE(pred, true): 19 | return np.mean((pred - true) ** 2) 20 | 21 | 22 | def RMSE(pred, true): 23 | return np.sqrt(MSE(pred, true)) 24 | 25 | 26 | def MAPE(pred, true): 27 | return np.mean(np.abs(100 * (pred - true) / (true +1e-8))) 28 | 29 | 30 | def MSPE(pred, true): 31 | return np.mean(np.square((pred - true) / (true + 1e-8))) 32 | 33 | def SMAPE(pred, true): 34 | return np.mean(200 * np.abs(pred - true) / (np.abs(pred) + np.abs(true) + 1e-8)) 35 | # return np.mean(200 * np.abs(pred - true) / (pred + true + 1e-8)) 36 | 37 | def ND(pred, true): 38 | return np.mean(np.abs(true - pred)) / np.mean(np.abs(true)) 39 | 40 | def metric(pred, true): 41 | mae = MAE(pred, true) 42 | mse = MSE(pred, true) 43 | rmse = RMSE(pred, true) 44 | mape = MAPE(pred, true) 45 | mspe = MSPE(pred, true) 46 | smape = SMAPE(pred, true) 47 | nd = ND(pred, true) 48 | 49 | return mae, mse, rmse, mape, mspe, smape, nd 50 | -------------------------------------------------------------------------------- /ltsm/utils/timefeatures.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from pandas.tseries import offsets 6 | from pandas.tseries.frequencies import to_offset 7 | 8 | 9 | class TimeFeature: 10 | def __init__(self): 11 | pass 12 | 13 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 14 | pass 15 | 16 | def __repr__(self): 17 | return self.__class__.__name__ + "()" 18 | 19 | 20 | class SecondOfMinute(TimeFeature): 21 | """Minute of hour encoded as value between [-0.5, 0.5]""" 22 | 23 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 24 | return index.second / 59.0 - 0.5 25 | 26 | 27 | class MinuteOfHour(TimeFeature): 28 | """Minute of hour encoded as value between [-0.5, 0.5]""" 29 | 30 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 31 | return index.minute / 59.0 - 0.5 32 | 33 | 34 | class HourOfDay(TimeFeature): 35 | """Hour of day encoded as value between [-0.5, 0.5]""" 36 | 37 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 38 | return index.hour / 23.0 - 0.5 39 | 40 | 41 | class DayOfWeek(TimeFeature): 42 | """Hour of day encoded as value between [-0.5, 0.5]""" 43 | 44 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 45 | return index.dayofweek / 6.0 - 0.5 46 | 47 | 48 | class DayOfMonth(TimeFeature): 49 | """Day of month encoded as value between [-0.5, 0.5]""" 50 | 51 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 52 | return (index.day - 1) / 30.0 - 0.5 53 | 54 | 55 | class DayOfYear(TimeFeature): 56 | """Day of year encoded as value between [-0.5, 0.5]""" 57 | 58 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 59 | return (index.dayofyear - 1) / 365.0 - 0.5 60 | 61 | 62 | class MonthOfYear(TimeFeature): 63 | """Month of year encoded as value between [-0.5, 0.5]""" 64 | 65 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 66 | return (index.month - 1) / 11.0 - 0.5 67 | 68 | 69 | class WeekOfYear(TimeFeature): 70 | """Week of year encoded as value between [-0.5, 0.5]""" 71 | 72 | def __call__(self, index: pd.DatetimeIndex) -> np.ndarray: 73 | return (index.isocalendar().week - 1) / 52.0 - 0.5 74 | 75 | 76 | def time_features_from_frequency_str(freq_str: str) -> List[TimeFeature]: 77 | """ 78 | Returns a list of time features that will be appropriate for the given frequency string. 79 | Parameters 80 | ---------- 81 | freq_str 82 | Frequency string of the form [multiple][granularity] such as "12H", "5min", "1D" etc. 83 | """ 84 | 85 | features_by_offsets = { 86 | offsets.YearEnd: [], 87 | offsets.QuarterEnd: [MonthOfYear], 88 | offsets.MonthEnd: [MonthOfYear], 89 | offsets.Week: [DayOfMonth, WeekOfYear], 90 | offsets.Day: [DayOfWeek, DayOfMonth, DayOfYear], 91 | offsets.BusinessDay: [DayOfWeek, DayOfMonth, DayOfYear], 92 | offsets.Hour: [HourOfDay, DayOfWeek, DayOfMonth, DayOfYear], 93 | offsets.Minute: [ 94 | MinuteOfHour, 95 | HourOfDay, 96 | DayOfWeek, 97 | DayOfMonth, 98 | DayOfYear, 99 | ], 100 | offsets.Second: [ 101 | SecondOfMinute, 102 | MinuteOfHour, 103 | HourOfDay, 104 | DayOfWeek, 105 | DayOfMonth, 106 | DayOfYear, 107 | ], 108 | } 109 | 110 | offset = to_offset(freq_str) 111 | 112 | for offset_type, feature_classes in features_by_offsets.items(): 113 | if isinstance(offset, offset_type): 114 | return [cls() for cls in feature_classes] 115 | 116 | supported_freq_msg = f""" 117 | Unsupported frequency {freq_str} 118 | The following frequencies are supported: 119 | Y - yearly 120 | alias: A 121 | M - monthly 122 | W - weekly 123 | D - daily 124 | B - business days 125 | H - hourly 126 | T - minutely 127 | alias: min 128 | S - secondly 129 | """ 130 | raise RuntimeError(supported_freq_msg) 131 | 132 | 133 | def time_features(dates, freq='h'): 134 | return np.vstack([feat(dates) for feat in time_features_from_frequency_str(freq)]) 135 | -------------------------------------------------------------------------------- /ltsm/utils/tools.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | import matplotlib.pyplot as plt 5 | from tqdm import tqdm 6 | from datetime import datetime 7 | from distutils.util import strtobool 8 | import pandas as pd 9 | 10 | from ltsm.utils.metrics import metric 11 | 12 | plt.switch_backend('agg') 13 | 14 | 15 | class dotdict(dict): 16 | """dot.notation access to dictionary attributes""" 17 | __getattr__ = dict.get 18 | __setattr__ = dict.__setitem__ 19 | __delattr__ = dict.__delitem__ 20 | 21 | 22 | class StandardScaler(): 23 | def __init__(self, mean, std): 24 | self.mean = mean 25 | self.std = std 26 | 27 | def transform(self, data): 28 | return (data - self.mean) / self.std 29 | 30 | def inverse_transform(self, data): 31 | return (data * self.std) + self.mean 32 | 33 | 34 | def visual(true, preds=None, name='./pic/test.pdf'): 35 | """ 36 | Results visualization 37 | """ 38 | plt.figure() 39 | plt.plot(true, label='GroundTruth', linewidth=2) 40 | if preds is not None: 41 | plt.plot(preds, label='Prediction', linewidth=2) 42 | plt.legend() 43 | plt.savefig(name, bbox_inches='tight') 44 | 45 | 46 | def convert_tsf_to_dataframe( 47 | full_file_path_and_name, 48 | replace_missing_vals_with="NaN", 49 | value_column_name="series_value", 50 | ): 51 | col_names = [] 52 | col_types = [] 53 | all_data = {} 54 | line_count = 0 55 | frequency = None 56 | forecast_horizon = None 57 | contain_missing_values = None 58 | contain_equal_length = None 59 | found_data_tag = False 60 | found_data_section = False 61 | started_reading_data_section = False 62 | 63 | print(full_file_path_and_name) 64 | with open(full_file_path_and_name, "r", encoding="cp1252") as file: 65 | for line in file: 66 | # Strip white space from start/end of line 67 | line = line.strip() 68 | 69 | if line: 70 | if line.startswith("@"): # Read meta-data 71 | if not line.startswith("@data"): 72 | line_content = line.split(" ") 73 | if line.startswith("@attribute"): 74 | if ( 75 | len(line_content) != 3 76 | ): # Attributes have both name and type 77 | raise Exception("Invalid meta-data specification.") 78 | 79 | col_names.append(line_content[1]) 80 | col_types.append(line_content[2]) 81 | else: 82 | if ( 83 | len(line_content) != 2 84 | ): # Other meta-data have only values 85 | raise Exception("Invalid meta-data specification.") 86 | 87 | if line.startswith("@frequency"): 88 | frequency = line_content[1] 89 | elif line.startswith("@horizon"): 90 | forecast_horizon = int(line_content[1]) 91 | elif line.startswith("@missing"): 92 | contain_missing_values = bool( 93 | strtobool(line_content[1]) 94 | ) 95 | elif line.startswith("@equallength"): 96 | contain_equal_length = bool(strtobool(line_content[1])) 97 | 98 | else: 99 | if len(col_names) == 0: 100 | raise Exception( 101 | "Missing attribute section. Attribute section must come before data." 102 | ) 103 | 104 | found_data_tag = True 105 | elif not line.startswith("#"): 106 | if len(col_names) == 0: 107 | raise Exception( 108 | "Missing attribute section. Attribute section must come before data." 109 | ) 110 | elif not found_data_tag: 111 | raise Exception("Missing @data tag.") 112 | else: 113 | if not started_reading_data_section: 114 | started_reading_data_section = True 115 | found_data_section = True 116 | all_series = [] 117 | 118 | for col in col_names: 119 | all_data[col] = [] 120 | 121 | full_info = line.split(":") 122 | 123 | if len(full_info) != (len(col_names) + 1): 124 | continue 125 | #raise Exception("Missing attributes/values in series.") 126 | 127 | series = full_info[len(full_info) - 1] 128 | series = series.split(",") 129 | 130 | if len(series) == 0: 131 | raise Exception( 132 | "A given series should contains a set of comma separated numeric values. At least one numeric value should be there in a series. Missing values should be indicated with ? symbol" 133 | ) 134 | 135 | numeric_series = [] 136 | 137 | for val in series: 138 | if val == "?": 139 | numeric_series.append(replace_missing_vals_with) 140 | else: 141 | numeric_series.append(float(val)) 142 | 143 | if numeric_series.count(replace_missing_vals_with) == len( 144 | numeric_series 145 | ): 146 | raise Exception( 147 | "All series values are missing. A given series should contains a set of comma separated numeric values. At least one numeric value should be there in a series." 148 | ) 149 | 150 | all_series.append(pd.Series(numeric_series).array) 151 | 152 | for i in range(len(col_names)): 153 | att_val = None 154 | if col_types[i] == "numeric": 155 | att_val = int(full_info[i]) 156 | elif col_types[i] == "string": 157 | att_val = str(full_info[i]) 158 | elif col_types[i] == "date": 159 | att_val = datetime.strptime( 160 | full_info[i], "%Y-%m-%d %H-%M-%S" 161 | ) 162 | else: 163 | raise Exception( 164 | "Invalid attribute type." 165 | ) # Currently, the code supports only numeric, string and date types. Extend this as required. 166 | 167 | if att_val is None: 168 | raise Exception("Invalid attribute value.") 169 | else: 170 | all_data[col_names[i]].append(att_val) 171 | 172 | line_count = line_count + 1 173 | 174 | if line_count == 0: 175 | raise Exception("Empty file.") 176 | if len(col_names) == 0: 177 | raise Exception("Missing attribute section.") 178 | if not found_data_section: 179 | raise Exception("Missing series information under data section.") 180 | 181 | all_data[value_column_name] = all_series 182 | loaded_data = pd.DataFrame(all_data) 183 | 184 | # ipdb.set_trace() 185 | 186 | return ( 187 | loaded_data, 188 | frequency, 189 | forecast_horizon, 190 | contain_missing_values, 191 | contain_equal_length, 192 | ) 193 | 194 | 195 | def MASE(x, freq, pred, true): 196 | masep = np.mean(np.abs(x[:, freq:] - x[:, :-freq])) 197 | return np.mean(np.abs(pred - true) / (masep + 1e-8)) 198 | 199 | -------------------------------------------------------------------------------- /main_ltsm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch import nn 4 | import os 5 | import argparse 6 | import random 7 | import ipdb 8 | 9 | from ltsm.data_provider.data_factory import get_data_loaders, get_datasets,get_test_datasets 10 | from ltsm.data_provider.data_loader import HF_Dataset 11 | from ltsm.models import get_model, LTSMConfig 12 | from peft import get_peft_config, get_peft_model, LoraConfig 13 | 14 | from transformers import ( 15 | Trainer, 16 | TrainingArguments, 17 | EvalPrediction, 18 | set_seed, 19 | ) 20 | 21 | def get_args(): 22 | parser = argparse.ArgumentParser(description='LTSM') 23 | 24 | # Basic Config 25 | parser.add_argument('--model_id', type=str, default='test_run', help='model id') 26 | parser.add_argument('--model_name_or_path', type=str, default="gpt2-medium", help='model name') 27 | parser.add_argument('--seed', type=int, default=2024, help='random seed') 28 | parser.add_argument('--device', type=str, default="cuda:0") 29 | parser.add_argument('--checkpoints', type=str, default='./checkpoints/') 30 | 31 | # Data Settings 32 | parser.add_argument('--data_path', nargs='+', default='dataset/weather.csv', help='data files') 33 | parser.add_argument('--test_data_path_list', nargs='+', required=True, help='test data file') 34 | parser.add_argument('--prompt_data_path', type=str, default='./weather.csv', help='prompt data file') 35 | parser.add_argument('--data_processing', type=str, default="standard_scaler", help='data processing method') 36 | parser.add_argument('--train_ratio', type=float, default=0.7, help='train data ratio') 37 | parser.add_argument('--val_ratio', type=float, default=0.1, help='validation data ratio') 38 | 39 | # Forecasting Settings 40 | parser.add_argument('--seq_len', type=int, default=336, help='input sequence length') 41 | parser.add_argument('--pred_len', type=int, default=96, help='prediction sequence length') 42 | parser.add_argument('--prompt_len', type=int, default=133, help='prompt sequence length') 43 | 44 | # Model Settings 45 | parser.add_argument('--lora', action="store_true", help='use lora') 46 | parser.add_argument('--lora_dim', type=int, default=128, help='dimension of lora') 47 | parser.add_argument('--gpt_layers', type=int, default=3, help='number of gpt layers') 48 | parser.add_argument('--d_model', type=int, default=1024, help='dimension of model') 49 | parser.add_argument('--n_heads', type=int, default=16, help='number of heads') 50 | parser.add_argument('--d_ff', type=int, default=512, help='dimension of fcn') 51 | parser.add_argument('--dropout', type=float, default=0.2, help='dropout') 52 | parser.add_argument('--enc_in', type=int, default=1, help='encoder input size') 53 | parser.add_argument('--c_out', type=int, default=862, help='output size') 54 | parser.add_argument('--patch_size', type=int, default=16, help='patch size') 55 | parser.add_argument('--pretrain', type=int, default=1, help='is pretrain') 56 | parser.add_argument('--local_pretrain', type=str, default="None", help='local pretrain weight') 57 | parser.add_argument('--freeze', type=int, default=1, help='is model weight frozen') 58 | parser.add_argument('--model', type=str, default='model', help='model name, , options:[LTSM, LTSM_WordPrompt, LTSM_Tokenizer]') 59 | parser.add_argument('--stride', type=int, default=8, help='stride') 60 | parser.add_argument('--tmax', type=int, default=10, help='tmax') 61 | 62 | # Training Settings 63 | parser.add_argument('--eval', type=int, default=0, help='evaluation') 64 | parser.add_argument('--itr', type=int, default=1, help='experiments times') 65 | parser.add_argument('--output_dir', type=str, default='output/ltsm_train_lr0005/', help='output directory') 66 | parser.add_argument('--downsample_rate', type=int, default=100, help='downsample rate') 67 | parser.add_argument('--llm_layers', type=int, default=32) 68 | parser.add_argument('--decay_fac', type=float, default=0.75, help='decay factor') 69 | parser.add_argument('--learning_rate', type=float, default=0.0001, help='learning rate') 70 | parser.add_argument('--batch_size', type=int, default=512, help='batch size') 71 | parser.add_argument('--num_workers', type=int, default=10, help='number of workers') 72 | parser.add_argument('--train_epochs', type=int, default=1, help='number of epochs') 73 | parser.add_argument('--lradj', type=str, default='type1', help='learning rate adjustment type') 74 | parser.add_argument('--patience', type=int, default=3, help='early stopping patience') 75 | parser.add_argument('--gradient_accumulation_steps', type=int, default=64, help='gradient accumulation steps') 76 | args, unknown = parser.parse_known_args() 77 | 78 | return args 79 | 80 | 81 | def seed_all(fixed_seed): 82 | random.seed(fixed_seed) 83 | torch.manual_seed(fixed_seed) 84 | np.random.seed(fixed_seed) 85 | 86 | def freeze_parameters(model): 87 | 88 | freeze_param_buf = ["gpt2"] 89 | for n, p in model.named_parameters(): 90 | if any(fp in n for fp in freeze_param_buf): 91 | p.requires_grad = False 92 | print(f"{n} has been freeezed") 93 | 94 | trainable_param_buf = ["ln", "wpe", "in_layer", "out_layer", "lora"] 95 | for n, p in model.named_parameters(): 96 | if any(fp in n for fp in trainable_param_buf): 97 | p.requires_grad = True 98 | 99 | def print_trainable_parameters(model): 100 | for n, p in model.named_parameters(): 101 | if p.requires_grad: 102 | print(f"{n} is trainable...") 103 | 104 | 105 | def run(args): 106 | print(args) 107 | 108 | model_config = LTSMConfig(**vars(args)) 109 | model = get_model(model_config) 110 | 111 | if args.lora: 112 | peft_config = LoraConfig( 113 | target_modules=["c_attn"], 114 | inference_mode=False, 115 | r=args.lora_dim, 116 | lora_alpha=32, 117 | lora_dropout=0.1 118 | ) 119 | model = get_peft_model(model, peft_config) 120 | model.print_trainable_parameters() 121 | 122 | elif args.freeze: 123 | freeze_parameters(model) 124 | 125 | print_trainable_parameters(model) 126 | 127 | # Optimizer settings 128 | model_optim = torch.optim.Adam(model.parameters(), lr=args.learning_rate) 129 | lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(model_optim, T_max=args.tmax, eta_min=1e-8) 130 | 131 | # Evaluation metrics 132 | def compute_metrics(p: EvalPrediction): 133 | preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions 134 | preds = np.squeeze(preds) 135 | if preds.shape != p.label_ids.shape: 136 | label_ids = np.squeeze(p.label_ids) 137 | else: 138 | label_ids = p.label_ids 139 | return { 140 | "mse": ((preds - label_ids) ** 2).mean().item(), 141 | "mae": (np.abs(preds - label_ids)).mean().item() 142 | } 143 | 144 | # Loss function 145 | def compute_loss(model, inputs, return_outputs=False): 146 | outputs = model(inputs["input_data"]) 147 | loss = nn.functional.mse_loss(outputs, inputs["labels"]) 148 | return (loss, outputs) if return_outputs else loss 149 | 150 | # Data collator 151 | def collate_fn(batch): 152 | return { 153 | 'input_data': torch.from_numpy(np.stack([x['input_data'] for x in batch])).type(torch.float32), 154 | 'labels': torch.from_numpy(np.stack([x['labels'] for x in batch])).type(torch.float32), 155 | } 156 | 157 | # Prediction step 158 | @torch.no_grad() 159 | def prediction_step(model, inputs, prediction_loss_only=False, ignore_keys=None): 160 | # CSV 161 | input_data = inputs["input_data"].to(model.module.device) 162 | labels = inputs["labels"].to(model.module.device) 163 | outputs = model(input_data) 164 | loss = nn.functional.mse_loss(outputs, labels) 165 | return (loss, outputs, labels) 166 | 167 | # Training settings 168 | training_args = TrainingArguments( 169 | output_dir=args.output_dir, 170 | per_device_train_batch_size=args.batch_size, 171 | per_device_eval_batch_size=args.batch_size, 172 | evaluation_strategy="steps", 173 | num_train_epochs=args.train_epochs, 174 | fp16=False, 175 | save_steps=100, 176 | eval_steps=25, 177 | logging_steps=5, 178 | learning_rate=args.learning_rate, 179 | gradient_accumulation_steps=args.gradient_accumulation_steps, 180 | save_total_limit=10, 181 | remove_unused_columns=False, 182 | push_to_hub=False, 183 | load_best_model_at_end=True, 184 | ) 185 | 186 | train_dataset, eval_dataset, _ = get_datasets(args) 187 | train_dataset, eval_dataset= HF_Dataset(train_dataset), HF_Dataset(eval_dataset) 188 | 189 | trainer = Trainer( 190 | model=model, 191 | args=training_args, 192 | data_collator=collate_fn, 193 | compute_metrics=compute_metrics, 194 | train_dataset=train_dataset, 195 | eval_dataset=eval_dataset, 196 | tokenizer=None, 197 | optimizers=(model_optim, lr_scheduler), 198 | ) 199 | 200 | # Overload the trainer API 201 | if not args.eval: 202 | trainer.compute_loss = compute_loss 203 | trainer.prediction_step = prediction_step 204 | train_results = trainer.train() 205 | trainer.save_model() 206 | trainer.log_metrics("train", train_results.metrics) 207 | trainer.save_metrics("train", train_results.metrics) 208 | trainer.save_state() 209 | 210 | # Testing settings 211 | for data_path in args.test_data_path_list: 212 | trainer.compute_loss = compute_loss 213 | trainer.prediction_step = prediction_step 214 | args.test_data_path = data_path 215 | test_dataset, _ = get_test_datasets(args) 216 | test_dataset = HF_Dataset(test_dataset) 217 | 218 | metrics = trainer.evaluate(test_dataset) 219 | trainer.log_metrics("Test", metrics) 220 | trainer.save_metrics("Test", metrics) 221 | 222 | 223 | if __name__ == "__main__": 224 | args = get_args() 225 | seed_all(args.seed) 226 | run(args) -------------------------------------------------------------------------------- /main_tokenizer.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch import nn 4 | import os 5 | import argparse 6 | import random 7 | import sys 8 | 9 | sys.path.append("/home/yc146/github_open_ltsm/ltsm") 10 | 11 | from ltsm.data_provider.data_factory import get_datasets,get_test_datasets 12 | from ltsm.data_provider.data_loader import HF_Dataset 13 | from ltsm.data_provider.data_processing.tokenizer_processor import TokenizerConfig 14 | from ltsm.models import get_model, LTSMConfig 15 | from peft import get_peft_model, LoraConfig 16 | 17 | from transformers import ( 18 | Trainer, 19 | TrainingArguments, 20 | EvalPrediction, 21 | set_seed, 22 | ) 23 | def get_args(): 24 | parser = argparse.ArgumentParser(description='LTSM') 25 | 26 | # Basic Config 27 | parser.add_argument('--model_id', type=str, default='test_run', help='model id') 28 | parser.add_argument('--model_name_or_path', type=str, default="gpt2-medium", help='model name') 29 | parser.add_argument('--seed', type=int, default=2024, help='random seed') 30 | parser.add_argument('--device', type=str, default="cuda:0") 31 | parser.add_argument('--checkpoints', type=str, default='./checkpoints/') 32 | 33 | # Data Settings 34 | parser.add_argument('--data_path', nargs='+', default='dataset/weather.csv', help='data files') 35 | parser.add_argument('--test_data_path', type=str, default='dataset/weather.csv', help='test data file') 36 | parser.add_argument('--test_data_path_list', nargs='+', required=True, help='test data file') 37 | parser.add_argument('--prompt_data_path', type=str, default='./weather.csv', help='prompt data file') 38 | parser.add_argument('--data_processing', type=str, default="standard_scaler", help='data processing method') 39 | parser.add_argument('--train_ratio', type=float, default=0.7, help='train data ratio') 40 | parser.add_argument('--val_ratio', type=float, default=0.1, help='validation data ratio') 41 | 42 | # Forecasting Settings 43 | parser.add_argument('--seq_len', type=int, default=336, help='input sequence length') 44 | parser.add_argument('--pred_len', type=int, default=96, help='prediction sequence length') 45 | parser.add_argument('--prompt_len', type=int, default=133, help='prompt sequence length') 46 | 47 | 48 | # Model Settings 49 | parser.add_argument('--lora', action="store_true", help='use lora') 50 | parser.add_argument('--lora_dim', type=int, default=128, help='dimension of lora') 51 | parser.add_argument('--gpt_layers', type=int, default=3, help='number of gpt layers') 52 | parser.add_argument('--d_model', type=int, default=1024, help='dimension of model') 53 | parser.add_argument('--n_heads', type=int, default=16, help='number of heads') 54 | parser.add_argument('--d_ff', type=int, default=512, help='dimension of fcn') 55 | parser.add_argument('--dropout', type=float, default=0.2, help='dropout') 56 | parser.add_argument('--enc_in', type=int, default=1, help='encoder input size') 57 | parser.add_argument('--c_out', type=int, default=862, help='output size') 58 | parser.add_argument('--patch_size', type=int, default=16, help='patch size') 59 | parser.add_argument('--pretrain', type=int, default=1, help='is pretrain') 60 | parser.add_argument('--local_pretrain', type=str, default="None", help='local pretrain weight') 61 | parser.add_argument('--freeze', type=int, default=1, help='is model weight frozen') 62 | parser.add_argument('--model', type=str, default='model', help='model name, , options:[LTSM, LTSM_WordPrompt, LTSM_Tokenizer]') 63 | parser.add_argument('--stride', type=int, default=8, help='stride') 64 | parser.add_argument('--tmax', type=int, default=10, help='tmax') 65 | 66 | # Training Settings 67 | parser.add_argument('--eval', type=int, default=0, help='evaluation') 68 | parser.add_argument('--itr', type=int, default=1, help='experiments times') 69 | parser.add_argument('--output_dir', type=str, default='output/ltsm_train_lr0005/', help='output directory') 70 | parser.add_argument('--downsample_rate', type=int, default=100, help='downsample rate') 71 | parser.add_argument('--llm_layers', type=int, default=32) 72 | parser.add_argument('--decay_fac', type=float, default=0.75, help='decay factor') 73 | parser.add_argument('--learning_rate', type=float, default=0.0001, help='learning rate') 74 | parser.add_argument('--batch_size', type=int, default=512, help='batch size') 75 | parser.add_argument('--num_workers', type=int, default=10, help='number of workers') 76 | parser.add_argument('--train_epochs', type=int, default=1, help='number of epochs') 77 | parser.add_argument('--lradj', type=str, default='type1', help='learning rate adjustment type') 78 | parser.add_argument('--patience', type=int, default=3, help='early stopping patience') 79 | parser.add_argument('--gradient_accumulation_steps', type=int, default=64, help='gradient accumulation steps') 80 | args, unknown = parser.parse_known_args() 81 | 82 | return args 83 | 84 | 85 | def seed_all(fixed_seed): 86 | random.seed(fixed_seed) 87 | torch.manual_seed(fixed_seed) 88 | np.random.seed(fixed_seed) 89 | 90 | def freeze_parameters(model): 91 | 92 | freeze_param_buf = ["gpt2"] 93 | for n, p in model.named_parameters(): 94 | if any(fp in n for fp in freeze_param_buf): 95 | p.requires_grad = False 96 | print(f"{n} has been freeezed") 97 | 98 | trainable_param_buf = ["ln", "wpe", "in_layer", "out_layer", "lora"] 99 | for n, p in model.named_parameters(): 100 | if any(fp in n for fp in trainable_param_buf): 101 | p.requires_grad = True 102 | 103 | def print_trainable_parameters(model): 104 | for n, p in model.named_parameters(): 105 | if p.requires_grad: 106 | print(f"{n} is trainable...") 107 | 108 | def run(args): 109 | print(args) 110 | model_config = LTSMConfig(**vars(args)) 111 | model = get_model(model_config) 112 | 113 | if args.lora: 114 | peft_config = LoraConfig( 115 | target_modules=["c_attn"], # ["q", "v"], 116 | inference_mode=False, 117 | r=args.lora_dim, 118 | lora_alpha=32, 119 | lora_dropout=0.1 120 | ) 121 | model = get_peft_model(model, peft_config) 122 | model.print_trainable_parameters() 123 | 124 | elif args.freeze: 125 | freeze_parameters(model) 126 | 127 | print_trainable_parameters(model) 128 | 129 | 130 | model_optim = torch.optim.Adam(model.parameters(), lr=args.learning_rate) 131 | lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(model_optim, T_max=args.tmax, eta_min=1e-8) 132 | 133 | # Load Tokenizer Config, Reference: https://github.com/amazon-science/chronos-forecasting 134 | context_length = args.seq_len+args.pred_len 135 | prediction_length = args.pred_len 136 | n_tokens = 1024 137 | n_special_tokens = 2 138 | config = TokenizerConfig( 139 | tokenizer_class="MeanScaleUniformBins", 140 | tokenizer_kwargs=dict(low_limit=-3.0, high_limit=3.0), 141 | n_tokens=n_tokens, 142 | n_special_tokens=n_special_tokens, 143 | pad_token_id=0, 144 | eos_token_id=1, 145 | use_eos_token=0, 146 | model_type="causal", 147 | context_length=context_length, 148 | prediction_length=prediction_length, 149 | num_samples=20, 150 | temperature=1.0, 151 | top_k=50, 152 | top_p=1.0, 153 | ) 154 | 155 | tokenizer = config.create_tokenizer() 156 | 157 | def compute_metrics(p: EvalPrediction): 158 | preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions 159 | preds = np.squeeze(preds) 160 | if preds.shape != p.label_ids.shape: 161 | label_ids = np.squeeze(p.label_ids) 162 | else: 163 | label_ids = p.label_ids 164 | return { 165 | "mse": ((preds - label_ids) ** 2).mean().item(), 166 | "mae": (np.abs(preds - label_ids)).mean().item()} 167 | 168 | def compute_loss(model, inputs, return_outputs=False): 169 | outputs = model(inputs["input_data"]) 170 | B, L, M, _ = outputs.shape 171 | loss = nn.functional.cross_entropy(outputs.reshape(B*L,-1), inputs["labels"][:,1:].long().reshape(B*L)) 172 | return (loss, outputs) if return_outputs else loss 173 | 174 | def collate_fn(batch): 175 | return { 176 | 'input_data': torch.from_numpy(np.stack([x['input_data'] for x in batch])).type(torch.float32), 177 | 'labels': torch.from_numpy(np.stack([x['labels'] for x in batch])).type(torch.float32), 178 | } 179 | 180 | @torch.no_grad() 181 | def prediction_step(model, inputs, prediction_loss_only=False, ignore_keys=None): 182 | input_data = inputs["input_data"].to(model.module.device) 183 | labels = inputs["labels"].to(model.module.device) 184 | scale = labels[:,0] 185 | labels = labels[:,1:] 186 | outputs = model(input_data) 187 | indices = torch.max(outputs, dim=-1).indices 188 | 189 | output_value = tokenizer.output_transform(indices, scale) 190 | label_value = tokenizer.output_transform(labels.unsqueeze(-1).long(), scale) 191 | loss = nn.functional.mse_loss(output_value, label_value) 192 | return (loss, output_value, label_value) 193 | 194 | 195 | training_args = TrainingArguments( 196 | output_dir=args.output_dir, 197 | per_device_train_batch_size=args.batch_size, 198 | per_device_eval_batch_size=args.batch_size, 199 | evaluation_strategy="steps", 200 | num_train_epochs=args.train_epochs, 201 | fp16=False, 202 | save_steps=100, 203 | eval_steps=25, 204 | logging_steps=5, 205 | learning_rate=args.learning_rate, 206 | gradient_accumulation_steps=args.gradient_accumulation_steps, 207 | save_total_limit=10, 208 | remove_unused_columns=False, 209 | push_to_hub=False, 210 | load_best_model_at_end=True, 211 | ) 212 | 213 | # Training settings 214 | train_dataset, eval_dataset, _ = get_datasets(args) 215 | train_dataset, eval_dataset= HF_Dataset(train_dataset), HF_Dataset(eval_dataset) 216 | 217 | trainer = Trainer( 218 | model=model, 219 | args=training_args, 220 | data_collator=collate_fn, 221 | compute_metrics=compute_metrics, 222 | train_dataset=train_dataset, 223 | eval_dataset=eval_dataset, 224 | tokenizer=None, 225 | optimizers=(model_optim, lr_scheduler), 226 | ) 227 | 228 | # Overload the trainer API 229 | if not args.eval: 230 | trainer.compute_loss = compute_loss 231 | trainer.prediction_step = prediction_step 232 | train_results = trainer.train() 233 | trainer.save_model() 234 | trainer.log_metrics("train", train_results.metrics) 235 | trainer.save_metrics("train", train_results.metrics) 236 | trainer.save_state() 237 | 238 | # Testing settings 239 | for data_path in args.test_data_path_list: 240 | trainer.compute_loss = compute_loss 241 | trainer.prediction_step = prediction_step 242 | args.test_data_path = data_path 243 | test_dataset, _ = get_test_datasets(args) 244 | test_dataset = HF_Dataset(test_dataset) 245 | 246 | metrics = trainer.evaluate(test_dataset) 247 | trainer.log_metrics("Test", metrics) 248 | trainer.save_metrics("Test", metrics) 249 | 250 | 251 | if __name__ == "__main__": 252 | args = get_args() 253 | seed_all(args.seed) 254 | run(args) 255 | -------------------------------------------------------------------------------- /prompt_bank/prompt_data_normalize_split/README.md: -------------------------------------------------------------------------------- 1 | # Time Series Prompt Dataset -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/README.md: -------------------------------------------------------------------------------- 1 | # Time Series Prompt Generator 2 | 3 | 4 | Time series prompts are designed to capture the extensive characteristics of time series data comprehensively. These prompts, distinct from text-based ones, are created by extracting a wide range of global features from the entire training dataset. This method ensures a robust representation of the underlying dynamics, essential for boosting model performance. 5 | 6 | ## Quick Start 7 | **Step 1.** Download the dataset from our [Google Drive](). Make sure your local data folder like this: 8 | ````angular2html 9 | - ltsm/ 10 | - datasets/ 11 | electricity/ 12 | ETT-small/ 13 | exchange_rate/ 14 | illness/ 15 | traffic/ 16 | weather/ 17 | ... 18 | ```` 19 | 20 | **Step 2.** Generating the time series prompts from training, validating, and testing datasets 21 | ````angular2html 22 | python3 prompt_generate_split.py 23 | ```` 24 | 25 | **Step 3.** Find the generated time series prompts in the './prompt_data_split' folder. Then run the following command for normalizing the prompts: 26 | ````angular2html 27 | python3 prompt_normalization_split.py --mode fit 28 | ```` 29 | 30 | **Step 4.** Run this command to export the prompts to the "./prompt_data_normalize_split" folder: 31 | ````angular2html 32 | python3 prompt_normalization_split.py --mode transform 33 | ```` -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/prompt_generate_split.py: -------------------------------------------------------------------------------- 1 | # from ltsm.data_provider.data_factory import get_data_loader, get_data_loaders, get_dataset 2 | import argparse 3 | import ipdb 4 | import pandas as pd 5 | import numpy as np 6 | import tsfel 7 | from pandas import read_csv, read_feather 8 | import matplotlib.pyplot as plt 9 | import sys, os 10 | import torch 11 | 12 | 13 | def get_args(): 14 | parser = argparse.ArgumentParser(description='LTSM') 15 | 16 | parser.add_argument('--data_path', type=str, default='dataset/weather.csv') 17 | parser.add_argument('--data', type=str, default='custom') 18 | parser.add_argument('--freq', type=str, default="h") 19 | parser.add_argument('--target', type=str, default='OT') 20 | parser.add_argument('--embed', type=str, default='timeF') 21 | parser.add_argument('--percent', type=int, default=10) 22 | parser.add_argument('--batch_size', type=int, default=512) 23 | parser.add_argument('--max_len', type=int, default=-1) 24 | parser.add_argument('--seq_len', type=int, default=512) 25 | parser.add_argument('--pred_len', type=int, default=96) 26 | parser.add_argument('--label_len', type=int, default=48) 27 | parser.add_argument('--features', type=str, default='M') 28 | 29 | args = parser.parse_args() 30 | 31 | return args 32 | 33 | def prompt_prune(pt): 34 | pt_dict = pt.to_dict() 35 | pt_keys = list(pt_dict.keys()) 36 | for key in pt_keys: 37 | if key.startswith("0_FFT mean coefficient"): 38 | del pt[key] 39 | 40 | return pt 41 | 42 | 43 | def prompt_generation_single(ts): 44 | cfg = tsfel.get_features_by_domain() 45 | prompt = tsfel.time_series_features_extractor(cfg, ts) 46 | prompt = prompt_prune(prompt) 47 | return prompt 48 | 49 | def prompt_generation(ts, ts_name): 50 | 51 | print(ts.shape) 52 | 53 | if ts.shape[1] == 1: 54 | 55 | return None 56 | 57 | else: 58 | 59 | column_name = [name.replace("/", "-") for name in list(ts.columns)] 60 | prompt_buf_train = pd.DataFrame(np.zeros((133, ts.shape[1])), columns=column_name) 61 | prompt_buf_val = pd.DataFrame(np.zeros((133, ts.shape[1])), columns=column_name) 62 | prompt_buf_test = pd.DataFrame(np.zeros((133, ts.shape[1])), columns=column_name) 63 | for index, col in ts.T.iterrows(): 64 | if "ETT" in ts_name: 65 | ts_len = len(ts) 66 | t1, t2 = int(0.6*ts_len), int(0.6*ts_len) + int(0.2*ts_len) 67 | ts_train, ts_val, ts_test = col[:t1], col[t1:t2].reset_index(drop=True), col[t2:].reset_index(drop=True) 68 | else: 69 | ts_len = len(ts) 70 | t1, t2 = int(0.7 * ts_len), int(0.7 * ts_len) + int(0.1 * ts_len) 71 | ts_train, ts_val, ts_test = col[:t1], col[t1:t2].reset_index(drop=True), col[t2:].reset_index(drop=True) 72 | 73 | prompt_train = prompt_generation_single(ts_train) 74 | prompt_val = prompt_generation_single(ts_val) 75 | prompt_test = prompt_generation_single(ts_test) 76 | 77 | prompt_buf_train[index.replace("/", "-")] = prompt_train.T.values 78 | prompt_buf_val[index.replace("/", "-")] = prompt_val.T.values 79 | prompt_buf_test[index.replace("/", "-")] = prompt_test.T.values 80 | 81 | prompt_buf_total = {"train": prompt_buf_train, "val": prompt_buf_val, "test": prompt_buf_test} 82 | print(prompt_buf_total) 83 | return prompt_buf_total 84 | 85 | 86 | def prompt_save(prompt_buf, output_path): 87 | 88 | print(prompt_buf["train"]) 89 | if prompt_buf["train"].shape[1] == 1: 90 | # ipdb.set_trace() 91 | return None 92 | 93 | # prompt_train_fname = os.path.join(prompt_train_data_dir, data_name + "_prompt.pth.tar") 94 | # prompt_train = prompt_buf["train"] 95 | # print("Export", prompt_train_fname, prompt_train.shape) 96 | # 97 | # prompt_val_fname = os.path.join(prompt_val_data_dir, data_name + "_prompt.pth.tar") 98 | # prompt_val = prompt_buf["val"] 99 | # torch.save(prompt_val, prompt_val_fname) 100 | # print("Export", prompt_val_fname, prompt_val.shape) 101 | # 102 | # prompt_test_fname = os.path.join(prompt_test_data_dir, data_name + "_prompt.pth.tar") 103 | # prompt_test = prompt_buf["test"] 104 | # torch.save(prompt_test, prompt_test_fname) 105 | # print("Export", prompt_test_fname, prompt_test.shape) 106 | 107 | else: 108 | 109 | for index, col in prompt_buf["train"].T.iterrows(): 110 | 111 | prompt_train_fname = os.path.join(output_path, "train", data_name + "_" + index + "_prompt.pth.tar") 112 | prompt_train = col 113 | prompt_train.columns = [index] 114 | prompt_train = prompt_train.T 115 | torch.save(prompt_train, prompt_train_fname) 116 | print("Export", prompt_train_fname, prompt_train.shape) 117 | 118 | for index, col in prompt_buf["val"].T.iterrows(): 119 | prompt_val_fname = os.path.join(output_path, "val", data_name + "_" + index + "_prompt.pth.tar") 120 | prompt_val = col 121 | prompt_val.columns = [index] 122 | prompt_val = prompt_val.T 123 | torch.save(prompt_val, prompt_val_fname) 124 | print("Export", prompt_val_fname, prompt_val.shape) 125 | 126 | for index, col in prompt_buf["test"].T.iterrows(): 127 | prompt_test_fname = os.path.join(output_path, "test", data_name + "_" + index + "_prompt.pth.tar") 128 | prompt_test = col 129 | prompt_test.columns = [index] 130 | prompt_test = prompt_test.T 131 | torch.save(prompt_test, prompt_test_fname) 132 | print("Export", prompt_test_fname, prompt_test.shape) 133 | 134 | 135 | def data_import(path, format="feather"): 136 | 137 | if format == "feather": 138 | data = read_feather(path) 139 | data_name = path.replace(root_path, "").replace(".feather", "") 140 | data_dir = data_name[0:data_name.rfind("/")] 141 | # ipdb.set_trace() 142 | data = data.value 143 | 144 | else: 145 | data = read_csv(path) 146 | data_name = path.replace(root_path, "").replace(".csv", "") 147 | data_dir = data_name[0:data_name.rfind("/")] 148 | if "date" in data.columns: 149 | data = data.drop("date", axis=1) 150 | # print(data) 151 | # data = data.value 152 | 153 | 154 | return data, data_name, data_dir 155 | 156 | 157 | def create_data_dir(dir_name): 158 | # prompt_dir = 159 | if not os.path.exists(dir_name): 160 | os.mkdir(dir_name) 161 | 162 | 163 | if __name__ == "__main__": 164 | 165 | root_path = "./datasets/" 166 | output_path = "./prompt_bank/stat-prompt/prompt_data_split/" 167 | 168 | 169 | dataset_name = [ 170 | "electricity", 171 | "ETT-small", 172 | "exchange_rate", 173 | "illness", 174 | "traffic", 175 | "weather", 176 | ] 177 | 178 | dataset_fullname = [os.path.join(root_path, name) for name in dataset_name] 179 | data_path_buf = [] 180 | for dataset_dir in dataset_fullname: 181 | for root, dirs, files in os.walk(dataset_dir): 182 | for file_name in files: 183 | if file_name.endswith(".csv"): 184 | file_path = os.path.join(root, file_name) 185 | data_path_buf.append(file_path) 186 | 187 | print(data_path_buf) 188 | create_data_dir(output_path) 189 | # ipdb.set_trace() 190 | 191 | for path_idx, path in enumerate(data_path_buf): 192 | 193 | # print(path) 194 | 195 | data, data_name, data_dir = data_import(path, "csv") 196 | # print("Data Shape:", data.shape) 197 | if data.shape[0] < 20: 198 | print(path, "Skip too short time-series data.", data.shape) 199 | continue 200 | else: 201 | print("Import", path, "data shape", data.shape) 202 | 203 | create_data_dir(os.path.join(output_path, "train")) 204 | create_data_dir(os.path.join(output_path, "val")) 205 | create_data_dir(os.path.join(output_path, "test")) 206 | create_data_dir(os.path.join(output_path, "train", data_dir)) 207 | create_data_dir(os.path.join(output_path, "val", data_dir)) 208 | create_data_dir(os.path.join(output_path, "test", data_dir)) 209 | 210 | prompt_data_buf = prompt_generation(data, data_name) 211 | if prompt_data_buf is not None: 212 | prompt_save(prompt_data_buf, output_path) 213 | 214 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/prompt_normalization_split.py: -------------------------------------------------------------------------------- 1 | # from ltsm.data_provider.data_factory import get_data_loader, get_data_loaders, get_dataset 2 | import argparse 3 | import ipdb 4 | import pandas as pd 5 | import numpy as np 6 | # import tsfel 7 | from pandas import read_csv, read_feather 8 | import matplotlib.pyplot as plt 9 | import sys, os 10 | import torch 11 | from sklearn.preprocessing import StandardScaler 12 | 13 | 14 | def get_args(): 15 | parser = argparse.ArgumentParser(description='LTSM') 16 | parser.add_argument('--mode', choices=["fit", "transform"], required=True) 17 | args = parser.parse_args() 18 | 19 | return args 20 | 21 | 22 | def prompt_generation(ts): 23 | cfg = tsfel.get_features_by_domain() 24 | prompt = tsfel.time_series_features_extractor(cfg, ts) 25 | return prompt 26 | 27 | 28 | def prompt_prune(pt): 29 | pt_dict = pt.to_dict() 30 | pt_keys = list(pt_dict.keys()) 31 | for key in pt_keys: 32 | if type(key) == type("abc") and key.startswith("0_FFT mean coefficient"): 33 | del pt[key] 34 | 35 | return pt 36 | 37 | 38 | def mean_std_export_ds(data_path_buf, normalize_param_fname): 39 | prompt_data_buf = [] 40 | output_dir_buf = [] 41 | output_path_buf = [] 42 | for index, dataset_path in enumerate(data_path_buf): 43 | prompt_data = torch.load(dataset_path) 44 | prompt_data = prompt_prune(prompt_data) 45 | # print(prompt_data) 46 | prompt_data_buf.append(prompt_data) 47 | 48 | data_name = dataset_path.replace(root_path, "").replace(".csv", "") 49 | data_dir = data_name[0:data_name.rfind("/")] 50 | prompt_dir = os.path.join(output_path, data_dir) 51 | prompt_fname = os.path.join(output_path, data_name) 52 | # print(prompt_fname) 53 | output_dir_buf.append(prompt_dir) 54 | output_path_buf.append(prompt_fname) 55 | print("Import from {}".format(dataset_path), prompt_data.shape) 56 | # ipdb.set_trace() 57 | 58 | prompt_data_all = pd.concat(prompt_data_buf, axis=1).T 59 | print(prompt_data_all) 60 | 61 | scaler = StandardScaler() 62 | scaler.fit(prompt_data_all) 63 | 64 | sc_mean = pd.DataFrame(scaler.mean_.reshape(1,-1), columns=prompt_data_all.keys()) 65 | sc_scale = pd.DataFrame(scaler.scale_.reshape(1,-1), columns=prompt_data_all.keys()) 66 | 67 | print({"mean": sc_mean, "scale": sc_scale}) 68 | print("Save the mean and std to {}".format(normalize_param_fname)) 69 | torch.save({"mean": sc_mean, "scale": sc_scale}, normalize_param_fname) 70 | 71 | 72 | def standardscale_export(data_path_buf, params_fname, output_path, root_path): 73 | 74 | params = torch.load(params_fname) 75 | mean, std = params["mean"], params["scale"] 76 | scaler = StandardScaler() 77 | scaler.mean_ = mean 78 | scaler.scale_ = std 79 | # ipdb.set_trace() 80 | 81 | for index, dataset_path in enumerate(data_path_buf): 82 | prompt_data_raw = torch.load(dataset_path) 83 | prompt_data_raw = prompt_prune(prompt_data_raw) 84 | 85 | prompt_data = scaler.transform(prompt_data_raw.values.reshape(1, -1)) 86 | prompt_data_array = prompt_data 87 | # print(prompt_data) 88 | prompt_data_array[np.isnan(prompt_data_array)] = 0 89 | prompt_data_transform = pd.DataFrame(prompt_data_array, columns=prompt_data.keys()) 90 | # ipdb.set_trace() 91 | 92 | prompt_fname = dataset_path.replace(root_path, output_path) 93 | prompt_dir = prompt_fname[0:prompt_fname.rfind("/")] 94 | if not os.path.exists(prompt_dir): 95 | os.mkdir(prompt_dir) 96 | 97 | torch.save(prompt_data_transform, prompt_fname) 98 | print("Save to {}".format(prompt_fname)) 99 | del prompt_data 100 | 101 | 102 | def create_data_dir(dir_name): 103 | # prompt_dir = 104 | if not os.path.exists(dir_name): 105 | os.mkdir(dir_name) 106 | 107 | if __name__ == "__main__": 108 | 109 | root_path_train = "./prompt_bank/stat-prompt/prompt_data_split/train" 110 | output_path_train = "./prompt_bank/stat-prompt/prompt_data_normalize_split/train" 111 | root_path_val = "./prompt_bank/stat-prompt/prompt_data_split/val" 112 | output_path_val = "./prompt_bank/stat-prompt/prompt_data_normalize_split/val" 113 | root_path_test = "./prompt_bank/stat-prompt/prompt_data_split/test" 114 | output_path_test = "./prompt_bank/stat-prompt/prompt_data_normalize_split/test" 115 | # normalize_param_fname = os.path.join(output_path, "normalization_params.pth.tar") 116 | ds_size = 50 117 | mode = get_args().mode # "transform" # "fit" # 118 | 119 | data_path_buf = { 120 | "train": {"root_path": root_path_train, "output_path": output_path_train, "normalize_param_fname": os.path.join(output_path_train, "normalization_params.pth.tar")}, 121 | "val": {"root_path": root_path_val, "output_path": output_path_val, "normalize_param_fname": os.path.join(output_path_val, "normalization_params.pth.tar")}, 122 | "test": {"root_path": root_path_test, "output_path": output_path_test, "normalize_param_fname": os.path.join(output_path_test, "normalization_params.pth.tar")}, 123 | } 124 | 125 | 126 | dataset_name = [ 127 | "electricity", 128 | "ETT-small", 129 | "exchange_rate", 130 | "illness", 131 | "traffic", 132 | "weather", 133 | ] 134 | 135 | for split_name, data_path in data_path_buf.items(): 136 | root_path = data_path_buf[split_name]["root_path"] 137 | output_path = data_path_buf[split_name]["output_path"] 138 | normalize_param_fname = data_path_buf[split_name]["normalize_param_fname"] 139 | 140 | create_data_dir(output_path) 141 | 142 | dataset_fullname = [os.path.join(root_path, name) for name in dataset_name] 143 | data_path_buf_tmp = [] 144 | if mode == "fit": 145 | 146 | for dataset_dir in dataset_fullname: 147 | paths = os.listdir(dataset_dir) 148 | new_dataset = [os.path.join(dataset_dir, path) for path in paths] 149 | sample_idx = np.random.permutation(len(new_dataset))[:ds_size].astype(np.int64) 150 | # ipdb.set_trace() 151 | new_dataset = np.array(new_dataset)[sample_idx].tolist() 152 | data_path_buf_tmp.extend(new_dataset) 153 | 154 | else: 155 | for dataset_dir in dataset_fullname: 156 | paths = os.listdir(dataset_dir) 157 | new_dataset = [os.path.join(dataset_dir, path) for path in paths] 158 | data_path_buf_tmp.extend(new_dataset) 159 | 160 | 161 | if mode == "fit": 162 | 163 | mean_std_export_ds(data_path_buf_tmp, normalize_param_fname) 164 | else: 165 | # ipdb.set_trace() 166 | standardscale_export(data_path_buf_tmp, normalize_param_fname, output_path, root_path) 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/prompt_tsne.py: -------------------------------------------------------------------------------- 1 | # from ltsm.data_provider.data_factory import get_data_loader, get_data_loaders, get_dataset 2 | import argparse 3 | import ipdb 4 | import pandas as pd 5 | import numpy as np 6 | # import tsfel 7 | from pandas import read_csv, read_feather 8 | import matplotlib.pyplot as plt 9 | import sys, os 10 | import torch 11 | from sklearn.preprocessing import StandardScaler 12 | from sklearn import manifold 13 | 14 | 15 | def get_args(): 16 | parser = argparse.ArgumentParser(description='LTSM') 17 | 18 | parser.add_argument('--data_path', type=str, default='dataset/weather.csv') 19 | parser.add_argument('--data', type=str, default='custom') 20 | parser.add_argument('--freq', type=str, default="h") 21 | parser.add_argument('--target', type=str, default='OT') 22 | parser.add_argument('--embed', type=str, default='timeF') 23 | parser.add_argument('--percent', type=int, default=10) 24 | parser.add_argument('--batch_size', type=int, default=512) 25 | parser.add_argument('--max_len', type=int, default=-1) 26 | parser.add_argument('--seq_len', type=int, default=512) 27 | parser.add_argument('--pred_len', type=int, default=96) 28 | parser.add_argument('--label_len', type=int, default=48) 29 | parser.add_argument('--features', type=str, default='M') 30 | 31 | args = parser.parse_args() 32 | 33 | return args 34 | 35 | 36 | def prompt_generation(ts): 37 | cfg = tsfel.get_features_by_domain() 38 | prompt = tsfel.time_series_features_extractor(cfg, ts) 39 | return prompt 40 | 41 | 42 | def prompt_prune(pt): 43 | pt_dict = pt.to_dict() 44 | pt_keys = list(pt_dict.keys()) 45 | for key in pt_keys: 46 | if key.startswith("0_FFT mean coefficient"): 47 | del pt[key] 48 | 49 | return pt 50 | 51 | 52 | if __name__ == "__main__": 53 | 54 | root_path = "./prompt_bank/stat-prompt/prompt_data_split/" 55 | # print(data_path_buf) 56 | 57 | dataset_name = [ 58 | "electricity", 59 | "ETT-small", 60 | "exchange_rate", 61 | "illness", 62 | "traffic", 63 | "weather", 64 | ] 65 | split_buf = ["train", "val", "test"] 66 | 67 | dataset_fullname_train = [os.path.join(root_path, "train", name) for name in dataset_name] 68 | dataset_fullname_val = [os.path.join(root_path, "val", name) for name in dataset_name] 69 | dataset_fullname_test = [os.path.join(root_path, "test", name) for name in dataset_name] 70 | dataset_fullname = dataset_fullname_train + dataset_fullname_val + dataset_fullname_test 71 | data_path_buf = [] 72 | dataset_dir_buf = [] 73 | dataset_split_buf = [] 74 | K = 100 75 | for index, dataset_dir in enumerate(dataset_fullname): 76 | paths = os.listdir(dataset_dir) 77 | new_dataset = [os.path.join(dataset_dir, path) for path in paths] 78 | sample_idx = np.random.permutation(len(new_dataset))[:K].astype(np.int64) 79 | # ipdb.set_trace() 80 | new_dataset = np.array(new_dataset)[sample_idx].tolist() 81 | data_path_buf.extend(new_dataset) 82 | 83 | for dataset_index, dname in enumerate(dataset_name): 84 | if dname in dataset_dir: 85 | dataset_dir_buf.extend(len(new_dataset) * [dataset_index]) 86 | 87 | for split_index, split in enumerate(split_buf): 88 | if split in dataset_dir: 89 | dataset_split_buf.extend(len(new_dataset) * [split_index]) 90 | break 91 | 92 | prompt_data_buf = [] 93 | for index, dataset_path in enumerate(data_path_buf): 94 | prompt_data = torch.load(dataset_path) 95 | prompt_data_buf.append(prompt_data) 96 | print("Import from {}".format(dataset_path)) 97 | # print(prompt_data) 98 | 99 | # if index == 100: 100 | # break 101 | 102 | # print(prompt_data_buf) 103 | # print(output_path_buf) 104 | 105 | prompt_data_all = pd.concat(prompt_data_buf, axis=0).values 106 | print(prompt_data_all.shape) 107 | # (3166, 133) 108 | 109 | # nan_index = np.where(np.isnan(prompt_data_all))[0] 110 | # prompt_data_all[nan_index] = 0 111 | 112 | # ipdb.set_trace() 113 | tsne = manifold.TSNE(n_components=2, init='pca', random_state=0) 114 | prompt_data_tsne = tsne.fit_transform(prompt_data_all) 115 | dataset_plot_buf = ["electricity"] 116 | color_buf = ["red", "blue", "black", "green", "pink", "brown"] 117 | marker_buf = [".", "^", "x"] 118 | for index, _ in enumerate(dataset_name): 119 | for sindex, split_fold in enumerate(split_buf): 120 | data_index = (np.array(dataset_dir_buf) == index) 121 | split_index = (np.array(dataset_split_buf) == sindex) 122 | plot_index = data_index & split_index 123 | plt.plot(prompt_data_tsne[plot_index, 0], prompt_data_tsne[plot_index, 1], linewidth=0, marker=marker_buf[sindex], label=str(dataset_name[index][0:8] + "-" + split_fold), color=color_buf[index]) 124 | # plt.text(prompt_data_tsne[data_index, 0].mean()-20, prompt_data_tsne[data_index, 1].mean(), str(dataset_name[index][0:8]), fontdict={'weight': 'bold', 'size': 9}) 125 | 126 | plt.legend(loc="right") 127 | plt.savefig("./figures/stat_prompt_tsne.png") 128 | plt.close() 129 | 130 | # ipdb.set_trace() 131 | # plt.xticks([]) 132 | # plt.yticks([]) 133 | 134 | # print(prompt_data_all) 135 | # , color = plt.cm.Set1(dataset_dir_buf[index]) 136 | # print(prompt_data_transform) 137 | # print(prompt_data_transform_array.mean(axis=0)) 138 | # print(prompt_data_transform_array.std(axis=0)) 139 | # print(prompt_data_transform.loc[5]) 140 | 141 | 142 | 143 | 144 | 145 | 146 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/__init__.py: -------------------------------------------------------------------------------- 1 | from tsfel.utils import * 2 | from tsfel.feature_extraction import * 3 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/feature_extraction/__init__.py: -------------------------------------------------------------------------------- 1 | from tsfel.feature_extraction.calc_features import * 2 | from tsfel.feature_extraction.features import * 3 | from tsfel.feature_extraction.features_settings import * 4 | from tsfel.feature_extraction.features_utils import * 5 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/feature_extraction/calc_features.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import importlib 3 | import multiprocessing as mp 4 | import numbers 5 | import os 6 | import pathlib 7 | import sys 8 | import warnings 9 | from functools import partial 10 | from pathlib import Path 11 | 12 | import numpy as np 13 | import pandas as pd 14 | 15 | from IPython import get_ipython 16 | from IPython.display import display 17 | 18 | from tsfel.utils.progress_bar import display_progress_bar, progress_bar_notebook 19 | from tsfel.utils.signal_processing import merge_time_series, signal_window_splitter 20 | 21 | import ipdb 22 | 23 | def dataset_features_extractor(main_directory, feat_dict, verbose=1, **kwargs): 24 | """Extracts features from a dataset. 25 | 26 | Parameters 27 | ---------- 28 | main_directory : String 29 | Input directory 30 | feat_dict : dict 31 | Dictionary with features 32 | verbose : int 33 | Verbosity mode. 0 = silent, 1 = progress bar. 34 | (0 or 1 (Default)) 35 | \**kwargs: 36 | See below: 37 | * *search_criteria* (``list``) -- 38 | List of file names to compute features. (Example: 'Accelerometer.txt') 39 | (default: ``None``) 40 | 41 | * *time_unit* (``float``) -- 42 | Time unit 43 | (default: ``1e9``) 44 | 45 | * *resampling_rate* (``int``) -- 46 | Resampling rate 47 | (default: ``100``) 48 | 49 | * *window_size* (``int``) -- 50 | Window size in number of samples 51 | (default: ``100``) 52 | 53 | * *overlap* (``float``) -- 54 | Overlap between 0 and 1 55 | (default: ``0``) 56 | 57 | * *pre_process* (``function``) -- 58 | Function with pre processing code 59 | 60 | (default: ``None``) 61 | 62 | * *output_directory* (``String``) -- 63 | Output directory 64 | (default: ``'output_directory', str(Path.home()) + '/tsfel_output'``) 65 | 66 | * *features_path* (``string``) -- 67 | Directory of script with personal features 68 | 69 | * *header_names* (``list or array``) -- 70 | Names of each column window 71 | 72 | * *n_jobs* (``int``) -- 73 | The number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. 74 | ``-1`` means using all processors. 75 | (default: ``None`` in Windows and ``-1`` for other systems) 76 | 77 | Returns 78 | ------- 79 | file 80 | csv file with the extracted features 81 | 82 | """ 83 | search_criteria = kwargs.get('search_criteria', None) 84 | time_unit = kwargs.get('time_unit', 1e9) 85 | resample_rate = kwargs.get('resample_rate', 30) 86 | window_size = kwargs.get('window_size', 100) 87 | overlap = kwargs.get('overlap', 0) 88 | pre_process = kwargs.get('pre_process', None) 89 | output_directory = kwargs.get('output_directory', str(Path.home()) + '/tsfel_output') 90 | features_path = kwargs.get('features_path', None) 91 | names = kwargs.get('header_names', None) 92 | 93 | # Choosing default of n_jobs by operating system 94 | if sys.platform[:-2] == 'win': 95 | n_jobs_default = None 96 | else: 97 | n_jobs_default = -1 98 | 99 | # Choosing default of n_jobs by python interface 100 | if get_ipython().__class__.__name__ == 'ZMQInteractiveShell' or \ 101 | get_ipython().__class__.__name__ == 'Shell': 102 | n_jobs_default = -1 103 | 104 | n_jobs = kwargs.get('n_jobs', n_jobs_default) 105 | 106 | if main_directory[-1] != os.sep: 107 | main_directory = main_directory + os.sep 108 | 109 | folders = [f for f in glob.glob(main_directory + "**/", recursive=True)] 110 | 111 | if folders: 112 | for fl in folders: 113 | sensor_data = {} 114 | if search_criteria: 115 | for c in search_criteria: 116 | if os.path.isfile(fl + c): 117 | key = c.split('.')[0] 118 | sensor_data[key] = pd.read_csv(fl + c, header=None) 119 | else: 120 | all_files = np.concatenate((glob.glob(fl + '/*.txt'), glob.glob(fl + '/*.csv'))) 121 | for c in all_files: 122 | key = c.split(os.sep)[-1].split('.')[0] 123 | try: 124 | data_file = pd.read_csv(c, header=None) 125 | except pd.io.common.CParserError: 126 | continue 127 | 128 | if np.dtype('O') in np.array(data_file.dtypes): 129 | continue 130 | 131 | sensor_data[key] = pd.read_csv(c, header=None) 132 | 133 | if not sensor_data: 134 | continue 135 | 136 | pp_sensor_data = sensor_data if pre_process is None else pre_process(sensor_data) 137 | 138 | data_new = merge_time_series(pp_sensor_data, resample_rate, time_unit) 139 | 140 | windows = signal_window_splitter(data_new, window_size, overlap) 141 | 142 | if features_path: 143 | features = time_series_features_extractor(feat_dict, windows, fs=resample_rate, verbose=0, 144 | features_path=features_path, header_names=names, n_jobs=n_jobs) 145 | else: 146 | features = time_series_features_extractor(feat_dict, windows, fs=resample_rate, verbose=0, 147 | header_names=names, n_jobs=n_jobs) 148 | 149 | fl = '/'.join(fl.split(os.sep)) 150 | invalid_char = '<>:"\|?* ' 151 | for char in invalid_char: 152 | fl = fl.replace(char, '') 153 | 154 | pathlib.Path(output_directory + fl).mkdir(parents=True, exist_ok=True) 155 | features.to_csv(output_directory + fl + '/Features.csv', sep=',', encoding='utf-8') 156 | 157 | if verbose == 1: 158 | print('Features files saved in: ', output_directory) 159 | else: 160 | raise FileNotFoundError("There is no folder(s) in directory: " + main_directory) 161 | 162 | 163 | def calc_features(wind_sig, dict_features, fs, **kwargs): 164 | """Extraction of time series features. 165 | 166 | Parameters 167 | ---------- 168 | wind_sig: list 169 | Input from which features are computed, window 170 | dict_features : dict 171 | Dictionary with features 172 | fs : float or None 173 | Sampling frequency 174 | \**kwargs: 175 | * *features_path* (``string``) -- 176 | Directory of script with personal features 177 | * *header_names* (``list or array``) -- 178 | Names of each column window 179 | 180 | Returns 181 | ------- 182 | DataFrame 183 | Extracted features 184 | 185 | """ 186 | 187 | features_path = kwargs.get('features_path', None) 188 | names = kwargs.get('header_names', None) 189 | feat_val = calc_window_features(dict_features, wind_sig, fs, features_path=features_path, header_names=names) 190 | feat_val.reset_index(drop=True) 191 | 192 | return feat_val 193 | 194 | 195 | def time_series_features_extractor(dict_features, signal_windows, fs=None, verbose=1, **kwargs): 196 | """Extraction of time series features. 197 | 198 | Parameters 199 | ---------- 200 | dict_features : dict 201 | Dictionary with features 202 | signal_windows: list 203 | Input from which features are computed, window 204 | fs : int or None 205 | Sampling frequency 206 | verbose : int 207 | Verbosity mode. 0 = silent, 1 = progress bar. 208 | (0 or 1 (Default)) 209 | \**kwargs: 210 | See below: 211 | * *window_size* (``int``) -- 212 | Window size in number of samples 213 | (default: ``100``) 214 | 215 | * *overlap* (``float``) -- 216 | Overlap between 0 and 1 217 | (default: ``0``) 218 | 219 | * *features_path* (``string``) -- 220 | Directory of script with personal features 221 | 222 | * *header_names* (``list or array``) -- 223 | Names of each column window 224 | 225 | * *n_jobs* (``int``) -- 226 | The number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. 227 | ``-1`` means using all processors. 228 | (default: ``None`` in Windows and ``-1`` for other systems) 229 | 230 | Returns 231 | ------- 232 | DataFrame 233 | Extracted features 234 | 235 | """ 236 | if verbose == 1: 237 | print("*** Feature extraction started ***") 238 | 239 | window_size = kwargs.get('window_size', None) 240 | overlap = kwargs.get('overlap', 0) 241 | features_path = kwargs.get('features_path', None) 242 | names = kwargs.get('header_names', None) 243 | 244 | # Choosing default of n_jobs by operating system 245 | if sys.platform[:-2] == 'win': 246 | n_jobs_default = None 247 | else: 248 | n_jobs_default = -1 249 | 250 | # Choosing default of n_jobs by python interface 251 | if get_ipython().__class__.__name__ == 'ZMQInteractiveShell' or \ 252 | get_ipython().__class__.__name__ == 'Shell': 253 | n_jobs_default = -1 254 | 255 | n_jobs = kwargs.get('n_jobs', n_jobs_default) 256 | 257 | if fs is None: 258 | warnings.warn('Using default sampling frequency set in configuration file.', stacklevel=2) 259 | 260 | if names is not None: 261 | names = list(names) 262 | else: 263 | # Name of each column to be concatenated with feature name 264 | if isinstance(signal_windows, pd.DataFrame): 265 | names = signal_windows.columns.values 266 | elif isinstance(signal_windows[0], pd.DataFrame): 267 | names = signal_windows[0].columns.values 268 | 269 | if window_size is not None: 270 | signal_windows = signal_window_splitter(signal_windows, window_size, overlap) 271 | 272 | if len(signal_windows) == 0: 273 | raise SystemExit('Empty signal windows. Please check window size input parameter.') 274 | 275 | features_final = pd.DataFrame() 276 | 277 | if isinstance(signal_windows, list) and isinstance(signal_windows[0], numbers.Real): 278 | signal_windows = np.array(signal_windows) 279 | 280 | # more than one window 281 | if isinstance(signal_windows, list): 282 | # Starting the display of progress bar for notebooks interfaces 283 | if (get_ipython().__class__.__name__ == "ZMQInteractiveShell") or ( 284 | get_ipython().__class__.__name__ == "Shell" 285 | ): 286 | 287 | out = display(progress_bar_notebook(0, len(signal_windows)), display_id=True) 288 | else: 289 | out = None 290 | 291 | if isinstance(n_jobs, int): 292 | # Multiprocessing use 293 | if n_jobs == -1: 294 | cpu_count = mp.cpu_count() 295 | else: 296 | cpu_count = n_jobs 297 | 298 | pool = mp.Pool(cpu_count) 299 | features = pool.imap( 300 | partial( 301 | calc_features, 302 | dict_features=dict_features, 303 | fs=fs, 304 | features_path=features_path, 305 | header_names=names, 306 | ), 307 | signal_windows, 308 | ) 309 | 310 | for i, feat in enumerate(features): 311 | if verbose == 1: 312 | display_progress_bar(i, len(signal_windows), out) 313 | features_final = pd.concat([features_final, feat], axis=0) 314 | 315 | pool.close() 316 | pool.join() 317 | 318 | elif n_jobs is None: 319 | for i, feat in enumerate(signal_windows): 320 | features_final = pd.concat( 321 | [ 322 | features_final, 323 | calc_window_features( 324 | dict_features, feat, fs, features_path=features_path, header_names=names) 325 | ], axis=0) 326 | if verbose == 1: 327 | display_progress_bar(i, len(signal_windows), out) 328 | else: 329 | raise SystemExit( 330 | "n_jobs value is not valid. " "Choose an integer value or None for no multiprocessing." 331 | ) 332 | # single window 333 | else: 334 | # import ipdb 335 | # ipdb.set_trace() 336 | features_final = calc_window_features( 337 | dict_features, 338 | signal_windows, 339 | fs, 340 | verbose=verbose, 341 | features_path=features_path, 342 | header_names=names, 343 | single_window=True, 344 | ) 345 | 346 | if verbose == 1: 347 | print("\n"+"*** Feature extraction finished ***") 348 | 349 | # Assuring the same feature extraction order 350 | features_final = features_final.reindex(sorted(features_final.columns), axis=1) 351 | return features_final.reset_index(drop=True) 352 | 353 | 354 | def calc_window_features(dict_features, signal_window, fs, verbose=1, single_window=False, **kwargs): 355 | """This function computes features matrix for one window. 356 | 357 | Parameters 358 | ---------- 359 | dict_features : dict 360 | Dictionary with features 361 | signal_window: pandas DataFrame 362 | Input from which features are computed, window 363 | fs : float 364 | Sampling frequency 365 | verbose : int 366 | Level of function communication 367 | (0 or 1 (Default)) 368 | single_window: Bool 369 | Boolean value for printing the progress bar for only one window feature extraction 370 | \**kwargs: 371 | See below: 372 | * *features_path* (``string``) -- 373 | Directory of script with personal features 374 | * *header_names* (``list or array``) -- 375 | Names of each column window 376 | 377 | Returns 378 | ------- 379 | pandas DataFrame 380 | (columns) names of the features 381 | (data) values of each features for signal 382 | 383 | """ 384 | 385 | features_path = kwargs.get('features_path', None) 386 | header_names = kwargs.get('header_names', None) 387 | 388 | # To handle object type signals 389 | signal_window = np.array(signal_window).astype(float) 390 | 391 | single_axis = True if len(signal_window.shape) == 1 else False 392 | 393 | if header_names is None: 394 | header_names = np.array([0]) if single_axis else np.arange(signal_window.shape[-1]) 395 | else: 396 | if (len(header_names) != signal_window.shape[-1] and not single_axis) or \ 397 | (len(header_names) != 1 and single_axis): 398 | raise Exception("header_names dimension does not match input columns.") 399 | 400 | # Execute imports 401 | exec("from tsfel import *") 402 | domain = dict_features.keys() 403 | 404 | if features_path: 405 | sys.path.append(features_path[:-len(features_path.split(os.sep)[-1])-1]) 406 | exec("import "+features_path.split(os.sep)[-1][:-3]) 407 | importlib.reload(sys.modules[features_path.split(os.sep)[-1][:-3]]) 408 | exec("from " + features_path.split(os.sep)[-1][:-3]+" import *") 409 | 410 | # Create global arrays 411 | feature_results = [] 412 | feature_names = [] 413 | 414 | # Starting the display of progress bar for notebooks interfaces 415 | # Iterating over features of a single window 416 | if verbose == 1 and single_window: 417 | 418 | feat_nb = np.hstack([list(dict_features[_type].keys()) for _type in domain]) 419 | 420 | if (get_ipython().__class__.__name__ == 'ZMQInteractiveShell') or ( 421 | get_ipython().__class__.__name__ == 'Shell'): 422 | out = display(progress_bar_notebook(0, len(feat_nb)), display_id=True) 423 | else: 424 | out = None 425 | 426 | i_feat = -1 427 | 428 | for _type in domain: 429 | domain_feats = dict_features[_type].keys() 430 | # print(domain_feats) 431 | # ipdb.set_trace() 432 | 433 | for feat in domain_feats: 434 | 435 | if verbose == 1 and single_window: 436 | i_feat = i_feat + 1 437 | display_progress_bar(i_feat, len(feat_nb), out) 438 | 439 | # Only returns used functions 440 | if dict_features[_type][feat]["use"] == "yes": 441 | 442 | # Read Function (real name of function) 443 | func_total = dict_features[_type][feat]["function"] 444 | 445 | if func_total.find("tsfel.") == 0: 446 | func_total = func_total.replace("tsfel.", "") 447 | 448 | # Check for parameters 449 | parameters_total = {} 450 | 451 | if dict_features[_type][feat]["parameters"] != "": 452 | parameters_total = dict_features[_type][feat]["parameters"] 453 | 454 | # Check assert fs parameter: 455 | if "fs" in parameters_total: 456 | 457 | # Select which fs to use 458 | if fs is None: 459 | # Check if features dict has default sampling frequency value 460 | if not (type(parameters_total["fs"]) is int or type(parameters_total["fs"]) is float): 461 | raise Exception("No sampling frequency assigned.") 462 | else: 463 | parameters_total["fs"] = fs 464 | 465 | # Eval feature results 466 | if single_axis: 467 | eval_result = locals()[func_total](signal_window, **parameters_total) 468 | eval_result = np.array([eval_result]) 469 | 470 | for ax in range(len(header_names)): 471 | sig_ax = signal_window if single_axis else signal_window[:, ax] 472 | eval_result_ax = locals()[func_total](sig_ax, **parameters_total) 473 | # Function returns more than one element 474 | if type(eval_result_ax) == tuple: 475 | if np.isnan(eval_result_ax[0]): 476 | eval_result_ax = np.zeros(len(eval_result_ax)) 477 | for rr in range(len(eval_result_ax)): 478 | feature_results += [eval_result_ax[rr]] 479 | feature_names += [str(header_names[ax]) + "_" + feat + "_" + str(rr)] 480 | else: 481 | feature_results += [eval_result_ax] 482 | feature_names += [str(header_names[ax]) + "_" + feat] 483 | 484 | features = pd.DataFrame( 485 | data=np.array(feature_results).reshape(1, len(feature_results)), columns=np.array(feature_names) 486 | ) 487 | 488 | return features 489 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/feature_extraction/features.json: -------------------------------------------------------------------------------- 1 | { 2 | "spectral": { 3 | "FFT mean coefficient": { 4 | "complexity": "constant", 5 | "description": "Computes the mean value of each spectrogram frequency.", 6 | "function": "tsfel.fft_mean_coeff", 7 | "parameters": { 8 | "fs": 100, 9 | "nfreq": 256 10 | }, 11 | "n_features": "nfreq", 12 | "use": "yes" 13 | }, 14 | "Fundamental frequency": { 15 | "complexity": "log", 16 | "description": "Computes the fundamental frequency.", 17 | "function": "tsfel.fundamental_frequency", 18 | "parameters": { 19 | "fs": 100 20 | }, 21 | "n_features": 1, 22 | "use": "yes" 23 | }, 24 | "Human range energy": { 25 | "complexity": "log", 26 | "description": "Computes the human range energy ratio given by the ratio between the energy in frequency 0.6-2.5Hz and the whole energy band.", 27 | "function": "tsfel.human_range_energy", 28 | "parameters": { 29 | "fs": 100 30 | }, 31 | "n_features": 1, 32 | "use": "yes", 33 | "tag": "inertial" 34 | }, 35 | "LPCC": { 36 | "complexity": "log", 37 | "description": "Computes the linear prediction cepstral coefficients.", 38 | "function": "tsfel.lpcc", 39 | "parameters": { 40 | "n_coeff": 12 41 | }, 42 | "n_features": "n_coeff", 43 | "use": "yes", 44 | "tag": "audio" 45 | }, 46 | "MFCC": { 47 | "complexity": "constant", 48 | "description": "Computes the MEL cepstral coefficients.", 49 | "function": "tsfel.mfcc", 50 | "parameters": { 51 | "cep_lifter": 22, 52 | "fs": 100, 53 | "nfft": 512, 54 | "nfilt": 40, 55 | "num_ceps": 12, 56 | "pre_emphasis": 0.97 57 | }, 58 | "n_features": "num_ceps", 59 | "use": "yes", 60 | "tag": [ 61 | "audio", 62 | "emg" 63 | ] 64 | }, 65 | "Max power spectrum": { 66 | "complexity": "log", 67 | "description": "Computes the maximum power spectrum density.", 68 | "function": "tsfel.max_power_spectrum", 69 | "parameters": { 70 | "fs": 100 71 | }, 72 | "n_features": 1, 73 | "use": "yes" 74 | }, 75 | "Maximum frequency": { 76 | "complexity": "log", 77 | "description": "Computes the maximum frequency.", 78 | "function": "tsfel.max_frequency", 79 | "parameters": { 80 | "fs": 100 81 | }, 82 | "n_features": 1, 83 | "use": "yes" 84 | }, 85 | "Median frequency": { 86 | "complexity": "log", 87 | "description": "Computes the median frequency.", 88 | "function": "tsfel.median_frequency", 89 | "parameters": { 90 | "fs": 100 91 | }, 92 | "n_features": 1, 93 | "use": "yes" 94 | }, 95 | "Power bandwidth": { 96 | "complexity": "log", 97 | "description": "Computes power spectrum density bandwidth of the signal.", 98 | "function": "tsfel.power_bandwidth", 99 | "parameters": { 100 | "fs": 100 101 | }, 102 | "n_features": 1, 103 | "use": "yes" 104 | }, 105 | "Spectral centroid": { 106 | "complexity": "linear", 107 | "description": "Computes the barycenter of the spectrum.", 108 | "function": "tsfel.spectral_centroid", 109 | "parameters": { 110 | "fs": 100 111 | }, 112 | "n_features": 1, 113 | "use": "yes", 114 | "tag": "audio" 115 | }, 116 | "Spectral decrease": { 117 | "complexity": "log", 118 | "description": "Computes the amount of decreasing of the spectra amplitude.", 119 | "function": "tsfel.spectral_decrease", 120 | "parameters": { 121 | "fs": 100 122 | }, 123 | "n_features": 1, 124 | "use": "yes" 125 | }, 126 | "Spectral distance": { 127 | "complexity": "log", 128 | "description": "Computes the signal spectral distance.", 129 | "function": "tsfel.spectral_distance", 130 | "parameters": { 131 | "fs": 100 132 | }, 133 | "n_features": 1, 134 | "use": "yes" 135 | }, 136 | "Spectral entropy": { 137 | "complexity": "log", 138 | "description": "Computes the spectral entropy of the signal based on Fourier transform.", 139 | "function": "tsfel.spectral_entropy", 140 | "parameters": { 141 | "fs": 100 142 | }, 143 | "n_features": 1, 144 | "use": "yes", 145 | "tag": "eeg" 146 | }, 147 | "Spectral kurtosis": { 148 | "complexity": "linear", 149 | "description": "Computes the flatness of a distribution around its mean value.", 150 | "function": "tsfel.spectral_kurtosis", 151 | "parameters": { 152 | "fs": 100 153 | }, 154 | "n_features": 1, 155 | "use": "yes" 156 | }, 157 | "Spectral positive turning points": { 158 | "complexity": "log", 159 | "description": "Computes number of positive turning points of the fft magnitude signal", 160 | "function": "tsfel.spectral_positive_turning", 161 | "parameters": { 162 | "fs": 100 163 | }, 164 | "n_features": 1, 165 | "use": "yes" 166 | }, 167 | "Spectral roll-off": { 168 | "complexity": "log", 169 | "description": "Computes the frequency where 95% of the signal magnitude is contained below of this value.", 170 | "function": "tsfel.spectral_roll_off", 171 | "parameters": { 172 | "fs": 100 173 | }, 174 | "n_features": 1, 175 | "use": "yes", 176 | "tag": "audio" 177 | }, 178 | "Spectral roll-on": { 179 | "complexity": "log", 180 | "description": "Computes the frequency where 5% of the signal magnitude is contained below of this value.", 181 | "function": "tsfel.spectral_roll_on", 182 | "parameters": { 183 | "fs": 100 184 | }, 185 | "n_features": 1, 186 | "use": "yes" 187 | }, 188 | "Spectral skewness": { 189 | "complexity": "linear", 190 | "description": "Computes the asymmetry of a distribution around its mean value.", 191 | "function": "tsfel.spectral_skewness", 192 | "parameters": { 193 | "fs": 100 194 | }, 195 | "n_features": 1, 196 | "use": "yes" 197 | }, 198 | "Spectral slope": { 199 | "complexity": "log", 200 | "description": "Computes the spectral slope, obtained by linear regression of the spectral amplitude.", 201 | "function": "tsfel.spectral_slope", 202 | "parameters": { 203 | "fs": 100 204 | }, 205 | "n_features": 1, 206 | "use": "yes" 207 | }, 208 | "Spectral spread": { 209 | "complexity": "linear", 210 | "description": "Computes the spread of the spectrum around its mean value.", 211 | "function": "tsfel.spectral_spread", 212 | "parameters": { 213 | "fs": 100 214 | }, 215 | "n_features": 1, 216 | "use": "yes" 217 | }, 218 | "Spectral variation": { 219 | "complexity": "log", 220 | "description": "Computes the amount of variation of the spectrum along time.", 221 | "function": "tsfel.spectral_variation", 222 | "parameters": { 223 | "fs": 100 224 | }, 225 | "n_features": 1, 226 | "use": "yes" 227 | }, 228 | "Wavelet absolute mean": { 229 | "complexity": "linear", 230 | "description": "Computes CWT absolute mean value of each wavelet scale.", 231 | "function": "tsfel.wavelet_abs_mean", 232 | "parameters": { 233 | "function": "scipy.signal.ricker", 234 | "widths": "np.arange(1,10)" 235 | }, 236 | "n_features": "widths", 237 | "use": "yes", 238 | "tag": [ 239 | "eeg", 240 | "ecg" 241 | ] 242 | }, 243 | "Wavelet energy": { 244 | "complexity": "linear", 245 | "description": "Computes CWT energy of each wavelet scale.", 246 | "function": "tsfel.wavelet_energy", 247 | "parameters": { 248 | "function": "scipy.signal.ricker", 249 | "widths": "np.arange(1,10)" 250 | }, 251 | "n_features": "widths", 252 | "use": "yes", 253 | "tag": "eeg" 254 | }, 255 | "Wavelet entropy": { 256 | "complexity": "linear", 257 | "description": "Computes CWT entropy of the signal.", 258 | "function": "tsfel.wavelet_entropy", 259 | "parameters": { 260 | "function": "scipy.signal.ricker", 261 | "widths": "np.arange(1,10)" 262 | }, 263 | "n_features": 1, 264 | "use": "yes", 265 | "tag": "eeg" 266 | }, 267 | "Wavelet standard deviation": { 268 | "complexity": "linear", 269 | "description": "Computes CWT std value of each wavelet scale.", 270 | "function": "tsfel.wavelet_std", 271 | "parameters": { 272 | "function": "scipy.signal.ricker", 273 | "widths": "np.arange(1,10)" 274 | }, 275 | "n_features": "widths", 276 | "use": "yes", 277 | "tag": "eeg" 278 | }, 279 | "Wavelet variance": { 280 | "complexity": "linear", 281 | "description": "Computes CWT variance value of each wavelet scale.", 282 | "function": "tsfel.wavelet_var", 283 | "parameters": { 284 | "function": "scipy.signal.ricker", 285 | "widths": "np.arange(1,10)" 286 | }, 287 | "n_features": "widths", 288 | "use": "yes", 289 | "tag": "eeg" 290 | } 291 | }, 292 | "statistical": { 293 | "Absolute energy": { 294 | "complexity": "log", 295 | "description": "Computes the absolute energy of the signal.", 296 | "function": "tsfel.abs_energy", 297 | "parameters": "", 298 | "n_features": 1, 299 | "use": "yes", 300 | "tag": "audio" 301 | }, 302 | "Average power": { 303 | "complexity": "constant", 304 | "description": "Computes the average power of the signal.", 305 | "function": "tsfel.average_power", 306 | "parameters": { 307 | "fs": 100 308 | }, 309 | "n_features": 1, 310 | "use": "yes", 311 | "tag": "audio" 312 | }, 313 | "ECDF": { 314 | "complexity": "log", 315 | "description": "Computes the values of ECDF (empirical cumulative distribution function) along the time axis.", 316 | "function": "tsfel.ecdf", 317 | "parameters": { 318 | "d": 10 319 | }, 320 | "n_features": "d", 321 | "use": "yes" 322 | }, 323 | "ECDF Percentile": { 324 | "complexity": "log", 325 | "description": "Determines the percentile value of the ECDF.", 326 | "function": "tsfel.ecdf_percentile", 327 | "parameters": { 328 | "percentile": "[0.2, 0.8]" 329 | }, 330 | "n_features": "percentile", 331 | "use": "yes" 332 | }, 333 | "ECDF Percentile Count": { 334 | "complexity": "log", 335 | "description": "Determines the cumulative sum of samples that are less than the percentile.", 336 | "function": "tsfel.ecdf_percentile_count", 337 | "parameters": { 338 | "percentile": "[0.2, 0.8]" 339 | }, 340 | "n_features": "percentile", 341 | "use": "yes" 342 | }, 343 | "Entropy": { 344 | "complexity": "log", 345 | "description": "Computes the entropy of the signal using the Shannon Entropy.", 346 | "function": "tsfel.entropy", 347 | "parameters": { 348 | "prob": "standard" 349 | }, 350 | "n_features": 1, 351 | "use": "yes", 352 | "tag": "eeg" 353 | }, 354 | "Histogram": { 355 | "complexity": "log", 356 | "description": "Computes histogram of the signal.", 357 | "function": "tsfel.hist", 358 | "parameters": { 359 | "nbins": 10, 360 | "r": 1 361 | }, 362 | "n_features": "nbins", 363 | "use": "yes" 364 | }, 365 | "Interquartile range": { 366 | "complexity": "constant", 367 | "description": "Computes interquartile range of the signal.", 368 | "function": "tsfel.interq_range", 369 | "parameters": "", 370 | "n_features": 1, 371 | "use": "yes" 372 | }, 373 | "Kurtosis": { 374 | "complexity": "constant", 375 | "description": "Computes kurtosis of the signal.", 376 | "function": "tsfel.kurtosis", 377 | "parameters": "", 378 | "n_features": 1, 379 | "use": "yes" 380 | }, 381 | "Max": { 382 | "complexity": "constant", 383 | "description": "Computes the maximum value of the signal.", 384 | "function": "tsfel.calc_max", 385 | "parameters": "", 386 | "n_features": 1, 387 | "use": "yes" 388 | }, 389 | "Mean": { 390 | "complexity": "constant", 391 | "description": "Computes the mean value of the signal.", 392 | "function": "tsfel.calc_mean", 393 | "parameters": "", 394 | "n_features": 1, 395 | "use": "yes", 396 | "tag": "inertial" 397 | }, 398 | "Mean absolute deviation": { 399 | "complexity": "log", 400 | "description": "Computes mean absolute deviation of the signal.", 401 | "function": "tsfel.mean_abs_deviation", 402 | "parameters": "", 403 | "n_features": 1, 404 | "use": "yes" 405 | }, 406 | "Median": { 407 | "complexity": "constant", 408 | "description": "Computes median of the signal.", 409 | "function": "tsfel.calc_median", 410 | "parameters": "", 411 | "n_features": 1, 412 | "use": "yes" 413 | }, 414 | "Median absolute deviation": { 415 | "complexity": "constant", 416 | "description": "Computes median absolute deviation of the signal.", 417 | "function": "tsfel.median_abs_deviation", 418 | "parameters": "", 419 | "n_features": 1, 420 | "use": "yes" 421 | }, 422 | "Min": { 423 | "complexity": "constant", 424 | "description": "Computes the minimum value of the signal.", 425 | "function": "tsfel.calc_min", 426 | "parameters": "", 427 | "n_features": 1, 428 | "use": "yes" 429 | }, 430 | "Peak to peak distance": { 431 | "complexity": "constant", 432 | "description": "Computes the peak to peak distance.", 433 | "function": "tsfel.pk_pk_distance", 434 | "parameters": "", 435 | "n_features": 1, 436 | "use": "yes" 437 | }, 438 | "Root mean square": { 439 | "complexity": "constant", 440 | "description": "Computes root mean square of the signal.", 441 | "function": "tsfel.rms", 442 | "parameters": "", 443 | "n_features": 1, 444 | "use": "yes", 445 | "tag": [ 446 | "emg", 447 | "inertial" 448 | ] 449 | }, 450 | "Skewness": { 451 | "complexity": "constant", 452 | "description": "Computes skewness of the signal.", 453 | "function": "tsfel.skewness", 454 | "parameters": "", 455 | "n_features": 1, 456 | "use": "yes" 457 | }, 458 | "Standard deviation": { 459 | "complexity": "constant", 460 | "description": "Computes standard deviation of the signal.", 461 | "function": "tsfel.calc_std", 462 | "parameters": "", 463 | "n_features": 1, 464 | "use": "yes" 465 | }, 466 | "Variance": { 467 | "complexity": "constant", 468 | "description": "Computes variance of the signal.", 469 | "function": "tsfel.calc_var", 470 | "parameters": "", 471 | "n_features": 1, 472 | "use": "yes" 473 | } 474 | }, 475 | "temporal": { 476 | "Area under the curve": { 477 | "complexity": "log", 478 | "description": "Computes the area under the curve of the signal computed with trapezoid rule.", 479 | "function": "tsfel.auc", 480 | "parameters": { 481 | "fs": 100 482 | }, 483 | "n_features": 1, 484 | "use": "yes" 485 | }, 486 | "Autocorrelation": { 487 | "complexity": "constant", 488 | "description": "Computes autocorrelation of the signal.", 489 | "function": "tsfel.autocorr", 490 | "parameters": "", 491 | "n_features": 1, 492 | "use": "yes", 493 | "tag": "inertial" 494 | }, 495 | "Centroid": { 496 | "complexity": "constant", 497 | "description": "Computes the centroid along the time axis.", 498 | "function": "tsfel.calc_centroid", 499 | "parameters": { 500 | "fs": 100 501 | }, 502 | "n_features": 1, 503 | "use": "yes" 504 | }, 505 | "Mean absolute diff": { 506 | "complexity": "constant", 507 | "description": "Computes mean absolute differences of the signal.", 508 | "function": "tsfel.mean_abs_diff", 509 | "parameters": "", 510 | "n_features": 1, 511 | "use": "yes" 512 | }, 513 | "Mean diff": { 514 | "complexity": "constant", 515 | "description": "Computes mean of differences of the signal.", 516 | "function": "tsfel.mean_diff", 517 | "parameters": "", 518 | "n_features": 1, 519 | "use": "yes" 520 | }, 521 | "Median absolute diff": { 522 | "complexity": "constant", 523 | "description": "Computes median absolute differences of the signal.", 524 | "function": "tsfel.median_abs_diff", 525 | "parameters": "", 526 | "n_features": 1, 527 | "use": "yes" 528 | }, 529 | "Median diff": { 530 | "complexity": "constant", 531 | "description": "Computes median of differences of the signal.", 532 | "function": "tsfel.median_diff", 533 | "parameters": "", 534 | "n_features": 1, 535 | "use": "yes" 536 | }, 537 | "Negative turning points": { 538 | "complexity": "constant", 539 | "description": "Computes number of negative turning points of the signal.", 540 | "function": "tsfel.negative_turning", 541 | "parameters": "", 542 | "n_features": 1, 543 | "use": "yes", 544 | "tag": "emg" 545 | }, 546 | "Neighbourhood peaks": { 547 | "complexity": "constant", 548 | "description": "Computes the number of peaks from a defined neighbourhood of the signal.", 549 | "function": "tsfel.neighbourhood_peaks", 550 | "parameters": { 551 | "n": 10 552 | }, 553 | "n_features": 1, 554 | "use": "yes" 555 | }, 556 | "Positive turning points": { 557 | "complexity": "constant", 558 | "description": "Computes number of positive turning points of the signal.", 559 | "function": "tsfel.positive_turning", 560 | "parameters": "", 561 | "n_features": 1, 562 | "use": "yes", 563 | "tag": "emg" 564 | }, 565 | "Signal distance": { 566 | "complexity": "constant", 567 | "description": "Computes signal traveled distance.", 568 | "function": "tsfel.distance", 569 | "parameters": "", 570 | "n_features": 1, 571 | "use": "yes" 572 | }, 573 | "Slope": { 574 | "complexity": "log", 575 | "description": "Computes the slope of the signal by fitting a linear equation to the observed data.", 576 | "function": "tsfel.slope", 577 | "parameters": "", 578 | "n_features": 1, 579 | "use": "yes" 580 | }, 581 | "Sum absolute diff": { 582 | "complexity": "constant", 583 | "description": "Computes sum of absolute differences of the signal.", 584 | "function": "tsfel.sum_abs_diff", 585 | "parameters": "", 586 | "n_features": 1, 587 | "use": "yes" 588 | }, 589 | "Zero crossing rate": { 590 | "complexity": "constant", 591 | "description": "Computes Zero-crossing rate of the signal.", 592 | "function": "tsfel.zero_cross", 593 | "parameters": "", 594 | "n_features": 1, 595 | "use": "yes", 596 | "tag": [ 597 | "audio", 598 | "emg" 599 | ] 600 | } 601 | } 602 | } -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/feature_extraction/features_settings.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tsfel 3 | import numpy as np 4 | 5 | 6 | def load_json(json_path): 7 | """Loads the json file given by filename. 8 | 9 | Parameters 10 | ---------- 11 | json_path : string 12 | Json path 13 | 14 | Returns 15 | ------- 16 | Dict 17 | Dictionary 18 | 19 | """ 20 | 21 | return json.load(open(json_path)) 22 | 23 | 24 | def get_features_by_domain(domain=None, json_path=None): 25 | """Creates a dictionary with the features settings by domain. 26 | 27 | Parameters 28 | ---------- 29 | domain : string 30 | Available domains: "statistical"; "spectral"; "temporal" 31 | If domain equals None, then the features settings from all domains are returned. 32 | json_path : string 33 | Directory of json file. Default: package features.json directory 34 | 35 | Returns 36 | ------- 37 | Dict 38 | Dictionary with the features settings 39 | 40 | """ 41 | 42 | if json_path is None: 43 | json_path = tsfel.__path__[0] + "/feature_extraction/features.json" 44 | 45 | if domain not in ['statistical', 'temporal', 'spectral', None]: 46 | raise SystemExit( 47 | 'No valid domain. Choose: statistical, temporal, spectral or None (for all feature settings).') 48 | 49 | dict_features = load_json(json_path) 50 | if domain is None: 51 | return dict_features 52 | else: 53 | return {domain: dict_features[domain]} 54 | 55 | 56 | def get_features_by_tag(tag=None, json_path=None): 57 | """Creates a dictionary with the features settings by tag. 58 | 59 | Parameters 60 | ---------- 61 | tag : string 62 | Available tags: "audio"; "inertial", "ecg"; "eeg"; "emg". 63 | If tag equals None then, all available features are returned. 64 | json_path : string 65 | Directory of json file. Default: package features.json directory 66 | 67 | Returns 68 | ------- 69 | Dict 70 | Dictionary with the features settings 71 | 72 | """ 73 | if json_path is None: 74 | json_path = tsfel.__path__[0] + "/feature_extraction/features.json" 75 | 76 | if tag not in ["audio", "inertial", "ecg", "eeg", "emg", None]: 77 | raise SystemExit( 78 | "No valid tag. Choose: audio, inertial, ecg, eeg, emg or None.") 79 | features_tag = {} 80 | dict_features = load_json(json_path) 81 | if tag is None: 82 | return dict_features 83 | else: 84 | for domain in dict_features: 85 | features_tag[domain] = {} 86 | for feat in dict_features[domain]: 87 | if dict_features[domain][feat]["use"] == "no": 88 | continue 89 | # Check if tag is defined 90 | try: 91 | js_tag = dict_features[domain][feat]["tag"] 92 | if isinstance(js_tag, list): 93 | if any([tag in js_t for js_t in js_tag]): 94 | features_tag[domain].update({feat: dict_features[domain][feat]}) 95 | elif js_tag == tag: 96 | features_tag[domain].update({feat: dict_features[domain][feat]}) 97 | except KeyError: 98 | continue 99 | # To remove empty dicts 100 | return dict([[d, features_tag[d]] for d in list(features_tag.keys()) if bool(features_tag[d])]) 101 | 102 | 103 | def get_number_features(dict_features): 104 | """Count the total number of features based on input parameters of each feature 105 | 106 | Parameters 107 | ---------- 108 | dict_features : dict 109 | Dictionary with features settings 110 | 111 | Returns 112 | ------- 113 | int 114 | Feature vector size 115 | """ 116 | number_features = 0 117 | for domain in dict_features: 118 | for feat in dict_features[domain]: 119 | if dict_features[domain][feat]["use"] == "no": 120 | continue 121 | n_feat = dict_features[domain][feat]["n_features"] 122 | 123 | if isinstance(n_feat, int): 124 | number_features += n_feat 125 | else: 126 | n_feat_param = dict_features[domain][feat]["parameters"][n_feat] 127 | if isinstance(n_feat_param, int): 128 | number_features += n_feat_param 129 | else: 130 | number_features += eval("len(" + n_feat_param + ")") 131 | 132 | return number_features 133 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/feature_extraction/features_utils.py: -------------------------------------------------------------------------------- 1 | import scipy 2 | import numpy as np 3 | 4 | 5 | def set_domain(key, value): 6 | def decorate_func(func): 7 | setattr(func, key, value) 8 | return func 9 | 10 | return decorate_func 11 | 12 | 13 | def compute_time(signal, fs): 14 | """Creates the signal correspondent time array. 15 | 16 | Parameters 17 | ---------- 18 | signal: nd-array 19 | Input from which the time is computed. 20 | fs: int 21 | Sampling Frequency 22 | 23 | Returns 24 | ------- 25 | time : float list 26 | Signal time 27 | 28 | """ 29 | 30 | return np.arange(0, len(signal))/fs 31 | 32 | 33 | def calc_fft(signal, fs): 34 | """ This functions computes the fft of a signal. 35 | 36 | Parameters 37 | ---------- 38 | signal : nd-array 39 | The input signal from which fft is computed 40 | fs : float 41 | Sampling frequency 42 | 43 | Returns 44 | ------- 45 | f: nd-array 46 | Frequency values (xx axis) 47 | fmag: nd-array 48 | Amplitude of the frequency values (yy axis) 49 | 50 | """ 51 | 52 | fmag = np.abs(np.fft.rfft(signal)) 53 | f = np.fft.rfftfreq(len(signal), d=1/fs) 54 | 55 | return f.copy(), fmag.copy() 56 | 57 | 58 | def filterbank(signal, fs, pre_emphasis=0.97, nfft=512, nfilt=40): 59 | """Computes the MEL-spaced filterbank. 60 | 61 | It provides the information about the power in each frequency band. 62 | 63 | Implementation details and description on: 64 | https://www.kaggle.com/ilyamich/mfcc-implementation-and-tutorial 65 | https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html#fnref:1 66 | 67 | Parameters 68 | ---------- 69 | signal : nd-array 70 | Input from which filterbank is computed 71 | fs : float 72 | Sampling frequency 73 | pre_emphasis : float 74 | Pre-emphasis coefficient for pre-emphasis filter application 75 | nfft : int 76 | Number of points of fft 77 | nfilt : int 78 | Number of filters 79 | 80 | Returns 81 | ------- 82 | nd-array 83 | MEL-spaced filterbank 84 | 85 | """ 86 | 87 | # Signal is already a window from the original signal, so no frame is needed. 88 | # According to the references it is needed the application of a window function such as 89 | # hann window. However if the signal windows don't have overlap, we will lose information, 90 | # as the application of a hann window will overshadow the windows signal edges. 91 | 92 | # pre-emphasis filter to amplify the high frequencies 93 | 94 | emphasized_signal = np.append(np.array(signal)[0], np.array(signal[1:]) - pre_emphasis * np.array(signal[:-1])) 95 | 96 | # Fourier transform and Power spectrum 97 | mag_frames = np.absolute(np.fft.rfft(emphasized_signal, nfft)) # Magnitude of the FFT 98 | 99 | pow_frames = ((1.0 / nfft) * (mag_frames ** 2)) # Power Spectrum 100 | 101 | low_freq_mel = 0 102 | high_freq_mel = (2595 * np.log10(1 + (fs / 2) / 700)) # Convert Hz to Mel 103 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale 104 | hz_points = (700 * (10 ** (mel_points / 2595) - 1)) # Convert Mel to Hz 105 | filter_bin = np.floor((nfft + 1) * hz_points / fs) 106 | 107 | fbank = np.zeros((nfilt, int(np.floor(nfft / 2 + 1)))) 108 | for m in range(1, nfilt + 1): 109 | 110 | f_m_minus = int(filter_bin[m - 1]) # left 111 | f_m = int(filter_bin[m]) # center 112 | f_m_plus = int(filter_bin[m + 1]) # right 113 | 114 | for k in range(f_m_minus, f_m): 115 | fbank[m - 1, k] = (k - filter_bin[m - 1]) / (filter_bin[m] - filter_bin[m - 1]) 116 | for k in range(f_m, f_m_plus): 117 | fbank[m - 1, k] = (filter_bin[m + 1] - k) / (filter_bin[m + 1] - filter_bin[m]) 118 | 119 | # Area Normalization 120 | # If we don't normalize the noise will increase with frequency because of the filter width. 121 | enorm = 2.0 / (hz_points[2:nfilt + 2] - hz_points[:nfilt]) 122 | fbank *= enorm[:, np.newaxis] 123 | 124 | filter_banks = np.dot(pow_frames, fbank.T) 125 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability 126 | filter_banks = 20 * np.log10(filter_banks) # dB 127 | 128 | return filter_banks 129 | 130 | 131 | def autocorr_norm(signal): 132 | """Computes the autocorrelation. 133 | 134 | Implementation details and description in: 135 | https://ccrma.stanford.edu/~orchi/Documents/speaker_recognition_report.pdf 136 | 137 | Parameters 138 | ---------- 139 | signal : nd-array 140 | Input from linear prediction coefficients are computed 141 | 142 | Returns 143 | ------- 144 | nd-array 145 | Autocorrelation result 146 | 147 | """ 148 | 149 | variance = np.var(signal) 150 | signal = np.copy(signal - signal.mean()) 151 | r = scipy.signal.correlate(signal, signal)[-len(signal):] 152 | 153 | if (signal == 0).all(): 154 | return np.zeros(len(signal)) 155 | 156 | acf = r / variance / len(signal) 157 | 158 | return acf 159 | 160 | 161 | def create_symmetric_matrix(acf, order=11): 162 | """Computes a symmetric matrix. 163 | 164 | Implementation details and description in: 165 | https://ccrma.stanford.edu/~orchi/Documents/speaker_recognition_report.pdf 166 | 167 | Parameters 168 | ---------- 169 | acf : nd-array 170 | Input from which a symmetric matrix is computed 171 | order : int 172 | Order 173 | 174 | Returns 175 | ------- 176 | nd-array 177 | Symmetric Matrix 178 | 179 | """ 180 | 181 | smatrix = np.empty((order, order)) 182 | xx = np.arange(order) 183 | j = np.tile(xx, order) 184 | i = np.repeat(xx, order) 185 | smatrix[i, j] = acf[np.abs(i - j)] 186 | 187 | return smatrix 188 | 189 | 190 | def lpc(signal, n_coeff=12): 191 | """Computes the linear prediction coefficients. 192 | 193 | Implementation details and description in: 194 | https://ccrma.stanford.edu/~orchi/Documents/speaker_recognition_report.pdf 195 | 196 | Parameters 197 | ---------- 198 | signal : nd-array 199 | Input from linear prediction coefficients are computed 200 | n_coeff : int 201 | Number of coefficients 202 | 203 | Returns 204 | ------- 205 | nd-array 206 | Linear prediction coefficients 207 | 208 | """ 209 | 210 | if signal.ndim > 1: 211 | raise ValueError("Only 1 dimensional arrays are valid") 212 | if n_coeff > signal.size: 213 | raise ValueError("Input signal must have a length >= n_coeff") 214 | 215 | # Calculate the order based on the number of coefficients 216 | order = n_coeff - 1 217 | 218 | # Calculate LPC with Yule-Walker 219 | acf = np.correlate(signal, signal, 'full') 220 | 221 | r = np.zeros(order+1, 'float32') 222 | # Assuring that works for all type of input lengths 223 | nx = np.min([order+1, len(signal)]) 224 | r[:nx] = acf[len(signal)-1:len(signal)+order] 225 | 226 | smatrix = create_symmetric_matrix(r[:-1], order) 227 | 228 | if np.sum(smatrix) == 0: 229 | return tuple(np.zeros(order+1)) 230 | 231 | lpc_coeffs = np.dot(np.linalg.inv(smatrix), -r[1:]) 232 | 233 | return tuple(np.concatenate(([1.], lpc_coeffs))) 234 | 235 | 236 | def create_xx(features): 237 | """Computes the range of features amplitude for the probability density function calculus. 238 | 239 | Parameters 240 | ---------- 241 | features : nd-array 242 | Input features 243 | 244 | Returns 245 | ------- 246 | nd-array 247 | range of features amplitude 248 | 249 | """ 250 | 251 | features_ = np.copy(features) 252 | 253 | if max(features_) < 0: 254 | max_f = - max(features_) 255 | min_f = min(features_) 256 | else: 257 | min_f = min(features_) 258 | max_f = max(features_) 259 | 260 | if min(features_) == max(features_): 261 | xx = np.linspace(min_f, min_f + 10, len(features_)) 262 | else: 263 | xx = np.linspace(min_f, max_f, len(features_)) 264 | 265 | return xx 266 | 267 | 268 | def kde(features): 269 | """Computes the probability density function of the input signal using a Gaussian KDE (Kernel Density Estimate) 270 | 271 | Parameters 272 | ---------- 273 | features : nd-array 274 | Input from which probability density function is computed 275 | 276 | Returns 277 | ------- 278 | nd-array 279 | probability density values 280 | 281 | """ 282 | features_ = np.copy(features) 283 | xx = create_xx(features_) 284 | 285 | if min(features_) == max(features_): 286 | noise = np.random.randn(len(features_)) * 0.0001 287 | features_ = np.copy(features_ + noise) 288 | 289 | kernel = scipy.stats.gaussian_kde(features_, bw_method='silverman') 290 | 291 | return np.array(kernel(xx) / np.sum(kernel(xx))) 292 | 293 | 294 | def gaussian(features): 295 | """Computes the probability density function of the input signal using a Gaussian function 296 | 297 | Parameters 298 | ---------- 299 | features : nd-array 300 | Input from which probability density function is computed 301 | Returns 302 | ------- 303 | nd-array 304 | probability density values 305 | 306 | """ 307 | 308 | features_ = np.copy(features) 309 | 310 | xx = create_xx(features_) 311 | std_value = np.std(features_) 312 | mean_value = np.mean(features_) 313 | 314 | if std_value == 0: 315 | return 0.0 316 | pdf_gauss = scipy.stats.norm.pdf(xx, mean_value, std_value) 317 | 318 | return np.array(pdf_gauss / np.sum(pdf_gauss)) 319 | 320 | 321 | def wavelet(signal, function=scipy.signal.ricker, widths=np.arange(1, 10)): 322 | """Computes CWT (continuous wavelet transform) of the signal. 323 | 324 | Parameters 325 | ---------- 326 | signal : nd-array 327 | Input from which CWT is computed 328 | function : wavelet function 329 | Default: scipy.signal.ricker 330 | widths : nd-array 331 | Widths to use for transformation 332 | Default: np.arange(1,10) 333 | 334 | Returns 335 | ------- 336 | nd-array 337 | The result of the CWT along the time axis 338 | matrix with size (len(widths),len(signal)) 339 | 340 | """ 341 | 342 | if isinstance(function, str): 343 | function = eval(function) 344 | 345 | if isinstance(widths, str): 346 | widths = eval(widths) 347 | 348 | cwt = scipy.signal.cwt(signal, function, widths) 349 | 350 | return cwt 351 | 352 | 353 | def calc_ecdf(signal): 354 | """Computes the ECDF of the signal. 355 | 356 | Parameters 357 | ---------- 358 | signal : nd-array 359 | Input from which ECDF is computed 360 | Returns 361 | ------- 362 | nd-array 363 | Sorted signal and computed ECDF. 364 | 365 | """ 366 | return np.sort(signal), np.arange(1, len(signal)+1)/len(signal) 367 | 368 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from tsfel.utils.calculate_complexity import * 2 | from tsfel.utils.signal_processing import * 3 | from tsfel.utils.add_personal_features import * 4 | from tsfel.utils.progress_bar import * 5 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/utils/add_personal_features.py: -------------------------------------------------------------------------------- 1 | import importlib 2 | import inspect 3 | import json 4 | import os 5 | import sys 6 | import warnings 7 | from inspect import getmembers, isfunction 8 | 9 | from tsfel.feature_extraction.features_settings import load_json 10 | from tsfel.utils.calculate_complexity import compute_complexity 11 | 12 | 13 | def add_feature_json(features_path, json_path): 14 | """Adds new feature to features.json. 15 | 16 | Parameters 17 | ---------- 18 | features_path: string 19 | Personal Python module directory containing new features implementation. 20 | 21 | json_path: string 22 | Personal .json file directory containing existing features from TSFEL. 23 | New customised features will be added to file in this directory. 24 | 25 | """ 26 | 27 | sys.path.append(features_path[:-len(features_path.split(os.sep)[-1]) - 1]) 28 | exec("import " + features_path.split(os.sep)[-1][:-3]) 29 | 30 | # Reload module containing the new features 31 | importlib.reload(sys.modules[features_path.split(os.sep)[-1][:-3]]) 32 | exec("import " + features_path.split(os.sep)[-1][:-3] + " as pymodule") 33 | 34 | # Functions from module containing the new features 35 | functions_list = [o for o in getmembers(locals()['pymodule']) if isfunction(o[1])] 36 | function_names = [fname[0] for fname in functions_list] 37 | 38 | # Check if @set_domain was declared on features module 39 | vset_domain = False 40 | 41 | for fname, f in list(locals()['pymodule'].__dict__.items()): 42 | 43 | if getattr(f, "domain", None) is not None: 44 | 45 | vset_domain = True 46 | 47 | # Access to personal features.json 48 | feat_json = load_json(json_path) 49 | 50 | # Assign domain and tag 51 | domain = getattr(f, "domain", None) 52 | tag = getattr(f, "tag", None) 53 | 54 | # Feature specifications 55 | # Description 56 | if f.__doc__ is not None: 57 | descrip = f.__doc__.split("\n")[0] 58 | else: 59 | descrip = "" 60 | # Feature usage 61 | use = "yes" 62 | # Feature function arguments 63 | args_name = inspect.getfullargspec(f)[0] 64 | 65 | # Access feature parameters 66 | if args_name != "": 67 | # Retrieve default values of arguments 68 | spec = inspect.getfullargspec(f) 69 | defaults = dict(zip(spec.args[::-1], (spec.defaults or ())[::-1])) 70 | defaults.update(spec.kwonlydefaults or {}) 71 | 72 | for p in args_name[1:]: 73 | if p not in list(defaults.keys()): 74 | if p == 'fs': 75 | # Assigning a default value for fs if not given 76 | defaults[p] = 100 77 | else: 78 | defaults[p] = None 79 | if len(defaults) == 0: 80 | defaults = "" 81 | else: 82 | defaults = "" 83 | 84 | # Settings of new feature 85 | new_feature = {"description": descrip, 86 | "parameters": defaults, 87 | "function": fname, 88 | "use": use 89 | } 90 | 91 | # Check if domain exists 92 | try: 93 | feat_json[domain][fname] = new_feature 94 | except KeyError: 95 | feat_json[domain] = {fname: new_feature} 96 | 97 | # Insert tag if it is declared 98 | if tag is not None: 99 | feat_json[domain][fname]['tag'] = tag 100 | 101 | # Write new feature on json file 102 | with open(json_path, "w") as fout: 103 | json.dump(feat_json, fout, indent=" ") 104 | 105 | # Calculate feature complexity 106 | compute_complexity(fname, domain, json_path, features_path=features_path) 107 | print('Feature '+str(fname)+' was added.') 108 | 109 | if vset_domain is False: 110 | warnings.warn('No features were added. Please declare @set_domain.', stacklevel=2) 111 | 112 | 113 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/utils/calculate_complexity.py: -------------------------------------------------------------------------------- 1 | import time 2 | import json 3 | import numpy as np 4 | from scipy.optimize import curve_fit 5 | from tsfel.feature_extraction.features_settings import load_json 6 | from tsfel.feature_extraction.calc_features import calc_window_features 7 | 8 | 9 | # curves 10 | def n_squared(x, no): 11 | """The model function""" 12 | return no * x ** 2 13 | 14 | 15 | def n_nlog(x, no): 16 | """The model function""" 17 | return no * x * np.log(x) 18 | 19 | 20 | def n_linear(x, no): 21 | """The model function""" 22 | return no * x 23 | 24 | 25 | def n_log(x, no): 26 | """The model function""" 27 | return no * np.log(x) 28 | 29 | 30 | def n_constant(x, no): 31 | """The model function""" 32 | return np.zeros(len(x)) + no 33 | 34 | 35 | def find_best_curve(t, signal): 36 | """Finds the best curve. 37 | 38 | Parameters 39 | ---------- 40 | t : nd-array 41 | Log space 42 | signal : nd-array 43 | Mean execution time array 44 | 45 | Returns 46 | ------- 47 | str 48 | Best fit curve name 49 | 50 | """ 51 | 52 | all_chisq = [] 53 | list_curves = [n_squared, n_nlog, n_linear, n_log, n_constant] 54 | all_curves = [] 55 | # Model parameters 56 | stdev = 2 57 | sig = np.zeros(len(signal)) + stdev 58 | 59 | # Fit the curve 60 | for curve in list_curves: 61 | start = 1 62 | popt, pcov = curve_fit(curve, t, signal, sigma=sig, p0=start, absolute_sigma=True) 63 | 64 | # Compute chi square 65 | nexp = curve(t, *popt) 66 | r = signal - nexp 67 | chisq = np.sum((r / stdev) ** 2) 68 | all_chisq.append(chisq) 69 | all_curves.append(nexp) 70 | 71 | idx_best = np.argmin(all_chisq) 72 | 73 | curve_name = str(list_curves[idx_best]) 74 | idx1 = curve_name.find("n_") 75 | idx2 = curve_name.find("at") 76 | curve_name = curve_name[idx1 + 2:idx2 - 1] 77 | 78 | return curve_name 79 | 80 | 81 | def compute_complexity(feature, domain, json_path, **kwargs): 82 | """Computes the feature complexity. 83 | 84 | Parameters 85 | ---------- 86 | feature : string 87 | Feature name 88 | domain : string 89 | Feature domain 90 | json_path: json 91 | Features json file 92 | \**kwargs: 93 | See below: 94 | * *features_path* (``string``) -- 95 | Directory of script with personal features 96 | 97 | Returns 98 | ------- 99 | int 100 | Feature complexity 101 | 102 | Writes complexity in json file 103 | 104 | """ 105 | 106 | dictionary = load_json(json_path) 107 | 108 | features_path = kwargs.get('features_path', None) 109 | 110 | # The inputs from this function should be replaced by a dictionary 111 | one_feat_dict = {domain: {feature: dictionary[domain][feature]}} 112 | 113 | t = np.logspace(3.0, 5.0, 6) 114 | signal, s = [], [] 115 | f = 0.05 116 | x = np.arange(0, t[-1] + 1, 1) 117 | fs = 100 118 | wave = np.sin(2 * np.pi * f * x / fs) 119 | 120 | for ti in t: 121 | for _ in range(20): 122 | 123 | start = time.time() 124 | calc_window_features(one_feat_dict, wave[:int(ti)], fs, features_path=features_path) 125 | end = time.time() 126 | 127 | s += [end - start] 128 | 129 | signal += [np.mean(s)] 130 | 131 | curve_name = find_best_curve(t, signal) 132 | dictionary[domain][feature]['complexity'] = curve_name 133 | 134 | with open(json_path, "w") as write_file: 135 | json.dump(dictionary, write_file, indent=4, sort_keys=True) 136 | 137 | if curve_name == 'constant' or curve_name == 'log': 138 | return 1 139 | elif curve_name == 'linear': 140 | return 2 141 | elif curve_name == 'nlog' or curve_name == 'squared': 142 | return 3 143 | else: 144 | return 0 145 | -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/utils/progress_bar.py: -------------------------------------------------------------------------------- 1 | from IPython.display import HTML 2 | from IPython import get_ipython 3 | 4 | 5 | def progress_bar_terminal(iteration, total, prefix="", suffix="", decimals=0, length=100, fill="█", printend="\r"): 6 | """Call in a loop to create terminal progress bar. 7 | 8 | Parameters 9 | ---------- 10 | iteration: int 11 | current iteration 12 | total: int 13 | total iterations 14 | prefix: str 15 | prefix string 16 | suffix: str 17 | suffix string 18 | decimals: int 19 | positive number of decimals in percent complete 20 | length: int 21 | character length of bar 22 | fill: str 23 | bar fill character 24 | printend: str 25 | end character (e.g. "\r", "\r\n") 26 | """ 27 | 28 | percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total))) 29 | filledlength = int(length * iteration // total) 30 | bar = fill * filledlength + "-" * (length - filledlength) 31 | print("\r%s |%s| %s%% %s" % (prefix, bar, percent, suffix), end=printend) 32 | # Print New Line on Complete 33 | if iteration == total: 34 | print() 35 | 36 | 37 | def progress_bar_notebook(iteration, total=100): 38 | """Progress bar for notebooks. 39 | 40 | Parameters 41 | ---------- 42 | iteration: int 43 | current iteration 44 | total: int 45 | total iterations 46 | 47 | Returns 48 | ------- 49 | Progress bar for notebooks 50 | 51 | """ 52 | result = int((iteration / total) * 100) 53 | return HTML( 54 | """ 55 |

56 | Progress: {result}% Complete 57 |

58 | 63 | {value} 64 | 65 | 66 | """.format( 67 | value=iteration, max_value=total, result=result 68 | ) 69 | ) 70 | 71 | 72 | def display_progress_bar(iteration, total, out): 73 | """Displays progress bar according to python interface. 74 | 75 | Parameters 76 | ---------- 77 | iteration: int 78 | current iteration 79 | total: int 80 | total iterations 81 | out: progress bar notebook output 82 | 83 | """ 84 | 85 | if ( 86 | (get_ipython().__class__.__name__ == "ZMQInteractiveShell") 87 | or (get_ipython().__class__.__name__ == "Shell") 88 | and out is not None 89 | ): 90 | out.update(progress_bar_notebook(iteration + 1, total)) 91 | else: 92 | progress_bar_terminal(iteration + 1, total, prefix="Progress:", suffix="Complete", length=50) 93 | return -------------------------------------------------------------------------------- /prompt_bank/stat-prompt/tsfel/utils/signal_processing.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from scipy.interpolate import interp1d 4 | 5 | 6 | def signal_window_splitter(signal, window_size, overlap=0): 7 | """Splits the signal into windows 8 | Parameters 9 | ---------- 10 | signal : nd-array or pandas DataFrame 11 | input signal 12 | window_size : int 13 | number of points of window size 14 | overlap : float 15 | percentage of overlap, value between 0 and 1 (exclusive) 16 | Default: 0 17 | Returns 18 | ------- 19 | list 20 | list of signal windows 21 | """ 22 | if not isinstance(window_size, int): 23 | raise SystemExit('window_size must be an integer.') 24 | step = int(round(window_size)) if overlap == 0 else int(round(window_size * (1 - overlap))) 25 | if step == 0: 26 | raise SystemExit('Invalid overlap. ' 27 | 'Choose a lower overlap value.') 28 | if len(signal) % window_size == 0 and overlap == 0: 29 | return [signal[i:i + window_size] for i in range(0, len(signal), step)] 30 | else: 31 | return [signal[i:i + window_size] for i in range(0, len(signal) - window_size + 1, step)] 32 | 33 | 34 | def merge_time_series(data, fs_resample, time_unit): 35 | """Time series data interpolation 36 | 37 | Parameters 38 | ---------- 39 | data : dict 40 | data to interpolate 41 | fs_resample : 42 | resample sampling frequency 43 | time_unit : 44 | time unit in seconds 45 | 46 | Returns 47 | ------- 48 | DataFrame 49 | Interpolated data 50 | 51 | """ 52 | 53 | # time interval for interpolation 54 | sensors_time = np.array([[dn.iloc[0, 0], dn.iloc[-1, 0]] for k, dn in data.items()]) 55 | t0 = np.max(sensors_time[:, 0]) 56 | tn = np.min(sensors_time[:, 1]) 57 | x_new = np.linspace(t0, tn, int((tn - t0) / ((1 / fs_resample) * time_unit))) 58 | 59 | # interpolation 60 | data_new = np.copy(x_new.reshape(len(x_new), 1)) 61 | header_values = ['time'] 62 | for k, dn in data.items(): 63 | header_values += [k + str(i) for i in range(1, np.shape(dn)[1])] 64 | data_new = np.hstack((data_new, np.array([interp1d(dn.iloc[:, 0], dn.iloc[:, ax])(x_new) for ax in range(1, np.shape(dn)[1])]).T)) 65 | 66 | return pd.DataFrame(data=data_new[:, 1:], columns=header_values[1:]) 67 | 68 | 69 | def correlated_features(features, threshold=0.95): 70 | """Compute pairwise correlation of features using pearson method 71 | 72 | Parameters 73 | ---------- 74 | features : DataFrame 75 | features 76 | threshold : 77 | correlation value for removing highly correlated features 78 | Returns 79 | ------- 80 | DataFrame 81 | correlated features names 82 | 83 | """ 84 | corr_matrix = features.corr().abs() 85 | # Select upper triangle of correlation matrix 86 | upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) 87 | # Find index and column name of features with correlation greater than 0.95 88 | to_drop = [column for column in upper.columns if any(upper[column] > threshold)] 89 | 90 | return to_drop 91 | -------------------------------------------------------------------------------- /prompt_bank/text_prompt_data_csv/csv_prompt.json: -------------------------------------------------------------------------------- 1 | { 2 | "0": "The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.", 3 | "1": "The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.", 4 | "2": "The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.", 5 | "3": "The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.", 6 | "4": "Electricity contains electircity consumption of 321 clients from 2012 to 2014. And the data was converted to reflect hourly consumption.", 7 | "5": "Exchange rate is a collection of the daily exchange rates of eight foreign countries ranging from 1990 to 2016.", 8 | "6": "Traffic is a collection of hourly data from California Department of Transportation, which describes the road occupancy rates measured by different sensors on San Francisco Bay area freeways.", 9 | "7": "Weather is recorded every 10 minutes for the 2020 whole year, which contains 21 meteorological indicators, such as air temperature, humidity, etc." 10 | } -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | accelerate==0.18.0 2 | charset-normalizer==3.1.0 3 | cmake==3.26.3 4 | contourpy==1.0.7 5 | cycler==0.11.0 6 | einops==0.6.0 7 | filelock==3.12.0 8 | fonttools==4.39.3 9 | huggingface-hub==0.13.4 10 | idna==3.4 11 | importlib-resources==5.12.0 12 | Jinja2==3.1.2 13 | joblib==1.2.0 14 | kiwisolver==1.4.4 15 | lit==16.0.1 16 | MarkupSafe==2.1.2 17 | matplotlib==3.7.1 18 | mpmath==1.3.0 19 | networkx==3.1 20 | numpy==1.24.2 21 | nvidia-cublas-cu11==11.10.3.66 22 | nvidia-cuda-cupti-cu11==11.7.101 23 | nvidia-cuda-nvrtc-cu11==11.7.99 24 | nvidia-cuda-runtime-cu11==11.7.99 25 | nvidia-cudnn-cu11==8.5.0.96 26 | nvidia-cufft-cu11==10.9.0.58 27 | nvidia-curand-cu11==10.2.10.91 28 | nvidia-cusolver-cu11==11.4.0.1 29 | nvidia-cusparse-cu11==11.7.4.91 30 | nvidia-nccl-cu11==2.14.3 31 | nvidia-nvtx-cu11==11.7.91 32 | packaging==23.1 33 | pandas==2.0.0 34 | Pillow==9.5.0 35 | pyparsing==3.0.9 36 | python-dateutil==2.8.2 37 | pytz==2023.3 38 | PyYAML==6.0 39 | regex==2023.3.23 40 | requests==2.28.2 41 | scikit-learn==1.2.2 42 | scipy==1.10.1 43 | six==1.16.0 44 | sympy==1.11.1 45 | threadpoolctl==3.1.0 46 | tokenizers==0.13.3 47 | torch==2.0.0 48 | tqdm==4.65.0 49 | transformers==4.28.1 50 | triton==2.0.0 51 | typing_extensions==4.5.0 52 | tzdata==2023.3 53 | urllib3==1.26.15 54 | zipp==3.15.0 55 | -------------------------------------------------------------------------------- /scripts/test_csv_lora.sh: -------------------------------------------------------------------------------- 1 | TRAIN="datasets/ETT-small/ETTh1.csv 2 | datasets/ETT-small/ETTh2.csv 3 | datasets/ETT-small/ETTm1.csv 4 | datasets/ETT-small/ETTm2.csv 5 | datasets/electricity/electricity.csv 6 | datasets/exchange_rate/exchange_rate.csv 7 | datasets/traffic/traffic.csv 8 | datasets/weather/weather.csv" 9 | 10 | TEST="datasets/ETT-small/ETTh1.csv 11 | datasets/ETT-small/ETTh2.csv 12 | datasets/ETT-small/ETTm1.csv 13 | datasets/ETT-small/ETTm2.csv 14 | datasets/electricity/electricity.csv 15 | datasets/exchange_rate/exchange_rate.csv 16 | datasets/traffic/traffic.csv 17 | datasets/weather/weather.csv" 18 | 19 | PROMPT="prompt_bank/prompt_data_normalize_csv_split" 20 | 21 | epoch=500 22 | downsample_rate=20 23 | freeze=0 24 | OUTPUT_PATH="output/test_ltsm_lr${lr}_loraFalse_down${downsample_rate}_freeze${freeze}_e${epoch}_pred${pred_len}/" 25 | 26 | for pred_len in 96 192 336 720 27 | do 28 | for lr in 1e-3 29 | do 30 | for lora_dim in 32 64 31 | do 32 | CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 python3 main_ltsm.py \ 33 | --lora \ 34 | --lora_dim ${lora_dim} \ 35 | --model_id test_run \ 36 | --train_epochs ${epoch} \ 37 | --batch_size 800 \ 38 | --pred_len ${pred_len} \ 39 | --gradient_accumulation_steps 64 \ 40 | --data_path ${TRAIN} \ 41 | --test_data_path ${INIT_TEST} \ 42 | --test_data_path_list ${TEST} \ 43 | --prompt_data_path ${PROMPT} \ 44 | --freeze ${freeze} \ 45 | --learning_rate ${lr} \ 46 | --downsample_rate ${downsample_rate} \ 47 | --output_dir ${OUTPUT_PATH} 48 | done 49 | done 50 | done 51 | -------------------------------------------------------------------------------- /scripts/test_ltsm.sh: -------------------------------------------------------------------------------- 1 | TRAIN=" 2 | all_six_datasets/ETT-small/ETTh1.csv 3 | all_six_datasets/ETT-small/ETTh2.csv 4 | all_six_datasets/ETT-small/ETTm1.csv 5 | all_six_datasets/ETT-small/ETTm2.csv 6 | all_six_datasets/electricity/electricity.csv 7 | all_six_datasets/exchange_rate/exchange_rate.csv 8 | all_six_datasets/traffic/traffic.csv 9 | all_six_datasets/weather/weather.csv" 10 | 11 | 12 | TEST=" 13 | all_six_datasets/ETT-small/ETTh1.csv 14 | all_six_datasets/ETT-small/ETTh2.csv 15 | all_six_datasets/ETT-small/ETTm1.csv 16 | all_six_datasets/ETT-small/ETTm2.csv 17 | all_six_datasets/electricity/electricity.csv 18 | all_six_datasets/exchange_rate/exchange_rate.csv 19 | all_six_datasets/traffic/traffic.csv 20 | all_six_datasets/weather/weather.csv" 21 | 22 | PROMPT="prompt_bank/prompt_data_normalize_csv_split" 23 | epoch=500 24 | downsample_rate=20 25 | freeze=0 26 | lr=1e-3 27 | 28 | 29 | for pred_len in 96 30 | do 31 | 32 | CUDA_VISIBLE_DEVICES=0,1 python3 main_ltsm.py \ 33 | --model LTSM \ 34 | --model_name_or_path gpt2-medium \ 35 | --local_pretrain LSC2204/LTSM-bundle \ 36 | --train_epochs ${epoch} \ 37 | --batch_size 800 \ 38 | --pred_len ${pred_len} \ 39 | --gradient_accumulation_steps 64 \ 40 | --data_path ${TRAIN} \ 41 | --test_data_path_list ${TEST} \ 42 | --prompt_data_path ${PROMPT} \ 43 | --freeze ${freeze} \ 44 | --learning_rate ${lr} \ 45 | --downsample_rate ${downsample_rate} \ 46 | --output_dir "output/ltsm_csv_medium_lr${lr}_loraFalse_down${downsample_rate}_freeze${freeze}_e${epoch}_pred${pred_len}/"\ 47 | --eval 1 48 | done 49 | -------------------------------------------------------------------------------- /scripts/train_ltsm_csv.sh: -------------------------------------------------------------------------------- 1 | TRAIN="datasets/ETT-small/ETTh1.csv 2 | datasets/ETT-small/ETTh2.csv 3 | datasets/ETT-small/ETTm1.csv 4 | datasets/ETT-small/ETTm2.csv 5 | datasets/electricity/electricity.csv 6 | datasets/exchange_rate/exchange_rate.csv 7 | datasets/traffic/traffic.csv 8 | datasets/weather/weather.csv" 9 | 10 | 11 | TEST="datasets/ETT-small/ETTh1.csv 12 | datasets/ETT-small/ETTh2.csv 13 | datasets/ETT-small/ETTm1.csv 14 | datasets/ETT-small/ETTm2.csv 15 | datasets/electricity/electricity.csv 16 | datasets/exchange_rate/exchange_rate.csv 17 | datasets/traffic/traffic.csv 18 | datasets/weather/weather.csv" 19 | 20 | PROMPT="prompt_bank/prompt_data_normalize_split" 21 | 22 | epoch=1000 23 | downsample_rate=20 24 | freeze=0 25 | lr=1e-3 26 | 27 | 28 | for pred_len in 96 192 336 720 29 | do 30 | OUTPUT_PATH="output/ltsm_lr${lr}_loraFalse_down${downsample_rate}_freeze${freeze}_e${epoch}_pred${pred_len}/" 31 | echo "Current OUTPUT_PATH: ${OUTPUT_PATH}" 32 | CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main_ltsm.py \ 33 | --model LTSM \ 34 | --model_name_or_path gpt2-medium \ 35 | --train_epochs ${epoch} \ 36 | --batch_size 100 \ 37 | --pred_len ${pred_len} \ 38 | --gradient_accumulation_steps 64 \ 39 | --data_path ${TRAIN} \ 40 | --test_data_path_list ${TEST} \ 41 | --prompt_data_path ${PROMPT} \ 42 | --freeze ${freeze} \ 43 | --learning_rate ${lr} \ 44 | --downsample_rate ${downsample_rate} \ 45 | --output_dir ${OUTPUT_PATH}\ 46 | --eval 0 47 | done 48 | -------------------------------------------------------------------------------- /scripts/train_ltsm_textprompt_csv.sh: -------------------------------------------------------------------------------- 1 | TRAIN="datasets/ETT-small/ETTh1.csv 2 | datasets/ETT-small/ETTh2.csv 3 | datasets/ETT-small/ETTm1.csv 4 | datasets/ETT-small/ETTm2.csv 5 | datasets/electricity/electricity.csv 6 | datasets/exchange_rate/exchange_rate.csv 7 | datasets/traffic/traffic.csv 8 | datasets/weather/weather.csv" 9 | 10 | TEST="datasets/ETT-small/ETTh1.csv 11 | datasets/ETT-small/ETTh2.csv 12 | datasets/ETT-small/ETTm1.csv 13 | datasets/ETT-small/ETTm2.csv 14 | datasets/electricity/electricity.csv 15 | datasets/exchange_rate/exchange_rate.csv 16 | datasets/traffic/traffic.csv 17 | datasets/weather/weather.csv" 18 | 19 | PROMPT="prompt_bank/text_prompt_data_csv/csv_prompt.json" 20 | epoch=1000 21 | downsample_rate=20 22 | freeze=0 23 | lr=1e-3 24 | 25 | 26 | for pred_len in 96 192 336 720 27 | do 28 | OUTPUT_PATH="output/ltsm_textprompt_lr${lr}_loraFalse_down${downsample_rate}_freeze${freeze}_e${epoch}_pred${pred_len}/" 29 | CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main_ltsm.py \ 30 | --model LTSM_WordPrompt \ 31 | --model_name_or_path gpt2-medium \ 32 | --train_epochs ${epoch} \ 33 | --batch_size 10 \ 34 | --pred_len ${pred_len} \ 35 | --gradient_accumulation_steps 64 \ 36 | --data_path ${TRAIN} \ 37 | --test_data_path_list ${TEST} \ 38 | --prompt_data_path ${PROMPT} \ 39 | --freeze ${freeze} \ 40 | --learning_rate ${lr} \ 41 | --downsample_rate ${downsample_rate} \ 42 | --output_dir ${OUTPUT_PATH} \ 43 | --eval 0 44 | done 45 | -------------------------------------------------------------------------------- /scripts/train_ltsm_tokenizer_csv.sh: -------------------------------------------------------------------------------- 1 | TRAIN="datasets/ETT-small/ETTh1.csv 2 | datasets/ETT-small/ETTh2.csv 3 | datasets/ETT-small/ETTm1.csv 4 | datasets/ETT-small/ETTm2.csv 5 | datasets/electricity/electricity.csv 6 | datasets/exchange_rate/exchange_rate.csv 7 | datasets/traffic/traffic.csv 8 | datasets/weather/weather.csv" 9 | 10 | TEST="datasets/ETT-small/ETTh1.csv 11 | datasets/ETT-small/ETTh2.csv 12 | datasets/ETT-small/ETTm1.csv 13 | datasets/ETT-small/ETTm2.csv 14 | datasets/electricity/electricity.csv 15 | datasets/exchange_rate/exchange_rate.csv 16 | datasets/traffic/traffic.csv 17 | datasets/weather/weather.csv" 18 | PROMPT="prompt_bank/prompt_data_normalize_csv_split" 19 | lr=1e-3 20 | epoch=50 21 | downsample_rate=20 22 | freeze=0 23 | d_ff=128 24 | OUTPUT_PATH="output/ltsm_tokenizer_lr${lr}_loraFalse_down${downsample_rate}_freeze${freeze}_e${epoch}_pred${pred_len}/" 25 | 26 | for pred_len in 96 27 | do 28 | CUDA_VISIBLE_DEVICES=0,1 python3 main_tokenizer.py \ 29 | --model LTSM_Tokenizer \ 30 | --model_name_or_path gpt2-medium \ 31 | --d_ff $d_ff \ 32 | --train_epochs ${epoch} \ 33 | --batch_size 20 \ 34 | --pred_len ${pred_len} \ 35 | --gradient_accumulation_steps 64 \ 36 | --data_path ${TRAIN} \ 37 | --test_data_path_list ${TEST} \ 38 | --prompt_data_path ${PROMPT} \ 39 | --freeze ${freeze} \ 40 | --learning_rate ${lr} \ 41 | --downsample_rate ${downsample_rate} \ 42 | --output_dir ${OUTPUT_PATH}\ 43 | --eval 1 44 | done 45 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | setuptools.setup( 4 | name="ltsm", 5 | version='1.0.0', 6 | author="Data Lab", 7 | author_email="daochen.zha@rice.edu", 8 | description="Large Time Sereis Model", 9 | url="XXXX", 10 | keywords=["Time Series"], 11 | packages=setuptools.find_packages(exclude=('tests',)), 12 | requires_python='>=3.8', 13 | classifiers=[ 14 | "Programming Language :: Python :: 3.8", 15 | "License :: OSI Approved :: MIT License", 16 | "Operating System :: OS Independent", 17 | ], 18 | ) 19 | -------------------------------------------------------------------------------- /tutorial/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial of LTSM-bundle 2 | 3 | 4 | ## Installation 5 | ``` 6 | conda create -n ltsm python=3.8.0 7 | conda activate ltsm 8 | git clone git@github.com:daochenzha/ltsm.git 9 | cd ltsm 10 | pip3 install -e . 11 | pip3 install -r requirements.txt 12 | ``` 13 | 14 | 15 | ## :bookmark: Step 0: Collect Datasets and Time Series Prompts 16 | 17 | ### :cyclone: You can use our prepared dataset to on-board youselves on LTSM-bundle 18 | 19 | ### Download training datasets 20 | ```bash 21 | cd datasets 22 | download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P 23 | ``` 24 | 25 | ### Download time sereis prompts 26 | ```bash 27 | cd prompt_bank/propmt_data_csv 28 | download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P 29 | ``` 30 | 31 | ### Check word prompts 32 | ```bash 33 | cd prompt_bank/text_prompt_data_csv/ 34 | check: csv_prompt.json 35 | ``` 36 | 37 | ## :bookmark: Step 1: Customize Datasets and Time Series Prompts 38 | 39 | ### :cyclone: If you prefer to build LTSM-bundle on your own dataset, please follow the 5-step instructions below: 40 | 41 | **Step 1-a.** Prepare your dataset. Make sure your local data folder like this: 42 | ````angular2html 43 | - ltsm/ 44 | - datasets/ 45 | DATA_1.csv/ 46 | DATA_2.csv/ 47 | ... 48 | ```` 49 | 50 | **Step 1-b.** Generating the time series prompts from training, validating, and testing datasets 51 | ````angular2html 52 | python3 prompt_generate_split.py 53 | ```` 54 | 55 | **Step 1-c.** Find the generated time series prompts in the './prompt_data_split' folder. Then run the following command for normalizing the prompts: 56 | ````angular2html 57 | python3 prompt_normalization_split.py --mode fit 58 | ```` 59 | 60 | **Step 1-d.** Run this command to export the prompts to the "./prompt_data_normalize_split" folder: 61 | ````angular2html 62 | python3 prompt_normalization_split.py --mode transform 63 | ```` 64 | 65 | **Step 1-e.** Modify the word prompt based on your dataset description in "prompt_bank/text_prompt_data_csv/csv_prompt.json": 66 | ````angular2html 67 | vim prompt_bank/text_prompt_data_csv/csv_prompt.json 68 | ```` 69 | 70 | ## :bookmark: Step 2: Customize your own LTSM-bundle 71 | 72 | ### :cyclone: Now, it's time to build you own LTSM-bundle!! 73 | 74 | #### Option-(1) Explore [Word Prompt] and [Linear Tokenization] on gpt2-medium 75 | ```bash 76 | python3 main_ltsm.py \ 77 | --model LTSM_WordPrompt \ 78 | --model_name_or_path gpt2-medium \ 79 | --train_epochs 500 \ 80 | --batch_size 10 \ 81 | --pred_len 96 \ 82 | --data_path "datasets/ETT-small/ETTh1.csv" \ 83 | --test_data_path_list "datasets/ETT-small/ETTh1.csv" \ 84 | --prompt_data_path "prompt_bank/text_prompt_data_csv/csv_prompt.json" \ 85 | --freeze 0 \ 86 | --learning_rate 1e-3 \ 87 | --downsample_rate 20 \ 88 | --output_dir [Your_Output_Path] \ 89 | ``` 90 | 91 | #### Option-(2) Explore [Time Series Prompt] and [Linear Tokenization] on gpt2-medium 92 | ```bash 93 | python3 main_ltsm.py \ 94 | --model LTSM \ 95 | --model_name_or_path gpt2-medium \ 96 | --train_epochs 500 \ 97 | --batch_size 10 \ 98 | --pred_len 96 \ 99 | --data_path "datasets/ETT-small/ETTh1.csv" \ 100 | --test_data_path_list "datasets/ETT-small/ETTh1.csv" \ 101 | --prompt_data_path "prompt_bank/prompt_data_normalize_split" \ 102 | --freeze 0 \ 103 | --learning_rate 1e-3 \ 104 | --downsample_rate 20 \ 105 | --output_dir [Your_Output_Path] \ 106 | ``` 107 | 108 | #### Option-(3) Finetune your dataset based on pre-trained LTSM-bundle model: [Time Series Prompt] and [Linear Tokenization] on gpt2-medium 109 | ```bash 110 | python3 main_ltsm.py \ 111 | --model LTSM \ 112 | --model_name_or_path gpt2-medium \ 113 | --local_pretrain LSC2204/LTSM-bundle \ # This model weight is for pred_len == 96 114 | --train_epochs 500 \ 115 | --batch_size 10 \ 116 | --pred_len 96 \ 117 | --data_path "datasets/ETT-small/ETTh1.csv" \ 118 | --test_data_path_list "datasets/ETT-small/ETTh1.csv" \ 119 | --prompt_data_path "prompt_bank/prompt_data_normalize_split" \ 120 | --freeze 0 \ 121 | --learning_rate 1e-3 \ 122 | --downsample_rate 20 \ 123 | --output_dir [Your_Output_Path] \ 124 | ``` 125 | 126 | --------------------------------------------------------------------------------