├── LICENSE ├── README.md ├── src ├── config.py ├── credentials.yml ├── dataloader.py ├── gemini_api.py ├── generator.py ├── generator.sh ├── llava_api.py ├── main.py ├── openai_api.py ├── prompt.py ├── qwen_api.py ├── result_agg.py ├── test.sh └── utils.py └── teaser.png /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 mllm-ts 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Can Multimodal LLMs Perform Time Series Anomaly Detection? 2 | This repo includes the official code and datasets for paper ["Can Multimodal LLMs Perform Time Series Anomaly Detection?"](https://arxiv.org/abs/2502.17812) 3 | 4 | ## 🕵️‍♂️ VisualTimeAnomaly 5 |
6 | 7 |
8 | 9 |

Left: the workflow of VisualTimeAnomaly. Right: the performance comparison across various setting.

10 | 11 | ## 🏆 Contributions 12 | - The first comprehensive benchmark for multimodal LLMs (MLLMs) in time series anomaly detection (TSAD), covering diverse scenarios (univariate, multivariate, irregular) and varying anomaly granularities (point-, range-, variate-wise). 13 | - Several critical insights significantly advance the understanding of both MLLMs and TSAD. 14 | - We construct a large-scale dataset including 12.4k time series images, and release the dateset and code to foster future research. 15 | 16 | ## 🔎 Findings 17 | - MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies; 18 | - MLLMs are highly robust to irregular time series, even with 25% of the data missing; 19 | - Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. 20 | 21 | ## ⚙️ Getting Started 22 | ### Environment 23 | * python 3.10.14 24 | * torch 2.4.1 25 | * numpy 1.26.4 26 | * transformers 4.49.0.dev0 27 | * huggingface-hub 0.24.7 28 | * openai 1.44.0 29 | * google-generativeai 0.8.3 30 | 31 | ### Dataset 32 | Enter `src` folder. 33 | 34 | If you want to generate all datasets, execute the below script: 35 | 36 | `./generator.sh` 37 | 38 | If you want to generate a specific dataset, execute the below script: 39 | 40 | `python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts`. 41 | 42 | For example, generate 100 univaraite time series images for global anomalies: 43 | 44 | `python generator.py --category synthetic --scenario univariate --anomaly_type global --num_ts 100` 45 | 46 | ### Run 47 | Enter `src` folder. 48 | 49 | If you want to run MLLMs on all datasets, execute the below script: 50 | 51 | `./test.sh` 52 | 53 | If you want to run a MLLM on a specific dataset, execute the below script: 54 | 55 | `python main.py --category $category --scenario $scenario --model_name $model_name --data $data` 56 | 57 | For example, run GPT-4o on univaraite time series scenario with global anomalies: 58 | 59 | `python main.py --category synthetic --scenario univariate --model_name gpt-4o --data global` 60 | 61 | ## Acknowledgement 62 | We sincerely appreciate the following github repo for the code base and datasets: 63 | 64 | https://github.com/Rose-STL-Lab/AnomLLM 65 | 66 | https://github.com/datamllab/tods/tree/benchmark 67 | 68 | ## 📝 Citation 69 | If you find our work useful, please cite the below paper: 70 | ```bibtex 71 | @article{xu2025can, 72 | title={Can Multimodal LLMs Perform Time Series Anomaly Detection?}, 73 | author={Xu, Xiongxiao and Wang, Haoran and Liang, Yueqing and Yu, Philip S and Zhao, Yue and Shu, Kai}, 74 | journal={arXiv preprint arXiv:2502.17812}, 75 | year={2025} 76 | } 77 | ``` 78 | -------------------------------------------------------------------------------- /src/config.py: -------------------------------------------------------------------------------- 1 | from prompt import create_openai_request 2 | 3 | def create_api_configs(): 4 | return { 5 | '0shot-vision': lambda train_dataset, data_tuple: create_openai_request( 6 | vision=True, 7 | few_shots=train_dataset.few_shots(num_shots=0), 8 | data_tuple=data_tuple 9 | ) 10 | } 11 | -------------------------------------------------------------------------------- /src/credentials.yml: -------------------------------------------------------------------------------- 1 | gpt-4o: 2 | api_key: 3 | base_url: (ended with v1) 4 | gemini-1.5: 5 | api_key: 6 | -------------------------------------------------------------------------------- /src/dataloader.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset 3 | import pickle 4 | import os 5 | import numpy as np 6 | 7 | # time series image dataset 8 | class TSIDataset(Dataset): 9 | 10 | def __init__(self, data_dir): 11 | self.data_dir = data_dir 12 | self.figs_dir = os.path.join(data_dir, 'figs') 13 | self.series = [] 14 | self.anom = [] 15 | 16 | # Load data 17 | with open(os.path.join(self.data_dir, 'data.pkl'), 'rb') as f: 18 | data_dict = pickle.load(f) 19 | self.series = data_dict['series'] 20 | self.anom = data_dict['anom'] 21 | self.scenario = data_dir.split('/')[2] 22 | train_eval = data_dir.split('/')[-1] 23 | 24 | if self.scenario == 'irr_univariate': 25 | self.drop_index = data_dict['drop_index'] 26 | if train_eval == 'eval': 27 | print(f"Loaded dataset {data_dir} with {len(self.series)} series.") 28 | 29 | def __len__(self): 30 | return len(self.series) 31 | 32 | def __getitem__(self, idx): 33 | anom = self.anom[idx] 34 | series = self.series[idx] 35 | 36 | # Convert to torch tensors 37 | anom = torch.tensor(anom, dtype=torch.float32) 38 | series = torch.tensor(series, dtype=torch.float32) 39 | if self.scenario == 'irr_univariate': 40 | drop_index = self.drop_index[idx] 41 | drop_index = torch.tensor(drop_index, dtype=torch.float32) 42 | return anom, series, drop_index 43 | else: 44 | return anom, series 45 | 46 | def few_shots(self, num_shots=5, idx=None): 47 | if idx is None: 48 | idx = np.random.choice(len(self.series), num_shots, replace=False) 49 | few_shot_data = [] 50 | for i in idx: 51 | anom, series = self.__getitem__(i) 52 | anom = [{"start": int(start.item()), "end": int(end.item())} for start, end in list(anom[0])] 53 | few_shot_data.append((series, anom, i+1)) 54 | return few_shot_data -------------------------------------------------------------------------------- /src/gemini_api.py: -------------------------------------------------------------------------------- 1 | import google.generativeai as genai 2 | from PIL import Image 3 | from loguru import logger 4 | import yaml 5 | import requests 6 | from io import BytesIO 7 | import base64 8 | 9 | SAFETY_SETTINGS = [ 10 | {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"}, 11 | {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, 12 | {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, 13 | {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"}, 14 | ] 15 | 16 | def load_gemini(model_name): 17 | credentials = yaml.safe_load(open("credentials.yml")) 18 | assert model_name in credentials, f"Model {model_name} not found in credentials" 19 | 20 | credential = credentials[model_name] 21 | api_key = credential["api_key"] 22 | 23 | genai.configure(api_key=api_key) 24 | model = genai.GenerativeModel(model_name) 25 | 26 | return model 27 | 28 | def call_gemini(model_name, model, gemini_request): 29 | logger.debug(f"{model_name} is running") 30 | 31 | response = model.generate_content( 32 | contents=gemini_request['messages'], 33 | generation_config=gemini_request['config'], 34 | # **gemini_request, 35 | safety_settings=SAFETY_SETTINGS, 36 | ) 37 | return response.text 38 | 39 | 40 | def convert_openai_to_gemini(openai_request): 41 | gemini_messages = [] 42 | 43 | for message in openai_request["messages"]: 44 | parts = [] 45 | for content in message["content"]: 46 | if content["type"] == "text": 47 | parts.append(content["text"]) 48 | elif content["type"] == "image_url": 49 | image_url = content["image_url"]["url"] 50 | if image_url.startswith("data:image"): 51 | base64_str = image_url.split(",")[1] 52 | img_data = base64.b64decode(base64_str) 53 | img = Image.open(BytesIO(img_data)) 54 | else: 55 | response = requests.get(image_url) 56 | img = Image.open(BytesIO(response.content)) 57 | parts.append(img) 58 | 59 | gemini_messages.append({"role": message["role"].replace("assistant", "model"), "parts": parts}) 60 | 61 | gemini_config = { 62 | 'temperature': openai_request.get("temperature", 0.4), 63 | 'max_output_tokens': openai_request.get("max_tokens", 8192), 64 | 'top_p': openai_request.get("top_p", 1.0), 65 | "stop_sequences": openai_request.get("stop", []) 66 | } 67 | 68 | gemini_request = { 69 | "messages": gemini_messages, 70 | "config": gemini_config 71 | } 72 | 73 | return gemini_request 74 | -------------------------------------------------------------------------------- /src/generator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import pandas as pd 4 | import argparse 5 | import os 6 | from tqdm import trange 7 | from utils import plot_series, vector_to_interval, vector_to_point, vector_to_id, plot_rectangle_stack_series 8 | import pickle 9 | import math 10 | from scipy.signal import square, sawtooth 11 | from scipy.io import arff 12 | 13 | def triangle_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 14 | timestamp = np.arange(length) 15 | value = 2 * np.abs((timestamp * freq) % 1 - 0.5) - 1 16 | if noise_amp != 0: 17 | noise = np.random.normal(0, 1, length) 18 | value = value + noise_amp * noise 19 | value = coef * value + offset 20 | return value 21 | 22 | def square_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 23 | timestamp = np.arange(length) 24 | value = square(2 * np.pi * freq * timestamp) 25 | if noise_amp != 0: 26 | noise = np.random.normal(0, 1, length) 27 | value = value + noise_amp * noise 28 | value = coef * value + offset 29 | return value 30 | 31 | def sawtooth_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 32 | timestamp = np.arange(length) 33 | value = sawtooth(2 * np.pi * freq * timestamp) 34 | if noise_amp != 0: 35 | noise = np.random.normal(0, 1, length) 36 | value = value + noise_amp * noise 37 | value = coef * value + offset 38 | return value 39 | 40 | def random_walk(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 41 | steps = np.random.normal(0, noise_amp, length) 42 | value = np.cumsum(steps) 43 | value = coef * value + offset 44 | return value 45 | 46 | def sine(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 47 | timestamp = np.arange(length) 48 | value = np.sin(2 * np.pi * freq * timestamp) 49 | if noise_amp != 0: 50 | noise = np.random.normal(0, 1, length) 51 | value = value + noise_amp * noise 52 | value = coef * value + offset 53 | return value 54 | 55 | def cosine(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 56 | timestamp = np.arange(length) 57 | value = np.cos(2 * np.pi * freq * timestamp) 58 | if noise_amp != 0: 59 | noise = np.random.normal(0, 1, length) 60 | value = value + noise_amp * noise 61 | value = coef * value + offset 62 | return value 63 | 64 | class MultivariateDataGenerator: 65 | def __init__(self, data_dir, dim, drop_ratio=0): 66 | STREAM_LENGTH = 200 67 | 68 | self.stream_length = STREAM_LENGTH 69 | self.behavior = [sine, cosine] 70 | self.ano_behavior = { 71 | 'triangle': triangle_wave, 72 | 'square': square_wave, 73 | 'sawtooth': sawtooth_wave, 74 | 'random_walk': random_walk 75 | } 76 | self.dim = dim 77 | 78 | self.data = None 79 | self.label = None 80 | self.data_origin = None 81 | 82 | self.drop_ratio = drop_ratio 83 | self.data_dir = data_dir 84 | self.series = [] 85 | self.anom = [] 86 | 87 | def generate(self, num_ts, category, anomaly_type, train_eval, tsname): 88 | for i in trange(num_ts): 89 | self.generate_base_timeseries(category, tsname) 90 | self.variate_outliers(anomaly_type) 91 | 92 | if scenario == 'irr_multivariate': 93 | for dim_id in range(self.dim): 94 | self.data[:, dim_id], self.label[:, dim_id], _ = drop(self.data[:, dim_id], self.label[:, dim_id], self.drop_ratio) 95 | 96 | anom = vector_to_id(self.label) 97 | 98 | self.series.append(self.data) 99 | self.anom.append(anom) 100 | 101 | fig = plot_rectangle_stack_series( 102 | series=self.data, 103 | single_series_figsize=(10, 10), 104 | gt_anomaly=anom, 105 | train_eval = train_eval 106 | ) 107 | 108 | fig_dir = os.path.join(self.data_dir, 'fig') 109 | os.makedirs(fig_dir, exist_ok=True) 110 | 111 | fig_path = os.path.join(fig_dir, f'{i + 1:03d}.png') 112 | fig.savefig(fig_path) 113 | plt.close() 114 | 115 | self.save() 116 | 117 | def save(self): 118 | data_dict = { 119 | 'series': self.series, 120 | 'anom': self.anom 121 | } 122 | with open(os.path.join(self.data_dir, 'data.pkl'), 'wb') as f: 123 | pickle.dump(data_dict, f) 124 | 125 | def generate_random_config(self): 126 | # Generate random parameters for time series using np.random 127 | return { 128 | 'freq': np.random.uniform(0.03, 0.05), # Frequency between 0.01 and 0.1 129 | 'coef': np.random.uniform(0.5, 2.0), # Coefficient between 0.5 and 2.0 130 | 'offset': np.random.uniform(-1.0, 1.0), # Offset between -1.0 and 1.0 131 | 'noise_amp': np.random.uniform(0.05, 0.20), # Noise amplitude between 0.0 and 0.1 132 | 'length': self.stream_length # Length of the time series 133 | } 134 | 135 | def generate_base_timeseries(self, category, basedata_dir=None): 136 | if category == 'synthetic': 137 | self.data = np.zeros((self.stream_length, self.dim)) 138 | for i in range(self.dim): 139 | behavior = np.random.choice(self.behavior) 140 | config = self.generate_random_config() 141 | uni_data = behavior(**config) 142 | self.data[:, i] = uni_data 143 | self.data_origin = self.data.copy() 144 | self.label = np.zeros((self.stream_length, self.dim), dtype=float) 145 | elif category == 'semi': 146 | basedata_dir = f'Multivariate_arff/{tsname}/{tsname}_TEST.arff' 147 | raw_data = arff.loadarff(basedata_dir) 148 | df = pd.DataFrame(raw_data[0]) 149 | self.data = np.array([list(item) for item in df.iloc[0,0]]).transpose() 150 | self.stream_length, dim = self.data.shape 151 | if self.dim > dim: 152 | extra_dims = self.dim - dim 153 | repeat_indices = np.random.choice(dim, extra_dims, replace=True) 154 | self.data = np.hstack((self.data, self.data[:, repeat_indices])) 155 | elif self.dim < dim: 156 | selected_indices = np.random.choice(dim, self.dim, replace=False) 157 | self.data = self.data[:, selected_indices] 158 | self.data_origin = self.data.copy() 159 | self.label = np.zeros((self.stream_length, self.dim), dtype=float) 160 | 161 | def variate_outliers(self, anomaly_type): 162 | min_ano, max_ano = 1, math.floor(math.sqrt(self.dim)) - 1 163 | num_anomalies = np.random.randint(min_ano, max_ano + 1) 164 | anomaly_indices = np.random.choice(self.dim, num_anomalies, replace=False) 165 | 166 | for idx in anomaly_indices: 167 | ano_behavior = self.ano_behavior[anomaly_type] 168 | config = self.generate_random_config() 169 | anomaly_data = ano_behavior(**config) 170 | self.data[:, idx] = anomaly_data 171 | self.label[:, idx] = 1 # Mark this variate as an anomaly 172 | 173 | 174 | def drop(data, label, drop_ratio): 175 | if not 0 <= drop_ratio <= 1: 176 | raise ValueError("drop_ratio must be between 0 and 1.") 177 | 178 | seq_len = len(data) 179 | num_drops = int(seq_len * drop_ratio) 180 | 181 | # Generate random indices to drop 182 | drop_index = np.random.choice(seq_len, size=num_drops, replace=False) 183 | 184 | data = data.astype(float) # Ensure float type to allow np.nan 185 | label = label.astype(float) # Ensure float type to allow np.nan 186 | 187 | data[drop_index] = np.nan 188 | label[drop_index] = np.nan 189 | 190 | return data, label, drop_index 191 | 192 | def square_sine(level=5, length=500, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05): 193 | value = np.zeros(length) 194 | for i in range(level): 195 | value += 1 / (2 * i + 1) * sine(length=length, freq=freq * (2 * i + 1), coef=coef, offset=offset, noise_amp=noise_amp) 196 | return value 197 | 198 | def collective_global_synthetic(length, base, coef=1.5, noise_amp=0.005): 199 | value = [] 200 | norm = np.linalg.norm(base) 201 | base = base / norm 202 | num = int(length / len(base)) 203 | for i in range(num): 204 | value.extend(base) 205 | residual = length - len(value) 206 | value.extend(base[:residual]) 207 | value = np.array(value) 208 | noise = np.random.normal(0, 1, length) 209 | value = coef * value + noise_amp * noise 210 | return value 211 | 212 | # The code is adapted from https://github.com/datamllab/tods/tree/benchmark 213 | class UnivariateDataGenerator: 214 | def __init__(self, data_dir, drop_ratio=0): 215 | BEHAVIOR = sine 216 | BEHAVIOR_CONFIG = {'freq': 0.04, 'coef': 1.5, "offset": 0.0, 'noise_amp': 0.05} 217 | STREAM_LENGTH = 400 218 | 219 | self.behavior = BEHAVIOR 220 | self.behavior_config = BEHAVIOR_CONFIG 221 | self.stream_length = STREAM_LENGTH 222 | 223 | self.data = None 224 | self.label = None 225 | self.data_origin = None 226 | 227 | self.drop_ratio = drop_ratio 228 | self.data_dir = data_dir 229 | self.series = [] 230 | self.anom = [] 231 | self.drop_index = [] 232 | 233 | def generate(self, num_ts, category, anomaly_type, train_eval, tsname): 234 | for i in trange(num_ts): 235 | self.generate_base_timeseries(category, tsname) 236 | 237 | if anomaly_type == 'global': 238 | self.point_global_outliers(ratio=0.05, factor=3.5, radius=5) 239 | elif anomaly_type == 'contextual': 240 | self.point_contextual_outliers(ratio=0.05, factor=2.5, radius=5) 241 | elif anomaly_type == 'seasonal': 242 | self.collective_seasonal_outliers(ratio=0.05, factor=3, radius=5) 243 | elif anomaly_type == 'trend': 244 | self.collective_trend_outliers(ratio=0.05, factor=0.5, radius=5) 245 | elif anomaly_type == 'shapelet': 246 | self.collective_global_outliers(ratio=0.05, radius=5, option='square', coef=1.5, noise_amp=0.03, level=20, freq=0.04, offset=0.0) 247 | 248 | if scenario == 'irr_univariate': 249 | self.data, self.label, drop_index = drop(self.data, self.label, self.drop_ratio) 250 | 251 | if anomaly_type in ['global', 'contextual']: 252 | anom = vector_to_point(self.label) 253 | else: 254 | anom = vector_to_interval(self.label) 255 | 256 | self.series.append(self.data) 257 | self.anom.append(anom) 258 | 259 | if scenario == 'irr_univariate': 260 | self.drop_index.append(drop_index) 261 | 262 | fig = plot_series( 263 | series=self.data, 264 | single_series_figsize=(10, 1.5), 265 | gt_anomaly=anom, 266 | # gt_ylim = (np.nanmin(self.data)*1.1, np.nanmax(self.data)*1.1), 267 | train_eval = train_eval 268 | ) 269 | 270 | fig_dir = os.path.join(self.data_dir, 'fig') 271 | os.makedirs(fig_dir, exist_ok=True) 272 | 273 | fig_path = os.path.join(fig_dir, f'{i + 1:03d}.png') 274 | fig.savefig(fig_path) 275 | plt.close() 276 | 277 | self.save() 278 | 279 | def save(self): 280 | data_dict = { 281 | 'series': self.series, 282 | 'anom': self.anom 283 | } 284 | if scenario == 'irr_univariate': 285 | data_dict['drop_index'] = self.drop_index 286 | with open(os.path.join(self.data_dir, 'data.pkl'), 'wb') as f: 287 | pickle.dump(data_dict, f) 288 | 289 | def generate_base_timeseries(self, category, tsname, basedata_dir=None): 290 | if category == 'synthetic': 291 | self.behavior_config['length'] = self.stream_length 292 | self.data = self.behavior(**self.behavior_config) 293 | self.data_origin = self.data.copy() 294 | self.label = np.zeros(self.stream_length, dtype=float) 295 | elif category == 'semi': 296 | basedata_dir = f'Univariate_arff/{tsname}/{tsname}_TEST.arff' 297 | raw_data = arff.loadarff(basedata_dir) 298 | df = pd.DataFrame(raw_data[0]) 299 | self.data = df.iloc[0, :-1].values 300 | self.stream_length = self.data.shape[0] 301 | self.data_origin = self.data.copy() 302 | self.label = np.zeros(self.stream_length, dtype=float) 303 | 304 | def point_global_outliers(self, ratio, factor, radius): 305 | """ 306 | Add point global outliers to original data 307 | Args: 308 | ratio: what ratio outliers will be added 309 | factor: the larger, the outliers are farther from inliers 310 | radius: the radius of collective outliers range 311 | """ 312 | position = (np.random.rand(round(self.stream_length * ratio)) * self.stream_length).astype(int) 313 | maximum, minimum = max(self.data), min(self.data) 314 | for i in position: 315 | local_std = self.data_origin[max(0, i - radius):min(i + radius, self.stream_length)].std() 316 | self.data[i] = self.data_origin[i] * factor * local_std 317 | if 0 <= self.data[i] < maximum: self.data[i] = maximum 318 | if 0 > self.data[i] > minimum: self.data[i] = minimum 319 | self.label[i] = 1 320 | 321 | def point_contextual_outliers(self, ratio, factor, radius): 322 | """ 323 | Add point contextual outliers to original data 324 | Args: 325 | ratio: what ratio outliers will be added 326 | factor: the larger, the outliers are farther from inliers 327 | Notice: point contextual outliers will not exceed the range of [min, max] of original data 328 | radius: the radius of collective outliers range 329 | """ 330 | position = (np.random.rand(round(self.stream_length * ratio)) * self.stream_length).astype(int) 331 | maximum, minimum = max(self.data), min(self.data) 332 | for i in position: 333 | local_std = self.data_origin[max(0, i - radius):min(i + radius, self.stream_length)].std() 334 | self.data[i] = self.data_origin[i] * factor * local_std 335 | if self.data[i] > maximum: self.data[i] = maximum * min(0.95, abs(np.random.normal(0, 0.5))) # previous(0, 1) 336 | if self.data[i] < minimum: self.data[i] = minimum * min(0.95, abs(np.random.normal(0, 0.5))) 337 | 338 | self.label[i] = 1 339 | 340 | def collective_global_outliers(self, ratio, radius, option='square', coef=3., noise_amp=0.0, 341 | level=5, freq=0.04, offset=0.0, # only used when option=='square' 342 | base=[0.,]): # only used when option=='other' 343 | """ 344 | Add collective global outliers to original data 345 | Args: 346 | ratio: what ratio outliers will be added 347 | radius: the radius of collective outliers range 348 | option: if 'square': 'level' 'freq' and 'offset' are used to generate square sine wave 349 | if 'other': 'base' is used to generate outlier shape 350 | level: how many sine waves will square_wave synthesis 351 | base: a list of values that we want to substitute inliers when we generate outliers 352 | """ 353 | base = [1.4529900e-01, 1.2820500e-01, 9.4017000e-02, 7.6923000e-02, 1.1111100e-01, 1.4529900e-01, 1.7948700e-01, 2.1367500e-01, 2.1367500e-01] 354 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int) 355 | 356 | valid_option = {'square', 'other'} 357 | if option not in valid_option: 358 | raise ValueError("'option' must be one of %r." % valid_option) 359 | 360 | if option == 'square': 361 | sub_data = square_sine(level=level, length=self.stream_length, freq=freq, 362 | coef=coef, offset=offset, noise_amp=noise_amp) 363 | else: 364 | sub_data = collective_global_synthetic(length=self.stream_length, base=base, 365 | coef=coef, noise_amp=noise_amp) 366 | for i in position: 367 | start, end = max(0, i - radius), min(self.stream_length, i + radius) 368 | self.data[start:end] = sub_data[start:end] 369 | self.label[start:end] = 1 370 | 371 | def collective_trend_outliers(self, ratio, factor, radius): 372 | """ 373 | Add collective trend outliers to original data 374 | Args: 375 | ratio: what ratio outliers will be added 376 | factor: how dramatic will the trend be 377 | radius: the radius of collective outliers range 378 | """ 379 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int) 380 | for i in position: 381 | start, end = max(0, i - radius), min(self.stream_length, i + radius) 382 | slope = np.random.choice([-1, 1]) * factor * np.arange(end - start) 383 | self.data[start:end] = self.data_origin[start:end] + slope 384 | self.data[end:] = self.data[end:] + slope[-1] 385 | self.label[start:end] = 1 386 | 387 | def collective_seasonal_outliers(self, ratio, factor, radius): 388 | """ 389 | Add collective seasonal outliers to original data 390 | Args: 391 | ratio: what ratio outliers will be added 392 | factor: how many times will frequency multiple 393 | radius: the radius of collective outliers range 394 | """ 395 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int) 396 | seasonal_config = self.behavior_config 397 | seasonal_config['freq'] = factor * self.behavior_config['freq'] 398 | for i in position: 399 | start, end = max(0, i - radius), min(self.stream_length, i + radius) 400 | self.data[start:end] = self.behavior(**seasonal_config)[start:end] 401 | self.label[start:end] = 1 402 | 403 | if __name__ == '__main__': 404 | parser = argparse.ArgumentParser(description="Generate synthetic dataset") 405 | parser.add_argument("--seed", type=int, default=0, help="Random seed for reproducibility") 406 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi']) 407 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate']) 408 | parser.add_argument("--tsname", type=str, default=None, help='base real world dataset name') 409 | parser.add_argument("--anomaly_type", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet', 410 | 'triangle', 'square', 'sawtooth', 'random_walk', 411 | 'long']) 412 | parser.add_argument("--num_ts", type=int, default=100, help="Numebr of time series") 413 | parser.add_argument("--dim", type=int, default=9, help="Number of variates of multivariate time series") # min_dim=4 414 | parser.add_argument("--drop_ratio", type=float, default=0.00, help="Dropping ratio of irregular time series") 415 | 416 | args = parser.parse_args() 417 | 418 | seed = args.seed 419 | category = args.category 420 | scenario = args.scenario 421 | tsname = args.tsname 422 | anomaly_type = args.anomaly_type 423 | num_ts = args.num_ts 424 | dim = args.dim 425 | drop_ratio = args.drop_ratio 426 | 427 | np.random.seed(seed) 428 | 429 | for train_eval in ['train', 'eval']: 430 | if category == 'synthetic': 431 | if scenario == 'univariate': 432 | data_dir = os.path.join('data', category, scenario, anomaly_type, train_eval) 433 | elif scenario == 'multivariate': 434 | data_dir = os.path.join('data', category, scenario, f'dim_{dim}', anomaly_type, train_eval) 435 | elif scenario.startswith('irr'): 436 | data_dir = os.path.join('data', category, scenario, f'ratio_{int(drop_ratio*100)}', anomaly_type, train_eval) 437 | elif category == 'semi': 438 | if scenario == 'univariate': 439 | data_dir = os.path.join('data', category, scenario, tsname, anomaly_type, train_eval) 440 | elif scenario == 'multivariate': 441 | data_dir = os.path.join('data', category, scenario, tsname, f'dim_{dim}', anomaly_type, train_eval) 442 | elif scenario.startswith('irr'): 443 | data_dir = os.path.join('data', category, scenario, tsname, f'ratio_{int(drop_ratio*100)}', anomaly_type, train_eval) 444 | 445 | print(f'Generating {data_dir} data.') 446 | 447 | if scenario.endswith('univariate'): 448 | univariate_generator = UnivariateDataGenerator(data_dir=data_dir, drop_ratio=drop_ratio) 449 | univariate_generator.generate(num_ts, category, anomaly_type, train_eval, tsname) 450 | elif scenario.endswith('multivariate'): 451 | multivariate_generator = MultivariateDataGenerator(data_dir=data_dir, dim=dim, drop_ratio=drop_ratio) 452 | multivariate_generator.generate(num_ts, category, anomaly_type, train_eval, tsname) 453 | -------------------------------------------------------------------------------- /src/generator.sh: -------------------------------------------------------------------------------- 1 | # !/bin/bash 2 | num_ts=100 3 | 4 | category=synthetic 5 | scenario=univariate 6 | for anomaly_type in global contextual seasonal trend shapelet 7 | do 8 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts 9 | done 10 | 11 | category=synthetic 12 | scenario=irr_univariate 13 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 14 | do 15 | for anomaly_type in global seasonal trend shapelet 16 | do 17 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio 18 | done 19 | done 20 | 21 | category=synthetic 22 | scenario=multivariate 23 | for dim in 4 9 16 25 36 24 | do 25 | for anomaly_type in triangle square sawtooth random_walk 26 | do 27 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --dim $dim 28 | done 29 | done 30 | 31 | category=synthetic 32 | scenario=irr_multivariate 33 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 34 | do 35 | for anomaly_type in triangle square sawtooth random_walk 36 | do 37 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio 38 | done 39 | done 40 | 41 | 42 | 43 | category=semi 44 | scenario=univariate 45 | tsname=Symbols 46 | for anomaly_type in global contextual trend shapelet 47 | do 48 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts 49 | done 50 | 51 | category=semi 52 | scenario=irr_univariate 53 | tsname=Symbols 54 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 55 | do 56 | for anomaly_type in global trend shapelet 57 | do 58 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio 59 | done 60 | done 61 | 62 | category=semi 63 | scenario=multivariate 64 | tsname=ArticularyWordRecognition 65 | for dim in 4 9 16 25 36 66 | do 67 | for anomaly_type in triangle square sawtooth random_walk 68 | do 69 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --dim $dim 70 | done 71 | done 72 | 73 | category=semi 74 | scenario=irr_multivariate 75 | tsname=ArticularyWordRecognition 76 | dim=9 77 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 78 | do 79 | for anomaly_type in triangle square sawtooth random_walk 80 | do 81 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio --dim $dim 82 | done 83 | done 84 | 85 | 86 | 87 | -------------------------------------------------------------------------------- /src/llava_api.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration 3 | from PIL import Image 4 | import base64 5 | from io import BytesIO 6 | import requests 7 | import copy 8 | 9 | def load_llava(model_name, device): 10 | if '8b' in model_name: 11 | device = 'cuda:7' 12 | elif '72b' in model_name: 13 | device = 'auto' 14 | 15 | # Initialize model and processor 16 | model = LlavaNextForConditionalGeneration.from_pretrained( 17 | f"llava-hf/{model_name}-hf", 18 | torch_dtype=torch.float16, 19 | load_in_4bit=True, 20 | low_cpu_mem_usage=True, 21 | device_map=device 22 | ) 23 | 24 | return model 25 | 26 | def call_llava(model_name, model, llava_request, device): 27 | processor = LlavaNextProcessor.from_pretrained(f"llava-hf/{model_name}-hf") 28 | # Process the conversation 29 | prompt = processor.apply_chat_template(llava_request['messages'], add_generation_prompt=True) 30 | image = llava_request['messages'][0]['content'][1]['image'], 31 | 32 | # Create model inputs 33 | inputs = processor( 34 | images=image, 35 | text=prompt, 36 | return_tensors="pt" 37 | ).to(model.device) 38 | 39 | input_length = inputs.input_ids.shape[1] 40 | 41 | # Generate output 42 | outputs = model.generate( 43 | **inputs, 44 | **llava_request['config'] 45 | ) 46 | 47 | if '8b' in model_name: 48 | new_tokens = outputs[0][input_length+1:] 49 | elif '72b' in model_name: 50 | new_tokens = outputs[0][input_length:] 51 | 52 | # Decode and print result 53 | response = processor.decode(new_tokens, skip_special_tokens=True) 54 | 55 | return response 56 | 57 | 58 | def convert_openai_to_llava(openai_request): 59 | openai_request_copy = copy.deepcopy(openai_request) 60 | 61 | for message in openai_request_copy["messages"]: 62 | for content in message["content"]: 63 | if content["type"] == "image_url": 64 | image_url = content["image_url"]["url"] 65 | if image_url.startswith("data:image"): 66 | base64_str = image_url.split(",")[1] 67 | img_data = base64.b64decode(base64_str) 68 | img = Image.open(BytesIO(img_data)) 69 | else: 70 | response = requests.get(image_url) 71 | img = Image.open(BytesIO(response.content)) 72 | 73 | content.update({ 74 | "type": "image", 75 | "image": img 76 | }) 77 | content.pop("image_url") 78 | 79 | llava_messages = openai_request_copy["messages"] 80 | 81 | llava_config = { 82 | 'temperature': openai_request.get("temperature", 0.4), 83 | 'max_new_tokens': openai_request.get("max_tokens", 8192), 84 | 'top_p': openai_request.get("top_p", 1.0), 85 | 'do_sample': True 86 | } 87 | 88 | llava_request = { 89 | "messages": llava_messages, 90 | "config": llava_config 91 | } 92 | 93 | return llava_request 94 | 95 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | from openai_api import load_gpt, call_gpt 2 | from gemini_api import load_gemini, call_gemini, convert_openai_to_gemini 3 | from llava_api import load_llava, call_llava, convert_openai_to_llava 4 | from qwen_api import load_qwen, call_qwen, convert_openai_to_qwen 5 | from config import create_api_configs 6 | from utils import process_request 7 | import argparse 8 | from dataloader import TSIDataset 9 | 10 | def load_mllm(model_name, device): 11 | if model_name.startswith('gpt'): 12 | return load_gpt(model_name) 13 | elif model_name.startswith('gemini'): 14 | return load_gemini(model_name) 15 | elif 'llava' in model_name: 16 | return load_llava(model_name, device) 17 | elif model_name.startswith('Qwen'): 18 | return load_qwen(model_name, device) 19 | 20 | def call_mllm(model_name, model, request, device): 21 | if model_name.startswith('gpt'): 22 | response = call_gpt(model_name, model, request) 23 | elif model_name.startswith('gemini'): 24 | response = call_gemini(model_name, model, convert_openai_to_gemini(request)) 25 | elif 'llava' in model_name: 26 | response = call_llava(model_name, model, convert_openai_to_llava(request), device) 27 | elif model_name.startswith('Qwen'): 28 | response = call_qwen(model_name, model, convert_openai_to_qwen(request), device) 29 | 30 | return response 31 | 32 | # The code is adapted from https://github.com/rose-stl-lab/anomllm 33 | def AD_with_retries( 34 | model_name: str, 35 | category: str, 36 | scenario: str, 37 | tsname: str, 38 | data_name: str, 39 | request_func: callable, 40 | variant: str = "standard", 41 | num_retries: int = 4, 42 | dim: int = 9, 43 | drop_ratio: float = 0.00, 44 | device: str = 'cuda:7' 45 | ): 46 | import json 47 | import time 48 | import pickle 49 | import os 50 | from loguru import logger 51 | 52 | results = {} 53 | 54 | if category == 'synthetic': 55 | if scenario == 'univariate': 56 | log_fn = f"logs/{category}/{scenario}/{data_name}/{model_name}/" + variant + ".log" 57 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 58 | results_dir = f'results/{category}/{scenario}/{data_name}/{model_name}' 59 | data_dir = f'data/{category}/{scenario}/{data_name}/eval' 60 | train_dir = f'data/{category}/{scenario}/{data_name}/train' 61 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 62 | elif scenario == 'multivariate': 63 | log_fn = f"logs/{category}/{scenario}/dim_{dim}/{data_name}/{model_name}/" + variant + ".log" 64 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 65 | results_dir = f'results/{category}/{scenario}/dim_{dim}/{data_name}/{model_name}' 66 | data_dir = f'data/{category}/{scenario}/dim_{dim}/{data_name}/eval' 67 | train_dir = f'data/{category}/{scenario}/dim_{dim}/{data_name}/train' 68 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 69 | elif scenario.startswith('irr'): 70 | log_fn = f"logs/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}/" + variant + ".log" 71 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 72 | results_dir = f'results/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}' 73 | data_dir = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval' 74 | train_dir = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/train' 75 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 76 | elif category == 'semi': 77 | if scenario == 'univariate': 78 | log_fn = f"logs/{category}/{scenario}/{tsname}/{data_name}/{model_name}/" + variant + ".log" 79 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 80 | results_dir = f'results/{category}/{scenario}/{tsname}/{data_name}/{model_name}' 81 | data_dir = f'data/{category}/{scenario}/{tsname}/{data_name}/eval' 82 | train_dir = f'data/{category}/{scenario}/{tsname}/{data_name}/train' 83 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 84 | elif scenario == 'multivariate': 85 | log_fn = f"logs/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/{model_name}/" + variant + ".log" 86 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 87 | results_dir = f'results/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/{model_name}' 88 | data_dir = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval' 89 | train_dir = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/train' 90 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 91 | elif scenario.startswith('irr'): 92 | log_fn = f"logs/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}/" + variant + ".log" 93 | logger.add(log_fn, format="{time} {level} {message}", level="INFO") 94 | results_dir = f'results/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}' 95 | data_dir = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval' 96 | train_dir = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/train' 97 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl') 98 | 99 | os.makedirs(results_dir, exist_ok=True) 100 | 101 | eval_dataset = TSIDataset(data_dir) 102 | train_dataset = TSIDataset(train_dir) 103 | 104 | # Load existing results if jsonl file exists 105 | if os.path.exists(jsonl_fn): 106 | with open(jsonl_fn, 'r') as f: 107 | for line in f: 108 | entry = json.loads(line.strip()) 109 | results[entry['custom_id']] = entry["response"] 110 | 111 | # Loop over image files 112 | model = load_mllm(model_name, device) 113 | for i in range(1, len(eval_dataset) + 1): 114 | if category == 'synthetic': 115 | if scenario == 'univariate': 116 | custom_id = f"{category}_{scenario}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 117 | elif scenario == 'multivariate': 118 | custom_id = f"{category}_{scenario}_dim_{dim}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 119 | elif scenario.startswith('irr'): 120 | custom_id = f"{category}_{scenario}_ratio_{int(drop_ratio*100)}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 121 | elif category == 'semi': 122 | if scenario == 'univariate': 123 | custom_id = f"{category}_{scenario}_{tsname}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 124 | elif scenario == 'multivariate': 125 | custom_id = f"{category}_{scenario}_{tsname}_dim_{dim}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 126 | elif scenario.startswith('irr'): 127 | custom_id = f"{category}_{scenario}_{tsname}_ratio_{int(drop_ratio*100)}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}" 128 | 129 | # Skip already processed files 130 | if custom_id in results: 131 | continue 132 | 133 | # print(custom_id) 134 | # Perform anomaly detection with exponential backoff 135 | for attempt in range(num_retries): 136 | try: 137 | start_time = time.time() 138 | request = request_func( 139 | # eval_dataset.series[i - 1], 140 | train_dataset, 141 | (category, scenario, tsname, dim, drop_ratio, data_name, i) 142 | ) 143 | response = call_mllm(model_name, model, request, device) 144 | end_time = time.time() 145 | elasped_time = f'{end_time - start_time}s' 146 | # Write the result to jsonl 147 | with open(jsonl_fn, 'a') as f: 148 | json.dump({'custom_id': custom_id, 'request': process_request(request), 'response': response, 'time': elasped_time}, f) 149 | f.write('\n') 150 | # If successful, break the retry loop 151 | break 152 | except Exception as e: 153 | if "503" in str(e): # Server not up yet, sleep until the server is up again 154 | while True: 155 | logger.debug("503 error, sleep 30 seconds") 156 | time.sleep(30) 157 | try: 158 | response = call_mllm(model_name, model, request, device) 159 | break 160 | except Exception as e: 161 | if "503" not in str(e): 162 | break 163 | else: 164 | logger.error(e) 165 | # If an exception occurs, wait and then retry 166 | wait_time = 2 ** (attempt + 3) 167 | logger.debug(f"Attempt {attempt + 1} failed. Waiting for {wait_time} seconds before retrying...") 168 | time.sleep(wait_time) 169 | continue 170 | else: 171 | logger.error(f"Failed to process {custom_id} after {num_retries} attempts") 172 | 173 | 174 | def parse_arguments(): 175 | parser = argparse.ArgumentParser(description='Process online API anomaly detection.') 176 | parser.add_argument('--variant', type=str, default='0shot-vision', help='Variant type') 177 | parser.add_argument('--model_name', type=str, default='llama3-llava-next-8b', choices=['gpt-4o', 'gpt-4o-mini', 'gemini-1.5-pro', 'gemini-1.5-flash', 178 | 'llama3-llava-next-8b', 'llava-next-72b', 179 | 'Qwen2-VL-7B-Instruct', 'Qwen2-VL-72B-Instruct'], help='Model name') 180 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi']) 181 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate']) 182 | parser.add_argument("--tsname", type=str, default=None, choices=['Symbols', 'ArticularyWordRecognition']) 183 | parser.add_argument("--data", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet', 184 | 'triangle', 'square', 'sawtooth', 'random_walk'], help="Synthesized anomaly type2") 185 | parser.add_argument("--drop_ratio", type=float, default=0.00) 186 | parser.add_argument("--dim", type=int, default=9) 187 | parser.add_argument("--device", type=str, default='auto') 188 | 189 | return parser.parse_args() 190 | 191 | def main(): 192 | args = parse_arguments() 193 | api_configs = create_api_configs() 194 | AD_with_retries( 195 | model_name=args.model_name, 196 | category=args.category, 197 | scenario=args.scenario, 198 | tsname=args.tsname, 199 | data_name=args.data, 200 | request_func=api_configs[args.variant], 201 | variant=args.variant, 202 | dim=args.dim, 203 | drop_ratio=args.drop_ratio, 204 | device=args.device 205 | ) 206 | 207 | if __name__ == '__main__': 208 | main() 209 | -------------------------------------------------------------------------------- /src/openai_api.py: -------------------------------------------------------------------------------- 1 | from loguru import logger 2 | from openai import AzureOpenAI 3 | import yaml 4 | 5 | def load_gpt(model_name): 6 | credentials = yaml.safe_load(open("credentials.yml")) 7 | assert model_name in credentials, f"Model {model_name} not found in credentials" 8 | 9 | credential = credentials[model_name] 10 | api_key = credential["api_key"] 11 | api_version = credential["api_version"] 12 | base_url = credential["base_url"] 13 | 14 | model = AzureOpenAI( 15 | api_key=api_key, 16 | api_version=api_version, 17 | base_url=base_url 18 | ) 19 | 20 | return model 21 | 22 | def call_gpt(model_name, model, openai_request): 23 | logger.debug(f"{model_name} is running") 24 | 25 | response = model.chat.completions.create( 26 | model=model_name, **openai_request 27 | ) 28 | return response.choices[0].message.content 29 | 30 | -------------------------------------------------------------------------------- /src/prompt.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy import interpolate 3 | import json 4 | import re 5 | from scipy import stats 6 | 7 | PROMPT_POINT = """Detect points of anomalies in this time series, in terms of the x-axis coordinate. 8 | List one by one in a list. For example, if points x=2, 51, and 106 are anomalies, then output "[2, 51, 106]". If there are no anomalies, answer with an empty list []. 9 | """ 10 | 11 | PROMPT = """Detect ranges of anomalies in this time series, in terms of the x-axis coordinate. 12 | List one by one in a list. For example, if ranges (incluing two endpoints) [2, 11], [50, 60], and [105, 118] are anomalies, then output "[[2, 11], [50, 60], [105, 118]]". \ 13 | If there are no anomalies, answer with an empty list []. 14 | """ 15 | 16 | PROMPT_VARIATE = """Detect univariate time series of anomalies in this multivariate time series, in terms of ID of univariate time series. 17 | The image is a multivariate time series including multiple subimages to indicate multiple univariate time series. \ 18 | From left to right and top to bottom, the ID of each subimage increases by 1, starting from 0. 19 | List one by one in a list. For example, if ID=0, 2, and 5 are anomalous univariate time series, then output "[0, 2, 5]". If there are no anomalies, answer with an empty list []. 20 | """ 21 | 22 | def encode_img(fig_path): 23 | import base64 24 | 25 | with open(fig_path, "rb") as image_file: 26 | return base64.b64encode(image_file.read()).decode('utf-8') 27 | 28 | def create_vision_messages( 29 | # time_series, 30 | few_shots=False, 31 | cot=False, 32 | calc=None, 33 | # image_args={}, 34 | data_tuple = None 35 | ): 36 | category, scenario, tsname, dim, drop_ratio, data_name, eval_i = data_tuple 37 | if category == 'synthetic': 38 | if scenario == 'univariate': 39 | fig_path = f'data/{category}/{scenario}/{data_name}/eval/fig/{eval_i:03d}.png' 40 | elif scenario == 'multivariate': 41 | fig_path = f'data/{category}/{scenario}/dim_{dim}/{data_name}/eval/fig/{eval_i:03d}.png' 42 | elif scenario.startswith('irr'): 43 | fig_path = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval/fig/{eval_i:03d}.png' 44 | elif category == 'semi': 45 | if scenario == 'univariate': 46 | fig_path = f'data/{category}/{scenario}/{tsname}/{data_name}/eval/fig/{eval_i:03d}.png' 47 | elif scenario == 'multivariate': 48 | fig_path = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval/fig/{eval_i:03d}.png' 49 | elif scenario.startswith('irr'): 50 | fig_path = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval/fig/{eval_i:03d}.png' 51 | 52 | img = encode_img(fig_path) 53 | 54 | if data_name in ["global", "contextual"]: 55 | prompt = PROMPT_POINT 56 | elif data_name in ["triangle", "square", "sawtooth", "random_walk"]: 57 | prompt = PROMPT_VARIATE 58 | else: 59 | prompt = PROMPT 60 | 61 | messages = [ 62 | { 63 | "role": "user", 64 | "content": [ 65 | { 66 | "type": "text", 67 | "text": prompt 68 | }, 69 | { 70 | "type": "image_url", 71 | "image_url": {"url": f"data:image/png;base64,{img}"} 72 | }, 73 | ], 74 | } 75 | ] 76 | 77 | return messages 78 | 79 | def create_openai_request( 80 | # time_series, 81 | few_shots=False, 82 | vision=False, 83 | temperature=0.4, 84 | stop=["’’’’", " – –", "<|endoftext|>", "<|eot_id|>"], 85 | cot=False, # Chain of Thought 86 | calc=None, # Enforce wrong calculation 87 | series_args={}, # Arguments for time_series_to_str 88 | # image_args={}, # Arguments for time_series_to_image 89 | data_tuple = None 90 | ): 91 | if vision: 92 | messages = create_vision_messages(few_shots, cot, calc, data_tuple) 93 | 94 | return { 95 | "messages": messages, 96 | "temperature": temperature, 97 | "stop": stop 98 | } 99 | -------------------------------------------------------------------------------- /src/qwen_api.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor 3 | from qwen_vl_utils import process_vision_info 4 | from PIL import Image 5 | import base64 6 | from io import BytesIO 7 | import requests 8 | import copy 9 | 10 | def load_qwen(model_name, device): 11 | if '7B' in model_name: 12 | device = 'cuda:7' 13 | elif '72B' in model_name: 14 | device = 'auto' 15 | 16 | model = Qwen2VLForConditionalGeneration.from_pretrained( 17 | f"Qwen/{model_name}", 18 | torch_dtype=torch.float16, 19 | device_map=device 20 | ) 21 | 22 | return model 23 | 24 | 25 | def call_qwen(model_name, model, qwen_request, device): 26 | processor = AutoProcessor.from_pretrained(f"Qwen/{model_name}") 27 | # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage. 28 | # min_pixels = 256*28*28 29 | # max_pixels = 1280*28*28 30 | # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) 31 | 32 | # Preparation for inference 33 | text = processor.apply_chat_template( 34 | qwen_request['messages'], tokenize=False, add_generation_prompt=True 35 | ) 36 | image_inputs, video_inputs = process_vision_info(qwen_request['messages']) 37 | inputs = processor( 38 | text=[text], 39 | images=image_inputs, 40 | videos=video_inputs, 41 | padding=True, 42 | return_tensors="pt", 43 | ).to(model.device) 44 | 45 | # Inference: Generation of the output 46 | generated_ids = model.generate(**inputs, **qwen_request['config']) 47 | generated_ids_trimmed = [ 48 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 49 | ] 50 | 51 | response = processor.batch_decode( 52 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 53 | ) 54 | 55 | return response[0] 56 | 57 | 58 | def convert_openai_to_qwen(openai_request): 59 | openai_request_copy = copy.deepcopy(openai_request) 60 | 61 | for message in openai_request_copy["messages"]: 62 | for content in message["content"]: 63 | if content["type"] == "image_url": 64 | image_url = content["image_url"]["url"] 65 | if image_url.startswith("data:image"): 66 | base64_str = image_url.split(",")[1] 67 | img_data = base64.b64decode(base64_str) 68 | img = Image.open(BytesIO(img_data)) 69 | else: 70 | response = requests.get(image_url) 71 | img = Image.open(BytesIO(response.content)) 72 | 73 | content.update({ 74 | "type": "image", 75 | "image": img 76 | }) 77 | content.pop("image_url") 78 | 79 | qwen_messages = openai_request_copy["messages"] 80 | 81 | qwen_config = { 82 | 'temperature': openai_request.get("temperature", 0.4), 83 | 'max_new_tokens': openai_request.get("max_tokens", 8192), 84 | 'top_p': openai_request.get("top_p", 1.0), 85 | 'do_sample': True 86 | } 87 | 88 | qwen_request = { 89 | "messages": qwen_messages, 90 | "config": qwen_config 91 | } 92 | 93 | return qwen_request -------------------------------------------------------------------------------- /src/result_agg.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import pandas as pd 4 | from tqdm import trange 5 | from utils import ( 6 | interval_to_vector, 7 | point_to_vector, 8 | id_to_vector 9 | ) 10 | import pickle 11 | import os 12 | from dataloader import TSIDataset 13 | from sklearn.metrics import precision_score, recall_score, f1_score 14 | from affiliation.generics import convert_vector_to_events 15 | from affiliation.metrics import pr_from_events 16 | 17 | 18 | def df_to_latex(df): 19 | # Step 1: Process the index to extract only the model name part 20 | df = df.reset_index() # Reset index to bring it as a column 21 | df['index'] = df['index'].str.split(' ').str[0] # Keep only the model name 22 | df.rename(columns={'index': 'model'}, inplace=True) # Rename the index column 23 | 24 | # Step 2: Sort the DataFrame by a custom order for models 25 | order = {"gpt-4o": 0, "gpt-4o-mini": 1, "gemini-1.5-pro": 2, "gemini-1.5-flash": 3, 26 | "llava-next-72b": 4, "llama3-llava-next-8b": 5, 27 | 'Qwen2-VL-72B-Instruct': 6, 'Qwen2-VL-7B-Instruct': 7} 28 | df['priority'] = df['model'].apply(lambda x: order.get(x.lower(), 20)) # Default priority for others is 4 29 | df = df.sort_values(by=['priority', 'model']).drop(columns=['priority']) # Sort and drop priority column 30 | 31 | # Step 3: Truncate numerical values to the first 3 decimal digits 32 | for col in df.select_dtypes(include=['float', 'int']).columns: 33 | df[col] = df[col].apply(lambda x: f"{x*100:.2f}") # Truncate to 4 decimals 34 | 35 | # Step 4: Convert to plain LaTeX table 36 | latex_table = df.to_latex(index=False, column_format="|l" + "r" * (len(df.columns) - 1) + "|") 37 | 38 | return df, latex_table 39 | 40 | def compute_metrics(gt, prediction): 41 | if np.count_nonzero(gt) == 0: 42 | print('ground truth is all zero!!!') 43 | exit() 44 | elif np.count_nonzero(prediction) == 0: 45 | metrics = { 46 | 'precision': 0, 47 | 'recall': 0, 48 | 'f1': 0, 49 | 'affi precision': 0, 50 | 'affi recall': 0, 51 | 'affi f1': 0 52 | } 53 | else: 54 | precision = precision_score(gt, prediction) 55 | recall = recall_score(gt, prediction) 56 | f1 = f1_score(gt, prediction) 57 | 58 | events_pred = convert_vector_to_events(prediction) 59 | events_gt = convert_vector_to_events(gt) 60 | Trange = (0, len(prediction)) 61 | aff = pr_from_events(events_pred, events_gt, Trange) 62 | 63 | # Calculate affiliation F1 64 | if aff['precision'] + aff['recall'] == 0: 65 | affi_f1 = 0 66 | else: 67 | affi_f1 = 2 * (aff['precision'] * aff['recall']) / (aff['precision'] + aff['recall']) 68 | 69 | metrics = { 70 | 'precision': round(precision, 3), 71 | 'recall': round(recall, 3), 72 | 'f1': round(f1, 3), 73 | 'affi precision': round(aff['precision'], 3), 74 | 'affi recall': round(aff['recall'], 3), 75 | 'affi f1': round(affi_f1, 3) 76 | } 77 | return metrics 78 | 79 | def compute_metrics_for_results(eval_dataset, results, scenario, data_name, num_samples=100): 80 | metric_names = [ 81 | "precision", 82 | "recall", 83 | "f1", 84 | "affi precision", 85 | "affi recall", 86 | "affi f1", 87 | ] 88 | results_dict = {key: [[] for _ in metric_names] for key in results.keys()} 89 | 90 | for i in trange(0, num_samples): 91 | anomaly_locations, series = eval_dataset[i][0].numpy(), eval_dataset[i][1].numpy() 92 | if scenario.endswith('univariate'): 93 | len_series = series.shape[0] 94 | elif scenario.endswith('multivariate'): 95 | dim = series.shape[1] 96 | 97 | if data_name in ['global', 'contextual']: 98 | gt = point_to_vector(anomaly_locations, len_vector=len_series) 99 | elif data_name in ['seasonal', 'trend', 'shapelet']: 100 | gt = interval_to_vector(anomaly_locations, start=0, end=len_series) 101 | else: 102 | gt = id_to_vector(anomaly_locations, dim) 103 | 104 | for name, prediction in results.items(): 105 | if prediction[i] == None: 106 | continue 107 | 108 | if data_name in ['global', 'contextual']: 109 | pred = point_to_vector(prediction[i], len_vector=len_series) 110 | elif data_name in ['seasonal', 'trend', 'shapelet']: 111 | pred = interval_to_vector(prediction[i], start=0, end=len_series, pred=True) 112 | else: 113 | pred = id_to_vector(prediction[i], dim) 114 | 115 | if scenario == 'irr_univariate': 116 | drop_index = eval_dataset[i][2].numpy().astype(int) 117 | gt_irr = np.delete(gt, drop_index) 118 | pred_irr = np.delete(pred, drop_index) 119 | 120 | metrics = compute_metrics(gt, pred) if scenario != 'irr_univariate' else compute_metrics(gt_irr, pred_irr) 121 | 122 | for idx, metric_name in enumerate(metric_names): 123 | results_dict[name][idx].append(metrics[metric_name]) 124 | 125 | df = pd.DataFrame( 126 | {k: np.mean(v, axis=1) for k, v in results_dict.items()}, 127 | index=["precision", "recall", "f1", "affi precision", "affi recall", "affi f1"], 128 | ) 129 | 130 | return df 131 | 132 | 133 | def load_time_results(result_fn): 134 | import json 135 | 136 | with open(result_fn, 'r') as f: 137 | time_results = [] 138 | for line in f: 139 | info = json.loads(line) 140 | try: 141 | time = float(info['time'][:-1]) 142 | time_results.append(time) 143 | except Exception: 144 | time_results.append(None) 145 | continue 146 | 147 | return time_results 148 | 149 | def parse_output(output: str, data_name: str) -> dict: 150 | """Parse the output of the AD model. 151 | 152 | Args: 153 | output: The output of the AD model. 154 | 155 | Returns: 156 | A dictionary containing the parsed output. 157 | """ 158 | import json 159 | import re 160 | 161 | # handle cases where the max_tokens are reached 162 | if output.count('[') == output.count(']') + 1: 163 | # remove invalid tokens 164 | if output.endswith(',') or output.endswith(' '): 165 | output = output.rstrip(', ') 166 | else: 167 | output = output.rstrip('0123456789').rstrip(', ') 168 | # Add the missing right bracket 169 | output += ']' 170 | 171 | # Trim the output string 172 | trimmed_output = output[output.index('['):output.rindex(']') + 1] 173 | 174 | # check if containing digits 175 | trimmed_output = '[]' if not re.search(r'\d', trimmed_output) else trimmed_output 176 | 177 | # Try to parse the output as JSON 178 | parsed_output = json.loads(trimmed_output) 179 | 180 | # Validate the output: list of dict with keys start and end 181 | if data_name in ['global', 'contextual', 'triangle', 'square', 'sawtooth', 'random_walk']: 182 | for item in parsed_output: 183 | if not isinstance(item, int): 184 | raise ValueError("Parsed output contains non-int items") 185 | else: 186 | for item in parsed_output: 187 | # if not isinstance(item, dict): 188 | # raise ValueError("Parsed output contains non-dict items") 189 | # if 'start' not in item or 'end' not in item: 190 | # raise ValueError("Parsed output dictionaries must contain 'start' and 'end' keys") 191 | if not isinstance(item, list): 192 | raise ValueError("Parsed output contains non-dict items") 193 | 194 | return parsed_output 195 | 196 | def load_results(result_fn, data_name, raw=False, postprocess_func: callable = None): 197 | """ 198 | Load and process results from a result JSON lines file. 199 | 200 | Parameters 201 | ---------- 202 | result_fn : str 203 | The filename of the JSON lines file containing the results. 204 | raw : bool, optional 205 | If True, return raw JSON objects. If False, parse the response 206 | and convert it to a vector. Default is False. 207 | postprocess_func : callable, optional 208 | A function to postprocess the results (e.g., scaling down). Default is None. 209 | 210 | Returns 211 | ------- 212 | list 213 | A list of processed results. Each item is either a raw JSON object 214 | or a vector representation of anomalies, depending on the 215 | `raw` parameter. 216 | 217 | Notes 218 | ----- 219 | The function attempts to parse each line in the file. If parsing fails, 220 | it appends an empty vector to the results. 221 | 222 | Raises 223 | ------ 224 | FileNotFoundError 225 | If the specified file does not exist. 226 | JSONDecodeError 227 | If a line in the file is not valid JSON. 228 | """ 229 | import json 230 | import pandas as pd 231 | 232 | if postprocess_func is None: 233 | postprocess_func = lambda x: x 234 | 235 | with open(result_fn, 'r') as f: 236 | results = [] 237 | for line in f: 238 | info = json.loads(line) 239 | if raw: 240 | results.append(info) 241 | else: 242 | try: 243 | response_parsed = parse_output(postprocess_func(info['response']), data_name) 244 | results.append(response_parsed) 245 | except Exception: 246 | results.append(None) 247 | continue 248 | 249 | return results 250 | 251 | def collect_results(directory, raw=False, ignore=[]): 252 | """ 253 | Collect and process results from JSON lines files in a directory. 254 | 255 | Parameters 256 | ---------- 257 | directory : str 258 | The path to the directory containing the JSON lines files. 259 | raw : bool, optional 260 | If True, return raw JSON objects. If False, parse the responses. 261 | Default is False. 262 | ignore: list[str], optional 263 | Skip folders containing these names. Default is an empty list. 264 | 265 | Returns 266 | ------- 267 | dict 268 | A dictionary where keys are model names with variants, and values 269 | are lists of processed results from each file. 270 | 271 | Notes 272 | ----- 273 | This function walks through the given directory, processing each 274 | `.jsonl` file except those with 'requests' in the filename. It uses 275 | the directory name as the model name and the filename (sans extension) 276 | as the variant. 277 | 278 | Raises 279 | ------ 280 | FileNotFoundError 281 | If the specified directory does not exist. 282 | """ 283 | import os 284 | from config import postprocess_configs 285 | 286 | results = {} 287 | config = postprocess_configs() 288 | for root, _, files in os.walk(directory): 289 | for file in files: 290 | skip = False 291 | for ignore_folder in ignore: 292 | if ignore_folder in root: 293 | skip = True 294 | break 295 | if skip: 296 | continue 297 | if 'requests' not in file and file.endswith('.jsonl'): 298 | model_name = os.path.basename(root) 299 | data_name = os.path.basename(os.path.dirname(root)) 300 | # scenario = os.path.basename(os.path.dirname(os.path.dirname(root))) 301 | variant = file.replace('.jsonl', '') 302 | if variant in config: 303 | pf = config[variant] 304 | else: 305 | pf = None 306 | result_fn = os.path.join(root, file) 307 | model_key = f'{model_name} ({variant})' 308 | results[model_key] = load_results(result_fn, data_name, raw=raw, postprocess_func=pf) 309 | 310 | return results 311 | 312 | def load_datasets(category, scenario, tsname, dim, drop_ratio, data_name): 313 | if category == 'synthetic': 314 | if scenario == 'univariate': 315 | data_dir = f"data/{category}/{scenario}/{data_name}/eval" 316 | train_dir = f"data/{category}/{scenario}/{data_name}/train" 317 | elif scenario == 'multivariate': 318 | data_dir = f"data/{category}/{scenario}/dim_{dim}/{data_name}/eval" 319 | train_dir = f"data/{category}/{scenario}/dim_{dim}/{data_name}/train" 320 | elif scenario.startswith('irr'): 321 | data_dir = f"data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval" 322 | train_dir = f"data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/train" 323 | elif category == 'semi': 324 | if scenario == 'univariate': 325 | data_dir = f"data/{category}/{scenario}/{tsname}/{data_name}/eval" 326 | train_dir = f"data/{category}/{scenario}/{tsname}/{data_name}/train" 327 | elif scenario == 'multivariate': 328 | data_dir = f"data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval" 329 | train_dir = f"data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/train" 330 | elif scenario.startswith('irr'): 331 | data_dir = f"data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval" 332 | train_dir = f"data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/train" 333 | eval_dataset = TSIDataset(data_dir) 334 | train_dataset = TSIDataset(train_dir) 335 | return eval_dataset, train_dataset 336 | 337 | 338 | def main(args): 339 | category = args.category 340 | scenario = args.scenario 341 | tsname = args.tsname 342 | data_name = args.data_name 343 | drop_ratio = args.drop_ratio 344 | dim = args.dim 345 | label_name = args.label_name 346 | table_caption = args.table_caption 347 | 348 | eval_dataset, train_dataset = load_datasets(category, scenario, tsname, dim, drop_ratio, data_name) 349 | if category == 'synthetic': 350 | if scenario == 'univariate': 351 | directory = f"results/{category}/{scenario}/{data_name}" 352 | elif scenario == 'multivariate': 353 | directory = f"results/{category}/{scenario}/dim_{dim}/{data_name}" 354 | elif scenario.startswith('irr'): 355 | directory = f"results/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}" 356 | elif category == 'semi': 357 | if scenario == 'univariate': 358 | directory = f'results/{category}/{scenario}/{tsname}/{data_name}' 359 | elif scenario == 'multivariate': 360 | directory = f"results/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}" 361 | elif scenario.startswith('irr'): 362 | directory = f'results/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}' 363 | results = collect_results(directory, ignore=[]) 364 | 365 | df = compute_metrics_for_results(eval_dataset, results, scenario, data_name, num_samples=len(eval_dataset)) 366 | df = df.T 367 | # print(df) 368 | df, latex_table = df_to_latex(df.copy()) 369 | print(df) 370 | print(latex_table) 371 | 372 | if scenario.endswith('univariate'): 373 | df_selected = df[['model', 'affi precision', 'affi recall', 'affi f1']].rename(columns={'affi precision': 'precision', 'affi recall': 'recall', 'affi f1': 'f1'})\ 374 | .set_index('model') 375 | elif scenario.endswith('multivariate'): 376 | df_selected = df[['model', 'precision', 'recall', 'f1']].set_index('model') 377 | 378 | # Attempt to drop the index, catch exception if it doesn't exist 379 | try: 380 | df_selected = df_selected.drop(index='gemini-1.5-flash-8b') 381 | except KeyError: 382 | pass # If index does not exist, do nothing and proceed 383 | 384 | with open(f"{directory}/df.pkl", "wb") as f: 385 | pickle.dump(df_selected, f) 386 | 387 | if __name__ == "__main__": 388 | parser = argparse.ArgumentParser(description="Process time series data and generate LaTeX table.") 389 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi', 'real']) 390 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate', 'long']) 391 | parser.add_argument("--tsname", type=str, default=None, choices=['Symbols', 'ArticularyWordRecognition']) 392 | parser.add_argument("--data_name", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet', 393 | 'triangle', 'square', 'sawtooth', 'random_walk', 394 | 'long'], help="Synthesized anomaly type2") 395 | parser.add_argument("--drop_ratio", type=float, default=0.00) 396 | parser.add_argument("--dim", type=int, default=9) 397 | parser.add_argument("--label_name", type=str, default='trend-exp', help="Name of the experiment") 398 | parser.add_argument("--table_caption", type=str, default='Trend anomalies in shifting sine wave', help="Caption for the LaTeX table") 399 | args = parser.parse_args() 400 | main(args) 401 | -------------------------------------------------------------------------------- /src/test.sh: -------------------------------------------------------------------------------- 1 | # !/bin/bash 2 | device=auto 3 | 4 | category=synthetic 5 | scenario=univariate 6 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 7 | do 8 | for data in global contextual seasonal trend shapelet 9 | do 10 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --device $device 11 | done 12 | done 13 | 14 | category=synthetic 15 | scenario=multivariate 16 | for dim in 4 9 16 25 36 17 | do 18 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 19 | do 20 | for data in triangle square sawtooth random_walk 21 | do 22 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --dim $dim --device $device 23 | done 24 | done 25 | done 26 | 27 | category=synthetic 28 | scenario=irr_univariate 29 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 30 | do 31 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 32 | do 33 | for data in global seasonal trend shapelet 34 | do 35 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device 36 | done 37 | done 38 | done 39 | 40 | category=synthetic 41 | scenario=irr_multivariate 42 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 43 | do 44 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 45 | do 46 | for data in triangle square sawtooth random_walk 47 | do 48 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device 49 | done 50 | done 51 | done 52 | 53 | 54 | 55 | category=semi 56 | scenario=univariate 57 | tsname=Symbols 58 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 59 | do 60 | for data in global contextual trend shapelet 61 | do 62 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --device $device 63 | done 64 | done 65 | 66 | category=semi 67 | scenario=irr_univariate 68 | tsname=Symbols 69 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 70 | do 71 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 72 | do 73 | for data in global trend shapelet 74 | do 75 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device 76 | done 77 | done 78 | done 79 | 80 | category=semi 81 | scenario=multivariate 82 | tsname=ArticularyWordRecognition 83 | for dim in 4 9 16 25 36 84 | do 85 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 86 | do 87 | for data in triangle square sawtooth random_walk 88 | do 89 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --dim $dim --device $device 90 | done 91 | done 92 | done 93 | 94 | category=semi 95 | scenario=irr_multivariate 96 | tsname=ArticularyWordRecognition 97 | dim=9 98 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25 99 | do 100 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct 101 | do 102 | for data in triangle square sawtooth random_walk 103 | do 104 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --drop_ratio $drop_ratio --dim $dim --device $device 105 | done 106 | done 107 | done 108 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | from matplotlib import pyplot as plt 2 | import numpy as np 3 | import pandas as pd 4 | from typing import Optional 5 | import math 6 | 7 | def process_request(request): 8 | request['messages'][0]['content'][1]['image_url'] = 'ignore' 9 | 10 | return request 11 | 12 | def id_to_vector(ids, len_dim=9): 13 | # not perfect inversion of function vector_to_id 14 | anomalies = np.zeros(len_dim) 15 | for id in ids: 16 | try: 17 | anomalies[int(id)] = 1 18 | except Exception: 19 | continue 20 | 21 | return anomalies 22 | 23 | def vector_to_id(multi_vector): 24 | ids = [] 25 | 26 | for i in range(multi_vector.shape[1]): 27 | vector = multi_vector[:, i] 28 | # Ignore NaN values and check if all remaining elements are 1 29 | if np.all(vector[np.isnan(vector) == False] == 1): 30 | ids.append(i) 31 | 32 | return ids 33 | 34 | def point_to_vector(points, len_vector=400): 35 | anomalies = np.zeros(len_vector) 36 | for point in points: 37 | try: 38 | anomalies[int(point)] = 1 39 | except Exception: 40 | continue 41 | 42 | return anomalies 43 | 44 | def vector_to_point(vector): 45 | points = [i for i, x in enumerate(vector) if x == 1] 46 | 47 | return points 48 | 49 | def interval_to_vector(interval, start=0, end=400, pred=False): 50 | anomalies = np.zeros(end - start) 51 | for entry in interval: 52 | if len(entry) !=2 : 53 | continue 54 | try: 55 | entry = {'start': int(entry[0]), 'end': int(entry[1])} 56 | entry['end'] = entry['end'] + 1 if pred else entry['end'] 57 | entry['start'] = np.clip(entry['start'], start, end) 58 | entry['end'] = np.clip(entry['end'], entry['start'], end) 59 | anomalies[entry['start']:entry['end']] = 1 60 | except (ValueError, IndexError, TypeError) as e: 61 | continue # Skip the current entry and move to the next 62 | 63 | return anomalies 64 | 65 | def vector_to_interval(vector): 66 | intervals = [] 67 | in_interval = False 68 | start = 0 69 | for i, value in enumerate(vector): 70 | if value == 1 and not in_interval: 71 | start = i 72 | in_interval = True 73 | elif value == 0 and in_interval: 74 | intervals.append((start, i)) 75 | in_interval = False 76 | if in_interval: 77 | intervals.append((start, len(vector))) 78 | 79 | return intervals 80 | 81 | def nearest_square_root(n): 82 | lower_sqrt = math.floor(math.sqrt(n)) 83 | upper_sqrt = math.ceil(math.sqrt(n)) 84 | 85 | lower_square = lower_sqrt ** 2 86 | upper_square = upper_sqrt ** 2 87 | 88 | return lower_sqrt if abs(lower_square - n) <= abs(upper_square - n) else upper_sqrt 89 | 90 | def create_color_generator(exclude_color='blue'): 91 | # Get the default color list 92 | default_colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][1:] 93 | # Filter out the excluded color 94 | filtered_colors = [color for color in default_colors if color != exclude_color] 95 | # Create a generator that yields colors in order 96 | return (color for color in filtered_colors) 97 | 98 | def plot_rectangle_stack_series( 99 | series, 100 | gt_anomaly, 101 | single_series_figsize: tuple[int, int] = (10, 10), 102 | gt_color: str = 'steelblue', 103 | train_eval: str = 'train' 104 | ) -> None: 105 | stream_length, dim = series.shape 106 | 107 | # Calculate the optimal number of rows and columns for a rectangular layout 108 | rows = int(math.sqrt(dim)) 109 | cols = math.ceil(dim / rows) 110 | 111 | fig, axes = plt.subplots(rows, cols, figsize=(single_series_figsize[0] * cols / rows, single_series_figsize[1])) 112 | fig.subplots_adjust(hspace=0, wspace=0) 113 | 114 | # Plot each univariate time series in its subplot 115 | for idx in range(rows * cols): 116 | row, col = divmod(idx, cols) 117 | ax = axes[row, col] 118 | 119 | if idx < dim: 120 | ax.plot(series[:, idx], color=gt_color) 121 | ax.set_xticks([]) 122 | ax.set_yticks([]) 123 | else: 124 | # Turn off unused subplots 125 | ax.axis('off') 126 | 127 | if train_eval == 'train' and gt_anomaly is not None: 128 | if isinstance(gt_anomaly[0], int) and idx in gt_anomaly: 129 | ax.lines[-1].set_color('red') 130 | 131 | plt.tight_layout() 132 | return plt.gcf() 133 | 134 | def plot_series( 135 | series, 136 | gt_anomaly, 137 | single_series_figsize: tuple[int, int] = (10, 1.5), 138 | # gt_ylim: tuple[int, int] = (-1, 1), 139 | gt_color: str = 'steelblue', 140 | train_eval: str = 'train' 141 | ) -> None: 142 | plt.figure(figsize=single_series_figsize) 143 | 144 | # plt.ylim(gt_ylim) 145 | plt.plot(series, color=gt_color) 146 | 147 | if train_eval == 'train': 148 | if gt_anomaly is not None: 149 | if isinstance(gt_anomaly[0], tuple): 150 | for start, end in gt_anomaly: 151 | plt.axvspan(start, end-1, alpha=0.2, color=gt_color) 152 | elif isinstance(gt_anomaly[0], int): 153 | for point in gt_anomaly: 154 | plt.axvline(x=point, color=gt_color, alpha=0.5, linestyle='--') 155 | 156 | plt.tight_layout() 157 | return plt.gcf() 158 | 159 | def view_base64_image(base64_string): 160 | import base64 161 | from io import BytesIO 162 | from PIL import Image 163 | import matplotlib.pyplot as plt 164 | 165 | # Decode the base64 string to binary data 166 | image_data = base64.b64decode(base64_string) 167 | 168 | # Convert binary data to an image 169 | image = Image.open(BytesIO(image_data)) 170 | 171 | # Display the image 172 | plt.imshow(image) 173 | plt.axis('off') # Hide axes 174 | plt.show() 175 | -------------------------------------------------------------------------------- /teaser.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mllm-ts/VisualTimeAnomaly/3d18d64b99cc13d4276eb2281555362604979f97/teaser.png --------------------------------------------------------------------------------