├── LICENSE
├── README.md
├── src
├── config.py
├── credentials.yml
├── dataloader.py
├── gemini_api.py
├── generator.py
├── generator.sh
├── llava_api.py
├── main.py
├── openai_api.py
├── prompt.py
├── qwen_api.py
├── result_agg.py
├── test.sh
└── utils.py
└── teaser.png
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 mllm-ts
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Can Multimodal LLMs Perform Time Series Anomaly Detection?
2 | This repo includes the official code and datasets for paper ["Can Multimodal LLMs Perform Time Series Anomaly Detection?"](https://arxiv.org/abs/2502.17812)
3 |
4 | ## 🕵️♂️ VisualTimeAnomaly
5 |
6 |

7 |
8 |
9 | Left: the workflow of VisualTimeAnomaly. Right: the performance comparison across various setting.
10 |
11 | ## 🏆 Contributions
12 | - The first comprehensive benchmark for multimodal LLMs (MLLMs) in time series anomaly detection (TSAD), covering diverse scenarios (univariate, multivariate, irregular) and varying anomaly granularities (point-, range-, variate-wise).
13 | - Several critical insights significantly advance the understanding of both MLLMs and TSAD.
14 | - We construct a large-scale dataset including 12.4k time series images, and release the dateset and code to foster future research.
15 |
16 | ## 🔎 Findings
17 | - MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies;
18 | - MLLMs are highly robust to irregular time series, even with 25% of the data missing;
19 | - Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series.
20 |
21 | ## ⚙️ Getting Started
22 | ### Environment
23 | * python 3.10.14
24 | * torch 2.4.1
25 | * numpy 1.26.4
26 | * transformers 4.49.0.dev0
27 | * huggingface-hub 0.24.7
28 | * openai 1.44.0
29 | * google-generativeai 0.8.3
30 |
31 | ### Dataset
32 | Enter `src` folder.
33 |
34 | If you want to generate all datasets, execute the below script:
35 |
36 | `./generator.sh`
37 |
38 | If you want to generate a specific dataset, execute the below script:
39 |
40 | `python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts`.
41 |
42 | For example, generate 100 univaraite time series images for global anomalies:
43 |
44 | `python generator.py --category synthetic --scenario univariate --anomaly_type global --num_ts 100`
45 |
46 | ### Run
47 | Enter `src` folder.
48 |
49 | If you want to run MLLMs on all datasets, execute the below script:
50 |
51 | `./test.sh`
52 |
53 | If you want to run a MLLM on a specific dataset, execute the below script:
54 |
55 | `python main.py --category $category --scenario $scenario --model_name $model_name --data $data`
56 |
57 | For example, run GPT-4o on univaraite time series scenario with global anomalies:
58 |
59 | `python main.py --category synthetic --scenario univariate --model_name gpt-4o --data global`
60 |
61 | ## Acknowledgement
62 | We sincerely appreciate the following github repo for the code base and datasets:
63 |
64 | https://github.com/Rose-STL-Lab/AnomLLM
65 |
66 | https://github.com/datamllab/tods/tree/benchmark
67 |
68 | ## 📝 Citation
69 | If you find our work useful, please cite the below paper:
70 | ```bibtex
71 | @article{xu2025can,
72 | title={Can Multimodal LLMs Perform Time Series Anomaly Detection?},
73 | author={Xu, Xiongxiao and Wang, Haoran and Liang, Yueqing and Yu, Philip S and Zhao, Yue and Shu, Kai},
74 | journal={arXiv preprint arXiv:2502.17812},
75 | year={2025}
76 | }
77 | ```
78 |
--------------------------------------------------------------------------------
/src/config.py:
--------------------------------------------------------------------------------
1 | from prompt import create_openai_request
2 |
3 | def create_api_configs():
4 | return {
5 | '0shot-vision': lambda train_dataset, data_tuple: create_openai_request(
6 | vision=True,
7 | few_shots=train_dataset.few_shots(num_shots=0),
8 | data_tuple=data_tuple
9 | )
10 | }
11 |
--------------------------------------------------------------------------------
/src/credentials.yml:
--------------------------------------------------------------------------------
1 | gpt-4o:
2 | api_key:
3 | base_url: (ended with v1)
4 | gemini-1.5:
5 | api_key:
6 |
--------------------------------------------------------------------------------
/src/dataloader.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch.utils.data import Dataset
3 | import pickle
4 | import os
5 | import numpy as np
6 |
7 | # time series image dataset
8 | class TSIDataset(Dataset):
9 |
10 | def __init__(self, data_dir):
11 | self.data_dir = data_dir
12 | self.figs_dir = os.path.join(data_dir, 'figs')
13 | self.series = []
14 | self.anom = []
15 |
16 | # Load data
17 | with open(os.path.join(self.data_dir, 'data.pkl'), 'rb') as f:
18 | data_dict = pickle.load(f)
19 | self.series = data_dict['series']
20 | self.anom = data_dict['anom']
21 | self.scenario = data_dir.split('/')[2]
22 | train_eval = data_dir.split('/')[-1]
23 |
24 | if self.scenario == 'irr_univariate':
25 | self.drop_index = data_dict['drop_index']
26 | if train_eval == 'eval':
27 | print(f"Loaded dataset {data_dir} with {len(self.series)} series.")
28 |
29 | def __len__(self):
30 | return len(self.series)
31 |
32 | def __getitem__(self, idx):
33 | anom = self.anom[idx]
34 | series = self.series[idx]
35 |
36 | # Convert to torch tensors
37 | anom = torch.tensor(anom, dtype=torch.float32)
38 | series = torch.tensor(series, dtype=torch.float32)
39 | if self.scenario == 'irr_univariate':
40 | drop_index = self.drop_index[idx]
41 | drop_index = torch.tensor(drop_index, dtype=torch.float32)
42 | return anom, series, drop_index
43 | else:
44 | return anom, series
45 |
46 | def few_shots(self, num_shots=5, idx=None):
47 | if idx is None:
48 | idx = np.random.choice(len(self.series), num_shots, replace=False)
49 | few_shot_data = []
50 | for i in idx:
51 | anom, series = self.__getitem__(i)
52 | anom = [{"start": int(start.item()), "end": int(end.item())} for start, end in list(anom[0])]
53 | few_shot_data.append((series, anom, i+1))
54 | return few_shot_data
--------------------------------------------------------------------------------
/src/gemini_api.py:
--------------------------------------------------------------------------------
1 | import google.generativeai as genai
2 | from PIL import Image
3 | from loguru import logger
4 | import yaml
5 | import requests
6 | from io import BytesIO
7 | import base64
8 |
9 | SAFETY_SETTINGS = [
10 | {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
11 | {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
12 | {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
13 | {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
14 | ]
15 |
16 | def load_gemini(model_name):
17 | credentials = yaml.safe_load(open("credentials.yml"))
18 | assert model_name in credentials, f"Model {model_name} not found in credentials"
19 |
20 | credential = credentials[model_name]
21 | api_key = credential["api_key"]
22 |
23 | genai.configure(api_key=api_key)
24 | model = genai.GenerativeModel(model_name)
25 |
26 | return model
27 |
28 | def call_gemini(model_name, model, gemini_request):
29 | logger.debug(f"{model_name} is running")
30 |
31 | response = model.generate_content(
32 | contents=gemini_request['messages'],
33 | generation_config=gemini_request['config'],
34 | # **gemini_request,
35 | safety_settings=SAFETY_SETTINGS,
36 | )
37 | return response.text
38 |
39 |
40 | def convert_openai_to_gemini(openai_request):
41 | gemini_messages = []
42 |
43 | for message in openai_request["messages"]:
44 | parts = []
45 | for content in message["content"]:
46 | if content["type"] == "text":
47 | parts.append(content["text"])
48 | elif content["type"] == "image_url":
49 | image_url = content["image_url"]["url"]
50 | if image_url.startswith("data:image"):
51 | base64_str = image_url.split(",")[1]
52 | img_data = base64.b64decode(base64_str)
53 | img = Image.open(BytesIO(img_data))
54 | else:
55 | response = requests.get(image_url)
56 | img = Image.open(BytesIO(response.content))
57 | parts.append(img)
58 |
59 | gemini_messages.append({"role": message["role"].replace("assistant", "model"), "parts": parts})
60 |
61 | gemini_config = {
62 | 'temperature': openai_request.get("temperature", 0.4),
63 | 'max_output_tokens': openai_request.get("max_tokens", 8192),
64 | 'top_p': openai_request.get("top_p", 1.0),
65 | "stop_sequences": openai_request.get("stop", [])
66 | }
67 |
68 | gemini_request = {
69 | "messages": gemini_messages,
70 | "config": gemini_config
71 | }
72 |
73 | return gemini_request
74 |
--------------------------------------------------------------------------------
/src/generator.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | import pandas as pd
4 | import argparse
5 | import os
6 | from tqdm import trange
7 | from utils import plot_series, vector_to_interval, vector_to_point, vector_to_id, plot_rectangle_stack_series
8 | import pickle
9 | import math
10 | from scipy.signal import square, sawtooth
11 | from scipy.io import arff
12 |
13 | def triangle_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
14 | timestamp = np.arange(length)
15 | value = 2 * np.abs((timestamp * freq) % 1 - 0.5) - 1
16 | if noise_amp != 0:
17 | noise = np.random.normal(0, 1, length)
18 | value = value + noise_amp * noise
19 | value = coef * value + offset
20 | return value
21 |
22 | def square_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
23 | timestamp = np.arange(length)
24 | value = square(2 * np.pi * freq * timestamp)
25 | if noise_amp != 0:
26 | noise = np.random.normal(0, 1, length)
27 | value = value + noise_amp * noise
28 | value = coef * value + offset
29 | return value
30 |
31 | def sawtooth_wave(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
32 | timestamp = np.arange(length)
33 | value = sawtooth(2 * np.pi * freq * timestamp)
34 | if noise_amp != 0:
35 | noise = np.random.normal(0, 1, length)
36 | value = value + noise_amp * noise
37 | value = coef * value + offset
38 | return value
39 |
40 | def random_walk(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
41 | steps = np.random.normal(0, noise_amp, length)
42 | value = np.cumsum(steps)
43 | value = coef * value + offset
44 | return value
45 |
46 | def sine(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
47 | timestamp = np.arange(length)
48 | value = np.sin(2 * np.pi * freq * timestamp)
49 | if noise_amp != 0:
50 | noise = np.random.normal(0, 1, length)
51 | value = value + noise_amp * noise
52 | value = coef * value + offset
53 | return value
54 |
55 | def cosine(length, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
56 | timestamp = np.arange(length)
57 | value = np.cos(2 * np.pi * freq * timestamp)
58 | if noise_amp != 0:
59 | noise = np.random.normal(0, 1, length)
60 | value = value + noise_amp * noise
61 | value = coef * value + offset
62 | return value
63 |
64 | class MultivariateDataGenerator:
65 | def __init__(self, data_dir, dim, drop_ratio=0):
66 | STREAM_LENGTH = 200
67 |
68 | self.stream_length = STREAM_LENGTH
69 | self.behavior = [sine, cosine]
70 | self.ano_behavior = {
71 | 'triangle': triangle_wave,
72 | 'square': square_wave,
73 | 'sawtooth': sawtooth_wave,
74 | 'random_walk': random_walk
75 | }
76 | self.dim = dim
77 |
78 | self.data = None
79 | self.label = None
80 | self.data_origin = None
81 |
82 | self.drop_ratio = drop_ratio
83 | self.data_dir = data_dir
84 | self.series = []
85 | self.anom = []
86 |
87 | def generate(self, num_ts, category, anomaly_type, train_eval, tsname):
88 | for i in trange(num_ts):
89 | self.generate_base_timeseries(category, tsname)
90 | self.variate_outliers(anomaly_type)
91 |
92 | if scenario == 'irr_multivariate':
93 | for dim_id in range(self.dim):
94 | self.data[:, dim_id], self.label[:, dim_id], _ = drop(self.data[:, dim_id], self.label[:, dim_id], self.drop_ratio)
95 |
96 | anom = vector_to_id(self.label)
97 |
98 | self.series.append(self.data)
99 | self.anom.append(anom)
100 |
101 | fig = plot_rectangle_stack_series(
102 | series=self.data,
103 | single_series_figsize=(10, 10),
104 | gt_anomaly=anom,
105 | train_eval = train_eval
106 | )
107 |
108 | fig_dir = os.path.join(self.data_dir, 'fig')
109 | os.makedirs(fig_dir, exist_ok=True)
110 |
111 | fig_path = os.path.join(fig_dir, f'{i + 1:03d}.png')
112 | fig.savefig(fig_path)
113 | plt.close()
114 |
115 | self.save()
116 |
117 | def save(self):
118 | data_dict = {
119 | 'series': self.series,
120 | 'anom': self.anom
121 | }
122 | with open(os.path.join(self.data_dir, 'data.pkl'), 'wb') as f:
123 | pickle.dump(data_dict, f)
124 |
125 | def generate_random_config(self):
126 | # Generate random parameters for time series using np.random
127 | return {
128 | 'freq': np.random.uniform(0.03, 0.05), # Frequency between 0.01 and 0.1
129 | 'coef': np.random.uniform(0.5, 2.0), # Coefficient between 0.5 and 2.0
130 | 'offset': np.random.uniform(-1.0, 1.0), # Offset between -1.0 and 1.0
131 | 'noise_amp': np.random.uniform(0.05, 0.20), # Noise amplitude between 0.0 and 0.1
132 | 'length': self.stream_length # Length of the time series
133 | }
134 |
135 | def generate_base_timeseries(self, category, basedata_dir=None):
136 | if category == 'synthetic':
137 | self.data = np.zeros((self.stream_length, self.dim))
138 | for i in range(self.dim):
139 | behavior = np.random.choice(self.behavior)
140 | config = self.generate_random_config()
141 | uni_data = behavior(**config)
142 | self.data[:, i] = uni_data
143 | self.data_origin = self.data.copy()
144 | self.label = np.zeros((self.stream_length, self.dim), dtype=float)
145 | elif category == 'semi':
146 | basedata_dir = f'Multivariate_arff/{tsname}/{tsname}_TEST.arff'
147 | raw_data = arff.loadarff(basedata_dir)
148 | df = pd.DataFrame(raw_data[0])
149 | self.data = np.array([list(item) for item in df.iloc[0,0]]).transpose()
150 | self.stream_length, dim = self.data.shape
151 | if self.dim > dim:
152 | extra_dims = self.dim - dim
153 | repeat_indices = np.random.choice(dim, extra_dims, replace=True)
154 | self.data = np.hstack((self.data, self.data[:, repeat_indices]))
155 | elif self.dim < dim:
156 | selected_indices = np.random.choice(dim, self.dim, replace=False)
157 | self.data = self.data[:, selected_indices]
158 | self.data_origin = self.data.copy()
159 | self.label = np.zeros((self.stream_length, self.dim), dtype=float)
160 |
161 | def variate_outliers(self, anomaly_type):
162 | min_ano, max_ano = 1, math.floor(math.sqrt(self.dim)) - 1
163 | num_anomalies = np.random.randint(min_ano, max_ano + 1)
164 | anomaly_indices = np.random.choice(self.dim, num_anomalies, replace=False)
165 |
166 | for idx in anomaly_indices:
167 | ano_behavior = self.ano_behavior[anomaly_type]
168 | config = self.generate_random_config()
169 | anomaly_data = ano_behavior(**config)
170 | self.data[:, idx] = anomaly_data
171 | self.label[:, idx] = 1 # Mark this variate as an anomaly
172 |
173 |
174 | def drop(data, label, drop_ratio):
175 | if not 0 <= drop_ratio <= 1:
176 | raise ValueError("drop_ratio must be between 0 and 1.")
177 |
178 | seq_len = len(data)
179 | num_drops = int(seq_len * drop_ratio)
180 |
181 | # Generate random indices to drop
182 | drop_index = np.random.choice(seq_len, size=num_drops, replace=False)
183 |
184 | data = data.astype(float) # Ensure float type to allow np.nan
185 | label = label.astype(float) # Ensure float type to allow np.nan
186 |
187 | data[drop_index] = np.nan
188 | label[drop_index] = np.nan
189 |
190 | return data, label, drop_index
191 |
192 | def square_sine(level=5, length=500, freq=0.04, coef=1.5, offset=0.0, noise_amp=0.05):
193 | value = np.zeros(length)
194 | for i in range(level):
195 | value += 1 / (2 * i + 1) * sine(length=length, freq=freq * (2 * i + 1), coef=coef, offset=offset, noise_amp=noise_amp)
196 | return value
197 |
198 | def collective_global_synthetic(length, base, coef=1.5, noise_amp=0.005):
199 | value = []
200 | norm = np.linalg.norm(base)
201 | base = base / norm
202 | num = int(length / len(base))
203 | for i in range(num):
204 | value.extend(base)
205 | residual = length - len(value)
206 | value.extend(base[:residual])
207 | value = np.array(value)
208 | noise = np.random.normal(0, 1, length)
209 | value = coef * value + noise_amp * noise
210 | return value
211 |
212 | # The code is adapted from https://github.com/datamllab/tods/tree/benchmark
213 | class UnivariateDataGenerator:
214 | def __init__(self, data_dir, drop_ratio=0):
215 | BEHAVIOR = sine
216 | BEHAVIOR_CONFIG = {'freq': 0.04, 'coef': 1.5, "offset": 0.0, 'noise_amp': 0.05}
217 | STREAM_LENGTH = 400
218 |
219 | self.behavior = BEHAVIOR
220 | self.behavior_config = BEHAVIOR_CONFIG
221 | self.stream_length = STREAM_LENGTH
222 |
223 | self.data = None
224 | self.label = None
225 | self.data_origin = None
226 |
227 | self.drop_ratio = drop_ratio
228 | self.data_dir = data_dir
229 | self.series = []
230 | self.anom = []
231 | self.drop_index = []
232 |
233 | def generate(self, num_ts, category, anomaly_type, train_eval, tsname):
234 | for i in trange(num_ts):
235 | self.generate_base_timeseries(category, tsname)
236 |
237 | if anomaly_type == 'global':
238 | self.point_global_outliers(ratio=0.05, factor=3.5, radius=5)
239 | elif anomaly_type == 'contextual':
240 | self.point_contextual_outliers(ratio=0.05, factor=2.5, radius=5)
241 | elif anomaly_type == 'seasonal':
242 | self.collective_seasonal_outliers(ratio=0.05, factor=3, radius=5)
243 | elif anomaly_type == 'trend':
244 | self.collective_trend_outliers(ratio=0.05, factor=0.5, radius=5)
245 | elif anomaly_type == 'shapelet':
246 | self.collective_global_outliers(ratio=0.05, radius=5, option='square', coef=1.5, noise_amp=0.03, level=20, freq=0.04, offset=0.0)
247 |
248 | if scenario == 'irr_univariate':
249 | self.data, self.label, drop_index = drop(self.data, self.label, self.drop_ratio)
250 |
251 | if anomaly_type in ['global', 'contextual']:
252 | anom = vector_to_point(self.label)
253 | else:
254 | anom = vector_to_interval(self.label)
255 |
256 | self.series.append(self.data)
257 | self.anom.append(anom)
258 |
259 | if scenario == 'irr_univariate':
260 | self.drop_index.append(drop_index)
261 |
262 | fig = plot_series(
263 | series=self.data,
264 | single_series_figsize=(10, 1.5),
265 | gt_anomaly=anom,
266 | # gt_ylim = (np.nanmin(self.data)*1.1, np.nanmax(self.data)*1.1),
267 | train_eval = train_eval
268 | )
269 |
270 | fig_dir = os.path.join(self.data_dir, 'fig')
271 | os.makedirs(fig_dir, exist_ok=True)
272 |
273 | fig_path = os.path.join(fig_dir, f'{i + 1:03d}.png')
274 | fig.savefig(fig_path)
275 | plt.close()
276 |
277 | self.save()
278 |
279 | def save(self):
280 | data_dict = {
281 | 'series': self.series,
282 | 'anom': self.anom
283 | }
284 | if scenario == 'irr_univariate':
285 | data_dict['drop_index'] = self.drop_index
286 | with open(os.path.join(self.data_dir, 'data.pkl'), 'wb') as f:
287 | pickle.dump(data_dict, f)
288 |
289 | def generate_base_timeseries(self, category, tsname, basedata_dir=None):
290 | if category == 'synthetic':
291 | self.behavior_config['length'] = self.stream_length
292 | self.data = self.behavior(**self.behavior_config)
293 | self.data_origin = self.data.copy()
294 | self.label = np.zeros(self.stream_length, dtype=float)
295 | elif category == 'semi':
296 | basedata_dir = f'Univariate_arff/{tsname}/{tsname}_TEST.arff'
297 | raw_data = arff.loadarff(basedata_dir)
298 | df = pd.DataFrame(raw_data[0])
299 | self.data = df.iloc[0, :-1].values
300 | self.stream_length = self.data.shape[0]
301 | self.data_origin = self.data.copy()
302 | self.label = np.zeros(self.stream_length, dtype=float)
303 |
304 | def point_global_outliers(self, ratio, factor, radius):
305 | """
306 | Add point global outliers to original data
307 | Args:
308 | ratio: what ratio outliers will be added
309 | factor: the larger, the outliers are farther from inliers
310 | radius: the radius of collective outliers range
311 | """
312 | position = (np.random.rand(round(self.stream_length * ratio)) * self.stream_length).astype(int)
313 | maximum, minimum = max(self.data), min(self.data)
314 | for i in position:
315 | local_std = self.data_origin[max(0, i - radius):min(i + radius, self.stream_length)].std()
316 | self.data[i] = self.data_origin[i] * factor * local_std
317 | if 0 <= self.data[i] < maximum: self.data[i] = maximum
318 | if 0 > self.data[i] > minimum: self.data[i] = minimum
319 | self.label[i] = 1
320 |
321 | def point_contextual_outliers(self, ratio, factor, radius):
322 | """
323 | Add point contextual outliers to original data
324 | Args:
325 | ratio: what ratio outliers will be added
326 | factor: the larger, the outliers are farther from inliers
327 | Notice: point contextual outliers will not exceed the range of [min, max] of original data
328 | radius: the radius of collective outliers range
329 | """
330 | position = (np.random.rand(round(self.stream_length * ratio)) * self.stream_length).astype(int)
331 | maximum, minimum = max(self.data), min(self.data)
332 | for i in position:
333 | local_std = self.data_origin[max(0, i - radius):min(i + radius, self.stream_length)].std()
334 | self.data[i] = self.data_origin[i] * factor * local_std
335 | if self.data[i] > maximum: self.data[i] = maximum * min(0.95, abs(np.random.normal(0, 0.5))) # previous(0, 1)
336 | if self.data[i] < minimum: self.data[i] = minimum * min(0.95, abs(np.random.normal(0, 0.5)))
337 |
338 | self.label[i] = 1
339 |
340 | def collective_global_outliers(self, ratio, radius, option='square', coef=3., noise_amp=0.0,
341 | level=5, freq=0.04, offset=0.0, # only used when option=='square'
342 | base=[0.,]): # only used when option=='other'
343 | """
344 | Add collective global outliers to original data
345 | Args:
346 | ratio: what ratio outliers will be added
347 | radius: the radius of collective outliers range
348 | option: if 'square': 'level' 'freq' and 'offset' are used to generate square sine wave
349 | if 'other': 'base' is used to generate outlier shape
350 | level: how many sine waves will square_wave synthesis
351 | base: a list of values that we want to substitute inliers when we generate outliers
352 | """
353 | base = [1.4529900e-01, 1.2820500e-01, 9.4017000e-02, 7.6923000e-02, 1.1111100e-01, 1.4529900e-01, 1.7948700e-01, 2.1367500e-01, 2.1367500e-01]
354 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int)
355 |
356 | valid_option = {'square', 'other'}
357 | if option not in valid_option:
358 | raise ValueError("'option' must be one of %r." % valid_option)
359 |
360 | if option == 'square':
361 | sub_data = square_sine(level=level, length=self.stream_length, freq=freq,
362 | coef=coef, offset=offset, noise_amp=noise_amp)
363 | else:
364 | sub_data = collective_global_synthetic(length=self.stream_length, base=base,
365 | coef=coef, noise_amp=noise_amp)
366 | for i in position:
367 | start, end = max(0, i - radius), min(self.stream_length, i + radius)
368 | self.data[start:end] = sub_data[start:end]
369 | self.label[start:end] = 1
370 |
371 | def collective_trend_outliers(self, ratio, factor, radius):
372 | """
373 | Add collective trend outliers to original data
374 | Args:
375 | ratio: what ratio outliers will be added
376 | factor: how dramatic will the trend be
377 | radius: the radius of collective outliers range
378 | """
379 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int)
380 | for i in position:
381 | start, end = max(0, i - radius), min(self.stream_length, i + radius)
382 | slope = np.random.choice([-1, 1]) * factor * np.arange(end - start)
383 | self.data[start:end] = self.data_origin[start:end] + slope
384 | self.data[end:] = self.data[end:] + slope[-1]
385 | self.label[start:end] = 1
386 |
387 | def collective_seasonal_outliers(self, ratio, factor, radius):
388 | """
389 | Add collective seasonal outliers to original data
390 | Args:
391 | ratio: what ratio outliers will be added
392 | factor: how many times will frequency multiple
393 | radius: the radius of collective outliers range
394 | """
395 | position = (np.random.rand(round(self.stream_length * ratio / (2 * radius))) * self.stream_length).astype(int)
396 | seasonal_config = self.behavior_config
397 | seasonal_config['freq'] = factor * self.behavior_config['freq']
398 | for i in position:
399 | start, end = max(0, i - radius), min(self.stream_length, i + radius)
400 | self.data[start:end] = self.behavior(**seasonal_config)[start:end]
401 | self.label[start:end] = 1
402 |
403 | if __name__ == '__main__':
404 | parser = argparse.ArgumentParser(description="Generate synthetic dataset")
405 | parser.add_argument("--seed", type=int, default=0, help="Random seed for reproducibility")
406 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi'])
407 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate'])
408 | parser.add_argument("--tsname", type=str, default=None, help='base real world dataset name')
409 | parser.add_argument("--anomaly_type", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet',
410 | 'triangle', 'square', 'sawtooth', 'random_walk',
411 | 'long'])
412 | parser.add_argument("--num_ts", type=int, default=100, help="Numebr of time series")
413 | parser.add_argument("--dim", type=int, default=9, help="Number of variates of multivariate time series") # min_dim=4
414 | parser.add_argument("--drop_ratio", type=float, default=0.00, help="Dropping ratio of irregular time series")
415 |
416 | args = parser.parse_args()
417 |
418 | seed = args.seed
419 | category = args.category
420 | scenario = args.scenario
421 | tsname = args.tsname
422 | anomaly_type = args.anomaly_type
423 | num_ts = args.num_ts
424 | dim = args.dim
425 | drop_ratio = args.drop_ratio
426 |
427 | np.random.seed(seed)
428 |
429 | for train_eval in ['train', 'eval']:
430 | if category == 'synthetic':
431 | if scenario == 'univariate':
432 | data_dir = os.path.join('data', category, scenario, anomaly_type, train_eval)
433 | elif scenario == 'multivariate':
434 | data_dir = os.path.join('data', category, scenario, f'dim_{dim}', anomaly_type, train_eval)
435 | elif scenario.startswith('irr'):
436 | data_dir = os.path.join('data', category, scenario, f'ratio_{int(drop_ratio*100)}', anomaly_type, train_eval)
437 | elif category == 'semi':
438 | if scenario == 'univariate':
439 | data_dir = os.path.join('data', category, scenario, tsname, anomaly_type, train_eval)
440 | elif scenario == 'multivariate':
441 | data_dir = os.path.join('data', category, scenario, tsname, f'dim_{dim}', anomaly_type, train_eval)
442 | elif scenario.startswith('irr'):
443 | data_dir = os.path.join('data', category, scenario, tsname, f'ratio_{int(drop_ratio*100)}', anomaly_type, train_eval)
444 |
445 | print(f'Generating {data_dir} data.')
446 |
447 | if scenario.endswith('univariate'):
448 | univariate_generator = UnivariateDataGenerator(data_dir=data_dir, drop_ratio=drop_ratio)
449 | univariate_generator.generate(num_ts, category, anomaly_type, train_eval, tsname)
450 | elif scenario.endswith('multivariate'):
451 | multivariate_generator = MultivariateDataGenerator(data_dir=data_dir, dim=dim, drop_ratio=drop_ratio)
452 | multivariate_generator.generate(num_ts, category, anomaly_type, train_eval, tsname)
453 |
--------------------------------------------------------------------------------
/src/generator.sh:
--------------------------------------------------------------------------------
1 | # !/bin/bash
2 | num_ts=100
3 |
4 | category=synthetic
5 | scenario=univariate
6 | for anomaly_type in global contextual seasonal trend shapelet
7 | do
8 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts
9 | done
10 |
11 | category=synthetic
12 | scenario=irr_univariate
13 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
14 | do
15 | for anomaly_type in global seasonal trend shapelet
16 | do
17 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio
18 | done
19 | done
20 |
21 | category=synthetic
22 | scenario=multivariate
23 | for dim in 4 9 16 25 36
24 | do
25 | for anomaly_type in triangle square sawtooth random_walk
26 | do
27 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --dim $dim
28 | done
29 | done
30 |
31 | category=synthetic
32 | scenario=irr_multivariate
33 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
34 | do
35 | for anomaly_type in triangle square sawtooth random_walk
36 | do
37 | python generator.py --category $category --scenario $scenario --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio
38 | done
39 | done
40 |
41 |
42 |
43 | category=semi
44 | scenario=univariate
45 | tsname=Symbols
46 | for anomaly_type in global contextual trend shapelet
47 | do
48 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts
49 | done
50 |
51 | category=semi
52 | scenario=irr_univariate
53 | tsname=Symbols
54 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
55 | do
56 | for anomaly_type in global trend shapelet
57 | do
58 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio
59 | done
60 | done
61 |
62 | category=semi
63 | scenario=multivariate
64 | tsname=ArticularyWordRecognition
65 | for dim in 4 9 16 25 36
66 | do
67 | for anomaly_type in triangle square sawtooth random_walk
68 | do
69 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --dim $dim
70 | done
71 | done
72 |
73 | category=semi
74 | scenario=irr_multivariate
75 | tsname=ArticularyWordRecognition
76 | dim=9
77 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
78 | do
79 | for anomaly_type in triangle square sawtooth random_walk
80 | do
81 | python generator.py --category $category --scenario $scenario --tsname $tsname --anomaly_type $anomaly_type --num_ts $num_ts --drop_ratio $drop_ratio --dim $dim
82 | done
83 | done
84 |
85 |
86 |
87 |
--------------------------------------------------------------------------------
/src/llava_api.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
3 | from PIL import Image
4 | import base64
5 | from io import BytesIO
6 | import requests
7 | import copy
8 |
9 | def load_llava(model_name, device):
10 | if '8b' in model_name:
11 | device = 'cuda:7'
12 | elif '72b' in model_name:
13 | device = 'auto'
14 |
15 | # Initialize model and processor
16 | model = LlavaNextForConditionalGeneration.from_pretrained(
17 | f"llava-hf/{model_name}-hf",
18 | torch_dtype=torch.float16,
19 | load_in_4bit=True,
20 | low_cpu_mem_usage=True,
21 | device_map=device
22 | )
23 |
24 | return model
25 |
26 | def call_llava(model_name, model, llava_request, device):
27 | processor = LlavaNextProcessor.from_pretrained(f"llava-hf/{model_name}-hf")
28 | # Process the conversation
29 | prompt = processor.apply_chat_template(llava_request['messages'], add_generation_prompt=True)
30 | image = llava_request['messages'][0]['content'][1]['image'],
31 |
32 | # Create model inputs
33 | inputs = processor(
34 | images=image,
35 | text=prompt,
36 | return_tensors="pt"
37 | ).to(model.device)
38 |
39 | input_length = inputs.input_ids.shape[1]
40 |
41 | # Generate output
42 | outputs = model.generate(
43 | **inputs,
44 | **llava_request['config']
45 | )
46 |
47 | if '8b' in model_name:
48 | new_tokens = outputs[0][input_length+1:]
49 | elif '72b' in model_name:
50 | new_tokens = outputs[0][input_length:]
51 |
52 | # Decode and print result
53 | response = processor.decode(new_tokens, skip_special_tokens=True)
54 |
55 | return response
56 |
57 |
58 | def convert_openai_to_llava(openai_request):
59 | openai_request_copy = copy.deepcopy(openai_request)
60 |
61 | for message in openai_request_copy["messages"]:
62 | for content in message["content"]:
63 | if content["type"] == "image_url":
64 | image_url = content["image_url"]["url"]
65 | if image_url.startswith("data:image"):
66 | base64_str = image_url.split(",")[1]
67 | img_data = base64.b64decode(base64_str)
68 | img = Image.open(BytesIO(img_data))
69 | else:
70 | response = requests.get(image_url)
71 | img = Image.open(BytesIO(response.content))
72 |
73 | content.update({
74 | "type": "image",
75 | "image": img
76 | })
77 | content.pop("image_url")
78 |
79 | llava_messages = openai_request_copy["messages"]
80 |
81 | llava_config = {
82 | 'temperature': openai_request.get("temperature", 0.4),
83 | 'max_new_tokens': openai_request.get("max_tokens", 8192),
84 | 'top_p': openai_request.get("top_p", 1.0),
85 | 'do_sample': True
86 | }
87 |
88 | llava_request = {
89 | "messages": llava_messages,
90 | "config": llava_config
91 | }
92 |
93 | return llava_request
94 |
95 |
--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | from openai_api import load_gpt, call_gpt
2 | from gemini_api import load_gemini, call_gemini, convert_openai_to_gemini
3 | from llava_api import load_llava, call_llava, convert_openai_to_llava
4 | from qwen_api import load_qwen, call_qwen, convert_openai_to_qwen
5 | from config import create_api_configs
6 | from utils import process_request
7 | import argparse
8 | from dataloader import TSIDataset
9 |
10 | def load_mllm(model_name, device):
11 | if model_name.startswith('gpt'):
12 | return load_gpt(model_name)
13 | elif model_name.startswith('gemini'):
14 | return load_gemini(model_name)
15 | elif 'llava' in model_name:
16 | return load_llava(model_name, device)
17 | elif model_name.startswith('Qwen'):
18 | return load_qwen(model_name, device)
19 |
20 | def call_mllm(model_name, model, request, device):
21 | if model_name.startswith('gpt'):
22 | response = call_gpt(model_name, model, request)
23 | elif model_name.startswith('gemini'):
24 | response = call_gemini(model_name, model, convert_openai_to_gemini(request))
25 | elif 'llava' in model_name:
26 | response = call_llava(model_name, model, convert_openai_to_llava(request), device)
27 | elif model_name.startswith('Qwen'):
28 | response = call_qwen(model_name, model, convert_openai_to_qwen(request), device)
29 |
30 | return response
31 |
32 | # The code is adapted from https://github.com/rose-stl-lab/anomllm
33 | def AD_with_retries(
34 | model_name: str,
35 | category: str,
36 | scenario: str,
37 | tsname: str,
38 | data_name: str,
39 | request_func: callable,
40 | variant: str = "standard",
41 | num_retries: int = 4,
42 | dim: int = 9,
43 | drop_ratio: float = 0.00,
44 | device: str = 'cuda:7'
45 | ):
46 | import json
47 | import time
48 | import pickle
49 | import os
50 | from loguru import logger
51 |
52 | results = {}
53 |
54 | if category == 'synthetic':
55 | if scenario == 'univariate':
56 | log_fn = f"logs/{category}/{scenario}/{data_name}/{model_name}/" + variant + ".log"
57 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
58 | results_dir = f'results/{category}/{scenario}/{data_name}/{model_name}'
59 | data_dir = f'data/{category}/{scenario}/{data_name}/eval'
60 | train_dir = f'data/{category}/{scenario}/{data_name}/train'
61 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
62 | elif scenario == 'multivariate':
63 | log_fn = f"logs/{category}/{scenario}/dim_{dim}/{data_name}/{model_name}/" + variant + ".log"
64 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
65 | results_dir = f'results/{category}/{scenario}/dim_{dim}/{data_name}/{model_name}'
66 | data_dir = f'data/{category}/{scenario}/dim_{dim}/{data_name}/eval'
67 | train_dir = f'data/{category}/{scenario}/dim_{dim}/{data_name}/train'
68 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
69 | elif scenario.startswith('irr'):
70 | log_fn = f"logs/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}/" + variant + ".log"
71 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
72 | results_dir = f'results/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}'
73 | data_dir = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval'
74 | train_dir = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/train'
75 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
76 | elif category == 'semi':
77 | if scenario == 'univariate':
78 | log_fn = f"logs/{category}/{scenario}/{tsname}/{data_name}/{model_name}/" + variant + ".log"
79 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
80 | results_dir = f'results/{category}/{scenario}/{tsname}/{data_name}/{model_name}'
81 | data_dir = f'data/{category}/{scenario}/{tsname}/{data_name}/eval'
82 | train_dir = f'data/{category}/{scenario}/{tsname}/{data_name}/train'
83 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
84 | elif scenario == 'multivariate':
85 | log_fn = f"logs/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/{model_name}/" + variant + ".log"
86 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
87 | results_dir = f'results/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/{model_name}'
88 | data_dir = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval'
89 | train_dir = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/train'
90 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
91 | elif scenario.startswith('irr'):
92 | log_fn = f"logs/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}/" + variant + ".log"
93 | logger.add(log_fn, format="{time} {level} {message}", level="INFO")
94 | results_dir = f'results/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/{model_name}'
95 | data_dir = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval'
96 | train_dir = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/train'
97 | jsonl_fn = os.path.join(results_dir, variant + '.jsonl')
98 |
99 | os.makedirs(results_dir, exist_ok=True)
100 |
101 | eval_dataset = TSIDataset(data_dir)
102 | train_dataset = TSIDataset(train_dir)
103 |
104 | # Load existing results if jsonl file exists
105 | if os.path.exists(jsonl_fn):
106 | with open(jsonl_fn, 'r') as f:
107 | for line in f:
108 | entry = json.loads(line.strip())
109 | results[entry['custom_id']] = entry["response"]
110 |
111 | # Loop over image files
112 | model = load_mllm(model_name, device)
113 | for i in range(1, len(eval_dataset) + 1):
114 | if category == 'synthetic':
115 | if scenario == 'univariate':
116 | custom_id = f"{category}_{scenario}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
117 | elif scenario == 'multivariate':
118 | custom_id = f"{category}_{scenario}_dim_{dim}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
119 | elif scenario.startswith('irr'):
120 | custom_id = f"{category}_{scenario}_ratio_{int(drop_ratio*100)}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
121 | elif category == 'semi':
122 | if scenario == 'univariate':
123 | custom_id = f"{category}_{scenario}_{tsname}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
124 | elif scenario == 'multivariate':
125 | custom_id = f"{category}_{scenario}_{tsname}_dim_{dim}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
126 | elif scenario.startswith('irr'):
127 | custom_id = f"{category}_{scenario}_{tsname}_ratio_{int(drop_ratio*100)}_{data_name}_{model_name}_{variant}_{str(i).zfill(5)}"
128 |
129 | # Skip already processed files
130 | if custom_id in results:
131 | continue
132 |
133 | # print(custom_id)
134 | # Perform anomaly detection with exponential backoff
135 | for attempt in range(num_retries):
136 | try:
137 | start_time = time.time()
138 | request = request_func(
139 | # eval_dataset.series[i - 1],
140 | train_dataset,
141 | (category, scenario, tsname, dim, drop_ratio, data_name, i)
142 | )
143 | response = call_mllm(model_name, model, request, device)
144 | end_time = time.time()
145 | elasped_time = f'{end_time - start_time}s'
146 | # Write the result to jsonl
147 | with open(jsonl_fn, 'a') as f:
148 | json.dump({'custom_id': custom_id, 'request': process_request(request), 'response': response, 'time': elasped_time}, f)
149 | f.write('\n')
150 | # If successful, break the retry loop
151 | break
152 | except Exception as e:
153 | if "503" in str(e): # Server not up yet, sleep until the server is up again
154 | while True:
155 | logger.debug("503 error, sleep 30 seconds")
156 | time.sleep(30)
157 | try:
158 | response = call_mllm(model_name, model, request, device)
159 | break
160 | except Exception as e:
161 | if "503" not in str(e):
162 | break
163 | else:
164 | logger.error(e)
165 | # If an exception occurs, wait and then retry
166 | wait_time = 2 ** (attempt + 3)
167 | logger.debug(f"Attempt {attempt + 1} failed. Waiting for {wait_time} seconds before retrying...")
168 | time.sleep(wait_time)
169 | continue
170 | else:
171 | logger.error(f"Failed to process {custom_id} after {num_retries} attempts")
172 |
173 |
174 | def parse_arguments():
175 | parser = argparse.ArgumentParser(description='Process online API anomaly detection.')
176 | parser.add_argument('--variant', type=str, default='0shot-vision', help='Variant type')
177 | parser.add_argument('--model_name', type=str, default='llama3-llava-next-8b', choices=['gpt-4o', 'gpt-4o-mini', 'gemini-1.5-pro', 'gemini-1.5-flash',
178 | 'llama3-llava-next-8b', 'llava-next-72b',
179 | 'Qwen2-VL-7B-Instruct', 'Qwen2-VL-72B-Instruct'], help='Model name')
180 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi'])
181 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate'])
182 | parser.add_argument("--tsname", type=str, default=None, choices=['Symbols', 'ArticularyWordRecognition'])
183 | parser.add_argument("--data", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet',
184 | 'triangle', 'square', 'sawtooth', 'random_walk'], help="Synthesized anomaly type2")
185 | parser.add_argument("--drop_ratio", type=float, default=0.00)
186 | parser.add_argument("--dim", type=int, default=9)
187 | parser.add_argument("--device", type=str, default='auto')
188 |
189 | return parser.parse_args()
190 |
191 | def main():
192 | args = parse_arguments()
193 | api_configs = create_api_configs()
194 | AD_with_retries(
195 | model_name=args.model_name,
196 | category=args.category,
197 | scenario=args.scenario,
198 | tsname=args.tsname,
199 | data_name=args.data,
200 | request_func=api_configs[args.variant],
201 | variant=args.variant,
202 | dim=args.dim,
203 | drop_ratio=args.drop_ratio,
204 | device=args.device
205 | )
206 |
207 | if __name__ == '__main__':
208 | main()
209 |
--------------------------------------------------------------------------------
/src/openai_api.py:
--------------------------------------------------------------------------------
1 | from loguru import logger
2 | from openai import AzureOpenAI
3 | import yaml
4 |
5 | def load_gpt(model_name):
6 | credentials = yaml.safe_load(open("credentials.yml"))
7 | assert model_name in credentials, f"Model {model_name} not found in credentials"
8 |
9 | credential = credentials[model_name]
10 | api_key = credential["api_key"]
11 | api_version = credential["api_version"]
12 | base_url = credential["base_url"]
13 |
14 | model = AzureOpenAI(
15 | api_key=api_key,
16 | api_version=api_version,
17 | base_url=base_url
18 | )
19 |
20 | return model
21 |
22 | def call_gpt(model_name, model, openai_request):
23 | logger.debug(f"{model_name} is running")
24 |
25 | response = model.chat.completions.create(
26 | model=model_name, **openai_request
27 | )
28 | return response.choices[0].message.content
29 |
30 |
--------------------------------------------------------------------------------
/src/prompt.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from scipy import interpolate
3 | import json
4 | import re
5 | from scipy import stats
6 |
7 | PROMPT_POINT = """Detect points of anomalies in this time series, in terms of the x-axis coordinate.
8 | List one by one in a list. For example, if points x=2, 51, and 106 are anomalies, then output "[2, 51, 106]". If there are no anomalies, answer with an empty list [].
9 | """
10 |
11 | PROMPT = """Detect ranges of anomalies in this time series, in terms of the x-axis coordinate.
12 | List one by one in a list. For example, if ranges (incluing two endpoints) [2, 11], [50, 60], and [105, 118] are anomalies, then output "[[2, 11], [50, 60], [105, 118]]". \
13 | If there are no anomalies, answer with an empty list [].
14 | """
15 |
16 | PROMPT_VARIATE = """Detect univariate time series of anomalies in this multivariate time series, in terms of ID of univariate time series.
17 | The image is a multivariate time series including multiple subimages to indicate multiple univariate time series. \
18 | From left to right and top to bottom, the ID of each subimage increases by 1, starting from 0.
19 | List one by one in a list. For example, if ID=0, 2, and 5 are anomalous univariate time series, then output "[0, 2, 5]". If there are no anomalies, answer with an empty list [].
20 | """
21 |
22 | def encode_img(fig_path):
23 | import base64
24 |
25 | with open(fig_path, "rb") as image_file:
26 | return base64.b64encode(image_file.read()).decode('utf-8')
27 |
28 | def create_vision_messages(
29 | # time_series,
30 | few_shots=False,
31 | cot=False,
32 | calc=None,
33 | # image_args={},
34 | data_tuple = None
35 | ):
36 | category, scenario, tsname, dim, drop_ratio, data_name, eval_i = data_tuple
37 | if category == 'synthetic':
38 | if scenario == 'univariate':
39 | fig_path = f'data/{category}/{scenario}/{data_name}/eval/fig/{eval_i:03d}.png'
40 | elif scenario == 'multivariate':
41 | fig_path = f'data/{category}/{scenario}/dim_{dim}/{data_name}/eval/fig/{eval_i:03d}.png'
42 | elif scenario.startswith('irr'):
43 | fig_path = f'data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval/fig/{eval_i:03d}.png'
44 | elif category == 'semi':
45 | if scenario == 'univariate':
46 | fig_path = f'data/{category}/{scenario}/{tsname}/{data_name}/eval/fig/{eval_i:03d}.png'
47 | elif scenario == 'multivariate':
48 | fig_path = f'data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval/fig/{eval_i:03d}.png'
49 | elif scenario.startswith('irr'):
50 | fig_path = f'data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval/fig/{eval_i:03d}.png'
51 |
52 | img = encode_img(fig_path)
53 |
54 | if data_name in ["global", "contextual"]:
55 | prompt = PROMPT_POINT
56 | elif data_name in ["triangle", "square", "sawtooth", "random_walk"]:
57 | prompt = PROMPT_VARIATE
58 | else:
59 | prompt = PROMPT
60 |
61 | messages = [
62 | {
63 | "role": "user",
64 | "content": [
65 | {
66 | "type": "text",
67 | "text": prompt
68 | },
69 | {
70 | "type": "image_url",
71 | "image_url": {"url": f"data:image/png;base64,{img}"}
72 | },
73 | ],
74 | }
75 | ]
76 |
77 | return messages
78 |
79 | def create_openai_request(
80 | # time_series,
81 | few_shots=False,
82 | vision=False,
83 | temperature=0.4,
84 | stop=["’’’’", " – –", "<|endoftext|>", "<|eot_id|>"],
85 | cot=False, # Chain of Thought
86 | calc=None, # Enforce wrong calculation
87 | series_args={}, # Arguments for time_series_to_str
88 | # image_args={}, # Arguments for time_series_to_image
89 | data_tuple = None
90 | ):
91 | if vision:
92 | messages = create_vision_messages(few_shots, cot, calc, data_tuple)
93 |
94 | return {
95 | "messages": messages,
96 | "temperature": temperature,
97 | "stop": stop
98 | }
99 |
--------------------------------------------------------------------------------
/src/qwen_api.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
3 | from qwen_vl_utils import process_vision_info
4 | from PIL import Image
5 | import base64
6 | from io import BytesIO
7 | import requests
8 | import copy
9 |
10 | def load_qwen(model_name, device):
11 | if '7B' in model_name:
12 | device = 'cuda:7'
13 | elif '72B' in model_name:
14 | device = 'auto'
15 |
16 | model = Qwen2VLForConditionalGeneration.from_pretrained(
17 | f"Qwen/{model_name}",
18 | torch_dtype=torch.float16,
19 | device_map=device
20 | )
21 |
22 | return model
23 |
24 |
25 | def call_qwen(model_name, model, qwen_request, device):
26 | processor = AutoProcessor.from_pretrained(f"Qwen/{model_name}")
27 | # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
28 | # min_pixels = 256*28*28
29 | # max_pixels = 1280*28*28
30 | # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
31 |
32 | # Preparation for inference
33 | text = processor.apply_chat_template(
34 | qwen_request['messages'], tokenize=False, add_generation_prompt=True
35 | )
36 | image_inputs, video_inputs = process_vision_info(qwen_request['messages'])
37 | inputs = processor(
38 | text=[text],
39 | images=image_inputs,
40 | videos=video_inputs,
41 | padding=True,
42 | return_tensors="pt",
43 | ).to(model.device)
44 |
45 | # Inference: Generation of the output
46 | generated_ids = model.generate(**inputs, **qwen_request['config'])
47 | generated_ids_trimmed = [
48 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
49 | ]
50 |
51 | response = processor.batch_decode(
52 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
53 | )
54 |
55 | return response[0]
56 |
57 |
58 | def convert_openai_to_qwen(openai_request):
59 | openai_request_copy = copy.deepcopy(openai_request)
60 |
61 | for message in openai_request_copy["messages"]:
62 | for content in message["content"]:
63 | if content["type"] == "image_url":
64 | image_url = content["image_url"]["url"]
65 | if image_url.startswith("data:image"):
66 | base64_str = image_url.split(",")[1]
67 | img_data = base64.b64decode(base64_str)
68 | img = Image.open(BytesIO(img_data))
69 | else:
70 | response = requests.get(image_url)
71 | img = Image.open(BytesIO(response.content))
72 |
73 | content.update({
74 | "type": "image",
75 | "image": img
76 | })
77 | content.pop("image_url")
78 |
79 | qwen_messages = openai_request_copy["messages"]
80 |
81 | qwen_config = {
82 | 'temperature': openai_request.get("temperature", 0.4),
83 | 'max_new_tokens': openai_request.get("max_tokens", 8192),
84 | 'top_p': openai_request.get("top_p", 1.0),
85 | 'do_sample': True
86 | }
87 |
88 | qwen_request = {
89 | "messages": qwen_messages,
90 | "config": qwen_config
91 | }
92 |
93 | return qwen_request
--------------------------------------------------------------------------------
/src/result_agg.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import numpy as np
3 | import pandas as pd
4 | from tqdm import trange
5 | from utils import (
6 | interval_to_vector,
7 | point_to_vector,
8 | id_to_vector
9 | )
10 | import pickle
11 | import os
12 | from dataloader import TSIDataset
13 | from sklearn.metrics import precision_score, recall_score, f1_score
14 | from affiliation.generics import convert_vector_to_events
15 | from affiliation.metrics import pr_from_events
16 |
17 |
18 | def df_to_latex(df):
19 | # Step 1: Process the index to extract only the model name part
20 | df = df.reset_index() # Reset index to bring it as a column
21 | df['index'] = df['index'].str.split(' ').str[0] # Keep only the model name
22 | df.rename(columns={'index': 'model'}, inplace=True) # Rename the index column
23 |
24 | # Step 2: Sort the DataFrame by a custom order for models
25 | order = {"gpt-4o": 0, "gpt-4o-mini": 1, "gemini-1.5-pro": 2, "gemini-1.5-flash": 3,
26 | "llava-next-72b": 4, "llama3-llava-next-8b": 5,
27 | 'Qwen2-VL-72B-Instruct': 6, 'Qwen2-VL-7B-Instruct': 7}
28 | df['priority'] = df['model'].apply(lambda x: order.get(x.lower(), 20)) # Default priority for others is 4
29 | df = df.sort_values(by=['priority', 'model']).drop(columns=['priority']) # Sort and drop priority column
30 |
31 | # Step 3: Truncate numerical values to the first 3 decimal digits
32 | for col in df.select_dtypes(include=['float', 'int']).columns:
33 | df[col] = df[col].apply(lambda x: f"{x*100:.2f}") # Truncate to 4 decimals
34 |
35 | # Step 4: Convert to plain LaTeX table
36 | latex_table = df.to_latex(index=False, column_format="|l" + "r" * (len(df.columns) - 1) + "|")
37 |
38 | return df, latex_table
39 |
40 | def compute_metrics(gt, prediction):
41 | if np.count_nonzero(gt) == 0:
42 | print('ground truth is all zero!!!')
43 | exit()
44 | elif np.count_nonzero(prediction) == 0:
45 | metrics = {
46 | 'precision': 0,
47 | 'recall': 0,
48 | 'f1': 0,
49 | 'affi precision': 0,
50 | 'affi recall': 0,
51 | 'affi f1': 0
52 | }
53 | else:
54 | precision = precision_score(gt, prediction)
55 | recall = recall_score(gt, prediction)
56 | f1 = f1_score(gt, prediction)
57 |
58 | events_pred = convert_vector_to_events(prediction)
59 | events_gt = convert_vector_to_events(gt)
60 | Trange = (0, len(prediction))
61 | aff = pr_from_events(events_pred, events_gt, Trange)
62 |
63 | # Calculate affiliation F1
64 | if aff['precision'] + aff['recall'] == 0:
65 | affi_f1 = 0
66 | else:
67 | affi_f1 = 2 * (aff['precision'] * aff['recall']) / (aff['precision'] + aff['recall'])
68 |
69 | metrics = {
70 | 'precision': round(precision, 3),
71 | 'recall': round(recall, 3),
72 | 'f1': round(f1, 3),
73 | 'affi precision': round(aff['precision'], 3),
74 | 'affi recall': round(aff['recall'], 3),
75 | 'affi f1': round(affi_f1, 3)
76 | }
77 | return metrics
78 |
79 | def compute_metrics_for_results(eval_dataset, results, scenario, data_name, num_samples=100):
80 | metric_names = [
81 | "precision",
82 | "recall",
83 | "f1",
84 | "affi precision",
85 | "affi recall",
86 | "affi f1",
87 | ]
88 | results_dict = {key: [[] for _ in metric_names] for key in results.keys()}
89 |
90 | for i in trange(0, num_samples):
91 | anomaly_locations, series = eval_dataset[i][0].numpy(), eval_dataset[i][1].numpy()
92 | if scenario.endswith('univariate'):
93 | len_series = series.shape[0]
94 | elif scenario.endswith('multivariate'):
95 | dim = series.shape[1]
96 |
97 | if data_name in ['global', 'contextual']:
98 | gt = point_to_vector(anomaly_locations, len_vector=len_series)
99 | elif data_name in ['seasonal', 'trend', 'shapelet']:
100 | gt = interval_to_vector(anomaly_locations, start=0, end=len_series)
101 | else:
102 | gt = id_to_vector(anomaly_locations, dim)
103 |
104 | for name, prediction in results.items():
105 | if prediction[i] == None:
106 | continue
107 |
108 | if data_name in ['global', 'contextual']:
109 | pred = point_to_vector(prediction[i], len_vector=len_series)
110 | elif data_name in ['seasonal', 'trend', 'shapelet']:
111 | pred = interval_to_vector(prediction[i], start=0, end=len_series, pred=True)
112 | else:
113 | pred = id_to_vector(prediction[i], dim)
114 |
115 | if scenario == 'irr_univariate':
116 | drop_index = eval_dataset[i][2].numpy().astype(int)
117 | gt_irr = np.delete(gt, drop_index)
118 | pred_irr = np.delete(pred, drop_index)
119 |
120 | metrics = compute_metrics(gt, pred) if scenario != 'irr_univariate' else compute_metrics(gt_irr, pred_irr)
121 |
122 | for idx, metric_name in enumerate(metric_names):
123 | results_dict[name][idx].append(metrics[metric_name])
124 |
125 | df = pd.DataFrame(
126 | {k: np.mean(v, axis=1) for k, v in results_dict.items()},
127 | index=["precision", "recall", "f1", "affi precision", "affi recall", "affi f1"],
128 | )
129 |
130 | return df
131 |
132 |
133 | def load_time_results(result_fn):
134 | import json
135 |
136 | with open(result_fn, 'r') as f:
137 | time_results = []
138 | for line in f:
139 | info = json.loads(line)
140 | try:
141 | time = float(info['time'][:-1])
142 | time_results.append(time)
143 | except Exception:
144 | time_results.append(None)
145 | continue
146 |
147 | return time_results
148 |
149 | def parse_output(output: str, data_name: str) -> dict:
150 | """Parse the output of the AD model.
151 |
152 | Args:
153 | output: The output of the AD model.
154 |
155 | Returns:
156 | A dictionary containing the parsed output.
157 | """
158 | import json
159 | import re
160 |
161 | # handle cases where the max_tokens are reached
162 | if output.count('[') == output.count(']') + 1:
163 | # remove invalid tokens
164 | if output.endswith(',') or output.endswith(' '):
165 | output = output.rstrip(', ')
166 | else:
167 | output = output.rstrip('0123456789').rstrip(', ')
168 | # Add the missing right bracket
169 | output += ']'
170 |
171 | # Trim the output string
172 | trimmed_output = output[output.index('['):output.rindex(']') + 1]
173 |
174 | # check if containing digits
175 | trimmed_output = '[]' if not re.search(r'\d', trimmed_output) else trimmed_output
176 |
177 | # Try to parse the output as JSON
178 | parsed_output = json.loads(trimmed_output)
179 |
180 | # Validate the output: list of dict with keys start and end
181 | if data_name in ['global', 'contextual', 'triangle', 'square', 'sawtooth', 'random_walk']:
182 | for item in parsed_output:
183 | if not isinstance(item, int):
184 | raise ValueError("Parsed output contains non-int items")
185 | else:
186 | for item in parsed_output:
187 | # if not isinstance(item, dict):
188 | # raise ValueError("Parsed output contains non-dict items")
189 | # if 'start' not in item or 'end' not in item:
190 | # raise ValueError("Parsed output dictionaries must contain 'start' and 'end' keys")
191 | if not isinstance(item, list):
192 | raise ValueError("Parsed output contains non-dict items")
193 |
194 | return parsed_output
195 |
196 | def load_results(result_fn, data_name, raw=False, postprocess_func: callable = None):
197 | """
198 | Load and process results from a result JSON lines file.
199 |
200 | Parameters
201 | ----------
202 | result_fn : str
203 | The filename of the JSON lines file containing the results.
204 | raw : bool, optional
205 | If True, return raw JSON objects. If False, parse the response
206 | and convert it to a vector. Default is False.
207 | postprocess_func : callable, optional
208 | A function to postprocess the results (e.g., scaling down). Default is None.
209 |
210 | Returns
211 | -------
212 | list
213 | A list of processed results. Each item is either a raw JSON object
214 | or a vector representation of anomalies, depending on the
215 | `raw` parameter.
216 |
217 | Notes
218 | -----
219 | The function attempts to parse each line in the file. If parsing fails,
220 | it appends an empty vector to the results.
221 |
222 | Raises
223 | ------
224 | FileNotFoundError
225 | If the specified file does not exist.
226 | JSONDecodeError
227 | If a line in the file is not valid JSON.
228 | """
229 | import json
230 | import pandas as pd
231 |
232 | if postprocess_func is None:
233 | postprocess_func = lambda x: x
234 |
235 | with open(result_fn, 'r') as f:
236 | results = []
237 | for line in f:
238 | info = json.loads(line)
239 | if raw:
240 | results.append(info)
241 | else:
242 | try:
243 | response_parsed = parse_output(postprocess_func(info['response']), data_name)
244 | results.append(response_parsed)
245 | except Exception:
246 | results.append(None)
247 | continue
248 |
249 | return results
250 |
251 | def collect_results(directory, raw=False, ignore=[]):
252 | """
253 | Collect and process results from JSON lines files in a directory.
254 |
255 | Parameters
256 | ----------
257 | directory : str
258 | The path to the directory containing the JSON lines files.
259 | raw : bool, optional
260 | If True, return raw JSON objects. If False, parse the responses.
261 | Default is False.
262 | ignore: list[str], optional
263 | Skip folders containing these names. Default is an empty list.
264 |
265 | Returns
266 | -------
267 | dict
268 | A dictionary where keys are model names with variants, and values
269 | are lists of processed results from each file.
270 |
271 | Notes
272 | -----
273 | This function walks through the given directory, processing each
274 | `.jsonl` file except those with 'requests' in the filename. It uses
275 | the directory name as the model name and the filename (sans extension)
276 | as the variant.
277 |
278 | Raises
279 | ------
280 | FileNotFoundError
281 | If the specified directory does not exist.
282 | """
283 | import os
284 | from config import postprocess_configs
285 |
286 | results = {}
287 | config = postprocess_configs()
288 | for root, _, files in os.walk(directory):
289 | for file in files:
290 | skip = False
291 | for ignore_folder in ignore:
292 | if ignore_folder in root:
293 | skip = True
294 | break
295 | if skip:
296 | continue
297 | if 'requests' not in file and file.endswith('.jsonl'):
298 | model_name = os.path.basename(root)
299 | data_name = os.path.basename(os.path.dirname(root))
300 | # scenario = os.path.basename(os.path.dirname(os.path.dirname(root)))
301 | variant = file.replace('.jsonl', '')
302 | if variant in config:
303 | pf = config[variant]
304 | else:
305 | pf = None
306 | result_fn = os.path.join(root, file)
307 | model_key = f'{model_name} ({variant})'
308 | results[model_key] = load_results(result_fn, data_name, raw=raw, postprocess_func=pf)
309 |
310 | return results
311 |
312 | def load_datasets(category, scenario, tsname, dim, drop_ratio, data_name):
313 | if category == 'synthetic':
314 | if scenario == 'univariate':
315 | data_dir = f"data/{category}/{scenario}/{data_name}/eval"
316 | train_dir = f"data/{category}/{scenario}/{data_name}/train"
317 | elif scenario == 'multivariate':
318 | data_dir = f"data/{category}/{scenario}/dim_{dim}/{data_name}/eval"
319 | train_dir = f"data/{category}/{scenario}/dim_{dim}/{data_name}/train"
320 | elif scenario.startswith('irr'):
321 | data_dir = f"data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/eval"
322 | train_dir = f"data/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}/train"
323 | elif category == 'semi':
324 | if scenario == 'univariate':
325 | data_dir = f"data/{category}/{scenario}/{tsname}/{data_name}/eval"
326 | train_dir = f"data/{category}/{scenario}/{tsname}/{data_name}/train"
327 | elif scenario == 'multivariate':
328 | data_dir = f"data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/eval"
329 | train_dir = f"data/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}/train"
330 | elif scenario.startswith('irr'):
331 | data_dir = f"data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/eval"
332 | train_dir = f"data/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}/train"
333 | eval_dataset = TSIDataset(data_dir)
334 | train_dataset = TSIDataset(train_dir)
335 | return eval_dataset, train_dataset
336 |
337 |
338 | def main(args):
339 | category = args.category
340 | scenario = args.scenario
341 | tsname = args.tsname
342 | data_name = args.data_name
343 | drop_ratio = args.drop_ratio
344 | dim = args.dim
345 | label_name = args.label_name
346 | table_caption = args.table_caption
347 |
348 | eval_dataset, train_dataset = load_datasets(category, scenario, tsname, dim, drop_ratio, data_name)
349 | if category == 'synthetic':
350 | if scenario == 'univariate':
351 | directory = f"results/{category}/{scenario}/{data_name}"
352 | elif scenario == 'multivariate':
353 | directory = f"results/{category}/{scenario}/dim_{dim}/{data_name}"
354 | elif scenario.startswith('irr'):
355 | directory = f"results/{category}/{scenario}/ratio_{int(drop_ratio*100)}/{data_name}"
356 | elif category == 'semi':
357 | if scenario == 'univariate':
358 | directory = f'results/{category}/{scenario}/{tsname}/{data_name}'
359 | elif scenario == 'multivariate':
360 | directory = f"results/{category}/{scenario}/{tsname}/dim_{dim}/{data_name}"
361 | elif scenario.startswith('irr'):
362 | directory = f'results/{category}/{scenario}/{tsname}/ratio_{int(drop_ratio*100)}/{data_name}'
363 | results = collect_results(directory, ignore=[])
364 |
365 | df = compute_metrics_for_results(eval_dataset, results, scenario, data_name, num_samples=len(eval_dataset))
366 | df = df.T
367 | # print(df)
368 | df, latex_table = df_to_latex(df.copy())
369 | print(df)
370 | print(latex_table)
371 |
372 | if scenario.endswith('univariate'):
373 | df_selected = df[['model', 'affi precision', 'affi recall', 'affi f1']].rename(columns={'affi precision': 'precision', 'affi recall': 'recall', 'affi f1': 'f1'})\
374 | .set_index('model')
375 | elif scenario.endswith('multivariate'):
376 | df_selected = df[['model', 'precision', 'recall', 'f1']].set_index('model')
377 |
378 | # Attempt to drop the index, catch exception if it doesn't exist
379 | try:
380 | df_selected = df_selected.drop(index='gemini-1.5-flash-8b')
381 | except KeyError:
382 | pass # If index does not exist, do nothing and proceed
383 |
384 | with open(f"{directory}/df.pkl", "wb") as f:
385 | pickle.dump(df_selected, f)
386 |
387 | if __name__ == "__main__":
388 | parser = argparse.ArgumentParser(description="Process time series data and generate LaTeX table.")
389 | parser.add_argument("--category", type=str, default='synthetic', choices=['synthetic', 'semi', 'real'])
390 | parser.add_argument("--scenario", type=str, default='univariate', choices=['univariate', 'multivariate', 'irr_univariate', 'irr_multivariate', 'long'])
391 | parser.add_argument("--tsname", type=str, default=None, choices=['Symbols', 'ArticularyWordRecognition'])
392 | parser.add_argument("--data_name", type=str, default='global', choices=['global', 'contextual', 'seasonal', 'trend', 'shapelet',
393 | 'triangle', 'square', 'sawtooth', 'random_walk',
394 | 'long'], help="Synthesized anomaly type2")
395 | parser.add_argument("--drop_ratio", type=float, default=0.00)
396 | parser.add_argument("--dim", type=int, default=9)
397 | parser.add_argument("--label_name", type=str, default='trend-exp', help="Name of the experiment")
398 | parser.add_argument("--table_caption", type=str, default='Trend anomalies in shifting sine wave', help="Caption for the LaTeX table")
399 | args = parser.parse_args()
400 | main(args)
401 |
--------------------------------------------------------------------------------
/src/test.sh:
--------------------------------------------------------------------------------
1 | # !/bin/bash
2 | device=auto
3 |
4 | category=synthetic
5 | scenario=univariate
6 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
7 | do
8 | for data in global contextual seasonal trend shapelet
9 | do
10 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --device $device
11 | done
12 | done
13 |
14 | category=synthetic
15 | scenario=multivariate
16 | for dim in 4 9 16 25 36
17 | do
18 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
19 | do
20 | for data in triangle square sawtooth random_walk
21 | do
22 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --dim $dim --device $device
23 | done
24 | done
25 | done
26 |
27 | category=synthetic
28 | scenario=irr_univariate
29 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
30 | do
31 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
32 | do
33 | for data in global seasonal trend shapelet
34 | do
35 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device
36 | done
37 | done
38 | done
39 |
40 | category=synthetic
41 | scenario=irr_multivariate
42 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
43 | do
44 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
45 | do
46 | for data in triangle square sawtooth random_walk
47 | do
48 | python main.py --category $category --scenario $scenario --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device
49 | done
50 | done
51 | done
52 |
53 |
54 |
55 | category=semi
56 | scenario=univariate
57 | tsname=Symbols
58 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
59 | do
60 | for data in global contextual trend shapelet
61 | do
62 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --device $device
63 | done
64 | done
65 |
66 | category=semi
67 | scenario=irr_univariate
68 | tsname=Symbols
69 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
70 | do
71 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
72 | do
73 | for data in global trend shapelet
74 | do
75 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --drop_ratio $drop_ratio --device $device
76 | done
77 | done
78 | done
79 |
80 | category=semi
81 | scenario=multivariate
82 | tsname=ArticularyWordRecognition
83 | for dim in 4 9 16 25 36
84 | do
85 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
86 | do
87 | for data in triangle square sawtooth random_walk
88 | do
89 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --dim $dim --device $device
90 | done
91 | done
92 | done
93 |
94 | category=semi
95 | scenario=irr_multivariate
96 | tsname=ArticularyWordRecognition
97 | dim=9
98 | for drop_ratio in 0.05 0.10 0.15 0.20 0.25
99 | do
100 | for model_name in gemini-1.5-flash gemini-1.5-pro gpt-4o-mini gpt-4o llama3-llava-next-8b llava-next-72b Qwen2-VL-7B-Instruct Qwen2-VL-72B-Instruct
101 | do
102 | for data in triangle square sawtooth random_walk
103 | do
104 | python main.py --category $category --scenario $scenario --tsname $tsname --model_name $model_name --data $data --drop_ratio $drop_ratio --dim $dim --device $device
105 | done
106 | done
107 | done
108 |
--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
1 | from matplotlib import pyplot as plt
2 | import numpy as np
3 | import pandas as pd
4 | from typing import Optional
5 | import math
6 |
7 | def process_request(request):
8 | request['messages'][0]['content'][1]['image_url'] = 'ignore'
9 |
10 | return request
11 |
12 | def id_to_vector(ids, len_dim=9):
13 | # not perfect inversion of function vector_to_id
14 | anomalies = np.zeros(len_dim)
15 | for id in ids:
16 | try:
17 | anomalies[int(id)] = 1
18 | except Exception:
19 | continue
20 |
21 | return anomalies
22 |
23 | def vector_to_id(multi_vector):
24 | ids = []
25 |
26 | for i in range(multi_vector.shape[1]):
27 | vector = multi_vector[:, i]
28 | # Ignore NaN values and check if all remaining elements are 1
29 | if np.all(vector[np.isnan(vector) == False] == 1):
30 | ids.append(i)
31 |
32 | return ids
33 |
34 | def point_to_vector(points, len_vector=400):
35 | anomalies = np.zeros(len_vector)
36 | for point in points:
37 | try:
38 | anomalies[int(point)] = 1
39 | except Exception:
40 | continue
41 |
42 | return anomalies
43 |
44 | def vector_to_point(vector):
45 | points = [i for i, x in enumerate(vector) if x == 1]
46 |
47 | return points
48 |
49 | def interval_to_vector(interval, start=0, end=400, pred=False):
50 | anomalies = np.zeros(end - start)
51 | for entry in interval:
52 | if len(entry) !=2 :
53 | continue
54 | try:
55 | entry = {'start': int(entry[0]), 'end': int(entry[1])}
56 | entry['end'] = entry['end'] + 1 if pred else entry['end']
57 | entry['start'] = np.clip(entry['start'], start, end)
58 | entry['end'] = np.clip(entry['end'], entry['start'], end)
59 | anomalies[entry['start']:entry['end']] = 1
60 | except (ValueError, IndexError, TypeError) as e:
61 | continue # Skip the current entry and move to the next
62 |
63 | return anomalies
64 |
65 | def vector_to_interval(vector):
66 | intervals = []
67 | in_interval = False
68 | start = 0
69 | for i, value in enumerate(vector):
70 | if value == 1 and not in_interval:
71 | start = i
72 | in_interval = True
73 | elif value == 0 and in_interval:
74 | intervals.append((start, i))
75 | in_interval = False
76 | if in_interval:
77 | intervals.append((start, len(vector)))
78 |
79 | return intervals
80 |
81 | def nearest_square_root(n):
82 | lower_sqrt = math.floor(math.sqrt(n))
83 | upper_sqrt = math.ceil(math.sqrt(n))
84 |
85 | lower_square = lower_sqrt ** 2
86 | upper_square = upper_sqrt ** 2
87 |
88 | return lower_sqrt if abs(lower_square - n) <= abs(upper_square - n) else upper_sqrt
89 |
90 | def create_color_generator(exclude_color='blue'):
91 | # Get the default color list
92 | default_colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][1:]
93 | # Filter out the excluded color
94 | filtered_colors = [color for color in default_colors if color != exclude_color]
95 | # Create a generator that yields colors in order
96 | return (color for color in filtered_colors)
97 |
98 | def plot_rectangle_stack_series(
99 | series,
100 | gt_anomaly,
101 | single_series_figsize: tuple[int, int] = (10, 10),
102 | gt_color: str = 'steelblue',
103 | train_eval: str = 'train'
104 | ) -> None:
105 | stream_length, dim = series.shape
106 |
107 | # Calculate the optimal number of rows and columns for a rectangular layout
108 | rows = int(math.sqrt(dim))
109 | cols = math.ceil(dim / rows)
110 |
111 | fig, axes = plt.subplots(rows, cols, figsize=(single_series_figsize[0] * cols / rows, single_series_figsize[1]))
112 | fig.subplots_adjust(hspace=0, wspace=0)
113 |
114 | # Plot each univariate time series in its subplot
115 | for idx in range(rows * cols):
116 | row, col = divmod(idx, cols)
117 | ax = axes[row, col]
118 |
119 | if idx < dim:
120 | ax.plot(series[:, idx], color=gt_color)
121 | ax.set_xticks([])
122 | ax.set_yticks([])
123 | else:
124 | # Turn off unused subplots
125 | ax.axis('off')
126 |
127 | if train_eval == 'train' and gt_anomaly is not None:
128 | if isinstance(gt_anomaly[0], int) and idx in gt_anomaly:
129 | ax.lines[-1].set_color('red')
130 |
131 | plt.tight_layout()
132 | return plt.gcf()
133 |
134 | def plot_series(
135 | series,
136 | gt_anomaly,
137 | single_series_figsize: tuple[int, int] = (10, 1.5),
138 | # gt_ylim: tuple[int, int] = (-1, 1),
139 | gt_color: str = 'steelblue',
140 | train_eval: str = 'train'
141 | ) -> None:
142 | plt.figure(figsize=single_series_figsize)
143 |
144 | # plt.ylim(gt_ylim)
145 | plt.plot(series, color=gt_color)
146 |
147 | if train_eval == 'train':
148 | if gt_anomaly is not None:
149 | if isinstance(gt_anomaly[0], tuple):
150 | for start, end in gt_anomaly:
151 | plt.axvspan(start, end-1, alpha=0.2, color=gt_color)
152 | elif isinstance(gt_anomaly[0], int):
153 | for point in gt_anomaly:
154 | plt.axvline(x=point, color=gt_color, alpha=0.5, linestyle='--')
155 |
156 | plt.tight_layout()
157 | return plt.gcf()
158 |
159 | def view_base64_image(base64_string):
160 | import base64
161 | from io import BytesIO
162 | from PIL import Image
163 | import matplotlib.pyplot as plt
164 |
165 | # Decode the base64 string to binary data
166 | image_data = base64.b64decode(base64_string)
167 |
168 | # Convert binary data to an image
169 | image = Image.open(BytesIO(image_data))
170 |
171 | # Display the image
172 | plt.imshow(image)
173 | plt.axis('off') # Hide axes
174 | plt.show()
175 |
--------------------------------------------------------------------------------
/teaser.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mllm-ts/VisualTimeAnomaly/3d18d64b99cc13d4276eb2281555362604979f97/teaser.png
--------------------------------------------------------------------------------