├── dataset ├── dataset.xlsx ├── example_3 ├── example_5 └── example_10 ├── run_test.sh ├── LICENSE ├── README.md └── eval.py /dataset/dataset.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sslab-gatech/SysGPT/HEAD/dataset/dataset.xlsx -------------------------------------------------------------------------------- /run_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | temp=(0.0 0.1 0.2 0.3 0.4 0.5 0.7) 4 | trial=(1 3 5 10) 5 | 6 | mkdir -p ./auto_log 7 | 8 | for a1 in "${temp[@]}"; do 9 | for a2 in "${trial[@]}"; do 10 | echo "Running: python3 -u eval.py $a1 $a2 | tee ./auto_log/temp$a1-best$a2.log" 11 | python3 -u eval.py "$a1" "$a2" | tee ./auto_log/temp$a1-best$a2.log 12 | done 13 | done 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 gts3.org (SSLab@Gatech) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SysGPT Dataset and Benchmark 2 | This repository contains the full dataset and evaluation benchmark introduced in 3 | our OSDI'25 paper: 4 | 5 | "Principles and Methodologies for Serial Performance Optimization (OSDI' 25)" 6 | 7 | ## Overview 8 | 9 | Large language models (LLMs) hold promise as assistants for system performance 10 | optimization, yet their evaluation in this domain remains underexplored. This 11 | repository provides: 12 | 13 | - A curated dataset of performance optimization problems and observations, derived from 10 years of SOSP/OSDI papers 14 | - A taxonomy-grounded benchmark to assess LLMs' ability to suggest concrete, actionable system optimizations 15 | - Scripts to evaluate models on their ability to recover real-world optimization strategies 16 | 17 | ## Contents 18 | 19 | ``` 20 | . 21 | ├── dataset/ 22 | │ ├── dataset.xlsx # Full training + test data (see below) 23 | │ ├── example_3 # Few-shot prompt examples (N = 3) 24 | │ ├── example_5 # Few-shot prompt examples (N = 5) 25 | │ └── example_10 # Few-shot prompt examples (N = 10) 26 | │ 27 | ├── eval.py # Evaluation script (e.g., precision/recall) 28 | ├── run_test.sh # Script to reproduce Figure 7 29 | └── README.md 30 | 31 | ``` 32 | 33 | ### `dataset.xlsx` 34 | 35 | - Sheet 1: Training dataset distilled from 10 years of OSDI/SOSP papers (2013–2022). 36 | - Sheet 2: Test dataset of 96 papers published in 2024 (OSDI/SOSP). 37 | - Each entry includes a problem statement, system observations, and labeled methodologies. 38 | 39 | 40 | ## Citation 41 | If you use this dataset or benchmark, please cite: 42 | 43 | ``` 44 | @inproceedings{park:sysgpt, 45 | title = {{Principles and Methodologies for Serial Performance Optimization}}, 46 | author = {Sujin Park and Mingyu Guan and Xiang Cheng and Taesoo Kim}, 47 | booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, 48 | month = jul, 49 | year = 2025, 50 | } 51 | ``` 52 | -------------------------------------------------------------------------------- /dataset/example_3: -------------------------------------------------------------------------------- 1 | Here are three examples seperated by ### marks: 2 | 3 | ### 4 | Problem: The performance of existing CPU- and GPU-based Embedding-based Retrieval (EBR) systems for recommendation systems is suboptimal due to their inherent architectural limitations. A good EBR system should achieve both high throughput and low latency, which is crucial for cost savings and user experience improvement. 5 | 6 | Observations: 1. CPU-based EBR systems suffer from low memory bandwidth and limited cores, hindering simultaneous support for desired parallelism paradigms and batch queries. 2. GPU-based EBR systems provide higher memory bandwidth and compute cores, but are not optimized for pipeline parallelism, leading to increased latency from explicit kernel boundaries and suboptimal inter-operator communication. 3. K-selection for GPU-based systems either heavily relies on on-chip memory with computation overheads or uses external memory, exacerbating latency with frequent memory accesses. 4. FPGA has desired properties ideal for EBR: large high-bandwidth memory (HBM), sufficient on-chip memories, and programmable compute elements supporting customized parallelism, making it a viable platform for optimizing EBR performance. 7 | 8 | Solution: 1. [Batching] Batch queries are executed together to share the cost of scanning the corpus among all queries in the batch. This means the system only scans the corpus once per batch, and simultaneous compute pipelines process different queries, achieving linear throughput scalability with the batch size while preserving low query latency. 2. [Precomputing] FAERY introduces a filter in the pipeline that can early drop the items that cannot make it into the Top-K, instead of waiting until the end of the pipeline. By detecting and discarding these non-competitive scores before K-selection, FAERY reduces the data volume processed in later stages, saving both computation and on-chip resources. 3. [Hardware] FAERY stores the entire item embedding corpus in the FPGA’s high-bandwidth memory (HBM), which provides much higher bandwidth than traditional CPU DRAM or GPU memory. It exploits HBM’s multi-channel parallelism and uses fully pipelined, streaming operator designs on the FPGA fabric for similarity computation and K-selection. The similarity calculation is parallelized to match HBM bandwidth, and a single, efficient K-selection pipeline is used thanks to early filtering. These hardware-accelerated pipelines achieve near-optimal latency and can scale linearly with batch size, outperforming CPU and GPU-based approaches in both latency-bounded throughput and resource efficiency. 9 | 10 | 11 | ### 12 | Problem: The paper addresses the challenge of efficiently scheduling tasks in large-scale data analytics frameworks. These frameworks are trending towards shorter task durations and greater parallelism to achieve low latency. The problem is how to implement a task scheduler capable of managing millions of scheduling decisions per second while ensuring millisecond-level latency and high availability in distributed systems. 13 | 14 | Observations: 1. The demand for low latency interactive data processing has led to the creation of frameworks that distribute work across thousands of machines. 2. Running sub-second tasks requires a scheduler capable of much higher throughput than existing centralized solutions. 3. Failure of a centralized scheduler involves complex recovery processes because large amounts of state need to be replicated or recovered quickly. 4. Probing multiple schedulers can result in conflicting actions due to delays and race conditions. 5. The ""power of two choices"" load balancing technique used naively doesn't perform well for parallel jobs due to a focus on individual task wait times. 6. Centralized schedulers struggle to handle the needs of decentralized, highly parallel workloads due to inherent throughput and availability limitations. 15 | 16 | Solution: 1. [Deferring] Late binding is implemented to delay the assignment of tasks to worker machines until the machines are ready to execute the tasks. This approach minimizes median job response time by avoiding premature task placement and mitigating the impact of queue length mispredictions on scheduling decisions. 2. [Relaxation] The system leverages a variant of the power of two choices load balancing technique, wherein multiple servers are probed randomly, and tasks are assigned to the server with fewer queued tasks. This strategic adaptation allows tasks to be distributed efficiently without comprehensive system state awareness. 17 | 18 | 19 | ### 20 | Problem: Constructing services that provide both low latency and high throughput is challenging. Traditional thread management operated by operating systems often results in resource underutilization or suboptimal performance due to lack of core awareness, leading to difficulties in adjusting thread parallelism according to available core resources. 21 | 22 | Observations: 1. Traditional threading models, such as those used by the operating system, provide no visibility into the physical cores allocated to applications, which prevents applications from efficiently managing parallelism. 2. Techniques for achieving low latency, like reserving cores and using polling instead of interrupts, can result in resource wastage. 3. Combining low-latency and high-throughput services is challenging, often because techniques designed to minimize latency result in underutilized resources, especially in multi-level services with nested requests. 4. Static assignment of threads to cores often results in inefficient load-balancing and performance degradation during periods of low load or high load imbalance. 5. Background tasks, like garbage collection, can interfere with request servicing unless additional resources are reserved. 6. Ideally, applications should be able to match their workload with available cores dynamically, but traditional systems do not allow this. 23 | 24 | Solution: 1. [Caching] Each thread context is bound to a core, and recently-used thread contexts are reused when creating new threads. This minimizes cache misses during thread creation, as thread context data (stacks and metadata) are likely to reside in the relevant core’s cache, enabling the rapid creation of user threads that are suitable for extremely short workloads. 2. [Contextualization] The user space thread implementation allows applications to communicate their resource (core) needs to the scheduler at runtime, and to compute requirements dynamically based on their workload. Each application selects a core policy to decide, with live performance statistics, how many cores it needs and how threads are placed across those cores. In Arachne, instead of relying on ready queues for scheduling, each dispatcher repeatedly scans all active user thread contexts on its associated core and chooses one that is runnable. This scan is efficient because there are typically only a few contexts per core, and the inevitable cache miss for waking a runnable thread can be overlapped with the scanning process, amortizing the cost. This approach enables applications to leverage runtime knowledge for optimal placement and scheduling decisions, achieving both low latency and high throughput on modern multicore hardware. 25 | 26 | ### 27 | 28 | 29 | -------------------------------------------------------------------------------- /dataset/example_5: -------------------------------------------------------------------------------- 1 | Here are five examples seperated by ### marks: 2 | ### 3 | Problem: The performance of existing CPU- and GPU-based Embedding-based Retrieval (EBR) systems for recommendation systems is suboptimal due to their inherent architectural limitations. A good EBR system should achieve both high throughput and low latency, which is crucial for cost savings and user experience improvement. 4 | Observations: 1. CPU-based EBR systems suffer from low memory bandwidth and limited cores, hindering simultaneous support for desired parallelism paradigms and batch queries. 2. GPU-based EBR systems provide higher memory bandwidth and compute cores, but are not optimized for pipeline parallelism, leading to increased latency from explicit kernel boundaries and suboptimal inter-operator communication. 3. K-selection for GPU-based systems either heavily relies on on-chip memory with computation overheads or uses external memory, exacerbating latency with frequent memory accesses. 4. FPGA has desired properties ideal for EBR: large high-bandwidth memory (HBM), sufficient on-chip memories, and programmable compute elements supporting customized parallelism, making it a viable platform for optimizing EBR performance. 5 | Solution: 1. [Batching] Batch queries are executed together to share the cost of scanning the corpus among all queries in the batch. This means the system only scans the corpus once per batch, and simultaneous compute pipelines process different queries, achieving linear throughput scalability with the batch size while preserving low query latency. 2. [Precomputing] FAERY introduces a filter in the pipeline that can early drop the items that cannot make it into the Top-K, instead of waiting until the end of the pipeline. By detecting and discarding these non-competitive scores before K-selection, FAERY reduces the data volume processed in later stages, saving both computation and on-chip resources. 3. [Hardware] FAERY stores the entire item embedding corpus in the FPGA’s high-bandwidth memory (HBM), which provides much higher bandwidth than traditional CPU DRAM or GPU memory. It exploits HBM’s multi-channel parallelism and uses fully pipelined, streaming operator designs on the FPGA fabric for similarity computation and K-selection. The similarity calculation is parallelized to match HBM bandwidth, and a single, efficient K-selection pipeline is used thanks to early filtering. These hardware-accelerated pipelines achieve near-optimal latency and can scale linearly with batch size, outperforming CPU and GPU-based approaches in both latency-bounded throughput and resource efficiency. 6 | ### 7 | Problem: The paper addresses the challenge of efficiently scheduling tasks in large-scale data analytics frameworks. These frameworks are trending towards shorter task durations and greater parallelism to achieve low latency. The problem is how to implement a task scheduler capable of managing millions of scheduling decisions per second while ensuring millisecond-level latency and high availability in distributed systems. 8 | Observations: 1. The demand for low latency interactive data processing has led to the creation of frameworks that distribute work across thousands of machines. 2. Running sub-second tasks requires a scheduler capable of much higher throughput than existing centralized solutions. 3. Failure of a centralized scheduler involves complex recovery processes because large amounts of state need to be replicated or recovered quickly. 4. Probing multiple schedulers can result in conflicting actions due to delays and race conditions. 5. The ""power of two choices"" load balancing technique used naively doesn't perform well for parallel jobs due to a focus on individual task wait times. 6. Centralized schedulers struggle to handle the needs of decentralized, highly parallel workloads due to inherent throughput and availability limitations. 9 | Solution: 1. [Deferring] Late binding is implemented to delay the assignment of tasks to worker machines until the machines are ready to execute the tasks. This approach minimizes median job response time by avoiding premature task placement and mitigating the impact of queue length mispredictions on scheduling decisions. 2. [Relaxation] The system leverages a variant of the power of two choices load balancing technique, wherein multiple servers are probed randomly, and tasks are assigned to the server with fewer queued tasks. This strategic adaptation allows tasks to be distributed efficiently without comprehensive system state awareness. 10 | ### 11 | Problem: Constructing services that provide both low latency and high throughput is challenging. Traditional thread management operated by operating systems often results in resource underutilization or suboptimal performance due to lack of core awareness, leading to difficulties in adjusting thread parallelism according to available core resources. 12 | Observations: 1. Traditional threading models, such as those used by the operating system, provide no visibility into the physical cores allocated to applications, which prevents applications from efficiently managing parallelism. 2. Techniques for achieving low latency, like reserving cores and using polling instead of interrupts, can result in resource wastage. 3. Combining low-latency and high-throughput services is challenging, often because techniques designed to minimize latency result in underutilized resources, especially in multi-level services with nested requests. 4. Static assignment of threads to cores often results in inefficient load-balancing and performance degradation during periods of low load or high load imbalance. 5. Background tasks, like garbage collection, can interfere with request servicing unless additional resources are reserved. 6. Ideally, applications should be able to match their workload with available cores dynamically, but traditional systems do not allow this. 13 | Solution: 1. [Caching] Each thread context is bound to a core, and recently-used thread contexts are reused when creating new threads. This minimizes cache misses during thread creation, as thread context data (stacks and metadata) are likely to reside in the relevant core’s cache, enabling the rapid creation of user threads that are suitable for extremely short workloads. 2. [Contextualization] The user space thread implementation allows applications to communicate their resource (core) needs to the scheduler at runtime, and to compute requirements dynamically based on their workload. Each application selects a core policy to decide, with live performance statistics, how many cores it needs and how threads are placed across those cores. In Arachne, instead of relying on ready queues for scheduling, each dispatcher repeatedly scans all active user thread contexts on its associated core and chooses one that is runnable. This scan is efficient because there are typically only a few contexts per core, and the inevitable cache miss for waking a runnable thread can be overlapped with the scanning process, amortizing the cost. This approach enables applications to leverage runtime knowledge for optimal placement and scheduling decisions, achieving both low latency and high throughput on modern multicore hardware. 14 | ### 15 | Problem: Distributed storage systems typically ensure transactional consistency by replicating data with strong consistency guarantees. These systems, while providing strong guarantees, are often expensive and suffer from latency and throughput reductions due to redundant use of distributed transaction and replication protocols. 16 | Observations: 1.Distributed transactional storage systems waste significant performance by doubling efforts in ensuring strong consistency both at the transaction level and the replication level, leading to inefficient use of resources. 2. Replication protocols like Paxos, while reliable, impose high latency due to strict ordering requirements across dispersed geographical regions, ultimately constraining system throughput. 3. Emerging applications require transactional systems that not only provide fault tolerance and strong consistency but also do so with reduced performance overheads, particularly for read-write transactions. 4. Many distributed systems guarantee transaction correctness and replication consistency together, unnecessarily constraining performance. 5. It is possible to achieve transactional consistency even if the underlying replication protocol only ensures eventual consistency. 17 | Solution: 1. [Decoupling] The system decouples replication consistency from transactional correctness by using inconsistent replication to replicate data quickly while enforcing strong transactional semantics through a separate coordination layer. 2. [Relaxation] The system relaxes the requirement that all replicas remain immediately consistent, allowing data to be written asynchronously and resolved later, while still ensuring that committed transactions observe a consistent global order. 3. [Deferring] The system defers coordination to commit time by executing transactions optimistically using local or stale data, and only validating and ordering transactions during the commit protocol, reducing the need for early synchronization. 18 | ### 19 | Problem: The paper addresses the performance penalties incurred by current operating systems' kernels due to mitigations against transient execution side-channel attacks. These attacks, such as Meltdown and Spectre, can leak sensitive data through speculative execution, and existing mitigations introduce significant performance overhead. 20 | Observations: 1. Transient execution attacks like Meltdown and Spectre are particularly concerning for OS kernels because they have access to all system memory and can potentially leak data from any process. 2. The performance overheads from mitigations are substantial; some require action at every system call entry/exit, which is not always feasible from a performance standpoint. 3. Current processor designs have embedded some mitigations in hardware, but these still reduce performance. 4. The challenge is to minimize these overheads while still providing adequate security against side-channel attacks. 21 | Solution: 1. [Decoupling] WARD decouples regions of memory that are visible in the Q domain from those requiring full K domain access, separating execution paths into two distinct processes and enabling controlled speculative execution. 22 | ### 23 | 24 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import re 3 | from openai import OpenAI 4 | from openpyxl import load_workbook 5 | 6 | EXAMPLE_FILE = './dataset/example_10' 7 | API_KEY = 'your api key for OpenAI' 8 | MODEL_KEY = 'your fine-tuned model key' 9 | 10 | def run_sysgpt(client: OpenAI, problem_observation: str, model_key: str, temp: float): 11 | completion = client.chat.completions.create( 12 | model= model_key, 13 | temperature = temp, 14 | messages=[ 15 | { 16 | "role": "system", 17 | "content": "You are an expert in Computer Science, especially in Systems area, who explains things specifically and comprehensively. You know the following categories that are common methodologies to improve system performance in a single-line execution, excluding those benefited from parallelism and algorithmic optimizations: 1. Batching: Merge duplicate costs by grouping data or operations; 2. Caching: Memorize computed result and reuse it to avoid redundant computation; 3. Precomputing: Conduct initialization or execution in advance; 4. Deferring: Delay initialization or execution until it is needed or it has better context to make decision; 5. Relaxation: Cut workload size by sacrificing accuracy with approximation; 6. Contextualization: Collect additional data at runtime to make better decisions; 7. Hardware: Utilize specific hardware features, e.g., NUMA, NVM, FPGA, SmartNIC, to optimize workload computation; 8. Bypass: Skip existing layer by taking a fast path; 9. Delayering: Merge multiple layers into one to avoid intermediate costs among layers; 10. Decoupling: Split one layer into multiple layers to have finer control. Given problem description with observations, provide a system solution for improving performance to the problem. Explain the solution in detail using the methodologies described above." 18 | }, 19 | { 20 | "role": "user", 21 | "content": "Given problem description with observations, provide **solutions** for improving performance to the problem. Give solutions in bullet points. Each bullet item must have following format:\n\n`1. [Methodology] Expanded description of this optimization technique.`\n\n The **Methodology** must be selected from one of the 10 methodologies." 22 | f"Here is the problem description with observations:\n```\n{problem_observation}\n```\n" 23 | } 24 | ] 25 | ) 26 | 27 | # print(completion.choices[0].message.content) 28 | return completion.choices[0].message.content 29 | 30 | def run_few_shot(client: OpenAI, problem_observation: str, examples: str, temp: float): 31 | completion = client.chat.completions.create( 32 | model="gpt-4o-2024-08-06", 33 | temperature = temp, 34 | messages=[ 35 | { 36 | "role": "system", 37 | "content": "You are an expert in Computer Science, especially in Systems area, who explains things specifically and comprehensively. You know the following categories that are common methodologies to improve system performance in a single-line execution, excluding those benefited from parallelism and algorithmic optimizations: 1. Batching: Merge duplicate costs by grouping data or operations; 2. Caching: Memorize computed result and reuse it to avoid redundant computation; 3. Precomputing: Conduct initialization or execution in advance; 4. Deferring: Delay initialization or execution until it is needed or it has better context to make decision; 5. Relaxation: Cut workload size by sacrificing accuracy with approximation; 6. Contextualization: Collect additional data at runtime to make better decisions; 7. Hardware: Utilize specific hardware features, e.g., NUMA, NVM, FPGA, SmartNIC, to optimize workload computation; 8. Bypass: Skip existing layer by taking a fast path; 9. Delayering: Merge multiple layers into one to avoid intermediate costs among layers; 10. Decoupling: Split one layer into multiple layers to have finer control. Given problem description with observations, provide a system solution for improving performance to the problem. Explain the solution in detail using the methodologies described above." 38 | }, 39 | { 40 | "role": "user", 41 | "content": "Given problem description with observations, provide **solutions** for improving performance to the problem. Give solutions in bullet points. Each bullet item must have following format:\n\n`1. [Methodology] Expanded description of this optimization technique.`\n\n The **Methodology** must be selected from one of the 10 methodologies." 42 | f"{examples}" 43 | f"Here is the problem description with observations:\n```\n{problem_observation}\n```\n" 44 | } 45 | ] 46 | ) 47 | # print(completion.choices[0].message.content) 48 | return completion.choices[0].message.content 49 | 50 | 51 | def extract_and_convert_methodologies(text): 52 | allowed = [ 53 | "Batching", "Caching", "Precomputing", "Deferring", "Relaxation", 54 | "Contextualization", "Hardware", "Bypass", "Delayering", "Decoupling" 55 | ] 56 | 57 | normalization_map = { 58 | "bypassing": "Bypass", 59 | "bypass": "Bypass", 60 | "deferring": "Deferring", 61 | "decoupling": "Decoupling", 62 | "delayering": "Delayering", 63 | "caching": "Caching", 64 | "batching": "Batching", 65 | "precomputing": "Precomputing", 66 | "contextualization": "Contextualization", 67 | "relaxation": "Relaxation", 68 | "hardware": "Hardware" 69 | } 70 | 71 | # Extract [Methodology] 72 | tags = re.findall(r'\[(.*?)\]', text) 73 | 74 | # Check validity 75 | normalized_tags = [] 76 | for tag in tags: 77 | key = tag.lower() 78 | if key in normalization_map: 79 | normalized_tags.append(normalization_map[key]) 80 | else: 81 | return False 82 | 83 | final_vector = [0] * len(allowed) 84 | 85 | 86 | for tag in normalized_tags: 87 | index = allowed.index(tag) 88 | final_vector[index] = 1 89 | 90 | return final_vector 91 | 92 | 93 | def calculate_metrics(ground_truth, prediction): 94 | assert len(ground_truth) == len(prediction), "Lists must be the same length" 95 | 96 | true_positive = sum((g == 1 and p == 1) for g, p in zip(ground_truth, prediction)) 97 | false_positive = sum((g == 0 and p == 1) for g, p in zip(ground_truth, prediction)) 98 | false_negative = sum((g == 1 and p == 0) for g, p in zip(ground_truth, prediction)) 99 | 100 | precision = true_positive / (true_positive + false_positive) if (true_positive + false_positive) else 0 101 | recall = true_positive / (true_positive + false_negative) if (true_positive + false_negative) else 0 102 | f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) else 0 103 | 104 | return precision, recall, f1_score 105 | 106 | 107 | def run_test(client: OpenAI, temp: float, trial: int): 108 | example_file = open(EXAMPLE_FILE, 'r').read() 109 | 110 | # Open Excel dataset 111 | wb = load_workbook(filename='./dataset/dataset.xlsx') 112 | ws = wb['testset'] 113 | 114 | methodology = [ 115 | "Batching", "Caching", "Precomputing", "Deferring", "Relaxation", 116 | "Contextualization", "Hardware", "Bypass", "Delayering", "Decoupling" 117 | ] 118 | 119 | n = 0 120 | 121 | sysgpt_total_precision = 0 122 | sysgpt_total_recall = 0 123 | sysgpt_total_f1 = 0 124 | 125 | few_shot_total_precision = 0 126 | few_shot_total_recall = 0 127 | few_shot_total_f1 = 0 128 | few_shot_method_num = 0 129 | 130 | for row in range(2, 98): 131 | is_scope = ws['G'+str(row)].value 132 | 133 | if is_scope == 'x': 134 | continue 135 | elif is_scope == 'o': 136 | print(f'=== {row} ===') 137 | 138 | problem = ws['D'+str(row)].value 139 | observation = ws['E'+str(row)].value 140 | 141 | problem_observation = problem + '\n' + observation 142 | 143 | # Get label vector 144 | ground_truth = extract_and_convert_methodologies(ws['F'+str(row)].value) 145 | print(ground_truth) 146 | 147 | # SysGPT =============== 148 | 149 | best_f1 = -1 150 | best_precision = 0 151 | best_recall = 0 152 | 153 | for i in range(0, trial): 154 | # Get sysgpt predict vector 155 | while True: 156 | sysgpt_answer = run_sysgpt(client, problem_observation, MODEL_KEY, temp) 157 | sysgpt_prediction = extract_and_convert_methodologies(sysgpt_answer) 158 | if isinstance(sysgpt_prediction, bool): 159 | print(sysgpt_answer) 160 | print("Extract Error!! Retry...") 161 | else: 162 | break 163 | 164 | print("> Try ", i+1) 165 | print(sysgpt_prediction) 166 | 167 | precision, recall, f1 = calculate_metrics(ground_truth, sysgpt_prediction) 168 | if f1 > best_f1: 169 | best_f1 = f1 170 | best_precision = precision 171 | best_recall = recall 172 | 173 | 174 | ## update sysgpt precision 175 | sysgpt_total_precision += best_precision 176 | sysgpt_total_recall += best_recall 177 | sysgpt_total_f1 += best_f1 178 | n += 1 179 | 180 | # Baseline =============== 181 | 182 | best_f1 = -1 183 | best_precision = 0 184 | best_recall = 0 185 | best_num = 0 186 | 187 | # Get 3-shot predict vector 188 | for i in range(0, trial): 189 | while True: 190 | few_shot = run_few_shot(client, problem_observation, example_file, temp) 191 | few_shot_prediction = extract_and_convert_methodologies(few_shot) 192 | if isinstance(few_shot_prediction, bool): 193 | print(few_shot) 194 | print("Extract Error!! Retry...") 195 | else: 196 | break 197 | 198 | print("> Try ", i + 1) 199 | print(few_shot_prediction) 200 | 201 | precision, recall, f1 = calculate_metrics(ground_truth, few_shot_prediction) 202 | if f1 > best_f1: 203 | best_f1 = f1 204 | best_precision = precision 205 | best_recall = recall 206 | best_num = sum(few_shot_prediction) 207 | 208 | 209 | # update 3-shot precision 210 | few_shot_total_precision += best_precision 211 | few_shot_total_recall += best_recall 212 | few_shot_total_f1 += best_f1 213 | few_shot_method_num += best_num 214 | 215 | print("sysgpt f1: ", sysgpt_total_f1 / n) 216 | print("fewshot f1: ", few_shot_total_f1 / n) 217 | print("fewshot # : ", few_shot_method_num / n) 218 | 219 | 220 | print("=== sysgpt ===") 221 | print("avg_precision", sysgpt_total_precision / n) 222 | print("avg_recall", sysgpt_total_recall / n) 223 | print("avg_f1", sysgpt_total_f1 / n) 224 | 225 | print("=== few-shot ===") 226 | print("avg_precision", few_shot_total_precision / n) 227 | print("avg_recall", few_shot_total_recall / n) 228 | print("avg_f1", few_shot_total_f1 / n) 229 | 230 | 231 | 232 | if __name__ == '__main__': 233 | client = OpenAI(api_key=API_KEY) 234 | 235 | TEMP = float(sys.argv[1]) 236 | TRIAL = int(sys.argv[2]) 237 | 238 | print("Test info:") 239 | print(" - TEMP : ", TEMP) 240 | print(" - Baseline : ", EXAMPLE_FILE) 241 | print(" - # trials : ", TRIAL) 242 | print("\n\n") 243 | 244 | 245 | run_test(client, TEMP, TRIAL) 246 | -------------------------------------------------------------------------------- /dataset/example_10: -------------------------------------------------------------------------------- 1 | Here are ten examples seperated by ### marks: 2 | ### 3 | Problem: The performance of existing CPU- and GPU-based Embedding-based Retrieval (EBR) systems for recommendation systems is suboptimal due to their inherent architectural limitations. A good EBR system should achieve both high throughput and low latency, which is crucial for cost savings and user experience improvement. 4 | Observations: 1. CPU-based EBR systems suffer from low memory bandwidth and limited cores, hindering simultaneous support for desired parallelism paradigms and batch queries. 2. GPU-based EBR systems provide higher memory bandwidth and compute cores, but are not optimized for pipeline parallelism, leading to increased latency from explicit kernel boundaries and suboptimal inter-operator communication. 3. K-selection for GPU-based systems either heavily relies on on-chip memory with computation overheads or uses external memory, exacerbating latency with frequent memory accesses. 4. FPGA has desired properties ideal for EBR: large high-bandwidth memory (HBM), sufficient on-chip memories, and programmable compute elements supporting customized parallelism, making it a viable platform for optimizing EBR performance. 5 | Solution: 1. [Batching] Batch queries are executed together to share the cost of scanning the corpus among all queries in the batch. This means the system only scans the corpus once per batch, and simultaneous compute pipelines process different queries, achieving linear throughput scalability with the batch size while preserving low query latency. 2. [Precomputing] FAERY introduces a filter in the pipeline that can early drop the items that cannot make it into the Top-K, instead of waiting until the end of the pipeline. By detecting and discarding these non-competitive scores before K-selection, FAERY reduces the data volume processed in later stages, saving both computation and on-chip resources. 3. [Hardware] FAERY stores the entire item embedding corpus in the FPGA’s high-bandwidth memory (HBM), which provides much higher bandwidth than traditional CPU DRAM or GPU memory. It exploits HBM’s multi-channel parallelism and uses fully pipelined, streaming operator designs on the FPGA fabric for similarity computation and K-selection. The similarity calculation is parallelized to match HBM bandwidth, and a single, efficient K-selection pipeline is used thanks to early filtering. These hardware-accelerated pipelines achieve near-optimal latency and can scale linearly with batch size, outperforming CPU and GPU-based approaches in both latency-bounded throughput and resource efficiency. 6 | ### 7 | Problem: The paper addresses the challenge of efficiently scheduling tasks in large-scale data analytics frameworks. These frameworks are trending towards shorter task durations and greater parallelism to achieve low latency. The problem is how to implement a task scheduler capable of managing millions of scheduling decisions per second while ensuring millisecond-level latency and high availability in distributed systems. 8 | Observations: 1. The demand for low latency interactive data processing has led to the creation of frameworks that distribute work across thousands of machines. 2. Running sub-second tasks requires a scheduler capable of much higher throughput than existing centralized solutions. 3. Failure of a centralized scheduler involves complex recovery processes because large amounts of state need to be replicated or recovered quickly. 4. Probing multiple schedulers can result in conflicting actions due to delays and race conditions. 5. The ""power of two choices"" load balancing technique used naively doesn't perform well for parallel jobs due to a focus on individual task wait times. 6. Centralized schedulers struggle to handle the needs of decentralized, highly parallel workloads due to inherent throughput and availability limitations. 9 | Solution: 1. [Deferring] Late binding is implemented to delay the assignment of tasks to worker machines until the machines are ready to execute the tasks. This approach minimizes median job response time by avoiding premature task placement and mitigating the impact of queue length mispredictions on scheduling decisions. 2. [Relaxation] The system leverages a variant of the power of two choices load balancing technique, wherein multiple servers are probed randomly, and tasks are assigned to the server with fewer queued tasks. This strategic adaptation allows tasks to be distributed efficiently without comprehensive system state awareness. 10 | ### 11 | Problem: Constructing services that provide both low latency and high throughput is challenging. Traditional thread management operated by operating systems often results in resource underutilization or suboptimal performance due to lack of core awareness, leading to difficulties in adjusting thread parallelism according to available core resources. 12 | Observations: 1. Traditional threading models, such as those used by the operating system, provide no visibility into the physical cores allocated to applications, which prevents applications from efficiently managing parallelism. 2. Techniques for achieving low latency, like reserving cores and using polling instead of interrupts, can result in resource wastage. 3. Combining low-latency and high-throughput services is challenging, often because techniques designed to minimize latency result in underutilized resources, especially in multi-level services with nested requests. 4. Static assignment of threads to cores often results in inefficient load-balancing and performance degradation during periods of low load or high load imbalance. 5. Background tasks, like garbage collection, can interfere with request servicing unless additional resources are reserved. 6. Ideally, applications should be able to match their workload with available cores dynamically, but traditional systems do not allow this. 13 | Solution: 1. [Caching] Each thread context is bound to a core, and recently-used thread contexts are reused when creating new threads. This minimizes cache misses during thread creation, as thread context data (stacks and metadata) are likely to reside in the relevant core’s cache, enabling the rapid creation of user threads that are suitable for extremely short workloads. 2. [Contextualization] The user space thread implementation allows applications to communicate their resource (core) needs to the scheduler at runtime, and to compute requirements dynamically based on their workload. Each application selects a core policy to decide, with live performance statistics, how many cores it needs and how threads are placed across those cores. In Arachne, instead of relying on ready queues for scheduling, each dispatcher repeatedly scans all active user thread contexts on its associated core and chooses one that is runnable. This scan is efficient because there are typically only a few contexts per core, and the inevitable cache miss for waking a runnable thread can be overlapped with the scanning process, amortizing the cost. This approach enables applications to leverage runtime knowledge for optimal placement and scheduling decisions, achieving both low latency and high throughput on modern multicore hardware. 14 | ### 15 | Problem: Distributed storage systems typically ensure transactional consistency by replicating data with strong consistency guarantees. These systems, while providing strong guarantees, are often expensive and suffer from latency and throughput reductions due to redundant use of distributed transaction and replication protocols. 16 | Observations: 1.Distributed transactional storage systems waste significant performance by doubling efforts in ensuring strong consistency both at the transaction level and the replication level, leading to inefficient use of resources. 2. Replication protocols like Paxos, while reliable, impose high latency due to strict ordering requirements across dispersed geographical regions, ultimately constraining system throughput. 3. Emerging applications require transactional systems that not only provide fault tolerance and strong consistency but also do so with reduced performance overheads, particularly for read-write transactions. 4. Many distributed systems guarantee transaction correctness and replication consistency together, unnecessarily constraining performance. 5. It is possible to achieve transactional consistency even if the underlying replication protocol only ensures eventual consistency. 17 | Solution: 1. [Decoupling] The system decouples replication consistency from transactional correctness by using inconsistent replication to replicate data quickly while enforcing strong transactional semantics through a separate coordination layer. 2. [Relaxation] The system relaxes the requirement that all replicas remain immediately consistent, allowing data to be written asynchronously and resolved later, while still ensuring that committed transactions observe a consistent global order. 3. [Deferring] The system defers coordination to commit time by executing transactions optimistically using local or stale data, and only validating and ordering transactions during the commit protocol, reducing the need for early synchronization. 18 | ### 19 | Problem: The paper addresses the performance penalties incurred by current operating systems' kernels due to mitigations against transient execution side-channel attacks. These attacks, such as Meltdown and Spectre, can leak sensitive data through speculative execution, and existing mitigations introduce significant performance overhead. 20 | Observations: 1. Transient execution attacks like Meltdown and Spectre are particularly concerning for OS kernels because they have access to all system memory and can potentially leak data from any process. 2. The performance overheads from mitigations are substantial; some require action at every system call entry/exit, which is not always feasible from a performance standpoint. 3. Current processor designs have embedded some mitigations in hardware, but these still reduce performance. 4. The challenge is to minimize these overheads while still providing adequate security against side-channel attacks. 21 | Solution: 1. [Decoupling] WARD decouples regions of memory that are visible in the Q domain from those requiring full K domain access, separating execution paths into two distinct processes and enabling controlled speculative execution. 22 | ### 23 | Problem: The paper addresses the challenge of integrating GPUs into network servers or distributed systems. Specifically, it highlights the need for developing software abstractions and networking capabilities for GPUs to simplify their usage and enhance performance within these systems. 24 | Observations: 1. GPUs excel in computational tasks but lack software abstractions for efficient I/O management, particularly in handling network data. 2. Current GPU programming models require complex CPU-side code for network operations, involving intricate coordination of CPU, GPU, and network interface card (NIC). 3. The bottleneck in leveraging GPUs in network systems often lies in the lack of support for high-level I/O abstractions akin to sockets available in CPU environments. 4. Challenges in designing efficient network abstractions for GPUs include the issues of massive parallelism, slow access to CPU memory, and low single-thread performance. 5. Utilizing recent NIC hardware capabilities, such as Remote Direct Memory Access (RDMA), can potentially enhance GPU network performance by minimizing CPU-GPU communication overheads. 25 | Solution: 1. [Batching] GPUnet accommodates parallelism at the API level by providing coalesced calls invoked by multiple GPU threads at the same point in data-parallel code. Multiple calls are coalesced into a single request to reduce the processing overhead of the networking stack. 2. [Hardware] - The goal of GPUnet is to foster GPU adoption in data center applications. GPUnet uses recent hardware support for network transmission directly into/from GPU memory to minimize slow accesses from the GPU to system memory. GPUnet minimizes control-intensive sequential execution on performance-critical paths by offloading message dispatching to the NIC via RDMA. 3. [Bypass] GPUnet enables individual threads in one GPU to communicate with threads in other GPUs or CPUs via standard socket interfaces. Native GPU networking cuts the CPU out of GPU-NIC interactions, simplifying code and increasing performance. 4. [Decoupling] It decouples the GPU API calls and the CPU I/O transfer operations, allowing the CPU to handle GPU I/O requests asynchronously." 26 | ### 27 | Problem: The paper addresses the challenge of scheduling heterogeneous directed acyclic graphs (DAGs) in data-parallel clusters. These DAGs involve tasks with complex dependency structures and diverse resource requirements, making it difficult to schedule them efficiently due to resource fragmentation and the varying runtime of tasks. Current heuristic-based schedulers often underperform when dealing with such heterogeneous DAGs, leading to suboptimal job completion times and resource utilization. 28 | Observations: 1. Existing schedulers often focus on either task dependencies or resource demands but struggle to optimize both simultaneously. 2. Critical path schedulers and multi-resource packers often fail due to ignoring non-critical tasks or task dependencies, respectively. 3. A significant gap exists between actual job completion times in current production clusters and the potential optimally achievable times, highlighting room for improvement. 4. The variability in task durations and resource usage complicates scheduling, often leading to idle resources even when tasks are pending. 5. Optimal scheduling for heterogeneous DAGs is computationally infeasible in practice. 29 | Solution: 1. [Batching] The system maintains bundles of tasks that are candidates for scheduling on a machine, amortizing the cost of examining pending tasks by evaluating them together. It categorizes tasks into types such as troublesome tasks, parents, children, and siblings based on their dependency relationships, and batches them accordingly to improve scheduling efficiency and packing quality. 2. [Contextualization] The system dynamically tracks task dependencies and resource availability during scheduling, prioritizing tasks that are ready to run and on the critical path of the job’s dependency graph. By adapting scheduling decisions to both runtime system state and task readiness, the system reduces critical path latency and improves overall job completion times." 30 | ### 31 | Problem: Modern operating systems rely heavily on directory caches to accelerate file system operations such as open() and stat(). However, even with a warm cache, these operations can consume a significant portion of execution time—up to 54% in some common utilities. 32 | Observations: 1. Current directory cache designs in Linux and similar systems often require multiple steps for each directory component during lookups, leading to higher latencies. 2. Even in a warm cache state, certain system calls, like `stat` and `open`, consume a significant portion of execution time due to path lookup. 3. Contemporary optimizations have plateaued in their ability to reduce hit latency due to inherent POSIX path permission semantics. 4. Path permission check processes are linear in complexity relative to the components of the path, which limits performance improvements. 5. Repeated permission checks during directory traversal introduce substantial redundancy, as the same access control logic is frequently re-evaluated for identical paths. 6. Developers face backward compatibility challenges when attempting to enhance data structures integral to these processes due to their foundational role in supporting various Linux features. 33 | Solution: 1. [Caching] The paper improves performance by separating indexing of paths from checking parent permissions and caching prefix check results. It leverages metadata already present in the kernel to decrease cache miss rates through tracking directory completeness and preserving negative dentries after a file is removed for future reuse. 2. [Precomputing] Full paths are indexed in the directory cache, which shifts the average-case lookup from linear to constant in the number of path components, thus significantly reducing lookup latency. 3. [Relaxation] The paper uses 240-bit signatures instead of full paths to minimize key comparison costs. While there is a risk due to potential signature collisions, it is managed within acceptable thresholds through careful design. 4. [Decoupling] The indexing of paths is separated from permission checking to allow caching of prefix check results, which allows for improved performance without disrupting necessary permission enforcement. 34 | ### 35 | Problem: Modern data-intensive applications, such as search engines and social networks, require processing a vast number of microsecond-scale tasks with stringent tail-latency service-level objectives (SLOs). Traditional operating systems and scheduling mechanisms struggle to meet these demands due to inefficient task distribution and load balancing across cores, leading to increased tail latencies and underutilized resources. 36 | Observations: 1. Traditional operating system schedulers, while suitable for multi-core resource management, introduce high overheads for microsecond-scale tasks, adversely affecting their tail latency. 2. Kernel-bypass techniques enhance throughput but face inefficiencies through non-work conserving scheduling, which can leave cores idle even when there are pending tasks. 3. Head-of-line blocking in dataplanes exacerbates latency issues, suggesting that current shared-nothing architectures lack efficiency in handling microsecond-scale tasks. 4. Single-queue systems with work-conserving schedulers are more effective than multiple queue systems because they better manage temporary load imbalances across cores. 37 | Solution: 1. [Precomputing] The system mitigates head-of-line blocking by enabling idle cores to proactively pull pending events from a global queue, effectively reshaping the order of task execution based on core availability. This dynamic task reshuffling improves parallelism and reduces tail latency. 2. [Decoupling] The system cleanly separates functionality into a three-layer architecture: a low-level networking layer for packet handling, a shuffle layer for load balancing and queue management, and a top execution layer for running application tasks. This separation enables independent optimization of each stage in the pipeline. 38 | ### 39 | Problem: The paper addresses the inefficiency of data movement between compute and storage in modern key-value stores that use kernel-bypass networking, which currently require applications to frequently transfer data across the network. This is particularly inefficient for operations that involve complex data structures or require aggregation over large datasets, leading to client-side stalls and increased latency. 40 | Observations: 1. Current in-memory key-value stores offer high throughput and low latency by maintaining simplicity in data operations, restricting them mainly to lookups and updates. 2. Despite fast hardware and networking, the simplicity of these interfaces leads to inefficiency, as applications are forced to manage data movement and aggregations client-side, resulting in high network and latency costs. 3. Traditional methods of extending such systems with additional code incur significant overhead due to isolation costs, which can undermine the latency and throughput advantages offered by these systems. 4. Existing key-value stores are not well-equipped to efficiently handle complex data models and operations without custom extensions. 5. Multi-tenancy further complicates the situation, as service providers must safely isolate tenant-specific extensions while maintaining low-latency performance. 41 | Solution: 1. [Hardware] No specific hardware-based optimization was employed distinct from the general use of kernel-bypass networking to reduce latency. 2. [Customization] Custom tenant-provided extensions written in Rust enable high performance and safe customization, supporting complex data types and operations beyond the simple get/put supported by standard key-value interfaces. 42 | ### 43 | 44 | --------------------------------------------------------------------------------