├── cvpr25_orals.xlsx ├── requirements.txt ├── README.md ├── pull_gpu_info.py ├── post_process ├── deepseek_postprocess.csv ├── GPTo3_postprocess.csv └── GPT4o_postprocess.csv └── cvpr25_accelerator_sentences_arxiv.csv /cvpr25_orals.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kxhit/cvpr25_oral_gpu_info/HEAD/cvpr25_orals.xlsx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | openpyxl 3 | requests 4 | PyPDF2 5 | arxiv 6 | tqdm 7 | beautifulsoup4 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CVPR2025 Oral Papers GPU/Accelerator Extraction 2 | 3 | Just wanna see what type and how many GPUs/TPUs are used in CVPR 2025 oral papers. Fun vibe coding with LLMs. 4 | 5 | Work flow: 6 | 7 | 1. **Scraping** and downloading the arxiv papers based on the list of CVPR 2025 oral papers 8 | 2. **Extracting** the exact sentence(s) mentioning GPUs/TPUs 9 | 3. **LLM** further process the extracted csv the results using GPT4o, GPT3o and DeepSeek for GPU numbers/models/times. 10 | 11 | --- 12 | 13 | ## Repository structure 14 | 15 | ``` 16 | . 17 | ├── pull_gpu_info.py # Initial scraper + sentence extractor 18 | ├── cvpr25_orals.xlsx # Input Excel with “Oral Sessions” sheet 19 | ├── papers/ # Folder where all PDFs get downloaded 20 | ├── cvpr25_accelerator_sentences_arxiv.csv # Raw sentences extracted from PDFs 21 | ├── post_process/ # Post-processed results by LLMs 22 | └── README.md # You are here! 23 | ``` 24 | 25 | --- 26 | 27 | ## Requirements 28 | 29 | - Python 3.8+ 30 | - pip install -r requirements.txt 31 | 32 | --- 33 | 34 | ## Installation 35 | 36 | ```bash 37 | git clone https://github.com/kxhit/cvpr25-gpu-extractor.git 38 | cd cvpr25-gpu-extractor 39 | pip install -r requirements.txt 40 | ``` 41 | 42 | --- 43 | 44 | ## Usage 45 | 46 | ### 1. Scrape & extract raw sentences 47 | 48 | ```bash 49 | python pull_gpu_info.py 50 | ``` 51 | 52 | - Reads `cvpr25_orals.xlsx` 53 | - Downloads PDFs into `papers/` (skips existing files) 54 | - Extracts the **first sentence** containing “GPU(s)” or “TPU(s)” from each PDF 55 | - Writes `cvpr25_accelerator_sentences_arxiv.csv` 56 | 57 | ### 2. LLM & DeepSeek-based refinement 58 | Prompt LLMs with `cvpr25_accelerator_sentences_arxiv.csv` and 59 | "can you convert the sentence describing the accelerator into the number, model of the GPUs are used, and maybe the time they trained on. and give me the updated form" 60 | 61 | 62 | - GPT4o: 20+ successful extractions. 63 | - GPT3o: slightly better than GPT4o 64 | - DeepSeek: Currently the best. 65 | - Gemini-2.0/2.5: Not suppport .csv or xlsx files, I gave up. 66 | 67 | --- 68 | 69 | ## Results 70 | 71 | - **Arxiv Paper Matching**: I did many interative prompts with GPT4o to get the scripts. Initially only got 30 papers, then with the topk matching, got almost all papers but not all correct, there is confidence score in the form. 72 | - **GPU Info Extraction**: Initially I make the scripts to extract the GPU info (model, number, time) from the matched sentences, but it is not very accurate, e.g., miscounted 4090 or 100 as the number. I then use the LLM to extract the info from the saved sentences. 73 | - **LLM Extraction**: surprisingly, not able to extract the info completely somehow. 74 | 75 | 76 | --- 77 | 78 | ### Better way to do this? 79 | Any better suggestions to leverage LLMs for this task? -------------------------------------------------------------------------------- /pull_gpu_info.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import csv 4 | import difflib 5 | import requests 6 | import arxiv 7 | import pandas as pd 8 | from tqdm import tqdm 9 | from PyPDF2 import PdfReader 10 | 11 | # --- Config --- 12 | excel_path = 'cvpr25_orals.xlsx' 13 | output_csv = 'cvpr25_accelerator_sentences_arxiv.csv' 14 | output_dir = 'papers' 15 | title_sim_threshold = 0.0 # no threshold, take best match 16 | # ---------------- 17 | 18 | os.makedirs(output_dir, exist_ok=True) 19 | 20 | # Load the list of oral papers 21 | df = pd.read_excel(excel_path, sheet_name='Oral Sessions', engine='openpyxl') 22 | papers = list(zip(df['Paper #'], df['Title'])) 23 | 24 | def fetch_arxiv_paper(title): 25 | """Query arXiv and return (url, score) for the best match.""" 26 | search = arxiv.Search( 27 | query=f'ti:"{title}"', 28 | max_results=10, 29 | sort_by=arxiv.SortCriterion.Relevance 30 | ) 31 | candidates = list(search.results()) 32 | # fallback: also search broad if none 33 | if not candidates: 34 | search = arxiv.Search(query=title, max_results=10) 35 | candidates = list(search.results()) 36 | # compute similarity 37 | best = None 38 | best_score = 0.0 39 | for res in candidates: 40 | score = difflib.SequenceMatcher(None, title.lower(), res.title.lower()).ratio() 41 | if score > best_score: 42 | best_score, best = score, res 43 | if best: 44 | return best.pdf_url, best_score 45 | return None, 0.0 46 | 47 | def download_pdf(url, out_path): 48 | resp = requests.get(url, stream=True, timeout=30) 49 | resp.raise_for_status() 50 | with open(out_path, 'wb') as f: 51 | for chunk in resp.iter_content(1024): 52 | f.write(chunk) 53 | 54 | def extract_accelerator_sentence(pdf_path): 55 | """Return the first sentence mentioning GPU or TPU, or ''.""" 56 | reader = PdfReader(pdf_path) 57 | full_text = [] 58 | for page in reader.pages: 59 | text = page.extract_text() or "" 60 | full_text.append(text) 61 | text = "\n".join(full_text) 62 | # naive sentence split 63 | sentences = re.split(r'(?<=[.!?])\s+', text.replace('\n', ' ')) 64 | for s in sentences: 65 | if re.search(r'\b(?:GPU|TPU)s?\b', s, re.IGNORECASE): 66 | return s.strip() 67 | return "" 68 | 69 | # Run processing loop 70 | results = [] 71 | found_count = 0 72 | 73 | for paper_id, title in tqdm(papers, desc="Processing Papers"): 74 | local_pdf = os.path.join(output_dir, f"{paper_id}.pdf") 75 | arxiv_url = '' 76 | score = 0.0 77 | 78 | if os.path.exists(local_pdf): 79 | # already have it 80 | arxiv_url = '' 81 | score = -1.0 82 | else: 83 | # fetch from arXiv 84 | try: 85 | url, score = fetch_arxiv_paper(title) 86 | if url: 87 | arxiv_url = url 88 | download_pdf(url, local_pdf) 89 | else: 90 | local_pdf = '' 91 | except Exception as e: 92 | arxiv_url = '' 93 | local_pdf = '' 94 | score = 0.0 95 | 96 | # extract the sentence (skip if no local PDF) 97 | sentence = extract_accelerator_sentence(local_pdf) if os.path.exists(local_pdf) else '' 98 | 99 | if arxiv_url or os.path.exists(local_pdf): 100 | found_count += 1 101 | 102 | results.append({ 103 | 'PaperID': paper_id, 104 | 'Title': title, 105 | 'ArxivURL': arxiv_url, 106 | 'LocalPDF': local_pdf, 107 | 'MatchScore': round(score, 3), 108 | 'AcceleratorSentence': sentence 109 | }) 110 | 111 | # Write CSV 112 | with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile: 113 | writer = csv.DictWriter(csvfile, fieldnames=[ 114 | 'PaperID', 'Title', 'ArxivURL', 'LocalPDF', 'MatchScore', 'AcceleratorSentence' 115 | ]) 116 | writer.writeheader() 117 | for row in results: 118 | writer.writerow(row) 119 | 120 | print(f"\nDone. Found/downloaded {found_count} / {len(papers)} papers.") 121 | print(f"Results saved to {output_csv} and PDFs under {output_dir}/") 122 | -------------------------------------------------------------------------------- /post_process/deepseek_postprocess.csv: -------------------------------------------------------------------------------- 1 | PaperID,Title,ArxivURL,LocalPDF,MatchScore,AcceleratorSentence,Number of GPUs,GPU Model,Training Time 2 | 483,Motion Prompting: Controlling Video Generation with Motion Trajectories,,papers/483.pdf,-1.0,,,,, 3 | 17653,Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Flow,,papers/17653.pdf,-1.0,"We used 8 NVIDIA A100 80GB GPUs over the course of 40 GPU days, for 30,000 iterations using a rank-2048 LoRA [23] with a learn- ing rate of 10−5and a batch size of 8.",8,A100 80GB,40 GPU days 4 | 11278,LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping,,papers/11278.pdf,-1.0,Our results are generated with Stable Diffusion 3.5 Medium on a Nvidia GeForce RTX 4090 GPU.,1,RTX 4090,, 5 | 4191,Alias-free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space,,papers/4191.pdf,-1.0,All models are trained on 8 A100 GPUs.,8,A100,, 6 | 14445,RandAR: Decoder-only Autoregressive Visual Generation in Random Orders,,papers/14445.pdf,-1.0,"However, this sequen- tial decoding is bottlenecked by hardware’s memory band- width [4, 20, 41] (also well-known as “memory wall”), as each new token generation step requires a forward pass through the model, and the model needs to load all param- eters into GPU registers, which is a process considerably slower than computation.",,,, 7 | 3919,OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation,,papers/3919.pdf,-1.0,"These prompts were refined through extensive prompt engineering to maximize efficiency and reduce token usage, and ultimately reduce GPU memory usage in eval- uating the open-source MLLMs.",,,, 8 | 4068,LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions,,papers/4068.pdf,-1.0,,,,, 9 | 2199,Do We Always Need the Simplicity Bias Looking for Optimal Inductive Biases in the Wild,,papers/2199.pdf,-1.0,Our exploratory work found this to be better than a complete linearization (no second-order derivatives) and vastly cheaper than back- propagating through the whole inner loop (which was not even testable at all because of the required GPU memory.),,,, 10 | 8543,Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models,,papers/8543.pdf,-1.0,"When computing gradients with FSDP, each GPU com- putes a gradient on a small mini-batch of examples, after which gradients are averaged across all devices.",,,, 11 | 13948,Rethink Visual-language Pretraining for Deepfake Detection: Multi-modal Interpretable Forged Face Detection,,papers/13948.pdf,-1.0,,,,, 12 | 1383,CleanDIFT: Diffusion Features without Noise,,papers/1383.pdf,-1.0,We show how to adapt an off-the-shelf large-scale pre-trained diffusion backbone to provide these features at minimal cost (approximately 30 minutes of fine- tuning on a single A100 GPU) and demonstrate improved performance across a wide range of downstream tasks.,1,A100,30 minutes 13 | 2969,OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels,,papers/2969.pdf,-1.0,All experiments are conducted on 8 NVIDIA H800 GPUs.,8,H800,, 14 | 6175,Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather,,papers/6175.pdf,-1.0,,,,, 15 | 10261,DiffDNO: Diffusion Fourier Neural Operator,,papers/10261.pdf,-1.0,All tests are performed on a NVIDIA RTX A6000 GPU card with 48GB memory.,1,RTX A6000,, 16 | 9462,Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining,,papers/9462.pdf,-1.0,"72 videos with 7200 frames are used for training, while 30 videos with 3000 frames are for testing.4.2 Implementation Details Our network is trained on NVIDIA RTX 4090 GPUs and imple- mented on the Pytorch platform.",,RTX 4090,, 17 | 3401,StereoAnything: Zero-Shot Stereo Matching,,papers/3401.pdf,-1.0,"In recent years, the proliferation of high-quality, large scale synthetic ground-truth datasets, the availability of high-performance GPUs, and advance- ments in deep learning architectures have paved the way for deep-learning based stereo matching models trained within supervised settings.",,,, 18 | 2006,MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision,,papers/2006.pdf,-1.0,This reduces the complexity to O(N2logN)and enables efficient GPU-based training.,,,, 19 | 669,Multi-view Reconstruction via SfM-guided Monocular Depth Estimation,,papers/669.pdf,-1.0,"However, due to the re- liance on matching across input images, they typically suf- fer from high GPU memory consumption and tend to fail in sparse view scenarios.",,,, 20 | 3896,MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds,,papers/3896.pdf,-1.0,We utilize 64Nvidia H100 GPUs for the model training.,64,H100,, 21 | 13998,VGGN: Visual Geometry Grounded Network,,papers/13998.pdf,-1.0,,,,, 22 | 3686,CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner,,papers/3686.pdf,-1.0,,,,, 23 | 8996,CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models,,papers/8996.pdf,-1.0,Training takes 2 weeks on 8 ×H100 GPUs.,8,H100,2 weeks 24 | 16804,Reanimating Images using Neural Representations of Dynamic Stimuli,,papers/16804.pdf,-1.0,,,,, 25 | 17684,EgoLM: Multi-Modal Language Model of Egocentric Motions,,papers/17684.pdf,-1.0,,,,, 26 | 1500,Reconstructing Humans with a Biomechanically Accurate Skeleton,,papers/1500.pdf,-1.0,,,,, 27 | 6256,MEGA: Masked Generative Autoencoder for Human Mesh Recovery,,papers/6256.pdf,-1.0,"Using 4 NVIDIA A100 GPUs, the entire pre- training and training process takes about 2.5 days.",4,A100,2.5 days 28 | 2693,TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization,,papers/2693.pdf,-1.0,Isaac gym: High performance gpu-based physics simulation for robot learning.,,,, 29 | 15934,Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays,,papers/15934.pdf,-1.0,,,,, 30 | 16593,Temporally Consistent Object-Centric Learning by Contrasting Slots,,papers/16593.pdf,-1.0,,,,, 31 | 11779,Temporal Alignment-Free Video Matching for Few-shot Action Recognition,,papers/11779.pdf,-1.0,All experiments are conducted on RTX A6000 GPUs with Pytorch Cuda amp.,,RTX A6000,, 32 | 13853,Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models,,papers/13853.pdf,-1.0,"During fine-tuning, the map encoder is frozen to save GPU memory, allowing for a larger training batch size with negligible impact on per- formance.",,,, 33 | 16837,The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition,,papers/16837.pdf,-1.0,All computation was performed on 4×NVIDIA H200 GPUs using a distributed batch size of 336.,4,H200,, 34 | 11849,Rethinking Spiking Self-Attention Mechanism: Implementing 𝛂-XNOR Similarity Calculation in Spiking Transformers,,papers/11849.pdf,-1.0,,,,, 35 | 4119,"MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos",,papers/4119.pdf,-1.0,We run all the above baselines using their respective open-source implementations on the same machine with single Nvidia A100 GPU.,1,A100,, 36 | 6709,Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos,,papers/6709.pdf,-1.0,,,,, 37 | 4998,Continuous 3D Perception Model with Persistent State,,papers/4998.pdf,-1.0,We train our model on eight A100 NVIDIA GPUs each with 80G memory.,8,A100 80GB,, 38 | 568,TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion,,papers/568.pdf,-1.0,We report the average runtime of processing one 900×1600 frame and 30Radar points by different methods on one NVIDIA RTX A6000 GPU.,1,RTX A6000,, 39 | 3774,Neural Inverse Rendering from Propagating Light,,papers/3774.pdf,-1.0,,,,, 40 | 213,SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images,,papers/213.pdf,-1.0,We use two 4090 GPUs to train 1 epoch with batch size set to 8.,2,RTX 4090,1 epoch 41 | 360,Towards Universal Dataset Distillation via Task-Driven Diffusion,,papers/360.pdf,-1.0,Performance validation was carried out using PyTorch on NVIDIA V100 GPUs.,,V100,, 42 | 12689,IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior,,papers/12689.pdf,-1.0,IceDiff-FM is optimized by AdamW using Pytorch on one NVIDIA A100 80GB GPU for all experiments with a learning rate of 0.0005.,1,A100 80GB,, 43 | 3628,Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning,,papers/3628.pdf,-1.0,"While these approaches effectively reduce both model parameters and GPU memory consumption during runtime, they in- herently require model execution to conduct the parameter search.",,,, 44 | 7787,Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation,,papers/7787.pdf,-1.0,We train our network for 36 epochs with a mini-batch of 16 images (8 GPUs 2 mini-batch).,8,,36 epochs 45 | 7056,Identifying and Mitigating Position Bias of Multi-image Vision-Language Models,,papers/7056.pdf,-1.0,,,,, 46 | 9900,Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key,,papers/9900.pdf,-1.0,,,,, 47 | 1757,Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content,,papers/1757.pdf,-1.0,Training is conducted on 8 A100 GPUs for one epoch by default.,8,A100,1 epoch 48 | 2142,"Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces",,papers/2142.pdf,-1.0,"This work was mainly supported by the Open Path AI Foundation, Google TPU Research Cloud (TRC) program, and the Google Cloud Research Credits program (GCP19980904).",,,, 49 | 16873,From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,,papers/16873.pdf,-1.0,Training takes around 2 days on 8 nodes of 8xH100 GPUs (see Appendix B).,64,H100,2 days 50 | 5303,Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space,,papers/5303.pdf,-1.0,,,,, 51 | 11948,Language-Guided Image Tokenization for Generation,,papers/11948.pdf,-1.0,,,,, 52 | 15547,DreamRelation: Bridging Customization and Relation Generation,,papers/15547.pdf,-1.0,"The model is fine- tuned for 500 steps, using 2 A100 GPUs, with a total batch size of 8, completing the process in 10 minutes.",2,A100,10 minutes 53 | 3815,Infinity oo: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis,,papers/3815.pdf,-1.0,"When Vd= 216andh= 2048 , it saves 99.95% parameters and GPU memory.",,,, 54 | 11935,Autoregressive Distillation of Diffusion Transformers,,papers/11935.pdf,-1.0,"We use 8 NVIDIA A100 GPUs for training, which takes approximately 2 days.",8,A100,2 days 55 | 8423,PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation,,papers/8423.pdf,-1.0,,,,, 56 | 7547,RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics,,papers/7547.pdf,-1.0,"All training is done using 8 Nvidia H100 GPUs, with the training time between 20 and 40 hours.",8,H100,20-40 hours 57 | 958,GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill,,papers/958.pdf,-1.0,Isaac gym: High performance gpu-based physics simulation for robot learning.,,,, 58 | 5028,Navigation World Models,,papers/5028.pdf,-1.0,,,,, 59 | 6688,Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning,,papers/6688.pdf,-1.0,,,,, 60 | 5864,DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution,,papers/5864.pdf,-1.0,Complexity on RGB-D-D (w/o Noisy) tested by a 4090 GPU.,1,RTX 4090,, 61 | 8301,Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World,,papers/8301.pdf,-1.0,,,,, 62 | 13017,Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues,,papers/13017.pdf,-1.0,,,,, 63 | 13934,Camera resection from known line pencils and a radially distorted scanline,,papers/13934.pdf,-1.0,,,,, 64 | 14114,Opportunistic Single-Photon Time of Flight,,papers/14114.pdf,-1.0,,,,, 65 | 13917,Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing,,papers/13917.pdf,-1.0,The x-axis indicates the single image sampling time on an A100-SXM4-80GB GPU and y-axis shows the LPIPS.,1,A100 80GB,, 66 | 11636,DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models,,papers/11636.pdf,-1.0,"The fine-tuning of SD3-2B and Flux-12B is conducted on 2 A800-80G GPUs and 8 A100-80G GPUs, respectively.","2, 8","A800-80G, A100-80G",, 67 | 11517,CustAny: Customizing Anything from A Single Example,,papers/11517.pdf,-1.0,"Training involves a 1e-5 learning rate, batch size of 32, and 6 epochs on 32 V100 GPUs, taking about 30 hours.",32,V100,30 hours 68 | 5293,Minority-Focused Text-to-Image Generation via Prompt Optimization,,papers/5293.pdf,-1.0,All experiments are conducted on a single NVIDIA A100 GPU with 40GB memory.,1,A100,, 69 | 11886,Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models,,papers/11886.pdf,-1.0,"Due to limited GPU memory, we apply gradient checkpointing to backpropagate through the entire inverse DDIM sampler.",,,, 70 | 14084,UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming,,papers/14084.pdf,-1.0,"Introduction Distributed learning (also called parallel learning) on clus- ters with several machines or GPUs is commonly used for training deep learning models, especially for some large models with billions of parameters [ 1,40,41].",,,, 71 | 2420,Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning,,papers/2420.pdf,-1.0,,,,, 72 | 12995,Enhancing Diversity for Data-free Quantization,,papers/12995.pdf,-1.0,,,,, 73 | 6820,TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model,,papers/6820.pdf,-1.0,The library implements a par- allelized raster scan method that efficiently computes the Euclidean distance transforms on GPU hardware.,,,, 74 | 10586,Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation,,papers/10586.pdf,-1.0,Our method is implemented in PyTorch [35] on an EC2 instance (with 64 GB NVIDIA T4 Tensor Core GPUs).,,T4,, 75 | 6909,Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks,,papers/6909.pdf,-1.0,"Adam [24] is set as the optimizer, the learning rate is set to 10−4, and the training is performed on 4 NVIDIA 3090 GPUs with a batch size of 8.",4,RTX 3090,, 76 | 2485,Gromov-Wasserstein Problem with Cyclic Symmetry,,papers/2485.pdf,-1.0,,,,, 77 | 7710,Time of the Flight of the Gaussians: Fast and Accurate Dynamic Time-of-Flight Radiance Fields,,papers/7710.pdf,-1.0,"For instance, Neuralangelo [Li et al.2023] requires 128 GPU hours for reconstructing a single scene from the Tanks and Temples Dataset [Knapitsch et al .2017].",,,128 GPU hours 78 | 7740,Zero-Shot Monocular Scene Flow Estimation in the Wild,,papers/7740.pdf,-1.0,"All experiments are trained with 8NVIDIA A100 GPUs for 50epochs, -------------------------------------------------------------------------------- /post_process/GPTo3_postprocess.csv: -------------------------------------------------------------------------------- 1 | PaperID,Title,ArxivURL,LocalPDF,MatchScore,GPU_Count,GPU_Model,Training_Duration,AcceleratorSentence 2 | 483,Motion Prompting: Controlling Video Generation with Motion Trajectories,,papers/483.pdf,-1.0,,,, 3 | 17653,Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Flow,,papers/17653.pdf,-1.0,8,A100,40 GPU days,"We used 8 NVIDIA A100 80GB GPUs over the course of 40 GPU days, for 30,000 iterations using a rank-2048 LoRA [23] with a learn- ing rate of 10−5and a batch size of 8." 4 | 11278,LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping,,papers/11278.pdf,-1.0,,,,Our results are generated with Stable Diffusion 3.5 Medium on a Nvidia GeForce RTX 4090 GPU. 5 | 4191,Alias-free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space,,papers/4191.pdf,-1.0,8,A100,,All models are trained on 8 A100 GPUs. 6 | 14445,RandAR: Decoder-only Autoregressive Visual Generation in Random Orders,,papers/14445.pdf,-1.0,,,,"However, this sequen- tial decoding is bottlenecked by hardware’s memory band- width [4, 20, 41] (also well-known as “memory wall”), as each new token generation step requires a forward pass through the model, and the model needs to load all param- eters into GPU registers, which is a process considerably slower than computation." 7 | 3919,OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation,,papers/3919.pdf,-1.0,,,,"These prompts were refined through extensive prompt engineering to maximize efficiency and reduce token usage, and ultimately reduce GPU memory usage in eval- uating the open-source MLLMs." 8 | 4068,LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions,,papers/4068.pdf,-1.0,,,, 9 | 2199,Do We Always Need the Simplicity Bias Looking for Optimal Inductive Biases in the Wild,,papers/2199.pdf,-1.0,,,,Our exploratory work found this to be better than a complete linearization (no second-order derivatives) and vastly cheaper than back- propagating through the whole inner loop (which was not even testable at all because of the required GPU memory). 10 | 8543,Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models,,papers/8543.pdf,-1.0,,,,"When computing gradients with FSDP, each GPU com- putes a gradient on a small mini-batch of examples, after which gradients are averaged across all devices." 11 | 13948,Rethink Visual-language Pretraining for Deepfake Detection: Multi-modal Interpretable Forged Face Detection,,papers/13948.pdf,-1.0,,,, 12 | 1383,CleanDIFT: Diffusion Features without Noise,,papers/1383.pdf,-1.0,1,A100,,We show how to adapt an off-the-shelf large-scale pre-trained diffusion backbone to provide these features at minimal cost (approximately 30 minutes of fine- tuning on a single A100 GPU) and demonstrate improved performance across a wide range of downstream tasks. 13 | 2969,OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels,,papers/2969.pdf,-1.0,8,H800,,All experiments are conducted on 8 NVIDIA H800 GPUs. 14 | 6175,Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather,,papers/6175.pdf,-1.0,,,, 15 | 10261,DiffDNO: Diffusion Fourier Neural Operator,,papers/10261.pdf,-1.0,,,,All tests are performed on a NVIDIA RTX A6000 GPU card with 48GB memory. 16 | 9462,Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining,,papers/9462.pdf,-1.0,,,,"72 videos with 7200 frames are used for training, while 30 videos with 3000 frames are for testing.4.2 Implementation Details Our network is trained on NVIDIA RTX 4090 GPUs and imple- mented on the Pytorch platform." 17 | 3401,StereoAnything: Zero-Shot Stereo Matching,,papers/3401.pdf,-1.0,,,,"In recent years, the proliferation of high-quality, large scale synthetic ground-truth datasets, the availability of high-performance GPUs, and advance- ments in deep learning architectures have paved the way for deep-learning based stereo matching models trained within supervised settings." 18 | 2006,MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision,,papers/2006.pdf,-1.0,,,,This reduces the complexity to O(N2logN)and enables efficient GPU-based training. 19 | 669,Multi-view Reconstruction via SfM-guided Monocular Depth Estimation,,papers/669.pdf,-1.0,,,,"However, due to the re- liance on matching across input images, they typically suf- fer from high GPU memory consumption and tend to fail in sparse view scenarios." 20 | 3896,MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds,,papers/3896.pdf,-1.0,64,H100,,We utilize 64Nvidia H100 GPUs for the model training. 21 | 13998,VGGN: Visual Geometry Grounded Network,,papers/13998.pdf,-1.0,,,, 22 | 3686,CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner,,papers/3686.pdf,-1.0,,,, 23 | 8996,CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models,,papers/8996.pdf,-1.0,8,H100,,Training takes 2 weeks on 8 ×H100 GPUs. 24 | 16804,Reanimating Images using Neural Representations of Dynamic Stimuli,,papers/16804.pdf,-1.0,,,, 25 | 17684,EgoLM: Multi-Modal Language Model of Egocentric Motions,,papers/17684.pdf,-1.0,,,, 26 | 1500,Reconstructing Humans with a Biomechanically Accurate Skeleton,,papers/1500.pdf,-1.0,,,, 27 | 6256,MEGA: Masked Generative Autoencoder for Human Mesh Recovery,,papers/6256.pdf,-1.0,4,A100,,"Using 4 NVIDIA A100 GPUs, the entire pre- training and training process takes about 2.5 days." 28 | 2693,TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization,,papers/2693.pdf,-1.0,,,,Isaac gym: High performance gpu-based physics simulation for robot learning. 29 | 15934,Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays,,papers/15934.pdf,-1.0,,,, 30 | 16593,Temporally Consistent Object-Centric Learning by Contrasting Slots,,papers/16593.pdf,-1.0,,,, 31 | 11779,Temporal Alignment-Free Video Matching for Few-shot Action Recognition,,papers/11779.pdf,-1.0,,,,All experiments are conducted on RTX A6000 GPUs with Pytorch Cuda amp. 32 | 13853,Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models,,papers/13853.pdf,-1.0,,,,"During fine-tuning, the map encoder is frozen to save GPU memory, allowing for a larger training batch size with negligible impact on per- formance." 33 | 16837,The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition,,papers/16837.pdf,-1.0,4,H200,,All computation was performed on 4×NVIDIA H200 GPUs using a distributed batch size of 336. 34 | 11849,Rethinking Spiking Self-Attention Mechanism: Implementing 𝛂-XNOR Similarity Calculation in Spiking Transformers,,papers/11849.pdf,-1.0,,,, 35 | 4119,"MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos",,papers/4119.pdf,-1.0,1,A100,,We run all the above baselines using their respective open-source implementations on the same machine with single Nvidia A100 GPU. 36 | 6709,Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos,,papers/6709.pdf,-1.0,,,, 37 | 4998,Continuous 3D Perception Model with Persistent State,,papers/4998.pdf,-1.0,8,A100,,We train our model on eight A100 NVIDIA GPUs each with 80G memory. 38 | 568,TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion,,papers/568.pdf,-1.0,,,,We report the average runtime of processing one 900×1600 frame and 30Radar points by different methods on one NVIDIA RTX A6000 GPU. 39 | 3774,Neural Inverse Rendering from Propagating Light,,papers/3774.pdf,-1.0,,,, 40 | 213,SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images,,papers/213.pdf,-1.0,,,,We use two 4090 GPUs to train 1 epoch with batch size set to 8. 41 | 360,Towards Universal Dataset Distillation via Task-Driven Diffusion,,papers/360.pdf,-1.0,1,V100,,Performance validation was carried out using PyTorch on NVIDIA V100 GPUs. 42 | 12689,IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior,,papers/12689.pdf,-1.0,1,A100,,IceDiff-FM is optimized by AdamW using Pytorch on one NVIDIA A100 80GB GPU for all experiments with a learning rate of 0.0005. 43 | 3628,Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning,,papers/3628.pdf,-1.0,,,,"While these approaches effectively reduce both model parameters and GPU memory consumption during runtime, they in- herently require model execution to conduct the parameter search." 44 | 7787,Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation,,papers/7787.pdf,-1.0,,,,We train our network for 36 epochs with a mini-batch of 16 images (8 GPUs 2 mini-batch). 45 | 7056,Identifying and Mitigating Position Bias of Multi-image Vision-Language Models,,papers/7056.pdf,-1.0,,,, 46 | 9900,Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key,,papers/9900.pdf,-1.0,,,, 47 | 1757,Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content,,papers/1757.pdf,-1.0,8,A100,,Training is conducted on 8 A100 GPUs for one epoch by default. 48 | 2142,"Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces",,papers/2142.pdf,-1.0,,,,"This work was mainly supported by the Open Path AI Foundation, Google TPU Research Cloud (TRC) program, and the Google Cloud Research Credits program (GCP19980904)." 49 | 16873,From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,,papers/16873.pdf,-1.0,8,H100,,Training takes around 2 days on 8 nodes of 8xH100 GPUs (see Appendix B). 50 | 5303,Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space,,papers/5303.pdf,-1.0,,,, 51 | 11948,Language-Guided Image Tokenization for Generation,,papers/11948.pdf,-1.0,,,, 52 | 15547,DreamRelation: Bridging Customization and Relation Generation,,papers/15547.pdf,-1.0,2,A100,,"The model is fine- tuned for 500 steps, using 2 A100 GPUs, with a total batch size of 8, completing the process in 10 minutes." 53 | 3815,Infinity oo: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis,,papers/3815.pdf,-1.0,,,,"When Vd= 216andh= 2048 , it saves 99.95% parameters and GPU memory." 54 | 11935,Autoregressive Distillation of Diffusion Transformers,,papers/11935.pdf,-1.0,8,A100,,"We use 8 NVIDIA A100 GPUs for training, which takes approximately 2 days." 55 | 8423,PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation,,papers/8423.pdf,-1.0,,,, 56 | 7547,RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics,,papers/7547.pdf,-1.0,8,H100,,"All training is done using 8 Nvidia H100 GPUs, with the training time between 20 and 40 hours." 57 | 958,GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill,,papers/958.pdf,-1.0,,,,Isaac gym: High performance gpu-based physics simulation for robot learning. 58 | 5028,Navigation World Models,,papers/5028.pdf,-1.0,,,, 59 | 6688,Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning,,papers/6688.pdf,-1.0,,,, 60 | 5864,DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution,,papers/5864.pdf,-1.0,,,,Complexity on RGB-D-D (w/o Noisy) tested by a 4090 GPU. 61 | 8301,Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World,,papers/8301.pdf,-1.0,,,, 62 | 13017,Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues,,papers/13017.pdf,-1.0,,,, 63 | 13934,Camera resection from known line pencils and a radially distorted scanline,,papers/13934.pdf,-1.0,,,, 64 | 14114,Opportunistic Single-Photon Time of Flight,,papers/14114.pdf,-1.0,,,, 65 | 13917,Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing,,papers/13917.pdf,-1.0,1,A100,,The x-axis indicates the single image sampling time on an A100-SXM4-80GB GPU and y-axis shows the LPIPS. 66 | 11636,DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models,,papers/11636.pdf,-1.0,2,A800,,"The fine-tuning of SD3-2B and Flux-12B is conducted on 2 A800-80G GPUs and 8 A100-80G GPUs, respectively." 67 | 11517,CustAny: Customizing Anything from A Single Example,,papers/11517.pdf,-1.0,32,V100,,"Training involves a 1e-5 learning rate, batch size of 32, and 6 epochs on 32 V100 GPUs, taking about 30 hours." 68 | 5293,Minority-Focused Text-to-Image Generation via Prompt Optimization,,papers/5293.pdf,-1.0,1,A100,,All experiments are conducted on a single NVIDIA A100 GPU with 40GB memory. 69 | 11886,Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models,,papers/11886.pdf,-1.0,,,,"Due to limited GPU memory, we apply gradient checkpointing to backpropagate through the entire inverse DDIM sampler." 70 | 14084,UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming,,papers/14084.pdf,-1.0,,,,"Introduction Distributed learning (also called parallel learning) on clus- ters with several machines or GPUs is commonly used for training deep learning models, especially for some large models with billions of parameters [ 1,40,41]." 71 | 2420,Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning,,papers/2420.pdf,-1.0,,,, 72 | 12995,Enhancing Diversity for Data-free Quantization,,papers/12995.pdf,-1.0,,,, 73 | 6820,TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model,,papers/6820.pdf,-1.0,,,,The library implements a par- allelized raster scan method that efficiently computes the Euclidean distance transforms on GPU hardware. 74 | 10586,Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation,,papers/10586.pdf,-1.0,1,T4,,Our method is implemented in PyTorch [35] on an EC2 instance (with 64 GB NVIDIA T4 Tensor Core GPUs). 75 | 6909,Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks,,papers/6909.pdf,-1.0,,,,"Adam [24] is set as the optimizer, the learning rate is set to 10−4, and the training is performed on 4 NVIDIA 3090 GPUs with a batch size of 8." 76 | 2485,Gromov-Wasserstein Problem with Cyclic Symmetry,,papers/2485.pdf,-1.0,,,, 77 | 7710,Time of the Flight of the Gaussians: Fast and Accurate Dynamic Time-of-Flight Radiance Fields,,papers/7710.pdf,-1.0,,,,"For instance, Neuralangelo [Li et al.2023] requires 128 GPU hours for reconstructing a single scene from the Tanks and Temples Dataset [Knapitsch et al .2017]." 78 | 7740,Zero-Shot Monocular Scene Flow Estimation in the Wild,,papers/7740.pdf,-1.0,8,A100,,"All experiments are trained with 8NVIDIA A100 GPUs for 50epochs, which take 12hours." 79 | 5813,DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models,,papers/5813.pdf,-1.0,1,A100,,"Since D I- FIXis a single-step model, the additional rendering time is only 76 ms on an NVIDIA A100 GPU, over 10 ×faster than standard diffusion models with multiple denoising steps." 80 | 12048,3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting,,papers/12048.pdf,-1.0,,,,"For all evaluations, we use the datasets’ default resolutions and re- port frames per second (FPS) measured on a single NVIDIA RTX 6000 Ada GPU." 81 | 16830,DNF: Unconditional 4D Generation with Dictionary-based Neural Fields,,papers/16830.pdf,-1.0,,,,"The shape/motion dif- fusion model is trained on two NVIDIA RTX A6000 GPUs for one day, for 1000 epochs." 82 | 2189,3D Student Splatting and Scooping,,papers/2189.pdf,-1.0,,,,All our experiments are running with one NVIDIA RTX 4090 GPU. 83 | 5225,CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models,,papers/5225.pdf,-1.0,16,A100,,"Our alternating sampling strategy takes about 1minute to generate all K′= 128 views for each timestamp, when executed in parallel on 16 A100 GPUs." 84 | 12055,Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models,,papers/12055.pdf,-1.0,32,A100,,The training takes around 2 days on 32 A100 GPUs. 85 | 5395,Effective SAM Combination for Open-Vocabulary Semantic Segmentation,,papers/5395.pdf,-1.0,,,,"All experiments are conducted us- ing the PyTorch [26] framework, and training is performed on four NVIDIA RTX A6000 GPUs." 86 | 1514,FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video,,papers/1514.pdf,-1.0,,,, 87 | 754,Birth and Death of a Rose,,papers/754.pdf,-1.0,,,, 88 | 1501,Removing Reflections from RAW Photos,,papers/1501.pdf,-1.0,32,A100,,"Upsampling Our upsampler is trained using Adam with lr= 4e-4, batch size64over32 A100 GPUs, and converges after about 40 epochs." 89 | 241,AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea,,papers/241.pdf,-1.0,1,A6000,280 hours,"Following prior works [5, 64, 82], we train our image editing model for 110,000 steps using four 48GB NVIDIA A6000 GPUs for 280 hours." 90 | 9404,Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens,,papers/9404.pdf,-1.0,32,A800,,"Besides, each training image is center-cropped to a size of 256×256.The train- ing was conducted on 32 NVIDIA A800 GPUs and lasted for nearly one weak." 91 | 7980,Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding,,papers/7980.pdf,-1.0,,,, 92 | 8650,Towards Vision Language Models For Extra-Long Video Understanding,,papers/8650.pdf,-1.0,2,A100,,"For hard negative post-training, we use a total batch size of 64, and the model is trained on 2 NVIDIA A100 GPUs." 93 | 14121,LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models,,papers/14121.pdf,-1.0,,,,All exper- iments are conducted on 4 NVIDIA 4090 GPUs. 94 | 1726,VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection,,papers/1726.pdf,-1.0,,,,The training and evaluation process is facilitated on 8 NVIDIA-A100 GPUs. 95 | 17865,SEAL: Semantic Attention Learning for Long Video Representation,,papers/17865.pdf,-1.0,,,, 96 | 5016,Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval,,papers/5016.pdf,-1.0,,,, 97 | -------------------------------------------------------------------------------- /post_process/GPT4o_postprocess.csv: -------------------------------------------------------------------------------- 1 | PaperID,Title,ArxivURL,LocalPDF,MatchScore,AcceleratorSentence,GPU_Count,GPU_Model,Training_Duration 2 | 483,Motion Prompting: Controlling Video Generation with Motion Trajectories,,papers/483.pdf,-1.0,,,, 3 | 17653,Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Flow,,papers/17653.pdf,-1.0,"We used 8 NVIDIA A100 80GB GPUs over the course of 40 GPU days, for 30,000 iterations using a rank-2048 LoRA [23] with a learn- ing rate of 10−5and a batch size of 8.",8,A100,"40 GPU days, for 30,000 iterations using a rank-2048 LoRA [23] with a learn- ing rate of 10−5and a batch size of 8" 4 | 11278,LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping,,papers/11278.pdf,-1.0,Our results are generated with Stable Diffusion 3.5 Medium on a Nvidia GeForce RTX 4090 GPU.,,, 5 | 4191,Alias-free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space,,papers/4191.pdf,-1.0,All models are trained on 8 A100 GPUs.,8,A100, 6 | 14445,RandAR: Decoder-only Autoregressive Visual Generation in Random Orders,,papers/14445.pdf,-1.0,"However, this sequen- tial decoding is bottlenecked by hardware’s memory band- width [4, 20, 41] (also well-known as “memory wall”), as each new token generation step requires a forward pass through the model, and the model needs to load all param- eters into GPU registers, which is a process considerably slower than computation.",,, 7 | 3919,OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation,,papers/3919.pdf,-1.0,"These prompts were refined through extensive prompt engineering to maximize efficiency and reduce token usage, and ultimately reduce GPU memory usage in eval- uating the open-source MLLMs.",,, 8 | 4068,LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions,,papers/4068.pdf,-1.0,,,, 9 | 2199,Do We Always Need the Simplicity Bias Looking for Optimal Inductive Biases in the Wild,,papers/2199.pdf,-1.0,Our exploratory work found this to be better than a complete linearization (no second-order derivatives) and vastly cheaper than back- propagating through the whole inner loop (which was not even testable at all because of the required GPU memory).,,, 10 | 8543,Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models,,papers/8543.pdf,-1.0,"When computing gradients with FSDP, each GPU com- putes a gradient on a small mini-batch of examples, after which gradients are averaged across all devices.",,, 11 | 13948,Rethink Visual-language Pretraining for Deepfake Detection: Multi-modal Interpretable Forged Face Detection,,papers/13948.pdf,-1.0,,,, 12 | 1383,CleanDIFT: Diffusion Features without Noise,,papers/1383.pdf,-1.0,We show how to adapt an off-the-shelf large-scale pre-trained diffusion backbone to provide these features at minimal cost (approximately 30 minutes of fine- tuning on a single A100 GPU) and demonstrate improved performance across a wide range of downstream tasks.,,, 13 | 2969,OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels,,papers/2969.pdf,-1.0,All experiments are conducted on 8 NVIDIA H800 GPUs.,,, 14 | 6175,Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather,,papers/6175.pdf,-1.0,,,, 15 | 10261,DiffDNO: Diffusion Fourier Neural Operator,,papers/10261.pdf,-1.0,All tests are performed on a NVIDIA RTX A6000 GPU card with 48GB memory.,,, 16 | 9462,Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining,,papers/9462.pdf,-1.0,"72 videos with 7200 frames are used for training, while 30 videos with 3000 frames are for testing.4.2 Implementation Details Our network is trained on NVIDIA RTX 4090 GPUs and imple- mented on the Pytorch platform.",,, 17 | 3401,StereoAnything: Zero-Shot Stereo Matching,,papers/3401.pdf,-1.0,"In recent years, the proliferation of high-quality, large scale synthetic ground-truth datasets, the availability of high-performance GPUs, and advance- ments in deep learning architectures have paved the way for deep-learning based stereo matching models trained within supervised settings.",,, 18 | 2006,MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision,,papers/2006.pdf,-1.0,This reduces the complexity to O(N2logN)and enables efficient GPU-based training.,,, 19 | 669,Multi-view Reconstruction via SfM-guided Monocular Depth Estimation,,papers/669.pdf,-1.0,"However, due to the re- liance on matching across input images, they typically suf- fer from high GPU memory consumption and tend to fail in sparse view scenarios.",,, 20 | 3896,MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds,,papers/3896.pdf,-1.0,We utilize 64Nvidia H100 GPUs for the model training.,64,H100, 21 | 13998,VGGN: Visual Geometry Grounded Network,,papers/13998.pdf,-1.0,,,, 22 | 3686,CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner,,papers/3686.pdf,-1.0,,,, 23 | 8996,CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models,,papers/8996.pdf,-1.0,Training takes 2 weeks on 8 ×H100 GPUs.,,, 24 | 16804,Reanimating Images using Neural Representations of Dynamic Stimuli,,papers/16804.pdf,-1.0,,,, 25 | 17684,EgoLM: Multi-Modal Language Model of Egocentric Motions,,papers/17684.pdf,-1.0,,,, 26 | 1500,Reconstructing Humans with a Biomechanically Accurate Skeleton,,papers/1500.pdf,-1.0,,,, 27 | 6256,MEGA: Masked Generative Autoencoder for Human Mesh Recovery,,papers/6256.pdf,-1.0,"Using 4 NVIDIA A100 GPUs, the entire pre- training and training process takes about 2.5 days.",4,A100, 28 | 2693,TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization,,papers/2693.pdf,-1.0,Isaac gym: High performance gpu-based physics simulation for robot learning.,,, 29 | 15934,Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays,,papers/15934.pdf,-1.0,,,, 30 | 16593,Temporally Consistent Object-Centric Learning by Contrasting Slots,,papers/16593.pdf,-1.0,,,, 31 | 11779,Temporal Alignment-Free Video Matching for Few-shot Action Recognition,,papers/11779.pdf,-1.0,All experiments are conducted on RTX A6000 GPUs with Pytorch Cuda amp.,,, 32 | 13853,Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models,,papers/13853.pdf,-1.0,"During fine-tuning, the map encoder is frozen to save GPU memory, allowing for a larger training batch size with negligible impact on per- formance.",,, 33 | 16837,The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition,,papers/16837.pdf,-1.0,All computation was performed on 4×NVIDIA H200 GPUs using a distributed batch size of 336.,,, 34 | 11849,Rethinking Spiking Self-Attention Mechanism: Implementing 𝛂-XNOR Similarity Calculation in Spiking Transformers,,papers/11849.pdf,-1.0,,,, 35 | 4119,"MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos",,papers/4119.pdf,-1.0,We run all the above baselines using their respective open-source implementations on the same machine with single Nvidia A100 GPU.,,, 36 | 6709,Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos,,papers/6709.pdf,-1.0,,,, 37 | 4998,Continuous 3D Perception Model with Persistent State,,papers/4998.pdf,-1.0,We train our model on eight A100 NVIDIA GPUs each with 80G memory.,,, 38 | 568,TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion,,papers/568.pdf,-1.0,We report the average runtime of processing one 900×1600 frame and 30Radar points by different methods on one NVIDIA RTX A6000 GPU.,,, 39 | 3774,Neural Inverse Rendering from Propagating Light,,papers/3774.pdf,-1.0,,,, 40 | 213,SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images,,papers/213.pdf,-1.0,We use two 4090 GPUs to train 1 epoch with batch size set to 8.,,, 41 | 360,Towards Universal Dataset Distillation via Task-Driven Diffusion,,papers/360.pdf,-1.0,Performance validation was carried out using PyTorch on NVIDIA V100 GPUs.,,, 42 | 12689,IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior,,papers/12689.pdf,-1.0,IceDiff-FM is optimized by AdamW using Pytorch on one NVIDIA A100 80GB GPU for all experiments with a learning rate of 0.0005.,,, 43 | 3628,Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning,,papers/3628.pdf,-1.0,"While these approaches effectively reduce both model parameters and GPU memory consumption during runtime, they in- herently require model execution to conduct the parameter search.",,, 44 | 7787,Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation,,papers/7787.pdf,-1.0,We train our network for 36 epochs with a mini-batch of 16 images (8 GPUs 2 mini-batch).,,, 45 | 7056,Identifying and Mitigating Position Bias of Multi-image Vision-Language Models,,papers/7056.pdf,-1.0,,,, 46 | 9900,Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key,,papers/9900.pdf,-1.0,,,, 47 | 1757,Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content,,papers/1757.pdf,-1.0,Training is conducted on 8 A100 GPUs for one epoch by default.,8,A100, 48 | 2142,"Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces",,papers/2142.pdf,-1.0,"This work was mainly supported by the Open Path AI Foundation, Google TPU Research Cloud (TRC) program, and the Google Cloud Research Credits program (GCP19980904).",,, 49 | 16873,From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,,papers/16873.pdf,-1.0,Training takes around 2 days on 8 nodes of 8xH100 GPUs (see Appendix B).,,, 50 | 5303,Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space,,papers/5303.pdf,-1.0,,,, 51 | 11948,Language-Guided Image Tokenization for Generation,,papers/11948.pdf,-1.0,,,, 52 | 15547,DreamRelation: Bridging Customization and Relation Generation,,papers/15547.pdf,-1.0,"The model is fine- tuned for 500 steps, using 2 A100 GPUs, with a total batch size of 8, completing the process in 10 minutes.",2,A100, 53 | 3815,Infinity oo: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis,,papers/3815.pdf,-1.0,"When Vd= 216andh= 2048 , it saves 99.95% parameters and GPU memory.",,, 54 | 11935,Autoregressive Distillation of Diffusion Transformers,,papers/11935.pdf,-1.0,"We use 8 NVIDIA A100 GPUs for training, which takes approximately 2 days.",8,A100, 55 | 8423,PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation,,papers/8423.pdf,-1.0,,,, 56 | 7547,RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics,,papers/7547.pdf,-1.0,"All training is done using 8 Nvidia H100 GPUs, with the training time between 20 and 40 hours.",8,H100, 57 | 958,GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill,,papers/958.pdf,-1.0,Isaac gym: High performance gpu-based physics simulation for robot learning.,,, 58 | 5028,Navigation World Models,,papers/5028.pdf,-1.0,,,, 59 | 6688,Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning,,papers/6688.pdf,-1.0,,,, 60 | 5864,DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution,,papers/5864.pdf,-1.0,Complexity on RGB-D-D (w/o Noisy) tested by a 4090 GPU.,,, 61 | 8301,Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World,,papers/8301.pdf,-1.0,,,, 62 | 13017,Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues,,papers/13017.pdf,-1.0,,,, 63 | 13934,Camera resection from known line pencils and a radially distorted scanline,,papers/13934.pdf,-1.0,,,, 64 | 14114,Opportunistic Single-Photon Time of Flight,,papers/14114.pdf,-1.0,,,, 65 | 13917,Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing,,papers/13917.pdf,-1.0,The x-axis indicates the single image sampling time on an A100-SXM4-80GB GPU and y-axis shows the LPIPS.,,, 66 | 11636,DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models,,papers/11636.pdf,-1.0,"The fine-tuning of SD3-2B and Flux-12B is conducted on 2 A800-80G GPUs and 8 A100-80G GPUs, respectively.",8,A100, 67 | 11517,CustAny: Customizing Anything from A Single Example,,papers/11517.pdf,-1.0,"Training involves a 1e-5 learning rate, batch size of 32, and 6 epochs on 32 V100 GPUs, taking about 30 hours.",32,V100, 68 | 5293,Minority-Focused Text-to-Image Generation via Prompt Optimization,,papers/5293.pdf,-1.0,All experiments are conducted on a single NVIDIA A100 GPU with 40GB memory.,,, 69 | 11886,Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models,,papers/11886.pdf,-1.0,"Due to limited GPU memory, we apply gradient checkpointing to backpropagate through the entire inverse DDIM sampler.",,, 70 | 14084,UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming,,papers/14084.pdf,-1.0,"Introduction Distributed learning (also called parallel learning) on clus- ters with several machines or GPUs is commonly used for training deep learning models, especially for some large models with billions of parameters [ 1,40,41].",,, 71 | 2420,Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning,,papers/2420.pdf,-1.0,,,, 72 | 12995,Enhancing Diversity for Data-free Quantization,,papers/12995.pdf,-1.0,,,, 73 | 6820,TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model,,papers/6820.pdf,-1.0,The library implements a par- allelized raster scan method that efficiently computes the Euclidean distance transforms on GPU hardware.,,, 74 | 10586,Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation,,papers/10586.pdf,-1.0,Our method is implemented in PyTorch [35] on an EC2 instance (with 64 GB NVIDIA T4 Tensor Core GPUs).,,, 75 | 6909,Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks,,papers/6909.pdf,-1.0,"Adam [24] is set as the optimizer, the learning rate is set to 10−4, and the training is performed on 4 NVIDIA 3090 GPUs with a batch size of 8.",4,3090, 76 | 2485,Gromov-Wasserstein Problem with Cyclic Symmetry,,papers/2485.pdf,-1.0,,,, 77 | 7710,Time of the Flight of the Gaussians: Fast and Accurate Dynamic Time-of-Flight Radiance Fields,,papers/7710.pdf,-1.0,"For instance, Neuralangelo [Li et al.2023] requires 128 GPU hours for reconstructing a single scene from the Tanks and Temples Dataset [Knapitsch et al .2017].",,, 78 | 7740,Zero-Shot Monocular Scene Flow Estimation in the Wild,,papers/7740.pdf,-1.0,"All experiments are trained with 8NVIDIA A100 GPUs for 50epochs, which take 12hours.",8,A100, 79 | 5813,DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models,,papers/5813.pdf,-1.0,"Since D I- FIXis a single-step model, the additional rendering time is only 76 ms on an NVIDIA A100 GPU, over 10 ×faster than standard diffusion models with multiple denoising steps.",,, 80 | 12048,3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting,,papers/12048.pdf,-1.0,"For all evaluations, we use the datasets’ default resolutions and re- port frames per second (FPS) measured on a single NVIDIA RTX 6000 Ada GPU.",,, 81 | 16830,DNF: Unconditional 4D Generation with Dictionary-based Neural Fields,,papers/16830.pdf,-1.0,"The shape/motion dif- fusion model is trained on two NVIDIA RTX A6000 GPUs for one day, for 1000 epochs.",,, 82 | 2189,3D Student Splatting and Scooping,,papers/2189.pdf,-1.0,All our experiments are running with one NVIDIA RTX 4090 GPU.,,, 83 | 5225,CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models,,papers/5225.pdf,-1.0,"Our alternating sampling strategy takes about 1minute to generate all K′= 128 views for each timestamp, when executed in parallel on 16 A100 GPUs.",16,A100, 84 | 12055,Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models,,papers/12055.pdf,-1.0,The training takes around 2 days on 32 A100 GPUs.,32,A100, 85 | 5395,Effective SAM Combination for Open-Vocabulary Semantic Segmentation,,papers/5395.pdf,-1.0,"All experiments are conducted us- ing the PyTorch [26] framework, and training is performed on four NVIDIA RTX A6000 GPUs.",,, 86 | 1514,FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video,,papers/1514.pdf,-1.0,,,, 87 | 754,Birth and Death of a Rose,,papers/754.pdf,-1.0,,,, 88 | 1501,Removing Reflections from RAW Photos,,papers/1501.pdf,-1.0,"Upsampling Our upsampler is trained using Adam with lr= 4e-4, batch size64over32 A100 GPUs, and converges after about 40 epochs.",32,A100, 89 | 241,AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea,,papers/241.pdf,-1.0,"Following prior works [5, 64, 82], we train our image editing model for 110,000 steps using four 48GB NVIDIA A6000 GPUs for 280 hours.",,, 90 | 9404,Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens,,papers/9404.pdf,-1.0,"Besides, each training image is center-cropped to a size of 256×256.The train- ing was conducted on 32 NVIDIA A800 GPUs and lasted for nearly one weak.",,, 91 | 7980,Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding,,papers/7980.pdf,-1.0,,,, 92 | 8650,Towards Vision Language Models For Extra-Long Video Understanding,,papers/8650.pdf,-1.0,"For hard negative post-training, we use a total batch size of 64, and the model is trained on 2 NVIDIA A100 GPUs.",2,A100, 93 | 14121,LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models,,papers/14121.pdf,-1.0,All exper- iments are conducted on 4 NVIDIA 4090 GPUs.,4,4090, 94 | 1726,VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection,,papers/1726.pdf,-1.0,The training and evaluation process is facilitated on 8 NVIDIA-A100 GPUs.,,, 95 | 17865,SEAL: Semantic Attention Learning for Long Video Representation,,papers/17865.pdf,-1.0,,,, 96 | 5016,Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval,,papers/5016.pdf,-1.0,,,, 97 | -------------------------------------------------------------------------------- /cvpr25_accelerator_sentences_arxiv.csv: -------------------------------------------------------------------------------- 1 | PaperID,Title,ArxivURL,LocalPDF,MatchScore,AcceleratorSentence 2 | 483,Motion Prompting: Controlling Video Generation with Motion Trajectories,http://arxiv.org/pdf/2412.02700v2,papers/483.pdf,1.0, 3 | 17653,Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Flow,http://arxiv.org/pdf/2501.08331v4,papers/17653.pdf,0.973,"We used 8 NVIDIA A100 80GB GPUs over the course of 40 GPU days, for 30,000 iterations using a rank-2048 LoRA [23] with a learn- ing rate of 10−5and a batch size of 8." 4 | 11278,LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping,http://arxiv.org/pdf/2504.08902v1,papers/11278.pdf,1.0,Our results are generated with Stable Diffusion 3.5 Medium on a Nvidia GeForce RTX 4090 GPU. 5 | 4191,Alias-free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space,http://arxiv.org/pdf/2503.09419v1,papers/4191.pdf,0.995,All models are trained on 8 A100 GPUs. 6 | 14445,RandAR: Decoder-only Autoregressive Visual Generation in Random Orders,http://arxiv.org/pdf/2412.01827v1,papers/14445.pdf,1.0,"However, this sequen- tial decoding is bottlenecked by hardware’s memory band- width [4, 20, 41] (also well-known as “memory wall”), as each new token generation step requires a forward pass through the model, and the model needs to load all param- eters into GPU registers, which is a process considerably slower than computation." 7 | 3919,OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation,http://arxiv.org/pdf/2411.18499v3,papers/3919.pdf,1.0,"These prompts were refined through extensive prompt engineering to maximize efficiency and reduce token usage, and ultimately reduce GPU memory usage in eval- uating the open-source MLLMs." 8 | 4068,LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions,http://arxiv.org/pdf/2411.16760v1,papers/4068.pdf,1.0, 9 | 2199,Do We Always Need the Simplicity Bias Looking for Optimal Inductive Biases in the Wild,http://arxiv.org/pdf/2503.10065v1,papers/2199.pdf,0.994,Our exploratory work found this to be better than a complete linearization (no second-order derivatives) and vastly cheaper than back- propagating through the whole inner loop (which was not even testable at all because of the required GPU memory). 10 | 8543,Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models,http://arxiv.org/pdf/2409.17146v2,papers/8543.pdf,1.0,"When computing gradients with FSDP, each GPU com- putes a gradient on a small mini-batch of examples, after which gradients are averaged across all devices." 11 | 13948,Rethink Visual-language Pretraining for Deepfake Detection: Multi-modal Interpretable Forged Face Detection,http://arxiv.org/pdf/2503.20188v1,papers/13948.pdf,0.751, 12 | 1383,CleanDIFT: Diffusion Features without Noise,http://arxiv.org/pdf/2412.03439v2,papers/1383.pdf,1.0,We show how to adapt an off-the-shelf large-scale pre-trained diffusion backbone to provide these features at minimal cost (approximately 30 minutes of fine- tuning on a single A100 GPU) and demonstrate improved performance across a wide range of downstream tasks. 13 | 2969,OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels,http://arxiv.org/pdf/2407.08972v1,papers/2969.pdf,0.354, 14 | 6175,Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather,http://arxiv.org/pdf/2404.05145v1,papers/6175.pdf,0.617, 15 | 10261,DiffDNO: Diffusion Fourier Neural Operator,http://arxiv.org/pdf/2501.17296v2,papers/10261.pdf,0.602,The computations were performed on an NVIDIA RTX 4090 GPU with 24GB memory. 16 | 9462,Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining,http://arxiv.org/pdf/2407.21773v2,papers/9462.pdf,0.471,"72 videos with 7200 frames are used for training, while 30 videos with 3000 frames are for testing.4.2 Implementation Details Our network is trained on NVIDIA RTX 4090 GPUs and imple- mented on the Pytorch platform." 17 | 3401,StereoAnything: Zero-Shot Stereo Matching,http://arxiv.org/pdf/2107.08186v1,papers/3401.pdf,0.609, 18 | 2006,MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision,http://arxiv.org/pdf/2410.19115v3,papers/2006.pdf,1.0,This reduces the complexity to O(N2logN)and enables efficient GPU-based training. 19 | 669,Multi-view Reconstruction via SfM-guided Monocular Depth Estimation,http://arxiv.org/pdf/2503.14483v1,papers/669.pdf,1.0,"However, due to the re- liance on matching across input images, they typically suf- fer from high GPU memory consumption and tend to fail in sparse view scenarios." 20 | 3896,MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds,http://arxiv.org/pdf/2412.06974v1,papers/3896.pdf,1.0,We utilize 64Nvidia H100 GPUs for the model training. 21 | 13998,VGGN: Visual Geometry Grounded Network,http://arxiv.org/pdf/2311.15005v1,papers/13998.pdf,0.452, 22 | 3686,CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner,http://arxiv.org/pdf/2405.14979v1,papers/3686.pdf,1.0, 23 | 8996,CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models,http://arxiv.org/pdf/2412.12093v1,papers/8996.pdf,1.0,Training takes 2 weeks on 8 ×H100 GPUs. 24 | 16804,Reanimating Images using Neural Representations of Dynamic Stimuli,http://arxiv.org/pdf/2406.02659v3,papers/16804.pdf,1.0, 25 | 17684,EgoLM: Multi-Modal Language Model of Egocentric Motions,http://arxiv.org/pdf/2409.18127v1,papers/17684.pdf,1.0, 26 | 1500,Reconstructing Humans with a Biomechanically Accurate Skeleton,http://arxiv.org/pdf/2503.21751v1,papers/1500.pdf,1.0, 27 | 6256,MEGA: Masked Generative Autoencoder for Human Mesh Recovery,http://arxiv.org/pdf/2405.18839v4,papers/6256.pdf,1.0,"Using 4 NVIDIA A100 GPUs, the entire pre- training and training process takes about 2.5 days." 28 | 2693,TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization,http://arxiv.org/pdf/2503.19901v2,papers/2693.pdf,1.0,Isaac gym: High performance gpu-based physics simulation for robot learning. 29 | 15934,Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays,http://arxiv.org/pdf/2106.07561v1,papers/15934.pdf,0.591,"This is in contrast to a conventional vision system and other non on- sensor computing cameras, where the understanding of the visual inputs happens only after the visual data is transmitted to, and processed by, CPUs and or GPUs, resulting in extra latency and power consumption." 30 | 16593,Temporally Consistent Object-Centric Learning by Contrasting Slots,http://arxiv.org/pdf/2412.14295v2,papers/16593.pdf,1.0, 31 | 11779,Temporal Alignment-Free Video Matching for Few-shot Action Recognition,http://arxiv.org/pdf/2504.05956v1,papers/11779.pdf,1.0,All experiments are conducted on RTX A6000 GPUs with Pytorch Cuda amp. 32 | 13853,Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models,http://arxiv.org/pdf/2412.05334v2,papers/13853.pdf,1.0,"During fine-tuning, the map encoder is frozen to save GPU memory, allowing for a larger training batch size with negligible impact on per- formance." 33 | 16837,The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition,http://arxiv.org/pdf/2502.21201v3,papers/16837.pdf,1.0,All computation was performed on 4×NVIDIA H200 GPUs using a distributed batch size of 336. 34 | 11849,Rethinking Spiking Self-Attention Mechanism: Implementing 𝛂-XNOR Similarity Calculation in Spiking Transformers,http://arxiv.org/pdf/1210.8024v2,papers/11849.pdf,0.352, 35 | 4119,"MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos",http://arxiv.org/pdf/2412.04463v2,papers/4119.pdf,0.994,We run all the above baselines using their respective open-source implementations on the same machine with single Nvidia A100 GPU. 36 | 6709,Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos,http://arxiv.org/pdf/2412.09621v1,papers/6709.pdf,1.0, 37 | 4998,Continuous 3D Perception Model with Persistent State,http://arxiv.org/pdf/2501.12387v1,papers/4998.pdf,1.0,We train our model on eight A100 NVIDIA GPUs each with 80G memory. 38 | 568,TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion,http://arxiv.org/pdf/2504.11773v1,papers/568.pdf,1.0,We report the average runtime of processing one 900×1600 frame and 30Radar points by different methods on one NVIDIA RTX A6000 GPU. 39 | 3774,Neural Inverse Rendering from Propagating Light,http://arxiv.org/pdf/2404.06493v3,papers/3774.pdf,0.618,We select the batch size to fit within 48 GB of VRAM on an NVIDIA A40 GPU. 40 | 213,SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images,http://arxiv.org/pdf/2504.09203v1,papers/213.pdf,0.741,We use one NVIDIA A100 80GB GPU to run our experiments. 41 | 360,Towards Universal Dataset Distillation via Task-Driven Diffusion,http://arxiv.org/pdf/2410.17606v1,papers/360.pdf,0.658, 42 | 12689,IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior,http://arxiv.org/pdf/2410.09111v1,papers/12689.pdf,0.964,IceDiff-FM is optimized by AdamW using Pytorch on one NVIDIA A100 80GB GPU for all experiments with a learning rate of 0.0005. 43 | 3628,Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning,http://arxiv.org/pdf/2201.10520v3,papers/3628.pdf,0.517,"How- ever, LTH has only been shown to work successfully with unstr uctured pruning which, unfortunately leads to models with low sparsity and difficult to accelerate on commodity hardware such as CPUs and GPUs (e.g., Hill et al." 44 | 7787,Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation,http://arxiv.org/pdf/2406.07023v2,papers/7787.pdf,0.539,"In our experiment, 6 Nvidia A30 GPUs with a batch size of 18 are employed for NuScenes, while a batch size of 12 is configured for WOD." 45 | 7056,Identifying and Mitigating Position Bias of Multi-image Vision-Language Models,http://arxiv.org/pdf/2503.13792v1,papers/7056.pdf,1.0, 46 | 9900,Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key,http://arxiv.org/pdf/2501.09695v2,papers/9900.pdf,1.0, 47 | 1757,Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content,http://arxiv.org/pdf/2503.02357v2,papers/1757.pdf,1.0,Training is conducted on 8 A100 GPUs for one epoch by default. 48 | 2142,"Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces",http://arxiv.org/pdf/2412.14171v1,papers/2142.pdf,0.994,"This work was mainly supported by the Open Path AI Foundation, Google TPU Research Cloud (TRC) program, and the Google Cloud Research Credits program (GCP19980904)." 49 | 16873,From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,http://arxiv.org/pdf/2412.08442v1,papers/16873.pdf,1.0,Training takes around 2 days on 8 nodes of 8xH100 GPUs (see Appendix B). 50 | 5303,Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space,http://arxiv.org/pdf/2312.02139v3,papers/5303.pdf,0.465,All the experiments were trained for 200000 iterations with Adam optimizer [41] and used PyTorch framework and 8 NVIDIA A100 GPUs. 51 | 11948,Language-Guided Image Tokenization for Generation,http://arxiv.org/pdf/2412.05796v2,papers/11948.pdf,1.0, 52 | 15547,DreamRelation: Bridging Customization and Relation Generation,http://arxiv.org/pdf/2410.23280v4,papers/15547.pdf,1.0,"The model is fine- tuned for 500 steps, using 2 A100 GPUs, with a total batch size of 8, completing the process in 10 minutes." 53 | 3815,Infinity oo: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis,http://arxiv.org/pdf/2412.04431v1,papers/3815.pdf,0.983,"When Vd= 216andh= 2048 , it saves 99.95% parameters and GPU memory." 54 | 11935,Autoregressive Distillation of Diffusion Transformers,http://arxiv.org/pdf/2504.11295v1,papers/11935.pdf,1.0,"We use 8 NVIDIA A100 GPUs for training, which takes approximately 2 days." 55 | 8423,PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation,http://arxiv.org/pdf/2403.03890v1,papers/8423.pdf,0.6, 56 | 7547,RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics,http://arxiv.org/pdf/2411.16537v4,papers/7547.pdf,1.0,"All training is done using 8 Nvidia H100 GPUs, with the training time between 20 and 40 hours." 57 | 958,GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill,http://arxiv.org/pdf/2504.04191v1,papers/958.pdf,1.0,Isaac gym: High performance gpu-based physics simulation for robot learning. 58 | 5028,Navigation World Models,http://arxiv.org/pdf/2412.03572v2,papers/5028.pdf,1.0, 59 | 6688,Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning,http://arxiv.org/pdf/2503.19706v2,papers/6688.pdf,0.539, 60 | 5864,DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution,http://arxiv.org/pdf/2410.11666v4,papers/5864.pdf,1.0,Complexity on RGB-D-D (w/o Noisy) tested by a 4090 GPU. 61 | 8301,Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World,http://arxiv.org/pdf/1608.05684v1,papers/8301.pdf,0.605, 62 | 13017,Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues,http://arxiv.org/pdf/2307.10284v2,papers/13017.pdf,0.389,For SASIC and our proposed ECSIC model we measure encoding and decoding times on an NVIDIA RTX 3090 GPU. 63 | 13934,Camera resection from known line pencils and a radially distorted scanline,http://arxiv.org/pdf/1105.4712v1,papers/13934.pdf,0.471, 64 | 14114,Opportunistic Single-Photon Time of Flight,http://arxiv.org/pdf/1407.8368v1,papers/14114.pdf,0.552, 65 | 13917,Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing,http://arxiv.org/pdf/2407.01521v2,papers/13917.pdf,1.0,The x-axis indicates the single image sampling time on an A100-SXM4-80GB GPU and y-axis shows the LPIPS. 66 | 11636,DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models,http://arxiv.org/pdf/2502.08580v1,papers/11636.pdf,0.571,"The application of diffusion models to improve the quality of ultrasound images is also an active field of research.16–19 It would be valuable to explore the application of large foundation models, which have been trained on extensive GPU resources and massive databases, to generate high-quality US images." 67 | 11517,CustAny: Customizing Anything from A Single Example,http://arxiv.org/pdf/2406.11643v4,papers/11517.pdf,1.0,"Training involves a 1e-5 learning rate, batch size of 32, and 6 epochs on 32 V100 GPUs, taking about 30 hours." 68 | 5293,Minority-Focused Text-to-Image Generation via Prompt Optimization,http://arxiv.org/pdf/2412.18196v2,papers/5293.pdf,0.559,"For instance, Xsum dataset contains 7 sub-datasets in- cluding Xsum under C2, C1, S1, W3, S3, S2, W1 Ad d P e rtu rb a t io nY o u n ee d t o g en e r a t e a n ew t e xt b y the gui danc e of the f ol l o w in gIn the t a sk, the o r ig ina l t e xt is

Ou tPu t: P e r tu rbe d Inpu t

2 ." 69 | 11886,Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models,http://arxiv.org/pdf/2412.03283v2,papers/11886.pdf,1.0,"Due to limited GPU memory, we apply gradient checkpointing to backpropagate through the entire inverse DDIM sampler." 70 | 14084,UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming,http://arxiv.org/pdf/2307.16375v6,papers/14084.pdf,1.0,"Introduction Distributed learning (also called parallel learning) on clus- ters with several machines or GPUs is commonly used for training deep learning models, especially for some large models with billions of parameters [ 1,40,41]." 71 | 2420,Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning,http://arxiv.org/pdf/2503.06457v1,papers/2420.pdf,1.0, 72 | 12995,Enhancing Diversity for Data-free Quantization,http://arxiv.org/pdf/2103.01049v3,papers/12995.pdf,0.625, 73 | 6820,TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model,http://arxiv.org/pdf/2412.06011v2,papers/6820.pdf,1.0,The library implements a par- allelized raster scan method that efficiently computes the Euclidean distance transforms on GPU hardware. 74 | 10586,Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation,http://arxiv.org/pdf/2503.04639v1,papers/10586.pdf,1.0,Our method is implemented in PyTorch [35] on an EC2 instance (with 64 GB NVIDIA T4 Tensor Core GPUs). 75 | 6909,Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks,http://arxiv.org/pdf/2503.08269v1,papers/6909.pdf,1.0,"Adam [24] is set as the optimizer, the learning rate is set to 10−4, and the training is performed on 4 NVIDIA 3090 GPUs with a batch size of 8." 76 | 2485,Gromov-Wasserstein Problem with Cyclic Symmetry,http://arxiv.org/pdf/2311.13147v1,papers/2485.pdf,0.588, 77 | 7710,Time of the Flight of the Gaussians: Fast and Accurate Dynamic Time-of-Flight Radiance Fields,http://arxiv.org/pdf/2111.14451v4,papers/7710.pdf,0.458,We optimize a single model for 200K iterations on a single NVIDIA V100 GPU (about one day). 78 | 7740,Zero-Shot Monocular Scene Flow Estimation in the Wild,http://arxiv.org/pdf/2501.10357v2,papers/7740.pdf,1.0,"All experiments are trained with 8NVIDIA A100 GPUs for 50epochs, which take 12hours." 79 | 5813,DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models,http://arxiv.org/pdf/2312.04465v3,papers/5813.pdf,0.54, 80 | 12048,3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting,http://arxiv.org/pdf/2412.12507v2,papers/12048.pdf,1.0,"For all evaluations, we use the datasets’ default resolutions and re- port frames per second (FPS) measured on a single NVIDIA RTX 6000 Ada GPU." 81 | 16830,DNF: Unconditional 4D Generation with Dictionary-based Neural Fields,http://arxiv.org/pdf/2412.05161v1,papers/16830.pdf,1.0,"The shape/motion dif- fusion model is trained on two NVIDIA RTX A6000 GPUs for one day, for 1000 epochs." 82 | 2189,3D Student Splatting and Scooping,http://arxiv.org/pdf/2503.10148v4,papers/2189.pdf,1.0,All our experiments are running with one NVIDIA RTX 4090 GPU. 83 | 5225,CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models,http://arxiv.org/pdf/2411.18613v2,papers/5225.pdf,1.0,"Our alternating sampling strategy takes about 1minute to generate all K′= 128 views for each timestamp, when executed in parallel on 16 A100 GPUs." 84 | 12055,Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models,http://arxiv.org/pdf/2501.18590v2,papers/12055.pdf,0.994,The training takes around 2 days on 32 A100 GPUs. 85 | 5395,Effective SAM Combination for Open-Vocabulary Semantic Segmentation,http://arxiv.org/pdf/2411.14723v2,papers/5395.pdf,1.0,"All experiments are conducted us- ing the PyTorch [26] framework, and training is performed on four NVIDIA RTX A6000 GPUs." 86 | 1514,FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video,http://arxiv.org/pdf/2503.04720v1,papers/1514.pdf,1.0, 87 | 754,Birth and Death of a Rose,http://arxiv.org/pdf/2412.05278v1,papers/754.pdf,1.0, 88 | 1501,Removing Reflections from RAW Photos,http://arxiv.org/pdf/2404.14414v3,papers/1501.pdf,1.0,"Upsampling Our upsampler is trained using Adam with lr= 4e-4, batch size64over32 A100 GPUs, and converges after about 40 epochs." 89 | 241,AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea,http://arxiv.org/pdf/2411.15738v3,papers/241.pdf,1.0,"Following prior works [5, 64, 82], we train our image editing model for 110,000 steps using four 48GB NVIDIA A6000 GPUs for 280 hours." 90 | 9404,Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens,http://arxiv.org/pdf/2504.14666v1,papers/9404.pdf,1.0,"Besides, each training image is center-cropped to a size of 256×256.The train- ing was conducted on 32 NVIDIA A800 GPUs and lasted for nearly one weak." 91 | 7980,Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding,http://arxiv.org/pdf/2407.20505v1,papers/7980.pdf,0.651, 92 | 8650,Towards Vision Language Models For Extra-Long Video Understanding,http://arxiv.org/pdf/2406.13809v1,papers/8650.pdf,0.494, 93 | 14121,LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models,http://arxiv.org/pdf/2503.16843v1,papers/14121.pdf,1.0,All exper- iments are conducted on 4 NVIDIA 4090 GPUs. 94 | 1726,VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection,http://arxiv.org/pdf/2411.14794v1,papers/1726.pdf,1.0,The training and evaluation process is facilitated on 8 NVIDIA-A100 GPUs. 95 | 17865,SEAL: Semantic Attention Learning for Long Video Representation,http://arxiv.org/pdf/2412.01798v3,papers/17865.pdf,1.0, 96 | 5016,Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval,http://arxiv.org/pdf/2504.02397v1,papers/5016.pdf,1.0, 97 | --------------------------------------------------------------------------------