├── .DS_Store ├── resources ├── .DS_Store ├── icon.png └── icon_old.png └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/YihongT/LLMSynthor/HEAD/.DS_Store -------------------------------------------------------------------------------- /resources/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/YihongT/LLMSynthor/HEAD/resources/.DS_Store -------------------------------------------------------------------------------- /resources/icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/YihongT/LLMSynthor/HEAD/resources/icon.png -------------------------------------------------------------------------------- /resources/icon_old.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/YihongT/LLMSynthor/HEAD/resources/icon_old.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LLMSynthor Icon LLMSynthor: LLMs for Data Synthesis 2 | 3 | 4 | This is the **official repository** for the paper _"Large Language Models for Data Synthesis."_ 5 | 6 | 📚 [[ArXiv](https://arxiv.org/pdf/2505.14752)] • 🌐 [[Project Page](https://yihongt.github.io/llmsynthor_web/)] 7 | 8 | --- 9 | 10 | ## 📝 Abstract 11 | 12 | Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces \emph{LLM Proposal Sampling} to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. 13 | We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond. 14 | 15 | 16 | --- 17 | 18 | ## 🚧 TODO 19 | 20 | - [ ] Code will be released upon acceptance, in accordance with our funding agency's code release policy. 21 | 22 | Stay tuned for updates! 23 | 24 | --- 25 | 26 | ## License 27 | 28 | This project will be released under an open-source license upon publication. 29 | 30 | --- 31 | 32 | 33 | ## Reference 34 | 35 | Please cite our paper if you use the model in your own work: 36 | ``` 37 | @article{tang2025llmsynthor, 38 | title={Large Language Models for Data Synthesis}, 39 | author={Tang, Yihong and Kong, Menglin and Sun, Lijun}, 40 | journal={arXiv preprint arXiv:2505.14752}, 41 | year={2025} 42 | } 43 | ``` --------------------------------------------------------------------------------